This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Asymptotic properties of maximum likelihood estimators for determinantal point processes

Yaozhong Hu Supported by an NSERC Discovery grant and a centennial fund from University of Alberta. Email: [email protected] Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, Canada. Haiyi Shi Corresponding author. Email: [email protected] Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC, Canada.
Abstract

We obtain the almost sure strong consistency and the Berry-Esseen type bound for the maximum likelihood estimator L~n\tilde{L}_{n} of the ensemble LL^{\star} for determinantal point processes (DPPs), strengthening and completing previous work initiated in Brunel, Moitra, Rigollet, and Urschel [BMRU17]. Numerical algorithms of estimating DPPs are developed and simulation studies are performed. Lastly, we give explicit formula and a detailed discussion for the maximum likelihood estimator for blocked determinantal matrix of two by two submatrices and compare it with the frequency method.

1 Introduction

Determinantal point processes (DPPs) arise from random matrix theory [Gin65] and are first introduced to give the probability distribution of Fermionic system in thermal equilibrium in quantum physics [Mac75]. Since then, DPPs have been found in various aspects of mathematics, including for example, loop-free Markov chains [BDF10], edges of uniformly spanning trees [BP93] and in particular, recently to image processing [LDG21].

In the seminal work [KT12], Kulesza and Taskar show that DPPs demonstrate the unique characteristics comparing to various other probabilistic models in the sense that they capture the global repulsive behavior between items, give polynomial-time algorithms for statistical inference, and have geometrical intuition. Due to these advantages DPPs have played very important roles in machine learning, especially in subset selection problems, such as documentary summarization, image search, and pose determination [KT12], and so on. These real world applications necessitate the estimation of parameters of determinantal point process models. In this context, maximum likelihood estimator L~n\tilde{L}_{n} of the true parameter LL^{\star} is a natural choice, which in general leads to a non-convex optimization problem in our situation. Along this direction, Kulesza and Taskar split DPPs model into diversity part and quality part and only learn the quality part while the first part is fixed. They conjecture that the problem of learning the likelihood of DPPs is NP-hard, which has been proven by [GJWX22] a decade later. Brunel, Moitra, Rigollet, and Urschel [BMRU17] first study the local geometry of the expected maximum likelihood estimation of DPPs, that is, the curvature of likelihood function around its maximum. Then they prove that the maximum likelihood estimator converges to true values in probability and establish the corresponding central limit theorem. Motivated by these works, the present paper contains three further contributions.

  1. (i)

    For the consistency we improve the convergence in probability of L~n\tilde{L}_{n} to LL^{\star} obtained in [BMRU17] to almost sure convergence. This almost sure convergence property is important in applications since in reality we only observe a single sample path. Theoretically, it is well-known that convergence almost surely is always a hard problem. It is also the case here. We get around this difficulty by making use of the Wald’s consistency theorem in our situation.

  2. (ii)

    The asymptotic normality of the maximum estimator is also obtained in [BMRU17]. This means that n(L~nL)\sqrt{n}(\tilde{L}_{n}-L^{\star}) is approximately a normal random variable. We study the rate of convergence associated this normality approximation. Namely, we obtain a Berry-Esseen type bound for the maximum likelihood estimator L~n\tilde{L}_{n}.

  3. (iii)

    Due to non-identifiability of DPPs observed in [Kul12], there is an involvement of D^n\hat{D}_{n} in the definition of L~n\tilde{L}_{n}. In general case this involvement of D^n\hat{D}_{n} seems unavoidable. To get rid of this D^n\hat{D}_{n} we consider special cases of blocked two by two matrix ensembles, where we obtain explicit form of the maximum likelihood estimator, and consequently, no involvement of D^n\hat{D}_{n} in these cases. This explicit form makes our computations much efficient. When we apply various numerical methods such as Newton-Raphson, Stochastic Gradient Descent method (SGD) to general maximum likelihood estimator we can hardly obtain satisfactory results.

The paper is organized as follows. In Section 2 we introduce some basic definitions and recall some properties of DPPs. We present a proof of a well-known result about the characterization of the determinantal process (Theorem 2.2 in next section), whose proof we cannot find explicitly in literature. In Section 3 we present our main results for the almost sure consistency and the Berry-Esseen type theorem for the maximum likelihood estimator. Newton-Raphson and stochastic gradient descent (SGD) algorithm are suggested to find the maximum likelihood estimator in general case in Section 4, and the results are reported although they are not satisfactory. In Section 5, we discuss the explicit maximum likelihood estimator for the two by two blocked ensembles and since the explicit form is available we get very satisfactory numerical results. Some concluding remarks are given in Section 6.

2 Preliminary

In this paper, we focus on the finite state point process. Let the ground set 𝒴=[N]:={1,2,,N}\mathcal{Y}=[N]:=\{1,2,\cdots,N\} be a set of NN points. The set of all subsets of 𝒴\mathcal{Y} is denoted by 𝕐\mathbb{Y}. Let (Ω,,)(\Omega,\mathcal{F},\mathbb{P}) be a complete probability space on which the expectation is denoted by 𝔼\mathbb{E}. A (finite) point process on a ground set 𝒴\mathcal{Y} is a random variable 𝐘:Ω𝕐\mathbf{Y}:\Omega\to\mathbb{Y}. We also call the probability measure 𝒫\mathcal{P} for 𝐘\mathbf{Y} the point process over the subsets of 𝒴\mathcal{Y}. Random subsets drawn from the point process 𝒫\mathcal{P} can be any subset between null set and full set 𝒴\mathcal{Y}.

Definition 2.1.

A 𝕐\mathbb{Y}-valued point process 𝐘\mathbf{Y} is called a determinantal point process if there is a symmetric and positive definite N×NN\times N symmetric matrix matrix KK such that for every fixed set A𝒴A\subseteq\mathcal{Y},

(A𝐘)=det(KA),\mathbb{P}(A\subseteq\mathbf{Y})=\det(K_{A})\,, (2.1)

where KAK_{A} denotes the restriction of KK to the subset A, that is, KA:=[Ki,j]i,jAK_{A}\mathrel{\mathop{:}}=[K_{i,j}]_{i,j\in A}.

To our best knowledge the following sufficient and necessary condition for a symmetric kernel KK to define a determinant point process is often mentioned without directly proving it so we provide a proof as follows.

Theorem 2.2.

The sufficient and necessary condition for a symmetric kernel KK to define a determinant point process is 0KI0\preceq K\preceq I.

Proof.

Since the marginal probability of empty set is the total probability space, (Ω)=(𝐘)=1\mathbb{P}(\Omega)=\mathbb{P}(\emptyset\subseteq\mathbf{Y})=1. We set det(K)=1\det(K_{\emptyset})=1. Because \mathbb{P} is a probability measure, all principal minors of KK, i.e. det(KA)\det(K_{A}) must be nonnegative, and thus K itself must be positive semidefinite, that is, K0K\succeq 0.

Furthermore, from (=𝐘)+(i=1N{i𝐘})=1\mathbb{P}(\emptyset=\mathbf{Y})+\mathbb{P}(\bigcup_{i=1}^{N}\{i\in\mathbf{Y}\})=1 and using inclusion–exclusion principle we get

(i=1N{i𝐘})\displaystyle\mathbb{P}(\bigcup_{i=1}^{N}\{i\in\mathbf{Y}\}) =\displaystyle= i[N](i𝐘){i,j}[N]({i,j}𝐘)+\displaystyle\sum_{i\in[N]}\mathbb{P}(i\in\mathbf{Y})-\sum_{\{i,j\}\subset[N]}\mathbb{P}(\{i,j\}\subseteq\mathbf{Y})+\cdots (2.2)
+(1)N1([N]𝐘)\displaystyle\qquad\cdots+(-1)^{N-1}\mathbb{P}([N]\subseteq\mathbf{Y})
=\displaystyle= |A|=1det(KA)|A|=2det(KA)+\displaystyle\sum_{|A|=1}\det(K_{A})-\sum_{|A|=2}\det(K_{A})+\cdots
+(1)N1det(K)\displaystyle\qquad\cdots+(-1)^{N-1}\det(K)
=\displaystyle= 1det(IK).\displaystyle 1-\det(I-K)\,.

The above last equality follows from the characteristic polynomial. Equation (2.2) also means

(=𝐘)=det(IK)0.\mathbb{P}(\emptyset=\mathbf{Y})=\det(I-K)\geq 0. (2.3)

Similarly, we are able to show that (=𝐘A)=det(IAKA)0\mathbb{P}(\emptyset=\mathbf{Y}\cap A)=\det(I_{A}-K_{A})\geq 0 for any subset A[N]A\subseteq[N] and hence KIK\preceq I. So the necessary condition for a symmetric matrix to give a determinantal process is 0KI0\preceq K\preceq I.

This condition turns out to be sufficient: any 0KI0\preceq K\preceq I defines a DPP. To prove this, it’s sufficient to show that for every A[N]A\subseteq[N], the atomic probability is well-defined, that is, 0(A=𝐘)10\leq\mathbb{P}(A=\mathbf{Y})\leq 1. The probability being less or equal to 1 holds since KIK\preceq I. For the other inequality, we assume KAK_{A} is invertible.111if KAK_{A} is not invertible, we immediately get (A=𝐘)=0\mathbb{P}(A=\mathbf{Y})=0. Then using Schur complement and characteristic polynomial, we have

(A=𝐘)\displaystyle\mathbb{P}(A=\mathbf{Y}) =\displaystyle= (A𝐘)(iA¯{A{i}𝐘})\displaystyle\mathbb{P}(A\subseteq\mathbf{Y})-\mathbb{P}(\bigcup_{i\in\bar{A}}\{A\cup\{i\}\subseteq\mathbf{Y}\}) (2.4)
=\displaystyle= det(KA)iA¯det(KA{i})+{i,j}A¯det(KA{i,j})+\displaystyle\det(K_{A})-\sum_{i\in\bar{A}}\det(K_{A\cup\{i\}})+\sum_{\{i,j\}\subseteq\bar{A}}\det(K_{A\cup\{i,j\}})+
+(1)|A¯|det(K)\displaystyle\qquad\cdots+(-1)^{|\bar{A}|}\det(K)
=\displaystyle= det(KA)iA¯det(KA)det(KiiK{i},AKA1KA,{i})\displaystyle\det(K_{A})-\sum_{i\in\bar{A}}\det(K_{A})\det(K_{ii}-K_{\{i\},A}K^{-1}_{A}K_{A,\{i\}})
+{i,j}A¯det(KA)det(K{i,j}K{i,j},AKA1KA,{i,j})+\displaystyle\qquad+\sum_{\{i,j\}\subseteq\bar{A}}\det(K_{A})\det(K_{\{i,j\}}-K_{\{i,j\},A}K^{-1}_{A}K_{A,\{i,j\}})+
+(1)|A|¯det(KA)det(KA¯KA¯,AKA1KA,A¯)\displaystyle\qquad\cdots+(-1)^{|\bar{A|}}\det(K_{A})\det(K_{\bar{A}}-K_{\bar{A},{A}}K^{-1}_{A}K_{A,\bar{A}})
=\displaystyle= (1)|A¯|det(KA)det((KA¯KA¯,AKA1KA,A¯)IA¯)\displaystyle(-1)^{|\bar{A}|}\det(K_{A})\det((K_{\bar{A}}-K_{\bar{A},{A}}K^{-1}_{A}K_{A,\bar{A}})-I_{\bar{A}})
=\displaystyle= (1)|A¯|det(KIA¯),\displaystyle(-1)^{|\bar{A}|}\det(K-I_{\bar{A}}),

where KA,BK_{A,B} denotes the matrix obtained from KK by keeping only those entries whose rows belong to AA and whose columns belong to BB (if A=BA=B we simply have KAK_{A}.), |A||A| denotes the cardinality of subset AA, and A¯\bar{A} the complement of set AA. Here we use a slight abuse of notation of IA¯I_{\bar{A}}. We refer it to an N × N matrix whose restriction to A¯\bar{A} is IA¯I_{\bar{A}} and has zeros everywhere else. Since 0KI0\preceq K\preceq I, (A=𝐘)=|det(KIA¯)|0\mathbb{P}(A=\mathbf{Y})=|\det(K-I_{\bar{A}})|\geq 0. This completes the proof of the theorem. ∎

It is quite inconvenient to work with marginal kernels since their eigenvalues should be bounded by 0 and 1, and the marginal probability is not very appropriate to describe real world data. Here we introduce a slightly smaller class of DPPs called L-ensembles, where L only needs to be positive semi-definite.

Definition 2.3.

A point process is called an L-ensemble if it is defined through a real, symmetric, positive semi-definite N×NN\times N matrix LL:

L(A=𝐘)det(LA),\mathbb{P}_{L}(A=\mathbf{Y})\propto\det(L_{A}), (2.5)

where A𝒴A\subseteq\mathcal{Y} is a fixed subset.

The proportional coefficient is given by [KT12]:

1A𝒴det(LA)=1det(L+I).\frac{1}{\sum_{A\subseteq\mathcal{Y}}\det(L_{A})}=\frac{1}{\det{(L+I)}}.

Moreover, [Mac75] shows that L-ensembles are indeed DPPs, where marginal kernels K is given by L(L+I)1=I(L+I)1.L(L+I)^{-1}=I-(L+I)^{-1}.

3 Maximum likelihood Estimator of DPPs

Let 𝒮[N]++\mathcal{S}_{[N]}^{++} denote the set of all NN by NN positive definite matrices. Suppose that Z1,,ZnZ_{1},...,Z_{n} is a sample of size nn from a DPP(L){\text{DPP}(L^{\star})} for some given but unknown L𝒮[N]++L^{\star}\in\mathcal{S}_{[N]}^{++}. This means that Z1,,ZnZ_{1},\cdots,Z_{n} are independent 𝕐\mathbb{Y} valued random subsets whose probability law is given by the ensemble LL^{\star}. The goal of this paper is to estimate this LL^{\star} from this given sample Z1,,ZnZ_{1},\cdots,Z_{n}. The most popular one is the maximum likelihood estimator N^n\hat{N}_{n}. We shall recall its construction this our specific setting and study its strong consistency, asymptotic normality, and associated Berry-Esseen theorem.

Let PL(Zi)P_{L}(Z_{i}) denote the probability of the L-ensemble being ZiZ_{i}. The scaled log-likelihood function associated with the sample Z1,,ZnZ_{1},\cdots,Z_{n} is

Φ^n(L)=1ni=1nlogPL(Zi)=J[N]p^(J)logdet(LJ)logdet(L+I),\hat{\Phi}_{n}(L)=\frac{1}{n}\sum_{i=1}^{n}\log P_{L}(Z_{i})=\sum_{J\subseteq[N]}\hat{p}{(J)}\log\det(L_{J})-\log\det(L+I), (3.1)

where L𝒮[N]++L\in\mathcal{S}_{[N]}^{++} and

p^(J)=1ni=1n𝕀(Zi=J),\hat{p}{(J)}=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}(Z_{i}=J),

with 𝕀()\mathbb{I}(\cdot) standing for the indicator function. It is also useful to define the expected log-maximum likelihood function given the real kernel LL^{\star}

ΦL(L)=J[N]pL(J)logdet(LJ)logdet(L+I),\Phi_{L^{\star}}(L)=\sum_{J\subseteq[N]}{p_{L^{\star}}(J)}\log\det(L_{J})-\log\det(L+I)\,, (3.2)

where

pL(J)=𝔼(p^(J))=det(LJ)det(L+I).p_{L^{\star}}(J)=\mathbb{E}{(\hat{p}{(J)})}=\frac{\det{(L^{\star}_{J})}}{\det{(L^{\star}+I)}}.

In the sequel LL^{\star} is always fixed. To simplify writing we let p^J\hat{p}_{J} denote p^(J)\hat{p}(J), pJp^{*}_{J} denote pL(J)p_{L^{*}}(J) and Φ\Phi denote ΦL\Phi_{L^{\star}}.

Definition 3.1.

A maximum likelihood estimator L^N\hat{L}_{N} is a global maximizer of (3.1):

L^n=argmaxL𝒮[N]++Φ^n(L).\hat{L}_{n}=\arg\!\!\max_{L\in\mathcal{S}_{[N]}^{++}}\hat{\Phi}_{n}(L)\,. (3.3)

We notice that Φ(L)Φ(L)\Phi{(L^{\star})}-\Phi{(L)} is the Kullback-Leibler divergence KL(DPP(L),DPP(L))\mathrm{KL}\big{(}\mathrm{DPP}(L^{\star}),\mathrm{DPP}(L)\big{)}, which measures the difference between distributions of DPP(L)\mathrm{DPP}(L^{\star}) and of DPP(L)\mathrm{DPP}(L). KL divergence is always non-negative:

KL(DPP(L),DPP(L))=Φ(L)Φ(L)0,L𝒮[N]++.\mathrm{KL}\big{(}\mathrm{DPP}(L^{\star}),\mathrm{DPP}(L)\big{)}=\Phi{(L^{\star})}-\Phi(L)\geq 0,\,\forall L\in\mathcal{S}_{[N]}^{++}.

As a consequence LL^{\star} is the global maxima of the expected maximum function Φ(L)\Phi(L).

The next Lemma gives the differential of the likelihood function Φ^n(L)\hat{\Phi}_{n}(L).

Lemma 3.2.

The gradient of log-likelihood function Φ^n(L)\hat{\Phi}_{n}(L) defined in (3.1) exists and is given by

dΦ^n(L)=J[N]p^JLJ1(L+I)1.\mathop{}\!\mathrm{d}\hat{\Phi}_{n}(L)=\sum_{J\subseteq[N]}\hat{p}_{J}L_{J}^{-1}-(L+I)^{-1}. (3.4)
Proof.

We regard determinant as a multivariate function of N×NN\times N variables and then the directional derivative of det(L+I)\det(L+I) along direction HH is given by

ddet(L+I)(H)=\displaystyle\mathop{}\!\mathrm{d}\det(L+I)(H)= limt0det(L+I+tH)det(L+I)t\displaystyle\lim_{t\to 0}\frac{\det(L+I+tH)-\det(L+I)}{t}
=\displaystyle= limt0det(L+I)[det(I+t(L+I)1H)1t]\displaystyle\lim_{t\to 0}\det(L+I)\Big{[}\frac{\det(I+t(L+I)^{-1}H)-1}{t}\Big{]}
=\displaystyle= limt0det(L+I)[1+tTr((L+I)1H)+o(t2)1t]\displaystyle\lim_{t\to 0}\det(L+I)\Big{[}\frac{1+t\operatorname{Tr}((L+I)^{-1}H)+o(t^{2})-1}{t}\Big{]}
=\displaystyle= det(L+I)Tr((L+I)1H),\displaystyle\det(L+I)\operatorname{Tr}((L+I)^{-1}H), (3.5)

where the third equality follows from the power series representation of det(I+A)\det(I+A). Then the directional derivative of Φ^n(L)\hat{\Phi}_{n}(L) along direction HH is

dΦ^n(L)(H)=J[N]p^JTr(LJ1HJ)Tr((L+I)1H).\mathop{}\!\mathrm{d}\hat{\Phi}_{n}(L)(H)=\sum_{J\subseteq[N]}\hat{p}_{J}\operatorname{Tr}(L_{J}^{-1}H_{J})-\operatorname{Tr}((L+I)^{-1}H). (3.6)

In matrix form, the above equation becomes

dΦ^n(L)=J[N]p^JLJ1(L+I)1.\mathop{}\!\mathrm{d}\hat{\Phi}_{n}(L)=\sum_{J\subseteq[N]}\hat{p}_{J}L_{J}^{-1}-(L+I)^{-1}. (3.7)

This proves the lemma ∎

3.1 Strong consistency

One critical issue for the maximum likelihood estimation is its consistency. Due to non-identifiability of DPPs illustrated in Theorem 4.1 in [Kul12] , any element in {DLD:D𝒟}\{DL^{\star}D:D\in\mathcal{D}\} defines the same DPP, where 𝒟\mathcal{D} is the collection of all diagonal matrices whose entry is either 1 or -1. Moreover, if L1,L2𝒮[N]++L_{1},L_{2}\in\mathcal{S}_{[N]}^{++} defines the same DPP, then there is a D𝒟D\in\mathcal{D} such that L2=DL1DL_{2}=DL_{1}D. Thus, we cannot expect that the maximum likelihood estimator L^n\hat{L}_{n} will converge to LL^{\star}. Instead, it should converge to DLDDL^{\star}D for some D𝒟D\in\mathcal{D}. Hence, for the consistency problem we study the limit

(L^n,L):=minD𝒟L^nDLDF.\ell(\hat{L}_{n},L^{\star}):=\min_{D\in\mathcal{D}}\|\hat{L}_{n}-DL^{\star}D\|_{F}. (3.8)

[BMRU17] proves that this distance converges to zero in probability. We shall prove a stronger version: the convergence also holds almost surely.

Theorem 3.3.

Let Z1,,ZnZ_{1},...,Z_{n} be a sample of size nn from ZDPP(L)Z\sim{\text{DPP}(L^{\star})}. Let L^n\hat{L}_{n} be the maximum likelihood estimator of LL^{\star} defined by (3.3). Then (L^n,L)\ell(\hat{L}_{n},L^{\star}) converges to zero almost surely.

Proof.

The proof is to combine the convergence in probability result of [BMRU17, Theorem 14] with Wald’s consistency theorem [Wal49] and is divided into the following five steps.

Step 1: For 0<α<β<10<\alpha<\beta<1, define a set Eα,βE_{\alpha,\beta}

Eα,β={L𝒮[N]++:K=L(I+L)1𝒮[N][α,β]},E_{\alpha,\beta}=\big{\{}L\in\mathcal{S}^{++}_{[N]}:K=L(I+L)^{-1}\in\mathcal{S}^{[\alpha,\beta]}_{[N]}\big{\}},

where 𝒮[N][α,β]\mathcal{S}^{[\alpha,\beta]}_{[N]} is the set of all N by N symmetric matrices whose eigenvalues are between α\alpha and β\beta. Observe that Eα,βE_{\alpha,\beta} is compact since it’s bounded and closed in N×N\mathbb{R}^{N\times N}. We first show that (L^n,L)\ell(\hat{L}_{n},L^{\star}) converges to zero almost surely when parameters of matrices are restricted to LEα,βL^{\star}\in E_{\alpha,\beta} for appropriate 0<α<β<10<\alpha<\beta<1. In order to complete this proof, we let

ΔΦ^(L)=Φ^(L)Φ^(L)=1ni=1nlogPL(Zi)PL(Zi)\Delta\hat{\Phi}(L)=\hat{\Phi}(L)-\hat{\Phi}(L^{\star})=\frac{1}{n}\sum_{i=1}^{n}\log\frac{P_{L}(Z_{i})}{P_{L^{\star}}(Z_{i})}

and

ΔΦ(L)=Φ(L)Φ(L)=𝔼L(logPL(Z)PL(Z)).\Delta{\Phi}(L)={\Phi}(L)-{\Phi}(L^{\star})=\mathbb{E}_{L^{\star}}\big{(}\log\frac{P_{L}(Z)}{P_{L^{\star}}(Z)}\big{)}.

ΔΦ(L)\Delta{\Phi}(L) is the Kullback-Leibler divergence between DPP(LL^{\star}) and DPP(LL). By Jensen’s inequality, ΔΦ(L)0\Delta{\Phi}(L)\leq 0 for all LL and Φ(L)=Φ(L){\Phi}(L)={\Phi}(L^{\star}) if and only if PL(Z)=PL(Z)P_{L}(Z)=P_{L^{\star}}(Z) for all Z[N]Z\in[N], which means L=DLDL=DL^{\star}D for some D𝒟D\in\mathcal{D}. In the sequel let EE denote ELE_{L^{*}}

For each LEα,βL\in E_{\alpha,\beta}, the strong law of large numbers implies

ΔΦ^(L)a.s.ΔΦ(L).\Delta{\hat{\Phi}}(L)\xrightarrow[]{a.s.}\Delta{\Phi}(L). (3.9)

Step 2: However, the above convergence (3.9) doesn’t imply the convergence of maximum likelihood estimator to the true values. Thus the Wald’s integrability condition is needed: for every LEα,βL\in E_{\alpha,\beta}, there exists ϵ>0\epsilon>0 such that,

𝔼supNEα,β(L,N)<ϵlogPN(Z)PL(Z)<.\mathbb{E}\sup_{\begin{subarray}{c}N\in E_{\alpha,\beta}\\ \ell(L,N)<\epsilon\end{subarray}}\log\frac{P_{N}(Z)}{P_{L^{\star}}(Z)}<\infty. (3.10)

Since LlogPL(Z)PL(Z)L\mapsto\log\frac{P_{L}(Z)}{P_{L^{\star}}(Z)} is continuous (the determinant function is continuous), for any δ>0\delta>0 there exists ϵ>0\epsilon>0, when (L,N)<ϵ\ell(L,N)<\epsilon

(1δ)PL(Z)PL(Z)<PN(Z)PL(Z)<(1+δ)PL(Z)PL(Z).(1-\delta)\frac{P_{L}(Z)}{P_{L^{\star}}(Z)}<\frac{P_{N}(Z)}{P_{L^{\star}}(Z)}<(1+\delta)\frac{P_{L}(Z)}{P_{L^{\star}}(Z)}.

Then the Wald’s integrability condition is satisfied. Now for every sequence {Ln}\{L_{n}\} converging to L, we show that ΔΦ(Ln)\Delta\Phi(L_{n}) is upper semicontinuous:

lim supnΔΦ(Ln)\displaystyle\limsup_{n\to\infty}\Delta\Phi(L_{n}) =\displaystyle= lim supn𝔼logPLn(Z)PL(Z)\displaystyle\limsup_{n\to\infty}\mathbb{E}\log\frac{P_{L_{n}}(Z)}{P_{L^{\star}}(Z)}
\displaystyle\leq 𝔼lim supnlogPLn(Z)PL(Z)\displaystyle\mathbb{E}\limsup_{n\to\infty}\log\frac{P_{L_{n}}(Z)}{P_{L^{\star}}(Z)}
=\displaystyle= 𝔼logPL(Z)PL(Z)\displaystyle\mathbb{E}\log\frac{P_{L}(Z)}{P_{L^{\star}}(Z)}
=\displaystyle= ΔΦ(L).\displaystyle\Delta\Phi(L).

The second inequality follows from the Fatou’s lemma and the third identity is the consequence of continuity of the function logPLn(Z)PL(Z)\log\frac{P_{L_{n}}(Z)}{P_{L^{\star}}(Z)}. For every η>0\eta>0 we define the set KηK_{\eta}

Kη\displaystyle K_{\eta} ={LEα,β:(L,L)η}\displaystyle=\big{\{}L\in E_{\alpha,\beta}:\ell(L,L^{\star})\geq\eta\big{\}}
=D𝒟{LEα,β:LDLDFη},\displaystyle=\bigcap_{D\in\mathcal{D}}\big{\{}L\in E_{\alpha,\beta}:\|L-DL^{\star}D\|_{F}\geq\eta\big{\}}, (3.11)

which is closed and hence compact.

Since ΔΦ(L)\Delta\Phi(L) is an upper semicontinuous function, it achieves maximum over the compact set KηK_{\eta} (since Eα,βE_{\alpha,\beta} is compact). We denote the maximum by m(η)m(\eta). And we cannot have m(η)=0m(\eta)=0 because that would imply there is an LKηL\in K_{\eta} such that L=DLDL=DL^{\star}D for some D𝒟D\in\mathcal{D}. The strong law of large numbers implies

supNEα,β(L,N)<ϵΔΦ^(N)\displaystyle\sup_{\begin{subarray}{c}N\in E_{\alpha,\beta}\\ \ell(L,N)<\epsilon\end{subarray}}\Delta\hat{\Phi}(N) \displaystyle\leq 1ni=1nsupNEα,β(L,N)<ϵlogPN(Zi)PL(Zi)\displaystyle\frac{1}{n}\sum_{i=1}^{n}\sup_{\begin{subarray}{c}N\in E_{\alpha,\beta}\\ \ell(L,N)<\epsilon\end{subarray}}\log\frac{P_{N}(Z_{i})}{P_{L^{\star}}(Z_{i})} (3.12)
a.s.\displaystyle\xrightarrow[]{a.s.} 𝔼supNEα,β(L,N)<ϵlogPN(Z)PL(Z).\displaystyle\mathbb{E}\sup_{\begin{subarray}{c}N\in E_{\alpha,\beta}\\ \ell(L,N)<\epsilon\end{subarray}}\log\frac{P_{N}(Z)}{P_{L^{\star}}(Z)}.

By continuity,

limϵ0supNEα,β(L,N)<ϵlogPN(Z))PL(Z)=logPL(Z)PL(Z)\lim_{\epsilon\to 0}\sup_{\begin{subarray}{c}N\in E_{\alpha,\beta}\\ \ell(L,N)<\epsilon\end{subarray}}\log\frac{P_{N}(Z))}{P_{L^{\star}}(Z)}=\log\frac{P_{L}(Z)}{P_{L^{\star}}(Z)}

and supϵlogPNPL\sup_{\epsilon}\log\frac{P_{N}}{P_{L^{\star}}} is a decreasing function with respect to ϵ\epsilon because supremum over a smaller subset is smaller than that over a bigger subset. And by (3.10) it is integrable for all small enough ϵ\epsilon. Hence by the dominated convergence theorem,

limϵ0𝔼supNEα,β(L,N)<ϵlogPN(Z))PL(Z)=𝔼logPL(Z)PL(Z)=ΔΦ(L).\lim_{\epsilon\to 0}\mathbb{E}\sup_{\begin{subarray}{c}N\in E_{\alpha,\beta}\\ \ell(L,N)<\epsilon\end{subarray}}\log\frac{P_{N}(Z))}{P_{L^{\star}}(Z)}=\mathbb{E}\log\frac{P_{L}(Z)}{P_{L^{\star}}(Z)}=\Delta\Phi(L).

Thus for any LKηL\in K_{\eta} and any γ>0\gamma>0 there exists an  ϵL\epsilon_{L} such that

𝔼supNEα,β(L,N)<ϵLlogPN(Z)PL(Z)<m(η)+γ.\mathbb{E}\sup_{\begin{subarray}{c}N\in E_{\alpha,\beta}\\ \ell(L,N)<\epsilon_{L}\end{subarray}}\log\frac{P_{N}(Z)}{P_{L^{\star}}(Z)}<m(\eta)+\gamma. (3.13)

Step 3. For each LKηL\in K_{\eta}, we define the open set:

VL={NEα,β:(N,L)<ϵL}V_{L}=\{N\in E_{\alpha,\beta}:\ell(N,L)<\epsilon_{L}\}

and then the family {VL:LKη}\{V_{L}:L\in K_{\eta}\} is an open cover of KηK_{\eta} and hence has a finite subcover: VL1,VL2,.,VLdV_{L_{1}},V_{L_{2}},....,V_{L_{d}}. On every VLiV_{L_{i}} we use strong law of large numbers again to obtain

lim supnsupNVLiΔΦ^(N)\displaystyle\limsup_{n\to\infty}\sup_{N\in V_{L_{i}}}\Delta\hat{\Phi}(N) \displaystyle\leq lim supn1ni=1nsupNVLilogPN(Zi)PL(Zi)\displaystyle\limsup_{n\to\infty}\frac{1}{n}\sum_{i=1}^{n}\sup_{N\in V_{L_{i}}}\log\frac{P_{N}(Z_{i})}{P_{L^{\star}}(Z_{i})} (3.14)
=\displaystyle= 𝔼supNVLilogPN(Z)PL(Z).\displaystyle\mathbb{E}\sup_{N\in V_{L_{i}}}\log\frac{P_{N}(Z)}{P_{L^{\star}}(Z)}.

From (3.13) we get

lim supnsupNVLiΔΦ^(N)<m(η)+γi=1,2,,d.\limsup_{n\to\infty}\sup_{N\in V_{L_{i}}}\Delta\hat{\Phi}(N)<m(\eta)+\gamma\qquad i=1,2,...,d.

Since {VLi:i=1,2,d}\{V_{L_{i}}:i=1,2...,d\} cover KηK_{\eta} we have

lim supnsupNKηΔΦ^(N)<m(η)+γ\limsup_{n\to\infty}\sup_{N\in K_{\eta}}\Delta\hat{\Phi}(N)<m(\eta)+\gamma

which, since γ\gamma is arbitrary, implies

lim supnsupLKηΔΦ^(L)<supLKηΔΦ(L)=m(η).\limsup_{n\to\infty}\sup_{L\in K_{\eta}}\Delta\hat{\Phi}(L)<\sup_{L\in K_{\eta}}\Delta{\Phi}(L)=m(\eta). (3.15)

Notice that m(η)<0m(\eta)<0. From (3.15) there exists a constant N1N_{1} such that

supLKηΔΦ^(L)<m(η)2,n>N1.\sup_{L\in K_{\eta}}\Delta\hat{\Phi}(L)<\frac{m(\eta)}{2},\qquad n>N_{1}.

But

ΔΦ^(L^n)=supLEα,βΔΦ^(L)ΔΦ^(L)a.s.ΔΦ(L)=0,\Delta\hat{\Phi}(\hat{L}_{n})=\sup_{L\in E_{\alpha,\beta}}\Delta\hat{\Phi}({L})\geq\Delta\hat{\Phi}(L^{\star})\xrightarrow[]{a.s.}\Delta{\Phi}(L^{\star})=0,

so there exists a constant N2N_{2} such that

ΔΦ^(L^n)m(η)2,n>N2\Delta\hat{\Phi}(\hat{L}_{n})\geq\frac{m(\eta)}{2},\qquad n>N_{2}

which implies that L^nKη\hat{L}_{n}\notin K_{\eta}, that is, (L^n,L)<ϵ\ell(\hat{L}_{n},L)<\epsilon.

Step 4. Now we can remove the compactness condition. First we show that the event {L^nEα,β}\{\hat{L}_{n}\in E_{\alpha,\beta}\} holds almost surely. We adopt the proof from [BMRU17]. Let δ=minJ[N]PL(J)\delta=\min_{J\subset[N]}P_{L^{\star}}{(J)}. For simplicity, we denote PL(J)P_{L^{\star}}{(J)} by pJp^{\star}_{J}. Since LL^{\star} is positive definite, δ>0\delta>0. Define the event 𝒜n\mathcal{A}_{n} by

𝒜n=J[N]{pJ2p^J3pJ}.\mathcal{A}_{n}=\bigcap_{J\subset[N]}\big{\{}p^{\star}_{J}\leq 2\hat{p}_{J}\leq 3p^{\star}_{J}\big{\}}.

Observe that Φ(L)<0\Phi(L^{\star})<0 and we can find α<exp(3Φ(L)/δ)\alpha<\exp(3\Phi(L^{\star})/\delta) and β>1exp(3Φ(L)/δ)\beta>1-\exp(3\Phi(L^{\star})/\delta) such that 0<α<β<10<\alpha<\beta<1. Then using [BMRU17, Theorem 14] we know that on the event 𝒜n\mathcal{A}_{n}, L^nEα,β\hat{L}_{n}\in E_{\alpha,\beta}, that is,

P(L^nEα,β)P(𝒜n).P(\hat{L}_{n}\in E_{\alpha,\beta})\geq P(\mathcal{A}_{n}).

Because

p^J=1ni=1n𝕀(Zi=J)a.s.PL(Z=J)=pJ,\hat{p}_{J}=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}(Z_{i}=J)\xrightarrow[]{a.s.}P_{L^{\star}}(Z=J)=p^{\star}_{J}\,,

the event 𝒜n\mathcal{A}_{n} holds almost surely when nn goes to infinity and hence {L^nEα,β}\{\hat{L}_{n}\in E_{\alpha,\beta}\} holds almost surely.

Step 5. Let 𝕀En\mathbb{I}_{E_{n}} denote the indicator function of the event {L^nEα,β}\{\hat{L}_{n}\in E_{\alpha,\beta}\}. Then

(limn(L^n,L)=0)\displaystyle\mathbb{P}\big{(}\lim_{n\to\infty}\ell(\hat{L}_{n},L^{\star})=0\big{)} =\displaystyle= (limn(L^n,L)=0,limn𝕀En=1)\displaystyle\mathbb{P}\big{(}\lim_{n\to\infty}\ell(\hat{L}_{n},L^{\star})=0,\lim_{n\to\infty}\mathbb{I}_{E_{n}}=1\big{)}
+(limn(L^n,L)=0,limn𝕀En1)\displaystyle+\mathbb{P}\big{(}\lim_{n\to\infty}\ell(\hat{L}_{n},L^{\star})=0,\lim_{n\to\infty}\mathbb{I}_{E_{n}}\neq 1\big{)}
=\displaystyle= (limn(L^n,L)=0,limn𝕀En=1)\displaystyle\mathbb{P}\big{(}\lim_{n\to\infty}\ell(\hat{L}_{n},L^{\star})=0,\lim_{n\to\infty}\mathbb{I}_{E_{n}}=1\big{)}
=\displaystyle= (limn(L^n,L)=0|limn𝕀En=1)(limn𝕀En=1)\displaystyle\mathbb{P}\big{(}\lim_{n\to\infty}\ell(\hat{L}_{n},L^{\star})=0\big{|}\lim_{n\to\infty}\mathbb{I}_{E_{n}}=1\big{)}\mathbb{P}\big{(}\lim_{n\to\infty}\mathbb{I}_{E_{n}}=1\big{)}
=\displaystyle= (limn(L^n,L)=0|limn𝕀En=1)\displaystyle\mathbb{P}\big{(}\lim_{n\to\infty}\ell(\hat{L}_{n},L^{\star})=0\big{|}\lim_{n\to\infty}\mathbb{I}_{E_{n}}=1\big{)}
=\displaystyle= 1.\displaystyle 1.

The last equality follows from the fact that L^nEα,β\hat{L}_{n}\in E_{\alpha,\beta} almost surely proved in Step 4. ∎

3.2 Berry-Esseen theorem

An NN by NN matrix [Aij]N×N[A_{ij}]_{N\times N} can also be viewed as an N×NN\times N dimensional column vector: (A11,A12,,A1N,A21,,AN1,ANN)T(A_{11},A_{12},...,A_{1N},A_{21},...,A_{N1},...A_{NN})^{T}. With this identification the Frobenius norm of the matrix ||||F||\cdot||_{F} is just the 2\mathcal{L}^{2} norm for its corresponding column vector. In the following we shall regard the matrix as the corresponding column vector.

Because of non-identifiability of DPPs, the maximum likelihood estimators are not unique. We choose the estimator L~n\tilde{L}_{n} which is closest to the given true value LL^{*}. In fact, let L^n{\hat{L}_{n}} be one maximal likelihood estimator. Let D^n𝒟\hat{D}_{n}\in\mathcal{D} be such that

D^nL^nD^nLF=minD𝒟DL^nDLF\lVert\hat{D}_{n}\hat{L}_{n}\hat{D}_{n}-L^{\star}\rVert_{F}=\min_{{{D}\in\mathcal{D}}}\lVert{D}\hat{L}_{n}{D}-L^{\star}\rVert_{F} (3.16)

and set

L~n=D^nL^nD^n.\tilde{L}_{n}=\hat{D}_{n}\hat{L}_{n}\hat{D}_{n}\,. (3.17)

Then the strong consistency of L~n\tilde{L}_{n} immediately follows from Theorem 3.3.

We impose irreducibility assumption introduced in [BMRU17] on LL^{\star}, of which the assumption is equivalent to LL^{\star} being a one block matrix, that is, it cannot be partitioned into two block matrices. And then according to [BMRU17, Theorem 8], d2Φ(L)\mathrm{d}^{2}{\Phi}(L^{\star}) is negative definite and hence invertible. Let V(L)V(L^{\star}) denote its inverse:

V(L):=[d2Φ(L)]1.V(L^{\star}):=\left[\mathrm{d}^{2}{\Phi}(L^{\star})\right]^{-1}\,. (3.18)

Here if we vectorize L{L} then d2Φ(L)\mathrm{d}^{2}{\Phi}(L^{\star}) is an (N×N)×(N×N)(N\times N)\times(N\times N) Hessian matrix. By [VdV00, Theorem 5.41],

n(L~nL)\displaystyle\sqrt{n}(\tilde{L}_{n}-L^{\star}) =\displaystyle= (𝔼(d2logPL(Z)))11ni=1nd(logPL(Zi))+oP(1)\displaystyle-({\mathbb{E}{(\mathrm{d}^{2}\log{P_{L^{\star}}(Z)})}})^{-1}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\mathrm{d}(\log P_{L^{\star}}(Z_{i}))+o_{P}(1)
=\displaystyle= V(L)1ni=1n((LZi)1(I+L)1)+oP(1).\displaystyle-V(L^{\star})\frac{1}{\sqrt{n}}\sum_{i=1}^{n}((L^{\star}_{Z_{i}})^{-1}-(I+L^{\star})^{-1})+o_{P}(1). (3.19)

In particular, [VdV00, Theorem 5.41] states that the sequence n(L~nL)\sqrt{n}(\tilde{L}_{n}-L^{\star}) is asymptotically normal with mean 𝟎\bm{0} and covariance matrix V(L)-V(L^{\star}). Hence we get the following theorem for the asymptotic normality of the maximum estimator L~n\tilde{L}_{n}.

Theorem 3.4.

Let LL^{\star} be irreducible. Then, L~n\tilde{L}_{n} is asymptotically normal:

n(L~nL)d𝒩(𝟎,V(L)), as n\sqrt{n}(\tilde{L}_{n}-L^{\star})\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathcal{N}(\bm{0},-V(L^{\star})),\quad\hbox{ as \ $n\longrightarrow\infty$} (3.20)

where V(L)V(L^{\star}) is given by (3.18) and the above convergence d\stackrel{{\scriptstyle d}}{{\rightarrow}} holds in distribution.

Having established the asymptotic normality of the maximum estimator L~n\tilde{L}_{n}, we want to take one step further. Namely, we want to find an upper error bound on the rate of convergence of the distribution of (V(L))12n(L~nL)(-V(L^{\star}))^{-\frac{1}{2}}\sqrt{n}(\tilde{L}_{n}-L^{\star}) to standard multidimensional normal distribution Z𝒩(𝟎,I)Z\sim\mathcal{N}(\bm{0},{I}). In other words, we are going to obtain a Berry-Esseen type theorem. We argue that when L~nEα,β\tilde{L}_{n}\in E_{\alpha,\beta}, the bound of the maximal error is of order n14n^{-\frac{1}{4}}. This restriction is not a big deal. Indeed, since α\alpha and β\beta can be chosen arbitrarily close to 0 and 11 respectively so that Eα,βE_{\alpha,\beta} converges to 𝒮[N]++\mathcal{S}_{[N]}^{++}. What’s more, since from Theorem 3.3, L^Eα,β\hat{L}\in E_{\alpha,\beta} almost surely, D^L^D^=L~nEα,β\hat{D}\hat{L}\hat{D}=\tilde{L}_{n}\in E_{\alpha,\beta} almost surely.

Theorem 3.5.

Let the maximum likelihood estimator L~n\tilde{L}_{n} be defined by (3.17) and also belong to Eα,βE_{\alpha,\beta} for some 0<α<β<10<\alpha<\beta<1 and let Z be an N×NN\times N standard Gaussian matrix. Then for every xN×Nx\in\mathbb{R}^{N\times N},

|((V(L))12n(L~nL)<x)(Z<x)|C1n4,|\mathbb{P}((-V(L^{\star}))^{-\frac{1}{2}}\sqrt{n}(\tilde{L}_{n}-L^{\star})<x)-\mathbb{P}(Z<x)|\leq C\frac{1}{\sqrt[4]{n}},

where C is a sufficiently large constant, which is irrelevant to xx, subject to α,β\alpha,\beta and proportional to N2N^{2}. Here for two NN by NN matrices AA and BB, A<BA<B means the components of AA is less than the corresponding components of BB.

Proof.

We divide the proof into four steps.

Step 1.   According to (3.19), (V(L))12n(L~nL)(-V(L^{\star}))^{-\frac{1}{2}}\sqrt{n}(\tilde{L}_{n}-L^{\star}) can be decomposed into a sum

Xn=i=1nξi:=(V(L))121ni=1n((LZi)1(I+L)1)X_{n}=\sum_{i=1}^{n}\xi_{i}:=(-V(L^{\star}))^{\frac{1}{2}}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}((L^{\star}_{Z_{i}})^{-1}-(I+L^{\star})^{-1}) (3.21)

and a term ρn=(V(L))12oP(1)\rho_{n}=(-V(L^{\star}))^{-\frac{1}{2}}o_{P}(1) whose Frobenius norm converges to zero in probability.

|(Xn+ρn<x)(Z<x)|\displaystyle|\mathbb{P}(X_{n}+\rho_{n}<{x})-\mathbb{P}(Z<{x})|
=|(Xn+ρn<x,ρnFkn)+(Xn+ρn<x,ρnF<kn)(Z<x)|\displaystyle=|\mathbb{P}(X_{n}+\rho_{n}<{x},\|\rho_{n}\|_{F}\geq k_{n})+\mathbb{P}(X_{n}+\rho_{n}<{x},\|\rho_{n}\|_{F}<k_{n})-\mathbb{P}(Z<{x})|
(ρnFkn)+|(Xn+ρn<x,ρnF<kn)(Z<x)|\displaystyle\leq\mathbb{P}(\|\rho_{n}\|_{F}\geq k_{n})+|\mathbb{P}(X_{n}+\rho_{n}<{x},\|\rho_{n}\|_{F}<k_{n})-\mathbb{P}(Z<{x})|
(ρnFkn)\displaystyle\leq\mathbb{P}(\|\rho_{n}\|_{F}\geq k_{n})
+|(Xn+kn𝟙<x,ρnF<kn)(Z<x)|\displaystyle\qquad+|\mathbb{P}(X_{n}+{k_{n}}{\mathbbm{1}}<x,\|\rho_{n}\|_{F}<k_{n})-\mathbb{P}(Z<{x})|
+|(Xnkn𝟙<x,ρnF<kn)(Z<x)|\displaystyle\qquad+|\mathbb{P}(X_{n}-{k_{n}}{\mathbbm{1}}<x,\|\rho_{n}\|_{F}<k_{n})-\mathbb{P}(Z<{x})|
=:I1+I2+I3,\displaystyle=:I_{1}+I_{2}+I_{3}\,, (3.22)

where {kn}\{k_{n}\} is an arbitrary sequence of positive real number and 𝟙\mathbbm{1} is the N×NN\times N matrix whose entries are all 1.

Step 2. The Estimation of (I1I_{1}). We claim (ρnkn)C4n4,\mathbb{P}(\|\rho_{n}\|\geq k_{n})\leq\frac{C_{4}}{\sqrt[4]{n}}, where kn=n14k_{n}=n^{-\frac{1}{4}} and C4C_{4} is a constant.

In fact, from the proof of [VdV00, Theorem 5.41], ρn\rho_{n} has the following expression

ρn=n(V(L))12(\displaystyle\rho_{n}=\sqrt{n}(-V(L^{\star}))^{\frac{1}{2}}\bigg{(} 𝐝2Φ^n(L)𝔼(𝐝2Φ^n(L))\displaystyle\operatorname{\mathbf{d}}^{2}\hat{\Phi}_{n}(L^{\star})-\mathbb{E}(\operatorname{\mathbf{d}}^{2}\hat{\Phi}_{n}(L^{\star}))
+12(L~nL)T𝐝3Φ^n(Ln))(L~nL),\displaystyle+\frac{1}{2}(\tilde{L}_{n}-L^{\star})^{T}\operatorname{\mathbf{d}}^{3}\hat{\Phi}_{n}(L_{n})\bigg{)}(\tilde{L}_{n}-L^{\star}), (3.23)

where LnL_{n} is a point on the line segment between L~n\tilde{L}_{n} and LL^{\star}. To simplify notation, let θ\theta denote

(𝐝2Φ^n(L)𝔼(𝐝2Φ^n(L))+12(L~nL)T𝐝3Φ^n(Ln))(L~nL).\bigg{(}\operatorname{\mathbf{d}}^{2}\hat{\Phi}_{n}(L^{\star})-\mathbb{E}(\operatorname{\mathbf{d}}^{2}\hat{\Phi}_{n}(L^{\star}))+\frac{1}{2}(\tilde{L}_{n}-L^{\star})^{T}\operatorname{\mathbf{d}}^{3}\hat{\Phi}_{n}(L_{n})\bigg{)}(\tilde{L}_{n}-L^{\star}).

Then

𝔼ρnF\displaystyle\mathbb{E}{\|\rho_{n}\|_{F}} =\displaystyle= 𝔼n(V(L))12θF\displaystyle\mathbb{E}\|\sqrt{n}(-V(L^{\star}))^{\frac{1}{2}}\theta\|_{F} (3.24)
=\displaystyle= n𝔼(V(L))12θF\displaystyle\sqrt{n}\mathbb{E}\|(-V(L^{\star}))^{\frac{1}{2}}\theta\|_{F}
\displaystyle\leq n𝔼(V(L))12opθ2\displaystyle\sqrt{n}\mathbb{E}\|(-V(L^{\star}))^{\frac{1}{2}}\|_{op}\|\theta\|_{2}
=\displaystyle= nΛmax(V)𝔼θ2.\displaystyle\sqrt{n\cdot\Lambda_{max}(-V)}\cdot\mathbb{E}\|\theta\|_{2}.

op\|\cdot\|_{op} denotes the operator norm induced by 2\mathcal{L}^{2} norm and Λmax\Lambda_{max} denotes the largest eigenvalue. For the first inequality, we regard θ\theta as an N×NN\times N column vector and (V(L))12(-V(L^{\star}))^{\frac{1}{2}} is an (N×N)×(N×N)(N\times N)\times(N\times N) matrix.

𝔼ϕ2=\displaystyle\mathbb{E}\|\phi\|_{2}= 𝔼(𝐝2Φ^n(L)𝔼(𝐝2Φ^n(L))+12(L~nL)T𝐝3Φ^n(Ln))(L~nL)2\displaystyle\mathbb{E}\big{\|}\big{(}\operatorname{\mathbf{d}}^{2}\hat{\Phi}_{n}(L^{\star})-\mathbb{E}(\operatorname{\mathbf{d}}^{2}\hat{\Phi}_{n}(L^{\star}))+\frac{1}{2}(\tilde{L}_{n}-L^{\star})^{T}\operatorname{\mathbf{d}}^{3}\hat{\Phi}_{n}(L_{n})\big{)}(\tilde{L}_{n}-L^{\star})\big{\|}_{2}
\displaystyle\leq 𝔼(𝐝2Φ^n(L)𝔼(𝐝2Φ^n(L)))(L~nL)2\displaystyle\mathbb{E}\big{\|}\big{(}\operatorname{\mathbf{d}}^{2}\hat{\Phi}_{n}(L^{\star})-\mathbb{E}(\operatorname{\mathbf{d}}^{2}\hat{\Phi}_{n}(L^{\star}))\big{)}(\tilde{L}_{n}-L^{\star})\big{\|}_{2} (3.25)
+\displaystyle+ 𝔼12(L~nL)T𝐝3Φ^n(Ln)(L~nL)2.\displaystyle\mathbb{E}\|\frac{1}{2}(\tilde{L}_{n}-L^{\star})^{T}\operatorname{\mathbf{d}}^{3}\hat{\Phi}_{n}(L_{n})(\tilde{L}_{n}-L^{\star})\big{\|}_{2}\,. (3.26)

Using Cauchy-Schwartz inequality to estimate (3.25) we see

I11\displaystyle I_{1}-1 \displaystyle\leq 𝔼12𝐝2Φ^n(L)𝔼(𝐝2Φ^n(L))op2𝔼12L~nL22\displaystyle\mathbb{E}^{\frac{1}{2}}\big{\|}\operatorname{\mathbf{d}}^{2}\hat{\Phi}_{n}(L^{\star})-\mathbb{E}(\operatorname{\mathbf{d}}^{2}\hat{\Phi}_{n}(L^{\star}))\big{\|}^{2}_{op}\mathbb{E}^{\frac{1}{2}}\|\tilde{L}_{n}-L^{\star}\|^{2}_{2} (3.27)
\displaystyle\leq N2nmaxi,j(L1)ij2𝔼12L~nL22.\displaystyle\frac{N^{2}}{\sqrt{n}}\max_{i,j}({L^{\star}}^{-1})_{ij}^{2}\mathbb{E}^{\frac{1}{2}}\|\tilde{L}_{n}-L^{\star}\|^{2}_{2}.

Let h(x)h(x) be a multivariate function:

h:N×N\displaystyle h:\quad\mathbb{R}^{N\times N}\longrightarrow{} \displaystyle\,\mathbb{R}
(x1,x2,,xNN)\displaystyle(x_{1},x_{2},...,x_{NN})\longmapsto x12+x22++xNN2\displaystyle\,x_{1}^{2}+x_{2}^{2}+\cdots+x_{NN}^{2}

Then hh is a continuous function. What’s more almost surely L~nEα,β\tilde{L}_{n}\in E_{\alpha,\beta}, which is a compact and convex set. Using Theorem 3.4 and portmanteau lemma we have

𝔼(h(n(L~nL)))=n𝔼L~nLF2𝔼Z~F2,\mathbb{E}\big{(}h(\sqrt{n}(\tilde{L}_{n}-L^{\star}))\big{)}=n\mathbb{E}\|\tilde{L}_{n}-L^{\star}\|^{2}_{F}\longrightarrow\mathbb{E}\|\tilde{Z}\|^{2}_{F}, (3.28)

where Z~𝒩(𝟎,V(L))\tilde{Z}\sim\mathcal{N}(\bm{0},-V(L^{\star})). 𝔼Z~F2\mathbb{E}\|\tilde{Z}\|^{2}_{F} is equal to 𝔼(Z~112++Z~1n2+Z~212++Z~nn2)=Tr(V(L))\mathbb{E}(\tilde{Z}_{11}^{2}+\cdots+\tilde{Z}_{1n}^{2}+\tilde{Z}_{21}^{2}+\cdots+\tilde{Z}_{nn}^{2})=\mathrm{Tr}(-V(L^{\star})). Then there exists a constant C1C_{1} subject to α,β\alpha,\beta such that

𝔼12L~nL22C11n.\mathbb{E}^{\frac{1}{2}}\|\tilde{L}_{n}-L^{\star}\|^{2}_{2}\leq C_{1}\frac{1}{\sqrt{n}}. (3.29)

As a result,

(3.25)C2N21n,\text{\eqref{rho1}}\leq C_{2}N^{2}\frac{1}{n}\,, (3.30)

where C2C_{2} is a suitable constant.

Next, we estimate the second part, that is (3.26):

𝔼12(L~nL)T𝐝3Φ^n(Ln))(L~nL)2.\mathbb{E}\|\frac{1}{2}(\tilde{L}_{n}-L^{\star})^{T}\operatorname{\mathbf{d}}^{3}\hat{\Phi}_{n}(L_{n})\big{)}(\tilde{L}_{n}-L^{\star})\big{\|}_{2}.

Here 𝐝3Φ^n(Ln)\operatorname{\mathbf{d}}^{3}\hat{\Phi}_{n}(L_{n}) is an N×NN\times N dimensional column vector whose entries are N×NN\times N matrices. Since Φ^n(L)\hat{\Phi}_{n}(L) is infinitely many time differentiable, LnL_{n} is on the line segment between L~n\tilde{L}_{n} and LL^{\star}, and Eα,βE_{\alpha,\beta} is a convex and compact set, we conclude that every entry of 𝐝3Φ^n(Ln)\operatorname{\mathbf{d}}^{3}\hat{\Phi}_{n}(L_{n}) is bounded. Hence there exists a constant C30C_{3}\geq 0 subject to α\alpha and β\beta such that

𝔼12(L~nL)T𝐝3Φ^n(Ln))(L~nL)2\displaystyle\mathbb{E}\|\frac{1}{2}(\tilde{L}_{n}-L^{\star})^{T}\operatorname{\mathbf{d}}^{3}\hat{\Phi}_{n}(L_{n})\big{)}(\tilde{L}_{n}-L^{\star})\big{\|}_{2}\leq C3𝔼L~nL22\displaystyle C_{3}\mathbb{E}\|\tilde{L}_{n}-L^{\star}\|^{2}_{2}
\displaystyle\leq C12C3n.\displaystyle\frac{C_{1}^{2}C_{3}}{n}. (3.31)

Now let kn=n14k_{n}=n^{-\frac{1}{4}}. Using Chebyshev’s inequality we get:

(ρnFkn)𝔼ρnFkn=C4n4\mathbb{P}(\|\rho_{n}\|_{F}\geq k_{n})\leq\frac{\mathbb{E}\|\rho_{n}\|_{F}}{k_{n}}=\frac{C_{4}}{\sqrt[4]{n}} (3.32)

for a suitable constant C4C_{4}.

Step 3. Our next goal is to estimate (I2I_{2}) as follows. Let knk_{n} be 1n4\frac{1}{\sqrt[4]{n}}. Then

I2C7n4I_{2}\leq\frac{C_{7}}{\sqrt[4]{n}}

for some constant C7C_{7}.

Because

(Xn+kn𝟙<x)(Z<x)\displaystyle\mathbb{P}(X_{n}+k_{n}\mathbbm{1}<x)-\mathbb{P}(Z<x)
(Xn+kn𝟙<x,ρnF<kn)(Z<x)\displaystyle\geq\mathbb{P}(X_{n}+{k_{n}}\mathbbm{1}<x,\|\rho_{n}\|_{F}<k_{n})-\mathbb{P}(Z<{x})
=((Xn+kn𝟙<x)(Xn+kn𝟙<x,ρnFkn))(Z<x)\displaystyle=\big{(}\mathbb{P}(X_{n}+k_{n}\mathbbm{1}<x)-\mathbb{P}(X_{n}+{k_{n}}\mathbbm{1}<x,\|\rho_{n}\|_{F}\geq k_{n})\big{)}-\mathbb{P}(Z<{x})
(Xn+kn𝟙<x)(ρnF>kn)(Z<x),\displaystyle\geq\mathbb{P}(X_{n}+k_{n}\mathbbm{1}<x)-\mathbb{P}(\|\rho_{n}\|_{F}>k_{n})-\mathbb{P}(Z<{x}),

we have

I2\displaystyle I_{2} |(Xn+kn𝟙<x)(Z<x)|\displaystyle\leq|\mathbb{P}(X_{n}+k_{n}\mathbbm{1}<x)-\mathbb{P}(Z<x)|
+|(Xn+kn𝟙<x)(|ρnFkn)(Z<x)|\displaystyle\quad+|\mathbb{P}(X_{n}+k_{n}\mathbbm{1}<x)-\mathbb{P}(|\rho_{n}\|_{F}\geq k_{n})-\mathbb{P}(Z<{x})|
2|(Xn+kn𝟙<x)(Z<x)|+(ρnFkn)\displaystyle\leq 2|\mathbb{P}(X_{n}+k_{n}\mathbbm{1}<x)-\mathbb{P}(Z<x)|+\mathbb{P}(\|\rho_{n}\|_{F}\geq k_{n})
=2|(Xn+kn𝟙<x)(Z+kn𝟙<x)\displaystyle=2|\mathbb{P}(X_{n}+k_{n}\mathbbm{1}<x)-\mathbb{P}(Z+k_{n}\mathbbm{1}<x)
+(Z+kn𝟙<x)(Z<x)|+(ρnFkn)\displaystyle\quad+\mathbb{P}(Z+k_{n}\mathbbm{1}<x)-\mathbb{P}(Z<x)|+\mathbb{P}(\|\rho_{n}\|_{F}\geq k_{n})
2|(Xn+kn𝟙<x)P(Z+kn𝟙<x)|\displaystyle\leq 2|\mathbb{P}(X_{n}+k_{n}\mathbbm{1}<x)-P(Z+k_{n}\mathbbm{1}<x)|
+2|(Z+kn𝟙<x)P(Z<x)|+(ρnFkn)\displaystyle\quad+2|\mathbb{P}(Z+k_{n}\mathbbm{1}<x)-P(Z<x)|+\mathbb{P}(\|\rho_{n}\|_{F}\geq k_{n})
=I21+I22+I23.\displaystyle=I_{21}+I_{22}+I_{23}. (3.33)

By multidimensional Berry-Esseen theorem in [Ben05],

I21C5Nn𝔼ξ123I_{21}\leq C_{5}\cdot\sqrt{N}\cdot n\cdot\mathbb{E}{\|\xi_{1}\|_{2}^{3}} (3.34)

where C5C_{5} is a constant and ξ1\xi_{1} is defined in (3.21):

𝔼ξ13\displaystyle\mathbb{E}{\|\xi_{1}\|}^{3} =\displaystyle= 𝔼1n(V(L))12((LZi)1(I+L)1)23\displaystyle\mathbb{E}\|\frac{1}{\sqrt{n}}(-V(L^{\star}))^{-\frac{1}{2}}\big{(}(L^{\star}_{Z_{i}})^{-1}-(I+L^{\star})^{-1}\big{)}\|_{2}^{3} (3.35)
\displaystyle\leq (1n)3𝔼(V(L))12((LZi)1(I+L)1)23.\displaystyle(\frac{1}{\sqrt{n}})^{3}\mathbb{E}\|(-V(L^{\star}))^{-\frac{1}{2}}\big{(}(L^{\star}_{Z_{i}})^{-1}-(I+L^{\star})^{-1}\big{)}\|^{3}_{2}.

Since 𝔼(V(L))12((LZi)1(I+L)1)23\mathbb{E}\|(-V(L^{\star}))^{-\frac{1}{2}}\big{(}(L^{\star}_{Z_{i}})^{-1}-(I+L^{\star})^{-1}\big{)}\|_{2}^{3} is a constant we get

I21C6NnI_{21}\leq C_{6}\sqrt{\frac{N}{n}} (3.36)

For I22I_{22}, since Z can be viewed as a standard Gaussian random vector, we have

I22\displaystyle I_{22} =\displaystyle= 2|(xknI<Zn<x)|\displaystyle 2|\mathbb{P}(x-k_{n}I<Z_{n}<x)| (3.37)
\displaystyle\leq 2i,j=1N(xijkn(Zn)ijxij)\displaystyle 2\sum_{i,j=1}^{N}\mathbb{P}(x_{ij}-k_{n}\leq(Z_{n})_{ij}\leq x_{ij})
=\displaystyle= 2N22πkn\displaystyle\frac{2N^{2}}{\sqrt{2\pi}}k_{n}

Combining (3.36), (3.37) with previous bound (3.32) to treat I23I_{23}, where we take kn=n14k_{n}=n^{-\frac{1}{4}} we conclude that

I2C7n4,I_{2}\leq\frac{C_{7}}{\sqrt[4]{n}},

where C5C_{5} is a constant.

Step 4. As for I3I_{3} we can use the same argument as above and conclude that I3I_{3} is bounded by C8n14C_{8}\cdot n^{-\frac{1}{4}} for some constant C8C_{8}. This completes the proof of the theorem. ∎

4 Simulation study

In this section we test numerically the mle for the DPPs. Since in general there is no analytic expression for the mle (3.16) we need to use numerical algorithms to find the mle. We try the well-known Newton-Raphson algorithm and compare it with stochastic gradient descent algorithm.

Algorithms. First, we explain the formula of the Newton-Raphson algorithm applied to our mle of DPPs. Recall that the gradient of the maximum likelihood function is given by (3.4):

dΦ^n(L)=J[N]p^JLJ1(L+I)1.\mathop{}\!\mathrm{d}\hat{\Phi}_{n}(L)=\sum_{J\subseteq[N]}\hat{p}_{J}L_{J}^{-1}-(L+I)^{-1}.

To use Newton-Raphson algorithm we think of dΦ^n(L)\mathop{}\!\mathrm{d}\hat{\Phi}_{n}(L) as an N×NN\times N vector-valued function (dΦ^n(L)11,dΦ^n(L)12,,dΦ^n(L)1N,dΦ^n(L)21,,dΦ^n(L)N1,dΦ^n(L)NN)(\mathop{}\!\mathrm{d}\hat{\Phi}_{n}(L)_{11},\mathop{}\!\mathrm{d}\hat{\Phi}_{n}(L)_{12},...,\mathop{}\!\mathrm{d}\hat{\Phi}_{n}(L)_{1N},\mathop{}\!\mathrm{d}\hat{\Phi}_{n}(L)_{21},...,\mathop{}\!\mathrm{d}\hat{\Phi}_{n}(L)_{N1},...\mathop{}\!\mathrm{d}\hat{\Phi}_{n}(L)_{NN}) of N×NN\times N variables (L11,L12,,L1N,L21,,LN1,LNN)(L_{11},L_{12},...,L_{1N},L_{21},...,L_{N1},...L_{NN}). Since

dL1dLij\displaystyle\frac{\mathop{}\!\mathrm{d}L^{-1}}{\mathop{}\!\mathrm{d}L_{ij}} =L1dLdLijL1\displaystyle=-L^{-1}\cdot\frac{\mathop{}\!\mathrm{d}L}{\mathop{}\!\mathrm{d}L_{ij}}\cdot L^{-1}
=[Lki1Ljl1]kl,\displaystyle=-[L^{-1}_{ki}L^{-1}_{jl}]_{kl}, (4.1)

the Jacobian matrix of Φ^n(L)\hat{\Phi}_{n}(L) is given by an N×NN\times N by N×NN\times N matrix

d2Φ^n(L)=\displaystyle\mathop{}\!\mathrm{d}^{2}\hat{\Phi}_{n}(L)= J[N]p^J[(Lki1Ljl1)𝕀{iJ,jJ,kJ,lJ}]ij,kl\displaystyle-\sum_{J\subseteq[N]}\hat{p}_{J}\big{[}(L^{-1}_{ki}L^{-1}_{jl})\mathbb{I}\{i\in J,j\in J,k\in J,l\in J\}\big{]}_{ij,kl}
+[(L+I)ki1(L+I)jl1]ij,kl,\displaystyle+\big{[}(L+I)^{-1}_{ki}(L+I)^{-1}_{jl}\big{]}_{ij,kl}, (4.2)

where the index ijij and klkl have order (1,1),(1,N),(2,1),(2,N),(N,1),(N,N).(1,1),...(1,N),(2,1),...(2,N),...(N,1),...(N,N).

Then Newton-Raphson algorithm updated at (m+1)(m+1)-th iteration is given by

Lm+1=Lm(d2Φ^n(Lm))1dΦ^n(Lm).L_{m+1}=L_{m}-(\mathop{}\!\mathrm{d}^{2}\hat{\Phi}_{n}(L_{m}))^{-1}\cdot\mathop{}\!\mathrm{d}\hat{\Phi}_{n}(L_{m}). (4.3)

The matrix d2Φ^n(Lm)\mathop{}\!\mathrm{d}^{2}\hat{\Phi}_{n}(L_{m}) is N2×N2N^{2}\times N^{2} and the computation of its inverse may need some time when NN is large.

Newton-Raphson algorithm is time consuming in particular when NN is large since we need to compute the inverse of d2Φ^n(Lm)\mathop{}\!\mathrm{d}^{2}\hat{\Phi}_{n}(L_{m}) (which is an N2×N2N^{2}\times N^{2} matrix) in each iteration step. For this reason we try another method called stochastic gradient descent method. This method depends on a step size which we choose to be η=0.1\eta=0.1 as well as initial kernel L0L_{0}. At (m+1)(m+1)-th iteration we do simple random sampling in our data set to pick one sample ZρZ_{\rho} and update our kernel by

Lm+1=Lm+η((LmZρ)1(Lm+I)1).L_{m+1}=L_{m}+\eta\big{(}({L_{{m}_{Z_{\rho}}}})^{-1}-(L_{m}+I)^{-1}\big{)}. (4.4)

In this way, we reduce high computational burden of Newton-Raphson algorithm (4.3) at each iteration but the algorithm has lower convergence rate.

Simulation results. In simulation study we consider two 33 by 33 matrix kernels H1,H2H_{1},H_{2} and one 22 by 22 matrix kernel H3H_{3}. To see the trend of convergence, for each kernel we generate 300, 1000, 3000, 10000, and 30000 synthetic samples using the algorithm proposed in [HKPV06]. The simulation results show that Newton-Raphson method outperforms SGD. For the first DPP, Newton-Raphson algorithm shows a very slow trend of convergence to the true kernel H1H_{1} whereas SGD only estimates the diagonal elements well. For the second independent DPP, both Newton-Raphson and SGD methods converge to the true kernel H2H_{2} pretty well. For the last one, Newton-Raphson converges to a wrong critical point while SGD is numerically unstable.

Our simulation results of n = 30000 are summarized in Table 1, which suggests learning the dependence and the repulsive behavior between items seems beyond the scope of general numerical methods. This unsatisfactory result may be because on one hand both algorithms involve computing the inverse of matrices at each iteration, which could yield large values and hence may not be stable. On the other hand, the likelihood function of a DPP is usually non-concave and has at least exponentially many saddle points [BMRU17]. This unsatisfactory performance of the two well-known algorithms motivates us to find an effective method suitable to the maximum of a likelihood function of a DPP. In the next section, this issue is resolved for the two by two matrix kernel since we can obtain the explicit maximizer in this case.

True kernel Initial kernel Newton-Raphsonk=100iterations{\textbf{Newton-Raphson}\atop k=100\ \ \hbox{iterations}} SGDk=60000iterations{\textbf{SGD}\atop k=60000\ \ \hbox{iterations}}
(10.200.220.300.33)\begin{pmatrix}1&0.2&0\\ 0.2&2&0.3\\ 0&0.3&3\end{pmatrix} (10.100.110.100.11)\begin{pmatrix}1&0.1&0\\ 0.1&1&0.1\\ 0&0.1&1\end{pmatrix} (0.9940.1670.0480.1672.0130.2410.0480.2413.029)\begin{pmatrix}0.994&0.167&-0.048\\ 0.167&2.013&0.241\\ -0.048&0.241&3.029\end{pmatrix} (0.9260.0150.0010.0152.3230.0250.0010.0253.068)\begin{pmatrix}0.926&0.015&0.001\\ 0.015&2.323&0.025\\ 0.001&0.025&3.068\end{pmatrix}
(700050009)\begin{pmatrix}7&0&0\\ 0&5&0\\ 0&0&9\end{pmatrix} (100010001)\begin{pmatrix}1&0&0\\ 0&1&0\\ 0&0&1\end{pmatrix} (7.0090005.0700009.0348)\begin{pmatrix}7.009&0&0\\ 0&5.070&0\\ 0&0&9.0348\end{pmatrix} (7.0020005.1560009.035)\begin{pmatrix}7.002&0&0\\ 0&5.156&0\\ 0&0&9.035\end{pmatrix}
(1112)\begin{pmatrix}1&1\\ 1&2\\ \end{pmatrix} (0.50.10.10.5)\begin{pmatrix}0.5&0.1\\ 0.1&0.5\\ \end{pmatrix} (0.657001.499)\begin{pmatrix}0.657&0\\ 0&1.499\\ \end{pmatrix} Numerically unstable
Table 1: We test the performance of algorithms for three different true kernels when the sample size is 30000. We obtain the result after a specific number kk of iterations. In the experiment, we observe that both algorithms become unstable as iteration increases to a certain certain point so we choose k=100k=100 for Newton-Raphson and k=60000k=60000 for SGD since for the later one each iteration needs less time to complete.

5 Two-by-two block kernel

Due to non-identifiability of DPPs illustrated in [Kul12, Theorem 4.1] that we mentioned earlier in Section 3, we have to use L~n\tilde{L}_{n} defined by (3.17) as our maximum likelihood estimator. This formula involves a choice of D^n\hat{D}_{n} which may also be random. However, the way to choose D^n\hat{D}_{n} is not available to us in general. To get rid of this choice of D^n\hat{D}_{n} we show in this section that if the kernels of determinantal point processes are two-by-two symmetric positive semi-definite matrices, we can obtain the desired results without involving the choice of D^n\hat{D}_{n}. Our result can also be immediately extended to any diagonally blocked two by two matrices. Since the commonly used iteration methods presented in the previous section are not so effective in finding the maximum likelihood estimator numerically this explicit analytic form we obtained is also very useful in compute the estimator numerically in practice. On the other hand, Our method effective to two by two matrices is difficult to apply to more general higher dimensional kernels.

Let ZDPP(L)Z\sim\text{DPP}(L^{\star}), where L=(abbc)L^{\star}=\bigg{(}\begin{matrix}a^{*}&b^{*}\\ b^{*}&c^{*}\end{matrix}\bigg{)}, and the ground set be 𝒴=[2]\mathcal{Y}=[2]. To ensure L𝒮[2]++L^{\star}\in\mathcal{S}_{[2]}^{++}, we assume

a,c>0a^{*},c^{*}>0

and

acb20.a^{*}c^{*}-{b^{*}}^{2}\geq 0.

We can always assume bb is non-negative since by identifiability of DPPs, (abbc)\bigg{(}\begin{matrix}a&b\\ b&c\end{matrix}\bigg{)} and (abbc)\bigg{(}\begin{matrix}a&-b\\ -b&c\end{matrix}\bigg{)} give the same DPP. For ease of notation, let p^0,p^1,p^2,p^3\hat{p}_{0},\hat{p}_{1},\hat{p}_{2},\hat{p}_{3} denote the empirical probability of the subset {}\{\emptyset\}, {1}\{1\}, {2}\{2\}, {1,2}\{1,2\} respectively and let p0,p1,p2,p3p_{0},p_{1},p_{2},p_{3} denote the theoretical probability respectively. The relationship between (a,b,c)(a,b,c) and (p0,p1,p2,p3)(p_{0},p_{1},p_{2},p_{3}) are given by

(a,b,c)=(p1p0,p1p2p0p3p0,p2p0),({a},{b},{c})=\Big{(}\frac{p_{1}}{p_{0}},\frac{\sqrt{p_{1}p_{2}-p_{0}p_{3}}}{p_{0}},\frac{p_{2}}{p_{0}}\Big{)},

and inversely

p0=1(a+1)(c+1)b2,p1=a(a+1)(c+1)b2,p2=c(a+1)(c+1)b2,p3=acb2(a+1)(c+1)b2.\begin{split}p_{0}=&\frac{1}{(a+1)(c+1)-b^{2}}\,,\qquad p_{1}=\frac{a}{(a+1)(c+1)-b^{2}}\,,\\ p_{2}=&\frac{c}{(a+1)(c+1)-b^{2}}\,,\qquad p_{3}=\frac{ac-b^{2}}{(a+1)(c+1)-b^{2}}\,.\end{split}

The likelihood function defined in (3.1) becomes now

Φ^n(L)\displaystyle\hat{\Phi}_{n}(L) =J[2]p^Jlog(LJ)logdet(L+I)\displaystyle=\sum_{J\in[2]}\hat{p}_{J}\log(L_{J})-\log\det(L+I)
=p^1loga+p^2logc+p^3log(acb2)log[(a+1)(c+1)b2]\displaystyle=\hat{p}_{1}\log a+\hat{p}_{2}\log c+\hat{p}_{3}\log(ac-b^{2})-\log[(a+1)(c+1)-b^{2}] (5.1)

To find the critical point of (5) we first let the partial derivative of Φ^n(L)\hat{\Phi}_{n}(L) with respect to bb equal zero and get

Φ^n(L)b=\displaystyle\frac{\partial\hat{\Phi}_{n}(L)}{\partial b}= 2p^3bacb2+2b(a+1)(c+1)b2=0.\displaystyle\frac{2\hat{p}_{3}b}{ac-b^{2}}+\frac{2b}{(a+1)(c+1)-b^{2}}=0. (5.2)

Then we have bb is either equal to 0 or

b2=ac(a+1)(c+1)p^31p^3.b^{2}=\frac{ac-(a+1)(c+1)\hat{p}_{3}}{1-\hat{p}_{3}}. (5.3)

If b=0b=0, then by setting the partial derivative with respect to aa and cc to zero and notice that p^0+p^1+p^2+p^3=1\hat{p}_{0}+\hat{p}_{1}+\hat{p}_{2}+\hat{p}_{3}=1 we get the first critical point

(a^,b^,c^)=(p^1+p^3p^0+p^2,0,p^2+p^3p^0+p^1).(\hat{a},\hat{b},\hat{c})=\Bigg{(}\frac{\hat{p}_{1}+\hat{p}_{3}}{\hat{p}_{0}+\hat{p}_{2}},0,\frac{\hat{p}_{2}+\hat{p}_{3}}{\hat{p}_{0}+\hat{p}_{1}}\Bigg{)}. (5.4)

We remark that this is the point which Newton-Raphson method converges to in the last simulation study. This critical point exists only if p^0+p^2\hat{p}_{0}+\hat{p}_{2} and p^0+p^1\hat{p}_{0}+\hat{p}_{1} is nonzero. Since empirical probability converges to its corresponding theoretical probability almost surely and p0>0p_{0}>0, the strong law of large numbers implies the critical point exists almost surely when nn is sufficiently large.

If b0b\not=0, then we can use (5.3) to estimate b^\hat{b} once a^,c^\hat{a},\hat{c} are obtained:

b^=a^c^(a^+1)(c^+1)p^31p^3.\hat{b}=\sqrt{\frac{\hat{a}\hat{c}-(\hat{a}+1)(\hat{c}+1)\hat{p}_{3}}{1-\hat{p}_{3}}}. (5.5)

To find the maximum likelihood estimators a^\hat{a} and c^\hat{c} of aa and cc we plug (5.3) into Φ^n(L)\hat{\Phi}_{n}(L) to obtain

Φ^n(L)=p^1loga+p^2logc+(p^31)log(a+c+1)(p^31)logp^31p^3+logp^3.\hat{\Phi}_{n}(L)=\hat{p}_{1}\log a+\hat{p}_{2}\log c+(\hat{p}_{3}-1)\log(a+c+1)-(\hat{p}_{3}-1)\log\frac{\hat{p}_{3}}{1-\hat{p}_{3}}+\log\hat{p}_{3}. (5.6)

Letting Φ^n(L)a\frac{\partial\hat{\Phi}_{n}(L)}{\partial a} and Φ^n(L)c\frac{\partial\hat{\Phi}_{n}(L)}{\partial c} equal zero yields

{Φ^n(L)a=p^1a+p^31a+c+1=0Φ^n(L)c=p^2c+p^31a+c+1=0.\displaystyle\begin{cases}\frac{\partial\hat{\Phi}_{n}(L)}{\partial a}=\frac{\hat{p}_{1}}{a}+\frac{\hat{p}_{3}-1}{a+c+1}=0\\ \frac{\partial\hat{\Phi}_{n}(L)}{\partial c}=\frac{\hat{p}_{2}}{c}+\frac{\hat{p}_{3}-1}{a+c+1}=0.\\ \end{cases} (5.7)

The above system of function equations can be explicitly solved and combining it together with (5.5) yields

(a^,b^,c^)=(p^1p^0,p^1p^2p^0p^3p^0,p^2p^0),(\hat{a},\hat{b},\hat{c})=\Big{(}\frac{\hat{p}_{1}}{\hat{p}_{0}},\frac{\sqrt{\hat{p}_{1}\hat{p}_{2}-\hat{p}_{0}\hat{p}_{3}}}{\hat{p}_{0}},\frac{\hat{p}_{2}}{\hat{p}_{0}}\Big{)}, (5.8)

from which we have this critical point exists only if p^0>0\hat{p}_{0}>0 and p^1p^2p^0p^30\hat{p}_{1}\hat{p}_{2}-\hat{p}_{0}\hat{p}_{3}\geq 0. Again by strong laws of large numbers, the second critical point also exists and converges to the true value almost surely. In fact, we have almost surely,

p^1p^0p1p0=a,p^1p^2p^0p^3p^0p1p2p0p3p0=b,p^2p^0c.\frac{\hat{p}_{1}}{\hat{p}_{0}}\to\frac{p_{1}}{p_{0}}=a^{*},\quad\frac{\sqrt{\hat{p}_{1}\hat{p}_{2}-\hat{p}_{0}\hat{p}_{3}}}{\hat{p}_{0}}\to\frac{\sqrt{p_{1}p_{2}-p_{0}p_{3}}}{p_{0}}=b^{*},\quad\frac{\hat{p}_{2}}{\hat{p}_{0}}\to c^{*}.

We test the estimator of the 2 by 2 kernel in earlier simulation study by taking n=300,3000,10000,30000,n=300,3000,10000,30000, and find the estimates respectively are

(0.85980.86610.86611.9159),(0.94540.91840.91841.9868),(1.02471.02531.02532.0241),(0.99421.00561.00562.0015),\begin{pmatrix}0.8598&0.8661\\ 0.8661&1.9159\\ \end{pmatrix},\begin{pmatrix}0.9454&0.9184\\ 0.9184&1.9868\\ \end{pmatrix},\begin{pmatrix}1.0247&1.0253\\ 1.0253&2.0241\\ \end{pmatrix},\begin{pmatrix}0.9942&1.0056\\ 1.0056&2.0015\\ \end{pmatrix},

which shows the estimator (5.8) is consistent numerically to the true kernel.

Furthermore, we can establish the central limit theorem for the estimator (5.8), which corresponds to the result in Theorem 3.4.

Theorem 5.1.

Assume b>0b>0, then the estimator (a^,b^,c^)(\hat{a},\hat{b},\hat{c}) in (5.8) is asymptotically normal,

n((a^,b^,c^)(a,b,c))n𝒩(𝟎,V(a,b,c)),\sqrt{n}((\hat{a},\hat{b},\hat{c})-(a^{*},b^{*},c^{*}))\xrightarrow[n\longrightarrow\infty]{}\mathcal{N}(\bm{0},-V({a^{*},b^{*},c^{*}})), (5.9)

where the convergence holds in distribution and V(a,b,c)V({a^{*},b^{*},c^{*}}) is the inverse of the Hessian matrix of the expected maximum likelihood function Φ(a,b,c)=p1loga+p2logc+p3log(acb2)log[(a+1)(c+1)b2]\Phi(a,b,c)=p_{1}\log a+p_{2}\log c+p_{3}\log(ac-b^{2})-\log[(a+1)(c+1)-b^{2}].

Proof.

Let Z1,,ZnZ_{1},...,Z_{n} be n independent subsets of ZDPP(L)Z\sim\text{DPP}(L^{*}), where L=(abbc)L^{*}=\bigg{(}\begin{matrix}a^{*}&b^{*}\\ b^{*}&c^{*}\end{matrix}\bigg{)}. Let XiX_{i} be the random vector (𝕀{Zi=},𝕀{Zi={1}},𝕀{Zi={2}},𝕀{Zi={1,2}})T(\mathbb{I}_{\{Z_{i}=\emptyset\}},\mathbb{I}_{\{Z_{i}=\{1\}\}},\mathbb{I}_{\{Z_{i}=\{2\}\}},\mathbb{I}_{\{Z_{i}=\{1,2\}\}})^{T}, where 𝕀{}\mathbb{I}_{\{\cdot\}} stands for the indicator random variable. Then XiX_{i} has mean 𝝁=(p0,p1,p2,p3)T\bm{\mu}=(p_{0},p_{1},p_{2},p_{3})^{T} and covariance matrix

𝚺=(p0p02p0p1p0p2p0p3p0p1p1p12p1p2p1p3p0p2p1p2p2p22p2p3p0p3p1p3p2p3p3p32).\mathbf{\Sigma}=\begin{pmatrix}p_{0}-p_{0}^{2}&-p_{0}p_{1}&-p_{0}p_{2}&-p_{0}p_{3}\\ -p_{0}p_{1}&p_{1}-p_{1}^{2}&-p_{1}p_{2}&-p_{1}p_{3}\\ -p_{0}p_{2}&-p_{1}p_{2}&p_{2}-p_{2}^{2}&-p_{2}p_{3}\\ -p_{0}p_{3}&-p_{1}p_{3}&-p_{2}p_{3}&p_{3}-p_{3}^{2}\end{pmatrix}\,.

By central limit theorem, n(X¯n𝝁)\sqrt{n}(\overline{X}_{n}-\bm{\mu}) converges to a multivariate normal distribution with mean 𝟎\bm{0} and covariance 𝚺\bm{\Sigma}. Let a function g:43g:\mathbb{R}^{4}\to\mathbb{R}^{3} be defined by

g(x1,x2,x3,x4)=(x2x1,x2x3x1x4x1,x3x1).g(x_{1},x_{2},x_{3},x_{4})=(\frac{x_{2}}{x_{1}},\frac{\sqrt{x_{2}x_{3}-x_{1}x_{4}}}{x_{1}},\frac{x_{3}}{x_{1}}).

Its Jacobi matrix g˙(𝒙)=[gixj]3×4\dot{g}(\bm{x})=\big{[}\frac{\partial g_{i}}{\partial x_{j}}\big{]}_{3\times 4} is given by

(x2x121x100x42x1x2x3x1x4x2x3x1x4x12x32x1x2x3x1x4x22x1x2x3x1x412x2x3x1x4x3x1201x10).\begin{pmatrix}-\frac{x_{2}}{x_{1}^{2}}&\frac{1}{x_{1}}&0&0\\ -\frac{x_{4}}{2x_{1}\sqrt{x_{2}x_{3}-x_{1}x_{4}}}-\frac{\sqrt{x_{2}x_{3}-x_{1}x_{4}}}{x_{1}^{2}}&\frac{x_{3}}{2x_{1}\sqrt{x_{2}x_{3}-x_{1}x_{4}}}&\frac{x_{2}}{2x_{1}\sqrt{x_{2}x_{3}-x_{1}x_{4}}}&-\frac{1}{2\sqrt{x_{2}x_{3}-x_{1}x_{4}}}\\ -\frac{x_{3}}{x_{1}^{2}}&0&\frac{1}{x_{1}}&0\end{pmatrix}.

Now we are in the position to apply Delta method [VdV00] to get

n((a^,b^,c^)(a,b,c))=n(g(X¯n)g(𝝁))𝑑𝒩(𝟎,g˙(𝝁)𝚺g˙(𝝁)).\sqrt{n}\big{(}(\hat{a},\hat{b},\hat{c})-(a^{*},b^{*},c^{*})\big{)}=\sqrt{n}\big{(}g(\overline{X}_{n})-g(\bm{\mu})\big{)}\xrightarrow[]{d}\mathcal{N}(\bm{0},\dot{g}(\bm{\mu})\bm{\Sigma}\dot{g}(\bm{\mu})^{\prime}).

After tedious matrix computations, g˙(𝝁)𝚺g˙(𝝁)\dot{g}(\bm{\mu})\bm{\Sigma}\dot{g}(\bm{\mu})^{\prime} is found to be

D((a+a2)σ12σ13σ12acb214D+a+c+4ac4σ23σ13σ23c+c2),D\begin{pmatrix}(a^{*}+{a^{*}}^{2})&\sigma_{12}&\sigma_{13}\\ \sigma_{12}&\frac{\frac{a^{*}c^{*}}{{b^{*}}^{2}}-1}{4}D+\frac{a^{*}+c^{*}+4a^{*}c^{*}}{4}&\sigma_{23}\\ \sigma_{13}&\sigma_{23}&c^{*}+{c^{*}}^{2}\end{pmatrix},

where

{D=(a+1)(c+1)b2;σ12=(ac2b+ab+a2b(acb2));σ13=ac;σ23=ac2b+bc+c2b(acb2).\left\{\begin{split}D=&(a^{*}+1)(c^{*}+1)-{b^{*}}^{2}\,;\\ \sigma_{12}=&(\frac{a^{*}c^{*}}{2b^{*}}+a^{*}b^{*}+\frac{a^{*}}{2b^{*}}(a^{*}c^{*}-{b^{*}}^{2}))\,;\\ \sigma_{13}=&a^{*}c^{*}\,;\\ \sigma_{23}=&\frac{a^{*}c^{*}}{2b^{*}}+b^{*}c^{*}+\frac{c^{*}}{2b^{*}}(a^{*}c^{*}-{b^{*}}^{2})\,.\\ \end{split}\right.

It is straightforward to verify the above matrix is the inverse of the Hessian matrix of the expected maximum likelihood function Φ(L)\Phi(L), that is, V(a,b,c)-V(a^{*},b^{*},c^{*}), which in turn verifies Theorem 3.4 in this special case. However, in this two-by-two case, our maximum likelihood estimator is unique without the maneuver of the identifiability (3.17). ∎

This idea can be extended to blocked ensemble with the two by two block submatrices. If LL^{\star} is a matrix with k two-by-two blocks J1,,JkJ_{1},...,J_{k}

(J1J2Jk)=(a1b1b1c1a2b2b2c2akbkbkck),\begin{pmatrix}J_{1}&&&\\ &J_{2}&&\\ &&\ddots&\\ &&&J_{k}\end{pmatrix}=\begin{pmatrix}a_{1}&b_{1}&&&&\\ b_{1}&c_{1}&&&&\\ &&a_{2}&b_{2}&&\\ &&b_{2}&c_{2}&&\\ &&&&\ddots\\ &&&&&a_{k}&b_{k}\\ &&&&&b_{k}&c_{k}\end{pmatrix}, (5.10)

where for each 1ik1\leq i\leq k, aia_{i}, bib_{i}, ci>0c_{i}>0 and aicibi20a_{i}c_{i}-b_{i}^{2}\geq 0. Let ground set 𝒴\mathcal{Y} of this DPP be {J11,J12,J21,J22,,Jk1,Jk2}\{J_{1}^{1},J_{1}^{2},J_{2}^{1},J_{2}^{2},...,J_{k}^{1},J_{k}^{2}\} and for each 1ik1\leq i\leq k,

p^Ji0=\displaystyle\hat{p}_{J_{i}}^{0}= 1nm=1n𝕀{Ji1Zm,Ji2Zm}\displaystyle\frac{1}{n}\sum_{m=1}^{n}\mathbb{I}\{J_{i}^{1}\notin Z_{m},J_{i}^{2}\notin Z_{m}\} (5.11)
p^Ji1=\displaystyle\hat{p}_{J_{i}}^{1}= 1nm=1n𝕀{Ji1Zm,Ji2Zm}\displaystyle\frac{1}{n}\sum_{m=1}^{n}\mathbb{I}\{J_{i}^{1}\in Z_{m},J_{i}^{2}\notin Z_{m}\} (5.12)
p^Ji2=\displaystyle\hat{p}_{J_{i}}^{2}= 1nm=1n𝕀{Ji1Zm,Ji2Zm}\displaystyle\frac{1}{n}\sum_{m=1}^{n}\mathbb{I}\{J_{i}^{1}\notin Z_{m},J_{i}^{2}\in Z_{m}\} (5.13)
p^Ji3=\displaystyle\hat{p}_{J_{i}}^{3}= 1nm=1n𝕀{Ji1Zm,Ji2Zm},\displaystyle\frac{1}{n}\sum_{m=1}^{n}\mathbb{I}\{J_{i}^{1}\in Z_{m},J_{i}^{2}\in Z_{m}\}, (5.14)

where Z1,,ZnZ_{1},...,Z_{n} are n independent subsets drawn from DPP(L)\text{DPP}(L^{\star}). By the construction (5.10) we see that ZJ1,,ZJkZ\cap J_{1},...,Z\cap J_{k} are mutually independent. Then the result of critical point for two by two matrix can be applied:

(a^i,b^i,c^i)=(p^Ji1p^Ji0,p^Ji1p^Ji2p^Ji0p^Ji3p^Ji0,p^Ji2p^Ji0),(\hat{a}_{i},\hat{b}_{i},\hat{c}_{i})=\Bigg{(}\frac{\hat{p}_{J_{i}}^{1}}{\hat{p}_{J_{i}}^{0}},\frac{\sqrt{\hat{p}_{J_{i}}^{1}\hat{p}_{J_{i}}^{2}-\hat{p}_{J_{i}}^{0}\hat{p}_{J_{i}}^{3}}}{\hat{p}_{J_{i}}^{0}},\frac{\hat{p}_{J_{i}}^{2}}{\hat{p}_{J_{i}}^{0}}\Bigg{)}, (5.15)

for every 1ik1\leq i\leq k.

However the above method encounters some difficulties when the kernel has dimension higher than 2 with general form. For example, if the kernel is a 3×33\times 3 matrix

(adedbfefc),\begin{pmatrix}a&d&e\\ d&b&f\\ e&f&c\end{pmatrix},

the letting the gradient of likelihood function Φ^n(L)\hat{\Phi}_{n}(L) equal zero will yield

dΦ^n(L)=\displaystyle\mathop{}\!\mathrm{d}\hat{\Phi}_{n}(L)= J[3]p^JLJ1(L+I)1=0.\displaystyle\sum_{J\subseteq[3]}\hat{p}_{J}L_{J}^{-1}-(L+I)^{-1}=0\,.

Computing L1L^{-1} and (L+I)1(L+I)^{-1} could be troublesome. For example, L1L^{-1} is:

1a(bcf2)d(cdef)+e(dfbe)(bcf2cd+efbe+dfcd+eface2deafbe+dfdeafabd2)\frac{1}{a(bc-f^{2})-d(cd-ef)+e(df-be)}\begin{pmatrix}bc-f^{2}&-cd+ef&-be+df\\ -cd+ef&ac-e^{2}&de-af\\ -be+df&de-af&ab-d^{2}\end{pmatrix}

which is difficult to use to obtain explicit maximum likelihood estimator.

6 Conclusion

In this paper, we study the maximum likelihood estimator for the ensemble matrix associated with determinantal process. Brunel et al show that the expected likelihood function Φ(L)\Phi(L) is locally strongly concave around true value LL^{\star} if and only if LL^{\star} is irreducible, since the Hessian matrix of Φ(L)\Phi{(L)} at LL^{\star} is negative definite. Then they prove the maximum likelihood estimator (MLE) is consistent with respect to the convergence in probability and when LL^{\star} is irreducible they also obtained the central limiting theorem for the MLE. We improve their results to show that the MLE is also strongly consistent with respect to the almost sure convergence. Moreover, we obtain the Berry-Esseen type result for the central limiting theorem and find the n14n^{-\frac{1}{4}} rate of convergence of the MLE to normality. Last, we obtain the explicit form of the MLE where LL^{\star} is a two by two block matrix, or a block matrix whose blocks are two by two matrices. This explicit form enables us to avoid the use of D^n\hat{D}_{n} in (3.17) which is inconvenient in applications. The strong consistency and central limit theorem follows from these explicit forms, which demonstrates the general strong consistency and central limit theorem proved earlier. It would be interesting to find the explicit form of some particular higher dimensional DPPs. However, as the learning of maximum likelihood of DPPs is proven to be NP-hard, the explicit form for general ensembles, even if was found, would be very difficult to compute.

In addition to the maximum likelihood estimator there are also other approaches in lieu of MLE. Let us only mention one alternative approach. For all J such that |J|1|J|\leq 1, we let

det(LJ)det(L+I)=p^J,\frac{\det(L_{J})}{\det(L+I)}=\hat{p}_{J}, (6.1)

where the left hand side is the theoretical probability of JJ and the right hand side is the empirical probability of JJ. Taking J={i}J=\{i\} suggests us the following estimator for LiiL_{ii}.

L^ii=p^ip^0.\hat{L}_{ii}=\frac{\hat{p}_{i}}{\hat{p}_{0}}. (6.2)

Using equations (6.1) for |J|=2|J|=2 again we are able to determine the off-diagonal elements up to the sign

Lij2=p^ip^jp^p^{i,j}p^2,L^{2}_{ij}=\frac{\hat{p}_{i}\hat{p}_{j}-\hat{p}_{\emptyset}\hat{p}_{\{i,j\}}}{\hat{p}^{2}_{\emptyset}}, (6.3)

where iji\neq j. Notice that this is the maximum likelihood estimator when LL is two dimensional. There is a question on how to choose the sign for LijL_{ij} in (6.3), which has been resolved by [UBMR17] with graph theory. We shall not pursue this direction in current paper.

References

  • [BDF10] Alexei Borodin, Persi Diaconis, and Jason Fulman. On adding a list of numbers (and other one-dependent determinantal processes). Bulletin of the American Mathematical Society, 47(4):639–670, 2010.
  • [Ben05] Vidmantas Bentkus. A lyapunov-type bound in RdR^{d}. Theory of Probability and Its Applications, 49(2):311–323, 2005.
  • [BMRU17] Victor-Emmanuel Brunel, Ankur Moitra, Philippe Rigollet, and John Urschel. Maximum likelihood estimation of determinantal point processes. arXiv preprint arXiv:1701.06501, 2017.
  • [BP93] Robert Burton and Robin Pemantle. Local characteristics, entropy and limit theorems for spanning trees and domino tilings via transfer-impedances. The Annals of Probability, pages 1329–1371, 1993.
  • [Gin65] Jean Ginibre. Statistical ensembles of complex, quaternion, and real matrices. Journal of Mathematical Physics, 6(3):440–449, 1965.
  • [GJWX22] Elena Grigorescu, Brendan Juba, Karl Wimmer, and Ning Xie. Hardness of maximum likelihood learning of DPPs. In Proceedings of Thirty Fifth Conference on Learning Theory, volume 178, pages 3800–3819. PMLR, 2022.
  • [HKPV06] J Ben Hough, Manjunath Krishnapur, Yuval Peres, and Bálint Virág. Determinantal processes and independence. Probability Survey, 3:206–229, 2006.
  • [KT12] Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 5(2–3):123–286, 2012.
  • [Kul12] John A Kulesza. Learning with determinantal point processes. University of Pennsylvania, 2012.
  • [LDG21] Claire Launay, Agnès Desolneux, and Bruno Galerne. Determinantal point processes for image processing. SIAM J. Imaging Sci., 14(1):304–348, 2021.
  • [Mac75] O. Macchi. The coincidence approach to stochastic point processes. Advances in Applied Probability, 7(1):83–122, 1975.
  • [UBMR17] John Urschel, Victor-Emmanuel Brunel, Ankur Moitra, and Philippe Rigollet. Learning determinantal point processes with moments and cycles. In International Conference on Machine Learning, pages 3511–3520. PMLR, 2017.
  • [VdV00] Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.
  • [Wal49] Abraham Wald. Note on the consistency of the maximum likelihood estimate. The Annals of Mathematical Statistics, 20(4):595–601, 1949.