Asymptotic properties of maximum likelihood estimators for determinantal point processes
Yaozhong HuSupported by an NSERC Discovery grant and a centennial fund from University of Alberta.
Email: [email protected] Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, Canada.
Haiyi Shi Corresponding author. Email: [email protected]Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC, Canada.
Abstract
We obtain the almost sure strong consistency and the Berry-Esseen type bound for the maximum likelihood estimator of the ensemble for determinantal point processes (DPPs), strengthening and completing previous work
initiated in Brunel, Moitra, Rigollet, and Urschel [BMRU17]. Numerical algorithms of estimating DPPs are developed and simulation studies are performed. Lastly, we give explicit formula and a detailed discussion for the maximum likelihood estimator for
blocked determinantal matrix of two by two submatrices and compare it with the frequency
method.
1 Introduction
Determinantal point processes (DPPs) arise from random matrix theory [Gin65] and are first introduced to give the probability distribution of Fermionic system in thermal equilibrium in quantum physics [Mac75].
Since then, DPPs have been found in various aspects of mathematics, including for example, loop-free Markov chains [BDF10], edges of uniformly spanning trees [BP93] and in particular, recently to image processing [LDG21].
In the seminal work [KT12], Kulesza and Taskar show that DPPs demonstrate
the unique characteristics comparing to various other probabilistic models in the sense that they capture the
global repulsive behavior between items, give polynomial-time algorithms for statistical inference, and have geometrical intuition. Due to these advantages DPPs have played very important roles in machine learning, especially in subset selection problems, such as documentary summarization, image search, and pose determination [KT12], and
so on. These real
world applications necessitate the estimation of parameters of determinantal
point process models. In this context, maximum likelihood estimator of the true parameter is a natural
choice, which in general leads to a non-convex optimization problem in our situation. Along
this direction, Kulesza and Taskar split DPPs model into diversity part and quality part and only learn the quality part while the first part is fixed. They conjecture that the problem of learning the likelihood of DPPs is NP-hard, which has been proven by [GJWX22] a decade later. Brunel, Moitra, Rigollet, and Urschel [BMRU17] first study the local geometry of the expected maximum likelihood estimation of DPPs, that is, the curvature of likelihood function around its maximum. Then they prove that the maximum likelihood estimator converges to true values in probability and establish the corresponding central limit theorem. Motivated by these works,
the present paper contains three further contributions.
(i)
For the consistency we improve the convergence in probability of to obtained in [BMRU17] to almost sure convergence. This almost sure convergence property is important in applications since in reality we only observe a single sample path. Theoretically, it is well-known that convergence almost surely is always a hard problem. It is also the case here. We get around this difficulty by making use of the Wald’s consistency theorem in our situation.
(ii)
The asymptotic normality of the maximum estimator is also obtained in [BMRU17]. This means that
is approximately a normal random variable. We study the rate of convergence associated this normality approximation. Namely, we obtain a Berry-Esseen type bound for the maximum likelihood estimator .
(iii)
Due to non-identifiability of DPPs observed in [Kul12], there is an involvement of in the definition of . In general case this involvement of seems unavoidable. To get rid of this we consider special cases of blocked two by two matrix ensembles, where we obtain explicit form of the maximum likelihood estimator,
and consequently, no involvement of in these cases. This explicit form makes our
computations much efficient. When we apply various numerical methods such as Newton-Raphson,
Stochastic Gradient Descent method (SGD) to general maximum likelihood estimator we can hardly
obtain satisfactory results.
The paper is organized as follows. In Section 2 we introduce some basic definitions and recall
some properties of DPPs.
We present a proof of a well-known result about the characterization of the determinantal process
(Theorem 2.2 in next section), whose proof
we cannot find explicitly in literature. In Section 3 we present our main results for the almost sure consistency and the Berry-Esseen type theorem for the maximum likelihood estimator. Newton-Raphson and stochastic gradient descent (SGD) algorithm are suggested to find the maximum likelihood estimator in general case in Section 4, and the results are reported although they are not satisfactory. In Section 5, we
discuss the explicit maximum likelihood estimator for the two by two blocked ensembles and since the explicit
form is available we get very satisfactory numerical results.
Some concluding remarks are given in Section 6.
2 Preliminary
In this paper, we focus on the finite state point process. Let the ground set be a set of points.
The set of all subsets of is denoted by .
Let be a complete probability space
on which the expectation is denoted by .
A (finite) point process on a ground set is
a random variable .
We also call the
probability measure for the point process over the subsets
of . Random subsets drawn from the point process can be any subset between null set and full set .
Definition 2.1.
A -valued point process is called a determinantal point process if there is a symmetric and positive definite symmetric matrix matrix such that for every fixed set ,
(2.1)
where denotes the restriction of
to the subset A, that is, .
To our best knowledge the following sufficient and necessary condition for a symmetric kernel to define a determinant point process is often mentioned without directly proving it so we provide a proof as follows.
Theorem 2.2.
The sufficient and necessary condition for a symmetric kernel to define a determinant point process is .
Proof.
Since the marginal probability of empty set is the total probability space, . We set . Because is a probability measure, all principal minors of , i.e. must be nonnegative, and thus K itself must be positive semidefinite, that is, .
Furthermore, from and using inclusion–exclusion principle we get
(2.2)
The above last equality follows from the characteristic polynomial. Equation (2.2) also means
(2.3)
Similarly, we are able to show that for any subset and hence . So the necessary condition for a symmetric matrix to give a determinantal process is .
This condition turns out to be sufficient: any defines a DPP. To prove this, it’s sufficient to show that for every , the atomic probability is well-defined, that is, . The probability being less or equal to 1 holds since . For the other inequality, we assume is invertible.111if is not invertible, we immediately get . Then using Schur complement and characteristic polynomial, we have
(2.4)
where denotes the matrix obtained from by keeping only those entries whose rows belong to and whose columns belong to (if we simply have .), denotes the cardinality of subset , and the complement of set . Here we use a slight abuse of notation of . We refer it to an N × N matrix whose restriction to is and has zeros everywhere else. Since , .
This completes the proof of the theorem.
∎
It is quite inconvenient to work with marginal kernels since their eigenvalues should be bounded by 0 and 1, and the marginal probability is not very appropriate to describe real world data. Here we introduce a slightly smaller class of DPPs called L-ensembles, where L only needs to be positive semi-definite.
Definition 2.3.
A point process is called an L-ensemble if it is defined through a real, symmetric, positive semi-definite matrix :
Moreover, [Mac75] shows that L-ensembles are indeed DPPs, where marginal kernels K is given by
3 Maximum likelihood Estimator of DPPs
Let denote the set of all by positive definite matrices. Suppose that is a sample of size from a for some given but unknown . This means that are independent valued random subsets whose probability law is given by the ensemble . The goal of this paper
is to estimate this from this given
sample . The most popular one is the maximum likelihood
estimator . We shall recall its construction
this our specific setting and study its strong consistency, asymptotic normality, and associated Berry-Esseen theorem.
Let denote the probability of the L-ensemble being . The scaled log-likelihood function
associated with the sample is
(3.1)
where and
with standing for the indicator function.
It is also useful to define the expected log-maximum likelihood function given the real kernel
(3.2)
where
In the sequel is always fixed. To simplify writing we let denote , denote and denote .
Definition 3.1.
A maximum likelihood estimator
is a global maximizer of (3.1):
(3.3)
We notice that is the Kullback-Leibler divergence , which measures the difference between distributions of and of . KL divergence is always non-negative:
As a consequence is the global maxima of the expected maximum function .
The next Lemma gives the differential of the likelihood function .
Lemma 3.2.
The gradient of log-likelihood function defined in (3.1) exists and is given by
(3.4)
Proof.
We regard determinant as a multivariate function of variables and then the directional derivative of along direction is given by
(3.5)
where the third equality follows from the power series representation of . Then the directional derivative of along direction is
(3.6)
In matrix form, the above equation becomes
(3.7)
This proves the lemma
∎
3.1 Strong consistency
One critical issue for the maximum likelihood estimation is its consistency. Due to non-identifiability of DPPs illustrated in Theorem 4.1 in [Kul12] , any element in defines the same DPP, where is the collection of all diagonal matrices whose entry is either 1 or -1. Moreover, if defines the same DPP, then there is a such that .
Thus, we cannot expect that the maximum likelihood estimator will converge to . Instead, it should converge to for some .
Hence, for the consistency problem we study the limit
(3.8)
[BMRU17] proves that this distance converges to zero in probability. We shall prove a stronger version: the convergence also holds almost surely.
Theorem 3.3.
Let be a
sample of size from .
Let be the maximum likelihood estimator of defined by (3.3).
Then converges to zero almost surely.
Proof.
The proof is to combine the convergence in probability result of [BMRU17, Theorem 14] with Wald’s consistency theorem [Wal49] and is divided into the following
five steps.
Step 1: For , define a set
where is the set of all N by N symmetric matrices whose eigenvalues are between and . Observe that is compact since it’s bounded and closed in . We first show that converges to zero almost surely when parameters of matrices are restricted to
for appropriate .
In order to complete this proof, we let
and
is the Kullback-Leibler divergence between DPP() and DPP(). By Jensen’s inequality, for all and if and only if for all , which means for some . In the sequel let denote
For each , the strong law of large numbers implies
(3.9)
Step 2: However, the above convergence
(3.9) doesn’t imply the convergence of maximum likelihood estimator to the true values. Thus the Wald’s integrability condition is needed:
for every , there exists such that,
(3.10)
Since is continuous (the determinant function is continuous), for any there exists , when
Then the Wald’s integrability condition is satisfied.
Now for every sequence converging to L, we show that is upper semicontinuous:
The second inequality follows from the Fatou’s lemma and the third identity is the consequence of continuity of the function .
For every we define the set
(3.11)
which is closed and hence compact.
Since is an upper semicontinuous function, it achieves maximum over the compact set
(since is compact). We denote the maximum by . And we cannot have because that would imply there is an such that for some .
The strong law of large numbers implies
(3.12)
By continuity,
and
is a decreasing function with respect to because supremum over a smaller subset is smaller than that over a bigger subset. And by (3.10) it is integrable for all small enough . Hence by the dominated convergence theorem,
Thus for any and any there exists an such that
(3.13)
Step 3. For each , we define the open set:
and then the family is an open cover of and hence has a finite subcover: .
On every we use strong law of large numbers again to obtain
Notice that . From (3.15) there exists a constant such that
But
so there exists a constant such that
which implies that , that is,
.
Step 4. Now we can remove the compactness condition.
First we show that the event holds almost surely. We adopt the proof from [BMRU17].
Let . For simplicity, we denote by . Since is positive definite, . Define the event by
Observe that and we can find and such that .
Then using [BMRU17, Theorem 14] we know that on the event , , that is,
Because
the event holds almost surely when goes to infinity and hence holds almost surely.
Step 5. Let denote the indicator function of the event . Then
The last equality follows from the fact that almost surely proved in Step 4.
∎
3.2 Berry-Esseen theorem
An by matrix can also be viewed as an dimensional column vector: . With this identification the Frobenius norm of the matrix is just the norm for its corresponding column vector. In the following we shall regard the matrix as the corresponding column vector.
Because of non-identifiability of DPPs, the maximum likelihood estimators are not unique. We choose the estimator which is closest to the given true value . In fact, let be one maximal likelihood estimator. Let be such that
(3.16)
and set
(3.17)
Then the strong consistency of immediately follows from Theorem 3.3.
We impose irreducibility assumption introduced in [BMRU17] on , of which the assumption is equivalent to being a one block matrix, that is, it cannot be partitioned into two block matrices. And then according to [BMRU17, Theorem 8], is negative definite and hence invertible. Let denote its inverse:
(3.18)
Here if we vectorize then is an Hessian matrix. By [VdV00, Theorem 5.41],
(3.19)
In particular, [VdV00, Theorem 5.41] states that the sequence is asymptotically normal with mean and covariance matrix . Hence we get the following theorem for the asymptotic normality of the maximum estimator .
Theorem 3.4.
Let be irreducible. Then, is asymptotically normal:
(3.20)
where is given by (3.18) and the above convergence holds in distribution.
Having established the asymptotic normality of the maximum estimator , we want to take one step further.
Namely,
we want to find an upper error bound on the rate of convergence of the distribution of to standard multidimensional normal distribution . In other words, we are going to obtain a Berry-Esseen type theorem. We argue that when , the bound of the maximal error is of order . This restriction is not a big deal. Indeed, since and can be chosen arbitrarily close to and respectively so that converges to . What’s more, since from Theorem 3.3, almost surely, almost surely.
Theorem 3.5.
Let the maximum likelihood estimator be defined by (3.17) and also belong to for some and let Z be an standard Gaussian matrix. Then for every ,
where C is a sufficiently large constant, which is irrelevant to , subject to and proportional to . Here for two
by matrices and , means the components of is less than the corresponding components of .
Proof.
We divide the proof into four steps.
Step 1. According to (3.19), can be decomposed into a sum
(3.21)
and a term whose Frobenius norm converges to zero in probability.
(3.22)
where is an arbitrary sequence of positive real number and is the matrix whose entries are all 1.
Step 2. The Estimation of (). We claim
where and is a constant.
In fact,
from the proof of [VdV00, Theorem 5.41], has the following expression
(3.23)
where is a point on the line segment between and . To simplify notation, let denote
Then
(3.24)
denotes the operator norm induced by norm and denotes the largest eigenvalue. For the first inequality, we regard as an column vector and is an matrix.
(3.25)
(3.26)
Using Cauchy-Schwartz inequality to estimate (3.25) we see
(3.27)
Let be a multivariate function:
Then is a continuous function. What’s more almost surely , which is a compact and convex set. Using Theorem 3.4 and portmanteau lemma we have
(3.28)
where . is equal to . Then there exists a constant subject to such that
(3.29)
As a result,
(3.30)
where is a suitable constant.
Next, we estimate the second part, that is (3.26):
Here is an dimensional column vector whose entries are matrices. Since is infinitely many time differentiable, is on the line segment between and , and is a convex and compact set, we conclude that every entry of is bounded. Hence there exists a constant subject to and such that
(3.31)
Now let . Using Chebyshev’s inequality we get:
(3.32)
for a suitable constant .
Step 3.
Our next goal is to estimate () as follows.
Let be . Then
for some constant .
Because
we have
(3.33)
By multidimensional Berry-Esseen theorem in [Ben05],
For , since Z can be viewed as a standard Gaussian random vector, we have
(3.37)
Combining (3.36), (3.37) with previous bound (3.32) to treat , where we take we conclude that
where is a constant.
Step 4.
As for we can use the same argument as above and conclude that is
bounded by for some constant .
This completes the proof of the theorem.
∎
4 Simulation study
In this section we test numerically the mle for the DPPs. Since in general
there is no analytic expression for the mle (3.16)
we need to use numerical algorithms to find the mle. We try the well-known Newton-Raphson algorithm and compare it with stochastic gradient descent algorithm.
Algorithms. First, we explain the formula of the Newton-Raphson algorithm applied to our mle of DPPs. Recall that the gradient of the maximum likelihood function is given by (3.4):
To use Newton-Raphson algorithm we think of as an vector-valued function of variables . Since
(4.1)
the Jacobian matrix of is given by an by matrix
(4.2)
where the index and have order
Then Newton-Raphson algorithm updated at -th iteration is given by
(4.3)
The matrix is
and the computation of its inverse
may need some time when is large.
Newton-Raphson algorithm is time consuming in particular when is large since
we need to compute the inverse of (which is an matrix) in each iteration step. For this reason we try another method called stochastic gradient descent method. This method depends on a step size which we
choose to be as well as initial kernel . At -th iteration we do simple random sampling in our data set to pick one sample and update our kernel by
(4.4)
In this way, we reduce high computational burden of Newton-Raphson algorithm (4.3) at each iteration but the algorithm has lower convergence rate.
Simulation results.
In simulation study we consider two by matrix kernels and
one by matrix kernel .
To see the trend of convergence, for each kernel we generate 300, 1000, 3000, 10000, and 30000 synthetic samples using the algorithm proposed in [HKPV06].
The simulation results show that Newton-Raphson method outperforms SGD. For the first DPP, Newton-Raphson algorithm shows a very slow trend of convergence to the true kernel whereas SGD only estimates the diagonal elements well. For the second independent DPP, both Newton-Raphson and SGD methods converge to the true kernel pretty well. For the last one, Newton-Raphson converges to a wrong critical point while SGD is numerically unstable.
Our simulation results of n = 30000 are summarized in Table 1, which suggests learning the dependence and the repulsive behavior between items seems beyond the scope of general numerical methods. This unsatisfactory result may be because on one hand both algorithms involve computing the inverse of matrices at each iteration, which could yield large values and hence may not be stable. On the other hand, the likelihood function
of a DPP is usually non-concave and has at least exponentially many saddle points [BMRU17]. This unsatisfactory performance of
the two well-known algorithms motivates us to find
an effective method suitable to the maximum of a likelihood function
of a DPP. In the next section, this issue
is resolved for the two by two matrix kernel since we can obtain the
explicit maximizer in this case.
True kernel
Initial kernel
Numerically unstable
Table 1: We test the performance of algorithms for three different true kernels when the sample size is 30000. We obtain the result after a specific number of iterations. In the experiment, we observe that both algorithms become unstable as iteration increases to a certain certain point so we choose for Newton-Raphson and for SGD since for the later one each iteration needs less time to complete.
5 Two-by-two block kernel
Due to non-identifiability of DPPs illustrated in [Kul12, Theorem 4.1] that we mentioned earlier in Section 3, we have to
use defined by (3.17) as our maximum likelihood estimator. This formula involves a choice of which may also be random. However, the way to choose is not available
to us in general. To get rid of this choice of
we show in this section that if the kernels of determinantal point processes are two-by-two symmetric positive
semi-definite matrices, we can obtain the desired results without involving the choice of .
Our result can also be immediately extended to any diagonally blocked two by two matrices. Since the commonly used iteration methods presented in the previous section are not so effective in finding the maximum likelihood estimator numerically this explicit analytic form we obtained is also very useful in compute the estimator numerically
in practice. On the other hand,
Our method effective to two by two matrices is difficult to apply to more general higher dimensional kernels.
Let , where , and the ground set be . To ensure , we assume
and
We can always assume is non-negative since by identifiability of DPPs, and give the same DPP.
For ease of notation, let denote the empirical probability of the subset , , , respectively and let denote the theoretical probability respectively.
The relationship between and are given by
and inversely
The likelihood function defined in (3.1) becomes now
(5.1)
To find the critical point of (5) we first let the partial derivative of with respect to equal zero and get
(5.2)
Then we have is either equal to 0 or
(5.3)
If , then by setting the partial derivative with respect to and to zero and notice that we get the first critical point
(5.4)
We remark that this is the point which Newton-Raphson method converges to in the last simulation study.
This critical point exists only if and is nonzero. Since empirical probability converges to its corresponding theoretical probability almost surely and , the strong law of large numbers implies the critical point exists almost surely when is sufficiently large.
If , then we can use (5.3) to estimate once are obtained:
(5.5)
To find the maximum likelihood estimators and of and
we plug (5.3) into to obtain
(5.6)
Letting and equal zero yields
(5.7)
The above system of function equations can be explicitly solved and combining it together with (5.5) yields
(5.8)
from which we have this critical point exists only if and . Again by strong laws of large numbers, the second critical point also exists and converges to the true value almost surely. In fact, we have almost surely,
We test the estimator of the 2 by 2 kernel in earlier simulation study by taking and find the estimates respectively are
which shows the estimator (5.8) is consistent numerically to the true kernel.
Furthermore, we can establish the central limit theorem for the estimator (5.8), which corresponds to the result in Theorem 3.4.
Theorem 5.1.
Assume , then the estimator in (5.8) is asymptotically normal,
(5.9)
where the convergence holds in distribution and is the inverse of the Hessian matrix of the expected maximum likelihood function .
Proof.
Let be n independent subsets of , where . Let be the random vector , where stands for the indicator random variable. Then has mean and covariance matrix
By central limit theorem, converges to a multivariate normal distribution with mean and covariance . Let a function be defined by
Its Jacobi matrix is given by
Now we are in the position to apply Delta method [VdV00] to get
After tedious matrix computations, is found to be
where
It is straightforward to verify the above matrix is the inverse of the Hessian matrix of the expected maximum likelihood function , that is, , which in turn verifies Theorem 3.4 in this special case. However, in this two-by-two case, our maximum likelihood estimator is unique without the maneuver of the identifiability (3.17).
∎
This idea can be extended to blocked ensemble with the two by two block
submatrices.
If is a matrix with k two-by-two blocks
(5.10)
where for each , , , and . Let ground set of this DPP be and for each ,
(5.11)
(5.12)
(5.13)
(5.14)
where are n independent subsets drawn from .
By the construction (5.10) we see that are mutually independent.
Then the result of critical point for two by two matrix can be applied:
(5.15)
for every .
However the above method encounters some difficulties when the kernel has dimension higher than 2 with general form. For example, if the kernel is a matrix
the letting the gradient of likelihood function equal zero will yield
Computing and could be troublesome. For example, is:
which is difficult to use to obtain explicit
maximum likelihood estimator.
6 Conclusion
In this paper, we study the maximum likelihood estimator for the ensemble matrix associated with determinantal process. Brunel et al show that the expected likelihood function is locally strongly concave around true value if and only if is irreducible, since the Hessian matrix of at is negative definite. Then they prove the maximum likelihood estimator (MLE) is consistent with respect to the convergence in probability and when is irreducible they also obtained the central limiting theorem for the MLE. We improve their results to show that the MLE is also strongly consistent with respect to the almost sure convergence. Moreover, we obtain the Berry-Esseen type result for the central limiting theorem and find the rate of convergence of the MLE to normality. Last, we obtain the explicit form of the MLE where is a two by two block matrix,
or a block matrix whose blocks are two
by two matrices. This explicit form enables us to avoid the use of in (3.17) which is inconvenient in applications. The strong consistency and central limit theorem follows from these explicit forms, which demonstrates the general strong consistency and central limit theorem proved earlier. It would be interesting to find the explicit form of some particular higher dimensional DPPs. However, as the learning of maximum likelihood of DPPs is proven to be NP-hard, the explicit form for general ensembles, even if was found, would be very difficult to compute.
In addition to the maximum likelihood estimator there are also other approaches in lieu of MLE. Let us only mention one alternative approach.
For all J such that , we let
(6.1)
where the left hand side is the theoretical probability of and the right hand side
is the empirical probability of . Taking suggests us the following estimator for .
(6.2)
Using equations (6.1) for again we are able to determine the off-diagonal elements up to the sign
(6.3)
where . Notice that this is the maximum likelihood estimator when is two dimensional. There is a question on how to choose the sign for in
(6.3), which has been resolved by [UBMR17] with graph theory. We shall not pursue this direction in current paper.
References
[BDF10]
Alexei Borodin, Persi Diaconis, and Jason Fulman.
On adding a list of numbers (and other one-dependent determinantal processes).
Bulletin of the American Mathematical Society, 47(4):639–670, 2010.
[Ben05]
Vidmantas Bentkus.
A lyapunov-type bound in .
Theory of Probability and Its Applications, 49(2):311–323, 2005.
[BMRU17]
Victor-Emmanuel Brunel, Ankur Moitra, Philippe Rigollet, and John Urschel.
Maximum likelihood estimation of determinantal point processes.
arXiv preprint arXiv:1701.06501, 2017.
[BP93]
Robert Burton and Robin Pemantle.
Local characteristics, entropy and limit theorems for spanning trees and domino tilings via transfer-impedances.
The Annals of Probability, pages 1329–1371, 1993.
[Gin65]
Jean Ginibre.
Statistical ensembles of complex, quaternion, and real matrices.
Journal of Mathematical Physics, 6(3):440–449, 1965.
[GJWX22]
Elena Grigorescu, Brendan Juba, Karl Wimmer, and Ning Xie.
Hardness of maximum likelihood learning of DPPs.
In Proceedings of Thirty Fifth Conference on Learning Theory, volume 178, pages 3800–3819. PMLR, 2022.
[HKPV06]
J Ben Hough, Manjunath Krishnapur, Yuval Peres, and Bálint Virág.
Determinantal processes and independence.
Probability Survey, 3:206–229, 2006.
[KT12]
Alex Kulesza and Ben Taskar.
Determinantal point processes for machine learning.
Foundations and Trends in Machine Learning, 5(2–3):123–286, 2012.
[Kul12]
John A Kulesza.
Learning with determinantal point processes.
University of Pennsylvania, 2012.
[LDG21]
Claire Launay, Agnès Desolneux, and Bruno Galerne.
Determinantal point processes for image processing.
SIAM J. Imaging Sci., 14(1):304–348, 2021.
[Mac75]
O. Macchi.
The coincidence approach to stochastic point processes.
Advances in Applied Probability, 7(1):83–122, 1975.
[UBMR17]
John Urschel, Victor-Emmanuel Brunel, Ankur Moitra, and Philippe Rigollet.
Learning determinantal point processes with moments and cycles.
In International Conference on Machine Learning, pages 3511–3520. PMLR, 2017.
[VdV00]
Aad W Van der Vaart.
Asymptotic statistics, volume 3.
Cambridge university press, 2000.
[Wal49]
Abraham Wald.
Note on the consistency of the maximum likelihood estimate.
The Annals of Mathematical Statistics, 20(4):595–601, 1949.