Convergence of de Finetti’s mixing measure in
latent structure models for observed exchangeable sequences
Abstract
Mixtures of product distributions are a powerful device for learning about heterogeneity within data populations. In this class of latent structure models, de Finetti’s mixing measure plays the central role for describing the uncertainty about the latent parameters representing heterogeneity. In this paper posterior contraction theorems for de Finetti’s mixing measure arising from finite mixtures of product distributions will be established, under the setting the number of exchangeable sequences of observed variables increases while sequence length(s) may be either fixed or varied. The role of both the number of sequences and the sequence lengths will be carefully examined. In order to obtain concrete rates of convergence, a first-order identifiability theory for finite mixture models and a family of sharp inverse bounds for mixtures of product distributions will be developed via a harmonic analysis of such latent structure models. This theory is applicable to broad classes of probability kernels composing the mixture model of product distributions for both continuous and discrete domain . Examples of interest include the case the probability kernel is only weakly identifiable in the sense of [25], the case where the kernel is itself a mixture distribution as in hierarchical models, and the case the kernel may not have a density with respect to a dominating measure on an abstract domain such as Dirichlet processes.
Acknowledgement
The authors are grateful for the support provided by NSF Grants DMS-1351362, DMS-2015361 and the Toyota Research Institute. We would like to sincerely thank Xianghong Chen, Danqing He and Qingtang Su for valuable discussions, and Judith Rousseau and Nhat Ho for helpful comments. We also thank the anonymous referees and associate editor for valuable comments and suggestions.
1 Introduction
Latent structure models with many observed variables are among the most powerful and widely used tools in statistics for learning about heterogeneity within data population(s). An important canonical example of such models is the mixture of product distributions, which may be motivated by de Finetti’s celebrated theorem for exchangeable sequences of random variables [1, 29]. The theorem of de Finetti states roughly that if is an infinite exchangeable sequence of random variables defined in a measure space , then there exists a random variable in some space , where is distributed according to a probability measure , such that are conditionally i.i.d. given . Denote by the conditional distribution of given , we may express the joint distribution of a -sequence , for any , as a mixture of product distributions in the following sense: for any ,
The probability measure is also known as de Finetti mixing measure for the exchangeable sequence. It captures the uncertainty about the latent variable , which describes the mechanism according to which the sequence is generated via . In other words, the de Finetti mixing measure can be seen as representing the heterogeneity within the data populations observed via sequences . A statistician typically makes some assumption about the family , and proceeds to draw inference about the nature of heterogeneity represented by based on data samples .
In order to obtain an estimate of the mixing measure , one needs multiple copies of the exchangeable sequences . As mentioned, some assumption will be required of the probability distributions , as well as the mixing measure . Throughout this paper it is assumed that the map is injective. Moreover, we will confine ourselves to the setting of exact-fitted finite mixtures, i.e., is assumed to be an element of , the space of discrete measures with distinct supporting atoms on , where is a subset of . Accordingly, we may express . We may write the distribution for in the following form, where we include the subscripts and to signify their roles:
(1) |
Note that when , we are reduced to a mixture distribution . Due to the role they play in the composition of the distribution , we also refer to as a family of probability kernels on . Given independent copies of exchangeable sequences each of which is respectively distributed according to given in (1), where denotes the possibly variable length of the -th sequence. The primary question of interest in this paper is the efficiency of the estimation of the true mixing measure , for some known , as sample size increases in a certain sense.
Models described by Eq. (1) are also known in the literature as mixtures of repeated measurements, or mixtures of grouped observations [23, 12, 10, 28, 44, 36], with applications to domains such as psychological analysis, educational assessment, and topic modeling in machine learning. The random effects model described in Section 1.3.3 of [30] in which the mixing measure is a discrete measure with finite number of atoms is also a special case of (1) with a normal distribution with mean . While [23, 10] consider the case that the number of components is unknown, [12, 28, 44, 36] focus on the case that is known, the same as our set up. In many of the aforementioned works the models are nonparametric, i.e., no parametric forms for the probability kernels are assumed, and the focus is on the problem of density estimation due to the nonparametric setup. By contrast, in this paper we study mixture of product distributions (1) with the parametric form of component distribution imposed, since in practice prior knowledge on the component distribution might be available. Moreover, we investigate the behavior of parameter estimates — the convergence of the parameters and , which are generally more challenging than density estimation in mixture models [32, 26, 25, 27].
Before the efficiency question can be addressed, one must consider the issue of identifiability: under what conditions does the data distribution uniquely identify the true mixing measure ? This question has occupied the interest of a number of authors [43, 11, 20], with decisive results obtained recently by [3] on finite mixture models for conditionally independent observations and by [44] on finite mixtures for conditionally i.i.d. observations (given by Eq. (1)). Here, the condition is in the form of , for some natural constant possibly depending on . We shall refer to as (minimal) zero-order identifiable length or 0-identifiable length for short (a formal definition will be given later). For the conditionally i.i.d. case as in model (1), [44] proves that as long as , model (1) will be identifiable for any . Note that is only an upper bound of . For a given parametric form of and a given truth , the -identifiable length might be smaller than .
Drawing from existing identifiability results, it is quite apparent that the observed sequence length (or more precisely, , in case of variable length sequences) must play a crucial role in the estimation of mixing measure , in addition to the number of sequences. Moreover, it is also quite clear that in order to have a consistent estimate of , the number of sequences must tend to infinity, whereas may be allowed to be fixed. It remains an open question as to the precise roles and play in estimating and on the different types of mixing parameters: the component parameters (atoms ) and mixing proportions (probability mass ), and the rates of convergence of a given estimation procedure.
Partial answers to this question were obtained in several settings of mixtures of product distributions. [23] proposed to discretize data so that the model in consideration becomes a finite mixture of product of identical binomial or multinomial distributions. Restricting to this class of models, a maximum likelihood estimator was applied, and a standard asymptotic analysis establishes root- rate for mixing proportion estimates. [21, 20] investigated a number of nonparametric estimators for , and obtained the root- convergence rate for both mixing proportion and component parameters in the setting of mixture components under suitable identifiability conditions. It seems challenging to extend their method and theory to a more general setting, e.g., . Moreover, no result on the effect of on parameter estimation efficiency seems to be available. Recently, [34, 33] studied the posterior contraction behavior of several classes of Bayesian hierarchical model where the sample is also specified by sequences of observations. His approach requires that both and tend to infinity and thus cannot be applied to our present setting where may be fixed.
In this paper we shall present a parameter estimation theory for general classes of finite mixtures of product distributions. An application of this theory will be posterior contraction theorems established for a standard Bayesian estimation procedure, according to which the de Finetti’s mixing measure tends toward the truth , as tends to infinity, under suitable conditions. In a standard Bayesian procedure, the statistician endows the space of parameters with a prior distribution , which is assumed to have compact support in these theorems, and applies Bayes’ rule to obtain the posterior distribution on , to be denoted by . To anticipate the distinct convergence behaviors for the atoms and probability mass parameters, for any , define
where denotes all the permutations on the set . (The suitability of over other choices of metric will be discussed in Section 3).
Given independent exchangeable sequences denoted by . We naturally require that , where is the zero-order identifiable length depending on . Moreover, to obtain concrete rates of convergence, we need also for some minimal natural number . We shall call the minimal first-order identifiable length depending on , or 1-identifiable length for short (a formal definition will be given later). Assume that are uniformly bounded from above by an arbitrary unknown constant, in Theorem 6.2 it is established that under suitable regularity conditions on , the posterior contraction rate for the mixing proportions is bounded above by , up to a logarithmic quantity. For mixture components’ supporting atoms, the contraction rate is
Note that represents the full volume of the observed data set. More precisely, for suitable kernel families , as long as and , there holds
in -probability as for any sequence . The point here is that constant is independent of , sequence lengths and their supremum. In plain terms, we may say that with finite mixtures of product distributions, the posterior inference of atoms of each individual mixture component receives the full benefit of "borrowing strength" across sampled sequences; while the mixing probabilities gain efficiency from only the number of such sequences. This appears to be the first work in which such a posterior contraction theorem is established for de Finetti mixing measure arising from finite mixtures of product distributions.
The Bayesian learning rates established appear intuitive, given the parameter space is of finite dimension. On the role of , they are somewhat compatible to the previous partial results [23, 21, 20]. However, we wish to make several brief remarks at this juncture.
-
•
First, even for exact-fitted parametric mixture models, "parametric-like" learning rates of the form root- or root- should not to be taken for granted, because they do not always hold [25, 27]. This is due to the fact that the kernel family may easily violate assumptions of strong identifiability often required for the root- rate to take place. In other words, the kernel family may be only weakly identifiable, resulting in poor learning rates for a standard mixture, i.e., when .
-
•
Second, the fact that by increasing the observed exchangeable sequence’s length so that , one may obtain parametric-like learning rates in terms of both and is a remarkable testament of how repeated measurements can help to completely overcome a latent variable model’s potential pathologies: parameter non-identifiability is overcome by making , while inefficiency of parameter estimation inherent in weakly identifiable mixture models is overcome by . For a deeper appreciation of this issue, see Section 2 for a background on the role of identifiability notions in parameter estimation.
Although the posterior contraction theorems for finite mixtures of product distributions presented in this paper are new, such results do not adequately capture the rather complex behavior of the convergence of parameters for a finite mixture of -product distributions. In fact, the heart of the matter lies in the establishment of a collection of general inverse bounds, i.e., inequalities of the form
(2) |
where is the variational distance. Note that (2) provides an upper bound on distance of mixing measures in terms of the variational distance between the corresponding mixture of -product distributions. Inequalities of this type allow one to transfer the convergence (and learning rates) of a data population’s distribution into that of the corresponding distribution’s parameters (therefore the term "inverse bounds"). Several points to highlight are:
-
•
The local nature of (2), which may hold only for residing in a suitably small -neighborhood of whose radius may also depend on and , while constant depends on but is independent of . In addition, the bound holds only when exceeds threshold , unless further assumptions are imposed. For instance, under a first-order identifiability condition of , , so this bound holds for all while remaining local in nature. Moreover, inequality (2) is sharp: the quantity in cannot be improved by for any sequence such that (see Lemma 8.3).
-
•
The inverse bounds of the form (2) are established without any overt assumption of identifiability. However, they carry striking consequences on both first-order and classical identifiability, which can be deduced from (2) under a compactness condition (see Proposition 5.1): using the notation and to denote explicitly the dependence of 0- and 1-identifiable lengths on in the first argument and its ambient space in the second argument, respectively, we have
Note that classical identifiability captured by describes a global property of the model family while first-order identifiability captured by is local in nature. The connection between these two concepts is possible because when the number of exchangeable variables gets large, the force of the central limit theorem for product distributions comes into effect to make the mixture model eventually become identifiable, either in the classical or the first-order sense, even if the model may be initially non-identifiable or weakly identifiable (when ).
-
•
These inverse bounds hold for very broad classes of probability kernels . In particular, they are established under mild regularity assumptions on the family of probability kernel on , when either , or is a finite set, or is an abstract space. A standard but non-trivial example of our theory is the case the kernels belong to the exponential families of distributions. A more unusual example is the case where is itself a mixture distribution on . Kernels of this type are rarely examined in theory, partly because when we set a mixture model using such kernels typically would not be parameter-identifiable. However, such "mixture-distribution" kernels are frequently employed by practitioners of hierarchical models (i.e., mixtures of mixture distributions). As the inverse bounds entail, this makes sense since the parameters become more strongly identifiable and efficiently estimable with repeated exchangeable measurements.
-
•
More generally, inverse bounds hold when does not necessarily admit a density with respect to a dominating measure on . An example considered in the paper is the case represents probability distribution on the space of probability distributions, namely, represents (mixtures of) Dirichlet processes. As such, the general inverse bounds are expected to be useful for models with nonparametric mixture components represented by , the kind of models that have attracted much recent attention, e.g., [41, 37, 8, 7].
The above highlights should make clear the central roles of the inverse bounds obtained in Section 4 and Section 5, which deepen our understanding of the questions of parameter identifiability and provide detailed information about the convergence behavior of parameter estimation. In addition to an asymptotic analysis of Bayesian estimation for mixtures of product distributions that will be carried out in this paper, such inverse bounds may also be useful for deriving rates of convergence for non-Bayesian parameter estimation procedures, including maximum likelihood estimation and distance based estimation methods.
The rest of the paper will proceed as follows. Section 2 presents related work in the literature and a high-level overview of our approach and techniques. Section 3 prepares the reader with basic setups and several useful concepts of distances on space of mixing measures that arise in mixtures of product distributions. Section 4 is a self-contained treatment of first-order identifiability theory for finite mixture models, leading to several new results that are useful for subsequent developments. Section 5 presents inverse bounds for broad classes of finite mixtures of product distributions, along with specific examples. An immediate application of these bounds are posterior contraction theorems for de Finetti’s mixing measures, the main focus of Section 6. Particular examples of interest for the inverse bounds established in Section 5 include the case the probability kernel is itself a mixture distribution on , and the case is a mixture of Dirichlet processes. These examples require development of new tools and are deferred to Section 7. Section 8 gives several technical results demonstrating the sharpness of the established inverse bounds, which is then used to derive minimax lower bounds for estimation procedures of de Finetti’s mixing parameters. Section 9 discusses extensions and several future directions. Finally, (most) proofs of all theorems and lemmas will be provided in the Appendix.
Notation For any probability measure and on measure space with densities respectively and with respect to some base measure , the variational distance between them is . The Hellinger distance is given by The Kullback-Leibler divergence of from is . Write to be the product measure of and and for the -fold product of . Any vector is a column vector with its -th coordinate denoted by . The inner product between two vectors and is denoted by or . Denote by or a positive finite constant depending only on its parameters and the probability kernel . In the presentation of inequality bounds and proofs, they may differ from line to line. Write if for some universal constant ; write if . Write if and ; write (or ) if and .
2 Background and overview
2.1 First-order identifiability and inverse inequalities
In order to shed light on the convergence behavior of model parameters as data sample size increases, stronger forms of identifiability conditions shall be required of the family of probability kernels . For finite mixture models, such conditions are often stated in terms of a suitable derivative of the density of with respect to parameter , and the linear independence of such derivatives as varies in . The impacts of such identifiability conditions, or the lack thereof, on the convergence of parameter estimation can be quite delicate. Specifically, let and fix , so we have . Assume that admits a density function with respect to Lebesgue measure on , and for all , is differentiable with respect to ; moreover the combined collection of functions and are linearly independent. This type of condition, which concerns linear independence of the first derivatives of the likelihood functions with respect to parameter , shall be generically referred to as first-order identifiability condition of the probability kernel family . A version of such condition was investigated by [26], who showed that their condition will be sufficient for establishing an inverse bound of the form
(3) |
where denotes the first-order Wasserstein distance metric on . The infimum limit quantifier should help to clarify somewhat the local nature of the inverse bound (2) mentioned earlier. The development of this local inverse bound and its variants plays the fundamental role in the analysis of parameter estimation with finite mixtures in a variety of settings in previous studies, where stronger forms of identifiability conditions based on higher order derivatives may be required [9, 32, 38, 26, 25, 22, 27]. In addition, [32, 34] studied inverse bounds of this type for infinite mixture and hierarchical models.
As noted by [26], for exact-fitted setting of mixtures, i.e., the number of mixture components is known, conditions based on only first-order derivatives of will suffice. Under a suitable first-order identifiability condition based on linear independence of , along with several additional regularity conditions, the mixing measure may be estimated via -i.i.d. sample at the parametric rate of convergence , due to (3) and the fact that the data population density is typically estimated at the same parametric rate. However, first-order identifiability may not be satisfied, as is the case of two-parameter gamma kernel, or three-parameter skewnormal kernel, following from the fact that these kernels are governed by certain partial differential equations. In such situations, not only does the resulting Fisher information matrix of the mixture model become singular, the singularity structure of the matrix can be extremely complex — an in-depth treatment of weakly identifiable mixture models can be found in [27]. Briefly speaking, in such situations (3) may not hold and the rate may not be achieved [25, 27]. In particular, in the case of skewnormal kernels, extremely slow rates of convergence for the component parameters (e.g., and so on) may be established depending on the actual parameter values of the true for a standard Bayesian estimation or maximum likelihood estimation procedure [27]. It remains unknown whether it is possible to devise an estimation procedure to achieve the parametric rate of convergence when the finite mixture model is only weakly identifiable, i.e., when first-order identifiability condition fails.
In Section 4 we shall revisit the described first-order identifiability notions, and then present considerable improvements upon the existing theory and deliver several novel results. First, we identify a tightened set of conditions concerning linear independence of and according to which the inverse bound (2) holds. This set of conditions turns out to be substantially weaker than the identifiability condition of [26], most notably by requiring be differentiable with respect to only for in a subset of with positive measure. This weaker notion of first-order identifiability allows us to broaden the scope of probability kernels for which the inverse bound (3) continues to apply (see Lemma 4.2). Second, in a precise sense we show that this notion is in fact necessary for (3) to hold (see Lemma 4.4), giving us an arguably complete characterization of first-order identifiability and its relations to the parametric learning rate for model parameters. Among other new results, it is worth mentioning that when the kernel family belongs to an exponential family of distributions on , there is a remarkable equivalence among our notion of first-order identifiability condition and the inverse bound of the form (3), and the inverse bound in which variational distance is replaced by Hellinger distance (see Lemma 4.15).
Turning our attention to finite mixtures of product distributions, a key question is on the effect of number of repeated measurements in overcoming weak identifiability (e.g., the violation of first-order identifiability). One way to formally define the first-order identifiable length (1-identifiable length) is to make it the minimal natural number such that the following inverse bound holds for any
(4) |
The key question is whether (finite) -identifiable length exists, and how can we characterize it. The significance of this concept is that one can achieve first-order identifiability by allowing at least repeated measurements and obtain the learning rate for the mixing measure. In fact, the component parameters can be learned at the rate , the square root of the full volume of exchangeable data (modulo a logarithmic term). The resolution of the question of existence and characterization of leads us to establish a collection inverse bounds involving mixtures of product distributions that we will describe next. Moreover, such inverse bounds are essential in deriving learning rates for mixing measure from a collection of exchangeable sequences of observations.
2.2 General approach and techniques
For finite mixtures of -product distributions, for , the precise expression for the inverse bound to be established takes the form: under certain conditions of the probability kernel , for a given ,
(5) |
Compared to inverse bound (3) for a standard finite mixture, the double infimum limits reveals the challenge for analyzing mixtures of -product distributions; they express the delicate nature of the inverse bound informally described via (2). Moreover, (5) entails that the finite 1-identifiable length defined by (4) exists.
Inverse bound (5) will be established for broad classes of kernel and it can be shown that this bound is sharp. Among the settings of kernel that the bound is applicable, there is a setting when belongs to any regular exponential family of distributions. More generally, this includes the setting where may be an abstract space; no parametric assumption on will be required. Instead, we appeal to a set of mild regularity conditions on the characteristic function of a push-forward measure produced by a measurable map acting on the measure space . Actually, a stronger bound is established relating to the positivity of a notion of curvature on the space of mixtures of product distributions (see (23)). We will see that this collection of inverse bounds, which are presented in Section 5, enables the study for a very broad range of mixtures of product distributions for exchangeable sequences.
The theorems establishing (5) and (23) represent the core of the paper. For simplicity, let us describe the gist of our proof techniques by considering the case kernel belongs to an exponential family of distribution on (see Theorem 5.8). Suppose the kernel admits a density function with respect to a dominating measure on . At a high-level, this is a proof of contradiction: if (5) does not hold, then there exists a strictly increasing subsequence of natural numbers according to which there exists a sequence of mixing measures such that as and the integral form
(6) |
tends to zero. One may be tempted to apply Fatou’s lemma to deduce that the integrand must vanish as , and from that one may hope to derive a contradiction with specified hypothesis on the probability kernel (e.g. first-order identifiability). This is basically the proof technique of Lemma 4.2 for establishing inverse bound (3) for finite mixtures. But this would not work here, because the integration domain’s dimensionality increases with . Instead we can exploit the structure of the mixture of -product densities in , and rewrite the integral as an expectation with respect to a suitable random variable of fixed domain. What comes to our rescue is the central limit theorem, which is applied to a -valued random variable , where denotes the expectation taken with respect to the probability distribution for some suitable chosen among the support of true mixing measure . Here denotes the sufficient statistic for the exponential family distribution , for each .
Continuing with this plan, by a change of measure the integral in (6) may be expressed as the expectation of the form for some suitable function . By exploiting the structure of the exponential families dictating the form of , it is possible to obtain that for any sequence , there holds for a certain function . Since converges in distribution to a non-degenerate zero-mean Gaussian random vector in , it entails that converges to in distribution by a generalized continuous mapping theorem [46]. Coupled with a generalized Fatou’s lemma [5], we arrive at , which can be verified as a contradiction.
For the general setting where is a family of probability on measure space , the basic proof structure remains the same, but we can no longer exploit the (explicit) parametric assumption on the kernel family (see Theorem 5.16). Since the primary object of inference is parameter , the assumptions on the kernel will center on the existence of a measurable map for some , and regularity conditions on the push-forward measure on : . This measurable map plays the same role as that of sufficient statistic when belongs to the exponential family. The main challenge lies in the analysis of function described in the previous paragraph. It is here that the power of Fourier analysis is brought to bear on the analysis of and the expectation . By the Fourier inversion theorem, may be expressed entirely in terms of the characteristic function of the push-forward measure . Provided regularity conditions on such characteristic function hold, one is able to establish the convergence of toward a certain function as before.
We shall provide a variety of examples demonstrating the broad applicability of Theorem 5.16, focusing on the cases does not belong to an exponential family of distributions. In some cases, checking for the existence of map is straightforward. When is a complex object, in particular, when is itself a mixture distribution, this requires substantial work, as should be expected. In this example, the burden of checking the applicability of Theorem 5.16 lies primarily in evaluating certain oscillatory integrals composed of the map in question. Tools from harmonic analysis of oscillatory integrals will be developed for such a purpose and presented in Section 7. We expect that the tools developed here present a useful stepping stone toward a more satisfactory theoretical treatment of complex hierarchical models (models that may be viewed as mixtures of mixtures of distributions, e.g. [41, 37, 34, 8]), which have received broad and increasingly deepened attention in the literature.
3 Preliminaries
We start by setting up basic notions required for the analysis of mixtures of product distributions. Given exchangeable data sequences denoted by for , while denotes the length of sequence . For ease of presentation, for now, we shall assume that for all . Later we will allow variable length sequences. These sequences are composed of elements in a measurable space . Examples include , is a discrete space, and is a space of measures. Regardless, parameters of interest are always encapsulated by discrete mixing measures , the space of discrete measures with distinct support atoms residing in .
The linkage between parameters of interest, i.e., the mixing measure , and the observed data sequences is achieved via the mixture of product distributions that we now define. Consider a family of probability distributions on measurable space , where is the parameter of the family and is the parameter space. Throughout this paper it is assumed that the map is injective. For , the -product probability family is denoted by on , where is the product sigma-algebra. Given a mixing measure , the mixture of -product distributions induced by is given by
Each exchangeable sequence , for , is an independent sample distributed according to . Due to the role they play in the composition of distribution , we also refer to as a family of probability kernels on .
In order to quantify the convergence of mixing measures arising in mixture models, an useful device is a suitably defined optimal transport distance [32, 31]. Consider the Wasserstein- distance w.r.t. distance on : , define
(7) |
where the infimum is taken over all joint probability distributions on such that, when expressing as a matrix, the marginal constraints hold: and . For the special case when is the Euclidean distance, write simply instead of . Write if converges to under the distance w.r.t. the Euclidean distance on .
For mixing measures arising in mixtures of -product distributions, a more useful notion is the following. For any and , define
where denote all the permutations on the set . It is simple to verify that is a valid metric on for each and relate it to a suitable optimal transport distance metric. Indeed, , due to the permutations invariance of its atoms, can be identified as a set , which can further be identified as . Formally, we define a map by
(8) |
Now, endow with a metric defined by and note the following fact.
Lemma 3.1.
For any and distance on ,
A proof of the preceding lemma is available as Proposition 2 in [31]. By applying Lemma 3.1 with , replaced respectively by and , then for any , , which validates that is indeed a metric on , and moreover it does not depend on the specific representations of and .
The next lemma establishes the relationship between and on .
Lemma 3.2.
The following statements hold.
-
a)
A sequence converges to under if and only if converges to under . That is, and generate the same topology.
-
b)
Let be bounded. Then for any .
More generally for any and , -
c)
Fix . Then and . That is, in a neighborhood of in , .
-
d)
Fix and suppose is bounded. Then for any , where constant depends on and .
We see that and generate the same topology on , and they are equivalent while fixing one argument. The benefit of is that it is defined on while is only defined on for each since its definition requires the two arguments have the same number of atoms. allows us to quantify the distinct convergence behavior for atoms and probability mass, by placing different factors on the atoms and the probability mass parameters, while on would fail to do so, because couples the atoms and probability mass parameters (see Example A.1 for such an attempt).
The factor present in the definition of arises from the anticipation that when we have independent exchangeable sequences of length , the dependence of the standard estimation rate on for component parameters will be of order . Indeed, given one single exchangeable sequence from some component parameter , as the coordinates in this sequence are conditionally independent and identically distributed, the standard rate for estimating is . On the other hand, the mixing proportions parameters cannot be estimated from a single such sequence (i.e., if ). One expects that for such parameters the number of sequences coming from the among all exchangeable sequences plays a more important role. In summary, the distance will be used to capture precisely the distinct convergence behavior due to the length of observed exchangeable sequences.
4 First-order identifiability theory
Let , a finite mixture of -product distributions is reduced to a standard finite mixture of distributions. Mixture components are modeled by a family of probability kernels on , where is the parameter of the family and is the parameter space. As discussed in the introduction, throughout the paper we assume that the map is injective; it is the nature of the map that we are after. Within this section, we further assume that has density w.r.t. a dominating measure on . Combining multiple mixture components using a mixing measure on results in the finite mixture distribution, which admits the following density with respect to : . The goal of this section is to provide a concise and self-contained treatment of identifiability of finite mixture models. We lay down basic foundations and present new results that will prove useful for the general theory of mixtures of product distributions to be developed in the subsequent sections.
4.1 Basic theory
The classical identifiability condition posits that uniquely identifies for all . This condition is satisfied if the collection of density functions are linearly independent. To obtain rates of convergence for the model parameters, it is natural to consider the following condition concerning the first-order derivative of with respect to .
Definition 4.1.
The family is first-order identifiable if
-
(i)
for every in the -positive subset where , is first-order differentiable w.r.t. at ; and
-
(ii)
is a set of distinct elements and the system of two equations with variable :
(9a) (9b) has only the zero solution: .
This definition specifies a condition that is weaker than the definition of identifiable in the first-order in [26] since it only requires to be differentiable at a finite number of points and it requires linear independence of functions at those points. Moreover, it does not require as a function of to be differentiable for -a.e. . Our definition requires only linear independence between the density and its derivative w.r.t. the parameter over the constraints of the coefficients specified by (9b). (Having said that, we are not aware of any simple example that differentiates the (9a) (9b) from (9a). Actually, it is established in Lemma 4.13 b) that: under some regularity condition, (9a) (9b) has the same solution set as (9a).) We will see shortly that in a precise sense that the conditions given Definition 4.1 are also necessary.
The significance of first-order identifiability conditions is that they entail a collection of inverse bounds that relate the behavior of some form of distances on mixture densities to a distance between corresponding parameters described by , as tends toward . Denote the interior of . For any , define
(10) |
It is obvious that for small .
Lemma 4.2 (Consequence of first-order identifiability).
Let . Suppose that the family is first-order identifiable in the sense of Definition 4.1 for some .
-
a)
Then
(11) -
b)
If in addition, for every in is continuously differentiable w.r.t. in a neighborhood of for , then
(12)
To put the above claims in context, note that the following inequality holds generally for any probability kernel family (even those without a density w.r.t. a dominating measure, see Lemma 8.1):
(13) |
Note also that
(14) |
for any probability kernel and any . Thus (12) entails (11). However, (11) is sufficient for translating a learning rate for estimating a population distribution into that of the corresponding mixing measure . To be concrete, if we are given an -i.i.d. sample from a parametric model , a standard estimation method yields root- rate of convergence for density , which means that the corresponding estimate of admits root- rate as well.
Remark 4.3.
Lemma 4.2 a) is a generalization of the Theorem 3.1 in [26] in several features. Firstly, first-order identifiable assumption in Lemma 4.2 is weaker since identifiability in the first-order in the sense of [26] implies first-order identifiability with . Example B.1 gives a specific instance which satisfies the notion of first-order identifiability specified by Definition 4.1 but not the condition specified by [26]. Secondly, it turns out that uniform Lipschitz assumption in Theorem 3.1 in [26] is redundant and Lemma 4.2 a) does not require it. Lemma 4.2 b) is an extension of [22, equation (20)] in a similar sense as the above. Finally, given additional features of , the first-order identifiable notion can be further simplified (see Section 4.2).
Proof of Lemma 4.2.
Suppose the lower bound of (11) is incorrect. Then there exist , such that
We may write such that and as . With subsequences argument if necessary, we may further require
(15) |
where and the components of are in and . Moreover, for sufficiently large , which implies
It also follows that at least one of is not or one of is not . On the other hand,
where the second inequality follows from Fatou’s Lemma. Then for . Thus we find a nonzero solution to (9a), (9b) with replaced by . However, the last statement contradicts with the definition of first-order identifiable.
Proof of part b) continues in the Appendix. ∎
Lemma 4.2 states that under (i) in Definition 4.1, the constrained linear independence between the density and its derivative w.r.t. the parameter (item (ii) in the definition) is sufficient for (11) and (12). For a converse result, the next lemma shows (ii) is also necessary provided that (i) holds for some -negligible and satisfies some regularity condition.
Lemma 4.4 (Lack of first-order identifiability).
Fix . Suppose
-
a)
there exists (that possibly depends on ) such that and for every , is differentiable with respect to at ;
- b)
-
c)
for each , there exists such that for any
where is integrable with respect to the measure .
Then
(16) |
Lemma 4.4 presents the consequence of the violation of first-order identifiability. Indeed, the conclusion (16) suggests that may vanish at a much slower rate than , i.e., the convergence of parameters representing may be much slower than the convergence of data distribution .
Remark 4.5.
Condition c) in the Lemma 4.4 is to guarantee the exchange of the order between the limit and the integral and one may replace it by any other similar condition. A byproduct of this condition is that it renders the constraint (9b) redundant (see Lemma 4.13 b)). While condition c) is tailored for an application of the dominated convergence theorem in the proof, one may tailored the following condition for Pratt’s Lemma.
Condition c’): there exists such that , ,
where satisfies .
Condition c’) is weaker than condition c) since the former reduces to the latter if one let .
Combining all the conditions in Lemma 4.2 and Lemma 4.4, one immediately obtains the following equivalence between (11), (12) and the first-order identifiable condition.
Corollary 4.6.
Fix . Suppose for -a.e. , as a function is continuously differentiable in a neighborhood of for each . Suppose that for any and for each there exists such that for any ,
(17) |
where satisfies . Here possibly depends on and . Then (12) holds if and only if (11) holds if and only if (9a) with replaced respectively by has only the zero solution.
Next, we highlight the role of condition c) of Lemma 4.4 in establishing either inverse bound (11) or (16) based on our notion of first-order identifiability. As mentioned, condition c) posits the existence of an integrable envelope function to ensure the exchange of the limit and integral. Without this condition, the conclusion (16) of Lemma 4.4 might not hold. The following two examples demonstrate the role of c), and serve as examples which are not first-order identifiable but for which inverse bound (11) still holds.
Example 4.7 (Uniform probability kernel).
Consider the uniform distribution family with parameter space . This family is defined on with the dominating measure to be the Lebesgue measure. It is easy to see is differentiable w.r.t. at and
So is not first-order identifiable by our definition. Note for any this family does not satisfy the assumption c) in Lemma 4.4 and hence Lemma 4.4 is not applicable. Indeed, by Lemma 4.8 this family satisfies (11) and (12) for any and .
Lemma 4.8.
Example 4.9 (Location-scale exponential distribution kernel).
Consider the location-scale exponential distribution on , with density with respect to Lebesgue measure given by with parameter and parameter space . It is easy to see is differentiable w.r.t. at and
So is not first-order identifiable. Note for any this family does not satisfy the third assumption in Lemma 4.4 and hence Lemma 4.4 is not applicable. Indeed by Lemma 4.10 this family satisfies (11) for any and . This lemma also serves as a correction for an erroneous result (Prop. 5.3 of [25]). The mistake in their proof may be attributed to failing to account for the envelope condition c) that arises due to shifted support of mixture components with distinct values. Interestingly, Lemma 4.10 also establishes that the stronger version of inverse bounds, namely, inequality (12) does not hold for some .
Lemma 4.10.
In some context it is of interest to establish inverse bounds for Hellinger distance rather than variational distance on mixture densities, e.g., in the derivation of minimax lower bounds. Since , the inverse bound (11), which holds under first-order identifiability, immediately entails that
Similarly, (12) entails that
For a converse result, the following is the Hellinger counterpart of Lemma 4.4.
Lemma 4.11.
Fix . Suppose that
-
a)
there exists (that possibly depends on ) such that and for every , is differentiable with respect to at ;
-
b)
the density family has common support, i.e. does not depend on ;
-
c)
(9a) with replaced respectively by has a nonzero solution ;
-
d)
there exists such that , ,
where satisfies .
Then
(18) |
4.2 Finer characterizations
In order to verify if the first-order identifiability condition is satisfied for a given probability kernel family , according to Definition 4.1 one needs to check that system of equations (9a) and (9b) does not have non-zero solutions. For many common probability kernel families, the presence of normalizing constant can make this verification challenging, because the normalizing constant is a function of , which has a complicated form or no closed form, and its derivative can also be complicated. Fortunately, the following lemma shows that under a mild condition one only needs to check for the family of kernel defined up to a function of that is constant in . Moreover, under additional mild assumptions, the equation (9b) can also be dropped from the verification.
Lemma 4.13.
Suppose for every in the -positive subset for some , is differentiable with respect to at . Let be a positive differentiable function on and define .
- a)
-
b)
Suppose . For a fixed set and for each there exists such that for any ,
(19) where is -integrable. Here and depend on and . Then is a solution of (9a) if and only if it is a solution of the system of equations (9a), (9b). Moreover, (19) holds for some -integrable if and only if the same inequality with on the left side replaced by holds for some -integrable .
- c)
Remark 4.14.
Part b), or Part c), of Lemma 4.13 shows that under some differentiability condition (i.e. ) and some regularity condition on the density to ensure the exchangeability of the limit and the integral, in the definition of first identifiable, (9b) adds no additional constraint and is redundant. In this case when we verify the first-order identifiability, we can simply check whether (9a) has only zero solution or not. In addition when some is available and is simpler than , according to Part c) of Lemma 4.13, for first-order identifiability it is sufficient to check whether (9a) with replaced by has only zero solution or not, provided that the for corresponds to and (19) with on the left side replaced by hold.
Probability kernels in the exponential families of distribution are frequently employed in practice. For these kernels, there is a remarkable equivalence among the first-order identifiability and inverse bounds for both variational distance and the Hellinger distance.
Lemma 4.15.
Suppose that the probability kernel has a density function in the full rank exponential family, given in its canonical form with , the natural parameter space. Then (9a) has the same solutions as the system of equations (9a), (9b). Moreover for a fixed the following five statements are equivalent:
-
a)
-
b)
-
c)
-
d)
-
e)
With replaced respectively by , equation (9a) has only the zero solution.
Parts c) and d) in Lemma 4.15 are not used in this paper beyond the current section, but they may be of independent interest beyond the scope of this paper. In the last result, the exponential family is in its canonical form. The same conclusions hold for the exponential family represented in general parametrizations. Recall a homeomorphism is a continuous function that has a continuous inverse.
Lemma 4.16.
Consider the probability kernel has a density function in the full rank exponential family, . Suppose the map is a homeomorphism. Fix . Suppose the Jacobian matrix of the function , denoted by exists and is full rank at for . Then with replaced respectively by , , (9a) has the same solutions as the system of equations (9a), (9b). Moreover the b), d) and e) as in Lemma 4.15 are equivalent. If in addition exists and is continuous in a neighborhood of for each , then the equivalence relationships of all the five statements in Lemma 4.15 hold.
Despite the simplicity of kernels in the exponential families, classical and/or first-order identifiability is not always guaranteed. For instance, it is well-known and can be checked easily that the mixture of Bernoulli distributions is not identifiable in the classical sense. We will study the Bernoulli kernel in the context of mixtures of product distributions in Example 5.11. The following example is somewhat less well-known.
Example 4.17 (Two-parameter gamma kernel).
Consider the gamma distribution
with and the dominating measure is the Lebesgue measure on . This is a full rank exponential family. For define as
For any , let be such that , i.e. and . Then observing
with , , and the rest to be zero is a nonzero solution of the system of equations (9a), (9b) with replaced respectively by . Write gamma distribution in exponential family as in Lemma 4.16 with and . Since satisfies all the conditions in Lemma 4.16, hence
This implies that even if vanishes at a fast rate, may not.
Finite mixtures of gamma were investigated by [25], who called is a pathological set of parameter values to highlight the effects of weak identifiability (more precisely, the violation of first-order identifiability conditions) on the convergence behavior of model parameters when the parameter values fall in . (On the other hand, for , it is shown in the proof of Proposition 5.1 (a) in [25] that (9a) with replaced respectively by has only the zero solution. Their original proof works under the stringent condition for the parameter space. But multiplying their (26) by should reach the same conclusion for the general case . A direct proof is also straightforward by using Lemma B.3 b) and is similar to Example 5.13.) Thus by Lemma 4.16,
Notice that a) and c) in Lemma 4.15 also hold but are omitted here. Thus, outside of pathological set the convergence rate of mixture density towards is carried over to the convergence of toward under . It is the uncertainty about whether the true mixing measure is pathological or not that makes parameter estimation highly inefficient. Given -i.i.d. from a finite mixture of gamma distributions, where the number of components is given, [25] established minimax bound for estimating that is slower than any polynomial rate for any under metric.
We end this section with several remarks to highlight the concern for parameter estimation for mixture models under weak identifiability and to set the stage for the next section.
Remark 4.18.
a) It may be of interest to devise an efficient parameter estimation method (by, perhaps, a clever regularization or reparametrization technique) that may help to overcome the lack of first-order identifiability. We are not aware of a general way to achieve this. Absent of such methods, a promising direction for the statistician to take is to simply collect more data: not only by increasing the number of independent observations, but also by increasing the number of repeated measurements. Finite mixtures of product distributions usually arise in this practical context: when one deals with a highly heterogeneous data population which is made up of many latent subpopulations carrying distinct patterns, it is often possible to collect observations presumably coming from the same subpopulation, even if one is uncertain about the mixture component that a subpopulation may be assigned to. Thus, one may aim to collect independent sequences of exchangeable observations, and assume that they are sampled from a finite mixture of -product distributions denoted by . Such possibilities arise naturally in practice. As a concrete example, [12] applied a finite mixture model with repeated measurements to observations from an education assessment study. In this study, each child is presented with a sequence of two dimensional line drawings of a rectangular vessel, each drawn in a different tilted position. Then each child is asked to draw on these figures how the water in the vessel would appear if the vessel were half filled with water. Thus the observations from each child can be represented as a vector of exchangeable data and the experimenter can increase the length by presenting each child with more independent random rectangular vessels. Other examples and applications include psychological study [23, 10] and topic modeling [36].
b) One natural question to ask is, how does increasing the number of repeated measurements (i.e., the length of exchangeable sequences) help to overcome the lack of strong identifiability such as our notion of first-order identifiability. This question can be made precise in light of Lemma 4.2: whether there exist a natural number such that the following inverse bound holds for any
(20) |
Observe that since increases in while the denominator is fixed in , if (20) holds for some , then it also holds for all . Moreover, what can we say about the role of in parameter estimation in presence of such inverse bounds? In the sequel these questions will be addressed by establishing inverse bounds for mixtures of product distributions. Such theory will also be used to derive tight learning rates for mixing measure from a collection of exchangeable sequences of observations.
5 Inverse bounds for mixtures of product distributions
Consider a family of probability distributions on some measurable space where is the parameter of the family and is the parameter space. This yields the -product probability kernel on , which is denoted by . For any as mixing measure, the resulting finite mixture for the -product families is a probability measure on , namely, .
The main results of this section are stated in Theorem 5.8 and Theorem 5.16. These theorems establish the following inverse bound under certain conditions of probability kernel family and some time that of : for a fixed there holds
(21) |
By contrast, an easy upper bound on the left side of (21) holds generally (cf. Lemma 8.1):
(22) |
In fact, a strong inverse bound can also be established:
(23) |
These inverse bounds relate to the positivity of a suitable notion of curvature on the space of mixtures of product distributions, and will be shown to have powerful consequences. It’s easy to see that (23) implies (21), which in turn entails (20).
Section 5.2 is devoted to establishing these bounds for belonging to exponential families of distributions. In Section 5.3 the inverse bounds are established for very general probability kernel families, where may be an abstract space and no parametric assumption on will be required. Instead, we appeal to a set of mild regularity conditions on the characteristic function of a push-forward measure produced by a measurable map acting on the measure space . We will see that this general theory enables the study for a very broad range of mixtures of product distributions for exchangeable sequences.
5.1 Implications on classical and first-order identifiability
Before presenting the section’s main theorems, let us explicate some immediate implications of their conclusions expressed by inequalities (21) and (23). These inequalities contain detailed information about convergence behavior of de Finetti’s mixing measure toward , an useful application of which will be demonstrated in Section 6. For now, we simply highlight striking implications on the basic notions of identifiability of mixtures of distributions investigated in Section 4. Note that no overt assumption on classical or first-order identifiability is required in the statement of the theorems establishing (21) or (23). In plain terms these inequalities entail that by increasing the number of exchangeable measurements, the resulting mixture of -product distributions becomes identifiable in both classical and first-order sense, even if it is not initially so, i.e., when or small.
Define, for any , and ,
(24) | |||||
is called minimal zero-order identifiable length (with respect to and ), or -identifiable length for short. is called minimal first-order identifiable length (with respect to and ), or -identifiable length for short. Since in small neighborhood of (see Lemma 3.2 c)), the two metrics can be exchangeable in the denominator of the above definition for and . Note that , depending on a set to be specified, describes a global property of classical identifiability, a notion determined mainly by the algebraic structure of the mixture model’s kernel family and its parametrization. (This is also known as "strict identifiability", cf., e.g., [3]). On the other hand, both and characterize a local behavior of mixture densities near a certain , a notion that relies primarily on regularity conditions of the kernel, as we shall see in what follows. When it clear from the context, we may use or for for brevity. Similar rules apply to and .
The following proposition provides the link between classical identifiability and strong notions of local identifiability provided either (21) or (23) holds. In a nutshell, as gets large, the two types of identifiability can be connected by the force of the central limit theorem applied to product distributions, which is one of the key ingredients in the proof of the inverse bounds. Define two related and useful quantities: for any
(25) | |||
(26) |
Note that (21) means , while (23) means . The following proposition explicates the connections among , , , and .
Proposition 5.1.
There hold the following statements.
-
a)
Consider any , then . Moreover, there exists such that
- b)
-
c)
There holds for any subset
- d)
- e)
- f)
Remark 5.2.
Part a) and part b) of Proposition 5.1 highlight an immediate significance of inverse bounds (21) and (23): they establish pointwise finiteness of 1-identifiable length . Moreover, under the additional condition on first-order identifiability, one can have the following strong result as an immediate consequence: Consider any . If (11) and (21) hold, then . If (12) and (23) hold, then .
Remark 5.3.
Proposition 5.1 c) relates , the uniform -identifiable length, to the uniform -identifiable length . Combining this with parts e) and f) and inverse bounds (21) and (23), one arrives at a rather remarkable consequence: for a compact subset of , they yield the finiteness of both 0-identifiable length and 1-identifiable length uniformly over subsets of mixing measures with finite number of support points. In particular, as long as (21) or (23) (along with some regularity conditions in the former) holds for every , then will be strictly identifiable and first-identifiable on for sufficiently large . That is, taking product helps in making the kernel identifiable in a strong sense. As we shall see in the next subsection, (23) holds for every when belongs to full rank exponential families of distributions. This inverse bound also holds for a broad range of probability kernels beyond the exponential families.
Remark 5.4.
Concerning only 0-identifiable length , a remarkable upper bound
was established in a recent paper [44]. This bound actually applies to the nonparametric component distributions, and extends also to our parametric component distribution setting. However, in a parametric component distribution setting, the upper bound is far from being tight (cf. Example 5.13).
Proof of Proposition 5.1.
a) It is sufficient to assume that . Then there exists such that
Then fixing in the preceding display yields and the proof is complete since is arbitrary in .
b) By the definition of ,
(27) |
which entails that . On the other hand, for any we have
which entails . Thus . The proof of is similar.
c) If suffices to prove the case . Take any for , and suppose that . Then there exists a for some such that but . Collecting the supporting atoms of and , there are at most of those, and denote them by . Supplement these with a set of atoms to obtain a set of distinct atoms denoted by . Now take to be any discrete probability measure supported by in . Since , the condition of Lemma C.1 for is satisfied and thus
But this contradicts with the definition of .
d) By part a) it suffices to prove for the case . By Lemma C.2, the product family satisfies all the conditions in Corollary 4.6. Thus by Corollary 4.6,
It follows that , which implies that by part a).
e) By part b) and part d), for every . Associate each with a neighborhood as in part a) such that its conclusion holds. Here we want to emphasize that by definition of in (10), when . Due to the fact that is compact and part a), we deduce that is uniformly bounded for . Combining this with part c) we conclude the proof.
We can further unpack the double infimum limits in its expression of (21) to develop results useful for subsequent convergence rate analysis in Section 6. First, it is simple to show that the limiting argument for can be completely removed when is suitably bounded. Denote by or a positive finite constant depending only on its parameters and the probability kernel . In the presentation of inequality bounds and proofs, they may differ from line to line.
Lemma 5.5.
Fix . Suppose (21) holds. Then for any , there exist and such that for any satisfying ,
A key feature of the above claim is that the radius of the local ball centered at over which the inverse bound holds depends on , but the multiple constant does not. Next, given additional conditions, most notably the compactness on the space of mixing measures, we may remove completely the second limiting argument involving . In other words, we may extend the domain of on which the inverse bound of the form continues to hold, where the multiple constants are suppressed here.
Lemma 5.6.
Fix . Consider any a compact subset of containing the supporting atoms of . Suppose the map from to is continuous. Then for any and any , provided ,
Finally, a simple and useful fact which allows one to transfer an inverse bound for one kernel family to another kernel family by means of homeomorphic transformation in the parameter space. If for some homeomorphic function , for any , denote . Given a probability kernel family , under the new parameter define
Let also denote a generic element in , and be defined similarly as .
Lemma 5.7 (Invariance under homeomorphic parametrization with local invertible Jacobian).
Suppose is a homeomorphism. For , suppose the Jacobian matrix of the function , denoted by exists and is full rank at for . Then
(28) |
Moreover, if in addition exists and is continuous in a neighborhood of for each , then
Lemma 5.7 shows that if an inverse bound (21) or (23) under a particular parametrization is established, then the same inverse bound holds for all other parametrizations that are homeomorphic and that have local invertible Jacobian. This allows one to choose the most convenient parametrization; for instance, one may choose the canonical form for an exponential family or another more convenient parametrization, like the mean parametrization.
5.2 Probability kernels in regular exponential family
We now present inverse bounds for the mixture of products of exponential family distributions. Suppose that is a full rank exponential family of distributions on . Adopting the notational convention for canonical parameters of exponential families, we assume admits a density function with respect to a dominating measure , namely for .
Theorem 5.8.
In the last theorem the exponential family is in its canonical form. The following corollary extends to exponential families in the general form under mild conditions.
Corollary 5.9.
As a consequence of Corollary 5.9, Proposition 5.1 and Lemma 4.16, we immediately obtain the following interesting algebraic result for which a direct proof may be challenging.
Corollary 5.10.
Let the probability kernel be in a full rank exponential family of distributions as in Corollary 5.9 and suppose that all conditions there hold. Then for any and for any , are finite. Moreover,
(29) |
has only the zero solution:
if and only if .
Corollary 5.10 establishes that for full rank exponential families of distribution specified in Corollary 5.9 with full rank Jacobian of , there is a finite phase transition behavior specified by of the -product in (29): the system of equations (29) has nonzero solution when and as soon as , it has only the zero solution. This also gives another characterization of defined in (24) for such exponential families, which also provides a way to compute . A byproduct is that does not depend on the of since (29) only depends on .
We next demonstrate two non-trivial examples of mixture models that are either non-identifiable or weakly identifiable, i.e., when , but become first-identifiable by taking products. We work out the details on calculating and ; these should serve as convincing examples to the discussion at the end of Section 4.
Example 5.11 (Bernoulli kernel).
Consider the Bernoulli distribution with parameter space . Here the family is defined on and the dominating measure is . It can be written in exponential form as in Lemma 4.16 with and . It’s easy to check that and thus all conditions in Lemma 4.16, Corollary 5.9 and Corollary 5.10 are satisfied. Thus any of those three results can be applied. In particular we may use the characterization of in Corollary 5.10 to compute .
For the -fold product, the density . Then is
We now compute for any . Notice the support of is and hence the support of is . Thus (29) with , , and replaced respectively by , and become a system of linear equations:
(30) |
As a system of linear equations with unknown variables, it has nonzero solutions when . Thus for any .
Let us now verify that for any . Indeed, for any , the system of linear equations (30) with is with and
where with . By Lemma 5.12 d) , with the convention when . Thus, is invertible and the system of linear equations (30) with has only zero solution. Thus by Corollary 5.10 . By the conclusion from last paragraph .
In Section C.3 we also prove that for any .
The next lemma is on the determinant of a type of generalized Vandermonde matrices. Its part d) is used in the previous example on Bernoulli kernel.
Lemma 5.12.
Let .
-
a)
Let be a polynomial. Define , , , and . Then , , and are all multivariate polynomials.
-
b)
Let be a polynomial and its derivative for for a positive integer . For define by
Then for any , , where is some multivariate polynomial.
-
c)
For the special case , has determinant , with the convention when .
-
d)
For the special case with , has determinant , with the convention when .
Example 5.13 (Continuation on two-parameter gamma kernel).
Consider the gamma distribution discussed in Example 4.17. Let and by Example 4.17 for any , and for any , where we recall that denotes the pathological subset of the gamma mixture’s parameter space,
This means for .
We now show that for and hence for . Let
be the density of the -fold product w.r.t. Lebesgue measure on . Let , which is a differentiable function on and let to be the density without normalization constant. Note that and . Then (9a) with replaced by is
(31) |
Let with where is the number of distinct elements. Define . Then (31) become for -a.e.
When fixing any such that in the -a.e. set such that the preceding equation holds, by Lemma B.3 b) for any , for any . Since are distinct for and , for any for any . That is for any . Analogously fixing produces for any . Plug these back into the preceding display and one obtains for -a.e.
Fixing any such that in the -a.e. set such that the preceding equation holds, and apply Lemma B.3 b) again to obtain for . Thus (31) for any has only the zero solution. By Lemma 4.16, for
Thus , and hence for any .
Following an analogous analysis, one can show that are linear independent for any distinct for any . The linear independence immediately implies that is identifiable on , i.e. for any and any , . Thus, for any .
The above examples demonstrate the remarkable benefits of having repeated (exchangeable) measurements: via the -fold product kernel for sufficiently large , one can completely erase the effect of parameter non-identifiability in Bernoulli mixtures, and the effect of weak-identifiability in the pathological subset of the parameter spaces in two-parameter gamma mixtures. We have also seen that it is challenging to determine the 0- or 1-identifiable lengths even for these simple examples of kernels. It is even more so, when we move to a broader class of probability kernels well beyond the exponential families.
5.3 General probability kernels
Unlike Section 5.2, which specializes to the probability kernels that are in the exponential families, in this section no such assumption will be required. In fact, we shall not require that the family of probability distributions on admit a density function. Since the primary object of inference is the parameter , the assumptions on the kernel will center on the existence of a measurable map for some , and regularity conditions on the push-forward measure on : . The use of the push-forward measure to prove (23) stems from the observation that the variational distance between two distributions is lower bounded by any push-forward operation, which is equivalent to considering a subclass of the Borel sets in the definition of the variational distance.
Definition 5.14 (Admissible transform).
A Borel measurable map is admissible with respect to a set if for each there exists and such that satisfies the following three properties.
-
(A1)
(Moment condition) For , the open ball centered at with radius , suppose and exist where . Moreover, is positive definite on and is continuous at .
-
(A2)
(Exchangeability of partial derivatives of characteristic functions) Denote by the characteristic function of the pushforward probability measure on , i.e., , where . exists in and as a function of it is twice continuously differentiable on with derivatives satisfying:
where the right hand side of both equations exist.
-
(A3)
(Continuity and integrability conditions of characteristic function) as a function of is twice continuously differentiable in . There hold: for any ,
(32) and for any ,
(33)
Remark 5.15.
The above definition of the admissible transform contains relatively mild regularity conditions concerning continuity, differentiability and integrability. In particular, (A1) is to guarantee the first two moments of , which are required for the application of a central limit theorem as outlined in Section 2.2. (A2) and (A3) are used in the essential technical lemma (Lemma D.1) to guarantee the following statement in Section 2.2: for any sequence , there holds for certain functions and . The inequality (33) is also used to obtain the existence of the Fourier inversion formula (more specifically, to imply existence of a density of with respect to Lebesgue measure for ). Since the characteristic function has modular less than , the larger the , the smaller the left hand side of (33). Here we only require existence of some in (33), which is a mild condition”. For more discussions on the role of , see Theorem 2 in Section 5, Chapter XV of [14]. The conditions of the admissible transform are typically straightforward to verify if a closed form formulae of is available; examples will be provided in the sequel.
Theorem 5.16.
Note that the condition that is necessary for the Jacobian matrix of , which is of dimension , to be of full column rank. The following corollary is useful, when the admissible maps are identical for all , which are the case for many (if not most) examples.
Corollary 5.17.
The proofs of Theorem 5.8 and Theorem 5.16 contain a number of potentially useful techniques and are deferred to Section D. We make additional remarks.
Remark 5.18 (Choices of admissible transform ).
If the probability kernel has a smooth closed form expression for the characteristic function and is of dimension exceeding the dimension of , one may take to be identity map (see Example 5.20 in the sequel). If is of dimension less than the dimension of , then one may take to be a moment map (see Example 5.21 and Example 5.22). On the other hand, if the probability kernel does not have a smooth closed form expression for the characteristic function, then one may consider to be the composition of moment maps and indicator functions of suitable subsets of (see Example 5.23). Unlike the three previous examples, the chosen in Example 5.23 depends on atoms of . All these examples were obtained by constructing a single admissible map following Corollary 5.17. There might exist cases for which it is difficult to come up with a single admissible map that satisfies the conditions of Corollary 5.17; For such cases Theorem 5.16 will be potentially more useful.
Remark 5.19 (Comparisons between Theorem 5.16 and Theorem 5.8).
While Theorem 5.16 appears more powerful than Theorem 5.8, the latter is significant in its own right. Indeed, Theorem 5.16 provides an inverse bound for a very broad range of probability kernels, but it seems not straightforward to apply it to non-degenerate discrete distributions on lattice points, like Poisson, Bernoulli, geometric distributions etc. The reason is that for non-degenerate discrete distributions on lattice points, its characteristic function is periodic (see Lemma 4 in Chapter XV, Section 1 of [14]), which implies that the characteristic function is not in for any . Thus, it does not satisfy (A3) for the identity map in the definition of the admissible transform. In order to apply Theorem 5.16 to such distributions one has to come up with suitable measurable transforms which induce distributions over a countable support that is not lattice points. On the contrary, Theorem 5.8 is readily applicable to discrete distributions that are in the exponential family, including Poisson, Bernoulli, geometric distributions, etc.
5.4 Examples of non-standard probability kernels
The power of Theorem 5.16 lies in its applicability to classes of kernels that do not belong to the exponential families.
Example 5.20 (Continuation on uniform probability kernel).
In Example 4.7 this example has been shown to satisfy inverse bound (11) and (12) for any . Note this family is not an exponential family and thus Theorem 5.8 or Corollary 5.9 is not applicable. Take the in Corollary 5.17 to be the identity map. Then , . So condition (A1) is satisfied. The characteristic function is
One can then calculate
and verify the condition (A2). To verify (A3) the following inequality (see [35, (9.5)])
comes handy. It then follows that
Then (32) holds. Finally take and one obtains
Thus (33) holds. We have then verified that the identity map is admissible on .
Example 5.21 (Continuation on location-scale exponential kernel).
In Example 4.9 this example has been shown to satisfy (11) for any and does not satisfy (12) for some for any . Note this family is not an exponential family and thus Theorem 5.8 or Corollary 5.9 is not applicable. Take in Corollary 5.17 to be as a map from . In Appendix C.5 we show that all conditions of Corollary 5.17 are satisfied and hence (21) and (23) hold for any for any . Moreover, by Remark 5.2, for any for any . Regarding , for every , there exists some such that .
Example 5.22 ( is itself a mixture distribution).
We consider the situation where is a rather complex object: it is itself a mixture distribution. With this we are moving from a standard mixture of product distributions to hierarchical models (i.e., mixtures of mixture distributions). Such models are central tools in Bayesian statistics. Theorem 5.8 or Corollary 5.9 is obviously not applicable in this example, which requires the full strength of Theorem 5.16 or Corollary 5.17. The application, however, is non-trivial requiring the development of tools for evaluating oscillatory integrals of interest. Such tools also prove useful in other contexts (such as Example 5.23). A full treatment is deferred to Section 7.
Example 5.23 ( is a mixture of Dirichlet processes).
This example illustrates the applicability of our theory to models using probability kernels defined in abstract spaces. Such kernels are commonly found in nonparametric Bayesian literature [24, 17]. In particular, in our specification of mixture of product distributions we will employ Dirichlet processes as the basic building block [15, 4]. Full details are presented in Section 7.4.
6 Posterior contraction of de Finetti mixing measures
The data are independent sequences of exchangeable observations, for . Each sequence is assumed to be a sample drawn from a mixture of product distributions for some "true" de Finetti mixing measure . The problem is to estimate given the independent exchangeable sequences. A Bayesian statistician endows with a prior distribution and obtains the posterior distribution by Bayes’ rule, where is the Borel sigma algebra w.r.t. distance. In this section we study the asymptotic behavior of this posterior distribution as the amount of data tend to infinity.
Suppose throughout this section, has density w.r.t. a -finite dominating measure on ; then for has density w.r.t. to :
(34) |
Then the density of conditioned on is . Since as a subset of is separable, is separable. Moreover, suppose the map from to is continuous, where is the Hellinger distance. Then the map from is also continuous by Lemma 8.2. Then by [2, Lemma 4.51], is measurable for each . Thus, the posterior distribution (a version of regular conditional distribution) is the random measure given by
(35) |
for any Borel measurable subset of . For further details of why the last quantity is a valid posterior distribution, we refer to Section 1.3 in [17]. It is customary to express the above model equivalently in the hierarchical Bayesian fashion:
As above, the independent data sequences are denoted by for . The following assumptions are required for the main theorems of this section.
-
(B1)
(Prior assumption) There is a prior measure on with its Borel sigma algebra possessing a density w.r.t. Lebesgue measure that is bounded away from zero and infinity, where is a compact subset of . Define -probability simplex , Suppose there is a prior measure on the -probability simplex possessing a density w.r.t. Lebesgue measure on that is bounded away from zero and infinity. Then is a measure on , which induces a probability measure on . Here, the prior is generated by via independent and and the support of is such that .
-
(B2)
(Kernel assumption) Suppose that for every , and some positive constants ,
(36) (37)
Remark 6.1.
(B1) on the compactness of the support is a standard assumption so as to obtain parameter convergence rate in finite mixture models (see [9, 32, 25, 26, 17, 22, 47]). See Section 9.1 for more discussions on the compactness assumption and an relaxation to boundedness assumption. The unbounded setting seems challenging and is beyond the scope of this paper. In this paper, for simplicity we consider the prior on finite mixing measures being generated by independent priors on component parameters and an independent prior on mixing proportions [38, 32, 19] for general probability kernel . It is not difficult to extend our theorem to more complex forms of prior specification when a specific kernel is considered.
The condition (B2) is not uncommon in parameter estimation (e.g. Theorem 8.25 in [17]). Note that the conditions in (B2) imply some implicit constraints on and . Specifically, if (11) holds for and (B2) holds, then and . Indeed, for any sequence converges to , by (11), Lemma C.3 with and (B2), for large
(38) |
which implies (by dividing both sides by and letting ). In the preceding display . By (38) and Pinsker’s inequality,
for large , which implies . The same conclusion holds if one replaces (11) with (21) by an analogous argument.
An useful quantity is the average sequence length . The posterior contraction theorem will be characterized in terms of distance , which extends the original notion of distance by allowing real-valued weight .
Theorem 6.2.
Remark 6.3.
Remark 6.4.
(a) In the above statement, note that the constant also depends on , , , upper and lower bounds of the densities of , and the density family (including , , , etc). All such dependence are suppressed for the sake of a clean presentation; it is the dependence on and the independence of and , that we want to emphasize. In addition, although and hence the vanishing radius of the ball characterized by does not depend on , the rate at which the posterior probability statement concerning this ball tending to zero may depend on it.
(b) Roughly speaking, the theorem produces the following posterior contraction rates. The rate toward mixing probabilities is . Individual atoms have much faster contraction rate, which utilizes the full volume of the data set:
(39) |
Note that the condition that implies that remains finite when . Since constant is independent of and , the theorem establishes that the larger the average length of observed sequences, the faster the posterior contraction as .
(c) The distinction between the two parts of the theorem highlights the role of first-order identifiability in mixtures of -product distributions. Under first-order identifiability, (11) is satisfied, so we can establish the aforementioned posterior contraction behavior for a full range of sequence length ’s, as long as they are uniformly bounded by an arbitrary unknown constant. When first-order identifiability is not satisfied, so (11) may fail to hold, the same posterior behavior can be ascertained when the sequence lengths exceed certain threshold depending on the true , namely, .
(d) The proof of Theorems 6.2 utilizes general techniques of Bayesian asymptotics (see [17, Chapter 8]) to deduce posterior contraction rate on the density . The main novelty lies in the application of inverse bounds for mixtures of product distributions of exchangeable sequences established in Section 5. These are lower bounds of distances between a pair of distributions () in terms of distance between the corresponding . The distance brings out the role of the sample size of exchangeable sequences, resulting in the rate (or , modulo the logarithm).
The gist of the proof of Theorem 6.2 lies in the following lemma where we consider the equal length data sequence to distill the essence. This lemma also illustrates the connection between the inverse bound (21) and the convergence rate for the mixing measure .
Lemma 6.5.
Consider be fixed and suppose (21) holds. Let and let be a prior distribution on . Suppose the posterior contraction rate towards the true mixture density is : for any , in probability as . Suppose the posterior consistency at the true mixing measure w.r.t. the distance holds: for any , in probability as . Then the posterior contraction rate to w.r.t. distance is , i.e. in probability as .
Proof.
All probabilities presented in this proof are posterior probabilities conditioning on the data , of which the conditioning notation are suppressed for brevity.
(40) |
where in the first inequality is the radius in Lemma 5.5 with , and the second inequality follows by Lemma 5.5. The proof is completed by noticing that the quantity in (40) converges to in probability as by the hypothesises for any . ∎
Remark 6.6.
Roughly speaking, the hypothesis of posterior consistency guarantees that as , lies in a small ball around w.r.t. distance, and then Lemma 5.5 transfers the convergence rate from mixture densities to mixing measures. No particular structures from posterior distributions are utilized and one can easily modify the above lemma for other estimators, the maximum likelihood estimator for instance.
Finally, the conditions (B2) and (21) can be verified for full rank exponential families and hence we have the following corollary from Theorem 6.2.
Corollary 6.7.
Example 6.8 (Posterior contraction for weakly identifiable kernels: Bernoulli and gamma).
Fix . For the Bernoulli kernel studied in Example 5.11, . Suppose that (B1) holds with compact . Then by Corollary 6.7, the conclusion a) of Theorem 6.2 holds provided . For the gamma kernel studied in Examples 4.17 and 5.13, when and when ; . Suppose that (B1) holds with compact . Then by Corollary 6.7, the conclusion a) of Theorem 6.2 holds provided . Moreover, no requirement on is needed if is given.
Example 6.9 (Posterior contraction for weakly identifiable kernels: beyond exponential family).
Here we present the posterior contraction rate for the four examples studied in Section 5.4, while the verification details are in Appendix E.3. Assume that the prior distribution satisfies the (B1) for each example below. For the uniform probability kernel studied in Example 5.20, the conclusion of Theorem 6.2 holds for any . For the location-scale exponential kernel studied in Example 5.21, the conclusion of Theorem 6.2 holds for any . For the case that kernel is location-mixture Gaussian in Example 5.22, the conclusion of Theorem 6.2 holds and the specific values of and are left as exercises. The kernel in Example 5.23 does not possess a density, which is needed in (35), and thus the results in this section on posterior contraction do not apply.
7 Hierarchical model: kernel is itself a mixture distribution
In this section we apply Theorem 5.16 to the cases where itself is a rather complex object: a finite mixture of distributions. Combining this kernel with a discrete mixing measure , the resulting represents a mixture of finite mixtures of distributions, while becomes a -mixture of -products of finite mixtures of distributions. These recursively defined objects represent a popular and formidable device in the statistical modeling world: the world of hierarchical models. We shall illustrate Theorem 5.16 on only two examples of such models. However, the tools required for these applications are quite general, chief among them are bounds on relevant oscillatory integrals for suitable statistical maps . We shall first describe such tools in Section 7.1 and then address the case is a -component Gaussian location mixture (Example 5.22) and the case is a mixture of Dirichlet processes (Example 5.23).
7.1 Bounds on oscillatory integrals
A key condition in Theorem 5.16, namely condition (A3), is reduced to the integrability of certain oscillatory integrals:
(41) |
for a broad class of functions and multi-dimensional maps . When , the oscillatory integral is also known as the Fourier transform of measures supported on curves or surfaces; bounds for such quantities are important topics in harmonic analysis and geometric analysis. We refer to [6] and the textbook [40, Chapter 8] for further details and broader contexts. Despite there are many existing results, such results are typically established when is supported on a compact interval or is smooth, i.e. has derivative of arbitrary orders. We shall develop an upper bound on (41) for our purposes to verify the integrability condition in (A3) for a broad class of , which is usually satisfied for probability density functions.
We start with the following bounds for oscillatory integrals of the form , where function is called the phase, and function the amplitude.
Lemma 7.1 (van der Corput’s Lemma).
Suppose , and that for all . Let be absolute continuous on . Then
and
hold when either i) , or ii) and is monotonic. The constant is independent of , , and the interval .
Proof.
See [40, the Corollary on Page 334] for the proof of the first display; even though in its original version in this reference, is assumed to be but its proof only needs to be absolute continuous on . The second display follows by applying the first display to . ∎
It can be observed from Lemma 7.1 the condition on derivatives of the phase function plays a crucial role. For our purpose the phase function will be supplied by use of monomial map . Hence, the following technical lemma will be needed.
Lemma 7.2.
Let with entries for and for , where are given. Let be the smallest singular value of . Then , where is a constant that depends only on .
The following lemma provides a crucial uniform bound on oscillatory integrals given by a phase given by monomial map .
Lemma 7.3.
Let defined by with . Consider a bounded non-negative function that is differentiable on , where with a finite number. The derivative and it is continuous when it exists. Moreover, and are both increasing when and decreasing when for some , where . Then for ,
where is a positive constant that only depends on .
Applying Lemma 7.3 we obtain a bound for the oscillatory integral in question.
Lemma 7.4.
Let and satisfy the same conditions as in Lemma 7.3. Define for . Then for ,
where is a positive constant that depends on .
7.2 Kernel is a location mixture of Gaussian distributions
We are now ready for an application of Theorem 5.16 to case kernel is a mixture of Gaussian distributions. As discussed in Example 5.22, with this example we are moving from a standard mixture of product distributions to hierarchical models (i.e., mixtures of mixture distributions). Such models are central tools in Bayesian statistics.
Let
(42) |
and w.r.t. Lebesgue measure on has probability density
(43) |
where and is the density of with a known constant. For the eligibility of this parametrization, see Section 9.2. It follows from the classical result [42, Proposition 1] that the map is injective on . The mixture of product distributions admits the density given in (34) (with ) w.r.t. Lebesgue measure on . Fix with .
Let us now verify that Corollary 5.17 with the map can be applied for this model. The mean of is with its -th entry given by
(44) |
where has density (43) and has the standard Gaussian distribution . The covariance matrix of is with its entries given by
It follows immediately from these formulae that and are continuous on . That is, (A1) in Definition 5.14 is satisfied. The characteristic function of is
(45) |
where . Denote by the density of . The verification of (A2) in Definition 5.14 is omitted since it is a straightforward application of the dominated convergence theorem. In Appendix F.2 it is shown by some calculations that to verify condition (A3) it remains to establish that there exists some such that on is upper bounded by a finite continuous function of .
Note that is differentiable everywhere and . Moreover in Lemma 7.4 for is and , are increasing on and decreasing on . By Lemma 7.4, for , and for
where is a constant that depends only . It can be verified easily by the dominated convergence theorem that is a continuous function of . Then
which is a finite continuous function of . Thus (A3) is verified. We have then verified that is admissible with respect to . That the mean map is injective is a classical result (e.g. [16, Corollary 3.3]). To apply Corollary 5.17 it remains to check that the Jacobian matrix of is of full column rank. Such details are established in the Section 7.3.
7.3 Moment map for location mixture of Gaussian distributions has full-rank Jacobian
In this subsection we verify that the Jacobian for the moment map specified in Section 7.2 is of full rank. By (44), for any :
(46) |
Denote and . By (46), , which implies
Since and are respectively the -th row of and ,
(47) |
Also, observe
(48) |
where the second equality holds since we may subtract the -th column of the matrix from each of its first columns and then do Laplace expansion along its first row, and the last equality follows by observing that the -th column of the matrix is the derivative of the -th column and by applying Lemma 5.12 c) after some column permutation. By (47) and (48), on . That is is of full column rank for any .
7.4 Kernel is mixture of Dirichlet processes
Now we tackle Example 5.23, which is motivated from modeling techniques in nonparametric Bayesian statistics. In particular, the kernel is given as a distribution on a space of measures: is a mixture of Dirichlet processes (DPs), so that is a finite mixture of products of mixtures of DPs. This should not be confused with the use of DP as a prior for mixing measures arising in mixture models. Rather, this is more akin to the use of DPs as probability kernels that arise in the famous hierarchical Dirichlet processes [41] (actually, this model uses DP both as a prior and kernels). The purpose of this example is to illustrate Theorem 5.16 when (mixtures of) Dirichlet processes are treated as kernels.
Let be the space of all probability measures on a Polish space . is equipped with the weak topology and the corresponding Borel sigma algebra . Let denote the Dirichlet distribution on , which is specified by two parameters, concentration parameter and base measure . Formal definition and key properties of the Dirichlet distributions can be found in the original paper of [15], or a recent textbook [17]. In this example, we take the probability kernel to be a mixture of two Dirichlet distributions with different concentration parameters, while the base measure is fixed and known: . Thus, the parameter vector is three dimensional which shall be restricted by the following constraint: . It can be easily verified that the map is injective. Kernel so defined is a simple instance of the so-called mixture of Dirichlet processes first studied by [4], but considerably more complex instances of model using Dirichlet as the building block have become a main staple in the lively literature of Bayesian nonparametrics [24, 41, 37, 8]. For notational convenience in the following we also denote for and , noting that is fixed, so we may write .
Having specified the kernel , let . The mixture of product distributions is defined in the same way as before (see Eq. (1)). Now we show that for , (21) and (23) hold by applying Corollary 5.17 via a suitable map .
Consider a map defined by for some to be specified later. The reason we restrict the domain of is so that this particular choice of map will be shown to be admissible. Define by and by . Then . For , has distribution
where . By a standard property of Dirichlet distribution, as , we have corresponds to , a Beta distribution. Thus with , has density w.r.t. Lebesgue measure on
where is the beta function. Then has density w.r.t. Lebesgue measure .
Now, the push-forward measure has mean with
and has covariance matrix with its entry given by
It follows immediately from these formula that and are continuous on , i.e., (A1) in Definition 5.14 is satisfied. Furthermore, observe that has characteristic function
where . The verification of (A2) in Definition 5.14 is omitted since it is a straightforward application of the dominated convergence theorem. In Appendix F.3 we provide detailed calculations to verify partially condition (A3) so that it remains to establish there exists some such that on is upper bounded by a finite continuous function of . So far we have verified (A1), (A2) and some parts of (A3) for the chosen for every .
To continue the verification of (A3) we now specify . For with , let be such that . Notice that since , is not empty. Hence to verify the condition (A3) in Definition 5.14 w.r.t. for with the specified it suffices to establish there exists some such that in a small neighborhood of is upper bounded by a finite continuous function of for each .
Since is differentiable w.r.t. to on and when , is
which is in when such that and , where depends on through . Moreover, and are both increasing on and decreasing on . Now, by appealing to Lemma 7.4, for , and for
where is a constant that depends only on . It can be verified easily by the dominated convergence theorem that is a continuous function of . Then for in a neighborhood of such that ,
which is a finite continuous function of . We have thus verified that with the specified is admissible w.r.t. .
Moreover, it can also be verified that for is injective on provided that . By calculation, the Jacobian matrix of satisfies
on provided that ; so is of full rank for each provided that . In summary, for with , with such that
satisfies all the conditions in Corollary 5.17 and thus (21) and (23) hold.
8 Sharpness of bounds and minimax theorem
8.1 Sharpness of inverse bounds
In this subsection we consider reverse upper bounds for (21), which are also reverse upper bounds for (23) by (14). Inverse bounds of the form (21) hold only under some identifiability conditions, while the following upper bound holds generally and is much easier to show.
Lemma 8.1.
Let and fix . Then for any
Proof.
Consider with for and , . Then for sufficiently large , and hence and satisfies . Thus for sufficiently large ,
∎
The next lemma establishes an upper bound for Hellinger distance of two mixture of product measures by Hellinger distance of individual components. It is an improvement of [34, Lemma 3.2 (a)]. Such a result is useful in Lemma 8.3. A similar result on variation distance is Lemma C.3.
Lemma 8.2.
For any and ,
where the minimum is taken over all in the permutation group .
The inverse bounds expressed by Eq. (21) are optimal as far as the role of in is concerned. This is made precise by the following result.
Lemma 8.3 (Optimality of for atoms).
Fix . Suppose there exists such that . Then for such that ,
Lemma 8.3 establishes that is optimal for the coefficients of the component parameters in . The next lemma establishes that the constant coefficients of the mixing propositions in are also optimal. For and , define
It states that the vanishing of may not induce a faster convergence rate for the mixing proportions in terms of as the exchangeable length increases.
Lemma 8.4 (Optimality of constant coefficient for mixing proportions).
Fix . Suppose that the map is injective. Then for such that ,
Proof.
Consider with for any and for , , . Then for large , . Note that and hence
which completes the proof. ∎
A slightly curious and pedantic way to gauge the meaning of the double infimum limiting arguments in the inverse bound (21), is to express its claim as follows:
where is defined in (10). It is possible to alter the order of the four operations and consider the resulting outcome. The following lemma shows the last display is the only order to possibly obtain a positive outcome.
Lemma 8.5.
-
a)
-
b)
-
c)
Proof.
The claims follow from
and
∎
8.2 Minimax lower bounds
Given and , define additional notions of distances
(49) | |||
(50) |
Notice that we denote to be a distance on in Section 3. Here the with bold subscript is on . These two notions of distance are pseudometrics on the space of measures , i.e., they share the same properties as a metric except that allow the distance between two different points be zero. focuses on the distance between atoms of two mixing measure; while focuses on the mixing probabilities of the two mixing measures. It is clear that
(51) |
We proceed to present minimax lower bounds for any sequence of estimators , which are measurable functions of , where the sequence length are assumed to be equal for simplicity. The minimax bounds are stated in terms of the aforementioned (pseudo-)metrics and , as well as the usual metric studied.
Theorem 8.6 (Minimax Lower Bound).
In the following three bounds the infimum is taken for all measurable functions of .
-
a)
Suppose there exists and such that . Moreover, suppose there exists a set of distinct points satisfying . Then
where is a constant depending on , and the probability family .
-
b)
Let .
where is a constant depending on and the probability family .
-
c)
Let . Suppose the conditions of part (a) hold. Then,
Remark 8.7.
-
a)
The condition that there exists a set of distinct points satisfying immediately follows from the injectivity of the map (recall that this condition is assumed throughout the paper).
-
b)
The condition that there exists and such that holds for most probability kernels considered in practice. For example, it is satisfied with for all full rank exponential families of distribution in their canonical form as shown by Lemma E.3. It can then be shown that this condition with is also satisfied by full rank exponential families in general form specified in Corollary 5.9. Notice that the same remark applies to the condition in Lemma 8.3.
-
c)
If conditions of Theorem 8.6 a) hold with , then
That is, the convergence rate of the best possible estimator for the worst scenario is at least . Recall that Theorem 6.2 implied that the convergence rate of the atoms is , which is obtained by replacing with in (39). It is worth noting that while the minimax rate seems to match the posterior contraction rate of the atoms except for a logarithmic factor, such a comparison is not very meaningful as pointwise posterior contraction bounds and minimax lower bounds are generally not considered to be compatible. In particular, in the posterior contraction Theorem 6.2, the truth is fixed and the hidden constant depends on , which is clearly not the case in the above results obtained under the minimax framework. In short, we do not claim that the Bayesian procedure described in Theorem 6.2 is optimal in the minimax sense; nor do we claim that the bounds given in Theorem 8.6 are sharp (i.e., achievable by some statistical procedure).
9 Extensions and discussions
9.1 On compactness assumption
In Theorems 6.2 we impose that the parameter subset is compact. This appears to be a strong assumption, although it is a rather standard one for most theoretical investigations of parameter estimation in finite mixture models (see [9, 32, 25, 26, 17, 47]). We surmise that in the context of mixture models, it might not be possible to achieve the global parameter estimation rate without a suitable constraint on the parameter space, such as compactness. In this subsection we clarify the roles of the compactness condition within our approach and discuss possible alternatives to relax compactness to boundedness.
The proof of Theorem 6.2 follows the basic structure of Lemma 6.5. To obtain the posterior contraction rate to mixture densities and the posterior consistency w.r.t. for general probability kernel , a global inverse bound Lemma 5.6 is applied (as an example, it follows that the posterior contraction rate to mixture densities and Lemma 5.6 together imply the posterior consistency w.r.t. ). The compactness of is only used to establish Lemma 5.6. It might be possible to have a posterior contraction result for the population density estimation and the posterior consistency result without Lemma 5.6 (e.g. by an existence of test argument), but such an approach would require additionally stronger and perhaps explicit knowledge of the kernel , and thus is beyond the scope of this paper.
The compactness of is only used to guarantee Lemma 5.6, which is used in the posterior contraction and consistency results mentioned above. It is possible to have a posterior contraction result for the population density estimation and the posterior consistency result without Lemma 5.6 (e.g. by existence of test), but such an approach would require additionally stronger and perhaps explicit knowledge of the kernel .
In this subsection we provide a substitute to the compactness assumption in Lemma 5.6, which removes the compactness assumption in Theorem 6.2. It is clear that is required to be a bounded set. The compactness assumption may be relaxed by the necessary boundedness assumption, provided that an identifiability condition additionally holds. This can be seen by the following claim.
Lemma 9.1.
Fix . Suppose is bounded. Let be given by (24). Suppose there exists such that for
(52) |
Then
provided , where is a constant that depends on and .
Proof.
In this proof we write for . By the definition of , for any
(53) |
By Lemma 3.2 b) one may replace the in the preceding display by . Fix . Then there exists depending on such that
(54) |
where is the open ball in metric space with center at and radius . Here we used the fact that any sufficiently small open ball in with center in is in . By assumption
Combining the last display with and (54) yields Observing increases with , the proof is then complete. ∎
Despite the above possibilities for relaxing the compactness assumption on , we want to point out that other assumptions may still implicitly require the compactness. For example, suppose that kernel takes the explicit form , the density of univariate normal distribution with mean and standard deviation . Then . With , which can not be upper bounded , a quantity convergences to when converges to . That is, the assumption (B2) cannot hold if is not bounded away from , which excludes bounded intervals of the form .
9.2 Kernel is a location-scale mixture of Gaussian distributions
In Section 7.2 we demonstrated an application of Theorem 5.16 to obtain inverse bound (23) when kernel is the location mixture of Gaussian distributions. It is of interest to extend Theorem 5.16 to richer kernels often employed in practice. The local-scale mixture of Gaussian distributions represent a salient example. Here, we shall discuss several technical difficulties that arise in such a pursuit. The first difficulty is that in Theorem 5.16, the parameter space is assumed to be a subset of an Euclidean space obtained via a suitable (i.e., homeomorphic) parametrization. For the "location-scale Gaussian mixture" kernel, such a parametrization is elusive.
Recall that the parameter set of a -component location mixture of Gaussian distributions given by (43) is or . To apply the result in Theorem 5.16 we parametrize the kernel and index it by parameters in a subset of a suitable Euclidean space. In Section 7.2 we identify or by a subset of Euclidean space as in (42) by ranking in increasing order. This identification is a bijection and moreover a homeomorphism. The properties of bijection and homeomorphism are necessary for the reparametrization since we need convergence of the parameters in the reparametrization space is equivalent to the convergence in the original parameter space . So the parametrization is suitable for the application of Theorem 5.16. However this scheme is not straightforward to be generalized to the case the atom space (space of in this particular example) is more than one dimension as discussed below.
For the case of -component location-scale mixture of Gaussian distributions, the parameter set is Similar to the location mixture, one may attempt to reparametrize by ranking in the lexicographically increasing order. While this reparametrization is a bijection, it is not a homeomorphism. To see this, consider and . The reparametrization of is since in the lexicographically order. Consider and its reparametrization is . It is clear that as but the Euclidean distance of the corresponding reparametrized parameters does not. This issue underscore one among many challenges that arise as we look at increasingly richer models that have already been widely applied in numerous application domains.
9.3 Other extensions
A direction of interest is the study of overfitted mixture models, i.e., the true number of mixture components may be unknown and . As previous studies suggest, a stronger notion of identifiability such as second-order identifiability may play a fundamental role (see [27]). Observing that (23) can also be viewed as uniform versions of (21) since they holds for any fixed in a neighborhood of and any , it would also be interesting to generalize Theorem 6.2 to a uniform result beyond a fixed . In addition, if is taken to a mixture distribution, what happens if this mixture is also overfitted? We can expect a much richer range of parameter estimation behavior and more complex roles and play in the rates of convergence.
References
- [1] David J. Aldous. Exchangeability and related topics. In P. L. Hennequin, editor, École d’Été de Probabilités de Saint-Flour XIII — 1983, pages 1–198, Berlin, Heidelberg, 1985. Springer Berlin Heidelberg.
- [2] Charalambos D. Aliprantis and Border C. Kim. Infinite dimensional analysis: A Hitchhiker’s Guide. Springer-Verlag Berlin Heidelberg, third edition, 2006.
- [3] Elizabeth S Allman, Catherine Matias, and John A Rhodes. Identifiability of parameters in latent structure models with many observed variables. Annals of Statistics, 37(6A):3099–3132, 2009.
- [4] Charles E. Antoniak. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. Annals of Statistics, 2(6):1152–1174, 11 1974.
- [5] Patrick Billingsley. Probability and measure. John Wiley & Sons, third edition, 1996.
- [6] Luca Brandolini, Giacomo Gigante, Allan Greenleaf, Alexander Iosevich, Andreas Seeger, and Giancarlo Travaglini. Average decay estimates for fourier transforms of measures supported on curves. Journal of geometric analysis, 17(1):15–40, 2007.
- [7] Federico Camerlenghi, David B Dunson, Antonio Lijoi, Igor Prünster, and Abel Rodr\́bm{i}guez. Latent nested nonparametric priors (with discussion). Bayesian Analysis, 14(4):1303–1356, 2019.
- [8] Federico Camerlenghi, Antonio Lijoi, Peter Orbanz, and Igor Prünster. Distribution theory for hierarchical processes. Annals of Statistics, 47(1):67–92, 2019.
- [9] Jiahua Chen. Optimal rate of convergence for finite mixture models. Annals of Statistics, 23(1):221–233, 02 1995.
- [10] IR Cruz-Medina, TP Hettmansperger, and H Thomas. Semiparametric mixture models and repeated measures: the multinomial cut point model. Journal of the Royal Statistical Society: Series C (Applied Statistics), 53(3):463–474, 2004.
- [11] Ryan Elmore, Peter Hall, and Amnon Neeman. An application of classical invariant theory to identifiability in nonparametric mixtures. In Annales de l’institut Fourier, volume 55, pages 1–28, 2005.
- [12] Ryan T Elmore, Thomas P Hettmansperger, and Hoben Thomas. Estimating component cumulative distribution functions in finite mixture models. Communications in Statistics-Theory and Methods, 33(9):2075–2086, 2004.
- [13] Ryan T Elmore and Shaoli Wang. Identifiability and estimation in finite mixture models with multinomial coefficients. Technical report, Technical Report 03-04, Penn State University, 2003.
- [14] Willliam Feller. An introduction to probability theory and its applications, volume 2. John Wiley & Sons, third edition, 2008.
- [15] Thomas S. Ferguson. A bayesian analysis of some nonparametric problems. Annals of Statistics, 1(2):209–230, 03 1973.
- [16] Kavish Gandhi and Yonah Borns-Weil. Moment-based learning of mixture distributions. unpublished paper, 2016.
- [17] Subhashis Ghosal and Aad van der Vaart. Fundamentals of Nonparametric Bayesian Inference, volume 44 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2017.
- [18] Subhashis Ghosal and Aad W Van Der Vaart. Entropies and rates of convergence for maximum likelihood and bayes estimation for mixtures of normal densities. Annals of Statistics, 29(5):1233–1263, 2001.
- [19] Aritra Guha, Nhat Ho, and XuanLong Nguyen. On posterior contraction of parameters and interpretability in Bayesian mixture modeling. Bernoulli, 27(4):2159–2188, 2021.
- [20] Peter Hall, Amnon Neeman, Reza Pakyari, and Ryan Elmore. Nonparametric inference in multivariate mixtures. Biometrika, 92(3):667–678, 2005.
- [21] Peter Hall and Xiao-Hua Zhou. Nonparametric estimation of component distributions in a multivariate mixture. Annals of Statistics, 31(1):201–224, 2003.
- [22] Philippe Heinrich and Jonas Kahn. Strong identifiability and optimal minimax rates for finite mixture estimation. Annals of Statistics, 46(6A):2844–2870, 2018.
- [23] TP Hettmansperger and Hoben Thomas. Almost nonparametric inference for repeated measures in mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(4):811–825, 2000.
- [24] Nils Lid Hjort, Chris Holmes, Peter Müller, and Stephen G Walker. Bayesian Nonparametrics, volume 28 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2010.
- [25] Nhat Ho and XuanLong Nguyen. Convergence rates of parameter estimation for some weakly identifiable finite mixtures. Annals of Statistics, 44(6):2726–2755, 2016.
- [26] Nhat Ho and XuanLong Nguyen. On strong identifiability and convergence rates of parameter estimation in finite mixtures. Electronic Journal of Statistics, 10(1):271–307, 2016.
- [27] Nhat Ho and XuanLong Nguyen. Singularity structures and impacts on parameter estimation in finite mixtures of distributions. SIAM Journal on Mathematics of Data Science, 1(4):730–758, 2019.
- [28] K. Jochmans, S. Bonhomme, and J.-M. Robin. Nonparametric estimation of finite mixtures from repeated measurements. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2016.
- [29] Olav Kallenberg. Probabilistic symmetries and invariance principles. Probability and Its Applications. Springer Science & Business Media, 2006.
- [30] Bruce G Lindsay. Mixture models: theory, geometry and applications. In NSF-CBMS regional conference series in probability and statistics, pages i–163. JSTOR, 1995.
- [31] XuanLong Nguyen. Wasserstein distances for discrete measures and convergence in nonparametric mixture models. Technical report, UNIV of Michigan MI DEPT OF STATISTICS, 2011.
- [32] XuanLong Nguyen. Convergence of latent mixing measures in finite and infinite mixture models. Annals of Statistics, 41(1):370–400, 2013.
- [33] XuanLong Nguyen. Posterior contraction of the population polytope in finite admixture models. Bernoulli, 21(1):618–646, 2015.
- [34] XuanLong Nguyen. Borrowing strengh in hierarchical Bayes: Posterior concentration of the Dirichlet base measure. Bernoulli, 22(3):1535–1571, 2016.
- [35] Sidney I Resnick. A probability path. Modern Birkhäuser Classics. Springer, fourth edition, 2014.
- [36] Alexander Ritchie, Robert A Vandermeulen, and Clayton Scott. Consistent estimation of identifiable nonparametric mixture models from grouped observations. arXiv preprint arXiv:2006.07459, 2020.
- [37] Abel Rodriguez, David B Dunson, and Alan E Gelfand. The nested dirichlet process. Journal of the American Statistical Association, 103(483):1131–1154, 2008.
- [38] Judith Rousseau and Kerrie Mengersen. Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(5):689–710, 2011.
- [39] V.A. Statulyavichus. Limit theorems for densities and asymptotic expansions for distributions of sums of independent random variables. Theory of Probability and Its Applications, 10(4):582–595, 1965.
- [40] Elias M Stein and Timothy S Murphy. Harmonic analysis: real-variable methods, orthogonality, and oscillatory integrals, volume 3 of Monographs in Harmonic Analysis. Princeton University Press, 1993.
- [41] Yee W Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006.
- [42] Henry Teicher. Identifiability of finite mixtures. Annals of Mathematical Statistics, 34(4):1265–1269, 1963.
- [43] Henry Teicher. Identifiability of mixtures of product measures. Annals of Mathematical Statistics, 38(4):1300–1302, 1967.
- [44] Robert A Vandermeulen and Clayton D Scott. An operator theoretic approach to nonparametric mixture models. Annals of Statistics, 47(5):2704–2733, 2019.
- [45] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.
- [46] Jon Wellner and Aad Van der Vaart. Weak convergence and empirical processes: with applications to statistics. Springer Series in Statistics. Springer Science & Business Media, 1996.
- [47] Yihong Wu and Pengkun Yang. Optimal estimation of gaussian mixtures via denoised method of moments. Annals of Statistics, 48(4):1981–2007, 2020.
Appendix A Examples and Proofs for Section 3
Example A.1.
Consider and with . When is sufficiently large, , a constant independent of . But with being Euclidean distance multiplied by
where is a coupling as in (7). So
which increases to when . Even though and share the set of atoms, is still of order . Thus, couples atoms and probabilities; in other words it does not separate them in the way does.
Proof of Lemma 3.2.
a) The proof is trivial and is therefore omitted.
b) Let and . Let be the optimal permutation that achieves . Let be a coupling of the mixing probabilities and such that and then the remaining mass to be assigned is . Thus,
The proof for the case proceeds in the same procedure.
c) Consider any and , and one may write for such that and . Then when is sufficiently large, for , where is the open ball with center at of radius . Then by b) for large , , which entails .
Denote for . Let be a coupling between and such that . Since , when is large,
(55) |
Moreover, when is large, for any . Thus when is large,
(56) |
where the last inequality follows from
Appendix B Additional examples and proofs for Section 4
B.1 Additional examples and proofs for Section 4.1
Example B.1 (Location gamma kernel).
For gamma distribution with fixed and , consider its location family with density
w.r.t. Lebesgue measure on . The parameter space is . Observe that
and
since . Then for any , as a function of is not differentiable at . So it is not identifiable in the first order as defined in [26]. However, this family does satisfy the first-order identifiable definition with where . Indeed, observing that
then (9a) become
Without loss of generality, assume . Then for , the above display become
which implies since . Repeating the above argument on interval shows for any . So this family is first-order identifiable. Moreover, for every in , is continuously differentiable w.r.t. in a neighborhood of for . By Lemma 4.2 b) for any (12) holds.
Proof of Lemma 4.2 b).
Suppose the equation (12) is incorrect. Then there exists such that
We may relabel the atoms of and such that , with and as for any . With subsequences argument if necessary, we may further require
(57) |
where and the components of are in and . Moreover, for sufficiently large , which implies
It then follows that at least one of is not or one of is not . On the other hand,
where the second inequality follows from Fatou’s Lemma, and the last step follows from Lemma B.2 a). Then for . Thus we find a nonzero solution to (9a), (9b) with replaced by .
However, the last statement contradicts with the definition of first-order identifiable. ∎
Proof of Lemma 4.4.
By Lemma 4.13 b) is also a nonzero solution of the system of equations (9a), (9b). Let and . Then and satisfy and , , is also a nonzero solution of the system of equations (9a), (9b) with replaced respectively by . Let with and for . When is large, and since and . Moreover, since . Then and since at least one of or is nonzero. When is large . Thus when is large
(58) |
Since by condition c) when is large
the integrand of (58) is bounded by , which is integrable w.r.t. to on . Then by the dominated convergence theorem
Thus
and the proof is completed by (14). ∎
Proof of Lemma 4.8.
Without loss of generality, assume . Let , where . Notice that for , as a function of is continuously differentiable on for each .
Suppose (12) is not true. Proceed exactly the same as the proof of Lemma 4.2 b) except the last paragraph to obtain a nonzero solution of (9a), (9b) with replaced by . For the uniform distribution family, one may argue that the nonzero solution has to satisfy
(59) |
Indeed, start from the rightmost interval that intersects with the support from only one mixture component, for
which implies . Repeat the above argument on interval , , , and (59) is established.
Combining (59) with the fact that some of the or is non-zero, it follows that for some . When is sufficiently large, . For sufficiently large
where the step follows from carefully examining the support of , the step follows from the integrand is a constant, and the last step follows from (57). The last display contradicts with the choice of , which satisfies . ∎
Proof of Lemma 4.10.
(a): inverse bound (11) holds
Without loss of generality, assume . Let . Notice that for , as a function of is differentiable at for each .
Suppose (11) is not true. Proceed exactly the same as the proof of Lemma 4.2 a) except the last paragraph to obtain a nonzero solution of (9a), (9b) with replaced by . Write the two-dimensional vector as . For the location-scale exponential distribution, one may argue that the nonzero solution has to satisfy
(60) |
Indeed, let with where is the number of distinct elements. Define . Then for
Start from the leftmost interval that intersects with the support from only one mixture component, for ,
Since for are all distinct, by Lemma B.3 a)
Repeat the above argument on interval and (60) is established.
Since at least one of or is not zero, from (60) it is clear that at least one of is not zero. Then by at least one of is positive. By (60) at least one of is negative. Let . That is is a largest negative one among . Let to be half of the smallest distance among different . By subsequence argument if necessary, we require for any , .
Let to be the set of indices for those sharing the same as . We now consider subsequences such that for satisfies finer properties as follows. Divide the index set into three subsets, , and . Note is the index set for those sharing the same as and sharing the same as (so their are also largest negative ones among ), while corresponds for indices for which and , and corresponds for indices for which and . To be clear, the two subsets and may be empty, but is non-empty by our definition.
For any ,
Then for large , for any and . Similarly for large , for any and . Thus by subsequence argument if necessary, we require additionally satisfy the conditions specified in the last two sentences for all .
Consider and there exists such that for infinitely many since has finite cardinality. By subsequence argument if necessary, we require for all . Moreover, since we may further require for all . Finally, for each such that , we may further require for all by subsequences. To sum up, for satisfies:
(61) |
Let with the convention that the minimum over an empty set is . Then and . Moreover, by property (61), . Thus on , 1) for any , since ; 2) for , due to due to (61); 3) for , since . Then
(62) |
Denote the integrand (including the absolute value) in the preceding display by . Then as a function on , converges uniformly to
Since is positive and continuous on compact interval , for large
which yields
Plug the preceding display into (62), one obtains for large ,
(63) |
where the convergence in the last step is due to (15). (63) contradicts with the choice of , which satisfies .
(b): inverse bound (12) does not hold
Recall . Consider with , and . Denote . Consider with , and . Consider with . It is clear that when is large, and . Moreover, when is large, since . Then
(64) |
since and . Denote the integrand in (64) (including the absolute value) by . By definition of for location-scale exponential distribution, on .
On , converges to uniformly. Then
(65) |
where the first equality follows by second fundamental theorem of calculus, and the last step follows by .
Proof of Lemma 4.11.
Take . Then is -integrable by Cauchy-Schwarz inequality. Moreover for any and any
Then by Lemma 4.13 b) is a nonzero solution of the system of equations (9a), (9b).
Let and . Then , satisfy and , , is also a nonzero solution of (9a), (9b) with replaced respectively by . Let with and for . When is large, and since and . Moreover, since . Then and since at least one of or is nonzero. When is large . Thus when is large
The integrand of the last integral is bounded by
which is integrable w.r.t. to on . Here the last inequalities follows from
Then by the dominated convergence theorem
The proof is completed by (14). ∎
B.2 Proofs for Section 4.2
Proof of Lemma 4.13.
a) For , . Then is a solution of (9a) with replaced by if and only if with and is a solution of (9a). We can write and . Thus is zero if and only if is zero.
b) Under the conditions, by dominated convergence theorem
where the last step follows from and the fact that is a density with respect to . Thus for any solution of (9a),
Proof of Lemma 4.15.
Notice that is continuously differentiable at every when fixing any . By Lemma B.4 and Lemma 4.13 c), (9a) has the same solutions as the system (9a),(9b).
Proof of Lemma 4.16.
Consider to be the same kernel but under the new parameter . Note with is the canonical parametrization of the same exponential family. Write . Since exists at and at those points,
and thus
(67) |
Then (9a), (9b) with replaced respectively by has only the zero solution if and only if (9a), (9b) with replaced respectively by has only the zero solution.
Suppose that is a solution of (9a) with replaced respectively by . Then by (67) with , is a solution of (9a) with replaced respectively by . Then by Lemma 4.15, it necessarily has . That is, is a solution of the system of equations (9a), (9b) with replaced respectively by . As a result, with replaced respectively by , (9a) has the same solutions as the system (9a),(9b).
B.3 Auxiliary lemmas for Section B.1
Lemma B.2.
Consider on is a function with its gradient existing in a neighborhood of and with continuous at .
-
a)
Then when and
-
b)
If in addition, the Hessian is continuous in a neighborhood of . Then for any in a closed ball of contained in that neighborhood,
Moreover
where .
Lemma B.3.
Let be a positive integer, be a sequence of real numbers and let be the Lebesgue measure on .
-
a)
Let be a sequence of polynomials. Consider any nonempty interval . Then
implies for any .
-
b)
Let be a sequence of functions, where each is of the form , i.e. a finite linear combination of power functions. Let be another sequence of such functions. Consider any nonempty interval . Then
implies when and for any .
Lemma B.4.
Let be the density of a full rank exponential family in canonical form specified as in Lemma 4.15. Then for any and there exists such that for any ,
with and
with . Here , and depend on and .
Appendix C Proofs, additional lemmas and calculation details for Section 5
This section contains all proofs for Section 5 except that of Theorem 5.8, Theorem 5.16. The proofs of Theorem 5.8 and Theorem 5.16 occupy the bulk of the paper and will be presented in Section D. This section also contains additional lemmas on the invariance of different parametrizations and on determinant of a type of generalized Vandermonde matrices, and contains calculation details for Examples 5.11 and 5.13.
C.1 Proofs for Section 5.1 and Corollary 5.9
Proof of Lemma 5.5.
Proof of Lemma 5.6.
In this proof we write for respectively. By the definition of , for any
(69) |
By Lemma 3.2 b) one may replace the in the preceding display by . Fix . Then there exists depending on such that
(70) |
where is the open ball in metric space with center at and radius . Here we used the fact that any sufficiently small open ball in with center in is in .
Notice that is compact under the metric if is compact. By the assumption that the map is continuous, Lemma C.3 and the triangle inequality of total variation distance, with domain is a continuous function of for each . Then is a continuous map on for each . Moreover is positive on the compact set provided . As a result for each
Combining the last display with and (70) yields
(71) |
where is a constant depending on and . Observing that increases with , the proof is then complete. ∎
C.2 Auxiliary lemmas for Section C.1
Lemma C.1 (Lack of identifiability).
Fix . Suppose has a nonzero solution , where the is the zero measure on . Then
(72) |
Lemma C.2.
Suppose the same conditions in Corollary 4.6 hold. Then for any , for each , and for any ,
where satisfies
Lemma C.3.
For any and ,
where the minimum is taken over all in the permutation group .
Proof.
The proof is similar as that of Lemma 8.2. ∎
C.3 Identifiability of Bernoulli kernel in Example 5.11
In this section we prove for any for the Bernoulli kernels in Example 5.11. (Note: The authors find a proof of this result in the technical report [13, Lemma 3.1 and Theorem 3.1] after preparation of this manuscript. Since technical report [13] is difficult to be found online, the proof given below is different from the technical report and will be presented for completeness.)
For any , there are parameters to determine it. has effective equations for different value since can takes values and is a probability density. Thus to have strictly identifiable for , a necessary condition is that for almost all under Lebesgue. In fact, in Lemma C.4 part e) it is established that for any for any there exist infinitely many such that , which implies for all .
Let us now verify that for any . In the following . For any and consider such that . Notice that means that it is possible some of is zero. implies
(73) |
Notice that the above system of equations does not include the constraint since it is redundant: by multiplying both sides of the equation by and summing up, we obtain . (In fact, in the above system of equations the equation with (or arbitrary ) can be replaced by )
We now show that the only solution is , beginning with the following simple observation. Notice that for a set of distinct elements in , the system of linear equations of with :
has only the zero solution since by setting the system of equations of :
has its coefficients of the first equations forming a non-singular Vandermonde matrix.
If some is not in , then by the observation in last paragraph , which contradicts with . As a result, . Suppose for . Then the system of equations (73) become
Applying the observation from last paragraph again yields for . That is, the only solution of (73) is . Thus , which together with the fact that yield for any .
Part e) of the first lemma and d) of the second lemma below are used in the preceding analysis of example on Bernoulli kernel.
Lemma C.4.
-
a)
Let be distinct real numbers. Let . Then the system of linear equations of
(74) has all the solutions given by
(75) for any .
-
b)
For any and for any positive , there exists infinitely many satisfying
-
c)
For any and for any positive , the system of equations of
(76) (77) has infinitely many solutions.
-
d)
If for some positive integer , then for any integer .
-
e)
Consider the kernel specified in Example 5.11. For any and for any , there exists infinitely many such that . In particular, this shows for any .
Proof of Lemma C.4.
a) By Lagrange interpolation formula over ,,,,
In particular, for any ,
Plugging the above identity into (74), it is clear that the specified in (75) are solutions of (74). Notice that the coefficient matrix of is has rank since the submatrix consisting the first columns form a non-singular Vandermonde matrix. Thus all the solutions of (74) form a subspace of of dimension , which implies (75) are all the solutions.
b) Let . Consider a polynomial such that , , and for , . Then this points determines uniquely a polynomial with degree at most . By our construction, satisfies
(78) |
Moreover, noticing that for odd integer between and , and for even integer between and . Then there must exist and for such that . Then where are constants that depend on . Plug into (78) shows that is a solution for the system of equations in the statement. By changing value of , we get infinitely many solutions.
c)
First, we apply part a) with : for any distinct real numbers , the system of linear equations of
has a solution
where we have specified .
Next, for the given in the lemma’s statement, by part b) we can choose that satisfy the requirements there. Accordingly, for . Moreover, it follows from the ranking of that for any . Thus is a solution of the system of equations in the statement. The infinite many solutions conclusion follows since there are infinitely many by part b).
d) follows immediately from for any , the product sigma-algebra on ,
Repeating this procedure inductively and the conclusion follows.
By part d) it suffices to prove that . Write with . Consider any with such that . for is
(79) | ||||
(80) |
Note the system of equations (79) automatically implies . Let , for and let , . Then and for . Then is a solution of (79), (80) if and only if the corresponding is the solution of
By part c), the system of equations in last display has infinitely many solutions additionally satisfying (77). For each such solution, the corresponding is a solution of system of equations (79) (80) additionally satisfying and for . By the comments after (79),(80) we also have . Thus, such gives such that . The existence of infinitely many such follows from the existence of infinitely many solutions by part c). ∎
C.4 Proof of Lemma 5.12
Proof of Lemma 5.12.
a) It’s obvious that are multivariate polynomials and that
That means has factor and thus is a multivariate polynomial and
Then has factor and thus is a multivariate polynomial.
b) Write for in this proof. Denote the bottom matrix of . Let , , and be defined in part a) with replace by . Then by subtracting the third row from the first row, the fourth row from the second row and then factor the common factor out of the resulting first two rows
where the second equality follows by subtracting the fourth row from first row and then factor the common factor out of the resulting row. The last step of the preceding display follows by subtracting 1/2 times the second row from the first row and then extract the common factor out of the resulting row. Thus is a factor of , which is a multivariate polynomial in . By symmetry, is a factor of .
c) We prove the statement by induction. It’s easy to verify the statement when . Suppose the statement for holds. By b),
for some multivariate polynomial . By the Leibniz formula of determinant, in the term of highest degree of is or , which both have degree since has degree and has degree . Moreover, in the degree of is and the corresponding term is , which implies in the degree of is no more than for any . As a result, is a constant. Thus
(81) |
On the other hand,
(82) |
where the second equality follows by Laplace expansion along the last row. Observing that and , plug these two equations into (82) and simplify the resulting determinant, and one has
(83) |
Comparing (83) to (81), together with the induction assumption that statement for holds,
That is, we proved the statement for .
d) We prove by induction. Write for in the following induction to emphasize its dependence on . It is easy to verify the case holds when . Suppose the statement for holds. By b), for some multivariate polynomial . Since has degree and has degree , by the Leibniz formula of determinant has degree no more than for any for . Moreover, in the degree of is , which implies in the degree of is no more than . As a result, it is eligible to write where are multivariate polynomials of . Thus
(84) |
and
(85) |
On the other hand,
(86) |
where the second equality follows by Laplace expansion along the last row. Observing that and , plug these two equations into (86) and simplify the resulting determinant, and one has
(87) |
Analogous argument produces
(88) |
Comparing (87) to (84), together with the induction assumption that statement for holds,
Comparing (88) to (85), together with the induction assumption that statement for holds and the preceding display,
That is, for any . ∎
C.5 Calculation details in Example 5.21
As in Example 5.21, take the . Then one may check . So condition (A1) is satisfied. The characteristic function is
The verification of (A2) and (32) are consequences of Leibniz rule for calculating derivatives and the dominated convergence theorem, and are omitted. To verify (33), notice that is increasing on and decreasing on . That is, the conditions of Lemma 7.3 is satisfied with , and . Moreover, it is clear that . Then by Lemma 7.4, for any ,
It can be verified easily by the dominated convergence theorem that is a continuous function of on . Thus (33) in (A3) is verified. We have then verified that is admissible with respect to .
Appendix D Proofs of inverse bounds for mixtures of product distributions
For an overview of our proof techniques, please refer to Section 2. The proofs of both Theorem 5.8 and Theorem 5.16 follow the same structure. The reader should read the former first before attempting the latter, which is considerably more technical and lengthy.
D.1 Proof of Theorem 5.8
Proof of Theorem 5.8.
Step 1 (Proof by contradiction with subsequences)
Suppose (23) is not true. Then subsequence of natural numbers tending to infinity such that
Then such that
(89) |
To see this, for each fixed , and thus fixed , if and only if . Thus, there exists such that , , and
thereby ensuring that (89) hold.
Write . We may relabel the atoms of and such that , with and for any . By subsequence argument if necessary, we may require , additionally satisfy:
(90) |
where the components of are in and . It also follows that at least one of is not or one of is not . Let .
Step 2 (Change of measure by index and application of CLT)
has density w.r.t. on :
where any is partitioned into blocks as with . Then
(91) |
where are i.i.d. random variables having densities , and
Let . Then since , the mean and covariance matrix of are respectively and , the gradient and Hessian of evaluated at . Then by central limit theorem, converges in distribution to . Moreover,
(92) |
where .
Step 3 (Application of continuous mapping theorem)
Define . Supposeing that
(93) |
a property to be verified in the sequel, then by Generalized Continuous Mapping Theorem ([46] Theorem 1.11.1), converges in distribution to . Applying Theorem 25.11 in [5],
(94) |
where the equality follows by (89), (91) and (92). Since is a non-zero affine transform and the covariance matrix of is positive definite due to full rank property of exponential family, is either a nondegenerate gaussian random variable or a non-zero constant, which contradicts with (94).
For any , by Taylor expansion of at and the fact that is infinitely differentiable at , for large ,
which implies
(96) |
where the equality follows from (89), and the inequality follows from that
(97) | ||||
(98) |
for large . The same conclusion holds with replaced by in the last two displays.
For , by Lemma B.2 b) and the fact that is infinitely differentiable at , for large
which implies
(99) |
where the inequality follows from (97) and (98), and the equality follows from (89) and (90).
Thus
(103) |
where step follows from mean value theorem with on the line segment between and , step follows from , due to (100), (101) and hence , and step follows from (102) and (90).
Case 2: Calculate for .
For ,
(104) |
where the last inequality follows from and
(105) |
implied by strict convexity of over due to full rank property of exponential family. Similarly, for sufficiently large ,
(106) |
It follows that for
(107) |
where the second inequality follows by applying the mean value theorem on the first term and applying (90) to the second term, while the last inequality follows from (104) and (106).
D.2 Proof of Theorem 5.16
Proof of Theorem 5.16.
Step 1 (Proof by contradiction with subsequences)
This step is similar to the proof of Theorem 5.8. Suppose that (23) is not true. Then subsequence of natural numbers tending to infinity such that
Then such that
(110) |
To see this, for each fixed , and thus fixed , if and only if . Thus, there exist such that , , and
thereby ensuring that (110) hold.
Write . We may relabel the atoms of and such that , with and for any . By subsequence argument if necessary, we may require , additionally satisfy:
(111) | ||||
(112) |
and
(113) |
where the components of are in and . It also follows that at least one of is not or one of is not . Let .
Step 2 (Transform the probability measure to support in )
Let be an arbitrary measurable map in this step. Extend to product space by by where any is partitioned into blocks as with . Then one can easily verify that , and hence for any
Further consider another measurable map defined by where is partitioned equally into blocks . Denote the induced probability measure on under of the by . Then the induced probability measure under of the mixture is
Note the dependences of and on are both suppressed, so are the dependences on of and .
Step 3 (Application of the central limit theorem)
In the rest of proof specialize in step 2 to be . Write to simplify the notation in the rest of the proof. Let and be the same as in Definition 5.14 of with respect to the finite set and define . By subsequences if necessary, we may further require that satisfy for all and .
Consider . Then is distributed by probability measure , which has characteristic function . For , by (33) in (A3) and by Fourier inversion theorem, and therefore have density with respect to Lebesgue measure given by
(115) |
Then has density with respect to Lebesgue measure given by , and similarly has density with respect to Lebesgue measure . Thus
(116) |
For has density , define , where . Note that this transform from to depends on in the density of . Then by the change of variable formula, has density with respect to Lebesgue measure, given by
or equivalently
(117) |
Now, applying the local central limit theorem (Lemma D.3), converges uniformly in to for every . Here is the density of , the multivariate normal with mean and covariance matrix . Next specialize the previous statement to , and define
We use the convention that the supreme of is in the above display. Because of the uniform convergence of to , we have when . It follows from (117) that on Then by (116)
(118) |
where
and
Observe if has density , then converges in distribution to .
Step 4 (Application of a continuous mapping theorem)
Define , where is the Jacobian matrix of evaluated at . Supposing that
(119) |
a property to be verified later, then by Generalized Continuous Mapping Theorem ([46, Theorem 1.11.1]), converges in distribution to . Applying [5, Theorem 25.11],
where the equality follows (118) and (114). Note that is positive definite (by (A1)) and is of full column rank. In addition, by our choice of , either or is non-zero. Hence, is a non-zero affine function of . For such an , cannot be zero, which results in a contradiction. As a result, it remains in the proof to establish (119).
We will now impose the following lemma and proceed to verify (119), while the technical and lengthy proof of the lemma will be given in the Section D.5.
Lemma D.1.
Suppose all the conditions in Theorem 5.16 hold and let be defined as in the first paragraph in Step 3. For any , for any pair of sequences and for any increasing satisfying and :
(120) |
where is the density with respect to Lebesgue measure of when is positive definite.
Step 5 (Verification of (119))
Write for abbreviation in the remaining of this proof. Observe by the local central limit theorem (Lemma D.3)
as , which implies
(121) |
Hereafter . Similar definition applies to . Then for each ,
(122) |
where the first equality follows from (117) and where in the first equality . Observe that for any ,
(123) | ||||
(124) |
by (111), (112) and (110). Then by applying (120) with respectively be , and by (113),
which together with (121) yield
(125) |
(126) |
provided the right hand side exists.
Note that for each , and any , by a standard calculation for Gaussian density,
so we have
Thus, when ,
(127) |
where the inequality holds for sufficiently large , is a constant that only depends on and , and the last step follows from by condition 1) in the statement of theorem.
Next, we turn to the second summation in the definition of in a similar fashion. By (117),
(130) |
Due to (124), by applying (120) with respectively be , and by (110),
which together with (121) yield
(131) |
Moreover for any ,
(132) |
D.3 Bounds on characteristic functions for distributions with bounded density
To prove Lemma D.1, we need the next lemma, which is a generalization of the corollary to [39, Lemma 1]. It gives an upper bound on the magnitude of the characteristic function for distributions with bounded density with respect to Lebesgue measure.
Lemma D.2.
Consider a random vector with its characteristic function. Suppose has density with respect to Lebesgue measure upper bounded by , and has positive definite covariance matrix . Then for all
where is some constant that depends only on , and is the largest eigenvalue.
Proof.
It suffices to prove for .
Step 1 In this step we prove the special case for , where is the standard basis in . Define and it is easy to verify
(136) |
Denote by to be the density w.r.t. Lebesgue measure of symmetrized random vector , where is an independent copy of . Then also has upper bound and is the characteristic function of and
(137) |
Write and let be the strip of length centered at across the -axis. Then by (137)
(138) |
where the first inequality follows from and is a subset in to be determined, and the last inequality follows from for .
Let with . Then and thus (138) become
(139) |
where in step , step with follows from Lemma D.4 b) and step with follows from Jensen’s inequality. The inequalities in step and are attained with where for positive satisfies
Observe and thus
where the last step follows from Markov inequality. Moreover by our choice of , . Then (139) become
where is a constant that depends only on . The last display replacing by , together with (136) yield the desired conclusion.
Step 2 For any , denote and . Consider an orthogonal matrix with its first row . Then where . Since has density with respect to Lebesgue measure, has the same upper bound and positive definite covariance matrix with the same largest eigenvalue as . The result then follows by applying Step 1 to . ∎
D.4 Auxiliary lemmas for Sections D.2 and D.3
Consider a family of probabilities on , where is the parameter of the family and is the parameter space. denotes the expectation under the probability measure . Consider a sequence of independent and identically distributed random vectors from . Suppose exists and define . The next result establishes that the density of converges uniformly to that of a multivariate normal distribution.
Lemma D.3 (Local Central Limit Theorem).
Suppose a sequence of independent and identically distributed random vectors from . Suppose and exist and is positive definite. Let the characteristic function of be and suppose there exists such that is Lebesgue integrable on . Then when , has density with respect to Lebesgue measure on , and its density as tends to infinity converges uniformly in to , the density of .
The special case for of the above lemma is Theorem 2 in Section 5, Chapter XV of [14]. That proof can be generalized to without much difficulties and therefore the proof of Lemma D.3 is omitted.
Lemma D.4.
-
a)
Consider a Lebesgue measurable function on satisfies and . Then for any
and the equality holds if and only if .
-
b)
For define a set . Consider a Lebesgue measurable function on satisfies on and . Then for any
and the equality holds if and only if where .
D.5 Proof of the technical lemma in the proof of Theorem 5.16
Proof of Lemma D.1.
We will write respectively for in this proof. But in this proof are generic variables and might not necessarily be the same as in the proof of Theorem 5.16. Let be the same as in the first paragraph of the Step 3 in the proof of Theorem 5.16.
For any , by condition (A1) , and exist, and by condition (A2) and exist. Then, with condition (A1) it follows from Pratt’s Lemma that exists and is given by
(140) |
Plugging the Fourier inversion formula (115) and (140) into (120), and noting for all , for sufficiently large we obtain
where
and
We will show in the sequel that in Step 1 and in Step 2, thereby establishing (120).
Step 1 (Prove )
By Condition (A3) and Lemma B.2 b),
(141) |
where with
Then
(142) |
where the first inequality follows from the fact that , and the last inequality follows from Condition (A3), Tonelli Theorem and the joint Lebesgue measurability of , and , as functions of , and by [2, Lemma 4.51]. Then following (141) and (142),
(143) | ||||
where in the first inequality is some constant that depends on and , and where the second equality follows from (142) and changing variable with . Denote the integrand in the last display by .
In the rest of the proofs denote the left hand sides of (32) and (33) respectively by and for every .
Observe that exists and has upper bound by condition (A3). Then invoking Lemma D.2, for ,
(144) |
where the last step follows from by condition (A1) with being some constant that depends on .
Let . Since the density w.r.t. Lebesgue exists and has characteristic function , as by Riemann–Lebesgue lemma. It follows that is actually a maximum. Moreover, the existence of the density w.r.t. Lebesgue, together with Lemma 4 in Section 1, Chapter XV of [14], yield when . It follows that . By the mean value theorem and (A3)
which further implies for sufficiently large . Then for sufficiently large ,
(147) |
where the first inequality follows from the definition of and condition (A3), and the last inequality follows from condition (A3). (146) and (147) immediately imply for any , :
(148) |
The above display together with (143) and the conditions yield .
Step 2 (Prove ). A large portion of the proof borrows ideas from Theorem 2 in Chapter XV, Section 5 of [14]. Observe
(149) |
where as before by a change of variable, ,
(150) |
Denote the integrand in the above display by . Since and exist, is twice continuously differentiable on , with gradient being and Hessian being at . Then by Taylor Theorem,
(151) |
for sufficient small , and
(152) |
Let . By the same reasoning of in Step 1, . Then for any ,
(153) |
Then, as
(154) |
where the first inequality follows from condition (A3) and the definition of , and the last step follows from and condition (A3).
By condition (A2), as a function of has gradient at : . Then by Taylor Theorem:
(155) |
Moreover, specialize in (145) and one has:
(156) |
By combining (151) and (156), we obtain as
(157) |
where in the second inequality we impose since it’s the limit that is of interest, and is a constant that depends on and .
Appendix E Proofs for Section 6
E.1 Proofs of Theorem 6.2 and Corollary 6.7
For a subset of metric space with metric , the minimal number of balls with centers in and of radius needed to cover is known as the -covering number of and is denoted by . Define the root average square Hellinger metric:
Proof of Theorem 6.2.
a) The proof structure is the same as Lemma 6.5 except that to take the varied sequence lengths into account, the distance is used in the place of total variation for the mixture densities.
We verify conditions (i) and (ii) of [17, Theorem 8.23], respectively, in Step 1 and Step 2 below to obtain a posterior contraction bound on the mixture density. In Step 3 we prove a posterior consistency result and then apply Lemma 5.5 to transfer posterior contraction result on density estimation to parameter estimation.
Step 1 (Verification of condition (i) of [17, Theorem 8.23])
Write and respectively for and in the proof for clean presentation.
Note that (B2) implies that from to is continuous. Then due to Lemma 5.6 and Lemma 3.2 d), for any , and any ,
(159) |
In the remainder of the proof is implicitly imposed. By (159) it holds that, for all
(160) |
Then
(161) |
and thus for any ,
(162) |
where the last inequality follows from (B1).
By an argument similar to [34, Lemma 3.2 (a)], for any
where the second inequality follows from Lemma 3.2 b) and (B2). Then
and
(163) |
Step 2 (Verification of condition (ii) of in [17, Theorem 8.23])
By (161),
where the last inequality follows from Lemma E.1. By Remark 6.1 . Then based on the last display one may verify with for some large enough constant ,
(164) |
Now we invoke [17, Theorem 8.23] (the Hellinger distance defined in [17] differs from our definition by a factor of constant. But this constant factor only affect the coefficients of but not the conclusion of convergence), we have for every ,
(165) |
in -probability as .
Step 3 (From convergence of densities to that of parameters) Since , by Lemma 5.5 for satisfying
(166) |
By Lemma E.2 for satisfying where , there exists a such that
(167) |
Let . Combining (166) and (167), for any
By the union bound,
in -probability as by applying (165) to the first term. The reason the second term vanishes is as follows. Note that the second term converges to essentially is a posterior consistency result with respect to (or ) metric. Here we prove it by (160) and (165). By (160),
for some constant . For some slow-increasing such that as ,
holds for large . Combining the last two displays and (165) yields
The proof is concluded.
E.2 Auxiliary lemmas for Section E.1
Lemma E.1.
Fix . Suppose for some and some where are any two distinct elements in .
Lemma E.2.
For with . If satisfying , then there exists a unique such that for all real number
Lemma E.3.
Consider a full rank exponential family’s density function with respect to a dominating measure on , which takes the form
where is the parameter space of .
-
a)
For any
where is the maximum eigenvalue of a symmetric matrix.
-
b)
For any compact subset , there exists such that
where is the convex hull of .
E.3 Calculation Details in Example 6.9
Details for the uniform probability kernel in Example 5.20. Consider the uniform kernel in Example 4.7 and Example 5.20. Write with . Let be a compact subset of such that the condition (B1) holds, and additionally satisfies . The reason for the additional condition will be discussed in the next paragraph. It is easy to establish that for any
and thus (37) holds with .
Additional care is needed for this example since the support of depends on and for . In particular, the condition (36) does not hold for the uniform kernel. In view of that the condition (36) is only used to guarantee (163), we may directly verify (163) for the uniform kernel so that the conclusions of Theorem 6.2 hold. Note that the additional condition is necessary for (163), since implies . (Actually the condition is necessary for a common condition called Kullback-Leibler property [17, Definition 6.15].)
We now verify (163). Denote and . In what follows for this example we always write in its increasing representation w.r.t. , i.e. . Consider the following set
For any , let be a coupling between and specified as below:
where
Then for any ,
(168) |
where the first inequality follows by the joint convexity of the Kullback-Leibler divergence. For any ,
(169) |
By our choice of , for any . Then plug (169) into (168),
(170) |
In fact, one can show but we do not have to use this fact here. Now by (170), for any ,
Thus
which is (163) for the example of being uniform kernels.
As a result, the conclusions of Theorem 6.2 hold. Moreover by Example 5.20 and one can directly verify that .
Details for the location-scale exponential kernel in Example 5.21. By similar calculations as above, one may show that the conclusion of Theorem 6.2 holds even though the KL-divergence is not Lipschitz as in assumption (B2). By Example 5.21 and one can directly verify that .
Details for the case that kernel is location-mixture Gaussian in Example 5.22. It suffices to verify assumption (B2) such that Theorem 6.2 can be applied.
Note that
where step follows from Lemma 8.2, and step follows from the formula of Hellinger distance between two Gaussian distributions. We also have
where in step is any coupling between and and the minimization is taken among all such couplings, step follows from the joint convexity of KL-divergence, Lemma 8.2, and step follows from the formula of KL-divergence between two Gaussian distributions.
Appendix F Proofs and calculation details for Section 7
F.1 Proofs for Section 7.1
Proof of Lemma 7.2.
Let be a positive constant that depends on its parameters in this proof.
Claim 1: There exists that depends only on such that for any . Suppose this is not true, then there is such that , and as ,
(171) |
Let with being some positive number to be specified. The characteristic polynomial of is
When , since for any , the entries of are bounded by . Thus for . Moreover,
with . Let be the smallest eigenvalue of . Then
When , since . Thus when ,
(172) |
Moreover, when , for any . Then by (171), , where , so . On the other hand, since and are bounded and ,
These contradict with (172) and hence the claim at the beginning of this paragraph is established.
Since on and is continuous,
Then take and the proof is complete. ∎
Proofs of Lemma 7.3.
Let . Then
where with entries for and for . Then for any ,
(173) |
where and the last inequality follows from Lemma 7.2.
Case 1: (). Partition the real line according to the increasing sequence where
For , by (173) we know for all . In order to appeal to Lemma 7.1, we need to specify the points with , where is defined as the set of roots in of any of the following equations,
Thus is a partition of such that for each , holds for some index and for all . Since is polynomial of degree , it follows that . Let be the maximum of , where are the coefficients corresponding to in Lemma 7.1. Then by Lemma 7.1, for
(174) |
where the last step follows from being increasing on . Then for
(175) |
where the step follows from , the step follows from and , and the last step follows from , for all and .
For , since is continuous on and is Lebesgue integrable on , and exist. Define . Then is absolute continuous on . Moreover, by (173) we know for all . Following the same argument as in the case , let with , where is the set of roots in of the following equations
Then is a partition of such that for each , for some and for all . Since are polynomial of degree , we have . Thus by Lemma 7.1, for any
(177) |
where the last step follows from . Then for any
(178) |
where is the same as in (175).
Hence,
(179) |
where the first equality follows from the dominated convergence theorem, the step follows from (175), (176), (178), and the step follows from the monotonicity of when , .
Case 2: (). Fix , partition into disjoint open intervals such that is monotone on each of those interval. Notice since is a polynomial of degree , and depend on . For , on when we subdivide the interval, besides the partition points , should also be added into the partition points. The new partition points set has at most points and hence subdivide into at most intervals such that on each subinterval and is monotone. Hence Lemma 7.1 (part ii)) can be applied on each subinterval. The rest of steps proceed similarly as in Case 1, and one will obtain
(180) |
where , a constant that depends only on . Following the same reasoning one can obtain (176) for and (178) for , both with replaced by . As a result, one obtains (179) with replaced by . ∎
F.2 Calculation details for Section 7.2
In this subsection we verify parts of (A3) for the specified in Section 7.2. It is easy to verify by the dominated convergence theorem or Pratt’s Lemma:
and
Then
(183) | ||||
(184) | ||||
(185) | ||||
(186) |
where and are continuous functions of by the dominated convergence theorem, with their dependence on the constant suppressed.
It follows that the gradient of with respect to is
(187) |
and Hessian with respect to with entry for given by
(188) |
and the lower part is symmetric to the upper part.
Then by (183), (184), (185), (186), (187) and (188), for any :
where the right hand side of the last display is a continuous function of since and are continuous. Hence to verify the condition (A3) it remains to establish that there exists some such that on is upper bounded by a finite continuous function of .
F.3 Calculation details for Section 7.4
In this subsection we verify parts of (A3) for the specified in Section 7.4. This subsection is similar to the Appendix F.2.
It is easy to verify by the dominated convergence theorem or Pratt’s Lemma:
and
From the preceding four displays,
(189) | ||||
(190) | ||||
(191) | ||||
(192) |
where , , and are continuous functions of by the dominated convergence theorem, with their dependence on the constant suppressed.
Appendix G Proofs for Section 8
Proof of Lemma 8.2.
Step 1: Suppose for any . In this case,
where the first inequality follows from the joint convexity of any -divergences (of which squared Hellinger distance is a member), and the second inequality follows from
Step 2: Suppose for any . Let be the discrete probability distribution associated to the weights of and define similarly. Consider any to be a coupling of and . Then
(195) |
where the first inequality follows from the joint convexity of any -divergence, and the second inequality follow from the Hellinger distance is upper bounded by . Since (195) holds for any coupling of and ,
Step 3: General case. Let . Then by triangular inequality, Step 1 and Step 2,
Finally, notice that the above procedure does not depend on the specific order of atoms of and , and thus the proof is complete. ∎
Proof of Lemma 8.3.
Since , there exists a sequences such that and
(196) |
for some . Supposing that
then there exists subsequences such that for any
Thus for each , there exists such that , and
By our choice of , for sufficiently large
On the other hand, by Lemma 8.2,
Combining the last two displays,
where the second inequality follows from (196). The last display contradicts with . ∎
Proof of Theorem 8.6.
a) Choose a set of distinct points satisfying
Let . Since , there exist and such that
(197) |
Consider and with for and , satisfying . Here is a constant to be determined. Then . Moreover, .
By two-point Le Cam bound (see [45, (15.14)])
(198) |
Notice
With our choice of and , by Lemma 8.2, the last display becomes
(199) |
where the equality follows from
due to . Plug (199) into (198),
(200) |
Consider any satisfying and let . Then . Plug the specified into (200), then the right hand side in the above display becomes
where depends on . Notice are constants that depends on the probability family and .
Appendix H Proofs of Auxiliary Lemmas
H.1 Proofs for Section B.3
Proof for Lemma B.2.
a)
where the first step follows from mean value theorem with lie on the line segment connecting and , the second step follows from Cauchy-Schwarz inequality, and the last step follows from the continuity of at and when .
b) For in specified in the statement,
(201) |
where the first two equalities follow respectively form fundamental theorem of calculus for -valued functions and -valued functions. Observe that for any matrix ,
where is the Frobenius norm. Applying the preceding display to (201),
Proof of Lemma B.3.
a) Define . From the condition on a dense subset of . Then on the closure of that subset, which contains , since it is a continuous function on . Let and consider its Taylor expansion for any . It follows from on that for any . Thus on . Then
This happen only when . Proceed in the same manner to show for from to .
b) Define . From the condition on a dense subset of . Then on the closure of that subset excluding , which contains , since it is a continuous function on . Let and consider its Taylor expansion at : for , since the Taylor series of , at converges respectively to , on for any . It follows from on that for any . Thus on . Now take and repeat the above analysis with replaced by , resulting in on . Then take and keep repeating the process, and one obtains on since . Let be the smallest power of all power functions that appear in , , and define . Then on . Then
which happens only when and . That is, when , and . Proceed in the same manner to show when , and for from to . ∎
Proof of Lemma B.4.
Let be such that the line segment between and lie in and , due to the fact that the moment generating function exists in a neighborhood of origin for any given . Then for and for any
(202) |
where step follows from . Then the the first conclusion holds with
Take and by Cauchy–Schwarz inequality . Moreover by (202)
∎
H.2 Proofs for Section C.2
Proof of Lemma C.1.
Note that . Construct with for . For large , and . Then for large , and . Then the proof is completed by observing that for large
and ∎
Proof of Lemma C.2.
By decomposing the difference as a telescoping sum,
Then the right hand side of the preceding display is upper bounded by
For clean presentation we write for in the remainder of the proof. Notice that satisfies
Moreover, for
and thus
∎
H.3 Proofs for Section D.4
lem:onecorsquintmin.
a) It suffices to prove since one can do the translation to reduce the general case to the special case . Let , and . Then
and hence
The equality holds if and only if the last two inequalities are attained, if and only if .
b) It suffices to prove since one can always do the translation and for all to reduce the general case to the special case . By Tonelli’s Theorem, exists for and . Moreover . Then by Tonelli’s Theorem and a)
The equality holds if and only if , if and only if . ∎
H.4 Proofs for Section E.2
Proof of Lemma E.3.
a) It is easy to calculate
(203) |
Let . It is easy to verify that , and . Then by (203)
(204) | ||||
b) First assume that is compact and convex. For each , by (204),
where the second equality follows from Taylor’s theorem with in the line joining and due to the convexity of and Taylor’s theorem. The result then follows with , which is finite since , as a function of and its gradient and hessian, is continuous on .
If is compact but not necessarily convex, consider . Note that , as the convex hull of a compact set, is convex and compact. Moreover, since , as the interior of a convex set, is convex. The proof is then complete by simply repeating the above proof for . ∎
Proof of Lemma E.1.
Consider an -net with minimum cardinality of and an -net with minimum cardinality of -probability simplex under the distance. Construct a set . Then for any satisfying , there exists some , such that by Lemma 8.2
Thus
Proof of Lemma E.2.
Let be any one in such that
For any , . Then for any that is not and for any real number
which with shows our choice of is unique and with shows is the optimal permutation for . ∎