This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Convergence of de Finetti’s mixing measure in
latent structure models for observed exchangeable sequences

Yun Wei and XuanLong Nguyen
University of Michigan
Abstract

Mixtures of product distributions are a powerful device for learning about heterogeneity within data populations. In this class of latent structure models, de Finetti’s mixing measure plays the central role for describing the uncertainty about the latent parameters representing heterogeneity. In this paper posterior contraction theorems for de Finetti’s mixing measure arising from finite mixtures of product distributions will be established, under the setting the number of exchangeable sequences of observed variables increases while sequence length(s) may be either fixed or varied. The role of both the number of sequences and the sequence lengths will be carefully examined. In order to obtain concrete rates of convergence, a first-order identifiability theory for finite mixture models and a family of sharp inverse bounds for mixtures of product distributions will be developed via a harmonic analysis of such latent structure models. This theory is applicable to broad classes of probability kernels composing the mixture model of product distributions for both continuous and discrete domain 𝔛\mathfrak{X}. Examples of interest include the case the probability kernel is only weakly identifiable in the sense of [25], the case where the kernel is itself a mixture distribution as in hierarchical models, and the case the kernel may not have a density with respect to a dominating measure on an abstract domain 𝔛\mathfrak{X} such as Dirichlet processes.

Acknowledgement

The authors are grateful for the support provided by NSF Grants DMS-1351362, DMS-2015361 and the Toyota Research Institute. We would like to sincerely thank Xianghong Chen, Danqing He and Qingtang Su for valuable discussions, and Judith Rousseau and Nhat Ho for helpful comments. We also thank the anonymous referees and associate editor for valuable comments and suggestions.

1 Introduction

Latent structure models with many observed variables are among the most powerful and widely used tools in statistics for learning about heterogeneity within data population(s). An important canonical example of such models is the mixture of product distributions, which may be motivated by de Finetti’s celebrated theorem for exchangeable sequences of random variables [1, 29]. The theorem of de Finetti states roughly that if X1,X2,X_{1},X_{2},\ldots is an infinite exchangeable sequence of random variables defined in a measure space (𝔛,𝒜)(\mathfrak{X},\mathcal{A}), then there exists a random variable θ\theta in some space Θ\Theta, where θ\theta is distributed according to a probability measure GG, such that X1,X2,X_{1},X_{2},\ldots are conditionally i.i.d. given θ\theta. Denote by PθP_{\theta} the conditional distribution of XiX_{i} given θ\theta, we may express the joint distribution of a N{N}-sequence X[N]:=(X1,,XN)X_{[{N}]}:=(X_{1},\ldots,X_{N}), for any N1{N}\geq 1, as a mixture of product distributions in the following sense: for any A1,,AN𝒜A_{1},\ldots,A_{N}\subset\mathcal{A},

P(X1A1,,XNAN)=n=1NPθ(XnAn)G(dθ).P(X_{1}\in A_{1},\ldots,X_{N}\in A_{N})=\int\prod_{n=1}^{{N}}P_{\theta}(X_{n}\in A_{n})G(d\theta).

The probability measure GG is also known as de Finetti mixing measure for the exchangeable sequence. It captures the uncertainty about the latent variable θ\theta, which describes the mechanism according to which the sequence (Xi)i(X_{i})_{i} is generated via PθP_{\theta}. In other words, the de Finetti mixing measure GG can be seen as representing the heterogeneity within the data populations observed via sequences X[N]X_{[{N}]}. A statistician typically makes some assumption about the family {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta}, and proceeds to draw inference about the nature of heterogeneity represented by GG based on data samples X[N]X_{[{N}]}.

In order to obtain an estimate of the mixing measure GG, one needs multiple copies of the exchangeable sequences X[N]X_{[{N}]}. As mentioned, some assumption will be required of the probability distributions PθP_{\theta}, as well as the mixing measure GG. Throughout this paper it is assumed that the map θPθ\theta\mapsto P_{\theta} is injective. Moreover, we will confine ourselves to the setting of exact-fitted finite mixtures, i.e., GG is assumed to be an element of k(Θ)\mathcal{E}_{k}(\Theta), the space of discrete measures with kk distinct supporting atoms on Θ\Theta, where Θ\Theta is a subset of q\mathbb{R}^{q}. Accordingly, we may express G=j=1kpjδθjG=\sum_{j=1}^{k}p_{j}\delta_{\theta_{j}}. We may write the distribution for X[N]X_{[{N}]} in the following form, where we include the subscripts GG and N{N} to signify their roles:

PG,N(X1A1,,XNAN)=j=1kpj{n=1NPθj(XnAn)}.P_{G,{N}}(X_{1}\in A_{1},\ldots,X_{{N}}\in A_{{N}})=\sum_{j=1}^{k}p_{j}\biggr{\{}\prod_{n=1}^{{N}}P_{\theta_{j}}(X_{n}\in A_{n})\biggr{\}}. (1)

Note that when N=1{N}=1, we are reduced to a mixture distribution PG:=PG,1=j=1kpjPθjP_{G}:=P_{G,1}=\sum_{j=1}^{k}p_{j}P_{\theta_{j}}. Due to the role they play in the composition of the distribution PG,NP_{G,{N}}, we also refer to {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta} as a family of probability kernels on 𝔛\mathfrak{X}. Given m{m} independent copies of exchangeable sequences {X[Ni]i}i=1m\{X_{[{N}_{i}]}^{i}\}_{i=1}^{{m}} each of which is respectively distributed according to PG,NiP_{G,{N}_{i}} given in (1), where Ni{N}_{i} denotes the possibly variable length of the ii-th sequence. The primary question of interest in this paper is the efficiency of the estimation of the true mixing measure G=G0k(Θ)G=G_{0}\in\mathcal{E}_{k}(\Theta), for some known k=k0k=k_{0}, as sample size (m,N1,,Nm)(m,{N}_{1},\ldots,{N}_{m}) increases in a certain sense.

Models described by Eq. (1) are also known in the literature as mixtures of repeated measurements, or mixtures of grouped observations [23, 12, 10, 28, 44, 36], with applications to domains such as psychological analysis, educational assessment, and topic modeling in machine learning. The random effects model described in Section 1.3.3 of [30] in which the mixing measure is a discrete measure with finite number of atoms is also a special case of (1) with PθP_{\theta} a normal distribution with mean θ\theta. While [23, 10] consider the case that the number of components kk is unknown, [12, 28, 44, 36] focus on the case that kk is known, the same as our set up. In many of the aforementioned works the models are nonparametric, i.e., no parametric forms for the probability kernels are assumed, and the focus is on the problem of density estimation due to the nonparametric setup. By contrast, in this paper we study mixture of product distributions (1) with the parametric form of component distribution imposed, since in practice prior knowledge on the component distribution PθP_{\theta} might be available. Moreover, we investigate the behavior of parameter estimates — the convergence of the parameters pjp_{j} and θj\theta_{j}, which are generally more challenging than density estimation in mixture models [32, 26, 25, 27].

Before the efficiency question can be addressed, one must consider the issue of identifiability: under what conditions does the data distribution PG,NP_{G,{N}} uniquely identify the true mixing measure G0G_{0}? This question has occupied the interest of a number of authors  [43, 11, 20], with decisive results obtained recently by [3] on finite mixture models for conditionally independent observations and by [44] on finite mixtures for conditionally i.i.d. observations (given by Eq. (1)). Here, the condition is in the form of Nn0{N}\geq n_{0}, for some natural constant n01n_{0}\geq 1 possibly depending on G0G_{0}. We shall refer to n0n_{0} as (minimal) zero-order identifiable length or 0-identifiable length for short (a formal definition will be given later). For the conditionally i.i.d. case as in model (1), [44] proves that as long as N2k1N\geq 2k-1, model (1) will be identifiable for any G0G_{0}. Note that 2k12k-1 is only an upper bound of n0n_{0}. For a given parametric form of {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta} and a given truth G0G_{0}, the 0-identifiable length might be smaller than 2k12k-1.

Drawing from existing identifiability results, it is quite apparent that the observed sequence length N{N} (or more precisely, N1,,Nm{N}_{1},\ldots,{N}_{m}, in case of variable length sequences) must play a crucial role in the estimation of mixing measure GG, in addition to the number m{m} of sequences. Moreover, it is also quite clear that in order to have a consistent estimate of G=G0G=G_{0}, the number of sequences m{m} must tend to infinity, whereas N{N} may be allowed to be fixed. It remains an open question as to the precise roles m{m} and N{N} play in estimating GG and on the different types of mixing parameters: the component parameters (atoms θj\theta_{j}) and mixing proportions (probability mass pjp_{j}), and the rates of convergence of a given estimation procedure.

Partial answers to this question were obtained in several settings of mixtures of product distributions.  [23] proposed to discretize data so that the model in consideration becomes a finite mixture of product of identical binomial or multinomial distributions. Restricting to this class of models, a maximum likelihood estimator was applied, and a standard asymptotic analysis establishes root-m{m} rate for mixing proportion estimates.  [21, 20] investigated a number of nonparametric estimators for GG, and obtained the root-m{m} convergence rate for both mixing proportion and component parameters in the setting of k=2k=2 mixture components under suitable identifiability conditions. It seems challenging to extend their method and theory to a more general setting, e.g., k>2k>2. Moreover, no result on the effect of N{N} on parameter estimation efficiency seems to be available. Recently,  [34, 33] studied the posterior contraction behavior of several classes of Bayesian hierarchical model where the sample is also specified by m{m} sequences of N{N} observations. His approach requires that both m{m} and N{N} tend to infinity and thus cannot be applied to our present setting where N{N} may be fixed.

In this paper we shall present a parameter estimation theory for general classes of finite mixtures of product distributions. An application of this theory will be posterior contraction theorems established for a standard Bayesian estimation procedure, according to which the de Finetti’s mixing measure GG tends toward the truth G0G_{0}, as m{m} tends to infinity, under suitable conditions. In a standard Bayesian procedure, the statistician endows the space of parameters k0(Θ)\mathcal{E}_{k_{0}}(\Theta) with a prior distribution Π\Pi, which is assumed to have compact support in these theorems, and applies Bayes’ rule to obtain the posterior distribution on k0(Θ)\mathcal{E}_{k_{0}}(\Theta), to be denoted by Π(G|{X[Ni]i}i=1m)\Pi(G|\{X_{[{N}_{i}]}^{i}\}_{i=1}^{{m}}). To anticipate the distinct convergence behaviors for the atoms and probability mass parameters, for any G=i=1kpiδθi,G=i=1kpiδθik(Θ)G=\sum_{i=1}^{k}p_{i}\delta_{\theta_{i}},G^{\prime}=\sum_{i=1}^{k}p^{\prime}_{i}\delta_{\theta^{\prime}_{i}}\in\mathcal{E}_{k}(\Theta), define

DN(G,G)=minτSki=1k(Nθτ(i)θi2+|pτ(i)pi|),D_{{N}}(G,G^{\prime})=\min_{\tau\in S_{k}}\sum_{i=1}^{k}(\sqrt{{N}}\|\theta_{\tau(i)}-\theta^{\prime}_{i}\|_{2}+|p_{\tau(i)}-p^{\prime}_{i}|),

where SkS_{k} denotes all the permutations on the set [k]:={1,2,,k}[k]:=\{1,2,\ldots,k\}. (The suitability of DND_{N} over other choices of metric will be discussed in Section 3).

Given m{m} independent exchangeable sequences denoted by {X[Ni]i}i=1m\{X^{i}_{[{N}_{i}]}\}_{i=1}^{{m}}. We naturally require that miniNin0\min_{i}{N}_{i}\geq n_{0}, where n0n_{0} is the zero-order identifiable length depending on G0G_{0}. Moreover, to obtain concrete rates of convergence, we need also miniNin1\min_{i}{N}_{i}\geq n_{1} for some minimal natural number n1:=n1(G0)1n_{1}:=n_{1}(G_{0})\geq 1. We shall call n1n_{1} the minimal first-order identifiable length depending on G0G_{0}, or 1-identifiable length for short (a formal definition will be given later). Assume that {Ni}i=1m\{{N}_{i}\}_{i=1}^{{m}} are uniformly bounded from above by an arbitrary unknown constant, in Theorem 6.2 it is established that under suitable regularity conditions on PθP_{\theta}, the posterior contraction rate for the mixing proportions is bounded above by m1/2{m}^{-1/2}, up to a logarithmic quantity. For mixture components’ supporting atoms, the contraction rate is

OP(ln(i=1mNi)i=1mNi).O_{P}\biggr{(}\sqrt{\frac{\ln(\sum_{i=1}^{{m}}{N}_{i})}{\sum_{i=1}^{{m}}{N}_{i}}}\biggr{)}.

Note that i=1mNi\sum_{i=1}^{{m}}{N}_{i} represents the full volume of the observed data set. More precisely, for suitable kernel families PθP_{\theta}, as long as miniNimax{n0,n1}\min_{i}{N}_{i}\geq\max\{n_{0},n_{1}\} and supiNi<\sup_{i}{N}_{i}<\infty, there holds

Π(Gk0(Θ):Di=1mNi/m(G,G0)C(G0)M¯mln(i=1mNi)m|X[N1]1,,X[Nm]m)1\displaystyle\Pi\biggr{(}G\in\mathcal{E}_{k_{0}}(\Theta):D_{\sum_{i=1}^{{m}}{N}_{i}/{m}}(G,G_{0})\leq C(G_{0})\bar{M}_{m}\sqrt{\frac{\ln(\sum_{i=1}^{{m}}{N}_{i})}{{m}}}\biggr{|}X_{[{N}_{1}]}^{1},\ldots,X_{[{N}_{{m}}]}^{{m}}\biggr{)}\to 1

in PG0,N1PG0,NmP_{G_{0},{N}_{1}}\otimes\cdots\otimes P_{G_{0},{N}_{{m}}}-probability as m{m}\to\infty for any sequence M¯m\bar{M}_{m}\to\infty. The point here is that constant C(G0)C(G_{0}) is independent of m{m}, sequence lengths {Ni}i=1m\{{N}_{i}\}_{i=1}^{{m}} and their supremum. In plain terms, we may say that with finite mixtures of product distributions, the posterior inference of atoms of each individual mixture component receives the full benefit of "borrowing strength" across sampled sequences; while the mixing probabilities gain efficiency from only the number of such sequences. This appears to be the first work in which such a posterior contraction theorem is established for de Finetti mixing measure arising from finite mixtures of product distributions.

The Bayesian learning rates established appear intuitive, given the parameter space Θq\Theta\in\mathbb{R}^{q} is of finite dimension. On the role of m{m}, they are somewhat compatible to the previous partial results [23, 21, 20]. However, we wish to make several brief remarks at this juncture.

  • First, even for exact-fitted parametric mixture models, "parametric-like" learning rates of the form root-m{m} or root-(mN)({m}{N}) should not to be taken for granted, because they do not always hold [25, 27]. This is due to the fact that the kernel family {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta} may easily violate assumptions of strong identifiability often required for the root-m{m} rate to take place. In other words, the kernel family {Pθ}\{P_{\theta}\} may be only weakly identifiable, resulting in poor learning rates for a standard mixture, i.e., when N=1{N}=1.

  • Second, the fact that by increasing the observed exchangeable sequence’s length N{N} so that Nn1n0{N}\geq n_{1}\vee n_{0}, one may obtain parametric-like learning rates in terms of both N{N} and m{m} is a remarkable testament of how repeated measurements can help to completely overcome a latent variable model’s potential pathologies: parameter non-identifiability is overcome by making Nn0{N}\geq n_{0}, while inefficiency of parameter estimation inherent in weakly identifiable mixture models is overcome by Nn1{N}\geq n_{1}. For a deeper appreciation of this issue, see Section 2 for a background on the role of identifiability notions in parameter estimation.

Although the posterior contraction theorems for finite mixtures of product distributions presented in this paper are new, such results do not adequately capture the rather complex behavior of the convergence of parameters for a finite mixture of N{N}-product distributions. In fact, the heart of the matter lies in the establishment of a collection of general inverse bounds, i.e., inequalities of the form

DN(G,G0)C(G0)V(PG,N,PG0,N),D_{{N}}(G,G_{0})\leq C(G_{0})V(P_{G,{N}},P_{G_{0},{N}}), (2)

where V(,)V(\cdot,\cdot) is the variational distance. Note that (2) provides an upper bound on distance DND_{N} of mixing measures in terms of the variational distance between the corresponding mixture of N{N}-product distributions. Inequalities of this type allow one to transfer the convergence (and learning rates) of a data population’s distribution into that of the corresponding distribution’s parameters (therefore the term "inverse bounds"). Several points to highlight are:

  • The local nature of (2), which may hold only for GG residing in a suitably small DND_{{N}}-neighborhood of G0G_{0} whose radius may also depend on G0G_{0} and N{N}, while constant C(G0)>0C(G_{0})>0 depends on G0G_{0} but is independent of N{N}. In addition, the bound holds only when N{N} exceeds threshold n11n_{1}\geq 1, unless further assumptions are imposed. For instance, under a first-order identifiability condition of PθP_{\theta}, n1=1n_{1}=1, so this bound holds for all N1{N}\geq 1 while remaining local in nature. Moreover, inequality (2) is sharp: the quantity N{N} in DND_{N} cannot be improved by Dψ(N)D_{\psi({N})} for any sequence ψ(N)\psi({N}) such that ψ(N)/N\psi({N})/{N}\rightarrow\infty (see Lemma 8.3).

  • The inverse bounds of the form (2) are established without any overt assumption of identifiability. However, they carry striking consequences on both first-order and classical identifiability, which can be deduced from (2) under a compactness condition (see Proposition 5.1): using the notation n0(G,kk0k(Θ1))n_{0}(G,\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1})) and n1(G,2k0(Θ1))n_{1}(G,\mathcal{E}_{2k_{0}}(\Theta_{1})) to denote explicitly the dependence of 0- and 1-identifiable lengths on GG in the first argument and its ambient space in the second argument, respectively, we have

    supGkk0k(Θ1)n0(G,kk0k(Θ1))supG2k0(Θ1)n1(G,2k0(Θ1))<.\sup_{G\in\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1})}n_{0}(G,\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1}))\leq\sup_{G\in\mathcal{E}_{2k_{0}}(\Theta_{1})}n_{1}(G,\mathcal{E}_{2k_{0}}(\Theta_{1}))<\infty.

    Note that classical identifiability captured by n0(G,kk0k(Θ1))n_{0}(G,\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1})) describes a global property of the model family while first-order identifiability captured by n1(G,2k0(Θ1))n_{1}(G,\mathcal{E}_{2k_{0}}(\Theta_{1})) is local in nature. The connection between these two concepts is possible because when the number of exchangeable variables N{N} gets large, the force of the central limit theorem for product distributions comes into effect to make the mixture model eventually become identifiable, either in the classical or the first-order sense, even if the model may be initially non-identifiable or weakly identifiable (when N=1{N}=1).

  • These inverse bounds hold for very broad classes of probability kernels {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta}. In particular, they are established under mild regularity assumptions on the family of probability kernel PθP_{\theta} on 𝔛\mathfrak{X}, when either 𝔛=d\mathfrak{X}=\mathbb{R}^{d}, or 𝔛\mathfrak{X} is a finite set, or 𝔛\mathfrak{X} is an abstract space. A standard but non-trivial example of our theory is the case the kernels PθP_{\theta} belong to the exponential families of distributions. A more unusual example is the case where PθP_{\theta} is itself a mixture distribution on 𝔛\mathfrak{X}. Kernels of this type are rarely examined in theory, partly because when we set N=1{N}=1 a mixture model using such kernels typically would not be parameter-identifiable. However, such "mixture-distribution" kernels are frequently employed by practitioners of hierarchical models (i.e., mixtures of mixture distributions). As the inverse bounds entail, this makes sense since the parameters become more strongly identifiable and efficiently estimable with repeated exchangeable measurements.

  • More generally, inverse bounds hold when PθP_{\theta} does not necessarily admit a density with respect to a dominating measure on 𝔛\mathfrak{X}. An example considered in the paper is the case PθP_{\theta} represents probability distribution on the space of probability distributions, namely, PθP_{\theta} represents (mixtures of) Dirichlet processes. As such, the general inverse bounds are expected to be useful for models with nonparametric mixture components represented by PθP_{\theta}, the kind of models that have attracted much recent attention, e.g., [41, 37, 8, 7].

The above highlights should make clear the central roles of the inverse bounds obtained in Section 4 and Section 5, which deepen our understanding of the questions of parameter identifiability and provide detailed information about the convergence behavior of parameter estimation. In addition to an asymptotic analysis of Bayesian estimation for mixtures of product distributions that will be carried out in this paper, such inverse bounds may also be useful for deriving rates of convergence for non-Bayesian parameter estimation procedures, including maximum likelihood estimation and distance based estimation methods.

The rest of the paper will proceed as follows. Section 2 presents related work in the literature and a high-level overview of our approach and techniques. Section 3 prepares the reader with basic setups and several useful concepts of distances on space of mixing measures that arise in mixtures of product distributions. Section 4 is a self-contained treatment of first-order identifiability theory for finite mixture models, leading to several new results that are useful for subsequent developments. Section 5 presents inverse bounds for broad classes of finite mixtures of product distributions, along with specific examples. An immediate application of these bounds are posterior contraction theorems for de Finetti’s mixing measures, the main focus of Section 6. Particular examples of interest for the inverse bounds established in Section 5 include the case the probability kernel PθP_{\theta} is itself a mixture distribution on 𝔛=\mathfrak{X}=\mathbb{R}, and the case PθP_{\theta} is a mixture of Dirichlet processes. These examples require development of new tools and are deferred to Section 7. Section 8 gives several technical results demonstrating the sharpness of the established inverse bounds, which is then used to derive minimax lower bounds for estimation procedures of de Finetti’s mixing parameters. Section 9 discusses extensions and several future directions. Finally, (most) proofs of all theorems and lemmas will be provided in the Appendix.

Notation For any probability measure PP and QQ on measure space (𝔛,𝒜)(\mathfrak{X},\mathcal{A}) with densities respectively pp and qq with respect to some base measure μ\mu, the variational distance between them is V(P,Q)=supA𝒜|P(A)Q(A)|=𝔛12|p(x)q(x)|𝑑μV(P,Q)=\sup_{A\in\mathcal{A}}|P(A)-Q(A)|=\int_{\mathfrak{X}}\frac{1}{2}|p(x)-q(x)|d\mu. The Hellinger distance is given by h(P,Q)=(𝔛12|p(x)q(x)|2𝑑μ)12.h(P,Q)=\left(\int_{\mathfrak{X}}\frac{1}{2}|\sqrt{p(x)}-\sqrt{q(x)}|^{2}d\mu\right)^{\frac{1}{2}}. The Kullback-Leibler divergence of QQ from PP is K(p,q)=𝔛p(x)lnp(x)q(x)dμK(p,q)=\int_{\mathfrak{X}}p(x)\ln\frac{p(x)}{q(x)}d\mu. Write PQP\otimes Q to be the product measure of PP and QQ and NP\otimes^{N}P for the NN-fold product of PP. Any vector xdx\in\mathbb{R}^{d} is a column vector with its ii-th coordinate denoted by x(i)x^{(i)}. The inner product between two vectors aa and bb is denoted by aba^{\top}b or a,b\langle a,b\rangle. Denote by C()C(\cdot) or c()c(\cdot) a positive finite constant depending only on its parameters and the probability kernel {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta}. In the presentation of inequality bounds and proofs, they may differ from line to line. Write aba\lesssim b if acba\leq cb for some universal constant cc; write aξba\lesssim_{\xi}b if ac(ξ)ba\leq c(\xi)b. Write aba\asymp b if aba\lesssim b and bab\lesssim a; write aξba{\asymp}_{\xi}b (or a𝜉ba\overset{\xi}{\asymp}b) if aξba\lesssim_{\xi}b and bξab\lesssim_{\xi}a.

2 Background and overview

2.1 First-order identifiability and inverse inequalities

In order to shed light on the convergence behavior of model parameters as data sample size increases, stronger forms of identifiability conditions shall be required of the family of probability kernels PθP_{\theta}. For finite mixture models, such conditions are often stated in terms of a suitable derivative of the density of PθP_{\theta} with respect to parameter θ\theta, and the linear independence of such derivatives as θ\theta varies in Θ\Theta. The impacts of such identifiability conditions, or the lack thereof, on the convergence of parameter estimation can be quite delicate. Specifically, let 𝔛=d\mathfrak{X}=\mathbb{R}^{d} and fix N=1{N}=1, so we have PG=j=1kpjPθjP_{G}=\sum_{j=1}^{k}p_{j}P_{\theta_{j}}. Assume that PθP_{\theta} admits a density function f(|θ)f(\cdot|\theta) with respect to Lebesgue measure on d\mathbb{R}^{d}, and for all xdx\in\mathbb{R}^{d}, f(|θ)f(\cdot|\theta) is differentiable with respect to θ\theta; moreover the combined collection of functions {f(|θ)}θΘ\{f(\cdot|\theta)\}_{\theta\in\Theta} and {f(|θ)}θΘ\{\nabla f(\cdot|\theta)\}_{\theta\in\Theta} are linearly independent. This type of condition, which concerns linear independence of the first derivatives of the likelihood functions with respect to parameter θ\theta, shall be generically referred to as first-order identifiability condition of the probability kernel family {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta}. A version of such condition was investigated by [26], who showed that their condition will be sufficient for establishing an inverse bound of the form

lim infGW1G0Gk0(Θ)V(PG,PG0)W1(G,G0)>0.\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G},P_{G_{0}})}{W_{1}(G,G_{0})}>0. (3)

where W1W_{1} denotes the first-order Wasserstein distance metric on k0(Θ)\mathcal{E}_{k_{0}}(\Theta). The infimum limit quantifier should help to clarify somewhat the local nature of the inverse bound (2) mentioned earlier. The development of this local inverse bound and its variants plays the fundamental role in the analysis of parameter estimation with finite mixtures in a variety of settings in previous studies, where stronger forms of identifiability conditions based on higher order derivatives may be required [9, 32, 38, 26, 25, 22, 27]. In addition, [32, 34] studied inverse bounds of this type for infinite mixture and hierarchical models.

As noted by [26], for exact-fitted setting of mixtures, i.e., the number of mixture components k=k0k=k_{0} is known, conditions based on only first-order derivatives of PθP_{\theta} will suffice. Under a suitable first-order identifiability condition based on linear independence of {f(|θ),θf(|θ)}θΘ\{f(\cdot|\theta),\nabla_{\theta}f(\cdot|\theta)\}_{\theta\in\Theta}, along with several additional regularity conditions, the mixing measure G=G0G=G_{0} may be estimated via m{m}-i.i.d. sample (X[1]1,,X[1]m)(X_{[1]}^{1},\ldots,X_{[1]}^{{m}}) at the parametric rate of convergence m1/2{m}^{-1/2}, due to  (3) and the fact that the data population density pG0p_{G_{0}} is typically estimated at the same parametric rate. However, first-order identifiability may not be satisfied, as is the case of two-parameter gamma kernel, or three-parameter skewnormal kernel, following from the fact that these kernels are governed by certain partial differential equations. In such situations, not only does the resulting Fisher information matrix of the mixture model become singular, the singularity structure of the matrix can be extremely complex — an in-depth treatment of weakly identifiable mixture models can be found in [27]. Briefly speaking, in such situations (3) may not hold and the rate m1/2{m}^{-1/2} may not be achieved [25, 27]. In particular, in the case of skewnormal kernels, extremely slow rates of convergence for the component parameters θj\theta_{j} (e.g., m1/4,m1/6,m1/8{m}^{-1/4},{m}^{-1/6},{m}^{-1/8} and so on) may be established depending on the actual parameter values of the true G0G_{0} for a standard Bayesian estimation or maximum likelihood estimation procedure  [27]. It remains unknown whether it is possible to devise an estimation procedure to achieve the parametric rate of convergence m1/2{m}^{-1/2} when the finite mixture model is only weakly identifiable, i.e., when first-order identifiability condition fails.

In Section 4 we shall revisit the described first-order identifiability notions, and then present considerable improvements upon the existing theory and deliver several novel results. First, we identify a tightened set of conditions concerning linear independence of f(x|θ)f(x|\theta) and θf(x|θ)\nabla_{\theta}f(x|\theta) according to which the inverse bound (2) holds. This set of conditions turns out to be substantially weaker than the identifiability condition of [26], most notably by requiring f(x|θ)f(x|\theta) be differentiable with respect to θ\theta only for xx in a subset of 𝔛\mathfrak{X} with positive measure. This weaker notion of first-order identifiability allows us to broaden the scope of probability kernels for which the inverse bound (3) continues to apply (see Lemma 4.2). Second, in a precise sense we show that this notion is in fact necessary for (3) to hold (see Lemma 4.4), giving us an arguably complete characterization of first-order identifiability and its relations to the parametric learning rate for model parameters. Among other new results, it is worth mentioning that when the kernel family {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta} belongs to an exponential family of distributions on 𝔛\mathfrak{X}, there is a remarkable equivalence among our notion of first-order identifiability condition and the inverse bound of the form (3), and the inverse bound in which variational distance VV is replaced by Hellinger distance hh (see Lemma 4.15).

Turning our attention to finite mixtures of product distributions, a key question is on the effect of number N{N} of repeated measurements in overcoming weak identifiability (e.g., the violation of first-order identifiability). One way to formally define the first-order identifiable length (1-identifiable length) n1=n1(G0)n_{1}=n_{1}(G_{0}) is to make it the minimal natural number such that the following inverse bound holds for any Nn1{N}\geq n_{1}

lim infGW1G0Gk0(Θ)V(PG,N,PG0,N)W1(G,G0)>0.\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{W_{1}(G,G_{0})}>0. (4)

The key question is whether (finite) 11-identifiable length exists, and how can we characterize it. The significance of this concept is that one can achieve first-order identifiability by allowing at least Nn1{N}\geq n_{1} repeated measurements and obtain the m1/2{m}^{-1/2} learning rate for the mixing measure. In fact, the component parameters can be learned at the rate (mN)1/2({m}{N})^{-1/2}, the square root of the full volume of exchangeable data (modulo a logarithmic term). The resolution of the question of existence and characterization of n1n_{1} leads us to establish a collection inverse bounds involving mixtures of product distributions that we will describe next. Moreover, such inverse bounds are essential in deriving learning rates for mixing measure GG from a collection of exchangeable sequences of observations.

2.2 General approach and techniques

For finite mixtures of N{N}-product distributions, for N1{N}\geq 1, the precise expression for the inverse bound to be established takes the form: under certain conditions of the probability kernel {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta}, for a given G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}),

lim infNlim infGW1G0Gk0(Θ)V(PG,N,PG0,N)DN(G,G0)>0.\liminf_{{N}\to\infty}\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{{N}}(G,G_{0})}>0. (5)

Compared to inverse bound (3) for a standard finite mixture, the double infimum limits reveals the challenge for analyzing mixtures of N{N}-product distributions; they express the delicate nature of the inverse bound informally described via (2). Moreover, (5) entails that the finite 1-identifiable length n1n_{1} defined by (4) exists.

Inverse bound (5) will be established for broad classes of kernel PθP_{\theta} and it can be shown that this bound is sharp. Among the settings of kernel that the bound is applicable, there is a setting when PθP_{\theta} belongs to any regular exponential family of distributions. More generally, this includes the setting where 𝔛\mathfrak{X} may be an abstract space; no parametric assumption on PθP_{\theta} will be required. Instead, we appeal to a set of mild regularity conditions on the characteristic function of a push-forward measure produced by a measurable map TT acting on the measure space (𝔛,𝒜)(\mathfrak{X},\mathcal{A}). Actually, a stronger bound is established relating to the positivity of a notion of curvature on the space of mixtures of product distributions (see (23)). We will see that this collection of inverse bounds, which are presented in Section 5, enables the study for a very broad range of mixtures of product distributions for exchangeable sequences.

The theorems establishing (5) and (23) represent the core of the paper. For simplicity, let us describe the gist of our proof techniques by considering the case kernel PθP_{\theta} belongs to an exponential family of distribution on 𝔛\mathfrak{X} (see Theorem 5.8). Suppose the kernel admits a density function f(x|θ)f(x|\theta) with respect to a dominating measure μ\mu on 𝔛\mathfrak{X}. At a high-level, this is a proof of contradiction: if  (5) does not hold, then there exists a strictly increasing subsequence {N}=1\{{N}_{\ell}\}_{\ell=1}^{\infty} of natural numbers according to which there exists a sequence of mixing measures {G}=1k0(Θ)\{G0}\{G_{\ell}\}_{\ell=1}^{\infty}\subset\mathcal{E}_{k_{0}}(\Theta)\backslash\{G_{0}\} such that DN(G,G0)0D_{{N}_{\ell}}(G_{\ell},G_{0})\rightarrow 0 as \ell\rightarrow\infty and the integral form

V(PG,N,PG0,N)DN(G,G0)=𝔛N|pG,N(x1,,xN)pG0,N(x1,,xN)DN(G,G0)|dNμ(x1,,xN)\frac{V(P_{G_{\ell},{N}_{\ell}},P_{G_{0},{N}_{\ell}})}{D_{{N}_{\ell}}(G_{\ell},G_{0})}=\int_{\mathfrak{X}^{{N}_{\ell}}}\biggr{|}\frac{p_{G_{\ell},{N}_{\ell}}(x_{1},\ldots,x_{{N}_{\ell}})-p_{G_{0},{N}_{\ell}}(x_{1},\ldots,x_{{N}_{\ell}})}{D_{{N}_{\ell}}(G_{\ell},G_{0})}\biggr{|}d\otimes^{{N}_{\ell}}\mu(x_{1},\ldots,x_{{N}_{\ell}}) (6)

tends to zero. One may be tempted to apply Fatou’s lemma to deduce that the integrand must vanish as \ell\rightarrow\infty, and from that one may hope to derive a contradiction with specified hypothesis on the probability kernel f(x|θ)f(x|\theta) (e.g. first-order identifiability). This is basically the proof technique of Lemma 4.2 for establishing inverse bound (3) for finite mixtures. But this would not work here, because the integration domain’s dimensionality increases with \ell. Instead we can exploit the structure of the mixture of N{N}_{\ell}-product densities in pG,Np_{G_{\ell},{N}_{\ell}}, and rewrite the integral as an expectation with respect to a suitable random variable of fixed domain. What comes to our rescue is the central limit theorem, which is applied to a q\mathbb{R}^{q}-valued random variable Z=(n=1NT(Xn)N𝔼θα0T(X1))/NZ_{\ell}=\left(\sum_{n=1}^{{N}_{\ell}}T(X_{n})-{N}_{\ell}\mathbb{E}_{\theta_{\alpha}^{0}}T(X_{1})\right)/\sqrt{{N}_{\ell}}, where 𝔼θα0\mathbb{E}_{\theta_{\alpha}^{0}} denotes the expectation taken with respect to the probability distribution PθP_{\theta} for some suitable θ=θα0\theta=\theta_{\alpha}^{0} chosen among the support of true mixing measure G0G_{0}. Here T:𝔛qT:\mathfrak{X}\rightarrow\mathbb{R}^{q} denotes the sufficient statistic for the exponential family distribution Pθ(dxn)P_{\theta}(dx_{n}), for each n=1,,Nn=1,\ldots,{N}_{\ell}.

Continuing with this plan, by a change of measure the integral in (6) may be expressed as the expectation of the form 𝔼|Ψ(Z)|\mathbb{E}|\Psi_{\ell}(Z_{\ell})| for some suitable function Ψ:q\Psi_{\ell}:\mathbb{R}^{q}\rightarrow\mathbb{R}. By exploiting the structure of the exponential families dictating the form of Ψ\Psi_{\ell}, it is possible to obtain that for any sequence zzz_{\ell}\rightarrow z, there holds Ψ(z)Ψ(z)\Psi_{\ell}(z_{\ell})\rightarrow\Psi(z) for a certain function Ψ:q\Psi:\mathbb{R}^{q}\rightarrow\mathbb{R}. Since ZZ_{\ell} converges in distribution to ZZ a non-degenerate zero-mean Gaussian random vector in q\mathbb{R}^{q}, it entails that Ψ(Z)\Psi_{\ell}(Z_{\ell}) converges to Ψ(Z)\Psi(Z) in distribution by a generalized continuous mapping theorem [46]. Coupled with a generalized Fatou’s lemma [5], we arrive at 𝔼θα|Ψ(Z)|=0\mathbb{E}_{\theta_{\alpha}}|\Psi(Z)|=0, which can be verified as a contradiction.

For the general setting where {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta} is a family of probability on measure space (𝔛,𝒜)(\mathfrak{X},\mathcal{A}), the basic proof structure remains the same, but we can no longer exploit the (explicit) parametric assumption on the kernel family PθP_{\theta} (see Theorem 5.16). Since the primary object of inference is parameter θΘq\theta\in\Theta\subset\mathbb{R}^{q}, the assumptions on the kernel PθP_{\theta} will center on the existence of a measurable map T:(𝔛,𝒜)(s,(s))T:(\mathfrak{X},\mathcal{A})\to(\mathbb{R}^{s},\mathcal{B}(\mathbb{R}^{s})) for some sqs\geq q, and regularity conditions on the push-forward measure on s\mathbb{R}^{s}: T#Pθ:=PθT1T_{\#}P_{\theta}:=P_{\theta}\circ T^{-1}. This measurable map plays the same role as that of sufficient statistic TT when PθP_{\theta} belongs to the exponential family. The main challenge lies in the analysis of function Ψ\Psi_{\ell} described in the previous paragraph. It is here that the power of Fourier analysis is brought to bear on the analysis of Ψ\Psi_{\ell} and the expectation 𝔼θα0Ψ(Z)\mathbb{E}_{\theta_{\alpha}^{0}}\Psi_{\ell}(Z_{\ell}). By the Fourier inversion theorem, Ψ\Psi_{\ell} may be expressed entirely in terms of the characteristic function of the push-forward measure T#PθT_{\#}P_{\theta}. Provided regularity conditions on such characteristic function hold, one is able to establish the convergence of Ψ\Psi_{\ell} toward a certain function Ψ:s\Psi:\mathbb{R}^{s}\rightarrow\mathbb{R} as before.

We shall provide a variety of examples demonstrating the broad applicability of Theorem 5.16, focusing on the cases PθP_{\theta} does not belong to an exponential family of distributions. In some cases, checking for the existence of map TT is straightforward. When PθP_{\theta} is a complex object, in particular, when PθP_{\theta} is itself a mixture distribution, this requires substantial work, as should be expected. In this example, the burden of checking the applicability of Theorem 5.16 lies primarily in evaluating certain oscillatory integrals composed of the map TT in question. Tools from harmonic analysis of oscillatory integrals will be developed for such a purpose and presented in Section 7. We expect that the tools developed here present a useful stepping stone toward a more satisfactory theoretical treatment of complex hierarchical models (models that may be viewed as mixtures of mixtures of distributions, e.g. [41, 37, 34, 8]), which have received broad and increasingly deepened attention in the literature.

3 Preliminaries

We start by setting up basic notions required for the analysis of mixtures of product distributions. Given exchangeable data sequences denoted by X[Ni]i:=(X1i,,XNii)X^{i}_{[{N}_{i}]}:=(X_{1}^{i},\ldots,X_{{N}_{i}}^{i}) for i=1,,mi=1,\ldots,{m}, while Ni{N}_{i} denotes the length of sequence X[Ni]iX_{[{N}_{i}]}^{i}. For ease of presentation, for now, we shall assume that Ni=NN_{i}=N for all ii. Later we will allow variable length sequences. These sequences are composed of elements in a measurable space (𝔛,𝒜)(\mathfrak{X},\mathcal{A}). Examples include 𝔛=d\mathfrak{X}=\mathbb{R}^{d}, 𝔛\mathfrak{X} is a discrete space, and 𝔛\mathfrak{X} is a space of measures. Regardless, parameters of interest are always encapsulated by discrete mixing measures Gk(Θ)G\in\mathcal{E}_{k}(\Theta), the space of discrete measures with kk distinct support atoms residing in Θq\Theta\subset\mathbb{R}^{q}.

The linkage between parameters of interest, i.e., the mixing measure GG, and the observed data sequences is achieved via the mixture of product distributions that we now define. Consider a family of probability distributions {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta} on measurable space (𝔛,𝒜)(\mathfrak{X},\mathcal{A}), where θ\theta is the parameter of the family and Θq\Theta\subset\mathbb{R}^{q} is the parameter space. Throughout this paper it is assumed that the map θPθ\theta\mapsto P_{\theta} is injective. For NN\in\mathbb{N}, the N{N}-product probability family is denoted by {Pθ,N:=NPθ}θΘ\{P_{\theta,{N}}:=\bigotimes^{{N}}P_{\theta}\}_{\theta\in\Theta} on (𝔛N,𝒜N)(\mathfrak{X}^{{N}},\mathcal{A}^{{N}}), where 𝒜N\mathcal{A}^{{N}} is the product sigma-algebra. Given a mixing measure G=i=1kpiδθik(Θ)G=\sum_{i=1}^{k}p_{i}\delta_{\theta_{i}}\in\mathcal{E}_{k}(\Theta), the mixture of N{N}-product distributions induced by GG is given by

PG,N=i=1kpiPθi,N.P_{G,{N}}=\sum_{i=1}^{k}p_{i}P_{\theta_{i},{N}}.

Each exchangeable sequence X[N]i=(X1i,,XNi)X^{i}_{[N]}=(X_{1}^{i},\ldots,X_{{N}}^{i}), for i=1,,mi=1,\ldots,{m}, is an independent sample distributed according to PG,NP_{G,{N}}. Due to the role they play in the composition of distribution PG,NP_{G,{N}}, we also refer to {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta} as a family of probability kernels on (𝔛,𝒜)(\mathfrak{X},\mathcal{A}).

In order to quantify the convergence of mixing measures arising in mixture models, an useful device is a suitably defined optimal transport distance [32, 31]. Consider the Wasserstein-pp distance w.r.t. distance dΘd_{\Theta} on Θ\Theta: G=i=1kpiδθi,G=i=1kpiδθi\forall G=\sum_{i=1}^{k}p_{i}\delta_{\theta_{i}},G^{\prime}=\sum_{i=1}^{k^{\prime}}p^{\prime}_{i}\delta_{\theta^{\prime}_{i}}, define

Wp(G,G;dΘ)=(min𝒒i=1kj=1kqijdΘp(θi,θj))1/p,W_{p}(G,G^{\prime};d_{\Theta})=\left(\min_{\bm{q}}\sum_{i=1}^{k}\sum_{j=1}^{k^{\prime}}q_{ij}d_{\Theta}^{p}(\theta_{i},\theta^{\prime}_{j})\right)^{1/p}, (7)

where the infimum is taken over all joint probability distributions 𝒒\bm{q} on [k]×[k][k]\times[k^{\prime}] such that, when expressing 𝒒\bm{q} as a k×kk\times k^{\prime} matrix, the marginal constraints hold: j=1kqij=pi\sum_{j=1}^{k^{\prime}}q_{ij}=p_{i} and i=1kqij=pj\sum_{i=1}^{k}q_{ij}=p^{\prime}_{j}. For the special case when dΘd_{\Theta} is the Euclidean distance, write simply Wp(G,G)W_{p}(G,G^{\prime}) instead of Wp(G,G;dΘ)W_{p}(G,G^{\prime};d_{\Theta}). Write GWpGG_{\ell}\overset{W_{p}}{\to}G if GG_{\ell} converges to GG under the WpW_{p} distance w.r.t. the Euclidean distance on Θ\Theta.

For mixing measures arising in mixtures of N{N}-product distributions, a more useful notion is the following. For any G=i=1kpiδθik(Θ)G=\sum_{i=1}^{k}p_{i}\delta_{\theta_{i}}\in\mathcal{E}_{k}(\Theta) and G=i=1kpiδθik(Θ)G^{\prime}=\sum_{i=1}^{k}p^{\prime}_{i}\delta_{\theta^{\prime}_{i}}\in\mathcal{E}_{k}(\Theta), define

DN(G,G)=minτSki=1k(Nθτ(i)θi2+|pτ(i)pi|)D_{{N}}(G,G^{\prime})=\min_{\tau\in S_{k}}\sum_{i=1}^{k}(\sqrt{{N}}\|\theta_{\tau(i)}-\theta^{\prime}_{i}\|_{2}+|p_{\tau(i)}-p^{\prime}_{i}|)

where SkS_{k} denote all the permutations on the set [k][k]. It is simple to verify that DN(,)D_{{N}}(\cdot,\cdot) is a valid metric on k(Θ)\mathcal{E}_{k}(\Theta) for each N{N} and relate it to a suitable optimal transport distance metric. Indeed, G=i=1kpiδθik(Θ)G=\sum_{i=1}^{k}p_{i}\delta_{\theta_{i}}\in\mathcal{E}_{k}(\Theta), due to the permutations invariance of its atoms, can be identified as a set {(θi,pi):1ik}\{(\theta_{i},p_{i}):1\leq i\leq k\}, which can further be identified as G~=i=1k1kδ(θi,pi)k(Θ×)\tilde{G}=\sum_{i=1}^{k}\frac{1}{k}\delta_{(\theta_{i},p_{i})}\in\mathcal{E}_{k}(\Theta\times\mathbb{R}). Formally, we define a map k(Θ)k(Θ×)\mathcal{E}_{k}(\Theta)\to\mathcal{E}_{k}(\Theta\times\mathbb{R}) by

G=i=1kpiδθiG~=i=1k1kδ(θi,pi)k(Θ×).G=\sum_{i=1}^{k}p_{i}\delta_{\theta_{i}}\mapsto\tilde{G}=\sum_{i=1}^{k}\frac{1}{k}\delta_{(\theta_{i},p_{i})}\in\mathcal{E}_{k}(\Theta\times\mathbb{R}). (8)

Now, endow Θ×\Theta\times\mathbb{R} with a metric MNM_{{N}} defined by MN((θ,p),(θ,p))=Nθθ2+|pp|M_{{N}}((\theta,p),(\theta^{\prime},p^{\prime}))=\sqrt{{N}}\|\theta-\theta^{\prime}\|_{2}+|p-p^{\prime}| and note the following fact.

Lemma 3.1.

For any G=i=1k1kδθ¯i,G=i=1k1kδθ¯ik(Θ¯)G=\sum_{i=1}^{k}\frac{1}{k}\delta_{\bar{\theta}_{i}},G^{\prime}=\sum_{i=1}^{k}\frac{1}{k}\delta_{\bar{\theta}^{\prime}_{i}}\in\mathcal{E}_{k}(\bar{\Theta}) and distance dΘ¯d_{\bar{\Theta}} on Θ¯\bar{\Theta},

Wpp(G,G;dΘ¯)=minτSk1ki=1kdΘ¯p(θi,θτ(i)).W_{p}^{p}(G,G^{\prime};d_{\bar{\Theta}})=\min_{\tau\in S_{k}}\frac{1}{k}\sum_{i=1}^{k}d_{\bar{\Theta}}^{p}(\theta_{i},\theta^{\prime}_{\tau(i)}).

A proof of the preceding lemma is available as Proposition 2 in [31]. By applying Lemma 3.1 with Θ¯\bar{\Theta}, dΘ¯d_{\bar{\Theta}} replaced respectively by Θ×\Theta\times\mathbb{R} and MNM_{N}, then for any G,Gk(Θ)G,G^{\prime}\in\mathcal{E}_{k}(\Theta), W1(G~,G~;MN)=1kDN(G,G)W_{1}(\tilde{G},\tilde{G^{\prime}};M_{{N}})=\frac{1}{k}D_{{N}}(G,G^{\prime}), which validates that DND_{{N}} is indeed a metric on k(Θ)\mathcal{E}_{k}(\Theta), and moreover it does not depend on the specific representations of GG and GG^{\prime}.

The next lemma establishes the relationship between DND_{{N}} and W1W_{1} on k(Θ)\mathcal{E}_{k}(\Theta).

Lemma 3.2.

The following statements hold.

  1. a)

    A sequence Gnk(Θ)G_{n}\in\mathcal{E}_{k}(\Theta) converges to G0k(Θ)G_{0}\in\mathcal{E}_{k}(\Theta) under WpW_{p} if and only if GnG_{n} converges to G0G_{0} under DND_{N}. That is, WpW_{p} and DND_{N} generate the same topology.

  2. b)

    Let Θ\Theta be bounded. Then W1(G,G)max{1,diam(Θ)2}D1(G,G)W_{1}(G,G^{\prime})\leq\max\left\{1,\frac{\text{diam}(\Theta)}{2}\right\}D_{1}(G,G^{\prime}) for any G,Gk(Θ)G,G^{\prime}\in\mathcal{E}_{k}(\Theta).
    More generally for any G=i=1kpiδθiG=\sum_{i=1}^{k}p_{i}\delta_{\theta_{i}} and G=i=1kpiδθiG^{\prime}=\sum_{i=1}^{k}p^{\prime}_{i}\delta_{\theta^{\prime}_{i}},

    Wpp(G,G)max{1,diamp(Θ)2}minτSki=1k(θτ(i)θi2p+|pτ(i)pi|).W_{p}^{p}(G,G^{\prime})\leq\max\left\{1,\frac{\text{diam}^{p}(\Theta)}{2}\right\}\min_{\tau\in S_{k}}\sum_{i=1}^{k}\left(\|\theta_{\tau(i)}-\theta^{\prime}_{i}\|_{2}^{p}+|p_{\tau(i)}-p^{\prime}_{i}|\right).
  3. c)

    Fix G0k(Θ)G_{0}\in\mathcal{E}_{k}(\Theta). Then lim infGW1G0Gk(Θ)W1(G,G0)D1(G,G0)>0\liminf\limits_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k}(\Theta)\end{subarray}}\frac{W_{1}(G,G_{0})}{D_{1}(G,G_{0})}>0 and lim infGW1G0Gk(Θ)D1(G,G0)W1(G,G0)>0\liminf\limits_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k}(\Theta)\end{subarray}}\frac{D_{1}(G,G_{0})}{W_{1}(G,G_{0})}>0. That is, in a neighborhood of G0G_{0} in k(Θ)\mathcal{E}_{k}(\Theta), D1(G,G0)G0W1(G,G0)D_{1}(G,G_{0})\asymp_{G_{0}}W_{1}(G,G_{0}).

  4. d)

    Fix G0k(Θ)G_{0}\in\mathcal{E}_{k}(\Theta) and suppose Θ\Theta is bounded. Then W1(G,G0)C(G0,diam(Θ))D1(G,G0)W_{1}(G,G_{0})\geq C(G_{0},\text{diam}(\Theta))D_{1}(G,G_{0}) for any Gk(Θ)G\in\mathcal{E}_{k}(\Theta), where constant C(G0,diam(Θ))>0C(G_{0},\text{diam}(\Theta))>0 depends on G0G_{0} and diam(Θ)\text{diam}(\Theta).

We see that W1W_{1} and D1D_{1} generate the same topology on k(Θ)\mathcal{E}_{k}(\Theta), and they are equivalent while fixing one argument. The benefit of WpW_{p} is that it is defined on k=1k(Θ)\bigcup_{k=1}^{\infty}\mathcal{E}_{k}(\Theta) while DND_{N} is only defined on k(Θ)\mathcal{E}_{k}(\Theta) for each kk since its definition requires the two arguments have the same number of atoms. DND_{N} allows us to quantify the distinct convergence behavior for atoms and probability mass, by placing different factors on the atoms and the probability mass parameters, while WpW_{p} on k=1k(Θ)\bigcup_{k=1}^{\infty}\mathcal{E}_{k}(\Theta) would fail to do so, because WpW_{p} couples the atoms and probability mass parameters (see Example A.1 for such an attempt).

The factor N\sqrt{N} present in the definition of DND_{N} arises from the anticipation that when we have independent exchangeable sequences of length NN, the dependence of the standard estimation rate on NN for component parameters θ\theta will be of order 1/N1/\sqrt{N}. Indeed, given one single exchangeable sequence from some component parameter θi\theta_{i}, as the coordinates in this sequence are conditionally independent and identically distributed, the standard rate for estimating θi\theta_{i} is 1/N1/\sqrt{N}. On the other hand, the mixing proportions parameters pip_{i} cannot be estimated from a single such sequence (i.e., if m=1m=1). One expects that for such parameters the number of sequences coming from the θi\theta_{i} among all exchangeable sequences plays a more important role. In summary, the distance DND_{N} will be used to capture precisely the distinct convergence behavior due to the length NN of observed exchangeable sequences.

4 First-order identifiability theory

Let N=1{N}=1, a finite mixture of N{N}-product distributions is reduced to a standard finite mixture of distributions. Mixture components are modeled by a family of probability kernels {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta} on 𝔛\mathfrak{X}, where θ\theta is the parameter of the family and Θq\Theta\subset\mathbb{R}^{q} is the parameter space. As discussed in the introduction, throughout the paper we assume that the map θPθ\theta\mapsto P_{\theta} is injective; it is the nature of the map GPGG\mapsto P_{G} that we are after. Within this section, we further assume that {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta} has density {f(x|θ)}θΘ\{f(x|\theta)\}_{\theta\in\Theta} w.r.t. a dominating measure μ\mu on (𝔛,𝒜)(\mathfrak{X},\mathcal{A}). Combining multiple mixture components using a mixing measure GG on Θ\Theta results in the finite mixture distribution, which admits the following density with respect to μ\mu: pG(x)=f(x|θ)G(dθ)p_{G}(x)=\int f(x|\theta)G(d\theta). The goal of this section is to provide a concise and self-contained treatment of identifiability of finite mixture models. We lay down basic foundations and present new results that will prove useful for the general theory of mixtures of product distributions to be developed in the subsequent sections.

4.1 Basic theory

The classical identifiability condition posits that PGP_{G} uniquely identifies GG for all Gk0(Θ)G\in\mathcal{E}_{k_{0}}(\Theta). This condition is satisfied if the collection of density functions {f(x|θ)}θΘ\{f(x|\theta)\}_{\theta\in\Theta} are linearly independent. To obtain rates of convergence for the model parameters, it is natural to consider the following condition concerning the first-order derivative of ff with respect to θ\theta.

Definition 4.1.

The family {f(x|θ)}θΘ\{f(x|\theta)\}_{\theta\in\Theta} is ({θi}i=1k,𝒩)(\{\theta_{i}\}_{i=1}^{k},\mathcal{N}) first-order identifiable if

  1. (i)

    for every xx in the μ\mu-positive subset 𝔛\𝒩\mathfrak{X}\backslash\mathcal{N} where 𝒩𝒜\mathcal{N}\in\mathcal{A}, f(x|θ)f(x|\theta) is first-order differentiable w.r.t. θ\theta at {θi}i=1k\{\theta_{i}\}_{i=1}^{k}; and

  2. (ii)

    {θi}i=1kΘ\{\theta_{i}\}_{i=1}^{k}\subset\Theta^{\circ} is a set of kk distinct elements and the system of two equations with variable (a1,b1,,ak,bk)(a_{1},b_{1},\ldots,a_{k},b_{k}):

    i=1k(aiθf(x|θi)+bif(x|θi))\displaystyle\sum_{i=1}^{k}\left(a_{i}^{\top}\nabla_{\theta}f(x|\theta_{i})+b_{i}f(x|\theta_{i})\right) =0,μa.e.x𝔛\𝒩,\displaystyle=0,\quad\mu-a.e.\ x\in\mathfrak{X}\backslash\mathcal{N}, (9a)
    i=1kbi\displaystyle\sum_{i=1}^{k}b_{i} =0\displaystyle=0 (9b)

    has only the zero solution: bi=0 and ai=𝟎q,1ikb_{i}=0\in\mathbb{R}\text{ and }a_{i}=\bm{0}\in\mathbb{R}^{q},\quad\forall 1\leq i\leq k.

This definition specifies a condition that is weaker than the definition of identifiable in the first-order in [26] since it only requires f(x|θ)f(x|\theta) to be differentiable at a finite number of points {θi}i=1k\{\theta_{i}\}_{i=1}^{k} and it requires linear independence of functions at those points. Moreover, it does not require f(x|θ)f(x|\theta) as a function of θ\theta to be differentiable for μ\mu-a.e. xx. Our definition requires only linear independence between the density and its derivative w.r.t. the parameter over the constraints of the coefficients specified by (9b). (Having said that, we are not aware of any simple example that differentiates the (9a) (9b) from (9a). Actually, it is established in Lemma 4.13 b) that: under some regularity condition, (9a) (9b) has the same solution set as (9a).) We will see shortly that in a precise sense that the conditions given Definition 4.1 are also necessary.

The significance of first-order identifiability conditions is that they entail a collection of inverse bounds that relate the behavior of some form of distances on mixture densities PG,PG0P_{G},P_{G_{0}} to a distance between corresponding parameters described by D1(G,G0)D_{1}(G,G_{0}), as GG tends toward G0G_{0}. Denote Θ\Theta^{\circ} the interior of Θ\Theta. For any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}), define

BW1(G0,r)={Gk=1k0k(Θ)|W1(G,G0)<r}.B_{W_{1}}(G_{0},r)=\biggr{\{}G\in\bigcup_{k=1}^{k_{0}}\mathcal{E}_{k}(\Theta)\biggr{|}W_{1}(G,G_{0})<r\biggr{\}}. (10)

It is obvious that BW1(G0,r)k0(Θ)B_{W_{1}}(G_{0},r)\subset\mathcal{E}_{k_{0}}(\Theta) for small rr.

Lemma 4.2 (Consequence of first-order identifiability).

Let G0=i=1k0pi0δθi0k0(Θ)G_{0}=\sum_{i=1}^{k_{0}}p_{i}^{0}\delta_{\theta_{i}^{0}}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}). Suppose that the family {f(x|θ)}θΘ\{f(x|\theta)\}_{\theta\in\Theta} is ({θi0}i=1k0,𝒩)(\{\theta_{i}^{0}\}_{i=1}^{k_{0}},\mathcal{N}) first-order identifiable in the sense of Definition 4.1 for some 𝒩𝒜\mathcal{N}\in\mathcal{A}.

  1. a)

    Then

    lim infGW1G0Gk0(Θ)V(PG,PG0)D1(G,G0)>0.\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G},P_{G_{0}})}{D_{1}(G,G_{0})}>0. (11)
  2. b)

    If in addition, for every xx in 𝔛\𝒩\mathfrak{X}\backslash\mathcal{N} f(x|θ)f(x|\theta) is continuously differentiable w.r.t. θ\theta in a neighborhood of θi0\theta_{i}^{0} for i[k0]:={1,2,,k0}i\in[k_{0}]:=\{1,2,\ldots,k_{0}\}, then

    limr0infG,HBW1(G0,r)GHV(PG,PH)D1(G,H)>0.\lim_{r\to 0}\ \inf_{\begin{subarray}{c}G,H\in B_{W_{1}}(G_{0},r)\\ G\not=H\end{subarray}}\frac{V(P_{G},P_{H})}{D_{1}(G,H)}>0. (12)

To put the above claims in context, note that the following inequality holds generally for any probability kernel family {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta} (even those without a density w.r.t. a dominating measure, see Lemma 8.1):

supG0k0(Θ)lim infGW1G0Gk0(Θ)V(PG,PG0)D1(G,G0)1/2.\sup_{G_{0}\in\mathcal{E}_{k_{0}}(\Theta)}\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G},P_{G_{0}})}{D_{1}(G,G_{0})}\leq 1/2. (13)

Note also that

limr0infG,HBW1(G0,r)GHV(PG,PH)D1(G,H)lim infGW1G0Gk0(Θ)V(PG,PG0)D1(G,G0)\lim_{r\to 0}\ \inf_{\begin{subarray}{c}G,H\in B_{W_{1}}(G_{0},r)\\ G\not=H\end{subarray}}\frac{V(P_{G},P_{H})}{D_{1}(G,H)}\leq\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G},P_{G_{0}})}{D_{1}(G,G_{0})} (14)

for any probability kernel PθP_{\theta} and any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}). Thus  (12) entails (11). However, (11) is sufficient for translating a learning rate for estimating a population distribution PGP_{G} into that of the corresponding mixing measure GG. To be concrete, if we are given an m{m}-i.i.d. sample from a parametric model PG0P_{G_{0}}, a standard estimation method yields root-m{m} rate of convergence for density pGp_{G}, which means that the corresponding estimate of GG admits root-m{m} rate as well.

Remark 4.3.

Lemma 4.2 a) is a generalization of the Theorem 3.1 in [26] in several features. Firstly, ({θi0}i=1k0,𝒩)(\{\theta_{i}^{0}\}_{i=1}^{k_{0}},\mathcal{N}) first-order identifiable assumption in Lemma 4.2 is weaker since identifiability in the first-order in the sense of [26] implies ({θi0}i=1k0,𝒩)(\{\theta_{i}^{0}\}_{i=1}^{k_{0}},\mathcal{N}) first-order identifiability with 𝒩=\mathcal{N}=\emptyset. Example B.1 gives a specific instance which satisfies the notion of first-order identifiability specified by Definition 4.1 but not the condition specified by [26]. Secondly, it turns out that uniform Lipschitz assumption in Theorem 3.1 in [26] is redundant and Lemma 4.2 a) does not require it. Lemma 4.2 b) is an extension of [22, equation (20)] in a similar sense as the above. Finally, given additional features of ff, the first-order identifiable notion can be further simplified (see Section 4.2). \Diamond

Proof of Lemma 4.2.

Suppose the lower bound of (11) is incorrect. Then there exist Gk0(Θ)\{G0}G_{\ell}\in\mathcal{E}_{k_{0}}(\Theta)\backslash\{G_{0}\}, GW1G0G_{\ell}\overset{W_{1}}{\to}G_{0} such that

V(pG,pG0)D1(G,G0)0, as .\frac{V(p_{G_{\ell}},p_{G_{0}})}{D_{1}(G_{\ell},G_{0})}\to 0,\text{ as }\ell\to\infty.

We may write G=i=1k0piδθiG_{\ell}=\sum_{i=1}^{k_{0}}p^{\ell}_{i}\delta_{\theta_{i}^{\ell}} such that θiθi0\theta_{i}^{\ell}\to\theta_{i}^{0} and pipi0p_{i}^{\ell}\to p_{i}^{0} as \ell\to\infty. With subsequences argument if necessary, we may further require

θiθi0D1(G,G0)aiq,pipi0D1(G,G0)bi,1ik0,\frac{\theta_{i}^{\ell}-\theta_{i}^{0}}{D_{1}(G_{\ell},G_{0})}\to a_{i}\in\mathbb{R}^{q},\quad\frac{p_{i}^{\ell}-p_{i}^{0}}{D_{1}(G_{\ell},G_{0})}\to b_{i}\in\mathbb{R},\quad\forall 1\leq i\leq k_{0}, (15)

where bib_{i} and the components of aia_{i} are in [1,1][-1,1] and i=1k0bi=0\sum_{i=1}^{k_{0}}b_{i}=0. Moreover, D1(G,G0)=i=1k0(θiθi02+|pipi0|)D_{1}(G_{\ell},G_{0})=\sum_{i=1}^{k_{0}}\left(\|\theta^{\ell}_{i}-\theta_{i}^{0}\|_{2}+|p^{\ell}_{i}-p_{i}^{0}|\right) for sufficiently large \ell, which implies

i=1k0ai2+i=1k0|bi|=1.\sum_{i=1}^{k_{0}}\|a_{i}\|_{2}+\sum_{i=1}^{k_{0}}|b_{i}|=1.

It also follows that at least one of aia_{i} is not 𝟎q\bm{0}\in\mathbb{R}^{q} or one of bib_{i} is not 0. On the other hand,

0\displaystyle 0 =lim2V(PG,PG0)D1(G,G0)\displaystyle=\lim_{\ell\to\infty}\frac{2V(P_{G_{\ell}},P_{G_{0}})}{D_{1}(G_{\ell},G_{0})}
lim𝔛\𝒩|i=1k0pif(x|θi)f(x|θi0)D1(G,G0)+i=1k0f(x|θi0)pipi0D1(G,G0)|μ(dx)\displaystyle\geq\lim_{\ell\to\infty}\int_{\mathfrak{X}\backslash\mathcal{N}}\left|\sum_{i=1}^{k_{0}}p_{i}^{\ell}\frac{f(x|\theta_{i}^{\ell})-f(x|\theta_{i}^{0})}{D_{1}(G_{\ell},G_{0})}+\sum_{i=1}^{k_{0}}f(x|\theta_{i}^{0})\frac{p_{i}^{\ell}-p_{i}^{0}}{D_{1}(G_{\ell},G_{0})}\right|\mu(dx)
𝔛\𝒩lim inf|i=1k0pif(x|θi)f(x|θi0)D1(G,G0)+i=1k0f(x|θi0)pipi0D1(G,G0)|μ(dx)\displaystyle\geq\int_{\mathfrak{X}\backslash\mathcal{N}}\liminf_{\ell\to\infty}\left|\sum_{i=1}^{k_{0}}p_{i}^{\ell}\frac{f(x|\theta_{i}^{\ell})-f(x|\theta_{i}^{0})}{D_{1}(G_{\ell},G_{0})}+\sum_{i=1}^{k_{0}}f(x|\theta_{i}^{0})\frac{p_{i}^{\ell}-p_{i}^{0}}{D_{1}(G_{\ell},G_{0})}\right|\mu(dx)
=𝔛\𝒩|i=1k0pi0aiθf(x|θi0)+i=1k0f(x|θi0)bi|μ(dx).\displaystyle=\int_{\mathfrak{X}\backslash\mathcal{N}}\left|\sum_{i=1}^{k_{0}}p_{i}^{0}a_{i}^{\top}\nabla_{\theta}f(x|\theta_{i}^{0})+\sum_{i=1}^{k_{0}}f(x|\theta_{i}^{0})b_{i}\right|\mu(dx).

where the second inequality follows from Fatou’s Lemma. Then i=1k0pi0aiθf(x|θi0)+i=1k0f(x|θi0)bi=0\sum_{i=1}^{k_{0}}p_{i}^{0}a_{i}^{\top}\nabla_{\theta}f(x|\theta_{i}^{0})+\sum_{i=1}^{k_{0}}f(x|\theta_{i}^{0})b_{i}=0 for μa.e.x𝔛\𝒩\mu-a.e.x\in\mathfrak{X}\backslash\mathcal{N}. Thus we find a nonzero solution to (9a), (9b) with k,θik,\theta_{i} replaced by k0,θi0k_{0},\theta_{i}^{0}. However, the last statement contradicts with the definition of ({θi0}i=1k0,𝒩)(\{\theta_{i}^{0}\}_{i=1}^{k_{0}},\mathcal{N}) first-order identifiable.

Proof of part b) continues in the Appendix. ∎

Lemma 4.2 states that under (i) in Definition 4.1, the constrained linear independence between the density and its derivative w.r.t. the parameter (item (ii) in the definition) is sufficient for (11) and (12). For a converse result, the next lemma shows (ii) is also necessary provided that (i) holds for some μ\mu-negligible 𝒩\mathcal{N} and f(x|θ)f(x|\theta) satisfies some regularity condition.

Lemma 4.4 (Lack of first-order identifiability).

Fix G0=i=1k0pi0δθi0k0(Θ)G_{0}=\sum_{i=1}^{k_{0}}p_{i}^{0}\delta_{\theta_{i}^{0}}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}). Suppose

  1. a)

    there exists 𝒩\mathcal{N} (that possibly depends on G0G_{0}) such that μ(𝒩)=0\mu(\mathcal{N})=0 and for every x𝒩x\not\in\mathcal{N}, f(x|θ)f(x|\theta) is differentiable with respect to θ\theta at {θi0}i=1k0\{\theta_{i}^{0}\}_{i=1}^{k_{0}};

  2. b)

    equation (9a) (or equivalently, system of equations (9a) and (9b)) with k,θik,\theta_{i} replaced respectively by k0,θi0k_{0},\theta_{i}^{0} has a nonzero solution (a1,b1,,ak0,bk0)(a_{1},b_{1},\ldots,a_{k_{0}},b_{k_{0}});

  3. c)

    for each i[k0]i\in[k_{0}], there exists γ(θi0,ai)\gamma(\theta_{i}^{0},a_{i}) such that for any 0<Δγ(θi0,ai)0<\Delta\leq\gamma(\theta_{i}^{0},a_{i})

    |f(x|θi0+aiΔ)f(x|θi0)Δ|f¯(x|θi0,ai),μa.e.x𝔛,\left|\frac{f(x|\theta_{i}^{0}+a_{i}\Delta)-f(x|\theta_{i}^{0})}{\Delta}\right|\leq\bar{f}(x|\theta_{i}^{0},a_{i}),\quad\mu-a.e.\ x\in\mathfrak{X},

    where f¯(x|θi0,ai)\bar{f}(x|\theta_{i}^{0},a_{i}) is integrable with respect to the measure μ\mu.

Then

limr0infG,HBW1(G0,r)GHV(PG,PH)D1(G,H)=lim infGW1G0Gk0(Θ)V(PG,PG0)D1(G,G0)=0.\lim_{r\to 0}\ \inf_{\begin{subarray}{c}G,H\in B_{W_{1}}(G_{0},r)\\ G\not=H\end{subarray}}\frac{V(P_{G},P_{H})}{D_{1}(G,H)}=\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G},P_{G_{0}})}{D_{1}(G,G_{0})}=0. (16)

Lemma 4.4 presents the consequence of the violation of first-order identifiability. Indeed, the conclusion (16) suggests that D1(G,G0)D_{1}(G,G_{0}) may vanish at a much slower rate than V(PG,PG0)V(P_{G},P_{G_{0}}), i.e., the convergence of parameters representing GG may be much slower than the convergence of data distribution PGP_{G}.

Remark 4.5.

Condition c) in the Lemma 4.4 is to guarantee the exchange of the order between the limit and the integral and one may replace it by any other similar condition. A byproduct of this condition is that it renders the constraint (9b) redundant (see Lemma 4.13 b)). While condition c) is tailored for an application of the dominated convergence theorem in the proof, one may tailored the following condition for Pratt’s Lemma.

Condition c’): there exists γ0>0\gamma_{0}>0 such that 1ik0\forall\ 1\leq i\leq k_{0}, 0<Δ<γ0\forall\ 0<\Delta<\gamma_{0},

|f(x|θi0+aiΔ)f(x|θi0)Δ|f¯Δ(x),μa.e.x𝔛\𝒩\left|\frac{f(x|\theta_{i}^{0}+a_{i}\Delta)-f(x|\theta_{i}^{0})}{\Delta}\right|\leq\bar{f}_{\Delta}(x),\quad\mu-a.e.\ x\in\mathfrak{X}\backslash\mathcal{N}

where f¯Δ(x)\bar{f}_{\Delta}(x) satisfies limΔ0+𝔛\𝒩f¯Δ(x)𝑑μ=𝔛\𝒩limΔ0+f¯Δ(x)dμ\lim_{\Delta\to 0^{+}}\int_{\mathfrak{X}\backslash\mathcal{N}}\bar{f}_{\Delta}(x)d\mu=\int_{\mathfrak{X}\backslash\mathcal{N}}\lim_{\Delta\to 0^{+}}\bar{f}_{\Delta}(x)d\mu.

Condition c’) is weaker than condition c) since the former reduces to the latter if one let f¯Δ(x)=f¯(x)<\bar{f}_{\Delta}(x)=\bar{f}(x)<\infty. \Diamond

Combining all the conditions in Lemma 4.2 and Lemma 4.4, one immediately obtains the following equivalence between (11), (12) and the first-order identifiable condition.

Corollary 4.6.

Fix G0=i=1k0pi0δθi0k0(Θ)G_{0}=\sum_{i=1}^{k_{0}}p_{i}^{0}\delta_{\theta_{i}^{0}}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}). Suppose for μ\mu-a.e. x𝔛x\in\mathfrak{X}, f(x|θ)f(x|\theta) as a function θ\theta is continuously differentiable in a neighborhood of θi0\theta_{i}^{0} for each i[k0]i\in[k_{0}]. Suppose that for any aqa\in\mathbb{R}^{q} and for each i[k0]i\in[k_{0}] there exists γ(θi0,a)>0\gamma(\theta_{i}^{0},a)>0 such that for any 0<Δγ(θi0,a)0<\Delta\leq\gamma(\theta_{i}^{0},a),

|f(x|θi0+aΔ)f(x|θi0)Δ|f¯Δ(x|θi0,a)μa.e.𝔛\left|\frac{f(x|\theta_{i}^{0}+a\Delta)-f(x|\theta_{i}^{0})}{\Delta}\right|\leq\bar{f}_{\Delta}(x|\theta_{i}^{0},a)\quad\mu-a.e.\ \mathfrak{X} (17)

where f¯Δ(x|θi0,a)\bar{f}_{\Delta}(x|\theta_{i}^{0},a) satisfies limΔ0+𝔛f¯Δ(x|θi0,a)𝑑μ=𝔛limΔ0+f¯Δ(x|θi0,a)dμ\lim_{\Delta\to 0^{+}}\int_{\mathfrak{X}}\bar{f}_{\Delta}(x|\theta_{i}^{0},a)d\mu=\int_{\mathfrak{X}}\lim_{\Delta\to 0^{+}}\bar{f}_{\Delta}(x|\theta_{i}^{0},a)d\mu. Here f¯Δ(x|θi0,a)\bar{f}_{\Delta}(x|\theta_{i}^{0},a) possibly depends on θi0\theta_{i}^{0} and aa. Then (12) holds if and only if (11) holds if and only if (9a) with k,θik,\theta_{i} replaced respectively by k0,θi0k_{0},\theta_{i}^{0} has only the zero solution.

Next, we highlight the role of condition c) of Lemma 4.4 in establishing either inverse bound (11) or (16) based on our notion of first-order identifiability. As mentioned, condition c) posits the existence of an integrable envelope function to ensure the exchange of the limit and integral. Without this condition, the conclusion (16) of Lemma 4.4 might not hold. The following two examples demonstrate the role of c), and serve as examples which are not first-order identifiable but for which inverse bound (11) still holds.

Example 4.7 (Uniform probability kernel).

Consider the uniform distribution family f(x|θ)=1θ𝟏(0,θ)(x)f(x|\theta)=\frac{1}{\theta}\bm{1}_{(0,\theta)}(x) with parameter space Θ=(0,)\Theta=(0,\infty). This family is defined on 𝔛=\mathfrak{X}=\mathbb{R} with the dominating measure μ\mu to be the Lebesgue measure. It is easy to see f(x|θ)f(x|\theta) is differentiable w.r.t. θ\theta at θx\theta\not=x and

θf(x|θ)=1θf(x|θ)when θx.\frac{\partial}{\partial\theta}f(x|\theta)=-\frac{1}{\theta}f(x|\theta)\quad\text{when }\theta\not=x.

So f(x|θ)f(x|\theta) is not first-order identifiable by our definition. Note for any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta) this family does not satisfy the assumption c) in Lemma 4.4 and hence Lemma 4.4 is not applicable. Indeed, by Lemma 4.8 this family satisfies (11) and (12) for any k0k_{0} and G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta). \Diamond

Lemma 4.8.

Let f(x|θ)f(x|\theta) be the uniform distribution family defined in Example 4.7. Then for any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta), inverse bounds (11) and (12) hold.

Example 4.9 (Location-scale exponential distribution kernel).

Consider the location-scale exponential distribution on 𝔛=\mathfrak{X}=\mathbb{R}, with density with respect to Lebesgue measure μ\mu given by f(x|ξ,σ)=1σexp(xξσ)𝟏(ξ,)(x)f(x|\xi,\sigma)=\frac{1}{\sigma}\exp\left(-\frac{x-\xi}{\sigma}\right)\bm{1}_{(\xi,\infty)}(x) with parameter θ=(ξ,σ)\theta=(\xi,\sigma) and parameter space Θ=×(0,)\Theta=\mathbb{R}\times(0,\infty). It is easy to see f(x|ξ,σ)f(x|\xi,\sigma) is differentiable w.r.t. ξ\xi at ξx\xi\not=x and

ξf(x|ξ,σ)=1σf(x|ξ,σ)when ξx.\frac{\partial}{\partial\xi}f(x|\xi,\sigma)=\frac{1}{\sigma}f(x|\xi,\sigma)\quad\text{when }\xi\not=x.

So f(x|ξ,σ)f(x|\xi,\sigma) is not first-order identifiable. Note for any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta) this family does not satisfy the third assumption in Lemma 4.4 and hence Lemma 4.4 is not applicable. Indeed by Lemma 4.10 this family satisfies (11) for any k0k_{0} and G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta). This lemma also serves as a correction for an erroneous result (Prop. 5.3 of [25]). The mistake in their proof may be attributed to failing to account for the envelope condition c) that arises due to shifted support of mixture components with distinct ξ\xi values. Interestingly, Lemma 4.10 also establishes that the stronger version of inverse bounds, namely, inequality (12) does not hold for some G0G_{0}. \Diamond

Lemma 4.10.

Let f(x|ξ,σ)f(x|\xi,\sigma) be the location-scale exponential distribution defined in Example 4.9. Then for any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta), inverse bound (11) holds. Moreover, for any k01k_{0}\geq 1, there exists a G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta), such that inverse bound (12) does not hold.

In some context it is of interest to establish inverse bounds for Hellinger distance rather than variational distance on mixture densities, e.g., in the derivation of minimax lower bounds. Since 2hV\sqrt{2}h\geq V, the inverse bound (11), which holds under first-order identifiability, immediately entails that

lim infGW1G0Gk0(Θ)h(PG,PG0)D1(G,G0)>0.\liminf\limits_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{h(P_{G},P_{G_{0}})}{D_{1}(G,G_{0})}>0.

Similarly, (12) entails that

limr0infG,HBW1(G0,r)GHh(PG,PH)D1(G,H)>0.\lim\limits_{r\to 0}\ \inf\limits_{\begin{subarray}{c}G,H\in B_{W_{1}}(G_{0},r)\\ G\not=H\end{subarray}}\frac{h(P_{G},P_{H})}{D_{1}(G,H)}>0.

For a converse result, the following is the Hellinger counterpart of Lemma 4.4.

Lemma 4.11.

Fix G0=i=1k0pi0δθi0k0(Θ)G_{0}=\sum_{i=1}^{k_{0}}p_{i}^{0}\delta_{\theta_{i}^{0}}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}). Suppose that

  1. a)

    there exists 𝒩\mathcal{N} (that possibly depends on G0G_{0}) such that μ(𝒩)=0\mu(\mathcal{N})=0 and for every x𝒩x\not\in\mathcal{N}, f(x|θ)f(x|\theta) is differentiable with respect to θ\theta at {θi0}i=1k0\{\theta_{i}^{0}\}_{i=1}^{k_{0}};

  2. b)

    the density family has common support, i.e. S={x𝔛|f(x|θ)>0}S=\{x\in\mathfrak{X}|f(x|\theta)>0\} does not depend on θΘ\theta\in\Theta;

  3. c)

    (9a) with k,θik,\theta_{i} replaced respectively by k0,θi0k_{0},\theta_{i}^{0} has a nonzero solution (a1,b1,,ak0,bk0)(a_{1},b_{1},\ldots,a_{k_{0}},b_{k_{0}});

  4. d)

    there exists γ0>0\gamma_{0}>0 such that 1ik0\forall\ 1\leq i\leq k_{0}, 0<Δγ0\forall\ 0<\Delta\leq\gamma_{0},

    |f(x|θi0+aiΔ)f(x|θi0)Δf(x|θi0)|f¯(x),μa.e.xS\𝒩,\left|\frac{f(x|\theta_{i}^{0}+a_{i}\Delta)-f(x|\theta_{i}^{0})}{\Delta\sqrt{f(x|\theta_{i}^{0})}}\right|\leq\bar{f}(x),\quad\mu-a.e.\ x\in S\backslash\mathcal{N},

    where f¯(x)\bar{f}(x) satisfies S\𝒩f¯2(x)𝑑μ<\int_{S\backslash\mathcal{N}}\bar{f}^{2}(x)d\mu<\infty.

Then

limr0infG,HBW1(G0,r)GHh(PG,PH)D1(G,H)=lim infGW1G0Gk0(Θ)h(PG,PG0)D1(G,G0)=0\lim_{r\to 0}\ \inf_{\begin{subarray}{c}G,H\in B_{W_{1}}(G_{0},r)\\ G\not=H\end{subarray}}\frac{h(P_{G},P_{H})}{D_{1}(G,H)}=\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{h(P_{G},P_{G_{0}})}{D_{1}(G,G_{0})}=0 (18)
Remark 4.12.

Similar to Remark 4.5, one may replace the condition d) in the preceding lemma by the following weaker condition:

Condition d’): there exist γ0>0\gamma_{0}>0 such that 1ik0\forall\ 1\leq i\leq k_{0}, 0<Δγ0\forall\ 0<\Delta\leq\gamma_{0},

|f(x|θi0+aiΔ)f(x|θi0)Δf(x|θi0)|f¯Δ(x),μa.e.xS\𝒩,\left|\frac{f(x|\theta_{i}^{0}+a_{i}\Delta)-f(x|\theta_{i}^{0})}{\Delta\sqrt{f(x|\theta_{i}^{0})}}\right|\leq\bar{f}_{\Delta}(x),\quad\mu-a.e.\ x\in S\backslash\mathcal{N},

where f¯Δ(x)\bar{f}_{\Delta}(x) satisfies limΔ0+S\𝒩f¯Δ2(x)𝑑μ=S\𝒩limΔ0+f¯Δ2(x)dμ<\lim_{\Delta\to 0^{+}}\int_{S\backslash\mathcal{N}}\bar{f}_{\Delta}^{2}(x)d\mu=\int_{S\backslash\mathcal{N}}\lim_{\Delta\to 0^{+}}\bar{f}_{\Delta}^{2}(x)d\mu<\infty. \Diamond

4.2 Finer characterizations

In order to verify if the first-order identifiability condition is satisfied for a given probability kernel family {f(x|θ)|θΘ}\{f(x|\theta)|\theta\in\Theta\}, according to Definition 4.1 one needs to check that system of equations (9a) and (9b) does not have non-zero solutions. For many common probability kernel families, the presence of normalizing constant can make this verification challenging, because the normalizing constant is a function of θ\theta, which has a complicated form or no closed form, and its derivative can also be complicated. Fortunately, the following lemma shows that under a mild condition one only needs to check for the family of kernel {f(x|θ)}\{f(x|\theta)\} defined up to a function of θ\theta that is constant in xx. Moreover, under additional mild assumptions, the equation (9b) can also be dropped from the verification.

Lemma 4.13.

Suppose for every xx in the μ\mu-positive subset 𝔛\𝒩\mathfrak{X}\backslash\mathcal{N} for some 𝒩𝒜\mathcal{N}\in\mathcal{A}, f(x|θ)f(x|\theta) is differentiable with respect to θ\theta at {θi}i=1k\{\theta_{i}\}_{i=1}^{k}. Let g(θ)g(\theta) be a positive differentiable function on Θ\Theta^{\circ} and define f~(x|θ)=g(θ)f(x|θ)\tilde{f}(x|\theta)=g(\theta)f(x|\theta).

  1. a)

    (9a) has only the zero solution if and only if (9a) with ff replaced by f~\tilde{f} has only the zero solution.

  2. b)

    Suppose μ(𝒩)=0\mu(\mathcal{N})=0. For a fixed set {ai}i=1kq\{a_{i}\}_{i=1}^{k}\subset\mathbb{R}^{q} and for each i[k]i\in[k] there exists γ(θi,ai)>0\gamma(\theta_{i},a_{i})>0 such that for any 0<Δγ(θi,ai)0<\Delta\leq\gamma(\theta_{i},a_{i}),

    |f(x|θi+aiΔ)f(x|θi)Δ|f¯(x|θi,ai),μa.e.𝔛,\left|\frac{f(x|\theta_{i}+a_{i}\Delta)-f(x|\theta_{i})}{\Delta}\right|\leq\bar{f}(x|\theta_{i},a_{i}),\quad\mu-a.e.\ \mathfrak{X}, (19)

    where f¯(x|θi,ai)\bar{f}(x|\theta_{i},a_{i}) is μ\mu-integrable. Here γ(θi,ai)\gamma(\theta_{i},a_{i}) and f¯(x|θi,ai)\bar{f}(x|\theta_{i},a_{i}) depend on θi\theta_{i} and aia_{i}. Then (a1,b1,,ak,bk)(a_{1},b_{1},\ldots,a_{k},b_{k}) is a solution of (9a) if and only if it is a solution of the system of equations (9a), (9b). Moreover, (19) holds for some μ\mu-integrable f¯\bar{f} if and only if the same inequality with ff on the left side replaced by f~\tilde{f} holds for some μ\mu-integrable f¯1\bar{f}_{1}.

  3. c)

    Suppose the conditions in b) (for ff or f~\tilde{f}) hold for any set {ai}i=1k\{a_{i}\}_{i=1}^{k}. Then (9a) has the same solutions as the system of equations (9a), (9b). Hence, the family {f(x|θ)}θΘ\{f(x|\theta)\}_{\theta\in\Theta} is ({θi}i=1k,𝒩)(\{\theta_{i}\}_{i=1}^{k},\mathcal{N}) first-order identifiable if and only if (9a) with ff replaced by f~\tilde{f} has only the zero solution.

Note a similar extension as in Remark 4.5 can be made in Lemma 4.13 b) and c).

Remark 4.14.

Part b), or Part c), of Lemma 4.13 shows that under some differentiability condition (i.e. μ(𝒩)=0\mu(\mathcal{N})=0) and some regularity condition on the density f(x|θ)f(x|\theta) to ensure the exchangeability of the limit and the integral, in the definition of ({θi}i=1k,𝒩)(\{\theta_{i}\}_{i=1}^{k},\mathcal{N}) first identifiable, (9b) adds no additional constraint and is redundant. In this case when we verify the first-order identifiability, we can simply check whether (9a) has only zero solution or not. In addition when some f~\tilde{f} is available and is simpler than ff, according to Part c) of Lemma 4.13, for first-order identifiability it is sufficient to check whether (9a) with ff replaced by f~\tilde{f} has only zero solution or not, provided that the μ(𝒩)=0\mu(\mathcal{N})=0 for 𝒩\mathcal{N} corresponds to f~\tilde{f} and (19) with ff on the left side replaced by f~\tilde{f} hold. \Diamond

Probability kernels in the exponential families of distribution are frequently employed in practice. For these kernels, there is a remarkable equivalence among the first-order identifiability and inverse bounds for both variational distance and the Hellinger distance.

Lemma 4.15.

Suppose that the probability kernel PθP_{\theta} has a density function ff in the full rank exponential family, given in its canonical form f(x|θ)=exp(θ,T(x)A(θ))h(x)f(x|\theta)=\exp(\langle\theta,T(x)\rangle-A(\theta))h(x) with θΘ\theta\in\Theta, the natural parameter space. Then (9a) has the same solutions as the system of equations (9a), (9b). Moreover for a fixed G0=i=1k0pi0δθi0k0(Θ)G_{0}=\sum_{i=1}^{k_{0}}p_{i}^{0}\delta_{\theta_{i}^{0}}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}) the following five statements are equivalent:

  1. a)

    limr0infG,HBW1(G0,r)GHV(PG,PH)D1(G,H)>0;\lim_{r\to 0}\ \inf_{\begin{subarray}{c}G,H\in B_{W_{1}}(G_{0},r)\\ G\not=H\end{subarray}}\frac{V(P_{G},P_{H})}{D_{1}(G,H)}>0;

  2. b)

    lim infGW1G0Gk0(Θ)V(PG,PG0)D1(G,G0)>0;\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G},P_{G_{0}})}{D_{1}(G,G_{0})}>0;

  3. c)

    limr0infG,HBW1(G0,r)GHh(PG,PH)D1(G,H)>0;\lim_{r\to 0}\ \inf_{\begin{subarray}{c}G,H\in B_{W_{1}}(G_{0},r)\\ G\not=H\end{subarray}}\frac{h(P_{G},P_{H})}{D_{1}(G,H)}>0;

  4. d)

    lim infGW1G0Gk0(Θ)h(PG,PG0)D1(G,G0)>0;\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{h(P_{G},P_{G_{0}})}{D_{1}(G,G_{0})}>0;

  5. e)

    With k,θik,\theta_{i} replaced respectively by k0,θi0k_{0},\theta_{i}^{0}, equation (9a) has only the zero solution.

Parts c) and d) in Lemma 4.15 are not used in this paper beyond the current section, but they may be of independent interest beyond the scope of this paper. In the last result, the exponential family is in its canonical form. The same conclusions hold for the exponential family represented in general parametrizations. Recall a homeomorphism is a continuous function that has a continuous inverse.

Lemma 4.16.

Consider the probability kernel PθP_{\theta} has a density function ff in the full rank exponential family, f(x|θ)=exp(η(θ),T(x)B(θ))h(x)f(x|\theta)=\exp\left(\langle\eta(\theta),T(x)\rangle-B(\theta)\right)h(x). Suppose the map η:Θη(Θ)q\eta:\Theta\to\eta(\Theta)\subset\mathbb{R}^{q} is a homeomorphism. Fix G0=i=1k0pi0δθi0k0(Θ)G_{0}=\sum_{i=1}^{k_{0}}p_{i}^{0}\delta_{\theta_{i}^{0}}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}). Suppose the Jacobian matrix of the function η(θ)\eta(\theta), denoted by Jη(θ):=(η(i)θ(j)(θ))ijJ_{\eta}(\theta):=(\frac{\partial\eta^{(i)}}{\partial\theta^{(j)}}(\theta))_{ij} exists and is full rank at θi0\theta_{i}^{0} for i[k0]i\in[k_{0}]. Then with k,θik,\theta_{i} replaced respectively by k0k_{0}, θi0\theta_{i}^{0}, (9a) has the same solutions as the system of equations (9a), (9b). Moreover the b), d) and e) as in Lemma 4.15 are equivalent. If in addition Jη(θ)J_{\eta}(\theta) exists and is continuous in a neighborhood of θi0\theta_{i}^{0} for each i[k0]i\in[k_{0}], then the equivalence relationships of all the five statements in Lemma 4.15 hold.

Despite the simplicity of kernels in the exponential families, classical and/or first-order identifiability is not always guaranteed. For instance, it is well-known and can be checked easily that the mixture of Bernoulli distributions is not identifiable in the classical sense. We will study the Bernoulli kernel in the context of mixtures of product distributions in Example 5.11. The following example is somewhat less well-known.

Example 4.17 (Two-parameter gamma kernel).

Consider the gamma distribution

f(x|α,β)=βαΓ(α)xα1eβx𝟏(0,)(x)f(x|\alpha,\beta)=\frac{\beta^{\alpha}}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x}\bm{1}_{(0,\infty)}(x)

with θ=(α,β)Θ:={(α,β)|α>0,β>0}\theta=(\alpha,\beta)\in\Theta:=\{(\alpha,\beta)|\alpha>0,\beta>0\} and the dominating measure μ\mu is the Lebesgue measure on 𝔛=\mathfrak{X}=\mathbb{R}. This is a full rank exponential family. For k02k_{0}\geq 2 define 𝒢k0(Θ)=k0(Θ)\mathcal{G}\subset\mathcal{E}_{k_{0}}(\Theta^{\circ})=\mathcal{E}_{k_{0}}(\Theta) as

𝒢:={Gk0(Θ)|G=i=1k0piδθi and there exist ij such that θjθi=(1,0)}.\mathcal{G}:=\{G\in\mathcal{E}_{k_{0}}(\Theta)|G=\sum_{i=1}^{k_{0}}p_{i}\delta_{\theta_{i}}\text{ and there exist }i\neq j\text{ such that }\theta_{j}-\theta_{i}=(1,0)\}.

For any G0=i=1k0piδθi0𝒢G_{0}=\sum_{i=1}^{k_{0}}p_{i}\delta_{\theta_{i}^{0}}\in\mathcal{G}, let i0j0i_{0}\not=j_{0} be such that θj00θi00=(1,0)\theta_{j_{0}}^{0}-\theta_{i_{0}}^{0}=(1,0), i.e. αj00=αi00+1\alpha_{j_{0}}^{0}=\alpha_{i_{0}}^{0}+1 and βj00=βi00\beta_{j_{0}}^{0}=\beta_{i_{0}}^{0}. Then observing

βf(x|α,β)=αβf(x|α,β)αβf(x|α+1,β),\frac{\partial}{\partial\beta}f(x|\alpha,\beta)=\frac{\alpha}{\beta}f(x|\alpha,\beta)-\frac{\alpha}{\beta}f(x|\alpha+1,\beta),

(a1,b1,,ak0,bk0)(a_{1},b_{1},\ldots,a_{k_{0}},b_{k_{0}}) with ai0=(0,βi0/αi0)a_{i_{0}}=(0,\beta_{i_{0}}/\alpha_{i_{0}}), bi0=1b_{i_{0}}=-1, bj0=1b_{j_{0}}=1 and the rest to be zero is a nonzero solution of the system of equations (9a), (9b) with k,θik,\theta_{i} replaced respectively by k0,θi0k_{0},\theta_{i}^{0}. Write gamma distribution in exponential family as in Lemma 4.16 with η(θ)=(α1,β)\eta(\theta)=(\alpha-1,\beta) and T(x)=(lnx,x)T(x)=(\ln x,-x). Since η(θ)\eta(\theta) satisfies all the conditions in Lemma 4.16, hence

lim infGW1G0Gk0(Θ)h(PG,PG0)D1(G,G0)=lim infGW1G0Gk0(Θ)V(PG,PG0)D1(G,G0)=0.\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{h(P_{G},P_{G_{0}})}{D_{1}(G,G_{0})}=\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G},P_{G_{0}})}{D_{1}(G,G_{0})}=0.

This implies that even if V(pG,pG0)V(p_{G},p_{G_{0}}) vanishes at a fast rate, D1(G,G0)D_{1}(G,G_{0}) may not.

Finite mixtures of gamma were investigated by [25], who called 𝒢\mathcal{G} is a pathological set of parameter values to highlight the effects of weak identifiability (more precisely, the violation of first-order identifiability conditions) on the convergence behavior of model parameters when the parameter values fall in 𝒢\mathcal{G}. (On the other hand, for G0k0(Θ)\𝒢G_{0}\in\mathcal{E}_{k_{0}}(\Theta^{\circ})\backslash\mathcal{G}, it is shown in the proof of Proposition 5.1 (a) in [25] that (9a) with k,θik,\theta_{i} replaced respectively by k0,θi0k_{0},\theta_{i}^{0} has only the zero solution. Their original proof works under the stringent condition α1\alpha\geq 1 for the parameter space. But multiplying their (26) by xx should reach the same conclusion for the general case α>0\alpha>0. A direct proof is also straightforward by using Lemma B.3 b) and is similar to Example 5.13.) Thus by Lemma 4.16,

2lim infGW1G0Gk0(Θ)h(PG,PG0)D1(G,G0)lim infGW1G0Gk0(Θ)V(PG,PG0)D1(G,G0)>0.\sqrt{2}\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{h(P_{G},P_{G_{0}})}{D_{1}(G,G_{0})}\geq\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G},P_{G_{0}})}{D_{1}(G,G_{0})}>0.

Notice that a) and c) in Lemma 4.15 also hold but are omitted here. Thus, outside of pathological set 𝒢\mathcal{G} the convergence rate of mixture density pGp_{G} towards pG0p_{G_{0}} is carried over to the convergence of GG toward G0G_{0} under D1D_{1}. It is the uncertainty about whether the true mixing measure G0G_{0} is pathological or not that makes parameter estimation highly inefficient. Given m{m}-i.i.d. from a finite mixture of gamma distributions, where the number of components k0k_{0} is given, [25] established minimax bound for estimating GG that is slower than any polynomial rate mr{m}^{-r} for any r1r\geq 1 under WrW_{r} metric. \Diamond

We end this section with several remarks to highlight the concern for parameter estimation for mixture models under weak identifiability and to set the stage for the next section.

Remark 4.18.

a) It may be of interest to devise an efficient parameter estimation method (by, perhaps, a clever regularization or reparametrization technique) that may help to overcome the lack of first-order identifiability. We are not aware of a general way to achieve this. Absent of such methods, a promising direction for the statistician to take is to simply collect more data: not only by increasing the number m{m} of independent observations, but also by increasing the number of repeated measurements. Finite mixtures of product distributions usually arise in this practical context: when one deals with a highly heterogeneous data population which is made up of many latent subpopulations carrying distinct patterns, it is often possible to collect observations presumably coming from the same subpopulation, even if one is uncertain about the mixture component that a subpopulation may be assigned to. Thus, one may aim to collect m{m} independent sequences of N{N} exchangeable observations, and assume that they are sampled from a finite mixture of N{N}-product distributions denoted by PG,NP_{G,{N}}. Such possibilities arise naturally in practice. As a concrete example, [12] applied a finite mixture model with repeated measurements to observations from an education assessment study. In this study, each child is presented with a sequence of two dimensional line drawings of a rectangular vessel, each drawn in a different tilted position. Then each child is asked to draw on these figures how the water in the vessel would appear if the vessel were half filled with water. Thus the observations from each child can be represented as a vector of exchangeable data and the experimenter can increase the length NN by presenting each child with more independent random rectangular vessels. Other examples and applications include psychological study [23, 10] and topic modeling [36].

b) One natural question to ask is, how does increasing the number N{N} of repeated measurements (i.e., the length of exchangeable sequences) help to overcome the lack of strong identifiability such as our notion of first-order identifiability. This question can be made precise in light of Lemma 4.2: whether there exist a natural number n11n_{1}\geq 1 such that the following inverse bound holds for any Nn1{N}\geq n_{1}

lim infGW1G0Gk0(Θ)V(PG,N,PG0,N)D1(G,G0)>0.\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{1}(G,G_{0})}>0. (20)

Observe that since V(PG,N,PG0,N)V(P_{G,{N}},P_{G_{0},{N}}) increases in N{N} while the denominator D1(G,G0)D_{1}(G,G_{0}) is fixed in N{N}, if  (20) holds for some N=n1{N}=n_{1}, then it also holds for all Nn1{N}\geq n_{1}. Moreover, what can we say about the role of N{N} in parameter estimation in presence of such inverse bounds? In the sequel these questions will be addressed by establishing inverse bounds for mixtures of product distributions. Such theory will also be used to derive tight learning rates for mixing measure GG from a collection of exchangeable sequences of observations. \Diamond

5 Inverse bounds for mixtures of product distributions

Consider a family of probability distributions {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta} on some measurable space (𝔛,𝒜)(\mathfrak{X},\mathcal{A}) where θ\theta is the parameter of the family and Θq\Theta\subset\mathbb{R}^{q} is the parameter space. This yields the N{N}-product probability kernel on (𝔛N,𝒜N)(\mathfrak{X}^{{N}},\mathcal{A}^{N}), which is denoted by {Pθ,N:=NPθ}θΘ\{P_{\theta,{N}}:=\bigotimes^{{N}}P_{\theta}\}_{\theta\in\Theta}. For any G=i=1kpiδθik(Θ)G=\sum_{i=1}^{k}p_{i}\delta_{\theta_{i}}\in\mathcal{E}_{k}(\Theta) as mixing measure, the resulting finite mixture for the N{N}-product families is a probability measure on (𝔛N,𝒜N)(\mathfrak{X}^{N},\mathcal{A}^{N}), namely, PG,N=i=1kpiPθi,NP_{G,{N}}=\sum_{i=1}^{k}p_{i}P_{\theta_{i},{N}}.

The main results of this section are stated in Theorem 5.8 and Theorem 5.16. These theorems establish the following inverse bound under certain conditions of probability kernel family {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta} and some time that of G0G_{0}: for a fixed G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}) there holds

lim infNlim infGW1G0Gk0(Θ)V(PG,N,PG0,N)DN(G,G0)>0.\liminf_{{N}\to\infty}\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{{N}}(G,G_{0})}>0. (21)

By contrast, an easy upper bound on the left side of (21) holds generally (cf. Lemma 8.1):

supN1lim infGW1G0Gk0(Θ)V(PG,N,PG0,N)DN(G,G0)1/2.\sup_{{N}\geq 1}\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{{N}}(G,G_{0})}\leq 1/2. (22)

In fact, a strong inverse bound can also be established:

lim infNlimr0infG,HBW1(G0,r)GHV(PG,N,PH,N)DN(G,H)>0.\liminf_{{N}\to\infty}\lim_{r\to 0}\ \inf_{\begin{subarray}{c}G,H\in B_{W_{1}}(G_{0},r)\\ G\not=H\end{subarray}}\frac{V(P_{G,{N}},P_{H,{N}})}{D_{{N}}(G,H)}>0. (23)

These inverse bounds relate to the positivity of a suitable notion of curvature on the space of mixtures of product distributions, and will be shown to have powerful consequences. It’s easy to see that (23) implies (21), which in turn entails (20).

Section 5.2 is devoted to establishing these bounds for PθP_{\theta} belonging to exponential families of distributions. In Section 5.3 the inverse bounds are established for very general probability kernel families, where 𝔛\mathfrak{X} may be an abstract space and no parametric assumption on PθP_{\theta} will be required. Instead, we appeal to a set of mild regularity conditions on the characteristic function of a push-forward measure produced by a measurable map TT acting on the measure space (𝔛,𝒜,Pθ)(\mathfrak{X},\mathcal{A},P_{\theta}). We will see that this general theory enables the study for a very broad range of mixtures of product distributions for exchangeable sequences.

5.1 Implications on classical and first-order identifiability

Before presenting the section’s main theorems, let us explicate some immediate implications of their conclusions expressed by inequalities (21) and (23). These inequalities contain detailed information about convergence behavior of de Finetti’s mixing measure GG toward G0G_{0}, an useful application of which will be demonstrated in Section 6. For now, we simply highlight striking implications on the basic notions of identifiability of mixtures of distributions investigated in Section 4. Note that no overt assumption on classical or first-order identifiability is required in the statement of the theorems establishing (21) or (23). In plain terms these inequalities entail that by increasing the number N{N} of exchangeable measurements, the resulting mixture of N{N}-product distributions becomes identifiable in both classical and first-order sense, even if it is not initially so, i.e., when N=1{N}=1 or small.

Define, for any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}), Hk=1k(Θ)H\in\cup_{k=1}^{\infty}\mathcal{E}_{k}(\Theta) and k=1k(Θ)\mathcal{H}\subset\cup_{k=1}^{\infty}\mathcal{E}_{k}(\Theta),

n0\displaystyle n_{0} :=n0(H,)\displaystyle:=n_{0}(H,\mathcal{H}) :=min{n1|G{H},PG,nPH,n},\displaystyle:=\min\biggr{\{}n\geq 1\biggr{|}\forall G\in\mathcal{H}\setminus\{H\},P_{G,n}\neq P_{H,n}\biggr{\}}, (24)
n1:=n1(G0)\displaystyle n_{1}:=n_{1}(G_{0}) :=n1(G0,k0(Θ))\displaystyle:=n_{1}(G_{0},\mathcal{E}_{k_{0}}(\Theta)) :=min{n1|lim infGW1G0Gk0(Θ)V(PG,n,PG0,n)D1(G,G0)>0},\displaystyle:=\min\biggr{\{}n\geq 1\biggr{|}\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,n},P_{G_{0},n})}{D_{1}(G,G_{0})}>0\biggr{\}},
n2:=n2(G0)\displaystyle n_{2}:=n_{2}(G_{0}) :=n2(G0,k0(Θ))\displaystyle:=n_{2}(G_{0},\mathcal{E}_{k_{0}}(\Theta)) :=min{n1|limr0infG,HBW1(G0,r)GHV(PG,n,PH,n)D1(G,H)>0}.\displaystyle:=\min\biggr{\{}n\geq 1\biggr{|}\lim_{r\to 0}\ \inf_{\begin{subarray}{c}G,H\in B_{W_{1}}(G_{0},r)\\ G\not=H\end{subarray}}\frac{V(P_{G,n},P_{H,n})}{D_{1}(G,H)}>0\biggr{\}}.

n0n_{0} is called minimal zero-order identifiable length (with respect to HH and \mathcal{H}), or 0-identifiable length for short. n1n_{1} is called minimal first-order identifiable length (with respect to G0G_{0} and k0(Θ)\mathcal{E}_{k_{0}}(\Theta)), or 11-identifiable length for short. Since W1(G,G0)G0D1(G,G0)W_{1}(G,G_{0})\asymp_{G_{0}}D_{1}(G,G_{0}) in small neighborhood of G0G_{0} (see Lemma 3.2 c)), the two metrics can be exchangeable in the denominator of the above definition for n1n_{1} and n2n_{2}. Note that n0n_{0}, depending on a set \mathcal{H} to be specified, describes a global property of classical identifiability, a notion determined mainly by the algebraic structure of the mixture model’s kernel family and its parametrization. (This is also known as "strict identifiability", cf., e.g.,  [3]). On the other hand, both n1n_{1} and n2n_{2} characterize a local behavior of mixture densities pG,Np_{G,{N}} near a certain pG0,Np_{G_{0},{N}}, a notion that relies primarily on regularity conditions of the kernel, as we shall see in what follows. When it clear from the context, we may use n1n_{1} or n1(G0)n_{1}(G_{0}) for n1(G0,k0(Θ))n_{1}(G_{0},\mathcal{E}_{k_{0}}(\Theta)) for brevity. Similar rules apply to n0n_{0} and n2n_{2}.

The following proposition provides the link between classical identifiability and strong notions of local identifiability provided either (21) or (23) holds. In a nutshell, as N{N} gets large, the two types of identifiability can be connected by the force of the central limit theorem applied to product distributions, which is one of the key ingredients in the proof of the inverse bounds. Define two related and useful quantities: for any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta^{\circ})

N¯1:=N¯1(G0):=min{n1|infNnlim infGW1G0Gk0(Θ)V(PG,N,PG0,N)DN(G,G0)>0}\displaystyle\underline{{N}}_{1}:=\underline{{N}}_{1}(G_{0}):=\min\biggr{\{}n\geq 1\biggr{|}\inf_{N\geq n}\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{{N}}(G,G_{0})}>0\biggr{\}} (25)
N¯2:=N¯2(G0):=min{n1|infNnlimr0infG,HBW1(G0,r)GHV(PG,N,PH,N)DN(G,H)>0}.\displaystyle\underline{{N}}_{2}:=\underline{{N}}_{2}(G_{0}):=\min\biggr{\{}n\geq 1\biggr{|}\inf_{{N}\geq n}\lim_{r\to 0}\ \inf_{\begin{subarray}{c}G,H\in B_{W_{1}}(G_{0},r)\\ G\not=H\end{subarray}}\frac{V(P_{G,{N}},P_{H,{N}})}{D_{{N}}(G,H)}>0\biggr{\}}. (26)

Note that (21) means N¯1(G0)<\underline{{N}}_{1}(G_{0})<\infty, while (23) means N¯2(G0)<\underline{{N}}_{2}(G_{0})<\infty. The following proposition explicates the connections among n0n_{0}, n1n_{1}, n2n_{2}, N¯1\underline{N}_{1} and N¯2\underline{N}_{2}.

Proposition 5.1.

There hold the following statements.

  1. a)

    Consider any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}), then n1(G0)n2(G0)n_{1}(G_{0})\leq n_{2}(G_{0}). Moreover, there exists r:=r(G0)>0r:=r(G_{0})>0 such that

    supGBW1(G0,r)n1(G)n2(G0).\sup_{G\in B_{W_{1}}(G_{0},r)}n_{1}(G)\leq n_{2}(G_{0}).
  2. b)

    Consider any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}). If N¯1(G0)<\underline{{N}}_{1}(G_{0})<\infty, then n1(G0)=N¯1(G0)<n_{1}(G_{0})=\underline{{N}}_{1}(G_{0})<\infty.
    If N¯2(G0)<\underline{{N}}_{2}(G_{0})<\infty then n2(G0)=N¯2(G0)<n_{2}(G_{0})=\underline{{N}}_{2}(G_{0})<\infty. In particular, the first or the second conclusion holds if (21) or (23) holds respectively.

  3. c)

    There holds for any subset Θ1Θ\Theta_{1}\subset\Theta^{\circ}

    supGkk0k(Θ1)n0(G,kk0k(Θ1))supG2k0(Θ1)n1(G).\sup_{G\in\bigcup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1})}n_{0}(G,\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1}))\leq\sup_{G\in\mathcal{E}_{2k_{0}}(\Theta_{1})}n_{1}(G).
  4. d)

    Suppose the kernel family PθP_{\theta} admits density f(|θ)f(\cdot|\theta) with respect to a dominating measure μ\mu on 𝔛\mathfrak{X}. Fix G0=i=1k0pi0δθi0k0(Θ)G_{0}=\sum_{i=1}^{k_{0}}p_{i}^{0}\delta_{\theta_{i}^{0}}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}). Suppose for μ\mu-a.e. x𝔛x\in\mathfrak{X}, f(x|θ)f(x|\theta) as a function θ\theta is continuously differentiable in a neighborhood of θi0\theta_{i}^{0} for each i[k0]i\in[k_{0}]. Moreover, assume that condition c) of Lemma 4.4 holds for any {ai}i=1k0q\{a_{i}\}_{i=1}^{k_{0}}\subset\mathbb{R}^{q}. Then, n2(G0)=n1(G0)n_{2}(G_{0})=n_{1}(G_{0}).

  5. e)

    Suppose that (21) holds for every G0k2k0k(Θ)G_{0}\in\cup_{k\leq 2k_{0}}\mathcal{E}_{k}(\Theta^{\circ}). Moreover, suppose all conditions in part d) are satisfied for every G0k2k0k(Θ)G_{0}\in\cup_{k\leq 2k_{0}}\mathcal{E}_{k}(\Theta^{\circ}). Then for any compact Θ1Θ\Theta_{1}\subset\Theta^{\circ},

    supGkk0k(Θ1)n0(G,kk0k(Θ1))supG2k0(Θ1)n1(G)<.\sup_{G\in\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1})}n_{0}(G,\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1}))\leq\sup_{G\in\mathcal{E}_{2k_{0}}(\Theta_{1})}n_{1}(G)<\infty.
  6. f)

    Suppose that (23) holds for every G0k2k0k(Θ)G_{0}\in\cup_{k\leq 2k_{0}}\mathcal{E}_{k}(\Theta^{\circ}). Then for any compact Θ1Θ\Theta_{1}\subset\Theta^{\circ}, the conclusion of part e) holds.

Remark 5.2.

Part a) and part b) of Proposition 5.1 highlight an immediate significance of inverse bounds (21) and (23): they establish pointwise finiteness of 1-identifiable length n1(G0)n_{1}(G_{0}). Moreover, under the additional condition on first-order identifiability, one can have the following strong result as an immediate consequence: Consider any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}). If (11) and (21) hold, then n1(G0)=N¯1(G0)=1n_{1}(G_{0})=\underline{{N}}_{1}(G_{0})=1. If (12) and (23) hold, then n1(G0)=N¯1(G0)=n2(G0)=N¯2(G0)=1n_{1}(G_{0})=\underline{{N}}_{1}(G_{0})=n_{2}(G_{0})=\underline{{N}}_{2}(G_{0})=1. \Diamond

Remark 5.3.

Proposition 5.1 c) relates supGkk0k(Θ1)n0(G,kk0k(Θ1))\sup_{G\in\bigcup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1})}n_{0}(G,\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1})), the uniform 0-identifiable length, to the uniform 11-identifiable length supG2k0(Θ1)n1(G,2k0(Θ1))\sup_{G\in\mathcal{E}_{2k_{0}}(\Theta_{1})}n_{1}(G,\mathcal{E}_{2k_{0}}(\Theta_{1})). Combining this with parts e) and f) and inverse bounds (21) and (23), one arrives at a rather remarkable consequence: for Θ1\Theta_{1} a compact subset of Θ\Theta^{\circ}, they yield the finiteness of both 0-identifiable length n0(G,kk0k(Θ1))n_{0}(G,\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1})) and 1-identifiable length n1(G)n_{1}(G) uniformly over subsets of mixing measures with finite number of support points. In particular, as long as (21) or (23) (along with some regularity conditions in the former) holds for every G0k2k0k(Θ)G_{0}\in\cup_{k\leq 2k_{0}}\mathcal{E}_{k}(\Theta^{\circ}), then PG,NP_{G,N} will be strictly identifiable and first-identifiable on kk0k(Θ1)\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1}) for sufficiently large NN. That is, taking product helps in making the kernel identifiable in a strong sense. As we shall see in the next subsection, (23) holds for every G0k=1k(Θ)G_{0}\in\bigcup_{k=1}^{\infty}\mathcal{E}_{k}(\Theta^{\circ}) when {Pθ}\{P_{\theta}\} belongs to full rank exponential families of distributions. This inverse bound also holds for a broad range of probability kernels beyond the exponential families. \Diamond

Remark 5.4.

Concerning only 0-identifiable length n0n_{0}, a remarkable upper bound

supGkk0k(Θ)n0(G,kk0k(Θ))2k01\sup_{G\in\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta)}n_{0}(G,\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta))\leq 2k_{0}-1

was established in a recent paper [44]. This bound actually applies to the nonparametric component distributions, and extends also to our parametric component distribution setting. However, in a parametric component distribution setting, the upper bound 2k012k_{0}-1 is far from being tight (cf. Example 5.13). \Diamond

Proof of Proposition 5.1.

a) It is sufficient to assume that n2=n2(G0)<n_{2}=n_{2}(G_{0})<\infty. Then there exists r0>0r_{0}>0 such that

infG,HBW1(G0,r0)GHV(PG,n2,PH,n2)D1(G,H)>0.\inf_{\begin{subarray}{c}G,H\in B_{W_{1}}(G_{0},r_{0})\\ G\not=H\end{subarray}}\frac{V(P_{G,n_{2}},P_{H,n_{2}})}{D_{1}(G,H)}>0.

Then fixing GG in the preceding display yields n1(G)n2(G0)n_{1}(G)\leq n_{2}(G_{0}) and the proof is complete since GG is arbitrary in BW1(G0,r0)B_{W_{1}}(G_{0},r_{0}).

b) By the definition of N¯1\underline{{N}}_{1},

lim infGW1G0Gk0(Θ)V(PG,N¯1,PG0,N¯1)D1(G,G0)lim infGW1G0Gk0(Θ)V(PG,N¯1,PG0,N¯1)DN¯1(G,G0)>0,\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,\underline{{N}}_{1}},P_{G_{0},\underline{{N}}_{1}})}{D_{1}(G,G_{0})}\geq\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,\underline{{N}}_{1}},P_{G_{0},\underline{{N}}_{1}})}{D_{\underline{{N}}_{1}}(G,G_{0})}>0, (27)

which entails that n1N¯1n_{1}\leq\underline{{N}}_{1}. On the other hand, for any N[n1,N¯1]{N}\in[n_{1},\underline{{N}}_{1}] we have

lim infGW1G0Gk0(Θ)V(PG,N,PG0,N)DN(G,G0)lim infGW1G0Gk0(Θ)1NV(PG,n1,PG0,n1)D1(G,G0)1N¯1lim infGW1G0Gk0(Θ)V(PG,n1,PG0,n1)D1(G,G0)>0,\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{N}(G,G_{0})}\geq\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{1}{\sqrt{{N}}}\frac{V(P_{G,n_{1}},P_{G_{0},n_{1}})}{D_{1}(G,G_{0})}\\ \geq\frac{1}{\sqrt{\underline{{N}}_{1}}}\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,n_{1}},P_{G_{0},n_{1}})}{D_{1}(G,G_{0})}>0,

which entails N¯1n1\underline{{N}}_{1}\leq n_{1}. Thus N¯1=n1\underline{{N}}_{1}=n_{1}. The proof of n2=N¯2<n_{2}=\underline{{N}}_{2}<\infty is similar.

c) If suffices to prove the case n¯:=supG2k0(Θ1)n1(G)<\bar{n}:=\sup_{G\in\mathcal{E}_{2k_{0}}(\Theta_{1})}n_{1}(G)<\infty. Take any Gk(Θ1)G\in\mathcal{E}_{k}(\Theta_{1}) for 1kk01\leq k\leq k_{0}, and suppose that n0(G,kk0k(Θ1))>n¯n_{0}(G,\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1}))>\bar{n}. Then there exists a G1k¯(Θ1)G_{1}\in\mathcal{E}_{\bar{k}}(\Theta_{1}) for some 1k¯k01\leq\bar{k}\leq k_{0} such that PG,n¯=PG1,n¯P_{G,\bar{n}}=P_{G_{1},\bar{n}} but G1GG_{1}\neq G. Collecting the supporting atoms of GG and G1G_{1}, there are at most 2k02k_{0} of those, and denote them by θ10,,θk0Θ1\theta^{0}_{1},\ldots,\theta^{0}_{k^{\prime}}\in\Theta_{1}. Supplement these with a set of atoms {θi0}i=k+12k0Θ1\{\theta^{0}_{i}\}_{i=k^{\prime}+1}^{2k_{0}}\subset\Theta_{1} to obtain a set of distinct 2k02k_{0} atoms denoted by {θi0}i=12k0\{\theta^{0}_{i}\}_{i=1}^{2k_{0}}. Now take G0G_{0} to be any discrete probability measure supported by θ10,,θ2k00\theta^{0}_{1},\ldots,\theta^{0}_{2k_{0}} in 2k0(Θ1)\mathcal{E}_{2k_{0}}(\Theta_{1}). Since PG,n¯=PG1,n¯P_{G,\bar{n}}=P_{G_{1},\bar{n}}, the condition of Lemma C.1 for G0G_{0} is satisfied and thus

lim infHW1G0H2k0(Θ)V(PH,n¯,PG0,n¯)D1(H,G0)=0.\liminf_{\begin{subarray}{c}H\overset{W_{1}}{\to}G_{0}\\ H\in\mathcal{E}_{2k_{0}}(\Theta)\end{subarray}}\frac{V(P_{H,\bar{n}},P_{G_{0},\bar{n}})}{D_{1}(H,G_{0})}=0.

But this contradicts with the definition of n¯\bar{n}.

d) By part a) it suffices to prove for the case n1=n1(G0)<n_{1}=n_{1}(G_{0})<\infty. By Lemma C.2, the product family j=1n1f(xj|θ)\prod_{j=1}^{n_{1}}f(x_{j}|\theta) satisfies all the conditions in Corollary 4.6. Thus by Corollary 4.6,

limr0infG,HBW1(G0,r)GHV(PG,n1,PH,n1)D1(G,H)>0\lim_{r\to 0}\ \inf_{\begin{subarray}{c}G,H\in B_{W_{1}}(G_{0},r)\\ G\not=H\end{subarray}}\frac{V(P_{G,n_{1}},P_{H,n_{1}})}{D_{1}(G,H)}>0

It follows that n2(G0)n1n_{2}(G_{0})\leq n_{1}, which implies that n2(G0)=n1(G0)n_{2}(G_{0})=n_{1}(G_{0}) by part a).

e) By part b) and part d), n2(G0)<n_{2}(G_{0})<\infty for every G0k2k0k(Θ)G_{0}\in\cup_{k\leq 2k_{0}}\mathcal{E}_{k}(\Theta^{\circ}). Associate each G0k2k0k(Θ)G_{0}\in\cup_{k\leq 2k_{0}}\mathcal{E}_{k}(\Theta^{\circ}) with a neighborhood BW1(G0,r(G0))B_{W_{1}}(G_{0},r(G_{0})) as in part a) such that its conclusion holds. Here we want to emphasize that by definition of BW1B_{W_{1}} in (10), BW1(G0,r(G0))kk(Θ)B_{W_{1}}(G_{0},r(G_{0}))\subset\cup_{k\leq\ell}\mathcal{E}_{k}(\Theta) when G0(Θ)G_{0}\in\mathcal{E}_{\ell}(\Theta^{\circ}). Due to the fact that k2k0k(Θ1)\cup_{k\leq 2k_{0}}\mathcal{E}_{k}(\Theta_{1}) is compact and part a), we deduce that n1(G)n_{1}(G) is uniformly bounded for Gk2k0k(Θ1)G\in\cup_{k\leq 2k_{0}}\mathcal{E}_{k}(\Theta_{1}). Combining this with part c) we conclude the proof.

f) By part b) n2(G0)<n_{2}(G_{0})<\infty for every G0k2k0k(Θ)G_{0}\in\cup_{k\leq 2k_{0}}\mathcal{E}_{k}(\Theta^{\circ}). The remainder of the argument is the same as part e). ∎

We can further unpack the double infimum limits in its expression of (21) to develop results useful for subsequent convergence rate analysis in Section 6. First, it is simple to show that the limiting argument for N{N} can be completely removed when N{N} is suitably bounded. Denote by C()C(\cdot) or c()c(\cdot) a positive finite constant depending only on its parameters and the probability kernel {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta}. In the presentation of inequality bounds and proofs, they may differ from line to line.

Lemma 5.5.

Fix G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}). Suppose (21) holds. Then for any N0n1(G0){N}_{0}\geq n_{1}(G_{0}), there exist c(G0,N0)c(G_{0},{N}_{0}) and C(G0)C(G_{0}) such that for any Gk0(Θ)G\in\mathcal{E}_{k_{0}}(\Theta) satisfying W1(G,G0)<c(G0,N0)W_{1}(G,G_{0})<c(G_{0},{N}_{0}),

V(PG,N,PG0,N)C(G0)DN(G,G0),N[n1(G0),N0].V(P_{G,{N}},P_{G_{0},{N}})\geq C(G_{0})D_{{N}}(G,G_{0}),\quad\forall{N}\in[n_{1}(G_{0}),{N}_{0}].

A key feature of the above claim is that the radius c(G0,N0)c(G_{0},{N}_{0}) of the local W1W_{1} ball centered at G0G_{0} over which the inverse bound holds depends on N0{N}_{0}, but the multiple constant C(G0)C(G_{0}) does not. Next, given additional conditions, most notably the compactness on the space of mixing measures, we may remove completely the second limiting argument involving GG. In other words, we may extend the domain of GG on which the inverse bound of the form VW1D1V\gtrsim W_{1}\gtrsim D_{1} continues to hold, where the multiple constants are suppressed here.

Lemma 5.6.

Fix G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}). Consider any Θ1\Theta_{1} a compact subset of Θ\Theta containing the supporting atoms of G0G_{0}. Suppose the map θPθ\theta\mapsto P_{\theta} from (Θ1,2)(\Theta_{1},\|\cdot\|_{2}) to ({Pθ}θΘ1,V(,))(\{P_{\theta}\}_{\theta\in\Theta_{1}},V(\cdot,\cdot)) is continuous. Then for any Gk=1k0k(Θ1)G\in\bigcup_{k=1}^{k_{0}}\mathcal{E}_{k}(\Theta_{1}) and any Nn1(G0)n0(G0,kk0k(Θ1)){N}\geq n_{1}(G_{0})\vee n_{0}(G_{0},\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1})), provided n1(G0)n0(G0,kk0k(Θ1))<n_{1}(G_{0})\vee n_{0}(G_{0},\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1}))<\infty,

V(PG,N,PG0,N)C(G0,Θ1)W1(G,G0).V(P_{G,{N}},P_{G_{0},{N}})\geq C(G_{0},\Theta_{1})W_{1}(G,G_{0}).

Finally, a simple and useful fact which allows one to transfer an inverse bound for one kernel family PθP_{\theta} to another kernel family by means of homeomorphic transformation in the parameter space. If g(θ)=ηg(\theta)=\eta for some homeomorphic function g:ΘΞqg:\Theta\to\Xi\subset\mathbb{R}^{q}, for any G=i=1kpiδθik(Θ)G=\sum_{i=1}^{k}p_{i}\delta_{\theta_{i}}\in\mathcal{E}_{k}(\Theta), denote Gη=i=1kpiδg(θi)k(Ξ)G^{\eta}=\sum_{i=1}^{k}p_{i}\delta_{g(\theta_{i})}\in\mathcal{E}_{k}(\Xi) . Given a probability kernel family {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta}, under the new parameter η\eta define

P~η=Pg1(η),ηΞ.\tilde{P}_{\eta}=P_{g^{-1}(\eta)},\quad\forall\eta\in\Xi.

Let GηG^{\eta} also denote a generic element in k0(Ξ)\mathcal{E}_{k_{0}}(\Xi), and P~Gη,N\tilde{P}_{G^{\eta},{N}} be defined similarly as PG,NP_{G,{N}}.

Lemma 5.7 (Invariance under homeomorphic parametrization with local invertible Jacobian).

Suppose gg is a homeomorphism. For G0=i=1k0pi0δθi0k0(Θ)G_{0}=\sum_{i=1}^{k_{0}}p_{i}^{0}\delta_{\theta_{i}^{0}}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}), suppose the Jacobian matrix of the function g(θ)g(\theta), denoted by Jg(θ):=(g(i)θ(j)(θ))ijJ_{g}(\theta):=(\frac{\partial g^{(i)}}{\partial\theta^{(j)}}(\theta))_{ij} exists and is full rank at θi0\theta_{i}^{0} for i[k0]i\in[k_{0}]. Then N\forall{N}

lim infGηW1G0ηGηk0(Ξ)V(P~Gη,N,P~G0η,N)DN(Gη,G0η)G0lim infGW1G0Gk0(Θ)V(PG,N,PG0,N)DN(G,G0).\liminf_{\begin{subarray}{c}G^{\eta}\overset{W_{1}}{\to}G_{0}^{\eta}\\ G^{\eta}\in\mathcal{E}_{k_{0}}(\Xi)\end{subarray}}\frac{V(\tilde{P}_{G^{\eta},{N}},\tilde{P}_{G_{0}^{\eta},{N}})}{D_{{N}}(G^{\eta},G_{0}^{\eta})}\overset{G_{0}}{\asymp}\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{{N}}(G,G_{0})}. (28)

Moreover, if in addition Jg(θ)J_{g}(\theta) exists and is continuous in a neighborhood of θi0\theta_{i}^{0} for each i[k0]i\in[k_{0}], then N\forall N

limr0infGη,HηBW1(G0η,r)GηHηV(P~Gη,N,P~Hη,N)D1(Gη,Hη)G0limr0infG,HBW1(G0,r)GHV(PG,N,PH,N)D1(G,H).\lim_{r\to 0}\ \inf_{\begin{subarray}{c}G^{\eta},H^{\eta}\in B_{W_{1}}(G_{0}^{\eta},r)\\ G^{\eta}\not=H^{\eta}\end{subarray}}\frac{V(\tilde{P}_{G^{\eta},N},\tilde{P}_{H^{\eta},N})}{D_{1}(G^{\eta},H^{\eta})}\overset{G_{0}}{\asymp}\lim_{r\to 0}\ \inf_{\begin{subarray}{c}G,H\in B_{W_{1}}(G_{0},r)\\ G\not=H\end{subarray}}\frac{V(P_{G,N},P_{H,N})}{D_{1}(G,H)}.

Lemma 5.7 shows that if an inverse bound (21) or (23) under a particular parametrization is established, then the same inverse bound holds for all other parametrizations that are homeomorphic and that have local invertible Jacobian. This allows one to choose the most convenient parametrization; for instance, one may choose the canonical form for an exponential family or another more convenient parametrization, like the mean parametrization.

5.2 Probability kernels in regular exponential family

We now present inverse bounds for the mixture of products of exponential family distributions. Suppose that {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta} is a full rank exponential family of distributions on 𝔛\mathfrak{X}. Adopting the notational convention for canonical parameters of exponential families, we assume PθP_{\theta} admits a density function with respect to a dominating measure μ\mu, namely f(x|θ)f(x|\theta) for θΘ\theta\in\Theta.

Theorem 5.8.

Suppose that the probability kernel {f(x|θ)}θΘ\{f(x|\theta)\}_{\theta\in\Theta} is in a full rank exponential family of distributions in canonical form as in Lemma 4.15. For any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}), (21) and (23) hold.

In the last theorem the exponential family is in its canonical form. The following corollary extends to exponential families in the general form under mild conditions.

Corollary 5.9.

Consider the probability kernel PθP_{\theta} has a density function ff in the full rank exponential family, f(x|θ)=exp(η(θ),T(x)B(θ))h(x)f(x|\theta)=\exp\left(\langle\eta(\theta),T(x)\rangle-B(\theta)\right)h(x), where the map η:Θη(Θ)q\eta:\Theta\to\eta(\Theta)\subset\mathbb{R}^{q} is a homeomorphism. Suppose that η\eta is continuously differentiable on Θ\Theta^{\circ} and its the Jacobian is of full rank on Θ\Theta^{\circ}. Then, for any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}),  (21) and (23) hold.

As a consequence of Corollary 5.9, Proposition 5.1 and Lemma 4.16, we immediately obtain the following interesting algebraic result for which a direct proof may be challenging.

Corollary 5.10.

Let the probability kernel {f(x|θ)}θΘ\{f(x|\theta)\}_{\theta\in\Theta} be in a full rank exponential family of distributions as in Corollary 5.9 and suppose that all conditions there hold. Then for any k01k_{0}\geq 1 and for any G0=i=1k0pi0δθi0k0(Θ)G_{0}=\sum_{i=1}^{k_{0}}p_{i}^{0}\delta_{\theta_{i}^{0}}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}), n1(G0)=n2(G0)=N¯1(G0)=N¯2(G0)n_{1}(G_{0})=n_{2}(G_{0})=\underline{N}_{1}(G_{0})=\underline{N}_{2}(G_{0}) are finite. Moreover,

i=1k0(aiθn=1Nf(xn|θi0)+bin=1Nf(xn|θi0))=0,Nμa.e.(x1,,xN)𝔛N\sum_{i=1}^{k_{0}}\left(a_{i}^{\top}\nabla_{\theta}\prod_{n=1}^{{N}}f(x_{n}|\theta_{i}^{0})+b_{i}\prod_{n=1}^{{N}}f(x_{n}|\theta_{i}^{0})\right)=0,\quad\bigotimes^{N}\mu-a.e.\ (x_{1},\ldots,x_{N})\in\mathfrak{X}^{{N}} (29)

has only the zero solution:

bi=0 and ai=𝟎q,1ik0b_{i}=0\in\mathbb{R}\text{ and }a_{i}=\bm{0}\in\mathbb{R}^{q},\quad\forall 1\leq i\leq k_{0}

if and only if Nn1(G0)N\geq n_{1}(G_{0}).

Corollary 5.10 establishes that for full rank exponential families of distribution specified in Corollary 5.9 with full rank Jacobian of η(θ)\eta(\theta), there is a finite phase transition behavior specified by n1(G0)n_{1}(G_{0}) of the N{N}-product in (29): the system of equations (29) has nonzero solution when N<n1(G0){N}<n_{1}(G_{0}) and as soon as Nn1(G0){N}\geq n_{1}(G_{0}), it has only the zero solution. This also gives another characterization of n1(G0)n_{1}(G_{0}) defined in (24) for such exponential families, which also provides a way to compute n1(G0)=N¯1(G0)=n2(G0)=N¯2(G0)n_{1}(G_{0})=\underline{{N}}_{1}(G_{0})=n_{2}(G_{0})=\underline{{N}}_{2}(G_{0}). A byproduct is that n1(G0)n_{1}(G_{0}) does not depend on the pi0p_{i}^{0} of G0G_{0} since (29) only depends on θi0\theta_{i}^{0}.

We next demonstrate two non-trivial examples of mixture models that are either non-identifiable or weakly identifiable, i.e., when N=1{N}=1, but become first-identifiable by taking products. We work out the details on calculating n0(G0)n_{0}(G_{0}) and n1(G0)n_{1}(G_{0}); these should serve as convincing examples to the discussion at the end of Section 4.

Example 5.11 (Bernoulli kernel).

Consider the Bernoulli distribution f(x|θ)=θx(1θ)1xf(x|\theta)=\theta^{x}(1-\theta)^{1-x} with parameter space Θ=(0,1)\Theta=(0,1). Here the family is defined on 𝔛=\mathfrak{X}=\mathbb{R} and the dominating measure is μ=δ0+δ1\mu=\delta_{0}+\delta_{1}. It can be written in exponential form as in Lemma 4.16 with η(θ)=lnθln(1θ)\eta(\theta)=\ln\theta-\ln(1-\theta) and T(x)=xT(x)=x. It’s easy to check that η(θ)=1θ(1θ)>0\eta^{\prime}(\theta)=\frac{1}{\theta(1-\theta)}>0 and thus all conditions in Lemma 4.16, Corollary 5.9 and Corollary 5.10 are satisfied. Thus any of those three results can be applied. In particular we may use the characterization of n1(G0)n_{1}(G_{0}) in Corollary 5.10 to compute n1(G0)n_{1}(G_{0}).

For the nn-fold product, the density fn(x1,x2,,xn|θ):==1nf(x|θ)=θ=1nx(1θ)n=1nxf_{n}(x_{1},x_{2},\ldots,x_{n}|\theta):=\prod_{\ell=1}^{n}f(x_{\ell}|\theta)=\theta^{\sum_{\ell=1}^{n}x_{\ell}}(1-\theta)^{n-\sum_{\ell=1}^{n}x_{\ell}}. Then θfn(x1,,xn|θ)\frac{\partial}{\partial\theta}f_{n}(x_{1},\dots,x_{n}|\theta) is

(=1nx)θ=1nx1(1θ)n=1nx(n=1nx)θ=1nx(1θ)n=1nx1.\left(\sum_{\ell=1}^{n}x_{\ell}\right)\theta^{\sum_{\ell=1}^{n}x_{\ell}-1}(1-\theta)^{n-\sum_{\ell=1}^{n}x_{\ell}}-\left(n-\sum_{\ell=1}^{n}x_{\ell}\right)\theta^{\sum_{\ell=1}^{n}x_{\ell}}(1-\theta)^{n-\sum_{\ell=1}^{n}x_{\ell}-1}.

We now compute n1(G)n_{1}(G) for any G=i=1kpiδθik(Θ)G=\sum_{i=1}^{k}p_{i}\delta_{\theta_{i}}\in\mathcal{E}_{k}(\Theta). Notice the support of ff is {0,1}\{0,1\} and hence the support of fnf_{n} is {0,1}n\{0,1\}^{n}. Thus (29) with k0k_{0}, NN, and θi0\theta_{i}^{0} replaced respectively by kk, nn and θi\theta_{i} become a system of n+1n+1 linear equations: j==1nx{0}[n]\forall j=\sum_{\ell=1}^{n}x_{\ell}\in\{0\}\cup[n]

i=1kai(j(θi)j1(1θi)nj(nj)(θi)j(1θi)nj1)+i=1kbi(θi)j(1θi)nj=0.\sum_{i=1}^{k}a_{i}\left(j(\theta_{i})^{j-1}(1-\theta_{i})^{n-j}-(n-j)(\theta_{i})^{j}(1-\theta_{i})^{n-j-1}\right)+\sum_{i=1}^{k}b_{i}(\theta_{i})^{j}(1-\theta_{i})^{n-j}=0. (30)

As a system of n+1n+1 linear equations with 2k2k unknown variables, it has nonzero solutions when n+1<2kn+1<2k. Thus n1(G)2k1n_{1}(G)\geq 2k-1 for any Gk(Θ)G\in\mathcal{E}_{k}(\Theta).

Let us now verify that n1(G)=2k1n_{1}(G)=2k-1 for any Gk(Θ)G\in\mathcal{E}_{k}(\Theta). Indeed, for any G=i=1kpiδθik(Θ)G=\sum_{i=1}^{k}p_{i}\delta_{\theta_{i}}\in\mathcal{E}_{k}(\Theta), the system of linear equations (30) with n=2k1n=2k-1 is Az=0A^{\top}z=0 with z=(b1,a1,,bk,ak)z=(b_{1},a_{1},\ldots,b_{k},a_{k})^{\top} and

Aij={fj(θm)i=2m1fj(θm)i=2m for j[2k],m[k],A_{ij}=\begin{cases}f_{j}(\theta_{m})&i=2m-1\\ f^{\prime}_{j}(\theta_{m})&i=2m\end{cases}\text{ for }j\in[2k],m\in[k],

where fj(θ)=θj1(1θ)n(j1)f_{j}(\theta)=\theta^{j-1}(1-\theta)^{n-(j-1)} with n=2k1n=2k-1. By Lemma 5.12 d) det(A)=1α<βk(θαθβ)4\text{det}(A)=\prod_{1\leq\alpha<\beta\leq k}(\theta_{\alpha}-\theta_{\beta})^{4}, with the convention 11 when k=1k=1. Thus, AA is invertible and the system of linear equations (30) with n=2k1n=2k-1 has only zero solution. Thus by Corollary 5.10 n1(G)2k1n_{1}(G)\leq 2k-1. By the conclusion from last paragraph n1(G)=2k1n_{1}(G)=2k-1.

In Section C.3 we also prove that n0(G,k(Θ))=2k1n_{0}(G,\cup_{\ell\leq k}\mathcal{E}_{\ell}(\Theta))=2k-1 for any Gk(Θ)G\in\mathcal{E}_{k}(\Theta). \Diamond

The next lemma is on the determinant of a type of generalized Vandermonde matrices. Its part d) is used in the previous example on Bernoulli kernel.

Lemma 5.12.

Let x,yx,y\in\mathbb{R}.

  1. a)

    Let f(x)f(x) be a polynomial. Define q(1)(x,y)=f(x)f(y)xyq^{(1)}(x,y)=\frac{f(x)-f(y)}{x-y}, q(2)(x,y)=f(x)f(y)xyq^{(2)}(x,y)=\frac{f^{\prime}(x)-f^{\prime}(y)}{x-y}, q¯(2)(x,y)=q(1)(x,y)f(y)xy\bar{q}^{(2)}(x,y)=\frac{q^{(1)}(x,y)-f^{\prime}(y)}{x-y}, and q¯(3)(x,y)=q¯(2)(x,y)12q(2)(x,y)xy\bar{q}^{(3)}(x,y)=\frac{\bar{q}^{(2)}(x,y)-\frac{1}{2}q^{(2)}(x,y)}{x-y}. Then q(1)(x,y)q^{(1)}(x,y), q(2)(x,y)q^{(2)}(x,y), q¯(2)(x,y)\bar{q}^{(2)}(x,y) and q¯(3)(x,y)\bar{q}^{(3)}(x,y) are all multivariate polynomials.

  2. b)

    Let fj(x)f_{j}(x) be a polynomial and fj(x)f^{\prime}_{j}(x) its derivative for j[2k]j\in[2k] for a positive integer kk. For x1,,xkx_{1},\ldots,x_{k}\in\mathbb{R} define A(k)(x1,,xk)(2k)×(2k)A^{(k)}(x_{1},\dots,x_{k})\in\mathbb{R}^{(2k)\times(2k)} by

    Aij(k)(x1,,xk)={fj(xm)i=2m1fj(xm)i=2m for j[2k],m[k].A_{ij}^{(k)}(x_{1},\dots,x_{k})=\begin{cases}f_{j}(x_{m})&i=2m-1\\ f^{\prime}_{j}(x_{m})&i=2m\end{cases}\text{ for }j\in[2k],m\in[k].

    Then for any k2k\geq 2, det(A(k)(x1,,xk))=gk(x1,,xk)1α<βk(xαxβ)4\text{det}(A^{(k)}(x_{1},\dots,x_{k}))=g_{k}(x_{1},\ldots,x_{k})\prod_{1\leq\alpha<\beta\leq k}(x_{\alpha}-x_{\beta})^{4} , where gkg_{k} is some multivariate polynomial.

  3. c)

    For the special case fj(x)=xj1f_{j}(x)=x^{j-1}, A(k)(x1,,xk)A^{(k)}(x_{1},\ldots,x_{k}) has determinant det(A(k)(x1,,xk))=1α<βk(xαxβ)4\text{det}(A^{(k)}(x_{1},\dots,x_{k}))=\prod_{1\leq\alpha<\beta\leq k}(x_{\alpha}-x_{\beta})^{4}, with the convention 11 when k=1k=1.

  4. d)

    For the special case fj(x)=fj(x|k)=xj1(1x)n(j1)f_{j}(x)=f_{j}(x|k)=x^{j-1}(1-x)^{n-(j-1)} with n=2k1n=2k-1, A(k)(x1,,xk)A^{(k)}(x_{1},\ldots,x_{k}) has determinant det(A(k)(x1,,xk))=1α<βk(xαxβ)4\text{det}(A^{(k)}(x_{1},\dots,x_{k}))=\prod_{1\leq\alpha<\beta\leq k}(x_{\alpha}-x_{\beta})^{4}, with the convention 11 when k=1k=1.

Example 5.13 (Continuation on two-parameter gamma kernel).

Consider the gamma distribution f(x|α,β)f(x|\alpha,\beta) discussed in Example 4.17. Let k02k_{0}\geq 2 and by Example 4.17 for any G0k0(Θ)\𝒢G_{0}\in\mathcal{E}_{k_{0}}(\Theta)\backslash\mathcal{G}, n1(G0)=1n_{1}(G_{0})=1 and for any G0𝒢G_{0}\in\mathcal{G}, where we recall that 𝒢\mathcal{G} denotes the pathological subset of the gamma mixture’s parameter space,

lim infGW1G0Gk0(Θ)h(PG,PG0)D1(G,G0)=lim infGW1G0Gk0(Θ)V(PG,PG0)D1(G,G0)=0.\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{h(P_{G},P_{G_{0}})}{D_{1}(G,G_{0})}=\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G},P_{G_{0}})}{D_{1}(G,G_{0})}=0.

This means n1(G0)2n_{1}(G_{0})\geq 2 for G0𝒢G_{0}\in\mathcal{G}.

We now show that for G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta) n1(G0)2n_{1}(G_{0})\leq 2 and hence n1(G0)=2n_{1}(G_{0})=2 for G0𝒢G_{0}\in\mathcal{G}. Let

f2(x1,x2|α,β):=f(x1|α,β)f(x2|α,β)=β2α(Γ(α))2(x1x2)α1eβ(x1+x2)𝟏(0,)2(x1,x2)f_{2}(x_{1},x_{2}|\alpha,\beta):=f(x_{1}|\alpha,\beta)f(x_{2}|\alpha,\beta)=\frac{\beta^{2\alpha}}{(\Gamma(\alpha))^{2}}(x_{1}x_{2})^{\alpha-1}e^{-\beta(x_{1}+x_{2})}\bm{1}_{(0,\infty)^{2}}(x_{1},x_{2})

be the density of the 22-fold product w.r.t. Lebesgue measure on 2\mathbb{R}^{2}. Let g(α,β)=(Γ(α))2/β2αg(\alpha,\beta)=(\Gamma(\alpha))^{2}/\beta^{2\alpha}, which is a differentiable function on Θ\Theta and let f~2(x1,x2|α,β):=g(α,β)×\tilde{f}_{2}(x_{1},x_{2}|\alpha,\beta):=g(\alpha,\beta)\times ×f2(x1,x2|α,β)\times f_{2}(x_{1},x_{2}|\alpha,\beta) to be the density without normalization constant. Note that αf~2(x1,x2|α,β)\frac{\partial}{\partial\alpha}\tilde{f}_{2}(x_{1},x_{2}|\alpha,\beta) =f~2(x1,x2|α,β)ln(x1x2)=\tilde{f}_{2}(x_{1},x_{2}|\alpha,\beta)\ln(x_{1}x_{2}) and βf~2(x1,x2|α,β)=(x1+x2)f~2(x1,x2|α,β)\frac{\partial}{\partial\beta}\tilde{f}_{2}(x_{1},x_{2}|\alpha,\beta)=-(x_{1}+x_{2})\tilde{f}_{2}(x_{1},x_{2}|\alpha,\beta). Then (9a) with ff replaced by f~2\tilde{f}_{2} is

i=1k0(ai(α)ln(x1x2)ai(β)(x1+x2)+bi)(x1x2)αi1eβi(x1+x2)=0.\sum_{i=1}^{k_{0}}\left(a^{(\alpha)}_{i}\ln(x_{1}x_{2})-a^{(\beta)}_{i}(x_{1}+x_{2})+b_{i}\right)(x_{1}x_{2})^{\alpha_{i}-1}e^{-\beta_{i}(x_{1}+x_{2})}=0. (31)

Let i=1k{βi}={β1,β2,,βk}\bigcup_{i=1}^{k}\{\beta_{i}\}=\{\beta^{\prime}_{1},\beta^{\prime}_{2},\cdots,\beta^{\prime}_{k}\} with β1<β2<<βk\beta^{\prime}_{1}<\beta^{\prime}_{2}<\ldots<\beta^{\prime}_{k^{\prime}} where kk^{\prime} is the number of distinct elements. Define I(β)={i[k]|βi=β}I(\beta^{\prime})=\{i\in[k]|\beta_{i}=\beta^{\prime}\}. Then (31) become for μ\mu-a.e. x1,x2(0,)x_{1},x_{2}\in(0,\infty)

0=\displaystyle 0= j=1k(iI(βj)(ai(α)ln(x1x2)ai(β)(x1+x2)+bi)(x1x2)αi1)eβj(x1+x2)\displaystyle\sum_{j=1}^{k^{\prime}}\left(\sum_{i\in I(\beta^{\prime}_{j})}\left(a^{(\alpha)}_{i}\ln(x_{1}x_{2})-a^{(\beta)}_{i}(x_{1}+x_{2})+b_{i}\right)(x_{1}x_{2})^{\alpha_{i}-1}\right)e^{-\beta^{\prime}_{j}(x_{1}+x_{2})}
=\displaystyle= j=1keβjx2eβjx1(iI(βj)ai(α)(x1x2)αi1ln(x1)\displaystyle\sum_{j=1}^{k^{\prime}}e^{-\beta^{\prime}_{j}x_{2}}e^{-\beta^{\prime}_{j}x_{1}}\left(\sum_{i\in I(\beta^{\prime}_{j})}a^{(\alpha)}_{i}(x_{1}x_{2})^{\alpha_{i}-1}\ln(x_{1})\right.
+iI(βj)(ai(α)ln(x2)ai(β)(x1+x2)+bi)(x1x2)αi1)\displaystyle\left.+\sum_{i\in I(\beta^{\prime}_{j})}\left(a^{(\alpha)}_{i}\ln(x_{2})-a^{(\beta)}_{i}(x_{1}+x_{2})+b_{i}\right)(x_{1}x_{2})^{\alpha_{i}-1}\right)

When fixing any x2x_{2} such that in the μ\mu-a.e. set such that the preceding equation holds, by Lemma B.3 b) for any j[k]j\in[k^{\prime}], iI(βj)ai(α)(x1x2)αi10\sum_{i\in I(\beta^{\prime}_{j})}a^{(\alpha)}_{i}(x_{1}x_{2})^{\alpha_{i}-1}\equiv 0 for any x10x_{1}\neq 0. Since αi\alpha_{i} are distinct for iI(βj)i\in I(\beta^{\prime}_{j}) and x2>0x_{2}>0, ai(α)=0a_{i}^{(\alpha)}=0 for any iI(βj)i\in I(\beta^{\prime}_{j}) for any j[k]j\in[k^{\prime}]. That is ai(α)=0a_{i}^{(\alpha)}=0 for any i[k]i\in[k]. Analogously fixing x1x_{1} produces ai(β)=0a_{i}^{(\beta)}=0 for any i[k]i\in[k]. Plug these back into the preceding display and one obtains for μ\mu-a.e. x1,x2(0,)x_{1},x_{2}\in(0,\infty)

0=j=1k(iI(βj)bi(x1x2)αi1)eβjx2eβjx10=\sum_{j=1}^{k^{\prime}}\left(\sum_{i\in I(\beta^{\prime}_{j})}b_{i}(x_{1}x_{2})^{\alpha_{i}-1}\right)e^{-\beta^{\prime}_{j}x_{2}}e^{-\beta^{\prime}_{j}x_{1}}

Fixing any x2x_{2} such that in the μ\mu-a.e. set such that the preceding equation holds, and apply Lemma B.3 b) again to obtain bi=0b_{i}=0 for i[k]i\in[k]. Thus (31) for any Gk(Θ)G\in\mathcal{E}_{k}(\Theta) has only the zero solution. By Lemma 4.16, for G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta)

lim infGW1G0Gk0(Θ)V(PG,2,PG0,2)D1(G,G0)>0.\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,2},P_{G_{0},2})}{D_{1}(G,G_{0})}>0.

Thus n1(G0)2n_{1}(G_{0})\leq 2, and hence n1(G0)=2n_{1}(G_{0})=2 for any G0𝒢G_{0}\in\mathcal{G}.

Following an analogous analysis, one can show that {f(x|θi)}i=1k\{f(x|\theta_{i})\}_{i=1}^{k} are linear independent for any distinct θ1,,θkΘ\theta_{1},\ldots,\theta_{k}\in\Theta for any kk. The linear independence immediately implies that pGp_{G} is identifiable on j=1j(Θ)\bigcup_{j=1}^{\infty}\mathcal{E}_{j}(\Theta), i.e. for any Gk(Θ)G\in\mathcal{E}_{k}(\Theta) and any Gk(Θ)G^{\prime}\in\mathcal{E}_{k^{\prime}}(\Theta), PGPGP_{G}\not=P_{G^{\prime}}. Thus, n0(G,j=1j(Θ))=1n_{0}(G,\bigcup_{j=1}^{\infty}\mathcal{E}_{j}(\Theta))=1 for any Gj=1j(Θ)G\in\bigcup_{j=1}^{\infty}\mathcal{E}_{j}(\Theta). \Diamond

The above examples demonstrate the remarkable benefits of having repeated (exchangeable) measurements: via the N{N}-fold product kernel j=1Nf(xj|θ)\prod_{j=1}^{{N}}f(x_{j}|\theta) for sufficiently large N{N}, one can completely erase the effect of parameter non-identifiability in Bernoulli mixtures, and the effect of weak-identifiability in the pathological subset of the parameter spaces in two-parameter gamma mixtures. We have also seen that it is challenging to determine the 0- or 1-identifiable lengths even for these simple examples of kernels. It is even more so, when we move to a broader class of probability kernels well beyond the exponential families.

5.3 General probability kernels

Unlike Section 5.2, which specializes to the probability kernels that are in the exponential families, in this section no such assumption will be required. In fact, we shall not require that the family of probability distributions {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta} on 𝔛\mathfrak{X} admit a density function. Since the primary object of inference is the parameter θΘq\theta\in\Theta\subset\mathbb{R}^{q}, the assumptions on the kernel PθP_{\theta} will center on the existence of a measurable map T:(𝔛,𝒜)(s,(s))T:(\mathfrak{X},\mathcal{A})\to(\mathbb{R}^{s},\mathcal{B}(\mathbb{R}^{s})) for some sqs\geq q, and regularity conditions on the push-forward measure on s\mathbb{R}^{s}: T#Pθ:=PθT1T_{\#}P_{\theta}:=P_{\theta}\circ T^{-1}. The use of the push-forward measure to prove (23) stems from the observation that the variational distance between two distributions is lower bounded by any push-forward operation, which is equivalent to considering a subclass of the Borel sets in the definition of the variational distance.

Definition 5.14 (Admissible transform).

A Borel measurable map T:𝔛sT:\mathfrak{X}\to\mathbb{R}^{s} is admissible with respect to a set Θ1Θ\Theta_{1}\subset\Theta^{\circ} if for each θ0Θ1\theta_{0}\in\Theta_{1} there exists γ>0\gamma>0 and r1r\geq 1 such that TT satisfies the following three properties.

  1. (A1)

    (Moment condition) For θB(θ0,γ)Θ\theta\in B(\theta_{0},\gamma)\subset\Theta^{\circ}, the open ball centered at θ0\theta_{0} with radius γ\gamma, suppose λ(θ)=λθ=𝔼θTX1\lambda(\theta)=\lambda_{\theta}=\mathbb{E}_{\theta}TX_{1} and Λθ:=𝔼θ(TX1𝔼θTX1)(TX1𝔼θTX1)\Lambda_{\theta}:=\mathbb{E}_{\theta}(TX_{1}-\mathbb{E}_{\theta}TX_{1})(TX_{1}-\mathbb{E}_{\theta}TX_{1})^{\top} exist where X1PθX_{1}\sim P_{\theta}. Moreover, Λθ\Lambda_{\theta} is positive definite on B(θ0,γ)B(\theta_{0},\gamma) and is continuous at θ0\theta_{0}.

  2. (A2)

    (Exchangeability of partial derivatives of characteristic functions) Denote by ϕT(ζ|θ)\phi_{T}(\zeta|\theta) the characteristic function of the pushforward probability measure T#PθT_{\#}P_{\theta} on s\mathbb{R}^{s}, i.e., ϕT(ζ|θ):=𝔼θeiζ,TX1\phi_{T}(\zeta|\theta):=\mathbb{E}_{\theta}e^{i\langle\zeta,TX_{1}\rangle}, where X1PθX_{1}\sim P_{\theta}. ϕT(ζ|θ)θ(i)\frac{\partial\phi_{T}(\zeta|\theta)}{\partial\theta^{(i)}} exists in B(θ0,γ)B(\theta_{0},\gamma) and as a function of ζ\zeta it is twice continuously differentiable on s\mathbb{R}^{s} with derivatives satisfying: θB(θ0,γ)\forall\theta\in B(\theta_{0},\gamma)

    2ϕT(ζ|θ)ζ(j)θ(i)=2ϕT(ζ|θ)θ(i)ζ(j),3ϕT(ζ|θ)ζ()ζ(j)θ(i)=3ϕT(ζ|θ)θ(i)ζ()ζ(j),ζs,j,[d],i[k0]\frac{\partial^{2}\phi_{T}(\zeta|\theta)}{\partial\zeta^{(j)}\partial\theta^{(i)}}=\frac{\partial^{2}\phi_{T}(\zeta|\theta)}{\partial\theta^{(i)}\partial\zeta^{(j)}},\ \frac{\partial^{3}\phi_{T}(\zeta|\theta)}{\partial\zeta^{(\ell)}\partial\zeta^{(j)}\partial\theta^{(i)}}=\frac{\partial^{3}\phi_{T}(\zeta|\theta)}{\partial\theta^{(i)}\partial\zeta^{(\ell)}\partial\zeta^{(j)}},\forall\zeta\in\mathbb{R}^{s},j,\ell\in[d],i\in[k_{0}]

    where the right hand side of both equations exist.

  3. (A3)

    (Continuity and integrability conditions of characteristic function) ϕT(ζ|θ)\phi_{T}(\zeta|\theta) as a function of θ\theta is twice continuously differentiable in B(θ0,γ)B(\theta_{0},\gamma). There hold: for any i[q],j[s]i\in[q],j\in[s],

    supθB(θ0,γ)max{supζs|ϕT(ζ|θ)θ(i)|,supζ2<1|2ϕT(ζ|θ)ζ(j)θ(i)|,supζ2<1|2ϕT(ζ|θ)θ(j)θ(i)|}<,\sup_{\theta\in B(\theta_{0},\gamma)}\max\left\{\sup_{\zeta\in\mathbb{R}^{s}}\left|\frac{\partial\phi_{T}(\zeta|\theta)}{\partial\theta^{(i)}}\right|,\sup_{\|\zeta\|_{2}<1}\left|\frac{\partial^{2}\phi_{T}(\zeta|\theta)}{\partial\zeta^{(j)}\partial\theta^{(i)}}\right|,\sup_{\|\zeta\|_{2}<1}\left|\frac{\partial^{2}\phi_{T}(\zeta|\theta)}{\partial\theta^{(j)}\partial\theta^{(i)}}\right|\right\}<\infty, (32)

    and for any i,j[q]i,j\in[q],

    supθB(θ0,γ)s|ϕT(ζ|θ)|r(1+|2ϕT(ζ|θ)θ(j)θ(i)|)dζ<.\sup_{\theta\in B(\theta_{0},\gamma)}\int_{\mathbb{R}^{s}}\left|\phi_{T}(\zeta|\theta)\right|^{r}\left(1+\left|\frac{\partial^{2}\phi_{T}(\zeta|\theta)}{\partial\theta^{(j)}\partial\theta^{(i)}}\right|\right)d\zeta<\infty. (33)
Remark 5.15.

The above definition of the admissible transform TT contains relatively mild regularity conditions concerning continuity, differentiability and integrability. In particular, (A1) is to guarantee the first two moments of TX1TX_{1}, which are required for the application of a central limit theorem as outlined in Section 2.2. (A2) and (A3) are used in the essential technical lemma (Lemma D.1) to guarantee the following statement in Section 2.2: for any sequence zzz_{\ell}\rightarrow z, there holds Ψ(z)Ψ(z)\Psi_{\ell}(z_{\ell})\rightarrow\Psi(z) for certain functions Ψ\Psi_{\ell} and Ψ:q\Psi:\mathbb{R}^{q}\rightarrow\mathbb{R}. The inequality (33) is also used to obtain the existence of the Fourier inversion formula (more specifically, to imply existence of a density of j=1NTXj\sum_{j=1}^{N}TX_{j} with respect to Lebesgue measure for NrN\geq r). Since the characteristic function has modular less than 11, the larger the rr, the smaller the left hand side of (33). Here we only require existence of some r1r\geq 1 in (33), which is a mild condition”. For more discussions on the role of rr, see Theorem 2 in Section 5, Chapter XV of [14]. The conditions of the admissible transform are typically straightforward to verify if a closed form formulae of ϕT(ζ|θ)\phi_{T}(\zeta|\theta) is available; examples will be provided in the sequel. \Diamond

Theorem 5.16.

Fix G0=i=1k0pi0δθi0k0(Θ)G_{0}=\sum_{i=1}^{k_{0}}p_{i}^{0}\delta_{\theta_{i}^{0}}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}). Assume that for each θi0\theta_{i}^{0}, there exists measurable transform Ti:(𝔛,𝒜)(si,(si))T_{i}:(\mathfrak{X},\mathcal{A})\to(\mathbb{R}^{s_{i}},\mathcal{B}(\mathbb{R}^{s_{i}})) that is admissible with respect to {θi0}i=1k\{\theta_{i}^{0}\}_{i=1}^{k} with siqs_{i}\geq q such that 1) the mean map λi(θ)\lambda_{i}(\theta) of TiT_{i} defined in (A1) is identifiable at θi0\theta_{i}^{0} over the set {θi0}i=1k0\{\theta_{i}^{0}\}_{i=1}^{k_{0}}, i.e., λi(θj0)λi(θi0)\lambda_{i}(\theta_{j}^{0})\not=\lambda_{i}(\theta_{i}^{0}) for any j[k0]\{i}j\in[k_{0}]\backslash\{i\} and 2) the Jacobian matrix of λi\lambda_{i} is of full column rank at θi0\theta_{i}^{0}. Then (21) and (23) hold.

Note that the condition that siqs_{i}\geq q is necessary for the Jacobian matrix of λi\lambda_{i}, which is of dimension si×qs_{i}\times q, to be of full column rank. The following corollary is useful, when the admissible maps TiT_{i} are identical for all ii, which are the case for many (if not most) examples.

Corollary 5.17.

Fix G0=i=1k0pi0δθi0k0(Θ)G_{0}=\sum_{i=1}^{k_{0}}p_{i}^{0}\delta_{\theta_{i}^{0}}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}). If there exists one measurable transform T:(𝔛,𝒜)(s,(s))T:(\mathfrak{X},\mathcal{A})\to(\mathbb{R}^{s},\mathcal{B}(\mathbb{R}^{s})) that is admissible with respect to {θi0}i=1k0\{\theta_{i}^{0}\}_{i=1}^{k_{0}} with sqs\geq q such that 1) the mean map λ(θ)\lambda(\theta) of TT defined in (A1) is identifiable over the set {θi0}i=1k0\{\theta_{i}^{0}\}_{i=1}^{k_{0}}, i.e., λ(θj0)λ(θi0)\lambda(\theta_{j}^{0})\not=\lambda(\theta_{i}^{0}) for any distinct i,j[k0]i,j\in[k_{0}] and 2) the Jacobian matrix of λ\lambda is of full column rank at θi0\theta_{i}^{0} for any i[k0]i\in[k_{0}]. Then (21) and (23) hold.

The proofs of Theorem 5.8 and Theorem 5.16 contain a number of potentially useful techniques and are deferred to Section D. We make additional remarks.

Remark 5.18 (Choices of admissible transform TT).

If the probability kernel PθP_{\theta} has a smooth closed form expression for the characteristic function and 𝔛\mathfrak{X} is of dimension exceeding the dimension of Θ\Theta, one may take TT to be identity map (see Example 5.20 in the sequel). If 𝔛\mathfrak{X} is of dimension less than the dimension of Θ\Theta, then one may take TT to be a moment map (see Example 5.21 and Example 5.22). On the other hand, if the probability kernel does not have a smooth closed form expression for the characteristic function, then one may consider TT to be the composition of moment maps and indicator functions of suitable subsets of 𝔛\mathfrak{X} (see Example 5.23). Unlike the three previous examples, the chosen TT in Example 5.23 depends on atoms {θi0}i=1k0\{\theta_{i}^{0}\}_{i=1}^{k_{0}} of G0G_{0}. All these examples were obtained by constructing a single admissible map TT following Corollary 5.17. There might exist cases for which it is difficult to come up with a single admissible map TT that satisfies the conditions of Corollary 5.17; For such cases Theorem 5.16 will be potentially more useful. \Diamond

Remark 5.19 (Comparisons between Theorem 5.16 and Theorem 5.8).

While Theorem 5.16 appears more powerful than Theorem 5.8, the latter is significant in its own right. Indeed, Theorem 5.16 provides an inverse bound for a very broad range of probability kernels, but it seems not straightforward to apply it to non-degenerate discrete distributions on lattice points, like Poisson, Bernoulli, geometric distributions etc. The reason is that for non-degenerate discrete distributions on lattice points, its characteristic function is periodic (see Lemma 4 in Chapter XV, Section 1 of [14]), which implies that the characteristic function is not in LrL^{r} for any r1r\geq 1. Thus, it does not satisfy (A3) for TT the identity map in the definition of the admissible transform. In order to apply Theorem 5.16 to such distributions one has to come up with suitable measurable transforms TT which induce distributions over a countable support that is not lattice points. On the contrary, Theorem 5.8 is readily applicable to discrete distributions that are in the exponential family, including Poisson, Bernoulli, geometric distributions, etc. \Diamond

5.4 Examples of non-standard probability kernels

The power of Theorem 5.16 lies in its applicability to classes of kernels that do not belong to the exponential families.

Example 5.20 (Continuation on uniform probability kernel).

In Example 4.7 this example has been shown to satisfy inverse bound (11) and (12) for any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta). Note this family is not an exponential family and thus Theorem 5.8 or Corollary 5.9 is not applicable. Take the TT in Corollary 5.17 to be the identity map. Then λ(θ)=θ2\lambda(\theta)=\frac{\theta}{2}, Λθ=θ212\Lambda_{\theta}=\frac{\theta^{2}}{12}. So condition (A1) is satisfied. The characteristic function is

ϕT(ζ|θ)=e𝒊ζθ1𝒊ζθ𝟏(ζ0)+𝟏(ζ=0).\phi_{T}(\zeta|\theta)=\frac{e^{\bm{i}\zeta\theta}-1}{\bm{i}\zeta\theta}\bm{1}(\zeta\not=0)+\bm{1}(\zeta=0).

One can then calculate

θϕT(ζ|θ)=\displaystyle\frac{\partial}{\partial\theta}\phi_{T}(\zeta|\theta)= e𝒊ζθ(e𝒊ζθ1(𝒊ζθ))𝒊ζθ2𝟏(ζ0),\displaystyle\frac{e^{\bm{i}\zeta\theta}(e^{-\bm{i}\zeta\theta}-1-(-\bm{i}\zeta\theta))}{\bm{i}\zeta\theta^{2}}\bm{1}(\zeta\not=0),
2ζθϕT(ζ|θ)=\displaystyle\frac{\partial^{2}}{\partial\zeta\partial\theta}\phi_{T}(\zeta|\theta)= e𝒊ζθ(e𝒊ζθ1(𝒊ζθ)(𝒊ζθ)2)𝒊ζ2θ2𝟏(ζ0)+𝒊2𝟏(ζ=0),\displaystyle\frac{-e^{\bm{i}\zeta\theta}(e^{-\bm{i}\zeta\theta}-1-(-\bm{i}\zeta\theta)-(-\bm{i}\zeta\theta)^{2})}{\bm{i}\zeta^{2}\theta^{2}}\bm{1}(\zeta\not=0)+\frac{\bm{i}}{2}\bm{1}(\zeta=0),
2θ2ϕT(ζ|θ)=\displaystyle\frac{\partial^{2}}{\partial\theta^{2}}\phi_{T}(\zeta|\theta)= 2e𝒊ζθ(e𝒊ζθ1(𝒊ζθ)12(𝒊ζθ)2)𝒊ζθ3𝟏(ζ0),\displaystyle\frac{-2e^{\bm{i}\zeta\theta}(e^{-\bm{i}\zeta\theta}-1-(-\bm{i}\zeta\theta)-\frac{1}{2}(-\bm{i}\zeta\theta)^{2})}{\bm{i}\zeta\theta^{3}}\bm{1}(\zeta\not=0),

and verify the condition (A2). To verify (A3) the following inequality (see [35, (9.5)])

|e𝒊xk=0j(𝒊x)kk!|2|x|jj!\left|e^{\bm{i}x}-\sum_{k=0}^{j}\frac{(\bm{i}x)^{k}}{k!}\right|\leq 2\frac{|x|^{j}}{j!}

comes handy. It then follows that

|θϕT(ζ|θ)|2θ,|2ζθϕT(ζ|θ)|32,|2θ2ϕT(ζ|θ)|2|ζ|θ.\displaystyle\left|\frac{\partial}{\partial\theta}\phi_{T}(\zeta|\theta)\right|\leq\frac{2}{\theta},\quad\left|\frac{\partial^{2}}{\partial\zeta\partial\theta}\phi_{T}(\zeta|\theta)\right|\leq\frac{3}{2},\quad\left|\frac{\partial^{2}}{\partial\theta^{2}}\phi_{T}(\zeta|\theta)\right|\leq\frac{2|\zeta|}{\theta}.

Then (32) holds. Finally take r=3r=3 and one obtains

|ϕT(ζ|θ)|3(1+|2θ2ϕT(ζ|θ)|){1+2θ|ζ|18|ζ|3θ3(1+2|ζ|θ)|ζ|>1.|\phi_{T}(\zeta|\theta)|^{3}\left(1+\left|\frac{\partial^{2}}{\partial\theta^{2}}\phi_{T}(\zeta|\theta)\right|\right)\leq\begin{cases}1+\frac{2}{\theta}&|\zeta|\leq 1\\ \frac{8}{|\zeta|^{3}\theta^{3}}\left(1+\frac{2|\zeta|}{\theta}\right)&|\zeta|>1\end{cases}.

Thus (33) holds. We have then verified that the identity map TT is admissible on Θ\Theta.

It is easy to see that λ(θ)=θ/2\lambda(\theta)=\theta/2 is injective and that its Jacobian Jλ(θ)=12J_{\lambda}(\theta)=\frac{1}{2} is full rank. Then by Corollary 5.17, (21) and (23) hold for any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta) for any k01k_{0}\geq 1. Moreover, by Remark 5.2, n1(G0)=N¯1(G0)=n2(G0)=N¯2(G0)=1n_{1}(G_{0})=\underline{{N}}_{1}(G_{0})=n_{2}(G_{0})=\underline{{N}}_{2}(G_{0})=1 for any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta) for any k01k_{0}\geq 1. \Diamond

Example 5.21 (Continuation on location-scale exponential kernel).

In Example 4.9 this example has been shown to satisfy (11) for any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta) and does not satisfy (12) for some G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta) for any k02k_{0}\geq 2. Note this family is not an exponential family and thus Theorem 5.8 or Corollary 5.9 is not applicable. Take TT in Corollary 5.17 to be Tx=(x,x2)Tx=(x,x^{2})^{\top} as a map from 2\mathbb{R}\to\mathbb{R}^{2}. In Appendix C.5 we show that all conditions of Corollary 5.17 are satisfied and hence (21) and (23) hold for any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta) for any k01k_{0}\geq 1. Moreover, by Remark 5.2, n1(G0)=N¯1(G0)=1n_{1}(G_{0})=\underline{{N}}_{1}(G_{0})=1 for any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta) for any k01k_{0}\geq 1. Regarding n2(G0)n_{2}(G_{0}), for every k02k_{0}\geq 2, there exists some G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta) such that 1<n2(G0)<1<n_{2}(G_{0})<\infty. \Diamond

Example 5.22 (PθP_{\theta} is itself a mixture distribution).

We consider the situation where PθP_{\theta} is a rather complex object: it is itself a mixture distribution. With this we are moving from a standard mixture of product distributions to hierarchical models (i.e., mixtures of mixture distributions). Such models are central tools in Bayesian statistics. Theorem 5.8 or Corollary 5.9 is obviously not applicable in this example, which requires the full strength of Theorem 5.16 or Corollary 5.17. The application, however, is non-trivial requiring the development of tools for evaluating oscillatory integrals of interest. Such tools also prove useful in other contexts (such as Example 5.23). A full treatment is deferred to Section 7. \Diamond

Example 5.23 (PθP_{\theta} is a mixture of Dirichlet processes).

This example illustrates the applicability of our theory to models using probability kernels defined in abstract spaces. Such kernels are commonly found in nonparametric Bayesian literature [24, 17]. In particular, in our specification of mixture of product distributions we will employ Dirichlet processes as the basic building block [15, 4]. Full details are presented in Section 7.4. \Diamond

6 Posterior contraction of de Finetti mixing measures

The data are m{m} independent sequences of exchangeable observations, X[Ni]i=(Xi1,Xi2,,XiNi)𝔛NiX_{[{N}_{i}]}^{i}=(X_{i1},X_{i2},\cdots,X_{i{N}_{i}})\in\mathfrak{X}^{{N}_{i}} for i[m]i\in[{m}]. Each sequence X[Ni]iX_{[{N}_{i}]}^{i} is assumed to be a sample drawn from a mixture of Ni{N}_{i}-product distributions PG,NiP_{G,{N}_{i}} for some "true" de Finetti mixing measure G=G0k0(Θ)G=G_{0}\in\mathcal{E}_{k_{0}}(\Theta). The problem is to estimate G0G_{0} given the mm independent exchangeable sequences. A Bayesian statistician endows (k0(Θ),(k0(Θ)))(\mathcal{E}_{k_{0}}(\Theta),\mathcal{B}(\mathcal{E}_{k_{0}}(\Theta))) with a prior distribution Π\Pi and obtains the posterior distribution Π(dG|X[N1]1,,X[Nm]m)\Pi(dG|X^{1}_{[{N}_{1}]},\ldots,X^{{m}}_{[{N}_{m}]}) by Bayes’ rule, where (k0(Θ))\mathcal{B}(\mathcal{E}_{k_{0}}(\Theta)) is the Borel sigma algebra w.r.t. D1D_{1} distance. In this section we study the asymptotic behavior of this posterior distribution as the amount of data m×N{m}\times{N} tend to infinity.

Suppose throughout this section, {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta} has density {f(x|θ)}θΘ\{f(x|\theta)\}_{\theta\in\Theta} w.r.t. a σ\sigma-finite dominating measure μ\mu on 𝔛\mathfrak{X}; then PG,NiP_{G,{N}_{i}} for G=i=1k0piδθiG=\sum_{i=1}^{k_{0}}p_{i}\delta_{\theta_{i}} has density w.r.t. to μ\mu:

pG,Ni(x¯)=i=1k0pij=1Nif(xj|θi),for x¯=(x1,x2,,xNi)𝔛Ni.p_{G,{N}_{i}}(\bar{x})=\sum_{i=1}^{k_{0}}p_{i}\prod_{j=1}^{{N}_{i}}f(x_{j}|\theta_{i}),\quad\text{for }\bar{x}=(x_{1},x_{2},\cdots,x_{{N}_{i}})\in\mathfrak{X}^{{N}_{i}}. (34)

Then the density of X[Ni]iX_{[{N}_{i}]}^{i} conditioned on GG is pG,Ni()p_{G,{N}_{i}}(\cdot). Since Θ\Theta as a subset of q\mathbb{R}^{q} is separable, k0(Θ)\mathcal{E}_{k_{0}}(\Theta) is separable. Moreover, suppose the map θPθ\theta\mapsto P_{\theta} from (Θ,2)(\Theta,\|\cdot\|_{2}) to ({Pθ}θΘ,h(,))(\{P_{\theta}\}_{\theta\in\Theta},h(\cdot,\cdot)) is continuous, where h(,)h(\cdot,\cdot) is the Hellinger distance. Then the map GPG,NG\mapsto P_{G,N} from (k0(Θ),D1)(pG,N,h(,))(\mathcal{E}_{k_{0}}(\Theta),D_{1})\to(p_{G,{N}},h(\cdot,\cdot)) is also continuous by Lemma 8.2. Then by [2, Lemma 4.51], (x,G)pG,N(x)(x,G)\mapsto p_{G,{N}}(x) is measurable for each N{N}. Thus, the posterior distribution (a version of regular conditional distribution) is the random measure given by

Π(B|X[N1]1,,X[Nm]m)=Bi=1mpG,Ni(X[Ni]i)dΠ(G)k0(Θ)i=1mpG,Ni(X[Ni]i)dΠ(G),\Pi(B|X_{[{N}_{1}]}^{1},\ldots,X_{[{N}_{{m}}]}^{{m}})=\frac{\int_{B}\prod_{i=1}^{m}p_{G,{N}_{i}}(X_{[{N}_{i}]}^{i})d\Pi(G)}{\int_{\mathcal{E}_{k_{0}}(\Theta)}\prod_{i=1}^{m}p_{G,{N}_{i}}(X_{[{N}_{i}]}^{i})d\Pi(G)}, (35)

for any Borel measurable subset of Bk0(Θ)B\subset\mathcal{E}_{k_{0}}(\Theta). For further details of why the last quantity is a valid posterior distribution, we refer to Section 1.3 in [17]. It is customary to express the above model equivalently in the hierarchical Bayesian fashion:

GΠ,θ1,θ2,,θm|Gi.i.d.G\displaystyle G\sim\Pi,\quad\theta_{1},\theta_{2},\cdots,\theta_{m}|G\overset{i.i.d.}{\sim}G
Xi1,Xi2,,XiNi|θii.i.d.f(x|θi)for i=1,,m.\displaystyle X_{i1},X_{i2},\cdots,X_{i{N}_{i}}|\theta_{i}\overset{i.i.d.}{\sim}f(x|\theta_{i})\quad\text{for }i=1,\cdots,{m}.

As above, the m{m} independent data sequences are denoted by X[Ni]i=(Xi1,,XiNi)𝔛NiX_{[{N}_{i}]}^{i}=(X_{i1},\cdots,X_{i{N}_{i}})\in\mathfrak{X}^{{N}_{i}} for i[m]i\in[{m}]. The following assumptions are required for the main theorems of this section.

  1. (B1)

    (Prior assumption) There is a prior measure Πθ\Pi_{\theta} on Θ1Θ\Theta_{1}\subset\Theta with its Borel sigma algebra possessing a density w.r.t. Lebesgue measure that is bounded away from zero and infinity, where Θ1\Theta_{1} is a compact subset of Θ\Theta. Define k0k_{0}-probability simplex Δk0:={(p1,,pk0)+k0|i=1k0=1}\Delta^{k_{0}}:=\{(p_{1},\ldots,p_{k_{0}})\in\mathbb{R}_{+}^{k_{0}}|\sum_{i=1}^{k_{0}}=1\}, Suppose there is a prior measure Πp\Pi_{p} on the k0k_{0}-probability simplex possessing a density w.r.t. Lebesgue measure on k01\mathbb{R}^{k_{0}-1} that is bounded away from zero and infinity. Then Πp×Πθk0\Pi_{p}\times\Pi_{\theta}^{k_{0}} is a measure on {((p1,θ1),,(pk0,θk0))|pi0,θiΘ1,i=1k0pi=1}\{((p_{1},\theta_{1}),\ldots,(p_{k_{0}},\theta_{k_{0}}))|p_{i}\geq 0,\theta_{i}\in\Theta_{1},\sum_{i=1}^{k_{0}}p_{i}=1\}, which induces a probability measure on k0(Θ1)\mathcal{E}_{k_{0}}(\Theta_{1}). Here, the prior Π\Pi is generated by via independent Πp\Pi_{p} and Πθ\Pi_{\theta} and the support Θ1\Theta_{1} of Πθ\Pi_{\theta} is such that G0k0(Θ1)G_{0}\in\mathcal{E}_{k_{0}}(\Theta_{1}).

  2. (B2)

    (Kernel assumption) Suppose that for every θ1,θ2Θ1\theta_{1},\theta_{2}\in\Theta_{1}, θ0{θi0}i=1k0\theta_{0}\in\{\theta_{i}^{0}\}_{i=1}^{k_{0}} and some positive constants α0,L1,β0,L2\alpha_{0},L_{1},\beta_{0},L_{2},

    K(f(x|θ0),f(x|θ2))\displaystyle K(f(x|\theta_{0}),f(x|\theta_{2}))\leq L1θ0θ22α0,\displaystyle L_{1}\|\theta_{0}-\theta_{2}\|_{2}^{\alpha_{0}}, (36)
    h(f(x|θ1),f(x|θ2))\displaystyle h(f(x|\theta_{1}),f(x|\theta_{2}))\leq L2θ1θ22β0.\displaystyle L_{2}\|\theta_{1}-\theta_{2}\|_{2}^{\beta_{0}}. (37)
Remark 6.1.

(B1) on the compactness of the support Θ1\Theta_{1} is a standard assumption so as to obtain parameter convergence rate in finite mixture models (see [9, 32, 25, 26, 17, 22, 47]). See Section 9.1 for more discussions on the compactness assumption and an relaxation to boundedness assumption. The unbounded setting seems challenging and is beyond the scope of this paper. In this paper, for simplicity we consider the prior on finite mixing measures being generated by independent priors on component parameters θ\theta and an independent prior on mixing proportions pip_{i} [38, 32, 19] for general probability kernel {f(x|θ)}\{f(x|\theta)\}. It is not difficult to extend our theorem to more complex forms of prior specification when a specific kernel {f(x|θ)}\{f(x|\theta)\} is considered.

The condition (B2) is not uncommon in parameter estimation (e.g. Theorem 8.25 in [17]). Note that the conditions in (B2) imply some implicit constraints on α0\alpha_{0} and β0\beta_{0}. Specifically, if (11) holds for G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}) and (B2) holds, then β01\beta_{0}\leq 1 and α02\alpha_{0}\leq 2. Indeed, for any sequence G=i=2k0pi0δθi0+p10δθ1k0(Θ)\{G0}G_{\ell}=\sum_{i=2}^{k_{0}}p^{0}_{i}\delta_{\theta_{i}^{0}}+p^{0}_{1}\delta_{\theta^{\ell}_{1}}\in\mathcal{E}_{k_{0}}(\Theta)\backslash\{G_{0}\} converges to G0=i=1k0pi0δθi0G_{0}=\sum_{i=1}^{k_{0}}p^{0}_{i}\delta_{\theta_{i}^{0}}, by (11), Lemma C.3 with N=1{N}=1 and (B2), for large \ell

C(G0)θ1θ102=C(G0)D1(G,G0)V(PG,PG0)V(f(x|θ1),f(x|θ10))h(f(x|θ1),f(x|θ10))L2θ1θ102β0,C(G_{0})\|\theta_{1}^{\ell}-\theta_{1}^{0}\|_{2}=C(G_{0})D_{1}(G_{\ell},G_{0})\leq V(P_{G_{\ell}},P_{G_{0}})\\ \leq V(f(x|\theta^{\ell}_{1}),f(x|\theta^{0}_{1}))\leq h(f(x|\theta^{\ell}_{1}),f(x|\theta^{0}_{1}))\leq L_{2}\|\theta_{1}^{\ell}-\theta^{0}_{1}\|^{\beta_{0}}_{2}, (38)

which implies β01\beta_{0}\leq 1 (by dividing both sides by θ1θ102\|\theta_{1}^{\ell}-\theta_{1}^{0}\|_{2} and letting \ell\to\infty). In the preceding display C(G0)=12lim infGW1G0Gk0(Θ)V(pG,pG0)D1(G,G0)>0C(G_{0})=\frac{1}{2}\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(p_{G},p_{G_{0}})}{D_{1}(G,G_{0})}>0. By (38) and Pinsker’s inequality,

C(G0)θ1θ102V(f(x|θ1),f(x|θ10))12K(f(x|θ10),f(x|θ1))12L1θ1θ102α0,C(G_{0})\|\theta_{1}^{\ell}-\theta_{1}^{0}\|_{2}\leq V(f(x|\theta^{\ell}_{1}),f(x|\theta^{0}_{1}))\leq\sqrt{\frac{1}{2}K(f(x|\theta^{0}_{1}),f(x|\theta^{\ell}_{1}))}\leq\sqrt{\frac{1}{2}L_{1}\|\theta_{1}^{\ell}-\theta^{0}_{1}\|^{\alpha_{0}}_{2}},

for large \ell, which implies α02\alpha_{0}\leq 2. The same conclusion holds if one replaces (11) with (21) by an analogous argument. \Diamond

An useful quantity is the average sequence length N¯m=1mi=1mNi\bar{{N}}_{{m}}=\frac{1}{{m}}\sum_{i=1}^{{m}}{N}_{i}. The posterior contraction theorem will be characterized in terms of distance DN¯m(,)D_{\bar{{N}}_{{m}}}(\cdot,\cdot), which extends the original notion of distance DN(,)D_{N}(\cdot,\cdot) by allowing real-valued weight N¯m\bar{{N}}_{{m}}.

Theorem 6.2.

Fix G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}). Suppose (B1), (B2) and additionally (21) hold.

  1. a)

    There exists some constant C(G0)>0C(G_{0})>0 such that as long as n0(G0,kk0k(Θ1))n1(G0)miniNisupiNi<n_{0}(G_{0},\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1}))\vee n_{1}(G_{0})\leq\min_{i}{N}_{i}\leq\sup_{i}{N}_{i}<\infty, for every M¯m\bar{M}_{m}\to\infty there holds

    Π(Gk0(Θ1):DN¯m(G,G0)C(G0)M¯mln(mN¯m)m|X[N1]1,,X[Nm]m)0\displaystyle\Pi\biggr{(}G\in\mathcal{E}_{k_{0}}(\Theta_{1}):D_{\bar{{N}}_{{m}}}(G,G_{0})\geq C(G_{0})\bar{M}_{m}\sqrt{\frac{\ln({m}\bar{{N}}_{{m}})}{{m}}}\biggr{|}X_{[{N}_{1}]}^{1},\ldots,X_{[{N}_{{m}}]}^{{m}}\biggr{)}\to 0

    in i=1mPG0,Ni\bigotimes_{i=1}^{m}P_{G_{0},{N}_{i}}-probability as m{m}\to\infty.

  2. b)

    If in addition,  (11) is satisfied. Then the claim in part a) holds with n1(G0)=1n_{1}(G_{0})=1.

Remark 6.3.

As discussed in Remark 5.4, [44] establishes that n0(G0,kk0k(Θ))2k01n_{0}(G_{0},\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta))\leq 2k_{0}-1 for any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta). While the uniform upper bound 2k012k_{0}-1 might not be tight, it does show n0(G0,kk0k(Θ1))<n_{0}(G_{0},\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1}))<\infty. By Proposition 5.1 b), n1(G0)<n_{1}(G_{0})<\infty is a direct consequence of (21). Hence n0(G0,kk0k(Θ1))n1(G0)<n_{0}(G_{0},\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1}))\vee n_{1}(G_{0})<\infty. \Diamond

Remark 6.4.

(a) In the above statement, note that the constant C(G0)C(G_{0}) also depends on Θ1\Theta_{1}, k0k_{0}, qq, upper and lower bounds of the densities of Πθ\Pi_{\theta}, Πp\Pi_{p} and the density family f(x|θ)f(x|\theta) (including α0\alpha_{0}, β0\beta_{0}, L1L_{1}, L2L_{2} etc). All such dependence are suppressed for the sake of a clean presentation; it is the dependence on G0G_{0} and the independence of m,{Ni}i1{m},\{{N}_{i}\}_{i\geq 1} and N0:=supiNi<{N}_{0}:=\sup_{i}{N}_{i}<\infty, that we want to emphasize. In addition, although C(G0)C(G_{0}) and hence the vanishing radius of the ball characterized by DN¯mD_{\bar{{N}}_{m}} does not depend on N0{N}_{0}, the rate at which the posterior probability statement concerning this ball tending to zero may depend on it.

(b) Roughly speaking, the theorem produces the following posterior contraction rates. The rate toward mixing probabilities pi0p_{i}^{0} is OP((ln(mN¯m)/m)1/2)O_{P}((\ln({m}\bar{{N}}_{m})/{m})^{1/2}). Individual atoms θi0\theta_{i}^{0} have much faster contraction rate, which utilizes the full volume of the data set:

OP(ln(mN¯m)mN¯m)=OP(ln(i=1mNi)i=1mNi).O_{P}\left(\sqrt{\frac{\ln({m}\bar{{N}}_{m})}{{m}\bar{{N}}_{m}}}\right)=O_{P}\biggr{(}\sqrt{\frac{\ln(\sum_{i=1}^{{m}}{N}_{i})}{\sum_{i=1}^{{m}}{N}_{i}}}\biggr{)}. (39)

Note that the condition that supiNi<\sup_{i}N_{i}<\infty implies that N¯m\bar{N}_{m} remains finite when mm\to\infty. Since constant C(G0)C(G_{0}) is independent of N¯m\bar{N}_{m} and mm, the theorem establishes that the larger the average length of observed sequences, the faster the posterior contraction as mm\to\infty.

(c) The distinction between the two parts of the theorem highlights the role of first-order identifiability in mixtures of N{N}-product distributions. Under first-order identifiability,  (11) is satisfied, so we can establish the aforementioned posterior contraction behavior for a full range of sequence length Ni{N}_{i}’s, as long as they are uniformly bounded by an arbitrary unknown constant. When first-order identifiability is not satisfied, so (11) may fail to hold, the same posterior behavior can be ascertained when the sequence lengths exceed certain threshold depending on the true G0G_{0}, namely, n1(G0)n0(G0,kk0k(Θ1))n_{1}(G_{0})\vee n_{0}(G_{0},\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1})).

(d) The proof of Theorems 6.2 utilizes general techniques of Bayesian asymptotics (see [17, Chapter 8]) to deduce posterior contraction rate on the density h(PG,N,PG0,N)h(P_{G,N},P_{G_{0},N}). The main novelty lies in the application of inverse bounds for mixtures of product distributions of exchangeable sequences established in Section 5. These are lower bounds of distances h(PG0,N,PG,N)h(P_{G_{0},{N}},P_{G,{N}}) between a pair of distributions (PG0,N,PG,NP_{G_{0},{N}},P_{G,{N}}) in terms of distance DN(G0,G)D_{{N}}(G_{0},G) between the corresponding (G0,G)(G_{0},G). The distance DN(G0,G)D_{{N}}(G_{0},G) brings out the role of the sample size N{N} of exchangeable sequences, resulting in the rate N1/2{N}^{-1/2} (or N¯m1/2\bar{N}_{m}^{-1/2}, modulo the logarithm). \Diamond

The gist of the proof of Theorem 6.2 lies in the following lemma where we consider the equal length data sequence to distill the essence. This lemma also illustrates the connection between the inverse bound (21) and the convergence rate for the mixing measure G0G_{0}.

Lemma 6.5.

Consider Nn1(G0)N\geq n_{1}(G_{0}) be fixed and suppose (21) holds. Let X[N]ii.i.d.pG0,NX^{i}_{[N]}\overset{\operatorname{i.i.d.}}{\sim}p_{G_{0},N} and let Π\Pi be a prior distribution on k0(Θ)\mathcal{E}_{k_{0}}(\Theta). Suppose the posterior contraction rate towards the true mixture density pG0,Np_{G_{0},N} is ϵm,N\epsilon_{m,N}: for any M¯m\bar{M}_{m}\to\infty, Π(V(PG,N,PG0,N)M¯mϵm,N|X[N]1,,X[N]m)0\Pi(V(P_{G,N},P_{G_{0},N})\geq\bar{M}_{m}\epsilon_{m,N}|X^{1}_{[N]},\ldots,X^{m}_{[N]})\to 0 in probability as mm\to\infty. Suppose the posterior consistency at the true mixing measure G0G_{0} w.r.t. the distance W1W_{1} holds: for any a>0a>0, Π(W1(G,G0)a|X[N]1,,X[N]m)0\Pi(W_{1}(G,G_{0})\geq a|X^{1}_{[N]},\ldots,X^{m}_{[N]})\to 0 in probability as mm\to\infty. Then the posterior contraction rate to G0G_{0} w.r.t. distance DND_{N} is ϵm,N\epsilon_{m,N}, i.e. Π(DN(G,G0)M¯mϵm,N|X[N]1,,X[N]m)0\Pi(D_{N}(G,G_{0})\geq\bar{M}_{m}\epsilon_{m,N}|X^{1}_{[N]},\ldots,X^{m}_{[N]})\to 0 in probability as mm\to\infty.

Proof.

All probabilities presented in this proof are posterior probabilities conditioning on the data X[N]1,,X[N]mX^{1}_{[N]},\ldots,X^{m}_{[N]}, of which the conditioning notation are suppressed for brevity.

Π(DN(G,G0)M¯mϵm,N)\displaystyle\Pi(D_{N}(G,G_{0})\geq\bar{M}_{m}\epsilon_{m,N})
\displaystyle\leq Π(DN(G,G0)M¯mϵm,N,W1(G,G0)<c(G0,N))+Π(W1(G,G0)c(G0,N))\displaystyle\Pi(D_{N}(G,G_{0})\geq\bar{M}_{m}\epsilon_{m,N},W_{1}(G,G_{0})<c(G_{0},N))+\Pi(W_{1}(G,G_{0})\geq c(G_{0},N))
\displaystyle\leq Π(V(PG,N,PG0,N)C(G0)M¯mϵm,N)+Π(W1(G,G0)c(G0,N)),\displaystyle\Pi(V(P_{G,N},P_{G_{0},N})\geq C(G_{0})\bar{M}_{m}\epsilon_{m,N})+\Pi(W_{1}(G,G_{0})\geq c(G_{0},N)), (40)

where in the first inequality c(G0,N)c(G_{0},N) is the radius in Lemma 5.5 with N0=NN_{0}=N, and the second inequality follows by Lemma 5.5. The proof is completed by noticing that the quantity in (40) converges to 0 in probability as mm\to\infty by the hypothesises for any M¯m\bar{M}_{m}\to\infty. ∎

Remark 6.6.

Roughly speaking, the hypothesis of posterior consistency guarantees that as mm\to\infty, GG lies in a small ball around G0G_{0} w.r.t. W1W_{1} distance, and then Lemma 5.5 transfers the convergence rate from mixture densities to mixing measures. No particular structures from posterior distributions are utilized and one can easily modify the above lemma for other estimators, the maximum likelihood estimator for instance.

Theorem 6.2 can be seen as some sufficient conditions on the prior Π\Pi and the kernel ff such that the hypothesises in Lemma 6.5 are satisfied. The setup in Theorem 6.2 is slightly more general, in that each data might have different sequence length. \Diamond

Finally, the conditions (B2) and (21) can be verified for full rank exponential families and hence we have the following corollary from Theorem 6.2.

Corollary 6.7.

Consider a full rank exponential family for kernel PθP_{\theta} specified as in Corollary 5.9 and assume all the requirements there are met. Fix G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}). Suppose that (B1) holds with Θ1Θ\Theta_{1}\subset\Theta^{\circ}. Then the conclusions a), b) of Theorem 6.2 hold.

Example 6.8 (Posterior contraction for weakly identifiable kernels: Bernoulli and gamma).

Fix G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}). For the Bernoulli kernel studied in Example 5.11, n1(G0)=n0(G0,kk0k(Θ))=2k01n_{1}(G_{0})=n_{0}(G_{0},\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta))=2k_{0}-1. Suppose that (B1) holds with compact Θ1Θ=(0,1)\Theta_{1}\subset\Theta^{\circ}=(0,1). Then by Corollary 6.7, the conclusion a) of Theorem 6.2 holds provided miniNi2k01\min_{i}N_{i}\geq 2k_{0}-1. For the gamma kernel studied in Examples 4.17 and 5.13, n1(G0)=2n_{1}(G_{0})=2 when G0𝒢G_{0}\in\mathcal{G} and n1(G0)=1n_{1}(G_{0})=1 when G0k0(Θ)\𝒢G_{0}\in\mathcal{E}_{k_{0}}(\Theta^{\circ})\backslash\mathcal{G}; n0(G0,kk0k(Θ))=1n_{0}(G_{0},\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta))=1. Suppose that (B1) holds with compact Θ1Θ=(0,)×(0,)\Theta_{1}\subset\Theta^{\circ}=(0,\infty)\times(0,\infty). Then by Corollary 6.7, the conclusion a) of Theorem 6.2 holds provided miniNi2\min_{i}N_{i}\geq 2. Moreover, no requirement on miniNi\min_{i}N_{i} is needed if G0𝒢G_{0}\not\in\mathcal{G} is given. \Diamond

Example 6.9 (Posterior contraction for weakly identifiable kernels: beyond exponential family).

Here we present the posterior contraction rate for the four examples studied in Section 5.4, while the verification details are in Appendix E.3. Assume that the prior distribution satisfies the (B1) for each example below. For the uniform probability kernel studied in Example 5.20, the conclusion of Theorem 6.2 holds for any N1N\geq 1. For the location-scale exponential kernel studied in Example 5.21, the conclusion of Theorem 6.2 holds for any N1N\geq 1. For the case that kernel is location-mixture Gaussian in Example 5.22, the conclusion of Theorem 6.2 holds and the specific values of n0(G0,kk0k(Θ1))n_{0}(G_{0},\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1})) and n1(G0)n_{1}(G_{0}) are left as exercises. The kernel in Example 5.23 does not possess a density, which is needed in (35), and thus the results in this section on posterior contraction do not apply. \Diamond

7 Hierarchical model: kernel PθP_{\theta} is itself a mixture distribution

In this section we apply Theorem 5.16 to the cases where PθP_{\theta} itself is a rather complex object: a finite mixture of distributions. Combining this kernel with a discrete mixing measure Gk0(Θ)G\in\mathcal{E}_{k_{0}}(\Theta), the resulting PGP_{G} represents a mixture of finite mixtures of distributions, while PG,NP_{G,{N}} becomes a k0k_{0}-mixture of N{N}-products of finite mixtures of distributions. These recursively defined objects represent a popular and formidable device in the statistical modeling world: the world of hierarchical models. We shall illustrate Theorem 5.16 on only two examples of such models. However, the tools required for these applications are quite general, chief among them are bounds on relevant oscillatory integrals for suitable statistical maps TT. We shall first describe such tools in Section 7.1 and then address the case PθP_{\theta} is a kk-component Gaussian location mixture (Example 5.22) and the case PθP_{\theta} is a mixture of Dirichlet processes (Example 5.23).

7.1 Bounds on oscillatory integrals

A key condition in Theorem 5.16, namely condition (A3), is reduced to the LrL^{r} integrability of certain oscillatory integrals:

𝔛e𝒊ζTxf(x)𝑑xLr(s)\left\|\int_{\mathfrak{X}}e^{\bm{i}\zeta^{\top}Tx}f(x)dx\right\|_{L^{r}(\mathbb{R}^{s})} (41)

for a broad class of functions f:𝔛f:\mathfrak{X}\rightarrow\mathbb{R} and multi-dimensional maps T:𝔛sT:\mathfrak{X}\rightarrow\mathbb{R}^{s}. When 𝔛=d\mathfrak{X}=\mathbb{R}^{d}, the oscillatory integral 𝔛e𝒊ζTxf(x)𝑑x\int_{\mathfrak{X}}e^{\bm{i}\zeta^{\top}Tx}f(x)dx is also known as the Fourier transform of measures supported on curves or surfaces; bounds for such quantities are important topics in harmonic analysis and geometric analysis. We refer to [6] and the textbook [40, Chapter 8] for further details and broader contexts. Despite there are many existing results, such results are typically established when f(x)f(x) is supported on a compact interval or is smooth, i.e. ff has derivative of arbitrary orders. We shall develop an upper bound on (41) for our purposes to verify the integrability condition in (A3) for a broad class of ff, which is usually satisfied for probability density functions.

We start with the following bounds for oscillatory integrals of the form e𝒊λϕ(x)ψ(x)𝑑x\int e^{\bm{i}\lambda\phi(x)}\psi(x)dx, where function ϕ\phi is called the phase, and function ψ\psi the amplitude.

Lemma 7.1 (van der Corput’s Lemma).

Suppose ϕ(x)C(a,b)\phi(x)\in C^{\infty}(a,b), and that |ϕ(k)(x)|1|\phi^{(k)}(x)|\geq 1 for all x(a,b)x\in(a,b). Let ψ(x)\psi(x) be absolute continuous on [a,b][a,b]. Then

|[a,b]eiλϕ(x)ψ(x)𝑑x|ckλ1k[|ψ(b)|+[a,b]|ψ(x)|𝑑x]\left|\int_{[a,b]}e^{i\lambda\phi(x)}\psi(x)dx\right|\leq c_{k}\lambda^{-\frac{1}{k}}\left[|\psi(b)|+\int_{[a,b]}|\psi^{\prime}(x)|dx\right]

and

|[a,b]eiλϕ(x)ψ(x)𝑑x|ckλ1k[|ψ(a)|+[a,b]|ψ(x)|𝑑x]\left|\int_{[a,b]}e^{i\lambda\phi(x)}\psi(x)dx\right|\leq c_{k}\lambda^{-\frac{1}{k}}\left[|\psi(a)|+\int_{[a,b]}|\psi^{\prime}(x)|dx\right]

hold when either i) k2k\geq 2, or ii) k=1k=1 and ϕ(x)\phi^{\prime}(x) is monotonic. The constant ckc_{k} is independent of ϕ\phi, ψ\psi, λ\lambda and the interval [a,b][a,b].

Proof.

See [40, the Corollary on Page 334] for the proof of the first display; even though in its original version in this reference, ψ\psi is assumed to be CC^{\infty} but its proof only needs ψ\psi to be absolute continuous on [a,b][a,b]. The second display follows by applying the first display to ψ~(x)=ψ(a+bx)\tilde{\psi}(x)=\psi(a+b-x). ∎

It can be observed from Lemma 7.1 the condition on derivatives of the phase function plays a crucial role. For our purpose the phase function will be supplied by use of monomial map TT. Hence, the following technical lemma will be needed.

Lemma 7.2.

Let A(x)d×dA(x)\in\mathbb{R}^{d\times d} with entries Aαβ(x)=0A_{\alpha\beta}(x)=0 for α>β\alpha>\beta and Aαβ(x)=jβ!(jβjα)!xjβjαA_{\alpha\beta}(x)=\frac{j_{\beta}!}{(j_{\beta}-j_{\alpha})!}x^{j_{\beta}-j_{\alpha}} for 1αβd1\leq\alpha\leq\beta\leq d, where 1j1<<jd1\leq j_{1}<\ldots<j_{d} are given. Let Smin(A(x))S_{\text{min}}(A(x)) be the smallest singular value of A(x)A(x). Then Smin(A(x))c3max{1,|x|}(jdj1)(d1)S_{\text{min}}(A(x))\geq c_{3}\max\{1,|x|\}^{-(j_{d}-j_{1})(d-1)}, where c3c_{3} is a constant that depends only on d,j1,,jdd,j_{1},\ldots,j_{d}.

The following lemma provides a crucial uniform bound on oscillatory integrals given by a phase given by monomial map TT.

Lemma 7.3.

Let T:dT:\mathbb{R}\to\mathbb{R}^{d} defined by Tx=(xj1,xj2,,xjd)Tx=(x^{j_{1}},x^{j_{2}},\ldots,x^{j_{d}})^{\top} with 1j1<j2<jd1\leq j_{1}<j_{2}<\ldots j_{d}. Consider a bounded non-negative function f(x)f(x) that is differentiable on \{bi}i=1\mathbb{R}\backslash\{b_{i}\}_{i=1}^{\ell}, where b1<b2<<bb_{1}<b_{2}<\ldots<b_{\ell} with \ell a finite number. The derivative f(x)L1()f^{\prime}(x)\in L^{1}(\mathbb{R}) and it is continuous when it exists. Moreover, f(x)f(x) and |x|α1f(x)|x|^{{\alpha_{1}}}f(x) are both increasing when x<c1x<-c_{1} and decreasing when x>c1x>c_{1} for some c1max{|b1|,|b|}c_{1}\geq\max\{|b_{1}|,|b_{\ell}|\}, where α1=(jdj1)(d1)/j1\alpha_{1}=(j_{d}-j_{1})(d-1)/j_{1}. Then for λ>1\lambda>1,

supwSd1|exp(𝒊λwTx)f(x)𝑑x|\displaystyle\sup_{w\in S^{d-1}}\left|\int_{\mathbb{R}}\exp(\bm{i}\lambda w^{\top}Tx)f(x)dx\right|
\displaystyle\leq C1λ1jd(c1+2)α1(|x|α1f(x)L1()+(+1)fL()+(|x|α1+1)f(x)L1()),\displaystyle C_{1}\lambda^{-\frac{1}{j_{d}}}(c_{1}+2)^{\alpha_{1}}\left(\left\||x|^{\alpha_{1}}f(x)\right\|_{L^{1}(\mathbb{R})}+(\ell+1)\|f\|_{L^{\infty}(\mathbb{R})}+\left\|\left(|x|^{\alpha_{1}}+1\right)f^{\prime}(x)\right\|_{L^{1}(\mathbb{R})}\right),

where C1C_{1} is a positive constant that only depends on d,j1,j2,,jdd,j_{1},j_{2},\ldots,j_{d}.

Applying Lemma 7.3 we obtain a bound for the oscillatory integral in question.

Lemma 7.4.

Let TT and ff satisfy the same conditions as in Lemma 7.3. Define g(ζ)=e𝐢ζTxf(x)𝑑xg(\zeta)=\int_{\mathbb{R}}e^{\bm{i}\zeta^{\top}Tx}f(x)dx for ζd\zeta\in\mathbb{R}^{d}. Then for r>djdr>dj_{d},

g(ζ)Lr(d)\displaystyle\|g(\zeta)\|_{L^{r}(\mathbb{R}^{d})}
\displaystyle\leq C2(c1+2)α1(|x|α1f(x)L1()+(+1)fL()+(|x|α1+1)f(x)L1()+fL1())\displaystyle C_{2}(c_{1}+2)^{{\alpha_{1}}}(\||x|^{\alpha_{1}}f(x)\|_{L^{1}(\mathbb{R})}+(\ell+1)\|f\|_{L^{\infty}(\mathbb{R})}+\|(|x|^{\alpha_{1}}+1)f^{\prime}(x)\|_{L^{1}(\mathbb{R})}+\|f\|_{L^{1}(\mathbb{R})})

where C2C_{2} is a positive constant that depends on r,d,j1,j2,,jdr,d,j_{1},j_{2},\ldots,j_{d}.

7.2 Kernel PθP_{\theta} is a location mixture of Gaussian distributions

We are now ready for an application of Theorem 5.16 to case kernel PθP_{\theta} is a mixture of kk Gaussian distributions. As discussed in Example 5.22, with this example we are moving from a standard mixture of product distributions to hierarchical models (i.e., mixtures of mixture distributions). Such models are central tools in Bayesian statistics.

Let

Θ={θ=(π1,,πk1,μ1,,μk)2k1|0<πi<1,i;μi<μj,1i<jk}\Theta=\{\theta=(\pi_{1},\ldots,\pi_{k-1},\mu_{1},\ldots,\mu_{k})\in\mathbb{R}^{2k-1}|0<\pi_{i}<1,\ \forall i;\ \mu_{i}<\mu_{j},\ \forall 1\leq i<j\leq k\} (42)

and PθP_{\theta} w.r.t. Lebesgue measure on \mathbb{R} has probability density

f(x|θ)=i=1kπif𝒩(x|μi,σ2)f(x|\theta)=\sum_{i=1}^{k}\pi_{i}f_{\mathcal{N}}(x|\mu_{i},\sigma^{2}) (43)

where πk=1i=1k1πi\pi_{k}=1-\sum_{i=1}^{k-1}\pi_{i} and f𝒩(x|μ,σ2)f_{\mathcal{N}}(x|\mu,\sigma^{2}) is the density of 𝒩(μ,σ2)\mathcal{N}(\mu,\sigma^{2}) with σ\sigma a known constant. For the eligibility of this parametrization, see Section 9.2. It follows from the classical result [42, Proposition 1] that the map θf(x|θ)\theta\mapsto f(x|\theta) is injective on Θ\Theta. The mixture of product distributions PG,NP_{G,N} admits the density pG,Np_{G,N} given in (34) (with Ni=NN_{i}=N) w.r.t. Lebesgue measure on N\mathbb{R}^{N}. Fix G0=i=1k0pi0δθi0G_{0}=\sum_{i=1}^{k_{0}}p_{i}^{0}\delta_{\theta_{i}^{0}} with θi0=(π1i0,,π(k1)i0,μ1i0,,μki0)\theta_{i}^{0}=(\pi_{1i}^{0},\ldots,\pi_{(k-1)i}^{0},\mu_{1i}^{0},\ldots,\mu_{ki}^{0}).

Let us now verify that Corollary 5.17 with the map Tx=(x,x2,,x2k1)Tx=(x,x^{2},\ldots,x^{2k-1})^{\top} can be applied for this model. The mean of TX1TX_{1} is λ(θ)2k1\lambda(\theta)\in\mathbb{R}^{2k-1} with its jj-th entry given by

λ(j)(θ)=𝔼θX1j=i=1kπi𝔼(σY+μi)j,j=1,,2k1\lambda^{(j)}(\theta)=\mathbb{E}_{\theta}X_{1}^{j}=\sum_{i=1}^{k}\pi_{i}\mathbb{E}(\sigma Y+\mu_{i})^{j},\quad j=1,\ldots,2k-1 (44)

where X1X_{1} has density (43) and YY has the standard Gaussian distribution 𝒩(0,1)\mathcal{N}(0,1). The covariance matrix of TX1TX_{1} is Λ(θ)(2k1)×(2k1)\Lambda(\theta)\in\mathbb{R}^{(2k-1)\times(2k-1)} with its (j,β)(j,\beta) entries given by

Λjβ(θ)=𝔼θX1j+βλ(j)(θ)λ(β)(θ)=i=1kπi𝔼(σY+μi)j+βλ(j)(θ)λ(β)(θ).\Lambda_{j\beta}(\theta)=\mathbb{E}_{\theta}X_{1}^{j+\beta}-\lambda^{(j)}(\theta)\lambda^{(\beta)}(\theta)=\sum_{i=1}^{k}\pi_{i}\mathbb{E}(\sigma Y+\mu_{i})^{j+\beta}-\lambda^{(j)}(\theta)\lambda^{(\beta)}(\theta).

It follows immediately from these formulae that λ(θ)\lambda(\theta) and Λ(θ)\Lambda(\theta) are continuous on Θ\Theta. That is, (A1) in Definition 5.14 is satisfied. The characteristic function of TX1TX_{1} is

ϕT(ζ|θ)=𝔼θexp(𝒊ζTX1)=i=1kπih(ζ|μi,σ)\phi_{T}(\zeta|\theta)=\mathbb{E}_{\theta}\exp(\bm{i}\zeta^{\top}TX_{1})=\sum_{i=1}^{k}\pi_{i}h(\zeta|\mu_{i},\sigma) (45)

where h(ζ|μ,σ)=𝔼exp(𝒊ζT(σY+μ))h(\zeta|\mu,\sigma)=\mathbb{E}\exp(\bm{i}\zeta^{\top}T(\sigma Y+\mu)). Denote by f𝒩(x|μ,σ)f_{\mathcal{N}}(x|\mu,\sigma) the density of 𝒩(μ,σ2)\mathcal{N}(\mu,\sigma^{2}). The verification of (A2) in Definition 5.14 is omitted since it is a straightforward application of the dominated convergence theorem. In Appendix F.2 it is shown by some calculations that to verify condition (A3) it remains to establish that there exists some r1r\geq 1 such that 2k1|ϕT(ζ|θ)|rdζ\int_{\mathbb{R}^{2k-1}}\left|\phi_{T}(\zeta|\theta)\right|^{r}d\zeta on Θ\Theta is upper bounded by a finite continuous function of θ\theta.

Note that f𝒩(x|μ,σ)f_{\mathcal{N}}(x|\mu,\sigma) is differentiable everywhere and f𝒩(x|μ,σ)xL1()\frac{\partial f_{\mathcal{N}}(x|\mu,\sigma)}{\partial x}\in L^{1}(\mathbb{R}). Moreover α1\alpha_{1} in Lemma 7.4 for TT is 4(k1)24(k-1)^{2} and f𝒩(x|μ,σ)f_{\mathcal{N}}(x|\mu,\sigma), x4(k1)2f𝒩(x|μ,σ)x^{4(k-1)^{2}}f_{\mathcal{N}}(x|\mu,\sigma) are increasing on (,|μ|+μ2+16(k1)2σ22)\left(-\infty,-\frac{|\mu|+\sqrt{\mu^{2}+16(k-1)^{2}\sigma^{2}}}{2}\right) and decreasing on (|μ|+μ2+16(k1)2σ22,)\left(\frac{|\mu|+\sqrt{\mu^{2}+16(k-1)^{2}\sigma^{2}}}{2},\infty\right). By Lemma 7.4, for r>(2k1)2r>(2k-1)^{2}, and for Tx=(x,x2,,x2k1)Tx=(x,x^{2},\cdots,x^{2k-1})

eiζTxf𝒩(x|μ,σ)dxLr(2k1)\displaystyle\left\|\int_{\mathbb{R}}e^{i\zeta^{\top}Tx}f_{\mathcal{N}}(x|\mu,\sigma)dx\right\|_{L^{r}(\mathbb{R}^{2k-1})}
\displaystyle\leq C(r)(|μ|+μ2+16(k1)2σ22+2)4(k1)2\displaystyle C(r)\left(\frac{|\mu|+\sqrt{\mu^{2}+16(k-1)^{2}\sigma^{2}}}{2}+2\right)^{4(k-1)^{2}}
(|x|4(k1)2f𝒩(x|μ,σ)L1()+12πσ+(|x|4(k1)2+1)f𝒩(x|μ,σ)xL1()+1)\displaystyle\left(\||x|^{4(k-1)^{2}}f_{\mathcal{N}}(x|\mu,\sigma)\|_{L^{1}(\mathbb{R})}+\frac{1}{\sqrt{2\pi}\sigma}+\left\|(|x|^{4(k-1)^{2}}+1)\frac{\partial f_{\mathcal{N}}(x|\mu,\sigma)}{\partial x}\right\|_{L^{1}(\mathbb{R})}+1\right)
:=\displaystyle:= h3(μ,σ),\displaystyle h_{3}(\mu,\sigma),

where C(r)C(r) is a constant that depends only rr. It can be verified easily by the dominated convergence theorem that h3(μ,σ)h_{3}(\mu,\sigma) is a continuous function of μ\mu. Then

ϕT(ζ|θ)Lr(2k1)i=1kπieiζTxf𝒩(x|μi,σ)dxLr(2k1)i=1kπih3(μi,σ),\displaystyle\|\phi_{T}(\zeta|\theta)\|_{L^{r}(\mathbb{R}^{2k-1})}\leq\sum_{i=1}^{k}\pi_{i}\left\|\int_{\mathbb{R}}e^{i\zeta^{\top}Tx}f_{\mathcal{N}}(x|\mu_{i},\sigma)dx\right\|_{L^{r}(\mathbb{R}^{2k-1})}\leq\sum_{i=1}^{k}\pi_{i}h_{3}(\mu_{i},\sigma),

which is a finite continuous function of θ=(π1,,πk1,μ1,,μk)\theta=(\pi_{1},\ldots,\pi_{k-1},\mu_{1},\ldots,\mu_{k}). Thus (A3) is verified. We have then verified that TT is admissible with respect to Θ\Theta. That the mean map λ(θ)\lambda(\theta) is injective is a classical result (e.g. [16, Corollary 3.3]). To apply Corollary 5.17 it remains to check that the Jacobian matrix Jλ(θ)J_{\lambda}(\theta) of λ(θ)\lambda(\theta) is of full column rank. Such details are established in the Section 7.3.

In summary, we have shown that all conditions in Corollary 5.17 are satisfied and thus, for PθP_{\theta} having the density in (43), the inverse bounds (21) and (23) hold for any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta).

7.3 Moment map for location mixture of Gaussian distributions has full-rank Jacobian

In this subsection we verify that the Jacobian Jλ(θ)J_{\lambda}(\theta) for the moment map λ(θ)\lambda(\theta) specified in Section 7.2 is of full rank. By (44), for any j[2k1]j\in[2k-1]:

λ(j)(θ)=i=1kπi(μij+=1σ𝔼Yμij)=i=1kπi(μij+=2 evenjσ(1)!!μij)=i=1kπiμij+=2 evenjσ(1)!!i=1kπiμij.\lambda^{(j)}(\theta)=\sum_{i=1}^{k}\pi_{i}\left(\mu_{i}^{j}+\sum_{\ell=1}^{\ell}\sigma^{\ell}\mathbb{E}Y^{\ell}\mu_{i}^{j-\ell}\right)=\sum_{i=1}^{k}\pi_{i}\left(\mu_{i}^{j}+\sum_{\begin{subarray}{c}\ell=2\\ \ell\text{ even}\end{subarray}}^{j}\sigma^{\ell}(\ell-1)!!\mu_{i}^{j-\ell}\right)\\ =\sum_{i=1}^{k}\pi_{i}\mu_{i}^{j}+\sum_{\begin{subarray}{c}\ell=2\\ \ell\text{ even}\end{subarray}}^{j}\sigma^{\ell}(\ell-1)!!\sum_{i=1}^{k}\pi_{i}\mu_{i}^{j-\ell}. (46)

Denote λ¯(j)(θ)=i=1kπiμij\bar{\lambda}^{(j)}(\theta)=\sum_{i=1}^{k}\pi_{i}\mu_{i}^{j} and λ¯(θ)=(λ¯(1)(θ),,λ¯(2k1)(θ))2k1\bar{\lambda}(\theta)=(\bar{\lambda}^{(1)}(\theta),\ldots,\bar{\lambda}^{(2k-1)}(\theta))\in\mathbb{R}^{2k-1}. By (46), λ(j)(θ)=λ¯(j)(θ)+=2 evenjσ(1)!!λ¯(j)(θ)\lambda^{(j)}(\theta)=\bar{\lambda}^{(j)}(\theta)+\sum_{\begin{subarray}{c}\ell=2\\ \ell\text{ even}\end{subarray}}^{j}\sigma^{\ell}(\ell-1)!!\bar{\lambda}^{(j-\ell)}(\theta), which implies

θλ(j)(θ)=θλ¯(j)(θ)+=2 evenjσ(1)!!θλ¯(j)(θ).\nabla_{\theta}\lambda^{(j)}(\theta)=\nabla_{\theta}\bar{\lambda}^{(j)}(\theta)+\sum_{\begin{subarray}{c}\ell=2\\ \ell\text{ even}\end{subarray}}^{j}\sigma^{\ell}(\ell-1)!!\nabla_{\theta}\bar{\lambda}^{(j-\ell)}(\theta).

Since θλ(j)(θ)\nabla_{\theta}\lambda^{(j)}(\theta) and θλ¯(j)(θ)\nabla_{\theta}\bar{\lambda}^{(j)}(\theta) are respectively the jj-th row of Jλ(θ)J_{\lambda}(\theta) and Jλ¯(θ)J_{\bar{\lambda}}(\theta),

det(Jλ(θ))=det(Jλ¯(θ)).\text{det}(J_{\lambda}(\theta))=\text{det}(J_{\bar{\lambda}}(\theta)). (47)

Also, observe

det(Jλ¯(θ))\displaystyle\text{det}(J_{\bar{\lambda}}(\theta))
=\displaystyle= (=1kπ)det(μ1μk,μk1μk,1,1μ12μk2,μk12μk2,2μ1,2μkμ12k1μk2k1,μk12k1μk2k1,(2k1)μ12k1,(2k1)μk2k1)\displaystyle\left(\prod_{\ell=1}^{k}\pi_{\ell}\right)\text{det}\begin{pmatrix}\mu_{1}-\mu_{k},&\ldots&\mu_{k-1}-\mu_{k},&1,&\ldots&1\\ \mu_{1}^{2}-\mu_{k}^{2},&\ldots&\mu_{k-1}^{2}-\mu_{k}^{2},&2\mu_{1},&\ldots&2\mu_{k}\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ \mu_{1}^{2k-1}-\mu_{k}^{2k-1},&\ldots&\mu_{k-1}^{2k-1}-\mu_{k}^{2k-1},&(2k-1)\mu_{1}^{2k-1},&\ldots&(2k-1)\mu_{k}^{2k-1}\end{pmatrix}
=\displaystyle= (=1kπ)(1)k+1det(1,1,1,0,0μ1,μk1,μk,1,1μ12,μk12,μk22μ1,2μkμ12k1,μk12k1,μk2k1(2k1)μ12k1,(2k1)μk2k1)\displaystyle\left(\prod_{\ell=1}^{k}\pi_{\ell}\right)(-1)^{k+1}\text{det}\begin{pmatrix}1,&\ldots&1,&1,&0,&\ldots&0\\ \mu_{1},&\ldots&\mu_{k-1},&\mu_{k},&1,&\ldots&1\\ \mu_{1}^{2},&\ldots&\mu_{k-1}^{2},&\mu_{k}^{2}&2\mu_{1},&\ldots&2\mu_{k}\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ \mu_{1}^{2k-1},&\ldots&\mu_{k-1}^{2k-1},&\mu_{k}^{2k-1}&(2k-1)\mu_{1}^{2k-1},&\ldots&(2k-1)\mu_{k}^{2k-1}\end{pmatrix}
=\displaystyle= (=1kπ)(1)k+1(i=1k(1)k+i2i)1α<βk(μαμβ)4\displaystyle\left(\prod_{\ell=1}^{k}\pi_{\ell}\right)(-1)^{k+1}\left(\prod_{i=1}^{k}(-1)^{k+i-2i}\right)\prod_{1\leq\alpha<\beta\leq k}(\mu_{\alpha}-\mu_{\beta})^{4} (48)

where the second equality holds since we may subtract the kk-th column of the 2k×2k2k\times 2k matrix from each of its first k1k-1 columns and then do Laplace expansion along its first row, and the last equality follows by observing that the (k+i)(k+i)-th column of the 2k×2k2k\times 2k matrix is the derivative of the ii-th column and by applying Lemma 5.12 c) after some column permutation. By (47) and (48), det(Jλ(θ))0\text{det}(J_{\lambda}(\theta))\neq 0 on Θ\Theta. That is Jλ(θ)J_{\lambda}(\theta) is of full column rank for any θΘ\theta\in\Theta.

7.4 Kernel PθP_{\theta} is mixture of Dirichlet processes

Now we tackle Example 5.23, which is motivated from modeling techniques in nonparametric Bayesian statistics. In particular, the kernel PθP_{\theta} is given as a distribution on a space of measures: PθP_{\theta} is a mixture of Dirichlet processes (DPs), so that PG,NP_{G,{N}} is a finite mixture of products of mixtures of DPs. This should not be confused with the use of DP as a prior for mixing measures arising in mixture models. Rather, this is more akin to the use of DPs as probability kernels that arise in the famous hierarchical Dirichlet processes [41] (actually, this model uses DP both as a prior and kernels). The purpose of this example is to illustrate Theorem 5.16 when (mixtures of) Dirichlet processes are treated as kernels.

Let 𝔛=𝒫()\mathfrak{X}=\mathscr{P}(\mathfrak{Z}) be the space of all probability measures on a Polish space (,𝒵)(\mathfrak{Z},\mathscr{Z}). 𝔛\mathfrak{X} is equipped with the weak topology and the corresponding Borel sigma algebra 𝒜\mathcal{A}. Let 𝒟αH\mathscr{D}_{\alpha H} denote the Dirichlet distribution on (𝔛,𝒜)(\mathfrak{X},\mathcal{A}), which is specified by two parameters, concentration parameter α(0,)\alpha\in(0,\infty) and base measure H𝔛H\in\mathfrak{X}. Formal definition and key properties of the Dirichlet distributions can be found in the original paper of [15], or a recent textbook [17]. In this example, we take the probability kernel PθP_{\theta} to be a mixture of two Dirichlet distributions with different concentration parameters, while the base measure is fixed and known: Pθ=π1𝒟α1H+(1π1)𝒟α2HP_{\theta}=\pi_{1}\mathscr{D}_{\alpha_{1}H}+(1-\pi_{1})\mathscr{D}_{\alpha_{2}H}. Thus, the parameter vector is three dimensional which shall be restricted by the following constraint: θ:=(π1,α1,α2)Θ={(π1,α1,α2)|0<π1<1,2<α1<α2}\theta:=(\pi_{1},\alpha_{1},\alpha_{2})\in\Theta=\{(\pi_{1},\alpha_{1},\alpha_{2})|0<\pi_{1}<1,2<\alpha_{1}<\alpha_{2}\}. It can be easily verified that the map θPθ\theta\to P_{\theta} is injective. Kernel PθP_{\theta} so defined is a simple instance of the so-called mixture of Dirichlet processes first studied by [4], but considerably more complex instances of model using Dirichlet as the building block have become a main staple in the lively literature of Bayesian nonparametrics [24, 41, 37, 8]. For notational convenience in the following we also denote Qα:=𝒟αHQ_{\alpha}:=\mathscr{D}_{\alpha H} for α=α1\alpha=\alpha_{1} and α=α2\alpha=\alpha_{2}, noting that HH is fixed, so we may write Pθ=π1Qα1+(1π1)Qα2P_{\theta}=\pi_{1}Q_{\alpha_{1}}+(1-\pi_{1})Q_{\alpha_{2}}.

Having specified the kernel PθP_{\theta}, let Gk(Θ)G\in\mathcal{E}_{k}(\Theta). The mixture of product distributions PG,NP_{G,N} is defined in the same way as before (see Eq. (1)). Now we show that for G0k0(Θ)=k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta^{\circ})=\mathcal{E}_{k_{0}}(\Theta), (21) and (23) hold by applying Corollary 5.17 via a suitable map TT.

Consider a map T:𝔛3T:\mathfrak{X}\to\mathbb{R}^{3} defined by Tx=((x(B))2,(x(B))3,(x(B))4)Tx=((x(B))^{2},(x(B))^{3},(x(B))^{4})^{\top} for some B𝒵B\in\mathscr{Z} to be specified later. The reason we restrict the domain of Θ\Theta is so that this particular choice of map will be shown to be admissible. Define T1:𝔛T_{1}:\mathfrak{X}\to\mathbb{R} by T1x=x(B)T_{1}x=x(B) and T2:3T_{2}:\mathbb{R}\to\mathbb{R}^{3} by T2z=(z2,z3,z4)T_{2}z=(z^{2},z^{3},z^{4})^{\top}. Then T=T2T1T=T_{2}\circ T_{1}. For XPθX\sim P_{\theta}, T1XT_{1}X has distribution

PθT11=π1(Qα1T11)+π2(Qα2T11).P_{\theta}\circ T_{1}^{-1}=\pi_{1}\left(Q_{\alpha_{1}}\circ T_{1}^{-1}\right)+\pi_{2}\left(Q_{\alpha_{2}}\circ T_{1}^{-1}\right).

where π2=1π1\pi_{2}=1-\pi_{1}. By a standard property of Dirichlet distribution, as Qα=𝒟αHQ_{\alpha}=\mathscr{D}_{\alpha H}, we have QαT11Q_{\alpha}\circ T_{1}^{-1} corresponds to Beta(αH(B),α(1H(B)))\text{Beta}(\alpha H(B),\alpha(1-H(B))), a Beta distribution. Thus with ξ=H(B)\xi=H(B), QαT11Q_{\alpha}\circ T_{1}^{-1} has density w.r.t. Lebesgue measure on \mathbb{R}

g(z|α,ξ)=1B(αξ,α(1ξ))zαξ1(1z)α(1ξ)1𝟏(0,1)(z),g(z|\alpha,\xi)=\frac{1}{B(\alpha\xi,\alpha(1-\xi))}z^{\alpha\xi-1}(1-z)^{\alpha(1-\xi)-1}\bm{1}_{(0,1)}(z),

where B(,)B(\cdot,\cdot) is the beta function. Then PθT11P_{\theta}\circ T_{1}^{-1} has density w.r.t. Lebesgue measure π1g(z|α1,ξ)+π2g(z|α2,ξ)\pi_{1}g(z|\alpha_{1},\xi)+\pi_{2}g(z|\alpha_{2},\xi).

Now, the push-forward measure PθT1=(PθT11)T21P_{\theta}\circ T^{-1}=(P_{\theta}\circ T_{1}^{-1})\circ T_{2}^{-1} has mean λ(θ)3\lambda(\theta)\in\mathbb{R}^{3} with

λ(j)(θ)=i=12πizj+1g(z|αi,ξ)𝑑z=i=12πi=0jαiξ+αi+j=1,2,3\lambda^{(j)}(\theta)=\sum_{i=1}^{2}\pi_{i}\int_{\mathbb{R}}z^{j+1}g(z|\alpha_{i},\xi)dz=\sum_{i=1}^{2}\pi_{i}\prod_{\ell=0}^{j}\frac{\alpha_{i}\xi+\ell}{\alpha_{i}+\ell}\quad\forall j=1,2,3

and has covariance matrix Λ\Lambda with its jβj\beta entry given by

Λjβ(θ)=i=12πizj+β+2g(z|αi,ξ)𝑑zλj(θ)λβ(θ)=i=12πi=0j+β+1αiξ+αi+λj(θ)λβ(θ).\Lambda_{j\beta}(\theta)=\sum_{i=1}^{2}\pi_{i}\int_{\mathbb{R}}z^{j+\beta+2}g(z|\alpha_{i},\xi)dz-\lambda^{j}(\theta)\lambda^{\beta}(\theta)=\sum_{i=1}^{2}\pi_{i}\prod_{\ell=0}^{j+\beta+1}\frac{\alpha_{i}\xi+\ell}{\alpha_{i}+\ell}-\lambda^{j}(\theta)\lambda^{\beta}(\theta).

It follows immediately from these formula that λ(θ)\lambda(\theta) and Λ(θ)\Lambda(\theta) are continuous on Θ\Theta, i.e., (A1) in Definition 5.14 is satisfied. Furthermore, observe that PθT1P_{\theta}\circ T^{-1} has characteristic function

ϕT(ζ|θ)=π1h(ζ|α1,ξ)+π2h(ζ|α2,ξ)\phi_{T}(\zeta|\theta)=\pi_{1}h(\zeta|\alpha_{1},\xi)+\pi_{2}h(\zeta|\alpha_{2},\xi)

where h(ζ|α,ξ)=exp(𝒊j=13ζ(j)zj)g(z|α,ξ)𝑑zh(\zeta|\alpha,\xi)=\int_{\mathbb{R}}\exp(\bm{i}\sum_{j=1}^{3}\zeta^{(j)}z^{j})g(z|\alpha,\xi)dz. The verification of (A2) in Definition 5.14 is omitted since it is a straightforward application of the dominated convergence theorem. In Appendix F.3 we provide detailed calculations to verify partially condition (A3) so that it remains to establish there exists some r1r\geq 1 such that 2k1|ϕT(ζ|θ)|rdζ\int_{\mathbb{R}^{2k-1}}\left|\phi_{T}(\zeta|\theta)\right|^{r}d\zeta on Θ\Theta is upper bounded by a finite continuous function of θ\theta. So far we have verified (A1), (A2) and some parts of (A3) for the chosen TT for every BB.

To continue the verification of (A3) we now specify BB. For G0=i=1k0pi0δθi0G_{0}=\sum_{i=1}^{k_{0}}p_{i}^{0}\delta_{\theta_{i}^{0}} with θi0=(π1i0,α1i0,α2i0)Θ\theta_{i}^{0}=(\pi_{1i}^{0},\alpha_{1i}^{0},\alpha_{2i}^{0})\in\Theta, let BB be such that ξ=H(B)(1/mini[k0]α1i0,11/mini[k0]α1i0)\xi=H(B)\in(1/\min_{i\in[k_{0}]}\alpha_{1i}^{0},1-1/\min_{i\in[k_{0}]}\alpha_{1i}^{0}). Notice that since α1i0>2\alpha_{1i}^{0}>2, (1/mini[k0]α1i0,11/mini[k0]α1i0)(1/\min_{i\in[k_{0}]}\alpha_{1i}^{0},1-1/\min_{i\in[k_{0}]}\alpha_{1i}^{0}) is not empty. Hence to verify the condition (A3) in Definition 5.14 w.r.t. {θi0}i=1k0\{\theta_{i}^{0}\}_{i=1}^{k_{0}} for TT with the BB specified it suffices to establish there exists some r1r\geq 1 such that 3|ϕT(ζ|θ)|rdζ\int_{\mathbb{R}^{3}}\left|\phi_{T}(\zeta|\theta)\right|^{r}d\zeta in a small neighborhood of θ0\theta_{0} is upper bounded by a finite continuous function of θ\theta for each θ0{θi0}i=1k0\theta_{0}\in\{\theta_{i}^{0}\}_{i=1}^{k_{0}}.

Since g(z|α,ξ)g(z|\alpha,\xi) is differentiable w.r.t. to zz on \{0,1}\mathbb{R}\backslash\{0,1\} and when z0,1z\neq 0,1, g(z|α,ξ)z\frac{\partial g(z|\alpha,\xi)}{\partial z} is

𝟏(0,1)(z)B(αξ,α(1ξ))((αξ1)zαξ2(1z)α(1ξ)1(α(1ξ)1)zαξ1(1z)α(1ξ)2),\frac{\bm{1}_{(0,1)}(z)}{B(\alpha\xi,\alpha(1-\xi))}\left((\alpha\xi-1)z^{\alpha\xi-2}(1-z)^{\alpha(1-\xi)-1}-(\alpha(1-\xi)-1)z^{\alpha\xi-1}(1-z)^{\alpha(1-\xi)-2}\right),

which is in L1L^{1} when αmini[k0]α1i0γ\alpha\geq\min_{i\in[k_{0}]}\alpha_{1i}^{0}-\gamma such that αξ>1\alpha\xi>1 and α(1ξ)>1\alpha(1-\xi)>1, where γ\gamma depends on TT through ξ\xi. Moreover, g(z|α,ξ)g(z|\alpha,\xi) and z2g(z|α,ξ)z^{2}g(z|\alpha,\xi) are both increasing on (,1)(-\infty,-1) and decreasing on (1,)(1,\infty). Now, by appealing to Lemma 7.4, for r>12r>12, and for αmini[k0]α1i0γ\alpha\geq\min_{i\in[k_{0}]}\alpha_{1i}^{0}-\gamma

h(ζ|α,ξ)Lr(3)\displaystyle\left\|h(\zeta|\alpha,\xi)\right\|_{L^{r}(\mathbb{R}^{3})}
\displaystyle\leq C(r)(1+2)2(z2g(z|α,ξ)L1+3g(z|α,ξ)L+(z2+1)g(z|α,ξ)zL1+1)\displaystyle C(r)(1+2)^{2}\left(\|z^{2}g(z|\alpha,\xi)\|_{L^{1}}+3\|g(z|\alpha,\xi)\|_{L^{\infty}}+\left\|(z^{2}+1)\frac{\partial g(z|\alpha,\xi)}{\partial z}\right\|_{L^{1}}+1\right)
:=\displaystyle:= h5(α,ξ),\displaystyle h_{5}(\alpha,\xi),

where C(r)C(r) is a constant that depends only on rr. It can be verified easily by the dominated convergence theorem that h5(α,ξ)h_{5}(\alpha,\xi) is a continuous function of α\alpha. Then for θ\theta in a neighborhood of θ0{θi0}i=1k0\theta_{0}\in\{\theta_{i}^{0}\}_{i=1}^{k_{0}} such that α1,α2α1i0γ\alpha_{1},\alpha_{2}\geq\alpha_{1i}^{0}-\gamma,

ϕT(ζ|θ)Lr(3)π1h(ζ|α1,ξ)Lr+π2h(ζ|α2,ξ)Lrπ1h5(α1,ξ)+π2h5(α2,ξ),\displaystyle\|\phi_{T}(\zeta|\theta)\|_{L^{r}(\mathbb{R}^{3})}\leq\pi_{1}\left\|h(\zeta|\alpha_{1},\xi)\right\|_{L^{r}}+\pi_{2}\left\|h(\zeta|\alpha_{2},\xi)\right\|_{L^{r}}\leq\pi_{1}h_{5}(\alpha_{1},\xi)+\pi_{2}h_{5}(\alpha_{2},\xi),

which is a finite continuous function of θ=(π1,α1,α2)\theta=(\pi_{1},\alpha_{1},\alpha_{2}). We have thus verified that TT with the specified BB is admissible w.r.t. {θi0}i=1k0\{\theta_{i}^{0}\}_{i=1}^{k_{0}}.

Moreover, it can also be verified that λ(θ)\lambda(\theta) for TT is injective on Θ\Theta provided that ξ13,12,23\xi\neq\frac{1}{3},\frac{1}{2},\frac{2}{3}. By calculation, the Jacobian matrix Jλ(θ)J_{\lambda}(\theta) of λ(θ)\lambda(\theta) satisfies

det(Jλ)(θ)=6(ξ1)3ξ3(2ξ1)(3ξ1)(3ξ2)π1π2(α1α2)4i=12((1+αi)2(2+αi)2(3+αi)2)0\text{det}(J_{\lambda})(\theta)=-\frac{6(\xi-1)^{3}\xi^{3}(2\xi-1)(3\xi-1)(3\xi-2)\pi_{1}\pi_{2}(\alpha_{1}-\alpha_{2})^{4}}{\prod_{i=1}^{2}\left((1+\alpha_{i})^{2}(2+\alpha_{i})^{2}(3+\alpha_{i})^{2}\right)}\not=0

on Θ\Theta provided that ξ13,12,23\xi\neq\frac{1}{3},\frac{1}{2},\frac{2}{3}; so Jλ(θ)J_{\lambda}(\theta) is of full rank for each θΘ\theta\in\Theta provided that ξ13,12,23\xi\neq\frac{1}{3},\frac{1}{2},\frac{2}{3}. In summary, for G0=i=1k0pi0δθi0G_{0}=\sum_{i=1}^{k_{0}}p_{i}^{0}\delta_{\theta_{i}^{0}} with θi0=(π1i0,α1i0,α2i0)Θ\theta_{i}^{0}=(\pi_{1i}^{0},\alpha_{1i}^{0},\alpha_{2i}^{0})\in\Theta, Tx=((x(B))2,(x(B))3,(x(B))4)Tx=((x(B))^{2},(x(B))^{3},(x(B))^{4})^{\top} with BB such that

ξ=H(B)(1mini[k0]α1i0,11mini[k0]α1i0)\{13,12,23}\xi=H(B)\in\left(\frac{1}{\min_{i\in[k_{0}]}\alpha_{1i}^{0}},1-\frac{1}{\min_{i\in[k_{0}]}\alpha_{1i}^{0}}\right)\biggr{\backslash}\left\{\frac{1}{3},\frac{1}{2},\frac{2}{3}\right\}

satisfies all the conditions in Corollary 5.17 and thus (21) and (23) hold.

8 Sharpness of bounds and minimax theorem

8.1 Sharpness of inverse bounds

In this subsection we consider reverse upper bounds for (21), which are also reverse upper bounds for (23) by (14). Inverse bounds of the form (21) hold only under some identifiability conditions, while the following upper bound holds generally and is much easier to show.

Lemma 8.1.

Let k02k_{0}\geq 2 and fix G0=i=1k0pi0δθi0k0(Θ)G_{0}=\sum_{i=1}^{k_{0}}p_{i}^{0}\delta_{\theta_{i}^{0}}\in\mathcal{E}_{k_{0}}(\Theta). Then for any N1{N}\geq 1

lim infGW1G0Gk0(Θ)V(PG,N,PG0,N)DN(G,G0)lim infGW1G0Gk0(Θ)V(PG,N,PG0,N)D1(G,G0)12.\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{{N}}(G,G_{0})}\leq\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{1}(G,G_{0})}\leq\frac{1}{2}.
Proof.

Consider G=i=1k0piδθi0G_{\ell}=\sum_{i=1}^{k_{0}}p_{i}^{\ell}\delta_{\theta_{i}^{0}} with pi=pi0p_{i}^{\ell}=p_{i}^{0} for 3k03\leq\ell\leq k_{0} and p1=p10+1p_{1}^{\ell}=p_{1}^{0}+\frac{1}{\ell}, p2=p201p_{2}^{\ell}=p_{2}^{0}-\frac{1}{\ell}. Then for sufficiently large \ell, p1,p2(0,1)p_{1}^{\ell},p_{2}^{\ell}\in(0,1) and hence Gk0(Θ)\{G0}G_{\ell}\in\mathcal{E}_{k_{0}}(\Theta)\backslash\{G_{0}\} and satisfies DN(G,G0)=D1(G,G0)=2/D_{{N}}(G_{\ell},G_{0})=D_{1}(G_{\ell},G_{0})=2/\ell. Thus for sufficiently large \ell,

V(PG,N,PG0,N)D1(G,G0)=2supA𝒜N|1NPθ10(A)1NPθ20(A)|=12V(NPθ10,NPθ20)12.\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{1}(G,G_{0})}=\frac{\ell}{2}\sup_{A\in\mathcal{A}^{N}}\left|\frac{1}{\ell}\bigotimes^{{N}}P_{\theta_{1}^{0}}(A)-\frac{1}{\ell}\bigotimes^{{N}}P_{\theta_{2}^{0}}(A)\right|=\frac{1}{2}V\left(\bigotimes^{{N}}P_{\theta_{1}^{0}},\bigotimes^{{N}}P_{\theta_{2}^{0}}\right)\leq\frac{1}{2}.

The next lemma establishes an upper bound for Hellinger distance of two mixture of product measures by Hellinger distance of individual components. It is an improvement of [34, Lemma 3.2 (a)]. Such a result is useful in Lemma 8.3. A similar result on variation distance is Lemma C.3.

Lemma 8.2.

For any G=i=1k0piδθiG=\sum_{i=1}^{k_{0}}p_{i}\delta_{\theta_{i}} and G=i=1k0piδθiG^{\prime}=\sum_{i=1}^{k_{0}}p^{\prime}_{i}\delta_{\theta^{\prime}_{i}},

h(PG,N,PG,N)minτ(Nmax1ik0h(Pθi,Pθτ(i))+12i=1k0|pipτ(i)|),h(P_{G,{N}},P_{G^{\prime},{N}})\leq\min_{\tau}\left(\sqrt{{N}}\max_{1\leq i\leq k_{0}}h\left(P_{\theta_{i}},P_{\theta^{\prime}_{\tau(i)}}\right)+\sqrt{\frac{1}{2}\sum_{i=1}^{k_{0}}\left|p_{i}-p_{\tau(i)}^{\prime}\right|}\right),

where the minimum is taken over all τ\tau in the permutation group Sk0S_{k_{0}}.

The inverse bounds expressed by Eq. (21) are optimal as far as the role of N{N} in DND_{N} is concerned. This is made precise by the following result.

Lemma 8.3 (Optimality of N\sqrt{{N}} for atoms).

Fix G0=i=1k0piδθi0k0(Θ)G_{0}=\sum_{i=1}^{k_{0}}p_{i}\delta_{\theta_{i}^{0}}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}). Suppose there exists j[k0]j\in[k_{0}] such that lim infθθj0h(Pθ,Pθj0)θθj02<\liminf\limits_{\theta\to\theta_{j}^{0}}\frac{h(P_{\theta},P_{\theta^{0}_{j}})}{\|\theta-\theta_{j}^{0}\|_{2}}<\infty . Then for ψ(N)\psi({N}) such that ψ(N)N\frac{\psi({N})}{{N}}\to\infty,

lim supNlim infGW1G0Gk0(Θ)h(PG,N,PG0,N)Dψ(N)(G,G0)=0.\limsup_{{N}\to\infty}\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{h(P_{G,{N}},P_{G_{0},{N}})}{D_{\psi({N})}(G,G_{0})}=0.

Lemma 8.3 establishes that N\sqrt{N} is optimal for the coefficients of the component parameters θi\theta_{i} in DND_{N}. The next lemma establishes that the constant coefficients of the mixing propositions pip_{i} in DND_{N} are also optimal. For G=i=1k0piδθiG=\sum_{i=1}^{k_{0}}p_{i}\delta_{\theta_{i}} and G=i=1k0piδθiG^{\prime}=\sum_{i=1}^{k_{0}}p^{\prime}_{i}\delta_{\theta^{\prime}_{i}}, define

D¯r(G,G)=minτSk(θτ(i)θi2+r|pτ(i)pi|).\bar{D}_{r}(G,G^{\prime})=\min_{\tau\in S_{k}}\left(\|\theta_{\tau(i)}-\theta^{\prime}_{i}\|_{2}+r|p_{\tau(i)}-p^{\prime}_{i}|\right).

It states that the vanishing of V(PG,N,PG0,N)V(P_{G,N},P_{G_{0},N}) may not induce a faster convergence rate for the mixing proportions pip_{i} in terms of NN as the exchangeable length NN increases.

Lemma 8.4 (Optimality of constant coefficient for mixing proportions).

Fix G0=i=1k0piδθi0k0(Θ)G_{0}=\sum_{i=1}^{k_{0}}p_{i}\delta_{\theta_{i}^{0}}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}). Suppose that the map θPθ\theta\to P_{\theta} is injective. Then for ψ(N)\psi({N}) such that ψ(N)\psi({N})\to\infty,

lim supNlim infGW1G0Gk0(Θ)V(PG,N,PG0,N)D¯ψ(N)(G,G0)=0.\limsup_{{N}\to\infty}\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{\bar{D}_{\psi({N})}(G,G_{0})}=0.
Proof.

Consider G=i=1k0piδθik0(Θ)G_{\ell}=\sum_{i=1}^{k_{0}}p_{i}^{\ell}\delta_{\theta_{i}^{\ell}}\in\mathcal{E}_{k_{0}}(\Theta) with θi=θi0\theta_{i}^{\ell}=\theta_{i}^{0} for any ii and pi=pi0p_{i}^{\ell}=p_{i}^{0} for i3i\geq 3, p1=p10+1/p_{1}^{\ell}=p_{1}^{0}+1/\ell, p2=p201/p_{2}^{\ell}=p_{2}^{0}-1/\ell. Then for large \ell, D¯ψ(N)(G,G0)=ψ(N)(|p1p10|+|p2p20|)=2ψ(N)/\bar{D}_{\psi(N)}(G_{\ell},G_{0})=\psi(N)(|p_{1}^{\ell}-p_{1}^{0}|+|p_{2}^{\ell}-p_{2}^{0}|)=2\psi(N)/\ell. Note that V(PG,N,PG0,N)=V(Pθ10,Pθ20)/V(P_{G_{\ell},N},P_{G_{0},N})=V(P_{\theta_{1}^{0}},P_{\theta_{2}^{0}})/\ell and hence

lim infGW1G0Gk0(Θ)V(PG,N,PG0,N)D¯ψ(N)(G,G0)V(PG,N,PG0,N)D¯ψ(N)(G,G0)=V(Pθ10,Pθ20)2ψ(N),\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{\bar{D}_{\psi({N})}(G,G_{0})}\leq\frac{V(P_{G_{\ell},{N}},P_{G_{0},{N}})}{\bar{D}_{\psi({N})}(G_{\ell},G_{0})}=\frac{V(P_{\theta_{1}^{0}},P_{\theta_{2}^{0}})}{2\psi(N)},

which completes the proof. ∎

A slightly curious and pedantic way to gauge the meaning of the double infimum limiting arguments in the inverse bound (21), is to express its claim as follows:

0<lim infNlim infGW1G0Gk0(Ξ)V(PG,N,PG0,N)DN(G,G0)=limkinfNklimϵ0infGBW1(G0,ϵ)\{G0}V(PG,N,PG0,N)DN(G,G0),0<\liminf_{{N}\to\infty}\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Xi)\end{subarray}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{{N}}(G,G_{0})}=\lim_{k\to\infty}\ \inf_{{N}\geq k}\ \lim_{\epsilon\to 0}\ \inf_{G\in B_{W_{1}}(G_{0},\epsilon)\backslash\{G_{0}\}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{{N}}(G,G_{0})},

where BW1(G0,R)k0(Θ)B_{W_{1}}(G_{0},R)\subset\mathcal{E}_{k_{0}}(\Theta) is defined in (10). It is possible to alter the order of the four operations and consider the resulting outcome. The following lemma shows the last display is the only order to possibly obtain a positive outcome.

Lemma 8.5.
  1. a)
    limklimϵ0infNkinfGBW1(G0,ϵ)\{G0}V(PG,N,PG0,N)DN(G,G0)=limklimϵ0infGBW1(G0,ϵ)\{G0}infNkV(PG,N,PG0,N)DN(G,G0)=0\lim_{k\to\infty}\ \lim_{\epsilon\to 0}\ \inf_{{N}\geq k}\ \inf_{G\in B_{W_{1}}(G_{0},\epsilon)\backslash\{G_{0}\}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{{N}}(G,G_{0})}\\ =\lim_{k\to\infty}\ \lim_{\epsilon\to 0}\ \inf_{G\in B_{W_{1}}(G_{0},\epsilon)\backslash\{G_{0}\}}\ \inf_{{N}\geq k}\ \frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{{N}}(G,G_{0})}=0
  2. b)
    limϵ0limkinfNkinfGBW1(G0,ϵ)\{G0}V(PG,N,PG0,N)DN(G,G0)=limϵ0limkinfGBW1(G0,ϵ)\{G0}infNkV(PG,N,PG0,N)DN(G,G0)=0\lim_{\epsilon\to 0}\ \lim_{k\to\infty}\ \inf_{{N}\geq k}\ \inf_{G\in B_{W_{1}}(G_{0},\epsilon)\backslash\{G_{0}\}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{{N}}(G,G_{0})}\\ =\lim_{\epsilon\to 0}\ \lim_{k\to\infty}\ \inf_{G\in B_{W_{1}}(G_{0},\epsilon)\backslash\{G_{0}\}}\ \inf_{{N}\geq k}\ \frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{{N}}(G,G_{0})}=0
  3. c)
    limϵ0infGBW1(G0,ϵ)\{G0}limkinfNkV(PG,N,PG0,N)DN(G,G0)=0.\lim_{\epsilon\to 0}\ \inf_{G\in B_{W_{1}}(G_{0},\epsilon)\backslash\{G_{0}\}}\lim_{k\to\infty}\ \inf_{{N}\geq k}\ \frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{{N}}(G,G_{0})}=0.
Proof.

The claims follow from

infNkinfGBW1(G0,ϵ)\{G0}V(PG,N,PG0,N)DN(G,G0)=infGBW1(G0,ϵ)\{G0}infNkV(PG,N,PG0,N)DN(G,G0)\inf_{{N}\geq k}\ \inf_{G\in B_{W_{1}}(G_{0},\epsilon)\backslash\{G_{0}\}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{{N}}(G,G_{0})}=\inf_{G\in B_{W_{1}}(G_{0},\epsilon)\backslash\{G_{0}\}}\inf_{{N}\geq k}\ \frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{{N}}(G,G_{0})}

and

infNkV(PG,N,PG0,N)DN(G,G0)infNk1DN(G,G0)=0.\inf_{{N}\geq k}\ \frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{{N}}(G,G_{0})}\leq\inf_{{N}\geq k}\ \frac{1}{D_{{N}}(G,G_{0})}=0.

8.2 Minimax lower bounds

Given G=i=1k0piδθik0(Θ)G=\sum_{i=1}^{k_{0}}p_{i}\delta_{\theta_{i}}\in\mathcal{E}_{k_{0}}(\Theta) and G=i=1k0piδθik0(Θ)G^{\prime}=\sum_{i=1}^{k_{0}}p^{\prime}_{i}\delta_{\theta^{\prime}_{i}}\in\mathcal{E}_{k_{0}}(\Theta), define additional notions of distances

d𝚯(G,G):=minτSk0i=1k0θτ(i)θi2\displaystyle d_{\bm{\Theta}}(G^{\prime},G):=\min_{\tau\in S_{k_{0}}}\sum_{i=1}^{k_{0}}\|\theta^{\prime}_{\tau(i)}-\theta_{i}\|_{2} (49)
d𝒑(G,G):=minτSk0i=1k0|pτ(i)pi|.\displaystyle d_{\bm{p}}(G^{\prime},G):=\min_{\tau\in S_{k_{0}}}\sum_{i=1}^{k_{0}}|p^{\prime}_{\tau(i)}-p_{i}|. (50)

Notice that we denote dΘd_{\Theta} to be a distance on Θ\Theta in Section 3. Here the d𝚯d_{\bm{\Theta}} with bold subscript is on k0(Θ)\mathcal{E}_{k_{0}}(\Theta). These two notions of distance are pseudometrics on the space of measures k0(Θ)\mathcal{E}_{k_{0}}(\Theta), i.e., they share the same properties as a metric except that allow the distance between two different points be zero. d𝚯(G,G)d_{\bm{\Theta}}(G^{\prime},G) focuses on the distance between atoms of two mixing measure; while d𝒑(G,G)d_{\bm{p}}(G^{\prime},G) focuses on the mixing probabilities of the two mixing measures. It is clear that

DN(G,G)Nd𝚯(G,G)+d𝒑(G,G).D_{N}(G,G^{\prime})\geq\sqrt{N}d_{\bm{\Theta}}(G,G^{\prime})+d_{\bm{p}}(G,G^{\prime}). (51)

We proceed to present minimax lower bounds for any sequence of estimators G^\hat{G}, which are measurable functions of X[N]1,,X[N]mX^{1}_{[{N}]},\ldots,X_{[{N}]}^{{m}}, where the sequence length are assumed to be equal for simplicity. The minimax bounds are stated in terms of the aforementioned (pseudo-)metrics d𝒑d_{\bm{p}} and d𝚯d_{\bm{\Theta}}, as well as the usual metric DND_{N} studied.

Theorem 8.6 (Minimax Lower Bound).

In the following three bounds the infimum is taken for all G^\hat{G} measurable functions of X[N]1,,X[N]mX^{1}_{[{N}]},\ldots,X_{[{N}]}^{{m}}.

  1. a)

    Suppose there exists θ0Θ\theta_{0}\in\Theta^{\circ} and β0>0\beta_{0}>0 such that lim supθθ0h(Pθ,Pθ0)θθ02β0<\limsup\limits_{\theta\to\theta_{0}}\frac{h\left(P_{\theta},P_{\theta_{0}}\right)}{\|{\theta}-{\theta_{0}}\|_{2}^{\beta_{0}}}<\infty. Moreover, suppose there exists a set of distinct k01k_{0}-1 points {θi}i=1k01Θ\{θ0}\{\theta_{i}\}_{i=1}^{k_{0}-1}\subset\Theta\backslash\{\theta_{0}\} satisfying min0i<jk01h(Pθi,Pθj)>0\min_{0\leq i<j\leq k_{0}-1}h(P_{\theta_{i}},P_{\theta_{j}})>0. Then

    infG^k0(Θ)supGk0(Θ)𝔼mPG,Nd𝚯(G^,G)C(β0,k0)(1mN)1β0,\inf_{\hat{G}\in\mathcal{E}_{k_{0}}(\Theta)}\sup_{G\in\mathcal{E}_{k_{0}}(\Theta)}\mathbb{E}_{\bigotimes^{m}P_{G,{N}}}d_{\bm{\Theta}}(\hat{G},G)\geq C(\beta_{0},k_{0})\left(\frac{1}{\sqrt{{m}}\sqrt{{N}}}\right)^{\frac{1}{\beta_{0}}},

    where C(β0,k0)C(\beta_{0},k_{0}) is a constant depending on β0\beta_{0}, k0k_{0} and the probability family PθP_{\theta}.

  2. b)

    Let k02k_{0}\geq 2.

    infG^k0(Θ)supGk0(Θ)𝔼mPG,Nd𝒑(G^,G)C(k0)1m,\inf_{\hat{G}\in\mathcal{E}_{k_{0}}(\Theta)}\sup_{G\in\mathcal{E}_{k_{0}}(\Theta)}\mathbb{E}_{\bigotimes^{m}P_{G,{N}}}d_{\bm{p}}(\hat{G},G)\geq C(k_{0})\frac{1}{{m}},

    where C(k0)C(k_{0}) is a constant depending on k0k_{0} and the probability family PθP_{\theta}.

  3. c)

    Let k02k_{0}\geq 2. Suppose the conditions of part (a) hold. Then,

    infG^k0(Θ)supGk0(Θ)𝔼mPG,NDN(G^,G)C(β0,k0)N(1mN)1β0+C(k0)1m.\inf_{\hat{G}\in\mathcal{E}_{k_{0}}(\Theta)}\sup_{G\in\mathcal{E}_{k_{0}}(\Theta)}\mathbb{E}_{\bigotimes^{m}P_{G,{N}}}D_{N}(\hat{G},G)\geq C(\beta_{0},k_{0})\sqrt{N}\left(\frac{1}{\sqrt{{m}}\sqrt{{N}}}\right)^{\frac{1}{\beta_{0}}}+C(k_{0})\frac{1}{{m}}.
Remark 8.7.
  1. a)

    The condition that there exists a set of distinct k01k_{0}-1 points {θi}i=1k01Θ\{θ0}\{\theta_{i}\}_{i=1}^{k_{0}-1}\subset\Theta\backslash\{\theta_{0}\} satisfying min0i<jk01h(Pθi,Pθj)>0\min_{0\leq i<j\leq k_{0}-1}h(P_{\theta_{i}},P_{\theta_{j}})>0 immediately follows from the injectivity of the map θPθ\theta\mapsto P_{\theta} (recall that this condition is assumed throughout the paper).

  2. b)

    The condition that there exists θ0Θ\theta_{0}\in\Theta^{\circ} and β0>0\beta_{0}>0 such that lim supθθ0h(Pθ,Pθ0)θθ02β0<\limsup\limits_{\theta\to\theta_{0}}\frac{h\left(P_{\theta},P_{\theta_{0}}\right)}{\|{\theta}-{\theta_{0}}\|_{2}^{\beta_{0}}}<\infty holds for most probability kernels considered in practice. For example, it is satisfied with β0=1\beta_{0}=1 for all full rank exponential families of distribution in their canonical form as shown by Lemma E.3. It can then be shown that this condition with β0=1\beta_{0}=1 is also satisfied by full rank exponential families in general form specified in Corollary 5.9. Notice that the same remark applies to the condition in Lemma 8.3.

  3. c)

    If conditions of Theorem 8.6 a) hold with β0=1\beta_{0}=1, then

    infG^k0(Θ)supGk0(Θ)𝔼mPG,Nd𝚯(G^,G)CmN.\inf_{\hat{G}\in\mathcal{E}_{k_{0}}(\Theta)}\sup_{G\in\mathcal{E}_{k_{0}}(\Theta)}\mathbb{E}_{\bigotimes^{m}P_{G,{N}}}d_{\bm{\Theta}}(\hat{G},G)\geq\frac{C}{\sqrt{{m}}\sqrt{{N}}}.

    That is, the convergence rate of the best possible estimator for the worst scenario is at least 1mN\frac{1}{\sqrt{{m}}\sqrt{{N}}}. Recall that Theorem 6.2 implied that the convergence rate of the atoms is OP(ln(mN)mN)O_{P}(\sqrt{\frac{\ln({m}N)}{{m}N}}), which is obtained by replacing N¯m\bar{N}_{m} with NN in (39). It is worth noting that while the minimax rate seems to match the posterior contraction rate of the atoms except for a logarithmic factor, such a comparison is not very meaningful as pointwise posterior contraction bounds and minimax lower bounds are generally not considered to be compatible. In particular, in the posterior contraction Theorem 6.2, the truth G0G_{0} is fixed and the hidden constant OP(ln(mN)mN)O_{P}(\sqrt{\frac{\ln({m}N)}{{m}N}}) depends on G0G_{0}, which is clearly not the case in the above results obtained under the minimax framework. In short, we do not claim that the Bayesian procedure described in Theorem 6.2 is optimal in the minimax sense; nor do we claim that the bounds given in Theorem 8.6 are sharp (i.e., achievable by some statistical procedure). \Diamond

9 Extensions and discussions

9.1 On compactness assumption

In Theorems 6.2 we impose that the parameter subset Θ1\Theta_{1} is compact. This appears to be a strong assumption, although it is a rather standard one for most theoretical investigations of parameter estimation in finite mixture models (see [9, 32, 25, 26, 17, 47]). We surmise that in the context of mixture models, it might not be possible to achieve the global parameter estimation rate without a suitable constraint on the parameter space, such as compactness. In this subsection we clarify the roles of the compactness condition within our approach and discuss possible alternatives to relax compactness to boundedness.

The proof of Theorem 6.2 follows the basic structure of Lemma 6.5. To obtain the posterior contraction rate to mixture densities and the posterior consistency w.r.t. W1W_{1} for general probability kernel f(x|θ)f(x|\theta), a global inverse bound Lemma 5.6 is applied (as an example, it follows that the posterior contraction rate to mixture densities and Lemma 5.6 together imply the posterior consistency w.r.t. W1W_{1}). The compactness of Θ1\Theta_{1} is only used to establish Lemma 5.6. It might be possible to have a posterior contraction result for the population density estimation and the posterior consistency result without Lemma 5.6 (e.g. by an existence of test argument), but such an approach would require additionally stronger and perhaps explicit knowledge of the kernel f(x|θ)f(x|\theta), and thus is beyond the scope of this paper.

The compactness of Θ1\Theta_{1} is only used to guarantee Lemma 5.6, which is used in the posterior contraction and consistency results mentioned above. It is possible to have a posterior contraction result for the population density estimation and the posterior consistency result without Lemma 5.6 (e.g. by existence of test), but such an approach would require additionally stronger and perhaps explicit knowledge of the kernel f(x|θ)f(x|\theta).

In this subsection we provide a substitute to the compactness assumption in Lemma 5.6, which removes the compactness assumption in Theorem 6.2. It is clear that Θ\Theta is required to be a bounded set. The compactness assumption may be relaxed by the necessary boundedness assumption, provided that an identifiability condition additionally holds. This can be seen by the following claim.

Lemma 9.1.

Fix G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta^{\circ}). Suppose Θ\Theta is bounded. Let n1(G0)n_{1}(G_{0}) be given by (24). Suppose there exists n01n_{0}\geq 1 such that for ϵ>0\epsilon>0

infGkk0k(Θ):W1(G,G0)>ϵh(pG,n0,pG0,n0)>0.\inf_{G\in\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta):W_{1}(G,G_{0})>\epsilon}h(p_{G,n_{0}},p_{G_{0},n_{0}})>0. (52)

Then

h(PG,N,PG0,N)C(G0,Θ)W1(G,G0),Gk=1k0k(Θ1),Nn1(G0)n0,h(P_{G,{N}},P_{G_{0},{N}})\geq C(G_{0},\Theta)W_{1}(G,G_{0}),\quad\forall G\in\bigcup_{k=1}^{k_{0}}\mathcal{E}_{k}(\Theta_{1}),\;\forall{N}\geq n_{1}(G_{0})\vee n_{0},

provided n1(G0)n0<n_{1}(G_{0})\vee n_{0}<\infty, where C(G0,Θ)>0C(G_{0},\Theta)>0 is a constant that depends on G0G_{0} and Θ\Theta.

Proof.

In this proof we write n1n_{1} for n1(G0)n_{1}(G_{0}) . By the definition of n1n_{1}, for any Nn1{N}\geq n_{1}

lim infGW1G0Gk0(Θ)V(PG,N,PG0,N)D1(G,G0)>0.\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{1}(G,G_{0})}>0. (53)

By Lemma 3.2 b) one may replace the D1(G,G0)D_{1}(G,G_{0}) in the preceding display by W1(G,G0)W_{1}(G,G_{0}). Fix N1=n1n0{N}_{1}=n_{1}\vee n_{0}. Then there exists R>0R>0 depending on G0G_{0} such that

infGBW1(G0,R)\{G0}V(PG,N1,PG0,N1)W1(G,G0)>0,\inf_{G\in B_{W_{1}}(G_{0},R)\backslash\{G_{0}\}}\frac{V(P_{G,N_{1}},P_{G_{0},N_{1}})}{W_{1}(G,G_{0})}>0, (54)

where BW1(G0,R)B_{W_{1}}(G_{0},R) is the open ball in metric space (k=1k0k(Θ),W1)(\bigcup_{k=1}^{k_{0}}\mathcal{E}_{k}(\Theta),W_{1}) with center at G0G_{0} and radius RR. Here we used the fact that any sufficiently small open ball in (k=1k0k(Θ),W1)(\bigcup_{k=1}^{k_{0}}\mathcal{E}_{k}(\Theta),W_{1}) with center in k0(Θ)\mathcal{E}_{k_{0}}(\Theta) is in k0(Θ)\mathcal{E}_{k_{0}}(\Theta). By assumption Nn0{N}\geq n_{0}

infGk=1k0k(Θ1)\BW1(G0,R)h(PG,N,PG0,N)W1(G,G0)>0.\inf_{G\in\bigcup_{k=1}^{k_{0}}\mathcal{E}_{k}(\Theta_{1})\backslash B_{W_{1}}(G_{0},R)}\frac{h(P_{G,{N}},P_{G_{0},{N}})}{W_{1}(G,G_{0})}>0.

Combining the last display with N=N1N={N}_{1} and (54) yields h(PG,N1,PG,N1)C(G0,Θ)W1(G,G0).h(P_{G,{N}_{1}},P_{G,{N}_{1}})\geq C(G_{0},\Theta)W_{1}(G,G_{0}). Observing h(PG,N,PG0,N)h(P_{G,{N}},P_{G_{0},{N}}) increases with N{N}, the proof is then complete. ∎

Despite the above possibilities for relaxing the compactness assumption on Θ1\Theta_{1}, we want to point out that other assumptions may still implicitly require the compactness. For example, suppose that kernel ff takes the explicit form f𝒩(x|μ,σ)f_{\mathcal{N}}(x|\mu,\sigma), the density of univariate normal distribution with mean μ\mu and standard deviation σ\sigma. Then h(f𝒩(x|μ,σ1),f𝒩(x|μ,σ2))=12σ1σ2σ12+σ22h(f_{\mathcal{N}}(x|\mu,\sigma_{1}),f_{\mathcal{N}}(x|\mu,\sigma_{2}))=1-\sqrt{\frac{2\sigma_{1}\sigma_{2}}{\sigma_{1}^{2}+\sigma_{2}^{2}}}. With σ2=2σ1\sigma_{2}=2\sigma_{1}, h(f𝒩(x|μ,σ1),f𝒩(x|μ,σ2))=145h(f_{\mathcal{N}}(x|\mu,\sigma_{1}),f_{\mathcal{N}}(x|\mu,\sigma_{2}))=1-\sqrt{\frac{4}{5}} which can not be upper bounded L2|σ2σ1|β0=L2σ1β0L_{2}|\sigma_{2}-\sigma_{1}|^{\beta_{0}}=L_{2}\sigma_{1}^{\beta_{0}}, a quantity convergences to 0 when σ1\sigma_{1} converges to 0. That is, the assumption (B2) cannot hold if σ\sigma is not bounded away from 0, which excludes bounded intervals of the form (0,a)(0,a).

9.2 Kernel PθP_{\theta} is a location-scale mixture of Gaussian distributions

In Section 7.2 we demonstrated an application of Theorem 5.16 to obtain inverse bound (23) when kernel PθP_{\theta} is the location mixture of Gaussian distributions. It is of interest to extend Theorem 5.16 to richer kernels often employed in practice. The local-scale mixture of Gaussian distributions represent a salient example. Here, we shall discuss several technical difficulties that arise in such a pursuit. The first difficulty is that in Theorem 5.16, the parameter space Θ\Theta is assumed to be a subset of an Euclidean space obtained via a suitable (i.e., homeomorphic) parametrization. For the "location-scale Gaussian mixture" kernel, such a parametrization is elusive.

Recall that the parameter set of a kk-component location mixture of Gaussian distributions given by (43) is Θ¯:={{(πi,μi)}i=1k:i=1kπi=1,μiμj},\bar{\Theta}:=\{\{(\pi_{i},\mu_{i})\}_{i=1}^{k}:\sum_{i=1}^{k}\pi_{i}=1,\mu_{i}\neq\mu_{j}\}, or Θ~:={i=1kπiδμi:i=1kπi=1,μiμj}=k()\tilde{\Theta}:=\{\sum_{i=1}^{k}\pi_{i}\delta_{\mu_{i}}:\sum_{i=1}^{k}\pi_{i}=1,\mu_{i}\neq\mu_{j}\}=\mathcal{E}_{k}(\mathbb{R}). To apply the result in Theorem 5.16 we parametrize the kernel and index it by parameters in a subset of a suitable Euclidean space. In Section 7.2 we identify Θ¯\bar{\Theta} or Θ~\tilde{\Theta} by a subset Θ\Theta of Euclidean space as in (42) by ranking μi\mu_{i} in increasing order. This identification is a bijection and moreover a homeomorphism. The properties of bijection and homeomorphism are necessary for the reparametrization since we need convergence of the parameters in the reparametrization space is equivalent to the convergence in the original parameter space Θ¯\bar{\Theta}. So the parametrization is suitable for the application of Theorem 5.16. However this scheme is not straightforward to be generalized to the case the atom space (space of μ\mu in this particular example) is more than one dimension as discussed below.

For the case of kk-component location-scale mixture of Gaussian distributions, the parameter set is Θ~:={i=1kπiδ(μi,δi):i=1kπi=1,(μi,σi)(μj,σj) and σi>0}=k(×+).\tilde{\Theta}:=\{\sum_{i=1}^{k}\pi_{i}\delta_{(\mu_{i},\delta_{i})}:\sum_{i=1}^{k}\pi_{i}=1,(\mu_{i},\sigma_{i})\neq(\mu_{j},\sigma_{j})\text{ and }\sigma_{i}>0\}=\mathcal{E}_{k}(\mathbb{R}\times\mathbb{R}_{+}). Similar to the location mixture, one may attempt to reparametrize Θ~\tilde{\Theta} by ranking (μi,σi)(\mu_{i},\sigma_{i}) in the lexicographically increasing order. While this reparametrization is a bijection, it is not a homeomorphism. To see this, consider k=2k=2 and F0=12δ(1,3)+12δ(1,2)Θ¯F_{0}=\frac{1}{2}\delta_{(1,3)}+\frac{1}{2}\delta_{(1,2)}\in\bar{\Theta}. The reparametrization of F0F_{0} is (12,1,2,1,3)(\frac{1}{2},1,2,1,3) since (1,2)<(1,3)(1,2)<(1,3) in the lexicographically order. Consider Fn=12δ(11n,3)+12δ(1+1n,2)F_{n}=\frac{1}{2}\delta_{(1-\frac{1}{n},3)}+\frac{1}{2}\delta_{(1+\frac{1}{n},2)} and its reparametrization is (12,11n,3,1+1n,2)(\frac{1}{2},1-\frac{1}{n},3,1+\frac{1}{n},2). It is clear that W1(Fn,F0)0W_{1}(F_{n},F_{0})\to 0 as nn\to\infty but the Euclidean distance of the corresponding reparametrized parameters does not. This issue underscore one among many challenges that arise as we look at increasingly richer models that have already been widely applied in numerous application domains.

9.3 Other extensions

A direction of interest is the study of overfitted mixture models, i.e., the true number of mixture components k0k_{0} may be unknown and k0kk_{0}\leq k. As previous studies suggest, a stronger notion of identifiability such as second-order identifiability may play a fundamental role (see [27]). Observing that (23) can also be viewed as uniform versions of (21) since they holds for any fixed GG in a neighborhood of G0G_{0} and any HW1GH\overset{W_{1}}{\to}G, it would also be interesting to generalize Theorem 6.2 to a uniform result beyond a fixed G0G_{0}. In addition, if PθP_{\theta} is taken to a mixture distribution, what happens if this mixture is also overfitted? We can expect a much richer range of parameter estimation behavior and more complex roles mm and NN play in the rates of convergence.

References

  • [1] David J. Aldous. Exchangeability and related topics. In P. L. Hennequin, editor, École d’Été de Probabilités de Saint-Flour XIII — 1983, pages 1–198, Berlin, Heidelberg, 1985. Springer Berlin Heidelberg.
  • [2] Charalambos D. Aliprantis and Border C. Kim. Infinite dimensional analysis: A Hitchhiker’s Guide. Springer-Verlag Berlin Heidelberg, third edition, 2006.
  • [3] Elizabeth S Allman, Catherine Matias, and John A Rhodes. Identifiability of parameters in latent structure models with many observed variables. Annals of Statistics, 37(6A):3099–3132, 2009.
  • [4] Charles E. Antoniak. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. Annals of Statistics, 2(6):1152–1174, 11 1974.
  • [5] Patrick Billingsley. Probability and measure. John Wiley & Sons, third edition, 1996.
  • [6] Luca Brandolini, Giacomo Gigante, Allan Greenleaf, Alexander Iosevich, Andreas Seeger, and Giancarlo Travaglini. Average decay estimates for fourier transforms of measures supported on curves. Journal of geometric analysis, 17(1):15–40, 2007.
  • [7] Federico Camerlenghi, David B Dunson, Antonio Lijoi, Igor Prünster, and Abel Rodr\́bm{i}guez. Latent nested nonparametric priors (with discussion). Bayesian Analysis, 14(4):1303–1356, 2019.
  • [8] Federico Camerlenghi, Antonio Lijoi, Peter Orbanz, and Igor Prünster. Distribution theory for hierarchical processes. Annals of Statistics, 47(1):67–92, 2019.
  • [9] Jiahua Chen. Optimal rate of convergence for finite mixture models. Annals of Statistics, 23(1):221–233, 02 1995.
  • [10] IR Cruz-Medina, TP Hettmansperger, and H Thomas. Semiparametric mixture models and repeated measures: the multinomial cut point model. Journal of the Royal Statistical Society: Series C (Applied Statistics), 53(3):463–474, 2004.
  • [11] Ryan Elmore, Peter Hall, and Amnon Neeman. An application of classical invariant theory to identifiability in nonparametric mixtures. In Annales de l’institut Fourier, volume 55, pages 1–28, 2005.
  • [12] Ryan T Elmore, Thomas P Hettmansperger, and Hoben Thomas. Estimating component cumulative distribution functions in finite mixture models. Communications in Statistics-Theory and Methods, 33(9):2075–2086, 2004.
  • [13] Ryan T Elmore and Shaoli Wang. Identifiability and estimation in finite mixture models with multinomial coefficients. Technical report, Technical Report 03-04, Penn State University, 2003.
  • [14] Willliam Feller. An introduction to probability theory and its applications, volume 2. John Wiley & Sons, third edition, 2008.
  • [15] Thomas S. Ferguson. A bayesian analysis of some nonparametric problems. Annals of Statistics, 1(2):209–230, 03 1973.
  • [16] Kavish Gandhi and Yonah Borns-Weil. Moment-based learning of mixture distributions. unpublished paper, 2016.
  • [17] Subhashis Ghosal and Aad van der Vaart. Fundamentals of Nonparametric Bayesian Inference, volume 44 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2017.
  • [18] Subhashis Ghosal and Aad W Van Der Vaart. Entropies and rates of convergence for maximum likelihood and bayes estimation for mixtures of normal densities. Annals of Statistics, 29(5):1233–1263, 2001.
  • [19] Aritra Guha, Nhat Ho, and XuanLong Nguyen. On posterior contraction of parameters and interpretability in Bayesian mixture modeling. Bernoulli, 27(4):2159–2188, 2021.
  • [20] Peter Hall, Amnon Neeman, Reza Pakyari, and Ryan Elmore. Nonparametric inference in multivariate mixtures. Biometrika, 92(3):667–678, 2005.
  • [21] Peter Hall and Xiao-Hua Zhou. Nonparametric estimation of component distributions in a multivariate mixture. Annals of Statistics, 31(1):201–224, 2003.
  • [22] Philippe Heinrich and Jonas Kahn. Strong identifiability and optimal minimax rates for finite mixture estimation. Annals of Statistics, 46(6A):2844–2870, 2018.
  • [23] TP Hettmansperger and Hoben Thomas. Almost nonparametric inference for repeated measures in mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(4):811–825, 2000.
  • [24] Nils Lid Hjort, Chris Holmes, Peter Müller, and Stephen G Walker. Bayesian Nonparametrics, volume 28 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2010.
  • [25] Nhat Ho and XuanLong Nguyen. Convergence rates of parameter estimation for some weakly identifiable finite mixtures. Annals of Statistics, 44(6):2726–2755, 2016.
  • [26] Nhat Ho and XuanLong Nguyen. On strong identifiability and convergence rates of parameter estimation in finite mixtures. Electronic Journal of Statistics, 10(1):271–307, 2016.
  • [27] Nhat Ho and XuanLong Nguyen. Singularity structures and impacts on parameter estimation in finite mixtures of distributions. SIAM Journal on Mathematics of Data Science, 1(4):730–758, 2019.
  • [28] K. Jochmans, S. Bonhomme, and J.-M. Robin. Nonparametric estimation of finite mixtures from repeated measurements. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2016.
  • [29] Olav Kallenberg. Probabilistic symmetries and invariance principles. Probability and Its Applications. Springer Science & Business Media, 2006.
  • [30] Bruce G Lindsay. Mixture models: theory, geometry and applications. In NSF-CBMS regional conference series in probability and statistics, pages i–163. JSTOR, 1995.
  • [31] XuanLong Nguyen. Wasserstein distances for discrete measures and convergence in nonparametric mixture models. Technical report, UNIV of Michigan MI DEPT OF STATISTICS, 2011.
  • [32] XuanLong Nguyen. Convergence of latent mixing measures in finite and infinite mixture models. Annals of Statistics, 41(1):370–400, 2013.
  • [33] XuanLong Nguyen. Posterior contraction of the population polytope in finite admixture models. Bernoulli, 21(1):618–646, 2015.
  • [34] XuanLong Nguyen. Borrowing strengh in hierarchical Bayes: Posterior concentration of the Dirichlet base measure. Bernoulli, 22(3):1535–1571, 2016.
  • [35] Sidney I Resnick. A probability path. Modern Birkhäuser Classics. Springer, fourth edition, 2014.
  • [36] Alexander Ritchie, Robert A Vandermeulen, and Clayton Scott. Consistent estimation of identifiable nonparametric mixture models from grouped observations. arXiv preprint arXiv:2006.07459, 2020.
  • [37] Abel Rodriguez, David B Dunson, and Alan E Gelfand. The nested dirichlet process. Journal of the American Statistical Association, 103(483):1131–1154, 2008.
  • [38] Judith Rousseau and Kerrie Mengersen. Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(5):689–710, 2011.
  • [39] V.A. Statulyavichus. Limit theorems for densities and asymptotic expansions for distributions of sums of independent random variables. Theory of Probability and Its Applications, 10(4):582–595, 1965.
  • [40] Elias M Stein and Timothy S Murphy. Harmonic analysis: real-variable methods, orthogonality, and oscillatory integrals, volume 3 of Monographs in Harmonic Analysis. Princeton University Press, 1993.
  • [41] Yee W Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006.
  • [42] Henry Teicher. Identifiability of finite mixtures. Annals of Mathematical Statistics, 34(4):1265–1269, 1963.
  • [43] Henry Teicher. Identifiability of mixtures of product measures. Annals of Mathematical Statistics, 38(4):1300–1302, 1967.
  • [44] Robert A Vandermeulen and Clayton D Scott. An operator theoretic approach to nonparametric mixture models. Annals of Statistics, 47(5):2704–2733, 2019.
  • [45] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.
  • [46] Jon Wellner and Aad Van der Vaart. Weak convergence and empirical processes: with applications to statistics. Springer Series in Statistics. Springer Science & Business Media, 1996.
  • [47] Yihong Wu and Pengkun Yang. Optimal estimation of gaussian mixtures via denoised method of moments. Annals of Statistics, 48(4):1981–2007, 2020.

Appendix A Examples and Proofs for Section 3

Example A.1.

Consider G1=p11δθ1+p21δθ2G_{1}=p_{1}^{1}\delta_{\theta_{1}}+p_{2}^{1}\delta_{\theta_{2}} and G2=p12δθ1+p22δθ22(Θ)G_{2}=p_{1}^{2}\delta_{\theta_{1}}+p_{2}^{2}\delta_{\theta_{2}}\in\mathcal{E}_{2}(\Theta) with p11p12p_{1}^{1}\not=p_{1}^{2}. When N{N} is sufficiently large, DN(G1,G2)=|p11p12|+|p21p22|D_{N}(G_{1},G_{2})=|p_{1}^{1}-p_{1}^{2}|+|p_{2}^{1}-p_{2}^{2}|, a constant independent of N{N}. But with dΘd_{\Theta} being Euclidean distance multiplied by N\sqrt{{N}}

Wpp(G1,G2;dΘ)=\displaystyle W_{p}^{p}(G_{1},G_{2};d_{\Theta})= min𝒒(q12+q21)(Nθ1θ22)p\displaystyle\min_{\bm{q}}(q_{12}+q_{21})\left(\sqrt{{N}}\|\theta_{1}-\theta_{2}\|_{2}\right)^{p}
=\displaystyle= (Nθ1θ22)p12(|p11p12|+|p21p22|),\displaystyle\left(\sqrt{{N}}\|\theta_{1}-\theta_{2}\|_{2}\right)^{p}\frac{1}{2}\left(|p_{1}^{1}-p_{1}^{2}|+|p_{2}^{1}-p_{2}^{2}|\right),

where 𝒒\bm{q} is a coupling as in (7). So

Wp(G1,G2;dΘ)=Nθ1θ22(12(|p11p12|+|p21p22|))1/p,W_{p}(G_{1},G_{2};d_{\Theta})=\sqrt{{N}}\|\theta_{1}-\theta_{2}\|_{2}\left(\frac{1}{2}(|p_{1}^{1}-p_{1}^{2}|+|p_{2}^{1}-p_{2}^{2}|)\right)^{1/p},

which increases to \infty when N{N}\to\infty. Even though G1G_{1} and G2G_{2} share the set of atoms, Wp(G1,G2;dΘ)W_{p}(G_{1},G_{2};d_{\Theta}) is still of order N\sqrt{{N}}. Thus, Wp(G1,G2;dΘ)W_{p}(G_{1},G_{2};d_{\Theta}) couples atoms and probabilities; in other words it does not separate them in the way DND_{{N}} does. \Diamond

Proof of Lemma 3.2.

a) The proof is trivial and is therefore omitted.

b) Let G=i=1kpiδθiG=\sum_{i=1}^{k}p_{i}\delta_{\theta_{i}} and G=i=1kpiδθiG^{\prime}=\sum_{i=1}^{k}p^{\prime}_{i}\delta_{\theta^{\prime}_{i}}. Let τ\tau be the optimal permutation that achieves D1(G,G)=i=1k(θτ(i)θi2+|pτ(i)pi|)D_{1}(G,G^{\prime})=\sum_{i=1}^{k}\left(\|\theta_{\tau(i)}-\theta^{\prime}_{i}\|_{2}+|p_{\tau(i)}-p^{\prime}_{i}|\right). Let 𝒒\bm{q} be a coupling of the mixing probabilities 𝒑=(p1,,pk)\bm{p}=(p_{1},\ldots,p_{k}) and 𝒑=(p1,,pk)\bm{p}^{\prime}=(p^{\prime}_{1},\ldots,p^{\prime}_{k}) such that qτ(i),i=min{pτ(i),pi}q_{\tau(i),i}=\min\{p_{\tau(i)},p_{i}\} and then the remaining mass to be assigned is i=1k(pτ(i)qτ(i),i)=12i=1k|pτ(i)pi|\sum_{i=1}^{k}(p_{\tau(i)}-q_{\tau(i),i})=\frac{1}{2}\sum_{i=1}^{k}|p_{\tau(i)}-p_{i}|. Thus,

W1(G,G)\displaystyle W_{1}(G,G^{\prime})\leq i=1kqτ(i),iθτ(i)θi2+12i=1k|pτ(i)pi|diam(Θ)\displaystyle\sum_{i=1}^{k}q_{\tau(i),i}\|\theta_{\tau(i)}-\theta^{\prime}_{i}\|_{2}+\frac{1}{2}\sum_{i=1}^{k}|p_{\tau(i)}-p_{i}|\text{diam}(\Theta)
\displaystyle\leq max{1,diam(Θ)2}D1(G,G).\displaystyle\max\left\{1,\frac{\text{diam}(\Theta)}{2}\right\}D_{1}(G,G^{\prime}).

The proof for the case p1p\neq 1 proceeds in the same procedure.

c) Consider any Gnk(Θ)G_{n}\in\mathcal{E}_{k}(\Theta) and GnW1G0G_{n}\overset{W_{1}}{\to}G_{0}, and one may write Gn=i=1kpinδθinG_{n}=\sum_{i=1}^{k}p_{i}^{n}\delta_{\theta_{i}^{n}} for n0n\geq 0 such that pinpi0p_{i}^{n}\to p_{i}^{0} and θinθi0\theta_{i}^{n}\to\theta_{i}^{0}. Then when nn is sufficiently large, Gnk(Θ1)G_{n}\in\mathcal{E}_{k}(\Theta_{1}) for Θ1=i=1k0B(θi0,12)\Theta_{1}=\bigcup_{i=1}^{k_{0}}B(\theta_{i}^{0},\frac{1}{2}), where B(θi0,ρ)qB(\theta_{i}^{0},\rho)\subset\mathbb{R}^{q} is the open ball with center at θi0\theta_{i}^{0} of radius ρ\rho. Then by b) for large nn, W1(Gn,G0)C(G0)D1(Gn,G0)W_{1}(G_{n},G_{0})\leq C(G_{0})D_{1}(G_{n},G_{0}), which entails lim infGW1G0Gk(Θ)D1(G,G0)W1(G,G0)>0\liminf\limits_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k}(\Theta)\end{subarray}}\frac{D_{1}(G,G_{0})}{W_{1}(G,G_{0})}>0.

Denote 𝒑n=(p1n,,pkn)\bm{p}^{n}=(p_{1}^{n},\ldots,p_{k}^{n}) for n0n\geq 0. Let 𝒒n=(qijn)i,j[k]\bm{q}_{n}=(q^{n}_{ij})_{i,j\in[k]} be a coupling between 𝒑n\bm{p}^{n} and 𝒑0\bm{p}^{0} such that W1(Gn,G0)=ijqijnθinθj02W_{1}(G_{n},G_{0})=\sum_{ij}q^{n}_{ij}\|\theta^{n}_{i}-\theta^{0}_{j}\|_{2}. Since θjnθj0\theta^{n}_{j}\to\theta^{0}_{j}, when nn is large,

W1(Gn,G0)=ijqijnθinθj02ijqijnθjnθj02=j=1kpj0θjnθj02.W_{1}(G_{n},G_{0})=\sum_{ij}q^{n}_{ij}\|\theta^{n}_{i}-\theta^{0}_{j}\|_{2}\geq\sum_{ij}q^{n}_{ij}\|\theta^{n}_{j}-\theta^{0}_{j}\|_{2}=\sum_{j=1}^{k}p_{j}^{0}\|\theta^{n}_{j}-\theta^{0}_{j}\|_{2}. (55)

Moreover, when nn is large, θαnθβ0212min1i<kθi0θ02:=12ρ\|\theta_{\alpha}^{n}-\theta_{\beta}^{0}\|_{2}\geq\frac{1}{2}\min_{1\leq i<\ell\leq k}\|\theta_{i}^{0}-\theta_{\ell}^{0}\|_{2}:=\frac{1}{2}\rho for any αβ\alpha\neq\beta. Thus when nn is large,

W1(Gn,G0)αβqαβnθαnθβ0212ραβqαβn14ρj=1k|pjnpj0|,W_{1}(G_{n},G_{0})\geq\sum_{\alpha\neq\beta}q^{n}_{\alpha\beta}\|\theta^{n}_{\alpha}-\theta^{0}_{\beta}\|_{2}\geq\frac{1}{2}\rho\sum_{\alpha\neq\beta}q^{n}_{\alpha\beta}\geq\frac{1}{4}\rho\sum_{j=1}^{k}|p_{j}^{n}-p_{j}^{0}|, (56)

where the last inequality follows from

12j=1k|pjnpj0|=V(𝒑n,𝒑0)=inf𝝅 coupling of 𝒑n and 𝒑0αβπαβαβqαβn.\frac{1}{2}\sum_{j=1}^{k}|p_{j}^{n}-p_{j}^{0}|=V(\bm{p}^{n},\bm{p}^{0})=\inf_{\bm{\pi}\text{ coupling of }\bm{p}^{n}\text{ and }\bm{p}^{0}}\ \sum_{\alpha\neq\beta}\pi_{\alpha\beta}\leq\sum_{\alpha\neq\beta}q^{n}_{\alpha\beta}.

Combining (55) and (56), for sufficiently large nn,

W1(Gn,G0)\displaystyle W_{1}(G_{n},G_{0})\geq 12j=1kpj0θjnθj02+18ρj=1k|pjnpj0|\displaystyle\frac{1}{2}\sum_{j=1}^{k}p_{j}^{0}\|\theta_{j}^{n}-\theta_{j}^{0}\|_{2}+\frac{1}{8}\rho\sum_{j=1}^{k}|p_{j}^{n}-p_{j}^{0}|
\displaystyle\geq 12min{minp0,14ρ}j=1k(θjnθj02+|pjnpj0|)\displaystyle\frac{1}{2}\min\left\{\min_{\ell}p_{\ell}^{0},\frac{1}{4}\rho\right\}\sum_{j=1}^{k}(\|\theta_{j}^{n}-\theta_{j}^{0}\|_{2}+|p_{j}^{n}-p_{j}^{0}|)
=\displaystyle= 12min{minp0,14ρ}D1(Gn,G0),\displaystyle\frac{1}{2}\min\left\{\min_{\ell}p_{\ell}^{0},\frac{1}{4}\rho\right\}D_{1}(G_{n},G_{0}),

which entails lim infGW1G0Gk(Θ)W1(G,G0)D1(G,G0)>0\liminf\limits_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k}(\Theta)\end{subarray}}\frac{W_{1}(G,G_{0})}{D_{1}(G,G_{0})}>0.

d) Based on c), there exists c(G0)>0c(G_{0})>0 such that for Gk0(Θ)G\in\mathcal{E}_{k_{0}}(\Theta) satisfying W1(G,G0)<c(G0)W_{1}(G,G_{0})<c(G_{0}): W1(G,G0)C1(G0)D1(G,G0).W_{1}(G,G_{0})\geq C_{1}(G_{0})D_{1}(G,G_{0}). For Gk0(Θ)G\in\mathcal{E}_{k_{0}}(\Theta) satisfying W1(G,G0)c(G0)W_{1}(G,G_{0})\geq c(G_{0}):

W1(G,G0)D1(G,G0)c(G0)k0diam(Θ)+1.\frac{W_{1}(G,G_{0})}{D_{1}(G,G_{0})}\geq\frac{c(G_{0})}{k_{0}\text{diam}(\Theta)+1}.

Appendix B Additional examples and proofs for Section 4

B.1 Additional examples and proofs for Section 4.1

Example B.1 (Location gamma kernel).

For gamma distribution with fixed α(0,1)(1,2)\alpha\in(0,1)\bigcup(1,2) and β>0\beta>0, consider its location family with density

f(x|θ)=βα(xθ)α1eβ(xθ)Γ(α)𝟏(θ,)(x)f(x|\theta)=\frac{\beta^{\alpha}(x-\theta)^{\alpha-1}e^{-\beta(x-\theta)}}{\Gamma(\alpha)}\bm{1}_{(\theta,\infty)}(x)

w.r.t. Lebesgue measure μ\mu on 𝔛=\mathfrak{X}=\mathbb{R}. The parameter space is Θ=\Theta=\mathbb{R}. Observe that

lima0+f(θ0|θ0+a)f(θ0|θ0)a=0\lim_{a\to 0^{+}}\frac{f(\theta_{0}|\theta_{0}+a)-f(\theta_{0}|\theta_{0})}{a}=0

and

lima0+f(θ0|θ0a)f(θ0|θ0)a=βαΓ(α)lima0+aα2eβa=,\lim_{a\to 0^{+}}\frac{f(\theta_{0}|\theta_{0}-a)-f(\theta_{0}|\theta_{0})}{a}=\frac{\beta^{\alpha}}{\Gamma(\alpha)}\lim_{a\to 0^{+}}a^{\alpha-2}e^{-\beta a}=\infty,

since α<2\alpha<2. Then for any xx, f(x|θ)f(x|\theta) as a function of θ\theta is not differentiable at θ=x\theta=x. So it is not identifiable in the first order as defined in [26]. However, this family does satisfy the ({θi}i=1k,𝒩)(\{\theta_{i}\}_{i=1}^{k},\mathcal{N}) first-order identifiable definition with 𝒩=i=1k(θiρ,θi+ρ)\mathcal{N}=\bigcup_{i=1}^{k}(\theta_{i}-\rho,\theta_{i}+\rho) where ρ=14min1i<jk|θiθj|\rho=\frac{1}{4}\min_{1\leq i<j\leq k}|\theta_{i}-\theta_{j}|. Indeed, observing that

θf(x|θ)=(βα1xθ)f(x|θ),θx,\frac{\partial}{\partial\theta}f(x|\theta)=\left(\beta-\frac{\alpha-1}{x-\theta}\right)f(x|\theta),\quad\forall\theta\not=x,

then (9a) become

0=i=1k(aiβaiα1xθi+bi)f(x|θi)for μa.e.x\𝒩.0=\sum_{i=1}^{k}\left(a_{i}\beta-a_{i}\frac{\alpha-1}{x-\theta_{i}}+b_{i}\right)f(x|\theta_{i})\quad\text{for }\ \mu-a.e.\ x\in\mathbb{R}\backslash\mathcal{N}.

Without loss of generality, assume θ1<<θk\theta_{1}<\ldots<\theta_{k}. Then for μa.e.x(θ1,θ2)\𝒩=[θ1+ρ,θ2ρ]\mu-a.e.\ x\in(\theta_{1},\theta_{2})\backslash\mathcal{N}=[\theta_{1}+\rho,\theta_{2}-\rho], the above display become

(a1βa1α1xθ1+b1)βα(xθ1)α1eβ(xθ1)Γ(α)=0\left(a_{1}\beta-a_{1}\frac{\alpha-1}{x-\theta_{1}}+b_{1}\right)\frac{\beta^{\alpha}(x-\theta_{1})^{\alpha-1}e^{-\beta(x-\theta_{1})}}{\Gamma(\alpha)}=0

which implies a1=b1=0a_{1}=b_{1}=0 since α1\alpha\neq 1. Repeating the above argument on interval (θ2,θ3),(θk,)(\theta_{2},\theta_{3}),\ldots(\theta_{k},\infty) shows ai=bi=0a_{i}=b_{i}=0 for any i[k]i\in[k]. So this family is ({θi}i=1k,𝒩)(\{\theta_{i}\}_{i=1}^{k},\mathcal{N}) first-order identifiable. Moreover, for every xx in \𝒩\mathbb{R}\backslash\mathcal{N}, f(x|θ)f(x|\theta) is continuously differentiable w.r.t. θ\theta in a neighborhood of θi0\theta_{i}^{0} for i[k0]i\in[k_{0}]. By Lemma 4.2 b) for any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta) (12) holds. \Diamond

Proof of Lemma 4.2 b).

Suppose the equation (12) is incorrect. Then there exists G,Hk0(Θ)G_{\ell},H_{\ell}\in\mathcal{E}_{k_{0}}(\Theta) such that

{GH,G,HW1G0, as V(PG,PH)D1(G,H)0, as .\begin{cases}G_{\ell}\neq H_{\ell},&\forall\ell\\ G_{\ell},H_{\ell}\overset{W_{1}}{\to}G_{0},&\text{ as }\ell\to\infty\\ \frac{V(P_{G_{\ell}},P_{H_{\ell}})}{D_{1}(G_{\ell},H_{\ell})}\to 0,&\text{ as }\ell\to\infty.\end{cases}

We may relabel the atoms of GG_{\ell} and HH_{\ell} such that G=i=1k0piδθiG_{\ell}=\sum_{i=1}^{k_{0}}p^{\ell}_{i}\delta_{\theta_{i}^{\ell}}, H=i=1k0πiδηiH_{\ell}=\sum_{i=1}^{k_{0}}\pi_{i}^{\ell}\delta_{\eta_{i}^{\ell}} with θi,ηiθi0\theta_{i}^{\ell},\eta_{i}^{\ell}\to\theta_{i}^{0} and pi,πipi0p_{i}^{\ell},\pi_{i}^{\ell}\to p_{i}^{0} as \ell\to\infty for any i[k0]i\in[k_{0}]. With subsequences argument if necessary, we may further require

θiηiD1(G,H)aiq,piπiD1(G,H)bi,1ik0,\frac{\theta_{i}^{\ell}-\eta_{i}^{\ell}}{D_{1}(G_{\ell},H_{\ell})}\to a_{i}\in\mathbb{R}^{q},\quad\frac{p_{i}^{\ell}-\pi_{i}^{\ell}}{D_{1}(G_{\ell},H_{\ell})}\to b_{i}\in\mathbb{R},\quad\forall 1\leq i\leq k_{0}, (57)

where bib_{i} and the components of aia_{i} are in [1,1][-1,1] and i=1k0bi=0\sum_{i=1}^{k_{0}}b_{i}=0. Moreover, D1(G,H)=i=1k0(θiηi2+|piπi|)D_{1}(G_{\ell},H_{\ell})=\sum_{i=1}^{k_{0}}\left(\|\theta^{\ell}_{i}-\eta_{i}^{\ell}\|_{2}+|p^{\ell}_{i}-\pi_{i}^{\ell}|\right) for sufficiently large \ell, which implies

i=1k0ai2+i=1k0|bi|=1.\sum_{i=1}^{k_{0}}\|a_{i}\|_{2}+\sum_{i=1}^{k_{0}}|b_{i}|=1.

It then follows that at least one of aia_{i} is not 𝟎q\bm{0}\in\mathbb{R}^{q} or one of bib_{i} is not 0. On the other hand,

0\displaystyle 0 =lim2V(PG,PH)D1(G,H)\displaystyle=\lim_{\ell\to\infty}\frac{2V(P_{G_{\ell}},P_{H_{\ell}})}{D_{1}(G_{\ell},H_{\ell})}
lim𝔛\𝒩|i=1k0pif(x|θi)f(x|ηi)D1(G,H)+i=1k0f(x|ηi)piπiD1(G,H)|μ(dx)\displaystyle\geq\lim_{\ell\to\infty}\int_{\mathfrak{X}\backslash\mathcal{N}}\left|\sum_{i=1}^{k_{0}}p_{i}^{\ell}\frac{f(x|\theta_{i}^{\ell})-f(x|\eta_{i}^{\ell})}{D_{1}(G_{\ell},H_{\ell})}+\sum_{i=1}^{k_{0}}f(x|\eta_{i}^{\ell})\frac{p_{i}^{\ell}-\pi_{i}^{\ell}}{D_{1}(G_{\ell},H_{\ell})}\right|\mu(dx)
𝔛\𝒩lim inf|i=1k0pif(x|θi)f(x|ηi)D1(G,H)+i=1k0f(x|ηi)piπiD1(G,H)|μ(dx)\displaystyle\geq\int_{\mathfrak{X}\backslash\mathcal{N}}\liminf_{\ell\to\infty}\left|\sum_{i=1}^{k_{0}}p_{i}^{\ell}\frac{f(x|\theta_{i}^{\ell})-f(x|\eta_{i}^{\ell})}{D_{1}(G_{\ell},H_{\ell})}+\sum_{i=1}^{k_{0}}f(x|\eta_{i}^{\ell})\frac{p_{i}^{\ell}-\pi_{i}^{\ell}}{D_{1}(G_{\ell},H_{\ell})}\right|\mu(dx)
=𝔛\𝒩|i=1k0pi0aiθf(x|θi0)+i=1k0f(x|θi0)bi|μ(dx),\displaystyle=\int_{\mathfrak{X}\backslash\mathcal{N}}\left|\sum_{i=1}^{k_{0}}p_{i}^{0}a_{i}^{\top}\nabla_{\theta}f(x|\theta_{i}^{0})+\sum_{i=1}^{k_{0}}f(x|\theta_{i}^{0})b_{i}\right|\mu(dx),

where the second inequality follows from Fatou’s Lemma, and the last step follows from Lemma B.2 a). Then i=1k0pi0aiθf(x|θi0)+i=1k0f(x|θi0)bi=0\sum_{i=1}^{k_{0}}p_{i}^{0}a_{i}^{\top}\nabla_{\theta}f(x|\theta_{i}^{0})+\sum_{i=1}^{k_{0}}f(x|\theta_{i}^{0})b_{i}=0 for μa.e.x𝔛\𝒩\mu-a.e.\ x\in\mathfrak{X}\backslash\mathcal{N}. Thus we find a nonzero solution to (9a), (9b) with k,θik,\theta_{i} replaced by k0,θi0k_{0},\theta_{i}^{0}.

However, the last statement contradicts with the definition of ({θi0}i=1k0,𝒩)(\{\theta_{i}^{0}\}_{i=1}^{k_{0}},\mathcal{N}) first-order identifiable. ∎

Proof of Lemma 4.4.

By Lemma 4.13 b) (a1,b1,,ak0,bk0)(a_{1},b_{1},\ldots,a_{k_{0}},b_{k_{0}}) is also a nonzero solution of the system of equations (9a), (9b). Let ai=ai/pi0i=1k0(ai/pi02+|bi|)a^{\prime}_{i}=\frac{a_{i}/p_{i}^{0}}{\sum_{i=1}^{k_{0}}\left(\|a_{i}/p_{i}^{0}\|_{2}+|b_{i}|\right)} and bi=bii=1k0(ai/pi02+|bi|)b^{\prime}_{i}=\frac{b_{i}}{\sum_{i=1}^{k_{0}}\left(\|a_{i}/p_{i}^{0}\|_{2}+|b_{i}|\right)}. Then aia^{\prime}_{i} and bib^{\prime}_{i} satisfy i=1k0(ai2+|bi|)=1\sum_{i=1}^{k_{0}}\left(\|a^{\prime}_{i}\|_{2}+|b^{\prime}_{i}|\right)=1 and (p10a1(p_{1}^{0}a^{\prime}_{1}, b1,,pk00ak0b^{\prime}_{1},\ldots,p_{k_{0}}^{0}a^{\prime}_{k_{0}}, bk0)b^{\prime}_{k_{0}}) is also a nonzero solution of the system of equations (9a), (9b) with k,θik,\theta_{i} replaced respectively by k0,θi0k_{0},\theta_{i}^{0}. Let G=piδθiG_{\ell}=p_{i}^{\ell}\delta_{\theta_{i}^{\ell}} with pi=pi0+bi1p_{i}^{\ell}=p_{i}^{0}+b^{\prime}_{i}\frac{1}{\ell} and θi=θi0+1ai\theta_{i}^{\ell}=\theta_{i}^{0}+\frac{1}{\ell}a^{\prime}_{i} for 1ik01\leq i\leq k_{0}. When \ell is large, 0<pi<10<p_{i}^{\ell}<1 and θiΘ\theta_{i}^{\ell}\in\Theta since 0<pi0<10<p_{i}^{0}<1 and θi0Θ\theta_{i}^{0}\in\Theta^{\circ}. Moreover, i=1k0pi=1\sum_{i=1}^{k_{0}}p_{i}^{\ell}=1 since i=1k0bi=0\sum_{i=1}^{k_{0}}b^{\prime}_{i}=0. Then Gk0(Θ)G_{\ell}\in\mathcal{E}_{k_{0}}(\Theta) and GG0G_{\ell}\not=G_{0} since at least one of aia^{\prime}_{i} or bib^{\prime}_{i} is nonzero. When \ell is large D1(G,G0)=i=1k0(θiθi02+|pipi0|)=1D_{1}(G_{\ell},G_{0})=\sum_{i=1}^{k_{0}}\left(\|\theta_{i}^{\ell}-\theta_{i}^{0}\|_{2}+|p_{i}^{\ell}-p_{i}^{0}|\right)=\frac{1}{\ell}. Thus when \ell is large

2V(PG,PG0)D1(G,G0)=\displaystyle\frac{2V(P_{G_{\ell}},P_{G_{0}})}{D_{1}(G_{\ell},G_{0})}= 𝔛\𝒩|i=1k0pif(x|θi)f(x|θi0)1/+i=1k0bif(x|θi0)|μ(dx).\displaystyle\int_{\mathfrak{X}\backslash\mathcal{N}}\left|\sum_{i=1}^{k_{0}}p_{i}^{\ell}\frac{f(x|\theta_{i}^{\ell})-f(x|\theta_{i}^{0})}{1/\ell}+\sum_{i=1}^{k_{0}}b^{\prime}_{i}f(x|\theta_{i}^{0})\right|\mu(dx). (58)

Since by condition c) when \ell is large

|f(x|θi)f(x|θi0)1/|=|f(x|θi0+1ai2ai2ai)f(x|θi0)1/|ai2ai2f¯(x|θi0,ai),\left|\frac{f(x|\theta_{i}^{\ell})-f(x|\theta_{i}^{0})}{1/\ell}\right|=\left|\frac{f(x|\theta_{i}^{0}+\frac{1}{\ell}\frac{\|a^{\prime}_{i}\|_{2}}{\|a_{i}\|_{2}}a_{i})-f(x|\theta_{i}^{0})}{1/\ell}\right|\leq\frac{\|a^{\prime}_{i}\|_{2}}{\|a_{i}\|_{2}}\bar{f}(x|\theta_{i}^{0},a_{i}),

the integrand of (58) is bounded by i=1k01/pi0i=1k0(ai/pi02+|bi|)f¯(x|θi0,ai)+i=1k0|bi|f(x|θi0)\sum\limits_{i=1}^{k_{0}}\frac{1/p_{i}^{0}}{\sum_{i=1}^{k_{0}}\left(\|a_{i}/p_{i}^{0}\|_{2}+|b_{i}|\right)}\bar{f}(x|\theta_{i}^{0},a_{i})+\sum\limits_{i=1}^{k_{0}}|b^{\prime}_{i}|f(x|\theta_{i}^{0}), which is integrable w.r.t. to μ\mu on 𝔛\𝒩\mathfrak{X}\backslash\mathcal{N}. Then by the dominated convergence theorem

lim2V(PG,PG0)D1(G,G0)=\displaystyle\lim_{\ell\to\infty}\frac{2V(P_{G_{\ell}},P_{G_{0}})}{D_{1}(G_{\ell},G_{0})}= 𝔛\𝒩|i=1k0pi0ai,θf(x|θi0)+i=1k0bif(x|θi0)|μ(dx)=0.\displaystyle\int_{\mathfrak{X}\backslash\mathcal{N}}\left|\sum_{i=1}^{k_{0}}p_{i}^{0}\langle a^{\prime}_{i},\nabla_{\theta}f(x|\theta_{i}^{0})\rangle+\sum_{i=1}^{k_{0}}b^{\prime}_{i}f(x|\theta_{i}^{0})\right|\mu(dx)=0.

Thus

lim infGW1G0Gk0(Θ)V(PG,PG0)D1(G,G0)=0.\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G},P_{G_{0}})}{D_{1}(G,G_{0})}=0.

and the proof is completed by (14). ∎

Proof of Lemma 4.8.

It suffices to prove (12) since (11) is a direct consequence of (12).

Without loss of generality, assume θ10<θ20<<θk00\theta_{1}^{0}<\theta_{2}^{0}<\ldots<\theta_{k_{0}}^{0}. Let 𝒩=i=1k0(θi0ρ,θi0+ρ)\mathcal{N}=\bigcup_{i=1}^{k_{0}}(\theta_{i}^{0}-\rho,\theta_{i}^{0}+\rho), where ρ=14min1i<jk0|θi0θj0|\rho=\frac{1}{4}\min_{1\leq i<j\leq k_{0}}|\theta_{i}^{0}-\theta_{j}^{0}|. Notice that for x\𝒩x\in\mathbb{R}\backslash\mathcal{N}, f(x|θ)f(x|\theta) as a function of θ\theta is continuously differentiable on (θi0ρ,θi0+ρ)(\theta_{i}^{0}-\rho,\theta_{i}^{0}+\rho) for each i[k0]i\in[k_{0}].

Suppose (12) is not true. Proceed exactly the same as the proof of Lemma 4.2 b) except the last paragraph to obtain a nonzero solution (pi0ai,bi:i[k0])(p_{i}^{0}a_{i},b_{i}:i\in[k_{0}]) of (9a), (9b) with k,θik,\theta_{i} replaced by k0,θi0k_{0},\theta_{i}^{0}. For the uniform distribution family, one may argue that the nonzero solution has to satisfy

pi0ai/θi0+bi=0i[k0].-p_{i}^{0}a_{i}/\theta_{i}^{0}+b_{i}=0\quad\forall i\in[k_{0}]. (59)

Indeed, start from the rightmost interval that intersects with the support from only one mixture component, for μa.e.x(θk010,θk00)\𝒩=[θk010+ρ,θk00ρ]\mu-a.e.\ x\in(\theta_{k_{0}-1}^{0},\theta_{k_{0}}^{0})\backslash\mathcal{N}=[\theta_{k_{0}-1}^{0}+\rho,\theta_{k_{0}}^{0}-\rho]

0=\displaystyle 0= i=1k0(pi0aiθf(x|θi0)+bif(x|θi0))\displaystyle\sum_{i=1}^{k_{0}}\left(p_{i}^{0}a_{i}\frac{\partial}{\partial\theta}f(x|\theta_{i}^{0})+b_{i}f(x|\theta_{i}^{0})\right)
=\displaystyle= i=1k0(pi0ai/θi0+bi)f(x|θi0)\displaystyle\sum_{i=1}^{k_{0}}\left(-p_{i}^{0}a_{i}/\theta_{i}^{0}+b_{i}\right)f(x|\theta_{i}^{0})
=\displaystyle= (pk00ak0/θk00+bk0)/θk00,\displaystyle\left(-p_{k_{0}}^{0}a_{k_{0}}/\theta_{k_{0}}^{0}+b_{k_{0}}\right)/\theta_{k_{0}}^{0},

which implies pk00ak0/θk00+bk0=0-p_{k_{0}}^{0}a_{k_{0}}/\theta_{k_{0}}^{0}+b_{k_{0}}=0. Repeat the above argument on interval (θk020,θk010)(\theta_{k_{0}-2}^{0},\theta_{k_{0}-1}^{0}), \ldots, (θ10,θ20)(\theta_{1}^{0},\theta_{2}^{0}), (0,θ10)(0,\theta_{1}^{0}) and (59) is established.

Combining (59) with the fact that some of the aia_{i} or bib_{i} is non-zero, it follows that |aα|>0|a_{\alpha}|>0 for some α[k0]\alpha\in[k_{0}]. When \ell is sufficiently large, θi,ηi(θi0ρ,θi0+ρ)\theta_{i}^{\ell},\eta_{i}^{\ell}\in(\theta_{i}^{0}-\rho,\theta_{i}^{0}+\rho). For sufficiently large \ell

2V(PG,PH)D1(G,H)\displaystyle\frac{2V(P_{G_{\ell}},P_{H_{\ell}})}{D_{1}(G_{\ell},H_{\ell})}
\displaystyle\geq 1D1(G,H)min{θα,ηα}max{θα,ηα}|pG(x)pH(x)|𝑑x\displaystyle\frac{1}{D_{1}(G_{\ell},H_{\ell})}\int_{\min\{\theta_{\alpha}^{\ell},\eta_{\alpha}^{\ell}\}}^{\max\{\theta_{\alpha}^{\ell},\eta_{\alpha}^{\ell}\}}\left|p_{G_{\ell}}(x)-p_{H_{\ell}}(x)\right|dx
=()\displaystyle\overset{(*)}{=} 1D1(G,H)min{θα,ηα}max{θα,ηα}|πα𝟏(θα<ηα)+pα𝟏(θαηα)max{θα,ηα}+i=α+1k0piθii=α+1k0πiηi|𝑑x\displaystyle\frac{1}{D_{1}(G_{\ell},H_{\ell})}\int_{\min\{\theta_{\alpha}^{\ell},\eta_{\alpha}^{\ell}\}}^{\max\{\theta_{\alpha}^{\ell},\eta_{\alpha}^{\ell}\}}\left|{\frac{\pi_{\alpha}^{\ell}\bm{1}(\theta_{\alpha}^{\ell}<\eta_{\alpha}^{\ell})+p_{\alpha}^{\ell}\bm{1}(\theta_{\alpha}^{\ell}\geq\eta_{\alpha}^{\ell})}{\max\{\theta_{\alpha}^{\ell},\eta_{\alpha}^{\ell}\}}+\sum_{i=\alpha+1}^{k_{0}}\frac{p_{i}^{\ell}}{\theta_{i}^{\ell}}-\sum_{i=\alpha+1}^{k_{0}}\frac{\pi_{i}^{\ell}}{\eta_{i}^{\ell}}}\right|dx
=()\displaystyle\overset{(**)}{=} |θαηα|D1(G,H)|πα𝟏(θα<ηα)+pα𝟏(θαηα)max{θα,ηα}+i=α+1k0piθii=α+1k0πiηi|\displaystyle\frac{|\theta_{\alpha}^{\ell}-\eta_{\alpha}^{\ell}|}{D_{1}(G_{\ell},H_{\ell})}\left|{\frac{\pi_{\alpha}^{\ell}\bm{1}(\theta_{\alpha}^{\ell}<\eta_{\alpha}^{\ell})+p_{\alpha}^{\ell}\bm{1}(\theta_{\alpha}^{\ell}\geq\eta_{\alpha}^{\ell})}{\max\{\theta_{\alpha}^{\ell},\eta_{\alpha}^{\ell}\}}+\sum_{i=\alpha+1}^{k_{0}}\frac{p_{i}^{\ell}}{\theta_{i}^{\ell}}-\sum_{i=\alpha+1}^{k_{0}}\frac{\pi_{i}^{\ell}}{\eta_{i}^{\ell}}}\right|
\displaystyle\to |aα|pα0θα0>0,\displaystyle{|a_{\alpha}|}\frac{p_{\alpha}^{0}}{\theta_{\alpha}^{0}}>0,

where the step ()(*) follows from carefully examining the support of f(x|θ)f(x|\theta), the step ()(**) follows from the integrand is a constant, and the last step follows from (57). The last display contradicts with the choice of G,HG_{\ell},H_{\ell}, which satisfies V(PG,PH)D1(G,H)0\frac{V(P_{G_{\ell}},P_{H_{\ell}})}{D_{1}(G_{\ell},H_{\ell})}\to 0. ∎

Proof of Lemma 4.10.

(a): inverse bound (11) holds
Without loss of generality, assume ξ10ξ20ξk00\xi_{1}^{0}\leq\xi_{2}^{0}\leq\ldots\leq\xi_{k_{0}}^{0}. Let 𝒩=i=1k0{ξi0}\mathcal{N}=\bigcup_{i=1}^{k_{0}}\{\xi_{i}^{0}\}. Notice that for x\𝒩x\in\mathbb{R}\backslash\mathcal{N}, f(x|θ)f(x|\theta) as a function of θ\theta is differentiable at θi0=(ξi0,σi0)\theta_{i}^{0}=(\xi_{i}^{0},\sigma_{i}^{0}) for each i[k0]i\in[k_{0}].

Suppose (11) is not true. Proceed exactly the same as the proof of Lemma 4.2 a) except the last paragraph to obtain a nonzero solution (pi0ai,bi:i[k0])(p_{i}^{0}a_{i},b_{i}:i\in[k_{0}]) of (9a), (9b) with k,θik,\theta_{i} replaced by k0,θi0k_{0},\theta_{i}^{0}. Write the two-dimensional vector aia_{i} as ai=(ai(ξ),ai(σ))a_{i}=(a_{i}^{(\xi)},a_{i}^{(\sigma)}). For the location-scale exponential distribution, one may argue that the nonzero solution has to satisfy

ai(σ)=0,pi0ai(ξ)/σi0+bi=0,i[k0].a_{i}^{(\sigma)}=0,\quad p_{i}^{0}a^{(\xi)}_{i}/\sigma_{i}^{0}+b_{i}=0,\quad\forall i\in[k_{0}]. (60)

Indeed, let i=1k0{ξi0}={ξ1,ξ2,,ξk}\bigcup_{i=1}^{k_{0}}\{\xi_{i}^{0}\}=\{\xi^{\prime}_{1},\xi^{\prime}_{2},\ldots,\xi^{\prime}_{k^{\prime}}\} with ξ1<ξ2<<ξk\xi^{\prime}_{1}<\xi^{\prime}_{2}<\ldots<\xi^{\prime}_{k^{\prime}} where kk^{\prime} is the number of distinct elements. Define I(ξ)={i[k0]:ξi0=ξ}I^{\prime}(\xi)=\{i\in[k_{0}]:\xi_{i}^{0}=\xi\}. Then for μa.e.x\𝒩\mu-a.e.\ x\in\mathbb{R}\backslash\mathcal{N}

0=\displaystyle 0= i=1k0(pi0ai,(ξ,σ)f(x|ξi0,σi0)+bif(x|ξi0,σi0))\displaystyle\sum_{i=1}^{k_{0}}\left(p_{i}^{0}\langle a_{i},\nabla_{(\xi,\sigma)}f(x|\xi_{i}^{0},\sigma_{i}^{0})\rangle+b_{i}f(x|\xi_{i}^{0},\sigma_{i}^{0})\right)
=\displaystyle= j=1kiI(ξj)(pi0ai,(ξ,σ)f(x|ξj,σi0)+bif(x|ξj,σi0))\displaystyle\sum_{j=1}^{k^{\prime}}\sum_{i\in I^{\prime}(\xi^{\prime}_{j})}\left(p_{i}^{0}\langle a_{i},\nabla_{(\xi,\sigma)}f(x|\xi^{\prime}_{j},\sigma_{i}^{0})\rangle+b_{i}f(x|\xi^{\prime}_{j},\sigma_{i}^{0})\right)
=\displaystyle= j=1kiI(ξj)(pi0ai(ξ)1σi0+pi0ai(σ)xξi0σi0(σi0)2+bi)f(x|ξj,σi0).\displaystyle\sum_{j=1}^{k^{\prime}}\ \ \sum_{i\in I^{\prime}(\xi^{\prime}_{j})}\left(p_{i}^{0}a_{i}^{(\xi)}\frac{1}{\sigma_{i}^{0}}+p_{i}^{0}a_{i}^{(\sigma)}\frac{x-\xi_{i}^{0}-\sigma_{i}^{0}}{(\sigma_{i}^{0})^{2}}+b_{i}\right)f(x|\xi^{\prime}_{j},\sigma_{i}^{0}).

Start from the leftmost interval that intersects with the support from only one mixture component, for μa.e.x(ξ1,ξ2)\𝒩=(ξ1,ξ2)\mu-a.e.\ x\in(\xi^{\prime}_{1},\xi^{\prime}_{2})\backslash\mathcal{N}=(\xi^{\prime}_{1},\xi^{\prime}_{2}),

0=\displaystyle 0= iI(ξ1)(pi0ai(ξ)1σi0+pi0ai(σ)xξi0σi0(σi0)2+bi)f(x|ξ1,σi0)\displaystyle\sum_{i\in I^{\prime}(\xi^{\prime}_{1})}\left(p_{i}^{0}a_{i}^{(\xi)}\frac{1}{\sigma_{i}^{0}}+p_{i}^{0}a_{i}^{(\sigma)}\frac{x-\xi_{i}^{0}-\sigma_{i}^{0}}{(\sigma_{i}^{0})^{2}}+b_{i}\right)f(x|\xi^{\prime}_{1},\sigma_{i}^{0})
=\displaystyle= iI(ξ1)(pi0ai(ξ)1σi0+pi0ai(σ)xξi0σi0(σi0)2+bi)exp(ξ1σi0)exp(xσi0).\displaystyle\sum_{i\in I^{\prime}(\xi^{\prime}_{1})}\left(p_{i}^{0}a_{i}^{(\xi)}\frac{1}{\sigma_{i}^{0}}+p_{i}^{0}a_{i}^{(\sigma)}\frac{x-\xi_{i}^{0}-\sigma_{i}^{0}}{(\sigma_{i}^{0})^{2}}+b_{i}\right)\exp\left(\frac{\xi^{\prime}_{1}}{\sigma_{i}^{0}}\right)\exp\left(-\frac{x}{\sigma_{i}^{0}}\right).

Since σi0\sigma_{i}^{0} for iI(ξ1)i\in I^{\prime}(\xi^{\prime}_{1}) are all distinct, by Lemma B.3 a)

ai(σ)=0,pi0ai(ξ)/σi0+bi=0,iI(ξ1).a_{i}^{(\sigma)}=0,\quad p_{i}^{0}a^{(\xi)}_{i}/\sigma_{i}^{0}+b_{i}=0,\quad\forall i\in I^{\prime}(\xi^{\prime}_{1}).

Repeat the above argument on interval (ξ2,ξ3),,(ξk1,ξk),(ξk,)(\xi^{\prime}_{2},\xi^{\prime}_{3}),\ldots,(\xi^{\prime}_{k^{\prime}-1},\xi^{\prime}_{k^{\prime}}),(\xi^{\prime}_{k^{\prime}},\infty) and (60) is established.

Since at least one of aia_{i} or bib_{i} is not zero, from (60) it is clear that at least one of {bi}i=1k0\{b_{i}\}_{i=1}^{k_{0}} is not zero. Then by i=1k0bi=0\sum_{i=1}^{k_{0}}b_{i}=0 at least one of bib_{i} is positive. By (60) at least one of ai(ξ)a_{i}^{(\xi)} is negative. Let αargmaxi{j[k0]:aj(ξ)<0}ai(ξ)\alpha\in\operatorname*{arg\,max}\limits_{i\in\{j\in[k_{0}]:a_{j}^{(\xi)}<0\}}a_{i}^{(\xi)}. That is aα(ξ)a_{\alpha}^{(\xi)} is a largest negative one among {ai(ξ)}i[k0]\{a_{i}^{(\xi)}\}_{i\in[k_{0}]}. Let ρ=12min1i<jk|ξiξj|\rho=\frac{1}{2}\min_{1\leq i<j\leq k^{\prime}}|\xi^{\prime}_{i}-\xi^{\prime}_{j}| to be half of the smallest distance among different {ξi}i=1k\{\xi^{\prime}_{i}\}_{i=1}^{k^{\prime}}. By subsequence argument if necessary, we require for any i[k0]i\in[k_{0}], ξi(ξi0ρ,ξi0+ρ)\xi_{i}^{\ell}\in(\xi_{i}^{0}-\rho,\xi_{i}^{0}+\rho).

Let I(α)={i[k0]|ξi0=ξα0}I(\alpha)=\{i\in[k_{0}]|\xi_{i}^{0}=\xi_{\alpha}^{0}\} to be the set of indices for those sharing the same ξi0\xi_{i}^{0} as ξα0\xi_{\alpha}^{0}. We now consider subsequences such that ξi\xi_{i}^{\ell} for iI(α)i\in I(\alpha) satisfies finer properties as follows. Divide the index set I(α)I(\alpha) into three subsets, J(α):={iI(α)|ai(ξ)=aα(ξ)}J(\alpha):=\{i\in I(\alpha)|a_{i}^{(\xi)}=a_{\alpha}^{(\xi)}\}, J<(α):={iI(α)|ai(ξ)<aα(ξ)}J_{<}(\alpha):=\{i\in I(\alpha)|a_{i}^{(\xi)}<a_{\alpha}^{(\xi)}\} and J>(α):={iI(α)|ai(ξ)>aα(ξ)}J_{>}(\alpha):=\{i\in I(\alpha)|a_{i}^{(\xi)}>a_{\alpha}^{(\xi)}\}. Note J(α)J(\alpha) is the index set for those sharing the same ξi0\xi_{i}^{0} as ξα0\xi_{\alpha}^{0} and sharing the same ai(ξ)a_{i}^{(\xi)} as aα(ξ)a_{\alpha}^{(\xi)} (so their ai(ξ)a_{i}^{(\xi)} are also largest negative ones among {ai(ξ)}i[k0]\{a_{i}^{(\xi)}\}_{i\in[k_{0}]}), while J>(α)J_{>}(\alpha) corresponds for indices ii for which ξi0=ξα0\xi_{i}^{0}=\xi_{\alpha}^{0} and ai(ξ)0a_{i}^{(\xi)}\geq 0, and J<(α)J_{<}(\alpha) corresponds for indices ii for which ξi0=ξα0\xi_{i}^{0}=\xi_{\alpha}^{0} and ai(ξ)<aα(ξ)a_{i}^{(\xi)}<a_{\alpha}^{(\xi)}. To be clear, the two subsets J<αJ_{<\alpha} and J>αJ_{>\alpha} may be empty, but JαJ_{\alpha} is non-empty by our definition.

For any iJ<(α)i\in J_{<}(\alpha), jJ(α)j\in J(\alpha)

ξiξα0D1(G,G0)ai(ξ)<aα(ξ)ξjξα0D1(G,G0).\frac{\xi_{i}^{\ell}-\xi_{\alpha}^{0}}{D_{1}(G_{\ell},G_{0})}\to a_{i}^{(\xi)}<a_{\alpha}^{(\xi)}\leftarrow\frac{\xi_{j}^{\ell}-\xi_{\alpha}^{0}}{D_{1}(G_{\ell},G_{0})}.

Then for large \ell, ξi<ξj\xi_{i}^{\ell}<\xi_{j}^{\ell} for any iJ<(α)i\in J_{<}(\alpha) and jJ(α)j\in J(\alpha). Similarly for large \ell, ξj<ξk\xi_{j}^{\ell}<\xi_{k}^{\ell} for any jJ(α)j\in J(\alpha) and kJ>(α)k\in J_{>}(\alpha). Thus by subsequence argument if necessary, we require ξi\xi_{i}^{\ell} additionally satisfy the conditions specified in the last two sentences for all \ell.

Consider maxjJ(α){ξj}\max_{j\in J(\alpha)}\{\xi_{j}^{\ell}\} and there exists α¯J(α)\bar{\alpha}\in J(\alpha) such that ξα¯=maxjJ(α){ξj}\xi_{\bar{\alpha}}^{\ell}=\max_{j\in J(\alpha)}\{\xi_{j}^{\ell}\} for infinitely many \ell since J(α)J(\alpha) has finite cardinality. By subsequence argument if necessary, we require ξα¯=maxjJ(α){ξj}\xi_{\bar{\alpha}}^{\ell}=\max_{j\in J(\alpha)}\{\xi_{j}^{\ell}\} for all \ell. Moreover, since aα¯ξ=aαξ<0a_{\bar{\alpha}}^{\xi}=a_{\alpha}^{\xi}<0 we may further require ξα¯<ξα0\xi_{\bar{\alpha}}^{\ell}<\xi_{\alpha}^{0} for all \ell. Finally, for each kJ>(α)k\in J_{>}(\alpha) such that ak(ξ)>0a_{k}^{(\xi)}>0, we may further require ξk>ξα0\xi_{k}^{\ell}>\xi_{\alpha}^{0} for all \ell by subsequences. To sum up, {ξi}\{\xi_{i}^{\ell}\} for iI(α)i\in I(\alpha) satisfies:

{ξiξα¯<ξα0,,iJ<(α)J(α)ξi>ξα¯,,iJ>(α)ξi>ξα0,,iJ>(α) and ai(ξ)>0.\begin{cases}\xi_{i}^{\ell}\leq\xi_{\bar{\alpha}}^{\ell}<\xi_{\alpha}^{0},&\forall\ell,\quad\forall\ i\in J_{<}(\alpha)\bigcup J(\alpha)\\ \xi_{i}^{\ell}>\xi_{\bar{\alpha}}^{\ell},&\forall\ell,\quad\forall\ i\in J_{>}(\alpha)\\ \xi_{i}^{\ell}>\xi_{\alpha}^{0},&\forall\ell,\forall i\in J_{>}(\alpha)\text{ and }a_{i}^{(\xi)}>0\end{cases}. (61)

Let ξ¯=min{mini{jI(α):aj(ξ)=0}ξi,ξα0}\bar{\xi}^{\ell}=\min\left\{\min\limits_{i\in\{j\in I(\alpha):a_{j}^{(\xi)}=0\}}\xi_{i}^{\ell},\xi_{\alpha}^{0}\right\} with the convention that the minimum over an empty set is \infty. Then ξ¯ξα0\bar{\xi}^{\ell}\leq\xi_{\alpha}^{0} and ξ¯ξα0\bar{\xi}^{\ell}\to\xi_{\alpha}^{0}. Moreover, by property (61), ξ¯>ξα¯\bar{\xi}^{\ell}>\xi_{\bar{\alpha}}^{\ell}. Thus on (ξα¯,ξ¯)(\xi_{\bar{\alpha}}^{\ell},\bar{\xi}^{\ell}), 1) for any i>maxI(α)i>\max I(\alpha), f(x|ξi,σi)=0=f(x|ξi0,σi0)f(x|\xi_{i}^{\ell},\sigma_{i}^{\ell})=0=f(x|\xi_{i}^{0},\sigma_{i}^{0}) since ξi,ξi0ξα0ξ¯\xi_{i}^{\ell},\xi_{i}^{0}\geq\xi_{\alpha}^{0}\geq\bar{\xi}^{\ell}; 2) for iJ>(α)i\in J_{>}(\alpha), f(x|ξi,σi)=0f(x|\xi_{i}^{\ell},\sigma_{i}^{\ell})=0 due to ξiξ¯\xi_{i}^{\ell}\geq\bar{\xi}^{\ell} due to (61); 3) for iI(α)i\in I(\alpha), f(x|ξi0,σi0)=0f(x|\xi_{i}^{0},\sigma_{i}^{0})=0 since ξi0=ξα0ξ¯\xi_{i}^{0}=\xi_{\alpha}^{0}\geq\bar{\xi}^{\ell}. Then

2V(PG,PG0)D1(G,G0)\displaystyle\frac{2V(P_{G_{\ell}},P_{G_{0}})}{D_{1}(G_{\ell},G_{0})}
\displaystyle\geq 1D1(G,G0)ξα¯ξ¯|pG(x)pG0(x)|𝑑x\displaystyle\frac{1}{D_{1}(G_{\ell},G_{0})}\int_{\xi_{\bar{\alpha}}^{\ell}}^{\bar{\xi}^{\ell}}\left|p_{G_{\ell}}(x)-p_{G_{0}}(x)\right|dx
=\displaystyle= 1D1(G,G0)ξα¯ξ¯|iJ<(α)J(α)pi1σiexp(xξασi)\displaystyle\frac{1}{D_{1}(G_{\ell},G_{0})}\int_{\xi_{\bar{\alpha}}^{\ell}}^{\bar{\xi}^{\ell}}\left|\sum_{i\in J_{<}(\alpha)\bigcup J(\alpha)}p_{i}^{\ell}\frac{1}{\sigma_{i}^{\ell}}\exp\left(-\frac{x-\xi_{\alpha}^{\ell}}{\sigma_{i}^{\ell}}\right)\right.
+i<minI(α)(pi1σiexp(xξiσi)pi01σi0exp(xξi0σi0))|dx.\displaystyle\quad\left.+\sum_{i<\min I(\alpha)}\left(p_{i}^{\ell}\frac{1}{\sigma_{i}^{\ell}}\exp\left(-\frac{x-\xi_{i}^{\ell}}{\sigma_{i}^{\ell}}\right)-p_{i}^{0}\frac{1}{\sigma_{i}^{0}}\exp\left(-\frac{x-\xi_{i}^{0}}{\sigma_{i}^{0}}\right)\right)\right|dx. (62)

Denote the integrand (including the absolute value) in the preceding display by A(x)A_{\ell}(x). Then as a function on [ξα0ρ,ξα0][\xi_{\alpha}^{0}-\rho,\xi_{\alpha}^{0}], A(x)A_{\ell}(x) converges uniformly to

iJ<(α)J(α)pi01σi0exp(xξα0σi0):=B(x).\sum\limits_{i\in J_{<}(\alpha)\bigcup J(\alpha)}p_{i}^{0}\frac{1}{\sigma_{i}^{0}}\exp\left(-\frac{x-\xi_{\alpha}^{0}}{\sigma_{i}^{0}}\right):=B(x).

Since B(x)B(x) is positive and continuous on compact interval [ξα0ρ,ξα0][\xi_{\alpha}^{0}-\rho,\xi_{\alpha}^{0}], for large \ell

|A(x)B(x)|112minB(x)12B(x),x[ξα0ρ,ξα0],|A_{\ell}(x)-B(x)|\leq\frac{1}{\ell}\leq\frac{1}{2}\min B(x)\leq\frac{1}{2}B(x),\quad\forall x\in[\xi_{\alpha}^{0}-\rho,\xi_{\alpha}^{0}],

which yields

A(x)12B(x)12pα¯01σα¯0exp(xξα0σα¯0)12pα¯01σα¯0,x[ξα0ρ,ξα0].A_{\ell}(x)\geq\frac{1}{2}B(x)\geq\frac{1}{2}p_{\bar{\alpha}}^{0}\frac{1}{\sigma_{\bar{\alpha}}^{0}}\exp\left(-\frac{x-\xi_{\alpha}^{0}}{\sigma_{\bar{\alpha}}^{0}}\right)\geq\frac{1}{2}p_{\bar{\alpha}}^{0}\frac{1}{\sigma_{\bar{\alpha}}^{0}},\quad\forall x\in[\xi_{\alpha}^{0}-\rho,\xi_{\alpha}^{0}].

Plug the preceding display into (62), one obtains for large \ell,

2V(PG,PG0)D1(G,G0)\displaystyle\frac{2V(P_{G_{\ell}},P_{G_{0}})}{D_{1}(G_{\ell},G_{0})}\geq 1D1(G,G0)ξα¯ξ¯12pα¯01σα¯0𝑑x\displaystyle\frac{1}{D_{1}(G_{\ell},G_{0})}\int_{\xi_{\bar{\alpha}}^{\ell}}^{\bar{\xi}^{\ell}}\frac{1}{2}p_{\bar{\alpha}}^{0}\frac{1}{\sigma_{\bar{\alpha}}^{0}}dx
=\displaystyle= (ξα0ξα¯D1(G,G0)ξα0ξ¯D1(G,G0))12pα¯01σα¯0\displaystyle\left(\frac{\xi_{\alpha}^{0}-\xi_{\bar{\alpha}}^{\ell}}{D_{1}(G_{\ell},G_{0})}-\frac{\xi_{\alpha}^{0}-\bar{\xi}^{\ell}}{D_{1}(G_{\ell},G_{0})}\right)\frac{1}{2}p_{\bar{\alpha}}^{0}\frac{1}{\sigma_{\bar{\alpha}}^{0}}
\displaystyle\to (aα¯(ξ)0)12pα¯01σα¯0>0\displaystyle(-a_{\bar{\alpha}}^{(\xi)}-0)\frac{1}{2}p_{\bar{\alpha}}^{0}\frac{1}{\sigma_{\bar{\alpha}}^{0}}>0 (63)

where the convergence in the last step is due to (15). (63) contradicts with the choice of GG_{\ell}, which satisfies V(PG,PG0)D1(G,G0)0\frac{V(P_{G_{\ell}},P_{G_{0}})}{D_{1}(G_{\ell},G_{0})}\to 0.

(b): inverse bound (12) does not hold
Recall Θ=×(0,)\Theta=\mathbb{R}\times(0,\infty). Consider G0=i=1k0pi0δ(ξi0,σi0)k0(Θ)G_{0}=\sum_{i=1}^{k_{0}}p_{i}^{0}\delta_{(\xi_{i}^{0},\sigma_{i}^{0})}\in\mathcal{E}_{k_{0}}(\Theta) with ξ10=ξ20\xi_{1}^{0}=\xi_{2}^{0}, σ10σ20\sigma_{1}^{0}\neq\sigma_{2}^{0} and p10/σ10=p20/σ20p_{1}^{0}/\sigma_{1}^{0}=p_{2}^{0}/\sigma_{2}^{0}. Denote ψ=p10/σ10>0\psi=p_{1}^{0}/\sigma_{1}^{0}>0. Consider G=p1δ(ξ1,σ10)+p2δ(ξ20,σ20)+i=3k0pi0δ(ξi0,σi0)G_{\ell}=p_{1}^{\ell}\delta_{(\xi_{1}^{\ell},\sigma_{1}^{0})}+p_{2}^{\ell}\delta_{(\xi_{2}^{0},\sigma_{2}^{0})}+\sum_{i=3}^{k_{0}}p_{i}^{0}\delta_{(\xi_{i}^{0},\sigma_{i}^{0})} with p1=p10+ψ2+2ψ1p_{1}^{\ell}=p_{1}^{0}+\frac{\psi}{2+2\psi}\frac{1}{\ell}, ξ1=ξ1012+2ψ1\xi_{1}^{\ell}=\xi_{1}^{0}-\frac{1}{2+2\psi}\frac{1}{\ell} and p2=p20ψ2+2ψ1p_{2}^{\ell}=p_{2}^{0}-\frac{\psi}{2+2\psi}\frac{1}{\ell}. Consider H=p20δ(ξ~2,σ20)+i=1,i2k0pi0δ(ξi0,σi0)H_{\ell}=p_{2}^{0}\delta_{(\tilde{\xi}_{2}^{\ell},\sigma_{2}^{0})}+\sum_{i=1,i\neq 2}^{k_{0}}p_{i}^{0}\delta_{(\xi_{i}^{0},\sigma_{i}^{0})} with ξ~2=ξ1\tilde{\xi}_{2}^{\ell}=\xi_{1}^{\ell}. It is clear that when \ell is large, G,Hk0(Θ)G_{\ell},H_{\ell}\in\mathcal{E}_{k_{0}}(\Theta) and G,HG0G_{\ell},H_{\ell}\to G_{0}. Moreover, when \ell is large, D1(G,H)=1/D_{1}(G_{\ell},H_{\ell})=1/\ell since ξ10=ξ20\xi_{1}^{0}=\xi_{2}^{0}. Then

V(G,H)D1(G,H)=|p1f(x|ξ1,σ10)+p2f(x|ξ10,σ20)p10f(x|ξ10,σ20)p20f(x|ξ1,σ20)|dx\frac{V(G_{\ell},H_{\ell})}{D_{1}(G_{\ell},H_{\ell})}=\ell\int_{\mathbb{R}}\left|p_{1}^{\ell}f(x|\xi_{1}^{\ell},\sigma_{1}^{0})+p_{2}^{\ell}f(x|\xi_{1}^{0},\sigma_{2}^{0})-p_{1}^{0}f(x|\xi_{1}^{0},\sigma_{2}^{0})-p_{2}^{0}f(x|\xi_{1}^{\ell},\sigma_{2}^{0})\right|dx (64)

since ξ10=ξ20\xi_{1}^{0}=\xi_{2}^{0} and ξ~2=ξ1\tilde{\xi}_{2}^{\ell}=\xi_{1}^{\ell}. Denote the integrand in (64) (including the absolute value) by A(x)A_{\ell}(x). By definition of f(x|ξ,σ)f(x|\xi,\sigma) for location-scale exponential distribution, A(x)=0A_{\ell}(x)=0 on (,ξ1)(-\infty,\xi_{1}^{\ell}).

On (ξ1,ξ10)(\xi_{1}^{\ell},\xi_{1}^{0}), A(x)=|p1σ10exξ1σ10p20σ20exξ1σ20|A_{\ell}(x)=\left|\frac{p_{1}^{\ell}}{\sigma_{1}^{0}}e^{-\frac{x-\xi_{1}^{\ell}}{\sigma_{1}^{0}}}-\frac{p_{2}^{0}}{\sigma_{2}^{0}}e^{-\frac{x-\xi_{1}^{\ell}}{\sigma_{2}^{0}}}\right| converges to B(x):=|p10σ10exξ10σ10p20σ20exξ10σ20|B(x):=\left|\frac{p_{1}^{0}}{\sigma_{1}^{0}}e^{-\frac{x-\xi_{1}^{0}}{\sigma_{1}^{0}}}-\frac{p_{2}^{0}}{\sigma_{2}^{0}}e^{-\frac{x-\xi_{1}^{0}}{\sigma_{2}^{0}}}\right| uniformly. Then

lim supξ1ξ10A(x)𝑑x\displaystyle\limsup_{\ell}\ \ell\int_{\xi_{1}^{\ell}}^{\xi_{1}^{0}}A_{\ell}(x)dx
\displaystyle\leq lim supξ1ξ10B(x)𝑑x+lim sup(ξ10ξ1)supx(ξ1,ξ10)|A(x)B(x)|\displaystyle\limsup_{\ell}\ \ell\int_{\xi_{1}^{\ell}}^{\xi_{1}^{0}}B(x)dx+\limsup_{\ell}\ \ell(\xi_{1}^{0}-\xi_{1}^{\ell})\sup_{x\in(\xi_{1}^{\ell},\xi_{1}^{0})}{|A_{\ell}(x)-B(x)|}
=\displaystyle= lim sup12+2ψ1ξ10ξ1ξ1ξ10B(x)𝑑x\displaystyle\limsup_{\ell}\ \frac{1}{2+2\psi}\frac{1}{\xi_{1}^{0}-\xi_{1}^{\ell}}\int_{\xi_{1}^{\ell}}^{\xi_{1}^{0}}B(x)dx
=\displaystyle= 12+2ψB(ξ10)\displaystyle\frac{1}{2+2\psi}B(\xi_{1}^{0})
=\displaystyle= 0,\displaystyle 0, (65)

where the first equality follows by second fundamental theorem of calculus, and the last step follows by p10/σ10=p20/σ20p_{1}^{0}/\sigma_{1}^{0}=p_{2}^{0}/\sigma_{2}^{0}.

On (ξ10,)(\xi_{1}^{0},\infty), A(x)=|p1σ10exξ1σ10+p2σ20exξ10σ20p10σ10exξ10σ10p20σ20exξ1σ20|A_{\ell}(x)=\left|\frac{p_{1}^{\ell}}{\sigma_{1}^{0}}e^{-\frac{x-\xi_{1}^{\ell}}{\sigma_{1}^{0}}}+\frac{p_{2}^{\ell}}{\sigma_{2}^{0}}e^{-\frac{x-\xi_{1}^{0}}{\sigma_{2}^{0}}}-\frac{p_{1}^{0}}{\sigma_{1}^{0}}e^{-\frac{x-\xi_{1}^{0}}{\sigma_{1}^{0}}}-\frac{p_{2}^{0}}{\sigma_{2}^{0}}e^{-\frac{x-\xi_{1}^{\ell}}{\sigma_{2}^{0}}}\right|. Then on (ξ10,)(\xi_{1}^{0},\infty),

A(x)\displaystyle\ell A_{\ell}(x)
=\displaystyle= |p1σ10(exξ1σ10exξ10σ10)+p1p10σ10exξ10σ10+p2p20σ20exξ10σ20+p20σ20(exξ10σ20exξ1σ20)|\displaystyle\ell\left|\frac{p_{1}^{\ell}}{\sigma_{1}^{0}}\left(e^{-\frac{x-\xi_{1}^{\ell}}{\sigma_{1}^{0}}}-e^{-\frac{x-\xi_{1}^{0}}{\sigma_{1}^{0}}}\right)+\frac{p_{1}^{\ell}-p_{1}^{0}}{\sigma_{1}^{0}}e^{-\frac{x-\xi_{1}^{0}}{\sigma_{1}^{0}}}+\frac{p_{2}^{\ell}-p_{2}^{0}}{\sigma_{2}^{0}}e^{-\frac{x-\xi_{1}^{0}}{\sigma_{2}^{0}}}+\frac{p_{2}^{0}}{\sigma_{2}^{0}}\left(e^{-\frac{x-\xi_{1}^{0}}{\sigma_{2}^{0}}}-e^{-\frac{x-\xi_{1}^{\ell}}{\sigma_{2}^{0}}}\right)\right|
\displaystyle\to |p10σ10exξ10σ101σ1012+2ψ+ψ2+2ψ1σ10exξ10σ10ψ2+2ψ1σ20exξ10σ20+p20σ20exξ10σ201σ2012+2ψ|\displaystyle\left|-\frac{p_{1}^{0}}{\sigma_{1}^{0}}e^{-\frac{x-\xi_{1}^{0}}{\sigma_{1}^{0}}}\frac{1}{\sigma_{1}^{0}}\frac{1}{2+2\psi}+\frac{\psi}{2+2\psi}\frac{1}{\sigma_{1}^{0}}e^{-\frac{x-\xi_{1}^{0}}{\sigma_{1}^{0}}}-\frac{\psi}{2+2\psi}\frac{1}{\sigma_{2}^{0}}e^{-\frac{x-\xi_{1}^{0}}{\sigma_{2}^{0}}}+\frac{p_{2}^{0}}{\sigma_{2}^{0}}e^{-\frac{x-\xi_{1}^{0}}{\sigma_{2}^{0}}}\frac{1}{\sigma_{2}^{0}}\frac{1}{2+2\psi}\right|
=\displaystyle= 0\displaystyle 0

where the last step follows by p10σ10=p20σ20=ψ\frac{p_{1}^{0}}{\sigma_{1}^{0}}=\frac{p_{2}^{0}}{\sigma_{2}^{0}}=\psi. It is easy to find an envelope function of A(x)\ell A_{\ell}(x) that is integrable on (ξ10,)(\xi_{1}^{0},\infty) and thus by the dominated convergence theorem,

limξ10A(x)𝑑x=0.\lim_{\ell}\ell\int_{\xi_{1}^{0}}^{\infty}A_{\ell}(x)dx=0. (66)

The conclusion then follows from (64), (65) and (66). ∎

Proof of Lemma 4.11.

Take f~(x)=maxi[k0]f¯(x)f(x|θi0)\tilde{f}(x)=\max_{i\in[k_{0}]}\bar{f}(x)\sqrt{f(x|\theta_{i}^{0})}. Then f~(x)\tilde{f}(x) is μ\mu-integrable by Cauchy-Schwarz inequality. Moreover for any i[k0]i\in[k_{0}] and any 0<Δγ00<\Delta\leq\gamma_{0}

|f(x|θi0+aiΔ)f(x|θi0)Δ|f~(x)μa.e.x𝔛.\left|\frac{f(x|\theta_{i}^{0}+a_{i}\Delta)-f(x|\theta_{i}^{0})}{\Delta}\right|\leq\tilde{f}(x)\quad\mu-a.e.\ x\in\mathfrak{X}.

Then by Lemma 4.13 b) (a1,b1,,ak0,bk0)(a_{1},b_{1},\ldots,a_{k_{0}},b_{k_{0}}) is a nonzero solution of the system of equations (9a), (9b).

Let ai=ai/pi0i=1k0(ai/pi02+|bi|)a^{\prime}_{i}=\frac{a_{i}/p_{i}^{0}}{\sum_{i=1}^{k_{0}}\left(\|a_{i}/p_{i}^{0}\|_{2}+|b_{i}|\right)} and bi=bii=1k0(ai/pi02+|bi|)b^{\prime}_{i}=\frac{b_{i}}{\sum_{i=1}^{k_{0}}\left(\|a_{i}/p_{i}^{0}\|_{2}+|b_{i}|\right)}. Then aia^{\prime}_{i}, bib^{\prime}_{i} satisfy i=1k0(ai2+|bi|)=1\sum_{i=1}^{k_{0}}(\|a^{\prime}_{i}\|_{2}+|b^{\prime}_{i}|)=1 and (p10a1(p_{1}^{0}a^{\prime}_{1}, b1,,pk00ak0b^{\prime}_{1},\ldots,p_{k_{0}}^{0}a^{\prime}_{k_{0}}, bk0)b^{\prime}_{k_{0}}) is also a nonzero solution of (9a), (9b) with k,θik,\theta_{i} replaced respectively by k0,θi0k_{0},\theta_{i}^{0}. Let G=piδθiG_{\ell}=p_{i}^{\ell}\delta_{\theta_{i}^{\ell}} with pi=pi0+bi1p_{i}^{\ell}=p_{i}^{0}+b^{\prime}_{i}\frac{1}{\ell} and θi=θi0+1ai\theta_{i}^{\ell}=\theta_{i}^{0}+\frac{1}{\ell}a^{\prime}_{i} for 1ik01\leq i\leq k_{0}. When \ell is large, 0<pi<10<p_{i}^{\ell}<1 and θiΘ\theta_{i}^{\ell}\in\Theta since 0<pi0<10<p_{i}^{0}<1 and θi0Θ\theta_{i}^{0}\in\Theta^{\circ}. Moreover, i=1k0pi=1\sum_{i=1}^{k_{0}}p_{i}^{\ell}=1 since i=1k0bi=0\sum_{i=1}^{k_{0}}b^{\prime}_{i}=0. Then Gk0(Θ)G_{\ell}\in\mathcal{E}_{k_{0}}(\Theta) and GG0G_{\ell}\not=G_{0} since at least one of aia^{\prime}_{i} or bib^{\prime}_{i} is nonzero. When \ell is large D1(G,G0)=i=1k0(θiθi02+|pipi0|)=1D_{1}(G_{\ell},G_{0})=\sum_{i=1}^{k_{0}}\left(\|\theta_{i}^{\ell}-\theta_{i}^{0}\|_{2}+|p_{i}^{\ell}-p_{i}^{0}|\right)=\frac{1}{\ell}. Thus when \ell is large

2h2(PG,PG0)D12(G,G0)\displaystyle\frac{2h^{2}(P_{G_{\ell}},P_{G_{0}})}{D_{1}^{2}(G_{\ell},G_{0})}
=\displaystyle= S|pG(x)pG0(x)D1(G,G0)1pG(x)+pG0(x)|2μ(dx)\displaystyle\int_{S}\left|\frac{p_{G_{\ell}}(x)-p_{G_{0}}(x)}{D_{1}(G_{\ell},G_{0})}\frac{1}{\sqrt{p_{G_{\ell}}(x)}+\sqrt{p_{G_{0}}(x)}}\right|^{2}\mu(dx)
=\displaystyle= S\𝒩|(i=1k0pif(x|θi)f(x|θi0)1/+i=1k0bif(x|θi0))1pG(x)+pG0(x)|2μ(dx).\displaystyle\int_{S\backslash\mathcal{N}}\left|\left(\sum_{i=1}^{k_{0}}p_{i}^{\ell}\frac{f(x|\theta_{i}^{\ell})-f(x|\theta_{i}^{0})}{1/\ell}+\sum_{i=1}^{k_{0}}b^{\prime}_{i}f(x|\theta_{i}^{0})\right)\frac{1}{\sqrt{p_{G_{\ell}}(x)}+\sqrt{p_{G_{0}}(x)}}\right|^{2}\mu(dx).

The integrand of the last integral is bounded by

|i=1k0pipi0f(x|θi)f(x|θi0)1/×f(x|θi0)+i=1k0bipi0f(x|θi0)|2\displaystyle\left|\sum_{i=1}^{k_{0}}\frac{p_{i}^{\ell}}{\sqrt{p_{i}^{0}}}\frac{f(x|\theta_{i}^{\ell})-f(x|\theta_{i}^{0})}{1/\ell\times\sqrt{f(x|\theta_{i}^{0})}}+\sum_{i=1}^{k_{0}}\frac{b^{\prime}_{i}}{\sqrt{p_{i}^{0}}}\sqrt{f(x|\theta_{i}^{0})}\right|^{2}
\displaystyle\leq 2k0i=1k0(pi)2pi0|f(x|θi)f(x|θi0)1/×f(x|θi0)|2+2k0i=1k0(bi)2pi0f(x|θi0)\displaystyle 2k_{0}\sum_{i=1}^{k_{0}}\frac{(p_{i}^{\ell})^{2}}{p_{i}^{0}}\left|\frac{f(x|\theta_{i}^{\ell})-f(x|\theta_{i}^{0})}{1/\ell\times\sqrt{f(x|\theta_{i}^{0})}}\right|^{2}+2k_{0}\sum_{i=1}^{k_{0}}\frac{(b^{\prime}_{i})^{2}}{p_{i}^{0}}f(x|\theta_{i}^{0})
\displaystyle\leq 2k0i=1k01pi0(1/pi0i=1k0(ai/pi02+|bi|))2f¯2(x)+2k0i=1k0(bi)2pi0f(x|θi0),\displaystyle 2k_{0}\sum_{i=1}^{k_{0}}\frac{1}{p_{i}^{0}}\left(\frac{1/p_{i}^{0}}{\sum_{i=1}^{k_{0}}\left(\|a_{i}/p_{i}^{0}\|_{2}+|b_{i}|\right)}\right)^{2}\bar{f}^{2}(x)+2k_{0}\sum_{i=1}^{k_{0}}\frac{(b^{\prime}_{i})^{2}}{p_{i}^{0}}f(x|\theta_{i}^{0}),

which is integrable w.r.t. to μ\mu on S\𝒩S\backslash\mathcal{N}. Here the last inequalities follows from

|f(x|θi)f(x|θi0)1/×f(x|θi0)|=|f(x|θi0+ai2ai2aiΔ)f(x|θi0)Δf(x|θi0)|ai2ai2f¯(x).\left|\frac{f(x|\theta_{i}^{\ell})-f(x|\theta_{i}^{0})}{1/\ell\times\sqrt{f(x|\theta_{i}^{0})}}\right|=\left|\frac{f(x|\theta_{i}^{0}+\frac{\|a^{\prime}_{i}\|_{2}}{\|a_{i}\|_{2}}a_{i}\Delta)-f(x|\theta_{i}^{0})}{\Delta\sqrt{f(x|\theta_{i}^{0})}}\right|\leq\frac{\|a^{\prime}_{i}\|_{2}}{\|a_{i}\|_{2}}\bar{f}(x).

Then by the dominated convergence theorem

lim2h2(PG(x),PG0(x))D12(G,G0)\displaystyle\lim_{\ell\to\infty}\frac{2h^{2}(P_{G_{\ell}}(x),P_{G_{0}}(x))}{D^{2}_{1}(G_{\ell},G_{0})}
=\displaystyle= S\𝒩|(i=1k0pi0ai,θf(x|θi0)+i=1k0bif(x|θi0))12pG0(x)|2μ(dx)\displaystyle\int_{S\backslash\mathcal{N}}\left|\left(\sum_{i=1}^{k_{0}}p_{i}^{0}\langle a^{\prime}_{i},\nabla_{\theta}f(x|\theta_{i}^{0})\rangle+\sum_{i=1}^{k_{0}}b^{\prime}_{i}f(x|\theta_{i}^{0})\right)\frac{1}{2\sqrt{p_{G_{0}}(x)}}\right|^{2}\mu(dx)
=\displaystyle= 0.\displaystyle 0.

The proof is completed by (14). ∎

B.2 Proofs for Section 4.2

Proof of Lemma 4.13.

a) For x𝔛\𝒩x\in\mathfrak{X}\backslash\mathcal{N}, θf~(x|θi)=g(θi)θf(x|θi)+f(x|θi)θg(x|θi)\nabla_{\theta}\tilde{f}(x|\theta_{i})=g(\theta_{i})\nabla_{\theta}f(x|\theta_{i})+f(x|\theta_{i})\nabla_{\theta}g(x|\theta_{i}). Then (a~1,b~1,,a~k,b~k)(\tilde{a}_{1},\tilde{b}_{1},\ldots,\tilde{a}_{k},\tilde{b}_{k}) is a solution of (9a) with ff replaced by f~\tilde{f} if and only if (a1,b1,,ak,bk)(a_{1},b_{1},\ldots,a_{k},b_{k}) with ai=g(θi)a~ia_{i}=g(\theta_{i})\tilde{a}_{i} and bi=a~i,θg(θi)+b~ig(θi)b_{i}=\langle\tilde{a}_{i},\nabla_{\theta}g(\theta_{i})\rangle+\tilde{b}_{i}g(\theta_{i}) is a solution of (9a). We can write a~i=ai/g(θi)\tilde{a}_{i}=a_{i}/g(\theta_{i}) and b~i=(biai,θg(θi)/g(θi))/g(θi)\tilde{b}_{i}=(b_{i}-\langle a_{i},\nabla_{\theta}g(\theta_{i})\rangle/g(\theta_{i}))/g(\theta_{i}). Thus (a~1,b~1,,a~k,b~k)(\tilde{a}_{1},\tilde{b}_{1},\ldots,\tilde{a}_{k},\tilde{b}_{k}) is zero if and only if (a1,b1,,ak,bk)(a_{1},b_{1},\ldots,a_{k},b_{k}) is zero.

b) Under the conditions, by dominated convergence theorem

𝔛\𝒩ai,θf(x|θi)𝑑μ=ai,θ𝔛\𝒩f(x|θ)𝑑μ|θ=θi=0.\int_{\mathfrak{X}\backslash\mathcal{N}}\left\langle a_{i},\nabla_{\theta}f(x|\theta_{i})\right\rangle d\mu=\left\langle a_{i},\nabla_{\theta}\left.\int_{\mathfrak{X}\backslash\mathcal{N}}f(x|\theta)d\mu\right\rangle\right|_{\theta=\theta_{i}}=0.

where the last step follows from μ(𝒩)=0\mu(\mathcal{N})=0 and the fact that f(x|θ)f(x|\theta) is a density with respect to μ\mu. Thus for (a1,b1,,ak,bk)(a_{1},b_{1},\ldots,a_{k},b_{k}) any solution of (9a),

i=1kbi=𝔛\𝒩i=1k(ai,θf(x|θi)+bif(x|θi))dμ=0.\sum_{i=1}^{k}b_{i}=\int_{\mathfrak{X}\backslash\mathcal{N}}\sum_{i=1}^{k}\left(\langle a_{i},\nabla_{\theta}f(x|\theta_{i})\rangle+b_{i}f(x|\theta_{i})\right)d\mu=0.

So (a1,b1,,ak,bk)(a_{1},b_{1},\ldots,a_{k},b_{k}) is also a solution of the system (9a),(9b).

It remains to show (19) is equivalent to the same conditions on f~\tilde{f}. Suppose (19) is true. Then there exists small enough γ~(θi,ai)<γ(θi,ai)\tilde{\gamma}(\theta_{i},a_{i})<\gamma(\theta_{i},a_{i}) such that for 0<Δγ~(θi,ai)0<\Delta\leq\tilde{\gamma}(\theta_{i},a_{i})

|f~(x|θi+aiΔ)f~(x|θi)Δ|\displaystyle\left|\frac{\tilde{f}(x|\theta_{i}+a_{i}\Delta)-\tilde{f}(x|\theta_{i})}{\Delta}\right|
\displaystyle\leq g(θi+aiΔ)|f(x|θi+aiΔ)f(x|θi)Δ|+|g(θi+aiΔ)g(θi)Δ|f(x|θi)\displaystyle g(\theta_{i}+a_{i}\Delta)\left|\frac{f(x|\theta_{i}+a_{i}\Delta)-f(x|\theta_{i})}{\Delta}\right|+\left|\frac{g(\theta_{i}+a_{i}\Delta)-g(\theta_{i})}{\Delta}\right|f(x|\theta_{i})
\displaystyle\leq C(g,θi,ai,γ~(θi,ai))(f¯(x|θi,ai)+f(x|θi))μa.e.𝔛\displaystyle C(g,\theta_{i},a_{i},\tilde{\gamma}(\theta_{i},a_{i}))(\bar{f}(x|\theta_{i},a_{i})+f(x|\theta_{i}))\quad\mu-a.e.\ \mathfrak{X}

and thus one can take μ\mu-integrable f¯1(x|θi,ai)=C(g,θi,ai,γ~(θi,ai))(f¯(x|θi,ai)+f(x|θi))\bar{f}_{1}(x|\theta_{i},a_{i})=C(g,\theta_{i},a_{i},\tilde{\gamma}(\theta_{i},a_{i}))(\bar{f}(x|\theta_{i},a_{i})+f(x|\theta_{i})). The reverse direction follows similarly.

c) It is a direct consequence from parts a) and b). ∎

Proof of Lemma 4.15.

Notice that f(x|θ)f(x|\theta) is continuously differentiable at every θΘ\theta\in\Theta^{\circ} when fixing any x𝔛x\in\mathfrak{X}. By Lemma B.4 and Lemma 4.13 c), (9a) has the same solutions as the system (9a),(9b).

It is obvious that a) implies b) and that c) implies d). That a) implies c) and that b) implies d) follow from V(pG,pG0)2h(pG,pG0)V(p_{G},p_{G_{0}})\leq\sqrt{2}h(p_{G},p_{G_{0}}). e) implies a) follows from Lemma 4.2 b). It remains to prove d) implies e).

Suppose d) holds and the system of equations (9a), (9b) with k,θik,\theta_{i} replaced respectively by k0,θi0k_{0},\theta_{i}^{0} has a nonzero solution (a1,b1,,ak0,bk0)(a_{1},b_{1},\ldots,a_{k_{0}},b_{k_{0}}). By Lemma B.4, the condition d) of Lemma 4.11 is satisfied with γ0=mini[k0]γ(θi0,ai)\gamma_{0}=\min_{i\in[k_{0}]}\gamma(\theta_{i}^{0},a_{i}) and f¯(x)=maxi[k0]f¯(x|θi0,ai)\bar{f}(x)=\max_{i\in[k_{0}]}\bar{f}(x|\theta_{i}^{0},a_{i}). Thus by Lemma 4.11, d) does not hold. This is a contradiction and thus d) implies e). ∎

Proof of Lemma 4.16.

Consider f~(x|η):=f(x|θ)\tilde{f}(x|\eta):=f(x|\theta) to be the same kernel but under the new parameter η=η(θ)\eta=\eta(\theta). Note {f~(x|η)}ηΘ\{\tilde{f}(x|\eta)\}_{\eta\in\Theta} with Ξ:=η(Θ)\Xi:=\eta(\Theta) is the canonical parametrization of the same exponential family. Write ηi0=η(θi0)\eta_{i}^{0}=\eta(\theta_{i}^{0}). Since Jη(θ)=(η(i)θ(j)(θ))ijJ_{\eta}(\theta)=(\frac{\partial\eta^{(i)}}{\partial\theta^{(j)}}(\theta))_{ij} exists at θi0\theta_{i}^{0} and at those points,

θf(x|θi0)=(Jη(θi0))ηf~(x|ηi0),i[k0]\nabla_{\theta}f(x|\theta_{i}^{0})=\left(J_{\eta}(\theta_{i}^{0})\right)^{\top}\nabla_{\eta}\tilde{f}(x|\eta_{i}^{0}),\quad\forall i\in[k_{0}]

and thus

i=1k0(ai,θf(x|θi0)+bif(x|θi0))=i=1k0(Jη(θi0)ai,ηf~(x|ηi0)+bif~(x|ηi0)).\sum_{i=1}^{k_{0}}\left(\langle a_{i},\nabla_{\theta}f(x|\theta_{i}^{0})\rangle+b_{i}f(x|\theta_{i}^{0})\right)=\sum_{i=1}^{k_{0}}\left(\langle J_{\eta}(\theta_{i}^{0})a_{i},\nabla_{\eta}\tilde{f}(x|\eta_{i}^{0})\rangle+b_{i}\tilde{f}(x|\eta_{i}^{0})\right). (67)

Then (9a), (9b) with k,θik,\theta_{i} replaced respectively by k0,θi0k_{0},\theta_{i}^{0} has only the zero solution if and only if (9a), (9b) with k,θi,fk,\theta_{i},f replaced respectively by k0,ηi0,f~k_{0},\eta_{i}^{0},\tilde{f} has only the zero solution.

Suppose that (a1,b1,,ak0,bk0)(a_{1},b_{1},\ldots,a_{k_{0}},b_{k_{0}}) is a solution of (9a) with k,θik,\theta_{i} replaced respectively by k0,θi0k_{0},\theta_{i}^{0}. Then by (67) (a~1,b~1,,a~k0,b~k0)(\tilde{a}_{1},\tilde{b}_{1},\ldots,\tilde{a}_{k_{0}},\tilde{b}_{k_{0}}) with a~i=Jη(θi0)ai\tilde{a}_{i}=J_{\eta}(\theta_{i}^{0})a_{i}, b~i=bi\tilde{b}_{i}=b_{i} is a solution of (9a) with k,θi,fk,\theta_{i},f replaced respectively by k0,ηi0,f~k_{0},\eta_{i}^{0},\tilde{f}. Then by Lemma 4.15, it necessarily has i=1k0bi=i=1k0b~i=0\sum_{i=1}^{k_{0}}b_{i}=\sum_{i=1}^{k_{0}}\tilde{b}_{i}=0. That is, (a1,b1,,ak0,bk0)(a_{1},b_{1},\ldots,a_{k_{0}},b_{k_{0}}) is a solution of the system of equations (9a), (9b) with k,θik,\theta_{i} replaced respectively by k0,θi0k_{0},\theta_{i}^{0}. As a result, with k,θik,\theta_{i} replaced respectively by k0,θi0k_{0},\theta_{i}^{0}, (9a) has the same solutions as the system (9a),(9b).

The rest of the proof is completed by appealing to Lemma 5.7 and Lemma 4.15. ∎

B.3 Auxiliary lemmas for Section B.1

Lemma B.2.

Consider g(x)g(x) on d\mathbb{R}^{d} is a function with its gradient g(x)\nabla g(x) existing in a neighborhood of x0x_{0} and with g(x)\nabla g(x) continuous at x0x_{0}.

  1. a)

    Then when xx0x\to x_{0} and yx0y\to x_{0}

    |g(x)g(y)g(x0),xy|=o(xy2).|g(x)-g(y)-\langle\nabla g(x_{0}),x-y\rangle|=o(\|x-y\|_{2}).
  2. b)

    If in addition, the Hessian 2g(x)\nabla^{2}g(x) is continuous in a neighborhood of x0x_{0}. Then for any x,yx,y in a closed ball BB of x0x_{0} contained in that neighborhood,

    |g(x)g(y)g(x0),xy|\displaystyle|g(x)-g(y)-\langle\nabla g(x_{0}),x-y\rangle|
    \displaystyle\leq 01012g(x0+s(y+t(xy)x0))2𝑑s𝑑txy2max{xx02,yx02}\displaystyle\int_{0}^{1}\int_{0}^{1}\|\nabla^{2}g(x_{0}+s(y+t(x-y)-x_{0}))\|_{2}dsdt\ \|x-y\|_{2}\max\{\|x-x_{0}\|_{2},\|y-x_{0}\|_{2}\}
    \displaystyle\leq d1i,jd0101|2gx(i)x(j)(x0+s(y+t(xy)x0))|dsdt×\displaystyle d\sum_{1\leq i,j\leq d}\int_{0}^{1}\int_{0}^{1}\left|\frac{\partial^{2}g}{\partial x^{(i)}x^{(j)}}(x_{0}+s(y+t(x-y)-x_{0}))\right|dsdt\times
    xy2max{xx02,yx02}.\displaystyle\quad\ \|x-y\|_{2}\max\{\|x-x_{0}\|_{2},\|y-x_{0}\|_{2}\}.

    Moreover

    |g(x)g(y)g(x0),xy|Lxy2max{xx02,yx02}.|g(x)-g(y)-\langle\nabla g(x_{0}),x-y\rangle|\leq L\|x-y\|_{2}\max\{\|x-x_{0}\|_{2},\|y-x_{0}\|_{2}\}.

    where L=supxB2g(x)2<L=\sup_{x\in B}\|\nabla^{2}g(x)\|_{2}<\infty.

Lemma B.3.

Let kk be a positive integer, b1<<bkb_{1}<\ldots<b_{k} be a sequence of real numbers and let μ\mu be the Lebesgue measure on \mathbb{R}.

  1. a)

    Let {hi(x)}i=1k\{h_{i}(x)\}_{i=1}^{k} be a sequence of polynomials. Consider any nonempty interval II. Then

    i=1khi(x)ebix=0μa.e.xI\sum_{i=1}^{k}h_{i}(x)e^{b_{i}x}=0\quad\mu-a.e.\ x\in I

    implies hi(x)0h_{i}(x)\equiv 0 for any i[k]i\in[k].

  2. b)

    Let {hi(x)}i=1k\{h_{i}(x)\}_{i=1}^{k} be a sequence of functions, where each is of the form j=1miajxγj\sum_{j=1}^{m_{i}}a_{j}x^{\gamma_{j}}, i.e. a finite linear combination of power functions. Let {gi(x)}i=1k\{g_{i}(x)\}_{i=1}^{k} be another sequence of such functions. Consider any nonempty interval I(0,)I\subset(0,\infty). Then

    i=1k(hi(x)+gi(x)ln(x))ebix=0μa.e.xI\sum_{i=1}^{k}(h_{i}(x)+g_{i}(x)\ln(x))e^{b_{i}x}=0\quad\mu-a.e.\ x\in I

    implies when x0x\neq 0 hi(x)0h_{i}(x)\equiv 0 and gi(x)0g_{i}(x)\equiv 0 for any i[k]i\in[k].

Lemma B.4.

Let f(x|θ)f(x|\theta) be the density of a full rank exponential family in canonical form specified as in Lemma 4.15. Then for any θΘ\theta\in\Theta^{\circ} and aqa\in\mathbb{R}^{q} there exists γ(θ,a)>0\gamma(\theta,a)>0 such that for any 0<Δγ(θ,a)0<\Delta\leq\gamma(\theta,a),

|f(x|θ+aΔ)f(x|θ)Δf(x|θ)|f¯(x|θ,a)xS={x|f(x|θ)>0}\left|\frac{f(x|\theta+a\Delta)-f(x|\theta)}{\Delta\sqrt{f(x|\theta)}}\right|\leq\bar{f}(x|\theta,a)\quad\forall x\in S=\{x|f(x|\theta)>0\}

with 𝔛f¯2(x|θ,a)𝑑μ<\int_{\mathfrak{X}}\bar{f}^{2}(x|\theta,a)d\mu<\infty and

|f(x|θ+aΔ)f(x|θ)Δ|f~(x|θ,a)x𝔛\left|\frac{f(x|\theta+a\Delta)-f(x|\theta)}{\Delta}\right|\leq\tilde{f}(x|\theta,a)\quad\forall x\in\mathfrak{X}

with 𝔛f~(x|θ,a)𝑑μ<\int_{\mathfrak{X}}\tilde{f}(x|\theta,a)d\mu<\infty. Here γ(θ,a)\gamma(\theta,a), f¯(x|θ,a)\bar{f}(x|\theta,a) and f~(x|θ,a)\tilde{f}(x|\theta,a) depend on θ\theta and aa.

Appendix C Proofs, additional lemmas and calculation details for Section 5

This section contains all proofs for Section 5 except that of Theorem 5.8, Theorem 5.16. The proofs of Theorem 5.8 and Theorem 5.16 occupy the bulk of the paper and will be presented in Section D. This section also contains additional lemmas on the invariance of different parametrizations and on determinant of a type of generalized Vandermonde matrices, and contains calculation details for Examples 5.11 and 5.13.

C.1 Proofs for Section 5.1 and Corollary 5.9

Proof of Lemma 5.5.

In this proof we write n1n_{1} and N¯1\underline{{N}}_{1} for n1(G0)n_{1}(G_{0}) and N¯1(G0)\underline{{N}}_{1}(G_{0}) respectively. By Lemma 5.1 b), n1=N¯1<n_{1}=\underline{{N}}_{1}<\infty. For each N1{N}\geq 1, there exists RN(G0)>0R_{{N}}(G_{0})>0 such that for any Gk0(Θ)\{G0}G\in\mathcal{E}_{k_{0}}(\Theta)\backslash\{G_{0}\} and W1(G,G0)<RN(G0)W_{1}(G,G_{0})<R_{{N}}(G_{0})

V(PG,N,PG0,N)DN(G,G0)12lim infGW1G0Gk0(Θ)V(PG,N,PG0,N)DN(G,G0).\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{N}(G,G_{0})}\geq\frac{1}{2}\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{N}(G,G_{0})}. (68)

Take c(G0,N0)=min1iN0Ri(G0)>0c(G_{0},{N}_{0})=\min\limits_{1\leq i\leq{N}_{0}}R_{i}(G_{0})>0. Moreover, by the definition (25) for any NN¯1N\geq\underline{{N}}_{1},

lim infGW1G0Gk0(Θ)V(PG,N,PG0,N)DN(G,G0)infNN¯1lim infGW1G0Gk0(Θ)V(PG,N,PG0,N)DN(G,G0)>0.\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{N}(G,G_{0})}\geq\inf_{{N}\geq\underline{{N}}_{1}}\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{N}(G,G_{0})}>0.

Combining the last two displays completes the proof with

C(G0)=12infNN¯1lim infGW1G0Gk0(Θ)V(PG,N,PG0,N)DN(G,G0).C(G_{0})=\frac{1}{2}\inf_{{N}\geq\underline{{N}}_{1}}\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{N}(G,G_{0})}.

Proof of Lemma 5.6.

In this proof we write n1,n0n_{1},n_{0} for n1(G0),n0(G0,kk0k(Θ1))n_{1}(G_{0}),n_{0}(G_{0},\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1})) respectively. By the definition of n1n_{1}, for any Nn1{N}\geq n_{1}

lim infGW1G0Gk0(Θ1)V(PG,N,PG0,N)D1(G,G0)>0.\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta_{1})\end{subarray}}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{D_{1}(G,G_{0})}>0. (69)

By Lemma 3.2 b) one may replace the D1(G,G0)D_{1}(G,G_{0}) in the preceding display by W1(G,G0)W_{1}(G,G_{0}). Fix N1=n1n0{N}_{1}=n_{1}\vee n_{0}. Then there exists R>0R>0 depending on G0G_{0} such that

infGBW1(G0,R)\{G0}V(PG,N1,PG0,N1)W1(G,G0)>0,\inf_{G\in B_{W_{1}}(G_{0},R)\backslash\{G_{0}\}}\frac{V(P_{G,N_{1}},P_{G_{0},N_{1}})}{W_{1}(G,G_{0})}>0, (70)

where BW1(G0,R)B_{W_{1}}(G_{0},R) is the open ball in metric space (k=1k0k(Θ1),W1)(\bigcup_{k=1}^{k_{0}}\mathcal{E}_{k}(\Theta_{1}),W_{1}) with center at G0G_{0} and radius RR. Here we used the fact that any sufficiently small open ball in (k=1k0k(Θ1),W1)(\bigcup_{k=1}^{k_{0}}\mathcal{E}_{k}(\Theta_{1}),W_{1}) with center in k0(Θ1)\mathcal{E}_{k_{0}}(\Theta_{1}) is in k0(Θ1)\mathcal{E}_{k_{0}}(\Theta_{1}).

Notice that k=1k0k(Θ1)\bigcup_{k=1}^{k_{0}}\mathcal{E}_{k}(\Theta_{1}) is compact under the W1W_{1} metric if Θ1\Theta_{1} is compact. By the assumption that the map θPθ\theta\mapsto P_{\theta} is continuous, Lemma C.3 and the triangle inequality of total variation distance, V(PG,N,PG0,N)V(P_{G,{N}},P_{G_{0},{N}}) with domain (k=1k0k(Θ1),W1)(\bigcup_{k=1}^{k_{0}}\mathcal{E}_{k}(\Theta_{1}),W_{1}) is a continuous function of GG for each N{N}. Then GV(PG,N,PG0,N)W1(G,G0)G\mapsto\frac{V(P_{G,{N}},P_{G_{0},{N}})}{W_{1}(G,G_{0})} is a continuous map on k=1k0k(Θ1){G0}\bigcup_{k=1}^{k_{0}}\mathcal{E}_{k}(\Theta_{1})\setminus\{G_{0}\} for each N{N}. Moreover V(PG,N,PG0,N)W1(G,G0)\frac{V(P_{G,{N}},P_{G_{0},{N}})}{W_{1}(G,G_{0})} is positive on the compact set k=1k0k(Θ1)\BW1(G0,R)\bigcup_{k=1}^{k_{0}}\mathcal{E}_{k}(\Theta_{1})\backslash B_{W_{1}}(G_{0},R) provided Nn0{N}\geq n_{0}. As a result for each Nn0{N}\geq n_{0}

minGk=1k0k(Θ1)\BW1(G0,R)V(PG,N,PG0,N)W1(G,G0)>0.\min_{G\in\bigcup_{k=1}^{k_{0}}\mathcal{E}_{k}(\Theta_{1})\backslash B_{W_{1}}(G_{0},R)}\frac{V(P_{G,{N}},P_{G_{0},{N}})}{W_{1}(G,G_{0})}>0.

Combining the last display with N1=n1n0{N}_{1}=n_{1}\vee n_{0} and (70) yields

V(PG,N1,PG,N1)C(G0,Θ1)W1(G,G0),V(P_{G,{N}_{1}},P_{G,{N}_{1}})\geq C(G_{0},\Theta_{1})W_{1}(G,G_{0}), (71)

where C(G0,Θ1)C(G_{0},\Theta_{1}) is a constant depending on G0G_{0} and Θ1\Theta_{1}. Observing that V(PG,N,PG0,N)V(P_{G,{N}},P_{G_{0},{N}}) increases with N{N}, the proof is then complete. ∎

Proof of Lemma 5.7.

It’s easy to see when θ\theta is a sufficiently small neighborhood of θi0\theta_{i}^{0},

(2(Jg(θi0))12)1θθi02g(θ)g(θi0)22Jg(θi0)2θθi02.(2\|(J_{g}(\theta_{i}^{0}))^{-1}\|_{2})^{-1}\|\theta-\theta_{i}^{0}\|_{2}\leq\|g(\theta)-g(\theta_{i}^{0})\|_{2}\leq 2\|J_{g}(\theta_{i}^{0})\|_{2}\|\theta-\theta_{i}^{0}\|_{2}.

Then when GG is in a small neighborhood of G0G_{0} under W1W_{1}

(2max1ik0(Jg(θi0))12+1)1DN(G,G0)\displaystyle(2\max_{1\leq i\leq k_{0}}\|(J_{g}(\theta_{i}^{0}))^{-1}\|_{2}+1)^{-1}D_{N}(G,G_{0})\leq DN(Gη,G0η)\displaystyle D_{N}(G^{\eta},G_{0}^{\eta})
\displaystyle\leq (2max1ik0Jg(θi0)2+1)DN(G,G0).\displaystyle(2\max_{1\leq i\leq k_{0}}\|J_{g}(\theta_{i}^{0})\|_{2}+1)D_{N}(G,G_{0}).

Moreover V(P~Gη,N,P~G0η,N)=V(PG,N,PG0,N).V(\tilde{P}_{G^{\eta},{N}},\tilde{P}_{G_{0}^{\eta},{N}})=V(P_{G,{N}},P_{G_{0},{N}}). Denote the left side and right side of (28) respectively by LL and RR. Then LC(G0)RL\leq C(G_{0})R and Lc(G0)RL\geq c(G_{0})R with

C(G0)=2max1ik0(Jg(θi0))12+1,c(G0)=(2max1ik0Jg(θi0)2+1)1.C(G_{0})=2\max_{1\leq i\leq k_{0}}\|(J_{g}(\theta_{i}^{0}))^{-1}\|_{2}+1,\quad c(G_{0})=(2\max_{1\leq i\leq k_{0}}\|J_{g}(\theta_{i}^{0})\|_{2}+1)^{-1}.

The other equation in the statement follows similarly. ∎

Proof of Corollary 5.9.

Consider f~(x|η):=f(x|θ)\tilde{f}(x|\eta):=f(x|\theta) be the same kernel but under the new parameter η=η(θ)\eta=\eta(\theta). Note {f~(x|η)}ηΞ\{\tilde{f}(x|\eta)\}_{\eta\in\Xi} with Ξ:=η(Θ)\Xi:=\eta(\Theta) is the canonical parametrization of the same exponential family. Write ηi0=η(θi0)\eta_{i}^{0}=\eta(\theta_{i}^{0}). The proof is then completed by applying Lemma 5.8 to f~(x|η)\tilde{f}(x|\eta) and then by applying Lemma 5.7. ∎

C.2 Auxiliary lemmas for Section C.1

Lemma C.1 (Lack of identifiability).

Fix G0=i=1k0pi0δθi0k0(Θ)G_{0}=\sum_{i=1}^{k_{0}}p_{i}^{0}\delta_{\theta_{i}^{0}}\in\mathcal{E}_{k_{0}}(\Theta). Suppose i=1k0biPθi0=0\sum_{i=1}^{k_{0}}b_{i}P_{\theta_{i}^{0}}=0 has a nonzero solution (b1,,bk0)(b_{1},\ldots,b_{k_{0}}), where the 0 is the zero measure on 𝔛\mathfrak{X}. Then

lim infGW1G0Gk0(Θ)V(PG,PG0)D1(G,G0)=0.\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{V(P_{G},P_{G_{0}})}{D_{1}(G,G_{0})}=0. (72)
Lemma C.2.

Suppose the same conditions in Corollary 4.6 hold. Then for any aqa\in\mathbb{R}^{q}, for each i[k0]i\in[k_{0}], and for any 0<Δγ(θi0,a)0<\Delta\leq\gamma(\theta_{i}^{0},a),

|j=1Nf(xj|θi0+aΔ)j=1Nf(xj|θi0)Δ|f~Δ(x¯|θi0,a,N),Nμa.e.x¯(x1,,xN)\left|\frac{\prod_{j=1}^{{N}}f(x_{j}|\theta_{i}^{0}+a\Delta)-\prod_{j=1}^{{N}}f(x_{j}|\theta_{i}^{0})}{\Delta}\right|\leq\tilde{f}_{\Delta}(\bar{x}|\theta_{i}^{0},a,N),\quad\bigotimes^{N}\mu-a.e.\bar{x}(x_{1},\ldots,x_{N})

where f~Δ(x¯|θi0,a,N)\tilde{f}_{\Delta}(\bar{x}|\theta_{i}^{0},a,N) satisfies

limΔ0+𝔛Nf~Δ(x¯|θi0,a,N)dNμ=𝔛NlimΔ0+f~Δ(x¯|θi0,a,N)dNμ.\lim_{\Delta\to 0^{+}}\int_{\mathfrak{X}^{N}}\tilde{f}_{\Delta}(\bar{x}|\theta_{i}^{0},a,N)d\bigotimes^{N}\mu=\int_{\mathfrak{X}^{N}}\lim_{\Delta\to 0^{+}}\tilde{f}_{\Delta}(\bar{x}|\theta_{i}^{0},a,N)d\bigotimes^{N}\mu.
Lemma C.3.

For any G=i=1k0piδθiG=\sum_{i=1}^{k_{0}}p_{i}\delta_{\theta_{i}} and G=i=1k0piδθiG^{\prime}=\sum_{i=1}^{k_{0}}p^{\prime}_{i}\delta_{\theta^{\prime}_{i}},

V(PG,N,PG,N)\displaystyle V(P_{G,{N}},P_{G^{\prime},{N}})\leq minτ(2Nmax1ik0h(Pθi,Pθτ(i))+12i=1k0|pipτ(i)|),\displaystyle\min_{\tau}\left(\sqrt{2{N}}\max_{1\leq i\leq k_{0}}h\left(P_{\theta_{i}},P_{\theta^{\prime}_{\tau(i)}}\right)+\frac{1}{2}\sum_{i=1}^{k_{0}}\left|p_{i}-p_{\tau(i)}^{\prime}\right|\right),
V(PG,N,PG,N)\displaystyle V(P_{G,{N}},P_{G^{\prime},{N}})\leq minτ(Nmax1ik0V(Pθi,Pθτ(i))+12i=1k0|pipτ(i)|),\displaystyle\min_{\tau}\left(N\max_{1\leq i\leq k_{0}}V\left(P_{\theta_{i}},P_{\theta^{\prime}_{\tau(i)}}\right)+\frac{1}{2}\sum_{i=1}^{k_{0}}\left|p_{i}-p_{\tau(i)}^{\prime}\right|\right),

where the minimum is taken over all τ\tau in the permutation group Sk0S_{k_{0}}.

Proof.

The proof is similar as that of Lemma 8.2. ∎

C.3 Identifiability of Bernoulli kernel in Example 5.11

In this section we prove n0(G,k(Θ))=2k1n_{0}(G,\cup_{\ell\leq k}\mathcal{E}_{\ell}(\Theta))=2k-1 for any Gk(Θ)G\in\mathcal{E}_{k}(\Theta) for the Bernoulli kernels in Example 5.11. (Note: The authors find a proof of this result in the technical report [13, Lemma 3.1 and Theorem 3.1] after preparation of this manuscript. Since technical report [13] is difficult to be found online, the proof given below is different from the technical report and will be presented for completeness.)

For any Gk(Θ)G\in\mathcal{E}_{k}(\Theta), there are 2k12k-1 parameters to determine it. fn(x1,,xn)f_{n}(x_{1},\ldots,x_{n}) has effective nn equations for different value (x1,,xn)(x_{1},\ldots,x_{n}) since j=1nxj\sum_{j=1}^{n}x_{j} can takes n+1n+1 values and fnf_{n} is a probability density. Thus to have PG,nP_{G,n} strictly identifiable for Gk(Θ)G\in\mathcal{E}_{k}(\Theta), a necessary condition is that n2k1n\geq 2k-1 for almost all GG under Lebesgue. In fact, in Lemma C.4 part e) it is established that for any n2k2n\leq 2k-2 for any Gk(Θ)G\in\mathcal{E}_{k}(\Theta) there exist infinitely many Gk(Θ)\{G0}G^{\prime}\in\mathcal{E}_{k}(\Theta)\backslash\{G_{0}\} such that PG,n=PG,nP_{G^{\prime},n}=P_{G,n}, which implies n0(G,k(Θ))2k1n_{0}(G,\cup_{\ell\leq k}\mathcal{E}_{\ell}(\Theta))\geq 2k-1 for all Gk(Θ)G\in\mathcal{E}_{k}(\Theta).

Let us now verify that n0(G,k(Θ))=2k1n_{0}(G,\cup_{\ell\leq k}\mathcal{E}_{\ell}(\Theta))=2k-1 for any Gk(Θ)G\in\mathcal{E}_{k}(\Theta). In the following n=2k1n=2k-1. For any G=i=1kpiδθiG=\sum_{i=1}^{k}p_{i}\delta_{\theta_{i}} and consider G=i=1kpiδθii=1kk(Θ)G^{\prime}=\sum_{i=1}^{k}p^{\prime}_{i}\delta_{\theta^{\prime}_{i}}\in\bigcup_{i=1}^{k}\mathcal{E}_{k}(\Theta) such that pG,n=pG,np_{G^{\prime},n}=p_{G,n}. Notice that Gi=1kk(Θ)G^{\prime}\in\bigcup_{i=1}^{k}\mathcal{E}_{k}(\Theta) means that it is possible some of pip^{\prime}_{i} is zero. pG,n=pG,np_{G^{\prime},n}=p_{G,n} implies

i=1kpi(θi)j(1θi)nji=1kpi(θi)j(1θi)nj=0j=0,1,,n.\sum_{i=1}^{k}p^{\prime}_{i}(\theta^{\prime}_{i})^{j}(1-\theta^{\prime}_{i})^{n-j}-\sum_{i=1}^{k}p_{i}(\theta_{i})^{j}(1-\theta_{i})^{n-j}=0\quad\forall j=0,1,\cdots,n. (73)

Notice that the above system of equations does not include the constraint i=1kpi=1\sum_{i=1}^{k}p^{\prime}_{i}=1 since it is redundant: by multiplying both sides of the equation jj by (nj)\binom{n}{j} and summing up, we obtain i=1kpi=i=1kpi=1\sum_{i=1}^{k}p^{\prime}_{i}=\sum_{i=1}^{k}p_{i}=1. (In fact, in the above system of equations the equation with j=nj=n (or arbitrary jj) can be replaced by i=1kpi=1.\sum_{i=1}^{k}{p^{\prime}_{i}}=1.)

We now show that the only solution is G=GG^{\prime}=G, beginning with the following simple observation. Notice that for a set {ξi}i=12k\{\xi_{i}\}_{i=1}^{2k} of 2k2k distinct elements in (0,1)(0,1), the system of linear equations of y=(y1,,yk)y=(y_{1},\ldots,y_{k^{\prime}}) with k2kk^{\prime}\leq 2k:

i=1kyi(ξi)j(1ξi)nj=0j=0,1,,n=2k1\sum_{i=1}^{k^{\prime}}y_{i}(\xi_{i})^{j}(1-\xi_{i})^{n-j}=0\quad\forall j=0,1,\ldots,n=2k-1

has only the zero solution since by setting y~i=(1ξi)nyi\tilde{y}_{i}=(1-\xi_{i})^{n}y_{i} the system of equations of y~\tilde{y}:

i=1ky~i(ξi1ξi)j=0j=0,1,,n\sum_{i=1}^{k^{\prime}}\tilde{y}_{i}\left(\frac{\xi_{i}}{1-\xi_{i}}\right)^{j}=0\quad\forall j=0,1,\ldots,n

has its coefficients of the first kk^{\prime} equations forming a non-singular Vandermonde matrix.

If some θi\theta_{i} is not in {θi}i=1k\{\theta^{\prime}_{i}\}_{i=1}^{k}, then by the observation in last paragraph pi=0p_{i}=0, which contradicts with Gk(Θ)G\in\mathcal{E}_{k}(\Theta). As a result, {θi}i=1k={θi}i=1k\{\theta^{\prime}_{i}\}_{i=1}^{k}=\{\theta_{i}\}_{i=1}^{k}. Suppose θli=θi\theta^{\prime}_{l_{i}}=\theta_{i} for i[k]i\in[k]. Then the system of equations (73) become

i=1k(plipi)(θi)j(1θi)nj=0j=0,1,,n.\sum_{i=1}^{k}(p^{\prime}_{l_{i}}-p_{i})(\theta_{i})^{j}(1-\theta_{i})^{n-j}=0\quad\forall j=0,1,\ldots,n.

Applying the observation from last paragraph again yields pli=pip^{\prime}_{l_{i}}=p_{i} for i[k]i\in[k]. That is, the only solution of (73) is G=GG^{\prime}=G. Thus n0(G,k(Θ))2k1n_{0}(G,\cup_{\ell\leq k}\mathcal{E}_{\ell}(\Theta))\leq 2k-1, which together with the fact that n0(G,k(Θ))2k1n_{0}(G,\cup_{\ell\leq k}\mathcal{E}_{\ell}(\Theta))\geq 2k-1 yield n0(G,k(Θ))=2k1n_{0}(G,\cup_{\ell\leq k}\mathcal{E}_{\ell}(\Theta))=2k-1 for any Gk(Θ)G\in\mathcal{E}_{k}(\Theta).

Part e) of the first lemma and d) of the second lemma below are used in the preceding analysis of example on Bernoulli kernel.

Lemma C.4.
  1. a)

    Let η1,η2,,η2k\eta_{1},\eta_{2},\ldots,\eta_{2k} be 2k2k distinct real numbers. Let n2k2n\leq 2k-2. Then the system of n+1n+1 linear equations of (y1,y2,,y2k)(y_{1},y_{2},\ldots,y_{2k})

    i=12kyiηij=0j[n]{0}\sum_{i=1}^{2k}y_{i}\eta_{i}^{j}=0\quad\forall j\in[n]\cup\{0\} (74)

    has all the solutions given by

    yi=q=n+22kyqi=1n+1(ηqη)(ηiη)i[n+1]y_{i}=-\sum_{q=n+2}^{2k}y_{q}\prod_{\begin{subarray}{c}\ell\neq i\\ \ell=1\end{subarray}}^{n+1}\frac{(\eta_{q}-\eta_{\ell})}{(\eta_{i}-\eta_{\ell})}\quad\forall i\in[n+1] (75)

    for any yn+2,,y2ky_{n+2},\ldots,y_{2k}\in\mathbb{R}.

  2. b)

    For any 0<ηk+1<ηk+2<<η2k0<\eta_{k+1}<\eta_{k+2}<\ldots<\eta_{2k} and for any positive yk+1,yk+2,,y2ky_{k+1},y_{k+2},\ldots,y_{2k}, there exists infinitely many η1,η2,,ηk\eta_{1},\eta_{2},\dots,\eta_{k} satisfying

    ηk+i1<ηi<ηk+i,for 2ik, and 0<η1<ηk+1 and\displaystyle\eta_{k+i-1}<\eta_{i}<\eta_{k+i},\quad\text{for }2\leq i\leq k,\text{ and }0<\eta_{1}<\eta_{k+1}\text{ and }
    yi=y2ki=12k1(η2kη)(ηiη)k+1i2k1.\displaystyle y_{i}=-y_{2k}\prod_{\begin{subarray}{c}\ell\neq i\\ \ell=1\end{subarray}}^{2k-1}\frac{(\eta_{2k}-\eta_{\ell})}{(\eta_{i}-\eta_{\ell})}\quad\forall k+1\leq i\leq 2k-1.
  3. c)

    For any 0<ηk+1<ηk+2<<η2k0<\eta_{k+1}<\eta_{k+2}<\ldots<\eta_{2k} and for any positive yk+1,yk+2,,y2ky_{k+1},y_{k+2},\ldots,y_{2k}, the system of equations of (y1,,yk,η1,,ηk)(y_{1},\ldots,y_{k},\eta_{1},\ldots,\eta_{k})

    i=12kyiηij\displaystyle\sum_{i=1}^{2k}y_{i}\eta_{i}^{j} =0j[2k2]{0}\displaystyle=0\quad\forall j\in[2k-2]\cup\{0\}
    yi\displaystyle y_{i} <0i[k]\displaystyle<0\quad\forall i\in[k] (76)
    η1(0,ηk+1),ηi\displaystyle\eta_{1}\in(0,\eta_{k+1}),\ \eta_{i} (ηk+i1,ηk+i)2ik\displaystyle\in(\eta_{k+i-1},\eta_{k+i})\quad\forall 2\leq i\leq k (77)

    has infinitely many solutions.

  4. d)

    If PG,n=PG,nP_{G,n}=P_{G^{\prime},n} for some positive integer nn, then PG,m=PG,mP_{G,m}=P_{G^{\prime},m} for any integer 1mn1\leq m\leq n.

  5. e)

    Consider the kernel specified in Example 5.11. For any Gk(Θ)G\in\mathcal{E}_{k}(\Theta) and for any n2k2n\leq 2k-2, there exists infinitely many Gk(Θ)G^{\prime}\in\mathcal{E}_{k}(\Theta) such that PG,n=PG,nP_{G,n}=P_{G^{\prime},n}. In particular, this shows n0(G,k(Θ))2k1n_{0}(G,\cup_{\ell\leq k}\mathcal{E}_{\ell}(\Theta))\geq 2k-1 for any Gk(Θ)G\in\mathcal{E}_{k}(\Theta).

Proof of Lemma C.4.

a) By Lagrange interpolation formula over η1\eta_{1},η2\eta_{2},\ldots,ηn+1\eta_{n+1},

xj=i=1n+1ηiji=1n+1(xη)(ηiη),j[n]{0},x.x^{j}=\sum_{i=1}^{n+1}\eta_{i}^{j}\prod_{\begin{subarray}{c}\ell\neq i\\ \ell=1\end{subarray}}^{n+1}\frac{(x-\eta_{\ell})}{(\eta_{i}-\eta_{\ell})},\quad\forall j\in[n]\cup\{0\},\ \forall x\in\mathbb{R}.

In particular, for any n+2q2kn+2\leq q\leq 2k,

ηqj=i=1n+1ηiji=1n+1(ηqη)(ηiη),j[n]{0}.\eta_{q}^{j}=\sum_{i=1}^{n+1}\eta_{i}^{j}\prod_{\begin{subarray}{c}\ell\neq i\\ \ell=1\end{subarray}}^{n+1}\frac{(\eta_{q}-\eta_{\ell})}{(\eta_{i}-\eta_{\ell})},\quad\forall j\in[n]\cup\{0\}.

Plugging the above identity into (74), it is clear that the yiy_{i} specified in (75) are solutions of (74). Notice that the coefficient matrix of (74)\eqref{eqn:ylinsys} is A=(ηij)j[n]{0},i[2k](n+1)×(2k)A=(\eta_{i}^{j})_{j\in[n]\cup\{0\},i\in[2k]}\in\mathbb{R}^{(n+1)\times(2k)} has rank n+1n+1 since the submatrix consisting the first n+1n+1 columns form a non-singular Vandermonde matrix. Thus all the solutions of (74) form a subspace of 2k\mathbb{R}^{2k} of dimension 2k(n+1)2k-(n+1), which implies (75) are all the solutions.

b) Let a>0a>0. Consider a polynomial g(x)g(x) such that g(0)=(1)k+1ag(0)=(-1)^{k+1}a, g(η2k)=1y2kg(\eta_{2k})=-\frac{1}{y_{2k}}, and for k+1i2k1k+1\leq i\leq 2k-1, g(ηi)=1yii=k+12k1(η2kη)(ηiη)g(\eta_{i})=\frac{1}{y_{i}}\prod\limits_{\begin{subarray}{c}\ell\neq i\\ \ell=k+1\end{subarray}}^{2k-1}\frac{(\eta_{2k}-\eta_{\ell})}{(\eta_{i}-\eta_{\ell})}. Then this k+1k+1 points determines uniquely a polynomial g(x)g(x) with degree at most kk. By our construction, g(x)g(x) satisfies

yig(ηi)=y2kg(η2k)i=k+12k1(η2kη)(ηiη),k+1i2k1y_{i}g(\eta_{i})=-y_{2k}g(\eta_{2k})\prod\limits_{\begin{subarray}{c}\ell\neq i\\ \ell=k+1\end{subarray}}^{2k-1}\frac{(\eta_{2k}-\eta_{\ell})}{(\eta_{i}-\eta_{\ell})},\quad\forall\ k+1\leq i\leq 2k-1 (78)

Moreover, noticing that g(ηi)>0g(\eta_{i})>0 for ii odd integer between k+1k+1 and 2k2k, and g(ηi)<0g(\eta_{i})<0 for ii even integer between k+1k+1 and 2k2k. Then there must exist η1(0,ηk+1)\eta_{1}\in(0,\eta_{k+1}) and ηi(ηk+i1,ηk+i)\eta_{i}\in(\eta_{k+i-1},\eta_{k+i}) for 2ik2\leq i\leq k such that g(ηi)=0g(\eta_{i})=0. Then g(x)=bi=1k(xηi)g(x)=b\prod_{i=1}^{k}(x-\eta_{i}) where b<0,η1,η2,,ηkb<0,\eta_{1},\eta_{2},\ldots,\eta_{k} are constants that depend on a,ηk+1,,η2k,yk+1,,y2ka,\eta_{k+1},\ldots,\eta_{2k},y_{k+1},\ldots,y_{2k}. Plug g(x)=bi=1k(xηi)g(x)=b\prod_{i=1}^{k}(x-\eta_{i}) into (78) shows that (η1,η2,,ηk)(\eta_{1},\eta_{2},\ldots,\eta_{k}) is a solution for the system of equations in the statement. By changing value of aa, we get infinitely many solutions.

c)

First, we apply part a) with n=2k2n=2k-2: for any 2k2k distinct real numbers η1,,η2k\eta_{1},\ldots,\eta_{2k}, the system of linear equations of (x1,,x2k)(x_{1},\ldots,x_{2k})

i=12kxiηij=0j[2k2]{0}\sum_{i=1}^{2k}x_{i}\eta_{i}^{j}=0\quad\forall j\in[2k-2]\cup\{0\}

has a solution

xi=y2ki=12k1(η2kη)(ηiη)i[2k1],x_{i}=-y_{2k}\prod_{\begin{subarray}{c}\ell\neq i\\ \ell=1\end{subarray}}^{2k-1}\frac{(\eta_{2k}-\eta_{\ell})}{(\eta_{i}-\eta_{\ell})}\quad\forall i\in[2k-1],

where we have specified x2k=y2kx_{2k}=y_{2k}.

Next, for the ηk+1,,η2k\eta_{k+1},\ldots,\eta_{2k} given in the lemma’s statement, by part b) we can choose η1,,ηk\eta_{1},\ldots,\eta_{k} that satisfy the requirements there. Accordingly, xi=yix_{i}=y_{i} for k+1i2kk+1\leq i\leq 2k. Moreover, it follows from the ranking of {ηi}i=12k\{\eta_{i}\}_{i=1}^{2k} that xi<0x_{i}<0 for any i[k]i\in[k]. Thus (x1,,xk,η1,,ηk)(x_{1},\ldots,x_{k},\eta_{1},\ldots,\eta_{k}) is a solution of the system of equations in the statement. The infinite many solutions conclusion follows since there are infinitely many (η1,,ηk)(\eta_{1},\ldots,\eta_{k}) by part b).

d) PG,n1=PG,n1P_{G,n-1}=P_{G^{\prime},n-1} follows immediately from for any A𝒜n1A\in\mathcal{A}^{n-1}, the product sigma-algebra on 𝔛n1\mathfrak{X}^{n-1},

PG,n1(A)=PG,n(A×𝔛)=PG,n(A×𝔛)=PG,n1(A).P_{G,n-1}(A)=P_{G,n}(A\times\mathfrak{X})=P_{G^{\prime},n}(A\times\mathfrak{X})=P_{G^{\prime},n-1}(A).

Repeating this procedure inductively and the conclusion follows.

By part d) it suffices to prove that n=2k2n=2k-2. Write G=i=1kpiδθiG=\sum_{i=1}^{k}p_{i}\delta_{\theta_{i}} with θ1<θ2<<θk\theta_{1}<\theta_{2}<\ldots<\theta_{k}. Consider any G=i=1kpiδθik(Θ)G^{\prime}=\sum_{i=1}^{k}p^{\prime}_{i}\delta_{\theta^{\prime}_{i}}\in\mathcal{E}_{k}(\Theta) with θ1<θ2<<θk\theta^{\prime}_{1}<\theta^{\prime}_{2}<\ldots<\theta^{\prime}_{k} such that PG,n=PG,nP_{G,n}=P_{G^{\prime},n}. PG,n=PG,nP_{G,n}=P_{G^{\prime},n} for n=2k2n=2k-2 is

i=1kpi(θi)j(1θi)2k2j\displaystyle\sum_{i=1}^{k}p^{\prime}_{i}(\theta^{\prime}_{i})^{j}(1-\theta^{\prime}_{i})^{2k-2-j} =i=1kpi(θi)j(1θi)2k2jj=0,1,,2k2.\displaystyle=\sum_{i=1}^{k}p_{i}(\theta_{i})^{j}(1-\theta_{i})^{2k-2-j}\quad\forall j=0,1,\cdots,2k-2. (79)
0<θ1<<θk<1,\displaystyle 0<\theta^{\prime}_{1}<\ldots<\theta^{\prime}_{k}<1, pi>0,i[k]\displaystyle\ p^{\prime}_{i}>0,\ \forall i\in[k] (80)

Note the system of equations (79) automatically implies i=1kpi=i=1kpi=1\sum_{i=1}^{k}p^{\prime}_{i}=\sum_{i=1}^{k}p_{i}=1. Let yi=pi(1θi)2k2y_{i}=-p^{\prime}_{i}(1-\theta^{\prime}_{i})^{2k-2}, ηi=θi/(1θi)\eta_{i}=\theta^{\prime}_{i}/(1-\theta^{\prime}_{i}) for i[k]i\in[k] and let yk+i=pi(1θi)2k2y_{k+i}=p_{i}(1-\theta_{i})^{2k-2}, ηk+i=θi/(1θi)\eta_{k+i}=\theta_{i}/(1-\theta_{i}). Then ηk+1<ηk+2<<η2k\eta_{k+1}<\eta_{k+2}<\ldots<\eta_{2k} and yi>0y_{i}>0 for k+1i2kk+1\leq i\leq 2k. Then (p1,,pk,θ1,,θk)(p^{\prime}_{1},\ldots,p^{\prime}_{k},\theta^{\prime}_{1},\ldots,\theta^{\prime}_{k}) is a solution of (79), (80) if and only if the corresponding (y1,,yk,η1,,ηk)(y_{1},\ldots,y_{k},\eta_{1},\ldots,\eta_{k}) is the solution of

i=12kyiηij\displaystyle\sum_{i=1}^{2k}y_{i}\eta_{i}^{j} =0,j[2k2]{0}.\displaystyle=0,\quad\forall j\in[2k-2]\cup\{0\}.
0<η1<<ηk,\displaystyle 0<\eta_{1}<\ldots<\eta_{k}, yi<0,i[k].\displaystyle\ y_{i}<0,\ \forall i\in[k].

By part c), the system of equations in last display has infinitely many solutions additionally satisfying (77). For each such solution, the corresponding (p1,,pk,θ1,,θk)(p^{\prime}_{1},\ldots,p^{\prime}_{k},\theta^{\prime}_{1},\ldots,\theta^{\prime}_{k}) is a solution of system of equations (79) (80) additionally satisfying 0<θ1<θ10<\theta^{\prime}_{1}<\theta_{1} and θi1<θi<θi\theta_{i-1}<\theta^{\prime}_{i}<\theta_{i} for 2ik2\leq i\leq k. By the comments after (79),(80) we also have i=1kpi=i=1kpi=1\sum_{i=1}^{k}p^{\prime}_{i}=\sum_{i=1}^{k}p_{i}=1. Thus, such (p1,,pk,θ1,,θk)(p^{\prime}_{1},\ldots,p^{\prime}_{k},\theta^{\prime}_{1},\ldots,\theta^{\prime}_{k}) gives Gk(Θ)G^{\prime}\in\mathcal{E}_{k}(\Theta) such that PG,2k2=PG,2k2P_{G^{\prime},2k-2}=P_{G,2k-2}. The existence of infinitely many such GG^{\prime} follows from the existence of infinitely many solutions (y1,,yk,η1,,ηk)(y_{1},\ldots,y_{k},\eta_{1},\ldots,\eta_{k}) by part c). ∎

C.4 Proof of Lemma 5.12

Proof of Lemma 5.12.

a) It’s obvious that q(1)(x,y),q(2)(x,y)q^{(1)}(x,y),q^{(2)}(x,y) are multivariate polynomials and that

q(1)(y,y)=\displaystyle q^{(1)}(y,y)= limxyq(1)(x,y)=f(y),\displaystyle\lim_{x\to y}q^{(1)}(x,y)=f^{\prime}(y),
q(2)(y,y)=\displaystyle q^{(2)}(y,y)= limxyq(2)(x,y)=f′′(y).\displaystyle\lim_{x\to y}q^{(2)}(x,y)=f^{\prime\prime}(y).

That means q(1)(x,y)f(y)q^{(1)}(x,y)-f^{\prime}(y) has factor xyx-y and thus q¯(2)(x,y)\bar{q}^{(2)}(x,y) is a multivariate polynomial and

q¯(2)(y,y)=limxyq(1)(x,y)f(y)xy=limxyf(x)f(y)f(y)(xy)(xy)2=12q(2)(y,y).\bar{q}^{(2)}(y,y)=\lim_{x\to y}\frac{q^{(1)}(x,y)-f^{\prime}(y)}{x-y}=\lim_{x\to y}\frac{f(x)-f(y)-f^{\prime}(y)(x-y)}{(x-y)^{2}}=\frac{1}{2}q^{(2)}(y,y).

Then q¯(2)(x,y)12q(2)(x,y)\bar{q}^{(2)}(x,y)-\frac{1}{2}q^{(2)}(x,y) has factor xyx-y and thus q¯(3)(x,y)\bar{q}^{(3)}(x,y) is a multivariate polynomial.

b) Write A(k)A^{(k)} for A(k)(x1,,xk)A^{(k)}(x_{1},\ldots,x_{k}) in this proof. Denote A¯(2k2)×(2k)\underline{A}\in\mathbb{R}^{(2k-2)\times(2k)} the bottom (2k2)×2k(2k-2)\times 2k matrix of A(k)A^{(k)}. Let qj(1)(x,y)q_{j}^{(1)}(x,y), qj(2)(x,y)q_{j}^{(2)}(x,y), q¯j(2)(x,y)\bar{q}_{j}^{(2)}(x,y) and q¯j(3)(x,y)\bar{q}_{j}^{(3)}(x,y) be defined in part a) with ff replace by fjf_{j}. Then by subtracting the third row from the first row, the fourth row from the second row and then factor the common factor (x1x2)(x_{1}-x_{2}) out of the resulting first two rows

det(A(k))=\displaystyle\text{det}(A^{(k)})= (x1x2)2det(q1(1)(x1,x2),,q2k(1)(x1,x2)q1(2)(x1,x2),,q2k(2)(x1,x2)A¯)\displaystyle(x_{1}-x_{2})^{2}\text{det}\begin{pmatrix}&q_{1}^{(1)}(x_{1},x_{2}),&\ldots,&q_{2k}^{(1)}(x_{1},x_{2})\\ &q_{1}^{(2)}(x_{1},x_{2}),&\ldots,&q_{2k}^{(2)}(x_{1},x_{2})\\ &\ &\underline{A}&\\ \end{pmatrix}
=\displaystyle= (x1x2)3det(q¯1(2)(x1,x2),,q¯2k(2)(x1,x2)q1(2)(x1,x2),,q2k(2)(x1,x2)A¯)\displaystyle(x_{1}-x_{2})^{3}\text{det}\begin{pmatrix}&\bar{q}_{1}^{(2)}(x_{1},x_{2}),&\ldots,&\bar{q}_{2k}^{(2)}(x_{1},x_{2})\\ &q_{1}^{(2)}(x_{1},x_{2}),&\ldots,&q_{2k}^{(2)}(x_{1},x_{2})\\ &\ &\underline{A}&\\ \end{pmatrix}
=\displaystyle= (x1x2)4det(q¯1(3)(x1,x2),,q¯2k(3)(x1,x2)q1(2)(x1,x2),,q2k(2)(x1,x2)A¯)\displaystyle(x_{1}-x_{2})^{4}\text{det}\begin{pmatrix}&\bar{q}_{1}^{(3)}(x_{1},x_{2}),&\ldots,&\bar{q}_{2k}^{(3)}(x_{1},x_{2})\\ &q_{1}^{(2)}(x_{1},x_{2}),&\ldots,&q_{2k}^{(2)}(x_{1},x_{2})\\ &\ &\underline{A}&\\ \end{pmatrix}

where the second equality follows by subtracting the fourth row from first row and then factor the common factor (x1x2)(x_{1}-x_{2}) out of the resulting row. The last step of the preceding display follows by subtracting 1/2 times the second row from the first row and then extract the common factor (x1x2)(x_{1}-x_{2}) out of the resulting row. Thus (x1x2)4(x_{1}-x_{2})^{4} is a factor of det(A(k))\text{det}(A^{(k)}), which is a multivariate polynomial in x1,,xkx_{1},\ldots,x_{k}. By symmetry, 1α<βk(xαxβ)4\prod_{1\leq\alpha<\beta\leq k}(x_{\alpha}-x_{\beta})^{4} is a factor of det(A(k))\text{det}(A^{(k)}).

c) We prove the statement by induction. It’s easy to verify the statement when k=1k=1. Suppose the statement for kk holds. By b),

det(A(k+1)(x1,,xk+1))=gk+1(x1,,xk+1)1α<βk+1(xαxβ)4\text{det}(A^{(k+1)}(x_{1},\dots,x_{k+1}))=g_{k+1}(x_{1},\ldots,x_{k+1})\prod_{1\leq\alpha<\beta\leq k+1}(x_{\alpha}-x_{\beta})^{4}

for some multivariate polynomial gk+1(x1,,xk+1)g_{k+1}(x_{1},\ldots,x_{k+1}). By the Leibniz formula of determinant, in det(A(k+1)(x1,,xk,xk+1))\text{det}(A^{(k+1)}(x_{1},\dots,x_{k},x_{k+1})) the term of highest degree of xαx_{\alpha} is f2(k+1)(xα)f2k+1(xα)f_{2(k+1)}(x_{\alpha})f^{\prime}_{2k+1}(x_{\alpha}) or f2(k+1)(xα)f2k+1(xα)f^{\prime}_{2(k+1)}(x_{\alpha})f_{2k+1}(x_{\alpha}), which both have degree 4k4k since fj(x)f_{j}(x) has degree j1j-1 and fj(x)f^{\prime}_{j}(x) has degree j2j-2. Moreover, in 1α<βk+1(xαxβ)4\prod_{1\leq\alpha<\beta\leq k+1}(x_{\alpha}-x_{\beta})^{4} the degree of xαx_{\alpha} is 4k4k and the corresponding term is xα4kx_{\alpha}^{4k}, which implies in gk+1(x1,,xk+1)g_{k+1}(x_{1},\ldots,x_{k+1}) the degree of xαx_{\alpha} is no more than 0 for any α[k+1]\alpha\in[k+1]. As a result, gk+1(x1,,xk+1)=qk+1g_{k+1}(x_{1},\ldots,x_{k+1})=q_{k+1} is a constant. Thus

det(A(k+1)(x1,,xk,0))=\displaystyle\text{det}(A^{(k+1)}(x_{1},\dots,x_{k},0))= qk+1(1α<βk(xαxβ)4)α=1kxα4,\displaystyle q_{k+1}\left(\prod_{1\leq\alpha<\beta\leq k}(x_{\alpha}-x_{\beta})^{4}\right)\prod_{\alpha=1}^{k}x_{\alpha}^{4}, (81)

On the other hand,

det(A(k+1)(x1,,xk,0))=\displaystyle\text{det}(A^{(k+1)}(x_{1},\dots,x_{k},0))= det(f1(x1|k+1),f2(x1|k+1),,f2(k+1)(x1|k+1)f1(x1|k+1),f2(x1|k+1),,f2(k+1)(x1|k+1)f1(xk|k+1),f2(xk|k+1),,f2(k+1)(xk|k+1)f1(xk|k+1),f2(xk|k+1),,f2(k+1)(xk|k+1)1,0,,00,1,0,,0)\displaystyle\text{det}\begin{pmatrix}f_{1}(x_{1}|k+1),&f_{2}(x_{1}|k+1),&\ldots,&f_{2(k+1)}(x_{1}|k+1)\\ f^{\prime}_{1}(x_{1}|k+1),&f^{\prime}_{2}(x_{1}|k+1),&\ldots,&f^{\prime}_{2(k+1)}(x_{1}|k+1)\\ \vdots&\vdots&\vdots\\ f_{1}(x_{k}|k+1),&f_{2}(x_{k}|k+1),&\ldots,&f_{2(k+1)}(x_{k}|k+1)\\ f^{\prime}_{1}(x_{k}|k+1),&f^{\prime}_{2}(x_{k}|k+1),&\ldots,&f^{\prime}_{2(k+1)}(x_{k}|k+1)\\ 1,&0,&\ldots,&0\\ 0,&1,&0,\ldots,&0\end{pmatrix}
=\displaystyle= det(f3(x1|k+1),f3(x1|k+1),,f2(k+1)(x1|k+1)f3(x1|k+1),f3(x1|k+1),,f2(k+1)(x1|k+1)f3(xk|k+1),f3(xk|k+1),,f2(k+1)(xk|k+1)f3(xk|k+1),f3(xk|k+1),,f2(k+1)(xk|k+1))\displaystyle\text{det}\begin{pmatrix}f_{3}(x_{1}|k+1),&f_{3}(x_{1}|k+1),&\ldots,&f_{2(k+1)}(x_{1}|k+1)\\ f^{\prime}_{3}(x_{1}|k+1),&f^{\prime}_{3}(x_{1}|k+1),&\ldots,&f^{\prime}_{2(k+1)}(x_{1}|k+1)\\ \vdots&\vdots&\vdots\\ f_{3}(x_{k}|k+1),&f_{3}(x_{k}|k+1),&\ldots,&f_{2(k+1)}(x_{k}|k+1)\\ f^{\prime}_{3}(x_{k}|k+1),&f^{\prime}_{3}(x_{k}|k+1),&\ldots,&f^{\prime}_{2(k+1)}(x_{k}|k+1)\\ \end{pmatrix} (82)

where the second equality follows by Laplace expansion along the last row. Observing that fj(x)=x2fj2(x)f_{j}(x)=x^{2}f_{j-2}(x) and fj(x)=x2fj2(x)+2xfj2(x)f^{\prime}_{j}(x)=x^{2}f^{\prime}_{j-2}(x)+2xf_{j-2}(x), plug these two equations into (82) and simplify the resulting determinant, and one has

det(A(k+1)(x1,,xk,0))=det(A(k)(x1,,xk))α=1kxα4.\text{det}(A^{(k+1)}(x_{1},\dots,x_{k},0))=\text{det}(A^{(k)}(x_{1},\dots,x_{k}))\prod_{\alpha=1}^{k}x_{\alpha}^{4}. (83)

Comparing (83) to (81), together with the induction assumption that statement for kk holds,

qk+1=1.q_{k+1}=1.

That is, we proved the statement for k+1k+1.

d) We prove det(A(k)(x1,,xk))=1α<βk(xαxβ)4\text{det}(A^{(k)}(x_{1},\dots,x_{k}))=\prod_{1\leq\alpha<\beta\leq k}(x_{\alpha}-x_{\beta})^{4} by induction. Write fj(x|k)f_{j}(x|k) for fj(x)f_{j}(x) in the following induction to emphasize its dependence on kk. It is easy to verify the case holds when k=1k=1. Suppose the statement for kk holds. By b), det(A(k+1)(x1,,xk+1))=gk+1(x1,,xk+1)1α<βk+1(xαxβ)4\text{det}(A^{(k+1)}(x_{1},\dots,x_{k+1}))=g_{k+1}(x_{1},\ldots,x_{k+1})\prod_{1\leq\alpha<\beta\leq k+1}(x_{\alpha}-x_{\beta})^{4} for some multivariate polynomial gk+1g_{k+1}. Since fj(x|k+1)f_{j}(x|k+1) has degree n=2(k+1)1n=2(k+1)-1 and fj(x|k+1)f^{\prime}_{j}(x|k+1) has degree 2k2k, by the Leibniz formula of determinant det(A(k+1)(x1,,xk,xk+1))\text{det}(A^{(k+1)}(x_{1},\dots,x_{k},x_{k+1})) has degree no more than 2k+(2k+1)=4k+12k+(2k+1)=4k+1 for any xαx_{\alpha} for α[k+1]\alpha\in[k+1]. Moreover, in 1α<βk+1(xαxβ)4\prod_{1\leq\alpha<\beta\leq k+1}(x_{\alpha}-x_{\beta})^{4} the degree of xαx_{\alpha} is 4k4k, which implies in gk+1(x1,,xk+1)g_{k+1}(x_{1},\ldots,x_{k+1}) the degree of xαx_{\alpha} is no more than 11. As a result, it is eligible to write gk+1(x1,,xk+1)=h1(x1,,xk)xk+1+h2(x1,,xk)g_{k+1}(x_{1},\ldots,x_{k+1})=h_{1}(x_{1},\dots,x_{k})x_{k+1}+h_{2}(x_{1},\ldots,x_{k}) where h1,h2h_{1},h_{2} are multivariate polynomials of x1,,xkx_{1},\ldots,x_{k}. Thus

det(A(k+1)(x1,,xk,0))=\displaystyle\text{det}(A^{(k+1)}(x_{1},\dots,x_{k},0))= h2(x1,,xk)(1α<βk(xαxβ)4)α=1kxα4,\displaystyle h_{2}(x_{1},\ldots,x_{k})\left(\prod_{1\leq\alpha<\beta\leq k}(x_{\alpha}-x_{\beta})^{4}\right)\prod_{\alpha=1}^{k}x_{\alpha}^{4}, (84)

and

det(A(k+1)(x1,,xk,1))\displaystyle\text{det}(A^{(k+1)}(x_{1},\dots,x_{k},1))
=\displaystyle= (h1(x1,,xk)+h2(x1,,xk))(1α<βk(xαxβ)4)α=1k(xα1)4.\displaystyle(h_{1}(x_{1},\ldots,x_{k})+h_{2}(x_{1},\ldots,x_{k}))\left(\prod_{1\leq\alpha<\beta\leq k}(x_{\alpha}-x_{\beta})^{4}\right)\prod_{\alpha=1}^{k}(x_{\alpha}-1)^{4}. (85)

On the other hand,

det(A(k+1)(x1,,xk,0))\displaystyle\text{det}(A^{(k+1)}(x_{1},\dots,x_{k},0))
=\displaystyle= det(f1(x1|k+1),f2(x1|k+1),,f2(k+1)(x1|k+1)f1(x1|k+1),f2(x1|k+1),,f2(k+1)(x1|k+1)f1(xk|k+1),f2(xk|k+1),,f2(k+1)(xk|k+1)f1(xk|k+1),f2(xk|k+1),,f2(k+1)(xk|k+1)1,0,,0(2(k+1)1),1,0,,0)\displaystyle\text{det}\begin{pmatrix}f_{1}(x_{1}|k+1),&f_{2}(x_{1}|k+1),&\ldots,&f_{2(k+1)}(x_{1}|k+1)\\ f^{\prime}_{1}(x_{1}|k+1),&f^{\prime}_{2}(x_{1}|k+1),&\ldots,&f^{\prime}_{2(k+1)}(x_{1}|k+1)\\ \vdots&\vdots&\vdots\\ f_{1}(x_{k}|k+1),&f_{2}(x_{k}|k+1),&\ldots,&f_{2(k+1)}(x_{k}|k+1)\\ f^{\prime}_{1}(x_{k}|k+1),&f^{\prime}_{2}(x_{k}|k+1),&\ldots,&f^{\prime}_{2(k+1)}(x_{k}|k+1)\\ 1,&0,&\ldots,&0\\ -(2(k+1)-1),&1,&0,\ldots,&0\end{pmatrix}
=\displaystyle= det(f3(x1|k+1),f3(x1|k+1),,f2(k+1)(x1|k+1)f3(x1|k+1),f3(x1|k+1),,f2(k+1)(x1|k+1)f3(xk|k+1),f3(xk|k+1),,f2(k+1)(xk|k+1)f3(xk|k+1),f3(xk|k+1),,f2(k+1)(xk|k+1))\displaystyle\text{det}\begin{pmatrix}f_{3}(x_{1}|k+1),&f_{3}(x_{1}|k+1),&\ldots,&f_{2(k+1)}(x_{1}|k+1)\\ f^{\prime}_{3}(x_{1}|k+1),&f^{\prime}_{3}(x_{1}|k+1),&\ldots,&f^{\prime}_{2(k+1)}(x_{1}|k+1)\\ \vdots&\vdots&\vdots\\ f_{3}(x_{k}|k+1),&f_{3}(x_{k}|k+1),&\ldots,&f_{2(k+1)}(x_{k}|k+1)\\ f^{\prime}_{3}(x_{k}|k+1),&f^{\prime}_{3}(x_{k}|k+1),&\ldots,&f^{\prime}_{2(k+1)}(x_{k}|k+1)\\ \end{pmatrix} (86)

where the second equality follows by Laplace expansion along the last row. Observing that fj(x|k+1)=x2fj2(x|k)f_{j}(x|k+1)=x^{2}f_{j-2}(x|k) and fj(x|k+1)=x2fj2(x|k)+2xfj2(x|k)f^{\prime}_{j}(x|k+1)=x^{2}f^{\prime}_{j-2}(x|k)+2xf_{j-2}(x|k), plug these two equations into (86) and simplify the resulting determinant, and one has

det(A(k+1)(x1,,xk,0))=det(A(k)(x1,,xk))α=1kxα4.\text{det}(A^{(k+1)}(x_{1},\dots,x_{k},0))=\text{det}(A^{(k)}(x_{1},\dots,x_{k}))\prod_{\alpha=1}^{k}x_{\alpha}^{4}. (87)

Analogous argument produces

det(A(k+1)(x1,,xk,1))=det(A(k)(x1,,xk))α=1k(1xα)4.\text{det}(A^{(k+1)}(x_{1},\dots,x_{k},1))=\text{det}(A^{(k)}(x_{1},\dots,x_{k}))\prod_{\alpha=1}^{k}(1-x_{\alpha})^{4}. (88)

Comparing (87) to (84), together with the induction assumption that statement for kk holds,

h2(x1,,xk)=1,x1,,xk.h_{2}(x_{1},\ldots,x_{k})=1,\quad\forall x_{1},\ldots,x_{k}.

Comparing (88) to (85), together with the induction assumption that statement for kk holds and the preceding display,

h1(x1,,xk)=0,x1,,xk.h_{1}(x_{1},\ldots,x_{k})=0,\quad\forall x_{1},\ldots,x_{k}.

That is, gk+1(x1,,xk+1)=1g_{k+1}(x_{1},\ldots,x_{k+1})=1 for any x1,,xk+1x_{1},\ldots,x_{k+1}. ∎

C.5 Calculation details in Example 5.21

As in Example 5.21, take the Tx=(x,x2)Tx=(x,x^{2})^{\top}. Then one may check λ(ξ,σ)=(ξ+σ,σ2+(σ+ξ)2)\lambda(\xi,\sigma)=(\xi+\sigma,\sigma^{2}+(\sigma+\xi)^{2})^{\top}. So condition (A1) is satisfied. The characteristic function is

ϕT(ζ1,ζ2|ξ,σ)=e𝒊(ζ1x+ζ2x2)f(x|ξ,σ)𝑑x=1δeξσξe𝒊(ζ1x+ζ2x2)exσ𝑑x.\phi_{T}(\zeta_{1},\zeta_{2}|\xi,\sigma)=\int_{\mathbb{R}}e^{\bm{i}(\zeta_{1}x+\zeta_{2}x^{2})}f(x|\xi,\sigma)dx=\frac{1}{\delta}e^{\frac{\xi}{\sigma}}\int_{\xi}^{\infty}e^{\bm{i}(\zeta_{1}x+\zeta_{2}x^{2})}e^{-\frac{x}{\sigma}}dx.

The verification of (A2) and (32) are consequences of Leibniz rule for calculating derivatives and the dominated convergence theorem, and are omitted. To verify (33), notice that |x|f(x|ξ,σ)|x|f(x|\xi,\sigma) is increasing on (,|ξ|)(-\infty,-|\xi|) and decreasing on (σ|ξ|,)(\sigma\vee|\xi|,\infty). That is, the conditions of Lemma 7.3 is satisfied with α1=1\alpha_{1}=1, b1=|ξ|b_{1}=|\xi| and c1=|ξ|σc_{1}=|\xi|\vee\sigma. Moreover, it is clear that f(x|ξ,σ)L()1/σ\|f(x|\xi,\sigma)\|_{L^{\infty}(\mathbb{R})}\leq 1/\sigma. Then by Lemma 7.4, for any r>4r>4,

g(ζ|ξ,σ)Lr(2)\displaystyle\|g(\zeta|\xi,\sigma)\|_{L^{r}(\mathbb{R}^{2})}
\displaystyle\leq C2(|ξ|σ+2)(|x|f(x|ξ,σ)L1()+2σ+(|x|+1)f(x|ξ,σ)xL1()+1)\displaystyle C_{2}(|\xi|\vee\sigma+2)\biggr{(}\||x|f(x|\xi,\sigma)\|_{L^{1}(\mathbb{R})}+\frac{2}{\sigma}+\left\|(|x|+1)\frac{\partial f(x|\xi,\sigma)}{\partial x}\right\|_{L^{1}(\mathbb{R})}+1\biggr{)}
:=\displaystyle:= h(ξ,σ).\displaystyle h(\xi,\sigma).

It can be verified easily by the dominated convergence theorem that h(μ,σ)h(\mu,\sigma) is a continuous function of θ=(ξ,σ)\theta=(\xi,\sigma) on Θ\Theta. Thus (33) in (A3) is verified. We have then verified that TT is admissible with respect to Θ\Theta.

One can easily check λ:Θ2\lambda:\Theta\to\mathbb{R}^{2} is injective on Θ\Theta. Moreover by simple calculations the Jacobi determinant of λ(θ)\lambda(\theta) is det(Jλ)=2σ>0\text{det}(J_{\lambda})=2\sigma>0, which implies JλJ_{\lambda} is of full rank on Θ\Theta. Then by Corollary 5.17, (21) and (23) hold for any G0k0(Θ)G_{0}\in\mathcal{E}_{k_{0}}(\Theta) for any k01k_{0}\geq 1.

Appendix D Proofs of inverse bounds for mixtures of product distributions

For an overview of our proof techniques, please refer to Section 2. The proofs of both Theorem 5.8 and Theorem 5.16 follow the same structure. The reader should read the former first before attempting the latter, which is considerably more technical and lengthy.

D.1 Proof of Theorem 5.8

Proof of Theorem 5.8.

  
Step 1 (Proof by contradiction with subsequences)
Suppose (23) is not true. Then {N}=1\exists\{{N}_{\ell}\}_{\ell=1}^{\infty} subsequence of natural numbers tending to infinity such that

limr0infG,HBW1(G0,r)GHV(PG,N,PH,N)DN(G,H)0 as N.\lim_{r\to 0}\ \ \inf_{\begin{subarray}{c}G,H\in B_{W_{1}}(G_{0},r)\\ G\not=H\end{subarray}}\frac{V(P_{G,{N}_{\ell}},P_{H,{N}_{\ell}})}{D_{{N}_{\ell}}(G,H)}\to 0\quad\text{ as }{N}_{\ell}\to\infty.

Then {G}=1,{H}=1k0(Θ)\exists\{G_{\ell}\}_{\ell=1}^{\infty},\{H_{\ell}\}_{\ell=1}^{\infty}\subset\mathcal{E}_{k_{0}}(\Theta) such that

{GHDN(G,G0)0,DN(H,G0)0 as V(PG,N,PH,N)DN(G,H)0 as .\begin{cases}G_{\ell}\not=H_{\ell}&\forall\ell\\ D_{{N}_{\ell}}(G_{\ell},G_{0})\to 0,D_{{N}_{\ell}}(H_{\ell},G_{0})\to 0&\text{ as }\ell\to\infty\\ \frac{V(P_{G_{\ell},{N}_{\ell}},P_{H_{\ell},{N}_{\ell}})}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}\to 0&\text{ as }{\ell}\to\infty.\end{cases} (89)

To see this, for each fixed \ell, and thus fixed N{N}_{\ell}, DN(G,G0)0D_{{N}_{\ell}}(G,G_{0})\to 0 if and only if W1(G,G0)0W_{1}(G,G_{0})\to 0. Thus, there exists G,Hk0(Θ)G_{\ell},H_{\ell}\in\mathcal{E}_{k_{0}}(\Theta) such that GHG_{\ell}\not=H_{\ell}, DN(G,G0)1D_{{N}_{\ell}}(G_{\ell},G_{0})\leq\frac{1}{\ell}, DN(H,G0)1D_{{N}_{\ell}}(H_{\ell},G_{0})\leq\frac{1}{\ell} and

V(PG,N,PH,N)DN(G,H)limr0infG,HBW1(G0,r)GHV(PG,N,PH,N)DN(G,H)+1,\frac{V(P_{G_{\ell},{N}_{\ell}},P_{H_{\ell},{N}_{\ell}})}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}\leq\lim_{r\to 0}\ \ \inf_{\begin{subarray}{c}G,H\in B_{W_{1}}(G_{0},r)\\ G\not=H\end{subarray}}\frac{V(P_{G,{N}_{\ell}},P_{H,{N}_{\ell}})}{D_{{N}_{\ell}}(G,H)}+\frac{1}{\ell},

thereby ensuring that (89) hold.

Write G0=i=1k0pi0δθi0G_{0}=\sum_{i=1}^{k_{0}}p_{i}^{0}\delta_{\theta_{i}^{0}}. We may relabel the atoms of GG_{\ell} and HH_{\ell} such that G=i=1k0piδθiG_{\ell}=\sum_{i=1}^{k_{0}}p_{i}^{\ell}\delta_{\theta_{i}^{\ell}}, H=i=1k0πiδηiH_{\ell}=\sum_{i=1}^{k_{0}}\pi_{i}^{\ell}\delta_{\eta_{i}^{\ell}} with θi,ηiθi0\theta^{\ell}_{i},\eta_{i}^{\ell}\to\theta_{i}^{0} and pi,πipi0p_{i}^{\ell},\pi_{i}^{\ell}\to p_{i}^{0} for any i[k0]i\in[k_{0}]. By subsequence argument if necessary, we may require {G}=1\{G_{\ell}\}_{\ell=1}^{\infty}, {H}=1\{H_{\ell}\}_{\ell=1}^{\infty} additionally satisfy:

N(θiηi)DN(G,H)aiq,piπiDN(G,H)bi,1ik0,\frac{\sqrt{{N}_{\ell}}\left(\theta^{\ell}_{i}-\eta^{\ell}_{i}\right)}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}\to a_{i}\in\mathbb{R}^{q},\quad\frac{p^{\ell}_{i}-\pi^{\ell}_{i}}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}\to b_{i}\in\mathbb{R},\quad\forall 1\leq i\leq k_{0}, (90)

where the components of aia_{i} are in [1,1][-1,1] and i=1k0bi=0\sum_{i=1}^{k_{0}}b_{i}=0. It also follows that at least one of aia_{i} is not 𝟎s\bm{0}\in\mathbb{R}^{s} or one of bib_{i} is not 0. Let α{1ik0:ai𝟎 or bi0}\alpha\in\{1\leq i\leq k_{0}:a_{i}\not=\bm{0}\text{ or }b_{i}\not=0\}.

Step 2 (Change of measure by index α\alpha and application of CLT)
Pθ,NP_{\theta,{N}} has density w.r.t. Nμ\bigotimes^{{N}}\mu on 𝔛N\mathfrak{X}^{{N}}:

f¯(x¯|θ,N)=j=1Nf(xj|θ)=eθ(j=1NT(xj))NA(θ)j=1Nh(xj),\bar{f}(\bar{x}|\theta,{N})=\prod_{j=1}^{{N}}f(x_{j}|\theta)=e^{\theta^{\top}\left(\sum_{j=1}^{{N}}T(x_{j})\right)-{N}A(\theta)}\prod_{j=1}^{{N}}h(x_{j}),

where any x¯𝔛N\bar{x}\in\mathfrak{X}^{{N}} is partitioned into N{N} blocks as x¯=(x1,x2,,xN)\bar{x}=(x_{1},x_{2},\ldots,x_{{N}}) with xi𝔛x_{i}\in\mathfrak{X}. Then

2V(PG,N,PH,N)DN(G,H)\displaystyle\frac{2V(P_{G_{\ell},{N}_{\ell}},P_{H_{\ell},{N}_{\ell}})}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}
=\displaystyle= 𝔛N|i=1k0pieθi,j=1NT(xj)NA(θi)πieηi,j=1NT(xj)NA(ηi)DN(G,H)|j=1Nh(xj)dNμ\displaystyle\int_{\mathfrak{X}^{{N}_{\ell}}}\left|\sum_{i=1}^{k_{0}}\frac{p_{i}^{\ell}e^{\left\langle\theta_{i}^{\ell},\sum_{j=1}^{{N}_{\ell}}T(x_{j})\right\rangle-{N}_{\ell}A(\theta_{i}^{\ell})}-\pi_{i}^{\ell}e^{\left\langle\eta_{i}^{\ell},\sum_{j=1}^{{N}_{\ell}}T(x_{j})\right\rangle-{N}_{\ell}A(\eta_{i}^{\ell})}}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}\right|\prod_{j=1}^{{N}_{\ell}}h(x_{j})d\bigotimes^{{N}_{\ell}}\mu
=\displaystyle= 𝔛N|i=1k0pieθi,j=1NT(xj)NA(θi)πieηi,j=1NT(xj)NA(ηi)DN(G,H)eθα0,j=1NT(xj)NA(θα0)|f¯(x¯|θα0,N)dNμ\displaystyle\int_{\mathfrak{X}^{{N}_{\ell}}}\left|\sum_{i=1}^{k_{0}}\frac{p_{i}^{\ell}e^{\left\langle\theta_{i}^{\ell},\sum_{j=1}^{{N}_{\ell}}T(x_{j})\right\rangle-{N}_{\ell}A(\theta_{i}^{\ell})}-\pi_{i}^{\ell}e^{\left\langle\eta_{i}^{\ell},\sum_{j=1}^{{N}_{\ell}}T(x_{j})\right\rangle-{N}_{\ell}A(\eta_{i}^{\ell})}}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})e^{\left\langle\theta_{\alpha}^{0},\sum_{j=1}^{{N}_{\ell}}T(x_{j})\right\rangle-{N}_{\ell}A(\theta_{\alpha}^{0})}}\right|\bar{f}(\bar{x}|\theta_{\alpha}^{0},{N}_{\ell})d\bigotimes^{{N}_{\ell}}\mu
=\displaystyle= 𝔼θα0|F(j=1NT(Xj))|,\displaystyle\mathbb{E}_{\theta_{\alpha}^{0}}\left|F_{\ell}\left(\sum_{j=1}^{{N}_{\ell}}T(X_{j})\right)\right|, (91)

where XjX_{j} are i.i.d. random variables having densities f(|θα0)f(\cdot|\theta_{\alpha}^{0}), and

F(y):=\displaystyle F_{\ell}(y):= i=1k0piexp(θi,yNA(θi))πiexp(ηi,yNA(ηi))DN(G,H)exp(θα0,yNA(θα0)).\displaystyle\sum_{i=1}^{k_{0}}\frac{p_{i}^{\ell}\exp\left(\left\langle\theta_{i}^{\ell},y\right\rangle-{N}_{\ell}A(\theta_{i}^{\ell})\right)-\pi_{i}^{\ell}\exp\left(\left\langle\eta_{i}^{\ell},y\right\rangle-{N}_{\ell}A(\eta_{i}^{\ell})\right)}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})\exp\left(\left\langle\theta_{\alpha}^{0},y\right\rangle-{N}_{\ell}A(\theta_{\alpha}^{0})\right)}.

Let Z=(j=1NT(Xj)N𝔼θα0T(Xj))/NZ_{\ell}=\left(\sum_{j=1}^{{N}_{\ell}}T(X_{j})-{N}_{\ell}\mathbb{E}_{\theta_{\alpha}^{0}}T(X_{j})\right)/\sqrt{{N}_{\ell}}. Then since θα0Θ\theta_{\alpha}^{0}\in\Theta^{\circ}, the mean and covariance matrix of T(Xj)T(X_{j}) are respectively θA(θα0)\nabla_{\theta}A(\theta_{\alpha}^{0}) and θ2A(θα0)\nabla^{2}_{\theta}A(\theta_{\alpha}^{0}), the gradient and Hessian of A(θ)A(\theta) evaluated at θα0\theta_{\alpha}^{0}. Then by central limit theorem, ZZ_{\ell} converges in distribution to Z𝒩(𝟎,θ2A(θα0))Z\sim\mathcal{N}(\bm{0},\nabla^{2}_{\theta}A(\theta_{\alpha}^{0})). Moreover,

F(j=1NT(Xj))=F(NZ+NθA(θα0))=Ψ(Z),F_{\ell}\left(\sum_{j=1}^{{N}_{\ell}}T(X_{j})\right)=F_{\ell}\left(\sqrt{{N}_{\ell}}Z_{\ell}+{N}_{\ell}\nabla_{\theta}A(\theta_{\alpha}^{0})\right)=\Psi_{\ell}(Z_{\ell}), (92)

where Ψ(z):=F(Nz+NθA(θα0))\Psi_{\ell}(z):=F_{\ell}\left(\sqrt{{N}_{\ell}}z+{N}_{\ell}\nabla_{\theta}A(\theta_{\alpha}^{0})\right).

Step 3 (Application of continuous mapping theorem)
Define Ψ(z)=pα0aα,z+bα\Psi(z)=p_{\alpha}^{0}\left\langle a_{\alpha},z\right\rangle+b_{\alpha}. Supposeing that

Ψ(z)Ψ(z) for any sequence zzq,\Psi_{\ell}(z_{\ell})\to\Psi(z)\text{ for any sequence }z_{\ell}\to z\in\mathbb{R}^{q}, (93)

a property to be verified in the sequel, then by Generalized Continuous Mapping Theorem ([46] Theorem 1.11.1), Ψ(Z)\Psi_{\ell}(Z_{\ell}) converges in distribution to Ψ(Z)\Psi(Z). Applying Theorem 25.11 in [5],

𝔼|Ψ(Z)|lim inf𝔼θα0|Ψ(Z)|=0,\mathbb{E}|\Psi(Z)|\leq\liminf_{\ell\to\infty}\mathbb{E}_{\theta_{\alpha}^{0}}|\Psi_{\ell}(Z_{\ell})|=0, (94)

where the equality follows by (89), (91) and (92). Since Ψ(z)\Psi(z) is a non-zero affine transform and the covariance matrix of ZZ is positive definite due to full rank property of exponential family, Ψ(Z)\Psi(Z) is either a nondegenerate gaussian random variable or a non-zero constant, which contradicts with (94).

It remains in the proof to verify (93). Consider any sequence zzz_{\ell}\to z. Write

Ψ(z)=i=1k0Ii,\Psi_{\ell}(z_{\ell})=\sum_{i=1}^{k_{0}}I_{i}, (95)

where

Ii:=piexp(g(θi))πiexp(g(ηi))DN(G,H)exp(g(θα0)),\displaystyle I_{i}:=\frac{p_{i}^{\ell}\exp\left(g_{\ell}(\theta_{i}^{\ell})\right)-\pi_{i}^{\ell}\exp\left(g_{\ell}(\eta_{i}^{\ell})\right)}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})\exp\left(g(\theta_{\alpha}^{0})\right)},

with

g(θ):=θ,Nz+NθA(θα0)NA(θ).g_{\ell}(\theta):=\left\langle\theta,\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\nabla_{\theta}A(\theta_{\alpha}^{0})\right\rangle-{N}_{\ell}A(\theta).

For any i[k0]i\in[k_{0}], by Taylor expansion of A(θ)A(\theta) at θi0\theta_{i}^{0} and the fact that A(θ)A(\theta) is infinitely differentiable at θi0Θ\theta_{i}^{0}\in\Theta^{\circ}, for large \ell,

|A(ηi)A(θi0)A(θi0),ηiθi0|22A(θi0)2ηiθi022,|A(\eta_{i}^{\ell})-A(\theta_{i}^{0})-\langle\nabla A(\theta_{i}^{0}),\eta_{i}^{\ell}-\theta_{i}^{0}\rangle|\leq 2\|\nabla^{2}A(\theta_{i}^{0})\|_{2}\|\eta_{i}^{\ell}-\theta_{i}^{0}\|_{2}^{2},

which implies

limN|A(ηi)A(θi0)A(θi0),ηiθi0|22A(θi0)2limDN2(H,G0)=0\lim_{\ell\to\infty}{N}_{\ell}|A(\eta_{i}^{\ell})-A(\theta_{i}^{0})-\langle\nabla A(\theta_{i}^{0}),\eta_{i}^{\ell}-\theta_{i}^{0}\rangle|\leq 2\|\nabla^{2}A(\theta_{i}^{0})\|_{2}\lim_{\ell\to\infty}D_{{N}_{\ell}}^{2}(H_{\ell},G_{0})=0 (96)

where the equality follows from (89), and the inequality follows from that

DN(H,G0)=\displaystyle D_{{N}_{\ell}}(H_{\ell},G_{0})= i=1k0(Nηiθi02+|πipi0|)\displaystyle\sum_{i=1}^{k_{0}}(\sqrt{{N}_{\ell}}\|\eta_{i}^{\ell}-\theta_{i}^{0}\|_{2}+|\pi_{i}^{\ell}-p_{i}^{0}|) (97)
DN(G,G0)=\displaystyle D_{{N}_{\ell}}(G_{\ell},G_{0})= i=1k0(Nθiθi02+|pipi0|)\displaystyle\sum_{i=1}^{k_{0}}(\sqrt{{N}_{\ell}}\|\theta_{i}^{\ell}-\theta_{i}^{0}\|_{2}+|p_{i}^{\ell}-p_{i}^{0}|) (98)

for large \ell. The same conclusion holds with ηi\eta_{i}^{\ell} replaced by θi\theta_{i}^{\ell} in the last two displays.

For i[k0]i\in[k_{0}], by Lemma B.2 b) and the fact that A(θ)A(\theta) is infinitely differentiable at θi0Θ\theta_{i}^{0}\in\Theta^{\circ}, for large \ell

|A(θi)A(ηi)A(θi0),θiηi|22A(θi0)2θiηi2(θiθi02+ηiηi02),|A(\theta_{i}^{\ell})-A(\eta_{i}^{\ell})-\langle\nabla A(\theta_{i}^{0}),\theta_{i}^{\ell}-\eta_{i}^{\ell}\rangle|\leq 2\|\nabla^{2}A(\theta_{i}^{0})\|_{2}\|\theta_{i}^{\ell}-\eta_{i}^{\ell}\|_{2}(\|\theta_{i}^{\ell}-\theta_{i}^{0}\|_{2}+\|\eta_{i}^{\ell}-\eta_{i}^{0}\|_{2}),

which implies

limN|A(θi)A(ηi)A(θi0),θiηi|DN(G,H)\displaystyle\lim_{\ell\to\infty}\frac{{N}_{\ell}|A(\theta_{i}^{\ell})-A(\eta_{i}^{\ell})-\langle\nabla A(\theta_{i}^{0}),\theta_{i}^{\ell}-\eta_{i}^{\ell}\rangle|}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}
\displaystyle\leq 22A(θi0)2limNθiηi2DN(G,H)(DN(G,G0)+DN(H,G0))\displaystyle 2\|\nabla^{2}A(\theta_{i}^{0})\|_{2}\lim_{\ell\to\infty}\frac{\sqrt{{N}_{\ell}}\|\theta_{i}^{\ell}-\eta_{i}^{\ell}\|_{2}}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}(D_{{N}_{\ell}}(G_{\ell},G_{0})+D_{{N}_{\ell}}(H_{\ell},G_{0}))
=\displaystyle= 0\displaystyle 0 (99)

where the inequality follows from (97) and (98), and the equality follows from (89) and (90).

Case 1: Calculate limIα\lim_{\ell\to\infty}I_{\alpha}.
When \ell\to\infty

g(ηα)g(θα0)=ηαθα0,NzN(A(ηα)A(θα0)ηαθα0,θA(θα0))0g_{\ell}(\eta_{\alpha}^{\ell})-g_{\ell}(\theta_{\alpha}^{0})=\left\langle\eta_{\alpha}^{\ell}-\theta_{\alpha}^{0},\sqrt{{N}_{\ell}}z_{\ell}\right\rangle-{N}_{\ell}\left(A(\eta_{\alpha}^{\ell})-A(\theta_{\alpha}^{0})-\left\langle\eta_{\alpha}^{\ell}-\theta_{\alpha}^{0},\nabla_{\theta}A(\theta_{\alpha}^{0})\right\rangle\right)\to 0 (100)

by (89) and (96) with i=αi=\alpha. Similarly, one has

lim(g(θα)g(θα0))=0\lim_{\ell\to\infty}\left(g_{\ell}(\theta_{\alpha}^{\ell})-g_{\ell}(\theta_{\alpha}^{0})\right)=0 (101)

Moreover when \ell\to\infty

g(θα)g(ηα)DN(G,H)=\displaystyle\frac{g_{\ell}(\theta_{\alpha}^{\ell})-g_{\ell}(\eta_{\alpha}^{\ell})}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}= θαηα,NzN(A(θα)A(ηα)θαηα,θA(θα0))DN(G,H)\displaystyle\frac{\left\langle\theta_{\alpha}^{\ell}-\eta_{\alpha}^{\ell},\sqrt{{N}_{\ell}}z_{\ell}\right\rangle-{N}_{\ell}\left(A(\theta_{\alpha}^{\ell})-A(\eta_{\alpha}^{\ell})-\left\langle\theta_{\alpha}^{\ell}-\eta_{\alpha}^{\ell},\nabla_{\theta}A(\theta_{\alpha}^{0})\right\rangle\right)}{{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}}
\displaystyle\to aα,z\displaystyle\langle a_{\alpha},z\rangle (102)

by (90) and (99) with i=αi=\alpha.

Thus

limIα\displaystyle\lim_{\ell\to\infty}I_{\alpha}
=\displaystyle= limpαexp(g(θα)g(θα0))παexp(g(ηα)g(θα0))DN(G,H)\displaystyle\lim_{\ell\to\infty}\frac{p_{\alpha}^{\ell}\exp\left(g_{\ell}(\theta_{\alpha}^{\ell})-g_{\ell}(\theta_{\alpha}^{0})\right)-\pi_{\alpha}^{\ell}\exp\left(g_{\ell}(\eta_{\alpha}^{\ell})-g_{\ell}(\theta_{\alpha}^{0})\right)}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}
=\displaystyle= limpαeg(θα)g(θα0)eg(ηα)g(θα0)DN(G,H)+lim(pαπα)eg(ηα)g(θα0)DN(G,H)\displaystyle\lim_{\ell\to\infty}p_{\alpha}^{\ell}\frac{e^{g_{\ell}(\theta_{\alpha}^{\ell})-g_{\ell}(\theta_{\alpha}^{0})}-e^{g_{\ell}(\eta_{\alpha}^{\ell})-g_{\ell}(\theta_{\alpha}^{0})}}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}+\lim_{\ell\to\infty}\frac{(p_{\alpha}^{\ell}-\pi_{\alpha}^{\ell})e^{g_{\ell}(\eta_{\alpha}^{\ell})-g_{\ell}(\theta_{\alpha}^{0})}}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}
=()\displaystyle\overset{(*)}{=} pα0limeξ(g(θα)g(ηα))DN(G,H)+lim(pαπα)eg(ηα)g(θα0)DN(G,H)\displaystyle p_{\alpha}^{0}\lim_{\ell\to\infty}\frac{e^{\xi_{\ell}}(g_{\ell}(\theta_{\alpha}^{\ell})-g_{\ell}(\eta_{\alpha}^{\ell}))}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}+\lim_{\ell\to\infty}\frac{(p_{\alpha}^{\ell}-\pi_{\alpha}^{\ell})e^{g_{\ell}(\eta_{\alpha}^{\ell})-g_{\ell}(\theta_{\alpha}^{0})}}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}
=()\displaystyle\overset{(**)}{=} pα0limg(θα)g(ηα)DN(G,H)+limpαπαDN(G,H)\displaystyle p_{\alpha}^{0}\lim_{\ell\to\infty}\frac{g_{\ell}(\theta_{\alpha}^{\ell})-g_{\ell}(\eta_{\alpha}^{\ell})}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}+\lim_{\ell\to\infty}\frac{p_{\alpha}^{\ell}-\pi_{\alpha}^{\ell}}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}
=()\displaystyle\overset{(***)}{=} pα0aα,z+bα,\displaystyle p_{\alpha}^{0}\left\langle a_{\alpha},z\right\rangle+b_{\alpha}, (103)

where step ()(*) follows from mean value theorem with ξ\xi_{\ell} on the line segment between g(θα)g(θα0)g_{\ell}(\theta_{\alpha}^{\ell})-g_{\ell}(\theta_{\alpha}^{0}) and g(ηα)g(θα0)g_{\ell}(\eta_{\alpha}^{\ell})-g_{\ell}(\theta_{\alpha}^{0}), step ()(**) follows from g(θα)g(θα0)g_{\ell}(\theta_{\alpha}^{\ell})-g_{\ell}(\theta_{\alpha}^{0}), g(ηα)g(θα0)0g_{\ell}(\eta_{\alpha}^{\ell})-g_{\ell}(\theta_{\alpha}^{0})\to 0 due to (100), (101) and hence ξ0\xi_{\ell}\to 0, and step ()(***) follows from (102) and (90).

Case 2: Calculate limIi\lim_{\ell\to\infty}I_{i} for iαi\neq\alpha.
For iαi\not=\alpha,

exp(g(θi))exp(g(θα0))\displaystyle\frac{\exp\left(g_{\ell}(\theta_{i}^{\ell})\right)}{\exp\left(g_{\ell}(\theta_{\alpha}^{0})\right)}
=\displaystyle= exp(θiθα0,Nz+NθA(θα0)N(A(θi)A(θα0)))\displaystyle\exp\left(\left\langle\theta_{i}^{\ell}-\theta_{\alpha}^{0},\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\nabla_{\theta}A(\theta_{\alpha}^{0})\right\rangle-{N}_{\ell}\left(A(\theta_{i}^{\ell})-A(\theta_{\alpha}^{0})\right)\right)
=\displaystyle= exp(N(A(θi)A(θα0)θiθα0,θA(θα0)1Nθiθα0,z))\displaystyle\exp\left(-{N}_{\ell}\left(A(\theta_{i}^{\ell})-A(\theta_{\alpha}^{0})-\left\langle\theta_{i}^{\ell}-\theta_{\alpha}^{0},\nabla_{\theta}A(\theta_{\alpha}^{0})\right\rangle-\frac{1}{\sqrt{{N}_{\ell}}}\left\langle\theta_{i}^{\ell}-\theta_{\alpha}^{0},z_{\ell}\right\rangle\right)\right)
\displaystyle\leq exp(N2(A(θi0)A(θα0)θi0θα0,θA(θα0)))for sufficiently large ,\displaystyle\exp\left(-\frac{{N}_{\ell}}{2}\left(A(\theta_{i}^{0})-A(\theta_{\alpha}^{0})-\left\langle\theta_{i}^{0}-\theta_{\alpha}^{0},\nabla_{\theta}A(\theta_{\alpha}^{0})\right\rangle\right)\right)\quad\text{for sufficiently large }\ell, (104)

where the last inequality follows from lim1Nθiθα0,z=0\lim_{\ell\to\infty}\frac{1}{\sqrt{{N}_{\ell}}}\left\langle\theta_{i}^{\ell}-\theta_{\alpha}^{0},z_{\ell}\right\rangle=0 and

A(θi0)A(θα0)θi0θα0,θA(θα0)>0,A(\theta_{i}^{0})-A(\theta_{\alpha}^{0})-\left\langle\theta_{i}^{0}-\theta_{\alpha}^{0},\nabla_{\theta}A(\theta_{\alpha}^{0})\right\rangle>0, (105)

implied by strict convexity of A(θ)A(\theta) over Θ\Theta^{\circ} due to full rank property of exponential family. Similarly, for sufficiently large \ell,

exp(g(ηi))exp(g(θα0))exp(N2(A(θi0)A(θα0)θi0θα0,θA(θα0))).\frac{\exp\left(g_{\ell}(\eta_{i}^{\ell})\right)}{\exp\left(g_{\ell}(\theta_{\alpha}^{0})\right)}\leq\exp\left(-\frac{{N}_{\ell}}{2}\left(A(\theta_{i}^{0})-A(\theta_{\alpha}^{0})-\left\langle\theta_{i}^{0}-\theta_{\alpha}^{0},\nabla_{\theta}A(\theta_{\alpha}^{0})\right\rangle\right)\right). (106)

It follows that for iαi\neq\alpha

lim|Ii|\displaystyle\lim_{\ell\to\infty}|I_{i}|
\displaystyle\leq limpi|exp(g(θi))exp(g(ηi))DN(G,H)exp(g(θα0))|+lim|piπiDN(G,H)|exp(g(ηi))exp(g(θα0))\displaystyle\lim_{\ell\to\infty}p_{i}^{\ell}\left|\frac{\exp\left(g_{\ell}(\theta_{i}^{\ell})\right)-\exp\left(g_{\ell}(\eta_{i}^{\ell})\right)}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})\exp\left(g_{\ell}(\theta_{\alpha}^{0})\right)}\right|+\lim_{\ell\to\infty}\left|\frac{p_{i}^{\ell}-\pi_{i}^{\ell}}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}\right|\frac{\exp\left(g_{\ell}(\eta_{i}^{\ell})\right)}{\exp\left(g_{\ell}(\theta_{\alpha}^{0})\right)}
\displaystyle\leq pi0limmax{exp(g(θi)),exp(g(ηi))}exp(g(θα0))|g(θi)g(ηi)DN(G,H)|+|bi|limexp(g(ηi))exp(g(θα0))\displaystyle p_{i}^{0}\lim_{\ell\to\infty}\frac{\max\{\exp\left(g_{\ell}(\theta_{i}^{\ell})\right),\exp\left(g_{\ell}(\eta_{i}^{\ell})\right)\}}{\exp\left(g_{\ell}(\theta_{\alpha}^{0})\right)}\left|\frac{g_{\ell}(\theta_{i}^{\ell})-g_{\ell}(\eta_{i}^{\ell})}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}\right|+|b_{i}|\lim_{\ell\to\infty}\frac{\exp\left(g_{\ell}(\eta_{i}^{\ell})\right)}{\exp\left(g_{\ell}(\theta_{\alpha}^{0})\right)}
\displaystyle\leq limeN2(A(θi0)A(θα0)θi0θα0,θA(θα0))(pi0|g(θi)g(ηi)DN(G,H)|+|bi|),\displaystyle\lim_{\ell\to\infty}e^{-\frac{{N}_{\ell}}{2}\left(A(\theta_{i}^{0})-A(\theta_{\alpha}^{0})-\left\langle\theta_{i}^{0}-\theta_{\alpha}^{0},\nabla_{\theta}A(\theta_{\alpha}^{0})\right\rangle\right)}\left(p_{i}^{0}\left|\frac{g_{\ell}(\theta_{i}^{\ell})-g_{\ell}(\eta_{i}^{\ell})}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}\right|+|b_{i}|\right), (107)

where the second inequality follows by applying the mean value theorem on the first term and applying (90) to the second term, while the last inequality follows from (104) and (106).

Since

lim sup1N|g(θi)g(ηi)DN(G,H)|\displaystyle\limsup_{\ell\to\infty}\frac{1}{\sqrt{{N}_{\ell}}}\left|\frac{g_{\ell}(\theta_{i}^{\ell})-g_{\ell}(\eta_{i}^{\ell})}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}\right|
=\displaystyle= lim sup1N|θiηi,Nz+NθA(θα0)N(A(θi)A(ηi))DN(G,H)|\displaystyle\limsup_{\ell\to\infty}\frac{1}{\sqrt{{N}_{\ell}}}\left|\frac{\left\langle\theta_{i}^{\ell}-\eta_{i}^{\ell},\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\nabla_{\theta}A(\theta_{\alpha}^{0})\right\rangle-{N}_{\ell}\left(A(\theta_{i}^{\ell})-A(\eta_{i}^{\ell})\right)}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}\right|
\displaystyle\leq lim sup|N(A(θi)A(θi0)θiηi,θA(θi0))DN(G,H)|\displaystyle\limsup_{\ell\to\infty}\left|\frac{-\sqrt{{N}_{\ell}}\left(A(\theta_{i}^{\ell})-A(\theta_{i}^{0})-\left\langle\theta_{i}^{\ell}-\eta_{i}^{\ell},\nabla_{\theta}A(\theta_{i}^{0})\right\rangle\right)}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}\right|
+lim sup1N|N(θiηi),zDN(G,H)|+lim sup|Nθiηi,θA(θα0)θA(θi0)DN(G,H)|\displaystyle+\limsup_{\ell\to\infty}\frac{1}{\sqrt{{N}_{\ell}}}\left|\frac{\left\langle\sqrt{{N}_{\ell}}(\theta_{i}^{\ell}-\eta_{i}^{\ell}),z_{\ell}\right\rangle}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}\right|+\limsup_{\ell\to\infty}\left|\frac{\sqrt{{N}_{\ell}}\left\langle\theta_{i}^{\ell}-\eta_{i}^{\ell},\nabla_{\theta}A(\theta_{\alpha}^{0})-\nabla_{\theta}A(\theta_{i}^{0})\right\rangle}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}\right|
=\displaystyle= |ai,θA(θα0)θA(θi0)|,\displaystyle\left|\left\langle a_{i},\nabla_{\theta}A(\theta_{\alpha}^{0})-\nabla_{\theta}A(\theta_{i}^{0})\right\rangle\right|,

where the last step follows from (90) and (99). Then for sufficiently large \ell

|g(θi)g(ηi)DN(G,H)|(|ai,θA(θα0)θA(θi0)|+1)N.\left|\frac{g_{\ell}(\theta_{i}^{\ell})-g_{\ell}(\eta_{i}^{\ell})}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}\right|\leq\left(\left|\left\langle a_{i},\nabla_{\theta}A(\theta_{\alpha}^{0})-\nabla_{\theta}A(\theta_{i}^{0})\right\rangle\right|+\frac{1}{\ell}\right)\sqrt{{N}_{\ell}}. (108)

Plug (108) into (107), for any iαi\not=\alpha,

lim|Ii|\displaystyle\lim_{\ell\to\infty}|I_{i}|
\displaystyle\leq limeN2(A(θi0)A(θα0)θi0θα0,θA(θα0))(|ai,θA(θα0)θA(θi0)|+1)N\displaystyle\lim_{\ell\to\infty}e^{-\frac{{N}_{\ell}}{2}\left(A(\theta_{i}^{0})-A(\theta_{\alpha}^{0})-\left\langle\theta_{i}^{0}-\theta_{\alpha}^{0},\nabla_{\theta}A(\theta_{\alpha}^{0})\right\rangle\right)}\left(\left|\left\langle a_{i},\nabla_{\theta}A(\theta_{\alpha}^{0})-\nabla_{\theta}A(\theta_{i}^{0})\right\rangle\right|+\frac{1}{\ell}\right)\sqrt{{N}_{\ell}}
=\displaystyle= 0.\displaystyle 0. (109)

Combining (95), (103) and (109), we see that (93) is established. This concludes the proof of the theorem. ∎

D.2 Proof of Theorem 5.16

Proof of Theorem 5.16.

Step 1 (Proof by contradiction with subsequences)
This step is similar to the proof of Theorem 5.8. Suppose that (23) is not true. Then {N}=1\exists\{{N}_{\ell}\}_{\ell=1}^{\infty} subsequence of natural numbers tending to infinity such that

limr0infG,HBW1(G0,r)GHV(PG,N,PH,N)DN(G,H)0 as N.\lim_{r\to 0}\ \ \inf_{\begin{subarray}{c}G,H\in B_{W_{1}}(G_{0},r)\\ G\not=H\end{subarray}}\frac{V(P_{G,{N}_{\ell}},P_{H,{N}_{\ell}})}{D_{{N}_{\ell}}(G,H)}\to 0\quad\text{ as }{N}_{\ell}\to\infty.

Then {G}=1,{H}=1k0(Θ)\exists\{G_{\ell}\}_{\ell=1}^{\infty},\{H_{\ell}\}_{\ell=1}^{\infty}\subset\mathcal{E}_{k_{0}}(\Theta) such that

{GHDN(G,G0)0,DN(H,G0)0 as V(PG,N,PH,N)DN(G,H)0 as .\begin{cases}G_{\ell}\not=H_{\ell}&\forall\ell\\ D_{{N}_{\ell}}(G_{\ell},G_{0})\to 0,D_{{N}_{\ell}}(H_{\ell},G_{0})\to 0&\text{ as }\ell\to\infty\\ \frac{V(P_{G_{\ell},{N}_{\ell}},P_{H_{\ell},{N}_{\ell}})}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}\to 0&\text{ as }{\ell}\to\infty.\end{cases} (110)

To see this, for each fixed \ell, and thus fixed N{N}_{\ell}, DN(G,G0)0D_{{N}_{\ell}}(G,G_{0})\to 0 if and only if W1(G,G0)0W_{1}(G,G_{0})\to 0. Thus, there exist G,Hk0(Θ)G_{\ell},H_{\ell}\in\mathcal{E}_{k_{0}}(\Theta) such that GHG_{\ell}\not=H_{\ell}, DN(G,G0)1D_{{N}_{\ell}}(G_{\ell},G_{0})\leq\frac{1}{\ell}, DN(H,G0)1D_{{N}_{\ell}}(H_{\ell},G_{0})\leq\frac{1}{\ell} and

V(PG,N,PH,N)DN(G,H)limr0infG,HBW1(G0,r)GHV(PG,N,PH,N)DN(G,H)+1,\frac{V(P_{G_{\ell},{N}_{\ell}},P_{H_{\ell},{N}_{\ell}})}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}\leq\lim_{r\to 0}\ \ \inf_{\begin{subarray}{c}G,H\in B_{W_{1}}(G_{0},r)\\ G\not=H\end{subarray}}\frac{V(P_{G,{N}_{\ell}},P_{H,{N}_{\ell}})}{D_{{N}_{\ell}}(G,H)}+\frac{1}{\ell},

thereby ensuring that (110) hold.

Write G0=i=1k0pi0δθi0G_{0}=\sum_{i=1}^{k_{0}}p_{i}^{0}\delta_{\theta_{i}^{0}}. We may relabel the atoms of GG_{\ell} and HH_{\ell} such that G=i=1k0piδθiG_{\ell}=\sum_{i=1}^{k_{0}}p_{i}^{\ell}\delta_{\theta_{i}^{\ell}}, H=i=1k0πiδηiH_{\ell}=\sum_{i=1}^{k_{0}}\pi_{i}^{\ell}\delta_{\eta_{i}^{\ell}} with θi,ηiθi0\theta^{\ell}_{i},\eta_{i}^{\ell}\to\theta_{i}^{0} and pi,πipi0p_{i}^{\ell},\pi_{i}^{\ell}\to p_{i}^{0} for any i[k0]i\in[k_{0}]. By subsequence argument if necessary, we may require {G}=1\{G_{\ell}\}_{\ell=1}^{\infty}, {H}=1\{H_{\ell}\}_{\ell=1}^{\infty} additionally satisfy:

DN(H,G0)=\displaystyle D_{{N}_{\ell}}(H_{\ell},G_{0})= i=1k0(Nηiθi02+|πipi0|)\displaystyle\sum_{i=1}^{k_{0}}(\sqrt{{N}_{\ell}}\|\eta_{i}^{\ell}-\theta_{i}^{0}\|_{2}+|\pi_{i}^{\ell}-p_{i}^{0}|) (111)
DN(G,G0)=\displaystyle D_{{N}_{\ell}}(G_{\ell},G_{0})= i=1k0(Nθiθi02+|pipi0|)\displaystyle\sum_{i=1}^{k_{0}}(\sqrt{{N}_{\ell}}\|\theta_{i}^{\ell}-\theta_{i}^{0}\|_{2}+|p_{i}^{\ell}-p_{i}^{0}|) (112)

and

N(θiηi)DN(G,H)aiq,piπiDN(G,H)bi,1ik0,\frac{\sqrt{{N}_{\ell}}\left(\theta^{\ell}_{i}-\eta^{\ell}_{i}\right)}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}\to a_{i}\in\mathbb{R}^{q},\quad\frac{p^{\ell}_{i}-\pi^{\ell}_{i}}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}\to b_{i}\in\mathbb{R},\quad\forall 1\leq i\leq k_{0}, (113)

where the components of aia_{i} are in [1,1][-1,1] and i=1k0bi=0\sum_{i=1}^{k_{0}}b_{i}=0. It also follows that at least one of aia_{i} is not 𝟎q\bm{0}\in\mathbb{R}^{q} or one of bib_{i} is not 0. Let α{1ik0:ai𝟎 or bi0}\alpha\in\{1\leq i\leq k_{0}:a_{i}\not=\bm{0}\text{ or }b_{i}\not=0\}.

Step 2 (Transform the probability measure to support in s\mathbb{R}^{s})
Let T1:(𝔛,𝒜)(s,(s))T_{1}:(\mathfrak{X},\mathcal{A})\to(\mathbb{R}^{s},\mathcal{B}(\mathbb{R}^{s})) be an arbitrary measurable map in this step. Extend T1T_{1} to product space by T¯1:𝔛NNs\bar{T}_{1}:\mathfrak{X}^{N}\to\mathbb{R}^{{N}s} by T¯1x¯=((T1x1),,(T1xN))\bar{T}_{1}\bar{x}=((T_{1}x_{1})^{\top},\ldots,(T_{1}x_{{N}})^{\top})^{\top} where any x¯𝔛N\bar{x}\in\mathfrak{X}^{{N}} is partitioned into N{N} blocks as x¯=(x1,x2,,xN)\bar{x}=(x_{1},x_{2},\ldots,x_{{N}}) with xi𝔛x_{i}\in\mathfrak{X}. Then one can easily verify that (NPθ)T¯11=N(PθT11)(\bigotimes^{{N}}P_{\theta})\circ\bar{T}_{1}^{-1}=\bigotimes^{{N}}(P_{\theta}\circ T_{1}^{-1}), and hence for any Gk0(Θ)G\in\mathcal{E}_{k_{0}}(\Theta)

PG,NT¯11=i=1k0pi(Pθi,NT¯11)=i=1k0pi(N(PθiT11)).P_{G,{N}}\circ\bar{T}_{1}^{-1}=\sum_{i=1}^{k_{0}}p_{i}(P_{\theta_{i},{N}}\circ\bar{T}_{1}^{-1})=\sum_{i=1}^{k_{0}}p_{i}\left(\bigotimes^{{N}}\left(P_{\theta_{i}}\circ T_{1}^{-1}\right)\right).

Further consider another measurable map T0:(Ns,(Ns))(s,(s))T_{0}:(\mathbb{R}^{{N}s},\mathcal{B}(\mathbb{R}^{{N}s}))\to(\mathbb{R}^{s},\mathcal{B}(\mathbb{R}^{s})) defined by T0t¯=i=1NtiT_{0}\bar{t}=\sum_{i=1}^{{N}}{t_{i}} where t¯Ns\bar{t}\in\mathbb{R}^{{N}s} is partitioned equally into N{N} blocks t¯=(t1,t2,,tN)Ns\bar{t}=(t_{1}^{\top},t_{2}^{\top},\ldots,t_{{N}}^{\top})^{\top}\in\mathbb{R}^{{N}s}. Denote the induced probability measure on s\mathbb{R}^{s} under T0T¯1T_{0}\circ\bar{T}_{1} of the Pθ,NP_{\theta,{N}} by Qθ,N:=(N(PθT11))T01Q_{\theta,{N}}:=\left(\bigotimes^{{N}}\left(P_{\theta}\circ T_{1}^{-1}\right)\right)\circ T_{0}^{-1}. Then the induced probability measure under T0T¯1T_{0}\circ\bar{T}_{1} of the mixture PG,NP_{G,{N}} is

PG,NT¯11T01=i=1k0piQθi,N:=QG,N.P_{G,{N}}\circ\bar{T}_{1}^{-1}\circ T_{0}^{-1}=\sum_{i=1}^{k_{0}}p_{i}Q_{\theta_{i},{N}}:=Q_{G,{N}}.

Note the dependences of T¯1\bar{T}_{1} and T0T_{0} on N{N} are both suppressed, so are the dependences on T1T_{1} of Qθ,NQ_{\theta,{N}} and QG,NQ_{G,{N}}.

Then by definition of total variation distance

V(PG,N,PH,N)V(QG,N,QH,N),N,T1.V(P_{G,{N}},P_{H,{N}})\geq V(Q_{G,{N}},Q_{H,{N}}),\quad\forall{N},\forall\ T_{1}.

The above display and (110) yield

lim0V(QG,N,QH,N)DN(G,H)=0,T1.\lim_{\ell\to 0}\frac{V(Q_{G_{\ell},{N}_{\ell}},Q_{H_{\ell},{N}_{\ell}})}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}=0,\quad\forall T_{1}. (114)

Step 3 (Application of the central limit theorem)
In the rest of proof specialize T1T_{1} in step 2 to be TαT_{\alpha}. Write T=TαT=T_{\alpha} to simplify the notation in the rest of the proof. Let γ>0\gamma>0 and r1r\geq 1 be the same as in Definition 5.14 of T=TαT=T_{\alpha} with respect to the finite set {θi0}i=1k0\{\theta_{i}^{0}\}_{i=1}^{k_{0}} and define Θ¯(G0):=i=1k0B(θi0,γ)\bar{\Theta}(G_{0}):=\bigcup_{i=1}^{k_{0}}B(\theta_{i}^{0},\gamma). By subsequences if necessary, we may further require that G,HG_{\ell},H_{\ell} satisfy θi,ηiB(θi0,γ)\theta_{i}^{\ell},\eta_{i}^{\ell}\in B(\theta_{i}^{0},\gamma) for all i[k0]i\in[k_{0}] and Nr{N}_{\ell}\geq r.

Consider {Xi}i=1i.i.d.Pθ\{X_{i}\}_{i=1}^{\infty}\overset{i.i.d.}{\sim}P_{\theta}. Then Y=i=1NTXiY_{\ell}=\sum_{i=1}^{{N}_{\ell}}TX_{i} is distributed by probability measure Qθ,NQ_{\theta,{N}_{\ell}}, which has characteristic function (ϕT(ζ|θ))N(\phi_{T}(\zeta|\theta))^{{N}_{\ell}}. For θΘ¯(G0)\theta\in\bar{\Theta}(G_{0}), by (33) in (A3) and by Fourier inversion theorem, Qθ,NQ_{\theta,{N}_{\ell}} and YY_{\ell} therefore have density fY(y|θ,N)f_{Y}(y|\theta,{N}_{\ell}) with respect to Lebesgue measure given by

fY(y|θ,N)=1(2π)sseiζy(ϕT(ζ|θ))N𝑑ζ.f_{Y}(y|\theta,{N}_{\ell})=\frac{1}{(2\pi)^{s}}\int_{\mathbb{R}^{s}}e^{-i\zeta^{\top}y}(\phi_{T}(\zeta|\theta))^{{N}_{\ell}}d\zeta. (115)

Then QG,NQ_{G_{\ell},{N}_{\ell}} has density with respect to Lebesgue measure given by i=1k0pifY(y|θi,N)\sum_{i=1}^{k_{0}}p_{i}^{\ell}f_{Y}(y|\theta_{i}^{\ell},{N}_{\ell}), and similarly QH,NQ_{H_{\ell},{N}_{\ell}} has density with respect to Lebesgue measure i=1k0πifY(y|ηi,N)\sum_{i=1}^{k_{0}}\pi_{i}^{\ell}f_{Y}(y|\eta_{i}^{\ell},{N}_{\ell}). Thus

2V(QG,N,QH,N)=s|i=1k0pifY(y|θi,N)i=1k0πifY(y|ηi,N)|dy.2V(Q_{G_{\ell},{N}_{\ell}},Q_{H_{\ell},{N}_{\ell}})=\int_{\mathbb{R}^{s}}\left|\sum_{i=1}^{k_{0}}p_{i}^{\ell}f_{Y}(y|\theta_{i}^{\ell},{N}_{\ell})-\sum_{i=1}^{k_{0}}\pi^{\ell}_{i}f_{Y}(y|\eta^{\ell}_{i},{N}_{\ell})\right|dy. (116)

For YY_{\ell} has density fY(y|θ,N)f_{Y}(y|\theta,{N}_{\ell}), define Z=(YNλθ)/NZ_{\ell}=(Y_{\ell}-{N}_{\ell}\lambda_{\theta})/\sqrt{{N}_{\ell}}, where λθ=𝔼θTX1\lambda_{\theta}=\mathbb{E}_{\theta}TX_{1}. Note that this transform from YY_{\ell} to ZZ_{\ell} depends on θ\theta in the density of YY_{\ell}. Then by the change of variable formula, ZZ_{\ell} has density fZ(z|θ,N)f_{Z}(z|\theta,{N}_{\ell}) with respect to Lebesgue measure, given by

fZ(z|θ,N)=fY(Nz+Nλθ|θ,N)Ns/2,f_{Z}(z|\theta,{N}_{\ell})=f_{Y}(\sqrt{{N}_{\ell}}z+{N}_{\ell}\lambda_{\theta}|\theta,{N}_{\ell}){N}_{\ell}^{s/2},

or equivalently

fY(y|θ,N)=fZ((yNλθ)/N|θ,N)/Ns/2.f_{Y}(y|\theta,{N}_{\ell})=f_{Z}((y-{N}_{\ell}\lambda_{\theta})/\sqrt{{N}_{\ell}}|\theta,{N}_{\ell})/{N}_{\ell}^{s/2}. (117)

Now, applying the local central limit theorem (Lemma D.3), fZ(z|θ,N)f_{Z}(z|\theta,{N}_{\ell}) converges uniformly in zz to f𝒩(z|θ)f_{\mathcal{N}}(z|\theta) for every θΘ¯(G0)\theta\in\bar{\Theta}(G_{0}). Here f𝒩(z|θ)f_{\mathcal{N}}(z|\theta) is the density of 𝒩(𝟎,Λθ)\mathcal{N}(\bm{0},\Lambda_{\theta}), the multivariate normal with mean 𝟎\bm{0} and covariance matrix Λθ\Lambda_{\theta}. Next specialize the previous statement to θα0\theta^{0}_{\alpha}, and define

w=sup{w0:fZ(z|θα0,N)1(2π)s/212 for all z2w}.w_{\ell}=\sup\left\{w\geq 0:f_{Z}(z|\theta^{0}_{\alpha},{N}_{\ell})\geq\frac{1}{(2\pi)^{s/2}}\frac{1}{2\ell}\text{ for all }\|z\|_{2}\leq w\right\}.

We use the convention that the supreme of \emptyset is 0 in the above display. Because of the uniform convergence of fZ(z|θα0,N)f_{Z}(z|\theta^{0}_{\alpha},{N}_{\ell}) to f𝒩(z|θα0)f_{\mathcal{N}}(z|\theta^{0}_{\alpha}), we have ww_{\ell}\to\infty when \ell\to\infty. It follows from (117) that fY(y|θα0,N)>0f_{Y}(y|\theta^{0}_{\alpha},{N}_{\ell})>0 on B:={ys|y=Nz+Nλθα0 for z2w}.B_{\ell}:=\{y\in\mathbb{R}^{s}|y=\sqrt{{N}_{\ell}}z+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}\text{ for }\|z\|_{2}\leq w_{\ell}\}. Then by (116)

2V(QG,N,QH,N)DN(G,H)\displaystyle\frac{2V(Q_{G_{\ell},{N}_{\ell}},Q_{H_{\ell},{N}_{\ell}})}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}
=\displaystyle= s|i=1k0pi(fY(y|θi,N)fY(y|ηi,N))DN(G,H)+i=1k0(piπi)fY(y|ηi,N)DN(G,H)|𝑑y\displaystyle\int_{\mathbb{R}^{s}}\left|\sum_{i=1}^{k_{0}}\frac{p_{i}^{\ell}\left(f_{Y}(y|\theta_{i}^{\ell},{N}_{\ell})-f_{Y}(y|\eta^{\ell}_{i},{N}_{\ell})\right)}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}+\sum_{i=1}^{k_{0}}\frac{(p_{i}^{\ell}-\pi^{\ell}_{i})f_{Y}(y|\eta^{\ell}_{i},{N}_{\ell})}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})}\right|dy
\displaystyle\geq B|i=1k0pi(fY(y|θi,N)fY(y|ηi,N))+(piπi)fY(y|ηi,N)DN(G,H)fY(y|θα0,N)|fY(y|θα0,N)𝑑y\displaystyle\int_{B_{\ell}}\left|\sum_{i=1}^{k_{0}}\frac{p_{i}^{\ell}\left(f_{Y}(y|\theta_{i}^{\ell},{N}_{\ell})-f_{Y}(y|\eta^{\ell}_{i},{N}_{\ell})\right)+(p_{i}^{\ell}-\pi^{\ell}_{i})f_{Y}(y|\eta^{\ell}_{i},{N}_{\ell})}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})f_{Y}(y|\theta^{0}_{\alpha},{N}_{\ell})}\right|f_{Y}(y|\theta^{0}_{\alpha},{N}_{\ell})dy
=\displaystyle= 𝔼θα0|F(Y)|\displaystyle\mathbb{E}_{\theta_{\alpha}^{0}}|F_{\ell}(Y_{\ell})|
=\displaystyle= 𝔼θα0|Ψ(Z)|,\displaystyle\mathbb{E}_{\theta_{\alpha}^{0}}|\Psi_{\ell}(Z_{\ell})|, (118)

where

F(y)=(i=1k0pi(fY(y|θi,N)fY(y|ηi,N))+(piπi)fY(y|ηi,N)DN(G,H)fY(y|θα0,N))𝟏B(y),F_{\ell}(y)=\left(\sum_{i=1}^{k_{0}}\frac{p_{i}^{\ell}\left(f_{Y}(y|\theta_{i}^{\ell},{N}_{\ell})-f_{Y}(y|\eta^{\ell}_{i},{N}_{\ell})\right)+(p_{i}^{\ell}-\pi^{\ell}_{i})f_{Y}(y|\eta^{\ell}_{i},{N}_{\ell})}{D_{{N}_{\ell}}(G_{\ell},H_{\ell})f_{Y}(y|\theta^{0}_{\alpha},{N}_{\ell})}\right)\bm{1}_{{B_{\ell}}}(y),

and

Ψ(z)=F(Nz+Nλθα0).\Psi_{\ell}(z)=F_{\ell}(\sqrt{{N}_{\ell}}z+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}).

Observe if ZZ_{\ell} has density fZ(z|θα0,N)f_{Z}(z|\theta^{0}_{\alpha},{N}_{\ell}), then ZZ_{\ell} converges in distribution to Z𝒩(𝟎,Λθα0)Z\sim\mathcal{N}(\bm{0},\Lambda_{\theta_{\alpha}^{0}}).

Step 4 (Application of a continuous mapping theorem)
Define Ψ(z)=pα0(Jλ(θα0)aα)Λθα01z+bα\Psi(z)=p_{\alpha}^{0}\left(J_{\lambda}(\theta_{\alpha}^{0})a_{\alpha}\right)^{\top}\Lambda_{\theta_{\alpha}^{0}}^{-1}z+b_{\alpha}, where Jλ(θα0)s×qJ_{\lambda}(\theta_{\alpha}^{0})\in\mathbb{R}^{s\times q} is the Jacobian matrix of λ(θ)\lambda(\theta) evaluated at θα0\theta_{\alpha}^{0}. Supposing that

Ψ(z)Ψ(z) for any sequence zzs,\Psi_{\ell}(z_{\ell})\to\Psi(z)\text{ for any sequence }z_{\ell}\to z\in\mathbb{R}^{s}, (119)

a property to be verified later, then by Generalized Continuous Mapping Theorem ([46, Theorem 1.11.1]), Ψ(Z)\Psi_{\ell}(Z_{\ell}) converges in distribution to Ψ(Z)\Psi(Z). Applying [5, Theorem 25.11],

𝔼|Ψ(Z)|lim inf𝔼θα0|Ψ(Z)|=0,\mathbb{E}|\Psi(Z)|\leq\liminf_{\ell\to\infty}\mathbb{E}_{\theta_{\alpha}^{0}}|\Psi_{\ell}(Z_{\ell})|=0,

where the equality follows (118) and (114). Note that Λθ\Lambda_{\theta} is positive definite (by (A1)) and Jλ(θα0)J_{\lambda}(\theta_{\alpha}^{0}) is of full column rank. In addition, by our choice of α\alpha, either aαa_{\alpha} or bαb_{\alpha} is non-zero. Hence, Ψ(z)\Psi(z) is a non-zero affine function of zz. For such an Ψ(z)\Psi(z), 𝔼|Ψ(Z)|\mathbb{E}|\Psi(Z)| cannot be zero, which results in a contradiction. As a result, it remains in the proof to establish (119).

We will now impose the following lemma and proceed to verify (119), while the technical and lengthy proof of the lemma will be given in the Section D.5.

Lemma D.1.

Suppose all the conditions in Theorem 5.16 hold and let γ,r\gamma,r be defined as in the first paragraph in Step 3. For any 1ik01\leq i\leq k_{0}, for any pair of sequences θ¯i,η¯iB(θi0,γ)\bar{\theta}_{i}^{\ell},\bar{\eta}_{i}^{\ell}\in B(\theta_{i}^{0},\gamma) and for any increasing N¯r\bar{{N}}_{\ell}\geq r satisfying N¯θ¯iθi02,N¯η¯iθi020\sqrt{\bar{{N}}_{\ell}}\|\bar{\theta}_{i}^{\ell}-\theta_{i}^{0}\|_{2},\sqrt{\bar{{N}}_{\ell}}\|\bar{\eta}_{i}^{\ell}-\theta_{i}^{0}\|_{2}\to 0 and N¯\bar{{N}}_{\ell}\to\infty:

J(θ¯i,η¯i,N¯)\displaystyle J(\bar{\theta}_{i}^{\ell},\bar{\eta}_{i}^{\ell},\bar{{N}}_{\ell})
:=\displaystyle:= N¯s/2supys|fY(y|θ¯i,N¯)fY(y|η¯i,N¯)j=1qf𝒩(y|θi0,N¯)θ(j)((θ¯i)(j)(η¯i)(j))|\displaystyle\bar{{N}}_{\ell}^{s/2}\sup_{y\in\mathbb{R}^{s}}\left|f_{Y}(y|\bar{\theta}_{i}^{\ell},\bar{{N}}_{\ell})-f_{Y}(y|\bar{\eta}^{\ell}_{i},\bar{{N}}_{\ell})-\sum_{j=1}^{q}\frac{\partial f_{\mathcal{N}}(y|\theta_{i}^{0},\bar{{N}}_{\ell})}{\partial\theta^{(j)}}\left((\bar{\theta}_{i}^{\ell})^{(j)}-(\bar{\eta}_{i}^{\ell})^{(j)}\right)\right|
=\displaystyle= o(N¯θ¯iη¯i2), as ,\displaystyle o(\sqrt{\bar{{N}}_{\ell}}\|\bar{\theta}_{i}^{\ell}-\bar{\eta}_{i}^{\ell}\|_{2}),\quad\text{ as }\ell\to\infty, (120)

where f𝒩(y|θ,N)f_{\mathcal{N}}(y|\theta,{N}) is the density with respect to Lebesgue measure of 𝒩(Nλθ,NΛθ)\mathcal{N}({N}\lambda_{\theta},{N}\Lambda_{\theta}) when Λθ\Lambda_{\theta} is positive definite.

Step 5 (Verification of (119))
Write D=DN(G,H)D_{\ell}=D_{{N}_{\ell}}(G_{\ell},H_{\ell}) for abbreviation in the remaining of this proof. Observe by the local central limit theorem (Lemma D.3)

|fZ(z|θα0,N)f𝒩(z|θα0)|\displaystyle\left|f_{Z}(z_{\ell}|\theta^{0}_{\alpha},{N}_{\ell})-f_{\mathcal{N}}(z|\theta^{0}_{\alpha})\right|\leq supzs|fZ(z|θα0,N)f𝒩(z|θα0)|+|f𝒩(z|θα0)f𝒩(z|θα0)|\displaystyle\sup_{z^{\prime}\in\mathbb{R}^{s}}|f_{Z}(z^{\prime}|\theta^{0}_{\alpha},{N}_{\ell})-f_{\mathcal{N}}(z^{\prime}|\theta^{0}_{\alpha})|+|f_{\mathcal{N}}(z_{\ell}|\theta^{0}_{\alpha})-f_{\mathcal{N}}(z|\theta^{0}_{\alpha})|
\displaystyle\to 0,\displaystyle 0,

as \ell\to\infty, which implies

limfZ(z|θα0,N)=f𝒩(z|θα0).\lim_{\ell\to\infty}f_{Z}(z_{\ell}|\theta^{0}_{\alpha},{N}_{\ell})=f_{\mathcal{N}}(z|\theta^{0}_{\alpha}). (121)

Hereafter fY(Nz+Nλθα0|θi0,N)θ(j):=fY(y|θi0,N)θ(j)|y=Nz+Nλθα0\frac{\partial f_{Y}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta_{i}^{0},{N}_{\ell})}{\partial\theta^{(j)}}:=\left.\frac{\partial f_{Y}(y|\theta_{i}^{0},{N}_{\ell})}{\partial\theta^{(j)}}\right|_{y=\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}}. Similar definition applies to f𝒩(Nz+Nλθα0|θi0,N)θ(j)\frac{\partial f_{\mathcal{N}}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta_{i}^{0},{N}_{\ell})}{\partial\theta^{(j)}}. Then for each i[k0]i\in[k_{0}],

1DfY(Nz+Nλθα0|θi,N)fY(Nz+Nλθα0|ηi,N)fY(Nz+Nλθα0|θα0,N)𝟏B(Nz+Nλθα0)\displaystyle\frac{1}{D_{\ell}}\frac{f_{Y}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta_{i}^{\ell},{N}_{\ell})-f_{Y}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\eta^{\ell}_{i},{N}_{\ell})}{f_{Y}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta^{0}_{\alpha},{N}_{\ell})}\bm{1}_{{B_{\ell}}}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}})
=\displaystyle= Ns/2DfY(Nz+Nλθα0|θi,N)fY(Nz+Nλθα0|ηi,N)fZ(z|θα0,N)𝟏E(z)\displaystyle\frac{{N}_{\ell}^{s/2}}{D_{\ell}}\frac{f_{Y}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta_{i}^{\ell},{N}_{\ell})-f_{Y}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\eta^{\ell}_{i},{N}_{\ell})}{f_{Z}(z_{\ell}|\theta^{0}_{\alpha},{N}_{\ell})}\bm{1}_{{E_{\ell}}}(z_{\ell})
\displaystyle\leq (Ns/2Dj=1qf𝒩(Nz+Nλθα0|θi0,N)θ(j)((θi)(j)(ηi)(j))fZ(z|θα0,N)+J(θi,ηi,N)DfZ(z|θα0,N))𝟏E(z),\displaystyle\left(\frac{{N}_{\ell}^{s/2}}{D_{\ell}}\frac{\sum_{j=1}^{q}\frac{\partial f_{\mathcal{N}}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta_{i}^{0},{N}_{\ell})}{\partial\theta^{(j)}}\left((\theta_{i}^{\ell})^{(j)}-(\eta_{i}^{\ell})^{(j)}\right)}{f_{Z}(z_{\ell}|\theta^{0}_{\alpha},{N}_{\ell})}+\frac{J(\theta_{i}^{\ell},\eta_{i}^{\ell},{N}_{\ell})}{D_{\ell}f_{Z}(z_{\ell}|\theta^{0}_{\alpha},{N}_{\ell})}\right)\bm{1}_{{E_{\ell}}}(z_{\ell}), (122)

where the first equality follows from (117) and where in the first equality E={zs|z2w}E_{\ell}=\{z\in\mathbb{R}^{s}|\|z\|_{2}\leq w_{\ell}\}. Observe that for any i[k0]i\in[k_{0}],

Nθiθi0\displaystyle\sqrt{{N}_{\ell}}\|\theta_{i}^{\ell}-\theta_{i}^{0}\|\to 0,\displaystyle 0, (123)
Nηiθi0\displaystyle\sqrt{{N}_{\ell}}\|\eta_{i}^{\ell}-\theta_{i}^{0}\|\to 0,\displaystyle 0, (124)

by (111), (112) and (110). Then by applying (120) with θ¯i,η¯i,N¯\bar{\theta}_{i}^{\ell},\bar{\eta}_{i}^{\ell},\bar{{N}}_{\ell} respectively be θi,ηi,N\theta_{i}^{\ell},\eta_{i}^{\ell},{N}_{\ell}, and by (113),

limJ(θi,ηi,N)D0,\lim_{\ell\to\infty}\frac{J(\theta_{i}^{\ell},\eta_{i}^{\ell},{N}_{\ell})}{D_{\ell}}\to 0,

which together with (121) yield

limJ(θi,ηi,N)DfZ(z|θα0,N)𝟏E(z)0.\lim_{\ell\to\infty}\frac{J(\theta_{i}^{\ell},\eta_{i}^{\ell},{N}_{\ell})}{D_{\ell}f_{Z}(z_{\ell}|\theta^{0}_{\alpha},{N}_{\ell})}\bm{1}_{{E_{\ell}}}(z_{\ell})\to 0. (125)

Thus by (122) and (125)

i=1k0piDfY(Nz+Nλθα0|θi,N)fY(Nz+Nλθα0|θi0,N)fY(Nz+Nλθα0|θα0,N)𝟏B(Nz+Nλθα0)\displaystyle\sum_{i=1}^{k_{0}}\frac{p_{i}^{\ell}}{D_{\ell}}\frac{f_{Y}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta_{i}^{\ell},{N}_{\ell})-f_{Y}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta^{0}_{i},{N}_{\ell})}{f_{Y}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta^{0}_{\alpha},{N}_{\ell})}\bm{1}_{{B_{\ell}}}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}})
\displaystyle\to i=1k0pi0lim(Ns/2Dj=1qf𝒩(Nz+Nλθα0|θi0,N)θ(j)((θi)(j)(ηi)(j))fZ(z|θα0,N))𝟏E(z),\displaystyle\sum_{i=1}^{k_{0}}p_{i}^{0}\lim_{\ell\to\infty}\left(\frac{{N}_{\ell}^{s/2}}{D_{\ell}}\frac{\sum_{j=1}^{q}\frac{\partial f_{\mathcal{N}}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta_{i}^{0},{N}_{\ell})}{\partial\theta^{(j)}}\left((\theta_{i}^{\ell})^{(j)}-(\eta_{i}^{\ell})^{(j)}\right)}{f_{Z}(z_{\ell}|\theta^{0}_{\alpha},{N}_{\ell})}\right)\bm{1}_{{E_{\ell}}}(z_{\ell}), (126)

provided the right hand side exists.

Note that for each j[q]j\in[q], and any θΘ¯(G0)\theta\in\bar{\Theta}(G_{0}), by a standard calculation for Gaussian density,

f𝒩(y|θ,N)θ(j)\displaystyle\frac{\partial f_{\mathcal{N}}(y|\theta,{N})}{\partial\theta^{(j)}}
=\displaystyle= f𝒩(y|θ,N)(12det(Λθ)1det(Λθ)θ(j)\displaystyle f_{\mathcal{N}}(y|\theta,{N})\quad\quad\left(-\frac{1}{2}\text{det}\left(\Lambda_{\theta}\right)^{-1}\frac{\partial\text{det}\left(\Lambda_{\theta}\right)}{\partial\theta^{(j)}}\right.
+(λθθ(j))Λθ1(yNλθ)12N(yNλθ)(Λθ1θ(j))(yNλθ)),\displaystyle\quad\left.+\left(\frac{\partial\lambda_{\theta}}{\partial\theta^{(j)}}\right)^{\top}\Lambda_{\theta}^{-1}(y-{N}\lambda_{\theta})-\frac{1}{2{N}}(y-{N}\lambda_{\theta})^{\top}\left(\frac{\partial\Lambda_{\theta}^{-1}}{\partial\theta^{(j)}}\right)(y-{N}\lambda_{\theta})\right),

so we have

f𝒩(Nz+Nλθα0|θi0,N)θ(j)\displaystyle\frac{\partial f_{\mathcal{N}}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta_{i}^{0},{N}_{\ell})}{\partial\theta^{(j)}}
=\displaystyle= f𝒩(Nz+Nλθα0|θi0,N)((λθi0θ(j))Λθi01(Nz+N(λθα0λθi0))\displaystyle f_{\mathcal{N}}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta_{i}^{0},{N}_{\ell})\left(\left(\frac{\partial\lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}\right)^{\top}\Lambda_{\theta_{i}^{0}}^{-1}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}(\lambda_{\theta_{\alpha}^{0}}-\lambda_{\theta_{i}^{0}}))\right.
121det(Λθi0)det(Λθi0)θ(j)12(z+N(λθα0λθi0))Λθi01θ(j)(z+N(λθα0λθi0)))\displaystyle\left.-\frac{1}{2}\frac{1}{\text{det}\left(\Lambda_{\theta_{i}^{0}}\right)}\frac{\partial\text{det}\left(\Lambda_{\theta_{i}^{0}}\right)}{\partial\theta^{(j)}}-\frac{1}{2}(z_{\ell}+\sqrt{{N}_{\ell}}(\lambda_{\theta_{\alpha}^{0}}-\lambda_{\theta_{i}^{0}}))^{\top}\frac{\partial\Lambda_{\theta_{i}^{0}}^{-1}}{\partial\theta^{(j)}}(z_{\ell}+\sqrt{{N}_{\ell}}(\lambda_{\theta_{\alpha}^{0}}-\lambda_{\theta_{i}^{0}}))\right)
=\displaystyle= Ns2f𝒩(z+N(λθα0λθi0)|θi0)((λθi0θ(j))Λθi01(Nz+N(λθα0λθi0))\displaystyle{N}_{\ell}^{-\frac{s}{2}}f_{\mathcal{N}}(z_{\ell}+\sqrt{{N}_{\ell}}(\lambda_{\theta^{0}_{\alpha}}-\lambda_{\theta^{0}_{i}})|\theta_{i}^{0})\left(\left(\frac{\partial\lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}\right)^{\top}\Lambda_{\theta_{i}^{0}}^{-1}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}(\lambda_{\theta_{\alpha}^{0}}-\lambda_{\theta_{i}^{0}}))\right.
121det(Λθi0)det(Λθi0)θ(j)12(z+N(λθα0λθi0))Λθi01θ(j)(z+N(λθα0λθi0))).\displaystyle\left.-\frac{1}{2}\frac{1}{\text{det}\left(\Lambda_{\theta_{i}^{0}}\right)}\frac{\partial\text{det}\left(\Lambda_{\theta_{i}^{0}}\right)}{\partial\theta^{(j)}}-\frac{1}{2}(z_{\ell}+\sqrt{{N}_{\ell}}(\lambda_{\theta_{\alpha}^{0}}-\lambda_{\theta_{i}^{0}}))^{\top}\frac{\partial\Lambda_{\theta_{i}^{0}}^{-1}}{\partial\theta^{(j)}}(z_{\ell}+\sqrt{{N}_{\ell}}(\lambda_{\theta_{\alpha}^{0}}-\lambda_{\theta_{i}^{0}}))\right).

Thus, when iαi\not=\alpha,

Ns12|f𝒩(Nz+Nλθα0|θi0,N)θ(j)|\displaystyle{N}_{\ell}^{\frac{s-1}{2}}\left|\frac{\partial f_{\mathcal{N}}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta_{i}^{0},{N}_{\ell})}{\partial\theta^{(j)}}\right|\leq N12f𝒩(z+N(λθα0λθi0)|θi0)C(θi0,z)N\displaystyle{N}_{\ell}^{-\frac{1}{2}}f_{\mathcal{N}}(z_{\ell}+\sqrt{{N}_{\ell}}(\lambda_{\theta^{0}_{\alpha}}-\lambda_{\theta^{0}_{i}})|\theta_{i}^{0})C(\theta_{i}^{0},z){N}_{\ell}
\displaystyle\to 0,\displaystyle 0, (127)

where the inequality holds for sufficiently large \ell, C(θi0,z)C(\theta_{i}^{0},z) is a constant that only depends on θi0\theta_{i}^{0} and zz, and the last step follows from λθα0λθi0\lambda_{\theta^{0}_{\alpha}}\not=\lambda_{\theta^{0}_{i}} by condition 1) in the statement of theorem.

When i=αi=\alpha,

Ns12f𝒩(Nz+Nλθα0|θi0,N)θ(j)\displaystyle{N}_{\ell}^{\frac{s-1}{2}}\frac{\partial f_{\mathcal{N}}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta_{i}^{0},{N}_{\ell})}{\partial\theta^{(j)}}
=\displaystyle= f𝒩(z|θα0)N(121det(Λθi0)det(Λθi0)θ(j)+(λθα0θ(j))Λθα01(Nz)12z(Λθα01θ(j))z)\displaystyle\frac{f_{\mathcal{N}}(z_{\ell}|\theta_{\alpha}^{0})}{\sqrt{{N}_{\ell}}}\left(-\frac{1}{2}\frac{1}{\text{det}\left(\Lambda_{\theta_{i}^{0}}\right)}\frac{\partial\text{det}\left(\Lambda_{\theta_{i}^{0}}\right)}{\partial\theta^{(j)}}+\left(\frac{\partial\lambda_{\theta_{\alpha}^{0}}}{\partial\theta^{(j)}}\right)^{\top}\Lambda_{\theta_{\alpha}^{0}}^{-1}(\sqrt{{N}_{\ell}}z_{\ell})-\frac{1}{2}z_{\ell}^{\top}\left(\frac{\partial\Lambda_{\theta_{\alpha}^{0}}^{-1}}{\partial\theta^{(j)}}\right)z_{\ell}\right)
\displaystyle\to f𝒩(z|θα0)(λθα0θ(j))Λθα01z.\displaystyle f_{\mathcal{N}}(z|\theta_{\alpha}^{0})\left(\frac{\partial\lambda_{\theta_{\alpha}^{0}}}{\partial\theta^{(j)}}\right)^{\top}\Lambda_{\theta_{\alpha}^{0}}^{-1}z. (128)

Plugging (127) and (128) into (126), and then combining with (121) and (113),

i=1k0piDfY(Nz+Nλθα0|θi,N)fY(Nz+Nλθα0|ηi,N)fY(Nz+Nλθα0|θα0,N)𝟏B(Nz+Nλθα0)\displaystyle\sum_{i=1}^{k_{0}}\frac{p_{i}^{\ell}}{D_{\ell}}\frac{f_{Y}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta_{i}^{\ell},{N}_{\ell})-f_{Y}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\eta^{\ell}_{i},{N}_{\ell})}{f_{Y}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta^{0}_{\alpha},{N}_{\ell})}\bm{1}_{{B_{\ell}}}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}})
\displaystyle\to pα0j=1qaα(j)(λθα0θ(j))Λθα01z\displaystyle p_{\alpha}^{0}\sum_{j=1}^{q}a_{\alpha}^{(j)}\left(\frac{\partial\lambda_{\theta_{\alpha}^{0}}}{\partial\theta^{(j)}}\right)^{\top}\Lambda_{\theta_{\alpha}^{0}}^{-1}z
=\displaystyle= pα0(Jλ(θα0)aα)Λθα01z.\displaystyle p_{\alpha}^{0}\left(J_{\lambda}(\theta_{\alpha}^{0})a_{\alpha}\right)^{\top}\Lambda_{\theta_{\alpha}^{0}}^{-1}z. (129)

Next, we turn to the second summation in the definition of Ψ\Psi_{\ell} in a similar fashion. By (117),

fY(Nz+Nλθα0|ηi,N)fY(Nz+Nλθα0|θα0,N)𝟏B(Nz+Nλθα0)\displaystyle\frac{f_{Y}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\eta^{\ell}_{i},{N}_{\ell})}{f_{Y}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta^{0}_{\alpha},{N}_{\ell})}\bm{1}_{{B_{\ell}}}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}})
=\displaystyle= Ns/2fY(Nz+Nλθα0|ηi,N)fZ(z|θα0,N)𝟏E(z)\displaystyle{N}_{\ell}^{s/2}\frac{f_{Y}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\eta^{\ell}_{i},{N}_{\ell})}{f_{Z}(z_{\ell}|\theta^{0}_{\alpha},{N}_{\ell})}\bm{1}_{{E_{\ell}}}(z_{\ell})
\displaystyle\leq Ns/2(j=1qf𝒩(Nz+Nλθα0|θi0,N)θ(j)((θi)(j)(θi0)(j))fZ(z|θα0,N)\displaystyle{N}_{\ell}^{s/2}\left(\frac{\sum_{j=1}^{q}\frac{\partial f_{\mathcal{N}}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta_{i}^{0},{N}_{\ell})}{\partial\theta^{(j)}}\left((\theta_{i}^{\ell})^{(j)}-(\theta_{i}^{0})^{(j)}\right)}{f_{Z}(z_{\ell}|\theta^{0}_{\alpha},{N}_{\ell})}\right.
+fY(Nz+Nλθα0|θi0,N)fZ(z|θα0,N))𝟏E(z)+J(ηi,θi0,N)fZ(z|θα0,N)𝟏E(z).\displaystyle\left.+\frac{f_{Y}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta^{0}_{i},{N}_{\ell})}{f_{Z}(z_{\ell}|\theta^{0}_{\alpha},{N}_{\ell})}\right)\bm{1}_{{E_{\ell}}}(z_{\ell})+\frac{J(\eta_{i}^{\ell},\theta_{i}^{0},{N}_{\ell})}{f_{Z}(z_{\ell}|\theta^{0}_{\alpha},{N}_{\ell})}\bm{1}_{{E_{\ell}}}(z_{\ell}). (130)

Due to (124), by applying (120) with θ¯i,η¯i,N¯\bar{\theta}_{i}^{\ell},\bar{\eta}_{i}^{\ell},\bar{{N}}_{\ell} respectively be ηi,θi0,N\eta_{i}^{\ell},\theta_{i}^{0},{N}_{\ell}, and by (110),

limJ(ηi,θi0,N)0,\lim_{\ell\to\infty}J(\eta_{i}^{\ell},\theta_{i}^{0},{N}_{\ell})\to 0,

which together with (121) yield

limJ(ηi,θi0,N)fZ(z|θα0,N)𝟏E(z)0.\lim_{\ell\to\infty}\frac{J(\eta_{i}^{\ell},\theta_{i}^{0},{N}_{\ell})}{f_{Z}(z_{\ell}|\theta^{0}_{\alpha},{N}_{\ell})}\bm{1}_{{E_{\ell}}}(z_{\ell})\to 0. (131)

Moreover for any i[k0]i\in[k_{0}],

|Ns/2j=1qf𝒩(Nz+Nλθα0|θi0,N)θ(j)((θi)(j)(θi0)(j))|\displaystyle\left|{N}_{\ell}^{s/2}\sum_{j=1}^{q}\frac{\partial f_{\mathcal{N}}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta_{i}^{0},{N}_{\ell})}{\partial\theta^{(j)}}\left((\theta_{i}^{\ell})^{(j)}-(\theta_{i}^{0})^{(j)}\right)\right|
\displaystyle\leq max1jqN(s1)/2|f𝒩(Nz+Nλθα0|θi0,N)θ(j)|qNθiθi02\displaystyle\max_{1\leq j\leq q}{N}_{\ell}^{(s-1)/2}\left|\frac{\partial f_{\mathcal{N}}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta_{i}^{0},{N}_{\ell})}{\partial\theta^{(j)}}\right|\sqrt{q}\sqrt{{N}_{\ell}}\|\theta_{i}^{\ell}-\theta_{i}^{0}\|_{2}
\displaystyle\to 0.\displaystyle 0. (132)

by (127) and (128) and (123).

Combining (130), (131), (132) and (113),

limi=1k0piπiDfY(Nz+Nλθα0|θi0,N)fY(Nz+Nλθα0|θα0,N)𝟏B(Nz+Nλθα0)\displaystyle\lim_{\ell\to\infty}\sum_{i=1}^{k_{0}}\frac{p_{i}^{\ell}-\pi^{\ell}_{i}}{D_{{\ell}}}\frac{f_{Y}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta^{0}_{i},{N}_{\ell})}{f_{Y}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta^{0}_{\alpha},{N}_{\ell})}\bm{1}_{{B_{\ell}}}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}})
=\displaystyle= i=1k0bilimNs/2fY(Nz+Nλθα0|θi0,N)fZ(z|θα0,N)𝟏E(z)\displaystyle\sum_{i=1}^{k_{0}}b_{i}\lim_{\ell\to\infty}{N}_{\ell}^{s/2}\frac{f_{Y}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta^{0}_{i},{N}_{\ell})}{f_{Z}(z_{\ell}|\theta^{0}_{\alpha},{N}_{\ell})}\bm{1}_{{E_{\ell}}}(z_{\ell})
=\displaystyle= i=1k0bilimfZ(z+N(λθα0λθi0)|θi0,N)fZ(z|θα0,N)𝟏E(z)\displaystyle\sum_{i=1}^{k_{0}}b_{i}\lim_{\ell\to\infty}\frac{f_{Z}(z_{\ell}+\sqrt{{N}_{\ell}}(\lambda_{\theta^{0}_{\alpha}}-\lambda_{\theta^{0}_{i}})|\theta^{0}_{i},{N}_{\ell})}{f_{Z}(z_{\ell}|\theta^{0}_{\alpha},{N}_{\ell})}\bm{1}_{{E_{\ell}}}(z_{\ell}) (133)

where the last step is due to (117).

When i=αi=\alpha, the term in the preceding display equals to 𝟏E(z)\bm{1}_{{E_{\ell}}}(z_{\ell}), which converges to 11 as \ell\rightarrow\infty. When iαi\not=\alpha,

|fZ(N(λθα0λθi0)+z|θi0,N)|\displaystyle|f_{Z}(\sqrt{{N}_{\ell}}(\lambda_{\theta_{\alpha}^{0}}-\lambda_{\theta_{i}^{0}})+z_{\ell}|\theta^{0}_{i},{N}_{\ell})|
\displaystyle\leq supzs|fZ(z|θi0,N)f𝒩(z|θi0)|+f𝒩(N(λθα0λθi0)+z|θi0)\displaystyle\sup_{z^{\prime}\in\mathbb{R}^{s}}|f_{Z}(z^{\prime}|\theta^{0}_{i},{N}_{\ell})-f_{\mathcal{N}}(z^{\prime}|\theta^{0}_{i})|+f_{\mathcal{N}}(\sqrt{{N}_{\ell}}(\lambda_{\theta_{\alpha}^{0}}-\lambda_{\theta_{i}^{0}})+z_{\ell}|\theta^{0}_{i})
\displaystyle\to 0,\displaystyle 0, (134)

where the last step follows from Lemma D.3 and λθα0λθi0\lambda_{\theta_{\alpha}^{0}}\not=\lambda_{\theta_{i}^{0}} by condition 1) in the statement of the theorem. Plug (134) and (121) into (133):

limi=1k0piπiDfY(Nz+Nλθα0|θi0,N)fY(Nz+Nλθα0|θα0,N)𝟏B(Nz+Nλθα0)=bα.\lim_{\ell\to\infty}\sum_{i=1}^{k_{0}}\frac{p_{i}^{\ell}-\pi^{\ell}_{i}}{D_{{\ell}}}\frac{f_{Y}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta^{0}_{i},{N}_{\ell})}{f_{Y}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}}|\theta^{0}_{\alpha},{N}_{\ell})}\bm{1}_{{B_{\ell}}}(\sqrt{{N}_{\ell}}z_{\ell}+{N}_{\ell}\lambda_{\theta^{0}_{\alpha}})=b_{\alpha}. (135)

Finally, combining (135) and (129) to obtain

limΨ(z)=Ψ(z)=pα0(Jλ(θα0)aα)Λθα01z+bα.\lim_{\ell\to\infty}\Psi_{\ell}(z_{\ell})=\Psi(z)=p_{\alpha}^{0}\left(J_{\lambda}(\theta_{\alpha}^{0})a_{\alpha}\right)^{\top}\Lambda_{\theta_{\alpha}^{0}}^{-1}z+b_{\alpha}.

Thus,  (119) is established, so we can conclude the proof of the theorem. ∎

D.3 Bounds on characteristic functions for distributions with bounded density

To prove Lemma D.1, we need the next lemma, which is a generalization of the corollary to [39, Lemma 1]. It gives an upper bound on the magnitude of the characteristic function for distributions with bounded density with respect to Lebesgue measure.

Lemma D.2.

Consider a random vector XdX\in\mathbb{R}^{d} with ϕ(ζ)\phi(\zeta) its characteristic function. Suppose XX has density f(x)f(x) with respect to Lebesgue measure upper bounded by UU, and has positive definite covariance matrix Λ\Lambda. Then for all ζd\zeta\in\mathbb{R}^{d}

|ϕ(ζ)|exp(C(d)ζ22(ζ22λmax(Λ)+1)λmaxd1(Λ)U2),|\phi(\zeta)|\leq\exp\left(-\frac{C(d)\|\zeta\|_{2}^{2}}{(\|\zeta\|_{2}^{2}\lambda_{\max}(\Lambda)+1)\lambda_{\max}^{d-1}(\Lambda)U^{2}}\right),

where C(d)C(d) is some constant that depends only on dd, and λmax(Λ)\lambda_{\max}(\Lambda) is the largest eigenvalue.

Proof.

It suffices to prove for ζ𝟎d\zeta\not=\bm{0}\in\mathbb{R}^{d}.

Step 1 In this step we prove the special case ζ=te1\zeta=te_{1} for t>0t>0, where e1e_{1} is the standard basis in d\mathbb{R}^{d}. Define I(ζ)=12(1|ϕ(ζ)|2)I(\zeta)=\frac{1}{2}\left(1-|\phi(\zeta)|^{2}\right) and it is easy to verify

|ϕ(ζ)|exp(I(ζ)).|\phi(\zeta)|\leq\exp(-I(\zeta)). (136)

Denote by f~\tilde{f} to be the density w.r.t. Lebesgue measure of symmetrized random vector XXX-X^{\prime}, where XX^{\prime} is an independent copy of XX. Then f~\tilde{f} also has upper bound UU and |ϕ(ζ)|2|\phi(\zeta)|^{2} is the characteristic function of XXX-X^{\prime} and

|ϕ(ζ)|2=deiζxf~(x)𝑑x=dcos(ζx)f~(x)𝑑x.|\phi(\zeta)|^{2}=\int_{\mathbb{R}^{d}}e^{i\zeta^{\top}x}\tilde{f}(x)dx=\int_{\mathbb{R}^{d}}\cos(\zeta^{\top}x)\tilde{f}(x)dx. (137)

Write x=(x(1),,x(d))x=(x^{(1)},\ldots,x^{(d)}) and let Gj={xd|x(1)(jt12t,jt+12t]}G_{j}=\{x\in\mathbb{R}^{d}|x^{(1)}\in(\frac{j}{t}-\frac{1}{2t},\frac{j}{t}+\frac{1}{2t}]\} be the strip of length 1t\frac{1}{t} centered at jt\frac{j}{t} across the x(1)x^{(1)}-axis. Then by (137)

I(2πζ)=\displaystyle I(2\pi\zeta)= dsin2(πζx)f~(x)𝑑x\displaystyle\int_{\mathbb{R}^{d}}\sin^{2}(\pi\zeta^{\top}x)\tilde{f}(x)dx
\displaystyle\geq Bsin2(πtx(1))f~(x)𝑑x\displaystyle\int_{B}\sin^{2}(\pi tx^{(1)})\tilde{f}(x)dx
=\displaystyle= j=GjBsin2(πtx(1))f~(x)𝑑x\displaystyle\sum_{j=-\infty}^{\infty}\int_{G_{j}\bigcap B}\sin^{2}(\pi tx^{(1)})\tilde{f}(x)dx
=\displaystyle= j=GjBsin2(πt(x(1)j/t))f~(x)𝑑x\displaystyle\sum_{j=-\infty}^{\infty}\int_{G_{j}\bigcap B}\sin^{2}(\pi t(x^{(1)}-j/t))\tilde{f}(x)dx
\displaystyle\geq 4t2j=GjB(x(1)j/t)2f~(x)𝑑x,\displaystyle 4t^{2}\sum_{j=-\infty}^{\infty}\int_{G_{j}\bigcap B}(x^{(1)}-j/t)^{2}\tilde{f}(x)dx, (138)

where the first inequality follows from ζ=te1\zeta=te_{1} and BB is a subset in d\mathbb{R}^{d} to be determined, and the last inequality follows from |sin(πx)|2|x||\sin(\pi x)|\geq 2|x| for |x|12|x|\leq\frac{1}{2}.

Let B={zd||z(i)|<2dλmax(Λ)i2, and |z(1)|<rt+12t}B=\{z\in\mathbb{R}^{d}||z^{(i)}|<2\sqrt{d\lambda_{\max}(\Lambda)}\ \forall i\geq 2,\text{ and }|z^{(1)}|<\frac{r}{t}+\frac{1}{2t}\} with r=min{b integer:bt+12t2dλmax(Λ)}r=\min\{b\text{ integer}:\frac{b}{t}+\frac{1}{2t}\geq 2\sqrt{d\lambda_{\max}(\Lambda)}\}. Then Bj=rrGjB\subset\bigcup_{j=-r}^{r}G_{j} and thus (138) become

I(2πζ)\displaystyle I(2\pi\zeta)\geq 4t2j=rrGjB(x(1)j/t)2f~(x)𝑑x\displaystyle 4t^{2}\sum_{j=-r}^{r}\int_{G_{j}\bigcap B}(x^{(1)}-j/t)^{2}\tilde{f}(x)dx
=()\displaystyle\overset{(*)}{=} 4t2j=rrG(x(1)j/t)2f~(x)𝟏GjB(x)𝑑x\displaystyle 4t^{2}\sum_{j=-r}^{r}\int_{G}(x^{(1)}-j/t)^{2}\tilde{f}(x)\bm{1}_{G_{j}\bigcap B}(x)dx
()\displaystyle\overset{(**)}{\geq} 4t2j=rrQj312U2(4dλmax(Λ))2(d1)\displaystyle 4t^{2}\sum_{j=-r}^{r}\frac{Q_{j}^{3}}{12U^{2}(4\sqrt{d\lambda_{\max}(\Lambda)})^{2(d-1)}}
()\displaystyle\overset{(***)}{\geq} 4t2Q312(2r+1)2U2(4dλmax(Λ))2(d1),\displaystyle 4t^{2}\frac{Q^{3}}{12(2r+1)^{2}U^{2}(4\sqrt{d\lambda_{\max}(\Lambda)})^{2(d-1)}}, (139)

where in step ()(*) G={zd||z(i)|<2dλmax(Λ)i2}G=\{z\in\mathbb{R}^{d}||z^{(i)}|<2\sqrt{d\lambda_{\max}(\Lambda)}\ \forall i\geq 2\}, step ()(**) with Qj=GjBf~(x)𝑑xQ_{j}=\int_{G_{j}\bigcap B}\tilde{f}(x)dx follows from Lemma D.4 b) and step ()(***) with Q=j=rrQj=Bf~(x)𝑑xQ=\sum_{j=-r}^{r}Q_{j}=\int_{B}\tilde{f}(x)dx follows from Jensen’s inequality. The inequalities in step ()(**) and ()(***) are attained with f~(x)=Uj=rr𝟏Wj(x)a.e.xG\tilde{f}(x)=U\sum\limits_{j=-r}^{r}\bm{1}_{W_{j}}\left(x\right)\ a.e.\ x\in G where Wj={z||z(i)|<2dλmax(Λ),i2, and |z(1)j/t|<a}W_{j}=\{z||z^{(i)}|<2\sqrt{d\lambda_{\max}(\Lambda)},\forall i\geq 2,\text{ and }|z^{(1)}-j/t|<a\} for positive aa satisfies

(2a)(4dλmax(Λ))d1U(2r+1)=Q.(2a)(4\sqrt{d\lambda_{\max}(\Lambda)})^{d-1}U(2r+1)=Q.

Observe {zd|z(2Λ)1z<2d}B\{z\in\mathbb{R}^{d}|z^{\top}(2\Lambda)^{-1}z<2d\}\subset B and thus

Q=P(XXB)1P((XX)(2Λ)1(XX)2d)12,Q=P(X-X^{\prime}\in B)\geq 1-P((X-X^{\prime})^{\top}(2\Lambda)^{-1}(X-X^{\prime})\geq 2d)\geq\frac{1}{2},

where the last step follows from Markov inequality. Moreover by our choice of rr, 2r+14tdλmax(Λ)+22r+1\leq 4t\sqrt{d\lambda_{\max}(\Lambda)}+2. Then (139) become

I(2πζ)\displaystyle I(2\pi\zeta)\geq t2124(4tdλmax(Λ)+2)2(4dλmax(Λ))2(d1)U2\displaystyle t^{2}\frac{1}{24(4t\sqrt{d\lambda_{\max}(\Lambda)}+2)^{2}(4\sqrt{d\lambda_{\max}(\Lambda)})^{2(d-1)}U^{2}}
\displaystyle\geq C(d)t2(t2λmax(Λ)+1)λmaxd1(Λ)U2,\displaystyle\frac{C(d)t^{2}}{(t^{2}\lambda_{\max}(\Lambda)+1)\lambda_{\max}^{d-1}(\Lambda)U^{2}},

where C(d)C(d) is a constant that depends only on dd. The last display replacing 2πζ=2πte12\pi\zeta=2\pi te_{1} by ζ=te1\zeta=te_{1}, together with (136) yield the desired conclusion.

Step 2 For any ζ0\zeta\not=0, denote t=ζ2t=\|\zeta\|_{2} and u1=ζ/ζ2u_{1}=\zeta/\|\zeta\|_{2}. Consider an orthogonal matrix UζU_{\zeta} with its first row u1u_{1}^{\top}. Then ϕ(ζ)=𝔼eitu1X=𝔼eite1Z\phi(\zeta)=\mathbb{E}e^{itu_{1}^{\top}X}=\mathbb{E}e^{ite_{1}^{\top}Z} where Z=UζXZ=U_{\zeta}X. Since ZZ has density fZ(z)=f(Uζz)f_{Z}(z)=f(U_{\zeta}^{\top}z) with respect to Lebesgue measure, fZ(z)f_{Z}(z) has the same upper bound UU and positive definite covariance matrix UζΛUζU_{\zeta}\Lambda U_{\zeta}^{\top} with the same largest eigenvalue as Λ\Lambda. The result then follows by applying Step 1 to |𝔼eite1Z|\left|\mathbb{E}e^{ite_{1}^{\top}Z}\right|. ∎

D.4 Auxiliary lemmas for Sections D.2 and D.3

Consider a family of probabilities {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta} on d\mathbb{R}^{d}, where θ\theta is the parameter of the family and Θq\Theta\subset\mathbb{R}^{q} is the parameter space. 𝔼θ\mathbb{E}_{\theta} denotes the expectation under the probability measure PθP_{\theta}. Consider {Xi}i=1\{X_{i}\}_{i=1}^{\infty} a sequence of independent and identically distributed random vectors from Pθ0P_{\theta_{0}}. Suppose 𝔼θ0X1\mathbb{E}_{\theta_{0}}X_{1} exists and define ZN=i=1NXiN𝔼θ0X1NZ_{{N}}=\frac{\sum_{i=1}^{{N}}{X_{i}}-{N}\mathbb{E}_{\theta_{0}}X_{1}}{\sqrt{{N}}}. The next result establishes that the density of ZNZ_{{N}} converges uniformly to that of a multivariate normal distribution.

Lemma D.3 (Local Central Limit Theorem).

Suppose {Xi}i=1\{X_{i}\}_{i=1}^{\infty} a sequence of independent and identically distributed random vectors from Pθ0P_{\theta_{0}}. Suppose 𝔼θ0X1\mathbb{E}_{\theta_{0}}X_{1} and Λθ0:=𝔼θ0(X1𝔼θ0X1)(X1𝔼θ0X1)\Lambda_{\theta_{0}}:=\mathbb{E}_{\theta_{0}}(X_{1}-\mathbb{E}_{\theta_{0}}X_{1})(X_{1}-\mathbb{E}_{\theta_{0}}X_{1})^{\top} exist and Λθ0\Lambda_{\theta_{0}} is positive definite. Let the characteristic function of PθP_{\theta} be ϕ(ζ|θ):=𝔼θeiζX1\phi(\zeta|\theta):=\mathbb{E}_{\theta}e^{i\zeta^{\top}X_{1}} and suppose there exists r1r\geq 1 such that |ϕ(ζ|θ0)|r|\phi(\zeta|\theta_{0})|^{r} is Lebesgue integrable on d\mathbb{R}^{d}. Then when Nr{N}\geq r, ZNZ_{{N}} has density with respect to Lebesgue measure on d\mathbb{R}^{d}, and its density fZ(z|θ0,N)f_{Z}(z|\theta_{0},{N}) as N{N} tends to infinity converges uniformly in zz to f𝒩(z|θ0)f_{\mathcal{N}}(z|\theta_{0}), the density of 𝒩(𝟎,Λθ0)\mathcal{N}(\bm{0},\Lambda_{\theta_{0}}).

The special case for d=1d=1 of the above lemma is Theorem 2 in Section 5, Chapter XV of [14]. That proof can be generalized to d>1d>1 without much difficulties and therefore the proof of Lemma D.3 is omitted.

Lemma D.4.
  1. a)

    Consider a Lebesgue measurable function on \mathbb{R} satisfies 0f(x)U0\leq f(x)\leq U and f(x)𝑑x=E(0,)\int_{\mathbb{R}}f(x)dx=E\in(0,\infty). Then for any b>0b>0

    (xb)2f(x)𝑑xE312U2,\int_{\mathbb{R}}(x-b)^{2}f(x)dx\geq\frac{E^{3}}{12U^{2}},

    and the equality holds if and only if f(x)=U𝟏[bE2U,b+E2U](x)a.e.f(x)=U\bm{1}_{[b-\frac{E}{2U},b+\frac{E}{2U}]}(x)\ a.e..

  2. b)

    For a>0a>0 define a set G={zd||z(i)|<ai2}G=\{z\in\mathbb{R}^{d}||z^{(i)}|<a\quad\forall i\geq 2\}. Consider a Lebesgue measurable function on d\mathbb{R}^{d} satisfies 0f(x)U0\leq f(x)\leq U on GG and Gf(x)𝑑x=E(0,)\int_{G}f(x)dx=E\in(0,\infty). Then for any b>0b>0

    G(x(1)b)2f(x)𝑑xE312U2(2a)2(d1),\int_{G}(x^{(1)}-b)^{2}f(x)dx\geq\frac{E^{3}}{12U^{2}(2a)^{2(d-1)}},

    and the equality holds if and only if f(x)=U𝟏G1(x)a.e.xGf(x)=U\bm{1}_{G_{1}}(x)\ a.e.\ x\in G where G1=[bE2U(2a)d1,b+E2U(2a)d1]×(a,a)d1G_{1}=[b-\frac{E}{2U(2a)^{d-1}},b+\frac{E}{2U(2a)^{d-1}}]\times(-a,a)^{d-1}.

D.5 Proof of the technical lemma in the proof of Theorem 5.16

Lemma D.1 plays an essential role in the proof of Theorem 5.16 presented in Section D.2.

Proof of Lemma D.1.

We will write θi,ηi,N\theta_{i}^{\ell},\eta_{i}^{\ell},{N}_{\ell} respectively for θ¯i,η¯i,N¯\bar{\theta}_{i}^{\ell},\bar{\eta}_{i}^{\ell},\bar{{N}}_{\ell} in this proof. But θi,ηi,N\theta_{i}^{\ell},\eta_{i}^{\ell},{N}_{\ell} in this proof are generic variables and might not necessarily be the same as in the proof of Theorem 5.16. Let Θ¯(G0)\bar{\Theta}(G_{0}) be the same as in the first paragraph of the Step 3 in the proof of Theorem 5.16.

For any θΘ¯(G0)\theta\in\bar{\Theta}(G_{0}), by condition (A1) ζϕT(ζ|θ)|ζ=𝟎=𝒊λθ\left.\nabla_{\zeta}\ \phi_{T}(\zeta|\theta)\right|_{\zeta=\bm{0}}=\bm{i}\lambda_{\theta}, and HessζϕT(ζ|θ)|ζ=𝟎=𝒊2(Λθ+λθλθ)\left.\textbf{Hess}_{\zeta}\ \phi_{T}(\zeta|\theta)\right|_{\zeta=\bm{0}}=\bm{i}^{2}\left(\Lambda_{\theta}+\lambda_{\theta}\lambda_{\theta}^{\top}\right) exist, and by condition (A2) λθθ(j)\frac{\partial\lambda_{\theta}}{\partial\theta^{(j)}} and Λθθ(j)\frac{\partial\Lambda_{\theta}}{\partial\theta^{(j)}} exist. Then, with condition (A1) it follows from Pratt’s Lemma that f𝒩(y|θ,N)θ(j)\frac{\partial f_{\mathcal{N}}(y|\theta,{N})}{\partial\theta^{(j)}} exists and is given by

f𝒩(y|θ,N)θ(j)=1(2π)sse𝒊ζye𝒊NζλθN2ζΛθζ(𝒊Nζλθθ(j)N2ζΛθθ(j)ζ)𝑑ζ.\displaystyle\frac{\partial f_{\mathcal{N}}(y|\theta,{N})}{\partial\theta^{(j)}}=\frac{1}{(2\pi)^{s}}\int_{\mathbb{R}^{s}}e^{-\bm{i}\zeta^{\top}y}e^{\bm{i}{N}\zeta^{\top}\lambda_{\theta}-\frac{{N}}{2}\zeta^{\top}\Lambda_{\theta}\zeta}\left(\bm{i}{N}\zeta^{\top}\frac{\partial\lambda_{\theta}}{\partial\theta^{(j)}}-\frac{{N}}{2}\zeta^{\top}\frac{\partial\Lambda_{\theta}}{\partial\theta^{(j)}}\zeta\right)d\zeta. (140)

Plugging the Fourier inversion formula (115) and (140) into (120), and noting |e𝒊ζy|1|e^{-\bm{i}\zeta^{\top}y}|\leq 1 for all ysy\in\mathbb{R}^{s}, for sufficiently large \ell we obtain

J(θi,ηi,N)\displaystyle J(\theta_{i}^{\ell},\eta_{i}^{\ell},{N}_{\ell})
\displaystyle\leq Ns/21(2π)ss|(ϕT(ζ|θi))N(ϕT(ζ|ηi))N\displaystyle{N}_{\ell}^{s/2}\frac{1}{(2\pi)^{s}}\int_{\mathbb{R}^{s}}\left|(\phi_{T}(\zeta|\theta_{i}^{\ell}))^{{N}_{\ell}}-(\phi_{T}(\zeta|\eta_{i}^{\ell}))^{{N}_{\ell}}\right.
Ne𝒊Nζλθi0N2ζΛθi0ζj=1q((θi)(j)(ηi)(j))(𝒊ζλθi0θ(j)12ζΛθi0θ(j)ζ)|dζ\displaystyle\ \ \ \ \left.-{N}_{\ell}e^{\bm{i}{N}_{\ell}\zeta^{\top}\lambda_{\theta_{i}^{0}}-\frac{{N}_{\ell}}{2}\zeta^{\top}\Lambda_{\theta_{i}^{0}}\zeta}\sum_{j=1}^{q}\left((\theta_{i}^{\ell})^{(j)}-(\eta_{i}^{\ell})^{(j)}\right)\left(\bm{i}\zeta^{\top}\frac{\partial\lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}-\frac{1}{2}\zeta^{\top}\frac{\partial\Lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}\zeta\right)\right|d\zeta
\displaystyle\leq Jˇ+J^,\displaystyle\check{J}_{\ell}+\hat{J}_{\ell},

where

Jˇ:=Ns/2(2π)ss|(ϕT(ζ|θi))N(ϕT(ζ|ηi))N\displaystyle\check{J}_{\ell}:=\frac{{N}_{\ell}^{s/2}}{(2\pi)^{s}}\int_{\mathbb{R}^{s}}\left|(\phi_{T}(\zeta|\theta_{i}^{\ell}))^{{N}_{\ell}}-(\phi_{T}(\zeta|\eta_{i}^{\ell}))^{{N}_{\ell}}\right.
N(ϕT(ζ|θi0))N1j=1q((θi)(j)(ηi)(j))ϕT(ζ|θi0)θ(j)|dζ,\displaystyle\left.-{N}_{\ell}\left(\phi_{T}(\zeta|\theta_{i}^{0})\right)^{{N}_{\ell}-1}\sum_{j=1}^{q}\left((\theta_{i}^{\ell})^{(j)}-(\eta_{i}^{\ell})^{(j)}\right)\frac{\partial\phi_{T}(\zeta|\theta_{i}^{0})}{\partial\theta^{(j)}}\right|d\zeta,

and

J^\displaystyle\hat{J}_{\ell} :=Ns/2+11(2π)sj=1q|(θi)(j)(ηi)(j)|s|(ϕT(ζ|θi0))N1ϕT(ζ|θi0)θ(j)\displaystyle:={N}_{\ell}^{s/2+1}\frac{1}{(2\pi)^{s}}\sum_{j=1}^{q}\left|(\theta_{i}^{\ell})^{(j)}-(\eta_{i}^{\ell})^{(j)}\right|\int_{\mathbb{R}^{s}}\left|\left(\phi_{T}(\zeta|\theta_{i}^{0})\right)^{{N}_{\ell}-1}\frac{\partial\phi_{T}(\zeta|\theta_{i}^{0})}{\partial\theta^{(j)}}-\right.
exp(𝒊Nζλθi0N2ζΛθi0ζ)(𝒊ζλθi0θ(j)12ζΛθi0θ(j)ζ)|dζ.\displaystyle\ \ \ \ \left.-\exp\left(\bm{i}{N}_{\ell}\zeta^{\top}\lambda_{\theta_{i}^{0}}-\frac{{N}_{\ell}}{2}\zeta^{\top}\Lambda_{\theta_{i}^{0}}\zeta\right)\left(\bm{i}\zeta^{\top}\frac{\partial\lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}-\frac{1}{2}\zeta^{\top}\frac{\partial\Lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}\zeta\right)\right|d\zeta.

We will show in the sequel that Jˇ=o(Nθiηi2)\check{J}_{\ell}=o(\sqrt{{N}_{\ell}}\|\theta_{i}^{\ell}-\eta_{i}^{\ell}\|_{2}) in Step 1 and J^=o(Nθiηi2)\hat{J}_{\ell}=o(\sqrt{{N}_{\ell}}\|\theta_{i}^{\ell}-\eta_{i}^{\ell}\|_{2}) in Step 2, thereby establishing (120).

Step 1 (Prove Jˇ=o(Nθiηi2)\check{J}_{\ell}=o(\sqrt{{N}_{\ell}}\|\theta_{i}^{\ell}-\eta_{i}^{\ell}\|_{2}))
By Condition (A3) and Lemma B.2 b),

JˇNs/2(2π)ss|q1j,βq(θiθi02+ηiθi02)θiηi2R1(ζ;θi0,θi,ηi,j,β)|𝑑ζ,\displaystyle\check{J}_{\ell}\leq\frac{{N}_{\ell}^{s/2}}{(2\pi)^{s}}\int_{\mathbb{R}^{s}}\left|q\sum_{1\leq j,\beta\leq q}\left(\|\theta_{i}^{\ell}-\theta_{i}^{0}\|_{2}+\|\eta_{i}^{\ell}-\theta_{i}^{0}\|_{2}\right)\|\theta_{i}^{\ell}-\eta_{i}^{\ell}\|_{2}R_{1}(\zeta;\theta_{i}^{0},\theta_{i}^{\ell},\eta_{i}^{\ell},j,\beta)\right|d\zeta, (141)

where with θ(t1,t2)=θi0+t2(ηi+t1(θiηi)θi0)\theta_{\ell}(t_{1},t_{2})=\theta_{i}^{0}+t_{2}(\eta_{i}^{\ell}+t_{1}(\theta_{i}^{\ell}-\eta_{i}^{\ell})-\theta_{i}^{0})

R1(ζ;θi0,θi,ηi,j,β),\displaystyle R_{1}(\zeta;\theta_{i}^{0},\theta_{i}^{\ell},\eta_{i}^{\ell},j,\beta),
=\displaystyle= 0101|N(N1)(ϕT(ζ|θ(t1,t2)))N2ϕT(ζ|θ(t1,t2))θ(j)ϕT(ζ|θ(t1,t2))θ(β)+\displaystyle\int_{0}^{1}\int_{0}^{1}\left|{N}_{\ell}({N}_{\ell}-1)\left(\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))\right)^{{N}_{\ell}-2}\frac{\partial\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))}{\partial\theta^{(j)}}\frac{\partial\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))}{\partial\theta^{(\beta)}}+\right.
+N(ϕT(ζ|θ(t1,t2)))N12ϕT(ζ|θ(t1,t2))θ(j)θ(β)|dt2dt1.\displaystyle\quad\left.+{N}_{\ell}\left(\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))\right)^{{N}_{\ell}-1}\frac{\partial^{2}\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))}{\partial\theta^{(j)}\partial\theta^{(\beta)}}\right|dt_{2}dt_{1}.

Then

s|R1(ζ;θi0,θi,ηi,j,β)|𝑑ζ\displaystyle\int_{\mathbb{R}^{s}}\left|R_{1}(\zeta;\theta_{i}^{0},\theta_{i}^{\ell},\eta_{i}^{\ell},j,\beta)\right|d\zeta
\displaystyle\leq Ns0101|ϕT(ζ|θ(t1,t2))|N2×\displaystyle{N}_{\ell}\int_{\mathbb{R}^{s}}\int_{0}^{1}\int_{0}^{1}\left|\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))\right|^{{N}_{\ell}-2}\times
(N|ϕT(ζ|θ(t1,t2))θ(j)ϕT(ζ|θ(t1,t2))θ(β)|+|2ϕT(ζ|θ(t1,t2))θ(j)θ(β)|)dt2dt1dζ\displaystyle\quad\left({N}_{\ell}\left|\frac{\partial\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))}{\partial\theta^{(j)}}\frac{\partial\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))}{\partial\theta^{(\beta)}}\right|+\left|\frac{\partial^{2}\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))}{\partial\theta^{(j)}\partial\theta^{(\beta)}}\right|\right)dt_{2}dt_{1}d\zeta
=\displaystyle= N0101s|ϕT(ζ|θ(t1,t2))|N2×\displaystyle{N}_{\ell}\int_{0}^{1}\int_{0}^{1}\int_{\mathbb{R}^{s}}\left|\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))\right|^{{N}_{\ell}-2}\times
(N|ϕT(ζ|θ(t1,t2))θ(j)ϕT(ζ|θ(t1,t2))θ(β)|+|2ϕT(ζ|θ(t1,t2))θ(j)θ(β)|)dζdt2dt1\displaystyle\quad\left({N}_{\ell}\left|\frac{\partial\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))}{\partial\theta^{(j)}}\frac{\partial\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))}{\partial\theta^{(\beta)}}\right|+\left|\frac{\partial^{2}\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))}{\partial\theta^{(j)}\partial\theta^{(\beta)}}\right|\right)d\zeta dt_{2}dt_{1}
=:\displaystyle=: NR2(θi0,θi,ηi,j,β),\displaystyle{N}_{\ell}R_{2}(\theta_{i}^{0},\theta_{i}^{\ell},\eta_{i}^{\ell},j,\beta), (142)

where the first inequality follows from the fact that |ϕT(ζ|θ(t1,t2)))|1|\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2})))|\leq 1, and the last inequality follows from Condition (A3), Tonelli Theorem and the joint Lebesgue measurability of ϕT(ζ|θ(t1,t2))\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2})), ϕT(ζ|θ(t1,t2))θ(j)\frac{\partial\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))}{\partial\theta^{(j)}} and |2ϕT(ζ|θ(t1,t2))θ(j)θ(β)|\left|\frac{\partial^{2}\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))}{\partial\theta^{(j)}\partial\theta^{(\beta)}}\right|, as functions of ζ\zeta, t1t_{1} and t2t_{2} by [2, Lemma 4.51]. Then following (141) and (142),

Jˇ\displaystyle\check{J}_{\ell}
\displaystyle\leq C(q,s)Ns/2+1θiηi2(θiθi02+ηiθi02)max1j,βqR2(θi0,θi,ηi,j,β)\displaystyle C(q,s){N}_{\ell}^{s/2+1}\left\|\theta_{i}^{\ell}-\eta_{i}^{\ell}\right\|_{2}\left(\left\|\theta_{i}^{\ell}-\theta_{i}^{0}\right\|_{2}+\left\|\eta_{i}^{\ell}-\theta_{i}^{0}\right\|_{2}\right)\ \max_{1\leq j,\beta\leq q}\ R_{2}(\theta_{i}^{0},\theta_{i}^{\ell},\eta_{i}^{\ell},j,\beta)
=\displaystyle= C(q,s)Nθiηi2(θiθi02+ηiθi02)maxj,β0101|ϕT(ζ¯N|θ(t1,t2))|N2\displaystyle C(q,s){N}_{\ell}\|\theta_{i}^{\ell}-\eta_{i}^{\ell}\|_{2}(\|\theta_{i}^{\ell}-\theta_{i}^{0}\|_{2}+\|\eta_{i}^{\ell}-\theta_{i}^{0}\|_{2})\max_{j,\beta}\ \int_{0}^{1}\int_{0}^{1}\int\left|\phi_{T}(\left.\frac{\bar{\zeta}}{\sqrt{{N}_{\ell}}}\right|\theta_{\ell}(t_{1},t_{2}))\right|^{{N}_{\ell}-2} (143)
×(N|ϕT(ζ¯N|θ(t1,t2))θ(j)ϕT(ζ¯N|θ(t1,t2))θ(β)|+|2ϕT(ζ¯N|θ(t1,t2))θ(j)θ(β)|)dζ¯dt2dt1,\displaystyle\times\left({N}_{\ell}\left|\frac{\partial\phi_{T}(\frac{\bar{\zeta}}{\sqrt{{N}_{\ell}}}|\theta_{\ell}(t_{1},t_{2}))}{\partial\theta^{(j)}}\frac{\partial\phi_{T}(\frac{\bar{\zeta}}{\sqrt{{N}_{\ell}}}|\theta_{\ell}(t_{1},t_{2}))}{\partial\theta^{(\beta)}}\right|+\left|\frac{\partial^{2}\phi_{T}(\frac{\bar{\zeta}}{\sqrt{{N}_{\ell}}}|\theta_{\ell}(t_{1},t_{2}))}{\partial\theta^{(j)}\partial\theta^{(\beta)}}\right|\right)d\bar{\zeta}dt_{2}dt_{1},

where in the first inequality C(q,s)C(q,s) is some constant that depends on qq and ss, and where the second equality follows from (142) and changing variable with ζ¯=Nζ\bar{\zeta}=\sqrt{{N}_{\ell}}\zeta. Denote the integrand in the last display by Ej,β(ζ¯,t1,t2)E_{j,\beta}(\bar{\zeta},t_{1},t_{2}).

In the rest of the proofs denote the left hand sides of (32) and (33) respectively by U1(θ0)U_{1}(\theta_{0}) and U2(θ0)U_{2}(\theta_{0}) for every θ0Θ1={θi0}i=1k0\theta_{0}\in\Theta_{1}=\{\theta_{i}^{0}\}_{i=1}^{k_{0}}.

Observe that fY(y|θ(t1,t2),r)f_{Y}(y|\theta_{\ell}(t_{1},t_{2}),r) exists and has upper bound 1(2π)ss|ϕT(ζ|θ(t1,t2))|rdζC(s)U2(θi0)\frac{1}{(2\pi)^{s}}\int_{\mathbb{R}^{s}}|\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))|^{r}d\zeta\leq C(s)U_{2}(\theta_{i}^{0}) by condition (A3). Then invoking Lemma D.2, for ζ21\|\zeta\|_{2}\leq 1,

|ϕT(ζ|θ(t1,t2))|r\displaystyle|\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))|^{r}\leq exp(C(s)ζ22(λmax(Λθ(t1,t2))+1)λmaxs1(Λθ(t1,t2)))U22(θi0))\displaystyle\exp\left(-\frac{C(s)\|\zeta\|_{2}^{2}}{(\lambda_{\max}(\Lambda_{\theta_{\ell}(t_{1},t_{2})})+1)\lambda_{\max}^{s-1}(\Lambda_{\theta_{\ell}(t_{1},t_{2})}))U_{2}^{2}(\theta_{i}^{0})}\right)
\displaystyle\leq exp(C(s)ζ22U3(θi0)U22(θi0)),\displaystyle\exp\left(-\frac{C(s)\|\zeta\|_{2}^{2}}{U_{3}(\theta_{i}^{0})U_{2}^{2}(\theta_{i}^{0})}\right), (144)

where the last step follows from (λmax(Λθ(t1,t2))+1)λmaxs1(Λθ(t1,t2))U3(θi0)(\lambda_{\max}(\Lambda_{\theta_{\ell}(t_{1},t_{2})})+1)\lambda_{\max}^{s-1}(\Lambda_{\theta_{\ell}(t_{1},t_{2})})\leq U_{3}(\theta_{i}^{0}) by condition (A1) with U3(θi0)U_{3}(\theta_{i}^{0}) being some constant that depends on θi0\theta_{i}^{0}.

Moreover, by the mean value theorem and condition (A3): ζ2<1\forall\|\zeta\|_{2}<1

|ϕT(ζ|θ(t1,t2))θ(j)|=|ϕT(ζ|θ(t1,t2))θ(j)ϕT(0|θ(t1,t2))θ(j)|ζ2supζ2<1ζϕT(ζ|θ(t1,t2))θ(j)2sU1(θi0)ζ2.\left|\frac{\partial\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))}{\partial\theta^{(j)}}\right|=\left|\frac{\partial\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))}{\partial\theta^{(j)}}-\frac{\partial\phi_{T}(0|\theta_{\ell}(t_{1},t_{2}))}{\partial\theta^{(j)}}\right|\\ \leq\|\zeta\|_{2}\ \sup_{\|\zeta\|_{2}<1}\left\|\nabla_{\zeta}\frac{\partial\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))}{\partial\theta^{(j)}}\right\|_{2}\leq\sqrt{s}U_{1}(\theta_{i}^{0})\|\zeta\|_{2}. (145)

Then

ζ¯2<NEj,β(ζ¯,t1,t2)dζ¯\displaystyle\int_{\|\bar{\zeta}\|_{2}<\sqrt{{N}_{\ell}}}E_{j,\beta}(\bar{\zeta},t_{1},t_{2})d\bar{\zeta}
\displaystyle\leq ζ¯2<Nexp(C(s)ζ¯22rU3(θi0)U22(θi0)N2N)((sU1(θi0))2ζ¯22+U1(θi0))dζ¯\displaystyle\int_{\|\bar{\zeta}\|_{2}<\sqrt{{N}_{\ell}}}\exp\left(-\frac{C(s)\|\bar{\zeta}\|_{2}^{2}}{rU_{3}(\theta_{i}^{0})U_{2}^{2}(\theta_{i}^{0})}\frac{{N}_{\ell}-2}{{N}_{\ell}}\right)\left(\left(\sqrt{s}U_{1}(\theta_{i}^{0})\right)^{2}\|\bar{\zeta}\|_{2}^{2}+U_{1}(\theta_{i}^{0})\right)d\bar{\zeta}
\displaystyle\leq sexp(C(s)ζ¯222rU3(θi0)U22(θi0))((sU1(θi0))2ζ¯22+U1(θi0))dζ¯\displaystyle\int_{\mathbb{R}^{s}}\exp\left(-\frac{C(s)\|\bar{\zeta}\|_{2}^{2}}{2rU_{3}(\theta_{i}^{0})U_{2}^{2}(\theta_{i}^{0})}\right)\left(\left(\sqrt{s}U_{1}(\theta_{i}^{0})\right)^{2}\|\bar{\zeta}\|_{2}^{2}+U_{1}(\theta_{i}^{0})\right)d\bar{\zeta}
=\displaystyle= C(s,r,θi0),\displaystyle C(s,r,\theta_{i}^{0}), (146)

where the first inequality follows from (144) and (145).

Let η:=supζ21|ϕT(ζ|θi0)|\eta:=\sup_{\|\zeta\|_{2}\geq 1}|\phi_{T}(\zeta|\theta_{i}^{0})|. Since the density fY(y|θi0,r)f_{Y}(y|\theta_{i}^{0},r) w.r.t. Lebesgue exists and has characteristic function ϕTr(ζ|θi0)\phi_{T}^{r}(\zeta|\theta_{i}^{0}), ϕTr(ζ|θi0)0\phi_{T}^{r}(\zeta|\theta_{i}^{0})\to 0 as ζ2\|\zeta\|_{2}\to\infty by Riemann–Lebesgue lemma. It follows that η\eta is actually a maximum. Moreover, the existence of the density fY(y|θi0,r)f_{Y}(y|\theta_{i}^{0},r) w.r.t. Lebesgue, together with Lemma 4 in Section 1, Chapter XV of [14], yield |ϕT(ζ|θi0)|r<1|\phi_{T}(\zeta|\theta_{i}^{0})|^{r}<1 when ζ𝟎\zeta\not=\bm{0}. It follows that η<1\eta<1. By the mean value theorem and (A3)

supζs|ϕT(ζ|θ(t1,t2))ϕT(ζ|θi0)|qU1(θi0)(θiθi02+ηiθi02),\sup_{\zeta\in\mathbb{R}^{s}}|\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))-\phi_{T}(\zeta|\theta_{i}^{0})|\leq\sqrt{q}U_{1}(\theta_{i}^{0})(\|\theta_{i}^{\ell}-\theta_{i}^{0}\|_{2}+\|\eta_{i}^{\ell}-\theta_{i}^{0}\|_{2}),

which further implies supt1,t2[0,1]supζ21|ϕT(ζ|θ(t1,t2))|<η+1η2:=η<1\sup_{t_{1},t_{2}\in[0,1]}\sup_{\|\zeta\|_{2}\geq 1}|\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))|<\eta+\frac{1-\eta}{2}:=\eta^{\prime}<1 for sufficiently large \ell. Then for sufficiently large \ell,

ζ¯2NEj,β(ζ¯,t1,t2)dζ¯\displaystyle\int_{\|\bar{\zeta}\|_{2}\geq\sqrt{{N}_{\ell}}}E_{j,\beta}(\bar{\zeta},t_{1},t_{2})d\bar{\zeta}
\displaystyle\leq (η)N2rs|ϕT(ζ¯N|θ(t1,t2))|r(NU12(θi0)+|2ϕT(ζ¯N|θ(t1,t2))θ(j)θ(β)|)dζ¯\displaystyle\left(\eta^{\prime}\right)^{{N}_{\ell}-2-r}\int_{\mathbb{R}^{s}}\left|\phi_{T}\left(\left.\frac{\bar{\zeta}}{\sqrt{{N}_{\ell}}}\right|\theta_{\ell}(t_{1},t_{2})\right)\right|^{r}\left({N}_{\ell}U_{1}^{2}(\theta_{i}^{0})+\left|\frac{\partial^{2}\phi_{T}(\frac{\bar{\zeta}}{\sqrt{{N}_{\ell}}}|\theta_{\ell}(t_{1},t_{2}))}{\partial\theta^{(j)}\partial\theta^{(\beta)}}\right|\right)d\bar{\zeta}
\displaystyle\leq (η)N2rNs/2(NU12(θi0)+1)s|ϕT(ζ|θ(t1,t2))|r(1+|2ϕT(ζ|θ(t1,t2))θ(j)θ(β)|)dζ\displaystyle\left(\eta^{\prime}\right)^{{N}_{\ell}-2-r}{N}_{\ell}^{s/2}({N}_{\ell}U_{1}^{2}(\theta_{i}^{0})+1)\int_{\mathbb{R}^{s}}\left|\phi_{T}\left(\left.\zeta\right|\theta_{\ell}(t_{1},t_{2})\right)\right|^{r}\left(1+\left|\frac{\partial^{2}\phi_{T}(\zeta|\theta_{\ell}(t_{1},t_{2}))}{\partial\theta^{(j)}\partial\theta^{(\beta)}}\right|\right)d\zeta
\displaystyle\leq (η)N2rNs/2(NU12(θi0)+1)U2(θi0),\displaystyle\left(\eta^{\prime}\right)^{{N}_{\ell}-2-r}{N}_{\ell}^{s/2}\left({N}_{\ell}U_{1}^{2}(\theta_{i}^{0})+1\right)U_{2}(\theta_{i}^{0}), (147)

where the first inequality follows from the definition of η\eta^{\prime} and condition (A3), and the last inequality follows from condition (A3). (146) and (147) immediately imply for any jj, β\beta:

lim sup0101sEj,β(ζ¯,t1,t2)dζ¯dt2dt1<.\limsup_{\ell\to\infty}\int_{0}^{1}\int_{0}^{1}\int_{\mathbb{R}^{s}}E_{j,\beta}(\bar{\zeta},t_{1},t_{2})d\bar{\zeta}dt_{2}dt_{1}<\infty. (148)

The above display together with (143) and the conditions Nθiθi02,Nηiθi020\sqrt{N_{\ell}}\|\theta_{i}^{\ell}-\theta_{i}^{0}\|_{2},\sqrt{N_{\ell}}\|\eta_{i}^{\ell}-\theta_{i}^{0}\|_{2}\to 0 yield Jˇ=o(Nθiηi2)\check{J}_{\ell}=o(\sqrt{N_{\ell}}\|\theta_{i}^{\ell}-\eta_{i}^{\ell}\|_{2}).

Step 2 (Prove J^=o(Nθiηi2)\hat{J}_{\ell}=o(\sqrt{{N}_{\ell}}\|\theta_{i}^{\ell}-\eta_{i}^{\ell}\|_{2})). A large portion of the proof borrows ideas from Theorem 2 in Chapter XV, Section 5 of [14]. Observe

J^Nθiηi2q(2π)smax1jqK(j)\hat{J}_{\ell}\leq\sqrt{{N}_{\ell}}\|\theta_{i}^{\ell}-\eta_{i}^{\ell}\|_{2}\frac{\sqrt{q}}{(2\pi)^{s}}\max_{1\leq j\leq q}K_{\ell}(j) (149)

where as before by a change of variable, ζ¯=Nζ\bar{\zeta}=\sqrt{{N}_{\ell}}\zeta,

K(j):=\displaystyle K_{\ell}(j):= Ns+12s|(ϕT(ζ|θi0))N1ϕT(ζ|θi0)θ(j)\displaystyle{N}_{\ell}^{\frac{s+1}{2}}\int_{\mathbb{R}^{s}}\left|\left(\phi_{T}(\zeta|\theta_{i}^{0})\right)^{{N}_{\ell}-1}\frac{\partial\phi_{T}(\zeta|\theta_{i}^{0})}{\partial\theta^{(j)}}-\right.
exp(𝒊Nζλθi0N2ζΛθi0ζ)(𝒊ζλθi0θ(j)12ζΛθi0θ(j)ζ)|dζ\displaystyle\ \ \ \ \left.-\exp\left(\bm{i}{N}_{\ell}\zeta^{\top}\lambda_{\theta_{i}^{0}}-\frac{{N}_{\ell}}{2}\zeta^{\top}\Lambda_{\theta_{i}^{0}}\zeta\right)\left(\bm{i}\zeta^{\top}\frac{\partial\lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}-\frac{1}{2}\zeta^{\top}\frac{\partial\Lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}\zeta\right)\right|d\zeta
=\displaystyle= Ns+12s|(e𝒊ζλθi0ϕT(ζ|θi0))N1ϕT(ζ|θi0)θ(j)\displaystyle{N}_{\ell}^{\frac{s+1}{2}}\int_{\mathbb{R}^{s}}\left|\left(e^{-\bm{i}\zeta^{\top}\lambda_{\theta_{i}^{0}}}\phi_{T}(\zeta|\theta_{i}^{0})\right)^{{N}_{\ell}-1}\frac{\partial\phi_{T}(\zeta|\theta_{i}^{0})}{\partial\theta^{(j)}}-\right.
exp(𝒊ζλθi0N2ζΛθi0ζ)(𝒊ζλθi0θ(j)12ζΛθi0θ(j)ζ)|dζ\displaystyle\ \ \ \ \left.-\exp\left(\bm{i}\zeta^{\top}\lambda_{\theta_{i}^{0}}-\frac{{N}_{\ell}}{2}\zeta^{\top}\Lambda_{\theta_{i}^{0}}\zeta\right)\left(\bm{i}\zeta^{\top}\frac{\partial\lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}-\frac{1}{2}\zeta^{\top}\frac{\partial\Lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}\zeta\right)\right|d\zeta
=\displaystyle= sN|(e𝒊Nζ¯λθi0ϕT(ζ¯N|θi0))N1ϕT(ζ¯N|θi0)θ(j)\displaystyle\int_{\mathbb{R}^{s}}\sqrt{{N}_{\ell}}\left|\left(e^{-\frac{\bm{i}}{\sqrt{{N}_{\ell}}}\bar{\zeta}^{\top}\lambda_{\theta_{i}^{0}}}\phi_{T}\left(\frac{\bar{\zeta}}{\sqrt{{N}_{\ell}}}|\theta_{i}^{0}\right)\right)^{{N}_{\ell}-1}\frac{\partial\phi_{T}(\frac{\bar{\zeta}}{\sqrt{{N}_{\ell}}}|\theta_{i}^{0})}{\partial\theta^{(j)}}-\right.
exp(𝒊Nζ¯λθi012ζ¯Λθi0ζ¯)(𝒊Nζ¯λθi0θ(j)12Nζ¯Λθi0θ(j)ζ¯)|dζ¯.\displaystyle\ \ \ \ \left.-\exp\left(\frac{\bm{i}}{\sqrt{{N}_{\ell}}}\bar{\zeta}^{\top}\lambda_{\theta_{i}^{0}}-\frac{1}{2}\bar{\zeta}^{\top}\Lambda_{\theta_{i}^{0}}\bar{\zeta}\right)\left(\frac{\bm{i}}{\sqrt{{N}_{\ell}}}\bar{\zeta}^{\top}\frac{\partial\lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}-\frac{1}{2{N}_{\ell}}\bar{\zeta}^{\top}\frac{\partial\Lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}\bar{\zeta}\right)\right|d\bar{\zeta}. (150)

Denote the integrand in the above display by AA. Since λθi0\lambda_{\theta_{i}^{0}} and Λθi0\Lambda_{\theta_{i}^{0}} exist, e𝒊ζλθi0ϕT(ζ|θi0)e^{-\bm{i}\zeta^{\top}\lambda_{\theta_{i}^{0}}}\phi_{T}(\zeta|\theta_{i}^{0}) is twice continuously differentiable on s\mathbb{R}^{s}, with gradient being 𝟎\bm{0} and Hessian being 𝒊2Λθi0\bm{i}^{2}\Lambda_{\theta_{i}^{0}} at ζ=0\zeta=0. Then by Taylor Theorem,

|eiζλθi0ϕT(ζ|θi0)|<exp(14ζΛθi0ζ) if 0<ζ2<γ1,\displaystyle\left|e^{-i\zeta^{\top}\lambda_{\theta_{i}^{0}}}\phi_{T}(\zeta|\theta_{i}^{0})\right|<\exp\left(-\frac{1}{4}\zeta^{\top}\Lambda_{\theta_{i}^{0}}\zeta\right)\quad\text{ if }0<\|\zeta\|_{2}<\gamma_{1}, (151)

for sufficient small 0<γ1<10<\gamma_{1}<1, and

(e𝒊Nζ¯λθi0ϕT(ζ¯N|θi0))N1exp(12ζ¯Λθi0ζ¯).\left(e^{-\frac{\bm{i}}{\sqrt{{N}_{\ell}}}\bar{\zeta}^{\top}\lambda_{\theta_{i}^{0}}}\phi_{T}\left(\frac{\bar{\zeta}}{\sqrt{{N}_{\ell}}}|\theta_{i}^{0}\right)\right)^{{N}_{\ell}-1}\to\exp\left(-\frac{1}{2}\bar{\zeta}^{\top}\Lambda_{\theta_{i}^{0}}\bar{\zeta}\right). (152)

Let η:=supζ2γ1|ϕ(ζ|θ0)|\eta^{\prime\prime}:=\sup_{\|\zeta\|_{2}\geq\gamma_{1}}|\phi(\zeta|\theta_{0})|. By the same reasoning of η<1\eta<1 in Step 1, η<1\eta^{\prime\prime}<1. Then for any a>0a>0,

sAdζ¯=ζ¯2aAdζ¯+a<ζ¯2<γ1NAdζ¯+ζ¯2γ1NAdζ¯.\displaystyle\int_{\mathbb{R}^{s}}Ad\bar{\zeta}=\int_{\|\bar{\zeta}\|_{2}\leq a}Ad\bar{\zeta}+\int_{a<\|\bar{\zeta}\|_{2}<\gamma_{1}\sqrt{{N}_{\ell}}}Ad\bar{\zeta}+\int_{\|\bar{\zeta}\|_{2}\geq\gamma_{1}\sqrt{{N}_{\ell}}}Ad\bar{\zeta}. (153)

Then, as \ell\rightarrow\infty

ζ¯2γ1NAdζ¯\displaystyle\int_{\|\bar{\zeta}\|_{2}\geq\gamma_{1}\sqrt{{N}_{\ell}}}Ad\bar{\zeta}
\displaystyle\leq (η)N1rNs|ϕT(ζ¯N|θi0)|rU1(θi0)dζ¯\displaystyle\left(\eta^{\prime\prime}\right)^{{N}_{\ell}-1-r}\sqrt{{N}_{\ell}}\int_{\mathbb{R}^{s}}\left|\phi_{T}\left(\left.\frac{\bar{\zeta}}{\sqrt{{N}_{\ell}}}\right|\theta_{i}^{0}\right)\right|^{r}U_{1}(\theta_{i}^{0})d\bar{\zeta}
+Nζ¯2γ1Nexp(12ζ¯Λθi0ζ¯)(1N|ζ¯λθi0θ(j)|+12N|ζ¯Λθi0θ(j)ζ¯|)dζ¯\displaystyle+\sqrt{{N}_{\ell}}\int_{\|\bar{\zeta}\|_{2}\geq\gamma_{1}\sqrt{{N}_{\ell}}}\exp\left(-\frac{1}{2}\bar{\zeta}^{\top}\Lambda_{\theta_{i}^{0}}\bar{\zeta}\right)\left(\frac{1}{\sqrt{{N}_{\ell}}}\left|\bar{\zeta}^{\top}\frac{\partial\lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}\right|+\frac{1}{2{N}_{\ell}}\left|\bar{\zeta}^{\top}\frac{\partial\Lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}\bar{\zeta}\right|\right)d\bar{\zeta}
=\displaystyle= (η)N1rNs+12U1(θi0)s|ϕT(ζ|θi0)|rdζ\displaystyle\left(\eta^{\prime\prime}\right)^{{N}_{\ell}-1-r}{N}_{\ell}^{\frac{s+1}{2}}U_{1}(\theta_{i}^{0})\int_{\mathbb{R}^{s}}\left|\phi_{T}\left(\left.\zeta\right|\theta_{i}^{0}\right)\right|^{r}d\zeta
+ζ¯2γ1Nexp(12ζ¯Λθi0ζ¯)(|ζ¯λθi0θ(j)|+12N|ζ¯Λθi0θ(j)ζ¯|)dζ¯\displaystyle+\int_{\|\bar{\zeta}\|_{2}\geq\gamma_{1}\sqrt{{N}_{\ell}}}\exp\left(-\frac{1}{2}\bar{\zeta}^{\top}\Lambda_{\theta_{i}^{0}}\bar{\zeta}\right)\left(\left|\bar{\zeta}^{\top}\frac{\partial\lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}\right|+\frac{1}{2\sqrt{{N}_{\ell}}}\left|\bar{\zeta}^{\top}\frac{\partial\Lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}\bar{\zeta}\right|\right)d\bar{\zeta}
\displaystyle\rightarrow 0,\displaystyle 0, (154)

where the first inequality follows from condition (A3) and the definition of η\eta^{\prime\prime}, and the last step follows from η<1\eta^{\prime\prime}<1 and condition (A3).

By condition (A2), ϕT(ζ|θi0)θ(j)\frac{\partial\phi_{T}(\zeta|\theta_{i}^{0})}{\partial\theta^{(j)}} as a function of ζ\zeta has gradient at 0: 𝒊λθi0θ(j)\bm{i}\frac{\partial\lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}. Then by Taylor Theorem:

NϕT(ζ¯N|θi0)θ(j)𝒊ζ¯λθi0θ(j).\sqrt{{N}_{\ell}}\frac{\partial\phi_{T}(\frac{\bar{\zeta}}{\sqrt{{N}_{\ell}}}|\theta_{i}^{0})}{\partial\theta^{(j)}}\to\bm{i}\bar{\zeta}^{\top}\frac{\partial\lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}. (155)

Moreover, specialize t=0t=0 in (145) and one has: ζ2<1\forall\|\zeta\|_{2}<1

|ϕT(ζ|θi0)θ(j)|sU1(θi0)ζ2.\left|\frac{\partial\phi_{T}(\zeta|\theta_{i}^{0})}{\partial\theta^{(j)}}\right|\leq\sqrt{s}U_{1}(\theta_{i}^{0})\|\zeta\|_{2}. (156)

By combining (151) and (156), we obtain as \ell\rightarrow\infty

a<ζ¯2<γ1NAdζ¯\displaystyle\int_{a<\|\bar{\zeta}\|_{2}<\gamma_{1}\sqrt{{N}_{\ell}}}Ad\bar{\zeta}
\displaystyle\leq Na<ζ¯2<γ1Nexp(N14Nζ¯Λθi0ζ¯)sU1(θi0)(ζ¯2N)\displaystyle\sqrt{{N}_{\ell}}\int_{a<\|\bar{\zeta}\|_{2}<\gamma_{1}\sqrt{{N}_{\ell}}}\exp\left(-\frac{{N}_{\ell}-1}{4{N}_{\ell}}\bar{\zeta}^{\top}\Lambda_{\theta_{i}^{0}}\bar{\zeta}\right)\sqrt{s}U_{1}(\theta_{i}^{0})\left(\frac{\|\bar{\zeta}\|_{2}}{\sqrt{{N}_{\ell}}}\right)
+exp(12ζ¯Λθi0ζ¯)(|1Nζ¯λθi0θ(j)|+|12Nζ¯Λθi0θ(j)ζ¯|)dζ¯\displaystyle+\exp\left(-\frac{1}{2}\bar{\zeta}^{\top}\Lambda_{\theta_{i}^{0}}\bar{\zeta}\right)\left(\left|\frac{1}{\sqrt{{N}_{\ell}}}\bar{\zeta}^{\top}\frac{\partial\lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}\right|+\left|\frac{1}{2{N}_{\ell}}\bar{\zeta}^{\top}\frac{\partial\Lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}\bar{\zeta}\right|\right)d\bar{\zeta}
\displaystyle\leq a<ζ¯2<γ1N2exp(18ζ¯Λθi0ζ¯)C(θi0,s)(ζ¯2+ζ¯22)dζ¯\displaystyle\int_{a<\|\bar{\zeta}\|_{2}<\gamma_{1}\sqrt{{N}_{\ell}}}2\exp\left(-\frac{1}{8}\bar{\zeta}^{\top}\Lambda_{\theta_{i}^{0}}\bar{\zeta}\right)C(\theta_{i}^{0},s)\left(\|\bar{\zeta}\|_{2}+\|\bar{\zeta}\|_{2}^{2}\right)d\bar{\zeta}
\displaystyle\to C(θi0,s)ζ¯2>a2exp(18ζ¯Λθi0ζ¯)(ζ¯2+ζ¯22)dζ¯,\displaystyle C(\theta_{i}^{0},s)\int_{\|\bar{\zeta}\|_{2}>a}2\exp\left(-\frac{1}{8}\bar{\zeta}^{\top}\Lambda_{\theta_{i}^{0}}\bar{\zeta}\right)\left(\|\bar{\zeta}\|_{2}+\|\bar{\zeta}\|_{2}^{2}\right)d\bar{\zeta}, (157)

where in the second inequality we impose N2{N}_{\ell}\geq 2 since it’s the limit that is of interest, and C(θi0,s)C(\theta_{i}^{0},s) is a constant that depends on θi0\theta_{i}^{0} and ss.

Finally by (152) and (155), when ζ¯2a\|\bar{\zeta}\|_{2}\leq a

N(e𝒊Nζ¯λθi0ϕT(ζ¯N|θi0))N1ϕT(ζ¯N|θi0)θ(j)exp(12ζ¯Λθi0ζ¯)𝒊ζ¯λθi0θ(j).\sqrt{{N}_{\ell}}\left(e^{-\frac{\bm{i}}{\sqrt{{N}_{\ell}}}\bar{\zeta}^{\top}\lambda_{\theta_{i}^{0}}}\phi_{T}\left(\frac{\bar{\zeta}}{\sqrt{{N}_{\ell}}}|\theta_{i}^{0}\right)\right)^{{N}_{\ell}-1}\frac{\partial\phi_{T}(\frac{\bar{\zeta}}{\sqrt{{N}_{\ell}}}|\theta_{i}^{0})}{\partial\theta^{(j)}}\to\exp\left(-\frac{1}{2}\bar{\zeta}^{\top}\Lambda_{\theta_{i}^{0}}\bar{\zeta}\right)\bm{i}\bar{\zeta}^{\top}\frac{\partial\lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}.

Moreover

Nexp(𝒊Nζ¯λθi012ζ¯Λθi0ζ¯)(𝒊Nζ¯λθi0θ(j)12Nζ¯Λθi0θ(j)ζ¯)\displaystyle\sqrt{{N}}_{\ell}\exp\left(\frac{\bm{i}}{\sqrt{{N}_{\ell}}}\bar{\zeta}^{\top}\lambda_{\theta_{i}^{0}}-\frac{1}{2}\bar{\zeta}^{\top}\Lambda_{\theta_{i}^{0}}\bar{\zeta}\right)\left(\frac{\bm{i}}{\sqrt{{N}_{\ell}}}\bar{\zeta}^{\top}\frac{\partial\lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}-\frac{1}{2{N}_{\ell}}\bar{\zeta}^{\top}\frac{\partial\Lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}\bar{\zeta}\right)
\displaystyle\to exp(12ζ¯Λθi0ζ¯)𝒊ζ¯λθi0θ(j)\displaystyle\exp\left(-\frac{1}{2}\bar{\zeta}^{\top}\Lambda_{\theta_{i}^{0}}\bar{\zeta}\right)\bm{i}\bar{\zeta}^{\top}\frac{\partial\lambda_{\theta_{i}^{0}}}{\partial\theta^{(j)}}

and hence limA=0\lim_{\ell\to\infty}A=0 when ζ¯2a\|\bar{\zeta}\|_{2}\leq a. One can find an integrable envelope function for AA when ζ¯2a\|\bar{\zeta}\|_{2}\leq a in similar steps as (157), and then by the dominated convergence theorem,

ζ¯2aAdζ¯0.\int_{\|\bar{\zeta}\|_{2}\leq a}Ad\bar{\zeta}\to 0. (158)

Plug (158), (157) and (154) into (153) and (150), and one has

lim supK(j)C(θi0,s)ζ¯2>a2exp(18ζ¯Λθi0ζ¯)(ζ¯2+ζ¯22)dζ¯.\displaystyle\limsup_{\ell\to\infty}\ \ K_{\ell}(j)\leq C(\theta_{i}^{0},s)\int_{\|\bar{\zeta}\|_{2}>a}2\exp\left(-\frac{1}{8}\bar{\zeta}^{\top}\Lambda_{\theta_{i}^{0}}\bar{\zeta}\right)\left(\|\bar{\zeta}\|_{2}+\|\bar{\zeta}\|_{2}^{2}\right)d\bar{\zeta}.

Letting aa\to\infty in the above display yields K(j)0K_{\ell}(j)\to 0, which together with (149) imply J^=o(Nθiηi2)\hat{J}_{\ell}=o(\sqrt{{N}_{\ell}}\|\theta_{i}^{\ell}-\eta_{i}^{\ell}\|_{2}). ∎

Appendix E Proofs for Section 6

E.1 Proofs of Theorem 6.2 and Corollary 6.7

For BB a subset of metric space with metric DD, the minimal number of balls with centers in BB and of radius ϵ\epsilon needed to cover BB is known as the ϵ\epsilon-covering number of BB and is denoted by 𝔑(ϵ,B,D)\mathfrak{N}(\epsilon,B,D). Define the root average square Hellinger metric:

dm,h(G,G0)=1mi=1mh2(pG,Ni,pG0,Ni).d_{{m},h}(G,G_{0})=\sqrt{\frac{1}{{m}}\sum_{i=1}^{{m}}h^{2}(p_{G,{N}_{i}},p_{G_{0},{N}_{i}})}.
Proof of Theorem 6.2.

a) The proof structure is the same as Lemma 6.5 except that to take the varied sequence lengths into account, the distance dm,hd_{m,h} is used in the place of total variation VV for the mixture densities. We verify conditions (i) and (ii) of [17, Theorem 8.23], respectively, in Step 1 and Step 2 below to obtain a posterior contraction bound on the mixture density. In Step 3 we prove a posterior consistency result and then apply Lemma 5.5 to transfer posterior contraction result on density estimation to parameter estimation.

Step 1 (Verification of condition (i) of [17, Theorem 8.23])
Write n1n_{1} and n0n_{0} respectively for n1(G0)n_{1}(G_{0}) and n0(G0,kk0k(Θ1))n_{0}(G_{0},\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1})) in the proof for clean presentation. Note that (B2) implies that θPθ\theta\mapsto P_{\theta} from (Θ,2)(\Theta,\|\cdot\|_{2}) to ({Pθ}θΘ,h)(\{P_{\theta}\}_{\theta\in\Theta},h) is continuous. Then due to Lemma 5.6 and Lemma 3.2 d), for any Nn1n0{N}\geq n_{1}\vee n_{0}, and any Gk0(Θ1)G\in\mathcal{E}_{k_{0}}(\Theta_{1}),

2h(pG,N,pG0,N)V(pG,N,pG0,N)C(G0,Θ1)W1(G,G0)C(G0,Θ1)D1(G,G0).\sqrt{2}h(p_{G,{N}},p_{G_{0},{N}})\geq V(p_{G,{N}},p_{G_{0},{N}})\geq C(G_{0},\Theta_{1})W_{1}(G,G_{0})\geq C(G_{0},\Theta_{1})D_{1}(G,G_{0}). (159)

In the remainder of the proof Nn1n0{N}\geq n_{1}\vee n_{0} is implicitly imposed. By (159) it holds that, for all Gk0(Θ1)G\in\mathcal{E}_{k_{0}}(\Theta_{1})

dm,h(G,G0)C(G0,Θ1)W1(G,G0)C(G0,Θ1)D1(G,G0).d_{{m},h}(G,G_{0})\geq C(G_{0},\Theta_{1})W_{1}(G,G_{0})\geq C(G_{0},\Theta_{1})D_{1}(G,G_{0}). (160)

Then

{Gk0(Θ1):dm,h(G,G0)ϵ}{Gk0(Θ1):D1(G,G0)ϵC(G0,Θ1)}\{G\in\mathcal{E}_{k_{0}}(\Theta_{1}):d_{{m},h}(G,G_{0})\leq\epsilon\}\subset\left\{G\in\mathcal{E}_{k_{0}}(\Theta_{1}):D_{1}(G,G_{0})\leq\frac{\epsilon}{C(G_{0},\Theta_{1})}\right\} (161)

and thus for any jj\in\mathbb{N},

Π(dm,h(G,G0)2jϵ)Π(D1(G,G0)2jϵC(G0,Θ1))k0!(2jϵC(G0,Θ1))k01(2jϵC(G0,Θ1))qk0,\Pi\left(d_{{m},h}(G,G_{0})\leq 2j\epsilon\right)\leq\Pi\left(D_{1}(G,G_{0})\leq\frac{2j\epsilon}{C(G_{0},\Theta_{1})}\right)\\ \lesssim k_{0}!\left(\frac{2j\epsilon}{C(G_{0},\Theta_{1})}\right)^{k_{0}-1}\left(\frac{2j\epsilon}{C(G_{0},\Theta_{1})}\right)^{qk_{0}}, (162)

where the last inequality follows from (B1).

By an argument similar to [34, Lemma 3.2 (a)], for any G=i=1k0piδθik0(Θ1)G=\sum_{i=1}^{k_{0}}p_{i}\delta_{\theta_{i}}\in\mathcal{E}_{k_{0}}(\Theta_{1})

K(pG0,Ni,pG,Ni)\displaystyle K(p_{G_{0},{N}_{i}},p_{G,{N}_{i}})\leq NiL1Wα0α0(G,G0)\displaystyle{N}_{i}L_{1}W_{\alpha_{0}}^{\alpha_{0}}(G,G_{0})
\displaystyle\leq NiC(diam(Θ1),α0,L1)minτSk0i=1k0(θτ(i)θ0i2α0+|pτ(i)p0i|),\displaystyle{N}_{i}C(\text{diam}(\Theta_{1}),\alpha_{0},L_{1})\min_{\tau\in S_{k_{0}}}\sum_{i=1}^{k_{0}}\left(\|\theta_{\tau(i)}-\theta^{0}_{i}\|_{2}^{\alpha_{0}}+|p_{\tau(i)}-p^{0}_{i}|\right),

where the second inequality follows from Lemma 3.2 b) and (B2). Then

1mi=1mK(pG0,Ni,pG,Ni)N¯mC(diam(Θ1),α0)minτSk0i=1k0(θτ(i)θ0i2α0+|pτ(i)p0i|),\displaystyle\frac{1}{{m}}\sum_{i=1}^{{m}}K(p_{G_{0},{N}_{i}},p_{G,{N}_{i}})\leq\bar{{N}}_{{m}}C(\text{diam}(\Theta_{1}),\alpha_{0})\min_{\tau\in S_{k_{0}}}\sum_{i=1}^{k_{0}}\left(\|\theta_{\tau(i)}-\theta^{0}_{i}\|_{2}^{\alpha_{0}}+|p_{\tau(i)}-p^{0}_{i}|\right),

and

Π(1mi=1mK(pG0,Ni,pG,Ni)ϵ2)\displaystyle\Pi\left(\frac{1}{{m}}\sum_{i=1}^{{m}}K(p_{G_{0},{N}_{i}},p_{G,{N}_{i}})\leq\epsilon^{2}\right)
\displaystyle\gtrsim (ϵ2N¯mC(diam(Θ1),α0,L1))qk0/α0(ϵ2N¯mC(diam(Θ1),α0,L1))k01.\displaystyle\left(\frac{\epsilon^{2}}{\bar{{N}}_{{m}}C(\text{diam}(\Theta_{1}),\alpha_{0},L_{1})}\right)^{qk_{0}/\alpha_{0}}\left(\frac{\epsilon^{2}}{\bar{{N}}_{{m}}C(\text{diam}(\Theta_{1}),\alpha_{0},L_{1})}\right)^{k_{0}-1}. (163)

Combining (162) and (163),

Π(dm,h(G,G0)2jϵ)Π(1mi=1mK(pG0,Ni,pG,Ni)ϵ2)\displaystyle\frac{\Pi\left(d_{{m},h}(G,G_{0})\leq 2j\epsilon\right)}{\Pi\left(\frac{1}{{m}}\sum_{i=1}^{{m}}K(p_{G_{0},{N}_{i}},p_{G,{N}_{i}})\leq\epsilon^{2}\right)}
\displaystyle\leq C(G0,Θ1,q,α0,k0,L1)jqk0+k01N¯mqk0/α0+k01ϵqk0(2/α01)(k01).\displaystyle C(G_{0},\Theta_{1},q,\alpha_{0},k_{0},L_{1})j^{qk_{0}+k_{0}-1}\bar{{N}}_{{m}}^{qk_{0}/\alpha_{0}+k_{0}-1}\epsilon^{-qk_{0}(2/\alpha_{0}-1)-(k_{0}-1)}.

By Remark 6.1 α02\alpha_{0}\leq 2. Then based on the last display one may verify with

ϵm,N¯m=C(G0,Θ,q,k0,α0,β0,L1,L2)ln(mN¯m)m\epsilon_{{m},\bar{{N}}_{{m}}}=C(G_{0},\Theta,q,k_{0},\alpha_{0},\beta_{0},L_{1},L_{2})\sqrt{\frac{\ln({m}\bar{{N}}_{{m}})}{{m}}}

for some large enough constant C(G0,Θ1,q,k0,α0,β0)C(G_{0},\Theta_{1},q,k_{0},\alpha_{0},\beta_{0}),

Π(dm,h(G,G0)2jϵm,N¯m)Π(1mi=1mK(pG0,Ni,pG,Ni)ϵ2m,N¯m)exp(14jmϵm,N¯m2).\frac{\Pi\left(d_{{m},h}(G,G_{0})\leq 2j\epsilon_{{m},\bar{{N}}_{{m}}}\right)}{\Pi\left(\frac{1}{{m}}\sum_{i=1}^{{m}}K(p_{G_{0},{N}_{i}},p_{G,{N}_{i}})\leq\epsilon^{2}_{{m},\bar{{N}}_{{m}}}\right)}\leq\exp\left(\frac{1}{4}j{m}\epsilon_{{m},\bar{{N}}_{{m}}}^{2}\right).

Step 2 (Verification of condition (ii) of in [17, Theorem 8.23])
By (161),

supϵϵm,N¯mln𝔑(136ϵ,{Gk0(Θ1):dm,h(G,G0)2ϵ},dm,h)\displaystyle\sup_{\epsilon\geq\epsilon_{{m},\bar{{N}}_{{m}}}}\ln\mathfrak{N}\left(\frac{1}{36}\epsilon,\{G\in\mathcal{E}_{k_{0}}(\Theta_{1}):d_{{m},h}(G,G_{0})\leq 2\epsilon\},d_{{m},h}\right)
\displaystyle\leq supϵϵm,N¯mln𝔑(136ϵ,{Gk0(Θ1):D1(G,G0)2ϵC(G0,Θ1)},dm,h)\displaystyle\sup_{\epsilon\geq\epsilon_{{m},\bar{{N}}_{{m}}}}\ln\mathfrak{N}\left(\frac{1}{36}\epsilon,\left\{G\in\mathcal{E}_{k_{0}}(\Theta_{1}):D_{1}(G,G_{0})\leq\frac{2\epsilon}{C(G_{0},\Theta_{1})}\right\},d_{{m},h}\right)
\displaystyle\leq qk0ln(1+4×(144L2)1β0C(G0,Θ1)N¯m12β0ϵm,N¯m(1β01))+(k01)ln(1+10×722ϵm,N¯m2),\displaystyle qk_{0}\ln\left(1+\frac{4\times(144L_{2})^{\frac{1}{\beta_{0}}}}{C(G_{0},\Theta_{1})}\bar{{N}}_{{m}}^{\frac{1}{2\beta_{0}}}\epsilon_{{m},\bar{{N}}_{{m}}}^{-(\frac{1}{\beta_{0}}-1)}\right)+(k_{0}-1)\ln\left(1+10\times 72^{2}\epsilon_{{m},\bar{{N}}_{{m}}}^{-2}\right),

where the last inequality follows from Lemma E.1. By Remark 6.1 β01\beta_{0}\leq 1. Then based on the last display one may verify with ϵm,N¯m=C(G0,Θ1,q,k0,α0,β0,L1,L2)ln(mN¯m)m\epsilon_{{m},\bar{{N}}_{{m}}}=C(G_{0},\Theta_{1},q,k_{0},\alpha_{0},\beta_{0},L_{1},L_{2})\sqrt{\frac{\ln({m}\bar{{N}}_{{m}})}{{m}}} for some large enough constant C(G0,Θ1,q,k0,α0,β0)C(G_{0},\Theta_{1},q,k_{0},\alpha_{0},\beta_{0}),

supϵϵm,N¯mln𝔑(136ϵ,{Gk0(Θ1):dm,h(G,G0)2ϵ},dm,h)mϵ2m,N¯m.\sup_{\epsilon\geq\epsilon_{{m},\bar{{N}}_{{m}}}}\ln\mathfrak{N}\left(\frac{1}{36}\epsilon,\{G\in\mathcal{E}_{k_{0}}(\Theta_{1}):d_{{m},h}(G,G_{0})\leq 2\epsilon\},d_{{m},h}\right)\leq{m}\epsilon^{2}_{{m},\bar{{N}}_{{m}}}. (164)

Now we invoke [17, Theorem 8.23] (the Hellinger distance defined in [17] differs from our definition by a factor of constant. But this constant factor only affect the coefficients of ϵm,N¯m\epsilon_{{m},\bar{{N}}_{{m}}} but not the conclusion of convergence), we have for every M¯m\bar{M}_{m}\to\infty,

Π(Gk0(Θ1):dm,h(G,G0)M¯mϵm,N¯m|X[N1]1,,X[Nm]m)0\Pi(G\in\mathcal{E}_{k_{0}}(\Theta_{1}):d_{{m},h}(G,G_{0})\geq\bar{M}_{m}\epsilon_{{m},\bar{{N}}_{{m}}}|X_{[{N}_{1}]}^{1},\ldots,X_{[{N}_{{m}}]}^{{m}})\to 0 (165)

in PG0,N1PG0,NmP_{G_{0},{N}_{1}}\bigotimes\cdots\bigotimes P_{G_{0},{N}_{{m}}}-probability as m{m}\to\infty.

Step 3 (From convergence of densities to that of parameters) Since n1NiN0:=supiNin_{1}\leq{N}_{i}\leq{N}_{0}:=\sup_{i}{N}_{i}, by Lemma 5.5 for Gk0(Θ)G\in\mathcal{E}_{k_{0}}(\Theta) satisfying W1(G,G0)<c(G0,N0)W_{1}(G,G_{0})<c(G_{0},{N}_{0})

dm,h(G,G0)C(G0)1mi=1mD2Ni(G,G0).\displaystyle d_{{m},h}(G,G_{0})\geq C(G_{0})\sqrt{\frac{1}{{m}}\sum_{i=1}^{{m}}D^{2}_{{N}_{i}}(G,G_{0})}. (166)

By Lemma E.2 for G=j=1k0pjδθjk0(Θ1)G=\sum_{j=1}^{k_{0}}p_{j}\delta_{\theta_{j}}\in\mathcal{E}_{k_{0}}(\Theta_{1}) satisfying D1(G,G0)<12ρD_{1}(G,G_{0})<\frac{1}{2}\rho where ρ:=min1i<jk0θi0θj02\rho:=\min_{1\leq i<j\leq k_{0}}\|\theta_{i}^{0}-\theta_{j}^{0}\|_{2}, there exists a τSk0\tau\in S_{k_{0}} such that

1mi=1mD2Ni(G,G0)=\displaystyle\sqrt{\frac{1}{{m}}\sum_{i=1}^{{m}}D^{2}_{{N}_{i}}(G,G_{0})}= 1mi=1m(Nij=1k0θτ(j)θj02+j=1k0|pτ(j)pj0|)2\displaystyle\sqrt{\frac{1}{{m}}\sum_{i=1}^{{m}}\left(\sqrt{{N}_{i}}\sum_{j=1}^{k_{0}}\|\theta_{\tau(j)}-\theta_{j}^{0}\|_{2}+\sum_{j=1}^{k_{0}}|p_{\tau(j)}-p_{j}^{0}|\right)^{2}}
\displaystyle\geq 1mi=1m(Ni(j=1k0θτ(j)θj02)2+(j=1k0|pτ(j)pj0|)2)\displaystyle\sqrt{\frac{1}{{m}}\sum_{i=1}^{{m}}\left({N}_{i}\left(\sum_{j=1}^{k_{0}}\|\theta_{\tau(j)}-\theta_{j}^{0}\|_{2}\right)^{2}+\left(\sum_{j=1}^{k_{0}}|p_{\tau(j)}-p_{j}^{0}|\right)^{2}\right)}
=\displaystyle= N¯m(j=1k0θτ(j)θj02)2+(j=1k0|pτ(j)pj0|)2\displaystyle\sqrt{\bar{{N}}_{{m}}\left(\sum_{j=1}^{k_{0}}\|\theta_{\tau(j)}-\theta_{j}^{0}\|_{2}\right)^{2}+\left(\sum_{j=1}^{k_{0}}|p_{\tau(j)}-p_{j}^{0}|\right)^{2}}
\displaystyle\geq 12(N¯mj=1k0θτ(j)θj02+j=1k0|pτ(j)pj0|)\displaystyle\frac{1}{\sqrt{2}}\left(\sqrt{\bar{{N}}_{{m}}}\sum_{j=1}^{k_{0}}\|\theta_{\tau(j)}-\theta_{j}^{0}\|_{2}+\sum_{j=1}^{k_{0}}|p_{\tau(j)}-p_{j}^{0}|\right)
=\displaystyle= 12DN¯m(G,G0).\displaystyle\frac{1}{\sqrt{2}}D_{\bar{{N}}_{{m}}}(G,G_{0}). (167)

Let 𝒢={Gk0(Θ1)|W1(G,G0)<c(G0,N0),D1(G,G0)<12ρ}\mathcal{G}=\{G\in\mathcal{E}_{k_{0}}(\Theta_{1})|W_{1}(G,G_{0})<c(G_{0},{N}_{0}),D_{1}(G,G_{0})<\frac{1}{2}\rho\}. Combining (166) and (167), for any G𝒢G\in\mathcal{G}

dm,h(G,G0)C(G0)2D¯N¯m(G,G0).d_{{m},h}(G,G_{0})\geq\frac{C(G_{0})}{\sqrt{2}}\bar{D}_{\bar{{N}}_{{m}}}(G,G_{0}).

By the union bound,

Π(Gk0(Θ1):DN¯m(G,G0)2M¯mC(G0)ϵm,N¯m|X[N1]1,,X[Nm]m)\displaystyle\Pi(G\in\mathcal{E}_{k_{0}}(\Theta_{1}):D_{\bar{{N}}_{{m}}}(G,G_{0})\geq\frac{\sqrt{2}\bar{M}_{m}}{C(G_{0})}\epsilon_{{m},{\bar{{N}}_{{m}}}}|X_{[{N}_{1}]}^{1},\ldots,X_{[{N}_{{m}}]}^{{m}})
\displaystyle\leq Π(Gk0(Θ1):dm,h(G,G0)M¯mϵm,N¯m|X[N1]1,,X[Nm]m)+Π(𝒢c|X[N1]1,,X[Nm]m)\displaystyle\Pi(G\in\mathcal{E}_{k_{0}}(\Theta_{1}):d_{{m},h}(G,G_{0})\geq\bar{M}_{m}\epsilon_{{m},\bar{{N}}_{{m}}}|X_{[{N}_{1}]}^{1},\ldots,X_{[{N}_{{m}}]}^{{m}})+\Pi(\mathcal{G}^{c}|X_{[{N}_{1}]}^{1},\ldots,X_{[{N}_{{m}}]}^{{m}})
\displaystyle\to 0\displaystyle 0

in i=1mPG0,Ni\bigotimes_{i=1}^{{m}}P_{G_{0},{N}_{i}}-probability as m{m}\to\infty by applying (165) to the first term. The reason the second term vanishes is as follows. Note that the second term converges to 0 essentially is a posterior consistency result with respect to W1W_{1} (or D1D_{1}) metric. Here we prove it by (160) and (165). By (160),

𝒢c{Gk0(Θ):dm,h(G,G0)>C(G0,ρ,N0,Θ1)}\mathcal{G}^{c}\subset\{G\in\mathcal{E}_{k_{0}}(\Theta):d_{{m},h}(G,G_{0})>C(G_{0},\rho,{N}_{0},\Theta_{1})\}

for some constant C(G0,ρ,N0,Θ1)>0C(G_{0},\rho,{N}_{0},\Theta_{1})>0. For some slow-increasing M¯m\bar{M}^{\prime}_{{m}} such that M¯mϵm,N¯m0\bar{M}^{\prime}_{{m}}\epsilon_{{m},\bar{{N}}_{{m}}}\to 0 as m{m}\to\infty,

{Gk0(Θ1):dm,h(G,G0)>C(G0,ρ,N0,Θ1)}\displaystyle\{G\in\mathcal{E}_{k_{0}}(\Theta_{1}):d_{{m},h}(G,G_{0})>C(G_{0},\rho,{N}_{0},\Theta_{1})\}
\displaystyle\subset {Gk0(Θ1):dm,h(G,G0)>M¯mϵm,N¯m}\displaystyle\{G\in\mathcal{E}_{k_{0}}(\Theta_{1}):d_{{m},h}(G,G_{0})>\bar{M}^{\prime}_{{m}}\epsilon_{{m},\bar{{N}}_{{m}}}\}

holds for large m{m}. Combining the last two displays and (165) yields

Π(𝒢c|X[N1]1,,X[Nm]m)0.\Pi(\mathcal{G}^{c}|X_{[{N}_{1}]}^{1},\ldots,X_{[{N}_{{m}}]}^{{m}})\\ \to 0.

The proof is concluded.

b) If the additional condition of part b) is satisfied, then by Remark 5.2 , n1(G0)=1n_{1}(G_{0})=1. That is, the claim of part a) holds for n1(G0)=1n_{1}(G_{0})=1. ∎

Proof of Corollary 6.7.

Recall f(x|θ)=exp(η(θ),T(x)B(θ))h(x)f(x|\theta)=\exp\left(\langle\eta(\theta),T(x)\rangle-B(\theta)\right)h(x). By easy calculations

|K(f(x|θ1),f(x|θ2))|=|θ1θ2,𝔼θ1Tx(B(θ1)B(θ2))|L1(Θ1)θ1θ22.|K(f(x|\theta_{1}),f(x|\theta_{2}))|=|\langle\theta_{1}-\theta_{2},\mathbb{E}_{\theta_{1}}Tx\rangle-(B(\theta_{1})-B(\theta_{2}))|\leq L_{1}(\Theta_{1})\|\theta_{1}-\theta_{2}\|_{2}.

By changing to its canonical parametrization and appeal to Lemma E.3 b),

|h(f(x|θ1),f(x|θ2))|L2(Θ1)θ1θ22.|h(f(x|\theta_{1}),f(x|\theta_{2}))|\leq L_{2}(\Theta_{1})\|\theta_{1}-\theta_{2}\|_{2}.

Here L1(Θ1)L_{1}(\Theta_{1}) and L2(Θ1)L_{2}(\Theta_{1}) are constants that depend on Θ1\Theta_{1}. In summary (B2) is satisfied. Then the conclusions are obtained by applying Theorem 6.2. ∎

E.2 Auxiliary lemmas for Section E.1

Lemma E.1.

Fix G0=i=1k0pi0δθi0k0(Θ1)G_{0}=\sum_{i=1}^{k_{0}}p_{i}^{0}\delta_{\theta_{i}^{0}}\in\mathcal{E}_{k_{0}}(\Theta_{1}). Suppose h(f(x|θ1),f(x|θ2))L2θ1θ22β0h(f(x|\theta_{1}),f(x|\theta_{2}))\leq L_{2}\|\theta_{1}-\theta_{2}\|_{2}^{\beta_{0}} for some 0<β010<\beta_{0}\leq 1 and some L2>0L_{2}>0 where θ1,θ2\theta_{1},\theta_{2} are any two distinct elements in Θ1\Theta_{1}.

𝔑(136ϵ,{Gk0(Θ1):D1(G,G0)2ϵC(G0,diam(Θ1))},dm,h)\displaystyle\mathfrak{N}\left(\frac{1}{36}\epsilon,\left\{G\in\mathcal{E}_{k_{0}}(\Theta_{1}):D_{1}(G,G_{0})\leq\frac{2\epsilon}{C(G_{0},\text{diam}(\Theta_{1}))}\right\},d_{{m},h}\right)
\displaystyle\leq (1+4×(144L2)1β0C(G0,diam(Θ1))N¯m12β0ϵ(1β01))qk0(1+10×722ϵ2)k01.\displaystyle\left(1+\frac{4\times(144L_{2})^{\frac{1}{\beta_{0}}}}{C(G_{0},\text{diam}(\Theta_{1}))}\bar{{N}}_{{m}}^{\frac{1}{2\beta_{0}}}\epsilon^{-(\frac{1}{\beta_{0}}-1)}\right)^{qk_{0}}\left(1+10\times 72^{2}\epsilon^{-2}\right)^{k_{0}-1}.
Lemma E.2.

For G0=i=1k0piδθi0k0(Θ)G_{0}=\sum_{i=1}^{k_{0}}p_{i}\delta_{\theta_{i}^{0}}\in\mathcal{E}_{k_{0}}(\Theta) with ρ=min1i<jk0θi0θj02\rho=\min_{1\leq i<j\leq k_{0}}\|\theta_{i}^{0}-\theta_{j}^{0}\|_{2}. If G=i=1k0piδθik0(Θ)G=\sum_{i=1}^{k_{0}}p_{i}\delta_{\theta_{i}}\in\mathcal{E}_{k_{0}}(\Theta) satisfying D1(G,G0)<12ρD_{1}(G,G_{0})<\frac{1}{2}\rho, then there exists a unique τSk0\tau\in S_{k_{0}} such that for all real number r1r\geq 1

Dr(G,G0)=i=1k0(rθτ(i)θi02+|pτ(i)pi0|).D_{r}(G,G_{0})=\sum_{i=1}^{k_{0}}\left(\sqrt{r}\|\theta_{\tau(i)}-\theta_{i}^{0}\|_{2}+|p_{\tau(i)}-p_{i}^{0}|\right).
Lemma E.3.

Consider a full rank exponential family’s density function f(x|θ)f(x|\theta) with respect to a dominating measure μ\mu on 𝔛\mathfrak{X}, which takes the form

f(x|θ)=exp(θT(x)A(θ))h(x),\displaystyle f(x|\theta)=\exp\left(\theta^{\top}T(x)-A(\theta)\right)h(x),

where Θ={θ|A(θ)<}s\Theta=\{\theta|A(\theta)<\infty\}\subset\mathbb{R}^{s} is the parameter space of θ\theta.

  1. a)

    For any θ0Θ\theta_{0}\in\Theta^{\circ}

    lim supθθ0h(f(x|θ),f(x|θ0))θθ02λmax(θ2A(θ0))/8,\limsup_{\theta\to\theta_{0}}\frac{h(f(x|\theta),f(x|\theta_{0}))}{\|\theta-\theta_{0}\|_{2}}\leq\sqrt{\lambda_{\text{max}}(\nabla_{\theta}^{2}A(\theta_{0}))/8},

    where λmax()\lambda_{\text{max}}(\cdot) is the maximum eigenvalue of a symmetric matrix.

  2. b)

    For any compact subset ΘΘ\Theta^{\prime}\subset\Theta^{\circ}, there exists L2>0L_{2}>0 such that

    h(f(x|θ1),f(x|θ2))L2θ1θ22θ1,θ2conv(Θ),h(f(x|\theta_{1}),f(x|\theta_{2}))\leq L_{2}\|\theta_{1}-\theta_{2}\|_{2}\quad\forall\ \theta_{1},\theta_{2}\in\operatorname{conv}(\Theta^{\prime}),

    where conv(Θ)\operatorname{conv}(\Theta^{\prime}) is the convex hull of Θ\Theta^{\prime}.

E.3 Calculation Details in Example 6.9

Details for the uniform probability kernel in Example 5.20. Consider the uniform kernel in Example 4.7 and Example 5.20. Write G0=i=1k0pi0δθi0G_{0}=\sum_{i=1}^{k_{0}}p_{i}^{0}\delta_{\theta_{i}^{0}} with θ10<θ20<<θk00\theta_{1}^{0}<\theta_{2}^{0}<\ldots<\theta_{k_{0}}^{0}. Let Θ1\Theta_{1} be a compact subset of Θ=(0,)\Theta=(0,\infty) such that the condition (B1) holds, and additionally satisfies maxΘ1>θk00\max\Theta_{1}>\theta_{k_{0}}^{0}. The reason for the additional condition will be discussed in the next paragraph. It is easy to establish that for any θ1,θ2Θ1\theta_{1},\theta_{2}\in\Theta_{1}

h2(f(x|θ1),f(x|θ2))=1min{θ1,θ2}max{θ1,θ2}1min{θ1,θ2}max{θ1,θ2}1minΘ1|θ2θ1|,h^{2}(f(x|\theta_{1}),f(x|\theta_{2}))=1-\sqrt{\frac{\min\{\theta_{1},\theta_{2}\}}{\max\{\theta_{1},\theta_{2}\}}}\leq 1-\frac{\min\{\theta_{1},\theta_{2}\}}{\max\{\theta_{1},\theta_{2}\}}\leq\frac{1}{\min\Theta_{1}}|\theta_{2}-\theta_{1}|,

and thus (37) holds with β0=12\beta_{0}=\frac{1}{2}.

Additional care is needed for this example since the support of f(x|θ)f(x|\theta) depends on θ\theta and K(f(x|θ1),f(x|θ2))=K(f(x|\theta_{1}),f(x|\theta_{2}))=\infty for θ1>θ2\theta_{1}>\theta_{2}. In particular, the condition (36) does not hold for the uniform kernel. In view of that the condition (36) is only used to guarantee (163), we may directly verify (163) for the uniform kernel so that the conclusions of Theorem 6.2 hold. Note that the additional condition maxΘ1>θk00\max\Theta_{1}>\theta_{k_{0}}^{0} is necessary for (163), since maxΘ1=θk00\max\Theta_{1}=\theta_{k_{0}}^{0} implies Π(1mi=1mK(pG0,Ni,pG,Ni)ϵ2)=0\Pi\left(\frac{1}{{m}}\sum_{i=1}^{{m}}K(p_{G_{0},{N}_{i}},p_{G,{N}_{i}})\leq\epsilon^{2}\right)=0. (Actually the condition maxΘ1>θk00\max\Theta_{1}>\theta_{k_{0}}^{0} is necessary for a common condition called Kullback-Leibler property [17, Definition 6.15].)

We now verify (163). Denote θk0+10:=maxΘ1\theta_{k_{0}+1}^{0}:=\max\Theta_{1} and ρ:=12min1ik0(θi+10θi0)\rho:=\frac{1}{2}\min_{1\leq i\leq k_{0}}(\theta_{i+1}^{0}-\theta_{i}^{0}). In what follows for this example we always write G=i=1k0piδθik0(Θ1)G=\sum_{i=1}^{k_{0}}p_{i}\delta_{\theta_{i}}\in\mathcal{E}_{k_{0}}(\Theta_{1}) in its increasing representation w.r.t. θ\theta, i.e. θ1<θ2<<θk0\theta_{1}<\theta_{2}<\ldots<\theta_{k_{0}}. Consider the following set

A(G0):={G=i=1k0piδθik0(Θ1)|θi[θi0,θi0+ρ],pjpj0,i[k0],j2}.A(G_{0}):=\left\{\left.G=\sum_{i=1}^{k_{0}}p_{i}\delta_{\theta_{i}}\in\mathcal{E}_{k_{0}}(\Theta_{1})\right|\theta_{i}\in[\theta_{i}^{0},\ \theta_{i}^{0}+\rho],p_{j}\geq p_{j}^{0},\ \forall i\in[k_{0}],j\geq 2\right\}.

For any G=i=1k0piδθiA(G0)G=\sum_{i=1}^{k_{0}}p_{i}\delta_{\theta_{i}}\in A(G_{0}), let QQ be a coupling between G0G_{0} and GG specified as below:

Q:=(α,β)Iqαβδ(θα0,θβ),Q:=\sum_{(\alpha,\beta)\in I}q_{\alpha\beta}\delta_{(\theta_{\alpha}^{0},\theta_{\beta})},

where

I:=\displaystyle\quad I:= I1I2,I1:=i=2k0{(i,i)},I2:=β=1k0{(1,β)},\displaystyle I_{1}\bigcup I_{2},\quad I_{1}:=\bigcup_{i=2}^{k_{0}}\{(i,i)\},\quad I_{2}:=\bigcup_{\beta=1}^{k_{0}}\{(1,\beta)\},
qαβ:=\displaystyle\quad q_{\alpha\beta}:= {pβ0,(α,β)I1pβpβ0,(α,β)I2,β2p1,(α,β)=(1,1).\displaystyle\begin{cases}p_{\beta}^{0},&(\alpha,\beta)\in I_{1}\\ p_{\beta}-p_{\beta}^{0},&(\alpha,\beta)\in I_{2},\beta\geq 2\\ p_{1},&(\alpha,\beta)=(1,1)\end{cases}.

Then for any N1N\geq 1,

K(pG0,N,pG,N)=\displaystyle K(p_{G_{0},N},p_{G,N})= K((α,β)Iqαβj=1Nf(xj|θα0),(α,β)Iqαβj=1Nf(xj|θβ))\displaystyle K\left(\sum_{(\alpha,\beta)\in I}q_{\alpha\beta}\prod_{j=1}^{N}f(x_{j}|\theta_{\alpha}^{0}),\sum_{(\alpha,\beta)\in I}q_{\alpha\beta}\prod_{j=1}^{N}f(x_{j}|\theta_{\beta})\right)
\displaystyle\leq (α,β)IqαβK(j=1Nf(xj|θα0),j=1Nf(xj|θβ))\displaystyle\sum_{(\alpha,\beta)\in I}q_{\alpha\beta}K\left(\prod_{j=1}^{N}f(x_{j}|\theta_{\alpha}^{0}),\prod_{j=1}^{N}f(x_{j}|\theta_{\beta})\right)
=\displaystyle= (α,β)IqαβNK(f(x1|θα0),f(x1|θβ)),\displaystyle\sum_{(\alpha,\beta)\in I}q_{\alpha\beta}NK\left(f(x_{1}|\theta_{\alpha}^{0}),f(x_{1}|\theta_{\beta})\right), (168)

where the first inequality follows by the joint convexity of the Kullback-Leibler divergence. For any θ1θ2Θ1\theta_{1}\leq\theta_{2}\in\Theta_{1},

K(f(x|θ1),f(x|θ2))=ln(θ2θ1)θ2θ1θ1θ2θ1minΘ1.K(f(x|\theta_{1}),f(x|\theta_{2}))=\ln\left(\frac{\theta_{2}}{\theta_{1}}\right)\leq\frac{\theta_{2}-\theta_{1}}{\theta_{1}}\leq\frac{\theta_{2}-\theta_{1}}{\min\Theta_{1}}. (169)

By our choice of II, θα0θβ\theta_{\alpha}^{0}\leq\theta_{\beta} for any (α,β)I(\alpha,\beta)\in I. Then plug (169) into (168),

K(pG0,N,pG,N)\displaystyle K(p_{G_{0},N},p_{G,N})\leq NminΘ1(α,β)Iqαβ(θβθα0)\displaystyle\frac{N}{\min\Theta_{1}}\sum_{(\alpha,\beta)\in I}q_{\alpha\beta}(\theta_{\beta}-\theta_{\alpha}^{0})
\displaystyle\leq NminΘ1min{1,diam(Θ1)}(β=1k0(θβθβ0)+β=2k0(pβpβ0)).\displaystyle\frac{N}{\min\Theta_{1}}\min\{1,diam(\Theta_{1})\}\left(\sum_{\beta=1}^{k_{0}}\left(\theta_{\beta}-\theta_{\beta}^{0}\right)+\sum_{\beta=2}^{k_{0}}\left(p_{\beta}-p_{\beta}^{0}\right)\right). (170)

In fact, one can show (α,β)Iqαβ(θβθα0)=W1(G0,G)\sum_{(\alpha,\beta)\in I}q_{\alpha\beta}(\theta_{\beta}-\theta_{\alpha}^{0})=W_{1}(G_{0},G) but we do not have to use this fact here. Now by (170), for any GA(G0)G\in A(G_{0}),

1mi=1mK(pG0,Ni,pG,Ni)C(Θ1)N¯m(β=1k0(θβθβ0)+β=2k0(pβpβ0)).\frac{1}{{m}}\sum_{i=1}^{{m}}K(p_{G_{0},{N}_{i}},p_{G,{N}_{i}})\leq C(\Theta_{1})\bar{N}_{m}\left(\sum_{\beta=1}^{k_{0}}\left(\theta_{\beta}-\theta_{\beta}^{0}\right)+\sum_{\beta=2}^{k_{0}}\left(p_{\beta}-p_{\beta}^{0}\right)\right).

Thus

Π(1mi=1mK(pG0,Ni,pG,Ni)ϵ2)\displaystyle\Pi\left(\frac{1}{{m}}\sum_{i=1}^{{m}}K(p_{G_{0},{N}_{i}},p_{G,{N}_{i}})\leq\epsilon^{2}\right)
\displaystyle\geq Π(A(G0){C(Θ1)N¯m(β=1k0(θβθβ0)+β=2k0(pβpβ0))ϵ2})\displaystyle\Pi\left(A(G_{0})\bigcap\left\{C(\Theta_{1})\bar{N}_{m}\left(\sum_{\beta=1}^{k_{0}}\left(\theta_{\beta}-\theta_{\beta}^{0}\right)+\sum_{\beta=2}^{k_{0}}\left(p_{\beta}-p_{\beta}^{0}\right)\right)\leq\epsilon^{2}\right\}\right)
\displaystyle\gtrsim (ϵ2N¯mC(Θ1))k0(ϵ2N¯mC(Θ1))k01,\displaystyle\left(\frac{\epsilon^{2}}{\bar{{N}}_{{m}}C(\Theta_{1})}\right)^{k_{0}}\left(\frac{\epsilon^{2}}{\bar{{N}}_{{m}}C(\Theta_{1})}\right)^{k_{0}-1},

which is (163) for the example of f(x|θ)f(x|\theta) being uniform kernels.

As a result, the conclusions of Theorem 6.2 hold. Moreover by Example 5.20 n1(G0)=1n_{1}(G_{0})=1 and one can directly verify that n0(G0,kk0k(Θ1))=1n_{0}(G_{0},\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1}))=1.

Details for the location-scale exponential kernel in Example 5.21. By similar calculations as above, one may show that the conclusion of Theorem 6.2 holds even though the KL-divergence is not Lipschitz as in assumption (B2). By Example 5.21 n1(G0)=1n_{1}(G_{0})=1 and one can directly verify that n0(G0,kk0k(Θ1))=1n_{0}(G_{0},\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1}))=1.

Details for the case that kernel is location-mixture Gaussian in Example 5.22. It suffices to verify assumption (B2) such that Theorem 6.2 can be applied.

Note that

h(f(x|θ1),f(x|θ2))=\displaystyle h(f(x|\theta_{1}),f(x|\theta_{2}))= h(i=1kπi1f𝒩(x|μi1,σ2),i=1kπi2f𝒩(x|μi2,σ2))\displaystyle h\left(\sum_{i=1}^{k}\pi_{i1}f_{\mathcal{N}}(x|\mu_{i1},\sigma^{2}),\sum_{i=1}^{k}\pi_{i2}f_{\mathcal{N}}(x|\mu_{i2},\sigma^{2})\right)
()\displaystyle\overset{(*)}{\leq} max1ikh(f𝒩(x|μi1,σ2),f𝒩(x|μi2,σ2))+12i=1k|πi1πi2|\displaystyle\max_{1\leq i\leq k}h\left(f_{\mathcal{N}}(x|\mu_{i1},\sigma^{2}),f_{\mathcal{N}}(x|\mu_{i2},\sigma^{2})\right)+\sqrt{\frac{1}{2}\sum_{i=1}^{k}\left|\pi_{i1}-\pi_{i2}\right|}
()\displaystyle\overset{(**)}{\leq} 122σmax1ik|μi1μi2|+12i=1k|πi1πi2|\displaystyle\frac{1}{2\sqrt{2}\sigma}\max_{1\leq i\leq k}|\mu_{i1}-\mu_{i2}|+\sqrt{\frac{1}{2}\sum_{i=1}^{k}\left|\pi_{i1}-\pi_{i2}\right|}
\displaystyle\leq C(σ,k,Θ1)θ1θ22,\displaystyle C(\sigma,k,\Theta_{1})\sqrt{\|\theta_{1}-\theta_{2}\|_{2}},

where step ()(*) follows from Lemma 8.2, and step ()(**) follows from the formula of Hellinger distance between two Gaussian distributions. We also have

K(f(x|θ1),f(x|θ2))=\displaystyle K(f(x|\theta_{1}),f(x|\theta_{2}))= K(i=1kπi1f𝒩(x|μi1,σ2),j=1kπj2f𝒩(x|μj2,σ2))\displaystyle K\left(\sum_{i=1}^{k}\pi_{i1}f_{\mathcal{N}}(x|\mu_{i1},\sigma^{2}),\sum_{j=1}^{k}\pi_{j2}f_{\mathcal{N}}(x|\mu_{j2},\sigma^{2})\right)
=()\displaystyle\overset{(*)}{=} minqK(i,j=1kqijf𝒩(x|μi1,σ2),i,j=1kqijf𝒩(x|μj2,σ2))\displaystyle\min_{q}K\left(\sum_{i,j=1}^{k}q_{ij}f_{\mathcal{N}}(x|\mu_{i1},\sigma^{2}),\sum_{i,j=1}^{k}q_{ij}f_{\mathcal{N}}(x|\mu_{j2},\sigma^{2})\right)
()\displaystyle\overset{(**)}{\leq} minqi,j=1kqijK(f𝒩(x|μi1,σ2),f𝒩(x|μj2,σ2))\displaystyle\min_{q}\sum_{i,j=1}^{k}q_{ij}K\left(f_{\mathcal{N}}(x|\mu_{i1},\sigma^{2}),f_{\mathcal{N}}(x|\mu_{j2},\sigma^{2})\right)
=()\displaystyle\overset{(***)}{=} minqi,j=1kqij|μi1μj2|22σ2\displaystyle\min_{q}\sum_{i,j=1}^{k}q_{ij}\frac{|\mu_{i1}-\mu_{j2}|^{2}}{2\sigma^{2}}
\displaystyle\leq C(σ,k,Θ1)θ1θ22,\displaystyle C(\sigma,k,\Theta_{1})\|\theta_{1}-\theta_{2}\|_{2},

where in step ()(*) (qij)i,j[k](q_{ij})_{i,j\in[k]} is any coupling between (πi1)i[k](\pi_{i1})_{i\in[k]} and (πj2)j[k](\pi_{j2})_{j\in[k]} and the minimization is taken among all such couplings, step ()(**) follows from the joint convexity of KL-divergence, Lemma 8.2, and step ()(***) follows from the formula of KL-divergence between two Gaussian distributions.

Thus assumption (B2) is satisfied. Moreover, by Appendix 9.2 n1(G0)<n_{1}(G_{0})<\infty. Hence Theorem 6.2 holds. The calculations of n0(G0,kk0k(Θ1))n_{0}(G_{0},\cup_{k\leq k_{0}}\mathcal{E}_{k}(\Theta_{1})) and n1(G0)n_{1}(G_{0}) are left as exercises to interested readers.

Appendix F Proofs and calculation details for Section 7

F.1 Proofs for Section 7.1

Proof of Lemma 7.2.

Let c()c(\cdot) be a positive constant that depends on its parameters in this proof.
Claim 1: There exists c6>0c_{6}>0 that depends only on d,j1,,jdd,j_{1},\ldots,j_{d} such that Smin(A(x))c6|x|(jdj1)(d1)S_{\text{min}}(A(x))\geq c_{6}|x|^{-(j_{d}-j_{1})(d-1)} for any |x|>1|x|>1. Suppose this is not true, then there is {xm}m=1\{x_{{m}}\}_{{m}=1}^{\infty} such that |xm|>1|x_{{m}}|>1, and as mm\to\infty,

|xm|(jdj1)(d1)Smin(A(xm))0.|x_{{m}}|^{(j_{d}-j_{1})(d-1)}S_{\text{min}}(A(x_{{m}}))\to 0. (171)

Let B(x,t)=|x|tA(x)B(x,t)=|x|^{t}A(x) with tt being some positive number to be specified. The characteristic polynomial of B(x,t)B(x,t)B(x,t)B^{\top}(x,t) is

det(λIB(x,t)B(x,t))=λd+i=0d1γi(x,t)λi.\text{det}\left(\lambda I-B(x,t)B^{\top}(x,t)\right)=\lambda^{d}+\sum_{i=0}^{d-1}\gamma_{i}(x,t)\lambda^{i}.

When |x|>1|x|>1, since |Aαβ(x)|c4(d,j1,,jd)|x|jdj1|A_{\alpha\beta}(x)|\leq c_{4}(d,j_{1},\cdots,j_{d})|x|^{j_{d}-j_{1}} for any α,β[d]\alpha,\beta\in[d], the entries of B(x,t)B(x,t)B(x,t)B^{\top}(x,t) are bounded by d(c4(d,j1,,jd)|x|(jdj1+t))2d\left(c_{4}(d,j_{1},\cdots,j_{d})|x|^{(j_{d}-j_{1}+t)}\right)^{2}. Thus |γi(x,t)|c8(d,j1,,jd)(|x|(jdj1+t))2(di)|\gamma_{i}(x,t)|\leq c_{8}(d,j_{1},\cdots,j_{d})\left(|x|^{(j_{d}-j_{1}+t)}\right)^{2(d-i)} for 1id11\leq i\leq d-1. Moreover,

|γ0(x,t)|=|xdtdet(A(x))|2=(i=1dji!)2|x|2dt=c5(d,j1,,jd)|x|2dt,|\gamma_{0}(x,t)|=\left|x^{dt}\text{det}(A(x))\right|^{2}=\left(\prod_{i=1}^{d}j_{i}!\right)^{2}|x|^{2dt}=c_{5}(d,j_{1},\cdots,j_{d})|x|^{2dt},

with c5(d,j1,,jd)=(i=1dji!)2>0c_{5}(d,j_{1},\cdots,j_{d})=(\prod_{i=1}^{d}j_{i}!)^{2}>0. Let λmin(x,t)0\lambda_{\text{min}}(x,t)\geq 0 be the smallest eigenvalue of B(x,t)B(x,t)B(x,t)B^{\top}(x,t). Then

λmind(x,t)+i=0d1γi(x,t)λmini(x,t)=0.\lambda_{\text{min}}^{d}(x,t)+\sum_{i=0}^{d-1}\gamma_{i}(x,t)\lambda_{\text{min}}^{i}(x,t)=0.

When x0x\not=0, λmin(x,t)>0\lambda_{\text{min}}(x,t)>0 since γ0(x,t)0\gamma_{0}(x,t)\not=0. Thus when x0x\not=0,

1λmin(x,t)=1γ0(x,t)λmind1(x,t)i=1d1γi(x,t)γ0(x,t)λmini1(x,t).\frac{1}{\lambda_{\text{min}}(x,t)}=-\frac{1}{\gamma_{0}(x,t)}\lambda_{\text{min}}^{d-1}(x,t)-\sum_{i=1}^{d-1}\frac{\gamma_{i}(x,t)}{\gamma_{0}(x,t)}\lambda_{\text{min}}^{i-1}(x,t). (172)

Moreover, when |x|>1|x|>1, |γi(x,t)γ0(x,t)|c8(d,j1,,jd)c5(d,j1,,jd)|x|2(jdj1)(di)|x|2tic8(d,j1,,jd)c5(d,j1,,jd)|x|2(jdj1)(d1)|x|2t\left|\frac{\gamma_{i}(x,t)}{\gamma_{0}(x,t)}\right|\leq\frac{c_{8}(d,j_{1},\cdots,j_{d})}{c_{5}(d,j_{1},\cdots,j_{d})}\frac{|x|^{2(j_{d}-j_{1})(d-i)}}{|x|^{2ti}}\leq\frac{c_{8}(d,j_{1},\cdots,j_{d})}{c_{5}(d,j_{1},\cdots,j_{d})}\frac{|x|^{2(j_{d}-j_{1})(d-1)}}{|x|^{2t}} for any 1id11\leq i\leq d-1. Then by (171), λmin(xm,t0)=(|xm|(jdj1)(d1)Smin(A(xm)))20\lambda_{\text{min}}(x_{{m}},t_{0})=\left(|x_{{m}}|^{(j_{d}-j_{1})(d-1)}S_{\text{min}}(A(x_{{m}}))\right)^{2}\to 0, where t0=(jdj1)(d1)t_{0}=(j_{d}-j_{1})(d-1), so 1λmin(x,t0)\frac{1}{\lambda_{\text{min}}(x,t_{0})}\to\infty. On the other hand, since |1γ0(xm,t0)||\frac{1}{\gamma_{0}(x_{{m}},t_{0})}| and γi(xm,t0)γ0(xm,t0)\frac{\gamma_{i}(x_{{m}},t_{0})}{\gamma_{0}(x_{{m}},t_{0})} are bounded and λmin(xm,t0)0\lambda_{\text{min}}(x_{{m}},t_{0})\to 0,

lim supm|1γ0(xm,t0)λmind1(xm,t0)i=1d1γi(xm,t0)γ0(xm,t0)λmini1(xm,t0)|=lim supm|γ1(xm,t0)γ0(xm,t0)|c8(d,j1,,jd)c5(d,j1,,jd).\limsup_{{m}\to\infty}\left|-\frac{1}{\gamma_{0}(x_{{m}},t_{0})}\lambda_{\text{min}}^{d-1}(x_{{m}},t_{0})-\sum_{i=1}^{d-1}\frac{\gamma_{i}(x_{{m}},t_{0})}{\gamma_{0}(x_{{m}},t_{0})}\lambda_{\text{min}}^{i-1}(x_{{m}},t_{0})\right|\\ =\limsup_{{m}\to\infty}\left|\frac{\gamma_{1}(x_{{m}},t_{0})}{\gamma_{0}(x_{{m}},t_{0})}\right|\leq\frac{c_{8}(d,j_{1},\cdots,j_{d})}{c_{5}(d,j_{1},\cdots,j_{d})}.

These contradict with (172) and hence the claim at the beginning of this paragraph is established.

Since Smin(A(x))>0S_{\text{min}}(A(x))>0 on |x|1|x|\leq 1 and Smin(A(x))S_{\text{min}}(A(x)) is continuous,

minx[1,1]Smin(A(x))c7>0.\min_{x\in[-1,1]}S_{\text{min}}(A(x))\geq c_{7}>0.

Then take c3=min{c6,c7}c_{3}=\min\{c_{6},c_{7}\} and the proof is complete. ∎

Proofs of Lemma 7.3.

Let ψw(x)=wTx\psi_{w}(x)=w^{\top}Tx. Then

(dj1ψw(x)dxj1,dj2ψw(x)dxj2,,djdψw(x)dxjd)=A(x)w\left(\frac{d^{j_{1}}\psi_{w}(x)}{dx^{j_{1}}},\frac{d^{j_{2}}\psi_{w}(x)}{dx^{j_{2}}},\ldots,\frac{d^{j_{d}}\psi_{w}(x)}{dx^{j_{d}}}\right)^{\top}=A(x)w

where A(x)d×dA(x)\in\mathbb{R}^{d\times d} with entries Aαβ(x)=0A_{\alpha\beta}(x)=0 for α>β\alpha>\beta and Aαβ(x)=jβ!(jβjα)!xjβjαA_{\alpha\beta}(x)=\frac{j_{\beta}!}{(j_{\beta}-j_{\alpha})!}x^{j_{\beta}-j_{\alpha}} for αβ\alpha\leq\beta. Then for any wSd1w\in S^{d-1},

max1id|djiψw(x)dxji|=A(x)w1dA(x)w21dSmin(A(x))1dc3max{1,|x|}α0,\max_{1\leq i\leq d}\left|\frac{d^{j_{i}}\psi_{w}(x)}{dx^{j_{i}}}\right|=\|A(x)w\|_{\infty}\geq\frac{1}{\sqrt{d}}\|A(x)w\|_{2}\geq\frac{1}{\sqrt{d}}S_{\text{min}}(A(x))\geq\frac{1}{\sqrt{d}}c_{3}\max\{1,|x|\}^{-\alpha_{0}}, (173)

where α0=(jdj1)(d1)\alpha_{0}=(j_{d}-j_{1})(d-1) and the last inequality follows from Lemma 7.2.

Case 1: (j1>1j_{1}>1). Partition the real line according to the increasing sequence {at}t=\{a_{t}\}_{t=-\infty}^{\infty} where

at={2at+1t1c11t=0bt1tc1+1t=+12at1t+2.a_{t}=\begin{cases}2a_{t+1}&t\leq-1\\ \lfloor-c_{1}\rfloor-1&t=0\\ b_{t}&1\leq t\leq\ell\\ \lceil c_{1}\rceil+1&t=\ell+1\\ 2a_{t-1}&t\geq\ell+2\end{cases}.

For t1t\leq-1, by (173) we know max1id|djiψw(x)dxji|1dc3|at|α0\max\limits_{1\leq i\leq d}\left|\frac{d^{j_{i}}\psi_{w}(x)}{dx^{j_{i}}}\right|\geq\frac{1}{\sqrt{d}}c_{3}|a_{t}|^{-\alpha_{0}} for all x[at,at+1]x\in[a_{t},a_{t+1}]. In order to appeal to Lemma 7.1, we need to specify the points {tβ}β=0β0\{t_{\beta}\}_{\beta=0}^{\beta_{0}} with t0=at<t1<<tβ0=at+1t_{0}=a_{t}<t_{1}<\ldots<t_{\beta_{0}}=a_{t+1}, where {tβ}β=1β01\{t_{\beta}\}_{\beta=1}^{\beta_{0}-1} is defined as the set of roots in (at,at+1)(a_{t},a_{t+1}) of any of the following d1d-1 equations,

|djiψw(x)dxji|=1dc3|at|α0,i[d1].\left|\frac{d^{j_{i}}\psi_{w}(x)}{dx^{j_{i}}}\right|=\frac{1}{\sqrt{d}}c_{3}|a_{t}|^{-\alpha_{0}},\quad i\in[d-1].

Thus {tβ}β=0β0\{t_{\beta}\}_{\beta=0}^{\beta_{0}} is a partition of [at,at+1][a_{t},a_{t+1}] such that for each 0ββ010\leq\beta\leq\beta_{0}-1, |djkβψw(x)dxjkβ|1dc3|ai|α0\left|\frac{d^{j_{k_{\beta}}}\psi_{w}(x)}{dx^{j_{k_{\beta}}}}\right|\geq\frac{1}{\sqrt{d}}c_{3}|a_{i}|^{-\alpha_{0}} holds for some index kβ[d]k_{\beta}\in[d] and for all x[tβ,tβ+1]x\in[t_{\beta},t_{\beta+1}]. Since djmψw(x)dxjm\frac{d^{j_{m}}\psi_{w}(x)}{dx^{j_{m}}} is polynomial of degree jdjmj_{d}-j_{m}, it follows that β012m=1d(jdjm)\beta_{0}-1\leq 2\sum_{m=1}^{d}(j_{d}-j_{m}). Let c~0\tilde{c}_{0} be the maximum of {c~jm}m=1d\{\tilde{c}_{j_{m}}\}_{m=1}^{d}, where c~jm\tilde{c}_{j_{m}} are the coefficients ckc_{k} corresponding to k=jmk=j_{m} in Lemma 7.1. Then by Lemma 7.1, for λ>1\lambda>1

|[tβ,tβ+1]e𝒊λψw(x)f(x)dx|\displaystyle\left|\int_{[t_{\beta},t_{\beta+1}]}e^{\bm{i}\lambda\psi_{w}(x)}f(x)dx\right|
\displaystyle\leq c~0(c3|at|α0λd)1jkβ(|f(tβ+1)|+[tβ,tβ+1]|f(x)|dx)\displaystyle\tilde{c}_{0}\left(\frac{c_{3}|a_{t}|^{-\alpha_{0}}\lambda}{\sqrt{d}}\right)^{-\frac{1}{j_{k_{\beta}}}}\left(|f(t_{\beta+1})|+\int_{[t_{\beta},t_{\beta+1}]}|f^{\prime}(x)|dx\right)
\displaystyle\leq c~0max{c31j1,c31jd}(d)1j1λ1jd(|at|α0)1j1(f(at+1)+[tβ,tβ+1]|f(x)|dx),\displaystyle\tilde{c}_{0}\max\left\{c_{3}^{-\frac{1}{j_{1}}},c_{3}^{-\frac{1}{j_{d}}}\right\}(\sqrt{d})^{\frac{1}{j_{1}}}\lambda^{-\frac{1}{j_{d}}}(|a_{t}|^{\alpha_{0}})^{\frac{1}{j_{1}}}\left(f(a_{t+1})+\int_{[t_{\beta},t_{\beta+1}]}|f^{\prime}(x)|dx\right), (174)

where the last step follows from f(x)f(x) being increasing on (,c1)(-\infty,-c_{1}). Then for λ>1\lambda>1

|[at,at+1]e𝒊λψw(x)f(x)dx|\displaystyle\left|\int_{[a_{t},a_{t+1}]}e^{\bm{i}\lambda\psi_{w}(x)}f(x)dx\right|
\displaystyle\leq β=0β01|[tβ,tβ+1]e𝒊λψw(x)f(x)dx|\displaystyle\sum_{\beta=0}^{\beta_{0}-1}\left|\int_{[t_{\beta},t_{\beta+1}]}e^{\bm{i}\lambda\psi_{w}(x)}f(x)dx\right|
()\displaystyle\overset{(*)}{\leq} c~0max{c31j1,c31jd}(d)1j1λ1jd(|at|α0)1j1(β0f(at+1)+[at,at+1]|f(x)|dx)\displaystyle\tilde{c}_{0}\max\left\{c_{3}^{-\frac{1}{j_{1}}},c_{3}^{-\frac{1}{j_{d}}}\right\}(\sqrt{d})^{\frac{1}{j_{1}}}\lambda^{-\frac{1}{j_{d}}}(|a_{t}|^{\alpha_{0}})^{\frac{1}{j_{1}}}\left(\beta_{0}f(a_{t+1})+\int_{[a_{t},a_{t+1}]}|f^{\prime}(x)|dx\right)
()\displaystyle\overset{(**)}{\leq} c~0(c31j1+c31jd)d12j1λ1jd2α0j1(β1|at+1|α0j1f(at+1)+|at+1|α0j1[at,at+1]|f(x)|dx)\displaystyle\tilde{c}_{0}\left(c_{3}^{-\frac{1}{j_{1}}}+c_{3}^{-\frac{1}{j_{d}}}\right)d^{\frac{1}{2j_{1}}}\lambda^{-\frac{1}{j_{d}}}2^{\frac{\alpha_{0}}{j_{1}}}\left(\beta_{1}|a_{t+1}|^{\frac{\alpha_{0}}{j_{1}}}f(a_{t+1})+|a_{t+1}|^{\frac{\alpha_{0}}{j_{1}}}\int_{[a_{t},a_{t+1}]}|f^{\prime}(x)|dx\right)
\displaystyle\leq C(d,j1,,jd)λ1jd(|at+1|α0j1f(at+1)+[at,at+1]|x|α0j1|f(x)|dx),\displaystyle C(d,j_{1},\cdots,j_{d})\lambda^{-\frac{1}{j_{d}}}\left(|a_{t+1}|^{\frac{\alpha_{0}}{j_{1}}}f(a_{t+1})+\int_{[a_{t},a_{t+1}]}|x|^{\frac{\alpha_{0}}{j_{1}}}|f^{\prime}(x)|dx\right), (175)

where the step ()(*) follows from (174)\eqref{eqn:rescorputbbeta}, the step ()(**) follows from at=2at+1a_{t}=2a_{t+1} and β0β1:=2m=1d(jdjm)+1\beta_{0}\leq\beta_{1}:=2\sum_{m=1}^{d}(j_{d}-j_{m})+1, and the last step follows from β11\beta_{1}\geq 1, |at||x||at+1||a_{t}|\geq|x|\geq|a_{t+1}| for all x[at,at+1]x\in[a_{t},a_{t+1}] and C(d,j1,,jd)=c~0max{c31j1,c31jd}(d)1j12α0j1β1C(d,j_{1},\cdots,j_{d})=\tilde{c}_{0}\max\left\{c_{3}^{-\frac{1}{j_{1}}},c_{3}^{-\frac{1}{j_{d}}}\right\}(\sqrt{d})^{\frac{1}{j_{1}}}2^{\frac{\alpha_{0}}{j_{1}}}\beta_{1}.

For t+1t\geq\ell+1, following similar steps as the case t1t\leq-1, one obtain

|[at,at+1]e𝒊λϕw(x)f(x)dx|\displaystyle\left|\int_{[a_{t},a_{t+1}]}e^{\bm{i}\lambda\phi_{w}(x)}f(x)dx\right|
\displaystyle\leq C(d,j1,,jd)λ1jd(|at|α0j1f(at)+[at,at+1]|x|α0j1|f(x)|dx),\displaystyle C(d,j_{1},\cdots,j_{d})\lambda^{-\frac{1}{j_{d}}}\left(|a_{t}|^{\frac{\alpha_{0}}{j_{1}}}f(a_{t})+\int_{[a_{t},a_{t+1}]}|x|^{\frac{\alpha_{0}}{j_{1}}}|f^{\prime}(x)|dx\right), (176)

where C(d,j1,,jd)C(d,j_{1},\cdots,j_{d}) is the same as in (175).

For 0t0\leq t\leq\ell, since ff^{\prime} is continuous on (at,at+1)(a_{t},a_{t+1}) and ff^{\prime} is Lebesgue integrable on [at,at+1][a_{t},a_{t+1}], limxat+1f(x)\lim_{x\to a_{t+1}^{-}}f(x) and limxat+f(x)\lim_{x\to a_{t}^{+}}f(x) exist. Define f~(x)=f(x)𝟏(at,at+1)(x)+𝟏{at+1}(x)limxat+1f(x)+𝟏{at}(x)limxat+f(x)\tilde{f}(x)=f(x)\bm{1}_{(a_{t},a_{t+1})}(x)+\bm{1}_{\{a_{t+1}\}}(x)\lim_{x\to a_{t+1}^{-}}f(x)+\bm{1}_{\{a_{t}\}}(x)\lim_{x\to a_{t}^{+}}f(x). Then f~(x)\tilde{f}(x) is absolute continuous on [at,at+1][a_{t},a_{t+1}]. Moreover, by (173) we know max1id|djiψw(x)dxji|1dc3(c1+2)α0\max\limits_{1\leq i\leq d}\left|\frac{d^{j_{i}}\psi_{w}(x)}{dx^{j_{i}}}\right|\geq\frac{1}{\sqrt{d}}c_{3}(c_{1}+2)^{-\alpha_{0}} for all x[at,at+1]x\in[a_{t},a_{t+1}]. Following the same argument as in the case t1t\leq-1, let {t~β}β=0β~0\{\tilde{t}_{\beta}\}_{\beta=0}^{\tilde{\beta}_{0}} with t~0=at<t~1<<t~β0=at+1\tilde{t}_{0}=a_{t}<\tilde{t}_{1}<\ldots<\tilde{t}_{\beta_{0}}=a_{t+1}, where {t~β}β=1β~01\{\tilde{t}_{\beta}\}_{\beta=1}^{\tilde{\beta}_{0}-1} is the set of roots in (at,at+1)(a_{t},a_{t+1}) of the following d1d-1 equations

|djiψw(x)dxji|=1dc3(c1+2)α0,i[d1].\left|\frac{d^{j_{i}}\psi_{w}(x)}{dx^{j_{i}}}\right|=\frac{1}{\sqrt{d}}c_{3}(c_{1}+2)^{-\alpha_{0}},\quad i\in[d-1].

Then {t~β}β=0β~0\{\tilde{t}_{\beta}\}_{\beta=0}^{\tilde{\beta}_{0}} is a partition of [at,at+1][a_{t},a_{t+1}] such that for each 0ββ~010\leq\beta\leq\tilde{\beta}_{0}-1, |djkβψw(x)dxjkβ|1dc3(c1+2)α0\left|\frac{d^{j_{k_{\beta}}}\psi_{w}(x)}{dx^{j_{k_{\beta}}}}\right|\geq\frac{1}{\sqrt{d}}c_{3}(c_{1}+2)^{-\alpha_{0}} for some kβ[d]k_{\beta}\in[d] and for all x[t~β,t~β+1]x\in[\tilde{t}_{\beta},\tilde{t}_{\beta+1}]. Since djmψw(x)dxjm\frac{d^{j_{m}}\psi_{w}(x)}{dx^{j_{m}}} are polynomial of degree jdjmj_{d}-j_{m}, we have β~012m=1d(jdjm)\tilde{\beta}_{0}-1\leq 2\sum_{m=1}^{d}(j_{d}-j_{m}). Thus by Lemma 7.1, for any λ>1\lambda>1

|[t~β,t~β+1]e𝒊λψw(x)f(x)dx|\displaystyle\left|\int_{[\tilde{t}_{\beta},\tilde{t}_{\beta+1}]}e^{\bm{i}\lambda\psi_{w}(x)}f(x)dx\right|
=\displaystyle= |[t~β,t~β+1]e𝒊λψw(x)f~(x)dx|\displaystyle\left|\int_{[\tilde{t}_{\beta},\tilde{t}_{\beta+1}]}e^{\bm{i}\lambda\psi_{w}(x)}\tilde{f}(x)dx\right|
\displaystyle\leq c~0(c3(c1+2)α0λd)1jkβ(|f~(t~β+1)|+[t~β,t~β+1]|f(x)|dx)\displaystyle\tilde{c}_{0}\left(\frac{c_{3}(c_{1}+2)^{-\alpha_{0}}\lambda}{\sqrt{d}}\right)^{-\frac{1}{j_{k_{\beta}}}}\left(|\tilde{f}(\tilde{t}_{\beta+1})|+\int_{[\tilde{t}_{\beta},\tilde{t}_{\beta+1}]}|f^{\prime}(x)|dx\right)
\displaystyle\leq c~0max{c31j1,c31jd}(d)1j1λ1jd((c1+2)α0)1j1(fL+[t~β,t~β+1]|f(x)|dx),\displaystyle\tilde{c}_{0}\max\left\{c_{3}^{-\frac{1}{j_{1}}},c_{3}^{-\frac{1}{j_{d}}}\right\}(\sqrt{d})^{\frac{1}{j_{1}}}\lambda^{-\frac{1}{j_{d}}}((c_{1}+2)^{\alpha_{0}})^{\frac{1}{j_{1}}}\left(\|f\|_{L^{\infty}}+\int_{[\tilde{t}_{\beta},\tilde{t}_{\beta+1}]}|f^{\prime}(x)|dx\right), (177)

where the last step follows from |f~(t~β+1)|fL|\tilde{f}(\tilde{t}_{\beta+1})|\leq\|f\|_{L^{\infty}}. Then for any λ>1\lambda>1

|[at,at+1]e𝒊λψw(x)f(x)dx|\displaystyle\left|\int_{[a_{t},a_{t+1}]}e^{\bm{i}\lambda\psi_{w}(x)}f(x)dx\right|
\displaystyle\leq β=0β~01|[t~β,t~β+1]e𝒊λψw(x)f(x)dx|\displaystyle\sum_{\beta=0}^{\tilde{\beta}_{0}-1}\left|\int_{[\tilde{t}_{\beta},\tilde{t}_{\beta+1}]}e^{\bm{i}\lambda\psi_{w}(x)}f(x)dx\right|
\displaystyle\leq c~0max{c31j1,c31jd}(d)1j1λ1jd((c1+2)α0)1j1(β~0fL+[at,at+1]|f(x)|dx)\displaystyle\tilde{c}_{0}\max\left\{c_{3}^{-\frac{1}{j_{1}}},c_{3}^{-\frac{1}{j_{d}}}\right\}(\sqrt{d})^{\frac{1}{j_{1}}}\lambda^{-\frac{1}{j_{d}}}((c_{1}+2)^{\alpha_{0}})^{\frac{1}{j_{1}}}\left(\tilde{\beta}_{0}\|f\|_{L^{\infty}}+\int_{[a_{t},a_{t+1}]}|f^{\prime}(x)|dx\right)
\displaystyle\leq C(d,j1,,jd)λ1jd(c1+2)α0j1(fL+[at,at+1]|f(x)|dx),\displaystyle C(d,j_{1},\ldots,j_{d})\lambda^{-\frac{1}{j_{d}}}(c_{1}+2)^{\frac{\alpha_{0}}{j_{1}}}\left(\|f\|_{L^{\infty}}+\int_{[a_{t},a_{t+1}]}|f^{\prime}(x)|dx\right), (178)

where C(d,j1,,jd)C(d,j_{1},\cdots,j_{d}) is the same as in (175).

Hence,

|e𝒊λψw(x)f(x)dx|\displaystyle\left|\int_{\mathbb{R}}e^{\bm{i}\lambda\psi_{w}(x)}f(x)dx\right|
=\displaystyle= |t=[at,at+1]e𝒊λψw(x)f(x)dx|\displaystyle\left|\sum_{t=-\infty}^{\infty}\int_{[a_{t},a_{t+1}]}e^{\bm{i}\lambda\psi_{w}(x)}f(x)dx\right|
\displaystyle\leq t=|[at,at+1]e𝒊λψw(x)f(x)dx|\displaystyle\sum_{t=-\infty}^{\infty}\left|\int_{[a_{t},a_{t+1}]}e^{\bm{i}\lambda\psi_{w}(x)}f(x)dx\right|
()\displaystyle\overset{(*)}{\leq} C(d,j1,,jd)λ1jd(c1+2)α0j1(t1|at+1|α0j1f(at+1)+\displaystyle C(d,j_{1},\ldots,j_{d})\lambda^{-\frac{1}{j_{d}}}(c_{1}+2)^{\frac{\alpha_{0}}{j_{1}}}\left(\sum_{t\leq-1}|a_{t+1}|^{\frac{\alpha_{0}}{j_{1}}}f(a_{t+1})+\right.
t+1|at|α0j1f(at)+(+1)fL+(|x|α0j1+1)f(x)L1)\displaystyle\left.\sum_{t\geq\ell+1}|a_{t}|^{\frac{\alpha_{0}}{j_{1}}}f(a_{t})+(\ell+1)\|f\|_{L^{\infty}}+\left\|\left(|x|^{\frac{\alpha_{0}}{j_{1}}}+1\right)f^{\prime}(x)\right\|_{L^{1}}\right)
()\displaystyle\overset{(**)}{\leq} C(d,j1,,jd)λ1jd(c1+2)α0j1((,c1]|x|α0j1f(x)dx+\displaystyle C(d,j_{1},\ldots,j_{d})\lambda^{-\frac{1}{j_{d}}}(c_{1}+2)^{\frac{\alpha_{0}}{j_{1}}}\left(\int_{(\infty,-c_{1}]}|x|^{\frac{\alpha_{0}}{j_{1}}}f(x)dx+\right.
[c1,)|x|α0j1f(x)dx+(+1)fL+(|x|α0j1+1)f(x)L1)\displaystyle\left.\int_{[c_{1},\infty)}|x|^{\frac{\alpha_{0}}{j_{1}}}f(x)dx+(\ell+1)\|f\|_{L^{\infty}}+\left\|\left(|x|^{\frac{\alpha_{0}}{j_{1}}}+1\right)f^{\prime}(x)\right\|_{L^{1}}\right)
\displaystyle\leq C(d,j1,,jd)λ1jd(c1+2)α0j1×\displaystyle C(d,j_{1},\ldots,j_{d})\lambda^{-\frac{1}{j_{d}}}(c_{1}+2)^{\frac{\alpha_{0}}{j_{1}}}\times
(|x|α0j1f(x)L1+(+1)fL+(|x|α0j1+1)f(x)L1)\displaystyle\left(\left\||x|^{\frac{\alpha_{0}}{j_{1}}}f(x)\right\|_{L^{1}}+(\ell+1)\|f\|_{L^{\infty}}+\left\|\left(|x|^{\frac{\alpha_{0}}{j_{1}}}+1\right)f^{\prime}(x)\right\|_{L^{1}}\right) (179)

where the first equality follows from the dominated convergence theorem, the step ()(*) follows from (175), (176), (178), and the step ()(**) follows from the monotonicity of |x|α0j1f|x|^{\frac{\alpha_{0}}{j_{1}}}f when x<c1x<-c_{1}, x>c1x>c_{1}.

Case 2: (j1=1j_{1}=1). Fix wSd1\forall w\in S^{d-1}, x1<x2<<xs\exists x_{1}<x_{2}<\ldots<x_{s} partition \mathbb{R} into s+1s+1 disjoint open intervals such that dψw(x)dx\frac{d\psi_{w}(x)}{dx} is monotone on each of those interval. Notice sjd2s\leq j_{d}-2 since dψw(x)dx\frac{d\psi_{w}(x)}{dx} is a polynomial of degree jd1j_{d}-1, and x1,x2,,xsx_{1},x_{2},\ldots,x_{s} depend on ww. For t1t\leq-1, on [at,at+1][a_{t},a_{t+1}] when we subdivide the interval, besides the partition points {tβ}β=0β0\{t_{\beta}\}_{\beta=0}^{\beta_{0}}, {x1,x2,,xs}[at,at+1]\{x_{1},x_{2},\ldots,x_{s}\}\cap[a_{t},a_{t+1}] should also be added into the partition points. The new partition points set has at most β0+1+sβ1+jd\beta_{0}+1+s\leq\beta_{1}+j_{d} points and hence subdivide [at,at+1][a_{t},a_{t+1}] into at most β1+jd1\beta_{1}+j_{d}-1 intervals such that on each subinterval max1id|djiψw(x)dxji|1dc3|at|α0\max\limits_{1\leq i\leq d}\left|\frac{d^{j_{i}}\psi_{w}(x)}{dx^{j_{i}}}\right|\geq\frac{1}{\sqrt{d}}c_{3}|a_{t}|^{-\alpha_{0}} and dψw(x)dx\frac{d\psi_{w}(x)}{dx} is monotone. Hence Lemma 7.1 (part ii)) can be applied on each subinterval. The rest of steps proceed similarly as in Case 1, and one will obtain

|[at,at+1]e𝒊λψw(x)f(x)dx|\displaystyle\left|\int_{[a_{t},a_{t+1}]}e^{\bm{i}\lambda\psi_{w}(x)}f(x)dx\right|
\displaystyle\leq C~(d,j1,,jd)λ1jd(|at+1|α0j1f(at+1)+[at,at+1]|x|α0j1|f(x)|dx),\displaystyle\tilde{C}(d,j_{1},\cdots,j_{d})\lambda^{-\frac{1}{j_{d}}}\left(|a_{t+1}|^{\frac{\alpha_{0}}{j_{1}}}f(a_{t+1})+\int_{[a_{t},a_{t+1}]}|x|^{\frac{\alpha_{0}}{j_{1}}}|f^{\prime}(x)|dx\right), (180)

where C~(d,j1,,jd)=c~0max{c31j1,c31jd}(d)1j12α0j1(β1+jd1)\tilde{C}(d,j_{1},\cdots,j_{d})=\tilde{c}_{0}\max\left\{c_{3}^{-\frac{1}{j_{1}}},c_{3}^{-\frac{1}{j_{d}}}\right\}(\sqrt{d})^{\frac{1}{j_{1}}}2^{\frac{\alpha_{0}}{j_{1}}}(\beta_{1}+j_{d}-1), a constant that depends only on d,j1,,jdd,j_{1},\ldots,j_{d}. Following the same reasoning one can obtain (176) for t+1t\geq\ell+1 and (178) for 0t0\leq t\leq\ell, both with C(d,j1,,jd)C(d,j_{1},\cdots,j_{d}) replaced by C~(d,j1,,jd)\tilde{C}(d,j_{1},\cdots,j_{d}). As a result, one obtains (179) with C(d,j1,,jd)C(d,j_{1},\cdots,j_{d}) replaced by C~(d,j1,,jd)\tilde{C}(d,j_{1},\cdots,j_{d}). ∎

Proof of Lemma 7.4.

By Lemma 7.3, when ζ2>1\|\zeta\|_{2}>1,

|g(ζ)|rC(f,d,j1,,jd)ζ2rjd.|g(\zeta)|^{r}\leq C(f,d,j_{1},\ldots,j_{d})\|\zeta\|_{2}^{-\frac{r}{j_{d}}}.

where

C(f,r,d,j1,,jd)=Cr(d,j1,,jd)(c1+2)α1r(|x|α1f(x)L1+(+1)fL+(|x|α1+1)f(x)L1)r.C(f,r,d,j_{1},\ldots,j_{d})=\\ C^{r}(d,j_{1},\ldots,j_{d})(c_{1}+2)^{\alpha_{1}r}\left(\left\||x|^{\alpha_{1}}f(x)\right\|_{L^{1}}+(\ell+1)\|f\|_{L^{\infty}}+\left\|\left(|x|^{\alpha_{1}}+1\right)f^{\prime}(x)\right\|_{L^{1}}\right)^{r}.

Let |Sd1||S^{d-1}| denote the area of Sd1S^{d-1}. Then

ζ2>1|g(ζ)|rdζ\displaystyle\int_{\|\zeta\|_{2}>1}|g(\zeta)|^{r}d\zeta
\displaystyle\leq C(f,r,d,j1,,jd)ζ2>1ζ2rjddζ\displaystyle C(f,r,d,j_{1},\ldots,j_{d})\int_{\|\zeta\|_{2}>1}\|\zeta\|_{2}^{-\frac{r}{j_{d}}}d\zeta
\displaystyle\leq C(f,r,d,j1,,jd)|Sd1|(1,)λrjdλd1dλ\displaystyle C(f,r,d,j_{1},\ldots,j_{d})|S^{d-1}|\int_{\left(1,\infty\right)}\lambda^{-\frac{r}{j_{d}}}\lambda^{d-1}d\lambda
=\displaystyle= C(r,d,j1,,jd)(c1+2)α1r(|x|α1f(x)L1+(+1)fL+(|x|α1+1)f(x)L1)r,\displaystyle C(r,d,j_{1},\ldots,j_{d})(c_{1}+2)^{{\alpha_{1}r}}\left(\left\||x|^{\alpha_{1}}f(x)\right\|_{L^{1}}+(\ell+1)\|f\|_{L^{\infty}}+\left\|\left(|x|^{\alpha_{1}}+1\right)f^{\prime}(x)\right\|_{L^{1}}\right)^{r}, (181)

where the last inequality follows from that (1,)λrjdλd1dλ\int_{\left(1,\infty\right)}\lambda^{-\frac{r}{j_{d}}}\lambda^{d-1}d\lambda is a finite constant that depends on dd and jdj_{d} for r>djdr>dj_{d} and C(r,d,j1,,jd)=Cr(d,j1,,jd)|Sd1|(1,)λrjdλd1dλC(r,d,j_{1},\ldots,j_{d})=C^{r}(d,j_{1},\ldots,j_{d})|S^{d-1}|\int_{\left(1,\infty\right)}\lambda^{-\frac{r}{j_{d}}}\lambda^{d-1}d\lambda.

In addition,

ζ21|g(ζ)|rdζζ21fL1rdζ=C(d)fL1r,\int_{\|\zeta\|_{2}\leq 1}|g(\zeta)|^{r}d\zeta\leq\int_{\|\zeta\|_{2}\leq 1}\|f\|_{L^{1}}^{r}d\zeta=C(d)\|f\|_{L^{1}}^{r}, (182)

where C(d)C(d) is a constant that depends on dd.

The proof is then completed by combining (181) and (182) and (ar+br)(a+b)r(a^{r}+b^{r})\leq(a+b)^{r} for any a,b>0,r1a,b>0,r\geq 1. ∎

F.2 Calculation details for Section 7.2

In this subsection we verify parts of (A3) for the TT specified in Section 7.2. It is easy to verify by the dominated convergence theorem or Pratt’s Lemma:

h(ζ|μ,σ)μ\displaystyle\frac{\partial h(\zeta|\mu,\sigma)}{\partial\mu} =exp(𝒊i=1kζ(i)xi)f𝒩(x|μ,σ)μdx,\displaystyle=\int_{\mathbb{R}}\exp\left(\bm{i}\sum_{i=1}^{k}\zeta^{(i)}x^{i}\right)\frac{\partial f_{\mathcal{N}}(x|\mu,\sigma)}{\partial\mu}dx,
2h(ζ|μ,σ)μ2\displaystyle\frac{\partial^{2}h(\zeta|\mu,\sigma)}{\partial\mu^{2}} =exp(𝒊i=1kζ(i)xi)2f𝒩(x|μ,σ)μ2dx,\displaystyle=\int_{\mathbb{R}}\exp\left(\bm{i}\sum_{i=1}^{k}\zeta^{(i)}x^{i}\right)\frac{\partial^{2}f_{\mathcal{N}}(x|\mu,\sigma)}{\partial\mu^{2}}dx,
h(ζ|μ,σ)ζ(j)\displaystyle\frac{\partial h(\zeta|\mu,\sigma)}{\partial\zeta^{(j)}} =𝒊xjexp(𝒊i=1kζ(i)xi)f𝒩(x|μ,σ)dx,j[k]\displaystyle=\int_{\mathbb{R}}\bm{i}x^{j}\exp\left(\bm{i}\sum_{i=1}^{k}\zeta^{(i)}x^{i}\right)f_{\mathcal{N}}(x|\mu,\sigma)dx,\quad j\in[k]

and

2h(ζ|μ,σ)ζ(j)μ=𝒊xjexp(𝒊i=1kζ(i)xi)f𝒩(x|μ,σ)μdx,j[k].\frac{\partial^{2}h(\zeta|\mu,\sigma)}{\partial\zeta^{(j)}\partial\mu}=\int_{\mathbb{R}}\bm{i}x^{j}\exp\left(\bm{i}\sum_{i=1}^{k}\zeta^{(i)}x^{i}\right)\frac{\partial f_{\mathcal{N}}(x|\mu,\sigma)}{\partial\mu}dx,\quad j\in[k].

Then

|h(ζ|μ,σ)μ|\displaystyle\left|\frac{\partial h(\zeta|\mu,\sigma)}{\partial\mu}\right| |f𝒩(x|μ,σ)μ|dx=2π1σ,\displaystyle\leq\int_{\mathbb{R}}\left|\frac{\partial f_{\mathcal{N}}(x|\mu,\sigma)}{\partial\mu}\right|dx=\sqrt{\frac{2}{\pi}}\frac{1}{\sigma}, (183)
|2h(ζ|μ,σ)μ2|\displaystyle\left|\frac{\partial^{2}h(\zeta|\mu,\sigma)}{\partial\mu^{2}}\right| |2f𝒩(x|μ,σ)μ2|dx2σ2,\displaystyle\leq\int_{\mathbb{R}}\left|\frac{\partial^{2}f_{\mathcal{N}}(x|\mu,\sigma)}{\partial\mu^{2}}\right|dx\leq\frac{2}{\sigma^{2}}, (184)
maxj[k]|h(ζ|μ,σ)ζ(j)|\displaystyle\max_{j\in[k]}\left|\frac{\partial h(\zeta|\mu,\sigma)}{\partial\zeta^{(j)}}\right| maxj[k]|xjf𝒩(x|μ,σ)|dx:=h1(μ),\displaystyle\leq\max_{j\in[k]}\int_{\mathbb{R}}\left|x^{j}f_{\mathcal{N}}(x|\mu,\sigma)\right|dx:=h_{1}(\mu), (185)
maxj[k]|2h(ζ|μ,σ)ζ(j)μ|\displaystyle\max_{j\in[k]}\left|\frac{\partial^{2}h(\zeta|\mu,\sigma)}{\partial\zeta^{(j)}\partial\mu}\right| maxj[k]|xjf𝒩(x|μ,σ)μ|dx:=h2(μ),\displaystyle\leq\max_{j\in[k]}\int_{\mathbb{R}}\left|x^{j}\frac{\partial f_{\mathcal{N}}(x|\mu,\sigma)}{\partial\mu}\right|dx:=h_{2}(\mu), (186)

where h1(μ)h_{1}(\mu) and h2(μ)h_{2}(\mu) are continuous functions of μ\mu by the dominated convergence theorem, with their dependence on the constant σ\sigma suppressed.

It follows that the gradient of ϕT(ζ|θ)\phi_{T}(\zeta|\theta) with respect to θ\theta is

θϕT(ζ|θ)=\displaystyle\nabla_{\theta}\phi_{T}(\zeta|\theta)= (h(ζ|μ1,σ)h(ζ|μk,σ),,\displaystyle\left(h(\zeta|\mu_{1},\sigma)-h(\zeta|\mu_{k},\sigma),\ldots,\right.
h(ζ|μk1,σ)h(ζ|μk,σ),π1h(ζ|μ1,σ)μ,,πkh(ζ|μk,σ)μ)\displaystyle h(\zeta|\mu_{k-1},\sigma)-h(\zeta|\mu_{k},\sigma),\pi_{1}\frac{\partial h(\zeta|\mu_{1},\sigma)}{\partial\mu},\ldots,\pi_{k}\frac{\partial h(\zeta|\mu_{k},\sigma)}{\partial\mu})^{\top} (187)

and Hessian with respect to θ\theta with (i,j)(i,j) entry for jij\geq i given by

2θ(j)θ(i)ϕT(ζ|θ)={h(ζ|μi,σ)μi[k1],j=k1+ih(ζ|μk,σ)μi[k1],j=2k1πi(k1)2h(ζ|μi(k1),σ)μ2ki2k1,j=i0otherwise\frac{\partial^{2}}{\partial\theta^{(j)}\partial\theta^{(i)}}\phi_{T}(\zeta|\theta)=\begin{cases}\frac{\partial h(\zeta|\mu_{i},\sigma)}{\partial\mu}&i\in[k-1],j=k-1+i\\ -\frac{\partial h(\zeta|\mu_{k},\sigma)}{\partial\mu}&i\in[k-1],j=2k-1\\ \pi_{i-(k-1)}\frac{\partial^{2}h(\zeta|\mu_{i-(k-1)},\sigma)}{\partial\mu^{2}}&k\leq i\leq 2k-1,j=i\\ 0&\text{otherwise}\end{cases} (188)

and the lower part is symmetric to the upper part.

Then by (183), (184), (185), (186), (187) and (188), for any i,j[k]i,j\in[k]:

|ϕT(ζ|θ)θ(i)|\displaystyle\left|\frac{\partial\phi_{T}(\zeta|\theta)}{\partial\theta^{(i)}}\right| 2+2π1σ,\displaystyle\leq 2+\sqrt{\frac{2}{\pi}}\frac{1}{\sigma},
|2ϕT(ζ|θ)θ(i)θ(j)|\displaystyle\left|\frac{\partial^{2}\phi_{T}(\zeta|\theta)}{\partial\theta^{(i)}\partial\theta^{(j)}}\right| 2π1σ+2σ2,\displaystyle\leq\sqrt{\frac{2}{\pi}}\frac{1}{\sigma}+\frac{2}{\sigma^{2}},
|2ϕT(ζ|θ)ζ(j)θ(i)|\displaystyle\left|\frac{\partial^{2}\phi_{T}(\zeta|\theta)}{\partial\zeta^{(j)}\partial\theta^{(i)}}\right| i=1k(h1(μi)+h2(μi)),\displaystyle\leq\sum_{i=1}^{k}\left(h_{1}(\mu_{i})+h_{2}(\mu_{i})\right),

where the right hand side of the last display is a continuous function of θ\theta since h1h_{1} and h2h_{2} are continuous. Hence to verify the condition (A3) it remains to establish that there exists some r1r\geq 1 such that 2k1|ϕT(ζ|θ)|rdζ\int_{\mathbb{R}^{2k-1}}\left|\phi_{T}(\zeta|\theta)\right|^{r}d\zeta on Θ\Theta is upper bounded by a finite continuous function of θ\theta.

F.3 Calculation details for Section 7.4

In this subsection we verify parts of (A3) for the TT specified in Section 7.4. This subsection is similar to the Appendix F.2.

It is easy to verify by the dominated convergence theorem or Pratt’s Lemma:

h(ζ|α,ξ)α\displaystyle\frac{\partial h(\zeta|\alpha,\xi)}{\partial\alpha} =exp(𝒊i=13ζ(i)zi+1)g(z|α,ξ)αdz,\displaystyle=\int_{\mathbb{R}}\exp\left(\bm{i}\sum_{i=1}^{3}\zeta^{(i)}z^{i+1}\right)\frac{\partial g(z|\alpha,\xi)}{\partial\alpha}dz,
2h(ζ|α,ξ)α2\displaystyle\frac{\partial^{2}h(\zeta|\alpha,\xi)}{\partial\alpha^{2}} =exp(𝒊i=13ζ(i)zi+1)2g(z|α,ξ)α2dz,\displaystyle=\int_{\mathbb{R}}\exp\left(\bm{i}\sum_{i=1}^{3}\zeta^{(i)}z^{i+1}\right)\frac{\partial^{2}g(z|\alpha,\xi)}{\partial\alpha^{2}}dz,
h(ζ|α,ξ)ζ(j)\displaystyle\frac{\partial h(\zeta|\alpha,\xi)}{\partial\zeta^{(j)}} =𝒊zj+1exp(𝒊i=13ζ(i)zi+1)g(z|α,ξ)dz,j=1,2,3\displaystyle=\int_{\mathbb{R}}\bm{i}z^{j+1}\exp\left(\bm{i}\sum_{i=1}^{3}\zeta^{(i)}z^{i+1}\right)g(z|\alpha,\xi)dz,\quad j=1,2,3

and

2h(ζ|α,ξ)ζ(j)α=2h(ζ|α,ξ)αζ(j)=𝒊zj+1exp(𝒊i=13ζ(i)zi+1)g(z|α,ξ)αdz,j=1,2,3.\frac{\partial^{2}h(\zeta|\alpha,\xi)}{\partial\zeta^{(j)}\partial\alpha}=\frac{\partial^{2}h(\zeta|\alpha,\xi)}{\partial\alpha\partial\zeta^{(j)}}=\int_{\mathbb{R}}\bm{i}z^{j+1}\exp\left(\bm{i}\sum_{i=1}^{3}\zeta^{(i)}z^{i+1}\right)\frac{\partial g(z|\alpha,\xi)}{\partial\alpha}dz,\quad j=1,2,3.

From the preceding four displays,

|h(ζ|α,ξ)α|\displaystyle\left|\frac{\partial h(\zeta|\alpha,\xi)}{\partial\alpha}\right| |g(z|α,ξ)α|dz:=h1(α)\displaystyle\leq\int_{\mathbb{R}}\left|\frac{\partial g(z|\alpha,\xi)}{\partial\alpha}\right|dz:=h_{1}(\alpha) (189)
|2h(ζ|α,ξ)α2|\displaystyle\left|\frac{\partial^{2}h(\zeta|\alpha,\xi)}{\partial\alpha^{2}}\right| |2g(z|α,ξ)α2|dz:=h2(α),\displaystyle\leq\int_{\mathbb{R}}\left|\frac{\partial^{2}g(z|\alpha,\xi)}{\partial\alpha^{2}}\right|dz:=h_{2}(\alpha), (190)
maxj=1,2,3|h(ζ|α,ξ)ζ(j)|\displaystyle\max_{j=1,2,3}\left|\frac{\partial h(\zeta|\alpha,\xi)}{\partial\zeta^{(j)}}\right| maxj=1,2,3|zj+1g(z|α,ξ)|dz:=h3(α),\displaystyle\leq\max_{j=1,2,3}\int_{\mathbb{R}}\left|z^{j+1}g(z|\alpha,\xi)\right|dz:=h_{3}(\alpha), (191)
maxj=1,2,3|2h(ζ|α,ξ)ζ(j)α|\displaystyle\max_{j=1,2,3}\left|\frac{\partial^{2}h(\zeta|\alpha,\xi)}{\partial\zeta^{(j)}\partial\alpha}\right| maxj=1,2,3|zj+1g(z|α,ξ)α|dz:=h4(α),\displaystyle\leq\max_{j=1,2,3}\int_{\mathbb{R}}\left|z^{j+1}\frac{\partial g(z|\alpha,\xi)}{\partial\alpha}\right|dz:=h_{4}(\alpha), (192)

where h1(α)h_{1}(\alpha), h2(α)h_{2}(\alpha), h3(α)h_{3}(\alpha) and h4(α)h_{4}(\alpha) are continuous functions of α\alpha by the dominated convergence theorem, with their dependence on the constant ξ\xi suppressed.

It follows that the gradient of ϕT(ζ|θ)\phi_{T}(\zeta|\theta) with respect to θ\theta is

θϕT(ζ|θ)=(h(ζ|α1,ξ)h(ζ|α2,ξ),π1h(ζ|α1,ξ)α,π2h(ζ|α2,ξ)α),\nabla_{\theta}\phi_{T}(\zeta|\theta)=\left(h(\zeta|\alpha_{1},\xi)-h(\zeta|\alpha_{2},\xi),\pi_{1}\frac{\partial h(\zeta|\alpha_{1},\xi)}{\partial\alpha},\pi_{2}\frac{\partial h(\zeta|\alpha_{2},\xi)}{\partial\alpha}\right)^{\top}, (193)

and Hessian with respect to θ\theta is

HessθϕT(ζ|θ)=(0h(ζ|α1,ξ)αh(ζ|α2,ξ)αh(ζ|α1,ξ)απ12h(ζ|α1,ξ)α20h(ζ|α2,ξ)α0π22h(ζ|α2,ξ)α2).\textbf{Hess}_{\theta}\phi_{T}(\zeta|\theta)=\begin{pmatrix}0&\frac{\partial h(\zeta|\alpha_{1},\xi)}{\partial\alpha}&-\frac{\partial h(\zeta|\alpha_{2},\xi)}{\partial\alpha}\\ \frac{\partial h(\zeta|\alpha_{1},\xi)}{\partial\alpha}&\pi_{1}\frac{\partial^{2}h(\zeta|\alpha_{1},\xi)}{\partial\alpha^{2}}&0\\ -\frac{\partial h(\zeta|\alpha_{2},\xi)}{\partial\alpha}&0&\pi_{2}\frac{\partial^{2}h(\zeta|\alpha_{2},\xi)}{\partial\alpha^{2}}\end{pmatrix}. (194)

Then by (189), (190), (191), (192), (193) and (194), for any i,j{1,2,3}i,j\in\{1,2,3\}:

|ϕT(ζ|θ)θ(i)|\displaystyle\left|\frac{\partial\phi_{T}(\zeta|\theta)}{\partial\theta^{(i)}}\right| 2+h1(α1)+h1(α2),\displaystyle\leq 2+h_{1}(\alpha_{1})+h_{1}(\alpha_{2}),
|2ϕT(ζ|θ)θ(i)θ(j)|\displaystyle\left|\frac{\partial^{2}\phi_{T}(\zeta|\theta)}{\partial\theta^{(i)}\partial\theta^{(j)}}\right| i=12(h1(αi)+h2(αi)),\displaystyle\leq\sum_{i=1}^{2}\left(h_{1}(\alpha_{i})+h_{2}(\alpha_{i})\right),
|2ϕT(ζ|θ)ζ(j)θ(i)|\displaystyle\left|\frac{\partial^{2}\phi_{T}(\zeta|\theta)}{\partial\zeta^{(j)}\partial\theta^{(i)}}\right| i=12(h3(αi)+h4(αi)),\displaystyle\leq\sum_{i=1}^{2}\left(h_{3}(\alpha_{i})+h_{4}(\alpha_{i})\right),

where the right hand side of the preceding 33 displays are continuous functions of θ\theta since h1h_{1}, h2h_{2}, h3h_{3} and h4h_{4} are continuous.

Appendix G Proofs for Section 8

Proof of Lemma 8.2.

Step 1: Suppose pi=pip^{\prime}_{i}=p_{i} for any i[k0]i\in[k_{0}]. In this case,

h2(PG,N,PG,N)=\displaystyle h^{2}(P_{G,{N}},P_{G^{\prime},{N}})= h2(i=1k0piPθi,N,i=1k0piPθi,N)\displaystyle h^{2}\left(\sum_{i=1}^{k_{0}}p_{i}P_{\theta_{i},{N}},\sum_{i=1}^{k_{0}}p_{i}P_{\theta^{\prime}_{i},{N}}\right)
\displaystyle\leq i=1k0pih2(Pθi,N,Pθi,N)\displaystyle\sum_{i=1}^{k_{0}}p_{i}h^{2}\left(P_{\theta_{i},{N}},P_{\theta^{\prime}_{i},{N}}\right)
\displaystyle\leq Ni=1k0pih2(Pθi,Pθi)\displaystyle{N}\sum_{i=1}^{k_{0}}p_{i}h^{2}\left(P_{\theta_{i}},P_{\theta^{\prime}_{i}}\right)
\displaystyle\leq Nmax1ik0h2(Pθi,Pθi),\displaystyle{N}\max_{1\leq i\leq k_{0}}h^{2}\left(P_{\theta_{i}},P_{\theta^{\prime}_{i}}\right),

where the first inequality follows from the joint convexity of any ff-divergences (of which squared Hellinger distance is a member), and the second inequality follows from

h2(Pθi,N,Pθi,N)=1(1h2(Pθi,Pθi))NNh2(Pθi,Pθi).h^{2}\left(P_{\theta_{i},{N}},P_{\theta^{\prime}_{i},{N}}\right)=1-\left(1-h^{2}\left(P_{\theta_{i}},P_{\theta^{\prime}_{i}}\right)\right)^{{N}}\leq{N}h^{2}\left(P_{\theta_{i}},P_{\theta^{\prime}_{i}}\right).

Step 2: Suppose θi=θi\theta^{\prime}_{i}=\theta_{i} for any i[k0]i\in[k_{0}]. Let 𝒑=(p1,p2,,pk0)\bm{p}=(p_{1},p_{2},\ldots,p_{k_{0}}) be the discrete probability distribution associated to the weights of GG and define 𝒑\bm{p}^{\prime} similarly. Consider any Q=(qij)i,j=1k0Q=(q_{ij})_{i,j=1}^{k_{0}} to be a coupling of 𝒑\bm{p} and 𝒑\bm{p^{\prime}}. Then

h2(PG,N,PG,N)=\displaystyle h^{2}(P_{G,{N}},P_{G^{\prime},{N}})= h2(i=1k0j=1k0qijPθi,N,i=1k0j=1k0qijPθj,N)\displaystyle h^{2}\left(\sum_{i=1}^{k_{0}}\sum_{j=1}^{k_{0}}q_{ij}P_{\theta_{i},{N}},\sum_{i=1}^{k_{0}}\sum_{j=1}^{k_{0}}q_{ij}P_{\theta_{j},{N}}\right)
\displaystyle\leq i=1k0j=1k0qijh2(Pθi,N,Pθj,N)\displaystyle\sum_{i=1}^{k_{0}}\sum_{j=1}^{k_{0}}q_{ij}h^{2}\left(P_{\theta_{i},{N}},P_{\theta_{j},{N}}\right)
\displaystyle\leq i=1k0j=1k0qij𝟏(θiθj),\displaystyle\sum_{i=1}^{k_{0}}\sum_{j=1}^{k_{0}}q_{ij}\bm{1}(\theta_{i}\not=\theta_{j}), (195)

where the first inequality follows from the joint convexity of any ff-divergence, and the second inequality follow from the Hellinger distance is upper bounded by 11. Since (195) holds for any coupling QQ of 𝒑\bm{p} and 𝒑\bm{p}^{\prime},

h2(PG,N,PG,N)infQi=1k0j=1k0qij𝟏(θiθj)=V(𝒑,𝒑)=12i=1k0|pipi|.h^{2}(P_{G,{N}},P_{G^{\prime},{N}})\leq\inf_{Q}\sum_{i=1}^{k_{0}}\sum_{j=1}^{k_{0}}q_{ij}\bm{1}(\theta_{i}\not=\theta_{j})=V(\bm{p},\bm{p}^{\prime})=\frac{1}{2}\sum_{i=1}^{k_{0}}|p_{i}-p_{i}^{\prime}|.

Step 3: General case. Let G=i=1k0piδθiG^{\prime\prime}=\sum_{i=1}^{k_{0}}{p_{i}}\delta_{\theta^{\prime}_{i}}. Then by triangular inequality, Step 1 and Step 2,

h(PG,N,PG,N)\displaystyle h(P_{G,{N}},P_{G^{\prime},{N}})\leq h(PG,N,PG,N)+h(PG,N,PG,N)\displaystyle h(P_{G,{N}},P_{G^{\prime\prime},{N}})+h(P_{G^{\prime\prime},{N}},P_{G^{\prime},{N}})
\displaystyle\leq Nmax1ik0h(Pθi,Pθi)+12i=1k0|pipi|.\displaystyle\sqrt{{N}}\max_{1\leq i\leq k_{0}}h\left(P_{\theta_{i}},P_{\theta^{\prime}_{i}}\right)+\sqrt{\frac{1}{2}\sum_{i=1}^{k_{0}}|p_{i}-p_{i}^{\prime}|}.

Finally, notice that the above procedure does not depend on the specific order of atoms of GG and GG^{\prime}, and thus the proof is complete. ∎

Proof of Lemma 8.3.

Since lim infθθ0jh(Pθ,Pθ0j)θθ0j2<\liminf\limits_{\theta\to\theta^{0}_{j}}\frac{h(P_{\theta},P_{\theta^{0}_{j}})}{\|{\theta}-{\theta^{0}_{j}}\|_{2}}<\infty, there exists a sequences {θjk}k=1Θ\i=1k0{θi0}\{\theta_{j}^{k}\}_{k=1}^{\infty}\subset\Theta\backslash\cup_{i=1}^{k_{0}}\{\theta_{i}^{0}\} such that θjkθj0\theta_{j}^{k}\to\theta_{j}^{0} and

h(Pθjk,Pθ0j)γθjkθ0j2h(P_{\theta_{j}^{k}},P_{\theta^{0}_{j}})\leq\gamma\|{\theta_{j}^{k}}-{\theta^{0}_{j}}\|_{2} (196)

for some γ(0,)\gamma\in(0,\infty). Supposing that

lim supNlim infGW1G0Gk0(Θ)h(PG,N,PG0,N)Dψ(N)(G,G0)=β(0,],\limsup_{{N}\to\infty}\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{h(P_{G,{N}},P_{G_{0},{N}})}{D_{\psi({N})}(G,G_{0})}=\beta\in(0,\infty],

then there exists subsequences N{N}_{\ell}\rightarrow\infty such that for any \ell

lim infGW1G0Gk0(Θ)h(PG,N,PG0,N)Dψ(N)(G,G0)34β.\liminf_{\begin{subarray}{c}G\overset{W_{1}}{\to}G_{0}\\ G\in\mathcal{E}_{k_{0}}(\Theta)\end{subarray}}\frac{h(P_{G,{N}_{\ell}},P_{G_{0},{N}_{\ell}})}{D_{\psi({N}_{\ell})}(G,G_{0})}\geq\frac{3}{4}\beta.

Thus for each \ell, there exists θjk\theta_{j}^{k_{\ell}} such that G=pj0δθjk+i=1,ijk0p0iδθ0ik0(Θ)\{G0}G_{\ell}=p_{j}^{0}\delta_{\theta_{j}^{k_{\ell}}}+\sum\limits_{i=1,i\not=j}^{k_{0}}{p^{0}_{i}}\delta_{\theta^{0}_{i}}\in\mathcal{E}_{k_{0}}(\Theta)\backslash\{G_{0}\}, and

h(PG,N,PG0,N)Dψ(N)(G,G0)β2.\frac{h(P_{G_{\ell},{N}_{\ell}},P_{G_{0},{N}_{\ell}})}{D_{\psi({N}_{\ell})}(G_{\ell},G_{0})}\geq\frac{\beta}{2}.

By our choice of GG_{\ell}, for sufficiently large \ell

h(PG,N,PG0,N)β2Dψ(N)(G,G0)=β2ψ(N)θjkθj02.h(P_{G_{\ell},{N}_{\ell}},P_{G_{0},{N}_{\ell}})\geq\frac{\beta}{2}D_{\psi({N}_{\ell})}(G_{\ell},G_{0})=\frac{\beta}{2}\sqrt{\psi({N}_{\ell})}\|\theta_{j}^{k_{\ell}}-\theta_{j}^{0}\|_{2}.

On the other hand, by Lemma 8.2,

h(PG,N,PG0,N)Nh(Pθkj,Pθ0j).h(P_{G_{\ell},{N}_{\ell}},P_{G_{0},{N}_{\ell}})\leq\sqrt{{N}_{\ell}}h(P_{\theta^{k_{\ell}}_{j}},P_{\theta^{0}_{j}}).

Combining the last two displays,

β2Nψ(N)h(Pθkj,Pθ0j)θkjθ0j2γNψ(N)0,as ,\frac{\beta}{2}\leq\sqrt{\frac{{N}_{\ell}}{\psi({N}_{\ell})}}\frac{h(P_{\theta^{k_{\ell}}_{j}},P_{\theta^{0}_{j}})}{\|{\theta^{k_{\ell}}_{j}}-{\theta^{0}_{j}}\|_{2}}\leq\gamma\sqrt{\frac{{N}_{\ell}}{\psi({N}_{\ell})}}\to 0,\quad\text{as }\ell\to\infty,

where the second inequality follows from (196). The last display contradicts with β>0\beta>0. ∎

Proof of Theorem 8.6.

a) Choose a set of distinct k01k_{0}-1 points {θi}i=1k01Θ\{θ0}\{\theta_{i}\}_{i=1}^{k_{0}-1}\subset\Theta\backslash\{\theta_{0}\} satisfying

ρ1:=min0i<jk01h(Pθi,Pθj)>0.\rho_{1}:=\min_{0\leq i<j\leq k_{0}-1}h(P_{\theta_{i}},P_{\theta_{j}})>0.

Let ρ:=min0i<jk01θiθj2\rho:=\min_{0\leq i<j\leq k_{0}-1}\|\theta_{i}-\theta_{j}\|_{2}. Since lim supθθ0h(Pθ,Pθ0)θθ02β0<\limsup\limits_{\theta\to\theta_{0}}\frac{h\left(P_{\theta},P_{\theta_{0}}\right)}{\|{\theta}-{\theta_{0}}\|_{2}^{\beta_{0}}}<\infty, there exist γ(0,)\gamma\in(0,\infty) and r0(0,min{ρ,(ρ1/γ)1/β0})r_{0}\in(0,\min\{\rho,(\rho_{1}/\gamma)^{1/\beta_{0}}\}) such that

h(Pθ,Pθ0)θθ02β0<γ,0<θθ02<r0.\frac{h\left(P_{\theta},P_{\theta_{0}}\right)}{\|{\theta}-{\theta_{0}}\|_{2}^{\beta_{0}}}<\gamma,\quad\forall 0<\|\theta-\theta_{0}\|_{2}<r_{0}. (197)

Consider G1=i=1k01k0δθi1k0(Θ)G_{1}=\sum_{i=1}^{k_{0}}\frac{1}{k_{0}}\delta_{\theta_{i}^{1}}\in\mathcal{E}_{k_{0}}(\Theta) and G2=i=1k01k0δθi2k0(Θ)G_{2}=\sum_{i=1}^{k_{0}}\frac{1}{k_{0}}\delta_{\theta_{i}^{2}}\in\mathcal{E}_{k_{0}}(\Theta) with θi1=θi2=θiΘ\{θ0}\theta_{i}^{1}=\theta_{i}^{2}=\theta_{i}\in\Theta\backslash\{\theta_{0}\} for i[k01]i\in[k_{0}-1] and θk01=θ0\theta_{k_{0}}^{1}=\theta_{0}, θk02=θ\theta_{k_{0}}^{2}=\theta satisfying θθ02=2ϵ<r0\|\theta-\theta_{0}\|_{2}=2\epsilon<r_{0}. Here ϵ(0,r0/2)\epsilon\in(0,r_{0}/2) is a constant to be determined. Then d𝚯(G1,G2)=2ϵd_{\bm{\Theta}}(G_{1},G_{2})=2\epsilon. Moreover, h(Pθ,Pθ0)γ(2ϵ)β0<ρ1h(P_{\theta},P_{\theta_{0}})\leq\gamma\left(2\epsilon\right)^{\beta_{0}}<\rho_{1}.

By two-point Le Cam bound (see [45, (15.14)])

infG^k0(Θ)supGk0(Θ)𝔼mPG,Nd𝚯(G,G^)ϵ2(1V(mPG1,N,mPG2,N)).\inf_{\hat{G}\in\mathcal{E}_{k_{0}}(\Theta)}\sup_{G\in\mathcal{E}_{k_{0}}(\Theta)}\mathbb{E}_{\bigotimes^{m}P_{G,{N}}}d_{\bm{\Theta}}(G,\hat{G})\geq\frac{\epsilon}{2}\left(1-V\left(\bigotimes^{m}P_{G_{1},{N}},\bigotimes^{m}P_{G_{2},{N}}\right)\right). (198)

Notice

V(mPG1,N,mPG2,N)h(mPG1,N,mPG2,N)mh(PG1,N,PG2,N).V\left(\bigotimes^{m}P_{G_{1},{N}},\bigotimes^{m}P_{G_{2},{N}}\right)\leq h\left(\bigotimes^{m}P_{G_{1},{N}},\bigotimes^{m}P_{G_{2},{N}}\right)\leq\sqrt{{m}}h\left(P_{G_{1},{N}},P_{G_{2},{N}}\right).

With our choice of G1G_{1} and G2G_{2}, by Lemma 8.2, the last display becomes

V(mPG1,N,mPG2,N)\displaystyle V\left(\bigotimes^{m}P_{G_{1},{N}},\bigotimes^{m}P_{G_{2},{N}}\right)\leq mNminτSk0max1ik0h(Pθ1i,Pθ2τ(i))\displaystyle\sqrt{{m}}\sqrt{{N}}\min_{\tau\in S_{k_{0}}}\max_{1\leq i\leq k_{0}}h\left(P_{\theta^{1}_{i}},P_{\theta^{2}_{\tau(i)}}\right)
=\displaystyle= mNh(Pθ0,Pθ)\displaystyle\sqrt{{m}}\sqrt{{N}}h\left(P_{\theta_{0}},P_{\theta}\right)
\displaystyle\leq mNγ(2ϵ)β0,\displaystyle\sqrt{{m}}\sqrt{{N}}\gamma\left(2\epsilon\right)^{\beta_{0}}, (199)

where the equality follows from

minτSk0max1ik0h(Pθ1i,Pθ2τ(i))=h(Pθ1k0,Pθ2k0)=h(Pθ0,Pθ)\min_{\tau\in S_{k_{0}}}\max_{1\leq i\leq k_{0}}h(P_{\theta^{1}_{i}},P_{\theta^{2}_{\tau(i)}})=h(P_{\theta^{1}_{k_{0}}},P_{\theta^{2}_{k_{0}}})=h\left(P_{\theta_{0}},P_{\theta}\right)

due to h(Pθ0,Pθ)<ρ1h\left(P_{\theta_{0}},P_{\theta}\right)<\rho_{1}. Plug (199) into (198),

infG^k0(Θ)supGk0(Θ)𝔼mPG,Nd𝚯(G,G^)ϵ2(1γmN(2ϵ)β0).\inf_{\hat{G}\in\mathcal{E}_{k_{0}}(\Theta)}\sup_{G\in\mathcal{E}_{k_{0}}(\Theta)}\mathbb{E}_{\bigotimes^{m}P_{G,{N}}}d_{\bm{\Theta}}(G,\hat{G})\geq\frac{\epsilon}{2}\left(1-\gamma\sqrt{{m}}\sqrt{{N}}(2\epsilon)^{\beta_{0}}\right). (200)

Consider any a(0,1)a\in(0,1) satisfying a>1γr0β0a>1-\gamma r_{0}^{\beta_{0}} and let 2ϵ=(1aγmN)1β02\epsilon=\left(\frac{1-a}{\gamma\sqrt{{m}}\sqrt{{N}}}\right)^{\frac{1}{\beta_{0}}}. Then 2ϵ(0,r0)2\epsilon\in(0,r_{0}). Plug the specified ϵ\epsilon into (200), then the right hand side in the above display becomes

a4(1aγmN)1β0=C(β0)(1mN)1β0,\frac{a}{4}\left(\frac{1-a}{\gamma\sqrt{{m}}\sqrt{{N}}}\right)^{\frac{1}{\beta_{0}}}=C(\beta_{0})\left(\frac{1}{\sqrt{{m}}\sqrt{{N}}}\right)^{\frac{1}{\beta_{0}}},

where C(β0)C(\beta_{0}) depends on β0\beta_{0}. Notice a,γ,r0a,\gamma,r_{0} are constants that depends on the probability family {Pθ}θΘ\{P_{\theta}\}_{\theta\in\Theta} and k0k_{0}.

b) Consider k0>3k_{0}>3. Let 0<ϵ<(1313(k02))/20<\epsilon<(\frac{1}{3}-\frac{1}{3(k_{0}-2)})/2 . Consider G1=i=1213δθi+i=3k013(k02)δθik0(Θ)G_{1}=\sum\limits_{i=1}^{2}\frac{1}{3}\delta_{\theta_{i}}+\sum\limits_{i=3}^{k_{0}}\frac{1}{3(k_{0}-2)}\delta_{\theta_{i}}\in\mathcal{E}_{k_{0}}(\Theta) and G2=(13ϵ)δθ1+(13+ϵ)δθ2+i=3k013(k02)δθik0(Θ)G_{2}=(\frac{1}{3}-\epsilon)\delta_{\theta_{1}}+(\frac{1}{3}+\epsilon)\delta_{\theta_{2}}+\sum\limits_{i=3}^{k_{0}}\frac{1}{3(k_{0}-2)}\delta_{\theta_{i}}\in\mathcal{E}_{k_{0}}(\Theta). By the range of ϵ\epsilon, G2k0(Θ)G_{2}\in\mathcal{E}_{k_{0}}(\Theta) and d𝒑(G1,G2)=2ϵd_{\bm{p}}(G_{1},G_{2})=2\epsilon. Similar to the proof of a),

infG^k0(Θ)supGk0(Θ)𝔼mPG,Nd𝒑(G^,G)ϵ2(1mh(PG1,N,PG2,N)).\inf_{\hat{G}\in\mathcal{E}_{k_{0}}(\Theta)}\sup_{G\in\mathcal{E}_{k_{0}}(\Theta)}\mathbb{E}_{\bigotimes^{m}P_{G,{N}}}d_{\bm{p}}(\hat{G},G)\geq\frac{\epsilon}{2}\left(1-\sqrt{{m}}h\left(P_{G_{1},{N}},P_{G_{2},{N}}\right)\right).

With our choice of G1G_{1} and G2G_{2}, by Lemma 8.2,

h(PG1,N,PG2,N)12×2ϵ=ϵ.h\left(P_{G_{1},{N}},P_{G_{2},{N}}\right)\leq\sqrt{\frac{1}{2}\times 2\epsilon}=\sqrt{\epsilon}.

Combining the last two displays,

infG^k0(Θ)supGk0(Θ)𝔼mPG,Nd𝒑(G^,G)ϵ2(1mϵ).\inf_{\hat{G}\in\mathcal{E}_{k_{0}}(\Theta)}\sup_{G\in\mathcal{E}_{k_{0}}(\Theta)}\mathbb{E}_{\bigotimes^{m}P_{G,{N}}}d_{\bm{p}}(\hat{G},G)\geq\frac{\epsilon}{2}\left(1-\sqrt{{m}}\sqrt{\epsilon}\right).

The proof is complete by specifying ϵ=1m(1313(k02))/4<(1313(k02))/2\epsilon=\frac{1}{{m}}(\frac{1}{3}-\frac{1}{3(k_{0}-2)})/4<(\frac{1}{3}-\frac{1}{3(k_{0}-2)})/2. The case for k0=2k_{0}=2 and k0=3k_{0}=3 follow similarly.

c) The conclusion follows immediately from a), b) and  (51). ∎

Appendix H Proofs of Auxiliary Lemmas

H.1 Proofs for Section B.3

Proof for Lemma B.2.

a)

limxy,xx0,yx0|g(x)g(y)g(x0),xy|xy2\displaystyle\lim_{x\neq y,x\to x_{0},y\to x_{0}}\frac{|g(x)-g(y)-\langle\nabla g(x_{0}),x-y\rangle|}{\|x-y\|_{2}}
=\displaystyle= limxy,xx0,yx0|g(ξ),xyg(x0),xy|xy2\displaystyle\lim_{x\neq y,x\to x_{0},y\to x_{0}}\frac{|\langle\nabla g(\xi),x-y\rangle-\langle\nabla g(x_{0}),x-y\rangle|}{\|x-y\|_{2}}
\displaystyle\leq limxy,xx0,yx0g(ξ)g(x0)2\displaystyle\lim_{x\neq y,x\to x_{0},y\to x_{0}}\|\nabla g(\xi)-\nabla g(x_{0})\|_{2}
=\displaystyle= 0,\displaystyle 0,

where the first step follows from mean value theorem with ξ\xi lie on the line segment connecting xx and yy, the second step follows from Cauchy-Schwarz inequality, and the last step follows from the continuity of g(x)\nabla g(x) at x0x_{0} and ξx0\xi\to x_{0} when x,yx0x,y\to x_{0}.

b) For xyx\not=y in BB specified in the statement,

|g(x)g(y)g(x0),xy|xy2\displaystyle\frac{|g(x)-g(y)-\langle\nabla g(x_{0}),x-y\rangle|}{\|x-y\|_{2}}
=\displaystyle= |01g(y+t(xy)),xydtg(x0),xy|xy2\displaystyle\frac{|\int_{0}^{1}\langle\nabla g(y+t(x-y)),x-y\rangle dt-\langle\nabla g(x_{0}),x-y\rangle|}{\|x-y\|_{2}}
=\displaystyle= |01012g(x0+s(y+t(xy)x0)),y+t(xy)x0,xydsdt|xy2\displaystyle\frac{|\int_{0}^{1}\int_{0}^{1}\langle\langle\nabla^{2}g(x_{0}+s(y+t(x-y)-x_{0})),y+t(x-y)-x_{0}\rangle,x-y\rangle dsdt|}{\|x-y\|_{2}}
\displaystyle\leq 0101|2g(x0+s(y+t(xy)x0)),y+t(xy)x0,xy|dsdtxy2\displaystyle\frac{\int_{0}^{1}\int_{0}^{1}|\langle\langle\nabla^{2}g(x_{0}+s(y+t(x-y)-x_{0})),y+t(x-y)-x_{0}\rangle,x-y\rangle|dsdt}{\|x-y\|_{2}}
\displaystyle\leq 01012g(x0+s(y+t(xy)x0))2y+t(xy)x02dsdt\displaystyle\int_{0}^{1}\int_{0}^{1}\|\nabla^{2}g(x_{0}+s(y+t(x-y)-x_{0}))\|_{2}\|y+t(x-y)-x_{0}\|_{2}dsdt
\displaystyle\leq 01012g(x0+s(y+t(xy)x0))2dsdtmax{xx02,yx02},\displaystyle\int_{0}^{1}\int_{0}^{1}\|\nabla^{2}g(x_{0}+s(y+t(x-y)-x_{0}))\|_{2}dsdt\ \max\{\|x-x_{0}\|_{2},\|y-x_{0}\|_{2}\}, (201)

where the first two equalities follow respectively form fundamental theorem of calculus for \mathbb{R}-valued functions and d\mathbb{R}^{d}-valued functions. Observe that for any matrix Ad×dA\in\mathbb{R}^{d\times d},

A2AFdmax1i,jd|Aij|d1i,jd|Aij|\|A\|_{2}\leq\|A\|_{F}\leq d\max_{1\leq i,j\leq d}|A_{ij}|\leq d\sum_{1\leq i,j\leq d}|A_{ij}|

where F\|\cdot\|_{F} is the Frobenius norm. Applying the preceding display to (201),

01012g(x0+s(y+t(xy)x0))2dsdt\displaystyle\int_{0}^{1}\int_{0}^{1}\|\nabla^{2}g(x_{0}+s(y+t(x-y)-x_{0}))\|_{2}dsdt
\displaystyle\leq d1i,jd0101|2gx(i)x(j)(x0+s(y+t(xy)x0))|dsdt\displaystyle d\sum_{1\leq i,j\leq d}\int_{0}^{1}\int_{0}^{1}\left|\frac{\partial^{2}g}{\partial x^{(i)}x^{(j)}}(x_{0}+s(y+t(x-y)-x_{0}))\right|dsdt

Following (201),

|g(x)g(y)g(x0),xy|xy2Lmax{xx02,yx02}.\frac{|g(x)-g(y)-\langle\nabla g(x_{0}),x-y\rangle|}{\|x-y\|_{2}}\leq L\max\{\|x-x_{0}\|_{2},\|y-x_{0}\|_{2}\}.

Proof of Lemma B.3.

a) Define F(x)=i=1khi(x)ebixF(x)=\sum_{i=1}^{k}h_{i}(x)e^{b_{i}x}. From the condition F(x)=0F(x)=0 on a dense subset of II. Then F(x)=0F(x)=0 on the closure of that subset, which contains II, since it is a continuous function on \mathbb{R}. Let aIa\in I^{\circ} and consider its Taylor expansion F(x)=i=0F(i)(a)i!(xa)iF(x)=\sum_{i=0}^{\infty}\frac{F^{(i)}(a)}{i!}(x-a)^{i} for any xx\in\mathbb{R}. It follows from F(x)=0F(x)=0 on II that F(i)(a)=0F^{(i)}(a)=0 for any i0i\geq 0. Thus F(x)0F(x)\equiv 0 on \mathbb{R}. Then

0=limxebkxF(x)=limxhk(x).0=\lim_{x\to\infty}e^{-b_{k}x}F(x)=\lim_{x\to\infty}h_{k}(x).

This happen only when hk(x)0h_{k}(x)\equiv 0. Proceed in the same manner to show hi(x)0h_{i}(x)\equiv 0 for ii from k1k-1 to 11.

b) Define H(x)=i=1k(hi(x)+gi(x)ln(x))ebixH(x)=\sum_{i=1}^{k}(h_{i}(x)+g_{i}(x)\ln(x))e^{b_{i}x}. From the condition H(x)=0H(x)=0 on a dense subset of II. Then H(x)=0H(x)=0 on the closure of that subset excluding 0, which contains II, since it is a continuous function on (0,)(0,\infty). Let a1Ia_{1}\in I^{\circ} and consider its Taylor expansion at a1a_{1}: H(x)=i=0H(i)(a1)i!(xa1)iH(x)=\sum_{i=0}^{\infty}\frac{H^{(i)}(a_{1})}{i!}(x-a_{1})^{i} for x(0,2a1)x\in(0,2a_{1}), since the Taylor series of ln(x)\ln(x), xγx^{\gamma} at a1a_{1} converges respectively to ln(x)\ln(x), xγx^{\gamma} on (0,2a1)(0,2a_{1}) for any γ\gamma. It follows from H(x)=0H(x)=0 on II that H(i)(a1)=0H^{(i)}(a_{1})=0 for any i0i\geq 0. Thus H(x)=0H(x)=0 on (0,2a1)(0,2a_{1}). Now take a2=32a1a_{2}=\frac{3}{2}a_{1} and repeat the above analysis with a1a_{1} replaced by a2a_{2}, resulting in H(x)=0H(x)=0 on (0,2a2)=(0,3a1)(0,2a_{2})=(0,3a_{1}). Then take a3=32a2a_{3}=\frac{3}{2}a_{2} and keep repeating the process, and one obtains H(x)=0H(x)=0 on (0,)(0,\infty) since a1>0a_{1}>0. Let γ0\gamma_{0} be the smallest power of all power functions that appear in {gi(x)}i=1k\{g_{i}(x)\}_{i=1}^{k}, {hi(x)}i=1k\{h_{i}(x)\}_{i=1}^{k}, and define H~(x)=xγ0H(x)\tilde{H}(x)=x^{-\gamma_{0}}H(x). Then H~(x)=0\tilde{H}(x)=0 on (0,)(0,\infty). Then

0=limxebkxH~(x)=limx(xγ0hk(x)+xγ0gk(x)ln(x)),0=\lim_{x\to\infty}e^{-b_{k}x}\tilde{H}(x)=\lim_{x\to\infty}(x^{-\gamma_{0}}h_{k}(x)+x^{-\gamma_{0}}g_{k}(x)\ln(x)),

which happens only when xγ0hk(x)0x^{-\gamma_{0}}h_{k}(x)\equiv 0 and xγ0gk(x)0x^{-\gamma_{0}}g_{k}(x)\equiv 0. That is, when x0x\neq 0, hk(x)0h_{k}(x)\equiv 0 and gk(x)0g_{k}(x)\equiv 0. Proceed in the same manner to show when x0x\neq 0, hi(x)0h_{i}(x)\equiv 0 and gi(x)0g_{i}(x)\equiv 0 for ii from k1k-1 to 11. ∎

Proof of Lemma B.4.

Let γ>0\gamma>0 be such that the line segment between θaγ\theta-a\gamma and θ+aγ\theta+a\gamma lie in Θ\Theta and 𝔛e4γT(x)f(x|θ)dμ<\int_{\mathfrak{X}}e^{4\gamma^{\top}T(x)}f(x|\theta)d\mu<\infty, 𝔛e4γT(x)f(x|θ)dμ<\int_{\mathfrak{X}}e^{-4\gamma^{\top}T(x)}f(x|\theta)d\mu<\infty due to the fact that the moment generating function exists in a neighborhood of origin for any given θΘ\theta\in\Theta^{\circ}. Then for Δ(0,γ]\Delta\in(0,\gamma] and for any xSx\in S

|f(x|θ+aΔ)f(x|θ)Δf(x|θ)|\displaystyle\left|\frac{f(x|\theta+a\Delta)-f(x|\theta)}{\Delta\sqrt{f(x|\theta)}}\right|
=\displaystyle= f(x|θ)|exp(aΔ,T(x)(A(θ+aΔ)A(θ)))1Δ|\displaystyle\sqrt{f(x|\theta)}\left|\frac{\exp(\langle a\Delta,T(x)\rangle-(A(\theta+a\Delta)-A(\theta)))-1}{\Delta}\right|
()\displaystyle\overset{(*)}{\leq} f(x|θ)|a,T(x)A(θ+aΔ)A(θ)Δ|eaΔ,T(x)(A(θ+aΔ)A(θ))\displaystyle\sqrt{f(x|\theta)}\left|\langle a,T(x)\rangle-\frac{A(\theta+a\Delta)-A(\theta)}{\Delta}\right|e^{\langle a\Delta,T(x)\rangle-(A(\theta+a\Delta)-A(\theta))}
\displaystyle\leq f(x|θ)(|a,T(x)|+a2maxΔ[0,γ]θA(θ+aΔ)2)×\displaystyle\sqrt{f(x|\theta)}\left(\left|\langle a,T(x)\rangle\right|+\|a\|_{2}\max_{\Delta\in[0,\gamma]}\|\nabla_{\theta}A(\theta+a\Delta)\|_{2}\right)\times
eΔ|a,T(x)|maxΔ[0,γ]e(A(θ+aΔ)A(θ))\displaystyle\quad e^{\Delta|\langle a,T(x)\rangle|}\max_{\Delta\in[0,\gamma]}e^{-(A(\theta+a\Delta)-A(\theta))}
\displaystyle\leq f(x|θ)1γeγ|a,T(x)|+γa2maxΔ[0,γ]θA(θ+aΔ)2eγ|a,T(x)|maxΔ[0,γ]e(A(θ+aΔ)A(θ))\displaystyle\sqrt{f(x|\theta)}\frac{1}{\gamma}e^{\gamma\left|\langle a,T(x)\rangle\right|+\gamma\|a\|_{2}\max_{\Delta\in[0,\gamma]}\|\nabla_{\theta}A(\theta+a\Delta)\|_{2}}\ e^{\gamma|\langle a,T(x)\rangle|}\max_{\Delta\in[0,\gamma]}e^{-(A(\theta+a\Delta)-A(\theta))}
=\displaystyle= C(γ,a,θ)f(x|θ)e2γ|a,T(x)|\displaystyle C(\gamma,a,\theta)\sqrt{f(x|\theta)}e^{2\gamma|\langle a,T(x)\rangle|}
\displaystyle\leq C2(γ,a,θ)f(x|θ)(e4γa,T(x)+e4γa,T(x)),\displaystyle\sqrt{C^{2}(\gamma,a,\theta)f(x|\theta)\left(e^{4\gamma\langle a,T(x)\rangle}+e^{-4\gamma\langle a,T(x)\rangle}\right)}, (202)

where step ()(*) follows from |et1||t|et|e^{t}-1|\leq|t|e^{t}. Then the the first conclusion holds with

f¯=C2(γ,a,θ)f(x|θ)(e4γa,T(x)+e4γa,T(x)).\bar{f}=\sqrt{C^{2}(\gamma,a,\theta)f(x|\theta)\left(e^{4\gamma\langle a,T(x)\rangle}+e^{-4\gamma\langle a,T(x)\rangle}\right)}.

Take f~(x)=f¯(x)f(x|θ)\tilde{f}(x)=\bar{f}(x)\sqrt{f(x|\theta)} and by Cauchy–Schwarz inequality 𝔛f~(x)dμ𝔛f¯2(x)dμ<\int_{\mathfrak{X}}\tilde{f}(x)d\mu\leq\int_{\mathfrak{X}}\bar{f}^{2}(x)d\mu<\infty. Moreover by (202)

|f(x|θi0+Δai)f(x|θi0)Δ|f~(x)x𝔛.\left|\frac{f(x|\theta_{i}^{0}+\Delta a_{i})-f(x|\theta_{i}^{0})}{\Delta}\right|\leq\tilde{f}(x)\quad\forall x\in\mathfrak{X}.

H.2 Proofs for Section C.2

Proof of Lemma C.1.

Note that i=1k0bi=i=1k0biPθi0(𝔛)=0\sum_{i=1}^{k_{0}}b_{i}=\sum_{i=1}^{k_{0}}b_{i}P_{\theta_{i}^{0}}(\mathfrak{X})=0. Construct G=i=1k0piδθi0G_{\ell}=\sum_{i=1}^{k_{0}}p_{i}^{\ell}\delta_{\theta_{i}^{0}} with pi=pi0+bi/p_{i}^{\ell}=p_{i}^{0}+b_{i}/\ell for i[k0]i\in[k_{0}]. For large \ell, pi(0,1)p_{i}^{\ell}\in(0,1) and i=1k0pi=1\sum_{i=1}^{k_{0}}p_{i}^{\ell}=1. Then for large \ell, Gk0(Θ)G_{\ell}\in\mathcal{E}_{k_{0}}(\Theta) and GW1G0G_{\ell}\overset{W_{1}}{\to}G_{0}. Then the proof is completed by observing that for large \ell

V(PG,PG0)=supA𝒜|PG(A)PG0(A)|=supA𝒜|1/i=1k0biPθi0(A)|=0,V(P_{G_{\ell}},P_{G_{0}})=\sup_{A\in\mathcal{A}}|P_{G_{\ell}}(A)-P_{G_{0}}(A)|=\sup_{A\in\mathcal{A}}|1/\ell\sum_{i=1}^{k_{0}}b_{i}P_{\theta_{i}^{0}}(A)|=0,

and D1(G,G0)=1i=1k0|bi|0.D_{1}(G_{\ell},G_{0})=\frac{1}{\ell}\sum_{i=1}^{k_{0}}|b_{i}|\neq 0.

Proof of Lemma C.2.

By decomposing the difference as a telescoping sum,

|j=1Nf(xj|θi0+aΔ)j=1Nf(xj|θi0)Δ|\displaystyle\left|\frac{\prod_{j=1}^{{N}}f(x_{j}|\theta_{i}^{0}+a\Delta)-\prod_{j=1}^{{N}}f(x_{j}|\theta_{i}^{0})}{\Delta}\right|
\displaystyle\leq =1N(j=11f(xj|θi0+aΔ))|f(x|θi0+aΔ)f(x|θi0)Δ|(j=+1Nf(xj|θi0)).\displaystyle\sum_{\ell=1}^{{N}}\left(\prod_{j=1}^{\ell-1}f(x_{j}|\theta_{i}^{0}+a\Delta)\right)\left|\frac{f(x_{\ell}|\theta_{i}^{0}+a\Delta)-f(x_{\ell}|\theta_{i}^{0})}{\Delta}\right|\left(\prod_{j=\ell+1}^{{N}}f(x_{j}|\theta_{i}^{0})\right).

Then the right hand side of the preceding display is upper bounded Nμa.e.𝔛N\bigotimes^{N}\mu-a.e.\ \mathfrak{X}^{N} by

f~Δ(x¯|θi0,a,N):==1N(j=11f(xj|θi0+aΔ))f¯Δ(x|θi0,a)(j=+1Nf(xj|θi0)).\tilde{f}_{\Delta}(\bar{x}|\theta_{i}^{0},a,N):=\sum_{\ell=1}^{{N}}\left(\prod_{j=1}^{\ell-1}f(x_{j}|\theta_{i}^{0}+a\Delta)\right)\bar{f}_{\Delta}(x_{\ell}|\theta_{i}^{0},a)\left(\prod_{j=\ell+1}^{{N}}f(x_{j}|\theta_{i}^{0})\right).

For clean presentation we write f~Δ(x¯|θi0,a)\tilde{f}_{\Delta}(\bar{x}|\theta_{i}^{0},a) for f~Δ(x¯|θi0,a,N)\tilde{f}_{\Delta}(\bar{x}|\theta_{i}^{0},a,N) in the remainder of the proof. Notice that f~Δ(x¯|θi0,a)\tilde{f}_{\Delta}(\bar{x}|\theta_{i}^{0},a) satisfies

𝔛Nf~Δ(x¯|θi0,a)dNμ==1N𝔛f¯Δ(x|θi0,a)dμN𝔛limΔ0+f¯Δ(x|θi0,a)dμ.\int_{\mathfrak{X}^{N}}\tilde{f}_{\Delta}(\bar{x}|\theta_{i}^{0},a)d\bigotimes^{N}\mu=\sum_{\ell=1}^{{N}}\int_{\mathfrak{X}}\bar{f}_{\Delta}(x_{\ell}|\theta_{i}^{0},a)d\mu\to{N}\int_{\mathfrak{X}}\lim_{\Delta\to 0^{+}}\bar{f}_{\Delta}(x|\theta_{i}^{0},a)d\mu.

Moreover, for Nμa.e.x¯𝔛N\bigotimes^{N}\mu-a.e.\ \bar{x}\in\mathfrak{X}^{N}

limΔ0+f~Δ(x¯|θi0,a)==1N(j=11f(xj|θi0))limΔ0+f¯Δ(x|θi0,a)(j=+1Nf(xj|θi0)),\lim_{\Delta\to 0^{+}}\tilde{f}_{\Delta}(\bar{x}|\theta_{i}^{0},a)=\sum_{\ell=1}^{{N}}\left(\prod_{j=1}^{\ell-1}f(x_{j}|\theta_{i}^{0})\right)\lim_{\Delta\to 0^{+}}\bar{f}_{\Delta}(x_{\ell}|\theta_{i}^{0},a)\left(\prod_{j=\ell+1}^{{N}}f(x_{j}|\theta_{i}^{0})\right),

and thus

𝔛NlimΔ0+f~Δ(x¯|θi0,a)dNμ==1N𝔛limΔ0+f¯Δ(x|θi0,a)dμ=N𝔛limΔ0+f¯Δ(x|θi0,a)dμ.\int_{\mathfrak{X}^{N}}\lim_{\Delta\to 0^{+}}\tilde{f}_{\Delta}(\bar{x}|\theta_{i}^{0},a)d\bigotimes^{N}\mu=\sum_{\ell=1}^{{N}}\int_{\mathfrak{X}}\lim_{\Delta\to 0^{+}}\bar{f}_{\Delta}(x_{\ell}|\theta_{i}^{0},a)d\mu={N}\int_{\mathfrak{X}}\lim_{\Delta\to 0^{+}}\bar{f}_{\Delta}(x|\theta_{i}^{0},a)d\mu.

H.3 Proofs for Section D.4

lem:onecorsquintmin.

a) It suffices to prove b=0b=0 since one can do the translation x=xbx^{\prime}=x-b to reduce the general case bb to the special case b=0b=0. Let f1(x)=f(x)𝟏[E2U,E2U](x)f_{1}(x)=f(x)\bm{1}_{[-\frac{E}{2U},\frac{E}{2U}]}(x), f2(x)=f(x)𝟏[E2U,E2U]c(x)f_{2}(x)=f(x)\bm{1}_{[-\frac{E}{2U},\frac{E}{2U}]^{c}}(x) and fU(x)=U𝟏[E2U,E2U](x)f1(x)f_{U}(x)=U\bm{1}_{[-\frac{E}{2U},\frac{E}{2U}]}(x)-f_{1}(x). Then

[E2U,E2U]fU(x)dx=E[E2U,E2U]f1(x)dx=[E2U,E2U]cf2(x)dx\int_{[-\frac{E}{2U},\frac{E}{2U}]}f_{U}(x)dx=E-\int_{[-\frac{E}{2U},\frac{E}{2U}]}f_{1}(x)dx=\int_{[-\frac{E}{2U},\frac{E}{2U}]^{c}}f_{2}(x)dx

and hence

x2f(x)dx=\displaystyle\int_{\mathbb{R}}x^{2}f(x)dx= [E2U,E2U]x2f1(x)dx+[E2U,E2U]cx2f2(x)dx\displaystyle\int_{[-\frac{E}{2U},\frac{E}{2U}]}x^{2}f_{1}(x)dx+\int_{[-\frac{E}{2U},\frac{E}{2U}]^{c}}x^{2}f_{2}(x)dx
\displaystyle\geq [E2U,E2U]x2f1(x)dx+(E2U)2[E2U,E2U]cf2(x)dx\displaystyle\int_{[-\frac{E}{2U},\frac{E}{2U}]}x^{2}f_{1}(x)dx+\left(\frac{E}{2U}\right)^{2}\int_{[-\frac{E}{2U},\frac{E}{2U}]^{c}}f_{2}(x)dx
=\displaystyle= [E2U,E2U]x2f1(x)dx+(E2U)2[E2U,E2U]fU(x)dx\displaystyle\int_{[-\frac{E}{2U},\frac{E}{2U}]}x^{2}f_{1}(x)dx+\left(\frac{E}{2U}\right)^{2}\int_{[-\frac{E}{2U},\frac{E}{2U}]}f_{U}(x)dx
\displaystyle\geq [E2U,E2U]x2f1(x)dx+[E2U,E2U]x2fU(x)dx\displaystyle\int_{[-\frac{E}{2U},\frac{E}{2U}]}x^{2}f_{1}(x)dx+\int_{[-\frac{E}{2U},\frac{E}{2U}]}x^{2}f_{U}(x)dx
=\displaystyle= [E2U,E2U]x2Udx\displaystyle\int_{[-\frac{E}{2U},\frac{E}{2U}]}x^{2}Udx
=\displaystyle= E312U2.\displaystyle\frac{E^{3}}{12U^{2}}.

The equality holds if and only if the last two inequalities are attained, if and only if f(x)=U𝟏[E2U,E2U](x)a.e.f(x)=U\bm{1}_{[-\frac{E}{2U},\frac{E}{2U}]}(x)\ a.e..

b) It suffices to prove b=0b=0 since one can always do the translation y(1)=x(1)by^{(1)}=x^{(1)}-b and y(i)=x(i)y^{(i)}=x^{(i)} for all 2id2\leq i\leq d to reduce the general case bb to the special case b=0b=0. By Tonelli’s Theorem, h(x(1))=(a,a)d1f(x)dx(2)dx(d)h(x^{(1)})=\int_{(-a,a)^{d-1}}f(x)dx^{(2)}\ldots dx^{(d)} exists for a.e.x(1)a.e.\ x^{(1)} and h(x(1))dx(1)=E\int_{\mathbb{R}}h(x^{(1)})dx^{(1)}=E. Moreover 0h(x(1))U(2a)d1a.e.0\leq h(x^{(1)})\leq U(2a)^{d-1}\ a.e. . Then by Tonelli’s Theorem and a)

G(x(1))2f(x)dx=(x(1))2h(x(1))dx(1)E312U2(2a)2(d1).\int_{G}(x^{(1)})^{2}f(x)dx=\int_{\mathbb{R}}(x^{(1)})^{2}h(x^{(1)})dx^{(1)}\geq\frac{E^{3}}{12U^{2}(2a)^{2(d-1)}}.

The equality holds if and only if h(x(1))=U(2a)d1𝟏[E2U(2a)d1,E2U(2a)d1](x(1))a.e.h(x^{(1)})=U(2a)^{d-1}\bm{1}_{[-\frac{E}{2U(2a)^{d-1}},\frac{E}{2U(2a)^{d-1}}]}(x^{(1)})\ a.e., if and only if f(x)=Ua.e.x[E2U(2a)d1,E2U(2a)d1]×(a,a)d1f(x)=U\ a.e.x\in[-\frac{E}{2U(2a)^{d-1}},\frac{E}{2U(2a)^{d-1}}]\times(-a,a)^{d-1}. ∎

H.4 Proofs for Section E.2

Proof of Lemma E.3.

a) It is easy to calculate

1h2(f(x|θ1),f(x|θ2))=exp(A(θ1+θ22)A(θ1)+A(θ2)2).1-h^{2}(f(x|\theta_{1}),f(x|\theta_{2}))=\exp\left(A\left(\frac{\theta_{1}+\theta_{2}}{2}\right)-\frac{A(\theta_{1})+A(\theta_{2})}{2}\right). (203)

Let g(θ)=exp(A(θ0+θ2)A(θ0)+A(θ)2)g(\theta)=\exp\left(A\left(\frac{\theta_{0}+\theta}{2}\right)-\frac{A(\theta_{0})+A(\theta)}{2}\right). It is easy to verify that g(θ0)=1g(\theta_{0})=1, g(θ0)=0\nabla g(\theta_{0})=0 and 2g(θ0)=142A(θ0)\nabla^{2}g(\theta_{0})=-\frac{1}{4}\nabla^{2}A(\theta_{0}). Then by (203)

lim supθθ0h2(f(x|θ),f(x|θ0))θθ022=\displaystyle\limsup_{\theta\to\theta_{0}}\frac{h^{2}(f(x|\theta),f(x|\theta_{0}))}{\|\theta-\theta_{0}\|_{2}^{2}}= lim supθθ0g(θ)g(θ0)g(θ0),θθ0θθ022\displaystyle\limsup_{\theta\to\theta_{0}}-\frac{g(\theta)-g(\theta_{0})-\langle\nabla g(\theta_{0}),\theta-\theta_{0}\rangle}{\|\theta-\theta_{0}\|_{2}^{2}} (204)
=\displaystyle= lim supθθ018(θθ0)2A(θ0)(θθ0)+o(θθ022)θθ022\displaystyle\limsup_{\theta\to\theta_{0}}\frac{\frac{1}{8}(\theta-\theta_{0})^{\top}\nabla^{2}A(\theta_{0})(\theta-\theta_{0})+o(\|\theta-\theta_{0}\|_{2}^{2})}{\|\theta-\theta_{0}\|_{2}^{2}}
\displaystyle\leq lim supθθ0(18λmax(2A(θ0))+o(1))\displaystyle\limsup_{\theta\to\theta_{0}}\left(\frac{1}{8}\lambda_{\text{max}}(\nabla^{2}A(\theta_{0}))+o(1)\right)
=\displaystyle= 18λmax(2A(θ0)).\displaystyle\frac{1}{8}\lambda_{\text{max}}(\nabla^{2}A(\theta_{0})).

b) First assume that Θ\Theta^{\prime} is compact and convex. For each θ,θ0Θ\theta,\theta_{0}\in\Theta^{\prime}, by (204),

h2(f(x|θ),f(x|θ0))θθ022=\displaystyle\frac{h^{2}(f(x|\theta),f(x|\theta_{0}))}{\|\theta-\theta_{0}\|_{2}^{2}}= g(θ)g(θ0)g(θ0),θθ0θθ022\displaystyle-\frac{g(\theta)-g(\theta_{0})-\langle\nabla g(\theta_{0}),\theta-\theta_{0}\rangle}{\|\theta-\theta_{0}\|_{2}^{2}}
=\displaystyle= 18(θθ0)2g(ξ)(θθ0)θθ022\displaystyle-\frac{\frac{1}{8}(\theta-\theta_{0})^{\top}\nabla^{2}g(\xi)(\theta-\theta_{0})}{\|\theta-\theta_{0}\|_{2}^{2}}
\displaystyle\leq 18supθΘλmax(2g(θ)),\displaystyle\frac{1}{8}\sup_{\theta\in\Theta^{\prime}}\lambda_{\text{max}}(-\nabla^{2}g(\theta)),

where the second equality follows from Taylor’s theorem with ξ\xi in the line joining θ\theta and θ0\theta_{0} due to the convexity of Θ\Theta^{\prime} and Taylor’s theorem. The result then follows with L2=18supθΘλmax(2g(θ))L_{2}=\sqrt{\frac{1}{8}\sup_{\theta\in\Theta^{\prime}}\lambda_{\text{max}}(-\nabla^{2}g(\theta))}, which is finite since 2g(θ)\nabla^{2}g(\theta), as a function of A(θ)A(\theta) and its gradient and hessian, is continuous on Θ\Theta^{\circ}.

If Θ\Theta^{\prime} is compact but not necessarily convex, consider conv(Θ)\operatorname{conv}(\Theta^{\prime}). Note that conv(Θ)\operatorname{conv}(\Theta^{\prime}), as the convex hull of a compact set, is convex and compact. Moreover, conv(Θ)Θ\operatorname{conv}(\Theta^{\prime})\subset\Theta^{\circ} since Θ\Theta^{\circ}, as the interior of a convex set, is convex. The proof is then complete by simply repeating the above proof for conv(Θ)\operatorname{conv}(\Theta^{\prime}). ∎

Proof of Lemma E.1.

Consider an η1\eta_{1}-net Λi\Lambda_{i} with minimum cardinality of {θ:θθ0i22ϵC(G0,diam(Θ1))}\{\theta:\|\theta-\theta^{0}_{i}\|_{2}\leq\frac{2\epsilon}{C(G_{0},\text{diam}(\Theta_{1}))}\} and an η2\eta_{2}-net Λ¯\bar{\Lambda} with minimum cardinality of k0k_{0}-probability simplex {pk0:i=1k0pi=1,pi0}\{p\in\mathbb{R}^{k_{0}}:\sum_{i=1}^{k_{0}}p_{i}=1,p_{i}\geq 0\} under the l1l_{1} distance. Construct a set Λ~={G~=i=1k0piδθi:(p1,,pk0)Λ¯,θiΛi}\tilde{\Lambda}=\{\tilde{G}=\sum_{i=1}^{k_{0}}p_{i}\delta_{\theta_{i}}:(p_{1},\ldots,p_{k_{0}})\in\bar{\Lambda},\theta_{i}\in\Lambda_{i}\}. Then for any Gk0(Θ)G\in\mathcal{E}_{k_{0}}(\Theta) satisfying D1(G,G0)2ϵC(G0,diam(Θ1))D_{1}(G,G_{0})\leq\frac{2\epsilon}{C(G_{0},\text{diam}(\Theta_{1}))}, there exists some G~Λ~\tilde{G}\in\tilde{\Lambda}, such that by Lemma 8.2

h2(pG,Ni,pG~,Ni)(NiL2η1β0+12η2)22(NiL22η12β0+12η2).h^{2}(p_{G,{N}_{i}},p_{\tilde{G},{N}_{i}})\leq\left(\sqrt{{N}_{i}}L_{2}\eta_{1}^{\beta_{0}}+\frac{1}{\sqrt{2}}\sqrt{\eta_{2}}\right)^{2}\leq 2\left({N}_{i}L_{2}^{2}\eta_{1}^{2\beta_{0}}+\frac{1}{2}\eta_{2}\right).

Thus dm,h(G,G~)2L22N¯mη12β0+η2.d_{{m},h}(G,\tilde{G})\leq\sqrt{2L_{2}^{2}\bar{{N}}_{{m}}\eta_{1}^{2\beta_{0}}+\eta_{2}}.

As a result Λ~\tilde{\Lambda} is a 2L22N¯mη12β0+η2\sqrt{2L_{2}^{2}\bar{{N}}_{{m}}\eta_{1}^{2\beta_{0}}+\eta_{2}}-net of {Gk0(Θ):D1(G,G0)2ϵC(G0,diam(Θ1))}\left\{G\in\mathcal{E}_{k_{0}}(\Theta):D_{1}(G,G_{0})\leq\frac{2\epsilon}{C(G_{0},\text{diam}(\Theta_{1}))}\right\}. Since Λ~\tilde{\Lambda} is not necessarily subset of k0(Θ)\mathcal{E}_{k_{0}}(\Theta),

𝔑(22L22N¯mη12β0+η2,{Gk0(Θ1):D1(G,G0)2ϵC(G0,diam(Θ1))},dm,h)\displaystyle\mathfrak{N}\left(2\sqrt{2L_{2}^{2}\bar{{N}}_{{m}}\eta_{1}^{2\beta_{0}}+\eta_{2}},\left\{G\in\mathcal{E}_{k_{0}}(\Theta_{1}):D_{1}(G,G_{0})\leq\frac{2\epsilon}{C(G_{0},\text{diam}(\Theta_{1}))}\right\},d_{{m},h}\right)\leq |Λ~|\displaystyle|\tilde{\Lambda}|
=\displaystyle= |Λ¯|i=1k0|Λi|.\displaystyle|\bar{\Lambda}|\prod_{i=1}^{k_{0}}|\Lambda_{i}|. (205)

Now specify η1=(ϵ144L2N¯m)1β0\eta_{1}=\left(\frac{\epsilon}{144L_{2}\sqrt{\bar{{N}}_{{m}}}}\right)^{\frac{1}{\beta_{0}}} and thus

|Λi|(1+22ϵC(G0,diam(Θ1))/η1)q=(1+4×(144L2)1β0C(G0,diam(Θ1))N¯m12β0ϵ(1β01))q.|\Lambda_{i}|\leq\left(1+2\frac{2\epsilon}{C(G_{0},\text{diam}(\Theta_{1}))}/\eta_{1}\right)^{q}=\left(1+\frac{4\times(144L_{2})^{\frac{1}{\beta_{0}}}}{C(G_{0},\text{diam}(\Theta_{1}))}\bar{{N}}_{{m}}^{\frac{1}{2\beta_{0}}}\epsilon^{-(\frac{1}{\beta_{0}}-1)}\right)^{q}.

Moreover, specify η2=12(ϵ72)2\eta_{2}=\frac{1}{2}\left(\frac{\epsilon}{72}\right)^{2} and by [18, Lemma A.4], |Λ¯|(1+5η2)k01=(1+10×722ϵ2)k01|\bar{\Lambda}|\leq\left(1+\frac{5}{\eta_{2}}\right)^{k_{0}-1}=\left(1+10\times 72^{2}\epsilon^{-2}\right)^{k_{0}-1}. Plug η1\eta_{1} and η2\eta_{2} into (205) and the proof is complete. ∎

Proof of Lemma E.2.

Let τ\tau be any one in Sk0S_{k_{0}} such that

D1(G,G0)=i=1k0(θτ(i)θi02+|pτ(i)pi0|).D_{1}(G,G_{0})=\sum_{i=1}^{k_{0}}\left(\|\theta_{\tau(i)}-\theta_{i}^{0}\|_{2}+|p_{\tau(i)}-p_{i}^{0}|\right).

For any jτ(i)j\not=\tau(i), θjθi02θτ1(j)0θi02θjθτ1(j)02>ρρ/2=ρ2\|\theta_{j}-\theta_{i}^{0}\|_{2}\geq\|\theta_{\tau^{-1}(j)}^{0}-\theta_{i}^{0}\|_{2}-\|\theta_{j}-\theta_{\tau^{-1}(j)}^{0}\|_{2}>\rho-\rho/2=\frac{\rho}{2}. Then for any τSk0\tau^{\prime}\in S_{k_{0}} that is not τ\tau and for any real number r1r\geq 1

i=1k0(rθτ(i)θi02+|pτ(i)pi0|)>rρ2>rD1(G,G0)i=1k0(rθτ(i)θi02+|pτ(i)pi0|),\sum_{i=1}^{k_{0}}\left(\sqrt{r}\|\theta_{\tau^{\prime}(i)}-\theta_{i}^{0}\|_{2}+|p_{\tau^{\prime}(i)}-p_{i}^{0}|\right)>\sqrt{r}\frac{\rho}{2}\\ >\sqrt{r}D_{1}(G,G_{0})\geq\sum_{i=1}^{k_{0}}\left(\sqrt{r}\|\theta_{\tau(i)}-\theta_{i}^{0}\|_{2}+|p_{\tau(i)}-p_{i}^{0}|\right),

which with r=1r=1 shows our choice of τ\tau is unique and with r1r\geq 1 shows τ\tau is the optimal permutation for Dr(G,G0)D_{r}(G,G_{0}). ∎