This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Stochastic approximation algorithm for estimating mixing distribution for dependent observations

Nilabja Guhalabel=e1]nilabja_\_[email protected] [    Anindya Roylabel=e2][email protected] [ Deaprtment of Mathematical Sciences, University of Massachusetts Lowell \thanksmarkm1; Department of Mathematics and Statistics, University of Maryland Baltimore County \thanksmarkm2
Abstract

Estimating the mixing density of a mixture distribution remains an interesting problem in statistics. Using a stochastic approximation method, Newton and Zhang [1] introduced a fast recursive algorithm for estimating the mixing density of a mixture. Under suitably chosen weights the stochastic approximation estimator converges to the true solution. In Tokdar et al. [2] the consistency of this recursive estimation method was established. However the current results of consistency of the resulting recursive estimator use independence among observations as an assumption. We extend the investigation of performance of Newton’s algorithm to several dependent scenarios. We prove that the original algorithm under certain conditions remains consistent even when the observations are arising from a weakly dependent stationary process with the target mixture as the marginal density. We show consistency under a decay condition on the dependence among observations when the dependence is characterized by a quantity similar to mutual information between the observations.

\startlocaldefs\endlocaldefs

,

1 Introduction

Stochastic approximation (SA) algorithms which are stochastic optimization techniques based on recursive update, have many applications in optimization problems arising in different fields such as engineering and machine learning. A classical and pioneering example of an SA can be found in Robbins and Monro [3] where a recursive method is introduced for finding the root of a function. In particular, suppose we have a non-increasing hh, where values of hh are observed with errors, i.e., we observe the value of hh at a point xnx_{n}, as yn=h(xn)+ϵny_{n}=h(x_{n})+{\epsilon}_{n}. Here ϵn{\epsilon}_{n}’s are i.i.d errors with mean zero and finite variance. The Robins–Monro stochastic algorithm recursively approximate the solution of h(x)=α0h(x)=\alpha_{0} as

xn+1=xn+wn(ynα0),x_{n+1}=x_{n}+w_{n}(y_{n}-\alpha_{0}), (1.1)

where wnw_{n} is a sequence of weights satisfying wn>0,nwn= and w_{n}>0,{\sum}_{n}w_{n}=\infty\text{ and } nwn2<{\sum}_{n}{w_{n}}^{2}<\infty. Under (1.1), the sequence {xn}\{x_{n}\} approaches the true root x0x_{0}, such that h(x0)=α0h(x_{0})=\alpha_{0}. Subsequent developments include rate of convergence, optimum step-size, convergence under convexity etc (see Chung [4], Fabian et al. [5], Polyak and Juditsky [6] ). Also see Sharia [7],Kushner [8] for recent developments such as adaptively truncating the solution to some domain. More generally, for a predictable sequence Rt(Zt)R_{t}(Z_{t}) with Rt(z0)=0;z0mR_{t}(z_{0})=0;z_{0}\subset\mathbb{R}^{m} the recursive solution is a limit to the recursion

Zt=Zt1+γt(Zt1)[Rt(Z(t1))+ϵt(Zt)];t=1,2,\displaystyle Z_{t}=Z_{t-1}+\gamma_{t}(Z_{t-1})[R_{t}(Z_{(t-1)})+\epsilon_{t}(Z_{t})];t=1,2,\dots (1.2)

where Z0mZ_{0}\subset\mathbb{R}^{m} is the initial value, γt(Zt1)\gamma_{t}(Z_{t-1}) possibly state dependent matrix of step sizes, and ϵt(Zt)\epsilon_{t}(Z_{t}) is a mean zero random process. The idea of stochastic approximation was cleverly used in a predictive recursion (PR) algorithm in Newton and Zhang [1] for finding the mixing component of a mixture distribution.

Mixture models are popular statistical models that provide a nice compromise between the flexibility offered by nonparametric approaches and the efficiency of parametric approaches. Mixture models are increasingly used in messy data situations where adequate fit is not obtained using conventional parametric models. Many algorithms, mostly variants of expectation maximization (EM) algorithm and Markov chain Monte Carlo (MCMC) algorithms, are currently available for fitting mixture models to the data. Specifically, these algorithms would fit a marginal model of the form

mf=Θp(x|θ)𝑑F(θ).{m}_{f}=\int_{\Theta}p(x|\theta)dF(\theta). (1.3)

to the data X1,,XnX_{1},\ldots,X_{n} assuming a form of the mixing kernel p(x|θ).p(x|\theta). A related problem is recovering (estimating) the mixing distribution FF. The problem of estimating the density ff of FF (with respect to some dominating measure μ\mu) in a nonparametric set up can be a challenging exercise and can be numerically taxing when full likelihood procedures (such as nonparametric MLE or nonparametric Bayesian) are used. However the full likelihood procedures generally enjoy desirable large sample properties, such as consistency.

Let XχX\in\chi be a random variable with distribution p(x|θ)p(x|\theta), where θΘ\theta\in\Theta is the latent variable and f(θ)f(\theta) be the mixing density. Let ν\nu and μ\mu be the sigma-finite measures associated with χ\chi and Θ\Theta. The recursive estimation algorithm Newton and Zhang [1] for estimating f(θ)f(\theta) then starts with some initial f0(θ)f_{0}(\theta) which has the same support as the true mixing density f(θ)f(\theta). The update with each new observation XiX_{i} is given by

fi(θ)=(1wi)fi1(θ)+wip(Xi|θ)fi1(θ)mi1(Xi)f_{i}(\theta)=(1-w_{i})f_{i-1}(\theta)+w_{i}\frac{p(X_{i}|\theta)f_{i-1}(\theta)}{m_{i-1}(X_{i})} (1.4)

where mi1(x)=Θp(x|θ)fi1(θ)𝑑μ(θ)m_{i-1}(x)=\int_{\Theta}p(x|\theta)f_{i-1}(\theta)d\mu(\theta) is the marginal density of XX at the iith iteration based on the mixing density fi1f_{i-1} obtained at the previous iteration.

From equation (1.4) calculating the marginals on both sides, we have the associated iterations for the marginal densities as

mi(x)=mi1(x){1+wi(p(x|θ)p(Xi|θ)fi1(θ)𝑑μ(θ)mi1(x)mi1(Xi)1)}.m_{i}(x)=m_{i-1}(x)\big{\{}1+w_{i}(\frac{\int p(x|\theta)p(X_{i}|\theta)f_{i-1}(\theta)d\mu(\theta)}{m_{i-1}(x)m_{i-1}(X_{i})}-1)\big{\}}. (1.5)

One way to connect the PR update to SA algorithm is to minimize the Kullback–Leibler (KL) distance between the proposed marginal and the true marginal and considering the minimizer of the corresponding Lagrange multiplier (Ghosh and Tokdar [9]; Martin and Ghosh [10]). The support of f0f_{0} can be misspecified. In such cases PR estimate can be shown to concentrate around a density in the space of proposed mixture densities that is nearest in KL distance to the true marginal (Martin and Tokdar [11]). Similar results can be found for general posterior consistency such as developments in Kleijn and van der Vaart [12] where posterior consistency under misspecification was shown under conditions like convexity.

The PR algorithm is computationally much faster than the full likelihood methods. Ghosh and Tokdar [9], and later Tokdar et al. [2] established the consistency of Newton’s predictive recursion algorithm, thereby putting the PR algorithm on solid theoretical footing. Some subsequent developments on the rate of convergence can be found in Martin and Ghosh [10], Martin and Tokdar [11]. Later developments focus on application on semiparametric model, predictive distribution calculation in Hahn et al. [13], Martin and Han [14]. Like the proof of original SA algorithm the proving the consistency for the recursive estimate of the mixture distribution depends on a Martingale difference sum construction. Because the algorithm depends on the order of XiX_{i}’s, the resulting estimator is not a function of sufficient statistics, the consistency of the PR solution cannot be drawn from the consistency of frequentist and Bayesian method (e.g using DP mixture model; see Ghosal et al. [15], Lijoi et al. [16] ). In Ghosh and Tokdar [9] and Tokdar et al. [2] the martingale based method from Robbins and Monro [3] was adapted in density estimation setting in a novel way to show the almost sure convergence of the estimator in the weak topology and Kullback–Leibler (KL) divergence. The PR solution was shown to be the Kullback–Leibler divergence minimizer between the true marginal and the set of proposed marginals.

One of the key assumptions in the existing proof of consistency is that the observations are independent. This assumptions significantly limits the scope of application for the PR algorithm, where some naturally occurring dependence may be present, for example cases of mixture of Markov processes. The main result of this paper is that the predictive recursion continues to provide consistent solution even under weakly dependent stationary processes as long as the dependence decays reasonably fast. This vanishing dependence can be connected with and can be quantified by the information theoretic quantities such as mutual information between the marginal and the conditional densities of the process. We use a novel sub-sequence argument to tackle the dependence among the observations and prove consistency of the PR algorithm when such vanishing dependence is present. At the same time we derive a bound for the convergence rate for the PR algorithm under such dependence. As a special case, we later consider the example of general MM dependent cases, where the consistency of the recursive estimator holds under weaker conditions. In all the cases we also investigate convergence under misspecification of the support of the mixing density and concentration around the KL projection to the misspecified model.

The arrangement of this article is the following. In Section 2, we provide the background and basic framework for the martingale based argument. Section 3 presents the main results regarding convergence and the rates for the weakly dependent cases, and address the special case of mixture of Markov processes. In Section 4, we consider the special case of MM- dependent sequences.

2 Preliminaries and revisiting the independent case

Our notation and initial framework will follow the original martingale type argument of Robbins and Monro [3] and the developments for the independent case in the literature, especially in Ghosh and Tokdar [9] and Tokdar et al. [2]. Thus, we first introduce the notation and revisit the main techniques used in the proof of consistency of the PR estimator in Tokdar et al. [2]. The discussion will also illustrate the need for generalization of the techniques to the dependent case.

A recursive formulation using KL divergence along with martingale based decomposition was used in Tokdar et al. [2] who established convergence of the recursive estimators for the mixing density and that of the marginal density. Specifically, if Kn=flog(f/fn)𝑑μ(θ)K_{n}=\int f\,log(f/f_{n})d\mu(\theta) and i=σ(X1,,Xi)\mathcal{F}_{i}=\sigma(X_{1},\ldots,X_{i}) is the σ\sigma- algebra generated by the first ii observations, then the following recursion can be established:

KnK0=i=1nwiVii=1nwiMi+i=1nEiK_{n}-K_{0}=\sum_{i=1}^{n}w_{i}V_{i}-\sum_{i=1}^{n}w_{i}M_{i}+\sum_{i=1}^{n}E_{i} (2.1)

where

Mi\displaystyle M_{i} =\displaystyle= E[1m(Xi)mi1(Xi)|i1];m(Xi)=p(Xi|θ)f(θ)𝑑μ(θ),\displaystyle-\operatorname{E}\Big{[}1-\frac{m(X_{i})}{m_{i-1}(X_{i})}|\mathcal{F}_{i-1}\Big{]};m(X_{i})=\int p(X_{i}|\theta)f(\theta)d\mu(\theta),
Vi\displaystyle V_{i} =\displaystyle= (1m(Xi)mi1(Xi))+Mi,\displaystyle(1-\frac{m(X_{i})}{m_{i-1}(X_{i})})+M_{i},
Ei\displaystyle E_{i} =\displaystyle= R(Xi,θ)f(θ)𝑑μ(θ),\displaystyle\int R(X_{i},\theta)f(\theta)d\mu(\theta),
R(Xi,θ)\displaystyle R(X_{i},\theta) =\displaystyle= wi2(p(Xi|θ)mi1(Xi)1)2R(wi(p(Xi|θ)mi1(Xi)1))\displaystyle w_{i}^{2}(\frac{p(X_{i}|\theta)}{m_{i-1}(X_{i})}-1)^{2}R(w_{i}(\frac{p(X_{i}|\theta)}{m_{i-1}(X_{i})}-1))

and R(x)R(x) is defined through the relation log(1+x)=xx2R(x)\log(1+x)=x-x^{2}R(x) for x>1.x>-1. The remainder term RR satisfies 0R(x) max{1,(11+x)2}/20\leq R(x)\leq\text{ max}\{1,(\frac{1}{1+x})^{2}\}/2. The corresponding similar recursion for the KL divergence of the marginal densities, Kn=mlog(m/mn)𝑑ν(x)K_{n}^{*}=\int m\,log(m/m_{n})d\nu(x), is then

KnK0=i=1nwiVii=1nwiMi+i=1nEiK_{n}^{*}-K_{0}^{*}=\sum_{i=1}^{n}w_{i}V^{*}_{i}-\sum_{i=1}^{n}w_{i}M^{*}_{i}+\sum_{i=1}^{n}E^{*}_{i} (2.2)

where

gi,x(θ)\displaystyle g_{i,x}(\theta) =\displaystyle= p(x|θ)fi1(θ)mi1(x),\displaystyle\frac{p(x|\theta)f_{i-1}(\theta)}{m_{i-1}(x)},
hi,x(x)\displaystyle h_{i,x^{\prime}}(x) =\displaystyle= gi,x(θ)p(x|θ)𝑑μ(θ),\displaystyle\int g_{i,x^{\prime}}(\theta)p(x|\theta)d\mu(\theta),
R(Xi,x)\displaystyle R^{*}(X_{i},x) =\displaystyle= wi2(hi,Xi(x)mi1(x)1)2R(wi[hi,Xi(x)mi1(x)1]),\displaystyle w_{i}^{2}(\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}-1)^{2}R(w_{i}[\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}-1]),
Vi\displaystyle V_{i}^{*} =\displaystyle= (1hi,Xi(x)mi1(x)m(x)dν(x)+Mi,\displaystyle(1-\int\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)d\nu(x)+M_{i}^{*},
Mi\displaystyle M_{i}^{*} =\displaystyle= χχhi,x(x)mi1(x)m(x)m(x)𝑑ν(x)𝑑ν(x)1,\displaystyle\int_{\chi}\int_{\chi}\frac{{h_{i,x^{\prime}}(x)}}{m_{i-1}(x)}m(x)m(x^{\prime})d\nu(x)d\nu(x^{\prime})-1,
Ei\displaystyle E_{i}^{*} =\displaystyle= χR(Xi,x)m(x)𝑑ν(x).\displaystyle\int_{\chi}R(X_{i},x)m(x)d\nu(x).

It is assumed that K0K_{0} and K0K_{0}^{*}, corresponding to initial starting point f0f_{0}, are finite. The main idea in Tokdar et al. [2] was to recognize that i=1nwiVi\sum_{i=1}^{n}w_{i}V_{i} , i=1nwiVi\sum_{i=1}^{n}w_{i}V_{i}^{*} are mean zero square integrable martingales and hence almost surely convergent, i=1nwiMi\sum_{i=1}^{n}w_{i}M_{i}, i=1nwiMi\sum_{i=1}^{n}w_{i}M_{i}^{*} are positive, and i=1nEi\sum_{i=1}^{n}E_{i}, i=1nEi\sum_{i=1}^{n}E_{i}^{*} have finite limits almost surely. Putting these facts together Tokdar et al. [2] established that KnK_{n}^{*} necessarily converges to zero limit almost surely, thereby establishing consistency of the predictive recursion sequences fnf_{n} and mnm_{n} in weak topology. Using developments in Robbins and Siegmund [17] it can be argued that the PR solution converges to the KL minimizer if the support is misspecified.

Hence, the theoretical results regarding convergence of such PR algorithms in current literature relies on independence of XiX_{i} and also assumed that the true marginals at each ii were the same and equal to m(x)m(x). There does not seem any obvious way of extending the proof to the case when the observations are not independent.

We propose a proof for convergence of the predictive recursion algorithm in the case when the observations are dependent. While the proof will use martingale construction technique similar to that of Tokdar et al. [2], there are significant differences in the approach that allows us to address the case of dependent observations, and tools required to address the dependence will be detailed in next section. The main contributions of the paper are summarized in the following:

  1. 1.

    We show that the PR algorithm continues to be consistent under weakly dependent processes where the marginal stationary distribution at each ii is a mixture with respect to a fixed kernel. The consistency is obtained under additional conditions on the kernel and the parameter space. If the dependence decays exponentially, under additional conditions, the original PR estimate is shown to be consistent. The decaying dependence is characterized by a special case of expected ff divergence between marginal and conditional densities.

  2. 2.

    We establish the convergence rate when the support is correctly specified and or misspecified with decaying dependence.

  3. 3.

    We show the result for finite mixture of Markov processes and general MM-dependent processes under milder conditions.

The next section describes the main results as well the tools needed for the proof of consistency in the dependent case that uses a novel sub-indexing argument.

3 Main results

To establish consistency of the PR algorithm when observations are dependent, we will need to control the expectation of ratios of kk-fold products ((k1)(k\geq 1)) of densities at two different parameter values. Suppose the parameter θ\theta lies in a compact subset Θ\Theta of an Euclidean space. Let Θ^\widehat{\Theta} denote a closed convex set containing Θ\Theta and assume p(x|θ)p(x|\theta) is well defined for θΘ^\theta\in\widehat{\Theta}. Assume there is a finite set ΘH={θj,j=1nH}Θ^,\Theta_{H}=\{\theta_{j},j=1\dots n_{H}\}\in\widehat{\Theta}, (typically will be the extreme points when Θ^\widehat{\Theta} is a convex polytope) and a compact set χcχ\chi_{c}\subset\chi such that the following holds.

  • C1. For any xχcx\notin\chi_{c} there exists θ1x,θ2xΘH\theta_{1}^{x},\theta_{2}^{x}\in\Theta_{H} such that supθ,θΘ{p(x|θ)p(x|θ)}cup(x|θ2x)p(x|θ1x)\text{sup}_{\theta,\theta^{\prime}\in\Theta}\{\frac{p(x|\theta)}{p(x|\theta^{\prime})}\}\leq c_{u}\frac{p(x|\theta_{2}^{x})}{p(x|\theta_{1}^{x})}, for some cu>1c_{u}>1. This condition is satisfied with cu=1c_{u}=1 for θlx=arg𝜃infp(x|θ)ΘH\theta^{x}_{l}=\underset{\theta}{\arg}\inf p(x|\theta)\in\Theta_{H} and θux=arg𝜃supp(x|θ)ΘH\theta^{x}_{u}=\underset{\theta}{\arg}\sup p(x|\theta)\in\Theta_{H}.

  • C2. There exists a>0a>0 such that infxχc,θΘp(x|θ)>a\underset{x\in\chi_{c},\theta\in\Theta}{\inf}p(x|\theta)>a.

  • C3. There exists b<b<\infty such that supxχc,θΘp(x|θ)<b\underset{x\in\chi_{c},\theta\in\Theta}{\sup}p(x|\theta)<b.

Without loss of generality, we can assume that 0<a<1<b0<a<1<b. Under C1—C3, the ratio of the marginals can be bounded by the ratios of conditionals on the finite set ΘH\Theta_{H} and a function of aa and bb. For many common problems, such as location mixtures of p(x|θ)p(x|\theta), a finite set ΘH\Theta_{H} exists that satisfies the assumption. This condition can be seen as a generalization of Monotone Likelihood Ratio property in higher dimensions. The condition is satisfied in a general multivariate normal mean mixture with known covariance matrix, where the mean parameter is constrained to a set Θ\Theta in d\mathcal{R}^{d}, contained in a closed large convex polytope Θ^{\widehat{\Theta}}, and the set ΘH\Theta_{H} then consists of suitable selected points on the boundary of Θ^\widehat{\Theta} and χc=Θ^\chi_{c}=\widehat{\Theta}.

Proposition 1.

Under C1–C3 , for any two distribution f1f_{1} and f2f_{2} on compact Θ\Theta, the ratio of the marginals under mixing densities f1f_{1} and f2f_{2} can be bounded as

mf1(Xi)mf2(Xi)A1(Xi)=cuθk,θlΘHp(Xi|θk)bp(Xi|θl)a.\frac{m_{f_{1}}(X_{i})}{m_{f_{2}}(X_{i})}\leq A_{1}(X_{i})=c_{u}\sum_{\theta_{k},\theta_{l}\in\Theta_{H}}\frac{p(X_{i}|\theta_{k})b}{p(X_{i}|\theta_{l})a}. (3.1)
Proof.

Given in the Appendix. ∎

Remark 1.

For finite Θ\Theta the result in Proposition 3.1 holds trivially using Θ\Theta in place of ΘH\Theta_{H}.

When XiX_{i}’s have a fixed marginal density mm, PR algorithm can still be applied in spite of dependence if the dependence decreases rapidly with distance along the sequential order in which XiX_{i}’s enter the algorithm. We assume the following α\alpha-mixing type condition for the XiX_{i} sequence. The decaying dependence is expressed as a special case of expected ff divergence (expected χ2\chi^{2}-distance) between the marginal and the conditional densities.

Let X1:i={X1,,Xi}X_{1:i}=\{X_{1},\ldots,X_{i}\} and let m(Xi+n|X1:i)m(X_{i+n}|X_{1:i}) be the conditional density/pmf of Xi+nX_{i+n} given X1:iX_{1:i}. We assume,

supiE[(m(Xi+n|X1:i)m(Xi+n)1)2m(Xi+n)d(.)]c02ρ2n\displaystyle\hskip 14.45377pt\text{sup}_{i}E\big{[}\int(\frac{m(X_{i+n}|X_{1:i})}{m(X_{i+n})}-1)^{2}m(X_{i+n})d(.)\big{]}\leq c_{0}^{2}\rho^{2n} (3.2)

where c0>0c_{0}>0 and 0<ρ<10<\rho<1.

Let 𝐇(Xi+n,X1:i)=logm(Xi+n,X1:i)m(Xi+n)m(X1:i)m(Xi+n,X1:i)d(){\bf H}(X_{i+n},X_{1:i})=\int\log\frac{m(X_{i+n},X_{1:i})}{m(X_{i+n})m(X_{1:i})}m(X_{i+n},X_{1:i})d(\cdot) denote the mutual information between Xi+nX_{i+n} and X1:iX_{1:i} and let 𝐇a(Xi+n,X1:i)={\bf H}_{a}(X_{i+n},X_{1:i})= |logm(Xi+n,X1:i)m(Xi+n)m(X1:i)|\int|\log\frac{m(X_{i+n},X_{1:i})}{m(X_{i+n})m(X_{1:i})}| m(Xi+n,X1:i)d()m(X_{i+n},X_{1:i})d(\cdot). The condition given in (3.2) implies exponential decay of mutual information 𝐇(Xi+n,X1:i){\bf H}(X_{i+n},X_{1:i}). When the conditional densities are uniformly bounded and uniformly bounded away from zero, (3.2) is satisfied if 𝐇a(Xi+n,X1:i){\bf H}_{a}(X_{i+n},X_{1:i}) decays exponentially. This result can be summarized in the following proposition.

Proposition 2.

The condition (3.2) implies supi𝐇(Xi+n,X1:i)c02ρ2n\text{sup}_{i}{\bf H}(X_{i+n},X_{1:i})\leq c_{0}^{2}\rho^{2n} where c0,ρc_{0},\rho are given in (3.2). If infi,np(Xi+n|X1:i)>0\text{inf}_{i,n}p(X_{i+n}|X_{1:i})>0 and supi,np(Xi+n|X1:i)<\text{sup}_{i,n}p(X_{i+n}|X_{1:i})<\infty then condition (3.2) holds if supi𝐇a(Xi+n,X1:i)c1ρ2n\text{sup}_{i}{\bf H}_{a}(X_{i+n},X_{1:i})\leq c_{1}\rho^{2n}, for some c1>0c_{1}>0.

Proof.

The result follows from the relationship between ff-divergence and mutual information and is omitted. ∎

In addition, we assume the following conditions which are similar to those in Tokdar et al. [2]:

  • B1

    wi0w_{i}\downarrow 0, and wiiαw_{i}\sim i^{-\alpha}, α(0.5,1]\alpha\in(0.5,1].

  • B2

    For θl1,θl2\theta_{l_{1}},\theta_{l_{2}}’s Θ^\in\widehat{\Theta} for E[(l=1n1p(Xjl|θl1)p(Xjl|θl2))2]b02n1l=1n1E[(p(Xjl|θl1)p(Xjl|θl2))2]E[(\prod_{l=1}^{n_{1}}\frac{p(X_{j_{l}}|\theta_{l_{1}})}{p(X_{j_{l}}|\theta_{l_{2}})})^{2}]\leq b_{0}^{2n_{1}}\prod_{l=1}^{n_{1}}E[(\frac{p(X_{j_{l}}|\theta_{l_{1}})}{p(X_{j_{l}}|\theta_{l_{2}})})^{2}], for some b0>0b_{0}>0 and where jls{1,n}j_{l}^{\prime}s\in\{1,\dots n\} and are distinct, and n1nn_{1}\leq n. Assume, b01b_{0}\geq 1 without loss of generality.

  • B3

    supθ1,θ2,θ3Θ^Eθ3[(p(Xi|θ1)p(Xi|θ2))2]<B<sup_{\theta_{1},\theta_{2},\theta_{3}\in\widehat{\Theta}}E_{\theta_{3}}[(\frac{p(X_{i}|\theta_{1})}{p(X_{i}|\theta_{2})})^{2}]<B<\infty

  • B4

    The map θp(x|θ)\theta\to p(x|\theta) is bounded and continuous for xχx\in\chi.

Condition [B2][\rm{B2}] is needed for n1n_{1} fold products over different indices for the dependent case. This condition will later be verified for some of the examples considered. Condition [B2][\rm{B2}] can be omitted if stricter moment condition [B2][\rm{B2}^{\prime}] is assumed, which is given later. We can now state our main results. Theorem 3.1 shows consistency when the support of the mixing density is finite while Theorem 3.2 establishes consistency for general support under slightly more restrictive conditions.

Theorem 3.1.

Let {Xi}\{X_{i}\} be a sequence of random variables satisfying (3.2) where XiX_{i} has a fixed marginal density m()m(\cdot) given by m(x)=Θp(x|θ)f(θ)𝑑θm(x)=\int_{\Theta}p(x|\theta)f(\theta)d\theta and the support of ff, Θ\Theta, is a finite set. Assume that the initial estimate f0f_{0} in (1.4) has the same support as ff. Then under B1–B4, C1–C3, the estimator fnf_{n} in (1.4) converges to the true mixing density ff with probability one, as the number of observations nn goes to infinity.

Proof.

The terms in the decomposition of the KL divergence, (2) can no longer be handled in the manner as in the independent case. We use further sub-indexing of the terms to obtain appropriate convergence results under the dependence condition (3.2).

We first partition the positive natural numbers into sub-sequences. Let (j)r(j)_{r} denote the rrth term in jjth sub-sequence and let Ψ(r,j)\Psi(r,j) denote its value. Then the sub-sequences are constructed in the following manner:

  1. 1.

    Let Ψ(1,1)=1\Psi(1,1)=1.

  2. 2.

    Let Ψ(1,j)=inf{Ψ(r,j1):Ψ(r,j1)>jK}+1\Psi(1,j)=\inf\{\Psi(r,j-1):\Psi(r,j-1)>j^{K}\}+1 for some fixed positive integer K>1K>1.

  3. 3.

    The terms in the (j1)(j-1)th sub-sequence are immediately followed by the next available term in the jjth sub-sequence unless there are no such terms in which case it is followed by the next term in the first sub-sequence.

By construction, jK(j)1(j+1)Kj^{K}\leq(j)_{1}\leq(j+1)^{K}. For convenience of notation, we denote by ilasti_{last} the integer Ψ(r1,j)\Psi(r-1,j) whenever i=Ψ(r,j)i=\Psi(r,j) and r>1r>1. We define ilast=iji_{last}=i-j if i=ψ(1,j)i=\psi(1,j). Let Flast,iF_{last,i} denote the σ\sigma-algebra generated by the collection {X1,,Xilast}\{X_{1},\ldots,X_{i_{last}}\}. Similarly, Let, pi,lastp_{i,last} denote the conditional density of XiX_{i} given X1,,XilastX_{1},\dots,X_{i_{last}}. Let LnL_{n} be the number of subsequences constructed for i=ni=n, that is we consider jj such that Ψ(r,j)n\Psi(r,j)\leq n for some r1r\geq 1. Clearly, LnnL_{n}\leq n. Also, let Nj,nN_{j,n} be the number of terms in the jjth sequence upto i=ni=n, such that Ψ(r,j)n.\Psi(r,j)\leq n. As an example, for K=2K=2 the following pattern arises for the sub-sequences

(1)1𝐁~1,(1)2𝐁~2,(1)3𝐁~3,(1)4,(2)1𝐁~4,(1)5,(2)2𝐁~5,(1)6,(2)3,(3)1𝐁~6,(1)i1,(2)i2,,(i+1)1𝐁~f(i)\displaystyle\overbrace{(1)_{1}}^{\tilde{\bf B}_{1}},\overbrace{(1)_{2}}^{\tilde{\bf B}_{2}},\overbrace{(1)_{3}}^{\tilde{\bf B}_{3}},\overbrace{(1)_{4},(2)_{1}}^{\tilde{\bf B}_{4}},\overbrace{(1)_{5},(2)_{2}}^{\tilde{\bf B}_{5}},\overbrace{(1)_{6},(2)_{3},(3)_{1}}^{\tilde{\bf B}_{6}}\dots,\overbrace{(1)_{i_{1}},(2)_{i_{2}},\dots,(i+1)_{1}}^{\tilde{\bf B}_{f(i)}}

Thus, Ψ(1,2)=5\Psi(1,2)=5, Ψ(1,3)=10\Psi(1,3)=10 N1,10=6N_{1,10}=6 and L10=3L_{10}=3 in this example.

From equation (1.4), similar to equation (2) we have

KnK0=j=1LnSv,j(n)j=1LnSM,j(n)+i=1nSiΔ+i=1nEi.\displaystyle K_{n}-K_{0}=\sum_{j=1}^{{L_{n}}}S_{v,j}^{(n)}-\sum_{j=1}^{L_{n}}S_{M,j}^{(n)}+\sum_{i=1}^{n}S^{\Delta}_{i}+\sum_{i=1}^{n}E_{i}. (3.3)

where

Sv,j(n)\displaystyle S_{v,j}^{(n)} =\displaystyle= r=1Nj,nvΨ(r,j);\displaystyle\sum_{r=1}^{N_{j,n}}v_{\Psi(r,j)};
SM,j(n)\displaystyle S_{M,j}^{(n)} =\displaystyle= r=1Nj,nwΨ(r,j)MΨ(r,j);\displaystyle\sum_{r=1}^{N_{j,n}}w_{\Psi(r,j)}M_{\Psi(r,j)};
vi\displaystyle v_{i} =\displaystyle= wi((1m(Xi)mi1(Xi))E[(1m(Xi)mi1(Xi)|Flast,i]);\displaystyle w_{i}\Big{(}(1-\frac{m(X_{i})}{m_{i-1}(X_{i})})-E\big{[}(1-\frac{m(X_{i})}{m_{i-1}(X_{i})}|F_{last,{i}}\big{]}\Big{)};
Mi\displaystyle M_{i} =\displaystyle= E[m(Xi)milast(Xi)1|Flast,i];\displaystyle E\big{[}\frac{m(X_{i})}{m_{i_{last}}(X_{i})}-1|F_{last,i}\big{]};
SiΔ\displaystyle S^{\Delta}_{i} =\displaystyle= wiE[m(Xi)mi1(Xi)(mi1(Xi)milast(Xi)1)|Flast,i)],\displaystyle w_{i}E\big{[}\frac{m(X_{i})}{m_{i-1}(X_{i})}(\frac{m_{i-1}(X_{i})}{m_{{i}_{last}}(X_{i})}-1)|F_{last,i})\big{]},

and EiE_{i} is as defined following (2). Next we show convergence of the different parts.

Convergence of j=1LnSv,j(n):\sum_{j=1}^{L_{n}}S^{(n)}_{v,j}: We have E[(Sv,j(n))2]2wi2E[1+(A1(Xi))2]<b0E[(S^{(n)}_{v,j})^{2}]\leq 2\sum w_{i}^{2}E[1+(A_{1}(X_{i}))^{2}]<b^{\prime}_{0} for some b0>0b^{\prime}_{0}>0. This implies that Sv,j(n)S^{(n)}_{v,j} is mean zero a squared integrable martingale with filtration σ(X1,,XΨ(Nj,n,j))\sigma(X_{1},\cdots,X_{\Psi(N_{j,n},j)}), and therefore converges almost surely [18]. Thus, outside a set of probability zero Sv,j(n)S^{(n)}_{v,j}’s converges for all jj, as jj varies over a countable set.

Next, we show that only finitely many martingale sequences in the j=1LnSv,j(n)\sum_{j=1}^{L_{n}}S^{(n)}_{v,j} will make significant contribution with probability one for large nn. Choose s>1s>1 and choose KK large enough such that K>2s/(2α1)K>2s/(2\alpha-1). From Lemma 1,

P(supn|Sv,j(n)|>ϵjs infinitely often)limn0j=n0P(supn|Sv,j(n)|>ϵjs)\displaystyle P(\sup_{n}|S_{v,j}^{(n)}|>\frac{\epsilon}{j^{s}}\text{ infinitely often})\leq lim_{n_{0}\uparrow\infty}\sum_{j=n_{0}}^{\infty}P(\sup_{n}|S_{v,j}^{(n)}|>\frac{\epsilon}{j^{s}})
ϵ2limn0j=n0C0jK(2α1)+2s10\displaystyle\leq\epsilon^{-2}lim_{n_{0}\uparrow\infty}\sum_{j=n_{0}}^{\infty}C^{\prime}_{0}j^{-K(2\alpha-1)+2s-1}\rightarrow 0

where C0>0C^{\prime}_{0}>0 is a constant.

Hence, outside a null set, say ΩN\Omega_{N} (possibly depending on ϵ\epsilon), we have sup|Sv,j(n)|<ϵjs\sup|S_{v,j}^{(n)}|<\frac{\epsilon}{j^{s}} for all but finitely many jj’s and Sv,j(n)S^{(n)}_{v,j}’s converge. Fix ωΩ\ΩN\omega\in\Omega\backslash\Omega_{N}, where Ω\Omega is the underlying product-probability space. For any ϵ1>0\epsilon_{1}>0, we can choose n1n_{1} (possibly depending upon ω\omega), such that for j>n1j>n_{1}, sup|Sv,j(n)|<ϵjssup|S_{v,j}^{(n)}|<\frac{\epsilon}{j^{s}} and j=n1|Sv,j(n)|<ϵ1\sum_{j=n_{1}}^{\infty}|S_{v,j}^{(n)}|<\epsilon_{1}. Let n2>ψ(1,n1)n_{2}>\psi(1,n_{1}) large enough such that for jn1j\leq n_{1} we have j=1n1|Sv,j(n3)Sv,j(n2)|<ϵ1\sum_{j=1}^{n_{1}}|S_{v,j}^{(n_{3})}-S_{v,j}^{(n_{2})}|<\epsilon_{1} whenever n3>n2n_{3}>n_{2}. Finally,

|j=1Ln3Sv,j(n3)j=1Ln2Sv,j(n2)|j=1n1|Sv,j(n3)Sv,j(n2)|+|j>n1(Sv,j(n3)Sv,j(n2))|\displaystyle|\sum_{j=1}^{L_{n_{3}}}S^{(n_{3})}_{v,j}-\sum_{j=1}^{L_{n_{2}}}S^{(n_{2})}_{v,j}|\leq\sum_{j=1}^{n_{1}}|S_{v,j}^{(n_{3})}-S_{v,j}^{(n_{2})}|+|\sum_{j>n_{1}}(S^{(n_{3})}_{v,j}-S^{(n_{2})}_{v,j})|
ϵ1+j>n1supnn2|Sv,j(n)|+j>n1supnn3|Sv,j(n)|3ϵ1.\displaystyle\leq\epsilon_{1}+\sum_{j>n_{1}}sup_{n\leq n_{2}}|S^{(n)}_{v,j}|+\sum_{j>n_{1}}sup_{n\leq n_{3}}|S^{(n)}_{v,j}|\leq 3\epsilon_{1}.

As ϵ1>0\epsilon_{1}>0 is arbitrary and we can choose ϵ1\epsilon_{1} going to zero over a sequence. This implies j=1LnSv,j(n)\sum_{j=1}^{L_{n}}S^{(n)}_{v,j} is a Cauchy sequence with probability one and therefore converges with probability one.

Convergence of i=1nEi\sum_{i=1}^{n}E_{i}: Using the expression for A1(x)A_{1}(x), we have,

E[i>i0|Ei|]i>i04wi2E[(1+A1(xi))2]<E[\sum_{i>i_{0}}|E_{i}|]\leq\sum_{i>i_{0}}4w_{i}^{2}E[(1+A_{1}(x_{i}))^{2}]<\infty

as wi(p(Xi|θ)mi1(Xi)1)>wi>12w_{i}(\frac{p(X_{i}|\theta)}{m_{i-1}(X_{i})}-1)>-w_{i}>-\frac{1}{2} for i>i0i>i_{0} for some i0>1i_{0}>1, as wi0w_{i}\downarrow 0. Hence, i>i0Ei\sum_{i>i_{0}}E_{i} converges almost surely from Proposition 4. Hence, Ei\sum E_{i} converges almost surely.

Decomposing Mi{M}_{i}: We have from (3.3),

Mi\displaystyle M_{i} =\displaystyle= (m(x)milast(x)1)m(x)ν(dx)+(m(x)milast(x)1)(pi,last(x)m(x)1)m(x)ν(dx)\displaystyle\int(\frac{m(x)}{m_{i_{last}}(x)}-1)m(x)\nu(dx)+\int(\frac{m(x)}{m_{i_{last}}(x)}-1)(\frac{p_{i,last}(x)}{m(x)}-1)m(x)\nu(dx)
\displaystyle\geq Kilast+(m(x)milast(x)1)(pi,last(x)m(x)1)m(x)ν(dx).\displaystyle K^{*}_{i_{last}}+\int(\frac{m(x)}{m_{i_{last}}(x)}-1)(\frac{p_{i,last}(x)}{m(x)}-1)m(x)\nu(dx).

By Cauchy-Schwarz and using [B3] to bound the expression for A1()A_{1}(\cdot) in (3.1), we have

δi=|(m(x)milast(x)1)(pi,last(x)m(x)1)m(x)ν(dx)|\displaystyle\delta_{i}=|\int(\frac{m(x)}{m_{i_{last}}(x)}-1)(\frac{p_{i,last}(x)}{m(x)}-1)m(x)\nu(dx)|\leq
2(A1(x)2+1)m(x)ν(dx)\displaystyle\sqrt{\int 2(A_{1}(x)^{2}+1)m(x)\nu(dx)} (pi,last(x)m(x)1)2m(x)ν(dx).\displaystyle\sqrt{\int(\frac{p_{i,last}(x)}{m(x)}-1)^{2}m(x)\nu(dx)}.

Let δi=\delta_{i}= (pi,last(x)m(x)1)2m(x)ν(dx)\sqrt{\int(\frac{p_{i,last}(x)}{m(x)}-1)^{2}m(x)\nu(dx)}. If δi(i)=E[δi]\delta_{i}^{(i)}=E[\delta_{i}], then by condition (3.2) and Jensen’s inequality, we have δi(i)c0ρ(iilast)\delta_{i}^{(i)}\leq c_{0}\rho^{(i-i_{last})}. By construction of the sequences, a gap (iilast)(i-i_{last}) is equal to ll at most (l+1)K(l+1)^{K} times. Also by (3.1), 2(A1(x)2+1)m(x)ν(dx)b1\sqrt{\int 2(A_{1}(x)^{2}+1)m(x)\nu(dx)}\leq b_{1} for some b1>0b_{1}>0. Thus, δi(i)c0b1llKρl1<\sum\delta_{i}^{(i)}\leq c_{0}b_{1}\sum_{l}l^{K}\rho^{l-1}<\infty. Hence, δi\sum\delta_{i} converges absolutely with probability one and hence, converges with probability one.

Convergence of i=1nSiΔ\sum_{i=1}^{n}S^{\Delta}_{i}: By Proposition 4, it is enough to show E[|E[wi(m(x)mi1(x)m(x)milast(x))|Flast,i]|]E[wi|m(x)mi1(x)m(x)milast(x)|]\sum E[|E[w_{i}(\frac{m(x)}{m_{i-1}(x)}-\frac{m(x)}{m_{i_{last}}(x)})|F_{last,i}]|]\leq\sum E[w_{i}|\frac{m(x)}{m_{i-1}(x)}-\frac{m(x)}{m_{i_{last}}(x)}|] converges.

Let nHn_{H} be the cardinality of ΘH\Theta_{H}. By condition [B3], E[A1(Xi)+1]1+nHcuB1b/a=Bu<E[A_{1}(X_{i})+1]\leq 1+n_{H}c_{u}B_{1}b/a=B_{u}<\infty, where E[p(Xi|θl)p(Xi|θk)]<B1E[\frac{p(X_{i}|\theta_{l})}{p(X_{i}|\theta_{k})}]<B_{1}. Then, by condition [B2] any mm-fold product E[i1imA1(Xi1)]<(Bu)m;0<Bu<,Bu=b0BuE[\prod_{i_{1}\neq\dots\neq i_{m}}A_{1}(X_{i_{1}})]<(B^{\prime}_{u})^{m};0<B^{\prime}_{u}<\infty,B^{\prime}_{u}=b_{0}B_{u}.

Expanding as product of ratios of successive marginals,

mi1(Xi)milast(Xi)=j=ilast+1i1(1+wj(p(Xj|θ)p(Xi|θ)fj1(θ)𝑑θmfj1(Xi)mfj1(Xj)1)).\frac{m_{i-1}(X_{i})}{m_{i_{last}}(X_{i})}=\prod_{j=i_{last}+1}^{i-1}\left(1+w_{j}(\frac{\int p(X_{j}|\theta)p(X_{i}|\theta)f_{j-1}(\theta)d\theta}{m_{f_{j-1}}(X_{i})m_{f_{j-1}}(X_{j})}-1)\right).

When Xjχc,X_{j}\notin\chi_{c}, p(Xj|θ)p(Xi|θ)fj1(θ)𝑑θmfj1(Xi)mfj1(Xj)p(Xj|θ1)p(Xi|θ)fj1(θ)𝑑θmfj1(Xi)p(Xj|θ2)A1(Xj)\frac{\int p(X_{j}|\theta)p(X_{i}|\theta)f_{j-1}(\theta)d\theta}{m_{f_{j-1}}(X_{i})m_{f_{j-1}}(X_{j})}\leq\frac{\int p(X_{j}|\theta_{1})p(X_{i}|\theta)f_{j-1}(\theta)d\theta}{m_{f_{j-1}}(X_{i})p(X_{j}|\theta_{2})}\leq A_{1}(X_{j}), where p(Xj|θ)p(X_{j}|\theta) is minimized and maximized at θ2\theta_{2} and θ1\theta_{1}, respectively. When XjχcX_{j}\in\chi_{c}, then we have p(Xj|θ)p(Xi|θ)fj1(θ)𝑑θmfj1(Xi)mfj1(Xj)baA1(Xj)\frac{\int p(X_{j}|\theta)p(X_{i}|\theta)f_{j-1}(\theta)d\theta}{m_{f_{j-1}}(X_{i})m_{f_{j-1}}(X_{j})}\leq\frac{b}{a}\leq A_{1}(X_{j}).

Hence,

|wim(Xi)mi1(Xi)(mi1(Xi)milast(Xi)1)|wi(j1𝒮iwj1A1(Xi)(A1(Xj1)+1)+|w_{i}\frac{m(X_{i})}{m_{i-1}(X_{i})}(\frac{m_{i-1}(X_{i})}{m_{i_{last}}(X_{i})}-1)|\leq w_{i}(\sum_{j_{1}\in\mathscr{S}_{i}}w_{j_{1}}A_{1}(X_{i})(A_{1}(X_{j_{1}})+1)+
j1j2𝒮iwj1wj2A1(Xi)(A1(Xj1)+1)(A1(Xj2)+1)+\sum_{j_{1}\neq j_{2}\in\mathscr{S}_{i}}w_{j_{1}}w_{j_{2}}A_{1}(X_{i})(A_{1}(X_{j_{1}})+1)(A_{1}(X_{j_{2}})+1)+
+j1j2jm𝒮iwj1wj2wjmA1(Xi)(A1(Xj1)+1)(A1(Xj2)+1)..(A1(Xjm)+1)+..\dots+\sum_{j_{1}\neq j_{2}\cdots\neq j_{m}\in\mathscr{S}_{i}}w_{j_{1}}w_{j_{2}}\cdots w_{j_{m}}A_{1}(X_{i})(A_{1}(X_{j_{1}})+1)(A_{1}(X_{j_{2}})+1)..(A_{1}(X_{j_{m}})+1)+..

where 𝒮i\mathscr{S}_{i} be the set {(ilast+1),..,i1}\{(i_{last}+1),..,i-1\}.

Coefficients for the mm-fold products A1(Xj1)A1(Xj2)A1(Xjm);A_{1}(X_{j_{1}})A_{1}(X_{j_{2}})\cdots A_{1}(X_{j_{m}}); j1j2jmj_{1}\neq j_{2}\neq\cdots\neq j_{m} are bounded by wit=mwilastt<c1wilast(m+1)w_{i}\sum_{t=m}^{\infty}w_{i_{last}}^{t}<c_{1}w_{i_{last}}^{(m+1)}, for some c1>0c_{1}>0. The expectations of each of those mm-fold product bounded by Bum{B^{\prime}_{u}}^{m}. Let di=i2ilastd_{i}=i-2-i_{last}. Then there are (dim){d_{i}\choose m} many terms consisting of mm-fold products of AiA_{i}.

As di=o(i1/K+ϵ)d_{i}=o({i}^{1/K+\epsilon}), for any ϵ>0\epsilon>0. Choosing N0N_{0} and KK large enough, we have wilast<Ciα1w_{i_{last}}<C^{\prime}i^{-\alpha_{1}}, when i>N0,C>0i>N_{0},C^{\prime}>0 and .5+1/K<α1<α1<α.5+1/K<\alpha_{1}^{\prime}<\alpha_{1}<\alpha, and we have

i>N0E[|SiΔ|]Cim=2diim(α11/K)(Bu)m<.\sum_{i>N_{0}}E[|S^{\Delta}_{i}|]\leq C^{\prime}\sum_{i}\sum_{m=2}^{d_{i}}i^{-m(\alpha^{\prime}_{1}-1/K)}(B^{\prime}_{u})^{m}<\infty.

By Proposition 4, we have the result.

Combining the parts: Having established that j=1LnSv,j(n),i=1nEi,i=1nSiΔ,δi\sum_{j=1}^{L_{n}}S^{(n)}_{v,j},\sum_{i=1}^{n}E_{i},\sum_{i=1}^{n}S^{\Delta}_{i},\sum\delta_{i} all converge with probability one to finite quantities, we essentially follow the arguments given in Tokdar et al. [2] . Since Ki>0K^{*}_{i}>0, we have KiK^{*}_{i} converging to zero in a sub sequence with probability one, as LHS in (3.3) is finite, because otherwise from the fact wi=\sum w_{i}=\infty, RHS will be -\infty.

Hence, as Kn0K_{n}\geq 0, wiKi\sum w_{i}K_{i}^{*} has to converge, as all other sequences converge and therefore KiK_{i}^{*} converges to zero in a subsequence almost surely. Now, using finiteness of Θ\Theta, the proof follows as KnK_{n} converges and over that subsequence fnf_{n} has to converge to ff, as otherwise KnK_{n}^{*} cannot converge to zero in that subsequence (Ghosh and Tokdar [9]).

In particular, if Knk(ω)0K^{*}_{n_{k}(\omega)}\rightarrow 0 and if fnkjf^ff_{n_{k_{j}}}\rightarrow\hat{f}\neq f in some sub-subsequence of nkn_{k} then corresponding marginal mnkjm_{n_{k_{j}}} converges to m^m\hat{m}\neq m weakly. But, as Knk0K^{*}_{n_{k}}\rightarrow 0, mnkm_{n_{k}} converges to mm in Hellinger distance and also weakly, which is a contradiction. Hence, fnkjff_{n_{k_{j}}}\rightarrow f and therefore Knkj0K_{n_{k_{j}}}\rightarrow 0. Therefore, Kn0K_{n}\rightarrow 0 and we have fnff_{n}\rightarrow f with probability one. ∎

In general a bigger value of ρ\rho in (3.2) would indicate stronger dependence among XiX_{i} and hence the convergence rate will be expected to be slower for bigger ρ\rho. Even though it is hard to write explicitly how ρ\rho affects the convergence, some insight can be gleaned by studying the different components of the decomposition of the KL divergence and their convergence.

Under the setting of Theorem 3.1, for convergence of KnK_{n}, we need for n>mn>m the Cauchy increments, |KmKn||K_{m}-K_{n}| to go to zero. From equation (3.3), the Cauchy difference is essentially made up of a martingale difference sequence, an increment of non-negative sequence, and other terms that constitute error terms. The main term in |KmKn||K_{m}-K_{n}| that is influenced directly by the dependence parameter is j=1LnSM,j(m,n)\sum_{j=1}^{L_{n}}S_{M,j}^{(m,n)}. Following calculations of bounds on MiM_{i} in (LABEL:wdep_decmp), we have

j=1LnSM,j(m,n)i=mnwi(m(x)milast(x)1)m(x)ν(dx)+l=m1/K(n+1)1/Kwlb~0lKρl1\sum_{j=1}^{L_{n}}S_{M,j}^{(m,n)}\leq\sum_{i=m}^{n}w_{i}\int(\frac{m(x)}{{m}_{{i}_{last}}(x)}-1)m(x)\nu(dx)+\sum_{l=m^{1/K}}^{(n+1)^{1/K}}w_{l}\tilde{b}_{0}l^{K}\rho^{l-1}

for some b~0>0\tilde{b}_{0}>0. Thus, convergence of the conditional mean sequence, and hence the KL sequence is expected to be slower for larger values of ρ\rho. This intuition is reinforced in Example 1 where the AR coefficient rr plays the role of the dependence parameter in (3.2). Numerically, we see that convergence is faster for smaller rr and slower with larger values of rr.

As mentioned earlier, consistency of the recursive algorithm can be established in much more generality, even under mild dependence, provided we can assume a slightly stronger condition. The following condition is stronger than [B2], but maybe more readily verifiable.

  • B2\rm{B2}^{\prime}. supθ1,θ2,θ3Θ^Eθ3[(p(Xi|θ1)p(Xi|θ2))j]<b3jsup_{\theta_{1},\theta_{2},\theta_{3}\in\hat{\Theta}}E_{\theta_{3}}[(\frac{p(X_{i}|\theta_{1})}{p(X_{i}|\theta_{2})})^{j}]<b_{3}^{j}, for some b3>0b_{3}>0,

The condition is sufficient for establishing consistency in the non-finite support with dependent data. However, it need not be necessary. It may not hold in some cases where the support of XX is unbounded. For example, the condition does not hold in normal location mixture with dependent data but it does hold when the a truncated normal kernel is used.

In the proof of Theorem 3.1, one could work with [B2][\rm{B2}^{\prime}] instead of [B2].

Remark 2.

If condition [B2][\rm{B2}] is replaced with condition [B2][\rm{B2}^{\prime}] in the statement of Theorem 3.1, the conclusions of Theorem 3.1 continue to hold.

The proof of the corollary is straight-forward and is omitted. The condition [B2][\rm{B2}^{\prime}] is in essence equivalent to [B2] and [B3], when the kernel is bounded and bounded away from zero.

Remark 3.

If infθp(x|θ)>0inf_{\theta}p(x|\theta)>0 and supθp(x|θ)<sup_{\theta}p(x|\theta)<\infty then B2, B3 and B2\mbox{\rm{B2}}^{\prime} are satisfied.

Theorem 3.2.

Let {Xi}\{X_{i}\} be sequence of random variables satisfying (3.2) where XiX_{i} has a fixed marginal density m()m(\cdot) given by m(x)=Θp(x|θ)f(θ)𝑑θm(x)=\int_{\Theta}p(x|\theta)f(\theta)d\theta and the support of ff, Θ\Theta, is a compact subset of the corresponding Euclidean space. Assume the initial estimate f0f_{0} in (1.4) has the same support as ff. Let FnF_{n} and FF denote the cdfs associated with fnf_{n} and ff, respectively. Then under B1,B2,B3,B4{\rm{B1},\rm{B2}^{\prime},\rm{B3,B4}}, C1–C3, the estimate FnF_{n} from equation (1.4) converges to FF in weak topology with probability one, as nn the number of observations goes to infinity.

Proof.

Given in the Appendix. ∎

3.1 Misspecification of support

The predictive recursion algorithm requires specification of the support of the mixing density. The support of ff could be misspecified, in which case the solution fnf_{n} cannot converge to the true density ff, however one could still investigate convergence for the sequence fnf_{n}. Let the support of the initial density in the predictive recursion (1.4) be Θ0\Theta_{0}, a compact set possibly different from Θ\Theta, the support of the true mixing density ff. Let 0\mathcal{F}_{0} be the class of all densities with the same support as f0(θ)f_{0}(\theta). Specifically, let

0\displaystyle\mathcal{F}_{0} =\displaystyle= {f^:f^ is a density supported on Θ0}\displaystyle\{{\hat{f}}:{\hat{f}}\mbox{ is a density supported on }\Theta_{0}\}
0\displaystyle\mathcal{M}_{0} =\displaystyle= {m^:m^(x)=Θ0p(x|θ)f^(θ)μ(dθ);f^0}.\displaystyle\{\hat{m}:\hat{m}(x)=\int_{\Theta_{0}}p(x|\theta)\hat{f}(\theta)\mu(d\theta);{\hat{f}}\in\mathcal{F}_{0}\}. (3.5)

Let

k~=infm^0KL(m,m^).\tilde{k}=\underset{{\hat{m}}\in\mathcal{M}_{0}}{\inf}KL(m,\hat{m}). (3.6)

Thus, assuming uniqueness, the minimizer m~0{\tilde{m}}\in\mathcal{M}_{0} is the information projection of mm to 0\mathcal{M}_{0}. This minimizer is related to the Gateaux differential of KL divergence (Patilea [19]). For independent observations misspecification of the support has been addressed in the literature; for example see Ghosh and Tokdar [9], Martin and Ghosh [10]. Using the decomposition for dependent case in the proof of Theorems 3.2, 3.3, we extend the result for dependent cases. Let K~n=Knk~{\tilde{K}}_{n}^{*}=K_{n}^{*}-\tilde{k} where KnK_{n}^{*} is defined in (2.2).

Theorem 3.3.

Let {Xi}\{X_{i}\} be sequence of random variables satisfying (3.2) and XiX_{i} have a fixed marginal density m()m(\cdot) given by m(x)=Θp(x|θ)f(θ)𝑑θ.m(x)=\int_{\Theta}p(x|\theta)f(\theta)d\theta. Assume Θ\Theta is a compact and f0f_{0} in (1.4) belongs to 0\mathcal{F}_{0} defined in (3.5). Assume, there exists a unique f~\tilde{f} such that K(m,m~)=k~K(m,\tilde{m})=\tilde{k} where mf~=m~(x)=Θ0p(x|θ)f~(θ)μ(dθ)m_{\tilde{f}}=\tilde{m}(x)=\int_{\Theta_{0}}p(x|\theta)\tilde{f}(\theta)\mu(d\theta). Assume B1,B2,B3,B4{\rm{B1},\rm{B2}^{\prime},\rm{B3,B4}}, C1–C3. Then,

limnK~n=0, w.p.1.\lim_{n\to\infty}{\tilde{K}}_{n}^{*}=0,\mbox{ w.p.1.}

Note that since 0\mathcal{F}_{0} is convex, so is the set of corresponding marginals. Thus, uniqueness of f~\tilde{f} is guaranteed if the model is identifiable.

The proofs of Theorems 3.3 is given in the Appendix following the proofs of analogous result Theorems 4.2 for MM-dependent as the proofs involve similar arguments.

3.2 Convergence rate of the recursive algorithm

Convergence of PR type algorithms has been explored in recent work such as Martin and Tokdar [11] where the fitted PR marginal shown to be in n1/6\sim n^{-1/6} radius Hellinger ball around the true marginal almost surely. Martin [20] establishes a better bound for finite mixture under misspecification. PR convergence rates are similar to nonparametric rate in nature and in current literature does not have the minimax rate (up to logarithmic factors) such as the rates shown in Ghosal and Van Der Vaart [21] and subsequent developments. The convergence rate calculations ([11], [20]) follow from the super martingale convergence theorem from Robbins-Siegmund (Robbins and Siegmund [17], Lai [22]) for independent cases and yield a rate similar to Genovese and Wasserman [23].

In the presence of dependence, we show the rate calculation assuming a faster rate of decay for the weights wiw_{i}. Instead of condition B1 we will assume

  • B1\rm{B1}^{\prime}

    wi0w_{i}\downarrow 0, and wiiαw_{i}\sim i^{-\alpha}, α(0.75,1]\alpha\in(0.75,1].

We use the subsequence construction technique described in 3 to do the rate calculations. Following the arguments given in the proofs of Theorems 3.1 and 3.2, we establish almost sure concentration in n(1γ)/2n^{-(1-\gamma)/2} radius Hellinger ball around the true marginal for γ>α\gamma>\alpha. The rate is slower than the rate for the independent case. Let H2(f,g)=(fg)2d()H^{2}(f,g)=\int(\sqrt{f}-\sqrt{g})^{2}d(\cdot) the squared Hellinger distance between densities.

Theorem 3.4.

Assume the conditions of Theorem 3.3 hold with B1 replaced with B1\rm{B1}^{\prime}. Then

nγ+1K~n0, with probability onen^{-\gamma+1}\tilde{K}_{n}^{*}\rightarrow 0,\mbox{ with probability one}

for some γ(α,1)\gamma\in(\alpha,1). Moreover, if mmf~\frac{m}{m_{\tilde{f}}} is bounded away from zero and infinity, then H(mf~,mn)=o(n(1γ)/2)H(m_{\tilde{f}},m_{n})=o(n^{-(1-\gamma)/2}).

Proof.

From the proof of Theorem 3.2 and 3.3 (in Appendix), wiK~ilast\sum w_{i}\tilde{K}^{*}_{i_{last}} converges where ilast=ψ(l1,j)i_{last}=\psi(l-1,j) if i=ψ(l,j)i=\psi(l,j). By construction iilasti1/Ki-i_{last}\preceq i^{1/K} and wilastwi=O(1)\frac{w_{i_{last}}}{w_{i}}=O(1). Hence, wiK~i1\sum w_{i}\tilde{K}^{*}_{i-1} converges almost surely. Let, ai=j=1iwj,a0=0a_{i}=\sum_{j=1}^{i}w_{j},a_{0}=0. Then, from A.1,

anK~na0K~0=i=1nwiK~i1+j=1LnS^v,j,(n)j=1LnS^M,j,(n)+i=1naiSiΔ+i=1naiEi.\displaystyle a_{n}\tilde{K}_{n}^{*}-a_{0}\tilde{K}_{0}^{*}=\sum_{i=1}^{n}w_{i}\tilde{K}^{*}_{i-1}+\sum_{j=1}^{{L_{n}}}\hat{S}_{v,j}^{*,(n)}-\sum_{j=1}^{L_{n}}\hat{S}_{M,j}^{*,(n)}+\sum_{i=1}^{n}a_{i}S^{\Delta*}_{i}+\sum_{i=1}^{n}a_{i}E^{*}_{i}.
(3.7)

Here

S^v,j,(n)\displaystyle\hat{S}_{v,j}^{*,(n)} =\displaystyle= i=1Nj,nv^Ψ(i,j),\displaystyle\sum_{i^{\prime}=1}^{N_{j,n}}\hat{v}^{*}_{\Psi(i^{\prime},j)},
S^M,j,(n)\displaystyle\hat{S}_{M,j}^{*,(n)} =\displaystyle= i=1Nj,naΨ(i,j)wΨ(i,j)MΨ(i,j),\displaystyle\sum_{i^{\prime}=1}^{N_{j,n}}a_{\Psi(i^{\prime},j)}w_{\Psi(i^{\prime},j)}M^{*}_{\Psi(i^{\prime},j)},
v^i\displaystyle\hat{v}_{i}^{*} =\displaystyle= aiwi(1hi,Xi(x)mi1(x)m(x)ν(dx))aiwiE[1χhi,Xi(x)mi1(x)m(x)ν(dx)|last,i].\displaystyle a_{i}w_{i}(1-\int\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx))-a_{i}w_{i}E[1-\int_{\chi}\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx)|\mathcal{F}_{last,i}].

Writing wi=aiwiiα2w_{i}^{\prime}=a_{i}w_{i}\sim i^{-\alpha_{2}^{\prime}}, α2=2α1>0.5\alpha_{2}^{\prime}=2\alpha-1>0.5, from the proof of Theorem 3.2, it can be shown that i=1naiSiΔ\sum_{i=1}^{n}a_{i}S^{\Delta*}_{i} , i=1naiEi\sum_{i=1}^{n}a_{i}E^{*}_{i}, j=1MnSv,j,(n)\sum_{j=1}^{{M_{n}}}S_{v,j}^{*,(n)} converge almost surely. We already established that wiK~i1\sum w_{i}\tilde{K}^{*}_{i-1} converges almost surely. Hence, j=1MnS^M,j,(n)\sum_{j=1}^{M_{n}}\hat{S}_{M,j}^{*,(n)} has to converge almost surely and would imply that aΨ(i,j)wΨ(i,j)MΨ(i,j)a_{\Psi(i^{\prime},j)}w_{\Psi(i^{\prime},j)}M^{*}_{\Psi(i^{\prime},j)} goes to zero over a subsequence.

As, anK~na_{n}\tilde{K}^{*}_{n} converges to a finite number for each ω\omega outside a set of probability zero, and ann1αa_{n}\sim n^{1-\alpha}, then n1γK~nn^{1-\gamma}\tilde{K}_{n}^{*} goes to zero with probability one for γ>α\gamma>\alpha. As mf~m\frac{m_{\tilde{f}}}{m} is bounded away from zero and infinity, KL(mf~,mi)χlogmf~(x)mi(x)m(x)ν(dx)KL(m_{\tilde{f}},m_{i})\sim\int_{\chi}\log\frac{m_{\tilde{f}}(x)}{m_{i}(x)}m(x)\nu(dx), the conclusion about the Hellinger distance follows as Kullback Leibler divergence is greater than squared Hellinger distance.

3.3 Finite Mixture of Markov processes and other examples

We consider a few examples of dependent processes, e.g. mixtures of Markov processes, where the marginal density is a mixture. Uniqueness of such representation has been studied extensively in the literature. Some of the earliest work can be found in Freedman [24], Freedman [25]. Also see related work in Diaconis and Freedman [26]. We here look at variants of stationary autoregressive processes of order one, AR(1), plus noise and Gaussian processes with constant drift. While simulation is done for different sample sizes, our main objective is to study the effect of dependence on convergence of the PR algorithm and not the rate of convergence. In fact, we use wi=1/(i+1)w_{i}=1/(i+1) throughout for which α=1\alpha=1. Thus, condition B1\rm{B1}^{\prime} and hence Theorem  3.4 are not applicable for these examples.

The condition given in (3.2) is a sufficient condition, and if the marginal, m(Xi)m(X_{i}) is a finite mixture of marginals of latent Markov processes which individually satisfy 3.2, and B2 then a similar but slightly weaker condition will hold for m()m(\cdot), and Theorem 3.1 will hold for the mixture. In particular, if we have Xi=Ii,1Zi,1++Ii,kZi,kX_{i}=I_{i,1}Z_{i,1}+\dots+I_{i,k}Z_{i,k} where Zi,jZ_{i,j}’s are independent Markov processes with stationary marginal densities m(1)(),,m(k)()m^{(1)}(\cdot),\dots,m^{(k)}(\cdot) and Ii,1,,Ii,kMultinomial(1,p1,,pk)I_{i,1},\dots,I_{i,k}\sim Multinomial(1,p_{1},\dots,p_{k}) and the marginal distribution m(l)(x)m^{(l)}(x)’s are parametrized by θl\theta_{l}’s( i.e. of the form p(x|θl)p(x|\theta_{l})), we can state the following result.

Proposition 3.

Let Xi=Ii,1Zi,1++Ii,kZi,kX_{i}=I_{i,1}Z_{i,1}+\dots+I_{i,k}Z_{i,k} where Zi,jZ_{i,j}’s are independent Markov processes with stationary marginal densities m(1)(),,m(k)()m^{(1)}(\cdot),\dots,m^{(k)}(\cdot) and Ii,1,,Ii,kMultinomial(1,p1,,pk)I_{i,1},\dots,I_{i,k}\sim Multinomial(1,p_{1},\dots,p_{k}). Suppose, for each l=1,,k,l=1,\ldots,k, there exists c0>0c_{0}>0 and 0<ρ<10<\rho<1 such that

supi(m(l)(Xi+n|X1:i)m(l)(Xi+n)1)2m(Xi+n)m(X1:i)\displaystyle sup_{i}\int(\frac{m^{(l)}(X_{i+n}|X_{1:i})}{m^{(l)}(X_{i+n})}-1)^{2}m(X_{i+n})m(X_{1:i})
=supi(m(l)(Xi+n|Xi)m(l)(Xi+n)1)2m(Xi+n)m(Xi)\displaystyle=sup_{i}\int(\frac{m^{(l)}(X_{i+n}|X_{i})}{m^{(l)}(X_{i+n})}-1)^{2}m(X_{i+n})m(X_{i})\leq c0ρ2n.\displaystyle c_{0}\rho^{2n}. (3.8)

Then Theorem 3.1 holds for the mixture process.

Proof.

By assumption, the marginal of XiX_{i} is m(Xi)=π1m(1)(Xi)++πkm(k)(Xi)m(X_{i})=\pi_{1}m^{(1)}(X_{i})+\cdots+\pi_{k}m^{(k)}(X_{i}). Let i2,i,ji1\mathscr{E}^{i_{1}}_{i_{2},i^{\prime},j} be the event, Ii1,l=1I_{i_{1},l}=1, for l=jl=j and the last time Ii,j=1I_{i,j}=1, for iii\leq i^{\prime} for i=i2i=i_{2}. Then

(m(Xi+n|X1:i)m(Xi+n)1)2=(lπlm(l)(Xi+n)+i=1ilΔi+n|i:i(l)p(i,i,li+n|X1:i)lπlm(l)(Xi+n)1)2.(\frac{m(X_{i+n}|X_{1:i})}{m(X_{i+n})}-1)^{2}=(\frac{\sum_{l}\pi_{l}m^{(l)}(X_{i+n})+\sum_{i^{\prime}=1}^{i}\sum_{l}\Delta^{(l)}_{i+n|i:i^{\prime}}p(\mathscr{E}^{i+n}_{i^{\prime},i,l}|X_{1:i})}{\sum_{l}\pi_{l}m^{(l)}(X_{i+n})}-1)^{2}.

where Δi+n|i:i(l)=m(l)(Xi+n|X1:i)m(l)(Xi+n).\Delta^{(l)}_{i+n|i:i^{\prime}}=m^{(l)}(X_{i+n}|X_{1:i^{\prime}})-m^{(l)}(X_{i+n}). Hence,

(m(Xi+n|X1:i)m(Xi+n)1)2\displaystyle(\frac{m(X_{i+n}|X_{1:i})}{m(X_{i+n})}-1)^{2} \displaystyle\leq [lπl2i=1i|m(l)(Xi+n|X1:i)m(l)(Xi+n)m(l)(Xi+n)|p(i,i,li+n|X1:i)]2\displaystyle[\sum_{l}\pi_{l}^{-2}\sum_{i^{\prime}=1}^{i}|\frac{m^{(l)}(X_{i+n}|X_{1:i^{\prime}})-m^{(l)}(X_{i+n})}{m^{(l)}(X_{i+n})}|p(\mathscr{E}^{i+n}_{i^{\prime},i,l}|X_{1:i})]^{2} (3.9)
\displaystyle\preceq maxli=1i(m(l)(Xi+n|X1:i)m(l)(Xi+n)1)2.\displaystyle max_{l}\sum_{i^{\prime}=1}^{i}(\frac{m^{(l)}(X_{i+n}|X_{1:i^{\prime}})}{m^{(l)}(X_{i+n})}-1)^{2}.

From the calculation following equation (LABEL:wdep_decmp), it follows that E[δi(i)]ilastρiilastE[\delta_{i}^{(i)}]\preceq i_{last}\rho^{i-i_{last}} with ρ\rho given in equation (3.8) and ilasti_{last} is as defined in the proof of Theorem 3.1, where \preceq denotes less than equal to up to a constant multiple. For ii between lKl^{K} and (l+1)K(l+1)^{K}, iilast=li-i_{last}=l. Hence, we have iE[δi(i)]l2Kρl1<\sum_{i}E[\delta_{i}^{(i)}]\preceq l^{2K}\rho^{l-1}<\infty.

Similarly, B2 can be verified by conditioning on the indicators and rest of the argument from Theorem 3.1 follows. ∎

Example 1. AR(1). As an example of mixture of independent Markov processes, we consider a mixture where one component is a Gaussian white noise and other component is a stationary AR(1) process. Let X1,iAR(1)X_{1,i}\sim AR(1) with marginal N(0,1)N(0,1). and let X2,iiidN(2.5,1)X_{2,i}\stackrel{{\scriptstyle iid}}{{\sim}}N(2.5,1), independent of X1,iX_{1,i}. Let Xi=IiX1,i+(1Ii)X2,iX_{i}=I_{i}X_{1,i}+(1-I_{i})X_{2,i}, where IiI_{i}s are independent Bernoulli(0.3). We consider different values for the AR(1) parameter, rr, and investigate the effect on the convergence. To have the stationary variance of the AR(1) process to be one we choose the innovation variance to be equal to 1r2\sqrt{1-r^{2}}. For the AR(1) process, the dependence parameter in condition (3.2), ρ,\rho, is a monotone function of rr. Specifically, we use r{0.3,0.7,0.99,0.999}r\in\{0.3,0.7,0.99,0.999\}. A typical example is given in Figure 1, where we see the higher the value of rr the slower the convergence. While for moderate values of rr, the effect is negligible, for strong dependence, the effect on convergence is very pronounced. In this example for f0f_{0} starting mixing probability p=0.7p=0.7 has been used where the true mixing probability is 0.3.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: The AR 1 example for the general dependent case. Top panel shows r=.3r=.3 and r=.7r=.7 in left and right panel, respectively. Bottom panel shows r=.99r=.99 and r=.999r=.999 in left and right panel, respectively. True value of pp is .3 . The heavily dependent cases r=.99r=.99 and r=.999r=.999 show slower convergence.

Here it is easy to see that the individual components satisfy (3.2). Thus, the PR algorithm based on {Xi}\{X_{i}\} will consistent for the marginal m(Xi)m(X_{i}) provided {Xi}\{X_{i}\} satisfy the moment conditions assumed in Theorem 3.1. This part follows from conditioning on the indicators and then calculating E[(l=1n1p(Xjl|θl1)p(Xjl|θl2))2]E[(\prod_{l=1}^{n_{1}}\frac{p(X_{j_{l}}|\theta_{l_{1}})}{p(X_{j_{l}}|\theta_{l_{2}})})^{2}] given the indicators by recursively conditioning on the observed XiX_{i}’s from the earlier indices. In particular, it will be sufficient to show for the case when n1=nn_{1}=n, ji=ij_{i}=i, and the values of Ii=1I_{i}=1 for all ii that is X1,,XnX_{1},\dots,X_{n} comes from the latent AR(1)AR(1) process. Note that (p(Xjl|θl1)p(Xjl|θl2))2=cθl1,θl2ecθl1,θl2Xi(\frac{p(X_{j_{l}}|\theta_{l_{1}})}{p(X_{j_{l}}|\theta_{l_{2}})})^{2}=c^{\prime}_{\theta_{l_{1}},\theta_{l_{2}}}e^{c_{\theta_{l_{1}},\theta_{l_{2}}}X_{i}} where |cθl1,θl2|,|cθl1,θl2||c_{\theta_{l_{1}},\theta_{l_{2}}}|,|c^{\prime}_{\theta_{l_{1}},\theta_{l_{2}}}| are uniformly bounded (say by c~>0\tilde{c}>0). For convenience we write cθl1,θl2=cl,1,2,cθl1,θl2=cl,1,2c_{\theta_{l_{1}},\theta_{l_{2}}}=c_{l,1,2},c^{\prime}_{\theta_{l_{1}},\theta_{l_{2}}}=c^{\prime}_{l,1,2}.

Let jl=n,θl1=θn1,θl2=θn2j_{l}=n,\theta_{l_{1}}=\theta_{n_{1}},\theta_{l_{2}}=\theta_{n_{2}}. Let 1,,n\mathscr{E}_{1,\dots,n} denote the event that Ii=1,i=1,,nI_{i}=1,i=1,\dots,n. Hence,

E[[p(Xn|θn1)p(Xn|θn2)]2|X1,,Xn1,1,,n]=ercn,1,2Xn1+cn,1,22(1r2)2cn,1,2.E[[\frac{p(X_{n}|\theta_{n_{1}})}{p(X_{n}|\theta_{n_{2}})}]^{2}|X_{1},\dots,X_{n-1},\mathscr{E}_{1,\dots,n}]=e^{rc_{n,1,2}X_{n-1}+\frac{c^{2}_{n,1,2}(1-r^{2})}{2}}c^{\prime}_{n,1,2}.

Similarly,

E[[p(Xn|θn1)p(Xn|θn2)p(Xn1|θ(n1)1)p(Xn1|θ(n1)2)]2|X1,,Xn2,1,,n]\displaystyle E[[\frac{p(X_{n}|\theta_{n_{1}})}{p(X_{n}|\theta_{n_{2}})}\frac{p(X_{n-1}|\theta_{{(n-1)}_{1}})}{p(X_{n-1}|\theta_{{(n-1)}_{2}})}]^{2}|X_{1},\dots,X_{n-2},\mathscr{E}_{1,\dots,n}]
c~2e(cn1,1,2+rcn,1,2)Xn2+(rcn,1,2+cn1,1,2)2(1r2)+cn,1,22(1r2)2.\displaystyle\leq\tilde{c}^{2}e^{(c_{n-1,1,2}+rc_{n,1,2})X_{n-2}+\frac{(rc_{n,1,2}+c_{n-1,1,2})^{2}(1-r^{2})+c^{2}_{n,1,2}(1-r^{2})}{2}}.

Applying the bounds recursively we have the result.

Example 2. A continuous mixture of AR(1). Next we consider a continuous mixture in the mean with error term following an AR(1) dependence. Let Xi=θi+ZiX_{i}=\theta_{i}+Z_{i} where ZiZ_{i} follows an AR(1) model with parameter r=0.7r=0.7. Let θiiidTN(0,1,3,3)\theta_{i}\stackrel{{\scriptstyle iid}}{{\sim}}TN(0,1,-3,3), a standard normal distribution truncated to [-3,3]. Using a uniform[-3,3] as f0f_{0} in (1.4), the fitted values are given in the left box in Figure 2. The true mixing density is given in solid blue. Solid and dashed black show the fit using n=1000,2000n=1000,2000, respectively.

The following argument verifies the conditions for consistency for the PR algorithm for a mean mixture of AR(1) process, Xi=θi+ZiX_{i}=\theta_{i}+Z_{i}. Here ZiZ_{i} follows AR(1)AR(1) and θi\theta_{i} are i.i.d with density f(θ)f(\theta), where f(θ)f(\theta) is bounded and bounded away from zero and has compact support. We write,

m(Xi+n|X1:i)m(Xi+n)=pZ(Xi+nθ|θi=θ,X1:i)f(θ)fi|X1,Xi(θ)𝑑θ𝑑θpZ(Xi+nθ)f(θ)𝑑θ.\frac{m(X_{i+n}|X_{1:i})}{m(X_{i+n})}=\frac{\int p_{Z}(X_{i+n}-\theta|\theta_{i}=\theta^{\prime},X_{1:i})f(\theta)f_{i|X_{1}\dots,X_{i}}(\theta^{\prime})d\theta d\theta^{\prime}}{\int p_{Z}(X_{i+n}-\theta)f(\theta)d\theta}.

Here, pZ(Xi+nθ|θi=θ,X1:i)=pZ(Xi+nθ|Xiθ)p_{Z}(X_{i+n}-\theta|\theta_{i}=\theta^{\prime},X_{1:i})=p_{Z}(X_{i+n}-\theta|X_{i}-\theta^{\prime}), and fi|X1,Xi()f_{i|X_{1}\dots,X_{i}}(\cdot) is the conditional density of θi\theta_{i} given X1,,XiX_{1},\dots,X_{i}. Hence,

|m(Xi+n|X1:i)m(Xi+n)1|\displaystyle|\frac{m(X_{i+n}|X_{1:i})}{m(X_{i+n})}-1| =\displaystyle= |{pZ(Xi+nθ|Xiθ)pz(Xi+nθ)}f(θ)fi|X1,Xi(θ)pZ(Xi+nθ)f(θ)|\displaystyle|\frac{\int\{p_{Z}(X_{i+n}-\theta|X_{i}-\theta^{\prime})-p_{z}(X_{i+n}-\theta)\}f(\theta)f_{i|X_{1}\dots,X_{i}}(\theta^{\prime})}{\int p_{Z}(X_{i+n}-\theta)f(\theta)}|
\displaystyle\leq F1(Xi+n)|pZ(Xi+nθ|Xiθ)pz(Xi+nθ)1|f(θ)fi|X1,Xi(θ),\displaystyle F_{1}(X_{i+n}){\int|\frac{p_{Z}(X_{i+n}-\theta|X_{i}-\theta^{\prime})}{p_{z}(X_{i+n}-\theta)}-1|f(\theta)f_{i|X_{1}\dots,X_{i}}(\theta^{\prime})},

where F1(Xin)ec~Xi+n+ed~Xi+n,c~,d~>0F_{1}(X_{i_{n}})\preceq e^{\tilde{c}X_{i+n}}+e^{-\tilde{d}X_{i+n}},\tilde{c},\tilde{d}>0, using the fact that the support of f()f(\cdot) is bounded and f()f(\cdot) is bounded away from zero and infinity. Hence, all the moments of F1(Xi+n)F_{1}(X_{i+n}) are finite (in particular, E[(F1(Xi+n))a0]<E[(F_{1}(X_{i+n}))^{a_{0}^{\prime}}]<\infty for any a0>0a_{0}^{\prime}>0). The result follows by noticing that pZ(Xi+nθ)f(θ)>eXi+n22ea0|Xi+n|b0\int p_{Z}(X_{i+n}-\theta)f(\theta)>e^{-\frac{X^{2}_{i+n}}{2}}e^{-a_{0}|X_{i+n}|}b_{0}; where a0,b0>0a_{0},b_{0}>0 are constants. It should be noted that Xi+n=Zi+n+θi+nX_{i+n}=Z_{i+n}+\theta_{i+n}, where θi+nf(θ)\theta_{i+n}\sim f(\theta) and bounded. Using a similar argument, writing the joint density of X1,,XnX_{1},\dots,X_{n} in convolution form, fi|X1,Xi(θ)F2(Xi,Xi1)f_{i|X_{1}\dots,X_{i}}(\theta^{\prime})\leq F_{2}(X_{i},X_{i-1}) where all the moments of F2(,)F_{2}(\cdot,\cdot) are finite.

Now, pZ(Xi+nθ|Xiθ)pz(Xi+nθ)=cn′′ecnXi+n2+dnXi+nXi+en\frac{p_{Z}(X_{i+n}-\theta|X_{i}-\theta^{\prime})}{p_{z}(X_{i+n}-\theta)}=c_{n}^{\prime\prime}e^{c_{n}^{\prime}X_{i+n}^{2}+d_{n}^{\prime}X_{i+n}X_{i}+e_{n}^{\prime}}, where cn,dn,en=O(rn);cn′′=1+O(rn)c_{n}^{\prime},d_{n}^{\prime},e_{n}^{\prime}=O(r^{n});c_{n}^{\prime\prime}=1+O(r^{n}). Hence, using Cauchy-Schwartz inequality we have,

|m(Xi+n|X1:i)m(Xi+n)1|2m(Xi+n)m(X1:i)rn.\int|\frac{m(X_{i+n}|X_{1:i})}{m(X_{i+n})}-1|^{2}m(X_{i+n})m(X_{1:i})\preceq r^{n}.

Hence the dependence condition is satisfied. Next we verify condition [B2][B2] for the mixture process. For i1<<ini_{1}<\cdots<i_{n^{\prime}}, such that {i1,,in}{1,,n}\{i_{1},\cdots,i_{n^{\prime}}\}\in\{1,\cdots,n\}, and for Xi=Zi+θiX_{i}=Z_{i}+\theta_{i}, where θi[a,b]\theta_{i}\in[a,b], let θil;l=1,,n\theta_{i_{l}};l=1,\cdots,n^{\prime}, and θjl;l=1,,n\theta_{j_{l}};l=1,\cdots,n^{\prime} such that θi,l,θj,lΘ^[a,b]\theta_{i,l},\theta_{j,l}\in\hat{\Theta}\subset[a,b]. Choosing, bb large enough, and aa small enough,

l=1np(Xil|θil)p(Xil|θjl)en[(ba)2+b2]l=1neZil(θjlθil)\displaystyle\prod_{l=1}^{n^{\prime}}\frac{p(X_{i_{l}}|\theta_{i_{l}})}{p(X_{i_{l}}|\theta_{j_{l}})}\leq e^{n^{\prime}[(b-a)^{2}+b^{2}]}\prod_{l=1}^{n^{\prime}}e^{-Z_{i_{l}}(\theta_{j_{l}}-\theta_{i_{l}})}

where ZiZ_{i} follows AR(1)AR(1) with parameter rr. Then by computing expectations recursively and noting that Zin|Zi1,,Zin1N(rilil1Zil1,1r2(ilil1))Z_{i_{n^{\prime}}}|Z_{i_{1}},\cdots,Z_{i_{n^{\prime}-1}}\sim N(r^{i_{l}-i_{l-1}}Z_{i_{l-1}},1-r^{2(i_{l}-i_{l-1})}), the moment condition is established using arguments similar to those in Example 1.

Example 3. An irregular mixture with AR(1) error. Consider the last example but for a mixing distribution that has more structure. Specifically, let θiiid0.5δ{0}+0.5TN(4,1,8,8)\theta_{i}\stackrel{{\scriptstyle iid}}{{\sim}}0.5\delta_{\{0\}}+0.5TN(4,1,-8,8). Here Θ=[8,8]\Theta=[-8,8]. Using a continuous uniform f0f_{0} on [-8,8] the fitted densities are given in the middle box of Figure 2 for sample sizes n=500,1000n=500,1000, respectively. The algorithm converges to a mixture structure for ff with probability around zero in the interval (-.5,.5) approximately equal to 0.45. The true mixing distribution is given by the solid line and the fit are given by the dotted line. While the estimated density is showing bimodality, a much larger sample size was also used to see explicitly the effect of large nn. The right box in Figure 2 shows the estimated mixture density for n=5000n=5000. The estimate is markedly better around the continuous mode and the other mode is more concentrated around zero indicating recovery of the discrete part. How the convergence is markedly slower than that in Example 2. Verification for conditions of consistency follows from the previous example.

Refer to caption
Refer to caption
Refer to caption
Figure 2: Examples with continuous mixture of AR(1). Left panel corresponds to Example 2 with θiTN(0,1,3,3)\theta_{i}\sim TN(0,1,-3,3); middle and right panel are density estimates for Example 3.For Example 2, n=1000,2000n=1000,2000 (dashed and solid, respectively) . For Example 3: n=500,1000n=500,1000 (dashed and solid respectively, in the middle box) and n=5000n=5000 in the right panel. True mixture is given by blue and the fitted ones are given by black.

Example 4. Gaussian process paths.

We consider two sub-examples with observations from a latent Gaussian process with known covariance kernel. In first case, the observations are shifted by a fixed mean with some unknown probability. In the second case, we consider a unknown constant drift along with the mean function. We observe from the marginal distribution.

a) Zero drift. Consider a continuous time process observed at times ti,i=1,,nt_{i},i=1,\ldots,n, where the observed process is the sum of two independent latent processes. Suppose, X(t)𝒢𝒫(1,𝒦)X(t)\sim\mathscr{GP}(-1,\mathscr{K}) with covariance kernel 𝒦(t1,t2)=.1e(t1t2)2/10\mathscr{K}(t_{1},t_{2})=.1e^{-(t_{1}-t_{2})^{2}/10} and the observed process Y(ti)=Zia1(ti)+X(ti)Y(t_{i})=Z_{i}a_{1}(t_{i})+X(t_{i}) where a1(t)=3a_{1}(t)=3 and ZiiidBernoulli(0.3)Z_{i}\stackrel{{\scriptstyle iid}}{{\sim}}Bernoulli(0.3) independent of X(t)X(t). Using known support i.e support at 1-1 and 22 for the mean we can plot the predictive recursion update for the probability at -1 in Figure 3 left panel (true value is 0.7). Suppose, we start with a continuous support on [3,3][-3,3] and uniform f0f_{0} then the predictive recursion solution for n=1000n=1000 is given in right panel.

Refer to caption
Refer to caption
Figure 3: Example 4 in Weak-dependent case. Left panel shows the convergence of the mixing density under true discrete support where index in the X axis is the number of observations used. Right panel shows the fitted mixing density if f0f_{0} is Unif[3,3]Unif[-3,3] which gives mass around the true discrete support.

b) Constant drift. A similar setting as in part (a) is considered with Y(ti)=Zia1(ti)+X(ti)Y(t_{i})=Z_{i}a_{1}(t_{i})+X(t_{i}), X(t)𝒢𝒫(0,𝒦)X(t)\sim\mathscr{GP}(0,\mathscr{K}) and 𝒦(t1,t2)=.1e(t1t2)2/10\mathscr{K}(t_{1},t_{2})=.1e^{-(t_{1}-t_{2})^{2}/10}, where there is a drift given by a1(t)=α+βt/100a_{1}(t)=\alpha+\beta t/100 and the observed points are at tit_{i}’s, i=1,,ni=1,\cdots,n, and α=5,β=2\alpha=5,\beta=2. Figure 4 top panel shows the fitted marginal and joint mixing densities for the slope and intercept parameters α\alpha and β\beta by the SA algorithm, where initial density is uniform on the rectangle [6,6]×[6,6][-6,6]\times[-6,6]. We have concentration around 0 and 55 for the intercept parameter, and, 0 and 22 for the slope parameter.

Figure 4 bottom panel shows the marginal fit when the starting density is uniform on [3,3]×[6,6][-3,3]\times[-6,6] and the support does not contain the true intercept parameter. In that case the mixing density for intercept concentrates around 0 and 33, corresponding to nearest KL distance point.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Example 4 in Weak-dependent case. Gaussian process with constant drift. Top left, middle and right panel shows the estimated mixing densities for the slope, intercept and the joint mixing density. Bottom panel shows the marginal mixing densities for the case when the support for the initial distribution does not contain the true intercept.

4 MM-dependent processes

An important subclass of the weakly dependent processes are the MM-dependent processes where XiX_{i} and XjX_{j} are independent if |ij|>M|i-j|>M for some positive integer MM. Heuristically, if the process is MM-dependent for some positive integer M=q1M=q-1, one expects the original PR algorithm to provide consistent estimates over qq different subsequences where consecutive indices are qq apart and hence provides overall consistent estimator. Thus, one expects that the sufficient conditions assumed for general weakly dependent processes can be significantly relaxed. It is indeed the case and the proof for the MM-dependent processes is also significantly different. Hence we present the case of MM-dependent processes separately.

Consider an MM-dependent sequence {Xi}i\{X_{i}\}_{i\in\mathbb{Z}} with fixed marginal distribution Xim(x)X_{i}\sim m(x) of the form (1.3). From the definition of the process XiX_{i}, Xi+hX_{i+h} are independent for all ii and hqh\geq q for some positive integer qq. An example of such a process would be a (q1)(q-1)th order moving average, MA(q1)MA(q-1), defined as

xi=ei+ψ1ei1+ψq1eiq+1,ix_{i}=e_{i}+\psi_{1}e_{i-1}+\cdots\psi_{q-1}e_{i-q+1},\;\;\;i\in\mathbb{Z}

where (ψ1,,ψq1)(\psi_{1},\ldots,\psi_{q-1}) are fixed parameters and {ei}\{e_{i}\} are independent mean zero random variables with density me()m_{e}(\cdot). Let the marginal density of XiX_{i} be of the mixture form (1.3). Note that the MA(q1)MA(q-1) process considered is stationary, but for the PR example we merely need the marginal density to not change with the index. As mentioned, the assumptions could be relaxed in the MM-dependent case. The new assumptions are

  • A1

    wn0w_{n}\downarrow 0 as nn\uparrow\infty; i=1wj+iq=\sum_{i=1}^{\infty}w_{j+iq}=\infty for j=1,2,,qj=1,2,\dots,q; and i=1wi2<\sum_{i=1}^{\infty}w_{i}^{2}<\infty.

  • A2

    The map Fp(x|θ)f(θ)𝑑θF\mapsto\int p(x|\theta)f(\theta)d\theta is injective; that is the mixing density is identifiable from the mixture p(x|θ)f(θ)𝑑θ.\int p(x|\theta)f(\theta)d\theta.

  • A3

    The map θp(x|θ)\theta\to p(x|\theta) is bounded and continuous for xχx\in\chi.

  • A4

    For θ1\theta_{1}, θ2\theta_{2} and θ3\theta_{3} in Θ\Theta, {p(x|θ1)p(x|θ2)}kp(x|θ3)ν(dx)<B\int\{\frac{p(x|\theta_{1})}{p(x|\theta_{2})}\}^{k}p(x|\theta_{3})\nu(dx)<B for some B>0B>0 and for k4qk\leq 4q.

The conditions needed for convergence in qq dependent sequence is similar to that of independent case other than [A4] and [A1]. Condition [A1] is needed, as we will look at qq gap martingale sum, instead of the subsequence construction in Theorem 3.1, 3.2 for the gradually vanishing dependence. Condition [A4] is needed to account for the difference for the conditional mean terms, which will involve qq fold products of terms similar to A1(Xi)A_{1}(X_{i})’s. Under the assumed conditions we have the following result.

Theorem 4.1.

Let X1,,XnX_{1},\ldots,X_{n} be a sample form a (q1)(q-1)-dependent process with marginal density mm of the form (1.3) with mixing density ff. Assume A1-A4 and C1-C3 hold and Θ\Theta is compact. Then the estimator FnF_{n} from equation (1.4) converges weakly to the true mixing distribution FF with probability one.

Proof.

Given in the Appendix. ∎

Analogous to the general case, a statement can be made about convergence under misspecification of support in the MM-dependent case. Assume the set up of Theorem 3.3.

Theorem 4.2 (Convergence under miss-specification).

Assume the XiX_{i} are generated from a (q1)(q-1)-dependent process with fixed marginal density mm. Assume A1-A4 and C1-C3 hold. Then K~n0\tilde{K}_{n}^{*}\rightarrow 0 with probability one where K~n\tilde{K}_{n}^{*} is defined in Theorem 3.3.

The proof is given in the Appendix.

4.1 Convergence rate for MM-dependent case

Next we investigate the convergence rate of the PR algorithm in the MM-dependent case. We will assume B1\rm{B1}^{\prime} as the decay rate for the weights wiw_{i}.

Theorem 4.3.

For unique f~\tilde{f}, wiiγw_{i}\sim i^{-\gamma} γ(3/4,1)\gamma\in(3/4,1), under the setting of Theorem 4.2, for γ(γ,1)\gamma^{\prime}\in(\gamma,1), nγ+1K~n0n^{-\gamma^{\prime}+1}\tilde{K}_{n}^{*}\rightarrow 0 with probability one and H(mf~,mn)=o(n(1γ)/2)H(m_{\tilde{f}},m_{n})=o(n^{-(1-\gamma^{\prime})/2}), if mmf~\frac{m}{m_{\tilde{f}}} is bounded away from zero and infinity.

The proof of Theorem 4.3 follows from argument similar to that of Theorem 3.4 and given in the Appendix.

5 Discussion

We have established consistency of the solution of the predictive recursion algorithm under various dependence scenarios. For the dependent cases considered, we have also explored convergence rate for the solution to the predictive recursion algorithm. The theoretical development provides justification for using the original algorithm in many cases. Under stationarity but misspecification about dependence structure the original algorithm continues to work as long as the dependence decays reasonably fast. Best possible nonparametric rate for dependent cases may be an interesting problem to explore and conditions for feasibility of minimax rate needs to be studied.

The proposed theoretical development justifies the possible use of stochastic approximation even when we have error in observations coming from an moving average or autoregressive mean zero distribution, if certain conditions are satisfied. It is well known that stochastic approximation or predictive recursion algorithms do not give a posterior estimate. However, similar extension for posterior consistency under the misspecification of independence under conditions analogous to equation (3.2) may be explored.

6 Acknowledgement

The first author Nilabja Guha is supported by the NSF grant #2015460.

A Appendix

We first prove a simple proposition which we use throughout the proofs. This is a standard result from probability theory, which we restate and prove for convenience.

Proposition 4.

Let Z1,Z2,Z_{1},Z_{2},\dots be a sequence of random variables such that i=1E[|Zi|]<\sum_{i=1}^{\infty}E[|Z_{i}|]<\infty. Then i=1nZi\sum_{i=1}^{n}Z_{i} converges almost surely and the limit is finite almost surely.

Proof.

Let Ω\Omega be the probability space corresponding to the joint distribution. For some ωΩ\omega\in\Omega, if |Zi(ω)|\sum|Z_{i}(\omega)|converges then Zi(ω)\sum Z_{i}(\omega) converges. Let ZZ_{\infty} be the limit of |Zi(ω)|\sum|Z_{i}(\omega)| which is defined to be infinity at some ω\omega incase |Zi(ω)|\sum|Z_{i}(\omega)| diverges to positive infinity.

By Monotone Convergence Theorem, E[Z]=i=1E[|Zi|]<E[Z_{\infty}]=\sum_{i=1}^{\infty}E[|Z_{i}|]<\infty (or equivalently can be argued using Fatou’s lemma on the sequence of partial sums i=1n|Zi|\sum_{i=1}^{n}|Z_{i}|). Hence ZZ_{\infty} is finite with probability one. Therefore, i=1|Zi|\sum_{i=1}^{\infty}|Z_{i}| converges with probability one and hence i=1Zi\sum_{i=1}^{\infty}Z_{i} converges with probability one. ∎

Lemma 1.

From equation (3.3), P(supn|Sv,j(n)|>ϵjs)C0ϵ2jK(2α1)+2s1P(\sup_{n}|S_{v,j}^{(n)}|>\frac{\epsilon}{j^{s}})\leq C^{\prime}_{0}\epsilon^{-2}j^{-K(2\alpha-1)+2s-1}, for C0>0C^{\prime}_{0}>0, some universal constant not depending on jj.

Proof.

From the derivation after equation (3.3) and using Proposition 3.1, E[vΨ(i,j)2]E[2wΨ(i,j)2(1+A1(XΨ(i,j))2]c0′′wΨ(i,j)2E[v_{\Psi(i,j)}^{2}]\leq E[2w_{\Psi(i,j)}^{2}(1+A_{1}(X_{\Psi(i,j)})^{2}]\leq c_{0}^{{}^{\prime\prime}}w^{2}_{\Psi(i,j)}, for some universal constant c0′′>0c_{0}^{{}^{\prime\prime}}>0, using condition A4. From the martingale construction of the equation 3 , Ψ(1,j)>jK\Psi(1,j)>j^{K}. By construction, the coefficients belonging to each block 𝐁~l{\bf\tilde{B}}_{l} is less for the higher the index jj is, that is if Ψ(l1,j1)\Psi(l_{1},j_{1}) and Ψ(l2,j2)\Psi(l_{2},j_{2}) is in 𝐁~l{\bf\tilde{B}}_{l} then wΨ(l1,j1)>wΨ(l2,j2)w_{\Psi(l_{1},j_{1})}>w_{\Psi(l_{2},j_{2})} if j2>j1j_{2}>j_{1}. Hence, E[(Sv,j(n))2]j1c0′′lΨ(1,j)wl2c0′′j1lΨ(1,j)l2α<c0j1Ψ(1,j)2α+1<C0j1j2Kα+KE[(S_{v,j}^{(n)})^{2}]\leq j^{-1}c^{{}^{\prime\prime}}_{0}\sum_{l\geq\Psi(1,j)}w_{l}^{2}\leq c^{{}^{\prime\prime}}_{0}j^{-1}\sum_{l\geq\Psi(1,j)}l^{-2\alpha}<c_{0}^{\prime}j^{-1}{\Psi(1,j)}^{-2\alpha+1}<C_{0}^{\prime}j^{-1}j^{-2K\alpha+K}, where c0′′,c0,C0c^{{}^{\prime\prime}}_{0},c_{0}^{\prime},C_{0}^{\prime} are universal constants.

Finally, using Doob’s maximal inequality [18],

P(supn|Sv,j(n)|>ϵjs)=limn0P(supn<n0|Sv,j(n)|>ϵjs)C0ϵ2jK(2α1)+2s1.P(\sup_{n}|S_{v,j}^{(n)}|>\frac{\epsilon}{j^{s}})=lim_{n_{0}\uparrow\infty}P(\sup_{n<n_{0}}|S_{v,j}^{(n)}|>\frac{\epsilon}{j^{s}})\leq C^{\prime}_{0}\epsilon^{-2}j^{-K(2\alpha-1)+2s-1}.

A.1 Proof of Proposition 3.1

Proof.

Note that mf1(Xi)mf2(Xi)p(Xi|θ1)p(Xi|θ2)A1(Xi)\frac{m_{f_{1}}(X_{i})}{m_{f_{2}}(X_{i})}\leq\frac{p(X_{i}|\theta_{1})}{p(X_{i}|\theta_{2})}\leq A_{1}(X_{i}), where at θ2\theta_{2} and θ1\theta_{1}, p(Xi|θ)p(X_{i}|\theta) is minimized and maximized, respectively when XiχcX_{i}\notin\chi_{c} (note that p(x|θ)p(x|\theta) continuous function on a compact set).

If XiχcX_{i}\in\chi_{c} then mf1(Xi)mf2(Xi)baA1(Xi)\frac{m_{f_{1}}(X_{i})}{m_{f_{2}}(X_{i})}\leq\frac{b}{a}\leq A_{1}(X_{i}).

A.2 Proof of Theorem 3.2

Proof.

Following (2.2) and the decomposition (3.3) in the dependent case, we can have an analogous KL decomposition for the marginals in the dependent case. We have,

KnK0=j=1LnSv,j,(n)j=1LnSM,j,(n)+i=1nSiΔ+i=1nEi.\displaystyle K_{n}^{*}-K_{0}^{*}=\sum_{j=1}^{{L_{n}}}S_{v,j}^{*,(n)}-\sum_{j=1}^{L_{n}}S_{M,j}^{*,(n)}+\sum_{i=1}^{n}S^{\Delta*}_{i}+\sum_{i=1}^{n}E^{*}_{i}. (A.1)

Here

Sv,j,(n)\displaystyle S_{v,j}^{*,(n)} =\displaystyle= i=1Nj,nvΨ(i,j)\displaystyle\sum_{i=1}^{N_{j,n}}v^{*}_{\Psi(i,j)}
SM,j,(n)\displaystyle S_{M,j}^{*,(n)} =\displaystyle= i=1Nj,nwΨ(i,j)MΨ(i,j)\displaystyle\sum_{i=1}^{N_{j,n}}w_{\Psi(i,j)}M^{*}_{\Psi(i,j)}
vi\displaystyle v_{i}^{*} =\displaystyle= wi((1hi,Xi(x)mi1(x)m(x)ν(dx))E[1χhi,Xi(x)mi1(x)m(x)ν(dx)|last,i])\displaystyle w_{i}\Big{(}(1-\int\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx))-E[1-\int_{\chi}\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx)|\mathcal{F}_{last,i}]\Big{)}
Mi\displaystyle M^{*}_{i} =\displaystyle= E[χhi,Xi(ilast)(x)milast(x)m(x)ν(dx)1|last,i];hi,x(ilast)(x)\displaystyle E[\int_{\chi}\frac{h^{(i_{last})}_{i,X_{i}}(x)}{m_{i_{last}}(x)}m(x)\nu(dx)-1|\mathcal{F}_{last,i}];h^{(i_{last})}_{i,x^{\prime}}(x)
=\displaystyle= p(x|θ)p(x|θ)filast(θ)𝑑μ(θ)milast(x)\displaystyle\frac{\int p(x^{\prime}|\theta)p(x|\theta)f_{i_{last}}(\theta)d\mu(\theta)}{m_{i_{last}}(x^{\prime})}
SiΔ\displaystyle S^{\Delta*}_{i} =\displaystyle= wiE[χhi,Xi(ilast)(x)milast(x)m(x)ν(dx)1|last,i]\displaystyle w_{i}E[\int_{\chi}\frac{h^{(i_{last})}_{i,X_{i}}(x)}{m_{i_{last}}(x)}m(x)\nu(dx)-1|\mathcal{F}_{last,i}]-
wiE[χhi,Xi(x)mi1(x)m(x)ν(dx)1|last,i],\displaystyle w_{i}E[\int_{\chi}\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx)-1|\mathcal{F}_{last,i}],

where hi,x(x)h_{i,x^{\prime}}(x) is defined in (2.2). From the proof of Theorem 3.1 it follows that the sequence KnK_{n} converges and KnK_{n}^{*} converges to zero over a subsequence with probability one, as in that case Θ\Theta was finite. Therefore, we first show that KnK_{n}^{*} converges to zero with probability one by showing that KnK_{n}^{*} converges almost surely.

Convergence of j=1LnSv,j,(n)\sum_{j=1}^{{L_{n}}}S_{v,j}^{*,(n)}: From earlier calculations, |1hi,Xi(x)mi1(x)m(x)ν(dx)|1+A1(Xi)|1-\int\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx)|\leq 1+A_{1}(X_{i}). Hence, convergence of j=1LnSv,j,(n)\sum_{j=1}^{{L_{n}}}S_{v,j}^{*,(n)} follows exactly same as in the same way as convergence of j=1LnSv,j(n)\sum_{j=1}^{{L_{n}}}S_{v,j}^{(n)} in Theorem 3.1. As, Sv,j,(n)S_{v,j}^{*,(n)} are squared integrable martingales each of them converge almost surely. From the fact, that |1hi,Xi(x)mi1(x)m(x)ν(dx)|1+A1(Xi)|1-\int\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx)|\leq 1+A_{1}(X_{i}) we have E[vΨ(i,j)2]E[2wΨ(i,j)2(1+A1(XΨ(i,j)2)]c0wΨ(i,j)2E[{v^{*}_{\Psi(i,j)}}^{2}]\leq E[2w_{\Psi(i,j)}^{2}(1+A_{1}(X_{\Psi(i,j)}^{2})]\leq c_{0}w^{2}_{\Psi(i,j)} for some fixed c0>0c_{0}>0. Therefore, argument similar to those in Lemma 1, we have

P(supn|Sv,j(,n)|>ϵjs)c0jK(2α1)+2s1,P(\sup_{n}|S_{v,j}^{(*,n)}|>\frac{\epsilon}{j^{s}})\leq c^{\prime}_{0}j^{-K(2\alpha-1)+2s-1},

for c0>0c^{\prime}_{0}>0. Hence, only finitely many martingales make contribution to the tail sum with probability one as jP(supn|Sv,j,n|>ϵjs)<\sum_{j}P(\sup_{n}|S_{v,j}^{*,n}|>\frac{\epsilon}{j^{s}})<\infty and supn|Sv,j,n|<ϵjs\sup_{n}|S_{v,j}^{*,n}|<\frac{\epsilon}{j^{s}} for all but finitely many jj’s with probability one with s>1s>1. Thus, j=1LnSv,j,(n)\sum_{j=1}^{{L_{n}}}S_{v,j}^{*,(n)} is Cauchy almost surely and therefore, convergent almost surely.

Convergence of SiΔ\sum{S_{i}^{\Delta}}^{*}: Let,

Δ(Xi,x)\displaystyle\Delta^{*}(X_{i},x) =\displaystyle= wi|hi,Xi(ilast)(x)milast(x)hi,Xi(x)mi1(x)|\displaystyle w_{i}|\frac{h_{i,X_{i}}^{(i_{last})}(x)}{m_{i_{last}}(x)}-\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}|
\displaystyle\leq wi{θ|p(Xi|θ)p(x|θ)filast(θ)mi1(Xi)mi1(x)(mi1(x)mi1(Xi)milast(Xi)milast(x)1)|dμ(θ)\displaystyle w_{i}\Huge\{\int_{\theta}|\frac{p(X_{i}|\theta)p(x|\theta)f_{i_{last}}(\theta)}{m_{i-1}(X_{i})m_{i-1}(x)}(\frac{m_{i-1}(x)m_{i-1}(X_{i})}{m_{i_{last}}(X_{i})m_{i_{last}}(x)}-1)|d\mu(\theta)
+\displaystyle+ θ|p(Xi|θ)p(x|θ)mi1(Xi)mi1(x)filast(θ)(fi1(θ)filast(θ)1)|dμ(θ)}\displaystyle\int_{\theta}|\frac{p(X_{i}|\theta)p(x|\theta)}{m_{i-1}(X_{i})m_{i-1}(x)}{f_{i_{last}}(\theta)}(\frac{f_{i-1}(\theta)}{f_{i_{last}}(\theta)}-1)|d\mu(\theta)\Huge\}

and let Δ(Xi,x)m(x)ν(dx)=Δ(Xi)\int\Delta^{*}(X_{i},x)m(x)\nu(dx)=\Delta^{*}(X_{i}). Note that

mi1(x)milast(x)=j=ilasti2(1+wj(p(Xj|θ)p(x|θ)fj(θ)𝑑θmfj(x)mfj(Xj)1))j=ilasti2(1+wj(A1(Xj)+1)).\frac{m_{i-1}(x)}{m_{i_{last}}(x)}=\prod_{j=i_{last}}^{i-2}(1+w_{j}(\frac{\int p(X_{j}|\theta)p(x|\theta)f_{j}(\theta)d\theta}{m_{f_{j}}(x)m_{f_{j}}(X_{j})}-1))\leq\prod_{j=i_{last}}^{i-2}(1+w_{j}(A_{1}(X_{j})+1)).

Hence, in the ll fold products of A1(Xjl)A_{1}(X_{j_{l}})’s where the index jlj_{l} appears at most twice in (mi1(x)mi1(Xi)milast(Xi)milast(x)1)(\frac{m_{i-1}(x)m_{i-1}(X_{i})}{m_{i_{last}}(X_{i})m_{i_{last}}(x)}-1). In SiΔS^{\Delta*}_{i} the terms are products of terms like p(Xi|θi1)p(Xj|θi2)\frac{p(X_{i}|\theta_{i_{1}})}{p(X_{j}|\theta_{i_{2}})} and p(x|θj1)p(x|θj2)\frac{p(x|\theta_{j_{1}})}{p(x|\theta_{j_{2}})} where θi1,θi2,θj1,θj2\theta_{i_{1}},\theta_{i_{2}},\theta_{j_{1}},\theta_{j_{2}} are in ΘH{\Theta}_{H}. Let qi=iilastq_{i}=i-i_{last}. Using B2\rm{B2}^{\prime} and Holder’s inequality, expectation of any ll fold such product would be bounded by b3lb_{3}^{l}. The number of such products of A1()A_{1}(\cdot) is less than qilq_{i}^{l}, qi=o(i1/K+ϵ)q_{i}=o(i^{1/K+\epsilon}) for any ϵ>0\epsilon>0. Also, ll fold product contains nHln_{H}^{l} many terms for which expectation can be bounded by B3lB_{3}^{l} for B3>0B_{3}>0. Hence, E[|SiΔ|]E|Δ(Xi)|Cwil=1il(α11/K)B~3l<C1i2α1E[|{S_{i}^{\Delta}}^{*}|]\leq E|\Delta^{*}(X_{i})|\leq Cw_{i}\sum_{l=1}^{\infty}i^{-l(\alpha_{1}-1/K)}\tilde{B}_{3}^{l}<C_{1}i^{-2\alpha_{1}^{\prime}} for some universal constants, C,C1C,C_{1} and B~3\tilde{B}_{3} greater than zero, for .5<α1<α1<α.5<\alpha_{1}^{\prime}<\alpha_{1}<\alpha and large enough KK. Thus, E[|SiΔ|]i2α1<\sum E[|{S_{i}^{\Delta}}^{*}|]\preceq\sum i^{-2\alpha_{1}^{\prime}}<\infty, where for sequences {an}n1\{a_{n}\}_{n\geq 1}, {bn}n1\{b_{n}\}_{n\geq 1},an,bn>0a_{n},b_{n}>0, anbna_{n}\preceq b_{n} implies that anC0bna_{n}\leq C_{0}^{\prime}b_{n} for some C0>0C_{0}^{\prime}>0. Hence, SiΔ\sum{S_{i}^{\Delta}}^{*} converges with probability one.

Convergence of wiMi\sum w_{i}M_{i}^{*}: Note that

Mi=δi(x)m(x)𝑑x+δi(x)(pi,last(m(x)1)m(x)ν(dx)M_{i}^{*}=\int\delta_{i}^{*}(x^{\prime})m(x^{\prime})dx^{\prime}+\int\delta_{i}^{*}(x^{\prime})(\frac{p_{i,last}}{(m(x)}-1)m(x^{\prime})\nu(dx^{\prime})
=I+II=I+II

where δi(x)=[χhi,x(ilast)(x)milast(x)m(x)ν(dx)1]\delta_{i}^{*}(x^{\prime})=[\int_{\chi}\frac{h^{(i_{last})}_{i,x^{\prime}}(x)}{m_{i_{last}}(x)}m(x)\nu(dx)-1]. Using Cauchy-Schwartz inequality and condition B3, the expectation of the second term is bounded by CρiilastC^{\prime}\rho^{i-i_{last}} from some C>0C^{\prime}>0. Number of times (iilast)(i-i_{last}) is equal to ll is less than (l+1)K(l+1)^{K} where KK is defined in the martingale construction 3. Therefore, for IIII the sum over all ii is absolutely convergent.

The first term,

I=χδi(x)m(x)ν(dx)\displaystyle I=\int_{\chi}\delta_{i}^{*}(x^{\prime})m(x^{\prime})\nu(dx^{\prime}) =\displaystyle= χ[χhi,x(ilast)(x)milast(x)m(x)ν(dx)ν(dx)]1\displaystyle\int_{\chi}[\int_{\chi}\frac{h^{(i_{last})}_{i,x^{\prime}}(x)}{m_{i_{last}}(x)}m(x)\nu(dx)\nu(dx^{\prime})]-1
=\displaystyle= Efilast[(χp(x|θ)milast(x)m(x)ν(dx))2]1\displaystyle E_{f_{i_{last}}}[\large(\int_{\chi}\frac{p(x|\theta)}{m_{i_{last}}(x)}m(x)\nu(dx)\large)^{2}]-1
\displaystyle\geq (Efilast[p(x|θ)milast(x)m(x)ν(dx)])21\displaystyle\large(E_{f_{i_{last}}}[\int\frac{p(x|\theta)}{m_{i_{last}}(x)}m(x)\nu(dx)]\large)^{2}-1
=\displaystyle= 11=0.\displaystyle 1-1=0.

Hence, wiMi\sum w_{i}M_{i}^{*} either converges or diverges to ++\infty. Given that LHS in equation A.1 cannot be -\infty and the other terms in RHS in A.1 converges, wiMi\sum w_{i}M_{i}^{*} has to converge with probability one.

Hence, we have KnK_{n}^{*} converging to zero in a sub-sequence with probability one, and converging with probability one. Hence, KnK_{n}^{*} converges to zero with probability one.

We now argue that this implies FnF_{n} converges weakly to FF. Suppose not. Since Θ\Theta is compact, FnF_{n} is tight and hence have convergent subsequence. Let FnkF_{n_{k}} be a subsequence that converges to F^F.{\hat{F}}\neq F. Let m^{\hat{m}} be the marginal corresponding to F^{\hat{F}}. Then by [B4], mnkm_{n_{k}} converges pointwise to m^{\hat{m}} and hence by Scheffe’s theorem it converges in L1L_{1} and hence in Hellinger norm to m^{\hat{m}}. However, by the previous calculations, mnkm_{n_{k}} converges to mm almost surely in Kullback Leibler distance and therefore in Hellinger norm, which is a contradiction as F^F\hat{F}\neq F. Hence, FnF_{n} converges to FF weakly in every subsequence. ∎

A.3 Proof of Theorem 4.1

line

Proof.

For jqj\leq q, define lj(q,n)=max{l:j+q(l1)n}l_{j}(q,n)=\max\{l:j+q(l-1)\leq n\}. Also let for each jj the subsequences {Xj+q(l1),l=1,,lj(q,n)}\{X_{j+q(l-1)},l=1,\ldots,l_{j}(q,n)\} be denoted by {Xj,l}.\{X_{j,l}\}. By construction {Xj,l,l=1,,lj(q,n)}\{X_{j,l},l=1,\ldots,l_{j}(q,n)\} are iid with marginal distribution m()m(\cdot). Let j,l=σX1,,Xj,l\mathcal{F}_{j,l}=\sigma\langle X_{1},\ldots,X_{j,l}\rangle denote the σ\sigma-field generated by all XiX_{i}’s up to and including Xj,lX_{j,l}. Also let the marginals mj+q(l1)m_{j+q(l-1)} generated during the iterations be denoted by mj,lm_{j,l} and the weights wj+q(l1)w_{j+q(l-1)} be denoted by wj,lw_{j,l}. From equation (2.1)

KnK0=j=1qSj,nj=1qQj,n+i=1nwiDi+i=1nEi.\displaystyle K_{n}-K_{0}=\sum_{j=1}^{{q}}S_{j,n}-\sum_{j=1}^{q}Q_{j,n}+\sum_{i=1}^{n}w_{i}D_{i}+\sum_{i=1}^{n}E_{i}. (A.2)

Here

Sj,n\displaystyle S_{j,n} =\displaystyle= l=1lj(q,n)wj,lVj,l,\displaystyle\sum_{l=1}^{l_{j}(q,n)}w_{j,l}V_{j,l},
Vj,l\displaystyle V_{j,l} =\displaystyle= (1m(Xj,l)mj+q(l1)1(Xj,l))E[(1m(Xj,l)mj+q(l1)1(Xj,l))|j,l1],\displaystyle(1-\frac{m(X_{j,l})}{m_{j+q(l-1)-1}(X_{j,l})})-E\big{[}(1-\frac{m(X_{j,l})}{m_{j+q(l-1)-1}(X_{j,l})})|\mathcal{F}_{j,l-1}\big{]},
Mj,l\displaystyle M_{j,l} =\displaystyle= E[(1+m(Xj,l)mj,(l1)(Xj,l))|j,l1];l>1 and\displaystyle E[(-1+\frac{m(X_{j,l})}{m_{j,(l-1)}(X_{j,l})})|\mathcal{F}_{j,l-1}];l>1\text{ and }
Mj,l\displaystyle M_{j,l} =\displaystyle= E[(m(Xj,l)mj+q(l1)1(Xj,l)1)|j,l1];l=1,\displaystyle E\big{[}(\frac{m(X_{j,l})}{m_{j+q(l-1)-1}(X_{j,l})}-1)|\mathcal{F}_{j,l-1}];l=1,
Qj,n\displaystyle Q_{j,n} =\displaystyle= l=1lj(q,n)wj,lMj,l,\displaystyle\sum_{l=1}^{l_{j}(q,n)}w_{j,l}M_{j,l},
Di\displaystyle D_{i} =\displaystyle= E[m(Xi)mi1(Xi)(mi1(Xi)miq(Xi)1)|j,l1];i>q and Di=0, for iq,\displaystyle E[\frac{m(X_{i})}{m_{i-1}(X_{i})}(\frac{m_{i-1}(X_{i})}{m_{i-q}(X_{i})}-1)|\mathcal{F}_{j,l-1}];i>q\text{ and }D_{i}=0,\text{ for }i\leq q,
Ei\displaystyle E_{i} =\displaystyle= R(Xi,θ)f(θ)𝑑μ(θ),\displaystyle\int R(X_{i},\theta)f(\theta)d\mu(\theta),

where R(Xi,θ)R(X_{i},\theta) is defined as in (2). We follow the same argument as in the general case; first we show that KnK_{n} converges with probability one and KnK_{n}^{*} converges to zero over some subsequence with probability one. Then we show convergence of KnK_{n}^{*}. As before, we show convergence of j=1qSj,n\sum_{j=1}^{{q}}S_{j,n}, the remainder term Δn=i=1nwiDi\Delta_{n}=\sum_{i=1}^{n}w_{i}D_{i} and the error term Tn=i=1nEiT_{n}=\sum_{i=1}^{n}E_{i}. We simplify some of the expressions first. From (1.5), for i>qi>q

mi1(Xi)miq(Xi)=j=1q1[1+wij(p(Xi|θ)p(Xij|θ)fij1(θ)𝑑μ(θ)mij1(Xi)mij1(Xij)1)].\frac{m_{i-1}(X_{i})}{m_{i-q}(X_{i})}=\prod_{j=1}^{q-1}[1+w_{i-j}(\frac{\int p(X_{i}|\theta)p(X_{i-j}|\theta)f_{i-j-1}(\theta)d\mu(\theta)}{m_{i-j-1}(X_{i})m_{i-j-1}(X_{i-j})}-1)].

We have

|(p(Xi|θ)p(Xij|θ)fij1(θ)𝑑μ(θ)mij1(Xi)mij1(Xij)1)|1+A1(Xij)\displaystyle|(\frac{\int p(X_{i}|\theta)p(X_{i-j}|\theta)f_{i-j-1}(\theta)d\mu(\theta)}{m_{i-j-1}(X_{i})m_{i-j-1}(X_{i-j})}-1)|\leq 1+A_{1}(X_{i-j}) (A.3)

by using triangle inequality and using Proposition 3.1 on p(Xij|θ)mij1(Xij)\frac{p(X_{i-j}|\theta)}{m_{i-j-1}(X_{i-j})}. By Holder’s inequality and assumption A4, we have

E[j=1rA12(Xij)](cunHb/a)2qBE[\prod_{j=1}^{r}A_{1}^{2}(X_{i-j})]\leq(c_{u}n_{H}b/a)^{2q}B

for r=1,,q1.r=1,\ldots,q-1. Thus

E[|1mi1(Xi)miq(Xi)|2]wiq3q(cunHb/a)2qB.E[|1-\frac{m_{i-1}(X_{i})}{m_{i-q}(X_{i})}|^{2}]\leq w_{i-q}3^{q}{(c_{u}n_{H}b/a)^{2q}}B.

Similarly, we can bound E[|m(Xi)mi1(Xi)|2]<(cunHb/a)2B.E[|\frac{m(X_{i})}{m_{i-1}(X_{i})}|^{2}]<(c_{u}n_{H}b/a)^{2}B. By Cauchy-Schwartz we have E[|wiDi|]Cwiq2E[|w_{i}D_{i}|]\leq Cw_{i-q}^{2} where C>0C>0 is a constant. Hence by Proposition 4, i=1nwiDi\sum_{i=1}^{n}w_{i}D_{i} converges almost surely. Therefore, ΔnΔ\Delta_{n}\rightarrow\Delta_{\infty}, a finite random variable, with probability one. Note that E[|Ei|]2wi2E[((p(Xi|θ)mi1(Xi)1)2]E[|E_{i}|]\leq 2w_{i}^{2}\int E\big{[}\big{(}(\frac{p(X_{i}|\theta)}{m_{i-1}(X_{i})}-1)^{2}\big{]} for ii greater than some positive integer i0i_{0}, as wi<1/2w_{i}<1/2 for large enough ii. Applying A4, we have E[|Ei|]<2(2(cunHb/a)2B+2)wi2E[|E_{i}|]<2(2(c_{u}n_{H}b/a)^{2}B+2)w_{i}^{2} and hence i>i0E[|Ei|]\sum_{i>i_{0}^{\prime}}E[|E_{i}|] converges and hence by Proposition 4, TnT_{n} converges almost surely. Similarly, E[Vj,l2]2(1+nH2B)E[V_{j,l}^{2}]\leq 2(1+n_{H}^{2}B) and Sj,nS_{j,n} is a martingale with filtration Fj,lj(q,n)F_{j,l_{j}(q,n)} and E[Sj,n2]2wj,l2(1+nH2B)<E[S_{j,n}^{2}]\leq\sum 2w_{j,l}^{2}(1+n_{H}^{2}B)<\infty and hence, Sj,nS_{j,n} converges almost surely to a finite random variable as it is an square integrable martingale.

From equation (A.2) we have convergence of j=1qQj,nQ<\sum_{j=1}^{q}Q_{j,n}\to Q_{\infty}<\infty with probability one. This statement holds as in L.H.S in (A.2) is a fixed quantity subtracted from a positive quantity and j=1qSj,n\sum_{j=1}^{{q}}S_{j,n}, i=1nwiDi\sum_{i=1}^{n}w_{i}D_{i} and i=1nEi\sum_{i=1}^{n}E_{i} converges with probability one. As Qj,n=l=1lj(q,n)wj,lMj,lQ_{j,n}=\sum_{l=1}^{l_{j}(q,n)}w_{j,l}M_{j,l} and Mj,l>logm(x)miq(x)m(x)ν(dx)=Kiq0M_{j,l}>\int log\frac{m(x)}{m_{i-q}(x)}m(x)\nu(dx)=K_{{i-q}}^{*}\geq 0, for i>qi>q. Hence, any Qj,nQ_{j,n} can not diverge to infinity. Moreover, as i=1wj+iq=\sum_{i=1}^{\infty}w_{j+iq}=\infty for j=1,2,,q1j=1,2,\dots,q-1 , Mj,lM_{j,l} has to go zero in some subsequence almost surely.

Next we show convergence of KnK_{n}^{*}. Analogously, replacing ViV_{i} and MiM_{i} by ViV_{i}^{*} and MiM_{i}^{*} from the above derivation, from equation (2.2) we get

KnK0=j=1qSj,nj=1qQj,n+i=1nwiDi+i=1nEi\displaystyle K_{n}^{*}-K_{0}^{*}=\sum_{j=1}^{{q}}S^{*}_{j,n}-\sum_{j=1}^{q}Q^{*}_{j,n}+\sum_{i=1}^{n}w_{i}D^{*}_{i}+\sum_{i=1}^{n}E^{*}_{i} (A.4)

where Sj,nS_{j,n}^{*} the martingale sequences, which converges due to squared integrability.

Sj,n\displaystyle S^{*}_{j,n} =\displaystyle= l=1lj(q,n)wj,lVj,l,\displaystyle\sum_{l=1}^{l_{j}(q,n)}w_{j,l}V^{*}_{j,l},
Vi\displaystyle V_{i}^{*} =\displaystyle= Vj,l=(1hi,Xi(x)mi1(x)m(x)ν(dx))\displaystyle V^{*}_{j,l}=(1-\int\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx))-
E[1χhi,Xi(x)mi1(x)m(x)ν(dx)|j,l1];i=j+q(l1),\displaystyle E[1-\int_{\chi}\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx)|\mathcal{F}_{j,l-1}];i=j+q(l-1),
Mi\displaystyle M_{i}^{*} =\displaystyle= E[χhi,Xi(q)(x)mj,l1(x)m(x)ν(dx)1|j,l1],i>q and\displaystyle E[\int_{\chi}\frac{h^{(q)}_{i,X_{i}}(x)}{m_{j,l-1}(x)}m(x)\nu(dx)-1|\mathcal{F}_{j,l-1}],i>q\text{ and }
Mi\displaystyle M_{i}^{*} =\displaystyle= E[χhi,Xi(x)mi1(x)m(x)ν(dx)1|j,l1];iq, for i=j+q(l1) and\displaystyle E[\int_{\chi}\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx)-1|\mathcal{F}_{j,l-1}];i\leq q,\text{ for }i=j+q(l-1)\text{ and }
hi,x(q)(x)\displaystyle h^{(q)}_{i,x^{\prime}}(x) =\displaystyle= p(x|θ)p(x|θ)fiq(θ)𝑑μθmiq(x),\displaystyle\frac{\int p(x^{\prime}|\theta)p(x|\theta)f_{i-q}(\theta)d\mu\theta}{m_{i-q}(x^{\prime})},
Qi\displaystyle Q_{i}^{*} =\displaystyle= l=1lj(q,n)wj,lMj,l;Mj,l=Mi for i=j+q(l1),\displaystyle\sum_{l=1}^{l_{j}(q,n)}w_{j,l}M^{*}_{j,l};M^{*}_{j,l}=M^{*}_{i}\text{ for }i=j+q(l-1),
Di\displaystyle D_{i}^{*} =\displaystyle= E[χhi,x(q)(x)mj,l1(x)m(x)ν(dx)1|j,l1]\displaystyle E[\int_{\chi}\frac{h^{(q)}_{i,x^{\prime}}(x)}{m_{j,l-1}(x)}m(x)\nu(dx)-1|\mathcal{F}_{j,l-1}]-
E[χhi,x(x)mi1(x)m(x)ν(dx)1|j,l1];i>q and Di=0;iq,\displaystyle E[\int_{\chi}\frac{h_{i,x^{\prime}}(x)}{m_{i-1}(x)}m(x)\nu(dx)-1|\mathcal{F}_{j,l-1}];i>q\text{ and }D^{*}_{i}=0;i\leq q,
R(Xi,x)\displaystyle R^{*}(X_{i},x) =\displaystyle= wi2(hi,Xi(x)mi1(x)1)2R(wi[hi,Xi(x)mi1(x)1]),\displaystyle w_{i}^{2}(\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}-1)^{2}R(w_{i}[{h_{i,X_{i}}(x)}{m_{i-1}(x)}-1]),
Ei\displaystyle E_{i}^{*} =\displaystyle= χR(Xi,x)m(x)ν(dx).\displaystyle\int_{\chi}R(X_{i},x)m(x)\nu(dx).

Then for i>qi>q,

E[hi,Xi(q)(x)miq(x)m(x)ν(dx)1|Fiq]=χχ(hi,x(q)(x)miq(x)m(x)m(x)ν(dx)ν(dx)1)\displaystyle E[\int\frac{h_{i,X_{i}}^{(q)}(x)}{m_{i-q}(x)}m(x)\nu(dx)-1|F_{i-q}]=\int_{\chi}\int_{\chi}(\frac{h^{(q)}_{i,x^{\prime}}(x)}{m_{i-q}(x)}m(x)m(x^{\prime})\nu(dx)\nu(dx^{\prime})-1)
=Efiq[(p(x|θ)miq(x)m(x)ν(dx))2]1\displaystyle=E_{f_{i-q}}[\large(\int\frac{p(x|\theta)}{m_{i-q}(x)}m(x)\nu(dx)\large)^{2}]-1
(Efiq[p(x|θ)miq(x)m(x)ν(dx)])21=11=0.\displaystyle\geq\large(E_{f_{i-q}}[\int\frac{p(x|\theta)}{m_{i-q}(x)}m(x)\nu(dx)]\large)^{2}-1=1-1=0. (A.5)

Also i=1nEi\sum_{i=1}^{n}E^{*}_{i} converges, using argument similar to that of i=1nEi\sum_{i=1}^{n}E_{i}. The proof of martingale squared integrability and the convergence of Sj,nS^{*}_{j,n} follows similarly as in Sj,nS_{j,n}. The convergences of the difference term i=1nwiDi\sum_{i=1}^{n}w_{i}D^{*}_{i} follows similarly to the first part and given in the next subsection. Hence, we have j=1qQj,n\sum_{j=1}^{q}Q^{*}_{j,n}’s converging to finite quantities with probability one for each jj, as KnK_{n}^{*} is positive. Hence, KnK_{n}^{*} converges with probability one.

From the fact Ki=KL(m,mi)K_{i}^{*}=KL(m,m_{i}) converges almost surely as ii goes to infinity and it converges to zero in a subsequence almost surely, we have it converging to zero almost surely. Therefore, by arguments given in the proof of Theorem 3.2, FnF_{n} converges weakly to FF with probability one. ∎

A.3.1 Convergence of i=1nEi\sum_{i=1}^{n}E^{*}_{i} and i=1nwiDi\sum_{i=1}^{n}w_{i}D^{*}_{i}

line

The proof of martingale squared integrability and convergences of the reminder terms i=1nEi\sum_{i=1}^{n}E^{*}_{i} and i=1nwiDi\sum_{i=1}^{n}w_{i}D^{*}_{i} from Theorem 4.1:

Convergence of Sj,nS^{*}_{j,n}: We show that Sj,nS^{*}_{j,n} is a square integrable martingale. Note that hi,Xi(x)mi1(x)A1(Xi)\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}\leq A_{1}(X_{i}). From A4, E[(hi,Xi(x)mi1(x)m(x)ν(dx))2]<BE[\big{(}\int\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx)\big{)}^{2}]<B using Holder’s inequality. Thus, we have E[(Sj,n)2]2wi2(B+2)<E\big{[}(S^{*}_{j,n})^{2}\big{]}\leq\sum 2w_{i}^{2}(B+2)<\infty which proves our claim.

Convergence of Ei\sum E_{i}^{*}:

Note that

E[i>i0|Ei|]wi2E[(hi,Xi(x)mi1(x))2m(x)dν(x)+2]<E[\sum_{i>i_{0}^{\prime}}|E^{*}_{i}|]\preceq\sum w_{i}^{2}E[(\int\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)})^{2}m(x)d\nu(x)+2]<\infty

from A4 and A1, and from the fact wi0w_{i}\downarrow 0 and wi(χhi,Xi(q)(x)mj,l1(x)m(x)ν(dx)1)>wiw_{i}(\int_{\chi}\frac{h^{(q)}_{i,X_{i}}(x)}{m_{j,l-1}(x)}m(x)\nu(dx)-1)>-w_{i} and E[A1(Xi)2]<E[A_{1}(X_{i})^{2}]<\infty (following A4). Hence, i>i0Ei\sum_{i>i_{0}^{\prime}}E_{i}^{*} and iEi\sum_{i}E_{i}^{*} converges with probability one.

Convergence of wiDi\sum w_{i}D_{i}^{*}: Let,

Δ(Xi,x)=wi|hi,Xi(q)(x)miq(x)hi,Xi(x)mi1(x)|wi{θ(|p(Xi|θ)p(x|θ)fiq(θ)mi1(Xi)mi1(x)(mi1(x)mi1(Xi)miq(Xi)miq(Xi)1)|dμ(θ)\Delta(X_{i},x)=w_{i}|\frac{h_{i,X_{i}}^{(q)}(x)}{m_{i-q}(x)}-\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}|\leq w_{i}\Huge\{\int_{\theta}(|\frac{p(X_{i}|\theta)p(x|\theta)f_{i-q}(\theta)}{m_{i-1}(X_{i})m_{i-1}(x)}(\frac{m_{i-1}(x)m_{i-1}(X_{i})}{m_{i-q}(X_{i})m_{i-q}(X_{i})}-1)|d\mu(\theta)
+θ|p(Xi|θ)p(x|θ)mi1(Xi)mi1(x)fiq(θ)(fi1(θ)fiq(θ)1)|)dμ(θ)}+\int_{\theta}|\frac{p(X_{i}|\theta)p(x|\theta)}{m_{i-1}(X_{i})m_{i-1}(x)}{f_{i-q}(\theta)}(\frac{f_{i-1}(\theta)}{f_{i-q}(\theta)}-1)|)d\mu(\theta)\Huge\}

and Δ(Xi,x)m(x)ν(dx)=Δ(Xi)\int\Delta(X_{i},x)m(x)\nu(dx)=\Delta(X_{i}). From the fact

mi1(Xi)miq(Xi)=j=1q1[1+wij(p(Xi|θ)p(Xij|θ)fij1(θ)𝑑μ(θ)mij1(Xi)mij1(Xij)1)]\frac{m_{i-1}(X_{i})}{m_{i-q}(X_{i})}=\prod_{j=1}^{q-1}[1+w_{i-j}(\frac{\int p(X_{i}|\theta)p(X_{i-j}|\theta)f_{i-j-1}(\theta)d\mu(\theta)}{m_{i-j-1}(X_{i})m_{i-j-1}(X_{i-j})}-1)]

and we have

|(p(Xi|θ)p(Xij|θ)fij1(θ)𝑑μ(θ)mij1(Xi)mij1(Xij)1)|1+A1(Xi).\displaystyle|(\frac{\int p(X_{i}|\theta)p(X_{i-j}|\theta)f_{i-j-1}(\theta)d\mu(\theta)}{m_{i-j-1}(X_{i})m_{i-j-1}(X_{i-j})}-1)|\leq 1+A_{1}(X_{i}).

Similarly

p(Xi|θ)p(x|θ)mi1(Xi)mi1(x)A1(Xi)A1(x).\frac{p(X_{i}|\theta)p(x|\theta)}{m_{i-1}(X_{i})m_{i-1}(x)}\leq A_{1}(X_{i})A_{1}(x).

The part |mi1(Xi)mi1(x)miq(Xi)miq(x)1||\frac{m_{i-1}(X_{i})m_{i-1}(x)}{m_{i-q}(X_{i})m_{i-q}(x)}-1| of R.H.S for Δ(Xi,x)\Delta(X_{i},x) can be bounded by the sum of 1q2q21\leq q^{\prime}\leq 2q-2 fold product of (1+A1(Xi))(1+A_{1}(X_{i})) and (1+A1(x))(1+A_{1}(x))’s multiplied by coefficient less than wiqqw_{i-q}^{q^{\prime}}. Similarly. for |fi1(θ)fiq(θ)1||\frac{f_{i-1}(\theta)}{f_{i-q}(\theta)}-1| we have 1qq11\leq q^{\prime}\leq q-1 fold products of 1+A1(Xi)1+A_{1}(X_{i})’s. Hence, integrating with respect to m(x)m(x) and taking expectation we have a universal bound for any such product term from Holder’s inequality. Hence, E[Δ(Xi)]Buwiq2E[\Delta(X_{i})]\leq B_{u}w_{i-q}^{2} for i>qi>q, for some universal constant Bu>0B_{u}>0. As, E(i|wiDi|)iwi2<E\big{(}\sum_{i}|w_{i}D_{i}^{*}|\big{)}\preceq\sum_{i}w_{i}^{2}<\infty, wiDi\sum w_{i}D_{i}^{*} converges absolutely with probability one. Therefore, it converges with probability one.

A.4 Proof of Theorem 4.2

Proof.

From the derivation of equation (A.2) using f~(θ)\tilde{f}(\theta) instead of f(θ)f(\theta), i.e writing logfifi1=logfif~logfi1f~\log\frac{f_{i}}{f_{i-1}}=\log\frac{f_{i}}{\tilde{f}}-\log\frac{f_{i-1}}{\tilde{f}}, we have, for K~n=KL(f~,fn)\tilde{K}_{n}=KL(\tilde{f},f_{n}),

K~nK~0=j=1qS~j,nj=1qQ~j,n+i=1nwiD~i+i=1nE~i.\displaystyle\tilde{K}_{n}-\tilde{K}_{0}=\sum_{j=1}^{{q}}\tilde{S}_{j,n}-\sum_{j=1}^{q}\tilde{Q}_{j,n}+\sum_{i=1}^{n}w_{i}\tilde{D}_{i}+\sum_{i=1}^{n}\tilde{E}_{i}. (A.6)

Here

S~j,n\displaystyle\tilde{S}_{j,n} =\displaystyle= l=1lj(q,n)wj,lV~j,l,\displaystyle\sum_{l=1}^{l_{j}(q,n)}w_{j,l}\tilde{V}_{j,l},
V~j,l\displaystyle\tilde{V}_{j,l} =\displaystyle= (1m~(Xj,l)mj+q(l1)1(Xj,l))E((1m~(Xj,l)mj+q(l1)1(Xj,l))|j,l1),\displaystyle(1-\frac{\tilde{m}(X_{j,l})}{m_{j+q(l-1)-1}(X_{j,l})})-E\big{(}(1-\frac{\tilde{m}(X_{j,l})}{m_{j+q(l-1)-1}(X_{j,l})})|\mathcal{F}_{j,l-1}\big{)},
M~j,l\displaystyle\tilde{M}_{j,l} =\displaystyle= E((1+m~(Xj,l)mj,l1(Xj,l))|j,l1);l>1\displaystyle E((-1+\frac{\tilde{m}(X_{j,l})}{m_{j,l-1}(X_{j,l})})|\mathcal{F}_{j,l-1});l>1
and M~j,l\displaystyle\text{ and }\tilde{M}_{j,l} =\displaystyle= E((1+m~(Xj,l)mj+q(l1)1(Xj,l))|j,l1) for l=1\displaystyle E((-1+\frac{\tilde{m}(X_{j,l})}{m_{j+q(l-1)-1}(X_{j,l})})|\mathcal{F}_{j,l-1})\text{ for }l=1
Q~j,n\displaystyle\tilde{Q}_{j,n} =\displaystyle= l=1lj(q,n)wj,lM~j,l,\displaystyle\sum_{l=1}^{l_{j}(q,n)}w_{j,l}\tilde{M}_{j,l},
D~i\displaystyle\tilde{D}_{i} =\displaystyle= m~(Xi)mi1(Xi)(mi1(Xi)miq(Xi)1);i>q, and D~i=0 for iq,\displaystyle\frac{\tilde{m}(X_{i})}{m_{i-1}(X_{i})}(\frac{m_{i-1}(X_{i})}{m_{i-q}(X_{i})}-1);i>q,\text{ and }\tilde{D}_{i}=0\text{ for }i\leq q,
E~i\displaystyle\tilde{E}_{i} =\displaystyle= R(Xi,θ)f~(θ)𝑑μ(θ),\displaystyle\int R(X_{i},\theta)\tilde{f}(\theta)d\mu(\theta),

For i>qi>q, M~j,llogm~(x)mj,l1(x)m(x)ν(dx)=KL(m,mj,l1)KL(m,m~)=KL(m,mj,l1)k~>0\tilde{M}_{j,l}\geq\int\log\frac{\tilde{m}(x)}{m_{j,l-1}(x)}m(x)\nu(dx)=KL(m,m_{j,l-1})-KL(m,\tilde{m})=KL(m,m_{j,l-1})-\tilde{k}>0. Convergence of S~j,n\tilde{S}_{j,n}, D~i\sum\tilde{D}_{i}, E~i\sum\tilde{E}_{i} follows from the proof of 4.1. Thus, similarly each Q~j,n\tilde{Q}_{j,n} has to converge and hence, KL(m,mj,l1)k~KL(m,m_{j,l-1})-\tilde{k} converges to zero in some subsequence with probability one. Convergence of KL(m,mi)KL(m,m_{i}) follows from the proof of Theorem 4.1, and completes the proof.

A.5 Proof of Theorem 3.3

Proof.

From the proof of Theorem 3.1, we decompose logfifi1=logfif~logfi1f~\log\frac{f_{i}}{f_{i-1}}=\log\frac{f_{i}}{\tilde{f}}-\log\frac{f_{i-1}}{\tilde{f}}, and have an equation analogous to (3.3)

K~nK~0=j=1LnS~v,j(n)j=1LnS~M,j(n)+i=1nS~iΔ+i=1nE~i\displaystyle\tilde{K}_{n}-\tilde{K}_{0}=\sum_{j=1}^{{L_{n}}}\tilde{S}_{v,j}^{(n)}-\sum_{j=1}^{L_{n}}\tilde{S}_{M,j}^{(n)}+\sum_{i=1}^{n}\tilde{S}^{\Delta}_{i}+\sum_{i=1}^{n}\tilde{E}_{i} (A.7)

which is derived by using f~(θ)\tilde{f}(\theta) instead of f(θ)f(\theta). Analogous to equation (LABEL:wdep_decmp) we have,

M~i~=(m~(x)milast(x)1)m(x)ν(dx)+(m~(x)milast(x)1).(pi,last(x)m(x)1)m(x)ν(dx).\displaystyle\tilde{M}_{\tilde{i}}=\int(\frac{\tilde{m}(x)}{{m}_{{i}_{last}}(x)}-1)m(x)\nu(dx)+\int(\frac{\tilde{m}(x)}{{m}_{{i}_{last}}(x)}-1).(\frac{p_{{i},last}(x)}{m(x)}-1)m(x)\nu(dx).

Using argument similar to Theorem 4.2, KL(m,mi)KL(m,m~)KL(m,{m}_{i})-KL(m,\tilde{m}) goes to zero in some subsequence with probability one, as jj goes to infinity. Convergence of KL(m,mi)KL(m,m_{i}) is essentially same proof as in Theorem 3.2. Together they complete the proof. ∎

A.6 Proof of Theorem 4.3

Proof.

Since wi1wi=O(1)\frac{w_{i-1}}{w_{i}}=O(1), from the proof of 4.1 we can conclude that c1wiK~iqiwiq+1K~iqcwiK~iq<c_{1}^{\prime}\sum w_{i}\tilde{K}^{*}_{i-q}\leq\sum_{i}w_{i-q+1}\tilde{K}^{*}_{i-q}\leq c^{\prime}\sum w_{i}\tilde{K}^{*}_{i-q}<\infty with c>0,c1>0,i=q+jl,l=1,2,c^{\prime}>0,c^{\prime}_{1}>0,i=q+jl,l=1,2,\dots converges with probability one.

Let an=i=1nwia_{n}=\sum_{i=1}^{n}w_{i}, a0=0a_{0}=0. Hence, for i=j+qli=j+ql for Theorem 4.2

aiK~i=ai1K~i1+wiK~i1+aiwiVj,laiwiMj,l+aiwiDi+aiEi\displaystyle a_{i}\tilde{K}_{i}^{*}=a_{i-1}\tilde{K}_{i-1}^{*}+w_{i}\tilde{K}^{*}_{i-1}+a_{i}w_{i}V_{j,l}^{*}-a_{i}w_{i}M^{*}_{j,l}+a_{i}w_{i}D_{i}^{*}+a_{i}E_{i}^{*}
(A.8)

and

anK~na0K~0=wiK~i1+j=1ql=1lj(q,n)aiwiVj,lj=1ql=1lj(q,n)aiwiMj,l+\displaystyle a_{n}\tilde{K}_{n}^{*}-a_{0}\tilde{K}^{*}_{0}=\sum w_{i}\tilde{K}^{*}_{i-1}+\sum_{j=1}^{q}\sum_{l=1}^{l_{j}(q,n)}a_{i}w_{i}V_{j,l}^{*}-\sum_{j=1}^{q}\sum_{l=1}^{l_{j}(q,n)}a_{i}w_{i}M^{*}_{j,l}+
i=1naiwiDi+i=1naiEi.\displaystyle\sum_{i=1}^{n}a_{i}w_{i}D_{i}^{*}+\sum_{i=1}^{n}a_{i}E_{i}^{*}. (A.9)

Let wi=iα,α(3/4,1)w_{i}=i^{-\alpha},\alpha\in(3/4,1) and aii(α1)a_{i}\sim i^{-(\alpha-1)}, wi=aiwii(2α1).w_{i}^{\prime}=a_{i}w_{i}\sim i^{-(2\alpha-1)}. For MM dependent case, from the proof of Theorem 4.1, writing wiw_{i}^{\prime} instead of wiw_{i} we can show the almost sure convergence Sj,lS^{*}_{j,l}. Convergence of aiwiDi\sum a_{i}w_{i}D_{i}^{*} and iaiEi\sum_{i}a_{i}E_{i}^{*} also follow in similar fashion. From the fact wi1wi=O(1)\frac{w_{i-1}}{w_{i}}=O(1) the term iwiK~i1\sum_{i}w_{i}\tilde{K}^{*}_{i-1} converges with probability one. Hence following A1, we have j=1ql=1lj(q,n)aiwiMj,l\sum_{j=1}^{q}\sum_{l=1}^{l_{j}(q,n)}a_{i}w_{i}M^{*}_{j,l} converging and Mj,lM^{*}_{j,l} going to zero in a subsequence with probability one for all jj as ll goes to infinity, and anK~na_{n}\tilde{K}_{n}^{*} converging with probability one.

We have, anK~na_{n}\tilde{K}^{*}_{n} converging to a finite number for each ω\omega outside a set of probability zero, and ann1αa_{n}\sim n^{1-\alpha}, then n1γK~nn^{1-\gamma}\tilde{K}_{n}^{*} goes to zero with probability one for γ>α\gamma>\alpha. The conclusion about the Hellinger distance follows from the relation between KL and Hellinger distance from argument as in Theorem 3.4 proof.

References

  • Newton and Zhang [1999] Michael A Newton and Yunlei Zhang. A recursive algorithm for nonparametric analysis with missing data. Biometrika, 86(1):15–26, 1999.
  • Tokdar et al. [2009] Surya T Tokdar, Ryan Martin, and Jayanta K Ghosh. Consistency of a recursive estimate of mixing distributions. The Annals of Statistics, 37(5A):2502–2522, 2009.
  • Robbins and Monro [1951] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
  • Chung [1954] Kai Lai Chung. On a stochastic approximation method. The Annals of Mathematical Statistics, 25(3):463–483, 1954.
  • Fabian et al. [1968] Vaclav Fabian et al. On asymptotic normality in stochastic approximation. The Annals of Mathematical Statistics, 39(4):1327–1332, 1968.
  • Polyak and Juditsky [1992] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992.
  • Sharia [2014] Teo Sharia. Truncated stochastic approximation with moving bounds: convergence. Statistical Inference for Stochastic Processes, 17(2):163–179, 2014.
  • Kushner [2010] Harold Kushner. Stochastic approximation: a survey. Wiley Interdisciplinary Reviews: Computational Statistics, 2(1):87–96, 2010.
  • Ghosh and Tokdar [2006] Jayanta K Ghosh and Surya T Tokdar. Convergence and consistency of newton’s algorithm for estimating mixing distribution. In Frontiers in statistics, pages 429–443. World Scientific, 2006.
  • Martin and Ghosh [2008] Ryan Martin and Jayanta K Ghosh. Stochastic approximation and newton’s estimate of a mixing distribution. Statistical Science, 23(3):365–382, 2008.
  • Martin and Tokdar [2009] Ryan Martin and Surya T Tokdar. Asymptotic properties of predictive recursion: robustness and rate of convergence. Electronic Journal of Statistics, 3:1455–1472, 2009.
  • Kleijn and van der Vaart [2006] Bas JK Kleijn and Aad W van der Vaart. Misspecification in infinite-dimensional bayesian statistics. The Annals of Statistics, 34(2):837–877, 2006.
  • Hahn et al. [2018] P Richard Hahn, Ryan Martin, and Stephen G Walker. On recursive bayesian predictive distributions. Journal of the American Statistical Association, 113(523):1085–1093, 2018.
  • Martin and Han [2016] Ryan Martin and Zhen Han. A semiparametric scale-mixture regression model and predictive recursion maximum likelihood. Computational Statistics & Data Analysis, 94:75–85, 2016.
  • Ghosal et al. [1999] Subhashis Ghosal, Jayanta K Ghosh, and RV Ramamoorthi. Posterior consistency of dirichlet mixtures in density estimation. Ann. Statist, 27(1):143–158, 1999.
  • Lijoi et al. [2005] Antonio Lijoi, Igor Prünster, and Stephen G Walker. On consistency of nonparametric normal mixtures for bayesian density estimation. Journal of the American Statistical Association, 100(472):1292–1296, 2005.
  • Robbins and Siegmund [1971] Herbert Robbins and David Siegmund. A convergence theorem for non negative almost supermartingales and some applications. In Optimizing methods in statistics, pages 233–257. Elsevier, 1971.
  • Durrett [2019] Rick Durrett. Probability: theory and examples, volume 49. Cambridge university press, 2019.
  • Patilea [2001] Valentin Patilea. Convex models, mle and misspecification. Annals of statistics, pages 94–123, 2001.
  • Martin [2012] Ryan Martin. Convergence rate for predictive recursion estimation of finite mixtures. Statistics & Probability Letters, 82(2):378–384, 2012.
  • Ghosal and Van Der Vaart [2001] Subhashis Ghosal and Aad W Van Der Vaart. Entropies and rates of convergence for maximum likelihood and bayes estimation for mixtures of normal densities. Annals of Statistics, 29(5):1233–1263, 2001.
  • Lai [2003] Tze Leung Lai. Stochastic approximation. The Annals of Statistics, 31(2):391–406, 2003.
  • Genovese and Wasserman [2000] Christopher R Genovese and Larry Wasserman. Rates of convergence for the gaussian mixture sieve. The Annals of Statistics, 28(4):1105–1127, 2000.
  • Freedman [1962a] David A Freedman. Mixtures of markov processes. The Annals of Mathematical Statistics, 33(1):114–118, 1962a.
  • Freedman [1962b] David A Freedman. Invariants under mixing which generalize de finetti’s theorem. The Annals of Mathematical Statistics, 33(3):916–923, 1962b.
  • Diaconis and Freedman [1980] Persi Diaconis and David Freedman. de finetti’s theorem for markov chains. The Annals of Probability, pages 115–130, 1980.