Stochastic approximation algorithm for estimating mixing distribution for dependent observations

Nilabja Guhalabel=e1]nilabja

\_

[email protected] [ Anindya Roylabel=e2][email protected] [ Deaprtment of Mathematical Sciences, University of Massachusetts Lowell \thanksmarkm1; Department of Mathematics and Statistics, University of Maryland Baltimore County \thanksmarkm2

Abstract

Estimating the mixing density of a mixture distribution remains an interesting problem in statistics. Using a stochastic approximation method, Newton and Zhang [1] introduced a fast recursive algorithm for estimating the mixing density of a mixture. Under suitably chosen weights the stochastic approximation estimator converges to the true solution. In Tokdar et al. [2] the consistency of this recursive estimation method was established. However the current results of consistency of the resulting recursive estimator use independence among observations as an assumption. We extend the investigation of performance of Newton’s algorithm to several dependent scenarios. We prove that the original algorithm under certain conditions remains consistent even when the observations are arising from a weakly dependent stationary process with the target mixture as the marginal density. We show consistency under a decay condition on the dependence among observations when the dependence is characterized by a quantity similar to mutual information between the observations.

\startlocaldefs\endlocaldefs

1 Introduction

Stochastic approximation (SA) algorithms which are stochastic optimization techniques based on recursive update, have many applications in optimization problems arising in different fields such as engineering and machine learning. A classical and pioneering example of an SA can be found in Robbins and Monro [3] where a recursive method is introduced for finding the root of a function. In particular, suppose we have a non-increasing $h$ , where values of $h$ are observed with errors, i.e., we observe the value of $h$ at a point $x_{n}$ , as $y_{n}=h(x_{n})+{\epsilon}_{n}$ . Here ${\epsilon}_{n}$ ’s are i.i.d errors with mean zero and finite variance. The Robins–Monro stochastic algorithm recursively approximate the solution of $h(x)=\alpha_{0}$ as

x_{n+1}=x_{n}+w_{n}(y_{n}-\alpha_{0}),

(1.1)

where $w_{n}$ is a sequence of weights satisfying $w_{n}>0,{\sum}_{n}w_{n}=\infty\text{ and }$ ${\sum}_{n}{w_{n}}^{2}<\infty$ . Under (1.1), the sequence $\{x_{n}\}$ approaches the true root $x_{0}$ , such that $h(x_{0})=\alpha_{0}$ . Subsequent developments include rate of convergence, optimum step-size, convergence under convexity etc (see Chung [4], Fabian et al. [5], Polyak and Juditsky [6] ). Also see Sharia [7],Kushner [8] for recent developments such as adaptively truncating the solution to some domain. More generally, for a predictable sequence $R_{t}(Z_{t})$ with $R_{t}(z_{0})=0;z_{0}\subset\mathbb{R}^{m}$ the recursive solution is a limit to the recursion

\displaystyle Z_{t}=Z_{t-1}+\gamma_{t}(Z_{t-1})[R_{t}(Z_{(t-1)})+\epsilon_{t}(Z_{t})];t=1,2,\dots

(1.2)

where $Z_{0}\subset\mathbb{R}^{m}$ is the initial value, $\gamma_{t}(Z_{t-1})$ possibly state dependent matrix of step sizes, and $\epsilon_{t}(Z_{t})$ is a mean zero random process. The idea of stochastic approximation was cleverly used in a predictive recursion (PR) algorithm in Newton and Zhang [1] for finding the mixing component of a mixture distribution.

Mixture models are popular statistical models that provide a nice compromise between the flexibility offered by nonparametric approaches and the efficiency of parametric approaches. Mixture models are increasingly used in messy data situations where adequate fit is not obtained using conventional parametric models. Many algorithms, mostly variants of expectation maximization (EM) algorithm and Markov chain Monte Carlo (MCMC) algorithms, are currently available for fitting mixture models to the data. Specifically, these algorithms would fit a marginal model of the form

{m}_{f}=\int_{\Theta}p(x|\theta)dF(\theta).

(1.3)

to the data $X_{1},\ldots,X_{n}$ assuming a form of the mixing kernel $p(x|\theta).$ A related problem is recovering (estimating) the mixing distribution $F$ . The problem of estimating the density $f$ of $F$ (with respect to some dominating measure $\mu$ ) in a nonparametric set up can be a challenging exercise and can be numerically taxing when full likelihood procedures (such as nonparametric MLE or nonparametric Bayesian) are used. However the full likelihood procedures generally enjoy desirable large sample properties, such as consistency.

Let $X\in\chi$ be a random variable with distribution $p(x|\theta)$ , where $\theta\in\Theta$ is the latent variable and $f(\theta)$ be the mixing density. Let $\nu$ and $\mu$ be the sigma-finite measures associated with $\chi$ and $\Theta$ . The recursive estimation algorithm Newton and Zhang [1] for estimating $f(\theta)$ then starts with some initial $f_{0}(\theta)$ which has the same support as the true mixing density $f(\theta)$ . The update with each new observation $X_{i}$ is given by

f_{i}(\theta)=(1-w_{i})f_{i-1}(\theta)+w_{i}\frac{p(X_{i}|\theta)f_{i-1}(\theta)}{m_{i-1}(X_{i})}

(1.4)

where $m_{i-1}(x)=\int_{\Theta}p(x|\theta)f_{i-1}(\theta)d\mu(\theta)$ is the marginal density of $X$ at the $i$ th iteration based on the mixing density $f_{i-1}$ obtained at the previous iteration.

From equation (1.4) calculating the marginals on both sides, we have the associated iterations for the marginal densities as

m_{i}(x)=m_{i-1}(x)\big{\{}1+w_{i}(\frac{\int p(x|\theta)p(X_{i}|\theta)f_{i-1}(\theta)d\mu(\theta)}{m_{i-1}(x)m_{i-1}(X_{i})}-1)\big{\}}.

(1.5)

One way to connect the PR update to SA algorithm is to minimize the Kullback–Leibler (KL) distance between the proposed marginal and the true marginal and considering the minimizer of the corresponding Lagrange multiplier (Ghosh and Tokdar [9]; Martin and Ghosh [10]). The support of $f_{0}$ can be misspecified. In such cases PR estimate can be shown to concentrate around a density in the space of proposed mixture densities that is nearest in KL distance to the true marginal (Martin and Tokdar [11]). Similar results can be found for general posterior consistency such as developments in Kleijn and van der Vaart [12] where posterior consistency under misspecification was shown under conditions like convexity.

The PR algorithm is computationally much faster than the full likelihood methods. Ghosh and Tokdar [9], and later Tokdar et al. [2] established the consistency of Newton’s predictive recursion algorithm, thereby putting the PR algorithm on solid theoretical footing. Some subsequent developments on the rate of convergence can be found in Martin and Ghosh [10], Martin and Tokdar [11]. Later developments focus on application on semiparametric model, predictive distribution calculation in Hahn et al. [13], Martin and Han [14]. Like the proof of original SA algorithm the proving the consistency for the recursive estimate of the mixture distribution depends on a Martingale difference sum construction. Because the algorithm depends on the order of $X_{i}$ ’s, the resulting estimator is not a function of sufficient statistics, the consistency of the PR solution cannot be drawn from the consistency of frequentist and Bayesian method (e.g using DP mixture model; see Ghosal et al. [15], Lijoi et al. [16] ). In Ghosh and Tokdar [9] and Tokdar et al. [2] the martingale based method from Robbins and Monro [3] was adapted in density estimation setting in a novel way to show the almost sure convergence of the estimator in the weak topology and Kullback–Leibler (KL) divergence. The PR solution was shown to be the Kullback–Leibler divergence minimizer between the true marginal and the set of proposed marginals.

One of the key assumptions in the existing proof of consistency is that the observations are independent. This assumptions significantly limits the scope of application for the PR algorithm, where some naturally occurring dependence may be present, for example cases of mixture of Markov processes. The main result of this paper is that the predictive recursion continues to provide consistent solution even under weakly dependent stationary processes as long as the dependence decays reasonably fast. This vanishing dependence can be connected with and can be quantified by the information theoretic quantities such as mutual information between the marginal and the conditional densities of the process. We use a novel sub-sequence argument to tackle the dependence among the observations and prove consistency of the PR algorithm when such vanishing dependence is present. At the same time we derive a bound for the convergence rate for the PR algorithm under such dependence. As a special case, we later consider the example of general $M$ dependent cases, where the consistency of the recursive estimator holds under weaker conditions. In all the cases we also investigate convergence under misspecification of the support of the mixing density and concentration around the KL projection to the misspecified model.

The arrangement of this article is the following. In Section 2, we provide the background and basic framework for the martingale based argument. Section 3 presents the main results regarding convergence and the rates for the weakly dependent cases, and address the special case of mixture of Markov processes. In Section 4, we consider the special case of $M$ - dependent sequences.

2 Preliminaries and revisiting the independent case

Our notation and initial framework will follow the original martingale type argument of Robbins and Monro [3] and the developments for the independent case in the literature, especially in Ghosh and Tokdar [9] and Tokdar et al. [2]. Thus, we first introduce the notation and revisit the main techniques used in the proof of consistency of the PR estimator in Tokdar et al. [2]. The discussion will also illustrate the need for generalization of the techniques to the dependent case.

A recursive formulation using KL divergence along with martingale based decomposition was used in Tokdar et al. [2] who established convergence of the recursive estimators for the mixing density and that of the marginal density. Specifically, if $K_{n}=\int f\,log(f/f_{n})d\mu(\theta)$ and $\mathcal{F}_{i}=\sigma(X_{1},\ldots,X_{i})$ is the $\sigma-$ algebra generated by the first $i$ observations, then the following recursion can be established:

K_{n}-K_{0}=\sum_{i=1}^{n}w_{i}V_{i}-\sum_{i=1}^{n}w_{i}M_{i}+\sum_{i=1}^{n}E_{i}

(2.1)

where

$\displaystyle M_{i}$	$\displaystyle=$	$\displaystyle-\operatorname{E}\Big{[}1-\frac{m(X_{i})}{m_{i-1}(X_{i})}\|\mathcal{F}_{i-1}\Big{]};m(X_{i})=\int p(X_{i}\|\theta)f(\theta)d\mu(\theta),$
$\displaystyle V_{i}$	$\displaystyle=$	$\displaystyle(1-\frac{m(X_{i})}{m_{i-1}(X_{i})})+M_{i},$
$\displaystyle E_{i}$	$\displaystyle=$	$\displaystyle\int R(X_{i},\theta)f(\theta)d\mu(\theta),$
$\displaystyle R(X_{i},\theta)$	$\displaystyle=$	$\displaystyle w_{i}^{2}(\frac{p(X_{i}\|\theta)}{m_{i-1}(X_{i})}-1)^{2}R(w_{i}(\frac{p(X_{i}\|\theta)}{m_{i-1}(X_{i})}-1))$

and $R(x)$ is defined through the relation $\log(1+x)=x-x^{2}R(x)$ for $x>-1.$ The remainder term $R$ satisfies $0\leq R(x)\leq\text{ max}\{1,(\frac{1}{1+x})^{2}\}/2$ . The corresponding similar recursion for the KL divergence of the marginal densities, $K_{n}^{*}=\int m\,log(m/m_{n})d\nu(x)$ , is then

K_{n}^{*}-K_{0}^{*}=\sum_{i=1}^{n}w_{i}V^{*}_{i}-\sum_{i=1}^{n}w_{i}M^{*}_{i}+\sum_{i=1}^{n}E^{*}_{i}

(2.2)

where

$\displaystyle g_{i,x}(\theta)$	$\displaystyle=$	$\displaystyle\frac{p(x\|\theta)f_{i-1}(\theta)}{m_{i-1}(x)},$
$\displaystyle h_{i,x^{\prime}}(x)$	$\displaystyle=$	$\displaystyle\int g_{i,x^{\prime}}(\theta)p(x\|\theta)d\mu(\theta),$
$\displaystyle R^{*}(X_{i},x)$	$\displaystyle=$	$\displaystyle w_{i}^{2}(\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}-1)^{2}R(w_{i}[\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}-1]),$
$\displaystyle V_{i}^{*}$	$\displaystyle=$	$\displaystyle(1-\int\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)d\nu(x)+M_{i}^{*},$
$\displaystyle M_{i}^{*}$	$\displaystyle=$	$\displaystyle\int_{\chi}\int_{\chi}\frac{{h_{i,x^{\prime}}(x)}}{m_{i-1}(x)}m(x)m(x^{\prime})d\nu(x)d\nu(x^{\prime})-1,$
$\displaystyle E_{i}^{*}$	$\displaystyle=$	$\displaystyle\int_{\chi}R(X_{i},x)m(x)d\nu(x).$

It is assumed that $K_{0}$ and $K_{0}^{*}$ , corresponding to initial starting point $f_{0}$ , are finite. The main idea in Tokdar et al. [2] was to recognize that $\sum_{i=1}^{n}w_{i}V_{i}$ , $\sum_{i=1}^{n}w_{i}V_{i}^{*}$ are mean zero square integrable martingales and hence almost surely convergent, $\sum_{i=1}^{n}w_{i}M_{i}$ , $\sum_{i=1}^{n}w_{i}M_{i}^{*}$ are positive, and $\sum_{i=1}^{n}E_{i}$ , $\sum_{i=1}^{n}E_{i}^{*}$ have finite limits almost surely. Putting these facts together Tokdar et al. [2] established that $K_{n}^{*}$ necessarily converges to zero limit almost surely, thereby establishing consistency of the predictive recursion sequences $f_{n}$ and $m_{n}$ in weak topology. Using developments in Robbins and Siegmund [17] it can be argued that the PR solution converges to the KL minimizer if the support is misspecified.

Hence, the theoretical results regarding convergence of such PR algorithms in current literature relies on independence of $X_{i}$ and also assumed that the true marginals at each $i$ were the same and equal to $m(x)$ . There does not seem any obvious way of extending the proof to the case when the observations are not independent.

We propose a proof for convergence of the predictive recursion algorithm in the case when the observations are dependent. While the proof will use martingale construction technique similar to that of Tokdar et al. [2], there are significant differences in the approach that allows us to address the case of dependent observations, and tools required to address the dependence will be detailed in next section. The main contributions of the paper are summarized in the following:

1.

We show that the PR algorithm continues to be consistent under weakly dependent processes where the marginal stationary distribution at each $i$ is a mixture with respect to a fixed kernel. The consistency is obtained under additional conditions on the kernel and the parameter space. If the dependence decays exponentially, under additional conditions, the original PR estimate is shown to be consistent. The decaying dependence is characterized by a special case of expected $f$ divergence between marginal and conditional densities.
2.

We establish the convergence rate when the support is correctly specified and or misspecified with decaying dependence.
3.

We show the result for finite mixture of Markov processes and general $M$ -dependent processes under milder conditions.

The next section describes the main results as well the tools needed for the proof of consistency in the dependent case that uses a novel sub-indexing argument.

3 Main results

To establish consistency of the PR algorithm when observations are dependent, we will need to control the expectation of ratios of $k$ -fold products ( $(k\geq 1)$ ) of densities at two different parameter values. Suppose the parameter $\theta$ lies in a compact subset $\Theta$ of an Euclidean space. Let $\widehat{\Theta}$ denote a closed convex set containing $\Theta$ and assume $p(x|\theta)$ is well defined for $\theta\in\widehat{\Theta}$ . Assume there is a finite set $\Theta_{H}=\{\theta_{j},j=1\dots n_{H}\}\in\widehat{\Theta},$ (typically will be the extreme points when $\widehat{\Theta}$ is a convex polytope) and a compact set $\chi_{c}\subset\chi$ such that the following holds.

C1. For any $x\notin\chi_{c}$ there exists $\theta_{1}^{x},\theta_{2}^{x}\in\Theta_{H}$ such that $\text{sup}_{\theta,\theta^{\prime}\in\Theta}\{\frac{p(x|\theta)}{p(x|\theta^{\prime})}\}\leq c_{u}\frac{p(x|\theta_{2}^{x})}{p(x|\theta_{1}^{x})}$ , for some $c_{u}>1$ . This condition is satisfied with $c_{u}=1$ for $\theta^{x}_{l}=\underset{\theta}{\arg}\inf p(x|\theta)\in\Theta_{H}$ and $\theta^{x}_{u}=\underset{\theta}{\arg}\sup p(x|\theta)\in\Theta_{H}$ .
C2. There exists $a>0$ such that $\underset{x\in\chi_{c},\theta\in\Theta}{\inf}p(x|\theta)>a$ .
C3. There exists $b<\infty$ such that $\underset{x\in\chi_{c},\theta\in\Theta}{\sup}p(x|\theta)<b$ .

Without loss of generality, we can assume that $0<a<1<b$ . Under C1—C3, the ratio of the marginals can be bounded by the ratios of conditionals on the finite set $\Theta_{H}$ and a function of $a$ and $b$ . For many common problems, such as location mixtures of $p(x|\theta)$ , a finite set $\Theta_{H}$ exists that satisfies the assumption. This condition can be seen as a generalization of Monotone Likelihood Ratio property in higher dimensions. The condition is satisfied in a general multivariate normal mean mixture with known covariance matrix, where the mean parameter is constrained to a set $\Theta$ in $\mathcal{R}^{d}$ , contained in a closed large convex polytope ${\widehat{\Theta}}$ , and the set $\Theta_{H}$ then consists of suitable selected points on the boundary of $\widehat{\Theta}$ and $\chi_{c}=\widehat{\Theta}$ .

Proposition 1.

Under C1–C3 , for any two distribution $f_{1}$ and $f_{2}$ on compact $\Theta$ , the ratio of the marginals under mixing densities $f_{1}$ and $f_{2}$ can be bounded as

\frac{m_{f_{1}}(X_{i})}{m_{f_{2}}(X_{i})}\leq A_{1}(X_{i})=c_{u}\sum_{\theta_{k},\theta_{l}\in\Theta_{H}}\frac{p(X_{i}|\theta_{k})b}{p(X_{i}|\theta_{l})a}.

(3.1)

Proof.

Given in the Appendix. ∎

Remark 1.

For finite $\Theta$ the result in Proposition 3.1 holds trivially using $\Theta$ in place of $\Theta_{H}$ .

When $X_{i}$ ’s have a fixed marginal density $m$ , PR algorithm can still be applied in spite of dependence if the dependence decreases rapidly with distance along the sequential order in which $X_{i}$ ’s enter the algorithm. We assume the following $\alpha$ -mixing type condition for the $X_{i}$ sequence. The decaying dependence is expressed as a special case of expected $f$ divergence (expected $\chi^{2}$ -distance) between the marginal and the conditional densities.

Let $X_{1:i}=\{X_{1},\ldots,X_{i}\}$ and let $m(X_{i+n}|X_{1:i})$ be the conditional density/pmf of $X_{i+n}$ given $X_{1:i}$ . We assume,

\displaystyle\hskip 14.45377pt\text{sup}_{i}E\big{[}\int(\frac{m(X_{i+n}|X_{1:i})}{m(X_{i+n})}-1)^{2}m(X_{i+n})d(.)\big{]}\leq c_{0}^{2}\rho^{2n}

(3.2)

where $c_{0}>0$ and $0<\rho<1$ .

Let ${\bf H}(X_{i+n},X_{1:i})=\int\log\frac{m(X_{i+n},X_{1:i})}{m(X_{i+n})m(X_{1:i})}m(X_{i+n},X_{1:i})d(\cdot)$ denote the mutual information between $X_{i+n}$ and $X_{1:i}$ and let ${\bf H}_{a}(X_{i+n},X_{1:i})=$ $\int|\log\frac{m(X_{i+n},X_{1:i})}{m(X_{i+n})m(X_{1:i})}|$ $m(X_{i+n},X_{1:i})d(\cdot)$ . The condition given in (3.2) implies exponential decay of mutual information ${\bf H}(X_{i+n},X_{1:i})$ . When the conditional densities are uniformly bounded and uniformly bounded away from zero, (3.2) is satisfied if ${\bf H}_{a}(X_{i+n},X_{1:i})$ decays exponentially. This result can be summarized in the following proposition.

Proposition 2.

The condition (3.2) implies $\text{sup}_{i}{\bf H}(X_{i+n},X_{1:i})\leq c_{0}^{2}\rho^{2n}$ where $c_{0},\rho$ are given in (3.2). If $\text{inf}_{i,n}p(X_{i+n}|X_{1:i})>0$ and $\text{sup}_{i,n}p(X_{i+n}|X_{1:i})<\infty$ then condition (3.2) holds if $\text{sup}_{i}{\bf H}_{a}(X_{i+n},X_{1:i})\leq c_{1}\rho^{2n}$ , for some $c_{1}>0$ .

Proof.

The result follows from the relationship between $f$ -divergence and mutual information and is omitted. ∎

In addition, we assume the following conditions which are similar to those in Tokdar et al. [2]:

B1

$w_{i}\downarrow 0$ , and $w_{i}\sim i^{-\alpha}$ , $\alpha\in(0.5,1]$ .
B2

For $\theta_{l_{1}},\theta_{l_{2}}$ ’s $\in\widehat{\Theta}$ for $E[(\prod_{l=1}^{n_{1}}\frac{p(X_{j_{l}}|\theta_{l_{1}})}{p(X_{j_{l}}|\theta_{l_{2}})})^{2}]\leq b_{0}^{2n_{1}}\prod_{l=1}^{n_{1}}E[(\frac{p(X_{j_{l}}|\theta_{l_{1}})}{p(X_{j_{l}}|\theta_{l_{2}})})^{2}]$ , for some $b_{0}>0$ and where $j_{l}^{\prime}s\in\{1,\dots n\}$ and are distinct, and $n_{1}\leq n$ . Assume, $b_{0}\geq 1$ without loss of generality.
B3

$sup_{\theta_{1},\theta_{2},\theta_{3}\in\widehat{\Theta}}E_{\theta_{3}}[(\frac{p(X_{i}|\theta_{1})}{p(X_{i}|\theta_{2})})^{2}]<B<\infty$
B4

The map $\theta\to p(x|\theta)$ is bounded and continuous for $x\in\chi$ .

Condition $[\rm{B2}]$ is needed for $n_{1}$ fold products over different indices for the dependent case. This condition will later be verified for some of the examples considered. Condition $[\rm{B2}]$ can be omitted if stricter moment condition $[\rm{B2}^{\prime}]$ is assumed, which is given later. We can now state our main results. Theorem 3.1 shows consistency when the support of the mixing density is finite while Theorem 3.2 establishes consistency for general support under slightly more restrictive conditions.

Theorem 3.1.

Let $\{X_{i}\}$ be a sequence of random variables satisfying (3.2) where $X_{i}$ has a fixed marginal density $m(\cdot)$ given by $m(x)=\int_{\Theta}p(x|\theta)f(\theta)d\theta$ and the support of $f$ , $\Theta$ , is a finite set. Assume that the initial estimate $f_{0}$ in (1.4) has the same support as $f$ . Then under B1–B4, C1–C3, the estimator $f_{n}$ in (1.4) converges to the true mixing density $f$ with probability one, as the number of observations $n$ goes to infinity.

Proof.

The terms in the decomposition of the KL divergence, (2) can no longer be handled in the manner as in the independent case. We use further sub-indexing of the terms to obtain appropriate convergence results under the dependence condition (3.2).

We first partition the positive natural numbers into sub-sequences. Let $(j)_{r}$ denote the $r$ th term in $j$ th sub-sequence and let $\Psi(r,j)$ denote its value. Then the sub-sequences are constructed in the following manner:

1.

Let $\Psi(1,1)=1$ .
2.

Let $\Psi(1,j)=\inf\{\Psi(r,j-1):\Psi(r,j-1)>j^{K}\}+1$ for some fixed positive integer $K>1$ .
3.

The terms in the $(j-1)$ th sub-sequence are immediately followed by the next available term in the $j$ th sub-sequence unless there are no such terms in which case it is followed by the next term in the first sub-sequence.

By construction, $j^{K}\leq(j)_{1}\leq(j+1)^{K}$ . For convenience of notation, we denote by $i_{last}$ the integer $\Psi(r-1,j)$ whenever $i=\Psi(r,j)$ and $r>1$ . We define $i_{last}=i-j$ if $i=\psi(1,j)$ . Let $F_{last,i}$ denote the $\sigma$ -algebra generated by the collection $\{X_{1},\ldots,X_{i_{last}}\}$ . Similarly, Let, $p_{i,last}$ denote the conditional density of $X_{i}$ given $X_{1},\dots,X_{i_{last}}$ . Let $L_{n}$ be the number of subsequences constructed for $i=n$ , that is we consider $j$ such that $\Psi(r,j)\leq n$ for some $r\geq 1$ . Clearly, $L_{n}\leq n$ . Also, let $N_{j,n}$ be the number of terms in the $j$ th sequence upto $i=n$ , such that $\Psi(r,j)\leq n.$ As an example, for $K=2$ the following pattern arises for the sub-sequences

\displaystyle\overbrace{(1)_{1}}^{\tilde{\bf B}_{1}},\overbrace{(1)_{2}}^{\tilde{\bf B}_{2}},\overbrace{(1)_{3}}^{\tilde{\bf B}_{3}},\overbrace{(1)_{4},(2)_{1}}^{\tilde{\bf B}_{4}},\overbrace{(1)_{5},(2)_{2}}^{\tilde{\bf B}_{5}},\overbrace{(1)_{6},(2)_{3},(3)_{1}}^{\tilde{\bf B}_{6}}\dots,\overbrace{(1)_{i_{1}},(2)_{i_{2}},\dots,(i+1)_{1}}^{\tilde{\bf B}_{f(i)}}

Thus, $\Psi(1,2)=5$ , $\Psi(1,3)=10$ $N_{1,10}=6$ and $L_{10}=3$ in this example.

From equation (1.4), similar to equation (2) we have

\displaystyle K_{n}-K_{0}=\sum_{j=1}^{{L_{n}}}S_{v,j}^{(n)}-\sum_{j=1}^{L_{n}}S_{M,j}^{(n)}+\sum_{i=1}^{n}S^{\Delta}_{i}+\sum_{i=1}^{n}E_{i}.

(3.3)

where

$\displaystyle S_{v,j}^{(n)}$	$\displaystyle=$	$\displaystyle\sum_{r=1}^{N_{j,n}}v_{\Psi(r,j)};$
$\displaystyle S_{M,j}^{(n)}$	$\displaystyle=$	$\displaystyle\sum_{r=1}^{N_{j,n}}w_{\Psi(r,j)}M_{\Psi(r,j)};$
$\displaystyle v_{i}$	$\displaystyle=$	$\displaystyle w_{i}\Big{(}(1-\frac{m(X_{i})}{m_{i-1}(X_{i})})-E\big{[}(1-\frac{m(X_{i})}{m_{i-1}(X_{i})}\|F_{last,{i}}\big{]}\Big{)};$
$\displaystyle M_{i}$	$\displaystyle=$	$\displaystyle E\big{[}\frac{m(X_{i})}{m_{i_{last}}(X_{i})}-1\|F_{last,i}\big{]};$
$\displaystyle S^{\Delta}_{i}$	$\displaystyle=$	$\displaystyle w_{i}E\big{[}\frac{m(X_{i})}{m_{i-1}(X_{i})}(\frac{m_{i-1}(X_{i})}{m_{{i}_{last}}(X_{i})}-1)\|F_{last,i})\big{]},$

and $E_{i}$ is as defined following (2). Next we show convergence of the different parts.

Convergence of $\sum_{j=1}^{L_{n}}S^{(n)}_{v,j}:$ We have $E[(S^{(n)}_{v,j})^{2}]\leq 2\sum w_{i}^{2}E[1+(A_{1}(X_{i}))^{2}]<b^{\prime}_{0}$ for some $b^{\prime}_{0}>0$ . This implies that $S^{(n)}_{v,j}$ is mean zero a squared integrable martingale with filtration $\sigma(X_{1},\cdots,X_{\Psi(N_{j,n},j)})$ , and therefore converges almost surely [18]. Thus, outside a set of probability zero $S^{(n)}_{v,j}$ ’s converges for all $j$ , as $j$ varies over a countable set.

Next, we show that only finitely many martingale sequences in the $\sum_{j=1}^{L_{n}}S^{(n)}_{v,j}$ will make significant contribution with probability one for large $n$ . Choose $s>1$ and choose $K$ large enough such that $K>2s/(2\alpha-1)$ . From Lemma 1,

	$\displaystyle P(\sup_{n}\|S_{v,j}^{(n)}\|>\frac{\epsilon}{j^{s}}\text{ infinitely often})\leq lim_{n_{0}\uparrow\infty}\sum_{j=n_{0}}^{\infty}P(\sup_{n}\|S_{v,j}^{(n)}\|>\frac{\epsilon}{j^{s}})$
	$\displaystyle\leq\epsilon^{-2}lim_{n_{0}\uparrow\infty}\sum_{j=n_{0}}^{\infty}C^{\prime}_{0}j^{-K(2\alpha-1)+2s-1}\rightarrow 0$

where $C^{\prime}_{0}>0$ is a constant.

Hence, outside a null set, say $\Omega_{N}$ (possibly depending on $\epsilon$ ), we have $\sup|S_{v,j}^{(n)}|<\frac{\epsilon}{j^{s}}$ for all but finitely many $j$ ’s and $S^{(n)}_{v,j}$ ’s converge. Fix $\omega\in\Omega\backslash\Omega_{N}$ , where $\Omega$ is the underlying product-probability space. For any $\epsilon_{1}>0$ , we can choose $n_{1}$ (possibly depending upon $\omega$ ), such that for $j>n_{1}$ , $sup|S_{v,j}^{(n)}|<\frac{\epsilon}{j^{s}}$ and $\sum_{j=n_{1}}^{\infty}|S_{v,j}^{(n)}|<\epsilon_{1}$ . Let $n_{2}>\psi(1,n_{1})$ large enough such that for $j\leq n_{1}$ we have $\sum_{j=1}^{n_{1}}|S_{v,j}^{(n_{3})}-S_{v,j}^{(n_{2})}|<\epsilon_{1}$ whenever $n_{3}>n_{2}$ . Finally,

	$\displaystyle\|\sum_{j=1}^{L_{n_{3}}}S^{(n_{3})}_{v,j}-\sum_{j=1}^{L_{n_{2}}}S^{(n_{2})}_{v,j}\|\leq\sum_{j=1}^{n_{1}}\|S_{v,j}^{(n_{3})}-S_{v,j}^{(n_{2})}\|+\|\sum_{j>n_{1}}(S^{(n_{3})}_{v,j}-S^{(n_{2})}_{v,j})\|$
	$\displaystyle\leq\epsilon_{1}+\sum_{j>n_{1}}sup_{n\leq n_{2}}\|S^{(n)}_{v,j}\|+\sum_{j>n_{1}}sup_{n\leq n_{3}}\|S^{(n)}_{v,j}\|\leq 3\epsilon_{1}.$

As $\epsilon_{1}>0$ is arbitrary and we can choose $\epsilon_{1}$ going to zero over a sequence. This implies $\sum_{j=1}^{L_{n}}S^{(n)}_{v,j}$ is a Cauchy sequence with probability one and therefore converges with probability one.

Convergence of $\sum_{i=1}^{n}E_{i}$ : Using the expression for $A_{1}(x)$ , we have,

E[\sum_{i>i_{0}}|E_{i}|]\leq\sum_{i>i_{0}}4w_{i}^{2}E[(1+A_{1}(x_{i}))^{2}]<\infty

as $w_{i}(\frac{p(X_{i}|\theta)}{m_{i-1}(X_{i})}-1)>-w_{i}>-\frac{1}{2}$ for $i>i_{0}$ for some $i_{0}>1$ , as $w_{i}\downarrow 0$ . Hence, $\sum_{i>i_{0}}E_{i}$ converges almost surely from Proposition 4. Hence, $\sum E_{i}$ converges almost surely.

Decomposing ${M}_{i}$ : We have from (3.3),

	$\displaystyle M_{i}$	$\displaystyle=$	$\displaystyle\int(\frac{m(x)}{m_{i_{last}}(x)}-1)m(x)\nu(dx)+\int(\frac{m(x)}{m_{i_{last}}(x)}-1)(\frac{p_{i,last}(x)}{m(x)}-1)m(x)\nu(dx)$
		$\displaystyle\geq$	$\displaystyle K^{*}_{i_{last}}+\int(\frac{m(x)}{m_{i_{last}}(x)}-1)(\frac{p_{i,last}(x)}{m(x)}-1)m(x)\nu(dx).$

By Cauchy-Schwarz and using [B3] to bound the expression for $A_{1}(\cdot)$ in (3.1), we have

	$\displaystyle\delta_{i}=\|\int(\frac{m(x)}{m_{i_{last}}(x)}-1)(\frac{p_{i,last}(x)}{m(x)}-1)m(x)\nu(dx)\|\leq$
	$\displaystyle\sqrt{\int 2(A_{1}(x)^{2}+1)m(x)\nu(dx)}$	$\displaystyle\sqrt{\int(\frac{p_{i,last}(x)}{m(x)}-1)^{2}m(x)\nu(dx)}.$

Let $\delta_{i}=$ $\sqrt{\int(\frac{p_{i,last}(x)}{m(x)}-1)^{2}m(x)\nu(dx)}$ . If $\delta_{i}^{(i)}=E[\delta_{i}]$ , then by condition (3.2) and Jensen’s inequality, we have $\delta_{i}^{(i)}\leq c_{0}\rho^{(i-i_{last})}$ . By construction of the sequences, a gap $(i-i_{last})$ is equal to $l$ at most $(l+1)^{K}$ times. Also by (3.1), $\sqrt{\int 2(A_{1}(x)^{2}+1)m(x)\nu(dx)}\leq b_{1}$ for some $b_{1}>0$ . Thus, $\sum\delta_{i}^{(i)}\leq c_{0}b_{1}\sum_{l}l^{K}\rho^{l-1}<\infty$ . Hence, $\sum\delta_{i}$ converges absolutely with probability one and hence, converges with probability one.

Convergence of $\sum_{i=1}^{n}S^{\Delta}_{i}$ : By Proposition 4, it is enough to show $\sum E[|E[w_{i}(\frac{m(x)}{m_{i-1}(x)}-\frac{m(x)}{m_{i_{last}}(x)})|F_{last,i}]|]\leq\sum E[w_{i}|\frac{m(x)}{m_{i-1}(x)}-\frac{m(x)}{m_{i_{last}}(x)}|]$ converges.

Let $n_{H}$ be the cardinality of $\Theta_{H}$ . By condition [B3], $E[A_{1}(X_{i})+1]\leq 1+n_{H}c_{u}B_{1}b/a=B_{u}<\infty$ , where $E[\frac{p(X_{i}|\theta_{l})}{p(X_{i}|\theta_{k})}]<B_{1}$ . Then, by condition [B2] any $m$ -fold product $E[\prod_{i_{1}\neq\dots\neq i_{m}}A_{1}(X_{i_{1}})]<(B^{\prime}_{u})^{m};0<B^{\prime}_{u}<\infty,B^{\prime}_{u}=b_{0}B_{u}$ .

Expanding as product of ratios of successive marginals,

\frac{m_{i-1}(X_{i})}{m_{i_{last}}(X_{i})}=\prod_{j=i_{last}+1}^{i-1}\left(1+w_{j}(\frac{\int p(X_{j}|\theta)p(X_{i}|\theta)f_{j-1}(\theta)d\theta}{m_{f_{j-1}}(X_{i})m_{f_{j-1}}(X_{j})}-1)\right).

When $X_{j}\notin\chi_{c},$ $\frac{\int p(X_{j}|\theta)p(X_{i}|\theta)f_{j-1}(\theta)d\theta}{m_{f_{j-1}}(X_{i})m_{f_{j-1}}(X_{j})}\leq\frac{\int p(X_{j}|\theta_{1})p(X_{i}|\theta)f_{j-1}(\theta)d\theta}{m_{f_{j-1}}(X_{i})p(X_{j}|\theta_{2})}\leq A_{1}(X_{j})$ , where $p(X_{j}|\theta)$ is minimized and maximized at $\theta_{2}$ and $\theta_{1}$ , respectively. When $X_{j}\in\chi_{c}$ , then we have $\frac{\int p(X_{j}|\theta)p(X_{i}|\theta)f_{j-1}(\theta)d\theta}{m_{f_{j-1}}(X_{i})m_{f_{j-1}}(X_{j})}\leq\frac{b}{a}\leq A_{1}(X_{j})$ .

Hence,

|w_{i}\frac{m(X_{i})}{m_{i-1}(X_{i})}(\frac{m_{i-1}(X_{i})}{m_{i_{last}}(X_{i})}-1)|\leq w_{i}(\sum_{j_{1}\in\mathscr{S}_{i}}w_{j_{1}}A_{1}(X_{i})(A_{1}(X_{j_{1}})+1)+

\sum_{j_{1}\neq j_{2}\in\mathscr{S}_{i}}w_{j_{1}}w_{j_{2}}A_{1}(X_{i})(A_{1}(X_{j_{1}})+1)(A_{1}(X_{j_{2}})+1)+

\dots+\sum_{j_{1}\neq j_{2}\cdots\neq j_{m}\in\mathscr{S}_{i}}w_{j_{1}}w_{j_{2}}\cdots w_{j_{m}}A_{1}(X_{i})(A_{1}(X_{j_{1}})+1)(A_{1}(X_{j_{2}})+1)..(A_{1}(X_{j_{m}})+1)+..

where $\mathscr{S}_{i}$ be the set $\{(i_{last}+1),..,i-1\}$ .

Coefficients for the $m$ -fold products $A_{1}(X_{j_{1}})A_{1}(X_{j_{2}})\cdots A_{1}(X_{j_{m}});$ $j_{1}\neq j_{2}\neq\cdots\neq j_{m}$ are bounded by $w_{i}\sum_{t=m}^{\infty}w_{i_{last}}^{t}<c_{1}w_{i_{last}}^{(m+1)}$ , for some $c_{1}>0$ . The expectations of each of those $m$ -fold product bounded by ${B^{\prime}_{u}}^{m}$ . Let $d_{i}=i-2-i_{last}$ . Then there are ${d_{i}\choose m}$ many terms consisting of $m$ -fold products of $A_{i}$ .

As $d_{i}=o({i}^{1/K+\epsilon})$ , for any $\epsilon>0$ . Choosing $N_{0}$ and $K$ large enough, we have $w_{i_{last}}<C^{\prime}i^{-\alpha_{1}}$ , when $i>N_{0},C^{\prime}>0$ and $.5+1/K<\alpha_{1}^{\prime}<\alpha_{1}<\alpha$ , and we have

\sum_{i>N_{0}}E[|S^{\Delta}_{i}|]\leq C^{\prime}\sum_{i}\sum_{m=2}^{d_{i}}i^{-m(\alpha^{\prime}_{1}-1/K)}(B^{\prime}_{u})^{m}<\infty.

By Proposition 4, we have the result.

Combining the parts: Having established that $\sum_{j=1}^{L_{n}}S^{(n)}_{v,j},\sum_{i=1}^{n}E_{i},\sum_{i=1}^{n}S^{\Delta}_{i},\sum\delta_{i}$ all converge with probability one to finite quantities, we essentially follow the arguments given in Tokdar et al. [2] . Since $K^{*}_{i}>0$ , we have $K^{*}_{i}$ converging to zero in a sub sequence with probability one, as LHS in (3.3) is finite, because otherwise from the fact $\sum w_{i}=\infty$ , RHS will be $-\infty$ .

Hence, as $K_{n}\geq 0$ , $\sum w_{i}K_{i}^{*}$ has to converge, as all other sequences converge and therefore $K_{i}^{*}$ converges to zero in a subsequence almost surely. Now, using finiteness of $\Theta$ , the proof follows as $K_{n}$ converges and over that subsequence $f_{n}$ has to converge to $f$ , as otherwise $K_{n}^{*}$ cannot converge to zero in that subsequence (Ghosh and Tokdar [9]).

In particular, if $K^{*}_{n_{k}(\omega)}\rightarrow 0$ and if $f_{n_{k_{j}}}\rightarrow\hat{f}\neq f$ in some sub-subsequence of $n_{k}$ then corresponding marginal $m_{n_{k_{j}}}$ converges to $\hat{m}\neq m$ weakly. But, as $K^{*}_{n_{k}}\rightarrow 0$ , $m_{n_{k}}$ converges to $m$ in Hellinger distance and also weakly, which is a contradiction. Hence, $f_{n_{k_{j}}}\rightarrow f$ and therefore $K_{n_{k_{j}}}\rightarrow 0$ . Therefore, $K_{n}\rightarrow 0$ and we have $f_{n}\rightarrow f$ with probability one. ∎

In general a bigger value of $\rho$ in (3.2) would indicate stronger dependence among $X_{i}$ and hence the convergence rate will be expected to be slower for bigger $\rho$ . Even though it is hard to write explicitly how $\rho$ affects the convergence, some insight can be gleaned by studying the different components of the decomposition of the KL divergence and their convergence.

Under the setting of Theorem 3.1, for convergence of $K_{n}$ , we need for $n>m$ the Cauchy increments, $|K_{m}-K_{n}|$ to go to zero. From equation (3.3), the Cauchy difference is essentially made up of a martingale difference sequence, an increment of non-negative sequence, and other terms that constitute error terms. The main term in $|K_{m}-K_{n}|$ that is influenced directly by the dependence parameter is $\sum_{j=1}^{L_{n}}S_{M,j}^{(m,n)}$ . Following calculations of bounds on $M_{i}$ in (LABEL:wdep_decmp), we have

\sum_{j=1}^{L_{n}}S_{M,j}^{(m,n)}\leq\sum_{i=m}^{n}w_{i}\int(\frac{m(x)}{{m}_{{i}_{last}}(x)}-1)m(x)\nu(dx)+\sum_{l=m^{1/K}}^{(n+1)^{1/K}}w_{l}\tilde{b}_{0}l^{K}\rho^{l-1}

for some $\tilde{b}_{0}>0$ . Thus, convergence of the conditional mean sequence, and hence the KL sequence is expected to be slower for larger values of $\rho$ . This intuition is reinforced in Example 1 where the AR coefficient $r$ plays the role of the dependence parameter in (3.2). Numerically, we see that convergence is faster for smaller $r$ and slower with larger values of $r$ .

As mentioned earlier, consistency of the recursive algorithm can be established in much more generality, even under mild dependence, provided we can assume a slightly stronger condition. The following condition is stronger than [B2], but maybe more readily verifiable.

$\rm{B2}^{\prime}$ . $sup_{\theta_{1},\theta_{2},\theta_{3}\in\hat{\Theta}}E_{\theta_{3}}[(\frac{p(X_{i}|\theta_{1})}{p(X_{i}|\theta_{2})})^{j}]<b_{3}^{j}$ , for some $b_{3}>0$ ,

The condition is sufficient for establishing consistency in the non-finite support with dependent data. However, it need not be necessary. It may not hold in some cases where the support of $X$ is unbounded. For example, the condition does not hold in normal location mixture with dependent data but it does hold when the a truncated normal kernel is used.

In the proof of Theorem 3.1, one could work with $[\rm{B2}^{\prime}]$ instead of [B2].

Remark 2.

If condition $[\rm{B2}]$ is replaced with condition $[\rm{B2}^{\prime}]$ in the statement of Theorem 3.1, the conclusions of Theorem 3.1 continue to hold.

The proof of the corollary is straight-forward and is omitted. The condition $[\rm{B2}^{\prime}]$ is in essence equivalent to [B2] and [B3], when the kernel is bounded and bounded away from zero.

Remark 3.

If $inf_{\theta}p(x|\theta)>0$ and $sup_{\theta}p(x|\theta)<\infty$ then B2, B3 and $\mbox{\rm{B2}}^{\prime}$ are satisfied.

Theorem 3.2.

Let $\{X_{i}\}$ be sequence of random variables satisfying (3.2) where $X_{i}$ has a fixed marginal density $m(\cdot)$ given by $m(x)=\int_{\Theta}p(x|\theta)f(\theta)d\theta$ and the support of $f$ , $\Theta$ , is a compact subset of the corresponding Euclidean space. Assume the initial estimate $f_{0}$ in (1.4) has the same support as $f$ . Let $F_{n}$ and $F$ denote the cdfs associated with $f_{n}$ and $f$ , respectively. Then under ${\rm{B1},\rm{B2}^{\prime},\rm{B3,B4}}$ , C1–C3, the estimate $F_{n}$ from equation (1.4) converges to $F$ in weak topology with probability one, as $n$ the number of observations goes to infinity.

Proof.

Given in the Appendix. ∎

3.1 Misspecification of support

The predictive recursion algorithm requires specification of the support of the mixing density. The support of $f$ could be misspecified, in which case the solution $f_{n}$ cannot converge to the true density $f$ , however one could still investigate convergence for the sequence $f_{n}$ . Let the support of the initial density in the predictive recursion (1.4) be $\Theta_{0}$ , a compact set possibly different from $\Theta$ , the support of the true mixing density $f$ . Let $\mathcal{F}_{0}$ be the class of all densities with the same support as $f_{0}(\theta)$ . Specifically, let

	$\displaystyle\mathcal{F}_{0}$	$\displaystyle=$	$\displaystyle\{{\hat{f}}:{\hat{f}}\mbox{ is a density supported on }\Theta_{0}\}$
	$\displaystyle\mathcal{M}_{0}$	$\displaystyle=$	$\displaystyle\{\hat{m}:\hat{m}(x)=\int_{\Theta_{0}}p(x\|\theta)\hat{f}(\theta)\mu(d\theta);{\hat{f}}\in\mathcal{F}_{0}\}.$		(3.5)

Let

\tilde{k}=\underset{{\hat{m}}\in\mathcal{M}_{0}}{\inf}KL(m,\hat{m}).

(3.6)

Thus, assuming uniqueness, the minimizer ${\tilde{m}}\in\mathcal{M}_{0}$ is the information projection of $m$ to $\mathcal{M}_{0}$ . This minimizer is related to the Gateaux differential of KL divergence (Patilea [19]). For independent observations misspecification of the support has been addressed in the literature; for example see Ghosh and Tokdar [9], Martin and Ghosh [10]. Using the decomposition for dependent case in the proof of Theorems 3.2, 3.3, we extend the result for dependent cases. Let ${\tilde{K}}_{n}^{*}=K_{n}^{*}-\tilde{k}$ where $K_{n}^{*}$ is defined in (2.2).

Theorem 3.3.

Let $\{X_{i}\}$ be sequence of random variables satisfying (3.2) and $X_{i}$ have a fixed marginal density $m(\cdot)$ given by $m(x)=\int_{\Theta}p(x|\theta)f(\theta)d\theta.$ Assume $\Theta$ is a compact and $f_{0}$ in (1.4) belongs to $\mathcal{F}_{0}$ defined in (3.5). Assume, there exists a unique $\tilde{f}$ such that $K(m,\tilde{m})=\tilde{k}$ where $m_{\tilde{f}}=\tilde{m}(x)=\int_{\Theta_{0}}p(x|\theta)\tilde{f}(\theta)\mu(d\theta)$ . Assume ${\rm{B1},\rm{B2}^{\prime},\rm{B3,B4}}$ , C1–C3. Then,

\lim_{n\to\infty}{\tilde{K}}_{n}^{*}=0,\mbox{ w.p.1.}

Note that since $\mathcal{F}_{0}$ is convex, so is the set of corresponding marginals. Thus, uniqueness of $\tilde{f}$ is guaranteed if the model is identifiable.

The proofs of Theorems 3.3 is given in the Appendix following the proofs of analogous result Theorems 4.2 for $M$ -dependent as the proofs involve similar arguments.

3.2 Convergence rate of the recursive algorithm

Convergence of PR type algorithms has been explored in recent work such as Martin and Tokdar [11] where the fitted PR marginal shown to be in $\sim n^{-1/6}$ radius Hellinger ball around the true marginal almost surely. Martin [20] establishes a better bound for finite mixture under misspecification. PR convergence rates are similar to nonparametric rate in nature and in current literature does not have the minimax rate (up to logarithmic factors) such as the rates shown in Ghosal and Van Der Vaart [21] and subsequent developments. The convergence rate calculations ([11], [20]) follow from the super martingale convergence theorem from Robbins-Siegmund (Robbins and Siegmund [17], Lai [22]) for independent cases and yield a rate similar to Genovese and Wasserman [23].

In the presence of dependence, we show the rate calculation assuming a faster rate of decay for the weights $w_{i}$ . Instead of condition B1 we will assume

$\rm{B1}^{\prime}$

$w_{i}\downarrow 0$ , and $w_{i}\sim i^{-\alpha}$ , $\alpha\in(0.75,1]$ .

We use the subsequence construction technique described in 3 to do the rate calculations. Following the arguments given in the proofs of Theorems 3.1 and 3.2, we establish almost sure concentration in $n^{-(1-\gamma)/2}$ radius Hellinger ball around the true marginal for $\gamma>\alpha$ . The rate is slower than the rate for the independent case. Let $H^{2}(f,g)=\int(\sqrt{f}-\sqrt{g})^{2}d(\cdot)$ the squared Hellinger distance between densities.

Theorem 3.4.

Assume the conditions of Theorem 3.3 hold with B1 replaced with $\rm{B1}^{\prime}$ . Then

n^{-\gamma+1}\tilde{K}_{n}^{*}\rightarrow 0,\mbox{ with probability one}

for some $\gamma\in(\alpha,1)$ . Moreover, if $\frac{m}{m_{\tilde{f}}}$ is bounded away from zero and infinity, then $H(m_{\tilde{f}},m_{n})=o(n^{-(1-\gamma)/2})$ .

Proof.

From the proof of Theorem 3.2 and 3.3 (in Appendix), $\sum w_{i}\tilde{K}^{*}_{i_{last}}$ converges where $i_{last}=\psi(l-1,j)$ if $i=\psi(l,j)$ . By construction $i-i_{last}\preceq i^{1/K}$ and $\frac{w_{i_{last}}}{w_{i}}=O(1)$ . Hence, $\sum w_{i}\tilde{K}^{*}_{i-1}$ converges almost surely. Let, $a_{i}=\sum_{j=1}^{i}w_{j},a_{0}=0$ . Then, from A.1,

	$\displaystyle a_{n}\tilde{K}_{n}^{}-a_{0}\tilde{K}_{0}^{}=\sum_{i=1}^{n}w_{i}\tilde{K}^{}_{i-1}+\sum_{j=1}^{{L_{n}}}\hat{S}_{v,j}^{,(n)}-\sum_{j=1}^{L_{n}}\hat{S}_{M,j}^{,(n)}+\sum_{i=1}^{n}a_{i}S^{\Delta}_{i}+\sum_{i=1}^{n}a_{i}E^{*}_{i}.$
			(3.7)

Here

$\displaystyle\hat{S}_{v,j}^{*,(n)}$	$\displaystyle=$	$\displaystyle\sum_{i^{\prime}=1}^{N_{j,n}}\hat{v}^{*}_{\Psi(i^{\prime},j)},$
$\displaystyle\hat{S}_{M,j}^{*,(n)}$	$\displaystyle=$	$\displaystyle\sum_{i^{\prime}=1}^{N_{j,n}}a_{\Psi(i^{\prime},j)}w_{\Psi(i^{\prime},j)}M^{*}_{\Psi(i^{\prime},j)},$
$\displaystyle\hat{v}_{i}^{*}$	$\displaystyle=$	$\displaystyle a_{i}w_{i}(1-\int\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx))-a_{i}w_{i}E[1-\int_{\chi}\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx)\|\mathcal{F}_{last,i}].$

Writing $w_{i}^{\prime}=a_{i}w_{i}\sim i^{-\alpha_{2}^{\prime}}$ , $\alpha_{2}^{\prime}=2\alpha-1>0.5$ , from the proof of Theorem 3.2, it can be shown that $\sum_{i=1}^{n}a_{i}S^{\Delta*}_{i}$ , $\sum_{i=1}^{n}a_{i}E^{*}_{i}$ , $\sum_{j=1}^{{M_{n}}}S_{v,j}^{*,(n)}$ converge almost surely. We already established that $\sum w_{i}\tilde{K}^{*}_{i-1}$ converges almost surely. Hence, $\sum_{j=1}^{M_{n}}\hat{S}_{M,j}^{*,(n)}$ has to converge almost surely and would imply that $a_{\Psi(i^{\prime},j)}w_{\Psi(i^{\prime},j)}M^{*}_{\Psi(i^{\prime},j)}$ goes to zero over a subsequence.

As, $a_{n}\tilde{K}^{*}_{n}$ converges to a finite number for each $\omega$ outside a set of probability zero, and $a_{n}\sim n^{1-\alpha}$ , then $n^{1-\gamma}\tilde{K}_{n}^{*}$ goes to zero with probability one for $\gamma>\alpha$ . As $\frac{m_{\tilde{f}}}{m}$ is bounded away from zero and infinity, $KL(m_{\tilde{f}},m_{i})\sim\int_{\chi}\log\frac{m_{\tilde{f}}(x)}{m_{i}(x)}m(x)\nu(dx)$ , the conclusion about the Hellinger distance follows as Kullback Leibler divergence is greater than squared Hellinger distance.

∎

3.3 Finite Mixture of Markov processes and other examples

We consider a few examples of dependent processes, e.g. mixtures of Markov processes, where the marginal density is a mixture. Uniqueness of such representation has been studied extensively in the literature. Some of the earliest work can be found in Freedman [24], Freedman [25]. Also see related work in Diaconis and Freedman [26]. We here look at variants of stationary autoregressive processes of order one, AR(1), plus noise and Gaussian processes with constant drift. While simulation is done for different sample sizes, our main objective is to study the effect of dependence on convergence of the PR algorithm and not the rate of convergence. In fact, we use $w_{i}=1/(i+1)$ throughout for which $\alpha=1$ . Thus, condition $\rm{B1}^{\prime}$ and hence Theorem 3.4 are not applicable for these examples.

The condition given in (3.2) is a sufficient condition, and if the marginal, $m(X_{i})$ is a finite mixture of marginals of latent Markov processes which individually satisfy 3.2, and B2 then a similar but slightly weaker condition will hold for $m(\cdot)$ , and Theorem 3.1 will hold for the mixture. In particular, if we have $X_{i}=I_{i,1}Z_{i,1}+\dots+I_{i,k}Z_{i,k}$ where $Z_{i,j}$ ’s are independent Markov processes with stationary marginal densities $m^{(1)}(\cdot),\dots,m^{(k)}(\cdot)$ and $I_{i,1},\dots,I_{i,k}\sim Multinomial(1,p_{1},\dots,p_{k})$ and the marginal distribution $m^{(l)}(x)$ ’s are parametrized by $\theta_{l}$ ’s( i.e. of the form $p(x|\theta_{l})$ ), we can state the following result.

Proposition 3.

Let $X_{i}=I_{i,1}Z_{i,1}+\dots+I_{i,k}Z_{i,k}$ where $Z_{i,j}$ ’s are independent Markov processes with stationary marginal densities $m^{(1)}(\cdot),\dots,m^{(k)}(\cdot)$ and $I_{i,1},\dots,I_{i,k}\sim Multinomial(1,p_{1},\dots,p_{k})$ . Suppose, for each $l=1,\ldots,k,$ there exists $c_{0}>0$ and $0<\rho<1$ such that

	$\displaystyle sup_{i}\int(\frac{m^{(l)}(X_{i+n}\|X_{1:i})}{m^{(l)}(X_{i+n})}-1)^{2}m(X_{i+n})m(X_{1:i})$
	$\displaystyle=sup_{i}\int(\frac{m^{(l)}(X_{i+n}\|X_{i})}{m^{(l)}(X_{i+n})}-1)^{2}m(X_{i+n})m(X_{i})\leq$	$\displaystyle c_{0}\rho^{2n}.$			(3.8)

Then Theorem 3.1 holds for the mixture process.

Proof.

By assumption, the marginal of $X_{i}$ is $m(X_{i})=\pi_{1}m^{(1)}(X_{i})+\cdots+\pi_{k}m^{(k)}(X_{i})$ . Let $\mathscr{E}^{i_{1}}_{i_{2},i^{\prime},j}$ be the event, $I_{i_{1},l}=1$ , for $l=j$ and the last time $I_{i,j}=1$ , for $i\leq i^{\prime}$ for $i=i_{2}$ . Then

(\frac{m(X_{i+n}|X_{1:i})}{m(X_{i+n})}-1)^{2}=(\frac{\sum_{l}\pi_{l}m^{(l)}(X_{i+n})+\sum_{i^{\prime}=1}^{i}\sum_{l}\Delta^{(l)}_{i+n|i:i^{\prime}}p(\mathscr{E}^{i+n}_{i^{\prime},i,l}|X_{1:i})}{\sum_{l}\pi_{l}m^{(l)}(X_{i+n})}-1)^{2}.

where $\Delta^{(l)}_{i+n|i:i^{\prime}}=m^{(l)}(X_{i+n}|X_{1:i^{\prime}})-m^{(l)}(X_{i+n}).$ Hence,

	$\displaystyle(\frac{m(X_{i+n}\|X_{1:i})}{m(X_{i+n})}-1)^{2}$	$\displaystyle\leq$	$\displaystyle[\sum_{l}\pi_{l}^{-2}\sum_{i^{\prime}=1}^{i}\|\frac{m^{(l)}(X_{i+n}\|X_{1:i^{\prime}})-m^{(l)}(X_{i+n})}{m^{(l)}(X_{i+n})}\|p(\mathscr{E}^{i+n}_{i^{\prime},i,l}\|X_{1:i})]^{2}$		(3.9)
		$\displaystyle\preceq$	$\displaystyle max_{l}\sum_{i^{\prime}=1}^{i}(\frac{m^{(l)}(X_{i+n}\|X_{1:i^{\prime}})}{m^{(l)}(X_{i+n})}-1)^{2}.$		(3.9)

From the calculation following equation (LABEL:wdep_decmp), it follows that $E[\delta_{i}^{(i)}]\preceq i_{last}\rho^{i-i_{last}}$ with $\rho$ given in equation (3.8) and $i_{last}$ is as defined in the proof of Theorem 3.1, where $\preceq$ denotes less than equal to up to a constant multiple. For $i$ between $l^{K}$ and $(l+1)^{K}$ , $i-i_{last}=l$ . Hence, we have $\sum_{i}E[\delta_{i}^{(i)}]\preceq l^{2K}\rho^{l-1}<\infty$ .

Similarly, B2 can be verified by conditioning on the indicators and rest of the argument from Theorem 3.1 follows. ∎

Example 1. AR(1). As an example of mixture of independent Markov processes, we consider a mixture where one component is a Gaussian white noise and other component is a stationary AR(1) process. Let $X_{1,i}\sim AR(1)$ with marginal $N(0,1)$ . and let $X_{2,i}\stackrel{{\scriptstyle iid}}{{\sim}}N(2.5,1)$ , independent of $X_{1,i}$ . Let $X_{i}=I_{i}X_{1,i}+(1-I_{i})X_{2,i}$ , where $I_{i}$ s are independent Bernoulli(0.3). We consider different values for the AR(1) parameter, $r$ , and investigate the effect on the convergence. To have the stationary variance of the AR(1) process to be one we choose the innovation variance to be equal to $\sqrt{1-r^{2}}$ . For the AR(1) process, the dependence parameter in condition (3.2), $\rho,$ is a monotone function of $r$ . Specifically, we use $r\in\{0.3,0.7,0.99,0.999\}$ . A typical example is given in Figure 1, where we see the higher the value of $r$ the slower the convergence. While for moderate values of $r$ , the effect is negligible, for strong dependence, the effect on convergence is very pronounced. In this example for $f_{0}$ starting mixing probability $p=0.7$ has been used where the true mixing probability is 0.3.

Refer to caption — Figure 1: The AR 1 example for the general dependent case. Top panel shows $r=.3$ and $r=.7$ in left and right panel, respectively. Bottom panel shows $r=.99$ and $r=.999$ in left and right panel, respectively. True value of $p$ is .3 . The heavily dependent cases $r=.99$ and $r=.999$ show slower convergence.

Here it is easy to see that the individual components satisfy (3.2). Thus, the PR algorithm based on $\{X_{i}\}$ will consistent for the marginal $m(X_{i})$ provided $\{X_{i}\}$ satisfy the moment conditions assumed in Theorem 3.1. This part follows from conditioning on the indicators and then calculating $E[(\prod_{l=1}^{n_{1}}\frac{p(X_{j_{l}}|\theta_{l_{1}})}{p(X_{j_{l}}|\theta_{l_{2}})})^{2}]$ given the indicators by recursively conditioning on the observed $X_{i}$ ’s from the earlier indices. In particular, it will be sufficient to show for the case when $n_{1}=n$ , $j_{i}=i$ , and the values of $I_{i}=1$ for all $i$ that is $X_{1},\dots,X_{n}$ comes from the latent $AR(1)$ process. Note that $(\frac{p(X_{j_{l}}|\theta_{l_{1}})}{p(X_{j_{l}}|\theta_{l_{2}})})^{2}=c^{\prime}_{\theta_{l_{1}},\theta_{l_{2}}}e^{c_{\theta_{l_{1}},\theta_{l_{2}}}X_{i}}$ where $|c_{\theta_{l_{1}},\theta_{l_{2}}}|,|c^{\prime}_{\theta_{l_{1}},\theta_{l_{2}}}|$ are uniformly bounded (say by $\tilde{c}>0$ ). For convenience we write $c_{\theta_{l_{1}},\theta_{l_{2}}}=c_{l,1,2},c^{\prime}_{\theta_{l_{1}},\theta_{l_{2}}}=c^{\prime}_{l,1,2}$ .

Let $j_{l}=n,\theta_{l_{1}}=\theta_{n_{1}},\theta_{l_{2}}=\theta_{n_{2}}$ . Let $\mathscr{E}_{1,\dots,n}$ denote the event that $I_{i}=1,i=1,\dots,n$ . Hence,

E[[\frac{p(X_{n}|\theta_{n_{1}})}{p(X_{n}|\theta_{n_{2}})}]^{2}|X_{1},\dots,X_{n-1},\mathscr{E}_{1,\dots,n}]=e^{rc_{n,1,2}X_{n-1}+\frac{c^{2}_{n,1,2}(1-r^{2})}{2}}c^{\prime}_{n,1,2}.

Similarly,

	$\displaystyle E[[\frac{p(X_{n}\|\theta_{n_{1}})}{p(X_{n}\|\theta_{n_{2}})}\frac{p(X_{n-1}\|\theta_{{(n-1)}_{1}})}{p(X_{n-1}\|\theta_{{(n-1)}_{2}})}]^{2}\|X_{1},\dots,X_{n-2},\mathscr{E}_{1,\dots,n}]$
	$\displaystyle\leq\tilde{c}^{2}e^{(c_{n-1,1,2}+rc_{n,1,2})X_{n-2}+\frac{(rc_{n,1,2}+c_{n-1,1,2})^{2}(1-r^{2})+c^{2}_{n,1,2}(1-r^{2})}{2}}.$

Applying the bounds recursively we have the result.

Example 2. A continuous mixture of AR(1). Next we consider a continuous mixture in the mean with error term following an AR(1) dependence. Let $X_{i}=\theta_{i}+Z_{i}$ where $Z_{i}$ follows an AR(1) model with parameter $r=0.7$ . Let $\theta_{i}\stackrel{{\scriptstyle iid}}{{\sim}}TN(0,1,-3,3)$ , a standard normal distribution truncated to [-3,3]. Using a uniform[-3,3] as $f_{0}$ in (1.4), the fitted values are given in the left box in Figure 2. The true mixing density is given in solid blue. Solid and dashed black show the fit using $n=1000,2000$ , respectively.

The following argument verifies the conditions for consistency for the PR algorithm for a mean mixture of AR(1) process, $X_{i}=\theta_{i}+Z_{i}$ . Here $Z_{i}$ follows $AR(1)$ and $\theta_{i}$ are i.i.d with density $f(\theta)$ , where $f(\theta)$ is bounded and bounded away from zero and has compact support. We write,

\frac{m(X_{i+n}|X_{1:i})}{m(X_{i+n})}=\frac{\int p_{Z}(X_{i+n}-\theta|\theta_{i}=\theta^{\prime},X_{1:i})f(\theta)f_{i|X_{1}\dots,X_{i}}(\theta^{\prime})d\theta d\theta^{\prime}}{\int p_{Z}(X_{i+n}-\theta)f(\theta)d\theta}.

Here, $p_{Z}(X_{i+n}-\theta|\theta_{i}=\theta^{\prime},X_{1:i})=p_{Z}(X_{i+n}-\theta|X_{i}-\theta^{\prime})$ , and $f_{i|X_{1}\dots,X_{i}}(\cdot)$ is the conditional density of $\theta_{i}$ given $X_{1},\dots,X_{i}$ . Hence,

	$\displaystyle\|\frac{m(X_{i+n}\|X_{1:i})}{m(X_{i+n})}-1\|$	$\displaystyle=$	$\displaystyle\|\frac{\int\{p_{Z}(X_{i+n}-\theta\|X_{i}-\theta^{\prime})-p_{z}(X_{i+n}-\theta)\}f(\theta)f_{i\|X_{1}\dots,X_{i}}(\theta^{\prime})}{\int p_{Z}(X_{i+n}-\theta)f(\theta)}\|$
		$\displaystyle\leq$	$\displaystyle F_{1}(X_{i+n}){\int\|\frac{p_{Z}(X_{i+n}-\theta\|X_{i}-\theta^{\prime})}{p_{z}(X_{i+n}-\theta)}-1\|f(\theta)f_{i\|X_{1}\dots,X_{i}}(\theta^{\prime})},$

where $F_{1}(X_{i_{n}})\preceq e^{\tilde{c}X_{i+n}}+e^{-\tilde{d}X_{i+n}},\tilde{c},\tilde{d}>0$ , using the fact that the support of $f(\cdot)$ is bounded and $f(\cdot)$ is bounded away from zero and infinity. Hence, all the moments of $F_{1}(X_{i+n})$ are finite (in particular, $E[(F_{1}(X_{i+n}))^{a_{0}^{\prime}}]<\infty$ for any $a_{0}^{\prime}>0$ ). The result follows by noticing that $\int p_{Z}(X_{i+n}-\theta)f(\theta)>e^{-\frac{X^{2}_{i+n}}{2}}e^{-a_{0}|X_{i+n}|}b_{0}$ ; where $a_{0},b_{0}>0$ are constants. It should be noted that $X_{i+n}=Z_{i+n}+\theta_{i+n}$ , where $\theta_{i+n}\sim f(\theta)$ and bounded. Using a similar argument, writing the joint density of $X_{1},\dots,X_{n}$ in convolution form, $f_{i|X_{1}\dots,X_{i}}(\theta^{\prime})\leq F_{2}(X_{i},X_{i-1})$ where all the moments of $F_{2}(\cdot,\cdot)$ are finite.

Now, $\frac{p_{Z}(X_{i+n}-\theta|X_{i}-\theta^{\prime})}{p_{z}(X_{i+n}-\theta)}=c_{n}^{\prime\prime}e^{c_{n}^{\prime}X_{i+n}^{2}+d_{n}^{\prime}X_{i+n}X_{i}+e_{n}^{\prime}}$ , where $c_{n}^{\prime},d_{n}^{\prime},e_{n}^{\prime}=O(r^{n});c_{n}^{\prime\prime}=1+O(r^{n})$ . Hence, using Cauchy-Schwartz inequality we have,

\int|\frac{m(X_{i+n}|X_{1:i})}{m(X_{i+n})}-1|^{2}m(X_{i+n})m(X_{1:i})\preceq r^{n}.

Hence the dependence condition is satisfied. Next we verify condition $[B2]$ for the mixture process. For $i_{1}<\cdots<i_{n^{\prime}}$ , such that $\{i_{1},\cdots,i_{n^{\prime}}\}\in\{1,\cdots,n\}$ , and for $X_{i}=Z_{i}+\theta_{i}$ , where $\theta_{i}\in[a,b]$ , let $\theta_{i_{l}};l=1,\cdots,n^{\prime}$ , and $\theta_{j_{l}};l=1,\cdots,n^{\prime}$ such that $\theta_{i,l},\theta_{j,l}\in\hat{\Theta}\subset[a,b]$ . Choosing, $b$ large enough, and $a$ small enough,

\displaystyle\prod_{l=1}^{n^{\prime}}\frac{p(X_{i_{l}}|\theta_{i_{l}})}{p(X_{i_{l}}|\theta_{j_{l}})}\leq e^{n^{\prime}[(b-a)^{2}+b^{2}]}\prod_{l=1}^{n^{\prime}}e^{-Z_{i_{l}}(\theta_{j_{l}}-\theta_{i_{l}})}

where $Z_{i}$ follows $AR(1)$ with parameter $r$ . Then by computing expectations recursively and noting that $Z_{i_{n^{\prime}}}|Z_{i_{1}},\cdots,Z_{i_{n^{\prime}-1}}\sim N(r^{i_{l}-i_{l-1}}Z_{i_{l-1}},1-r^{2(i_{l}-i_{l-1})})$ , the moment condition is established using arguments similar to those in Example 1.

Example 3. An irregular mixture with AR(1) error. Consider the last example but for a mixing distribution that has more structure. Specifically, let $\theta_{i}\stackrel{{\scriptstyle iid}}{{\sim}}0.5\delta_{\{0\}}+0.5TN(4,1,-8,8)$ . Here $\Theta=[-8,8]$ . Using a continuous uniform $f_{0}$ on [-8,8] the fitted densities are given in the middle box of Figure 2 for sample sizes $n=500,1000$ , respectively. The algorithm converges to a mixture structure for $f$ with probability around zero in the interval (-.5,.5) approximately equal to 0.45. The true mixing distribution is given by the solid line and the fit are given by the dotted line. While the estimated density is showing bimodality, a much larger sample size was also used to see explicitly the effect of large $n$ . The right box in Figure 2 shows the estimated mixture density for $n=5000$ . The estimate is markedly better around the continuous mode and the other mode is more concentrated around zero indicating recovery of the discrete part. How the convergence is markedly slower than that in Example 2. Verification for conditions of consistency follows from the previous example.

Example 4. Gaussian process paths.

We consider two sub-examples with observations from a latent Gaussian process with known covariance kernel. In first case, the observations are shifted by a fixed mean with some unknown probability. In the second case, we consider a unknown constant drift along with the mean function. We observe from the marginal distribution.

a) Zero drift. Consider a continuous time process observed at times $t_{i},i=1,\ldots,n$ , where the observed process is the sum of two independent latent processes. Suppose, $X(t)\sim\mathscr{GP}(-1,\mathscr{K})$ with covariance kernel $\mathscr{K}(t_{1},t_{2})=.1e^{-(t_{1}-t_{2})^{2}/10}$ and the observed process $Y(t_{i})=Z_{i}a_{1}(t_{i})+X(t_{i})$ where $a_{1}(t)=3$ and $Z_{i}\stackrel{{\scriptstyle iid}}{{\sim}}Bernoulli(0.3)$ independent of $X(t)$ . Using known support i.e support at $-1$ and $2$ for the mean we can plot the predictive recursion update for the probability at -1 in Figure 3 left panel (true value is 0.7). Suppose, we start with a continuous support on $[-3,3]$ and uniform $f_{0}$ then the predictive recursion solution for $n=1000$ is given in right panel.

b) Constant drift. A similar setting as in part (a) is considered with $Y(t_{i})=Z_{i}a_{1}(t_{i})+X(t_{i})$ , $X(t)\sim\mathscr{GP}(0,\mathscr{K})$ and $\mathscr{K}(t_{1},t_{2})=.1e^{-(t_{1}-t_{2})^{2}/10}$ , where there is a drift given by $a_{1}(t)=\alpha+\beta t/100$ and the observed points are at $t_{i}$ ’s, $i=1,\cdots,n$ , and $\alpha=5,\beta=2$ . Figure 4 top panel shows the fitted marginal and joint mixing densities for the slope and intercept parameters $\alpha$ and $\beta$ by the SA algorithm, where initial density is uniform on the rectangle $[-6,6]\times[-6,6]$ . We have concentration around $0$ and $5$ for the intercept parameter, and, $0$ and $2$ for the slope parameter.

Figure 4 bottom panel shows the marginal fit when the starting density is uniform on $[-3,3]\times[-6,6]$ and the support does not contain the true intercept parameter. In that case the mixing density for intercept concentrates around $0$ and $3$ , corresponding to nearest KL distance point.

4 $M$ -dependent processes

An important subclass of the weakly dependent processes are the $M$ -dependent processes where $X_{i}$ and $X_{j}$ are independent if $|i-j|>M$ for some positive integer $M$ . Heuristically, if the process is $M$ -dependent for some positive integer $M=q-1$ , one expects the original PR algorithm to provide consistent estimates over $q$ different subsequences where consecutive indices are $q$ apart and hence provides overall consistent estimator. Thus, one expects that the sufficient conditions assumed for general weakly dependent processes can be significantly relaxed. It is indeed the case and the proof for the $M$ -dependent processes is also significantly different. Hence we present the case of $M$ -dependent processes separately.

Consider an $M$ -dependent sequence $\{X_{i}\}_{i\in\mathbb{Z}}$ with fixed marginal distribution $X_{i}\sim m(x)$ of the form (1.3). From the definition of the process $X_{i}$ , $X_{i+h}$ are independent for all $i$ and $h\geq q$ for some positive integer $q$ . An example of such a process would be a $(q-1)$ th order moving average, $MA(q-1)$ , defined as

x_{i}=e_{i}+\psi_{1}e_{i-1}+\cdots\psi_{q-1}e_{i-q+1},\;\;\;i\in\mathbb{Z}

where $(\psi_{1},\ldots,\psi_{q-1})$ are fixed parameters and $\{e_{i}\}$ are independent mean zero random variables with density $m_{e}(\cdot)$ . Let the marginal density of $X_{i}$ be of the mixture form (1.3). Note that the $MA(q-1)$ process considered is stationary, but for the PR example we merely need the marginal density to not change with the index. As mentioned, the assumptions could be relaxed in the $M$ -dependent case. The new assumptions are

A1

$w_{n}\downarrow 0$ as $n\uparrow\infty$ ; $\sum_{i=1}^{\infty}w_{j+iq}=\infty$ for $j=1,2,\dots,q$ ; and $\sum_{i=1}^{\infty}w_{i}^{2}<\infty$ .
A2

The map $F\mapsto\int p(x|\theta)f(\theta)d\theta$ is injective; that is the mixing density is identifiable from the mixture $\int p(x|\theta)f(\theta)d\theta.$
A3

The map $\theta\to p(x|\theta)$ is bounded and continuous for $x\in\chi$ .
A4

For $\theta_{1}$ , $\theta_{2}$ and $\theta_{3}$ in $\Theta$ , $\int\{\frac{p(x|\theta_{1})}{p(x|\theta_{2})}\}^{k}p(x|\theta_{3})\nu(dx)<B$ for some $B>0$ and for $k\leq 4q$ .

The conditions needed for convergence in $q$ dependent sequence is similar to that of independent case other than [A4] and [A1]. Condition [A1] is needed, as we will look at $q$ gap martingale sum, instead of the subsequence construction in Theorem 3.1, 3.2 for the gradually vanishing dependence. Condition [A4] is needed to account for the difference for the conditional mean terms, which will involve $q$ fold products of terms similar to $A_{1}(X_{i})$ ’s. Under the assumed conditions we have the following result.

Theorem 4.1.

Let $X_{1},\ldots,X_{n}$ be a sample form a $(q-1)$ -dependent process with marginal density $m$ of the form (1.3) with mixing density $f$ . Assume A1-A4 and C1-C3 hold and $\Theta$ is compact. Then the estimator $F_{n}$ from equation (1.4) converges weakly to the true mixing distribution $F$ with probability one.

Proof.

Given in the Appendix. ∎

Analogous to the general case, a statement can be made about convergence under misspecification of support in the $M$ -dependent case. Assume the set up of Theorem 3.3.

Theorem 4.2 (Convergence under miss-specification).

Assume the $X_{i}$ are generated from a $(q-1)$ -dependent process with fixed marginal density $m$ . Assume A1-A4 and C1-C3 hold. Then $\tilde{K}_{n}^{*}\rightarrow 0$ with probability one where $\tilde{K}_{n}^{*}$ is defined in Theorem 3.3.

The proof is given in the Appendix.

4.1 Convergence rate for $M$ -dependent case

Next we investigate the convergence rate of the PR algorithm in the $M$ -dependent case. We will assume $\rm{B1}^{\prime}$ as the decay rate for the weights $w_{i}$ .

Theorem 4.3.

For unique $\tilde{f}$ , $w_{i}\sim i^{-\gamma}$ $\gamma\in(3/4,1)$ , under the setting of Theorem 4.2, for $\gamma^{\prime}\in(\gamma,1)$ , $n^{-\gamma^{\prime}+1}\tilde{K}_{n}^{*}\rightarrow 0$ with probability one and $H(m_{\tilde{f}},m_{n})=o(n^{-(1-\gamma^{\prime})/2})$ , if $\frac{m}{m_{\tilde{f}}}$ is bounded away from zero and infinity.

The proof of Theorem 4.3 follows from argument similar to that of Theorem 3.4 and given in the Appendix.

5 Discussion

We have established consistency of the solution of the predictive recursion algorithm under various dependence scenarios. For the dependent cases considered, we have also explored convergence rate for the solution to the predictive recursion algorithm. The theoretical development provides justification for using the original algorithm in many cases. Under stationarity but misspecification about dependence structure the original algorithm continues to work as long as the dependence decays reasonably fast. Best possible nonparametric rate for dependent cases may be an interesting problem to explore and conditions for feasibility of minimax rate needs to be studied.

The proposed theoretical development justifies the possible use of stochastic approximation even when we have error in observations coming from an moving average or autoregressive mean zero distribution, if certain conditions are satisfied. It is well known that stochastic approximation or predictive recursion algorithms do not give a posterior estimate. However, similar extension for posterior consistency under the misspecification of independence under conditions analogous to equation (3.2) may be explored.

6 Acknowledgement

The first author Nilabja Guha is supported by the NSF grant #2015460.

A Appendix

We first prove a simple proposition which we use throughout the proofs. This is a standard result from probability theory, which we restate and prove for convenience.

Proposition 4.

Let $Z_{1},Z_{2},\dots$ be a sequence of random variables such that $\sum_{i=1}^{\infty}E[|Z_{i}|]<\infty$ . Then $\sum_{i=1}^{n}Z_{i}$ converges almost surely and the limit is finite almost surely.

Proof.

Let $\Omega$ be the probability space corresponding to the joint distribution. For some $\omega\in\Omega$ , if $\sum|Z_{i}(\omega)|$ converges then $\sum Z_{i}(\omega)$ converges. Let $Z_{\infty}$ be the limit of $\sum|Z_{i}(\omega)|$ which is defined to be infinity at some $\omega$ incase $\sum|Z_{i}(\omega)|$ diverges to positive infinity.

By Monotone Convergence Theorem, $E[Z_{\infty}]=\sum_{i=1}^{\infty}E[|Z_{i}|]<\infty$ (or equivalently can be argued using Fatou’s lemma on the sequence of partial sums $\sum_{i=1}^{n}|Z_{i}|$ ). Hence $Z_{\infty}$ is finite with probability one. Therefore, $\sum_{i=1}^{\infty}|Z_{i}|$ converges with probability one and hence $\sum_{i=1}^{\infty}Z_{i}$ converges with probability one. ∎

Lemma 1.

From equation (3.3), $P(\sup_{n}|S_{v,j}^{(n)}|>\frac{\epsilon}{j^{s}})\leq C^{\prime}_{0}\epsilon^{-2}j^{-K(2\alpha-1)+2s-1}$ , for $C^{\prime}_{0}>0$ , some universal constant not depending on $j$ .

Proof.

From the derivation after equation (3.3) and using Proposition 3.1, $E[v_{\Psi(i,j)}^{2}]\leq E[2w_{\Psi(i,j)}^{2}(1+A_{1}(X_{\Psi(i,j)})^{2}]\leq c_{0}^{{}^{\prime\prime}}w^{2}_{\Psi(i,j)}$ , for some universal constant $c_{0}^{{}^{\prime\prime}}>0$ , using condition A4. From the martingale construction of the equation 3 , $\Psi(1,j)>j^{K}$ . By construction, the coefficients belonging to each block ${\bf\tilde{B}}_{l}$ is less for the higher the index $j$ is, that is if $\Psi(l_{1},j_{1})$ and $\Psi(l_{2},j_{2})$ is in ${\bf\tilde{B}}_{l}$ then $w_{\Psi(l_{1},j_{1})}>w_{\Psi(l_{2},j_{2})}$ if $j_{2}>j_{1}$ . Hence, $E[(S_{v,j}^{(n)})^{2}]\leq j^{-1}c^{{}^{\prime\prime}}_{0}\sum_{l\geq\Psi(1,j)}w_{l}^{2}\leq c^{{}^{\prime\prime}}_{0}j^{-1}\sum_{l\geq\Psi(1,j)}l^{-2\alpha}<c_{0}^{\prime}j^{-1}{\Psi(1,j)}^{-2\alpha+1}<C_{0}^{\prime}j^{-1}j^{-2K\alpha+K}$ , where $c^{{}^{\prime\prime}}_{0},c_{0}^{\prime},C_{0}^{\prime}$ are universal constants.

Finally, using Doob’s maximal inequality [18],

P(\sup_{n}|S_{v,j}^{(n)}|>\frac{\epsilon}{j^{s}})=lim_{n_{0}\uparrow\infty}P(\sup_{n<n_{0}}|S_{v,j}^{(n)}|>\frac{\epsilon}{j^{s}})\leq C^{\prime}_{0}\epsilon^{-2}j^{-K(2\alpha-1)+2s-1}.

∎

A.1 Proof of Proposition 3.1

Proof.

Note that $\frac{m_{f_{1}}(X_{i})}{m_{f_{2}}(X_{i})}\leq\frac{p(X_{i}|\theta_{1})}{p(X_{i}|\theta_{2})}\leq A_{1}(X_{i})$ , where at $\theta_{2}$ and $\theta_{1}$ , $p(X_{i}|\theta)$ is minimized and maximized, respectively when $X_{i}\notin\chi_{c}$ (note that $p(x|\theta)$ continuous function on a compact set).

If $X_{i}\in\chi_{c}$ then $\frac{m_{f_{1}}(X_{i})}{m_{f_{2}}(X_{i})}\leq\frac{b}{a}\leq A_{1}(X_{i})$ .

∎

A.2 Proof of Theorem 3.2

Proof.

Following (2.2) and the decomposition (3.3) in the dependent case, we can have an analogous KL decomposition for the marginals in the dependent case. We have,

\displaystyle K_{n}^{*}-K_{0}^{*}=\sum_{j=1}^{{L_{n}}}S_{v,j}^{*,(n)}-\sum_{j=1}^{L_{n}}S_{M,j}^{*,(n)}+\sum_{i=1}^{n}S^{\Delta*}_{i}+\sum_{i=1}^{n}E^{*}_{i}.

(A.1)

Here

$\displaystyle S_{v,j}^{*,(n)}$	$\displaystyle=$	$\displaystyle\sum_{i=1}^{N_{j,n}}v^{*}_{\Psi(i,j)}$
$\displaystyle S_{M,j}^{*,(n)}$	$\displaystyle=$	$\displaystyle\sum_{i=1}^{N_{j,n}}w_{\Psi(i,j)}M^{*}_{\Psi(i,j)}$
$\displaystyle v_{i}^{*}$	$\displaystyle=$	$\displaystyle w_{i}\Big{(}(1-\int\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx))-E[1-\int_{\chi}\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx)\|\mathcal{F}_{last,i}]\Big{)}$
$\displaystyle M^{*}_{i}$	$\displaystyle=$	$\displaystyle E[\int_{\chi}\frac{h^{(i_{last})}_{i,X_{i}}(x)}{m_{i_{last}}(x)}m(x)\nu(dx)-1\|\mathcal{F}_{last,i}];h^{(i_{last})}_{i,x^{\prime}}(x)$
	$\displaystyle=$	$\displaystyle\frac{\int p(x^{\prime}\|\theta)p(x\|\theta)f_{i_{last}}(\theta)d\mu(\theta)}{m_{i_{last}}(x^{\prime})}$
$\displaystyle S^{\Delta*}_{i}$	$\displaystyle=$	$\displaystyle w_{i}E[\int_{\chi}\frac{h^{(i_{last})}_{i,X_{i}}(x)}{m_{i_{last}}(x)}m(x)\nu(dx)-1\|\mathcal{F}_{last,i}]-$
		$\displaystyle w_{i}E[\int_{\chi}\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx)-1\|\mathcal{F}_{last,i}],$

where $h_{i,x^{\prime}}(x)$ is defined in (2.2). From the proof of Theorem 3.1 it follows that the sequence $K_{n}$ converges and $K_{n}^{*}$ converges to zero over a subsequence with probability one, as in that case $\Theta$ was finite. Therefore, we first show that $K_{n}^{*}$ converges to zero with probability one by showing that $K_{n}^{*}$ converges almost surely.

Convergence of $\sum_{j=1}^{{L_{n}}}S_{v,j}^{*,(n)}$ : From earlier calculations, $|1-\int\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx)|\leq 1+A_{1}(X_{i})$ . Hence, convergence of $\sum_{j=1}^{{L_{n}}}S_{v,j}^{*,(n)}$ follows exactly same as in the same way as convergence of $\sum_{j=1}^{{L_{n}}}S_{v,j}^{(n)}$ in Theorem 3.1. As, $S_{v,j}^{*,(n)}$ are squared integrable martingales each of them converge almost surely. From the fact, that $|1-\int\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx)|\leq 1+A_{1}(X_{i})$ we have $E[{v^{*}_{\Psi(i,j)}}^{2}]\leq E[2w_{\Psi(i,j)}^{2}(1+A_{1}(X_{\Psi(i,j)}^{2})]\leq c_{0}w^{2}_{\Psi(i,j)}$ for some fixed $c_{0}>0$ . Therefore, argument similar to those in Lemma 1, we have

P(\sup_{n}|S_{v,j}^{(*,n)}|>\frac{\epsilon}{j^{s}})\leq c^{\prime}_{0}j^{-K(2\alpha-1)+2s-1},

for $c^{\prime}_{0}>0$ . Hence, only finitely many martingales make contribution to the tail sum with probability one as $\sum_{j}P(\sup_{n}|S_{v,j}^{*,n}|>\frac{\epsilon}{j^{s}})<\infty$ and $\sup_{n}|S_{v,j}^{*,n}|<\frac{\epsilon}{j^{s}}$ for all but finitely many $j$ ’s with probability one with $s>1$ . Thus, $\sum_{j=1}^{{L_{n}}}S_{v,j}^{*,(n)}$ is Cauchy almost surely and therefore, convergent almost surely.

Convergence of $\sum{S_{i}^{\Delta}}^{*}$ : Let,

$\displaystyle\Delta^{*}(X_{i},x)$	$\displaystyle=$	$\displaystyle w_{i}\|\frac{h_{i,X_{i}}^{(i_{last})}(x)}{m_{i_{last}}(x)}-\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}\|$
	$\displaystyle\leq$	$\displaystyle w_{i}\Huge\{\int_{\theta}\|\frac{p(X_{i}\|\theta)p(x\|\theta)f_{i_{last}}(\theta)}{m_{i-1}(X_{i})m_{i-1}(x)}(\frac{m_{i-1}(x)m_{i-1}(X_{i})}{m_{i_{last}}(X_{i})m_{i_{last}}(x)}-1)\|d\mu(\theta)$
	$\displaystyle+$	$\displaystyle\int_{\theta}\|\frac{p(X_{i}\|\theta)p(x\|\theta)}{m_{i-1}(X_{i})m_{i-1}(x)}{f_{i_{last}}(\theta)}(\frac{f_{i-1}(\theta)}{f_{i_{last}}(\theta)}-1)\|d\mu(\theta)\Huge\}$

and let $\int\Delta^{*}(X_{i},x)m(x)\nu(dx)=\Delta^{*}(X_{i})$ . Note that

\frac{m_{i-1}(x)}{m_{i_{last}}(x)}=\prod_{j=i_{last}}^{i-2}(1+w_{j}(\frac{\int p(X_{j}|\theta)p(x|\theta)f_{j}(\theta)d\theta}{m_{f_{j}}(x)m_{f_{j}}(X_{j})}-1))\leq\prod_{j=i_{last}}^{i-2}(1+w_{j}(A_{1}(X_{j})+1)).

Hence, in the $l$ fold products of $A_{1}(X_{j_{l}})$ ’s where the index $j_{l}$ appears at most twice in $(\frac{m_{i-1}(x)m_{i-1}(X_{i})}{m_{i_{last}}(X_{i})m_{i_{last}}(x)}-1)$ . In $S^{\Delta*}_{i}$ the terms are products of terms like $\frac{p(X_{i}|\theta_{i_{1}})}{p(X_{j}|\theta_{i_{2}})}$ and $\frac{p(x|\theta_{j_{1}})}{p(x|\theta_{j_{2}})}$ where $\theta_{i_{1}},\theta_{i_{2}},\theta_{j_{1}},\theta_{j_{2}}$ are in ${\Theta}_{H}$ . Let $q_{i}=i-i_{last}$ . Using $\rm{B2}^{\prime}$ and Holder’s inequality, expectation of any $l$ fold such product would be bounded by $b_{3}^{l}$ . The number of such products of $A_{1}(\cdot)$ is less than $q_{i}^{l}$ , $q_{i}=o(i^{1/K+\epsilon})$ for any $\epsilon>0$ . Also, $l$ fold product contains $n_{H}^{l}$ many terms for which expectation can be bounded by $B_{3}^{l}$ for $B_{3}>0$ . Hence, $E[|{S_{i}^{\Delta}}^{*}|]\leq E|\Delta^{*}(X_{i})|\leq Cw_{i}\sum_{l=1}^{\infty}i^{-l(\alpha_{1}-1/K)}\tilde{B}_{3}^{l}<C_{1}i^{-2\alpha_{1}^{\prime}}$ for some universal constants, $C,C_{1}$ and $\tilde{B}_{3}$ greater than zero, for $.5<\alpha_{1}^{\prime}<\alpha_{1}<\alpha$ and large enough $K$ . Thus, $\sum E[|{S_{i}^{\Delta}}^{*}|]\preceq\sum i^{-2\alpha_{1}^{\prime}}<\infty$ , where for sequences $\{a_{n}\}_{n\geq 1}$ , $\{b_{n}\}_{n\geq 1}$ , $a_{n},b_{n}>0$ , $a_{n}\preceq b_{n}$ implies that $a_{n}\leq C_{0}^{\prime}b_{n}$ for some $C_{0}^{\prime}>0$ . Hence, $\sum{S_{i}^{\Delta}}^{*}$ converges with probability one.

Convergence of $\sum w_{i}M_{i}^{*}$ : Note that

M_{i}^{*}=\int\delta_{i}^{*}(x^{\prime})m(x^{\prime})dx^{\prime}+\int\delta_{i}^{*}(x^{\prime})(\frac{p_{i,last}}{(m(x)}-1)m(x^{\prime})\nu(dx^{\prime})

=I+II

where $\delta_{i}^{*}(x^{\prime})=[\int_{\chi}\frac{h^{(i_{last})}_{i,x^{\prime}}(x)}{m_{i_{last}}(x)}m(x)\nu(dx)-1]$ . Using Cauchy-Schwartz inequality and condition B3, the expectation of the second term is bounded by $C^{\prime}\rho^{i-i_{last}}$ from some $C^{\prime}>0$ . Number of times $(i-i_{last})$ is equal to $l$ is less than $(l+1)^{K}$ where $K$ is defined in the martingale construction 3. Therefore, for $II$ the sum over all $i$ is absolutely convergent.

The first term,

$\displaystyle I=\int_{\chi}\delta_{i}^{*}(x^{\prime})m(x^{\prime})\nu(dx^{\prime})$	$\displaystyle=$	$\displaystyle\int_{\chi}[\int_{\chi}\frac{h^{(i_{last})}_{i,x^{\prime}}(x)}{m_{i_{last}}(x)}m(x)\nu(dx)\nu(dx^{\prime})]-1$
	$\displaystyle=$	$\displaystyle E_{f_{i_{last}}}[\large(\int_{\chi}\frac{p(x\|\theta)}{m_{i_{last}}(x)}m(x)\nu(dx)\large)^{2}]-1$
	$\displaystyle\geq$	$\displaystyle\large(E_{f_{i_{last}}}[\int\frac{p(x\|\theta)}{m_{i_{last}}(x)}m(x)\nu(dx)]\large)^{2}-1$
	$\displaystyle=$	$\displaystyle 1-1=0.$

Hence, $\sum w_{i}M_{i}^{*}$ either converges or diverges to $+\infty$ . Given that LHS in equation A.1 cannot be $-\infty$ and the other terms in RHS in A.1 converges, $\sum w_{i}M_{i}^{*}$ has to converge with probability one.

Hence, we have $K_{n}^{*}$ converging to zero in a sub-sequence with probability one, and converging with probability one. Hence, $K_{n}^{*}$ converges to zero with probability one.

We now argue that this implies $F_{n}$ converges weakly to $F$ . Suppose not. Since $\Theta$ is compact, $F_{n}$ is tight and hence have convergent subsequence. Let $F_{n_{k}}$ be a subsequence that converges to ${\hat{F}}\neq F.$ Let ${\hat{m}}$ be the marginal corresponding to ${\hat{F}}$ . Then by [B4], $m_{n_{k}}$ converges pointwise to ${\hat{m}}$ and hence by Scheffe’s theorem it converges in $L_{1}$ and hence in Hellinger norm to ${\hat{m}}$ . However, by the previous calculations, $m_{n_{k}}$ converges to $m$ almost surely in Kullback Leibler distance and therefore in Hellinger norm, which is a contradiction as $\hat{F}\neq F$ . Hence, $F_{n}$ converges to $F$ weakly in every subsequence. ∎

A.3 Proof of Theorem 4.1

line

Proof.

For $j\leq q$ , define $l_{j}(q,n)=\max\{l:j+q(l-1)\leq n\}$ . Also let for each $j$ the subsequences $\{X_{j+q(l-1)},l=1,\ldots,l_{j}(q,n)\}$ be denoted by $\{X_{j,l}\}.$ By construction $\{X_{j,l},l=1,\ldots,l_{j}(q,n)\}$ are iid with marginal distribution $m(\cdot)$ . Let $\mathcal{F}_{j,l}=\sigma\langle X_{1},\ldots,X_{j,l}\rangle$ denote the $\sigma-$ field generated by all $X_{i}$ ’s up to and including $X_{j,l}$ . Also let the marginals $m_{j+q(l-1)}$ generated during the iterations be denoted by $m_{j,l}$ and the weights $w_{j+q(l-1)}$ be denoted by $w_{j,l}$ . From equation (2.1)

\displaystyle K_{n}-K_{0}=\sum_{j=1}^{{q}}S_{j,n}-\sum_{j=1}^{q}Q_{j,n}+\sum_{i=1}^{n}w_{i}D_{i}+\sum_{i=1}^{n}E_{i}.

(A.2)

Here

$\displaystyle S_{j,n}$	$\displaystyle=$	$\displaystyle\sum_{l=1}^{l_{j}(q,n)}w_{j,l}V_{j,l},$
$\displaystyle V_{j,l}$	$\displaystyle=$	$\displaystyle(1-\frac{m(X_{j,l})}{m_{j+q(l-1)-1}(X_{j,l})})-E\big{[}(1-\frac{m(X_{j,l})}{m_{j+q(l-1)-1}(X_{j,l})})\|\mathcal{F}_{j,l-1}\big{]},$
$\displaystyle M_{j,l}$	$\displaystyle=$	$\displaystyle E[(-1+\frac{m(X_{j,l})}{m_{j,(l-1)}(X_{j,l})})\|\mathcal{F}_{j,l-1}];l>1\text{ and }$
$\displaystyle M_{j,l}$	$\displaystyle=$	$\displaystyle E\big{[}(\frac{m(X_{j,l})}{m_{j+q(l-1)-1}(X_{j,l})}-1)\|\mathcal{F}_{j,l-1}];l=1,$
$\displaystyle Q_{j,n}$	$\displaystyle=$	$\displaystyle\sum_{l=1}^{l_{j}(q,n)}w_{j,l}M_{j,l},$
$\displaystyle D_{i}$	$\displaystyle=$	$\displaystyle E[\frac{m(X_{i})}{m_{i-1}(X_{i})}(\frac{m_{i-1}(X_{i})}{m_{i-q}(X_{i})}-1)\|\mathcal{F}_{j,l-1}];i>q\text{ and }D_{i}=0,\text{ for }i\leq q,$
$\displaystyle E_{i}$	$\displaystyle=$	$\displaystyle\int R(X_{i},\theta)f(\theta)d\mu(\theta),$

where $R(X_{i},\theta)$ is defined as in (2). We follow the same argument as in the general case; first we show that $K_{n}$ converges with probability one and $K_{n}^{*}$ converges to zero over some subsequence with probability one. Then we show convergence of $K_{n}^{*}$ . As before, we show convergence of $\sum_{j=1}^{{q}}S_{j,n}$ , the remainder term $\Delta_{n}=\sum_{i=1}^{n}w_{i}D_{i}$ and the error term $T_{n}=\sum_{i=1}^{n}E_{i}$ . We simplify some of the expressions first. From (1.5), for $i>q$

\frac{m_{i-1}(X_{i})}{m_{i-q}(X_{i})}=\prod_{j=1}^{q-1}[1+w_{i-j}(\frac{\int p(X_{i}|\theta)p(X_{i-j}|\theta)f_{i-j-1}(\theta)d\mu(\theta)}{m_{i-j-1}(X_{i})m_{i-j-1}(X_{i-j})}-1)].

We have

\displaystyle|(\frac{\int p(X_{i}|\theta)p(X_{i-j}|\theta)f_{i-j-1}(\theta)d\mu(\theta)}{m_{i-j-1}(X_{i})m_{i-j-1}(X_{i-j})}-1)|\leq 1+A_{1}(X_{i-j})

(A.3)

by using triangle inequality and using Proposition 3.1 on $\frac{p(X_{i-j}|\theta)}{m_{i-j-1}(X_{i-j})}$ . By Holder’s inequality and assumption A4, we have

E[\prod_{j=1}^{r}A_{1}^{2}(X_{i-j})]\leq(c_{u}n_{H}b/a)^{2q}B

for $r=1,\ldots,q-1.$ Thus

E[|1-\frac{m_{i-1}(X_{i})}{m_{i-q}(X_{i})}|^{2}]\leq w_{i-q}3^{q}{(c_{u}n_{H}b/a)^{2q}}B.

Similarly, we can bound $E[|\frac{m(X_{i})}{m_{i-1}(X_{i})}|^{2}]<(c_{u}n_{H}b/a)^{2}B.$ By Cauchy-Schwartz we have $E[|w_{i}D_{i}|]\leq Cw_{i-q}^{2}$ where $C>0$ is a constant. Hence by Proposition 4, $\sum_{i=1}^{n}w_{i}D_{i}$ converges almost surely. Therefore, $\Delta_{n}\rightarrow\Delta_{\infty}$ , a finite random variable, with probability one. Note that $E[|E_{i}|]\leq 2w_{i}^{2}\int E\big{[}\big{(}(\frac{p(X_{i}|\theta)}{m_{i-1}(X_{i})}-1)^{2}\big{]}$ for $i$ greater than some positive integer $i_{0}$ , as $w_{i}<1/2$ for large enough $i$ . Applying A4, we have $E[|E_{i}|]<2(2(c_{u}n_{H}b/a)^{2}B+2)w_{i}^{2}$ and hence $\sum_{i>i_{0}^{\prime}}E[|E_{i}|]$ converges and hence by Proposition 4, $T_{n}$ converges almost surely. Similarly, $E[V_{j,l}^{2}]\leq 2(1+n_{H}^{2}B)$ and $S_{j,n}$ is a martingale with filtration $F_{j,l_{j}(q,n)}$ and $E[S_{j,n}^{2}]\leq\sum 2w_{j,l}^{2}(1+n_{H}^{2}B)<\infty$ and hence, $S_{j,n}$ converges almost surely to a finite random variable as it is an square integrable martingale.

From equation (A.2) we have convergence of $\sum_{j=1}^{q}Q_{j,n}\to Q_{\infty}<\infty$ with probability one. This statement holds as in L.H.S in (A.2) is a fixed quantity subtracted from a positive quantity and $\sum_{j=1}^{{q}}S_{j,n}$ , $\sum_{i=1}^{n}w_{i}D_{i}$ and $\sum_{i=1}^{n}E_{i}$ converges with probability one. As $Q_{j,n}=\sum_{l=1}^{l_{j}(q,n)}w_{j,l}M_{j,l}$ and $M_{j,l}>\int log\frac{m(x)}{m_{i-q}(x)}m(x)\nu(dx)=K_{{i-q}}^{*}\geq 0$ , for $i>q$ . Hence, any $Q_{j,n}$ can not diverge to infinity. Moreover, as $\sum_{i=1}^{\infty}w_{j+iq}=\infty$ for $j=1,2,\dots,q-1$ , $M_{j,l}$ has to go zero in some subsequence almost surely.

Next we show convergence of $K_{n}^{*}$ . Analogously, replacing $V_{i}$ and $M_{i}$ by $V_{i}^{*}$ and $M_{i}^{*}$ from the above derivation, from equation (2.2) we get

\displaystyle K_{n}^{*}-K_{0}^{*}=\sum_{j=1}^{{q}}S^{*}_{j,n}-\sum_{j=1}^{q}Q^{*}_{j,n}+\sum_{i=1}^{n}w_{i}D^{*}_{i}+\sum_{i=1}^{n}E^{*}_{i}

(A.4)

where $S_{j,n}^{*}$ the martingale sequences, which converges due to squared integrability.

$\displaystyle S^{*}_{j,n}$	$\displaystyle=$	$\displaystyle\sum_{l=1}^{l_{j}(q,n)}w_{j,l}V^{*}_{j,l},$
$\displaystyle V_{i}^{*}$	$\displaystyle=$	$\displaystyle V^{*}_{j,l}=(1-\int\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx))-$
		$\displaystyle E[1-\int_{\chi}\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx)\|\mathcal{F}_{j,l-1}];i=j+q(l-1),$
$\displaystyle M_{i}^{*}$	$\displaystyle=$	$\displaystyle E[\int_{\chi}\frac{h^{(q)}_{i,X_{i}}(x)}{m_{j,l-1}(x)}m(x)\nu(dx)-1\|\mathcal{F}_{j,l-1}],i>q\text{ and }$
$\displaystyle M_{i}^{*}$	$\displaystyle=$	$\displaystyle E[\int_{\chi}\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx)-1\|\mathcal{F}_{j,l-1}];i\leq q,\text{ for }i=j+q(l-1)\text{ and }$
$\displaystyle h^{(q)}_{i,x^{\prime}}(x)$	$\displaystyle=$	$\displaystyle\frac{\int p(x^{\prime}\|\theta)p(x\|\theta)f_{i-q}(\theta)d\mu\theta}{m_{i-q}(x^{\prime})},$
$\displaystyle Q_{i}^{*}$	$\displaystyle=$	$\displaystyle\sum_{l=1}^{l_{j}(q,n)}w_{j,l}M^{}_{j,l};M^{}_{j,l}=M^{*}_{i}\text{ for }i=j+q(l-1),$
$\displaystyle D_{i}^{*}$	$\displaystyle=$	$\displaystyle E[\int_{\chi}\frac{h^{(q)}_{i,x^{\prime}}(x)}{m_{j,l-1}(x)}m(x)\nu(dx)-1\|\mathcal{F}_{j,l-1}]-$
		$\displaystyle E[\int_{\chi}\frac{h_{i,x^{\prime}}(x)}{m_{i-1}(x)}m(x)\nu(dx)-1\|\mathcal{F}_{j,l-1}];i>q\text{ and }D^{*}_{i}=0;i\leq q,$
$\displaystyle R^{*}(X_{i},x)$	$\displaystyle=$	$\displaystyle w_{i}^{2}(\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}-1)^{2}R(w_{i}[{h_{i,X_{i}}(x)}{m_{i-1}(x)}-1]),$
$\displaystyle E_{i}^{*}$	$\displaystyle=$	$\displaystyle\int_{\chi}R(X_{i},x)m(x)\nu(dx).$

Then for $i>q$ ,

	$\displaystyle E[\int\frac{h_{i,X_{i}}^{(q)}(x)}{m_{i-q}(x)}m(x)\nu(dx)-1\|F_{i-q}]=\int_{\chi}\int_{\chi}(\frac{h^{(q)}_{i,x^{\prime}}(x)}{m_{i-q}(x)}m(x)m(x^{\prime})\nu(dx)\nu(dx^{\prime})-1)$
	$\displaystyle=E_{f_{i-q}}[\large(\int\frac{p(x\|\theta)}{m_{i-q}(x)}m(x)\nu(dx)\large)^{2}]-1$
	$\displaystyle\geq\large(E_{f_{i-q}}[\int\frac{p(x\|\theta)}{m_{i-q}(x)}m(x)\nu(dx)]\large)^{2}-1=1-1=0.$		(A.5)

Also $\sum_{i=1}^{n}E^{*}_{i}$ converges, using argument similar to that of $\sum_{i=1}^{n}E_{i}$ . The proof of martingale squared integrability and the convergence of $S^{*}_{j,n}$ follows similarly as in $S_{j,n}$ . The convergences of the difference term $\sum_{i=1}^{n}w_{i}D^{*}_{i}$ follows similarly to the first part and given in the next subsection. Hence, we have $\sum_{j=1}^{q}Q^{*}_{j,n}$ ’s converging to finite quantities with probability one for each $j$ , as $K_{n}^{*}$ is positive. Hence, $K_{n}^{*}$ converges with probability one.

From the fact $K_{i}^{*}=KL(m,m_{i})$ converges almost surely as $i$ goes to infinity and it converges to zero in a subsequence almost surely, we have it converging to zero almost surely. Therefore, by arguments given in the proof of Theorem 3.2, $F_{n}$ converges weakly to $F$ with probability one. ∎

A.3.1 Convergence of $\sum_{i=1}^{n}E^{}_{i}$ and $\sum_{i=1}^{n}w_{i}D^{}_{i}$

line

The proof of martingale squared integrability and convergences of the reminder terms $\sum_{i=1}^{n}E^{*}_{i}$ and $\sum_{i=1}^{n}w_{i}D^{*}_{i}$ from Theorem 4.1:

Convergence of $S^{*}_{j,n}$ : We show that $S^{*}_{j,n}$ is a square integrable martingale. Note that $\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}\leq A_{1}(X_{i})$ . From A4, $E[\big{(}\int\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}m(x)\nu(dx)\big{)}^{2}]<B$ using Holder’s inequality. Thus, we have $E\big{[}(S^{*}_{j,n})^{2}\big{]}\leq\sum 2w_{i}^{2}(B+2)<\infty$ which proves our claim.

Convergence of $\sum E_{i}^{*}$ :

Note that

E[\sum_{i>i_{0}^{\prime}}|E^{*}_{i}|]\preceq\sum w_{i}^{2}E[(\int\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)})^{2}m(x)d\nu(x)+2]<\infty

from A4 and A1, and from the fact $w_{i}\downarrow 0$ and $w_{i}(\int_{\chi}\frac{h^{(q)}_{i,X_{i}}(x)}{m_{j,l-1}(x)}m(x)\nu(dx)-1)>-w_{i}$ and $E[A_{1}(X_{i})^{2}]<\infty$ (following A4). Hence, $\sum_{i>i_{0}^{\prime}}E_{i}^{*}$ and $\sum_{i}E_{i}^{*}$ converges with probability one.

Convergence of $\sum w_{i}D_{i}^{*}$ : Let,

\Delta(X_{i},x)=w_{i}|\frac{h_{i,X_{i}}^{(q)}(x)}{m_{i-q}(x)}-\frac{h_{i,X_{i}}(x)}{m_{i-1}(x)}|\leq w_{i}\Huge\{\int_{\theta}(|\frac{p(X_{i}|\theta)p(x|\theta)f_{i-q}(\theta)}{m_{i-1}(X_{i})m_{i-1}(x)}(\frac{m_{i-1}(x)m_{i-1}(X_{i})}{m_{i-q}(X_{i})m_{i-q}(X_{i})}-1)|d\mu(\theta)

+\int_{\theta}|\frac{p(X_{i}|\theta)p(x|\theta)}{m_{i-1}(X_{i})m_{i-1}(x)}{f_{i-q}(\theta)}(\frac{f_{i-1}(\theta)}{f_{i-q}(\theta)}-1)|)d\mu(\theta)\Huge\}

and $\int\Delta(X_{i},x)m(x)\nu(dx)=\Delta(X_{i})$ . From the fact

\frac{m_{i-1}(X_{i})}{m_{i-q}(X_{i})}=\prod_{j=1}^{q-1}[1+w_{i-j}(\frac{\int p(X_{i}|\theta)p(X_{i-j}|\theta)f_{i-j-1}(\theta)d\mu(\theta)}{m_{i-j-1}(X_{i})m_{i-j-1}(X_{i-j})}-1)]

and we have

\displaystyle|(\frac{\int p(X_{i}|\theta)p(X_{i-j}|\theta)f_{i-j-1}(\theta)d\mu(\theta)}{m_{i-j-1}(X_{i})m_{i-j-1}(X_{i-j})}-1)|\leq 1+A_{1}(X_{i}).

Similarly

\frac{p(X_{i}|\theta)p(x|\theta)}{m_{i-1}(X_{i})m_{i-1}(x)}\leq A_{1}(X_{i})A_{1}(x).

The part $|\frac{m_{i-1}(X_{i})m_{i-1}(x)}{m_{i-q}(X_{i})m_{i-q}(x)}-1|$ of R.H.S for $\Delta(X_{i},x)$ can be bounded by the sum of $1\leq q^{\prime}\leq 2q-2$ fold product of $(1+A_{1}(X_{i}))$ and $(1+A_{1}(x))$ ’s multiplied by coefficient less than $w_{i-q}^{q^{\prime}}$ . Similarly. for $|\frac{f_{i-1}(\theta)}{f_{i-q}(\theta)}-1|$ we have $1\leq q^{\prime}\leq q-1$ fold products of $1+A_{1}(X_{i})$ ’s. Hence, integrating with respect to $m(x)$ and taking expectation we have a universal bound for any such product term from Holder’s inequality. Hence, $E[\Delta(X_{i})]\leq B_{u}w_{i-q}^{2}$ for $i>q$ , for some universal constant $B_{u}>0$ . As, $E\big{(}\sum_{i}|w_{i}D_{i}^{*}|\big{)}\preceq\sum_{i}w_{i}^{2}<\infty$ , $\sum w_{i}D_{i}^{*}$ converges absolutely with probability one. Therefore, it converges with probability one.

A.4 Proof of Theorem 4.2

Proof.

From the derivation of equation (A.2) using $\tilde{f}(\theta)$ instead of $f(\theta)$ , i.e writing $\log\frac{f_{i}}{f_{i-1}}=\log\frac{f_{i}}{\tilde{f}}-\log\frac{f_{i-1}}{\tilde{f}}$ , we have, for $\tilde{K}_{n}=KL(\tilde{f},f_{n})$ ,

\displaystyle\tilde{K}_{n}-\tilde{K}_{0}=\sum_{j=1}^{{q}}\tilde{S}_{j,n}-\sum_{j=1}^{q}\tilde{Q}_{j,n}+\sum_{i=1}^{n}w_{i}\tilde{D}_{i}+\sum_{i=1}^{n}\tilde{E}_{i}.

(A.6)

Here

$\displaystyle\tilde{S}_{j,n}$	$\displaystyle=$	$\displaystyle\sum_{l=1}^{l_{j}(q,n)}w_{j,l}\tilde{V}_{j,l},$
$\displaystyle\tilde{V}_{j,l}$	$\displaystyle=$	$\displaystyle(1-\frac{\tilde{m}(X_{j,l})}{m_{j+q(l-1)-1}(X_{j,l})})-E\big{(}(1-\frac{\tilde{m}(X_{j,l})}{m_{j+q(l-1)-1}(X_{j,l})})\|\mathcal{F}_{j,l-1}\big{)},$
$\displaystyle\tilde{M}_{j,l}$	$\displaystyle=$	$\displaystyle E((-1+\frac{\tilde{m}(X_{j,l})}{m_{j,l-1}(X_{j,l})})\|\mathcal{F}_{j,l-1});l>1$
$\displaystyle\text{ and }\tilde{M}_{j,l}$	$\displaystyle=$	$\displaystyle E((-1+\frac{\tilde{m}(X_{j,l})}{m_{j+q(l-1)-1}(X_{j,l})})\|\mathcal{F}_{j,l-1})\text{ for }l=1$
$\displaystyle\tilde{Q}_{j,n}$	$\displaystyle=$	$\displaystyle\sum_{l=1}^{l_{j}(q,n)}w_{j,l}\tilde{M}_{j,l},$
$\displaystyle\tilde{D}_{i}$	$\displaystyle=$	$\displaystyle\frac{\tilde{m}(X_{i})}{m_{i-1}(X_{i})}(\frac{m_{i-1}(X_{i})}{m_{i-q}(X_{i})}-1);i>q,\text{ and }\tilde{D}_{i}=0\text{ for }i\leq q,$
$\displaystyle\tilde{E}_{i}$	$\displaystyle=$	$\displaystyle\int R(X_{i},\theta)\tilde{f}(\theta)d\mu(\theta),$

For $i>q$ , $\tilde{M}_{j,l}\geq\int\log\frac{\tilde{m}(x)}{m_{j,l-1}(x)}m(x)\nu(dx)=KL(m,m_{j,l-1})-KL(m,\tilde{m})=KL(m,m_{j,l-1})-\tilde{k}>0$ . Convergence of $\tilde{S}_{j,n}$ , $\sum\tilde{D}_{i}$ , $\sum\tilde{E}_{i}$ follows from the proof of 4.1. Thus, similarly each $\tilde{Q}_{j,n}$ has to converge and hence, $KL(m,m_{j,l-1})-\tilde{k}$ converges to zero in some subsequence with probability one. Convergence of $KL(m,m_{i})$ follows from the proof of Theorem 4.1, and completes the proof.

∎

A.5 Proof of Theorem 3.3

Proof.

From the proof of Theorem 3.1, we decompose $\log\frac{f_{i}}{f_{i-1}}=\log\frac{f_{i}}{\tilde{f}}-\log\frac{f_{i-1}}{\tilde{f}}$ , and have an equation analogous to (3.3)

\displaystyle\tilde{K}_{n}-\tilde{K}_{0}=\sum_{j=1}^{{L_{n}}}\tilde{S}_{v,j}^{(n)}-\sum_{j=1}^{L_{n}}\tilde{S}_{M,j}^{(n)}+\sum_{i=1}^{n}\tilde{S}^{\Delta}_{i}+\sum_{i=1}^{n}\tilde{E}_{i}

(A.7)

which is derived by using $\tilde{f}(\theta)$ instead of $f(\theta)$ . Analogous to equation (LABEL:wdep_decmp) we have,

\displaystyle\tilde{M}_{\tilde{i}}=\int(\frac{\tilde{m}(x)}{{m}_{{i}_{last}}(x)}-1)m(x)\nu(dx)+\int(\frac{\tilde{m}(x)}{{m}_{{i}_{last}}(x)}-1).(\frac{p_{{i},last}(x)}{m(x)}-1)m(x)\nu(dx).

Using argument similar to Theorem 4.2, $KL(m,{m}_{i})-KL(m,\tilde{m})$ goes to zero in some subsequence with probability one, as $j$ goes to infinity. Convergence of $KL(m,m_{i})$ is essentially same proof as in Theorem 3.2. Together they complete the proof. ∎

A.6 Proof of Theorem 4.3

Proof.

Since $\frac{w_{i-1}}{w_{i}}=O(1)$ , from the proof of 4.1 we can conclude that $c_{1}^{\prime}\sum w_{i}\tilde{K}^{*}_{i-q}\leq\sum_{i}w_{i-q+1}\tilde{K}^{*}_{i-q}\leq c^{\prime}\sum w_{i}\tilde{K}^{*}_{i-q}<\infty$ with $c^{\prime}>0,c^{\prime}_{1}>0,i=q+jl,l=1,2,\dots$ converges with probability one.

Let $a_{n}=\sum_{i=1}^{n}w_{i}$ , $a_{0}=0$ . Hence, for $i=j+ql$ for Theorem 4.2

	$\displaystyle a_{i}\tilde{K}_{i}^{}=a_{i-1}\tilde{K}_{i-1}^{}+w_{i}\tilde{K}^{}_{i-1}+a_{i}w_{i}V_{j,l}^{}-a_{i}w_{i}M^{}_{j,l}+a_{i}w_{i}D_{i}^{}+a_{i}E_{i}^{*}$
			(A.8)

and

	$\displaystyle a_{n}\tilde{K}_{n}^{}-a_{0}\tilde{K}^{}_{0}=\sum w_{i}\tilde{K}^{}_{i-1}+\sum_{j=1}^{q}\sum_{l=1}^{l_{j}(q,n)}a_{i}w_{i}V_{j,l}^{}-\sum_{j=1}^{q}\sum_{l=1}^{l_{j}(q,n)}a_{i}w_{i}M^{*}_{j,l}+$
	$\displaystyle\sum_{i=1}^{n}a_{i}w_{i}D_{i}^{}+\sum_{i=1}^{n}a_{i}E_{i}^{}.$		(A.9)

Let $w_{i}=i^{-\alpha},\alpha\in(3/4,1)$ and $a_{i}\sim i^{-(\alpha-1)}$ , $w_{i}^{\prime}=a_{i}w_{i}\sim i^{-(2\alpha-1)}.$ For $M$ dependent case, from the proof of Theorem 4.1, writing $w_{i}^{\prime}$ instead of $w_{i}$ we can show the almost sure convergence $S^{*}_{j,l}$ . Convergence of $\sum a_{i}w_{i}D_{i}^{*}$ and $\sum_{i}a_{i}E_{i}^{*}$ also follow in similar fashion. From the fact $\frac{w_{i-1}}{w_{i}}=O(1)$ the term $\sum_{i}w_{i}\tilde{K}^{*}_{i-1}$ converges with probability one. Hence following A1, we have $\sum_{j=1}^{q}\sum_{l=1}^{l_{j}(q,n)}a_{i}w_{i}M^{*}_{j,l}$ converging and $M^{*}_{j,l}$ going to zero in a subsequence with probability one for all $j$ as $l$ goes to infinity, and $a_{n}\tilde{K}_{n}^{*}$ converging with probability one.

We have, $a_{n}\tilde{K}^{*}_{n}$ converging to a finite number for each $\omega$ outside a set of probability zero, and $a_{n}\sim n^{1-\alpha}$ , then $n^{1-\gamma}\tilde{K}_{n}^{*}$ goes to zero with probability one for $\gamma>\alpha$ . The conclusion about the Hellinger distance follows from the relation between KL and Hellinger distance from argument as in Theorem 3.4 proof.

∎

References

Newton and Zhang [1999] Michael A Newton and Yunlei Zhang. A recursive algorithm for nonparametric analysis with missing data. Biometrika, 86(1):15–26, 1999.
Tokdar et al. [2009] Surya T Tokdar, Ryan Martin, and Jayanta K Ghosh. Consistency of a recursive estimate of mixing distributions. The Annals of Statistics, 37(5A):2502–2522, 2009.
Robbins and Monro [1951] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
Chung [1954] Kai Lai Chung. On a stochastic approximation method. The Annals of Mathematical Statistics, 25(3):463–483, 1954.
Fabian et al. [1968] Vaclav Fabian et al. On asymptotic normality in stochastic approximation. The Annals of Mathematical Statistics, 39(4):1327–1332, 1968.
Polyak and Juditsky [1992] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992.
Sharia [2014] Teo Sharia. Truncated stochastic approximation with moving bounds: convergence. Statistical Inference for Stochastic Processes, 17(2):163–179, 2014.
Kushner [2010] Harold Kushner. Stochastic approximation: a survey. Wiley Interdisciplinary Reviews: Computational Statistics, 2(1):87–96, 2010.
Ghosh and Tokdar [2006] Jayanta K Ghosh and Surya T Tokdar. Convergence and consistency of newton’s algorithm for estimating mixing distribution. In Frontiers in statistics, pages 429–443. World Scientific, 2006.
Martin and Ghosh [2008] Ryan Martin and Jayanta K Ghosh. Stochastic approximation and newton’s estimate of a mixing distribution. Statistical Science, 23(3):365–382, 2008.
Martin and Tokdar [2009] Ryan Martin and Surya T Tokdar. Asymptotic properties of predictive recursion: robustness and rate of convergence. Electronic Journal of Statistics, 3:1455–1472, 2009.
Kleijn and van der Vaart [2006] Bas JK Kleijn and Aad W van der Vaart. Misspecification in infinite-dimensional bayesian statistics. The Annals of Statistics, 34(2):837–877, 2006.
Hahn et al. [2018] P Richard Hahn, Ryan Martin, and Stephen G Walker. On recursive bayesian predictive distributions. Journal of the American Statistical Association, 113(523):1085–1093, 2018.
Martin and Han [2016] Ryan Martin and Zhen Han. A semiparametric scale-mixture regression model and predictive recursion maximum likelihood. Computational Statistics & Data Analysis, 94:75–85, 2016.
Ghosal et al. [1999] Subhashis Ghosal, Jayanta K Ghosh, and RV Ramamoorthi. Posterior consistency of dirichlet mixtures in density estimation. Ann. Statist, 27(1):143–158, 1999.
Lijoi et al. [2005] Antonio Lijoi, Igor Prünster, and Stephen G Walker. On consistency of nonparametric normal mixtures for bayesian density estimation. Journal of the American Statistical Association, 100(472):1292–1296, 2005.
Robbins and Siegmund [1971] Herbert Robbins and David Siegmund. A convergence theorem for non negative almost supermartingales and some applications. In Optimizing methods in statistics, pages 233–257. Elsevier, 1971.
Durrett [2019] Rick Durrett. Probability: theory and examples, volume 49. Cambridge university press, 2019.
Patilea [2001] Valentin Patilea. Convex models, mle and misspecification. Annals of statistics, pages 94–123, 2001.
Martin [2012] Ryan Martin. Convergence rate for predictive recursion estimation of finite mixtures. Statistics & Probability Letters, 82(2):378–384, 2012.
Ghosal and Van Der Vaart [2001] Subhashis Ghosal and Aad W Van Der Vaart. Entropies and rates of convergence for maximum likelihood and bayes estimation for mixtures of normal densities. Annals of Statistics, 29(5):1233–1263, 2001.
Lai [2003] Tze Leung Lai. Stochastic approximation. The Annals of Statistics, 31(2):391–406, 2003.
Genovese and Wasserman [2000] Christopher R Genovese and Larry Wasserman. Rates of convergence for the gaussian mixture sieve. The Annals of Statistics, 28(4):1105–1127, 2000.
Freedman [1962a] David A Freedman. Mixtures of markov processes. The Annals of Mathematical Statistics, 33(1):114–118, 1962a.
Freedman [1962b] David A Freedman. Invariants under mixing which generalize de finetti’s theorem. The Annals of Mathematical Statistics, 33(3):916–923, 1962b.
Diaconis and Freedman [1980] Persi Diaconis and David Freedman. de finetti’s theorem for markov chains. The Annals of Probability, pages 115–130, 1980.

	$\displaystyle\|\frac{m(X_{i+n}\|X_{1:i})}{m(X_{i+n})}-1\|$	$\displaystyle=$	$\displaystyle\|\frac{\int\{p_{Z}(X_{i+n}-\theta\|X_{i}-\theta^{\prime})-p_{z}(X_{i+n}-\theta)\}f(\theta)f_{i\|X_{1}\dots,X_{i}}(\theta^{\prime})}{\int p_{Z}(X_{i+n}-\theta)f(\theta)}\|$
		$\displaystyle\leq$	$\displaystyle F_{1}(X_{i+n}){\int\|\frac{p_{Z}(X_{i+n}-\theta\|X_{i}-\theta^{\prime})}{p_{z}(X_{i+n}-\theta)}-1\|f(\theta)f_{i\|X_{1}\dots,X_{i}}(\theta^{\prime})},$

	$\displaystyle E[\int\frac{h_{i,X_{i}}^{(q)}(x)}{m_{i-q}(x)}m(x)\nu(dx)-1\|F_{i-q}]=\int_{\chi}\int_{\chi}(\frac{h^{(q)}_{i,x^{\prime}}(x)}{m_{i-q}(x)}m(x)m(x^{\prime})\nu(dx)\nu(dx^{\prime})-1)$
	$\displaystyle=E_{f_{i-q}}[\large(\int\frac{p(x\|\theta)}{m_{i-q}(x)}m(x)\nu(dx)\large)^{2}]-1$
	$\displaystyle\geq\large(E_{f_{i-q}}[\int\frac{p(x\|\theta)}{m_{i-q}(x)}m(x)\nu(dx)]\large)^{2}-1=1-1=0.$		(A.5)

Stochastic approximation algorithm for estimating mixing distribution for dependent observations

Abstract

1 Introduction

2 Preliminaries and revisiting the independent case

3 Main results

Proposition 1.

Proof.

Remark 1.

Proposition 2.

Proof.

Theorem 3.1.

Proof.

Remark 2.

Remark 3.

Theorem 3.2.

Proof.

3.1 Misspecification of support

Theorem 3.3.

3.2 Convergence rate of the recursive algorithm

Theorem 3.4.

Proof.

3.3 Finite Mixture of Markov processes and other examples

Proposition 3.

Proof.

4 MM-dependent processes

Theorem 4.1.

Proof.

Theorem 4.2 (Convergence under miss-specification).

4.1 Convergence rate for MM-dependent case

Theorem 4.3.

5 Discussion

6 Acknowledgement

A Appendix

Proposition 4.

Proof.

Lemma 1.

Proof.

A.1 Proof of Proposition 3.1

Proof.

A.2 Proof of Theorem 3.2

Proof.

A.3 Proof of Theorem 4.1

Proof.

A.3.1 Convergence of ∑i=1nEi∗\sum_{i=1}^{n}E^{*}_{i} and ∑i=1nwi​Di∗\sum_{i=1}^{n}w_{i}D^{*}_{i}

A.4 Proof of Theorem 4.2

Proof.

A.5 Proof of Theorem 3.3

Proof.

A.6 Proof of Theorem 4.3

Proof.

References

4 $M$ -dependent processes

4.1 Convergence rate for $M$ -dependent case

A.3.1 Convergence of $\sum_{i=1}^{n}E^{}_{i}$ and $\sum_{i=1}^{n}w_{i}D^{}_{i}$