This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\startlocaldefs\endlocaldefs

and

A probabilistic view on predictive constructions for Bayesian learning

Patrizia Bertilabel=e1][email protected] [    Emanuela Dreassilabel=e2][email protected] [    Fabrizio Leisenlabel=e3][email protected] [    Luca Pratellilabel=e4][email protected] [    Pietro Rigolabel=e5][email protected] [ Dipartimento di Scienze Fisiche, Informatiche e Matematiche, Università di Modena e Reggio-Emilia, via Campi 213/B, 41100 Modena, Italy Dipartimento di Statistica, Informatica, Applicazioni, Università di Firenze, viale Morgagni 59, 50134 Firenze, Italy School of Mathematical Sciences, University of Nottingham, University Park, Nottingham, NG7 2RD, UK Accademia Navale, viale Italia 72, 57100 Livorno, Italy Dipartimento di Scienze Statistiche “P. Fortunati”, Università di Bologna, via delle Belle Arti 41, 40126 Bologna, Italy
Abstract

Given a sequence X=(X1,X2,)X=(X_{1},X_{2},\ldots) of random observations, a Bayesian forecaster aims to predict Xn+1X_{n+1} based on (X1,,Xn)(X_{1},\ldots,X_{n}) for each n0n\geq 0. To this end, in principle, she only needs to select a collection σ=(σ0,σ1,)\sigma=(\sigma_{0},\sigma_{1},\ldots), called “strategy” in what follows, where σ0()=P(X1)\sigma_{0}(\cdot)=P(X_{1}\in\cdot) is the marginal distribution of X1X_{1} and σn()=P(Xn+1X1,,Xn)\sigma_{n}(\cdot)=P(X_{n+1}\in\cdot\mid X_{1},\ldots,X_{n}) the nn-th predictive distribution. Because of the Ionescu-Tulcea theorem, σ\sigma can be assigned directly, without passing through the usual prior/posterior scheme. One main advantage is that no prior probability is to be selected. In a nutshell, this is the predictive approach to Bayesian learning. A concise review of the latter is provided in this paper. We try to put such an approach in the right framework, to make clear a few misunderstandings, and to provide a unifying view. Some recent results are discussed as well. In addition, some new strategies are introduced and the corresponding distribution of the data sequence XX is determined. The strategies concern generalized Pólya urns, random change points, covariates and stationary sequences.

62F15,
62M20,
60G09,
60G25,
Bayesian inference,
Conditional identity in distribution,
Exchangeability,
Predictive distribution,
Sequential predictions,
Stationarity,
keywords:
[class=MSC2020]
keywords:

1 Introduction

This paper has been written having the following interpretation of Bayesian inference in mind. (We declare this interpretation from the outset just to make transparent our point of view and easier the understanding of the paper). Let us call 𝒪\mathcal{O} the object of inference. Roughly speaking, 𝒪\mathcal{O} denotes whatever we ignore but would like to know. For instance, 𝒪\mathcal{O} could be a parameter (finite or infinite dimensional), a set of future observations, an unknown probability distribution, the effect of some action, or something else. According to us, the distinguishing feature of the Bayesian approach is to regard 𝒪\mathcal{O} as the realization of a random element, and not as an unknown but fixed constant. As a consequence, the main goal of any Bayesian inferential procedure is to determine the conditional distribution of 𝒪\mathcal{O} given the available information.

Note that, unless 𝒪\mathcal{O} itself is a parameter, no other parameter is necessarily involved.

Prediction of unknown observable quantities is a fundamental part of statistics. Initially, it was probably the most prevalent form of statistical inference. The wind changed at the beginning of the 20th{}^{\text{th}} century when statisticians’ attention shifted to other issues, such as parametric estimation and testing; see e.g. [36]. Nowadays, prediction is back in the limelight again, and plays a role in modern topics including machine learning and data mining; see e.g. [17, 18, 27, 43].

This paper deals with prediction of future observations, based on the past ones, from the Bayesian point of view. Precisely, we focus on a sequence

X=(X1,X2,)X=(X_{1},X_{2},\ldots)

of random observations and, at each time nn, we aim to predict Xn+1X_{n+1} based on (X1,,Xn)(X_{1},\ldots,X_{n}). Hence, for each nn, the object of inference is 𝒪=Xn+1\mathcal{O}=X_{n+1}, the available information is (X1,,Xn)(X_{1},\ldots,X_{n}), and the target is the predictive distribution P(Xn+1X1,,Xn)P(X_{n+1}\in\cdot\mid X_{1},\ldots,X_{n}). We point out that, apart from technicalities, most of our considerations could be generalized to the case where 𝒪\mathcal{O} is an arbitrary (measurable) function of the future observations, say

𝒪=f(Xn+1,Xn+2,).\mathcal{O}=f(X_{n+1},X_{n+2},\ldots).

This case is recently object of increasing attention; see e.g. [29, 40].

No parameter θ\theta plays a role at this stage. The forecaster may involve some θ\theta, if she thinks it helps, but she is not interested in θ\theta as such. To involve θ\theta means to model the probability distribution of XX as depending on θ\theta, and then to exploit this fact to calculate the predictive distributions P(Xn+1X1,,Xn)P(X_{n+1}\in\cdot\mid X_{1},\ldots,X_{n}).

To better address our prediction problem, it is convenient to introduce the notion of strategy. Let (S,)(S,\mathcal{B}) be a measurable space, with SS to be viewed as the set where the observations XnX_{n} take values. Following Dubins and Savage [26], a strategy is a sequence

σ=(σ0,σ1,σ2,)\sigma=(\sigma_{0},\sigma_{1},\sigma_{2},\ldots)

such that

  • σ0\sigma_{0} and σn(x)\sigma_{n}(x) are probability measures on \mathcal{B} for all n1n\geq 1 and xSnx\in S^{n};

  • The map xσn(x,A)x\mapsto\sigma_{n}(x,A) is n\mathcal{B}^{n}-measurable for fixed n1n\geq 1 and AA\in\mathcal{B}.

Here, σ0\sigma_{0} should be regarded as the marginal distribution of X1X_{1} and σn(x)\sigma_{n}(x) as the conditional distribution of Xn+1X_{n+1} given that (X1,,Xn)=x(X_{1},\ldots,X_{n})=x. Moreover, σn(x,A)\sigma_{n}(x,A) denotes the value taken at AA by the probability measure σn(x)\sigma_{n}(x). We also note that strategies are often called prediction rules in the framework of species sampling sequences; see [54, p. 251].

Strategies are a natural tool to frame a prediction problem from the Bayesian standpoint. In fact, a strategy σ\sigma can be regarded as the collection of all predictive distributions (including the marginal distribution of X1X_{1}) in the sense that σn(x,)=P(Xn+1(X1,,Xn)=x)\sigma_{n}(x,\,\cdot)=P\bigl{(}X_{n+1}\in\cdot\mid(X_{1},\ldots,X_{n})=x\bigr{)} for all n0n\geq 0 and xSnx\in S^{n}. Thus, in a sense, everything a Bayesian forecaster has to do is to select a strategy σ\sigma. Obviously, the problem is how to do it. A related problem is whether, in order to choose σ\sigma, involving a parameter θ\theta is convenient or not.

An important special case is exchangeability. In fact, if XX is assumed to be exchangeable, there is natural way to involve a parameter θ\theta. To see this, take the parameter space Θ\Theta as

Θ={all probability measures on }.\Theta=\bigl{\{}\text{all probability measures on }\mathcal{B}\bigr{\}}.

Moreover, for each θΘ\theta\in\Theta, denote by PθP_{\theta} a probability measure which makes XX i.i.d. with common distribution θ\theta, i.e.,

Pθ(X1A1,,XnAn)=i=1nθ(Ai)P_{\theta}\bigl{(}X_{1}\in A_{1},\ldots,X_{n}\in A_{n}\bigr{)}=\prod_{i=1}^{n}\theta(A_{i})

for all n1n\geq 1 and A1,,AnA_{1},\ldots,A_{n}\in\mathcal{B}. Then, under mild conditions on (S,)(S,\mathcal{B}), de Finetti’s theorem yields

P(X)=ΘPθ(X)π(dθ)\displaystyle P(X\in\cdot)=\int_{\Theta}P_{\theta}(X\in\cdot)\,\pi(d\theta)

for some (unique) prior probability π\pi on Θ\Theta. Thus, conditionally on θΘ\theta\in\Theta, the observations are i.i.d. with common distribution θ\theta. This suggests calculating the strategy σ\sigma as follows.

  • (i)

    Select a prior π\pi on Θ\Theta;

  • (ii)

    For each n1n\geq 1 and xSnx\in S^{n}, evaluate the posterior of θ\theta given xx, namely, the conditional distribution of θ\theta given that (X1,,Xn)=x(X_{1},\ldots,X_{n})=x;

  • (iii)

    Calculate σ\sigma as

    σn(x,A)=Θθ(A)πn(dθx)for all A,\sigma_{n}(x,A)=\int_{\Theta}\theta(A)\,\pi_{n}(d\theta\mid x)\quad\quad\text{for all }A\in\mathcal{B},

    where πn(x)\pi_{n}(\cdot\mid x) is the posterior and π0(x)\pi_{0}(\cdot\mid x) is meant as π0(x)=π\pi_{0}(\cdot\mid x)=\pi.

Steps (i)-(ii)-(iii) are familiar in a Bayesian framework. Henceforth, if σ\sigma is selected via (i)-(ii)-(iii), the forecaster is said to follow the inferential approach (I.A.).

1.1 Predictive approach to Bayesian modeling

There is another approach to Bayesian prediction, usually called the predictive approach (P.A.), which is quite recurrent in the Bayesian literature and recently gained increasing attention. (Such an approach, incidentally, has been referred to as the “non-standard approach” in [8, 9]). According to P.A., the forecaster directly selects her strategy σ\sigma. Merely, for each n0n\geq 0, she selects the predictive σn\sigma_{n} without passing through the prior/posterior scheme described above. Among others, P.A. is supported by de Finetti, Savage, Dubins [22, 23, 26] and more recently by Diaconis and Regazzini [4, 16, 24, 25, 31]. P.A. is also strictly connected to Dawid’s prequential approach [19, 20, 21] and to Pitman’s treatment of species sampling sequences [54, 55, 56]. In addition, several prediction procedures arising in non-necessarily Bayesian frameworks, such as Machine Learning and Data Mining, are consistent with P.A.; see e.g. [17, 18, 27, 43]. Some further related references are [8, 9, 29, 30, 32, 40, 41, 44].

The theoretical foundation of P.A. is the Ionescu-Tulcea theorem; see e.g. [46, p. 159]. Roughly speaking this theorem states that, to assign the joint distribution of XX, it suffices to choose, in an arbitrary way, the marginal distribution of X1X_{1}, the conditional distribution of X2X_{2} given X1X_{1}, the conditional distribution of X3X_{3} given (X1,X2)(X_{1},X_{2}), and so on. Note that this fact would be obvious if XX would be replaced by a finite dimensional random vector (X1,,Xm)(X_{1},\ldots,X_{m}). So, in a sense, the Ionescu-Tulcea theorem extends to infinite sequences a straightforward property of finite dimensional vectors. In any case, a formal statement of the theorem is as follows.

Theorem 1.

(Ionescu-Tulcea). For each n1n\geq 1, let XnX_{n} be the nn-th coordinate random variable on (S,)(S^{\infty},\mathcal{B}^{\infty}). Then, for any strategy σ\sigma, there is a unique probability measure PσP_{\sigma} on (S,)(S^{\infty},\mathcal{B}^{\infty}) such that

Pσ(X1)=σ0()and\displaystyle P_{\sigma}(X_{1}\in\cdot)=\sigma_{0}(\cdot)\quad\quad\text{and} (1)
Pσ(Xn+1(X1,,Xn)=x)=σn(x,)\displaystyle P_{\sigma}\bigl{(}X_{n+1}\in\cdot\mid(X_{1},\ldots,X_{n})=x\bigr{)}=\sigma_{n}(x,\cdot)

for all n1n\geq 1 and PσP_{\sigma}-almost all xSnx\in S^{n}.

Because of Theorem 1, to make predictions on the sequence XX, the forecaster is free to select an arbitrary strategy σ\sigma. In fact, for any σ\sigma, there is a (unique) probability distribution for XX, denoted above by PσP_{\sigma}, whose predictives Pσ(Xn+1X1,,Xn)P_{\sigma}\bigl{(}X_{n+1}\in\cdot\mid X_{1},\ldots,X_{n}\bigr{)} agree with σ\sigma in the sense of equation (1).

The strengths and weaknesses of I.A. versus P.A. are discussed in a number of papers; see e.g. [8, 18, 27, 36, 58] and references therein. Here, we summarize this issue (from our point of view) under the assumption that prediction is the main target.

I.A. is not motivated by prediction alone. The main goal of I.A. is to make inference on other features of the data distribution (typically some parameters) and in this case the prior π\pi is fundamental. It should be added that π\pi often provides various meaningful information on the data generating process. However, to assess π\pi is not an easy task. In addition, once π\pi is selected, to evaluate the posterior πn(x)\pi_{n}(\cdot\mid x) is quite difficult as well. Frequently, πn(x)\pi_{n}(\cdot\mid x) cannot be written in closed form but only approximated numerically. In short, I.A. is a cornerstone of Bayesian inference, but, when prediction is the main target, it is actually quite involved.

In turn, P.A. has essentially four merits. First, P.A. allows to avoid an explicit choice of the prior π\pi. Indeed, when prediction is the main target, why select π\pi explicitly ? Rather than wondering about π\pi, it seems reasonable to reflect on how the information in (X1,,Xn)(X_{1},\ldots,X_{n}) is conveyed in the prediction of Xn+1X_{n+1}. Second, the data sequence XX is not required any distributional assumption. This point is developed in Subsections 1.2 and 1.3. By now, we stress a consequence of such a point. The Bayesian nature of a prediction procedure does not depend on the data distribution. For instance, a forecaster applying P.A. is certainly Bayesian independently of the distribution attached to XX. Third, P.A. requires the assignment of probabilities on observable facts only. The value of Xn+1X_{n+1} is actually observable, while π\pi and πn\pi_{n} (being probabilities on Θ\Theta) do not necessarily deal with observable facts. Fourth, the strategy σ\sigma may be assigned stepwise. At each time nn, the forecaster has observed x=(x1,,xn)Snx=(x_{1},\ldots,x_{n})\in S^{n} and has already selected σ0,σ1(x1),,σn1(x1,,xn1)\sigma_{0},\sigma_{1}(x_{1}),\ldots,\sigma_{n-1}(x_{1},\ldots,x_{n-1}). Then, to predict Xn+1X_{n+1}, she is still free to select σn(x)\sigma_{n}(x) as she wants. No choice of σn(x)\sigma_{n}(x) is precluded. This is consistent with the Bayesian view, where the observed data are fixed and one should condition on them. In spite of these advantages, P.A. has an obvious drawback. In fact, assigning a strategy σ\sigma directly may be very difficult, in principle as difficult as selecting a prior π\pi.

A last (basic) remark is that, if XX is exchangeable, both I.A. and P.A. completely determine the probability distribution of XX. Selecting a prior π\pi or choosing a strategy σ\sigma are just equivalent routes to fix the distribution of XX. In particular, selecting σ\sigma uniquely determines π\pi. An intriguing line of research is in fact to identify the prior corresponding to a given σ\sigma; see e.g. [10, 24, 25, 31].

1.2 Characterizations

Recall that, for any strategy σ\sigma, there is a unique probability measure PσP_{\sigma} on (S,)(S^{\infty},\mathcal{B}^{\infty}) satisfying condition (1).

In principle, when applying P.A., the data sequence XX is free to have any probability distribution. Nevertheless, in most applications, it is reasonable (if not mandatory) to impose some conditions on XX. For instance, the forecaster may wish XX to be exchangeable, or stationary, or Markov, or a martingale, and so on. In these cases, σ\sigma is subjected to some constraints. If XX is required to be exchangeable, for instance, σ\sigma should be such that PσP_{\sigma} is exchangeable. Hence, those strategies σ\sigma which make PσP_{\sigma} exchangeable should be characterized.

More generally, fix any collection 𝒞\mathcal{C} of probability measures on (S,)(S^{\infty},\mathcal{B}^{\infty}) and suppose the data distribution is required to belong to 𝒞\mathcal{C}. Then, P.A. gives rise to the following problem:

  • Problem (*): Characterize those strategies σ\sigma such that Pσ𝒞P_{\sigma}\in\mathcal{C}.

Sometimes, Problem (*) is trivial (Markov, martingales) but sometimes it is not (stationarity, exchangeability). To illustrate, we mention three examples (which correspond to the three dependence forms examined in the sequel).

In the exchangeable case, Problem (*) admits a solution [31, Th. 3.1] but the conditions on σ\sigma are quite hard to check in real problems. Hence, applying P.A. to exchangeable data is usually difficult (even if there are some exceptions; see Section 2).

A condition weaker than exchangeability is conditional identity in distribution. Say that XX is conditionally identically distributed (c.i.d.) if X2=𝑑X1X_{2}\overset{d}{=}X_{1} and, for each n1n\geq 1, the conditional distribution of XkX_{k} given (X1,,Xn)(X_{1},\ldots,X_{n}) is the same for all k>nk>n; see Section 3. It can be shown that

X is exchangeableX is stationary and c.i.d.;X\text{ is exchangeable}\quad\Leftrightarrow\quad X\text{ is stationary and c.i.d.;}

see [5, 47]. Hence, conditional identity in distribution can be regarded as one of the two basic ingredients of exchangeability (the other being stationarity). Now, in the c.i.d. case, Problem (*) has been solved [6, Th. 3.1] and the conditions on σ\sigma are quite simple. The class of admissible strategies includes several meaningful elements which cannot be used if XX is required to be exchangeable. As a consequence, P.A. works quite well for c.i.d. data; see [8, 9].

The stationary case is more involved. In fact, to our knowledge, there is no general characterization of the strategies σ\sigma which make PσP_{\sigma} stationary. However, such a characterization is available in some meaningful special cases (e.g. when PσP_{\sigma} is also required to be Markov); see Section 4.

Finally, Problem (*) is usually easier in a few (meaningful) special cases. For instance, Problem (*) is simpler if PσP_{\sigma} is also asked to be Markov; see e.g. [33] and Section 4. Or else, if the strategy σ\sigma is required to be dominated.

  • Dominated strategies: Let λ\lambda be a σ\sigma-finite measure on (S,)(S,\mathcal{B}). Say that a strategy σ\sigma is dominated by λ\lambda if each σn(x)\sigma_{n}(x) admits a density fn(x)f_{n}(\cdot\mid x) with respect to λ\lambda, namely,

    σ0(dy)=f0(y)λ(dy)and\displaystyle\sigma_{0}(dy)=f_{0}(y)\,\lambda(dy)\quad\quad\text{and}
    σn(x,dy)=fn(yx)λ(dy)\displaystyle\sigma_{n}(x,\,dy)=f_{n}(y\mid x)\,\lambda(dy)

    for all n1n\geq 1 and xSnx\in S^{n}. Here, f0:S+f_{0}:S\rightarrow\mathbb{R}^{+} and fn:S×Sn+f_{n}:S\times S^{n}\rightarrow\mathbb{R}^{+} are non-negative measurable functions.

For instance, if S=S=\mathbb{R} and σn(x)\sigma_{n}(x) is a non-degenerate normal distribution for all nn and xx, then σ\sigma is dominated by λ=\lambda=\,Lebesgue measure. Or else, if SS is countable, any strategy is dominated by λ=\lambda=\,counting measure. Instead, if SS is uncountable, a non-dominated strategy is σn(x1,,xn)=δxn\sigma_{n}(x_{1},\ldots,x_{n})=\delta_{x_{n}} where δxn\delta_{x_{n}} denotes the unit mass at the point xnx_{n}. Another non-dominated strategy is the empirical measure σn(x1,,xn)=(1/n)i=1nδxi.\sigma_{n}(x_{1},\ldots,x_{n})=(1/n)\,\sum_{i=1}^{n}\delta_{x_{i}}.

In a sense, dominated strategies play an analogous role to the usual dominated models in parametric statistical inference. The main advantage is that one can use the conditional density fn(x)f_{n}(\cdot\mid x) instead of the conditional measure σn(x)\sigma_{n}(x). A related advantage is that, if one fixes λ\lambda and restricts to strategies dominated by λ\lambda, Problem (*) becomes simpler. However, even in applied data analysis, various familiar strategies are not dominated. In the framework of species sampling sequences, for instance, most strategies are not dominated. Therefore, in this paper, we focus on general strategies while the dominated ones are regarded as an important special case.

1.3 Content of this paper and further notation

This is a review paper on P.A. which also includes some (minor) new results. Our perspective is mainly on the probabilistic aspects of Bayesian predictive constructions. Moreover, we tacitly assume that the major target is to predict future observations (and not to make inference on other random elements, such as random parameters).

Essentially, we aim to achieve three goals. First, we try to put P.A. in the right framework, to provide a unifying view, and to make clear a few misunderstandings. This has been done in the Introduction. Second, in Section 2 and Subsection 3.1, we report some known results. Third, we provide some new strategies and we prove a few related results. The strategies, introduced by means of examples, deal with generalized Pólya urns, random change points, covariates and stationary sequences. The results consist in determining the distribution of the data sequence XX under such strategies. To our knowledge, Examples 7, 9, 12, 14 and Theorems 8, 11, 13 are actually new, while Theorem 6 makes precise a claim contained in [29]. Moreover, as far as we know, Section 4 is the first attempt to develop P.A. for stationary data. It provides a brief discussion of Problem (*) and introduces two large classes of stationary sequences.

As already noted, even if XX could be potentially given any distribution, in most applications XX is required some conditions. There is obviously a number of such conditions. Among them, we decided to focus on exchangeability, stationarity and conditional identity in distribution. This choice seems reasonable to keep the paper focused, but of course it leaves out various interesting conditions, such as partial exchangeability. To write a paper of reasonable length, however, some choice was necessary.

To defend our choice, we note that, in addition to be natural in various practical problems, exchangeability is the usual assumption in Bayesian prediction. Hence, taking exchangeability into account is more or less mandatory. Moreover, since XX is exchangeable if and only if it is stationary and c.i.d., the other two conditions can be motivated as the basic components of exchangeability. But there are also other reasons for dealing with them. Stationarity is in fact a routine assumption in the classical treatment of time series, and it is reasonable to consider it from the Bayesian point of view as well. Conditional identity in distribution, even if not that popular, seems to be quite suitable for P.A.; see Section 3.

The rest of the paper is organized in three sections, each concerned with a specific assumption on XX, plus a final section of open problems. All the proofs are gathered in the Appendix.

We close this Introduction with some further notations.

As usual, δu\delta_{u} is the unit mass at the point uu. For each xSnx\in S^{n}, where nn is a positive integer or n=n=\infty, we denote by xix_{i} the ii-th coordinate of xx. Moreover, we take XX to be the sequence of coordinate random variables on SS^{\infty}, namely,

Xi(x)=xifor all i1 and xS.\displaystyle X_{i}(x)=x_{i}\quad\quad\text{for all }i\geq 1\text{ and }x\in S^{\infty}.

From now on, we fix a strategy σ\sigma and we assume

X=𝑑Pσ.\displaystyle X\overset{d}{=}P_{\sigma}.

We write ν\nu instead of σ0\sigma_{0} (i.e., we let σ0=ν\sigma_{0}=\nu). Hence, ν\nu is a probability measure on \mathcal{B} to be regarded as the distribution of X1X_{1} under the strategy σ\sigma. Finally, to avoid technicalities, SS is assumed to be a Borel subset of a Polish space and \mathcal{B} the Borel σ\sigma-field on SS.

2 Exchangeable data

A permutation of SnS^{n} is a map ϕ:SnSn\phi:S^{n}\rightarrow S^{n} of the form

ϕ(x)=(xj1,,xjn)for all xSn\displaystyle\phi(x)=(x_{j_{1}},\ldots,x_{j_{n}})\quad\quad\text{for all }x\in S^{n}

where (j1,,jn)(j_{1},\ldots,j_{n}) is a fixed permutation of (1,,n)(1,\ldots,n). A sequence Y=(Y1,Y2,)Y=(Y_{1},Y_{2},\ldots) of random variables is exchangeable if

ϕ(Y1,,Yn)=𝑑(Y1,,Yn)\displaystyle\phi(Y_{1},\ldots,Y_{n})\overset{d}{=}(Y_{1},\ldots,Y_{n})

for all n2n\geq 2 and all permutations ϕ\phi of SnS^{n}.

As noted in Subsection 1.2, if XX is required to be exchangeable, applying P.A. is usually hard. But there are a few exceptions and two of them are discussed in this section. We first recall that XX is a Dirichlet sequence (or a Pólya sequence, see [11]) if

σn(x)=cν+i=1nδxin+cfor all n0 and xSn,\displaystyle\sigma_{n}(x)=\frac{c\,\nu+\sum_{i=1}^{n}\delta_{x_{i}}}{n+c}\quad\quad\text{for all }n\geq 0\text{ and }x\in S^{n},

where c>0c>0 is a constant, ν\nu a probability measure on \mathcal{B}, and σ0(x)\sigma_{0}(x) is meant as σ0(x)=ν\sigma_{0}(x)=\nu. The role of Dirichlet sequences is actually huge in various frameworks, including Bayesian nonparametrics, population genetics, ecology, combinatorics and number theory; see e.g. [28, 37, 45, 54, 55, 56]. From our point of view, however, two facts are to be stressed. First, a Dirichlet sequence is exchangeable. Second, being defined through its predictive distributions, a Dirichlet sequence is a natural candidate for P.A.

2.1 Species sampling sequences

For n1n\geq 1 and x=(x1,,xn)Snx=(x_{1},\ldots,x_{n})\in S^{n}, denote by kn=kn(x)k_{n}=k_{n}(x) the number of distinct values in the vector xx and by x1,,xknx_{1}^{*},\ldots,x_{k_{n}}^{*} such distinct values (in the order that they appear). Say that XX is a species sampling sequence if it is exchangeable, σ0=ν\sigma_{0}=\nu is non-atomic, and

σn(x)=j=1knpj,n(x)δxj+qn(x)ν\displaystyle\sigma_{n}(x)=\sum_{j=1}^{k_{n}}p_{j,n}(x)\,\delta_{x_{j}^{*}}+q_{n}(x)\,\nu
for all n1 and xSn\displaystyle\text{for all }n\geq 1\text{ and }x\in S^{n}

where the pj,np_{j,n} are non-negative measurable functions on SnS^{n} and qn=1j=1knpj,nq_{n}=1-\sum_{j=1}^{k_{n}}p_{j,n}. Under this strategy, quoting from [42, p. 253], XX can be regarded as: “the sequence of species of individuals in a process of sequential random sampling from some hypothetical infinite population of individuals of various species. The species of the first individual to be observed is assigned a random tag X1=X1X_{1}=X_{1}^{*} distributed according to ν\nu. Given the tags X1,,XnX_{1},\ldots,X_{n} of the first nn individuals observed, it is supposed that the next individual is one of the jj-th species observed so far with probability pj,np_{j,n}, and one of a new species with probability qnq_{n}”.

A nice consequence of the definition is that pj,n(x)p_{j,n}(x) depends on xx only through the vector (N1,n,,Nkn,n)(N_{1,n},\ldots,N_{k_{n},n}), where

Nj,n=Nj,n(x)=card{k:1kn,xk=xj}N_{j,n}=N_{j,n}(x)=\text{card}\bigl{\{}k:1\leq k\leq n,\,x_{k}=x_{j}^{*}\bigr{\}}

is the number of times that xjx_{j}^{*} appears in the vector xx; see [42, 54].

The most popular example of species sampling sequence is probably the two-parameter Poisson-Dirichlet, introduced by Pitman in [53], which corresponds to the weights

pj,n(x)=Nj,nbn+candqn(x)=bkn+cn+c\displaystyle p_{j,n}(x)=\frac{N_{j,n}-b}{n+c}\quad\text{and}\quad q_{n}(x)=\frac{b\,k_{n}+c}{n+c}

where bb and cc are constants such that: either (i) 0b<10\leq b<1 and c>bc>-b or (ii) b<0b<0 and c=mbc=-m\,b for some integer m2m\geq 2. In this model, if LL denotes the number of distinct values appearing in the sequence XX, one obtains L=a.s.L\overset{a.s.}{=}\infty under (i) and L=a.s.mL\overset{a.s.}{=}m under (ii). Note also that XX reduces to a Dirichlet sequence in the special case b=0b=0.

Another example, due to [39], is

pj,n(x)=(Nj,n+1)(nkn+b)n2+bn+c\displaystyle p_{j,n}(x)=\frac{(N_{j,n}+1)(n-k_{n}+b)}{n^{2}+bn+c}
andqn(x)=kn2bkn+cn2+bn+c\displaystyle\text{and}\quad q_{n}(x)=\frac{k_{n}^{2}-bk_{n}+c}{n^{2}+bn+c}

where b>0b>0 and cc is such that k2+bk+c>0k^{2}+bk+c>0 for all integers k>0k>0. This time, unlike the two-parameter Poisson-Dirichlet, LL is a finite but non-degenerate random variable.

In general, to obtain a species sampling sequence, the forecaster needs to select ν\nu and the weights pj,np_{j,n}. While the choice of ν\nu is free (apart from non-atomicity) the pj,np_{j,n} are subjected to the constraint that XX should be exchangeable. (Incidentally, the choice of pj,np_{j,n} is a good example of the difficulty of applying P.A. when XX is required to be exchangeable). The usual method to select pj,np_{j,n} involves exchangeable random partitions. Let ={1,2,}\mathbb{N}=\bigl{\{}1,2,\ldots\bigr{\}} and let Π\Pi be a random partition of \mathbb{N}. For each n1n\geq 1, call Πn\Pi_{n} the restriction of Π\Pi to {1,,n}\{1,\ldots,n\}, namely, the random partition of {1,,n}\{1,\ldots,n\} whose elements are of the form {1,,n}A\{1,\ldots,n\}\cap A for some AΠA\in\Pi. Say that Π\Pi is exchangeable if

φ(Πn)=𝑑Πn\varphi(\Pi_{n})\overset{d}{=}\Pi_{n}

for all n1n\geq 1 and all permutations φ\varphi of (1,,n)(1,\ldots,n), where φ(Πn)\varphi(\Pi_{n}) denotes the random partition φ(Πn)={φ(B):BΠn}\varphi(\Pi_{n})=\bigl{\{}\varphi(B):B\in\Pi_{n}\bigr{\}}. For instance, given any sequence Y=(Y1,Y2,)Y=(Y_{1},Y_{2},\ldots) of random variables, define Π\Pi to be the random partition of \mathbb{N} induced by the equivalence relation iji\sim j \Leftrightarrow Yi=YjY_{i}=Y_{j}. Then, Π\Pi is exchangeable provided YY is exchangeable. Now, the weights pj,np_{j,n} of a species sampling sequence correspond, in a canonical way, to the probability law of an exchangeable partition; see [53, 54]. Hence, choosing the pj,np_{j,n} essentially amounts to choosing an exchangeable partition. We stop here since a detailed discussion of exchangeable partitions is bejond the scopes of this paper. The interested reader is referred to [38, 39, 48, 49, 53, 56] and references therein.

A last remark is that the definition of species sampling sequences can be generalized. In particular, non-atomicity of ν\nu can be dropped (as in [3] and [13]) and exchangeability can be replaced by some weaker condition (as in [1] and [2]).

2.2 Kernel based Dirichlet sequences

In [10], to generalize Dirichlet sequences while preserving their main properties, a class of strategies has been introduced. Among other things, such strategies make XX exchangeable.

A kernel α\alpha on (S,)(S,\mathcal{B}) is a collection

α={α(x):xS}\alpha=\bigl{\{}\alpha(\cdot\mid x):x\in S\bigr{\}}

such that α(x)\alpha(\cdot\mid x) is a probability measure on \mathcal{B}, for each xSx\in S, and the map xα(Ax)x\mapsto\alpha(A\mid x) is measurable for each AA\in\mathcal{B}. Sometimes, to make the notation easier, we will write αx\alpha_{x} instead of α(x)\alpha(\cdot\mid x). A straightforward example of kernel is αx=δx\alpha_{x}=\delta_{x} for each xSx\in S.

Fix a probability measure ν\nu on \mathcal{B}, a constant c>0c>0, a kernel α\alpha on (S,)(S,\mathcal{B}), and define the strategy

σn(x)=cν+i=1nαxin+c\displaystyle\sigma_{n}(x)=\frac{c\,\nu+\sum_{i=1}^{n}\alpha_{x_{i}}}{n+c} (2)

for all n0n\geq 0 and xSnx\in S^{n}. Clearly, XX reduces to a Dirichlet sequence if α=δ\alpha=\delta. In this case, we also say that XX is a classical Dirichlet sequence.

If α\alpha is an arbitrary kernel, XX may fail to be exchangeable. However, a useful sufficient condition for exchangeability is available. In fact, XX is exchangeable if α\alpha agrees with the conditional distribution for ν\nu given some sub-σ\sigma-field 𝒢\mathcal{G}\subset\mathcal{B}. For instance, if 𝒢=\mathcal{G}=\mathcal{B}, then α=δ\alpha=\delta and XX is a classical Dirichlet sequence. At the opposite extreme, if 𝒢\mathcal{G} is the trivial σ\sigma-field, then αx=ν\alpha_{x}=\nu for all xSx\in S and XX is i.i.d. with common distribution ν\nu. In general, for fixed ν\nu and cc, a strategy σ\sigma which makes XX exchangeable can be associated with any sub-σ\sigma-field 𝒢\mathcal{G}\subset\mathcal{B}. It suffices to take α\alpha as the conditional distribution for ν\nu given 𝒢\mathcal{G}.

Example 2.

(Countable partitions). Let \mathcal{H} be a (non-random) countable partition of SS such that HH\in\mathcal{B} and ν(H)>0\nu(H)>0 for all HH\in\mathcal{H}. For xSx\in S, denote by HxH_{x} the only HH\in\mathcal{H} such that xHx\in H. The conditional distribution for ν\nu given the sub-σ\sigma-field generated by \mathcal{H} is

α(x)=H1H(x)ν(H)=ν(Hx)for all xS.\displaystyle\alpha(\cdot\mid x)=\sum_{H\in\mathcal{H}}1_{H}(x)\,\nu\bigl{(}\cdot\mid H\bigr{)}=\nu\bigl{(}\cdot\mid H_{x}\bigr{)}\quad\text{for all }x\in S.

Hence, XX is exchangeable whenever

σn(x)=cν+i=1nν(Hxi)n+cfor all n0 and xSn.\displaystyle\sigma_{n}(x)=\frac{c\,\nu+\sum_{i=1}^{n}\nu\bigl{(}\cdot\mid H_{x_{i}}\bigr{)}}{n+c}\quad\text{for all }n\geq 0\text{ and }x\in S^{n}.

Some remarks on the above strategy σ\sigma are in order.

  • σ\sigma may be reasonable when the basic information provided by each observation xix_{i} is HxiH_{x_{i}}, namely, the element of the partition \mathcal{H} including xix_{i}.

  • If SS is countable, each sub-σ\sigma-field 𝒢\mathcal{G}\subset\mathcal{B} is generated by a partition \mathcal{H} of SS. Hence, α\alpha is necessarily as above.

  • σn(x)\sigma_{n}(x) is absolutely continuous with respect to ν\nu for all nn and xx. This is a striking difference with classical Dirichlet sequences. To make an example, call σ\sigma^{*} the strategy obtained by σ\sigma replacing α\alpha with δ\delta. Under σ\sigma^{*}, XX is a classical Dirichlet sequence. Moreover, suppose ν\nu is nonatomic and define the set B(x)={x1,,xn}B(x)=\{x_{1},\ldots,x_{n}\} for each x=(x1,,xn)Snx=(x_{1},\ldots,x_{n})\in S^{n}. Since ν\nu is nonatomic and B(x)B(x) is finite,

    Pσ(Xn+1=Xi for some in(X1,,Xn)=x)\displaystyle P_{\sigma}\Bigl{(}X_{n+1}=X_{i}\text{ for some }i\leq n\mid(X_{1},\ldots,X_{n})=x\Bigr{)}
    =σn(x,B(x))=0.\displaystyle=\sigma_{n}\bigl{(}x,\,B(x)\bigr{)}=0.

    On the other hand, since δxi(B(x))=1\delta_{x_{i}}(B(x))=1 for each i=1,,ni=1,\ldots,n,

    Pσ(Xn+1=Xi for some in(X1,,Xn)=x)\displaystyle P_{\sigma^{*}}\Bigl{(}X_{n+1}=X_{i}\text{ for some }i\leq n\mid(X_{1},\ldots,X_{n})=x\Bigr{)}
    =σn(x,B(x))=n/(n+c).\displaystyle=\sigma_{n}^{*}\bigl{(}x,B(x)\bigr{)}=n/(n+c).

    As a consequence, one obtains

    Pσ(all the observations are distinct)=1,\displaystyle P_{\sigma}\Bigl{(}\text{all the observations are distinct}\Big{)}=1,
    Pσ(all the observations are distinct)=0.\displaystyle P_{\sigma^{*}}\Bigl{(}\text{all the observations are distinct}\Big{)}=0.
  • σ\sigma can be generalized replacing α\alpha with

    β(x)=1A(x)δx+ 1Ac(x)ν(AcHx),\displaystyle\beta(\cdot\mid x)=1_{A}(x)\,\delta_{x}\,+\,1_{A^{c}}(x)\,\nu\bigl{(}\cdot\mid A^{c}\cap H_{x}\bigr{)},

    where AA\in\mathcal{B} is a suitable set. Note that β\beta reduces to α\alpha if A=A=\emptyset. Roughly speaking, β\beta is reasonable in those problems where there is a set AA such that xix_{i} is informative about the future observations only if xiAx_{i}\in A. Otherwise, if xiAx_{i}\notin A, the only relevant information provided by xix_{i} is HxiH_{x_{i}}. As a trivial example, take S=S=\mathbb{R} and

    ={(,0),{0},(0,)},A=[u,u]\displaystyle\mathcal{H}=\bigl{\{}(-\infty,0),\,\{0\},\,(0,\infty)\bigr{\}},\quad A=[-u,u]

    for some u>0u>0. Then, β\beta is reasonable if xix_{i} is informative only if |xi|u\lvert x_{i}\rvert\leq u. Otherwise, if |xi|>u\lvert x_{i}\rvert>u, the only meaningful information provided by xix_{i} is its sign.

Example 3.

(Pólya urns). Some Pólya urns are covered by Example 2. It follows that, for such urns, the sequence XX of observed colors is exchangeable. To our knowledge, this fact was previously unknown.

As an example, consider sequential draws from an urn and denote by XnX_{n} the color of the ball extracted at time n1n\geq 1. At time n=0n=0, the urn contains mjm_{j} balls of color jj where j{1,,k}j\in\{1,\ldots,k\}. Define

S={1,,k},m=j=1kmjandν{j}=mjm\displaystyle S=\{1,\ldots,k\},\quad m=\sum_{j=1}^{k}m_{j}\quad\text{and}\quad\nu\{j\}=\frac{m_{j}}{m}

for each jSj\in S. The sampling scheme is as follows. Fix a partition \mathcal{H} of SS and define

mj=mν({j}Hj)=mmjiHjmi.\displaystyle m_{j}^{*}=m\,\nu\bigl{(}\{j\}\mid H_{j}\bigr{)}=\frac{m\,m_{j}}{\sum_{i\in H_{j}}m_{i}}.

For each n1n\geq 1, one obtains XnHX_{n}\in H for some unique HH\in\mathcal{H}. In this case (i.e., if XnHX_{n}\in H) the extracted ball is replaced together with mjm_{j}^{*} more balls of color jj for each jHj\in H. In other terms, if the observed color belongs to HH, each color in HH is reinforced (and not only the observed color). In particular, after each draw, mm new balls are added to the urn. Hence, denoting by σ\sigma the strategy of Example 2 with c=1c=1, one obtains

P(Xn+1=j(X1,,Xn)=x)\displaystyle P\bigl{(}X_{n+1}=j\mid(X_{1},\ldots,X_{n})=x\bigr{)}
=mj+i=1n1Hj(xi)mjm+mn\displaystyle=\frac{m_{j}+\sum_{i=1}^{n}1_{H_{j}}(x_{i})\,m_{j}^{*}}{m+m\,n}
=ν{j}+i=1n1Hj(xi)ν({j}Hj)1+n\displaystyle=\frac{\nu\{j\}+\sum_{i=1}^{n}1_{H_{j}}(x_{i})\,\nu\bigl{(}\{j\}\mid H_{j}\bigr{)}}{1+n}
=cν{j}+i=1nν({j}Hxi)c+n=σn(x){j}.\displaystyle=\frac{c\,\nu\{j\}+\sum_{i=1}^{n}\nu\bigl{(}\{j\}\mid H_{x_{i}}\bigr{)}}{c+n}=\sigma_{n}(x)\{j\}.

If σ\sigma is the strategy (2), in addition to exchangeability, XX satisfies various other properties of classical Dirichlet sequences. We refer to [10] for details. Here, we just note that the prior π\pi and the posterior πn\pi_{n} can be explicitly determined. In particular, up to replacing δ\delta with α\alpha, the Sethuraman’s representation of π\pi (see [57]) is still true. Precisely, π\pi is the probability distribution of a random probability measure μ\mu of the form

μ()=jVjα(Zj)\displaystyle\mu(\cdot)=\sum_{j}V_{j}\,\alpha(\cdot\mid Z_{j})

where:

  • (Zj)(Z_{j}) and (Vj)(V_{j}) are independent sequences of random variables;

  • (Zj)(Z_{j}) is i.i.d. with common distribution ν\nu;

  • Vj=Uji=1j1(1Ui)V_{j}=U_{j}\,\prod_{i=1}^{j-1}(1-U_{i}) for all j1j\geq 1, where (Ui)(U_{i}) is i.i.d. with common distribution beta(1,c)(1,c). Namely, (Vj)(V_{j}) has the stick breaking distribution with parameter cc.

3 Conditionally identically distributed data

A sequence Y=(Y1,Y2,)Y=(Y_{1},Y_{2},\ldots) of random variables is conditionally identically distributed (c.i.d.) if Y2=𝑑Y1Y_{2}\overset{d}{=}Y_{1} and

P(YkY1,,Yn)=P(Yn+1Y1,,Yn)a.s.P\bigl{(}Y_{k}\in\cdot\mid Y_{1},\ldots,Y_{n}\bigr{)}=P\bigl{(}Y_{n+1}\in\cdot\mid Y_{1},\ldots,Y_{n}\bigr{)}\quad\text{a.s.}

for all k>n1k>n\geq 1. A c.i.d. sequence YY is identically distributed. It is also asymptotically exchangeable in the sense that, as nn\rightarrow\infty, the probability distribution of the shifted sequence (Yn,Yn+1,)(Y_{n},Y_{n+1},\ldots) converges weakly to an exchangeable law. Moreover, as already stressed, YY is exchangeable if and only if it is stationary and c.i.d.

C.i.d. sequences have been introduced in [5, 47] and then investigated or applied in various papers; see e.g. [1, 2, 6, 7, 8, 9, 14, 15, 29, 30, 34].

There are reasons for taking c.i.d. data into account in Bayesian prediction. In fact, in a sense, c.i.d. sequences have been introduced having prediction in mind. If XX is c.i.d., at each time nn, the future observations (Xk:k>n)(X_{k}:k>n) are identically distributed given the past, and this is reasonable in several prediction problems. Examples arise in clinical trials, generalized Pólya urns, species sampling models, survival analysis and disease surveillance; see [1, 2, 5, 8, 9, 14, 15, 29, 30, 35]. A further reason for assuming XX c.i.d. is that the asymptotics is very close to that of exchangeable sequences. As a consequence, a meaningful part of the usual Bayesian machinery can be developed under the sole assumption that XX is c.i.d.; see [29]. Finally, the strategies which make XX c.i.d. can be easily characterized; see Theorem 15 in the Appendix. Hence, unlike the exchangeable case, P.A. can be easily implemented for c.i.d. data. A number of interesting strategies, which cannot be used if XX is required to be exchangeable, become available if XX is only asked to be c.i.d.; see e.g. [8, 9].

As a concrete example, fix a constant q(0,1)q\in(0,1) and define

σn(x)=qnν+(1q)i=1nqniδxi\displaystyle\sigma_{n}(x)=q^{n}\nu+(1-q)\sum_{i=1}^{n}q^{n-i}\delta_{x_{i}} (3)

for all n0n\geq 0 and xSnx\in S^{n}. Using σ\sigma to make predictions corresponds to exponential smoothing. It may be reasonable when the forecaster has only vague opinions on the dependence structure of the data, and yet she feels that the weight of the ii-th observation xix_{i} should be a decreasing function of nin-i. In this case, XX is not exchangeable, since σn(x)\sigma_{n}(x) is not invariant under permutation of xx, but it can be easily seen to be c.i.d.; see [8, Ex. 7].

In this section, following [8, 9], P.A. is applied to c.i.d. data. We first report some known strategies (Subsection 3.1) and then we introduce two new strategies which make XX c.i.d. (Subsection 3.2).

3.1 Fast recursive update of predictive distributions

A possible condition for a strategy σ\sigma is

σn+1(x,y) is a function of σn(x) and y\displaystyle\sigma_{n+1}(x,y)\text{ is a function of }\sigma_{n}(x)\text{ and }y (4)

for all n0n\geq 0, xSnx\in S^{n} and ySy\in S, where yy denotes the (n+1)(n+1)-th observation and

(x,y)=(x1,,xn,y).\displaystyle(x,y)=(x_{1},\ldots,x_{n},y).

Under (4), the predictive σn+1(x,y)\sigma_{n+1}(x,y) is just a recursive update of the previous predictive σn(x)\sigma_{n}(x) and the last observation yy. Recursive properties of this type are useful in applications. They have a long history (see e.g. [51, 52, 59]) and have been recently investigated in [41].

For each n0n\geq 0, let qn:Sn[0,1]q_{n}:S^{n}\rightarrow[0,1] be a measurable function (with q0q_{0} constant) and αn\alpha_{n} a kernel on (S,)(S,\mathcal{B}). Define a strategy σ\sigma through the recursive equations

σ0=νand\displaystyle\sigma_{0}=\nu\quad\quad\text{and} (5)
σn+1(x,y)=qn(x)σn(x)+(1qn(x))αn(y)\displaystyle\sigma_{n+1}(x,y)=q_{n}(x)\,\sigma_{n}(x)+(1-q_{n}(x))\,\alpha_{n}(\cdot\mid y)

for all n0n\geq 0, xSnx\in S^{n} and ySy\in S. Since σn+1(x,y)\sigma_{n+1}(x,y) is a convex combination of the previous predictive σn(x)\sigma_{n}(x) and the kernel αn(y)\alpha_{n}(\cdot\mid y), which depends only on yy, the strategy σ\sigma satisfies condition (4). The obvious interpretation is that, at time n+1n+1, after observing (x,y)(x,y), the next observation is drawn from σn(x)\sigma_{n}(x) with probability qn(x)q_{n}(x) and from αn(y)\alpha_{n}(\cdot\mid y) with probability 1qn(x)1-q_{n}(x).

An example of strategy satisfying equation (5) is Newton’s algorithm [51, 52]. More precisely, Newton’s algorithm aims to estimate the latent distribution in a mixture model rather than to make predictions. However, if reinterpreted as a predictive rule, Newton’s algorithm corresponds to a strategy σ\sigma and such a σ\sigma meets equation (5) for a suitable choice of qnq_{n} and αn\alpha_{n}; see e.g. [35, p. 1095]. Moreover, as shown in [35], σ\sigma makes XX c.i.d.

The strategies satisfying equation (5) are investigated in [9]. Under such strategies, XX is usually not exchangeable but it is c.i.d. under some conditions on the kernels αn\alpha_{n}. Precisely, XX is c.i.d. if αn\alpha_{n} is the conditional distribution for ν\nu given 𝒢n\mathcal{G}_{n} for each n0n\geq 0, where

𝒢0𝒢1𝒢2\displaystyle\mathcal{G}_{0}\subset\mathcal{G}_{1}\subset\mathcal{G}_{2}\subset\ldots\subset\mathcal{B}

is any filtration (i.e., any increasing sequence of sub-σ\sigma-fields of \mathcal{B}). This condition is trivially true if αn(y)=δy\alpha_{n}(\cdot\mid y)=\delta_{y} for all ySy\in S (just take 𝒢n=\mathcal{G}_{n}=\mathcal{B} for all n0n\geq 0).

Example 4.

(Finer countable partitions). For each n0n\geq 0, let n\mathcal{H}_{n} be a countable partition of SS such that HH\in\mathcal{B} and ν(H)>0\nu(H)>0 for all HnH\in\mathcal{H}_{n}. Suppose that n+1\mathcal{H}_{n+1} is finer than n\mathcal{H}_{n} for all n0n\geq 0. Define σ\sigma through equation (5) with

αn(y)=Hn1H(y)ν(H)=ν(Hyn)\displaystyle\alpha_{n}(\cdot\mid y)=\sum_{H\in\mathcal{H}_{n}}1_{H}(y)\,\nu(\cdot\mid H)=\nu\bigl{(}\cdot\mid H_{y}^{n}\bigr{)}

where HynH_{y}^{n} denotes the only HnH\in\mathcal{H}_{n} such that yHy\in H. The kernel αn\alpha_{n} is the conditional distribution for ν\nu given 𝒢n\mathcal{G}_{n}, where 𝒢n\mathcal{G}_{n} is the σ\sigma-field generated by n\mathcal{H}_{n}. Since n+1\mathcal{H}_{n+1} is finer than n\mathcal{H}_{n}, one obtains 𝒢n𝒢n+1\mathcal{G}_{n}\subset\mathcal{G}_{n+1}. Hence, XX is c.i.d. Note also that the n\mathcal{H}_{n} could be chosen such that

{y}=nHynfor all yS.\displaystyle\{y\}=\bigcap_{n}H_{y}^{n}\quad\quad\text{for all }y\in S.

In this case, as nn\rightarrow\infty, the partitions n\mathcal{H}_{n} shrink to the partition of SS in the singletons.

For instance, in Example 2, suppose the forecaster wants to replace the fixed partition \mathcal{H} with a sequence n\mathcal{H}_{n} of finer partitions. This is possible at the price of having XX c.i.d. instead of exchangeable. In fact, with qn=n+cn+1+cq_{n}=\frac{n+c}{n+1+c}, one obtains

σn(x)=cν+i=1nαi1(xi)n+c\displaystyle\sigma_{n}(x)=\frac{c\,\nu+\sum_{i=1}^{n}\alpha_{i-1}(\cdot\mid x_{i})}{n+c}
=cν+i=1nν(Hxii1)n+c.\displaystyle=\frac{c\,\nu+\sum_{i=1}^{n}\nu\bigl{(}\cdot\mid H_{x_{i}}^{i-1}\bigr{)}}{n+c}.

Similarly, to decrease the impact of the observed data while preserving the c.i.d. condition, the strategy (3) could be modified as

σn(x)=qnν+(1q)i=1nqniν(Hxii1).\displaystyle\sigma_{n}(x)=q^{n}\nu+(1-q)\sum_{i=1}^{n}q^{n-i}\nu\bigl{(}\cdot\mid H_{x_{i}}^{i-1}\bigr{)}.

We next turn to a strategy introduced in [41]. Once again, under this strategy, the data are c.i.d. but not necessarily exchangeable.

Example 5.

(Hahn, Martin and Walker; Copulas). In this example, S=S=\mathbb{R} and “density function” means “density function with respect to Lebesgue measure”. A bivariate copula is a distribution function on 2\mathbb{R}^{2} whose marginals are uniform on (0,1)(0,1). The density function of a bivariate copula, provided it exists, is said to be a copula density.

In [41], in order to realize condition (4), the following updating rule is introduced. Fix a density f0f_{0} and a sequence c1,c2,c_{1},c_{2},\ldots of bivariate copula densities. For the sake of simplicity, we assume f0>0f_{0}>0 and cn>0c_{n}>0 for all n1n\geq 1. For n=0n=0, define σ0(dz)=f0(z)dz\sigma_{0}(dz)=f_{0}(z)\,dz and call F0F_{0} the distribution function corresponding to σ0\sigma_{0}. Then, for each yy\in\mathbb{R}, define

σ1(y,dz)=f1(zy)dzwhere\displaystyle\sigma_{1}(y,\,dz)=f_{1}(z\mid y)\,dz\quad\quad\text{where}
f1(zy)=c1{F0(z),F0(y)}f0(z).\displaystyle f_{1}(z\mid y)=c_{1}\bigl{\{}F_{0}(z),\,F_{0}(y)\bigr{\}}\,f_{0}(z).

In general, for each n0n\geq 0 and xnx\in\mathbb{R}^{n}, suppose σn(x)\sigma_{n}(x) has been defined and denote by fn(x)f_{n}(\cdot\mid x) and Fn(x)F_{n}(\cdot\mid x) the density and the distribution function of σn(x)\sigma_{n}(x). Then, for all yy\in\mathbb{R}, one can define

σn+1(x,y,dz)=fn+1(zx,y)dzwhere\displaystyle\sigma_{n+1}(x,y,\,dz)=f_{n+1}(z\mid x,y)\,dz\quad\text{where} (6)
fn+1(zx,y)=cn+1{Fn(zx),Fn(yx)}fn(zx).\displaystyle f_{n+1}(z\mid x,y)=c_{n+1}\bigl{\{}F_{n}(z\mid x),\,F_{n}(y\mid x)\bigr{\}}\,f_{n}(z\mid x).

Equation (6) defines a strategy σ\sigma dominated by the Lebesgue measure.

In [41] (but not here) the cnc_{n} are also required to be symmetric. Furthermore, in [41], equation (6) is not necessarily viewed as a method for obtaining a strategy but is deduced as a consequence of exchangeability. From our point of view, instead, equation (6) defines a strategy σ\sigma which we call HMW’s strategy.

Under HMW’s strategy, XX is not necessarily exchangeable, even if the cnc_{n} are symmetric and cn1c_{n}\rightarrow 1 (in some sense) as nn\rightarrow\infty. To see this, recall that XX is i.i.d. if and only if it is exchangeable and X1X_{1} is independent of X2X_{2}. In turn, X1X_{1} is independent of X2X_{2} if c1c_{1} is the independence copula density (i.e., c1(u,v)=1c_{1}(u,v)=1 for all (u,v)[0,1]2(u,v)\in[0,1]^{2}). Therefore, XX fails to be exchangeable whenever c1c_{1} is the independence copula density and c2c1c_{2}\neq c_{1}. However, as noted in [29], XX turns out to be c.i.d.

Theorem 6.

If σ\sigma is HMW’s strategy, then XX is c.i.d.

A proof of Theorem 6 is provided in the Appendix. We note that, for Theorem 6 to hold, the positivity assumption on f0f_{0} and cnc_{n} may be dropped and the cnc_{n} can be taken to be conditional copula densities; see Remark 16.

3.2 Further examples

In the next example, the data are exchangeable until a stopping time TT and then go on so as to form a c.i.d. sequence. The time TT should be regarded as the first time when something meaningful happens, possibly something modifying the nature of the observed phenomenon. Even if apparently involved, the example could find some applications. For instance, to model censored survival times, with T1T-1 the first time when a given number of survival times is observed.

Example 7.

(Change points). A predictable stopping time is a function TT on SS^{\infty}, with values in {2,3,,}\{2,3,\ldots,\infty\}, satisfying

{T=n+1}={(X1,,Xn)An}\displaystyle\bigl{\{}T={n+1}\bigr{\}}=\bigl{\{}(X_{1},\ldots,X_{n})\in A_{n}\bigr{\}} (7)

for some set AnnA_{n}\in\mathcal{B}^{n}. Basically, condition (7) means that the event {T=n+1}\{T=n+1\} depends only on (X1,,Xn)(X_{1},\ldots,X_{n}). Similarly, {Tn+1}=j=2n+1{T=j}\{T\leq n+1\}=\bigcup_{j=2}^{n+1}\{T=j\} depends only on (X1,,Xn)(X_{1},\ldots,X_{n}). Therefore, for all xSnx\in S^{n} and ySy\in S, the indicators of {Tn+1}\{T\leq n+1\} and {T>n+1}\{T>n+1\} depend on xx but not on yy.

Fix a predictable stopping time TT and a strategy β=(β0,β1,)\beta=(\beta_{0},\beta_{1},\ldots) which makes XX exchangeable. Moreover, as in Subsection 3.1, fix the measurable functions qn:Sn[0,1]q_{n}:S^{n}\rightarrow[0,1]. Then, define σ0=β0\sigma_{0}=\beta_{0}, σ1=β1\sigma_{1}=\beta_{1}, and

σn+1(x,y)=1{T>n+1}(x)βn+1(x,y)+\displaystyle\sigma_{n+1}(x,y)=1_{\{T>n+1\}}(x)\,\beta_{n+1}(x,y)\,+
+ 1{Tn+1}(x){qn(x)σn(x)+(1qn(x))δy}\displaystyle+\,1_{\{T\leq n+1\}}(x)\,\Bigl{\{}q_{n}(x)\,\sigma_{n}(x)\,+\,(1-q_{n}(x))\,\delta_{y}\Bigr{\}}

for all n1n\geq 1, xSnx\in S^{n} and ySy\in S. In the Appendix, it is shown that:

Theorem 8.

The above strategy σ\sigma makes XX c.i.d. Moreover, if

An is invariant under permutations of Sn for all n1,\displaystyle A_{n}\text{ is invariant under permutations of }S^{n}\text{ for all }n\geq 1,

where AnA_{n} is the set involved in condition (7), then (X1,,Xn)(X_{1},\ldots,X_{n}) is exchangeable conditionally on T>nT>n. Precisely,

Pσ(ϕ(X1,,Xn)T>n)\displaystyle P_{\sigma}\Bigl{(}\phi(X_{1},\ldots,X_{n})\in\cdot\mid T>n\Bigr{)}
=Pσ((X1,,Xn)T>n)\displaystyle=P_{\sigma}\Bigl{(}(X_{1},\ldots,X_{n})\in\cdot\mid T>n\Bigr{)}

for all nn such that Pσ(T>n)>0P_{\sigma}(T>n)>0 and all permutations ϕ\phi of SnS^{n}.

Theorem 8 is still valid if σ\sigma is defined differently at the times subsequent to TT. For instance, given a countable partition \mathcal{H} of SS, the conclusions of Theorem 8 are true even if

σn+1(x,y)=qn(x)σn(x)+(1qn(x))σn(x,Hy)\displaystyle\sigma_{n+1}(x,y)=q_{n}(x)\,\sigma_{n}(x)+(1-q_{n}(x))\,\sigma_{n}(x,\,\cdot\mid H_{y})

for all xSnx\in S^{n} and ySy\in S such that Tn+1T\leq n+1 and σn(x,Hy)>0\sigma_{n}(x,\,H_{y})>0. Here, σn(x,Hy)\sigma_{n}(x,\,\cdot\mid H_{y}) denotes the probability measure

σn(x,AHy)=σn(x,AHy)σn(x,Hy)for all A.\sigma_{n}(x,\,A\mid H_{y})=\frac{\sigma_{n}(x,\,A\cap H_{y})}{\sigma_{n}(x,\,H_{y})}\quad\quad\text{for all }A\in\mathcal{B}.

Censored survival times are a possible application of σ\sigma. Suppose that S={0,1}×(0,)S=\{0,1\}\times(0,\infty) and the ii-th observation is a pair xi=(ji,ti)x_{i}=(j_{i},t_{i}) where tit_{i} is the survival time of item ii, or the time when item ii leaves the trial, according to whether ji=1j_{i}=1 or ji=0j_{i}=0. In this framework, T1T-1 could be the first time when a fixed number kk of survival times is observed, namely,

T=1+inf{n:i=1nji=k}\displaystyle T=1+\inf\bigl{\{}n:\sum_{i=1}^{n}j_{i}=k\bigr{\}}

with the usual convention inf=\inf\emptyset=\infty. Finally, the strategy β\beta could be as in Subsection 2.2. In fact, classical Dirichlet sequences are a quite popular model to describe censored survival times but have the drawback of ties. This drawback may be overcome if β\beta is of the form

βn(x)=cν+i=1nαxin+c,\beta_{n}(x)=\frac{c\,\nu+\sum_{i=1}^{n}\alpha_{x_{i}}}{n+c},

where the kernel α\alpha satisfies the conditions of Subsection 2.2 and ν\nu and αx\alpha_{x} are nonatomic for all xSx\in S.

So far, the nn-th predictive distribution has been meant as the conditional distribution of Xn+1X_{n+1} given (X1,,Xn)(X_{1},\ldots,X_{n}). But the information available at time nn is often strictly larger than (X1,,Xn)(X_{1},\ldots,X_{n}). To model this situation, we suppose to observe the sequence

Y=(X1,Z1,X2,Z2,)Y=(X_{1},Z_{1},X_{2},Z_{2},\ldots)

where Z=(Z1,Z2,)Z=(Z_{1},Z_{2},\ldots) is any sequence of random variables. The ZnZ_{n} can be regarded as covariates. At each time nn, the forecaster aims to predict Xn+1X_{n+1} based on (X1,Z1,,Xn,Zn)(X_{1},Z_{1},\ldots,X_{n},Z_{n}). She is not interested in Zn+1Z_{n+1} as such, but Z1,,ZnZ_{1},\ldots,Z_{n} can not be neglected since they are informative on Xn+1X_{n+1}. Moreover, she wants XX to be c.i.d. and ZZ unconstrained as much as possible. One solution could be a strategy which makes YY c.i.d. However, if YY is c.i.d., both XX and ZZ are marginally c.i.d., and having ZZ c.i.d. may be unwelcome. In the next example, XX is c.i.d. but ZZ is not. In addition, XX satisfies a condition stronger than the c.i.d. one, that is, X2=𝑑X1X_{2}\overset{d}{=}X_{1} and

P(XkX1,Z1,,Xn,Zn)\displaystyle P\bigl{(}X_{k}\in\cdot\mid X_{1},Z_{1},\ldots,X_{n},Z_{n}\bigr{)} (8)
=P(Xn+1X1,Z1,,Xn,Zn)\displaystyle=P\bigl{(}X_{n+1}\in\cdot\mid X_{1},Z_{1},\ldots,X_{n},Z_{n}\bigr{)}

a.s. for all k>n1k>n\geq 1; see [5].

Example 9.

(Covariates). Let S=2S=\mathbb{R}^{2} and

0=b0<b1<b2<,supnbn1,\displaystyle 0=b_{0}<b_{1}<b_{2}<\ldots,\quad\sup_{n}b_{n}\leq 1,

a bounded strictly increasing sequence of real numbers. Take σ0\sigma_{0} as the probability distribution of (U+V,V)(U+V,V) where

U independent of V,U=𝑑𝒩(0,b1),V=𝑑𝒩(0,1b1).\displaystyle U\text{ independent of }V,\quad U\overset{d}{=}\mathcal{N}(0,b_{1}),\quad V\overset{d}{=}\mathcal{N}(0,1-b_{1}).

Similarly, for each n1n\geq 1 and

y=(y1,,yn)=(x1,z1,,xn,zn),\displaystyle y=(y_{1},\ldots,y_{n})=(x_{1},z_{1},\ldots,x_{n},z_{n}),

take σn(y)\sigma_{n}(y) as the probability distribution of (Un(y)+Vn(y),Vn(y))(U_{n}(y)+V_{n}(y),\,V_{n}(y)) where

Un(y) independent of Vn(y),\displaystyle U_{n}(y)\text{ independent of }V_{n}(y),
Un(y)=𝑑𝒩(xnzn,bn+1bn),\displaystyle U_{n}(y)\overset{d}{=}\mathcal{N}\bigl{(}x_{n}-z_{n},\,b_{n+1}-b_{n}\bigr{)},
Vn(y)=𝑑𝒩(0,1bn+1).\displaystyle V_{n}(y)\overset{d}{=}\mathcal{N}(0,1-b_{n+1}).

Then, ZZ is not c.i.d. while XX satisfies condition (8). Furthermore, arguing as in [9, Sect. 4], the normal distribution could be replaced by any symmetric stable law.

To see that ZZ is not c.i.d., just note that ZZ fails to be identically distributed. To prove condition (8), take a collection {Tn,Wn:n1}\bigl{\{}T_{n},W_{n}:n\geq 1\bigr{\}} of independent standard normal random variables and define the sequence

Y=(X1,Z1,X2,Z2,),Y^{*}=(X_{1}^{*},Z_{1}^{*},X_{2}^{*},Z_{2}^{*},\ldots),

where Zn=1bnWnZ_{n}^{*}=\sqrt{1-b_{n}}\,W_{n} and

Xn=j=1nbjbj1Tj+Zn.X_{n}^{*}=\sum_{j=1}^{n}\sqrt{b_{j}-b_{j-1}}\,T_{j}+Z_{n}^{*}.

It is not hard to verify that Y=𝑑YY^{*}\overset{d}{=}Y. Hence, it suffices to prove (8) with YY^{*} in the place of YY, and this can be done as in [5, Ex. 1.2]. We omit the explicit calculations.

4 Stationary data

A sequence Y=(Y1,Y2,)Y=(Y_{1},Y_{2},\ldots) of random variables is stationary if

(Y2,,Yn+1)=𝑑(Y1,,Yn)for all n1.\displaystyle(Y_{2},\ldots,Y_{n+1})\overset{d}{=}(Y_{1},\ldots,Y_{n})\quad\quad\text{for all }n\geq 1.

In the non-Bayesian approaches to prediction, stationarity is a classical assumption. In a Bayesian framework, instead, stationarity seems to be less popular. In particular, to our knowledge, there is no systematic treatment of P.A. for stationary data. This section aims to fill this gap and begins an investigation of P.A. when XX is required to be stationary. It is just a preliminary step and much more work is to be done.

After some general remarks on Problem (*), two large classes of stationary sequences will be introduced. Incidentally, these two classes may look unusual for a Bayesian forecaster. We don’t know whether this is true, but we recall that P.A. is consistent with any probability distribution for XX. Hence, in a Bayesian framework, using data coming from such classes is certainly admissible.

If XX is required to be stationary, for P.A. to apply, the strategies which make XX stationary should be characterized. Hence, one comes across Problem (*) with 𝒞\mathcal{C} the class of stationary probability measures on (S,)(S^{\infty},\mathcal{B}^{\infty}). This version of Problem (*) is quite hard and we are not aware of any general solution; see e.g. [12, 50] and references therein. Fortunately, however, Problem (*) is simple (or even trivial) in a few special cases. As an example, a strategy σ\sigma makes XX a stationary (first order) Markov chain if and only if

σ1(x,)σ0(dx)=σ0()andσn(x)=σ1(xn)\displaystyle\int\sigma_{1}(x,\,\cdot)\,\sigma_{0}(dx)=\sigma_{0}(\cdot)\quad\text{and}\quad\sigma_{n}(x)=\sigma_{1}(x_{n})

for all n1n\geq 1 and PσP_{\sigma}-almost all xSnx\in S^{n}. Even if obvious, this fact has a useful practical consequence. If the data are required to be stationary and Markov, in order to make Bayesian predictions, applying P.A. is straightforward.

Another remark is that, unlike the exchangeable case, a finite dimensional stationary random vector can be always extended to an (infinite) stationary sequence. To formalize this fact, we first recall that the probability distribution of the random vector (X1,,Xn)(X_{1},\ldots,X_{n}) is completely determined by σ0,σ1,,σn1\sigma_{0},\sigma_{1},\ldots,\sigma_{n-1}.

Lemma 10.

Fix n1n\geq 1, select σ0,σ1,,σn1\sigma_{0},\sigma_{1},\ldots,\sigma_{n-1} and define

σj(u,x)=σn1(x)\displaystyle\sigma_{j}(u,x)=\sigma_{n-1}(x)

for all j>n1j>n-1, uSjn+1u\in S^{j-n+1} and xSn1x\in S^{n-1}. Then, XX is stationary provided (X2,,Xn)=𝑑(X1,,Xn1)(X_{2},\ldots,X_{n})\overset{d}{=}(X_{1},\ldots,X_{n-1}).

Lemma 10 is probably well known, but again we do not know of any explicit reference. Anyway, the proof is straightforward. It suffices to note that, under the strategy of Lemma 10, Xj+1X_{j+1} is conditionally independent of (X1,,Xjn+1)(X_{1},\ldots,X_{j-n+1}) given (Xjn+2,,Xj)(X_{j-n+2},\ldots,X_{j}).

A last remark is that Problem (*) admits an obvious solution for dominated strategies. In this case, incidentally, Problem (*) can be easily solved even for exchangeable data.

Theorem 11.

Let λ\lambda be a σ\sigma-finite measure on (S,)(S,\mathcal{B}) and σ\sigma a strategy dominated by λ\lambda, say

σ0(dy)=f0(y)λ(dy)andσn(x,dy)=fn(yx)λ(dy)\displaystyle\sigma_{0}(dy)=f_{0}(y)\,\lambda(dy)\quad\text{and}\quad\sigma_{n}(x,\,dy)=f_{n}(y\mid x)\,\lambda(dy)

for all n1n\geq 1 and xSnx\in S^{n}. Define

gn(x)=f0(x1)f1(x2x1)fn1(xnx1,,xn1)\displaystyle g_{n}(x)=f_{0}(x_{1})\,f_{1}(x_{2}\mid x_{1})\ldots f_{n-1}(x_{n}\mid x_{1},\ldots,x_{n-1})

for all n1n\geq 1 and xSnx\in S^{n}. Then,

  • PσP_{\sigma} is stationary if and only if

    gn(x)=gn+1(u,x)λ(du)\displaystyle g_{n}(x)=\int g_{n+1}(u,x)\,\lambda(du)

    for all n1n\geq 1 and PσP_{\sigma}-almost all xSnx\in S^{n}.

  • PσP_{\sigma} is exchangeable if and only if

    gn(ϕ(x))=gn(x)\displaystyle g_{n}(\phi(x))=g_{n}(x)

    for all n2n\geq 2, all permutations ϕ\phi of SnS^{n} and PσP_{\sigma}-almost all xSnx\in S^{n}.

The proof of Theorem 11 is given in the Appendix.

We finally give two examples. In both, XX is a stationary Markov sequence, possibly of order greater than 1.

Example 12.

(Generalized autoregressive sequences). Let S=S=\mathbb{R}. Fix a probability measure μ\mu on \mathcal{B} and a measurable function f:f:\mathbb{R}\rightarrow\mathbb{R}. Define

σ1(x,A)=P(f(x)+UA)for all x and A,\displaystyle\sigma_{1}(x,A)=P\bigl{(}f(x)+U\in A)\quad\quad\text{for all }x\in\mathbb{R}\text{ and }A\in\mathcal{B},

where UU is a real random variable such that U=𝑑μU\overset{d}{=}\mu. Suppose now that

σ1(x,A)ν(dx)=ν(A),A,\displaystyle\int\sigma_{1}(x,A)\,\nu(dx)=\nu(A),\quad\quad A\in\mathcal{B}, (9)

for some probability measure ν\nu on \mathcal{B}. Then, XX is a stationary Markov chain provided

σ0=νandσn(x)=σ1(xn)for all n2 and xn.\displaystyle\sigma_{0}=\nu\quad\text{and}\quad\sigma_{n}(x)=\sigma_{1}(x_{n})\quad\text{for all }n\geq 2\text{ and }x\in\mathbb{R}^{n}.

Note that Y=𝑑PσY\overset{d}{=}P_{\sigma} for any sequence Y=(Y1,Y2,)Y=(Y_{1},Y_{2},\ldots) such that

Y1=𝑑νandYn=f(Yn1)+Un for n2,\displaystyle Y_{1}\overset{d}{=}\nu\quad\text{and}\quad Y_{n}=f(Y_{n-1})+U_{n}\text{ for }n\geq 2,

where (Un:n2)(U_{n}:n\geq 2) is i.i.d., independent of Y1Y_{1}, and U2=𝑑μU_{2}\overset{d}{=}\mu. Thus, μ\mu can be regarded as the distribution of the “errors” UnU_{n} and ν\nu as the marginal distribution of the observations YnY_{n}. For instance, the usual Gaussian (first order) autoregressive processes correspond to f(x)=cxf(x)=c\,x, μ=𝒩(0,b)\mu=\mathcal{N}(0,b) and ν=𝒩(0,b/(1c2))\nu=\mathcal{N}(0,\,b/(1-c^{2})), where c(1,1)c\in(-1,1) and b>0b>0 are constants.

To make the above argument concrete, the following problem is to be solved: For fixed ff and μ\mu, give conditions for the existence of ν\nu satisfying equation (9). More importantly, give an explicit formula for ν\nu provided it exists. We next focus on this problem in the (meaningful) special case where μ\mu is a symmetric stable law.

Let γ(0,2]\gamma\in(0,2] be a constant and ZZ a real random variable with characteristic function

E{exp(itZ)}=exp(|t|γ2)for all t.\displaystyle E\bigl{\{}\exp(i\,t\,Z)\bigr{\}}=\exp\Bigl{(}-\frac{\lvert t\rvert^{\gamma}}{2}\Bigr{)}\quad\quad\text{for all }t\in\mathbb{R}.

(The exponent γ\gamma is usually denoted by α\alpha, but this notation cannot be adopted in this paper since α\alpha denotes a kernel). For aa\in\mathbb{R} and b>0b>0, denote by 𝒮(a,b)\mathcal{S}(a,b) the probability distribution of a+b1/γZa+b^{1/\gamma}Z, namely

𝒮(a,b;A)=P(a+b1/γZA)for all A.\displaystyle\mathcal{S}(a,b;\,A)=P\bigl{(}a+b^{1/\gamma}Z\in A)\quad\quad\text{for all }A\in\mathcal{B}.

The probability measure 𝒮(a,b)\mathcal{S}(a,b) is said to be a symmetric stable law with exponent γ\gamma. Note that 𝒮(a,b)=𝒩(a,b)\mathcal{S}(a,b)=\mathcal{N}(a,b) if γ=2\gamma=2 and 𝒮(a,b)=𝒞(a,b)\mathcal{S}(a,b)=\mathcal{C}(a,b) if γ=1\gamma=1, where 𝒞(a,b)\mathcal{C}(a,b) is the Cauchy distribution with density f(x)=2bπ1b2+4(xa)2f(x)=\frac{2\,b}{\pi}\,\frac{1}{b^{2}+4\,(x-a)^{2}} (the standard Cauchy distribution corresponds to a=0a=0 and b=2b=2).

Theorem 13.

Let c(1,1)c\in(-1,1) be a constant. If μ=𝒮(a,b)\mu=\mathcal{S}(a,b) and f(x)=a+cxf(x)=-a+c\,x, then equation (9) is satisfied by

ν=𝒮(0,b1|c|γ).\displaystyle\nu=\mathcal{S}\left(0,\,\frac{b}{1-\lvert c\rvert^{\gamma}}\right).

By Theorem 13, which is proved in the Appendix, one obtains (first order) stationary autoregressive processes with any symmetric stable marginal distribution.

Example 14.

(Markov sequences of arbitrary order). Let λ\lambda be a σ\sigma-finite measure on (S,)(S,\mathcal{B}). Fix n2n\geq 2 and a measurable function hh on SnS^{n} such that h>0h>0 and h𝑑λn=1\int h\,d\lambda^{n}=1. Given hh, define a further function gg via cyclic permutations of hh, namely

g(x)=1n{h(x1,,xn)+h(x2,,xn,x1)+\displaystyle g(x)=\frac{1}{n}\,\bigl{\{}h(x_{1},\ldots,x_{n})+h(x_{2},\ldots,x_{n},x_{1})+
++h(xn,x1,,xn1)}\displaystyle+\ldots+h(x_{n},x_{1},\ldots,x_{n-1})\bigr{\}}

for all xSnx\in S^{n}. Such a gg is still a density with respect to λn\lambda^{n} (since g𝑑λn=1\int g\,d\lambda^{n}=1) and satisfies

g(x,y)=g(y,x)for all xSn1 and yS.\displaystyle g(x,y)=g(y,x)\quad\text{for all }x\in S^{n-1}\text{ and }y\in S. (10)

Next, define

f0(x)=g(x,v)λn1(dv)for all xS,\displaystyle f_{0}(x)=\int g(x,v)\,\lambda^{n-1}(dv)\quad\quad\text{for all }x\in S,
fn1(xnx1,,xn1)=g(x)g(x1,,xn1,v)λ(dv)\displaystyle f_{n-1}(x_{n}\mid x_{1},\ldots,x_{n-1})=\frac{g(x)}{\int g(x_{1},\ldots,x_{n-1},v)\,\lambda(dv)}

for all xSnx\in S^{n}, and

fj1(xjx1,,xj1)=g(x,v)λnj(dv)g(x1,,xj1,v)λnj+1(dv)f_{j-1}(x_{j}\mid x_{1},\ldots,x_{j-1})=\frac{\int g(x,v)\,\lambda^{n-j}(dv)}{\int g(x_{1},\ldots,x_{j-1},v)\,\lambda^{n-j+1}(dv)}

for all 2jn12\leq j\leq n-1 and xSjx\in S^{j}. Finally, define a strategy σ\sigma dominated by λ\lambda as

σ0(dz)=f0(z)λ(dz),\displaystyle\sigma_{0}(dz)=f_{0}(z)\,\lambda(dz),
σj(x,dz)=fj(zx)λ(dz)\displaystyle\sigma_{j}(x,\,dz)=f_{j}(z\mid x)\,\lambda(dz)

if 1jn11\leq j\leq n-1 and xSjx\in S^{j}, and

σj(u,x)=σn1(x)\displaystyle\sigma_{j}(u,x)=\sigma_{n-1}(x)

if j>n1j>n-1, uSjn+1u\in S^{j-n+1} and xSn1x\in S^{n-1}. Under σ\sigma, a density of (X1,,Xn)(X_{1},\ldots,X_{n}) is given by gg. By equation (10),

g(v,x)λ(dv)=g(x,v)λ(dv)for all xSn1\displaystyle\int g(v,x)\,\lambda(dv)=\int g(x,v)\,\lambda(dv)\quad\quad\text{for all }x\in S^{n-1}

and this in turn implies

(X2,,Xn)=𝑑(X1,,Xn1).\displaystyle(X_{2},\ldots,X_{n})\overset{d}{=}(X_{1},\ldots,X_{n-1}).

Therefore, XX is stationary because of Lemma 10. Note also that XX is a Markov sequence of order n1n-1.

5 Concluding remarks and open problems

When prediction is the main target, P.A. has some advantages with respect to I.A. This is only our opinion, obviously, and we tried to support it along this paper. Even if one agrees, however, some further work is to be done to make P.A. a concrete tool. We close this paper with a brief list of open problems and possible hints for future research.

  • In various applications, the available information strictly includes the past observations on the variable to be predicted. For instance, as in Example 9, suppose one aims to predict Xn+1X_{n+1} based on (X1,Z1,,Xn,Zn)(X_{1},Z_{1},\ldots,X_{n},Z_{n}) where Z1,,ZnZ_{1},\ldots,Z_{n} are any random elements. Suppose also that Z1,,ZnZ_{1},\ldots,Z_{n} cannot be neglected for they are informative on Xn+1X_{n+1}. In this case, one needs the conditional distribution of Xn+1X_{n+1} given (X1,Z1,,Xn,Zn)(X_{1},Z_{1},\ldots,X_{n},Z_{n}). Situations of this type are practically meaningful and should be investigated further.

  • Section 4 should be expanded. It would be nice to have a general solution of Problem (*) for both the stationary and the stationary-ergodic cases. Further examples of stationary sequences (possibly, non-Markovian) would be welcome as well.

  • Obviously, P.A. could be investigated under other distributional assumptions, in addition to exchangeability, stationarity and conditional identity in distribution. In particular, partial exchangeability should be taken into account.

  • A question, related to Example 5, is: Under what conditions XX is exchangeable when σ\sigma is HMW’s strategy ?

  • While probably hard, the problem raised in Example 12 looks intriguing. In Theorem 13, such a problem has been addressed when μ\mu is a symmetric stable law and ff has a special form. What happens if μ\mu and ff are arbitrary ?

  • In case of I.A., the empirical Bayes point of view (where the prior is allowed to depend on the data) may be problematic. In case of P.A., instead, this point of view is certainly admissible. In fact, suppose a strategy σ\sigma depends on some unknown constants, and an empirical Bayes forecaster decides to estimate these constants based on the available data. Acting in this way, she is merely replacing a strategy with another. Instead of σ\sigma, she is working with σ^\hat{\sigma}, where σ^\hat{\sigma} is the strategy obtained from σ\sigma estimating the unknown constants. This empirical form of P.A. looks reasonable and could be investigated.

Appendix

This appendix contains the proofs of some claims scattered throughout the text. We will need the following characterization of c.i.d. sequences in terms of strategies.

Theorem 15.

(Theorem 3.1 of [6]). Let σ\sigma be a strategy. Then, PσP_{\sigma} is c.i.d. if and only if

σn(x,A)=σn+1(x,y,A)σn(x,dy)\displaystyle\sigma_{n}(x,A)=\int\sigma_{n+1}(x,y,\,A)\,\sigma_{n}(x,\,dy) (11)

for all n0n\geq 0, all AA\in\mathcal{B} and PσP_{\sigma}-almost all xSnx\in S^{n}.

Proof of Theorem 6.

In this proof, “density function” stands for “density function with respect to Lebesgue measure”. We first recall a well known fact.

Let CC be a bivariate copula and F1F_{1}, F2F_{2} distribution functions on \mathbb{R}. Suppose that CC, F1F_{1} and F2F_{2} all have densities, say cc, f1f_{1} and f2f_{2}, respectively. Then,

F(x,y)=C{F1(x),F2(y)}\displaystyle F(x,y)=C\bigl{\{}F_{1}(x),F_{2}(y)\bigr{\}}

is a distribution function on 2\mathbb{R}^{2} and

f(x,y)=c{F1(x),F2(y)}f1(x)f2(y)\displaystyle f(x,y)=c\bigl{\{}F_{1}(x),F_{2}(y)\bigr{\}}\,f_{1}(x)\,f_{2}(y)

is a density of FF. Therefore, for all yy\in\mathbb{R} with f2(y)>0f_{2}(y)>0, one obtains

c{F1(x),F2(y)}f1(x)𝑑x=f(x,y)f2(y)𝑑x=1.\displaystyle\int c\bigl{\{}F_{1}(x),F_{2}(y)\bigr{\}}\,f_{1}(x)\,dx=\int\frac{f(x,y)}{f_{2}(y)}\,dx=1.

We next show that equation (6) actually defines a strategy σ\sigma. Fix a density f0>0f_{0}>0 and a sequence c1,c2,c_{1},c_{2},\ldots of strictly positive bivariate copula densities. For each yy\in\mathbb{R},

f1(zy)𝑑z=c1{F0(z),F0(y)}f0(z)𝑑z=1\displaystyle\int f_{1}(z\mid y)\,dz=\int c_{1}\bigl{\{}F_{0}(z),F_{0}(y)\bigr{\}}\,f_{0}(z)\,dz=1

since f0(y)>0f_{0}(y)>0. Moreover, f1(zy)>0f_{1}(z\mid y)>0 for all zz due to f0>0f_{0}>0 and c1>0c_{1}>0. Next, suppose that fn(x)f_{n}(\cdot\mid x) is a strictly positive density for some n1n\geq 1 and xnx\in\mathbb{R}^{n}. Then, for all yy\in\mathbb{R},

fn+1(zx,y)𝑑z\displaystyle\int f_{n+1}(z\mid x,y)\,dz
=cn+1{Fn(zx),Fn(yx)}fn(zx)𝑑z=1\displaystyle=\int c_{n+1}\bigl{\{}F_{n}(z\mid x),F_{n}(y\mid x)\bigr{\}}\,f_{n}(z\mid x)\,dz=1

since fn(yx)>0f_{n}(y\mid x)>0. Furthermore, fn+1(zx,y)>0f_{n+1}(z\mid x,y)>0 for all zz since fn(x)>0f_{n}(\cdot\mid x)>0 and cn+1>0c_{n+1}>0. By induction, this proves that fn(x)f_{n}(\cdot\mid x) is a density for all n1n\geq 1 and xnx\in\mathbb{R}^{n}. Therefore, equation (6) defines a strategy σ\sigma (called HMW’s strategy in Example 5).

Finally, we prove that PσP_{\sigma} is c.i.d. if σ\sigma is HMW’s strategy. By Theorem 15, it suffices to prove condition (11). In turn, since σ\sigma is dominated by the Lebesgue measure, condition (11) reduces to

fn(zx)=fn+1(zx,y)fn(yx)𝑑y\displaystyle f_{n}(z\mid x)=\int f_{n+1}(z\mid x,y)\,f_{n}(y\mid x)\,dy

for all n0n\geq 0, almost all zz\in\mathbb{R} and PσP_{\sigma}-almost all xnx\in\mathbb{R}^{n}. Such a condition follows directly from the definition of σ\sigma. In fact, for all n0n\geq 0 an xnx\in\mathbb{R}^{n}, one obtains

fn+1(zx,y)fn(yx)𝑑y\displaystyle\int f_{n+1}(z\mid x,y)\,f_{n}(y\mid x)\,dy
=cn+1{Fn(zx),Fn(yx)}fn(zx)fn(yx)𝑑y\displaystyle=\int c_{n+1}\bigl{\{}F_{n}(z\mid x),F_{n}(y\mid x)\bigr{\}}\,f_{n}(z\mid x)\,f_{n}(y\mid x)\,dy
=fn(zx)for almost all z.\displaystyle=f_{n}(z\mid x)\quad\quad\text{for almost all }z.

This concludes the proof. ∎

Remark 16.

HMW’s strategy σ\sigma has been defined under the assumption that f0>0f_{0}>0 and cn>0c_{n}>0 for all n1n\geq 1. Such an assumption is superfluous and has been made only to avoid annoying complications in the definition of σ\sigma. Similarly, XX is c.i.d. even if the cnc_{n} are conditional copulas, in the sense that they are allowed to depend on past data. Precisely, for each n1n\geq 1 and xnx\in\mathbb{R}^{n}, fix a bivariate copula density cn+1(x)c_{n+1}(\cdot\mid x). Then, the proof Theorem 6 still applies if fn+1(zx,y)f_{n+1}(z\mid x,y) is rewritten as

fn+1(zx,y)=cn+1{Fn(zx),Fn(yx)x}fn(zx).\displaystyle f_{n+1}(z\mid x,y)=c_{n+1}\bigl{\{}F_{n}(z\mid x),\,F_{n}(y\mid x)\mid x\bigr{\}}\,f_{n}(z\mid x).

Proof of Theorem 8.

We show that XX is c.i.d. via Theorem 15. Fix AA\in\mathcal{B} and n0n\geq 0. Since PβP_{\beta} is exchangeable (and thus c.i.d.) Theorem 15 yields

βn(x,A)=βn+1(x,y,A)βn(x,dy)\displaystyle\beta_{n}(x,A)=\int\beta_{n+1}(x,y,\,A)\,\beta_{n}(x,\,dy) (12)

for PβP_{\beta}-almost all xSnx\in S^{n}. Hence, up to changing β\beta on a PβP_{\beta}-null set, equation (12) can be assumed to hold for all xSnx\in S^{n}. If n=0n=0,

σ1(y,A)σ0(dy)=β1(y,A)β0(dy)=β0(A)=σ0(A)\displaystyle\int\sigma_{1}(y,A)\,\sigma_{0}(dy)=\int\beta_{1}(y,A)\,\beta_{0}(dy)=\beta_{0}(A)=\sigma_{0}(A)

where the first equality is because σ0=β0\sigma_{0}=\beta_{0} and σ1=β1\sigma_{1}=\beta_{1} while the second follows from (12). Next, suppose n1n\geq 1 and take xSnx\in S^{n} and ySy\in S. By assumption, the events {T>n+1}\{T>n+1\} and {Tn+1}\{T\leq n+1\} depend on xx but not on yy. If T>n+1T>n+1, one obtains σn+1(x,y)=βn+1(x,y)\sigma_{n+1}(x,y)=\beta_{n+1}(x,y) and σn(x)=βn(x)\sigma_{n}(x)=\beta_{n}(x). Hence, equation (12) implies again

σn+1(x,y,A)σn(x,dy)=βn+1(x,y,A)βn(x,dy)\displaystyle\int\sigma_{n+1}(x,y,\,A)\,\sigma_{n}(x,\,dy)=\int\beta_{n+1}(x,y,\,A)\,\beta_{n}(x,\,dy)
=βn(x,A)=σn(x,A).\displaystyle=\beta_{n}(x,A)=\sigma_{n}(x,A).

Similarly, if Tn+1T\leq n+1,

σn+1(x,y,A)σn(x,dy)\displaystyle\int\sigma_{n+1}(x,y,\,A)\,\sigma_{n}(x,\,dy)
={qn(x)σn(x,A)+(1qn(x))δy(A)}σn(x,dy)\displaystyle=\int\bigl{\{}q_{n}(x)\,\sigma_{n}(x,A)\,+\,(1-q_{n}(x))\,\delta_{y}(A)\bigr{\}}\,\sigma_{n}(x,\,dy)
=qn(x)σn(x,A)+(1qn(x))δy(A)σn(x,dy)\displaystyle=q_{n}(x)\,\sigma_{n}(x,A)+(1-q_{n}(x))\,\int\delta_{y}(A)\,\sigma_{n}(x,\,dy)
=σn(x,A).\displaystyle=\sigma_{n}(x,A).

In view of Theorem 15, this proves that XX is c.i.d.

Finally, suppose that AnA_{n} is invariant under permutations of SnS^{n} for each n1n\geq 1. We have to show that (X1,,Xn)(X_{1},\ldots,X_{n}) is exchangeable conditionally on T>nT>n. Fix nn, a set CnC\in\mathcal{B}^{n}, and a permutation ϕ\phi of SnS^{n}. For each jnj\geq n, it is easily seen that

Pσ(T=j+1,ϕ(X1,,Xn)C)\displaystyle P_{\sigma}\Bigl{(}T=j+1,\,\,\phi(X_{1},\ldots,X_{n})\in C\Bigr{)}
=Pβ(T=j+1,ϕ(X1,,Xn)C).\displaystyle=P_{\beta}\Bigl{(}T=j+1,\,\,\phi(X_{1},\ldots,X_{n})\in C\Bigr{)}.

Therefore,

Pσ(T=j+1,ϕ(X1,,Xn)C)\displaystyle P_{\sigma}\Bigl{(}T=j+1,\,\,\phi(X_{1},\ldots,X_{n})\in C\Bigr{)}
=Pβ(T=j+1,ϕ(X1,,Xn)C)\displaystyle=P_{\beta}\Bigl{(}T=j+1,\,\,\phi(X_{1},\ldots,X_{n})\in C\Bigr{)}
=Pβ((X1,,Xj)Aj,ϕ(X1,,Xn)C)\displaystyle=P_{\beta}\Bigl{(}(X_{1},\ldots,X_{j})\in A_{j},\,\,\phi(X_{1},\ldots,X_{n})\in C\Bigr{)}
=Pβ((X1,,Xj)Aj,(X1,,Xn)C)\displaystyle=P_{\beta}\Bigl{(}(X_{1},\ldots,X_{j})\in A_{j},\,\,(X_{1},\ldots,X_{n})\in C\Bigr{)}

where the last equality is because PβP_{\beta} is exchangeable and AjA_{j} is invariant under permutations of SjS^{j}. In turn, this implies

Pσ(T>n,ϕ(X1,,Xn)C)\displaystyle P_{\sigma}\Bigl{(}T>n,\,\,\phi(X_{1},\ldots,X_{n})\in C\Bigr{)}
=jnPσ(T=j+1,ϕ(X1,,Xn)C)\displaystyle=\sum_{j\geq n}P_{\sigma}\Bigl{(}T=j+1,\,\,\phi(X_{1},\ldots,X_{n})\in C\Bigr{)}
=jnPβ(T=j+1,(X1,,Xn)C)\displaystyle=\sum_{j\geq n}P_{\beta}\Bigl{(}T=j+1,\,\,(X_{1},\ldots,X_{n})\in C\Bigr{)}
=jnPσ(T=j+1,(X1,,Xn)C)\displaystyle=\sum_{j\geq n}P_{\sigma}\Bigl{(}T=j+1,\,\,(X_{1},\ldots,X_{n})\in C\Bigr{)}
=Pσ(T>n,(X1,,Xn)C).\displaystyle=P_{\sigma}\Bigl{(}T>n,\,\,(X_{1},\ldots,X_{n})\in C\Bigr{)}.

This concludes the proof. ∎


Proof of Theorem 11.

Just note that gng_{n} is a density of (X1,,Xn)(X_{1},\ldots,X_{n}) with respect to λn\lambda^{n}. Therefore, Theorem 11 follows from the very definitions of stationarity and exchangeability, after noting that gn+1(u,)λ(du)\int g_{n+1}(u,\cdot)\,\lambda(du) is a density of (X2,,Xn+1)(X_{2},\ldots,X_{n+1}) with respect to λn\lambda^{n}. ∎


Proof of Theorem 13.

We first recall that

𝒮(x,b;A)𝒮(0,r;dx)=𝒮(0,b+r;A)\displaystyle\int\mathcal{S}(x,b;\,A)\,\mathcal{S}(0,r;\,dx)=\mathcal{S}(0,b+r;\,A)

for all AA\in\mathcal{B} and b,r>0b,\,r>0. This can be checked by a direct calculation. For a proof, we refer to the Claim of [9, Th. 10]. Having noted this fact, define

μ=𝒮(a,b),f(x)=a+cx,ν=𝒮(0,b1|c|γ),\displaystyle\mu=\mathcal{S}(a,b),\quad f(x)=-a+c\,x,\quad\nu=\mathcal{S}\left(0,\,\frac{b}{1-\lvert c\rvert^{\gamma}}\right),

and denote by ZZ a real random variable such that Z=𝑑𝒮(0,1)Z\overset{d}{=}\mathcal{S}(0,1). Define also

r=b|c|γ1|c|γ,h(x)=cx,\displaystyle r=\frac{b\,\lvert c\rvert^{\gamma}}{1-\lvert c\rvert^{\gamma}},\quad h(x)=c\,x,

and call ν\nu^{*} the probability distribution of hh under ν\nu. On noting that

a+b1/γZ=𝑑μandν=𝒮(0,r),\displaystyle a+b^{1/\gamma}Z\overset{d}{=}\mu\quad\text{and}\quad\nu^{*}=\mathcal{S}(0,\,r),

one obtains

σ1(x,A)ν(dx)=P(f(x)+a+b1/γZA)ν(dx)\displaystyle\int\sigma_{1}(x,A)\,\nu(dx)=\int P\bigl{(}f(x)+a+b^{1/\gamma}Z\in A\bigr{)}\,\nu(dx)
=P(h(x)+b1/γZA)ν(dx)\displaystyle=\int P\bigl{(}h(x)+b^{1/\gamma}Z\in A\bigr{)}\,\nu(dx)
=P(x+b1/γZA)ν(dx)\displaystyle=\int P\bigl{(}x+b^{1/\gamma}Z\in A\bigr{)}\,\nu^{*}(dx)
=𝒮(x,b;A)𝒮(0,r;dx)=𝒮(0,b+r;A)\displaystyle=\int\mathcal{S}(x,b;\,A)\,\mathcal{S}(0,r;\,dx)=\mathcal{S}(0,\,b+r;\,A)
=𝒮(0,b1|c|γ;A)=ν(A).\displaystyle=\mathcal{S}\left(0,\,\frac{b}{1-\lvert c\rvert^{\gamma}};\,A\right)=\nu(A).

Therefore, equation (9) holds. ∎

Acknowledgments: We are grateful to Federico Bassetti and Paola Bortot for very useful conversations.

References

  • [1] Airoldi E.M., Costa T., Bassetti F., Leisen F., Guindani M. (2014) Generalized species sampling priors with latent beta reinforcements, J.A.S.A., 109, 1466-1480.
  • [2] Bassetti F., Crimaldi I., Leisen F. (2010) Conditionally identically distributed species sampling sequences, Adv. in Appl. Probab., 42, 433-459.
  • [3] Bassetti F., Ladelli L. (2020) Asymptotic number of clusters for species sampling sequences with non-diffuse base measure, Stat. Prob. Letters, 162, 108749.
  • [4] Berti P., Regazzini E., Rigo P. (1997) Well-calibrated, coherent forecasting systems, Theory Probab. Appl., 42, 82-102.
  • [5] Berti P., Pratelli L., Rigo P. (2004) Limit theorems for a class of identically distributed random variables, Ann. Probab., 32, 2029-2052.
  • [6] Berti P., Pratelli L., Rigo P. (2012) Limit theorems for empirical processes based on dependent data, Electronic J. Probab., 17, 1-18.
  • [7] Berti P., Pratelli L., Rigo P. (2013) Exchangeable sequences driven by an absolutely continuous random measure, Ann. Probab., 41, 2090-2102.
  • [8] Berti P., Dreassi E., Pratelli L., Rigo P. (2021) A class of models for Bayesian predictive inference, Bernoulli, 27, 702-726.
  • [9] Berti P., Dreassi E., Leisen F., Pratelli L., Rigo P. (2023) Bayesian predictive inference without a prior, Statistica Sinica, 33.
  • [10] Berti P., Dreassi E., Leisen F., Pratelli L., Rigo P. (2022) Kernel based Dirichlet sequences, Bernoulli, to appear, available at arXiv:2106.00114 [math.PR].
  • [11] Blackwell D., Mac Queen J.B. (1973) Ferguson distributions via Pólya urn schemes, Ann. Statist., 1, 353-355.
  • [12] Bladt M., McNeil A.J. (2022) Time series models with infinite-order partial copula dependence, Dependence Modeling, 10, 87-107.
  • [13] Canale A., Lijoi A., Nipoti B., Pruenster I. (2017) On the Pitman–Yor process with spike and slab base measure, Biometrika, 104, 681-697.
  • [14] Cassese A., Zhu W., Guindani M., Vannucci M. (2019) A Bayesian nonparametric spiked process prior for dynamic model selection, Bayesian Analysis, 14, 553-572.
  • [15] Chen K., Shen W., Zhu W. (2023) Covariate dependent Beta-GOS process, Computat. Stat. Data Anal., 180.
  • [16] Cifarelli D.M., Regazzini E. (1996) De Finetti’s contribution to probability and statistics, Statist. Science, 11, 253-282.
  • [17] Clarke B., Fokoue E., Zhang H.H. (2009) Principles and theory for data mining and machine learning, Springer, New York.
  • [18] Clarke B., Clarke J. (2018) Predictive statistics: Analysis and inference beyond models, Cambridge University Press, Cambridge.
  • [19] Dawid A.P. (1984) Present position and potential developments: Some personal views: Statistical Theory: The prequential approach, J. Royal Stat. Soc. A, 147, 278-292.
  • [20] Dawid A.P. (1992) Prequential data analysis, In Current Issues in Statistical Inference: Essays in Honor of D. Basu, Edited by M. Ghosh and P.K. Pathak, IMS Lecture Notes - Monograph Series, 17, 113-126.
  • [21] Dawid A.P., Vovk V.G. (1999) Prequential probability: principles and properties, Bernoulli, 5, 125-162.
  • [22] de Finetti B. (1931) Sul significato soggettivo della probabilità, Fund. Math., 17, 298–329.
  • [23] de Finetti B. (1937) La prévision: Ses lois logiques, ses sources subjectives, Ann. Inst. H. Poincaré, 7, 1–68.
  • [24] Diaconis P., Ylvisaker D. (1979) Conjugate priors for exponential families, Ann. Statist., 7, 269-281.
  • [25] Diaconis P., Freedman D.A. (1990) Cauchy’s equation and de Finetti’s theorem, Scand. J. Stat., 17, 235-249.
  • [26] Dubins L.E., Savage L.J. (1965) How to gamble if you must: Inequalities for stochastic processes, McGraw Hill.
  • [27] Efron B. (2020) Prediction, estimation, and attribution, J.A.S.A., 115, 636-655.
  • [28] Ferguson T.S. (1973) A Bayesian analysis of some nonparametric problems, Ann. Statist., 1, 209-230.
  • [29] Fong E., Holmes C., Walker S.G. (2023) Martingale posterior distributions (with discussion), J. Royal Stat. Soc. B, to appear.
  • [30] Fong E., Lehmann B. (2022) A predictive approach to Bayesian nonparametric survival analysis, arXiv: 2202.10361v1 [stat.ME].
  • [31] Fortini S., Ladelli L., Regazzini E. (2000) Exchangeability, predictive distributions and parametric models, Sankhya A, 62, 86-109.
  • [32] Fortini S., Petrone S. (2012) Predictive construction of priors in Bayesian nonparametrics, Brazilian J. Probab. Statist., 26, 423-449.
  • [33] Fortini S., Petrone S. (2017) Predictive characterizations of mixtures of Markov chains, Bernoulli, 23, 1538-1565.
  • [34] Fortini S., Petrone S., Sporysheva P. (2018) On a notion of partially conditionally identically distributed sequences, Stoch. Proc. Appl., 128, 819-846.
  • [35] Fortini S., Petrone S. (2020) Quasi-Bayes properties of a procedure for sequential learning in mixture models, J. Royal Stat. Soc. B, 82, 1087-1114.
  • [36] Geisser S. (1993) Predictive inference: An introduction, Chapman and Hall, New York.
  • [37] Ghosal S., van der Vaart A. (2017) Fundamentals of nonparametric Bayesian inference, Cambridge University Press, Cambridge.
  • [38] Gnedin A., Pitman J. (2006) Exchangeable Gibbs partitions and Stirling triangles. J. Math. Sci., 138, 5674-5685.
  • [39] Gnedin A. (2010) A species sampling model with finitely many types, Electron. Commun. Probab., 15, 79-88.
  • [40] Hahn P.R. (2017) Predictivist Bayes density estimation, unpublished technical report, available at https://math.la.asu.edu/ prhahn/pred-bayes.pdf
  • [41] Hahn P.R., Martin R., Walker S.G. (2018) On recursive Bayesian predictive distributions, J.A.S.A., 113, 1085-1093.
  • [42] Hansen B., Pitman J. (2000) Prediction rules for exchangeable sequences related to species sampling, Stat. Prob. Letters, 46, 251-256.
  • [43] Hastie T., Tibshirani R., Friedman J. (2009) The elements of statistical learning: Data Mining, Inference, and Prediction, Springer, New York.
  • [44] Hill B.M. (1993) Parametric models for AnA_{n}: splitting processes and mixtures, J. Royal Stat. Soc. B, 55, 423-433.
  • [45] Hjort N.L., Holmes C., Muller P., Walker S.G. (2010) Bayesian nonparametrics, Cambridge University Press, Cambridge.
  • [46] Hoffmann-Jorgensen J. (1994) Probability with a view toward statistics, Vol. II, Chapman and Hall, New York.
  • [47] Kallenberg O. (1988) Spreading and predictable sampling in exchangeable sequences and processes, Ann. Probab., 16, 508-534.
  • [48] Lee J., Quintana F.A., Muller P., Trippa L. (2013) Defining predictive probability functions for species sampling models, Statist. Science, 28, 209-222.
  • [49] Lijoi A., Pruenster I., Walker S.G. (2008) Bayesian nonparametric estimators derived from conditional Gibbs structures, Ann. Appl. Probab., 18, 1519-1547.
  • [50] Morvai G., Weiss B. (2021) On universal algorithms for classifying and predicting stationary processes, Probab. Surveys, 18, 77-131.
  • [51] Newton M.A., Zhang Y. (1999) A recursive algorithm for nonparametric analysis with missing data, Biometrika, 86, 15-26.
  • [52] Newton M.A. (2002) On a nonparametric recursive estimator of the mixing distribution, Sankhya, 64, 306-322.
  • [53] Pitman J. (1995) Exchangeable and partially exchangeable random partitions, Probab. Theory Rel. Fields, 102, 145-158.
  • [54] Pitman J. (1996) Some developments of the Blackwell-MacQueen urn scheme, Statistics, Probability and Game Theory, IMS Lect. Notes Mon. Series, 30, 245-267.
  • [55] Pitman J., Yor M. (1997) The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator, Ann. Probab., 25, 855-900.
  • [56] Pitman J. (2006) Combinatorial stochastic processes, Lectures from the XXXII Summer School in Saint-Flour, 2002, Springer, Berlin.
  • [57] Sethuraman J. (1994) A constructive definition of Dirichlet priors, Stat. Sinica, 4, 639-650.
  • [58] Shmueli G. (2010) To explain or to predict ?, Statist. Science, 25, 289-310.
  • [59] Smith A.F.M., Makov U.E. (1978) A quasi-Bayes sequential procedure for mixtures, J. Royal Stat. Soc. B, 40, 106-112.