\startlocaldefs\endlocaldefs

and

A probabilistic view on predictive constructions for Bayesian learning

Patrizia Bertilabel=e1][email protected] [ Emanuela Dreassilabel=e2][email protected] [ Fabrizio Leisenlabel=e3][email protected] [ Luca Pratellilabel=e4][email protected] [ Pietro Rigolabel=e5][email protected] [ Dipartimento di Scienze Fisiche, Informatiche e Matematiche, Università di Modena e Reggio-Emilia, via Campi 213/B, 41100 Modena, Italy Dipartimento di Statistica, Informatica, Applicazioni, Università di Firenze, viale Morgagni 59, 50134 Firenze, Italy School of Mathematical Sciences, University of Nottingham, University Park, Nottingham, NG7 2RD, UK Accademia Navale, viale Italia 72, 57100 Livorno, Italy Dipartimento di Scienze Statistiche “P. Fortunati”, Università di Bologna, via delle Belle Arti 41, 40126 Bologna, Italy

Abstract

Given a sequence $X=(X_{1},X_{2},\ldots)$ of random observations, a Bayesian forecaster aims to predict $X_{n+1}$ based on $(X_{1},\ldots,X_{n})$ for each $n\geq 0$ . To this end, in principle, she only needs to select a collection $\sigma=(\sigma_{0},\sigma_{1},\ldots)$ , called “strategy” in what follows, where $\sigma_{0}(\cdot)=P(X_{1}\in\cdot)$ is the marginal distribution of $X_{1}$ and $\sigma_{n}(\cdot)=P(X_{n+1}\in\cdot\mid X_{1},\ldots,X_{n})$ the $n$ -th predictive distribution. Because of the Ionescu-Tulcea theorem, $\sigma$ can be assigned directly, without passing through the usual prior/posterior scheme. One main advantage is that no prior probability is to be selected. In a nutshell, this is the predictive approach to Bayesian learning. A concise review of the latter is provided in this paper. We try to put such an approach in the right framework, to make clear a few misunderstandings, and to provide a unifying view. Some recent results are discussed as well. In addition, some new strategies are introduced and the corresponding distribution of the data sequence $X$ is determined. The strategies concern generalized Pólya urns, random change points, covariates and stationary sequences.

62F15,

62M20,

60G09,

60G25,

Bayesian inference,

Conditional identity in distribution,

Exchangeability,

Predictive distribution,

Sequential predictions,

Stationarity,

keywords:

[class=MSC2020]

keywords:

1 Introduction

This paper has been written having the following interpretation of Bayesian inference in mind. (We declare this interpretation from the outset just to make transparent our point of view and easier the understanding of the paper). Let us call $\mathcal{O}$ the object of inference. Roughly speaking, $\mathcal{O}$ denotes whatever we ignore but would like to know. For instance, $\mathcal{O}$ could be a parameter (finite or infinite dimensional), a set of future observations, an unknown probability distribution, the effect of some action, or something else. According to us, the distinguishing feature of the Bayesian approach is to regard $\mathcal{O}$ as the realization of a random element, and not as an unknown but fixed constant. As a consequence, the main goal of any Bayesian inferential procedure is to determine the conditional distribution of $\mathcal{O}$ given the available information.

Note that, unless $\mathcal{O}$ itself is a parameter, no other parameter is necessarily involved.

Prediction of unknown observable quantities is a fundamental part of statistics. Initially, it was probably the most prevalent form of statistical inference. The wind changed at the beginning of the 20 ${}^{\text{th}}$ century when statisticians’ attention shifted to other issues, such as parametric estimation and testing; see e.g. [36]. Nowadays, prediction is back in the limelight again, and plays a role in modern topics including machine learning and data mining; see e.g. [17, 18, 27, 43].

This paper deals with prediction of future observations, based on the past ones, from the Bayesian point of view. Precisely, we focus on a sequence

X=(X_{1},X_{2},\ldots)

of random observations and, at each time $n$ , we aim to predict $X_{n+1}$ based on $(X_{1},\ldots,X_{n})$ . Hence, for each $n$ , the object of inference is $\mathcal{O}=X_{n+1}$ , the available information is $(X_{1},\ldots,X_{n})$ , and the target is the predictive distribution $P(X_{n+1}\in\cdot\mid X_{1},\ldots,X_{n})$ . We point out that, apart from technicalities, most of our considerations could be generalized to the case where $\mathcal{O}$ is an arbitrary (measurable) function of the future observations, say

\mathcal{O}=f(X_{n+1},X_{n+2},\ldots).

This case is recently object of increasing attention; see e.g. [29, 40].

No parameter $\theta$ plays a role at this stage. The forecaster may involve some $\theta$ , if she thinks it helps, but she is not interested in $\theta$ as such. To involve $\theta$ means to model the probability distribution of $X$ as depending on $\theta$ , and then to exploit this fact to calculate the predictive distributions $P(X_{n+1}\in\cdot\mid X_{1},\ldots,X_{n})$ .

To better address our prediction problem, it is convenient to introduce the notion of strategy. Let $(S,\mathcal{B})$ be a measurable space, with $S$ to be viewed as the set where the observations $X_{n}$ take values. Following Dubins and Savage [26], a strategy is a sequence

\sigma=(\sigma_{0},\sigma_{1},\sigma_{2},\ldots)

such that

•

$\sigma_{0}$ and $\sigma_{n}(x)$ are probability measures on $\mathcal{B}$ for all $n\geq 1$ and $x\in S^{n}$ ;
•

The map $x\mapsto\sigma_{n}(x,A)$ is $\mathcal{B}^{n}$ -measurable for fixed $n\geq 1$ and $A\in\mathcal{B}$ .

Here, $\sigma_{0}$ should be regarded as the marginal distribution of $X_{1}$ and $\sigma_{n}(x)$ as the conditional distribution of $X_{n+1}$ given that $(X_{1},\ldots,X_{n})=x$ . Moreover, $\sigma_{n}(x,A)$ denotes the value taken at $A$ by the probability measure $\sigma_{n}(x)$ . We also note that strategies are often called prediction rules in the framework of species sampling sequences; see [54, p. 251].

Strategies are a natural tool to frame a prediction problem from the Bayesian standpoint. In fact, a strategy $\sigma$ can be regarded as the collection of all predictive distributions (including the marginal distribution of $X_{1}$ ) in the sense that $\sigma_{n}(x,\,\cdot)=P\bigl{(}X_{n+1}\in\cdot\mid(X_{1},\ldots,X_{n})=x\bigr{)}$ for all $n\geq 0$ and $x\in S^{n}$ . Thus, in a sense, everything a Bayesian forecaster has to do is to select a strategy $\sigma$ . Obviously, the problem is how to do it. A related problem is whether, in order to choose $\sigma$ , involving a parameter $\theta$ is convenient or not.

An important special case is exchangeability. In fact, if $X$ is assumed to be exchangeable, there is natural way to involve a parameter $\theta$ . To see this, take the parameter space $\Theta$ as

\Theta=\bigl{\{}\text{all probability measures on }\mathcal{B}\bigr{\}}.

Moreover, for each $\theta\in\Theta$ , denote by $P_{\theta}$ a probability measure which makes $X$ i.i.d. with common distribution $\theta$ , i.e.,

P_{\theta}\bigl{(}X_{1}\in A_{1},\ldots,X_{n}\in A_{n}\bigr{)}=\prod_{i=1}^{n}\theta(A_{i})

for all $n\geq 1$ and $A_{1},\ldots,A_{n}\in\mathcal{B}$ . Then, under mild conditions on $(S,\mathcal{B})$ , de Finetti’s theorem yields

\displaystyle P(X\in\cdot)=\int_{\Theta}P_{\theta}(X\in\cdot)\,\pi(d\theta)

for some (unique) prior probability $\pi$ on $\Theta$ . Thus, conditionally on $\theta\in\Theta$ , the observations are i.i.d. with common distribution $\theta$ . This suggests calculating the strategy $\sigma$ as follows.

(i)

Select a prior $\pi$ on $\Theta$ ;
(ii)

For each $n\geq 1$ and $x\in S^{n}$ , evaluate the posterior of $\theta$ given $x$ , namely, the conditional distribution of $\theta$ given that $(X_{1},\ldots,X_{n})=x$ ;
(iii)

Calculate $\sigma$ as

$\sigma_{n}(x,A)=\int_{\Theta}\theta(A)\,\pi_{n}(d\theta\mid x)\quad\quad\text{for all }A\in\mathcal{B},$

where $\pi_{n}(\cdot\mid x)$ is the posterior and $\pi_{0}(\cdot\mid x)$ is meant as $\pi_{0}(\cdot\mid x)=\pi$ .

Steps (i)-(ii)-(iii) are familiar in a Bayesian framework. Henceforth, if $\sigma$ is selected via (i)-(ii)-(iii), the forecaster is said to follow the inferential approach (I.A.).

1.1 Predictive approach to Bayesian modeling

There is another approach to Bayesian prediction, usually called the predictive approach (P.A.), which is quite recurrent in the Bayesian literature and recently gained increasing attention. (Such an approach, incidentally, has been referred to as the “non-standard approach” in [8, 9]). According to P.A., the forecaster directly selects her strategy $\sigma$ . Merely, for each $n\geq 0$ , she selects the predictive $\sigma_{n}$ without passing through the prior/posterior scheme described above. Among others, P.A. is supported by de Finetti, Savage, Dubins [22, 23, 26] and more recently by Diaconis and Regazzini [4, 16, 24, 25, 31]. P.A. is also strictly connected to Dawid’s prequential approach [19, 20, 21] and to Pitman’s treatment of species sampling sequences [54, 55, 56]. In addition, several prediction procedures arising in non-necessarily Bayesian frameworks, such as Machine Learning and Data Mining, are consistent with P.A.; see e.g. [17, 18, 27, 43]. Some further related references are [8, 9, 29, 30, 32, 40, 41, 44].

The theoretical foundation of P.A. is the Ionescu-Tulcea theorem; see e.g. [46, p. 159]. Roughly speaking this theorem states that, to assign the joint distribution of $X$ , it suffices to choose, in an arbitrary way, the marginal distribution of $X_{1}$ , the conditional distribution of $X_{2}$ given $X_{1}$ , the conditional distribution of $X_{3}$ given $(X_{1},X_{2})$ , and so on. Note that this fact would be obvious if $X$ would be replaced by a finite dimensional random vector $(X_{1},\ldots,X_{m})$ . So, in a sense, the Ionescu-Tulcea theorem extends to infinite sequences a straightforward property of finite dimensional vectors. In any case, a formal statement of the theorem is as follows.

Theorem 1.

(Ionescu-Tulcea). For each $n\geq 1$ , let $X_{n}$ be the $n$ -th coordinate random variable on $(S^{\infty},\mathcal{B}^{\infty})$ . Then, for any strategy $\sigma$ , there is a unique probability measure $P_{\sigma}$ on $(S^{\infty},\mathcal{B}^{\infty})$ such that

	$\displaystyle P_{\sigma}(X_{1}\in\cdot)=\sigma_{0}(\cdot)\quad\quad\text{and}$		(1)
	$\displaystyle P_{\sigma}\bigl{(}X_{n+1}\in\cdot\mid(X_{1},\ldots,X_{n})=x\bigr{)}=\sigma_{n}(x,\cdot)$

for all $n\geq 1$ and $P_{\sigma}$ -almost all $x\in S^{n}$ .

Because of Theorem 1, to make predictions on the sequence $X$ , the forecaster is free to select an arbitrary strategy $\sigma$ . In fact, for any $\sigma$ , there is a (unique) probability distribution for $X$ , denoted above by $P_{\sigma}$ , whose predictives $P_{\sigma}\bigl{(}X_{n+1}\in\cdot\mid X_{1},\ldots,X_{n}\bigr{)}$ agree with $\sigma$ in the sense of equation (1).

The strengths and weaknesses of I.A. versus P.A. are discussed in a number of papers; see e.g. [8, 18, 27, 36, 58] and references therein. Here, we summarize this issue (from our point of view) under the assumption that prediction is the main target.

I.A. is not motivated by prediction alone. The main goal of I.A. is to make inference on other features of the data distribution (typically some parameters) and in this case the prior $\pi$ is fundamental. It should be added that $\pi$ often provides various meaningful information on the data generating process. However, to assess $\pi$ is not an easy task. In addition, once $\pi$ is selected, to evaluate the posterior $\pi_{n}(\cdot\mid x)$ is quite difficult as well. Frequently, $\pi_{n}(\cdot\mid x)$ cannot be written in closed form but only approximated numerically. In short, I.A. is a cornerstone of Bayesian inference, but, when prediction is the main target, it is actually quite involved.

In turn, P.A. has essentially four merits. First, P.A. allows to avoid an explicit choice of the prior $\pi$ . Indeed, when prediction is the main target, why select $\pi$ explicitly ? Rather than wondering about $\pi$ , it seems reasonable to reflect on how the information in $(X_{1},\ldots,X_{n})$ is conveyed in the prediction of $X_{n+1}$ . Second, the data sequence $X$ is not required any distributional assumption. This point is developed in Subsections 1.2 and 1.3. By now, we stress a consequence of such a point. The Bayesian nature of a prediction procedure does not depend on the data distribution. For instance, a forecaster applying P.A. is certainly Bayesian independently of the distribution attached to $X$ . Third, P.A. requires the assignment of probabilities on observable facts only. The value of $X_{n+1}$ is actually observable, while $\pi$ and $\pi_{n}$ (being probabilities on $\Theta$ ) do not necessarily deal with observable facts. Fourth, the strategy $\sigma$ may be assigned stepwise. At each time $n$ , the forecaster has observed $x=(x_{1},\ldots,x_{n})\in S^{n}$ and has already selected $\sigma_{0},\sigma_{1}(x_{1}),\ldots,\sigma_{n-1}(x_{1},\ldots,x_{n-1})$ . Then, to predict $X_{n+1}$ , she is still free to select $\sigma_{n}(x)$ as she wants. No choice of $\sigma_{n}(x)$ is precluded. This is consistent with the Bayesian view, where the observed data are fixed and one should condition on them. In spite of these advantages, P.A. has an obvious drawback. In fact, assigning a strategy $\sigma$ directly may be very difficult, in principle as difficult as selecting a prior $\pi$ .

A last (basic) remark is that, if $X$ is exchangeable, both I.A. and P.A. completely determine the probability distribution of $X$ . Selecting a prior $\pi$ or choosing a strategy $\sigma$ are just equivalent routes to fix the distribution of $X$ . In particular, selecting $\sigma$ uniquely determines $\pi$ . An intriguing line of research is in fact to identify the prior corresponding to a given $\sigma$ ; see e.g. [10, 24, 25, 31].

1.2 Characterizations

Recall that, for any strategy $\sigma$ , there is a unique probability measure $P_{\sigma}$ on $(S^{\infty},\mathcal{B}^{\infty})$ satisfying condition (1).

In principle, when applying P.A., the data sequence $X$ is free to have any probability distribution. Nevertheless, in most applications, it is reasonable (if not mandatory) to impose some conditions on $X$ . For instance, the forecaster may wish $X$ to be exchangeable, or stationary, or Markov, or a martingale, and so on. In these cases, $\sigma$ is subjected to some constraints. If $X$ is required to be exchangeable, for instance, $\sigma$ should be such that $P_{\sigma}$ is exchangeable. Hence, those strategies $\sigma$ which make $P_{\sigma}$ exchangeable should be characterized.

More generally, fix any collection $\mathcal{C}$ of probability measures on $(S^{\infty},\mathcal{B}^{\infty})$ and suppose the data distribution is required to belong to $\mathcal{C}$ . Then, P.A. gives rise to the following problem:

Problem (*): Characterize those strategies $\sigma$ such that $P_{\sigma}\in\mathcal{C}$ .

Sometimes, Problem (*) is trivial (Markov, martingales) but sometimes it is not (stationarity, exchangeability). To illustrate, we mention three examples (which correspond to the three dependence forms examined in the sequel).

In the exchangeable case, Problem (*) admits a solution [31, Th. 3.1] but the conditions on $\sigma$ are quite hard to check in real problems. Hence, applying P.A. to exchangeable data is usually difficult (even if there are some exceptions; see Section 2).

A condition weaker than exchangeability is conditional identity in distribution. Say that $X$ is conditionally identically distributed (c.i.d.) if $X_{2}\overset{d}{=}X_{1}$ and, for each $n\geq 1$ , the conditional distribution of $X_{k}$ given $(X_{1},\ldots,X_{n})$ is the same for all $k>n$ ; see Section 3. It can be shown that

X\text{ is exchangeable}\quad\Leftrightarrow\quad X\text{ is stationary and c.i.d.;}

see [5, 47]. Hence, conditional identity in distribution can be regarded as one of the two basic ingredients of exchangeability (the other being stationarity). Now, in the c.i.d. case, Problem (*) has been solved [6, Th. 3.1] and the conditions on $\sigma$ are quite simple. The class of admissible strategies includes several meaningful elements which cannot be used if $X$ is required to be exchangeable. As a consequence, P.A. works quite well for c.i.d. data; see [8, 9].

The stationary case is more involved. In fact, to our knowledge, there is no general characterization of the strategies $\sigma$ which make $P_{\sigma}$ stationary. However, such a characterization is available in some meaningful special cases (e.g. when $P_{\sigma}$ is also required to be Markov); see Section 4.

Finally, Problem (*) is usually easier in a few (meaningful) special cases. For instance, Problem (*) is simpler if $P_{\sigma}$ is also asked to be Markov; see e.g. [33] and Section 4. Or else, if the strategy $\sigma$ is required to be dominated.

Dominated strategies: Let $\lambda$ be a $\sigma$ -finite measure on $(S,\mathcal{B})$ . Say that a strategy $\sigma$ is dominated by $\lambda$ if each $\sigma_{n}(x)$ admits a density $f_{n}(\cdot\mid x)$ with respect to $\lambda$ , namely,

	$\displaystyle\sigma_{0}(dy)=f_{0}(y)\,\lambda(dy)\quad\quad\text{and}$
	$\displaystyle\sigma_{n}(x,\,dy)=f_{n}(y\mid x)\,\lambda(dy)$

for all $n\geq 1$ and $x\in S^{n}$ . Here, $f_{0}:S\rightarrow\mathbb{R}^{+}$ and $f_{n}:S\times S^{n}\rightarrow\mathbb{R}^{+}$ are non-negative measurable functions.

For instance, if $S=\mathbb{R}$ and $\sigma_{n}(x)$ is a non-degenerate normal distribution for all $n$ and $x$ , then $\sigma$ is dominated by $\lambda=\,$ Lebesgue measure. Or else, if $S$ is countable, any strategy is dominated by $\lambda=\,$ counting measure. Instead, if $S$ is uncountable, a non-dominated strategy is $\sigma_{n}(x_{1},\ldots,x_{n})=\delta_{x_{n}}$ where $\delta_{x_{n}}$ denotes the unit mass at the point $x_{n}$ . Another non-dominated strategy is the empirical measure $\sigma_{n}(x_{1},\ldots,x_{n})=(1/n)\,\sum_{i=1}^{n}\delta_{x_{i}}.$

In a sense, dominated strategies play an analogous role to the usual dominated models in parametric statistical inference. The main advantage is that one can use the conditional density $f_{n}(\cdot\mid x)$ instead of the conditional measure $\sigma_{n}(x)$ . A related advantage is that, if one fixes $\lambda$ and restricts to strategies dominated by $\lambda$ , Problem (*) becomes simpler. However, even in applied data analysis, various familiar strategies are not dominated. In the framework of species sampling sequences, for instance, most strategies are not dominated. Therefore, in this paper, we focus on general strategies while the dominated ones are regarded as an important special case.

1.3 Content of this paper and further notation

This is a review paper on P.A. which also includes some (minor) new results. Our perspective is mainly on the probabilistic aspects of Bayesian predictive constructions. Moreover, we tacitly assume that the major target is to predict future observations (and not to make inference on other random elements, such as random parameters).

Essentially, we aim to achieve three goals. First, we try to put P.A. in the right framework, to provide a unifying view, and to make clear a few misunderstandings. This has been done in the Introduction. Second, in Section 2 and Subsection 3.1, we report some known results. Third, we provide some new strategies and we prove a few related results. The strategies, introduced by means of examples, deal with generalized Pólya urns, random change points, covariates and stationary sequences. The results consist in determining the distribution of the data sequence $X$ under such strategies. To our knowledge, Examples 7, 9, 12, 14 and Theorems 8, 11, 13 are actually new, while Theorem 6 makes precise a claim contained in [29]. Moreover, as far as we know, Section 4 is the first attempt to develop P.A. for stationary data. It provides a brief discussion of Problem (*) and introduces two large classes of stationary sequences.

As already noted, even if $X$ could be potentially given any distribution, in most applications $X$ is required some conditions. There is obviously a number of such conditions. Among them, we decided to focus on exchangeability, stationarity and conditional identity in distribution. This choice seems reasonable to keep the paper focused, but of course it leaves out various interesting conditions, such as partial exchangeability. To write a paper of reasonable length, however, some choice was necessary.

To defend our choice, we note that, in addition to be natural in various practical problems, exchangeability is the usual assumption in Bayesian prediction. Hence, taking exchangeability into account is more or less mandatory. Moreover, since $X$ is exchangeable if and only if it is stationary and c.i.d., the other two conditions can be motivated as the basic components of exchangeability. But there are also other reasons for dealing with them. Stationarity is in fact a routine assumption in the classical treatment of time series, and it is reasonable to consider it from the Bayesian point of view as well. Conditional identity in distribution, even if not that popular, seems to be quite suitable for P.A.; see Section 3.

The rest of the paper is organized in three sections, each concerned with a specific assumption on $X$ , plus a final section of open problems. All the proofs are gathered in the Appendix.

We close this Introduction with some further notations.

As usual, $\delta_{u}$ is the unit mass at the point $u$ . For each $x\in S^{n}$ , where $n$ is a positive integer or $n=\infty$ , we denote by $x_{i}$ the $i$ -th coordinate of $x$ . Moreover, we take $X$ to be the sequence of coordinate random variables on $S^{\infty}$ , namely,

\displaystyle X_{i}(x)=x_{i}\quad\quad\text{for all }i\geq 1\text{ and }x\in S^{\infty}.

From now on, we fix a strategy $\sigma$ and we assume

\displaystyle X\overset{d}{=}P_{\sigma}.

We write $\nu$ instead of $\sigma_{0}$ (i.e., we let $\sigma_{0}=\nu$ ). Hence, $\nu$ is a probability measure on $\mathcal{B}$ to be regarded as the distribution of $X_{1}$ under the strategy $\sigma$ . Finally, to avoid technicalities, $S$ is assumed to be a Borel subset of a Polish space and $\mathcal{B}$ the Borel $\sigma$ -field on $S$ .

2 Exchangeable data

A permutation of $S^{n}$ is a map $\phi:S^{n}\rightarrow S^{n}$ of the form

\displaystyle\phi(x)=(x_{j_{1}},\ldots,x_{j_{n}})\quad\quad\text{for all }x\in S^{n}

where $(j_{1},\ldots,j_{n})$ is a fixed permutation of $(1,\ldots,n)$ . A sequence $Y=(Y_{1},Y_{2},\ldots)$ of random variables is exchangeable if

\displaystyle\phi(Y_{1},\ldots,Y_{n})\overset{d}{=}(Y_{1},\ldots,Y_{n})

for all $n\geq 2$ and all permutations $\phi$ of $S^{n}$ .

As noted in Subsection 1.2, if $X$ is required to be exchangeable, applying P.A. is usually hard. But there are a few exceptions and two of them are discussed in this section. We first recall that $X$ is a Dirichlet sequence (or a Pólya sequence, see [11]) if

\displaystyle\sigma_{n}(x)=\frac{c\,\nu+\sum_{i=1}^{n}\delta_{x_{i}}}{n+c}\quad\quad\text{for all }n\geq 0\text{ and }x\in S^{n},

where $c>0$ is a constant, $\nu$ a probability measure on $\mathcal{B}$ , and $\sigma_{0}(x)$ is meant as $\sigma_{0}(x)=\nu$ . The role of Dirichlet sequences is actually huge in various frameworks, including Bayesian nonparametrics, population genetics, ecology, combinatorics and number theory; see e.g. [28, 37, 45, 54, 55, 56]. From our point of view, however, two facts are to be stressed. First, a Dirichlet sequence is exchangeable. Second, being defined through its predictive distributions, a Dirichlet sequence is a natural candidate for P.A.

2.1 Species sampling sequences

For $n\geq 1$ and $x=(x_{1},\ldots,x_{n})\in S^{n}$ , denote by $k_{n}=k_{n}(x)$ the number of distinct values in the vector $x$ and by $x_{1}^{*},\ldots,x_{k_{n}}^{*}$ such distinct values (in the order that they appear). Say that $X$ is a species sampling sequence if it is exchangeable, $\sigma_{0}=\nu$ is non-atomic, and

	$\displaystyle\sigma_{n}(x)=\sum_{j=1}^{k_{n}}p_{j,n}(x)\,\delta_{x_{j}^{*}}+q_{n}(x)\,\nu$
	$\displaystyle\text{for all }n\geq 1\text{ and }x\in S^{n}$

where the $p_{j,n}$ are non-negative measurable functions on $S^{n}$ and $q_{n}=1-\sum_{j=1}^{k_{n}}p_{j,n}$ . Under this strategy, quoting from [42, p. 253], $X$ can be regarded as: “the sequence of species of individuals in a process of sequential random sampling from some hypothetical infinite population of individuals of various species. The species of the first individual to be observed is assigned a random tag $X_{1}=X_{1}^{*}$ distributed according to $\nu$ . Given the tags $X_{1},\ldots,X_{n}$ of the first $n$ individuals observed, it is supposed that the next individual is one of the $j$ -th species observed so far with probability $p_{j,n}$ , and one of a new species with probability $q_{n}$ ”.

A nice consequence of the definition is that $p_{j,n}(x)$ depends on $x$ only through the vector $(N_{1,n},\ldots,N_{k_{n},n})$ , where

N_{j,n}=N_{j,n}(x)=\text{card}\bigl{\{}k:1\leq k\leq n,\,x_{k}=x_{j}^{*}\bigr{\}}

is the number of times that $x_{j}^{*}$ appears in the vector $x$ ; see [42, 54].

The most popular example of species sampling sequence is probably the two-parameter Poisson-Dirichlet, introduced by Pitman in [53], which corresponds to the weights

\displaystyle p_{j,n}(x)=\frac{N_{j,n}-b}{n+c}\quad\text{and}\quad q_{n}(x)=\frac{b\,k_{n}+c}{n+c}

where $b$ and $c$ are constants such that: either (i) $0\leq b<1$ and $c>-b$ or (ii) $b<0$ and $c=-m\,b$ for some integer $m\geq 2$ . In this model, if $L$ denotes the number of distinct values appearing in the sequence $X$ , one obtains $L\overset{a.s.}{=}\infty$ under (i) and $L\overset{a.s.}{=}m$ under (ii). Note also that $X$ reduces to a Dirichlet sequence in the special case $b=0$ .

Another example, due to [39], is

	$\displaystyle p_{j,n}(x)=\frac{(N_{j,n}+1)(n-k_{n}+b)}{n^{2}+bn+c}$
	$\displaystyle\text{and}\quad q_{n}(x)=\frac{k_{n}^{2}-bk_{n}+c}{n^{2}+bn+c}$

where $b>0$ and $c$ is such that $k^{2}+bk+c>0$ for all integers $k>0$ . This time, unlike the two-parameter Poisson-Dirichlet, $L$ is a finite but non-degenerate random variable.

In general, to obtain a species sampling sequence, the forecaster needs to select $\nu$ and the weights $p_{j,n}$ . While the choice of $\nu$ is free (apart from non-atomicity) the $p_{j,n}$ are subjected to the constraint that $X$ should be exchangeable. (Incidentally, the choice of $p_{j,n}$ is a good example of the difficulty of applying P.A. when $X$ is required to be exchangeable). The usual method to select $p_{j,n}$ involves exchangeable random partitions. Let $\mathbb{N}=\bigl{\{}1,2,\ldots\bigr{\}}$ and let $\Pi$ be a random partition of $\mathbb{N}$ . For each $n\geq 1$ , call $\Pi_{n}$ the restriction of $\Pi$ to $\{1,\ldots,n\}$ , namely, the random partition of $\{1,\ldots,n\}$ whose elements are of the form $\{1,\ldots,n\}\cap A$ for some $A\in\Pi$ . Say that $\Pi$ is exchangeable if

\varphi(\Pi_{n})\overset{d}{=}\Pi_{n}

for all $n\geq 1$ and all permutations $\varphi$ of $(1,\ldots,n)$ , where $\varphi(\Pi_{n})$ denotes the random partition $\varphi(\Pi_{n})=\bigl{\{}\varphi(B):B\in\Pi_{n}\bigr{\}}$ . For instance, given any sequence $Y=(Y_{1},Y_{2},\ldots)$ of random variables, define $\Pi$ to be the random partition of $\mathbb{N}$ induced by the equivalence relation $i\sim j$ $\Leftrightarrow$ $Y_{i}=Y_{j}$ . Then, $\Pi$ is exchangeable provided $Y$ is exchangeable. Now, the weights $p_{j,n}$ of a species sampling sequence correspond, in a canonical way, to the probability law of an exchangeable partition; see [53, 54]. Hence, choosing the $p_{j,n}$ essentially amounts to choosing an exchangeable partition. We stop here since a detailed discussion of exchangeable partitions is bejond the scopes of this paper. The interested reader is referred to [38, 39, 48, 49, 53, 56] and references therein.

A last remark is that the definition of species sampling sequences can be generalized. In particular, non-atomicity of $\nu$ can be dropped (as in [3] and [13]) and exchangeability can be replaced by some weaker condition (as in [1] and [2]).

2.2 Kernel based Dirichlet sequences

In [10], to generalize Dirichlet sequences while preserving their main properties, a class of strategies has been introduced. Among other things, such strategies make $X$ exchangeable.

A kernel $\alpha$ on $(S,\mathcal{B})$ is a collection

\alpha=\bigl{\{}\alpha(\cdot\mid x):x\in S\bigr{\}}

such that $\alpha(\cdot\mid x)$ is a probability measure on $\mathcal{B}$ , for each $x\in S$ , and the map $x\mapsto\alpha(A\mid x)$ is measurable for each $A\in\mathcal{B}$ . Sometimes, to make the notation easier, we will write $\alpha_{x}$ instead of $\alpha(\cdot\mid x)$ . A straightforward example of kernel is $\alpha_{x}=\delta_{x}$ for each $x\in S$ .

Fix a probability measure $\nu$ on $\mathcal{B}$ , a constant $c>0$ , a kernel $\alpha$ on $(S,\mathcal{B})$ , and define the strategy

\displaystyle\sigma_{n}(x)=\frac{c\,\nu+\sum_{i=1}^{n}\alpha_{x_{i}}}{n+c}

(2)

for all $n\geq 0$ and $x\in S^{n}$ . Clearly, $X$ reduces to a Dirichlet sequence if $\alpha=\delta$ . In this case, we also say that $X$ is a classical Dirichlet sequence.

If $\alpha$ is an arbitrary kernel, $X$ may fail to be exchangeable. However, a useful sufficient condition for exchangeability is available. In fact, $X$ is exchangeable if $\alpha$ agrees with the conditional distribution for $\nu$ given some sub- $\sigma$ -field $\mathcal{G}\subset\mathcal{B}$ . For instance, if $\mathcal{G}=\mathcal{B}$ , then $\alpha=\delta$ and $X$ is a classical Dirichlet sequence. At the opposite extreme, if $\mathcal{G}$ is the trivial $\sigma$ -field, then $\alpha_{x}=\nu$ for all $x\in S$ and $X$ is i.i.d. with common distribution $\nu$ . In general, for fixed $\nu$ and $c$ , a strategy $\sigma$ which makes $X$ exchangeable can be associated with any sub- $\sigma$ -field $\mathcal{G}\subset\mathcal{B}$ . It suffices to take $\alpha$ as the conditional distribution for $\nu$ given $\mathcal{G}$ .

Example 2.

(Countable partitions). Let $\mathcal{H}$ be a (non-random) countable partition of $S$ such that $H\in\mathcal{B}$ and $\nu(H)>0$ for all $H\in\mathcal{H}$ . For $x\in S$ , denote by $H_{x}$ the only $H\in\mathcal{H}$ such that $x\in H$ . The conditional distribution for $\nu$ given the sub- $\sigma$ -field generated by $\mathcal{H}$ is

\displaystyle\alpha(\cdot\mid x)=\sum_{H\in\mathcal{H}}1_{H}(x)\,\nu\bigl{(}\cdot\mid H\bigr{)}=\nu\bigl{(}\cdot\mid H_{x}\bigr{)}\quad\text{for all }x\in S.

Hence, $X$ is exchangeable whenever

\displaystyle\sigma_{n}(x)=\frac{c\,\nu+\sum_{i=1}^{n}\nu\bigl{(}\cdot\mid H_{x_{i}}\bigr{)}}{n+c}\quad\text{for all }n\geq 0\text{ and }x\in S^{n}.

Some remarks on the above strategy $\sigma$ are in order.

•

$\sigma$ may be reasonable when the basic information provided by each observation $x_{i}$ is $H_{x_{i}}$ , namely, the element of the partition $\mathcal{H}$ including $x_{i}$ .
•

If $S$ is countable, each sub- $\sigma$ -field $\mathcal{G}\subset\mathcal{B}$ is generated by a partition $\mathcal{H}$ of $S$ . Hence, $\alpha$ is necessarily as above.

•

$\sigma_{n}(x)$ is absolutely continuous with respect to $\nu$ for all $n$ and $x$ . This is a striking difference with classical Dirichlet sequences. To make an example, call $\sigma^{*}$ the strategy obtained by $\sigma$ replacing $\alpha$ with $\delta$ . Under $\sigma^{*}$ , $X$ is a classical Dirichlet sequence. Moreover, suppose $\nu$ is nonatomic and define the set $B(x)=\{x_{1},\ldots,x_{n}\}$ for each $x=(x_{1},\ldots,x_{n})\in S^{n}$ . Since $\nu$ is nonatomic and $B(x)$ is finite,

	$\displaystyle P_{\sigma}\Bigl{(}X_{n+1}=X_{i}\text{ for some }i\leq n\mid(X_{1},\ldots,X_{n})=x\Bigr{)}$
	$\displaystyle=\sigma_{n}\bigl{(}x,\,B(x)\bigr{)}=0.$

On the other hand, since $\delta_{x_{i}}(B(x))=1$ for each $i=1,\ldots,n$ ,

	$\displaystyle P_{\sigma^{*}}\Bigl{(}X_{n+1}=X_{i}\text{ for some }i\leq n\mid(X_{1},\ldots,X_{n})=x\Bigr{)}$
	$\displaystyle=\sigma_{n}^{*}\bigl{(}x,B(x)\bigr{)}=n/(n+c).$

As a consequence, one obtains

	$\displaystyle P_{\sigma}\Bigl{(}\text{all the observations are distinct}\Big{)}=1,$
	$\displaystyle P_{\sigma^{*}}\Bigl{(}\text{all the observations are distinct}\Big{)}=0.$

•

$\sigma$ can be generalized replacing $\alpha$ with

$\displaystyle\beta(\cdot\mid x)=1_{A}(x)\,\delta_{x}\,+\,1_{A^{c}}(x)\,\nu\bigl{(}\cdot\mid A^{c}\cap H_{x}\bigr{)},$

where $A\in\mathcal{B}$ is a suitable set. Note that $\beta$ reduces to $\alpha$ if $A=\emptyset$ . Roughly speaking, $\beta$ is reasonable in those problems where there is a set $A$ such that $x_{i}$ is informative about the future observations only if $x_{i}\in A$ . Otherwise, if $x_{i}\notin A$ , the only relevant information provided by $x_{i}$ is $H_{x_{i}}$ . As a trivial example, take $S=\mathbb{R}$ and

$\displaystyle\mathcal{H}=\bigl{\{}(-\infty,0),\,\{0\},\,(0,\infty)\bigr{\}},\quad A=[-u,u]$

for some $u>0$ . Then, $\beta$ is reasonable if $x_{i}$ is informative only if $\lvert x_{i}\rvert\leq u$ . Otherwise, if $\lvert x_{i}\rvert>u$ , the only meaningful information provided by $x_{i}$ is its sign.

Example 3.

(Pólya urns). Some Pólya urns are covered by Example 2. It follows that, for such urns, the sequence $X$ of observed colors is exchangeable. To our knowledge, this fact was previously unknown.

As an example, consider sequential draws from an urn and denote by $X_{n}$ the color of the ball extracted at time $n\geq 1$ . At time $n=0$ , the urn contains $m_{j}$ balls of color $j$ where $j\in\{1,\ldots,k\}$ . Define

\displaystyle S=\{1,\ldots,k\},\quad m=\sum_{j=1}^{k}m_{j}\quad\text{and}\quad\nu\{j\}=\frac{m_{j}}{m}

for each $j\in S$ . The sampling scheme is as follows. Fix a partition $\mathcal{H}$ of $S$ and define

\displaystyle m_{j}^{*}=m\,\nu\bigl{(}\{j\}\mid H_{j}\bigr{)}=\frac{m\,m_{j}}{\sum_{i\in H_{j}}m_{i}}.

For each $n\geq 1$ , one obtains $X_{n}\in H$ for some unique $H\in\mathcal{H}$ . In this case (i.e., if $X_{n}\in H$ ) the extracted ball is replaced together with $m_{j}^{*}$ more balls of color $j$ for each $j\in H$ . In other terms, if the observed color belongs to $H$ , each color in $H$ is reinforced (and not only the observed color). In particular, after each draw, $m$ new balls are added to the urn. Hence, denoting by $\sigma$ the strategy of Example 2 with $c=1$ , one obtains

	$\displaystyle P\bigl{(}X_{n+1}=j\mid(X_{1},\ldots,X_{n})=x\bigr{)}$
	$\displaystyle=\frac{m_{j}+\sum_{i=1}^{n}1_{H_{j}}(x_{i})\,m_{j}^{*}}{m+m\,n}$
	$\displaystyle=\frac{\nu\{j\}+\sum_{i=1}^{n}1_{H_{j}}(x_{i})\,\nu\bigl{(}\{j\}\mid H_{j}\bigr{)}}{1+n}$
	$\displaystyle=\frac{c\,\nu\{j\}+\sum_{i=1}^{n}\nu\bigl{(}\{j\}\mid H_{x_{i}}\bigr{)}}{c+n}=\sigma_{n}(x)\{j\}.$

If $\sigma$ is the strategy (2), in addition to exchangeability, $X$ satisfies various other properties of classical Dirichlet sequences. We refer to [10] for details. Here, we just note that the prior $\pi$ and the posterior $\pi_{n}$ can be explicitly determined. In particular, up to replacing $\delta$ with $\alpha$ , the Sethuraman’s representation of $\pi$ (see [57]) is still true. Precisely, $\pi$ is the probability distribution of a random probability measure $\mu$ of the form

\displaystyle\mu(\cdot)=\sum_{j}V_{j}\,\alpha(\cdot\mid Z_{j})

where:

•

$(Z_{j})$ and $(V_{j})$ are independent sequences of random variables;
•

$(Z_{j})$ is i.i.d. with common distribution $\nu$ ;
•

$V_{j}=U_{j}\,\prod_{i=1}^{j-1}(1-U_{i})$ for all $j\geq 1$ , where $(U_{i})$ is i.i.d. with common distribution beta $(1,c)$ . Namely, $(V_{j})$ has the stick breaking distribution with parameter $c$ .

3 Conditionally identically distributed data

A sequence $Y=(Y_{1},Y_{2},\ldots)$ of random variables is conditionally identically distributed (c.i.d.) if $Y_{2}\overset{d}{=}Y_{1}$ and

P\bigl{(}Y_{k}\in\cdot\mid Y_{1},\ldots,Y_{n}\bigr{)}=P\bigl{(}Y_{n+1}\in\cdot\mid Y_{1},\ldots,Y_{n}\bigr{)}\quad\text{a.s.}

for all $k>n\geq 1$ . A c.i.d. sequence $Y$ is identically distributed. It is also asymptotically exchangeable in the sense that, as $n\rightarrow\infty$ , the probability distribution of the shifted sequence $(Y_{n},Y_{n+1},\ldots)$ converges weakly to an exchangeable law. Moreover, as already stressed, $Y$ is exchangeable if and only if it is stationary and c.i.d.

C.i.d. sequences have been introduced in [5, 47] and then investigated or applied in various papers; see e.g. [1, 2, 6, 7, 8, 9, 14, 15, 29, 30, 34].

There are reasons for taking c.i.d. data into account in Bayesian prediction. In fact, in a sense, c.i.d. sequences have been introduced having prediction in mind. If $X$ is c.i.d., at each time $n$ , the future observations $(X_{k}:k>n)$ are identically distributed given the past, and this is reasonable in several prediction problems. Examples arise in clinical trials, generalized Pólya urns, species sampling models, survival analysis and disease surveillance; see [1, 2, 5, 8, 9, 14, 15, 29, 30, 35]. A further reason for assuming $X$ c.i.d. is that the asymptotics is very close to that of exchangeable sequences. As a consequence, a meaningful part of the usual Bayesian machinery can be developed under the sole assumption that $X$ is c.i.d.; see [29]. Finally, the strategies which make $X$ c.i.d. can be easily characterized; see Theorem 15 in the Appendix. Hence, unlike the exchangeable case, P.A. can be easily implemented for c.i.d. data. A number of interesting strategies, which cannot be used if $X$ is required to be exchangeable, become available if $X$ is only asked to be c.i.d.; see e.g. [8, 9].

As a concrete example, fix a constant $q\in(0,1)$ and define

\displaystyle\sigma_{n}(x)=q^{n}\nu+(1-q)\sum_{i=1}^{n}q^{n-i}\delta_{x_{i}}

(3)

for all $n\geq 0$ and $x\in S^{n}$ . Using $\sigma$ to make predictions corresponds to exponential smoothing. It may be reasonable when the forecaster has only vague opinions on the dependence structure of the data, and yet she feels that the weight of the $i$ -th observation $x_{i}$ should be a decreasing function of $n-i$ . In this case, $X$ is not exchangeable, since $\sigma_{n}(x)$ is not invariant under permutation of $x$ , but it can be easily seen to be c.i.d.; see [8, Ex. 7].

In this section, following [8, 9], P.A. is applied to c.i.d. data. We first report some known strategies (Subsection 3.1) and then we introduce two new strategies which make $X$ c.i.d. (Subsection 3.2).

3.1 Fast recursive update of predictive distributions

A possible condition for a strategy $\sigma$ is

\displaystyle\sigma_{n+1}(x,y)\text{ is a function of }\sigma_{n}(x)\text{ and }y

(4)

for all $n\geq 0$ , $x\in S^{n}$ and $y\in S$ , where $y$ denotes the $(n+1)$ -th observation and

\displaystyle(x,y)=(x_{1},\ldots,x_{n},y).

Under (4), the predictive $\sigma_{n+1}(x,y)$ is just a recursive update of the previous predictive $\sigma_{n}(x)$ and the last observation $y$ . Recursive properties of this type are useful in applications. They have a long history (see e.g. [51, 52, 59]) and have been recently investigated in [41].

For each $n\geq 0$ , let $q_{n}:S^{n}\rightarrow[0,1]$ be a measurable function (with $q_{0}$ constant) and $\alpha_{n}$ a kernel on $(S,\mathcal{B})$ . Define a strategy $\sigma$ through the recursive equations

	$\displaystyle\sigma_{0}=\nu\quad\quad\text{and}$		(5)
	$\displaystyle\sigma_{n+1}(x,y)=q_{n}(x)\,\sigma_{n}(x)+(1-q_{n}(x))\,\alpha_{n}(\cdot\mid y)$

for all $n\geq 0$ , $x\in S^{n}$ and $y\in S$ . Since $\sigma_{n+1}(x,y)$ is a convex combination of the previous predictive $\sigma_{n}(x)$ and the kernel $\alpha_{n}(\cdot\mid y)$ , which depends only on $y$ , the strategy $\sigma$ satisfies condition (4). The obvious interpretation is that, at time $n+1$ , after observing $(x,y)$ , the next observation is drawn from $\sigma_{n}(x)$ with probability $q_{n}(x)$ and from $\alpha_{n}(\cdot\mid y)$ with probability $1-q_{n}(x)$ .

An example of strategy satisfying equation (5) is Newton’s algorithm [51, 52]. More precisely, Newton’s algorithm aims to estimate the latent distribution in a mixture model rather than to make predictions. However, if reinterpreted as a predictive rule, Newton’s algorithm corresponds to a strategy $\sigma$ and such a $\sigma$ meets equation (5) for a suitable choice of $q_{n}$ and $\alpha_{n}$ ; see e.g. [35, p. 1095]. Moreover, as shown in [35], $\sigma$ makes $X$ c.i.d.

The strategies satisfying equation (5) are investigated in [9]. Under such strategies, $X$ is usually not exchangeable but it is c.i.d. under some conditions on the kernels $\alpha_{n}$ . Precisely, $X$ is c.i.d. if $\alpha_{n}$ is the conditional distribution for $\nu$ given $\mathcal{G}_{n}$ for each $n\geq 0$ , where

\displaystyle\mathcal{G}_{0}\subset\mathcal{G}_{1}\subset\mathcal{G}_{2}\subset\ldots\subset\mathcal{B}

is any filtration (i.e., any increasing sequence of sub- $\sigma$ -fields of $\mathcal{B}$ ). This condition is trivially true if $\alpha_{n}(\cdot\mid y)=\delta_{y}$ for all $y\in S$ (just take $\mathcal{G}_{n}=\mathcal{B}$ for all $n\geq 0$ ).

Example 4.

(Finer countable partitions). For each $n\geq 0$ , let $\mathcal{H}_{n}$ be a countable partition of $S$ such that $H\in\mathcal{B}$ and $\nu(H)>0$ for all $H\in\mathcal{H}_{n}$ . Suppose that $\mathcal{H}_{n+1}$ is finer than $\mathcal{H}_{n}$ for all $n\geq 0$ . Define $\sigma$ through equation (5) with

\displaystyle\alpha_{n}(\cdot\mid y)=\sum_{H\in\mathcal{H}_{n}}1_{H}(y)\,\nu(\cdot\mid H)=\nu\bigl{(}\cdot\mid H_{y}^{n}\bigr{)}

where $H_{y}^{n}$ denotes the only $H\in\mathcal{H}_{n}$ such that $y\in H$ . The kernel $\alpha_{n}$ is the conditional distribution for $\nu$ given $\mathcal{G}_{n}$ , where $\mathcal{G}_{n}$ is the $\sigma$ -field generated by $\mathcal{H}_{n}$ . Since $\mathcal{H}_{n+1}$ is finer than $\mathcal{H}_{n}$ , one obtains $\mathcal{G}_{n}\subset\mathcal{G}_{n+1}$ . Hence, $X$ is c.i.d. Note also that the $\mathcal{H}_{n}$ could be chosen such that

\displaystyle\{y\}=\bigcap_{n}H_{y}^{n}\quad\quad\text{for all }y\in S.

In this case, as $n\rightarrow\infty$ , the partitions $\mathcal{H}_{n}$ shrink to the partition of $S$ in the singletons.

For instance, in Example 2, suppose the forecaster wants to replace the fixed partition $\mathcal{H}$ with a sequence $\mathcal{H}_{n}$ of finer partitions. This is possible at the price of having $X$ c.i.d. instead of exchangeable. In fact, with $q_{n}=\frac{n+c}{n+1+c}$ , one obtains

	$\displaystyle\sigma_{n}(x)=\frac{c\,\nu+\sum_{i=1}^{n}\alpha_{i-1}(\cdot\mid x_{i})}{n+c}$
	$\displaystyle=\frac{c\,\nu+\sum_{i=1}^{n}\nu\bigl{(}\cdot\mid H_{x_{i}}^{i-1}\bigr{)}}{n+c}.$

Similarly, to decrease the impact of the observed data while preserving the c.i.d. condition, the strategy (3) could be modified as

\displaystyle\sigma_{n}(x)=q^{n}\nu+(1-q)\sum_{i=1}^{n}q^{n-i}\nu\bigl{(}\cdot\mid H_{x_{i}}^{i-1}\bigr{)}.

We next turn to a strategy introduced in [41]. Once again, under this strategy, the data are c.i.d. but not necessarily exchangeable.

Example 5.

(Hahn, Martin and Walker; Copulas). In this example, $S=\mathbb{R}$ and “density function” means “density function with respect to Lebesgue measure”. A bivariate copula is a distribution function on $\mathbb{R}^{2}$ whose marginals are uniform on $(0,1)$ . The density function of a bivariate copula, provided it exists, is said to be a copula density.

In [41], in order to realize condition (4), the following updating rule is introduced. Fix a density $f_{0}$ and a sequence $c_{1},c_{2},\ldots$ of bivariate copula densities. For the sake of simplicity, we assume $f_{0}>0$ and $c_{n}>0$ for all $n\geq 1$ . For $n=0$ , define $\sigma_{0}(dz)=f_{0}(z)\,dz$ and call $F_{0}$ the distribution function corresponding to $\sigma_{0}$ . Then, for each $y\in\mathbb{R}$ , define

	$\displaystyle\sigma_{1}(y,\,dz)=f_{1}(z\mid y)\,dz\quad\quad\text{where}$
	$\displaystyle f_{1}(z\mid y)=c_{1}\bigl{\{}F_{0}(z),\,F_{0}(y)\bigr{\}}\,f_{0}(z).$

In general, for each $n\geq 0$ and $x\in\mathbb{R}^{n}$ , suppose $\sigma_{n}(x)$ has been defined and denote by $f_{n}(\cdot\mid x)$ and $F_{n}(\cdot\mid x)$ the density and the distribution function of $\sigma_{n}(x)$ . Then, for all $y\in\mathbb{R}$ , one can define

	$\displaystyle\sigma_{n+1}(x,y,\,dz)=f_{n+1}(z\mid x,y)\,dz\quad\text{where}$		(6)
	$\displaystyle f_{n+1}(z\mid x,y)=c_{n+1}\bigl{\{}F_{n}(z\mid x),\,F_{n}(y\mid x)\bigr{\}}\,f_{n}(z\mid x).$

Equation (6) defines a strategy $\sigma$ dominated by the Lebesgue measure.

In [41] (but not here) the $c_{n}$ are also required to be symmetric. Furthermore, in [41], equation (6) is not necessarily viewed as a method for obtaining a strategy but is deduced as a consequence of exchangeability. From our point of view, instead, equation (6) defines a strategy $\sigma$ which we call HMW’s strategy.

Under HMW’s strategy, $X$ is not necessarily exchangeable, even if the $c_{n}$ are symmetric and $c_{n}\rightarrow 1$ (in some sense) as $n\rightarrow\infty$ . To see this, recall that $X$ is i.i.d. if and only if it is exchangeable and $X_{1}$ is independent of $X_{2}$ . In turn, $X_{1}$ is independent of $X_{2}$ if $c_{1}$ is the independence copula density (i.e., $c_{1}(u,v)=1$ for all $(u,v)\in[0,1]^{2}$ ). Therefore, $X$ fails to be exchangeable whenever $c_{1}$ is the independence copula density and $c_{2}\neq c_{1}$ . However, as noted in [29], $X$ turns out to be c.i.d.

Theorem 6.

If $\sigma$ is HMW’s strategy, then $X$ is c.i.d.

A proof of Theorem 6 is provided in the Appendix. We note that, for Theorem 6 to hold, the positivity assumption on $f_{0}$ and $c_{n}$ may be dropped and the $c_{n}$ can be taken to be conditional copula densities; see Remark 16.

3.2 Further examples

In the next example, the data are exchangeable until a stopping time $T$ and then go on so as to form a c.i.d. sequence. The time $T$ should be regarded as the first time when something meaningful happens, possibly something modifying the nature of the observed phenomenon. Even if apparently involved, the example could find some applications. For instance, to model censored survival times, with $T-1$ the first time when a given number of survival times is observed.

Example 7.

(Change points). A predictable stopping time is a function $T$ on $S^{\infty}$ , with values in $\{2,3,\ldots,\infty\}$ , satisfying

\displaystyle\bigl{\{}T={n+1}\bigr{\}}=\bigl{\{}(X_{1},\ldots,X_{n})\in A_{n}\bigr{\}}

(7)

for some set $A_{n}\in\mathcal{B}^{n}$ . Basically, condition (7) means that the event $\{T=n+1\}$ depends only on $(X_{1},\ldots,X_{n})$ . Similarly, $\{T\leq n+1\}=\bigcup_{j=2}^{n+1}\{T=j\}$ depends only on $(X_{1},\ldots,X_{n})$ . Therefore, for all $x\in S^{n}$ and $y\in S$ , the indicators of $\{T\leq n+1\}$ and $\{T>n+1\}$ depend on $x$ but not on $y$ .

Fix a predictable stopping time $T$ and a strategy $\beta=(\beta_{0},\beta_{1},\ldots)$ which makes $X$ exchangeable. Moreover, as in Subsection 3.1, fix the measurable functions $q_{n}:S^{n}\rightarrow[0,1]$ . Then, define $\sigma_{0}=\beta_{0}$ , $\sigma_{1}=\beta_{1}$ , and

	$\displaystyle\sigma_{n+1}(x,y)=1_{\{T>n+1\}}(x)\,\beta_{n+1}(x,y)\,+$
	$\displaystyle+\,1_{\{T\leq n+1\}}(x)\,\Bigl{\{}q_{n}(x)\,\sigma_{n}(x)\,+\,(1-q_{n}(x))\,\delta_{y}\Bigr{\}}$

for all $n\geq 1$ , $x\in S^{n}$ and $y\in S$ . In the Appendix, it is shown that:

Theorem 8.

The above strategy $\sigma$ makes $X$ c.i.d. Moreover, if

\displaystyle A_{n}\text{ is invariant under permutations of }S^{n}\text{ for all }n\geq 1,

where $A_{n}$ is the set involved in condition (7), then $(X_{1},\ldots,X_{n})$ is exchangeable conditionally on $T>n$ . Precisely,

	$\displaystyle P_{\sigma}\Bigl{(}\phi(X_{1},\ldots,X_{n})\in\cdot\mid T>n\Bigr{)}$
	$\displaystyle=P_{\sigma}\Bigl{(}(X_{1},\ldots,X_{n})\in\cdot\mid T>n\Bigr{)}$

for all $n$ such that $P_{\sigma}(T>n)>0$ and all permutations $\phi$ of $S^{n}$ .

Theorem 8 is still valid if $\sigma$ is defined differently at the times subsequent to $T$ . For instance, given a countable partition $\mathcal{H}$ of $S$ , the conclusions of Theorem 8 are true even if

\displaystyle\sigma_{n+1}(x,y)=q_{n}(x)\,\sigma_{n}(x)+(1-q_{n}(x))\,\sigma_{n}(x,\,\cdot\mid H_{y})

for all $x\in S^{n}$ and $y\in S$ such that $T\leq n+1$ and $\sigma_{n}(x,\,H_{y})>0$ . Here, $\sigma_{n}(x,\,\cdot\mid H_{y})$ denotes the probability measure

\sigma_{n}(x,\,A\mid H_{y})=\frac{\sigma_{n}(x,\,A\cap H_{y})}{\sigma_{n}(x,\,H_{y})}\quad\quad\text{for all }A\in\mathcal{B}.

Censored survival times are a possible application of $\sigma$ . Suppose that $S=\{0,1\}\times(0,\infty)$ and the $i$ -th observation is a pair $x_{i}=(j_{i},t_{i})$ where $t_{i}$ is the survival time of item $i$ , or the time when item $i$ leaves the trial, according to whether $j_{i}=1$ or $j_{i}=0$ . In this framework, $T-1$ could be the first time when a fixed number $k$ of survival times is observed, namely,

\displaystyle T=1+\inf\bigl{\{}n:\sum_{i=1}^{n}j_{i}=k\bigr{\}}

with the usual convention $\inf\emptyset=\infty$ . Finally, the strategy $\beta$ could be as in Subsection 2.2. In fact, classical Dirichlet sequences are a quite popular model to describe censored survival times but have the drawback of ties. This drawback may be overcome if $\beta$ is of the form

\beta_{n}(x)=\frac{c\,\nu+\sum_{i=1}^{n}\alpha_{x_{i}}}{n+c},

where the kernel $\alpha$ satisfies the conditions of Subsection 2.2 and $\nu$ and $\alpha_{x}$ are nonatomic for all $x\in S$ .

So far, the $n$ -th predictive distribution has been meant as the conditional distribution of $X_{n+1}$ given $(X_{1},\ldots,X_{n})$ . But the information available at time $n$ is often strictly larger than $(X_{1},\ldots,X_{n})$ . To model this situation, we suppose to observe the sequence

Y=(X_{1},Z_{1},X_{2},Z_{2},\ldots)

where $Z=(Z_{1},Z_{2},\ldots)$ is any sequence of random variables. The $Z_{n}$ can be regarded as covariates. At each time $n$ , the forecaster aims to predict $X_{n+1}$ based on $(X_{1},Z_{1},\ldots,X_{n},Z_{n})$ . She is not interested in $Z_{n+1}$ as such, but $Z_{1},\ldots,Z_{n}$ can not be neglected since they are informative on $X_{n+1}$ . Moreover, she wants $X$ to be c.i.d. and $Z$ unconstrained as much as possible. One solution could be a strategy which makes $Y$ c.i.d. However, if $Y$ is c.i.d., both $X$ and $Z$ are marginally c.i.d., and having $Z$ c.i.d. may be unwelcome. In the next example, $X$ is c.i.d. but $Z$ is not. In addition, $X$ satisfies a condition stronger than the c.i.d. one, that is, $X_{2}\overset{d}{=}X_{1}$ and

	$\displaystyle P\bigl{(}X_{k}\in\cdot\mid X_{1},Z_{1},\ldots,X_{n},Z_{n}\bigr{)}$		(8)
	$\displaystyle=P\bigl{(}X_{n+1}\in\cdot\mid X_{1},Z_{1},\ldots,X_{n},Z_{n}\bigr{)}$

a.s. for all $k>n\geq 1$ ; see [5].

Example 9.

(Covariates). Let $S=\mathbb{R}^{2}$ and

\displaystyle 0=b_{0}<b_{1}<b_{2}<\ldots,\quad\sup_{n}b_{n}\leq 1,

a bounded strictly increasing sequence of real numbers. Take $\sigma_{0}$ as the probability distribution of $(U+V,V)$ where

\displaystyle U\text{ independent of }V,\quad U\overset{d}{=}\mathcal{N}(0,b_{1}),\quad V\overset{d}{=}\mathcal{N}(0,1-b_{1}).

Similarly, for each $n\geq 1$ and

\displaystyle y=(y_{1},\ldots,y_{n})=(x_{1},z_{1},\ldots,x_{n},z_{n}),

take $\sigma_{n}(y)$ as the probability distribution of $(U_{n}(y)+V_{n}(y),\,V_{n}(y))$ where

	$\displaystyle U_{n}(y)\text{ independent of }V_{n}(y),$
	$\displaystyle U_{n}(y)\overset{d}{=}\mathcal{N}\bigl{(}x_{n}-z_{n},\,b_{n+1}-b_{n}\bigr{)},$
	$\displaystyle V_{n}(y)\overset{d}{=}\mathcal{N}(0,1-b_{n+1}).$

Then, $Z$ is not c.i.d. while $X$ satisfies condition (8). Furthermore, arguing as in [9, Sect. 4], the normal distribution could be replaced by any symmetric stable law.

To see that $Z$ is not c.i.d., just note that $Z$ fails to be identically distributed. To prove condition (8), take a collection $\bigl{\{}T_{n},W_{n}:n\geq 1\bigr{\}}$ of independent standard normal random variables and define the sequence

Y^{*}=(X_{1}^{*},Z_{1}^{*},X_{2}^{*},Z_{2}^{*},\ldots),

where $Z_{n}^{*}=\sqrt{1-b_{n}}\,W_{n}$ and

X_{n}^{*}=\sum_{j=1}^{n}\sqrt{b_{j}-b_{j-1}}\,T_{j}+Z_{n}^{*}.

It is not hard to verify that $Y^{*}\overset{d}{=}Y$ . Hence, it suffices to prove (8) with $Y^{*}$ in the place of $Y$ , and this can be done as in [5, Ex. 1.2]. We omit the explicit calculations.

4 Stationary data

A sequence $Y=(Y_{1},Y_{2},\ldots)$ of random variables is stationary if

\displaystyle(Y_{2},\ldots,Y_{n+1})\overset{d}{=}(Y_{1},\ldots,Y_{n})\quad\quad\text{for all }n\geq 1.

In the non-Bayesian approaches to prediction, stationarity is a classical assumption. In a Bayesian framework, instead, stationarity seems to be less popular. In particular, to our knowledge, there is no systematic treatment of P.A. for stationary data. This section aims to fill this gap and begins an investigation of P.A. when $X$ is required to be stationary. It is just a preliminary step and much more work is to be done.

After some general remarks on Problem (*), two large classes of stationary sequences will be introduced. Incidentally, these two classes may look unusual for a Bayesian forecaster. We don’t know whether this is true, but we recall that P.A. is consistent with any probability distribution for $X$ . Hence, in a Bayesian framework, using data coming from such classes is certainly admissible.

If $X$ is required to be stationary, for P.A. to apply, the strategies which make $X$ stationary should be characterized. Hence, one comes across Problem (*) with $\mathcal{C}$ the class of stationary probability measures on $(S^{\infty},\mathcal{B}^{\infty})$ . This version of Problem (*) is quite hard and we are not aware of any general solution; see e.g. [12, 50] and references therein. Fortunately, however, Problem (*) is simple (or even trivial) in a few special cases. As an example, a strategy $\sigma$ makes $X$ a stationary (first order) Markov chain if and only if

\displaystyle\int\sigma_{1}(x,\,\cdot)\,\sigma_{0}(dx)=\sigma_{0}(\cdot)\quad\text{and}\quad\sigma_{n}(x)=\sigma_{1}(x_{n})

for all $n\geq 1$ and $P_{\sigma}$ -almost all $x\in S^{n}$ . Even if obvious, this fact has a useful practical consequence. If the data are required to be stationary and Markov, in order to make Bayesian predictions, applying P.A. is straightforward.

Another remark is that, unlike the exchangeable case, a finite dimensional stationary random vector can be always extended to an (infinite) stationary sequence. To formalize this fact, we first recall that the probability distribution of the random vector $(X_{1},\ldots,X_{n})$ is completely determined by $\sigma_{0},\sigma_{1},\ldots,\sigma_{n-1}$ .

Lemma 10.

Fix $n\geq 1$ , select $\sigma_{0},\sigma_{1},\ldots,\sigma_{n-1}$ and define

\displaystyle\sigma_{j}(u,x)=\sigma_{n-1}(x)

for all $j>n-1$ , $u\in S^{j-n+1}$ and $x\in S^{n-1}$ . Then, $X$ is stationary provided $(X_{2},\ldots,X_{n})\overset{d}{=}(X_{1},\ldots,X_{n-1})$ .

Lemma 10 is probably well known, but again we do not know of any explicit reference. Anyway, the proof is straightforward. It suffices to note that, under the strategy of Lemma 10, $X_{j+1}$ is conditionally independent of $(X_{1},\ldots,X_{j-n+1})$ given $(X_{j-n+2},\ldots,X_{j})$ .

A last remark is that Problem (*) admits an obvious solution for dominated strategies. In this case, incidentally, Problem (*) can be easily solved even for exchangeable data.

Theorem 11.

Let $\lambda$ be a $\sigma$ -finite measure on $(S,\mathcal{B})$ and $\sigma$ a strategy dominated by $\lambda$ , say

\displaystyle\sigma_{0}(dy)=f_{0}(y)\,\lambda(dy)\quad\text{and}\quad\sigma_{n}(x,\,dy)=f_{n}(y\mid x)\,\lambda(dy)

for all $n\geq 1$ and $x\in S^{n}$ . Define

\displaystyle g_{n}(x)=f_{0}(x_{1})\,f_{1}(x_{2}\mid x_{1})\ldots f_{n-1}(x_{n}\mid x_{1},\ldots,x_{n-1})

for all $n\geq 1$ and $x\in S^{n}$ . Then,

•

$P_{\sigma}$ is stationary if and only if

$\displaystyle g_{n}(x)=\int g_{n+1}(u,x)\,\lambda(du)$

for all $n\geq 1$ and $P_{\sigma}$ -almost all $x\in S^{n}$ .
•

$P_{\sigma}$ is exchangeable if and only if

$\displaystyle g_{n}(\phi(x))=g_{n}(x)$

for all $n\geq 2$ , all permutations $\phi$ of $S^{n}$ and $P_{\sigma}$ -almost all $x\in S^{n}$ .

The proof of Theorem 11 is given in the Appendix.

We finally give two examples. In both, $X$ is a stationary Markov sequence, possibly of order greater than 1.

Example 12.

(Generalized autoregressive sequences). Let $S=\mathbb{R}$ . Fix a probability measure $\mu$ on $\mathcal{B}$ and a measurable function $f:\mathbb{R}\rightarrow\mathbb{R}$ . Define

\displaystyle\sigma_{1}(x,A)=P\bigl{(}f(x)+U\in A)\quad\quad\text{for all }x\in\mathbb{R}\text{ and }A\in\mathcal{B},

where $U$ is a real random variable such that $U\overset{d}{=}\mu$ . Suppose now that

\displaystyle\int\sigma_{1}(x,A)\,\nu(dx)=\nu(A),\quad\quad A\in\mathcal{B},

(9)

for some probability measure $\nu$ on $\mathcal{B}$ . Then, $X$ is a stationary Markov chain provided

\displaystyle\sigma_{0}=\nu\quad\text{and}\quad\sigma_{n}(x)=\sigma_{1}(x_{n})\quad\text{for all }n\geq 2\text{ and }x\in\mathbb{R}^{n}.

Note that $Y\overset{d}{=}P_{\sigma}$ for any sequence $Y=(Y_{1},Y_{2},\ldots)$ such that

\displaystyle Y_{1}\overset{d}{=}\nu\quad\text{and}\quad Y_{n}=f(Y_{n-1})+U_{n}\text{ for }n\geq 2,

where $(U_{n}:n\geq 2)$ is i.i.d., independent of $Y_{1}$ , and $U_{2}\overset{d}{=}\mu$ . Thus, $\mu$ can be regarded as the distribution of the “errors” $U_{n}$ and $\nu$ as the marginal distribution of the observations $Y_{n}$ . For instance, the usual Gaussian (first order) autoregressive processes correspond to $f(x)=c\,x$ , $\mu=\mathcal{N}(0,b)$ and $\nu=\mathcal{N}(0,\,b/(1-c^{2}))$ , where $c\in(-1,1)$ and $b>0$ are constants.

To make the above argument concrete, the following problem is to be solved: For fixed $f$ and $\mu$ , give conditions for the existence of $\nu$ satisfying equation (9). More importantly, give an explicit formula for $\nu$ provided it exists. We next focus on this problem in the (meaningful) special case where $\mu$ is a symmetric stable law.

Let $\gamma\in(0,2]$ be a constant and $Z$ a real random variable with characteristic function

\displaystyle E\bigl{\{}\exp(i\,t\,Z)\bigr{\}}=\exp\Bigl{(}-\frac{\lvert t\rvert^{\gamma}}{2}\Bigr{)}\quad\quad\text{for all }t\in\mathbb{R}.

(The exponent $\gamma$ is usually denoted by $\alpha$ , but this notation cannot be adopted in this paper since $\alpha$ denotes a kernel). For $a\in\mathbb{R}$ and $b>0$ , denote by $\mathcal{S}(a,b)$ the probability distribution of $a+b^{1/\gamma}Z$ , namely

\displaystyle\mathcal{S}(a,b;\,A)=P\bigl{(}a+b^{1/\gamma}Z\in A)\quad\quad\text{for all }A\in\mathcal{B}.

The probability measure $\mathcal{S}(a,b)$ is said to be a symmetric stable law with exponent $\gamma$ . Note that $\mathcal{S}(a,b)=\mathcal{N}(a,b)$ if $\gamma=2$ and $\mathcal{S}(a,b)=\mathcal{C}(a,b)$ if $\gamma=1$ , where $\mathcal{C}(a,b)$ is the Cauchy distribution with density $f(x)=\frac{2\,b}{\pi}\,\frac{1}{b^{2}+4\,(x-a)^{2}}$ (the standard Cauchy distribution corresponds to $a=0$ and $b=2$ ).

Theorem 13.

Let $c\in(-1,1)$ be a constant. If $\mu=\mathcal{S}(a,b)$ and $f(x)=-a+c\,x$ , then equation (9) is satisfied by

\displaystyle\nu=\mathcal{S}\left(0,\,\frac{b}{1-\lvert c\rvert^{\gamma}}\right).

By Theorem 13, which is proved in the Appendix, one obtains (first order) stationary autoregressive processes with any symmetric stable marginal distribution.

Example 14.

(Markov sequences of arbitrary order). Let $\lambda$ be a $\sigma$ -finite measure on $(S,\mathcal{B})$ . Fix $n\geq 2$ and a measurable function $h$ on $S^{n}$ such that $h>0$ and $\int h\,d\lambda^{n}=1$ . Given $h$ , define a further function $g$ via cyclic permutations of $h$ , namely

	$\displaystyle g(x)=\frac{1}{n}\,\bigl{\{}h(x_{1},\ldots,x_{n})+h(x_{2},\ldots,x_{n},x_{1})+$
	$\displaystyle+\ldots+h(x_{n},x_{1},\ldots,x_{n-1})\bigr{\}}$

for all $x\in S^{n}$ . Such a $g$ is still a density with respect to $\lambda^{n}$ (since $\int g\,d\lambda^{n}=1$ ) and satisfies

\displaystyle g(x,y)=g(y,x)\quad\text{for all }x\in S^{n-1}\text{ and }y\in S.

(10)

Next, define

	$\displaystyle f_{0}(x)=\int g(x,v)\,\lambda^{n-1}(dv)\quad\quad\text{for all }x\in S,$
	$\displaystyle f_{n-1}(x_{n}\mid x_{1},\ldots,x_{n-1})=\frac{g(x)}{\int g(x_{1},\ldots,x_{n-1},v)\,\lambda(dv)}$

for all $x\in S^{n}$ , and

f_{j-1}(x_{j}\mid x_{1},\ldots,x_{j-1})=\frac{\int g(x,v)\,\lambda^{n-j}(dv)}{\int g(x_{1},\ldots,x_{j-1},v)\,\lambda^{n-j+1}(dv)}

for all $2\leq j\leq n-1$ and $x\in S^{j}$ . Finally, define a strategy $\sigma$ dominated by $\lambda$ as

	$\displaystyle\sigma_{0}(dz)=f_{0}(z)\,\lambda(dz),$
	$\displaystyle\sigma_{j}(x,\,dz)=f_{j}(z\mid x)\,\lambda(dz)$

if $1\leq j\leq n-1$ and $x\in S^{j}$ , and

\displaystyle\sigma_{j}(u,x)=\sigma_{n-1}(x)

if $j>n-1$ , $u\in S^{j-n+1}$ and $x\in S^{n-1}$ . Under $\sigma$ , a density of $(X_{1},\ldots,X_{n})$ is given by $g$ . By equation (10),

\displaystyle\int g(v,x)\,\lambda(dv)=\int g(x,v)\,\lambda(dv)\quad\quad\text{for all }x\in S^{n-1}

and this in turn implies

\displaystyle(X_{2},\ldots,X_{n})\overset{d}{=}(X_{1},\ldots,X_{n-1}).

Therefore, $X$ is stationary because of Lemma 10. Note also that $X$ is a Markov sequence of order $n-1$ .

5 Concluding remarks and open problems

When prediction is the main target, P.A. has some advantages with respect to I.A. This is only our opinion, obviously, and we tried to support it along this paper. Even if one agrees, however, some further work is to be done to make P.A. a concrete tool. We close this paper with a brief list of open problems and possible hints for future research.

•

In various applications, the available information strictly includes the past observations on the variable to be predicted. For instance, as in Example 9, suppose one aims to predict $X_{n+1}$ based on $(X_{1},Z_{1},\ldots,X_{n},Z_{n})$ where $Z_{1},\ldots,Z_{n}$ are any random elements. Suppose also that $Z_{1},\ldots,Z_{n}$ cannot be neglected for they are informative on $X_{n+1}$ . In this case, one needs the conditional distribution of $X_{n+1}$ given $(X_{1},Z_{1},\ldots,X_{n},Z_{n})$ . Situations of this type are practically meaningful and should be investigated further.
•

Section 4 should be expanded. It would be nice to have a general solution of Problem (*) for both the stationary and the stationary-ergodic cases. Further examples of stationary sequences (possibly, non-Markovian) would be welcome as well.
•

Obviously, P.A. could be investigated under other distributional assumptions, in addition to exchangeability, stationarity and conditional identity in distribution. In particular, partial exchangeability should be taken into account.
•

A question, related to Example 5, is: Under what conditions $X$ is exchangeable when $\sigma$ is HMW’s strategy ?
•

While probably hard, the problem raised in Example 12 looks intriguing. In Theorem 13, such a problem has been addressed when $\mu$ is a symmetric stable law and $f$ has a special form. What happens if $\mu$ and $f$ are arbitrary ?
•

In case of I.A., the empirical Bayes point of view (where the prior is allowed to depend on the data) may be problematic. In case of P.A., instead, this point of view is certainly admissible. In fact, suppose a strategy $\sigma$ depends on some unknown constants, and an empirical Bayes forecaster decides to estimate these constants based on the available data. Acting in this way, she is merely replacing a strategy with another. Instead of $\sigma$ , she is working with $\hat{\sigma}$ , where $\hat{\sigma}$ is the strategy obtained from $\sigma$ estimating the unknown constants. This empirical form of P.A. looks reasonable and could be investigated.

Appendix

This appendix contains the proofs of some claims scattered throughout the text. We will need the following characterization of c.i.d. sequences in terms of strategies.

Theorem 15.

(Theorem 3.1 of [6]). Let $\sigma$ be a strategy. Then, $P_{\sigma}$ is c.i.d. if and only if

\displaystyle\sigma_{n}(x,A)=\int\sigma_{n+1}(x,y,\,A)\,\sigma_{n}(x,\,dy)

(11)

for all $n\geq 0$ , all $A\in\mathcal{B}$ and $P_{\sigma}$ -almost all $x\in S^{n}$ .

Proof of Theorem 6.

In this proof, “density function” stands for “density function with respect to Lebesgue measure”. We first recall a well known fact.

Let $C$ be a bivariate copula and $F_{1}$ , $F_{2}$ distribution functions on $\mathbb{R}$ . Suppose that $C$ , $F_{1}$ and $F_{2}$ all have densities, say $c$ , $f_{1}$ and $f_{2}$ , respectively. Then,

\displaystyle F(x,y)=C\bigl{\{}F_{1}(x),F_{2}(y)\bigr{\}}

is a distribution function on $\mathbb{R}^{2}$ and

\displaystyle f(x,y)=c\bigl{\{}F_{1}(x),F_{2}(y)\bigr{\}}\,f_{1}(x)\,f_{2}(y)

is a density of $F$ . Therefore, for all $y\in\mathbb{R}$ with $f_{2}(y)>0$ , one obtains

\displaystyle\int c\bigl{\{}F_{1}(x),F_{2}(y)\bigr{\}}\,f_{1}(x)\,dx=\int\frac{f(x,y)}{f_{2}(y)}\,dx=1.

We next show that equation (6) actually defines a strategy $\sigma$ . Fix a density $f_{0}>0$ and a sequence $c_{1},c_{2},\ldots$ of strictly positive bivariate copula densities. For each $y\in\mathbb{R}$ ,

\displaystyle\int f_{1}(z\mid y)\,dz=\int c_{1}\bigl{\{}F_{0}(z),F_{0}(y)\bigr{\}}\,f_{0}(z)\,dz=1

since $f_{0}(y)>0$ . Moreover, $f_{1}(z\mid y)>0$ for all $z$ due to $f_{0}>0$ and $c_{1}>0$ . Next, suppose that $f_{n}(\cdot\mid x)$ is a strictly positive density for some $n\geq 1$ and $x\in\mathbb{R}^{n}$ . Then, for all $y\in\mathbb{R}$ ,

	$\displaystyle\int f_{n+1}(z\mid x,y)\,dz$
	$\displaystyle=\int c_{n+1}\bigl{\{}F_{n}(z\mid x),F_{n}(y\mid x)\bigr{\}}\,f_{n}(z\mid x)\,dz=1$

since $f_{n}(y\mid x)>0$ . Furthermore, $f_{n+1}(z\mid x,y)>0$ for all $z$ since $f_{n}(\cdot\mid x)>0$ and $c_{n+1}>0$ . By induction, this proves that $f_{n}(\cdot\mid x)$ is a density for all $n\geq 1$ and $x\in\mathbb{R}^{n}$ . Therefore, equation (6) defines a strategy $\sigma$ (called HMW’s strategy in Example 5).

Finally, we prove that $P_{\sigma}$ is c.i.d. if $\sigma$ is HMW’s strategy. By Theorem 15, it suffices to prove condition (11). In turn, since $\sigma$ is dominated by the Lebesgue measure, condition (11) reduces to

\displaystyle f_{n}(z\mid x)=\int f_{n+1}(z\mid x,y)\,f_{n}(y\mid x)\,dy

for all $n\geq 0$ , almost all $z\in\mathbb{R}$ and $P_{\sigma}$ -almost all $x\in\mathbb{R}^{n}$ . Such a condition follows directly from the definition of $\sigma$ . In fact, for all $n\geq 0$ an $x\in\mathbb{R}^{n}$ , one obtains

	$\displaystyle\int f_{n+1}(z\mid x,y)\,f_{n}(y\mid x)\,dy$
	$\displaystyle=\int c_{n+1}\bigl{\{}F_{n}(z\mid x),F_{n}(y\mid x)\bigr{\}}\,f_{n}(z\mid x)\,f_{n}(y\mid x)\,dy$
	$\displaystyle=f_{n}(z\mid x)\quad\quad\text{for almost all }z.$

This concludes the proof. ∎

Remark 16.

HMW’s strategy $\sigma$ has been defined under the assumption that $f_{0}>0$ and $c_{n}>0$ for all $n\geq 1$ . Such an assumption is superfluous and has been made only to avoid annoying complications in the definition of $\sigma$ . Similarly, $X$ is c.i.d. even if the $c_{n}$ are conditional copulas, in the sense that they are allowed to depend on past data. Precisely, for each $n\geq 1$ and $x\in\mathbb{R}^{n}$ , fix a bivariate copula density $c_{n+1}(\cdot\mid x)$ . Then, the proof Theorem 6 still applies if $f_{n+1}(z\mid x,y)$ is rewritten as

\displaystyle f_{n+1}(z\mid x,y)=c_{n+1}\bigl{\{}F_{n}(z\mid x),\,F_{n}(y\mid x)\mid x\bigr{\}}\,f_{n}(z\mid x).

Proof of Theorem 8.

We show that $X$ is c.i.d. via Theorem 15. Fix $A\in\mathcal{B}$ and $n\geq 0$ . Since $P_{\beta}$ is exchangeable (and thus c.i.d.) Theorem 15 yields

\displaystyle\beta_{n}(x,A)=\int\beta_{n+1}(x,y,\,A)\,\beta_{n}(x,\,dy)

(12)

for $P_{\beta}$ -almost all $x\in S^{n}$ . Hence, up to changing $\beta$ on a $P_{\beta}$ -null set, equation (12) can be assumed to hold for all $x\in S^{n}$ . If $n=0$ ,

\displaystyle\int\sigma_{1}(y,A)\,\sigma_{0}(dy)=\int\beta_{1}(y,A)\,\beta_{0}(dy)=\beta_{0}(A)=\sigma_{0}(A)

where the first equality is because $\sigma_{0}=\beta_{0}$ and $\sigma_{1}=\beta_{1}$ while the second follows from (12). Next, suppose $n\geq 1$ and take $x\in S^{n}$ and $y\in S$ . By assumption, the events $\{T>n+1\}$ and $\{T\leq n+1\}$ depend on $x$ but not on $y$ . If $T>n+1$ , one obtains $\sigma_{n+1}(x,y)=\beta_{n+1}(x,y)$ and $\sigma_{n}(x)=\beta_{n}(x)$ . Hence, equation (12) implies again

	$\displaystyle\int\sigma_{n+1}(x,y,\,A)\,\sigma_{n}(x,\,dy)=\int\beta_{n+1}(x,y,\,A)\,\beta_{n}(x,\,dy)$
	$\displaystyle=\beta_{n}(x,A)=\sigma_{n}(x,A).$

Similarly, if $T\leq n+1$ ,

	$\displaystyle\int\sigma_{n+1}(x,y,\,A)\,\sigma_{n}(x,\,dy)$
	$\displaystyle=\int\bigl{\{}q_{n}(x)\,\sigma_{n}(x,A)\,+\,(1-q_{n}(x))\,\delta_{y}(A)\bigr{\}}\,\sigma_{n}(x,\,dy)$
	$\displaystyle=q_{n}(x)\,\sigma_{n}(x,A)+(1-q_{n}(x))\,\int\delta_{y}(A)\,\sigma_{n}(x,\,dy)$
	$\displaystyle=\sigma_{n}(x,A).$

In view of Theorem 15, this proves that $X$ is c.i.d.

Finally, suppose that $A_{n}$ is invariant under permutations of $S^{n}$ for each $n\geq 1$ . We have to show that $(X_{1},\ldots,X_{n})$ is exchangeable conditionally on $T>n$ . Fix $n$ , a set $C\in\mathcal{B}^{n}$ , and a permutation $\phi$ of $S^{n}$ . For each $j\geq n$ , it is easily seen that

	$\displaystyle P_{\sigma}\Bigl{(}T=j+1,\,\,\phi(X_{1},\ldots,X_{n})\in C\Bigr{)}$
	$\displaystyle=P_{\beta}\Bigl{(}T=j+1,\,\,\phi(X_{1},\ldots,X_{n})\in C\Bigr{)}.$

Therefore,

	$\displaystyle P_{\sigma}\Bigl{(}T=j+1,\,\,\phi(X_{1},\ldots,X_{n})\in C\Bigr{)}$
	$\displaystyle=P_{\beta}\Bigl{(}T=j+1,\,\,\phi(X_{1},\ldots,X_{n})\in C\Bigr{)}$
	$\displaystyle=P_{\beta}\Bigl{(}(X_{1},\ldots,X_{j})\in A_{j},\,\,\phi(X_{1},\ldots,X_{n})\in C\Bigr{)}$
	$\displaystyle=P_{\beta}\Bigl{(}(X_{1},\ldots,X_{j})\in A_{j},\,\,(X_{1},\ldots,X_{n})\in C\Bigr{)}$

where the last equality is because $P_{\beta}$ is exchangeable and $A_{j}$ is invariant under permutations of $S^{j}$ . In turn, this implies

	$\displaystyle P_{\sigma}\Bigl{(}T>n,\,\,\phi(X_{1},\ldots,X_{n})\in C\Bigr{)}$
	$\displaystyle=\sum_{j\geq n}P_{\sigma}\Bigl{(}T=j+1,\,\,\phi(X_{1},\ldots,X_{n})\in C\Bigr{)}$
	$\displaystyle=\sum_{j\geq n}P_{\beta}\Bigl{(}T=j+1,\,\,(X_{1},\ldots,X_{n})\in C\Bigr{)}$
	$\displaystyle=\sum_{j\geq n}P_{\sigma}\Bigl{(}T=j+1,\,\,(X_{1},\ldots,X_{n})\in C\Bigr{)}$
	$\displaystyle=P_{\sigma}\Bigl{(}T>n,\,\,(X_{1},\ldots,X_{n})\in C\Bigr{)}.$

This concludes the proof. ∎

Proof of Theorem 11.

Just note that $g_{n}$ is a density of $(X_{1},\ldots,X_{n})$ with respect to $\lambda^{n}$ . Therefore, Theorem 11 follows from the very definitions of stationarity and exchangeability, after noting that $\int g_{n+1}(u,\cdot)\,\lambda(du)$ is a density of $(X_{2},\ldots,X_{n+1})$ with respect to $\lambda^{n}$ . ∎

Proof of Theorem 13.

We first recall that

\displaystyle\int\mathcal{S}(x,b;\,A)\,\mathcal{S}(0,r;\,dx)=\mathcal{S}(0,b+r;\,A)

for all $A\in\mathcal{B}$ and $b,\,r>0$ . This can be checked by a direct calculation. For a proof, we refer to the Claim of [9, Th. 10]. Having noted this fact, define

\displaystyle\mu=\mathcal{S}(a,b),\quad f(x)=-a+c\,x,\quad\nu=\mathcal{S}\left(0,\,\frac{b}{1-\lvert c\rvert^{\gamma}}\right),

and denote by $Z$ a real random variable such that $Z\overset{d}{=}\mathcal{S}(0,1)$ . Define also

\displaystyle r=\frac{b\,\lvert c\rvert^{\gamma}}{1-\lvert c\rvert^{\gamma}},\quad h(x)=c\,x,

and call $\nu^{*}$ the probability distribution of $h$ under $\nu$ . On noting that

\displaystyle a+b^{1/\gamma}Z\overset{d}{=}\mu\quad\text{and}\quad\nu^{*}=\mathcal{S}(0,\,r),

one obtains

	$\displaystyle\int\sigma_{1}(x,A)\,\nu(dx)=\int P\bigl{(}f(x)+a+b^{1/\gamma}Z\in A\bigr{)}\,\nu(dx)$
	$\displaystyle=\int P\bigl{(}h(x)+b^{1/\gamma}Z\in A\bigr{)}\,\nu(dx)$
	$\displaystyle=\int P\bigl{(}x+b^{1/\gamma}Z\in A\bigr{)}\,\nu^{*}(dx)$
	$\displaystyle=\int\mathcal{S}(x,b;\,A)\,\mathcal{S}(0,r;\,dx)=\mathcal{S}(0,\,b+r;\,A)$
	$\displaystyle=\mathcal{S}\left(0,\,\frac{b}{1-\lvert c\rvert^{\gamma}};\,A\right)=\nu(A).$

Therefore, equation (9) holds. ∎

Acknowledgments: We are grateful to Federico Bassetti and Paola Bortot for very useful conversations.

References

[1] Airoldi E.M., Costa T., Bassetti F., Leisen F., Guindani M. (2014) Generalized species sampling priors with latent beta reinforcements, J.A.S.A., 109, 1466-1480.
[2] Bassetti F., Crimaldi I., Leisen F. (2010) Conditionally identically distributed species sampling sequences, Adv. in Appl. Probab., 42, 433-459.
[3] Bassetti F., Ladelli L. (2020) Asymptotic number of clusters for species sampling sequences with non-diffuse base measure, Stat. Prob. Letters, 162, 108749.
[4] Berti P., Regazzini E., Rigo P. (1997) Well-calibrated, coherent forecasting systems, Theory Probab. Appl., 42, 82-102.
[5] Berti P., Pratelli L., Rigo P. (2004) Limit theorems for a class of identically distributed random variables, Ann. Probab., 32, 2029-2052.
[6] Berti P., Pratelli L., Rigo P. (2012) Limit theorems for empirical processes based on dependent data, Electronic J. Probab., 17, 1-18.
[7] Berti P., Pratelli L., Rigo P. (2013) Exchangeable sequences driven by an absolutely continuous random measure, Ann. Probab., 41, 2090-2102.
[8] Berti P., Dreassi E., Pratelli L., Rigo P. (2021) A class of models for Bayesian predictive inference, Bernoulli, 27, 702-726.
[9] Berti P., Dreassi E., Leisen F., Pratelli L., Rigo P. (2023) Bayesian predictive inference without a prior, Statistica Sinica, 33.
[10] Berti P., Dreassi E., Leisen F., Pratelli L., Rigo P. (2022) Kernel based Dirichlet sequences, Bernoulli, to appear, available at arXiv:2106.00114 [math.PR].
[11] Blackwell D., Mac Queen J.B. (1973) Ferguson distributions via Pólya urn schemes, Ann. Statist., 1, 353-355.
[12] Bladt M., McNeil A.J. (2022) Time series models with infinite-order partial copula dependence, Dependence Modeling, 10, 87-107.
[13] Canale A., Lijoi A., Nipoti B., Pruenster I. (2017) On the Pitman–Yor process with spike and slab base measure, Biometrika, 104, 681-697.
[14] Cassese A., Zhu W., Guindani M., Vannucci M. (2019) A Bayesian nonparametric spiked process prior for dynamic model selection, Bayesian Analysis, 14, 553-572.
[15] Chen K., Shen W., Zhu W. (2023) Covariate dependent Beta-GOS process, Computat. Stat. Data Anal., 180.
[16] Cifarelli D.M., Regazzini E. (1996) De Finetti’s contribution to probability and statistics, Statist. Science, 11, 253-282.
[17] Clarke B., Fokoue E., Zhang H.H. (2009) Principles and theory for data mining and machine learning, Springer, New York.
[18] Clarke B., Clarke J. (2018) Predictive statistics: Analysis and inference beyond models, Cambridge University Press, Cambridge.
[19] Dawid A.P. (1984) Present position and potential developments: Some personal views: Statistical Theory: The prequential approach, J. Royal Stat. Soc. A, 147, 278-292.
[20] Dawid A.P. (1992) Prequential data analysis, In Current Issues in Statistical Inference: Essays in Honor of D. Basu, Edited by M. Ghosh and P.K. Pathak, IMS Lecture Notes - Monograph Series, 17, 113-126.
[21] Dawid A.P., Vovk V.G. (1999) Prequential probability: principles and properties, Bernoulli, 5, 125-162.
[22] de Finetti B. (1931) Sul significato soggettivo della probabilità, Fund. Math., 17, 298–329.
[23] de Finetti B. (1937) La prévision: Ses lois logiques, ses sources subjectives, Ann. Inst. H. Poincaré, 7, 1–68.
[24] Diaconis P., Ylvisaker D. (1979) Conjugate priors for exponential families, Ann. Statist., 7, 269-281.
[25] Diaconis P., Freedman D.A. (1990) Cauchy’s equation and de Finetti’s theorem, Scand. J. Stat., 17, 235-249.
[26] Dubins L.E., Savage L.J. (1965) How to gamble if you must: Inequalities for stochastic processes, McGraw Hill.
[27] Efron B. (2020) Prediction, estimation, and attribution, J.A.S.A., 115, 636-655.
[28] Ferguson T.S. (1973) A Bayesian analysis of some nonparametric problems, Ann. Statist., 1, 209-230.
[29] Fong E., Holmes C., Walker S.G. (2023) Martingale posterior distributions (with discussion), J. Royal Stat. Soc. B, to appear.
[30] Fong E., Lehmann B. (2022) A predictive approach to Bayesian nonparametric survival analysis, arXiv: 2202.10361v1 [stat.ME].
[31] Fortini S., Ladelli L., Regazzini E. (2000) Exchangeability, predictive distributions and parametric models, Sankhya A, 62, 86-109.
[32] Fortini S., Petrone S. (2012) Predictive construction of priors in Bayesian nonparametrics, Brazilian J. Probab. Statist., 26, 423-449.
[33] Fortini S., Petrone S. (2017) Predictive characterizations of mixtures of Markov chains, Bernoulli, 23, 1538-1565.
[34] Fortini S., Petrone S., Sporysheva P. (2018) On a notion of partially conditionally identically distributed sequences, Stoch. Proc. Appl., 128, 819-846.
[35] Fortini S., Petrone S. (2020) Quasi-Bayes properties of a procedure for sequential learning in mixture models, J. Royal Stat. Soc. B, 82, 1087-1114.
[36] Geisser S. (1993) Predictive inference: An introduction, Chapman and Hall, New York.
[37] Ghosal S., van der Vaart A. (2017) Fundamentals of nonparametric Bayesian inference, Cambridge University Press, Cambridge.
[38] Gnedin A., Pitman J. (2006) Exchangeable Gibbs partitions and Stirling triangles. J. Math. Sci., 138, 5674-5685.
[39] Gnedin A. (2010) A species sampling model with finitely many types, Electron. Commun. Probab., 15, 79-88.
[40] Hahn P.R. (2017) Predictivist Bayes density estimation, unpublished technical report, available at https://math.la.asu.edu/ prhahn/pred-bayes.pdf
[41] Hahn P.R., Martin R., Walker S.G. (2018) On recursive Bayesian predictive distributions, J.A.S.A., 113, 1085-1093.
[42] Hansen B., Pitman J. (2000) Prediction rules for exchangeable sequences related to species sampling, Stat. Prob. Letters, 46, 251-256.
[43] Hastie T., Tibshirani R., Friedman J. (2009) The elements of statistical learning: Data Mining, Inference, and Prediction, Springer, New York.
[44] Hill B.M. (1993) Parametric models for $A_{n}$ : splitting processes and mixtures, J. Royal Stat. Soc. B, 55, 423-433.
[45] Hjort N.L., Holmes C., Muller P., Walker S.G. (2010) Bayesian nonparametrics, Cambridge University Press, Cambridge.
[46] Hoffmann-Jorgensen J. (1994) Probability with a view toward statistics, Vol. II, Chapman and Hall, New York.
[47] Kallenberg O. (1988) Spreading and predictable sampling in exchangeable sequences and processes, Ann. Probab., 16, 508-534.
[48] Lee J., Quintana F.A., Muller P., Trippa L. (2013) Defining predictive probability functions for species sampling models, Statist. Science, 28, 209-222.
[49] Lijoi A., Pruenster I., Walker S.G. (2008) Bayesian nonparametric estimators derived from conditional Gibbs structures, Ann. Appl. Probab., 18, 1519-1547.
[50] Morvai G., Weiss B. (2021) On universal algorithms for classifying and predicting stationary processes, Probab. Surveys, 18, 77-131.
[51] Newton M.A., Zhang Y. (1999) A recursive algorithm for nonparametric analysis with missing data, Biometrika, 86, 15-26.
[52] Newton M.A. (2002) On a nonparametric recursive estimator of the mixing distribution, Sankhya, 64, 306-322.
[53] Pitman J. (1995) Exchangeable and partially exchangeable random partitions, Probab. Theory Rel. Fields, 102, 145-158.
[54] Pitman J. (1996) Some developments of the Blackwell-MacQueen urn scheme, Statistics, Probability and Game Theory, IMS Lect. Notes Mon. Series, 30, 245-267.
[55] Pitman J., Yor M. (1997) The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator, Ann. Probab., 25, 855-900.
[56] Pitman J. (2006) Combinatorial stochastic processes, Lectures from the XXXII Summer School in Saint-Flour, 2002, Springer, Berlin.
[57] Sethuraman J. (1994) A constructive definition of Dirichlet priors, Stat. Sinica, 4, 639-650.
[58] Shmueli G. (2010) To explain or to predict ?, Statist. Science, 25, 289-310.
[59] Smith A.F.M., Makov U.E. (1978) A quasi-Bayes sequential procedure for mixtures, J. Royal Stat. Soc. B, 40, 106-112.