and
A probabilistic view on predictive constructions for Bayesian learning
Abstract
Given a sequence of random observations, a Bayesian forecaster aims to predict based on for each . To this end, in principle, she only needs to select a collection , called “strategy” in what follows, where is the marginal distribution of and the -th predictive distribution. Because of the Ionescu-Tulcea theorem, can be assigned directly, without passing through the usual prior/posterior scheme. One main advantage is that no prior probability is to be selected. In a nutshell, this is the predictive approach to Bayesian learning. A concise review of the latter is provided in this paper. We try to put such an approach in the right framework, to make clear a few misunderstandings, and to provide a unifying view. Some recent results are discussed as well. In addition, some new strategies are introduced and the corresponding distribution of the data sequence is determined. The strategies concern generalized Pólya urns, random change points, covariates and stationary sequences.
keywords:
[class=MSC2020]keywords:
1 Introduction
This paper has been written having the following interpretation of Bayesian inference in mind. (We declare this interpretation from the outset just to make transparent our point of view and easier the understanding of the paper). Let us call the object of inference. Roughly speaking, denotes whatever we ignore but would like to know. For instance, could be a parameter (finite or infinite dimensional), a set of future observations, an unknown probability distribution, the effect of some action, or something else. According to us, the distinguishing feature of the Bayesian approach is to regard as the realization of a random element, and not as an unknown but fixed constant. As a consequence, the main goal of any Bayesian inferential procedure is to determine the conditional distribution of given the available information.
Note that, unless itself is a parameter, no other parameter is necessarily involved.
Prediction of unknown observable quantities is a fundamental part of statistics. Initially, it was probably the most prevalent form of statistical inference. The wind changed at the beginning of the 20 century when statisticians’ attention shifted to other issues, such as parametric estimation and testing; see e.g. [36]. Nowadays, prediction is back in the limelight again, and plays a role in modern topics including machine learning and data mining; see e.g. [17, 18, 27, 43].
This paper deals with prediction of future observations, based on the past ones, from the Bayesian point of view. Precisely, we focus on a sequence
of random observations and, at each time , we aim to predict based on . Hence, for each , the object of inference is , the available information is , and the target is the predictive distribution . We point out that, apart from technicalities, most of our considerations could be generalized to the case where is an arbitrary (measurable) function of the future observations, say
This case is recently object of increasing attention; see e.g. [29, 40].
No parameter plays a role at this stage. The forecaster may involve some , if she thinks it helps, but she is not interested in as such. To involve means to model the probability distribution of as depending on , and then to exploit this fact to calculate the predictive distributions .
To better address our prediction problem, it is convenient to introduce the notion of strategy. Let be a measurable space, with to be viewed as the set where the observations take values. Following Dubins and Savage [26], a strategy is a sequence
such that
-
•
and are probability measures on for all and ;
-
•
The map is -measurable for fixed and .
Here, should be regarded as the marginal distribution of and as the conditional distribution of given that . Moreover, denotes the value taken at by the probability measure . We also note that strategies are often called prediction rules in the framework of species sampling sequences; see [54, p. 251].
Strategies are a natural tool to frame a prediction problem from the Bayesian standpoint. In fact, a strategy can be regarded as the collection of all predictive distributions (including the marginal distribution of ) in the sense that for all and . Thus, in a sense, everything a Bayesian forecaster has to do is to select a strategy . Obviously, the problem is how to do it. A related problem is whether, in order to choose , involving a parameter is convenient or not.
An important special case is exchangeability. In fact, if is assumed to be exchangeable, there is natural way to involve a parameter . To see this, take the parameter space as
Moreover, for each , denote by a probability measure which makes i.i.d. with common distribution , i.e.,
for all and . Then, under mild conditions on , de Finetti’s theorem yields
for some (unique) prior probability on . Thus, conditionally on , the observations are i.i.d. with common distribution . This suggests calculating the strategy as follows.
-
(i)
Select a prior on ;
-
(ii)
For each and , evaluate the posterior of given , namely, the conditional distribution of given that ;
-
(iii)
Calculate as
where is the posterior and is meant as .
Steps (i)-(ii)-(iii) are familiar in a Bayesian framework. Henceforth, if is selected via (i)-(ii)-(iii), the forecaster is said to follow the inferential approach (I.A.).
1.1 Predictive approach to Bayesian modeling
There is another approach to Bayesian prediction, usually called the predictive approach (P.A.), which is quite recurrent in the Bayesian literature and recently gained increasing attention. (Such an approach, incidentally, has been referred to as the “non-standard approach” in [8, 9]). According to P.A., the forecaster directly selects her strategy . Merely, for each , she selects the predictive without passing through the prior/posterior scheme described above. Among others, P.A. is supported by de Finetti, Savage, Dubins [22, 23, 26] and more recently by Diaconis and Regazzini [4, 16, 24, 25, 31]. P.A. is also strictly connected to Dawid’s prequential approach [19, 20, 21] and to Pitman’s treatment of species sampling sequences [54, 55, 56]. In addition, several prediction procedures arising in non-necessarily Bayesian frameworks, such as Machine Learning and Data Mining, are consistent with P.A.; see e.g. [17, 18, 27, 43]. Some further related references are [8, 9, 29, 30, 32, 40, 41, 44].
The theoretical foundation of P.A. is the Ionescu-Tulcea theorem; see e.g. [46, p. 159]. Roughly speaking this theorem states that, to assign the joint distribution of , it suffices to choose, in an arbitrary way, the marginal distribution of , the conditional distribution of given , the conditional distribution of given , and so on. Note that this fact would be obvious if would be replaced by a finite dimensional random vector . So, in a sense, the Ionescu-Tulcea theorem extends to infinite sequences a straightforward property of finite dimensional vectors. In any case, a formal statement of the theorem is as follows.
Theorem 1.
(Ionescu-Tulcea). For each , let be the -th coordinate random variable on . Then, for any strategy , there is a unique probability measure on such that
(1) | |||
for all and -almost all .
Because of Theorem 1, to make predictions on the sequence , the forecaster is free to select an arbitrary strategy . In fact, for any , there is a (unique) probability distribution for , denoted above by , whose predictives agree with in the sense of equation (1).
The strengths and weaknesses of I.A. versus P.A. are discussed in a number of papers; see e.g. [8, 18, 27, 36, 58] and references therein. Here, we summarize this issue (from our point of view) under the assumption that prediction is the main target.
I.A. is not motivated by prediction alone. The main goal of I.A. is to make inference on other features of the data distribution (typically some parameters) and in this case the prior is fundamental. It should be added that often provides various meaningful information on the data generating process. However, to assess is not an easy task. In addition, once is selected, to evaluate the posterior is quite difficult as well. Frequently, cannot be written in closed form but only approximated numerically. In short, I.A. is a cornerstone of Bayesian inference, but, when prediction is the main target, it is actually quite involved.
In turn, P.A. has essentially four merits. First, P.A. allows to avoid an explicit choice of the prior . Indeed, when prediction is the main target, why select explicitly ? Rather than wondering about , it seems reasonable to reflect on how the information in is conveyed in the prediction of . Second, the data sequence is not required any distributional assumption. This point is developed in Subsections 1.2 and 1.3. By now, we stress a consequence of such a point. The Bayesian nature of a prediction procedure does not depend on the data distribution. For instance, a forecaster applying P.A. is certainly Bayesian independently of the distribution attached to . Third, P.A. requires the assignment of probabilities on observable facts only. The value of is actually observable, while and (being probabilities on ) do not necessarily deal with observable facts. Fourth, the strategy may be assigned stepwise. At each time , the forecaster has observed and has already selected . Then, to predict , she is still free to select as she wants. No choice of is precluded. This is consistent with the Bayesian view, where the observed data are fixed and one should condition on them. In spite of these advantages, P.A. has an obvious drawback. In fact, assigning a strategy directly may be very difficult, in principle as difficult as selecting a prior .
A last (basic) remark is that, if is exchangeable, both I.A. and P.A. completely determine the probability distribution of . Selecting a prior or choosing a strategy are just equivalent routes to fix the distribution of . In particular, selecting uniquely determines . An intriguing line of research is in fact to identify the prior corresponding to a given ; see e.g. [10, 24, 25, 31].
1.2 Characterizations
Recall that, for any strategy , there is a unique probability measure on satisfying condition (1).
In principle, when applying P.A., the data sequence is free to have any probability distribution. Nevertheless, in most applications, it is reasonable (if not mandatory) to impose some conditions on . For instance, the forecaster may wish to be exchangeable, or stationary, or Markov, or a martingale, and so on. In these cases, is subjected to some constraints. If is required to be exchangeable, for instance, should be such that is exchangeable. Hence, those strategies which make exchangeable should be characterized.
More generally, fix any collection of probability measures on and suppose the data distribution is required to belong to . Then, P.A. gives rise to the following problem:
-
Problem (*): Characterize those strategies such that .
Sometimes, Problem (*) is trivial (Markov, martingales) but sometimes it is not (stationarity, exchangeability). To illustrate, we mention three examples (which correspond to the three dependence forms examined in the sequel).
In the exchangeable case, Problem (*) admits a solution [31, Th. 3.1] but the conditions on are quite hard to check in real problems. Hence, applying P.A. to exchangeable data is usually difficult (even if there are some exceptions; see Section 2).
A condition weaker than exchangeability is conditional identity in distribution. Say that is conditionally identically distributed (c.i.d.) if and, for each , the conditional distribution of given is the same for all ; see Section 3. It can be shown that
see [5, 47]. Hence, conditional identity in distribution can be regarded as one of the two basic ingredients of exchangeability (the other being stationarity). Now, in the c.i.d. case, Problem (*) has been solved [6, Th. 3.1] and the conditions on are quite simple. The class of admissible strategies includes several meaningful elements which cannot be used if is required to be exchangeable. As a consequence, P.A. works quite well for c.i.d. data; see [8, 9].
The stationary case is more involved. In fact, to our knowledge, there is no general characterization of the strategies which make stationary. However, such a characterization is available in some meaningful special cases (e.g. when is also required to be Markov); see Section 4.
Finally, Problem (*) is usually easier in a few (meaningful) special cases. For instance, Problem (*) is simpler if is also asked to be Markov; see e.g. [33] and Section 4. Or else, if the strategy is required to be dominated.
-
Dominated strategies: Let be a -finite measure on . Say that a strategy is dominated by if each admits a density with respect to , namely,
for all and . Here, and are non-negative measurable functions.
For instance, if and is a non-degenerate normal distribution for all and , then is dominated by Lebesgue measure. Or else, if is countable, any strategy is dominated by counting measure. Instead, if is uncountable, a non-dominated strategy is where denotes the unit mass at the point . Another non-dominated strategy is the empirical measure
In a sense, dominated strategies play an analogous role to the usual dominated models in parametric statistical inference. The main advantage is that one can use the conditional density instead of the conditional measure . A related advantage is that, if one fixes and restricts to strategies dominated by , Problem (*) becomes simpler. However, even in applied data analysis, various familiar strategies are not dominated. In the framework of species sampling sequences, for instance, most strategies are not dominated. Therefore, in this paper, we focus on general strategies while the dominated ones are regarded as an important special case.
1.3 Content of this paper and further notation
This is a review paper on P.A. which also includes some (minor) new results. Our perspective is mainly on the probabilistic aspects of Bayesian predictive constructions. Moreover, we tacitly assume that the major target is to predict future observations (and not to make inference on other random elements, such as random parameters).
Essentially, we aim to achieve three goals. First, we try to put P.A. in the right framework, to provide a unifying view, and to make clear a few misunderstandings. This has been done in the Introduction. Second, in Section 2 and Subsection 3.1, we report some known results. Third, we provide some new strategies and we prove a few related results. The strategies, introduced by means of examples, deal with generalized Pólya urns, random change points, covariates and stationary sequences. The results consist in determining the distribution of the data sequence under such strategies. To our knowledge, Examples 7, 9, 12, 14 and Theorems 8, 11, 13 are actually new, while Theorem 6 makes precise a claim contained in [29]. Moreover, as far as we know, Section 4 is the first attempt to develop P.A. for stationary data. It provides a brief discussion of Problem (*) and introduces two large classes of stationary sequences.
As already noted, even if could be potentially given any distribution, in most applications is required some conditions. There is obviously a number of such conditions. Among them, we decided to focus on exchangeability, stationarity and conditional identity in distribution. This choice seems reasonable to keep the paper focused, but of course it leaves out various interesting conditions, such as partial exchangeability. To write a paper of reasonable length, however, some choice was necessary.
To defend our choice, we note that, in addition to be natural in various practical problems, exchangeability is the usual assumption in Bayesian prediction. Hence, taking exchangeability into account is more or less mandatory. Moreover, since is exchangeable if and only if it is stationary and c.i.d., the other two conditions can be motivated as the basic components of exchangeability. But there are also other reasons for dealing with them. Stationarity is in fact a routine assumption in the classical treatment of time series, and it is reasonable to consider it from the Bayesian point of view as well. Conditional identity in distribution, even if not that popular, seems to be quite suitable for P.A.; see Section 3.
The rest of the paper is organized in three sections, each concerned with a specific assumption on , plus a final section of open problems. All the proofs are gathered in the Appendix.
We close this Introduction with some further notations.
As usual, is the unit mass at the point . For each , where is a positive integer or , we denote by the -th coordinate of . Moreover, we take to be the sequence of coordinate random variables on , namely,
From now on, we fix a strategy and we assume
We write instead of (i.e., we let ). Hence, is a probability measure on to be regarded as the distribution of under the strategy . Finally, to avoid technicalities, is assumed to be a Borel subset of a Polish space and the Borel -field on .
2 Exchangeable data
A permutation of is a map of the form
where is a fixed permutation of . A sequence of random variables is exchangeable if
for all and all permutations of .
As noted in Subsection 1.2, if is required to be exchangeable, applying P.A. is usually hard. But there are a few exceptions and two of them are discussed in this section. We first recall that is a Dirichlet sequence (or a Pólya sequence, see [11]) if
where is a constant, a probability measure on , and is meant as . The role of Dirichlet sequences is actually huge in various frameworks, including Bayesian nonparametrics, population genetics, ecology, combinatorics and number theory; see e.g. [28, 37, 45, 54, 55, 56]. From our point of view, however, two facts are to be stressed. First, a Dirichlet sequence is exchangeable. Second, being defined through its predictive distributions, a Dirichlet sequence is a natural candidate for P.A.
2.1 Species sampling sequences
For and , denote by the number of distinct values in the vector and by such distinct values (in the order that they appear). Say that is a species sampling sequence if it is exchangeable, is non-atomic, and
where the are non-negative measurable functions on and . Under this strategy, quoting from [42, p. 253], can be regarded as: “the sequence of species of individuals in a process of sequential random sampling from some hypothetical infinite population of individuals of various species. The species of the first individual to be observed is assigned a random tag distributed according to . Given the tags of the first individuals observed, it is supposed that the next individual is one of the -th species observed so far with probability , and one of a new species with probability ”.
A nice consequence of the definition is that depends on only through the vector , where
is the number of times that appears in the vector ; see [42, 54].
The most popular example of species sampling sequence is probably the two-parameter Poisson-Dirichlet, introduced by Pitman in [53], which corresponds to the weights
where and are constants such that: either (i) and or (ii) and for some integer . In this model, if denotes the number of distinct values appearing in the sequence , one obtains under (i) and under (ii). Note also that reduces to a Dirichlet sequence in the special case .
Another example, due to [39], is
where and is such that for all integers . This time, unlike the two-parameter Poisson-Dirichlet, is a finite but non-degenerate random variable.
In general, to obtain a species sampling sequence, the forecaster needs to select and the weights . While the choice of is free (apart from non-atomicity) the are subjected to the constraint that should be exchangeable. (Incidentally, the choice of is a good example of the difficulty of applying P.A. when is required to be exchangeable). The usual method to select involves exchangeable random partitions. Let and let be a random partition of . For each , call the restriction of to , namely, the random partition of whose elements are of the form for some . Say that is exchangeable if
for all and all permutations of , where denotes the random partition . For instance, given any sequence of random variables, define to be the random partition of induced by the equivalence relation . Then, is exchangeable provided is exchangeable. Now, the weights of a species sampling sequence correspond, in a canonical way, to the probability law of an exchangeable partition; see [53, 54]. Hence, choosing the essentially amounts to choosing an exchangeable partition. We stop here since a detailed discussion of exchangeable partitions is bejond the scopes of this paper. The interested reader is referred to [38, 39, 48, 49, 53, 56] and references therein.
2.2 Kernel based Dirichlet sequences
In [10], to generalize Dirichlet sequences while preserving their main properties, a class of strategies has been introduced. Among other things, such strategies make exchangeable.
A kernel on is a collection
such that is a probability measure on , for each , and the map is measurable for each . Sometimes, to make the notation easier, we will write instead of . A straightforward example of kernel is for each .
Fix a probability measure on , a constant , a kernel on , and define the strategy
(2) |
for all and . Clearly, reduces to a Dirichlet sequence if . In this case, we also say that is a classical Dirichlet sequence.
If is an arbitrary kernel, may fail to be exchangeable. However, a useful sufficient condition for exchangeability is available. In fact, is exchangeable if agrees with the conditional distribution for given some sub--field . For instance, if , then and is a classical Dirichlet sequence. At the opposite extreme, if is the trivial -field, then for all and is i.i.d. with common distribution . In general, for fixed and , a strategy which makes exchangeable can be associated with any sub--field . It suffices to take as the conditional distribution for given .
Example 2.
(Countable partitions). Let be a (non-random) countable partition of such that and for all . For , denote by the only such that . The conditional distribution for given the sub--field generated by is
Hence, is exchangeable whenever
Some remarks on the above strategy are in order.
-
•
may be reasonable when the basic information provided by each observation is , namely, the element of the partition including .
-
•
If is countable, each sub--field is generated by a partition of . Hence, is necessarily as above.
-
•
is absolutely continuous with respect to for all and . This is a striking difference with classical Dirichlet sequences. To make an example, call the strategy obtained by replacing with . Under , is a classical Dirichlet sequence. Moreover, suppose is nonatomic and define the set for each . Since is nonatomic and is finite,
On the other hand, since for each ,
As a consequence, one obtains
-
•
can be generalized replacing with
where is a suitable set. Note that reduces to if . Roughly speaking, is reasonable in those problems where there is a set such that is informative about the future observations only if . Otherwise, if , the only relevant information provided by is . As a trivial example, take and
for some . Then, is reasonable if is informative only if . Otherwise, if , the only meaningful information provided by is its sign.
Example 3.
(Pólya urns). Some Pólya urns are covered by Example 2. It follows that, for such urns, the sequence of observed colors is exchangeable. To our knowledge, this fact was previously unknown.
As an example, consider sequential draws from an urn and denote by the color of the ball extracted at time . At time , the urn contains balls of color where . Define
for each . The sampling scheme is as follows. Fix a partition of and define
For each , one obtains for some unique . In this case (i.e., if ) the extracted ball is replaced together with more balls of color for each . In other terms, if the observed color belongs to , each color in is reinforced (and not only the observed color). In particular, after each draw, new balls are added to the urn. Hence, denoting by the strategy of Example 2 with , one obtains
If is the strategy (2), in addition to exchangeability, satisfies various other properties of classical Dirichlet sequences. We refer to [10] for details. Here, we just note that the prior and the posterior can be explicitly determined. In particular, up to replacing with , the Sethuraman’s representation of (see [57]) is still true. Precisely, is the probability distribution of a random probability measure of the form
where:
-
•
and are independent sequences of random variables;
-
•
is i.i.d. with common distribution ;
-
•
for all , where is i.i.d. with common distribution beta. Namely, has the stick breaking distribution with parameter .
3 Conditionally identically distributed data
A sequence of random variables is conditionally identically distributed (c.i.d.) if and
for all . A c.i.d. sequence is identically distributed. It is also asymptotically exchangeable in the sense that, as , the probability distribution of the shifted sequence converges weakly to an exchangeable law. Moreover, as already stressed, is exchangeable if and only if it is stationary and c.i.d.
C.i.d. sequences have been introduced in [5, 47] and then investigated or applied in various papers; see e.g. [1, 2, 6, 7, 8, 9, 14, 15, 29, 30, 34].
There are reasons for taking c.i.d. data into account in Bayesian prediction. In fact, in a sense, c.i.d. sequences have been introduced having prediction in mind. If is c.i.d., at each time , the future observations are identically distributed given the past, and this is reasonable in several prediction problems. Examples arise in clinical trials, generalized Pólya urns, species sampling models, survival analysis and disease surveillance; see [1, 2, 5, 8, 9, 14, 15, 29, 30, 35]. A further reason for assuming c.i.d. is that the asymptotics is very close to that of exchangeable sequences. As a consequence, a meaningful part of the usual Bayesian machinery can be developed under the sole assumption that is c.i.d.; see [29]. Finally, the strategies which make c.i.d. can be easily characterized; see Theorem 15 in the Appendix. Hence, unlike the exchangeable case, P.A. can be easily implemented for c.i.d. data. A number of interesting strategies, which cannot be used if is required to be exchangeable, become available if is only asked to be c.i.d.; see e.g. [8, 9].
As a concrete example, fix a constant and define
(3) |
for all and . Using to make predictions corresponds to exponential smoothing. It may be reasonable when the forecaster has only vague opinions on the dependence structure of the data, and yet she feels that the weight of the -th observation should be a decreasing function of . In this case, is not exchangeable, since is not invariant under permutation of , but it can be easily seen to be c.i.d.; see [8, Ex. 7].
In this section, following [8, 9], P.A. is applied to c.i.d. data. We first report some known strategies (Subsection 3.1) and then we introduce two new strategies which make c.i.d. (Subsection 3.2).
3.1 Fast recursive update of predictive distributions
A possible condition for a strategy is
(4) |
for all , and , where denotes the -th observation and
Under (4), the predictive is just a recursive update of the previous predictive and the last observation . Recursive properties of this type are useful in applications. They have a long history (see e.g. [51, 52, 59]) and have been recently investigated in [41].
For each , let be a measurable function (with constant) and a kernel on . Define a strategy through the recursive equations
(5) | |||
for all , and . Since is a convex combination of the previous predictive and the kernel , which depends only on , the strategy satisfies condition (4). The obvious interpretation is that, at time , after observing , the next observation is drawn from with probability and from with probability .
An example of strategy satisfying equation (5) is Newton’s algorithm [51, 52]. More precisely, Newton’s algorithm aims to estimate the latent distribution in a mixture model rather than to make predictions. However, if reinterpreted as a predictive rule, Newton’s algorithm corresponds to a strategy and such a meets equation (5) for a suitable choice of and ; see e.g. [35, p. 1095]. Moreover, as shown in [35], makes c.i.d.
The strategies satisfying equation (5) are investigated in [9]. Under such strategies, is usually not exchangeable but it is c.i.d. under some conditions on the kernels . Precisely, is c.i.d. if is the conditional distribution for given for each , where
is any filtration (i.e., any increasing sequence of sub--fields of ). This condition is trivially true if for all (just take for all ).
Example 4.
(Finer countable partitions). For each , let be a countable partition of such that and for all . Suppose that is finer than for all . Define through equation (5) with
where denotes the only such that . The kernel is the conditional distribution for given , where is the -field generated by . Since is finer than , one obtains . Hence, is c.i.d. Note also that the could be chosen such that
In this case, as , the partitions shrink to the partition of in the singletons.
For instance, in Example 2, suppose the forecaster wants to replace the fixed partition with a sequence of finer partitions. This is possible at the price of having c.i.d. instead of exchangeable. In fact, with , one obtains
Similarly, to decrease the impact of the observed data while preserving the c.i.d. condition, the strategy (3) could be modified as
We next turn to a strategy introduced in [41]. Once again, under this strategy, the data are c.i.d. but not necessarily exchangeable.
Example 5.
(Hahn, Martin and Walker; Copulas). In this example, and “density function” means “density function with respect to Lebesgue measure”. A bivariate copula is a distribution function on whose marginals are uniform on . The density function of a bivariate copula, provided it exists, is said to be a copula density.
In [41], in order to realize condition (4), the following updating rule is introduced. Fix a density and a sequence of bivariate copula densities. For the sake of simplicity, we assume and for all . For , define and call the distribution function corresponding to . Then, for each , define
In general, for each and , suppose has been defined and denote by and the density and the distribution function of . Then, for all , one can define
(6) | |||
Equation (6) defines a strategy dominated by the Lebesgue measure.
In [41] (but not here) the are also required to be symmetric. Furthermore, in [41], equation (6) is not necessarily viewed as a method for obtaining a strategy but is deduced as a consequence of exchangeability. From our point of view, instead, equation (6) defines a strategy which we call HMW’s strategy.
Under HMW’s strategy, is not necessarily exchangeable, even if the are symmetric and (in some sense) as . To see this, recall that is i.i.d. if and only if it is exchangeable and is independent of . In turn, is independent of if is the independence copula density (i.e., for all ). Therefore, fails to be exchangeable whenever is the independence copula density and . However, as noted in [29], turns out to be c.i.d.
Theorem 6.
If is HMW’s strategy, then is c.i.d.
3.2 Further examples
In the next example, the data are exchangeable until a stopping time and then go on so as to form a c.i.d. sequence. The time should be regarded as the first time when something meaningful happens, possibly something modifying the nature of the observed phenomenon. Even if apparently involved, the example could find some applications. For instance, to model censored survival times, with the first time when a given number of survival times is observed.
Example 7.
(Change points). A predictable stopping time is a function on , with values in , satisfying
(7) |
for some set . Basically, condition (7) means that the event depends only on . Similarly, depends only on . Therefore, for all and , the indicators of and depend on but not on .
Fix a predictable stopping time and a strategy which makes exchangeable. Moreover, as in Subsection 3.1, fix the measurable functions . Then, define , , and
for all , and . In the Appendix, it is shown that:
Theorem 8.
The above strategy makes c.i.d. Moreover, if
where is the set involved in condition (7), then is exchangeable conditionally on . Precisely,
for all such that and all permutations of .
Theorem 8 is still valid if is defined differently at the times subsequent to . For instance, given a countable partition of , the conclusions of Theorem 8 are true even if
for all and such that and . Here, denotes the probability measure
Censored survival times are a possible application of . Suppose that and the -th observation is a pair where is the survival time of item , or the time when item leaves the trial, according to whether or . In this framework, could be the first time when a fixed number of survival times is observed, namely,
with the usual convention . Finally, the strategy could be as in Subsection 2.2. In fact, classical Dirichlet sequences are a quite popular model to describe censored survival times but have the drawback of ties. This drawback may be overcome if is of the form
where the kernel satisfies the conditions of Subsection 2.2 and and are nonatomic for all .
So far, the -th predictive distribution has been meant as the conditional distribution of given . But the information available at time is often strictly larger than . To model this situation, we suppose to observe the sequence
where is any sequence of random variables. The can be regarded as covariates. At each time , the forecaster aims to predict based on . She is not interested in as such, but can not be neglected since they are informative on . Moreover, she wants to be c.i.d. and unconstrained as much as possible. One solution could be a strategy which makes c.i.d. However, if is c.i.d., both and are marginally c.i.d., and having c.i.d. may be unwelcome. In the next example, is c.i.d. but is not. In addition, satisfies a condition stronger than the c.i.d. one, that is, and
(8) | |||
a.s. for all ; see [5].
Example 9.
(Covariates). Let and
a bounded strictly increasing sequence of real numbers. Take as the probability distribution of where
Similarly, for each and
take as the probability distribution of where
Then, is not c.i.d. while satisfies condition (8). Furthermore, arguing as in [9, Sect. 4], the normal distribution could be replaced by any symmetric stable law.
To see that is not c.i.d., just note that fails to be identically distributed. To prove condition (8), take a collection of independent standard normal random variables and define the sequence
where and
It is not hard to verify that . Hence, it suffices to prove (8) with in the place of , and this can be done as in [5, Ex. 1.2]. We omit the explicit calculations.
4 Stationary data
A sequence of random variables is stationary if
In the non-Bayesian approaches to prediction, stationarity is a classical assumption. In a Bayesian framework, instead, stationarity seems to be less popular. In particular, to our knowledge, there is no systematic treatment of P.A. for stationary data. This section aims to fill this gap and begins an investigation of P.A. when is required to be stationary. It is just a preliminary step and much more work is to be done.
After some general remarks on Problem (*), two large classes of stationary sequences will be introduced. Incidentally, these two classes may look unusual for a Bayesian forecaster. We don’t know whether this is true, but we recall that P.A. is consistent with any probability distribution for . Hence, in a Bayesian framework, using data coming from such classes is certainly admissible.
If is required to be stationary, for P.A. to apply, the strategies which make stationary should be characterized. Hence, one comes across Problem (*) with the class of stationary probability measures on . This version of Problem (*) is quite hard and we are not aware of any general solution; see e.g. [12, 50] and references therein. Fortunately, however, Problem (*) is simple (or even trivial) in a few special cases. As an example, a strategy makes a stationary (first order) Markov chain if and only if
for all and -almost all . Even if obvious, this fact has a useful practical consequence. If the data are required to be stationary and Markov, in order to make Bayesian predictions, applying P.A. is straightforward.
Another remark is that, unlike the exchangeable case, a finite dimensional stationary random vector can be always extended to an (infinite) stationary sequence. To formalize this fact, we first recall that the probability distribution of the random vector is completely determined by .
Lemma 10.
Fix , select and define
for all , and . Then, is stationary provided .
Lemma 10 is probably well known, but again we do not know of any explicit reference. Anyway, the proof is straightforward. It suffices to note that, under the strategy of Lemma 10, is conditionally independent of given .
A last remark is that Problem (*) admits an obvious solution for dominated strategies. In this case, incidentally, Problem (*) can be easily solved even for exchangeable data.
Theorem 11.
Let be a -finite measure on and a strategy dominated by , say
for all and . Define
for all and . Then,
-
•
is stationary if and only if
for all and -almost all .
-
•
is exchangeable if and only if
for all , all permutations of and -almost all .
The proof of Theorem 11 is given in the Appendix.
We finally give two examples. In both, is a stationary Markov sequence, possibly of order greater than 1.
Example 12.
(Generalized autoregressive sequences). Let . Fix a probability measure on and a measurable function . Define
where is a real random variable such that . Suppose now that
(9) |
for some probability measure on . Then, is a stationary Markov chain provided
Note that for any sequence such that
where is i.i.d., independent of , and . Thus, can be regarded as the distribution of the “errors” and as the marginal distribution of the observations . For instance, the usual Gaussian (first order) autoregressive processes correspond to , and , where and are constants.
To make the above argument concrete, the following problem is to be solved: For fixed and , give conditions for the existence of satisfying equation (9). More importantly, give an explicit formula for provided it exists. We next focus on this problem in the (meaningful) special case where is a symmetric stable law.
Let be a constant and a real random variable with characteristic function
(The exponent is usually denoted by , but this notation cannot be adopted in this paper since denotes a kernel). For and , denote by the probability distribution of , namely
The probability measure is said to be a symmetric stable law with exponent . Note that if and if , where is the Cauchy distribution with density (the standard Cauchy distribution corresponds to and ).
Theorem 13.
Let be a constant. If and , then equation (9) is satisfied by
By Theorem 13, which is proved in the Appendix, one obtains (first order) stationary autoregressive processes with any symmetric stable marginal distribution.
Example 14.
(Markov sequences of arbitrary order). Let be a -finite measure on . Fix and a measurable function on such that and . Given , define a further function via cyclic permutations of , namely
for all . Such a is still a density with respect to (since ) and satisfies
(10) |
Next, define
for all , and
for all and . Finally, define a strategy dominated by as
if and , and
if , and . Under , a density of is given by . By equation (10),
and this in turn implies
Therefore, is stationary because of Lemma 10. Note also that is a Markov sequence of order .
5 Concluding remarks and open problems
When prediction is the main target, P.A. has some advantages with respect to I.A. This is only our opinion, obviously, and we tried to support it along this paper. Even if one agrees, however, some further work is to be done to make P.A. a concrete tool. We close this paper with a brief list of open problems and possible hints for future research.
-
•
In various applications, the available information strictly includes the past observations on the variable to be predicted. For instance, as in Example 9, suppose one aims to predict based on where are any random elements. Suppose also that cannot be neglected for they are informative on . In this case, one needs the conditional distribution of given . Situations of this type are practically meaningful and should be investigated further.
-
•
Section 4 should be expanded. It would be nice to have a general solution of Problem (*) for both the stationary and the stationary-ergodic cases. Further examples of stationary sequences (possibly, non-Markovian) would be welcome as well.
-
•
Obviously, P.A. could be investigated under other distributional assumptions, in addition to exchangeability, stationarity and conditional identity in distribution. In particular, partial exchangeability should be taken into account.
-
•
A question, related to Example 5, is: Under what conditions is exchangeable when is HMW’s strategy ?
- •
-
•
In case of I.A., the empirical Bayes point of view (where the prior is allowed to depend on the data) may be problematic. In case of P.A., instead, this point of view is certainly admissible. In fact, suppose a strategy depends on some unknown constants, and an empirical Bayes forecaster decides to estimate these constants based on the available data. Acting in this way, she is merely replacing a strategy with another. Instead of , she is working with , where is the strategy obtained from estimating the unknown constants. This empirical form of P.A. looks reasonable and could be investigated.
Appendix
This appendix contains the proofs of some claims scattered throughout the text. We will need the following characterization of c.i.d. sequences in terms of strategies.
Theorem 15.
(Theorem 3.1 of [6]). Let be a strategy. Then, is c.i.d. if and only if
(11) |
for all , all and -almost all .
Proof of Theorem 6.
In this proof, “density function” stands for “density function with respect to Lebesgue measure”. We first recall a well known fact.
Let be a bivariate copula and , distribution functions on . Suppose that , and all have densities, say , and , respectively. Then,
is a distribution function on and
is a density of . Therefore, for all with , one obtains
We next show that equation (6) actually defines a strategy . Fix a density and a sequence of strictly positive bivariate copula densities. For each ,
since . Moreover, for all due to and . Next, suppose that is a strictly positive density for some and . Then, for all ,
since . Furthermore, for all since and . By induction, this proves that is a density for all and . Therefore, equation (6) defines a strategy (called HMW’s strategy in Example 5).
Finally, we prove that is c.i.d. if is HMW’s strategy. By Theorem 15, it suffices to prove condition (11). In turn, since is dominated by the Lebesgue measure, condition (11) reduces to
for all , almost all and -almost all . Such a condition follows directly from the definition of . In fact, for all an , one obtains
This concludes the proof. ∎
Remark 16.
HMW’s strategy has been defined under the assumption that and for all . Such an assumption is superfluous and has been made only to avoid annoying complications in the definition of . Similarly, is c.i.d. even if the are conditional copulas, in the sense that they are allowed to depend on past data. Precisely, for each and , fix a bivariate copula density . Then, the proof Theorem 6 still applies if is rewritten as
Proof of Theorem 8.
We show that is c.i.d. via Theorem 15. Fix and . Since is exchangeable (and thus c.i.d.) Theorem 15 yields
(12) |
for -almost all . Hence, up to changing on a -null set, equation (12) can be assumed to hold for all . If ,
where the first equality is because and while the second follows from (12). Next, suppose and take and . By assumption, the events and depend on but not on . If , one obtains and . Hence, equation (12) implies again
Similarly, if ,
In view of Theorem 15, this proves that is c.i.d.
Finally, suppose that is invariant under permutations of for each . We have to show that is exchangeable conditionally on . Fix , a set , and a permutation of . For each , it is easily seen that
Therefore,
where the last equality is because is exchangeable and is invariant under permutations of . In turn, this implies
This concludes the proof. ∎
Proof of Theorem 11.
Just note that is a density of with respect to . Therefore, Theorem 11 follows from the very definitions of stationarity and exchangeability, after noting that is a density of with respect to . ∎
Proof of Theorem 13.
We first recall that
for all and . This can be checked by a direct calculation. For a proof, we refer to the Claim of [9, Th. 10]. Having noted this fact, define
and denote by a real random variable such that . Define also
and call the probability distribution of under . On noting that
one obtains
Therefore, equation (9) holds. ∎
Acknowledgments: We are grateful to Federico Bassetti and Paola Bortot for very useful conversations.
References
- [1] Airoldi E.M., Costa T., Bassetti F., Leisen F., Guindani M. (2014) Generalized species sampling priors with latent beta reinforcements, J.A.S.A., 109, 1466-1480.
- [2] Bassetti F., Crimaldi I., Leisen F. (2010) Conditionally identically distributed species sampling sequences, Adv. in Appl. Probab., 42, 433-459.
- [3] Bassetti F., Ladelli L. (2020) Asymptotic number of clusters for species sampling sequences with non-diffuse base measure, Stat. Prob. Letters, 162, 108749.
- [4] Berti P., Regazzini E., Rigo P. (1997) Well-calibrated, coherent forecasting systems, Theory Probab. Appl., 42, 82-102.
- [5] Berti P., Pratelli L., Rigo P. (2004) Limit theorems for a class of identically distributed random variables, Ann. Probab., 32, 2029-2052.
- [6] Berti P., Pratelli L., Rigo P. (2012) Limit theorems for empirical processes based on dependent data, Electronic J. Probab., 17, 1-18.
- [7] Berti P., Pratelli L., Rigo P. (2013) Exchangeable sequences driven by an absolutely continuous random measure, Ann. Probab., 41, 2090-2102.
- [8] Berti P., Dreassi E., Pratelli L., Rigo P. (2021) A class of models for Bayesian predictive inference, Bernoulli, 27, 702-726.
- [9] Berti P., Dreassi E., Leisen F., Pratelli L., Rigo P. (2023) Bayesian predictive inference without a prior, Statistica Sinica, 33.
- [10] Berti P., Dreassi E., Leisen F., Pratelli L., Rigo P. (2022) Kernel based Dirichlet sequences, Bernoulli, to appear, available at arXiv:2106.00114 [math.PR].
- [11] Blackwell D., Mac Queen J.B. (1973) Ferguson distributions via Pólya urn schemes, Ann. Statist., 1, 353-355.
- [12] Bladt M., McNeil A.J. (2022) Time series models with infinite-order partial copula dependence, Dependence Modeling, 10, 87-107.
- [13] Canale A., Lijoi A., Nipoti B., Pruenster I. (2017) On the Pitman–Yor process with spike and slab base measure, Biometrika, 104, 681-697.
- [14] Cassese A., Zhu W., Guindani M., Vannucci M. (2019) A Bayesian nonparametric spiked process prior for dynamic model selection, Bayesian Analysis, 14, 553-572.
- [15] Chen K., Shen W., Zhu W. (2023) Covariate dependent Beta-GOS process, Computat. Stat. Data Anal., 180.
- [16] Cifarelli D.M., Regazzini E. (1996) De Finetti’s contribution to probability and statistics, Statist. Science, 11, 253-282.
- [17] Clarke B., Fokoue E., Zhang H.H. (2009) Principles and theory for data mining and machine learning, Springer, New York.
- [18] Clarke B., Clarke J. (2018) Predictive statistics: Analysis and inference beyond models, Cambridge University Press, Cambridge.
- [19] Dawid A.P. (1984) Present position and potential developments: Some personal views: Statistical Theory: The prequential approach, J. Royal Stat. Soc. A, 147, 278-292.
- [20] Dawid A.P. (1992) Prequential data analysis, In Current Issues in Statistical Inference: Essays in Honor of D. Basu, Edited by M. Ghosh and P.K. Pathak, IMS Lecture Notes - Monograph Series, 17, 113-126.
- [21] Dawid A.P., Vovk V.G. (1999) Prequential probability: principles and properties, Bernoulli, 5, 125-162.
- [22] de Finetti B. (1931) Sul significato soggettivo della probabilità, Fund. Math., 17, 298–329.
- [23] de Finetti B. (1937) La prévision: Ses lois logiques, ses sources subjectives, Ann. Inst. H. Poincaré, 7, 1–68.
- [24] Diaconis P., Ylvisaker D. (1979) Conjugate priors for exponential families, Ann. Statist., 7, 269-281.
- [25] Diaconis P., Freedman D.A. (1990) Cauchy’s equation and de Finetti’s theorem, Scand. J. Stat., 17, 235-249.
- [26] Dubins L.E., Savage L.J. (1965) How to gamble if you must: Inequalities for stochastic processes, McGraw Hill.
- [27] Efron B. (2020) Prediction, estimation, and attribution, J.A.S.A., 115, 636-655.
- [28] Ferguson T.S. (1973) A Bayesian analysis of some nonparametric problems, Ann. Statist., 1, 209-230.
- [29] Fong E., Holmes C., Walker S.G. (2023) Martingale posterior distributions (with discussion), J. Royal Stat. Soc. B, to appear.
- [30] Fong E., Lehmann B. (2022) A predictive approach to Bayesian nonparametric survival analysis, arXiv: 2202.10361v1 [stat.ME].
- [31] Fortini S., Ladelli L., Regazzini E. (2000) Exchangeability, predictive distributions and parametric models, Sankhya A, 62, 86-109.
- [32] Fortini S., Petrone S. (2012) Predictive construction of priors in Bayesian nonparametrics, Brazilian J. Probab. Statist., 26, 423-449.
- [33] Fortini S., Petrone S. (2017) Predictive characterizations of mixtures of Markov chains, Bernoulli, 23, 1538-1565.
- [34] Fortini S., Petrone S., Sporysheva P. (2018) On a notion of partially conditionally identically distributed sequences, Stoch. Proc. Appl., 128, 819-846.
- [35] Fortini S., Petrone S. (2020) Quasi-Bayes properties of a procedure for sequential learning in mixture models, J. Royal Stat. Soc. B, 82, 1087-1114.
- [36] Geisser S. (1993) Predictive inference: An introduction, Chapman and Hall, New York.
- [37] Ghosal S., van der Vaart A. (2017) Fundamentals of nonparametric Bayesian inference, Cambridge University Press, Cambridge.
- [38] Gnedin A., Pitman J. (2006) Exchangeable Gibbs partitions and Stirling triangles. J. Math. Sci., 138, 5674-5685.
- [39] Gnedin A. (2010) A species sampling model with finitely many types, Electron. Commun. Probab., 15, 79-88.
- [40] Hahn P.R. (2017) Predictivist Bayes density estimation, unpublished technical report, available at https://math.la.asu.edu/ prhahn/pred-bayes.pdf
- [41] Hahn P.R., Martin R., Walker S.G. (2018) On recursive Bayesian predictive distributions, J.A.S.A., 113, 1085-1093.
- [42] Hansen B., Pitman J. (2000) Prediction rules for exchangeable sequences related to species sampling, Stat. Prob. Letters, 46, 251-256.
- [43] Hastie T., Tibshirani R., Friedman J. (2009) The elements of statistical learning: Data Mining, Inference, and Prediction, Springer, New York.
- [44] Hill B.M. (1993) Parametric models for : splitting processes and mixtures, J. Royal Stat. Soc. B, 55, 423-433.
- [45] Hjort N.L., Holmes C., Muller P., Walker S.G. (2010) Bayesian nonparametrics, Cambridge University Press, Cambridge.
- [46] Hoffmann-Jorgensen J. (1994) Probability with a view toward statistics, Vol. II, Chapman and Hall, New York.
- [47] Kallenberg O. (1988) Spreading and predictable sampling in exchangeable sequences and processes, Ann. Probab., 16, 508-534.
- [48] Lee J., Quintana F.A., Muller P., Trippa L. (2013) Defining predictive probability functions for species sampling models, Statist. Science, 28, 209-222.
- [49] Lijoi A., Pruenster I., Walker S.G. (2008) Bayesian nonparametric estimators derived from conditional Gibbs structures, Ann. Appl. Probab., 18, 1519-1547.
- [50] Morvai G., Weiss B. (2021) On universal algorithms for classifying and predicting stationary processes, Probab. Surveys, 18, 77-131.
- [51] Newton M.A., Zhang Y. (1999) A recursive algorithm for nonparametric analysis with missing data, Biometrika, 86, 15-26.
- [52] Newton M.A. (2002) On a nonparametric recursive estimator of the mixing distribution, Sankhya, 64, 306-322.
- [53] Pitman J. (1995) Exchangeable and partially exchangeable random partitions, Probab. Theory Rel. Fields, 102, 145-158.
- [54] Pitman J. (1996) Some developments of the Blackwell-MacQueen urn scheme, Statistics, Probability and Game Theory, IMS Lect. Notes Mon. Series, 30, 245-267.
- [55] Pitman J., Yor M. (1997) The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator, Ann. Probab., 25, 855-900.
- [56] Pitman J. (2006) Combinatorial stochastic processes, Lectures from the XXXII Summer School in Saint-Flour, 2002, Springer, Berlin.
- [57] Sethuraman J. (1994) A constructive definition of Dirichlet priors, Stat. Sinica, 4, 639-650.
- [58] Shmueli G. (2010) To explain or to predict ?, Statist. Science, 25, 289-310.
- [59] Smith A.F.M., Makov U.E. (1978) A quasi-Bayes sequential procedure for mixtures, J. Royal Stat. Soc. B, 40, 106-112.