Posterior Convergence of Nonparametric Binary and Poisson Regression Under Possible Misspecifications
Abstract
In this article, we investigate posterior convergence of nonparametric binary and Poisson regression
under possible model misspecification, assuming general stochastic process prior with appropriate properties.
Our model setup and objective for binary regression is similar to that of Ghosal and Roy (2006) where the authors have used the
approach of entropy bound and exponentially consistent tests with the sieve method to achieve consistency with respect to their Gaussian process prior.
In contrast, for both binary and Poisson regression, using general stochastic process prior, our approach involves verification
of asymptotic equipartition property along with the method of sieve, which is a manoeuvre of the general results of Shalizi (2009), useful even
for misspecified models. Moreover, we will establish not only posterior consistency but also the rates at which the posterior probabilities converge,
which turns out to be the Kullback-Leibler divergence rate. We also investgate the traditional posterior convergence rates.
Interestingly, from subjective Bayesian viewpoint we will show that the posterior predictive distribution can accurately approximate
the best possible predictive distribution in the sense that the Hellinger distance, as well
as the total variation distance between the two distributions can tend to zero, in spite of
misspecifications.
Keywords: Binary/Poisson regression; Cumulative distribution function; Infinite dimension; Kullback-Leibler divergence rate; Misspecification;
Posterior convergence.
‡ Indian Statistical Institute
Corresponding author: [email protected]
1 Introduction
The situation for applicability of nonparametric regression is frequently encountered in many practical scenarios where no parametric model fits the data. In particular, non-parametric regression for binary dependent variables is very common for various branches of statistics like medical and spatial statistics, whereas nonparametric version of Poisson regression is being used recently in many non-trivial scenerios such as for analyzing the likelihood and severity of vehicle crashes (Ye et al. (2018)). Interestingly, despite vast applicability of both the binary as well as Poisson regression, it seems that the available literature on nonparametric Poisson regression is scarce in comparison to the available literature on nonparametric binary regression. The Bayesian approach to nonparametric binary regression problem has been accounted for in Diaconis and Freedman (1993). An account of posterior consistency for Gaussian process prior in nonparametric binary regression modeling can be found in Ghosal and Roy (2006), where the authors suggested that similar consistency results should hold for nonparametric Poisson regression model setup. Literature on consistency results for nonparametric Poisson regression is very limited. Pillai et al. (2007) have obtained consistency results for Poisson regression using an approach similar to that of Ghosal and Roy (2006) under certain assumptions, but so far without explicit specifications and detail on prior. On the other hand, our approach will be based on results on Shalizi (2009), which is much different from Ghosal and Roy (2006) and capable of handling model misspecification. Unlike the previous works, the approach of Shalizi (2009) also enables us to investigate the rate at which the posterior converges, which turns out to be the Kullback-Leibler (KL) divergence rate, and also the traditional posterior convergence rate.
In this article, we investigate posterior convergence of nonparametric binary and Poisson regression where the nonparametric regression is modeled as some suitable stochastic process. In the binary situation, we consider a similar setup as that of Ghosal and Roy (2006), where the authors have considered binary observations with response probability as an unknown smooth function of a set of covariates, which was modeled using Gaussian process. Here we will consider a binary response variable and a -dimensional covariate belonging to a compact subset. The probability function is given by along with a prior for induced by some appropriate stochastic process with the relation for a known, non-decreasing and continuously differentiable cumulative distribution function . We will establish a posterior convergence theory for nonparametric binary regression under possible misspecifications based on the general theory of posterior convergence of Shalizi (2009). Our theory also includes the case of misspecified models, that is, if the true regression function is not even supported by the prior. This approach to Bayesian asymptotics also permits us to show that the relevant posterior probabilities converge at the KL divergence rate, and that the posterior convergence rate with respect to KL-divergence is just slower than , where denotes the number of observations. We further show that even in the case of misspecification, the posterior predictive distribution can approximate the best possible predictive distribution adequately, in the sense that the Hellinger distance, as well as the total variation distance between the two distributions can tend to zero.
For nonparametric Poisson regression, given in the compact space of covariates, we model the mean function as , where is a continuously differentiable function. Again, we investigate the general theory of posterior convergence, including misspecifications, rate of convergence of the posterior distribution and the usual posterior convergence rate, in Shalizi’s framework.
The rest of our paper is structured as follows. In Section 2 we provide a brief overview and intuitive explanation of the main assumptions and results of Shalizi (2009) suitable for our approach. The basic prenises for nonparametric binary and Poisson regression are provided in Sections 3 and 4, respectively. The required assumptions and their discussions are provided in Section 5. In Section 6, our main results on posterior convergence of binary and Poisson regression are provided, while Section 8 details the consequences of misspecifications. Concluding remarks are provided in Section 9.
The technical details are presented in the Appendix. Specifically, details of the necessary assumptions and results of Shalizi (2009) are provided in Appendix A. The detailed proofs of verification of Shalizi’s assumptions are provided in Appendix B and Appendix C for binary and Poisson regression setups, respectively.
2 An outline of the main assumptions and results of Shalizi
Let the set of random variables for the response be denoted by . For a given parameter space , let be the observed likelihood and be the true likelihood. We assume but the truth need not be in , thus allowing possible misspecification.
The KL divergence is a measure of divergence between two probability densities and . The KL divergence is related to likelihood ratios, since by the Strong Law of Large Numbers (SLLN) for independent and identical () situations,
For every , the KL divergence rate is given by:
(2.1) |
The key ingredient associated with the approach of Shalizi (2009) for proving convergence of the posterior distribution of is to show that the asymptotic equipartition property holds. To illustrate, let us consider the following likelihood ratio:
(2.2) |
If we think of the setup, reduces to the KL divergence between the true and the hypothesized model. For each , “asymptotic equipartition” property is as follows:
(2.3) |
Here “asymptotic equipartition” refers to dividing up into factors for large such that all the factors are asymptotically equal. For illustration, in the scenario, each factor converges to the same KL divergence between the true and the postulated model. The purpose of asymptotic equipartition is to ensure that relative to the true distribution, the likelihood of each decreases to zero exponentially fast, with rate being the KL divergence rate.
for , let
(2.4) | ||||
(2.5) | ||||
(2.6) |
where roughly represent the minimum KL-divergence between the postulated and the true model over the set A. If , it indicates model misspecification. However, as we shall show, model misspecification need not always imply that . One such counter example is also given in Chatterjee and Bhattacharya (2020).
Observe that, for , . For the prior, it is required to construct an appropriate sequence of sieve sets as such that:
-
(1)
, as .
-
(2)
;
The sets can be interpreted as the sieves in the sense that, the behaviour of the likelihood ratio and the posterior on the sets essentially carries over to .
Let denote the posterior distribution of given . Then with the above notions, verification of (2.3) along with several other technical conditions (details given in Appendix A) ensure that any for which ,
(2.7) |
almost surely, provided that . The latter implies positive KL-divergence in , even if . That is, is the set in which the postulated model fails to capture the true model in terms of the KL-divergence. Hence, expectedly, the posterior probability of that set converges to zero.
Under mild assumptions, it also holds that
(2.8) |
almost surely. This result shows that the rate at which the posterior probability of converges to zero is about . From the above results it is clear that the posterior concentrates on sets of the form , for any .
Shalizi addressed the rate of posterior convergence as follows. Letting , where such that , Shalizi showed, under an additional technical assumption, that almost surely,
(2.9) |
Moreover, it was shown by Shalizi that the squares of the Hellinger and the total variation distances between the posterior predictive distribution and the best possible predictive distribution under the truth, are asymptotically almost surely bounded above by and , respectively. That is, if , then this allows very accurate approximation of the true predictive distribution by the posterior predictive distribution.
3 Model setup and preliminaries of the binary regression
Let be a binary outcome variable and X a vector of covariates. Suppose are some independent binary responses conditional on unobserved covariates . We assume that the covariate space is compact. Let be the binary response random variables against the covariate vector . The corresponding observed values will be denoted by and respectively. Let the model be specified as follows: for :
(3.1) | |||
(3.2) | |||
(3.3) |
where is the prior for some suitable stochastic process. Note that the prior for is induced by the prior for . Our concern is to infer about the success probability function when the number of observations goes to infinity. We will assume that the functions have continuous first partial derivatives. We denote this class of functions by . We do not assume the truth in , allowing misspecification. The link function is a known, non-decreasing, continuously differentiable cumulative distribution function on the real line . It is widely accepted to assume the function to be known as part of model assumption. For example, in logistic regression we choose the standard logistic cumulative distribution function as the link function, whereas in probit regression is chosen to be the standard normal cumulative distribution function . More discussion on link function along with several other examples can be found in Choudhuri et al. (2007), Newton et al. (1996), Gelfand and Kuo (1991). A Bayesian method for estimation of has been provided in Choudhuri et al. (2007). In has been shown in Ghosal and Roy (2006) that the sample paths of the Gaussian processes can well approximate a large class of functions and hence it is not essential to consider additional uncertainty in the link function .
Let be the counting measure on . Then according to the model assumption, the conditional density of given with respect to will be represented by the density function as follows:
(3.4) |
The prior for will be denoted by . Let and denote truth density and success probability, respectively. Then under the truth, the joint density is:
(3.5) |
One of the main objectives of this article is to show consistency of the posterior distribution of treated as parameter arising from the parameter space specified as follows:
(3.6) |
or simply, .
4 Model setup and preliminaries of Poisson regression
For Poisson regression model set up, let be a count outcome variable and a vector of covariates. Here denote the set of non negative integers. Suppose are some independent responses conditional on covariates . We assume that the covariate space is compact. Let be the response random variables against the covariate vector . The corresponding observed values will be denoted by and respectively. Let the parameter space be specified as follows:
(4.1) |
The link function is a known, non-negative continuously differentiable function on . We equivalently define the parameter space as . Thus, in what follows, we shall use both and to denote the parameter space, depending on convenience. Then the model is specified as follows: for ,
(4.2) | |||
(4.3) | |||
(4.4) |
Similar to binary regression, here our concern will be to infer about when the number of observations goes to infinity. We do not assume the truth in as before, allowing misspecification.
Now, suppose be the counting measure on . According to the model assumption for Poisson regression, the conditional density of given with respect to will be represented by density function as follows:
(4.5) |
The prior for will be denoted by . Let and denote truth density and true mean function, respectively. Again, one of our main aims is to establish consistency of the posterior distribution of treated as parameter arising from .
5 Assumptions and their discussions
We need to make some appropriate assumptions for establishing convergence of both the binary and Poisson regression models equipped with stochastic process prior. The latter also requires suitable assumptions. Many of the assumptions are similar to those taken in Chatterjee and Bhattacharya (2020). Hence the purpose of such assumptions will be as discussed in Chatterjee and Bhattacharya (2020), which we shall briefly touch upon here.
Assumption 1.
is a compact, -dimensional space, for some finite equipped with a suitable metric.
Assumption 2.
Recall that in our notation, denotes the class of continuously partially differentiable function on . In other words, the functions are continuous on and for such functions the limit
(5.1) |
exists for each and is continuous . Here is the d-dimensional vector with the -th element as 1 and all the other elements as zero.
Assumption 3.
The priors for is chosen such that for ,
where and ; , are positive constants.
We treat the covariates as either random (observed or unobserved) or non-random (observed). Accordingly, in Assumption 4 below we provide conditions pertaining to these aspects.
Assumption 4.
-
(i)
is an observed or unobserved sample associated with an iid sequence associated with some probability measure , supported on , which is independent of
-
(ii)
is an observed non-random sample. In this case, we consider a specific partition of the -dimensional space into n subsets such that each subset of the partition contains at least one and has Lebesgue measure , for some .
Assumption 5.
The truth function is bounded in sup norm. In other words, the truth satisfies the following for some constant :
(5.2) |
Observe that in general . For random covariate , we assume that is measurable.
Assumption 6.
For binary regression model set up we assume a uniform positive lower bound for . In other words, for all ,
(5.3) |
where as defined in expression 3.6.
Assumption 7.
For Poisson regression model set up we assume a uniform positive lower bound for . In other words, for all ,
(5.4) |
where is as defined in expression 4.1.
5.1 Discussion of the assumptions
Assumption 1 is on compactness of , which guarantees that continuous functions on will have finite sup-norms.
Assumption 2 is as taken in Chatterjee and Bhattacharya (2020) for the purpose of constructing appropriate sieves in order to show posterior convergence results. More precisely, Assumption 2 is required for to ensure that is Lipschitz continuous in the sieves. Since a differentiable function is Lipschitz if and only if its partial derivatives are bounded, this serves our purpose, as continuity of the partial derivatives of guarantees the boundedness in the compact domain . In particular, if is a Gaussian process, conditions presented in Adler (1981), Adler and Taylor (2007), Cramer and Leadbetter (1967) guarantee the above continuity and smoothness properties required by Assumption 2. We refer to Chatterjee and Bhattacharya (2020) for more discussion about this.
Assumption 3 is required for ensuring that the complements of the sieves have exponentially small probabilities. In particular, this assumption is satisfied if is a Gaussian process, even if is replaced with .
Assumption 4 is for the covariates , accordingly as they are considered an observed random sample, unobserved random sample, or non-random. Note that thanks to the strong law of large numbers (SLLN), given any in the complement of some null set with respect to the prior, and given any sequence Assumption 4 (i) ensures that for any integrable function , as ,
(5.5) |
where is some probability measure supported on .
Assumption 4 (ii) ensures that is a particular Riemann sum and hence (5.5) holds with being the Lebesgue measure on . We continue to denote the limit in this case by .
Assumption 5 is equivalent to the Assumption(T) of Ghosal and Roy (2006). Assumption 5 actually implies that is bounded away from 0 and 1 and hence the corresponding truth function given by is uniformly bounded above and below.
As is uniformly bounded above and below, hence will also be bounded away from 0 and 1. For the Poisson regression model set up it follows that .
It is to be noted that here we do not require to assume that or , allowing model misspecifications.
Observe that, similar to Pillai et al. (2007) we need the parameter space for Poisson regresion to be bounded away from zero (Assumption 7). As pointed out in Pillai et al. (2007), we cannot bypass this and as such these are not a mere pathway towards our proof. This is because, if almost all observations in a sample from a Poisson distribution are zero, then it impossible to extract the information about the (log) mean. Hence we must require at least some condition to make it bound away from zero. Similar argument also applicable for binary regression, which is reflected in Assumption 6.
It is important to remark that Assumptions 6 and 7 are necessary only to validate Assumption (S6) of Shalizi, and unnecessary elsewhere. The reasons are clarified in Remarks 1 and 2. Although many of our proofs would be simpler if Assumptions 6 and 7 were used, we reserved these assumptions only to validate Assumption (S6) of Shalizi.
6 Main results on posterior convergence
Here we will state a summary of our main results regarding posterior convergence of nonparametric binary regression and Poisson regression. The key results associated with the asymptotic equipartition property are provided in Theorems 1 – 4, proofs of which are provided in Appendix B (for binary regression) and in Appendix C (for Poisson regression).
Theorem 1.
Let and the counting measure on be the measures associated with the random variable and the binary random variable respectively. Denote and . Then under the nonparametric binary regression model, under Assumption 4, the KL divergence rate exists for , and is given by
(6.1) |
Alternatively, admits the following form:
(6.2) |
Theorem 2.
Let and the counting measure on be associated with the random variable and the count random variable , respectively. Denote and . Then under the nonparametric Poisson regression model, under Assumption 4, the KL divergence rate exists for , and is given by
(6.3) |
Theorem 3.
Under the nonparametric binary regression model and Assumption 4, the asymptotic equipartition property holds, and is given by
(6.4) |
The convergence is uniform on any compact subset of .
Theorem 4.
Under the nonparametric Poisson regression model and Assumption 4, the asymptotic equipartition property holds, and is given by
(6.5) |
The convergence is uniform on any compact subset of .
Theorems 1 and 3 for binary regression and Theorems 2 and 4 for Poisson regression ensure that conditions (S1) to (S3) of Shalizi (2009) hold, and (S4) holds for both binary and Poisson regression because of compactness of and continuity of and . The detailed proofs are presented in Appendix B.4 and Appendix C.4, respectively.
We construct the sieves for binary regression model set up as follows:
(6.6) | ||||
It follows that as , where the parameter space is given by (3.6).
In a similar manner, we construct the sieves for binary regression as follows:
(6.7) | ||||
Then similarly it will also follow that as , where the parameter space is given by (4.1).
Assumption 3 ensures that for binary regression, for some and similarly for Poisson regression. Now, these results, continuity of , (the proofs of continuity of and follows using the same techniques as in Appendices B.1 and C.1), compactness of , and the uniform convergence results of Theorems 3 and 4, together ensure (S5) for both the model setups.
Now, as pointed out in Chatterjee and Bhattacharya (2020), we observe that the aim of assumption (S6) is to ensure that (see the proof of Lemma 7 of Shalizi (2009)) for every and for all sufficiently large ,
(6.8) |
As as , it is enough to verify that for every and for all sufficiently large,
(6.9) |
First we observe that
(6.10) |
For large enough , consider .
Lemma 1.
is a compact set.
Proof.
First recall that the proof of continuity of in follows easily using the same techniques as in Appendix B.1.
Now note that, if , then there exists such that either or . Hence, as . Thus, is a coercive function.
Since is continuous and coercive, it follows that is a compact set.
∎
In a very similar manner, the following lemma also holds for Poisson model set up.
Lemma 2.
is a compact set.
Proof.
Again, recall that continuity of in can be shown using the same techniques as in Appendix C.1, and it is easily seen that if , then . Thus, is continuous and coercive, ensuring that is compact. ∎
Using compactness of , in the same way as in Chatterjee and Bhattacharya (2020), condition (S6) of Shalizi can be shown to be equivalent to (6.11) and (6.12) in Theorems 5 and 6 below, corresponding to binary and Poisson cases. In the supplement we show that these equivalent conditions are satisfied in our model setups.
Theorem 5.
Theorem 6.
Assumption (S7) of Shalizi also holds for both the model setups because of continuity of and . Hence, all the assumptions (S1)–(S7) stated in Appendix A are satisfied for binary and Poisson regression setups.
Overall, our results lead to the following theorems.
Theorem 7.
7 Rate of convergence
Consider a sequence of positive reals such that while as and the set . Then the following result of Shalizi holds.
Theorem 9 (Shalizi (2009)).
Assume (S1) to (S7) of Appendix A. If for each ,
(7.1) |
eventually almost surely, then almost surely the following holds:
(7.2) |
To investigate the rate of convergence in our cases (and also for the case of Chatterjee and Bhattacharya (2020)), it has been proved in Chatterjee and Bhattacharya (2020) that will be the rate of convergence for , as , if we can show that the following hold:
(7.3) |
(7.4) |
for any and all sufficiently large.
Following similar arguments of Chatterjee and Bhattacharya (2020), we find that the posterior rate of convergence with respect to KL-divergence is just slower than . To put it another way, it is just slower that with respect to Hellinger distance for the model setups we consider. Our results can be formally stated in Theorem 10 for Binary regression and in Theorem 11 for Poisson regression.
Theorem 10.
8 Consequences of model misspecification
Suppose that the true function consists of countable number of discontinuities but has continuous first order partial derivatives at all other points. Then . However, there exists some such that for all where is continuous. Similar to this kind of situation is mentioned in Chatterjee and Bhattacharya (2020). Observe that, if the probability measure of is dominated by the Lebesgue measure, then from Theorem 1 we have . Then the posterior of concentrates around , which is the same as except at the countable number of discontinuities of . Corresponding and will also differ from and . If and are such that and respectively then the posteriors concentrate around the minimizers of and , provided such minimizers exist in and , respectively.
8.1 Consequences from the subjective Bayesian perspective
Bayesian posterior consistency has two apparently different viewpoints, namely, classical and subjective. Bayesian analysis starts with a prior knowledge, and updates the knowledge given the data, forming the posterior. It is of utmost importance to know whether the updated knowledge becomes more and more accurate and precise as data are collected indefinitely. This requirement is called consistency of the posterior distribution. From the classical Bayesian point of view we should believe in existence of a true model. On the contrary, if we look from the subjective Bayesian viewpoint, then we need not believe in true models. A subjective Bayesian thinks only in terms of the predictive distribution of future observations. But Blackwell and Dubins (1962), Diaconis and Freedman (1986) have shown that consistency is equivalent to inter subjective agreement, which means that two Bayesians will ultimately have very close posterior predictive distributions.
Let us define the one-step-ahead predictive distribution of and , one-step-ahead best predictor (which is the best prediction one could make had the true model, , been known) and the posterior predictive distribution (Shalizi (2009)), with the convention that gives the marginal distribution of the first observation, as follows:
-
(One-step-ahead predictive distribution of ): ,
-
(One-step-ahead predictive distribution of ): ,
-
(One-step-ahead best predictor): ,
-
(The posterior predictive distribution): .
With the above definitions, the following results have been proved by Shalizi.
Theorem 12 (Shalizi (2009)).
Let and be Hellinger and total variation metrics, respectively. Then with probability 1,
In our nonparametric setup, and if consists of countable number of discontinuities. Hence, from Theorem 12 it is clear that in spite of such misspecification, the posterior predictive distribution does a good job in learning the best possible predictive distribution in terms of the popular Hellinger and the total variation distance. We state our result formally as follows.
Theorem 13.
Consider the setups of nonparametric binary and Poisson regression. Assume that the truth function consists of countable number of discontinuities but has continuous first order partial derivatives at all other points. Then under Assumptions 1–6 (for binary regression) or under Assumptions 1–5 and 7 (for Poisson regression) the following hold:
9 Conclusion and future work
In this paper we attempted to address posterior convergence of nonparametric binary and Poisson regression, along with the rate of convergence, while also allowing for misspecification, using the approach of Shalizi (2009). We also have shown that, even in the case of misspecification, the posterior predictive distribution can be quite accurate asymptotically, which should be a point of interest from subjective Bayesian viewpoint. The asymptotic equipartition property plays a central role here. It is one of the crucial assumptions and yet relatively easy to establish under mild conditions. It actually brings forward the KL property of the posterior, which in turn characterizes the posterior convergence, and also the rate of posterior convergence and misspecification.
Appendix
Appendix A Assumptions and theorems of Shalizi
Following Shalizi (2009), let us consider a probability space , a sequence of random variables taking values in the measurable space , having infinite-dimensional distribution . The theoretical development requires no restrictive assumptions on such as it being a product measure, Markovian, or exchangeable, thus paving the way for great generality.
Let denote the natural filtration, that is, the -algebra generated by . Also, let the distributions of the processes adapted to be denoted by , where takes values in a measurable space . Here denotes the hypothesized probability measure associated with the unknown distribution of and is the set of hypothesized probability measures. In other words, assuming that is the infinite-dimensional distribution of the stochastic process , denotes the -dimensional marginal distribution associated with ; is suppressed for the ease of notation. For parametric models, the probability measure corresponds to some probability density with respect to some dominating measure (such as Lebesgue or counting measure) and consists of unknown, but finite number of parameters. For nonparametric models, is usually associated with infinite number of parameters and may not even have any density with respect to -finite measures.
As in Shalizi (2009), we assume that and all the are dominated by a common measure with densities and , respectively. In Shalizi (2009) and in our case, the assumption that , is not required, so that all possible models are allowed to be misspecified. Indeed, Shalizi (2009) provides an example of such misspecification where the true model is not Markov but all the hypothesized models indexed by are -th order stationary binary Markov models, for . As shown in Shalizi (2009), the results of posterior convergence hold even in the case of such misspecification, essentially because the true model can be approximated by the -th order Markov models belonging to .
Given a prior on , we assume that the posterior distributions are dominated by a common measure for all .
A.1 Assumptions
-
(S1)
Letting be the likelihood under parameter an be the likelihood under the true parameter , given the true model , consider the following likelihood ratio:
(A.1) Assume that is -measurable for all .
-
(S2)
For every , the KL divergence rate
(A.2) exists (possibly being infinite) and is -measurable. Note that in the set-up, reduces to the KL divergence between the true and the hypothesized model, so that (A.2) may be regarded as a generalized KL divergence measure.
-
(S3)
For each , the generalized or relative asymptotic equipartition property holds, and so, almost surely with respect to ,
(A.3) where is given by (A.2).
Intuitively, the terminology “asymptotic equipartition” refers to dividing up into factors for large such that all the factors are asymptotically equal. Again, considering the scenario helps clarify this point, as in this case each factor converges to the same KL divergence between the true and the postulated model. With this understanding note that the purpose of condition (S3) is to ensure that relative to the true distribution, the likelihood of each decreases to zero exponentially fast, with rate being the KL divergence rate (A.3).
-
(S4)
Let . The prior on satisfies . Failure of this assumption entails extreme misspecification of almost all the hypothesized models relative to the true model . With such extreme misspecification, posterior consistency is not expected to hold.
-
(S5)
There exists a sequence of sets as such that:
-
1.
, as .
-
2.
The following inequality holds for some
-
3.
The convergence in (S3) is uniform in over .
The sets can be loosely interpreted as the sieves. Method of sieves is common to Bayesian non parametric approach, such that the behaviour of the likelihood ratio and the posterior on the sets essentially carries over to . This can be anticipated from the first and the second parts of the assumption; the second part ensuring in particular that the parts of on which the log likelihood ratio may be ill-behaved have exponentially small prior probabilities. The third part is more of a technical condition that is useful in proving posterior convergence through the sets . For further details, see Shalizi (2009).
-
1.
For each measurable , for every , there exists a random natural number such that
(A.4) |
for all , provided . Regarding this, the following assumption has been made by Shalizi:
-
(S6)
The sets of (A5) can be chosen such that for every , the inequality holds almost surely for all sufficiently large .
To understand the essence of this assumption, note that for almost every data set there exists such that equation (A.4) holds with replaced by for all . Since are sets with large enough prior probabilities, the assumption formalizes our expectation that decays fast enough on so that is nearly stable in the sense that it is not only finite but also not significantly different for different data sets when is large. See Shalizi (2009) for more detailed explanation.
-
(S7)
The sets of (S5) and (S6) can be chosen such that for any set with ,
(A.5)
Under the above assumptions, Shalizi (2009) proved the following results.
Theorem 14 (Shalizi (2009)).
Consider assumptions (S1)–(S7) and any set with and . Then,
The rate of convergence of the log-posterior is given by the following result.
Theorem 15 (Shalizi (2009)).
Consider assumptions (S1)–(S7) and any set with . If , where corresponds to assumption (S5), or if for some , then
Appendix B Verification of (S1) to (S7) for binary regression
B.1 Verification of (S1) for binary regression
Observe that
(B.1) | |||
(B.2) |
Therefore,
(B.3) |
To show measurability of , first note that for any ,
(B.4) |
Note that for given , there exists such that , for all . Now consider a sequence , such that , as . Then, with , note that there exists such that for , , for all . Hence, using the inequality for , we obtain and , for some , for all . Hence, for ,
(B.5) |
Now, since is continuously differentiable, using Taylor’s series expansion up to the first order we obtain,
(B.6) |
where lies between and . Since , as , it follows from (B.6) that , as . This again implies, thanks to (B.5), that , as .
B.2 Verification of (S2) for binary regression
for every , we need to show that the KL divergence rate
exists (possibly being infinite) and is -measurable.
Now,
(B.7) | ||||
Therefore,
(B.8) | ||||
(B.9) |
The last line follows from Assumption 4 and SLLN. Here .
Hence,
(B.10) |
B.3 Verification of (S3) for binary regression
Here we need to verify the asymptotic equipartition, that is, almost surely with respect to ,
(B.11) |
Observe that,
By rearranging the terms we get,
Using the inequality for , compactness of , and continuity of in for given , and , for some . Hence,
(B.12) | |||
(B.13) |
Observe that are observations from independent random variables. Hence by Kolmogorov’s SLLN for independent random variables,
almost surely, as .
B.4 Verification of (S4) for binary regression
If then we need to show . Note that due to compactness of and continuity of and , given , is bounded away from and . Hence, , almost surely. In other words, (S4) holds.
B.5 Verification of (S5) for binary regression
In our model, the parameter space is . We need to show that there exists a sequence of sets as such that:
-
1.
, as .
-
2.
The inequality holds for some .
-
3.
The convergence in (S3) is uniform in over .
We shall work with the following sequence of sieve sets considered in Chatterjee and Bhattacharya (2020): for ,
(B.14) |
Then as (Chatterjee and Bhattacharya (2020)).
B.5.1 Verification of (S5) (1)
We now verify that , as . Observe that:
(B.15) |
Recall that is continuous in and is continuous in , which follows from (B.6). Hence, continuity of , compactness of along with its non-decreasing nature with respect to implies that , as .
B.5.2 Verification of (S5) (2)
where the last inequality follows from Assumption 3.
B.5.3 Verification of (S5) (3)
We need to show that uniform convergence in (S3) in over holds, where as in subsection B.4. In our case, . Hence, we need to show uniform convergence in (S3) in over . We need to establish that is compact, but this has already been shown by Chatterjee and Bhattacharya (2020). In a nutshell, Chatterjee and Bhattacharya (2020) proved compactness of for each by showing that is closed, bounded and equicontinuous and then by using Arzela-Ascoli lemma to imply compactness. It should be noted that boundedness of the partial derivatives as in Assumption 1 is used to show Lipschitz continuity, hence equicontinuity.
Consider . Now, to show uniform convergence we only need to show the following (see, for example, Chatterjee and Bhattacharya (2020)):
-
(i)
is stochastically equicontinuous almost surely in ,
-
(ii)
for all as .
We have already shown almost sure pointwise convergence of to in Appendix B.3. Hence it is enough to verify stochastic equicontinuity of in . Stochastic equicontinuity usually follows easily if one can prove that the function concerned is almost surely Lipschitz continuous (Chatterjee and Bhattacharya (2020)). Observe that, if we can show that both and are Lipschitz then this would imply that is Lipschitz (sum of Lipschitz functions is Lipschitz).
We now show that and are both Lipschitz in . Now,
(B.16) |
Let correspond to . Note that, since on (), it follows that , for all . Thus, there exists such that and , for . Hence,
showing Lipschitz continuity of with respect to corresponding to . Since is continuously differentiable, and are bounded on , with the same bound for all , it follows that is Lipschitz on .
To see that is also Lipschitz in , it is enough to note that
and the result follows since is Lipschitz on .
B.6 Verification of (S6) for binary regression
We need to show:
(B.17) |
Let us take . Observe that,
It follows that:
(B.18) | |||
(B.19) | |||
(B.20) |
Since are binary, it follows using the inequalities , for and Assumptions 5 and 6, that the random variables and are absolutely bounded by , for some . We shall apply Hoeffding’s inequality (Hoeffding (1963)) separately on the two terms of (B.20) involving and .
Note that for ,
(B.21) |
where is the Lipschitz constant associated with . Here it is important to note that for , is Lipschitz in thanks to continuous differentiability of , and boundedness of and by the same constant on . Also note that (B.21) holds irrespective of ; being random or non-random (see Chatterjee and Bhattacharya (2020)).
Similarly, for ,
(B.22) |
Now,
(B.23) |
and
(B.24) |
Then proceeding in the same way as (S-2.25) – (S-2.30) of Chatterjee and Bhattacharya (2020), and noting that , we obtain (B.17).
Hence (S6) holds.
Remark 1.
It is important to clarify the role of Assumption 6 here. Note that, we need a lower bound for . For instance, if , then even if on , it holds that for all , for all , for some constant . In our bounding method uing the inequality for , we have . It would then follow that the exponent of the Hoeffding inequality is . This would fail to ensure summability of the corresponding terms involving . Thus, we need to ensure that is bounded away from . Similarly, the infinite sum associated with would not be finite unless is bounded away from .
B.7 Verification of (S7)for Binary Regression
This verification follows from the fact that is continuous. Indeed, for any set with , . It follows from continuity of that as and hence (S7) holds.
Appendix C Verification of (S1) to (S7) for Poisson regression
C.1 Verification of (S1) for Poisson regression
Observe that
Therefore,
(C.1) |
and,
(C.2) |
Note that for any , . Let ; be such that , as . Then, letting , for all , it follows, since on , that there exists such that for , . Hence, using the inequalities for , we obtain , for some , for . It follows that
in the same way as in the binary regression, using Taylor’s series expansion up to the first order. Hence, is continuous in , ensuring measurability of , and hence of . It follows that is measurable.
Also, continuity of with respect to ensures measurability of . Thus, , and hence , is measurable.
C.2 Verification of (S2) for Poisson regression
For every , we need to show that the KL divergence rate
exists (possibly being infinite) and is -measurable.
Now,
Therefore,
The last line holds due to Assumption 4 and SLLN. Here . In other words,
(C.3) |
C.3 Verification of (S3) for Poisson regression
Here we need to verify the asymptotic equipartition property, that is, almost surely with respect to the true model ,
(C.4) |
C.4 Verification of (S4) for Poisson regression
If then we need to show . But this holds in almost the same way as for binary regression. In other words, (S4) holds for Poisson regression.
C.5 Verification of (S5) for Poisson regression
The parameter space here remains the same as in the binary regression case, that is, . We also consider the same sequence as in binary regression. We need to verify that
-
1.
, as ;
-
2.
The inequality holds for some ;
-
3.
The convergence in (S3) is uniform over .
C.5.1 Verification of (S5) (1)
We now need to verify that as . But this holds in the same way as for binary regression.
C.5.2 Verification of (S5) (2)
Again, this holds in the same way as for binary regression.
C.5.3 Verification of (S5) (3)
Using the same arguments as in the binary regression case, here we only need to show that and are both Lipschitz.
Recall that
For any , there exists such that , for all , where and . Hence,
Thus, is almost surely Lipschitz with respect to . Since, by Kolmogorov’s SLLN for independent variables, , as , and since is Lipschitz in in the same way as in binary regression, the desired stochastic equicontinuity follows. Lipschitz continuity of in follows using similar techniques.
C.6 Verification of (S6) for Poisson Regression
Since
(C.6) |
and the second term of (C.6) is finite, it is enough to show that the first term of (C.6) is finite.
Let us take . Observe that for ,
(C.7) | |||
(C.8) | |||
(C.9) |
Using Hoeffding’s inequality and Lipschitz continuity of in as in binary regression, we find that (C.7) and (C.8) are bounded above by , and , for some and . These bounds hold even if the covariates are non-random.
To bound (C.9), we shall first show that the summands are sub-exponential, and then shall apply Bernstein’s inequality (see, for example, Uspensky (1937), Bennett (1962), Massart (2003)). Direct calculation yields
(C.10) |
The first factor of (C.10) has the following upper bound:
(C.11) |
A bound for the second factor of (C.10) is given as follows:
(C.12) |
for , where , for some .
Combining (C.10), (C.11) and (C.12) we see that (C.10) is bounded above by provided that
(C.13) |
The rightmost bound of (C.13) is close to zero if is chosen sufficiently small. Now consider the function , where is given by (C.10). Since is continuous in and and on , it follows that on the sufficiently small interval , . In other words, (C.10) is bounded above by for . Thus, are independent sub-exponential variables with parameter .
Bernstein’s inequality, in conjunction with Lipschitz continuity of on then ensures that (C.9) is bounded above by , for positive constants and .
The rest of the proof of finiteness of (C.6) follows in the same (indeed, simpler) way as Chatterjee and Bhattacharya (2020). Hence (S6) holds.
Remark 2.
Arguments similar to that of Remark 1 shows that it is essential to have bounded away from zero.
C.7 Verification of (S7) for Poisson regression
This verification follows from the fact that is continuous, similar to binary regression.
References
- Adler (1981) Adler, R. J. (1981). The Geometry of Random Fields. John Wiley & Sons Ltd., New York.
- Adler and Taylor (2007) Adler, R. J. and Taylor, J. E. (2007). Random Fields and Geometry. Springer, New York.
- Bennett (1962) Bennett, G. (1962). Probability Inequalities for the Sums of Independent Random Variables. Journal of the American Statistical Association, 57, 33–45.
- Blackwell and Dubins (1962) Blackwell, D. and Dubins, L. (1962). Merging of Opinions With Increasing Information. The Annals of Mathematical Statistics, 33, 882–886.
- Chatterjee and Bhattacharya (2020) Chatterjee, D. and Bhattacharya, S. (2020). On Posterior Convergence of Gaussian and General Stochastic Process Regression Under Possible Misspecifications. ArXiv preprint.
- Choudhuri et al. (2007) Choudhuri, N., Ghosal, S., and Roy, A. (2007). Nonparametric Binary Regression Using a Gaussian Process Prior. Statistical Methodology, 4, 227–243.
- Cramer and Leadbetter (1967) Cramer, H. and Leadbetter, M. R. (1967). Stationary and Related Stochastic Processes. Wiley, New York.
- Diaconis and Freedman (1986) Diaconis, P. and Freedman, D. (1986). On the consistency of Bayes estimates (with discussion). Annals of Statistics, 14, 1–67.
- Diaconis and Freedman (1993) Diaconis, P. and Freedman, D. A. (1993). Nonparametric Binary Regression: A Bayesian Approach. The Annals of Statistics, 21, 2108–2137.
- Gelfand and Kuo (1991) Gelfand, A. E. and Kuo, L. (1991). Nonparametric Bayesian Bioassay Including Ordered Polytomous Response. Biometrika, 78, 657–666.
- Ghosal and Roy (2006) Ghosal, S. and Roy, A. (2006). Posterior Consistency of Gaussian Process Prior for Nonparametric Binary Regression. The Annals of Statistics, 34, 2413–2429.
- Hoeffding (1963) Hoeffding, W. (1963). Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association, 58, 13–30.
- Massart (2003) Massart, P. (2003). Concentration Inequalities and Model Selection. Volume 1896 of Lecture Notes in Mathematics. Springer-Verlag. Lectures given at the 33rd Probability Summer School in Saint-Flour.
- Newton et al. (1996) Newton, M. A., Czado, C., and Chappell, R. (1996). Bayesian Inference for Semiparametric Binary Regression. Journal of the American Statistical Association, 91, 142–153.
- Pillai et al. (2007) Pillai, N., Wolpert, R. L., and Clyde, M. A. (2007). A Note on Posterior Consistency of Nonparametric Poisson Regression Models. Available at https://pdfs.semanticscholar.org/27f5/af4d00cef092c8b19662951cc316c2e222b7.pdf.
- Shalizi (2009) Shalizi, C. R. (2009). Dynamics of Bayesian Updating With Dependent Data and Misspecified Models. Electronic Journal of Statistics, 3, 1039–1074.
- Uspensky (1937) Uspensky, J. V. (1937). Introduction to Mathematical Probability. McGraw-Hill, New York, USA.
- Ye et al. (2018) Ye, X., Wang, K., Zou, Y., and Lord, D. (2018). A Semi-nonparametric Poisson Regression Model for Analyzing Motor Vehicle Crash Data. PLoS One, 23, 15.