This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Posterior Convergence of Nonparametric Binary and Poisson Regression Under Possible Misspecifications

Debashis Chatterjee and Sourabh Bhattacharya†,+
Abstract

In this article, we investigate posterior convergence of nonparametric binary and Poisson regression under possible model misspecification, assuming general stochastic process prior with appropriate properties. Our model setup and objective for binary regression is similar to that of Ghosal and Roy (2006) where the authors have used the approach of entropy bound and exponentially consistent tests with the sieve method to achieve consistency with respect to their Gaussian process prior. In contrast, for both binary and Poisson regression, using general stochastic process prior, our approach involves verification of asymptotic equipartition property along with the method of sieve, which is a manoeuvre of the general results of Shalizi (2009), useful even for misspecified models. Moreover, we will establish not only posterior consistency but also the rates at which the posterior probabilities converge, which turns out to be the Kullback-Leibler divergence rate. We also investgate the traditional posterior convergence rates. Interestingly, from subjective Bayesian viewpoint we will show that the posterior predictive distribution can accurately approximate the best possible predictive distribution in the sense that the Hellinger distance, as well as the total variation distance between the two distributions can tend to zero, in spite of misspecifications.
Keywords: Binary/Poisson regression; Cumulative distribution function; Infinite dimension; Kullback-Leibler divergence rate; Misspecification; Posterior convergence.

Indian Statistical Institute

++ Corresponding author: [email protected]

1 Introduction

The situation for applicability of nonparametric regression is frequently encountered in many practical scenarios where no parametric model fits the data. In particular, non-parametric regression for binary dependent variables is very common for various branches of statistics like medical and spatial statistics, whereas nonparametric version of Poisson regression is being used recently in many non-trivial scenerios such as for analyzing the likelihood and severity of vehicle crashes (Ye et al. (2018)). Interestingly, despite vast applicability of both the binary as well as Poisson regression, it seems that the available literature on nonparametric Poisson regression is scarce in comparison to the available literature on nonparametric binary regression. The Bayesian approach to nonparametric binary regression problem has been accounted for in Diaconis and Freedman (1993). An account of posterior consistency for Gaussian process prior in nonparametric binary regression modeling can be found in Ghosal and Roy (2006), where the authors suggested that similar consistency results should hold for nonparametric Poisson regression model setup. Literature on consistency results for nonparametric Poisson regression is very limited. Pillai et al. (2007) have obtained consistency results for Poisson regression using an approach similar to that of Ghosal and Roy (2006) under certain assumptions, but so far without explicit specifications and detail on prior. On the other hand, our approach will be based on results on Shalizi (2009), which is much different from Ghosal and Roy (2006) and capable of handling model misspecification. Unlike the previous works, the approach of Shalizi (2009) also enables us to investigate the rate at which the posterior converges, which turns out to be the Kullback-Leibler (KL) divergence rate, and also the traditional posterior convergence rate.

In this article, we investigate posterior convergence of nonparametric binary and Poisson regression where the nonparametric regression is modeled as some suitable stochastic process. In the binary situation, we consider a similar setup as that of Ghosal and Roy (2006), where the authors have considered binary observations with response probability as an unknown smooth function of a set of covariates, which was modeled using Gaussian process. Here we will consider a binary response variable YY and a dd-dimensional covariate xx belonging to a compact subset. The probability function is given by p(x)=P(Y=1|X=x)p(x)=P(Y=1|X=x) along with a prior for pp induced by some appropriate stochastic process η(x)\eta(x) with the relation p(x)=H(η(x))p(x)=H\left(\eta(x)\right) for a known, non-decreasing and continuously differentiable cumulative distribution function H()H(\cdot). We will establish a posterior convergence theory for nonparametric binary regression under possible misspecifications based on the general theory of posterior convergence of Shalizi (2009). Our theory also includes the case of misspecified models, that is, if the true regression function is not even supported by the prior. This approach to Bayesian asymptotics also permits us to show that the relevant posterior probabilities converge at the KL divergence rate, and that the posterior convergence rate with respect to KL-divergence is just slower than 1n\frac{1}{n}, where nn denotes the number of observations. We further show that even in the case of misspecification, the posterior predictive distribution can approximate the best possible predictive distribution adequately, in the sense that the Hellinger distance, as well as the total variation distance between the two distributions can tend to zero.

For nonparametric Poisson regression, given xx in the compact space of covariates, we model the mean function λ(x)\lambda(x) as λ(x)=H(η(x))\lambda(x)=H(\eta(x)), where HH is a continuously differentiable function. Again, we investigate the general theory of posterior convergence, including misspecifications, rate of convergence of the posterior distribution and the usual posterior convergence rate, in Shalizi’s framework.

The rest of our paper is structured as follows. In Section 2 we provide a brief overview and intuitive explanation of the main assumptions and results of Shalizi (2009) suitable for our approach. The basic prenises for nonparametric binary and Poisson regression are provided in Sections 3 and 4, respectively. The required assumptions and their discussions are provided in Section 5. In Section 6, our main results on posterior convergence of binary and Poisson regression are provided, while Section 8 details the consequences of misspecifications. Concluding remarks are provided in Section 9.

The technical details are presented in the Appendix. Specifically, details of the necessary assumptions and results of Shalizi (2009) are provided in Appendix A. The detailed proofs of verification of Shalizi’s assumptions are provided in Appendix B and Appendix C for binary and Poisson regression setups, respectively.

2 An outline of the main assumptions and results of Shalizi

Let the set of random variables for the response be denoted by 𝐘𝐧=(Y1,Y2,,Yn)\mathbf{Y_{n}}=\left(Y_{1},Y_{2},\ldots,Y_{n}\right). For a given parameter space Θ\Theta, let fθ(𝐘𝐧)f_{\theta}(\mathbf{Y_{n}}) be the observed likelihood and fθ0(𝐘𝐧)f_{\theta_{0}}(\mathbf{Y_{n}}) be the true likelihood. We assume θΘ\theta\in\Theta but the truth θ0\theta_{0} need not be in Θ\Theta, thus allowing possible misspecification.

The KL divergence KL(f,g)=flog(fg)KL(f,g)=\int f\log(\frac{f}{g}) is a measure of divergence between two probability densities ff and gg. The KL divergence is related to likelihood ratios, since by the Strong Law of Large Numbers (SLLN) for independent and identical (iidiid) situations,

1ni=1nlog[f(Yi)g(Yi)]KL(f,g).\frac{1}{n}\displaystyle\sum_{i=1}^{n}\log\left[\dfrac{f(Y_{i})}{g(Y_{i})}\right]\rightarrow KL(f,g).

For every θΘ\theta\in\Theta, the KL divergence rate is given by:

h(θ)=limn1nE[log{fθ0(𝐘𝐧)fθ(𝐘𝐧)}].h(\theta)=\underset{n\rightarrow\infty}{\lim}~{}\dfrac{1}{n}E\left[\log\left\{\dfrac{f_{\theta_{0}}(\mathbf{Y_{n}})}{f_{\theta}(\mathbf{Y_{n}})}\right\}\right]. (2.1)

The key ingredient associated with the approach of Shalizi (2009) for proving convergence of the posterior distribution of θ\theta is to show that the asymptotic equipartition property holds. To illustrate, let us consider the following likelihood ratio:

Rn(θ)=fθ(𝐘𝐧)fθ0(𝐘𝐧).R_{n}(\theta)=\dfrac{f_{\theta}(\mathbf{Y_{n}})}{f_{\theta_{0}}(\mathbf{Y_{n}})}. (2.2)

If we think of the iidiid setup, h(θ)h(\theta) reduces to the KL divergence between the true and the hypothesized model. For each θΘ\theta\in\Theta, “asymptotic equipartition” property is as follows:

limn1nlog[Rn(θ)]=h(θ),\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}\log\left[R_{n}(\theta)\right]=-h(\theta), (2.3)

Here “asymptotic equipartition” refers to dividing up log[Rn(θ)]\log\left[R_{n}(\theta)\right] into nn factors for large nn such that all the factors are asymptotically equal. For illustration, in the iidiid scenario, each factor converges to the same KL divergence between the true and the postulated model. The purpose of asymptotic equipartition is to ensure that relative to the true distribution, the likelihood of each θ\theta decreases to zero exponentially fast, with rate being the KL divergence rate.

for AΘA\subseteq\Theta, let

h(A)\displaystyle h\left(A\right) =ess infθAh(θ);\displaystyle=\underset{\theta\in A}{\mbox{ess~{}inf}}~{}h(\theta); (2.4)
J(θ)\displaystyle J(\theta) =h(θ)h(Θ);\displaystyle=h(\theta)-h(\Theta); (2.5)
J(A)\displaystyle J(A) =ess infθAJ(θ),\displaystyle=\underset{\theta\in A}{\mbox{ess~{}inf}}~{}J(\theta), (2.6)

where h(A)h(A) roughly represent the minimum KL-divergence between the postulated and the true model over the set A. If h(Θ)>0h(\Theta)>0, it indicates model misspecification. However, as we shall show, model misspecification need not always imply that h(Θ)>0h(\Theta)>0. One such counter example is also given in Chatterjee and Bhattacharya (2020).

Observe that, for AΘA\subset\Theta, J(A)>0J(A)>0. For the prior, it is required to construct an appropriate sequence of sieve sets 𝒢nΘ\mathcal{G}_{n}\rightarrow\Theta as nn\rightarrow\infty such that:

  1. (1)

    h(𝒢n)h(Θ)h\left(\mathcal{G}_{n}\right)\rightarrow h\left(\Theta\right), as nn\rightarrow\infty.

  2. (2)

    π(𝒢n)1αexp(βn),for someα>0,β>2h(Θ)\pi\left(\mathcal{G}_{n}\right)\geq 1-\alpha\exp\left(-\beta n\right),~{}\mbox{for some}~{}\alpha>0,~{}\beta>2h(\Theta);

The sets 𝒢n\mathcal{G}_{n} can be interpreted as the sieves in the sense that, the behaviour of the likelihood ratio and the posterior on the sets 𝒢n\mathcal{G}_{n} essentially carries over to Θ\Theta.

Let π(|𝐘n)\pi(\cdot|\mathbf{Y}_{n}) denote the posterior distribution of θ\theta given 𝐘n\mathbf{Y}_{n}. Then with the above notions, verification of (2.3) along with several other technical conditions (details given in Appendix A) ensure that any AΘA\subseteq\Theta for which π(A)>0\pi(A)>0,

limnπ(A|𝐘n)=0,\underset{n\rightarrow\infty}{\lim}~{}\pi(A|\mathbf{Y}_{n})=0, (2.7)

almost surely, provided that h(A)>h(Θ)h(A)>h(\Theta). The latter h(A)>h(Θ)h(A)>h(\Theta) implies positive KL-divergence in AA, even if h(Θ)=0h(\Theta)=0. That is, AA is the set in which the postulated model fails to capture the true model in terms of the KL-divergence. Hence, expectedly, the posterior probability of that set converges to zero.

Under mild assumptions, it also holds that

limn1nlogπ(A|𝐘n)=J(A),\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}\log\pi(A|\mathbf{Y}_{n})=-J(A), (2.8)

almost surely. This result shows that the rate at which the posterior probability of AA converges to zero is about exp(nJ(A))\exp(-nJ(A)). From the above results it is clear that the posterior concentrates on sets of the form Nϵ={θ:h(θ)h(Θ)+ϵ}N_{\epsilon}=\left\{\theta:h(\theta)\leq h(\Theta)+\epsilon\right\}, for any ϵ>0\epsilon>0.

Shalizi addressed the rate of posterior convergence as follows. Letting Nϵn={θ:h(θ)h(Θ)+ϵn}N_{\epsilon_{n}}=\left\{\theta:h(\theta)\leq h(\Theta)+\epsilon_{n}\right\}, where ϵn0\epsilon_{n}\rightarrow 0 such that nϵnn\epsilon_{n}\rightarrow\infty, Shalizi showed, under an additional technical assumption, that almost surely,

limnπ(Nϵn|𝐘n)=1.\underset{n\rightarrow\infty}{\lim}~{}\pi\left(N_{\epsilon_{n}}|\mathbf{Y}_{n}\right)=1. (2.9)

Moreover, it was shown by Shalizi that the squares of the Hellinger and the total variation distances between the posterior predictive distribution and the best possible predictive distribution under the truth, are asymptotically almost surely bounded above by h(Θ)h(\Theta) and 4h(Θ)4h(\Theta), respectively. That is, if h(Θ)=0h(\Theta)=0, then this allows very accurate approximation of the true predictive distribution by the posterior predictive distribution.

3 Model setup and preliminaries of the binary regression

Let Y{0,1}Y\in\{0,1\} be a binary outcome variable and X a vector of covariates. Suppose Y1,Y2,,Yn{0,1}nY_{1},Y_{2},\ldots,Y_{n}\in\{0,1\}^{n} are some independent binary responses conditional on unobserved covariates X1,X2,,Xn𝔛dX_{1},X_{2},\ldots,X_{n}\in\mathfrak{X}\subset\Re^{d}. We assume that the covariate space 𝔛\mathfrak{X} is compact. Let 𝐘n=(Y1,Y2,,Yn)T\mathbf{Y}_{n}=(Y_{1},Y_{2},\ldots,Y_{n})^{T} be the binary response random variables against the covariate vector 𝐗n=(X1,X2,,Xn)T\mathbf{X}_{n}=(X_{1},X_{2},\ldots,X_{n})^{T}. The corresponding observed values will be denoted by 𝐲n=(y1,y2,,yn)\mathbf{y}_{n}=(y_{1},y_{2},\ldots,y_{n}) and 𝐱n=(x1,x2,,xn)\mathbf{x}_{n}=(x_{1},x_{2},\ldots,x_{n}) respectively. Let the model be specified as follows: for i=1,2,,ni=1,2,\ldots,n:

Yi|XiBinomial(1,p(Xi))\displaystyle Y_{i}|X_{i}\displaystyle\sim Binomial\left(1,p(X_{i})\right) (3.1)
p(x)=H(η(x))\displaystyle p(x)=H\left(\eta(x)\right) (3.2)
η()πη,\displaystyle\eta(\cdot)\sim\pi_{\eta}, (3.3)

where πη\pi_{\eta} is the prior for some suitable stochastic process. Note that the prior for pp is induced by the prior for η\eta. Our concern is to infer about the success probability function p(x)=P(Y=1|X=x)p(x)=P(Y=1|X=x) when the number of observations goes to infinity. We will assume that the functions η\eta have continuous first partial derivatives. We denote this class of functions by 𝒞(𝔛)\mathcal{C^{\prime}}(\mathfrak{X}). We do not assume the truth η0\eta_{0} in 𝒞(𝔛)\mathcal{C^{\prime}}(\mathfrak{X}), allowing misspecification. The link function HH is a known, non-decreasing, continuously differentiable cumulative distribution function on the real line \Re. It is widely accepted to assume the function H()H(\cdot) to be known as part of model assumption. For example, in logistic regression we choose the standard logistic cumulative distribution function as the link function, whereas in probit regression HH is chosen to be the standard normal cumulative distribution function ϕ\phi. More discussion on link function along with several other examples can be found in Choudhuri et al. (2007), Newton et al. (1996), Gelfand and Kuo (1991). A Bayesian method for estimation of pp has been provided in Choudhuri et al. (2007). In has been shown in Ghosal and Roy (2006) that the sample paths of the Gaussian processes can well approximate a large class of functions and hence it is not essential to consider additional uncertainty in the link function HH.

Let \mathfrak{C} be the counting measure on {0,1}\{0,1\}. Then according to the model assumption, the conditional density of yy given xx with respect to \mathfrak{C} will be represented by the density function ff as follows:

f(y|x)=p(x)y(1p(x))1y.f(y|x)=p(x)^{y}\left(1-p(x)\right)^{1-y}. (3.4)

The prior for ff will be denoted by π\pi. Let f0f_{0} and p0p_{0} denote truth density and success probability, respectively. Then under the truth, the joint density is:

f0(y|x)=p0(x)y(1p0(x))1y.f_{0}(y|x)=p_{0}(x)^{y}\left(1-p_{0}(x)\right)^{1-y}. (3.5)

One of the main objectives of this article is to show consistency of the posterior distribution of pp treated as parameter arising from the parameter space Θ\Theta specified as follows:

Θ={p():p(x)=H(η(x)),η𝒞(𝔛)},\Theta=\left\{p(\cdot):p(x)=H\left(\eta(x)\right),\eta\in\mathcal{C^{\prime}}(\mathfrak{X})\right\}, (3.6)

or simply, Θ=𝒞(𝔛)\Theta=\mathcal{C^{\prime}}(\mathfrak{X}).

4 Model setup and preliminaries of Poisson regression

For Poisson regression model set up, let YY\in\mathbb{N} be a count outcome variable and XX a vector of covariates. Here \mathbb{N} denote the set of non negative integers. Suppose Y1,Y2,,YnnY_{1},Y_{2},\ldots,Y_{n}\in\mathbb{N}^{n} are some independent responses conditional on covariates X1,X2,,Xn𝔛dX_{1},X_{2},\ldots,X_{n}\in\mathfrak{X}\subset\Re^{d}. We assume that the covariate space 𝔛\mathfrak{X} is compact. Let 𝐘n=(Y1,Y2,,Yn)T\mathbf{Y}_{n}=(Y_{1},Y_{2},\ldots,Y_{n})^{T} be the response random variables against the covariate vector 𝐗n=(X1,X2,,Xn)T\mathbf{X}_{n}=(X_{1},X_{2},\ldots,X_{n})^{T}. The corresponding observed values will be denoted by 𝐲n=(y1,y2,,yn)\mathbf{y}_{n}=(y_{1},y_{2},\ldots,y_{n}) and 𝐱n=(x1,x2,,xn)\mathbf{x}_{n}=(x_{1},x_{2},\ldots,x_{n}) respectively. Let the parameter space be specified as follows:

Λ={λ():λ(x)=H(η(x)),η𝒞(𝔛)}.\Lambda=\left\{\lambda(\cdot):\lambda(x)=H\left(\eta(x)\right),\eta\in\mathcal{C^{\prime}}(\mathfrak{X})\right\}. (4.1)

The link function HH is a known, non-negative continuously differentiable function on \Re. We equivalently define the parameter space as Θ=𝒞(𝔛)\Theta=\mathcal{C^{\prime}}(\mathfrak{X}). Thus, in what follows, we shall use both Λ\Lambda and Θ\Theta to denote the parameter space, depending on convenience. Then the model is specified as follows: for i=1,2,,ni=1,2,\ldots,n,

Yi|Xiexp(λ(Xi))(λ(Xi))yy!\displaystyle Y_{i}|X_{i}\sim\exp\left(-\lambda(X_{i})\right)\dfrac{(\lambda(X_{i}))^{y}}{y!} (4.2)
λ(x)=H(η(x));\displaystyle\lambda(x)=H\left(\eta(x)\right); (4.3)
η()πη.\displaystyle\eta(\cdot)\sim\pi_{\eta}. (4.4)

Similar to binary regression, here our concern will be to infer about λ(x)\lambda(x) when the number of observations goes to infinity. We do not assume the truth η0\eta_{0} in 𝒞(𝔛)\mathcal{C^{\prime}}(\mathfrak{X}) as before, allowing misspecification.

Now, suppose \mathfrak{C} be the counting measure on \mathbb{N}. According to the model assumption for Poisson regression, the conditional density of yy given xx with respect to \mathfrak{C} will be represented by density function ff as follows:

f(y|x)=exp(λ(x))(λ(x))yy!.f(y|x)=\exp\left(-\lambda(x)\right)\dfrac{(\lambda(x))^{y}}{y!}. (4.5)

The prior for ff will be denoted by Π\Pi. Let f0f_{0} and λ0\lambda_{0} denote truth density and true mean function, respectively. Again, one of our main aims is to establish consistency of the posterior distribution of λ\lambda treated as parameter arising from Λ\Lambda.

5 Assumptions and their discussions

We need to make some appropriate assumptions for establishing convergence of both the binary and Poisson regression models equipped with stochastic process prior. The latter also requires suitable assumptions. Many of the assumptions are similar to those taken in Chatterjee and Bhattacharya (2020). Hence the purpose of such assumptions will be as discussed in Chatterjee and Bhattacharya (2020), which we shall briefly touch upon here.

Assumption 1.

𝔛\mathfrak{X} is a compact, dd-dimensional space, for some finite d1d\geq 1 equipped with a suitable metric.

Assumption 2.

Recall that in our notation, 𝒞(𝔛)\mathcal{C}^{\prime}(\mathfrak{X}) denotes the class of continuously partially differentiable function on 𝔛\mathfrak{X}. In other words, the functions η𝒞(𝔛)\eta\in\mathcal{C}^{\prime}(\mathfrak{X}) are continuous on 𝔛\mathfrak{X} and for such functions the limit

ηj(𝐱)=η(𝐱)xj=limh0η(𝐱+hδj)η(𝐱)h\eta_{j}^{\prime}(\mathbf{x})=\dfrac{\partial\eta(\mathbf{x})}{\partial x_{j}}=\lim_{h\rightarrow 0}\dfrac{\partial\eta\left(\mathbf{x}+h\mathbf{\delta}_{j}\right)-\eta(\mathbf{x})}{\partial h} (5.1)

exists for each 𝐱𝔛\mathbf{x}\in\mathfrak{X} and is continuous 𝔛\mathfrak{X}. Here δj\mathbf{\delta}_{j} is the d-dimensional vector with the jj-th element as 1 and all the other elements as zero.

Assumption 3.

The priors for η\eta is chosen such that for β>2h(Θ)\beta>2h\left(\Theta\right),

π(ηexp((βn)1/4))\displaystyle\pi\left(\|\eta\|\leq\exp\left(\left(\beta n\right)^{1/4}\right)\right) 1cηexp(βn);\displaystyle\geq 1-c_{\eta}\exp\left(-\beta n\right);
π(ηjexp((βn)1/4))\displaystyle\pi\left(\|\eta^{\prime}_{j}\|\leq\exp\left(\left(\beta n\right)^{1/4}\right)\right) 1cηjexp(βn),forj=1,,d;\displaystyle\geq 1-c_{\eta^{\prime}_{j}}\exp\left(-\beta n\right),~{}\mbox{for}~{}j=1,\ldots,d;

where cηc_{\eta} and cηjc_{\eta^{\prime}_{j}}; j=1,,dj=1,\ldots,d, are positive constants.

We treat the covariates as either random (observed or unobserved) or non-random (observed). Accordingly, in Assumption 4 below we provide conditions pertaining to these aspects.

Assumption 4.
  • (i)

    {xi:i=1,2,}\{x_{i}:i=1,2,\ldots\} is an observed or unobserved sample associated with an iid sequence associated with some probability measure QQ, supported on 𝔛\mathfrak{X}, which is independent of {yi:i=1,2,}\{y_{i}:i=1,2,\ldots\}

  • (ii)

    {xi:i=1,2,}\{x_{i}:i=1,2,\ldots\} is an observed non-random sample. In this case, we consider a specific partition of the dd-dimensional space 𝔛\mathfrak{X} into n subsets such that each subset of the partition contains at least one x{xi:i=1,2,}x\in\{x_{i}:i=1,2,\ldots\} and has Lebesgue measure Ln\frac{L}{n}, for some L>0L>0.

Assumption 5.

The truth function η0\eta_{0} is bounded in sup norm. In other words, the truth η0\eta_{0} satisfies the following for some constant κ0\kappa_{0} :

η0<κ0<\|\eta_{0}\|_{\infty}<\kappa_{0}<\infty (5.2)

Observe that in general η0𝒞(𝔛)\eta_{0}\notin\mathcal{C}^{\prime}(\mathfrak{X}). For random covariate XX, we assume that η0(X)\eta_{0}(X) is measurable.

Assumption 6.

For binary regression model set up we assume a uniform positive lower bound κB\kappa_{B} for min{p(),1p()}\min\{p(\cdot),1-p(\cdot)\}. In other words, for all pΘp\in\Theta,

inf{min(p(x),1p(x)):x𝔛}κB>0,\inf\{\min\left(p(x),1-p(x)\right):\ x\in\mathfrak{X}\}\geq\kappa_{B}>0, (5.3)

where Θ\Theta as defined in expression 3.6.

Assumption 7.

For Poisson regression model set up we assume a uniform positive lower bound κP\kappa_{P} for λ()\lambda(\cdot). In other words, for all λΛ\lambda\in\Lambda,

inf{λ(x):x𝔛}κP>0,\inf\{\lambda(x):\ x\in\mathfrak{X}\}\geq\kappa_{P}>0, (5.4)

where Λ\Lambda is as defined in expression 4.1.

5.1 Discussion of the assumptions

Assumption 1 is on compactness of 𝔛\mathfrak{X}, which guarantees that continuous functions on 𝔛\mathfrak{X} will have finite sup-norms.

Assumption 2 is as taken in Chatterjee and Bhattacharya (2020) for the purpose of constructing appropriate sieves in order to show posterior convergence results. More precisely, Assumption 2 is required for to ensure that η\eta is Lipschitz continuous in the sieves. Since a differentiable function is Lipschitz if and only if its partial derivatives are bounded, this serves our purpose, as continuity of the partial derivatives of η\eta guarantees the boundedness in the compact domain 𝔛\mathfrak{X}. In particular, if η\eta is a Gaussian process, conditions presented in Adler (1981), Adler and Taylor (2007), Cramer and Leadbetter (1967) guarantee the above continuity and smoothness properties required by Assumption 2. We refer to Chatterjee and Bhattacharya (2020) for more discussion about this.

Assumption 3 is required for ensuring that the complements of the sieves have exponentially small probabilities. In particular, this assumption is satisfied if η\eta is a Gaussian process, even if exp((βn)1/4)\exp\left(\left(\beta n\right)^{1/4}\right) is replaced with βn\sqrt{\beta n}.

Assumption 4 is for the covariates xix_{i}, accordingly as they are considered an observed random sample, unobserved random sample, or non-random. Note that thanks to the strong law of large numbers (SLLN), given any η\eta in the complement of some null set with respect to the prior, and given any sequence {𝐱i:i=1,2,}\left\{\mathbf{x}_{i}:i=1,2,\ldots\right\} Assumption 4 (i) ensures that for any integrable function gg, as nn\rightarrow\infty,

1ni=1ng(𝐱i)𝔛g(𝐱)𝑑Q(𝐗)=E𝐗[g(𝐗)](say),\frac{1}{n}\sum_{i=1}^{n}g(\mathbf{x}_{i})\rightarrow\int_{\mathfrak{X}}g(\mathbf{x})dQ(\mathbf{X})=E_{\mathbf{X}}\left[g(\mathbf{X})\right]~{}\mbox{(say)}, (5.5)

where QQ is some probability measure supported on 𝔛\mathfrak{X}.

Assumption 4 (ii) ensures that 1ni=1ng(𝐱i)\frac{1}{n}\sum_{i=1}^{n}g(\mathbf{x}_{i}) is a particular Riemann sum and hence (5.5) holds with QQ being the Lebesgue measure on 𝔛\mathfrak{X}. We continue to denote the limit in this case by E𝐗[g(𝐗)]E_{\mathbf{X}}\left[g(\mathbf{X})\right].

Assumption 5 is equivalent to the Assumption(T) of Ghosal and Roy (2006). Assumption 5 actually implies that p0(x)=H(η0(x))p_{0}(x)=H(\eta_{0}(x)) is bounded away from 0 and 1 and hence the corresponding truth function η0\eta_{0} given by η0(x)=H1(p0(x))\eta_{0}(x)=H^{-1}(p_{0}(x)) is uniformly bounded above and below.

As η0\eta_{0} is uniformly bounded above and below, hence p0(x)=H(η0(x))p_{0}(x)=H(\eta_{0}(x)) will also be bounded away from 0 and 1. For the Poisson regression model set up it follows that λ0<\|\lambda_{0}\|_{\infty}<\infty.

It is to be noted that here we do not require to assume that p0Θp_{0}\in\Theta or λ0Λ\lambda_{0}\in\Lambda, allowing model misspecifications.

Observe that, similar to Pillai et al. (2007) we need the parameter space for Poisson regresion to be bounded away from zero (Assumption 7). As pointed out in Pillai et al. (2007), we cannot bypass this and as such these are not a mere pathway towards our proof. This is because, if almost all observations in a sample from a Poisson distribution are zero, then it impossible to extract the information about the (log) mean. Hence we must require at least some condition to make it bound away from zero. Similar argument also applicable for binary regression, which is reflected in Assumption 6.

It is important to remark that Assumptions 6 and 7 are necessary only to validate Assumption (S6) of Shalizi, and unnecessary elsewhere. The reasons are clarified in Remarks 1 and 2. Although many of our proofs would be simpler if Assumptions 6 and 7 were used, we reserved these assumptions only to validate Assumption (S6) of Shalizi.

To achieve Assumptions 6 and 7, we set, for all xx\in\Re,

H(x)=κB𝕀{G(x)κB}(x)+G(x)𝕀{κB<G(x)<1κB}(x)+(1κB)𝕀{G(x)1κB}(x),H(x)=\kappa_{B}\mathbb{I}_{\left\{G(x)\leq\kappa_{B}\right\}}(x)+G(x)\mathbb{I}_{\left\{\kappa_{B}<G(x)<1-\kappa_{B}\right\}}(x)+(1-\kappa_{B})\mathbb{I}_{\left\{G(x)\geq 1-\kappa_{B}\right\}}(x), (5.6)

for the binary case, where 0<κB<1/20<\kappa_{B}<1/2, and

H(x)=κB𝕀{G(x)κP}(x)+G(x)𝕀{G(x)>κP}(x),H(x)=\kappa_{B}\mathbb{I}_{\left\{G(x)\leq\kappa_{P}\right\}}(x)+G(x)\mathbb{I}_{\left\{G(x)>\kappa_{P}\right\}}(x), (5.7)

where κP>0\kappa_{P}>0. In (5.6), GG is a continuously differentiable distribution function on \Re and in (5.7), GG is a non-negative continuously differentiable function on \Re.

6 Main results on posterior convergence

Here we will state a summary of our main results regarding posterior convergence of nonparametric binary regression and Poisson regression. The key results associated with the asymptotic equipartition property are provided in Theorems 14, proofs of which are provided in Appendix B (for binary regression) and in Appendix C (for Poisson regression).

Theorem 1.

Let QQ and the counting measure \mathfrak{C} on {0,1}\{0,1\} be the measures associated with the random variable XX and the binary random variable YY respectively. Denote E𝐗,𝐘()=𝑑𝑑QE_{\mathbf{X},\mathbf{Y}}(\cdot)=\int\int\cdot\ d\mathfrak{C}\ dQ and E𝐗()=𝑑QE_{\mathbf{X}}(\cdot)=\int\int\cdot\ dQ. Then under the nonparametric binary regression model, under Assumption 4, the KL divergence rate h(p)h(p) exists for pΘp\in\Theta, and is given by

h(p)=[E𝐗(p0(𝐗)log{p0(𝐗)p(𝐗)})+E𝐗((1p0(𝐗))log{(1p0(𝐗))(1p(𝐗))})].h(p)=\left[E_{\mathbf{X}}\left(p_{0}(\mathbf{X})\log\left\{\dfrac{p_{0}(\mathbf{X})}{p(\mathbf{X})}\right\}\right)+E_{\mathbf{X}}\left((1-p_{0}(\mathbf{X}))\log\left\{\dfrac{\left(1-p_{0}(\mathbf{X})\right)}{\left(1-p(\mathbf{X})\right)}\right\}\right)\right]. (6.1)

Alternatively, h(p)h(p) admits the following form:

h(p)=E𝐗,𝐘(f0(𝐗,𝐘)log{f0(𝐗,𝐘)f(𝐗,𝐘)}),h(p)=E_{\mathbf{X},\mathbf{Y}}\left(f_{0}(\mathbf{X},\mathbf{Y})\log\left\{\dfrac{f_{0}(\mathbf{X},\mathbf{Y})}{f(\mathbf{X},\mathbf{Y})}\right\}\right), (6.2)

where ff and f0f_{0} are as defined in (3.4) and (3.5).

Theorem 2.

Let QQ and the counting measure \mathfrak{C} on \mathbb{N} be associated with the random variable XX and the count random variable YY, respectively. Denote E𝐗,𝐘()=𝑑𝑑QE_{\mathbf{X},\mathbf{Y}}(\cdot)=\int\int\cdot\ d\mathfrak{C}\ dQ and E𝐗()=𝑑QE_{\mathbf{X}}(\cdot)=\int\int\cdot\ dQ. Then under the nonparametric Poisson regression model, under Assumption 4, the KL divergence rate h(λ)h(\lambda) exists for λΛ\lambda\in\Lambda, and is given by

h(λ)=[E𝐗(λ(𝐗)λ0(𝐗))+E𝐗(λ0(𝐗)log{λ0(𝐗)λ(𝐗)})].h(\lambda)=\left[E_{\mathbf{X}}\left(\lambda(\mathbf{X})-\lambda_{0}(\mathbf{X})\right)+E_{\mathbf{X}}\left(\lambda_{0}(\mathbf{X})\log\left\{\dfrac{\lambda_{0}(\mathbf{X})}{\lambda(\mathbf{X})}\right\}\right)\right]. (6.3)
Theorem 3.

Under the nonparametric binary regression model and Assumption 4, the asymptotic equipartition property holds, and is given by

limn1nlog[Rn(p)]=h(p).\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}\log\left[R_{n}(p)\right]=-h(p). (6.4)

The convergence is uniform on any compact subset of Θ\Theta.

Theorem 4.

Under the nonparametric Poisson regression model and Assumption 4, the asymptotic equipartition property holds, and is given by

limn1nlog[Rn(λ)]=h(λ).\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}\log\left[R_{n}(\lambda)\right]=-h(\lambda). (6.5)

The convergence is uniform on any compact subset of Λ\Lambda.

Theorems 1 and 3 for binary regression and Theorems 2 and 4 for Poisson regression ensure that conditions (S1) to (S3) of Shalizi (2009) hold, and (S4) holds for both binary and Poisson regression because of compactness of 𝔛\mathfrak{X} and continuity of HH and η\eta. The detailed proofs are presented in Appendix B.4 and Appendix C.4, respectively.

We construct the sieves 𝒢n\mathcal{G}_{n} for binary regression model set up as follows:

𝒢n=\displaystyle\mathcal{G}_{n}={} {η𝒞(𝔛):ηexp((βn)1/4),\displaystyle\{\eta\in\mathcal{C}^{\prime}(\mathfrak{X}):\|\eta\|\leq\exp(\left(\beta n\right)^{1/4}), (6.6)
ηjexp((βn)1/4);j=1,2,,d}\displaystyle\|\eta_{j}^{\prime}\|\leq\exp(\left(\beta n\right)^{1/4});j=1,2,\ldots,d\}

It follows that 𝒢nΘ\mathcal{G}_{n}\rightarrow\Theta as nn\rightarrow\infty, where the parameter space Θ\Theta is given by (3.6).

In a similar manner, we construct the sieves 𝔾n\mathbb{G}_{n} for binary regression as follows:

𝔾n=\displaystyle\mathbb{G}_{n}={} {λ():λ(x)=H(η(x)),η𝒞(𝔛),ηexp((βn)1/4),\displaystyle\{\lambda(\cdot):\ \ \lambda(x)=H(\eta(x)),\eta\in\mathcal{C}^{\prime}(\mathfrak{X}),\|\eta\|\leq\exp(\left(\beta n\right)^{1/4}), (6.7)
ηjexp((βn)1/4);j=1,2,,d}.\displaystyle\|\eta_{j}^{\prime}\|\leq\exp(\left(\beta n\right)^{1/4});j=1,2,\ldots,d\}.

Then similarly it will also follow that 𝔾nΛ\mathbb{G}_{n}\rightarrow\Lambda as nn\rightarrow\infty, where the parameter space Λ\Lambda is given by (4.1).

Assumption 3 ensures that for binary regression, Π(𝒢nc)αexp(βn)\Pi\left(\mathcal{G}^{c}_{n}\right)\leq\alpha\exp(-\beta n) for some α>0\alpha>0 and similarly Π(𝔾nc)αexp(βn)\Pi\left(\mathbb{G}^{c}_{n}\right)\leq\alpha\exp(-\beta n) for Poisson regression. Now, these results, continuity of h(θ)h(\theta), h(λ)h(\lambda) (the proofs of continuity of h(p)h(p) and h(λ)h(\lambda) follows using the same techniques as in Appendices B.1 and C.1), compactness of 𝒢n\mathcal{G}_{n}, 𝔾n\mathbb{G}_{n} and the uniform convergence results of Theorems 3 and 4, together ensure (S5) for both the model setups.

Now, as pointed out in Chatterjee and Bhattacharya (2020), we observe that the aim of assumption (S6) is to ensure that (see the proof of Lemma 7 of Shalizi (2009)) for every ϵ>0\epsilon>0 and for all sufficiently large nn,

1nlog𝒢nRn(p)𝑑π(p)h(𝒢n)+ϵ,almost surely.\dfrac{1}{n}\log\int_{\mathcal{G}_{n}}R_{n}(p)\ d\pi(p)\leq h(\mathcal{G}_{n})+\epsilon,\ \ \text{almost surely}. (6.8)

As h(𝒢n)h(Θ)h(\mathcal{G}_{n})\rightarrow h(\Theta) as nn\rightarrow\infty, it is enough to verify that for every ϵ>0\epsilon>0 and for all nn sufficiently large,

1nlog𝒢nRn(p)𝑑π(p)h(Θ)+ϵ,almost surely.\dfrac{1}{n}\log\int_{\mathcal{G}_{n}}R_{n}(p)\ d\pi(p)\leq h(\Theta)+\epsilon,\ \ \text{almost surely}. (6.9)

First we observe that

1nlog𝒢nRn(p)𝑑π(p)1nsupp𝒢nlogRn(p).\dfrac{1}{n}\log\int_{\mathcal{G}_{n}}R_{n}(p)\ d\pi(p)\leq\dfrac{1}{n}\sup_{p\in\mathcal{G}_{n}}\log R_{n}(p). (6.10)

For large enough κ>h(Θ)\kappa>h(\Theta), consider S={p:h(p)κ}S=\{p:h(p)\leq\kappa\}.

Lemma 1.

S={p:h(p)κ}S=\{p:h(p)\leq\kappa\} is a compact set.

Proof.

First recall that the proof of continuity of h(p)h(p) in pp follows easily using the same techniques as in Appendix B.1.

Now note that, if η\|\eta\|_{\infty}\rightarrow\infty, then there exists 𝒳𝔛\mathcal{X}\subseteq\mathfrak{X} such that either E𝐗[p0(𝐗)log(p0(𝐗)p(𝐗))I𝒳]E_{\mathbf{X}}\left[p_{0}(\mathbf{X})\log\left(\frac{p_{0}(\mathbf{X})}{p(\mathbf{X})}\right)I_{\mathcal{X}}\right]\rightarrow\infty or E𝐗[(1p0(𝐗))log(1p0(𝐗)1p(𝐗))I𝒳]E_{\mathbf{X}}\left[(1-p_{0}(\mathbf{X}))\log\left(\frac{1-p_{0}(\mathbf{X})}{1-p(\mathbf{X})}\right)I_{\mathcal{X}}\right]\rightarrow\infty. Hence, h(p)h(p)\rightarrow\infty as η\|\eta\|_{\infty}\rightarrow\infty. Thus, h(p)h(p) is a coercive function.

Since h(p)h(p) is continuous and coercive, it follows that SS is a compact set.

In a very similar manner, the following lemma also holds for Poisson model set up.

Lemma 2.

S={λ:h(λ)κ}S=\{\lambda:h(\lambda)\leq\kappa\} is a compact set.

Proof.

Again, recall that continuity of h(λ)h(\lambda) in λ\lambda can be shown using the same techniques as in Appendix C.1, and it is easily seen that if η\|\eta\|_{\infty}\rightarrow\infty, then h(λ)h(\lambda)\rightarrow\infty. Thus, h(λ)h(\lambda) is continuous and coercive, ensuring that SS is compact. ∎

Using compactness of SS, in the same way as in Chatterjee and Bhattacharya (2020), condition (S6) of Shalizi can be shown to be equivalent to (6.11) and (6.12) in Theorems 5 and 6 below, corresponding to binary and Poisson cases. In the supplement we show that these equivalent conditions are satisfied in our model setups.

Theorem 5.

For the binary regression setup, (S6) is equivalent to the following, which holds under Assumptions 16:

n=1ScP(|1nlogRn(p)+h(p)|>κh(Θ))𝑑π(p)<.\sum_{n=1}^{\infty}\int_{S^{c}}P\left(\left|\dfrac{1}{n}\log R_{n}(p)+h(p)\right|>\kappa-h(\Theta)\right)\ d\pi(p)<\infty. (6.11)
Theorem 6.

For the Poisson regression model set up, (S6) is equivalent to the following, which holds under Assumptions 15 and 7:

n=1ScP(|1nlogRn(λ)+h(λ)|>κh(Λ))𝑑π(λ)<.\sum_{n=1}^{\infty}\int_{S^{c}}P\left(\left|\dfrac{1}{n}\log R_{n}(\lambda)+h(\lambda)\right|>\kappa-h(\Lambda)\right)\ d\pi(\lambda)<\infty. (6.12)

Assumption (S7) of Shalizi also holds for both the model setups because of continuity of h(p)h(p) and h(λ)h(\lambda). Hence, all the assumptions (S1)–(S7) stated in Appendix A are satisfied for binary and Poisson regression setups.

Overall, our results lead to the following theorems.

Theorem 7.

Assume the nonparametric binary regression setup. Then under the Assumptions 16,

limnπ(A|𝐘n)=0.\lim_{n\rightarrow\infty}\pi(A|\mathbf{Y}_{n})=0. (6.13)

Also, for any measurable set A with π(A)>0\pi(A)>0, if β>2h(A)\beta>2h(A), where hh is given by equation (6.1), or if Ak=n𝒢kA\subset\bigcap_{k=n}^{\infty}\mathcal{G}_{k} for some nn, where 𝒢k\mathcal{G}_{k} is given by 6.6, then the followings hold:

  1. (i)
    limn1nlog[π(A|𝐘n)]=J(A),\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}\log\left[\pi(A|\mathbf{Y}_{n})\right]=-J(A), (6.14)
  2. (ii)
    h(A)>h(Θ),π(A)>0limnπ(A|𝐘n)=0.h(A)>h(\Theta),\pi(A)>0\ \ \ \Rightarrow\ \ \lim_{n\rightarrow\infty}\pi\left(A|\mathbf{Y}_{n}\right)=0. (6.15)
Theorem 8.

Assume the nonparametric Poisson regression setup. Then under Assumptions 15 and 7,

limnπ(A|𝐘n)=0.\lim_{n\rightarrow\infty}\pi(A|\mathbf{Y}_{n})=0. (6.16)

Also, for any measurable set A with π(A)>0\pi(A)>0, if β>2h(A)\beta>2h(A), where hh is given by equation (6.3), or if Ak=n𝔾kA\subset\bigcap_{k=n}^{\infty}\mathbb{G}_{k} for some nn, where 𝔾k\mathbb{G}_{k} is given by 6.7, then the followings hold:

  1. (i)
    limn1nlog[π(A|𝐘n)]=J(A),\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}\log\left[\pi(A|\mathbf{Y}_{n})\right]=-J(A), (6.17)
  2. (ii)
    h(A)>h(Λ),π(A)>0limnπ(A|𝐘n)=0.h(A)>h(\Lambda),\pi(A)>0\ \ \ \Rightarrow\ \ \lim_{n\rightarrow\infty}\pi\left(A|\mathbf{Y}_{n}\right)=0. (6.18)

7 Rate of convergence

Consider a sequence of positive reals ϵn\epsilon_{n} such that ϵn0\epsilon_{n}\rightarrow 0 while nϵnn\epsilon_{n}\rightarrow\infty as nn\rightarrow\infty and the set Nϵn={p:h(p)h(Θ)+ϵn}N_{\epsilon_{n}}=\{p:h(p)\leq h(\Theta)+\epsilon_{n}\}. Then the following result of Shalizi holds.

Theorem 9 (Shalizi (2009)).

Assume (S1) to (S7) of Appendix A. If for each δ>0\delta>0,

τ(𝒢nNϵnc,δ)n\tau\left(\mathcal{G}_{n}\cap N_{\epsilon_{n}}^{c},\delta\right)\leq n (7.1)

eventually almost surely, then almost surely the following holds:

limn(Nϵn|𝐘n)=1.\lim_{n\rightarrow\infty}\left(N_{\epsilon_{n}}|\mathbf{Y}_{n}\right)=1. (7.2)

To investigate the rate of convergence in our cases (and also for the case of Chatterjee and Bhattacharya (2020)), it has been proved in Chatterjee and Bhattacharya (2020) that ϵn\epsilon_{n} will be the rate of convergence for ϵn0\epsilon_{n}\rightarrow 0, nϵnn\epsilon_{n}\rightarrow\infty as nn\rightarrow\infty, if we can show that the following hold:

1nlog𝒢nNϵncRn(p)𝑑π(p)h(Θ)+ϵ,\dfrac{1}{n}\log\int_{\mathcal{G}_{n}\cap N_{\epsilon_{n}}^{c}}R_{n}(p)\ d\pi(p)\leq-h(\Theta)+\epsilon, (7.3)
1nlog𝔾nNϵncRn(λ)𝑑π(λ)h(Λ)+ϵ,\dfrac{1}{n}\log\int_{\mathbb{G}_{n}\cap N_{\epsilon_{n}}^{c}}R_{n}(\lambda)\ d\pi(\lambda)\leq-h(\Lambda)+\epsilon, (7.4)

for any ϵ>0\epsilon>0 and all nn sufficiently large.

Following similar arguments of Chatterjee and Bhattacharya (2020), we find that the posterior rate of convergence with respect to KL-divergence is just slower than n1n^{-1}. To put it another way, it is just slower that n12n^{-\frac{1}{2}} with respect to Hellinger distance for the model setups we consider. Our results can be formally stated in Theorem 10 for Binary regression and in Theorem 11 for Poisson regression.

Theorem 10.

For the nonparametric binary regression setup, under Assumptions 16, limn(Nϵn|𝐘n)=1\lim_{n\rightarrow\infty}\left(N_{\epsilon_{n}}|\mathbf{Y}_{n}\right)=1 holds almost surely, where Nϵn={p:h(p)h(Θ)+ϵn}N_{\epsilon_{n}}=\{p:h(p)\leq h(\Theta)+\epsilon_{n}\}, ϵn0\epsilon_{n}\rightarrow 0, nϵnn\epsilon_{n}\rightarrow\infty as nn\rightarrow\infty.

Theorem 11.

For the nonparametric Poisson regression setup, under Assumptions 15 and 7, limn(Nϵn|𝐘n)=1\lim_{n\rightarrow\infty}\left(N_{\epsilon_{n}}|\mathbf{Y}_{n}\right)=1 holds almost surely, where Nϵn={λ:h(λ)h(Λ)+ϵn}N_{\epsilon_{n}}=\{\lambda:h(\lambda)\leq h(\Lambda)+\epsilon_{n}\}, ϵn0\epsilon_{n}\rightarrow 0, nϵnn\epsilon_{n}\rightarrow\infty as nn\rightarrow\infty.

8 Consequences of model misspecification

Suppose that the true function η0\eta_{0} consists of countable number of discontinuities but has continuous first order partial derivatives at all other points. Then η0𝒞(𝔛)\eta_{0}\not\in\mathcal{C^{\prime}}(\mathfrak{X}). However, there exists some η~𝒞(𝔛)\tilde{\eta}\in\mathcal{C^{\prime}}(\mathfrak{X}) such that η~(x)=η0(x)\tilde{\eta}(x)=\eta_{0}(x) for all x𝔛x\in\mathfrak{X} where η0\eta_{0} is continuous. Similar to this kind of situation is mentioned in Chatterjee and Bhattacharya (2020). Observe that, if the probability measure QQ of XiX_{i} is dominated by the Lebesgue measure, then from Theorem 1 we have h(Θ)=0h(\Theta)=0. Then the posterior of η\eta concentrates around η~\tilde{\eta}, which is the same as η0\eta_{0} except at the countable number of discontinuities of η0\eta_{0}. Corresponding p~=H(η~)\tilde{p}=H(\tilde{\eta}) and λ~=H(η~)\tilde{\lambda}=H(\tilde{\eta}) will also differ from p0p_{0} and λ0\lambda_{0}. If p0p_{0} and λ0\lambda_{0} are such that 0<h(Θ)<0<h(\Theta)<\infty and 0<h(Λ)<0<h(\Lambda)<\infty respectively then the posteriors concentrate around the minimizers of h(p)h(p) and h(λ)h(\lambda), provided such minimizers exist in Θ\Theta and Λ\Lambda, respectively.

8.1 Consequences from the subjective Bayesian perspective

Bayesian posterior consistency has two apparently different viewpoints, namely, classical and subjective. Bayesian analysis starts with a prior knowledge, and updates the knowledge given the data, forming the posterior. It is of utmost importance to know whether the updated knowledge becomes more and more accurate and precise as data are collected indefinitely. This requirement is called consistency of the posterior distribution. From the classical Bayesian point of view we should believe in existence of a true model. On the contrary, if we look from the subjective Bayesian viewpoint, then we need not believe in true models. A subjective Bayesian thinks only in terms of the predictive distribution of future observations. But Blackwell and Dubins (1962), Diaconis and Freedman (1986) have shown that consistency is equivalent to inter subjective agreement, which means that two Bayesians will ultimately have very close posterior predictive distributions.

Let us define the one-step-ahead predictive distribution of pp and λ\lambda, one-step-ahead best predictor (which is the best prediction one could make had the true model, PP, been known) and the posterior predictive distribution (Shalizi (2009)), with the convention that n=1n=1 gives the marginal distribution of the first observation, as follows:

  • (One-step-ahead predictive distribution of pp): Fpn=Fp(Yn|Y1,,Yn1)F_{p}^{n}=F_{p}\left(Y_{n}|Y_{1},\ldots,Y_{n-1}\right),

  • (One-step-ahead predictive distribution of λ\lambda): Fλn=Fλ(Yn|Y1,,Yn1)F_{\lambda}^{n}=F_{\lambda}\left(Y_{n}|Y_{1},\ldots,Y_{n-1}\right),

  • (One-step-ahead best predictor): Pn=Pn(Yn|Y1,,Yn1)P^{n}=P^{n}\left(Y_{n}|Y_{1},\ldots,Y_{n-1}\right),

  • (The posterior predictive distribution): Fπn=Fpn𝑑π(p|Yn)F_{\pi}^{n}=\int F_{p}^{n}\ d\pi(p|\textbf{Y}_{n}).

With the above definitions, the following results have been proved by Shalizi.

Theorem 12 (Shalizi (2009)).

Let ρH\rho_{H} and ρTV\rho_{TV} be Hellinger and total variation metrics, respectively. Then with probability 1,

lim supnρH2(Pn,Fπn)h(Θ);\displaystyle\limsup_{n\rightarrow\infty}\rho_{H}^{2}\left(P^{n},F_{\pi}^{n}\right)\leq h(\Theta);
lim supnρTV2(Pn,Fπn)4h(Θ).\displaystyle\limsup_{n\rightarrow\infty}\rho_{TV}^{2}\left(P^{n},F_{\pi}^{n}\right)\leq 4h(\Theta).

In our nonparametric setup, h(Θ)=0h(\Theta)=0 and h(Λ)=0h(\Lambda)=0 if η0\eta_{0} consists of countable number of discontinuities. Hence, from Theorem 12 it is clear that in spite of such misspecification, the posterior predictive distribution does a good job in learning the best possible predictive distribution in terms of the popular Hellinger and the total variation distance. We state our result formally as follows.

Theorem 13.

Consider the setups of nonparametric binary and Poisson regression. Assume that the truth function η0\eta_{0} consists of countable number of discontinuities but has continuous first order partial derivatives at all other points. Then under Assumptions 16 (for binary regression) or under Assumptions 15 and 7 (for Poisson regression) the following hold:

lim supnρH2(Pn,Fπn)=0;\displaystyle\limsup_{n\rightarrow\infty}\rho_{H}^{2}\left(P^{n},F_{\pi}^{n}\right)=0;
lim supnρTV2(Pn,Fπn)=0.\displaystyle\limsup_{n\rightarrow\infty}\rho_{TV}^{2}\left(P^{n},F_{\pi}^{n}\right)=0.

9 Conclusion and future work

In this paper we attempted to address posterior convergence of nonparametric binary and Poisson regression, along with the rate of convergence, while also allowing for misspecification, using the approach of Shalizi (2009). We also have shown that, even in the case of misspecification, the posterior predictive distribution can be quite accurate asymptotically, which should be a point of interest from subjective Bayesian viewpoint. The asymptotic equipartition property plays a central role here. It is one of the crucial assumptions and yet relatively easy to establish under mild conditions. It actually brings forward the KL property of the posterior, which in turn characterizes the posterior convergence, and also the rate of posterior convergence and misspecification.

Appendix

Appendix A Assumptions and theorems of Shalizi

Following Shalizi (2009), let us consider a probability space (Ω,,P)(\Omega,\mathcal{F},P), a sequence of random variables {Y1,Y2,}\{Y_{1},Y_{2},\ldots\} taking values in the measurable space (,𝒳)(\aleph,\mathcal{X}), having infinite-dimensional distribution PP. The theoretical development requires no restrictive assumptions on PP such as it being a product measure, Markovian, or exchangeable, thus paving the way for great generality.

Let n=σ(Yn)\mathcal{F}_{n}=\sigma(\textbf{Y}_{n}) denote the natural filtration, that is, the σ\sigma-algebra generated by Yn\textbf{Y}_{n}. Also, let the distributions of the processes adapted to n\mathcal{F}_{n} be denoted by FθF_{\theta}, where θ\theta takes values in a measurable space (Θ,𝒯)(\Theta,\mathcal{T}). Here θ\theta denotes the hypothesized probability measure associated with the unknown distribution of {Y1,Y2,}\{Y_{1},Y_{2},\ldots\} and Θ\Theta is the set of hypothesized probability measures. In other words, assuming that θ\theta is the infinite-dimensional distribution of the stochastic process {Y1,Y2,}\{Y_{1},Y_{2},\ldots\}, FθF_{\theta} denotes the nn-dimensional marginal distribution associated with θ\theta; nn is suppressed for the ease of notation. For parametric models, the probability measure θ\theta corresponds to some probability density with respect to some dominating measure (such as Lebesgue or counting measure) and consists of unknown, but finite number of parameters. For nonparametric models, θ\theta is usually associated with infinite number of parameters and may not even have any density with respect to σ\sigma-finite measures.

As in Shalizi (2009), we assume that PP and all the FθF_{\theta} are dominated by a common measure with densities pp and fθf_{\theta}, respectively. In Shalizi (2009) and in our case, the assumption that PΘP\in\Theta, is not required, so that all possible models are allowed to be misspecified. Indeed, Shalizi (2009) provides an example of such misspecification where the true model PP is not Markov but all the hypothesized models indexed by θ\theta are kk-th order stationary binary Markov models, for k=1,2,k=1,2,\ldots. As shown in Shalizi (2009), the results of posterior convergence hold even in the case of such misspecification, essentially because the true model can be approximated by the kk-th order Markov models belonging to Θ\Theta.

Given a prior π\pi on θ\theta, we assume that the posterior distributions π(|Yn)\pi(\cdot|\textbf{Y}_{n}) are dominated by a common measure for all n>0n>0.

A.1 Assumptions

  • (S1)

    Letting fθ(𝐘n)f_{\theta}(\mathbf{Y}_{n}) be the likelihood under parameter θ\theta anfθ0(𝐘n)f_{\theta_{0}}(\mathbf{Y}_{n}) be the likelihood under the true parameter θ0\theta_{0}, given the true model PP, consider the following likelihood ratio:

    Rn(θ)=fθ(𝐘n)fθ0(𝐘n).R_{n}(\theta)=\frac{f_{\theta}(\mathbf{Y}_{n})}{f_{\theta_{0}}(\mathbf{Y}_{n})}. (A.1)

    Assume that Rn(θ)R_{n}(\theta) is n×𝒯\mathcal{F}_{n}\times\mathcal{T}-measurable for all n>0n>0.

  • (S2)

    For every θΘ\theta\in\Theta, the KL divergence rate

    h(θ)=limn1nE[log{fθ0(𝐘n)fθ(𝐘n)}].h(\theta)=\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}E\left[\log\left\{\frac{f_{\theta_{0}}(\mathbf{Y}_{n})}{f_{\theta}(\mathbf{Y}_{n})}\right\}\right]. (A.2)

    exists (possibly being infinite) and is 𝒯\mathcal{T}-measurable. Note that in the iidiid set-up, h(θ)h(\theta) reduces to the KL divergence between the true and the hypothesized model, so that (A.2) may be regarded as a generalized KL divergence measure.

  • (S3)

    For each θΘ\theta\in\Theta, the generalized or relative asymptotic equipartition property holds, and so, almost surely with respect to PP,

    limn1nlog[Rn(θ)]=h(θ),\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}\log\left[R_{n}(\theta)\right]=-h(\theta), (A.3)

    where h(θ)h(\theta) is given by (A.2).

    Intuitively, the terminology “asymptotic equipartition” refers to dividing up log[Rn(θ)]\log\left[R_{n}(\theta)\right] into nn factors for large nn such that all the factors are asymptotically equal. Again, considering the iidiid scenario helps clarify this point, as in this case each factor converges to the same KL divergence between the true and the postulated model. With this understanding note that the purpose of condition (S3) is to ensure that relative to the true distribution, the likelihood of each θ\theta decreases to zero exponentially fast, with rate being the KL divergence rate (A.3).

  • (S4)

    Let I={θ:h(θ)=}I=\left\{\theta:h(\theta)=\infty\right\}. The prior π\pi on θ\theta satisfies π(I)<1\pi(I)<1. Failure of this assumption entails extreme misspecification of almost all the hypothesized models fθf_{\theta} relative to the true model pp. With such extreme misspecification, posterior consistency is not expected to hold.

  • (S5)

    There exists a sequence of sets 𝒢nΘ\mathcal{G}_{n}\rightarrow\Theta as nn\rightarrow\infty such that:

    1. 1.

      h(𝒢n)h(Θ)h\left(\mathcal{G}_{n}\right)\rightarrow h\left(\Theta\right), as nn\rightarrow\infty.

    2. 2.

      The following inequality holds for some α>0,β>2h(Θ)\alpha>0,\beta>2h(\Theta)

      π(𝒢n)1αexp(βn);\pi\left(\mathcal{G}_{n}\right)\geq 1-\alpha\exp\left(-\beta n\right);
    3. 3.

      The convergence in (S3) is uniform in θ\theta over 𝒢nI\mathcal{G}_{n}\setminus I.

    The sets 𝒢n\mathcal{G}_{n} can be loosely interpreted as the sieves. Method of sieves is common to Bayesian non parametric approach, such that the behaviour of the likelihood ratio and the posterior on the sets 𝒢n\mathcal{G}_{n} essentially carries over to Θ\Theta. This can be anticipated from the first and the second parts of the assumption; the second part ensuring in particular that the parts of Θ\Theta on which the log likelihood ratio may be ill-behaved have exponentially small prior probabilities. The third part is more of a technical condition that is useful in proving posterior convergence through the sets 𝒢n\mathcal{G}_{n}. For further details, see Shalizi (2009).

For each measurable AΘA\subseteq\Theta, for every δ>0\delta>0, there exists a random natural number τ(A,δ)\tau(A,\delta) such that

1nlog[ARn(θ)π(θ)𝑑θ]δ+limsupn1nlog[ARn(θ)π(θ)𝑑θ],\frac{1}{n}\log\left[\int_{A}R_{n}(\theta)\pi(\theta)d\theta\right]\leq\delta+\underset{n\rightarrow\infty}{\lim\sup}~{}\frac{1}{n}\log\left[\int_{A}R_{n}(\theta)\pi(\theta)d\theta\right], (A.4)

for all n>τ(A,δ)n>\tau(A,\delta), provided limsupn1nlog[ARn(θ)π(θ)𝑑θ]<\underset{n\rightarrow\infty}{\lim\sup}~{}\frac{1}{n}\log\left[\int_{A}R_{n}(\theta)\pi(\theta)d\theta\right]<\infty. Regarding this, the following assumption has been made by Shalizi:

  • (S6)

    The sets 𝒢n\mathcal{G}_{n} of (A5) can be chosen such that for every δ>0\delta>0, the inequality n>τ(𝒢n,δ)n>\tau(\mathcal{G}_{n},\delta) holds almost surely for all sufficiently large nn.

    To understand the essence of this assumption, note that for almost every data set {Y1,Y2,}\{Y_{1},Y_{2},\ldots\} there exists τ(𝒢n,δ)\tau(\mathcal{G}_{n},\delta) such that equation (A.4) holds with AA replaced by 𝒢n\mathcal{G}_{n} for all n>τ(𝒢n,δ)n>\tau(\mathcal{G}_{n},\delta). Since 𝒢n\mathcal{G}_{n} are sets with large enough prior probabilities, the assumption formalizes our expectation that Rn(θ)R_{n}(\theta) decays fast enough on 𝒢n\mathcal{G}_{n} so that τ(𝒢n,δ)\tau(\mathcal{G}_{n},\delta) is nearly stable in the sense that it is not only finite but also not significantly different for different data sets when nn is large. See Shalizi (2009) for more detailed explanation.

  • (S7)

    The sets 𝒢n\mathcal{G}_{n} of (S5) and (S6) can be chosen such that for any set AA with π(A)>0\pi(A)>0,

    limnh(𝒢nA)=h(A).\lim_{n\rightarrow\infty}h\left(\mathcal{G}_{n}\cap A\right)=h(A). (A.5)

Under the above assumptions, Shalizi (2009) proved the following results.

Theorem 14 (Shalizi (2009)).

Consider assumptions (S1)–(S7) and any set A𝒯A\in\mathcal{T} with π(A)>0\pi(A)>0 and h(A)>h(Θ)h(A)>h(\Theta). Then,

limnπ(A|𝐘n)=0,almost surely.\underset{n\rightarrow\infty}{\lim}~{}\pi(A|\mathbf{Y}_{n})=0,~{}\mbox{almost surely}.

The rate of convergence of the log-posterior is given by the following result.

Theorem 15 (Shalizi (2009)).

Consider assumptions (S1)–(S7) and any set A𝒯A\in\mathcal{T} with π(A)>0\pi(A)>0. If β>2h(A)\beta>2h(A), where β\beta corresponds to assumption (S5), or if Ak=n𝒢kA\subset\cap_{k=n}^{\infty}\mathcal{G}_{k} for some nn, then

limn1nlogπ(A|𝐘n)=J(A),almost surely.\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}\log\pi(A|\mathbf{Y}_{n})=-J(A),~{}\mbox{almost surely.}

Appendix B Verification of (S1) to (S7) for binary regression

B.1 Verification of (S1) for binary regression

Observe that

fp(𝐘n|𝐗n)=i=1nf(yi|xi)=i=1np(xi)yi(1p(xi))1yi,\displaystyle f_{p}(\mathbf{Y}_{n}|\mathbf{X}_{n})=\prod_{i=1}^{n}f(y_{i}|x_{i})=\prod_{i=1}^{n}p(x_{i})^{y_{i}}\left(1-p(x_{i})\right)^{1-y_{i}}, (B.1)
fp0(𝐘n|𝐗n)=i=1nf0(yi|xi)=i=1np0(xi)yi(1p0(xi))1yi.\displaystyle f_{p_{0}}(\mathbf{Y}_{n}|\mathbf{X}_{n})=\prod_{i=1}^{n}f_{0}(y_{i}|x_{i})=\prod_{i=1}^{n}p_{0}(x_{i})^{y_{i}}\left(1-p_{0}(x_{i})\right)^{1-y_{i}}. (B.2)

Therefore,

1nlogRn(p)=\displaystyle\frac{1}{n}\log R_{n}(p)= 1ni=1n{(yilog(p(xi)p0(xi)))+(1yi)log(1p(xi)1p0(xi))}.\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left\{\left(y_{i}\log\left(\frac{p(x_{i})}{p_{0}(x_{i})}\right)\right)+(1-y_{i})\log\left(\frac{1-p(x_{i})}{1-p_{0}(x_{i})}\right)\right\}. (B.3)

To show measurability of Rn(p)R_{n}(p), first note that for any aa\in\mathfrak{R},

{(yi,η):yilog(p(xi)p0(xi))+(1yi)log(1p(xi)1p0(xi))<a}\displaystyle\left\{(y_{i},\eta):y_{i}\log\left(\frac{p(x_{i})}{p_{0}(x_{i})}\right)+(1-y_{i})\log\left(\frac{1-p(x_{i})}{1-p_{0}(x_{i})}\right)<a\right\}
={η:log(p(xi)p0(xi))<a}{η:log(1p(xi)1p0(xi))<a}.\displaystyle=\left\{\eta:\log\left(\frac{p(x_{i})}{p_{0}(x_{i})}\right)<a\right\}\bigcup\left\{\eta:\log\left(\frac{1-p(x_{i})}{1-p_{0}(x_{i})}\right)<a\right\}. (B.4)

Note that for given pp, there exists 0<ϵ<1/20<\epsilon<1/2 such that ϵ<p(x)<1ϵ\epsilon<p(x)<1-\epsilon, for all x𝔛x\in\mathfrak{X}. Now consider a sequence η~j\tilde{\eta}_{j}, j=1,2,j=1,2,\ldots such that η~jη0\|\tilde{\eta}_{j}-\eta\|_{\infty}\rightarrow 0, as jj\rightarrow\infty. Then, with p~j(x)=H(η~j(x))\tilde{p}_{j}(x)=H\left(\tilde{\eta}_{j}(x)\right), note that there exists j01j_{0}\geq 1 such that for jj0j\geq j_{0}, ϵ<p~j(x)<1ϵ\epsilon<\tilde{p}_{j}(x)<1-\epsilon, for all x𝔛x\in\mathfrak{X}. Hence, using the inequality 11xlogxx11-\frac{1}{x}\leq\log x\leq x-1 for x>0x>0, we obtain |log(p~j(xi)p(xi))|Cp~jp\left|\log\left(\frac{\tilde{p}_{j}(x_{i})}{p(x_{i})}\right)\right|\leq C\|\tilde{p}_{j}-p\|_{\infty} and |log(1p~j(xi)1p(xi))|Cp~jp\left|\log\left(\frac{1-\tilde{p}_{j}(x_{i})}{1-p(x_{i})}\right)\right|\leq C\|\tilde{p}_{j}-p\|_{\infty}, for some C>0C>0, for all x𝔛x\in\mathfrak{X}. Hence, for jj0j\geq j_{0},

|log(p~j(xi)p0(xi))log(p(xi)p0(xi))|=|log(p~j(xi)p(xi))|Cp~jp.\displaystyle\left|\log\left(\frac{\tilde{p}_{j}(x_{i})}{p_{0}(x_{i})}\right)-\log\left(\frac{p(x_{i})}{p_{0}(x_{i})}\right)\right|=\left|\log\left(\frac{\tilde{p}_{j}(x_{i})}{p(x_{i})}\right)\right|\leq C\|\tilde{p}_{j}-p\|_{\infty}. (B.5)

Now, since HH is continuously differentiable, using Taylor’s series expansion up to the first order we obtain,

p~jp\displaystyle\|\tilde{p}_{j}-p\|_{\infty} =supx𝔛|H(η~j(x))H(η(x))|\displaystyle=\underset{x\in\mathfrak{X}}{\sup}~{}\left|H\left(\tilde{\eta}_{j}(x)\right)-H\left(\eta(x)\right)\right|
=supx𝔛|H(u(η~j(x),η(x)))|η~jη,\displaystyle=\underset{x\in\mathfrak{X}}{\sup}~{}\left|H^{\prime}(u(\tilde{\eta}_{j}(x),\eta(x)))\right|\|\tilde{\eta}_{j}-\eta\|_{\infty}, (B.6)

where u(η~j(x),η(x))u(\tilde{\eta}_{j}(x),\eta(x)) lies between η(x)\eta(x) and η~j(x)η(x)\tilde{\eta}_{j}(x)-\eta(x). Since η~jη0\|\tilde{\eta}_{j}-\eta\|_{\infty}\rightarrow 0, as jj\rightarrow\infty, it follows from (B.6) that p~jp0\|\tilde{p}_{j}-p\|_{\infty}\rightarrow 0, as jj\rightarrow\infty. This again implies, thanks to (B.5), that |log(p~j(xi)p0(xi))log(p(xi)p0(xi))|0\left|\log\left(\frac{\tilde{p}_{j}(x_{i})}{p_{0}(x_{i})}\right)-\log\left(\frac{p(x_{i})}{p_{0}(x_{i})}\right)\right|\rightarrow 0, as jj\rightarrow\infty.

In other words, log(p(xi)p0(xi))\log\left(\frac{p(x_{i})}{p_{0}(x_{i})}\right) is continuous in η\eta, and hence {η:log(p(xi)p0(xi))<a}\left\{\eta:\log\left(\frac{p(x_{i})}{p_{0}(x_{i})}\right)<a\right\} of (B.4) is measurable. Similarly, log(1p(xi)1p0(xi))\log\left(\frac{1-p(x_{i})}{1-p_{0}(x_{i})}\right) is also continuous in η\eta, so that {η:log(1p(xi)1p0(xi))<a}\left\{\eta:\log\left(\frac{1-p(x_{i})}{1-p_{0}(x_{i})}\right)<a\right\} is also measurable. Hence, the individual terms in (B.3) are measurable. Since sums of measurable functions are measurable, it follows that logRn(p)\log R_{n}(p), and hence Rn(p)R_{n}(p), is measurable.

B.2 Verification of (S2) for binary regression

for every pΘp\in\Theta, we need to show that the KL divergence rate

h(p)=limn1nEp0[log{fp0(𝐘n|𝐗n)fp(𝐘n|𝐗n)}]=limn1nEp0[log{Rn(p)}].h(p)=\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}E_{p_{0}}\left[\log\left\{\frac{f_{p_{0}}(\mathbf{Y}_{n}|\mathbf{X}_{n})}{f_{p}(\mathbf{Y}_{n}|\mathbf{X}_{n})}\right\}\right]=\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}E_{p_{0}}\left[-\log\left\{R_{n}(p)\right\}\right].

exists (possibly being infinite) and is 𝒯\mathcal{T}-measurable.

Now,

1nlogRn(p)=\displaystyle\frac{1}{n}\log R_{n}(p)= 1ni=1n{(yilogp(xi))+(1yi)log(1p(xi))}\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left\{\left(y_{i}\log p(x_{i})\right)+(1-y_{i})\log\left(1-p(x_{i})\right)\right\} (B.7)
\displaystyle- 1ni=1n{(yilogp0(xi))+(1yi)log(1p0(xi))}.\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left\{\left(y_{i}\log p_{0}(x_{i})\right)+(1-y_{i})\log\left(1-p_{0}(x_{i})\right)\right\}.

Therefore,

1nEp0[log{Rn(p)}]=\displaystyle\frac{1}{n}E_{p_{0}}\left[-\log\left\{R_{n}(p)\right\}\right]= 1ni=1n{(p0(xi)logp0(xi))+(1p0(xi))log(1p0(xi))}\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left\{\left(p_{0}(x_{i})\log p_{0}(x_{i})\right)+(1-p_{0}(x_{i}))\log\left(1-p_{0}(x_{i})\right)\right\} (B.8)
1ni=1n{(p0(xi)logp(xi))+(1p0(xi))log(1p(xi))}.\displaystyle-\frac{1}{n}\sum_{i=1}^{n}\left\{\left(p_{0}(x_{i})\log p(x_{i})\right)+(1-p_{0}(x_{i}))\log\left(1-p(x_{i})\right)\right\}.
limn1nEp0[log{Rn(p)}]=\displaystyle\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}E_{p_{0}}\left[-\log\left\{R_{n}(p)\right\}\right]= limn1ni=1n{(p0(xi)logp0(xi))+(1p0(xi))log(1p0(xi))}\displaystyle\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}\sum_{i=1}^{n}\left\{\left(p_{0}(x_{i})\log p_{0}(x_{i})\right)+(1-p_{0}(x_{i}))\log\left(1-p_{0}(x_{i})\right)\right\}
limn1ni=1n{(p0(xi)logp(xi))+(1p0(xi))log(1p(xi))}\displaystyle-\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}\sum_{i=1}^{n}\left\{\left(p_{0}(x_{i})\log p(x_{i})\right)+(1-p_{0}(x_{i}))\log\left(1-p(x_{i})\right)\right\}
=\displaystyle= E𝐗{(p0(𝐗)logp0(𝐗))+(1p0(𝐗))log(1p0(𝐗))}\displaystyle E_{\mathbf{X}}\left\{\left(p_{0}(\mathbf{X})\log p_{0}(\mathbf{X})\right)+(1-p_{0}(\mathbf{X}))\log\left(1-p_{0}(\mathbf{X})\right)\right\}
\displaystyle- E𝐗{(p0(𝐗)logp(𝐗))+(1p0(𝐗))log(1p(𝐗))}.\displaystyle E_{\mathbf{X}}\left\{\left(p_{0}(\mathbf{X})\log p(\mathbf{X})\right)+(1-p_{0}(\mathbf{X}))\log\left(1-p(\mathbf{X})\right)\right\}. (B.9)

The last line follows from Assumption 4 and SLLN. Here E𝐗()=𝔛dQE_{\mathbf{X}}(\cdot)=\int_{\mathfrak{X}}\cdot\ dQ.

Hence,

h(p)=[E𝐗(p0(𝐗)log{p0(𝐗)p(𝐗)})+E𝐗((1p0(𝐗))log{(1p0(𝐗))(1p(𝐗))})].h(p)=\left[E_{\mathbf{X}}\left(p_{0}(\mathbf{X})\log\left\{\dfrac{p_{0}(\mathbf{X})}{p(\mathbf{X})}\right\}\right)+E_{\mathbf{X}}\left((1-p_{0}(\mathbf{X}))\log\left\{\dfrac{\left(1-p_{0}(\mathbf{X})\right)}{\left(1-p(\mathbf{X})\right)}\right\}\right)\right]. (B.10)

B.3 Verification of (S3) for binary regression

Here we need to verify the asymptotic equipartition, that is, almost surely with respect to PP,

limn1nlog[Rn(p)]=h(p)=limn1nE[log{fp(𝐘n|𝐗n)fp0(𝐘n|𝐗n)}].\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}\log\left[R_{n}(p)\right]=-h(p)=\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}E\left[\log\left\{\frac{f_{p}(\mathbf{Y}_{n}|\mathbf{X}_{n})}{f_{p_{0}}(\mathbf{Y}_{n}|\mathbf{X}_{n})}\right\}\right]. (B.11)

Observe that,

1nlogRn(p)=\displaystyle\frac{1}{n}\log R_{n}(p)= 1ni=1n{(yilogp(xi))+(1yi)log(1p(xi))}\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left\{\left(y_{i}\log p(x_{i})\right)+(1-y_{i})\log\left(1-p(x_{i})\right)\right\}
\displaystyle- 1ni=1n{(yilogp0(xi))+(1yi)log(1p0(xi))}.\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left\{\left(y_{i}\log p_{0}(x_{i})\right)+(1-y_{i})\log\left(1-p_{0}(x_{i})\right)\right\}.

By rearranging the terms we get,

1nlogRn(p)=\displaystyle-\frac{1}{n}\log R_{n}(p)= 1ni=1n{yilog(p0(xi)p(xi))+(1yi)log(1p0(xi)1p(xi))}.\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left\{y_{i}\log\left(\dfrac{p_{0}(x_{i})}{p(x_{i})}\right)+(1-y_{i})\log\left(\dfrac{1-p_{0}(x_{i})}{1-p(x_{i})}\right)\right\}.

Using the inequality 11xlogxx11-\frac{1}{x}\leq\log x\leq x-1 for x>0x>0, compactness of 𝔛\mathfrak{X}, and continuity of p(x)p(x) in x𝔛x\in\mathfrak{X} for given pΘp\in\Theta, |log(p0(xi)p(xi))|Cpp0\left|\log\left(\frac{p_{0}(x_{i})}{p(x_{i})}\right)\right|\leq C\|p-p_{0}\|_{\infty} and |log(1p0(xi)1p(xi))|Cpp0\left|\log\left(\frac{1-p_{0}(x_{i})}{1-p(x_{i})}\right)\right|\leq C\|p-p_{0}\|_{\infty}, for some C>0C>0. Hence,

i=1i2var[{yilog(p0(xi)p(xi))+(1yi)log(1p0(xi)1p(xi))}]\displaystyle\displaystyle\sum_{i=1}^{\infty}i^{-2}var\left[\left\{y_{i}\log\left(\dfrac{p_{0}(x_{i})}{p(x_{i})}\right)+(1-y_{i})\log\left(\dfrac{1-p_{0}(x_{i})}{1-p(x_{i})}\right)\right\}\right] (B.12)
=i=1i2p0(xi)(1p0(xi))\displaystyle=\displaystyle\sum_{i=1}^{\infty}i^{-2}p_{0}(x_{i})(1-p_{0}(x_{i}))
×{[log(p0(xi)p(xi))]2+[log(1p0(xi)1p(xi))]22log(p0(xi)p(xi))×log(1p0(xi)1p(xi))}\displaystyle\qquad\qquad\times\left\{\left[\log\left(\dfrac{p_{0}(x_{i})}{p(x_{i})}\right)\right]^{2}+\left[\log\left(\dfrac{1-p_{0}(x_{i})}{1-p(x_{i})}\right)\right]^{2}-2\log\left(\dfrac{p_{0}(x_{i})}{p(x_{i})}\right)\times\log\left(\dfrac{1-p_{0}(x_{i})}{1-p(x_{i})}\right)\right\}
4C2p0pp02i=1i2\displaystyle\leq 4C^{2}\|p_{0}\|_{\infty}\|p-p_{0}\|^{2}_{\infty}\sum_{i=1}^{\infty}i^{-2}
<.\displaystyle<\infty. (B.13)

Observe that yiy_{i} are observations from independent random variables. Hence by Kolmogorov’s SLLN for independent random variables,

1nlogRn(p)=1ni=1n{yilog(p0(xi)p(xi))+(1yi)log(1p0(xi)1p(xi))}\displaystyle-\frac{1}{n}\log R_{n}(p)=\frac{1}{n}\sum_{i=1}^{n}\left\{y_{i}\log\left(\dfrac{p_{0}(x_{i})}{p(x_{i})}\right)+(1-y_{i})\log\left(\dfrac{1-p_{0}(x_{i})}{1-p(x_{i})}\right)\right\}
[E𝐗(p0(𝐗)log{p0(𝐗)p(𝐗)})+E𝐗((1p0(𝐗))log{(1p0(𝐗))(1p(𝐗))})]=h(p),\displaystyle\rightarrow\left[E_{\mathbf{X}}\left(p_{0}(\mathbf{X})\log\left\{\dfrac{p_{0}(\mathbf{X})}{p(\mathbf{X})}\right\}\right)+E_{\mathbf{X}}\left((1-p_{0}(\mathbf{X}))\log\left\{\dfrac{\left(1-p_{0}(\mathbf{X})\right)}{\left(1-p(\mathbf{X})\right)}\right\}\right)\right]=h(p),

almost surely, as nn\rightarrow\infty.

B.4 Verification of (S4) for binary regression

If I={p:h(p)=}I=\{p:\ h(p)=\infty\} then we need to show Π(I)<1\Pi(I)<1. Note that due to compactness of 𝔛\mathfrak{X} and continuity of HH and η\eta, given ηΘ\eta\in\Theta, pp is bounded away from 0 and 11. Hence, h(p)pp0×(1infx𝔛p(x)+11supx𝔛p(x))<h(p)\leq\|p-p_{0}\|_{\infty}\times\left(\frac{1}{\underset{x\in\mathfrak{X}}{\inf}~{}p(x)}+\frac{1}{1-\underset{x\in\mathfrak{X}}{\sup}~{}p(x)}\right)<\infty, almost surely. In other words, (S4) holds.

B.5 Verification of (S5) for binary regression

In our model, the parameter space is Θ=𝒞(𝔛)\Theta=\mathcal{C^{\prime}}(\mathfrak{X}). We need to show that there exists a sequence of sets 𝒢nΘ\mathcal{G}_{n}\rightarrow\Theta as nn\rightarrow\infty such that:

  1. 1.

    h(𝒢n)h(Θ)h\left(\mathcal{G}_{n}\right)\rightarrow h\left(\Theta\right), as nn\rightarrow\infty.

  2. 2.

    The inequality π(𝒢n)1αexp(βn)\pi\left(\mathcal{G}_{n}\right)\geq 1-\alpha\exp\left(-\beta n\right) holds for some α>0,β>2h(Θ)\alpha>0,\beta>2h(\Theta).

  3. 3.

    The convergence in (S3) is uniform in pp over 𝒢nI\mathcal{G}_{n}\setminus I.

We shall work with the following sequence of sieve sets considered in Chatterjee and Bhattacharya (2020): for n1n\geq 1,

𝒢n={η𝒞(𝔛):ηexp((βn)1/4),ηjexp((βn)1/4);j=1,2,,d}.\displaystyle\mathcal{G}_{n}=\left\{\eta\in\mathcal{C}^{\prime}(\mathfrak{X}):\ \|\eta\|_{\infty}\leq\exp(\left(\beta n\right)^{1/4}),\ \|\eta_{j}^{\prime}\|_{\infty}\leq\exp(\left(\beta n\right)^{1/4});j=1,2,\ldots,d\right\}. (B.14)

Then 𝒢n𝒞(𝔛)\mathcal{G}_{n}\rightarrow\mathcal{C}^{\prime}(\mathfrak{X}) as nn\rightarrow\infty (Chatterjee and Bhattacharya (2020)).

B.5.1 Verification of (S5) (1)

We now verify that h(𝒢n)h(Θ)h\left(\mathcal{G}_{n}\right)\rightarrow h\left(\Theta\right), as nn\rightarrow\infty. Observe that:

h(p)=[E𝐗(p0(𝐗)log{p0(𝐗)p(𝐗)})+E𝐗((1p0(𝐗))log{(1p0(𝐗))(1p(𝐗))})].h(p)=\left[E_{\mathbf{X}}\left(p_{0}(\mathbf{X})\log\left\{\dfrac{p_{0}(\mathbf{X})}{p(\mathbf{X})}\right\}\right)+E_{\mathbf{X}}\left((1-p_{0}(\mathbf{X}))\log\left\{\dfrac{\left(1-p_{0}(\mathbf{X})\right)}{\left(1-p(\mathbf{X})\right)}\right\}\right)\right]. (B.15)

Recall that h(p)h(p) is continuous in pp and pp is continuous in η\eta, which follows from (B.6). Hence, continuity of h(p)h(p), compactness of 𝒢n\mathcal{G}_{n} along with its non-decreasing nature with respect to nn implies that h(𝒢n)h(Θ)h\left(\mathcal{G}_{n}\right)\rightarrow h\left(\Theta\right), as nn\rightarrow\infty.

B.5.2 Verification of (S5) (2)

π(𝒢n)\displaystyle\pi(\mathcal{G}_{n}) =Π(ηexp((βn)1/4))\displaystyle=\Pi\left(\|\eta\|\leq\exp(\left(\beta n\right)^{1/4})\right)
π(ηjexp((βn)1/4);j=1,2,,d)\displaystyle-\pi\left(\|\eta_{j}^{\prime}\|\leq\exp(\left(\beta n\right)^{1/4});j=1,2,\ldots,d\right)
=π(ηexp((βn)1/4),)\displaystyle=\pi\left(\|\eta\|\leq\exp(\left(\beta n\right)^{1/4}),\right)
π(j=1d{ηjexp((βn)1/4)})\displaystyle-\pi\left(\bigcup_{j=1}^{d}\left\{\|\eta_{j}^{\prime}\|\leq\exp(\left(\beta n\right)^{1/4})\right\}\right)
1Π(η>exp((βn)1/4))j=1dΠ(ηjexp((βn)1/4))\displaystyle\geq 1-\Pi\left(\|\eta\|>\exp(\left(\beta n\right)^{1/4})\right)-\sum_{j=1}^{d}\Pi\left(\|\eta_{j}^{\prime}\|\leq\exp(\left(\beta n\right)^{1/4})\right)
1(cη+j=1dcηj)exp(βn).\displaystyle\geq 1-\left(c_{\eta}+\sum_{j=1}^{d}c_{\eta_{j}^{\prime}}\right)\exp(-\beta n).

where the last inequality follows from Assumption 3.

B.5.3 Verification of (S5) (3)

We need to show that uniform convergence in (S3) in pp over 𝒢nI\mathcal{G}_{n}\setminus I holds, where I={p:h(p)=}I=\{p:\ h(p)=\infty\} as in subsection B.4. In our case, I=I=\emptyset. Hence, we need to show uniform convergence in (S3) in pp over 𝒢n\mathcal{G}_{n}. We need to establish that 𝒢n\mathcal{G}_{n} is compact, but this has already been shown by Chatterjee and Bhattacharya (2020). In a nutshell, Chatterjee and Bhattacharya (2020) proved compactness of 𝒢n\mathcal{G}_{n} for each n1n\geq 1 by showing that 𝒢n\mathcal{G}_{n} is closed, bounded and equicontinuous and then by using Arzela-Ascoli lemma to imply compactness. It should be noted that boundedness of the partial derivatives as in Assumption 1 is used to show Lipschitz continuity, hence equicontinuity.

Consider 𝒢{𝒢n:n=1,2,}\mathcal{G}\in\left\{\mathcal{G}_{n}:\ n=1,2,\ldots\right\}. Now, to show uniform convergence we only need to show the following (see, for example, Chatterjee and Bhattacharya (2020)):

  1. (i)

    1nlog(Rn(p))+h(p)\dfrac{1}{n}\log(R_{n}(p))+h(p) is stochastically equicontinuous almost surely in p𝒢p\in\mathcal{G},

  2. (ii)

    1nlog(Rn(p))+h(p)0\dfrac{1}{n}\log(R_{n}(p))+h(p)\rightarrow 0 for all p𝒢p\in\mathcal{G} as nn\rightarrow\infty.

We have already shown almost sure pointwise convergence of n1log(Rn(p))n^{-1}\log(R_{n}(p)) to h(p)-h(p) in Appendix B.3. Hence it is enough to verify stochastic equicontinuity of 1nlog(Rn(p))+h(p)\dfrac{1}{n}\log(R_{n}(p))+h(p) in 𝒢{𝒢n:n=1,2,}\mathcal{G}\in\left\{\mathcal{G}_{n}:\ n=1,2,\ldots\right\}. Stochastic equicontinuity usually follows easily if one can prove that the function concerned is almost surely Lipschitz continuous (Chatterjee and Bhattacharya (2020)). Observe that, if we can show that both 1nlog(Rn(p))\dfrac{1}{n}\log(R_{n}(p)) and h(p)h(p) are Lipschitz then this would imply that 1nlog(Rn(p))+h(p)\dfrac{1}{n}\log(R_{n}(p))+h(p) is Lipschitz (sum of Lipschitz functions is Lipschitz).

We now show that 1nlog(Rn(p))\dfrac{1}{n}\log(R_{n}(p)) and h(p)h(p) are both Lipschitz in 𝒢\mathcal{G}. Now,

1nlogRn(p)=1ni=1n{yilog(p(xi)p0(xi))+(1yi)log(1p(xi)1p0(xi))}.\displaystyle\frac{1}{n}\log R_{n}(p)=\dfrac{1}{n}\sum_{i=1}^{n}\left\{y_{i}\log\left(\dfrac{p(x_{i})}{p_{0}(x_{i})}\right)+(1-y_{i})\log\left(\dfrac{1-p(x_{i})}{1-p_{0}(x_{i})}\right)\right\}. (B.16)

Let p1,p2p_{1},p_{2} correspond to η1,η2Θ\eta_{1},\eta_{2}\in\Theta. Note that, since ηexp(βm)\|\eta\|_{\infty}\leq\exp\left(\sqrt{\beta m}\right) on 𝒢=𝒢m\mathcal{G}=\mathcal{G}_{m} (m1m\geq 1), it follows that 0<κBp1(x),p2(x)1κB<10<\kappa_{B}\leq p_{1}(x),p_{2}(x)\leq 1-\kappa_{B}<1, for all x𝔛x\in\mathfrak{X}. Thus, there exists C>0C>0 such that |log(p1(x)p2(x))|Cp1p2\left|\log\left(\frac{p_{1}(x)}{p_{2}(x)}\right)\right|\leq C\|p_{1}-p_{2}\|_{\infty} and |log(1p1(x)1p2(x))|Cp1p2\left|\log\left(\frac{1-p_{1}(x)}{1-p_{2}(x)}\right)\right|\leq C\|p_{1}-p_{2}\|_{\infty}, for x𝔛x\in\mathfrak{X}. Hence,

|1nlogRn(p1)1nlogRn(p2)|\displaystyle\left|\frac{1}{n}\log R_{n}(p_{1})-\frac{1}{n}\log R_{n}(p_{2})\right|
=|1ni=1n{yilog(p1(xi)p2(xi))+(1yi)log(1p1(xi)1p2(xi))}|\displaystyle\qquad=\left|\frac{1}{n}\sum_{i=1}^{n}\left\{y_{i}\log\left(\frac{p_{1}(x_{i})}{p_{2}(x_{i})}\right)+(1-y_{i})\log\left(\frac{1-p_{1}(x_{i})}{1-p_{2}(x_{i})}\right)\right\}\right|
2Cp1p2,\displaystyle\qquad\leq 2C\|p_{1}-p_{2}\|_{\infty},

showing Lipschitz continuity of 1nlogRn(p)\frac{1}{n}\log R_{n}(p) with respect to pp corresponding to η𝒢=𝒢m\eta\in\mathcal{G}=\mathcal{G}_{m}. Since HH is continuously differentiable, η\eta and η\eta^{\prime} are bounded on 𝒢\mathcal{G}, with the same bound for all η\eta, it follows that pp is Lipschitz on 𝒢\mathcal{G}.

To see that h(p)h(p) is also Lipschitz in 𝒢=𝒢m\mathcal{G}=\mathcal{G}_{m}, it is enough to note that

|h(p1)h(p2)|=|E𝐗(p0(𝐗)log(p2(𝐗)p1(𝐗)))+E𝐗((1p0(𝐗))log(1p2(𝐗)1p1(𝐗)))|\displaystyle\left|h(p_{1})-h(p_{2})\right|=\left|E_{\mathbf{X}}\left(p_{0}(\mathbf{X})\log\left(\frac{p_{2}(\mathbf{X})}{p_{1}(\mathbf{X})}\right)\right)+E_{\mathbf{X}}\left((1-p_{0}(\mathbf{X}))\log\left(\frac{1-p_{2}(\mathbf{X})}{1-p_{1}(\mathbf{X})}\right)\right)\right|
2Cp1p2,\displaystyle\qquad\leq 2C\|p_{1}-p_{2}\|_{\infty},

and the result follows since pp is Lipschitz on 𝒢\mathcal{G}.

B.6 Verification of (S6) for binary regression

We need to show:

n=1ScP(|1nlogRn(p)+h(p)|>κh(Θ))𝑑π(p)<.\sum_{n=1}^{\infty}\int_{S^{c}}P\left(\left|\dfrac{1}{n}\log R_{n}(p)+h(p)\right|>\kappa-h(\Theta)\right)\ d\pi(p)<\infty. (B.17)

Let us take κ1=κh(Θ)\kappa_{1}=\kappa-h(\Theta). Observe that,

1nlogRn(p)+h(p)\displaystyle\frac{1}{n}\log R_{n}(p)+h(p)
=\displaystyle= 1ni=1n{yilog(p(xi)p0(xi))+(1yi)log(1p(xi)1p0(xi))}\displaystyle\dfrac{1}{n}\sum_{i=1}^{n}\left\{y_{i}\log\left(\dfrac{p(x_{i})}{p_{0}(x_{i})}\right)+(1-y_{i})\log\left(\dfrac{1-p(x_{i})}{1-p_{0}(x_{i})}\right)\right\}
+[E𝐗(p0(𝐗)log{p0(𝐗)p(𝐗)})+E𝐗((1p0(𝐗))log{(1p0(𝐗))(1p(𝐗))})]\displaystyle+\left[E_{\mathbf{X}}\left(p_{0}(\mathbf{X})\log\left\{\dfrac{p_{0}(\mathbf{X})}{p(\mathbf{X})}\right\}\right)+E_{\mathbf{X}}\left((1-p_{0}(\mathbf{X}))\log\left\{\dfrac{\left(1-p_{0}(\mathbf{X})\right)}{\left(1-p(\mathbf{X})\right)}\right\}\right)\right]
=\displaystyle= 1ni=1n{yilog(p(xi)p0(xi))E𝐗(p0(𝐗)log{p(𝐗)p0(𝐗)})}\displaystyle\dfrac{1}{n}\sum_{i=1}^{n}\left\{y_{i}\log\left(\dfrac{p(x_{i})}{p_{0}(x_{i})}\right)-E_{\mathbf{X}}\left(p_{0}(\mathbf{X})\log\left\{\dfrac{p(\mathbf{X})}{p_{0}(\mathbf{X})}\right\}\right)\right\}
+1ni=1n{(1yi)log(1p(xi)1p0(xi))E𝐗((1p0(𝐗))log{(1p(𝐗))(1p0(𝐗))})}.\displaystyle+\dfrac{1}{n}\sum_{i=1}^{n}\left\{(1-y_{i})\log\left(\dfrac{1-p(x_{i})}{1-p_{0}(x_{i})}\right)-E_{\mathbf{X}}\left((1-p_{0}(\mathbf{X}))\log\left\{\dfrac{\left(1-p(\mathbf{X})\right)}{\left(1-p_{0}(\mathbf{X})\right)}\right\}\right)\right\}.

It follows that:

P(|1nlogRn(p)+h(p)|>κ1)\displaystyle P\left(\left|\dfrac{1}{n}\log R_{n}(p)+h(p)\right|>\kappa_{1}\right) (B.18)
P(|1ni=1n{yilog(p(xi)p0(xi))E𝐗(p0(𝐗)log{p(𝐗)p0(𝐗)})}|>κ12)\displaystyle\leq P\left(\left|\dfrac{1}{n}\sum_{i=1}^{n}\left\{y_{i}\log\left(\dfrac{p(x_{i})}{p_{0}(x_{i})}\right)-E_{\mathbf{X}}\left(p_{0}(\mathbf{X})\log\left\{\dfrac{p(\mathbf{X})}{p_{0}(\mathbf{X})}\right\}\right)\right\}\right|>\dfrac{\kappa_{1}}{2}\right) (B.19)
+P(|1ni=1n{(1yi)log(1p(xi)1p0(xi))E𝐗((1p0(𝐗))log{(1p(𝐗))(1p0(𝐗))})}|>κ12).\displaystyle+P\left(\left|\dfrac{1}{n}\sum_{i=1}^{n}\left\{(1-y_{i})\log\left(\dfrac{1-p(x_{i})}{1-p_{0}(x_{i})}\right)-E_{\mathbf{X}}\left((1-p_{0}(\mathbf{X}))\log\left\{\dfrac{\left(1-p(\mathbf{X})\right)}{\left(1-p_{0}(\mathbf{X})\right)}\right\}\right)\right\}\right|>\dfrac{\kappa_{1}}{2}\right). (B.20)

Since yiy_{i} are binary, it follows using the inequalities 11xlogxx11-\frac{1}{x}\leq\log x\leq x-1, for x>0x>0 and Assumptions 5 and 6, that the random variables Vi=yilog(p(xi)p0(xi))V_{i}=y_{i}\log\left(\dfrac{p(x_{i})}{p_{0}(x_{i})}\right) and Wi=yilog(1p(xi)1p0(xi))W_{i}=y_{i}\log\left(\dfrac{1-p(x_{i})}{1-p_{0}(x_{i})}\right) are absolutely bounded by Cpp0C\|p-p_{0}\|_{\infty}, for some C>0C>0. We shall apply Hoeffding’s inequality (Hoeffding (1963)) separately on the two terms of (B.20) involving ViV_{i} and WiW_{i}.

Note that for η𝒢n\eta\in\mathcal{G}_{n},

P(|1ni=1n{yilog(p(xi)p0(xi))E𝐗(p0(𝐗)log{p(𝐗)p0(𝐗)})}|>κ12)\displaystyle P\left(\left|\dfrac{1}{n}\sum_{i=1}^{n}\left\{y_{i}\log\left(\dfrac{p(x_{i})}{p_{0}(x_{i})}\right)-E_{\mathbf{X}}\left(p_{0}(\mathbf{X})\log\left\{\dfrac{p(\mathbf{X})}{p_{0}(\mathbf{X})}\right\}\right)\right\}\right|>\dfrac{\kappa_{1}}{2}\right)
P(|1ni=1n{yilog(p(xi)p0(xi))p0(xi)log(p(xi)p0(xi))}|>κ14)\displaystyle\leq P\left(\left|\dfrac{1}{n}\sum_{i=1}^{n}\left\{y_{i}\log\left(\dfrac{p(x_{i})}{p_{0}(x_{i})}\right)-p_{0}(x_{i})\log\left(\frac{p(x_{i})}{p_{0}(x_{i})}\right)\right\}\right|>\dfrac{\kappa_{1}}{4}\right)
+P(|1ni=1n{p0(xi)log(p(xi)p0(xi))E𝐗(p0(𝐗)log{p(𝐗)p0(𝐗)})}|>κ14)\displaystyle\qquad+P\left(\left|\dfrac{1}{n}\sum_{i=1}^{n}\left\{p_{0}(x_{i})\log\left(\frac{p(x_{i})}{p_{0}(x_{i})}\right)-E_{\mathbf{X}}\left(p_{0}(\mathbf{X})\log\left\{\dfrac{p(\mathbf{X})}{p_{0}(\mathbf{X})}\right\}\right)\right\}\right|>\dfrac{\kappa_{1}}{4}\right)
4exp{nκ128C2pp02}4exp{nκ128C2L2ηη02},\displaystyle\leq 4\exp\left\{-\dfrac{n\kappa^{2}_{1}}{8C^{2}\|p-p_{0}\|^{2}_{\infty}}\right\}\leq 4\exp\left\{-\dfrac{n\kappa^{2}_{1}}{8C^{2}L^{2}\|\eta-\eta_{0}\|^{2}_{\infty}}\right\}, (B.21)

where L>0L>0 is the Lipschitz constant associated with HH. Here it is important to note that for η𝒢n\eta\in\mathcal{G}_{n}, H(η)H(\eta) is Lipschitz in η\eta thanks to continuous differentiability of HH, and boundedness of η\eta and η\eta^{\prime} by the same constant on 𝒢n\mathcal{G}_{n}. Also note that (B.21) holds irrespective of xix_{i}; i=1,,ni=1,\ldots,n being random or non-random (see Chatterjee and Bhattacharya (2020)).

Similarly, for η𝒢n\eta\in\mathcal{G}_{n},

P(|1ni=1n{(1yi)log(1p(xi)1p0(xi))E𝐗((1p0(𝐗))log{(1p(𝐗))(1p0(𝐗))})}|>κ12)\displaystyle P\left(\left|\dfrac{1}{n}\sum_{i=1}^{n}\left\{(1-y_{i})\log\left(\dfrac{1-p(x_{i})}{1-p_{0}(x_{i})}\right)-E_{\mathbf{X}}\left((1-p_{0}(\mathbf{X}))\log\left\{\dfrac{\left(1-p(\mathbf{X})\right)}{\left(1-p_{0}(\mathbf{X})\right)}\right\}\right)\right\}\right|>\dfrac{\kappa_{1}}{2}\right)
4exp{nκ128C2L2ηη02}.\displaystyle\leq 4\exp\left\{-\dfrac{n\kappa^{2}_{1}}{8C^{2}L^{2}\|\eta-\eta_{0}\|^{2}_{\infty}}\right\}. (B.22)

Now,

n=1scP(|1ni=1n{yilog(p(xi)p0(xi))E𝐗(p0(𝐗)log{p(𝐗)p0(𝐗)})}|>κ12)𝑑π(p)\displaystyle\sum_{n=1}^{\infty}\int_{s^{c}}P\left(\left|\dfrac{1}{n}\sum_{i=1}^{n}\left\{y_{i}\log\left(\dfrac{p(x_{i})}{p_{0}(x_{i})}\right)-E_{\mathbf{X}}\left(p_{0}(\mathbf{X})\log\left\{\dfrac{p(\mathbf{X})}{p_{0}(\mathbf{X})}\right\}\right)\right\}\right|>\dfrac{\kappa_{1}}{2}\right)\ d\pi(p)
n=1𝒢n4exp{nκ128C2L2ηη02}𝑑π(η)+n=1π(𝒢nc),\displaystyle\leq\sum_{n=1}^{\infty}\int_{\mathcal{G}_{n}}4\exp\left\{-\dfrac{n\kappa^{2}_{1}}{8C^{2}L^{2}\|\eta-\eta_{0}\|^{2}_{\infty}}\right\}d\pi(\eta)+\sum_{n=1}^{\infty}\pi\left(\mathcal{G}^{c}_{n}\right), (B.23)

and

n=1scP(|1ni=1n{(1yi)log(1p(xi)1p0(xi))E𝐗((1p0(𝐗))log{1p(𝐗)1p0(𝐗)})}|>κ12)𝑑π(p)\displaystyle\sum_{n=1}^{\infty}\int_{s^{c}}P\left(\left|\dfrac{1}{n}\sum_{i=1}^{n}\left\{(1-y_{i})\log\left(\dfrac{1-p(x_{i})}{1-p_{0}(x_{i})}\right)-E_{\mathbf{X}}\left((1-p_{0}(\mathbf{X}))\log\left\{\dfrac{1-p(\mathbf{X})}{1-p_{0}(\mathbf{X})}\right\}\right)\right\}\right|>\dfrac{\kappa_{1}}{2}\right)\ d\pi(p)
n=1𝒢n4exp{nκ128C2L2ηη02}𝑑π(η)+n=1π(𝒢nc).\displaystyle\leq\sum_{n=1}^{\infty}\int_{\mathcal{G}_{n}}4\exp\left\{-\dfrac{n\kappa^{2}_{1}}{8C^{2}L^{2}\|\eta-\eta_{0}\|^{2}_{\infty}}\right\}d\pi(\eta)+\sum_{n=1}^{\infty}\pi\left(\mathcal{G}^{c}_{n}\right). (B.24)

Then proceeding in the same way as (S-2.25) – (S-2.30) of Chatterjee and Bhattacharya (2020), and noting that n=1π(𝒢nc)<\sum_{n=1}^{\infty}\pi\left(\mathcal{G}^{c}_{n}\right)<\infty, we obtain (B.17).

Hence (S6) holds.

Remark 1.

It is important to clarify the role of Assumption 6 here. Note that, we need a lower bound for log(p(x)p0(x))\log\left(\frac{p(x)}{p_{0}(x)}\right). For instance, if H(η(x))=exp(η(x))1+exp(η(x))H(\eta(x))=\frac{\exp\left(\eta(x)\right)}{1+\exp\left(\eta(x)\right)}, then even if ηβn\|\eta\|_{\infty}\leq\sqrt{\beta n} on 𝒢n\mathcal{G}_{n}, it holds that log(p(x)p0(x))Cβn\log\left(\frac{p(x)}{p_{0}(x)}\right)\geq C-\sqrt{\beta n} for all x𝔛x\in\mathfrak{X}, for all η𝒢n\eta\in\mathcal{G}_{n}, for some constant CC. In our bounding method uing the inequality logx11/x\log x\geq 1-1/x for x>0x>0, we have log(p(x)p0(x))pp0p(x)2exp(βn)pp0\log\left(\frac{p(x)}{p_{0}(x)}\right)\geq-\frac{\|p-p_{0}\|_{\infty}}{p(x)}\geq-2\exp\left(\sqrt{\beta n}\right)\|p-p_{0}\|_{\infty}. It would then follow that the exponent of the Hoeffding inequality is O(1)O(1). This would fail to ensure summability of the corresponding terms involving ViV_{i}. Thus, we need to ensure that p(x)p(x) is bounded away from 0. Similarly, the infinite sum associated with WiW_{i} would not be finite unless 1p(x)1-p(x) is bounded away from 0.

B.7 Verification of (S7)for Binary Regression

This verification follows from the fact that h(p)h(p) is continuous. Indeed, for any set AA with π(A)>0\pi(A)>0, 𝒢nAA\mathcal{G}_{n}\cap A\uparrow A. It follows from continuity of hh that h(𝒢nA)h(A)h\left(\mathcal{G}_{n}\cap A\right)\downarrow h(A) as nn\rightarrow\infty and hence (S7) holds.

Appendix C Verification of (S1) to (S7) for Poisson regression

C.1 Verification of (S1) for Poisson regression

Observe that

fλ(𝐘𝐧|𝐗n)=i=1nf(yi|xi)=i=1nexp(λ(xi))(λ(xi))yiyi!,\displaystyle f_{\lambda}(\mathbf{Y_{n}}|\mathbf{X}_{n})=\prod_{i=1}^{n}f(y_{i}|x_{i})=\prod_{i=1}^{n}\exp\left(-\lambda(x_{i})\right)\dfrac{(\lambda(x_{i}))^{y_{i}}}{y_{i}!},
fλ0(𝐘𝐧|𝐗n)=i=1nf0(yi|xi)=i=1nexp(λo(xi))(λ0(xi))yiyi!.\displaystyle f_{\lambda_{0}}(\mathbf{Y_{n}}|\mathbf{X}_{n})=\prod_{i=1}^{n}f_{0}(y_{i}|x_{i})=\prod_{i=1}^{n}\exp\left(-\lambda_{o}(x_{i})\right)\dfrac{(\lambda_{0}(x_{i}))^{y_{i}}}{y_{i}!}.

Therefore,

Rn(λ)=exp(i=1n[λ(xi)λ0(xi)])i=1n(λ(xi)λ0(xi))yiR_{n}(\lambda)=\exp\left(-\sum_{i=1}^{n}[\lambda(x_{i})-\lambda_{0}(x_{i})]\right)\prod_{i=1}^{n}\left(\dfrac{\lambda(x_{i})}{\lambda_{0}(x_{i})}\right)^{y_{i}} (C.1)

and,

1nlogRn(λ)=(1ni=1n[λ(xi)λ0(xi)])+1ni=1nyilog(λ(xi)λ0(xi)).\frac{1}{n}\log R_{n}(\lambda)=\left(-\frac{1}{n}\sum_{i=1}^{n}[\lambda(x_{i})-\lambda_{0}(x_{i})]\right)+\frac{1}{n}\sum_{i=1}^{n}{y_{i}}\log\left(\dfrac{\lambda(x_{i})}{\lambda_{0}(x_{i})}\right). (C.2)

Note that for any aa\in\Re, {(yi,η):yilog(λ(xi)λ0(xi))<a}=r=1{η:rlog(λ(xi)λ0(xi))<a}\left\{(y_{i},\eta):y_{i}\log\left(\dfrac{\lambda(x_{i})}{\lambda_{0}(x_{i})}\right)<a\right\}=\bigcup_{r=1}^{\infty}\left\{\eta:r\log\left(\dfrac{\lambda(x_{i})}{\lambda_{0}(x_{i})}\right)<a\right\}. Let η~j\tilde{\eta}_{j}; j=1,2,j=1,2,\ldots be such that ηjη0\|\eta_{j}-\eta\|_{\infty}\rightarrow 0, as jj\rightarrow\infty. Then, letting λ~j(x)=H(η~j(x))\tilde{\lambda}_{j}(x)=H(\tilde{\eta}_{j}(x)), for all x𝔛x\in\mathfrak{X}, it follows, since 0<C1λ(x)C2<0<C_{1}\leq\lambda(x)\leq C_{2}<\infty on 𝔛\mathfrak{X}, that there exists j01j_{0}\geq 1 such that for jj0j\geq j_{0}, 0<C1λ~j(x)C2<0<C_{1}\leq\tilde{\lambda}_{j}(x)\leq C_{2}<\infty. Hence, using the inequalities 11xlogxx11-\frac{1}{x}\leq\log x\leq x-1 for x>0x>0, we obtain |log(λ~j(xi)λ(xi))|Cλ~jλ\left|\log\left(\frac{\tilde{\lambda}_{j}(x_{i})}{\lambda(x_{i})}\right)\right|\leq C\|\tilde{\lambda}_{j}-\lambda\|_{\infty}, for some C>0C>0, for jj01j\geq j_{0}\geq 1. It follows that

|rlog(λ~j(xi)λ0(xi))r(λ(xi)λ~0(xi))|=r|log(λ~j(xi)λ(xi))|rCλ~jλ0,\displaystyle\left|r\log\left(\dfrac{\tilde{\lambda}_{j}(x_{i})}{\lambda_{0}(x_{i})}\right)-r\left(\dfrac{\lambda(x_{i})}{\tilde{\lambda}_{0}(x_{i})}\right)\right|=r\left|\log\left(\dfrac{\tilde{\lambda}_{j}(x_{i})}{\lambda(x_{i})}\right)\right|\leq rC\|\tilde{\lambda}_{j}-\lambda\|_{\infty}\rightarrow 0,

in the same way as in the binary regression, using Taylor’s series expansion up to the first order. Hence, rlog(λ(xi)λ0(xi))r\log\left(\dfrac{\lambda(x_{i})}{\lambda_{0}(x_{i})}\right) is continuous in η\eta, ensuring measurability of {η:rlog(λ(xi)λ0(xi))<a}\left\{\eta:r\log\left(\dfrac{\lambda(x_{i})}{\lambda_{0}(x_{i})}\right)<a\right\}, and hence of {(yi,η):yilog(λ(xi)λ0(xi))<a}\left\{(y_{i},\eta):y_{i}\log\left(\dfrac{\lambda(x_{i})}{\lambda_{0}(x_{i})}\right)<a\right\}. It follows that 1ni=1nyilog(λ(xi)λ0(xi))\frac{1}{n}\sum_{i=1}^{n}{y_{i}}\log\left(\dfrac{\lambda(x_{i})}{\lambda_{0}(x_{i})}\right) is measurable.

Also, continuity of λ(xi)λ0(xi)\lambda(x_{i})-\lambda_{0}(x_{i}) with respect to η\eta ensures measurability of 1ni=1n[λ(xi)λ0(xi)]-\frac{1}{n}\sum_{i=1}^{n}[\lambda(x_{i})-\lambda_{0}(x_{i})]. Thus, 1nlogRn(λ)\frac{1}{n}\log R_{n}(\lambda), and hence Rn(λ)R_{n}(\lambda), is measurable.

C.2 Verification of (S2) for Poisson regression

For every λΛ\lambda\in\Lambda, we need to show that the KL divergence rate

h(λ)=limn1nEλ0[log{fλ0(𝐘n|𝐗n)fλ(𝐘n|𝐗n)}]=limn1nEλ0[log{Rn(λ)}].h(\lambda)=\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}E_{\lambda_{0}}\left[\log\left\{\frac{f_{\lambda_{0}}(\mathbf{Y}_{n}|\mathbf{X}_{n})}{f_{\lambda}(\mathbf{Y}_{n}|\mathbf{X}_{n})}\right\}\right]=\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}E_{\lambda_{0}}\left[-\log\left\{R_{n}(\lambda)\right\}\right].

exists (possibly being infinite) and is 𝒯\mathcal{T}-measurable.

Now,

1nlogRn(λ)=(1ni=1n[λ(xi)λ0(xi)])+1ni=1nyilog(λ(xi)λ0(xi))\frac{1}{n}\log R_{n}(\lambda)=\left(-\frac{1}{n}\sum_{i=1}^{n}[\lambda(x_{i})-\lambda_{0}(x_{i})]\right)+\frac{1}{n}\sum_{i=1}^{n}{y_{i}}\log\left(\dfrac{\lambda(x_{i})}{\lambda_{0}(x_{i})}\right)

Therefore,

1nEλ0[log{Rn(λ)}]=(1ni=1n[λ(xi)λ0(xi)])+1ni=1nλ0(xi)log(λ0(xi)λ(xi)).\displaystyle\frac{1}{n}E_{\lambda_{0}}\left[-\log\left\{R_{n}(\lambda)\right\}\right]=\left(\frac{1}{n}\sum_{i=1}^{n}[\lambda(x_{i})-\lambda_{0}(x_{i})]\right)+\frac{1}{n}\sum_{i=1}^{n}{\lambda_{0}(x_{i})}\log\left(\dfrac{\lambda_{0}(x_{i})}{\lambda(x_{i})}\right).
limn1nEλ0[log{Rn(λ)}]=\displaystyle\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}E_{\lambda_{0}}\left[-\log\left\{R_{n}(\lambda)\right\}\right]= limn(1ni=1n[λ(xi)λ0(xi)])+limn1ni=1nλ0(xi)log(λ0(xi)λ(xi))\displaystyle\underset{n\rightarrow\infty}{\lim}~{}\left(\frac{1}{n}\sum_{i=1}^{n}[\lambda(x_{i})-\lambda_{0}(x_{i})]\right)+\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}\sum_{i=1}^{n}{\lambda_{0}(x_{i})}\log\left(\dfrac{\lambda_{0}(x_{i})}{\lambda(x_{i})}\right)
=\displaystyle= E𝐗[λ(𝐗)λ0(𝐗)]+E𝐗[λ0(𝐗)log(λ0(𝐗)λ(𝐗))].\displaystyle E_{\mathbf{X}}\left[\lambda(\mathbf{X})-\lambda_{0}(\mathbf{X})\right]+E_{\mathbf{X}}\left[{\lambda_{0}(\mathbf{X})}\log\left(\dfrac{\lambda_{0}(\mathbf{X})}{\lambda(\mathbf{X})}\right)\right].

The last line holds due to Assumption 4 and SLLN. Here E𝐗()=𝔛dQE_{\mathbf{X}}(\cdot)=\int_{\mathfrak{X}}\cdot\ dQ. In other words,

h(λ)=E𝐗[λ(𝐗)λ0(𝐗)]+E𝐗[λ0(𝐗)log(λ0(𝐗)λ(𝐗))].h(\lambda)=E_{\mathbf{X}}\left[\lambda(\mathbf{X})-\lambda_{0}(\mathbf{X})\right]+E_{\mathbf{X}}\left[{\lambda_{0}(\mathbf{X})}\log\left(\dfrac{\lambda_{0}(\mathbf{X})}{\lambda(\mathbf{X})}\right)\right]. (C.3)

C.3 Verification of (S3) for Poisson regression

Here we need to verify the asymptotic equipartition property, that is, almost surely with respect to the true model PP,

limn1nlog[Rn(λ)]=h(λ)=limn1nE[log{fλ(𝐘n|𝐗n)fλ0(𝐘n|𝐗n)}].\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}\log\left[R_{n}(\lambda)\right]=-h(\lambda)=\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}E\left[\log\left\{\frac{f_{\lambda}(\mathbf{Y}_{n}|\mathbf{X}_{n})}{f_{\lambda 0}(\mathbf{Y}_{n}|\mathbf{X}_{n})}\right\}\right]. (C.4)

Now,

1nlogRn(λ)=1ni=1n{[λ(xi)λ0(xi)]+yilog(λ0(xi)λ(xi))}.-\frac{1}{n}\log R_{n}(\lambda)=\frac{1}{n}\sum_{i=1}^{n}\left\{\left[\lambda(x_{i})-\lambda_{0}(x_{i})\right]+y_{i}\log\left(\dfrac{\lambda_{0}(x_{i})}{\lambda(x_{i})}\right)\right\}.

As before, for given λ\lambda, there exists C>0C>0 such that |log(λ0(xi)λ(xi))|Cλλ0\left|\log\left(\frac{\lambda_{0}(x_{i})}{\lambda(x_{i})}\right)\right|\leq C\|\lambda-\lambda_{0}\|_{\infty}. Hence,

i=1i2Var[{[λ(xi)λ0(xi)]+yilog(λ0(xi)λ(xi))}]\displaystyle\displaystyle\sum_{i=1}^{\infty}i^{-2}Var\left[\left\{\left[\lambda(x_{i})-\lambda_{0}(x_{i})\right]+y_{i}\log\left(\dfrac{\lambda_{0}(x_{i})}{\lambda(x_{i})}\right)\right\}\right]
=i=1i2λ0(xi)[log(λ0(xi)λ(xi))]2\displaystyle=\displaystyle\sum_{i=1}^{\infty}i^{-2}\lambda_{0}(x_{i})\left[\log\left(\dfrac{\lambda_{0}(x_{i})}{\lambda(x_{i})}\right)\right]^{2}
C2H(κ0)(λλ0)2i=1i2\displaystyle\leq C^{2}\|H(\kappa_{0})\|\left(\|\lambda-\lambda_{0}\|_{\infty}\right)^{2}\displaystyle\sum_{i=1}^{\infty}i^{-2}
<.\displaystyle<\infty. (C.5)

Observe that yiy_{i} are observations from independent random variables. Hence from Kolmogorov’s SLLN for independent random variables and from Assumption 4, (C.4) holds as nn\rightarrow\infty.

C.4 Verification of (S4) for Poisson regression

If I={λ:h(λ)=}I=\{\lambda:\ h(\lambda)=\infty\} then we need to show Π(I)<1\Pi(I)<1. But this holds in almost the same way as for binary regression. In other words, (S4) holds for Poisson regression.

C.5 Verification of (S5) for Poisson regression

The parameter space here remains the same as in the binary regression case, that is, Θ=𝒞(𝔛)\Theta=\mathcal{C^{\prime}}(\mathfrak{X}). We also consider the same sequence 𝒢n\mathcal{G}_{n} as in binary regression. We need to verify that

  1. 1.

    h(𝒢n)h(Λ)h\left(\mathcal{G}_{n}\right)\rightarrow h\left(\Lambda\right), as nn\rightarrow\infty;

  2. 2.

    The inequality π(𝒢n)1αexp(βn)\pi\left(\mathcal{G}_{n}\right)\geq 1-\alpha\exp\left(-\beta n\right) holds for some α>0,β>2h(Λ)\alpha>0,\beta>2h(\Lambda);

  3. 3.

    The convergence in (S3) is uniform over 𝒢nI\mathcal{G}_{n}\setminus I.

C.5.1 Verification of (S5) (1)

We now need to verify that h(𝒢n)h(Λ)h\left(\mathcal{G}_{n}\right)\rightarrow h\left(\Lambda\right) as nn\rightarrow\infty. But this holds in the same way as for binary regression.

C.5.2 Verification of (S5) (2)

Again, this holds in the same way as for binary regression.

C.5.3 Verification of (S5) (3)

Using the same arguments as in the binary regression case, here we only need to show that 1nlog(Rn(λ))\dfrac{1}{n}\log(R_{n}(\lambda)) and h(λ)h(\lambda) are both Lipschitz.

Recall that

1nlogRn(λ)=1ni=1n{[λ0(xi)λ(xi)]+yilog(λ(xi)λ0(xi))}.\displaystyle\frac{1}{n}\log R_{n}(\lambda)=\frac{1}{n}\sum_{i=1}^{n}\left\{\left[\lambda_{0}(x_{i})-\lambda(x_{i})\right]+y_{i}\log\left(\dfrac{\lambda(x_{i})}{\lambda_{0}(x_{i})}\right)\right\}.

For any η1,η2𝒢\eta_{1},\eta_{2}\in\mathcal{G}, there exists C>0C>0 such that |log(λ1(x)λ2(x))|Cλ1λ2\left|\log\left(\frac{\lambda_{1}(x)}{\lambda_{2}(x)}\right)\right|\leq C\|\lambda_{1}-\lambda_{2}\|_{\infty}, for all x𝔛x\in\mathfrak{X}, where λ1=H(η1)\lambda_{1}=H(\eta_{1}) and λ2=H(η2)\lambda_{2}=H(\eta_{2}). Hence,

|1nlogRn(λ1)1nlogRn(λ2)|λ1λ2(1+C×1ni=1nyi).\displaystyle\left|\frac{1}{n}\log R_{n}(\lambda_{1})-\frac{1}{n}\log R_{n}(\lambda_{2})\right|\leq\|\lambda_{1}-\lambda_{2}\|_{\infty}\left(1+C\times\frac{1}{n}\sum_{i=1}^{n}y_{i}\right).

Thus, 1nlogRn(λ)\frac{1}{n}\log R_{n}(\lambda) is almost surely Lipschitz with respect to λ\lambda. Since, by Kolmogorov’s SLLN for independent variables, 1ni=1nyia.sE𝐗(λ0(𝐗))<\frac{1}{n}\sum_{i=1}^{n}y_{i}\stackrel{{\scriptstyle a.s}}{{\longrightarrow}}E_{\mathbf{X}}\left(\lambda_{0}(\mathbf{X})\right)<\infty, as nn\rightarrow\infty, and since λ=H(η)\lambda=H(\eta) is Lipschitz in η𝒢n\eta\in\mathcal{G}_{n} in the same way as in binary regression, the desired stochastic equicontinuity follows. Lipschitz continuity of h(λ)h(\lambda) in 𝒢n\mathcal{G}_{n} follows using similar techniques.

C.6 Verification of (S6) for Poisson Regression

Since

n=1ScP(|1nlogRn(λ)+h(λ)|>κh(Λ))𝑑π(λ)\displaystyle\sum_{n=1}^{\infty}\int_{S^{c}}P\left(\left|\dfrac{1}{n}\log R_{n}(\lambda)+h(\lambda)\right|>\kappa-h(\Lambda)\right)\ d\pi(\lambda)
n=1𝒢nP(|1nlogRn(λ)+h(λ)|>κh(Λ))𝑑π(λ)\displaystyle\qquad\leq\sum_{n=1}^{\infty}\int_{\mathcal{G}_{n}}P\left(\left|\dfrac{1}{n}\log R_{n}(\lambda)+h(\lambda)\right|>\kappa-h(\Lambda)\right)\ d\pi(\lambda)
+n=1𝒢ncP(|1nlogRn(λ)+h(λ)|>κh(Λ))𝑑π(λ)\displaystyle\qquad\quad+\sum_{n=1}^{\infty}\int_{\mathcal{G}^{c}_{n}}P\left(\left|\dfrac{1}{n}\log R_{n}(\lambda)+h(\lambda)\right|>\kappa-h(\Lambda)\right)\ d\pi(\lambda)
n=1𝒢nP(|1nlogRn(λ)+h(λ)|>κh(Λ))𝑑π(λ)+n=1π(𝒢nc),\displaystyle\qquad\leq\sum_{n=1}^{\infty}\int_{\mathcal{G}_{n}}P\left(\left|\dfrac{1}{n}\log R_{n}(\lambda)+h(\lambda)\right|>\kappa-h(\Lambda)\right)\ d\pi(\lambda)+\sum_{n=1}^{\infty}\pi\left(\mathcal{G}^{c}_{n}\right), (C.6)

and the second term of (C.6) is finite, it is enough to show that the first term of (C.6) is finite.

Let us take κ1=κh(Λ)\kappa_{1}=\kappa-h(\Lambda). Observe that for η𝒢n\eta\in\mathcal{G}_{n},

P(|1nlogRn(λ)+h(λ)|>κ1)\displaystyle P\left(\left|\dfrac{1}{n}\log R_{n}(\lambda)+h(\lambda)\right|>\kappa_{1}\right)
P(|1ni=1n[λ0(xi)log(λ(xi)λ0(xi))E𝐗(λ0(𝐗)log(λ(𝐗)λ0(𝐗)))]|>κ13)\displaystyle\leq P\left(\left|\frac{1}{n}\sum_{i=1}^{n}\left[\lambda_{0}(x_{i})\log\left(\frac{\lambda(x_{i})}{\lambda_{0}(x_{i})}\right)-E_{\mathbf{X}}\left(\lambda_{0}(\mathbf{X})\log\left(\frac{\lambda(\mathbf{X})}{\lambda_{0}(\mathbf{X})}\right)\right)\right]\right|>\frac{\kappa_{1}}{3}\right) (C.7)
+P(|1ni=1n[(λ0(xi)λ(xi))E𝐗(λ0(𝐗)λ(𝐗))]|>κ13)\displaystyle\qquad+P\left(\left|\frac{1}{n}\sum_{i=1}^{n}\left[\left(\lambda_{0}(x_{i})-\lambda(x_{i})\right)-E_{\mathbf{X}}\left(\lambda_{0}(\mathbf{X})-\lambda(\mathbf{X})\right)\right]\right|>\frac{\kappa_{1}}{3}\right) (C.8)
+P(|1ni=1n[yilog(λ(xi)λ0(xi))λ0(xi)log(λ(xi)λ0(xi))]|>κ13).\displaystyle\qquad+P\left(\left|\frac{1}{n}\sum_{i=1}^{n}\left[y_{i}\log\left(\frac{\lambda(x_{i})}{\lambda_{0}(x_{i})}\right)-\lambda_{0}(x_{i})\log\left(\frac{\lambda(x_{i})}{\lambda_{0}(x_{i})}\right)\right]\right|>\frac{\kappa_{1}}{3}\right). (C.9)

Using Hoeffding’s inequality and Lipschitz continuity of HH in 𝒢n\mathcal{G}_{n} as in binary regression, we find that (C.7) and (C.8) are bounded above by 2exp(C1nκ12ηη02)2\exp\left(-\frac{C_{1}n\kappa^{2}_{1}}{\|\eta-\eta_{0}\|^{2}_{\infty}}\right), and exp(C2nκ12ηη02)\exp\left(-\frac{C_{2}n\kappa^{2}_{1}}{\|\eta-\eta_{0}\|^{2}_{\infty}}\right), for some C1>0C_{1}>0 and C2>0C_{2}>0. These bounds hold even if the covariates are non-random.

To bound (C.9), we shall first show that the summands are sub-exponential, and then shall apply Bernstein’s inequality (see, for example, Uspensky (1937), Bennett (1962), Massart (2003)). Direct calculation yields

E[exp{t(yilog(λ(xi)λ0(xi))λ0(xi)log(λ(xi)λ0(xi)))}]\displaystyle E\left[\exp\left\{t\left(y_{i}\log\left(\frac{\lambda(x_{i})}{\lambda_{0}(x_{i})}\right)-\lambda_{0}(x_{i})\log\left(\frac{\lambda(x_{i})}{\lambda_{0}(x_{i})}\right)\right)\right\}\right]
=exp[tλ0(xi)log(λ(xi)λ0(xi))]×exp[λ0(xi){exp(tlog(λ(xi)λ0(xi)))1}].\displaystyle=\exp\left[-t\lambda_{0}(x_{i})\log\left(\frac{\lambda(x_{i})}{\lambda_{0}(x_{i})}\right)\right]\times\exp\left[\lambda_{0}(x_{i})\left\{\exp\left(t\log\left(\frac{\lambda(x_{i})}{\lambda_{0}(x_{i})}\right)\right)-1\right\}\right]. (C.10)

The first factor of (C.10) has the following upper bound:

exp[tλ0(xi)log(λ(xi)λ0(xi))]exp(cλλ|t|).\exp\left[-t\lambda_{0}(x_{i})\log\left(\frac{\lambda(x_{i})}{\lambda_{0}(x_{i})}\right)\right]\leq\exp\left(c_{\lambda}\|\lambda\|_{\infty}|t|\right). (C.11)

A bound for the second factor of (C.10) is given as follows:

exp[λ0(xi){exp(tlog(λ(xi)λ0(xi)))1}]\displaystyle\exp\left[\lambda_{0}(x_{i})\left\{\exp\left(t\log\left(\frac{\lambda(x_{i})}{\lambda_{0}(x_{i})}\right)\right)-1\right\}\right]
exp[λ0(exp(tλλ0κP)1)]\displaystyle\qquad\leq\exp\left[\|\lambda_{0}\|_{\infty}\left(\exp\left(\frac{t\|\lambda-\lambda_{0}\|_{\infty}}{\kappa_{P}}\right)-1\right)\right]
exp[λ0(cλ|t|+cλ2t2)],\displaystyle\qquad\leq\exp\left[\|\lambda_{0}\|_{\infty}\left(c_{\lambda}|t|+c^{2}_{\lambda}t^{2}\right)\right], (C.12)

for |t|cλ1|t|\leq c^{-1}_{\lambda}, where cλ=Cλλ0c_{\lambda}=C\|\lambda-\lambda_{0}\|_{\infty}, for some C>0C>0.

Combining (C.10), (C.11) and (C.12) we see that (C.10) is bounded above by exp(cλ2t2)\exp\left(c^{2}_{\lambda}t^{2}\right) provided that

cλ|t|2/(λ011)2/(κP11).c_{\lambda}|t|\geq 2/\left(\|\lambda_{0}\|^{-1}_{\infty}-1\right)\geq 2/\left(\kappa^{-1}_{P}-1\right). (C.13)

The rightmost bound of (C.13) is close to zero if κP\kappa_{P} is chosen sufficiently small. Now consider the function g(t)=exp(cλ2t2)f(t)g(t)=\exp\left(c^{2}_{\lambda}t^{2}\right)-f(t), where f(t)f(t) is given by (C.10). Since g(t)g(t) is continuous in tt and g(0)=0g(0)=0 and g(t)>0g(t)>0 on 2/(κP11)|t|cλ12/\left(\kappa^{-1}_{P}-1\right)\leq|t|\leq c^{-1}_{\lambda}, it follows that on the sufficiently small interval 0|t|2/(κP11)0\leq|t|\leq 2/\left(\kappa^{-1}_{P}-1\right), g(t)>0g(t)>0. In other words, (C.10) is bounded above by exp(cλ2t2)\exp\left(c^{2}_{\lambda}t^{2}\right) for 0|t|cλ10\leq|t|\leq c^{-1}_{\lambda}. Thus, zi=yilog(λ(xi)λ0(xi))λ0(xi)log(λ(xi)λ0(xi))z_{i}=y_{i}\log\left(\frac{\lambda(x_{i})}{\lambda_{0}(x_{i})}\right)-\lambda_{0}(x_{i})\log\left(\frac{\lambda(x_{i})}{\lambda_{0}(x_{i})}\right) are independent sub-exponential variables with parameter cλc_{\lambda}.

Bernstein’s inequality, in conjunction with Lipschitz continuity of HH on 𝒢n\mathcal{G}_{n} then ensures that (C.9) is bounded above by 2exp[n2min{C1κ12ηη02,C2κ1ηη0}]2\exp\left[-\frac{n}{2}\min\left\{\frac{C_{1}\kappa^{2}_{1}}{\|\eta-\eta_{0}\|^{2}_{\infty}},\frac{C_{2}\kappa_{1}}{\|\eta-\eta_{0}\|_{\infty}}\right\}\right], for positive constants C1C_{1} and C2C_{2}.

The rest of the proof of finiteness of (C.6) follows in the same (indeed, simpler) way as Chatterjee and Bhattacharya (2020). Hence (S6) holds.

Remark 2.

Arguments similar to that of Remark 1 shows that it is essential to have λ\lambda bounded away from zero.

C.7 Verification of (S7) for Poisson regression

This verification follows from the fact that h(λ)h(\lambda) is continuous, similar to binary regression.

References

  • Adler (1981) Adler, R. J. (1981). The Geometry of Random Fields. John Wiley & Sons Ltd., New York.
  • Adler and Taylor (2007) Adler, R. J. and Taylor, J. E. (2007). Random Fields and Geometry. Springer, New York.
  • Bennett (1962) Bennett, G. (1962). Probability Inequalities for the Sums of Independent Random Variables. Journal of the American Statistical Association, 57, 33–45.
  • Blackwell and Dubins (1962) Blackwell, D. and Dubins, L. (1962). Merging of Opinions With Increasing Information. The Annals of Mathematical Statistics, 33, 882–886.
  • Chatterjee and Bhattacharya (2020) Chatterjee, D. and Bhattacharya, S. (2020). On Posterior Convergence of Gaussian and General Stochastic Process Regression Under Possible Misspecifications. ArXiv preprint.
  • Choudhuri et al. (2007) Choudhuri, N., Ghosal, S., and Roy, A. (2007). Nonparametric Binary Regression Using a Gaussian Process Prior. Statistical Methodology, 4, 227–243.
  • Cramer and Leadbetter (1967) Cramer, H. and Leadbetter, M. R. (1967). Stationary and Related Stochastic Processes. Wiley, New York.
  • Diaconis and Freedman (1986) Diaconis, P. and Freedman, D. (1986). On the consistency of Bayes estimates (with discussion). Annals of Statistics, 14, 1–67.
  • Diaconis and Freedman (1993) Diaconis, P. and Freedman, D. A. (1993). Nonparametric Binary Regression: A Bayesian Approach. The Annals of Statistics, 21, 2108–2137.
  • Gelfand and Kuo (1991) Gelfand, A. E. and Kuo, L. (1991). Nonparametric Bayesian Bioassay Including Ordered Polytomous Response. Biometrika, 78, 657–666.
  • Ghosal and Roy (2006) Ghosal, S. and Roy, A. (2006). Posterior Consistency of Gaussian Process Prior for Nonparametric Binary Regression. The Annals of Statistics, 34, 2413–2429.
  • Hoeffding (1963) Hoeffding, W. (1963). Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association, 58, 13–30.
  • Massart (2003) Massart, P. (2003). Concentration Inequalities and Model Selection. Volume 1896 of Lecture Notes in Mathematics. Springer-Verlag. Lectures given at the 33rd Probability Summer School in Saint-Flour.
  • Newton et al. (1996) Newton, M. A., Czado, C., and Chappell, R. (1996). Bayesian Inference for Semiparametric Binary Regression. Journal of the American Statistical Association, 91, 142–153.
  • Pillai et al. (2007) Pillai, N., Wolpert, R. L., and Clyde, M. A. (2007). A Note on Posterior Consistency of Nonparametric Poisson Regression Models. Available at https://pdfs.semanticscholar.org/27f5/af4d00cef092c8b19662951cc316c2e222b7.pdf.
  • Shalizi (2009) Shalizi, C. R. (2009). Dynamics of Bayesian Updating With Dependent Data and Misspecified Models. Electronic Journal of Statistics, 3, 1039–1074.
  • Uspensky (1937) Uspensky, J. V. (1937). Introduction to Mathematical Probability. McGraw-Hill, New York, USA.
  • Ye et al. (2018) Ye, X., Wang, K., Zou, Y., and Lord, D. (2018). A Semi-nonparametric Poisson Regression Model for Analyzing Motor Vehicle Crash Data. PLoS One, 23, 15.