This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Mass-shifting phenomenon of truncated multivariate normal priors

Shuang Zhou [email protected] Department of Statistics, Texas A&M University, College Station, Texas, 77843, USA Pallavi Ray [email protected] Department of Statistics, Texas A&M University, College Station, Texas, 77843, USA Debdeep Pati [email protected] Department of Statistics, Texas A&M University, College Station, Texas, 77843, USA Anirban Bhattacharya [email protected] Department of Statistics, Texas A&M University, College Station, Texas, 77843, USA
Abstract

We show that lower-dimensional marginal densities of dependent zero-mean normal distributions truncated to the positive orthant exhibit a mass-shifting phenomenon. Despite the truncated multivariate normal density having a mode at the origin, the marginal density assigns increasingly small mass near the origin as the dimension increases. The phenomenon accentuates with stronger correlation between the random variables. A precise quantification characterizing the role of the dimension as well as the dependence is provided. This surprising behavior has serious implications towards Bayesian constrained estimation and inference, where the prior, in addition to having a full support, is required to assign a substantial probability near the origin to capture flat parts of the true function of interest. Without further modification, we show that truncated normal priors are not suitable for modeling flat regions and propose a novel alternative strategy based on shrinking the coordinates using a multiplicative scale parameter. The proposed shrinkage prior is empirically shown to guard against the mass shifting phenomenon while retaining computational efficiency.

1 Introduction

Let p()p(\cdot) denote the density of a 𝒩N(𝟎,Σ)\mathcal{N}_{N}(\mathbf{0},\Sigma) distribution truncated to the non-negative orthant in N\mathbb{R}^{N},

p(θ)eθTΣ1θ/2 1𝒞(θ),𝒞=[0,)N:={θN:θ10,,θN0}.\displaystyle p(\theta)\propto e^{-\theta^{\mathrm{\scriptscriptstyle T}}\Sigma^{-1}\theta/2}\,\mathbbm{1}_{\mathcal{C}}(\theta),\quad\mathcal{C}=[0,\infty)^{N}:\,=\big{\{}\theta\in\mathbb{R}^{N}~{}:~{}\theta_{1}\geq 0,\ldots,\theta_{N}\geq 0\big{\}}. (1.1)

The density pp is clearly unimodal with its mode at the origin. However, for certain classes of non-diagonal Σ\Sigma, we surprisingly observe that the lower-dimensional marginal distributions increasingly shift mass away from the origin as NN increases. This observation is quantified in Theorem 2, where we provide non-asymptotic estimates for marginal probabilities of events of the form {θ1δ}\{\theta_{1}\leq\delta\}, for δ>0\delta>0. En-route to the proof, we derive a novel Gaussian comparison inequality in Lemma 1. An immediate implication of this mass-shifting phenomenon is that corner regions of the support 𝒞\mathcal{C}, where a subset of the coordinates take values close to zero, increasingly become low-probability regions under p()p(\cdot) as dimension increases. From a statistical perspective, this helps explain a paradoxical behavior in Bayesian constrained regression empirically observed in Neelon and Dunson, (2004) and Curtis and Ghosh, (2011), where truncated normal priors led to biased posterior inference when the underlying function had flat regions.

A common approach towards Bayesian constrained regression expands the function in a flexible basis which facilitates representation of the functional constraints in terms of simple constraints on the coefficient space, and then specifies a prior distribution on the coefficients obeying the said constraints. In this context, the multivariate normal distribution subject to linear constraints arises as a natural conjugate prior in Gaussian models and beyond. Various basis, such as Bernstein polynomials (Curtis and Ghosh,, 2011), regression splines (Cai and Dunson,, 2007; Meyer et al.,, 2011), penalized spines (Brezger and Steiner,, 2008), cumulative distribution functions (Bornkamp and Ickstadt,, 2009), restricted splines (Shively et al.,, 2011), and compactly supported basis (Maatouk and Bay,, 2017) have been employed in the literature. For numerical illustrations in this article, we shall use the formulation of Maatouk and Bay, (2017) where various restrictions such as boundedness, monotonicity, convexity, etc were equivalently translated into non-negativity constraints on the coefficients under an appropriate basis expansion. They used a truncated normal prior as in (1.1) on the coefficients, with Σ\Sigma induced from a parent Gaussian process on the regression function; see Appendix A for more details.

Refer to caption
Refer to caption
Figure 1: Monotone function estimation using the basis of Maatouk and Bay, (2017) and a joint truncated normal prior p()p(\cdot) on the coefficients. Red solid curve corresponds to the true function, blue solid curve is the posterior mean, the region within two dotted blue curves represent a pointwise 95% credible interval, and the green dots are observed data points. Left panel: true function is strictly monotone. Right panel: true function is monotone with a near-flat region.

To motivate our theoretical investigations, the two panels in Figure 1 depict the estimation of two different monotone smooth functions on [0,1][0,1] based on 100 samples using the basis of Maatouk and Bay, (2017) and a joint prior p()p(\cdot) as in (1.1) on the N=50N=50 dimensional basis coefficients. The same prior scale matrix Σ\Sigma was employed across the two settings; the specifics are deferred to Section 3 and Appendix A. Observe that the function in the left panel is strictly monotone, while the one on the right panel is relatively flat over a region. While the point estimate (posterior mean) as well as the credible intervals look reasonable for the function in the left panel, the situation is significantly worse for the function in the right panel. The posterior mean incurs a large bias, and the pointwise 95%\% credible intervals fail to capture the true function for a substantial part of the input domain, suggesting that the entire posterior distribution is biased away from the truth. This behavior is perplexing; we are fitting a well-specified model with a prior that has full support111the prior probability assigned to arbitrarily small Kullback–Leibler neighborhoods of any point is positive. on the parameter space, which under mild conditions implies good first-order asymptotic properties (Ghosal et al.,, 2000) such as posterior consistency. However, the finite sample behavior of the posterior under the second scenario clearly suggests otherwise.

Functions with flat regions as in the right panel of Figure 1 routinely appear in many applications; for example, dose-response curves are assumed to be non-decreasing with the possibility that the dose-response relationship is flat over certain regions (Neelon and Dunson,, 2004). A similar biased behavior of the posterior for such functions under truncated normal priors was observed by Neelon and Dunson, (2004) while using a piecewise linear model, and also by Curtis and Ghosh, (2011) under a Bernstein polynomial basis. However, a clear explanation behind such behavior as well as the extent to which it is prevalent has been missing in the literature, and the mass-shifting phenomenon alluded before offers a clarification. Under the basis of Maatouk and Bay, (2017), a subset of the basis coefficients are required to shrink close to zero to accurately approximate functions with such flat regions. However, the truncated normal posterior pushes mass away from such corner regions, leading to the bias. Importantly, our theory also suggests that the problem would not disappear and would rather get accentuated in the large sample scenario if one follows standard practice of scaling up the number of basis functions with increasing sample size, since the mass-shifting gets more pronounced with increasing dimension. To illustrate this point, Figure 2 shows the estimation of the same function in the right panel of Figure 1, now based on 500 samples and N=50N=50 and N=250N=250 basis functions in the left and right panel respectively. Increasing the number of basis functions indeed results in a noticeable increase in the bias as clearly seen from the insets which zoom into two disjoint regions of the covariate domain. A similar story holds for the basis of Neelon and Dunson, (2004) and Curtis and Ghosh, (2011).

Refer to caption
Refer to caption
Figure 2: Monotone function estimation using the basis of Maatouk and Bay, (2017) and a joint truncated normal prior on the coefficients. Red solid curve corresponds to the true function, blue solid curve is the posterior mean, the region within two dotted blue curves represent a pointwise 95% credible interval, and the green dots are observed data points corresponding to N=50N=50 (left panel) and N=250N=250 (right panel).

Neelon and Dunson, (2004) and Curtis and Ghosh, (2011) both used point-mass mixture priors as remedy, which is a natural choice under a non-decreasing constraint. However, their introduction becomes somewhat cumbersome under the non-negativity constraint in (1.1). As a simple remedy, we suggest introducing a multiplicative scale parameter for each coordinate a priori and further equipping it with a prior mixing distribution which has positive density at the origin and heavy tails; a default candidate is the half-Cauchy density (Carvalho et al.,, 2010). The resulting prior shrinks much more aggressively towards the origin, and we empirically illustrate its superior performance over the truncated normal prior. This empirical exercise provides further support to our argument.

2 Mass-shifting phenomenon of truncated normal distributions

2.1 Marginal densities of truncated normal distributions

Our main focus is on studying the properties of marginal densities of truncated normal distributions described in equation (1.1) and quantifying how they behave with increasing dimensions. We begin by introducing some notation. We use 𝒩(γ,Ω)\mathcal{N}(\gamma,\Omega) to denote the dd-dimensional normal distribution with mean γd\gamma\in\mathbb{R}^{d} and positive definite covariance matrix Ω\Omega; also let 𝒩(x;γ,Ω)\mathcal{N}(x;\gamma,\Omega) denote its density evaluated at xdx\in\mathbb{R}^{d}. We reserve the notation Σd(ρ)\Sigma_{d}(\rho) to denote the d×dd\times d compound-symmetry correlation matrix with diagonal elements equal to 11 and off-diagonal elements equal to ρ(0,1)\rho\in(0,1),

Σd(ρ)=(1ρ)Id+ρ𝟏d𝟏dT,\displaystyle\Sigma_{d}(\rho)=(1-\rho)\mathrm{I}_{d}+\rho\mathbf{1}_{d}\mathbf{1}_{d}^{\mathrm{\scriptscriptstyle T}}, (2.1)

with 𝟏d\mathbf{1}_{d} the vector of ones in d\mathbb{R}^{d} and Id\mathrm{I}_{d} the d×dd\times d identity matrix.

For a subset 𝒞N\mathcal{C}\subset\mathbb{R}^{N} with positive Lebesgue measure, let 𝒩𝒞(γ,Ω)\mathcal{N}_{\mathcal{C}}(\gamma,\Omega) denote a 𝒩(γ,Ω)\mathcal{N}(\gamma,\Omega) distribution truncated onto 𝒞\mathcal{C}, with density

p~(θ)=m𝒞1𝒩N(θ;γ,Ω) 1𝒞(θ),\displaystyle\widetilde{p}(\theta)=m_{\mathcal{C}}^{-1}\,\mathcal{N}_{N}(\theta;\gamma,\Omega)\,\mathbbm{1}_{\mathcal{C}}(\theta), (2.2)

where m𝒞=(X𝒞)m_{\mathcal{C}}=\mathbb{P}(X\in\mathcal{C}) for X𝒩(γ,Ω)X\sim\mathcal{N}(\gamma,\Omega) is the constant of integration and 𝟙𝒞()\mathbbm{1}_{\mathcal{C}}(\cdot) the indicator function of the set 𝒞\mathcal{C}. We throughout assume 𝒞\mathcal{C} to be the positive orthant of N\mathbb{R}^{N} as in equation (1.1), namely, 𝒞=[0,)N\mathcal{C}=[0,\infty)^{N}; a general 𝒞\mathcal{C} defined by linear inequality constraints can be reduced to rectangular constraints using a linear transformation - see, for example, § 2 of Botev, (2017). The dimension NN will be typically evident from the context.

Our investigations were originally motivated by the following observation. Consider θ𝒩𝒞(𝟎,Σ2(ρ))\theta\sim\mathcal{N}_{\mathcal{C}}(\mathbf{0},\Sigma_{2}(\rho)) for ρ(0,1)\rho\in(0,1). Then, the marginal distribution of θ1\theta_{1} has density proportional to eθ12/2Φ{ρθ1/(1ρ2)1/2}e^{-\theta_{1}^{2}/2}\,\Phi\{\rho\theta_{1}/(1-\rho^{2})^{1/2}\} on (0,)(0,\infty), where Φ\Phi denotes the 𝒩(0,1)\mathcal{N}(0,1) cumulative distribution function. This distribution is readily recognized as a skew normal density (Azzalini and Valle,, 1996) truncated to (0,)(0,\infty). Interestingly, the marginal of θ1\theta_{1} has a strictly positive mode, while the joint distribution of θ\theta had its mode at 𝟎\mathbf{0}. Cartinhour, (1990) noted that the truncated normal family is not closed under marginalization for non-diagonal Σ\Sigma, and derived a general formula for the univariate marginal as the product of a univariate normal density with a skewing factor. In Proposition 1 below, we generalize Cartinhour’s result for any lower-dimensional marginal density. We write the scale matrix ΣN\Sigma_{N} in block form as ΣN=[Σk,k,ΣNk,k;Σk,Nk,ΣNk,Nk]\Sigma_{N}=[\Sigma_{k,k},\,\Sigma_{N-k,k};\Sigma_{k,N-k},\,\Sigma_{N-k,N-k}].

Proposition 1.

Suppose θ𝒩𝒞(𝟎N,ΣN)\theta\sim\mathcal{N}_{\mathcal{C}}(\mathbf{0}_{N},\Sigma_{N}). The marginal density p~k,N\widetilde{p}_{k,N} of θ(k)=(θ1,,θk)T\theta^{(k)}=(\theta_{1},\ldots,\theta_{k})^{\mathrm{\scriptscriptstyle T}} is

p~k,N(θ1,,θk)=\displaystyle\widetilde{p}_{k,N}(\theta_{1},\ldots,\theta_{k})= (2π)k/2m𝒞1e12θ(k)TΣk,k1θ(k)\displaystyle(2\pi)^{-k/2}\,m_{\mathcal{C}}^{-1}\,e^{-\frac{1}{2}{\theta^{(k)}}^{\mathrm{\scriptscriptstyle T}}\Sigma_{k,k}^{-1}\theta^{(k)}}\,
×\displaystyle\times\, (X~NkΣNk,kΣk,k1θ(k))i=1k𝟙[0,)(θi),\displaystyle\mathbb{P}(\widetilde{X}_{N-k}\leq\Sigma_{N-k,k}\,\Sigma^{-1}_{k,k}\,\theta^{(k)})\prod_{i=1}^{k}\mathbbm{1}_{[0,\infty)}(\theta_{i}),

where X~Nk𝒩(𝟎Nk,Σ~Nk,Nk1)\widetilde{X}_{N-k}\sim\mathcal{N}(\mathbf{0}_{N-k},\widetilde{\Sigma}^{-1}_{N-k,N-k}) with Σ~Nk,Nk=(ΣNk,NkΣNk,kΣk,k1Σk,Nk)1\widetilde{\Sigma}_{N-k,N-k}=(\Sigma_{N-k,N-k}-\Sigma_{N-k,k}\,\Sigma^{-1}_{k,k}\,\Sigma_{k,N-k})^{-1}, and the \leq symbol is to be interpreted elementwise. Here, the constant m𝒞=(X𝒞)m_{\mathcal{C}}=\mathbb{P}(X\in\mathcal{C}) for X𝒩(𝟎N,ΣN)X\sim\mathcal{N}(\mathbf{0}_{N},\Sigma_{N}).

When k=1k=1, Proposition 1 implies

p~1,Neθ12/(2Σ1,1)(X~N1ΣN1,1θ1/Σ1,1)𝟙[0,)(θ1).\displaystyle\widetilde{p}_{1,N}\propto e^{-\theta^{2}_{1}/(2\Sigma_{1,1})}\mathbb{P}(\widetilde{X}_{N-1}\leq\Sigma_{N-1,1}\,\theta_{1}/\Sigma_{1,1})\mathbbm{1}_{[0,\infty)}(\theta_{1}). (2.3)

Let 𝒮N\mathcal{S}_{N} denote the set of N×NN\times N covariance matrices whose correlation coefficients are all non-negative. The map θ1eθ12/(2Σ1,1)\theta_{1}\mapsto e^{-\theta^{2}_{1}/(2\Sigma_{1,1})} is decreasing and when ΣN𝒮N\Sigma_{N}\in\mathcal{S}_{N}, θ1(X~N1ΣN1,1θ1/Σ1,1)\theta_{1}\mapsto\mathbb{P}(\widetilde{X}_{N-1}\leq\Sigma_{N-1,1}\,\theta_{1}/\Sigma_{1,1}) is increasing, on (0,)(0,\infty). Thus, if ΣN𝒮N\Sigma_{N}\in\mathcal{S}_{N}, p~1,N\widetilde{p}_{1,N} is unimodal with a strictly positive mode.

As another special case, suppose Σ=ΣN(ρ)\Sigma=\Sigma_{N}(\rho) for some ρ(0,1)\rho\in(0,1) and let k=N1k=N-1. We then have,

p~N1,Neθ(N1)TΣN11(ρ)θ(N1)Φ(aTθ(N1))i=1N1𝟙[0,)(θi),\widetilde{p}_{N-1,N}\propto e^{-{\theta^{(N-1)}}^{\mathrm{\scriptscriptstyle T}}\,\Sigma_{N-1}^{-1}(\rho)\,\theta^{(N-1)}}\,\Phi(a^{\mathrm{\scriptscriptstyle T}}\theta^{(N-1)})\ \prod_{i=1}^{N-1}\mathbbm{1}_{[0,\infty)}(\theta_{i}),

with a=Cρ(i=1N1θi) 1N1a=C_{\rho}\big{(}\sum_{i=1}^{N-1}\theta_{i}\big{)}\,\mathbf{1}_{N-1}, where CρC_{\rho} is a positive constant. This density can be recognized as a multivariate skew-normal distribution (Azzalini and Valle,, 1996) truncated to the non-negative orthant.

2.2 Mass-shifting phenomenon of marginal densities

While the results in the previous section imply that the marginal distributions shift mass away from the origin, they do not precisely characterize the severity of its prevalence. In this section, we show that under appropriate conditions, the univariate marginals assign increasingly smaller mass to a fixed neighborhood of the origin with increasing dimension. In other words, the skewing factor noted by Cartinhour begins to dominate when the ambient dimension is large. In addition to the dimension, we also quantify the amount of dependence in ΣN\Sigma_{N} contributing to this mass-shifting. To the best of our knowledge, this has not been observed or quantified in the literature.

We state our results for ΣNN,K\Sigma_{N}\in\mathcal{B}_{N,K}, where for 2KN12\leq K\leq N-1, N,K\mathcal{B}_{N,K} denotes the space of KK-banded nonnegative correlation matrices,

N,K={ΣN=(ρij)𝒮N:ρii=1i,ρij=0|ij|K}.\displaystyle\mathcal{B}_{N,K}=\Big{\{}\Sigma_{N}=(\rho_{ij})\in\mathcal{S}_{N}\,:\,\rho_{ii}=1\ \forall\,i,\ \ \rho_{ij}=0\ \forall\ |i-j|\geq K\Big{\}}. (2.4)

While our main theorem below can be proved for other dependence structures, the banded structure naturally arises in statistical applications as discussed in the next section.

Given ΣN=(ρij)N,K\Sigma_{N}=(\rho_{ij})\in\mathcal{B}_{N,K}, define ρmax=maxijρij\rho_{\max}=\max_{i\neq j}\rho_{ij} and ρmin=minij,|ij|Kρij\rho_{\min}=\min_{i\neq j,|i-j|\leq K}\rho_{ij} to be the maximum and minimum correlation values within the band. For θ𝒩𝒞(𝟎,ΣN)\theta\sim\mathcal{N}_{\mathcal{C}}(\mathbf{0},\Sigma_{N}) with 𝒞=[0,)N\mathcal{C}=[0,\infty)^{N}, let αN,δ=(θ1δ)\alpha_{N,\delta}=\mathbb{P}(\theta_{1}\leq\delta). With these definitions, we are ready to state our main theorem.

Theorem 2.

Let ΣNN,K\Sigma_{N}\in\mathcal{B}_{N,K} be such that (ρmin,ρmax)𝒬(\rho_{\min},\rho_{\max})\in\mathcal{Q}, where

𝒬={(u,v)(0,1)2:uv,u2(1u)v}.\mathcal{Q}=\bigg{\{}(u,v)\in(0,1)^{2}:u\leq v,\ \frac{u}{2(1-u)}\geq v\bigg{\}}.

Then, there exists a constant K0K_{0} such that whenever KK0K\geq K_{0}, we have for any δ>0\delta>0,

αN,δCρmin,ρmaxδ(logK)1/2KG(ρmin,ρmax),\alpha_{N,\delta}\leq C^{\prime}_{\rho_{\min},\rho_{\max}}\,\delta\,(\log K)^{1/2}\,K^{-G(\rho_{\min},\rho_{\max})},

where GG is a positive rational function of (ρmin,ρmax)(\rho_{\min},\rho_{\max}), and Cρmin,ρmax>0C^{\prime}_{\rho_{\min},\rho_{\max}}>0 is a constant free of NN.

In particular, if we consider a sequence of KNK_{N}-banded correlation matrices ΣNN,KN\Sigma_{N}\in\mathcal{B}_{N,K_{N}} with KNK_{N}\to\infty as NN\to\infty, then under the conditions of Theorem 2, limNαN,δ=0\lim_{N\to\infty}\alpha_{N,\delta}=0 for any fixed δ>0\delta>0. Theorem 2, being non-asymptotic in nature, additionally characterizes the rate of decay of αN,δ\alpha_{N,\delta}. To contrast the conclusion of Theorem 2 with two closely related cases, consider first the case when θ𝒩(𝟎,ΣN)\theta\sim\mathcal{N}(\mathbf{0},\Sigma_{N}). For any NN, the marginal distribution of θ1\theta_{1} is always 𝒩(0,1)\mathcal{N}(0,1), and hence αN,δ\alpha_{N,\delta} does not depend on NN. Similarly, if θ𝒩𝒞(𝟎,ΣN)\theta\sim\mathcal{N}_{\mathcal{C}}(\mathbf{0},\Sigma_{N}) with ΣN\Sigma_{N} a diagonal correlation matrix, then for any N1N\geq 1, the marginal distribution of θ1\theta_{1} is 𝒩(0,1)\mathcal{N}(0,1) truncated to (0,)(0,\infty) and αN,δ\alpha_{N,\delta} again does not depend on NN. In particular, in both these cases, αN,δδ\alpha_{N,\delta}\asymp\delta for δ\delta small. However, when a combination of dependence and truncation is present, an additional (logK)1/2KG(ρmin,ρmax)(\log K)^{1/2}\,K^{-G(\rho_{\min},\rho_{\max})} penalty is incurred.

Refer to caption
Figure 3: The region shaded in black depicts 𝒬\mathcal{Q} from the statement of Theorem 2.

For the conclusion of Theorem 2 to hold, our current proof technique requires (ρmin,ρmax)(\rho_{\min},\rho_{\max}) to lie in the region 𝒬\mathcal{Q}, which is pictorially represented by the black shaded region in Figure 3. As a special case, if all the non-zero correlations are the same, i.e., ρmin=ρmax\rho_{\min}=\rho_{\max}, then the condition simplifies to ρmin>0.5\rho_{\min}>0.5. More generally, if we write ρmin=κρmax\rho_{\min}=\kappa\rho_{\max} for some κ(0,1]\kappa\in(0,1], then the condition reduces to ρmin1κ/2\rho_{\min}\geq 1-\kappa/2.

Remark 1.

For any fixed NN, the marginal density of θ1\theta_{1} evaluated at the origin, p~1,N(0)=limδ0αN,δ/δ\widetilde{p}_{1,N}(0)=\lim_{\delta\to 0}\alpha_{N,\delta}/\delta. Theorem 2 thus implies in particular that limNp~1,N(0)=0\lim_{N\to\infty}\widetilde{p}_{1,N}(0)=0. Also, for any fixed 1kN1\leq k\leq N, if we denote βN,k,δ=(θ1δ,,θkδ)\beta_{N,k,\delta}=\mathbb{P}(\theta_{1}\leq\delta,\ldots,\theta_{k}\leq\delta), it is immediate that βN,k,δ<αN,δ\beta_{N,k,\delta}<\alpha_{N,\delta}, and hence limNβN,k,δ=0\lim_{N\to\infty}\beta_{N,k,\delta}=0, meaning the probability of a corner region is vanishingly small for large NN.

We now empirically illustrate the conclusion of the theorem by presenting the univariate marginal density p~1,N\widetilde{p}_{1,N} for different values of the dimension NN and the bandwidth KK. The density calculations were performed using the R package tmvtnorm, which is based on the numerical approximation algorithm proposed in Cartinhour, (1990) and subsequent refinements in Genz, (1992, 1993); Genz and Bretz, (2009).

Refer to caption
Refer to caption
Refer to caption
Figure 4: Left panel shows marginal density functions p~1,N\widetilde{p}_{1,N} for K=2K=2 (black), K=5K=5 (red) and K=20K=20 (blue) with N=100N=100. Middle panel shows p~1,N\widetilde{p}_{1,N} for N=10N=10 (black), N=50N=50 (red) and N=100N=100 (blue) with K=5K=5. Right panel shows p~1,N\widetilde{p}_{1,N} for (K,N)=(5,25)(K,N)=(5,25) (black), (20,100)(20,100) (red) and (50,250)(50,250) (blue).

The left panel of Figure 4 shows that for NN fixed at a moderately large value, the probability assigned to a small neighborhood of the origin decreases with increasing KK. Also, the mode of the marginal density increasingly shifts away from zero. A similar effect is seen for a fixed KK and increasing NN in the middle panel and also for an increasing pair (K,N)(K,N) in the right panel, although the mass-shifting effect is somewhat weakened compared to the left panel. This behavior perfectly aligns with the main message of the theorem that the interplay between the truncation and the dependence brings forth the mass-shifting phenomenon.

The proof of Theorem 2 is non-trivial; we provide the overall chain of arguments in the next subsection, deferring the proof of several auxiliary results to the supplemental document. The marginal probability of (0,δ)(0,\delta) is a ratio of the probabilities of two rectangular regions under a 𝒩(𝟎,ΣN)\mathcal{N}(\mathbf{0},\Sigma_{N}) distribution, with the denominator appearing due to the truncation. While there is a rich literature on estimating tail probabilities under correlated multivariate normals using multivariate extensions of the Mill’s ratio (Savage,, 1962; Ruben,, 1964; Sidák,, 1968; Steck,, 1979; Hashorva and Hüsler,, 2003; Lu,, 2016), the existing bounds are more suited for numerical evaluation (Cartinhour,, 1990; Genz,, 1992; Genz and Bretz,, 2009) and pose analytic difficulties due to their complicated forms. Moreover, the current bounds lose their accuracy when the region boundary is close to the origin (Gasull and Utzet,, 2014), which is precisely our object of interest. Our argument instead relies on novel usage of Gaussian comparison inequalities such as the Slepian’s inequality; see Li and Shao, (2001); Vershynin, 2018a for book-level treatments. We additionally derive a generalization of Slepian’s inequality in Lemma 1, which might be of independent interest. As an important reduction step, we introduce a blocking idea to carefully approximate the banded scale matrix ΣN\Sigma_{N} by a block tridiagonal matrix to simplify the analysis.

2.3 Proof of Theorem 2

By definition,

αN,δ=(θ1δ)=(0Z1δ,Z20,,ZN0)(Z10,Z20,,ZN0),\displaystyle\alpha_{N,\delta}=\mathbb{P}(\theta_{1}\leq\delta)=\frac{\mathbb{P}(0\leq Z_{1}\leq\delta,Z_{2}\geq 0,\ldots,Z_{N}\geq 0)}{\mathbb{P}(Z_{1}\geq 0,Z_{2}\geq 0,\ldots,Z_{N}\geq 0)}, (2.5)

where Z𝒩(𝟎,ΣN)Z\sim\mathcal{N}(\mathbf{0},\Sigma_{N}). We now proceed to separately bound the numerator and denominator in the above display.

We first consider the denominator in equation (2.5), and use Slepian’s lemma to bound it from below. It follows from Slepian’s inequality, see comment after Lemma 3 in the supplemental document, that if X,YX,Y are centered dd-dimensional Gaussian random variables with 𝔼(Xi2)=𝔼(Yi2)\mathbb{E}(X_{i}^{2})=\mathbb{E}(Y_{i}^{2}) for all ii, and 𝔼(XiXj)𝔼(YiYj)\mathbb{E}(X_{i}X_{j})\leq\mathbb{E}(Y_{i}Y_{j}) for all iji\neq j, then

(X10,,Xd0)(Y10,,Yd0).\displaystyle\mathbb{P}(X_{1}\geq 0,\ldots,X_{d}\geq 0)\leq\mathbb{P}(Y_{1}\geq 0,\ldots,Y_{d}\geq 0). (2.6)
Refer to caption
Refer to caption
Figure 5: Left panel: example of ΣN\Sigma_{N} with N=18,K=3N=18,K=3. Right panel: the corresponding block approximation Σ~N\widetilde{\Sigma}_{N}.

The Slepian’s inequality is a prominent example of a Gaussian comparison inequality originally developed to bound the supremum of Gaussian processes. To apply Slepian’s inequality to the present context, we construct another NN-dimensional centered Gaussian random vector S𝒩N(𝟎,Σ~N)S\sim\mathcal{N}_{N}(\mathbf{0},\widetilde{\Sigma}_{N}) such that (i) S[1:K]=𝑑Z[1:K]S_{[1\,:\,K]}\overset{d}{=}Z_{[1\,:\,K]}, S[(K+1): 2K]=𝑑Z[(K+1): 2K]S_{[(K+1)\,:\,2K]}\overset{d}{=}Z_{[(K+1)\,:\,2K]} and S[(2K+1):N]=𝑑Z[(2K+1):N]S_{[(2K+1)\,:\,N]}\overset{d}{=}Z_{[(2K+1)\,:\,N]}, and (ii) the sub-vectors S[1:K]S_{[1\,:\,K]}, S[(K+1): 2K]S_{[(K+1)\,:\,2K]} and S[(2K+1):N]S_{[(2K+1)\,:\,N]} are mutually independent. The correlation matrix Σ~N\widetilde{\Sigma}_{N} of SS clearly satisfies (ΣN)ij(Σ~N)ij(\Sigma_{N})_{ij}\geq(\widetilde{\Sigma}_{N})_{ij} for all iji\neq j by construction. Figure 5 pictorially depicts this block approximation in an example with N=18N=18 and K=3K=3. Applying Slepian’s inequality, we then have,

(Z10,,ZN0)(S10,,SN0)\displaystyle\mathbb{P}(Z_{1}\geq 0,\ldots,Z_{N}\geq 0)\geq\mathbb{P}(S_{1}\geq 0,\ldots,S_{N}\geq 0)
=(S[1:K]𝟎)(S[(K+1): 2K]𝟎)(S[(2K+1):N]𝟎)\displaystyle~{}~{}~{}=\mathbb{P}(S_{[1\,:\,K]}\geq\mathbf{0})\,\mathbb{P}(S_{[(K+1)\,:\,2K]}\geq\mathbf{0})\,\mathbb{P}(S_{[(2K+1)\,:\,N]}\geq\mathbf{0})
=(Z[1:K]𝟎)(Z[(K+1): 2K]𝟎)(Z[(2K+1):N]𝟎).\displaystyle~{}~{}~{}=\mathbb{P}(Z_{[1\,:\,K]}\geq\mathbf{0})\,\mathbb{P}(Z_{[(K+1)\,:\,2K]}\geq\mathbf{0})\,\mathbb{P}(Z_{[(2K+1)\,:\,N]}\geq\mathbf{0}). (2.7)

Next, we consider the numerator in equation (2.5). We have,

(0Z1δ,Z20,,ZN0)\displaystyle\mathbb{P}(0\leq Z_{1}\leq\delta,Z_{2}\geq 0,\ldots,Z_{N}\geq 0)
(0Z1δ,Z[2:K]𝟎,Z[(K+1): 2K]K,Z[(2K+1):N]𝟎)\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}\leq\mathbb{P}\big{(}0\leq Z_{1}\leq\delta,Z_{[2\,:\,K]}\geq\mathbf{0},\,Z_{[(K+1)\,:\,2K]}\in\mathbb{R}^{K},\,Z_{[(2K+1)\,:\,N]}\geq\mathbf{0}\big{)}
=(0Z1δ,Z[2:K]𝟎)(Z[(2K+1):N]𝟎).\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}=\mathbb{P}\big{(}0\leq Z_{1}\leq\delta,Z_{[2\,:\,K]}\geq\mathbf{0}\big{)}\mathbb{P}\big{(}Z_{[(2K+1)\,:\,N]}\geq\mathbf{0}\big{)}. (2.8)

The last equality crucially uses Z[1:K]Z_{[1\,:\,K]} and Z[(2K+1):N]Z_{[(2K+1)\,:\,N]} are independent, which is a consequence of ΣN\Sigma_{N} being KK-banded. Taking the ratio of equations (2.3) and (2.3), the term (Z[(2K+1):N]𝟎)\mathbb{P}(Z_{[(2K+1)\,:\,N]}\geq\mathbf{0}) cancels so that

αN,δ(0Z1δ,Z[2:K]𝟎)(Z[1:K]𝟎)(Z[K+1: 2K]𝟎)=R.\displaystyle\alpha_{N,\delta}\leq\frac{\mathbb{P}(0\leq Z_{1}\leq\delta,Z_{[2\,:\,K]}\geq\mathbf{0})}{\mathbb{P}(Z_{[1\,:\,K]}\geq\mathbf{0})\,\mathbb{P}(Z_{[K+1\,:\,2K]}\geq\mathbf{0})}=R. (2.9)

To bound the terms (Z[1:K]𝟎)\mathbb{P}(Z_{[1\,:\,K]}\geq\mathbf{0}) and (Z[K+1: 2K]𝟎)\mathbb{P}(Z_{[K+1\,:\,2K]}\geq\mathbf{0}) in the denominator of RR, we resort to another round of Slepian’s inequality. Let Z′′𝒩(𝟎,ΣK(ρmin))Z^{\prime\prime}\sim\mathcal{N}(\mathbf{0},\Sigma_{K}(\rho_{\min})), where recall that ρmin\rho_{\min} is the minimum non-zero correlation in ΣN\Sigma_{N}. Also, recall from equation (2.1) that ΣK(ρmin)\Sigma_{K}(\rho_{\min}) denotes the K×KK\times K compound-symmetry correlation matrix with all correlations equal to ρmin\rho_{\min}. By construction, for any 1ijK1\leq i\neq j\leq K, 𝔼(ZiZj),𝔼(ZK+iZK+j)ρmin=𝔼(Zi′′Zj′′)\mathbb{E}(Z_{i}Z_{j}),\mathbb{E}(Z_{K+i}Z_{K+j})\geq\rho_{\min}=\mathbb{E}(Z^{\prime\prime}_{i}Z^{\prime\prime}_{j}). Thus, applying Slepian’s inequality as in equation (2.6), {equs}P(Z_[1 : K]≥0) P(Z_[K+1 : 2K]≥0) ≥{P(Z” ≥0)}^2.

The numerator of equation (2.9) cannot be directly tackled by Slepian’s inequality, and we prove the following comparison inequality in the supplemental document.

Lemma 1.

(Generalized Slepian’s inequality) Let X,YX,Y be centered dd-dimensional Gaussian vectors with 𝔼Xi2=𝔼Yi2\mathbb{E}X_{i}^{2}=\mathbb{E}Y_{i}^{2} for all ii and 𝔼(XiXj)𝔼(YiYj)\mathbb{E}(X_{i}X_{j})\leq\mathbb{E}(Y_{i}Y_{j}) for all iji\neq j. Then for any 01<u10\leq\ell_{1}<u_{1} and u2,udu_{2},\ldots u_{d}\in\mathbb{R}, we have {equs}P(_1 X_1u_1 , X_2 u_2, …, X_d u_d) P(_1 Y_1 u_1, Y_2 u_2, …, Y_d u_d).

Define a random variable Z𝒩(𝟎,ΣK(ρmax))Z^{\prime}\sim\mathcal{N}(\mathbf{0},\Sigma_{K}(\rho_{\max})) and use Lemma 1 to conclude that (0Z1δ,Z[2:K]𝟎)(0Z1δ,Z[2:K]𝟎)\mathbb{P}(0\leq Z_{1}\leq\delta,Z_{[2\,:\,K]}\geq\mathbf{0})\leq\mathbb{P}(0\leq Z^{\prime}_{1}\leq\delta,Z^{\prime}_{[2\,:\,K]}\geq\mathbf{0}).

Substituting these bounds in equation (2.9), we obtain

RR=(0Z1δ,Z20,,ZK0){(Z1′′0,,ZK′′0)}2.\displaystyle R\leq R^{\prime}=\frac{\mathbb{P}(0\leq Z^{\prime}_{1}\leq\delta,Z^{\prime}_{2}\geq 0,\ldots,Z^{\prime}_{K}\geq 0)}{\{\mathbb{P}(Z^{\prime\prime}_{1}\geq 0,\ldots,Z^{\prime\prime}_{K}\geq 0)\}^{2}}. (2.10)

The primary reduction achieved by bounding RR^{\prime} by R′′R^{\prime\prime} is that we only need to estimate Gaussian probabilities under a compound-symmetry covariance structure. We prove the following inequalities in the supplemental document that provide these estimates.

Lemma 2.

Let X𝒩(𝟎,Σd(ρ))X\sim\mathcal{N}(\mathbf{0},\Sigma_{d}(\rho)) with ρ(0,1)\rho\in(0,1). Fix δ>0\delta>0. Define ρ¯=(1ρ)/ρ\bar{\rho}=(1-\rho)/\rho. Then,

(0X1<δ,X20,,Xd0)\displaystyle\mathbb{P}(0\leq X_{1}<\delta,X_{2}\geq 0,\ldots,X_{d}\geq 0)
δ{2(1α)ρ¯log(d1)}1/2(d1)(1α)/ρ+exp(dα),\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\leq\delta\,\{2(1-\alpha)\bar{\rho}\log(d-1)\}^{-1/2}\,(d-1)^{-(1-\alpha)/\rho}+\exp(-d^{\alpha}),

for any α(0,1)\alpha\in(0,1). Also,

(X10,,Xd0)(2ρ¯logd)1/22ρ¯logd+1dρ¯.\mathbb{P}(X_{1}\geq 0,\ldots,X_{d}\geq 0)\geq\frac{(2\,\bar{\rho}\log d)^{1/2}}{2\,\bar{\rho}\log d+1}\,d^{-\bar{\rho}}.

A key aspect of the compound-symmetry structure that we exploit is for X𝒩(𝟎,Σd(ρ))X\sim\mathcal{N}(\mathbf{0},\Sigma_{d}(\rho)) with ρ(0,1)\rho\in(0,1), we can represent Xi=𝑑ρ1/2w+(1ρ)1/2WiX_{i}\overset{d}{=}\rho^{1/2}\,w+(1-\rho)^{1/2}\,W_{i}, where w,Wiw,W_{i}’s are independent 𝒩(0,1)\mathcal{N}(0,1) variables.

Using Lemma 2, we can bound, for any α(0,1)\alpha\in(0,1),

R\displaystyle R^{\prime} δ{2ρ¯max(1α)log(K1)}1/2(K1)(1α)/ρmax+exp{(K1)α}2ρ¯minlogK(2ρ¯minlogK+1)2K2ρ¯min\displaystyle\leq\frac{\delta\,\{2\,\bar{\rho}_{\max}\,(1-\alpha)\log(K-1)\}^{-1/2}\,(K-1)^{-(1-\alpha)/\rho_{\max}}+\exp\{-(K-1)^{\alpha}\}}{2\,\bar{\rho}_{\min}\log K(2\,\bar{\rho}_{\min}\log K+1)^{-2}K^{-2\,\bar{\rho}_{\min}}}
2(1α)/ρmaxδ(2ρ¯minlogK+1)22ρ¯minlogK{2ρ¯max(1α)log(K1)}1/2K{(1α)/ρmax2ρ¯min}\displaystyle\leq 2^{(1-\alpha)/\rho_{\max}}\,\delta\,\frac{(2\,\bar{\rho}_{\min}\log K+1)^{2}}{2\,\bar{\rho}_{\min}\,\log K\{2\,\bar{\rho}_{\max}\,(1-\alpha)\log(K-1)\}^{1/2}}\,K^{-\{(1-\alpha)/\rho_{\max}-2\,\bar{\rho}_{\min}\}}
+exp{(K1)α}K2ρ¯min(2ρ¯minlogK+1)2/(2ρ¯minlogK)\displaystyle~{}~{}~{}+\exp\{-(K-1)^{\alpha}\}K^{2\,\bar{\rho}_{\min}}(2\,\bar{\rho}_{\min}\log K+1)^{2}/(2\,\bar{\rho}_{\min}\log K)
Cδ(logK)1/2K{(1α)/ρmax2ρ¯min}+4ρ¯minexp{(K1)α}K2ρ¯minlogK,\displaystyle\leq C\,\delta\,(\log K)^{1/2}\,K^{-\{(1-\alpha)/\rho_{\max}-2\,\bar{\rho}_{\min}\}}+4\,\bar{\rho}_{\min}\exp\{-(K-1)^{\alpha}\}K^{2\,\bar{\rho}_{\min}}\log K, (2.11)

with C=5ρ¯min/{(1α)ρ¯max}1/2C=5\bar{\rho}_{\min}/\{(1-\alpha)\bar{\rho}_{\max}\}^{1/2}. Since (ρmin,ρmax)𝒬(\rho_{\min},\rho_{\max})\in\mathcal{Q}, we have ρmin/{2(1ρmin)}ρmax\rho_{\min}/\{2(1-\rho_{\min})\}\geq\rho_{\max}, or equivalently, 2ρ¯min<1/ρmax2\bar{\rho}_{\min}<1/\rho_{\max}. Thus, we can always find α>0\alpha>0 such that (1α)/ρmax2ρ¯min>0(1-\alpha)/\rho_{\max}-2\,\bar{\rho}_{\min}>0. Fix such an α\alpha, and substitute in equation (2.3). The proof is now completed by choosing K0K_{0} large enough so that for any K>K0K>K_{0}, the second term in the last line of (2.3) is smaller than the first; this is possible since the second term decreases exponentially while the first does so polynomially in KK.

3 Connections with Bayesian constrained inference

In this section, we connect the theoretical findings in the previous section to posterior inference in Bayesian constrained regression models. We work under the setup of a usual Gaussian regression model,

yi=f(xi)+ϵi,ϵi𝒩(0,σ2)(i=1,,n),\displaystyle y_{i}=f(x_{i})+\epsilon_{i},\quad\epsilon_{i}\sim\mathcal{N}(0,\sigma^{2})\quad(i=1,\dots,n), (3.1)

where we assume xi[0,1]x_{i}\in[0,1] for simplicity. We are interested in the situation when the regression function ff is constrained to lie in some space 𝒞f\mathcal{C}_{f} which is a subset of the space of all continuous functions on [0,1][0,1], determined by linear restrictions on ff and possibly its higher-order derivatives. Common examples include bounded, monotone, convex, and concave functions.

As discussed in the introduction, a general approach is to expand ff in some basis {ϕj}\{\phi_{j}\} as f()=j=1Nθjϕj()f(\cdot)=\sum_{j=1}^{N}\theta_{j}\phi_{j}(\cdot) so that the restrictions on ff can be posed as linear restrictions on the vector of basis coefficients θN\theta\in\mathbb{R}^{N}, with the parameter space 𝒞\mathcal{C} for θ\theta of the form 𝒞={θN:Aθb}\mathcal{C}=\{\theta\in\mathbb{R}^{N}\,:\,A\theta\geq b\}. For example, when 𝒞f\mathcal{C}_{f} corresponds to monotone increasing functions, the set 𝒞\mathcal{C} is of the form {θ1θ2θN}\{\theta_{1}\leq\theta_{2}\ldots\leq\theta_{N}\} under the Bernstein polynomial basis (Curtis and Ghosh,, 2011) and [0,)N[0,\infty)^{N} under the integrated triangular basis of Maatouk and Bay, (2017). For sake of concreteness, we shall henceforth work with 𝒞=[0,)N\mathcal{C}=[0,\infty)^{N}. Under such a basis representation, the model (3.1) can be expressed as

Y\displaystyle Y =Φθ+ϵ,ϵ𝒩(𝟎,In),θ𝒞,\displaystyle=\Phi\theta+\epsilon,\quad\epsilon\sim\mathcal{N}({\bf 0},\mathrm{I}_{n}),\quad\theta\in\mathcal{C}, (3.2)

where Y=(y1,,yn)TY=(y_{1},\ldots,y_{n})^{\mathrm{\scriptscriptstyle T}} and Φ={ϕj(xi)}ij\Phi=\{\phi_{j}(x_{i})\}_{ij} is an n×Nn\times N basis matrix.

The truncated normal prior θ𝒩𝒞(𝟎,ΩN)\theta\sim\mathcal{N}_{\mathcal{C}}(\mathbf{0},\Omega_{N}) is conjugate, with the posterior θY𝒩𝒞(μN,ΣN)\theta\mid Y\sim\mathcal{N}_{\mathcal{C}}(\mu_{N},\Sigma_{N}), with

μN=ΣNΦTY,ΣN=(ΩN1+ΦTΦ)1.\mu_{N}=\Sigma_{N}\Phi^{\mathrm{\scriptscriptstyle T}}Y,\quad\Sigma_{N}=(\Omega_{N}^{-1}+\Phi^{\mathrm{\scriptscriptstyle T}}\Phi)^{-1}.

To motivate their prior choice, Maatouk and Bay, (2017) begin with an unconstrained mean-zero Gaussian process prior on ff, fgp(0,K)f\sim\textsc{gp}(0,K), with covariance kernel KK. Since their basis coefficients correspond to evaluation of the function and its derivatives at the grid points; see Appendix A for details; this induces a multivariate zero-mean Gaussian prior 𝒩(𝟎,ΣN)\mathcal{N}(\mathbf{0},\Sigma_{N}) on θ\theta provided the covariance kernel KK of the parent Gaussian process is sufficiently smooth. Having obtained this unconstrained Gaussian prior on θ\theta, Maatouk and Bay, (2017) multiply it with the indicator function 𝟙𝒞(θ)\mathbbm{1}_{\mathcal{C}}(\theta) of the truncation region to obtain the truncated normal prior.

We are now in a position to connect the posterior bias in Figures 1 and 2 to the mass-shifting phenomenon characterized in the previous section. Since the posterior θY𝒩𝒞(μN,ΣN)\theta\mid Y\sim\mathcal{N}_{\mathcal{C}}(\mu_{N},\Sigma_{N}), a draw from the posterior can be represented as

θ=μN+θc,θc𝒩𝒞(0,ΣN).\theta=\mu_{N}+\theta_{c},\quad\theta_{c}\sim\mathcal{N}_{\mathcal{C}}(0,\Sigma_{N}).

Consider an extreme scenario when the true function is entirely flat. In this case, the optimal parameter value θ0=𝟎N\theta_{0}=\mathbf{0}_{N} and under mild assumptions, μN\mu_{N} is concentrated near the origin with high probability under the true data distribution; see §E of the supplementary material. The mass shifting phenomenon pushes θc\theta_{c} away from the origin, resulting in the bias. On the other hand, when the true function is strictly monotone as in the left panel of Figure 1, all the entries of μN\mu_{N} are bounded away from zero, which masks the effect of the shift in θc\theta_{c}.

In strict technical terms, our theory is not directly applicable to θc\theta_{c} since the scale matrix ΣN\Sigma_{N} is a dense matrix in general. However, we show below that ΣN\Sigma_{N} is approximately banded under mild conditions. Figure 6 shows image plots of ΣN\Sigma_{N} for three choices of NN using the basis of Maatouk and Bay, (2017) and sample size n=500n=500. In all cases, ΣN\Sigma_{N} is seen to have a near-banded structure.

Refer to caption
Refer to caption
Refer to caption
Figure 6: Scaled posterior scale matrix Σ~N\widetilde{\Sigma}_{N} of dimension N=50N=50 (left), N=250N=250 (middle) and N=500N=500 (right).

We make this empirical observation concrete below. We first state the assumptions on the basis matrix Φ\Phi and prior covariance matrix ΩN\Omega_{N} that allows the construction of a strictly banded matrix approximation.

Assumption 1.

We assume the basis matrix Φ\Phi is such that the matrix ΦTΦ\Phi^{\mathrm{\scriptscriptstyle T}}\,\Phi is qq-banded for some 2qN2\leq q\leq N; also there exists constants 0<C1<C2<0<C_{1}<C_{2}<\infty such that

C1(n/N)INΦTΦC2(n/N)IN.\displaystyle C_{1}\,(n/N)\,I_{N}\leq\Phi^{\mathrm{\scriptscriptstyle T}}\Phi\leq C_{2}\,(n/N)\,I_{N}.

One example of a basis satisfying Assumption 1 is a B-Spline of fixed order qq denoted as BN,q(x)B_{N,q}(x) with N=J+qN=J+q over quasi-uniform knot points of number J>0J>0; see, for example, Yoo et al., (2016).

Regarding the prior covariance matrix, we first define a uniform class of symmetric positive definite well-conditioned matrices (Bickel et al., 2008b, ) as

(λ0,α,k)={ΩN:\displaystyle\mathcal{M}(\lambda_{0},\alpha,k)=\Big{\{}\Omega_{N}\,:\, maxji{|Σij|:|ij|>k}Ckαfor allk>0,\displaystyle\max_{j}\sum_{i}\{|\Sigma_{ij}|\,:\,|i-j|>k\}\leq C\,k^{-\alpha}\,\mbox{for all}\,k>0, (3.3)
and 0<λ0λmin(ΩN)λmax(ΩN)1/λ0}.\displaystyle\mbox{and}\,0<\lambda_{0}\leq\lambda_{\min}(\Omega_{N})\leq\lambda_{\max}(\Omega_{N})\leq 1/\lambda_{0}\Big{\}}.
Assumption 2.

We assume the prior covariance matrix ΩN(λ0,α,k)\Omega_{N}\in\mathcal{M}(\lambda_{0},\alpha,k) defined in (3.3).

Assumption 2 ensures the covariance matrix is “approximately bandable”, which is common in covariance matrix estimation with thresholding techniques (Bickel et al., 2008a, ; Bickel et al., 2008b, ).

Given above Assumptions, we are now ready to give the approximation result of posterior scale matrix Ω1\Omega^{-1} to a banded symmetric positive definite matrix.

Proposition 3.

For the posterior scale matrix ΣN=(ΩN1+ΦTΦ)1\Sigma_{N}=(\Omega_{N}^{-1}+\Phi^{\mathrm{\scriptscriptstyle T}}\Phi)^{-1} with Φ\Phi satisfying Assumption 1 and ΩN\Omega_{N} satisfying Assumption 2, for sufficiently small 0<ϵ<1/λ00<\epsilon<1/\lambda_{0} there exists rlog(1/ϵ)r\succsim\log(1/\epsilon), and for sufficiently large n0n_{0} we can always find a max(n02r,n0q)\max(n^{2}_{0}r,n_{0}q)-banded, symmetric and positive definite matrix ΣN\Sigma^{\prime}_{N} such that

ΣNΣNδϵ,κ,\displaystyle\|\Sigma_{N}-\Sigma^{\prime}_{N}\|\precsim\delta_{\epsilon,\kappa}, (3.4)

where δϵ,κ=(ϵ+κn0+1)max{(N/n),(N/n)2}\delta_{\epsilon,\kappa}=(\epsilon+\kappa^{n_{0}+1})\max\{(N/n),(N/n)^{2}\} and 0<κ<10<\kappa<1 is a fixed constant.

Proposition 3 states under mild conditions we can always construct a banded positive definite matrix that approximates ΣN\Sigma_{N} in operator norm. Applying the result in Theorem 2 to a truncated normal distribution with the banded approximation of the posterior scale matrix, the marginal density would present a mass-shifting behavior. If we control the band width KK such that the approximation is close enough to the posterior scale matrix, the marginal posterior distribution would be expected to behave similarly and shift its probability mass away from the origin and the probability mass over the “corner region” will decrease to zero as the dimension NN goes to infinity. This helps explain the bias occurred in the posterior mean over the flat area shown in the Figure 1.

4 A de-biasing remedy based on a shrinkage prior

As concluded in previous sections, we view the mass-shifting behavior of the posterior marginals causing the bias in posterior estimation for flat functions. In this section, we will provide empirical evidences that a simple modification to the truncated normal prior can alleviate the issues related to such mass-shifting phenomenon. Among remedies proposed in the literature, Curtis and Ghosh, (2011) proposed independent shrinkage priors on the parameter vector {uk}\{u_{k}\} given by a mixture of a point-mass at zero and a univariate normal distribution truncated to the positive real line as an alternative to the truncated normal prior,

uk(1π)δ0+π𝒩+(μ,σ2).u_{k}\sim(1-\pi)\delta_{0}+\pi\mathcal{N}_{+}(\mu,\sigma^{2}).

Similar mixture priors were also previously used by Neelon and Dunson, (2004) and Dunson, (2005). The mass at zero allows positive prior probability to functions having exactly flat regions. Although possible in principle, introduction of such point-masses while retaining the dependence structure between the coefficients becomes somewhat cumbersome in addition to being computationally burdensome. With such motivation and the additional consideration that in most real scenarios a function is approximately flat in certain regions, we propose a shrinkage procedure as a remedy to replace the coefficients θ𝒞\theta\in\mathcal{C} by ξ=(ξ1,,ξN)T\xi=(\xi_{1},\ldots,\xi_{N})^{\mathrm{\scriptscriptstyle T}}, where

ξj=τλjθj,(j=1,,N)\displaystyle\xi_{j}=\tau\,\lambda_{j}\,\theta_{j},\quad(j=1,\ldots,N) (4.1)

The parameter τ\tau provides global shrinkage towards the origin while the λj\lambda_{j}s provide coefficient-specific deviation. We consider default (Carvalho et al.,, 2010) half-Cauchy priors 𝒞+(0,1)\mathcal{C}_{+}(0,1) on τ\tau and the λj\lambda_{j}s independently. The 𝒞+(0,1)\mathcal{C}_{+}(0,1) distribution has a density proportional to (1+t2)1𝟙(0,)(t)(1+t^{2})^{-1}\mathbbm{1}_{(0,\infty)}(t). We continue to use a dependent truncated normal prior θ𝒩𝒞(𝟎N,ΣN)\theta\sim\mathcal{N}_{\mathcal{C}}(\mathbf{0}_{N},\Sigma_{N}) which in turn induces dependence among the ξj\xi_{j}s. Our prior on ξ\xi can thus be considered as a dependent extension of the global-local shrinkage priors (Carvalho et al.,, 2010) widely used in the high-dimensional regression context. Figure 9 in §D.1 of the supplementary material shows prior draws for the first and third components of both θ\theta and ξ\xi, based on which the marginal distribution of the ξj\xi_{j}s is clearly seen to place more mass near the origin while retaining heavy tails.

We provide an illustration of the proposed shrinkage procedure in the context of estimating monotone functions as described in (A.1). The procedure can be readily adapted to include various other constraints. Replacing θ\theta by ξ\xi in (M) in (A.1), we can write (3.1) in vector notation as

Y=ξ0𝟏n+τΨΛθ+ε,ε𝒩n(𝟎n,σ2In).\displaystyle Y=\xi_{0}\mathbf{1}_{n}+\tau\,\Psi\Lambda\theta+\varepsilon,\quad\varepsilon\sim\mathcal{N}_{n}(\mathbf{0}_{n},\sigma^{2}\mathrm{I}_{n}). (4.2)

Here, Ψ\Psi is an n×Nn\times N basis matrix with ithi^{th} row ΨiT\Psi_{i}^{\mathrm{\scriptscriptstyle T}} where Ψij=ψj1(xi)\Psi_{ij}=\psi_{j-1}(x_{i}) for j=1,,Nj=1,\ldots,N and the basis functions ψj\psi_{j} are as in (A.1). Also, Y=(y1,,yn)TY=(y_{1},\ldots,y_{n})^{\mathrm{\scriptscriptstyle T}}, Λ=diag(λ1,,λN)\Lambda=\mbox{diag}(\lambda_{1},\ldots,\lambda_{N}) and ε=(ϵ1,,ϵn)T\varepsilon=(\epsilon_{1},\ldots,\epsilon_{n})^{\mathrm{\scriptscriptstyle T}}.

The model is parametrized by ξ0\xi_{0}\in\mathbb{R}, θ=(θ1,,θN)T𝒞\theta=(\theta_{1},\ldots,\theta_{N})^{\mathrm{\scriptscriptstyle T}}\in\mathcal{C}, λ=(λ1,,λN)T𝒞\lambda=(\lambda_{1},\ldots,\lambda_{N})^{\mathrm{\scriptscriptstyle T}}\in\mathcal{C}, σ+\sigma\in\mathbb{R}^{+} and τ+\tau\in\mathbb{R}^{+}. We place a flat prior π(ξ0)1\pi(\xi_{0})\propto 1 on ξ0\xi_{0}. We place a truncated normal prior 𝒩𝒞(𝟎N,ΣN)\mathcal{N}_{\mathcal{C}}(\mathbf{0}_{N},\Sigma_{N}) on θ\theta independently of ξ0\xi_{0}, τ\tau and λ\lambda with

ΣN=(Σjj),Σjj=k(ujuj),uj=j/(N1),(j=0,1,,N1)\Sigma_{N}=(\Sigma_{jj^{\prime}}),\quad\Sigma_{jj^{\prime}}=k(u_{j}-u_{j^{\prime}}),\quad u_{j}=j/(N-1),\quad(j=0,1,\ldots,N-1)

and k()k(\cdot) is the stationary Matérn kernel with smoothness parameter ν>0\nu>0 and length-scale parameter >0\ell>0. To complete the prior specification, we place improper prior π(σ2)1/σ2\pi(\sigma^{2})\propto 1/\sigma^{2} on σ2\sigma^{2} and compactly supported priors ν𝒰(0.5,1)\nu\sim\mathcal{U}(0.5,1) and 𝒰(0.1,1)\ell\sim\mathcal{U}(0.1,1) on ν\nu and \ell. We develop a data-augmentation Gibbs sampler which combined with the embedding technique of Ray et al., (2019) results in an efficient MCMC algorithm to sample from the joint posterior of (ξ0,θ,λ,σ2,τ2,ν,)(\xi_{0},\theta,\lambda,\sigma^{2},\tau^{2},\nu,\ell); the details are deferred to §D.2 of the supplementary material.

We conduct a small-scale simulation study to illustrate the efficacy of the proposed shrinkage procedure. We consider model (3.1) with true σ=0.5\sigma=0.5 and two different choices of the true ff, namely,

f1(x)=(5x3)3 1[0.6,1](x),f2(x)=2l=1100l1.7sin(l)cos(π(l0.5)(1x))f_{1}(x)=(5x-3)^{3}\,\mathbbm{1}_{[0.6,1]}(x),\quad f_{2}(x)=\sqrt{2}\,\sum_{l=1}^{100}l^{-1.7}\,\sin(l)\,\cos(\pi(l-0.5)(1-x))

for x[0,1]x\in[0,1]. The function f1f_{1}, which is non-decreasing and flat between 0 and 0.60.6, was used as the motivating example in the introduction. The function f2f_{2} is also approximately flat between 0.70.7 and 11, although it is not strictly non-decreasing in this region, which allows us to evaluate the performance under slight model misspecification.

To showcase the improvement due to the shrinkage, we consider a cascading sequence of priors beginning with only a truncated normal prior and gradually adding more structure to eventually arrive at the proposed shrinkage prior. Specifically, the variants considered are
No shrinkage and fixed hyperparameters: Here, we set Λ=IN\Lambda=\mathrm{I}_{N} and τ=1\tau=1 in (4.2), and also fix ν\nu and \ell, so that we have a truncated normal prior on the coefficients. This was implemented as part of the motivating examples in the introduction. We fix ν=0.75\nu=0.75 and \ell so that the correlation k(1)k(1) between the maximum separated points in the covariate domain equals 0.050.05.
No shrinkage with hyperparameter updates: The only difference from the previous case is that ν\nu and \ell are both assigned priors described previously and updated within the MCMC algorithm.
Global shrinkage: We continue with Λ=IN\Lambda=\mathrm{I}_{N} and place a half-Cauchy prior on the global shrinkage parameter τ\tau. The hyperparameters ν\nu and \ell are updated.
Global-local shrinkage: This is the proposed procedure where the λj\lambda_{j}s are also assigned half-Cauchy priors and the hyperparameters are updated.

We generate 500500 pairs of response and covariates and randomly divide the data into 300300 training samples and 200200 test samples. For all of the variants above, we set the number of knots N=150N=150. We provide plots of the function fit along with pointwise 95%95\% credible intervals in Figures 7 and 8 respectively, and also report the mean squared prediction error (mspe) at the bottom of the sub-plots. As expected, only using the truncated normal prior leads to a large bias in the flat region. Adding some global structure to the truncated normal prior, for instance, updating the gp hyperparamaters and adding a global shrinkage term improves estimation around the flat region, which however still lacks the flexibility to transition from the flat region to the strictly increasing region. The global-local shrinkage performs the best, both visually and also in terms of mspe.

Additionally, the shrinkage procedure performed at least as good as bsar, a very recent state-of-the-art method, developed by Lenk and Choi, (2017), and implemented in the R package bsamGP. For the out-of-sample prediction performance of bsar, refer to Figure 10 in §D.3 of the supplementary material, based on which it is clear that the performance of global-local shrinkage procedure is comparable with that of bsar. It is important to point out that bsar is also a shrinkage based method that allows for exact zeros in the coefficients in a transformed Gaussian process prior through a spike and slab specification.

Refer to caption
Figure 7: Out-of-sample prediction accuracy for f1f_{1} using the four variants. Red solid curve corresponds to the true function, black solid curve is the mean prediction, the region within two dotted blue curves represent 95% pointwise prediction Interval and the green dots are 200200 test data points. MSPE values corresponding to each of the method are also shown in the plots.
Refer to caption
Figure 8: Same as Figure 7, now for the function f2f_{2}.

5 Discussion

A seemingly natural way to define a prior distribution on a constrained parameter space is to consider the restriction of a standard unrestricted prior to the constrained space. The conjugacy properties of the unrestricted prior typically carry over to the restricted case, facilitating computation. Moreover, reference priors on constrained parameters are typically the unconstrained reference prior multiplied by the indicator of the constrained parameter space (Sun and Berger,, 1998). Despite these various attractive properties, the findings of this article pose a caveat towards routine truncation of priors in moderate to high-dimensional parameter spaces, which might lead to biased inference. This issue gets increasingly severe with increasing dimension due to the concentration of measure phenomenon (Talagrand,, 1995; Boucheron et al.,, 2013), which forces the prior to increasingly concentrate away from statistically relevant portions of the parameter space. A somewhat related issue with certain high-dimensional shrinkage priors has been noted in Bhattacharya et al., (2016). Overall, our results suggest a careful study of the geometry of truncated priors as a useful practice. Understanding the cause of the biased behavior also suggests natural shrinkage procedures that can guard against such unintended consequences. We note that post-processing approaches based on projection (Lin and Dunson,, 2014) and constraint relaxation (Duan et al.,, 2020) do not suffer from this unintended bias. The same is also true for the recently proposed monotone bart (Bayesian Additive Regression Trees) (Chipman et al.,, 2010) method.

It would be interesting to explore the presence of similar issues arising from truncations beyond the constrained regression setting. Possible examples include correlation matrix estimation and simultaneous quantile regression. Priors on correlation matrices are often prescribed in terms of constrained priors on covariance matrices, and truncated normal priors are used to maintain ordering between quantile functions corresponding to different quantiles, and this might leave the door open for unintended bias to creep in.

Appendix

Appendix A Basis representation of Maatouk and Bay, (2017)

As our example which motivates the main results of this paper, we consider the more recent basis sequence of Maatouk and Bay, (2017). Let uj=j/(N1),j=0,1,,N1u_{j}=j/(N-1),j=0,1,\ldots,N-1 be equally spaced points on [0,1][0,1], with spacing δN=1/(N1)\delta_{N}=1/(N-1). Let,

hj(x)=h(xujδN),ψj(x)=0xhj(t)𝑑t,ϕj(x)=0x0thj(u)𝑑u𝑑t,\displaystyle h_{j}(x)=h\bigg{(}\frac{x-u_{j}}{\delta_{N}}\bigg{)},\quad\psi_{j}(x)=\int_{0}^{x}h_{j}(t)\,dt,\quad\phi_{j}(x)=\int_{0}^{x}\int_{0}^{t}h_{j}(u)\,dudt,

for j=0,1,,N1j=0,1,\dots,N-1, where h(x)=(1|x|) 1[1,1](x)h(x)=(1-|x|)\,\mathbbm{1}_{[-1,1]}(x) is the “hat function” on [1,1][-1,1]. For any continuous function f:[0,1]f:[0,1]\to\mathbb{R}, the function f~()=j=0N1f(uj)hj()\widetilde{f}(\cdot)=\sum_{j=0}^{N-1}f(u_{j})\,h_{j}(\cdot) approximates ff by linearly interpolating between the function values at the knots {uj}\{u_{j}\}, with the quality of the approximation improving with increasing NN. With no additional smoothness assumption, this suggests a model for ff as f()=j=0N1θj+1hj()f(\cdot)=\sum_{j=0}^{N-1}\theta_{j+1}h_{j}(\cdot).

The basis {ψj}\{\psi_{j}\} and {ϕj}\{\phi_{j}\} take advantage of higher-order smoothness. If ff is once or twice continuously differentiable respectively, then by the fundamental theorem of calculus,

f(x)f(0)=0xf(t)dt,f(x)f(0)xf(0)=0x0tf′′(s)dsdt.\displaystyle f(x)-f(0)=\int_{0}^{x}f{{}^{\prime}}(t)dt,\quad f(x)-f(0)-xf^{\prime}(0)=\int_{0}^{x}\int_{0}^{t}f^{\prime\prime}(s)\,dsdt.

Expanding ff^{\prime} and f′′f^{\prime\prime} in the interpolation basis as in the previous paragraph respectively imply the models

f(x)=θ0+j=0N1θj+1ψj(x)M,f(x)=θ0+θx+j=0N1θj+1ϕj(x)C.\displaystyle\underbrace{f(x)=\theta_{0}+\sum_{j=0}^{N-1}\theta_{j+1}\psi_{j}(x)}_{M},\quad\underbrace{f(x)=\theta_{0}+\theta^{\ast}x+\sum_{j=0}^{N-1}\theta_{j+1}\phi_{j}(x)}_{C}. (A.1)

Under the above, the coefficients have a natural interpretation as evaluations of the function or its derivatives at the grid points. For example, under (M), f(uj)=θj+1f^{\prime}(u_{j})=\theta_{j+1} for j=0,1,,N1j=0,1,\ldots,N-1, while under (C), f′′(uj)=θj+1f^{\prime\prime}(u_{j})=\theta_{j+1} for j=0,1,,N1j=0,1,\ldots,N-1.

Maatouk and Bay, (2017) showed that under the representation (M) in (A.1), ff is monotone non-decreasing if and only if θi0\theta_{i}\geq 0 for all i=1,,Ni=1,\ldots,N. Similarly, under (C), ff is convex non-decreasing if and only if θi0\theta_{i}\geq 0 for all i=1,,Ni=1,\ldots,N. The ability to equivalently express various constraints in terms of linear restrictions on the vector θ=(θ1,,θN)T\theta=(\theta_{1},\ldots,\theta_{N})^{\mathrm{\scriptscriptstyle T}} is an attractive feature of this basis not necessarily shared by other basis.

In either case, the parameter space 𝒞\mathcal{C} for θ\theta is the non-negative orthant [0,)N[0,\infty)^{N}. If ff were unrestricted, a gp prior on ff would induce a dependent Gaussian prior on θ\theta. The approach of Maatouk and Bay, (2017) is to restrict this dependent prior subject to the linear restrictions, resulting in a truncated normal prior.

Supplementary Material

In this supplementary document, we first collect all remaining technical proofs in the first two sections. §D provides additional details on prior illustration, posterior computation, and posterior performance. In §E we formulate the concentration property of the posterior center μN\mu_{N}. Several auxiliary results used in the proofs are listed in §F.

Appendix B Proofs of auxiliary results in the proof of Theorem 2

In this section, we provide proofs of Lemma 1 and Lemma 2 that were used to prove Theorem 2 in the main manuscript. For any NN-dimensional vector a=[a1,,ad]Ta=[a_{1},\ldots,a_{d}]^{\mathrm{\scriptscriptstyle T}} we denote its sub-vector a[i1:i2]=[ai1,,ai2]Ta_{[i_{1}\,:\,i_{2}]}=[a_{i_{1}},\ldots,a_{i_{2}}]^{\mathrm{\scriptscriptstyle T}} for any 1i1<i2d1\leq i_{1}<i_{2}\leq d. For two vectors aa and bb of the same length, let aba\geq b (aba\leq b) denote the event aibia_{i}\geq b_{i} (aibia_{i}\leq b_{i}) for all ii. For two random variables XX and YY, We write X=dYX\overset{\mathrm{d}}{=}Y if XX and YY are identical in distribution.

B.1 Proof of Lemma 1

For random vectors X𝒩(𝟎,ΣX)X\sim\mathcal{N}({\bf 0},\Sigma_{X}) and Y𝒩(𝟎,ΣY)Y\sim\mathcal{N}({\bf 0},\Sigma_{Y}), to show (1X1u1,X2u2,,Xdud)(1Y1u1,Y2u2,,Ydud)\mathbb{P}(\ell_{1}\leq X_{1}\leq u_{1},\,X_{2}\geq u_{2},\ldots,X_{d}\geq u_{d})\leq\mathbb{P}(\ell_{1}\leq Y_{1}\leq u_{1},\,Y_{2}\geq u_{2},\ldots,Y_{d}\geq u_{d}), it suffices to show

(Y1u1,Y2u2,,Ydud)(X1u1,X2u2,,Xdud)\displaystyle\mathbb{P}(Y_{1}\geq u_{1},Y_{2}\geq u_{2},\dots,Y_{d}\geq u_{d})-\mathbb{P}(X_{1}\geq u_{1},X_{2}\geq u_{2},\dots,X_{d}\geq u_{d}) (B.1)
(Y11,Y2u2,,Ydud)(X11,X2u2,,Xdud).\displaystyle\leq\mathbb{P}(Y_{1}\geq\ell_{1},Y_{2}\geq u_{2},\dots,Y_{d}\geq u_{d})-\mathbb{P}(X_{1}\geq\ell_{1},X_{2}\geq u_{2},\dots,X_{d}\geq u_{d}).

We define dd-dimensional indicator functions G(x)=𝟙[u1,)(x1)j=2d𝟙(uj,)(xj)G(x)=\mathbbm{1}_{[u_{1},\infty)}(x_{1})\prod_{j=2}^{d}\mathbbm{1}_{(u_{j},\infty)}(x_{j}) and F(x)=𝟙[1,)(x1)j=2d𝟙(uj,)(xj)F(x)=\mathbbm{1}_{[\ell_{1},\infty)}(x_{1})\prod_{j=2}^{d}\mathbbm{1}_{(u_{j},\infty)}(x_{j}), then it is equivalent to show

𝔼{G(Y)}𝔼{G(X)}𝔼{F(Y)}𝔼{F(X)}.\displaystyle\mathbb{E}\{G(Y)\}-\mathbb{E}\{G(X)\}\leq\mathbb{E}\{F(Y)\}-\mathbb{E}\{F(X)\}. (B.2)

We now construct non-decreasing approximating functions of G,FG,F with continuous second order derivatives respectively. Let νC2()\nu\in C^{2}({\mathbb{R}}) be a non-decreasing twice differentiable function with ν(t)=0\nu(t)=0 for t0t\leq 0, ν(t)[0,1]\nu(t)\in[0,1] for t[0,1]t\in[0,1], and ν(t)=1\nu(t)=1 for t1t\geq 1. Also, choose ν\nu so that ν<C\|\nu^{\prime}\|_{\infty}<C for some universal constant C>0C>0. For η>0\eta>0, we define mη(x)=ν(ηx)m_{\eta}(x)=\nu(\eta x). It is clear that mη(x)m_{\eta}(x) approximates 𝟙[0,)(x)\mathbbm{1}_{[0,\infty)}(x) for large η\eta. In fact, for any x0x\neq 0, limηmη(x)=𝟙[0,)(x)\lim_{\eta\to\infty}m_{\eta}(x)=\mathbbm{1}_{[0,\infty)}(x).

Given the above, let gjη(xj)=ν{η(xjuj)}g^{\eta}_{j}(x_{j})=\nu\{\eta(x_{j}-u_{j})\} for j=1,,dj=1,\dots,d, and f1η=ν{η(x1)}f^{\eta}_{1}=\nu\{\eta(x-\ell_{1})\}, fjη=ν{η(xjuj)}f^{\eta}_{j}=\nu\{\eta(x_{j}-u_{j})\} for j=2,,dj=2,\dots,d. Define

gη(x)=Πj=1dgjη(xj)andfη(x)=Πj=1dfjη(xj).\displaystyle g^{\eta}(x)=\Pi_{j=1}^{d}g^{\eta}_{j}(x_{j})\quad\mbox{and}\quad f^{\eta}(x)=\Pi_{j=1}^{d}f^{\eta}_{j}(x_{j}).

It then follows that gηg^{\eta} and fηf^{\eta} provide increasingly better approximations of GG and FF as η\eta\to\infty. It thus suffices to show

𝔼{gη(Y)}𝔼{gη(X)}𝔼{fη(Y)}𝔼{fη(X)},\displaystyle\mathbb{E}\{g^{\eta}(Y)\}-\mathbb{E}\{g^{\eta}(X)\}\leq\mathbb{E}\{f^{\eta}(Y)\}-\mathbb{E}\{f^{\eta}(X)\}, (B.3)

for sufficiently large η>0\eta>0 to be chosen later. We henceforth drop the superscript η\eta from gg and ff for notation brevity.

We proceed to utilize an interpolation technique commonly used to prove comparison inequalities (see Chapter 7 of Vershynin, 2018b ). We construct a sequence of interpolating random variables based on the independent random variables X,YX,Y:

St=(1t2)1/2X+tY,t[0,1].\displaystyle S_{t}=(1-t^{2})^{1/2}X+tY,\quad t\in[0,1].

Specifically, we have S0=XS_{0}=X, S1=YS_{1}=Y, and for any t[0,1]t\in[0,1], St𝒩(𝟎,Σ~t)S_{t}\sim\mathcal{N}({\bf 0},\widetilde{\Sigma}_{t}) where Σ~t=(1t2)ΣX+t2ΣY\widetilde{\Sigma}_{t}=(1-t^{2})\Sigma_{X}+t^{2}\Sigma_{Y}. For any twice differentiable function hh, we have the following identity

𝔼{h(Y)}𝔼{h(X)}=01ddt𝔼{h(St)}𝑑t.\displaystyle\mathbb{E}\{h(Y)\}-\mathbb{E}\{h(X)\}=\int_{0}^{1}\frac{d}{dt}\mathbb{E}\{h(S_{t})\}\,dt. (B.4)

Applying a multivariate version of Stein’s lemma (Lemma 7.2.7 in Vershynin, 2018b ) to the integrand in (B.4), one obtains

ddt𝔼{h(St)}\displaystyle\frac{d}{dt}\mathbb{E}\{h(S_{t})\} =ti,j=1d𝔼[{𝔼(YiYj)𝔼(XiXj)}2hxixj(St)].\displaystyle=t\sum_{i,j=1}^{d}\mathbb{E}\bigg{[}\{\mathbb{E}(Y_{i}Y_{j})-\mathbb{E}(X_{i}X_{j})\}\frac{\partial^{2}h}{\partial x_{i}\partial x_{j}}(S_{t})\bigg{]}. (B.5)

To show (B.3), we define the difference Δ=[𝔼{f(Y)}𝔼{f(X)}][𝔼{g(Y)}𝔼{g(X)}]\Delta=[\mathbb{E}\{f(Y)\}-\mathbb{E}\{f(X)\}]-[\mathbb{E}\{g(Y)\}-\mathbb{E}\{g(X)\}]. We further decompose Δ\Delta as

Δ\displaystyle\Delta =[𝔼{f(Y)}𝔼{f(X)}][𝔼{g(Y)}𝔼{g(X)}]\displaystyle=[\mathbb{E}\{f(Y)\}-\mathbb{E}\{f(X)\}]-[\mathbb{E}\{g(Y)\}-\mathbb{E}\{g(X)\}]
=01𝑑t{ddt𝔼{f(St)}ddt𝔼{g(St)}}\displaystyle=\int_{0}^{1}dt\,\bigg{\{}\frac{d}{dt}\mathbb{E}\{f(S_{t})\}-\frac{d}{dt}\mathbb{E}\{g(S_{t})\}\bigg{\}}
=01𝑑t{ti,j=1d𝔼[{𝔼(YiYj)𝔼(XiXj)}(2fxixj(St)2gxixj(St))]}\displaystyle=\int_{0}^{1}dt\,\bigg{\{}t\,\sum_{i,j=1}^{d}\mathbb{E}\bigg{[}\,\{\mathbb{E}(Y_{i}Y_{j})-\mathbb{E}(X_{i}X_{j})\}\bigg{(}\frac{\partial^{2}f}{\partial x_{i}\partial x_{j}}(S_{t})-\frac{\partial^{2}g}{\partial x_{i}\partial x_{j}}(S_{t})\bigg{)}\bigg{]}\bigg{\}}
=201𝑑t{tj=2d𝔼[{𝔼(Y1Yj)𝔼(X1Xj)}(2fx1xj(St)2gx1xj(St))]}\displaystyle=2\int_{0}^{1}dt\,\bigg{\{}t\,\sum_{j=2}^{d}\mathbb{E}\bigg{[}\,\{\mathbb{E}(Y_{1}Y_{j})-\mathbb{E}(X_{1}X_{j})\}\bigg{(}\frac{\partial^{2}f}{\partial x_{1}\partial x_{j}}(S_{t})-\frac{\partial^{2}g}{\partial x_{1}\partial x_{j}}(S_{t})\bigg{)}\bigg{]}\bigg{\}}
+01𝑑t{ti,j=2d𝔼[{𝔼(YiYj)𝔼(XiXj)}(2fxixj(St)2gxixj(St))]}\displaystyle~{}~{}~{}+\int_{0}^{1}dt\,\bigg{\{}t\,\sum_{i,j=2}^{d}\mathbb{E}\bigg{[}\,\{\mathbb{E}(Y_{i}Y_{j})-\mathbb{E}(X_{i}X_{j})\}\bigg{(}\frac{\partial^{2}f}{\partial x_{i}\partial x_{j}}(S_{t})-\frac{\partial^{2}g}{\partial x_{i}\partial x_{j}}(S_{t})\bigg{)}\bigg{]}\bigg{\}}
=Δ1+Δ2.\displaystyle=\Delta_{1}+\Delta_{2}.

The second equation follows from (B.4) and the third equation follows from (B.5)\eqref{eq:multi_stein}. First we show Δ10\Delta_{1}\geq 0. Since 𝔼(Y1Yj)𝔼(X1Xj)\mathbb{E}(Y_{1}Y_{j})\geq\mathbb{E}(X_{1}X_{j}) for all j>1j>1, it suffices to show that for any fixed t[0,1]t\in[0,1] and for any j=2,,dj=2,\dots,d,

D1=𝔼(2fx1xj(St)2gx1xj(St))0.\displaystyle D_{1}=\mathbb{E}\bigg{(}\frac{\partial^{2}f}{\partial x_{1}\partial x_{j}}(S_{t})-\frac{\partial^{2}g}{\partial x_{1}\partial x_{j}}(S_{t})\bigg{)}\geq 0.

We consider a generic interpolating random variable S𝒩(𝟎,Σ~)S\sim\mathcal{N}\big{(}\mathbf{0},\widetilde{\Sigma}\big{)} by dropping the tt-subscript; let ϕ(s1,,sd)\phi(s_{1},\dots,s_{d}) denote its probability density function. Then we have

D1\displaystyle D_{1} ={f1(s1)fj(sj)g1(s1)gj(sj)}Πl1,jfl(sl)ϕ(s1,,sd)𝑑s1𝑑sd\displaystyle=\int_{-\infty}^{\infty}\dots\int_{-\infty}^{\infty}\{f^{\prime}_{1}(s_{1})f^{\prime}_{j}(s_{j})-g^{\prime}_{1}(s_{1})g^{\prime}_{j}(s_{j})\}\,\Pi_{l\neq 1,j}f_{l}(s_{l})\,\phi(s_{1},\dots,s_{d})\,ds_{1}\dots ds_{d}
=[{f1(s1)g1(s1)}ϕ(s1,,sN)𝑑s1]fj(sj)Πl1,jfl(sl)𝑑s2𝑑sd.\displaystyle=\int_{-\infty}^{\infty}\dots\int_{-\infty}^{\infty}\bigg{[}\int_{-\infty}^{\infty}\{f^{\prime}_{1}(s_{1})-g^{\prime}_{1}(s_{1})\}\,\phi(s_{1},\dots,s_{N})\,ds_{1}\bigg{]}\,f^{\prime}_{j}(s_{j})\,\Pi_{l\neq 1,j}f_{l}(s_{l})\,ds_{2}\dots ds_{d}.

To guarantee D1D_{1} is non-negative we need the integral over s1s_{1} to be non-negative. Based on the definition of f1f_{1} and g1g_{1}, the integral over s1s_{1} can be simplified to

{f1(s1)g1(s1)}ϕ(s1,,sN)𝑑s1\displaystyle\int_{-\infty}^{\infty}\{f^{\prime}_{1}(s_{1})-g^{\prime}_{1}(s_{1})\}\,\phi(s_{1},\ldots,s_{N})\,ds_{1}
=11+1/η{ην(η(s11))}ϕ(s1,,sN)𝑑s1u1u1+1/η{ην(η(s1u1))}ϕ(s1,,sN)𝑑s1\displaystyle=\int_{\ell_{1}}^{\ell_{1}+1/\eta}\big{\{}\eta\,\nu^{\prime}\big{(}\eta(s_{1}-\ell_{1})\big{)}\big{\}}\,\phi(s_{1},\ldots,s_{N})\,ds_{1}-\int_{u_{1}}^{u_{1}+1/\eta}\big{\{}\eta\,\nu^{\prime}\big{(}\eta(s_{1}-u_{1})\big{)}\big{\}}\,\phi(s_{1},\ldots,s_{N})\,ds_{1}
=01/ηην(ηs1){ϕ(s1+1,s2,,sN)ϕ(s1+u1,s2,,sN)}𝑑s1.\displaystyle=\int_{0}^{1/\eta}\,\eta\,\nu^{\prime}(\eta s_{1})\{\phi(s_{1}+\ell_{1},s_{2},\ldots,s_{N})-\phi(s_{1}+u_{1},s_{2},\ldots,s_{N})\}ds_{1}. (B.6)

Let us denote the inverse of the covariance matrix Σ~\widetilde{\Sigma} as

Σ~1=[Σ~111Σ~121Σ~211Σ~221],\widetilde{\Sigma}^{-1}=\begin{bmatrix}\widetilde{\Sigma}^{-1}_{11}&\widetilde{\Sigma}^{-1}_{12}\\ \widetilde{\Sigma}^{-1}_{21}&\widetilde{\Sigma}^{-1}_{22}\\ \end{bmatrix},

where Σ~111\widetilde{\Sigma}^{-1}_{11} is a scalar. To check the non-negativity of the last line in (B.1), we now estimate the term

ϕ(s1+1,s2,,sd)ϕ(s1+u1,s2,,sd)=e{(u1212)+2s1(u11)}Σ~111/2+(u11)Σ~121s~2,\displaystyle\frac{\phi(s_{1}+\ell_{1},s_{2},\ldots,s_{d})}{\phi(s_{1}+u_{1},s_{2},\ldots,s_{d})}=e^{\{(u_{1}^{2}-\ell_{1}^{2})+2s_{1}\,(u_{1}-\ell_{1})\}\,\widetilde{\Sigma}^{-1}_{11}/2+\,(u_{1}-\ell_{1})\,\widetilde{\Sigma}^{-1}_{12}\,\tilde{s}_{2}},

where s~2=(s2,,sd)T\tilde{s}_{2}=(s_{2},\dots,s_{d})^{{\mathrm{\scriptscriptstyle T}}}. Since sj[0,1/η]s_{j}\in[0,1/\eta], we have s1(u11)}Σ~111>0s_{1}\,(u_{1}-\ell_{1})\}\,\widetilde{\Sigma}^{-1}_{11}>0. We denote ρ~=max{Σ~121}\tilde{\rho}=\max\{\widetilde{\Sigma}^{-1}_{12}\} as the largest element of Σ~121\widetilde{\Sigma}^{-1}_{12}. Then, one can choose η\eta large enough such that

(u1+1)Σ~1112(d1)ρ~/η0,\displaystyle(u_{1}+\ell_{1})\,\widetilde{\Sigma}^{-1}_{11}-2(d-1)\tilde{\rho}/\eta\geq 0,

to guarantee D10D_{1}\geq 0. For example η=4(d1)ρ~Σ~11/(u1+1)\eta=4(d-1)\,\tilde{\rho}\,\widetilde{\Sigma}_{11}/(u_{1}+\ell_{1}) satisfies the above inequality.

Now we show Δ20\Delta_{2}\geq 0. We have 𝔼(YiYj)𝔼(XiXj)\mathbb{E}(Y_{i}Y_{j})\geq\mathbb{E}(X_{i}X_{j}) for all i,j=2,,di,j=2,\dots,d. For any i,j2i,j\geq 2, for any fixed t[0,1]t\in[0,1], we define {equs}D_2 = E(2f∂xi∂xj(S_t) - ∂2g∂xi∂xj(S_t) ) = E{ (f_1 - g_1)f’_i g’_j  Π_k≠1,i,jf_k }. Since f1g10f_{1}-g_{1}\geq 0, and fj0f^{\prime}_{j}\geq 0 for all j>1j>1, it follows that D20D_{2}\geq 0 and thus Δ20\Delta_{2}\geq 0. Combining with the non-negativity of Δ1\Delta_{1} completes the proof of Lemma B.1.

B.2 Proof of Lemma 2

For X𝒩(𝟎,Σd(ρ))X\sim\mathcal{N}(\mathbf{0},\Sigma_{d}(\rho)) with ρ(0,1)\rho\in(0,1), we will repeatedly use its equivalent expression

Xi=ρ1/2w+(1ρ)1/2Wi(i=1,,N),\displaystyle X_{i}=\rho^{1/2}\,w+(1-\rho)^{1/2}\,W_{i}\quad(i=1,\ldots,N), (B.7)

where w,Wiw,W_{i}’s are independent standard normal variables.

Proof of the upper bound. We recall ρ¯=(1ρ)/ρ\bar{\rho}=(1-\rho)/\rho. For any fixed δ>0\delta>0 and α(0,1)\alpha\in(0,1), we have

(0X1<δ,X20,,Xd0)\displaystyle\mathbb{P}(0\leq X_{1}<\delta,X_{2}\geq 0,\ldots,X_{d}\geq 0) (B.8)
=(0ρ1/2w+(1ρ)1/2W1δ,wρ¯1/2max2idWi)\displaystyle=\mathbb{P}\Big{(}0\leq{\rho}^{1/2}\,w+{(1-\rho)}^{1/2}\,W_{1}\leq\delta,w\geq{\bar{\rho}}^{1/2}\max_{2\leq i\leq d}W_{i}\Big{)}
=({0ρ1/2w+(1ρ)1/2W1δ,wρ¯1/2max2idWi}\displaystyle=\mathbb{P}\Big{(}\Big{\{}0\leq{\rho}^{1/2}\,w+{(1-\rho)}^{1/2}\,W_{1}\leq\delta,w\geq{\bar{\rho}}^{1/2}\max_{2\leq i\leq d}W_{i}\Big{\}}
[maxidWi{2(1α)log(d1)}1/2][maxidWi{2(1α)log(d1)}1/2])\displaystyle~{}~{}~{}~{}\cup\Big{[}\max_{i\leq d}W_{i}\geq\{2(1-\alpha)\log(d-1)\}^{1/2}\Big{]}\cup\Big{[}\max_{i\leq d}W_{i}\leq\{2(1-\alpha)\log(d-1)\}^{1/2}\Big{]}\Big{)}
[0ρ1/2w+(1ρ)1/2W1δ,w{2ρ¯(1α)log(d1)}1/2]\displaystyle\leq\mathbb{P}\Big{[}0\leq{\rho}^{1/2}\,w+{(1-\rho)}^{1/2}\,W_{1}\leq\delta,w\geq\{2\,\bar{\rho}\,(1-\alpha)\log(d-1)\}^{1/2}\Big{]}
+[maxidWi{2(1α)log(d1)}1/2]\displaystyle~{}~{}~{}~{}+\mathbb{P}\Big{[}\max_{i\leq d}W_{i}\leq\{2(1-\alpha)\log(d-1)\}^{1/2}\Big{]}
=P1+P2.\displaystyle=P_{1}+P_{2}.

First, we estimate P1P_{1} in (B.8). By applying the equivalent expression of XX in (B.7), we have

P1\displaystyle P_{1} =[W1{(ρ1ρ)1/2w,δ/(1ρ)1/2(ρ1ρ)1/2w}w{2ρ¯(1α)log(d1)}1/2]\displaystyle=\mathbb{P}\bigg{[}W_{1}\in\bigg{\{}-\bigg{(}\frac{\rho}{1-\rho}\bigg{)}^{1/2}w,\delta/(1-\rho)^{1/2}-\bigg{(}\frac{\rho}{1-\rho}\bigg{)}^{1/2}w\bigg{\}}\mid w\geq\{2\,\bar{\rho}\,(1-\alpha)\log(d-1)\}^{1/2}\bigg{]}
[w{2ρ¯(1α)log(d1)}1/2]\displaystyle~{}~{}~{}~{}~{}\mathbb{P}\big{[}w\geq\{2\,\bar{\rho}\,(1-\alpha)\log(d-1)\}^{1/2}\,\big{]}
[W1({2(1α)log(d1)}1/2,δ(1ρ)1/2{2(1α)log(d1)}1/2)]\displaystyle\leq\mathbb{P}\big{[}W_{1}\in\big{(}-\{2(1-\alpha)\log(d-1)\}^{1/2},\delta(1-\rho)^{-1/2}-\{2(1-\alpha)\log(d-1)\}^{1/2}\,\big{)}\big{]}
(w{2ρ¯(1α)log(d1)}1/2)\displaystyle~{}~{}~{}~{}~{}\mathbb{P}\big{(}w\geq\{2\,\bar{\rho}\,(1-\alpha)\log(d-1)\}^{1/2}\,\big{)}
δ(2π)1/2exp([δ(1ρ)1/2{2(1α)log(d1)}1/2]2/2)\displaystyle\leq\delta(2\pi)^{-1/2}\,\exp\big{(}-\big{[}\delta(1-\rho)^{-1/2}-\{2(1-\alpha)\log(d-1)\}^{1/2}\big{]}^{2}/2\big{)}
{2ρ¯(1α)log(d1)}1/2exp({2ρ¯(1α)log(d1)}2/2).\displaystyle~{}~{}~{}~{}~{}\{2\,\bar{\rho}\,(1-\alpha)\log(d-1)\}^{-1/2}\,\exp\big{(}-\{2\,\bar{\rho}\,(1-\alpha)\log(d-1)\}^{2}/2\big{)}.

The last inequality follows from Lemma 5 in Appendix F.

Now we move to estimate the term P2P_{2} in (B.8). We have,

[maxidWi{2(1α)logd}1/2]\displaystyle\mathbb{P}\big{[}\max_{i\leq d}W_{i}\leq\{2\,(1-\alpha)\log d\}^{1/2}\big{]} =(1[Z{2(1α)logd}1/2])d\displaystyle=\big{(}1-\mathbb{P}\big{[}Z\geq\{2\,(1-\alpha)\,\log d\}^{1/2}\,\big{]}\big{)}^{d}
exp(d[Z{2(1α)logd}1/2])exp(dα),\displaystyle\leq\exp\big{(}-d\,\mathbb{P}\big{[}Z\geq\{2\,(1-\alpha)\,\log d\}^{1/2}\,\big{]}\big{)}\leq\exp(-d\,^{\alpha}),

where Z𝒩(0,1)Z\sim\mathcal{N}(0,1).

Combining bounds for P1P_{1} and P2P_{2}, we obtain

(0X1<δ,X20,,Xd0)δ{2ρ¯(1α)log(d1)}1/2(d1)(1α)/ρ+exp(dα),\displaystyle\mathbb{P}(0\leq X_{1}<\delta,X_{2}\geq 0,\ldots,X_{d}\geq 0)\leq\delta\,\{2\bar{\rho}\,(1-\alpha)\log(d-1)\}^{-1/2}\,(d-1)^{-(1-\alpha)/\rho}+\exp(-d\,^{\alpha}),

for any α(0,1)\alpha\in(0,1). We then complete the proof of the upper bound.

Proof of the lower bound. We provide a more general result for the lower bound. We show that for any scalar a0a\geq 0, we have

(Xa𝟏d)aρ1/2+(2ρ¯logN)1/2{aρ1/2+(2ρ¯logd)1/2}2+1exp[12{aρ1/2+(2ρ¯logd)1/2}2],\displaystyle\mathbb{P}(X\geq a\mathbf{1}_{d})\geq\frac{a{\rho}^{-1/2}+(2\,\bar{\rho}\,\log N)^{1/2}}{\big{\{}a{\rho}^{-1/2}+(2\,\bar{\rho}\,\log d)^{1/2}\big{\}}^{2}+1}\exp\bigg{[}-\frac{1}{2}\bigg{\{}a{\rho}^{-1/2}+(2\,\bar{\rho}\,\log d)^{1/2}\bigg{\}}^{2}\bigg{]}, (B.9)

where recall that 𝟏N\mathbf{1}_{N} denotes a NN-dimensional vector of ones. By taking a=0a=0 leads to the desired lower bound in Lemma 2.

Now we prove the lower bound in (B.9). First,

(Xa𝟏d)\displaystyle\mathbb{P}(X\geq a\mathbf{1}_{d}) =(ρ1/2w+(1ρ)1/2Wia,fori=1,,d)\displaystyle=\mathbb{P}\big{(}\rho^{1/2}\,w+(1-\rho)^{1/2}\,W_{i}\geq a,\,\mbox{for}\,i=1,\dots,d\big{)} (B.10)
=𝔼([wρ1/2{a(1ρ)1/2Wi},i=1,,dW1,,Wd])\displaystyle=\mathbb{E}\big{(}\,\mathbb{P}\big{[}w\geq{\rho}^{-1/2}\{a-(1-\rho)^{1/2}\,W_{i}\},\,i=1,\dots,d\mid{W_{1},\ldots,W_{d}}\big{]}\big{)}
=(i)𝔼([wρ1/2{a+(1ρ)1/2maxiWi}W1,,WN])\displaystyle\overset{\mathrm{(i)}}{=}\mathbb{E}\big{(}\,\mathbb{P}\big{[}w\geq\rho^{-1/2}\big{\{}a+(1-\rho)^{1/2}\,\max_{i}W_{i}\big{\}}\mid W_{1},\ldots,W_{N}\big{]}\big{)}
=𝔼{1Φ(aρ1/2+ρ¯1/2maxiWi)},\displaystyle=\mathbb{E}\bigg{\{}1-\Phi\bigg{(}a\rho^{-1/2}+{\bar{\rho}}^{1/2}\max_{i}W_{i}\bigg{)}\bigg{\}},

where W=[W1,,Wd]TW=[W_{1},\dots,W_{d}]^{\mathrm{\scriptscriptstyle T}}. Here, (i)(\mbox{i}) holds since Wi=dWi-W_{i}\overset{\mathrm{d}}{=}W_{i} for i=1,,di=1,\dots,d and maxid(Wi)=dmaxid(Wi)\max_{i\leq d}(-W_{i})\overset{\mathrm{d}}{=}\max_{i\leq d}(W_{i}).
 
We now proceed to lower bound the right hand side of the last equation in (B.10). To that end, we define g(a,b)=1Φ(aρ1/2+ρ¯1/2b)g(a,b)=1-\Phi(a\rho^{-1/2}+{\bar{\rho}}^{1/2}\,b), where g:+×[0,1]g:\mathbb{R}_{+}\times\mathbb{R}\to[0,1]. Importantly, gg is non-increasing function of a,ba,b for a,ba,b\in\mathbb{R}, and gg is a convex function of (a,b)(a,b) for a,b>0a,b>0. For any fixed a>0a>0, since g(a,maxiWi)g(a,\max_{i}W_{i}) is non-increasing in maxiWi\max_{i}W_{i}, we have g(a,maxiWi)g(a,maxi|Wi|)g(a,\max_{i}W_{i})\geq g(a,\max_{i}|W_{i}|). We then apply Jensen’s inequality,

𝔼{g(a,max1id|Wi|)}g{a,𝔼(max1id|Wi|)}g{a,(2logd)1/2}.\displaystyle{\mathbb{E}}\Big{\{}g\Big{(}a,\max_{1\leq i\leq d}|W_{i}|\Big{)}\Big{\}}\geq g\Big{\{}a,\mathbb{E}\Big{(}\max_{1\leq i\leq d}|W_{i}|\Big{)}\Big{\}}\geq g\big{\{}a,(2\log d)^{1/2}\big{\}}.

The last inequality holds by applying Lemma 4 in Appendix F. To lower bound g{a,(2logd)1/2}g\big{\{}a,(2\log d)^{1/2}\big{\}} we apply Lemma 5 in Appendix F. Eventually, we obtain

𝔼{g(a,max1id|Wi|)}aρ1/2+(2ρ¯logd)1/2{aρ1/2+(2ρ¯logd)1/2}2+1exp[{aρ1/2+(2ρ¯logd)1/2}2/2].\displaystyle\mathbb{E}\,\Big{\{}g\Big{(}a,\max_{1\leq i\leq d}|W_{i}|\Big{)}\Big{\}}\geq\frac{a\rho^{-1/2}+(2\bar{\rho}\log d)^{1/2}}{\big{\{}a\rho^{-1/2}+(2\bar{\rho}\log d)^{1/2}\big{\}}^{2}+1}\,\exp\big{[}-\big{\{}a\rho^{-1/2}+(2\bar{\rho}\log d)^{1/2}\big{\}}^{2}/2\big{]}.

This completes the proof for the lower bound.

Appendix C Remaining proofs from the main document

C.1 Proof of the Proposition 1

Now we derive the kk-dimensional marginal density function. We denote θ(k)=(θ1,,θk)T\theta^{(k)}=(\theta_{1},\dots,\theta_{k})^{{\mathrm{\scriptscriptstyle T}}} and θ(Nk)=(θk+1,,θN)T\theta^{(N-k)}=(\theta_{k+1},\dots,\theta_{N})^{{\mathrm{\scriptscriptstyle T}}}. We partition ΣN\Sigma_{N} into appropriate blocks as

ΣN=[Σk,kΣk,NkΣNk,kΣNk,Nk].\Sigma_{N}=\begin{bmatrix}\Sigma_{k,k}&\Sigma_{k,N-k}\\ \Sigma_{N-k,k}&\Sigma_{N-k,N-k}\\ \end{bmatrix}.

We also partition its inverse matrix Σ~N\widetilde{\Sigma}_{N},

Σ~N=[Σ~k,kΣ~k,NkΣ~k,NkΣ~Nk,Nk].\widetilde{\Sigma}_{N}=\begin{bmatrix}\widetilde{\Sigma}_{k,k}&\widetilde{\Sigma}_{k,N-k}\\ \widetilde{\Sigma}_{k,N-k}&\widetilde{\Sigma}_{N-k,N-k}\\ \end{bmatrix}.

Then the kk-dimensional marginal p~k,N(θ1,,θk)=\widetilde{p}_{k,N}(\theta_{1},\dots,\theta_{k})=

(12π)N/2{det(Σ)}1/200exp{(θ(k)TΣ~k,kθ(k)\displaystyle\bigg{(}\frac{1}{2\pi}\bigg{)}^{N/2}\,\{\mbox{det}(\Sigma)\}^{-1/2}\,\int_{0}^{\infty}\dots\int_{0}^{\infty}\exp\big{\{}-\big{(}{\theta^{(k)}}^{{\mathrm{\scriptscriptstyle T}}}\widetilde{\Sigma}_{k,k}\,\theta^{(k)}
2θ(k)TΣ~k,Nkθ(Nk)+θ(Nk)TΣ~Nk,Nkθ(Nk))/2}dθ(Nk)\displaystyle~{}~{}~{}~{}-2{\theta^{(k)}}^{{\mathrm{\scriptscriptstyle T}}}\,\widetilde{\Sigma}_{k,N-k}\,\theta^{(N-k)}+{\theta^{(N-k)}}^{\mathrm{\scriptscriptstyle T}}\,\widetilde{\Sigma}_{N-k,N-k}\,\theta^{(N-k)}\big{)}/2\big{\}}\,d\theta^{(N-k)}
=(12π)k/2exp{θ(k)TΣ~k,kθ(k)/2}Πi=1k𝟙[0,)(θi)(12π)(Nk)/2{det(Σ~Nk,Nk)}1/2\displaystyle=\bigg{(}\frac{1}{2\pi}\bigg{)}^{k/2}\exp\big{\{}-{\theta^{(k)}}^{{\mathrm{\scriptscriptstyle T}}}\widetilde{\Sigma}_{k,k}\theta^{(k)}/2\big{\}}\cdot\Pi_{i=1}^{k}\mathbbm{1}_{[0,\infty)}(\theta_{i})\bigg{(}\frac{1}{2\pi}\bigg{)}^{(N-k)/2}\{\mbox{det}(\widetilde{\Sigma}_{N-k,N-k})\}^{-1/2}
00exp{Σ~Nk,Nk12(θ(Nk)ΣNk,kΣk,k1θ(k))2/2}dθ(Nk)\displaystyle~{}~{}~{}~{}\cdot\int_{0}^{\infty}\dots\int_{0}^{\infty}\exp\big{\{}-\|\widetilde{\Sigma}_{N-k,N-k}^{\frac{1}{2}}\,\big{(}\theta^{(N-k)}-\Sigma_{N-k,k}\,\Sigma^{-1}_{k,k}\,\theta^{(k)}\big{)}\|^{2}/2\big{\}}d\theta^{(N-k)}
=(12π)k/2exp{θ(k)TΣ~k,kθ(k)/2}(X~NkΣNk,kΣk,k1θ(k))Πi=1k𝟙[0,)(θi).\displaystyle=\bigg{(}\frac{1}{2\pi}\bigg{)}^{k/2}\exp\{-{\theta^{(k)}}^{{\mathrm{\scriptscriptstyle T}}}\widetilde{\Sigma}_{k,k}\,\theta^{(k)}/2\}\,\mathbb{P}(\widetilde{X}_{N-k}\leq\Sigma_{N-k,k}\,\Sigma^{-1}_{k,k}\,\theta^{(k)})\cdot\Pi_{i=1}^{k}\mathbbm{1}_{[0,\infty)}(\theta_{i}).

where

Σ~k,k\displaystyle\widetilde{\Sigma}_{k,k} =Σk,k1+Σk,k1Σk,NkΣ~Nk,NkΣNk,kΣk,k1,\displaystyle=\Sigma^{-1}_{k,k}+\Sigma^{-1}_{k,k}\,\Sigma_{k,N-k}\,\widetilde{\Sigma}_{N-k,N-k}\,\Sigma_{N-k,k}\,\Sigma^{-1}_{k,k},
Σ~k,Nk\displaystyle\widetilde{\Sigma}_{k,N-k} =Σk,k1Σk,NkΣ~Nk,Nk,\displaystyle=\Sigma^{-1}_{k,k}\,\Sigma_{k,N-k}\,\widetilde{\Sigma}_{N-k,N-k},
Σ~Nk,Nk\displaystyle\widetilde{\Sigma}_{N-k,N-k} =(ΣNk,NkΣNk,kΣk,k1Σk,Nk)1,\displaystyle=(\Sigma_{N-k,N-k}-\Sigma_{N-k,k}\,\Sigma^{-1}_{k,k}\,\Sigma_{k,N-k})^{-1},

and X~Nk𝒩Nk(𝟎Nk,Σ~Nk,Nk1)\widetilde{X}_{N-k}\sim\mathcal{N}_{N-k}(\mathbf{0}_{N-k},\widetilde{\Sigma}^{-1}_{N-k,N-k}).

C.2 Proof of Proposition 3

We first introduce some notations that are used in the proof. For a N×NN\times N matrix AA, we denote λj(A)\lambda_{j}(A) as its jjth eigenvalue, and denote λmin(A)\lambda_{\min}(A) and λmax(A)\lambda_{\max}(A) as the minimum and maximum of eigenvalues, respectively. For a matrix AA, we define its operator norm as A={λmax(ATA)}1/2\|A\|=\{\lambda_{\max}(A^{\mathrm{\scriptscriptstyle T}}A)\}^{1/2}. For two quantities a,ba,b, we write aba\asymp b when a/ba/b can be bounded from below and above by two finite constants.

We repeatedly apply Newmann series and Lemma 6 in Appendix F to construct the approximation matrix to the posterior scale matrix ΣN\Sigma_{N}. Under Assumption 2, we have the prior covariance matrix ΩN(λ0,α,k)\Omega_{N}\in\mathcal{M}(\lambda_{0},\alpha,k) for some universal constants λ0,α,k>0\lambda_{0},\alpha,k>0. Then for any ϵ(0,λ0/2)\epsilon\in(0,\lambda_{0}/2), by choosing rlog(C/ϵ)/αr\geq\log(C/\epsilon)/\alpha, one can find a rr-banded symmetric and positive definite matrix ΩN,r\Omega_{N,r} such that

ΩNΩN,rϵ.\displaystyle\|\Omega_{N}-\Omega_{N,r}\|\leq\epsilon. (C.1)

Now we let M=λmax(ΩN,r)M=\lambda_{\max}(\Omega_{N,r}) and m=λmin(ΩN,r)m=\lambda_{\min}(\Omega_{N,r}). Given (C.1), we have

λ0ϵmM1/λ0+ϵ.\displaystyle\lambda_{0}-\epsilon\leq m\leq M\leq 1/\lambda_{0}+\epsilon. (C.2)

By choosing ξ=2/(M+m)\xi=2/(M+m), simple calculation gives INξΩN,r<1\|I_{N}-\xi\,\Omega_{N,r}\|<1. We now apply Newmann series to construct a polynomial of ΩN,r\Omega_{N,r} of degree n1n_{1}, defined as Ω~1=ξj=0n1(IξΩN,r)j\widetilde{\Omega}^{-1}=\xi\,\sum_{j=0}^{n_{1}}(I-\xi\,\Omega_{N,r})^{j}, for some integer n1>0n_{1}>0 to be chosen later. Applying Lemma 6 in Appendix F, we have

ΩN,r1Ω~1κ0n1+1/(λ0ϵ),\displaystyle\|\Omega_{N,r}^{-1}-\widetilde{\Omega}^{-1}\|\leq\kappa_{0}^{n_{1}+1}/(\lambda_{0}-\epsilon), (C.3)

where κ0=(Mm)/(M+m)\kappa_{0}=(M-m)/(M+m). Applying Lemma 6 we guarantee Ω~1\widetilde{\Omega}^{-1} is (n1r)(n_{1}\,r)-banded and positive definite. Combining results in (C.2) and (C.3), we have

λ0/(1+λ0ϵ)κ0n1+1/(λ0ϵ)λmin(Ω~1)λmax(Ω~1)1/(λ0ϵ)+κ0n1+1/(λ0ϵ).\displaystyle\lambda_{0}/(1+\lambda_{0}\epsilon)-\kappa_{0}^{n_{1}+1}/(\lambda_{0}-\epsilon)\leq\lambda_{\min}(\widetilde{\Omega}^{-1})\leq\lambda_{\max}(\widetilde{\Omega}^{-1})\leq 1/(\lambda_{0}-\epsilon)+\kappa_{0}^{n_{1}+1}/(\lambda_{0}-\epsilon). (C.4)

Now we let Σ~1=Ω~1+ΦTΦ\widetilde{\Sigma}^{-1}=\widetilde{\Omega}^{-1}+\Phi^{\mathrm{\scriptscriptstyle T}}\Phi. Under Assumption 1 we have Σ~1\widetilde{\Sigma}^{-1} is kk-banded with k=max{n1r,q}.k=\max\{n_{1}\,r,q\}. We then define λ~1=λmax(Σ~1)\tilde{\lambda}_{1}=\lambda_{\max}(\widetilde{\Sigma}^{-1}) and λ~N=λmin(Σ~1)\tilde{\lambda}_{N}=\lambda_{\min}(\widetilde{\Sigma}^{-1}). Thus, given (C.4), we have

C1(n/N)+λ0/(1+λ0ϵ)κ0n1+1/(λ0ϵ)λ~Nλ~1C2(n/N)+1/(λ0ϵ)+κ0n1+1/(λ0ϵ),\displaystyle C_{1}\,(n/N)+\lambda_{0}/(1+\lambda_{0}\epsilon)-\kappa_{0}^{n_{1}+1}/(\lambda_{0}-\epsilon)\leq\tilde{\lambda}_{N}\leq\tilde{\lambda}_{1}\leq C_{2}\,(n/N)+1/(\lambda_{0}-\epsilon)+\kappa_{0}^{n_{1}+1}/(\lambda_{0}-\epsilon),

for constants 0<C1<C2<0<C_{1}<C_{2}<\infty in Assumption 2.

We first consider the case where N/naN/n\to a for some constant a(0,1)a\in(0,1), as n,Nn,N\to\infty. For sufficiently large n,Nn,N, we obtain

C1a+λ0/(1+λ0ϵ)λ~Nλ~1C2a+1/(λ0ϵ),\displaystyle C^{\prime}_{1}a+\lambda_{0}/(1+\lambda_{0}\epsilon)\leq\tilde{\lambda}_{N}\leq\tilde{\lambda}_{1}\leq C^{\prime}_{2}\,a+1/(\lambda_{0}-\epsilon), (C.5)

for constants C1,C2C^{\prime}_{1},C^{\prime}_{2} satisfying C1<C1C^{\prime}_{1}<C_{1} and C2<C2C_{2}<C^{\prime}_{2}.

Secondly, we consider the case where N/n0N/n\to 0 as n,Nn,N\to\infty. In this case, n/Nn/N dominates in the eigenvalues of Σ~1\widetilde{\Sigma}^{-1}. Thus, for sufficiently large n,Nn,N, we have

C1(n/N)λ~Nλ~1C2(n/N).\displaystyle C_{1}\,(n/N)\leq\tilde{\lambda}_{N}\leq\tilde{\lambda}_{1}\leq C_{2}\,(n/N). (C.6)

Now we apply Lemma 6 one more time to construct the approximation matrix to the inverse of Σ~1\widetilde{\Sigma}^{-1}. Again, by taking γ=2/(λ~1+λ~N)\gamma=2/(\tilde{\lambda}_{1}+\tilde{\lambda}_{N}), we have INγΣ~1<1\|I_{N}-\gamma\,\widetilde{\Sigma}^{-1}\|<1. Now we define Σ=γj=0m1(INγΣ~1)j\Sigma^{\prime}=\gamma\,\sum_{j=0}^{m_{1}}(I_{N}-\gamma\,\widetilde{\Sigma}^{-1})^{j} for some positive integer m1m_{1}. Also, it follows

Σ~Σκ~m1+1/λ~N,\displaystyle\|\widetilde{\Sigma}-\Sigma^{\prime}\|\leq\tilde{\kappa}^{m_{1}+1}/{\tilde{\lambda}_{N}}, (C.7)

where κ~=(λ~1λ~N)/(λ~1+λ~N)\tilde{\kappa}=(\tilde{\lambda}_{1}-\tilde{\lambda}_{N})/(\tilde{\lambda}_{1}+\tilde{\lambda}_{N}). By construction Σ\Sigma^{\prime} is (m1k)(m_{1}\,k)-banded.

Now we estimate κ~\tilde{\kappa}. For large enough N,nN,n in the first case, we can upper bound

κ~κ1=(C2C1)a+1/(λ0ϵ)λ0/(1+λ0ϵ)(C2+C1)a+1/(λ0ϵ)+λ0/(1+λ0ϵ).\tilde{\kappa}\leq\kappa_{1}=\frac{(C^{\prime}_{2}-C^{\prime}_{1})\,a+1/(\lambda_{0}-\epsilon)-\lambda_{0}/(1+\lambda_{0}\epsilon)}{(C^{\prime}_{2}+C^{\prime}_{1})\,a+1/(\lambda_{0}-\epsilon)+\lambda_{0}/(1+\lambda_{0}\epsilon)}.

The inequality holds since the map x(1x)/(1+x)x\mapsto(1-x)/(1+x) is non-increasing in x(0,1)x\in(0,1). Combing this with the result in (C.5) and taking x=λ~N/λ~1x=\tilde{\lambda}_{N}/\tilde{\lambda}_{1} leads to the expression of κ1\kappa_{1}. Based on (C.7), we have Σ~Σκ1m1+1/{C1a+λ0/(1+λ0ϵ)}\|\widetilde{\Sigma}-\Sigma^{\prime}\|\leq\kappa_{1}^{m_{1}+1}/\{C^{\prime}_{1}a+\lambda_{0}/(1+\lambda_{0}\epsilon)\}. For N,nN,n in the second case, following a similar line of argument, we have Σ~Σκ~m1+1N/(C1n)\|\widetilde{\Sigma}-\Sigma^{\prime}\|\leq\tilde{\kappa}^{m_{1}+1}\,N/(C_{1}n) with κ~=(C2C1)/(C2+C1)\tilde{\kappa}=(C_{2}-C_{1})/(C_{2}+C_{1}).

We recall the posterior scale matrix ΣN=(ΩN1+ΦTΦ)1\Sigma_{N}=(\Omega_{N}^{-1}+\Phi^{\mathrm{\scriptscriptstyle T}}\Phi)^{-1}. Then we have

ΣNΣ\displaystyle\|\Sigma_{N}-\Sigma^{\prime}\| ΣNΣ~+Σ~Σ\displaystyle\leq\|\Sigma_{N}-\widetilde{\Sigma}\|+\|\widetilde{\Sigma}-\Sigma^{\prime}\|
ΣN(ΩN1ΩN,r1+ΩN,r1Ω~1)Σ~+Σ~Σ\displaystyle\leq\|\Sigma_{N}\|(\|\Omega_{N}^{-1}-\Omega_{N,r}^{-1}\|+\|\Omega_{N,r}^{-1}-\widetilde{\Omega}^{-1}\|)\|\widetilde{\Sigma}\|+\|\widetilde{\Sigma}-\Sigma^{\prime}\|
ΣNΣ~(c1ϵ+c2κ0n1+1)+Σ~Σ\displaystyle\leq\|\Sigma_{N}\|\|\widetilde{\Sigma}\|(c_{1}\,\epsilon+c_{2}\,\kappa_{0}^{n_{1}+1})+\|\widetilde{\Sigma}-\Sigma^{\prime}\|

where c1=Ω1ΩN,r1c_{1}=\|\Omega^{-1}\|\|\Omega_{N,r}^{-1}\| and c2=1/(λ0ϵ)c_{2}=1/(\lambda_{0}-\epsilon). The first inequality follows from the triangular inequality and the second inequality follows from the identity A1B1=A1ABB1\|A^{-1}-B^{-1}\|=\|A^{-1}\|\|A-B\|\|B^{-1}\| for invertible matrices A,BA,B. The last inequality follows from results in (C.1) and (C.3).

For N,nN,n in the first case, ΣN\|\Sigma_{N}\| and Σ~\|\widetilde{\Sigma}\| are upper bounded by some constants that are free of n,Nn,N given (C.5). Then we obtain

ΣNΣC(ϵ+κ0n1+1+κ1m1+1),\displaystyle\|\Sigma_{N}-\Sigma^{\prime}\|\leq C^{\prime}(\epsilon+\kappa_{0}^{n_{1}+1}+\,\kappa_{1}^{m_{1}+1}),

where C=max{c1,c2,C1a+λ0/(1+λ0ϵ)}/{C1a+λ0/(1+λ0ϵ)}2C^{\prime}=\max\{c_{1},c_{2},C^{\prime}_{1}\,a+\lambda_{0}/(1+\lambda_{0}\epsilon)\}/\{C^{\prime}_{1}\,a+\lambda_{0}/(1+\lambda_{0}\epsilon)\}^{2}.

For N,nN,n in the second case, for sufficiently large N,nN,n we have ΣN(N/n)\|\Sigma_{N}\|\asymp(N/n) given (C.6). Then we have

ΣNΣC′′{(N/n)2(ϵ+κ0n1+1)+(N/n)κ~m1+1},\displaystyle\|\Sigma_{N}-\Sigma^{\prime}\|\leq C^{\prime\prime}\,\{(N/n)^{2}(\epsilon+\kappa_{0}^{n_{1}+1})+(N/n)\tilde{\kappa}^{m_{1}+1}\},

where C′′=C12max{c1,c2,C1}C^{\prime\prime}=C^{-2}_{1}\,\max\{c_{1},c_{2},C_{1}\}. Letting κ=max{κ0,κ1,κ~}\kappa=\max\{\kappa_{0},\kappa_{1},\tilde{\kappa}\}, n0=min{n1,m1}n_{0}=\min\{n_{1},m_{1}\}, and δϵ,κ=(ϵ+κn0+1)max{(N/n),(N/n)2}\delta_{\epsilon,\kappa}=(\epsilon+\kappa^{n_{0}+1})\max\{(N/n),(N/n)^{2}\} yields the result in Proposition 3.

Appendix D Additional details on the numerical studies

D.1 Prior draws

We consider equation (4.1) and the prior specified in section 4. Prior samples on both θ\theta and ξ\xi of dimension N=100N=100 were drawn. Figure 9 shows prior draws for the first and third components of both θ\theta and ξ\xi.

Refer to caption
Figure 9: Showing prior draws from distribution of θ\theta (left panel) and ξ\xi (right panel). Top and bottom panels correspond to first and third components respectively, for both θ\theta and ξ\xi.

D.2 Posterior Computations

We now consider model (4.2) and the prior specified in section 4. Then the full conditional distribution of θ\theta

π(θY,ξ0,λ,τ,σ)exp{12σ2Y~ΨΛθ2}exp{12τ2θTK1θ} 1𝒞θ(θ)\displaystyle\pi(\theta\mid Y,\xi_{0},\lambda,\tau,\sigma)\propto\exp\bigg{\{}-\frac{1}{2\sigma^{2}}\|\widetilde{Y}-\Psi\Lambda\theta\|^{2}\bigg{\}}\,\exp\bigg{\{}-\frac{1}{2\tau^{2}}\theta^{\mathrm{\scriptscriptstyle T}}K^{-1}\theta\bigg{\}}\,\mathbbm{1}_{\mathcal{C}_{\theta}}(\theta)

can be approximated by

π(θY,ξ0,λ,τ,σ)exp{12σ2Y~Ψλθ2}exp{12τ2θTK1θ}{j=1N+1eηθj1+eηθj}\displaystyle\pi(\theta\mid Y,\xi_{0},\lambda,\tau,\sigma)\propto\exp\bigg{\{}-\frac{1}{2\sigma^{2}}\|\widetilde{Y}-\Psi_{\lambda}\,\theta\|^{2}\bigg{\}}\,\exp\bigg{\{}-\frac{1}{2\tau^{2}}\theta^{\mathrm{\scriptscriptstyle T}}K^{-1}\theta\bigg{\}}\,\bigg{\{}\prod_{j=1}^{N+1}\frac{e^{\eta\,\theta_{j}}}{1+e^{\eta\,\theta_{j}}}\bigg{\}}
=[exp{12σ2Y~Ψλθ2}{j=1N+1eηθj1+eηθj}]exp{12τ2θTK1θ}\displaystyle=\bigg{[}\exp\bigg{\{}-\frac{1}{2\sigma^{2}}\|\widetilde{Y}-\Psi_{\lambda}\,\theta\|^{2}\bigg{\}}\,\bigg{\{}\prod_{j=1}^{N+1}\frac{e^{\eta\,\theta_{j}}}{1+e^{\eta\,\theta_{j}}}\bigg{\}}\bigg{]}\,\exp\bigg{\{}-\frac{1}{2\tau^{2}}\theta^{\mathrm{\scriptscriptstyle T}}K^{-1}\theta\bigg{\}}

where η\eta is a large valued constant, Y~=Yξ01n\widetilde{Y}=Y-\xi_{0}\mathrm{1}_{n} and Ψλ=ΨΛ\Psi_{\lambda}=\Psi\Lambda. The above is same as equation (5) of Ray et al., (2019) and thus falls under the framework of their sampling scheme. For more details on the sampling scheme and the approximation, one can refer to Ray et al., (2019).

Note that λj𝒞+(0,1),j=1,,N\lambda_{j}\sim\mathcal{C}_{+}(0,1),\,\,j=1,\ldots,N, can be equivalently given by λjwj𝒩(0,wj1)𝟙(λj>0),wj𝒢(0.5,0.5),j=1,,N\lambda_{j}\mid w_{j}\sim\mathcal{N}(0,w_{j}^{-1})\mathbbm{1}(\lambda_{j}>0)\,,\,\,w_{j}\sim\mathcal{G}(0.5,0.5)\,,\,\,j=1,\ldots,N. Thus the full conditional distribution of λ\lambda can be approximated by:

π(λY,ξ0,θ,w,τ,σ)[exp{12σ2Y~Ψθλ2}{j=1N+1eζλj1+eζλj}]exp{12λTWλ}\displaystyle\pi(\lambda\mid Y,\xi_{0},\theta,w,\tau,\sigma)\propto\bigg{[}\exp\bigg{\{}-\frac{1}{2\sigma^{2}}\|\widetilde{Y}-\Psi_{\theta}\,\lambda\|^{2}\bigg{\}}\,\bigg{\{}\prod_{j=1}^{N+1}\frac{e^{\zeta\,\lambda_{j}}}{1+e^{\zeta\,\lambda_{j}}}\bigg{\}}\bigg{]}\,\exp\bigg{\{}-\frac{1}{2}\lambda^{\mathrm{\scriptscriptstyle T}}W\lambda\bigg{\}}

where ζ\zeta plays the same role as η\eta, w=(w1,,wN)Tw=(w_{1},\ldots,w_{N})^{\mathrm{\scriptscriptstyle T}}, W=diag(w1,,wN)W=\mbox{diag}(w_{1},\ldots,w_{N}), Ψθ=ΨΘ\Psi_{\theta}=\Psi\Theta and Θ=diag(θ1,,θN)\Theta=\mbox{diag}(\theta_{1},\ldots,\theta_{N}). Thus, λ\lambda can be sampled efficiently using algorithm proposed in Ray et al., (2019).

D.3 Performance of bsar

Consider the simulation set-up specified in section 4. Figure 10 shows the out-of-sample prediction performance of bsar, developed by Lenk and Choi, (2017), and implemented by the R package bsamGP.

Refer to caption
Figure 10: Figure portraying out-of-sample prediction accuracy using bsar for f1f_{1} and f2f_{2}. Red solid curve corresponds to the true function, black solid curve is the mean prediction, the region within two dotted blue curves represent 95% pointwise prediction Interval and the green dots are 200200 test data points. mspe values corresponding to each of the method are also shown in the plots.

Appendix E Concentration result of the posterior mean μ\mu

We start by introducing some new notations and assumptions. For two variables X,YX,Y, we denote the conditional probability measure, conditional expectation and conditional variance of YY given XX as YX\mathbb{P}_{Y\mid X} 𝔼YX\mathbb{E}_{Y\mid X}, and varYX\mathop{\rm var}_{Y\mid X}, respectively. For two quantities a,ba,b, we write aba\succsim b when aa is bounded below by a multiple of bb. For matrices A,BA,B of the same size, we say ABA\leq B if BAB-A is positive semi-definite. In the following, we state the assumptions on the basis choice and prior preferences.

Assumption 3.

We assume the number of basis NN and sample size nn satisfy N/n0N/n\to 0 as N,nN,n\to\infty. Also, we assume the basis matrix Φn×N\Phi_{n\times N} satisfies

c1(n/N)INΦTΦc2(n/N)IN,c_{1}(n/N)\,I_{N}\leq\Phi^{\mathrm{\scriptscriptstyle T}}\Phi\leq c_{2}(n/N)\,I_{N},

for some constants 0<c1<c2<0<c_{1}<c_{2}<\infty.

For an example of basis that satisfies Assumption 3, we take a qqth (q2q\geq 2) order B-Spline basis function associated with NqN-q knots. Moreover, under mild conditions, it can be shown that the optimal order of the number of basis NncN\asymp n^{c} for some c(0,1)c\in(0,1) in the regression setting (Yoo et al.,, 2016).

Assumption 4.

For the prior distribution θ𝒩(𝟎,ΩN)\theta\sim\mathcal{N}(\mathbf{0},\Omega_{N}) with NN satisfying Assumption 3, we assume the covariance matrix ΩN\Omega_{N} satisfies λmin(ΩN)(N/n)\lambda_{\min}(\Omega_{N})\succsim(N/n).

Now we are ready to state the concentration result of the posterior center μN\mu_{N}.

Proposition 4.

Under Assumption 3 and Assumption 4, for the truncated normal posterior 𝒩𝒞(μN,ΣN)\mathcal{N}_{\mathcal{C}}(\mu_{N},\Sigma_{N}) in §3 of the main manuscript, with at least probability 12N21-2\,N^{-2} with respect to YX\mathbb{P}_{Y\mid X}, we have

μNϵN,\|\mu_{N}\|_{\infty}\leq\epsilon_{N},

for ϵN2(c2/c12)1/2(NlogN/n)1/2\epsilon_{N}\geq 2\,(c_{2}/c^{2}_{1})^{1/2}\,(N\log N/n)^{1/2} with c1,c2c_{1},c_{2} defined in Assumption 3.

Proof.

Under model (3.2) with true coefficient vector θ0=𝟎\theta_{0}={\bf 0} we have Yθ0,X𝒩(𝟎n,In)Y\mid\theta_{0},X\sim\mathcal{N}({\bf 0}_{n},I_{n}). We henceforth write the posterior center μ(Y)=ΣNΦTY\mu(Y)=\Sigma_{N}\Phi^{\mathrm{\scriptscriptstyle T}}Y, also we have 𝔼YX{μ(Y)}=𝟎\mathbb{E}_{Y\mid X}\{\mu(Y)\}=\mathbf{0} and varYX{μ(Y)}=ΣNΦTΦΣN\mathop{\rm var}_{Y\mid X}\{\mu(Y)\}=\Sigma_{N}\Phi^{\mathrm{\scriptscriptstyle T}}\Phi\Sigma_{N}. Further, we denote σj2=varYX{μj(Y)}\sigma^{2}_{j}=\mathop{\rm var}_{Y\mid X}\{\mu_{j}(Y)\} for j=1,,Nj=1,\ldots,N. For basis matrix Φ\Phi and ΩN\Omega_{N} satisfying Assumption 3 and Assumption 4 separately, we have

c1(n/N)λmin(ΩN1+ΦTΦ)λmax(ΩN1+ΦTΦ)(c2+D)(n/N).\displaystyle c_{1}(n/N)\leq\lambda_{\min}(\Omega_{N}^{-1}+\Phi^{T}\Phi)\leq\lambda_{\max}(\Omega_{N}^{-1}+\Phi^{T}\Phi)\leq(c_{2}+D)(n/N).

Since under Assumption 4, the prior covariance matrix ΩN\Omega_{N} satisfies ΩN1D(n/N)\|\Omega^{-1}_{N}\|\leq D\,(n/N) for some constant D>0D>0 and λmin(ΩN1)0\lambda_{\min}(\Omega_{N}^{-1})\geq 0. Further, we have

c1(c1+D)2(N/n)INΣNΦTΦΣNc2c12(N/n)IN.\displaystyle\frac{c_{1}}{(c_{1}+D)^{2}}(N/n)\,I_{N}\leq\Sigma_{N}\Phi^{\mathrm{\scriptscriptstyle T}}\Phi\Sigma_{N}\leq\frac{c_{2}}{c^{2}_{1}}(N/n)\,I_{N}. (E.1)

We define σmax2=maxjN{σj2}\sigma^{2}_{\max}=\max_{j\leq N}\{\sigma^{2}_{j}\}, then (E.1) implies σmax2(c2/c12)(N/n)\sigma^{2}_{\max}\leq(c_{2}/c^{2}_{1})(N/n). It is well known that maxjN|μj|\max_{j\leq N}|\mu_{j}| is a Lipschitz function of μj\mu_{j}’s with the Lipschitz constant σmax\sigma_{\max}. We can also upper bound the expectation as

𝔼Y|X(maxjN|μi|){2log(2N)}1/2maxjN{σj}M0(NlogN/n)1/2,\mathbb{E}_{Y|X}\Big{(}\max_{j\leq N}|\mu_{i}|\Big{)}\leq\{2\log(2N)\}^{1/2}\max_{j\leq N}\{\sigma_{j}\}\leq M_{0}(N\log N/n)^{1/2},

where M0=2(c2/c12)1/2M_{0}=2{(c_{2}/c^{2}_{1})}^{1/2}.

Thus we take ϵN2M0(NlogN/n)1/2\epsilon_{N}\geq 2M_{0}\,(N\log N/n)^{1/2}, we have

YX(μN>ϵN)\displaystyle\mathbb{P}_{Y\mid X}(\|\mu_{N}\|_{\infty}>\epsilon_{N}) YX{|maxjN|μj|𝔼YX(maxiN|μi|)|>ϵN𝔼YX(maxiN|μi|)}\displaystyle\leq\mathbb{P}_{Y\mid X}\Big{\{}\Big{|}\max_{j\leq N}|\mu_{j}|-\mathbb{E}_{Y\mid X}\Big{(}\max_{i\leq N}|\mu_{i}|\Big{)}\Big{|}>\epsilon_{N}-\mathbb{E}_{Y\mid X}\Big{(}\max_{i\leq N}|\mu_{i}|\Big{)}\Big{\}}
YX{|maxjN|μj|𝔼YX(maxiN|μi|)|>ϵN/2}\displaystyle\leq\mathbb{P}_{Y\mid X}\Big{\{}\Big{|}\max_{j\leq N}|\mu_{j}|-\mathbb{E}_{Y\mid X}\Big{(}\max_{i\leq N}|\mu_{i}|\Big{)}\Big{|}>\epsilon_{N}/2\Big{\}}
2exp{ϵN2/(8σmax2)}2N2.\displaystyle\leq 2\exp\{-\epsilon_{N}^{2}/(8\,\sigma^{2}_{\max})\}\leq 2\,N^{-2}.

Then we have established the result. ∎

Appendix F Auxiliary results

Lemma 3.

(Slepian’s lemma) Let X,YX,Y be centered Gaussian vectors on d\mathbb{R}^{d}. Suppose 𝔼Xi2=𝔼Yi2\mathbb{E}X_{i}^{2}=\mathbb{E}Y_{i}^{2} for all ii, and 𝔼(XiXj)𝔼(ZiZj)\mathbb{E}(X_{i}X_{j})\leq\mathbb{E}(Z_{i}Z_{j}) for all iji\neq j. Then, for any xx\in\mathbb{R},

(max1idXix)(max1idYix).\displaystyle\mathbb{P}\Big{(}\max_{1\leq i\leq d}X_{i}\leq x\Big{)}\leq\mathbb{P}\Big{(}\max_{1\leq i\leq d}Y_{i}\leq x\Big{)}.

We use the Slepian’s lemma in the following way in the main document. We have,

(X10,,Xd0)=(min1idXi0)=(max1idXi0),\displaystyle\mathbb{P}(X_{1}\geq 0,\ldots,X_{d}\geq 0)=\mathbb{P}\Big{(}\min_{1\leq i\leq d}X_{i}\geq 0\Big{)}=\mathbb{P}\Big{(}\max_{1\leq i\leq d}X_{i}\leq 0\Big{)},

where the second equality uses X=𝑑XX\overset{d}{=}-X. We use Slepian’s inequality to arrive at equation (2.6) in the main document.

Lemma 4.

Let Z1,,ZNZ_{1},\ldots,Z_{N} be iid 𝒩(0,1)\mathcal{N}(0,1) random variables. Then we have

C12logN𝔼maxi=1,,NZi𝔼maxi=1,,N|Zi|2logN.\displaystyle C_{1}\sqrt{2\log N}\leq\mathbb{E}\max_{i=1,\dots,N}Z_{i}\leq\mathbb{E}\max_{i=1,\dots,N}|Z_{i}|\leq\sqrt{2\log N}. (F.1)

for some constant 0<C1<10<C_{1}<1.

Lemma 5.

(Mill’s ratio bound) Let X𝒩(0,1)X\sim\mathcal{N}(0,1). We have, for x>0x>0, that

xx2+1ex2/21Φ(x)1xex2/2,\displaystyle\frac{x}{x^{2}+1}\mbox{e}^{-x^{2}/2}\leq 1-\Phi(x)\leq\frac{1}{x}\mbox{e}^{-x^{2}/2},

where Φ()\Phi(\cdot) is cumulative distribution function of XX.

Lemma 6.

(Lemma 2.1 in Bickel and Lindner, (2012)) Let matrix AA be kk-banded, symmetric, and positive definite. We denote M=AM=\|A\| and m=1/A1m=1/\|A^{-1}\|, and for n0n\in\mathbb{N}_{0}, we define

Bn=γj=0n(IγA)j,\displaystyle B_{n}=\gamma\sum_{j=0}^{n}(I-\gamma A)^{j}, (F.2)

where γ=2/(M+m)\gamma=2/(M+m). Then BnB_{n} is a symmetric positive definite (nk)(nk)-banded matrix, also, A1Bnκn+1/m\|A^{-1}-B_{n}\|\leq\kappa^{n+1}/m, κ=(Mm)/(M+m)<1\kappa=(M-m)/(M+m)<1.

References

  • Azzalini and Valle, [1996] Azzalini, A. and Valle, A. D. (1996). The multivariate skew-normal distribution. Biometrika, 83(4):715–726.
  • Bhattacharya et al., [2016] Bhattacharya, A., Dunson, D. B., Pati, D., and Pillai, N. S. (2016). Sub-optimality of some continuous shrinkage priors. Stochastic Processes and their Applications, 126(12):3828 – 3842. In Memoriam: Evarist Giné.
  • Bickel and Lindner, [2012] Bickel, P. and Lindner, M. (2012). Approximating the inverse of banded matrices by banded matrices with applications to probability and statistics. Theory of Probability & Its Applications, 56(1):1–20.
  • [4] Bickel, P. J., Levina, E., et al. (2008a). Covariance regularization by thresholding. The Annals of Statistics, 36(6):2577–2604.
  • [5] Bickel, P. J., Levina, E., et al. (2008b). Regularized estimation of large covariance matrices. The Annals of Statistics, 36(1):199–227.
  • Bornkamp and Ickstadt, [2009] Bornkamp, B. and Ickstadt, K. (2009). Bayesian nonparametric estimation of continuous monotone functions with applications to dose–response analysis. Biometrics, 65(1):198–205.
  • Botev, [2017] Botev, Z. I. (2017). The normal law under linear restrictions: simulation and estimation via minimax tilting. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(1):125–148.
  • Boucheron et al., [2013] Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.
  • Brezger and Steiner, [2008] Brezger, A. and Steiner, W. J. (2008). Monotonic Regression Based on Bayesian P–splines: An Application to Estimating Price Response Functions from Store-Level Scanner Data. Journal of Business & Economic Statistics, 26(1):90–104.
  • Cai and Dunson, [2007] Cai, B. and Dunson, D. B. (2007). Bayesian multivariate isotonic regression splines: Applications to carcinogenicity studies. Journal of the American Statistical Association, 102(480):1158–1171.
  • Cartinhour, [1990] Cartinhour, J. (1990). One-dimensional marginal density functions of a truncated multivariate normal density function. Communications in Statistics-Theory and Methods, 19(1):197–203.
  • Carvalho et al., [2010] Carvalho, C., Polson, N., and Scott, J. (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2):465–480.
  • Chipman et al., [2010] Chipman, H. A., George, E. I., McCulloch, R. E., et al. (2010). Bart: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266–298.
  • Curtis and Ghosh, [2011] Curtis, S. M. and Ghosh, S. K. (2011). A variable selection approach to monotonic regression with Bernstein polynomials. Journal of Applied Statistics, 38(5):961–976.
  • Duan et al., [2020] Duan, L. L., Young, A. L., Nishimura, A., and Dunson, D. B. (2020). Bayesian constraint relaxation. Biometrika, 107(1):191–204.
  • Dunson, [2005] Dunson, D. B. (2005). Bayesian semiparametric isotonic regression for count data. Journal of the American Statistical Association, 100(470):618–627.
  • Gasull and Utzet, [2014] Gasull, A. and Utzet, F. (2014). Approximating Mill’s ratio. Journal of Mathematical Analysis and Applications, 420(2):1832–1853.
  • Genz, [1992] Genz, A. (1992). Numerical computation of multivariate normal probabilities. Journal of computational and graphical statistics, 1(2):141–149.
  • Genz, [1993] Genz, A. (1993). Comparison of methods for the computation of multivariate normal probabilities. Computing Science and Statistics, pages 400–400.
  • Genz and Bretz, [2009] Genz, A. and Bretz, F. (2009). Computation of multivariate normal and t probabilities, volume 195. Springer Science & Business Media.
  • Ghosal et al., [2000] Ghosal, S., Ghosh, J. K., and van der Vaart, A. W. (2000). Convergence rates of posterior distributions. The Annals of Statistics, 28(2):500–531.
  • Hashorva and Hüsler, [2003] Hashorva, E. and Hüsler, J. (2003). On multivariate Gaussian tails. Annals of the Institute of Statistical Mathematics, 55(3):507–522.
  • Lenk and Choi, [2017] Lenk, P. J. and Choi, T. (2017). Bayesian analysis of shape-restricted functions using Gaussian process priors. Statistica Sinica, 27:43–69.
  • Li and Shao, [2001] Li, W. V. and Shao, Q.-M. (2001). Gaussian processes: inequalities, small ball probabilities and applications. Handbook of Statistics, 19:533–597.
  • Lin and Dunson, [2014] Lin, L. and Dunson, D. B. (2014). Bayesian monotone regression using gaussian process projection. Biometrika, 101(2):303–317.
  • Lu, [2016] Lu, D. (2016). A note on the estimates of multivariate Gaussian probability. Communications in Statistics-Theory and Methods, 45(5):1459–1465.
  • Maatouk and Bay, [2017] Maatouk, H. and Bay, X. (2017). Gaussian process emulators for computer experiments with inequality constraints. Mathematical Geosciences, 49(5):557–582.
  • Meyer et al., [2011] Meyer, M. C., Hackstadt, A. J., and Hoeting, J. A. (2011). Bayesian estimation and inference for generalised partial linear models using shape-restricted splines. Journal of Nonparametric Statistics, 23(4):867–884.
  • Neelon and Dunson, [2004] Neelon, B. and Dunson, D. B. (2004). Bayesian isotonic regression and trend analysis. Biometrics, 60(2):398–406.
  • Ray et al., [2019] Ray, P., Pati, D., and Bhattacharya, A. (2019). Efficient Bayesian shape-restricted function estimation with constrained Gaussian process priors. arXiv preprint arXiv:1902.04701.
  • Ruben, [1964] Ruben, H. (1964). An asymptotic expansion for the multivariate normal distribution and Mill’s ratio. Journal of Research of the National Bureau of Standards, Mathematics and Mathematical Physics B, 68:1.
  • Savage, [1962] Savage, I. R. (1962). Mill’s ratio for multivariate normal distributions. J. Res. Nat. Bur. Standards Sect. B, 66:93–96.
  • Shively et al., [2011] Shively, T. S., Walker, S. G., and Damien, P. (2011). Nonparametric function estimation subject to monotonicity, convexity and other shape constraints. Journal of Econometrics, 161(2):166–181.
  • Sidák, [1968] Sidák, Z. (1968). On multivariate normal probabilities of rectangles: their dependence on correlations. The Annals of Mathematical Statistics, 39(5):1425–1434.
  • Steck, [1979] Steck, G. P. (1979). Lower bounds for the multivariate normal Mill’s ratio. The Annals of Probability, 7(3):547–551.
  • Sun and Berger, [1998] Sun, D. and Berger, J. O. (1998). Reference priors with partial information. Biometrika, 85(1):55–71.
  • Talagrand, [1995] Talagrand, M. (1995). Concentration of measure and isoperimetric inequalities in product spaces. Publications Mathématiques de l’Institut des Hautes Etudes Scientifiques, 81(1):73–205.
  • [38] Vershynin, R. (2018a). High–Dimensional Probability : An Introduction with Applications in Data Science. Cambridge University Press, Cambridge, United Kingdom New York, NY.
  • [39] Vershynin, R. (2018b). High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press.
  • Yoo et al., [2016] Yoo, W. W., Ghosal, S., et al. (2016). Supremum norm posterior contraction and credible sets for nonparametric multivariate regression. The Annals of Statistics, 44(3):1069–1102.