This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Unified Bayesian theory of sparse linear regression with nuisance parametersthanks: Research is partially supported by a Faculty Research and Professional Development Grant from College of Sciences of North Carolina State University.

Seonghyun Jeonglabel=e1][email protected] [ Department of Statistics and Data Science, Department of Applied Statistics
Yonsei University, Seoul 03722, South Korea
   Subhashis Ghosallabel=e2][email protected] [ North Carolina State University Department of Statistics
North Carolina State University, Raleigh, NC 27607, USA
Abstract

We study frequentist asymptotic properties of Bayesian procedures for high-dimensional Gaussian sparse regression when unknown nuisance parameters are involved. Nuisance parameters can be finite-, high-, or infinite-dimensional. A mixture of point masses at zero and continuous distributions is used for the prior distribution on sparse regression coefficients, and appropriate prior distributions are used for nuisance parameters. The optimal posterior contraction of sparse regression coefficients, hampered by the presence of nuisance parameters, is also examined and discussed. It is shown that the procedure yields strong model selection consistency. A Bernstein-von Mises-type theorem for sparse regression coefficients is also obtained for uncertainty quantification through credible sets with guaranteed frequentist coverage. Asymptotic properties of numerous examples are investigated using the theories developed in this study.

62F15,
Bernstein-von Mises theorems,
High-dimensional regression,
Model selection consistency,
Posterior contraction rates,
Sparse priors,
keywords:
[class=MSC]
keywords:

1 Introduction

While Bayesian model selection for classical low-dimensional problems has a long history, sparse estimation in high-dimensional regression was studied much later; see Bondell and Reich, [5], Johnson and Rossell, [20], and Narisetty and He, [24] for consistent Bayesian model selection methods in high-dimensional linear models. Extensive theoretical investigations, however, have been carried out only very recently. Since the pioneering work of Castillo et al., [8], frequentist asymptotic properties of Bayesian sparse regression have been discovered under various settings, and there is now a substantial body of literature [e.g., 23, 1, 28, 3, 26, 2, 10, 25, 14, 19, 18].

Most of the existing studies deal with sparse regression setups without nuisance parameters and there are only a few exceptions. An unknown variance parameter, the simplest type of nuisance parameters, was incorporated for high-dimensional linear regression in Song and Liang, [28] and Bai et al., [2]. In these studies, the optimal properties of Bayesian procedures are characterized with continuous shrinkage priors. For more involved models, Chae et al., [10] adopted a nonparametric approach to estimate unknown symmetric densities in sparse linear regression. Ning et al., [25] considered a sparse linear model for vector-valued response variables with unknown covariance matrices.

Although nuisance parameters may not be of primary interest, modeling frameworks require the complete description of their roles as they explicitly parameterize models. Therefore, one may want to achieve optimal estimation properties for sparse regression coefficients, no matter what a nuisance parameter is. It may also be of interest to examine posterior contraction of nuisance parameters as a secondary objective. Despite these facts, however, there have not been attempts to consider a general class of high-dimensional regression models with nuisance parameters. In this study, we consider a general form of Gaussian sparse regression in the presence of nuisance parameters, and establish a theoretical framework for Bayesian procedures.

We formulate a general framework to treat sparse regression models in a unified way as follows. Let η\eta be possibly an infinite-dimensional nuisance parameter taking values in a set \mathbb{H}. For each η\eta\in\mathbb{H} and an integer mi{1,,m¯}m_{i}\in\{1,\dots,\overline{m}\} for some m¯1\overline{m}\geq 1, suppose that there are a vector ξη,imi\xi_{\eta,i}\in\mathbb{R}^{m_{i}} and a positive definite matrix Δη,imi×mi\Delta_{\eta,i}\in\mathbb{R}^{m_{i}\times m_{i}} which define a regression model for a vector-valued response variable YimiY_{i}\in\mathbb{R}^{m_{i}} against covariates Ximi×pX_{i}\in\mathbb{R}^{m_{i}\times p} given by

Yi\displaystyle Y_{i} =Xiθ+ξη,i+εi,εiindNmi(0,Δη,i),i=1,,n,\displaystyle=X_{i}\theta+\xi_{\eta,i}+\varepsilon_{i},\quad\varepsilon_{i}\overset{\scriptscriptstyle\rm ind}{\sim}\mathrm{N}_{m_{i}}(0,\Delta_{\eta,i}),\quad i=1,\dots,n, (1)

where θp\theta\in\mathbb{R}^{p} is a vector of regression coefficients. Here mim_{i} (and m¯\overline{m}) can increase with nn. We consider the high-dimensional situation where p>np>n, but θ\theta is assumed to be sparse, with many coordinates zero. The form in (1) clearly includes sparse linear regression with unknown error variances. Our main interest lies in more complicated setups. As will be shortly discussed in Section 1.1, many interesting examples belong to form (1).

In this paper, we develop a unified theory of posterior asymptotics in the high-dimensional sparse regression models described by form (1). To the best of our knowledge, there is no study thus far considering a general modeling framework of sparse regression as in (1), even from the frequentist perspective. The results on complicated high-dimensional regression models are only available at model-specific levels and cannot be universally used for different model classes. On the other hand, our approach is a unified theoretical treatment of the general model structure in (1) under the Bayesian framework. We establish general theorems on nearly optimal posterior contraction rates, a Bernstein-von Mises theorem via shape approximation to the posterior distribution of θ\theta, and model selection consistency.

The general theory of posterior contraction using the canonical root-average-squared Hellinger metric on the joint density [16] is not very useful in this context, since to recover rates in terms of the metric of interest on the regression coefficients, some boundedness conditions are needed [19]. To deal with this issue, we construct an exponentially powerful likelihood ratio test in small pieces that are sufficiently separated from the true parameters in terms of the average Rényi divergence of order 1/21/2 (which coincides with the average negative log-affinity). This test provides posterior contraction relative to the corresponding divergence. The posterior contraction rates of θ\theta and η\eta can then be recovered in terms of the metrics of interest under mild conditions on the parameter space. Due to a nuisance parameter η\eta, the resulting posterior contraction for θ\theta may be suboptimal. Conditions for the optimal posterior contraction will also be examined. Our results show that the obtained posterior contraction rates are adaptive to the unknown sparsity level.

For a Bernstein-von Mises theorem and selection consistency, stronger conditions are required than those used for posterior contraction, in line with the existing literature [e.g., 8, 23]. As pointed out by Chae et al., [10], the Bernstein-von Mises theorems for finite dimensional parameters in classical semiparametric models [e.g., 7] may not be directly useful in the high-dimensional context. We thus directly characterize a version of the Bernstein von-Mises theorem for model (1). The key idea is to find a suitable orthogonal projection that satisfies some required conditions, which is typically straightforward if the support of a prior for ξη,i\xi_{\eta,i} is a linear space. The complexity of the space of covariance matrices, measured by its metric entropy, also has an important role in deriving the Bernstein-von Mises theorem and selection consistency. Combining these two leads to a single component of normal distributions for an approximation, which enables to correctly quantify remaining uncertainty on the parameter through the posterior distribution.

1.1 Sparse linear regression with nuisance parameters

As briefly discussed above, the form in (1) is general and includes many interesting statistical models. Here we provide specific examples belonging to (1) in detail. In Section 5, these examples will be used to apply the main results developed in this study.

Example 1 (Multiple response models with missing components).

We consider a general multiple response model with missing values, which is very common in practice. Suppose that for each ii, a vector of m¯\overline{m} responses with covariance matrix Σ\Sigma are supposed to be observed, but for the iith group (or subject) only mim_{i} entries are actually observed with the rest missing. Letting YimiY_{i}\in\mathbb{R}^{m_{i}} be the iith observation and Yiaugm¯Y_{i}^{\rm aug}\in\mathbb{R}^{\overline{m}} be the augmented vector of YiY_{i} and missing entries, we can write Yi=EiTYiaugY_{i}=E_{i}^{T}Y_{i}^{\rm aug} and Cov(Yi)=EiTΣEi\mathrm{Cov}(Y_{i})=E_{i}^{T}\Sigma E_{i}, where Eim¯×miE_{i}\in\mathbb{R}^{\overline{m}\times m_{i}} is the submatrix of the m¯×m¯\overline{m}\times\overline{m} identity matrix with the jjth column included if the jjth element of YiaugY_{i}^{\rm aug} is observed, j=1,,m¯j=1,\ldots,{\overline{m}}. Assuming that the mean of YiY_{i} is only XiθX_{i}\theta for covariates Ximi×pX_{i}\in\mathbb{R}^{m_{i}\times p} and sparse coefficients θp\theta\in\mathbb{R}^{p} with p>np>n, the model of interest can be written as Yi=Xiθ+εiY_{i}=X_{i}\theta+\varepsilon_{i}, εiindNmi(0,EiTΣEi)\varepsilon_{i}\overset{\scriptscriptstyle\rm ind}{\sim}\mathrm{N}_{m_{i}}(0,E_{i}^{T}\Sigma E_{i}), i=1,,ni=1,\dots,n. The model belongs to the class described by (1) with ξη,i=0mi\xi_{\eta,i}=0_{m_{i}} and Δη,i=EiTΣEi\Delta_{\eta,i}=E_{i}^{T}\Sigma E_{i} for η=Σ\eta=\Sigma.

Example 2 (Multivariate measurement error models).

Suppose that a scalar response variable YiY_{i}^{\ast}\in\mathbb{R} is connected to fixed covariates XipX_{i}^{\ast}\in\mathbb{R}^{p} with p>np>n and random covariates ZiqZ_{i}\in\mathbb{R}^{q} with fixed q1q\geq 1, through the following linear additive relationship: Yi=α+XiTθ+ZiTβ+εiY_{i}^{\ast}=\alpha+X_{i}^{\ast T}\theta+Z_{i}^{T}\beta+\varepsilon_{i}^{\ast}, ZiiidNq(μ,Σ)Z_{i}\overset{\scriptscriptstyle\rm iid}{\sim}\mathrm{N}_{q}(\mu,\Sigma), εiiidN(0,σ2)\varepsilon_{i}^{\ast}\overset{\scriptscriptstyle\rm iid}{\sim}\mathrm{N}(0,\sigma^{2}), i=1,,ni=1,\dots,n. While XiX_{i}^{\ast} is fully observed without noise, we observe a surrogate WiW_{i} of ZiZ_{i} as Wi=Zi+τiW_{i}=Z_{i}+\tau_{i}, τiiidNq(0,Ψ)\tau_{i}\overset{\scriptscriptstyle\rm iid}{\sim}\mathrm{N}_{q}(0,\Psi), where to ensure identifiability, Ψ\Psi is assumed to be known. This type of model is called a measurement error model or an errors-in-variables model; see Fuller, [13] and Carroll et al., [6] for a complete overview. By direct calculations, the joint distribution of (Yi,Wi)(Y_{i}^{\ast},W_{i}) is given by

(YiWi)indNq+1((α+XiTθ+μTβμ),(βTΣβ+σ2βTΣΣβΣ+Ψ)).\displaystyle\begin{pmatrix}Y_{i}^{\ast}\\ W_{i}\end{pmatrix}\overset{\scriptscriptstyle\rm ind}{\sim}\mathrm{N}_{q+1}\left(\begin{pmatrix}\alpha+X_{i}^{\ast T}\theta+\mu^{T}\beta\\ \mu\end{pmatrix},\begin{pmatrix}\beta^{T}\Sigma\beta+\sigma^{2}&\beta^{T}\Sigma\\ \Sigma\beta&\Sigma+\Psi\end{pmatrix}\right).

By writing Yi=(Yi,WiT)Tq+1Y_{i}=(Y_{i}^{\ast},W_{i}^{T})^{T}\in\mathbb{R}^{q+1}, Xi=(Xi,0p×q)T(q+1)×pX_{i}=(X_{i}^{\ast},0_{p\times q})^{T}\in\mathbb{R}^{(q+1)\times p}, ξη,i=(α+μTβ,μT)Tq+1\xi_{\eta,i}=(\alpha+\mu^{T}\beta,\mu^{T})^{T}\in\mathbb{R}^{q+1}, and Δη,i=(βTΣβ+σ2βTΣΣβΣ+Ψ)(q+1)×(q+1)\Delta_{\eta,i}=\left(\begin{smallmatrix}\beta^{T}\Sigma\beta+\sigma^{2}&\beta^{T}\Sigma\\ \Sigma\beta&\Sigma+\Psi\end{smallmatrix}\right)\in\mathbb{R}^{(q+1)\times(q+1)} with η=(α,β,μ,σ2,Σ)\eta=(\alpha,\beta,\mu,\sigma^{2},\Sigma), the model is of form (1) with mi=q+1m_{i}=q+1.

Example 3 (Parametric correlation structure).

For mi1m_{i}\geq 1, i=1,,ni=1,\dots,n, suppose that we have a response variable YimiY_{i}\in\mathbb{R}^{m_{i}} and covariates Ximi×pX_{i}\in\mathbb{R}^{m_{i}\times p} with p>np>n. We consider a standard regression model given by Yi=Xiθ+εiY_{i}=X_{i}\theta+\varepsilon_{i}, εiindNmi(0,Σi)\varepsilon_{i}\overset{\scriptscriptstyle\rm ind}{\sim}\mathrm{N}_{m_{i}}(0,\Sigma_{i}), i=1,,ni=1,\dots,n, but mim_{i} is considered to be possibly increasing. For a known parametric correlation structure GiG_{i} and a fixed dimensional Euclidean parameter α\alpha, we model the covariance matrix as Σi=σ2Gi(α)\Sigma_{i}=\sigma^{2}G_{i}(\alpha) using a variance parameter σ2\sigma^{2} and a correlation matrix Gi(α)mi×miG_{i}(\alpha)\in\mathbb{R}^{m_{i}\times m_{i}}. Examples of GiG_{i} include first order autoregressive and moving average correlation matrices. The model belongs to (1) by writing ξη,i=0mi\xi_{\eta,i}=0_{m_{i}} and Δη,i=σ2Gi(α)\Delta_{\eta,i}=\sigma^{2}G_{i}(\alpha) with η=(α,σ2)\eta=(\alpha,\sigma^{2}).

Example 4 (Mixed effects models).

For mi1m_{i}\geq 1, i=1,,ni=1,\dots,n, consider a response variable YimiY_{i}\in\mathbb{R}^{m_{i}} and covariates Ximi×pX_{i}\in\mathbb{R}^{m_{i}\times p} with p>np>n and Zimi×qZ_{i}\in\mathbb{R}^{m_{i}\times q} with fixed q1q\geq 1. A mixed effect model given by Yi=Xiθ+Zibi+εiY_{i}=X_{i}\theta+Z_{i}b_{i}+\varepsilon_{i}^{\ast}, biiidNq(0,Ψ)b_{i}\overset{\scriptscriptstyle\rm iid}{\sim}{\rm N}_{q}(0,\Psi), εiindNmi(0,σ2Imi)\varepsilon_{i}^{\ast}\overset{\scriptscriptstyle\rm ind}{\sim}{\rm N}_{m_{i}}(0,\sigma^{2}I_{m_{i}}), i=1,,ni=1,\dots,n, where Ψq×q\Psi\in\mathbb{R}^{q\times q} is a positive definite matrix. Then the marginal law of YiY_{i} is given by Yi=Xiθ+εiY_{i}=X_{i}\theta+\varepsilon_{i}, εiindNmi(0,σ2Imi+ZiΨZiT)\varepsilon_{i}\overset{\scriptscriptstyle\rm ind}{\sim}{\rm N}_{m_{i}}(0,\sigma^{2}I_{m_{i}}+Z_{i}\Psi Z_{i}^{T}). We assume that σ2\sigma^{2} is known. The model belongs to (1) by letting ξη,i=0mi\xi_{\eta,i}=0_{m_{i}} and Δη,i=σ2Imi+ZiΨZiT\Delta_{\eta,i}=\sigma^{2}I_{m_{i}}+Z_{i}\Psi Z_{i}^{T} with η=Ψ\eta=\Psi.

Example 5 (Graphical structure with sparse precision matrices).

For a response variable Yim¯Y_{i}\in\mathbb{R}^{\overline{m}} and covariates Xim¯×pX_{i}\in\mathbb{R}^{\overline{m}\times p} with increasing m¯1\overline{m}\geq 1 and p>np>n, consider a model given by Yi=Xiθ+εiY_{i}=X_{i}\theta+\varepsilon_{i}, εiiidNm¯(0,Ω1)\varepsilon_{i}\overset{\scriptscriptstyle\rm iid}{\sim}\mathrm{N}_{\overline{m}}(0,\Omega^{-1}), i=1,,ni=1,\dots,n, where θ\theta is a sparse coefficient vector and the precision matrix Ωm¯×m¯\Omega\in\mathbb{R}^{\overline{m}\times\overline{m}} is a positive definite matrix. Along with θ\theta, we also impose sparsity on the off-diagonal entries of Ω\Omega, which accounts for a graphical structure between observations. More precisely, if an off-diagonal entry is zero, it implies the conditional independence between the two concerned entries of εi\varepsilon_{i} given the remaining ones, and we suppose that most off-diagonal entries are actually zero, even though we do not know their locations. The model is then seen to be a special case of (1) by letting ξη,i=0m¯\xi_{\eta,i}=0_{\overline{m}} and Δη,i=Ω1\Delta_{\eta,i}=\Omega^{-1} with η=Ω\eta=\Omega.

Example 6 (Nonparametric heteroskedastic regression models).

For a response variable YiY_{i}\in\mathbb{R} and a row vector of covariates Xi1×pX_{i}\in\mathbb{R}^{1\times p}, a linear regression model with a nonparametric heteroskedastic error is given by Yi=Xiθ+εiY_{i}=X_{i}\theta+\varepsilon_{i}, εiindN(0,v(zi))\varepsilon_{i}\overset{\scriptscriptstyle\rm ind}{\sim}\mathrm{N}(0,v(z_{i})), i=1,,ni=1,\dots,n, where θ\theta is a sparse coefficient vector, v:[0,1](0,)v:[0,1]\mapsto(0,\infty) is a univariate variance function, and zi[0,1]z_{i}\in[0,1] is a one-dimensional variable associated with the iith observation that controls the variance of YiY_{i} through the variance function vv.Then the model belongs to (1) by letting ξη,i=0\xi_{\eta,i}=0 and Δη,i=v(zi)\Delta_{\eta,i}=v(z_{i}) with η=v\eta=v.

Example 7 (Partial linear models).

Consider a partial linear model given by Yi=Xiθ+g(zi)+εiY_{i}=X_{i}\theta+g(z_{i})+\varepsilon_{i}, εiiidN(0,σ2)\varepsilon_{i}\overset{\scriptscriptstyle\rm iid}{\sim}\mathrm{N}(0,\sigma^{2}), i=1,,ni=1,\dots,n, where YiY_{i}\in\mathbb{R} is a response variable, Xi1×pX_{i}\in\mathbb{R}^{1\times p} is a row vector of covariates with p>np>n, θp\theta\in\mathbb{R}^{p} is a sparse coefficient vector, g:[0,1]g:[0,1]\mapsto\mathbb{R} is a univariate function, and zi[0,1]z_{i}\in[0,1] is a scalar predictor. This model is expressed in form (1) by writing ξη,i=g(zi)\xi_{\eta,i}=g(z_{i}) and Δη,i=σ2\Delta_{\eta,i}=\sigma^{2} with η=(g,σ2)\eta=(g,\sigma^{2}).

1.2 Outline

The rest of this paper is organized as follows. In Section 2, some notations are introduced and a prior distribution on sparse regression coefficients is specified. Sections 34 provide our main results on the posterior contraction, the Bernstein-von Mises phenomenon, and selection consistency of the posterior distribution. In Section 5, our general theorems are applied to the examples considered above to derive the posterior asymptotic properties in each specific example. All technical proofs are provided in Appendix.

2 Setup, notations, and prior specification

2.1 Notation

Here we describe the notations we use throughout this paper. For a vector θ=(θj)p\theta=(\theta_{j})\in\mathbb{R}^{p} and a set S{1,,p}S\subset\{1,\dots,p\} of indices, we write Sθ={j:θj0}S_{\theta}=\{j:\theta_{j}\neq 0\} to denote the support of θ\theta, s|S|s\coloneqq|S| (or sθ|Sθ|s_{\theta}\coloneqq|S_{\theta}|) to denote the cardinality of SS (or SθS_{\theta}), and θS={θj:jS}\theta_{S}=\{\theta_{j}:j\in S\} and θSc={θj:jS}\theta_{S^{c}}=\{\theta_{j}:j\notin S\} to separate components of θ\theta using SS. In particular, the support of the true parameter θ0\theta_{0} and its cardinality are written as S0S_{0} and s0|S0|s_{0}\coloneqq|S_{0}|, respectively. The notation θq=(j|θj|q)1/q\lVert\theta\rVert_{q}=(\sum_{j}|\theta_{j}|^{q})^{1/q}, 1q<1\leq q<\infty, stands for the q\ell_{q}-norm and θ=maxj|θj|\lVert\theta\rVert_{\infty}=\max_{j}|\theta_{j}| denotes the maximum norm. We write ρmin(A)\rho_{\min}(A) and ρmax(A)\rho_{\max}(A) for the minimum and maximum eigenvalues of a square matrix AA, respectively. For a matrix X=((xij))X=(\!(x_{ij})\!), let Xsp=ρmax1/2(XTX)\lVert X\rVert_{\rm sp}=\rho_{\max}^{1/2}(X^{T}X) stand for the spectral norm and XF=(i,jxij2)1/2\lVert X\rVert_{\rm F}=(\sum_{i,j}x_{ij}^{2})^{1/2} stand for the Frobenius norm of XX. We also define a matrix norm X=maxjXj2\lVert X\rVert_{\ast}=\max_{j}\lVert X_{\cdot j}\rVert_{2} for XjX_{\cdot j} the jjth column of XX, which is used for compatibility conditions. The column space of XX is denoted by span(X){\rm span}(X). For further convenience, we write ςmin(X)=ρmin1/2(XTX)\varsigma_{\min}(X)=\rho_{\min}^{1/2}(X^{T}X) for the minimum singular value of XX. The notation XSX_{S} means the submatrix of XX with columns chosen by SS. For sequences ana_{n} and bnb_{n}, anbna_{n}\lesssim b_{n} (or bnanb_{n}\gtrsim a_{n}) stands for anCbna_{n}\leq Cb_{n} for some constant C>0C>0 independent of nn, and anbna_{n}\asymp b_{n} means anbnana_{n}\lesssim b_{n}\lesssim a_{n}. These inequalities are also used for relations involving constant sequences.

For given parameters θ\theta and η\eta, we write the joint density as pθ,η=i=1npθ,η,ip_{\theta,\eta}=\prod_{i=1}^{n}p_{\theta,\eta,i} for pθ,η,ip_{\theta,\eta,i} the density of the iith observation vector YiY_{i}. In particular, the true joint density is expressed as p0=i=1np0,ip_{0}=\prod_{i=1}^{n}p_{0,i} for p0,ipθ0,η0,ip_{0,i}\coloneqq p_{\theta_{0},\eta_{0},i} with the true parameters θ0\theta_{0} and η0\eta_{0}. The notation 𝔼0\mathbb{E}_{0} denotes the expectation operator with the true density p0p_{0}. For two probability measures PP and QQ, let PQTV\lVert P-Q\rVert_{\rm TV} denote the total variation between PP and QQ. For two nn-variate densities fi=1nfif\coloneqq\prod_{i=1}^{n}f_{i} and gi=1ngig\coloneqq\prod_{i=1}^{n}g_{i} of independent variables, denote the average Rényi divergence (of order 1/21/2) by Rn(f,g)=n1i=1nlogfigiR_{n}(f,g)=-n^{-1}\sum_{i=1}^{n}\log\int\sqrt{f_{i}g_{i}}.

For any η1,η2\eta_{1},\eta_{2}\in\mathbb{H}, we define dn2(η1,η2)=dA,n2(η1,η2)+dB,n2(η1,η2)d_{n}^{2}(\eta_{1},\eta_{2})=d_{A,n}^{2}(\eta_{1},\eta_{2})+d_{B,n}^{2}(\eta_{1},\eta_{2}) for the two squared pseudo-metrics:

dA,n2(η1,η2)=1ni=1nξη1,iξη2,i22,dB,n2(η1,η2)=1ni=1nΔη1,iΔη2,iF2.\displaystyle d_{A,n}^{2}(\eta_{1},\eta_{2})=\frac{1}{n}\sum_{i=1}^{n}\lVert\xi_{\eta_{1},i}-\xi_{\eta_{2},i}\rVert_{2}^{2},\quad d_{B,n}^{2}(\eta_{1},\eta_{2})=\frac{1}{n}\sum_{i=1}^{n}\lVert\Delta_{\eta_{1},i}-\Delta_{\eta_{2},i}\rVert_{\rm F}^{2}.

For compatibility conditions, the uniform compatibility number ϕ1\phi_{1} and the smallest scaled singular value ϕ2\phi_{2} are defined as

ϕ1(s)=infθ:1|Sθ|sXθ2|Sθ|1/2Xθ1,ϕ2(s)=infθ:1|Sθ|sXθ2Xθ2.\displaystyle\phi_{1}(s)=\inf_{\theta:1\leq|S_{\theta}|\leq s}\frac{\lVert X\theta\rVert_{2}|S_{\theta}|^{1/2}}{\lVert X\rVert_{\ast}\lVert\theta\rVert_{1}},\quad\phi_{2}(s)=\inf_{\theta:1\leq|S_{\theta}|\leq s}\frac{\lVert X\theta\rVert_{2}}{\lVert X\rVert_{\ast}\lVert\theta\rVert_{2}}.

We write Y(n)=(Y1T,,YnT)TY^{(n)}=(Y_{1}^{T},\dots,Y_{n}^{T})^{T} for the observation vector, n=i=1nmin_{\ast}=\sum_{i=1}^{n}m_{i} for the dimension of Y(n)Y^{(n)}, and Θ=p\Theta=\mathbb{R}^{p} for the parameter space of θ\theta. Lastly, for a (pseudo-)metric space (,d)({\cal F},d), let N(ϵ,,d)N(\epsilon,{\cal F},d) denote the ϵ\epsilon-covering number, the minimal number of ϵ\epsilon-balls that cover \cal F.

2.2 Prior for the high-dimensional coefficients

In this subsection, we specify a prior distribution for the high-dimensional regression coefficients θ\theta. A prior for η\eta should satisfy the conditions required for the main results, so its specific characterization is deferred to Section 3. On the other hand, the prior for θ\theta specified here is always good for our purposes and satisfies all requirements.

We first select a dimension ss from a prior πp\pi_{p}, and then randomly choose S{1,,p}S\subset\{1,\dots,p\} for given ss. A nonzero part θS\theta_{S} of θ\theta is then selected from a prior gSg_{S} on s\mathbb{R}^{s} while θSc\theta_{S^{c}} is fixed to zero. The resulting prior specification for (S,θ)(S,\theta) is formulated as

(S,θ)πp(s)(ps)gS(θS)δ0(θSc),\displaystyle(S,\theta)\mapsto\frac{\pi_{p}(s)}{\binom{p}{s}}g_{S}(\theta_{S})\delta_{0}(\theta_{S^{c}}), (2)

where δ0\delta_{0} is the Dirac measure at zero on ps\mathbb{R}^{p-s} with suppressed dimensionality. For the prior πp\pi_{p} on the model dimensions, we consider a prior satisfying the following: for some constants A1,A2,A3,A4>0A_{1},A_{2},A_{3},A_{4}>0,

A1pA3πp(s1)πp(s)A2pA4πp(s1),s=1,,p.\displaystyle A_{1}p^{-A_{3}}\pi_{p}(s-1)\leq\pi_{p}(s)\leq A_{2}p^{-A_{4}}\pi_{p}(s-1),\quad s=1,\dots,p. (3)

Examples of priors satisfying (3) can be found in Castillo and van der Vaart, [9] and Castillo et al., [8]. For the prior gSg_{S}, the ss-fold product of the exponential power density is considered, where the regularization parameter is allowed to vary with pp and X\lVert X\rVert_{\ast}, i.e.,

gS(θS)=jSλ2exp(λ|θj|),XL1pL2λL3Xn,\displaystyle g_{S}(\theta_{S})=\prod_{j\in S}\frac{\lambda}{2}\exp\left(-\lambda|\theta_{j}|\right),\quad\frac{\lVert X\rVert_{\ast}}{L_{1}p^{L_{2}}}\leq\lambda\leq\frac{L_{3}\lVert X\rVert_{\ast}}{\sqrt{n}}, (4)

for some constants L1,L2,L3>0L_{1},L_{2},L_{3}>0. The order of λ\lambda is important in that it determines the boundedness requirement of the true signal θ0\theta_{0} (see condition (C3) below). A particularly interesting case is obtained when λ\lambda is set to the lower bound X/(L1pL2)\lVert X\rVert_{\ast}/(L_{1}p^{L_{2}}). Then the boundedness condition becomes very mild by choosing L2L_{2} sufficiently large. When λ\lambda is set to the upper bound, the boundedness condition is still reasonably mild. However, it can actually be relaxed if the true signal is known to be small enough, though we do not pursue this generalization in this study. In Section 4, we shall see that values of λ\lambda that do not increase too fast are in fact necessary for a distributional approximation and selection consistency.

Remark 1.

Since some size restriction on θ0\theta_{0} will be made unlike Castillo et al., [8], we note that the use of the Laplace density is not necessary and other prior distributions may also be used for θ\theta. For example, normal densities can be used for gSg_{S} to exploit semi-conjugacy. However, if its precision parameter is fixed independent of nn, a normal prior requires a stronger restriction on the true signal than (C3) below. To achieve the nearly optimal posterior contraction, other densities with similar tail properties should also work with appropriate modifications for the true signal size (see, e.g., Jeong and Ghosal, [19]). Instead of the spike-and-slab prior in (2) and (3), a class of continuous shrinkage priors may also be used at the expense of substantial modifications in the technical details [28]. In this paper, we only consider the prior in (2)–(4).

3 Posterior contraction rates

The prior for a nuisance parameter η\eta should be chosen to complete the prior specification. Once we assign the prior for the full parameters, the posterior distribution Π(|Y(n))\Pi(\cdot\,|\,Y^{(n)}) is defined by Bayes’ rule. How the prior for η\eta is chosen is crucial to obtain desirable asymptotic properties of the posterior distribution. In this subsection, we shall examine such conditions on the prior distribution for a nuisance parameter and study the posterior contraction rates for both θ\theta and η\eta.

The prior for η\eta is put on a subspace {\cal H}\subset\mathbb{H}. In many instances, we take ={\cal H}={\mathbb{H}}, especially when a nuisance parameter is finite dimensional, but the flexibility of a subspace may be beneficial in infinite-dimensional situations. We need to choose \cal H to satisfy certain conditions.

  1. (C1)

    There exists a nondecreasing sequence an=o(n)a_{n}=o(n) such that

    anmax1inΔη,iΔη0,iF2\displaystyle a_{n}\max_{1\leq i\leq n}\lVert\Delta_{\eta^{\prime},i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2} en0,for some η,\displaystyle\eqqcolon e_{n}\rightarrow 0,\quad\text{for some $\eta^{\prime}\in{\cal H}$},
    max1inΔη1,iΔη2,iF2\displaystyle\max_{1\leq i\leq n}\lVert\Delta_{\eta_{1},i}-\Delta_{\eta_{2},i}\rVert_{\rm F}^{2} andB,n2(η1,η2),η1,η2.\displaystyle\leq a_{n}d_{B,n}^{2}(\eta_{1},\eta_{2}),\quad\eta_{1},\eta_{2}\in{\cal H}.
  2. (C2)

    For some sequence ϵ¯n\bar{\epsilon}_{n} such that anϵ¯n20a_{n}\bar{\epsilon}_{n}^{2}\rightarrow 0 and nϵ¯n2n\bar{\epsilon}_{n}^{2}\rightarrow\infty with ana_{n} satisfying (C1),

    logΠ(η:dn(η,η0)ϵ¯n)nϵ¯n2.\displaystyle\log\Pi\left(\eta\in{\cal H}:d_{n}(\eta,\eta_{0})\leq\bar{\epsilon}_{n}\right)\gtrsim-n\bar{\epsilon}_{n}^{2}.

The first condition of (C1) implies that we have a good approximation to the true parameter value in the parameter set \cal H. This holds trivially if there exists η\eta^{\prime}\in{\cal H} such that Δη,i=Δη0,i\Delta_{\eta^{\prime},i}=\Delta_{\eta_{0},i} for every ini\leq n, which is obviously true if η0\eta_{0}\in{\cal H}. The second condition of (C1) means that in \cal H, the maximum Frobenius norm of the difference between covariance matrices can be controlled by the average Frobenius norm multiplied by the sequence ana_{n}. Clearly, this holds with an=1a_{n}=1 if Δη,i\Delta_{\eta,i} is the same for every ini\leq n. By the triangle inequality, we see that (C1) implies that

max1inΔη,iΔη0,iF2en+andB,n2(η,η0),η,\displaystyle\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2}\lesssim e_{n}+a_{n}d_{B,n}^{2}(\eta,\eta_{0}),\quad\eta\in{\cal H}, (5)

which is used throughout the paper. Condition (C2) is typically called the prior concentration condition, which requires a prior to put sufficient mass around the true parameter η0\eta_{0}, measured by the pseudo-metric dnd_{n}. As in other infinite-dimensional situations, such a closeness is translated into the closeness in terms of the Kullback-Leibler divergence and variation (see Lemma 1 in Appendix for more details).

As noted in Section 1, the true parameters should be restricted to certain norm-bounded subset of the parameter space. This is clarified as follows.

  1. (C3)

    The true signal satisfies θ0λ1logp\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p.

  2. (C4)

    The eigenvalues of the true covariance matrix satisfy

    1min1inρmin(Δη0,i)max1inρmax(Δη0,i)1.\displaystyle 1\lesssim\min_{1\leq i\leq n}\rho_{\min}(\Delta_{\eta_{0},i})\leq\max_{1\leq i\leq n}\rho_{\max}(\Delta_{\eta_{0},i})\lesssim 1.

Condition (C3) is required to apply the general strategy for posterior contraction to our modeling framework containing nuisance parameters. More specifically, the condition is imposed such that the prior assigns sufficient mass on a Kullback-Leibler neighborhood of θ0\theta_{0}. If nuisance parameters are not present, one can directly handle the model and such a restriction may be removed [e.g., 8, 14]. One may refer to Song and Liang, [28], Ning et al., [25], and Bai et al., [2] for conditions similar to ours, where a variance parameter stands for a nuisance parameter. Still, the condition is mild if λ\lambda is chosen to decrease at an appropriate order. In particular, if λ\lambda is matched to the lower bound 1/(L1pL2)1/(L_{1}p^{L_{2}}), the condition becomes θ0(pL2logp)/X\lVert\theta_{0}\rVert_{\infty}\lesssim(p^{L_{2}}\log p)/\lVert X\rVert_{\ast} which is very mild if L2L_{2} is sufficiently large. Even if the upper bound L3X/nL_{3}\lVert X\rVert_{\ast}/\sqrt{n} is chosen, the condition is not restrictive as the right hand side of the condition can be made nondecreasing as long as X\lVert X\rVert_{\ast} is increasing at a suitable order. Condition (C4) implies that the eigenvalues of the true covariance matrix are bounded below and above. The lower and upper bounds are required for a lot of technical details, including the construction of an exponentially powerful test in Lemma 2 in Appendix.

Remark 2.

Condition (C3) is actually stronger than what it needs to be, but is adopted for the ease of interpretation. For Theorem 3 below to hold, it suffices if we have λθ01(s0logp)nϵ¯n2\lambda\lVert\theta_{0}\rVert_{1}\leq(s_{0}\log p)\vee n\bar{\epsilon}_{n}^{2} for ϵ¯n\bar{\epsilon}_{n} satisfying (C2). For the optimal posterior contraction in Theorem 4 below, a slightly stronger bound is needed: λθ01s0logp\lambda\lVert\theta_{0}\rVert_{1}\leq s_{0}\log p (see Lemma 6 and its proof in Appendix).

3.1 Rényi posterior contraction and recovery

The goal of this subsection is to study posterior contraction of θ\theta relative to the 1\ell_{1}- and 2\ell_{2}-metrics. To do so, we derive the posterior contraction rate with respect to the average Rényi divergence Rn(f,g)R_{n}(f,g), and then the rates for θ\theta relative to more concrete metrics will be recovered from the Rényi contraction.

To proceed, we first need to examine a dimensionality property of the support of θ\theta. The following theorem shows that the posterior distribution is concentrated on models of relatively small sizes.

Theorem 1 (Dimension).

Suppose that (C1)(C4) are satisfied. Then for ss0(nϵ¯n2/logp)s_{\star}\coloneqq s_{0}\vee(n\bar{\epsilon}_{n}^{2}/\log p), there exists a constant K1K_{1} such that

𝔼0Π(θ:sθ>K1s|Y(n))0.\displaystyle\mathbb{E}_{0}\Pi\left(\theta:s_{\theta}>K_{1}s_{\star}\,\big{|}\,Y^{(n)}\right)\rightarrow 0.

Compared to the literature [e.g., 8, 23, 3], the rate in Theorem 1 is floored by the extra term nϵ¯n2/logpn\bar{\epsilon}_{n}^{2}/\log p. This arises from the presence of a nuisance parameter in the model formulation. To minimize its impact, a prior on η\eta should be chosen such that (C2) holds for as small ϵ¯n\bar{\epsilon}_{n} as possible; a suitable choice induces the (nearly) optimal contraction rate.

Using the basic results in Theorem 1, the next theorem obtains the rate at which the posterior distribution contracts at the truth with respect to the average Rényi divergence. The theorem requires additional assumptions on a prior.

  1. (C5)

    For ss0(nϵ¯n2/logp)s_{\star}\coloneqq s_{0}\vee(n\bar{\epsilon}_{n}^{2}/\log p) with ϵ¯n\bar{\epsilon}_{n} satisfying (C2), a sufficiently large B>0B>0, and some sequences γn\gamma_{n} and ϵnslog(pm¯γn)/n\epsilon_{n}\geq\sqrt{s_{\star}\log(p\vee\overline{m}\vee\gamma_{n})/n} satisfying ϵn2/m¯0\epsilon_{n}^{2}/\overline{m}\rightarrow 0, there exists a subset n{\cal H}_{n}\subset{\cal H} such that

    min1ininfηnρmin(Δη,i)\displaystyle\min_{1\leq i\leq n}\inf_{\eta\in{\cal H}_{n}}\rho_{\min}(\Delta_{\eta,i}) 1γn,\displaystyle\geq\frac{1}{\gamma_{n}}, (6)
    logN(16m¯γnn3/2,n,dn)\displaystyle\log N\left(\frac{1}{6\overline{m}\gamma_{n}n^{3/2}},{\cal H}_{n},d_{n}\right) nϵn2,\displaystyle\lesssim n\epsilon_{n}^{2}, (7)
    eBslogpΠ(n)\displaystyle e^{Bs_{\star}\log p}\Pi({\cal H}\setminus{\cal H}_{n}) 0.\displaystyle\rightarrow 0. (8)

The above conditions are related to the classical ones in the literature (e.g., see Theorem 2.1. of Ghosal et al., [15]). Condition (6) requires that for every ini\leq n, the minimum eigenvalue of Δη,i\Delta_{\eta,i} is not too small on a sieve n{\cal H}_{n}. Although γn\gamma_{n} can be any positive sequence, a sequence increasing exponentially fast makes the entropy in (7) too large, resulting in a suboptimal rate ϵn\epsilon_{n}. If γn\gamma_{n} can be chosen to be smaller than pp and m¯\overline{m}, then this does not lead to any deterioration of the rate in ϵn\epsilon_{n}. The entropy condition (7) is actually stronger than needed. Scrutinizing the proof of the theorem, one can see that the entropy appearing in the theorem is obtained using pieces that are smaller than those giving the exponentially powerful test in Lemma 2 in Appendix. However, the covering number with those pieces looks more complicated and the form in (7) suffices for all examples in the present paper. Lastly, condition (8) implies that the outside of a sieve n{\cal H}_{n} should possess sufficiently small prior mass to kill the factor slogps_{\star}\log p arising from the lower bound of the denominator of the posterior distribution. In fact, conditions similar to (C2), (7) and (8) are also required for the prior of θ\theta. By reading the proof, it is easy to see that the prior (2) explicitly satisfies the analogous conditions on an appropriately chosen sieve.

Theorem 2 (Contraction rate, Rényi).

Suppose that (C1)(C5) are satisfied. Then there exists a constant K2K_{2} such that

𝔼0Π((θ,η):Rn(pθ,η,p0)>K2ϵn2|Y(n))0.\displaystyle{\mathbb{E}}_{0}\Pi\left((\theta,\eta):R_{n}(p_{\theta,\eta},p_{0})>K_{2}\epsilon_{n}^{2}\,\big{|}\,Y^{(n)}\right)\rightarrow 0.

We want to sharpen the rate ϵnslog(pm¯γn)/n\epsilon_{n}\geq\sqrt{s_{\star}\log(p\vee\overline{m}\vee\gamma_{n})/n} as much as possible. In most instances, γn\gamma_{n} can be chosen such that logγnlogp\log\gamma_{n}\lesssim\log p. This is trivially satisfied if γn\gamma_{n} is some polynomial in nn as in the examples in this paper. If pp is known to increase much faster than nn, e.g., logpnc\log p\asymp n^{c} for some c(0,1)c\in(0,1), then γn\gamma_{n} need not be a polynomial in nn and the condition can be met more easily with a sequence that grows even faster. Note also that we typically have logm¯logp\log\overline{m}\lesssim\log p in most cases. These postulates lead to ϵn(slogp)/n\epsilon_{n}\geq\sqrt{(s_{\star}\log p)/n}. Indeed, it is often possible to choose ϵn=(slogp)/n\epsilon_{n}=\sqrt{(s_{\star}\log p)/n}, which is commonly guaranteed by choosing an appropriate sieve n{\cal H}_{n} and a prior. The condition will be made precise in (C5) below for recovery and we only consider the situation that ϵn=(slogp)/n\epsilon_{n}=\sqrt{(s_{\star}\log p)/n} in what follows.

Although Theorem 2 provides the basic results for posterior contraction, it does not give precise interpretations for the parameters θ\theta and η\eta themselves, because of the abstruse expression of the average Rényi divergence. The contraction rates with respect to more concrete metrics are recovered under some additional conditions. Under the additional assumption anϵn20a_{n}\epsilon_{n}^{2}\rightarrow 0, it can be shown that Theorem 1 and Theorem 2 explicitly imply that for the set

𝒜n={\displaystyle{\cal A}_{n}=\bigg{\{} (θ,η)Θ×:sθK1s,\displaystyle(\theta,\eta)\in\Theta\times{\cal H}:s_{\theta}\leq K_{1}s_{\star},
1ni=1nXi(θθ0)+ξη,iξη0,i22+dB,n2(η,η0)M1ϵn2},\displaystyle\qquad\qquad\frac{1}{n}\sum_{i=1}^{n}\lVert X_{i}(\theta-\theta_{0})+\xi_{\eta,i}-\xi_{\eta_{0},i}\rVert_{2}^{2}+d_{B,n}^{2}(\eta,\eta_{0})\leq M_{1}\epsilon_{n}^{2}\bigg{\}},

with a sufficiently large constant M1M_{1}, the posterior mass of 𝒜n{\cal A}_{n} goes to one in probability (see the proof of Theorem 3). To complete the recovery, we need to separate the sum of squares of the mean into X(θθ0)2\lVert X(\theta-\theta_{0})\rVert_{2} and ndA,n2(η,η0)nd_{A,n}^{2}(\eta,\eta_{0}), which requires an additional condition. The conditions required for the recovery are clarified as follows.

  1. (C5)

    While logm¯logp\log\overline{m}\lesssim\log p, (C5) holds for γn\gamma_{n} and ϵn=(slogp)/n\epsilon_{n}=\sqrt{(s_{\star}\log p)/n} such that logγnlogp\log\gamma_{n}\lesssim\log p and anϵn20a_{n}\epsilon_{n}^{2}\rightarrow 0 with ana_{n} satisfying (C1).

  2. (C6)

    For ss_{\star} satisfying (C5), there exists η\eta_{\ast}\in\mathbb{H} such that

    lim infn1inf(θ,η)𝒜ni=1n(θθ0)TXiT(ξη,iξη,i)X(θθ0)22+ndA,n2(η,η)\displaystyle\liminf_{n\geq 1}\inf_{(\theta,\eta)\in{\cal A}_{n}}\frac{\sum_{i=1}^{n}(\theta-\theta_{0})^{T}X_{i}^{T}(\xi_{\eta,i}-\xi_{\eta_{\ast},i})}{\lVert X(\theta-\theta_{0})\rVert_{2}^{2}+nd_{A,n}^{2}(\eta,\eta_{\ast})} >12,\displaystyle>-\frac{1}{2},
    dA,n(η,η0)\displaystyle d_{A,n}(\eta_{\ast},\eta_{0}) slogpn,\displaystyle\lesssim\sqrt{\frac{s_{\star}\log p}{n}},

    where ϵn\epsilon_{n} in 𝒜n\mathcal{A}_{n} satisfies ϵn=(slogp)/n\epsilon_{n}=\sqrt{(s_{\star}\log p)/n}.

By expanding the quadratic term for the mean in 𝒜n{\cal A}_{n}, one can see that the separation is possible if (C6) is satisfied. Clearly, (C6) is trivially satisfied if the model has only XθX\theta for its mean, in which we take ξη,iξη,i=ξη,iξη0,i=0\xi_{\eta,i}-\xi_{\eta_{\ast},i}=\xi_{\eta_{\ast},i}-\xi_{\eta_{0},i}=0 for every ini\leq n. In many cases where there exists η\eta^{\prime}\in{\cal H} such that dA,n(η,η0)=0d_{A,n}(\eta^{\prime},\eta_{0})=0, we can often take η=η\eta_{\ast}=\eta^{\prime} for the second inequality of (C6) to hold automatically.

The following theorem shows that the posterior distribution of θ\theta and η\eta contracts at their respective true values at some rates, relative to more easily comprehensible metrics than the average Rényi divergence. In the expressions, if K1s+s0<1K_{1}s_{\star}+s_{0}<1, the compatibility numbers should be understood be equal to 1 for interpretation.

Theorem 3 (Recovery).

Suppose that (C1)(C4), (C5), and (C6) are satisfied. Then, there exists a constant K3K_{3} such that

𝔼0Π(θ:θθ01>K3slogpϕ1(K1s+s0)X|Y(n))0,𝔼0Π(θ:θθ02>K3slogpϕ2(K1s+s0)X|Y(n))0,𝔼0Π(θ:X(θθ0)2>K3slogp|Y(n))0,𝔼0Π(η:dn(η,η0)>K3slogpn|Y(n))0.\displaystyle\begin{split}{\mathbb{E}}_{0}\Pi\left(\theta:\lVert\theta-\theta_{0}\rVert_{1}>\frac{K_{3}s_{\star}\sqrt{\log p}}{\phi_{1}(K_{1}s_{\star}+s_{0})\lVert X\rVert_{\ast}}\,\bigg{|}\,Y^{(n)}\right)&\rightarrow 0,\\ {\mathbb{E}}_{0}\Pi\left(\theta:\lVert\theta-\theta_{0}\rVert_{2}>\frac{K_{3}\sqrt{s_{\star}\log p}}{\phi_{2}(K_{1}s_{\star}+s_{0})\lVert X\rVert_{\ast}}\,\bigg{|}\,Y^{(n)}\right)&\rightarrow 0,\\ {\mathbb{E}}_{0}\Pi\left(\theta:\lVert X(\theta-\theta_{0})\rVert_{2}>K_{3}\sqrt{s_{\star}\log p}\,\big{|}\,Y^{(n)}\right)&\rightarrow 0,\\ {\mathbb{E}}_{0}\Pi\left(\eta:d_{n}(\eta,\eta_{0})>K_{3}\sqrt{\frac{s_{\star}\log p}{n}}\,\bigg{|}\,Y^{(n)}\right)&\rightarrow 0.\end{split} (9)

The thresholds for contraction depend upon the compatibility conditions, which make their implication somewhat vague. As K1s+s0K_{1}s_{\star}+s_{0} is much smaller than nn_{\ast}, it is not unreasonable to assume that ϕ1(K1s+s0)\phi_{1}(K_{1}s_{\star}+s_{0}) and ϕ2(K1s+s0)\phi_{2}(K_{1}s_{\star}+s_{0}) are bounded away from zero, whence the compatibility number is removed from the rates. We refer to Example 7 of Castillo et al., [8] for more discussion. In the next subsection, we will see that one of these restrictions is actually necessary for shape approximation or selection consistency.

Remark 3.

The separation condition (C6) can be left as an assumption to be satisfied, but can also be verified by a stronger condition on the design matrix without resorting to the values of the parameters. Suppose that for some integer q1q\geq 1, there exists a matrix Zimi×qZ_{i}\in\mathbb{R}^{m_{i}\times q} such that ξη,i=Zih(η)\xi_{\eta,i}=Z_{i}h(\eta) for every η\eta\in{\cal H}, with some map h:qh:{\cal H}\mapsto\mathbb{R}^{q}. Since we can write ξη,iξη,i=Zi(h(η)h(η))\xi_{\eta,i}-\xi_{\eta_{\ast},i}=Z_{i}(h(\eta)-h(\eta_{\ast})) for any η,η\eta,\eta_{\ast}\in{\cal H}, the Cauchy-Schwarz inequality indicates that the first inequality of (C6) is implied by

lim infn1inf(θ,η)Θ×:sθK1s(θθ0)TXTZ(h(η)h(η))X(θθ0)2Z(h(η)h(η))2>1,\displaystyle\liminf_{n\geq 1}\inf_{(\theta,\eta)\in\Theta\times{\cal H}:s_{\theta}\leq K_{1}s_{\star}}\frac{(\theta-\theta_{0})^{T}X^{T}Z(h(\eta)-h(\eta_{\ast}))}{\lVert X(\theta-\theta_{0})\rVert_{2}\lVert Z(h(\eta)-h(\eta_{\ast}))\rVert_{2}}>-1,

for Z=(Z1T,,ZnT)TZ=(Z_{1}^{T},\dots,Z_{n}^{T})^{T}. The left hand side is always between 1-1 and 11 by the Cauchy-Schwarz inequality, and is exactly equal to 1-1 or 11 if and only if the two vectors are linearly dependent. A sufficient condition for the preceding display is thus min{ςmin([XS,Z]):sK1s+s0}1\min\{\varsigma_{\min}([X_{S},Z]):{s\leq K_{1}s_{\star}+s_{0}}\}\gtrsim 1 since the linear dependence cannot happen under such a condition due to the inequality sθθ0sθ+s0K1s+s0s_{\theta-\theta_{0}}\leq s_{\theta}+s_{0}\leq K_{1}s_{\star}+s_{0} for θ\theta such that sθK1ss_{\theta}\leq K_{1}s_{\star}. This sufficient condition is not restrictive at all if q=o(n)q=o(n) as we already have K1s+s0=o(n)K_{1}s_{\star}+s_{0}=o(n). Since there typically exists η\eta_{\ast}\in{\cal H} satisfying the second inequality of (C6) as long as \cal H provides a good approximation for the true parameter η0\eta_{0}, condition (C6) can be easily satisfied if the sufficient condition is met.

Notwithstanding the lack of formal study of minimax rates with additional complications, we still want to match our rates for θ\theta with those in simple linear regression, which we call the “optimal” rates. In this sense, Theorem 3 only provides the suboptimal rates for θ\theta if s0=o(s)s_{0}=o(s_{\star}). Although the theorem gives the optimal results if s0logpnϵ¯n2s_{0}\log p\gtrsim n\bar{\epsilon}_{n}^{2}, it is practically hard to check this condition as s0s_{0} is unknown. If s0s_{0} is known to be nonzero, the desired conclusion is trivially achieved as soon as nϵ¯n2/logp1n\bar{\epsilon}_{n}^{2}/\log p\lesssim 1. The following corollary, however, shows that the optimal rates are still available even if s0=0s_{0}=0, with restrictions on ϵ¯n\bar{\epsilon}_{n} and the prior.

Corollary 1 (Optimality under restriction).

For ϵ¯n\bar{\epsilon}_{n} satisfying the conditions for Theorem 3, we have the following assertions.

  1. (a)

    Assume that nϵ¯n2/logp0n\bar{\epsilon}_{n}^{2}/\log p\rightarrow 0. Then, Theorems 1 and 3 hold for ss_{\star} replaced by s0s_{0}.

  2. (b)

    Assume that nϵ¯n2/logp1n\bar{\epsilon}_{n}^{2}/\log p\lesssim 1. Then, Theorems 1 and 3 hold for ss_{\star} replaced by s0s_{0} if either A4A_{4} in (3) is chosen large enough or s0>0s_{0}>0.

The corollary is useful in limited situations, especially when a parametric rate is available for a nuisance parameter. Even if nϵ¯n2=lognn\bar{\epsilon}_{n}^{2}=\log n, we need to further assume that logn=o(logp)\log n=o(\log p), i.e., the ultra high-dimensional setup, to conclude that (a) holds, while we can always apply (b) because lognlogp\log n\lesssim\log p. Although assertion (b) holds for any s00s_{0}\geq 0 if A4A_{4} is chosen sufficiently large, its specific threshold is not directly available. Indeed, by carefully reading the proof of Theorem 1 together with Lemma 1 in Appendix, one can see that the threshold depends on unknown constant bounds for the eigenvalues of the true covariance matrix in (C4). Still, (b) holds for any A4>0A_{4}>0 if s0>0s_{0}>0. We believe that the assumption s0>0s_{0}>0 is very mild, and hence simply apply (b) with this assumption to conclude the optimal contraction for models with finite dimensional nuisance parameters. The optimal rates can still be achieved for any s00s_{0}\geq 0 by verifying the conditions in the following subsection. With finite dimensional nuisance parameters, we do not pursue this direction as it seems an overkill considering the mildness of the assumption s0>0s_{0}>0, though those conditions are actually required for the Bernstein-von Mises theorem and selection consistency in Section 4.

In semiparametric situations with high- or infinite-dimensional nuisance parameters, none of (a) and (b) generally works unless pp increases sufficiently fast. Still, the optimal rates can be achieved under stronger conditions using the semiparametric theory, as the following subsection provides.

3.2 Optimal posterior contraction for θ\theta

Recall that only suboptimal rates may be available from Theorem 3 if s0logpnϵ¯n2s_{0}\log p\lesssim n\bar{\epsilon}_{n}^{2}. In many semiparametric situations, however, it is often possible to obtain parametric rates for finite dimensional parameters under stronger conditions, even when there are infinite-dimensional nuisance parameters in a model [4, 7]. It has also been shown that a similar argument holds in some high-dimensional semiparametric regression models [10]. Therefore, it is naturally of interest to examine under what conditions we can replace ss_{\star} by s0s_{0} in the rates for θ\theta, even if s0logpnϵ¯n2s_{0}\log p\lesssim n\bar{\epsilon}_{n}^{2}. Similar to other semiparametric settings [4, 10], this can be established by the semiparametric theory, but requires stronger conditions than those in traditional fixed dimensional parametric cases because of the high-dimensions of the parameters in our setup.

To proceed, some additional conditions are required for technical reasons, which are made for the size of ϵ¯n\bar{\epsilon}_{n} as the optimal rates are automatically attained if s0logpnϵ¯n2s_{0}\log p\gtrsim n\bar{\epsilon}_{n}^{2}. Still, in a practical sense, the conditions almost always need to be verified to reach the optimal rates, since only oracle rates are generally available and we do not know which term is greater.

In what follows, we write s¯nϵ¯n2/logp\bar{s}_{\star}\coloneqq n\bar{\epsilon}_{n}^{2}/\log p for ϵ¯n\bar{\epsilon}_{n} satisfying the conditions of Theorem 3 through the definition of ϵn\epsilon_{n}. We first assume the following condition on the uniform compatibility number.

  1. (C3)

    For a sufficiently large MM, the uniform compatibility number ϕ1(Ms¯+s0)\phi_{1}(M\bar{s}_{\star}+s_{0}) is bounded away from zero.

This condition is weaker than assuming that the smallest scaled singular value ϕ2(Ms¯+s0)\phi_{2}(M\bar{s}_{\star}+s_{0}) is bounded away from zero, as we have ϕ1(s)ϕ2(s)\phi_{1}(s)\geq\phi_{2}(s) for any s>0s>0 by the Cauchy-Schwarz inequality. We will also resort on a slightly stronger condition with respect to ϕ1\phi_{1} for a distributional approximation in the following section. In this sense, our condition is weaker than those for Theorem 4 of Castillo et al., [8]. Condition (C3) is not restrictive as (C5) requires s=o(n)s_{\star}=o(n); we again refer to Example 7 of Castillo et al., [8].

To precisely describe other conditions, hereafter we use the following additional notations. We write

X~=(Δη0,11/2X1Δη0,n1/2Xn)n×p,ξ~η=(Δη0,11/2ξη,1Δη0,n1/2ξη,n)n,\displaystyle\tilde{X}=\left(\begin{matrix}\Delta_{\eta_{0},1}^{-1/2}X_{1}\\ \vdots\\ \Delta_{\eta_{0},n}^{-1/2}X_{n}\end{matrix}\right)\in\mathbb{R}^{n_{\ast}\times p},\quad\tilde{\xi}_{\eta}=\left(\begin{matrix}\Delta_{\eta_{0},1}^{-1/2}\xi_{\eta,1}\\ \vdots\\ \Delta_{\eta_{0},n}^{-1/2}\xi_{\eta,n}\end{matrix}\right)\in\mathbb{R}^{n_{\ast}},

and Δ~η\tilde{\Delta}_{\eta} to denote the collection of Δη,i\Delta_{\eta,i} for i=1,,ni=1,\dots,n. In particular, X~Sn×|S|\tilde{X}_{S}\in\mathbb{R}^{n_{\ast}\times|S|} denotes the submatrix of X~\tilde{X} with columns chosen by an index set SS. We also define the following neighborhoods of the true parameters: for s¯\bar{s}_{\star} and ϵ¯n\bar{\epsilon}_{n} satisfying (C5), and sufficiently large constants M~1\tilde{M}_{1} and M~2\tilde{M}_{2},

Θ~n={θΘ:sθK1s¯,X(θθ0)2M~1nϵ¯n},~n={η:dn(η,η0)M~2ϵ¯n}.\displaystyle\begin{split}\widetilde{\Theta}_{n}&=\left\{\theta\in\Theta:s_{\theta}\leq K_{1}\bar{s}_{\star},\,\lVert X(\theta-\theta_{0})\rVert_{2}\leq\tilde{M}_{1}\sqrt{n}\bar{\epsilon}_{n}\right\},\\ \widetilde{\cal H}_{n}&=\left\{\eta\in{\cal H}:d_{n}(\eta,\eta_{0})\leq\tilde{M}_{2}\bar{\epsilon}_{n}\right\}.\end{split} (10)

Combined by other conditions, Theorem 3 implies that the posterior probabilities of these neighborhoods tend to one in probability if s0logpnϵ¯n2s_{0}\log p\lesssim n\bar{\epsilon}_{n}^{2}. We need some bounding conditions on these neighborhoods, which will be specified below.

Let Φ(η)=(ξ~η,Δ~η)\Phi(\eta)=(\tilde{\xi}_{\eta},\tilde{\Delta}_{\eta}) for any given η\eta\in\mathcal{H}. For a given θ\theta, we choose a bijective map ηη~n(θ,η):\eta\mapsto\tilde{\eta}_{n}(\theta,\eta):\mathcal{H}\mapsto\mathcal{H} such that Φ(η~n(θ,η))=(ξ~η+HX~(θθ0),Δ~η)\Phi(\tilde{\eta}_{n}(\theta,\eta))=(\tilde{\xi}_{\eta}+H\tilde{X}(\theta-\theta_{0}),\tilde{\Delta}_{\eta}) for some orthogonal projection HH which may depend on the true parameter values, but not on θ\theta and η\eta. The projection HH plays a key role here and for a distributional approximation in the following section, and thus should be appropriately chosen to satisfy the followings.

  1. (C4)

    The orthogonal projection HH satisfies

    1(s01)logpsupη~n(IH)(ξ~ηξ~η0)22\displaystyle\frac{1}{(s_{0}\vee 1)\log p}\sup_{\eta\in\widetilde{\cal H}_{n}}\lVert(I-H)(\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}})\rVert_{2}^{2} 0,\displaystyle\rightarrow 0,
    minS:sK1s¯infvs:v2=1(IH)X~Sv2X~Sv2\displaystyle\min_{S:s\leq K_{1}\bar{s}_{\star}}\inf_{v\in\mathbb{R}^{s}:\lVert v\rVert_{2}=1}\frac{\lVert(I-H)\tilde{X}_{S}v\rVert_{2}}{\lVert\tilde{X}_{S}v\rVert_{2}} 1.\displaystyle\gtrsim 1.
  2. (C5)

    The conditional law Πn,θ\Pi_{n,\theta} of η~n(θ,η)\tilde{\eta}_{n}(\theta,\eta) given θ\theta, induced by the prior, is absolutely continuous relative to its distribution Πn,θ0\Pi_{n,\theta_{0}} at θ=θ0\theta=\theta_{0} (which is the same as the prior for η\eta), and the Radon-Nikodym derivative dΠn,θ/dΠn,θ0d\Pi_{n,\theta}/d\Pi_{n,\theta_{0}} satisfies

    supθΘ~nsupη~n|logdΠn,θdΠn,θ0(η)|1.\displaystyle\sup_{\theta\in\widetilde{\Theta}_{n}}\sup_{\eta\in\widetilde{\cal H}_{n}}\left\lvert\log\frac{d\Pi_{n,\theta}}{d\Pi_{n,\theta_{0}}}(\eta)\right\rvert\lesssim 1.

By reading the proof, one can see that Theorem 4 below is based on the approximate likelihood ratio. The first condition of (C4) is required to control the remainder of an approximation. The second condition of (C4) implies that u2(IH)u2u2\lVert u\rVert_{2}\lesssim\lVert(I-H)u\rVert_{2}\leq\lVert u\rVert_{2} for every uspan(X~S)u\in{\rm span}(\tilde{X}_{S}) with SS such that sK1s¯s\leq K_{1}\bar{s}_{\star}, as the second inequality trivially holds by the fact that IHI-H is an orthogonal projection. The use of the shifting map ηη~n(θ,η)\eta\mapsto\tilde{\eta}_{n}(\theta,\eta) is justified by the condition (C5), which implies that a shift in certain directions does not substantially affect the prior on η\eta. This is related in spirit to the absolute continuity condition in the semiparametric Bernstein-von Mises theorem (see, for example, Theorem 12. 8 of Ghosal and van der Vaart, [17]). We will see that a distributional approximation also requires similar, but stronger conditions.

Lastly, the complexity of the neighborhood ~n\widetilde{\mathcal{H}}_{n} should also be controlled. Specifically, we make the following condition.

  1. (C6)

    For ana_{n} and ene_{n} satisfying (C1) and a sufficiently large C>0C>0,

    nϵ¯n2(en+anϵ¯n2)(s01)logp+an0Cϵ¯nlogN(δ,~n,dB,n)𝑑δ0.\displaystyle\sqrt{\frac{n\bar{\epsilon}_{n}^{2}(e_{n}+a_{n}\bar{\epsilon}_{n}^{2})}{(s_{0}\vee 1)\log p}}+\sqrt{a_{n}}\int_{0}^{C\bar{\epsilon}_{n}}\sqrt{\log N(\delta,\widetilde{\mathcal{H}}_{n},d_{B,n})}d\delta\rightarrow 0.
  2. (C7)

    The parameter space {\mathcal{H}} is separable with the pseudo-metric dB,nd_{B,n}.

Similar to (C4), these conditions are required to control the remainder of an approximation. The integral term comes from the expected supremum of a separable Gaussian process, exploiting the Gaussian likelihood of the model and the separability of ~n\widetilde{\cal H}_{n} with the standard deviation metric. Condition (C7) is crucial for this reason. Since we usually put a prior on η\eta in an explicit way, condition (C7) is rarely violated in practice. One may see a connection between the first term of (C6) and the conditions for Corollary 1. The former easily tends to zero even if nϵ¯n2/logpn\bar{\epsilon}_{n}^{2}/\log p is increasing, due to the extra term ϵ¯n\bar{\epsilon}_{n} which commonly tends to zero in a polynomial order. Note also that the term s01s_{0}\vee 1 appears in (C4) and (C6). Although this gives sharper bounds, the conditions often need to be verified with s01s_{0}\vee 1 replaced by 11 as s0s_{0} is unknown.

Under the conditions specified above, we obtain the following theorem for the contraction rates for θ\theta which do not depend on ϵ¯n\bar{\epsilon}_{n}. The compatibility numbers below should be understood to be 1 if s0=0s_{0}=0.

Theorem 4 (Optimal posterior contraction).

Suppose that (C1)(C4), (C5), and (C6)(C7) are satisfied. Then, there exist constants K4K_{4} and K5K_{5} such that

𝔼0Π(θ:sθ>K4s0|Y(n))0,𝔼0Π(θ:θθ01>K5s0logpϕ1((K4+1)s0)X|Y(n))0,𝔼0Π(θ:θθ02>K5s0logpϕ2((K4+1)s0)X|Y(n))0,𝔼0Π(θ:X(θθ0)2>K5s0logp|Y(n))0.\displaystyle\begin{split}{\mathbb{E}}_{0}\Pi\left(\theta:s_{\theta}>K_{4}s_{0}\,\Big{|}\,Y^{(n)}\right)&\rightarrow 0,\\ {\mathbb{E}}_{0}\Pi\left(\theta:\lVert\theta-\theta_{0}\rVert_{1}>\frac{K_{5}s_{0}\sqrt{\log p}}{\phi_{1}((K_{4}+1)s_{0})\lVert X\rVert_{\ast}}\,\bigg{|}\,Y^{(n)}\right)&\rightarrow 0,\\ {\mathbb{E}}_{0}\Pi\left(\theta:\lVert\theta-\theta_{0}\rVert_{2}>\frac{K_{5}\sqrt{s_{0}\log p}}{\phi_{2}((K_{4}+1)s_{0})\lVert X\rVert_{\ast}}\,\bigg{|}\,Y^{(n)}\right)&\rightarrow 0,\\ {\mathbb{E}}_{0}\Pi\left(\theta:\lVert X(\theta-\theta_{0})\rVert_{2}>K_{5}\sqrt{s_{0}\log p}\,\big{|}\,Y^{(n)}\right)&\rightarrow 0.\end{split} (11)

Similar to the paragraph followed by Theorem 3, the compatibility numbers are easily bounded away from zero so that they can be removed from the expressions. These are actually weaker than before as s0ss_{0}\leq s_{\star}. The simplified rates are then available for ease of interpretation.

Remark 4.

In regression models where no additional mean part ξη,i\xi_{\eta,i} exists, conditions (C4) and (C5) are trivially satisfied by choosing the zero matrix for HH. This is also true for (C8) and (C9) specified in the next section.

Remark 5.

Suppose that there exists a matrix Zimi×qZ_{i}\in\mathbb{R}^{m_{i}\times q} such that ξη,i=Zih(η)\xi_{\eta,i}=Z_{i}h(\eta) for every η\eta\in{\cal H} with some map h:qh:{\cal H}\mapsto\mathbb{R}^{q}. Then, a general strategy to choose HH is to set H=Z~(Z~TZ~)1Z~TH=\tilde{Z}(\tilde{Z}^{T}\tilde{Z})^{-1}\tilde{Z}^{T} for Z~=(Z1TΔη0,11/2,,ZnTΔη0,n1/2)T\tilde{Z}=(Z_{1}^{T}\Delta_{\eta_{0},1}^{-1/2},\dots,Z_{n}^{T}\Delta_{\eta_{0},n}^{-1/2})^{T}. In this case, by the triangle inequality, the first condition of (C4) is satisfied if there exists η\eta_{\ast}\in{\cal H} such that ndA,n2(η,η0)/(s0logp)0nd_{A,n}^{2}(\eta_{\ast},\eta_{0})/(s_{0}\log p)\rightarrow 0. For (C8) in the next section, this is replaced by (s2logp)ndA,n2(η,η0)0(s_{\star}^{2}\log p)nd_{A,n}^{2}(\eta_{\ast},\eta_{0})\rightarrow 0. These are trivially the case if there exists η\eta^{\prime}\in{\cal H} such that dA,n(η,η0)=0d_{A,n}(\eta^{\prime},\eta_{0})=0. Also similar to Remark 3, a sufficient condition for the second line of (C4) is min{ςmin([XS,Z]):sK1s¯}1\min\{\varsigma_{\min}([X_{S},Z]):{s\leq K_{1}\bar{s}_{\star}}\}\gtrsim 1 as pre-multiplication of a positive definite matrix by XSX_{S} and ZZ is an isomorphism. This is also sufficient for (C8) in the next section with s¯\bar{s}_{\star} replaced by ss_{\star}.

Remark 6.

In many instances, for every δ>0\delta>0 and ζn>0\zeta_{n}>0, we typically have

logN(δ,{η:dB,n(η,η0)ζn},dB,n)0rnlog(bnζnδ),\displaystyle\log N\left(\delta,\{\eta\in\mathcal{H}:d_{B,n}(\eta,\eta_{0})\leq\zeta_{n}\},d_{B,n}\right)\leq 0\vee r_{n}\log\left(\frac{b_{n}\zeta_{n}}{\delta}\right),

for some sequences rnr_{n} and bnb_{n}, especially when the part of η\eta involved with dB,nd_{B,n} is an rnr_{n}-dimensional Euclidean parameter. Note that 0Cζn0rnlog(bnζn/δ)𝑑δ\int_{0}^{C\zeta_{n}}\sqrt{0\vee r_{n}\log(b_{n}\zeta_{n}/{\delta})}d\delta is equal to

0(Cbn)ζnrnlog(bnζnδ)𝑑δ\displaystyle\int_{0}^{(C\wedge b_{n})\zeta_{n}}\sqrt{r_{n}\log\left(\frac{b_{n}\zeta_{n}}{\delta}\right)}d\delta
=(Cbn)ζnrnlog(bnCbn)+bnζnrnlog(bn/(Cbn))et2𝑑t.\displaystyle\quad=(C\wedge b_{n})\zeta_{n}\sqrt{r_{n}\log\left(\frac{b_{n}}{C\wedge b_{n}}\right)}+b_{n}\zeta_{n}\sqrt{r_{n}}\int_{\sqrt{\log(b_{n}/(C\wedge b_{n}))}}^{\infty}e^{-t^{2}}dt.

If bnb_{n} is increasing, the right hand side is bounded by a multiple of ζnrnlogbn\zeta_{n}\sqrt{r_{n}\log b_{n}} by the tail probability of a normal distribution, while it is bounded by a multiple of ζnbnrn\zeta_{n}b_{n}\sqrt{r_{n}} for nonincreasing bnb_{n}. This simplification is useful to verify (C6) in many applications, and can also be used for (C10) in the next section.

4 Bernstein-von Mises and selection consistency

An extremely important question is whether the true support S0S_{0} is recovered with probability tending to one, which is the property called selection consistency. We will show this based on a distributional approximation to the posterior distribution. Combined with selection consistency, the shape approximation also leads to the product of a point mass and a normal distribution, which we call the Bernstein-von Mises theorem. This reduced approximate distribution enables us to correctly quantify the remaining uncertainty of the parameter through the posterior distribution.

4.1 Shape approximation to the posterior distribution

It is worth noting that selection consistency can often be verified without a distributional approximation. For example, in sparse linear regression with scalar unknown variance σ2\sigma^{2}, Song and Liang, [28] deployed the marginal likelihood of the model support which can be obtained by integrating out θ\theta and σ2\sigma^{2} from the likelihood using the inverse gamma kernel. In our general formulation, however, this approach is hard to implement due to the arbitrary structure of a nuisance parameter η\eta. Indeed, the approach is not directly available even for a parametric covariance matrix with dimension m¯2\overline{m}\geq 2. In this sense, using a shape approximation could be a natural solution to the problem, which may require some extra conditions on the parameter space and on the priors for θ\theta and η\eta.

Recall that the results in Section 3.2 are based on the semiparametric theory. In this section we will need very similar conditions as before, but the requirements are generally stronger, as the remainder of an approximation should be strictly manipulated. Since the setup is high-dimensional, our conditions are even more restrictive than those for semiparametric models with a fixed dimensional parametric segment [e.g., 7]. One may refer to Section 3.3 of Chae et al., [10] for a relevant discussion.

Throughout this section, we only consider ss_{\star} that satisfies the conditions of Theorem 3. First of all, we make a modification of (C3). The following condition is slightly stronger than (C3), but is still not too restrictive as (C5) requires s=o(n)s_{\star}=o(n).

  1. (C7)

    Condition (C3) is satisfied with s¯\bar{s}_{\star} replaced by ss_{\star}.

The assumption on the prior for θ\theta is made only through the regularization parameter λ\lambda. As in Castillo et al., [8], λ\lambda should not increase too fast and should satisfy λslogp/X0\lambda s_{\star}\sqrt{\log p}/\lVert X\rVert_{\ast}\rightarrow 0. In our setup, the range of λ\lambda induces a sufficient condition for this: s2logp=o(n)s_{\star}^{2}\log p=o(n). Since this is weaker than the one that will be made later in this section, the “small lambda regime” is automatically met by a stronger condition for the entire procedure for a distributional approximation (see (C10) below and the following paragraph).

For sufficiently large constants M^1\hat{M}_{1} and M^2\hat{M}_{2}, we now define the neighborhoods,

Θ^n={θΘ:sθK1s,θθ01M^1slogp/X},^n={η:dA,n(η,η0)M^2slogpn,dB,n(η,η0)M^2slogpn}.\displaystyle\begin{split}\widehat{\Theta}_{n}&=\Big{\{}\theta\in\Theta:s_{\theta}\leq K_{1}s_{\star},\,\lVert\theta-\theta_{0}\rVert_{1}\leq\hat{M}_{1}s_{\star}\sqrt{\log p}/\lVert X\rVert_{\ast}\Big{\}},\\ \widehat{\cal H}_{n}&=\Bigg{\{}\eta\in{\cal H}:d_{A,n}(\eta,\eta_{0})\leq\hat{M}_{2}s_{\star}\sqrt{\frac{\log p}{n}},d_{B,n}(\eta,\eta_{0})\leq\hat{M}_{2}\sqrt{\frac{s_{\star}\log p}{n}}\Bigg{\}}.\end{split} (12)

Note that Θ^n\widehat{\Theta}_{n} is defined with an 1\ell_{1}-ball, which makes it contract more slowly than Θ~n\widetilde{\Theta}_{n} in (10) under (C7). This is due to technical reasons that for a distributional approximation, the 1\ell_{1}-ball should be directly manipulated in the complement of Θ^n\widehat{\Theta}_{n}. The neighborhood ^n\widehat{\cal H}_{n} is also increased to be matched with Θ^n\widehat{\Theta}_{n}. We leave more details on this to the reader; refer to the proof of Theorem 5 below.

As in Section 3.2, we choose a bijective map ηη~n(θ,η)\eta\mapsto\tilde{\eta}_{n}(\theta,\eta) which gives rise to Φ(η~n(θ,η))=(ξ~η+HX~(θθ0),Δ~η)\Phi(\tilde{\eta}_{n}(\theta,\eta))=(\tilde{\xi}_{\eta}+H\tilde{X}(\theta-\theta_{0}),\tilde{\Delta}_{\eta}) for some orthogonal projection HH. Again, the orthogonal projection HH should be carefully chosen to satisfy some boundedness conditions. The conditions are similar to, but stronger than those in Section 3.2. This is not only because of the increased neighborhoods Θ^n\widehat{\Theta}_{n} and ^n\widehat{\cal H}_{n}, but also because the remainder of an approximation should be bounded on their complements. We precisely make the required conditions below.

  1. (C8)

    The orthogonal projection HH satisfies

    s2logpsupη^n(IH)(ξ~ηξ~η0)22\displaystyle s_{\star}^{2}\log p\sup_{\eta\in\widehat{\cal H}_{n}}\lVert(I-H)(\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}})\rVert_{2}^{2} 0,\displaystyle\rightarrow 0,
    minS:sK1sinfvs:v2=1(IH)X~Sv2X~Sv2\displaystyle\min_{S:s\leq K_{1}s_{\star}}\inf_{v\in\mathbb{R}^{s}:\lVert v\rVert_{2}=1}\frac{\lVert(I-H)\tilde{X}_{S}v\rVert_{2}}{\lVert\tilde{X}_{S}v\rVert_{2}} 1.\displaystyle\gtrsim 1.
  2. (C9)

    The conditional law Πn,θ\Pi_{n,\theta} of η~n(θ,η)\tilde{\eta}_{n}(\theta,\eta) given θ\theta, induced by the prior, is absolutely continuous relative to its distribution Πn,θ0\Pi_{n,\theta_{0}} at θ=θ0\theta=\theta_{0}, and the Radon-Nikodym derivative dΠn,θ/dΠn,θ0d\Pi_{n,\theta}/d\Pi_{n,\theta_{0}} satisfies

    supθΘ^nsupη^n|logdΠn,θdΠn,θ0(η)|0.\displaystyle\sup_{\theta\in\widehat{\Theta}_{n}}\sup_{\eta\in\widehat{\cal H}_{n}}\left\lvert\log\frac{d\Pi_{n,\theta}}{d\Pi_{n,\theta_{0}}}(\eta)\right\rvert\rightarrow 0.
  3. (C10)

    For ana_{n} and ene_{n} satisfying (C1) and a sufficiently large C>0C>0,

    slogp{\displaystyle s_{\star}\log p\Bigg{\{} sen+anslogpn\displaystyle s_{\star}\sqrt{e_{n}+\frac{a_{n}s_{\star}\log p}{n}}
    +an0C(slogp)/nlogN(δ,^n,dB,n)dδ}0.\displaystyle+\sqrt{a_{n}}\int_{0}^{C\sqrt{(s_{\star}\log p)/n}}\sqrt{\log N\left(\delta,\widehat{\cal H}_{n},d_{B,n}\right)}d\delta\Bigg{\}}\rightarrow 0.

Conditions (C8)(C10) are required for similar reasons as in Section 3.2. We mention that (C10) is a sufficient condition for the small lambda regime, since its necessary condition is s5log3p=o(n)s_{\star}^{5}\log^{3}p=o(n) that is stronger than s2logp=o(n)s_{\star}^{2}\log p=o(n). This necessary condition for (C10) is often a sufficient condition in many finite dimensional models.

We define the standardized vector,

U=(Δη0,11/2(Y1X1θ0ξη0,1)Δη0,n1/2(YnXnθ0ξη0,n))n.\displaystyle U=\left(\begin{matrix}\Delta_{\eta_{0},1}^{-1/2}(Y_{1}-X_{1}\theta_{0}-\xi_{\eta_{0},1})\\ \vdots\\ \Delta_{\eta_{0},n}^{-1/2}(Y_{n}-X_{n}\theta_{0}-\xi_{\eta_{0},n})\end{matrix}\right)\in\mathbb{R}^{n_{\ast}}.

Under the assumptions above, the posterior distribution of θ\theta is approximated by Π\Pi^{\infty} given by

Π(θ|Y(n))\displaystyle\Pi^{\infty}(\theta\in\cdot\,|\,Y^{(n)}) =S:sK1sw^S(𝒩θ^S,X~ST(IH)X~SSδ0Sc)(θ),\displaystyle=\sum_{S:s\leq K_{1}s_{\star}}\hat{w}_{S}\left({\cal N}_{\hat{\theta}_{S},\tilde{X}_{S}^{T}(I-H)\tilde{X}_{S}}^{S}\otimes\delta_{0}^{S^{c}}\right)(\theta\in\cdot), (13)

where 𝒩μ,ΩS\mathcal{N}_{\mu,\Omega}^{S} is the Gaussian measure with mean μ\mu and precision Ω\Omega on the coordinate SS, δ0Sc\delta_{0}^{S^{c}} is the Dirac measure at zero on ScS^{c}, θ^S\hat{\theta}_{S} is the least squares solution θ^S=(X~ST(IH)X~S)1X~ST(IH)(U+X~θ0)\hat{\theta}_{S}=(\tilde{X}_{S}^{T}(I-H)\tilde{X}_{S})^{-1}\tilde{X}_{S}^{T}(I-H)(U+\tilde{X}\theta_{0}), and the weights w^S\hat{w}_{S} satisfy

w^Sπp(s)(ps)(λ2)s(2π)s/2det(X~ST(IH)X~S)1/2exp{12(IH)X~Sθ^S22}.\displaystyle\hat{w}_{S}\propto\frac{{\pi_{p}(s)}}{{\binom{p}{s}}}\left(\frac{\lambda}{2}\right)^{s}(2\pi)^{s/2}\det\Big{(}\tilde{X}_{S}^{T}(I-H)\tilde{X}_{S}\Big{)}^{-1/2}\exp\bigg{\{}\frac{1}{2}\lVert(I-H)\tilde{X}_{S}\hat{\theta}_{S}\rVert_{2}^{2}\bigg{\}}.

Another way to express Π\Pi^{\infty}, for any measurable p{\cal B}\subset\mathbb{R}^{p}, is

Π(θ|Y(n))\displaystyle\Pi^{\infty}(\theta\in{\cal B}\,|\,Y^{(n)}) =S:sK1sπp(s)(ps)1(λ/2)sΛn(θ)d{(θS)δ0(θSc)}S:sK1sπp(s)(ps)1(λ/2)spΛn(θ)d{(θS)δ0(θSc)},\displaystyle=\frac{\sum_{S:s\leq K_{1}s_{\star}}{\pi_{p}(s)}{\binom{p}{s}}^{-1}\left({\lambda}/{2}\right)^{s}\int_{\cal B}\Lambda_{n}^{\star}(\theta)d\{\mathcal{L}(\theta_{S})\otimes\delta_{0}(\theta_{S^{c}})\}}{\sum_{S:s\leq K_{1}s_{\star}}{\pi_{p}(s)}{\binom{p}{s}}^{-1}\left({\lambda}/{2}\right)^{s}\int_{\mathbb{R}^{p}}\Lambda_{n}^{\star}(\theta)d\{\mathcal{L}(\theta_{S})\otimes\delta_{0}(\theta_{S^{c}})\}},

where \mathcal{L} denotes the Lebesgue measure and

Λn(θ)=exp{12(IH)X~(θθ0)22+UT(IH)X~(θθ0)}.\displaystyle\Lambda_{n}^{\star}(\theta)=\exp\left\{{-\frac{1}{2}\lVert(I-H)\tilde{X}(\theta-\theta_{0})\rVert_{2}^{2}+U^{T}(I-H)\tilde{X}(\theta-\theta_{0})}\right\}. (14)

It can be easily checked that both the expressions are equivalent. The results are summarized in the following theorem.

Theorem 5 (Distributional approximation).

Suppose that (C1)(C4), (C5), (C6), (C7)(C10), and (C7) are satisfied for some orthogonal projection HH. Then

𝔼0Π(θ|Y(n))Π(θ|Y(n))TV0.\displaystyle\mathbb{E}_{0}\left\lVert\Pi(\theta\in\cdot\,|\,Y^{(n)})-\Pi^{\infty}(\theta\in\cdot\,|\,Y^{(n)})\right\rVert_{\rm TV}\rightarrow 0. (15)

4.2 Model selection consistency

The shape approximation to the posterior distribution facilitates obtaining the next theorem which shows that the posterior distribution is concentrated on subsets of the true support with probability tending to one. The result is then used as the basis of selection consistency. Similar to the literature, the theorem requires an additional condition on the prior as follows.

  1. (C12)

    The prior satisfies A4>1A_{4}>1 and spas_{\star}\lesssim p^{a} for a<A41a<A_{4}-1.

Theorem 6 (Selection, no supersets).

Suppose that (C1)(C4), (C5), (C6), (C7)(C10), and (C7)(C12) are satisfied for some orthogonal projection HH. Then

𝔼0Π(θ:SθS0,SθS0|Y(n))0.\displaystyle{\mathbb{E}}_{0}\Pi\left(\theta:S_{\theta}\supset S_{0},S_{\theta}\neq S_{0}\,|\,Y^{(n)}\right)\rightarrow 0. (16)

Since coefficients that are too close to zero cannot be identified by any selection strategy, some threshold for the true nonzero coefficients is needed for detection. The requirement of a threshold is a fundamental limitation in high-dimensional setups. We make the following threshold, the so-called beta-min condition. The condition is made in view of the third assertion of Theorem 4. The second assertion can also be used to make a similar threshold, but we only consider the given one below as it is generally weaker.

  1. (C13)

    The true parameter satisfies

    minθ0,j0|θ0,j|>K5s0logpϕ2((K4+1)s0)X.\displaystyle\min_{\theta_{0,j}\neq 0}|\theta_{0,j}|>\frac{K_{5}\sqrt{s_{0}\log p}}{\phi_{2}((K_{4}+1)s_{0})\lVert X\rVert_{\ast}}.

Since Theorem 3 implies that the posterior distribution of the support of θ\theta includes that of the true support with probability tending to one, selection consistency is an easy consequence of Theorem 6 under the beta-min condition (C13). Moreover, this improves the distributional approximation in (15) so that the posterior distribution can be approximated by a single component of the mixture; that is, the Bernstein-von Mises theorem holds for the parameter component θS0\theta_{S_{0}}. The arguments here are summarized in the following two corollaries, whose proofs are straightforward and thus are omitted.

Corollary 2 (Selection consistency).

Suppose that (C1)(C4), (C5), (C6), (C7)(C10), and (C7)(C13) are satisfied for some orthogonal projection HH. Then

𝔼0Π(θ:SθS0|Y(n))0.\displaystyle{\mathbb{E}}_{0}\Pi\left(\theta:S_{\theta}\neq S_{0}\,|\,Y^{(n)}\right)\rightarrow 0. (17)
Corollary 3 (Bernstein-von Mises).

Suppose that (C1)(C4), (C5), (C6), (C7)(C10), and (C7)(C13) are satisfied for some orthogonal projection HH. Then

𝔼0Π(θ|Y(n))(𝒩θ^S0,X~S0T(IH)X~S0Sδ0S0c)(θ)TV0.\displaystyle\begin{split}{\mathbb{E}}_{0}\bigg{\lVert}&\Pi(\theta\in\cdot\,|\,Y^{(n)})-\left({\cal N}_{\hat{\theta}_{S_{0}},\tilde{X}_{S_{0}}^{T}(I-H)\tilde{X}_{S_{0}}}^{S}\otimes\delta_{0}^{S_{0}^{c}}\right)(\theta\in\cdot)\bigg{\rVert}_{\rm TV}\rightarrow 0.\end{split} (18)

Corollary 3 enables us to quantify the remaining uncertainty of the parameter through the posterior distribution. Specifically, we can construct credible sets for the individual components of θ0\theta_{0} as in Castillo et al., [8]. It is easy to see that by the definition of θ^S0\hat{\theta}_{S_{0}}, its jjth component has a normal distribution, whose mean is the jjth element of θS0\theta_{S_{0}} and variance is the jjth diagonal element of (X~S0T(IH)X~S0)1(\tilde{X}_{S_{0}}^{T}(I-H)\tilde{X}_{S_{0}})^{-1}. Correct uncertainty quantification is thus guaranteed by the weak convergence.

5 Applications

In this section, we apply the main results established in this study to the examples considered in Section 1.1. The main objective is to obtain nearly optimal posterior contraction rates and selection consistency via shape approximation to the posterior distribution with the Bernstein-von Mises phenomenon.

To use Corollary 1 for the optimal posterior contraction when nϵ¯n2=lognn\bar{\epsilon}_{n}^{2}=\log n, we simply assume that s0>0s_{0}>0 for all examples in this section, although Theorem 4 can also be applied under stronger conditions. The assumption s0>0s_{0}>0 is extremely mild rather than considering the ultra high-dimensional case, i.e., logn=o(logp)\log n=o(\log p). A large enough A4A_{4} is also sufficient instead of the assumption s0>0s_{0}>0, but we do not pursue this direction as a specific threshold is not available. We check the conditions of Theorem 4 only for more complicated models where nϵ¯n2>lognn\bar{\epsilon}_{n}^{2}>\log n.

5.1 Multiple response models with missing components

We first apply the main results to Example 1. To recover posterior contraction of Σ\Sigma from the primitive results, it is necessary to assume that every entry of the response is jointly observed sufficiently many times. To be more specific, let eije_{ij} be 1 if the jjth entry of YiaugY_{i}^{\rm aug} is observed and be zero otherwise. The contraction rate of the (j,k)(j,k)th element of Σ\Sigma is directly determined by the order of n1i=ineijeikn^{-1}\sum_{i=i}^{n}e_{ij}e_{ik}. The ideal case is when this quantity is bounded away from zero, that is, the entries are jointly observed at a rate proportional to nn. Then the recovery is possible without any loss of information. If n1i=1neijeikn^{-1}\sum_{i=1}^{n}e_{ij}e_{ik} decays to zero, then the optimal recovery is not attainable, but consistent estimation may still be possible with slower rates. With an inverse Wishart prior on Σ\Sigma, the following theorem studies the posterior asymptotic properties of the given model.

Theorem 7.

Assume that s0>0s_{0}>0, 1ρmin(Σ0)ρmax(Σ0)11\lesssim\rho_{\min}(\Sigma_{0})\leq\rho_{\max}(\Sigma_{0})\lesssim 1, θ0λ1logp\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p, and minj,kn1i=1neijeikcn1\min_{j,k}n^{-1}\sum_{i=1}^{n}e_{ij}e_{ik}\gtrsim c_{n}^{-1} for some nondecreasing cnc_{n} such that cns0logp=o(n)c_{n}s_{0}\log p=o(n). Then the following assertions hold.

  1. (a)

    The optimal posterior contraction rates for θ\theta in (11) are obtained.

  2. (b)

    The posterior contraction rate for Σ\Sigma is cn(s0logp)/n\sqrt{c_{n}(s_{0}\log p)/n} with respect to the Frobenius norm.

Assume further that cn(s02logcn)(s0logp)3=o(n)c_{n}(s_{0}^{2}\vee\log c_{n})(s_{0}\log p)^{3}=o(n) and ϕ1(Ds0)1\phi_{1}(Ds_{0})\gtrsim 1 for a sufficiently large DD. Then the following assertions hold.

  1. (c)

    For Hn×nH\in\mathbb{R}^{n_{\ast}\times n_{\ast}} the zero matrix, the distributional approximation in (15) holds.

  2. (d)

    If A4>1A_{4}>1 and s0pas_{0}\lesssim p^{a} for a<A41a<A_{4}-1, then the no-superset result in (16) holds.

  3. (e)

    Under the beta-min condition as well as the conditions for (d), the selection consistency in (17) and the Bernstein-von Mises theorem in (18) hold.

5.2 Multivariate measurement error models

We now consider Example 2. For convenience we write Y=(Y1,Yn)TnY^{\ast}=(Y_{1}^{\ast},\dots Y_{n}^{\ast})^{T}\in\mathbb{R}^{n}, W=(W1T,,WnT)TnqW=(W_{1}^{T},\dots,W_{n}^{T})^{T}\in\mathbb{R}^{nq}, and X=(X1,,Xn)Tn×pX^{\ast}=(X_{1}^{\ast},\dots,X_{n}^{\ast})^{T}\in\mathbb{R}^{n\times p} in what follows. In this subsection, we use the symbol \otimes for the Kronecker product of matrices. For priors of the nuisance parameters, normal prior distributions are assigned for the location parameters (α\alpha, β\beta, and μ\mu) and an inverse gamma and inverse Wishart prior are used for the scale parameters (σ2\sigma^{2} and Σ\Sigma). The next theorem shows posterior asymptotic properties of the model. In particular, specific forms of their mean and variance for shape approximation are provided considering the modeling structure.

Theorem 8.

Assume that s0>0s_{0}>0, s0logp=o(n)s_{0}\log p=o(n), |α0|β0μ01|\alpha_{0}|\vee\lVert\beta_{0}\rVert_{\infty}\vee\lVert\mu_{0}\rVert_{\infty}\lesssim 1, 1σ0211\lesssim\sigma_{0}^{2}\lesssim 1, 1ρmin(Σ0)ρmax(Σ0)11\lesssim\rho_{\min}(\Sigma_{0})\leq\rho_{\max}(\Sigma_{0})\lesssim 1, θ0λ1logp\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p, and min{ςmin([XS,1n]):sDs0}1\min\{\varsigma_{\min}([X_{S}^{\ast},1_{n}]):s\leq Ds_{0}\}\gtrsim 1 for a sufficiently large DD. Then the following assertions hold.

  1. (a)

    The optimal posterior contraction rates for θ\theta in (11) are obtained.

  2. (b)

    The contraction rates for α\alpha, β\beta, μ\mu, and σ2\sigma^{2} are (s0logp)/n\sqrt{(s_{0}\log p)/n} relative to the 2\ell_{2}-norms. The same rate is also obtained for Σ\Sigma with respect to the Frobenius norm.

Assume further that s05log3p=o(n)s_{0}^{5}\log^{3}p=o(n) and ϕ1(Ds0)1\phi_{1}(Ds_{0})\gtrsim 1 for a sufficiently large DD. Then the following assertions hold.

  1. (c)

    The distributional approximation in (15) holds with the mean vector

    θ^S=\displaystyle\hat{\theta}_{S}= (XSTHXS)1XST{H[(Y(α0+μ0Tβ0)1n)\displaystyle(X_{S}^{\ast T}H^{\ast}X_{S}^{\ast})^{-1}X_{S}^{\ast T}\Big{\{}H^{\ast}\Big{[}\left(Y^{\ast}-(\alpha_{0}+\mu_{0}^{T}\beta_{0})1_{n}\right)
    (In(β0TΣ0(Σ0+Ψ)1))(W1nμ0)]}\displaystyle\qquad\qquad\qquad\qquad-\left(I_{n}\otimes(\beta_{0}^{T}\Sigma_{0}(\Sigma_{0}+\Psi)^{-1})\right)\left(W-1_{n}\otimes\mu_{0}\right)\Big{]}\Big{\}}

    and the covariance matrix (σ02+β0TΣ0(Σ0+Ψ)1Ψβ0)(XSTHXS)1(\sigma_{0}^{2}+\beta_{0}^{T}\Sigma_{0}(\Sigma_{0}+\Psi)^{-1}\Psi\beta_{0})(X_{S}^{\ast T}H^{\ast}X_{S}^{\ast})^{-1} for H=Inn11n1nTH^{\ast}=I_{n}-n^{-1}1_{n}1_{n}^{T}.

  2. (d)

    If A4>1A_{4}>1 and s0pas_{0}\lesssim p^{a} for a<A41a<A_{4}-1, then the no-superset result in (16) holds.

  3. (e)

    Under the beta-min condition as well as the conditions for (d), the selection consistency in (17) and the Bernstein-von Mises theorem in (18) hold.

We note that the marginal law of WiW_{i} is given by WiN(μ,Σ+Ψ)W_{i}\sim{\rm N}(\mu,\Sigma+\Psi). This gives a hope that the rates for μ\mu and Σ\Sigma may actually be improved up to the parametric rate n1/2n^{-1/2} (possibly up to some logarithmic factors). However, other parameters are connected to the high-dimensional coefficients θ\theta, so such a parametric rate may not be obtained for them.

5.3 Parametric correlation structure

Next, our main results are applied to Example 3. A correlation matrix Gi(α)G_{i}(\alpha) should be chosen so that the conditions in the main theorems can be satisfied. Here we consider a compound-symmetric, a first order autoregressive, and a first order moving average correlation matrices: for α(b1,b2)\alpha\in(b_{1},b_{2}) with fixed boundaries b1b_{1} and b2b_{2} of the range, respectively, {GiCS(α)}j,k=𝟙(j=k)+α𝟙(jk)\{G_{i}^{\rm CS}(\alpha)\}_{j,k}=\mathbbm{1}(j=k)+\alpha\mathbbm{1}(j\neq k), {GiAR(α)}j,k=α|jk|\{G_{i}^{\rm AR}(\alpha)\}_{j,k}=\alpha^{|j-k|}, and {GiMA(α)}j,k=𝟙(j=k)+α𝟙(|jk|=1)\{G_{i}^{\rm MA}(\alpha)\}_{j,k}=\mathbbm{1}(j=k)+\alpha\mathbbm{1}(|j-k|=1). The range is chosen so that the corresponding correlation matrix can be positive definite, i.e., (b1,b2)=(0,1)(b_{1},b_{2})=(0,1) for GiCS(α)G_{i}^{\rm CS}(\alpha), (b1,b2)=(1,1)(b_{1},b_{2})=(-1,1) for GiAR(α)G_{i}^{\rm AR}(\alpha), and (b1,b2)=(1/2,1/2)(b_{1},b_{2})=(-1/2,1/2) for GiMA(α)G_{i}^{\rm MA}(\alpha). Again, an inverse gamma prior is assigned to σ2\sigma^{2}. For a prior on α\alpha, we consider a density

Π(dα)exp{1(αb1)c1(b2α)c2},α(b1,b2),\displaystyle\Pi(d\alpha)\propto\exp\left\{-\frac{1}{(\alpha-b_{1})^{c_{1}}(b_{2}-\alpha)^{c_{2}}}\right\},\quad\alpha\in(b_{1},b_{2}),

for some c1,c2>0c_{1},c_{2}>0 such that Π(α<t)exp((tb1)c1)\Pi(\alpha<t)\lesssim\exp(-(t-b_{1})^{-c_{1}}) for t>b1t>b_{1} close to b1b_{1} and Π(α>t)exp((b2t)c2)\Pi(\alpha>t)\lesssim\exp(-(b_{2}-t)^{-c_{2}}) for t<b2t<b_{2} close to b2b_{2}.

Theorem 9.

Assume that s0>0s_{0}>0, s0logp=o(n)s_{0}\log p=o(n), m¯nn\overline{m}n\asymp n_{\ast}, θ0λ1logp\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p, σ021\sigma_{0}^{2}\asymp 1, α0[b1+ϵ,b2ϵ]\alpha_{0}\in[b_{1}+\epsilon,b_{2}-\epsilon] for some fixed ϵ>0\epsilon>0. Suppose further that m¯1\overline{m}\lesssim 1 for the compound-symmetric correlation matrix and logm¯logp\log\overline{m}\lesssim\log p for the autoregressive and moving average correlation matrices. Then the following assertions hold.

  1. (a)

    For any correlation matrix discussed above, the optimal posterior contraction rates for θ\theta in (11) are obtained.

  2. (b)

    For the autoregressive and moving average correlation matrices, the posterior contraction rates for σ2\sigma^{2} and α\alpha are (s0logp)/(m¯n)\sqrt{(s_{0}\log p)/(\overline{m}n)} with respect to the 2\ell_{2}-norms. For the compound-symmetric correlation matrix, their contraction rates are (s0logp)/n\sqrt{(s_{0}\log p)/n} relative to the 2\ell_{2}-norm.

Assume further that s05log3p=o(n)s_{0}^{5}\log^{3}p=o(n) and ϕ1(Ds0)1\phi_{1}(Ds_{0})\gtrsim 1 for a sufficiently large DD. Then the following assertions hold.

  1. (c)

    For Hn×nH\in\mathbb{R}^{n_{\ast}\times n_{\ast}} the zero matrix, the distributional approximation in (15) holds.

  2. (d)

    If A4>1A_{4}>1 and spas_{\star}\lesssim p^{a} for a<A41a<A_{4}-1, then the no-superset result in (16) holds.

  3. (e)

    Under the beta-min condition as well as the conditions for (d), the selection consistency in (17) and the Bernstein-von Mises theorem in (18) hold.

As for the prior for α\alpha, the property that the tail probabilities decay to zero exponentially fast near both zero and one is crucial for the optimal posterior contraction rates. It should be noted that many common probability distributions with compact supports may not be enough for this purpose (e.g., beta distributions).

The main difference between this example and those in the preceding subsections is that we consider possibly increasing mim_{i} here. Although we have the same form of contraction rates for θ\theta as in previous examples, the implication is not the same due to a different order of X\lVert X\rVert_{\ast}. For increasing mim_{i}, it is expected to have Xn\lVert X\rVert_{\ast}\asymp\sqrt{n_{\ast}}, which is commonly the case in regression settings. This is reduced to Xn\lVert X\rVert_{\ast}\asymp\sqrt{n} for the cases with fixed mim_{i}, and hence increasing mim_{i} may help get faster rates. While the increasing dimensionality of mim_{i} is often a benefit for contraction properties of θ\theta, this may or may not be the case for the nuisance parameters since it depends on the dimensionality of η\eta. In the example in this subsection, the dimension of the nuisance parameters is fixed although mim_{i} can increase, which makes their posterior contraction rates faster than those with fixed mim_{i}. However, this may not be true if η\eta is increasing dimensional. For example, see the example in Section 5.5.

5.4 Mixed effects models

For the mixed effects models with sparse regression coefficients in Example 4, we assume that the maximum of Zisp\lVert Z_{i}\rVert_{\rm sp} is bounded, which is particularly mild if m¯\overline{m} is bounded. We also assume that i=1n𝟙(miq)n\sum_{i=1}^{n}\mathbbm{1}(m_{i}\geq q)\asymp n and mini{ςmin(Zi):miq}1\min_{i}\{\varsigma_{\min}(Z_{i}):m_{i}\geq q\}\gtrsim 1, that is, mim_{i} is likely to be larger than qq with fixed probability and ZiZ_{i} is a full rank. These conditions are required for (C1) to hold. We put an inverse Wishart prior on Ψ\Psi as in other examples. The following theorem shows that the posterior asymptotic properties of the mixed effects models.

Theorem 10.

Assume that s0>0s_{0}>0, s0logp=o(n)s_{0}\log p=o(n), 1ρmin(Ψ0)ρmax(Ψ0)11\lesssim\rho_{\min}(\Psi_{0})\leq\rho_{\max}(\Psi_{0})\lesssim 1, θ0λ1logp\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p, i=1n𝟙(miq)n\sum_{i=1}^{n}\mathbbm{1}(m_{i}\geq q)\asymp n, mini{ςmin(Zi):miq}1\min_{i}\{\varsigma_{\min}(Z_{i}):m_{i}\geq q\}\gtrsim 1, and maxiZisp1\max_{i}\lVert Z_{i}\rVert_{\rm sp}\lesssim 1. Then the following assertions hold.

  1. (a)

    The optimal posterior contraction rates for θ\theta in (11) are obtained.

  2. (b)

    The posterior contraction rate for Ψ\Psi is (s0logp)/n\sqrt{(s_{0}\log p)/n} with respect to the Frobenius norm.

Assume further that s05log3p=o(n)s_{0}^{5}\log^{3}p=o(n) and ϕ1(Ds0)1\phi_{1}(Ds_{0})\gtrsim 1 for a sufficiently large DD. Then the following assertions hold.

  1. (c)

    For Hn×nH\in\mathbb{R}^{n_{\ast}\times n_{\ast}} the zero matrix, the distributional approximation in (15) holds.

  2. (d)

    If A4>1A_{4}>1 and s0pas_{0}\lesssim p^{a} for a<A41a<A_{4}-1, then the no-superset result in (16) holds.

  3. (e)

    Under the beta-min condition as well as the conditions for (d), the selection consistency in (17) and the Bernstein-von Mises theorem in (18) hold.

Note that we assume that σ2\sigma^{2} is known, which is actually unnecessary at the modeling stage. The assumption was made to find a sequence ana_{n} satisfying (C1) with ease. This can be relaxed only with stronger assumptions on ZiZ_{i}. For example, if q=1q=1 and ZiZ_{i} is an all-one vector, then the model is equivalent to that with a compound-symmetric correlation matrix in Section 5.3 with some reparameterization, in which σ2\sigma^{2} can be treated as unknown.

5.5 Graphical structure with sparse precision matrices

For the graphical structure models in Example 5, we define an edge-inclusion indicator Υ={υjk:1jkm¯}\Upsilon=\{\upsilon_{jk}:1\leq j\leq k\leq\overline{m}\} such that υjk=1\upsilon_{jk}=1 if ωjk0\omega_{jk}\neq 0 and υjk=0\upsilon_{jk}=0 otherwise, where ωjk\omega_{jk} is the (j,k)(j,k)th element of Ω\Omega. We put a prior with a density f1f_{1} on (0,)(0,\infty) to the nonzero off-diagonal entries and a prior with a density f2f_{2} on \mathbb{R} to the diagonal entries of Ω\Omega, such that the support is truncated to a matrix space with restricted eigenvalues and entries. For the edge-inclusion indicator, we use a binomial prior with probability ϖ\varpi when |Υ|j,kυjk|\Upsilon|\coloneqq\sum_{j,k}\upsilon_{jk} is given, and assign a prior to |Υ||\Upsilon| such that logΠ(|Υ|r¯)r¯logr¯\log\Pi(|\Upsilon|\leq\bar{r})\lesssim-\bar{r}\log\bar{r}. The prior specification is summarized as

Π(Ω|Υ)\displaystyle\Pi(\Omega|\Upsilon) j,k:υjk=1f1(ωjk)j=1m¯f2(ωjj)𝟙0+(L)(Ω),\displaystyle\propto\prod_{j,k:\upsilon_{jk}=1}f_{1}(\omega_{jk})\prod_{j=1}^{\overline{m}}f_{2}(\omega_{jj})\mathbbm{1}_{{\cal M}_{0}^{+}(L)}(\Omega),
Π(Υ)\displaystyle\Pi(\Upsilon) ϖr¯(1ϖ)(m¯2)r¯Π(|Υ|=r¯),logΠ(|Υ|r¯)r¯logr¯,\displaystyle\propto\varpi^{\bar{r}}(1-\varpi)^{\binom{\overline{m}}{2}-\bar{r}}\Pi(|\Upsilon|=\bar{r}),\quad\log\Pi(|\Upsilon|\leq\bar{r})\lesssim-\bar{r}\log\bar{r},

where 0+(L){\cal M}_{0}^{+}(L) is a collection of m¯×m¯\overline{m}\times\overline{m} positive definite matrices for a sufficiently large LL, in which eigenvalues are between [L1,L][L^{-1},L] and entries are also bounded by LL in absolute value.

Theorem 11.

Let s=s0s¯s_{\star}=s_{0}\vee\bar{s}_{\star} for s¯=(m¯+d)(logn)/logp\bar{s}_{\star}=(\overline{m}+d)(\log n)/\log p. Assume that s0>0s_{0}>0, s0logp=o(n)s_{0}\log p=o(n), m¯logn=o(n)\overline{m}\log n=o(n), |Υ0|d|\Upsilon_{0}|\leq d for some dd such that dlogn=o(n)d\log n=o(n), Ω00+(cL)\Omega_{0}\in{\cal M}_{0}^{+}(cL) for some 0<c<10<c<1, and θ0λ1logp\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p. Then the following assertions hold.

  1. (a)

    The posterior contraction rates for θ\theta are given by (9). If s¯1\bar{s}_{\star}\lesssim 1, the optimal rates in (11) are obtained.

  2. (b)

    The posterior contraction rate of Ω\Omega is (s0logp(m¯+d)logn)/n\sqrt{(s_{0}\log p\vee(\overline{m}+d)\log n)/n} with respect to the Frobenius norm.

If further (s¯m¯2)s¯logp=o(n)(\bar{s}_{\star}\vee\overline{m}^{2})\bar{s}_{\star}\log p=o(n) and ϕ1(Ds¯)1\phi_{1}(D\bar{s}_{\star})\gtrsim 1 for a sufficiently large DD, then the following assertion holds.

  1. (c)

    The optimal posterior contraction rates for θ\theta in (11) are obtained even if s¯\bar{s}_{\star}\rightarrow\infty.

Assume further that (sm¯)2(slogp)3=o(n)(s_{\star}\vee\overline{m})^{2}(s_{\star}\log p)^{3}=o(n) and ϕ1(Ds)1\phi_{1}(Ds_{\star})\gtrsim 1 for a sufficiently large DD. Then the following assertions hold.

  1. (d)

    For Hn×nH\in\mathbb{R}^{n_{\ast}\times n_{\ast}} the zero matrix, the distributional approximation in (15) holds.

  2. (e)

    If A4>1A_{4}>1 and spas_{\star}\lesssim p^{a} for a<A41a<A_{4}-1, then the no-superset result in (16) holds.

  3. (f)

    Under the beta-min condition as well as the conditions for (e), the selection consistency in (17) and the Bernstein-von Mises theorem in (18) hold.

Note that increasing m¯\overline{m} is likely to improve the 2\ell_{2}-norm contraction rate for θ\theta as we expect that Xm¯n\lVert X\rVert_{\ast}\asymp\sqrt{\overline{m}n}. In particular, the improvement is clearly the case if dm¯d\lesssim\overline{m} and ϕ2(Ds)1\phi_{2}(Ds_{\star})\gtrsim 1 for a sufficiently large DD. However, as pointed out in Section 5.3, this is not the case for Ω\Omega as its dimension is also increasing.

If we assume that lognlogm¯\log n\lesssim\log\overline{m}, then the term (m¯+d)(logn)/n\sqrt{(\overline{m}+d)(\log n)/n} arising from the sparse precision matrix Ω\Omega becomes (m¯+d)(logm¯)/n\sqrt{(\overline{m}+d)(\log\overline{m})/n}. The latter is comparable to the frequentist convergence rate of the graphical lasso in Rothman et al., [27]. Therefore, our rate is deemed to be optimal considering the additional complication due to the mean term involving sparse regression coefficients.

5.6 Nonparametric heteroskedastic regression models

Next, we use the main results for Example 6. For a bounded, convex subset 𝒳{\cal X}\subset\mathbb{R}, define the α\alpha-Hölder class α(𝒳)\mathfrak{C}^{\alpha}({\cal X}) as the collection of functions f:𝒳f:{\cal X}\rightarrow\mathbb{R} such that fα<\lVert f\rVert_{\mathfrak{C}^{\alpha}}<\infty, where

fα=max0kαsupx𝒳|f(k)(x)|+supx,y𝒳:xy|f(α)(x)f(α)(y)||xy|αα,\displaystyle\lVert f\rVert_{\mathfrak{C}^{\alpha}}=\max_{0\leq k\leq\lfloor\alpha\rfloor}\sup_{x\in{\cal X}}|f^{(k)}(x)|+\sup_{x,y\in{\cal X}:x\neq y}\frac{|f^{(\lfloor\alpha\rfloor)}(x)-f^{(\lfloor\alpha\rfloor)}(y)|}{|x-y|^{\alpha-\lfloor\alpha\rfloor}},

with the kkth derivative f(k)f^{(k)} of ff and α\lfloor\alpha\rfloor the largest integer that is strictly smaller than α\alpha. Let the true function v0v_{0} belong to α[0,1]\mathfrak{C}^{\alpha}[0,1] with assumption that v0v_{0} is strictly positive. While α>1/2\alpha>1/2 suffices for the basic posterior contraction, we will see that the optimal posterior contraction for θ\theta requires α>1\alpha>1. The stronger condition α>2\alpha>2 is even needed for the Bernstein-von Mises theorem and the selection consistency, but all these conditions are mild if the true function is sufficiently smooth.

We put a prior on gg through B-splines. The function is expressed as a linear combination of JJ-dimensional B-spline basis terms BJB_{J} of order qαq\geq\alpha, i.e., vβ(z)=βTBJ(z)v_{\beta}(z)=\beta^{T}B_{J}(z), while an inverse Gaussian prior distribution is independently assigned to each entry of β\beta. For any measurable function f:[0,1]f:[0,1]\mapsto\mathbb{R}, we let f=supz[0,1]|f(z)|\lVert f\rVert_{\infty}=\sup_{z\in[0,1]}|f(z)| and f2,n=(n1i=1n|f(zi)|2)1/2\lVert f\rVert_{2,n}=(n^{-1}\sum_{i=1}^{n}|f(z_{i})|^{2})^{1/2} denote the sup-norm and empirical L2L_{2}-norm, respectively. To deploy the properties of B-splines, we assume that ziz_{i} are sufficiently regularly distributed on [0,1][0,1].

Theorem 12.

The true function v0v_{0} is assumed to be strictly positive on [0,1][0,1] and belong to α[0,1]\mathfrak{C}^{\alpha}[0,1] with α>1/2\alpha>1/2. We choose J(n/logn)1/(2α+1)J\asymp(n/\log n)^{1/(2\alpha+1)}. Let s=s0s¯s_{\star}=s_{0}\vee\bar{s}_{\star} for s¯=(logn)2α/(2α+1)n1/(2α+1)/logp\bar{s}_{\star}=(\log n)^{2\alpha/(2\alpha+1)}n^{1/(2\alpha+1)}/\log p and assume that s0>0s_{0}>0, Js0logp=o(n)Js_{0}\log p=o(n), and θ0λ1logp\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p. Then the following assertions hold.

  1. (a)

    The posterior contraction rates for θ\theta are given by (9). If s¯1\bar{s}_{\star}\lesssim 1, the optimal rates in (11) are obtained.

  2. (b)

    The posterior contraction rate for vv is (s0logp)/n(logn/n)α/(2α+1)\sqrt{(s_{0}\log p)/n}\vee(\log n/n)^{\alpha/(2\alpha+1)} with respect to the 2,n\lVert\cdot\rVert_{2,n}-norm.

If further α>1\alpha>1 and ϕ1(Ds¯)1\phi_{1}(D\bar{s}_{\star})\gtrsim 1 for a sufficiently large DD, then the following assertion holds.

  1. (c)

    The optimal posterior contraction rates for θ\theta in (11) are obtained even if s¯\bar{s}_{\star}\rightarrow\infty.

Assume further that α>2\alpha>2, J(s2J)(slogp)3=o(n)J(s_{\star}^{2}\vee J)(s_{\star}\log p)^{3}=o(n) and ϕ1(Ds)1\phi_{1}(Ds_{\star})\gtrsim 1 for a sufficiently large DD. Then the following assertions hold.

  1. (d)

    The distributional approximation in (15) holds with HH the n×nn\times n zero matrix.

  2. (e)

    If A4>1A_{4}>1 and spas_{\star}\lesssim p^{a} for a<A41a<A_{4}-1, then the no-superset result in (16) holds.

  3. (f)

    Under the beta-min condition as well as the conditions for (e), the selection consistency in (17) and the Bernstein-von Mises theorem in (18) hold.

An inverse Gaussian prior is used due to the property that its tail probabilities at both zero and infinity decay to zero exponentially fast. The exponentially decaying tail probabilities in both directions are essential to obtain the optimal contraction rate. Note that standard choices such as gamma and inverse gamma distributions do not satisfy this property.

By investigating the proof, it can be seen that the condition α>1/2\alpha>1/2 is required to satisfy condition (C1) for posterior contraction, so this condition is not avoidable in applying the main theorems. Unlike Theorem 13 below, assertion (c) does not require any further boundedness condition. This is because the restriction α>1\alpha>1 makes the required bound tend to zero. For the Bernstein-von Mises theorem and the selection consistency, it can be seen that α>2\alpha>2 is necessary for the condition J(s2J)(slogp)3=o(n)J(s_{\star}^{2}\vee J)(s_{\star}\log p)^{3}=o(n) but not sufficient. Although the requirement α>2\alpha>2 is implied by the latter condition, we specify this in the statement due to its importance. We refer to the proof of Theorem 12 for more details.

5.7 Partial linear models

Lastly, we consider Example 7. We assume that the true function g0g_{0} belongs to α[0,1]\mathfrak{C}^{\alpha}[0,1] for with α>0\alpha>0. Any α>0\alpha>0 suffices for the basic posterior contraction, but stronger restrictions are required for further assertions as in Theorem 12. We put a prior on gg through JJ-dimensional B-spline basis terms of order qaq\geq a, i.e., gβ(z)=βTBJ(z)g_{\beta}(z)=\beta^{T}B_{J}(z). With a given JJ, we define the design matrix WJ=(BJ(z1),,BJ(zn))Tn×JW_{J}=(B_{J}(z_{1}),\dots,B_{J}(z_{n}))^{T}\in\mathbb{R}^{n\times J}. The standard normal prior is independently assigned to each component of β\beta and an inverse gamma prior is assigned to σ2\sigma^{2}. Similar to Section 5.6, we assume that ziz_{i} are sufficiently regularly distributed on [0,1][0,1].

Theorem 13.

The true function is assumed to satisfy g0α[0,1]g_{0}\in\mathfrak{C}^{\alpha}[0,1] with α>0\alpha>0. We choose J(n/logn)1/(2α¯+1)J\asymp(n/\log n)^{1/(2\bar{\alpha}+1)} for some α¯α\bar{\alpha}\leq\alpha. Let s=s0s¯s_{\star}=s_{0}\vee\bar{s}_{\star} for s¯=(logn)2α¯/(2α¯+1)n1/(2α¯+1)/logp\bar{s}_{\star}=(\log n)^{2\bar{\alpha}/(2\bar{\alpha}+1)}n^{1/(2\bar{\alpha}+1)}/\log p and assume that s0>0s_{0}>0, s0logp=o(n)s_{0}\log p=o(n), σ021\sigma_{0}^{2}\asymp 1, θ0λ1logp\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p, and min{ςmin([XS,WJ]):sDs}1\min\{\varsigma_{\min}([X_{S},W_{J}]):s\leq Ds_{\star}\}\gtrsim 1 for a sufficiently large DD. Then the following assertions hold.

  1. (a)

    The posterior contraction rates for θ\theta are given by (9). If s¯1\bar{s}_{\star}\lesssim 1, the optimal rates in (11) are obtained.

  2. (b)

    The contraction rates for gg and σ2\sigma^{2} are (s0logp)/n(logn/n)α¯/(2α¯+1)\sqrt{(s_{0}\log p)/n}\vee(\log n/n)^{\bar{\alpha}/(2\bar{\alpha}+1)} with respect to the 2,n\lVert\cdot\rVert_{2,n}- and 2\ell_{2}-norms, respectively.

If further 1/2α¯<α1/2\leq\bar{\alpha}<\alpha, (logn)2(α2α¯)/(2α¯+1)n(2(α2α¯)+2α¯+1)/(2α¯+1)=o(logp)(\log n)^{2(\alpha\wedge 2\bar{\alpha})/(2\bar{\alpha}+1)}n^{(-2(\alpha\wedge 2\bar{\alpha})+2\bar{\alpha}+1)/(2\bar{\alpha}+1)}=o(\log p), and ϕ1(Ds¯)1\phi_{1}(D\bar{s}_{\star})\gtrsim 1 for a sufficiently large DD, then the following assertion holds.

  1. (c)

    The optimal posterior contraction rates for θ\theta in (11) are obtained even if s¯\bar{s}_{\star}\rightarrow\infty.

Assume that 1<α¯<α1/21<\bar{\alpha}<\alpha-1/2, (s2logp)(logn)2α/(2α¯+1)n(2(α¯α)+1)/(2α¯+1)=o(1)(s_{\star}^{2}\log p)(\log n)^{2\alpha/(2\bar{\alpha}+1)}n^{(2(\bar{\alpha}-\alpha)+1)/(2\bar{\alpha}+1)}=o(1), s5log3p=o(n)s_{\star}^{5}\log^{3}p=o(n), and ϕ1(Ds)1\phi_{1}(Ds_{\star})\gtrsim 1 for a sufficiently large DD. Then the following assertions hold.

  1. (d)

    The distributional approximation in (15) holds for the projection matrix H=WJ(WJTWJ)1WJTH=W_{J}(W_{J}^{T}W_{J})^{-1}W_{J}^{T}.

  2. (e)

    If A4>1A_{4}>1 and spas_{\star}\lesssim p^{a} for a<A41a<A_{4}-1, then the no-superset result in (16) holds.

  3. (f)

    Under the beta-min condition as well as the conditions for (e), the selection consistency in (17) and the Bernstein-von Mises theorem in (18) hold.

Here we elaborate more on the choices of the number JJ of basis terms. For assertions (a)(b), JJ can be chosen such that α¯=α\bar{\alpha}=\alpha which gives rise to the optimal rates for the nuisance parameters. This choice, however, does not satisfy (C4) and (C8), and hence we need a better approximation for (IH)ξ~η02\lVert(I-H)\tilde{\xi}_{\eta_{0}}\rVert_{2} with some α¯<α\bar{\alpha}<\alpha to strictly control the remaining bias. For example, if α¯=α\bar{\alpha}=\alpha, the bondedness condition for (c) is reduced to s¯=o(1)\bar{s}_{\star}=o(1), which gives the optimal contraction for θ\theta by (a). Therefore, to incorporate the case that s¯\bar{s}_{\star}\rightarrow\infty, there is a need to consider some appropriate α¯\bar{\alpha} that is strictly smaller than α\alpha. For the Bernstein-von Mises theorem and the selection consistency, the required restriction becomes even stronger such that α¯<α1/2\bar{\alpha}<\alpha-1/2.

Appendix A Proofs for the main results

In this section, we provide proofs of the main theorems. We first describe the additional notations used for the proofs. For a matrix XX, we write ρ1(X)ρ2(X)\rho_{1}(X)\geq\rho_{2}(X)\geq\cdots for the eigenvalues of XX in decreasing order. The notation Λn(θ,η)=i=1n(pθ,η,i/p0,i)(Yi)\Lambda_{n}(\theta,\eta)=\prod_{i=1}^{n}({p_{\theta,\eta,i}}/{p_{0,i}})(Y_{i}) stands for the likelihood ratio of pθ,ηp_{\theta,\eta} and p0p_{0}. Let 𝔼θ,η\mathbb{E}_{\theta,\eta} denote the expectation operator with the density pθ,ηp_{\theta,\eta} and let 0\mathbb{P}_{0} denote the probability operator with the true density. For two densities ff and gg, let K(f,g)=flog(f/g)K(f,g)=\int f\log(f/g) and V(f,g)=f|log(f/g)K(f,g)|2V(f,g)=\int f|\log(f/g)-K(f,g)|^{2} stand for the Kullback-Leibler divergence and variation, respectively. Using some constants ρ¯0,ρ¯0>0\underline{\rho}_{0},\overline{\rho}_{0}>0, we rewrite (C4) as ρ¯0miniρmin(Δη0,i)maxiρmax(Δη0,i)ρ¯0\underline{\rho}_{0}\leq\min_{i}\rho_{\min}(\Delta_{\eta_{0},i})\leq\max_{i}\rho_{\max}(\Delta_{\eta_{0},i})\leq\overline{\rho}_{0} for clarity.

A.1 Proof of Theorem 1

We first state a lemma showing that the denominator of the posterior distribution is bounded below by a factor with probability tending to one, which will be used to prove the main theorems.

Lemma 1.

Suppose that (C1)(C4) are satisfied. Then there exists a constant K0K_{0} such that

0(Θ×Λn(θ,η)dΠ(θ,η)πp(s0)eK0(s0logp+nϵ¯n2))1.\displaystyle\begin{split}{\mathbb{P}}_{0}\bigg{(}&\int_{\Theta\times{\cal H}}\Lambda_{n}(\theta,\eta)d\Pi(\theta,\eta)\geq\pi_{p}(s_{0})e^{-K_{0}(s_{0}\log p+n\bar{\epsilon}_{n}^{2})}\bigg{)}\rightarrow 1.\end{split} (19)
Proof.

We define the Kullback-Leibler-type neighborhood n={(θ,η)Θ×:i=1nK(p0,i,pθ,η,i)C1nϵ¯n2,i=1nV(p0,i,pθ,η,i)C1nϵ¯n2}{\cal B}_{n}=\{(\theta,\eta)\in\Theta\times{\cal H}:\sum_{i=1}^{n}K(p_{0,i},p_{\theta,\eta,i})\leq C_{1}n\bar{\epsilon}_{n}^{2},\sum_{i=1}^{n}V(p_{0,i},p_{\theta,\eta,i})\leq C_{1}n\bar{\epsilon}_{n}^{2}\} for a sufficiently large C1C_{1}. Then Lemma 10 of Ghosal and van der Vaart, [16] implies that for any C>0C>0,

0(nΛn(θ,η)dΠ(θ,η)e(1+C)C1nϵ¯n2Π(n))1C2C1nϵ¯n2.\displaystyle{\mathbb{P}}_{0}\left(\int_{{\cal B}_{n}}\Lambda_{n}(\theta,\eta)d\Pi(\theta,\eta)\leq e^{-(1+C)C_{1}n\bar{\epsilon}_{n}^{2}}\Pi({\cal B}_{n})\right)\leq\frac{1}{C^{2}C_{1}n\bar{\epsilon}_{n}^{2}}. (20)

Hence, it suffices to show that Π(n)\Pi({\cal B}_{n}) is bounded below as in the lemma. By Lemma 9, the Kullback-Leibler divergence and variation of the iith observation are given by

K(p0,i,pθ,η,i)\displaystyle K(p_{0,i},p_{\theta,\eta,i}) =12{k=1milogρi,kk=1mi(1ρi,k)\displaystyle=\frac{1}{2}\bigg{\{}-\sum_{k=1}^{m_{i}}\log\rho_{i,k}^{\ast}-\sum_{k=1}^{m_{i}}(1-\rho_{i,k}^{\ast})
+Δη,i1/2(Xi(θθ0)+ξη,iξη0,i)22},\displaystyle\qquad\quad+\lVert\Delta_{\eta,i}^{-1/2}(X_{i}(\theta-\theta_{0})+\xi_{\eta,i}-\xi_{\eta_{0},i})\rVert_{2}^{2}\bigg{\}},
V(p0,i,pθ,η,i)\displaystyle V(p_{0,i},p_{\theta,\eta,i}) =12k=1mi(1ρi,k)2+Δη0,i1/2Δη,i1(Xi(θθ0)+ξη,iξη0,i)22,\displaystyle=\frac{1}{2}\sum_{k=1}^{m_{i}}(1-\rho_{i,k}^{\ast})^{2}+\lVert\Delta_{\eta_{0},i}^{1/2}\Delta_{\eta,i}^{-1}(X_{i}(\theta-\theta_{0})+\xi_{\eta,i}-\xi_{\eta_{0},i})\rVert_{2}^{2},

where ρi,k,k=1,,mi,\rho_{i,k}^{\ast},~{}k=1,\dots,m_{i}, are the eigenvalues of Δη0,i1/2Δη,i1Δη0,i1/2\Delta_{\eta_{0},i}^{1/2}\Delta_{\eta,i}^{-1}\Delta_{\eta_{0},i}^{1/2}. For n,δ={1in:k=1mi(1ρi,k)2δ}{\cal I}_{n,\delta}=\{1\leq i\leq n:\sum_{k=1}^{m_{i}}(1-\rho_{i,k}^{\ast})^{2}\geq\delta\} with small δ>0\delta>0 and |n,δ||{\cal I}_{n,\delta}| the cardinality of n,δ{\cal I}_{n,\delta}, we see that on n{\cal B}_{n},

anϵ¯n2anni=1nk=1mi(1ρi,k)2anδ|n,δ|n+annin,δk=1mi(1ρi,k)2.\displaystyle\begin{split}a_{n}\bar{\epsilon}_{n}^{2}&\gtrsim\frac{a_{n}}{n}\sum_{i=1}^{n}\sum_{k=1}^{m_{i}}(1-\rho_{i,k}^{\ast})^{2}\geq\frac{a_{n}\delta|{\cal I}_{n,\delta}|}{n}+\frac{a_{n}}{n}\sum_{i\notin{\cal I}_{n,\delta}}\sum_{k=1}^{m_{i}}(1-\rho_{i,k}^{\ast})^{2}.\end{split} (21)

Since every in,δi\notin\mathcal{I}_{n,\delta} satisfies k=1mi(1ρi,k)2<δ\sum_{k=1}^{m_{i}}(1-\rho_{i,k}^{\ast})^{2}<\delta for small δ>0\delta>0, observe that

in,δk=1mi(1ρi,k)2\displaystyle\sum_{i\notin{\cal I}_{n,\delta}}\sum_{k=1}^{m_{i}}(1-\rho_{i,k}^{\ast})^{2} in,δk=1mi(11/ρi,k)21ρ¯02in,δΔη,iΔη0,iF2,\displaystyle\gtrsim\sum_{i\notin{\cal I}_{n,\delta}}\sum_{k=1}^{m_{i}}(1-1/\rho_{i,k}^{\ast})^{2}\geq\frac{1}{\overline{\rho}_{0}^{2}}\sum_{i\notin{\cal I}_{n,\delta}}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2},

where the first inequality follows by the relation |1x||1x1||1-x|\asymp|1-x^{-1}| as x1x\rightarrow 1 and the second inequality holds by (i) of Lemma 10 in Appendix. Since an|n,δ|/nanϵ¯n2a_{n}|{\cal I}_{n,\delta}|/n\lesssim a_{n}\bar{\epsilon}_{n}^{2} by (21), it follows using (5) that for some constants C2,C3>0C_{2},C_{3}>0,

annin,δΔη,iΔη0,iF2\displaystyle\frac{a_{n}}{n}\sum_{i\notin{\cal I}_{n,\delta}}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2} andB,n2(η,η0)an|n,δ|nmax1inΔη,iΔη0,iF2\displaystyle\geq a_{n}d_{B,n}^{2}(\eta,\eta_{0})-\frac{a_{n}|{\cal I}_{n,\delta}|}{n}\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2}
(C2C3anϵ¯n2)max1inΔη,iΔη0,iF2en.\displaystyle\geq(C_{2}-C_{3}a_{n}\bar{\epsilon}_{n}^{2})\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2}-e_{n}.

Combining this with (21), we conclude that anϵ¯n2+enmaxiΔη,iΔη0,iF2a_{n}\bar{\epsilon}_{n}^{2}+e_{n}\gtrsim\max_{i}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2} on n{\cal B}_{n}, which implies that maxi,k|1ρi,k|\max_{i,k}|1-\rho_{i,k}^{\ast}| is small for all sufficiently large nn, by (i) of Lemma 10 and the inequality |1x||1x1||1-x|\asymp|1-x^{-1}| as x1x\rightarrow 1. Hence, logρi,k\log\rho_{i,k}^{\ast} can be expanded in the powers of (1ρi,k)(1-\rho_{i,k}^{\ast}) to get logρi,k(1ρi,k)(1ρi,k)2/2-\log\rho_{i,k}^{\ast}-(1-\rho_{i,k}^{\ast})\sim(1-\rho_{i,k}^{\ast})^{2}/2 for every ii and kk. Furthermore, since maxi,k|1ρi,k|\max_{i,k}|1-\rho_{i,k}^{\ast}| is sufficiently small, we obtain that k=1mi(1ρi,k)2k=1mi(11/ρi,k)2Δη,iΔη0,iF2\sum_{k=1}^{m_{i}}(1-\rho_{i,k}^{\ast})^{2}\lesssim\sum_{k=1}^{m_{i}}(1-1/\rho_{i,k}^{\ast})^{2}\lesssim\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2} by (i) of Lemma 10, and that Δη,i1spΔη0,i1/2Δη,i1Δη0,i1/2sp1\lVert\Delta_{\eta,i}^{-1}\rVert_{\rm sp}\lesssim\lVert\Delta_{\eta_{0},i}^{1/2}\Delta_{\eta,i}^{-1}\Delta_{\eta_{0},i}^{1/2}\rVert_{\rm sp}\lesssim 1 by the restriction on the eigenvalues of Δη0,i\Delta_{\eta_{0},i}. Combining these results, it follows that on n{\cal B}_{n}, both n1i=1nK(p0,i,pθ,η,i)n^{-1}\sum_{i=1}^{n}K(p_{0,i},p_{\theta,\eta,i}) and n1i=1nV(p0,i,pθ,η,i)n^{-1}\sum_{i=1}^{n}V(p_{0,i},p_{\theta,\eta,i}) are bounded above by a constant multiple of n1X(θθ0)22+dn2(η,η0)n^{-1}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}+d_{n}^{2}(\eta,\eta_{0}). Hence, C1C_{1} can be chosen sufficiently large such that

Π(n)Π{(θ,η)Θ×:n1X2θθ012+dn2(η,η0)2ϵ¯n2}Π{θΘ:n1X2θθ012ϵ¯n2}Π{η:dn2(η,η0)ϵ¯n2},\displaystyle\begin{split}\Pi({\cal B}_{n})&\geq\Pi\left\{(\theta,\eta)\in\Theta\times{\cal H}:n^{-1}\lVert X\rVert_{\ast}^{2}\lVert\theta-\theta_{0}\rVert_{1}^{2}+d_{n}^{2}(\eta,\eta_{0})\leq 2\bar{\epsilon}_{n}^{2}\right\}\\ &\geq\Pi\left\{\theta\in\Theta:n^{-1}\lVert X\rVert_{\ast}^{2}\lVert\theta-\theta_{0}\rVert_{1}^{2}\leq\bar{\epsilon}_{n}^{2}\right\}\Pi\left\{\eta\in{\cal H}:d_{n}^{2}(\eta,\eta_{0})\leq\bar{\epsilon}_{n}^{2}\right\},\end{split} (22)

by the inequality Xθ2j=1p|θj|Xj2Xθ1\lVert X\theta\rVert_{2}\leq\sum_{j=1}^{p}\lvert\theta_{j}\rvert\lVert X_{\cdot j}\rVert_{2}\leq\lVert X\rVert_{\ast}\lVert\theta\rVert_{1}. The logarithm of the second term on the rightmost side is bounded below by a constant multiple of nϵ¯n2-n\bar{\epsilon}_{n}^{2} by (C2). To find the lower bound for the first term, we shall first work with the case s01s_{0}\geq 1, and then show that the same lower bound is obtained even when s0=0s_{0}=0.

Now, assume that s01s_{0}\geq 1 and let Θ0,n={θS0s0:n1/2XθS0θ0,S01ϵ}\Theta_{0,n}=\{\theta_{S_{0}}\in\mathbb{R}^{s_{0}}:n^{-1/2}\lVert X\rVert_{\ast}\lVert\theta_{S_{0}}-\theta_{0,S_{0}}\rVert_{1}\leq\epsilon\} for ϵ>0\epsilon>0 to be chosen later. Then

Π{θΘ:n1/2Xθθ01ϵ}πp(s0)(ps0)Θ0,ngS0(θS0)dθS0πp(s0)(ps0)eλθ01Θ0,ngS0(θS0θ0,S0)dθS0\displaystyle\begin{split}&\Pi\{\theta\in\Theta:n^{-1/2}\lVert X\rVert_{\ast}\lVert\theta-\theta_{0}\rVert_{1}\leq\epsilon\}\\ &\quad\geq\frac{\pi_{p}(s_{0})}{\binom{p}{s_{0}}}\int_{\Theta_{0,n}}g_{S_{0}}(\theta_{S_{0}})d\theta_{S_{0}}\\ &\quad\geq\frac{\pi_{p}(s_{0})}{\binom{p}{s_{0}}}e^{-\lambda\lVert\theta_{0}\rVert_{1}}\int_{\Theta_{0,n}}g_{S_{0}}(\theta_{S_{0}}-\theta_{0,S_{0}})d\theta_{S_{0}}\end{split} (23)

by the inequality gS0(θS0)eλθ01gS0(θS0θ0,S0)g_{S_{0}}(\theta_{S_{0}})\geq e^{-\lambda\lVert\theta_{0}\rVert_{1}}g_{S_{0}}(\theta_{S_{0}}-\theta_{0,S_{0}}). Using the relation (6.2) of Castillo et al., [8] and the assumption on the prior in (4), the integral on the rightmost side satisfies

Θ0,ngS0(θS0θ0,S0)dθS0eλϵn/X(λϵn/X)s0s0!eL3ϵ(ϵn/L1pL2)s0s0!,\displaystyle\begin{split}\int_{\Theta_{0,n}}g_{S_{0}}(\theta_{S_{0}}-\theta_{0,S_{0}})d\theta_{S_{0}}&\geq e^{-\lambda\epsilon\sqrt{n}/\lVert X\rVert_{\ast}}\frac{(\lambda\epsilon\sqrt{n}/\lVert X\rVert_{\ast})^{s_{0}}}{s_{0}!}\\ &\geq e^{-L_{3}\epsilon}\frac{(\epsilon\sqrt{n}/L_{1}p^{L_{2}})^{s_{0}}}{s_{0}!},\end{split} (24)

for s0>0s_{0}>0, and thus the rightmost side of (23) is bounded below by

πp(s0)(ϵn)s0exp{λθ01L3ϵ(L1+1)s0logps0logL1},\displaystyle\pi_{p}(s_{0})(\epsilon\sqrt{n})^{s_{0}}\exp\left\{-\lambda\lVert\theta_{0}\rVert_{1}-L_{3}\epsilon-(L_{1}+1)s_{0}\log p-s_{0}\log L_{1}\right\},

by the inequality (ps0)s0!ps0\binom{p}{s_{0}}s_{0}!\leq p^{s_{0}}. Choosing ϵ=ϵ¯n\epsilon=\bar{\epsilon}_{n}, the first term on the rightmost side of (22) satisfies

Π{θΘ:n1X2θθ012ϵ¯n2}\displaystyle\Pi\left\{\theta\in\Theta:n^{-1}\lVert X\rVert_{\ast}^{2}\lVert\theta-\theta_{0}\rVert_{1}^{2}\leq\bar{\epsilon}_{n}^{2}\right\}
πp(s0)(nϵ¯n2)s0/2exp{λθ01L3ϵ¯n(L1+1)s0logps0logL1}.\displaystyle\quad\geq\pi_{p}(s_{0})(n\bar{\epsilon}_{n}^{2})^{s_{0}/2}\exp\left\{-\lambda\lVert\theta_{0}\rVert_{1}-L_{3}\bar{\epsilon}_{n}-(L_{1}+1)s_{0}\log p-s_{0}\log L_{1}\right\}.

Note that nϵ¯n2>1n\bar{\epsilon}_{n}^{2}>1 and s0+ϵ¯n+s0logps0logps_{0}+\bar{\epsilon}_{n}+s_{0}\log p\lesssim s_{0}\log p if s0>0s_{0}>0, and thus the last display implies that there exists a constant C4>0C_{4}>0 such that

Π(n)πp(s0)exp{C4(λθ01+s0logp+nϵ¯n2)}.\displaystyle\Pi({\cal B}_{n})\geq\pi_{p}(s_{0})\exp\left\{-C_{4}(\lambda\lVert\theta_{0}\rVert_{1}+s_{0}\log p+n\bar{\epsilon}_{n}^{2})\right\}.

If s0=0s_{0}=0, the first term of (22) is clearly bounded below by πp(0)\pi_{p}(0), so that the same lower bound for Π(n)\Pi({\cal B}_{n}) in the last display is also obtained since we have λθ01+s0logp=0\lambda\lVert\theta_{0}\rVert_{1}+s_{0}\log p=0. Finally, the lemma follows from (20). ∎

Proof of Theorem 1.

For the set ={(θ,η):sθ>s¯}{\cal B}=\{(\theta,\eta):s_{\theta}>\bar{s}\} with any integer s¯s0\bar{s}\geq s_{0}, we see that Π()\Pi({\cal B}) is equal to

s=s¯+1pπp(s)πp(s0)s=s¯+1p(A2pA4)ss0πp(s0)(A2pA4)s¯+1s0j=0(A2pA4)j.\displaystyle\sum_{s=\bar{s}+1}^{p}\pi_{p}(s)\leq\pi_{p}(s_{0})\sum_{s=\bar{s}+1}^{p}\left(\frac{A_{2}}{p^{A_{4}}}\right)^{s-s_{0}}\leq\pi_{p}(s_{0})\left(\frac{A_{2}}{p^{A_{4}}}\right)^{\bar{s}+1-s_{0}}\sum_{j=0}^{\infty}\left(\frac{A_{2}}{p^{A_{4}}}\right)^{j}.

Let n{\cal E}_{n} be the event in (19). Since Λn(θ,η)\Lambda_{n}(\theta,\eta) is nonnegative, by Fubini’s theorem and Lemma 1,

𝔼0Π(|Y(n))𝟙n=𝔼0[Λn(θ,η)dΠ(θ,η)Λn(θ,η)dΠ(θ,η)𝟙n]πp(s0)1exp{C1(s0logp+nϵ¯n2)}Π()exp{(s¯+1s0)(logA2A4logp)+2C1slogp},\displaystyle\begin{split}{\mathbb{E}}_{0}\Pi({\cal B}\,|\,Y^{(n)})\mathbbm{1}_{{\cal E}_{n}}&={\mathbb{E}}_{0}\left[\frac{\int_{{\cal B}}\Lambda_{n}(\theta,\eta)d\Pi(\theta,\eta)}{\int\Lambda_{n}(\theta,\eta)d\Pi(\theta,\eta)}\mathbbm{1}_{{\cal E}_{n}}\right]\\ &\leq\pi_{p}(s_{0})^{-1}\exp\{C_{1}(s_{0}\log p+n\bar{\epsilon}_{n}^{2})\}\Pi({\cal B})\\ &\lesssim\exp\left\{(\bar{s}+1-s_{0})(\log A_{2}-A_{4}\log p)+2C_{1}s_{\star}\log p\right\},\end{split} (25)

for some constant C1C_{1} and sufficiently large pp. For a sufficiently large constant C2C_{2}, choose the largest integer that is smaller than C2sC_{2}s_{\star} for s¯\bar{s}. Replacing s¯+1\bar{s}+1 by C2sC_{2}s_{\star} in the last display, it is easy to see that the rightmost side goes to zero. The proof is complete since 0(nc)0{\mathbb{P}}_{0}({\cal E}_{n}^{c})\rightarrow 0 by Lemma 1. ∎

A.2 Proof of Theorems 23 and Corollary 1

The following lemma shows that a small piece of the alternative centered at any (θ1,η1)Θ×(\theta_{1},\eta_{1})\in\Theta\times{\cal H} are locally testable with exponentially small errors, provided that the center is sufficiently separated from the truth with respect to the average Rényi divergence. Theorem 2 for posterior contraction relative to the average Rényi divergence will then be proved by showing that the number of those pieces is controlled by the target rate. We write p1p_{1} for the density with (θ1,η1)(\theta_{1},\eta_{1}), and 𝔼1\mathbb{E}_{1} and 1\mathbb{P}_{1} for the expectation and probability with p1p_{1}, respectively.

Lemma 2.

For a given sequence γn>0\gamma_{n}^{\prime}>0, a sequence ana_{n} satisfying (C1), given (θ1,η1)Θ×(\theta_{1},\eta_{1})\in\Theta\times{\cal H} such that Rn(p0,p1)δn2R_{n}(p_{0},p_{1})\geq\delta_{n}^{2} with δn=o(m¯)\delta_{n}=o(\sqrt{\overline{m}}), define

1,n={(θ,η)Θ×:1ni=1nXi(θθ1)+ξη,iξη1,i22δn216γn,dB,n(η,η1)δn22m¯γnan,max1inΔη,i1spγn}.\displaystyle\begin{split}{\cal F}_{1,n}=\bigg{\{}(\theta,\eta)\in\Theta\times{\cal H}\,:\,\frac{1}{n}\sum_{i=1}^{n}\lVert X_{i}(\theta-\theta_{1})+\xi_{\eta,i}-\xi_{\eta_{1},i}\rVert_{2}^{2}\leq\frac{\delta_{n}^{2}}{16\gamma_{n}^{\prime}},\\ d_{B,n}(\eta,\eta_{1})\leq\frac{\delta_{n}^{2}}{2\overline{m}\gamma_{n}^{\prime}\sqrt{a_{n}}},\,\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}^{-1}\rVert_{\rm sp}\leq\gamma_{n}^{\prime}\bigg{\}}.\end{split} (26)

Then under (C1), there exists a test φ¯n\bar{\varphi}_{n} such that

𝔼0φ¯nenδn2,sup(θ,η)1,n𝔼θ,η(1φ¯n)enδn2/16.\displaystyle\mathbb{E}_{0}\bar{\varphi}_{n}\leq e^{-n\delta_{n}^{2}},\qquad\sup_{(\theta,\eta)\in{\cal F}_{1,n}}\mathbb{E}_{\theta,\eta}(1-\bar{\varphi}_{n})\leq e^{-n\delta_{n}^{2}/16}.
Proof.

For given (θ1,η1)Θ×(\theta_{1},\eta_{1})\in\Theta\times{\cal H} such that Rn(p0,p1)δn2R_{n}(p_{0},p_{1})\geq\delta_{n}^{2}, consider the most powerful test φ¯n=𝟙{Λn(θ1,η1)1}\bar{\varphi}_{n}=\mathbbm{1}_{\{\Lambda_{n}(\theta_{1},\eta_{1})\geq 1\}} given by the Neyman-Pearson lemma. It is then easy to see that

𝔼0φ¯n=0(Λn(θ1,η1)1)p0p1enδn2,𝔼1(1φ¯n)=1(Λn(θ1,η1)1)p0p1enδn2.\displaystyle\begin{split}\mathbb{E}_{0}\bar{\varphi}_{n}&=\mathbb{P}_{0}\left(\sqrt{\Lambda_{n}(\theta_{1},\eta_{1})}\geq 1\right)\leq\int\sqrt{p_{0}p_{1}}\leq e^{-n\delta_{n}^{2}},\\ \mathbb{E}_{1}(1-\bar{\varphi}_{n})&=\mathbb{P}_{1}\left(\sqrt{\Lambda_{n}(\theta_{1},\eta_{1})}\leq 1\right)\leq\int\sqrt{p_{0}p_{1}}\leq e^{-n\delta_{n}^{2}}.\end{split} (27)

The first inequality of the lemma is a direct consequence of the first line of the preceding display. For the second inequality of the lemma, note that by the Cauchy-Schwarz inequality, we have

{𝔼θ,η(1φ¯n)}2𝔼1(1φ¯n)𝔼1((pθ,η/p1)(Y(n)))2.\displaystyle\left\{\mathbb{E}_{\theta,\eta}(1-\bar{\varphi}_{n})\right\}^{2}\leq\mathbb{E}_{1}(1-\bar{\varphi}_{n})\;\mathbb{E}_{1}(({p_{\theta,\eta}}/{p_{1}})(Y^{(n)}))^{2}.

Thus, by the second line of (27), it suffices to show 𝔼1((pθ,η/p1)(Y(n)))2e7nδn2/8\mathbb{E}_{1}(({p_{\theta,\eta}}/{p_{1}})(Y^{(n)}))^{2}\leq e^{7n\delta_{n}^{2}/8} for every (θ,η)1,n(\theta,\eta)\in{\cal F}_{1,n}. Defining Δη,i=Δη,i1/2Δη1,iΔη,i1/2\Delta_{\eta,i}^{\ast}=\Delta_{\eta,i}^{-1/2}\Delta_{\eta_{1},i}\Delta_{\eta,i}^{-1/2}, observe that

max1inΔη,iIsp\displaystyle\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}^{\ast}-I\rVert_{\rm sp} max1inΔη,i1spΔη,iΔη1,isp\displaystyle\leq\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}^{-1}\rVert_{\rm sp}\lVert\Delta_{\eta,i}-\Delta_{\eta_{1},i}\rVert_{\rm sp}
max1inΔη,i1spandB,n(η,η1)δn22m¯,\displaystyle\leq\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}^{-1}\rVert_{\rm sp}\sqrt{a_{n}}d_{B,n}(\eta,\eta_{1})\leq\frac{\delta_{n}^{2}}{2\overline{m}},

on the set 1,n{\cal F}_{1,n}, where the second inequality is due to (C1). Since the leftmost side of the display is further bounded below by maxi|ρk(Δη,i)1|\max_{i}|\rho_{k}(\Delta_{\eta,i}^{\ast})-1| for every kmik\leq m_{i}, we have that

1δn22m¯min1inρmin(Δη,i)max1inρmax(Δη,i)1+δn22m¯.\displaystyle 1-\frac{\delta_{n}^{2}}{2\overline{m}}\leq\min_{1\leq i\leq n}\rho_{\min}(\Delta_{\eta,i}^{\ast})\leq\max_{1\leq i\leq n}\rho_{\max}(\Delta_{\eta,i}^{\ast})\leq 1+\frac{\delta_{n}^{2}}{2\overline{m}}. (28)

Since δn2/m¯0\delta_{n}^{2}/\overline{m}\rightarrow 0 and ρk(2Δη,iI)=2ρk(Δη,i)1\rho_{k}(2\Delta_{\eta,i}^{\ast}-I)=2\rho_{k}(\Delta_{\eta,i}^{\ast})-1 for every kmik\leq m_{i}, (28) implies that 2Δη,iI2\Delta_{\eta,i}^{\ast}-I is nonsingular for every ini\leq n, and hence on 1,n{\cal F}_{1,n}, it can be shown that 𝔼1((pθ,η/p1)(Y(n)))2\mathbb{E}_{1}(({p_{\theta,\eta}}/{p_{1}})(Y^{(n)}))^{2} can be written as being equal to

i=1n{det(Δη,i)1/2det(2IΔη,i1)1/2}×exp{i=1n(2Δη,iI)1/2Δη,i1/2(Xi(θθ1)+ξη,iξη1,i)22}.\displaystyle\begin{split}&\prod_{i=1}^{n}\left\{\det(\Delta_{\eta,i}^{\ast})^{1/2}\det(2I-{\Delta_{\eta,i}^{\ast-1}})^{-1/2}\right\}\\ &\times\exp\Bigg{\{}\sum_{i=1}^{n}\lVert(2\Delta_{\eta,i}^{\ast}-I)^{-1/2}\Delta_{\eta,i}^{-1/2}(X_{i}(\theta-\theta_{1})+\xi_{\eta,i}-\xi_{\eta_{1},i})\rVert_{2}^{2}\Bigg{\}}.\end{split} (29)

To bound this, note that det(Δη,i)1/2det(2IΔη,i1)1/2\det(\Delta_{\eta,i}^{\ast})^{1/2}\det(2I-{\Delta_{\eta,i}^{\ast-1}})^{-1/2} is equal to

k=1mi{ρk(Δη,i)2ρk1(Δη,i)}1/2(1δn4/4m¯21δn2/m¯)mi/2(1+3δn22m¯)mi/2e3δn2/4,\displaystyle\begin{split}\prod_{k=1}^{m_{i}}\left\{\frac{\rho_{k}(\Delta_{\eta,i}^{\ast})}{2-\rho_{k}^{-1}(\Delta_{\eta,i}^{\ast})}\right\}^{1/2}\leq\left(\frac{1-\delta_{n}^{4}/4\overline{m}^{2}}{1-\delta_{n}^{2}/\overline{m}}\right)^{m_{i}/2}\leq\left(1+\frac{3\delta_{n}^{2}}{2\overline{m}}\right)^{m_{i}/2}\leq e^{3\delta_{n}^{2}/4},\end{split} (30)

where the first inequality holds by (28), the second inequality holds by the inequality (1x2)/(12x)1+3x(1-x^{2})/(1-2x)\leq 1+3x for small x>0x>0, and the last inequality holds by the inequality x+1exx+1\leq e^{x}. Now, for every (θ,η)1,n(\theta,\eta)\in{\cal F}_{1,n}, observe that the exponent in (29) is bounded above by

max1in(2Δη,iI)1spmax1inΔη,i1spi=1nXi(θθ1)+ξη,iξη1,i22nδn28,\displaystyle\max_{1\leq i\leq n}\lVert(2\Delta_{\eta,i}^{\ast}-I)^{-1}\rVert_{\rm sp}\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}^{-1}\rVert_{\rm sp}\sum_{i=1}^{n}\lVert X_{i}(\theta-\theta_{1})+\xi_{\eta,i}-\xi_{\eta_{1},i}\rVert_{2}^{2}\leq\frac{n\delta_{n}^{2}}{8},

since maxi(2Δη,iI)1sp2\max_{i}\lVert(2\Delta_{\eta,i}^{\ast}-I)^{-1}\rVert_{\rm sp}\leq 2 for large nn. Combined with (29) and (30), the display completes the proof. ∎

Proof of Theorem 2.

Let Θn={θΘ:sθK1s}\Theta_{n}=\left\{\theta\in\Theta:s_{\theta}\leq K_{1}s_{\star}\right\} and Rn(θ,η)=Rn(pθ,η,p0)R_{n}^{\star}(\theta,\eta)=R_{n}(p_{\theta,\eta},p_{0}). Then for every ϵ>0\epsilon>0,

𝔼0Π((θ,η)Θ×:Rn(θ,η)>ϵ|Y(n))𝔼0Π((θ,η)Θn×:Rn(θ,η)>ϵ|Y(n))+𝔼0Π(Θnc|Y(n)),\displaystyle\begin{split}&{\mathbb{E}}_{0}\Pi\left((\theta,\eta)\in\Theta\times{\cal H}:\sqrt{R_{n}^{\star}(\theta,\eta)}>\epsilon\,|\,Y^{(n)}\right)\\ &\quad\leq{\mathbb{E}}_{0}\Pi\left((\theta,\eta)\in\Theta_{n}\times{\cal H}:\sqrt{R_{n}^{\star}(\theta,\eta)}>\epsilon\,|\,Y^{(n)}\right)+{\mathbb{E}}_{0}\Pi\left(\Theta_{n}^{c}\,|\,Y^{(n)}\right),\end{split} (31)

where the second term on the right hand side goes to zero by Theorem 1. Hence, it suffices to show that the first term goes to zero for ϵ>0\epsilon>0 chosen to be the threshold in the theorem. Now, let Θn={θΘ:sθK1s,θpL2+2/X}\Theta_{n}^{\ast}=\{\theta\in\Theta:s_{\theta}\leq K_{1}s_{\star},\lVert\theta\rVert_{\infty}\leq p^{L_{2}+2}/\lVert X\rVert_{\ast}\} and define 1,n{\cal F}_{1,n} as in (26) with γn=γn\gamma_{n}^{\prime}=\gamma_{n} and δn=ϵn\delta_{n}=\epsilon_{n}. Then Lemma 2 implies that small pieces of the alternative densities can be tested with exponentially small errors as long as the center is ϵn\epsilon_{n}-separated from the true parameter values relative to the average Rényi divergence. To complete the proof, we shall show that the minimal number NnN_{n}^{\ast} of those small pieces that are needed to cover Θn×n\Theta_{n}^{\ast}\times{\cal H}_{n} is controlled appropriately in terms of ϵn\epsilon_{n}, and that the prior mass of ΘnΘn\Theta_{n}\setminus\Theta_{n}^{\ast} and n{\cal H}\setminus{\cal H}_{n} decreases fast enough to balance the denominator of the posterior distribution. (For more discussion on a construction of a test using metric entropies, see Section D.2 and Section D.3 of Ghosal and van der Vaart, [17].)

Note that for every θ,θΘ\theta,\theta^{\prime}\in\Theta and η,η\eta,\eta^{\prime}\in{\cal H},

1ni=1nXi(θθ)+ξη,iξη,i222{p2nX2θθ2+dA,n2(η,η)},\displaystyle\frac{1}{n}\sum_{i=1}^{n}\lVert X_{i}(\theta-\theta^{\prime})+\xi_{\eta,i}-\xi_{\eta^{\prime},i}\rVert_{2}^{2}\leq 2\left\{\frac{p^{2}}{n}\lVert X\rVert_{\ast}^{2}\lVert\theta-\theta^{\prime}\rVert_{\infty}^{2}+d_{A,n}^{2}(\eta,\eta^{\prime})\right\},

by the inequality X(θθ)2Xθθ1pXθθ\lVert X(\theta-\theta^{\prime})\rVert_{2}\leq\lVert X\rVert_{\ast}\lVert\theta-\theta^{\prime}\rVert_{1}\leq p\lVert X\rVert_{\ast}\lVert\theta-\theta^{\prime}\rVert_{\infty} and the Cauchy-Schwarz inequality. Since an<na_{n}<n and ϵn2>n1\epsilon_{n}^{2}>n^{-1}, it is easy to see that we have 1,n1,n{\cal F}_{1,n}\supset{\cal F}_{1,n}^{\prime} for

1,n={(θ,η)Θ×:\displaystyle{\cal F}_{1,n}^{\prime}=\bigg{\{}(\theta,\eta)\in\Theta\times{\cal H}:\, p2nX2θθ12+dn2(η,η1)132m¯2γn2n3,\displaystyle\frac{p^{2}}{n}\lVert X\rVert_{\ast}^{2}\lVert\theta-\theta_{1}\rVert_{\infty}^{2}+d_{n}^{2}(\eta,\eta_{1})\leq\frac{1}{32\overline{m}^{2}\gamma_{n}^{2}n^{3}},
max1inΔη,i1spγn},\displaystyle\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}^{-1}\rVert_{\rm sp}\leq\gamma_{n}\bigg{\}},

with the same (θ1,η1)(\theta_{1},\eta_{1}) used to define 1,n{\cal F}_{1,n}. Hence, logNn\log N_{n}^{\ast} is bounded above by

logN(16m¯γnnpX,Θn,)+logN(16m¯γnn3/2,n,dn).\displaystyle\log N\left(\frac{1}{6\overline{m}\gamma_{n}np\lVert X\rVert_{\ast}},\Theta_{n}^{\ast},\lVert\cdot\rVert_{\infty}\right)+\log N\left(\frac{1}{6\overline{m}\gamma_{n}n^{3/2}},{\cal H}_{n},d_{n}\right). (32)

Note that for any small δ>0\delta>0,

N(δ,Θn,)(pK1s)(3pL2+2δX)K1s(3pL2+3δX)K1s,\displaystyle N(\delta,\Theta_{n}^{\ast},\lVert\cdot\rVert_{\infty})\leq\binom{p}{\lfloor K_{1}s_{\star}\rfloor}\left(\frac{3p^{L_{2}+2}}{\delta\lVert X\rVert_{\ast}}\right)^{\lfloor K_{1}s_{\star}\rfloor}\leq\left(\frac{3p^{L_{2}+3}}{\delta\lVert X\rVert_{\ast}}\right)^{K_{1}s_{\star}},

and thus we obtain

logN(16m¯γnnpX,Θn,)\displaystyle\log N\left(\frac{1}{6\overline{m}\gamma_{n}np\lVert X\rVert_{\ast}},\Theta_{n}^{\ast},\lVert\cdot\rVert_{\infty}\right) s(logm¯+logγn+logp)nϵn2.\displaystyle\lesssim s_{\star}(\log\overline{m}+\log\gamma_{n}+\log p)\lesssim n\epsilon_{n}^{2}.

Using the last display and the entropy condition (7), the right hand side of (32) is bounded above by a constant multiple of nϵn2n\epsilon_{n}^{2}. Hence, by Lemma D.3 of Ghosal and van der Vaart, [17], for every ϵ>ϵn\epsilon>\epsilon_{n}, there exists a test φn\varphi_{n} such that for some C1>0C_{1}>0, 𝔼0φn2exp(C1nϵn2nϵ2){\mathbb{E}}_{0}\varphi_{n}\leq 2\exp(C_{1}n\epsilon_{n}^{2}-n\epsilon^{2}) and 𝔼θ,η(1φn)exp(nϵ2/16){\mathbb{E}}_{\theta,\eta}(1-\varphi_{n})\leq\exp(-n\epsilon^{2}/16) for every (θ,η)Θn×n(\theta,\eta)\in\Theta_{n}^{\ast}\times{\cal H}_{n} such that Rn(θ,η)>ϵ\sqrt{R_{n}^{\star}(\theta,\eta)}>\epsilon. Note that under condition (3) on the prior distribution, we have logπp(s0)s0logplogπp(0)slogp-\log\pi_{p}(s_{0})\lesssim s_{0}\log p-\log\pi_{p}(0)\lesssim s_{\star}\log p since πp(0)\pi_{p}(0) is bounded away from zero. Hence, for n{\cal E}_{n} the event in (19) and some constant C2>0C_{2}>0, the first term on the right hand side of (31) is bounded by

𝔼0Π((θ,η)Θn×:Rn(θ,η)>ϵ|Y(n))𝟙n(1φn)+𝔼0(φn+𝟙nc)\displaystyle{\mathbb{E}}_{0}\Pi\left((\theta,\eta)\in\Theta_{n}\times{\cal H}:\sqrt{R_{n}^{\star}(\theta,\eta)}>\epsilon\,|\,Y^{(n)}\right)\mathbbm{1}_{{\cal E}_{n}}(1-\varphi_{n})+{\mathbb{E}}_{0}(\varphi_{n}+\mathbbm{1}_{{\cal E}_{n}^{c}})
{sup(θ,η)Θn×n:Rn(θ,η)>ϵ2𝔼θ,η(1φn)+Π(ΘnΘn)+Π(n)}eC2slogp\displaystyle~{}~{}\leq\Bigg{\{}\!\sup_{(\theta,\eta)\in\Theta_{n}^{\ast}\times{\cal H}_{n}:R_{n}^{\star}(\theta,\eta)>\epsilon^{2}}\!{\mathbb{E}}_{\theta,\eta}(1-\varphi_{n})+\Pi(\Theta_{n}\!\setminus\!\Theta_{n}^{\ast})+\Pi({\cal H}\!\setminus\!{\cal H}_{n})\Bigg{\}}e^{C_{2}s_{\star}\log p}
+𝔼0φn+0nc,\displaystyle\qquad+{\mathbb{E}}_{0}\varphi_{n}+{\mathbb{P}}_{0}{\cal E}_{n}^{c},

where the term 0nc{\mathbb{P}}_{0}{\cal E}_{n}^{c} converges to zero by Lemma 1. Choosing ϵ=C3ϵn\epsilon=C_{3}\epsilon_{n} for a sufficiently large C3C_{3}, we have

𝔼0φn0,sup(θ,η)Θn×n:Rn(θ,η)>ϵ2𝔼θ,η(1φn)eC2slogp0.\displaystyle{\mathbb{E}}_{0}\varphi_{n}\rightarrow 0,\quad\sup_{(\theta,\eta)\in\Theta_{n}^{\ast}\times{\cal H}_{n}:R_{n}^{\star}(\theta,\eta)>\epsilon^{2}}{\mathbb{E}}_{\theta,\eta}(1-\varphi_{n})e^{C_{2}s_{\star}\log p}\rightarrow 0.

Furthermore, Π(n)eC2slogp\Pi({\cal H}\setminus{\cal H}_{n})e^{C_{2}s_{\star}\log p} goes to zero by condition (8). Now, to show that Π(ΘnΘn)\Pi(\Theta_{n}\setminus\Theta_{n}^{\ast}) goes to zero exponentially fast, observe that

Π(ΘnΘn)\displaystyle\Pi(\Theta_{n}\setminus\Theta_{n}^{\ast}) =Π{θΘ:sθK1s,θ>pL2+2/X}\displaystyle=\Pi\left\{\theta\in\Theta:s_{\theta}\leq K_{1}s_{\star},\lVert\theta\rVert_{\infty}>p^{L_{2}+2}/\lVert X\rVert_{\ast}\right\}
=S:sK1sπp(s)(ps){θS:θS>pL2+2/X}gS(θS)dθS\displaystyle=\sum_{S:s\leq K_{1}s_{\star}}\frac{\pi_{p}(s)}{\binom{p}{s}}\int_{\{\theta_{S}:\lVert\theta_{S}\rVert_{\infty}>p^{L_{2}+2}/\lVert X\rVert_{\ast}\}}g_{S}(\theta_{S})d\theta_{S}
S:sK1s(A2pA4)s(ps){θS:θS>pL2+2/X}gS(θS)dθS.\displaystyle\leq\sum_{S:s\leq K_{1}s_{\star}}\frac{(A_{2}p^{-A_{4}})^{s}}{\binom{p}{s}}\int_{\{\theta_{S}:\lVert\theta_{S}\rVert_{\infty}>p^{L_{2}+2}/\lVert X\rVert_{\ast}\}}g_{S}(\theta_{S})d\theta_{S}.

by the inequality πp(s)(A2pA4)sπp(0)\pi_{p}(s)\leq(A_{2}p^{-A_{4}})^{s}\pi_{p}(0) for every SS. Since the tail probability of the Laplace distribution is given by |x|>t21λeλ|x|dx=exp(λt)\int_{|x|>t}2^{-1}\lambda e^{-\lambda|x|}dx=\exp(-\lambda t) for every t>0t>0, the rightmost side of the last display is bounded above by a constant multiple of

s=1K1sseλpL2+2/X(A2pA4)sseλpL2+2/X.\displaystyle\sum_{s=1}^{K_{1}s_{\star}}se^{-\lambda p^{L_{2}+2}/\lVert X\rVert_{\ast}}\left(\frac{A_{2}}{p^{A_{4}}}\right)^{s}\lesssim s_{\star}e^{-\lambda p^{L_{2}+2}/\lVert X\rVert_{\ast}}.

Since λpL2+2/Xp2\lambda p^{L_{2}+2}/\lVert X\rVert_{\ast}\gtrsim p^{2} by (4), the right hand side is bounded by eC4p2e^{-C_{4}p^{2}} for some C4>0C_{4}>0, and thus Π(ΘnΘn)eC2slogp\Pi(\Theta_{n}\setminus\Theta_{n}^{\ast})e^{C_{2}s_{\star}\log p} goes to zero since slogp=o(p2)s_{\star}\log p=o(p^{2}). Finally, we conclude that the left hand side of (31) goes to zero with ϵ=C3ϵn\epsilon=C_{3}\epsilon_{n}. ∎

Proof of Theorem 3.

By Theorem 2, we obtain the contraction rate of the posterior distribution with respect to the average Rényi divergence Rn(pθ,η,p0)R_{n}(p_{\theta,\eta},p_{0}) between pθ,ηp_{\theta,\eta} and p0p_{0} given by

Rn(pθ,η,p0)=\displaystyle R_{n}(p_{\theta,\eta},p_{0})= 1ni=1nlog{(detΔη,i)1/4(detΔη0,i)1/4det((Δη,i+Δη0,i)/2)1/2}\displaystyle-\frac{1}{n}\sum_{i=1}^{n}\log\left\{\frac{(\det\Delta_{\eta,i})^{1/4}(\det\Delta_{\eta_{0},i})^{1/4}}{\det((\Delta_{\eta,i}+\Delta_{\eta_{0},i})/2)^{1/2}}\right\}
+14ni=1n(Δη,i+Δη0,i)1(Xi(θθ0)+ξη,iξη0,i)22.\displaystyle+\frac{1}{4n}\sum_{i=1}^{n}\lVert(\Delta_{\eta,i}+\Delta_{\eta_{0},i})^{-1}(X_{i}(\theta-\theta_{0})+\xi_{\eta,i}-\xi_{\eta_{0},i})\rVert_{2}^{2}.

Define

g2(Δη,i,Δη0,i)=1(detΔη,i)1/4(detΔη0,i)1/4det((Δη,i+Δη0,i)/2)1/2.\displaystyle g^{2}(\Delta_{\eta,i},\Delta_{\eta_{0},i})=1-\frac{(\det\Delta_{\eta,i})^{1/4}(\det\Delta_{\eta_{0},i})^{1/4}}{\det((\Delta_{\eta,i}+\Delta_{\eta_{0},i})/2)^{1/2}}. (33)

Then Theorem 2 implies that by the last display,

ϵn21ni=1nlog(1g2(Δη,i,Δη0,i))1ni=1ng2(Δη,i,Δη0,i),\displaystyle\epsilon_{n}^{2}\gtrsim-\frac{1}{n}\sum_{i=1}^{n}\log(1-g^{2}(\Delta_{\eta,i},\Delta_{\eta_{0},i}))\geq\frac{1}{n}\sum_{i=1}^{n}g^{2}(\Delta_{\eta,i},\Delta_{\eta_{0},i}), (34)

where the second inequality holds by the inequality logxx1\log x\leq x-1. Note that by combining (i) and (ii) of Lemma 10 in Appendix, we obtain g2(Δη,i,Δη0,i)Δη,iΔη0,iF2g^{2}(\Delta_{\eta,i},\Delta_{\eta_{0},i})\gtrsim\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2} if the left hand side is small. Thus, using the same approach in the proof of Lemma 1, (34) is further bounded below by

C1dB,n2(η,η0)C2ϵn2max1inΔη,iΔη0,iF2(C1C3anϵn2)dB,n2(η,η0)C3enϵn2,\displaystyle\begin{split}&C_{1}d_{B,n}^{2}(\eta,\eta_{0})-C_{2}\epsilon_{n}^{2}\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2}\\ &\quad\geq(C_{1}-C_{3}a_{n}\epsilon_{n}^{2})d_{B,n}^{2}(\eta,\eta_{0})-C_{3}e_{n}\epsilon_{n}^{2},\end{split} (35)

for some constants C1,C2,C3>0C_{1},C_{2},C_{3}>0. Since C1C3anϵn2C_{1}-C_{3}a_{n}\epsilon_{n}^{2} is bounded away from zero and ene_{n} is decreasing, (34) and (35) imply that ϵndB,n(η,η0)\epsilon_{n}\gtrsim d_{B,n}(\eta,\eta_{0}). Now, it is easy to see that by (5),

max1inΔη,i+Δη0,isp2\displaystyle\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}+\Delta_{\eta_{0},i}\rVert_{\rm sp}^{2} 2max1inΔη,iΔη0,isp2+8max1inΔη0,isp2\displaystyle\leq 2\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm sp}^{2}+8\max_{1\leq i\leq n}\lVert\Delta_{\eta_{0},i}\rVert_{\rm sp}^{2}
en+andB,n2(η,η0)+1,\displaystyle\lesssim e_{n}+a_{n}d_{B,n}^{2}(\eta,\eta_{0})+1,

which is bounded since en+anϵn2=o(1)e_{n}+a_{n}\epsilon_{n}^{2}=o(1). Hence, we see that for η\eta_{\ast} satisfying (C6), n1X(θθ0)22+dA,n2(η,η0)n^{-1}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}+d_{A,n}^{2}(\eta,\eta_{0}) is bounded by a constant multiple of

1nX(θθ0)22+dA,n2(η,η)+dA,n2(η,η0)\displaystyle\frac{1}{n}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}+d_{A,n}^{2}(\eta,\eta_{\ast})+d_{A,n}^{2}(\eta_{\ast},\eta_{0})
1ni=1nXi(θθ0)+ξη,iξη,i22+dA,n2(η,η0)\displaystyle\quad\lesssim\frac{1}{n}\sum_{i=1}^{n}\lVert X_{i}(\theta-\theta_{0})+\xi_{\eta,i}-\xi_{\eta_{\ast},i}\rVert_{2}^{2}+d_{A,n}^{2}(\eta_{\ast},\eta_{0})
1ni=1n(Δη,i+Δη0,i)1(Xi(θθ0)+ξη,iξη0,i)22+dA,n2(η,η0).\displaystyle\quad\lesssim\frac{1}{n}\sum_{i=1}^{n}\lVert(\Delta_{\eta,i}+\Delta_{\eta_{0},i})^{-1}(X_{i}(\theta-\theta_{0})+\xi_{\eta,i}-\xi_{\eta_{0},i})\rVert_{2}^{2}+d_{A,n}^{2}(\eta_{\ast},\eta_{0}).

The display implies that X(θθ0)22+ndA,n2(η,η0)nϵn2\lVert X(\theta-\theta_{0})\rVert_{2}^{2}+nd_{A,n}^{2}(\eta,\eta_{0})\lesssim n\epsilon_{n}^{2} by Theorem 2 and (C6). Combining the results verifies the third and fourth assertions of the theorem. For the remainder, observe that sθθ0sθ+s0K1s+s0ss_{\theta-\theta_{0}}\leq s_{\theta}+s_{0}\leq K_{1}s_{\star}+s_{0}\lesssim s_{\star} for θ\theta such that sθK1ss_{\theta}\leq K_{1}s_{\star}. Therefore by Theorem 1, the first and the second assertions readily follow from the definitions of ϕ1\phi_{1} and ϕ2\phi_{2}. ∎

Proof of Corollary 1.

We first verify the assertion (a). If s0>0s_{0}>0 the assertion is trivial. If s0=0s_{0}=0, the condition nϵ¯n2/logp0n\bar{\epsilon}_{n}^{2}/\log p\rightarrow 0 implies that s0s_{\star}\rightarrow 0, and hence Theorem 1 holds with s=0s_{\star}=0. Since this means that θ=θ0=0\theta=\theta_{0}=0 if s0=0s_{0}=0, we can plug in s0s_{0} for ss_{\star} in Theorem 3.

Similarly, the assertion (b) trivially holds if s0>0s_{0}>0 and we only need to verify the case s0=0s_{0}=0. By reading the proof of Theorem 1, one can see that (25) goes to zero for large enough A4A_{4} if s0=0s_{0}=0. This completes the proof. ∎

A.3 Proof of Theorem 4

To prove Theorem 4, we first provide preliminary results. Some of these will also be used to prove Theorems 56.

Lemma 3.

Suppose that (C1), (C2), (C3), (C4) and (C6) are satisfied for some orthogonal projection HH. Then, for Λn(θ,η)=(pθ,η/pθ0,η~n(θ,η))(Y(n))\Lambda_{n}^{\ast}(\theta,\eta)=(p_{\theta,\eta}/p_{\theta_{0},{\tilde{\eta}_{n}(\theta,\eta)}})(Y^{(n)}) and Λn(θ)\Lambda_{n}^{\star}(\theta) in (14) with the corresponding HH, there exists a positive sequence δn0\delta_{n}\rightarrow 0 such that for any θ\theta with sθK1s¯s_{\theta}\leq K_{1}\bar{s}_{\star},

0(supη~n|logΛn(θ,η)logΛn(θ)|δn{X(θθ0)2(sθ+s0)logp+X(θθ0)22})1.\displaystyle\begin{split}\mathbb{P}_{0}\Bigg{(}&\sup_{\eta\in\widetilde{\mathcal{H}}_{n}}|\log\Lambda_{n}^{\ast}(\theta,\eta)-\log\Lambda_{n}^{\star}(\theta)|\\ &\quad\leq\delta_{n}\left\{\lVert X(\theta-\theta_{0})\rVert_{2}\sqrt{(s_{\theta}+s_{0})\log p}+\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\right\}\Bigg{)}\rightarrow 1.\end{split} (36)
Proof.

If sθ=s0=0s_{\theta}=s_{0}=0, the left hand side in the probability operator is zero, and the assertion trivially holds. We thus only consider the case sθ+s0>0s_{\theta}+s_{0}>0 below.

By Markov’s inequality, it suffices to show that there exists a positive sequence δn=o(δn)\delta_{n}^{\prime}=o(\delta_{n}) such that

𝔼0supη~n|logΛn(θ,η)logΛn(θ)|δn{X(θθ0)2(sθ+s0)logp+X(θθ0)22}.\displaystyle\begin{split}&\mathbb{E}_{0}\sup_{\eta\in\widetilde{\mathcal{H}}_{n}}|\log\Lambda_{n}^{\ast}(\theta,\eta)-\log\Lambda_{n}^{\star}(\theta)|\\ &\quad\leq\delta_{n}^{\prime}\left\{\lVert X(\theta-\theta_{0})\rVert_{2}\sqrt{(s_{\theta}+s_{0})\log p}+\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\right\}.\end{split} (37)

Let Δηn×n\Delta_{\eta}^{\star}\in\mathbb{R}^{n_{\ast}\times n_{\ast}} be the block-diagonal matrix formed by stacking Δη0,i1/2Δη,i1Δη0,i1/2\Delta_{\eta_{0},i}^{1/2}\Delta_{\eta,i}^{-1}\Delta_{\eta_{0},i}^{1/2}, i=1,,ni=1,\dots,n, and observe that

logΛn(θ,η)=\displaystyle\log\Lambda_{n}^{\ast}(\theta,\eta)= 12Δη1/2(IH)X~(θθ0)22\displaystyle-\frac{1}{2}\lVert{\Delta_{\eta}^{\star}}^{1/2}(I-H)\tilde{X}(\theta-\theta_{0})\rVert_{2}^{2}
+(θθ0)TX~T(IH)Δη{U(ξ~ηξ~η0)HX~(θθ0)}.\displaystyle+(\theta-\theta_{0})^{T}\tilde{X}^{T}(I-H)\Delta_{\eta}^{\star}\{U-(\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}})-H\tilde{X}(\theta-\theta_{0})\}.

The left hand side of (37) is thus bounded by the sum of the following terms:

supη~n|(θθ0)TX~T(IH)(IΔη)(IH)X~(θθ0)|\displaystyle\sup_{\eta\in\widetilde{\cal H}_{n}}\big{\lvert}(\theta-\theta_{0})^{T}\tilde{X}^{T}(I-H)(I-\Delta_{\eta}^{\star})(I-H)\tilde{X}(\theta-\theta_{0})\big{\rvert} , (38)
supη~n|(θθ0)TX~T(IH)Δη(ξ~ηξ~η0+HX~(θθ0))|\displaystyle\sup_{\eta\in\widetilde{\cal H}_{n}}\big{\lvert}(\theta-\theta_{0})^{T}\tilde{X}^{T}(I-H)\Delta_{\eta}^{\star}(\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}}+H\tilde{X}(\theta-\theta_{0}))\big{\rvert} , (39)
𝔼0supη~n|(θθ0)TX~T(IH)(IΔη)U|\displaystyle{\mathbb{E}}_{0}\sup_{\eta\in\widetilde{\cal H}_{n}}\big{\lvert}(\theta-\theta_{0})^{T}\tilde{X}^{T}(I-H)(I-\Delta_{\eta}^{\star})U\big{\rvert} . (40)

First, observe that (38) is bounded above by a constant multiple of

supη~nIΔηspX~(θθ0)22X(θθ0)22supη~nmax1inΔη,i1Δη0,i1F.\displaystyle\begin{split}&\sup_{\eta\in\widetilde{\cal H}_{n}}\lVert I-\Delta_{\eta}^{\star}\rVert_{\rm sp}\lVert\tilde{X}(\theta-\theta_{0})\rVert_{2}^{2}\lesssim\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\sup_{\eta\in\widetilde{\cal H}_{n}}\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}^{-1}-\Delta_{\eta_{0},i}^{-1}\rVert_{\rm F}.\end{split} (41)

Using (i) of Lemma 10 and the inequality |1x||1x1||1-x|\asymp|1-x^{-1}| as x1x\rightarrow 1, we obtain that for ρi,k=ρk(Δη0,i1/2Δη,i1Δη0,i1/2)\rho_{i,k}^{\ast}=\rho_{k}(\Delta_{\eta_{0},i}^{1/2}\Delta_{\eta,i}^{-1}\Delta_{\eta_{0},i}^{1/2}),

Δη,i1Δη0,i1F2\displaystyle\lVert\Delta_{\eta,i}^{-1}-\Delta_{\eta_{0},i}^{-1}\rVert_{\rm F}^{2} k=1mi(1ρi,k)2k=1mi(11/ρi,k)2Δη,iΔη0,iF2,\displaystyle\lesssim\sum_{k=1}^{m_{i}}\left(1-\rho_{i,k}^{\ast}\right)^{2}\lesssim\sum_{k=1}^{m_{i}}\left(1-1/\rho_{i,k}^{\ast}\right)^{2}\lesssim\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2}, (42)

provided that the rightmost side is sufficiently small. Because maxiΔη,iΔη0,iF2en+andB,n2(η,η0)en+anϵ¯n2\max_{i}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2}\leq e_{n}+a_{n}d_{B,n}^{2}(\eta,\eta_{0})\lesssim e_{n}+a_{n}\bar{\epsilon}_{n}^{2} on ~n\widetilde{\cal H}_{n}, (42) holds. This implies that for all sufficiently large nn, the right hand side of (41) is bounded above by a constant multiple of

X(θθ0)22supη~nen+andB,n2(η,η0)X(θθ0)22en+anϵ¯n2,\displaystyle\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\sup_{\eta\in\widetilde{\cal H}_{n}}\sqrt{e_{n}+a_{n}d_{B,n}^{2}(\eta,\eta_{0})}\lesssim\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\sqrt{e_{n}+a_{n}\bar{\epsilon}_{n}^{2}},

where en+anϵ¯n2=o(1)e_{n}+a_{n}\bar{\epsilon}_{n}^{2}=o(1) due to (C1) and (C2).

Next, (39) is equal to

supη~n|(θθ0)TX~T(IH){(ξ~ηξ~η0)(IΔη)(ξ~ηξ~η0+HX~(θθ0))}|.\displaystyle\sup_{\eta\in\widetilde{\cal H}_{n}}\Big{\lvert}(\theta-\theta_{0})^{T}\tilde{X}^{T}(I-H)\left\{(\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}})-(I-\Delta_{\eta}^{\star})(\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}}+H\tilde{X}(\theta-\theta_{0}))\right\}\Big{\rvert}.

By the triangle inequality, the display is bounded by a constant multiple of

X(θθ0)2supη~n(IH)(ξ~ηξ~η0)2+supη~n{X(θθ0)22+X(θθ0)2ndA,n(η,η0)}max1inΔη,i1Δη0,i1sp.\displaystyle\begin{split}&\lVert X(\theta-\theta_{0})\rVert_{2}\sup_{\eta\in\widetilde{\cal H}_{n}}\lVert(I-H)(\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}})\rVert_{2}\\ &+\sup_{\eta\in\widetilde{\cal H}_{n}}\Big{\{}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}+\lVert X(\theta-\theta_{0})\rVert_{2}\sqrt{n}d_{A,n}(\eta,\eta_{0})\Big{\}}\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}^{-1}-\Delta_{\eta_{0},i}^{-1}\rVert_{\rm sp}.\end{split} (43)

Using the same approach used in (42), the second term is further bounded above by a constant multiple of

X(θθ0)22en+anϵ¯n2+X(θθ0)2nϵ¯n2(en+anϵ¯n2).\displaystyle\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\sqrt{e_{n}+a_{n}\bar{\epsilon}_{n}^{2}}+\lVert X(\theta-\theta_{0})\rVert_{2}\sqrt{n\bar{\epsilon}_{n}^{2}(e_{n}+a_{n}\bar{\epsilon}_{n}^{2})}.

Therefore, by (C4) and (C6), (43) is bounded by δn{X(θθ0)2(s01)logp+X(θθ0)22}\delta_{n}^{\prime}\{\lVert X(\theta-\theta_{0})\rVert_{2}\sqrt{(s_{0}\vee 1)\log p}+\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\} for some δn0\delta_{n}^{\prime}\rightarrow 0. This is not more than the right hand side of (37) if sθ+s0>0s_{\theta}+s_{0}>0.

Note also that (40) is bounded by

θθ01𝔼0supη~nX~T(IH)(IΔη)U\displaystyle\left\lVert\theta-\theta_{0}\right\rVert_{1}{\mathbb{E}}_{0}\sup_{\eta\in\widetilde{\cal H}_{n}}\lVert\tilde{X}^{T}(I-H)(I-\Delta_{\eta}^{\star})U\rVert_{\infty}
sθ+s0X(θθ0)2ϕ1(sθ+s0)X𝔼0supη~nX~T(IH)(IΔη)U.\displaystyle\quad\leq\frac{\sqrt{s_{\theta}+s_{0}}\lVert X(\theta-\theta_{0})\rVert_{2}}{\phi_{1}(s_{\theta}+s_{0})\lVert X\rVert_{\ast}}{\mathbb{E}}_{0}\sup_{\eta\in\widetilde{\cal H}_{n}}\lVert\tilde{X}^{T}(I-H)(I-\Delta_{\eta}^{\star})U\rVert_{\infty}.

We have that ϕ1(sθ+s0)ϕ1(K1s¯+s0)1\phi_{1}(s_{\theta}+s_{0})\geq\phi_{1}(K_{1}\bar{s}_{\star}+s_{0})\gtrsim 1 by condition (C3). By Lemma 4 below, one can see that

𝔼0supη~nX~T(IH)(IΔη)UXlogp{en+anϵ¯n2+an0C3ϵ¯nlogN(δ,~n,dB,n)dδ},\displaystyle\begin{split}&{\mathbb{E}}_{0}\sup_{\eta\in\widetilde{\cal H}_{n}}\lVert\tilde{X}^{T}(I-H)(I-\Delta_{\eta}^{\star})U\rVert_{\infty}\\ &\quad\lesssim\lVert X\rVert_{\ast}\sqrt{\log p}\left\{\sqrt{e_{n}+a_{n}\bar{\epsilon}_{n}^{2}}+\sqrt{a_{n}}\int_{0}^{C_{3}\bar{\epsilon}_{n}}\sqrt{\log N(\delta,\widetilde{\mathcal{H}}_{n},d_{B,n})}d\delta\right\},\end{split} (44)

for some C3>0C_{3}>0. The term in the braces goes to zero by (C6). Combining the bounds, we easily see that there exists δn0\delta_{n}^{\prime}\rightarrow 0 satisfying (37). The assertion holds by choosing δn=δn\delta_{n}=\sqrt{\delta_{n}^{\prime}}. ∎

Lemma 4.

Consider a neighborhood n={η:dB,n(η,η0)ζn}\mathcal{H}_{n}^{\ast}=\{\eta\in\mathcal{H}:d_{B,n}(\eta,\eta_{0})\leq\zeta_{n}\} with any given ζn=o(an1/2)\zeta_{n}=o(a_{n}^{-1/2}) for ana_{n} satisfying (C1). Then, for any orthogonal projection PP and a sufficiently large C>0C>0, we have that under (C1),

𝔼0supηnX~TP(IΔη)U\displaystyle{\mathbb{E}}_{0}\sup_{\eta\in{\cal H}_{n}^{\ast}}\lVert\tilde{X}^{T}P(I-\Delta_{\eta}^{\star})U\rVert_{\infty}
Xlogp{en+anζn2+an0CζnlogN(δ,n,dB,n)dδ},\displaystyle\quad\lesssim\lVert X\rVert_{\ast}\sqrt{\log p}\left\{\sqrt{e_{n}+a_{n}\zeta_{n}^{2}}+\sqrt{a_{n}}\int_{0}^{C\zeta_{n}}\sqrt{\log N\left(\delta,{\cal H}_{n}^{\ast},d_{B,n}\right)}d\delta\right\},

where Δηn×n\Delta_{\eta}^{\star}\in\mathbb{R}^{n_{\ast}\times n_{\ast}} is the block-diagonal matrix formed by stacking the matrices Δη0,i1/2Δη,i1Δη0,i1/2\Delta_{\eta_{0},i}^{1/2}\Delta_{\eta,i}^{-1}\Delta_{\eta_{0},i}^{1/2}, i=1,,ni=1,\dots,n.

Proof.

Let Wη,j=X~jTP(IΔη)UW_{\eta,j}=\tilde{X}_{\cdot j}^{T}P(I-\Delta_{\eta}^{\star})U for X~jn\tilde{X}_{\cdot j}\in\mathbb{R}^{n_{\ast}} the jjth column of X~\tilde{X}. Then, by Lemma 2.2.2 of van der Vaart and Wellner, [29] applied with ψ(x)=ex21\psi(x)=e^{x^{2}}-1, the expectation in the lemma is equal to

𝔼0max1jpsupηn|Wη,j|max1jpsupηn|Wη,j|ψlogpmax1jpsupηn|Wη,j|ψ,\displaystyle\begin{split}{\mathbb{E}}_{0}\max_{1\leq j\leq p}\sup_{\eta\in{\cal H}_{n}^{\ast}}|W_{\eta,j}|&\leq\bigg{\lVert}\max_{1\leq j\leq p}\sup_{\eta\in{\cal H}_{n}^{\ast}}|W_{\eta,j}|\bigg{\rVert}_{\psi}\lesssim\sqrt{\log p}\max_{1\leq j\leq p}\bigg{\lVert}\sup_{\eta\in{\cal H}_{n}^{\ast}}|W_{\eta,j}|\bigg{\rVert}_{\psi},\end{split} (45)

where ψ\lVert\cdot\rVert_{\psi} is the Orlicz norm for ψ\psi. For any η1,η2n\eta_{1},\eta_{2}\in{\cal H}_{n}^{\ast}, define the standard deviation pseudo-metric between Wη1,jW_{\eta_{1},j} and Wη2,jW_{\eta_{2},j} as

dσ,j(η1,η2)\displaystyle d_{\sigma,j}(\eta_{1},\eta_{2}) Var(Wη1,jWη2,j)=(Δη1Δη2)PX~j2.\displaystyle\coloneqq\sqrt{{\rm Var}(W_{\eta_{1},j}-W_{\eta_{2},j})}=\lVert(\Delta_{\eta_{1}}^{\star}-\Delta_{\eta_{2}}^{\star})P\tilde{X}_{\cdot j}\rVert_{2}.

Using the tail bound for normal distributions and Lemma 2.2.1 of van der Vaart and Wellner, [29], we see that Wη1,jWη2,jψdσ,j(η1,η2)\lVert W_{\eta_{1},j}-W_{\eta_{2},j}\rVert_{\psi}\lesssim d_{\sigma,j}(\eta_{1},\eta_{2}) for every η1,η2n\eta_{1},\eta_{2}\in{\cal H}_{n}^{\ast}. We shall show that n{\cal H}_{n}^{\ast} is a separable pseudo-metric space with dσ,jd_{\sigma,j} for every jpj\leq p. Then, under the true model 0{\mathbb{P}}_{0}, we see that {Wη,j:ηn}\{W_{\eta,j}:\eta\in{\cal H}_{n}^{\ast}\} is a separable Gaussian process for dσ,jd_{\sigma,j}. Hence, by Corollary 2.2.5 of van der Vaart and Wellner, [29], for any fixed ηn\eta^{\prime}\in{\cal H}_{n}^{\ast},

supηn|Wη,j|ψWη,jψ+0diamj(n)logN(ϵ/2,n,dσ,j)dϵ,\displaystyle\bigg{\lVert}\sup_{\eta\in{\cal H}_{n}^{\ast}}|W_{\eta,j}|\bigg{\rVert}_{\psi}\lesssim\lVert W_{\eta^{\prime},j}\rVert_{\psi}+\int_{0}^{{\rm diam}_{j}({\cal H}_{n}^{\ast})}\sqrt{\log N(\epsilon/2,{\cal H}_{n}^{\ast},d_{\sigma,j})}d\epsilon, (46)

where diamj(n)=sup{dσ,j(η1,η2):η1,η2n}{\rm diam}_{j}({\cal H}_{n}^{\ast})=\sup\{d_{\sigma,j}(\eta_{1},\eta_{2}):{\eta_{1},\eta_{2}\in{\cal H}_{n}^{\ast}}\}. It is clear that Wη,jW_{\eta^{\prime},j} possesses a normal distribution with mean zero and variance (IΔη)PX~j22\lVert(I-\Delta_{\eta^{\prime}}^{\star})P\tilde{X}_{\cdot j}\rVert_{2}^{2}.

Using Lemma 2.2.1 of van der Vaart and Wellner, [29] again, we see that

Wη,jψ(IΔη)PX~j2max1inΔη,i1Δη0,i12Xj2Xen+anζn2,\displaystyle\begin{split}\lVert W_{\eta^{\prime},j}\rVert_{\psi}&\lesssim\lVert(I-\Delta_{\eta^{\prime}}^{\star})P\tilde{X}_{\cdot j}\rVert_{2}\\ &\lesssim\max_{1\leq i\leq n}\lVert\Delta_{\eta^{\prime},i}^{-1}-\Delta_{\eta_{0},i}^{-1}\rVert_{2}\lVert X_{\cdot j}\rVert_{2}\\ &\lesssim\lVert X\rVert_{\ast}\sqrt{e_{n}+a_{n}\zeta_{n}^{2}},\end{split} (47)

for every ηn\eta^{\prime}\in{\cal H}_{n}^{\ast}. Here the last inequality holds by using (42) and the fact that maxiΔη,iΔη0,iF2en+andB,n2(η,η0)en+anζn2=o(1)\max_{i}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2}\leq e_{n}+a_{n}d_{B,n}^{2}(\eta,\eta_{0})\lesssim e_{n}+a_{n}\zeta_{n}^{2}=o(1) on n{\cal H}_{n}^{\ast}, under (C1).

Next, to further bound the second term in (46), note that for every η1,η2n\eta_{1},\eta_{2}\in{\cal H}_{n}^{\ast},

anζn2\displaystyle a_{n}\zeta_{n}^{2} k=122andB,n2(ηk,η0)andB,n2(η1,η2)max1inΔη1,iΔη2,iF2,\displaystyle\gtrsim\sum_{k=1}^{2}2a_{n}d_{B,n}^{2}(\eta_{k},\eta_{0})\geq a_{n}d_{B,n}^{2}(\eta_{1},\eta_{2})\geq\max_{1\leq i\leq n}\lVert\Delta_{\eta_{1},i}-\Delta_{\eta_{2},i}\rVert_{\rm F}^{2},

which is further bounded below by

min1inρmin2(Δη2,i)max1ink=1mi{11/ρk(Δη2,i1/2Δη1,i1Δη2,i1/2)}2,\displaystyle\min_{1\leq i\leq n}\rho_{\min}^{2}(\Delta_{\eta_{2},i})\max_{1\leq i\leq n}\sum_{k=1}^{m_{i}}\left\{1-1/\rho_{k}(\Delta_{\eta_{2},i}^{1/2}\Delta_{\eta_{1},i}^{-1}\Delta_{\eta_{2},i}^{1/2})\right\}^{2},

using (i) of Lemma 10. In the last display, we see that miniρmin(Δη2,i)\min_{i}\rho_{\min}(\Delta_{\eta_{2},i}) is bounded away from zero since

max1inΔη2,i1sp\displaystyle\max_{1\leq i\leq n}\lVert\Delta_{\eta_{2},i}^{-1}\rVert_{\rm sp} max1inΔη2,i1Δη0,i1sp+max1inΔη0,i1spen+anζn2+1,\displaystyle\leq\max_{1\leq i\leq n}\lVert\Delta_{\eta_{2},i}^{-1}-\Delta_{\eta_{0},i}^{-1}\rVert_{\rm sp}+\max_{1\leq i\leq n}\lVert\Delta_{\eta_{0},i}^{-1}\rVert_{\rm sp}\lesssim\sqrt{e_{n}+a_{n}\zeta_{n}^{2}}+1,

and hence every eigenvalue ρk(Δη2,i1/2Δη1,i1Δη2,i1/2)\rho_{k}(\Delta_{\eta_{2},i}^{1/2}\Delta_{\eta_{1},i}^{-1}\Delta_{\eta_{2},i}^{1/2}) is bounded below and above by a multiple of its reciprocal, as anζn20a_{n}\zeta_{n}^{2}\rightarrow 0. This implies that anζn2a_{n}\zeta_{n}^{2} is further bounded below by a constant multiple of

max1ink=1mi{1ρk(Δη2,i1/2Δη1,i1Δη2,i1/2)}2\displaystyle\max_{1\leq i\leq n}\sum_{k=1}^{m_{i}}\left\{1-\rho_{k}(\Delta_{\eta_{2},i}^{1/2}\Delta_{\eta_{1},i}^{-1}\Delta_{\eta_{2},i}^{1/2})\right\}^{2}
min1inρmin2(Δη2,i)max1inΔη1,i1Δη2,i1F2.\displaystyle\quad\geq\min_{1\leq i\leq n}\rho_{\min}^{2}(\Delta_{\eta_{2},i})\max_{1\leq i\leq n}\lVert\Delta_{\eta_{1},i}^{-1}-\Delta_{\eta_{2},i}^{-1}\rVert_{\rm F}^{2}.

By the definition of dσ,jd_{\sigma,j} and the preceding displays, we thus obtain

dσ,j(η1,η2)Δη1Δη2spX~j2Xj2max1inΔη1,i1Δη2,i1spXj2andB,n(η1,η2),\displaystyle\begin{split}d_{\sigma,j}(\eta_{1},\eta_{2})&\leq\lVert\Delta_{\eta_{1}}^{\star}-\Delta_{\eta_{2}}^{\star}\rVert_{\rm sp}\lVert\tilde{X}_{\cdot j}\rVert_{2}\\ &\lesssim\lVert X_{\cdot j}\rVert_{2}\max_{1\leq i\leq n}\lVert\Delta_{\eta_{1},i}^{-1}-\Delta_{\eta_{2},i}^{-1}\rVert_{\rm sp}\\ &\lesssim\lVert X_{\cdot j}\rVert_{2}\sqrt{a_{n}}d_{B,n}(\eta_{1},\eta_{2}),\end{split} (48)

for every η1,η2n\eta_{1},\eta_{2}\in{\cal H}_{n}^{\ast}. Hence, using that diamj(n)Xj2ζnan{\rm diam}_{j}({\cal H}_{n}^{\ast})\lesssim\lVert X_{\cdot j}\rVert_{2}\zeta_{n}\sqrt{a_{n}}, we can bound the second term in (46) above by a constant multiple of

0C1Xj2ζnanlogN(ϵ/C2Xj2an,n,dB,n)dϵ,\displaystyle\int_{0}^{C_{1}\lVert X_{\cdot j}\rVert_{2}\zeta_{n}\sqrt{a_{n}}}\sqrt{\log N\left({\epsilon}/{C_{2}\lVert X_{\cdot j}\rVert_{2}\sqrt{a_{n}}},{\cal H}_{n}^{\ast},d_{B,n}\right)}d\epsilon,

for some C1,C2>0C_{1},C_{2}>0. This can be further bounded by replacing Xj2\lVert X_{\cdot j}\rVert_{2} in the display by X\lVert X\rVert_{\ast}. Then, using (45), (46), and (47), and by the substitution δ=ϵ/(C2Xan)\delta={\epsilon}/{(C_{2}\lVert X\rVert_{\ast}\sqrt{a_{n}})} for the last display, we bound (45) above by a constant multiple of

Xlogp{en+anζn2+an0C3ζnlogN(δ,n,dB,n)dδ},\displaystyle\lVert X\rVert_{\ast}\sqrt{\log p}\left\{\sqrt{e_{n}+a_{n}\zeta_{n}^{2}}+\sqrt{a_{n}}\int_{0}^{C_{3}\zeta_{n}}\sqrt{\log N\left(\delta,{\cal H}_{n}^{\ast},d_{B,n}\right)}d\delta\right\},

for some C3>0C_{3}>0.

To complete the proof, it remains to show that n{\cal H}_{n}^{\ast} is a separable pseudo-metric space with dσ,jd_{\sigma,j} for every jpj\leq p. By (48), we see that dσ,j(η1,η2)XandB,n(η1,η2)d_{\sigma,j}(\eta_{1},\eta_{2})\lesssim\lVert X\rVert_{\ast}\sqrt{a_{n}}d_{B,n}(\eta_{1},\eta_{2}) for every η1,η2n\eta_{1},\eta_{2}\in{\cal H}_{n}^{\ast}. This implies that n{\cal H}_{n}^{\ast} is separable with dσ,jd_{\sigma,j} since \mathcal{H} is separable with dB,nd_{B,n}. ∎

Lemma 5.

For any orthogonal projection PP,

0(X~TPU>2ρ¯01/2logpX)\displaystyle{\mathbb{P}}_{0}\left(\lVert\tilde{X}^{T}PU\rVert_{\infty}>2\underline{\rho}_{0}^{-1/2}\sqrt{\log p}\lVert X\rVert_{\ast}\right) 2p.\displaystyle\leq\frac{2}{p}.
Proof.

Note first that X~jTPU\tilde{X}_{\cdot j}^{T}PU has a normal distribution with mean zero and variance PX~j22\lVert P\tilde{X}_{\cdot j}\rVert_{2}^{2}, and hence we have

0(X~TPU>tmax1jpPX~j2)2pet2/2,t>0,\displaystyle{\mathbb{P}}_{0}\left(\lVert\tilde{X}^{T}PU\rVert_{\infty}>t\max_{1\leq j\leq p}\lVert P\tilde{X}_{\cdot j}\rVert_{2}\right)\leq 2pe^{-t^{2}/2},\quad t>0,

by the tail probabilities of normal distributions. By choosing t=2logpt=2\sqrt{\log p} and using the inequality PX~j2X~j2ρ¯01/2X\lVert P\tilde{X}_{\cdot j}\rVert_{2}\leq\lVert\tilde{X}_{\cdot j}\rVert_{2}\leq\underline{\rho}_{0}^{-1/2}\lVert X\rVert_{\ast} for every jpj\leq p, we verify the assertion. ∎

Lemma 6.

If (C3) and (C6) are satisfied and s0logpnϵ¯n2s_{0}\log p\lesssim n\bar{\epsilon}_{n}^{2}, there exists a constant K0>0K_{0}^{\prime}>0 such that

0(infη~npθ,ηpθ0,η(Y(n))dΠ(θ)eK0(1+s0logp))1.\displaystyle\mathbb{P}_{0}\left(\inf_{\eta\in\widetilde{\mathcal{H}}_{n}}\int\frac{p_{\theta,\eta}}{p_{\theta_{0},\eta}}(Y^{(n)})d\Pi(\theta)\geq e^{-K_{0}^{\prime}(1+s_{0}\log p)}\right)\rightarrow 1. (49)
Proof.

Let Θn={θΘ:sθ=s0,X(θθ0)221}\Theta_{n}^{\ast}=\{\theta\in\Theta:s_{\theta}=s_{0},\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\leq 1\}. Restricting the integral to this set, the left hand side of the inequality in (49) is bounded below by

infη~nΘnpθ,ηpθ0,η(Y(n))dΠ(θ)Θninfη~npθ,ηpθ0,η(Y(n))dΠ(θ)=Θnexp(infη~nlogpθ,ηpθ0,η(Y(n)))dΠ(θ).\displaystyle\begin{split}\inf_{\eta\in\widetilde{\mathcal{H}}_{n}}\int_{\Theta_{n}^{\ast}}\frac{p_{\theta,\eta}}{p_{\theta_{0},\eta}}(Y^{(n)})d\Pi(\theta)&\geq\int_{\Theta_{n}^{\ast}}\inf_{\eta\in\widetilde{\mathcal{H}}_{n}}\frac{p_{\theta,\eta}}{p_{\theta_{0},\eta}}(Y^{(n)})d\Pi(\theta)\\ &=\int_{\Theta_{n}^{\ast}}\exp\left(\inf_{\eta\in\widetilde{\mathcal{H}}_{n}}\log\frac{p_{\theta,\eta}}{p_{\theta_{0},\eta}}(Y^{(n)})\right)d\Pi(\theta).\end{split} (50)

The exponent is equal to

infη~n{(θθ0)TX~TΔη(Uξ~η+ξ~η0)12Δη1/2X~(θθ0)22}θθ01supη~nX~TΔηUX(θθ0)2supη~nξ~ηξ~η02X(θθ0)22,\displaystyle\begin{split}&\inf_{\eta\in\widetilde{\mathcal{H}}_{n}}\left\{(\theta-\theta_{0})^{T}\tilde{X}^{T}\Delta_{\eta}^{\star}(U-\tilde{\xi}_{\eta}+\tilde{\xi}_{\eta_{0}})-\frac{1}{2}\lVert\Delta_{\eta}^{\star 1/2}\tilde{X}(\theta-\theta_{0})\rVert_{2}^{2}\right\}\\ &\quad\gtrsim-\lVert\theta-\theta_{0}\rVert_{1}\sup_{\eta\in\widetilde{\mathcal{H}}_{n}}\lVert\tilde{X}^{T}\Delta_{\eta}^{\star}U\rVert_{\infty}\\ &\qquad-\lVert X(\theta-\theta_{0})\rVert_{2}\sup_{\eta\in\widetilde{\mathcal{H}}_{n}}\lVert\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}}\rVert_{2}-\lVert X(\theta-\theta_{0})\rVert_{2}^{2},\end{split} (51)

since Δηsp1\lVert\Delta_{\eta}^{\star}\rVert_{\rm sp}\lesssim 1 on ~n\widetilde{\mathcal{H}}_{n}. We first consider the case s0>0s_{0}>0. Observe that supη~nX~TΔηUX~TU+supη~nX~T(IΔη)U\sup_{\eta\in\widetilde{\mathcal{H}}_{n}}\lVert\tilde{X}^{T}\Delta_{\eta}^{\star}U\rVert_{\infty}\leq\lVert\tilde{X}^{T}U\rVert_{\infty}+\sup_{\eta\in\widetilde{\mathcal{H}}_{n}}\lVert\tilde{X}^{T}(I-\Delta_{\eta}^{\star})U\rVert_{\infty}, where the first term is bounded by a constant multiple of Xlogp\lVert X\rVert_{\ast}\sqrt{\log p} with 0\mathbb{P}_{0}-probability tending to one, due to Lemma 5. By Lemma 4 applied with P=IP=I together with (C6), the expected value of the second term is bounded by δnXlogp\delta_{n}\lVert X\rVert_{\ast}\sqrt{\log p} for some δn0\delta_{n}\rightarrow 0. Hence, for any MnM_{n}\rightarrow\infty,

0(supη~nX~T(IΔη)UMnδnXlogp)1.\displaystyle\mathbb{P}_{0}\left(\sup_{\eta\in\widetilde{\mathcal{H}}_{n}}\lVert\tilde{X}^{T}(I-\Delta_{\eta}^{\star})U\rVert_{\infty}\leq M_{n}\delta_{n}\lVert X\rVert_{\ast}\sqrt{\log p}\right)\rightarrow 1.

Consequently, taking a sufficiently slowly increasing MnM_{n} for the above, (51) is bounded below by a constant multiple of

Xθθ01logpX(θθ0)22,\displaystyle-\lVert X\rVert_{\ast}\lVert\theta-\theta_{0}\rVert_{1}\sqrt{\log p}-\lVert X(\theta-\theta_{0})\rVert_{2}^{2},

with 0\mathbb{P}_{0}-probability tending to one. Note that Xθθ01sθ+s0X(θθ0)2/ϕ1(sθ+s0)\lVert X\rVert_{\ast}\lVert\theta-\theta_{0}\rVert_{1}\leq\sqrt{s_{\theta}+s_{0}}\lVert X(\theta-\theta_{0})\rVert_{2}/\phi_{1}(s_{\theta}+s_{0}) and ϕ1(sθ+s0)=ϕ1(2s0)1\phi_{1}(s_{\theta}+s_{0})=\phi_{1}(2s_{0})\gtrsim 1 on Θn\Theta_{n}^{\ast} by (C3), if s0logpnϵ¯n2s_{0}\log p\lesssim n\bar{\epsilon}_{n}^{2}. The last display is thus bounded below by C1s0logp-C_{1}s_{0}\log p for some C1>0C_{1}>0, uniformly over θΘn\theta\in\Theta_{n}^{\ast}. Consequently, with 0\mathbb{P}_{0}-probability tending to one, (50) is bounded below by

eC1s0logpΠ(Θn)πp(s0)eC2s0logp,\displaystyle e^{-C_{1}s_{0}\log p}\Pi(\Theta_{n}^{\ast})\geq\pi_{p}(s_{0})e^{-C_{2}s_{0}\log p},

for some C2>0C_{2}>0, where the inequality holds by (23) and (24) since λθ01s0logp\lambda\lVert\theta_{0}\rVert_{1}\leq s_{0}\log p by (C3). Since logπp(s0)s0logp-\log\pi_{p}(s_{0})\lesssim s_{0}\log p if s0>0s_{0}>0, the display is further bounded below as in the assertion.

If s0=0s_{0}=0, (51) is equal to zero on Θn\Theta_{n}^{\ast}, as this is a singleton set {θ:θ=0}\{\theta:\theta=0\}. This means that (50) is bounded below by πp(0)\pi_{p}(0), which is also bounded away from zero. This leads to the desired assertion. ∎

Proof of Theorem 4.

The idea of our proof is similar in part to that of Theorem 3.5 in Chae et al., [10]. We only need to verify the first and fourth assertions. The second and third assertions then follow from the definitions of ϕ1\phi_{1} and ϕ2\phi_{2}. Note also that we only need to consider the case s0logpnϵ¯n2s_{0}\log p\lesssim n\bar{\epsilon}_{n}^{2}, as the assertions follow from Theorems 1 and 3 if s0logpnϵ¯n2s_{0}\log p\gtrsim n\bar{\epsilon}_{n}^{2}.

Let n={θΘ:sθ>K4s0}{θΘ:X(θθ0)22>K5s0logp}\mathcal{B}_{n}=\{\theta\in\Theta:s_{\theta}>K_{4}s_{0}\}\cup\{\theta\in\Theta:\lVert X(\theta-\theta_{0})\rVert_{2}^{2}>K_{5}s_{0}\log p\}. Also define ~n\widetilde{\mathcal{H}}_{n}^{\prime} as ~n\widetilde{\mathcal{H}}_{n} but using a constant M~2M~2\tilde{M}_{2}^{\prime}\leq\tilde{M}_{2} such that ~n~n\widetilde{\mathcal{H}}_{n}^{\prime}\subset\widetilde{\mathcal{H}}_{n}. Then, by Theorem 3, we have that

𝔼0Π(θn|Y(n))\displaystyle\mathbb{E}_{0}\Pi(\theta\in\mathcal{B}_{n}|Y^{(n)}) 𝔼0Π(θnΘ~n,η~n|Y(n))+o(1)\displaystyle\leq\mathbb{E}_{0}\Pi(\theta\in\mathcal{B}_{n}\cap\widetilde{\Theta}_{n},\eta\in\widetilde{\mathcal{H}}_{n}^{\prime}|Y^{(n)})+o(1)
𝔼0Π(θnΘ~n,η~n|Y(n),η~n)+o(1).\displaystyle\leq\mathbb{E}_{0}\Pi(\theta\in\mathcal{B}_{n}\cap\widetilde{\Theta}_{n},\eta\in\widetilde{\mathcal{H}}_{n}^{\prime}|Y^{(n)},\eta\in\widetilde{\mathcal{H}}_{n})+o(1).

Let Ω\Omega be the event that is an intersection of the events in (36), (49), and the event {X~T(IH)U2ρ¯01/2logpX}\{\lVert\tilde{X}^{T}(I-H)U\rVert_{\infty}\leq 2\underline{\rho}_{0}^{-1/2}\sqrt{\log p}\lVert X\rVert_{\ast}\} whose probability goes to zero by Lemma 5. Since 0(Ωc)0\mathbb{P}_{0}(\Omega^{c})\rightarrow 0, it suffices to show that

𝔼0Π(θnΘ~n,η~n|Y(n),η~n)𝟙Ω=𝔼0Θ~nn~npθ,η(Y(n))dΠ(η)dΠ(θ)~npθ,η(Y(n))dΠ(η)dΠ(θ)\displaystyle\begin{split}&\mathbb{E}_{0}\Pi(\theta\in\mathcal{B}_{n}\cap\widetilde{\Theta}_{n},\eta\in\widetilde{\mathcal{H}}_{n}^{\prime}|Y^{(n)},\eta\in\widetilde{\mathcal{H}}_{n})\mathbbm{1}_{\Omega}\\ &\quad=\mathbb{E}_{0}\frac{\int_{\widetilde{\Theta}_{n}\cap{\mathcal{B}}_{n}}\int_{\widetilde{\mathcal{H}}_{n}^{\prime}}p_{\theta,\eta}(Y^{(n)})d\Pi(\eta)d\Pi(\theta)}{\int\int_{\widetilde{\mathcal{H}}_{n}}p_{\theta,\eta}(Y^{(n)})d\Pi(\eta)d\Pi(\theta)}\end{split} (52)

tends to zero. Observe that by Fubini’s theorem, the denominator of the ratio is equal to

~npθ,ηpθ0,η(Y(n))dΠ(θ)pθ0,ηdΠ(η)\displaystyle\int_{\widetilde{\mathcal{H}}_{n}}\int\frac{p_{\theta,\eta}}{p_{\theta_{0},\eta}}(Y^{(n)})d\Pi(\theta)p_{\theta_{0},\eta}d\Pi(\eta)
{infη~npθ,ηpθ0,η(Y(n))dΠ(θ)}~npθ0,η(Y(n))dΠ(η).\displaystyle\quad\geq\left\{\inf_{\eta\in\widetilde{\mathcal{H}}_{n}}\int\frac{p_{\theta,\eta}}{p_{\theta_{0},\eta}}(Y^{(n)})d\Pi(\theta)\right\}\int_{\widetilde{\mathcal{H}}_{n}}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta).

By Lemma 6, the term in the braces on the right hand side is further bounded below by eK0(1+s0logp)e^{-K_{0}^{\prime}(1+s_{0}\log p)} on the event Ω\Omega. Note also that the numerator of the ratio in (52) is equal to

Θ~nn~nΛn(θ,η)pθ0,η~n(θ,η)(Y(n))dΠ(η)dΠ(θ)\displaystyle\int_{\widetilde{\Theta}_{n}\cap{\mathcal{B}}_{n}}\int_{\widetilde{\mathcal{H}}_{n}^{\prime}}\Lambda_{n}^{\ast}(\theta,\eta)p_{\theta_{0},\tilde{\eta}_{n}(\theta,\eta)}(Y^{(n)})d\Pi(\eta)d\Pi(\theta)
{Θ~nnΛn(θ)supη~nΛn(θ,η)Λn(θ)dΠ(θ)}supθΘ~nn~npθ0,η~n(θ,η)(Y(n))dΠ(η).\displaystyle\quad\leq\left\{\int_{\widetilde{\Theta}_{n}\cap{\mathcal{B}}_{n}}\Lambda_{n}^{\star}(\theta)\sup_{\eta\in\widetilde{\mathcal{H}}_{n}^{\prime}}\frac{\Lambda_{n}^{\ast}(\theta,\eta)}{\Lambda_{n}^{\star}(\theta)}d\Pi(\theta)\right\}\sup_{\theta\in\widetilde{\Theta}_{n}\cap{\mathcal{B}}_{n}}\int_{\widetilde{\mathcal{H}}_{n}^{\prime}}p_{\theta_{0},\tilde{\eta}_{n}(\theta,\eta)}(Y^{(n)})d\Pi(\eta).

Combining the bounds, on the event Ω\Omega, the ratio in (52) is bounded by

eK0(1+s0logp)supθΘ~nn~npθ0,η~n(θ,η)(Y(n))dΠ(η)~npθ0,η(Y(n))dΠ(η)\displaystyle e^{K_{0}^{\prime}(1+s_{0}\log p)}\sup_{\theta\in\widetilde{\Theta}_{n}\cap{\mathcal{B}}_{n}}\frac{\int_{\widetilde{\mathcal{H}}_{n}^{\prime}}p_{\theta_{0},\tilde{\eta}_{n}(\theta,\eta)}(Y^{(n)})d\Pi(\eta)}{\int_{\widetilde{\mathcal{H}}_{n}}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta)}
×Θ~nnΛn(θ)supη~nΛn(θ,η)Λn(θ)dΠ(θ).\displaystyle\quad\times\int_{\widetilde{\Theta}_{n}\cap{\mathcal{B}}_{n}}\Lambda_{n}^{\star}(\theta)\sup_{\eta\in\widetilde{\mathcal{H}}_{n}^{\prime}}\frac{\Lambda_{n}^{\ast}(\theta,\eta)}{\Lambda_{n}^{\star}(\theta)}d\Pi(\theta).

At the end of this proof, we will verify that

supθΘ~nn~npθ0,η~n(θ,η)(Y(n))dΠ(η)~npθ0,η(Y(n))dΠ(η)1,\displaystyle\sup_{\theta\in\widetilde{\Theta}_{n}\cap{\mathcal{B}}_{n}}\frac{\int_{\widetilde{\mathcal{H}}_{n}^{\prime}}p_{\theta_{0},\tilde{\eta}_{n}(\theta,\eta)}(Y^{(n)})d\Pi(\eta)}{\int_{\widetilde{\mathcal{H}}_{n}}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta)}\lesssim 1, (53)

with 0\mathbb{P}_{0}-probability tending to one. Assuming that this is true for now and letting Ω\Omega^{\ast} be the event satisfying (53), we see that (52) is bounded by

eK0(1+s0logp)𝔼0Θ~nnΛn(θ)supη~nΛn(θ,η)Λn(θ)dΠ(θ)𝟙ΩΩ+o(1).\displaystyle e^{K_{0}^{\prime}(1+s_{0}\log p)}\mathbb{E}_{0}\int_{\widetilde{\Theta}_{n}\cap{\mathcal{B}}_{n}}\Lambda_{n}^{\star}(\theta)\sup_{\eta\in\widetilde{\mathcal{H}}_{n}^{\prime}}\frac{\Lambda_{n}^{\ast}(\theta,\eta)}{\Lambda_{n}^{\star}(\theta)}d\Pi(\theta)\mathbbm{1}_{\Omega\cap\Omega^{\ast}}+o(1).

To show that this tends to zero, for δn\delta_{n} in Lemma 3, define 1,n={θΘ~n:sθ>K4s0,X(θθ0)22δn1/2(sθ+s0)logp}\mathcal{B}_{1,n}=\{\theta\in\widetilde{\Theta}_{n}:s_{\theta}>K_{4}s_{0},\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\leq\delta_{n}^{-1/2}(s_{\theta}+s_{0})\log p\}, 2,n={θΘ~n:sθ>K4s0,X(θθ0)22>δn1/2(sθ+s0)logp}\mathcal{B}_{2,n}=\{\theta\in\widetilde{\Theta}_{n}:s_{\theta}>K_{4}s_{0},\lVert X(\theta-\theta_{0})\rVert_{2}^{2}>\delta_{n}^{-1/2}(s_{\theta}+s_{0})\log p\}, and 3,n={θΘ~n:sθK4s0,X(θθ0)22>K5s0logp}\mathcal{B}_{3,n}=\{\theta\in\widetilde{\Theta}_{n}:s_{\theta}\leq K_{4}s_{0},\lVert X(\theta-\theta_{0})\rVert_{2}^{2}>K_{5}s_{0}\log p\} such that Θ~nn=k=13k,n\widetilde{\Theta}_{n}\cap{\mathcal{B}}_{n}=\cup_{k=1}^{3}\mathcal{B}_{k,n}. Below we will show that

A(k,n)\displaystyle A(\mathcal{B}_{k,n}) eK0(1+s0logp)\displaystyle\coloneqq e^{K_{0}^{\prime}(1+s_{0}\log p)}
×𝔼0k,nΛn(θ)supηnΛn(θ,η)Λn(θ)dΠ(θ)𝟙ΩΩ0,k=1,2,3.\displaystyle\quad\times\mathbb{E}_{0}\int_{{\mathcal{B}}_{k,n}}\Lambda_{n}^{\star}(\theta)\sup_{\eta\in\mathcal{H}_{n}}\frac{\Lambda_{n}^{\ast}(\theta,\eta)}{\Lambda_{n}^{\star}(\theta)}d\Pi(\theta)\mathbbm{1}_{\Omega\cap\Omega^{\ast}}\rightarrow 0,\quad k=1,2,3.

Since 𝔼0Λn(θ)=1\mathbb{E}_{0}\Lambda_{n}^{\star}(\theta)=1 by the moment generating function of normal distributions, we obtain that

A(1,n)\displaystyle A(\mathcal{B}_{1,n}) 𝔼01,nΛn(θ)eK0(1+s0logp)+2δn1/2(sθ+s0)logpdΠ(θ)\displaystyle\leq\mathbb{E}_{0}\int_{{\mathcal{B}}_{1,n}}\Lambda_{n}^{\star}(\theta)e^{K_{0}^{\prime}(1+s_{0}\log p)+2\delta_{n}^{1/2}(s_{\theta}+s_{0})\log p}d\Pi(\theta)
πp(0)s>K4s0eK0(1+s0logp)+2δn1/2(s+s0)logp(A2pA4)ss0.\displaystyle\leq\pi_{p}(0)\sum_{s>K_{4}s_{0}}e^{K_{0}^{\prime}(1+s_{0}\log p)+2\delta_{n}^{1/2}(s+s_{0})\log p}\left(\frac{A_{2}}{p^{A_{4}}}\right)^{s-s_{0}}.

If s0=0s_{0}=0, the rightmost side goes to zero for any K4>0K_{4}>0. If s0>0s_{0}>0, it still goes to zero for K4K_{4} that is much larger than K0K_{0}^{\prime}.

Note also that by conditions (C4), (C3) and (C4), we have that for some C1,C2>0C_{1},C_{2}>0 and any θ\theta,

logΛn(θ)=12(IH)X~(θθ0)22+(θθ0)TX~T(IH)UC1X(θθ0)22+θθ01X~T(IH)UC1X(θθ0)22+C2X(θθ0)2(sθ+s0)logp,\displaystyle\begin{split}\log\Lambda_{n}^{\star}(\theta)&=-\frac{1}{2}\lVert(I-H)\tilde{X}(\theta-\theta_{0})\rVert_{2}^{2}+(\theta-\theta_{0})^{T}\tilde{X}^{T}(I-H)U\\ &\leq-C_{1}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}+\lVert\theta-\theta_{0}\rVert_{1}\lVert\tilde{X}^{T}(I-H)U\rVert_{\infty}\\ &\leq-C_{1}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}+C_{2}\lVert X(\theta-\theta_{0})\rVert_{2}\sqrt{(s_{\theta}+s_{0})\log p},\end{split} (54)

on the event Ω\Omega. Hence by (36) and (54), for every θ2,n\theta\in\mathcal{B}_{2,n},

log{Λn(θ)supηnΛn(θ,η)Λn(θ)}\displaystyle\log\left\{\Lambda_{n}^{\star}(\theta)\sup_{\eta\in\mathcal{H}_{n}}\frac{\Lambda_{n}^{\ast}(\theta,\eta)}{\Lambda_{n}^{\star}(\theta)}\right\} (C2δn1/4+δn+δn5/4C1)X(θθ0)220,\displaystyle\leq(C_{2}\delta_{n}^{1/4}+\delta_{n}+\delta_{n}^{5/4}-C_{1})\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\leq 0,

on the event Ω\Omega. Therefore,

A(2,n)\displaystyle A(\mathcal{B}_{2,n}) eK0(1+s0logp)2,ndΠ(θ)+o(1)\displaystyle\leq e^{K_{0}^{\prime}(1+s_{0}\log p)}\int_{{\mathcal{B}}_{2,n}}d\Pi(\theta)+o(1)
πp(0)eK0(1+s0logp)s>K4s0(A2pA4)ss0+o(1).\displaystyle\leq\pi_{p}(0)e^{K_{0}^{\prime}(1+s_{0}\log p)}\sum_{s>K_{4}s_{0}}\left(\frac{A_{2}}{p^{A_{4}}}\right)^{s-s_{0}}+o(1).

This tends to zero if K4K_{4} is sufficiently large.

If s0=0s_{0}=0, 3,n\mathcal{B}_{3,n} is the empty set as it implies θ=θ0=0\theta=\theta_{0}=0. Hence it suffices to consider the case that s0>0s_{0}>0 below. By (36) and (54) again, there exists a constant C3>0C_{3}>0 such that for every θ3,n\theta\in\mathcal{B}_{3,n},

log{Λn(θ)supηnΛn(θ,η)Λn(θ)}\displaystyle\log\left\{\Lambda_{n}^{\star}(\theta)\sup_{\eta\in\mathcal{H}_{n}}\frac{\Lambda_{n}^{\ast}(\theta,\eta)}{\Lambda_{n}^{\star}(\theta)}\right\}
C1X(θθ0)22+{C2K4+1K5+δn(1+1K5)}X(θθ0)22\displaystyle\quad\leq-C_{1}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}+\left\{C_{2}\sqrt{\frac{K_{4}+1}{K_{5}}}+\delta_{n}\left(1+\frac{1}{\sqrt{K_{5}}}\right)\right\}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}
C3X(θθ0)22,\displaystyle\quad\leq-C_{3}\lVert X(\theta-\theta_{0})\rVert_{2}^{2},

on the event Ω\Omega, where the last inequality holds by choosing K5K_{5} much larger than K4K_{4}. Therefore,

A(3,n)\displaystyle A(\mathcal{B}_{3,n}) eK0(1+s0logp)3,neC3X(θθ0)22dΠ(θ)\displaystyle\leq e^{K_{0}^{\prime}(1+s_{0}\log p)}\int_{{\mathcal{B}}_{3,n}}e^{-C_{3}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}}d\Pi(\theta)
eK0(1+s0logp)C3K5s0logp,\displaystyle\leq e^{K_{0}^{\prime}(1+s_{0}\log p)-C_{3}K_{5}s_{0}\log p},

which tends to zero for K5K_{5} that is much larger than K0K_{0}^{\prime}, if s0>0s_{0}>0.

It only remains to show (53). Since the map ηη~n(θ,η)\eta\mapsto\tilde{\eta}_{n}(\theta,\eta) is bijective for every fixed θ\theta, for the set defined by η~n(θ,~n)={η~n(θ,η):η~n}\tilde{\eta}_{n}(\theta,\widetilde{\cal H}_{n}^{\prime})=\{\tilde{\eta}_{n}(\theta,\eta):\eta\in\widetilde{\cal H}_{n}^{\prime}\} with given θΘ~n\theta\in\widetilde{\Theta}_{n}, we see that

~npθ0,η~n(θ,η)(Y(n))dΠ(η)\displaystyle\int_{\widetilde{\cal H}_{n}^{\prime}}p_{\theta_{0},\tilde{\eta}_{n}(\theta,\eta)}(Y^{(n)})d\Pi(\eta) =η~n(θ,~n)pθ0,η(Y(n))dΠn,θ(η),\displaystyle=\int_{\tilde{\eta}_{n}(\theta,\widetilde{\cal H}_{n}^{\prime})}p_{\theta_{0},\eta}(Y^{(n)})d\Pi_{n,\theta}(\eta), (55)

by the substitution in the integral. Writing Δ0\Delta_{0}^{\ast} the block diagonal matrix formed by stacking Δη0,i1/2\Delta_{\eta_{0},i}^{1/2}, i=1,,ni=1,\dots,n, it can be seen that

η~n(θ,~n)={η:Δ0(ξ~ηξ~0HX~(θθ0))22+dB,n2(η,η0)M~2ϵ¯n}.\displaystyle\tilde{\eta}_{n}(\theta,\widetilde{\cal H}_{n}^{\prime})=\bigg{\{}\eta\in{\cal H}:\sqrt{\lVert\Delta_{0}^{\ast}(\tilde{\xi}_{\eta}-\tilde{\xi}_{0}-H\tilde{X}(\theta-\theta_{0}))\rVert_{2}^{2}+d_{B,n}^{2}(\eta,\eta_{0})}\leq\tilde{M}_{2}^{\prime}\bar{\epsilon}_{n}\bigg{\}}.

Hence, we see that M~2\tilde{M}_{2} can be chosen sufficiently larger than M~2\tilde{M}_{2}^{\prime} such that η~n(θ,~n)~n\tilde{\eta}_{n}(\theta,\widetilde{\cal H}_{n}^{\prime})\subset\widetilde{\cal H}_{n} for every θΘ~n\theta\in\widetilde{\Theta}_{n} as we have ndA,n(η,η0)ξ~ηξ~η0HX~(θθ0)2+X(θθ0)2\sqrt{n}d_{A,n}(\eta,\eta_{0})\lesssim\lVert\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}}-H\tilde{X}(\theta-\theta_{0})\rVert_{2}+\lVert X(\theta-\theta_{0})\rVert_{2}. Therefore, (55) is bounded by

~npθ0,η(Y(n))exp(|logdΠn,θ(η)dΠ(η)|)dΠ(η)~npθ0,η(Y(n))dΠ(η),\displaystyle\int_{\widetilde{\cal H}_{n}}p_{\theta_{0},\eta}(Y^{(n)})\exp\left(\left|\log\frac{d\Pi_{n,\theta}(\eta)}{d\Pi(\eta)}\right|\right)d\Pi(\eta)\lesssim\int_{\widetilde{\cal H}_{n}}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta),

by (C5), since dΠ(η)=dΠn,θ0(η)d\Pi(\eta)=d\Pi_{n,\theta_{0}}(\eta). This verifies (53) and thus the proof is complete. ∎

A.4 Proof of Theorems 56

To prove the shape approximation in Theorem 5 and the selection results in Theorem 6, we first obtain two lemmas. The first shows that the remainder of the approximation goes to zero in 0\mathbb{P}_{0}- probability, which is a stronger version of Lemma 3. The second implies that with a point mass prior for θ\theta at θ0\theta_{0}, we also obtain a rate which is not worse than that in Theorem 3.

Lemma 7.

Suppose that (C1), (C4), (C8), and (C10) are satisfied for some orthogonal projection HH. Then, for Λn(θ,η)=(pθ,η/pθ0,η~n(θ,η))(Y(n))\Lambda_{n}^{\ast}(\theta,\eta)=(p_{\theta,\eta}/p_{\theta_{0},{\tilde{\eta}_{n}(\theta,\eta)}})(Y^{(n)}) and Λn(θ)\Lambda_{n}^{\star}(\theta) in (14) with the corresponding HH, we have that

𝔼0supθΘ^nsupη^n|logΛn(θ,η)logΛn(θ)|0.\displaystyle{\mathbb{E}}_{0}\sup_{\theta\in\widehat{\Theta}_{n}}\sup_{\eta\in\widehat{\cal H}_{n}}\left\lvert\log\Lambda_{n}^{\ast}(\theta,\eta)-\log\Lambda_{n}^{\star}(\theta)\right\rvert\rightarrow 0.
Proof.

Similar to the proof of Lemma 3, it suffices to show the following three assertions:

supθΘ^nsupη^n|(θθ0)TX~T(IH)(IΔη)(IH)X~(θθ0)|\displaystyle\sup_{\theta\in\widehat{\Theta}_{n}}\sup_{\eta\in\widehat{\cal H}_{n}}\big{\lvert}(\theta-\theta_{0})^{T}\tilde{X}^{T}(I-H)(I-\Delta_{\eta}^{\star})(I-H)\tilde{X}(\theta-\theta_{0})\big{\rvert} 0,\displaystyle\rightarrow 0, (56)
supθΘ^nsupη^n|(θθ0)TX~T(IH)Δη(ξ~ηξ~η0+HX~(θθ0))|\displaystyle\sup_{\theta\in\widehat{\Theta}_{n}}\sup_{\eta\in\widehat{\cal H}_{n}}\big{\lvert}(\theta-\theta_{0})^{T}\tilde{X}^{T}(I-H)\Delta_{\eta}^{\star}(\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}}+H\tilde{X}(\theta-\theta_{0}))\big{\rvert} 0,\displaystyle\rightarrow 0, (57)
𝔼0supθΘ^nsupη^n|(θθ0)TX~T(IH)(IΔη)U|\displaystyle{\mathbb{E}}_{0}\sup_{\theta\in\widehat{\Theta}_{n}}\sup_{\eta\in\widehat{\cal H}_{n}}\big{\lvert}(\theta-\theta_{0})^{T}\tilde{X}^{T}(I-H)(I-\Delta_{\eta}^{\star})U\big{\rvert} 0.\displaystyle\rightarrow 0. (58)

First, note that the left side of (56) is bounded above by a constant multiple of

supθΘ^nsupη^nIΔηspX~(θθ0)22supθΘ^nX(θθ0)22supη^nmax1inΔη,iΔη0,iF,\displaystyle\begin{split}&\sup_{\theta\in\widehat{\Theta}_{n}}\sup_{\eta\in\widehat{\cal H}_{n}}\lVert I-\Delta_{\eta}^{\star}\rVert_{\rm sp}\lVert\tilde{X}(\theta-\theta_{0})\rVert_{2}^{2}\\ &\quad\lesssim\sup_{\theta\in\widehat{\Theta}_{n}}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\sup_{\eta\in\widehat{\cal H}_{n}}\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F},\end{split} (59)

where the inequality holds by (42) and the fact that maxiΔη,iΔη0,iF2en+andB,n2(η,η0)en+an(slogp)/n=o(1)\max_{i}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2}\leq e_{n}+a_{n}d_{B,n}^{2}(\eta,\eta_{0})\lesssim e_{n}+a_{n}(s_{\star}\log p)/n=o(1) on ^n\widehat{\cal H}_{n}. We see that (59) is bounded above by a constant multiple of

supθΘ^nXθθ012supη^nen+andB,n2(η,η0)s2logpen+anslogpn,\displaystyle\sup_{\theta\in\widehat{\Theta}_{n}}\lVert X\rVert_{\ast}\lVert\theta-\theta_{0}\rVert_{1}^{2}\sup_{\eta\in\widehat{\cal H}_{n}}\sqrt{e_{n}+a_{n}d_{B,n}^{2}(\eta,\eta_{0})}\lesssim s_{\star}^{2}\log p\sqrt{e_{n}+\frac{a_{n}s_{\star}\log p}{n}},

which goes to zero by (C10).

Next, similar to (43), the left side of (57) is bounded by

supθΘ^nX(θθ0)2supη^n(IH)(ξ~ηξ~η0)2\displaystyle\sup_{\theta\in\widehat{\Theta}_{n}}\lVert X(\theta-\theta_{0})\rVert_{2}\sup_{\eta\in\widehat{\cal H}_{n}}\lVert(I-H)(\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}})\rVert_{2}
+supθΘ^nsupη^n{(X(θθ0)22+X(θθ0)2ndA,n(η,η0))\displaystyle+\sup_{\theta\in\widehat{\Theta}_{n}}\sup_{\eta\in\widehat{\cal H}_{n}}\Big{\{}\Big{(}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}+\lVert X(\theta-\theta_{0})\rVert_{2}\sqrt{n}d_{A,n}(\eta,\eta_{0})\Big{)}
×max1inΔη,i1Δη0,i1sp}.\displaystyle\qquad\qquad\qquad\times\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}^{-1}-\Delta_{\eta_{0},i}^{-1}\rVert_{\rm sp}\Big{\}}.

Using the same approach used in (42), the display is further bounded above by a constant multiple of

slogpsupη^n(IH)(ξ~ηξ~η0)2+s2logpen+anslogpn,\displaystyle s_{\star}\sqrt{\log p}\sup_{\eta\in\widehat{\cal H}_{n}}\lVert(I-H)(\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}})\rVert_{2}+s_{\star}^{2}\log p\sqrt{e_{n}+\frac{a_{n}s_{\star}\log p}{n}},

which goes to zero by (C8) and (C10).

Now, using Lemma 4, note that (58) is bounded above by

supθΘ^nθθ01𝔼0supη^nX~T(IH)(IΔη)Uslogp{en+anslogpn+an0C1(slogp)/nlogN(δ,^n,dB,n)dδ},\displaystyle\begin{split}&\sup_{\theta\in\widehat{\Theta}_{n}}\left\lVert\theta-\theta_{0}\right\rVert_{1}{\mathbb{E}}_{0}\sup_{\eta\in\widehat{\cal H}_{n}}\lVert\tilde{X}^{T}(I-H)(I-\Delta_{\eta}^{\star})U\rVert_{\infty}\\ &\quad\lesssim s_{\star}\log p\Bigg{\{}\sqrt{e_{n}+\frac{a_{n}s_{\star}\log p}{n}}\\ &\qquad\qquad\qquad+\sqrt{a_{n}}\int_{0}^{C_{1}\sqrt{(s_{\star}\log p)/n}}\sqrt{\log N(\delta,\widehat{\mathcal{H}}_{n},d_{B,n})}d\delta\Bigg{\}},\end{split}

for some C1>0C_{1}>0. This tends to zero by (C10).

Lemma 8.

Suppose that (C1)(C4), (C5), and (C6) are satisfied. Then there exists a constant K6>0K_{6}>0 such that

𝔼0Πθ0(dn(η,η0)>K6ϵ¯n|Y(n))0,\displaystyle\mathbb{E}_{0}\Pi^{\theta_{0}}\left(d_{n}(\eta,\eta_{0})>K_{6}\bar{\epsilon}_{n}\,\big{|}\,Y^{(n)}\right)\rightarrow 0,

where Πθ0(|Y(n))\Pi^{\theta_{0}}(\cdot\,|\,Y^{(n)}) is the posterior distribution induced by the point mass prior for θ\theta at θ0\theta_{0}, i.e., δθ0(θ)\delta_{\theta_{0}}(\theta), in place of the prior in (4).

Proof.

Since the prior for θ\theta is the point mass at θ0\theta_{0}, we can reduce to a low dimensional model YiYiXiθ0=ξη,i+εiY_{i}^{\ast}\coloneqq Y_{i}-X_{i}\theta_{0}=\xi_{\eta,i}+\varepsilon_{i}, i=1,,ni=1,\dots,n. Then the lemma can be easily verified using the main results on posterior contraction in Section 3. The denominator of the posterior distribution with the Dirac prior at θ0\theta_{0} is bounded as in Lemma 1, which can be shown using (20) for the prior concentration condition (C2) and the expressions for the Kullback-Leibler divergence K(p0,i,pθ0,η,i)K(p_{0,i},p_{\theta_{0},\eta,i}) and variation V(p0,i,pθ0,η,i)V(p_{0,i},p_{\theta_{0},\eta,i}) with the true value θ0\theta_{0}. For a local test relative to the average Rényi divergence, Lemma 2 applied with 1,n{\cal F}_{1,n}, modified so that it can be involved only with a given η1\eta_{1} such that Rn(p0,pθ0,η1)ϵ¯n2R_{n}(p_{0},p_{\theta_{0},\eta_{1}})\geq\bar{\epsilon}_{n}^{2}, implies that a small piece of the alternative is tested with exponentially small errors. Hence, by (C5), we obtain the contraction rate ϵ¯n2\bar{\epsilon}_{n}^{2} relative to Rn(p0,pθ0,η)R_{n}(p_{0},p_{\theta_{0},\eta}) for Πθ0(|Y(n))\Pi^{\theta_{0}}(\cdot\,|\,Y^{(n)}), as in the proof of Theorem 2. The lemma is then obtained by recovering the contraction rate of η\eta with respect to dnd_{n} using the approach in the proof of Theorem 3. ∎

Proof of Theorem 5.

Our proof is based on the proof of Theorem 6 in Castillo et al., [8], but is more involved due to η\eta. We use the fact that for any probability measure QQ and its renormalized restriction Q𝒜()=Q(𝒜)/Q(𝒜)Q_{\cal A}(\cdot)=Q(\cdot\cap{\cal A})/Q({\cal A}) to a set 𝒜{\cal A}, we have QQ𝒜TV2Q(𝒜c)\lVert Q-Q_{\cal A}\rVert_{\rm TV}\leq 2Q({\cal A}^{c}). First, using a sufficiently large constant M^2\hat{M}_{2}^{\prime} that is smaller than M^2\hat{M}_{2}, define ^n\widehat{\cal H}_{n}^{\prime} as ^n\widehat{\cal H}_{n} in (12) such that ^n^n\widehat{\cal H}_{n}^{\prime}\subset\widehat{\cal H}_{n}. Let Π~((θ,η))\widetilde{\Pi}((\theta,\eta)\in\cdot) be the prior distribution restricted and renormalized on Θ^n×^n\widehat{\Theta}_{n}\times\widehat{\cal H}_{n}^{\prime} and Π~((θ,η)|Y(n))\widetilde{\Pi}((\theta,\eta)\in\cdot\,|\,Y^{(n)}) be the corresponding posterior distribution. Also, Π~(θ|Y(n))\widetilde{\Pi}^{\infty}(\theta\in\cdot\,|\,Y^{(n)}) is the restricted and renormalized version of Π(θ|Y(n))\Pi^{\infty}(\theta\in\cdot\,|\,Y^{(n)}) to the set Θ^n\widehat{\Theta}_{n}. Then the left hand side of the theorem is bounded above by

Π(θ|Y(n))Π~(θ|Y(n))TV+Π~(θ|Y(n))Π~(θ|Y(n))TV+Π(θ|Y(n))Π~(θ|Y(n))TV,\displaystyle\begin{split}&\left\lVert\Pi(\theta\in\cdot\,|\,Y^{(n)})-\widetilde{\Pi}(\theta\in\cdot\,|\,Y^{(n)})\right\rVert_{\rm TV}+\left\lVert\widetilde{\Pi}(\theta\in\cdot\,|\,Y^{(n)})-\widetilde{\Pi}^{\infty}(\theta\in\cdot\,|\,Y^{(n)})\right\rVert_{\rm TV}\\ &+\left\lVert\Pi^{\infty}(\theta\in\cdot\,|\,Y^{(n)})-\widetilde{\Pi}^{\infty}(\theta\in\cdot\,|\,Y^{(n)})\right\rVert_{\rm TV},\end{split} (60)

where the first summand goes to zero in 0{\mathbb{P}}_{0}-probability since Π((θ,η)Θ^n×^n|Y(n))1\Pi((\theta,\eta)\in\widehat{\Theta}_{n}\times\widehat{\cal H}_{n}^{\prime}\,|\,Y^{(n)})\rightarrow 1 in 0{\mathbb{P}}_{0}-probability by Theorem 1 and Theorem 3.

To show that the second summand goes to zero in 0{\mathbb{P}}_{0}-probability, note that for every measurable p{\cal B}\subset\mathbb{R}^{p}, we obtain

Π~(θ|Y(n))\displaystyle\widetilde{\Pi}(\theta\in{\cal B}\,|\,Y^{(n)}) Θ^n^npθ,η(Y(n))eλθ1dΠ(η)dV(θ)\displaystyle\propto\int_{{\cal B}\cap\widehat{\Theta}_{n}}\int_{\widehat{\cal H}_{n}^{\prime}}{p_{\theta,\eta}}(Y^{(n)})\>e^{-\lambda\lVert\theta\rVert_{1}}d\Pi(\eta)dV(\theta)
=Θ^n^nΛn(θ,η)eλθ1pθ0,η~n(θ,η)(Y(n))dΠ(η)dV(θ),\displaystyle=\int_{{\cal B}\cap\widehat{\Theta}_{n}}\int_{\widehat{\cal H}_{n}^{\prime}}\Lambda_{n}^{\ast}(\theta,\eta)\>e^{-\lambda\lVert\theta\rVert_{1}}p_{\theta_{0},\tilde{\eta}_{n}(\theta,\eta)}(Y^{(n)})\,d\Pi(\eta)dV(\theta),
Π~(θ|Y(n))\displaystyle\widetilde{\Pi}^{\infty}(\theta\in{\cal B}\,|\,Y^{(n)}) Θ^nΛn(θ)dV(θ)\displaystyle\propto\int_{{\cal B}\cap\widehat{\Theta}_{n}}\Lambda_{n}^{\star}(\theta)dV(\theta)
Θ^nΛn(θ)eλθ01pθ0,η(Y(n))dΠ(η)dV(θ),\displaystyle\propto\int_{{\cal B}\cap\widehat{\Theta}_{n}}\Lambda_{n}^{\star}(\theta)\>e^{-\lambda\lVert\theta_{0}\rVert_{1}}\int_{{\cal H}}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta)dV(\theta),

where dV(θ)=S:sK1sπp(s)(ps)1(λ/2)sd{(θS)δ0(θSc)}dV(\theta)=\sum_{S:s\leq K_{1}s_{\star}}{\pi_{p}(s)}{\binom{p}{s}}^{-1}(\lambda/2)^{s}d\{\mathcal{L}(\theta_{S})\otimes\delta_{0}(\theta_{S^{c}})\}. In the last line, the factor eλθ01pθ0,η(Y(n))dΠ(η)e^{-\lambda\lVert\theta_{0}\rVert_{1}}\int_{{\cal H}}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta) cancels out in the normalizing constant, but is inserted for the sake of comparison. For any sequences of measures {μS}\{\mu_{S}\} and {νS}\{\nu_{S}\}, if νS\nu_{S} is absolutely continuous with respect to μS\mu_{S} with the Radon-Nikodym derivative dνS/dμSd{\nu_{S}}/d{\mu_{S}}, then it can be easily verified that

SμSSμSTVSνSSνSTVTV\displaystyle\left\lVert\frac{\sum_{S}\mu_{S}}{\lVert\sum_{S}\mu_{S}\rVert_{\rm TV}}-\frac{\sum_{S}\nu_{S}}{\lVert\sum_{S}\nu_{S}\rVert_{\rm TV}}\right\rVert_{\rm TV} 2SμSνSTVSμSTV2supS1dνSdμS.\displaystyle\leq\frac{2\sum_{S}\lVert\mu_{S}-\nu_{S}\rVert_{\rm TV}}{\lVert\sum_{S}\mu_{S}\rVert_{\rm TV}}\leq 2\sup_{S}\left\lVert 1-\frac{d{\nu_{S}}}{d{\mu_{S}}}\right\rVert_{\infty}.

Hence, for Cn=pθ0,η(Y(n))dΠ(η)C_{n}=\int_{{\cal H}}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta), we see that the second summand of (60) is bounded by

2supθΘ^n|11Cn^nΛn(θ,η)eλθ1Λn(θ)eλθ01pθ0,η~n(θ,η)(Y(n))dΠ(η)|.\displaystyle 2\sup_{\theta\in\widehat{\Theta}_{n}}\left\lvert 1-\frac{1}{C_{n}}\int_{\widehat{\cal H}_{n}^{\prime}}\frac{\Lambda_{n}^{\ast}(\theta,\eta)e^{-\lambda\lVert\theta\rVert_{1}}}{\Lambda_{n}^{\star}(\theta)e^{-\lambda\lVert\theta_{0}\rVert_{1}}}p_{\theta_{0},\tilde{\eta}_{n}(\theta,\eta)}(Y^{(n)})d\Pi(\eta)\right\rvert.

Using the fact that |λ(θ1θ01)|λθθ01λslogp/X0|\lambda(\lVert\theta\rVert_{1}-\lVert\theta_{0}\rVert_{1})|\leq\lambda\lVert\theta-\theta_{0}\rVert_{1}\lesssim\lambda s_{\star}\sqrt{\log p}/\lVert X\rVert_{\ast}\rightarrow 0 on Θ^n\widehat{\Theta}_{n} and that sup{|1Λn(θ,η)/Λn(θ)|:θΘ^n,η^n}\sup\{|1-\Lambda_{n}^{\ast}(\theta,\eta)/\Lambda_{n}^{\star}(\theta)|:\theta\in\widehat{\Theta}_{n},\eta\in\widehat{\cal H}_{n}^{\prime}\} goes to zero in 0{\mathbb{P}}_{0}-probability by Lemma 7, the last display is further bounded by

2supθΘ^n|1{1+o(1)+o0(1)}1Cn^npθ0,η~n(θ,η)(Y(n))dΠ(η)|.\displaystyle 2\sup_{\theta\in\widehat{\Theta}_{n}}\left\lvert 1-\left\{1+o(1)+o_{{\mathbb{P}}_{0}}(1)\right\}\frac{1}{C_{n}}\int_{\widehat{\cal H}_{n}^{\prime}}p_{\theta_{0},\tilde{\eta}_{n}(\theta,\eta)}(Y^{(n)})d\Pi(\eta)\right\rvert. (61)

Now, note that the map ηη~n(θ,η)\eta\mapsto\tilde{\eta}_{n}(\theta,\eta) is bijective for every fixed θΘ^n\theta\in\widehat{\Theta}_{n}. Thus for the set defined by η~n(θ,^n)={η~n(θ,η):η^n}\tilde{\eta}_{n}(\theta,\widehat{\cal H}_{n}^{\prime})=\{\tilde{\eta}_{n}(\theta,\eta):\eta\in\widehat{\cal H}_{n}^{\prime}\} with given θΘ^n\theta\in\widehat{\Theta}_{n}, we see that

^npθ0,η~n(θ,η)(Y(n))dΠ(η)\displaystyle\int_{\widehat{\cal H}_{n}^{\prime}}p_{\theta_{0},\tilde{\eta}_{n}(\theta,\eta)}(Y^{(n)})d\Pi(\eta) =η~n(θ,^n)pθ0,η(Y(n))dΠn,θ(η),\displaystyle=\int_{\tilde{\eta}_{n}(\theta,\widehat{\cal H}_{n}^{\prime})}p_{\theta_{0},\eta}(Y^{(n)})d\Pi_{n,\theta}(\eta), (62)

by the substitution in the integral. Similar to the proof of Theorem 4, observe that

η~n(θ,^n)={η:\displaystyle\tilde{\eta}_{n}(\theta,\widehat{\cal H}_{n}^{\prime})=\Big{\{}\eta\in{\cal H}: Δ0(ξ~ηξ~0HX~(θθ0))2M^2s(logp)/n,\displaystyle\,\lVert\Delta_{0}^{\ast}(\tilde{\xi}_{\eta}-\tilde{\xi}_{0}-H\tilde{X}(\theta-\theta_{0}))\rVert_{2}\leq\hat{M}_{2}^{\prime}s_{\star}\sqrt{(\log p)/n},
dB,n(η,η0)M^2(slogp)/n}.\displaystyle~{}d_{B,n}(\eta,\eta_{0})\leq\hat{M}_{2}^{\prime}\sqrt{(s_{\star}\log p)/n}\Big{\}}.

Hence, we see that M^2\hat{M}_{2} can be chosen sufficiently large such that η~n(θ,^n)^n\tilde{\eta}_{n}(\theta,\widehat{\cal H}_{n}^{\prime})\subset\widehat{\cal H}_{n} for every θΘ^n\theta\in\widehat{\Theta}_{n} as we have ndA,n(η,η0)ξ~ηξ~η0HX~(θθ0)2+Xθθ01\sqrt{n}d_{A,n}(\eta,\eta_{0})\lesssim\lVert\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}}-H\tilde{X}(\theta-\theta_{0})\rVert_{2}+\lVert X\rVert_{\ast}\lVert\theta-\theta_{0}\rVert_{1}. Therefore, since dΠ(η)=dΠn,θ0(η)d\Pi(\eta)=d\Pi_{n,\theta_{0}}(\eta), one can see that (62) is written as

{1+o(1)}η~n(θ,^n)pθ0,η(Y(n))dΠ(η),\displaystyle\{1+o(1)\}\int_{\tilde{\eta}_{n}(\theta,\widehat{\cal H}_{n}^{\prime})}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta),

by (C9), and hence (61) is equal to

2supθΘ^n|1{1+o0(1)}η~n(θ,^n)pθ0,η(Y(n))dΠ(η)pθ0,η(Y(n))dΠ(η)|.\displaystyle 2\sup_{\theta\in\widehat{\Theta}_{n}}\left\lvert 1-\left\{1+o_{{\mathbb{P}}_{0}}(1)\right\}\frac{\int_{\tilde{\eta}_{n}(\theta,\widehat{\cal H}_{n}^{\prime})}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta)}{\int_{{\cal H}}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta)}\right\rvert. (63)

Now, observe that we also have the inequality of the other direction: ξ~ηξ~η0HX~(θθ0)2ndA,n(η,η0)+Xθθ01\lVert\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}}-H\tilde{X}(\theta-\theta_{0})\rVert_{2}\lesssim\sqrt{n}d_{A,n}(\eta,\eta_{0})+\lVert X\rVert_{\ast}\lVert\theta-\theta_{0}\rVert_{1}. This means that M^2\hat{M}_{2}^{\prime} can be chosen sufficiently large such that {η:dn(η,η0)K6ϵ¯n}η~n(θ,^n)\{\eta\in{\cal H}:d_{n}(\eta,\eta_{0})\leq K_{6}\bar{\epsilon}_{n}\}\subset\tilde{\eta}_{n}(\theta,\widehat{\cal H}_{n}^{\prime}) for every θΘ^n\theta\in\widehat{\Theta}_{n}. Hence, with appropriately chosen constants, we obtain

infθΘ^nη~n(θ,^n)pθ0,η(Y(n))dΠ(η)pθ0,η(Y(n))dΠ(η)\displaystyle\inf_{\theta\in\widehat{\Theta}_{n}}\frac{\int_{\tilde{\eta}_{n}(\theta,\widehat{\cal H}_{n}^{\prime})}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta)}{\int_{{\cal H}}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta)} =infθΘ^nΠθ0(ηη~n(θ,^n)|Y(n))\displaystyle=\inf_{\theta\in\widehat{\Theta}_{n}}\Pi^{\theta_{0}}\left(\eta\in\tilde{\eta}_{n}(\theta,\widehat{\cal H}_{n}^{\prime})\,|\,Y^{(n)}\right)
Πθ0(dn(η,η0)K6ϵ¯n|Y(n)).\displaystyle\geq\Pi^{\theta_{0}}\left(d_{n}(\eta,\eta_{0})\leq K_{6}\bar{\epsilon}_{n}\,\big{|}\,Y^{(n)}\right).

The rightmost term goes to one with probability tending to one by Lemma 8. This implies that (63) goes to zero in 0\mathbb{P}_{0}-probability, completing the proof for the second part of (60).

Next, we show that Π(θΘ^n|Y(n))\Pi^{\infty}(\theta\in\widehat{\Theta}_{n}\,|\,Y^{(n)}) goes to one in 0{\mathbb{P}}_{0}-probability to verify that the last summand in (60) goes to zero in 0{\mathbb{P}}_{0}-probability. Observe that Π(θΘ^nc|Y(n))\Pi^{\infty}(\theta\in\widehat{\Theta}_{n}^{c}\,|\,Y^{(n)}) is equal to

Θ^ncexp{12(IH)X~(θθ0)22+UT(IH)X~(θθ0)}dV(θ)pexp{12(IH)X~(θθ0)22+UT(IH)X~(θθ0)}dV(θ).\displaystyle\frac{\int_{\widehat{\Theta}_{n}^{c}}\exp\left\{{-\frac{1}{2}\lVert(I-H)\tilde{X}(\theta-\theta_{0})\rVert_{2}^{2}+U^{T}(I-H)\tilde{X}(\theta-\theta_{0})}\right\}dV(\theta)}{\int_{\mathbb{R}^{p}}\exp\left\{{-\frac{1}{2}\lVert(I-H)\tilde{X}(\theta-\theta_{0})\rVert_{2}^{2}+U^{T}(I-H)\tilde{X}(\theta-\theta_{0})}\right\}dV(\theta)}. (64)

Clearly, the denominator is bounded below by

πp(s0)(ps0)(λ2)s0s0exp{12(IH)X~S0(θS0θ0,S0)22+UT(IH)X~S0(θS0θ0,S0)}dθS0.\displaystyle\begin{split}\frac{\pi_{p}(s_{0})}{\binom{p}{s_{0}}}\left(\frac{\lambda}{2}\right)^{s_{0}}\int_{\mathbb{R}^{s_{0}}}\exp\bigg{\{}&-\frac{1}{2}\lVert(I-H)\tilde{X}_{S_{0}}(\theta_{S_{0}}-\theta_{0,{S_{0}}})\rVert_{2}^{2}\\ &+U^{T}(I-H)\tilde{X}_{S_{0}}(\theta_{S_{0}}-\theta_{0,{S_{0}}})\bigg{\}}d\theta_{S_{0}}.\end{split} (65)

Since the measure QQ defined by Q(dθS0)=exp{(1/2)(IH)X~S0(θS0θ0,S0)22}Q(d\theta_{S_{0}})=\exp\{-(1/2)\lVert(I-H)\tilde{X}_{S_{0}}(\theta_{S_{0}}-\theta_{0,S_{0}})\rVert_{2}^{2}\} is symmetric about θ0,S0\theta_{0,S_{0}}, the mean of (θS0θ0,S0)(\theta_{S_{0}}-\theta_{0,S_{0}}) with respect to the normalized probability measure Q~=Q/Q(s0)\widetilde{Q}=Q/Q(\mathbb{R}^{s_{0}}) is zero. Note also that ΓS=X~ST(IH)X~S\Gamma_{S}=\tilde{X}_{S}^{T}(I-H)\tilde{X}_{S} is nonsingular for every SS such that sK1ss\leq K_{1}s_{\star} by (C8). Thus, by Jensen’s inequality, (65) is bounded below by

πp(s0)(ps0)(λ2)s0s0exp{12(IH)X~S0(θS0θ0,S0)22}dθS0\displaystyle\frac{\pi_{p}(s_{0})}{\binom{p}{s_{0}}}\left(\frac{\lambda}{2}\right)^{s_{0}}\int_{\mathbb{R}^{s_{0}}}\exp\left\{{-\frac{1}{2}\lVert(I-H)\tilde{X}_{S_{0}}(\theta_{S_{0}}-\theta_{0,{S_{0}}})\rVert_{2}^{2}}\right\}d\theta_{S_{0}}
=πp(s0)(ps0)(λ2)s0(2π)s0/2det(ΓS0)1/2.\displaystyle\quad=\frac{\pi_{p}(s_{0})}{\binom{p}{s_{0}}}\left(\frac{\lambda}{2}\right)^{s_{0}}\frac{(2\pi)^{s_{0}/2}}{\det(\Gamma_{S_{0}})^{1/2}}.

Applying the arithmetic-geometric mean inequality to the eigenvalues, we obtain det(ΓS0)(tr(ΓS0)/s0)s0(IH)X~S02s0ρ¯0s0X2s0\det(\Gamma_{S_{0}})\leq({\rm tr}(\Gamma_{S_{0}})/s_{0})^{s_{0}}\leq\lVert(I-H)\tilde{X}_{S_{0}}\rVert_{\ast}^{2s_{0}}\leq\underline{\rho}_{0}^{-s_{0}}\lVert X\rVert_{\ast}^{2s_{0}}, and hence det(ΓS0)1/2/λs0ρ¯0s0/2(L1pL2)s0\det(\Gamma_{S_{0}})^{1/2}/\lambda^{s_{0}}\leq\underline{\rho}_{0}^{-s_{0}/2}(L_{1}p^{L_{2}})^{s_{0}} by (4). Furthermore, we have πp(s0)A1s0pA3s0\pi_{p}(s_{0})\gtrsim A_{1}^{s_{0}}p^{-A_{3}s_{0}} by (3) and (ps0)ps0\binom{p}{s_{0}}\leq p^{s_{0}}. Hence, the preceding display is further bounded below by a constant multiple of

p(1+L2+A3)s0(A1ρ¯0πL12)s0.\displaystyle p^{-(1+L_{2}+A_{3})s_{0}}\left(\frac{A_{1}\sqrt{\underline{\rho}_{0}\pi}}{L_{1}\sqrt{2}}\right)^{s_{0}}. (66)

To bound the numerator of (64), let Dn=2ρ¯01/2logpXD_{n}=2\underline{\rho}_{0}^{-1/2}\sqrt{\log p}\lVert X\rVert_{\ast} and 𝒰n={X~T(IH)UDn}{\cal U}_{n}=\{\lVert\tilde{X}^{T}(I-H)U\rVert_{\infty}\leq D_{n}\}. Then it suffices to show that (64) goes to zero in 0{\mathbb{P}}_{0}-probability on the set 𝒰n{\cal U}_{n} as 0(𝒰nc)0\mathbb{P}_{0}({\cal U}_{n}^{c})\rightarrow 0 by Lemma 5. Note that on the set 𝒰n{\cal U}_{n} we have

UT(IH)X~(θθ0)\displaystyle U^{T}(I-H)\tilde{X}(\theta-\theta_{0}) Dnθθ01\displaystyle\leq D_{n}\lVert\theta-\theta_{0}\rVert_{1}
Dn2ρ¯0X~(θθ0)2|Sθθ0|1/2Xϕ1(|Sθθ0|)Dnθθ01.\displaystyle\leq D_{n}\frac{2\sqrt{\overline{\rho}_{0}}\lVert\tilde{X}(\theta-\theta_{0})\rVert_{2}|S_{\theta-\theta_{0}}|^{1/2}}{\lVert X\rVert_{\ast}\phi_{1}(|S_{\theta-\theta_{0}}|)}-D_{n}\lVert\theta-\theta_{0}\rVert_{1}.

Using that u2(IH)u2\lVert u\rVert_{2}\lesssim\lVert(I-H)u\rVert_{2} for every uspan(X~S)u\in{\rm span}(\tilde{X}_{S}) with sK1ss\leq K_{1}s_{\star} by (C8), the preceding display is, for some constant C1>0C_{1}>0, further bounded above by

Dn2ρ¯0C1(IH)X~(θθ0)2|Sθθ0|1/2Xϕ1(|Sθθ0|)Dnθθ01\displaystyle D_{n}\frac{2\sqrt{\overline{\rho}_{0}}C_{1}\lVert(I-H)\tilde{X}(\theta-\theta_{0})\rVert_{2}|S_{\theta-\theta_{0}}|^{1/2}}{\lVert X\rVert_{\ast}\phi_{1}(|S_{\theta-\theta_{0}}|)}-D_{n}\lVert\theta-\theta_{0}\rVert_{1}
12(IH)X~(θθ0)22+2ρ¯0C12Dn2|Sθθ0|X2ϕ1(|Sθθ0|)2Dnθθ01,\displaystyle\quad\leq\frac{1}{2}\lVert(I-H)\tilde{X}(\theta-\theta_{0})\rVert_{2}^{2}+\frac{2\overline{\rho}_{0}C_{1}^{2}D_{n}^{2}|S_{\theta-\theta_{0}}|}{\lVert X\rVert_{\ast}^{2}\phi_{1}(|S_{\theta-\theta_{0}}|)^{2}}-D_{n}\lVert\theta-\theta_{0}\rVert_{1},

by the Cauchy-Schwarz inequality. We have sθθ0K1s+s0s_{\theta-\theta_{0}}\leq K_{1}s_{\star}+s_{0} on the support of the measure VV. Hence, on the event 𝒰n{\cal U}_{n}, the numerator of (64) is bounded above by

exp{2ρ¯0C12Dn2(K1s+s0)X2ϕ1(K1s+s0)2M^1Dnslogp2X}\displaystyle\exp\left\{\frac{2\overline{\rho}_{0}C_{1}^{2}D_{n}^{2}(K_{1}s_{\star}+s_{0})}{\lVert X\rVert_{\ast}^{2}\phi_{1}(K_{1}s_{\star}+s_{0})^{2}}-\frac{\hat{M}_{1}D_{n}s_{\star}\sqrt{\log p}}{2\lVert X\rVert_{\ast}}\right\}
×S:sK1sπp(s)(ps)(λ2)se(Dn/2)θSθ0,S1dθS\displaystyle\times\sum_{S:s\leq K_{1}s_{\star}}\frac{\pi_{p}(s)}{\binom{p}{s}}\int\left(\frac{\lambda}{2}\right)^{s}e^{-({D_{n}}/{2})\lVert\theta_{S}-\theta_{0,S}\rVert_{1}}d\theta_{S}
exp{8ρ¯0C12(K1+1)slogpρ¯0ϕ1(K1s+s0)2M^1slogpρ¯0}s=0pπp(s)(L3ρ¯0n)s,\displaystyle\quad\leq\exp\left\{\frac{8\overline{\rho}_{0}C_{1}^{2}(K_{1}+1)s_{\star}\log p}{\underline{\rho}_{0}\phi_{1}(K_{1}s_{\star}+s_{0})^{2}}-\frac{\hat{M}_{1}s_{\star}\log p}{\sqrt{\underline{\rho}_{0}}}\right\}\sum_{s=0}^{p}\pi_{p}(s)\left(L_{3}\sqrt{\frac{{\underline{\rho}_{0}}}{n}}\right)^{s},

since Dn/2λn/(L3ρ¯0)D_{n}/2\geq\lambda\sqrt{n}/(L_{3}\sqrt{\underline{\rho}_{0}}). Note that we have

s=0pπp(s)(L3ρ¯0n)ss=0p(A2L3pA4ρ¯0n)s1,\displaystyle\sum_{s=0}^{p}\pi_{p}(s)\left(L_{3}\sqrt{\frac{{\underline{\rho}_{0}}}{n}}\right)^{s}\lesssim\sum_{s=0}^{p}\left(\frac{A_{2}L_{3}}{p^{A_{4}}}\sqrt{\frac{{\underline{\rho}_{0}}}{n}}\right)^{s}\lesssim 1,

by (3) and that ϕ1(K1s+s0)\phi_{1}(K_{1}s_{\star}+s_{0}) in the denominators is bounded away from zero by the assumption. Thus, the last display combined with (66) shows that (64) goes to zero on the event 𝒰n{\cal U}_{n}, provided that M^1\hat{M}_{1} is chosen sufficiently large.

Finally we conclude that (60) goes to zero in 0\mathbb{P}_{0}-probability. Since the total variation metric is bounded by 2, the convergence in mean holds as in the assertion. ∎

Proof of Theorem 6.

Our proof follows the proof of Theorem 4 in Castillo et al., [8]. Since 𝔼0Π(θ|Y(n))Π(θ|Y(n))TV{\mathbb{E}}_{0}\lVert\Pi(\theta\in\cdot\,|\,Y^{(n)})-\Pi^{\infty}(\theta\in\cdot\,|\,Y^{(n)})\rVert_{\rm TV} tends to zero by Theorem 5, it suffices to show that 𝔼0Π(θ:Sθ𝒮n|Y(n))0{\mathbb{E}}_{0}\Pi^{\infty}(\theta:S_{\theta}\in{\cal S}_{n}\,|\,Y^{(n)})\rightarrow 0 for 𝒮n={S:sK1s,SS0,SS0}{\cal S}_{n}=\{S:s\leq K_{1}s_{\star},S\supset S_{0},S\neq S_{0}\}. For the orthogonal projection defined by H~S=(IH)X~SΓS1X~ST(IH)\tilde{H}_{S}=(I-H)\tilde{X}_{S}\Gamma_{S}^{-1}\tilde{X}_{S}^{T}(I-H) with ΓS=X~ST(IH)X~S\Gamma_{S}=\tilde{X}_{S}^{T}(I-H)\tilde{X}_{S}, we see that Π(θ:Sθ𝒮n|Y(n))\Pi^{\infty}(\theta:S_{\theta}\in{\cal S}_{n}\,|\,Y^{(n)}) is bounded by

s=s0+1K1sπp(s)(ps0)(ps0ss0)πp(s0)(ps)(λπ2)ss0maxS𝒮n:|S|=s[det(ΓS0)1/2det(ΓS)1/2e(H~SH~S0)U22/2],\displaystyle\sum_{s=s_{0}+1}^{K_{1}s_{\star}}\frac{\pi_{p}(s)\binom{p}{s_{0}}\binom{p-s_{0}}{s-s_{0}}}{\pi_{p}(s_{0})\binom{p}{s}}\left(\frac{\lambda\sqrt{\pi}}{\sqrt{2}}\right)^{s-s_{0}}\max_{S\in{\cal S}_{n}:|S|=s}\left[\frac{\det(\Gamma_{S_{0}})^{1/2}}{\det(\Gamma_{S})^{1/2}}e^{\lVert(\tilde{H}_{S}-\tilde{H}_{S_{0}})U\rVert_{2}^{2}/2}\right],

by (13), since (H~SH~S0)X~θ0=(H~SH~S0)(IH)X~S0θ0,S0=0(\tilde{H}_{S}-\tilde{H}_{S_{0}})\tilde{X}\theta_{0}=(\tilde{H}_{S}-\tilde{H}_{S_{0}})(I-H)\tilde{X}_{S_{0}}\theta_{0,S_{0}}=0 for every S𝒮nS\in{\cal S}_{n} due to S0SS_{0}\subset S on 𝒮n{\cal S}_{n}. Note that ρk(ΓS0)ρk(ΓS)\rho_{k}(\Gamma_{S_{0}})\leq\rho_{k}(\Gamma_{S}) for k=1,,s0k=1,\dots,s_{0}, because ΓS0\Gamma_{S_{0}} is a principal submatrix of ΓS\Gamma_{S}. Hence, det(ΓS0)\det(\Gamma_{S_{0}}) is equal to

k=1s0ρk(ΓS0)k=1s0ρk(ΓS)det(ΓS)ρmin(ΓS)ss0det(ΓS)(C1ρ¯01/2ϕ2(s)X)2(ss0),\displaystyle\begin{split}\prod_{k=1}^{s_{0}}\rho_{k}(\Gamma_{S_{0}})\leq\prod_{k=1}^{s_{0}}\rho_{k}(\Gamma_{S})\leq\frac{\det(\Gamma_{S})}{\rho_{\min}(\Gamma_{S})^{s-s_{0}}}\leq\frac{\det(\Gamma_{S})}{(C_{1}\overline{\rho}_{0}^{-1/2}\phi_{2}(s)\lVert X\rVert_{\ast})^{2(s-s_{0})}},\end{split} (67)

for some C1>0C_{1}>0. The last inequality holds since by (C8), there exists a constant C1>0C_{1}>0 such that C12v22(IH)v22C_{1}^{2}\lVert v\rVert_{2}^{2}\leq\lVert(I-H)v\rVert_{2}^{2} for every vspan(X~S)v\in{\rm span}(\tilde{X}_{S}) with sK1ss\leq K_{1}s_{\star}, and hence we have that by the definition of ϕ2\phi_{2},

ρmin(ΓS)=infus,u0(IH)X~Su22u22C12ϕ2(s)2X2ρ¯0.\displaystyle\rho_{\min}(\Gamma_{S})=\inf_{u\in\mathbb{R}^{s},u\neq 0}\frac{\lVert(I-H)\tilde{X}_{S}u\rVert_{2}^{2}}{\lVert u\rVert_{2}^{2}}\geq\frac{C_{1}^{2}\phi_{2}(s)^{2}\lVert X\rVert_{\ast}^{2}}{\overline{\rho}_{0}}.

Now, we shall show that for any fixed b>2b>2,

0((H~SH~S0)U22b(ss0)logp, for every S𝒮n)1.\displaystyle{\mathbb{P}}_{0}\left(\lVert(\tilde{H}_{S}-\tilde{H}_{S_{0}})U\rVert_{2}^{2}\leq b(s-s_{0})\log p,\text{~{}for every $S\in{\cal S}_{n}$}\right)\rightarrow 1. (68)

Note that (H~SH~S0)U22\lVert(\tilde{H}_{S}-\tilde{H}_{S_{0}})U\rVert_{2}^{2} has a chi-squared distribution with degree of freedom ss0s-s_{0}. Therefore, by Lemma 5 of Castillo et al., [8], there exists a constant C2C_{2} such that for every b>2b>2 and given ss0+1s\geq s_{0}+1,

0(maxS𝒮n:|S|=s(H~SH~S0)U22>blogNs)\displaystyle{\mathbb{P}}_{0}\left(\max_{S\in{\cal S}_{n}:|S|=s}\lVert(\tilde{H}_{S}-\tilde{H}_{S_{0}})U\rVert_{2}^{2}>b\log N_{s}\right) (1Ns)(b2)/4eC2(ss0),\displaystyle\leq\left(\frac{1}{N_{s}}\right)^{(b-2)/4}e^{C_{2}(s-s_{0})},

where Ns=(ps0ss0)N_{s}=\binom{p-s_{0}}{s-s_{0}} is the cardinality of the set {S𝒮n:|S|=s}\{S\in{\cal S}_{n}:|S|=s\}. Since Ns(ps0)ss0pss0N_{s}\leq(p-s_{0})^{s-s_{0}}\leq p^{s-s_{0}}, for 𝒯n{\cal T}_{n} the event in the relation (68), it follows that

0(𝒯nc)s=s0+1K1s(1Ns)(b2)/4eC2(ss0).\displaystyle{\mathbb{P}}_{0}({\cal T}_{n}^{c})\leq\sum_{s=s_{0}+1}^{K_{1}s_{\star}}\left(\frac{1}{N_{s}}\right)^{(b-2)/4}e^{C_{2}(s-s_{0})}.

This goes to zero as pp\rightarrow\infty, since for sK1ss\leq K_{1}s_{\star},

Ns(ps)ss0(ss0)!(pK1s)ss0(ss0)ss0(pK1sK1s)ss0,\displaystyle N_{s}\geq\frac{(p-s)^{s-s_{0}}}{(s-s_{0})!}\geq\frac{(p-K_{1}s_{\star})^{s-s_{0}}}{(s-s_{0})^{s-s_{0}}}\geq\left(\frac{p-K_{1}s_{\star}}{K_{1}s_{\star}}\right)^{s-s_{0}},

and s/p=o(1)s_{\star}/p=o(1). To complete the proof, it remains to show that Π(θ:Sθ𝒮n|Y(n))\Pi^{\infty}(\theta:S_{\theta}\in{\cal S}_{n}\,|\,Y^{(n)}) goes to zero on the set 𝒯n{\cal T}_{n}. Combining (67) and (68), we see that Π(θ:Sθ𝒮n|Y(n))𝟙𝒯n\Pi^{\infty}(\theta:S_{\theta}\in{\cal S}_{n}\,|\,Y^{(n)})\mathbbm{1}_{{\cal T}_{n}} is bounded by

s=s0+1K1sπp(s)(ps0)(ps0ss0)πp(s0)(ps)(λπ2)ss0(ρ¯0pbC1ϕ2(s)X)ss0\displaystyle\sum_{s=s_{0}+1}^{K_{1}s_{\star}}\frac{\pi_{p}(s)\binom{p}{s_{0}}\binom{p-s_{0}}{s-s_{0}}}{\pi_{p}(s_{0})\binom{p}{s}}\left(\frac{\lambda\sqrt{\pi}}{\sqrt{2}}\right)^{s-s_{0}}\left(\frac{\sqrt{\overline{\rho}_{0}p^{b}}}{C_{1}\phi_{2}(s)\lVert X\rVert_{\ast}}\right)^{s-s_{0}}
s=s0+1K1s(A2pA4)ss0(ss0)(L3C1ϕ1(K1s)K1sπρ¯0pb2n)ss0,\displaystyle\quad\leq\sum_{s=s_{0}+1}^{K_{1}s_{\star}}\left(\frac{A_{2}}{p^{A_{4}}}\right)^{s-s_{0}}\binom{s}{s_{0}}\left(\frac{L_{3}}{C_{1}\phi_{1}(K_{1}s_{\star})}\sqrt{\frac{K_{1}s_{\star}\pi\overline{\rho}_{0}p^{b}}{2n}}\right)^{s-s_{0}},

which holds by the inequalities πp(s)/πp(s0)(A2pA4)ss0{\pi_{p}(s)}/{\pi_{p}(s_{0})}\leq(A_{2}p^{-A_{4}})^{s-s_{0}} and (ps0)(ps0ss0)/(ps)=(ss0)\binom{p}{s_{0}}\binom{p-s_{0}}{s-s_{0}}/\binom{p}{s}=\binom{s}{s_{0}}. Note that for sK1ss\leq K_{1}s_{\star}, we have that (ss0)=(sss0)(K1s)ss0(K1C2pa)ss0\binom{s}{s_{0}}=\binom{s}{s-s_{0}}\leq(K_{1}s_{\star})^{s-s_{0}}\leq(K_{1}C_{2}p^{a})^{s-s_{0}} for some C2>0C_{2}>0. Hence, the preceding display goes to zero provided that aA4+b/2<0a-A_{4}+b/2<0 since s=o(n)s_{\star}=o(n). This condition can be translated to a<A41a<A_{4}-1 by choosing bb arbitrarily close to 2. ∎

Appendix B Proofs for the applications

B.1 Proof of Theorem 7

We first verify the conditions for Theorem 3 to prove assertions (a) and (b).

  • Verification of (C1): Let σ¯jk\bar{\sigma}_{jk} be the (j,k)(j,k)th element of ΣΣ0\Sigma-\Sigma_{0}. Observe that dn2(Σ,Σ0)d_{n}^{2}(\Sigma,\Sigma_{0}) is equal to

    1ni=1nEiT(ΣΣ0)EiF2=1nj=1m¯k=1m¯[σ¯jk2i=1neijeik]1cnΣΣ0F2.\displaystyle\begin{split}\frac{1}{n}\sum_{i=1}^{n}\lVert E_{i}^{T}(\Sigma-\Sigma_{0})E_{i}\rVert_{\rm F}^{2}&=\frac{1}{n}\sum_{j=1}^{\overline{m}}\sum_{k=1}^{\overline{m}}\left[\bar{\sigma}_{jk}^{2}\sum_{i=1}^{n}e_{ij}e_{ik}\right]\gtrsim\frac{1}{c_{n}}\lVert\Sigma-\Sigma_{0}\rVert_{\rm F}^{2}.\end{split} (69)

    Hence, we see that cnc_{n} has the same role as ana_{n}. We also have en=0e_{n}=0 as the true Σ0\Sigma_{0} belongs to the support of the prior.

  • Verification of (C2): Note that

    dn2(Σ1,Σ2)=1ni=1nEiT(Σ1Σ2)EiF2Σ1Σ2F2,\displaystyle d_{n}^{2}(\Sigma_{1},\Sigma_{2})=\frac{1}{n}\sum_{i=1}^{n}\lVert E_{i}^{T}(\Sigma_{1}-\Sigma_{2})E_{i}\rVert_{\rm F}^{2}\leq\lVert\Sigma_{1}-\Sigma_{2}\rVert_{\rm F}^{2}, (70)

    for every Σ1,Σ2\Sigma_{1},\Sigma_{2}\in{\cal H}. Hence we obtain that for every ϵ¯n>n1/2\bar{\epsilon}_{n}>n^{-1/2},

    logΠ(dn(Σ,Σ0)ϵ¯n)logΠ(ΣΣ0Fϵ¯n)logϵ¯nlogn,\displaystyle\log\Pi(d_{n}(\Sigma,\Sigma_{0})\leq\bar{\epsilon}_{n})\geq\log\Pi(\lVert\Sigma-\Sigma_{0}\rVert_{\rm F}\leq\bar{\epsilon}_{n})\gtrsim\log\bar{\epsilon}_{n}\gtrsim-\log n,

    since 1ρmin(Σ0)ρmax(Σ0)11\lesssim\rho_{\min}(\Sigma_{0})\leq\rho_{\max}(\Sigma_{0})\lesssim 1. This leads us to choose ϵ¯n=(logn)/n\bar{\epsilon}_{n}=\sqrt{(\log n)/n} for (C2) to be satisfied.

  • Verification of (C3): The assumption θ0λ1logp\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p given in the theorem directly satisfies (C3).

  • Verification of (C4): We have the inequalities ρmin(Σ0)ρmin(EiTΣ0Ei)ρmax(EiTΣ0Ei)ρmax(Σ0)\rho_{\min}(\Sigma_{0})\leq\rho_{\min}(E_{i}^{T}\Sigma_{0}E_{i})\leq\rho_{\max}(E_{i}^{T}\Sigma_{0}E_{i})\leq\rho_{\max}(\Sigma_{0}) for every ini\leq n as EiTΣ0EiE_{i}^{T}\Sigma_{0}E_{i} is a principal submatrix of Σ0\Sigma_{0}. Hence (C4) is directly satisfied by the assumption on Σ0\Sigma_{0}.

  • Verification of (C5): For a sufficiently large M>0M>0 and s=s0(logn/logp)s_{\star}=s_{0}\vee(\log n/\log p), choose n={Σ:nMρmin(Σ)ρmax(Σ)eMslogp}{\cal H}_{n}=\{\Sigma:n^{-M}\leq\rho_{\min}(\Sigma)\leq\rho_{\max}(\Sigma)\leq e^{Ms_{\star}\log p}\}. Since EiTΣEiE_{i}^{T}\Sigma E_{i} is a principal submatrix of Σ\Sigma, we have ρmin(EiTΣEi)ρmin(Σ)nM\rho_{\min}(E_{i}^{T}\Sigma E_{i})\geq\rho_{\min}(\Sigma)\geq n^{-M} for every ini\leq n and Σn\Sigma\in{\cal H}_{n}. Hence the minimum eigenvalue condition (6) is satisfied with logγnlogn\log\gamma_{n}\asymp\log n. Also, the entropy relative to dnd_{n} is given by

    logN(16m¯nM+3/2,n,dn)\displaystyle\log N\left(\frac{1}{6\overline{m}n^{M+3/2}},{\cal H}_{n},d_{n}\right)
    logN(16m¯nM+3/2,{Σ:ΣFm¯eMslogp},F)\displaystyle\quad\leq\log N\left(\frac{1}{6\overline{m}n^{M+3/2}},\left\{\Sigma:\lVert\Sigma\rVert_{\rm F}\leq\sqrt{\overline{m}}e^{Ms_{\star}\log p}\right\},\lVert\cdot\rVert_{\rm F}\right)
    logn+slogp.\displaystyle\quad\lesssim\log n+{s_{\star}\log p}.

    The entropy condition in (7) is thus satisfied if we choose ϵn=(slogp)/n\epsilon_{n}=\sqrt{(s_{\star}\log p)/n}. To verify the sieve condition (8), note that for some positive constants b1b_{1}, b2b_{2}, b3b_{3}, b4b_{4} and b5b_{5}, an inverse Wishart distribution satisfies

    Π(Σ:ρmin(Σ)<nM)b1eb2nb3M,Π(Σ:ρmax(Σ)>eMslogp)b4eb5Mslogp;\displaystyle\begin{split}\Pi(\Sigma:\rho_{\min}(\Sigma)<n^{-M})&\leq b_{1}e^{-b_{2}n^{b_{3}M}},\\ \Pi(\Sigma:\rho_{\max}(\Sigma)>e^{Ms_{\star}\log p})&\leq b_{4}e^{-b_{5}Ms_{\star}\log p};\end{split} (71)

    see, for example, Lemma 9.16 of Ghosal and van der Vaart, [17]. The sieve condition (8) is met provided that MM is chosen sufficiently large. Note that the condition anϵn20a_{n}\epsilon_{n}^{2}\rightarrow 0 is satisfied by the assumption cnslogp=o(n)c_{n}s_{\star}\log p=o(n).

  • Verification of (C6): The separability condition is trivially satisfied in this example as there is no nuisance mean part.

Therefore, the contraction properties in Theorem 3 are obtained with s=s0(logn/logp)s_{\star}=s_{0}\vee(\log n/\log p), but ss_{\star} is replaced by s0s_{0} since s0>0s_{0}>0 and lognlogp\log n\lesssim\log p. The contraction rate for Σ\Sigma with respect to the Frobenius norm follows from (69). The optimal posterior contraction directly follows from Corollary 1. Assertions (a) and (b) are thus proved.

Next, we verify conditions (C8)(C10) and (C7) to apply Theorems 56 and Corollaries 23.

  • Verification of (C8)(C9): These conditions are trivially satisfied with the zero matrix HH as there is no nuisance mean part.

  • Verification of (C10): Since the entropy in (C10) is bounded above by a constant multiple of logN(δ,{Σ:ΣΣ0FM^2cnϵn},F)0log(3M^2cnϵn/δ)\log N(\delta,\{\Sigma:\lVert\Sigma-\Sigma_{0}\rVert_{\rm F}\leq\hat{M}_{2}\sqrt{c_{n}}\epsilon_{n}\},\lVert\cdot\rVert_{\rm F})\lesssim 0\vee\log(3\hat{M}_{2}\sqrt{c_{n}}\epsilon_{n}/\delta) using (69) and (70), the term in (C10) is bounded by a multiple of (slogcn)cn(slogp)3/n(s_{\star}\vee\sqrt{\log c_{n}})\sqrt{c_{n}(s_{\star}\log p)^{3}/n} by Remark 6. This term tends to zero as ss_{\star} can be replaced by s0s_{0}.

  • Verification of (C7): Note that dB,n(Σ1,Σ2)Σ1Σ2Fd_{B,n}(\Sigma_{1},\Sigma_{2})\leq\lVert\Sigma_{1}-\Sigma_{2}\rVert_{\rm F} for every Σ1,Σ2\Sigma_{1},\Sigma_{2} by (70), and hence it suffices to show that \cal H is a separable metric space with the Frobenius norm. Since the support of the prior for Σ\Sigma is Euclidean, separability with the Frobenius norm is trivial.

Hence, under (C7), Theorem 5 can be applied to obtain the distributional approximation in (15) with the zero matrix HH. Under (C7) and (C12), Theorem 6 implies the no-superset result in (16). If the beta-min condition (C13) is also met, the strong results in Corollary 2 and Corollary 3 hold. These establish (c)(e).

B.2 Proof of Theorem 8

We first verify the conditions for Theorem 3 for (a) and (b).

  • Verification of (C1): Since Δη,i\Delta_{\eta,i} is the same for every ini\leq n and the true parameters belong to the support of the prior, we see that an=1a_{n}=1 and en=0e_{n}=0 satisfy (C1).

  • Verification of (C2): Observe that for every η1,η2\eta_{1},\eta_{2}\in{\cal H},

    ξη1ξη222=|(α1α2)+(μ1Tβ1μ2Tβ2)|2+μ1μ222|α1α2|2+μ122β1β222+(β222+1)μ1μ222,Δη1Δη2F2=|(β1TΣ1β1β2TΣ2β2)+(σ12σ22)|2+2Σ1β1Σ2β222+Σ1Σ2F2(β122+1)2Σ1Σ2F2+|σ12σ22|2+(β122+β222+1)Σ2F2β1β222.\displaystyle\begin{split}\lVert\xi_{\eta_{1}}-\xi_{\eta_{2}}\rVert_{2}^{2}&=\lvert(\alpha_{1}-\alpha_{2})+(\mu_{1}^{T}\beta_{1}-\mu_{2}^{T}\beta_{2})\rvert^{2}+\lVert\mu_{1}-\mu_{2}\rVert_{2}^{2}\\ &\lesssim\lvert\alpha_{1}-\alpha_{2}\rvert^{2}+\lVert\mu_{1}\rVert_{2}^{2}\lVert\beta_{1}-\beta_{2}\rVert_{2}^{2}+(\lVert\beta_{2}\rVert_{2}^{2}+1)\lVert\mu_{1}-\mu_{2}\rVert_{2}^{2},\\ \lVert\Delta_{\eta_{1}}-\Delta_{\eta_{2}}\rVert_{\rm F}^{2}&=|(\beta_{1}^{T}\Sigma_{1}\beta_{1}-\beta_{2}^{T}\Sigma_{2}\beta_{2})+(\sigma_{1}^{2}-\sigma_{2}^{2})|^{2}\\ &\quad+2\lVert\Sigma_{1}\beta_{1}-\Sigma_{2}\beta_{2}\rVert_{2}^{2}+\lVert\Sigma_{1}-\Sigma_{2}\rVert_{\rm F}^{2}\\ &\lesssim(\lVert\beta_{1}\rVert_{2}^{2}+1)^{2}\lVert\Sigma_{1}-\Sigma_{2}\rVert_{\rm F}^{2}+|\sigma_{1}^{2}-\sigma_{2}^{2}|^{2}\\ &\quad+(\lVert\beta_{1}\rVert_{2}^{2}+\lVert\beta_{2}\rVert_{2}^{2}+1)\lVert\Sigma_{2}\rVert_{\rm F}^{2}\lVert\beta_{1}-\beta_{2}\rVert_{2}^{2}.\end{split} (72)

    Since β02\lVert\beta_{0}\rVert_{2}, |σ02|\lvert\sigma_{0}^{2}\rvert, and Σ0F\lVert\Sigma_{0}\rVert_{\rm F} are bounded, it follows from the last display that there exists a constant C1C_{1} such that |αα0|+ββ02+μμ02+|σ2σ02|+ΣΣ0FC1ϵ¯n\lvert\alpha-\alpha_{0}\rvert+\lVert\beta-\beta_{0}\rVert_{2}+\lVert\mu-\mu_{0}\rVert_{2}+\lvert\sigma^{2}-\sigma_{0}^{2}\rvert+\lVert\Sigma-\Sigma_{0}\rVert_{\rm F}\leq C_{1}\bar{\epsilon}_{n} implies dn(η,η0)ϵ¯nd_{n}(\eta,\eta_{0})\leq\bar{\epsilon}_{n} for any small ϵ¯n\bar{\epsilon}_{n}. This shows that (C2) is satisfied as long as we choose ϵ¯n=logn/n\bar{\epsilon}_{n}=\sqrt{\log n/n}, as we have |α0|β0μ01|\alpha_{0}|\vee\lVert\beta_{0}\rVert_{\infty}\vee\lVert\mu_{0}\rVert_{\infty}\lesssim 1, σ021\sigma_{0}^{2}\asymp 1, and 1ρmin(Σ0)ρmax(Σ0)11\lesssim\rho_{\min}(\Sigma_{0})\leq\rho_{\max}(\Sigma_{0})\lesssim 1.

  • Verification of (C3): The assumption θ0λ1logp\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p given in the theorem directly satisfies (C3).

  • Verification of (C4): Since Δη\Delta_{\eta} can be written as the sum of two positive definite matrices as

    Δη=(βTΣββTΣΣβΣ)+(σ200Ψ),\displaystyle\Delta_{\eta}=\begin{pmatrix}\beta^{T}\Sigma\beta&\beta^{T}\Sigma\\ \Sigma\beta&\Sigma\end{pmatrix}+\begin{pmatrix}\sigma^{2}&0\\ 0&\Psi\end{pmatrix},

    condition (C4) is satisfied as we obtain σ02ρmin(Ψ)ρmin(Δη0)ρmax(Δη0)Δη0F\sigma_{0}^{2}\wedge\rho_{\min}(\Psi)\leq\rho_{\min}(\Delta_{\eta_{0}})\leq\rho_{\max}(\Delta_{\eta_{0}})\leq\lVert\Delta_{\eta_{0}}\rVert_{\rm F} by Weyl’s inequality.

  • Verification of (C5): For a sufficiently large MM and s=s0(logn/logp)s_{\star}=s_{0}\vee(\log n/\log p), choose a sieve as

    n\displaystyle{\cal H}_{n} ={(α,β,μ):|α|2+β22+μ22n2M}×{σ:nMσ2eMslogp}\displaystyle=\{(\alpha,\beta,\mu):\lvert\alpha\rvert^{2}+\lVert\beta\rVert_{2}^{2}+\lVert\mu\rVert_{2}^{2}\leq n^{2M}\}\times\{\sigma:n^{-M}\leq\sigma^{2}\leq e^{Ms_{\star}\log p}\}
    ×{Σ:nMρmin(Σ)ρmax(Σ)eMslogp}.\displaystyle\quad\times\{\Sigma:n^{-M}\leq\rho_{\min}(\Sigma)\leq\rho_{\max}(\Sigma)\leq e^{Ms_{\star}\log p}\}.

    Then we have ρmin(Δη)σ2ρmin(Ψ)nM\rho_{\min}(\Delta_{\eta})\geq\sigma^{2}\wedge\rho_{\min}(\Psi)\geq n^{-M} for large nn, and hence the minimum eigenvalue condition (6) is directly met with logγnlogn\log\gamma_{n}\asymp\log n by the definition of the sieve. To see the entropy condition, observe from (72) that for every η1,η2n\eta_{1},\eta_{2}\in{\cal H}_{n},

    dn2(η1,η2)n4Me2Mslogp(\displaystyle d_{n}^{2}(\eta_{1},\eta_{2})\lesssim n^{4M}e^{2Ms_{\star}\log p}\big{(} |αα0|2+β1β222+μ1μ222\displaystyle\lvert\alpha-\alpha_{0}\rvert^{2}+\lVert\beta_{1}-\beta_{2}\rVert_{2}^{2}+\lVert\mu_{1}-\mu_{2}\rVert_{2}^{2}
    +Σ1Σ2F2+|σ12σ22|2).\displaystyle+\lVert\Sigma_{1}-\Sigma_{2}\rVert_{\rm F}^{2}+|\sigma_{1}^{2}-\sigma_{2}^{2}|^{2}\big{)}.

    Therefore, for δn=1/(6m¯n3M+3/2eMslogp)\delta_{n}=1/(6\overline{m}n^{3M+3/2}e^{Ms_{\star}\log p}), the entropy relative to dnd_{n} is bounded above by

    logN(δn,{(α,β,μ):|α|2+β22+μ22n2M},2)\displaystyle\log N\left(\delta_{n},\{(\alpha,\beta,\mu):\lvert\alpha\rvert^{2}+\lVert\beta\rVert_{2}^{2}+\lVert\mu\rVert_{2}^{2}\leq n^{2M}\},\lVert\cdot\rVert_{2}\right)
    +logN(δn,{σ:σ2eMslogp},2)\displaystyle+\log N\left(\delta_{n},\{\sigma:\sigma^{2}\leq e^{Ms_{\star}\log p}\},\lVert\cdot\rVert_{2}\right)
    +logN(δn,{Σ:ΣFqeMslogp},F),\displaystyle+\log N\left(\delta_{n},\{\Sigma:\lVert\Sigma\rVert_{\rm F}\leq\sqrt{q}e^{Ms_{\star}\log p}\},\lVert\cdot\rVert_{\rm F}\right),

    each summand of which is bounded by a multiple of logn+slogp\log n+s_{\star}\log p. This shows that the choice ϵn=(slogp)/n\epsilon_{n}=\sqrt{(s_{\star}\log p)/n} satisfies the entropy condition in (7). Further, it is easy to see that condition (8) holds using the tail bounds for normal and inverse Wishart distributions as in (71).

  • Verification of (C6): Note that the mean of YY is expressed as Xθ+ZξηX\theta+Z\xi_{\eta} for Z=1nIq+1Z=1_{n}\otimes I_{q+1}. Since the condition ςmin([XS,1n])1\varsigma_{\min}([X_{S}^{\ast},1_{n}])\gtrsim 1 implies ςmin([XS,Z])1\varsigma_{\min}([X_{S},Z])\gtrsim 1, condition (C6) is satisfied by Remark 3.

Therefore we obtain the contraction properties of the posterior distribution as in (9) with ss_{\star} replaced by s0s_{0} as s0>0s_{0}>0 and lognlogp\log n\lesssim\log p. The rates for η\eta with respect to more concrete metrics than dnd_{n} can now be obtained. Note that for small δ>0\delta>0, dn(η,η0)δd_{n}(\eta,\eta_{0})\leq\delta directly implies μμ02δ\lVert\mu-\mu_{0}\rVert_{2}\leq\delta and ΣΣ0Fδ\lVert\Sigma-\Sigma_{0}\rVert_{\rm F}\leq\delta by the definition of dnd_{n}. For β\beta, observe that

ββ02\displaystyle\lVert\beta-\beta_{0}\rVert_{2} Σ1spΣ(ββ0)2\displaystyle\leq\lVert\Sigma^{-1}\rVert_{\rm sp}\lVert\Sigma(\beta-\beta_{0})\rVert_{2}
Σ1sp(ΣβΣ0β02+ΣΣ0Fβ02)\displaystyle\leq\lVert\Sigma^{-1}\rVert_{\rm sp}(\lVert\Sigma\beta-\Sigma_{0}\beta_{0}\rVert_{2}+\lVert\Sigma-\Sigma_{0}\rVert_{\rm F}\lVert\beta_{0}\rVert_{2})
Σ1spδ.\displaystyle\lesssim\lVert\Sigma^{-1}\rVert_{\rm sp}\delta.

Since Σ1sp\lVert\Sigma^{-1}\rVert_{\rm sp} is bounded as ΣΣ0Fδ\lVert\Sigma-\Sigma_{0}\rVert_{\rm F}\leq\delta, the preceding display implies ββ02δ\lVert\beta-\beta_{0}\rVert_{2}\lesssim\delta. Moreover, we have

|αα0|\displaystyle|\alpha-\alpha_{0}| |μTβμ0Tβ0|+δ\displaystyle\leq|\mu^{T}\beta-\mu_{0}^{T}\beta_{0}|+\delta
μ2ββ02+β02μμ02+δ\displaystyle\lesssim\lVert\mu\rVert_{2}\lVert\beta-\beta_{0}\rVert_{2}+\lVert\beta_{0}\rVert_{2}\lVert\mu-\mu_{0}\rVert_{2}+\delta
(μ2+1)δ,\displaystyle\lesssim(\lVert\mu\rVert_{2}+1)\delta,

and

|σ2σ02|\displaystyle|\sigma^{2}-\sigma_{0}^{2}| |βTΣββ0TΣ0β0|+|(βTΣβ+σ2)(β0TΣ0β0+σ02)|\displaystyle\leq|\beta^{T}\Sigma\beta-\beta_{0}^{T}\Sigma_{0}\beta_{0}|+|(\beta^{T}\Sigma\beta+\sigma^{2})-(\beta_{0}^{T}\Sigma_{0}\beta_{0}+\sigma_{0}^{2})|
β2ΣβΣ0β02+β02Σ0spββ02+δ\displaystyle\leq\lVert\beta\rVert_{2}\lVert\Sigma\beta-\Sigma_{0}\beta_{0}\rVert_{2}+\lVert\beta_{0}\rVert_{2}\lVert\Sigma_{0}\rVert_{\rm sp}\lVert\beta-\beta_{0}\rVert_{2}+\delta
(β2+1)δ.\displaystyle\lesssim(\lVert\beta\rVert_{2}+1)\delta.

These show that |αα0|+|σ2σ02|δ|\alpha-\alpha_{0}|+|\sigma^{2}-\sigma_{0}^{2}|\lesssim\delta as μ2\lVert\mu\rVert_{2} and β2\lVert\beta\rVert_{2} are bounded. We finally conclude that |αα0|+ββ02+μμ02+|σ2σ02|+ΣΣ0F\lvert\alpha-\alpha_{0}\rvert+\lVert\beta-\beta_{0}\rVert_{2}+\lVert\mu-\mu_{0}\rVert_{2}+\lvert\sigma^{2}-\sigma_{0}^{2}\rvert+\lVert\Sigma-\Sigma_{0}\rVert_{\rm F} contracts at the same rate of dnd_{n}. The optimal posterior contraction is directly obtained by Corollary 1. Thus assertions (a) and (b) hold.

Next, we verify conditions (C8)(C10) and (C7) to apply Theorems 56 and Corollaries 23. The orthogonal projection defined by H=Z~(Z~TZ~)1Z~TH=\tilde{Z}(\tilde{Z}^{T}\tilde{Z})^{-1}\tilde{Z}^{T} with Z~=1nΔη01/2\tilde{Z}=1_{n}\otimes\Delta_{\eta_{0}}^{-1/2} is used to check the conditions.

  • Verification of (C8): For HH defined above, it is easy to see that the first condition of (C8) is satisfied. The second condition is directly satisfied by Remark 5.

  • Verification of (C9): Choose a map (α,β,μ,σ2,Σ)(α+n11nTX(θθ0),β,μ,σ2,Σ)(\alpha,\beta,\mu,\sigma^{2},\Sigma)\mapsto(\alpha+n^{-1}1_{n}^{T}X^{\ast}(\theta-\theta_{0}),\beta,\mu,\sigma^{2},\Sigma) for ηη~n(θ,η)\eta\mapsto\tilde{\eta}_{n}(\theta,\eta). To check (C9), we shall verify that this map induces Φ(η~n(θ,η))=(ξ~η+HX~(θθ0),Δ~η)\Phi(\tilde{\eta}_{n}(\theta,\eta))=(\tilde{\xi}_{\eta}+H\tilde{X}(\theta-\theta_{0}),\tilde{\Delta}_{\eta}) as follows. Note that for matrices RkR_{k}, k=1,,6k=1,\dots,6, we have the properties of the Kronecker product that (R1R2)(R3R4)=(R1R2R3R4)(R_{1}\otimes R_{2})(R_{3}\otimes R_{4})=(R_{1}R_{2}\otimes R_{3}R_{4}) and (R5R6)1=R51R61(R_{5}\otimes R_{6})^{-1}=R_{5}^{-1}\otimes R_{6}^{-1} if the matrices allow such operations. Using these properties, we see that HH satisfies

    H\displaystyle H =(1nΔη01/2)(1nT1nΔη01)1(1nΔη01/2)T\displaystyle=(1_{n}\otimes\Delta_{\eta_{0}}^{-1/2})(1_{n}^{T}1_{n}\otimes\Delta_{\eta_{0}}^{-1})^{-1}(1_{n}\otimes\Delta_{\eta_{0}}^{-1/2})^{T}
    =1n(1nΔη01/2)Δη0(1nΔη01/2)T\displaystyle=\frac{1}{n}(1_{n}\otimes\Delta_{\eta_{0}}^{-1/2})\Delta_{\eta_{0}}(1_{n}\otimes\Delta_{\eta_{0}}^{-1/2})^{T}
    =1n(1nIq+1)(1nTIq+1)\displaystyle=\frac{1}{n}(1_{n}\otimes I_{q+1})(1_{n}^{T}\otimes I_{q+1})
    =1n(1n1nTIq+1).\displaystyle=\frac{1}{n}(1_{n}1_{n}^{T}\otimes I_{q+1}).

    Hence,

    Z(Z~TZ~)1Z~TX~(θθ0)\displaystyle Z(\tilde{Z}^{T}\tilde{Z})^{-1}\tilde{Z}^{T}\tilde{X}(\theta-\theta_{0}) =(InΔη01/2)H(InΔη01/2)X(θθ0)\displaystyle=(I_{n}\otimes\Delta_{\eta_{0}}^{1/2})H(I_{n}\otimes\Delta_{\eta_{0}}^{-1/2})X(\theta-\theta_{0})
    =HX(θθ0)=1n(n11nTX(θθ0)0q×1),\displaystyle=HX(\theta-\theta_{0})=1_{n}\otimes\begin{pmatrix}n^{-1}1_{n}^{T}X^{\ast}(\theta-\theta_{0})\\ 0_{q\times 1}\end{pmatrix},

    which implies that the shift only for α\alpha as in the given map provides Φ(η~n(θ,η))=(ξ~η+HX~(θθ0),Δ~η)\Phi(\tilde{\eta}_{n}(\theta,\eta))=(\tilde{\xi}_{\eta}+H\tilde{X}(\theta-\theta_{0}),\tilde{\Delta}_{\eta}). Without loss of generality, we assume that the standard normal prior is used for α\alpha. Now, observe that

    |logdΠn,θdΠn,θ0(η)|\displaystyle\left\lvert\log\frac{d\Pi_{n,\theta}}{d\Pi_{n,\theta_{0}}}(\eta)\right\rvert |α2(α+n11nTX(θθ0))2|\displaystyle\lesssim\left\lvert\alpha^{2}-(\alpha+n^{-1}1_{n}^{T}X^{\ast}(\theta-\theta_{0}))^{2}\right\rvert
    2|α||n11nTX(θθ0)|+(n11nTX(θθ0))2,\displaystyle\leq 2|\alpha||n^{-1}1_{n}^{T}X^{\ast}(\theta-\theta_{0})|+(n^{-1}1_{n}^{T}X^{\ast}(\theta-\theta_{0}))^{2},

    since the priors for the other parameters cancel out due to invariance. One can note that

    supη^n|α|s(logp)/n+|α0|1,\displaystyle\sup_{\eta\in\widehat{\cal H}_{n}}|\alpha|\lesssim s_{\star}\sqrt{(\log p)/n}+|\alpha_{0}|\lesssim 1,

    and

    1nsupθΘ^nX(θθ0)2s(logp)/n.\displaystyle\frac{1}{\sqrt{n}}\sup_{\theta\in\widehat{\Theta}_{n}}\lVert X(\theta-\theta_{0})\rVert_{2}\lesssim s_{\star}\sqrt{(\log p)/n}.

    Thus, condition (C9) is satisfied.

  • Verification of (C10): Note again that dB,n(η,η0)ΣΣ0F+|σ2σ02|+ββ02d_{B,n}(\eta,\eta_{0})\lesssim\lVert\Sigma-\Sigma_{0}\rVert_{\rm F}+|\sigma^{2}-\sigma_{0}^{2}|+\lVert\beta-\beta_{0}\rVert_{2} for every η^n\eta\in\widehat{\cal H}_{n}. The inequality also holds for the other direction for every η^n\eta\in\widehat{\cal H}_{n}, by the same argument used for the recovery in the proof of Theorem 8, (a)(b). Hence, for some constants C1,C2>0C_{1},C_{2}>0, the entropy in (C10) is bounded above by

    logN(C1δ,{β:ββ02C2M^2ϵn},2)\displaystyle\log N\left(C_{1}\delta,\left\{\beta:\lVert\beta-\beta_{0}\rVert_{2}\leq C_{2}\hat{M}_{2}\epsilon_{n}\right\},\lVert\cdot\rVert_{2}\right)
    +logN(C1δ,{σ2:|σ2σ02|C2M^2ϵn},||)\displaystyle+\log N\left(C_{1}\delta,\left\{\sigma^{2}:\lvert\sigma^{2}-\sigma_{0}^{2}\rvert\leq C_{2}\hat{M}_{2}\epsilon_{n}\right\},\lvert\cdot\rvert\right)
    +logN(C1δ,{Σ:ΣΣ0FC2M^2ϵn},F).\displaystyle+\log N\left(C_{1}\delta,\left\{\Sigma:\lVert\Sigma-\Sigma_{0}\rVert_{\rm F}\leq C_{2}\hat{M}_{2}\epsilon_{n}\right\},\lVert\cdot\rVert_{\rm F}\right).

    Since all nuisance parameters are of fixed dimensions, the last display is bounded by a multiple of 0log(3C2M^2ϵn/C1δ)0\vee\log(3C_{2}\hat{M}_{2}\epsilon_{n}/C_{1}\delta) for every δ>0\delta>0, so that (C10) is bounded by (s5log3p/n)1/2(s_{\star}^{5}\log^{3}p/n)^{1/2} by Remark 6. Since ss0s_{\star}\lesssim s_{0} in this case, the condition is verified.

  • Verification of (C7): Note that by (72), dB,n(η1,η2)Σ1Σ2F+|σ12σ22|+β1β22d_{B,n}(\eta_{1},\eta_{2})\lesssim\lVert\Sigma_{1}-\Sigma_{2}\rVert_{\rm F}+|\sigma_{1}^{2}-\sigma_{2}^{2}|+\lVert\beta_{1}-\beta_{2}\rVert_{2} for every η1,η2^n\eta_{1},\eta_{2}\in\widehat{\cal H}_{n}. Since each of the parameter spaces of Σ\Sigma, σ2\sigma^{2}, and β\beta is a separable metric space with each of these norms, (C7) is satisfied.

Therefore, under (C7), Theorem 5 implies that the distributional approximation in (15) holds. Under (C7) and (C12), we obtain the no-superset result in (16). The remaining assertions in the theorem are direct consequences of Corollary 2 and Corollary 3 if the beta-min condition (C13) is also satisfied. These prove (c)(e).

We complete the proof by showing that the covariance matrix of the nonzero part can be written as in the theorem. For given SS, we obtain

X~ST(In(q+1)H)X~S\displaystyle\tilde{X}_{S}^{T}(I_{n(q+1)}-H)\tilde{X}_{S}
=XST(In{Δη01/2}1T)(InIq+1H)(In{Δη01/2}1)XS\displaystyle\quad=X_{S}^{\ast T}\left(I_{n}\otimes\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1}^{T}\right)\left(I_{n}\otimes I_{q+1}-H\right)\left(I_{n}\otimes\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1}\right)X_{S}^{\ast}
={Δη01/2}1T{Δη01/2}1XSTHXS,\displaystyle\quad=\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1}^{T}\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1}X_{S}^{\ast T}H^{\ast}X_{S}^{\ast},

where {Δη01/2}1\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1} is the first column of Δη01/2\Delta_{\eta_{0}}^{-1/2}. Note that {Δη01/2}1T{Δη01/2}1={Δη01}1,1\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1}^{T}\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1}=\{\Delta_{\eta_{0}}^{-1}\}_{1,1}, where {Δη01}1,1\{\Delta_{\eta_{0}}^{-1}\}_{1,1} is the top-left element of Δη01\Delta_{\eta_{0}}^{-1}, which is equal to (β0TΣ0β0+σ02β0TΣ0(Σ0+Ψ)1Σ0β0)1=(σ02+β0TΣ0(Σ0+Ψ)1Ψβ0)1(\beta_{0}^{T}\Sigma_{0}\beta_{0}+\sigma_{0}^{2}-\beta_{0}^{T}\Sigma_{0}(\Sigma_{0}+\Psi)^{-1}\Sigma_{0}\beta_{0})^{-1}=(\sigma_{0}^{2}+\beta_{0}^{T}\Sigma_{0}(\Sigma_{0}+\Psi)^{-1}\Psi\beta_{0})^{-1} by direct calculations. For the mean θ^S\hat{\theta}_{S}, observe that

X~ST(In(q+1)H)(U+X~θ0)\displaystyle\tilde{X}_{S}^{T}(I_{n(q+1)}-H)(U+\tilde{X}\theta_{0})
=XST(In{Δη01/2}1T)(InIq+11n1n1nTIq+1)\displaystyle\quad=X_{S}^{\ast T}\left(I_{n}\otimes\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1}^{T}\right)\left(I_{n}\otimes I_{q+1}-\frac{1}{n}1_{n}1_{n}^{T}\otimes I_{q+1}\right)
×{(In{Δη01/2}1)(Y(α0+μ0Tβ0)1n)\displaystyle\qquad\times\left\{\left(I_{n}\otimes\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1}\right)\Big{(}Y^{\ast}-(\alpha_{0}+\mu_{0}^{T}\beta_{0})1_{n}\right)
+(In{Δη01/2}(1))(W1nμ0)},\displaystyle\quad\qquad\quad+\left(I_{n}\otimes\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot(-1)}\right)\left(W-1_{n}\otimes\mu_{0}\right)\Big{\}},

where {Δη01/2}(1)\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot(-1)} is the submatrix of Δη01/2\Delta_{\eta_{0}}^{-1/2} consisting of columns except for {Δη01/2}1\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1} the first column. Since {Δη01/2}1T{Δη01/2}(1)={Δη01}1,(1)\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1}^{T}\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot(-1)}=\{\Delta_{\eta_{0}}^{-1}\}_{1,(-1)}, where {Δη01}1,(1)\{\Delta_{\eta_{0}}^{-1}\}_{1,(-1)} is the first row of Δη01\Delta_{\eta_{0}}^{-1} with the top-left element excluded, the last display is equal to

XST{H[\displaystyle X_{S}^{\ast T}\Big{\{}H^{\ast}\Big{[} {Δη01}1,1(Y(α0+μ0Tβ0)1n)\displaystyle\{\Delta_{\eta_{0}}^{-1}\}_{1,1}\left(Y^{\ast}-(\alpha_{0}+\mu_{0}^{T}\beta_{0})1_{n}\right)
+(In{Δη01}1,(1))(W1nμ0)]}.\displaystyle+\left(I_{n}\otimes\{\Delta_{\eta_{0}}^{-1}\}_{1,(-1)}\right)\left(W-1_{n}\otimes\mu_{0}\right)\Big{]}\Big{\}}.

As we have {Δη01}1,(1)={Δη01}1,11β0TΣ0(Σ0+Ψ)1\{\Delta_{\eta_{0}}^{-1}\}_{1,(-1)}=-\{\Delta_{\eta_{0}}^{-1}\}_{1,1}^{-1}\beta_{0}^{T}\Sigma_{0}(\Sigma_{0}+\Psi)^{-1} by direct calculations, it follows that

θ^S=(XSTHXS)1XST{H[\displaystyle\hat{\theta}_{S}=\left(X_{S}^{\ast T}H^{\ast}X_{S}^{\ast}\right)^{-1}X_{S}^{\ast T}\Big{\{}H^{\ast}\Big{[} (Y(α0+μ0Tβ0)1n)\displaystyle\left(Y^{\ast}-(\alpha_{0}+\mu_{0}^{T}\beta_{0})1_{n}\right)
(In(β0TΣ0(Σ0+Ψ)1))(W1nμ0)]}.\displaystyle\!-\left(I_{n}\otimes(\beta_{0}^{T}\Sigma_{0}(\Sigma_{0}+\Psi)^{-1})\right)\left(W-1_{n}\otimes\mu_{0}\right)\Big{]}\Big{\}}.

This completes the proof.

B.3 Proof of Theorem 9

We shall verify the conditions for the posterior contraction in Theorem 3 to prove (a)(b). First we give the bounds for the eigenvalues of each correlation matrix. It can be shown that

1α=ρmin(GiCS(α))\displaystyle 1-\alpha=\rho_{\min}\left(G_{i}^{\rm CS}(\alpha)\right) ρmax(GiCS(α))=1+(mi1)α,\displaystyle\leq\rho_{\max}\left(G_{i}^{\rm CS}(\alpha)\right)=1+(m_{i}-1)\alpha, (73)
1α2(1+|α|)2ρmin(GiAR(α))\displaystyle\frac{1-\alpha^{2}}{(1+|\alpha|)^{2}}\leq\rho_{\min}\left(G_{i}^{\rm AR}(\alpha)\right) ρmax(GiAR(α))1α2(1|α|)2,\displaystyle\leq\rho_{\max}\left(G_{i}^{\rm AR}(\alpha)\right)\leq\frac{1-\alpha^{2}}{(1-|\alpha|)^{2}}, (74)
12|α|ρmin(GiMA(α))\displaystyle 1-2|\alpha|\leq\rho_{\min}\left(G_{i}^{\rm MA}(\alpha)\right) ρmax(GiMA(α))1+2|α|.\displaystyle\leq\rho_{\max}\left(G_{i}^{\rm MA}(\alpha)\right)\leq 1+2|\alpha|. (75)

The first assertion in (73) follows directly from the identity ρk(GiCS(α))=ρk(α1mi1miT)+1α\rho_{k}(G_{i}^{\rm CS}(\alpha))=\rho_{k}(\alpha 1_{m_{i}}1_{m_{i}}^{T})+1-\alpha for every kmik\leq m_{i}. For (74), see Theorem 2.1 and Theorem 3.5 of Fikioris, [12]. The assertion in (75) is due to Theorem 2.2 of Kulkarni et al., [21].

  • Verification of (C1): For the autoregressive correlation matrix, note that

    max1inσ2GiAR(α)σ02GiAR(α0)F2\displaystyle\max_{1\leq i\leq n}\left\lVert\sigma^{2}G_{i}^{\rm AR}(\alpha)-\sigma_{0}^{2}G_{i}^{\rm AR}(\alpha_{0})\right\rVert_{\rm F}^{2}
    =m¯(σ2σ02)2+2k=1m¯1(m¯k)(σ2αkσ02α0k)2.\displaystyle\quad=\overline{m}(\sigma^{2}-\sigma_{0}^{2})^{2}+2\sum_{k=1}^{\overline{m}-1}(\overline{m}-k)(\sigma^{2}\alpha^{k}-\sigma_{0}^{2}\alpha_{0}^{k})^{2}.

    Using m¯nn\overline{m}n\asymp n_{\ast}, we have that

    k=1m¯1(m¯k)(σ2αkσ02α0k)2\displaystyle\sum_{k=1}^{\overline{m}-1}(\overline{m}-k)(\sigma^{2}\alpha^{k}-\sigma_{0}^{2}\alpha_{0}^{k})^{2} 1nk=1m¯1(σ2αkσ02α0k)2i=1n{(mik)0}\displaystyle\lesssim\frac{1}{n}\sum_{k=1}^{\overline{m}-1}(\sigma^{2}\alpha^{k}-\sigma_{0}^{2}\alpha_{0}^{k})^{2}\sum_{i=1}^{n}\{(m_{i}-k)\vee 0\}
    =1ni=1nk=1mi1(mik)(σ2αkσ02α0k)2,\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{m_{i}-1}(m_{i}-k)(\sigma^{2}\alpha^{k}-\sigma_{0}^{2}\alpha_{0}^{k})^{2},

    and hence

    max1inσ2GiAR(α)σ02GiAR(α0)F21ni=1nσ2GiAR(α)σ02GiAR(α0)F2.\displaystyle\max_{1\leq i\leq n}\lVert\sigma^{2}G_{i}^{\rm AR}(\alpha)-\sigma_{0}^{2}G_{i}^{\rm AR}(\alpha_{0})\rVert_{\rm F}^{2}\lesssim\frac{1}{n}\sum_{i=1}^{n}\lVert\sigma^{2}G_{i}^{\rm AR}(\alpha)-\sigma_{0}^{2}G_{i}^{\rm AR}(\alpha_{0})\rVert_{\rm F}^{2}.

    This gives us an1a_{n}\asymp 1 for the autoregressive matrices. Similarly, we can also show that an1a_{n}\asymp 1 satisfies (C1) for the compound-symmetric and the moving average correlation matrices. Also, we have en=0e_{n}=0 for (C1) as the true parameter values α0\alpha_{0} and σ02\sigma_{0}^{2} are in the support of the prior.

  • Verification of (C2): Since the nuisance parameters are of fixed dimensions, condition (C2) is satisfied with ϵ¯n=(logn)/n\bar{\epsilon}_{n}=\sqrt{(\log n)/n} due to the restricted range of the true parameters, σ021\sigma_{0}^{2}\asymp 1 and α0[b1+ϵ,b2ϵ]\alpha_{0}\in[b_{1}+\epsilon,b_{2}-\epsilon] for some fixed ϵ>0\epsilon>0.

  • Verification of (C3): The assumption θ0λ1logp\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p given in the theorem directly satisfies (C3).

  • Verification of (C4): Using (73)–(75), we see that for the compound-symmetric correlation matrix, condition (C4) is satisfied with the bounded range of the true parameters provided that m¯\overline{m} is bounded. For the other correlation matrices, condition (C4) is satisfied even with increasing m¯\overline{m}.

  • Verification of (C5): For a sufficiently large M>0M>0 and s=s0(logn/logp)s_{\star}=s_{0}\vee(\log n/\log p), choose a sieve n={σ2:nMσ2eMslogp}×{α:b1+nMαb2nM}{\cal H}_{n}=\{\sigma^{2}:n^{-M}\leq\sigma^{2}\leq e^{Ms_{\star}\log p}\}\times\{\alpha:b_{1}+n^{-M}\leq\alpha\leq b_{2}-n^{-M}\}. Then using (73)–(75), it is easy to see that the minimum eigenvalue of each correlation matrix is bounded below by a polynomial in nn, which implies that condition (6) is satisfied with logγnlogn\log\gamma_{n}\asymp\log n. For the entropy calculation, note that for every type of correlation matrix,

    dn2(η1,η2)=1ni=1nσ12Gi(α1)σ22Gi(α2)F21ni=1n{(σ12σ22)2Gi(α1)F2+σ24Gi(α1)Gi(α2)F2}.\displaystyle\begin{split}d_{n}^{2}(\eta_{1},\eta_{2})&=\frac{1}{n}\sum_{i=1}^{n}\lVert\sigma_{1}^{2}G_{i}(\alpha_{1})-\sigma_{2}^{2}G_{i}(\alpha_{2})\rVert_{\rm F}^{2}\\ &\leq\frac{1}{n}\sum_{i=1}^{n}\left\{(\sigma_{1}^{2}-\sigma_{2}^{2})^{2}\lVert G_{i}(\alpha_{1})\rVert_{\rm F}^{2}+\sigma_{2}^{4}\lVert G_{i}(\alpha_{1})-G_{i}(\alpha_{2})\rVert_{\rm F}^{2}\right\}.\end{split} (76)

    From the identity α1kα2k=(α1α2)j=0k1α1jα2k1j\alpha_{1}^{k}-\alpha_{2}^{k}=(\alpha_{1}-\alpha_{2})\sum_{j=0}^{k-1}\alpha_{1}^{j}\alpha_{2}^{k-1-j} for every integer k1k\geq 1, we have that |α1kα2k|k|α1α2||\alpha_{1}^{k}-\alpha_{2}^{k}|\lesssim k|\alpha_{1}-\alpha_{2}| for every α1,α2(b1,b2)\alpha_{1},\alpha_{2}\in(b_{1},b_{2}). By this inequality we obtain Gi(α1)Gi(α2)F2m¯4|α1α2|2\lVert G_{i}(\alpha_{1})-G_{i}(\alpha_{2})\rVert_{\rm F}^{2}\lesssim\overline{m}^{4}|\alpha_{1}-\alpha_{2}|^{2} for every correlation matrix. Then, the last display is bounded by a multiple of m¯2(σ12σ22)2+e2Mslogpm¯4(α1α2)2\overline{m}^{2}(\sigma_{1}^{2}-\sigma_{2}^{2})^{2}+e^{2Ms_{\star}\log p}\overline{m}^{4}(\alpha_{1}-\alpha_{2})^{2} for every η1,η2n\eta_{1},\eta_{2}\in{\cal H}_{n}. The entropy in (7) is thus bounded by

    logN(δn,{σ2:0<σ2eMslogp},||)+logN(δn,{α:0<α<1},||),\displaystyle\log N\big{(}\delta_{n},\{\sigma^{2}:0<\sigma^{2}\leq e^{Ms_{\star}\log p}\},|\cdot|\big{)}+\log N\big{(}\delta_{n},\{\alpha:0<\alpha<1\},|\cdot|\big{)},

    for δn=(6m¯3n3/2+C1eMslogp)1\delta_{n}=(6\overline{m}^{3}n^{3/2+C_{1}}e^{Ms_{\star}\log p})^{-1} with some constant C1>0C_{1}>0. It can be easily checked that each term in the last display is bounded by a multiple of slogps_{\star}\log p, by which the entropy condition in (7) is satisfied with ϵn=(slogp)/n\epsilon_{n}=\sqrt{(s_{\star}\log p)/n}. Using the tail bounds of inverse gamma distributions and properties of the density Π(dα)\Pi(d\alpha) near the boundaries, condition (8) is satisfied as long as MM is chosen sufficiently large.

  • Verification of (C6): The separation condition is trivially satisfied as there is no nuisance mean part.

Therefore, we obtain the posterior contraction properties of θ\theta with s=s0(logn/logp)s_{\star}=s_{0}\vee(\log n/\log p) by Theorem 3. The term ss_{\star} can be replaced by s0s_{0} since s0>0s_{0}>0 and lognlogp\log n\lesssim\log p. Since we have mi(σ2σ02)2σ2Gi(α)σ02Gi(α0)F2m_{i}(\sigma^{2}-\sigma_{0}^{2})^{2}\leq\lVert\sigma^{2}G_{i}(\alpha)-\sigma_{0}^{2}G_{i}(\alpha_{0})\rVert_{\rm F}^{2} by the diagonal entries of each matrix, the contraction rate (s0logp)/(m¯n)\sqrt{(s_{0}\log p)/(\overline{m}n)} is obtained for σ2\sigma^{2} with respect to the 2\ell_{2}-norm, for every correlation matrix, as m¯nn\overline{m}n\asymp n_{\ast}. In particular, for the compound-symmetric correlation matrix, this rate is reduced to (s0logp)/n\sqrt{(s_{0}\log p)/n} since m¯\overline{m} is bounded in that case. We also have mi(σ2ασ02α0)2σ2Gi(α)σ02Gi(α0)F2m_{i}(\sigma^{2}\alpha-\sigma_{0}^{2}\alpha_{0})^{2}\leq\lVert\sigma^{2}G_{i}(\alpha)-\sigma_{0}^{2}G_{i}(\alpha_{0})\rVert_{\rm F}^{2} for every correlation matrix, as there are more than mim_{i} entries that is equal to σ2ασ02α0\sigma^{2}\alpha-\sigma_{0}^{2}\alpha_{0}. Hence, by the relation |αα0||σ2ασ02α0|+|α||σ2σ02||\alpha-\alpha_{0}|\lesssim|\sigma^{2}\alpha-\sigma_{0}^{2}\alpha_{0}|+|\alpha||\sigma^{2}-\sigma_{0}^{2}|, the same rate is also obtained for α\alpha relative to the 2\ell_{2}-norm. The optimal posterior contraction directly follows from Corollary 1. Thus assertions (a)(b) hold.

Next, we verify conditions (C8)(C10) and (C7) to apply Theorems 56 and Corollaries 23.

  • Verification of (C8)(C9): These conditions are trivially satisfied with the zero matrix HH since there is no nuisance mean part.

  • Verification of (C10): Using the results of contraction rates of σ2\sigma^{2} and α\alpha, note that there exists a constant C2>0C_{2}>0 such that {η:dB,n(η,η0)M^2ϵn}{σ2:|σ2σ02|C2ϵn/m¯}×{α:|αα0|C2ϵn/m¯}\{\eta\in{\cal H}:d_{B,n}(\eta,\eta_{0})\leq\hat{M}_{2}\epsilon_{n}\}\subset\{\sigma^{2}:|\sigma^{2}-\sigma_{0}^{2}|\leq C_{2}\epsilon_{n}/\sqrt{\overline{m}}\}\times\{\alpha:|\alpha-\alpha_{0}|\leq C_{2}\epsilon_{n}/\sqrt{\overline{m}}\}. Thus the entropy in (C10) is bounded by 02log(3C2ϵn/m¯δ)0\vee 2\log(3C_{2}\epsilon_{n}/\sqrt{\overline{m}}\delta). By Remark 6, (C10) is bounded by a multiple of {(s5log3p)/n}1/2\{(s_{\star}^{5}\log^{3}p)/n\}^{1/2}, which goes to zero by the assumption since ss0s_{\star}\lesssim s_{0}.

  • Verification of (C7): Using (76), we have dB,n(η1,η2)m¯|σ12σ22|+m¯2|α1α2|d_{B,n}(\eta_{1},\eta_{2})\lesssim\overline{m}|\sigma_{1}^{2}-\sigma_{2}^{2}|+\overline{m}^{2}|\alpha_{1}-\alpha_{2}| for every η1,η2^n\eta_{1},\eta_{2}\in\widehat{\cal H}_{n}. Since the parameter spaces of α\alpha and σ2\sigma^{2} are Euclidean and hence separable under the 2\ell_{2}-metric, condition (C7) is satisfied.

Therefore, under (C7), the distributional approximation in (15) holds with the zero matrix HH by Theorem 5. Under (C7) and (C12), Theorem 6 implies that the no-superset result in (16) holds. The strong results in Corollary 2 and Corollary 3 follow explicitly from the beta-min condition (C13). These prove (c)(e).

B.4 Proof of Theorem 10

We verify the conditions for the posterior contraction in Theorem 3 to show (a)(b).

  • Verification of (C1): Using the assumption maxiZisp1\max_{i}\lVert Z_{i}\rVert_{\rm sp}\lesssim 1, note that

    max1inZi(ΨΨ0)ZiTF2ΨΨ0F2max1inZisp41i=1n𝟙(miq)i:miqZi(ΨΨ0)ZiTF2(ZiTZi)1ZiTsp41ni=1nZi(ΨΨ0)ZiTF2,\displaystyle\begin{split}&\max_{1\leq i\leq n}\lVert Z_{i}(\Psi-\Psi_{0})Z_{i}^{T}\rVert_{\rm F}^{2}\\ &\quad\leq\lVert\Psi-\Psi_{0}\rVert_{\rm F}^{2}\max_{1\leq i\leq n}\lVert Z_{i}\rVert_{\rm sp}^{4}\\ &\quad\lesssim\frac{1}{\sum_{i=1}^{n}\mathbbm{1}(m_{i}\geq q)}\sum_{i:m_{i}\geq q}\lVert Z_{i}(\Psi-\Psi_{0})Z_{i}^{T}\rVert_{\rm F}^{2}\lVert(Z_{i}^{T}Z_{i})^{-1}Z_{i}^{T}\rVert_{\rm sp}^{4}\\ &\quad\lesssim\frac{1}{n}\sum_{i=1}^{n}\lVert Z_{i}(\Psi-\Psi_{0})Z_{i}^{T}\rVert_{\rm F}^{2},\end{split} (77)

    where the last inequality holds since mini{ςmin(Zi):miq}1\min_{i}\{\varsigma_{\min}(Z_{i}):m_{i}\geq q\}\gtrsim 1 and i=1n𝟙(miq)n\sum_{i=1}^{n}\mathbbm{1}(m_{i}\geq q)\asymp n. Thus we have an1a_{n}\asymp 1 and en=0e_{n}=0.

  • Verification of (C2): The condition is satisfied with ϵ¯n=(logn)/n\bar{\epsilon}_{n}=\sqrt{(\log n)/n} as Ψ\Psi is fixed dimensional and we have 1ρmin(Ψ0)ρmax(Ψ0)11\lesssim\rho_{\min}(\Psi_{0})\leq\rho_{\max}(\Psi_{0})\lesssim 1.

  • Verification of (C3): The assumption θ0λ1logp\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p given in the theorem directly satisfies (C3).

  • Verification of (C4): By Weyl’s inequality, we obtain that

    min1inρmin(σ2Imi+ZiΨ0ZiT)\displaystyle\min_{1\leq i\leq n}\rho_{\min}(\sigma^{2}I_{m_{i}}+Z_{i}\Psi_{0}Z_{i}^{T}) σ2+min1inρmin(ZiΨ0ZiT),\displaystyle\geq\sigma^{2}+\min_{1\leq i\leq n}\rho_{\min}(Z_{i}\Psi_{0}Z_{i}^{T}), (78)
    max1inρmax(σ2Imi+ZiΨ0ZiT)\displaystyle\max_{{1\leq i\leq n}}\rho_{\max}(\sigma^{2}I_{m_{i}}+Z_{i}\Psi_{0}Z_{i}^{T}) σ2+ρmax(Ψ0)max1inZisp2.\displaystyle\leq\sigma^{2}+\rho_{\max}(\Psi_{0})\max_{1\leq i\leq n}\lVert Z_{i}\rVert_{\rm sp}^{2}. (79)

    Since ZiΨ0ZiTZ_{i}\Psi_{0}Z_{i}^{T} is nonnegative definite, the right hand side of (78) is further bounded below by σ2\sigma^{2}, while the right hand side of (79) is bounded. The condition (C4) is thus satisfied.

  • Verification of (C5): For a sufficiently large MM and s=s0(logn/logp)s_{\star}=s_{0}\vee(\log n/\log p), define a sieve as n={Ψ:nMρmin(Σ)ρmax(Σ)eMslogp}{\cal H}_{n}=\{\Psi:n^{-M}\leq\rho_{\min}(\Sigma)\leq\rho_{\max}(\Sigma)\leq e^{Ms_{\star}\log p}\}, so that the minimum eigenvalue condition (6) can be satisfied with logγnlogn\log\gamma_{n}\asymp\log n. Similar to the proof of Theorem 7, it can be easily shown that conditions (7) and (8) are satisfied with ϵn=(slogp)/n\epsilon_{n}=\sqrt{(s_{\star}\log p)/n}.

  • Verification of (C6): The separation condition is trivially satisfied as there is no nuisance mean part.

Therefore, the posterior contraction rates for θ\theta are given by Theorem 3 with ss_{\star} replaced by s0s_{0} since s0>0s_{0}>0 and lognlogp\log n\lesssim\log p. The contraction rate for Σ\Sigma relative to the Frobenius norm is a direct consequence of (77). The optimal posterior contraction easily follows from Corollary 1. Thus assertions (a)(b) hold.

Now, we verify conditions (C8)(C10) and (C7) to apply Theorems 56 and Corollaries 23.

  • Verification of (C8)(C9): These conditions are trivially satisfied with the zero matrix HH since there is no nuisance mean part.

  • Verification of (C10): For some C1>0C_{1}>0, the entropy in (C10) is bounded above by a multiple of logN(δ,{Σ:ΣΣ0FM^2C1ϵn},F)0log(3M^2C1ϵn/δ)\log N(\delta,\{\Sigma:\lVert\Sigma-\Sigma_{0}\rVert_{\rm F}\leq\hat{M}_{2}C_{1}\epsilon_{n}\},\lVert\cdot\rVert_{\rm F})\lesssim 0\vee\log(3\hat{M}_{2}C_{1}\epsilon_{n}/\delta) by (77). The expression in (C10) is thus bounded by a constant multiple of s5log3ps_{\star}^{5}\log^{3}p by Remark 6. This tends to zero since ss0s_{\star}\lesssim s_{0}.

  • Verification of (C7): It is easy to see that dB,n(η,η0)ΨΨ0Fd_{B,n}(\eta,\eta_{0})\lesssim\lVert\Psi-\Psi_{0}\rVert_{\rm F} since maxiZisp1\max_{i}\lVert Z_{i}\rVert_{\rm sp}\lesssim 1. The separability of the space is thus trivial.

Hence, under (C7), Theorem 5 can be applied to obtain the distributional approximation in (15) with the zero matrix HH. Under (C7) and (C12), we obtain the no-superset result in (16) by Theorem 6. The strong results in Corollary 2 and Corollary 3 follow explicitly from the beta-min condition (C13). These establish (c)(e).

B.5 Proof of Theorem 11

We verify the conditions for the posterior contraction in Theorem 3.

  • Verification of (C1): Since Δη,i=Ω1\Delta_{\eta,i}=\Omega^{-1} for every ini\leq n and Ω00+(cL)\Omega_{0}\in{\cal M}_{0}^{+}(cL) for some 0<c<10<c<1, an=1a_{n}=1 and en=0e_{n}=0 satisfy (C1).

  • Verification of (C2): Using (i) of Lemma 10 and the relation 1x1x11-x\asymp 1-x^{-1} as x1x\rightarrow 1, observe that Ω1Ω01FΩΩ0Fϵ¯n\lVert\Omega^{-1}-\Omega_{0}^{-1}\rVert_{\rm F}\lesssim\lVert\Omega-\Omega_{0}\rVert_{\rm F}\lesssim\bar{\epsilon}_{n} if the right hand side is small enough. Thus, there exists a constant C1>0C_{1}>0 such that {Ω:Ω1Ω01Fϵ¯n}{Ω:ΩΩ0FC1ϵ¯n}\{\Omega:\lVert\Omega^{-1}-\Omega_{0}^{-1}\rVert_{\rm F}\leq\bar{\epsilon}_{n}\}\supset\{\Omega:\lVert\Omega-\Omega_{0}\rVert_{\rm F}\leq C_{1}\bar{\epsilon}_{n}\}. Furthermore, although the components of Ω\Omega are not a priori independent as the prior is truncated to 0+(L){\cal M}_{0}^{+}(L), the truncation can only increase prior concentration since Ω00+(cL)\Omega_{0}\in{\cal M}_{0}^{+}(cL) for some 0<c<10<c<1. Hence, for some C2>0C_{2}>0,

    Π(Ω1Ω01Fϵ¯n)Π(ΩΩ0C2ϵ¯n/m¯)(C2ϵ¯nm¯)m¯+d,\displaystyle\Pi\left(\lVert\Omega^{-1}-\Omega_{0}^{-1}\rVert_{\rm F}\leq\bar{\epsilon}_{n}\right)\geq\Pi\left(\lVert\Omega-\Omega_{0}\rVert_{\infty}\leq C_{2}\bar{\epsilon}_{n}/\overline{m}\right)\gtrsim\left(\frac{C_{2}\bar{\epsilon}_{n}}{\overline{m}}\right)^{\overline{m}+d},

    which justifies the choice ϵ¯n(m¯+d)(logn)/n\bar{\epsilon}_{n}\asymp\sqrt{(\overline{m}+d)(\log n)/n} for (C2).

  • Verification of (C3): The assumption θ0λ1logp\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p given in the theorem directly satisfies (C3).

  • Verification of (C4): This is trivially met as Ω00+(cL)\Omega_{0}\in{\cal M}_{0}^{+}(cL) for some 0<c<10<c<1.

  • Verification of (C5): Note that the minimum eigenvalue condition (6) is trivially satisfied with γn=1\gamma_{n}=1 since the prior is put on 0+(L){\cal M}_{0}^{+}(L). Now, for r¯n=Mslogp/logn\bar{r}_{n}=Ms_{\star}\log p/\log n with s=s0(nϵ¯n2/logp)s_{\star}=s_{0}\vee(n\bar{\epsilon}_{n}^{2}/\log p) and sufficiently large MM, choose a sieve as n={Ω0+(L):j,k𝟙{ωjk0}r¯n}{\cal H}_{n}=\{\Omega\in{\cal M}_{0}^{+}(L):\sum_{j,k}\mathbbm{1}{\{\omega_{jk}\neq 0\}}\leq\bar{r}_{n}\}, that is, the maximum number of edges of Ω\Omega does not exceed r¯n\bar{r}_{n}. Then, for δn=1/6m¯n3/2\delta_{n}=1/6\overline{m}n^{3/2}, the entropy in (7) is bounded by

    logN(δn/m¯,n,)\displaystyle\log N(\delta_{n}/\overline{m},{\cal H}_{n},\lVert\cdot\rVert_{\infty}) log{(m¯Lδn)m¯+r¯n((m¯2)r¯n)}\displaystyle\leq\log\left\{\left(\frac{\overline{m}L}{\delta_{n}}\right)^{\overline{m}+\bar{r}_{n}}\binom{\binom{\overline{m}}{2}}{\bar{r}_{n}}\right\}
    (m¯+r¯n)log(m¯L/δn)+2r¯nlogm¯,\displaystyle\leq(\overline{m}+\bar{r}_{n})\log(\overline{m}L/\delta_{n})+2\bar{r}_{n}\log\overline{m},

    where in the second term, the factor (m¯L/δn)m¯(\overline{m}L/\delta_{n})^{\overline{m}} comes from the diagonal elements of Ω\Omega, while the rest is from the off-diagonal entries. It is easy to see that the last display is bounded by a multiple of slogps_{\star}\log p with chosen r¯n\bar{r}_{n}, and hence the entropy condition in (7) is satisfied. Lastly, note that for some C3>0C_{3}>0,

    logΠ(n)\displaystyle\log\Pi({\cal H}\setminus{\cal H}_{n}) =logΠ(|Υ|>r¯n)r¯nlogr¯nC3Mslogp.\displaystyle=\log\Pi(|\Upsilon|>\bar{r}_{n})\lesssim-\bar{r}_{n}\log\bar{r}_{n}\leq-C_{3}Ms_{\star}\log p.

    Therefore, condition (8) is satisfied with sufficiently large MM.

  • Verification of (C6): The separation condition is trivially met as there is no nuisance mean part.

Therefore, we obtain the posterior contraction properties for θ\theta by Theorem 3. The theorem also implies that the posterior distribution of Ω1\Omega^{-1} contracts to Ω01\Omega_{0}^{-1} at the rate ϵn=(s0logp(m¯+d)logn)/n\epsilon_{n}=\sqrt{(s_{0}\log p\vee(\overline{m}+d)\log n)/n} with respect to the Frobenius norm. This is also translated as convergence of Ω\Omega to Ω0\Omega_{0} at the same rate, since we obtain

ΩΩ0F2Ω1Ω01F2ϵn2,\displaystyle\lVert\Omega-\Omega_{0}\rVert_{\rm F}^{2}\lesssim\lVert\Omega^{-1}-\Omega_{0}^{-1}\rVert_{\rm F}^{2}\lesssim\epsilon_{n}^{2}, (80)

by (i) of Lemma 10 and the inequality 1x1x11-x\asymp 1-x^{-1} as x1x\rightarrow 1. The assertion for the optimal posterior contraction is directly justified by Corollary 1. These prove (a)(b).

Next, we verify conditions (C4)(C7) to obtain the optimal posterior contraction by applying Theorem 4.

  • Verification of (C4)(C5): These conditions are trivially satisfied with the zero matrix HH since there is no nuisance mean part.

  • Verification of (C6): Note that by (80), there exists a constant C4>0C_{4}>0 such that the entropy in (C6) is bounded by logN(δ,{Ω:ΩΩ0FC4ϵ¯n},dB,n)\log N(\delta,\{\Omega:\lVert\Omega-\Omega_{0}\rVert_{\rm F}\leq C_{4}\bar{\epsilon}_{n}\},d_{B,n}) for every δ>0\delta>0. Using (81), the entropy is further bounded by logN(C5δ,{Ω:ΩΩ0FC4ϵ¯n},F)\log N(C_{5}\delta,\{\Omega:\lVert\Omega-\Omega_{0}\rVert_{\rm F}\leq C_{4}\bar{\epsilon}_{n}\},\lVert\cdot\rVert_{\rm F}) for some C5>0C_{5}>0. This is clearly bounded by a multiple of 0m¯2log(3C4ϵ¯n/C5δ)0\vee\overline{m}^{2}\log(3C_{4}\bar{\epsilon}_{n}/C_{5}\delta), and hence using Remark 6 we bound (C6) by a multiple of (s¯m¯)(s¯logp)/n(\sqrt{\bar{s}_{\star}}\vee\overline{m})\sqrt{(\bar{s}_{\star}\log p)/n} which goes to zero by assumption.

  • Verification of (C7): For every Ω1,Ω2^n\Omega_{1},\Omega_{2}\in\widehat{\cal H}_{n}, note that

    Ω11Ω21FΩ1Ω2FΩ11Ω21Fϵn,\displaystyle\lVert\Omega_{1}^{-1}-\Omega_{2}^{-1}\rVert_{\rm F}\lesssim\lVert\Omega_{1}-\Omega_{2}\rVert_{\rm F}\lesssim\lVert\Omega_{1}^{-1}-\Omega_{2}^{-1}\rVert_{\rm F}\lesssim\epsilon_{n}, (81)

    using (i) of Lemma 10 and the inequality 1x1x11-x\asymp 1-x^{-1} as x1x\rightarrow 1 again. By the first inequality, it suffices to show that \cal H is separable metric space with the Frobenius norm. This is trivial as the parameter space is Euclidean.

Hence, under condition (C3), Theorem 4 verifies (c).

Now, we verify conditions (C8)(C10) to apply Theorems 56 and Corollaries 23.

  • Verification of (C8)(C9): These are trivially satisfied for the same reason as (C4)(C5).

  • Verification of (C10): Similar to the verification of (C6), the entropy in (C10) is bounded by a multiple of 0m¯2log(3C6ϵn/δ)0\vee\overline{m}^{2}\log(3C_{6}\epsilon_{n}/\delta) for some C6>0C_{6}>0. Hence using Remark 6 we bound (C10) by a multiple of (sm¯)(slogp)3/n(s_{\star}\vee\overline{m})\sqrt{(s_{\star}\log p)^{3}/n} which goes to zero by assumption.

Therefore, under (C7), we obtain the distributional approximation in (15) with the zero matrix HH by Theorem 5. Under (C7) and (C12), the no-superset result in (16) holds by Theorem 6. Lastly, we obtain the strong results in Corollary 2 and Corollary 3 if the beta-min condition (C13) is also met. These prove (d)(f).

B.6 Proof of Theorem 12

To verify the conditions for Theorem 3, we will use the following properties of BB-splines.

For any fα[0,1]f\in\mathfrak{C}^{\alpha}[0,1], there exists βJ\beta_{\ast}\in\mathbb{R}^{J} with β<fα\lVert\beta_{\ast}\rVert_{\infty}<\lVert f\rVert_{\mathfrak{C}^{\alpha}} such that

βTBJfJαfα,\displaystyle\lVert\beta_{\ast}^{T}B_{J}-f\rVert_{\infty}\lesssim J^{-\alpha}\lVert f\rVert_{\mathfrak{C}^{\alpha}}, (82)

by the well-known approximation theory of B-splines [11, page 170]. Writing fβ=βTBJf_{\beta}=\beta^{T}B_{J}, this gives

fβf2,nfβfJαfα+fβfβ.\displaystyle\lVert f_{\beta}-f\rVert_{2,n}\leq\lVert f_{\beta}-f\rVert_{\infty}\lesssim J^{-\alpha}\lVert f\rVert_{\mathfrak{C}^{\alpha}}+\lVert f_{\beta}-f_{\beta_{\ast}}\rVert_{\infty}. (83)

We also use the following inequalities: for every βJ\beta\in\mathbb{R}^{J},

βfββ,β2Jfβ2,nβ2.\displaystyle\lVert\beta\rVert_{\infty}\lesssim\lVert f_{\beta}\rVert_{\infty}\leq\lVert\beta\rVert_{\infty},\quad\lVert\beta\rVert_{2}\lesssim\sqrt{J}\lVert f_{\beta}\rVert_{2,n}\lesssim\lVert\beta\rVert_{2}. (84)

See Lemma E.6 of Ghosal and van der Vaart, [17] for proofs with respect to LL_{\infty}- and L2L_{2}-norms. Hence the first relation can be formally justified. For the second relation with respect to the empirical L2L_{2}-norm, we assume that ziz_{i} are sufficiently regularly distributed as in (7.12) of Ghosal and van der Vaart, [16].

  • Verification of (C1): If v0v_{0} is strictly positive on [0,1][0,1], then v0v_{0} satisfies the same approximation rule in (82) for some β(0,)J\beta_{\ast}\in(0,\infty)^{J} with β<v0α\lVert\beta_{\ast}\rVert_{\infty}<\lVert v_{0}\rVert_{\mathfrak{C}^{\alpha}} (see Lemma E.5 of Ghosal and van der Vaart, [17]). Therefore the approximation in (83) also holds for v0v_{0} even if β\beta is restricted to have positive entries only, and thus by (82) and (84),

    vβv0\displaystyle\lVert v_{\beta_{\ast}}-v_{0}\rVert_{\infty} Jα,for some β(0,)J,\displaystyle\lesssim J^{-\alpha},\quad\text{for some $\beta_{\ast}\in(0,\infty)^{J}$},
    vβ1vβ2\displaystyle\lVert v_{\beta_{1}}-v_{\beta_{2}}\rVert_{\infty} Jvβ1vβ22,n,β1,β2(0,)J,\displaystyle\lesssim\sqrt{J}\lVert v_{\beta_{1}}-v_{\beta_{2}}\rVert_{2,n},\quad\beta_{1},\beta_{2}\in(0,\infty)^{J},

    which tells us that we have anJa_{n}\asymp J and enJ12αe_{n}\asymp J^{1-2\alpha} for (C1).

  • Verification of (C2): Note that if Jαϵ¯nJ^{-\alpha}\lesssim\bar{\epsilon}_{n}, it follows that for some C1>0C_{1}>0,

    logΠ(β:vβv02,nϵ¯n)\displaystyle\log\Pi(\beta:\lVert v_{\beta}-v_{0}\rVert_{2,n}\leq\bar{\epsilon}_{n}) logΠ(β:ββC1ϵ¯n)Jlogϵ¯n.\displaystyle\geq\log\Pi(\beta:\lVert\beta-\beta_{\ast}\rVert_{\infty}\leq C_{1}\bar{\epsilon}_{n})\gtrsim J\log\bar{\epsilon}_{n}.

    This implies that condition (C2) is satisfied with ϵ¯n=(Jlogn)/n\bar{\epsilon}_{n}=\sqrt{(J\log n)/n}.

  • Verification of (C3): The assumption θ0λ1logp\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p given in the theorem directly satisfies (C3).

  • Verification of (C4): Since v0v_{0} is strictly positive on [0,1][0,1] and belongs to a fixed multiple of the unit ball of α[0,1]\mathfrak{C}^{\alpha}[0,1], we have that

    1infz[0,1]v0(z)supz[0,1]v0(z)1.\displaystyle 1\lesssim\inf_{z\in[0,1]}v_{0}(z)\leq\sup_{z\in[0,1]}v_{0}(z)\lesssim 1.

    The condition (C4) is thus satisfied.

  • Verification of (C5): For a sufficiently large MM, choose a sieve as n=j=1J{βj:nMβjnM}{\cal H}_{n}=\prod_{j=1}^{J}\{\beta_{j}:n^{-M}\leq\beta_{j}\leq n^{M}\}. Then the minimum eigenvalue condition (6) is satisfied with logγnlogn\log\gamma_{n}\asymp\log n because for every ini\leq n,

    infβnvβ(zi)=infβnj=1JBJ,j(zi)βjinfβnmin1jJβjj=1JBJ,j(zi)nM,\displaystyle\inf_{\beta\in{\cal H}_{n}}v_{\beta}(z_{i})=\inf_{\beta\in{\cal H}_{n}}\sum_{j=1}^{J}B_{J,j}(z_{i})\beta_{j}\geq\inf_{\beta\in{\cal H}_{n}}\min_{1\leq j\leq J}\beta_{j}\sum_{j=1}^{J}B_{J,j}(z_{i})\geq n^{-M},

    where BJ,jB_{J,j} and βj\beta_{j} denote the jjth components of BJB_{J} and β\beta, respectively. To check the entropy condition in (7), note that for every η1,η2n\eta_{1},\eta_{2}\in{\cal H}_{n}, we have dn(η1,η2)β1β2d_{n}(\eta_{1},\eta_{2})\lesssim\lVert\beta_{1}-\beta_{2}\rVert_{\infty} by (84). Hence, for some C2>0C_{2}>0, the entropy in (7) is bounded above by a multiple of

    logN(1C2m¯nM+3/2,{β:βnM},)Jlogn.\displaystyle\log N\left(\frac{1}{C_{2}\overline{m}n^{M+3/2}},\{\beta:\lVert\beta\rVert_{\infty}\leq n^{M}\},\lVert\cdot\rVert_{\infty}\right)\lesssim J\log n.

    The condition (8) holds since an inverse Gaussian prior on each βj\beta_{j} produces Π(n)JeC3nM\Pi({\cal H}\setminus{\cal H}_{n})\lesssim Je^{-C_{3}n^{M}} for some constant C3C_{3}, by its exponentially small bounds for tail probabilities on both sides. By matching Jαϵ¯nJ^{-\alpha}\asymp\bar{\epsilon}_{n} and nϵ¯n2Jlognn\bar{\epsilon}_{n}^{2}\asymp J\log n, we obtain J(n/logn)1/(2α+1)J\asymp(n/\log n)^{1/(2\alpha+1)} and ϵ¯n=(logn/n)α/(2α+1)\bar{\epsilon}_{n}=(\log n/n)^{\alpha/(2\alpha+1)}. Note that the conditions anϵn20a_{n}\epsilon_{n}^{2}\rightarrow 0 and en0e_{n}\rightarrow 0 hold only if α>1/2\alpha>1/2.

  • Verification of (C6): The separation condition holds as there is no additional mean part.

Hence, we obtain the posterior contraction rates for θ\theta by Theorem 3. The contraction rate for vv is also obtained by the same theorem. The assertion for the optimal posterior contraction is directly justified by Corollary 1. Hence we have verified (a)(b).

Now, we verify (C4)(C7) for the optimal posterior contraction in Theorem 4.

  • Verification of (C4)(C5): These conditions are trivially satisfied as there is no nuisance mean part.

  • Verification of (C6): Note that by the inequality vβv02,nvβvβ2,n+ϵ¯n\lVert v_{\beta}-v_{0}\rVert_{2,n}\lesssim\lVert v_{\beta}-v_{\beta_{\ast}}\rVert_{2,n}+\bar{\epsilon}_{n}, the entropy in the integrand is bounded by

    logN(δJ,{β:ββ2C4Jϵ¯n},2)0Jlog(3C4ϵ¯nδ),\displaystyle\log N\left(\delta\sqrt{J},\left\{\beta:\lVert\beta-\beta_{\ast}\rVert_{2}\leq C_{4}\sqrt{J}\bar{\epsilon}_{n}\right\},\lVert\cdot\rVert_{2}\right)\lesssim 0\vee J\log\left(\frac{3C_{4}\bar{\epsilon}_{n}}{\delta}\right),

    for some C4>0C_{4}>0. Thus, the second term of (C6) is bounded by Jϵ¯nJ\bar{\epsilon}_{n} by Remark 6, while the first term is bounded by Js¯2(logp)/n\sqrt{J\bar{s}_{\star}^{2}(\log p)/n}. Since s¯=(Jlogn)/logpJ\bar{s}_{\star}=(J\log n)/\log p\lesssim J, (C6) is bounded by Jϵ¯n=(n/logn)(1α)/(2α+1)J\bar{\epsilon}_{n}=(n/\log n)^{(1-\alpha)/(2\alpha+1)}, which tends to zero as α>1\alpha>1.

  • Verification of (C7): For every vβ1,vβ2^nv_{\beta_{1}},v_{\beta_{2}}\in\widehat{\cal H}_{n}, note that dB,n(η1,η2)=vβ1vβ22,nβ1β22d_{B,n}(\eta_{1},\eta_{2})=\lVert v_{\beta_{1}}-v_{\beta_{2}}\rVert_{2,n}\lesssim\lVert\beta_{1}-\beta_{2}\rVert_{2} by (84). Since we put a prior for vv using the B-splines through a Euclidean parameter β\beta, the separability is trivially satisfied.

Therefore, since (C3) is satisfied the assumption, assertion (c) holds by Theorem 4.

Next, we verify conditions (C8)(C10) to apply Theorems 56 and Corollaries 23.

  • Verification of (C8)(C9): These are trivially satisfied for the same reason as before.

  • Verification of (C10): Similar to the verification of (C6), the entropy of interest is bounded by a constant multiple of 0Jlog(3C5ϵn/δ)0\vee J\log({3C_{5}\epsilon_{n}}/{\delta}) for some C5>0C_{5}>0. Thus, (C10) is bounded above by a multiple of {(s2J)J(slogp)3/n}1/2\{(s_{\star}^{2}\vee J)J(s_{\star}\log p)^{3}/n\}^{1/2} by Remark 6, and hence goes to zero by the assumption. The condition α>2\alpha>2 is seen to be necessary by the inequality

    (s2J)J(slogp)3/nJ2n2ϵ¯n6=n2(α+2)/(2α+1)logn2(3α1)/(2α+1).\displaystyle(s_{\star}^{2}\vee J)J(s_{\star}\log p)^{3}/n\geq J^{2}n^{2}\bar{\epsilon}_{n}^{6}=n^{2(-\alpha+2)/(2\alpha+1)}{\log n}^{2(3\alpha-1)/(2\alpha+1)}.

Under (C7), the distributional approximation in (15) holds with the zero matrix HH by Theorem 5. Under (C7) and (C12), the no-superset result in (16) holds by Theorem 6. We also obtain the strong results in Corollary 2 and Corollary 3 if the beta-min condition (C13) is also met. These prove (d)(f).

B.7 Proof of Theorem 13

We verify the conditions for the posterior contraction in Theorem 3.

  • Verification of (C1): Since Δη,i=σ2\Delta_{\eta,i}=\sigma^{2} for every ini\leq n and σ02\sigma_{0}^{2} belongs to the support of the prior, we have an=1a_{n}=1 and en=0e_{n}=0.

  • Verification of (C2): Note that we write dn2(η,η0)=|σ2σ02|2+gβg02,n2d_{n}^{2}(\eta,\eta_{0})=\lvert\sigma^{2}-\sigma_{0}^{2}\rvert^{2}+\lVert g_{\beta}-g_{0}\rVert_{2,n}^{2}. To verify the prior concentration condition, observe that

    logΠ(η:dn(η,η0)ϵ¯n)\displaystyle\log\Pi\left(\eta\in{\cal H}:d_{n}(\eta,\eta_{0})\leq\bar{\epsilon}_{n}\right)
    logΠ(β:gβg02,nϵ¯n2)+logΠ(σ:|σ2σ02|ϵ¯n2),\displaystyle\quad\geq\log\Pi\left(\beta:\lVert g_{\beta}-g_{0}\rVert_{2,n}\leq\frac{\bar{\epsilon}_{n}}{\sqrt{2}}\right)+\log\Pi\left(\sigma:|\sigma^{2}-\sigma_{0}^{2}|\leq\frac{\bar{\epsilon}_{n}}{\sqrt{2}}\right),

    where the second term on the right hand side is trivially bounded below by a constant multiple of logϵ¯n\log\bar{\epsilon}_{n}. Using (82)–(84), it is easy to see that if Jαϵ¯nJ^{-\alpha}\lesssim\bar{\epsilon}_{n},

    logΠ(β:gβg02,nϵ¯n2)logΠ(β:ββC1ϵ¯n)Jlogϵ¯n,\displaystyle\log\Pi\left(\beta:\lVert g_{\beta}-g_{0}\rVert_{2,n}\leq\frac{\bar{\epsilon}_{n}}{\sqrt{2}}\right)\geq\log\Pi(\beta:\lVert\beta-\beta_{\ast}\rVert_{\infty}\leq C_{1}\bar{\epsilon}_{n})\gtrsim J\log\bar{\epsilon}_{n},

    for some C1>0C_{1}>0. Since α¯α\bar{\alpha}\leq\alpha, this implies that (C2) is satisfied with ϵ¯n=(Jlogn)/n\bar{\epsilon}_{n}=\sqrt{(J\log n)/n}.

  • Verification of (C3): The assumption θ0λ1logp\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p given in the theorem directly satisfies the condition.

  • Verification of (C4): This is directly satisfied by σ021\sigma_{0}^{2}\asymp 1.

  • Verification of (C5): For a sufficiently large constant MM and s=s0(Jlogn/logp)s_{\star}=s_{0}\vee(J\log n/\log p), choose n={gβ:βnM}×{σ:nMσ2eMslogp}{\cal H}_{n}=\{g_{\beta}:\lVert\beta\rVert_{\infty}\leq n^{M}\}\times\{\sigma:n^{-M}\leq\sigma^{2}\leq e^{Ms_{\star}\log p}\}, from which the minimum eigenvalue condition (6) is directly satisfied with logγnlogn\log\gamma_{n}\asymp\log n. To check the entropy condition in (7), note that for every η1,η2n\eta_{1},\eta_{2}\in{\cal H}_{n}, we have dn2(η1,η2)β1β22+|σ12σ22|2d_{n}^{2}(\eta_{1},\eta_{2})\lesssim\lVert\beta_{1}-\beta_{2}\rVert_{\infty}^{2}+\lvert\sigma_{1}^{2}-\sigma_{2}^{2}\rvert^{2} by (84). Hence, for some C3>0C_{3}>0, the entropy in (7) is bounded above by a multiple of

    logN(1C3m¯nM+3/2,{β:βnM},)\displaystyle\log N\left(\frac{1}{C_{3}\overline{m}n^{M+3/2}},\{\beta:\lVert\beta\rVert_{\infty}\leq n^{M}\},\lVert\cdot\rVert_{\infty}\right)
    +logN(1C3m¯nM+3/2,{σ:σ2eMslogp},||).\displaystyle\quad+\log N\left(\frac{1}{C_{3}\overline{m}n^{M+3/2}},\{\sigma:\sigma^{2}\leq e^{Ms_{\star}\log p}\},\lvert\cdot\rvert\right).

    The display is further bounded by a multiple of Jlogn+slogpJ\log n+s_{\star}\log p, and hence (7) is satisfied with ϵn=(slogp)/n\epsilon_{n}=\sqrt{(s_{\star}\log p)/n}. Using the tail bounds of normal and inverse gamma distributions, condition (8) is also satisfied.

  • Verification of (C6): The separation condition holds by Remark 3 as we have dA,n(η,η0)=gβg02,nϵ¯nd_{A,n}(\eta_{\ast},\eta_{0})=\lVert g_{\beta_{\ast}}-g_{0}\rVert_{2,n}\lesssim\bar{\epsilon}_{n} for η=(gβ,σ02)\eta_{\ast}=(g_{\beta_{\ast}},\sigma_{0}^{2}) in view of (82).

Therefore, the contraction rates for θ\theta are given by Theorem 3. The rate for gg is also obtained by the same theorem. The assertion for the optimal posterior contraction is directly justified by Corollary 1. We thus see (a)(b) hold.

Now, we verify (C4)(C7) for Theorem 4.

  • Verification of (C4): Observe that the left hand side of the first line of (C4) is equal to

    1(s01)logpξ~η0Hξ~η022\displaystyle\frac{1}{(s_{0}\vee 1)\log p}\lVert\tilde{\xi}_{\eta_{0}}-H\tilde{\xi}_{\eta_{0}}\rVert_{2}^{2} =nσ02(s01)logpg0β^JTBJ2,n2,\displaystyle=\frac{n}{\sigma_{0}^{2}(s_{0}\vee 1)\log p}\lVert g_{0}-\hat{\beta}_{J}^{T}B_{J}\rVert_{2,n}^{2},

    where β^J=(WJTWJ)1WJT(g0(z1),,g0(zn))T\hat{\beta}_{J}=(W_{J}^{T}W_{J})^{-1}W_{J}^{T}(g_{0}(z_{1}),\dots,g_{0}(z_{n}))^{T} is the the least squares solution. Since β^J\hat{\beta}_{J} is the solution minimizing g0β^JTBJ2,n2\lVert g_{0}-\hat{\beta}_{J}^{T}B_{J}\rVert_{2,n}^{2}, for some βJ\beta_{\ast}\in\mathbb{R}^{J}, the last display is bounded above by

    nσ02logpg0βTBJ2nJ2αlogp,\displaystyle\frac{n}{\sigma_{0}^{2}\log p}\lVert g_{0}-\beta_{\ast}^{T}B_{J}\rVert_{\infty}^{2}\lesssim\frac{n}{J^{2\alpha}\log p}, (85)

    by (82), where s01s_{0}\vee 1 is replaced by 11 as s0s_{0} is unknown. Plugging in J(n/logn)1/(2α¯+1)J\asymp(n/\log n)^{1/(2\bar{\alpha}+1)}, it is easy to see that the right hand side of (85) is the same order of (logn)2α/(2α¯+1)n(2α+2α¯+1)/(2α¯+1)/logp(\log n)^{2\alpha/(2\bar{\alpha}+1)}n^{(-2\alpha+2\bar{\alpha}+1)/(2\bar{\alpha}+1)}/\log p. This tends to zero by the given boundedness assumption. The necessary condition α¯<α\bar{\alpha}<\alpha is implied by this, because logp=o(n)\log p=o(n). The second condition of (C4) is satisfied by Remark 5.

  • Verification of (C5): Let η~n(θ,η)=(gβ()+BJT()(WJTWJ)1WJTX(θθ0),σ2)\tilde{\eta}_{n}(\theta,\eta)=(g_{\beta}(\cdot)+B_{J}^{T}(\cdot)(W_{J}^{T}W_{J})^{-1}W_{J}^{T}X(\theta-\theta_{0}),\sigma^{2}) for a given θ\theta, where η=(gβ(),σ2)\eta=(g_{\beta}(\cdot),\sigma^{2}). This setting satisfies Φ(η~n(θ,η))=(ξ~η+HX~(θθ0),Δ~η)\Phi(\tilde{\eta}_{n}(\theta,\eta))=(\tilde{\xi}_{\eta}+H\tilde{X}(\theta-\theta_{0}),\tilde{\Delta}_{\eta}). Since each entry of β\beta has the standard normal prior, gβ()g_{\beta}(\cdot) is a zero mean Gaussian process with the covariance kernel K(t1,t2)=BJ(t1)TBJ(t2)K(t_{1},t_{2})=B_{J}(t_{1})^{T}B_{J}(t_{2}), and thus its reproducing kernel Hilbert space (RKHS) 𝕂\mathbb{K} is the set of all functions of the form kζkBJ(tk)TBJ()\sum_{k}\zeta_{k}B_{J}(t_{k})^{T}B_{J}(\cdot) with coefficients ζk\zeta_{k}, k{1,2,}k\in\{1,2,\dots\}. It is easy to see that the shift (θθ0)TXTWJ(WJTWJ)1BJ()(\theta-\theta_{0})^{T}X^{T}W_{J}(W_{J}^{T}W_{J})^{-1}B_{J}(\cdot) is in the RKHS 𝕂\mathbb{K} since it is expressed as (θθ0)TXTWJ(WJTWJ)1W~J1W~JBJ()(\theta-\theta_{0})^{T}X^{T}W_{J}(W_{J}^{T}W_{J})^{-1}\tilde{W}_{J}^{-1}\tilde{W}_{J}B_{J}(\cdot) using an invertible matrix W~JJ×J\tilde{W}_{J}\in\mathbb{R}^{J\times J} with rows BJ(tk)B_{J}(t_{k}) evaluated by some tkt_{k}, k=1,,Jk=1,\dots,J. Hence, by the Cameron-Martin theorem, for ν=(ν1,,νJ)T=(W~JT)1(WJTWJ)1WJTX(θθ0)\nu=(\nu_{1},\dots,\nu_{J})^{T}=(\tilde{W}_{J}^{T})^{-1}(W_{J}^{T}W_{J})^{-1}W_{J}^{T}X(\theta-\theta_{0}) and 𝕂\lVert\cdot\rVert_{\mathbb{K}} the RKHS norm, we see that

    logdΠn,θdΠn,θ0(η)\displaystyle\log\frac{d\Pi_{n,\theta}}{d\Pi_{n,\theta_{0}}}(\eta) =k=1Jνkgβ(tk)12νTW~JBJ𝕂2=νTW~Jβ12W~JTν22,\displaystyle=\sum_{k=1}^{J}\nu_{k}g_{\beta}(t_{k})-\frac{1}{2}\lVert\nu^{T}\tilde{W}_{J}B_{J}\rVert_{\mathbb{K}}^{2}=\nu^{T}\tilde{W}_{J}\beta-\frac{1}{2}\lVert\tilde{W}_{J}^{T}\nu\rVert_{2}^{2},

    almost surely. This gives that

    |logdΠn,θdΠn,θ0(η)|β2(WJTWJ)1WJTX(θθ0)2+(WJTWJ)1WJTX(θθ0)22.\displaystyle\begin{split}\left\lvert\log\frac{d\Pi_{n,\theta}}{d\Pi_{n,\theta_{0}}}(\eta)\right\rvert&\lesssim\lVert\beta\rVert_{2}\lVert(W_{J}^{T}W_{J})^{-1}W_{J}^{T}X(\theta-\theta_{0})\rVert_{2}\\ &\quad+\lVert(W_{J}^{T}W_{J})^{-1}W_{J}^{T}X(\theta-\theta_{0})\rVert_{2}^{2}.\end{split} (86)

    Note that we have

    supη~nβ2\displaystyle\sup_{\eta\in\widetilde{\cal H}_{n}}\lVert\beta\rVert_{2} supη~nββ2+β2\displaystyle\leq\sup_{\eta\in\widetilde{\cal H}_{n}}\lVert\beta-\beta_{\ast}\rVert_{2}+\lVert\beta_{\ast}\rVert_{2}
    Jsupη~ngβgβ2,n+1Jϵ¯n+1,\displaystyle\lesssim\sqrt{J}\sup_{\eta\in\widetilde{\cal H}_{n}}\lVert g_{\beta}-g_{\beta_{\ast}}\rVert_{2,n}+1\lesssim\sqrt{J}\bar{\epsilon}_{n}+1,

    and

    supθΘ~n(WJTWJ)1WJTX(θθ0)2\displaystyle\sup_{\theta\in\widetilde{\Theta}_{n}}\lVert(W_{J}^{T}W_{J})^{-1}W_{J}^{T}X(\theta-\theta_{0})\rVert_{2} WJspsupθΘ~nX(θθ0)2ρmin(WJTWJ)Jϵ¯n,\displaystyle\lesssim\frac{\lVert W_{J}\rVert_{\rm sp}\sup_{\theta\in\widetilde{\Theta}_{n}}\lVert X(\theta-\theta_{0})\rVert_{2}}{\rho_{\min}(W_{J}^{T}W_{J})}\lesssim\sqrt{J}\bar{\epsilon}_{n},

    using (84). Since Jϵ¯n\sqrt{J}\bar{\epsilon}_{n} is bounded due to α¯1/2\bar{\alpha}\geq 1/2, (86) is bounded.

  • Verification of (C6): Since the entropy in the integral in (C6) is bounded above by a multiple of 0log(3M~2ϵ¯n/δ)0\vee\log(3\tilde{M}_{2}\bar{\epsilon}_{n}/\delta) for every δ>0\delta>0, the second term of (C6) is bounded by a constant multiple of ϵ¯n\bar{\epsilon}_{n} due to Remark 6. The first term is ϵ¯n2n/logp=(logn)2α¯/(2α¯+1)n(α¯+1/2)/(2α¯+1)/logp\bar{\epsilon}_{n}^{2}\sqrt{n/\log p}=(\log n)^{2\bar{\alpha}/(2\bar{\alpha}+1)}n^{(-\bar{\alpha}+1/2)/(2\bar{\alpha}+1)}/\sqrt{\log p} that tends to zero by the boundedness assumption.

  • Verification of (C7): Since we have dB,n(η1,η2)=|σ12σ22|d_{B,n}(\eta_{1},\eta_{2})=|\sigma_{1}^{2}-\sigma_{2}^{2}| for every σ12,σ22(0,)\sigma_{1}^{2},\sigma_{2}^{2}\in(0,\infty) and the parameter space of σ2\sigma^{2} is Euclidean, the condition is trivially satisfied.

Therefore, assertion (c) holds by Theorem 4 since (C3) is also satisfied by the given assumption.

Lastly, we verify conditions (C8)(C10) to apply Theorems 56 and Corollaries 23.

  • Verification of (C8): Similar to the verification of (C4), the first line of (C8) is equal to

    s2logpξ~η0Hξ~η022\displaystyle s_{\star}^{2}\log p\lVert\tilde{\xi}_{\eta_{0}}-H\tilde{\xi}_{\eta_{0}}\rVert_{2}^{2} ns2logpJ2α.\displaystyle\lesssim\frac{ns_{\star}^{2}\log p}{J^{2\alpha}}.

    Plugging in J(n/logn)1/(2α¯+1)J\asymp(n/\log n)^{1/(2\bar{\alpha}+1)}, it is easy to see that this tends to zero by the given boundedness condition, which requires that α¯<α1/2\bar{\alpha}<\alpha-1/2.

  • Verification of (C9): Similar to the verification of (C5), we now have

    supη^nβ2\displaystyle\sup_{\eta\in\widehat{\cal H}_{n}}\lVert\beta\rVert_{2} s(Jlogp)/n+1,\displaystyle\lesssim s_{\star}\sqrt{(J\log p)/n}+1,
    supθΘ^n(WJTWJ)1WJTX(θθ0)2\displaystyle\sup_{\theta\in\widehat{\Theta}_{n}}\lVert(W_{J}^{T}W_{J})^{-1}W_{J}^{T}X(\theta-\theta_{0})\rVert_{2} WJspsupθΘ^nX(θθ0)2ρmin(WJTWJ)\displaystyle\lesssim\frac{\lVert W_{J}\rVert_{\rm sp}\sup_{\theta\in\widehat{\Theta}_{n}}\lVert X(\theta-\theta_{0})\rVert_{2}}{\rho_{\min}(W_{J}^{T}W_{J})}
    s(Jlogp)/n.\displaystyle\lesssim s_{\star}\sqrt{(J\log p)/n}.

    Since Jlogn=nϵ¯n2slogpJ\log n=n\bar{\epsilon}_{n}^{2}\leq s_{\star}\log p, (86) tends to zero since s5log3p=o(n)s_{\star}^{5}\log^{3}p=o(n).

  • Verification of (C10): By the similar calculations as before, we see that (C10) is bounded by (s5log3p/n)1/2(s_{\star}^{5}\log^{3}p/n)^{1/2} which tends to zero. The condition α¯>1\bar{\alpha}>1 is necessary since (s5log3p)/nn2ϵ¯n6=(logn)6α¯/(2α¯+1)n2(α¯1)/(2α¯+1)(s_{\star}^{5}\log^{3}p)/n\geq n^{2}\bar{\epsilon}_{n}^{6}=(\log n)^{6\bar{\alpha}/(2\bar{\alpha}+1)}n^{-2(\bar{\alpha}-1)/(2\bar{\alpha}+1)}.

Therefore, under (C7), we have the distributional approximation in (15) by Theorem 5. Under (C7) and (C12), Theorem 6 implies that the no-superset result in (16) holds. The stronger assertions in (17) and (18) are explicitly derived from Corollary 2 and Corollary 3 if the beta-min condition (C13) is also met.

Appendix C Auxiliary results

Here we provide some auxiliary results used to prove the main results.

Lemma 9.

Let pkp_{k} be the density of Nr(μk,Σk){\rm N}_{r}(\mu_{k},\Sigma_{k}) for k=1,2k=1,2. Then,

K(p1,p2)=\displaystyle K(p_{1},p_{2})= 12{logdetΣ2detΣ1+tr(Σ1Σ21)r+Σ21/2(μ1μ2)22},\displaystyle\frac{1}{2}\left\{\log\frac{\det\Sigma_{2}}{\det\Sigma_{1}}+{\rm tr}(\Sigma_{1}\Sigma_{2}^{-1})-r+\lVert\Sigma_{2}^{-1/2}(\mu_{1}-\mu_{2})\rVert_{2}^{2}\right\},
V(p1,p2)=\displaystyle V(p_{1},p_{2})= 12{tr(Σ1Σ21Σ1Σ21)2tr(Σ1Σ21)+r}+Σ11/2Σ21(μ1μ2)22.\displaystyle\frac{1}{2}\Big{\{}{\rm tr}(\Sigma_{1}\Sigma_{2}^{-1}\Sigma_{1}\Sigma_{2}^{-1})-2{\rm tr}(\Sigma_{1}\Sigma_{2}^{-1})+r\Big{\}}+\lVert\Sigma_{1}^{1/2}\Sigma_{2}^{-1}(\mu_{1}-\mu_{2})\rVert_{2}^{2}.
Proof.

Let Z=Σ11/2(Xμ1)Nr(0,I)Z=\Sigma_{1}^{-1/2}(X-\mu_{1})\sim{\rm N}_{r}(0,I) for Xp1X\sim p_{1} and A=Σ11/2Σ21Σ11/2A=\Sigma_{1}^{1/2}\Sigma_{2}^{-1}\Sigma_{1}^{1/2}. Then by direct calculations, we have

K(p1,p2)\displaystyle K(p_{1},p_{2}) =𝔼p1{logp1p2(X)}\displaystyle=\mathbb{E}_{p_{1}}\left\{\log\frac{p_{1}}{p_{2}}(X)\right\}
=12{logdetΣ2detΣ1+𝔼p1ZTAZr+(μ1μ2)TΣ21(μ1μ2)},\displaystyle=\frac{1}{2}\left\{\log\frac{\det\Sigma_{2}}{\det\Sigma_{1}}+\mathbb{E}_{p_{1}}Z^{T}AZ-r+(\mu_{1}-\mu_{2})^{T}\Sigma_{2}^{-1}(\mu_{1}-\mu_{2})\right\},

which verifies the first assertion because 𝔼p1ZTAZ=trA\mathbb{E}_{p_{1}}Z^{T}AZ={\rm tr}A. After some algebra, we also obtain

V(p1,p2)\displaystyle V(p_{1},p_{2}) =𝔼p1{logp1p2(X)K(p1,p2)}2\displaystyle=\mathbb{E}_{p_{1}}\left\{\log\frac{p_{1}}{p_{2}}(X)-K(p_{1},p_{2})\right\}^{2}
=14𝔼p1{ZTZ+ZTAZ+2(μ1μ2)TΣ21Σ11/2Ztr(A)+r}2.\displaystyle=\frac{1}{4}\mathbb{E}_{p_{1}}\left\{-Z^{T}Z+Z^{T}AZ+2(\mu_{1}-\mu_{2})^{T}\Sigma_{2}^{-1}\Sigma_{1}^{1/2}Z-{\rm tr}(A)+r\right\}^{2}.

The rightmost side involves forms of 𝔼p1(ZZTQ1Z)\mathbb{E}_{p_{1}}(ZZ^{T}Q_{1}Z) and 𝔼p1(ZTQ1ZZTQ2Z)\mathbb{E}_{p_{1}}(Z^{T}Q_{1}ZZ^{T}Q_{2}Z) for two positive definite matrices Q1Q_{1} and Q2Q_{2}. It is easy to see that the former is zero, while it can be shown the latter equals 2tr(Q1Q2)+tr(Q1)tr(Q2)2{\rm tr}(Q_{1}Q_{2})+{\rm tr}(Q_{1}){\rm tr}(Q_{2}); for example, see Lemma 6.2 of Magnus, [22]. Plugging in this for the expected values of the products of quadratic forms, it is easy (but tedious) to verify the second assertion. ∎

Lemma 10.

For r×rr\times r positive definite matrices Σ1\Sigma_{1} and Σ2\Sigma_{2}, let d1,,drd_{1},\dots,d_{r} be the eigenvalues of Σ21/2Σ11Σ21/2\Sigma_{2}^{1/2}\Sigma_{1}^{-1}\Sigma_{2}^{1/2}. Then the following assertions hold:

  1. (i)

    ρmax2(Σ2)Σ1Σ2F2k=1r(dk11)2ρmin2(Σ2)Σ1Σ2F2\rho_{\max}^{-2}(\Sigma_{2})\lVert\Sigma_{1}-\Sigma_{2}\rVert_{\rm F}^{2}\leq\sum_{k=1}^{r}(d_{k}^{-1}-1)^{2}\leq\rho_{\min}^{-2}(\Sigma_{2})\lVert\Sigma_{1}-\Sigma_{2}\rVert_{\rm F}^{2},

  2. (ii)

    maxk|dk1|\max_{k}|d_{k}-1| can be made arbitrarily small if g2(Σ1,Σ2)g^{2}(\Sigma_{1},\Sigma_{2}) is chosen sufficiently small, where gg is defined in (33).

Proof.

Let A=Σ21/2Σ1Σ21/2A=\Sigma_{2}^{-1/2}\Sigma_{1}\Sigma_{2}^{-1/2}. Since the eigenvalues of AIrA-I_{r} are d111,,dr11d_{1}^{-1}-1,\dots,d_{r}^{-1}-1, we can see that Σ1Σ2F2\lVert\Sigma_{1}-\Sigma_{2}\rVert_{\rm F}^{2} is equal to

Σ21/2(AIr)Σ21/2F2ρmax2(Σ2)AIrF2=ρmax2(Σ2)k=1r(dk11)2.\displaystyle\lVert\Sigma_{2}^{1/2}(A-I_{r})\Sigma_{2}^{1/2}\rVert_{\rm F}^{2}\leq\rho_{\max}^{2}(\Sigma_{2})\lVert A-I_{r}\rVert_{\rm F}^{2}=\rho_{\max}^{2}(\Sigma_{2})\sum_{k=1}^{r}(d_{k}^{-1}-1)^{2}.

Conversely, using the sub-multiplicative property of the Frobenius norm, BCFBspCF\lVert BC\rVert_{\rm F}\leq\lVert B\rVert_{\rm sp}\lVert C\rVert_{\rm F}, it can be seen that k=1r(dk11)2\sum_{k=1}^{r}(d_{k}^{-1}-1)^{2} is equal to

AIrF2=Σ21/2(Σ1Σ2)Σ21/2F2ρmax2(Σ21)Σ1Σ2F2.\displaystyle\lVert A-I_{r}\rVert_{\rm F}^{2}=\lVert\Sigma_{2}^{-1/2}(\Sigma_{1}-\Sigma_{2})\Sigma_{2}^{-1/2}\rVert_{\rm F}^{2}\leq\rho_{\max}^{2}(\Sigma_{2}^{-1})\lVert\Sigma_{1}-\Sigma_{2}\rVert_{\rm F}^{2}.

These verify (i). Now, note that by direct calculations,

(detΣ1)1/4(detΣ2)1/4det((Σ1+Σ2)/2)1/2\displaystyle\frac{(\det\Sigma_{1})^{1/4}(\det\Sigma_{2})^{1/4}}{\det((\Sigma_{1}+\Sigma_{2})/2)^{1/2}} ={12rdet(A1/2+A1/2)}1/2\displaystyle=\left\{\frac{1}{2^{r}}\det(A^{1/2}+A^{-1/2})\right\}^{-1/2}
={k=1r12(dk1/2+dk1/2)}1/2.\displaystyle=\left\{\prod_{k=1}^{r}\frac{1}{2}(d_{k}^{1/2}+d_{k}^{-1/2})\right\}^{-1/2}.

Hence, g2(Σ1,Σ2)<δg^{2}(\Sigma_{1},\Sigma_{2})<\delta for a sufficiently small δ>0\delta>0 implies that

k=1r12(dk1/2+dk1/2)<(1δ2/2)2.\displaystyle\prod_{k=1}^{r}\frac{1}{2}(d_{k}^{1/2}+d_{k}^{-1/2})<(1-\delta^{2}/2)^{-2}.

Since every term in the product of the last display is greater than or equal to 1, we have (dk1/2+dk1/2)/2<(1δ2/2)2(d_{k}^{1/2}+d_{k}^{-1/2})/2<(1-\delta^{2}/2)^{-2} for every kk. As a function of dkd_{k}, (dk1/2+dk1/2)/2(d_{k}^{1/2}+d_{k}^{-1/2})/2 has the global minimum at dk=1d_{k}=1, and hence δ\delta can be chosen sufficiently small to make |dk1||d_{k}-1| small for every k=1,,rk=1,\dots,r, which establishes (ii). ∎

References

  • Atchadé, [2017] Atchadé, Y. A. (2017). On the contraction properties of some high-dimensional quasi-posterior distributions. The Annals of Statistics, 45(5):2248–2273.
  • Bai et al., [2020] Bai, R., Moran, G. E., Antonelli, J., Chen, Y., and Boland, M. R. (2020). Spike-and-slab group lassos for grouped regression and sparse generalized additive models. Journal of the American Statistical Association, to appear.
  • Belitser and Ghosal, [2020] Belitser, E. and Ghosal, S. (2020). Empirical Bayes oracle uncertainty quantification for regression. The Annals of Statistics, 48(6):3113–3137.
  • Bickel and Kleijn, [2012] Bickel, P. J. and Kleijn, B. J. (2012). The semiparametric Bernstein–von Mises theorem. The Annals of Statistics, 40(1):206–237.
  • Bondell and Reich, [2012] Bondell, H. D. and Reich, B. J. (2012). Consistent high-dimensional Bayesian variable selection via penalized credible regions. Journal of the American Statistical Association, 107(500):1610–1624.
  • Carroll et al., [2006] Carroll, R. J., Ruppert, D., Crainiceanu, C. M., and Stefanski, L. A. (2006). Measurement Error in Nonlinear Models: A Modern Perspective. Chapman and Hall/CRC.
  • Castillo, [2012] Castillo, I. (2012). A semiparametric Bernstein–von Mises theorem for Gaussian process priors. Probability Theory and Related Fields, 152(1-2):53–99.
  • Castillo et al., [2015] Castillo, I., Schmidt-Hieber, J., and van der Vaart, A. (2015). Bayesian linear regression with sparse priors. The Annals of Statistics, 43(5):1986–2018.
  • Castillo and van der Vaart, [2012] Castillo, I. and van der Vaart, A. (2012). Needles and straw in a haystack: Posterior concentration for possibly sparse sequences. The Annals of Statistics, 40(4):2069–2101.
  • Chae et al., [2019] Chae, M., Lin, L., and Dunson, D. B. (2019). Bayesian sparse linear regression with unknown symmetric error. Information and Inference: A Journal of the IMA, 8(3):621–653.
  • De Boor, [1978] De Boor, C. (1978). A Practical Guide to Splines. New York: Springer.
  • Fikioris, [2018] Fikioris, G. (2018). Spectral properties of Kac–Murdock–Szegö matrices with a complex parameter. Linear Algebra and its Applications, 553:182–210.
  • Fuller, [1987] Fuller, W. A. (1987). Measurement Error Models. John Wiley & Sons.
  • Gao et al., [2020] Gao, C., van der Vaart, A. W., and Zhou, H. H. (2020). A general framework for bayes structured linear models. Annals of Statistics, 48(5):2848–2878.
  • Ghosal et al., [2000] Ghosal, S., Ghosh, J. K., and van der Vaart, A. W. (2000). Convergence rates of posterior distributions. Annals of Statistics, 28(2):500–531.
  • Ghosal and van der Vaart, [2007] Ghosal, S. and van der Vaart, A. (2007). Convergence rates of posterior distributions for noniid observations. The Annals of Statistics, 35(1):192–223.
  • Ghosal and van der Vaart, [2017] Ghosal, S. and van der Vaart, A. (2017). Fundamentals of Nonparametric Bayesian Inference. Cambridge University Press.
  • Jeong, [2020] Jeong, S. (2020). Posterior contraction in group sparse logit models for categorical responses. arXiv preprint arXiv:2010.03513.
  • Jeong and Ghosal, [2020] Jeong, S. and Ghosal, S. (2020). Posterior contraction in sparse generalized linear models. Biometrika, to appear.
  • Johnson and Rossell, [2012] Johnson, V. E. and Rossell, D. (2012). Bayesian model selection in high-dimensional settings. Journal of the American Statistical Association, 107(498):649–660.
  • Kulkarni et al., [1999] Kulkarni, D., Schmidt, D., and Tsui, S.-K. (1999). Eigenvalues of tridiagonal pseudo-Toeplitz matrices. Linear Algebra and its Applications, 297:63–80.
  • Magnus, [1978] Magnus, J. R. (1978). The moments of products of quadratic forms in normal variables. Statistica Neerlandica, 32(4):201–210.
  • Martin et al., [2017] Martin, R., Mess, R., and Walker, S. G. (2017). Empirical Bayes posterior concentration in sparse high-dimensional linear models. Bernoulli, 23(3):1822–1847.
  • Narisetty and He, [2014] Narisetty, N. N. and He, X. (2014). Bayesian variable selection with shrinking and diffusing priors. The Annals of Statistics, 42(2):789–817.
  • Ning et al., [2020] Ning, B., Jeong, S., and Ghosal, S. (2020). Bayesian linear regression for multivariate responses under group sparsity. Bernoulli, 26(3):2353–2382.
  • Ročková, [2018] Ročková, V. (2018). Bayesian estimation of sparse signals with a continuous spike-and-slab prior. The Annals of Statistics, 46(1):401–437.
  • Rothman et al., [2008] Rothman, A. J., Bickel, P. J., Levina, E., and Zhu, J. (2008). Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494–515.
  • Song and Liang, [2017] Song, Q. and Liang, F. (2017). Nearly optimal Bayesian shrinkage for high dimensional regression. arXiv preprint arXiv:1712.08964.
  • van der Vaart and Wellner, [1996] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer.