Unified Bayesian theory of sparse linear regression with nuisance parameters^†^†thanks: Research is partially supported by a Faculty Research and Professional Development Grant from College of Sciences of North Carolina State University.

Seonghyun Jeonglabel=e1][email protected] [ Department of Statistics and Data Science, Department of Applied Statistics
Yonsei University, Seoul 03722, South Korea
Subhashis Ghosallabel=e2][email protected] [ North Carolina State University Department of Statistics
North Carolina State University, Raleigh, NC 27607, USA

Abstract

We study frequentist asymptotic properties of Bayesian procedures for high-dimensional Gaussian sparse regression when unknown nuisance parameters are involved. Nuisance parameters can be finite-, high-, or infinite-dimensional. A mixture of point masses at zero and continuous distributions is used for the prior distribution on sparse regression coefficients, and appropriate prior distributions are used for nuisance parameters. The optimal posterior contraction of sparse regression coefficients, hampered by the presence of nuisance parameters, is also examined and discussed. It is shown that the procedure yields strong model selection consistency. A Bernstein-von Mises-type theorem for sparse regression coefficients is also obtained for uncertainty quantification through credible sets with guaranteed frequentist coverage. Asymptotic properties of numerous examples are investigated using the theories developed in this study.

62F15,

Bernstein-von Mises theorems,

High-dimensional regression,

Model selection consistency,

Posterior contraction rates,

Sparse priors,

keywords:

[class=MSC]

keywords:

1 Introduction

While Bayesian model selection for classical low-dimensional problems has a long history, sparse estimation in high-dimensional regression was studied much later; see Bondell and Reich, [5], Johnson and Rossell, [20], and Narisetty and He, [24] for consistent Bayesian model selection methods in high-dimensional linear models. Extensive theoretical investigations, however, have been carried out only very recently. Since the pioneering work of Castillo et al., [8], frequentist asymptotic properties of Bayesian sparse regression have been discovered under various settings, and there is now a substantial body of literature [e.g., 23, 1, 28, 3, 26, 2, 10, 25, 14, 19, 18].

Most of the existing studies deal with sparse regression setups without nuisance parameters and there are only a few exceptions. An unknown variance parameter, the simplest type of nuisance parameters, was incorporated for high-dimensional linear regression in Song and Liang, [28] and Bai et al., [2]. In these studies, the optimal properties of Bayesian procedures are characterized with continuous shrinkage priors. For more involved models, Chae et al., [10] adopted a nonparametric approach to estimate unknown symmetric densities in sparse linear regression. Ning et al., [25] considered a sparse linear model for vector-valued response variables with unknown covariance matrices.

Although nuisance parameters may not be of primary interest, modeling frameworks require the complete description of their roles as they explicitly parameterize models. Therefore, one may want to achieve optimal estimation properties for sparse regression coefficients, no matter what a nuisance parameter is. It may also be of interest to examine posterior contraction of nuisance parameters as a secondary objective. Despite these facts, however, there have not been attempts to consider a general class of high-dimensional regression models with nuisance parameters. In this study, we consider a general form of Gaussian sparse regression in the presence of nuisance parameters, and establish a theoretical framework for Bayesian procedures.

We formulate a general framework to treat sparse regression models in a unified way as follows. Let $\eta$ be possibly an infinite-dimensional nuisance parameter taking values in a set $\mathbb{H}$ . For each $\eta\in\mathbb{H}$ and an integer $m_{i}\in\{1,\dots,\overline{m}\}$ for some $\overline{m}\geq 1$ , suppose that there are a vector $\xi_{\eta,i}\in\mathbb{R}^{m_{i}}$ and a positive definite matrix $\Delta_{\eta,i}\in\mathbb{R}^{m_{i}\times m_{i}}$ which define a regression model for a vector-valued response variable $Y_{i}\in\mathbb{R}^{m_{i}}$ against covariates $X_{i}\in\mathbb{R}^{m_{i}\times p}$ given by

\displaystyle Y_{i}

\displaystyle=X_{i}\theta+\xi_{\eta,i}+\varepsilon_{i},\quad\varepsilon_{i}\overset{\scriptscriptstyle\rm ind}{\sim}\mathrm{N}_{m_{i}}(0,\Delta_{\eta,i}),\quad i=1,\dots,n,

(1)

where $\theta\in\mathbb{R}^{p}$ is a vector of regression coefficients. Here $m_{i}$ (and $\overline{m}$ ) can increase with $n$ . We consider the high-dimensional situation where $p>n$ , but $\theta$ is assumed to be sparse, with many coordinates zero. The form in (1) clearly includes sparse linear regression with unknown error variances. Our main interest lies in more complicated setups. As will be shortly discussed in Section 1.1, many interesting examples belong to form (1).

In this paper, we develop a unified theory of posterior asymptotics in the high-dimensional sparse regression models described by form (1). To the best of our knowledge, there is no study thus far considering a general modeling framework of sparse regression as in (1), even from the frequentist perspective. The results on complicated high-dimensional regression models are only available at model-specific levels and cannot be universally used for different model classes. On the other hand, our approach is a unified theoretical treatment of the general model structure in (1) under the Bayesian framework. We establish general theorems on nearly optimal posterior contraction rates, a Bernstein-von Mises theorem via shape approximation to the posterior distribution of $\theta$ , and model selection consistency.

The general theory of posterior contraction using the canonical root-average-squared Hellinger metric on the joint density [16] is not very useful in this context, since to recover rates in terms of the metric of interest on the regression coefficients, some boundedness conditions are needed [19]. To deal with this issue, we construct an exponentially powerful likelihood ratio test in small pieces that are sufficiently separated from the true parameters in terms of the average Rényi divergence of order $1/2$ (which coincides with the average negative log-affinity). This test provides posterior contraction relative to the corresponding divergence. The posterior contraction rates of $\theta$ and $\eta$ can then be recovered in terms of the metrics of interest under mild conditions on the parameter space. Due to a nuisance parameter $\eta$ , the resulting posterior contraction for $\theta$ may be suboptimal. Conditions for the optimal posterior contraction will also be examined. Our results show that the obtained posterior contraction rates are adaptive to the unknown sparsity level.

For a Bernstein-von Mises theorem and selection consistency, stronger conditions are required than those used for posterior contraction, in line with the existing literature [e.g., 8, 23]. As pointed out by Chae et al., [10], the Bernstein-von Mises theorems for finite dimensional parameters in classical semiparametric models [e.g., 7] may not be directly useful in the high-dimensional context. We thus directly characterize a version of the Bernstein von-Mises theorem for model (1). The key idea is to find a suitable orthogonal projection that satisfies some required conditions, which is typically straightforward if the support of a prior for $\xi_{\eta,i}$ is a linear space. The complexity of the space of covariance matrices, measured by its metric entropy, also has an important role in deriving the Bernstein-von Mises theorem and selection consistency. Combining these two leads to a single component of normal distributions for an approximation, which enables to correctly quantify remaining uncertainty on the parameter through the posterior distribution.

1.1 Sparse linear regression with nuisance parameters

As briefly discussed above, the form in (1) is general and includes many interesting statistical models. Here we provide specific examples belonging to (1) in detail. In Section 5, these examples will be used to apply the main results developed in this study.

Example 1 (Multiple response models with missing components).

We consider a general multiple response model with missing values, which is very common in practice. Suppose that for each $i$ , a vector of $\overline{m}$ responses with covariance matrix $\Sigma$ are supposed to be observed, but for the $i$ th group (or subject) only $m_{i}$ entries are actually observed with the rest missing. Letting $Y_{i}\in\mathbb{R}^{m_{i}}$ be the $i$ th observation and $Y_{i}^{\rm aug}\in\mathbb{R}^{\overline{m}}$ be the augmented vector of $Y_{i}$ and missing entries, we can write $Y_{i}=E_{i}^{T}Y_{i}^{\rm aug}$ and $\mathrm{Cov}(Y_{i})=E_{i}^{T}\Sigma E_{i}$ , where $E_{i}\in\mathbb{R}^{\overline{m}\times m_{i}}$ is the submatrix of the $\overline{m}\times\overline{m}$ identity matrix with the $j$ th column included if the $j$ th element of $Y_{i}^{\rm aug}$ is observed, $j=1,\ldots,{\overline{m}}$ . Assuming that the mean of $Y_{i}$ is only $X_{i}\theta$ for covariates $X_{i}\in\mathbb{R}^{m_{i}\times p}$ and sparse coefficients $\theta\in\mathbb{R}^{p}$ with $p>n$ , the model of interest can be written as $Y_{i}=X_{i}\theta+\varepsilon_{i}$ , $\varepsilon_{i}\overset{\scriptscriptstyle\rm ind}{\sim}\mathrm{N}_{m_{i}}(0,E_{i}^{T}\Sigma E_{i})$ , $i=1,\dots,n$ . The model belongs to the class described by (1) with $\xi_{\eta,i}=0_{m_{i}}$ and $\Delta_{\eta,i}=E_{i}^{T}\Sigma E_{i}$ for $\eta=\Sigma$ .

Example 2 (Multivariate measurement error models).

Suppose that a scalar response variable $Y_{i}^{\ast}\in\mathbb{R}$ is connected to fixed covariates $X_{i}^{\ast}\in\mathbb{R}^{p}$ with $p>n$ and random covariates $Z_{i}\in\mathbb{R}^{q}$ with fixed $q\geq 1$ , through the following linear additive relationship: $Y_{i}^{\ast}=\alpha+X_{i}^{\ast T}\theta+Z_{i}^{T}\beta+\varepsilon_{i}^{\ast}$ , $Z_{i}\overset{\scriptscriptstyle\rm iid}{\sim}\mathrm{N}_{q}(\mu,\Sigma)$ , $\varepsilon_{i}^{\ast}\overset{\scriptscriptstyle\rm iid}{\sim}\mathrm{N}(0,\sigma^{2})$ , $i=1,\dots,n$ . While $X_{i}^{\ast}$ is fully observed without noise, we observe a surrogate $W_{i}$ of $Z_{i}$ as $W_{i}=Z_{i}+\tau_{i}$ , $\tau_{i}\overset{\scriptscriptstyle\rm iid}{\sim}\mathrm{N}_{q}(0,\Psi)$ , where to ensure identifiability, $\Psi$ is assumed to be known. This type of model is called a measurement error model or an errors-in-variables model; see Fuller, [13] and Carroll et al., [6] for a complete overview. By direct calculations, the joint distribution of $(Y_{i}^{\ast},W_{i})$ is given by

\displaystyle\begin{pmatrix}Y_{i}^{\ast}\\ W_{i}\end{pmatrix}\overset{\scriptscriptstyle\rm ind}{\sim}\mathrm{N}_{q+1}\left(\begin{pmatrix}\alpha+X_{i}^{\ast T}\theta+\mu^{T}\beta\\ \mu\end{pmatrix},\begin{pmatrix}\beta^{T}\Sigma\beta+\sigma^{2}&\beta^{T}\Sigma\\ \Sigma\beta&\Sigma+\Psi\end{pmatrix}\right).

By writing $Y_{i}=(Y_{i}^{\ast},W_{i}^{T})^{T}\in\mathbb{R}^{q+1}$ , $X_{i}=(X_{i}^{\ast},0_{p\times q})^{T}\in\mathbb{R}^{(q+1)\times p}$ , $\xi_{\eta,i}=(\alpha+\mu^{T}\beta,\mu^{T})^{T}\in\mathbb{R}^{q+1}$ , and $\Delta_{\eta,i}=\left(\begin{smallmatrix}\beta^{T}\Sigma\beta+\sigma^{2}&\beta^{T}\Sigma\\ \Sigma\beta&\Sigma+\Psi\end{smallmatrix}\right)\in\mathbb{R}^{(q+1)\times(q+1)}$ with $\eta=(\alpha,\beta,\mu,\sigma^{2},\Sigma)$ , the model is of form (1) with $m_{i}=q+1$ .

Example 3 (Parametric correlation structure).

For $m_{i}\geq 1$ , $i=1,\dots,n$ , suppose that we have a response variable $Y_{i}\in\mathbb{R}^{m_{i}}$ and covariates $X_{i}\in\mathbb{R}^{m_{i}\times p}$ with $p>n$ . We consider a standard regression model given by $Y_{i}=X_{i}\theta+\varepsilon_{i}$ , $\varepsilon_{i}\overset{\scriptscriptstyle\rm ind}{\sim}\mathrm{N}_{m_{i}}(0,\Sigma_{i})$ , $i=1,\dots,n$ , but $m_{i}$ is considered to be possibly increasing. For a known parametric correlation structure $G_{i}$ and a fixed dimensional Euclidean parameter $\alpha$ , we model the covariance matrix as $\Sigma_{i}=\sigma^{2}G_{i}(\alpha)$ using a variance parameter $\sigma^{2}$ and a correlation matrix $G_{i}(\alpha)\in\mathbb{R}^{m_{i}\times m_{i}}$ . Examples of $G_{i}$ include first order autoregressive and moving average correlation matrices. The model belongs to (1) by writing $\xi_{\eta,i}=0_{m_{i}}$ and $\Delta_{\eta,i}=\sigma^{2}G_{i}(\alpha)$ with $\eta=(\alpha,\sigma^{2})$ .

Example 4 (Mixed effects models).

For $m_{i}\geq 1$ , $i=1,\dots,n$ , consider a response variable $Y_{i}\in\mathbb{R}^{m_{i}}$ and covariates $X_{i}\in\mathbb{R}^{m_{i}\times p}$ with $p>n$ and $Z_{i}\in\mathbb{R}^{m_{i}\times q}$ with fixed $q\geq 1$ . A mixed effect model given by $Y_{i}=X_{i}\theta+Z_{i}b_{i}+\varepsilon_{i}^{\ast}$ , $b_{i}\overset{\scriptscriptstyle\rm iid}{\sim}{\rm N}_{q}(0,\Psi)$ , $\varepsilon_{i}^{\ast}\overset{\scriptscriptstyle\rm ind}{\sim}{\rm N}_{m_{i}}(0,\sigma^{2}I_{m_{i}})$ , $i=1,\dots,n$ , where $\Psi\in\mathbb{R}^{q\times q}$ is a positive definite matrix. Then the marginal law of $Y_{i}$ is given by $Y_{i}=X_{i}\theta+\varepsilon_{i}$ , $\varepsilon_{i}\overset{\scriptscriptstyle\rm ind}{\sim}{\rm N}_{m_{i}}(0,\sigma^{2}I_{m_{i}}+Z_{i}\Psi Z_{i}^{T})$ . We assume that $\sigma^{2}$ is known. The model belongs to (1) by letting $\xi_{\eta,i}=0_{m_{i}}$ and $\Delta_{\eta,i}=\sigma^{2}I_{m_{i}}+Z_{i}\Psi Z_{i}^{T}$ with $\eta=\Psi$ .

Example 5 (Graphical structure with sparse precision matrices).

For a response variable $Y_{i}\in\mathbb{R}^{\overline{m}}$ and covariates $X_{i}\in\mathbb{R}^{\overline{m}\times p}$ with increasing $\overline{m}\geq 1$ and $p>n$ , consider a model given by $Y_{i}=X_{i}\theta+\varepsilon_{i}$ , $\varepsilon_{i}\overset{\scriptscriptstyle\rm iid}{\sim}\mathrm{N}_{\overline{m}}(0,\Omega^{-1})$ , $i=1,\dots,n$ , where $\theta$ is a sparse coefficient vector and the precision matrix $\Omega\in\mathbb{R}^{\overline{m}\times\overline{m}}$ is a positive definite matrix. Along with $\theta$ , we also impose sparsity on the off-diagonal entries of $\Omega$ , which accounts for a graphical structure between observations. More precisely, if an off-diagonal entry is zero, it implies the conditional independence between the two concerned entries of $\varepsilon_{i}$ given the remaining ones, and we suppose that most off-diagonal entries are actually zero, even though we do not know their locations. The model is then seen to be a special case of (1) by letting $\xi_{\eta,i}=0_{\overline{m}}$ and $\Delta_{\eta,i}=\Omega^{-1}$ with $\eta=\Omega$ .

Example 6 (Nonparametric heteroskedastic regression models).

For a response variable $Y_{i}\in\mathbb{R}$ and a row vector of covariates $X_{i}\in\mathbb{R}^{1\times p}$ , a linear regression model with a nonparametric heteroskedastic error is given by $Y_{i}=X_{i}\theta+\varepsilon_{i}$ , $\varepsilon_{i}\overset{\scriptscriptstyle\rm ind}{\sim}\mathrm{N}(0,v(z_{i}))$ , $i=1,\dots,n$ , where $\theta$ is a sparse coefficient vector, $v:[0,1]\mapsto(0,\infty)$ is a univariate variance function, and $z_{i}\in[0,1]$ is a one-dimensional variable associated with the $i$ th observation that controls the variance of $Y_{i}$ through the variance function $v$ .Then the model belongs to (1) by letting $\xi_{\eta,i}=0$ and $\Delta_{\eta,i}=v(z_{i})$ with $\eta=v$ .

Example 7 (Partial linear models).

Consider a partial linear model given by $Y_{i}=X_{i}\theta+g(z_{i})+\varepsilon_{i}$ , $\varepsilon_{i}\overset{\scriptscriptstyle\rm iid}{\sim}\mathrm{N}(0,\sigma^{2})$ , $i=1,\dots,n$ , where $Y_{i}\in\mathbb{R}$ is a response variable, $X_{i}\in\mathbb{R}^{1\times p}$ is a row vector of covariates with $p>n$ , $\theta\in\mathbb{R}^{p}$ is a sparse coefficient vector, $g:[0,1]\mapsto\mathbb{R}$ is a univariate function, and $z_{i}\in[0,1]$ is a scalar predictor. This model is expressed in form (1) by writing $\xi_{\eta,i}=g(z_{i})$ and $\Delta_{\eta,i}=\sigma^{2}$ with $\eta=(g,\sigma^{2})$ .

1.2 Outline

The rest of this paper is organized as follows. In Section 2, some notations are introduced and a prior distribution on sparse regression coefficients is specified. Sections 3–4 provide our main results on the posterior contraction, the Bernstein-von Mises phenomenon, and selection consistency of the posterior distribution. In Section 5, our general theorems are applied to the examples considered above to derive the posterior asymptotic properties in each specific example. All technical proofs are provided in Appendix.

2 Setup, notations, and prior specification

2.1 Notation

Here we describe the notations we use throughout this paper. For a vector $\theta=(\theta_{j})\in\mathbb{R}^{p}$ and a set $S\subset\{1,\dots,p\}$ of indices, we write $S_{\theta}=\{j:\theta_{j}\neq 0\}$ to denote the support of $\theta$ , $s\coloneqq|S|$ (or $s_{\theta}\coloneqq|S_{\theta}|$ ) to denote the cardinality of $S$ (or $S_{\theta}$ ), and $\theta_{S}=\{\theta_{j}:j\in S\}$ and $\theta_{S^{c}}=\{\theta_{j}:j\notin S\}$ to separate components of $\theta$ using $S$ . In particular, the support of the true parameter $\theta_{0}$ and its cardinality are written as $S_{0}$ and $s_{0}\coloneqq|S_{0}|$ , respectively. The notation $\lVert\theta\rVert_{q}=(\sum_{j}|\theta_{j}|^{q})^{1/q}$ , $1\leq q<\infty$ , stands for the $\ell_{q}$ -norm and $\lVert\theta\rVert_{\infty}=\max_{j}|\theta_{j}|$ denotes the maximum norm. We write $\rho_{\min}(A)$ and $\rho_{\max}(A)$ for the minimum and maximum eigenvalues of a square matrix $A$ , respectively. For a matrix $X=(\!(x_{ij})\!)$ , let $\lVert X\rVert_{\rm sp}=\rho_{\max}^{1/2}(X^{T}X)$ stand for the spectral norm and $\lVert X\rVert_{\rm F}=(\sum_{i,j}x_{ij}^{2})^{1/2}$ stand for the Frobenius norm of $X$ . We also define a matrix norm $\lVert X\rVert_{\ast}=\max_{j}\lVert X_{\cdot j}\rVert_{2}$ for $X_{\cdot j}$ the $j$ th column of $X$ , which is used for compatibility conditions. The column space of $X$ is denoted by ${\rm span}(X)$ . For further convenience, we write $\varsigma_{\min}(X)=\rho_{\min}^{1/2}(X^{T}X)$ for the minimum singular value of $X$ . The notation $X_{S}$ means the submatrix of $X$ with columns chosen by $S$ . For sequences $a_{n}$ and $b_{n}$ , $a_{n}\lesssim b_{n}$ (or $b_{n}\gtrsim a_{n}$ ) stands for $a_{n}\leq Cb_{n}$ for some constant $C>0$ independent of $n$ , and $a_{n}\asymp b_{n}$ means $a_{n}\lesssim b_{n}\lesssim a_{n}$ . These inequalities are also used for relations involving constant sequences.

For given parameters $\theta$ and $\eta$ , we write the joint density as $p_{\theta,\eta}=\prod_{i=1}^{n}p_{\theta,\eta,i}$ for $p_{\theta,\eta,i}$ the density of the $i$ th observation vector $Y_{i}$ . In particular, the true joint density is expressed as $p_{0}=\prod_{i=1}^{n}p_{0,i}$ for $p_{0,i}\coloneqq p_{\theta_{0},\eta_{0},i}$ with the true parameters $\theta_{0}$ and $\eta_{0}$ . The notation $\mathbb{E}_{0}$ denotes the expectation operator with the true density $p_{0}$ . For two probability measures $P$ and $Q$ , let $\lVert P-Q\rVert_{\rm TV}$ denote the total variation between $P$ and $Q$ . For two $n$ -variate densities $f\coloneqq\prod_{i=1}^{n}f_{i}$ and $g\coloneqq\prod_{i=1}^{n}g_{i}$ of independent variables, denote the average Rényi divergence (of order $1/2$ ) by $R_{n}(f,g)=-n^{-1}\sum_{i=1}^{n}\log\int\sqrt{f_{i}g_{i}}$ .

For any $\eta_{1},\eta_{2}\in\mathbb{H}$ , we define $d_{n}^{2}(\eta_{1},\eta_{2})=d_{A,n}^{2}(\eta_{1},\eta_{2})+d_{B,n}^{2}(\eta_{1},\eta_{2})$ for the two squared pseudo-metrics:

\displaystyle d_{A,n}^{2}(\eta_{1},\eta_{2})=\frac{1}{n}\sum_{i=1}^{n}\lVert\xi_{\eta_{1},i}-\xi_{\eta_{2},i}\rVert_{2}^{2},\quad d_{B,n}^{2}(\eta_{1},\eta_{2})=\frac{1}{n}\sum_{i=1}^{n}\lVert\Delta_{\eta_{1},i}-\Delta_{\eta_{2},i}\rVert_{\rm F}^{2}.

For compatibility conditions, the uniform compatibility number $\phi_{1}$ and the smallest scaled singular value $\phi_{2}$ are defined as

\displaystyle\phi_{1}(s)=\inf_{\theta:1\leq|S_{\theta}|\leq s}\frac{\lVert X\theta\rVert_{2}|S_{\theta}|^{1/2}}{\lVert X\rVert_{\ast}\lVert\theta\rVert_{1}},\quad\phi_{2}(s)=\inf_{\theta:1\leq|S_{\theta}|\leq s}\frac{\lVert X\theta\rVert_{2}}{\lVert X\rVert_{\ast}\lVert\theta\rVert_{2}}.

We write $Y^{(n)}=(Y_{1}^{T},\dots,Y_{n}^{T})^{T}$ for the observation vector, $n_{\ast}=\sum_{i=1}^{n}m_{i}$ for the dimension of $Y^{(n)}$ , and $\Theta=\mathbb{R}^{p}$ for the parameter space of $\theta$ . Lastly, for a (pseudo-)metric space $({\cal F},d)$ , let $N(\epsilon,{\cal F},d)$ denote the $\epsilon$ -covering number, the minimal number of $\epsilon$ -balls that cover $\cal F$ .

2.2 Prior for the high-dimensional coefficients

In this subsection, we specify a prior distribution for the high-dimensional regression coefficients $\theta$ . A prior for $\eta$ should satisfy the conditions required for the main results, so its specific characterization is deferred to Section 3. On the other hand, the prior for $\theta$ specified here is always good for our purposes and satisfies all requirements.

We first select a dimension $s$ from a prior $\pi_{p}$ , and then randomly choose $S\subset\{1,\dots,p\}$ for given $s$ . A nonzero part $\theta_{S}$ of $\theta$ is then selected from a prior $g_{S}$ on $\mathbb{R}^{s}$ while $\theta_{S^{c}}$ is fixed to zero. The resulting prior specification for $(S,\theta)$ is formulated as

\displaystyle(S,\theta)\mapsto\frac{\pi_{p}(s)}{\binom{p}{s}}g_{S}(\theta_{S})\delta_{0}(\theta_{S^{c}}),

(2)

where $\delta_{0}$ is the Dirac measure at zero on $\mathbb{R}^{p-s}$ with suppressed dimensionality. For the prior $\pi_{p}$ on the model dimensions, we consider a prior satisfying the following: for some constants $A_{1},A_{2},A_{3},A_{4}>0$ ,

\displaystyle A_{1}p^{-A_{3}}\pi_{p}(s-1)\leq\pi_{p}(s)\leq A_{2}p^{-A_{4}}\pi_{p}(s-1),\quad s=1,\dots,p.

(3)

Examples of priors satisfying (3) can be found in Castillo and van der Vaart, [9] and Castillo et al., [8]. For the prior $g_{S}$ , the $s$ -fold product of the exponential power density is considered, where the regularization parameter is allowed to vary with $p$ and $\lVert X\rVert_{\ast}$ , i.e.,

\displaystyle g_{S}(\theta_{S})=\prod_{j\in S}\frac{\lambda}{2}\exp\left(-\lambda|\theta_{j}|\right),\quad\frac{\lVert X\rVert_{\ast}}{L_{1}p^{L_{2}}}\leq\lambda\leq\frac{L_{3}\lVert X\rVert_{\ast}}{\sqrt{n}},

(4)

for some constants $L_{1},L_{2},L_{3}>0$ . The order of $\lambda$ is important in that it determines the boundedness requirement of the true signal $\theta_{0}$ (see condition (C3) below). A particularly interesting case is obtained when $\lambda$ is set to the lower bound $\lVert X\rVert_{\ast}/(L_{1}p^{L_{2}})$ . Then the boundedness condition becomes very mild by choosing $L_{2}$ sufficiently large. When $\lambda$ is set to the upper bound, the boundedness condition is still reasonably mild. However, it can actually be relaxed if the true signal is known to be small enough, though we do not pursue this generalization in this study. In Section 4, we shall see that values of $\lambda$ that do not increase too fast are in fact necessary for a distributional approximation and selection consistency.

Remark 1.

Since some size restriction on $\theta_{0}$ will be made unlike Castillo et al., [8], we note that the use of the Laplace density is not necessary and other prior distributions may also be used for $\theta$ . For example, normal densities can be used for $g_{S}$ to exploit semi-conjugacy. However, if its precision parameter is fixed independent of $n$ , a normal prior requires a stronger restriction on the true signal than (C3) below. To achieve the nearly optimal posterior contraction, other densities with similar tail properties should also work with appropriate modifications for the true signal size (see, e.g., Jeong and Ghosal, [19]). Instead of the spike-and-slab prior in (2) and (3), a class of continuous shrinkage priors may also be used at the expense of substantial modifications in the technical details [28]. In this paper, we only consider the prior in (2)–(4).

3 Posterior contraction rates

The prior for a nuisance parameter $\eta$ should be chosen to complete the prior specification. Once we assign the prior for the full parameters, the posterior distribution $\Pi(\cdot\,|\,Y^{(n)})$ is defined by Bayes’ rule. How the prior for $\eta$ is chosen is crucial to obtain desirable asymptotic properties of the posterior distribution. In this subsection, we shall examine such conditions on the prior distribution for a nuisance parameter and study the posterior contraction rates for both $\theta$ and $\eta$ .

The prior for $\eta$ is put on a subspace ${\cal H}\subset\mathbb{H}$ . In many instances, we take ${\cal H}={\mathbb{H}}$ , especially when a nuisance parameter is finite dimensional, but the flexibility of a subspace may be beneficial in infinite-dimensional situations. We need to choose $\cal H$ to satisfy certain conditions.

(C1)

There exists a nondecreasing sequence $a_{n}=o(n)$ such that

	$\displaystyle a_{n}\max_{1\leq i\leq n}\lVert\Delta_{\eta^{\prime},i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2}$	$\displaystyle\eqqcolon e_{n}\rightarrow 0,\quad\text{for some $\eta^{\prime}\in{\cal H}$},$
	$\displaystyle\max_{1\leq i\leq n}\lVert\Delta_{\eta_{1},i}-\Delta_{\eta_{2},i}\rVert_{\rm F}^{2}$	$\displaystyle\leq a_{n}d_{B,n}^{2}(\eta_{1},\eta_{2}),\quad\eta_{1},\eta_{2}\in{\cal H}.$

(C2)

For some sequence $\bar{\epsilon}_{n}$ such that $a_{n}\bar{\epsilon}_{n}^{2}\rightarrow 0$ and $n\bar{\epsilon}_{n}^{2}\rightarrow\infty$ with $a_{n}$ satisfying (C1),

$\displaystyle\log\Pi\left(\eta\in{\cal H}:d_{n}(\eta,\eta_{0})\leq\bar{\epsilon}_{n}\right)\gtrsim-n\bar{\epsilon}_{n}^{2}.$

The first condition of (C1) implies that we have a good approximation to the true parameter value in the parameter set $\cal H$ . This holds trivially if there exists $\eta^{\prime}\in{\cal H}$ such that $\Delta_{\eta^{\prime},i}=\Delta_{\eta_{0},i}$ for every $i\leq n$ , which is obviously true if $\eta_{0}\in{\cal H}$ . The second condition of (C1) means that in $\cal H$ , the maximum Frobenius norm of the difference between covariance matrices can be controlled by the average Frobenius norm multiplied by the sequence $a_{n}$ . Clearly, this holds with $a_{n}=1$ if $\Delta_{\eta,i}$ is the same for every $i\leq n$ . By the triangle inequality, we see that (C1) implies that

\displaystyle\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2}\lesssim e_{n}+a_{n}d_{B,n}^{2}(\eta,\eta_{0}),\quad\eta\in{\cal H},

(5)

which is used throughout the paper. Condition (C2) is typically called the prior concentration condition, which requires a prior to put sufficient mass around the true parameter $\eta_{0}$ , measured by the pseudo-metric $d_{n}$ . As in other infinite-dimensional situations, such a closeness is translated into the closeness in terms of the Kullback-Leibler divergence and variation (see Lemma 1 in Appendix for more details).

As noted in Section 1, the true parameters should be restricted to certain norm-bounded subset of the parameter space. This is clarified as follows.

(C3)

The true signal satisfies $\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p$ .

(C4)

The eigenvalues of the true covariance matrix satisfy

\displaystyle 1\lesssim\min_{1\leq i\leq n}\rho_{\min}(\Delta_{\eta_{0},i})\leq\max_{1\leq i\leq n}\rho_{\max}(\Delta_{\eta_{0},i})\lesssim 1.

Condition (C3) is required to apply the general strategy for posterior contraction to our modeling framework containing nuisance parameters. More specifically, the condition is imposed such that the prior assigns sufficient mass on a Kullback-Leibler neighborhood of $\theta_{0}$ . If nuisance parameters are not present, one can directly handle the model and such a restriction may be removed [e.g., 8, 14]. One may refer to Song and Liang, [28], Ning et al., [25], and Bai et al., [2] for conditions similar to ours, where a variance parameter stands for a nuisance parameter. Still, the condition is mild if $\lambda$ is chosen to decrease at an appropriate order. In particular, if $\lambda$ is matched to the lower bound $1/(L_{1}p^{L_{2}})$ , the condition becomes $\lVert\theta_{0}\rVert_{\infty}\lesssim(p^{L_{2}}\log p)/\lVert X\rVert_{\ast}$ which is very mild if $L_{2}$ is sufficiently large. Even if the upper bound $L_{3}\lVert X\rVert_{\ast}/\sqrt{n}$ is chosen, the condition is not restrictive as the right hand side of the condition can be made nondecreasing as long as $\lVert X\rVert_{\ast}$ is increasing at a suitable order. Condition (C4) implies that the eigenvalues of the true covariance matrix are bounded below and above. The lower and upper bounds are required for a lot of technical details, including the construction of an exponentially powerful test in Lemma 2 in Appendix.

Remark 2.

Condition (C3) is actually stronger than what it needs to be, but is adopted for the ease of interpretation. For Theorem 3 below to hold, it suffices if we have $\lambda\lVert\theta_{0}\rVert_{1}\leq(s_{0}\log p)\vee n\bar{\epsilon}_{n}^{2}$ for $\bar{\epsilon}_{n}$ satisfying (C2). For the optimal posterior contraction in Theorem 4 below, a slightly stronger bound is needed: $\lambda\lVert\theta_{0}\rVert_{1}\leq s_{0}\log p$ (see Lemma 6 and its proof in Appendix).

3.1 Rényi posterior contraction and recovery

The goal of this subsection is to study posterior contraction of $\theta$ relative to the $\ell_{1}$ - and $\ell_{2}$ -metrics. To do so, we derive the posterior contraction rate with respect to the average Rényi divergence $R_{n}(f,g)$ , and then the rates for $\theta$ relative to more concrete metrics will be recovered from the Rényi contraction.

To proceed, we first need to examine a dimensionality property of the support of $\theta$ . The following theorem shows that the posterior distribution is concentrated on models of relatively small sizes.

Theorem 1 (Dimension).

Suppose that (C1)–(C4) are satisfied. Then for $s_{\star}\coloneqq s_{0}\vee(n\bar{\epsilon}_{n}^{2}/\log p)$ , there exists a constant $K_{1}$ such that

\displaystyle\mathbb{E}_{0}\Pi\left(\theta:s_{\theta}>K_{1}s_{\star}\,\big{|}\,Y^{(n)}\right)\rightarrow 0.

Compared to the literature [e.g., 8, 23, 3], the rate in Theorem 1 is floored by the extra term $n\bar{\epsilon}_{n}^{2}/\log p$ . This arises from the presence of a nuisance parameter in the model formulation. To minimize its impact, a prior on $\eta$ should be chosen such that (C2) holds for as small $\bar{\epsilon}_{n}$ as possible; a suitable choice induces the (nearly) optimal contraction rate.

Using the basic results in Theorem 1, the next theorem obtains the rate at which the posterior distribution contracts at the truth with respect to the average Rényi divergence. The theorem requires additional assumptions on a prior.

(C5)

For $s_{\star}\coloneqq s_{0}\vee(n\bar{\epsilon}_{n}^{2}/\log p)$ with $\bar{\epsilon}_{n}$ satisfying (C2), a sufficiently large $B>0$ , and some sequences $\gamma_{n}$ and $\epsilon_{n}\geq\sqrt{s_{\star}\log(p\vee\overline{m}\vee\gamma_{n})/n}$ satisfying $\epsilon_{n}^{2}/\overline{m}\rightarrow 0$ , there exists a subset ${\cal H}_{n}\subset{\cal H}$ such that

$\displaystyle\min_{1\leq i\leq n}\inf_{\eta\in{\cal H}_{n}}\rho_{\min}(\Delta_{\eta,i})$	$\displaystyle\geq\frac{1}{\gamma_{n}},$	(6)
$\displaystyle\log N\left(\frac{1}{6\overline{m}\gamma_{n}n^{3/2}},{\cal H}_{n},d_{n}\right)$	$\displaystyle\lesssim n\epsilon_{n}^{2},$	(7)
$\displaystyle e^{Bs_{\star}\log p}\Pi({\cal H}\setminus{\cal H}_{n})$	$\displaystyle\rightarrow 0.$	(8)

The above conditions are related to the classical ones in the literature (e.g., see Theorem 2.1. of Ghosal et al., [15]). Condition (6) requires that for every $i\leq n$ , the minimum eigenvalue of $\Delta_{\eta,i}$ is not too small on a sieve ${\cal H}_{n}$ . Although $\gamma_{n}$ can be any positive sequence, a sequence increasing exponentially fast makes the entropy in (7) too large, resulting in a suboptimal rate $\epsilon_{n}$ . If $\gamma_{n}$ can be chosen to be smaller than $p$ and $\overline{m}$ , then this does not lead to any deterioration of the rate in $\epsilon_{n}$ . The entropy condition (7) is actually stronger than needed. Scrutinizing the proof of the theorem, one can see that the entropy appearing in the theorem is obtained using pieces that are smaller than those giving the exponentially powerful test in Lemma 2 in Appendix. However, the covering number with those pieces looks more complicated and the form in (7) suffices for all examples in the present paper. Lastly, condition (8) implies that the outside of a sieve ${\cal H}_{n}$ should possess sufficiently small prior mass to kill the factor $s_{\star}\log p$ arising from the lower bound of the denominator of the posterior distribution. In fact, conditions similar to (C2), (7) and (8) are also required for the prior of $\theta$ . By reading the proof, it is easy to see that the prior (2) explicitly satisfies the analogous conditions on an appropriately chosen sieve.

Theorem 2 (Contraction rate, Rényi).

Suppose that (C1)–(C5) are satisfied. Then there exists a constant $K_{2}$ such that

\displaystyle{\mathbb{E}}_{0}\Pi\left((\theta,\eta):R_{n}(p_{\theta,\eta},p_{0})>K_{2}\epsilon_{n}^{2}\,\big{|}\,Y^{(n)}\right)\rightarrow 0.

We want to sharpen the rate $\epsilon_{n}\geq\sqrt{s_{\star}\log(p\vee\overline{m}\vee\gamma_{n})/n}$ as much as possible. In most instances, $\gamma_{n}$ can be chosen such that $\log\gamma_{n}\lesssim\log p$ . This is trivially satisfied if $\gamma_{n}$ is some polynomial in $n$ as in the examples in this paper. If $p$ is known to increase much faster than $n$ , e.g., $\log p\asymp n^{c}$ for some $c\in(0,1)$ , then $\gamma_{n}$ need not be a polynomial in $n$ and the condition can be met more easily with a sequence that grows even faster. Note also that we typically have $\log\overline{m}\lesssim\log p$ in most cases. These postulates lead to $\epsilon_{n}\geq\sqrt{(s_{\star}\log p)/n}$ . Indeed, it is often possible to choose $\epsilon_{n}=\sqrt{(s_{\star}\log p)/n}$ , which is commonly guaranteed by choosing an appropriate sieve ${\cal H}_{n}$ and a prior. The condition will be made precise in (C5^∗) below for recovery and we only consider the situation that $\epsilon_{n}=\sqrt{(s_{\star}\log p)/n}$ in what follows.

Although Theorem 2 provides the basic results for posterior contraction, it does not give precise interpretations for the parameters $\theta$ and $\eta$ themselves, because of the abstruse expression of the average Rényi divergence. The contraction rates with respect to more concrete metrics are recovered under some additional conditions. Under the additional assumption $a_{n}\epsilon_{n}^{2}\rightarrow 0$ , it can be shown that Theorem 1 and Theorem 2 explicitly imply that for the set

	$\displaystyle{\cal A}_{n}=\bigg{\{}$	$\displaystyle(\theta,\eta)\in\Theta\times{\cal H}:s_{\theta}\leq K_{1}s_{\star},$
		$\displaystyle\qquad\qquad\frac{1}{n}\sum_{i=1}^{n}\lVert X_{i}(\theta-\theta_{0})+\xi_{\eta,i}-\xi_{\eta_{0},i}\rVert_{2}^{2}+d_{B,n}^{2}(\eta,\eta_{0})\leq M_{1}\epsilon_{n}^{2}\bigg{\}},$

with a sufficiently large constant $M_{1}$ , the posterior mass of ${\cal A}_{n}$ goes to one in probability (see the proof of Theorem 3). To complete the recovery, we need to separate the sum of squares of the mean into $\lVert X(\theta-\theta_{0})\rVert_{2}$ and $nd_{A,n}^{2}(\eta,\eta_{0})$ , which requires an additional condition. The conditions required for the recovery are clarified as follows.

(C5^∗)

While $\log\overline{m}\lesssim\log p$ , (C5) holds for $\gamma_{n}$ and $\epsilon_{n}=\sqrt{(s_{\star}\log p)/n}$ such that $\log\gamma_{n}\lesssim\log p$ and $a_{n}\epsilon_{n}^{2}\rightarrow 0$ with $a_{n}$ satisfying (C1).

(C6)

For $s_{\star}$ satisfying (C5^∗), there exists $\eta_{\ast}\in\mathbb{H}$ such that

	$\displaystyle\liminf_{n\geq 1}\inf_{(\theta,\eta)\in{\cal A}_{n}}\frac{\sum_{i=1}^{n}(\theta-\theta_{0})^{T}X_{i}^{T}(\xi_{\eta,i}-\xi_{\eta_{\ast},i})}{\lVert X(\theta-\theta_{0})\rVert_{2}^{2}+nd_{A,n}^{2}(\eta,\eta_{\ast})}$	$\displaystyle>-\frac{1}{2},$
	$\displaystyle d_{A,n}(\eta_{\ast},\eta_{0})$	$\displaystyle\lesssim\sqrt{\frac{s_{\star}\log p}{n}},$

where $\epsilon_{n}$ in $\mathcal{A}_{n}$ satisfies $\epsilon_{n}=\sqrt{(s_{\star}\log p)/n}$ .

By expanding the quadratic term for the mean in ${\cal A}_{n}$ , one can see that the separation is possible if (C6) is satisfied. Clearly, (C6) is trivially satisfied if the model has only $X\theta$ for its mean, in which we take $\xi_{\eta,i}-\xi_{\eta_{\ast},i}=\xi_{\eta_{\ast},i}-\xi_{\eta_{0},i}=0$ for every $i\leq n$ . In many cases where there exists $\eta^{\prime}\in{\cal H}$ such that $d_{A,n}(\eta^{\prime},\eta_{0})=0$ , we can often take $\eta_{\ast}=\eta^{\prime}$ for the second inequality of (C6) to hold automatically.

The following theorem shows that the posterior distribution of $\theta$ and $\eta$ contracts at their respective true values at some rates, relative to more easily comprehensible metrics than the average Rényi divergence. In the expressions, if $K_{1}s_{\star}+s_{0}<1$ , the compatibility numbers should be understood be equal to 1 for interpretation.

Theorem 3 (Recovery).

Suppose that (C1)–(C4), (C5^∗), and (C6) are satisfied. Then, there exists a constant $K_{3}$ such that

\displaystyle\begin{split}{\mathbb{E}}_{0}\Pi\left(\theta:\lVert\theta-\theta_{0}\rVert_{1}>\frac{K_{3}s_{\star}\sqrt{\log p}}{\phi_{1}(K_{1}s_{\star}+s_{0})\lVert X\rVert_{\ast}}\,\bigg{|}\,Y^{(n)}\right)&\rightarrow 0,\\ {\mathbb{E}}_{0}\Pi\left(\theta:\lVert\theta-\theta_{0}\rVert_{2}>\frac{K_{3}\sqrt{s_{\star}\log p}}{\phi_{2}(K_{1}s_{\star}+s_{0})\lVert X\rVert_{\ast}}\,\bigg{|}\,Y^{(n)}\right)&\rightarrow 0,\\ {\mathbb{E}}_{0}\Pi\left(\theta:\lVert X(\theta-\theta_{0})\rVert_{2}>K_{3}\sqrt{s_{\star}\log p}\,\big{|}\,Y^{(n)}\right)&\rightarrow 0,\\ {\mathbb{E}}_{0}\Pi\left(\eta:d_{n}(\eta,\eta_{0})>K_{3}\sqrt{\frac{s_{\star}\log p}{n}}\,\bigg{|}\,Y^{(n)}\right)&\rightarrow 0.\end{split}

(9)

The thresholds for contraction depend upon the compatibility conditions, which make their implication somewhat vague. As $K_{1}s_{\star}+s_{0}$ is much smaller than $n_{\ast}$ , it is not unreasonable to assume that $\phi_{1}(K_{1}s_{\star}+s_{0})$ and $\phi_{2}(K_{1}s_{\star}+s_{0})$ are bounded away from zero, whence the compatibility number is removed from the rates. We refer to Example 7 of Castillo et al., [8] for more discussion. In the next subsection, we will see that one of these restrictions is actually necessary for shape approximation or selection consistency.

Remark 3.

The separation condition (C6) can be left as an assumption to be satisfied, but can also be verified by a stronger condition on the design matrix without resorting to the values of the parameters. Suppose that for some integer $q\geq 1$ , there exists a matrix $Z_{i}\in\mathbb{R}^{m_{i}\times q}$ such that $\xi_{\eta,i}=Z_{i}h(\eta)$ for every $\eta\in{\cal H}$ , with some map $h:{\cal H}\mapsto\mathbb{R}^{q}$ . Since we can write $\xi_{\eta,i}-\xi_{\eta_{\ast},i}=Z_{i}(h(\eta)-h(\eta_{\ast}))$ for any $\eta,\eta_{\ast}\in{\cal H}$ , the Cauchy-Schwarz inequality indicates that the first inequality of (C6) is implied by

\displaystyle\liminf_{n\geq 1}\inf_{(\theta,\eta)\in\Theta\times{\cal H}:s_{\theta}\leq K_{1}s_{\star}}\frac{(\theta-\theta_{0})^{T}X^{T}Z(h(\eta)-h(\eta_{\ast}))}{\lVert X(\theta-\theta_{0})\rVert_{2}\lVert Z(h(\eta)-h(\eta_{\ast}))\rVert_{2}}>-1,

for $Z=(Z_{1}^{T},\dots,Z_{n}^{T})^{T}$ . The left hand side is always between $-1$ and $1$ by the Cauchy-Schwarz inequality, and is exactly equal to $-1$ or $1$ if and only if the two vectors are linearly dependent. A sufficient condition for the preceding display is thus $\min\{\varsigma_{\min}([X_{S},Z]):{s\leq K_{1}s_{\star}+s_{0}}\}\gtrsim 1$ since the linear dependence cannot happen under such a condition due to the inequality $s_{\theta-\theta_{0}}\leq s_{\theta}+s_{0}\leq K_{1}s_{\star}+s_{0}$ for $\theta$ such that $s_{\theta}\leq K_{1}s_{\star}$ . This sufficient condition is not restrictive at all if $q=o(n)$ as we already have $K_{1}s_{\star}+s_{0}=o(n)$ . Since there typically exists $\eta_{\ast}\in{\cal H}$ satisfying the second inequality of (C6) as long as $\cal H$ provides a good approximation for the true parameter $\eta_{0}$ , condition (C6) can be easily satisfied if the sufficient condition is met.

Notwithstanding the lack of formal study of minimax rates with additional complications, we still want to match our rates for $\theta$ with those in simple linear regression, which we call the “optimal” rates. In this sense, Theorem 3 only provides the suboptimal rates for $\theta$ if $s_{0}=o(s_{\star})$ . Although the theorem gives the optimal results if $s_{0}\log p\gtrsim n\bar{\epsilon}_{n}^{2}$ , it is practically hard to check this condition as $s_{0}$ is unknown. If $s_{0}$ is known to be nonzero, the desired conclusion is trivially achieved as soon as $n\bar{\epsilon}_{n}^{2}/\log p\lesssim 1$ . The following corollary, however, shows that the optimal rates are still available even if $s_{0}=0$ , with restrictions on $\bar{\epsilon}_{n}$ and the prior.

Corollary 1 (Optimality under restriction).

For $\bar{\epsilon}_{n}$ satisfying the conditions for Theorem 3, we have the following assertions.

(a)

Assume that $n\bar{\epsilon}_{n}^{2}/\log p\rightarrow 0$ . Then, Theorems 1 and 3 hold for $s_{\star}$ replaced by $s_{0}$ .
(b)

Assume that $n\bar{\epsilon}_{n}^{2}/\log p\lesssim 1$ . Then, Theorems 1 and 3 hold for $s_{\star}$ replaced by $s_{0}$ if either $A_{4}$ in (3) is chosen large enough or $s_{0}>0$ .

The corollary is useful in limited situations, especially when a parametric rate is available for a nuisance parameter. Even if $n\bar{\epsilon}_{n}^{2}=\log n$ , we need to further assume that $\log n=o(\log p)$ , i.e., the ultra high-dimensional setup, to conclude that (a) holds, while we can always apply (b) because $\log n\lesssim\log p$ . Although assertion (b) holds for any $s_{0}\geq 0$ if $A_{4}$ is chosen sufficiently large, its specific threshold is not directly available. Indeed, by carefully reading the proof of Theorem 1 together with Lemma 1 in Appendix, one can see that the threshold depends on unknown constant bounds for the eigenvalues of the true covariance matrix in (C4). Still, (b) holds for any $A_{4}>0$ if $s_{0}>0$ . We believe that the assumption $s_{0}>0$ is very mild, and hence simply apply (b) with this assumption to conclude the optimal contraction for models with finite dimensional nuisance parameters. The optimal rates can still be achieved for any $s_{0}\geq 0$ by verifying the conditions in the following subsection. With finite dimensional nuisance parameters, we do not pursue this direction as it seems an overkill considering the mildness of the assumption $s_{0}>0$ , though those conditions are actually required for the Bernstein-von Mises theorem and selection consistency in Section 4.

In semiparametric situations with high- or infinite-dimensional nuisance parameters, none of (a) and (b) generally works unless $p$ increases sufficiently fast. Still, the optimal rates can be achieved under stronger conditions using the semiparametric theory, as the following subsection provides.

3.2 Optimal posterior contraction for $\theta$

Recall that only suboptimal rates may be available from Theorem 3 if $s_{0}\log p\lesssim n\bar{\epsilon}_{n}^{2}$ . In many semiparametric situations, however, it is often possible to obtain parametric rates for finite dimensional parameters under stronger conditions, even when there are infinite-dimensional nuisance parameters in a model [4, 7]. It has also been shown that a similar argument holds in some high-dimensional semiparametric regression models [10]. Therefore, it is naturally of interest to examine under what conditions we can replace $s_{\star}$ by $s_{0}$ in the rates for $\theta$ , even if $s_{0}\log p\lesssim n\bar{\epsilon}_{n}^{2}$ . Similar to other semiparametric settings [4, 10], this can be established by the semiparametric theory, but requires stronger conditions than those in traditional fixed dimensional parametric cases because of the high-dimensions of the parameters in our setup.

To proceed, some additional conditions are required for technical reasons, which are made for the size of $\bar{\epsilon}_{n}$ as the optimal rates are automatically attained if $s_{0}\log p\gtrsim n\bar{\epsilon}_{n}^{2}$ . Still, in a practical sense, the conditions almost always need to be verified to reach the optimal rates, since only oracle rates are generally available and we do not know which term is greater.

In what follows, we write $\bar{s}_{\star}\coloneqq n\bar{\epsilon}_{n}^{2}/\log p$ for $\bar{\epsilon}_{n}$ satisfying the conditions of Theorem 3 through the definition of $\epsilon_{n}$ . We first assume the following condition on the uniform compatibility number.

(C3)

For a sufficiently large $M$ , the uniform compatibility number $\phi_{1}(M\bar{s}_{\star}+s_{0})$ is bounded away from zero.

This condition is weaker than assuming that the smallest scaled singular value $\phi_{2}(M\bar{s}_{\star}+s_{0})$ is bounded away from zero, as we have $\phi_{1}(s)\geq\phi_{2}(s)$ for any $s>0$ by the Cauchy-Schwarz inequality. We will also resort on a slightly stronger condition with respect to $\phi_{1}$ for a distributional approximation in the following section. In this sense, our condition is weaker than those for Theorem 4 of Castillo et al., [8]. Condition (C3) is not restrictive as (C5^∗) requires $s_{\star}=o(n)$ ; we again refer to Example 7 of Castillo et al., [8].

To precisely describe other conditions, hereafter we use the following additional notations. We write

\displaystyle\tilde{X}=\left(\begin{matrix}\Delta_{\eta_{0},1}^{-1/2}X_{1}\\ \vdots\\ \Delta_{\eta_{0},n}^{-1/2}X_{n}\end{matrix}\right)\in\mathbb{R}^{n_{\ast}\times p},\quad\tilde{\xi}_{\eta}=\left(\begin{matrix}\Delta_{\eta_{0},1}^{-1/2}\xi_{\eta,1}\\ \vdots\\ \Delta_{\eta_{0},n}^{-1/2}\xi_{\eta,n}\end{matrix}\right)\in\mathbb{R}^{n_{\ast}},

and $\tilde{\Delta}_{\eta}$ to denote the collection of $\Delta_{\eta,i}$ for $i=1,\dots,n$ . In particular, $\tilde{X}_{S}\in\mathbb{R}^{n_{\ast}\times|S|}$ denotes the submatrix of $\tilde{X}$ with columns chosen by an index set $S$ . We also define the following neighborhoods of the true parameters: for $\bar{s}_{\star}$ and $\bar{\epsilon}_{n}$ satisfying (C5^∗), and sufficiently large constants $\tilde{M}_{1}$ and $\tilde{M}_{2}$ ,

\displaystyle\begin{split}\widetilde{\Theta}_{n}&=\left\{\theta\in\Theta:s_{\theta}\leq K_{1}\bar{s}_{\star},\,\lVert X(\theta-\theta_{0})\rVert_{2}\leq\tilde{M}_{1}\sqrt{n}\bar{\epsilon}_{n}\right\},\\ \widetilde{\cal H}_{n}&=\left\{\eta\in{\cal H}:d_{n}(\eta,\eta_{0})\leq\tilde{M}_{2}\bar{\epsilon}_{n}\right\}.\end{split}

(10)

Combined by other conditions, Theorem 3 implies that the posterior probabilities of these neighborhoods tend to one in probability if $s_{0}\log p\lesssim n\bar{\epsilon}_{n}^{2}$ . We need some bounding conditions on these neighborhoods, which will be specified below.

Let $\Phi(\eta)=(\tilde{\xi}_{\eta},\tilde{\Delta}_{\eta})$ for any given $\eta\in\mathcal{H}$ . For a given $\theta$ , we choose a bijective map $\eta\mapsto\tilde{\eta}_{n}(\theta,\eta):\mathcal{H}\mapsto\mathcal{H}$ such that $\Phi(\tilde{\eta}_{n}(\theta,\eta))=(\tilde{\xi}_{\eta}+H\tilde{X}(\theta-\theta_{0}),\tilde{\Delta}_{\eta})$ for some orthogonal projection $H$ which may depend on the true parameter values, but not on $\theta$ and $\eta$ . The projection $H$ plays a key role here and for a distributional approximation in the following section, and thus should be appropriately chosen to satisfy the followings.

(C4)

The orthogonal projection $H$ satisfies

	$\displaystyle\frac{1}{(s_{0}\vee 1)\log p}\sup_{\eta\in\widetilde{\cal H}_{n}}\lVert(I-H)(\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}})\rVert_{2}^{2}$	$\displaystyle\rightarrow 0,$
	$\displaystyle\min_{S:s\leq K_{1}\bar{s}_{\star}}\inf_{v\in\mathbb{R}^{s}:\lVert v\rVert_{2}=1}\frac{\lVert(I-H)\tilde{X}_{S}v\rVert_{2}}{\lVert\tilde{X}_{S}v\rVert_{2}}$	$\displaystyle\gtrsim 1.$

(C5)

The conditional law $\Pi_{n,\theta}$ of $\tilde{\eta}_{n}(\theta,\eta)$ given $\theta$ , induced by the prior, is absolutely continuous relative to its distribution $\Pi_{n,\theta_{0}}$ at $\theta=\theta_{0}$ (which is the same as the prior for $\eta$ ), and the Radon-Nikodym derivative $d\Pi_{n,\theta}/d\Pi_{n,\theta_{0}}$ satisfies

\displaystyle\sup_{\theta\in\widetilde{\Theta}_{n}}\sup_{\eta\in\widetilde{\cal H}_{n}}\left\lvert\log\frac{d\Pi_{n,\theta}}{d\Pi_{n,\theta_{0}}}(\eta)\right\rvert\lesssim 1.

By reading the proof, one can see that Theorem 4 below is based on the approximate likelihood ratio. The first condition of (C4) is required to control the remainder of an approximation. The second condition of (C4) implies that $\lVert u\rVert_{2}\lesssim\lVert(I-H)u\rVert_{2}\leq\lVert u\rVert_{2}$ for every $u\in{\rm span}(\tilde{X}_{S})$ with $S$ such that $s\leq K_{1}\bar{s}_{\star}$ , as the second inequality trivially holds by the fact that $I-H$ is an orthogonal projection. The use of the shifting map $\eta\mapsto\tilde{\eta}_{n}(\theta,\eta)$ is justified by the condition (C5), which implies that a shift in certain directions does not substantially affect the prior on $\eta$ . This is related in spirit to the absolute continuity condition in the semiparametric Bernstein-von Mises theorem (see, for example, Theorem 12. 8 of Ghosal and van der Vaart, [17]). We will see that a distributional approximation also requires similar, but stronger conditions.

Lastly, the complexity of the neighborhood $\widetilde{\mathcal{H}}_{n}$ should also be controlled. Specifically, we make the following condition.

(C6)

For $a_{n}$ and $e_{n}$ satisfying (C1) and a sufficiently large $C>0$ ,

\displaystyle\sqrt{\frac{n\bar{\epsilon}_{n}^{2}(e_{n}+a_{n}\bar{\epsilon}_{n}^{2})}{(s_{0}\vee 1)\log p}}+\sqrt{a_{n}}\int_{0}^{C\bar{\epsilon}_{n}}\sqrt{\log N(\delta,\widetilde{\mathcal{H}}_{n},d_{B,n})}d\delta\rightarrow 0.

(C7)

The parameter space ${\mathcal{H}}$ is separable with the pseudo-metric $d_{B,n}$ .

Similar to (C4), these conditions are required to control the remainder of an approximation. The integral term comes from the expected supremum of a separable Gaussian process, exploiting the Gaussian likelihood of the model and the separability of $\widetilde{\cal H}_{n}$ with the standard deviation metric. Condition (C7) is crucial for this reason. Since we usually put a prior on $\eta$ in an explicit way, condition (C7) is rarely violated in practice. One may see a connection between the first term of (C6) and the conditions for Corollary 1. The former easily tends to zero even if $n\bar{\epsilon}_{n}^{2}/\log p$ is increasing, due to the extra term $\bar{\epsilon}_{n}$ which commonly tends to zero in a polynomial order. Note also that the term $s_{0}\vee 1$ appears in (C4) and (C6). Although this gives sharper bounds, the conditions often need to be verified with $s_{0}\vee 1$ replaced by $1$ as $s_{0}$ is unknown.

Under the conditions specified above, we obtain the following theorem for the contraction rates for $\theta$ which do not depend on $\bar{\epsilon}_{n}$ . The compatibility numbers below should be understood to be 1 if $s_{0}=0$ .

Theorem 4 (Optimal posterior contraction).

Suppose that (C1)–(C4), (C5^∗), and (C6)–(C7) are satisfied. Then, there exist constants $K_{4}$ and $K_{5}$ such that

\displaystyle\begin{split}{\mathbb{E}}_{0}\Pi\left(\theta:s_{\theta}>K_{4}s_{0}\,\Big{|}\,Y^{(n)}\right)&\rightarrow 0,\\ {\mathbb{E}}_{0}\Pi\left(\theta:\lVert\theta-\theta_{0}\rVert_{1}>\frac{K_{5}s_{0}\sqrt{\log p}}{\phi_{1}((K_{4}+1)s_{0})\lVert X\rVert_{\ast}}\,\bigg{|}\,Y^{(n)}\right)&\rightarrow 0,\\ {\mathbb{E}}_{0}\Pi\left(\theta:\lVert\theta-\theta_{0}\rVert_{2}>\frac{K_{5}\sqrt{s_{0}\log p}}{\phi_{2}((K_{4}+1)s_{0})\lVert X\rVert_{\ast}}\,\bigg{|}\,Y^{(n)}\right)&\rightarrow 0,\\ {\mathbb{E}}_{0}\Pi\left(\theta:\lVert X(\theta-\theta_{0})\rVert_{2}>K_{5}\sqrt{s_{0}\log p}\,\big{|}\,Y^{(n)}\right)&\rightarrow 0.\end{split}

(11)

Similar to the paragraph followed by Theorem 3, the compatibility numbers are easily bounded away from zero so that they can be removed from the expressions. These are actually weaker than before as $s_{0}\leq s_{\star}$ . The simplified rates are then available for ease of interpretation.

Remark 4.

In regression models where no additional mean part $\xi_{\eta,i}$ exists, conditions (C4) and (C5) are trivially satisfied by choosing the zero matrix for $H$ . This is also true for (C8^∗) and (C9^∗) specified in the next section.

Remark 5.

Suppose that there exists a matrix $Z_{i}\in\mathbb{R}^{m_{i}\times q}$ such that $\xi_{\eta,i}=Z_{i}h(\eta)$ for every $\eta\in{\cal H}$ with some map $h:{\cal H}\mapsto\mathbb{R}^{q}$ . Then, a general strategy to choose $H$ is to set $H=\tilde{Z}(\tilde{Z}^{T}\tilde{Z})^{-1}\tilde{Z}^{T}$ for $\tilde{Z}=(Z_{1}^{T}\Delta_{\eta_{0},1}^{-1/2},\dots,Z_{n}^{T}\Delta_{\eta_{0},n}^{-1/2})^{T}$ . In this case, by the triangle inequality, the first condition of (C4) is satisfied if there exists $\eta_{\ast}\in{\cal H}$ such that $nd_{A,n}^{2}(\eta_{\ast},\eta_{0})/(s_{0}\log p)\rightarrow 0$ . For (C8^∗) in the next section, this is replaced by $(s_{\star}^{2}\log p)nd_{A,n}^{2}(\eta_{\ast},\eta_{0})\rightarrow 0$ . These are trivially the case if there exists $\eta^{\prime}\in{\cal H}$ such that $d_{A,n}(\eta^{\prime},\eta_{0})=0$ . Also similar to Remark 3, a sufficient condition for the second line of (C4) is $\min\{\varsigma_{\min}([X_{S},Z]):{s\leq K_{1}\bar{s}_{\star}}\}\gtrsim 1$ as pre-multiplication of a positive definite matrix by $X_{S}$ and $Z$ is an isomorphism. This is also sufficient for (C8^∗) in the next section with $\bar{s}_{\star}$ replaced by $s_{\star}$ .

Remark 6.

In many instances, for every $\delta>0$ and $\zeta_{n}>0$ , we typically have

\displaystyle\log N\left(\delta,\{\eta\in\mathcal{H}:d_{B,n}(\eta,\eta_{0})\leq\zeta_{n}\},d_{B,n}\right)\leq 0\vee r_{n}\log\left(\frac{b_{n}\zeta_{n}}{\delta}\right),

for some sequences $r_{n}$ and $b_{n}$ , especially when the part of $\eta$ involved with $d_{B,n}$ is an $r_{n}$ -dimensional Euclidean parameter. Note that $\int_{0}^{C\zeta_{n}}\sqrt{0\vee r_{n}\log(b_{n}\zeta_{n}/{\delta})}d\delta$ is equal to

	$\displaystyle\int_{0}^{(C\wedge b_{n})\zeta_{n}}\sqrt{r_{n}\log\left(\frac{b_{n}\zeta_{n}}{\delta}\right)}d\delta$
	$\displaystyle\quad=(C\wedge b_{n})\zeta_{n}\sqrt{r_{n}\log\left(\frac{b_{n}}{C\wedge b_{n}}\right)}+b_{n}\zeta_{n}\sqrt{r_{n}}\int_{\sqrt{\log(b_{n}/(C\wedge b_{n}))}}^{\infty}e^{-t^{2}}dt.$

If $b_{n}$ is increasing, the right hand side is bounded by a multiple of $\zeta_{n}\sqrt{r_{n}\log b_{n}}$ by the tail probability of a normal distribution, while it is bounded by a multiple of $\zeta_{n}b_{n}\sqrt{r_{n}}$ for nonincreasing $b_{n}$ . This simplification is useful to verify (C6) in many applications, and can also be used for (C10^∗) in the next section.

4 Bernstein-von Mises and selection consistency

An extremely important question is whether the true support $S_{0}$ is recovered with probability tending to one, which is the property called selection consistency. We will show this based on a distributional approximation to the posterior distribution. Combined with selection consistency, the shape approximation also leads to the product of a point mass and a normal distribution, which we call the Bernstein-von Mises theorem. This reduced approximate distribution enables us to correctly quantify the remaining uncertainty of the parameter through the posterior distribution.

4.1 Shape approximation to the posterior distribution

It is worth noting that selection consistency can often be verified without a distributional approximation. For example, in sparse linear regression with scalar unknown variance $\sigma^{2}$ , Song and Liang, [28] deployed the marginal likelihood of the model support which can be obtained by integrating out $\theta$ and $\sigma^{2}$ from the likelihood using the inverse gamma kernel. In our general formulation, however, this approach is hard to implement due to the arbitrary structure of a nuisance parameter $\eta$ . Indeed, the approach is not directly available even for a parametric covariance matrix with dimension $\overline{m}\geq 2$ . In this sense, using a shape approximation could be a natural solution to the problem, which may require some extra conditions on the parameter space and on the priors for $\theta$ and $\eta$ .

Recall that the results in Section 3.2 are based on the semiparametric theory. In this section we will need very similar conditions as before, but the requirements are generally stronger, as the remainder of an approximation should be strictly manipulated. Since the setup is high-dimensional, our conditions are even more restrictive than those for semiparametric models with a fixed dimensional parametric segment [e.g., 7]. One may refer to Section 3.3 of Chae et al., [10] for a relevant discussion.

Throughout this section, we only consider $s_{\star}$ that satisfies the conditions of Theorem 3. First of all, we make a modification of (C3). The following condition is slightly stronger than (C3), but is still not too restrictive as (C5^∗) requires $s_{\star}=o(n)$ .

(C7^∗)

Condition (C3) is satisfied with $\bar{s}_{\star}$ replaced by $s_{\star}$ .

The assumption on the prior for $\theta$ is made only through the regularization parameter $\lambda$ . As in Castillo et al., [8], $\lambda$ should not increase too fast and should satisfy $\lambda s_{\star}\sqrt{\log p}/\lVert X\rVert_{\ast}\rightarrow 0$ . In our setup, the range of $\lambda$ induces a sufficient condition for this: $s_{\star}^{2}\log p=o(n)$ . Since this is weaker than the one that will be made later in this section, the “small lambda regime” is automatically met by a stronger condition for the entire procedure for a distributional approximation (see (C10^∗) below and the following paragraph).

For sufficiently large constants $\hat{M}_{1}$ and $\hat{M}_{2}$ , we now define the neighborhoods,

\displaystyle\begin{split}\widehat{\Theta}_{n}&=\Big{\{}\theta\in\Theta:s_{\theta}\leq K_{1}s_{\star},\,\lVert\theta-\theta_{0}\rVert_{1}\leq\hat{M}_{1}s_{\star}\sqrt{\log p}/\lVert X\rVert_{\ast}\Big{\}},\\ \widehat{\cal H}_{n}&=\Bigg{\{}\eta\in{\cal H}:d_{A,n}(\eta,\eta_{0})\leq\hat{M}_{2}s_{\star}\sqrt{\frac{\log p}{n}},d_{B,n}(\eta,\eta_{0})\leq\hat{M}_{2}\sqrt{\frac{s_{\star}\log p}{n}}\Bigg{\}}.\end{split}

(12)

Note that $\widehat{\Theta}_{n}$ is defined with an $\ell_{1}$ -ball, which makes it contract more slowly than $\widetilde{\Theta}_{n}$ in (10) under (C7^∗). This is due to technical reasons that for a distributional approximation, the $\ell_{1}$ -ball should be directly manipulated in the complement of $\widehat{\Theta}_{n}$ . The neighborhood $\widehat{\cal H}_{n}$ is also increased to be matched with $\widehat{\Theta}_{n}$ . We leave more details on this to the reader; refer to the proof of Theorem 5 below.

As in Section 3.2, we choose a bijective map $\eta\mapsto\tilde{\eta}_{n}(\theta,\eta)$ which gives rise to $\Phi(\tilde{\eta}_{n}(\theta,\eta))=(\tilde{\xi}_{\eta}+H\tilde{X}(\theta-\theta_{0}),\tilde{\Delta}_{\eta})$ for some orthogonal projection $H$ . Again, the orthogonal projection $H$ should be carefully chosen to satisfy some boundedness conditions. The conditions are similar to, but stronger than those in Section 3.2. This is not only because of the increased neighborhoods $\widehat{\Theta}_{n}$ and $\widehat{\cal H}_{n}$ , but also because the remainder of an approximation should be bounded on their complements. We precisely make the required conditions below.

(C8^∗)

The orthogonal projection $H$ satisfies

	$\displaystyle s_{\star}^{2}\log p\sup_{\eta\in\widehat{\cal H}_{n}}\lVert(I-H)(\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}})\rVert_{2}^{2}$	$\displaystyle\rightarrow 0,$
	$\displaystyle\min_{S:s\leq K_{1}s_{\star}}\inf_{v\in\mathbb{R}^{s}:\lVert v\rVert_{2}=1}\frac{\lVert(I-H)\tilde{X}_{S}v\rVert_{2}}{\lVert\tilde{X}_{S}v\rVert_{2}}$	$\displaystyle\gtrsim 1.$

(C9^∗)

The conditional law $\Pi_{n,\theta}$ of $\tilde{\eta}_{n}(\theta,\eta)$ given $\theta$ , induced by the prior, is absolutely continuous relative to its distribution $\Pi_{n,\theta_{0}}$ at $\theta=\theta_{0}$ , and the Radon-Nikodym derivative $d\Pi_{n,\theta}/d\Pi_{n,\theta_{0}}$ satisfies

\displaystyle\sup_{\theta\in\widehat{\Theta}_{n}}\sup_{\eta\in\widehat{\cal H}_{n}}\left\lvert\log\frac{d\Pi_{n,\theta}}{d\Pi_{n,\theta_{0}}}(\eta)\right\rvert\rightarrow 0.

(C10^∗)

For $a_{n}$ and $e_{n}$ satisfying (C1) and a sufficiently large $C>0$ ,

	$\displaystyle s_{\star}\log p\Bigg{\{}$	$\displaystyle s_{\star}\sqrt{e_{n}+\frac{a_{n}s_{\star}\log p}{n}}$
		$\displaystyle+\sqrt{a_{n}}\int_{0}^{C\sqrt{(s_{\star}\log p)/n}}\sqrt{\log N\left(\delta,\widehat{\cal H}_{n},d_{B,n}\right)}d\delta\Bigg{\}}\rightarrow 0.$

Conditions (C8^∗)–(C10^∗) are required for similar reasons as in Section 3.2. We mention that (C10^∗) is a sufficient condition for the small lambda regime, since its necessary condition is $s_{\star}^{5}\log^{3}p=o(n)$ that is stronger than $s_{\star}^{2}\log p=o(n)$ . This necessary condition for (C10^∗) is often a sufficient condition in many finite dimensional models.

We define the standardized vector,

\displaystyle U=\left(\begin{matrix}\Delta_{\eta_{0},1}^{-1/2}(Y_{1}-X_{1}\theta_{0}-\xi_{\eta_{0},1})\\ \vdots\\ \Delta_{\eta_{0},n}^{-1/2}(Y_{n}-X_{n}\theta_{0}-\xi_{\eta_{0},n})\end{matrix}\right)\in\mathbb{R}^{n_{\ast}}.

Under the assumptions above, the posterior distribution of $\theta$ is approximated by $\Pi^{\infty}$ given by

\displaystyle\Pi^{\infty}(\theta\in\cdot\,|\,Y^{(n)})

\displaystyle=\sum_{S:s\leq K_{1}s_{\star}}\hat{w}_{S}\left({\cal N}_{\hat{\theta}_{S},\tilde{X}_{S}^{T}(I-H)\tilde{X}_{S}}^{S}\otimes\delta_{0}^{S^{c}}\right)(\theta\in\cdot),

(13)

where $\mathcal{N}_{\mu,\Omega}^{S}$ is the Gaussian measure with mean $\mu$ and precision $\Omega$ on the coordinate $S$ , $\delta_{0}^{S^{c}}$ is the Dirac measure at zero on $S^{c}$ , $\hat{\theta}_{S}$ is the least squares solution $\hat{\theta}_{S}=(\tilde{X}_{S}^{T}(I-H)\tilde{X}_{S})^{-1}\tilde{X}_{S}^{T}(I-H)(U+\tilde{X}\theta_{0})$ , and the weights $\hat{w}_{S}$ satisfy

\displaystyle\hat{w}_{S}\propto\frac{{\pi_{p}(s)}}{{\binom{p}{s}}}\left(\frac{\lambda}{2}\right)^{s}(2\pi)^{s/2}\det\Big{(}\tilde{X}_{S}^{T}(I-H)\tilde{X}_{S}\Big{)}^{-1/2}\exp\bigg{\{}\frac{1}{2}\lVert(I-H)\tilde{X}_{S}\hat{\theta}_{S}\rVert_{2}^{2}\bigg{\}}.

Another way to express $\Pi^{\infty}$ , for any measurable ${\cal B}\subset\mathbb{R}^{p}$ , is

\displaystyle\Pi^{\infty}(\theta\in{\cal B}\,|\,Y^{(n)})

\displaystyle=\frac{\sum_{S:s\leq K_{1}s_{\star}}{\pi_{p}(s)}{\binom{p}{s}}^{-1}\left({\lambda}/{2}\right)^{s}\int_{\cal B}\Lambda_{n}^{\star}(\theta)d\{\mathcal{L}(\theta_{S})\otimes\delta_{0}(\theta_{S^{c}})\}}{\sum_{S:s\leq K_{1}s_{\star}}{\pi_{p}(s)}{\binom{p}{s}}^{-1}\left({\lambda}/{2}\right)^{s}\int_{\mathbb{R}^{p}}\Lambda_{n}^{\star}(\theta)d\{\mathcal{L}(\theta_{S})\otimes\delta_{0}(\theta_{S^{c}})\}},

where $\mathcal{L}$ denotes the Lebesgue measure and

\displaystyle\Lambda_{n}^{\star}(\theta)=\exp\left\{{-\frac{1}{2}\lVert(I-H)\tilde{X}(\theta-\theta_{0})\rVert_{2}^{2}+U^{T}(I-H)\tilde{X}(\theta-\theta_{0})}\right\}.

(14)

It can be easily checked that both the expressions are equivalent. The results are summarized in the following theorem.

Theorem 5 (Distributional approximation).

Suppose that (C1)–(C4), (C5^∗), (C6), (C7^∗)–(C10^∗), and (C7) are satisfied for some orthogonal projection $H$ . Then

\displaystyle\mathbb{E}_{0}\left\lVert\Pi(\theta\in\cdot\,|\,Y^{(n)})-\Pi^{\infty}(\theta\in\cdot\,|\,Y^{(n)})\right\rVert_{\rm TV}\rightarrow 0.

(15)

4.2 Model selection consistency

The shape approximation to the posterior distribution facilitates obtaining the next theorem which shows that the posterior distribution is concentrated on subsets of the true support with probability tending to one. The result is then used as the basis of selection consistency. Similar to the literature, the theorem requires an additional condition on the prior as follows.

(C12)

The prior satisfies $A_{4}>1$ and $s_{\star}\lesssim p^{a}$ for $a<A_{4}-1$ .

Theorem 6 (Selection, no supersets).

Suppose that (C1)–(C4), (C5^∗), (C6), (C7^∗)–(C10^∗), and (C7)–(C12) are satisfied for some orthogonal projection $H$ . Then

\displaystyle{\mathbb{E}}_{0}\Pi\left(\theta:S_{\theta}\supset S_{0},S_{\theta}\neq S_{0}\,|\,Y^{(n)}\right)\rightarrow 0.

(16)

Since coefficients that are too close to zero cannot be identified by any selection strategy, some threshold for the true nonzero coefficients is needed for detection. The requirement of a threshold is a fundamental limitation in high-dimensional setups. We make the following threshold, the so-called beta-min condition. The condition is made in view of the third assertion of Theorem 4. The second assertion can also be used to make a similar threshold, but we only consider the given one below as it is generally weaker.

(C13)

The true parameter satisfies

\displaystyle\min_{\theta_{0,j}\neq 0}|\theta_{0,j}|>\frac{K_{5}\sqrt{s_{0}\log p}}{\phi_{2}((K_{4}+1)s_{0})\lVert X\rVert_{\ast}}.

Since Theorem 3 implies that the posterior distribution of the support of $\theta$ includes that of the true support with probability tending to one, selection consistency is an easy consequence of Theorem 6 under the beta-min condition (C13). Moreover, this improves the distributional approximation in (15) so that the posterior distribution can be approximated by a single component of the mixture; that is, the Bernstein-von Mises theorem holds for the parameter component $\theta_{S_{0}}$ . The arguments here are summarized in the following two corollaries, whose proofs are straightforward and thus are omitted.

Corollary 2 (Selection consistency).

Suppose that (C1)–(C4), (C5^∗), (C6), (C7^∗)–(C10^∗), and (C7)–(C13) are satisfied for some orthogonal projection $H$ . Then

\displaystyle{\mathbb{E}}_{0}\Pi\left(\theta:S_{\theta}\neq S_{0}\,|\,Y^{(n)}\right)\rightarrow 0.

(17)

Corollary 3 (Bernstein-von Mises).

Suppose that (C1)–(C4), (C5^∗), (C6), (C7^∗)–(C10^∗), and (C7)–(C13) are satisfied for some orthogonal projection $H$ . Then

\displaystyle\begin{split}{\mathbb{E}}_{0}\bigg{\lVert}&\Pi(\theta\in\cdot\,|\,Y^{(n)})-\left({\cal N}_{\hat{\theta}_{S_{0}},\tilde{X}_{S_{0}}^{T}(I-H)\tilde{X}_{S_{0}}}^{S}\otimes\delta_{0}^{S_{0}^{c}}\right)(\theta\in\cdot)\bigg{\rVert}_{\rm TV}\rightarrow 0.\end{split}

(18)

Corollary 3 enables us to quantify the remaining uncertainty of the parameter through the posterior distribution. Specifically, we can construct credible sets for the individual components of $\theta_{0}$ as in Castillo et al., [8]. It is easy to see that by the definition of $\hat{\theta}_{S_{0}}$ , its $j$ th component has a normal distribution, whose mean is the $j$ th element of $\theta_{S_{0}}$ and variance is the $j$ th diagonal element of $(\tilde{X}_{S_{0}}^{T}(I-H)\tilde{X}_{S_{0}})^{-1}$ . Correct uncertainty quantification is thus guaranteed by the weak convergence.

5 Applications

In this section, we apply the main results established in this study to the examples considered in Section 1.1. The main objective is to obtain nearly optimal posterior contraction rates and selection consistency via shape approximation to the posterior distribution with the Bernstein-von Mises phenomenon.

To use Corollary 1 for the optimal posterior contraction when $n\bar{\epsilon}_{n}^{2}=\log n$ , we simply assume that $s_{0}>0$ for all examples in this section, although Theorem 4 can also be applied under stronger conditions. The assumption $s_{0}>0$ is extremely mild rather than considering the ultra high-dimensional case, i.e., $\log n=o(\log p)$ . A large enough $A_{4}$ is also sufficient instead of the assumption $s_{0}>0$ , but we do not pursue this direction as a specific threshold is not available. We check the conditions of Theorem 4 only for more complicated models where $n\bar{\epsilon}_{n}^{2}>\log n$ .

5.1 Multiple response models with missing components

We first apply the main results to Example 1. To recover posterior contraction of $\Sigma$ from the primitive results, it is necessary to assume that every entry of the response is jointly observed sufficiently many times. To be more specific, let $e_{ij}$ be 1 if the $j$ th entry of $Y_{i}^{\rm aug}$ is observed and be zero otherwise. The contraction rate of the $(j,k)$ th element of $\Sigma$ is directly determined by the order of $n^{-1}\sum_{i=i}^{n}e_{ij}e_{ik}$ . The ideal case is when this quantity is bounded away from zero, that is, the entries are jointly observed at a rate proportional to $n$ . Then the recovery is possible without any loss of information. If $n^{-1}\sum_{i=1}^{n}e_{ij}e_{ik}$ decays to zero, then the optimal recovery is not attainable, but consistent estimation may still be possible with slower rates. With an inverse Wishart prior on $\Sigma$ , the following theorem studies the posterior asymptotic properties of the given model.

Theorem 7.

Assume that $s_{0}>0$ , $1\lesssim\rho_{\min}(\Sigma_{0})\leq\rho_{\max}(\Sigma_{0})\lesssim 1$ , $\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p$ , and $\min_{j,k}n^{-1}\sum_{i=1}^{n}e_{ij}e_{ik}\gtrsim c_{n}^{-1}$ for some nondecreasing $c_{n}$ such that $c_{n}s_{0}\log p=o(n)$ . Then the following assertions hold.

(a)

The optimal posterior contraction rates for $\theta$ in (11) are obtained.
(b)

The posterior contraction rate for $\Sigma$ is $\sqrt{c_{n}(s_{0}\log p)/n}$ with respect to the Frobenius norm.

Assume further that $c_{n}(s_{0}^{2}\vee\log c_{n})(s_{0}\log p)^{3}=o(n)$ and $\phi_{1}(Ds_{0})\gtrsim 1$ for a sufficiently large $D$ . Then the following assertions hold.

(c)

For $H\in\mathbb{R}^{n_{\ast}\times n_{\ast}}$ the zero matrix, the distributional approximation in (15) holds.
(d)

If $A_{4}>1$ and $s_{0}\lesssim p^{a}$ for $a<A_{4}-1$ , then the no-superset result in (16) holds.
(e)

Under the beta-min condition as well as the conditions for (d), the selection consistency in (17) and the Bernstein-von Mises theorem in (18) hold.

5.2 Multivariate measurement error models

We now consider Example 2. For convenience we write $Y^{\ast}=(Y_{1}^{\ast},\dots Y_{n}^{\ast})^{T}\in\mathbb{R}^{n}$ , $W=(W_{1}^{T},\dots,W_{n}^{T})^{T}\in\mathbb{R}^{nq}$ , and $X^{\ast}=(X_{1}^{\ast},\dots,X_{n}^{\ast})^{T}\in\mathbb{R}^{n\times p}$ in what follows. In this subsection, we use the symbol $\otimes$ for the Kronecker product of matrices. For priors of the nuisance parameters, normal prior distributions are assigned for the location parameters ( $\alpha$ , $\beta$ , and $\mu$ ) and an inverse gamma and inverse Wishart prior are used for the scale parameters ( $\sigma^{2}$ and $\Sigma$ ). The next theorem shows posterior asymptotic properties of the model. In particular, specific forms of their mean and variance for shape approximation are provided considering the modeling structure.

Theorem 8.

Assume that $s_{0}>0$ , $s_{0}\log p=o(n)$ , $|\alpha_{0}|\vee\lVert\beta_{0}\rVert_{\infty}\vee\lVert\mu_{0}\rVert_{\infty}\lesssim 1$ , $1\lesssim\sigma_{0}^{2}\lesssim 1$ , $1\lesssim\rho_{\min}(\Sigma_{0})\leq\rho_{\max}(\Sigma_{0})\lesssim 1$ , $\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p$ , and $\min\{\varsigma_{\min}([X_{S}^{\ast},1_{n}]):s\leq Ds_{0}\}\gtrsim 1$ for a sufficiently large $D$ . Then the following assertions hold.

(a)

The optimal posterior contraction rates for $\theta$ in (11) are obtained.
(b)

The contraction rates for $\alpha$ , $\beta$ , $\mu$ , and $\sigma^{2}$ are $\sqrt{(s_{0}\log p)/n}$ relative to the $\ell_{2}$ -norms. The same rate is also obtained for $\Sigma$ with respect to the Frobenius norm.

Assume further that $s_{0}^{5}\log^{3}p=o(n)$ and $\phi_{1}(Ds_{0})\gtrsim 1$ for a sufficiently large $D$ . Then the following assertions hold.

(c)

The distributional approximation in (15) holds with the mean vector

	$\displaystyle\hat{\theta}_{S}=$	$\displaystyle(X_{S}^{\ast T}H^{\ast}X_{S}^{\ast})^{-1}X_{S}^{\ast T}\Big{\{}H^{\ast}\Big{[}\left(Y^{\ast}-(\alpha_{0}+\mu_{0}^{T}\beta_{0})1_{n}\right)$
		$\displaystyle\qquad\qquad\qquad\qquad-\left(I_{n}\otimes(\beta_{0}^{T}\Sigma_{0}(\Sigma_{0}+\Psi)^{-1})\right)\left(W-1_{n}\otimes\mu_{0}\right)\Big{]}\Big{\}}$

and the covariance matrix $(\sigma_{0}^{2}+\beta_{0}^{T}\Sigma_{0}(\Sigma_{0}+\Psi)^{-1}\Psi\beta_{0})(X_{S}^{\ast T}H^{\ast}X_{S}^{\ast})^{-1}$ for $H^{\ast}=I_{n}-n^{-1}1_{n}1_{n}^{T}$ .

(d)

If $A_{4}>1$ and $s_{0}\lesssim p^{a}$ for $a<A_{4}-1$ , then the no-superset result in (16) holds.
(e)

Under the beta-min condition as well as the conditions for (d), the selection consistency in (17) and the Bernstein-von Mises theorem in (18) hold.

We note that the marginal law of $W_{i}$ is given by $W_{i}\sim{\rm N}(\mu,\Sigma+\Psi)$ . This gives a hope that the rates for $\mu$ and $\Sigma$ may actually be improved up to the parametric rate $n^{-1/2}$ (possibly up to some logarithmic factors). However, other parameters are connected to the high-dimensional coefficients $\theta$ , so such a parametric rate may not be obtained for them.

5.3 Parametric correlation structure

Next, our main results are applied to Example 3. A correlation matrix $G_{i}(\alpha)$ should be chosen so that the conditions in the main theorems can be satisfied. Here we consider a compound-symmetric, a first order autoregressive, and a first order moving average correlation matrices: for $\alpha\in(b_{1},b_{2})$ with fixed boundaries $b_{1}$ and $b_{2}$ of the range, respectively, $\{G_{i}^{\rm CS}(\alpha)\}_{j,k}=\mathbbm{1}(j=k)+\alpha\mathbbm{1}(j\neq k)$ , $\{G_{i}^{\rm AR}(\alpha)\}_{j,k}=\alpha^{|j-k|}$ , and $\{G_{i}^{\rm MA}(\alpha)\}_{j,k}=\mathbbm{1}(j=k)+\alpha\mathbbm{1}(|j-k|=1)$ . The range is chosen so that the corresponding correlation matrix can be positive definite, i.e., $(b_{1},b_{2})=(0,1)$ for $G_{i}^{\rm CS}(\alpha)$ , $(b_{1},b_{2})=(-1,1)$ for $G_{i}^{\rm AR}(\alpha)$ , and $(b_{1},b_{2})=(-1/2,1/2)$ for $G_{i}^{\rm MA}(\alpha)$ . Again, an inverse gamma prior is assigned to $\sigma^{2}$ . For a prior on $\alpha$ , we consider a density

\displaystyle\Pi(d\alpha)\propto\exp\left\{-\frac{1}{(\alpha-b_{1})^{c_{1}}(b_{2}-\alpha)^{c_{2}}}\right\},\quad\alpha\in(b_{1},b_{2}),

for some $c_{1},c_{2}>0$ such that $\Pi(\alpha<t)\lesssim\exp(-(t-b_{1})^{-c_{1}})$ for $t>b_{1}$ close to $b_{1}$ and $\Pi(\alpha>t)\lesssim\exp(-(b_{2}-t)^{-c_{2}})$ for $t<b_{2}$ close to $b_{2}$ .

Theorem 9.

Assume that $s_{0}>0$ , $s_{0}\log p=o(n)$ , $\overline{m}n\asymp n_{\ast}$ , $\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p$ , $\sigma_{0}^{2}\asymp 1$ , $\alpha_{0}\in[b_{1}+\epsilon,b_{2}-\epsilon]$ for some fixed $\epsilon>0$ . Suppose further that $\overline{m}\lesssim 1$ for the compound-symmetric correlation matrix and $\log\overline{m}\lesssim\log p$ for the autoregressive and moving average correlation matrices. Then the following assertions hold.

(a)

For any correlation matrix discussed above, the optimal posterior contraction rates for $\theta$ in (11) are obtained.
(b)

For the autoregressive and moving average correlation matrices, the posterior contraction rates for $\sigma^{2}$ and $\alpha$ are $\sqrt{(s_{0}\log p)/(\overline{m}n)}$ with respect to the $\ell_{2}$ -norms. For the compound-symmetric correlation matrix, their contraction rates are $\sqrt{(s_{0}\log p)/n}$ relative to the $\ell_{2}$ -norm.

Assume further that $s_{0}^{5}\log^{3}p=o(n)$ and $\phi_{1}(Ds_{0})\gtrsim 1$ for a sufficiently large $D$ . Then the following assertions hold.

(c)

For $H\in\mathbb{R}^{n_{\ast}\times n_{\ast}}$ the zero matrix, the distributional approximation in (15) holds.
(d)

If $A_{4}>1$ and $s_{\star}\lesssim p^{a}$ for $a<A_{4}-1$ , then the no-superset result in (16) holds.
(e)

Under the beta-min condition as well as the conditions for (d), the selection consistency in (17) and the Bernstein-von Mises theorem in (18) hold.

As for the prior for $\alpha$ , the property that the tail probabilities decay to zero exponentially fast near both zero and one is crucial for the optimal posterior contraction rates. It should be noted that many common probability distributions with compact supports may not be enough for this purpose (e.g., beta distributions).

The main difference between this example and those in the preceding subsections is that we consider possibly increasing $m_{i}$ here. Although we have the same form of contraction rates for $\theta$ as in previous examples, the implication is not the same due to a different order of $\lVert X\rVert_{\ast}$ . For increasing $m_{i}$ , it is expected to have $\lVert X\rVert_{\ast}\asymp\sqrt{n_{\ast}}$ , which is commonly the case in regression settings. This is reduced to $\lVert X\rVert_{\ast}\asymp\sqrt{n}$ for the cases with fixed $m_{i}$ , and hence increasing $m_{i}$ may help get faster rates. While the increasing dimensionality of $m_{i}$ is often a benefit for contraction properties of $\theta$ , this may or may not be the case for the nuisance parameters since it depends on the dimensionality of $\eta$ . In the example in this subsection, the dimension of the nuisance parameters is fixed although $m_{i}$ can increase, which makes their posterior contraction rates faster than those with fixed $m_{i}$ . However, this may not be true if $\eta$ is increasing dimensional. For example, see the example in Section 5.5.

5.4 Mixed effects models

For the mixed effects models with sparse regression coefficients in Example 4, we assume that the maximum of $\lVert Z_{i}\rVert_{\rm sp}$ is bounded, which is particularly mild if $\overline{m}$ is bounded. We also assume that $\sum_{i=1}^{n}\mathbbm{1}(m_{i}\geq q)\asymp n$ and $\min_{i}\{\varsigma_{\min}(Z_{i}):m_{i}\geq q\}\gtrsim 1$ , that is, $m_{i}$ is likely to be larger than $q$ with fixed probability and $Z_{i}$ is a full rank. These conditions are required for (C1) to hold. We put an inverse Wishart prior on $\Psi$ as in other examples. The following theorem shows that the posterior asymptotic properties of the mixed effects models.

Theorem 10.

Assume that $s_{0}>0$ , $s_{0}\log p=o(n)$ , $1\lesssim\rho_{\min}(\Psi_{0})\leq\rho_{\max}(\Psi_{0})\lesssim 1$ , $\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p$ , $\sum_{i=1}^{n}\mathbbm{1}(m_{i}\geq q)\asymp n$ , $\min_{i}\{\varsigma_{\min}(Z_{i}):m_{i}\geq q\}\gtrsim 1$ , and $\max_{i}\lVert Z_{i}\rVert_{\rm sp}\lesssim 1$ . Then the following assertions hold.

(a)

The optimal posterior contraction rates for $\theta$ in (11) are obtained.
(b)

The posterior contraction rate for $\Psi$ is $\sqrt{(s_{0}\log p)/n}$ with respect to the Frobenius norm.

Assume further that $s_{0}^{5}\log^{3}p=o(n)$ and $\phi_{1}(Ds_{0})\gtrsim 1$ for a sufficiently large $D$ . Then the following assertions hold.

(c)

For $H\in\mathbb{R}^{n_{\ast}\times n_{\ast}}$ the zero matrix, the distributional approximation in (15) holds.
(d)

If $A_{4}>1$ and $s_{0}\lesssim p^{a}$ for $a<A_{4}-1$ , then the no-superset result in (16) holds.
(e)

Under the beta-min condition as well as the conditions for (d), the selection consistency in (17) and the Bernstein-von Mises theorem in (18) hold.

Note that we assume that $\sigma^{2}$ is known, which is actually unnecessary at the modeling stage. The assumption was made to find a sequence $a_{n}$ satisfying (C1) with ease. This can be relaxed only with stronger assumptions on $Z_{i}$ . For example, if $q=1$ and $Z_{i}$ is an all-one vector, then the model is equivalent to that with a compound-symmetric correlation matrix in Section 5.3 with some reparameterization, in which $\sigma^{2}$ can be treated as unknown.

5.5 Graphical structure with sparse precision matrices

For the graphical structure models in Example 5, we define an edge-inclusion indicator $\Upsilon=\{\upsilon_{jk}:1\leq j\leq k\leq\overline{m}\}$ such that $\upsilon_{jk}=1$ if $\omega_{jk}\neq 0$ and $\upsilon_{jk}=0$ otherwise, where $\omega_{jk}$ is the $(j,k)$ th element of $\Omega$ . We put a prior with a density $f_{1}$ on $(0,\infty)$ to the nonzero off-diagonal entries and a prior with a density $f_{2}$ on $\mathbb{R}$ to the diagonal entries of $\Omega$ , such that the support is truncated to a matrix space with restricted eigenvalues and entries. For the edge-inclusion indicator, we use a binomial prior with probability $\varpi$ when $|\Upsilon|\coloneqq\sum_{j,k}\upsilon_{jk}$ is given, and assign a prior to $|\Upsilon|$ such that $\log\Pi(|\Upsilon|\leq\bar{r})\lesssim-\bar{r}\log\bar{r}$ . The prior specification is summarized as

	$\displaystyle\Pi(\Omega\|\Upsilon)$	$\displaystyle\propto\prod_{j,k:\upsilon_{jk}=1}f_{1}(\omega_{jk})\prod_{j=1}^{\overline{m}}f_{2}(\omega_{jj})\mathbbm{1}_{{\cal M}_{0}^{+}(L)}(\Omega),$
	$\displaystyle\Pi(\Upsilon)$	$\displaystyle\propto\varpi^{\bar{r}}(1-\varpi)^{\binom{\overline{m}}{2}-\bar{r}}\Pi(\|\Upsilon\|=\bar{r}),\quad\log\Pi(\|\Upsilon\|\leq\bar{r})\lesssim-\bar{r}\log\bar{r},$

where ${\cal M}_{0}^{+}(L)$ is a collection of $\overline{m}\times\overline{m}$ positive definite matrices for a sufficiently large $L$ , in which eigenvalues are between $[L^{-1},L]$ and entries are also bounded by $L$ in absolute value.

Theorem 11.

Let $s_{\star}=s_{0}\vee\bar{s}_{\star}$ for $\bar{s}_{\star}=(\overline{m}+d)(\log n)/\log p$ . Assume that $s_{0}>0$ , $s_{0}\log p=o(n)$ , $\overline{m}\log n=o(n)$ , $|\Upsilon_{0}|\leq d$ for some $d$ such that $d\log n=o(n)$ , $\Omega_{0}\in{\cal M}_{0}^{+}(cL)$ for some $0<c<1$ , and $\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p$ . Then the following assertions hold.

(a)

The posterior contraction rates for $\theta$ are given by (9). If $\bar{s}_{\star}\lesssim 1$ , the optimal rates in (11) are obtained.
(b)

The posterior contraction rate of $\Omega$ is $\sqrt{(s_{0}\log p\vee(\overline{m}+d)\log n)/n}$ with respect to the Frobenius norm.

If further $(\bar{s}_{\star}\vee\overline{m}^{2})\bar{s}_{\star}\log p=o(n)$ and $\phi_{1}(D\bar{s}_{\star})\gtrsim 1$ for a sufficiently large $D$ , then the following assertion holds.

(c)

The optimal posterior contraction rates for $\theta$ in (11) are obtained even if $\bar{s}_{\star}\rightarrow\infty$ .

Assume further that $(s_{\star}\vee\overline{m})^{2}(s_{\star}\log p)^{3}=o(n)$ and $\phi_{1}(Ds_{\star})\gtrsim 1$ for a sufficiently large $D$ . Then the following assertions hold.

(d)

For $H\in\mathbb{R}^{n_{\ast}\times n_{\ast}}$ the zero matrix, the distributional approximation in (15) holds.
(e)

If $A_{4}>1$ and $s_{\star}\lesssim p^{a}$ for $a<A_{4}-1$ , then the no-superset result in (16) holds.
(f)

Under the beta-min condition as well as the conditions for (e), the selection consistency in (17) and the Bernstein-von Mises theorem in (18) hold.

Note that increasing $\overline{m}$ is likely to improve the $\ell_{2}$ -norm contraction rate for $\theta$ as we expect that $\lVert X\rVert_{\ast}\asymp\sqrt{\overline{m}n}$ . In particular, the improvement is clearly the case if $d\lesssim\overline{m}$ and $\phi_{2}(Ds_{\star})\gtrsim 1$ for a sufficiently large $D$ . However, as pointed out in Section 5.3, this is not the case for $\Omega$ as its dimension is also increasing.

If we assume that $\log n\lesssim\log\overline{m}$ , then the term $\sqrt{(\overline{m}+d)(\log n)/n}$ arising from the sparse precision matrix $\Omega$ becomes $\sqrt{(\overline{m}+d)(\log\overline{m})/n}$ . The latter is comparable to the frequentist convergence rate of the graphical lasso in Rothman et al., [27]. Therefore, our rate is deemed to be optimal considering the additional complication due to the mean term involving sparse regression coefficients.

5.6 Nonparametric heteroskedastic regression models

Next, we use the main results for Example 6. For a bounded, convex subset ${\cal X}\subset\mathbb{R}$ , define the $\alpha$ -Hölder class $\mathfrak{C}^{\alpha}({\cal X})$ as the collection of functions $f:{\cal X}\rightarrow\mathbb{R}$ such that $\lVert f\rVert_{\mathfrak{C}^{\alpha}}<\infty$ , where

\displaystyle\lVert f\rVert_{\mathfrak{C}^{\alpha}}=\max_{0\leq k\leq\lfloor\alpha\rfloor}\sup_{x\in{\cal X}}|f^{(k)}(x)|+\sup_{x,y\in{\cal X}:x\neq y}\frac{|f^{(\lfloor\alpha\rfloor)}(x)-f^{(\lfloor\alpha\rfloor)}(y)|}{|x-y|^{\alpha-\lfloor\alpha\rfloor}},

with the $k$ th derivative $f^{(k)}$ of $f$ and $\lfloor\alpha\rfloor$ the largest integer that is strictly smaller than $\alpha$ . Let the true function $v_{0}$ belong to $\mathfrak{C}^{\alpha}[0,1]$ with assumption that $v_{0}$ is strictly positive. While $\alpha>1/2$ suffices for the basic posterior contraction, we will see that the optimal posterior contraction for $\theta$ requires $\alpha>1$ . The stronger condition $\alpha>2$ is even needed for the Bernstein-von Mises theorem and the selection consistency, but all these conditions are mild if the true function is sufficiently smooth.

We put a prior on $g$ through B-splines. The function is expressed as a linear combination of $J$ -dimensional B-spline basis terms $B_{J}$ of order $q\geq\alpha$ , i.e., $v_{\beta}(z)=\beta^{T}B_{J}(z)$ , while an inverse Gaussian prior distribution is independently assigned to each entry of $\beta$ . For any measurable function $f:[0,1]\mapsto\mathbb{R}$ , we let $\lVert f\rVert_{\infty}=\sup_{z\in[0,1]}|f(z)|$ and $\lVert f\rVert_{2,n}=(n^{-1}\sum_{i=1}^{n}|f(z_{i})|^{2})^{1/2}$ denote the sup-norm and empirical $L_{2}$ -norm, respectively. To deploy the properties of B-splines, we assume that $z_{i}$ are sufficiently regularly distributed on $[0,1]$ .

Theorem 12.

The true function $v_{0}$ is assumed to be strictly positive on $[0,1]$ and belong to $\mathfrak{C}^{\alpha}[0,1]$ with $\alpha>1/2$ . We choose $J\asymp(n/\log n)^{1/(2\alpha+1)}$ . Let $s_{\star}=s_{0}\vee\bar{s}_{\star}$ for $\bar{s}_{\star}=(\log n)^{2\alpha/(2\alpha+1)}n^{1/(2\alpha+1)}/\log p$ and assume that $s_{0}>0$ , $Js_{0}\log p=o(n)$ , and $\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p$ . Then the following assertions hold.

(a)

The posterior contraction rates for $\theta$ are given by (9). If $\bar{s}_{\star}\lesssim 1$ , the optimal rates in (11) are obtained.
(b)

The posterior contraction rate for $v$ is $\sqrt{(s_{0}\log p)/n}\vee(\log n/n)^{\alpha/(2\alpha+1)}$ with respect to the $\lVert\cdot\rVert_{2,n}$ -norm.

If further $\alpha>1$ and $\phi_{1}(D\bar{s}_{\star})\gtrsim 1$ for a sufficiently large $D$ , then the following assertion holds.

(c)

The optimal posterior contraction rates for $\theta$ in (11) are obtained even if $\bar{s}_{\star}\rightarrow\infty$ .

Assume further that $\alpha>2$ , $J(s_{\star}^{2}\vee J)(s_{\star}\log p)^{3}=o(n)$ and $\phi_{1}(Ds_{\star})\gtrsim 1$ for a sufficiently large $D$ . Then the following assertions hold.

(d)

The distributional approximation in (15) holds with $H$ the $n\times n$ zero matrix.
(e)

If $A_{4}>1$ and $s_{\star}\lesssim p^{a}$ for $a<A_{4}-1$ , then the no-superset result in (16) holds.
(f)

Under the beta-min condition as well as the conditions for (e), the selection consistency in (17) and the Bernstein-von Mises theorem in (18) hold.

An inverse Gaussian prior is used due to the property that its tail probabilities at both zero and infinity decay to zero exponentially fast. The exponentially decaying tail probabilities in both directions are essential to obtain the optimal contraction rate. Note that standard choices such as gamma and inverse gamma distributions do not satisfy this property.

By investigating the proof, it can be seen that the condition $\alpha>1/2$ is required to satisfy condition (C1) for posterior contraction, so this condition is not avoidable in applying the main theorems. Unlike Theorem 13 below, assertion (c) does not require any further boundedness condition. This is because the restriction $\alpha>1$ makes the required bound tend to zero. For the Bernstein-von Mises theorem and the selection consistency, it can be seen that $\alpha>2$ is necessary for the condition $J(s_{\star}^{2}\vee J)(s_{\star}\log p)^{3}=o(n)$ but not sufficient. Although the requirement $\alpha>2$ is implied by the latter condition, we specify this in the statement due to its importance. We refer to the proof of Theorem 12 for more details.

5.7 Partial linear models

Lastly, we consider Example 7. We assume that the true function $g_{0}$ belongs to $\mathfrak{C}^{\alpha}[0,1]$ for with $\alpha>0$ . Any $\alpha>0$ suffices for the basic posterior contraction, but stronger restrictions are required for further assertions as in Theorem 12. We put a prior on $g$ through $J$ -dimensional B-spline basis terms of order $q\geq a$ , i.e., $g_{\beta}(z)=\beta^{T}B_{J}(z)$ . With a given $J$ , we define the design matrix $W_{J}=(B_{J}(z_{1}),\dots,B_{J}(z_{n}))^{T}\in\mathbb{R}^{n\times J}$ . The standard normal prior is independently assigned to each component of $\beta$ and an inverse gamma prior is assigned to $\sigma^{2}$ . Similar to Section 5.6, we assume that $z_{i}$ are sufficiently regularly distributed on $[0,1]$ .

Theorem 13.

The true function is assumed to satisfy $g_{0}\in\mathfrak{C}^{\alpha}[0,1]$ with $\alpha>0$ . We choose $J\asymp(n/\log n)^{1/(2\bar{\alpha}+1)}$ for some $\bar{\alpha}\leq\alpha$ . Let $s_{\star}=s_{0}\vee\bar{s}_{\star}$ for $\bar{s}_{\star}=(\log n)^{2\bar{\alpha}/(2\bar{\alpha}+1)}n^{1/(2\bar{\alpha}+1)}/\log p$ and assume that $s_{0}>0$ , $s_{0}\log p=o(n)$ , $\sigma_{0}^{2}\asymp 1$ , $\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p$ , and $\min\{\varsigma_{\min}([X_{S},W_{J}]):s\leq Ds_{\star}\}\gtrsim 1$ for a sufficiently large $D$ . Then the following assertions hold.

(a)

The posterior contraction rates for $\theta$ are given by (9). If $\bar{s}_{\star}\lesssim 1$ , the optimal rates in (11) are obtained.
(b)

The contraction rates for $g$ and $\sigma^{2}$ are $\sqrt{(s_{0}\log p)/n}\vee(\log n/n)^{\bar{\alpha}/(2\bar{\alpha}+1)}$ with respect to the $\lVert\cdot\rVert_{2,n}$ - and $\ell_{2}$ -norms, respectively.

If further $1/2\leq\bar{\alpha}<\alpha$ , $(\log n)^{2(\alpha\wedge 2\bar{\alpha})/(2\bar{\alpha}+1)}n^{(-2(\alpha\wedge 2\bar{\alpha})+2\bar{\alpha}+1)/(2\bar{\alpha}+1)}=o(\log p)$ , and $\phi_{1}(D\bar{s}_{\star})\gtrsim 1$ for a sufficiently large $D$ , then the following assertion holds.

(c)

The optimal posterior contraction rates for $\theta$ in (11) are obtained even if $\bar{s}_{\star}\rightarrow\infty$ .

Assume that $1<\bar{\alpha}<\alpha-1/2$ , $(s_{\star}^{2}\log p)(\log n)^{2\alpha/(2\bar{\alpha}+1)}n^{(2(\bar{\alpha}-\alpha)+1)/(2\bar{\alpha}+1)}=o(1)$ , $s_{\star}^{5}\log^{3}p=o(n)$ , and $\phi_{1}(Ds_{\star})\gtrsim 1$ for a sufficiently large $D$ . Then the following assertions hold.

(d)

The distributional approximation in (15) holds for the projection matrix $H=W_{J}(W_{J}^{T}W_{J})^{-1}W_{J}^{T}$ .
(e)

If $A_{4}>1$ and $s_{\star}\lesssim p^{a}$ for $a<A_{4}-1$ , then the no-superset result in (16) holds.
(f)

Under the beta-min condition as well as the conditions for (e), the selection consistency in (17) and the Bernstein-von Mises theorem in (18) hold.

Here we elaborate more on the choices of the number $J$ of basis terms. For assertions (a)–(b), $J$ can be chosen such that $\bar{\alpha}=\alpha$ which gives rise to the optimal rates for the nuisance parameters. This choice, however, does not satisfy (C4) and (C8^∗), and hence we need a better approximation for $\lVert(I-H)\tilde{\xi}_{\eta_{0}}\rVert_{2}$ with some $\bar{\alpha}<\alpha$ to strictly control the remaining bias. For example, if $\bar{\alpha}=\alpha$ , the bondedness condition for (c) is reduced to $\bar{s}_{\star}=o(1)$ , which gives the optimal contraction for $\theta$ by (a). Therefore, to incorporate the case that $\bar{s}_{\star}\rightarrow\infty$ , there is a need to consider some appropriate $\bar{\alpha}$ that is strictly smaller than $\alpha$ . For the Bernstein-von Mises theorem and the selection consistency, the required restriction becomes even stronger such that $\bar{\alpha}<\alpha-1/2$ .

Appendix A Proofs for the main results

In this section, we provide proofs of the main theorems. We first describe the additional notations used for the proofs. For a matrix $X$ , we write $\rho_{1}(X)\geq\rho_{2}(X)\geq\cdots$ for the eigenvalues of $X$ in decreasing order. The notation $\Lambda_{n}(\theta,\eta)=\prod_{i=1}^{n}({p_{\theta,\eta,i}}/{p_{0,i}})(Y_{i})$ stands for the likelihood ratio of $p_{\theta,\eta}$ and $p_{0}$ . Let $\mathbb{E}_{\theta,\eta}$ denote the expectation operator with the density $p_{\theta,\eta}$ and let $\mathbb{P}_{0}$ denote the probability operator with the true density. For two densities $f$ and $g$ , let $K(f,g)=\int f\log(f/g)$ and $V(f,g)=\int f|\log(f/g)-K(f,g)|^{2}$ stand for the Kullback-Leibler divergence and variation, respectively. Using some constants $\underline{\rho}_{0},\overline{\rho}_{0}>0$ , we rewrite (C4) as $\underline{\rho}_{0}\leq\min_{i}\rho_{\min}(\Delta_{\eta_{0},i})\leq\max_{i}\rho_{\max}(\Delta_{\eta_{0},i})\leq\overline{\rho}_{0}$ for clarity.

A.1 Proof of Theorem 1

We first state a lemma showing that the denominator of the posterior distribution is bounded below by a factor with probability tending to one, which will be used to prove the main theorems.

Lemma 1.

Suppose that (C1)–(C4) are satisfied. Then there exists a constant $K_{0}$ such that

\displaystyle\begin{split}{\mathbb{P}}_{0}\bigg{(}&\int_{\Theta\times{\cal H}}\Lambda_{n}(\theta,\eta)d\Pi(\theta,\eta)\geq\pi_{p}(s_{0})e^{-K_{0}(s_{0}\log p+n\bar{\epsilon}_{n}^{2})}\bigg{)}\rightarrow 1.\end{split}

(19)

Proof.

We define the Kullback-Leibler-type neighborhood ${\cal B}_{n}=\{(\theta,\eta)\in\Theta\times{\cal H}:\sum_{i=1}^{n}K(p_{0,i},p_{\theta,\eta,i})\leq C_{1}n\bar{\epsilon}_{n}^{2},\sum_{i=1}^{n}V(p_{0,i},p_{\theta,\eta,i})\leq C_{1}n\bar{\epsilon}_{n}^{2}\}$ for a sufficiently large $C_{1}$ . Then Lemma 10 of Ghosal and van der Vaart, [16] implies that for any $C>0$ ,

\displaystyle{\mathbb{P}}_{0}\left(\int_{{\cal B}_{n}}\Lambda_{n}(\theta,\eta)d\Pi(\theta,\eta)\leq e^{-(1+C)C_{1}n\bar{\epsilon}_{n}^{2}}\Pi({\cal B}_{n})\right)\leq\frac{1}{C^{2}C_{1}n\bar{\epsilon}_{n}^{2}}.

(20)

Hence, it suffices to show that $\Pi({\cal B}_{n})$ is bounded below as in the lemma. By Lemma 9, the Kullback-Leibler divergence and variation of the $i$ th observation are given by

	$\displaystyle K(p_{0,i},p_{\theta,\eta,i})$	$\displaystyle=\frac{1}{2}\bigg{\{}-\sum_{k=1}^{m_{i}}\log\rho_{i,k}^{\ast}-\sum_{k=1}^{m_{i}}(1-\rho_{i,k}^{\ast})$
		$\displaystyle\qquad\quad+\lVert\Delta_{\eta,i}^{-1/2}(X_{i}(\theta-\theta_{0})+\xi_{\eta,i}-\xi_{\eta_{0},i})\rVert_{2}^{2}\bigg{\}},$
	$\displaystyle V(p_{0,i},p_{\theta,\eta,i})$	$\displaystyle=\frac{1}{2}\sum_{k=1}^{m_{i}}(1-\rho_{i,k}^{\ast})^{2}+\lVert\Delta_{\eta_{0},i}^{1/2}\Delta_{\eta,i}^{-1}(X_{i}(\theta-\theta_{0})+\xi_{\eta,i}-\xi_{\eta_{0},i})\rVert_{2}^{2},$

where $\rho_{i,k}^{\ast},~{}k=1,\dots,m_{i},$ are the eigenvalues of $\Delta_{\eta_{0},i}^{1/2}\Delta_{\eta,i}^{-1}\Delta_{\eta_{0},i}^{1/2}$ . For ${\cal I}_{n,\delta}=\{1\leq i\leq n:\sum_{k=1}^{m_{i}}(1-\rho_{i,k}^{\ast})^{2}\geq\delta\}$ with small $\delta>0$ and $|{\cal I}_{n,\delta}|$ the cardinality of ${\cal I}_{n,\delta}$ , we see that on ${\cal B}_{n}$ ,

\displaystyle\begin{split}a_{n}\bar{\epsilon}_{n}^{2}&\gtrsim\frac{a_{n}}{n}\sum_{i=1}^{n}\sum_{k=1}^{m_{i}}(1-\rho_{i,k}^{\ast})^{2}\geq\frac{a_{n}\delta|{\cal I}_{n,\delta}|}{n}+\frac{a_{n}}{n}\sum_{i\notin{\cal I}_{n,\delta}}\sum_{k=1}^{m_{i}}(1-\rho_{i,k}^{\ast})^{2}.\end{split}

(21)

Since every $i\notin\mathcal{I}_{n,\delta}$ satisfies $\sum_{k=1}^{m_{i}}(1-\rho_{i,k}^{\ast})^{2}<\delta$ for small $\delta>0$ , observe that

\displaystyle\sum_{i\notin{\cal I}_{n,\delta}}\sum_{k=1}^{m_{i}}(1-\rho_{i,k}^{\ast})^{2}

\displaystyle\gtrsim\sum_{i\notin{\cal I}_{n,\delta}}\sum_{k=1}^{m_{i}}(1-1/\rho_{i,k}^{\ast})^{2}\geq\frac{1}{\overline{\rho}_{0}^{2}}\sum_{i\notin{\cal I}_{n,\delta}}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2},

where the first inequality follows by the relation $|1-x|\asymp|1-x^{-1}|$ as $x\rightarrow 1$ and the second inequality holds by (i) of Lemma 10 in Appendix. Since $a_{n}|{\cal I}_{n,\delta}|/n\lesssim a_{n}\bar{\epsilon}_{n}^{2}$ by (21), it follows using (5) that for some constants $C_{2},C_{3}>0$ ,

	$\displaystyle\frac{a_{n}}{n}\sum_{i\notin{\cal I}_{n,\delta}}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2}$	$\displaystyle\geq a_{n}d_{B,n}^{2}(\eta,\eta_{0})-\frac{a_{n}\|{\cal I}_{n,\delta}\|}{n}\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2}$
		$\displaystyle\geq(C_{2}-C_{3}a_{n}\bar{\epsilon}_{n}^{2})\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2}-e_{n}.$

Combining this with (21), we conclude that $a_{n}\bar{\epsilon}_{n}^{2}+e_{n}\gtrsim\max_{i}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2}$ on ${\cal B}_{n}$ , which implies that $\max_{i,k}|1-\rho_{i,k}^{\ast}|$ is small for all sufficiently large $n$ , by (i) of Lemma 10 and the inequality $|1-x|\asymp|1-x^{-1}|$ as $x\rightarrow 1$ . Hence, $\log\rho_{i,k}^{\ast}$ can be expanded in the powers of $(1-\rho_{i,k}^{\ast})$ to get $-\log\rho_{i,k}^{\ast}-(1-\rho_{i,k}^{\ast})\sim(1-\rho_{i,k}^{\ast})^{2}/2$ for every $i$ and $k$ . Furthermore, since $\max_{i,k}|1-\rho_{i,k}^{\ast}|$ is sufficiently small, we obtain that $\sum_{k=1}^{m_{i}}(1-\rho_{i,k}^{\ast})^{2}\lesssim\sum_{k=1}^{m_{i}}(1-1/\rho_{i,k}^{\ast})^{2}\lesssim\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2}$ by (i) of Lemma 10, and that $\lVert\Delta_{\eta,i}^{-1}\rVert_{\rm sp}\lesssim\lVert\Delta_{\eta_{0},i}^{1/2}\Delta_{\eta,i}^{-1}\Delta_{\eta_{0},i}^{1/2}\rVert_{\rm sp}\lesssim 1$ by the restriction on the eigenvalues of $\Delta_{\eta_{0},i}$ . Combining these results, it follows that on ${\cal B}_{n}$ , both $n^{-1}\sum_{i=1}^{n}K(p_{0,i},p_{\theta,\eta,i})$ and $n^{-1}\sum_{i=1}^{n}V(p_{0,i},p_{\theta,\eta,i})$ are bounded above by a constant multiple of $n^{-1}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}+d_{n}^{2}(\eta,\eta_{0})$ . Hence, $C_{1}$ can be chosen sufficiently large such that

\displaystyle\begin{split}\Pi({\cal B}_{n})&\geq\Pi\left\{(\theta,\eta)\in\Theta\times{\cal H}:n^{-1}\lVert X\rVert_{\ast}^{2}\lVert\theta-\theta_{0}\rVert_{1}^{2}+d_{n}^{2}(\eta,\eta_{0})\leq 2\bar{\epsilon}_{n}^{2}\right\}\\ &\geq\Pi\left\{\theta\in\Theta:n^{-1}\lVert X\rVert_{\ast}^{2}\lVert\theta-\theta_{0}\rVert_{1}^{2}\leq\bar{\epsilon}_{n}^{2}\right\}\Pi\left\{\eta\in{\cal H}:d_{n}^{2}(\eta,\eta_{0})\leq\bar{\epsilon}_{n}^{2}\right\},\end{split}

(22)

by the inequality $\lVert X\theta\rVert_{2}\leq\sum_{j=1}^{p}\lvert\theta_{j}\rvert\lVert X_{\cdot j}\rVert_{2}\leq\lVert X\rVert_{\ast}\lVert\theta\rVert_{1}$ . The logarithm of the second term on the rightmost side is bounded below by a constant multiple of $-n\bar{\epsilon}_{n}^{2}$ by (C2). To find the lower bound for the first term, we shall first work with the case $s_{0}\geq 1$ , and then show that the same lower bound is obtained even when $s_{0}=0$ .

Now, assume that $s_{0}\geq 1$ and let $\Theta_{0,n}=\{\theta_{S_{0}}\in\mathbb{R}^{s_{0}}:n^{-1/2}\lVert X\rVert_{\ast}\lVert\theta_{S_{0}}-\theta_{0,S_{0}}\rVert_{1}\leq\epsilon\}$ for $\epsilon>0$ to be chosen later. Then

\displaystyle\begin{split}&\Pi\{\theta\in\Theta:n^{-1/2}\lVert X\rVert_{\ast}\lVert\theta-\theta_{0}\rVert_{1}\leq\epsilon\}\\ &\quad\geq\frac{\pi_{p}(s_{0})}{\binom{p}{s_{0}}}\int_{\Theta_{0,n}}g_{S_{0}}(\theta_{S_{0}})d\theta_{S_{0}}\\ &\quad\geq\frac{\pi_{p}(s_{0})}{\binom{p}{s_{0}}}e^{-\lambda\lVert\theta_{0}\rVert_{1}}\int_{\Theta_{0,n}}g_{S_{0}}(\theta_{S_{0}}-\theta_{0,S_{0}})d\theta_{S_{0}}\end{split}

(23)

by the inequality $g_{S_{0}}(\theta_{S_{0}})\geq e^{-\lambda\lVert\theta_{0}\rVert_{1}}g_{S_{0}}(\theta_{S_{0}}-\theta_{0,S_{0}})$ . Using the relation (6.2) of Castillo et al., [8] and the assumption on the prior in (4), the integral on the rightmost side satisfies

\displaystyle\begin{split}\int_{\Theta_{0,n}}g_{S_{0}}(\theta_{S_{0}}-\theta_{0,S_{0}})d\theta_{S_{0}}&\geq e^{-\lambda\epsilon\sqrt{n}/\lVert X\rVert_{\ast}}\frac{(\lambda\epsilon\sqrt{n}/\lVert X\rVert_{\ast})^{s_{0}}}{s_{0}!}\\ &\geq e^{-L_{3}\epsilon}\frac{(\epsilon\sqrt{n}/L_{1}p^{L_{2}})^{s_{0}}}{s_{0}!},\end{split}

(24)

for $s_{0}>0$ , and thus the rightmost side of (23) is bounded below by

\displaystyle\pi_{p}(s_{0})(\epsilon\sqrt{n})^{s_{0}}\exp\left\{-\lambda\lVert\theta_{0}\rVert_{1}-L_{3}\epsilon-(L_{1}+1)s_{0}\log p-s_{0}\log L_{1}\right\},

by the inequality $\binom{p}{s_{0}}s_{0}!\leq p^{s_{0}}$ . Choosing $\epsilon=\bar{\epsilon}_{n}$ , the first term on the rightmost side of (22) satisfies

	$\displaystyle\Pi\left\{\theta\in\Theta:n^{-1}\lVert X\rVert_{\ast}^{2}\lVert\theta-\theta_{0}\rVert_{1}^{2}\leq\bar{\epsilon}_{n}^{2}\right\}$
	$\displaystyle\quad\geq\pi_{p}(s_{0})(n\bar{\epsilon}_{n}^{2})^{s_{0}/2}\exp\left\{-\lambda\lVert\theta_{0}\rVert_{1}-L_{3}\bar{\epsilon}_{n}-(L_{1}+1)s_{0}\log p-s_{0}\log L_{1}\right\}.$

Note that $n\bar{\epsilon}_{n}^{2}>1$ and $s_{0}+\bar{\epsilon}_{n}+s_{0}\log p\lesssim s_{0}\log p$ if $s_{0}>0$ , and thus the last display implies that there exists a constant $C_{4}>0$ such that

\displaystyle\Pi({\cal B}_{n})\geq\pi_{p}(s_{0})\exp\left\{-C_{4}(\lambda\lVert\theta_{0}\rVert_{1}+s_{0}\log p+n\bar{\epsilon}_{n}^{2})\right\}.

If $s_{0}=0$ , the first term of (22) is clearly bounded below by $\pi_{p}(0)$ , so that the same lower bound for $\Pi({\cal B}_{n})$ in the last display is also obtained since we have $\lambda\lVert\theta_{0}\rVert_{1}+s_{0}\log p=0$ . Finally, the lemma follows from (20). ∎

Proof of Theorem 1.

For the set ${\cal B}=\{(\theta,\eta):s_{\theta}>\bar{s}\}$ with any integer $\bar{s}\geq s_{0}$ , we see that $\Pi({\cal B})$ is equal to

\displaystyle\sum_{s=\bar{s}+1}^{p}\pi_{p}(s)\leq\pi_{p}(s_{0})\sum_{s=\bar{s}+1}^{p}\left(\frac{A_{2}}{p^{A_{4}}}\right)^{s-s_{0}}\leq\pi_{p}(s_{0})\left(\frac{A_{2}}{p^{A_{4}}}\right)^{\bar{s}+1-s_{0}}\sum_{j=0}^{\infty}\left(\frac{A_{2}}{p^{A_{4}}}\right)^{j}.

Let ${\cal E}_{n}$ be the event in (19). Since $\Lambda_{n}(\theta,\eta)$ is nonnegative, by Fubini’s theorem and Lemma 1,

\displaystyle\begin{split}{\mathbb{E}}_{0}\Pi({\cal B}\,|\,Y^{(n)})\mathbbm{1}_{{\cal E}_{n}}&={\mathbb{E}}_{0}\left[\frac{\int_{{\cal B}}\Lambda_{n}(\theta,\eta)d\Pi(\theta,\eta)}{\int\Lambda_{n}(\theta,\eta)d\Pi(\theta,\eta)}\mathbbm{1}_{{\cal E}_{n}}\right]\\ &\leq\pi_{p}(s_{0})^{-1}\exp\{C_{1}(s_{0}\log p+n\bar{\epsilon}_{n}^{2})\}\Pi({\cal B})\\ &\lesssim\exp\left\{(\bar{s}+1-s_{0})(\log A_{2}-A_{4}\log p)+2C_{1}s_{\star}\log p\right\},\end{split}

(25)

for some constant $C_{1}$ and sufficiently large $p$ . For a sufficiently large constant $C_{2}$ , choose the largest integer that is smaller than $C_{2}s_{\star}$ for $\bar{s}$ . Replacing $\bar{s}+1$ by $C_{2}s_{\star}$ in the last display, it is easy to see that the rightmost side goes to zero. The proof is complete since ${\mathbb{P}}_{0}({\cal E}_{n}^{c})\rightarrow 0$ by Lemma 1. ∎

A.2 Proof of Theorems 2–3 and Corollary 1

The following lemma shows that a small piece of the alternative centered at any $(\theta_{1},\eta_{1})\in\Theta\times{\cal H}$ are locally testable with exponentially small errors, provided that the center is sufficiently separated from the truth with respect to the average Rényi divergence. Theorem 2 for posterior contraction relative to the average Rényi divergence will then be proved by showing that the number of those pieces is controlled by the target rate. We write $p_{1}$ for the density with $(\theta_{1},\eta_{1})$ , and $\mathbb{E}_{1}$ and $\mathbb{P}_{1}$ for the expectation and probability with $p_{1}$ , respectively.

Lemma 2.

For a given sequence $\gamma_{n}^{\prime}>0$ , a sequence $a_{n}$ satisfying (C1), given $(\theta_{1},\eta_{1})\in\Theta\times{\cal H}$ such that $R_{n}(p_{0},p_{1})\geq\delta_{n}^{2}$ with $\delta_{n}=o(\sqrt{\overline{m}})$ , define

\displaystyle\begin{split}{\cal F}_{1,n}=\bigg{\{}(\theta,\eta)\in\Theta\times{\cal H}\,:\,\frac{1}{n}\sum_{i=1}^{n}\lVert X_{i}(\theta-\theta_{1})+\xi_{\eta,i}-\xi_{\eta_{1},i}\rVert_{2}^{2}\leq\frac{\delta_{n}^{2}}{16\gamma_{n}^{\prime}},\\ d_{B,n}(\eta,\eta_{1})\leq\frac{\delta_{n}^{2}}{2\overline{m}\gamma_{n}^{\prime}\sqrt{a_{n}}},\,\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}^{-1}\rVert_{\rm sp}\leq\gamma_{n}^{\prime}\bigg{\}}.\end{split}

(26)

Then under (C1), there exists a test $\bar{\varphi}_{n}$ such that

\displaystyle\mathbb{E}_{0}\bar{\varphi}_{n}\leq e^{-n\delta_{n}^{2}},\qquad\sup_{(\theta,\eta)\in{\cal F}_{1,n}}\mathbb{E}_{\theta,\eta}(1-\bar{\varphi}_{n})\leq e^{-n\delta_{n}^{2}/16}.

Proof.

For given $(\theta_{1},\eta_{1})\in\Theta\times{\cal H}$ such that $R_{n}(p_{0},p_{1})\geq\delta_{n}^{2}$ , consider the most powerful test $\bar{\varphi}_{n}=\mathbbm{1}_{\{\Lambda_{n}(\theta_{1},\eta_{1})\geq 1\}}$ given by the Neyman-Pearson lemma. It is then easy to see that

\displaystyle\begin{split}\mathbb{E}_{0}\bar{\varphi}_{n}&=\mathbb{P}_{0}\left(\sqrt{\Lambda_{n}(\theta_{1},\eta_{1})}\geq 1\right)\leq\int\sqrt{p_{0}p_{1}}\leq e^{-n\delta_{n}^{2}},\\ \mathbb{E}_{1}(1-\bar{\varphi}_{n})&=\mathbb{P}_{1}\left(\sqrt{\Lambda_{n}(\theta_{1},\eta_{1})}\leq 1\right)\leq\int\sqrt{p_{0}p_{1}}\leq e^{-n\delta_{n}^{2}}.\end{split}

(27)

The first inequality of the lemma is a direct consequence of the first line of the preceding display. For the second inequality of the lemma, note that by the Cauchy-Schwarz inequality, we have

\displaystyle\left\{\mathbb{E}_{\theta,\eta}(1-\bar{\varphi}_{n})\right\}^{2}\leq\mathbb{E}_{1}(1-\bar{\varphi}_{n})\;\mathbb{E}_{1}(({p_{\theta,\eta}}/{p_{1}})(Y^{(n)}))^{2}.

Thus, by the second line of (27), it suffices to show $\mathbb{E}_{1}(({p_{\theta,\eta}}/{p_{1}})(Y^{(n)}))^{2}\leq e^{7n\delta_{n}^{2}/8}$ for every $(\theta,\eta)\in{\cal F}_{1,n}$ . Defining $\Delta_{\eta,i}^{\ast}=\Delta_{\eta,i}^{-1/2}\Delta_{\eta_{1},i}\Delta_{\eta,i}^{-1/2}$ , observe that

	$\displaystyle\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}^{\ast}-I\rVert_{\rm sp}$	$\displaystyle\leq\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}^{-1}\rVert_{\rm sp}\lVert\Delta_{\eta,i}-\Delta_{\eta_{1},i}\rVert_{\rm sp}$
		$\displaystyle\leq\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}^{-1}\rVert_{\rm sp}\sqrt{a_{n}}d_{B,n}(\eta,\eta_{1})\leq\frac{\delta_{n}^{2}}{2\overline{m}},$

on the set ${\cal F}_{1,n}$ , where the second inequality is due to (C1). Since the leftmost side of the display is further bounded below by $\max_{i}|\rho_{k}(\Delta_{\eta,i}^{\ast})-1|$ for every $k\leq m_{i}$ , we have that

\displaystyle 1-\frac{\delta_{n}^{2}}{2\overline{m}}\leq\min_{1\leq i\leq n}\rho_{\min}(\Delta_{\eta,i}^{\ast})\leq\max_{1\leq i\leq n}\rho_{\max}(\Delta_{\eta,i}^{\ast})\leq 1+\frac{\delta_{n}^{2}}{2\overline{m}}.

(28)

Since $\delta_{n}^{2}/\overline{m}\rightarrow 0$ and $\rho_{k}(2\Delta_{\eta,i}^{\ast}-I)=2\rho_{k}(\Delta_{\eta,i}^{\ast})-1$ for every $k\leq m_{i}$ , (28) implies that $2\Delta_{\eta,i}^{\ast}-I$ is nonsingular for every $i\leq n$ , and hence on ${\cal F}_{1,n}$ , it can be shown that $\mathbb{E}_{1}(({p_{\theta,\eta}}/{p_{1}})(Y^{(n)}))^{2}$ can be written as being equal to

\displaystyle\begin{split}&\prod_{i=1}^{n}\left\{\det(\Delta_{\eta,i}^{\ast})^{1/2}\det(2I-{\Delta_{\eta,i}^{\ast-1}})^{-1/2}\right\}\\ &\times\exp\Bigg{\{}\sum_{i=1}^{n}\lVert(2\Delta_{\eta,i}^{\ast}-I)^{-1/2}\Delta_{\eta,i}^{-1/2}(X_{i}(\theta-\theta_{1})+\xi_{\eta,i}-\xi_{\eta_{1},i})\rVert_{2}^{2}\Bigg{\}}.\end{split}

(29)

To bound this, note that $\det(\Delta_{\eta,i}^{\ast})^{1/2}\det(2I-{\Delta_{\eta,i}^{\ast-1}})^{-1/2}$ is equal to

\displaystyle\begin{split}\prod_{k=1}^{m_{i}}\left\{\frac{\rho_{k}(\Delta_{\eta,i}^{\ast})}{2-\rho_{k}^{-1}(\Delta_{\eta,i}^{\ast})}\right\}^{1/2}\leq\left(\frac{1-\delta_{n}^{4}/4\overline{m}^{2}}{1-\delta_{n}^{2}/\overline{m}}\right)^{m_{i}/2}\leq\left(1+\frac{3\delta_{n}^{2}}{2\overline{m}}\right)^{m_{i}/2}\leq e^{3\delta_{n}^{2}/4},\end{split}

(30)

where the first inequality holds by (28), the second inequality holds by the inequality $(1-x^{2})/(1-2x)\leq 1+3x$ for small $x>0$ , and the last inequality holds by the inequality $x+1\leq e^{x}$ . Now, for every $(\theta,\eta)\in{\cal F}_{1,n}$ , observe that the exponent in (29) is bounded above by

\displaystyle\max_{1\leq i\leq n}\lVert(2\Delta_{\eta,i}^{\ast}-I)^{-1}\rVert_{\rm sp}\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}^{-1}\rVert_{\rm sp}\sum_{i=1}^{n}\lVert X_{i}(\theta-\theta_{1})+\xi_{\eta,i}-\xi_{\eta_{1},i}\rVert_{2}^{2}\leq\frac{n\delta_{n}^{2}}{8},

since $\max_{i}\lVert(2\Delta_{\eta,i}^{\ast}-I)^{-1}\rVert_{\rm sp}\leq 2$ for large $n$ . Combined with (29) and (30), the display completes the proof. ∎

Proof of Theorem 2.

Let $\Theta_{n}=\left\{\theta\in\Theta:s_{\theta}\leq K_{1}s_{\star}\right\}$ and $R_{n}^{\star}(\theta,\eta)=R_{n}(p_{\theta,\eta},p_{0})$ . Then for every $\epsilon>0$ ,

\displaystyle\begin{split}&{\mathbb{E}}_{0}\Pi\left((\theta,\eta)\in\Theta\times{\cal H}:\sqrt{R_{n}^{\star}(\theta,\eta)}>\epsilon\,|\,Y^{(n)}\right)\\ &\quad\leq{\mathbb{E}}_{0}\Pi\left((\theta,\eta)\in\Theta_{n}\times{\cal H}:\sqrt{R_{n}^{\star}(\theta,\eta)}>\epsilon\,|\,Y^{(n)}\right)+{\mathbb{E}}_{0}\Pi\left(\Theta_{n}^{c}\,|\,Y^{(n)}\right),\end{split}

(31)

where the second term on the right hand side goes to zero by Theorem 1. Hence, it suffices to show that the first term goes to zero for $\epsilon>0$ chosen to be the threshold in the theorem. Now, let $\Theta_{n}^{\ast}=\{\theta\in\Theta:s_{\theta}\leq K_{1}s_{\star},\lVert\theta\rVert_{\infty}\leq p^{L_{2}+2}/\lVert X\rVert_{\ast}\}$ and define ${\cal F}_{1,n}$ as in (26) with $\gamma_{n}^{\prime}=\gamma_{n}$ and $\delta_{n}=\epsilon_{n}$ . Then Lemma 2 implies that small pieces of the alternative densities can be tested with exponentially small errors as long as the center is $\epsilon_{n}$ -separated from the true parameter values relative to the average Rényi divergence. To complete the proof, we shall show that the minimal number $N_{n}^{\ast}$ of those small pieces that are needed to cover $\Theta_{n}^{\ast}\times{\cal H}_{n}$ is controlled appropriately in terms of $\epsilon_{n}$ , and that the prior mass of $\Theta_{n}\setminus\Theta_{n}^{\ast}$ and ${\cal H}\setminus{\cal H}_{n}$ decreases fast enough to balance the denominator of the posterior distribution. (For more discussion on a construction of a test using metric entropies, see Section D.2 and Section D.3 of Ghosal and van der Vaart, [17].)

Note that for every $\theta,\theta^{\prime}\in\Theta$ and $\eta,\eta^{\prime}\in{\cal H}$ ,

\displaystyle\frac{1}{n}\sum_{i=1}^{n}\lVert X_{i}(\theta-\theta^{\prime})+\xi_{\eta,i}-\xi_{\eta^{\prime},i}\rVert_{2}^{2}\leq 2\left\{\frac{p^{2}}{n}\lVert X\rVert_{\ast}^{2}\lVert\theta-\theta^{\prime}\rVert_{\infty}^{2}+d_{A,n}^{2}(\eta,\eta^{\prime})\right\},

by the inequality $\lVert X(\theta-\theta^{\prime})\rVert_{2}\leq\lVert X\rVert_{\ast}\lVert\theta-\theta^{\prime}\rVert_{1}\leq p\lVert X\rVert_{\ast}\lVert\theta-\theta^{\prime}\rVert_{\infty}$ and the Cauchy-Schwarz inequality. Since $a_{n}<n$ and $\epsilon_{n}^{2}>n^{-1}$ , it is easy to see that we have ${\cal F}_{1,n}\supset{\cal F}_{1,n}^{\prime}$ for

	$\displaystyle{\cal F}_{1,n}^{\prime}=\bigg{\{}(\theta,\eta)\in\Theta\times{\cal H}:\,$	$\displaystyle\frac{p^{2}}{n}\lVert X\rVert_{\ast}^{2}\lVert\theta-\theta_{1}\rVert_{\infty}^{2}+d_{n}^{2}(\eta,\eta_{1})\leq\frac{1}{32\overline{m}^{2}\gamma_{n}^{2}n^{3}},$
		$\displaystyle\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}^{-1}\rVert_{\rm sp}\leq\gamma_{n}\bigg{\}},$

with the same $(\theta_{1},\eta_{1})$ used to define ${\cal F}_{1,n}$ . Hence, $\log N_{n}^{\ast}$ is bounded above by

\displaystyle\log N\left(\frac{1}{6\overline{m}\gamma_{n}np\lVert X\rVert_{\ast}},\Theta_{n}^{\ast},\lVert\cdot\rVert_{\infty}\right)+\log N\left(\frac{1}{6\overline{m}\gamma_{n}n^{3/2}},{\cal H}_{n},d_{n}\right).

(32)

Note that for any small $\delta>0$ ,

\displaystyle N(\delta,\Theta_{n}^{\ast},\lVert\cdot\rVert_{\infty})\leq\binom{p}{\lfloor K_{1}s_{\star}\rfloor}\left(\frac{3p^{L_{2}+2}}{\delta\lVert X\rVert_{\ast}}\right)^{\lfloor K_{1}s_{\star}\rfloor}\leq\left(\frac{3p^{L_{2}+3}}{\delta\lVert X\rVert_{\ast}}\right)^{K_{1}s_{\star}},

and thus we obtain

\displaystyle\log N\left(\frac{1}{6\overline{m}\gamma_{n}np\lVert X\rVert_{\ast}},\Theta_{n}^{\ast},\lVert\cdot\rVert_{\infty}\right)

\displaystyle\lesssim s_{\star}(\log\overline{m}+\log\gamma_{n}+\log p)\lesssim n\epsilon_{n}^{2}.

Using the last display and the entropy condition (7), the right hand side of (32) is bounded above by a constant multiple of $n\epsilon_{n}^{2}$ . Hence, by Lemma D.3 of Ghosal and van der Vaart, [17], for every $\epsilon>\epsilon_{n}$ , there exists a test $\varphi_{n}$ such that for some $C_{1}>0$ , ${\mathbb{E}}_{0}\varphi_{n}\leq 2\exp(C_{1}n\epsilon_{n}^{2}-n\epsilon^{2})$ and ${\mathbb{E}}_{\theta,\eta}(1-\varphi_{n})\leq\exp(-n\epsilon^{2}/16)$ for every $(\theta,\eta)\in\Theta_{n}^{\ast}\times{\cal H}_{n}$ such that $\sqrt{R_{n}^{\star}(\theta,\eta)}>\epsilon$ . Note that under condition (3) on the prior distribution, we have $-\log\pi_{p}(s_{0})\lesssim s_{0}\log p-\log\pi_{p}(0)\lesssim s_{\star}\log p$ since $\pi_{p}(0)$ is bounded away from zero. Hence, for ${\cal E}_{n}$ the event in (19) and some constant $C_{2}>0$ , the first term on the right hand side of (31) is bounded by

	$\displaystyle{\mathbb{E}}_{0}\Pi\left((\theta,\eta)\in\Theta_{n}\times{\cal H}:\sqrt{R_{n}^{\star}(\theta,\eta)}>\epsilon\,\|\,Y^{(n)}\right)\mathbbm{1}_{{\cal E}_{n}}(1-\varphi_{n})+{\mathbb{E}}_{0}(\varphi_{n}+\mathbbm{1}_{{\cal E}_{n}^{c}})$
	$\displaystyle~{}~{}\leq\Bigg{\{}\!\sup_{(\theta,\eta)\in\Theta_{n}^{\ast}\times{\cal H}_{n}:R_{n}^{\star}(\theta,\eta)>\epsilon^{2}}\!{\mathbb{E}}_{\theta,\eta}(1-\varphi_{n})+\Pi(\Theta_{n}\!\setminus\!\Theta_{n}^{\ast})+\Pi({\cal H}\!\setminus\!{\cal H}_{n})\Bigg{\}}e^{C_{2}s_{\star}\log p}$
	$\displaystyle\qquad+{\mathbb{E}}_{0}\varphi_{n}+{\mathbb{P}}_{0}{\cal E}_{n}^{c},$

where the term ${\mathbb{P}}_{0}{\cal E}_{n}^{c}$ converges to zero by Lemma 1. Choosing $\epsilon=C_{3}\epsilon_{n}$ for a sufficiently large $C_{3}$ , we have

\displaystyle{\mathbb{E}}_{0}\varphi_{n}\rightarrow 0,\quad\sup_{(\theta,\eta)\in\Theta_{n}^{\ast}\times{\cal H}_{n}:R_{n}^{\star}(\theta,\eta)>\epsilon^{2}}{\mathbb{E}}_{\theta,\eta}(1-\varphi_{n})e^{C_{2}s_{\star}\log p}\rightarrow 0.

Furthermore, $\Pi({\cal H}\setminus{\cal H}_{n})e^{C_{2}s_{\star}\log p}$ goes to zero by condition (8). Now, to show that $\Pi(\Theta_{n}\setminus\Theta_{n}^{\ast})$ goes to zero exponentially fast, observe that

	$\displaystyle\Pi(\Theta_{n}\setminus\Theta_{n}^{\ast})$	$\displaystyle=\Pi\left\{\theta\in\Theta:s_{\theta}\leq K_{1}s_{\star},\lVert\theta\rVert_{\infty}>p^{L_{2}+2}/\lVert X\rVert_{\ast}\right\}$
		$\displaystyle=\sum_{S:s\leq K_{1}s_{\star}}\frac{\pi_{p}(s)}{\binom{p}{s}}\int_{\{\theta_{S}:\lVert\theta_{S}\rVert_{\infty}>p^{L_{2}+2}/\lVert X\rVert_{\ast}\}}g_{S}(\theta_{S})d\theta_{S}$
		$\displaystyle\leq\sum_{S:s\leq K_{1}s_{\star}}\frac{(A_{2}p^{-A_{4}})^{s}}{\binom{p}{s}}\int_{\{\theta_{S}:\lVert\theta_{S}\rVert_{\infty}>p^{L_{2}+2}/\lVert X\rVert_{\ast}\}}g_{S}(\theta_{S})d\theta_{S}.$

by the inequality $\pi_{p}(s)\leq(A_{2}p^{-A_{4}})^{s}\pi_{p}(0)$ for every $S$ . Since the tail probability of the Laplace distribution is given by $\int_{|x|>t}2^{-1}\lambda e^{-\lambda|x|}dx=\exp(-\lambda t)$ for every $t>0$ , the rightmost side of the last display is bounded above by a constant multiple of

\displaystyle\sum_{s=1}^{K_{1}s_{\star}}se^{-\lambda p^{L_{2}+2}/\lVert X\rVert_{\ast}}\left(\frac{A_{2}}{p^{A_{4}}}\right)^{s}\lesssim s_{\star}e^{-\lambda p^{L_{2}+2}/\lVert X\rVert_{\ast}}.

Since $\lambda p^{L_{2}+2}/\lVert X\rVert_{\ast}\gtrsim p^{2}$ by (4), the right hand side is bounded by $e^{-C_{4}p^{2}}$ for some $C_{4}>0$ , and thus $\Pi(\Theta_{n}\setminus\Theta_{n}^{\ast})e^{C_{2}s_{\star}\log p}$ goes to zero since $s_{\star}\log p=o(p^{2})$ . Finally, we conclude that the left hand side of (31) goes to zero with $\epsilon=C_{3}\epsilon_{n}$ . ∎

Proof of Theorem 3.

By Theorem 2, we obtain the contraction rate of the posterior distribution with respect to the average Rényi divergence $R_{n}(p_{\theta,\eta},p_{0})$ between $p_{\theta,\eta}$ and $p_{0}$ given by

	$\displaystyle R_{n}(p_{\theta,\eta},p_{0})=$	$\displaystyle-\frac{1}{n}\sum_{i=1}^{n}\log\left\{\frac{(\det\Delta_{\eta,i})^{1/4}(\det\Delta_{\eta_{0},i})^{1/4}}{\det((\Delta_{\eta,i}+\Delta_{\eta_{0},i})/2)^{1/2}}\right\}$
		$\displaystyle+\frac{1}{4n}\sum_{i=1}^{n}\lVert(\Delta_{\eta,i}+\Delta_{\eta_{0},i})^{-1}(X_{i}(\theta-\theta_{0})+\xi_{\eta,i}-\xi_{\eta_{0},i})\rVert_{2}^{2}.$

Define

\displaystyle g^{2}(\Delta_{\eta,i},\Delta_{\eta_{0},i})=1-\frac{(\det\Delta_{\eta,i})^{1/4}(\det\Delta_{\eta_{0},i})^{1/4}}{\det((\Delta_{\eta,i}+\Delta_{\eta_{0},i})/2)^{1/2}}.

(33)

Then Theorem 2 implies that by the last display,

\displaystyle\epsilon_{n}^{2}\gtrsim-\frac{1}{n}\sum_{i=1}^{n}\log(1-g^{2}(\Delta_{\eta,i},\Delta_{\eta_{0},i}))\geq\frac{1}{n}\sum_{i=1}^{n}g^{2}(\Delta_{\eta,i},\Delta_{\eta_{0},i}),

(34)

where the second inequality holds by the inequality $\log x\leq x-1$ . Note that by combining (i) and (ii) of Lemma 10 in Appendix, we obtain $g^{2}(\Delta_{\eta,i},\Delta_{\eta_{0},i})\gtrsim\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2}$ if the left hand side is small. Thus, using the same approach in the proof of Lemma 1, (34) is further bounded below by

\displaystyle\begin{split}&C_{1}d_{B,n}^{2}(\eta,\eta_{0})-C_{2}\epsilon_{n}^{2}\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2}\\ &\quad\geq(C_{1}-C_{3}a_{n}\epsilon_{n}^{2})d_{B,n}^{2}(\eta,\eta_{0})-C_{3}e_{n}\epsilon_{n}^{2},\end{split}

(35)

for some constants $C_{1},C_{2},C_{3}>0$ . Since $C_{1}-C_{3}a_{n}\epsilon_{n}^{2}$ is bounded away from zero and $e_{n}$ is decreasing, (34) and (35) imply that $\epsilon_{n}\gtrsim d_{B,n}(\eta,\eta_{0})$ . Now, it is easy to see that by (5),

	$\displaystyle\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}+\Delta_{\eta_{0},i}\rVert_{\rm sp}^{2}$	$\displaystyle\leq 2\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm sp}^{2}+8\max_{1\leq i\leq n}\lVert\Delta_{\eta_{0},i}\rVert_{\rm sp}^{2}$
		$\displaystyle\lesssim e_{n}+a_{n}d_{B,n}^{2}(\eta,\eta_{0})+1,$

which is bounded since $e_{n}+a_{n}\epsilon_{n}^{2}=o(1)$ . Hence, we see that for $\eta_{\ast}$ satisfying (C6), $n^{-1}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}+d_{A,n}^{2}(\eta,\eta_{0})$ is bounded by a constant multiple of

	$\displaystyle\frac{1}{n}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}+d_{A,n}^{2}(\eta,\eta_{\ast})+d_{A,n}^{2}(\eta_{\ast},\eta_{0})$
	$\displaystyle\quad\lesssim\frac{1}{n}\sum_{i=1}^{n}\lVert X_{i}(\theta-\theta_{0})+\xi_{\eta,i}-\xi_{\eta_{\ast},i}\rVert_{2}^{2}+d_{A,n}^{2}(\eta_{\ast},\eta_{0})$
	$\displaystyle\quad\lesssim\frac{1}{n}\sum_{i=1}^{n}\lVert(\Delta_{\eta,i}+\Delta_{\eta_{0},i})^{-1}(X_{i}(\theta-\theta_{0})+\xi_{\eta,i}-\xi_{\eta_{0},i})\rVert_{2}^{2}+d_{A,n}^{2}(\eta_{\ast},\eta_{0}).$

The display implies that $\lVert X(\theta-\theta_{0})\rVert_{2}^{2}+nd_{A,n}^{2}(\eta,\eta_{0})\lesssim n\epsilon_{n}^{2}$ by Theorem 2 and (C6). Combining the results verifies the third and fourth assertions of the theorem. For the remainder, observe that $s_{\theta-\theta_{0}}\leq s_{\theta}+s_{0}\leq K_{1}s_{\star}+s_{0}\lesssim s_{\star}$ for $\theta$ such that $s_{\theta}\leq K_{1}s_{\star}$ . Therefore by Theorem 1, the first and the second assertions readily follow from the definitions of $\phi_{1}$ and $\phi_{2}$ . ∎

Proof of Corollary 1.

We first verify the assertion (a). If $s_{0}>0$ the assertion is trivial. If $s_{0}=0$ , the condition $n\bar{\epsilon}_{n}^{2}/\log p\rightarrow 0$ implies that $s_{\star}\rightarrow 0$ , and hence Theorem 1 holds with $s_{\star}=0$ . Since this means that $\theta=\theta_{0}=0$ if $s_{0}=0$ , we can plug in $s_{0}$ for $s_{\star}$ in Theorem 3.

Similarly, the assertion (b) trivially holds if $s_{0}>0$ and we only need to verify the case $s_{0}=0$ . By reading the proof of Theorem 1, one can see that (25) goes to zero for large enough $A_{4}$ if $s_{0}=0$ . This completes the proof. ∎

A.3 Proof of Theorem 4

To prove Theorem 4, we first provide preliminary results. Some of these will also be used to prove Theorems 5–6.

Lemma 3.

Suppose that (C1), (C2), (C3), (C4) and (C6) are satisfied for some orthogonal projection $H$ . Then, for $\Lambda_{n}^{\ast}(\theta,\eta)=(p_{\theta,\eta}/p_{\theta_{0},{\tilde{\eta}_{n}(\theta,\eta)}})(Y^{(n)})$ and $\Lambda_{n}^{\star}(\theta)$ in (14) with the corresponding $H$ , there exists a positive sequence $\delta_{n}\rightarrow 0$ such that for any $\theta$ with $s_{\theta}\leq K_{1}\bar{s}_{\star}$ ,

\displaystyle\begin{split}\mathbb{P}_{0}\Bigg{(}&\sup_{\eta\in\widetilde{\mathcal{H}}_{n}}|\log\Lambda_{n}^{\ast}(\theta,\eta)-\log\Lambda_{n}^{\star}(\theta)|\\ &\quad\leq\delta_{n}\left\{\lVert X(\theta-\theta_{0})\rVert_{2}\sqrt{(s_{\theta}+s_{0})\log p}+\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\right\}\Bigg{)}\rightarrow 1.\end{split}

(36)

Proof.

If $s_{\theta}=s_{0}=0$ , the left hand side in the probability operator is zero, and the assertion trivially holds. We thus only consider the case $s_{\theta}+s_{0}>0$ below.

By Markov’s inequality, it suffices to show that there exists a positive sequence $\delta_{n}^{\prime}=o(\delta_{n})$ such that

\displaystyle\begin{split}&\mathbb{E}_{0}\sup_{\eta\in\widetilde{\mathcal{H}}_{n}}|\log\Lambda_{n}^{\ast}(\theta,\eta)-\log\Lambda_{n}^{\star}(\theta)|\\ &\quad\leq\delta_{n}^{\prime}\left\{\lVert X(\theta-\theta_{0})\rVert_{2}\sqrt{(s_{\theta}+s_{0})\log p}+\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\right\}.\end{split}

(37)

Let $\Delta_{\eta}^{\star}\in\mathbb{R}^{n_{\ast}\times n_{\ast}}$ be the block-diagonal matrix formed by stacking $\Delta_{\eta_{0},i}^{1/2}\Delta_{\eta,i}^{-1}\Delta_{\eta_{0},i}^{1/2}$ , $i=1,\dots,n$ , and observe that

	$\displaystyle\log\Lambda_{n}^{\ast}(\theta,\eta)=$	$\displaystyle-\frac{1}{2}\lVert{\Delta_{\eta}^{\star}}^{1/2}(I-H)\tilde{X}(\theta-\theta_{0})\rVert_{2}^{2}$
		$\displaystyle+(\theta-\theta_{0})^{T}\tilde{X}^{T}(I-H)\Delta_{\eta}^{\star}\{U-(\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}})-H\tilde{X}(\theta-\theta_{0})\}.$

The left hand side of (37) is thus bounded by the sum of the following terms:

$\displaystyle\sup_{\eta\in\widetilde{\cal H}_{n}}\big{\lvert}(\theta-\theta_{0})^{T}\tilde{X}^{T}(I-H)(I-\Delta_{\eta}^{\star})(I-H)\tilde{X}(\theta-\theta_{0})\big{\rvert}$	,	(38)
$\displaystyle\sup_{\eta\in\widetilde{\cal H}_{n}}\big{\lvert}(\theta-\theta_{0})^{T}\tilde{X}^{T}(I-H)\Delta_{\eta}^{\star}(\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}}+H\tilde{X}(\theta-\theta_{0}))\big{\rvert}$	,	(39)
$\displaystyle{\mathbb{E}}_{0}\sup_{\eta\in\widetilde{\cal H}_{n}}\big{\lvert}(\theta-\theta_{0})^{T}\tilde{X}^{T}(I-H)(I-\Delta_{\eta}^{\star})U\big{\rvert}$	.	(40)

First, observe that (38) is bounded above by a constant multiple of

\displaystyle\begin{split}&\sup_{\eta\in\widetilde{\cal H}_{n}}\lVert I-\Delta_{\eta}^{\star}\rVert_{\rm sp}\lVert\tilde{X}(\theta-\theta_{0})\rVert_{2}^{2}\lesssim\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\sup_{\eta\in\widetilde{\cal H}_{n}}\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}^{-1}-\Delta_{\eta_{0},i}^{-1}\rVert_{\rm F}.\end{split}

(41)

Using (i) of Lemma 10 and the inequality $|1-x|\asymp|1-x^{-1}|$ as $x\rightarrow 1$ , we obtain that for $\rho_{i,k}^{\ast}=\rho_{k}(\Delta_{\eta_{0},i}^{1/2}\Delta_{\eta,i}^{-1}\Delta_{\eta_{0},i}^{1/2})$ ,

\displaystyle\lVert\Delta_{\eta,i}^{-1}-\Delta_{\eta_{0},i}^{-1}\rVert_{\rm F}^{2}

\displaystyle\lesssim\sum_{k=1}^{m_{i}}\left(1-\rho_{i,k}^{\ast}\right)^{2}\lesssim\sum_{k=1}^{m_{i}}\left(1-1/\rho_{i,k}^{\ast}\right)^{2}\lesssim\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2},

(42)

provided that the rightmost side is sufficiently small. Because $\max_{i}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2}\leq e_{n}+a_{n}d_{B,n}^{2}(\eta,\eta_{0})\lesssim e_{n}+a_{n}\bar{\epsilon}_{n}^{2}$ on $\widetilde{\cal H}_{n}$ , (42) holds. This implies that for all sufficiently large $n$ , the right hand side of (41) is bounded above by a constant multiple of

\displaystyle\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\sup_{\eta\in\widetilde{\cal H}_{n}}\sqrt{e_{n}+a_{n}d_{B,n}^{2}(\eta,\eta_{0})}\lesssim\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\sqrt{e_{n}+a_{n}\bar{\epsilon}_{n}^{2}},

where $e_{n}+a_{n}\bar{\epsilon}_{n}^{2}=o(1)$ due to (C1) and (C2).

Next, (39) is equal to

\displaystyle\sup_{\eta\in\widetilde{\cal H}_{n}}\Big{\lvert}(\theta-\theta_{0})^{T}\tilde{X}^{T}(I-H)\left\{(\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}})-(I-\Delta_{\eta}^{\star})(\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}}+H\tilde{X}(\theta-\theta_{0}))\right\}\Big{\rvert}.

By the triangle inequality, the display is bounded by a constant multiple of

\displaystyle\begin{split}&\lVert X(\theta-\theta_{0})\rVert_{2}\sup_{\eta\in\widetilde{\cal H}_{n}}\lVert(I-H)(\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}})\rVert_{2}\\ &+\sup_{\eta\in\widetilde{\cal H}_{n}}\Big{\{}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}+\lVert X(\theta-\theta_{0})\rVert_{2}\sqrt{n}d_{A,n}(\eta,\eta_{0})\Big{\}}\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}^{-1}-\Delta_{\eta_{0},i}^{-1}\rVert_{\rm sp}.\end{split}

(43)

Using the same approach used in (42), the second term is further bounded above by a constant multiple of

\displaystyle\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\sqrt{e_{n}+a_{n}\bar{\epsilon}_{n}^{2}}+\lVert X(\theta-\theta_{0})\rVert_{2}\sqrt{n\bar{\epsilon}_{n}^{2}(e_{n}+a_{n}\bar{\epsilon}_{n}^{2})}.

Therefore, by (C4) and (C6), (43) is bounded by $\delta_{n}^{\prime}\{\lVert X(\theta-\theta_{0})\rVert_{2}\sqrt{(s_{0}\vee 1)\log p}+\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\}$ for some $\delta_{n}^{\prime}\rightarrow 0$ . This is not more than the right hand side of (37) if $s_{\theta}+s_{0}>0$ .

Note also that (40) is bounded by

	$\displaystyle\left\lVert\theta-\theta_{0}\right\rVert_{1}{\mathbb{E}}_{0}\sup_{\eta\in\widetilde{\cal H}_{n}}\lVert\tilde{X}^{T}(I-H)(I-\Delta_{\eta}^{\star})U\rVert_{\infty}$
	$\displaystyle\quad\leq\frac{\sqrt{s_{\theta}+s_{0}}\lVert X(\theta-\theta_{0})\rVert_{2}}{\phi_{1}(s_{\theta}+s_{0})\lVert X\rVert_{\ast}}{\mathbb{E}}_{0}\sup_{\eta\in\widetilde{\cal H}_{n}}\lVert\tilde{X}^{T}(I-H)(I-\Delta_{\eta}^{\star})U\rVert_{\infty}.$

We have that $\phi_{1}(s_{\theta}+s_{0})\geq\phi_{1}(K_{1}\bar{s}_{\star}+s_{0})\gtrsim 1$ by condition (C3). By Lemma 4 below, one can see that

\displaystyle\begin{split}&{\mathbb{E}}_{0}\sup_{\eta\in\widetilde{\cal H}_{n}}\lVert\tilde{X}^{T}(I-H)(I-\Delta_{\eta}^{\star})U\rVert_{\infty}\\ &\quad\lesssim\lVert X\rVert_{\ast}\sqrt{\log p}\left\{\sqrt{e_{n}+a_{n}\bar{\epsilon}_{n}^{2}}+\sqrt{a_{n}}\int_{0}^{C_{3}\bar{\epsilon}_{n}}\sqrt{\log N(\delta,\widetilde{\mathcal{H}}_{n},d_{B,n})}d\delta\right\},\end{split}

(44)

for some $C_{3}>0$ . The term in the braces goes to zero by (C6). Combining the bounds, we easily see that there exists $\delta_{n}^{\prime}\rightarrow 0$ satisfying (37). The assertion holds by choosing $\delta_{n}=\sqrt{\delta_{n}^{\prime}}$ . ∎

Lemma 4.

Consider a neighborhood $\mathcal{H}_{n}^{\ast}=\{\eta\in\mathcal{H}:d_{B,n}(\eta,\eta_{0})\leq\zeta_{n}\}$ with any given $\zeta_{n}=o(a_{n}^{-1/2})$ for $a_{n}$ satisfying (C1). Then, for any orthogonal projection $P$ and a sufficiently large $C>0$ , we have that under (C1),

	$\displaystyle{\mathbb{E}}_{0}\sup_{\eta\in{\cal H}_{n}^{\ast}}\lVert\tilde{X}^{T}P(I-\Delta_{\eta}^{\star})U\rVert_{\infty}$
	$\displaystyle\quad\lesssim\lVert X\rVert_{\ast}\sqrt{\log p}\left\{\sqrt{e_{n}+a_{n}\zeta_{n}^{2}}+\sqrt{a_{n}}\int_{0}^{C\zeta_{n}}\sqrt{\log N\left(\delta,{\cal H}_{n}^{\ast},d_{B,n}\right)}d\delta\right\},$

where $\Delta_{\eta}^{\star}\in\mathbb{R}^{n_{\ast}\times n_{\ast}}$ is the block-diagonal matrix formed by stacking the matrices $\Delta_{\eta_{0},i}^{1/2}\Delta_{\eta,i}^{-1}\Delta_{\eta_{0},i}^{1/2}$ , $i=1,\dots,n$ .

Proof.

Let $W_{\eta,j}=\tilde{X}_{\cdot j}^{T}P(I-\Delta_{\eta}^{\star})U$ for $\tilde{X}_{\cdot j}\in\mathbb{R}^{n_{\ast}}$ the $j$ th column of $\tilde{X}$ . Then, by Lemma 2.2.2 of van der Vaart and Wellner, [29] applied with $\psi(x)=e^{x^{2}}-1$ , the expectation in the lemma is equal to

\displaystyle\begin{split}{\mathbb{E}}_{0}\max_{1\leq j\leq p}\sup_{\eta\in{\cal H}_{n}^{\ast}}|W_{\eta,j}|&\leq\bigg{\lVert}\max_{1\leq j\leq p}\sup_{\eta\in{\cal H}_{n}^{\ast}}|W_{\eta,j}|\bigg{\rVert}_{\psi}\lesssim\sqrt{\log p}\max_{1\leq j\leq p}\bigg{\lVert}\sup_{\eta\in{\cal H}_{n}^{\ast}}|W_{\eta,j}|\bigg{\rVert}_{\psi},\end{split}

(45)

where $\lVert\cdot\rVert_{\psi}$ is the Orlicz norm for $\psi$ . For any $\eta_{1},\eta_{2}\in{\cal H}_{n}^{\ast}$ , define the standard deviation pseudo-metric between $W_{\eta_{1},j}$ and $W_{\eta_{2},j}$ as

\displaystyle d_{\sigma,j}(\eta_{1},\eta_{2})

\displaystyle\coloneqq\sqrt{{\rm Var}(W_{\eta_{1},j}-W_{\eta_{2},j})}=\lVert(\Delta_{\eta_{1}}^{\star}-\Delta_{\eta_{2}}^{\star})P\tilde{X}_{\cdot j}\rVert_{2}.

Using the tail bound for normal distributions and Lemma 2.2.1 of van der Vaart and Wellner, [29], we see that $\lVert W_{\eta_{1},j}-W_{\eta_{2},j}\rVert_{\psi}\lesssim d_{\sigma,j}(\eta_{1},\eta_{2})$ for every $\eta_{1},\eta_{2}\in{\cal H}_{n}^{\ast}$ . We shall show that ${\cal H}_{n}^{\ast}$ is a separable pseudo-metric space with $d_{\sigma,j}$ for every $j\leq p$ . Then, under the true model ${\mathbb{P}}_{0}$ , we see that $\{W_{\eta,j}:\eta\in{\cal H}_{n}^{\ast}\}$ is a separable Gaussian process for $d_{\sigma,j}$ . Hence, by Corollary 2.2.5 of van der Vaart and Wellner, [29], for any fixed $\eta^{\prime}\in{\cal H}_{n}^{\ast}$ ,

\displaystyle\bigg{\lVert}\sup_{\eta\in{\cal H}_{n}^{\ast}}|W_{\eta,j}|\bigg{\rVert}_{\psi}\lesssim\lVert W_{\eta^{\prime},j}\rVert_{\psi}+\int_{0}^{{\rm diam}_{j}({\cal H}_{n}^{\ast})}\sqrt{\log N(\epsilon/2,{\cal H}_{n}^{\ast},d_{\sigma,j})}d\epsilon,

(46)

where ${\rm diam}_{j}({\cal H}_{n}^{\ast})=\sup\{d_{\sigma,j}(\eta_{1},\eta_{2}):{\eta_{1},\eta_{2}\in{\cal H}_{n}^{\ast}}\}$ . It is clear that $W_{\eta^{\prime},j}$ possesses a normal distribution with mean zero and variance $\lVert(I-\Delta_{\eta^{\prime}}^{\star})P\tilde{X}_{\cdot j}\rVert_{2}^{2}$ .

Using Lemma 2.2.1 of van der Vaart and Wellner, [29] again, we see that

\displaystyle\begin{split}\lVert W_{\eta^{\prime},j}\rVert_{\psi}&\lesssim\lVert(I-\Delta_{\eta^{\prime}}^{\star})P\tilde{X}_{\cdot j}\rVert_{2}\\ &\lesssim\max_{1\leq i\leq n}\lVert\Delta_{\eta^{\prime},i}^{-1}-\Delta_{\eta_{0},i}^{-1}\rVert_{2}\lVert X_{\cdot j}\rVert_{2}\\ &\lesssim\lVert X\rVert_{\ast}\sqrt{e_{n}+a_{n}\zeta_{n}^{2}},\end{split}

(47)

for every $\eta^{\prime}\in{\cal H}_{n}^{\ast}$ . Here the last inequality holds by using (42) and the fact that $\max_{i}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2}\leq e_{n}+a_{n}d_{B,n}^{2}(\eta,\eta_{0})\lesssim e_{n}+a_{n}\zeta_{n}^{2}=o(1)$ on ${\cal H}_{n}^{\ast}$ , under (C1).

Next, to further bound the second term in (46), note that for every $\eta_{1},\eta_{2}\in{\cal H}_{n}^{\ast}$ ,

\displaystyle a_{n}\zeta_{n}^{2}

\displaystyle\gtrsim\sum_{k=1}^{2}2a_{n}d_{B,n}^{2}(\eta_{k},\eta_{0})\geq a_{n}d_{B,n}^{2}(\eta_{1},\eta_{2})\geq\max_{1\leq i\leq n}\lVert\Delta_{\eta_{1},i}-\Delta_{\eta_{2},i}\rVert_{\rm F}^{2},

which is further bounded below by

\displaystyle\min_{1\leq i\leq n}\rho_{\min}^{2}(\Delta_{\eta_{2},i})\max_{1\leq i\leq n}\sum_{k=1}^{m_{i}}\left\{1-1/\rho_{k}(\Delta_{\eta_{2},i}^{1/2}\Delta_{\eta_{1},i}^{-1}\Delta_{\eta_{2},i}^{1/2})\right\}^{2},

using (i) of Lemma 10. In the last display, we see that $\min_{i}\rho_{\min}(\Delta_{\eta_{2},i})$ is bounded away from zero since

\displaystyle\max_{1\leq i\leq n}\lVert\Delta_{\eta_{2},i}^{-1}\rVert_{\rm sp}

\displaystyle\leq\max_{1\leq i\leq n}\lVert\Delta_{\eta_{2},i}^{-1}-\Delta_{\eta_{0},i}^{-1}\rVert_{\rm sp}+\max_{1\leq i\leq n}\lVert\Delta_{\eta_{0},i}^{-1}\rVert_{\rm sp}\lesssim\sqrt{e_{n}+a_{n}\zeta_{n}^{2}}+1,

and hence every eigenvalue $\rho_{k}(\Delta_{\eta_{2},i}^{1/2}\Delta_{\eta_{1},i}^{-1}\Delta_{\eta_{2},i}^{1/2})$ is bounded below and above by a multiple of its reciprocal, as $a_{n}\zeta_{n}^{2}\rightarrow 0$ . This implies that $a_{n}\zeta_{n}^{2}$ is further bounded below by a constant multiple of

	$\displaystyle\max_{1\leq i\leq n}\sum_{k=1}^{m_{i}}\left\{1-\rho_{k}(\Delta_{\eta_{2},i}^{1/2}\Delta_{\eta_{1},i}^{-1}\Delta_{\eta_{2},i}^{1/2})\right\}^{2}$
	$\displaystyle\quad\geq\min_{1\leq i\leq n}\rho_{\min}^{2}(\Delta_{\eta_{2},i})\max_{1\leq i\leq n}\lVert\Delta_{\eta_{1},i}^{-1}-\Delta_{\eta_{2},i}^{-1}\rVert_{\rm F}^{2}.$

By the definition of $d_{\sigma,j}$ and the preceding displays, we thus obtain

\displaystyle\begin{split}d_{\sigma,j}(\eta_{1},\eta_{2})&\leq\lVert\Delta_{\eta_{1}}^{\star}-\Delta_{\eta_{2}}^{\star}\rVert_{\rm sp}\lVert\tilde{X}_{\cdot j}\rVert_{2}\\ &\lesssim\lVert X_{\cdot j}\rVert_{2}\max_{1\leq i\leq n}\lVert\Delta_{\eta_{1},i}^{-1}-\Delta_{\eta_{2},i}^{-1}\rVert_{\rm sp}\\ &\lesssim\lVert X_{\cdot j}\rVert_{2}\sqrt{a_{n}}d_{B,n}(\eta_{1},\eta_{2}),\end{split}

(48)

for every $\eta_{1},\eta_{2}\in{\cal H}_{n}^{\ast}$ . Hence, using that ${\rm diam}_{j}({\cal H}_{n}^{\ast})\lesssim\lVert X_{\cdot j}\rVert_{2}\zeta_{n}\sqrt{a_{n}}$ , we can bound the second term in (46) above by a constant multiple of

\displaystyle\int_{0}^{C_{1}\lVert X_{\cdot j}\rVert_{2}\zeta_{n}\sqrt{a_{n}}}\sqrt{\log N\left({\epsilon}/{C_{2}\lVert X_{\cdot j}\rVert_{2}\sqrt{a_{n}}},{\cal H}_{n}^{\ast},d_{B,n}\right)}d\epsilon,

for some $C_{1},C_{2}>0$ . This can be further bounded by replacing $\lVert X_{\cdot j}\rVert_{2}$ in the display by $\lVert X\rVert_{\ast}$ . Then, using (45), (46), and (47), and by the substitution $\delta={\epsilon}/{(C_{2}\lVert X\rVert_{\ast}\sqrt{a_{n}})}$ for the last display, we bound (45) above by a constant multiple of

\displaystyle\lVert X\rVert_{\ast}\sqrt{\log p}\left\{\sqrt{e_{n}+a_{n}\zeta_{n}^{2}}+\sqrt{a_{n}}\int_{0}^{C_{3}\zeta_{n}}\sqrt{\log N\left(\delta,{\cal H}_{n}^{\ast},d_{B,n}\right)}d\delta\right\},

for some $C_{3}>0$ .

To complete the proof, it remains to show that ${\cal H}_{n}^{\ast}$ is a separable pseudo-metric space with $d_{\sigma,j}$ for every $j\leq p$ . By (48), we see that $d_{\sigma,j}(\eta_{1},\eta_{2})\lesssim\lVert X\rVert_{\ast}\sqrt{a_{n}}d_{B,n}(\eta_{1},\eta_{2})$ for every $\eta_{1},\eta_{2}\in{\cal H}_{n}^{\ast}$ . This implies that ${\cal H}_{n}^{\ast}$ is separable with $d_{\sigma,j}$ since $\mathcal{H}$ is separable with $d_{B,n}$ . ∎

Lemma 5.

For any orthogonal projection $P$ ,

\displaystyle{\mathbb{P}}_{0}\left(\lVert\tilde{X}^{T}PU\rVert_{\infty}>2\underline{\rho}_{0}^{-1/2}\sqrt{\log p}\lVert X\rVert_{\ast}\right)

\displaystyle\leq\frac{2}{p}.

Proof.

Note first that $\tilde{X}_{\cdot j}^{T}PU$ has a normal distribution with mean zero and variance $\lVert P\tilde{X}_{\cdot j}\rVert_{2}^{2}$ , and hence we have

\displaystyle{\mathbb{P}}_{0}\left(\lVert\tilde{X}^{T}PU\rVert_{\infty}>t\max_{1\leq j\leq p}\lVert P\tilde{X}_{\cdot j}\rVert_{2}\right)\leq 2pe^{-t^{2}/2},\quad t>0,

by the tail probabilities of normal distributions. By choosing $t=2\sqrt{\log p}$ and using the inequality $\lVert P\tilde{X}_{\cdot j}\rVert_{2}\leq\lVert\tilde{X}_{\cdot j}\rVert_{2}\leq\underline{\rho}_{0}^{-1/2}\lVert X\rVert_{\ast}$ for every $j\leq p$ , we verify the assertion. ∎

Lemma 6.

If (C3) and (C6) are satisfied and $s_{0}\log p\lesssim n\bar{\epsilon}_{n}^{2}$ , there exists a constant $K_{0}^{\prime}>0$ such that

\displaystyle\mathbb{P}_{0}\left(\inf_{\eta\in\widetilde{\mathcal{H}}_{n}}\int\frac{p_{\theta,\eta}}{p_{\theta_{0},\eta}}(Y^{(n)})d\Pi(\theta)\geq e^{-K_{0}^{\prime}(1+s_{0}\log p)}\right)\rightarrow 1.

(49)

Proof.

Let $\Theta_{n}^{\ast}=\{\theta\in\Theta:s_{\theta}=s_{0},\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\leq 1\}$ . Restricting the integral to this set, the left hand side of the inequality in (49) is bounded below by

\displaystyle\begin{split}\inf_{\eta\in\widetilde{\mathcal{H}}_{n}}\int_{\Theta_{n}^{\ast}}\frac{p_{\theta,\eta}}{p_{\theta_{0},\eta}}(Y^{(n)})d\Pi(\theta)&\geq\int_{\Theta_{n}^{\ast}}\inf_{\eta\in\widetilde{\mathcal{H}}_{n}}\frac{p_{\theta,\eta}}{p_{\theta_{0},\eta}}(Y^{(n)})d\Pi(\theta)\\ &=\int_{\Theta_{n}^{\ast}}\exp\left(\inf_{\eta\in\widetilde{\mathcal{H}}_{n}}\log\frac{p_{\theta,\eta}}{p_{\theta_{0},\eta}}(Y^{(n)})\right)d\Pi(\theta).\end{split}

(50)

The exponent is equal to

\displaystyle\begin{split}&\inf_{\eta\in\widetilde{\mathcal{H}}_{n}}\left\{(\theta-\theta_{0})^{T}\tilde{X}^{T}\Delta_{\eta}^{\star}(U-\tilde{\xi}_{\eta}+\tilde{\xi}_{\eta_{0}})-\frac{1}{2}\lVert\Delta_{\eta}^{\star 1/2}\tilde{X}(\theta-\theta_{0})\rVert_{2}^{2}\right\}\\ &\quad\gtrsim-\lVert\theta-\theta_{0}\rVert_{1}\sup_{\eta\in\widetilde{\mathcal{H}}_{n}}\lVert\tilde{X}^{T}\Delta_{\eta}^{\star}U\rVert_{\infty}\\ &\qquad-\lVert X(\theta-\theta_{0})\rVert_{2}\sup_{\eta\in\widetilde{\mathcal{H}}_{n}}\lVert\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}}\rVert_{2}-\lVert X(\theta-\theta_{0})\rVert_{2}^{2},\end{split}

(51)

since $\lVert\Delta_{\eta}^{\star}\rVert_{\rm sp}\lesssim 1$ on $\widetilde{\mathcal{H}}_{n}$ . We first consider the case $s_{0}>0$ . Observe that $\sup_{\eta\in\widetilde{\mathcal{H}}_{n}}\lVert\tilde{X}^{T}\Delta_{\eta}^{\star}U\rVert_{\infty}\leq\lVert\tilde{X}^{T}U\rVert_{\infty}+\sup_{\eta\in\widetilde{\mathcal{H}}_{n}}\lVert\tilde{X}^{T}(I-\Delta_{\eta}^{\star})U\rVert_{\infty}$ , where the first term is bounded by a constant multiple of $\lVert X\rVert_{\ast}\sqrt{\log p}$ with $\mathbb{P}_{0}$ -probability tending to one, due to Lemma 5. By Lemma 4 applied with $P=I$ together with (C6), the expected value of the second term is bounded by $\delta_{n}\lVert X\rVert_{\ast}\sqrt{\log p}$ for some $\delta_{n}\rightarrow 0$ . Hence, for any $M_{n}\rightarrow\infty$ ,

\displaystyle\mathbb{P}_{0}\left(\sup_{\eta\in\widetilde{\mathcal{H}}_{n}}\lVert\tilde{X}^{T}(I-\Delta_{\eta}^{\star})U\rVert_{\infty}\leq M_{n}\delta_{n}\lVert X\rVert_{\ast}\sqrt{\log p}\right)\rightarrow 1.

Consequently, taking a sufficiently slowly increasing $M_{n}$ for the above, (51) is bounded below by a constant multiple of

\displaystyle-\lVert X\rVert_{\ast}\lVert\theta-\theta_{0}\rVert_{1}\sqrt{\log p}-\lVert X(\theta-\theta_{0})\rVert_{2}^{2},

with $\mathbb{P}_{0}$ -probability tending to one. Note that $\lVert X\rVert_{\ast}\lVert\theta-\theta_{0}\rVert_{1}\leq\sqrt{s_{\theta}+s_{0}}\lVert X(\theta-\theta_{0})\rVert_{2}/\phi_{1}(s_{\theta}+s_{0})$ and $\phi_{1}(s_{\theta}+s_{0})=\phi_{1}(2s_{0})\gtrsim 1$ on $\Theta_{n}^{\ast}$ by (C3), if $s_{0}\log p\lesssim n\bar{\epsilon}_{n}^{2}$ . The last display is thus bounded below by $-C_{1}s_{0}\log p$ for some $C_{1}>0$ , uniformly over $\theta\in\Theta_{n}^{\ast}$ . Consequently, with $\mathbb{P}_{0}$ -probability tending to one, (50) is bounded below by

\displaystyle e^{-C_{1}s_{0}\log p}\Pi(\Theta_{n}^{\ast})\geq\pi_{p}(s_{0})e^{-C_{2}s_{0}\log p},

for some $C_{2}>0$ , where the inequality holds by (23) and (24) since $\lambda\lVert\theta_{0}\rVert_{1}\leq s_{0}\log p$ by (C3). Since $-\log\pi_{p}(s_{0})\lesssim s_{0}\log p$ if $s_{0}>0$ , the display is further bounded below as in the assertion.

If $s_{0}=0$ , (51) is equal to zero on $\Theta_{n}^{\ast}$ , as this is a singleton set $\{\theta:\theta=0\}$ . This means that (50) is bounded below by $\pi_{p}(0)$ , which is also bounded away from zero. This leads to the desired assertion. ∎

Proof of Theorem 4.

The idea of our proof is similar in part to that of Theorem 3.5 in Chae et al., [10]. We only need to verify the first and fourth assertions. The second and third assertions then follow from the definitions of $\phi_{1}$ and $\phi_{2}$ . Note also that we only need to consider the case $s_{0}\log p\lesssim n\bar{\epsilon}_{n}^{2}$ , as the assertions follow from Theorems 1 and 3 if $s_{0}\log p\gtrsim n\bar{\epsilon}_{n}^{2}$ .

Let $\mathcal{B}_{n}=\{\theta\in\Theta:s_{\theta}>K_{4}s_{0}\}\cup\{\theta\in\Theta:\lVert X(\theta-\theta_{0})\rVert_{2}^{2}>K_{5}s_{0}\log p\}$ . Also define $\widetilde{\mathcal{H}}_{n}^{\prime}$ as $\widetilde{\mathcal{H}}_{n}$ but using a constant $\tilde{M}_{2}^{\prime}\leq\tilde{M}_{2}$ such that $\widetilde{\mathcal{H}}_{n}^{\prime}\subset\widetilde{\mathcal{H}}_{n}$ . Then, by Theorem 3, we have that

	$\displaystyle\mathbb{E}_{0}\Pi(\theta\in\mathcal{B}_{n}\|Y^{(n)})$	$\displaystyle\leq\mathbb{E}_{0}\Pi(\theta\in\mathcal{B}_{n}\cap\widetilde{\Theta}_{n},\eta\in\widetilde{\mathcal{H}}_{n}^{\prime}\|Y^{(n)})+o(1)$
		$\displaystyle\leq\mathbb{E}_{0}\Pi(\theta\in\mathcal{B}_{n}\cap\widetilde{\Theta}_{n},\eta\in\widetilde{\mathcal{H}}_{n}^{\prime}\|Y^{(n)},\eta\in\widetilde{\mathcal{H}}_{n})+o(1).$

Let $\Omega$ be the event that is an intersection of the events in (36), (49), and the event $\{\lVert\tilde{X}^{T}(I-H)U\rVert_{\infty}\leq 2\underline{\rho}_{0}^{-1/2}\sqrt{\log p}\lVert X\rVert_{\ast}\}$ whose probability goes to zero by Lemma 5. Since $\mathbb{P}_{0}(\Omega^{c})\rightarrow 0$ , it suffices to show that

\displaystyle\begin{split}&\mathbb{E}_{0}\Pi(\theta\in\mathcal{B}_{n}\cap\widetilde{\Theta}_{n},\eta\in\widetilde{\mathcal{H}}_{n}^{\prime}|Y^{(n)},\eta\in\widetilde{\mathcal{H}}_{n})\mathbbm{1}_{\Omega}\\ &\quad=\mathbb{E}_{0}\frac{\int_{\widetilde{\Theta}_{n}\cap{\mathcal{B}}_{n}}\int_{\widetilde{\mathcal{H}}_{n}^{\prime}}p_{\theta,\eta}(Y^{(n)})d\Pi(\eta)d\Pi(\theta)}{\int\int_{\widetilde{\mathcal{H}}_{n}}p_{\theta,\eta}(Y^{(n)})d\Pi(\eta)d\Pi(\theta)}\end{split}

(52)

tends to zero. Observe that by Fubini’s theorem, the denominator of the ratio is equal to

	$\displaystyle\int_{\widetilde{\mathcal{H}}_{n}}\int\frac{p_{\theta,\eta}}{p_{\theta_{0},\eta}}(Y^{(n)})d\Pi(\theta)p_{\theta_{0},\eta}d\Pi(\eta)$
	$\displaystyle\quad\geq\left\{\inf_{\eta\in\widetilde{\mathcal{H}}_{n}}\int\frac{p_{\theta,\eta}}{p_{\theta_{0},\eta}}(Y^{(n)})d\Pi(\theta)\right\}\int_{\widetilde{\mathcal{H}}_{n}}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta).$

By Lemma 6, the term in the braces on the right hand side is further bounded below by $e^{-K_{0}^{\prime}(1+s_{0}\log p)}$ on the event $\Omega$ . Note also that the numerator of the ratio in (52) is equal to

	$\displaystyle\int_{\widetilde{\Theta}_{n}\cap{\mathcal{B}}_{n}}\int_{\widetilde{\mathcal{H}}_{n}^{\prime}}\Lambda_{n}^{\ast}(\theta,\eta)p_{\theta_{0},\tilde{\eta}_{n}(\theta,\eta)}(Y^{(n)})d\Pi(\eta)d\Pi(\theta)$
	$\displaystyle\quad\leq\left\{\int_{\widetilde{\Theta}_{n}\cap{\mathcal{B}}_{n}}\Lambda_{n}^{\star}(\theta)\sup_{\eta\in\widetilde{\mathcal{H}}_{n}^{\prime}}\frac{\Lambda_{n}^{\ast}(\theta,\eta)}{\Lambda_{n}^{\star}(\theta)}d\Pi(\theta)\right\}\sup_{\theta\in\widetilde{\Theta}_{n}\cap{\mathcal{B}}_{n}}\int_{\widetilde{\mathcal{H}}_{n}^{\prime}}p_{\theta_{0},\tilde{\eta}_{n}(\theta,\eta)}(Y^{(n)})d\Pi(\eta).$

Combining the bounds, on the event $\Omega$ , the ratio in (52) is bounded by

	$\displaystyle e^{K_{0}^{\prime}(1+s_{0}\log p)}\sup_{\theta\in\widetilde{\Theta}_{n}\cap{\mathcal{B}}_{n}}\frac{\int_{\widetilde{\mathcal{H}}_{n}^{\prime}}p_{\theta_{0},\tilde{\eta}_{n}(\theta,\eta)}(Y^{(n)})d\Pi(\eta)}{\int_{\widetilde{\mathcal{H}}_{n}}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta)}$
	$\displaystyle\quad\times\int_{\widetilde{\Theta}_{n}\cap{\mathcal{B}}_{n}}\Lambda_{n}^{\star}(\theta)\sup_{\eta\in\widetilde{\mathcal{H}}_{n}^{\prime}}\frac{\Lambda_{n}^{\ast}(\theta,\eta)}{\Lambda_{n}^{\star}(\theta)}d\Pi(\theta).$

At the end of this proof, we will verify that

\displaystyle\sup_{\theta\in\widetilde{\Theta}_{n}\cap{\mathcal{B}}_{n}}\frac{\int_{\widetilde{\mathcal{H}}_{n}^{\prime}}p_{\theta_{0},\tilde{\eta}_{n}(\theta,\eta)}(Y^{(n)})d\Pi(\eta)}{\int_{\widetilde{\mathcal{H}}_{n}}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta)}\lesssim 1,

(53)

with $\mathbb{P}_{0}$ -probability tending to one. Assuming that this is true for now and letting $\Omega^{\ast}$ be the event satisfying (53), we see that (52) is bounded by

\displaystyle e^{K_{0}^{\prime}(1+s_{0}\log p)}\mathbb{E}_{0}\int_{\widetilde{\Theta}_{n}\cap{\mathcal{B}}_{n}}\Lambda_{n}^{\star}(\theta)\sup_{\eta\in\widetilde{\mathcal{H}}_{n}^{\prime}}\frac{\Lambda_{n}^{\ast}(\theta,\eta)}{\Lambda_{n}^{\star}(\theta)}d\Pi(\theta)\mathbbm{1}_{\Omega\cap\Omega^{\ast}}+o(1).

To show that this tends to zero, for $\delta_{n}$ in Lemma 3, define $\mathcal{B}_{1,n}=\{\theta\in\widetilde{\Theta}_{n}:s_{\theta}>K_{4}s_{0},\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\leq\delta_{n}^{-1/2}(s_{\theta}+s_{0})\log p\}$ , $\mathcal{B}_{2,n}=\{\theta\in\widetilde{\Theta}_{n}:s_{\theta}>K_{4}s_{0},\lVert X(\theta-\theta_{0})\rVert_{2}^{2}>\delta_{n}^{-1/2}(s_{\theta}+s_{0})\log p\}$ , and $\mathcal{B}_{3,n}=\{\theta\in\widetilde{\Theta}_{n}:s_{\theta}\leq K_{4}s_{0},\lVert X(\theta-\theta_{0})\rVert_{2}^{2}>K_{5}s_{0}\log p\}$ such that $\widetilde{\Theta}_{n}\cap{\mathcal{B}}_{n}=\cup_{k=1}^{3}\mathcal{B}_{k,n}$ . Below we will show that

	$\displaystyle A(\mathcal{B}_{k,n})$	$\displaystyle\coloneqq e^{K_{0}^{\prime}(1+s_{0}\log p)}$
		$\displaystyle\quad\times\mathbb{E}_{0}\int_{{\mathcal{B}}_{k,n}}\Lambda_{n}^{\star}(\theta)\sup_{\eta\in\mathcal{H}_{n}}\frac{\Lambda_{n}^{\ast}(\theta,\eta)}{\Lambda_{n}^{\star}(\theta)}d\Pi(\theta)\mathbbm{1}_{\Omega\cap\Omega^{\ast}}\rightarrow 0,\quad k=1,2,3.$

Since $\mathbb{E}_{0}\Lambda_{n}^{\star}(\theta)=1$ by the moment generating function of normal distributions, we obtain that

	$\displaystyle A(\mathcal{B}_{1,n})$	$\displaystyle\leq\mathbb{E}_{0}\int_{{\mathcal{B}}_{1,n}}\Lambda_{n}^{\star}(\theta)e^{K_{0}^{\prime}(1+s_{0}\log p)+2\delta_{n}^{1/2}(s_{\theta}+s_{0})\log p}d\Pi(\theta)$
		$\displaystyle\leq\pi_{p}(0)\sum_{s>K_{4}s_{0}}e^{K_{0}^{\prime}(1+s_{0}\log p)+2\delta_{n}^{1/2}(s+s_{0})\log p}\left(\frac{A_{2}}{p^{A_{4}}}\right)^{s-s_{0}}.$

If $s_{0}=0$ , the rightmost side goes to zero for any $K_{4}>0$ . If $s_{0}>0$ , it still goes to zero for $K_{4}$ that is much larger than $K_{0}^{\prime}$ .

Note also that by conditions (C4), (C3) and (C4), we have that for some $C_{1},C_{2}>0$ and any $\theta$ ,

\displaystyle\begin{split}\log\Lambda_{n}^{\star}(\theta)&=-\frac{1}{2}\lVert(I-H)\tilde{X}(\theta-\theta_{0})\rVert_{2}^{2}+(\theta-\theta_{0})^{T}\tilde{X}^{T}(I-H)U\\ &\leq-C_{1}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}+\lVert\theta-\theta_{0}\rVert_{1}\lVert\tilde{X}^{T}(I-H)U\rVert_{\infty}\\ &\leq-C_{1}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}+C_{2}\lVert X(\theta-\theta_{0})\rVert_{2}\sqrt{(s_{\theta}+s_{0})\log p},\end{split}

(54)

on the event $\Omega$ . Hence by (36) and (54), for every $\theta\in\mathcal{B}_{2,n}$ ,

\displaystyle\log\left\{\Lambda_{n}^{\star}(\theta)\sup_{\eta\in\mathcal{H}_{n}}\frac{\Lambda_{n}^{\ast}(\theta,\eta)}{\Lambda_{n}^{\star}(\theta)}\right\}

\displaystyle\leq(C_{2}\delta_{n}^{1/4}+\delta_{n}+\delta_{n}^{5/4}-C_{1})\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\leq 0,

on the event $\Omega$ . Therefore,

	$\displaystyle A(\mathcal{B}_{2,n})$	$\displaystyle\leq e^{K_{0}^{\prime}(1+s_{0}\log p)}\int_{{\mathcal{B}}_{2,n}}d\Pi(\theta)+o(1)$
		$\displaystyle\leq\pi_{p}(0)e^{K_{0}^{\prime}(1+s_{0}\log p)}\sum_{s>K_{4}s_{0}}\left(\frac{A_{2}}{p^{A_{4}}}\right)^{s-s_{0}}+o(1).$

This tends to zero if $K_{4}$ is sufficiently large.

If $s_{0}=0$ , $\mathcal{B}_{3,n}$ is the empty set as it implies $\theta=\theta_{0}=0$ . Hence it suffices to consider the case that $s_{0}>0$ below. By (36) and (54) again, there exists a constant $C_{3}>0$ such that for every $\theta\in\mathcal{B}_{3,n}$ ,

	$\displaystyle\log\left\{\Lambda_{n}^{\star}(\theta)\sup_{\eta\in\mathcal{H}_{n}}\frac{\Lambda_{n}^{\ast}(\theta,\eta)}{\Lambda_{n}^{\star}(\theta)}\right\}$
	$\displaystyle\quad\leq-C_{1}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}+\left\{C_{2}\sqrt{\frac{K_{4}+1}{K_{5}}}+\delta_{n}\left(1+\frac{1}{\sqrt{K_{5}}}\right)\right\}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}$
	$\displaystyle\quad\leq-C_{3}\lVert X(\theta-\theta_{0})\rVert_{2}^{2},$

on the event $\Omega$ , where the last inequality holds by choosing $K_{5}$ much larger than $K_{4}$ . Therefore,

	$\displaystyle A(\mathcal{B}_{3,n})$	$\displaystyle\leq e^{K_{0}^{\prime}(1+s_{0}\log p)}\int_{{\mathcal{B}}_{3,n}}e^{-C_{3}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}}d\Pi(\theta)$
		$\displaystyle\leq e^{K_{0}^{\prime}(1+s_{0}\log p)-C_{3}K_{5}s_{0}\log p},$

which tends to zero for $K_{5}$ that is much larger than $K_{0}^{\prime}$ , if $s_{0}>0$ .

It only remains to show (53). Since the map $\eta\mapsto\tilde{\eta}_{n}(\theta,\eta)$ is bijective for every fixed $\theta$ , for the set defined by $\tilde{\eta}_{n}(\theta,\widetilde{\cal H}_{n}^{\prime})=\{\tilde{\eta}_{n}(\theta,\eta):\eta\in\widetilde{\cal H}_{n}^{\prime}\}$ with given $\theta\in\widetilde{\Theta}_{n}$ , we see that

\displaystyle\int_{\widetilde{\cal H}_{n}^{\prime}}p_{\theta_{0},\tilde{\eta}_{n}(\theta,\eta)}(Y^{(n)})d\Pi(\eta)

\displaystyle=\int_{\tilde{\eta}_{n}(\theta,\widetilde{\cal H}_{n}^{\prime})}p_{\theta_{0},\eta}(Y^{(n)})d\Pi_{n,\theta}(\eta),

(55)

by the substitution in the integral. Writing $\Delta_{0}^{\ast}$ the block diagonal matrix formed by stacking $\Delta_{\eta_{0},i}^{1/2}$ , $i=1,\dots,n$ , it can be seen that

\displaystyle\tilde{\eta}_{n}(\theta,\widetilde{\cal H}_{n}^{\prime})=\bigg{\{}\eta\in{\cal H}:\sqrt{\lVert\Delta_{0}^{\ast}(\tilde{\xi}_{\eta}-\tilde{\xi}_{0}-H\tilde{X}(\theta-\theta_{0}))\rVert_{2}^{2}+d_{B,n}^{2}(\eta,\eta_{0})}\leq\tilde{M}_{2}^{\prime}\bar{\epsilon}_{n}\bigg{\}}.

Hence, we see that $\tilde{M}_{2}$ can be chosen sufficiently larger than $\tilde{M}_{2}^{\prime}$ such that $\tilde{\eta}_{n}(\theta,\widetilde{\cal H}_{n}^{\prime})\subset\widetilde{\cal H}_{n}$ for every $\theta\in\widetilde{\Theta}_{n}$ as we have $\sqrt{n}d_{A,n}(\eta,\eta_{0})\lesssim\lVert\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}}-H\tilde{X}(\theta-\theta_{0})\rVert_{2}+\lVert X(\theta-\theta_{0})\rVert_{2}$ . Therefore, (55) is bounded by

\displaystyle\int_{\widetilde{\cal H}_{n}}p_{\theta_{0},\eta}(Y^{(n)})\exp\left(\left|\log\frac{d\Pi_{n,\theta}(\eta)}{d\Pi(\eta)}\right|\right)d\Pi(\eta)\lesssim\int_{\widetilde{\cal H}_{n}}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta),

by (C5), since $d\Pi(\eta)=d\Pi_{n,\theta_{0}}(\eta)$ . This verifies (53) and thus the proof is complete. ∎

A.4 Proof of Theorems 5–6

To prove the shape approximation in Theorem 5 and the selection results in Theorem 6, we first obtain two lemmas. The first shows that the remainder of the approximation goes to zero in $\mathbb{P}_{0}$ - probability, which is a stronger version of Lemma 3. The second implies that with a point mass prior for $\theta$ at $\theta_{0}$ , we also obtain a rate which is not worse than that in Theorem 3.

Lemma 7.

Suppose that (C1), (C4), (C8^∗), and (C10^∗) are satisfied for some orthogonal projection $H$ . Then, for $\Lambda_{n}^{\ast}(\theta,\eta)=(p_{\theta,\eta}/p_{\theta_{0},{\tilde{\eta}_{n}(\theta,\eta)}})(Y^{(n)})$ and $\Lambda_{n}^{\star}(\theta)$ in (14) with the corresponding $H$ , we have that

\displaystyle{\mathbb{E}}_{0}\sup_{\theta\in\widehat{\Theta}_{n}}\sup_{\eta\in\widehat{\cal H}_{n}}\left\lvert\log\Lambda_{n}^{\ast}(\theta,\eta)-\log\Lambda_{n}^{\star}(\theta)\right\rvert\rightarrow 0.

Proof.

Similar to the proof of Lemma 3, it suffices to show the following three assertions:

$\displaystyle\sup_{\theta\in\widehat{\Theta}_{n}}\sup_{\eta\in\widehat{\cal H}_{n}}\big{\lvert}(\theta-\theta_{0})^{T}\tilde{X}^{T}(I-H)(I-\Delta_{\eta}^{\star})(I-H)\tilde{X}(\theta-\theta_{0})\big{\rvert}$	$\displaystyle\rightarrow 0,$	(56)
$\displaystyle\sup_{\theta\in\widehat{\Theta}_{n}}\sup_{\eta\in\widehat{\cal H}_{n}}\big{\lvert}(\theta-\theta_{0})^{T}\tilde{X}^{T}(I-H)\Delta_{\eta}^{\star}(\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}}+H\tilde{X}(\theta-\theta_{0}))\big{\rvert}$	$\displaystyle\rightarrow 0,$	(57)
$\displaystyle{\mathbb{E}}_{0}\sup_{\theta\in\widehat{\Theta}_{n}}\sup_{\eta\in\widehat{\cal H}_{n}}\big{\lvert}(\theta-\theta_{0})^{T}\tilde{X}^{T}(I-H)(I-\Delta_{\eta}^{\star})U\big{\rvert}$	$\displaystyle\rightarrow 0.$	(58)

First, note that the left side of (56) is bounded above by a constant multiple of

\displaystyle\begin{split}&\sup_{\theta\in\widehat{\Theta}_{n}}\sup_{\eta\in\widehat{\cal H}_{n}}\lVert I-\Delta_{\eta}^{\star}\rVert_{\rm sp}\lVert\tilde{X}(\theta-\theta_{0})\rVert_{2}^{2}\\ &\quad\lesssim\sup_{\theta\in\widehat{\Theta}_{n}}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}\sup_{\eta\in\widehat{\cal H}_{n}}\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F},\end{split}

(59)

where the inequality holds by (42) and the fact that $\max_{i}\lVert\Delta_{\eta,i}-\Delta_{\eta_{0},i}\rVert_{\rm F}^{2}\leq e_{n}+a_{n}d_{B,n}^{2}(\eta,\eta_{0})\lesssim e_{n}+a_{n}(s_{\star}\log p)/n=o(1)$ on $\widehat{\cal H}_{n}$ . We see that (59) is bounded above by a constant multiple of

\displaystyle\sup_{\theta\in\widehat{\Theta}_{n}}\lVert X\rVert_{\ast}\lVert\theta-\theta_{0}\rVert_{1}^{2}\sup_{\eta\in\widehat{\cal H}_{n}}\sqrt{e_{n}+a_{n}d_{B,n}^{2}(\eta,\eta_{0})}\lesssim s_{\star}^{2}\log p\sqrt{e_{n}+\frac{a_{n}s_{\star}\log p}{n}},

which goes to zero by (C10^∗).

Next, similar to (43), the left side of (57) is bounded by

	$\displaystyle\sup_{\theta\in\widehat{\Theta}_{n}}\lVert X(\theta-\theta_{0})\rVert_{2}\sup_{\eta\in\widehat{\cal H}_{n}}\lVert(I-H)(\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}})\rVert_{2}$
	$\displaystyle+\sup_{\theta\in\widehat{\Theta}_{n}}\sup_{\eta\in\widehat{\cal H}_{n}}\Big{\{}\Big{(}\lVert X(\theta-\theta_{0})\rVert_{2}^{2}+\lVert X(\theta-\theta_{0})\rVert_{2}\sqrt{n}d_{A,n}(\eta,\eta_{0})\Big{)}$
	$\displaystyle\qquad\qquad\qquad\times\max_{1\leq i\leq n}\lVert\Delta_{\eta,i}^{-1}-\Delta_{\eta_{0},i}^{-1}\rVert_{\rm sp}\Big{\}}.$

Using the same approach used in (42), the display is further bounded above by a constant multiple of

\displaystyle s_{\star}\sqrt{\log p}\sup_{\eta\in\widehat{\cal H}_{n}}\lVert(I-H)(\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}})\rVert_{2}+s_{\star}^{2}\log p\sqrt{e_{n}+\frac{a_{n}s_{\star}\log p}{n}},

which goes to zero by (C8^∗) and (C10^∗).

Now, using Lemma 4, note that (58) is bounded above by

\displaystyle\begin{split}&\sup_{\theta\in\widehat{\Theta}_{n}}\left\lVert\theta-\theta_{0}\right\rVert_{1}{\mathbb{E}}_{0}\sup_{\eta\in\widehat{\cal H}_{n}}\lVert\tilde{X}^{T}(I-H)(I-\Delta_{\eta}^{\star})U\rVert_{\infty}\\ &\quad\lesssim s_{\star}\log p\Bigg{\{}\sqrt{e_{n}+\frac{a_{n}s_{\star}\log p}{n}}\\ &\qquad\qquad\qquad+\sqrt{a_{n}}\int_{0}^{C_{1}\sqrt{(s_{\star}\log p)/n}}\sqrt{\log N(\delta,\widehat{\mathcal{H}}_{n},d_{B,n})}d\delta\Bigg{\}},\end{split}

for some $C_{1}>0$ . This tends to zero by (C10^∗).

∎

Lemma 8.

Suppose that (C1)–(C4), (C5^∗), and (C6) are satisfied. Then there exists a constant $K_{6}>0$ such that

\displaystyle\mathbb{E}_{0}\Pi^{\theta_{0}}\left(d_{n}(\eta,\eta_{0})>K_{6}\bar{\epsilon}_{n}\,\big{|}\,Y^{(n)}\right)\rightarrow 0,

where $\Pi^{\theta_{0}}(\cdot\,|\,Y^{(n)})$ is the posterior distribution induced by the point mass prior for $\theta$ at $\theta_{0}$ , i.e., $\delta_{\theta_{0}}(\theta)$ , in place of the prior in (4).

Proof.

Since the prior for $\theta$ is the point mass at $\theta_{0}$ , we can reduce to a low dimensional model $Y_{i}^{\ast}\coloneqq Y_{i}-X_{i}\theta_{0}=\xi_{\eta,i}+\varepsilon_{i}$ , $i=1,\dots,n$ . Then the lemma can be easily verified using the main results on posterior contraction in Section 3. The denominator of the posterior distribution with the Dirac prior at $\theta_{0}$ is bounded as in Lemma 1, which can be shown using (20) for the prior concentration condition (C2) and the expressions for the Kullback-Leibler divergence $K(p_{0,i},p_{\theta_{0},\eta,i})$ and variation $V(p_{0,i},p_{\theta_{0},\eta,i})$ with the true value $\theta_{0}$ . For a local test relative to the average Rényi divergence, Lemma 2 applied with ${\cal F}_{1,n}$ , modified so that it can be involved only with a given $\eta_{1}$ such that $R_{n}(p_{0},p_{\theta_{0},\eta_{1}})\geq\bar{\epsilon}_{n}^{2}$ , implies that a small piece of the alternative is tested with exponentially small errors. Hence, by (C5^∗), we obtain the contraction rate $\bar{\epsilon}_{n}^{2}$ relative to $R_{n}(p_{0},p_{\theta_{0},\eta})$ for $\Pi^{\theta_{0}}(\cdot\,|\,Y^{(n)})$ , as in the proof of Theorem 2. The lemma is then obtained by recovering the contraction rate of $\eta$ with respect to $d_{n}$ using the approach in the proof of Theorem 3. ∎

Proof of Theorem 5.

Our proof is based on the proof of Theorem 6 in Castillo et al., [8], but is more involved due to $\eta$ . We use the fact that for any probability measure $Q$ and its renormalized restriction $Q_{\cal A}(\cdot)=Q(\cdot\cap{\cal A})/Q({\cal A})$ to a set ${\cal A}$ , we have $\lVert Q-Q_{\cal A}\rVert_{\rm TV}\leq 2Q({\cal A}^{c})$ . First, using a sufficiently large constant $\hat{M}_{2}^{\prime}$ that is smaller than $\hat{M}_{2}$ , define $\widehat{\cal H}_{n}^{\prime}$ as $\widehat{\cal H}_{n}$ in (12) such that $\widehat{\cal H}_{n}^{\prime}\subset\widehat{\cal H}_{n}$ . Let $\widetilde{\Pi}((\theta,\eta)\in\cdot)$ be the prior distribution restricted and renormalized on $\widehat{\Theta}_{n}\times\widehat{\cal H}_{n}^{\prime}$ and $\widetilde{\Pi}((\theta,\eta)\in\cdot\,|\,Y^{(n)})$ be the corresponding posterior distribution. Also, $\widetilde{\Pi}^{\infty}(\theta\in\cdot\,|\,Y^{(n)})$ is the restricted and renormalized version of $\Pi^{\infty}(\theta\in\cdot\,|\,Y^{(n)})$ to the set $\widehat{\Theta}_{n}$ . Then the left hand side of the theorem is bounded above by

\displaystyle\begin{split}&\left\lVert\Pi(\theta\in\cdot\,|\,Y^{(n)})-\widetilde{\Pi}(\theta\in\cdot\,|\,Y^{(n)})\right\rVert_{\rm TV}+\left\lVert\widetilde{\Pi}(\theta\in\cdot\,|\,Y^{(n)})-\widetilde{\Pi}^{\infty}(\theta\in\cdot\,|\,Y^{(n)})\right\rVert_{\rm TV}\\ &+\left\lVert\Pi^{\infty}(\theta\in\cdot\,|\,Y^{(n)})-\widetilde{\Pi}^{\infty}(\theta\in\cdot\,|\,Y^{(n)})\right\rVert_{\rm TV},\end{split}

(60)

where the first summand goes to zero in ${\mathbb{P}}_{0}$ -probability since $\Pi((\theta,\eta)\in\widehat{\Theta}_{n}\times\widehat{\cal H}_{n}^{\prime}\,|\,Y^{(n)})\rightarrow 1$ in ${\mathbb{P}}_{0}$ -probability by Theorem 1 and Theorem 3.

To show that the second summand goes to zero in ${\mathbb{P}}_{0}$ -probability, note that for every measurable ${\cal B}\subset\mathbb{R}^{p}$ , we obtain

	$\displaystyle\widetilde{\Pi}(\theta\in{\cal B}\,\|\,Y^{(n)})$	$\displaystyle\propto\int_{{\cal B}\cap\widehat{\Theta}_{n}}\int_{\widehat{\cal H}_{n}^{\prime}}{p_{\theta,\eta}}(Y^{(n)})\>e^{-\lambda\lVert\theta\rVert_{1}}d\Pi(\eta)dV(\theta)$
		$\displaystyle=\int_{{\cal B}\cap\widehat{\Theta}_{n}}\int_{\widehat{\cal H}_{n}^{\prime}}\Lambda_{n}^{\ast}(\theta,\eta)\>e^{-\lambda\lVert\theta\rVert_{1}}p_{\theta_{0},\tilde{\eta}_{n}(\theta,\eta)}(Y^{(n)})\,d\Pi(\eta)dV(\theta),$
	$\displaystyle\widetilde{\Pi}^{\infty}(\theta\in{\cal B}\,\|\,Y^{(n)})$	$\displaystyle\propto\int_{{\cal B}\cap\widehat{\Theta}_{n}}\Lambda_{n}^{\star}(\theta)dV(\theta)$
		$\displaystyle\propto\int_{{\cal B}\cap\widehat{\Theta}_{n}}\Lambda_{n}^{\star}(\theta)\>e^{-\lambda\lVert\theta_{0}\rVert_{1}}\int_{{\cal H}}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta)dV(\theta),$

where $dV(\theta)=\sum_{S:s\leq K_{1}s_{\star}}{\pi_{p}(s)}{\binom{p}{s}}^{-1}(\lambda/2)^{s}d\{\mathcal{L}(\theta_{S})\otimes\delta_{0}(\theta_{S^{c}})\}$ . In the last line, the factor $e^{-\lambda\lVert\theta_{0}\rVert_{1}}\int_{{\cal H}}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta)$ cancels out in the normalizing constant, but is inserted for the sake of comparison. For any sequences of measures $\{\mu_{S}\}$ and $\{\nu_{S}\}$ , if $\nu_{S}$ is absolutely continuous with respect to $\mu_{S}$ with the Radon-Nikodym derivative $d{\nu_{S}}/d{\mu_{S}}$ , then it can be easily verified that

\displaystyle\left\lVert\frac{\sum_{S}\mu_{S}}{\lVert\sum_{S}\mu_{S}\rVert_{\rm TV}}-\frac{\sum_{S}\nu_{S}}{\lVert\sum_{S}\nu_{S}\rVert_{\rm TV}}\right\rVert_{\rm TV}

\displaystyle\leq\frac{2\sum_{S}\lVert\mu_{S}-\nu_{S}\rVert_{\rm TV}}{\lVert\sum_{S}\mu_{S}\rVert_{\rm TV}}\leq 2\sup_{S}\left\lVert 1-\frac{d{\nu_{S}}}{d{\mu_{S}}}\right\rVert_{\infty}.

Hence, for $C_{n}=\int_{{\cal H}}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta)$ , we see that the second summand of (60) is bounded by

\displaystyle 2\sup_{\theta\in\widehat{\Theta}_{n}}\left\lvert 1-\frac{1}{C_{n}}\int_{\widehat{\cal H}_{n}^{\prime}}\frac{\Lambda_{n}^{\ast}(\theta,\eta)e^{-\lambda\lVert\theta\rVert_{1}}}{\Lambda_{n}^{\star}(\theta)e^{-\lambda\lVert\theta_{0}\rVert_{1}}}p_{\theta_{0},\tilde{\eta}_{n}(\theta,\eta)}(Y^{(n)})d\Pi(\eta)\right\rvert.

Using the fact that $|\lambda(\lVert\theta\rVert_{1}-\lVert\theta_{0}\rVert_{1})|\leq\lambda\lVert\theta-\theta_{0}\rVert_{1}\lesssim\lambda s_{\star}\sqrt{\log p}/\lVert X\rVert_{\ast}\rightarrow 0$ on $\widehat{\Theta}_{n}$ and that $\sup\{|1-\Lambda_{n}^{\ast}(\theta,\eta)/\Lambda_{n}^{\star}(\theta)|:\theta\in\widehat{\Theta}_{n},\eta\in\widehat{\cal H}_{n}^{\prime}\}$ goes to zero in ${\mathbb{P}}_{0}$ -probability by Lemma 7, the last display is further bounded by

\displaystyle 2\sup_{\theta\in\widehat{\Theta}_{n}}\left\lvert 1-\left\{1+o(1)+o_{{\mathbb{P}}_{0}}(1)\right\}\frac{1}{C_{n}}\int_{\widehat{\cal H}_{n}^{\prime}}p_{\theta_{0},\tilde{\eta}_{n}(\theta,\eta)}(Y^{(n)})d\Pi(\eta)\right\rvert.

(61)

Now, note that the map $\eta\mapsto\tilde{\eta}_{n}(\theta,\eta)$ is bijective for every fixed $\theta\in\widehat{\Theta}_{n}$ . Thus for the set defined by $\tilde{\eta}_{n}(\theta,\widehat{\cal H}_{n}^{\prime})=\{\tilde{\eta}_{n}(\theta,\eta):\eta\in\widehat{\cal H}_{n}^{\prime}\}$ with given $\theta\in\widehat{\Theta}_{n}$ , we see that

\displaystyle\int_{\widehat{\cal H}_{n}^{\prime}}p_{\theta_{0},\tilde{\eta}_{n}(\theta,\eta)}(Y^{(n)})d\Pi(\eta)

\displaystyle=\int_{\tilde{\eta}_{n}(\theta,\widehat{\cal H}_{n}^{\prime})}p_{\theta_{0},\eta}(Y^{(n)})d\Pi_{n,\theta}(\eta),

(62)

by the substitution in the integral. Similar to the proof of Theorem 4, observe that

	$\displaystyle\tilde{\eta}_{n}(\theta,\widehat{\cal H}_{n}^{\prime})=\Big{\{}\eta\in{\cal H}:$	$\displaystyle\,\lVert\Delta_{0}^{\ast}(\tilde{\xi}_{\eta}-\tilde{\xi}_{0}-H\tilde{X}(\theta-\theta_{0}))\rVert_{2}\leq\hat{M}_{2}^{\prime}s_{\star}\sqrt{(\log p)/n},$
		$\displaystyle~{}d_{B,n}(\eta,\eta_{0})\leq\hat{M}_{2}^{\prime}\sqrt{(s_{\star}\log p)/n}\Big{\}}.$

Hence, we see that $\hat{M}_{2}$ can be chosen sufficiently large such that $\tilde{\eta}_{n}(\theta,\widehat{\cal H}_{n}^{\prime})\subset\widehat{\cal H}_{n}$ for every $\theta\in\widehat{\Theta}_{n}$ as we have $\sqrt{n}d_{A,n}(\eta,\eta_{0})\lesssim\lVert\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}}-H\tilde{X}(\theta-\theta_{0})\rVert_{2}+\lVert X\rVert_{\ast}\lVert\theta-\theta_{0}\rVert_{1}$ . Therefore, since $d\Pi(\eta)=d\Pi_{n,\theta_{0}}(\eta)$ , one can see that (62) is written as

\displaystyle\{1+o(1)\}\int_{\tilde{\eta}_{n}(\theta,\widehat{\cal H}_{n}^{\prime})}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta),

by (C9^∗), and hence (61) is equal to

\displaystyle 2\sup_{\theta\in\widehat{\Theta}_{n}}\left\lvert 1-\left\{1+o_{{\mathbb{P}}_{0}}(1)\right\}\frac{\int_{\tilde{\eta}_{n}(\theta,\widehat{\cal H}_{n}^{\prime})}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta)}{\int_{{\cal H}}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta)}\right\rvert.

(63)

Now, observe that we also have the inequality of the other direction: $\lVert\tilde{\xi}_{\eta}-\tilde{\xi}_{\eta_{0}}-H\tilde{X}(\theta-\theta_{0})\rVert_{2}\lesssim\sqrt{n}d_{A,n}(\eta,\eta_{0})+\lVert X\rVert_{\ast}\lVert\theta-\theta_{0}\rVert_{1}$ . This means that $\hat{M}_{2}^{\prime}$ can be chosen sufficiently large such that $\{\eta\in{\cal H}:d_{n}(\eta,\eta_{0})\leq K_{6}\bar{\epsilon}_{n}\}\subset\tilde{\eta}_{n}(\theta,\widehat{\cal H}_{n}^{\prime})$ for every $\theta\in\widehat{\Theta}_{n}$ . Hence, with appropriately chosen constants, we obtain

	$\displaystyle\inf_{\theta\in\widehat{\Theta}_{n}}\frac{\int_{\tilde{\eta}_{n}(\theta,\widehat{\cal H}_{n}^{\prime})}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta)}{\int_{{\cal H}}p_{\theta_{0},\eta}(Y^{(n)})d\Pi(\eta)}$	$\displaystyle=\inf_{\theta\in\widehat{\Theta}_{n}}\Pi^{\theta_{0}}\left(\eta\in\tilde{\eta}_{n}(\theta,\widehat{\cal H}_{n}^{\prime})\,\|\,Y^{(n)}\right)$
		$\displaystyle\geq\Pi^{\theta_{0}}\left(d_{n}(\eta,\eta_{0})\leq K_{6}\bar{\epsilon}_{n}\,\big{\|}\,Y^{(n)}\right).$

The rightmost term goes to one with probability tending to one by Lemma 8. This implies that (63) goes to zero in $\mathbb{P}_{0}$ -probability, completing the proof for the second part of (60).

Next, we show that $\Pi^{\infty}(\theta\in\widehat{\Theta}_{n}\,|\,Y^{(n)})$ goes to one in ${\mathbb{P}}_{0}$ -probability to verify that the last summand in (60) goes to zero in ${\mathbb{P}}_{0}$ -probability. Observe that $\Pi^{\infty}(\theta\in\widehat{\Theta}_{n}^{c}\,|\,Y^{(n)})$ is equal to

\displaystyle\frac{\int_{\widehat{\Theta}_{n}^{c}}\exp\left\{{-\frac{1}{2}\lVert(I-H)\tilde{X}(\theta-\theta_{0})\rVert_{2}^{2}+U^{T}(I-H)\tilde{X}(\theta-\theta_{0})}\right\}dV(\theta)}{\int_{\mathbb{R}^{p}}\exp\left\{{-\frac{1}{2}\lVert(I-H)\tilde{X}(\theta-\theta_{0})\rVert_{2}^{2}+U^{T}(I-H)\tilde{X}(\theta-\theta_{0})}\right\}dV(\theta)}.

(64)

Clearly, the denominator is bounded below by

\displaystyle\begin{split}\frac{\pi_{p}(s_{0})}{\binom{p}{s_{0}}}\left(\frac{\lambda}{2}\right)^{s_{0}}\int_{\mathbb{R}^{s_{0}}}\exp\bigg{\{}&-\frac{1}{2}\lVert(I-H)\tilde{X}_{S_{0}}(\theta_{S_{0}}-\theta_{0,{S_{0}}})\rVert_{2}^{2}\\ &+U^{T}(I-H)\tilde{X}_{S_{0}}(\theta_{S_{0}}-\theta_{0,{S_{0}}})\bigg{\}}d\theta_{S_{0}}.\end{split}

(65)

Since the measure $Q$ defined by $Q(d\theta_{S_{0}})=\exp\{-(1/2)\lVert(I-H)\tilde{X}_{S_{0}}(\theta_{S_{0}}-\theta_{0,S_{0}})\rVert_{2}^{2}\}$ is symmetric about $\theta_{0,S_{0}}$ , the mean of $(\theta_{S_{0}}-\theta_{0,S_{0}})$ with respect to the normalized probability measure $\widetilde{Q}=Q/Q(\mathbb{R}^{s_{0}})$ is zero. Note also that $\Gamma_{S}=\tilde{X}_{S}^{T}(I-H)\tilde{X}_{S}$ is nonsingular for every $S$ such that $s\leq K_{1}s_{\star}$ by (C8^∗). Thus, by Jensen’s inequality, (65) is bounded below by

	$\displaystyle\frac{\pi_{p}(s_{0})}{\binom{p}{s_{0}}}\left(\frac{\lambda}{2}\right)^{s_{0}}\int_{\mathbb{R}^{s_{0}}}\exp\left\{{-\frac{1}{2}\lVert(I-H)\tilde{X}_{S_{0}}(\theta_{S_{0}}-\theta_{0,{S_{0}}})\rVert_{2}^{2}}\right\}d\theta_{S_{0}}$
	$\displaystyle\quad=\frac{\pi_{p}(s_{0})}{\binom{p}{s_{0}}}\left(\frac{\lambda}{2}\right)^{s_{0}}\frac{(2\pi)^{s_{0}/2}}{\det(\Gamma_{S_{0}})^{1/2}}.$

Applying the arithmetic-geometric mean inequality to the eigenvalues, we obtain $\det(\Gamma_{S_{0}})\leq({\rm tr}(\Gamma_{S_{0}})/s_{0})^{s_{0}}\leq\lVert(I-H)\tilde{X}_{S_{0}}\rVert_{\ast}^{2s_{0}}\leq\underline{\rho}_{0}^{-s_{0}}\lVert X\rVert_{\ast}^{2s_{0}}$ , and hence $\det(\Gamma_{S_{0}})^{1/2}/\lambda^{s_{0}}\leq\underline{\rho}_{0}^{-s_{0}/2}(L_{1}p^{L_{2}})^{s_{0}}$ by (4). Furthermore, we have $\pi_{p}(s_{0})\gtrsim A_{1}^{s_{0}}p^{-A_{3}s_{0}}$ by (3) and $\binom{p}{s_{0}}\leq p^{s_{0}}$ . Hence, the preceding display is further bounded below by a constant multiple of

\displaystyle p^{-(1+L_{2}+A_{3})s_{0}}\left(\frac{A_{1}\sqrt{\underline{\rho}_{0}\pi}}{L_{1}\sqrt{2}}\right)^{s_{0}}.

(66)

To bound the numerator of (64), let $D_{n}=2\underline{\rho}_{0}^{-1/2}\sqrt{\log p}\lVert X\rVert_{\ast}$ and ${\cal U}_{n}=\{\lVert\tilde{X}^{T}(I-H)U\rVert_{\infty}\leq D_{n}\}$ . Then it suffices to show that (64) goes to zero in ${\mathbb{P}}_{0}$ -probability on the set ${\cal U}_{n}$ as $\mathbb{P}_{0}({\cal U}_{n}^{c})\rightarrow 0$ by Lemma 5. Note that on the set ${\cal U}_{n}$ we have

	$\displaystyle U^{T}(I-H)\tilde{X}(\theta-\theta_{0})$	$\displaystyle\leq D_{n}\lVert\theta-\theta_{0}\rVert_{1}$
		$\displaystyle\leq D_{n}\frac{2\sqrt{\overline{\rho}_{0}}\lVert\tilde{X}(\theta-\theta_{0})\rVert_{2}\|S_{\theta-\theta_{0}}\|^{1/2}}{\lVert X\rVert_{\ast}\phi_{1}(\|S_{\theta-\theta_{0}}\|)}-D_{n}\lVert\theta-\theta_{0}\rVert_{1}.$

Using that $\lVert u\rVert_{2}\lesssim\lVert(I-H)u\rVert_{2}$ for every $u\in{\rm span}(\tilde{X}_{S})$ with $s\leq K_{1}s_{\star}$ by (C8^∗), the preceding display is, for some constant $C_{1}>0$ , further bounded above by

	$\displaystyle D_{n}\frac{2\sqrt{\overline{\rho}_{0}}C_{1}\lVert(I-H)\tilde{X}(\theta-\theta_{0})\rVert_{2}\|S_{\theta-\theta_{0}}\|^{1/2}}{\lVert X\rVert_{\ast}\phi_{1}(\|S_{\theta-\theta_{0}}\|)}-D_{n}\lVert\theta-\theta_{0}\rVert_{1}$
	$\displaystyle\quad\leq\frac{1}{2}\lVert(I-H)\tilde{X}(\theta-\theta_{0})\rVert_{2}^{2}+\frac{2\overline{\rho}_{0}C_{1}^{2}D_{n}^{2}\|S_{\theta-\theta_{0}}\|}{\lVert X\rVert_{\ast}^{2}\phi_{1}(\|S_{\theta-\theta_{0}}\|)^{2}}-D_{n}\lVert\theta-\theta_{0}\rVert_{1},$

by the Cauchy-Schwarz inequality. We have $s_{\theta-\theta_{0}}\leq K_{1}s_{\star}+s_{0}$ on the support of the measure $V$ . Hence, on the event ${\cal U}_{n}$ , the numerator of (64) is bounded above by

	$\displaystyle\exp\left\{\frac{2\overline{\rho}_{0}C_{1}^{2}D_{n}^{2}(K_{1}s_{\star}+s_{0})}{\lVert X\rVert_{\ast}^{2}\phi_{1}(K_{1}s_{\star}+s_{0})^{2}}-\frac{\hat{M}_{1}D_{n}s_{\star}\sqrt{\log p}}{2\lVert X\rVert_{\ast}}\right\}$
	$\displaystyle\times\sum_{S:s\leq K_{1}s_{\star}}\frac{\pi_{p}(s)}{\binom{p}{s}}\int\left(\frac{\lambda}{2}\right)^{s}e^{-({D_{n}}/{2})\lVert\theta_{S}-\theta_{0,S}\rVert_{1}}d\theta_{S}$
	$\displaystyle\quad\leq\exp\left\{\frac{8\overline{\rho}_{0}C_{1}^{2}(K_{1}+1)s_{\star}\log p}{\underline{\rho}_{0}\phi_{1}(K_{1}s_{\star}+s_{0})^{2}}-\frac{\hat{M}_{1}s_{\star}\log p}{\sqrt{\underline{\rho}_{0}}}\right\}\sum_{s=0}^{p}\pi_{p}(s)\left(L_{3}\sqrt{\frac{{\underline{\rho}_{0}}}{n}}\right)^{s},$

since $D_{n}/2\geq\lambda\sqrt{n}/(L_{3}\sqrt{\underline{\rho}_{0}})$ . Note that we have

\displaystyle\sum_{s=0}^{p}\pi_{p}(s)\left(L_{3}\sqrt{\frac{{\underline{\rho}_{0}}}{n}}\right)^{s}\lesssim\sum_{s=0}^{p}\left(\frac{A_{2}L_{3}}{p^{A_{4}}}\sqrt{\frac{{\underline{\rho}_{0}}}{n}}\right)^{s}\lesssim 1,

by (3) and that $\phi_{1}(K_{1}s_{\star}+s_{0})$ in the denominators is bounded away from zero by the assumption. Thus, the last display combined with (66) shows that (64) goes to zero on the event ${\cal U}_{n}$ , provided that $\hat{M}_{1}$ is chosen sufficiently large.

Finally we conclude that (60) goes to zero in $\mathbb{P}_{0}$ -probability. Since the total variation metric is bounded by 2, the convergence in mean holds as in the assertion. ∎

Proof of Theorem 6.

Our proof follows the proof of Theorem 4 in Castillo et al., [8]. Since ${\mathbb{E}}_{0}\lVert\Pi(\theta\in\cdot\,|\,Y^{(n)})-\Pi^{\infty}(\theta\in\cdot\,|\,Y^{(n)})\rVert_{\rm TV}$ tends to zero by Theorem 5, it suffices to show that ${\mathbb{E}}_{0}\Pi^{\infty}(\theta:S_{\theta}\in{\cal S}_{n}\,|\,Y^{(n)})\rightarrow 0$ for ${\cal S}_{n}=\{S:s\leq K_{1}s_{\star},S\supset S_{0},S\neq S_{0}\}$ . For the orthogonal projection defined by $\tilde{H}_{S}=(I-H)\tilde{X}_{S}\Gamma_{S}^{-1}\tilde{X}_{S}^{T}(I-H)$ with $\Gamma_{S}=\tilde{X}_{S}^{T}(I-H)\tilde{X}_{S}$ , we see that $\Pi^{\infty}(\theta:S_{\theta}\in{\cal S}_{n}\,|\,Y^{(n)})$ is bounded by

\displaystyle\sum_{s=s_{0}+1}^{K_{1}s_{\star}}\frac{\pi_{p}(s)\binom{p}{s_{0}}\binom{p-s_{0}}{s-s_{0}}}{\pi_{p}(s_{0})\binom{p}{s}}\left(\frac{\lambda\sqrt{\pi}}{\sqrt{2}}\right)^{s-s_{0}}\max_{S\in{\cal S}_{n}:|S|=s}\left[\frac{\det(\Gamma_{S_{0}})^{1/2}}{\det(\Gamma_{S})^{1/2}}e^{\lVert(\tilde{H}_{S}-\tilde{H}_{S_{0}})U\rVert_{2}^{2}/2}\right],

by (13), since $(\tilde{H}_{S}-\tilde{H}_{S_{0}})\tilde{X}\theta_{0}=(\tilde{H}_{S}-\tilde{H}_{S_{0}})(I-H)\tilde{X}_{S_{0}}\theta_{0,S_{0}}=0$ for every $S\in{\cal S}_{n}$ due to $S_{0}\subset S$ on ${\cal S}_{n}$ . Note that $\rho_{k}(\Gamma_{S_{0}})\leq\rho_{k}(\Gamma_{S})$ for $k=1,\dots,s_{0}$ , because $\Gamma_{S_{0}}$ is a principal submatrix of $\Gamma_{S}$ . Hence, $\det(\Gamma_{S_{0}})$ is equal to

\displaystyle\begin{split}\prod_{k=1}^{s_{0}}\rho_{k}(\Gamma_{S_{0}})\leq\prod_{k=1}^{s_{0}}\rho_{k}(\Gamma_{S})\leq\frac{\det(\Gamma_{S})}{\rho_{\min}(\Gamma_{S})^{s-s_{0}}}\leq\frac{\det(\Gamma_{S})}{(C_{1}\overline{\rho}_{0}^{-1/2}\phi_{2}(s)\lVert X\rVert_{\ast})^{2(s-s_{0})}},\end{split}

(67)

for some $C_{1}>0$ . The last inequality holds since by (C8^∗), there exists a constant $C_{1}>0$ such that $C_{1}^{2}\lVert v\rVert_{2}^{2}\leq\lVert(I-H)v\rVert_{2}^{2}$ for every $v\in{\rm span}(\tilde{X}_{S})$ with $s\leq K_{1}s_{\star}$ , and hence we have that by the definition of $\phi_{2}$ ,

\displaystyle\rho_{\min}(\Gamma_{S})=\inf_{u\in\mathbb{R}^{s},u\neq 0}\frac{\lVert(I-H)\tilde{X}_{S}u\rVert_{2}^{2}}{\lVert u\rVert_{2}^{2}}\geq\frac{C_{1}^{2}\phi_{2}(s)^{2}\lVert X\rVert_{\ast}^{2}}{\overline{\rho}_{0}}.

Now, we shall show that for any fixed $b>2$ ,

\displaystyle{\mathbb{P}}_{0}\left(\lVert(\tilde{H}_{S}-\tilde{H}_{S_{0}})U\rVert_{2}^{2}\leq b(s-s_{0})\log p,\text{~{}for every $S\in{\cal S}_{n}$}\right)\rightarrow 1.

(68)

Note that $\lVert(\tilde{H}_{S}-\tilde{H}_{S_{0}})U\rVert_{2}^{2}$ has a chi-squared distribution with degree of freedom $s-s_{0}$ . Therefore, by Lemma 5 of Castillo et al., [8], there exists a constant $C_{2}$ such that for every $b>2$ and given $s\geq s_{0}+1$ ,

\displaystyle{\mathbb{P}}_{0}\left(\max_{S\in{\cal S}_{n}:|S|=s}\lVert(\tilde{H}_{S}-\tilde{H}_{S_{0}})U\rVert_{2}^{2}>b\log N_{s}\right)

\displaystyle\leq\left(\frac{1}{N_{s}}\right)^{(b-2)/4}e^{C_{2}(s-s_{0})},

where $N_{s}=\binom{p-s_{0}}{s-s_{0}}$ is the cardinality of the set $\{S\in{\cal S}_{n}:|S|=s\}$ . Since $N_{s}\leq(p-s_{0})^{s-s_{0}}\leq p^{s-s_{0}}$ , for ${\cal T}_{n}$ the event in the relation (68), it follows that

\displaystyle{\mathbb{P}}_{0}({\cal T}_{n}^{c})\leq\sum_{s=s_{0}+1}^{K_{1}s_{\star}}\left(\frac{1}{N_{s}}\right)^{(b-2)/4}e^{C_{2}(s-s_{0})}.

This goes to zero as $p\rightarrow\infty$ , since for $s\leq K_{1}s_{\star}$ ,

\displaystyle N_{s}\geq\frac{(p-s)^{s-s_{0}}}{(s-s_{0})!}\geq\frac{(p-K_{1}s_{\star})^{s-s_{0}}}{(s-s_{0})^{s-s_{0}}}\geq\left(\frac{p-K_{1}s_{\star}}{K_{1}s_{\star}}\right)^{s-s_{0}},

and $s_{\star}/p=o(1)$ . To complete the proof, it remains to show that $\Pi^{\infty}(\theta:S_{\theta}\in{\cal S}_{n}\,|\,Y^{(n)})$ goes to zero on the set ${\cal T}_{n}$ . Combining (67) and (68), we see that $\Pi^{\infty}(\theta:S_{\theta}\in{\cal S}_{n}\,|\,Y^{(n)})\mathbbm{1}_{{\cal T}_{n}}$ is bounded by

	$\displaystyle\sum_{s=s_{0}+1}^{K_{1}s_{\star}}\frac{\pi_{p}(s)\binom{p}{s_{0}}\binom{p-s_{0}}{s-s_{0}}}{\pi_{p}(s_{0})\binom{p}{s}}\left(\frac{\lambda\sqrt{\pi}}{\sqrt{2}}\right)^{s-s_{0}}\left(\frac{\sqrt{\overline{\rho}_{0}p^{b}}}{C_{1}\phi_{2}(s)\lVert X\rVert_{\ast}}\right)^{s-s_{0}}$
	$\displaystyle\quad\leq\sum_{s=s_{0}+1}^{K_{1}s_{\star}}\left(\frac{A_{2}}{p^{A_{4}}}\right)^{s-s_{0}}\binom{s}{s_{0}}\left(\frac{L_{3}}{C_{1}\phi_{1}(K_{1}s_{\star})}\sqrt{\frac{K_{1}s_{\star}\pi\overline{\rho}_{0}p^{b}}{2n}}\right)^{s-s_{0}},$

which holds by the inequalities ${\pi_{p}(s)}/{\pi_{p}(s_{0})}\leq(A_{2}p^{-A_{4}})^{s-s_{0}}$ and $\binom{p}{s_{0}}\binom{p-s_{0}}{s-s_{0}}/\binom{p}{s}=\binom{s}{s_{0}}$ . Note that for $s\leq K_{1}s_{\star}$ , we have that $\binom{s}{s_{0}}=\binom{s}{s-s_{0}}\leq(K_{1}s_{\star})^{s-s_{0}}\leq(K_{1}C_{2}p^{a})^{s-s_{0}}$ for some $C_{2}>0$ . Hence, the preceding display goes to zero provided that $a-A_{4}+b/2<0$ since $s_{\star}=o(n)$ . This condition can be translated to $a<A_{4}-1$ by choosing $b$ arbitrarily close to 2. ∎

Appendix B Proofs for the applications

B.1 Proof of Theorem 7

We first verify the conditions for Theorem 3 to prove assertions (a) and (b).

•

Verification of (C1): Let $\bar{\sigma}_{jk}$ be the $(j,k)$ th element of $\Sigma-\Sigma_{0}$ . Observe that $d_{n}^{2}(\Sigma,\Sigma_{0})$ is equal to

\displaystyle\begin{split}\frac{1}{n}\sum_{i=1}^{n}\lVert E_{i}^{T}(\Sigma-\Sigma_{0})E_{i}\rVert_{\rm F}^{2}&=\frac{1}{n}\sum_{j=1}^{\overline{m}}\sum_{k=1}^{\overline{m}}\left[\bar{\sigma}_{jk}^{2}\sum_{i=1}^{n}e_{ij}e_{ik}\right]\gtrsim\frac{1}{c_{n}}\lVert\Sigma-\Sigma_{0}\rVert_{\rm F}^{2}.\end{split}

(69)

Hence, we see that $c_{n}$ has the same role as $a_{n}$ . We also have $e_{n}=0$ as the true $\Sigma_{0}$ belongs to the support of the prior.

•

Verification of (C2): Note that

\displaystyle d_{n}^{2}(\Sigma_{1},\Sigma_{2})=\frac{1}{n}\sum_{i=1}^{n}\lVert E_{i}^{T}(\Sigma_{1}-\Sigma_{2})E_{i}\rVert_{\rm F}^{2}\leq\lVert\Sigma_{1}-\Sigma_{2}\rVert_{\rm F}^{2},

(70)

for every $\Sigma_{1},\Sigma_{2}\in{\cal H}$ . Hence we obtain that for every $\bar{\epsilon}_{n}>n^{-1/2}$ ,

\displaystyle\log\Pi(d_{n}(\Sigma,\Sigma_{0})\leq\bar{\epsilon}_{n})\geq\log\Pi(\lVert\Sigma-\Sigma_{0}\rVert_{\rm F}\leq\bar{\epsilon}_{n})\gtrsim\log\bar{\epsilon}_{n}\gtrsim-\log n,

since $1\lesssim\rho_{\min}(\Sigma_{0})\leq\rho_{\max}(\Sigma_{0})\lesssim 1$ . This leads us to choose $\bar{\epsilon}_{n}=\sqrt{(\log n)/n}$ for (C2) to be satisfied.

•

Verification of (C3): The assumption $\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p$ given in the theorem directly satisfies (C3).
•

Verification of (C4): We have the inequalities $\rho_{\min}(\Sigma_{0})\leq\rho_{\min}(E_{i}^{T}\Sigma_{0}E_{i})\leq\rho_{\max}(E_{i}^{T}\Sigma_{0}E_{i})\leq\rho_{\max}(\Sigma_{0})$ for every $i\leq n$ as $E_{i}^{T}\Sigma_{0}E_{i}$ is a principal submatrix of $\Sigma_{0}$ . Hence (C4) is directly satisfied by the assumption on $\Sigma_{0}$ .

•

Verification of (C5^∗): For a sufficiently large $M>0$ and $s_{\star}=s_{0}\vee(\log n/\log p)$ , choose ${\cal H}_{n}=\{\Sigma:n^{-M}\leq\rho_{\min}(\Sigma)\leq\rho_{\max}(\Sigma)\leq e^{Ms_{\star}\log p}\}$ . Since $E_{i}^{T}\Sigma E_{i}$ is a principal submatrix of $\Sigma$ , we have $\rho_{\min}(E_{i}^{T}\Sigma E_{i})\geq\rho_{\min}(\Sigma)\geq n^{-M}$ for every $i\leq n$ and $\Sigma\in{\cal H}_{n}$ . Hence the minimum eigenvalue condition (6) is satisfied with $\log\gamma_{n}\asymp\log n$ . Also, the entropy relative to $d_{n}$ is given by

	$\displaystyle\log N\left(\frac{1}{6\overline{m}n^{M+3/2}},{\cal H}_{n},d_{n}\right)$
	$\displaystyle\quad\leq\log N\left(\frac{1}{6\overline{m}n^{M+3/2}},\left\{\Sigma:\lVert\Sigma\rVert_{\rm F}\leq\sqrt{\overline{m}}e^{Ms_{\star}\log p}\right\},\lVert\cdot\rVert_{\rm F}\right)$
	$\displaystyle\quad\lesssim\log n+{s_{\star}\log p}.$

The entropy condition in (7) is thus satisfied if we choose $\epsilon_{n}=\sqrt{(s_{\star}\log p)/n}$ . To verify the sieve condition (8), note that for some positive constants $b_{1}$ , $b_{2}$ , $b_{3}$ , $b_{4}$ and $b_{5}$ , an inverse Wishart distribution satisfies

\displaystyle\begin{split}\Pi(\Sigma:\rho_{\min}(\Sigma)<n^{-M})&\leq b_{1}e^{-b_{2}n^{b_{3}M}},\\ \Pi(\Sigma:\rho_{\max}(\Sigma)>e^{Ms_{\star}\log p})&\leq b_{4}e^{-b_{5}Ms_{\star}\log p};\end{split}

(71)

see, for example, Lemma 9.16 of Ghosal and van der Vaart, [17]. The sieve condition (8) is met provided that $M$ is chosen sufficiently large. Note that the condition $a_{n}\epsilon_{n}^{2}\rightarrow 0$ is satisfied by the assumption $c_{n}s_{\star}\log p=o(n)$ .

•

Verification of (C6): The separability condition is trivially satisfied in this example as there is no nuisance mean part.

Therefore, the contraction properties in Theorem 3 are obtained with $s_{\star}=s_{0}\vee(\log n/\log p)$ , but $s_{\star}$ is replaced by $s_{0}$ since $s_{0}>0$ and $\log n\lesssim\log p$ . The contraction rate for $\Sigma$ with respect to the Frobenius norm follows from (69). The optimal posterior contraction directly follows from Corollary 1. Assertions (a) and (b) are thus proved.

Next, we verify conditions (C8^∗)–(C10^∗) and (C7) to apply Theorems 5–6 and Corollaries 2–3.

•

Verification of (C8^∗)–(C9^∗): These conditions are trivially satisfied with the zero matrix $H$ as there is no nuisance mean part.
•

Verification of (C10^∗): Since the entropy in (C10^∗) is bounded above by a constant multiple of $\log N(\delta,\{\Sigma:\lVert\Sigma-\Sigma_{0}\rVert_{\rm F}\leq\hat{M}_{2}\sqrt{c_{n}}\epsilon_{n}\},\lVert\cdot\rVert_{\rm F})\lesssim 0\vee\log(3\hat{M}_{2}\sqrt{c_{n}}\epsilon_{n}/\delta)$ using (69) and (70), the term in (C10^∗) is bounded by a multiple of $(s_{\star}\vee\sqrt{\log c_{n}})\sqrt{c_{n}(s_{\star}\log p)^{3}/n}$ by Remark 6. This term tends to zero as $s_{\star}$ can be replaced by $s_{0}$ .
•

Verification of (C7): Note that $d_{B,n}(\Sigma_{1},\Sigma_{2})\leq\lVert\Sigma_{1}-\Sigma_{2}\rVert_{\rm F}$ for every $\Sigma_{1},\Sigma_{2}$ by (70), and hence it suffices to show that $\cal H$ is a separable metric space with the Frobenius norm. Since the support of the prior for $\Sigma$ is Euclidean, separability with the Frobenius norm is trivial.

Hence, under (C7^∗), Theorem 5 can be applied to obtain the distributional approximation in (15) with the zero matrix $H$ . Under (C7^∗) and (C12), Theorem 6 implies the no-superset result in (16). If the beta-min condition (C13) is also met, the strong results in Corollary 2 and Corollary 3 hold. These establish (c)–(e).

B.2 Proof of Theorem 8

We first verify the conditions for Theorem 3 for (a) and (b).

•

Verification of (C1): Since $\Delta_{\eta,i}$ is the same for every $i\leq n$ and the true parameters belong to the support of the prior, we see that $a_{n}=1$ and $e_{n}=0$ satisfy (C1).

•

Verification of (C2): Observe that for every $\eta_{1},\eta_{2}\in{\cal H}$ ,

\displaystyle\begin{split}\lVert\xi_{\eta_{1}}-\xi_{\eta_{2}}\rVert_{2}^{2}&=\lvert(\alpha_{1}-\alpha_{2})+(\mu_{1}^{T}\beta_{1}-\mu_{2}^{T}\beta_{2})\rvert^{2}+\lVert\mu_{1}-\mu_{2}\rVert_{2}^{2}\\ &\lesssim\lvert\alpha_{1}-\alpha_{2}\rvert^{2}+\lVert\mu_{1}\rVert_{2}^{2}\lVert\beta_{1}-\beta_{2}\rVert_{2}^{2}+(\lVert\beta_{2}\rVert_{2}^{2}+1)\lVert\mu_{1}-\mu_{2}\rVert_{2}^{2},\\ \lVert\Delta_{\eta_{1}}-\Delta_{\eta_{2}}\rVert_{\rm F}^{2}&=|(\beta_{1}^{T}\Sigma_{1}\beta_{1}-\beta_{2}^{T}\Sigma_{2}\beta_{2})+(\sigma_{1}^{2}-\sigma_{2}^{2})|^{2}\\ &\quad+2\lVert\Sigma_{1}\beta_{1}-\Sigma_{2}\beta_{2}\rVert_{2}^{2}+\lVert\Sigma_{1}-\Sigma_{2}\rVert_{\rm F}^{2}\\ &\lesssim(\lVert\beta_{1}\rVert_{2}^{2}+1)^{2}\lVert\Sigma_{1}-\Sigma_{2}\rVert_{\rm F}^{2}+|\sigma_{1}^{2}-\sigma_{2}^{2}|^{2}\\ &\quad+(\lVert\beta_{1}\rVert_{2}^{2}+\lVert\beta_{2}\rVert_{2}^{2}+1)\lVert\Sigma_{2}\rVert_{\rm F}^{2}\lVert\beta_{1}-\beta_{2}\rVert_{2}^{2}.\end{split}

(72)

Since $\lVert\beta_{0}\rVert_{2}$ , $\lvert\sigma_{0}^{2}\rvert$ , and $\lVert\Sigma_{0}\rVert_{\rm F}$ are bounded, it follows from the last display that there exists a constant $C_{1}$ such that $\lvert\alpha-\alpha_{0}\rvert+\lVert\beta-\beta_{0}\rVert_{2}+\lVert\mu-\mu_{0}\rVert_{2}+\lvert\sigma^{2}-\sigma_{0}^{2}\rvert+\lVert\Sigma-\Sigma_{0}\rVert_{\rm F}\leq C_{1}\bar{\epsilon}_{n}$ implies $d_{n}(\eta,\eta_{0})\leq\bar{\epsilon}_{n}$ for any small $\bar{\epsilon}_{n}$ . This shows that (C2) is satisfied as long as we choose $\bar{\epsilon}_{n}=\sqrt{\log n/n}$ , as we have $|\alpha_{0}|\vee\lVert\beta_{0}\rVert_{\infty}\vee\lVert\mu_{0}\rVert_{\infty}\lesssim 1$ , $\sigma_{0}^{2}\asymp 1$ , and $1\lesssim\rho_{\min}(\Sigma_{0})\leq\rho_{\max}(\Sigma_{0})\lesssim 1$ .

•

Verification of (C3): The assumption $\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p$ given in the theorem directly satisfies (C3).

•

Verification of (C4): Since $\Delta_{\eta}$ can be written as the sum of two positive definite matrices as

\displaystyle\Delta_{\eta}=\begin{pmatrix}\beta^{T}\Sigma\beta&\beta^{T}\Sigma\\ \Sigma\beta&\Sigma\end{pmatrix}+\begin{pmatrix}\sigma^{2}&0\\ 0&\Psi\end{pmatrix},

condition (C4) is satisfied as we obtain $\sigma_{0}^{2}\wedge\rho_{\min}(\Psi)\leq\rho_{\min}(\Delta_{\eta_{0}})\leq\rho_{\max}(\Delta_{\eta_{0}})\leq\lVert\Delta_{\eta_{0}}\rVert_{\rm F}$ by Weyl’s inequality.

•

Verification of (C5^∗): For a sufficiently large $M$ and $s_{\star}=s_{0}\vee(\log n/\log p)$ , choose a sieve as

	$\displaystyle{\cal H}_{n}$	$\displaystyle=\{(\alpha,\beta,\mu):\lvert\alpha\rvert^{2}+\lVert\beta\rVert_{2}^{2}+\lVert\mu\rVert_{2}^{2}\leq n^{2M}\}\times\{\sigma:n^{-M}\leq\sigma^{2}\leq e^{Ms_{\star}\log p}\}$
		$\displaystyle\quad\times\{\Sigma:n^{-M}\leq\rho_{\min}(\Sigma)\leq\rho_{\max}(\Sigma)\leq e^{Ms_{\star}\log p}\}.$

Then we have $\rho_{\min}(\Delta_{\eta})\geq\sigma^{2}\wedge\rho_{\min}(\Psi)\geq n^{-M}$ for large $n$ , and hence the minimum eigenvalue condition (6) is directly met with $\log\gamma_{n}\asymp\log n$ by the definition of the sieve. To see the entropy condition, observe from (72) that for every $\eta_{1},\eta_{2}\in{\cal H}_{n}$ ,

	$\displaystyle d_{n}^{2}(\eta_{1},\eta_{2})\lesssim n^{4M}e^{2Ms_{\star}\log p}\big{(}$	$\displaystyle\lvert\alpha-\alpha_{0}\rvert^{2}+\lVert\beta_{1}-\beta_{2}\rVert_{2}^{2}+\lVert\mu_{1}-\mu_{2}\rVert_{2}^{2}$
		$\displaystyle+\lVert\Sigma_{1}-\Sigma_{2}\rVert_{\rm F}^{2}+\|\sigma_{1}^{2}-\sigma_{2}^{2}\|^{2}\big{)}.$

Therefore, for $\delta_{n}=1/(6\overline{m}n^{3M+3/2}e^{Ms_{\star}\log p})$ , the entropy relative to $d_{n}$ is bounded above by

	$\displaystyle\log N\left(\delta_{n},\{(\alpha,\beta,\mu):\lvert\alpha\rvert^{2}+\lVert\beta\rVert_{2}^{2}+\lVert\mu\rVert_{2}^{2}\leq n^{2M}\},\lVert\cdot\rVert_{2}\right)$
	$\displaystyle+\log N\left(\delta_{n},\{\sigma:\sigma^{2}\leq e^{Ms_{\star}\log p}\},\lVert\cdot\rVert_{2}\right)$
	$\displaystyle+\log N\left(\delta_{n},\{\Sigma:\lVert\Sigma\rVert_{\rm F}\leq\sqrt{q}e^{Ms_{\star}\log p}\},\lVert\cdot\rVert_{\rm F}\right),$

each summand of which is bounded by a multiple of $\log n+s_{\star}\log p$ . This shows that the choice $\epsilon_{n}=\sqrt{(s_{\star}\log p)/n}$ satisfies the entropy condition in (7). Further, it is easy to see that condition (8) holds using the tail bounds for normal and inverse Wishart distributions as in (71).

•

Verification of (C6): Note that the mean of $Y$ is expressed as $X\theta+Z\xi_{\eta}$ for $Z=1_{n}\otimes I_{q+1}$ . Since the condition $\varsigma_{\min}([X_{S}^{\ast},1_{n}])\gtrsim 1$ implies $\varsigma_{\min}([X_{S},Z])\gtrsim 1$ , condition (C6) is satisfied by Remark 3.

Therefore we obtain the contraction properties of the posterior distribution as in (9) with $s_{\star}$ replaced by $s_{0}$ as $s_{0}>0$ and $\log n\lesssim\log p$ . The rates for $\eta$ with respect to more concrete metrics than $d_{n}$ can now be obtained. Note that for small $\delta>0$ , $d_{n}(\eta,\eta_{0})\leq\delta$ directly implies $\lVert\mu-\mu_{0}\rVert_{2}\leq\delta$ and $\lVert\Sigma-\Sigma_{0}\rVert_{\rm F}\leq\delta$ by the definition of $d_{n}$ . For $\beta$ , observe that

	$\displaystyle\lVert\beta-\beta_{0}\rVert_{2}$	$\displaystyle\leq\lVert\Sigma^{-1}\rVert_{\rm sp}\lVert\Sigma(\beta-\beta_{0})\rVert_{2}$
		$\displaystyle\leq\lVert\Sigma^{-1}\rVert_{\rm sp}(\lVert\Sigma\beta-\Sigma_{0}\beta_{0}\rVert_{2}+\lVert\Sigma-\Sigma_{0}\rVert_{\rm F}\lVert\beta_{0}\rVert_{2})$
		$\displaystyle\lesssim\lVert\Sigma^{-1}\rVert_{\rm sp}\delta.$

Since $\lVert\Sigma^{-1}\rVert_{\rm sp}$ is bounded as $\lVert\Sigma-\Sigma_{0}\rVert_{\rm F}\leq\delta$ , the preceding display implies $\lVert\beta-\beta_{0}\rVert_{2}\lesssim\delta$ . Moreover, we have

	$\displaystyle\|\alpha-\alpha_{0}\|$	$\displaystyle\leq\|\mu^{T}\beta-\mu_{0}^{T}\beta_{0}\|+\delta$
		$\displaystyle\lesssim\lVert\mu\rVert_{2}\lVert\beta-\beta_{0}\rVert_{2}+\lVert\beta_{0}\rVert_{2}\lVert\mu-\mu_{0}\rVert_{2}+\delta$
		$\displaystyle\lesssim(\lVert\mu\rVert_{2}+1)\delta,$

and

	$\displaystyle\|\sigma^{2}-\sigma_{0}^{2}\|$	$\displaystyle\leq\|\beta^{T}\Sigma\beta-\beta_{0}^{T}\Sigma_{0}\beta_{0}\|+\|(\beta^{T}\Sigma\beta+\sigma^{2})-(\beta_{0}^{T}\Sigma_{0}\beta_{0}+\sigma_{0}^{2})\|$
		$\displaystyle\leq\lVert\beta\rVert_{2}\lVert\Sigma\beta-\Sigma_{0}\beta_{0}\rVert_{2}+\lVert\beta_{0}\rVert_{2}\lVert\Sigma_{0}\rVert_{\rm sp}\lVert\beta-\beta_{0}\rVert_{2}+\delta$
		$\displaystyle\lesssim(\lVert\beta\rVert_{2}+1)\delta.$

These show that $|\alpha-\alpha_{0}|+|\sigma^{2}-\sigma_{0}^{2}|\lesssim\delta$ as $\lVert\mu\rVert_{2}$ and $\lVert\beta\rVert_{2}$ are bounded. We finally conclude that $\lvert\alpha-\alpha_{0}\rvert+\lVert\beta-\beta_{0}\rVert_{2}+\lVert\mu-\mu_{0}\rVert_{2}+\lvert\sigma^{2}-\sigma_{0}^{2}\rvert+\lVert\Sigma-\Sigma_{0}\rVert_{\rm F}$ contracts at the same rate of $d_{n}$ . The optimal posterior contraction is directly obtained by Corollary 1. Thus assertions (a) and (b) hold.

Next, we verify conditions (C8^∗)–(C10^∗) and (C7) to apply Theorems 5–6 and Corollaries 2–3. The orthogonal projection defined by $H=\tilde{Z}(\tilde{Z}^{T}\tilde{Z})^{-1}\tilde{Z}^{T}$ with $\tilde{Z}=1_{n}\otimes\Delta_{\eta_{0}}^{-1/2}$ is used to check the conditions.

•

Verification of (C8^∗): For $H$ defined above, it is easy to see that the first condition of (C8^∗) is satisfied. The second condition is directly satisfied by Remark 5.

•

Verification of (C9^∗): Choose a map $(\alpha,\beta,\mu,\sigma^{2},\Sigma)\mapsto(\alpha+n^{-1}1_{n}^{T}X^{\ast}(\theta-\theta_{0}),\beta,\mu,\sigma^{2},\Sigma)$ for $\eta\mapsto\tilde{\eta}_{n}(\theta,\eta)$ . To check (C9^∗), we shall verify that this map induces $\Phi(\tilde{\eta}_{n}(\theta,\eta))=(\tilde{\xi}_{\eta}+H\tilde{X}(\theta-\theta_{0}),\tilde{\Delta}_{\eta})$ as follows. Note that for matrices $R_{k}$ , $k=1,\dots,6$ , we have the properties of the Kronecker product that $(R_{1}\otimes R_{2})(R_{3}\otimes R_{4})=(R_{1}R_{2}\otimes R_{3}R_{4})$ and $(R_{5}\otimes R_{6})^{-1}=R_{5}^{-1}\otimes R_{6}^{-1}$ if the matrices allow such operations. Using these properties, we see that $H$ satisfies

	$\displaystyle H$	$\displaystyle=(1_{n}\otimes\Delta_{\eta_{0}}^{-1/2})(1_{n}^{T}1_{n}\otimes\Delta_{\eta_{0}}^{-1})^{-1}(1_{n}\otimes\Delta_{\eta_{0}}^{-1/2})^{T}$
		$\displaystyle=\frac{1}{n}(1_{n}\otimes\Delta_{\eta_{0}}^{-1/2})\Delta_{\eta_{0}}(1_{n}\otimes\Delta_{\eta_{0}}^{-1/2})^{T}$
		$\displaystyle=\frac{1}{n}(1_{n}\otimes I_{q+1})(1_{n}^{T}\otimes I_{q+1})$
		$\displaystyle=\frac{1}{n}(1_{n}1_{n}^{T}\otimes I_{q+1}).$

Hence,

	$\displaystyle Z(\tilde{Z}^{T}\tilde{Z})^{-1}\tilde{Z}^{T}\tilde{X}(\theta-\theta_{0})$	$\displaystyle=(I_{n}\otimes\Delta_{\eta_{0}}^{1/2})H(I_{n}\otimes\Delta_{\eta_{0}}^{-1/2})X(\theta-\theta_{0})$
		$\displaystyle=HX(\theta-\theta_{0})=1_{n}\otimes\begin{pmatrix}n^{-1}1_{n}^{T}X^{\ast}(\theta-\theta_{0})\\ 0_{q\times 1}\end{pmatrix},$

which implies that the shift only for $\alpha$ as in the given map provides $\Phi(\tilde{\eta}_{n}(\theta,\eta))=(\tilde{\xi}_{\eta}+H\tilde{X}(\theta-\theta_{0}),\tilde{\Delta}_{\eta})$ . Without loss of generality, we assume that the standard normal prior is used for $\alpha$ . Now, observe that

	$\displaystyle\left\lvert\log\frac{d\Pi_{n,\theta}}{d\Pi_{n,\theta_{0}}}(\eta)\right\rvert$	$\displaystyle\lesssim\left\lvert\alpha^{2}-(\alpha+n^{-1}1_{n}^{T}X^{\ast}(\theta-\theta_{0}))^{2}\right\rvert$
		$\displaystyle\leq 2\|\alpha\|\|n^{-1}1_{n}^{T}X^{\ast}(\theta-\theta_{0})\|+(n^{-1}1_{n}^{T}X^{\ast}(\theta-\theta_{0}))^{2},$

since the priors for the other parameters cancel out due to invariance. One can note that

\displaystyle\sup_{\eta\in\widehat{\cal H}_{n}}|\alpha|\lesssim s_{\star}\sqrt{(\log p)/n}+|\alpha_{0}|\lesssim 1,

and

\displaystyle\frac{1}{\sqrt{n}}\sup_{\theta\in\widehat{\Theta}_{n}}\lVert X(\theta-\theta_{0})\rVert_{2}\lesssim s_{\star}\sqrt{(\log p)/n}.

Thus, condition (C9^∗) is satisfied.

•

Verification of (C10^∗): Note again that $d_{B,n}(\eta,\eta_{0})\lesssim\lVert\Sigma-\Sigma_{0}\rVert_{\rm F}+|\sigma^{2}-\sigma_{0}^{2}|+\lVert\beta-\beta_{0}\rVert_{2}$ for every $\eta\in\widehat{\cal H}_{n}$ . The inequality also holds for the other direction for every $\eta\in\widehat{\cal H}_{n}$ , by the same argument used for the recovery in the proof of Theorem 8, (a)–(b). Hence, for some constants $C_{1},C_{2}>0$ , the entropy in (C10^∗) is bounded above by

	$\displaystyle\log N\left(C_{1}\delta,\left\{\beta:\lVert\beta-\beta_{0}\rVert_{2}\leq C_{2}\hat{M}_{2}\epsilon_{n}\right\},\lVert\cdot\rVert_{2}\right)$
	$\displaystyle+\log N\left(C_{1}\delta,\left\{\sigma^{2}:\lvert\sigma^{2}-\sigma_{0}^{2}\rvert\leq C_{2}\hat{M}_{2}\epsilon_{n}\right\},\lvert\cdot\rvert\right)$
	$\displaystyle+\log N\left(C_{1}\delta,\left\{\Sigma:\lVert\Sigma-\Sigma_{0}\rVert_{\rm F}\leq C_{2}\hat{M}_{2}\epsilon_{n}\right\},\lVert\cdot\rVert_{\rm F}\right).$

Since all nuisance parameters are of fixed dimensions, the last display is bounded by a multiple of $0\vee\log(3C_{2}\hat{M}_{2}\epsilon_{n}/C_{1}\delta)$ for every $\delta>0$ , so that (C10^∗) is bounded by $(s_{\star}^{5}\log^{3}p/n)^{1/2}$ by Remark 6. Since $s_{\star}\lesssim s_{0}$ in this case, the condition is verified.

•

Verification of (C7): Note that by (72), $d_{B,n}(\eta_{1},\eta_{2})\lesssim\lVert\Sigma_{1}-\Sigma_{2}\rVert_{\rm F}+|\sigma_{1}^{2}-\sigma_{2}^{2}|+\lVert\beta_{1}-\beta_{2}\rVert_{2}$ for every $\eta_{1},\eta_{2}\in\widehat{\cal H}_{n}$ . Since each of the parameter spaces of $\Sigma$ , $\sigma^{2}$ , and $\beta$ is a separable metric space with each of these norms, (C7) is satisfied.

Therefore, under (C7^∗), Theorem 5 implies that the distributional approximation in (15) holds. Under (C7^∗) and (C12), we obtain the no-superset result in (16). The remaining assertions in the theorem are direct consequences of Corollary 2 and Corollary 3 if the beta-min condition (C13) is also satisfied. These prove (c)–(e).

We complete the proof by showing that the covariance matrix of the nonzero part can be written as in the theorem. For given $S$ , we obtain

	$\displaystyle\tilde{X}_{S}^{T}(I_{n(q+1)}-H)\tilde{X}_{S}$
	$\displaystyle\quad=X_{S}^{\ast T}\left(I_{n}\otimes\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1}^{T}\right)\left(I_{n}\otimes I_{q+1}-H\right)\left(I_{n}\otimes\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1}\right)X_{S}^{\ast}$
	$\displaystyle\quad=\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1}^{T}\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1}X_{S}^{\ast T}H^{\ast}X_{S}^{\ast},$

where $\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1}$ is the first column of $\Delta_{\eta_{0}}^{-1/2}$ . Note that $\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1}^{T}\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1}=\{\Delta_{\eta_{0}}^{-1}\}_{1,1}$ , where $\{\Delta_{\eta_{0}}^{-1}\}_{1,1}$ is the top-left element of $\Delta_{\eta_{0}}^{-1}$ , which is equal to $(\beta_{0}^{T}\Sigma_{0}\beta_{0}+\sigma_{0}^{2}-\beta_{0}^{T}\Sigma_{0}(\Sigma_{0}+\Psi)^{-1}\Sigma_{0}\beta_{0})^{-1}=(\sigma_{0}^{2}+\beta_{0}^{T}\Sigma_{0}(\Sigma_{0}+\Psi)^{-1}\Psi\beta_{0})^{-1}$ by direct calculations. For the mean $\hat{\theta}_{S}$ , observe that

	$\displaystyle\tilde{X}_{S}^{T}(I_{n(q+1)}-H)(U+\tilde{X}\theta_{0})$
	$\displaystyle\quad=X_{S}^{\ast T}\left(I_{n}\otimes\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1}^{T}\right)\left(I_{n}\otimes I_{q+1}-\frac{1}{n}1_{n}1_{n}^{T}\otimes I_{q+1}\right)$
	$\displaystyle\qquad\times\left\{\left(I_{n}\otimes\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1}\right)\Big{(}Y^{\ast}-(\alpha_{0}+\mu_{0}^{T}\beta_{0})1_{n}\right)$
	$\displaystyle\quad\qquad\quad+\left(I_{n}\otimes\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot(-1)}\right)\left(W-1_{n}\otimes\mu_{0}\right)\Big{\}},$

where $\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot(-1)}$ is the submatrix of $\Delta_{\eta_{0}}^{-1/2}$ consisting of columns except for $\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1}$ the first column. Since $\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot 1}^{T}\{\Delta_{\eta_{0}}^{-1/2}\}_{\cdot(-1)}=\{\Delta_{\eta_{0}}^{-1}\}_{1,(-1)}$ , where $\{\Delta_{\eta_{0}}^{-1}\}_{1,(-1)}$ is the first row of $\Delta_{\eta_{0}}^{-1}$ with the top-left element excluded, the last display is equal to

	$\displaystyle X_{S}^{\ast T}\Big{\{}H^{\ast}\Big{[}$	$\displaystyle\{\Delta_{\eta_{0}}^{-1}\}_{1,1}\left(Y^{\ast}-(\alpha_{0}+\mu_{0}^{T}\beta_{0})1_{n}\right)$
		$\displaystyle+\left(I_{n}\otimes\{\Delta_{\eta_{0}}^{-1}\}_{1,(-1)}\right)\left(W-1_{n}\otimes\mu_{0}\right)\Big{]}\Big{\}}.$

As we have $\{\Delta_{\eta_{0}}^{-1}\}_{1,(-1)}=-\{\Delta_{\eta_{0}}^{-1}\}_{1,1}^{-1}\beta_{0}^{T}\Sigma_{0}(\Sigma_{0}+\Psi)^{-1}$ by direct calculations, it follows that

	$\displaystyle\hat{\theta}_{S}=\left(X_{S}^{\ast T}H^{\ast}X_{S}^{\ast}\right)^{-1}X_{S}^{\ast T}\Big{\{}H^{\ast}\Big{[}$	$\displaystyle\left(Y^{\ast}-(\alpha_{0}+\mu_{0}^{T}\beta_{0})1_{n}\right)$
		$\displaystyle\!-\left(I_{n}\otimes(\beta_{0}^{T}\Sigma_{0}(\Sigma_{0}+\Psi)^{-1})\right)\left(W-1_{n}\otimes\mu_{0}\right)\Big{]}\Big{\}}.$

This completes the proof.

B.3 Proof of Theorem 9

We shall verify the conditions for the posterior contraction in Theorem 3 to prove (a)–(b). First we give the bounds for the eigenvalues of each correlation matrix. It can be shown that

$\displaystyle 1-\alpha=\rho_{\min}\left(G_{i}^{\rm CS}(\alpha)\right)$	$\displaystyle\leq\rho_{\max}\left(G_{i}^{\rm CS}(\alpha)\right)=1+(m_{i}-1)\alpha,$	(73)
$\displaystyle\frac{1-\alpha^{2}}{(1+\|\alpha\|)^{2}}\leq\rho_{\min}\left(G_{i}^{\rm AR}(\alpha)\right)$	$\displaystyle\leq\rho_{\max}\left(G_{i}^{\rm AR}(\alpha)\right)\leq\frac{1-\alpha^{2}}{(1-\|\alpha\|)^{2}},$	(74)
$\displaystyle 1-2\|\alpha\|\leq\rho_{\min}\left(G_{i}^{\rm MA}(\alpha)\right)$	$\displaystyle\leq\rho_{\max}\left(G_{i}^{\rm MA}(\alpha)\right)\leq 1+2\|\alpha\|.$	(75)

The first assertion in (73) follows directly from the identity $\rho_{k}(G_{i}^{\rm CS}(\alpha))=\rho_{k}(\alpha 1_{m_{i}}1_{m_{i}}^{T})+1-\alpha$ for every $k\leq m_{i}$ . For (74), see Theorem 2.1 and Theorem 3.5 of Fikioris, [12]. The assertion in (75) is due to Theorem 2.2 of Kulkarni et al., [21].

•

Verification of (C1): For the autoregressive correlation matrix, note that

	$\displaystyle\max_{1\leq i\leq n}\left\lVert\sigma^{2}G_{i}^{\rm AR}(\alpha)-\sigma_{0}^{2}G_{i}^{\rm AR}(\alpha_{0})\right\rVert_{\rm F}^{2}$
	$\displaystyle\quad=\overline{m}(\sigma^{2}-\sigma_{0}^{2})^{2}+2\sum_{k=1}^{\overline{m}-1}(\overline{m}-k)(\sigma^{2}\alpha^{k}-\sigma_{0}^{2}\alpha_{0}^{k})^{2}.$

Using $\overline{m}n\asymp n_{\ast}$ , we have that

	$\displaystyle\sum_{k=1}^{\overline{m}-1}(\overline{m}-k)(\sigma^{2}\alpha^{k}-\sigma_{0}^{2}\alpha_{0}^{k})^{2}$	$\displaystyle\lesssim\frac{1}{n}\sum_{k=1}^{\overline{m}-1}(\sigma^{2}\alpha^{k}-\sigma_{0}^{2}\alpha_{0}^{k})^{2}\sum_{i=1}^{n}\{(m_{i}-k)\vee 0\}$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{m_{i}-1}(m_{i}-k)(\sigma^{2}\alpha^{k}-\sigma_{0}^{2}\alpha_{0}^{k})^{2},$

and hence

\displaystyle\max_{1\leq i\leq n}\lVert\sigma^{2}G_{i}^{\rm AR}(\alpha)-\sigma_{0}^{2}G_{i}^{\rm AR}(\alpha_{0})\rVert_{\rm F}^{2}\lesssim\frac{1}{n}\sum_{i=1}^{n}\lVert\sigma^{2}G_{i}^{\rm AR}(\alpha)-\sigma_{0}^{2}G_{i}^{\rm AR}(\alpha_{0})\rVert_{\rm F}^{2}.

This gives us $a_{n}\asymp 1$ for the autoregressive matrices. Similarly, we can also show that $a_{n}\asymp 1$ satisfies (C1) for the compound-symmetric and the moving average correlation matrices. Also, we have $e_{n}=0$ for (C1) as the true parameter values $\alpha_{0}$ and $\sigma_{0}^{2}$ are in the support of the prior.

•

Verification of (C2): Since the nuisance parameters are of fixed dimensions, condition (C2) is satisfied with $\bar{\epsilon}_{n}=\sqrt{(\log n)/n}$ due to the restricted range of the true parameters, $\sigma_{0}^{2}\asymp 1$ and $\alpha_{0}\in[b_{1}+\epsilon,b_{2}-\epsilon]$ for some fixed $\epsilon>0$ .
•

Verification of (C3): The assumption $\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p$ given in the theorem directly satisfies (C3).
•

Verification of (C4): Using (73)–(75), we see that for the compound-symmetric correlation matrix, condition (C4) is satisfied with the bounded range of the true parameters provided that $\overline{m}$ is bounded. For the other correlation matrices, condition (C4) is satisfied even with increasing $\overline{m}$ .

•

Verification of (C5^∗): For a sufficiently large $M>0$ and $s_{\star}=s_{0}\vee(\log n/\log p)$ , choose a sieve ${\cal H}_{n}=\{\sigma^{2}:n^{-M}\leq\sigma^{2}\leq e^{Ms_{\star}\log p}\}\times\{\alpha:b_{1}+n^{-M}\leq\alpha\leq b_{2}-n^{-M}\}$ . Then using (73)–(75), it is easy to see that the minimum eigenvalue of each correlation matrix is bounded below by a polynomial in $n$ , which implies that condition (6) is satisfied with $\log\gamma_{n}\asymp\log n$ . For the entropy calculation, note that for every type of correlation matrix,

\displaystyle\begin{split}d_{n}^{2}(\eta_{1},\eta_{2})&=\frac{1}{n}\sum_{i=1}^{n}\lVert\sigma_{1}^{2}G_{i}(\alpha_{1})-\sigma_{2}^{2}G_{i}(\alpha_{2})\rVert_{\rm F}^{2}\\ &\leq\frac{1}{n}\sum_{i=1}^{n}\left\{(\sigma_{1}^{2}-\sigma_{2}^{2})^{2}\lVert G_{i}(\alpha_{1})\rVert_{\rm F}^{2}+\sigma_{2}^{4}\lVert G_{i}(\alpha_{1})-G_{i}(\alpha_{2})\rVert_{\rm F}^{2}\right\}.\end{split}

(76)

From the identity $\alpha_{1}^{k}-\alpha_{2}^{k}=(\alpha_{1}-\alpha_{2})\sum_{j=0}^{k-1}\alpha_{1}^{j}\alpha_{2}^{k-1-j}$ for every integer $k\geq 1$ , we have that $|\alpha_{1}^{k}-\alpha_{2}^{k}|\lesssim k|\alpha_{1}-\alpha_{2}|$ for every $\alpha_{1},\alpha_{2}\in(b_{1},b_{2})$ . By this inequality we obtain $\lVert G_{i}(\alpha_{1})-G_{i}(\alpha_{2})\rVert_{\rm F}^{2}\lesssim\overline{m}^{4}|\alpha_{1}-\alpha_{2}|^{2}$ for every correlation matrix. Then, the last display is bounded by a multiple of $\overline{m}^{2}(\sigma_{1}^{2}-\sigma_{2}^{2})^{2}+e^{2Ms_{\star}\log p}\overline{m}^{4}(\alpha_{1}-\alpha_{2})^{2}$ for every $\eta_{1},\eta_{2}\in{\cal H}_{n}$ . The entropy in (7) is thus bounded by

\displaystyle\log N\big{(}\delta_{n},\{\sigma^{2}:0<\sigma^{2}\leq e^{Ms_{\star}\log p}\},|\cdot|\big{)}+\log N\big{(}\delta_{n},\{\alpha:0<\alpha<1\},|\cdot|\big{)},

for $\delta_{n}=(6\overline{m}^{3}n^{3/2+C_{1}}e^{Ms_{\star}\log p})^{-1}$ with some constant $C_{1}>0$ . It can be easily checked that each term in the last display is bounded by a multiple of $s_{\star}\log p$ , by which the entropy condition in (7) is satisfied with $\epsilon_{n}=\sqrt{(s_{\star}\log p)/n}$ . Using the tail bounds of inverse gamma distributions and properties of the density $\Pi(d\alpha)$ near the boundaries, condition (8) is satisfied as long as $M$ is chosen sufficiently large.

•

Verification of (C6): The separation condition is trivially satisfied as there is no nuisance mean part.

Therefore, we obtain the posterior contraction properties of $\theta$ with $s_{\star}=s_{0}\vee(\log n/\log p)$ by Theorem 3. The term $s_{\star}$ can be replaced by $s_{0}$ since $s_{0}>0$ and $\log n\lesssim\log p$ . Since we have $m_{i}(\sigma^{2}-\sigma_{0}^{2})^{2}\leq\lVert\sigma^{2}G_{i}(\alpha)-\sigma_{0}^{2}G_{i}(\alpha_{0})\rVert_{\rm F}^{2}$ by the diagonal entries of each matrix, the contraction rate $\sqrt{(s_{0}\log p)/(\overline{m}n)}$ is obtained for $\sigma^{2}$ with respect to the $\ell_{2}$ -norm, for every correlation matrix, as $\overline{m}n\asymp n_{\ast}$ . In particular, for the compound-symmetric correlation matrix, this rate is reduced to $\sqrt{(s_{0}\log p)/n}$ since $\overline{m}$ is bounded in that case. We also have $m_{i}(\sigma^{2}\alpha-\sigma_{0}^{2}\alpha_{0})^{2}\leq\lVert\sigma^{2}G_{i}(\alpha)-\sigma_{0}^{2}G_{i}(\alpha_{0})\rVert_{\rm F}^{2}$ for every correlation matrix, as there are more than $m_{i}$ entries that is equal to $\sigma^{2}\alpha-\sigma_{0}^{2}\alpha_{0}$ . Hence, by the relation $|\alpha-\alpha_{0}|\lesssim|\sigma^{2}\alpha-\sigma_{0}^{2}\alpha_{0}|+|\alpha||\sigma^{2}-\sigma_{0}^{2}|$ , the same rate is also obtained for $\alpha$ relative to the $\ell_{2}$ -norm. The optimal posterior contraction directly follows from Corollary 1. Thus assertions (a)–(b) hold.

Next, we verify conditions (C8^∗)–(C10^∗) and (C7) to apply Theorems 5–6 and Corollaries 2–3.

•

Verification of (C8^∗)–(C9^∗): These conditions are trivially satisfied with the zero matrix $H$ since there is no nuisance mean part.
•

Verification of (C10^∗): Using the results of contraction rates of $\sigma^{2}$ and $\alpha$ , note that there exists a constant $C_{2}>0$ such that $\{\eta\in{\cal H}:d_{B,n}(\eta,\eta_{0})\leq\hat{M}_{2}\epsilon_{n}\}\subset\{\sigma^{2}:|\sigma^{2}-\sigma_{0}^{2}|\leq C_{2}\epsilon_{n}/\sqrt{\overline{m}}\}\times\{\alpha:|\alpha-\alpha_{0}|\leq C_{2}\epsilon_{n}/\sqrt{\overline{m}}\}$ . Thus the entropy in (C10^∗) is bounded by $0\vee 2\log(3C_{2}\epsilon_{n}/\sqrt{\overline{m}}\delta)$ . By Remark 6, (C10^∗) is bounded by a multiple of $\{(s_{\star}^{5}\log^{3}p)/n\}^{1/2}$ , which goes to zero by the assumption since $s_{\star}\lesssim s_{0}$ .
•

Verification of (C7): Using (76), we have $d_{B,n}(\eta_{1},\eta_{2})\lesssim\overline{m}|\sigma_{1}^{2}-\sigma_{2}^{2}|+\overline{m}^{2}|\alpha_{1}-\alpha_{2}|$ for every $\eta_{1},\eta_{2}\in\widehat{\cal H}_{n}$ . Since the parameter spaces of $\alpha$ and $\sigma^{2}$ are Euclidean and hence separable under the $\ell_{2}$ -metric, condition (C7) is satisfied.

Therefore, under (C7^∗), the distributional approximation in (15) holds with the zero matrix $H$ by Theorem 5. Under (C7^∗) and (C12), Theorem 6 implies that the no-superset result in (16) holds. The strong results in Corollary 2 and Corollary 3 follow explicitly from the beta-min condition (C13). These prove (c)–(e).

B.4 Proof of Theorem 10

We verify the conditions for the posterior contraction in Theorem 3 to show (a)–(b).

•

Verification of (C1): Using the assumption $\max_{i}\lVert Z_{i}\rVert_{\rm sp}\lesssim 1$ , note that

\displaystyle\begin{split}&\max_{1\leq i\leq n}\lVert Z_{i}(\Psi-\Psi_{0})Z_{i}^{T}\rVert_{\rm F}^{2}\\ &\quad\leq\lVert\Psi-\Psi_{0}\rVert_{\rm F}^{2}\max_{1\leq i\leq n}\lVert Z_{i}\rVert_{\rm sp}^{4}\\ &\quad\lesssim\frac{1}{\sum_{i=1}^{n}\mathbbm{1}(m_{i}\geq q)}\sum_{i:m_{i}\geq q}\lVert Z_{i}(\Psi-\Psi_{0})Z_{i}^{T}\rVert_{\rm F}^{2}\lVert(Z_{i}^{T}Z_{i})^{-1}Z_{i}^{T}\rVert_{\rm sp}^{4}\\ &\quad\lesssim\frac{1}{n}\sum_{i=1}^{n}\lVert Z_{i}(\Psi-\Psi_{0})Z_{i}^{T}\rVert_{\rm F}^{2},\end{split}

(77)

where the last inequality holds since $\min_{i}\{\varsigma_{\min}(Z_{i}):m_{i}\geq q\}\gtrsim 1$ and $\sum_{i=1}^{n}\mathbbm{1}(m_{i}\geq q)\asymp n$ . Thus we have $a_{n}\asymp 1$ and $e_{n}=0$ .

•

Verification of (C2): The condition is satisfied with $\bar{\epsilon}_{n}=\sqrt{(\log n)/n}$ as $\Psi$ is fixed dimensional and we have $1\lesssim\rho_{\min}(\Psi_{0})\leq\rho_{\max}(\Psi_{0})\lesssim 1$ .
•

Verification of (C3): The assumption $\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p$ given in the theorem directly satisfies (C3).

•

Verification of (C4): By Weyl’s inequality, we obtain that

	$\displaystyle\min_{1\leq i\leq n}\rho_{\min}(\sigma^{2}I_{m_{i}}+Z_{i}\Psi_{0}Z_{i}^{T})$	$\displaystyle\geq\sigma^{2}+\min_{1\leq i\leq n}\rho_{\min}(Z_{i}\Psi_{0}Z_{i}^{T}),$		(78)
	$\displaystyle\max_{{1\leq i\leq n}}\rho_{\max}(\sigma^{2}I_{m_{i}}+Z_{i}\Psi_{0}Z_{i}^{T})$	$\displaystyle\leq\sigma^{2}+\rho_{\max}(\Psi_{0})\max_{1\leq i\leq n}\lVert Z_{i}\rVert_{\rm sp}^{2}.$		(79)

Since $Z_{i}\Psi_{0}Z_{i}^{T}$ is nonnegative definite, the right hand side of (78) is further bounded below by $\sigma^{2}$ , while the right hand side of (79) is bounded. The condition (C4) is thus satisfied.

•

Verification of (C5^∗): For a sufficiently large $M$ and $s_{\star}=s_{0}\vee(\log n/\log p)$ , define a sieve as ${\cal H}_{n}=\{\Psi:n^{-M}\leq\rho_{\min}(\Sigma)\leq\rho_{\max}(\Sigma)\leq e^{Ms_{\star}\log p}\}$ , so that the minimum eigenvalue condition (6) can be satisfied with $\log\gamma_{n}\asymp\log n$ . Similar to the proof of Theorem 7, it can be easily shown that conditions (7) and (8) are satisfied with $\epsilon_{n}=\sqrt{(s_{\star}\log p)/n}$ .
•

Verification of (C6): The separation condition is trivially satisfied as there is no nuisance mean part.

Therefore, the posterior contraction rates for $\theta$ are given by Theorem 3 with $s_{\star}$ replaced by $s_{0}$ since $s_{0}>0$ and $\log n\lesssim\log p$ . The contraction rate for $\Sigma$ relative to the Frobenius norm is a direct consequence of (77). The optimal posterior contraction easily follows from Corollary 1. Thus assertions (a)–(b) hold.

Now, we verify conditions (C8^∗)–(C10^∗) and (C7) to apply Theorems 5–6 and Corollaries 2–3.

•

Verification of (C8^∗)–(C9^∗): These conditions are trivially satisfied with the zero matrix $H$ since there is no nuisance mean part.
•

Verification of (C10^∗): For some $C_{1}>0$ , the entropy in (C10^∗) is bounded above by a multiple of $\log N(\delta,\{\Sigma:\lVert\Sigma-\Sigma_{0}\rVert_{\rm F}\leq\hat{M}_{2}C_{1}\epsilon_{n}\},\lVert\cdot\rVert_{\rm F})\lesssim 0\vee\log(3\hat{M}_{2}C_{1}\epsilon_{n}/\delta)$ by (77). The expression in (C10^∗) is thus bounded by a constant multiple of $s_{\star}^{5}\log^{3}p$ by Remark 6. This tends to zero since $s_{\star}\lesssim s_{0}$ .
•

Verification of (C7): It is easy to see that $d_{B,n}(\eta,\eta_{0})\lesssim\lVert\Psi-\Psi_{0}\rVert_{\rm F}$ since $\max_{i}\lVert Z_{i}\rVert_{\rm sp}\lesssim 1$ . The separability of the space is thus trivial.

Hence, under (C7^∗), Theorem 5 can be applied to obtain the distributional approximation in (15) with the zero matrix $H$ . Under (C7^∗) and (C12), we obtain the no-superset result in (16) by Theorem 6. The strong results in Corollary 2 and Corollary 3 follow explicitly from the beta-min condition (C13). These establish (c)–(e).

B.5 Proof of Theorem 11

We verify the conditions for the posterior contraction in Theorem 3.

•

Verification of (C1): Since $\Delta_{\eta,i}=\Omega^{-1}$ for every $i\leq n$ and $\Omega_{0}\in{\cal M}_{0}^{+}(cL)$ for some $0<c<1$ , $a_{n}=1$ and $e_{n}=0$ satisfy (C1).

•

Verification of (C2): Using (i) of Lemma 10 and the relation $1-x\asymp 1-x^{-1}$ as $x\rightarrow 1$ , observe that $\lVert\Omega^{-1}-\Omega_{0}^{-1}\rVert_{\rm F}\lesssim\lVert\Omega-\Omega_{0}\rVert_{\rm F}\lesssim\bar{\epsilon}_{n}$ if the right hand side is small enough. Thus, there exists a constant $C_{1}>0$ such that $\{\Omega:\lVert\Omega^{-1}-\Omega_{0}^{-1}\rVert_{\rm F}\leq\bar{\epsilon}_{n}\}\supset\{\Omega:\lVert\Omega-\Omega_{0}\rVert_{\rm F}\leq C_{1}\bar{\epsilon}_{n}\}$ . Furthermore, although the components of $\Omega$ are not a priori independent as the prior is truncated to ${\cal M}_{0}^{+}(L)$ , the truncation can only increase prior concentration since $\Omega_{0}\in{\cal M}_{0}^{+}(cL)$ for some $0<c<1$ . Hence, for some $C_{2}>0$ ,

\displaystyle\Pi\left(\lVert\Omega^{-1}-\Omega_{0}^{-1}\rVert_{\rm F}\leq\bar{\epsilon}_{n}\right)\geq\Pi\left(\lVert\Omega-\Omega_{0}\rVert_{\infty}\leq C_{2}\bar{\epsilon}_{n}/\overline{m}\right)\gtrsim\left(\frac{C_{2}\bar{\epsilon}_{n}}{\overline{m}}\right)^{\overline{m}+d},

which justifies the choice $\bar{\epsilon}_{n}\asymp\sqrt{(\overline{m}+d)(\log n)/n}$ for (C2).

•

Verification of (C3): The assumption $\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p$ given in the theorem directly satisfies (C3).
•

Verification of (C4): This is trivially met as $\Omega_{0}\in{\cal M}_{0}^{+}(cL)$ for some $0<c<1$ .

•

Verification of (C5^∗): Note that the minimum eigenvalue condition (6) is trivially satisfied with $\gamma_{n}=1$ since the prior is put on ${\cal M}_{0}^{+}(L)$ . Now, for $\bar{r}_{n}=Ms_{\star}\log p/\log n$ with $s_{\star}=s_{0}\vee(n\bar{\epsilon}_{n}^{2}/\log p)$ and sufficiently large $M$ , choose a sieve as ${\cal H}_{n}=\{\Omega\in{\cal M}_{0}^{+}(L):\sum_{j,k}\mathbbm{1}{\{\omega_{jk}\neq 0\}}\leq\bar{r}_{n}\}$ , that is, the maximum number of edges of $\Omega$ does not exceed $\bar{r}_{n}$ . Then, for $\delta_{n}=1/6\overline{m}n^{3/2}$ , the entropy in (7) is bounded by

	$\displaystyle\log N(\delta_{n}/\overline{m},{\cal H}_{n},\lVert\cdot\rVert_{\infty})$	$\displaystyle\leq\log\left\{\left(\frac{\overline{m}L}{\delta_{n}}\right)^{\overline{m}+\bar{r}_{n}}\binom{\binom{\overline{m}}{2}}{\bar{r}_{n}}\right\}$
		$\displaystyle\leq(\overline{m}+\bar{r}_{n})\log(\overline{m}L/\delta_{n})+2\bar{r}_{n}\log\overline{m},$

where in the second term, the factor $(\overline{m}L/\delta_{n})^{\overline{m}}$ comes from the diagonal elements of $\Omega$ , while the rest is from the off-diagonal entries. It is easy to see that the last display is bounded by a multiple of $s_{\star}\log p$ with chosen $\bar{r}_{n}$ , and hence the entropy condition in (7) is satisfied. Lastly, note that for some $C_{3}>0$ ,

\displaystyle\log\Pi({\cal H}\setminus{\cal H}_{n})

\displaystyle=\log\Pi(|\Upsilon|>\bar{r}_{n})\lesssim-\bar{r}_{n}\log\bar{r}_{n}\leq-C_{3}Ms_{\star}\log p.

Therefore, condition (8) is satisfied with sufficiently large $M$ .

•

Verification of (C6): The separation condition is trivially met as there is no nuisance mean part.

Therefore, we obtain the posterior contraction properties for $\theta$ by Theorem 3. The theorem also implies that the posterior distribution of $\Omega^{-1}$ contracts to $\Omega_{0}^{-1}$ at the rate $\epsilon_{n}=\sqrt{(s_{0}\log p\vee(\overline{m}+d)\log n)/n}$ with respect to the Frobenius norm. This is also translated as convergence of $\Omega$ to $\Omega_{0}$ at the same rate, since we obtain

\displaystyle\lVert\Omega-\Omega_{0}\rVert_{\rm F}^{2}\lesssim\lVert\Omega^{-1}-\Omega_{0}^{-1}\rVert_{\rm F}^{2}\lesssim\epsilon_{n}^{2},

(80)

by (i) of Lemma 10 and the inequality $1-x\asymp 1-x^{-1}$ as $x\rightarrow 1$ . The assertion for the optimal posterior contraction is directly justified by Corollary 1. These prove (a)–(b).

Next, we verify conditions (C4)–(C7) to obtain the optimal posterior contraction by applying Theorem 4.

•

Verification of (C4)–(C5): These conditions are trivially satisfied with the zero matrix $H$ since there is no nuisance mean part.
•

Verification of (C6): Note that by (80), there exists a constant $C_{4}>0$ such that the entropy in (C6) is bounded by $\log N(\delta,\{\Omega:\lVert\Omega-\Omega_{0}\rVert_{\rm F}\leq C_{4}\bar{\epsilon}_{n}\},d_{B,n})$ for every $\delta>0$ . Using (81), the entropy is further bounded by $\log N(C_{5}\delta,\{\Omega:\lVert\Omega-\Omega_{0}\rVert_{\rm F}\leq C_{4}\bar{\epsilon}_{n}\},\lVert\cdot\rVert_{\rm F})$ for some $C_{5}>0$ . This is clearly bounded by a multiple of $0\vee\overline{m}^{2}\log(3C_{4}\bar{\epsilon}_{n}/C_{5}\delta)$ , and hence using Remark 6 we bound (C6) by a multiple of $(\sqrt{\bar{s}_{\star}}\vee\overline{m})\sqrt{(\bar{s}_{\star}\log p)/n}$ which goes to zero by assumption.

•

Verification of (C7): For every $\Omega_{1},\Omega_{2}\in\widehat{\cal H}_{n}$ , note that

\displaystyle\lVert\Omega_{1}^{-1}-\Omega_{2}^{-1}\rVert_{\rm F}\lesssim\lVert\Omega_{1}-\Omega_{2}\rVert_{\rm F}\lesssim\lVert\Omega_{1}^{-1}-\Omega_{2}^{-1}\rVert_{\rm F}\lesssim\epsilon_{n},

(81)

using (i) of Lemma 10 and the inequality $1-x\asymp 1-x^{-1}$ as $x\rightarrow 1$ again. By the first inequality, it suffices to show that $\cal H$ is separable metric space with the Frobenius norm. This is trivial as the parameter space is Euclidean.

Hence, under condition (C3), Theorem 4 verifies (c).

Now, we verify conditions (C8^∗)–(C10^∗) to apply Theorems 5–6 and Corollaries 2–3.

•

Verification of (C8^∗)–(C9^∗): These are trivially satisfied for the same reason as (C4)–(C5).
•

Verification of (C10^∗): Similar to the verification of (C6), the entropy in (C10^∗) is bounded by a multiple of $0\vee\overline{m}^{2}\log(3C_{6}\epsilon_{n}/\delta)$ for some $C_{6}>0$ . Hence using Remark 6 we bound (C10^∗) by a multiple of $(s_{\star}\vee\overline{m})\sqrt{(s_{\star}\log p)^{3}/n}$ which goes to zero by assumption.

Therefore, under (C7^∗), we obtain the distributional approximation in (15) with the zero matrix $H$ by Theorem 5. Under (C7^∗) and (C12), the no-superset result in (16) holds by Theorem 6. Lastly, we obtain the strong results in Corollary 2 and Corollary 3 if the beta-min condition (C13) is also met. These prove (d)–(f).

B.6 Proof of Theorem 12

To verify the conditions for Theorem 3, we will use the following properties of $B$ -splines.

For any $f\in\mathfrak{C}^{\alpha}[0,1]$ , there exists $\beta_{\ast}\in\mathbb{R}^{J}$ with $\lVert\beta_{\ast}\rVert_{\infty}<\lVert f\rVert_{\mathfrak{C}^{\alpha}}$ such that

\displaystyle\lVert\beta_{\ast}^{T}B_{J}-f\rVert_{\infty}\lesssim J^{-\alpha}\lVert f\rVert_{\mathfrak{C}^{\alpha}},

(82)

by the well-known approximation theory of B-splines [11, page 170]. Writing $f_{\beta}=\beta^{T}B_{J}$ , this gives

\displaystyle\lVert f_{\beta}-f\rVert_{2,n}\leq\lVert f_{\beta}-f\rVert_{\infty}\lesssim J^{-\alpha}\lVert f\rVert_{\mathfrak{C}^{\alpha}}+\lVert f_{\beta}-f_{\beta_{\ast}}\rVert_{\infty}.

(83)

We also use the following inequalities: for every $\beta\in\mathbb{R}^{J}$ ,

\displaystyle\lVert\beta\rVert_{\infty}\lesssim\lVert f_{\beta}\rVert_{\infty}\leq\lVert\beta\rVert_{\infty},\quad\lVert\beta\rVert_{2}\lesssim\sqrt{J}\lVert f_{\beta}\rVert_{2,n}\lesssim\lVert\beta\rVert_{2}.

(84)

See Lemma E.6 of Ghosal and van der Vaart, [17] for proofs with respect to $L_{\infty}$ - and $L_{2}$ -norms. Hence the first relation can be formally justified. For the second relation with respect to the empirical $L_{2}$ -norm, we assume that $z_{i}$ are sufficiently regularly distributed as in (7.12) of Ghosal and van der Vaart, [16].

•

Verification of (C1): If $v_{0}$ is strictly positive on $[0,1]$ , then $v_{0}$ satisfies the same approximation rule in (82) for some $\beta_{\ast}\in(0,\infty)^{J}$ with $\lVert\beta_{\ast}\rVert_{\infty}<\lVert v_{0}\rVert_{\mathfrak{C}^{\alpha}}$ (see Lemma E.5 of Ghosal and van der Vaart, [17]). Therefore the approximation in (83) also holds for $v_{0}$ even if $\beta$ is restricted to have positive entries only, and thus by (82) and (84),

	$\displaystyle\lVert v_{\beta_{\ast}}-v_{0}\rVert_{\infty}$	$\displaystyle\lesssim J^{-\alpha},\quad\text{for some $\beta_{\ast}\in(0,\infty)^{J}$},$
	$\displaystyle\lVert v_{\beta_{1}}-v_{\beta_{2}}\rVert_{\infty}$	$\displaystyle\lesssim\sqrt{J}\lVert v_{\beta_{1}}-v_{\beta_{2}}\rVert_{2,n},\quad\beta_{1},\beta_{2}\in(0,\infty)^{J},$

which tells us that we have $a_{n}\asymp J$ and $e_{n}\asymp J^{1-2\alpha}$ for (C1).

•

Verification of (C2): Note that if $J^{-\alpha}\lesssim\bar{\epsilon}_{n}$ , it follows that for some $C_{1}>0$ ,

\displaystyle\log\Pi(\beta:\lVert v_{\beta}-v_{0}\rVert_{2,n}\leq\bar{\epsilon}_{n})

\displaystyle\geq\log\Pi(\beta:\lVert\beta-\beta_{\ast}\rVert_{\infty}\leq C_{1}\bar{\epsilon}_{n})\gtrsim J\log\bar{\epsilon}_{n}.

This implies that condition (C2) is satisfied with $\bar{\epsilon}_{n}=\sqrt{(J\log n)/n}$ .

•

Verification of (C3): The assumption $\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p$ given in the theorem directly satisfies (C3).
•

Verification of (C4): Since $v_{0}$ is strictly positive on $[0,1]$ and belongs to a fixed multiple of the unit ball of $\mathfrak{C}^{\alpha}[0,1]$ , we have that

$\displaystyle 1\lesssim\inf_{z\in[0,1]}v_{0}(z)\leq\sup_{z\in[0,1]}v_{0}(z)\lesssim 1.$

The condition (C4) is thus satisfied.

•

Verification of (C5^∗): For a sufficiently large $M$ , choose a sieve as ${\cal H}_{n}=\prod_{j=1}^{J}\{\beta_{j}:n^{-M}\leq\beta_{j}\leq n^{M}\}$ . Then the minimum eigenvalue condition (6) is satisfied with $\log\gamma_{n}\asymp\log n$ because for every $i\leq n$ ,

\displaystyle\inf_{\beta\in{\cal H}_{n}}v_{\beta}(z_{i})=\inf_{\beta\in{\cal H}_{n}}\sum_{j=1}^{J}B_{J,j}(z_{i})\beta_{j}\geq\inf_{\beta\in{\cal H}_{n}}\min_{1\leq j\leq J}\beta_{j}\sum_{j=1}^{J}B_{J,j}(z_{i})\geq n^{-M},

where $B_{J,j}$ and $\beta_{j}$ denote the $j$ th components of $B_{J}$ and $\beta$ , respectively. To check the entropy condition in (7), note that for every $\eta_{1},\eta_{2}\in{\cal H}_{n}$ , we have $d_{n}(\eta_{1},\eta_{2})\lesssim\lVert\beta_{1}-\beta_{2}\rVert_{\infty}$ by (84). Hence, for some $C_{2}>0$ , the entropy in (7) is bounded above by a multiple of

\displaystyle\log N\left(\frac{1}{C_{2}\overline{m}n^{M+3/2}},\{\beta:\lVert\beta\rVert_{\infty}\leq n^{M}\},\lVert\cdot\rVert_{\infty}\right)\lesssim J\log n.

The condition (8) holds since an inverse Gaussian prior on each $\beta_{j}$ produces $\Pi({\cal H}\setminus{\cal H}_{n})\lesssim Je^{-C_{3}n^{M}}$ for some constant $C_{3}$ , by its exponentially small bounds for tail probabilities on both sides. By matching $J^{-\alpha}\asymp\bar{\epsilon}_{n}$ and $n\bar{\epsilon}_{n}^{2}\asymp J\log n$ , we obtain $J\asymp(n/\log n)^{1/(2\alpha+1)}$ and $\bar{\epsilon}_{n}=(\log n/n)^{\alpha/(2\alpha+1)}$ . Note that the conditions $a_{n}\epsilon_{n}^{2}\rightarrow 0$ and $e_{n}\rightarrow 0$ hold only if $\alpha>1/2$ .

•

Verification of (C6): The separation condition holds as there is no additional mean part.

Hence, we obtain the posterior contraction rates for $\theta$ by Theorem 3. The contraction rate for $v$ is also obtained by the same theorem. The assertion for the optimal posterior contraction is directly justified by Corollary 1. Hence we have verified (a)–(b).

Now, we verify (C4)–(C7) for the optimal posterior contraction in Theorem 4.

•

Verification of (C4)–(C5): These conditions are trivially satisfied as there is no nuisance mean part.

•

Verification of (C6): Note that by the inequality $\lVert v_{\beta}-v_{0}\rVert_{2,n}\lesssim\lVert v_{\beta}-v_{\beta_{\ast}}\rVert_{2,n}+\bar{\epsilon}_{n}$ , the entropy in the integrand is bounded by

\displaystyle\log N\left(\delta\sqrt{J},\left\{\beta:\lVert\beta-\beta_{\ast}\rVert_{2}\leq C_{4}\sqrt{J}\bar{\epsilon}_{n}\right\},\lVert\cdot\rVert_{2}\right)\lesssim 0\vee J\log\left(\frac{3C_{4}\bar{\epsilon}_{n}}{\delta}\right),

for some $C_{4}>0$ . Thus, the second term of (C6) is bounded by $J\bar{\epsilon}_{n}$ by Remark 6, while the first term is bounded by $\sqrt{J\bar{s}_{\star}^{2}(\log p)/n}$ . Since $\bar{s}_{\star}=(J\log n)/\log p\lesssim J$ , (C6) is bounded by $J\bar{\epsilon}_{n}=(n/\log n)^{(1-\alpha)/(2\alpha+1)}$ , which tends to zero as $\alpha>1$ .

•

Verification of (C7): For every $v_{\beta_{1}},v_{\beta_{2}}\in\widehat{\cal H}_{n}$ , note that $d_{B,n}(\eta_{1},\eta_{2})=\lVert v_{\beta_{1}}-v_{\beta_{2}}\rVert_{2,n}\lesssim\lVert\beta_{1}-\beta_{2}\rVert_{2}$ by (84). Since we put a prior for $v$ using the B-splines through a Euclidean parameter $\beta$ , the separability is trivially satisfied.

Therefore, since (C3) is satisfied the assumption, assertion (c) holds by Theorem 4.

Next, we verify conditions (C8^∗)–(C10^∗) to apply Theorems 5–6 and Corollaries 2–3.

•

Verification of (C8^∗)–(C9^∗): These are trivially satisfied for the same reason as before.

•

Verification of (C10^∗): Similar to the verification of (C6), the entropy of interest is bounded by a constant multiple of $0\vee J\log({3C_{5}\epsilon_{n}}/{\delta})$ for some $C_{5}>0$ . Thus, (C10^∗) is bounded above by a multiple of $\{(s_{\star}^{2}\vee J)J(s_{\star}\log p)^{3}/n\}^{1/2}$ by Remark 6, and hence goes to zero by the assumption. The condition $\alpha>2$ is seen to be necessary by the inequality

\displaystyle(s_{\star}^{2}\vee J)J(s_{\star}\log p)^{3}/n\geq J^{2}n^{2}\bar{\epsilon}_{n}^{6}=n^{2(-\alpha+2)/(2\alpha+1)}{\log n}^{2(3\alpha-1)/(2\alpha+1)}.

Under (C7^∗), the distributional approximation in (15) holds with the zero matrix $H$ by Theorem 5. Under (C7^∗) and (C12), the no-superset result in (16) holds by Theorem 6. We also obtain the strong results in Corollary 2 and Corollary 3 if the beta-min condition (C13) is also met. These prove (d)–(f).

B.7 Proof of Theorem 13

We verify the conditions for the posterior contraction in Theorem 3.

•

Verification of (C1): Since $\Delta_{\eta,i}=\sigma^{2}$ for every $i\leq n$ and $\sigma_{0}^{2}$ belongs to the support of the prior, we have $a_{n}=1$ and $e_{n}=0$ .

•

Verification of (C2): Note that we write $d_{n}^{2}(\eta,\eta_{0})=\lvert\sigma^{2}-\sigma_{0}^{2}\rvert^{2}+\lVert g_{\beta}-g_{0}\rVert_{2,n}^{2}$ . To verify the prior concentration condition, observe that

	$\displaystyle\log\Pi\left(\eta\in{\cal H}:d_{n}(\eta,\eta_{0})\leq\bar{\epsilon}_{n}\right)$
	$\displaystyle\quad\geq\log\Pi\left(\beta:\lVert g_{\beta}-g_{0}\rVert_{2,n}\leq\frac{\bar{\epsilon}_{n}}{\sqrt{2}}\right)+\log\Pi\left(\sigma:\|\sigma^{2}-\sigma_{0}^{2}\|\leq\frac{\bar{\epsilon}_{n}}{\sqrt{2}}\right),$

where the second term on the right hand side is trivially bounded below by a constant multiple of $\log\bar{\epsilon}_{n}$ . Using (82)–(84), it is easy to see that if $J^{-\alpha}\lesssim\bar{\epsilon}_{n}$ ,

\displaystyle\log\Pi\left(\beta:\lVert g_{\beta}-g_{0}\rVert_{2,n}\leq\frac{\bar{\epsilon}_{n}}{\sqrt{2}}\right)\geq\log\Pi(\beta:\lVert\beta-\beta_{\ast}\rVert_{\infty}\leq C_{1}\bar{\epsilon}_{n})\gtrsim J\log\bar{\epsilon}_{n},

for some $C_{1}>0$ . Since $\bar{\alpha}\leq\alpha$ , this implies that (C2) is satisfied with $\bar{\epsilon}_{n}=\sqrt{(J\log n)/n}$ .

•

Verification of (C3): The assumption $\lVert\theta_{0}\rVert_{\infty}\lesssim\lambda^{-1}\log p$ given in the theorem directly satisfies the condition.
•

Verification of (C4): This is directly satisfied by $\sigma_{0}^{2}\asymp 1$ .

•

Verification of (C5^∗): For a sufficiently large constant $M$ and $s_{\star}=s_{0}\vee(J\log n/\log p)$ , choose ${\cal H}_{n}=\{g_{\beta}:\lVert\beta\rVert_{\infty}\leq n^{M}\}\times\{\sigma:n^{-M}\leq\sigma^{2}\leq e^{Ms_{\star}\log p}\}$ , from which the minimum eigenvalue condition (6) is directly satisfied with $\log\gamma_{n}\asymp\log n$ . To check the entropy condition in (7), note that for every $\eta_{1},\eta_{2}\in{\cal H}_{n}$ , we have $d_{n}^{2}(\eta_{1},\eta_{2})\lesssim\lVert\beta_{1}-\beta_{2}\rVert_{\infty}^{2}+\lvert\sigma_{1}^{2}-\sigma_{2}^{2}\rvert^{2}$ by (84). Hence, for some $C_{3}>0$ , the entropy in (7) is bounded above by a multiple of

	$\displaystyle\log N\left(\frac{1}{C_{3}\overline{m}n^{M+3/2}},\{\beta:\lVert\beta\rVert_{\infty}\leq n^{M}\},\lVert\cdot\rVert_{\infty}\right)$
	$\displaystyle\quad+\log N\left(\frac{1}{C_{3}\overline{m}n^{M+3/2}},\{\sigma:\sigma^{2}\leq e^{Ms_{\star}\log p}\},\lvert\cdot\rvert\right).$

The display is further bounded by a multiple of $J\log n+s_{\star}\log p$ , and hence (7) is satisfied with $\epsilon_{n}=\sqrt{(s_{\star}\log p)/n}$ . Using the tail bounds of normal and inverse gamma distributions, condition (8) is also satisfied.

•

Verification of (C6): The separation condition holds by Remark 3 as we have $d_{A,n}(\eta_{\ast},\eta_{0})=\lVert g_{\beta_{\ast}}-g_{0}\rVert_{2,n}\lesssim\bar{\epsilon}_{n}$ for $\eta_{\ast}=(g_{\beta_{\ast}},\sigma_{0}^{2})$ in view of (82).

Therefore, the contraction rates for $\theta$ are given by Theorem 3. The rate for $g$ is also obtained by the same theorem. The assertion for the optimal posterior contraction is directly justified by Corollary 1. We thus see (a)–(b) hold.

Now, we verify (C4)–(C7) for Theorem 4.

•

Verification of (C4): Observe that the left hand side of the first line of (C4) is equal to

\displaystyle\frac{1}{(s_{0}\vee 1)\log p}\lVert\tilde{\xi}_{\eta_{0}}-H\tilde{\xi}_{\eta_{0}}\rVert_{2}^{2}

\displaystyle=\frac{n}{\sigma_{0}^{2}(s_{0}\vee 1)\log p}\lVert g_{0}-\hat{\beta}_{J}^{T}B_{J}\rVert_{2,n}^{2},

where $\hat{\beta}_{J}=(W_{J}^{T}W_{J})^{-1}W_{J}^{T}(g_{0}(z_{1}),\dots,g_{0}(z_{n}))^{T}$ is the the least squares solution. Since $\hat{\beta}_{J}$ is the solution minimizing $\lVert g_{0}-\hat{\beta}_{J}^{T}B_{J}\rVert_{2,n}^{2}$ , for some $\beta_{\ast}\in\mathbb{R}^{J}$ , the last display is bounded above by

\displaystyle\frac{n}{\sigma_{0}^{2}\log p}\lVert g_{0}-\beta_{\ast}^{T}B_{J}\rVert_{\infty}^{2}\lesssim\frac{n}{J^{2\alpha}\log p},

(85)

by (82), where $s_{0}\vee 1$ is replaced by $1$ as $s_{0}$ is unknown. Plugging in $J\asymp(n/\log n)^{1/(2\bar{\alpha}+1)}$ , it is easy to see that the right hand side of (85) is the same order of $(\log n)^{2\alpha/(2\bar{\alpha}+1)}n^{(-2\alpha+2\bar{\alpha}+1)/(2\bar{\alpha}+1)}/\log p$ . This tends to zero by the given boundedness assumption. The necessary condition $\bar{\alpha}<\alpha$ is implied by this, because $\log p=o(n)$ . The second condition of (C4) is satisfied by Remark 5.

•

Verification of (C5): Let $\tilde{\eta}_{n}(\theta,\eta)=(g_{\beta}(\cdot)+B_{J}^{T}(\cdot)(W_{J}^{T}W_{J})^{-1}W_{J}^{T}X(\theta-\theta_{0}),\sigma^{2})$ for a given $\theta$ , where $\eta=(g_{\beta}(\cdot),\sigma^{2})$ . This setting satisfies $\Phi(\tilde{\eta}_{n}(\theta,\eta))=(\tilde{\xi}_{\eta}+H\tilde{X}(\theta-\theta_{0}),\tilde{\Delta}_{\eta})$ . Since each entry of $\beta$ has the standard normal prior, $g_{\beta}(\cdot)$ is a zero mean Gaussian process with the covariance kernel $K(t_{1},t_{2})=B_{J}(t_{1})^{T}B_{J}(t_{2})$ , and thus its reproducing kernel Hilbert space (RKHS) $\mathbb{K}$ is the set of all functions of the form $\sum_{k}\zeta_{k}B_{J}(t_{k})^{T}B_{J}(\cdot)$ with coefficients $\zeta_{k}$ , $k\in\{1,2,\dots\}$ . It is easy to see that the shift $(\theta-\theta_{0})^{T}X^{T}W_{J}(W_{J}^{T}W_{J})^{-1}B_{J}(\cdot)$ is in the RKHS $\mathbb{K}$ since it is expressed as $(\theta-\theta_{0})^{T}X^{T}W_{J}(W_{J}^{T}W_{J})^{-1}\tilde{W}_{J}^{-1}\tilde{W}_{J}B_{J}(\cdot)$ using an invertible matrix $\tilde{W}_{J}\in\mathbb{R}^{J\times J}$ with rows $B_{J}(t_{k})$ evaluated by some $t_{k}$ , $k=1,\dots,J$ . Hence, by the Cameron-Martin theorem, for $\nu=(\nu_{1},\dots,\nu_{J})^{T}=(\tilde{W}_{J}^{T})^{-1}(W_{J}^{T}W_{J})^{-1}W_{J}^{T}X(\theta-\theta_{0})$ and $\lVert\cdot\rVert_{\mathbb{K}}$ the RKHS norm, we see that

\displaystyle\log\frac{d\Pi_{n,\theta}}{d\Pi_{n,\theta_{0}}}(\eta)

\displaystyle=\sum_{k=1}^{J}\nu_{k}g_{\beta}(t_{k})-\frac{1}{2}\lVert\nu^{T}\tilde{W}_{J}B_{J}\rVert_{\mathbb{K}}^{2}=\nu^{T}\tilde{W}_{J}\beta-\frac{1}{2}\lVert\tilde{W}_{J}^{T}\nu\rVert_{2}^{2},

almost surely. This gives that

\displaystyle\begin{split}\left\lvert\log\frac{d\Pi_{n,\theta}}{d\Pi_{n,\theta_{0}}}(\eta)\right\rvert&\lesssim\lVert\beta\rVert_{2}\lVert(W_{J}^{T}W_{J})^{-1}W_{J}^{T}X(\theta-\theta_{0})\rVert_{2}\\ &\quad+\lVert(W_{J}^{T}W_{J})^{-1}W_{J}^{T}X(\theta-\theta_{0})\rVert_{2}^{2}.\end{split}

(86)

Note that we have

	$\displaystyle\sup_{\eta\in\widetilde{\cal H}_{n}}\lVert\beta\rVert_{2}$	$\displaystyle\leq\sup_{\eta\in\widetilde{\cal H}_{n}}\lVert\beta-\beta_{\ast}\rVert_{2}+\lVert\beta_{\ast}\rVert_{2}$
		$\displaystyle\lesssim\sqrt{J}\sup_{\eta\in\widetilde{\cal H}_{n}}\lVert g_{\beta}-g_{\beta_{\ast}}\rVert_{2,n}+1\lesssim\sqrt{J}\bar{\epsilon}_{n}+1,$

and

\displaystyle\sup_{\theta\in\widetilde{\Theta}_{n}}\lVert(W_{J}^{T}W_{J})^{-1}W_{J}^{T}X(\theta-\theta_{0})\rVert_{2}

\displaystyle\lesssim\frac{\lVert W_{J}\rVert_{\rm sp}\sup_{\theta\in\widetilde{\Theta}_{n}}\lVert X(\theta-\theta_{0})\rVert_{2}}{\rho_{\min}(W_{J}^{T}W_{J})}\lesssim\sqrt{J}\bar{\epsilon}_{n},

using (84). Since $\sqrt{J}\bar{\epsilon}_{n}$ is bounded due to $\bar{\alpha}\geq 1/2$ , (86) is bounded.

•

Verification of (C6): Since the entropy in the integral in (C6) is bounded above by a multiple of $0\vee\log(3\tilde{M}_{2}\bar{\epsilon}_{n}/\delta)$ for every $\delta>0$ , the second term of (C6) is bounded by a constant multiple of $\bar{\epsilon}_{n}$ due to Remark 6. The first term is $\bar{\epsilon}_{n}^{2}\sqrt{n/\log p}=(\log n)^{2\bar{\alpha}/(2\bar{\alpha}+1)}n^{(-\bar{\alpha}+1/2)/(2\bar{\alpha}+1)}/\sqrt{\log p}$ that tends to zero by the boundedness assumption.
•

Verification of (C7): Since we have $d_{B,n}(\eta_{1},\eta_{2})=|\sigma_{1}^{2}-\sigma_{2}^{2}|$ for every $\sigma_{1}^{2},\sigma_{2}^{2}\in(0,\infty)$ and the parameter space of $\sigma^{2}$ is Euclidean, the condition is trivially satisfied.

Therefore, assertion (c) holds by Theorem 4 since (C3) is also satisfied by the given assumption.

Lastly, we verify conditions (C8^∗)–(C10^∗) to apply Theorems 5–6 and Corollaries 2–3.

•

Verification of (C8^∗): Similar to the verification of (C4), the first line of (C8^∗) is equal to

\displaystyle s_{\star}^{2}\log p\lVert\tilde{\xi}_{\eta_{0}}-H\tilde{\xi}_{\eta_{0}}\rVert_{2}^{2}

\displaystyle\lesssim\frac{ns_{\star}^{2}\log p}{J^{2\alpha}}.

Plugging in $J\asymp(n/\log n)^{1/(2\bar{\alpha}+1)}$ , it is easy to see that this tends to zero by the given boundedness condition, which requires that $\bar{\alpha}<\alpha-1/2$ .

•

Verification of (C9^∗): Similar to the verification of (C5), we now have

	$\displaystyle\sup_{\eta\in\widehat{\cal H}_{n}}\lVert\beta\rVert_{2}$	$\displaystyle\lesssim s_{\star}\sqrt{(J\log p)/n}+1,$
	$\displaystyle\sup_{\theta\in\widehat{\Theta}_{n}}\lVert(W_{J}^{T}W_{J})^{-1}W_{J}^{T}X(\theta-\theta_{0})\rVert_{2}$	$\displaystyle\lesssim\frac{\lVert W_{J}\rVert_{\rm sp}\sup_{\theta\in\widehat{\Theta}_{n}}\lVert X(\theta-\theta_{0})\rVert_{2}}{\rho_{\min}(W_{J}^{T}W_{J})}$
		$\displaystyle\lesssim s_{\star}\sqrt{(J\log p)/n}.$

Since $J\log n=n\bar{\epsilon}_{n}^{2}\leq s_{\star}\log p$ , (86) tends to zero since $s_{\star}^{5}\log^{3}p=o(n)$ .

•

Verification of (C10^∗): By the similar calculations as before, we see that (C10^∗) is bounded by $(s_{\star}^{5}\log^{3}p/n)^{1/2}$ which tends to zero. The condition $\bar{\alpha}>1$ is necessary since $(s_{\star}^{5}\log^{3}p)/n\geq n^{2}\bar{\epsilon}_{n}^{6}=(\log n)^{6\bar{\alpha}/(2\bar{\alpha}+1)}n^{-2(\bar{\alpha}-1)/(2\bar{\alpha}+1)}$ .

Therefore, under (C7^∗), we have the distributional approximation in (15) by Theorem 5. Under (C7^∗) and (C12), Theorem 6 implies that the no-superset result in (16) holds. The stronger assertions in (17) and (18) are explicitly derived from Corollary 2 and Corollary 3 if the beta-min condition (C13) is also met.

Appendix C Auxiliary results

Here we provide some auxiliary results used to prove the main results.

Lemma 9.

Let $p_{k}$ be the density of ${\rm N}_{r}(\mu_{k},\Sigma_{k})$ for $k=1,2$ . Then,

	$\displaystyle K(p_{1},p_{2})=$	$\displaystyle\frac{1}{2}\left\{\log\frac{\det\Sigma_{2}}{\det\Sigma_{1}}+{\rm tr}(\Sigma_{1}\Sigma_{2}^{-1})-r+\lVert\Sigma_{2}^{-1/2}(\mu_{1}-\mu_{2})\rVert_{2}^{2}\right\},$
	$\displaystyle V(p_{1},p_{2})=$	$\displaystyle\frac{1}{2}\Big{\{}{\rm tr}(\Sigma_{1}\Sigma_{2}^{-1}\Sigma_{1}\Sigma_{2}^{-1})-2{\rm tr}(\Sigma_{1}\Sigma_{2}^{-1})+r\Big{\}}+\lVert\Sigma_{1}^{1/2}\Sigma_{2}^{-1}(\mu_{1}-\mu_{2})\rVert_{2}^{2}.$

Proof.

Let $Z=\Sigma_{1}^{-1/2}(X-\mu_{1})\sim{\rm N}_{r}(0,I)$ for $X\sim p_{1}$ and $A=\Sigma_{1}^{1/2}\Sigma_{2}^{-1}\Sigma_{1}^{1/2}$ . Then by direct calculations, we have

	$\displaystyle K(p_{1},p_{2})$	$\displaystyle=\mathbb{E}_{p_{1}}\left\{\log\frac{p_{1}}{p_{2}}(X)\right\}$
		$\displaystyle=\frac{1}{2}\left\{\log\frac{\det\Sigma_{2}}{\det\Sigma_{1}}+\mathbb{E}_{p_{1}}Z^{T}AZ-r+(\mu_{1}-\mu_{2})^{T}\Sigma_{2}^{-1}(\mu_{1}-\mu_{2})\right\},$

which verifies the first assertion because $\mathbb{E}_{p_{1}}Z^{T}AZ={\rm tr}A$ . After some algebra, we also obtain

	$\displaystyle V(p_{1},p_{2})$	$\displaystyle=\mathbb{E}_{p_{1}}\left\{\log\frac{p_{1}}{p_{2}}(X)-K(p_{1},p_{2})\right\}^{2}$
		$\displaystyle=\frac{1}{4}\mathbb{E}_{p_{1}}\left\{-Z^{T}Z+Z^{T}AZ+2(\mu_{1}-\mu_{2})^{T}\Sigma_{2}^{-1}\Sigma_{1}^{1/2}Z-{\rm tr}(A)+r\right\}^{2}.$

The rightmost side involves forms of $\mathbb{E}_{p_{1}}(ZZ^{T}Q_{1}Z)$ and $\mathbb{E}_{p_{1}}(Z^{T}Q_{1}ZZ^{T}Q_{2}Z)$ for two positive definite matrices $Q_{1}$ and $Q_{2}$ . It is easy to see that the former is zero, while it can be shown the latter equals $2{\rm tr}(Q_{1}Q_{2})+{\rm tr}(Q_{1}){\rm tr}(Q_{2})$ ; for example, see Lemma 6.2 of Magnus, [22]. Plugging in this for the expected values of the products of quadratic forms, it is easy (but tedious) to verify the second assertion. ∎

Lemma 10.

For $r\times r$ positive definite matrices $\Sigma_{1}$ and $\Sigma_{2}$ , let $d_{1},\dots,d_{r}$ be the eigenvalues of $\Sigma_{2}^{1/2}\Sigma_{1}^{-1}\Sigma_{2}^{1/2}$ . Then the following assertions hold:

(i)

$\rho_{\max}^{-2}(\Sigma_{2})\lVert\Sigma_{1}-\Sigma_{2}\rVert_{\rm F}^{2}\leq\sum_{k=1}^{r}(d_{k}^{-1}-1)^{2}\leq\rho_{\min}^{-2}(\Sigma_{2})\lVert\Sigma_{1}-\Sigma_{2}\rVert_{\rm F}^{2}$ ,
(ii)

$\max_{k}|d_{k}-1|$ can be made arbitrarily small if $g^{2}(\Sigma_{1},\Sigma_{2})$ is chosen sufficiently small, where $g$ is defined in (33).

Proof.

Let $A=\Sigma_{2}^{-1/2}\Sigma_{1}\Sigma_{2}^{-1/2}$ . Since the eigenvalues of $A-I_{r}$ are $d_{1}^{-1}-1,\dots,d_{r}^{-1}-1$ , we can see that $\lVert\Sigma_{1}-\Sigma_{2}\rVert_{\rm F}^{2}$ is equal to

\displaystyle\lVert\Sigma_{2}^{1/2}(A-I_{r})\Sigma_{2}^{1/2}\rVert_{\rm F}^{2}\leq\rho_{\max}^{2}(\Sigma_{2})\lVert A-I_{r}\rVert_{\rm F}^{2}=\rho_{\max}^{2}(\Sigma_{2})\sum_{k=1}^{r}(d_{k}^{-1}-1)^{2}.

Conversely, using the sub-multiplicative property of the Frobenius norm, $\lVert BC\rVert_{\rm F}\leq\lVert B\rVert_{\rm sp}\lVert C\rVert_{\rm F}$ , it can be seen that $\sum_{k=1}^{r}(d_{k}^{-1}-1)^{2}$ is equal to

\displaystyle\lVert A-I_{r}\rVert_{\rm F}^{2}=\lVert\Sigma_{2}^{-1/2}(\Sigma_{1}-\Sigma_{2})\Sigma_{2}^{-1/2}\rVert_{\rm F}^{2}\leq\rho_{\max}^{2}(\Sigma_{2}^{-1})\lVert\Sigma_{1}-\Sigma_{2}\rVert_{\rm F}^{2}.

These verify (i). Now, note that by direct calculations,

	$\displaystyle\frac{(\det\Sigma_{1})^{1/4}(\det\Sigma_{2})^{1/4}}{\det((\Sigma_{1}+\Sigma_{2})/2)^{1/2}}$	$\displaystyle=\left\{\frac{1}{2^{r}}\det(A^{1/2}+A^{-1/2})\right\}^{-1/2}$
		$\displaystyle=\left\{\prod_{k=1}^{r}\frac{1}{2}(d_{k}^{1/2}+d_{k}^{-1/2})\right\}^{-1/2}.$

Hence, $g^{2}(\Sigma_{1},\Sigma_{2})<\delta$ for a sufficiently small $\delta>0$ implies that

\displaystyle\prod_{k=1}^{r}\frac{1}{2}(d_{k}^{1/2}+d_{k}^{-1/2})<(1-\delta^{2}/2)^{-2}.

Since every term in the product of the last display is greater than or equal to 1, we have $(d_{k}^{1/2}+d_{k}^{-1/2})/2<(1-\delta^{2}/2)^{-2}$ for every $k$ . As a function of $d_{k}$ , $(d_{k}^{1/2}+d_{k}^{-1/2})/2$ has the global minimum at $d_{k}=1$ , and hence $\delta$ can be chosen sufficiently small to make $|d_{k}-1|$ small for every $k=1,\dots,r$ , which establishes (ii). ∎

References

Atchadé, [2017] Atchadé, Y. A. (2017). On the contraction properties of some high-dimensional quasi-posterior distributions. The Annals of Statistics, 45(5):2248–2273.
Bai et al., [2020] Bai, R., Moran, G. E., Antonelli, J., Chen, Y., and Boland, M. R. (2020). Spike-and-slab group lassos for grouped regression and sparse generalized additive models. Journal of the American Statistical Association, to appear.
Belitser and Ghosal, [2020] Belitser, E. and Ghosal, S. (2020). Empirical Bayes oracle uncertainty quantification for regression. The Annals of Statistics, 48(6):3113–3137.
Bickel and Kleijn, [2012] Bickel, P. J. and Kleijn, B. J. (2012). The semiparametric Bernstein–von Mises theorem. The Annals of Statistics, 40(1):206–237.
Bondell and Reich, [2012] Bondell, H. D. and Reich, B. J. (2012). Consistent high-dimensional Bayesian variable selection via penalized credible regions. Journal of the American Statistical Association, 107(500):1610–1624.
Carroll et al., [2006] Carroll, R. J., Ruppert, D., Crainiceanu, C. M., and Stefanski, L. A. (2006). Measurement Error in Nonlinear Models: A Modern Perspective. Chapman and Hall/CRC.
Castillo, [2012] Castillo, I. (2012). A semiparametric Bernstein–von Mises theorem for Gaussian process priors. Probability Theory and Related Fields, 152(1-2):53–99.
Castillo et al., [2015] Castillo, I., Schmidt-Hieber, J., and van der Vaart, A. (2015). Bayesian linear regression with sparse priors. The Annals of Statistics, 43(5):1986–2018.
Castillo and van der Vaart, [2012] Castillo, I. and van der Vaart, A. (2012). Needles and straw in a haystack: Posterior concentration for possibly sparse sequences. The Annals of Statistics, 40(4):2069–2101.
Chae et al., [2019] Chae, M., Lin, L., and Dunson, D. B. (2019). Bayesian sparse linear regression with unknown symmetric error. Information and Inference: A Journal of the IMA, 8(3):621–653.
De Boor, [1978] De Boor, C. (1978). A Practical Guide to Splines. New York: Springer.
Fikioris, [2018] Fikioris, G. (2018). Spectral properties of Kac–Murdock–Szegö matrices with a complex parameter. Linear Algebra and its Applications, 553:182–210.
Fuller, [1987] Fuller, W. A. (1987). Measurement Error Models. John Wiley & Sons.
Gao et al., [2020] Gao, C., van der Vaart, A. W., and Zhou, H. H. (2020). A general framework for bayes structured linear models. Annals of Statistics, 48(5):2848–2878.
Ghosal et al., [2000] Ghosal, S., Ghosh, J. K., and van der Vaart, A. W. (2000). Convergence rates of posterior distributions. Annals of Statistics, 28(2):500–531.
Ghosal and van der Vaart, [2007] Ghosal, S. and van der Vaart, A. (2007). Convergence rates of posterior distributions for noniid observations. The Annals of Statistics, 35(1):192–223.
Ghosal and van der Vaart, [2017] Ghosal, S. and van der Vaart, A. (2017). Fundamentals of Nonparametric Bayesian Inference. Cambridge University Press.
Jeong, [2020] Jeong, S. (2020). Posterior contraction in group sparse logit models for categorical responses. arXiv preprint arXiv:2010.03513.
Jeong and Ghosal, [2020] Jeong, S. and Ghosal, S. (2020). Posterior contraction in sparse generalized linear models. Biometrika, to appear.
Johnson and Rossell, [2012] Johnson, V. E. and Rossell, D. (2012). Bayesian model selection in high-dimensional settings. Journal of the American Statistical Association, 107(498):649–660.
Kulkarni et al., [1999] Kulkarni, D., Schmidt, D., and Tsui, S.-K. (1999). Eigenvalues of tridiagonal pseudo-Toeplitz matrices. Linear Algebra and its Applications, 297:63–80.
Magnus, [1978] Magnus, J. R. (1978). The moments of products of quadratic forms in normal variables. Statistica Neerlandica, 32(4):201–210.
Martin et al., [2017] Martin, R., Mess, R., and Walker, S. G. (2017). Empirical Bayes posterior concentration in sparse high-dimensional linear models. Bernoulli, 23(3):1822–1847.
Narisetty and He, [2014] Narisetty, N. N. and He, X. (2014). Bayesian variable selection with shrinking and diffusing priors. The Annals of Statistics, 42(2):789–817.
Ning et al., [2020] Ning, B., Jeong, S., and Ghosal, S. (2020). Bayesian linear regression for multivariate responses under group sparsity. Bernoulli, 26(3):2353–2382.
Ročková, [2018] Ročková, V. (2018). Bayesian estimation of sparse signals with a continuous spike-and-slab prior. The Annals of Statistics, 46(1):401–437.
Rothman et al., [2008] Rothman, A. J., Bickel, P. J., Levina, E., and Zhu, J. (2008). Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494–515.
Song and Liang, [2017] Song, Q. and Liang, F. (2017). Nearly optimal Bayesian shrinkage for high dimensional regression. arXiv preprint arXiv:1712.08964.
van der Vaart and Wellner, [1996] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer.

Unified Bayesian theory of sparse linear regression with nuisance parameters††thanks: Research is partially supported by a Faculty Research and Professional Development Grant from College of Sciences of North Carolina State University.

Abstract

keywords:

keywords:

1 Introduction

1.1 Sparse linear regression with nuisance parameters

Example 1 (Multiple response models with missing components).

Example 2 (Multivariate measurement error models).

Example 3 (Parametric correlation structure).

Example 4 (Mixed effects models).

Example 5 (Graphical structure with sparse precision matrices).

Example 6 (Nonparametric heteroskedastic regression models).

Example 7 (Partial linear models).

1.2 Outline

2 Setup, notations, and prior specification

2.1 Notation

2.2 Prior for the high-dimensional coefficients

Remark 1.

3 Posterior contraction rates

Remark 2.

3.1 Rényi posterior contraction and recovery

Theorem 1 (Dimension).

Theorem 2 (Contraction rate, Rényi).

Theorem 3 (Recovery).

Remark 3.

Corollary 1 (Optimality under restriction).

3.2 Optimal posterior contraction for θ\theta

Theorem 4 (Optimal posterior contraction).

Remark 4.

Remark 5.

Remark 6.

4 Bernstein-von Mises and selection consistency

4.1 Shape approximation to the posterior distribution

Theorem 5 (Distributional approximation).

4.2 Model selection consistency

Theorem 6 (Selection, no supersets).

Corollary 2 (Selection consistency).

Corollary 3 (Bernstein-von Mises).

5 Applications

5.1 Multiple response models with missing components

Theorem 7.

5.2 Multivariate measurement error models

Theorem 8.

5.3 Parametric correlation structure

Theorem 9.

5.4 Mixed effects models

Theorem 10.

5.5 Graphical structure with sparse precision matrices

Theorem 11.

5.6 Nonparametric heteroskedastic regression models

Theorem 12.

5.7 Partial linear models

Theorem 13.

Appendix A Proofs for the main results

A.1 Proof of Theorem 1

Lemma 1.

Proof.

Proof of Theorem 1.

A.2 Proof of Theorems 2–3 and Corollary 1

Lemma 2.

Proof.

Proof of Theorem 2.

Proof of Theorem 3.

Proof of Corollary 1.

A.3 Proof of Theorem 4

Lemma 3.

Proof.

Lemma 4.

Proof.

Lemma 5.

Proof.

Lemma 6.

Proof.

Proof of Theorem 4.

A.4 Proof of Theorems 5–6

Lemma 7.

Proof.

Lemma 8.

Proof.

Proof of Theorem 5.

Unified Bayesian theory of sparse linear regression with nuisance parameters^†^†thanks: Research is partially supported by a Faculty Research and Professional Development Grant from College of Sciences of North Carolina State University.

3.2 Optimal posterior contraction for $\theta$