Unified Bayesian theory of sparse linear regression with nuisance parameters††thanks: Research is partially supported by a Faculty Research and Professional Development Grant from College of Sciences of North Carolina State University.
Abstract
We study frequentist asymptotic properties of Bayesian procedures for high-dimensional Gaussian sparse regression when unknown nuisance parameters are involved. Nuisance parameters can be finite-, high-, or infinite-dimensional. A mixture of point masses at zero and continuous distributions is used for the prior distribution on sparse regression coefficients, and appropriate prior distributions are used for nuisance parameters. The optimal posterior contraction of sparse regression coefficients, hampered by the presence of nuisance parameters, is also examined and discussed. It is shown that the procedure yields strong model selection consistency. A Bernstein-von Mises-type theorem for sparse regression coefficients is also obtained for uncertainty quantification through credible sets with guaranteed frequentist coverage. Asymptotic properties of numerous examples are investigated using the theories developed in this study.
keywords:
[class=MSC]keywords:
1 Introduction
While Bayesian model selection for classical low-dimensional problems has a long history, sparse estimation in high-dimensional regression was studied much later; see Bondell and Reich, [5], Johnson and Rossell, [20], and Narisetty and He, [24] for consistent Bayesian model selection methods in high-dimensional linear models. Extensive theoretical investigations, however, have been carried out only very recently. Since the pioneering work of Castillo et al., [8], frequentist asymptotic properties of Bayesian sparse regression have been discovered under various settings, and there is now a substantial body of literature [e.g., 23, 1, 28, 3, 26, 2, 10, 25, 14, 19, 18].
Most of the existing studies deal with sparse regression setups without nuisance parameters and there are only a few exceptions. An unknown variance parameter, the simplest type of nuisance parameters, was incorporated for high-dimensional linear regression in Song and Liang, [28] and Bai et al., [2]. In these studies, the optimal properties of Bayesian procedures are characterized with continuous shrinkage priors. For more involved models, Chae et al., [10] adopted a nonparametric approach to estimate unknown symmetric densities in sparse linear regression. Ning et al., [25] considered a sparse linear model for vector-valued response variables with unknown covariance matrices.
Although nuisance parameters may not be of primary interest, modeling frameworks require the complete description of their roles as they explicitly parameterize models. Therefore, one may want to achieve optimal estimation properties for sparse regression coefficients, no matter what a nuisance parameter is. It may also be of interest to examine posterior contraction of nuisance parameters as a secondary objective. Despite these facts, however, there have not been attempts to consider a general class of high-dimensional regression models with nuisance parameters. In this study, we consider a general form of Gaussian sparse regression in the presence of nuisance parameters, and establish a theoretical framework for Bayesian procedures.
We formulate a general framework to treat sparse regression models in a unified way as follows. Let be possibly an infinite-dimensional nuisance parameter taking values in a set . For each and an integer for some , suppose that there are a vector and a positive definite matrix which define a regression model for a vector-valued response variable against covariates given by
(1) |
where is a vector of regression coefficients. Here (and ) can increase with . We consider the high-dimensional situation where , but is assumed to be sparse, with many coordinates zero. The form in (1) clearly includes sparse linear regression with unknown error variances. Our main interest lies in more complicated setups. As will be shortly discussed in Section 1.1, many interesting examples belong to form (1).
In this paper, we develop a unified theory of posterior asymptotics in the high-dimensional sparse regression models described by form (1). To the best of our knowledge, there is no study thus far considering a general modeling framework of sparse regression as in (1), even from the frequentist perspective. The results on complicated high-dimensional regression models are only available at model-specific levels and cannot be universally used for different model classes. On the other hand, our approach is a unified theoretical treatment of the general model structure in (1) under the Bayesian framework. We establish general theorems on nearly optimal posterior contraction rates, a Bernstein-von Mises theorem via shape approximation to the posterior distribution of , and model selection consistency.
The general theory of posterior contraction using the canonical root-average-squared Hellinger metric on the joint density [16] is not very useful in this context, since to recover rates in terms of the metric of interest on the regression coefficients, some boundedness conditions are needed [19]. To deal with this issue, we construct an exponentially powerful likelihood ratio test in small pieces that are sufficiently separated from the true parameters in terms of the average Rényi divergence of order (which coincides with the average negative log-affinity). This test provides posterior contraction relative to the corresponding divergence. The posterior contraction rates of and can then be recovered in terms of the metrics of interest under mild conditions on the parameter space. Due to a nuisance parameter , the resulting posterior contraction for may be suboptimal. Conditions for the optimal posterior contraction will also be examined. Our results show that the obtained posterior contraction rates are adaptive to the unknown sparsity level.
For a Bernstein-von Mises theorem and selection consistency, stronger conditions are required than those used for posterior contraction, in line with the existing literature [e.g., 8, 23]. As pointed out by Chae et al., [10], the Bernstein-von Mises theorems for finite dimensional parameters in classical semiparametric models [e.g., 7] may not be directly useful in the high-dimensional context. We thus directly characterize a version of the Bernstein von-Mises theorem for model (1). The key idea is to find a suitable orthogonal projection that satisfies some required conditions, which is typically straightforward if the support of a prior for is a linear space. The complexity of the space of covariance matrices, measured by its metric entropy, also has an important role in deriving the Bernstein-von Mises theorem and selection consistency. Combining these two leads to a single component of normal distributions for an approximation, which enables to correctly quantify remaining uncertainty on the parameter through the posterior distribution.
1.1 Sparse linear regression with nuisance parameters
As briefly discussed above, the form in (1) is general and includes many interesting statistical models. Here we provide specific examples belonging to (1) in detail. In Section 5, these examples will be used to apply the main results developed in this study.
Example 1 (Multiple response models with missing components).
We consider a general multiple response model with missing values, which is very common in practice. Suppose that for each , a vector of responses with covariance matrix are supposed to be observed, but for the th group (or subject) only entries are actually observed with the rest missing. Letting be the th observation and be the augmented vector of and missing entries, we can write and , where is the submatrix of the identity matrix with the th column included if the th element of is observed, . Assuming that the mean of is only for covariates and sparse coefficients with , the model of interest can be written as , , . The model belongs to the class described by (1) with and for .
Example 2 (Multivariate measurement error models).
Suppose that a scalar response variable is connected to fixed covariates with and random covariates with fixed , through the following linear additive relationship: , , , . While is fully observed without noise, we observe a surrogate of as , , where to ensure identifiability, is assumed to be known. This type of model is called a measurement error model or an errors-in-variables model; see Fuller, [13] and Carroll et al., [6] for a complete overview. By direct calculations, the joint distribution of is given by
By writing , , , and with , the model is of form (1) with .
Example 3 (Parametric correlation structure).
For , , suppose that we have a response variable and covariates with . We consider a standard regression model given by , , , but is considered to be possibly increasing. For a known parametric correlation structure and a fixed dimensional Euclidean parameter , we model the covariance matrix as using a variance parameter and a correlation matrix . Examples of include first order autoregressive and moving average correlation matrices. The model belongs to (1) by writing and with .
Example 4 (Mixed effects models).
For , , consider a response variable and covariates with and with fixed . A mixed effect model given by , , , , where is a positive definite matrix. Then the marginal law of is given by , . We assume that is known. The model belongs to (1) by letting and with .
Example 5 (Graphical structure with sparse precision matrices).
For a response variable and covariates with increasing and , consider a model given by , , , where is a sparse coefficient vector and the precision matrix is a positive definite matrix. Along with , we also impose sparsity on the off-diagonal entries of , which accounts for a graphical structure between observations. More precisely, if an off-diagonal entry is zero, it implies the conditional independence between the two concerned entries of given the remaining ones, and we suppose that most off-diagonal entries are actually zero, even though we do not know their locations. The model is then seen to be a special case of (1) by letting and with .
Example 6 (Nonparametric heteroskedastic regression models).
For a response variable and a row vector of covariates , a linear regression model with a nonparametric heteroskedastic error is given by , , , where is a sparse coefficient vector, is a univariate variance function, and is a one-dimensional variable associated with the th observation that controls the variance of through the variance function .Then the model belongs to (1) by letting and with .
Example 7 (Partial linear models).
Consider a partial linear model given by , , , where is a response variable, is a row vector of covariates with , is a sparse coefficient vector, is a univariate function, and is a scalar predictor. This model is expressed in form (1) by writing and with .
1.2 Outline
The rest of this paper is organized as follows. In Section 2, some notations are introduced and a prior distribution on sparse regression coefficients is specified. Sections 3–4 provide our main results on the posterior contraction, the Bernstein-von Mises phenomenon, and selection consistency of the posterior distribution. In Section 5, our general theorems are applied to the examples considered above to derive the posterior asymptotic properties in each specific example. All technical proofs are provided in Appendix.
2 Setup, notations, and prior specification
2.1 Notation
Here we describe the notations we use throughout this paper. For a vector and a set of indices, we write to denote the support of , (or ) to denote the cardinality of (or ), and and to separate components of using . In particular, the support of the true parameter and its cardinality are written as and , respectively. The notation , , stands for the -norm and denotes the maximum norm. We write and for the minimum and maximum eigenvalues of a square matrix , respectively. For a matrix , let stand for the spectral norm and stand for the Frobenius norm of . We also define a matrix norm for the th column of , which is used for compatibility conditions. The column space of is denoted by . For further convenience, we write for the minimum singular value of . The notation means the submatrix of with columns chosen by . For sequences and , (or ) stands for for some constant independent of , and means . These inequalities are also used for relations involving constant sequences.
For given parameters and , we write the joint density as for the density of the th observation vector . In particular, the true joint density is expressed as for with the true parameters and . The notation denotes the expectation operator with the true density . For two probability measures and , let denote the total variation between and . For two -variate densities and of independent variables, denote the average Rényi divergence (of order ) by .
For any , we define for the two squared pseudo-metrics:
For compatibility conditions, the uniform compatibility number and the smallest scaled singular value are defined as
We write for the observation vector, for the dimension of , and for the parameter space of . Lastly, for a (pseudo-)metric space , let denote the -covering number, the minimal number of -balls that cover .
2.2 Prior for the high-dimensional coefficients
In this subsection, we specify a prior distribution for the high-dimensional regression coefficients . A prior for should satisfy the conditions required for the main results, so its specific characterization is deferred to Section 3. On the other hand, the prior for specified here is always good for our purposes and satisfies all requirements.
We first select a dimension from a prior , and then randomly choose for given . A nonzero part of is then selected from a prior on while is fixed to zero. The resulting prior specification for is formulated as
(2) |
where is the Dirac measure at zero on with suppressed dimensionality. For the prior on the model dimensions, we consider a prior satisfying the following: for some constants ,
(3) |
Examples of priors satisfying (3) can be found in Castillo and van der Vaart, [9] and Castillo et al., [8]. For the prior , the -fold product of the exponential power density is considered, where the regularization parameter is allowed to vary with and , i.e.,
(4) |
for some constants . The order of is important in that it determines the boundedness requirement of the true signal (see condition (C3) below). A particularly interesting case is obtained when is set to the lower bound . Then the boundedness condition becomes very mild by choosing sufficiently large. When is set to the upper bound, the boundedness condition is still reasonably mild. However, it can actually be relaxed if the true signal is known to be small enough, though we do not pursue this generalization in this study. In Section 4, we shall see that values of that do not increase too fast are in fact necessary for a distributional approximation and selection consistency.
Remark 1.
Since some size restriction on will be made unlike Castillo et al., [8], we note that the use of the Laplace density is not necessary and other prior distributions may also be used for . For example, normal densities can be used for to exploit semi-conjugacy. However, if its precision parameter is fixed independent of , a normal prior requires a stronger restriction on the true signal than (C3) below. To achieve the nearly optimal posterior contraction, other densities with similar tail properties should also work with appropriate modifications for the true signal size (see, e.g., Jeong and Ghosal, [19]). Instead of the spike-and-slab prior in (2) and (3), a class of continuous shrinkage priors may also be used at the expense of substantial modifications in the technical details [28]. In this paper, we only consider the prior in (2)–(4).
3 Posterior contraction rates
The prior for a nuisance parameter should be chosen to complete the prior specification. Once we assign the prior for the full parameters, the posterior distribution is defined by Bayes’ rule. How the prior for is chosen is crucial to obtain desirable asymptotic properties of the posterior distribution. In this subsection, we shall examine such conditions on the prior distribution for a nuisance parameter and study the posterior contraction rates for both and .
The prior for is put on a subspace . In many instances, we take , especially when a nuisance parameter is finite dimensional, but the flexibility of a subspace may be beneficial in infinite-dimensional situations. We need to choose to satisfy certain conditions.
-
(C1)
There exists a nondecreasing sequence such that
-
(C2)
For some sequence such that and with satisfying (C1),
The first condition of (C1) implies that we have a good approximation to the true parameter value in the parameter set . This holds trivially if there exists such that for every , which is obviously true if . The second condition of (C1) means that in , the maximum Frobenius norm of the difference between covariance matrices can be controlled by the average Frobenius norm multiplied by the sequence . Clearly, this holds with if is the same for every . By the triangle inequality, we see that (C1) implies that
(5) |
which is used throughout the paper. Condition (C2) is typically called the prior concentration condition, which requires a prior to put sufficient mass around the true parameter , measured by the pseudo-metric . As in other infinite-dimensional situations, such a closeness is translated into the closeness in terms of the Kullback-Leibler divergence and variation (see Lemma 1 in Appendix for more details).
As noted in Section 1, the true parameters should be restricted to certain norm-bounded subset of the parameter space. This is clarified as follows.
-
(C3)
The true signal satisfies .
-
(C4)
The eigenvalues of the true covariance matrix satisfy
Condition (C3) is required to apply the general strategy for posterior contraction to our modeling framework containing nuisance parameters. More specifically, the condition is imposed such that the prior assigns sufficient mass on a Kullback-Leibler neighborhood of . If nuisance parameters are not present, one can directly handle the model and such a restriction may be removed [e.g., 8, 14]. One may refer to Song and Liang, [28], Ning et al., [25], and Bai et al., [2] for conditions similar to ours, where a variance parameter stands for a nuisance parameter. Still, the condition is mild if is chosen to decrease at an appropriate order. In particular, if is matched to the lower bound , the condition becomes which is very mild if is sufficiently large. Even if the upper bound is chosen, the condition is not restrictive as the right hand side of the condition can be made nondecreasing as long as is increasing at a suitable order. Condition (C4) implies that the eigenvalues of the true covariance matrix are bounded below and above. The lower and upper bounds are required for a lot of technical details, including the construction of an exponentially powerful test in Lemma 2 in Appendix.
Remark 2.
Condition (C3) is actually stronger than what it needs to be, but is adopted for the ease of interpretation. For Theorem 3 below to hold, it suffices if we have for satisfying (C2). For the optimal posterior contraction in Theorem 4 below, a slightly stronger bound is needed: (see Lemma 6 and its proof in Appendix).
3.1 Rényi posterior contraction and recovery
The goal of this subsection is to study posterior contraction of relative to the - and -metrics. To do so, we derive the posterior contraction rate with respect to the average Rényi divergence , and then the rates for relative to more concrete metrics will be recovered from the Rényi contraction.
To proceed, we first need to examine a dimensionality property of the support of . The following theorem shows that the posterior distribution is concentrated on models of relatively small sizes.
Theorem 1 (Dimension).
Compared to the literature [e.g., 8, 23, 3], the rate in Theorem 1 is floored by the extra term . This arises from the presence of a nuisance parameter in the model formulation. To minimize its impact, a prior on should be chosen such that (C2) holds for as small as possible; a suitable choice induces the (nearly) optimal contraction rate.
Using the basic results in Theorem 1, the next theorem obtains the rate at which the posterior distribution contracts at the truth with respect to the average Rényi divergence. The theorem requires additional assumptions on a prior.
-
(C5)
For with satisfying (C2), a sufficiently large , and some sequences and satisfying , there exists a subset such that
(6) (7) (8)
The above conditions are related to the classical ones in the literature (e.g., see Theorem 2.1. of Ghosal et al., [15]). Condition (6) requires that for every , the minimum eigenvalue of is not too small on a sieve . Although can be any positive sequence, a sequence increasing exponentially fast makes the entropy in (7) too large, resulting in a suboptimal rate . If can be chosen to be smaller than and , then this does not lead to any deterioration of the rate in . The entropy condition (7) is actually stronger than needed. Scrutinizing the proof of the theorem, one can see that the entropy appearing in the theorem is obtained using pieces that are smaller than those giving the exponentially powerful test in Lemma 2 in Appendix. However, the covering number with those pieces looks more complicated and the form in (7) suffices for all examples in the present paper. Lastly, condition (8) implies that the outside of a sieve should possess sufficiently small prior mass to kill the factor arising from the lower bound of the denominator of the posterior distribution. In fact, conditions similar to (C2), (7) and (8) are also required for the prior of . By reading the proof, it is easy to see that the prior (2) explicitly satisfies the analogous conditions on an appropriately chosen sieve.
Theorem 2 (Contraction rate, Rényi).
We want to sharpen the rate as much as possible. In most instances, can be chosen such that . This is trivially satisfied if is some polynomial in as in the examples in this paper. If is known to increase much faster than , e.g., for some , then need not be a polynomial in and the condition can be met more easily with a sequence that grows even faster. Note also that we typically have in most cases. These postulates lead to . Indeed, it is often possible to choose , which is commonly guaranteed by choosing an appropriate sieve and a prior. The condition will be made precise in (C5∗) below for recovery and we only consider the situation that in what follows.
Although Theorem 2 provides the basic results for posterior contraction, it does not give precise interpretations for the parameters and themselves, because of the abstruse expression of the average Rényi divergence. The contraction rates with respect to more concrete metrics are recovered under some additional conditions. Under the additional assumption , it can be shown that Theorem 1 and Theorem 2 explicitly imply that for the set
with a sufficiently large constant , the posterior mass of goes to one in probability (see the proof of Theorem 3). To complete the recovery, we need to separate the sum of squares of the mean into and , which requires an additional condition. The conditions required for the recovery are clarified as follows.
- (C5∗)
- (C6)
By expanding the quadratic term for the mean in , one can see that the separation is possible if (C6) is satisfied. Clearly, (C6) is trivially satisfied if the model has only for its mean, in which we take for every . In many cases where there exists such that , we can often take for the second inequality of (C6) to hold automatically.
The following theorem shows that the posterior distribution of and contracts at their respective true values at some rates, relative to more easily comprehensible metrics than the average Rényi divergence. In the expressions, if , the compatibility numbers should be understood be equal to 1 for interpretation.
Theorem 3 (Recovery).
The thresholds for contraction depend upon the compatibility conditions, which make their implication somewhat vague. As is much smaller than , it is not unreasonable to assume that and are bounded away from zero, whence the compatibility number is removed from the rates. We refer to Example 7 of Castillo et al., [8] for more discussion. In the next subsection, we will see that one of these restrictions is actually necessary for shape approximation or selection consistency.
Remark 3.
The separation condition (C6) can be left as an assumption to be satisfied, but can also be verified by a stronger condition on the design matrix without resorting to the values of the parameters. Suppose that for some integer , there exists a matrix such that for every , with some map . Since we can write for any , the Cauchy-Schwarz inequality indicates that the first inequality of (C6) is implied by
for . The left hand side is always between and by the Cauchy-Schwarz inequality, and is exactly equal to or if and only if the two vectors are linearly dependent. A sufficient condition for the preceding display is thus since the linear dependence cannot happen under such a condition due to the inequality for such that . This sufficient condition is not restrictive at all if as we already have . Since there typically exists satisfying the second inequality of (C6) as long as provides a good approximation for the true parameter , condition (C6) can be easily satisfied if the sufficient condition is met.
Notwithstanding the lack of formal study of minimax rates with additional complications, we still want to match our rates for with those in simple linear regression, which we call the “optimal” rates. In this sense, Theorem 3 only provides the suboptimal rates for if . Although the theorem gives the optimal results if , it is practically hard to check this condition as is unknown. If is known to be nonzero, the desired conclusion is trivially achieved as soon as . The following corollary, however, shows that the optimal rates are still available even if , with restrictions on and the prior.
Corollary 1 (Optimality under restriction).
The corollary is useful in limited situations, especially when a parametric rate is available for a nuisance parameter. Even if , we need to further assume that , i.e., the ultra high-dimensional setup, to conclude that (a) holds, while we can always apply (b) because . Although assertion (b) holds for any if is chosen sufficiently large, its specific threshold is not directly available. Indeed, by carefully reading the proof of Theorem 1 together with Lemma 1 in Appendix, one can see that the threshold depends on unknown constant bounds for the eigenvalues of the true covariance matrix in (C4). Still, (b) holds for any if . We believe that the assumption is very mild, and hence simply apply (b) with this assumption to conclude the optimal contraction for models with finite dimensional nuisance parameters. The optimal rates can still be achieved for any by verifying the conditions in the following subsection. With finite dimensional nuisance parameters, we do not pursue this direction as it seems an overkill considering the mildness of the assumption , though those conditions are actually required for the Bernstein-von Mises theorem and selection consistency in Section 4.
3.2 Optimal posterior contraction for
Recall that only suboptimal rates may be available from Theorem 3 if . In many semiparametric situations, however, it is often possible to obtain parametric rates for finite dimensional parameters under stronger conditions, even when there are infinite-dimensional nuisance parameters in a model [4, 7]. It has also been shown that a similar argument holds in some high-dimensional semiparametric regression models [10]. Therefore, it is naturally of interest to examine under what conditions we can replace by in the rates for , even if . Similar to other semiparametric settings [4, 10], this can be established by the semiparametric theory, but requires stronger conditions than those in traditional fixed dimensional parametric cases because of the high-dimensions of the parameters in our setup.
To proceed, some additional conditions are required for technical reasons, which are made for the size of as the optimal rates are automatically attained if . Still, in a practical sense, the conditions almost always need to be verified to reach the optimal rates, since only oracle rates are generally available and we do not know which term is greater.
In what follows, we write for satisfying the conditions of Theorem 3 through the definition of . We first assume the following condition on the uniform compatibility number.
-
(C3)
For a sufficiently large , the uniform compatibility number is bounded away from zero.
This condition is weaker than assuming that the smallest scaled singular value is bounded away from zero, as we have for any by the Cauchy-Schwarz inequality. We will also resort on a slightly stronger condition with respect to for a distributional approximation in the following section. In this sense, our condition is weaker than those for Theorem 4 of Castillo et al., [8]. Condition (C3) is not restrictive as (C5∗) requires ; we again refer to Example 7 of Castillo et al., [8].
To precisely describe other conditions, hereafter we use the following additional notations. We write
and to denote the collection of for . In particular, denotes the submatrix of with columns chosen by an index set . We also define the following neighborhoods of the true parameters: for and satisfying (C5∗), and sufficiently large constants and ,
(10) |
Combined by other conditions, Theorem 3 implies that the posterior probabilities of these neighborhoods tend to one in probability if . We need some bounding conditions on these neighborhoods, which will be specified below.
Let for any given . For a given , we choose a bijective map such that for some orthogonal projection which may depend on the true parameter values, but not on and . The projection plays a key role here and for a distributional approximation in the following section, and thus should be appropriately chosen to satisfy the followings.
-
(C4)
The orthogonal projection satisfies
-
(C5)
The conditional law of given , induced by the prior, is absolutely continuous relative to its distribution at (which is the same as the prior for ), and the Radon-Nikodym derivative satisfies
By reading the proof, one can see that Theorem 4 below is based on the approximate likelihood ratio. The first condition of (C4) is required to control the remainder of an approximation. The second condition of (C4) implies that for every with such that , as the second inequality trivially holds by the fact that is an orthogonal projection. The use of the shifting map is justified by the condition (C5), which implies that a shift in certain directions does not substantially affect the prior on . This is related in spirit to the absolute continuity condition in the semiparametric Bernstein-von Mises theorem (see, for example, Theorem 12. 8 of Ghosal and van der Vaart, [17]). We will see that a distributional approximation also requires similar, but stronger conditions.
Lastly, the complexity of the neighborhood should also be controlled. Specifically, we make the following condition.
-
(C6)
For and satisfying (C1) and a sufficiently large ,
-
(C7)
The parameter space is separable with the pseudo-metric .
Similar to (C4), these conditions are required to control the remainder of an approximation. The integral term comes from the expected supremum of a separable Gaussian process, exploiting the Gaussian likelihood of the model and the separability of with the standard deviation metric. Condition (C7) is crucial for this reason. Since we usually put a prior on in an explicit way, condition (C7) is rarely violated in practice. One may see a connection between the first term of (C6) and the conditions for Corollary 1. The former easily tends to zero even if is increasing, due to the extra term which commonly tends to zero in a polynomial order. Note also that the term appears in (C4) and (C6). Although this gives sharper bounds, the conditions often need to be verified with replaced by as is unknown.
Under the conditions specified above, we obtain the following theorem for the contraction rates for which do not depend on . The compatibility numbers below should be understood to be 1 if .
Theorem 4 (Optimal posterior contraction).
Similar to the paragraph followed by Theorem 3, the compatibility numbers are easily bounded away from zero so that they can be removed from the expressions. These are actually weaker than before as . The simplified rates are then available for ease of interpretation.
Remark 4.
Remark 5.
Suppose that there exists a matrix such that for every with some map . Then, a general strategy to choose is to set for . In this case, by the triangle inequality, the first condition of (C4) is satisfied if there exists such that . For (C8∗) in the next section, this is replaced by . These are trivially the case if there exists such that . Also similar to Remark 3, a sufficient condition for the second line of (C4) is as pre-multiplication of a positive definite matrix by and is an isomorphism. This is also sufficient for (C8∗) in the next section with replaced by .
Remark 6.
In many instances, for every and , we typically have
for some sequences and , especially when the part of involved with is an -dimensional Euclidean parameter. Note that is equal to
If is increasing, the right hand side is bounded by a multiple of by the tail probability of a normal distribution, while it is bounded by a multiple of for nonincreasing . This simplification is useful to verify (C6) in many applications, and can also be used for (C10∗) in the next section.
4 Bernstein-von Mises and selection consistency
An extremely important question is whether the true support is recovered with probability tending to one, which is the property called selection consistency. We will show this based on a distributional approximation to the posterior distribution. Combined with selection consistency, the shape approximation also leads to the product of a point mass and a normal distribution, which we call the Bernstein-von Mises theorem. This reduced approximate distribution enables us to correctly quantify the remaining uncertainty of the parameter through the posterior distribution.
4.1 Shape approximation to the posterior distribution
It is worth noting that selection consistency can often be verified without a distributional approximation. For example, in sparse linear regression with scalar unknown variance , Song and Liang, [28] deployed the marginal likelihood of the model support which can be obtained by integrating out and from the likelihood using the inverse gamma kernel. In our general formulation, however, this approach is hard to implement due to the arbitrary structure of a nuisance parameter . Indeed, the approach is not directly available even for a parametric covariance matrix with dimension . In this sense, using a shape approximation could be a natural solution to the problem, which may require some extra conditions on the parameter space and on the priors for and .
Recall that the results in Section 3.2 are based on the semiparametric theory. In this section we will need very similar conditions as before, but the requirements are generally stronger, as the remainder of an approximation should be strictly manipulated. Since the setup is high-dimensional, our conditions are even more restrictive than those for semiparametric models with a fixed dimensional parametric segment [e.g., 7]. One may refer to Section 3.3 of Chae et al., [10] for a relevant discussion.
Throughout this section, we only consider that satisfies the conditions of Theorem 3. First of all, we make a modification of (C3). The following condition is slightly stronger than (C3), but is still not too restrictive as (C5∗) requires .
-
(C7∗)
Condition (C3) is satisfied with replaced by .
The assumption on the prior for is made only through the regularization parameter . As in Castillo et al., [8], should not increase too fast and should satisfy . In our setup, the range of induces a sufficient condition for this: . Since this is weaker than the one that will be made later in this section, the “small lambda regime” is automatically met by a stronger condition for the entire procedure for a distributional approximation (see (C10∗) below and the following paragraph).
For sufficiently large constants and , we now define the neighborhoods,
(12) |
Note that is defined with an -ball, which makes it contract more slowly than in (10) under (C7∗). This is due to technical reasons that for a distributional approximation, the -ball should be directly manipulated in the complement of . The neighborhood is also increased to be matched with . We leave more details on this to the reader; refer to the proof of Theorem 5 below.
As in Section 3.2, we choose a bijective map which gives rise to for some orthogonal projection . Again, the orthogonal projection should be carefully chosen to satisfy some boundedness conditions. The conditions are similar to, but stronger than those in Section 3.2. This is not only because of the increased neighborhoods and , but also because the remainder of an approximation should be bounded on their complements. We precisely make the required conditions below.
-
(C8∗)
The orthogonal projection satisfies
-
(C9∗)
The conditional law of given , induced by the prior, is absolutely continuous relative to its distribution at , and the Radon-Nikodym derivative satisfies
-
(C10∗)
For and satisfying (C1) and a sufficiently large ,
Conditions (C8∗)–(C10∗) are required for similar reasons as in Section 3.2. We mention that (C10∗) is a sufficient condition for the small lambda regime, since its necessary condition is that is stronger than . This necessary condition for (C10∗) is often a sufficient condition in many finite dimensional models.
We define the standardized vector,
Under the assumptions above, the posterior distribution of is approximated by given by
(13) |
where is the Gaussian measure with mean and precision on the coordinate , is the Dirac measure at zero on , is the least squares solution , and the weights satisfy
Another way to express , for any measurable , is
where denotes the Lebesgue measure and
(14) |
It can be easily checked that both the expressions are equivalent. The results are summarized in the following theorem.
4.2 Model selection consistency
The shape approximation to the posterior distribution facilitates obtaining the next theorem which shows that the posterior distribution is concentrated on subsets of the true support with probability tending to one. The result is then used as the basis of selection consistency. Similar to the literature, the theorem requires an additional condition on the prior as follows.
-
(C12)
The prior satisfies and for .
Theorem 6 (Selection, no supersets).
Since coefficients that are too close to zero cannot be identified by any selection strategy, some threshold for the true nonzero coefficients is needed for detection. The requirement of a threshold is a fundamental limitation in high-dimensional setups. We make the following threshold, the so-called beta-min condition. The condition is made in view of the third assertion of Theorem 4. The second assertion can also be used to make a similar threshold, but we only consider the given one below as it is generally weaker.
-
(C13)
The true parameter satisfies
Since Theorem 3 implies that the posterior distribution of the support of includes that of the true support with probability tending to one, selection consistency is an easy consequence of Theorem 6 under the beta-min condition (C13). Moreover, this improves the distributional approximation in (15) so that the posterior distribution can be approximated by a single component of the mixture; that is, the Bernstein-von Mises theorem holds for the parameter component . The arguments here are summarized in the following two corollaries, whose proofs are straightforward and thus are omitted.
Corollary 2 (Selection consistency).
Corollary 3 (Bernstein-von Mises).
Corollary 3 enables us to quantify the remaining uncertainty of the parameter through the posterior distribution. Specifically, we can construct credible sets for the individual components of as in Castillo et al., [8]. It is easy to see that by the definition of , its th component has a normal distribution, whose mean is the th element of and variance is the th diagonal element of . Correct uncertainty quantification is thus guaranteed by the weak convergence.
5 Applications
In this section, we apply the main results established in this study to the examples considered in Section 1.1. The main objective is to obtain nearly optimal posterior contraction rates and selection consistency via shape approximation to the posterior distribution with the Bernstein-von Mises phenomenon.
To use Corollary 1 for the optimal posterior contraction when , we simply assume that for all examples in this section, although Theorem 4 can also be applied under stronger conditions. The assumption is extremely mild rather than considering the ultra high-dimensional case, i.e., . A large enough is also sufficient instead of the assumption , but we do not pursue this direction as a specific threshold is not available. We check the conditions of Theorem 4 only for more complicated models where .
5.1 Multiple response models with missing components
We first apply the main results to Example 1. To recover posterior contraction of from the primitive results, it is necessary to assume that every entry of the response is jointly observed sufficiently many times. To be more specific, let be 1 if the th entry of is observed and be zero otherwise. The contraction rate of the th element of is directly determined by the order of . The ideal case is when this quantity is bounded away from zero, that is, the entries are jointly observed at a rate proportional to . Then the recovery is possible without any loss of information. If decays to zero, then the optimal recovery is not attainable, but consistent estimation may still be possible with slower rates. With an inverse Wishart prior on , the following theorem studies the posterior asymptotic properties of the given model.
Theorem 7.
Assume that , , , and for some nondecreasing such that . Then the following assertions hold.
-
(a)
The optimal posterior contraction rates for in (11) are obtained.
-
(b)
The posterior contraction rate for is with respect to the Frobenius norm.
Assume further that and for a sufficiently large . Then the following assertions hold.
5.2 Multivariate measurement error models
We now consider Example 2. For convenience we write , , and in what follows. In this subsection, we use the symbol for the Kronecker product of matrices. For priors of the nuisance parameters, normal prior distributions are assigned for the location parameters (, , and ) and an inverse gamma and inverse Wishart prior are used for the scale parameters ( and ). The next theorem shows posterior asymptotic properties of the model. In particular, specific forms of their mean and variance for shape approximation are provided considering the modeling structure.
Theorem 8.
Assume that , , , , , , and for a sufficiently large . Then the following assertions hold.
-
(a)
The optimal posterior contraction rates for in (11) are obtained.
-
(b)
The contraction rates for , , , and are relative to the -norms. The same rate is also obtained for with respect to the Frobenius norm.
Assume further that and for a sufficiently large . Then the following assertions hold.
- (c)
-
(d)
If and for , then the no-superset result in (16) holds.
- (e)
We note that the marginal law of is given by . This gives a hope that the rates for and may actually be improved up to the parametric rate (possibly up to some logarithmic factors). However, other parameters are connected to the high-dimensional coefficients , so such a parametric rate may not be obtained for them.
5.3 Parametric correlation structure
Next, our main results are applied to Example 3. A correlation matrix should be chosen so that the conditions in the main theorems can be satisfied. Here we consider a compound-symmetric, a first order autoregressive, and a first order moving average correlation matrices: for with fixed boundaries and of the range, respectively, , , and . The range is chosen so that the corresponding correlation matrix can be positive definite, i.e., for , for , and for . Again, an inverse gamma prior is assigned to . For a prior on , we consider a density
for some such that for close to and for close to .
Theorem 9.
Assume that , , , , , for some fixed . Suppose further that for the compound-symmetric correlation matrix and for the autoregressive and moving average correlation matrices. Then the following assertions hold.
-
(a)
For any correlation matrix discussed above, the optimal posterior contraction rates for in (11) are obtained.
-
(b)
For the autoregressive and moving average correlation matrices, the posterior contraction rates for and are with respect to the -norms. For the compound-symmetric correlation matrix, their contraction rates are relative to the -norm.
Assume further that and for a sufficiently large . Then the following assertions hold.
As for the prior for , the property that the tail probabilities decay to zero exponentially fast near both zero and one is crucial for the optimal posterior contraction rates. It should be noted that many common probability distributions with compact supports may not be enough for this purpose (e.g., beta distributions).
The main difference between this example and those in the preceding subsections is that we consider possibly increasing here. Although we have the same form of contraction rates for as in previous examples, the implication is not the same due to a different order of . For increasing , it is expected to have , which is commonly the case in regression settings. This is reduced to for the cases with fixed , and hence increasing may help get faster rates. While the increasing dimensionality of is often a benefit for contraction properties of , this may or may not be the case for the nuisance parameters since it depends on the dimensionality of . In the example in this subsection, the dimension of the nuisance parameters is fixed although can increase, which makes their posterior contraction rates faster than those with fixed . However, this may not be true if is increasing dimensional. For example, see the example in Section 5.5.
5.4 Mixed effects models
For the mixed effects models with sparse regression coefficients in Example 4, we assume that the maximum of is bounded, which is particularly mild if is bounded. We also assume that and , that is, is likely to be larger than with fixed probability and is a full rank. These conditions are required for (C1) to hold. We put an inverse Wishart prior on as in other examples. The following theorem shows that the posterior asymptotic properties of the mixed effects models.
Theorem 10.
Assume that , , , , , , and . Then the following assertions hold.
-
(a)
The optimal posterior contraction rates for in (11) are obtained.
-
(b)
The posterior contraction rate for is with respect to the Frobenius norm.
Assume further that and for a sufficiently large . Then the following assertions hold.
Note that we assume that is known, which is actually unnecessary at the modeling stage. The assumption was made to find a sequence satisfying (C1) with ease. This can be relaxed only with stronger assumptions on . For example, if and is an all-one vector, then the model is equivalent to that with a compound-symmetric correlation matrix in Section 5.3 with some reparameterization, in which can be treated as unknown.
5.5 Graphical structure with sparse precision matrices
For the graphical structure models in Example 5, we define an edge-inclusion indicator such that if and otherwise, where is the th element of . We put a prior with a density on to the nonzero off-diagonal entries and a prior with a density on to the diagonal entries of , such that the support is truncated to a matrix space with restricted eigenvalues and entries. For the edge-inclusion indicator, we use a binomial prior with probability when is given, and assign a prior to such that . The prior specification is summarized as
where is a collection of positive definite matrices for a sufficiently large , in which eigenvalues are between and entries are also bounded by in absolute value.
Theorem 11.
Let for . Assume that , , , for some such that , for some , and . Then the following assertions hold.
- (a)
-
(b)
The posterior contraction rate of is with respect to the Frobenius norm.
If further and for a sufficiently large , then the following assertion holds.
-
(c)
The optimal posterior contraction rates for in (11) are obtained even if .
Assume further that and for a sufficiently large . Then the following assertions hold.
Note that increasing is likely to improve the -norm contraction rate for as we expect that . In particular, the improvement is clearly the case if and for a sufficiently large . However, as pointed out in Section 5.3, this is not the case for as its dimension is also increasing.
If we assume that , then the term arising from the sparse precision matrix becomes . The latter is comparable to the frequentist convergence rate of the graphical lasso in Rothman et al., [27]. Therefore, our rate is deemed to be optimal considering the additional complication due to the mean term involving sparse regression coefficients.
5.6 Nonparametric heteroskedastic regression models
Next, we use the main results for Example 6. For a bounded, convex subset , define the -Hölder class as the collection of functions such that , where
with the th derivative of and the largest integer that is strictly smaller than . Let the true function belong to with assumption that is strictly positive. While suffices for the basic posterior contraction, we will see that the optimal posterior contraction for requires . The stronger condition is even needed for the Bernstein-von Mises theorem and the selection consistency, but all these conditions are mild if the true function is sufficiently smooth.
We put a prior on through B-splines. The function is expressed as a linear combination of -dimensional B-spline basis terms of order , i.e., , while an inverse Gaussian prior distribution is independently assigned to each entry of . For any measurable function , we let and denote the sup-norm and empirical -norm, respectively. To deploy the properties of B-splines, we assume that are sufficiently regularly distributed on .
Theorem 12.
The true function is assumed to be strictly positive on and belong to with . We choose . Let for and assume that , , and . Then the following assertions hold.
- (a)
-
(b)
The posterior contraction rate for is with respect to the -norm.
If further and for a sufficiently large , then the following assertion holds.
-
(c)
The optimal posterior contraction rates for in (11) are obtained even if .
Assume further that , and for a sufficiently large . Then the following assertions hold.
An inverse Gaussian prior is used due to the property that its tail probabilities at both zero and infinity decay to zero exponentially fast. The exponentially decaying tail probabilities in both directions are essential to obtain the optimal contraction rate. Note that standard choices such as gamma and inverse gamma distributions do not satisfy this property.
By investigating the proof, it can be seen that the condition is required to satisfy condition (C1) for posterior contraction, so this condition is not avoidable in applying the main theorems. Unlike Theorem 13 below, assertion (c) does not require any further boundedness condition. This is because the restriction makes the required bound tend to zero. For the Bernstein-von Mises theorem and the selection consistency, it can be seen that is necessary for the condition but not sufficient. Although the requirement is implied by the latter condition, we specify this in the statement due to its importance. We refer to the proof of Theorem 12 for more details.
5.7 Partial linear models
Lastly, we consider Example 7. We assume that the true function belongs to for with . Any suffices for the basic posterior contraction, but stronger restrictions are required for further assertions as in Theorem 12. We put a prior on through -dimensional B-spline basis terms of order , i.e., . With a given , we define the design matrix . The standard normal prior is independently assigned to each component of and an inverse gamma prior is assigned to . Similar to Section 5.6, we assume that are sufficiently regularly distributed on .
Theorem 13.
The true function is assumed to satisfy with . We choose for some . Let for and assume that , , , , and for a sufficiently large . Then the following assertions hold.
- (a)
-
(b)
The contraction rates for and are with respect to the - and -norms, respectively.
If further , , and for a sufficiently large , then the following assertion holds.
-
(c)
The optimal posterior contraction rates for in (11) are obtained even if .
Assume that , , , and for a sufficiently large . Then the following assertions hold.
Here we elaborate more on the choices of the number of basis terms. For assertions (a)–(b), can be chosen such that which gives rise to the optimal rates for the nuisance parameters. This choice, however, does not satisfy (C4) and (C8∗), and hence we need a better approximation for with some to strictly control the remaining bias. For example, if , the bondedness condition for (c) is reduced to , which gives the optimal contraction for by (a). Therefore, to incorporate the case that , there is a need to consider some appropriate that is strictly smaller than . For the Bernstein-von Mises theorem and the selection consistency, the required restriction becomes even stronger such that .
Appendix A Proofs for the main results
In this section, we provide proofs of the main theorems. We first describe the additional notations used for the proofs. For a matrix , we write for the eigenvalues of in decreasing order. The notation stands for the likelihood ratio of and . Let denote the expectation operator with the density and let denote the probability operator with the true density. For two densities and , let and stand for the Kullback-Leibler divergence and variation, respectively. Using some constants , we rewrite (C4) as for clarity.
A.1 Proof of Theorem 1
We first state a lemma showing that the denominator of the posterior distribution is bounded below by a factor with probability tending to one, which will be used to prove the main theorems.
Proof.
We define the Kullback-Leibler-type neighborhood for a sufficiently large . Then Lemma 10 of Ghosal and van der Vaart, [16] implies that for any ,
(20) |
Hence, it suffices to show that is bounded below as in the lemma. By Lemma 9, the Kullback-Leibler divergence and variation of the th observation are given by
where are the eigenvalues of . For with small and the cardinality of , we see that on ,
(21) |
Since every satisfies for small , observe that
where the first inequality follows by the relation as and the second inequality holds by (i) of Lemma 10 in Appendix. Since by (21), it follows using (5) that for some constants ,
Combining this with (21), we conclude that on , which implies that is small for all sufficiently large , by (i) of Lemma 10 and the inequality as . Hence, can be expanded in the powers of to get for every and . Furthermore, since is sufficiently small, we obtain that by (i) of Lemma 10, and that by the restriction on the eigenvalues of . Combining these results, it follows that on , both and are bounded above by a constant multiple of . Hence, can be chosen sufficiently large such that
(22) |
by the inequality . The logarithm of the second term on the rightmost side is bounded below by a constant multiple of by (C2). To find the lower bound for the first term, we shall first work with the case , and then show that the same lower bound is obtained even when .
Now, assume that and let for to be chosen later. Then
(23) |
by the inequality . Using the relation (6.2) of Castillo et al., [8] and the assumption on the prior in (4), the integral on the rightmost side satisfies
(24) |
for , and thus the rightmost side of (23) is bounded below by
by the inequality . Choosing , the first term on the rightmost side of (22) satisfies
Note that and if , and thus the last display implies that there exists a constant such that
If , the first term of (22) is clearly bounded below by , so that the same lower bound for in the last display is also obtained since we have . Finally, the lemma follows from (20). ∎
Proof of Theorem 1.
For the set with any integer , we see that is equal to
Let be the event in (19). Since is nonnegative, by Fubini’s theorem and Lemma 1,
(25) |
for some constant and sufficiently large . For a sufficiently large constant , choose the largest integer that is smaller than for . Replacing by in the last display, it is easy to see that the rightmost side goes to zero. The proof is complete since by Lemma 1. ∎
A.2 Proof of Theorems 2–3 and Corollary 1
The following lemma shows that a small piece of the alternative centered at any are locally testable with exponentially small errors, provided that the center is sufficiently separated from the truth with respect to the average Rényi divergence. Theorem 2 for posterior contraction relative to the average Rényi divergence will then be proved by showing that the number of those pieces is controlled by the target rate. We write for the density with , and and for the expectation and probability with , respectively.
Lemma 2.
Proof.
For given such that , consider the most powerful test given by the Neyman-Pearson lemma. It is then easy to see that
(27) |
The first inequality of the lemma is a direct consequence of the first line of the preceding display. For the second inequality of the lemma, note that by the Cauchy-Schwarz inequality, we have
Thus, by the second line of (27), it suffices to show for every . Defining , observe that
on the set , where the second inequality is due to (C1). Since the leftmost side of the display is further bounded below by for every , we have that
(28) |
Since and for every , (28) implies that is nonsingular for every , and hence on , it can be shown that can be written as being equal to
(29) |
To bound this, note that is equal to
(30) |
where the first inequality holds by (28), the second inequality holds by the inequality for small , and the last inequality holds by the inequality . Now, for every , observe that the exponent in (29) is bounded above by
since for large . Combined with (29) and (30), the display completes the proof. ∎
Proof of Theorem 2.
Let and . Then for every ,
(31) |
where the second term on the right hand side goes to zero by Theorem 1. Hence, it suffices to show that the first term goes to zero for chosen to be the threshold in the theorem. Now, let and define as in (26) with and . Then Lemma 2 implies that small pieces of the alternative densities can be tested with exponentially small errors as long as the center is -separated from the true parameter values relative to the average Rényi divergence. To complete the proof, we shall show that the minimal number of those small pieces that are needed to cover is controlled appropriately in terms of , and that the prior mass of and decreases fast enough to balance the denominator of the posterior distribution. (For more discussion on a construction of a test using metric entropies, see Section D.2 and Section D.3 of Ghosal and van der Vaart, [17].)
Note that for every and ,
by the inequality and the Cauchy-Schwarz inequality. Since and , it is easy to see that we have for
with the same used to define . Hence, is bounded above by
(32) |
Note that for any small ,
and thus we obtain
Using the last display and the entropy condition (7), the right hand side of (32) is bounded above by a constant multiple of . Hence, by Lemma D.3 of Ghosal and van der Vaart, [17], for every , there exists a test such that for some , and for every such that . Note that under condition (3) on the prior distribution, we have since is bounded away from zero. Hence, for the event in (19) and some constant , the first term on the right hand side of (31) is bounded by
where the term converges to zero by Lemma 1. Choosing for a sufficiently large , we have
Furthermore, goes to zero by condition (8). Now, to show that goes to zero exponentially fast, observe that
by the inequality for every . Since the tail probability of the Laplace distribution is given by for every , the rightmost side of the last display is bounded above by a constant multiple of
Since by (4), the right hand side is bounded by for some , and thus goes to zero since . Finally, we conclude that the left hand side of (31) goes to zero with . ∎
Proof of Theorem 3.
By Theorem 2, we obtain the contraction rate of the posterior distribution with respect to the average Rényi divergence between and given by
Define
(33) |
Then Theorem 2 implies that by the last display,
(34) |
where the second inequality holds by the inequality . Note that by combining (i) and (ii) of Lemma 10 in Appendix, we obtain if the left hand side is small. Thus, using the same approach in the proof of Lemma 1, (34) is further bounded below by
(35) |
for some constants . Since is bounded away from zero and is decreasing, (34) and (35) imply that . Now, it is easy to see that by (5),
which is bounded since . Hence, we see that for satisfying (C6), is bounded by a constant multiple of
The display implies that by Theorem 2 and (C6). Combining the results verifies the third and fourth assertions of the theorem. For the remainder, observe that for such that . Therefore by Theorem 1, the first and the second assertions readily follow from the definitions of and . ∎
A.3 Proof of Theorem 4
To prove Theorem 4, we first provide preliminary results. Some of these will also be used to prove Theorems 5–6.
Lemma 3.
Proof.
If , the left hand side in the probability operator is zero, and the assertion trivially holds. We thus only consider the case below.
By Markov’s inequality, it suffices to show that there exists a positive sequence such that
(37) |
Let be the block-diagonal matrix formed by stacking , , and observe that
The left hand side of (37) is thus bounded by the sum of the following terms:
, | (38) | |||
, | (39) | |||
. | (40) |
First, observe that (38) is bounded above by a constant multiple of
(41) |
Using (i) of Lemma 10 and the inequality as , we obtain that for ,
(42) |
provided that the rightmost side is sufficiently small. Because on , (42) holds. This implies that for all sufficiently large , the right hand side of (41) is bounded above by a constant multiple of
Next, (39) is equal to
By the triangle inequality, the display is bounded by a constant multiple of
(43) |
Using the same approach used in (42), the second term is further bounded above by a constant multiple of
Therefore, by (C4) and (C6), (43) is bounded by for some . This is not more than the right hand side of (37) if .
Lemma 4.
Proof.
Let for the th column of . Then, by Lemma 2.2.2 of van der Vaart and Wellner, [29] applied with , the expectation in the lemma is equal to
(45) |
where is the Orlicz norm for . For any , define the standard deviation pseudo-metric between and as
Using the tail bound for normal distributions and Lemma 2.2.1 of van der Vaart and Wellner, [29], we see that for every . We shall show that is a separable pseudo-metric space with for every . Then, under the true model , we see that is a separable Gaussian process for . Hence, by Corollary 2.2.5 of van der Vaart and Wellner, [29], for any fixed ,
(46) |
where . It is clear that possesses a normal distribution with mean zero and variance .
Using Lemma 2.2.1 of van der Vaart and Wellner, [29] again, we see that
(47) |
for every . Here the last inequality holds by using (42) and the fact that on , under (C1).
Next, to further bound the second term in (46), note that for every ,
which is further bounded below by
using (i) of Lemma 10. In the last display, we see that is bounded away from zero since
and hence every eigenvalue is bounded below and above by a multiple of its reciprocal, as . This implies that is further bounded below by a constant multiple of
By the definition of and the preceding displays, we thus obtain
(48) |
for every . Hence, using that , we can bound the second term in (46) above by a constant multiple of
for some . This can be further bounded by replacing in the display by . Then, using (45), (46), and (47), and by the substitution for the last display, we bound (45) above by a constant multiple of
for some .
To complete the proof, it remains to show that is a separable pseudo-metric space with for every . By (48), we see that for every . This implies that is separable with since is separable with . ∎
Lemma 5.
For any orthogonal projection ,
Proof.
Note first that has a normal distribution with mean zero and variance , and hence we have
by the tail probabilities of normal distributions. By choosing and using the inequality for every , we verify the assertion. ∎
Proof.
Let . Restricting the integral to this set, the left hand side of the inequality in (49) is bounded below by
(50) |
The exponent is equal to
(51) |
since on . We first consider the case . Observe that , where the first term is bounded by a constant multiple of with -probability tending to one, due to Lemma 5. By Lemma 4 applied with together with (C6), the expected value of the second term is bounded by for some . Hence, for any ,
Consequently, taking a sufficiently slowly increasing for the above, (51) is bounded below by a constant multiple of
with -probability tending to one. Note that and on by (C3), if . The last display is thus bounded below by for some , uniformly over . Consequently, with -probability tending to one, (50) is bounded below by
for some , where the inequality holds by (23) and (24) since by (C3). Since if , the display is further bounded below as in the assertion.
Proof of Theorem 4.
The idea of our proof is similar in part to that of Theorem 3.5 in Chae et al., [10]. We only need to verify the first and fourth assertions. The second and third assertions then follow from the definitions of and . Note also that we only need to consider the case , as the assertions follow from Theorems 1 and 3 if .
Let . Also define as but using a constant such that . Then, by Theorem 3, we have that
Let be the event that is an intersection of the events in (36), (49), and the event whose probability goes to zero by Lemma 5. Since , it suffices to show that
(52) |
tends to zero. Observe that by Fubini’s theorem, the denominator of the ratio is equal to
By Lemma 6, the term in the braces on the right hand side is further bounded below by on the event . Note also that the numerator of the ratio in (52) is equal to
Combining the bounds, on the event , the ratio in (52) is bounded by
At the end of this proof, we will verify that
(53) |
with -probability tending to one. Assuming that this is true for now and letting be the event satisfying (53), we see that (52) is bounded by
To show that this tends to zero, for in Lemma 3, define , , and such that . Below we will show that
Since by the moment generating function of normal distributions, we obtain that
If , the rightmost side goes to zero for any . If , it still goes to zero for that is much larger than .
Note also that by conditions (C4), (C3) and (C4), we have that for some and any ,
(54) |
on the event . Hence by (36) and (54), for every ,
on the event . Therefore,
This tends to zero if is sufficiently large.
If , is the empty set as it implies . Hence it suffices to consider the case that below. By (36) and (54) again, there exists a constant such that for every ,
on the event , where the last inequality holds by choosing much larger than . Therefore,
which tends to zero for that is much larger than , if .
It only remains to show (53). Since the map is bijective for every fixed , for the set defined by with given , we see that
(55) |
by the substitution in the integral. Writing the block diagonal matrix formed by stacking , , it can be seen that
Hence, we see that can be chosen sufficiently larger than such that for every as we have . Therefore, (55) is bounded by
by (C5), since . This verifies (53) and thus the proof is complete. ∎
A.4 Proof of Theorems 5–6
To prove the shape approximation in Theorem 5 and the selection results in Theorem 6, we first obtain two lemmas. The first shows that the remainder of the approximation goes to zero in - probability, which is a stronger version of Lemma 3. The second implies that with a point mass prior for at , we also obtain a rate which is not worse than that in Theorem 3.
Lemma 7.
Proof.
Similar to the proof of Lemma 3, it suffices to show the following three assertions:
(56) | ||||
(57) | ||||
(58) |
First, note that the left side of (56) is bounded above by a constant multiple of
(59) |
where the inequality holds by (42) and the fact that on . We see that (59) is bounded above by a constant multiple of
which goes to zero by (C10∗).
Next, similar to (43), the left side of (57) is bounded by
Using the same approach used in (42), the display is further bounded above by a constant multiple of
∎
Lemma 8.
Proof.
Since the prior for is the point mass at , we can reduce to a low dimensional model , . Then the lemma can be easily verified using the main results on posterior contraction in Section 3. The denominator of the posterior distribution with the Dirac prior at is bounded as in Lemma 1, which can be shown using (20) for the prior concentration condition (C2) and the expressions for the Kullback-Leibler divergence and variation with the true value . For a local test relative to the average Rényi divergence, Lemma 2 applied with , modified so that it can be involved only with a given such that , implies that a small piece of the alternative is tested with exponentially small errors. Hence, by (C5∗), we obtain the contraction rate relative to for , as in the proof of Theorem 2. The lemma is then obtained by recovering the contraction rate of with respect to using the approach in the proof of Theorem 3. ∎
Proof of Theorem 5.
Our proof is based on the proof of Theorem 6 in Castillo et al., [8], but is more involved due to . We use the fact that for any probability measure and its renormalized restriction to a set , we have . First, using a sufficiently large constant that is smaller than , define as in (12) such that . Let be the prior distribution restricted and renormalized on and be the corresponding posterior distribution. Also, is the restricted and renormalized version of to the set . Then the left hand side of the theorem is bounded above by
(60) |
where the first summand goes to zero in -probability since in -probability by Theorem 1 and Theorem 3.
To show that the second summand goes to zero in -probability, note that for every measurable , we obtain
where . In the last line, the factor cancels out in the normalizing constant, but is inserted for the sake of comparison. For any sequences of measures and , if is absolutely continuous with respect to with the Radon-Nikodym derivative , then it can be easily verified that
Hence, for , we see that the second summand of (60) is bounded by
Using the fact that on and that goes to zero in -probability by Lemma 7, the last display is further bounded by
(61) |
Now, note that the map is bijective for every fixed . Thus for the set defined by with given , we see that
(62) |
by the substitution in the integral. Similar to the proof of Theorem 4, observe that
Hence, we see that can be chosen sufficiently large such that for every as we have . Therefore, since , one can see that (62) is written as
by (C9∗), and hence (61) is equal to
(63) |
Now, observe that we also have the inequality of the other direction: . This means that can be chosen sufficiently large such that for every . Hence, with appropriately chosen constants, we obtain
The rightmost term goes to one with probability tending to one by Lemma 8. This implies that (63) goes to zero in -probability, completing the proof for the second part of (60).
Next, we show that goes to one in -probability to verify that the last summand in (60) goes to zero in -probability. Observe that is equal to
(64) |
Clearly, the denominator is bounded below by
(65) |
Since the measure defined by is symmetric about , the mean of with respect to the normalized probability measure is zero. Note also that is nonsingular for every such that by (C8∗). Thus, by Jensen’s inequality, (65) is bounded below by
Applying the arithmetic-geometric mean inequality to the eigenvalues, we obtain , and hence by (4). Furthermore, we have by (3) and . Hence, the preceding display is further bounded below by a constant multiple of
(66) |
To bound the numerator of (64), let and . Then it suffices to show that (64) goes to zero in -probability on the set as by Lemma 5. Note that on the set we have
Using that for every with by (C8∗), the preceding display is, for some constant , further bounded above by
by the Cauchy-Schwarz inequality. We have on the support of the measure . Hence, on the event , the numerator of (64) is bounded above by
since . Note that we have
by (3) and that in the denominators is bounded away from zero by the assumption. Thus, the last display combined with (66) shows that (64) goes to zero on the event , provided that is chosen sufficiently large.
Finally we conclude that (60) goes to zero in -probability. Since the total variation metric is bounded by 2, the convergence in mean holds as in the assertion. ∎
Proof of Theorem 6.
Our proof follows the proof of Theorem 4 in Castillo et al., [8]. Since tends to zero by Theorem 5, it suffices to show that for . For the orthogonal projection defined by with , we see that is bounded by
by (13), since for every due to on . Note that for , because is a principal submatrix of . Hence, is equal to
(67) |
for some . The last inequality holds since by (C8∗), there exists a constant such that for every with , and hence we have that by the definition of ,
Now, we shall show that for any fixed ,
(68) |
Note that has a chi-squared distribution with degree of freedom . Therefore, by Lemma 5 of Castillo et al., [8], there exists a constant such that for every and given ,
where is the cardinality of the set . Since , for the event in the relation (68), it follows that
This goes to zero as , since for ,
and . To complete the proof, it remains to show that goes to zero on the set . Combining (67) and (68), we see that is bounded by
which holds by the inequalities and . Note that for , we have that for some . Hence, the preceding display goes to zero provided that since . This condition can be translated to by choosing arbitrarily close to 2. ∎
Appendix B Proofs for the applications
B.1 Proof of Theorem 7
We first verify the conditions for Theorem 3 to prove assertions (a) and (b).
-
•
Verification of (C1): Let be the th element of . Observe that is equal to
(69) Hence, we see that has the same role as . We also have as the true belongs to the support of the prior.
- •
- •
- •
-
•
Verification of (C5∗): For a sufficiently large and , choose . Since is a principal submatrix of , we have for every and . Hence the minimum eigenvalue condition (6) is satisfied with . Also, the entropy relative to is given by
The entropy condition in (7) is thus satisfied if we choose . To verify the sieve condition (8), note that for some positive constants , , , and , an inverse Wishart distribution satisfies
(71) see, for example, Lemma 9.16 of Ghosal and van der Vaart, [17]. The sieve condition (8) is met provided that is chosen sufficiently large. Note that the condition is satisfied by the assumption .
-
•
Verification of (C6): The separability condition is trivially satisfied in this example as there is no nuisance mean part.
Therefore, the contraction properties in Theorem 3 are obtained with , but is replaced by since and . The contraction rate for with respect to the Frobenius norm follows from (69). The optimal posterior contraction directly follows from Corollary 1. Assertions (a) and (b) are thus proved.
Next, we verify conditions (C8∗)–(C10∗) and (C7) to apply Theorems 5–6 and Corollaries 2–3.
- •
- •
- •
Hence, under (C7∗), Theorem 5 can be applied to obtain the distributional approximation in (15) with the zero matrix . Under (C7∗) and (C12), Theorem 6 implies the no-superset result in (16). If the beta-min condition (C13) is also met, the strong results in Corollary 2 and Corollary 3 hold. These establish (c)–(e).
B.2 Proof of Theorem 8
We first verify the conditions for Theorem 3 for (a) and (b).
- •
- •
- •
- •
-
•
Verification of (C5∗): For a sufficiently large and , choose a sieve as
Then we have for large , and hence the minimum eigenvalue condition (6) is directly met with by the definition of the sieve. To see the entropy condition, observe from (72) that for every ,
Therefore, for , the entropy relative to is bounded above by
each summand of which is bounded by a multiple of . This shows that the choice satisfies the entropy condition in (7). Further, it is easy to see that condition (8) holds using the tail bounds for normal and inverse Wishart distributions as in (71).
- •
Therefore we obtain the contraction properties of the posterior distribution as in (9) with replaced by as and . The rates for with respect to more concrete metrics than can now be obtained. Note that for small , directly implies and by the definition of . For , observe that
Since is bounded as , the preceding display implies . Moreover, we have
and
These show that as and are bounded. We finally conclude that contracts at the same rate of . The optimal posterior contraction is directly obtained by Corollary 1. Thus assertions (a) and (b) hold.
Next, we verify conditions (C8∗)–(C10∗) and (C7) to apply Theorems 5–6 and Corollaries 2–3. The orthogonal projection defined by with is used to check the conditions.
- •
-
•
Verification of (C9∗): Choose a map for . To check (C9∗), we shall verify that this map induces as follows. Note that for matrices , , we have the properties of the Kronecker product that and if the matrices allow such operations. Using these properties, we see that satisfies
Hence,
which implies that the shift only for as in the given map provides . Without loss of generality, we assume that the standard normal prior is used for . Now, observe that
since the priors for the other parameters cancel out due to invariance. One can note that
and
Thus, condition (C9∗) is satisfied.
-
•
Verification of (C10∗): Note again that for every . The inequality also holds for the other direction for every , by the same argument used for the recovery in the proof of Theorem 8, (a)–(b). Hence, for some constants , the entropy in (C10∗) is bounded above by
Since all nuisance parameters are of fixed dimensions, the last display is bounded by a multiple of for every , so that (C10∗) is bounded by by Remark 6. Since in this case, the condition is verified.
- •
Therefore, under (C7∗), Theorem 5 implies that the distributional approximation in (15) holds. Under (C7∗) and (C12), we obtain the no-superset result in (16). The remaining assertions in the theorem are direct consequences of Corollary 2 and Corollary 3 if the beta-min condition (C13) is also satisfied. These prove (c)–(e).
We complete the proof by showing that the covariance matrix of the nonzero part can be written as in the theorem. For given , we obtain
where is the first column of . Note that , where is the top-left element of , which is equal to by direct calculations. For the mean , observe that
where is the submatrix of consisting of columns except for the first column. Since , where is the first row of with the top-left element excluded, the last display is equal to
As we have by direct calculations, it follows that
This completes the proof.
B.3 Proof of Theorem 9
We shall verify the conditions for the posterior contraction in Theorem 3 to prove (a)–(b). First we give the bounds for the eigenvalues of each correlation matrix. It can be shown that
(73) | ||||
(74) | ||||
(75) |
The first assertion in (73) follows directly from the identity for every . For (74), see Theorem 2.1 and Theorem 3.5 of Fikioris, [12]. The assertion in (75) is due to Theorem 2.2 of Kulkarni et al., [21].
-
•
Verification of (C1): For the autoregressive correlation matrix, note that
Using , we have that
and hence
This gives us for the autoregressive matrices. Similarly, we can also show that satisfies (C1) for the compound-symmetric and the moving average correlation matrices. Also, we have for (C1) as the true parameter values and are in the support of the prior.
- •
- •
- •
-
•
Verification of (C5∗): For a sufficiently large and , choose a sieve . Then using (73)–(75), it is easy to see that the minimum eigenvalue of each correlation matrix is bounded below by a polynomial in , which implies that condition (6) is satisfied with . For the entropy calculation, note that for every type of correlation matrix,
(76) From the identity for every integer , we have that for every . By this inequality we obtain for every correlation matrix. Then, the last display is bounded by a multiple of for every . The entropy in (7) is thus bounded by
for with some constant . It can be easily checked that each term in the last display is bounded by a multiple of , by which the entropy condition in (7) is satisfied with . Using the tail bounds of inverse gamma distributions and properties of the density near the boundaries, condition (8) is satisfied as long as is chosen sufficiently large.
-
•
Verification of (C6): The separation condition is trivially satisfied as there is no nuisance mean part.
Therefore, we obtain the posterior contraction properties of with by Theorem 3. The term can be replaced by since and . Since we have by the diagonal entries of each matrix, the contraction rate is obtained for with respect to the -norm, for every correlation matrix, as . In particular, for the compound-symmetric correlation matrix, this rate is reduced to since is bounded in that case. We also have for every correlation matrix, as there are more than entries that is equal to . Hence, by the relation , the same rate is also obtained for relative to the -norm. The optimal posterior contraction directly follows from Corollary 1. Thus assertions (a)–(b) hold.
Next, we verify conditions (C8∗)–(C10∗) and (C7) to apply Theorems 5–6 and Corollaries 2–3.
- •
- •
- •
Therefore, under (C7∗), the distributional approximation in (15) holds with the zero matrix by Theorem 5. Under (C7∗) and (C12), Theorem 6 implies that the no-superset result in (16) holds. The strong results in Corollary 2 and Corollary 3 follow explicitly from the beta-min condition (C13). These prove (c)–(e).
B.4 Proof of Theorem 10
We verify the conditions for the posterior contraction in Theorem 3 to show (a)–(b).
-
•
Verification of (C1): Using the assumption , note that
(77) where the last inequality holds since and . Thus we have and .
-
•
Verification of (C2): The condition is satisfied with as is fixed dimensional and we have .
- •
- •
- •
-
•
Verification of (C6): The separation condition is trivially satisfied as there is no nuisance mean part.
Therefore, the posterior contraction rates for are given by Theorem 3 with replaced by since and . The contraction rate for relative to the Frobenius norm is a direct consequence of (77). The optimal posterior contraction easily follows from Corollary 1. Thus assertions (a)–(b) hold.
Now, we verify conditions (C8∗)–(C10∗) and (C7) to apply Theorems 5–6 and Corollaries 2–3.
- •
- •
-
•
Verification of (C7): It is easy to see that since . The separability of the space is thus trivial.
Hence, under (C7∗), Theorem 5 can be applied to obtain the distributional approximation in (15) with the zero matrix . Under (C7∗) and (C12), we obtain the no-superset result in (16) by Theorem 6. The strong results in Corollary 2 and Corollary 3 follow explicitly from the beta-min condition (C13). These establish (c)–(e).
B.5 Proof of Theorem 11
We verify the conditions for the posterior contraction in Theorem 3.
- •
-
•
Verification of (C2): Using (i) of Lemma 10 and the relation as , observe that if the right hand side is small enough. Thus, there exists a constant such that . Furthermore, although the components of are not a priori independent as the prior is truncated to , the truncation can only increase prior concentration since for some . Hence, for some ,
which justifies the choice for (C2).
- •
-
•
Verification of (C4): This is trivially met as for some .
-
•
Verification of (C5∗): Note that the minimum eigenvalue condition (6) is trivially satisfied with since the prior is put on . Now, for with and sufficiently large , choose a sieve as , that is, the maximum number of edges of does not exceed . Then, for , the entropy in (7) is bounded by
where in the second term, the factor comes from the diagonal elements of , while the rest is from the off-diagonal entries. It is easy to see that the last display is bounded by a multiple of with chosen , and hence the entropy condition in (7) is satisfied. Lastly, note that for some ,
Therefore, condition (8) is satisfied with sufficiently large .
-
•
Verification of (C6): The separation condition is trivially met as there is no nuisance mean part.
Therefore, we obtain the posterior contraction properties for by Theorem 3. The theorem also implies that the posterior distribution of contracts to at the rate with respect to the Frobenius norm. This is also translated as convergence of to at the same rate, since we obtain
(80) |
by (i) of Lemma 10 and the inequality as . The assertion for the optimal posterior contraction is directly justified by Corollary 1. These prove (a)–(b).
Next, we verify conditions (C4)–(C7) to obtain the optimal posterior contraction by applying Theorem 4.
- •
-
•
Verification of (C6): Note that by (80), there exists a constant such that the entropy in (C6) is bounded by for every . Using (81), the entropy is further bounded by for some . This is clearly bounded by a multiple of , and hence using Remark 6 we bound (C6) by a multiple of which goes to zero by assumption.
- •
Now, we verify conditions (C8∗)–(C10∗) to apply Theorems 5–6 and Corollaries 2–3.
- •
- •
Therefore, under (C7∗), we obtain the distributional approximation in (15) with the zero matrix by Theorem 5. Under (C7∗) and (C12), the no-superset result in (16) holds by Theorem 6. Lastly, we obtain the strong results in Corollary 2 and Corollary 3 if the beta-min condition (C13) is also met. These prove (d)–(f).
B.6 Proof of Theorem 12
To verify the conditions for Theorem 3, we will use the following properties of -splines.
For any , there exists with such that
(82) |
by the well-known approximation theory of B-splines [11, page 170]. Writing , this gives
(83) |
We also use the following inequalities: for every ,
(84) |
See Lemma E.6 of Ghosal and van der Vaart, [17] for proofs with respect to - and -norms. Hence the first relation can be formally justified. For the second relation with respect to the empirical -norm, we assume that are sufficiently regularly distributed as in (7.12) of Ghosal and van der Vaart, [16].
-
•
Verification of (C1): If is strictly positive on , then satisfies the same approximation rule in (82) for some with (see Lemma E.5 of Ghosal and van der Vaart, [17]). Therefore the approximation in (83) also holds for even if is restricted to have positive entries only, and thus by (82) and (84),
which tells us that we have and for (C1).
- •
- •
- •
-
•
Verification of (C5∗): For a sufficiently large , choose a sieve as . Then the minimum eigenvalue condition (6) is satisfied with because for every ,
where and denote the th components of and , respectively. To check the entropy condition in (7), note that for every , we have by (84). Hence, for some , the entropy in (7) is bounded above by a multiple of
The condition (8) holds since an inverse Gaussian prior on each produces for some constant , by its exponentially small bounds for tail probabilities on both sides. By matching and , we obtain and . Note that the conditions and hold only if .
-
•
Verification of (C6): The separation condition holds as there is no additional mean part.
Hence, we obtain the posterior contraction rates for by Theorem 3. The contraction rate for is also obtained by the same theorem. The assertion for the optimal posterior contraction is directly justified by Corollary 1. Hence we have verified (a)–(b).
Now, we verify (C4)–(C7) for the optimal posterior contraction in Theorem 4.
- •
- •
- •
Therefore, since (C3) is satisfied the assumption, assertion (c) holds by Theorem 4.
Next, we verify conditions (C8∗)–(C10∗) to apply Theorems 5–6 and Corollaries 2–3.
- •
- •
Under (C7∗), the distributional approximation in (15) holds with the zero matrix by Theorem 5. Under (C7∗) and (C12), the no-superset result in (16) holds by Theorem 6. We also obtain the strong results in Corollary 2 and Corollary 3 if the beta-min condition (C13) is also met. These prove (d)–(f).
B.7 Proof of Theorem 13
We verify the conditions for the posterior contraction in Theorem 3.
-
•
Verification of (C1): Since for every and belongs to the support of the prior, we have and .
-
•
Verification of (C2): Note that we write . To verify the prior concentration condition, observe that
where the second term on the right hand side is trivially bounded below by a constant multiple of . Using (82)–(84), it is easy to see that if ,
for some . Since , this implies that (C2) is satisfied with .
-
•
Verification of (C3): The assumption given in the theorem directly satisfies the condition.
-
•
Verification of (C4): This is directly satisfied by .
-
•
Verification of (C5∗): For a sufficiently large constant and , choose , from which the minimum eigenvalue condition (6) is directly satisfied with . To check the entropy condition in (7), note that for every , we have by (84). Hence, for some , the entropy in (7) is bounded above by a multiple of
The display is further bounded by a multiple of , and hence (7) is satisfied with . Using the tail bounds of normal and inverse gamma distributions, condition (8) is also satisfied.
- •
Therefore, the contraction rates for are given by Theorem 3. The rate for is also obtained by the same theorem. The assertion for the optimal posterior contraction is directly justified by Corollary 1. We thus see (a)–(b) hold.
Now, we verify (C4)–(C7) for Theorem 4.
-
•
Verification of (C4): Observe that the left hand side of the first line of (C4) is equal to
where is the the least squares solution. Since is the solution minimizing , for some , the last display is bounded above by
(85) by (82), where is replaced by as is unknown. Plugging in , it is easy to see that the right hand side of (85) is the same order of . This tends to zero by the given boundedness assumption. The necessary condition is implied by this, because . The second condition of (C4) is satisfied by Remark 5.
-
•
Verification of (C5): Let for a given , where . This setting satisfies . Since each entry of has the standard normal prior, is a zero mean Gaussian process with the covariance kernel , and thus its reproducing kernel Hilbert space (RKHS) is the set of all functions of the form with coefficients , . It is easy to see that the shift is in the RKHS since it is expressed as using an invertible matrix with rows evaluated by some , . Hence, by the Cameron-Martin theorem, for and the RKHS norm, we see that
almost surely. This gives that
(86) Note that we have
and
- •
-
•
Verification of (C7): Since we have for every and the parameter space of is Euclidean, the condition is trivially satisfied.
Therefore, assertion (c) holds by Theorem 4 since (C3) is also satisfied by the given assumption.
Lastly, we verify conditions (C8∗)–(C10∗) to apply Theorems 5–6 and Corollaries 2–3.
- •
- •
- •
Therefore, under (C7∗), we have the distributional approximation in (15) by Theorem 5. Under (C7∗) and (C12), Theorem 6 implies that the no-superset result in (16) holds. The stronger assertions in (17) and (18) are explicitly derived from Corollary 2 and Corollary 3 if the beta-min condition (C13) is also met.
Appendix C Auxiliary results
Here we provide some auxiliary results used to prove the main results.
Lemma 9.
Let be the density of for . Then,
Proof.
Let for and . Then by direct calculations, we have
which verifies the first assertion because . After some algebra, we also obtain
The rightmost side involves forms of and for two positive definite matrices and . It is easy to see that the former is zero, while it can be shown the latter equals ; for example, see Lemma 6.2 of Magnus, [22]. Plugging in this for the expected values of the products of quadratic forms, it is easy (but tedious) to verify the second assertion. ∎
Lemma 10.
For positive definite matrices and , let be the eigenvalues of . Then the following assertions hold:
-
(i)
,
-
(ii)
can be made arbitrarily small if is chosen sufficiently small, where is defined in (33).
Proof.
Let . Since the eigenvalues of are , we can see that is equal to
Conversely, using the sub-multiplicative property of the Frobenius norm, , it can be seen that is equal to
These verify (i). Now, note that by direct calculations,
Hence, for a sufficiently small implies that
Since every term in the product of the last display is greater than or equal to 1, we have for every . As a function of , has the global minimum at , and hence can be chosen sufficiently small to make small for every , which establishes (ii). ∎
References
- Atchadé, [2017] Atchadé, Y. A. (2017). On the contraction properties of some high-dimensional quasi-posterior distributions. The Annals of Statistics, 45(5):2248–2273.
- Bai et al., [2020] Bai, R., Moran, G. E., Antonelli, J., Chen, Y., and Boland, M. R. (2020). Spike-and-slab group lassos for grouped regression and sparse generalized additive models. Journal of the American Statistical Association, to appear.
- Belitser and Ghosal, [2020] Belitser, E. and Ghosal, S. (2020). Empirical Bayes oracle uncertainty quantification for regression. The Annals of Statistics, 48(6):3113–3137.
- Bickel and Kleijn, [2012] Bickel, P. J. and Kleijn, B. J. (2012). The semiparametric Bernstein–von Mises theorem. The Annals of Statistics, 40(1):206–237.
- Bondell and Reich, [2012] Bondell, H. D. and Reich, B. J. (2012). Consistent high-dimensional Bayesian variable selection via penalized credible regions. Journal of the American Statistical Association, 107(500):1610–1624.
- Carroll et al., [2006] Carroll, R. J., Ruppert, D., Crainiceanu, C. M., and Stefanski, L. A. (2006). Measurement Error in Nonlinear Models: A Modern Perspective. Chapman and Hall/CRC.
- Castillo, [2012] Castillo, I. (2012). A semiparametric Bernstein–von Mises theorem for Gaussian process priors. Probability Theory and Related Fields, 152(1-2):53–99.
- Castillo et al., [2015] Castillo, I., Schmidt-Hieber, J., and van der Vaart, A. (2015). Bayesian linear regression with sparse priors. The Annals of Statistics, 43(5):1986–2018.
- Castillo and van der Vaart, [2012] Castillo, I. and van der Vaart, A. (2012). Needles and straw in a haystack: Posterior concentration for possibly sparse sequences. The Annals of Statistics, 40(4):2069–2101.
- Chae et al., [2019] Chae, M., Lin, L., and Dunson, D. B. (2019). Bayesian sparse linear regression with unknown symmetric error. Information and Inference: A Journal of the IMA, 8(3):621–653.
- De Boor, [1978] De Boor, C. (1978). A Practical Guide to Splines. New York: Springer.
- Fikioris, [2018] Fikioris, G. (2018). Spectral properties of Kac–Murdock–Szegö matrices with a complex parameter. Linear Algebra and its Applications, 553:182–210.
- Fuller, [1987] Fuller, W. A. (1987). Measurement Error Models. John Wiley & Sons.
- Gao et al., [2020] Gao, C., van der Vaart, A. W., and Zhou, H. H. (2020). A general framework for bayes structured linear models. Annals of Statistics, 48(5):2848–2878.
- Ghosal et al., [2000] Ghosal, S., Ghosh, J. K., and van der Vaart, A. W. (2000). Convergence rates of posterior distributions. Annals of Statistics, 28(2):500–531.
- Ghosal and van der Vaart, [2007] Ghosal, S. and van der Vaart, A. (2007). Convergence rates of posterior distributions for noniid observations. The Annals of Statistics, 35(1):192–223.
- Ghosal and van der Vaart, [2017] Ghosal, S. and van der Vaart, A. (2017). Fundamentals of Nonparametric Bayesian Inference. Cambridge University Press.
- Jeong, [2020] Jeong, S. (2020). Posterior contraction in group sparse logit models for categorical responses. arXiv preprint arXiv:2010.03513.
- Jeong and Ghosal, [2020] Jeong, S. and Ghosal, S. (2020). Posterior contraction in sparse generalized linear models. Biometrika, to appear.
- Johnson and Rossell, [2012] Johnson, V. E. and Rossell, D. (2012). Bayesian model selection in high-dimensional settings. Journal of the American Statistical Association, 107(498):649–660.
- Kulkarni et al., [1999] Kulkarni, D., Schmidt, D., and Tsui, S.-K. (1999). Eigenvalues of tridiagonal pseudo-Toeplitz matrices. Linear Algebra and its Applications, 297:63–80.
- Magnus, [1978] Magnus, J. R. (1978). The moments of products of quadratic forms in normal variables. Statistica Neerlandica, 32(4):201–210.
- Martin et al., [2017] Martin, R., Mess, R., and Walker, S. G. (2017). Empirical Bayes posterior concentration in sparse high-dimensional linear models. Bernoulli, 23(3):1822–1847.
- Narisetty and He, [2014] Narisetty, N. N. and He, X. (2014). Bayesian variable selection with shrinking and diffusing priors. The Annals of Statistics, 42(2):789–817.
- Ning et al., [2020] Ning, B., Jeong, S., and Ghosal, S. (2020). Bayesian linear regression for multivariate responses under group sparsity. Bernoulli, 26(3):2353–2382.
- Ročková, [2018] Ročková, V. (2018). Bayesian estimation of sparse signals with a continuous spike-and-slab prior. The Annals of Statistics, 46(1):401–437.
- Rothman et al., [2008] Rothman, A. J., Bickel, P. J., Levina, E., and Zhu, J. (2008). Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494–515.
- Song and Liang, [2017] Song, Q. and Liang, F. (2017). Nearly optimal Bayesian shrinkage for high dimensional regression. arXiv preprint arXiv:1712.08964.
- van der Vaart and Wellner, [1996] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer.