12
Double shrinkage priors for a normal mean matrix
Abstract
We consider estimation of a normal mean matrix under the Frobenius loss. Motivated by the Efron–Morris estimator, a generalization of Stein’s prior has been recently developed, which is superharmonic and shrinks the singular values towards zero. The generalized Bayes estimator with respect to this prior is minimax and dominates the maximum likelihood estimator. However, here we show that it is inadmissible by using Brown’s condition. Then, we develop two types of priors that provide improved generalized Bayes estimators and examine their performance numerically. The proposed priors attain risk reduction by adding scalar shrinkage or column-wise shrinkage to singular value shrinkage. Parallel results for Bayesian predictive densities are also given.
1 Introduction
Suppose that we have independent matrix observations whose entries are independent normal random variables , where is an unknown mean matrix. In the notation of Gupta and Nagar (2000), this is expressed as for , where denotes the -dimensional identity matrix. We consider estimation of under the Frobenius loss
By sufficiency reduction, it suffices to consider the average in estimation of . We assume in the following. Note that vectorization reduces this problem to estimation of a normal mean vector from under the quadratic loss, which has been well studied in shrinkage estimation theory (Fourdrinier et al., 2018).
Efron and Morris (1972) proposed an empirical Bayes estimator:
(1) |
This estimator can be viewed as a generalization of the James–Stein estimator () for a normal mean vector. Efron and Morris (1972) showed that is minimax and dominates the maximum likelihood estimator under the Frobenius loss. Let , , , be the singular value decomposition of , where and are the singular values of . Stein (1974) pointed out that does not change the singular vectors but shrinks the singular values of towards zero:
where
See Tsukuma and Kubokawa (2020); Yuasa and Kubokawa (2023a, b) for recent developments around the Efron–Morris estimator.
As a Bayesian counterpart of , Matsuda and Komaki (2015) proposed a singular value shrinkage prior
(2) |
and showed that the generalized Bayes estimator with respect to dominates the maximum likelihood estimator under the Frobenius loss. This prior can be viewed as a generalization of Stein’s prior for a normal mean vector () by Stein (1974). Similarly to in (1) , shrinks the singular values towards zero. Thus, it works well when the true matrix is close to low-rank. See Matsuda and Strawderman (2022) and Matsuda (2023) for details on the risk behavior of and .
In this paper, we show that the generalized Bayes estimator with respect to the singular value shrinkage prior in (2) is inadmissible under the Frobenius loss. Then, we develop two types of priors that provide improved generalized Bayes estimators asymptotically. The first type adds scalar shrinkage while the second type adds column-wise shrinkage. We conduct numerical experiments and confirm the effectiveness of the proposed priors in finite samples. We also provide parallel results for Bayesian prediction as well as a similar improvement of the blockwise Stein prior, which was conjectured by Brown and Zhao (2009).
This paper is organized as follows. In Section 2, we prove the inadmissibility of the generalized Bayes estimator with respect to the singular value shrinkage prior in (2). In Sections 3 and 4, we provide two types of priors that asymptotically dominate the singular value shrinkage prior in (2) by adding scalar or column-wise shrinkage, respectively. Numerical results are also given. In Section 5, we provide parallel results for Bayesian prediction. Technical details and similar results for the blockwise Stein prior are given in the Appendix.
2 Inadmissibility of the singular value shrinkage prior
Here, we show that the generalized Bayes estimator with respect to the singular value shrinkage prior in (2) is inadmissible under the Frobenius loss. Since does not affect admissibility results, we fix for convenience in this section.
For estimation of a normal mean vector under the quadratic loss, Brown (1971) derived the following sufficient condition for inadmissibility of generalized Bayes estimators.
Lemma 2.1.
(Brown, 1971) In estimation of from under the quadratic loss, the generalized Bayes estimator of with respect to a prior is inadmissible if
for some , where
and is the uniform measure on the sphere of radius in .
After vectorization, estimation of a normal mean matrix from under the Frobenius loss reduces to estimation of a normal mean vector from under the quadratic loss. Then, by using Brown’s condition in Lemma 2.1, we obtain the following.
Theorem 2.1.
When , the generalized Bayes estimator with respect to in (2) is inadmissible under the Frobenius loss.
Proof.
From and the AM-GM inequality
we have
where . Therefore,
(3) |
where and we used the triangle inequality. As ,
which yields
(4) |
Now, we apply Lemma 2.1 by noting that estimation of a normal mean matrix from under the Frobenius loss is equivalent to estimation of a normal mean vector from under the quadratic loss. Let be the uniform measure on the sphere of radius in , where the Frobenius norm is adopted for radius. Then, from (3) and (4),
for some constant . Therefore, since when ,
From Lemma 2.1, it implies the inadmissibility of the generelized Bayes estimator with respect to under the Frobenius loss. ∎
3 Improvement by additional scalar shrinkage
Here, motivated by the result of Efron and Morris (1976), we develop a class of priors for which the generalized Bayes estimators asymptotically dominate that with respect to the singular value shrinkage prior in (2). Efron and Morris (1976) proved that the estimator
(5) |
dominates in (1) under the Frobenius loss. This estimator shrinks the singular values of more strongly than :
where
(6) |
In other words, adds scalar shrinkage to . Konno (1990, 1991) showed corresponding results in the unknown covariance setting. By extending these results, Tsukuma and Kubokawa (2007) derived a general method for improving matrix mean estimators by adding scalar shrinkage.
Motivated by in (5), we construct priors by adding scalar shrinkage to in (2):
(7) |
where . Note that Tsukuma and Kubokawa (2017) studied this type of prior in the context of Bayesian prediction. Let
be the marginal density of under the prior .
Lemma 3.1.
If , then for every .
Proof.
Since is interpreted as the expectation of under , it suffices to show that is locally integrable at every .
First, consider . Since
for every from Lemma 1 of Matsuda and Komaki (2015), is locally integrable at . Also, for some in a neighborhood of . Thus, is locally integrable at if .
Next, consider and take its neighborhood for . To evaluate the integral on , we use the variable transformation from to , where and so that . We have . Also, from ,
Thus,
The integral with respect to is finite if , which is equivalent to . The integral with respect to is finite due to the local integrability of , which corresponds to , at . Therefore, is locally integrable at if .
Hence, is locally integrable at every if . ∎
From Lemma 3.1, the generalized Bayes estimator with respect to is well-defined when . We denote it by .
Theorem 3.1.
Proof.
Let and be the th entry of . From
(9) |
we have
Therefore,
(10) |
From (8), the choice attains the minimum risk among . Note that also appears in the singular value decomposition form of the modified Efron–Morris estimator in (6).
Now, we examine the performance of in (7) by Monte Carlo simulation. Figure 1 plots the Frobenius risk of generalized Bayes estimators with respect to in (7) with , in (2) and , which is Stein’s prior on the vectorization of , for , and . We computed the generalized Bayes estimators by using the random-walk Metropolis–Hastings algorithm with Gaussian proposal of variance . Note that the Frobenius risk of these estimators depends only on the singular values of due to the orthogonal invariance. Similarly to the Efron–Morris estimator and , works well when is close to low-rank. Also, attains large risk reduction when is close to the zero matrix like . Thus, has the best of both worlds. Figure 2 plots the Frobenius risk for , and , computed by the random walk Metropolis–Hastings algorithm with proposal variance . The risk behavior is similar to Figure 1. Figure 3 plots the Frobenius risk for , and , computed by the random walk Metropolis–Hastings algorithm with proposal variance . Again, the risk behavior is similar to Figure 1. Note that the value of is the same with Figure 1.
Improvement by additional scalar shrinkage holds even under the matrix quadratic loss (Matsuda and Strawderman, 2022; Matsuda, 2024).
Theorem 3.2.
Proof.
The generalized Bayes estimator with respect to in (7) attains minimaxity in some cases as follows.
Theorem 3.3.
If , and , then the generalized Bayes estimator with respect to in (7) is minimax under the Frobenius loss.
Proof.
It is an interesting problem whether the generalized Bayes estimator with respect to in (7) attains admissibility or not. In addition to Lemma 2.1, Brown (1971) derived the following sufficient condition for admissibility of generalized Bayes estimators, which may be useful here. While the condition (14) can be verified by using a similar argument to Theorem 2.1, the verification of the uniform boundedness of seems difficult. We leave further investigation for future work.
Lemma 3.2.
(Brown, 1971) In estimation of from under the quadratic loss, the generalized Bayes estimator of with respect to a prior is admissible if is uniformly bounded and
(14) |
where
and is the uniform measure on the sphere of radius in .
4 Improvement by additional column-wise shrinkage
Here, instead of scalar shrinkage, we consider priors with additional column-wise shrinkage:
(15) |
where for every and denotes the norm of the -th column vector of . Let
be the marginal density of under the prior .
Lemma 4.1.
If for every , then for every .
Proof.
Similarly to Lemma 3.1, it suffices to show that is locally integrable at . Consider the neighborhood of defined by for . To evaluate the integral on , we use the variable transformation from to , where each and with are defined by and for the -th column vector of so that (polar coordinate). Since ,
Also, from with and ,
Thus,
(16) |
By variable transformation from to and , the first integral in (16) is reduced to
The integral with respect to is finite if , which is equivalent to . The integral with respect to is finite if for every . On the other hand, the second integral in (16) is finite due to the local integrability of , which corresponds to , at . Therefore, is locally integrable at if for every . ∎
From Lemma 4.1, the generalized Bayes estimator with respect to is well-defined when . We denote it by .
Theorem 4.1.
Proof.
Let
Then,
(18) |
(19) |
From (17), the choice attains the minimum risk among .
Now, we examine the performance of in (15) by Monte Carlo simulation. Figure 4 plots the Frobenius risk of generalized Bayes estimators with respect to in (15) with , in (7) with , in (2) and , which is Stein’s prior on the vectorization of . We computed the generalized Bayes estimators by using the random walk Metropolis–Hastings algorithm with proposal variance . We set , where and . For this , the Frobenius risk of the estimators compared here depends only on the singular values of . Overall, performs better than . Also, even dominates and except when is sufficiently small. This is understood from the column-wise shrinkage effect of . Figure 5 plots the Frobenius risk for , and , computed by the random walk Metropolis–Hastings algorithm with proposal variance . The risk behavior is similar to Figure 4. Figure 6 plots the Frobenius risk for , and , computed by the random walk Metropolis–Hastings algorithm with proposal variance . Again, the risk behavior is similar to Figure 4. Note that the value of is the same with Figure 4.
Improvement by additional column-wise shrinkage holds even under the matrix quadratic loss (Matsuda and Strawderman, 2022; Matsuda, 2024).
Theorem 4.2.
Proof.
We use the same notation with the proof of Theorem 4.1. By using (10), (18), and (19), we obtain
Therefore, from Lemma A.3,
Hence, we obtain (23).
Suppose that and . Let be the operater norm. Since is diagonal and ,
which yields . Therefore,
∎
The generalized Bayes estimator with respect to in (15) attains minimaxity in some cases as follows. It is an interesting future work to investigate its admissibility.
Theorem 4.3.
If , and , then the generalized Bayes estimator with respect to in (15) is minimax under the Frobenius loss.
5 Bayesian prediction
Here, we consider Bayesian prediction and provide parallel results to those in Sections 3 and 4. Suppose that we observe and predict by a predictive density . We evaluate predictive densities by the Kullback–Leibler loss:
The Bayesian predictive density based on a prior is defined as
where is the posterior distribution of given , and it minimizes the Bayes risk (Aitchison, 1975):
The Bayesian predictive density with respect to the uniform prior is minimax. However, it is inadmissible and dominated by Bayesian predictive densities based on superharmonic priors (Komaki, 2001; George, Liang and Xu, 2006). In particular, the Bayesian predictive density based on the singular value shrinkage prior in (2) is minimax and dominates that based on the uniform prior (Matsuda and Komaki, 2015).
The asymptotic expansion of the difference between the Kullback–Leibler risk of two Bayesian predictive densities is obtained as follows.
Lemma 5.1.
As , the difference between the Kullback–Leibler risk of and is expanded as
(24) |
Proof.
For the normal model with known covariance, the information geometrical quantities (Amari, 1985) are given by
Also, the Jeffreys prior coincides with the uniform prior . Therefore, from equation (3) of Komaki (2006), the Kullback–Leibler risk of the Bayesian predictive density based on a prior is expanded as
(25) |
where is a function independent of . Substituting and into (25) and taking difference, we obtain (24). ∎
By comparing Lemma 5.1 to Lemma A.2, we obtain the following connection between estimation and prediction.
Proposition 5.1.
For every ,
Therefore, if asymptotically dominates under the quadratic loss, then asymptotically dominates under the Kullback–Leibler loss.
Therefore, Theorems 3.1 and 4.1 are extended to Bayesian prediction as follows. Other results in the previous sections can be extended to Bayesian prediction similarly.
Theorem 5.1.
Theorem 5.2.
Figures 7 and 8 plot the Kullback–Leibler risk of Bayesian predictive densities in similar settings to Figures 1 and 4, respectively. They show that the risk behavior in prediction is qualitatively the same with that in estimation, which is compatible with Theorems 5.1 and 5.2.
Acknowledgments
We thank the referees for helpful comments. This work was supported by JSPS KAKENHI Grant Numbers 19K20220, 21H05205, 22K17865 and JST Moonshot Grant Number JPMJMS2024.
References
- Aitchison (1975) Aitchison, J. (1975). Goodness of prediction fit. Biometrika 62, 547–554.
- Amari (1985) Amari, S. (1985). Differential-Geometrical Methods in Statistics. New York: Springer.
- Brown (1971) Brown, L. D. (1971). Admissible estimators, recurrent diffusions, and insoluble boundary value problems. Annals of Mathematical Statistics 36, 855–903.
- Brown and Zhao (2009) Brown, L. D. & Zhao, L. H. (2009). Estimators for Gaussian models having a block-wise structure. Statistica Sinica 19, 885–903.
- Efron and Morris (1972) Efron, B. & Morris, C. (1972). Empirical Bayes on vector observations: an extension of Stein’s method. Biometrika 59, 335–347.
- Efron and Morris (1976) Efron, B. & Morris, C. (1976). Multivariate empirical Bayes and estimation of covariance matrices. Annals of Statistics 4, 22–32.
- Fourdrinier et al. (2018) Fourdrinier, D., Strawderman, W. E. & Wells, M. (2018). Shrinkage Estimation. Springer, New York.
- George, Liang and Xu (2006) George, E. I., Liang, F. & Xu, X. (2006). Improved minimax predictive densities under Kullback–Leibler loss. Annals of Statistics 34, 78–91.
- Gupta and Nagar (2000) Gupta, A. K. & Nagar, D. K. (2000). Matrix Variate Distributions. New York: Chapman & Hall.
- Komaki (2001) Komaki, F. (2001). A shrinkage predictive distribution for multivariate normal observables. Biometrika 88, 859–864.
- Komaki (2006) Komaki, F. (2006). Shrinkage priors for Bayesian prediction. Annals of Statistics 34, 808–819.
- Konno (1990) Konno, Y. (1990). Families of minimax estimators of matrix of normal means with unknown covariance matrix. J. Japan Statist. Soc. 20, 191–201.
- Konno (1991) Konno, Y. (1991). On estimation of a matrix of normal means with unknown covariance matrix. Journal of Multivariate Analysis 36, 44–55.
- Matsuda (2023) Matsuda, T. (2023). Adapting to arbitrary quadratic loss via singular value shrinkage. IEEE Transactions on Information Theory, accepted.
- Matsuda (2024) Matsuda, T. (2024). Matrix quadratic risk of orthogonally invariant estimators for a normal mean matrix. Japanese Journal of Statistics and Data Science, accepted.
- Matsuda and Komaki (2015) Matsuda, T. & Komaki, F. (2015). Singular value shrinkage priors for Bayesian prediction. Biometrika 102, 843–854.
- Matsuda and Strawderman (2019) Matsuda, T. & Strawderman, W. E. (2019). Improved loss estimation for a normal mean matrix. Journal of Multivariate Analysis 169, 300–311.
- Matsuda and Strawderman (2022) Matsuda, T. & Strawderman, W. E. (2022). Estimation under matrix quadratic loss and matrix superharmonicity. Biometrika 109, 503–519.
- Stein (1974) Stein, C. (1974). Estimation of the mean of a multivariate normal distribution. Proc. Prague Symp. Asymptotic Statistics 2, 345–381.
- Tsukuma and Kubokawa (2007) Tsukuma, H. & Kubokawa, T. (2007). Methods for improvement in estimation of a normal mean matrix. Journal of Multivariate Analysis 98, 1592–1610.
- Tsukuma and Kubokawa (2017) Tsukuma, H. & Kubokawa, T. (2017). Proper Bayes and minimax predictive densities related to estimation of a normal mean matrix. Journal of Multivariate Analysis 159, 138–150.
- Tsukuma and Kubokawa (2020) Tsukuma, H. & Kubokawa, T. (2020). Shrinkage estimation for mean and covariance matrices. Springer.
- Yuasa and Kubokawa (2023a) Yuasa, R. & Kubokawa, T. (2023). Generalized Bayes estimators with closed forms for the normal mean and covariance matrices. Journal of Statistical Planning and Inference 222, 182–194.
- Yuasa and Kubokawa (2023b) Yuasa, R. & Kubokawa, T. (2023). Weighted shrinkage estimators of normal mean matrices and dominance properties. Journal of Multivariate Analysis 194, 1–17.
Appendix A Asymptotic expansion of risk
Here, we provide asymptotic expansion formulas for estimators of a normal mean vector. Consider the problem of estimating from the observation under the quadratic loss . As shown in Stein (1974), the generalized Bayes estimator with respect to a prior is expressed as
where
The asymptotic difference between the quadratic risk of two generalized Bayes estimators as is given as follows.
Lemma A.1.
As , the difference between the quadratic risk of and is expanded as
(26) |
Proof.
We extend the above formula to matrices by using the matrix derivative notations from Matsuda and Strawderman (2022). For a function , its matrix gradient is defined as
(28) |
For a function , its matrix Laplacian is defined as
(29) |
Then, the above formulas can be straightforwardly extended to matrix-variate normal distributions as follows.
Lemma A.2.
As , the difference between the Frobenius risk of and is expanded as
Lemma A.3.
As , the difference between the matrix quadratic risk of and is expanded as
Komaki (2006) derived the asymptotic expansion of the Kullback–Leibler risk of Bayesian predictive densities. For the normal model as discussed in Section 5, the result shows that Stein’s prior dominates the Jeffreys prior in term at the origin and term at other points, which is reminiscent of superefficiency theory. A similar phenomenon should exist in estimation as well. Unlike Stein’s prior, the priors for a normal mean matrix such as diverge at many points such as low-rank matrices. It is an interesting future problem to investigate the asymptotic risk of such priors in detail.
Appendix B Laplacian of and
Lemma B.1.
Proposition B.1.
The Laplacian of in (7) is given by
Proof.
Proposition B.2.
The Laplacian of in (15) is given by
Appendix C Improving on the block-wise Stein prior
Here, we develop priors that asymptotically dominate the block-wise Stein prior in estimation and prediction. Suppose that we observe and estimate or predict . We assume that the -dimensional mean vector is split into disjoint blocks with size , where . For example. such a situation appears in balanced ANOVA and wavelet regression (Brown and Zhao, 2009). Then, the block-wise Stein prior is defined as
(30) |
which puts Stein’s prior on each block. Since it is superharmonic, the generalized Bayes estimator with respect to is minimax. However, Brown and Zhao (2009) showed that is inadmissible and dominated by an estimator with additional James–Stein type shrinkage defined by
where . From this result, Brown and Zhao (2009) conjectured that the block-wise Stein prior can be improved by multiplying a Stein-type shrinkage prior in Remark 3.2. Following their conjecture, we construct priors by adding scalar shrinkage to the block-wise Stein priors:
(31) |
where . Let
Lemma C.1.
If for every , then for every .
Proof.
Since is interpreted as the expectation of under , it suffices to show that is locally integrable at every .
First, consider . Since for every (Brown and Zhao, 2009), is locally integrable at . Also, for some in a neighborhood of . Thus, is locally integrable at .
Next, consider and take the neighborhood for . From the AM-GM inequality,
Thus,
where and is a constant. Therefore, is locally integrable at if for every , which is equivalent to for every . ∎
From Lemma C.1, the generalized Bayes estimator with respect to is well-defined when for every . We denote it by .