This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\typearea

12

Double shrinkage priors for a normal mean matrix

Takeru Matsuda The University of Tokyo RIKEN Center for Brain Science Fumiyasu Komaki The University of Tokyo RIKEN Center for Brain Science William E. Strawderman Rutgers University
Abstract

We consider estimation of a normal mean matrix under the Frobenius loss. Motivated by the Efron–Morris estimator, a generalization of Stein’s prior has been recently developed, which is superharmonic and shrinks the singular values towards zero. The generalized Bayes estimator with respect to this prior is minimax and dominates the maximum likelihood estimator. However, here we show that it is inadmissible by using Brown’s condition. Then, we develop two types of priors that provide improved generalized Bayes estimators and examine their performance numerically. The proposed priors attain risk reduction by adding scalar shrinkage or column-wise shrinkage to singular value shrinkage. Parallel results for Bayesian predictive densities are also given.

1 Introduction

Suppose that we have independent matrix observations Y(1),,Y(N)n×pY^{(1)},\dots,Y^{(N)}\in\mathbb{R}^{n\times p} whose entries are independent normal random variables Yij(t)N(Mij,1)Y^{(t)}_{ij}\sim{\rm N}(M_{ij},1), where Mn×pM\in\mathbb{R}^{n\times p} is an unknown mean matrix. In the notation of Gupta and Nagar (2000), this is expressed as Y(t)Nn,p(M,In,Ip)Y^{(t)}\sim{\rm N}_{n,p}(M,I_{n},I_{p}) for t=1,,Nt=1,\dots,N, where IkI_{k} denotes the kk-dimensional identity matrix. We consider estimation of MM under the Frobenius loss

l(M,M^)=M^MF2=a=1ni=1p(M^aiMai)2.\displaystyle l(M,\hat{M})=\|\hat{M}-M\|_{{\rm F}}^{2}=\sum_{a=1}^{n}\sum_{i=1}^{p}(\hat{M}_{ai}-M_{ai})^{2}.

By sufficiency reduction, it suffices to consider the average Y=(Y(1)++Y(N))/NN(M,In,N1Ip){Y}=(Y^{(1)}+\dots+Y^{(N)})/N\sim{\rm N}(M,I_{n},N^{-1}I_{p}) in estimation of MM. We assume np1>0n-p-1>0 in the following. Note that vectorization reduces this problem to estimation of a normal mean vector vec(M){\rm vec}(M) from vec(Y)Nnp(vec(M),N1Inp){\rm vec}({Y})\sim{\rm N}_{np}({\rm vec}(M),N^{-1}I_{np}) under the quadratic loss, which has been well studied in shrinkage estimation theory (Fourdrinier et al., 2018).

Efron and Morris (1972) proposed an empirical Bayes estimator:

M^EM=Y(Ipnp1N(YY)1).\displaystyle\hat{M}_{{\rm EM}}=Y\left(I_{p}-\frac{n-p-1}{N}(Y^{\top}Y)^{-1}\right). (1)

This estimator can be viewed as a generalization of the James–Stein estimator (p=1p=1) for a normal mean vector. Efron and Morris (1972) showed that M^EM\hat{M}_{{\rm EM}} is minimax and dominates the maximum likelihood estimator M^=Y\hat{M}=Y under the Frobenius loss. Let Y=UΛVY=U\Lambda V^{\top}, Un×pU\in\mathbb{R}^{n\times p}, Vp×pV\in\mathbb{R}^{p\times p}, Λ=diag(σ1(Y),,σp(Y))\Lambda={\rm diag}(\sigma_{1}(Y),\ldots,\sigma_{p}(Y)) be the singular value decomposition of YY, where UU=VV=IpU^{\top}U=V^{\top}V=I_{p} and σ1(Y)σp(Y)0\sigma_{1}(Y)\geq\cdots\geq\sigma_{p}(Y)\geq 0 are the singular values of YY. Stein (1974) pointed out that M^EM\hat{M}_{{\rm EM}} does not change the singular vectors but shrinks the singular values of YY towards zero:

M^EM=UΛ^EMV,Λ^EM=diag(σ1(M^EM),,σp(M^EM)),\displaystyle\hat{M}_{{\rm EM}}=U\hat{\Lambda}_{{\rm EM}}V^{\top},\quad\hat{\Lambda}_{{\rm EM}}={\rm diag}(\sigma_{1}(\hat{M}_{{\rm EM}}),\ldots,\sigma_{p}(\hat{M}_{{\rm EM}})),

where

σi(M^EM)=(1np1Nσi(Y)2)σi(Y),i=1,,p.\displaystyle\sigma_{i}(\hat{M}_{{\rm EM}})=\left(1-\frac{n-p-1}{N\sigma_{i}(Y)^{2}}\right)\sigma_{i}(Y),\quad i=1,\ldots,p.

See Tsukuma and Kubokawa (2020); Yuasa and Kubokawa (2023a, b) for recent developments around the Efron–Morris estimator.

As a Bayesian counterpart of M^EM\hat{M}_{{\rm EM}}, Matsuda and Komaki (2015) proposed a singular value shrinkage prior

πSVS(M)=det(MM)(np1)/2,\displaystyle\pi_{{\rm SVS}}(M)=\det(M^{\top}M)^{-(n-p-1)/2}, (2)

and showed that the generalized Bayes estimator M^SVS\hat{M}_{{\rm SVS}} with respect to πSVS\pi_{{\rm SVS}} dominates the maximum likelihood estimator M^=Y\hat{M}=Y under the Frobenius loss. This prior can be viewed as a generalization of Stein’s prior π(μ)=μ2n\pi(\mu)=\|\mu\|^{2-n} for a normal mean vector μ\mu (p=1p=1) by Stein (1974). Similarly to M^EM\hat{M}_{{\rm EM}} in (1) , M^SVS\hat{M}_{{\rm SVS}} shrinks the singular values towards zero. Thus, it works well when the true matrix is close to low-rank. See Matsuda and Strawderman (2022) and Matsuda (2023) for details on the risk behavior of M^EM\hat{M}_{{\rm EM}} and M^SVS\hat{M}_{{\rm SVS}}.

In this paper, we show that the generalized Bayes estimator with respect to the singular value shrinkage prior πSVS\pi_{{\rm SVS}} in (2) is inadmissible under the Frobenius loss. Then, we develop two types of priors that provide improved generalized Bayes estimators asymptotically. The first type adds scalar shrinkage while the second type adds column-wise shrinkage. We conduct numerical experiments and confirm the effectiveness of the proposed priors in finite samples. We also provide parallel results for Bayesian prediction as well as a similar improvement of the blockwise Stein prior, which was conjectured by Brown and Zhao (2009).

This paper is organized as follows. In Section 2, we prove the inadmissibility of the generalized Bayes estimator with respect to the singular value shrinkage prior πSVS\pi_{{\rm SVS}} in (2). In Sections 3 and 4, we provide two types of priors that asymptotically dominate the singular value shrinkage prior πSVS\pi_{{\rm SVS}} in (2) by adding scalar or column-wise shrinkage, respectively. Numerical results are also given. In Section 5, we provide parallel results for Bayesian prediction. Technical details and similar results for the blockwise Stein prior are given in the Appendix.

2 Inadmissibility of the singular value shrinkage prior

Here, we show that the generalized Bayes estimator with respect to the singular value shrinkage prior πSVS\pi_{{\rm SVS}} in (2) is inadmissible under the Frobenius loss. Since NN does not affect admissibility results, we fix N=1N=1 for convenience in this section.

For estimation of a normal mean vector under the quadratic loss, Brown (1971) derived the following sufficient condition for inadmissibility of generalized Bayes estimators.

Lemma 2.1.

(Brown, 1971) In estimation of θ\theta from YNd(θ,Id)Y\sim\mathrm{N}_{d}(\theta,I_{d}) under the quadratic loss, the generalized Bayes estimator of θ\theta with respect to a prior π(θ)\pi(\theta) is inadmissible if

cr1dm¯(r)dr<\int_{c}^{\infty}{r^{1-d}\underline{m}(r)}{\rm d}r<\infty

for some c>0c>0, where

m¯(r)=1mπ(y)dUr(y),\underline{m}(r)=\int\frac{1}{m_{\pi}(y)}{\rm d}U_{r}(y),
mπ(y)=p(yθ)π(θ)dθ,m_{\pi}(y)=\int p(y\mid\theta)\pi(\theta){\rm d}\theta,
p(yθ)=1(2π)d/2exp(yθ22),p(y\mid\theta)=\frac{1}{(2\pi)^{d/2}}\exp\left(-\frac{\|y-\theta\|^{2}}{2}\right),

and UrU_{r} is the uniform measure on the sphere of radius rr in d\mathbb{R}^{d}.

After vectorization, estimation of a normal mean matrix MM from YNn,p(M,In,Ip)Y\sim{\rm N}_{n,p}(M,I_{n},I_{p}) under the Frobenius loss reduces to estimation of a normal mean vector vec(M){\rm vec}(M) from vec(Y)Nnp(vec(M),Inp){\rm vec}(Y)\sim{\rm N}_{np}({\rm vec}(M),I_{np}) under the quadratic loss. Then, by using Brown’s condition in Lemma 2.1, we obtain the following.

Theorem 2.1.

When p2p\geq 2, the generalized Bayes estimator with respect to πSVS\pi_{{\rm SVS}} in (2) is inadmissible under the Frobenius loss.

Proof.

From np1>0n-p-1>0 and the AM-GM inequality

(i=1pσi(M)2)1/p1pi=1pσi(M)2,\displaystyle\left(\prod_{i=1}^{p}\sigma_{i}(M)^{2}\right)^{1/p}\leq\frac{1}{p}\sum_{i=1}^{p}\sigma_{i}(M)^{2},

we have

πSVS(M)=(i=1pσi(M)2)(np1)/2\displaystyle\pi_{{\rm SVS}}(M)=\left(\prod_{i=1}^{p}\sigma_{i}(M)^{2}\right)^{-(n-p-1)/2} (1pi=1pσi(M)2)p(np1)/2\displaystyle\geq\left(\frac{1}{p}\sum_{i=1}^{p}\sigma_{i}(M)^{2}\right)^{-p(n-p-1)/2}
=An,pMFp(np1),\displaystyle=A_{n,p}\|M\|_{{\rm F}}^{-p(n-p-1)},

where An,p=pp(np1)/2A_{n,p}=p^{p(n-p-1)/2}. Therefore,

mSVS(Y)\displaystyle m_{{\rm SVS}}(Y) =πSVS(M)p(YM)dM\displaystyle=\int\pi_{{\rm SVS}}(M)p(Y\mid M){\rm d}M
An,pMFp(np1)p(YM)dM\displaystyle\geq A_{n,p}\int\|M\|_{{\rm F}}^{-p(n-p-1)}p(Y\mid M){\rm d}M
=An,pE[Y+ZFp(np1)]\displaystyle=A_{n,p}{\rm E}[\|Y+Z\|_{{\rm F}}^{-p(n-p-1)}]
An,pE[(YF+ZF)p(np1)],\displaystyle\geq A_{n,p}{\rm E}[(\|Y\|_{{\rm F}}+\|Z\|_{{\rm F}})^{-p(n-p-1)}], (3)

where Z=MYNn,p(O,In,Ip)Z=M-Y\sim{\rm N}_{n,p}(O,I_{n},I_{p}) and we used the triangle inequality. As YF\|Y\|_{{\rm F}}\to\infty,

YFp(np1)E[(YF+ZF)p(np1)]=E[(1+ZFYF)p(np1)]1,\displaystyle\|Y\|_{{\rm F}}^{p(n-p-1)}{\rm E}[(\|Y\|_{{\rm F}}+\|Z\|_{{\rm F}})^{-p(n-p-1)}]={\rm E}\left[\left(1+\frac{\|Z\|_{{\rm F}}}{\|Y\|_{{\rm F}}}\right)^{-p(n-p-1)}\right]\to 1,

which yields

E[(YF+ZF)p(np1)]=O(YFp(np1)).\displaystyle{\rm E}[(\|Y\|_{{\rm F}}+\|Z\|_{{\rm F}})^{-p(n-p-1)}]=O(\|Y\|_{{\rm F}}^{-p(n-p-1)}). (4)

Now, we apply Lemma 2.1 by noting that estimation of a normal mean matrix MM from YNn,p(M,In,Ip)Y\sim{\rm N}_{n,p}(M,I_{n},I_{p}) under the Frobenius loss is equivalent to estimation of a normal mean vector vec(M){\rm vec}(M) from vec(Y)Nnp(vec(M),Inp){\rm vec}(Y)\sim{\rm N}_{np}({\rm vec}(M),I_{np}) under the quadratic loss. Let UrU_{r} be the uniform measure on the sphere of radius rr in n×p\mathbb{R}^{n\times p}, where the Frobenius norm is adopted for radius. Then, from (3) and (4),

m¯SVS(r)\displaystyle\underline{m}_{{\rm SVS}}(r) =1mSVS(Y)dUr(Y)Crp(np1)\displaystyle=\int\frac{1}{m_{{\rm SVS}}(Y)}{\rm d}U_{r}(Y)\leq Cr^{p(n-p-1)}

for some constant CC. Therefore, since p2p+1<1-p^{2}-p+1<-1 when p2p\geq 2,

1r1npm¯SVS(r)drC1rp2p+1dr<.\int_{1}^{\infty}r^{1-np}\underline{m}_{{\rm SVS}}(r){\rm d}r\leq C\int_{1}^{\infty}r^{-p^{2}-p+1}{\rm d}r<\infty.

From Lemma 2.1, it implies the inadmissibility of the generelized Bayes estimator with respect to πSVS\pi_{{\rm SVS}} under the Frobenius loss. ∎

3 Improvement by additional scalar shrinkage

Here, motivated by the result of Efron and Morris (1976), we develop a class of priors for which the generalized Bayes estimators asymptotically dominate that with respect to the singular value shrinkage prior πSVS\pi_{{\rm SVS}} in (2). Efron and Morris (1976) proved that the estimator

M^MEM=Y(Ipnp1N(YY)1p2+p2NYF2Ip)\displaystyle\hat{M}_{{\rm MEM}}=Y\left(I_{p}-\frac{n-p-1}{N}(Y^{\top}Y)^{-1}-\frac{p^{2}+p-2}{N\|Y\|_{\mathrm{F}}^{2}}I_{p}\right) (5)

dominates M^EM\hat{M}_{{\rm EM}} in (1) under the Frobenius loss. This estimator shrinks the singular values of YY more strongly than M^EM\hat{M}_{{\rm EM}}:

M^MEM=UΛ^MEMV,Λ^MEM=diag(σ1(M^MEM),,σp(M^MEM)),\displaystyle\hat{M}_{{\rm MEM}}=U\hat{\Lambda}_{{\rm MEM}}V^{\top},\quad\hat{\Lambda}_{{\rm MEM}}={\rm diag}(\sigma_{1}(\hat{M}_{{\rm MEM}}),\ldots,\sigma_{p}(\hat{M}_{{\rm MEM}})),

where

σi(M^MEM)=(1np1Nσi(Y)2p2+p2NYF2)σi(Y),i=1,,p.\displaystyle\sigma_{i}(\hat{M}_{{\rm MEM}})=\left(1-\frac{n-p-1}{N\sigma_{i}(Y)^{2}}-\frac{p^{2}+p-2}{N\|Y\|_{\mathrm{F}}^{2}}\right)\sigma_{i}(Y),\quad i=1,\ldots,p. (6)

In other words, M^MEM\hat{M}_{{\rm MEM}} adds scalar shrinkage to M^EM\hat{M}_{{\rm EM}}. Konno (1990, 1991) showed corresponding results in the unknown covariance setting. By extending these results, Tsukuma and Kubokawa (2007) derived a general method for improving matrix mean estimators by adding scalar shrinkage.

Motivated by M^MEM\hat{M}_{{\rm MEM}} in (5), we construct priors by adding scalar shrinkage to πSVS\pi_{{\rm SVS}} in (2):

πMSVS1(M)=πSVS(M)MFγ,\displaystyle\pi_{{\rm MSVS1}}(M)=\pi_{{\rm SVS}}(M)\|M\|_{{\rm F}}^{-\gamma}, (7)

where γ0\gamma\geq 0. Note that Tsukuma and Kubokawa (2017) studied this type of prior in the context of Bayesian prediction. Let

mMSVS1(Y)=p(YM)πMSVS1(M)dM\displaystyle m_{{\rm MSVS1}}(Y)=\int p(Y\mid M)\pi_{{\rm MSVS1}}(M){\rm d}M

be the marginal density of YY under the prior πMSVS1(M)\pi_{{\rm MSVS1}}(M).

Lemma 3.1.

If 0γ<p2+p0\leq\gamma<p^{2}+p, then mMSVS1(Y)<m_{{\rm MSVS1}}(Y)<\infty for every YY.

Proof.

Since mMSVS1(Y)m_{{\rm MSVS1}}(Y) is interpreted as the expectation of πMSVS1(M)\pi_{{\rm MSVS1}}(M) under MNn,p(Y,In,Ip)M\sim{\rm N}_{n,p}(Y,I_{n},I_{p}), it suffices to show that πMSVS1(M)\pi_{{\rm MSVS1}}(M) is locally integrable at every MM.

First, consider MOM\neq O. Since

mSVS(Y)\displaystyle m_{{\rm SVS}}(Y) =πSVS(M)p(YM)dM<\displaystyle=\int\pi_{{\rm SVS}}(M)p(Y\mid M){\rm d}M<\infty

for every YY from Lemma 1 of Matsuda and Komaki (2015), πSVS(M)\pi_{{\rm SVS}}(M) is locally integrable at MM. Also, MF>c\|M\|_{{\rm F}}>c for some c>0c>0 in a neighborhood of MM. Thus, πMSVS1(M)=πSVS(M)MFγ\pi_{{\rm MSVS1}}(M)=\pi_{{\rm SVS}}(M)\|M\|_{{\rm F}}^{-\gamma} is locally integrable at MM if γ0\gamma\geq 0.

Next, consider M=OM=O and take its neighborhood A={MMFε}A=\{M\mid\|M\|_{\mathrm{F}}\leq\varepsilon\} for ε>0\varepsilon>0. To evaluate the integral on AA, we use the variable transformation from MM to (r,U)(r,U), where r=MFr=\|M\|_{\mathrm{F}} and U=M/rU=M/r so that M=rUM=rU. We have dM=rnp1drdU{\rm d}M=r^{np-1}{\rm d}r{\rm d}U. Also, from det(MM)=r2pdet(UU)\det(M^{\top}M)=r^{2p}\det(U^{\top}U),

πMSVS1(M)=rp(np1)γdet(UU)(np1)/2.\displaystyle\pi_{{\rm MSVS1}}(M)=r^{-p(n-p-1)-\gamma}\det(U^{\top}U)^{-(n-p-1)/2}.

Thus,

AπMSVS1(M)dM\displaystyle\int_{A}\pi_{{\rm MSVS1}}(M){\rm d}M
=\displaystyle= 0εrp(np1)γ+np1drdet(UU)(np1)/2dU\displaystyle\int_{0}^{\varepsilon}r^{-p(n-p-1)-\gamma+np-1}{\rm d}r\int\det(U^{\top}U)^{-(n-p-1)/2}{\rm d}U
=\displaystyle= 0εrp2+pγ1drdet(UU)(np1)/2dU.\displaystyle\int_{0}^{\varepsilon}r^{p^{2}+p-\gamma-1}{\rm d}r\int\det(U^{\top}U)^{-(n-p-1)/2}{\rm d}U.

The integral with respect to rr is finite if p2+pγ1>1p^{2}+p-\gamma-1>-1, which is equivalent to γ<p2+p\gamma<p^{2}+p. The integral with respect to UU is finite due to the local integrability of πSVS\pi_{\mathrm{SVS}}, which corresponds to γ=0\gamma=0, at M=OM=O. Therefore, πMSVS1(M)\pi_{{\rm MSVS1}}(M) is locally integrable at M=OM=O if 0γ<p2+p0\leq\gamma<p^{2}+p.

Hence, πMSVS1(M)\pi_{{\rm MSVS1}}(M) is locally integrable at every MM if 0γ<p2+p0\leq\gamma<p^{2}+p. ∎

From Lemma 3.1, the generalized Bayes estimator with respect to πMSVS1\pi_{{\rm MSVS1}} is well-defined when 0γ<p2+p0\leq\gamma<p^{2}+p. We denote it by M^MSVS1\hat{M}_{{\rm MSVS1}}.

Theorem 3.1.

For every MM,

N2(EM[M^MSVS1MF2]EM[M^SVSMF2])γ(γ2p22p+4)tr(MM)\displaystyle N^{2}({\rm E}_{M}[\|\hat{M}_{{{\rm MSVS1}}}-M\|_{{\rm F}}^{2}]-{\rm E}_{M}[\|\hat{M}_{{{\rm SVS}}}-M\|_{{\rm F}}^{2}])\to\frac{\gamma(\gamma-2p^{2}-2p+4)}{{\rm tr}(M^{\top}M)} (8)

as NN\to\infty. Therefore, if p2p\geq 2 and 0<γ<p2+p0<\gamma<p^{2}+p, then the generalized Bayes estimator with respect to πMSVS1\pi_{{\rm MSVS1}} in (7) asymptotically dominates that with respect to πSVS\pi_{{\rm SVS}} in (2) under the Frobenius loss.

Proof.

Let K=MMK=M^{\top}M and KijK^{ij} be the (i,j)(i,j)th entry of K1K^{-1}. From

KjkMai=δikMaj+δijMak,KijdetK=KijdetK,\displaystyle\frac{\partial K_{jk}}{\partial M_{ai}}=\delta_{ik}M_{aj}+\delta_{ij}M_{ak},\quad\frac{\partial}{\partial K_{ij}}\det K=K^{ij}\det K, (9)

we have

MaidetK\displaystyle\frac{\partial}{\partial M_{ai}}\det K =j,kKjkMaiKjkdetK=2jMajKijdetK.\displaystyle=\sum_{j,k}\frac{\partial K_{jk}}{\partial M_{ai}}\frac{\partial}{\partial K_{jk}}\det K=2\sum_{j}M_{aj}K^{ij}\det K.

Therefore,

MailogπSVS(M)=(np1)jMajKij.\displaystyle\frac{\partial}{\partial M_{ai}}\log\pi_{{\rm SVS}}(M)=-(n-p-1)\sum_{j}M_{aj}K^{ij}. (10)

Let πS(M)=MFγ=(trK)γ/2\pi_{{\rm S}}(M)=\|M\|_{{\rm F}}^{-\gamma}=({\rm tr}K)^{-\gamma/2}. Since

MaitrK=2Mai\displaystyle\frac{\partial}{\partial M_{ai}}{\rm tr}K=2M_{ai}

from (9), we have

MailogπS(M)\displaystyle\frac{\partial}{\partial M_{ai}}\log\pi_{{\rm S}}(M) =γMai(trK)1,\displaystyle=-\gamma M_{ai}({\rm tr}K)^{-1}, (11)
2Mai2logπS(M)\displaystyle\frac{\partial^{2}}{\partial M_{ai}^{2}}\log\pi_{{\rm S}}(M) =γ(trK2Mai2)(trK)2.\displaystyle=-\gamma({\rm tr}K-2M_{ai}^{2})({\rm tr}K)^{-2}. (12)

By using (10), (11), and (12), we obtain

tr(~logπSVS(M)~logπS(M))\displaystyle{\rm tr}(\widetilde{\nabla}\log\pi_{{\rm SVS}}(M)^{\top}\widetilde{\nabla}\log\pi_{{\rm S}}(M)) =γp(np1)(trK)1,\displaystyle=\gamma p(n-p-1)({\rm tr}K)^{-1},
tr(~logπS(M)~logπS(M))\displaystyle{\rm tr}(\widetilde{\nabla}\log\pi_{{\rm S}}(M)^{\top}\widetilde{\nabla}\log\pi_{{\rm S}}(M)) =γ2(trK)1,\displaystyle=\gamma^{2}({\rm tr}K)^{-1},
tr(Δ~logπS(M))\displaystyle{\rm tr}(\widetilde{\Delta}\log\pi_{{\rm S}}(M)) =γ(np2)(trK)1,\displaystyle=-\gamma(np-2)({\rm tr}K)^{-1},

where we used the matrix derivative notations (28) and (29). Therefore, from Lemma A.2,

EM[M^MSVS1MF2]EM[M^SVSMF2]\displaystyle{\rm E}_{M}[\|\hat{M}_{{\rm MSVS1}}-M\|_{{\rm F}}^{2}]-{\rm E}_{M}[\|\hat{M}_{{\rm SVS}}-M\|_{{\rm F}}^{2}]
=\displaystyle= 1N2tr(2~logπSVS(M)~logπS(M)+~logπS(M)~logπS(M)+2Δ~logπS(M))\displaystyle\frac{1}{N^{2}}{\rm tr}(2\widetilde{\nabla}\log\pi_{{\rm SVS}}(M)^{\top}\widetilde{\nabla}\log\pi_{{\rm S}}(M)+\widetilde{\nabla}\log\pi_{{\rm S}}(M)^{\top}\widetilde{\nabla}\log\pi_{{\rm S}}(M)+2\widetilde{\Delta}\log\pi_{{\rm S}}(M))
+o(N2)\displaystyle\quad+o(N^{-2})
=\displaystyle= 1N2γ(γ2p22p+4)(trK)1+o(N2).\displaystyle\frac{1}{N^{2}}\gamma(\gamma-2p^{2}-2p+4)({\rm tr}K)^{-1}+o(N^{-2}).

Hence, we obtain (8). ∎

From (8), the choice γ=p2+p2\gamma=p^{2}+p-2 attains the minimum risk among 0<γ<p2+p0<\gamma<p^{2}+p. Note that p2+p2p^{2}+p-2 also appears in the singular value decomposition form of the modified Efron–Morris estimator M^MEM\hat{M}_{{\rm MEM}} in (6).

Now, we examine the performance of πMSVS1\pi_{{\rm MSVS1}} in (7) by Monte Carlo simulation. Figure 1 plots the Frobenius risk of generalized Bayes estimators with respect to πMSVS1\pi_{{\rm MSVS1}} in (7) with γ=p2+p2\gamma=p^{2}+p-2, πSVS\pi_{{\rm SVS}} in (2) and πS(M)=MF2np\pi_{{\rm S}}(M)=\|M\|_{{\rm F}}^{2-np}, which is Stein’s prior on the vectorization of MM, for n=10n=10, p=3p=3 and N=1N=1. We computed the generalized Bayes estimators by using the random-walk Metropolis–Hastings algorithm with Gaussian proposal of variance 0.10.1. Note that the Frobenius risk of these estimators depends only on the singular values of MM due to the orthogonal invariance. Similarly to the Efron–Morris estimator and πSVS\pi_{{\rm SVS}}, πMSVS1\pi_{{\rm MSVS1}} works well when MM is close to low-rank. Also, πMSVS1\pi_{{\rm MSVS1}} attains large risk reduction when MM is close to the zero matrix like πS\pi_{{\rm S}}. Thus, πMSVS1\pi_{{\rm MSVS1}} has the best of both worlds. Figure 2 plots the Frobenius risk for n=10n=10, p=3p=3 and N=10N=10, computed by the random walk Metropolis–Hastings algorithm with proposal variance 0.0050.005. The risk behavior is similar to Figure 1. Figure 3 plots the Frobenius risk for n=20n=20, p=3p=3 and N=2N=2, computed by the random walk Metropolis–Hastings algorithm with proposal variance 0.010.01. Again, the risk behavior is similar to Figure 1. Note that the value of np/N=30np/N=30 is the same with Figure 1.

02244668810100101020203030σ1(M)\sigma_{1}(M)Frobenius risk
02244668810100101020203030σ2(M)\sigma_{2}(M)MSVS1SVSStein
Figure 1: Frobenius risk of generalized Bayes estimators for n=10n=10, p=3p=3 and N=1N=1. Left: σ2=σ3=0\sigma_{2}=\sigma_{3}=0. Right: σ1=10\sigma_{1}=10, σ3=0\sigma_{3}=0. solid: πMSVS1\pi_{{\rm MSVS1}} with γ=p2+p2\gamma=p^{2}+p-2, dashed: πSVS\pi_{{\rm SVS}}, dotted: Stein’s prior πS\pi_{{\rm S}}, Note that the minimax risk is np/N=30np/N=30.
02244668810100112233σ1(M)\sigma_{1}(M)Frobenius risk
02244668810100112233σ2(M)\sigma_{2}(M)MSVS1SVSStein
Figure 2: Frobenius risk of generalized Bayes estimators for n=10n=10, p=3p=3 and N=10N=10. Left: σ2=σ3=0\sigma_{2}=\sigma_{3}=0. Right: σ1=10\sigma_{1}=10, σ3=0\sigma_{3}=0. solid: πMSVS1\pi_{{\rm MSVS1}} with γ=p2+p2\gamma=p^{2}+p-2, dashed: πSVS\pi_{{\rm SVS}}, dotted: Stein’s prior πS\pi_{{\rm S}}, Note that the minimax risk is np/N=3np/N=3.
02244668810100101020203030σ1(M)\sigma_{1}(M)Frobenius risk
02244668810100101020203030σ2(M)\sigma_{2}(M)MSVS1SVSStein
Figure 3: Frobenius risk of generalized Bayes estimators for n=20n=20, p=3p=3 and N=2N=2. Left: σ2=σ3=0\sigma_{2}=\sigma_{3}=0. Right: σ1=10\sigma_{1}=10, σ3=0\sigma_{3}=0. solid: πMSVS1\pi_{{\rm MSVS1}} with γ=p2+p2\gamma=p^{2}+p-2, dashed: πSVS\pi_{{\rm SVS}}, dotted: Stein’s prior πS\pi_{{\rm S}}, Note that the minimax risk is np/N=30np/N=30.

Improvement by additional scalar shrinkage holds even under the matrix quadratic loss (Matsuda and Strawderman, 2022; Matsuda, 2024).

Theorem 3.2.

For every MM,

N2(EM[(M^MSVS1M)\displaystyle N^{2}({\rm E}_{M}[(\hat{M}_{{{\rm MSVS1}}}-M)^{\top} (M^MSVS1M)]EM[(M^SVSM)(M^SVSM)])\displaystyle(\hat{M}_{{{\rm MSVS1}}}-M)]-{\rm E}_{M}[(\hat{M}_{{{\rm SVS}}}-M)^{\top}(\hat{M}_{{{\rm SVS}}}-M)])
γ(trK)2(2(p+1)(trK)Ip+(γ+4)K)\displaystyle\to\gamma({\rm tr}K)^{-2}(-2(p+1)({\rm tr}K)I_{p}+(\gamma+4)K) (13)

as NN\to\infty. Therefore, if p2p\geq 2 and 0<γ<2p20<\gamma<2p-2, then the generalized Bayes estimator with respect to πMSVS1\pi_{{\rm MSVS1}} in (7) asymptotically dominates that with respect to πSVS\pi_{{\rm SVS}} in (2) under the matrix quadratic loss.

Proof.

We use the same notation with the proof of Theorem 3.1. By using (10), (11), and (12), we obtain

(~logπSVS(M)~logπS(M))\displaystyle(\widetilde{\nabla}\log\pi_{{\rm SVS}}(M)^{\top}\widetilde{\nabla}\log\pi_{{\rm S}}(M)) =γ(np1)(trK)1Ip,\displaystyle=\gamma(n-p-1)({\rm tr}K)^{-1}I_{p},
(~logπS(M)~logπS(M))\displaystyle(\widetilde{\nabla}\log\pi_{{\rm S}}(M)^{\top}\widetilde{\nabla}\log\pi_{{\rm S}}(M)) =γ2(trK)2K,\displaystyle=\gamma^{2}({\rm tr}K)^{-2}K,
(Δ~logπS(M))\displaystyle(\widetilde{\Delta}\log\pi_{{\rm S}}(M)) =nγ(trK)1Ip+2γ(trK)2K.\displaystyle=-n\gamma({\rm tr}K)^{-1}I_{p}+2\gamma({\rm tr}K)^{-2}K.

Therefore, from Lemma A.3,

EM[(M^MSVS1M)(M^MSVS1M)]EM[(M^SVSM)(M^SVSM)]\displaystyle{\rm E}_{M}[(\hat{M}_{{{\rm MSVS1}}}-M)^{\top}(\hat{M}_{{{\rm MSVS1}}}-M)]-{\rm E}_{M}[(\hat{M}_{{{\rm SVS}}}-M)^{\top}(\hat{M}_{{{\rm SVS}}}-M)]
=\displaystyle= 1N2(2~logπSVS(M)~logπS(M)+~logπS(M)~logπS(M)+2Δ~logπS(M))\displaystyle\frac{1}{N^{2}}(2\widetilde{\nabla}\log\pi_{{\rm SVS}}(M)^{\top}\widetilde{\nabla}\log\pi_{{\rm S}}(M)+\widetilde{\nabla}\log\pi_{{\rm S}}(M)^{\top}\widetilde{\nabla}\log\pi_{{\rm S}}(M)+2\widetilde{\Delta}\log\pi_{{\rm S}}(M))
+o(N2)\displaystyle\quad+o(N^{-2})
=\displaystyle= 1N2γ(trK)2(2(p+1)(trK)Ip+(γ+4)K)+o(N2).\displaystyle\frac{1}{N^{2}}\gamma({\rm tr}K)^{-2}(-2(p+1)({\rm tr}K)I_{p}+(\gamma+4)K)+o(N^{-2}).

Hence, we obtain (13). Since K(trK)IpK\preceq({\rm tr}K)I_{p} from KOK\succeq O,

2(p+1)(trK)Ip+(γ+4)K(γ2p+2)(trK)IpO\displaystyle-2(p+1)({\rm tr}K)I_{p}+(\gamma+4)K\preceq(\gamma-2p+2)({\rm tr}K)I_{p}\prec O

if 0<γ<2p20<\gamma<2p-2. ∎

The generalized Bayes estimator with respect to πMSVS1\pi_{{\rm MSVS1}} in (7) attains minimaxity in some cases as follows.

Theorem 3.3.

If p2p\geq 2, p+2n<2p+22/pp+2\leq n<2p+2-2/p and 0<γnp+2p2+2p20<\gamma\leq-np+2p^{2}+2p-2, then the generalized Bayes estimator with respect to πMSVS1\pi_{{\rm MSVS1}} in (7) is minimax under the Frobenius loss.

Proof.

From Proposition B.1,

ΔπMSVS1(M)\displaystyle\Delta\pi_{{\rm MSVS1}}(M) =γ(γ+np2p22p+2)MF2πMSVS1(M)0.\displaystyle=\gamma(\gamma+np-2p^{2}-2p+2)\|M\|_{{\rm F}}^{-2}\pi_{{\rm MSVS1}}(M)\leq 0.

Thus, πMSVS1(M)\pi_{{\rm MSVS1}}(M) is superharmonic, which indicates the minimaxity of the generalized Bayes estimator with respect to πMSVS1\pi_{{\rm MSVS1}} in (7) under the Frobenius loss from Stein’s classical result (Stein, 1974; Matsuda and Komaki, 2015). ∎

It is an interesting problem whether the generalized Bayes estimator with respect to πMSVS1\pi_{{\rm MSVS1}} in (7) attains admissibility or not. In addition to Lemma 2.1, Brown (1971) derived the following sufficient condition for admissibility of generalized Bayes estimators, which may be useful here. While the condition (14) can be verified by using a similar argument to Theorem 2.1, the verification of the uniform boundedness of logmπ(y)\|\nabla\log m_{\pi}(y)\| seems difficult. We leave further investigation for future work.

Lemma 3.2.

(Brown, 1971) In estimation of θ\theta from YNd(θ,Id)Y\sim\mathrm{N}_{d}(\theta,I_{d}) under the quadratic loss, the generalized Bayes estimator of θ\theta with respect to a prior π(θ)\pi(\theta) is admissible if logmπ(y)\|\nabla\log m_{\pi}(y)\| is uniformly bounded and

cr1dm\cc@style¯(r)dr=,\displaystyle\int_{c}^{\infty}\frac{r^{1-d}}{\accentset{{\cc@style\underline{\mskip 10.0mu}}}{m}(r)}{\rm d}r=\infty, (14)

where

m\cc@style¯(r)=mπ(y)dUr(y)\accentset{{\cc@style\underline{\mskip 10.0mu}}}{m}(r)=\int{m_{\pi}(y)}{\rm d}U_{r}(y)

and UrU_{r} is the uniform measure on the sphere of radius rr in d\mathbb{R}^{d}.

4 Improvement by additional column-wise shrinkage

Here, instead of scalar shrinkage, we consider priors with additional column-wise shrinkage:

πMSVS2(M)=πSVS(M)i=1pMiγi,\displaystyle\pi_{{\rm MSVS2}}(M)=\pi_{{\rm SVS}}(M)\prod_{i=1}^{p}\|M_{\cdot i}\|^{-\gamma_{i}}, (15)

where γi0\gamma_{i}\geq 0 for every ii and Mi\|M_{\cdot i}\| denotes the norm of the ii-th column vector of MM. Let

mMSVS2(Y)=p(YM)πMSVS2(M)dM\displaystyle m_{{\rm MSVS2}}(Y)=\int p(Y\mid M)\pi_{{\rm MSVS2}}(M){\rm d}M

be the marginal density of YY under the prior πMSVS2(M)\pi_{{\rm MSVS2}}(M).

Lemma 4.1.

If 0γip0\leq\gamma_{i}\leq p for every ii, then mMSVS2(Y)<m_{{\rm MSVS2}}(Y)<\infty for every YY.

Proof.

Similarly to Lemma 3.1, it suffices to show that πMSVS2(M)\pi_{{\rm MSVS2}}(M) is locally integrable at M=OM=O. Consider the neighborhood of M=OM=O defined by A={MMFε}A=\{M\mid\|M\|_{\mathrm{F}}\leq\varepsilon\} for ε>0\varepsilon>0. To evaluate the integral on AA, we use the variable transformation from MM to (r1,,rp,u1,,up)(r_{1},\dots,r_{p},u_{1},\dots,u_{p}), where each ri[0,)r_{i}\in[0,\infty) and uinu_{i}\in\mathbb{R}^{n} with ui=1\|u_{i}\|=1 are defined by ri=Mir_{i}=\|M_{\cdot i}\| and ui=Mi/riu_{i}=M_{\cdot i}/r_{i} for the ii-th column vector MiM_{\cdot i} of MM so that Mi=riuiM_{\cdot i}=r_{i}u_{i} (polar coordinate). Since dMi=rin1dridui{\rm d}M_{\cdot i}=r_{i}^{n-1}{\rm d}r_{i}{\rm d}u_{i},

dM=r1n1rpn1dr1drpdu1dup.\displaystyle{\rm d}M=r_{1}^{n-1}\dots r_{p}^{n-1}{\rm d}r_{1}\dots\rm dr_{p}{\rm d}u_{1}\dots\rm du_{p}.

Also, from MM=D(UU)DM^{\top}M=D(U^{\top}U)D with D=diag(r1,,rp)D={\rm diag}(r_{1},\dots,r_{p}) and U=(u1up)U=(u_{1}\dots u_{p}),

πMSVS2(M)\displaystyle\pi_{{\rm MSVS2}}(M) =det(D)(np1)det(UU)(np1)/2i=1priγi\displaystyle=\det(D)^{-(n-p-1)}\det(U^{\top}U)^{-(n-p-1)/2}\prod_{i=1}^{p}r_{i}^{-\gamma_{i}}
=i=1pri(np1)γidet(UU)(np1)/2.\displaystyle=\prod_{i=1}^{p}r_{i}^{-(n-p-1)-\gamma_{i}}\cdot\det(U^{\top}U)^{-(n-p-1)/2}.

Thus,

AπMSVS2(M)dM\displaystyle\int_{A}\pi_{{\rm MSVS2}}(M){\rm d}M
=\displaystyle= rε(i=1pripγi)dr1drpdet(UU)(np1)/2du1dup.\displaystyle\int_{\|r\|\leq\varepsilon}\left(\prod_{i=1}^{p}r_{i}^{p-\gamma_{i}}\right){\rm d}r_{1}\dots\rm dr_{p}\int\det(U^{\top}U)^{-(n-p-1)/2}{\rm d}u_{1}\dots\rm du_{p}. (16)

By variable transformation from r=(r1,,rp)r=(r_{1},\dots,r_{p}) to s=rs=\|r\| and v=r/sv=r/s, the first integral in (16) is reduced to

0εsp2i=1pγi+p1dsv=1(i=1pvipγi)dv.\displaystyle\int_{0}^{\varepsilon}s^{p^{2}-\sum_{i=1}^{p}\gamma_{i}+p-1}{\rm d}s\int_{\|v\|=1}\left(\prod_{i=1}^{p}v_{i}^{p-\gamma_{i}}\right){\rm d}v.

The integral with respect to ss is finite if p2i=1pγi+p1>1p^{2}-\sum_{i=1}^{p}\gamma_{i}+p-1>-1, which is equivalent to i=1pγi<p2+p\sum_{i=1}^{p}\gamma_{i}<p^{2}+p. The integral with respect to vv is finite if pγi0p-\gamma_{i}\geq 0 for every ii. On the other hand, the second integral in (16) is finite due to the local integrability of πSVS\pi_{\mathrm{SVS}}, which corresponds to γ=0\gamma=0, at M=OM=O. Therefore, πMSVS2(M)\pi_{{\rm MSVS2}}(M) is locally integrable at M=OM=O if 0γip0\leq\gamma_{i}\leq p for every ii. ∎

From Lemma 4.1, the generalized Bayes estimator with respect to πMSVS2\pi_{{\rm MSVS2}} is well-defined when 0γp0\leq\gamma\leq p. We denote it by M^MSVS2\hat{M}_{{\rm MSVS2}}.

Theorem 4.1.

For every MM,

N2(EM[M^MSVS2MF2]EM[M^SVSMF2])i=1pγi(γi2p+2)Mi2\displaystyle N^{2}({\rm E}_{M}[\|\hat{M}_{{{\rm MSVS2}}}-M\|_{{\rm F}}^{2}]-{\rm E}_{M}[\|\hat{M}_{{{\rm SVS}}}-M\|_{{\rm F}}^{2}])\to\sum_{i=1}^{p}\gamma_{i}(\gamma_{i}-2p+2)\|M_{\cdot i}\|^{-2} (17)

as NN\to\infty. Therefore, if p2p\geq 2 and 0<γip0<\gamma_{i}\leq p for every ii, then the generalized Bayes estimator with respect to πMSVS2\pi_{{\rm MSVS2}} in (15) asymptotically dominates that with respect to πSVS\pi_{{\rm SVS}} in (2) under the Frobenius loss.

Proof.

Let

πCS(M)=i=1pMiγi.\displaystyle\pi_{{\rm CS}}(M)=\prod_{i=1}^{p}\|M_{\cdot i}\|^{-\gamma_{i}}.

Then,

MailogπCS(M)=γiMaiMi2,\displaystyle\frac{\partial}{\partial M_{ai}}\log\pi_{{\rm CS}}(M)=-\gamma_{i}M_{ai}\|M_{\cdot i}\|^{-2}, (18)
2Mai2logπCS(M)=γi(Mi22Mai2)Mi4.\displaystyle\frac{\partial^{2}}{\partial M_{ai}^{2}}\log\pi_{{\rm CS}}(M)=-\gamma_{i}\left(\|M_{\cdot i}\|^{2}-2M_{ai}^{2}\right)\|M_{\cdot i}\|^{-4}. (19)

From (10), (18), and (19),

tr(~logπSVS(M)~logπCS(M))=(np1)i=1pγiMi2,\displaystyle{\rm tr}(\widetilde{\nabla}\log\pi_{{\rm SVS}}(M)^{\top}\widetilde{\nabla}\log\pi_{{\rm CS}}(M))=(n-p-1)\sum_{i=1}^{p}\gamma_{i}\|M_{\cdot i}\|^{-2}, (20)
tr(~logπCS(M)~logπCS(M))=i=1pγi2Mi2,\displaystyle{\rm tr}(\widetilde{\nabla}\log\pi_{{\rm CS}}(M)^{\top}\widetilde{\nabla}\log\pi_{{\rm CS}}(M))=\sum_{i=1}^{p}\gamma_{i}^{2}\|M_{\cdot i}\|^{-2}, (21)
tr(Δ~logπCS(M))=(n2)i=1pγiMi2,\displaystyle{\rm tr}(\widetilde{\Delta}\log\pi_{{\rm CS}}(M))=-(n-2)\sum_{i=1}^{p}\gamma_{i}\|M_{\cdot i}\|^{-2}, (22)

where we used the matrix derivative notations (28) and (29). Therefore, from Lemma A.2,

EM[M^MSVS2MF2]EM[M^SVSMF2]\displaystyle{\rm E}_{M}[\|\hat{M}_{{\rm MSVS2}}-M\|_{{\rm F}}^{2}]-{\rm E}_{M}[\|\hat{M}_{{\rm SVS}}-M\|_{{\rm F}}^{2}]
=\displaystyle= 1N2tr(2~logπSVS(M)~logπCS(M)+~logπCS(M)~logπCS(M)+2Δ~logπCS(M))\displaystyle\frac{1}{N^{2}}{\rm tr}(2\widetilde{\nabla}\log\pi_{{\rm SVS}}(M)^{\top}\widetilde{\nabla}\log\pi_{{\rm CS}}(M)+\widetilde{\nabla}\log\pi_{{\rm CS}}(M)^{\top}\widetilde{\nabla}\log\pi_{{\rm CS}}(M)+2\widetilde{\Delta}\log\pi_{{\rm CS}}(M))
+o(N2)\displaystyle\quad+o(N^{-2})
=\displaystyle= 1N2i=1pγi(γi2p+2)Mi2+o(N2).\displaystyle\frac{1}{N^{2}}\sum_{i=1}^{p}\gamma_{i}(\gamma_{i}-2p+2)\|M_{\cdot i}\|^{-2}+o(N^{-2}).

Hence, we obtain (17). ∎

From (17), the choice γ1==γp=p1\gamma_{1}=\dots=\gamma_{p}=p-1 attains the minimum risk among 0<γip0<\gamma_{i}\leq p.

Now, we examine the performance of πMSVS2\pi_{{\rm MSVS2}} in (15) by Monte Carlo simulation. Figure 4 plots the Frobenius risk of generalized Bayes estimators with respect to πMSVS2\pi_{{\rm MSVS2}} in (15) with γ1==γp=p1\gamma_{1}=\dots=\gamma_{p}=p-1, πMSVS1\pi_{{\rm MSVS1}} in (7) with γ=p2+p2\gamma=p^{2}+p-2, πSVS\pi_{{\rm SVS}} in (2) and πS(M)=MF2np\pi_{{\rm S}}(M)=\|M\|_{{\rm F}}^{2-np}, which is Stein’s prior on the vectorization of MM. We computed the generalized Bayes estimators by using the random walk Metropolis–Hastings algorithm with proposal variance 0.10.1. We set M=UΣM=U\Sigma, where UU=IpU^{\top}U=I_{p} and Σ=diag(σ1,,σp)\Sigma={\rm diag}(\sigma_{1},\dots,\sigma_{p}). For this MM, the Frobenius risk of the estimators compared here depends only on the singular values σ1,,σp\sigma_{1},\dots,\sigma_{p} of MM. Overall, πMSVS2\pi_{{\rm MSVS2}} performs better than πSVS\pi_{{\rm SVS}}. Also, πMSVS2\pi_{{\rm MSVS2}} even dominates πMSVS1\pi_{{\rm MSVS1}} and πS\pi_{{\rm S}} except when σ1\sigma_{1} is sufficiently small. This is understood from the column-wise shrinkage effect of πMSVS2\pi_{{\rm MSVS2}}. Figure 5 plots the Frobenius risk for n=10n=10, p=3p=3 and N=10N=10, computed by the random walk Metropolis–Hastings algorithm with proposal variance 0.010.01. The risk behavior is similar to Figure 4. Figure 6 plots the Frobenius risk for n=20n=20, p=3p=3 and N=2N=2, computed by the random walk Metropolis–Hastings algorithm with proposal variance 0.010.01. Again, the risk behavior is similar to Figure 4. Note that the value of np/N=30np/N=30 is the same with Figure 4.

02244668810100101020203030σ1(M)\sigma_{1}(M)Frobenius risk
02244668810100101020203030σ2(M)\sigma_{2}(M)MSVS2MSVS1SVSStein
Figure 4: Frobenius risk of generalized Bayes estimators for n=10n=10, p=3p=3 and N=1N=1 where M=UΣM=U\Sigma with UU=IpU^{\top}U=I_{p} and Σ=diag(σ1,,σp)\Sigma={\rm diag}(\sigma_{1},\dots,\sigma_{p}). Left: σ2=σ3=0\sigma_{2}=\sigma_{3}=0. Right: σ1=10\sigma_{1}=10, σ3=0\sigma_{3}=0. solid: πMSVS2\pi_{{\rm MSVS2}} with γ1==γp=p1\gamma_{1}=\dots=\gamma_{p}=p-1, dash-dotted: πMSVS1\pi_{{\rm MSVS1}} with γ=p2+p2\gamma=p^{2}+p-2, dashed: πSVS\pi_{{\rm SVS}}, dotted: Stein’s prior πS\pi_{{\rm S}}. Note that the minimax risk is np/N=30np/N=30.
02244668810100112233σ1(M)\sigma_{1}(M)Frobenius risk
02244668810100112233σ2(M)\sigma_{2}(M)MSVS2MSVS1SVSStein
Figure 5: Frobenius risk of generalized Bayes estimators for n=10n=10, p=3p=3 and N=10N=10 where M=UΣM=U\Sigma with UU=IpU^{\top}U=I_{p} and Σ=diag(σ1,,σp)\Sigma={\rm diag}(\sigma_{1},\dots,\sigma_{p}). Left: σ2=σ3=0\sigma_{2}=\sigma_{3}=0. Right: σ1=10\sigma_{1}=10, σ3=0\sigma_{3}=0. solid: πMSVS2\pi_{{\rm MSVS2}} with γ1==γp=p1\gamma_{1}=\dots=\gamma_{p}=p-1, dash-dotted: πMSVS1\pi_{{\rm MSVS1}} with γ=p2+p2\gamma=p^{2}+p-2, dashed: πSVS\pi_{{\rm SVS}}, dotted: Stein’s prior πS\pi_{{\rm S}}. Note that the minimax risk is np/N=3np/N=3.
02244668810100101020203030σ1(M)\sigma_{1}(M)Frobenius risk
02244668810100101020203030σ2(M)\sigma_{2}(M)MSVS2MSVS1SVSStein
Figure 6: Frobenius risk of generalized Bayes estimators for n=20n=20, p=3p=3 and N=2N=2 where M=UΣM=U\Sigma with UU=IpU^{\top}U=I_{p} and Σ=diag(σ1,,σp)\Sigma={\rm diag}(\sigma_{1},\dots,\sigma_{p}). Left: σ2=σ3=0\sigma_{2}=\sigma_{3}=0. Right: σ1=10\sigma_{1}=10, σ3=0\sigma_{3}=0. solid: πMSVS2\pi_{{\rm MSVS2}} with γ1==γp=p1\gamma_{1}=\dots=\gamma_{p}=p-1, dash-dotted: πMSVS1\pi_{{\rm MSVS1}} with γ=p2+p2\gamma=p^{2}+p-2, dashed: πSVS\pi_{{\rm SVS}}, dotted: Stein’s prior πS\pi_{{\rm S}}. Note that the minimax risk is np/N=30np/N=30.

Improvement by additional column-wise shrinkage holds even under the matrix quadratic loss (Matsuda and Strawderman, 2022; Matsuda, 2024).

Theorem 4.2.

For every MM,

N2(EM[(M^MSVS2M)\displaystyle N^{2}({\rm E}_{M}[(\hat{M}_{{{\rm MSVS2}}}-M)^{\top} (M^MSVS2M)]EM[(M^SVSM)(M^SVSM)])\displaystyle(\hat{M}_{{{\rm MSVS2}}}-M)]-{\rm E}_{M}[(\hat{M}_{{{\rm SVS}}}-M)^{\top}(\hat{M}_{{{\rm SVS}}}-M)])
2(p1)D+DMMD\displaystyle\to-2(p-1)D+DM^{\top}MD (23)

as NN\to\infty, where D=diag(γ1M12,,γpMp2)D={\rm diag}(\gamma_{1}\|M_{\cdot 1}\|^{-2},\dots,\gamma_{p}\|M_{\cdot p}\|^{-2}). Therefore, if p2p\geq 2 and 0<γ1==γp<22/p0<\gamma_{1}=\dots=\gamma_{p}<2-2/p, then the generalized Bayes estimator with respect to πMSVS2\pi_{{\rm MSVS2}} in (15) asymptotically dominates that with respect to πSVS\pi_{{\rm SVS}} in (2) under the matrix quadratic loss.

Proof.

We use the same notation with the proof of Theorem 4.1. By using (10), (18), and (19), we obtain

~logπSVS(M)~logπCS(M)=(np1)D,\displaystyle\widetilde{\nabla}\log\pi_{{\rm SVS}}(M)^{\top}\widetilde{\nabla}\log\pi_{{\rm CS}}(M)=(n-p-1)D,
~logπCS(M)~logπCS(M)=DMMD,\displaystyle\widetilde{\nabla}\log\pi_{{\rm CS}}(M)^{\top}\widetilde{\nabla}\log\pi_{{\rm CS}}(M)=DM^{\top}MD,
Δ~logπCS(M)=(n2)D.\displaystyle\widetilde{\Delta}\log\pi_{{\rm CS}}(M)=-(n-2)D.

Therefore, from Lemma A.3,

EM[(M^MSVS2M)(M^MSVS2M)]EM[(M^SVSM)(M^SVSM)]\displaystyle{\rm E}_{M}[(\hat{M}_{{{\rm MSVS2}}}-M)^{\top}(\hat{M}_{{{\rm MSVS2}}}-M)]-{\rm E}_{M}[(\hat{M}_{{{\rm SVS}}}-M)^{\top}(\hat{M}_{{{\rm SVS}}}-M)]
=\displaystyle= 1N2(2~logπSVS(M)~logπCS(M)+~logπCS(M)~logπCS(M)+2Δ~logπCS(M))\displaystyle\frac{1}{N^{2}}(2\widetilde{\nabla}\log\pi_{{\rm SVS}}(M)^{\top}\widetilde{\nabla}\log\pi_{{\rm CS}}(M)+\widetilde{\nabla}\log\pi_{{\rm CS}}(M)^{\top}\widetilde{\nabla}\log\pi_{{\rm CS}}(M)+2\widetilde{\Delta}\log\pi_{{\rm CS}}(M))
+o(N2)\displaystyle\quad+o(N^{-2})
=\displaystyle= 2(p1)D+DMMD.\displaystyle-2(p-1)D+DM^{\top}MD.

Hence, we obtain (23).

Suppose that p2p\geq 2 and 0<γ1==γp<22/p0<\gamma_{1}=\dots=\gamma_{p}<2-2/p. Let \|\cdot\| be the operater norm. Since DOD\succeq O is diagonal and M2MF2=i=1pMi2\|M\|^{2}\leq\|M\|_{\mathrm{F}}^{2}=\sum_{i=1}^{p}\|M_{\cdot i}\|^{2},

D1/2MMD1/2\displaystyle\|D^{1/2}M^{\top}MD^{1/2}\| D1/2MMD1/2\displaystyle\leq\|D^{1/2}\|\|M^{\top}M\|\|D^{1/2}\|
=(maxiDii)M2\displaystyle=(\max_{i}D_{ii})\|M\|^{2}
=γ1M2miniMi2\displaystyle=\gamma_{1}\frac{\|M\|^{2}}{\min_{i}\|M_{\cdot i}\|^{2}}
γ1i=1pMi2miniMi2\displaystyle\leq\gamma_{1}\frac{\sum_{i=1}^{p}\|M_{\cdot i}\|^{2}}{\min_{i}\|M_{\cdot i}\|^{2}}
γ1p\displaystyle\leq\gamma_{1}p
<2(p1),\displaystyle<2(p-1),

which yields D1/2MMD1/22(p1)IpD^{1/2}M^{\top}MD^{1/2}\prec 2(p-1)I_{p}. Therefore,

2(p1)D+DMMD=D1/2(2(p1)Ip+D1/2MMD1/2)D1/2O.\displaystyle-2(p-1)D+DM^{\top}MD=D^{1/2}(-2(p-1)I_{p}+D^{1/2}M^{\top}MD^{1/2})D^{1/2}\prec O.

The generalized Bayes estimator with respect to πMSVS2\pi_{{\rm MSVS2}} in (15) attains minimaxity in some cases as follows. It is an interesting future work to investigate its admissibility.

Theorem 4.3.

If p3p\geq 3, p+2n<2pp+2\leq n<2p and 0<γn+2p0<\gamma\leq-n+2p, then the generalized Bayes estimator with respect to πMSVS2\pi_{{\rm MSVS2}} in (15) is minimax under the Frobenius loss.

Proof.

From Proposition B.2,

ΔπMSVS2(M)\displaystyle\Delta\pi_{{\rm MSVS2}}(M) =γ(γ+n2p)(i=1pMi2)πMSVS2(M)0.\displaystyle=\gamma(\gamma+n-2p)\left(\sum_{i=1}^{p}\|M_{\cdot i}\|^{-2}\right)\pi_{{\rm MSVS2}}(M)\leq 0.

Thus, πMSVS2(M)\pi_{{\rm MSVS2}}(M) is superharmonic, which indicates the minimaxity of the generalized Bayes estimator with respect to πMSVS2\pi_{{\rm MSVS2}} in (15) under the Frobenius loss from Stein’s classical result (Stein, 1974; Matsuda and Komaki, 2015). ∎

5 Bayesian prediction

Here, we consider Bayesian prediction and provide parallel results to those in Sections 3 and 4. Suppose that we observe YNn,p(M,In,N1Ip)Y\sim{\rm N}_{n,p}(M,I_{n},N^{-1}I_{p}) and predict Y~Nn,p(M,In,Ip)\widetilde{Y}\sim{\rm N}_{n,p}(M,I_{n},I_{p}) by a predictive density p^(Y~Y)\hat{p}(\widetilde{Y}\mid Y). We evaluate predictive densities by the Kullback–Leibler loss:

D(p(M),p^(Y))=p(Y~M)logp(Y~M)p^(Y~Y)dY~.\displaystyle D({p}(\cdot\mid M),\hat{p}(\cdot\mid Y))=\int{p}(\widetilde{Y}\mid M)\log\frac{{p}(\widetilde{Y}\mid M)}{\hat{p}(\widetilde{Y}\mid Y)}{\rm d}\widetilde{Y}.

The Bayesian predictive density based on a prior π(M)\pi(M) is defined as

p^π(Y~Y)\displaystyle\hat{p}_{\pi}(\widetilde{Y}\mid Y) =p(Y~M)π(MY)dM,\displaystyle=\int{p}(\widetilde{Y}\mid M)\pi(M\mid Y){\rm d}M,

where π(MY)\pi(M\mid Y) is the posterior distribution of MM given YY, and it minimizes the Bayes risk (Aitchison, 1975):

p^π(Y~Y)\displaystyle\hat{p}_{\pi}(\widetilde{Y}\mid Y) =argminp^D(p(M),p^(Y))p(YM)π(M)dYdM.\displaystyle=\operatorname*{arg\,min}_{\hat{p}}\int D({p}(\cdot\mid M),\hat{p}(\cdot\mid Y))p(Y\mid M)\pi(M){\rm d}Y{\rm d}M.

The Bayesian predictive density with respect to the uniform prior is minimax. However, it is inadmissible and dominated by Bayesian predictive densities based on superharmonic priors (Komaki, 2001; George, Liang and Xu, 2006). In particular, the Bayesian predictive density based on the singular value shrinkage prior πSVS\pi_{{\rm SVS}} in (2) is minimax and dominates that based on the uniform prior (Matsuda and Komaki, 2015).

The asymptotic expansion of the difference between the Kullback–Leibler risk of two Bayesian predictive densities is obtained as follows.

Lemma 5.1.

As NN\to\infty, the difference between the Kullback–Leibler risk of pπ1(Y~Y)p_{\pi_{1}}(\widetilde{Y}\mid Y) and pπ1π2(Y~Y)p_{\pi_{1}\pi_{2}}(\widetilde{Y}\mid Y) is expanded as

EM[D(p(Y~M),pπ1π2(Y~Y))]EM[D(p(Y~M),pπ1(Y~Y))]\displaystyle{\rm E}_{M}[D(p(\widetilde{Y}\mid M),p_{\pi_{1}\pi_{2}}(\widetilde{Y}\mid Y))]-{\rm E}_{M}[D(p(\widetilde{Y}\mid M),p_{\pi_{1}}(\widetilde{Y}\mid Y))]
=\displaystyle= 12N2tr(2(~logπ1(M))(~logπ2(M))+(~logπ2(M))(~logπ2(M))+2Δ~logπ2(M))\displaystyle\frac{1}{2N^{2}}{\rm tr}(2(\widetilde{\nabla}\log\pi_{1}(M))^{\top}(\widetilde{\nabla}\log\pi_{2}(M))+(\widetilde{\nabla}\log\pi_{2}(M))^{\top}(\widetilde{\nabla}\log\pi_{2}(M))+2\widetilde{\Delta}\log\pi_{2}(M))
+o(N2).\displaystyle\quad+o(N^{-2}). (24)
Proof.

For the normal model with known covariance, the information geometrical quantities (Amari, 1985) are given by

gij=gij=δij,Γijk=0,Tijk=0.\displaystyle g_{ij}=g^{ij}=\delta_{ij},\quad\Gamma_{ij}^{k}=0,\quad T_{ijk}=0.

Also, the Jeffreys prior coincides with the uniform prior π(M)1\pi(M)\equiv 1. Therefore, from equation (3) of Komaki (2006), the Kullback–Leibler risk of the Bayesian predictive density pπ(Y~Y)p_{\pi}(\widetilde{Y}\mid Y) based on a prior π(M)\pi(M) is expanded as

EM[D(p(Y~M),pπ(Y~Y))]\displaystyle{\rm E}_{M}[D(p(\widetilde{Y}\mid M),p_{\pi}(\widetilde{Y}\mid Y))]
=\displaystyle= np2N+12N2tr((~logπ(M))(~logπ(M))+2Δ~logπ(M))+g(M)+o(N2),\displaystyle\frac{np}{2N}+\frac{1}{2N^{2}}{\rm tr}((\widetilde{\nabla}\log\pi(M))^{\top}(\widetilde{\nabla}\log\pi(M))+2\widetilde{\Delta}\log\pi(M))+g(M)+o(N^{-2}), (25)

where g(M)g(M) is a function independent of π(M)\pi(M). Substituting π=π1π2\pi=\pi_{1}\pi_{2} and π=π1\pi=\pi_{1} into (25) and taking difference, we obtain (24). ∎

By comparing Lemma 5.1 to Lemma A.2, we obtain the following connection between estimation and prediction.

Proposition 5.1.

For every MM,

limNN2(EM[D(p(Y~M),p^π1π2(Y~Y))]EM[D(p(Y~M),p^π1(Y~Y))])\displaystyle\lim_{N\to\infty}N^{2}({\rm E}_{M}[D(p(\widetilde{Y}\mid M),\hat{p}_{\pi_{1}\pi_{2}}(\widetilde{Y}\mid Y))]-{\rm E}_{M}[D(p(\widetilde{Y}\mid M),\hat{p}_{\pi_{1}}(\widetilde{Y}\mid Y))])
=\displaystyle= 12limNN2(EM[M^π1π2MF2]EM[M^π1MF2]).\displaystyle\frac{1}{2}\lim_{N\to\infty}N^{2}({\rm E}_{M}[\|\hat{M}^{\pi_{1}\pi_{2}}-M\|_{\mathrm{F}}^{2}]-{\rm E}_{M}[\|\hat{M}^{\pi_{1}}-M\|_{\mathrm{F}}^{2}]).

Therefore, if θ^π1π2\hat{\theta}^{\pi_{1}\pi_{2}} asymptotically dominates θ^π1\hat{\theta}^{\pi_{1}} under the quadratic loss, then p^π1π2(Y~Y)\hat{p}_{\pi_{1}\pi_{2}}(\widetilde{Y}\mid Y) asymptotically dominates p^π1(Y~Y)\hat{p}_{\pi_{1}}(\widetilde{Y}\mid Y) under the Kullback–Leibler loss.

Therefore, Theorems 3.1 and 4.1 are extended to Bayesian prediction as follows. Other results in the previous sections can be extended to Bayesian prediction similarly.

Theorem 5.1.

For every MM,

N2(EM[D(p(M),p^MSVS1(Y))]EM[D(p(M),p^SVS(Y))])\displaystyle N^{2}({\rm E}_{M}[D(p(\cdot\mid M),\hat{p}_{{\rm MSVS1}}(\cdot\mid Y))]-{\rm E}_{M}[D(p(\cdot\mid M),\hat{p}_{{\rm SVS}}(\cdot\mid Y))])
\displaystyle\to γ(γ2p22p+4)2tr(MM)\displaystyle\frac{\gamma(\gamma-2p^{2}-2p+4)}{2{\rm tr}(M^{\top}M)}

as NN\to\infty. Therefore, if p2p\geq 2 and 0<γ<p2+p0<\gamma<p^{2}+p, then the Bayesian predictive density with respect to πMSVS1\pi_{{\rm MSVS1}} in (7) asymptotically dominates that with respect to πSVS\pi_{{\rm SVS}} in (2) under the Kullback–Leibler loss.

Theorem 5.2.

For every MM,

N2(EM[D(p(M),p^MSVS2(Y))]EM[D(p(M),p^SVS(Y))])\displaystyle N^{2}({\rm E}_{M}[D(p(\cdot\mid M),\hat{p}_{{\rm MSVS2}}(\cdot\mid Y))]-{\rm E}_{M}[D(p(\cdot\mid M),\hat{p}_{{\rm SVS}}(\cdot\mid Y))])
\displaystyle\to 12i=1pγi(γi2p+2)Mi2\displaystyle\frac{1}{2}\sum_{i=1}^{p}\gamma_{i}(\gamma_{i}-2p+2)\|M_{\cdot i}\|^{-2}

as NN\to\infty. Therefore, if p2p\geq 2 and 0<γp0<\gamma\leq p, then the Bayesian predictive density with respect to πMSVS2\pi_{{\rm MSVS2}} in (15) asymptotically dominates that with respect to πSVS\pi_{{\rm SVS}} in (2) under the Kullback–Leibler loss.

Figures 7 and 8 plot the Kullback–Leibler risk of Bayesian predictive densities in similar settings to Figures 1 and 4, respectively. They show that the risk behavior in prediction is qualitatively the same with that in estimation, which is compatible with Theorems 5.1 and 5.2.

02244668810104040454550505555σ1(M)\sigma_{1}(M)KL risk
02244668810104040454550505555σ2(M)\sigma_{2}(M)MSVS1SVSStein
Figure 7: Kullback–Leibler risk of Bayesian predictive densities for n=10n=10, p=3p=3 and N=1N=1. Left: σ2=σ3=0\sigma_{2}=\sigma_{3}=0. Right: σ1=10\sigma_{1}=10, σ3=0\sigma_{3}=0. solid: πMSVS1\pi_{{\rm MSVS1}} with γ=p2+p2\gamma=p^{2}+p-2, dashed: πSVS\pi_{{\rm SVS}}, dotted: Stein’s prior πS\pi_{{\rm S}}.
02244668810104040454550505555σ1(M)\sigma_{1}(M)KL risk
02244668810104040454550505555σ2(M)\sigma_{2}(M)MSVS2MSVS1SVSStein
Figure 8: Kullback–Leibler risk of Bayesian predictive densities for n=10n=10, p=3p=3 and N=1N=1 where M=UΣM=U\Sigma with UU=IpU^{\top}U=I_{p} and Σ=diag(σ1,,σp)\Sigma={\rm diag}(\sigma_{1},\dots,\sigma_{p}). Left: σ2=σ3=0\sigma_{2}=\sigma_{3}=0. Right: σ1=10\sigma_{1}=10, σ3=0\sigma_{3}=0. solid: πMSVS2\pi_{{\rm MSVS2}} with γ1==γp=p1\gamma_{1}=\dots=\gamma_{p}=p-1, dash-dotted: πMSVS1\pi_{{\rm MSVS1}} with γ=p2+p2\gamma=p^{2}+p-2, dashed: πSVS\pi_{{\rm SVS}}, dotted: Stein’s prior πS\pi_{{\rm S}}.

Acknowledgments

We thank the referees for helpful comments. This work was supported by JSPS KAKENHI Grant Numbers 19K20220, 21H05205, 22K17865 and JST Moonshot Grant Number JPMJMS2024.

References

  • Aitchison (1975) Aitchison, J. (1975). Goodness of prediction fit. Biometrika 62, 547–554.
  • Amari (1985) Amari, S. (1985). Differential-Geometrical Methods in Statistics. New York: Springer.
  • Brown (1971) Brown, L. D. (1971). Admissible estimators, recurrent diffusions, and insoluble boundary value problems. Annals of Mathematical Statistics 36, 855–903.
  • Brown and Zhao (2009) Brown, L. D. & Zhao, L. H. (2009). Estimators for Gaussian models having a block-wise structure. Statistica Sinica 19, 885–903.
  • Efron and Morris (1972) Efron, B. & Morris, C. (1972). Empirical Bayes on vector observations: an extension of Stein’s method. Biometrika 59, 335–347.
  • Efron and Morris (1976) Efron, B. & Morris, C. (1976). Multivariate empirical Bayes and estimation of covariance matrices. Annals of Statistics 4, 22–32.
  • Fourdrinier et al. (2018) Fourdrinier, D., Strawderman, W. E. & Wells, M. (2018). Shrinkage Estimation. Springer, New York.
  • George, Liang and Xu (2006) George, E. I., Liang, F. & Xu, X. (2006). Improved minimax predictive densities under Kullback–Leibler loss. Annals of Statistics 34, 78–91.
  • Gupta and Nagar (2000) Gupta, A. K. & Nagar, D. K. (2000). Matrix Variate Distributions. New York: Chapman & Hall.
  • Komaki (2001) Komaki, F. (2001). A shrinkage predictive distribution for multivariate normal observables. Biometrika 88, 859–864.
  • Komaki (2006) Komaki, F. (2006). Shrinkage priors for Bayesian prediction. Annals of Statistics 34, 808–819.
  • Konno (1990) Konno, Y. (1990). Families of minimax estimators of matrix of normal means with unknown covariance matrix. J. Japan Statist. Soc. 20, 191–201.
  • Konno (1991) Konno, Y. (1991). On estimation of a matrix of normal means with unknown covariance matrix. Journal of Multivariate Analysis 36, 44–55.
  • Matsuda (2023) Matsuda, T. (2023). Adapting to arbitrary quadratic loss via singular value shrinkage. IEEE Transactions on Information Theory, accepted.
  • Matsuda (2024) Matsuda, T. (2024). Matrix quadratic risk of orthogonally invariant estimators for a normal mean matrix. Japanese Journal of Statistics and Data Science, accepted.
  • Matsuda and Komaki (2015) Matsuda, T. & Komaki, F. (2015). Singular value shrinkage priors for Bayesian prediction. Biometrika 102, 843–854.
  • Matsuda and Strawderman (2019) Matsuda, T. & Strawderman, W. E. (2019). Improved loss estimation for a normal mean matrix. Journal of Multivariate Analysis 169, 300–311.
  • Matsuda and Strawderman (2022) Matsuda, T. & Strawderman, W. E. (2022). Estimation under matrix quadratic loss and matrix superharmonicity. Biometrika 109, 503–519.
  • Stein (1974) Stein, C. (1974). Estimation of the mean of a multivariate normal distribution. Proc. Prague Symp. Asymptotic Statistics 2, 345–381.
  • Tsukuma and Kubokawa (2007) Tsukuma, H. & Kubokawa, T. (2007). Methods for improvement in estimation of a normal mean matrix. Journal of Multivariate Analysis 98, 1592–1610.
  • Tsukuma and Kubokawa (2017) Tsukuma, H. & Kubokawa, T. (2017). Proper Bayes and minimax predictive densities related to estimation of a normal mean matrix. Journal of Multivariate Analysis 159, 138–150.
  • Tsukuma and Kubokawa (2020) Tsukuma, H. & Kubokawa, T. (2020). Shrinkage estimation for mean and covariance matrices. Springer.
  • Yuasa and Kubokawa (2023a) Yuasa, R. & Kubokawa, T. (2023). Generalized Bayes estimators with closed forms for the normal mean and covariance matrices. Journal of Statistical Planning and Inference 222, 182–194.
  • Yuasa and Kubokawa (2023b) Yuasa, R. & Kubokawa, T. (2023). Weighted shrinkage estimators of normal mean matrices and dominance properties. Journal of Multivariate Analysis 194, 1–17.

Appendix A Asymptotic expansion of risk

Here, we provide asymptotic expansion formulas for estimators of a normal mean vector. Consider the problem of estimating θ\theta from the observation YNd(θ,N1Id)Y\sim{\rm N}_{d}(\theta,N^{-1}I_{d}) under the quadratic loss l(θ,θ^)=θ^θ2l(\theta,\hat{\theta})=\|\hat{\theta}-\theta\|^{2}. As shown in Stein (1974), the generalized Bayes estimator θ^π\hat{\theta}^{\pi} with respect to a prior π(θ)\pi(\theta) is expressed as

θ^π(y)=y+1Nylogmπ(y),\displaystyle\hat{\theta}^{\pi}(y)=y+\frac{1}{N}\nabla_{y}\log m_{\pi}(y),

where

mπ(y)=p(yθ)π(θ)dθ.\displaystyle m_{\pi}(y)=\int p(y\mid\theta)\pi(\theta){\rm d}\theta.

The asymptotic difference between the quadratic risk of two generalized Bayes estimators as NN\to\infty is given as follows.

Lemma A.1.

As NN\to\infty, the difference between the quadratic risk of θ^π1\hat{\theta}^{\pi_{1}} and θ^π1π2\hat{\theta}^{\pi_{1}\pi_{2}} is expanded as

Eθ[θ^π1π2θ2]Eθ[θ^π1θ2]\displaystyle{\rm E}_{\theta}[\|\hat{\theta}^{\pi_{1}\pi_{2}}-\theta\|^{2}]-{\rm E}_{\theta}[\|\hat{\theta}^{\pi_{1}}-\theta\|^{2}]
=\displaystyle= 1N2(2logπ1(θ)logπ2(θ)+logπ2(θ)2+2Δlogπ2(θ))+o(N2).\displaystyle\frac{1}{N^{2}}(2\nabla\log\pi_{1}(\theta)^{\top}\nabla\log\pi_{2}(\theta)+\|\nabla\log\pi_{2}(\theta)\|^{2}+2\Delta\log\pi_{2}(\theta))+o(N^{-2}). (26)
Proof.

By using Stein’s lemma (Fourdrinier et al., 2018) and mπ(y)=π(y)+o(1)m_{\pi}(y)=\pi(y)+o(1) as NN\to\infty, the quadratic risk of the generalized Bayes estimator θ^π\hat{\theta}^{\pi} is calculated as

Eθ[θ^π(y)θ2]\displaystyle{\rm E}_{\theta}[\|\hat{\theta}^{\pi}(y)-\theta\|^{2}]
=\displaystyle= Eθ[yθ2]+2NEθ[(yθ)logmπ(y)]+1N2Eθ[logmπ(y)2]\displaystyle{\rm E}_{\theta}[\|y-\theta\|^{2}]+\frac{2}{N}{\rm E}_{\theta}[(y-\theta)^{\top}\nabla\log m_{\pi}(y)]+\frac{1}{N^{2}}{\rm E}_{\theta}[\|\nabla\log m_{\pi}(y)\|^{2}]
=\displaystyle= dN+1N2Eθ[logmπ(y)2+2Δlogmπ(y)]\displaystyle\frac{d}{N}+\frac{1}{N^{2}}{\rm E}_{\theta}[\|\nabla\log m_{\pi}(y)\|^{2}+2\Delta\log m_{\pi}(y)]
=\displaystyle= dN+1N2(logπ(θ)2+2Δlogπ(θ))+o(N2).\displaystyle\frac{d}{N}+\frac{1}{N^{2}}\left(\|\nabla\log\pi(\theta)\|^{2}+2\Delta\log\pi(\theta)\right)+o(N^{-2}). (27)

Substituting π=π1π2\pi=\pi_{1}\pi_{2} and π=π1\pi=\pi_{1} into (27) and taking difference, we obtain (26). ∎

We extend the above formula to matrices by using the matrix derivative notations from Matsuda and Strawderman (2022). For a function f:n×pf:\mathbb{R}^{n\times p}\to\mathbb{R}, its matrix gradient ~f:n×pn×p\widetilde{\nabla}f:\mathbb{R}^{n\times p}\to\mathbb{R}^{n\times p} is defined as

(~f(X))ai=Xaif(X).\displaystyle(\widetilde{\nabla}f(X))_{ai}=\frac{\partial}{\partial X_{ai}}f(X). (28)

For a C2C^{2} function f:n×pf:\mathbb{R}^{n\times p}\to\mathbb{R}, its matrix Laplacian Δ~f:n×pp×p\widetilde{\Delta}f:\mathbb{R}^{n\times p}\to\mathbb{R}^{p\times p} is defined as

(Δ~f(X))ij=a=1n2XaiXajf(X).\displaystyle(\widetilde{\Delta}f(X))_{ij}=\sum_{a=1}^{n}\frac{\partial^{2}}{\partial X_{ai}\partial X_{aj}}f(X). (29)

Then, the above formulas can be straightforwardly extended to matrix-variate normal distributions as follows.

Lemma A.2.

As NN\to\infty, the difference between the Frobenius risk of M^π1\hat{M}^{\pi_{1}} and M^π1π2\hat{M}^{\pi_{1}\pi_{2}} is expanded as

EM[M^π1π2MF2]EM[M^π1MF2]\displaystyle{\rm E}_{M}[\|\hat{M}^{\pi_{1}\pi_{2}}-M\|_{\mathrm{F}}^{2}]-{\rm E}_{M}[\|\hat{M}^{\pi_{1}}-M\|_{\mathrm{F}}^{2}]
=\displaystyle= 1N2tr(2~logπ1(M)~logπ2(M)+~logπ2(M)~logπ2(M)+2Δ~logπ2(M))\displaystyle\frac{1}{N^{2}}{\rm tr}(2\widetilde{\nabla}\log\pi_{1}(M)^{\top}\widetilde{\nabla}\log\pi_{2}(M)+\widetilde{\nabla}\log\pi_{2}(M)^{\top}\widetilde{\nabla}\log\pi_{2}(M)+2\widetilde{\Delta}\log\pi_{2}(M))
+o(N2).\displaystyle\quad+o(N^{-2}).
Lemma A.3.

As NN\to\infty, the difference between the matrix quadratic risk of M^π1\hat{M}^{\pi_{1}} and M^π1π2\hat{M}^{\pi_{1}\pi_{2}} is expanded as

EM[(M^π1π2M)(M^π1π2M)(M^π1M)(M^π1M)]\displaystyle{\rm E}_{M}[(\hat{M}^{\pi_{1}\pi_{2}}-M)^{\top}(\hat{M}^{\pi_{1}\pi_{2}}-M)-(\hat{M}^{\pi_{1}}-M)^{\top}(\hat{M}^{\pi_{1}}-M)]
=\displaystyle= 1N2(2~logπ1(M)~logπ2(M)+~logπ2(M)~logπ2(M)+2Δ~logπ2(M))\displaystyle\frac{1}{N^{2}}(2\widetilde{\nabla}\log\pi_{1}(M)^{\top}\widetilde{\nabla}\log\pi_{2}(M)+\widetilde{\nabla}\log\pi_{2}(M)^{\top}\widetilde{\nabla}\log\pi_{2}(M)+2\widetilde{\Delta}\log\pi_{2}(M))
+o(N2).\displaystyle\quad+o(N^{-2}).

Komaki (2006) derived the asymptotic expansion of the Kullback–Leibler risk of Bayesian predictive densities. For the normal model as discussed in Section 5, the result shows that Stein’s prior dominates the Jeffreys prior in O(N1)O(N^{-1}) term at the origin and O(N2)O(N^{-2}) term at other points, which is reminiscent of superefficiency theory. A similar phenomenon should exist in estimation as well. Unlike Stein’s prior, the priors for a normal mean matrix such as πSVS\pi_{\mathrm{SVS}} diverge at many points such as low-rank matrices. It is an interesting future problem to investigate the asymptotic risk of such priors in detail.

Appendix B Laplacian of πMSVS1\pi_{{\rm MSVS1}} and πMSVS2\pi_{{\rm MSVS2}}

Lemma B.1.

(Stein, 1974; Matsuda and Strawderman, 2019) Suppose that f:n×pf:\mathbb{R}^{n\times p}\to\mathbb{R} is represented as f(X)=f~(σ)f(X)=\widetilde{f}(\sigma), where npn\geq p and σ=(σ1(X),,σp(X))\sigma=(\sigma_{1}(X),\ldots,\sigma_{p}(X)) denotes the singular values of XX. If ff is twice weakly differentiable, then its Laplacian is

Δf\displaystyle\Delta f =a=1ni=1p2fXai2=2i<jσif~/σiσjf~/σjσi2σj2+(np)i=1p1σif~σi+i=1p2f~σi2.\displaystyle=\sum_{a=1}^{n}\sum_{i=1}^{p}\frac{\partial^{2}f}{\partial X_{ai}^{2}}=2\sum_{i<j}\frac{\sigma_{i}{\partial\widetilde{f}}/{\partial\sigma_{i}}-\sigma_{j}{\partial\widetilde{f}}/{\partial\sigma_{j}}}{\sigma_{i}^{2}-\sigma_{j}^{2}}+(n-p)\sum_{i=1}^{p}\frac{1}{\sigma_{i}}\frac{\partial\widetilde{f}}{\partial\sigma_{i}}+\sum_{i=1}^{p}\frac{\partial^{2}\widetilde{f}}{\partial\sigma_{i}^{2}}.
Proposition B.1.

The Laplacian of πMSVS1\pi_{{\rm MSVS1}} in (7) is given by

ΔπMSVS1(M)\displaystyle\Delta\pi_{{\rm MSVS1}}(M) =γ(γ+np2p22p+2)MF2πMSVS1(M).\displaystyle=\gamma(\gamma+np-2p^{2}-2p+2)\|M\|_{{\rm F}}^{-2}\pi_{{\rm MSVS1}}(M).
Proof.

Let

f~(σ)=(i=1pσi(np1))(i=1pσi2)γ/2\displaystyle\widetilde{f}(\sigma)=\left(\prod_{i=1}^{p}\sigma_{i}^{-(n-p-1)}\right)\left(\sum_{i=1}^{p}\sigma_{i}^{2}\right)^{-\gamma/2}

so that πMSVS1(M)=f~(σ)\pi_{{\rm MSVS1}}(M)=\widetilde{f}(\sigma) with σ=(σ1(M),,σp(M))\sigma=(\sigma_{1}(M),\dots,\sigma_{p}(M)). From Lemma B.1 and

f~σi=(np1)σi1f~γσi(j=1pσj2)1f~,\displaystyle\frac{\partial\widetilde{f}}{\partial\sigma_{i}}=-(n-p-1)\sigma_{i}^{-1}\widetilde{f}-\gamma\sigma_{i}\left(\sum_{j=1}^{p}\sigma_{j}^{2}\right)^{-1}\widetilde{f},
2f~σi2=(np)(np1)σi2f~+(2n2p3)γ(j=1pσj2)1f~+γ(γ+2)σi2(j=1pσj2)2f~,\displaystyle\frac{\partial^{2}\widetilde{f}}{\partial\sigma_{i}^{2}}=(n-p)(n-p-1)\sigma_{i}^{-2}\widetilde{f}+(2n-2p-3)\gamma\left(\sum_{j=1}^{p}\sigma_{j}^{2}\right)^{-1}\widetilde{f}+\gamma(\gamma+2)\sigma_{i}^{2}\left(\sum_{j=1}^{p}\sigma_{j}^{2}\right)^{-2}\widetilde{f},

we have

ΔπMSVS1(M)\displaystyle\Delta\pi_{{\rm MSVS1}}(M) =2i<jσif~/σiσjf~/σjσi2σj2+(np)i=1p1σif~σi+i=1p2f~σi2\displaystyle=2\sum_{i<j}\frac{\sigma_{i}{\partial\widetilde{f}}/{\partial\sigma_{i}}-\sigma_{j}{\partial\widetilde{f}}/{\partial\sigma_{j}}}{\sigma_{i}^{2}-\sigma_{j}^{2}}+(n-p)\sum_{i=1}^{p}\frac{1}{\sigma_{i}}\frac{\partial\widetilde{f}}{\partial\sigma_{i}}+\sum_{i=1}^{p}\frac{\partial^{2}\widetilde{f}}{\partial\sigma_{i}^{2}}
=2p(p1)2γ(i=1pσi2)1f~(np)(np1)(i=1pσi2)f~\displaystyle=-2\cdot\frac{p(p-1)}{2}\gamma\left(\sum_{i=1}^{p}\sigma_{i}^{2}\right)^{-1}\widetilde{f}-(n-p)(n-p-1)\left(\sum_{i=1}^{p}\sigma_{i}^{-2}\right)\widetilde{f}
p(np)γ(i=1pσi2)1f~+(np)(np1)(i=1pσi2)f~\displaystyle\quad-p(n-p)\gamma\left(\sum_{i=1}^{p}\sigma_{i}^{2}\right)^{-1}\widetilde{f}+(n-p)(n-p-1)\left(\sum_{i=1}^{p}\sigma_{i}^{-2}\right)\widetilde{f}
+γ(p(2n2p3)+γ+2)(i=1pσi2)1f~\displaystyle\quad+\gamma(p(2n-2p-3)+\gamma+2)\left(\sum_{i=1}^{p}\sigma_{i}^{2}\right)^{-1}\widetilde{f}
=γ(γ+np2p22p+2)(i=1pσi2)1f~.\displaystyle=\gamma(\gamma+np-2p^{2}-2p+2)\left(\sum_{i=1}^{p}\sigma_{i}^{2}\right)^{-1}\widetilde{f}.

Proposition B.2.

The Laplacian of πMSVS2\pi_{{\rm MSVS2}} in (15) is given by

ΔπMSVS2(M)\displaystyle\Delta\pi_{{\rm MSVS2}}(M) =γ(γ+n2p)(i=1pMi2)πMSVS2(M).\displaystyle=\gamma(\gamma+n-2p)\left(\sum_{i=1}^{p}\|M_{\cdot i}\|^{-2}\right)\pi_{{\rm MSVS2}}(M).
Proof.

From

Δlogf=Δfflogf2\displaystyle\Delta\log f=\frac{\Delta f}{f}-\|\nabla\log f\|^{2}

and (20), (21) and (22),

ΔπMSVS2(M)πMSVS2(M)\displaystyle\frac{\Delta\pi_{{\rm MSVS2}}(M)}{\pi_{{\rm MSVS2}}(M)} =ΔlogπMSVS2(M)+logπMSVS2(M)2\displaystyle=\Delta\log\pi_{{\rm MSVS2}}(M)+\|\nabla\log\pi_{{\rm MSVS2}}(M)\|^{2}
=ΔlogπSVS(M)+ΔlogπCS(M)+logπSVS(M)+logπCS(M)2\displaystyle=\Delta\log\pi_{{\rm SVS}}(M)+\Delta\log\pi_{{\rm CS}}(M)+\|\nabla\log\pi_{{\rm SVS}}(M)+\nabla\log\pi_{{\rm CS}}(M)\|^{2}
=ΔπSVS(M)πSVS(M)+((n2)γ+2(np1)γ+γ2)i=1pMi2\displaystyle=\frac{\Delta\pi_{{\rm SVS}}(M)}{\pi_{{\rm SVS}}(M)}+(-(n-2)\gamma+2(n-p-1)\gamma+\gamma^{2})\sum_{i=1}^{p}\|M_{\cdot i}\|^{-2}
=γ(γ+n2p)i=1pMi2,\displaystyle=\gamma(\gamma+n-2p)\sum_{i=1}^{p}\|M_{\cdot i}\|^{-2},

where we used ΔπSVS(M)=0\Delta\pi_{{\rm SVS}}(M)=0 (Theorem 2 of Matsuda and Komaki, 2015). ∎

Appendix C Improving on the block-wise Stein prior

Here, we develop priors that asymptotically dominate the block-wise Stein prior in estimation and prediction. Suppose that we observe YNd(θ,N1Id)Y\sim{\rm N}_{d}(\theta,N^{-1}I_{d}) and estimate θ\theta or predict Y~Nd(θ,Id)\widetilde{Y}\sim{\rm N}_{d}(\theta,I_{d}). We assume that the dd-dimensional mean vector θ\theta is split into BB disjoint blocks θ(1),,θ(B)\theta^{(1)},\dots,\theta^{(B)} with size d1,,dBd_{1},\dots,d_{B}, where d1++dB=dd_{1}+\dots+d_{B}=d. For example. such a situation appears in balanced ANOVA and wavelet regression (Brown and Zhao, 2009). Then, the block-wise Stein prior is defined as

πBS(θ)=b=1Bθ(b)Rb,Rb=(db2)+,\displaystyle\pi_{{\rm BS}}(\theta)=\prod_{b=1}^{B}\|\theta^{(b)}\|^{R_{b}},\quad R_{b}=-(d_{b}-2)_{+}, (30)

which puts Stein’s prior on each block. Since it is superharmonic, the generalized Bayes estimator θ^πBS\hat{\theta}^{\pi_{{\rm BS}}} with respect to πBS\pi_{{\rm BS}} is minimax. However, Brown and Zhao (2009) showed that θ^πBS\hat{\theta}^{\pi_{{\rm BS}}} is inadmissible and dominated by an estimator with additional James–Stein type shrinkage defined by

θ^(y)=θ^πBS(y)R#+d2y2y,\displaystyle\hat{\theta}(y)=\hat{\theta}^{\pi_{{\rm BS}}}(y)-\frac{R_{\#}+d-2}{\|y\|^{2}}y,

where R#=bRb>2dR_{\#}=\sum_{b}R_{b}>2-d. From this result, Brown and Zhao (2009) conjectured that the block-wise Stein prior can be improved by multiplying a Stein-type shrinkage prior in Remark 3.2. Following their conjecture, we construct priors by adding scalar shrinkage to the block-wise Stein priors:

πMBS(θ)=πBS(θ)θγ,\displaystyle\pi_{{\rm MBS}}(\theta)=\pi_{{\rm BS}}(\theta)\|\theta\|^{-\gamma}, (31)

where γ0\gamma\geq 0. Let

mMBS(y)=p(yθ)πMBS(θ)dθ.\displaystyle m_{{\rm MBS}}(y)=\int p(y\mid\theta)\pi_{{\rm MBS}}(\theta){\rm d}\theta.
Lemma C.1.

If 0γ<B(Rb+db)0\leq\gamma<B(R_{b}+d_{b}) for every bb, then mMBS(y)<m_{{\rm MBS}}(y)<\infty for every yy.

Proof.

Since mMBS(y)m_{{\rm MBS}}(y) is interpreted as the expectation of πMBS(θ)\pi_{{\rm MBS}}(\theta) under θNd(y,Id)\theta\sim{\rm N}_{d}(y,I_{d}), it suffices to show that πMBS(θ)\pi_{{\rm MBS}}(\theta) is locally integrable at every θ\theta.

First, consider θ0\theta\neq 0. Since mBS(y)<m_{{\rm BS}}(y)<\infty for every yy (Brown and Zhao, 2009), πBS(θ)\pi_{{\rm BS}}(\theta) is locally integrable at θ\theta. Also, θ>c\|\theta\|>c for some c>0c>0 in a neighborhood of θ\theta. Thus, πMBS(θ)=πBS(θ)θγ\pi_{{\rm MBS}}(\theta)=\pi_{{\rm BS}}(\theta)\|\theta\|^{-\gamma} is locally integrable at θ\theta.

Next, consider θ=0\theta=0 and take the neighborhood A={θθ(1)s,,θ(B)s}A=\{\theta\mid\|\theta^{(1)}\|\leq s,\dots,\|\theta^{(B)}\|\leq s\} for s>0s>0. From the AM-GM inequality,

θ2=bθ(b)2B(bθ(b)2)1/B=Bbθ(b)2/B.\displaystyle\|\theta\|^{2}=\sum_{b}\|\theta^{(b)}\|^{2}\geq B\left(\prod_{b}\|\theta^{(b)}\|^{2}\right)^{1/B}=B\prod_{b}\|\theta^{(b)}\|^{2/B}.

Thus,

AπMBS(θ)dθ\displaystyle\int_{A}\pi_{{\rm MBS}}(\theta){\rm d}\theta\leq\ C0s0sbrbRb+db1γ/Bdr1drB,\displaystyle C\int_{0}^{s}\cdots\int_{0}^{s}\prod_{b}r_{b}^{R_{b}+d_{b}-1-\gamma/B}{\rm d}r_{1}\cdots{\rm d}r_{B},

where rb=θ(b)r_{b}=\|\theta^{(b)}\| and CC is a constant. Therefore, πMBS(θ)\pi_{{\rm MBS}}(\theta) is locally integrable at θ=0\theta=0 if Rb+db1γ/B>1R_{b}+d_{b}-1-\gamma/B>-1 for every bb, which is equivalent to γ<B(Rb+db)\gamma<B(R_{b}+d_{b}) for every bb. ∎

From Lemma C.1, the generalized Bayes estimator with respect to πMBS\pi_{{\rm MBS}} is well-defined when 0γ<B(Rb+db)0\leq\gamma<B(R_{b}+d_{b}) for every bb. We denote it by θ^MBS\hat{\theta}_{{\rm MBS}}.

Theorem C.1.

For every MM,

N2(Eθ[θ^MBSθ2]Eθ[θ^BSθ2])γ(γ2(R#+d2))θ2\displaystyle N^{2}({\rm E}_{\theta}[\|\hat{\theta}_{{{\rm MBS}}}-\theta\|^{2}]-{\rm E}_{\theta}[\|\hat{\theta}_{{{\rm BS}}}-\theta\|^{2}])\to\gamma(\gamma-2(R_{\#}+d-2))\|\theta\|^{-2} (32)

as NN\to\infty. Therefore, if 0<γ<2(R#+d2)0<\gamma<2(R_{\#}+d-2), then the generalized Bayes estimator with respect to πMBS\pi_{{\rm MBS}} in (31) asymptotically dominates that with respect to πBS\pi_{{\rm BS}} in (30) under the Frobenius loss.

Proof.

Let πS(θ)=θγ\pi_{{\rm S}}(\theta)=\|\theta\|^{-\gamma}. By straightforward calculation, we obtain

logπBS(θ)logπS(θ)=γR#θ2,\displaystyle\nabla\log\pi_{{\rm BS}}(\theta)^{\top}\nabla\log\pi_{{\rm S}}(\theta)=-\gamma R_{\#}\|\theta\|^{-2},
logπS(θ)logπS(θ)=γ2θ2,\displaystyle\nabla\log\pi_{{\rm S}}(\theta)^{\top}\nabla\log\pi_{{\rm S}}(\theta)=\gamma^{2}\|\theta\|^{-2},
ΔlogπS(θ)=γ(d2)θ2.\displaystyle\Delta\log\pi_{{\rm S}}(\theta)=-\gamma(d-2)\|\theta\|^{-2}.

Therefore, from Lemma A.1,

Eθ[θ^MBSθ2]Eθ[θ^BSθ2]\displaystyle{\rm E}_{\theta}[\|\hat{\theta}_{{{\rm MBS}}}-\theta\|^{2}]-{\rm E}_{\theta}[\|\hat{\theta}_{{{\rm BS}}}-\theta\|^{2}]
=\displaystyle= 1N2(2logπBS(θ)logπS(θ)+logπS(θ)2+2ΔlogπS(θ))+o(N2)\displaystyle\frac{1}{N^{2}}\left(2\nabla\log\pi_{{\rm BS}}(\theta)^{\top}\nabla\log\pi_{{\rm S}}(\theta)+\|\nabla\log\pi_{{\rm S}}(\theta)\|^{2}+2\Delta\log\pi_{{\rm S}}(\theta)\right)+o(N^{-2})
=\displaystyle= 1N2γ(γ2(R#+d2))θ2+o(N2).\displaystyle\frac{1}{N^{2}}\gamma(\gamma-2(R_{\#}+d-2))\|\theta\|^{-2}+o(N^{-2}).

Hence, we obtain (32). ∎

From (32), the choice γ=R#+d2\gamma=R_{\#}+d-2 is optimal. As discussed in Section 5, Theorem C.1 is extended to Bayesian prediction as follows.

Theorem C.2.

For every MM,

N2(Eθ[D(p(θ),p^MBS(y))]Eθ[D(p(θ),p^BS(y))])γ(γ2(R#+d2))2θ2\displaystyle N^{2}({\rm E}_{\theta}[D(p(\cdot\mid\theta),\hat{p}_{{\rm MBS}}(\cdot\mid y))]-{\rm E}_{\theta}[D(p(\cdot\mid\theta),\hat{p}_{{\rm BS}}(\cdot\mid y))])\to\frac{\gamma(\gamma-2(R_{\#}+d-2))}{2}\|\theta\|^{-2}

as NN\to\infty. Therefore, if p2p\geq 2 and 0<γ<p2+p0<\gamma<p^{2}+p, then the Bayesian predictive density with respect to πMBS\pi_{{\rm MBS}} in (31) asymptotically dominates that with respect to πBS\pi_{{\rm BS}} in (30) under the Kullback–Leibler loss.