This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Nonparametric Mean and Variance Adaptive Classification Rule for High-Dimensional Data with Heteroscedastic Variances

Seungyeon Oh   and Hoyoung Park Department of Statistics, Sookmyung Women’s University, Seoul, Korea, [email protected]Department of Statistics, Sookmyung Women’s University, Seoul, Korea, [email protected]
Abstract

In this study, we introduce an innovative methodology aimed at enhancing Fisher’s Linear Discriminant Analysis (LDA) in the context of high-dimensional data classification scenarios, specifically addressing situations where each feature exhibits distinct variances. Our approach leverages Nonparametric Maximum Likelihood Estimation (NPMLE) techniques to estimate both the mean and variance parameters. By accommodating varying variances among features, our proposed method leads to notable improvements in classification performance. In particular, unlike numerous prior studies that assume the distribution of heterogeneous variances follows a right-skewed inverse gamma distribution, our proposed method demonstrates excellent performance even when the distribution of heterogeneous variances takes on left-skewed, symmetric, or right-skewed forms. We conducted a series of rigorous experiments to empirically validate the effectiveness of our approach. The results of these experiments demonstrate that our proposed methodology excels in accurately classifying high-dimensional data characterized by heterogeneous variances.


Keywords: Bayes rule, Empirical Bayes, Heteroscedastic Variances, Kiefer-Wolfowitz estimator, Linear discriminant analysis, Nonparametric maximum likelihood estimation

1 Introduction

Recently, researchers have made numerous attempts to obtain new insights by classifying high-dimensional data into two groups. In the context of dealing with this issue, Fisher’s Linear Discriminant Analysis (LDA) is a widely used and acknowledged method in various practical applications. However, the high dimensionality of the data occasionally causes immense difficulties in utilizing this method. Specifically, the challenges commonly occur in the following settings. We consider the binary classification using the data, which consists of pp-dimensional features extracted from n=n1+n2n=n_{1}+n_{2} samples. Here, the feature vector and labeling variable of the ii-th observation are denoted as 𝑿i=(Xi1,,Xip)T\boldsymbol{X}_{i}=(X_{i1},\ldots,X_{ip})^{T} and Yi{1,2}Y_{i}\in\{1,2\}, respectively.

𝑿𝒊{Yi=1;𝝁1,𝚺}iidNp(𝝁1,𝚺),i=1,,n1,\displaystyle\boldsymbol{X_{i}}\mid\left\{Y_{i}=1;\boldsymbol{\mu}_{1},\boldsymbol{\Sigma}\right\}\quad\overset{iid}{\sim}\quad N_{p}(\boldsymbol{\mu}_{1},\boldsymbol{\Sigma}),\quad i=1,\dots,n_{1}, (1)
𝑿𝒊{Yi=2;𝝁2,𝚺}iidNp(𝝁2,𝚺),i=n1+1,,n1+n2.\displaystyle\boldsymbol{X_{i}}\mid\left\{Y_{i}=2;\boldsymbol{\mu}_{2},\boldsymbol{\Sigma}\right\}\quad\overset{iid}{\sim}\quad N_{p}(\boldsymbol{\mu}_{2},\boldsymbol{\Sigma}),\quad i=n_{1}+1,\dots,n_{1}+n_{2}.

Assuming condition (1) and that the prior probabilities of both groups are π1=π2=0.5\pi_{1}=\pi_{2}=0.5, the following classification rule known as the Bayes rule can be available.

δOPT(𝑿new)\displaystyle\delta_{OPT}(\boldsymbol{X}^{new}) =(𝑿new𝝁1+𝝁22)T𝚺1(𝝁1𝝁2)log(π2π1),\displaystyle=\left(\boldsymbol{X}^{new}-\frac{\boldsymbol{\mu}_{1}+\boldsymbol{\mu}_{2}}{2}\right)^{T}\boldsymbol{\Sigma}^{-1}\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)-\log\left(\frac{\pi_{2}}{\pi_{1}}\right), (2)
Y^new\displaystyle\widehat{Y}^{new} ={1forδOPT(𝑿new)0,2forδOPT(𝑿new)<0,\displaystyle=\left\{\begin{array}[]{rcl}1&\mbox{for}&\delta_{OPT}(\boldsymbol{X}^{new})\geq 0,\\ 2&\mbox{for}&\delta_{OPT}(\boldsymbol{X}^{new})<0,\end{array}\right. (5)

where Y^new\widehat{Y}^{new} corresponds to the labeling variable for the feature vector 𝑿new\boldsymbol{X}^{new}.

It is important to substitute the unknown parameters included in the optimal classification rule (2) with robust estimators, since it can potentially enhance the performance of the classifier. Unfortunately, in high-dimensional settings, a large pp causes inaccurate estimation of the covariance matrix. In addition, High Dimensional Low Sample Size (HDLSS), which means that pnp\gg n, precludes obtaining the inverse matrix of the estimated covariance matrix due to its singularity. These difficulties hinder efforts to employ LDA in the settings. Certainly, as a strategy to circumvent the significant impediments, we can consider deriving the shrinkage estimator of the covariance matrix in order that the singularity problem can be solved. Relevant studies include Ledoit and Wolf, (2004); Bodnar et al., (2014); Ledoit and Wolf, (2020, 2022), and others. Furthermore, recent studies by Park et al., 2022b and Kim et al., (2022) have suggested research findings that involve standardizing data based on the estimation of the precision matrix, and the combination of Nonparametric Maximum Likelihood Estimation (NPMLE) or Nonparametric Empirical Bayes (NPEB) for mean vector estimation contributes to the improvement of Linear Discriminant Analysis (LDA). However, since the computational complexity of the classifiers based on these methods explodes depending on pp, it is unfeasible to apply these methods in situations when pp is excessively large.

To mitigate the computational burden of the above strategy, adding the Independent Rule (IR) assumption to the optimal classification rule can be a feasible approach. Through the assumption that the components of the feature vector are assumed to be mutually independent, the covariance matrix is simplified to 𝑫=diag(𝚺)=diag(σ12,,σp2)\boldsymbol{D}={\rm diag}(\boldsymbol{\Sigma})={\rm diag}(\sigma_{1}^{2},\cdots,\sigma_{p}^{2}). Consequently, it drastically reduces the number of parameters to be estimated, and also allows to invert the estimated covariance matrix without a rank-deficient problem. This approach has been widely researched due to its straightforward interpretation and low computational burden, even in high-dimensional settings. For instance, the naive Bayes (NB) is one of the straightforward methods, which estimates 𝝁k\boldsymbol{\mu}_{k} as the sample mean of the kk-th group and σj2\sigma_{j}^{2} as the pooled sample variances of the jj-th feature. Specifically, based on Stein’s unbiased risk estimate (SURE), Ji et al., (2020) estimates the mean and inhomogeneous variance parameters using parametric maximum likelihood estimation. SURE exhibits satisfactory performance only in the dense cases, where the mean differences between the two groups are significant in quite a few features. This approach was further extended by adopting NPMLE in Park et al., 2022a . The NPMLE-based method is more adaptive to the mean differences structures as it assumes a nonparametric distribution for mean parameters. Nevertheless, there remains the potential for enhancement in that the method also incorporates the parametric maximum likelihood estimation for variance parameters.

Aside from these methods, there are some methods that apply the IR assumption and remove the noise features simultaneously. The nearest shrunken centroid (NSC) from Tibshirani et al., (2002) eliminates the uninformative features using the soft thresholding. The features annealed independence rules (FAIR) from Fan and Fan, (2008) identifies significant features through the two-sample tt-test for each of the pp features. FAIR is characterized by what uses harder thresholding compared to NSC. Moreover, we can also contemplate the approach estimating 𝜷Bayes=𝚺𝟏(𝝁1𝝁2)\boldsymbol{\beta}^{Bayes}=\mathbf{\Sigma^{-1}}\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right). Witten and Tibshirani, (2011) proposed penalized linear discriminant analysis (PLDA) which estimates 𝜷^PLDA\widehat{\boldsymbol{\beta}}^{PLDA} using l1l_{1} penalty and fused lasso penalties. However, these three methods have the disadvantage of performing well only in the sparse cases where the mean differences between two groups are significant in a relatively small number of features.

Considering all the preceding methodologies, we aspire to a robust classifier which performs well regardless of structures of data such as a sparsity of mean differences and distributions of mean and variance parameters. Our proposed Mean and Variance Adaptive (MVA) linear discriminant rule fulfills all the aspects. Since MVA employs the NPMLE to estimate both mean and variance parameters, it is subject to fewer constraints in terms of the structure of data compared to the existing methods. Especially, in SURE, the strong assumption that mean parameters follow a normal distribution can lead to its limited effectiveness in specific cases. The semi-parametric method proposed by Park et al., 2022a makes up for this weakness by employing the NPMLE for mean parameters. However, there is still a possibility that its performance could be suboptimal in certain scenarios due to the underlying assumption about the distribution of variance parameters. In contrast to the SURE and NPMLE-base methods of Park et al., 2022a which make some specific assumptions related to the distribution of mean or variance parameters, we introduce a more versatile model that doesn’t confine itself to specific distributions for estimating mean and variance parameters. The flexibility inherent in our proposed approach is anticipated to enhance accuracy, particularly in the context of high-dimensional binary classification. The rest of the paper is organized as follows. In the subsequent section, we provide a comprehensive explanation of the simultaneous nonparametric maximum likelihood estimation for mean differences and variances, which underlies our MVA. We also introduce a series of algorithms to approximate the Bayes rule. Section 3 describes the results of simulation studies conducted on the data with various structures. Section 4 compares our MVA with the classifiers based on a few prior studies by utilizing gene expression datasets. Lastly, we conclude the paper in Section 5 including concise summaries and outlining future works.

2 Methodology

2.1 A Simultaneous Estimation Method for Mean Differences and Variances

In this section, we describe the high-dimensional parameter estimation method to be used in the classification rule proposed in this study. Our proposed classification rule applies the IR assumption so that can alleviate the computational complexity caused by the high-dimensionality. Given the data {(Yi,𝑿i)}i=1n\left\{\left(Y_{i},\boldsymbol{X}_{i}\right)\right\}_{i=1}^{n} satisfying the condition (1), the following δI(X)\delta_{I}({\textbf{X}}) is the optimal classification rule incorporating the IR assumption.

δI(X)\displaystyle\delta_{I}({\textbf{X}}) =\displaystyle= (X𝝁1+𝝁22)T𝑫1(𝝁1𝝁2)j=1pajXij+a0,\displaystyle\left({\textbf{X}}-\frac{\boldsymbol{\mu}_{1}+\boldsymbol{\mu}_{2}}{2}\right)^{T}\boldsymbol{D}^{-1}\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)\coloneqq\sum_{j=1}^{p}a_{j}X_{ij}+a_{0}, (6)

where 𝑫=diag(σ12,,σp2)\boldsymbol{D}={\rm diag}(\sigma_{1}^{2},\cdots,\sigma_{p}^{2}). Then, the components to be estimated are represented as follows:

aj=μjσj2forμj=μ1jμ2j,a0=21j=1paj(μ1j+μ2j)\displaystyle a_{j}=\frac{\mu_{j}}{\sigma_{j}^{2}}\quad\text{for}\quad\mu_{j}=\mu_{1j}-\mu_{2j},\quad a_{0}=-{2}^{-1}\sum_{j=1}^{p}a_{j}(\mu_{1j}+\mu_{2j})

Considering the necessity to substitute each aja_{j} with a robust estimator, our primary objective lies in effectively estimating of mean differences μj=μ1jμ2j\mu_{j}=\mu_{1j}-\mu_{2j} and variances σj2\sigma_{j}^{2} 1jp1\leq j\leq p included in (6) from the observed data. Let the observed data {(Yi,𝑿i)}i=1n\left\{\left(Y_{i},\boldsymbol{X}_{i}\right)\right\}_{i=1}^{n}, where 𝑿i=(Xi1,,Xip)T\boldsymbol{X}_{i}=(X_{i1},\ldots,X_{ip})^{T}, Yi=1Y_{i}=1 for i𝒞1={1,,n1}i\in\mathcal{C}_{1}=\{1,\cdots,n_{1}\}, and Yi=2Y_{i}=2 for i𝒞2={i=n1+1,,n1+n2=n}i\in\mathcal{C}_{2}=\{i=n_{1}+1,\cdots,n_{1}+n_{2}=n\}, and 𝒞k\mathcal{C}_{k} is a class of indices for the observations belonging to kk-th group. Assuming the normality of 𝑿i\boldsymbol{X}_{i}, then

Xij{Yi=k;μkj,σj2}N(μkj,σj2),i𝒞k,k=1,2\begin{array}[]{l}X_{ij}\mid\left\{Y_{i}=k;\mu_{kj},\sigma_{j}^{2}\right\}\sim N\left(\mu_{kj},\sigma_{j}^{2}\right),\quad i\in\mathcal{C}_{k},\quad k=1,2\end{array} (7)

To estimate the unknown components of (6), we define XjX_{j} as a sample mean difference between two groups and VjV_{j} as a pooled sample variance as in (8).

Xj\displaystyle X_{j} =\displaystyle= Xj(1)Xj(2), where Xj(k)=nk1i𝒞kXij, k=1,2\displaystyle X_{j}^{(1)}-X_{j}^{(2)},\text{ where }\displaystyle X_{j}^{(k)}={n_{k}}^{-1}\sum_{i\in\mathcal{C}_{k}}X_{ij},\text{ }k=1,2
Vj\displaystyle{V_{j}} =\displaystyle= 1n1+n22{i𝒞1(XijXj(1))2+i𝒞2(XijXj(2))2},\displaystyle\displaystyle\frac{1}{n_{1}+n_{2}-2}\left\{\sum_{i\in\mathcal{C}_{1}}\left(X_{ij}-X_{j}^{(1)}\right)^{2}+\sum_{i\in\mathcal{C}_{2}}\left(X_{ij}-X_{j}^{(2)}\right)^{2}\right\}, (8)

More precisely, we adopt the subsequent Bayesian hierarchical model for the observed pp independent pairs of (Xj,Vj)\left(X_{j},V_{j}\right), 1jp,1\leq j\leq p, as follows:

Xj|μj,σj2\displaystyle\allowdisplaybreaks X_{j}|\mu_{j},\sigma_{j}^{2} ind\displaystyle\overset{ind}{\sim} N(μj,n1+n2n1n2σj2),μj=μ1jμ2j,\displaystyle N\left(\mu_{j},{\frac{n_{1}+n_{2}}{n_{1}n_{2}}}\sigma_{j}^{2}\right),\quad\mu_{j}=\mu_{1j}-\mu_{2j}, (9)
(n1+n22)σj2Vj|σj2\displaystyle{\frac{(n_{1}+n_{2}-2)}{\sigma_{j}^{2}}}V_{j}|\sigma_{j}^{2} iid\displaystyle\overset{iid}{\sim} χn1+n222,\displaystyle\chi_{{n_{1}+n_{2}-2}}^{2}, (10)
μj\displaystyle\mu_{j} iid\displaystyle\overset{iid}{\sim} G0𝒢,\displaystyle G_{0}\in\mathcal{G}, (11)
σj2\displaystyle\sigma_{j}^{2} iid\displaystyle\overset{iid}{\sim} F0,\displaystyle F_{0}\in\mathcal{F}, (12)

where 𝒢\mathcal{G} and \mathcal{F} represent classes of all distributions with the support of (,)(-\infty,\infty) and [0,)[0,\infty), respectively. Here, we consider G0G_{0} and F0F_{0} as unknown distributions for location and scale parameters, respectively, and we estimate these distributions by incorporating the observed data structure. Our model assumptions in (11) and (12), which relax assumptions on the location and scale distributions and aim to estimate them by embracing the data structure, are intended to be more effective in a broader range of situations compared to previous studies that impose assumptions on the distributions of location or scale parameters as in Ji et al., (2020); Park et al., 2022a

We hereby present a concurrent nonparametric technique for estimating the location parameter, μj\mu_{j}, and scale parameter, σj2\sigma_{j}^{2}, 1jp1\leq j\leq p as delineated in the model (9) \sim (12). Specifically, this paper employs a configuration in which these two parameters are considered independent of each other.

To begin, we derive the optimal Bayes rules for both the location parameter μj\mu_{j} and the scale parameter σj2\sigma_{j}^{2} for each jj. Additionally, we introduce a method to approximate these rules. Let fX,V(Xj,Vjμ,σ2)f_{X,V}(X_{j},V_{j}\mid\mu,\sigma^{2}) denotes the joint density function of XjX_{j} and VjV_{j} given μ\mu and σ2\sigma^{2}, and fV(Vjσ2)f_{V}(V_{j}\mid\sigma^{2}) represents the density function of VjV_{j} given σ2\sigma^{2}. For j=1,,pj=1,\ldots,p, we can characterize these two Bayes rules from model (9) \sim (12).

μ~j\displaystyle\widetilde{\mu}_{j} =\displaystyle= E(μXj,Vj)=μfX,V(Xj,Vjμ,σ2)𝑑F0(σ2)𝑑G0(μ)fX,V(Xj,Vjμ,σ2)𝑑F0(σ2)𝑑G0(μ)\displaystyle{E}\left(\mu\mid X_{j},V_{j}\right)=\frac{\displaystyle\iint\mu\cdot f_{X,V}(X_{j},V_{j}\mid\mu,\sigma^{2})dF_{0}(\sigma^{2})dG_{0}(\mu)}{\displaystyle\iint f_{X,V}(X_{j},V_{j}\mid\mu,\sigma^{2})dF_{0}(\sigma^{2})dG_{0}(\mu)} (13)

, and

σ~j2\displaystyle\widetilde{\sigma}_{j}^{2} =\displaystyle= E(σ2Vj)=σ2fV(Vjσ2)𝑑F0(σ2)fV(Vjσ2)𝑑F0(σ2).\displaystyle E\left(\sigma^{2}\mid V_{j}\right)=\frac{\displaystyle\int\sigma^{2}\cdot f_{V}(V_{j}\mid\sigma^{2})dF_{0}(\sigma^{2})}{\displaystyle\int f_{V}(V_{j}\mid\sigma^{2})dF_{0}(\sigma^{2})}. (14)

2.2 Nonparametric Maximum Likelihood Estimation for G0G_{0} and F0F_{0}

In most cases, we lack information regarding true distributions F0F_{0} and G0G_{0}, making Bayes’ rules (13) and (14), inaccessible. Consequently, our primary approach involves approximating Bayes’ rules by acquiring estimates of F0F_{0} and G0G_{0} through the Nonparametric Maximum Likelihood Estimation (NPMLE) (Kiefer and Wolfowitz,, 1956) and subsequently integrating them into our analysis.

Consider a parametric family of probability density functions denoted as {f(θ):θΘ}\left\{f(\cdot\mid\theta):\theta\in\Theta\right\}, where Θ\Theta represents a parametric space that is a subset of the real numbers \mathbb{R}, and these densities are defined with respect to a dominating measure dd on \mathbb{R}. Given a distribution QQ on Θ\Theta, we can define the resulting mixture density as follows:

fQ(y)Θf(yθ)𝑑Q(θ).\displaystyle f_{Q}(y)\triangleq\int_{\Theta}f(y\mid\theta)dQ(\theta). (15)

(Kiefer and Wolfowitz,, 1956) introduced the Nonparametric Maximum Likelihood Estimator for the distribution Q()Q(\cdot) as the maximizer of the mixture likelihood given a dataset of pp observations y1,,ypy_{1},\ldots,y_{p}:

Q^=argmaxQ(Θ)1pi=1plogfQ(yi),\displaystyle\widehat{Q}=\arg\max_{Q\in\mathcal{M}(\Theta)}\frac{1}{p}\sum_{i=1}^{p}\log f_{Q}\left(y_{i}\right), (16)

where (Θ)\mathcal{M}(\Theta) represents the set of all probability measure on Θ\Theta. For a comprehensive understanding of NPMLE, readers can refer to the well-known study, Lindsay, (1995). Therefore, within the NPMLE framework, we can estimate the two distributions, F0F_{0} and G0G_{0}, as follows.

F^0=argmaxF1pj=1p[log{fV(Vjσ2)𝑑F(σ2)}],\displaystyle\widehat{F}_{0}=\operatorname*{argmax}_{F\in\mathcal{F}}\frac{1}{p}\sum_{j=1}^{p}\left[\log{\left\{\int f_{V}\left(V_{j}\mid\sigma^{2}\right)dF(\sigma^{2})\right\}}\right], (17)

where fV(Vjσ2)f_{V}\left(V_{j}\mid\sigma^{2}\right) is a conditional density function of VjV_{j} given σ2\sigma^{2}.

G^0=argmaxG𝒢1pj=1p[log{fF^0(Xj,Vjμ)𝑑G(μ)}],\displaystyle\widehat{G}_{0}=\operatorname*{argmax}_{G\in\mathcal{G}}\frac{1}{p}\sum_{j=1}^{p}\left[\log{\left\{\int f_{\widehat{F}_{0}}\left(X_{j},V_{j}\mid\mu\right)dG(\mu)\right\}}\right], (18)

where fF0(Xj,Vjμ)=f(Xj,Vjμ,σ2)𝑑F0(σ2)f_{{F}_{0}}\left(X_{j},V_{j}\mid\mu\right)=\int f\left(X_{j},V_{j}\mid\mu,\sigma^{2}\right)dF_{0}(\sigma^{2}) and f(Xj,Vjμ,σ2)f\left(X_{j},V_{j}\mid\mu,\sigma^{2}\right) is expressed as the product of the conditional density function of XjX_{j} given μ\mu and σ2\sigma^{2} and the conditional density function of VjV_{j} given σ2\sigma^{2}.

Although, the convex optimization problem (16) is infinite-dimensional, various computationally efficient algorithms have been developed over the years (Jiang and Zhang,, 2009; Koenker and Mizera,, 2014; Gu and Koenker,, 2017). Following Lindsay, (1995); Koenker and Mizera, (2014); Dicker and Zhao, (2016); Feng and Dicker, (2016); Park et al., 2022a , we approximate F^0\widehat{F}_{0} and G^0\widehat{G}_{0} with fine grid points support set. Concretely, we take a grid of logarithmically equispaced KK points {logv1,,logvK}\left\{\log v_{1},\ldots,\log v_{K}\right\} within the interval [logmin1ipVj,logmax1ipVj]\left[\log\underset{1\leq i\leq p}{\min}V_{j},\log\underset{1\leq i\leq p}{\max}V_{j}\right]. Subsequently, we apply the exponential function to each component of the grid to obtain {v1,,vK}\left\{v_{1},\ldots,v_{K}\right\}. Also, we consider the LL points {u1,,uL}\left\{u_{1},\ldots,u_{L}\right\} evenly spaced within the interval [min1ipXi,max1ipXi]\left[\underset{1\leq i\leq p}{\min}X_{i},\underset{1\leq i\leq p}{\max}X_{i}\right]. Define ^K\widehat{\mathcal{F}}_{K} and 𝒢^L\widehat{\mathcal{G}}_{L} as classes of probability distributions supported on the sets {v1,,vK}\left\{v_{1},\ldots,v_{K}\right\}, and {u1,,uL}\left\{u_{1},\ldots,u_{L}\right\}, respectively. We can obtain approximate versions F^K\widehat{F}_{K} and G^L\widehat{G}_{L} by substituting \mathcal{F} or 𝒢\mathcal{G} with ^K\widehat{\mathcal{F}}_{K} and 𝒢^L\widehat{\mathcal{G}}_{L} in equations (17) or (18). Additionally, these tasks can be easily accomplished using convex optimization algorithms introduced by Koenker and Mizera, (2014); Kim et al., (2020).

Let

F^K(σ2)\displaystyle\widehat{F}_{K}(\sigma^{2}) \displaystyle\equiv k=1Kw^1,kI(vkσ2),\displaystyle\sum_{k=1}^{K}\widehat{w}_{1,k}I(v_{k}\leq\sigma^{2}), (19)
G^L(μ)\displaystyle\widehat{G}_{L}(\mu) \displaystyle\equiv l=1Lw^2,lI(ulμ2),\displaystyle\sum_{l=1}^{L}\widehat{w}_{2,l}I(u_{l}\leq\mu^{2}), (20)

where w^1,k0\widehat{w}_{1,k}\geq 0, 1kK1\leq k\leq K, k=1Kw^1,k=1\sum_{k=1}^{K}\widehat{w}_{1,k}=1 and w^2,l0\widehat{w}_{2,l}\geq 0, 1lL1\leq l\leq L, l=1Lw^2,l=1\sum_{l=1}^{L}\widehat{w}_{2,l}=1. Using equations (19), (20), we obtain viable estimates σ^i2\widehat{\sigma}_{i}^{2} and μ^i\widehat{\mu}_{i} for i=1,,pi=1,\ldots,p as follows.

σ^j2\displaystyle\widehat{\sigma}_{j}^{2} =\displaystyle= E^(σ2Vj)\displaystyle\widehat{E}\left(\sigma^{2}\mid V_{j}\right) (21)
=\displaystyle= σ2fV(Vjσ2)𝑑F^K(σ2)fV(Vjσ2)𝑑F^K(σ2)\displaystyle\frac{\displaystyle\int{\sigma^{2}f_{V}(V_{j}\mid\sigma^{2})}d\widehat{F}_{K}\left(\sigma^{2}\right)}{\displaystyle\int{f_{V}(V_{j}\mid\sigma^{2})}d\widehat{F}_{K}\left(\sigma^{2}\right)}
=\displaystyle= k=1Kvkw^1,kfV(Vjvk)k=1Kw^1,kfV(Vjvk)\displaystyle\frac{\displaystyle\sum_{k=1}^{K}v_{k}\widehat{w}_{1,k}f_{V}(V_{j}\mid v_{k})}{\displaystyle\sum_{k=1}^{K}\widehat{w}_{1,k}f_{V}(V_{j}\mid v_{k})}
μ^j\displaystyle\widehat{\mu}_{j} =\displaystyle= E^(μXj,Vj)\displaystyle\widehat{E}(\mu\mid X_{j},V_{j}) (22)
=\displaystyle= μf^X,V(Xj,Vjμ,σ2)𝑑F^K(σ2)𝑑G^L(μ)f^X,V(Xj,Vjμ,σ2)𝑑F^K(σ2)𝑑G^L(μ)\displaystyle\frac{\displaystyle\int\int\mu\widehat{f}_{X,V}(X_{j},V_{j}\mid\mu,\sigma^{2})d\widehat{F}_{K}\left(\sigma^{2}\right)d\widehat{G}_{L}\left(\mu\right)}{\displaystyle\int\int\widehat{f}_{X,V}(X_{j},V_{j}\mid\mu,\sigma^{2})d\widehat{F}_{K}\left(\sigma^{2}\right)d\widehat{G}_{L}\left(\mu\right)}
=\displaystyle= l=1Lulw^2,lk=1Kw^1,kfX,V(Xj,Vjul,vk)l=1Lw^2,lk=1Kw^1,kfX,V(Xj,Vjul,vk)\displaystyle\frac{\displaystyle\sum_{l=1}^{L}u_{l}\widehat{w}_{2,l}\sum_{k=1}^{K}\widehat{w}_{1,k}f_{X,V}(X_{j},V_{j}\mid u_{l},v_{k})}{\displaystyle\sum_{l=1}^{L}\widehat{w}_{2,l}\sum_{k=1}^{K}\widehat{w}_{1,k}f_{X,V}(X_{j},V_{j}\mid u_{l},v_{k})}

2.3 Proposed classification rule

0:  The nn observed data : {(Yi,𝑿i)}i=1n\{(Y_{i},\boldsymbol{X}_{i})\}_{i=1}^{n}, 𝑿i=(Xi1,,Xip)T\boldsymbol{X}_{i}=(X_{i1},\ldots,X_{ip})^{T}, Yi{1,2}Y_{i}\in\{1,2\}.
0:  
1:  Refer to (7) and (8) to compute the sample mean difference between two groups XjX_{j} and the pooled sample variance VjV_{j} for each j=1,,pj=1,\ldots,p.
2:  Derive the estimators related to F0F_{0} and G0G_{0} through the nonparametric maximum likelihood estimation method given by (17) and (18).
3:  Substituting F0F_{0} and G0G_{0} in (13) and (14) with the estimates of F^0\widehat{F}_{0} and G^0\widehat{G}_{0}, obtain Bayes estimates (μ^j,σ^j2)(\widehat{\mu}_{j},\widehat{\sigma}_{j}^{2}) of (μj,σj2)(\mu_{j},\sigma_{j}^{2}) for each j=1,,pj=1,\ldots,p.
4:  Plugging μ^j\widehat{\mu}_{j} and σ^j2\widehat{\sigma}_{j}^{2} into the optimal classification rule (6), then
a^j=μ^j/σ^j2\widehat{a}_{j}={\widehat{\mu}_{j}}/{\widehat{\sigma}_{j}^{2}}, j=1,,pj=1,\ldots,p, a^0=21j=1paj(Xj(1)+Xj(2))\widehat{a}_{0}=-{2}^{-1}\sum_{j=1}^{p}a_{j}(X_{j}^{(1)}+X_{j}^{(2)}).
5:  Classify the new observation 𝑿new\boldsymbol{X}^{new} according to the proposed classification rule below.
δMVA(𝑿new)\displaystyle\delta_{\textit{MVA}}(\boldsymbol{X}^{new}) =j=1pa^jXijnew+a^0log(π^2π^1)\displaystyle=\sum_{j=1}^{p}\widehat{a}_{j}X_{ij}^{new}+\widehat{a}_{0}-\log\left(\frac{\hat{\pi}_{2}}{\hat{\pi}_{1}}\right) (23)
Y^new\displaystyle\widehat{Y}^{new} ={1forδMVA(𝑿new)0,2forδMVA(𝑿new)<0,\displaystyle=\left\{\begin{array}[]{rcl}1&\mbox{for}&\delta_{\textit{MVA}}(\boldsymbol{X}^{new})\geq 0,\\ 2&\mbox{for}&\delta_{\textit{MVA}}(\boldsymbol{X}^{new})<0,\end{array}\right. (26)
where π^k\hat{\pi}_{k} is the proportion of observations belonging to kk-th group among the all observations.
Algorithm 1 Nonparametric Mean and Variance Adaptive Classification Rule

In this section, we describe our proposed MVA classification rule using nonparametric estimation methods for the mean difference and variance parameters introduced in Section 2.2. The MVA rule serves as a classifier that approximates the optimal classification rule δI(X)\delta_{I}({\textbf{X}}) 6 under the IR assumption. The improvement in the estimation of unknown parameters included in δI(X)\delta_{I}({\textbf{X}}) contributes to the enhancement of classification performance. Algorithm 1 outlines the construction procedure of the MVA rule.

3 Simulation Study

This section presents a simulation study to assess the performance of MVA and compare it with several existing methods. Our main focus is comparing MVA with Stein’s unbiased risk estimate (SURE) from Ji et al., (2020) and nonparametric maximum likelihood Estimation based method (NPMLE) from Park et al., 2022a . These methods involve specific assumptions about the distribution of mean or variance values. Also, our simulation, features annealed independence rule (FAIR) from Fan and Fan, (2008), penalized linear discriminant analysis (PLDA) from Witten and Tibshirani, (2011), naive Bayes (NB), and nearest shrunken centroid (NSC) from Tibshirani et al., (2002) are included. In SURE, we use shrunken estimators towards the grand mean. FAIR is conducted using R-package HiDimDA, PLDA is conducted employing R-package penalizedLDA, and NSC is implemented through R-package pamr.

In particular, to compare our MVA with SURE from Ji et al., (2020) and NPMLE-based method from Park et al., 2022a , which assume a right-skewed distribution for the variance parameters, we set the structures of the variance parameters’ distribution as follows. Concretely, we explore three scenarios related to the skewness of the distribution of the variances in various structures of mean differences. The first scenario involves situations where the distribution of the variances F0F_{0} has the long tail on the left side, while the second includes cases where the distribution of the variances F0F_{0} is right-skewed. In the last scenario, we evaluate the performance under symmetric distributions for the variances. Section 3.1.1 portrays the performance of the models in the first scenario, Section 3.1.2 covers the second scenario, and Section 3.1.3 focuses on the last one. This simulation will illustrate our proposed classifier’s superior performance in these scenarios.

3.1 Data Generation

Given Yi=kY_{i}=k, all observations are generated from Np(𝝁k,𝑫)N_{p}\left({\boldsymbol{\mu}}_{k},\boldsymbol{D}\right), where 𝝁k=(μk1,,μkp){\boldsymbol{\mu}}_{k}=\left(\mu_{k1},\ldots,\mu_{kp}\right) and 𝑫=diag(σ12,,σp2)\boldsymbol{D}={\rm diag}(\sigma_{1}^{2},\cdots,\sigma_{p}^{2}). The specific setting for generating the dataset is as follows:

  1. 1.

    The data dimension of pp is set to 10,000.

  2. 2.

    The sample size of two groups is n1=n2=125n_{1}=n_{2}=125.

  3. 3.

    We independently generate (𝝁k,𝑫)(\boldsymbol{\mu}_{k},\boldsymbol{D}) for each group, and subsequently generate 𝑿i=(Xi1,,Xip),i=1,,n1+n2\boldsymbol{X}_{i}=(X_{i1},\ldots,X_{ip}),\ i=1,\ldots,n_{1}+n_{2}.

  4. 4.

    For each simulation, we explore both non-sparse and sparse structures of mean differences.

    • (Sparse case)
      The first 100 components of 𝝁1=(μ1j)1jp\boldsymbol{\mu}_{1}=\left(\mu_{1j}\right)\ _{1\leq j\leq p} are 1, the remaining components of 𝝁1\boldsymbol{\mu}_{1} are 0 while all of the components of 𝝁2=(μ2j)1jp\boldsymbol{\mu}_{2}=\left(\mu_{2j}\right)\ _{1\leq j\leq p} are 0. To put it differently, 𝝁1=(1,,1100,0,,0p100)T\boldsymbol{\mu}_{1}=(\underbrace{1,\ldots,1}_{100},\underbrace{0,\ldots,0}_{p-100})^{T}, 𝝁2=(0,,0)T\boldsymbol{\mu}_{2}=\left(0,\ldots,0\right)^{T}.

    • (Non-sparse case)
      The first 100 components of 𝝁1=(μ1j)1jp\boldsymbol{\mu}_{1}=\left(\mu_{1j}\right)\ _{1\leq j\leq p} are 1, the (p100)(p-100) remains are generated from N(0,0.12)N(0,0.1^{2}). 𝝁2\boldsymbol{\mu}_{2} is set to be the same as the sparse case.

  5. 5.

    we split the dataset. For each group, 25 samples are used for training the models and 100 samples are used for evaluation. We repeat 500 times evaluations on independent test datasets in all of the cases.

3.1.1 Left-skewed case for F0F_{0}

In this section, we deal with data generated from the various situations where the distribution of the variances F0F_{0} is left-skewed. Here, we consider both discrete and continuous distributions for variances. To determine the value of each σj2\sigma_{j}^{2} in the discrete distributions for variances, we set σbase2\sigma_{base}^{2} as the baseline for σj2\sigma^{2}_{j}, δ\delta as the proportion of σj2\sigma_{j}^{2}s with σbase2\sigma_{base}^{2} value, and Δ\Delta as the remaining value apart from σbase2\sigma_{base}^{2}. For instance, consider the scenario where σbase2=1\sigma_{base}^{2}=1, δ=0.005\delta=0.005, and Δ=6\Delta=6. In this case, for j=1,,pj=1,\ldots,p, σj2\sigma^{2}_{j} follows a distribution where 1δ=0.9951-\delta=0.995 proportion of σj2\sigma^{2}_{j} values have Δ=6\Delta=6, and the remaining δ=0.005\delta=0.005 proportion of σj2\sigma^{2}_{j} values have a value of 1. This implies that the majority of σj2\sigma^{2}_{j} for j=1,,pj=1,\ldots,p are concentrated on the right side, and as a result, the distribution of variances F0F_{0} is considered to be left-skewed. In addition, we utilize left-skewed beta distributions for continuous distributions of the variances. To calculate misclassification rates of classifiers in more diverse situations, we consider three different σbase2\sigma_{base}^{2} values (1,2,3)(1,2,3) and two δ\delta values (0.005,0.05)(0.005,0.05). The Δ\Delta value is consistently fixed at 6 across all the situations. Moreover, maintaining a scale parameter at 5, we examine three values (1.5,2.5,3.5)(1.5,2.5,3.5) for the shape parameter of the beta distributions, denoted as β\beta. In these situations, we generate {σj}j=1p\left\{\sigma_{j}\right\}_{j=1}^{p} from σj2/5Beta(5,β)\sigma^{2}_{j}/5\sim\text{Beta}(5,\beta) to accommodate the wider support of the distribution of the variances F0F_{0}.

Figure 1 and 2 describe the performance of the classifiers in left-skewed case for F0F_{0}. The left and center of Figure 1 represent the discrete distributions for σj2\sigma_{j}^{2}, while the right represents the continuous beta distributions for σj2\sigma_{j}^{2}. Specifically, the left and center panels have different values of δ\delta set to 0.005 and 0.05, respectively. Hence, the classification problems included in the center panel are considered easier than those in the left panel. The configuration of Figure 2 is the same as Figure 1, except for the structure of the mean differences.

Refer to caption
Figure 1: Left-skewed case for F0F_{0} with non-sparse setting of mean differences : The left is the situations where σbase2=(1,2,3)\sigma_{base}^{2}=(1,2,3), δ=0.005\delta=0.005, and Δ=6\Delta=6. The middle is the situations where σbase2=(1,2,3)\sigma_{base}^{2}=(1,2,3), δ=0.05\delta=0.05, and Δ=6\Delta=6. Lastly, the right is the situations where σj2/5Beta(5,β)\sigma^{2}_{j}/5\sim\text{Beta}(5,\beta), β=(1.5,2.5,3.5)\beta=(1.5,2.5,3.5).
Refer to caption
Figure 2: Left-skewed case for F0F_{0} with sparse setting of mean differences. In each panel, F0F_{0} is set in the same way as in Figure 1.

In both Figures, MVA demonstrates the most superior performance. In the case where the distributions of σj2\sigma^{2}_{j} are discrete, the misclassification rates of MVA and FAIR are close to zero as the value of σbase2\sigma_{base}^{2} is reduced. On the contrary, some models such as NPMLE(Parks), NSC, and SURE perform poorly in these situations. Especially, NPMLE(Parks) and SURE, which assume a right-skewed inverse gamma distribution for the distribution of σj2\sigma^{2}_{j}, show inferior performance compared to MVA. This is because the assumption of the models leads to a conflict with the actual data structure. Our proposed classifier, MVA, shows a comparatively smaller misclassification rates due to its flexibility, as it does not make any assumption for the distribution of σj2\sigma^{2}_{j}. In the right panels of Figure 1 and 2, the skewness of the distribution of σj2\sigma^{2}_{j} is mitigated compared to the two previous cases. Nevertheless, no models are able to catch up with the performance of MVA. Through these two figures, we show that MVA attains the optimal performance among various classifiers in the left-skewed distributions for σj2\sigma^{2}_{j}.

3.1.2 Right-skewed case for F0F_{0}

In this section, we examine the performance of the classifiers, including MVA, in a few situations where the distribution of variances F0F_{0} is right-skewed. As with section 3.1.1, we take into account both discrete and continuous distributions for the variances. For discrete distributions of the variances, we explore three different σbase2\sigma_{base}^{2} values (8,9,10)(8,9,10), two δ\delta values (0.005,0.05)(0.005,0.05), and a single Δ\Delta value of 6. For continuous distributions of the variances, inverse gamma distributions with a scale parameter of 10 are employed, and three values (2,4,6)(2,4,6) for the shape parameter α\alpha of the inverse gamma distribution is considered.

Refer to caption
Figure 3: Right-skewed case for F0F_{0} with non-sparse setting of mean differences. : The left is the situations where σbase2=(8,9,10)\sigma_{base}^{2}=(8,9,10), δ=0.005\delta=0.005, and Δ=6\Delta=6. The middle is the situations where σbase2=(8,9,10)\sigma_{base}^{2}=(8,9,10), δ=0.05\delta=0.05, and Δ=6\Delta=6. Lastly, the right is the situations where σj2Γ1(α,10)\sigma^{2}_{j}\sim\Gamma^{-1}(\alpha,10), α=(2,4,6)\alpha=(2,4,6).
Refer to caption
Figure 4: Right-skewed case for F0F_{0} with sparse setting of mean differences. In each panel, F0F_{0} is set in the same way as in Figure 3.

Figure 3 and 4 present the misclassification rates of the models when the distributions of σj2\sigma^{2}_{j} are right-skewed. In this case, we anticipate that NPMLE(Parks) and SURE have it over on MVA as their assumptions are consistent with the actual forms of the distribution of σj2\sigma_{j}^{2}. Also, in the left and center panels of Figure 3 and 4, δ\delta is set to 0.005 and 0.05, respectively. These four plots show that NPMLE(Parks), SURE and MVA perform well as opposed to the other models. Particularly, NPMLE(Parks) and SURE exhibit outstanding performance at any σbase2\sigma_{base}^{2} unlike section 3.1.1. When the value of σbase2\sigma_{base}^{2} becomes 10, MVA falls behind NPMLE(Parks) and SURE. But it eventually achieves relatively lower misclassification rates than other models when σbase2\sigma_{base}^{2} reaches its smallest value. In the right plots of Figure 3 and 4, although the distribution of σj2\sigma^{2}_{j} is right-skewed, the performance of MVA is similar to NPMLE(Parks). As illustrated in those figures, we demonstrate that our MVA achieves accomplished performance even when the skewness of the distribution of σj2\sigma^{2}_{j} is positive.

3.1.3 Symmetric case

Refer to caption
Figure 5: Symmetric case for F0F_{0} with both sparse and non-sparse settings of mean differences. : The left is set to sparse structures of mean differences, and the right panel is set to non-sparse structures of mean differences. In both panel, σj2U(1,ζ)\sigma^{2}_{j}\sim U(1,\zeta), ζ=(3,5,7,9)\zeta=(3,5,7,9).

In this section, we address the various situations where the distribution of the variances F0F_{0} is symmetric, specifically following a uniform distribution. For each situation, we investigate how the performance of the classifiers changes as we decrease the value of ζ\zeta, which is the maximum of a uniform distribution with a fixed minimum value set to 1. In this context, we consider four different values (3,5,7,9)(3,5,7,9) for ζ\zeta. In Figure 5, the results of the simulation are shown when the distributions of σj2\sigma^{2}_{j} are set to continuous uniform distributions with both non-sparse and sparse structures of mean differences. Here, if ζ\zeta takes relatively larger value, the support of the distribution of σj2\sigma_{j}^{2} is wider, and the classification problem in this situation is more challenging. MVA and NPMLE(Parks) work well irrespective of the sparseness of the mean differences not only when the support of the distribution of σj2\sigma^{2}_{j} is relatively broader but also when it is less so. Ultimately, we showed that our proposed MVA has promising performance regardless of the skewness of the distribution of σ2\sigma^{2} and the sparseness of the mean differences.

Refer to caption
Figure 6: fV(V)f_{V}(V) and estimated probability marginal density functions of pooled sample variances in sparse case: The left is a left-skewed case where σj2/5Beta(5,1.5)\sigma_{j}^{2}/5\sim\text{Beta}(5,1.5), the middle is a right-skewed case where σj2Γ1(6,10)\sigma_{j}^{2}\sim\Gamma^{-1}(6,10), and the right is a symmetric case where σj2U(1,9)\sigma_{j}^{2}\sim U(1,9).

In the above experiments, our proposed model demonstrates stable performance compared to other models in any case. One of the factors that significantly affect the performance of the classifiers is how accurately they estimate mean, and variance values included in the optimal classification rule (2). Especially, our model has the advantage of estimating variance values more adaptively than NPMLE(Parks) and SURE. To indirectly confirm that MVA estimates the distribution of variance values more flexibly, we need to examine how each model estimates the probability marginal density function of pooled sample variances, utilizing the fact that the pooled sample variance VjV_{j} is an unbiased estimator of σj2\sigma^{2}_{j}.

Figure 6 compares the estimated probability marginal density functions of pooled sample variances for MVA, NPMLE(Parks) and SURE with the corresponding true probability marginal density function, denoted as fV(V)=fV(Vσ2)𝑑F0(σ2).f_{V}(V)=\displaystyle\int f_{V}(V\mid\sigma^{2})dF_{0}(\sigma^{2}). In the left of Figure 6, where the distribution F0F_{0} of σj2\sigma_{j}^{2} is left-skewed, MVA demonstrates a remarkable adaptability for closely approximating the true distribution of pooled sample variances. However, both NPMLE(Paks) and SURE tend to estimate the fV(V)f_{V}(V) slightly deviating from it. In the middle of Figure 6, which represents one of the right-skewed cases for the distribution F0F_{0} of σj2\sigma_{j}^{2}, all of the models estimate the fV(V)f_{V}(V) similarly well. Furthermore, as illustrated in the right of Figure 6, MVA flexibly estimates the fV(V)f_{V}(V) by reflecting the actual structure of the data even when the distribution F0F_{0} of σj2\sigma_{j}^{2} is symmetric. In contrast, NPMLE(Parks) and SURE are constrained by the strong assumption concerning the distribution F0F_{0} of σj2\sigma_{j}^{2}, leading to the failure in accurately estimating fV(V)f_{V}(V).

To sum up, our suggested model outperforms the other models except in the situations where the distribution F0F_{0} of σj2\sigma_{j}^{2} is right-skewed, and few σj2\sigma_{j}^{2} have relatively larger values like 10. Particularly, in the left-skewed cases, our classifier exhibits outstanding performance by sharply reducing its misclassification rates as the value of σbase2\sigma_{base}^{2} is decreased. In the right-skewed case, NPMLE(Parks) and SURE present satisfactory performance, and our model is comparable to the two previous models in terms of performance. This tendency remains in both cases where the structure of mean differences is sparse and non-sparse. Consequently, our model fulfills our desire for a robust model that is not influenced by the sparseness of mean differences or skewness of the distribution of variances. As evidence of that, we present a comparison of the estimated probability marginal density of pooled sample variances. Through that, we discover that the robustness of our model is based on adaptive estimation capturing the structure of the actual data. From above the simulation study, we confirm that the classifier based on our devised methodology takes advantage of ensuring effective performance in any case since it estimates mean and variance values without some constraints or assumptions.

4 Case Study

In this section, we evaluate the performance of MVA and several methods by applying them to the four benchmark datasets: Breast Cancer from Zhu et al., (2007), Huntington’s Disease from Borovecki et al., (2005), Leukemia from Golub et al., (1999), Central Nervous System(CNS) from Pomeroy et al., (2002). The Breast Cancer dataset is available at the https://csse.szu.edu.cn/staff/zhuzx/Datasets.html., and the rest are provided in R-package datamicroarray.

Before analyzing the datasets, we carry out the following preprocessing steps. To alleviate the computational burden, we conduct a two-sample tt-test for each feature and employ a feature screening procedure to identify noisy features, using a significance level of 0.2. These tasks are carried out for all considered datasets. Additionally, we implement min-max scaling on Huntington’s Disease and Leukemia to ensure that all the feature’s minimum and maximum values are adjusted to 0 and 1, respectively. The details related to the four datasets that are used are presented in Table 1.

Dataset nn pp Classes Reduced dimension
Breast Cancer 97 24481 non-relapse(51), relapse(46) 6906
Huntington’s Disease 31 22283 control(14), symptomatic(17) 11455
Leukemia 72 7129 ALL(47), AML(25) 3502
CNS 60 7128 died(21), survived(39) 1151
Table 1: Details of the utilized datasets
Dataset MVA NPMLE(Parks) FAIR PLDA SURE NB NSC
Breast Cancer 0.206 0.216 0.351 0.268 0.443 0.268 0.309
Huntington’s Disease 0.000 0.032 0.065 0.065 0.065 0.032 0.065
Leukemia 0.083 0.083 0.083 0.083 0.097 0.083 0.028
CNS 0.200 0.217 0.383 0.233 0.217 0.217 0.250
Table 2: Misclassification rates of classifiers

More specifically, first, the breast cancer dataset was employed to predict the relapse status of 97 individuals, resulting in a preprocessed dataset with 6906 genes. Among the samples, only 46 patients experienced a recurrence of breast cancer. The second dataset focused on Huntington’s disease, aiming to distinguish between 17 symptomatic individuals and 14 in the control group, resulting in a dataset with 11455 genes after preprocessing. The third dataset, leukemia data, comprised 47 patients diagnosed with acute lymphoblastic leukemia (ALL) and 25 with acute myeloid leukemia (AML), with the data dimension reduced to 3502 through feature screening. Lastly, the CNS dataset is associated with the survival outcomes (died or survived) of 60 patients with central nervous system embryonal tumors, where 21 patients succumbed, and the remaining survived. After feature screening procedures, the dataset retained 1151 genes.

Refer to caption
Figure 7: Histogram and estimated probability density functions of log pooled sample variances of the breast cancer dataset

The misclassification rates of the models, calculated through Leave-One-Out Cross-Validation(LOOCV), are shown in Table 2. Our proposed model demonstrates satisfactory performance overall for all datasets. As seen in the first row of Table 2, our model exhibits the superior performance for the breast cancer data, while other models, except NPMLE(Parks), show remarkably low performance compared to MVA. The results of Huntington’s disease data are also found in the second row in Table 2, from which we can see that only our proposed model accurately predicts the status of all the samples. The third row in Table 2 presents the misclassification rates of the models for the leukemia data. In this classification problem, our model is not the best in terms of performance, but it still exhibits the performance on par with NSC, which is the optimal model for the data. From the last row of Table 2, our model reveals the top performance in CNS data, and NPMLE(Parks), SURE, and NB have the same performance slightly below that of MVA. In contrast, FAIR and NSC have worse performance for this dataset. Consequently, our classifier maintains the robust performance across all the datasets, unlike the classifiers based on the existing methodologies that perform well only on specific datasets.

Additionally, utilizing the analysis of the breast cancer data, we provide evidence supporting the effectiveness of our proposed model in real data analysis. Figure 7 particularly aims to show enhanced adaptability of MVA in estimating variance values compared to SURE and NPMLE(Parks). We need to remark that the primary difference between our model and the others lies in the assumption related to the distribution of variance values. Since we cannot compare the directly estimated probability density function of the variances for each method, we try to examine which method estimates the distribution of the log pooled sample variances more effectively. In Figure 7, our model effectively detects the bimodality of the distribution represented in the histogram, while the other fails to do so.

In summary, we confirm that our MVA has stable performance in most cases since it is based on a more flexible estimation of the parameters included in the optimal classification rule. In other words, if we are given a microarray dataset with noise appropriately removed, we anticipate that our model performs well regardless of the structures of the actual dataset in the context of the high-dimensional binary classification task.

5 Conclusion

Our innovative methodology has shown significant promise in enhancing the effectiveness of Fisher’s Linear Discriminant Analysis (LDA) for high-dimensional data classification. By addressing the challenge of varying variances among features, we have successfully overcome a critical limitation of traditional LDA methods that assume uniform variances. Our approach, based on Nonparametric Maximum Likelihood Estimation (NPMLE) techniques, not only accommodates distinct variances but also demonstrates exceptional performance across a range of variance distribution profiles, including left-skewed, symmetric, and right-skewed forms. Our empirical experiments have provided strong evidence of the practical benefits of our approach. We have shown that our methodology excels in accurately classifying high-dimensional data characterized by heterogeneous variances, contributing to the advancement of discriminant analysis techniques in high-dimensional settings. Furthermore, we validated the effectiveness of our methodology through comprehensive analyses on various real datasets including gene expression datasets. These findings open up new opportunities for improving classification accuracy in a variety of real-world applications. Of course, our research is based on the assumption of the Independent Rule (IR), but it is recognized that this assumption may not always be reasonable. Therefore, as part of future work, we aim to extend our proposed Multiple Variable Analysis (MVA) by relaxing the IR assumption. Additionally, exploring the extension of MVA to address classification problems in multiclass scenarios or dealing with class-imbalanced cases presents an intriguing avenue for further investigation.

Acknowledgement

Seungyeon Oh and Hoyoung Park were supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government(MSIT) (No. RS-2023-00212502).

References

  • Bodnar et al., (2014) Bodnar, T., Gupta, A. K., and Parolya, N. (2014). On the strong convergence of the optimal linear shrinkage estimator for large dimensional covariance matrix. Journal of Multivariate Analysis, 132:215–228.
  • Borovecki et al., (2005) Borovecki, F., Lovrecic, L., Zhou, J., Jeong, H., Then, F., Rosas, H., Hersch, S., Hogarth, P., Bouzou, B., Jensen, R., et al. (2005). Genome-wide expression profiling of human blood reveals biomarkers for huntington’s disease. Proceedings of the National Academy of Sciences, 102(31):11023–11028.
  • Dicker and Zhao, (2016) Dicker, L. H. and Zhao, S. D. (2016). High-dimensional classification via nonparametric empirical bayes and maximum likelihood inference. Biometrika, 103(1):21–34.
  • Fan and Fan, (2008) Fan, J. and Fan, Y. (2008). High Dimensional Classification using Features Annealed Independence Rules. The Annals of Statistics, 36(6):2605–2637.
  • Feng and Dicker, (2016) Feng, L. and Dicker, L. H. (2016). Approximate nonparametric maximum likelihood inference for mixture models via convex optimization. arXiv preprint arXiv:1606.02011.
  • Golub et al., (1999) Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science, 286(5439):531–537.
  • Gu and Koenker, (2017) Gu, J. and Koenker, R. (2017). Rebayes: An R package for empirical Bayes mixture methods. cemmap working paper CWP37/17, London.
  • Ji et al., (2020) Ji, Z., Wei, Y., and Li, Z. (2020). Sure estimates for high dimensional classification. Statistical Analysis and Data Mining: The ASA Data Science Journal, 13(5):423–436.
  • Jiang and Zhang, (2009) Jiang, W. and Zhang, C.-H. (2009). General maximum likelihood empirical Bayes estimation of normal means. The Annals of Statistics, 37(4):1647–1684.
  • Kiefer and Wolfowitz, (1956) Kiefer, J. and Wolfowitz, J. (1956). Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. The Annals of Mathematical Statistics, 27(4):887–906.
  • Kim et al., (2022) Kim, J., Park, H., and Park, J. (2022). High dimensional discriminant rules with shrinkage estimators of covariance matrix and mean vector. arXiv preprint arXiv:2211.15063.
  • Kim et al., (2020) Kim, Y., Carbonetto, P., Stephens, M., and Anitescu, M. (2020). A Fast Algorithm for Maximum Likelihood Estimation of Mixture Proportions Using Sequential Quadratic Programming. Journal of Computational and Graphical Statistics, 29(2):261–273.
  • Koenker and Mizera, (2014) Koenker, R. and Mizera, I. (2014). Convex Optimization, Shape Constraints, Compound Decisions, and Empirical Bayes Rules. Journal of the American Statistical Association, 109(506):674–685.
  • Ledoit and Wolf, (2004) Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of multivariate analysis, 88(2):365–411.
  • Ledoit and Wolf, (2020) Ledoit, O. and Wolf, M. (2020). Analytical nonlinear shrinkage of large-dimensional covariance matrices.
  • Ledoit and Wolf, (2022) Ledoit, O. and Wolf, M. (2022). Quadratic shrinkage for large covariance matrices. Bernoulli, 28(3):1519–1547.
  • Lindsay, (1995) Lindsay, B. G. (1995). Mixture Models: Theory, Geometry and Applications. NSF-CBMS Regional Conference Series in Probability and Statistics, 5:i–163.
  • (18) Park, H., Baek, S., and Park, J. (2022a). High-dimensional classification based on nonparametric maximum likelihood estimation under unknown and inhomogeneous variances. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(2):193–205.
  • (19) Park, H., Baek, S., and Park, J. (2022b). High-dimensional linear discriminant analysis using nonparametric methods. Journal of Multivariate Analysis, 188:104836.
  • Pomeroy et al., (2002) Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., Kim, J. Y., Goumnerova, L. C., Black, P. M., Lau, C., et al. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415(6870):436–442.
  • Tibshirani et al., (2002) Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences, 99(10):6567–6572.
  • Witten and Tibshirani, (2011) Witten, D. M. and Tibshirani, R. (2011). Penalized classification using Fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(5):753–772.
  • Zhu et al., (2007) Zhu, Z., Ong, Y.-S., and Dash, M. (2007). Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognition, 40(11):3236–3248.