This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A new classification framework for high-dimensional data

Xiangbo Mo  
Department of Statistics, University of California, Davis
and
Hao Chen
Department of Statistics, University of California, Davis
The authors were supported in part by NSF award DMS-1848579 and DMS-2311399.
Abstract

Classification, a fundamental problem in many fields, faces significant challenges when handling a large number of features, a scenario commonly encountered in modern applications, such as identifying tumor subtypes from genomic data or categorizing customer attitudes based on online reviews. We propose a novel framework that utilizes the ranks of pairwise distances among observations and identifies consistent patterns in moderate- to high- dimensional data, which previous methods have overlooked. The proposed method exhibits superior performance across a variety of scenarios, from high-dimensional data to network data. We further explore a typical setting to investigate key quantities that play essential roles in our framework, which reveal the framework’s capabilities in distinguishing differences in the first and/or second moment, as well as distinctions in higher moments.


Keywords: curse of dimensionality, ranks of pairwise distances, multi-class classification, network data

1 Introduction

In recent decades, the presence of high-dimensional data in classification has become increasingly common. For example, gene expression data with thousands of genes has been used to classify tumor subtypes (Golub et al., 1999; Ayyad et al., 2019), online review data with hundreds of features has been used to classify reviewers’ attitudes (Ye et al., 2009; Bansal and Srivastava, 2018), and speech signal data with thousands of utterances has been used to classify speakers’ sentiment (Burkhardt et al., 2010). To address the challenges in classifying high-dimensional data, many methods have been proposed.

One of the early methods used for high-dimensional classification was the support vector machine (SVM) (Boser et al., 1992; Brown et al., 1999; Furey et al., 2000; Schölkopf et al., 2001). Recently, Ghaddar and Naoum-Sawaya (2018) combined the SVM model with a feature selection approach to better handle high-dimensional data, and Hussain (2019) improved the traditional SVM by using a new semantic kernel. Another well-known approach is linear discriminant analysis (LDA) (Fisher, 1936; Rao, 1948), which has been extended to high-dimensional data by addressing the singularity of the covariance matrix. Extensions include generalized LDA (GLDA) (Li et al., 2005), dimension reduction with PCA followed by LDA in a low-dimensional space (Paliwal and Sharma, 2012), and regularized LDA (RLDA) (Yang and Wu, 2014). The kk-nearest neighbor classification is also a common approach (Cover and Hart, 1967) with many variants (Liu et al., 2006; Tang et al., 2011). Recently, Pal et al. (2016) applied the kk-nearest neighbor criterion based on an inter-point dissimilarity measure that utilizes the mean absolute difference to address the issue of the concentration of pairwise distances. Other methods include the nearest centroids classifier with feature selection (Fan and Fan, 2008) and ensemble methods such as boosting and random forest (Freund et al., 1996; Buehlmann, 2006; Mayr et al., 2012; Breiman, 2001; Ye et al., 2013). There are also other methods available, such as partial least squares regression and multivariate adaptive regression splines, as reviewed in Fernández-Delgado et al. (2014).

In addition, neural network classification frameworks have shown promising results in various kind of tasks, such as convolutional neural networks and their variants (LeCun et al., 1989; Ranzato et al., 2006; Shi et al., 2015; Cao et al., 2020) in image processing tasks, recurrent neural networks in sound classification (Deng et al., 2020; Zhang et al., 2021) and text classification (Liu et al., 2016), and generative adversarial networks in semi-supervised learning (Kingma et al., 2014; Zhou et al., 2020) and unsupervised learning (Radford et al., 2015; Kim and Hong, 2021).

While numerous methods exist for classification, a common underlying principle is that observations or their projections tend to be closer if they belong to the same class than if they come from different classes. This principle is effective in low-dimensional spaces but can fail in high-dimensional scenarios due to the curse of dimensionality. Consider a typical classification problem where observations are drawn from two distributions: X1,X2,,X50 i.i.d FXX_{1},X_{2},\dots,X_{50}\stackrel{{\scriptstyle\text{ i.i.d }}}{{\sim}}F_{X} and Y1,Y2,,Y50 i.i.d FYY_{1},Y_{2},\dots,Y_{50}\stackrel{{\scriptstyle\text{ i.i.d }}}{{\sim}}F_{Y}. Here FXF_{X} and FYF_{Y} are unknown for the classification task, while X1,X2,,X50X_{1},X_{2},\dots,X_{50} and Y1,Y2,,Y50Y_{1},Y_{2},\dots,Y_{50} are observed and labeled by classes. The task is to classify a future observation, WW, as belonging to either class XX or class YY. In our simulation, we set Xi=𝐀(xi1,xi2,,xid)X_{i}=\mathbf{A}(x_{i1},x_{i2},\dots,x_{id})^{\top}, where 𝐀d×d\mathbf{A}\in\mathbb{R}^{d\times d}, xik i.i.d t5x_{ik}\stackrel{{\scriptstyle\text{ i.i.d }}}{{\sim}}t_{5}, d=1000d=1000, and 𝐀𝐀=Σ\mathbf{A}\mathbf{A}^{\top}=\Sigma, with Σr,c=0.1|rc|\Sigma_{r,c}=0.1^{|r-c|}. Similarly, Yj=𝐁(yj1,yj2,,yjd)+μY_{j}=\mathbf{B}(y_{j1},y_{j2},\dots,y_{jd})^{\top}+\mu, where yik i.i.d t5y_{ik}\stackrel{{\scriptstyle\text{ i.i.d }}}{{\sim}}t_{5}, 𝐁𝐁=a2Σ\mathbf{B}\mathbf{B}^{\top}=a^{2}\Sigma, and μ=μ0μμ\mu=\mu_{0}\frac{\mu^{\prime}}{\|\mu^{\prime}\|}, with μ\mu^{\prime} being a random vector from Nd(𝟎d,𝐈d)N_{d}(\mathbf{0}_{d},\mathbf{I}_{d}). By varying μ0\mu_{0} and aa, we can generate distributions that differ in mean and/or variance. Fifty new observations are generated from each of the two distributions and are classified in each trial. The average misclassification rate is calculated over fifty trials. While many classifiers are tested in this simple setting, we show results for a few representative ones from different categories that are either commonly used or haven shown good performance: Generalized LDA (GLDA) (Li et al., 2005), supporter vector machine (SVM) (Schölkopf et al., 2001), random forest (RF) (Breiman, 2001), boosting (Freund et al., 1996), FAIR (Features Annealed Independence Rules) (Fan and Fan, 2008), NN-MADD (kk-Nearest Neighbor classifier based on the Mean Absolute Difference of Distances) (Pal et al., 2016), and several top rated methods based on the results in the review paper ((Fernández-Delgado et al., 2014)): decision tree (DT) (Salzberg, 1994), multivariate adaptive regression splines (MARS) (Leathwick et al., 2005), partial least squares regression (PLSR) (Martens and Naes, 1992), and extreme learning machine (ELM) (Huang et al., 2011). Among the various kinds of deep neural network structures, we choose the multilayer perceptron (MLP) (Popescu et al., 2009), which is not task-specific but also illustrate the core idea of the deep neural networks. The structure of the MLP is discussed in Appendix A.

Table 1: Misclassification rate (those smaller than “the lowest misclassification rate + 0.01+\ 0.01” are in bold).
μ0\mu_{0} aa GLDA SVM RF Boosting FAIR NN-MADD DT MARS PLSR ELM MLP
4 1 0.264\mathbb{0.264} 0.260\mathbb{0.260} 0.328 0.430 0.327 0.504 0.476 0.512 0.256\mathbb{0.256} 0.260\mathbb{0.260} 0.316
0 1.05 0.498 0.381 0.476 0.498 0.505 0.357\mathbb{0.357} 0.532 0.524 0.480 0.452 0.483

Table 1 presents the average misclassification rate for these classification methods when the distributions differ either in mean or variance. When there is only a mean difference, PLSR, ELM, SVM and GLDA perform the best. When there is only a variance difference, NN-MADD performs the best. However, NN-MADD performs poorly when there is only a mean difference. Therefore, there is a need to devise a new classification rule that works more universally.

The organization of the rest of the paper is as follows. In Section 2, we investigate the above example in more detail and propose new classification algorithms that is capable of distinguishing between the two classes in both scenarios. In Section 3, we evaluate the performance of the new algorithm in various simulation settings. To gain a deeper understanding of the new method, Section 4 explores key quantities and mechanisms underlying the new approach. The paper concludes with a brief discussion in Section 5.

2 Method and theory

2.1 Intuition

It is well-known that the equality of the two multivariate distributions, XFX\sim F and YGY\sim G, can be characterized by the equality of three distributions: the distribution of the inter-point distance between two observations from FXF_{X}, the distribution of the inter-point distance between two observations from GYG_{Y}, and the distribution of the distance between one observation from FXF_{X} and one from GYG_{Y} (Maa et al., 1996). We utilize this fact as the basis for our approach.

We begin by examining the inter-point distances of observations in both settings shown in Table 1. Heatmaps of these distances in a typical simulation run are shown in the top panel of Figure 1, where the data is arranged in the order of X1,X2,,X50X_{1},X_{2},\dots,X_{50} and Y1,Y2,,Y50Y_{1},Y_{2},\dots,Y_{50}. To better visualize the patterns, we also include data with larger differences in the bottom panel of Figure 1: μ0=15,a=1\mu_{0}=15,a=1 (left) and μ0=0,a=1.15\mu_{0}=0,a=1.15 (right).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Heatmap of inter-point distances for observations generated from the two settings in Table 1 (top panel) and with larger differences (bottom panel).

We denote the distance between two observations both from class XX as DXXD_{XX}, the distance between one observation from class XX and the other from class YY as DXYD_{XY}, and the distance between two observations both from class YY as DYYD_{YY}. We see that, under the mean difference setting, the between-group distance (DXYD_{XY}) tend to be larger than the within-group distance (DXX,DYY)(D_{XX},D_{YY}) (left panel of Figure 1). However, under the variance difference setting where class Y has a larger variance difference, we see that DXX<DXY<DYYD_{XX}<D_{XY}<D_{YY} in general (right panel of Figure 1). This phenomenon is due to the curse of dimensionality, where the volume of the dd-dimensional space increases exponentially in dd, causing observations from a distribution with a larger variance to scatter farther apart compared to those from a distribution with a smaller variance. This behavior of high-dimensional data has been discussed in Chen and Friedman (2017).

Based on the observations above, we propose using (DXX,DXY)(D_{XX},D_{XY}) and (DXY,DYY)(D_{XY},D_{YY}) as summary statistics for class XX and YY, respectively. Specially, under the mean difference setting, the between-group distance (DXYD_{XY}) tends to be larger than the within-group distance (DXX,DYY)(D_{XX},D_{YY}). This creates differences in both dimensions: between DXXD_{XX} and DXYD_{XY}, and between DXYD_{XY} and DYYD_{YY}). Under the variance difference setting, we observe DXX<DXY<DYYD_{XX}<D_{XY}<D_{YY} (or DXX>DXY>DYYD_{XX}>D_{XY}>D_{YY} if class XX has a larger variance), so differences exist in both dimensions as well. In either scenario, the summary statistic is distinguishable between the two classes.

This idea can also be extended to kk-class classification problems. For a kk-class classification problem with class labels 1,2,,k1,2,\dots,k, let DijD_{ij} be the distance between one observation from the ithi^{th} class and one from the jthj^{th} class, and let DiiD_{ii} be the inter-point distance between two observations from the ithi^{th} class. We can use Di=(Di1,Di2,,Dik)D_{i}=(D_{i1},D_{i2},\dots,D_{ik}) as the summary statistic for the ithi^{th} class. For two class ii and jj differ in distribution, (Dii,Dij)(D_{ii},D_{ij}) and (Dij,Djj)(D_{ij},D_{jj}) also differ in distribution. Hence, the summary statistic DiD_{i} can distinguish class ii from other classes.

We provide two classification approaches. Approach 1 is based on the pairwise distances, while Approach 2 is based on the ranks of pairwise distances. The rank-based approach inherits the good properties of ranks and is more robust to outliers than the distance-based approach (see Section 3.4).

2.2 Proposed method

Let S={(Zi,gi),i=1,2,,N}S=\left\{(Z_{i},g_{i}),i=1,2,\dots,N\right\} be a training set consisting of NN observations, where ZidZ_{i}\in\mathbb{R}^{d} represents the ii-th observation and gi{1,2,,k}g_{i}\in\left\{1,2,\dots,k\right\} is its corresponding class label. We assume that all ZiZ_{i}’s are independent and unique, and that if gi=jg_{i}=j, then ZiZ_{i} is drawn from the distribution FjF_{j}. Let nj=i=1N𝟙{gi=j}n_{j}=\sum_{i=1}^{N}\mathbbm{1}\left\{g_{i}=j\right\} denote the number of observations in the training set that belong to class jj. The goal is to classify a new observation WdW\in\mathbb{R}^{d} that is generated from one of the kk distributions {Fj}j=1k\left\{F_{j}\right\}_{j=1}^{k} to one of the kk classes.

The distance-based approach and rank-based approach are described in Algorithm 1 and Algorithm 2, respectively. In the distance-based approach, we first compute the pairwise distance matrix with the training set SS and the distance mean matrix M(D)M^{(D)}. Then, we compute a distance vector DWD_{W}, which contains all the pair-wise distance from the new observation WW to the training set, and the group distance mean vector MW(D)M_{W}^{(D)}. The last step is to classify WW using the group distance mean vector MW(D)M_{W}^{(D)} and compare it with M(D)M^{(D)}. In the rank-based approach, we add steps to compute the rank matrix(RR) and vector(RWR_{W}), along with the rank mean matrix(M(R)M^{(R)}) and vector(MW(R)M^{(R)}_{W}), and classify the rank mean vector based on the rank mean vector in the last step.

Algorithm 1 Distance-based classification
  • 1.

    Construct a distance matrix DN×ND\in\mathbb{R}^{N\times N}: D[i,j]=ZiZj22.D[i,j]=\|Z_{i}-Z_{j}\|_{2}^{2}.

  • 2.

    Construct a distance mean matrix M(D)N×kM^{(D)}\in\mathbb{R}^{N\times k}:

    M(D)[i,j]:=Mj(D)(Zi)=1nj𝟙{gi=j}gl=j,liD[i,l].M^{(D)}[i,j]:=M^{(D)}_{j}(Z_{i})=\frac{1}{n_{j}-\mathbbm{1}\left\{g_{i}=j\right\}}\sum_{g_{l}=j,l\neq i}D[i,l].
  • 3.

    Construct a distance vector DW=(DW,1,DW,2,DW,N)ND_{W}=(D_{W,1},D_{W,2},\dots D_{W,N})\in\mathbb{R}^{N}, where DW,i=WZi22.D_{W,i}=\|W-Z_{i}\|_{2}^{2}.

  • 4.

    Construct a group distance mean vector MW(D)=(M1(D)(W),M2(D)(W),Mk(D)(W))kM^{(D)}_{W}=(M^{(D)}_{1}(W),M^{(D)}_{2}(W),\dots M^{(D)}_{k}(W))\in\mathbb{R}^{k}, where

    Mi(D)(W)=1nigl=iDW,l.M^{(D)}_{i}(W)=\frac{1}{n_{i}}\sum_{g_{l}=i}D_{W,l}.
  • 5.

    Use QDA to classify MW(D)M^{(D)}_{W}:

    g(D)(W)=argmaxj{12log|Σ^j(D)|12(MW(D)μ^j(D))Σ^j(D)1(MW(D)μ^j(D))+lognjN},g^{(D)}(W)=\arg\max_{j}\left\{-\frac{1}{2}\log|\hat{\Sigma}^{(D)}_{j}|-\frac{1}{2}(M^{(D)}_{W}-\hat{\mu}_{j}^{(D)})^{\top}\hat{\Sigma}_{j}^{{(D)}^{-1}}(M^{(D)}_{W}-\hat{\mu}^{(D)}_{j})+\log\frac{n_{j}}{N}\right\},

    where μ^j(D)=1nji=1NMj(D)(Zi)𝟙{gi=j},\hat{\mu}^{(D)}_{j}=\frac{1}{n_{j}}\sum_{i=1}^{N}M_{j}^{(D)}(Z_{i})\mathbbm{1}\left\{g_{i}=j\right\},
    Σ^j(D)=1nj1i=1N(Mj(D)(Zi)μ^j(D))(Mj(D)(Zi)μ^j(D))𝟙{gi=j}.\hat{\Sigma}^{(D)}_{j}=\frac{1}{n_{j}-1}\sum_{i=1}^{N}\left(M_{j}^{(D)}(Z_{i})-\hat{\mu}^{(D)}_{j}\right)\left(M_{j}^{(D)}(Z_{i})-\hat{\mu}^{(D)}_{j}\right)^{\top}\mathbbm{1}\left\{g_{i}=j\right\}.

Algorithm 2 Rank-based classification
  • 1.

    Construct a distance matrix DN×ND\in\mathbb{R}^{N\times N}, where D[i,j]=ZiZj.D[i,j]=\|Z_{i}-Z_{j}\|.

  • 2.

    Construct a distance rank matrix RN×NR\in\mathbb{R}^{N\times N}, where R[i,j]=the rank ofD[i,j]in thejthcolumn.R[i,j]=\textit{the rank of}\ D[i,j]\ \textit{in the}\ j^{th}\ \textit{column}.

  • 3.

    Construct a rank mean matrix M(R)N×kM^{(R)}\in\mathbb{R}^{N\times k}:

    M(R)[i,j]:=Mj(R)(Zi)=1nj𝟙{gi=j}gl=j,liR[i,l].M^{(R)}[i,j]:=M^{(R)}_{j}(Z_{i})=\frac{1}{n_{j}-\mathbbm{1}\left\{g_{i}=j\right\}}\sum_{g_{l}=j,l\neq i}R[i,l].
  • 4.

    Construct a distance vector DW=(DW,1,DW,2,DW,N)ND_{W}=(D_{W,1},D_{W,2},\dots D_{W,N})\in\mathbb{R}^{N}, where DW,i=WZi.D_{W,i}=\|W-Z_{i}\|.

  • 5.

    Construct a rank vector RW=(RW,1,RW,2,RW,N)NR_{W}=(R_{W,1},R_{W,2},\dots R_{W,N})\in\mathbb{R}^{N}, where

    RW,i=12+t=1N𝟙{ZtZi<WZi}+12t=1N𝟙{ZtZi=WZi}.R_{W,i}=\frac{1}{2}+\sum_{t=1}^{N}\mathbbm{1}\left\{\|Z_{t}-Z_{i}\|<\|W-Z_{i}\|\right\}+\frac{1}{2}\sum_{t=1}^{N}\mathbbm{1}\left\{\|Z_{t}-Z_{i}\|=\|W-Z_{i}\|\right\}.
  • 6.

    Construct a group rank mean vector MW(R)=(M1(R)(W),M2(R)(W),Mk(R)(W))kM^{(R)}_{W}=(M^{(R)}_{1}(W),M^{(R)}_{2}(W),\dots M^{(R)}_{k}(W))\in\mathbb{R}^{k}, where

    Mi(R)(W)=1nigl=iRW,l.M^{(R)}_{i}(W)=\frac{1}{n_{i}}\sum_{g_{l}=i}R_{W,l}.
  • 7.

    Use QDA to classify M(R)(W)M^{(R)}(W):

    g(W)=argmaxj{12log|Σ^j|12(MW(R)μ^j)Σ^j1(MW(R)μ^j)+lognjN},g(W)=\arg\max_{j}\left\{-\frac{1}{2}\log|\hat{\Sigma}_{j}|-\frac{1}{2}(M^{(R)}_{W}-\hat{\mu}_{j})^{\top}\hat{\Sigma}_{j}^{-1}(M^{(R)}_{W}-\hat{\mu}_{j})+\log\frac{n_{j}}{N}\right\},

    where μ^j=1nji=1NMj(R)(Zi)𝟙{gi=j}\hat{\mu}_{j}=\frac{1}{n_{j}}\sum_{i=1}^{N}M^{(R)}_{j}(Z_{i})\mathbbm{1}\left\{g_{i}=j\right\}, Σ^j=1nj1i=1N(Mj(R)(Zi)μ^j)(Mj(R)(Zi)μ^j)𝟙{gi=j}.\hat{\Sigma}_{j}=\frac{1}{n_{j}-1}\sum_{i=1}^{N}(M^{(R)}_{j}(Z_{i})-\hat{\mu}_{j})(M^{(R)}_{j}(Z_{i})-\hat{\mu}_{j})^{\top}\mathbbm{1}\left\{g_{i}=j\right\}.

Remark 1.

In the first step of both algorithms, the distance can be chosen as the Euclidean distance or any other suitable distance measure. For this paper, we default to using the Euclidean distance. In the last step of both algorithms, the task is to classify MW()M^{(\cdot)}_{W} based on all M()(Zi)s,i=1,2,,NM^{(\cdot)}(Z_{i})^{\prime}s,i=1,2,\dots,N. Since M()(w)M^{(\cdot)}(w) is kk-dimensional, where kk is the number of classes, other low-dimensional classification methods can also be used.

To illustrate how the first three steps perform in separating the different classes, for the four datasets in Figure 1, we plot the distance mean vector M(D)(Zi),i=1,2,,100M^{(D)}(Z_{i}),i=1,2,\dots,100 in Figure 2 and the rank mean vector M(R)(Zi),i=1,2,,100M^{(R)}(Z_{i}),i=1,2,\dots,100 in Figure 3. As shown in Figures 2 and 3, the classes are well separated, indicating the effectiveness of the first three steps.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: The scatter plot of the distance mean vector M(D)()M^{(D)}(\cdot) for the four datasets in Figure 1. The decision boundary is given by the quadratic discriminant analysis.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: The scatter plot of the rank mean vector M(R)()M^{(R)}(\cdot) for the four datasets in Figure 1. The decision boundary is given by the quadratic discriminant analysis.
Table 2: Misclassification rate under the same settings as in Table 1
μ0\mu_{0} aa Dist Rank GLDA SVM RF FAIR NN-MADD PLSR ELM MLP
4 1 0.258\mathbb{0.258} 0.273 0.264\mathbb{0.264} 0.260\mathbb{0.260} 0.328 0.327 0.504 0.256\mathbb{0.256} 0.260\mathbb{0.260} 0.316
0 1.05 0.290\mathbb{0.290} 0.282\mathbb{0.282} 0.498 0.381 0.476 0.505 0.357 0.480 0.452 0.483

We applied Algorithms 1 and 2 to the two same settings as in Table 1, and the results are presented in Table 2. We see that under the mean difference setting, the new method performs similarly to PLSR, ELM, SVM and GLDA, which are the best performers in this setting. Under the variance difference setting, the new approaches outperform all other methods.

3 Performance comparisons

Here, we examine the performance of the new methods by comparing them to other classification methods under various settings. Due to the fact that DT and MARS are not much better than random guessing for either mean or variance differences, as indicated in Table 1, they are omitted in the following comparison.

3.1 Two-class classification

In each trial, we generate two independent samples, X1,X2,,X50 i.i.d FXX_{1},X_{2},\dots,X_{50}\stackrel{{\scriptstyle\text{ i.i.d }}}{{\sim}}F_{X} and Y1,Y2,,Y_{1},Y_{2},\dots, Y50Y_{50}  i.i.d FY\stackrel{{\scriptstyle\text{ i.i.d }}}{{\sim}}F_{Y}, with Xi=𝐀(xi1,xi2,,xid)X_{i}=\mathbf{A}(x_{i1},x_{i2},\dots,x_{id})^{\top}, xik i.i.d Fxx_{ik}\stackrel{{\scriptstyle\text{ i.i.d }}}{{\sim}}F_{x} and Yj=𝐁(yj1,yj2,,yjd)+μY_{j}=\mathbf{B}(y_{j1},y_{j2},\dots,y_{jd})^{\top}+\mu, yjk i.i.d Fyy_{jk}\stackrel{{\scriptstyle\text{ i.i.d }}}{{\sim}}F_{y}, to be the training set. We set 𝐀𝐀=Σ\mathbf{A}\mathbf{A}^{\top}=\Sigma, with Σr,c=0.1|rc|\Sigma_{r,c}=0.1^{|r-c|}, d=1000d=1000 , 𝐁=a𝐀\mathbf{B}=a\mathbf{A} and μ=μ0μμ\mu=\mu_{0}\frac{\mu^{\prime}}{\|\mu^{\prime}\|} with μ\mu^{\prime} a random vector generated from Nd(0,𝐈d)N_{d}(0,\mathbf{I}_{d}). The testing samples are Wx1,Wx2,Wx50 i.i.d FXW_{x_{1}},W_{x_{2}},\dots W_{x_{50}}\stackrel{{\scriptstyle\text{ i.i.d }}}{{\sim}}F_{X} and Wy1,Wy2,Wy50 i.i.d FYW_{y_{1}},W_{y_{2}},\dots W_{y_{50}}\stackrel{{\scriptstyle\text{ i.i.d }}}{{\sim}}F_{Y}. We consider a few different scenarios:

  • Scenario S1: Fx=N(0,1)F_{x}=N(0,1), Fy=N(0,1)F_{y}=N(0,1);

  • Scenario S2: Fx=t5F_{x}=t_{5}, Fy=t5F_{y}=t_{5};

  • Scenario S3: Fx=χ525F_{x}=\chi^{2}_{5}-5, Fy=χ525F_{y}=\chi^{2}_{5}-5;

  • Scenario S4: Fx=N(0,1)F_{x}=N(0,1), Fy=t5F_{y}=t_{5}.

Table 3 shows the average misclassification rate from 5050 trials. We see that when there is only a mean difference, the new methods (‘Dist’ and ‘Rank’) have a misclassfication rate close to the best method among all other methods. When there is only a variance difference, the new methods perform the best among all the methods. When there are both mean and variance differences, the new methods again have the lowest misclassification rate.

Table 3: Two-class misclassification rate for 10001000-dimensional data (those smaller than “the lowest misclassification rate + 0.01+\ 0.01” are in bold)
μ0\mu_{0} aa Dist Rank GLDA SVM RF Boosting FAIR NN-MADD PLSR ELM MLP
S1 6 1 0.022\mathbb{0.022} 0.027\mathbb{0.027} 0.027\mathbb{0.027} 0.023\mathbb{0.023} 0.115 0.354 0.077 0.343 0.028\mathbb{0.028} 0.020\mathbb{0.020} 0.146
S1 0 1.1 0.019\mathbb{0.019} 0.020\mathbb{0.020} 0.506 0.106 0.425 0.465 0.507 0.025\mathbb{0.025} 0.412 0.440 0.472
S1 6 1.1 0.002\mathbb{0.002} 0.003\mathbb{0.003} 0.044 0.005\mathbb{0.005} 0.107 0.368 0.107 0.024 0.064 0.076 0.201
S2 6 1 0.094\mathbb{0.094} 0.109 0.102\mathbb{0.102} 0.092\mathbb{0.092} 0.161 0.385 0.170 0.476 0.116 0.104\mathbb{0.104} 0.218
S2 0 1.1 0.105\mathbb{0.105} 0.100\mathbb{0.100} 0.483 0.172 0.424 0.473 0.493 0.145 0.536 0.548 0.534
S2 6 1.1 0.041\mathbb{0.041} 0.042\mathbb{0.042} 0.128 0.051\mathbb{0.051} 0.162 0.389 0.197 0.140 0.176 0.188 0.249
S3 6 1 0.181\mathbb{0.181} 0.198 0.191\mathbb{0.191} 0.184 0.172\mathbb{0.172} 0.378 0.335 0.261 0.487 0.336 0.422
S3 0 1.1 0.070\mathbb{0.070} 0.071\mathbb{0.071} 0.478 0.141 0.345 0.460 0.495 0.092 0.504 0.532 0.473
S3 6 1.1 0.070\mathbb{0.070} 0.069\mathbb{0.069} 0.417 0.124 0.278 0.431 0.434 0.096 0.428 0.472 0.431
S4 0 1 0.276\mathbb{0.276} 0.278\mathbb{0.278} 0.500 0.380 0.483 0.487 0.506 0.361 0.476 0.536 0.485

3.2 Multi-class classification

In each trial, we randomly generate n=50n=50 observations from four distributions Zij i.i.d Fi,i=1,,4Z_{ij}\stackrel{{\scriptstyle\text{ i.i.d }}}{{\sim}}F_{i},\ i=1,\dots,4, j=1,,50j=1,\dots,50, to be the training set, with Zij=ai𝐀(zij1,zij2,,zijd)+μiZ_{ij}=a_{i}\mathbf{A}(z_{ij1},z_{ij2},\dots,z_{ijd})^{\top}+\mu_{i}, zijk i.i.d Fzz_{ijk}\stackrel{{\scriptstyle\text{ i.i.d }}}{{\sim}}F_{z}, i=1,,4i=1,\dots,4, j=1,2,,50j=1,2,\dots,50. We set d=1000d=1000, 𝐀𝐀=Σ\mathbf{A}\mathbf{A}^{\top}=\Sigma, Σr,c=0.1|rc|\Sigma_{r,c}=0.1^{|r-c|} and μ=μ0μμ\mu=\mu_{0}\frac{\mu^{\prime}}{\|\mu^{\prime}\|}, with μ\mu^{\prime} a random vector generated from Nd(0,𝐈d)N_{d}(0,\mathbf{I}_{d}). The aisa_{i}^{\prime}s and μis\mu_{i}^{\prime}s are set as follow: a1=a3=1a_{1}=a_{3}=1, a2=a4=1.1a_{2}=a_{4}=1.1; μ1=μ2=0\mu_{1}=\mu_{2}=0, μ3=μ4=μ\mu_{3}=\mu_{4}=\mu, μ0=12\mu_{0}=12. Under those settings, the four distributions have two different means and two different variances. The testing samples are Wij i.i.d FiW_{ij}\stackrel{{\scriptstyle\text{ i.i.d }}}{{\sim}}F_{i}, i=1,,4i=1,\dots,4, j=1,,50j=1,\dots,50. We consider the following scenarios:

  • Scenario S5: Fz=N(0,1)F_{z}=N(0,1);

  • Scenario S6: Fz=t5F_{z}=t_{5};

  • Scenario S7: Fz=χ525F_{z}=\chi^{2}_{5}-5.

Table 4: Multi-class misclassification rate for 10001000-dimensional data (those smaller than “the lowest misclassification rate + 0.01+\ 0.01” are in bold)
Dist Rank GLDA SVM RF Boosting NN-MADD PLSR ELM MLP
S5 0.028\mathbb{0.028} 0.023\mathbb{0.023} 0.495 0.118 0.444 0.502 0.640 0.498 0.479 0.482
S6 0.125\mathbb{0.125} 0.133\mathbb{0.133} 0.495 0.190 0.456 0.513 0.603 0.506 0.488 0.478
S7 0.216\mathbb{0.216} 0.216\mathbb{0.216} 0.584 0.306 0.482 0.583 0.696 0.505 0.573 0.585

The average misclassification rate of 5050 trials are shown in Table 4 (FAIR can only be applied to the two-class problem and is not included here). We see that, the new method has the lowest misclassification rate across all these scenarios.

3.3 Network data classification

We generate random graphs using the configuration model G(v,k)G(v,\vec{k}), where vv is the number of vertices and k\vec{k} is a vector containing the degrees of the vv vertices, with kik_{i} assigned to vertex viv_{i}. In each trial, we generate two independent samples, {X1,X2,,X30}\left\{X_{1},X_{2},\dots,X_{30}\right\} and {Y1,Y2,,Y30}\left\{Y_{1},Y_{2},\dots,Y_{30}\right\}, with degree vectors kx\vec{k}^{x} and ky\vec{k}^{y}, respectively. The testing samples are Wx1,Wx2,,Wx20 i.i.d G(v,kx)W_{x_{1}},W_{x_{2}},\dots,W_{x_{20}}\stackrel{{\scriptstyle\text{ i.i.d }}}{{\sim}}G(v,\vec{k}^{x}) and Wy1,Wy2,,Wy20 i.i.d G(v,ky)W_{y_{1}},W_{y_{2}},\dots,W_{y_{20}}\stackrel{{\scriptstyle\text{ i.i.d }}}{{\sim}}G(v,\vec{k}^{y}). We consider the following scenarios:

  • Scenario S8: k1x=k2x==k20x=1k^{x}_{1}=k^{x}_{2}=\dots=k^{x}_{20}=1, k21x=k22x==k40x=3k^{x}_{21}=k^{x}_{22}=\dots=k^{x}_{40}=3; k1y=k2y==k20ay=1k^{y}_{1}=k^{y}_{2}=\dots=k^{y}_{20-a}=1, k21ay=k22ay==k20y=2k^{y}_{21-a}=k^{y}_{22-a}=\dots=k^{y}_{20}=2, k21y=k22y==k20+ay=4k^{y}_{21}=k^{y}_{22}=\dots=k^{y}_{20+a}=4, k21+ay=k22+ay==k40y=3k^{y}_{21+a}=k^{y}_{22+a}=\dots=k^{y}_{40}=3;

  • Scenario S9: k1x=k2x==k40x=20k^{x}_{1}=k^{x}_{2}=\dots=k^{x}_{40}=20, k1y=k2y==k40ay=20k^{y}_{1}=k^{y}_{2}=\dots=k^{y}_{40-a}=20, k41ay=k42ay==k40y=30k^{y}_{41-a}=k^{y}_{42-a}=\dots=k^{y}_{40}=30.

When comparing the performance of the methods, we convert the network data with 40 nodes into 40×4040\times 40 adjacency matrices, which are further converted into 16001600-dimensional vectors. We do this conversion so that all methods in the comparison can be applied. While for our approach, it can be applied to network data directly by using a distance on network data in step 1 of Algorithms 1 and 2.

The results are presented in Table 5. We see that the new method is among the best performers in all settings, while other good performers could work well under some settings but fail for others.

Table 5: Network data misclassification rate (those smaller than “the lowest misclassification rate + 0.01+\ 0.01” are in bold)
aa Dist Rank GLDA SVM RF Boosting NN-MADD PLSR ELM MLP
S8 5 0.194 0.054\mathbb{0.054} 0.377 0.382 0.405 0.445 0.119 0.318 0.411 0.433
S8 10 0.089 0.011\mathbb{0.011} 0.342 0.321 0.371 0.462 0.008\mathbb{0.008} 0.306 0.352 0.411
S8 15 0.017 0.002\mathbb{0.002} 0.331 0.326 0.401 0.483 0.001\mathbb{0.001} 0.376 0.294 0.427
S8 20 0.006\mathbb{0.006} 0.000\mathbb{0.000} 0.315 0.290 0.380 0.470 0.000\mathbb{0.000} 0.476 0.482 0.504
S9 4 0.118 0.100\mathbb{0.100} 0.090\mathbb{0.090} 0.151 0.220 0.404 0.386 0.254 0.458 0.482
S9 8 0.020 0.007\mathbb{0.007} 0.004\mathbb{0.004} 0.026 0.127 0.344 0.164 0.203 0.077 0.392
S9 12 0.006\mathbb{0.006} 0.001\mathbb{0.001} 0.004\mathbb{0.004} 0.004\mathbb{0.004} 0.059 0.289 0.080 0.088 0.005\mathbb{0.005} 0.208
S9 16 0.000\mathbb{0.000} 0.000\mathbb{0.000} 0.001\mathbb{0.001} 0.001\mathbb{0.001} 0.041 0.271 0.045 0.059 0.000\mathbb{0.000} 0.213

3.4 Robustness analysis

If there are outliers in the data, the distance-based approach is much less robust. Considering the simulation setting scenario S1 (multivariate normal distribution) in Section 3.1 but contaminated by outliers Xi(5(a1)+1)𝐀(xi1,xi2,,xid)+5μX_{i}\sim(5(a-1)+1)\mathbf{A}(x_{i1},x_{i2},\dots,x_{id})^{\top}+5\mu, i=1,2,,noi=1,2,\dots,n_{o}, where non_{o} is the number of outliers, and all other observations are simulated in the same way as before. Table 6 shows the misclassification rate of the two approaches. We see that with outliers, the distance-based has a much higher misclassification rate than the rank-based approach. Therefore, the rank-based approach is recommended in practice for its robustness. However, under the ideal scenario of no outliers, the performance of the two approaches is similar, and we could study the distance-based approach to approximate the rank-based version as the former is much easier to analyze.

Table 6: The two-class misclassification rates of the two approaches when the data is contaminated with outliers
μ0\mu_{0} aa non_{o} Rank Dist μ0\mu_{0} aa non_{o} Rank Dist
0 1.1 1 0.0243 0.3081 0 1.1 3 0.0350 0.3246
6 1 1 0.0303 0.1275 6 1 3 0.0439 0.1410
6 1.1 1 0.0114 0.1269 6 1.1 3 0.0262 0.1510
0 1.1 5 0.0433 0.3352 0 1.1 7 0.0396 0.3378
6 1 5 0.0530 0.1694 6 1 7 0.0708 0.2287
6 1.1 5 0.0451 0.1747 6 1.1 7 0.0657 0.2265

4 Explore quantities that play important roles

In this section, we aim to explore the key factors that play important roles in the framework. We consider the following two-class setting (\star):

Xi=𝐀(xi1,xi2,,xid),i=1,2,,n;\displaystyle X_{i}=\mathbf{A}(x_{i1},x_{i2},\dots,x_{id})^{\top},i=1,2,\dots,n;
where𝐀d×d,xiki.i.d.Fx,with𝔼x11=0,𝔼x114<;\displaystyle\textit{where}\ \mathbf{A}\in\mathbb{R}^{d\times d},x_{ik}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}F_{x},\textit{with}\ \mathbb{E}x_{11}=0,\mathbb{E}x_{11}^{4}<\infty;
Yj=𝐁(yj1,yj2,,yjd)+μ,j=1,2,,m;\displaystyle Y_{j}=\mathbf{B}(y_{j1},y_{j2},\dots,y_{jd})^{\top}+\mu,j=1,2,\dots,m;
where𝐁d×d,yjki.i.d.Fy,with𝔼y11=0,𝔼y114<,\displaystyle\textit{where}\ \mathbf{B}\in\mathbb{R}^{d\times d},y_{jk}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}F_{y},\textit{with}\ \mathbb{E}y_{11}=0,\mathbb{E}y_{11}^{4}<\infty,

We use this setting (\star) as the difference between the two classes can be controlled via FxF_{x}, FyF_{y}, 𝐀\mathbf{A} and 𝐁\mathbf{B}. Our objective is to approximate the misclassification rate using quantities derived from FxF_{x}, FyF_{y}, 𝐀\mathbf{A} and 𝐁\mathbf{B}. Directly handling the rank-based approach is challenging due to the complexities related to ranks; therefore, we work on the distance-based approach, which offers similar performance but is more manageable.

Define

D(Xi)=(DX(Xi)DY(Xi)),D(Yi)=(DX(Yi)DY(Yi)),D(X_{i})=\binom{D_{X}(X_{i})}{D_{Y}(X_{i})},\ D(Y_{i})=\binom{D_{X}(Y_{i})}{D_{Y}(Y_{i})},

where DX(Xi)=1n1kiXkXi22D_{X}(X_{i})=\frac{1}{n-1}\sum_{k\neq i}\|X_{k}-X_{i}\|_{2}^{2}, DY(Xi)=1mk=1mYkXi22D_{Y}(X_{i})=\frac{1}{m}\sum_{k=1}^{m}\|Y_{k}-X_{i}\|_{2}^{2}, DX(Yi)=1nk=1nXkYi22D_{X}(Y_{i})=\frac{1}{n}\sum_{k=1}^{n}\|X_{k}-Y_{i}\|_{2}^{2}, DY(Yi)=1m1kiYkYi22D_{Y}(Y_{i})=\frac{1}{m-1}\sum_{k\neq i}\|Y_{k}-Y_{i}\|_{2}^{2}. Under Setting (\star), we can compute the expectation and covariance matrix of D(Xi)D(X_{i}) and D(Yi)D(Y_{i}) through the following theorems. The proofs for these theorems are provided in the Supplemental materials.

Theorem 4.1.

Let {Xi}i=1n\left\{X_{i}\right\}_{i=1}^{n}, {Yj}j=1m\left\{Y_{j}\right\}_{j=1}^{m} be generated from Setting (\star), we have

𝔼(D(Xi))=μDX=f(𝐀,𝐁,μ,Fx,Fy),\displaystyle\mathbb{E}(D(X_{i}))=\mu_{D_{X}}=f(\mathbf{A},\mathbf{B},\mu,F_{x},F_{y}),
𝔼(D(Yi))=μDY=f(𝐁,𝐀,μ,Fy,Fx),\displaystyle\mathbb{E}(D(Y_{i}))=\mu_{D_{Y}}=f(\mathbf{B},\mathbf{A},-\mu,F_{y},F_{x}),
Cov(D(Xi))=ΣDX=g(n,m,𝐀,𝐁,μ,Fx,Fy),\displaystyle\textsf{Cov}(D(X_{i}))=\Sigma_{D_{X}}=g(n,m,\mathbf{A},\mathbf{B},\mu,F_{x},F_{y}),
Cov(D(Yi))=ΣDY=g(m,n,𝐁,𝐀,μ,Fy,Fx).\displaystyle\textsf{Cov}(D(Y_{i}))=\Sigma_{D_{Y}}=g(m,n,\mathbf{B},\mathbf{A},-\mu,F_{y},F_{x}).

where the f()2f(\cdot)\in\mathbb{R}^{2} and g()2×2g(\cdot)\in\mathbb{R}^{2\times 2}:

f(𝐀,𝐁,μ,Fx,Fy)=(2𝐀F2𝔼x112𝐀F2𝔼x112+𝐁F2𝔼y112+μ22),f(\mathbf{A},\mathbf{B},\mu,F_{x},F_{y})=\left(\begin{matrix}2\|\mathbf{A}\|_{F}^{2}\mathbb{E}x_{11}^{2}\\ \|\mathbf{A}\|_{F}^{2}\mathbb{E}x_{11}^{2}+\|\mathbf{B}\|_{F}^{2}\mathbb{E}y_{11}^{2}+\|\mu\|_{2}^{2}\end{matrix}\right),
g(n,m,𝐀,𝐁,μ,Fx,Fy)=f(𝐀,𝐁,μ,Fx,Fy)f(𝐀,𝐁,μ,Fx,Fy)\displaystyle g(n,m,\mathbf{A},\mathbf{B},\mu,F_{x},F_{y})=-f(\mathbf{A},\mathbf{B},\mu,F_{x},F_{y})f(\mathbf{A},\mathbf{B},\mu,F_{x},F_{y})^{\top}
+(1n1(h1(𝐀,𝐀,𝐀,𝟎,𝟎,Fx,Fx,Fx)h2(𝐀,𝐀,𝟎,Fx,Fx)),(1n2)h1(𝐀,𝐀,𝐁,0,μ,Fx,Fx,Fy)h1(𝐀,𝐀,𝐁,𝟎,μ,Fx,Fy)1m(h1(𝐀,𝐁,𝐁,μ,μ,Fx,Fy,Fy)h2(𝐀,𝐁,μ,Fx,Fy)),(1m1)),\displaystyle+\left(\begin{array}[]{cc}\frac{1}{n-1}\left\langle\left(\begin{array}[]{l}h_{1}(\mathbf{A},\mathbf{A},\mathbf{A},\mathbf{0},\mathbf{0},F_{x},F_{x},F_{x})\\ h_{2}(\mathbf{A},\mathbf{A},\mathbf{0},F_{x},F_{x})\end{array}\right),\left(\begin{array}[]{c}1\\ n-2\end{array}\right)\right\rangle&h_{1}(\mathbf{A},\mathbf{A},\mathbf{B},0,\mu,F_{x},F_{x},F_{y})\\ h_{1}(\mathbf{A},\mathbf{A},\mathbf{B},\mathbf{0},\mu,F_{x},F_{y})&\frac{1}{m}\left\langle\left(\begin{array}[]{l}h_{1}(\mathbf{A},\mathbf{B},\mathbf{B},\mu,\mu,F_{x},F_{y},F_{y})\\ h_{2}(\mathbf{A},\mathbf{B},\mu,F_{x},F_{y})\end{array}\right),\left(\begin{array}[]{c}1\\ m-1\end{array}\right)\right\rangle\end{array}\right),

and h1h_{1} and h2h_{2} defined in Lemma 4.2.

Lemma 4.2.

For U=𝐂(u1,u2,,ud)U=\mathbf{C}(u_{1},u_{2},\dots,u_{d})^{\top}, V=𝐃(v1,v2,,vd)V=\mathbf{D}(v_{1},v_{2},\dots,v_{d})^{\top}, where uii.i.d.Fuu_{i}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}F_{u}, with 𝔼u=0\mathbb{E}u=0, 𝔼u4<\mathbb{E}u^{4}<\infty and vii.i.d.Fvv_{i}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}F_{v}, with 𝔼v=0\mathbb{E}v=0, 𝔼v4<\mathbb{E}v^{4}<\infty, Xi=𝐀(xi1,xi2,,xid)X_{i}=\mathbf{A}(x_{i1},x_{i2},\dots,x_{id})^{\top}, i=1,2,,ni=1,2,\dots,n, xiki.i.d.Fxx_{ik}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}F_{x}, with 𝔼x11=0\mathbb{E}x_{11}=0, 𝔼x114<\mathbb{E}x_{11}^{4}<\infty, 𝐀,𝐂,𝐃d×d\mathbf{A},\mathbf{C},\mathbf{D}\in\mathbb{R}^{d\times d}, we have

h1\displaystyle h_{1} (𝐀,𝐂,𝐃,μu,μv,Fx,Fu,Fv)=𝐂F2𝔼u12𝐃F2𝔼v12+𝐃F2𝔼v12μu22\displaystyle(\mathbf{A},\mathbf{C},\mathbf{D},\mu_{u},\mu_{v},F_{x},F_{u},F_{v})=\|\mathbf{C}\|_{F}^{2}\mathbb{E}u_{1}^{2}\|\mathbf{D}\|_{F}^{2}\mathbb{E}v_{1}^{2}+\|\mathbf{D}\|_{F}^{2}\mathbb{E}v_{1}^{2}\|\mu_{u}\|_{2}^{2}
+i=1dj=1d(k=1daik2ajk2𝔼x114+(k1k2aik12ajk22+2k1k2aik1aik2ajk1ajk2)(𝔼x112)2)\displaystyle+\sum_{i=1}^{d}\sum_{j=1}^{d}(\sum_{k=1}^{d}a_{ik}^{2}a_{jk}^{2}\mathbb{E}x_{11}^{4}+\left(\sum_{k_{1}\neq k_{2}}a_{ik_{1}}^{2}a_{jk_{2}}^{2}+2\sum_{k_{1}\neq k_{2}}a_{ik_{1}}a_{ik_{2}}a_{jk_{1}}a_{jk_{2}}\right)(\mathbb{E}x_{11}^{2})^{2})
+(𝐂F2𝔼u12+𝐃F2𝔼v12)𝐀F2𝔼x112+(μu22+μv22)𝐀F2𝔼x112+μu22μv22\displaystyle+(\|\mathbf{C}\|_{F}^{2}\mathbb{E}u_{1}^{2}+\|\mathbf{D}\|_{F}^{2}\mathbb{E}v_{1}^{2})\|\mathbf{A}\|_{F}^{2}\mathbb{E}x_{11}^{2}+(\|\mu_{u}\|_{2}^{2}+\|\mu_{v}\|_{2}^{2})\|\mathbf{A}\|_{F}^{2}\mathbb{E}x_{11}^{2}+\|\mu_{u}\|_{2}^{2}\|\mu_{v}\|_{2}^{2}
2i=1dj=1d(k=1daik2ajk(μuk+μvk))𝔼x113+𝐂F2𝔼u12μv22,\displaystyle-2\sum_{i=1}^{d}\sum_{j=1}^{d}(\sum_{k=1}^{d}a_{ik}^{2}a_{jk}(\mu_{uk}+\mu_{vk}))\mathbb{E}x_{11}^{3}+\|\mathbf{C}\|_{F}^{2}\mathbb{E}u_{1}^{2}\|\mu_{v}\|_{2}^{2},
h2\displaystyle h_{2} (𝐀,𝐂,μu,Fx,Fu)=2μu22(𝐂F2𝔼u12+𝐀F2𝔼x112)+4μu𝐂F2𝔼u12\displaystyle(\mathbf{A},\mathbf{C},\mu_{u},F_{x},F_{u})=2\|\mu_{u}\|_{2}^{2}(\|\mathbf{C}\|_{F}^{2}\mathbb{E}u_{1}^{2}+\|\mathbf{A}\|_{F}^{2}\mathbb{E}x_{11}^{2})+4\|\mu_{u}\mathbf{C}\|_{F}^{2}\mathbb{E}u_{1}^{2}
+i=1dj=1d(k=1dcik2cjk2𝔼u14+(k1k2cik12cjk22+2k1k2cik1cik2cjk1cjk2)(𝔼u12)2)\displaystyle+\sum_{i=1}^{d}\sum_{j=1}^{d}(\sum_{k=1}^{d}c_{ik}^{2}c_{jk}^{2}\mathbb{E}u_{1}^{4}+\left(\sum_{k_{1}\neq k_{2}}c_{ik_{1}}^{2}c_{jk_{2}}^{2}+2\sum_{k_{1}\neq k_{2}}c_{ik_{1}}c_{ik_{2}}c_{jk_{1}}c_{jk_{2}}\right)(\mathbb{E}u_{1}^{2})^{2})
+i=1dj=1d(k=1daik2ajk2𝔼x114+(k1k2aik12ajk22+2k1k2aik1aik2ajk1ajk2)(𝔼x112)2)\displaystyle+\sum_{i=1}^{d}\sum_{j=1}^{d}(\sum_{k=1}^{d}a_{ik}^{2}a_{jk}^{2}\mathbb{E}x_{11}^{4}+\left(\sum_{k_{1}\neq k_{2}}a_{ik_{1}}^{2}a_{jk_{2}}^{2}+2\sum_{k_{1}\neq k_{2}}a_{ik_{1}}a_{ik_{2}}a_{jk_{1}}a_{jk_{2}}\right)(\mathbb{E}x_{11}^{2})^{2})
+4i=1dj=1d(k=1dcik2cjkμk)𝔼u134i=1dj=1d(k=1daik2ajkμk)𝔼x113+4μu𝐀F2𝔼x112\displaystyle+4\sum_{i=1}^{d}\sum_{j=1}^{d}(\sum_{k=1}^{d}c_{ik}^{2}c_{jk}\mu_{k})\mathbb{E}u_{1}^{3}-4\sum_{i=1}^{d}\sum_{j=1}^{d}(\sum_{k=1}^{d}a_{ik}^{2}a_{jk}\mu_{k})\mathbb{E}x_{11}^{3}+4\|\mu_{u}\mathbf{A}\|_{F}^{2}\mathbb{E}x_{11}^{2}
+2𝐂F2𝔼u12𝐀F2𝔼x112+μu24+4𝐀𝐂F2𝔼u12𝔼x112.\displaystyle+2\|\mathbf{C}\|_{F}^{2}\mathbb{E}u_{1}^{2}\|\mathbf{A}\|_{F}^{2}\mathbb{E}x_{11}^{2}+\|\mu_{u}\|_{2}^{4}+4\|\mathbf{A}^{\top}\mathbf{C}\|_{F}^{2}\mathbb{E}u_{1}^{2}\mathbb{E}x_{11}^{2}.

For a testing sample, suppose WxFXW_{x}\sim F_{X} (the distribution of XiX_{i}’s) and WyFYW_{y}\sim F_{Y} (the distribution of YjY_{j}’s), we can also obtain the expectation and covariance matrix of D(Wx)D(W_{x}) and D(Wy)D(W_{y}) in a similar way, where D(W)=(DX(W)DY(W))D(W)=\binom{D_{X}(W)}{D_{Y}(W)} with DX(W)=1nk=1nXkW22D_{X}(W)=\frac{1}{n}\sum_{k=1}^{n}\|X_{k}-W\|_{2}^{2} and DY(W)=1mk=1mYkW22D_{Y}(W)=\frac{1}{m}\sum_{k=1}^{m}\|Y_{k}-W\|_{2}^{2}.

Theorem 4.3.

Under Setting (\star), the the expectation and covariance matrix of D(Wx)D(W_{x}) and D(Wy)D(W_{y}) are given by:

𝔼(D(Wx))=μDWx\displaystyle\mathbb{E}(D(W_{x}))=\mu_{D_{W_{x}}} =f(𝐀,𝐁,μ,Fx,Fy),\displaystyle=f(\mathbf{A},\mathbf{B},\mu,F_{x},F_{y}),
Cov(D(Wx))=ΣDWx\displaystyle\textsf{Cov}(D(W_{x}))=\Sigma_{D_{W_{x}}} =g(n+1,m,𝐀,𝐁,μ,Fx,Fy),\displaystyle=g(n+1,m,\mathbf{A},\mathbf{B},\mu,F_{x},F_{y}),
𝔼(D(Wy))=μDWy\displaystyle\mathbb{E}(D(W_{y}))=\mu_{D_{W_{y}}} =f(𝐁,𝐀,μ,Fy,Fx),\displaystyle=f(\mathbf{B},\mathbf{A},-\mu,F_{y},F_{x}),
Cov(D(Wy))=ΣDWy\displaystyle\textsf{Cov}(D(W_{y}))=\Sigma_{D_{W_{y}}} =g(m+1,n,𝐁,𝐀,μ,Fy,Fx).\displaystyle=g(m+1,n,\mathbf{B},\mathbf{A},-\mu,F_{y},F_{x}).

If we further add constraints on 𝐀\mathbf{A} and 𝐁\mathbf{B}, we can obtain the asymptotic distribution of the distance mean vector D()D(\cdot).

Theorem 4.4.

In Setting (\star), let s=d15s=\lfloor d^{\frac{1}{5}}\rfloor, t=d/tt=\lfloor d/t\rfloor, r=dstr=d-st,

𝐀=(a1Ta2TadT),𝐁=(b1Tb2TbdT),Ri=k=(i1)s+1ism0(1n1jin(akTXjakTX1)21mj=1m(bkTYjakTX1+μk)2),\mathbf{A}=\left(\begin{matrix}a_{1}^{T}\\ a_{2}^{T}\\ \vdots\\ a_{d}^{T}\end{matrix}\right),\ \mathbf{B}=\left(\begin{matrix}b_{1}^{T}\\ b_{2}^{T}\\ \vdots\\ b_{d}^{T}\end{matrix}\right),\ R_{i}=\sum_{k=(i-1)s+1}^{is-m_{0}}\begin{pmatrix}\frac{1}{n-1}\sum_{j\neq i}^{n}(a_{k}^{T}X_{j}-a_{k}^{T}X_{1})^{2}\\ \frac{1}{m}\sum_{j=1}^{m}(b_{k}^{T}Y_{j}-a_{k}^{T}X_{1}+\mu_{k})^{2}\end{pmatrix},

and Σt=Cov(i=1tRi)\Sigma_{t}=\textsf{Cov}(\sum_{i=1}^{t}R_{i}). If 𝐀\mathbf{A} and 𝐁\mathbf{B} are band matrix, where aij=0,|ij|m0a_{ij}=0,\forall\ |i-j|\geq m_{0}; bij=0,|ij|m0b_{ij}=0,\forall\ |i-j|\geq m_{0}, with m0m_{0} is a fixed number, max{|μi|}<μ0\max\{|\mu_{i}|\}<\mu_{0}, max{|aij|,|bij|}<c0\max\left\{|a_{ij}|,|b_{ij}|\right\}<c_{0}, with c0c_{0} and μ0\mu_{0} fixed numbers, and limti=1t𝔼(Σt12Ri23)0\lim_{t\to\infty}\sum_{i=1}^{t}\mathbb{E}\left(\|\Sigma_{t}^{-\frac{1}{2}}R_{i}\|_{2}^{3}\right)\to 0, then 1d(D(Xi)μDX)\frac{1}{\sqrt{d}}(D(X_{i})-\mu_{D_{X}}) converges to a normal distribution as dd\to\infty.

Remark 2.

A special case for condition limti=1t𝔼(Σt12Ri23)0\lim_{t\to\infty}\sum_{i=1}^{t}\mathbb{E}\left(\|\Sigma_{t}^{-\frac{1}{2}}R_{i}\|_{2}^{3}\right)\to 0 to hold is when all μi\mu_{i}’s are the same: non-zero elements in aiTa_{i}^{T}’s are the same and non-zero elements in biTb_{i}^{T}’s are the same, i.e. α=(α1,α2,,α2m01)\exists\ \alpha=(\alpha_{1},\alpha_{2},\dots,\alpha_{2m_{0}-1}) and β=(β1,β2,,β2m01)\beta=(\beta_{1},\beta_{2},\dots,\beta_{2m_{0}-1}), 𝐀ij=αm0+ji\mathbf{A}_{ij}=\alpha_{m_{0}+j-i}, 𝐁ij=βm0+ji\mathbf{B}_{ij}=\beta_{m_{0}+j-i}, |ij|<m0\forall\ |i-j|<m_{0}.

Remark 3.

We can also prove the asymptotic normality of 1d(D(Yj)μD(Y))\frac{1}{\sqrt{d}}(D(Y_{j})-\mu_{D(Y)}), 1d(D(Wxi)μD(Wx))\frac{1}{\sqrt{d}}(D(W_{x_{i}})-\mu_{D(W_{x})}) and 1d(D(Wyj)μD(Wy))\frac{1}{\sqrt{d}}(D(W_{y_{j}})-\mu_{D(W_{y})}) by changing the conditions in 4.4. By substituting 𝐀\mathbf{A} with 𝐁\mathbf{B}, 𝐁\mathbf{B} with 𝐀\mathbf{A}, μ\mu with μ-\mu, nn with mm and mm with nn, we can obtain the conditions for 1d(D(Yj)μD(Y))\frac{1}{\sqrt{d}}(D(Y_{j})-\mu_{D(Y)}); by substituting nn with n+1n+1, we can obtain the conditions for 1d(D(Wxi)μD(Wx))\frac{1}{\sqrt{d}}(D(W_{x_{i}})-\mu_{D(W_{x})}); by substituting 𝐀\mathbf{A} with 𝐁\mathbf{B}, 𝐁\mathbf{B} with 𝐀\mathbf{A}, μ\mu with μ-\mu, nn with m+1m+1 and mm with nn, we can obtain the conditions for 1d(D(Wyj)μD(Wy))\frac{1}{\sqrt{d}}(D(W_{y_{j}})-\mu_{D(W_{y})}).

It should be noted that the distance vector DWD_{W} is dependent on the pairwise distance in the distance matrix. Therefore, MW(D)M_{W}^{(D)} is dependent on the Mj(D)(Zi)M_{j}^{(D)}(Z_{i})’s. We studied an independent version of the distance-based approach in Supplement S3 and found that the dependency has minimal impact on the misclassification rate. Therefore, we continue to sample DWD_{W} independently in our estimation. By sampling DWD_{W} from 0.5N(μDWx,ΣDWx)+0.5N(μDWy,ΣDWy)0.5N(\mu_{D_{W_{x}}},\Sigma_{D_{W_{x}}})+0.5N(\mu_{D_{W_{y}}},\Sigma_{D_{W_{y}}}) and applying the decision rule to each DWD_{W} with

δX=12log|ΣDX|12(DWμDX)ΣDX1(DWμDX),\displaystyle\delta_{X}=-\frac{1}{2}log|\Sigma_{D_{X}}|-\frac{1}{2}(D_{W}-\mu_{D_{X}})^{\top}\Sigma_{D_{X}}^{-1}(D_{W}-\mu_{D_{X}}),
δY=12log|ΣDY|12(DWμDY)ΣDY1(DWμDY),\displaystyle\delta_{Y}=-\frac{1}{2}log|\Sigma_{D_{Y}}|-\frac{1}{2}(D_{W}-\mu_{D_{Y}})^{\top}\Sigma_{D_{Y}}^{-1}(D_{W}-\mu_{D_{Y}}),

we can estimate the misclassification rate by simulating the last step.

We now test our estimation through numerical simulations. Under setting (\star), we set n=50n=50, m=50m=50, d=2000d=2000, 𝐁=a𝐀\mathbf{B}=a\mathbf{A}, 𝐀𝐀=Σ\mathbf{A}\mathbf{A}^{\top}=\Sigma, with Σr,c=0.1|rc|\Sigma_{r,c}=0.1^{|r-c|}, where μ=μ0μμ\mu=\mu_{0}\frac{\mu^{\prime}}{\|\mu^{\prime}\|}, μ\mu^{\prime} is a random vector generated from Nd(0,𝐈d)N_{d}(0,\mathbf{I}_{d}). The testing samples Wx1,Wx2,Wx50 i.i.d FXW_{x_{1}},W_{x_{2}},\dots W_{x_{50}}\stackrel{{\scriptstyle\text{ i.i.d }}}{{\sim}}F_{X} and Wy1,Wy2,Wy50 i.i.d FYW_{y_{1}},W_{y_{2}},\dots W_{y_{50}}\stackrel{{\scriptstyle\text{ i.i.d }}}{{\sim}}F_{Y} . We consider the following scenarios:

  • Scenario S10: Fx=N(0,1)F_{x}=N(0,1), Fy=N(0,1)F_{y}=N(0,1); fix a=1a=1 and change μ0\mu_{0};

  • Scenario S11: Fx=N(0,1)F_{x}=N(0,1), Fy=N(0,1)F_{y}=N(0,1); fix μ=0\mu=0 and change aa.

  • Scenario S12: Fx=t10F_{x}=t_{10}, Fy=t10F_{y}=t_{10}; fix a=1a=1 and change μ0\mu_{0};

  • Scenario S13: Fx=t10F_{x}=t_{10}, Fy=t10F_{y}=t_{10}; fix μ=0\mu=0 and change aa.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Analytic (red) and simulated (rank-based approach: green; distance-based approach: blue) misclassification rates for scenarios S10 (upper-left panel), S11 (upper-right panel), S12 (bottom-left panel), and S13 (bottom-right panel).

Figure 4 compares the analytic misclassification rate, calculated using the formulas from this section, with the simulated misclassification rate, obtained from 50 simulation runs of Algorithms 1 and 2 under these scenarios. We see that the analytic misclassification rates closely match the simulated ones. This suggests that the formulas used for estimating the misclassification rate effectively capture the key quantities from the distributions that are critical for the proposed approaches.

By examining these formulas, we see that the first and second moments of FxF_{x} and FyF_{y} are particularly significant (through the f()f(\cdot) function), This likely explains why the proposed approaches performs well in scenarios with differences in means and/or variances. Additionally, the third and fourth moments of FxF_{x} and FyF_{y} also contribute, albeit to a lesser extent, through the g()g(\cdot) function.

5 Conclusion

We propose a novel framework for high-dimensional classification that utilizes common patterns found in high-dimensional data under both the mean difference and variance difference scenarios. This framework exhibits superior performance in high-dimensional data classification and network data classification. Additionally, we provide theoretical analysis to understand the key quantities that influence the method’s performance.

While the l2l_{2} distance is the default choice for computing pairwise distances in the proposed algorithms, it is not limited to this measure and can be extended to other types of distance measures. Exploring the performance of the method with different distances is an interesting avenue for future research, and we plan to investigate this further.

Appendix A Discussion about multilayer perceptron (MLP)

In this section, we examine the impact of different parameters in a multilayer perceptron (MLP) on high-dimensional classification tasks. All simulations are conducted under Setting (\star) with Fx=N(0,1)F_{x}=N(0,1), Fy=N(0,1)F_{y}=N(0,1), d=1000d=1000, 𝐁=a𝐀\mathbf{B}=a\mathbf{A}, 𝐀𝐀=Σ\mathbf{A}\mathbf{A}^{\top}=\Sigma, and μ=μ0μμ\mu=\mu_{0}\frac{\mu^{\prime}}{\|\mu^{\prime}\|}. Here, Σr,c=0.1|rc|\Sigma_{r,c}=0.1^{|r-c|}, and μ\mu^{\prime} is a random vector generated from Nd(𝟎d,𝐈d)N_{d}(\mathbf{0}_{d},\mathbf{I}_{d}). The testing samples are Wx1,Wx2,,Wx50 i.i.d FXW_{x_{1}},W_{x_{2}},\dots,W_{x_{50}}\stackrel{{\scriptstyle\text{ i.i.d }}}{{\sim}}F_{X} and Wy1,Wy2,,Wy50 i.i.d FYW_{y_{1}},W_{y_{2}},\dots,W_{y_{50}}\stackrel{{\scriptstyle\text{ i.i.d }}}{{\sim}}F_{Y}.

For the MLP structure, we used three hidden layers, each with the same number of nodes and activation function, and an output layer with the softmax function as the activation function. We first tested the performance of different activation functions with 1,0001,000 nodes in each layer and 800800 training samples. The results are shown in Table 7. We observed that the “relu”(x𝟙{x>0}x\mathbbm{1}\left\{x>0\right\}) and “softplus” (ln(1+ex)\ln(1+e^{x})) activation functions slightly outperformed the “gelu”(12x(1+erf(x2))\frac{1}{2}x(1+erf(\frac{x}{\sqrt{2}}))) function when classifying mean differences. However, only the “gelu” function can work for classifying variance differences. Therefore, we selected the “gelu” function for the activation function in subsequent simulations.

Table 7: The two-class misclassification rate of MLP with different activation functions
μ0\mu_{0} aa gelu relu tanh softplus selu
6 1 0.028 0.013\mathbb{0.013} 0.195 0.006\mathbb{0.006} 0.173
0 1.1 0.462\mathbb{0.462} 0.502 0.496 0.498 0.488
0 1.4 0.057\mathbb{0.057} 0.496 0.490 0.471 0.385

Next, we examine the impact of different training set sizes on the performance of the MLP with 1,000 nodes per layer and the “gelu” activation function. The results are shown in Table 8. We observe that, compared to other classification methods, the MLP requires significantly larger sample sizes to achieve optimal performance. In practice, the performance of MLP may be constrained by the available training set size.

Table 8: The two-class misclassification rate of MLP under different training set sizes
μ0\mu_{0} aa 100 200 400 800
6 1 0.146 0.055 0.035 0.028
0 1.1 0.610 0.495 0.470 0.462
0 1.4 0.410 0.275 0.130 0.057

Table 9 presents the performance of other methods under the same settings as in Table 7. A comparison of these results reveals that the MLP perform comparably to the best-performing methods when classifying mean differences. However, when classifying variance differences, the MLP fails to perform effectively when the signal is already large enough for other effective methods (second row of Table 9).

Table 9: The two-class misclassification rates of other methods
μ0\mu_{0} aa Dist Rank GLDA SVM RF Boosting FAIR NN-MADD PLSR ELM
6 1 0.022\mathbb{0.022} 0.027\mathbb{0.027} 0.027\mathbb{0.027} 0.023\mathbb{0.023} 0.115 0.354 0.077 0.343 0.028\mathbb{0.028} 0.020\mathbb{0.020}
0 1.1 0.019\mathbb{0.019} 0.020\mathbb{0.020} 0.506 0.106 0.425 0.465 0.507 0.025\mathbb{0.025} 0.412 0.440
0 1.4 0.000\mathbb{0.000} 0.000\mathbb{0.000} 0.457 0.000\mathbb{0.000} 0.106 0.339 0.500 0.000\mathbb{0.000} 0.459 0.470

References

  • Golub et al. [1999] Todd R Golub, Donna K Slonim, Pablo Tamayo, Christine Huard, Michelle Gaasenbeek, Jill P Mesirov, Hilary Coller, Mignon L Loh, James R Downing, Mark A Caligiuri, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science, 286(5439):531–537, 1999.
  • Ayyad et al. [2019] Sarah M Ayyad, Ahmed I Saleh, and Labib M Labib. Gene expression cancer classification using modified k-nearest neighbors technique. Biosystems, 176:41–51, 2019.
  • Ye et al. [2009] Qiang Ye, Ziqiong Zhang, and Rob Law. Sentiment classification of online reviews to travel destinations by supervised machine learning approaches. Expert systems with applications, 36(3):6527–6535, 2009.
  • Bansal and Srivastava [2018] Barkha Bansal and Sangeet Srivastava. Sentiment classification of online consumer reviews using word vector representations. Procedia computer science, 132:1147–1153, 2018.
  • Burkhardt et al. [2010] Felix Burkhardt, Martin Eckert, Wiebke Johannsen, and Joachim Stegmann. A database of age and gender annotated telephone speech. In LREC. Malta, 2010.
  • Boser et al. [1992] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152, 1992.
  • Brown et al. [1999] M Brown, William Noble Grundy, David Lin, Nello Cristianini, Charles Sugnet, Manuel Ares, and David Haussler. Support vector machine classification of microarray gene expression data. University of California, Santa Cruz, Technical Report UCSC-CRL-99-09, 1999.
  • Furey et al. [2000] Terrence S Furey, Nello Cristianini, Nigel Duffy, David W Bednarski, Michel Schummer, and David Haussler. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16(10):906–914, 2000.
  • Schölkopf et al. [2001] Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. Estimating the support of a high-dimensional distribution. Neural computation, 13(7):1443–1471, 2001.
  • Ghaddar and Naoum-Sawaya [2018] Bissan Ghaddar and Joe Naoum-Sawaya. High dimensional data classification and feature selection using support vector machines. European Journal of Operational Research, 265(3):993–1004, 2018.
  • Hussain [2019] Syed Fawad Hussain. A novel robust kernel for classifying high-dimensional data using support vector machines. Expert Systems with Applications, 131:116–131, 2019.
  • Fisher [1936] Ronald A Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179–188, 1936.
  • Rao [1948] C Radhakrishna Rao. The utilization of multiple measurements in problems of biological classification. Journal of the Royal Statistical Society. Series B (Methodological), 10(2):159–203, 1948.
  • Li et al. [2005] Haifeng Li, Keshu Zhang, and Tao Jiang. Robust and accurate cancer classification with gene expression profiling. In 2005 IEEE Computational Systems Bioinformatics Conference (CSB’05), pages 310–321. IEEE, 2005.
  • Paliwal and Sharma [2012] Kuldip K Paliwal and Alok Sharma. Improved pseudoinverse linear discriminant analysis method for dimensionality reduction. International Journal of Pattern Recognition and Artificial Intelligence, 26(01):1250002, 2012.
  • Yang and Wu [2014] Wuyi Yang and Houyuan Wu. Regularized complete linear discriminant analysis. Neurocomputing, 137:185–191, 2014.
  • Cover and Hart [1967] Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1):21–27, 1967.
  • Liu et al. [2006] Ting Liu, Andrew W Moore, Alexander Gray, and Claire Cardie. New algorithms for efficient high-dimensional nonparametric classification. Journal of Machine Learning Research, 7(6), 2006.
  • Tang et al. [2011] Lu-An Tang, Yu Zheng, Xing Xie, Jing Yuan, Xiao Yu, and Jiawei Han. Retrieving k-nearest neighboring trajectories by a set of point locations. In International Symposium on Spatial and Temporal Databases, pages 223–241. Springer, 2011.
  • Pal et al. [2016] Arnab K Pal, Pronoy K Mondal, and Anil K Ghosh. High dimensional nearest neighbor classification based on mean absolute differences of inter-point distances. Pattern Recognition Letters, 74:1–8, 2016.
  • Fan and Fan [2008] Jianqing Fan and Yingying Fan. High dimensional classification using features annealed independence rules. Annals of statistics, 36(6):2605, 2008.
  • Freund et al. [1996] Yoav Freund, Robert E Schapire, et al. Experiments with a new boosting algorithm. In icml, volume 96, pages 148–156. Citeseer, 1996.
  • Buehlmann [2006] Peter Buehlmann. Boosting for high-dimensional linear models. The Annals of Statistics, 34(2):559–583, 2006.
  • Mayr et al. [2012] Andreas Mayr, Nora Fenske, Benjamin Hofner, Thomas Kneib, and Matthias Schmid. Generalized additive models for location, scale and shape for high dimensional data—a flexible approach based on boosting. Journal of the Royal Statistical Society: Series C (Applied Statistics), 61(3):403–427, 2012.
  • Breiman [2001] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
  • Ye et al. [2013] Yunming Ye, Qingyao Wu, Joshua Zhexue Huang, Michael K Ng, and Xutao Li. Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recognition, 46(3):769–787, 2013.
  • Fernández-Delgado et al. [2014] Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. Do we need hundreds of classifiers to solve real world classification problems? The journal of machine learning research, 15(1):3133–3181, 2014.
  • LeCun et al. [1989] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
  • Ranzato et al. [2006] Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra, and Yann Cun. Efficient learning of sparse representations with an energy-based model. Advances in neural information processing systems, 19, 2006.
  • Shi et al. [2015] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. Advances in neural information processing systems, 28, 2015.
  • Cao et al. [2020] Xiangyong Cao, Jing Yao, Zongben Xu, and Deyu Meng. Hyperspectral image classification with convolutional neural network and active learning. IEEE Transactions on Geoscience and Remote Sensing, 58(7):4604–4616, 2020.
  • Deng et al. [2020] Muqing Deng, Tingting Meng, Jiuwen Cao, Shimin Wang, Jing Zhang, and Huijie Fan. Heart sound classification based on improved mfcc features and convolutional recurrent neural networks. Neural Networks, 130:22–32, 2020.
  • Zhang et al. [2021] Zhichao Zhang, Shugong Xu, Shunqing Zhang, Tianhao Qiao, and Shan Cao. Attention based convolutional recurrent neural network for environmental sound classification. Neurocomputing, 453:896–903, 2021.
  • Liu et al. [2016] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Recurrent neural network for text classification with multi-task learning. arXiv preprint arXiv:1605.05101, 2016.
  • Kingma et al. [2014] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. Advances in neural information processing systems, 27, 2014.
  • Zhou et al. [2020] Huaji Zhou, Licheng Jiao, Shilian Zheng, Lifeng Yang, Weiguo Shen, and Xiaoniu Yang. Generative adversarial network-based electromagnetic signal classification: A semi-supervised learning framework. China Communications, 17(10):157–169, 2020.
  • Radford et al. [2015] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • Kim and Hong [2021] Dahye Kim and Byung-Woo Hong. Unsupervised feature elimination via generative adversarial networks: application to hair removal in melanoma classification. IEEE Access, 9:42610–42620, 2021.
  • Salzberg [1994] Steven L Salzberg. C4. 5: Programs for machine learning by j. ross quinlan. morgan kaufmann publishers, inc., 1993, 1994.
  • Leathwick et al. [2005] JR Leathwick, D Rowe, J Richardson, Jane Elith, and T Hastie. Using multivariate adaptive regression splines to predict the distributions of new zealand’s freshwater diadromous fish. Freshwater Biology, 50(12):2034–2052, 2005.
  • Martens and Naes [1992] Harald Martens and Tormod Naes. Multivariate calibration. John Wiley & Sons, 1992.
  • Huang et al. [2011] Guang-Bin Huang, Hongming Zhou, Xiaojian Ding, and Rui Zhang. Extreme learning machine for regression and multiclass classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42(2):513–529, 2011.
  • Popescu et al. [2009] Marius-Constantin Popescu, Valentina E Balas, Liliana Perescu-Popescu, and Nikos Mastorakis. Multilayer perceptron and neural networks. WSEAS Transactions on Circuits and Systems, 8(7):579–588, 2009.
  • Maa et al. [1996] Jen-Fue Maa, Dennis K Pearl, and Robert Bartoszyński. Reducing multidimensional two-sample data to one-dimensional interpoint comparisons. The annals of statistics, 24(3):1069–1074, 1996.
  • Chen and Friedman [2017] Hao Chen and Jerome H Friedman. A new graph-based two-sample test for multivariate and object data. Journal of the American statistical association, 112(517):397–409, 2017.