This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Consistency of Optimal Bayesian Feature Selection in the Presence of Correlations

Ali Foroughi pourlabel=e1][email protected] [    Lori A. Daltonlabel=e2][email protected] [ Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH, 43210
(0000)
Abstract

Optimal Bayesian feature selection (OBFS) is a multivariate supervised screening method designed from the ground up for biomarker discovery. In this work, we prove that Gaussian OBFS is strongly consistent under mild conditions, and provide rates of convergence for key posteriors in the framework. These results are of enormous importance, since they identify precisely what features are selected by OBFS asymptotically, characterize the relative rates of convergence for posteriors on different types of features, provide conditions that guarantee convergence, justify the use of OBFS when its internal assumptions are invalid, and set the stage for understanding the asymptotic behavior of other algorithms based on the OBFS framework.

62F15,
62C10,
62F07,
92C37,
Bayesian decision theory,
variable selection,
biomarker discovery,
doi:
0000
keywords:
[class=MSC]
keywords:
volume: 00issue: 0
\startlocaldefs\endlocaldefs

and

1 Introduction

Biomarker discovery aims to identify biological markers (genes or gene products) that lead to improved diagnostic or prognostic tests, better treatment recommendations, or advances in our understanding of the disease or biological condition of interest (Ilyin et al., 2004; Rifai et al., 2006; Ramachandran et al., 2008). Reliable and reproducible biomarker discovery has proven to be difficult (Diamandis, 2010); while one reason is that preliminary small-sample datasets are inherently difficult to analyze, another factor is that most popular algorithms and methods employed in bioinformatics have inherent limitations that make them unsuitable for many discovery applications.

Consider univariate filter methods like t-tests, which are perhaps the most ubiquitous throughout the bioinformatics literature. Since they focus on only one feature at a time, filter methods cannot take advantage of potential correlations between markers. In particular, they cannot identify pairs (or sets) of markers with tell-tale behaviors only when observed in tandem. Multivariate methods, on the other hand, can account for correlations, but the vast majority are wrapper or embedded feature selection algorithms designed to aid in classification or regression model reduction. Model construction is not a reasonable goal in most small-scale exploratory studies, particularly in biology where the expected number of variables is large and the nature of their interactions can be highly complex and context dependent. Furthermore, feature selection methods designed for model reduction intrinsically penalize redundant features and reward smaller feature sets (Sima and Dougherty, 2008; Awada et al., 2012; Ang et al., 2016; Li et al., 2017). This can be counterproductive in biomarker discovery; so much so that filter methods often far outperform multivariate methods (Sima and Dougherty, 2006, 2008; Foroughi pour and Dalton, 2018b).

Optimal Bayesian feature selection (OBFS) is a supervised multivariate screening method designed to address these problems (Dalton, 2013; Foroughi pour and Dalton, 2019). The OBFS modeling framework, discussed in detail in Section 2, assumes that features can be partitioned into two types of independent blocks: blocks of correlated “good” features that have distinct joint distributions between two classes, and blocks of correlated “bad” features that have identical joint distributions in all classes (Foroughi pour and Dalton, 2018a). Given distributional assumptions on each block, and priors on the distribution parameters, OBFS can be implemented using one of many selection criteria, for example the maximum number correct (MNC) criterion, which maximizes the posterior expected number of features correctly identified as belonging to good versus bad blocks, or the constrained MNC (CMNC) criterion, which additionally constrains the number of selected features. In this way, OBFS aims to optimally identify and rank the set of all features with distinct distributions between the classes, which represents the predicted set of biomarkers, while accounting for correlations.

Optimal Bayesian feature filtering (OBF) is a special case of OBFS that assumes that features are independent (blocks of size one) and the parameters governing each feature are independent (Foroughi pour and Dalton, 2015). Gaussian OBF (which assumes independent Gaussian features with conjugate priors, henceforth referred to as simply OBF), has very low computation cost, robust performance when its modeling assumptions are violated, and particularly excels at identifying strong markers with low correlations (Foroughi pour and Dalton, 2017a, 2018a). Furthermore, it has been shown that OBF is strongly consistent under very mild conditions, including cases where the features are dependent and non-Gaussian (Foroughi pour and Dalton, 2019). In particular, OBF, in the long run, selects precisely the set of features with distinct means or variances between the classes. The consistency theorem in (Foroughi pour and Dalton, 2019) formalizes our intuition that OBF cannot take advantage of correlations; although OBF identifies features that are individually discriminating, we see that it has no capacity to identify features that only discriminate when grouped together, or features that are merely correlated with (or linked through a chain of correlation with) other discriminating features.

In bioinformatics, features that only discriminate when grouped together and features that are correlated with other strong markers are of tremendous interest, since they may represent important actors in the biological condition under study. While multivariate methods have the potential to discover such features, as discussed above, methods focused on model reduction tend to do so unreliably and inconsistently. Rather than involve classification or regression models, OBFS searches for intrinsic differences between features. Like most multivariate methods, one difficulty with OBFS is that it is computationally expensive (except in certain special cases, like OBF). If an exact solution is desired, even Gaussian OBFS (which assumes independent blocks of Gaussian features with conjugate priors, henceforth referred to as simply OBFS) is currently only tractable in small-scale problems with up to approximately ten features. This has lead to the development of a number of promising computationally efficient heuristic algorithms based on OBFS theory, for example 2MNC-Robust (Foroughi pour and Dalton, 2014), POFAC, REMAIN, and SPM (Foroughi pour and Dalton, 2018a).

Although OBFS and heuristic algorithms based on OBFS have demonstrated better performance than other multivariate methods in biomarker discovery, it is currently unknown precisely what features they select in the long run. Our main contribution is a theorem presented in Section 3, analogous to the consistency theorem presented in (Foroughi pour and Dalton, 2019) for OBF, that shows that OBFS is strongly consistent under mild conditions. The consistency proof for OBFS utilizes nine lemmas and is significantly more complex than that of OBF. This theorem identifies precisely what features are selected by OBFS asymptotically, provides conditions that guarantee convergence, justifies the use of OBFS in non-design settings where its internal assumptions are invalid, and characterizes rates of convergence for key posteriors in the framework, including marginal posteriors on different types of features.

The asymptotic behavior of optimal feature selection provides a frame of reference to understand the performance of heuristic algorithms based on OBFS; for instance, we may compare the sets of features selected asymptotically, the conditions required to guarantee convergence, and rates of convergence. Furthermore, while numerous works emphasize the need to account for gene interactions and to detect families of interacting biomarkers [e.g., see Han et al. (2017), Xi et al. (2018) and Fang et al. (2019)], typically the focus is on simple statistics that measure pairwise interactions in the data, rather than on establishing intrinsic characteristics of complete marker families, or on quantifying the performance and properties of selection algorithms designed to identify marker families. Thus, this work is also important because it proposes a formal definition for marker families (correlated blocks of features), and shows that these marker families are identifiable (via OBFS).

2 Gaussian Optimal Bayesian Feature Selection

We begin with an overview of the Gaussian modeling framework presented in Foroughi pour and Dalton (2018a), and a derivation of the corresponding optimal selection rule, OBFS. Let FF be the set of feature indices, each corresponding to a real-valued feature. We call feature indices we wish to select “true good features,” denoted by G¯\bar{G}, and we call feature indices we wish to not select “true bad features,” denoted by B¯=F\G¯\bar{B}=F\backslash\bar{G}. In this model, good features assign (w.p. 11 over the prior) distinct distributions between two classes, labeled y=0y=0 and y=1y=1, while bad features assign the same distribution across the whole sample. We further assume that G¯\bar{G} and B¯\bar{B} can each be partitioned into sub-blocks of interacting features. We call the set of all sub-blocks a “true feature partition,” denoted by P¯\bar{P}. If G¯\bar{G} and B¯\bar{B} are non-empty, we write P¯=({G¯1,,G¯u¯},{B¯1,,B¯v¯})\bar{P}=(\{{\bar{G}_{1}},\ldots,{\bar{G}_{\bar{u}}}\},\{{\bar{B}_{1}},\ldots,{\bar{B}_{\bar{v}}}\}), where u¯\bar{u} and v¯\bar{v} are positive integers, and the set of all G¯i\bar{G}_{i}’s and B¯j\bar{B}_{j}’s are disjoint such that G¯=i=1u¯G¯i\bar{G}=\cup_{i=1}^{\bar{u}}\bar{G}_{i}, B¯=j=1v¯B¯j\bar{B}=\cup_{j=1}^{\bar{v}}\bar{B}_{j}, and all G¯i\bar{G}_{i}’s and B¯j\bar{B}_{j}’s are non-empty. If G¯\bar{G} is empty, then we still denote P¯\bar{P} this way, but also define Gi=\cup G_{i}=\varnothing, and define sums and products from 11 to u¯\bar{u} to be 0 and 11, respectively. We define similar conventions when B¯\bar{B} is empty.

In the Bayesian model, G¯\bar{G}, B¯\bar{B} and P¯\bar{P} are random sets, and we denote instantiations of these random sets by GG, BB, and P=({G1,,Gu},{B1,,Bv})P=(\{{G_{1}},\ldots,{G_{u}}\},\{{B_{1}},\ldots,{B_{v}}\}), respectively. We define a prior on the true feature partition, π(P)P(P¯=P)\pi(P)\equiv\text{P}(\bar{P}=P), which induces a prior on the true good features, π(G)P(G¯=G)=P:Gi=Gπ(P)\pi(G)\equiv\text{P}(\bar{G}=G)=\sum_{P:\cup G_{i}=G}\pi(P).

Given a true feature partition P¯=P\bar{P}=P, define θyGi\theta^{G_{i}}_{y} for each class y=0,1y=0,1 and each good block index i=1,,ui=1,\ldots,u, and θBj\theta^{B_{j}} for each bad block index j=1,vj=1,\ldots v. Let θP\theta^{P} denote the collection of all θyGi\theta^{G_{i}}_{y}’s and θBj\theta^{B_{j}}’s. Assume the θyGi\theta^{G_{i}}_{y}’s and θBj\theta^{B_{j}}’s are mutually independent, i.e., we have a prior of the form

π(θP|P¯=P)=i=1uπ(θ0Gi)π(θ1Gi)j=1vπ(θBj).\pi(\theta^{P}|\bar{P}=P)=\prod_{i=1}^{u}\pi(\theta^{G_{i}}_{0})\pi(\theta^{G_{i}}_{1})\prod_{j=1}^{v}\pi(\theta^{B_{j}}). (2.1)

For good block A=GiA=G_{i}, assume θyA=[μyA,ΣyA]\theta^{A}_{y}=[\mu^{A}_{y},\Sigma^{A}_{y}] for each class y=0,1y=0,1, where μyA\mu^{A}_{y} is a length |A||A| column vector, ΣyA\Sigma^{A}_{y} is an |A|×|A||A|\times|A| matrix, and |A||A| denotes the cardinality of set AA. We assume π(θyA)=π(μyA|ΣyA)π(ΣyA)\pi(\theta^{A}_{y})=\pi(\mu^{A}_{y}|\Sigma^{A}_{y})\pi(\Sigma^{A}_{y}) is normal-inverse-Wishart:

π(ΣyA)\displaystyle\pi(\Sigma^{A}_{y}) =KyA|ΣyA|0.5(κyA+|A|+1)exp(0.5tr(SyA(ΣyA)1)),\displaystyle=K^{A}_{y}|\Sigma^{A}_{y}|^{-0.5(\kappa^{A}_{y}+|A|+1)}\exp\left(-0.5\operatorname{tr}(S^{A}_{y}(\Sigma^{A}_{y})^{-1})\right), (2.2)
π(μyA|ΣyA)\displaystyle\pi(\mu^{A}_{y}|\Sigma^{A}_{y}) =LyA|ΣyA|0.5exp(0.5νyA(μyAmyA)T(ΣyA)1(μyAmyA)),\displaystyle=L^{A}_{y}|\Sigma^{A}_{y}|^{-0.5}\exp\left(-0.5\nu^{A}_{y}(\mu^{A}_{y}-m^{A}_{y})^{T}(\Sigma^{A}_{y})^{-1}(\mu^{A}_{y}-m^{A}_{y})\right), (2.3)

where |X||X| is the determinant, tr(X)\operatorname{tr}(X) is the trace, and XTX^{T} is the transpose of matrix XX. For a proper prior, κyA>|A|1\kappa^{A}_{y}>|A|-1, SyAS^{A}_{y} is a symmetric positive-definite |A|×|A||A|\times|A| matrix, νyA>0\nu^{A}_{y}>0, myAm^{A}_{y} is an |A|×1|{A}|\times 1 vector,

KyA\displaystyle K^{A}_{y} =|SyA|0.5κyA20.5κyA|A|/Γ|A|(0.5κyA),\displaystyle=|S^{A}_{y}|^{0.5\kappa^{A}_{y}}2^{-0.5\kappa^{A}_{y}|{A}|}/\Gamma_{|{A}|}(0.5\kappa^{A}_{y}), (2.4)
LyA\displaystyle L^{A}_{y} =(2π/νyA)0.5|A|,\displaystyle=(2\pi/\nu^{A}_{y})^{-0.5|{A}|}, (2.5)

and Γd\Gamma_{d} denotes the multivariate gamma function, where dd is a positive integer. Likewise, for bad block A=BjA=B_{j} assume θA=[μA,ΣA]\theta^{A}=[\mu^{A},\Sigma^{A}], and that π(θA)\pi(\theta^{A}) is normal-inverse-Wishart with hyperparameters κA\kappa^{A}, SAS^{A}, νA\nu^{A}, and mAm^{A}, and scaling constants KAK^{A} and LAL^{A}.

Let SnS_{n} be a sample composed of nn labeled points with nyn_{y} points in class yy, where labels are determined by a process independent from P¯\bar{P} and θP¯\theta^{\bar{P}} (for instance, using random sampling or separate sampling). Given P¯=P\bar{P}=P, θP\theta^{P}, and the labels, assume all sample points are mutually independent, and features in different blocks are also independent. Assume that features in block GiG_{i}, i=1,,ui=1,\ldots,u, are jointly Gaussian with mean μyGi\mu^{G_{i}}_{y} and covariance matrix ΣyGi\Sigma^{G_{i}}_{y} under class yy, and that features in block BjB_{j}, j=1,,vj=1,\ldots,v, are jointly Gaussian with mean μBj\mu^{B_{j}} and covariance ΣBj\Sigma^{B_{j}} across the whole sample.

The posterior on P¯\bar{P} over the set of all valid feature partitions is given by the normalized product of the prior and likelihood function. It can be shown that π(P)P(P¯=P|Sn)π(P)q(P)a(P)\pi^{*}(P)\equiv\text{P}(\bar{P}=P|S_{n})\propto\pi(P)q(P)a(P), where π(P)\pi^{*}(P) is normalized to have unit sum, and

q(P)\displaystyle q(P) =i=1,,u,y=0,1QyGi(κyGi|Gi|1)0.5|Gi|κyGi\displaystyle=\prod_{\begin{subarray}{c}i=1,\ldots,u,\\ y=0,1\end{subarray}}Q^{G_{i}}_{y}(\kappa^{G_{i}*}_{y}-|G_{i}|-1)^{-0.5|G_{i}|\kappa^{G_{i}*}_{y}}
×j=1,,vQBj(κBj|Bj|1)0.5|Bj|κBj,\displaystyle\quad\times\prod_{j=1,\ldots,v}Q^{B_{j}}(\kappa^{B_{j}*}-|B_{j}|-1)^{-0.5|B_{j}|\kappa^{B_{j}*}}, (2.6)
a(P)\displaystyle a(P) =(i=1,,u,y=0,1|CyGi|κyGij=1,,v|CBj|κBj)0.5.\displaystyle=\Bigg{(}\prod_{\begin{subarray}{c}i=1,\ldots,u,\\ y=0,1\end{subarray}}|C^{G_{i}}_{y}|^{\kappa^{G_{i}*}_{y}}\prod_{j=1,\ldots,v}|C^{B_{j}}|^{\kappa^{B_{j}*}}\Bigg{)}^{-0.5}. (2.7)

For each good block A=GiA=G_{i}, i=1,,ui=1,\ldots,u, and class y=0,1y=0,1,

νyA\displaystyle\nu^{A*}_{y} =νyA+ny,\displaystyle=\nu^{A}_{y}+n_{y}, (2.8)
κyA\displaystyle\kappa^{A*}_{y} =κyA+ny,\displaystyle=\kappa^{A}_{y}+n_{y}, (2.9)
QyA\displaystyle Q^{A}_{y} =KyALyA20.5κyA|A|Γ|A|(0.5κyA)(2π/νyA)0.5|A|,\displaystyle=K^{A}_{y}L^{A}_{y}2^{0.5\kappa^{A*}_{y}|A|}\Gamma_{|A|}(0.5\kappa^{A*}_{y})\left({2\pi}/{\nu^{A*}_{y}}\right)^{0.5|A|}, (2.10)
SyA\displaystyle S^{A*}_{y} =SyA+(ny1)Σ^yA+νyAnyνyA(μ^yAmyA)(μ^yAmyA)T,\displaystyle=S^{A}_{y}+(n_{y}-1)\hat{\Sigma}^{A}_{y}+\frac{\nu^{A}_{y}n_{y}}{\nu^{A*}_{y}}(\hat{\mu}^{A}_{y}-m^{A}_{y})(\hat{\mu}^{A}_{y}-m^{A}_{y})^{T}, (2.11)
CyA\displaystyle C^{A}_{y} =SyA/(κyA|A|1),\displaystyle={S^{A*}_{y}}/{(\kappa^{A*}_{y}-|A|-1)}, (2.12)

and μ^yA\hat{\mu}^{A}_{y} and Σ^yA\hat{\Sigma}^{A}_{y} are the sample mean and unbiased sample covariance of features in AA under class yy. Similarly, for each bad block A=BjA=B_{j}, we find νA\nu^{A*}, κA\kappa^{A*}, QAQ^{A}, SAS^{A*} and CAC^{A} using (2.8) through (2.12) with all subscript yy’s removed, where μ^A\hat{\mu}^{A} and Σ^A\hat{\Sigma}^{A} are the sample mean and unbiased sample covariance of features in AA across the whole sample.

The posterior on P¯\bar{P} induces a posterior on G¯\bar{G} over all subsets GFG\subseteq F,

π(G)P(G¯=G|Sn)\displaystyle\pi^{*}(G)\equiv\text{P}(\bar{G}=G|S_{n}) =P:Gi=Gπ(P),\displaystyle=\sum_{P:\cup G_{i}=G}\pi^{*}(P), (2.13)

as well as posterior probabilities that each individual feature fFf\in F is in G¯\bar{G},

π(f)P(fG¯|Sn)\displaystyle\pi^{*}(f)\equiv\text{P}(f\in\bar{G}|S_{n}) =G:fGπ(G).\displaystyle=\sum_{G:f\in G}\pi^{*}(G). (2.14)

Note π(f)=P(fG¯|Sn)\pi^{*}(f)=\text{P}(f\in\bar{G}|S_{n}) and π({f})=P(G¯={f}|Sn)\pi^{*}(\{f\})=\text{P}(\bar{G}=\{f\}|S_{n}) are distinct.

The objective of OBFS is to identify the set of true good features, G¯\bar{G}. We will consider two objective criteria: MNC and CMNC. MNC maximizes the expected number of correctly identified features; that is, MNC outputs the set GFG\subseteq F with complement B=F\GB=F\backslash G that maximizes E[|GG¯|+|BB¯|]\text{E}[|G\cap\bar{G}|+|B\cap\bar{B}|], which is given by GMNC={fF:π(f)>0.5}G^{\textrm{MNC}}=\{f\in F:\pi^{*}(f)>0.5\} (Foroughi pour and Dalton, 2014). CMNC maximizes the expected number of correctly identified features under the constraint of selecting a specified number of features, DD, and GCMNCG^{\textrm{CMNC}} is found by picking the DD features with largest π(f)\pi^{*}(f) (Foroughi pour and Dalton, 2017b). Both MNC and CMNC require computing π(f)\pi^{*}(f) for all fFf\in F, which generally requires computing π(P)\pi(P) for all valid feature partitions PP, and is generally intractable unless |F||F| is small.

Under proper priors, Gaussian OBFS takes the following modeling parameters as input: (1) π(P)\pi(P) for each valid feature partition, PP, (2) νyA>0\nu^{A}_{y}>0, myAm^{A}_{y}, κyA>|A|1\kappa^{A}_{y}>|A|-1, and symmetric positive-definite SyAS^{A}_{y} for all yy and all possible good blocks AA in valid PP, and (3) νA>0\nu^{A}>0, mAm^{A}, κA>|A|1\kappa^{A}>|A|-1, and symmetric positive-definite SAS^{A} for all possible bad blocks AA in valid PP. If CMNC is used, it also takes DD as input. When π(θP|P¯=P)\pi(\theta^{P}|\bar{P}=P) is improper, the above derivations are invalid, but π(P)π(P)q(P)a(P)\pi^{*}(P)\propto\pi(P)q(P)a(P) can still be taken as a definition and computed as long as: (1) π(P)\pi(P) is proper, (2) νyA>0\nu^{A*}_{y}>0, κyA>|A|1\kappa^{A*}_{y}>|A|-1, and SyAS^{A*}_{y} is symmetric positive-definite, (3) νA>0\nu^{A*}>0, κA>|A|1\kappa^{A*}>|A|-1, and SAS^{A*} is symmetric positive-definite, and (4) KyAK^{A}_{y} and LyAL^{A}_{y} are no longer given by (2.4) and (2.5), and similarly KAK^{A} and LAL^{A} are no longer given by analogous equations; instead these are positive input parameters specified by the user.

Gaussian OBF is a special case where π(P)\pi(P) assumes that all blocks G¯1,,G¯u¯\bar{G}_{1},\ldots,\bar{G}_{\bar{u}} and B¯1,,B¯v¯\bar{B}_{1},\ldots,\bar{B}_{\bar{v}} are of size one, and the events {fG¯}\{f\in\bar{G}\} are mutually independent. For each fFf\in F, OBF takes as input (1) marginal priors π(f)P(fG¯)\pi(f)\equiv\text{P}(f\in\bar{G}), (2) scalars νy{f}\nu_{y}^{\{f\}}, my{f}m_{y}^{\{f\}}, κy{f}\kappa_{y}^{\{f\}} and Sy{f}S_{y}^{\{f\}} for all yy, (3) scalars ν{f}\nu^{\{f\}}, m{f}m^{\{f\}}, κ{f}\kappa^{\{f\}} and S{f}S^{\{f\}}, and (4) LfK0{f}L0{f}K1{f}L1{f}/(K{f}L{f})L^{f}\equiv K_{0}^{\{f\}}L_{0}^{\{f\}}K_{1}^{\{f\}}L_{1}^{\{f\}}/(K^{\{f\}}L^{\{f\}}) if improper priors are used. OBF also takes DD as input if CMNC is used. Under OBF, it can be shown that π(f)\pi^{*}(f), defined in (2.14), simplifies to π(f)=h(f)/(1+h(f))\pi^{*}(f)=h(f)/(1+h(f)), where

h(f)=π(f)1π(f)Q0{f}Q0{f}Q{f}(S{f})0.5κ{f}(S0{f})0.5κy{f}(S1{f})0.5κy{f},h(f)=\frac{\pi(f)}{1-\pi(f)}\cdot\frac{Q_{0}^{\{f\}}Q_{0}^{\{f\}}}{Q^{\{f\}}}\cdot\frac{(S^{\{f\}*})^{0.5\kappa^{\{f\}*}}}{(S_{0}^{\{f\}*})^{0.5\kappa_{y}^{\{f\}*}}(S_{1}^{\{f\}*})^{0.5\kappa_{y}^{\{f\}*}}}, (2.15)

κy{f}\kappa_{y}^{\{f\}*}, Qy{f}Q_{y}^{\{f\}} and Sy{f}S_{y}^{\{f\}*} are defined in (2.9), (2.10) and (2.11), respectively, and κ{f}\kappa^{\{f\}*}, Q{f}Q^{\{f\}} and S{f}S^{\{f\}*} are defined similarly. Rather than evaluate π(P)\pi^{*}(P) for all feature partitions, OBF under both MNC and CMNC reduces to simple filtering, where features are scored by the h(f)h(f) given in (2.15).

3 Consistency

Let \mathcal{F}_{\infty} be an arbitrary sampling distribution on an infinite sample, SS_{\infty}. Each sample point in SS_{\infty} consists of a binary label, y=0,1y=0,1, and a set of features corresponding to a set of feature indices, FF. For each n=1,2,n=1,2,\ldots, let SnS_{n} denote a sample consisting of the first nn points in SS_{\infty}, let nyn_{y} denote the number of points in class yy, and define ρn=n0/n\rho_{n}=n_{0}/n.

The goal of feature selection is to identify a specific subset of features (say those with different distributions between two classes), which we denote by G¯\bar{G}. Proving strong consistency for a feature selection algorithm thus amounts to showing that

limnG^(Sn)=G¯\displaystyle\lim_{n\to\infty}\hat{G}(S_{n})=\bar{G} (3.1)

with probability 11 (w.p. 11) over the infinite sampling distribution, where nn is the sample size, SnS_{n} is a sample of size nn, and G^(Sn)\hat{G}(S_{n}) is the output of the selection algorithm. Here, G^(Sn)\hat{G}(S_{n}) and G¯\bar{G} are sets, and we define limnG^(Sn)=G¯\lim_{n\to\infty}\hat{G}(S_{n})=\bar{G} to mean that G^(Sn)=G¯\hat{G}(S_{n})=\bar{G} for all but a finite number of nn.

OBFS and OBF fix the sample and model the set of features we wish to select and the sampling distribution as random. To study consistency, we reverse this: now the sampling distribution is fixed and the sample is random. We begin by reviewing a few definitions and a special case of the strong consistency theorem for OBF.

Definition 1 (Foroughi pour and Dalton, 2019).

G¯\bar{G} is an independent unambiguous set of good features if the following hold, where μyf\mu^{f}_{y} and σyf\sigma^{f}_{y} are the mean and variance of feature ff under class yy, respectively:

  1. 1.

    For each gG¯g\in\bar{G}, μyg\mu^{g}_{y} and σyg\sigma^{g}_{y} exist for all yy such that either μ0gμ1g\mu^{g}_{0}\neq\mu^{g}_{1} or σ0gσ1g\sigma^{g}_{0}\neq\sigma^{g}_{1}.

  2. 2.

    For each bG¯b\notin\bar{G}, μyb\mu^{b}_{y} and σyb\sigma^{b}_{y} exist for all yy such that μ0b=μ1b\mu^{b}_{0}=\mu^{b}_{1} and σ0b=σ1b\sigma^{b}_{0}=\sigma^{b}_{1}.

Definition 2 (Foroughi pour and Dalton, 2019).

An infinite sample, SS_{\infty}, is a balanced sample if lim infnρn>0\liminf_{n\to\infty}\rho_{n}>0, lim supnρn<1\limsup_{n\to\infty}\rho_{n}<1, and, conditioned on the labels, sample points are independent with points belonging to the same class identically distributed.

Theorem 1 (Foroughi pour and Dalton, 2019).

Let SS_{\infty} be a fixed infinite sample and let GG be a fixed subset of FF. If limnπ(G)=1\lim_{n\to\infty}\pi^{*}(G)=1, then limnG^MNC(Sn)=G\lim_{n\to\infty}\hat{G}_{\mathrm{MNC}}(S_{n})=G and limnG^CMNC(Sn)=G\lim_{n\to\infty}\hat{G}_{\mathrm{CMNC}}(S_{n})=G if D=|G|D=|G|, where G^MNC(Sn)\hat{G}_{\mathrm{MNC}}(S_{n}) and G^CMNC(Sn)\hat{G}_{\mathrm{CMNC}}(S_{n}) are the MNC and CMNC feature sets under π(G)\pi^{*}(G), respectively.

Theorem 2 (Foroughi pour and Dalton, 2019).

Suppose the following are true:

  1. (i)

    G¯\bar{G} is the independent unambiguous set of good features.

  2. (ii)

    For each feature not in G¯\bar{G}, the fourth order moment across the whole sample exists.

  3. (iii)

    SS_{\infty} is a balanced sample w.p. 11.

  4. (iv)

    π(G)\pi^{*}(G) assumes a fixed Gaussian OBF model for all nn with 0<π(f)<10<\pi(f)<1 for all ff.

Then limnπ(G¯)=1\lim_{n\to\infty}\pi^{*}(\bar{G})=1 for \mathcal{F}_{\infty}-almost all infinite samples.

By Theorems 1 and 2, Gaussian OBF under MNC and Gaussian OBF under CMNC with “correct” DD (the user knows in advance how many features to select) are strongly consistent if the conditions of Theorem 2 hold. Condition (i) uses Definition 1, and characterizes the features that OBF aims to select. Condition (iii) uses Definition 2, and characterizes requirements of the sample. Conditions (i)–(iii) impose very mild assumptions on the data generating process, which is not required to satisfy the Gaussian OBF modeling assumptions in Condition (iv).

Let us now turn to OBFS. Denote the true mean and covariance of features in feature set AA and class yy by μyA\mu^{A}_{y} and ΣyA\Sigma^{A}_{y}, respectively (the features need not be Gaussian). OBFS aims to select the smallest set of features, G¯\bar{G}, with different means or covariances between the classes, where B¯=F\G¯\bar{B}=F\backslash\bar{G} has the same means and covariances between the classes and is uncorrelated with G¯\bar{G}. This is formalized in the following definition.

Definition 3.

G¯\bar{G} is an unambiguous set of good features if the following hold:

  1. 1.

    μyF\mu^{F}_{y} and ΣyF\Sigma^{F}_{y} exist for all yy.

  2. 2.

    Either μ0G¯μ1G¯\mu^{\bar{G}}_{0}\neq\mu^{\bar{G}}_{1} or Σ0G¯Σ1G¯\Sigma^{\bar{G}}_{0}\neq\Sigma^{\bar{G}}_{1}.

  3. 3.

    μ0B¯=μ1B¯\mu^{\bar{B}}_{0}=\mu^{\bar{B}}_{1} and Σ0B¯=Σ1B¯\Sigma^{\bar{B}}_{0}=\Sigma^{\bar{B}}_{1}, where B¯=F\G¯\bar{B}=F\backslash\bar{G}.

  4. 4.

    Each feature in G¯\bar{G} is uncorrelated with each feature in B¯\bar{B} in both classes.

  5. 5.

    Conditions 1-4 do not all hold for any strict subset GG¯G\subset\bar{G}.

Assuming all first and second order moments exist, an unambiguous set of good features always exists and is unique. To prove uniqueness, let G¯1\bar{G}_{1} and G¯2\bar{G}_{2} be arbitrary unambiguous sets of good features, and let G¯3=G¯1G¯2\bar{G}_{3}=\bar{G}_{1}\cap\bar{G}_{2}. By condition 4, ΣyF\Sigma_{y}^{F}, has a block diagonal structure for each y=0,1y=0,1 corresponding to the sets of features G¯3\bar{G}_{3}, G¯1\G¯3\bar{G}_{1}\backslash\bar{G}_{3}, G¯2\G¯3\bar{G}_{2}\backslash\bar{G}_{3}, and F\(G¯1G¯2)F\backslash(\bar{G}_{1}\cup\bar{G}_{2}). Thus, condition 4 holds for G¯3\bar{G}_{3}. By condition 3, μ0A=μ1A\mu^{A}_{0}=\mu^{A}_{1} and Σ0A=Σ1A\Sigma^{A}_{0}=\Sigma^{A}_{1} for all of these blocks except G¯3\bar{G}_{3}. Thus, condition 3 holds for G¯3\bar{G}_{3}. If μ0G¯1μ1G¯1\mu^{\bar{G}_{1}}_{0}\neq\mu^{\bar{G}_{1}}_{1}, then (since the means for each class are equal for G¯1\G¯3\bar{G}_{1}\backslash\bar{G}_{3}) μ0G¯3μ1G¯3\mu^{\bar{G}_{3}}_{0}\neq\mu^{\bar{G}_{3}}_{1}. Alternatively, if Σ0G¯1Σ1G¯1\Sigma^{\bar{G}_{1}}_{0}\neq\Sigma^{\bar{G}_{1}}_{1}, then (since G¯3\bar{G}_{3} and G¯1\G¯3\bar{G}_{1}\backslash\bar{G}_{3} are uncorrelated for each class, and the covariances between each class are equal for G¯1\G¯3\bar{G}_{1}\backslash\bar{G}_{3}) Σ0G¯3Σ1G¯3\Sigma^{\bar{G}_{3}}_{0}\neq\Sigma^{\bar{G}_{3}}_{1}. In either case, condition 2 holds for G¯3\bar{G}_{3}. Since G¯3G¯1\bar{G}_{3}\subseteq\bar{G}_{1} and G¯3G¯2\bar{G}_{3}\subseteq\bar{G}_{2}, by condition 5 we must have G¯1=G¯2=G¯3\bar{G}_{1}=\bar{G}_{2}=\bar{G}_{3}. We denote the unique unambiguous set of good features by G¯\bar{G}, and its complement by B¯\bar{B}, throughout.

Similar to the Bayesian model, define a feature partition to be an ordered set of the form P=({G1,,Gu},{B1,,Bv})P=(\{G_{1},\ldots,G_{u}\},\{B_{1},\ldots,B_{v}\}), where the set of GiG_{i}’s and BjB_{j}’s partition FF. We call each GiG_{i} a “good block” and each BjB_{j} a “bad block”, keeping in mind that these terms are always relative to a specified arbitrary partition. Feature partitions with no good blocks or no bad blocks are permitted, with appropriate conventions for unions, sums, and products over the (empty set of) corresponding blocks. The following definitions formalize a non-Bayesian analog of the true feature partition.

Definition 4.

Let P1P_{1} and P2P_{2} be arbitrary feature partitions. P1P_{1} is a mesh of P2P_{2} if every block in P1P_{1} is a subset of a block in P2P_{2}. P1P_{1} is a refinement of P2P_{2} if every good block in P1P_{1} is a subset of a good block in P2P_{2} and every bad block in P1P_{1} is a subset of a bad block in P2P_{2}. P1P_{1} is a strict refinement of P2P_{2} if it is a refinement and P1P2P_{1}\neq P_{2}.

Definition 5.

P¯=({G¯1,,G¯u¯},{B¯1,,B¯v¯})\bar{P}=(\{\bar{G}_{1},\ldots,\bar{G}_{\bar{u}}\},\{\bar{B}_{1},\ldots,\bar{B}_{\bar{v}}\}) is an unambiguous feature partition if the following hold:

  1. 1.

    G¯=i=1u¯G¯i\bar{G}=\cup_{i=1}^{\bar{u}}\bar{G}_{i} is an unambiguous set of good features.

  2. 2.

    Each feature in any arbitrary block is uncorrelated with each feature in any other block in both classes.

  3. 3.

    Conditions 1 and 2 do not hold for any feature partition PP that is a strict refinement of P¯\bar{P}.

An unambiguous feature partition always exists and is unique, and we denote it by P¯\bar{P} throughout. By definition, the unambiguous feature partition P¯\bar{P} induces the unambiguous set of good features G¯\bar{G}.

Our main result is given in the following theorem (Theorem 3), which provides sufficient conditions for the (almost sure) convergence of π(P)\pi^{*}(P) to a point mass at P¯\bar{P}, thereby guaranteeing the (almost sure) convergence of π(G)\pi^{*}(G) to a point mass at G¯\bar{G}. By Theorem 1, Gaussian OBFS under MNC and Gaussian OBFS under CMNC with “correct” DD are strongly consistent if the conditions of Theorem 3 hold, which are very mild. Condition (i) is based on Definition 3 and essentially guarantees that G¯\bar{G} really represents the type of features OBFS aims to select (those with different means or covariances between the classes). Conditions (i) and (ii) ensure certain moments exist, and Condition (iii) ensures that all covariances are full rank. There are no other distributional requirements. Condition (iv) is based on Definition 2 and characterizes the sampling strategy; the proportion of points observed in any class must not converge to zero, and sample points should be independent. Condition (v) places constraints on the inputs to the OBFS algorithm; we must assign a non-zero probability to the true feature partition.

Finally, from the proof of Theorem 3 we have that, under the conditions stated in the theorem, w.p. 11 there exist 0<r<10<r<1 and N>0N>0 such that

π(P)π(P¯)<rn\displaystyle\frac{\pi^{*}(P)}{\pi^{*}(\bar{P})}<r^{n} (3.2)

for all n>Nn>N and all PP¯P\neq\bar{P} that label any good features as bad or that put features that are correlated in separate blocks. Also, w.p. 11 there exist s,N>0s,N>0 such that

π(P)π(P¯)<ns\displaystyle\frac{\pi^{*}(P)}{\pi^{*}(\bar{P})}<n^{-s} (3.3)

for all n>Nn>N and all PP¯P\neq\bar{P}. Fix fG¯f\in\bar{G}. By (3.2), for each feature partition PP such that fGif\notin\cup G_{i}, w.p. 11 there exist 0<rP<10<r_{P}<1 and NP>0N_{P}>0 such that π(P)/π(P¯)<rPn\pi^{*}(P)/\pi^{*}(\bar{P})<r_{P}^{n} for all n>NPn>N_{P}. Therefore, w.p. 11,

1π(f)π(P¯)=P:fGiπ(P)π(P¯)<rn\frac{1-\pi^{*}(f)}{\pi^{*}(\bar{P})}=\sum_{P:f\notin\cup G_{i}}\frac{\pi^{*}(P)}{\pi^{*}(\bar{P})}<r^{n} (3.4)

for all n>Nn>N, where maxP:fGirP<r<1\max_{P:f\notin\cup G_{i}}r_{P}<r<1 and N>maxP:fGiNPN>\max_{P:f\notin\cup G_{i}}N_{P}. Thus, w.p. 11,

π(f)>1π(P¯)rn1rn\pi^{*}(f)>1-\pi^{*}(\bar{P})r^{n}\geq 1-r^{n} (3.5)

for all n>Nn>N. The marginal posterior of good features converges to 11 at least exponentially w.p. 11. Now fix fB¯f\in\bar{B}. By (3.3), w.p. 11 there exist s,N>0s,N>0 such that

π(f)π(P¯)=P:fGiπ(P)π(P¯)<ns\frac{\pi^{*}(f)}{\pi^{*}(\bar{P})}=\sum_{P:f\in\cup G_{i}}\frac{\pi^{*}(P)}{\pi^{*}(\bar{P})}<n^{-s} (3.6)

for all n>Nn>N. Thus, w.p. 11,

π(f)<ns\pi^{*}(f)<n^{-s} (3.7)

for all n>Nn>N. In other words, the marginal posterior of bad features converges to 0 at least polynomially w.p. 11. This characterizes rates of convergence of the posterior on feature partitions, and the marginal posterior probabilities on individual features.

Throughout, “ff and gg are asymptotically equivalent” and “fgf\sim g as nn\to\infty” mean that limnf(n)/g(n)=1\lim_{n\to\infty}f(n)/g(n)=1, v(i)v(i) denotes the iith element of vector vv, M(i,j)M(i,j) denotes the iith row, jjth column element of matrix MM, 0a,b0_{a,b} denotes the all-zero a×ba\times b matrix, and the sample mean and unbiased sample covariance of features in block AA and class yy are denoted by μ^yA\hat{\mu}_{y}^{A} and Σ^yA\hat{\Sigma}_{y}^{A}, respectively.

Theorem 3.

Suppose the following hold:

  1. (i)

    P¯\bar{P} is the unambiguous feature partition.

  2. (ii)

    For all fFf\in F, the fourth order moment across both classes exists and is finite.

  3. (iii)

    Σ0F\Sigma_{0}^{F} and Σ1F\Sigma_{1}^{F} are full rank.

  4. (iv)

    SS_{\infty} is a balanced sample w.p. 11.

  5. (v)

    π(P)\pi^{*}(P) assumes a fixed Gaussian OBFS model for all nn with π(P¯)>0\pi(\bar{P})>0.

Then limnπ(P¯)=1\lim_{n\to\infty}\pi^{*}(\bar{P})=1 for \mathcal{F}_{\infty}-almost all sequences.

Proof.

Since Pπ(P)=1\sum_{P}\pi^{*}(P)=1, it suffices to show that for all feature partitions PP¯P\neq\bar{P},

limnπ(P)π(P¯)=limnπ(P)π(P¯)×q(P)q(P¯)×a(P)a(P¯)=0w.p. 1.\lim_{n\to\infty}\frac{\pi^{*}(P)}{\pi^{*}(\bar{P})}=\lim_{n\to\infty}\frac{\pi(P)}{\pi(\bar{P})}\times\frac{q(P)}{q(\bar{P})}\times\frac{a(P)}{a(\bar{P})}=0\quad\mbox{w.p.~{}$1$}. (3.8)

Fix PP¯P\neq\bar{P}. If π(P)=0\pi(P)=0 then (3.8) holds trivially. In the remainder of this proof, assume π(P)0\pi(P)\neq 0. It suffices to show:

limnq(P)q(P¯)×a(P)a(P¯)=0w.p. 1.\lim_{n\to\infty}\frac{q(P)}{q(\bar{P})}\times\frac{a(P)}{a(\bar{P})}=0\quad\mbox{w.p.~{}$1$}. (3.9)

We prove this by constructing a sequence of partitions from PP to P¯\bar{P} in six steps. In Step 1, blocks that are labeled bad in PP are split into subsets of bad blocks in P¯\bar{P} and subsets of the unambiguous set of good features G¯\bar{G}. In Steps 2 and 3, blocks that are labeled bad in the previous partition but that are actually subsets of G¯\bar{G} are labeled good (and in Step 3 they are also merged with other good blocks). In Step 4, blocks that are labeled good in the previous partition are split into subsets of (good and bad) blocks in P¯\bar{P}. In Step 5, blocks that are labeled good in the previous partition but that are actually subsets of bad blocks in P¯\bar{P} are labeled bad. In Step 6, we merge blocks in the previous partition until we arrive at P¯\bar{P}. An example of this sequence of partitions is illustrated in Fig. 1. Finally, Step 7 uses the sequence of partitions constructed in Steps 1-6 to show (3.9).

Refer to caption
Figure 1: An example of the sequence of partitions constructed in the consistency proof for the OBFS algorithm, starting from the partition PP (top) and ending at P¯\bar{P} (bottom). Green represents good blocks in the current partition (thus G¯\bar{G} is the set of green blocks at the bottom), and red represents bad blocks in the current partition (thus B¯\bar{B} is the set of red blocks at the bottom).

Step 1. Let P11,,P1K1P_{1}^{1},\ldots,P_{1}^{K_{1}}, K12K_{1}\geq 2, be a sequence of partitions where (1) we let P11=PP_{1}^{1}=P, (2) for each k=1,,K11k=1,\ldots,K_{1}-1, we let P1k+1P_{1}^{k+1} be identical to P1kP_{1}^{k} except with one bad block partitioned into two bad blocks such that one of the smaller blocks is the intersection of the original block with a bad block in P¯\bar{P}, and (3) we iterate until no more blocks can be split. The order we split blocks does not matter; at the end we always obtain the partition P1=P1K1P_{1}=P_{1}^{K_{1}} that is identical to PP except with each bad block split into smaller blocks that are each either a subset of a bad block in P¯\bar{P} or a subset of the unambiguous set of good features G¯\bar{G}. Suppose PP1P\neq P_{1}. Note P1P_{1} is a strict refinement of PP. By Lemma 1, w.p. 11 there exist s1,N1>0s_{1},N_{1}>0 (which may depend on the sample) such that q(P)/q(P1)<ns1q(P)/q(P_{1})<n^{-s_{1}} for all n>N1n>N_{1}. Furthermore,

a(P)a(P1)=k=1K11a(P1k)a(P1k+1).\displaystyle\frac{a(P)}{a(P_{1})}=\prod_{k=1}^{K_{1}-1}\frac{a(P_{1}^{k})}{a(P_{1}^{k+1})}. (3.10)

For each k=1,,K11k=1,\ldots,K_{1}-1, by Lemma 3, w.p. 11 there exist c1k,N1k>0c_{1}^{k},N_{1}^{k}>0 (which may depend on the sample) such that a(P1k)/a(P1k+1)<(logn)c1ka(P_{1}^{k})/a(P_{1}^{k+1})<(\log n)^{c_{1}^{k}} for all n>N1kn>N_{1}^{k}. By (3.10), w.p. 11 there exist s1,c1,N>0s_{1},c_{1},N>0 (namely s1s_{1} from above, c1=c11++c1K11c_{1}=c_{1}^{1}+\ldots+c_{1}^{K_{1}-1}, and N=max(N1,N11,,N1K11)N=\max(N_{1},N_{1}^{1},\ldots,N_{1}^{K_{1}-1})) such that (q(P)/q(P1))×(a(P)/a(P1))<ns1(logn)c1(q(P)/q(P_{1}))\times(a(P)/a(P_{1}))<n^{-s_{1}}(\log n)^{c_{1}} for all n>Nn>N. Finally, since ns1n^{-s_{1}} for s1>0s_{1}>0 dominates (logn)c1(\log n)^{c_{1}}, (q(P)/q(P1))×(a(P)/a(P1))0(q(P)/q(P_{1}))\times(a(P)/a(P_{1}))\to 0 w.p. 11.

Step 2. Construct a sequence of partitions starting from P1P_{1} where, in each step, one bad block in the current partition that is (a) contained in G¯\bar{G}, and (b) has either different means or different covariances between the classes, is labeled good in the next partition, and iterate until no more blocks can be re-labeled. The order we re-label blocks does not matter; we always obtain the same final partition, which we call P2P_{2}. Suppose P1P2P_{1}\neq P_{2}. By Lemma 1, w.p. 11 there exists s2s_{2} such that q(P1)/q(P2)<ns2q(P_{1})/q(P_{2})<n^{-s_{2}} for all nn large enough. By Lemma 4, w.p. 11 there exists 0<r2<10<r_{2}<1 such that a(P1)/a(P2)<r2na(P_{1})/a(P_{2})<r_{2}^{n} for all nn large enough. Since r2nr_{2}^{n} for 0<r2<10<r_{2}<1 dominates ns2n^{-s_{2}}, q(P1)/q(P2)×a(P1)/a(P2)0q(P_{1})/q(P_{2})\times a(P_{1})/a(P_{2})\to 0 w.p. 11.

Step 3. Some bad blocks in P2P_{2} may have equal means and covariances between the classes, but may still be contained in G¯\bar{G} because they are correlated with, or connected through a chain of correlation with, features in G¯\bar{G} that have either different means or different covariances between the classes. Here we construct a sequence of partitions to correct the label of these features. The construction goes as follows: starting from P2P_{2}, in each step merge one bad block in the current partition that is contained in G¯\bar{G} (which must have the same means and covariances between classes after Step 2) with one good block in the current partition that is correlated in at least one class (because it is correlated it must also be in G¯\bar{G}), label the resulting block as good in the next partition, and iterate until no more blocks can be merged.

First only blocks directly correlated with good blocks in the original partition can be moved, then only blocks directly correlated with good blocks in the original partition or the block just moved can be moved, etc. All bad blocks in P2P_{2} that are actually in G¯\bar{G} will eventually be moved because: (1) all features that have either different means or different variances have already been correctly labeled good in Step 2, and (2) the definition of P¯\bar{P} guarantees that every feature in G¯\bar{G} is connected to at least one feature in G¯\bar{G} with either different means or different variances via at least one chain of pairwise-correlated features that are all in the same good block in P¯\bar{P}. Thus, each bad block in the final partition, call it P3P_{3}, is guaranteed to be a subset of a bad block in P¯\bar{P}.

Suppose P2P3P_{2}\neq P_{3}. By Lemma 1, w.p. 11 there exists s3s_{3} such that q(P2)/q(P3)<ns3q(P_{2})/q(P_{3})<n^{-s_{3}} for all nn large enough. By Lemma 5, w.p. 11 there exists 0<r3<10<r_{3}<1 such that a(P2)/a(P3)<r3na(P_{2})/a(P_{3})<r_{3}^{n} for all nn large enough. Since r3nr_{3}^{n} for 0<r3<10<r_{3}<1 dominates ns3n^{-s_{3}}, q(P2)/q(P3)×a(P2)/a(P3)0q(P_{2})/q(P_{3})\times a(P_{2})/a(P_{3})\to 0 w.p. 11.

Step 4. Construct a sequence of partitions starting from P3P_{3} where, in each step, one good block in the current partition is split into two good blocks such that one of the smaller blocks is the intersection of the original block with a (good or bad) block in P¯\bar{P}, and iterate until no more blocks can be split. The order we split blocks does not matter; in the final partition, call it P4P_{4}, each bad block is a subset of a bad block in P¯\bar{P} and each good block is a subset of a good or bad block in P¯\bar{P}. Suppose P3P4P_{3}\neq P_{4}. Note that P4P_{4} is a strict refinement of P3P_{3}. By Lemma 1, w.p. 11 there exists s4>0s_{4}>0 such that q(P3)/q(P4)<ns4q(P_{3})/q(P_{4})<n^{-s_{4}} for all nn large enough. By Lemma 6, w.p. 11 there exists c4>0c_{4}>0 such that a(P3)/a(P4)<(logn)c4a(P_{3})/a(P_{4})<(\log n)^{c_{4}} for all nn large enough. Finally, since ns4n^{-s_{4}} for s4>0s_{4}>0 dominates (logn)c4(\log n)^{c_{4}}, q(P3)/q(P4)×a(P3)/a(P4)0q(P_{3})/q(P_{4})\times a(P_{3})/a(P_{4})\to 0 w.p. 11.

Step 5. Construct a sequence of partitions starting from P4P_{4} where, in each step, one good block in the current partition that is contained in a bad block in P¯\bar{P} is labeled bad in the next partition, and iterate until no more blocks can be re-labeled. The order we re-label blocks does not matter; in the final partition, call it P5P_{5}, each block is a subset of a block in P¯\bar{P} of the same type. Suppose P4P5P_{4}\neq P_{5}. Note that P5P_{5} is a mesh of P4P_{4}, every good block in P5P_{5} is a good block in P4P_{4}, and every bad block in P5P_{5} is a block (good or bad) in P4P_{4}. By Lemma 1, w.p. 11 there exists s5>0s_{5}>0 such that q(P4)/q(P5)<ns5q(P_{4})/q(P_{5})<n^{-s_{5}} for all nn large enough. By Lemma 7, w.p. 11 there exists c5>0c_{5}>0 such that a(P4)/a(P5)<(logn)c5a(P_{4})/a(P_{5})<(\log n)^{c_{5}} for all nn large enough. Since ns5n^{-s_{5}} for s5>0s_{5}>0 dominates (logn)c5(\log n)^{c_{5}}, q(P4)/q(P5)×a(P4)/a(P5)0q(P_{4})/q(P_{5})\times a(P_{4})/a(P_{5})\to 0 w.p. 11.

Step 6. Construct a sequence of partitions from P5P_{5} to P¯\bar{P} where each partition is a strict refinement of P¯\bar{P} and, in each step, two blocks of the same type (good or bad) in the current partition are merged into a larger block of the same type in the next partition that is contained in a block of the same type in P¯\bar{P}. The order we merge blocks does not matter, as long as the pair of blocks merged in each step are correlated in at least one class. Suppose P5P¯P_{5}\neq\bar{P}. By Lemma 1, w.p. 11 there exists s6s_{6} such that q(P5)/q(P¯)<ns6q(P_{5})/q(\bar{P})<n^{-s_{6}} for all nn large enough. By Lemmas 8 and 9, w.p. 11 there exists 0<r6<10<r_{6}<1 such that a(P5)/a(P¯)<r6na(P_{5})/a(\bar{P})<r_{6}^{n} for all nn large enough. Since r6nr_{6}^{n} for 0<r6<10<r_{6}<1 dominates ns6n^{-s_{6}}, q(P5)/q(P¯)×a(P5)/a(P¯)0q(P_{5})/q(\bar{P})\times a(P_{5})/a(\bar{P})\to 0 w.p. 11.

Step 7. We have that

q(P)a(P)q(P¯)a(P¯)=q(P)a(P)q(P1)a(P1)q(P1)a(P1)q(P2)a(P2)q(P2)a(P2)q(P3)a(P3)q(P3)a(P3)q(P4)a(P4)q(P4)a(P4)q(P5)a(P5)q(P5)a(P5)q(P¯)a(P¯).\frac{q(P)a(P)}{q(\bar{P})a(\bar{P})}=\frac{q(P)a(P)}{q(P_{1})a(P_{1})}\cdot\frac{q(P_{1})a(P_{1})}{q(P_{2})a(P_{2})}\cdot\frac{q(P_{2})a(P_{2})}{q(P_{3})a(P_{3})}\cdot\frac{q(P_{3})a(P_{3})}{q(P_{4})a(P_{4})}\cdot\frac{q(P_{4})a(P_{4})}{q(P_{5})a(P_{5})}\cdot\frac{q(P_{5})a(P_{5})}{q(\bar{P})a(\bar{P})}. (3.11)

Since PP¯P\neq\bar{P}, at least one of Steps 1-6 applies. For the steps that apply, the corresponding ratio in (3.11) converges to 0 w.p. 11. For the steps that do not apply, the corresponding ratio equals 11. Thus, (3.9) holds. ∎

Lemma 1.

Let P=({G1,,Gu},{B1,,Bv})P=(\{G_{1},\ldots,G_{u}\},\{B_{1},\ldots,B_{v}\}) and P=({G1,,Gu},{B1,,Bv})P^{\prime}=(\{G_{1}^{\prime},\ldots,G_{u^{\prime}}^{\prime}\},\{B_{1}^{\prime},\ldots,B_{v^{\prime}}^{\prime}\}) be arbitrary feature partitions, and let SS_{\infty} be a fixed sample such that lim infnρn>0\liminf_{n\to\infty}\rho_{n}>0 and lim supnρn<1\limsup_{n\to\infty}\rho_{n}<1. Then there exists ss\in\mathbb{R} and N>0N>0 such that q(P)/q(P)<ns{q(P)}/{q(P^{\prime})}<n^{-s} for all n>Nn>N. Further, this holds for s>0s>0 if: (i) PP^{\prime} is a strict refinement of PP, or (ii) PP^{\prime} is a mesh of PP where PPP\neq P^{\prime}, every good block in PP^{\prime} is a subset of a good block in PP, and every bad block in PP^{\prime} is a subset of a good or bad block in PP.

Proof.

For any feature partition PP, we may write

q(P)\displaystyle q(P) =c1Pi=1,,u,y=0,1(νyGi)0.5|Gi|Γ|Gi|(0.5κyGi)(κyGi|Gi|1)0.5|Gi|κyGi\displaystyle=c^{P}_{1}\prod_{\begin{subarray}{c}i=1,\ldots,u,\\ y=0,1\end{subarray}}\left(\nu^{G_{i}*}_{y}\right)^{-0.5|G_{i}|}\Gamma_{|G_{i}|}(0.5\kappa^{G_{i}*}_{y})(\kappa^{G_{i}*}_{y}-|G_{i}|-1)^{-0.5|G_{i}|\kappa^{G_{i}*}_{y}}
×j=1,,v(νBj)0.5|Bj|Γ|Bj|(0.5κBj)(κBj|Bj|1)0.5|Bj|κBj,\displaystyle\quad\times\prod_{j=1,\ldots,v}\left(\nu^{B_{j}*}\right)^{-0.5|B_{j}|}\Gamma_{|B_{j}|}(0.5\kappa^{B_{j}*})(\kappa^{B_{j}*}-|B_{j}|-1)^{-0.5|B_{j}|\kappa^{B_{j}*}}, (3.12)

where c1Pc^{P}_{1} is a positive constant that does not depend on nn. We first focus on each block individually. νyGi\nu^{G_{i}*}_{y} and κyGi\kappa^{G_{i}*}_{y} are asymptotically equivalent to nyn_{y}, and νBj\nu^{B_{j}*} and κBj\kappa^{B_{j}*} are asymptotically equivalent to nn. Further, for all positive integers dd,

Γd(x)=πd(d1)/4k=1dΓ(x+(1k)/2),\Gamma_{d}(x)=\pi^{d(d-1)/4}\prod_{k=1}^{d}\Gamma\left(x+(1-k)/2\right), (3.13)

and by Stirling’s formula,

Γd(x)πd(d1)/4k=1d2πx+(1k)/2(x+(1k)/2e)x+(1k)/2\Gamma_{d}(x)\sim\pi^{d(d-1)/4}\prod_{k=1}^{d}\sqrt{\frac{2\pi}{x+(1-k)/2}}\left(\frac{x+(1-k)/2}{e}\right)^{x+(1-k)/2} (3.14)

as xx\to\infty. Thus, for all AFA\subseteq F,

(νyA)0.5|A|Γ|A|(0.5κyA)(κyA|A|1)0.5|A|κyA\displaystyle\left(\nu^{A*}_{y}\right)^{-0.5|A|}\Gamma_{|A|}(0.5\kappa^{A*}_{y})(\kappa^{A*}_{y}-|A|-1)^{-0.5|A|\kappa^{A*}_{y}} (3.15)
c2Any0.75|A|0.25|A|2(2e)0.5|A|nyk=1|A|(ny+κyA+1kny+κyA|A|1)0.5ny\displaystyle\qquad\sim c_{2}^{A}n_{y}^{-0.75|A|-0.25|A|^{2}}(2e)^{-0.5|A|n_{y}}\prod_{k=1}^{|A|}\left(\frac{n_{y}+\kappa_{y}^{A}+1-k}{n_{y}+\kappa_{y}^{A}-|A|-1}\right)^{0.5n_{y}} (3.16)

as nyn_{y}\to\infty, where c2A>0c_{2}^{A}>0 does not depend on nn. For any constants c1c_{1} and c2c_{2},

(xc1)c2xec1c2xc2x(x-c_{1})^{-c_{2}x}\sim e^{c_{1}c_{2}}x^{-c_{2}x} (3.17)

as xx\to\infty. Hence,

(νyA)0.5|A|Γ|A|(0.5κyA)(κyA|A|1)0.5|A|κyA\displaystyle\left(\nu^{A*}_{y}\right)^{-0.5|A|}\Gamma_{|A|}(0.5\kappa^{A*}_{y})(\kappa^{A*}_{y}-|A|-1)^{-0.5|A|\kappa^{A*}_{y}} c3Any0.75|A|0.25|A|2(2e)0.5|A|ny\displaystyle\sim c_{3}^{A}n_{y}^{-0.75|A|-0.25|A|^{2}}(2e)^{-0.5|A|n_{y}} (3.18)

as nyn_{y}\to\infty, where c3A>0c_{3}^{A}>0 does not depend on nn. Similarly,

(νA)0.5|A|Γ|A|(0.5κA)(κA|A|1)0.5|A|κA\displaystyle\left(\nu^{A*}\right)^{-0.5|A|}\Gamma_{|A|}(0.5\kappa^{A*})(\kappa^{A*}-|A|-1)^{-0.5|A|\kappa^{A*}} c4An0.75|A|0.25|A|2(2e)0.5|A|n\displaystyle\sim c_{4}^{A}n^{-0.75|A|-0.25|A|^{2}}(2e)^{-0.5|A|n} (3.19)

as nn\to\infty, where c4A>0c_{4}^{A}>0 does not depend on nn. Applying (3.18) and (3.19) in (3.12), we have that q(P)c5P(n0n1)r1Pnr2P(2e)0.5|F|nq(P)\sim c^{P}_{5}(n_{0}n_{1})^{-r_{1}^{P}}n^{-r_{2}^{P}}(2e)^{-0.5|F|n} as nn\to\infty, where c5P>0c^{P}_{5}>0 does not depend on nn, r1P0.75|G|+0.25i=1u|Gi|2r_{1}^{P}\equiv 0.75|G|+0.25\sum_{i=1}^{u}|G_{i}|^{2}, r2P0.75|B|+0.25j=1v|Bj|2r_{2}^{P}\equiv 0.75|B|+0.25\sum_{j=1}^{v}|B_{j}|^{2}, and we treat n0n_{0} and n1n_{1} as functions of nn. Applying this in the ratio, we have:

q(P)q(P)\displaystyle\frac{q(P)}{q(P^{\prime})} c(ρn(1ρn))r1nr2\displaystyle\sim c(\rho_{n}(1-\rho_{n}))^{-r_{1}}n^{-r_{2}} (3.20)

as nn\to\infty, where cc3P/c3Pc\equiv c^{P}_{3}/c^{P^{\prime}}_{3}, r1r1Pr1Pr_{1}\equiv r_{1}^{P^{\prime}}-r_{1}^{P} and r22r1P2r1P+r2Pr2Pr_{2}\equiv 2r_{1}^{P^{\prime}}-2r_{1}^{P}+r_{2}^{P^{\prime}}-r_{2}^{P}. We always have lim supnρn(1ρn)0.25\limsup_{n\to\infty}\rho_{n}(1-\rho_{n})\leq 0.25, and by our assumed bounds on the limit inferior and limit superior of ρn\rho_{n} we also have lim infnρn(1ρn)>0\liminf_{n\to\infty}\rho_{n}(1-\rho_{n})>0. Thus, for all s<r2s<r_{2},

limnq(P)/q(P)ns\displaystyle\lim_{n\to\infty}\frac{q(P)/q(P^{\prime})}{n^{-s}} =limnc(ρn(1ρn))r1nr2ns=0.\displaystyle=\lim_{n\to\infty}\frac{c(\rho_{n}(1-\rho_{n}))^{-r_{1}}n^{-r_{2}}}{n^{-s}}=0. (3.21)

Further,

r2\displaystyle r_{2} =14(3|G|3|G|+2i=1u|Gi|22i=1u|Gi|2+j=1v|Bj|2j=1v|Bj|2).\displaystyle=\frac{1}{4}\left(3|G|-3|G^{\prime}|+2\sum_{i=1}^{u}|G_{i}|^{2}-2\sum_{i=1}^{u^{\prime}}|G_{i}^{\prime}|^{2}+\sum_{j=1}^{v}|B_{j}|^{2}-\sum_{j=1}^{v^{\prime}}|B_{j}^{\prime}|^{2}\right). (3.22)

If PP^{\prime} is a strict refinement of PP, then r2>0r_{2}>0. Also, if PP^{\prime} is a mesh of PP where PPP\neq P^{\prime}, then every good block in PP^{\prime} is a subset of a good block in PP and every bad block in PP^{\prime} is a subset of a good or bad block in PP, thus r2>0r_{2}>0. If r2>0r_{2}>0, then (3.21) holds for all 0<s<r20<s<r_{2}. ∎

Lemma 2.

Let AFA\subseteq F be any non-empty feature set such that Σ0A\Sigma_{0}^{A} and Σ1A\Sigma_{1}^{A} are full rank. Suppose for all fFf\in F the fourth order moment across both classes exists and is finite. Let SS_{\infty} be a balanced sample. Then, w.p. 11 there exist K,N>0K,N>0 (which may depend on the sample) such that

|μ^yA(i)μyA(i)|\displaystyle|\hat{\mu}_{y}^{A}(i)-\mu_{y}^{A}(i)| <Kloglognn,\displaystyle<K\sqrt{\frac{\log\log n}{n}}, (3.23)
||CyA||Σ^yA|1|\displaystyle\left|\frac{|C_{y}^{A}|}{|\hat{\Sigma}_{y}^{A}|}-1\right| <Kn,\displaystyle<\frac{K}{n}, (3.24)
|Σ^yA(i,j)ΣyA(i,j)|\displaystyle\left|\hat{\Sigma}_{y}^{A}(i,j)-\Sigma_{y}^{A}(i,j)\right| <Kloglognn,\displaystyle<K\sqrt{\frac{\log\log n}{n}}, (3.25)
||Σ^yA||ΣyA|1|\displaystyle\left|\frac{|\hat{\Sigma}_{y}^{A}|}{|\Sigma_{y}^{A}|}-1\right| <Kloglognn,\displaystyle<K\sqrt{\frac{\log\log n}{n}}, (3.26)
|CyA(i,j)ΣyA(i,j)|\displaystyle\left|C_{y}^{A}(i,j)-\Sigma_{y}^{A}(i,j)\right| <Kloglognn,\displaystyle<K\sqrt{\frac{\log\log n}{n}}, (3.27)
||CA||Σ^A|1|\displaystyle\left|\frac{|C^{A}|}{|\hat{\Sigma}^{A}|}-1\right| <Kn,\displaystyle<\frac{K}{n}, (3.28)
||Σ^A||ΣnA|1|\displaystyle\left|\frac{|\hat{\Sigma}^{A}|}{|\Sigma_{n}^{A}|}-1\right| <Kloglognn,\displaystyle<K\sqrt{\frac{\log\log n}{n}}, (3.29)
|CA(i,j)ΣnA(i,j)|\displaystyle\left|C^{A}(i,j)-\Sigma^{A}_{n}(i,j)\right| <Kloglognn\displaystyle<K\sqrt{\frac{\log\log n}{n}} (3.30)

all hold for all n>Nn>N, y=0,1y=0,1 and i,j=1,,|A|i,j=1,\ldots,|A|, where

ΣnA\displaystyle\Sigma_{n}^{A} =ρnΣ0A+(1ρn)Σ1A+ρn(1ρn)(μ0Aμ1A)(μ0Aμ1A)T.\displaystyle=\rho_{n}\Sigma_{0}^{A}+\left(1-\rho_{n}\right)\Sigma_{1}^{A}+\rho_{n}(1-\rho_{n})(\mu_{0}^{A}-\mu_{1}^{A})(\mu_{0}^{A}-\mu_{1}^{A})^{T}. (3.31)
Proof.

Fix AA. Since SS_{\infty} is a balanced sample, nyn_{y}\to\infty as nn\to\infty, and by the strong law of large numbers μ^yA\hat{\mu}_{y}^{A} converges to μyA\mu_{y}^{A} and Σ^yA\hat{\Sigma}_{y}^{A} converges to ΣyA\Sigma_{y}^{A} w.p. 11 for both y=0,1y=0,1. In the rest of this proof, we only consider the event where μ^yA\hat{\mu}_{y}^{A} and Σ^yA\hat{\Sigma}_{y}^{A} converge.

Since SS_{\infty} is a balanced sample, there exist 0<p0,p1<10<p_{0},p_{1}<1 and N1>0N_{1}>0 such that ny/n>pyn_{y}/n>p_{y} for all n>N1n>N_{1} and y=0,1y=0,1. By the law of the iterated logarithm (Hartman and Wintner, 1941), w.p. 11 there exist K1>0K_{1}>0 and N2>N1N_{2}>N_{1} such that

|μ^yA(i)μyA(i)|\displaystyle|\hat{\mu}_{y}^{A}(i)-\mu_{y}^{A}(i)| <K1loglognyny\displaystyle<K_{1}\sqrt{\frac{\log\log n_{y}}{n_{y}}}
<K1pyloglognn\displaystyle<\frac{K_{1}}{\sqrt{p_{y}}}\sqrt{\frac{\log\log n}{n}} (3.32)

for all n>N2n>N_{2} and i=1,,|A|i=1,\ldots,|A| (Foroughi pour and Dalton, 2019). This proves (3.23).

We can decompose the sample covariance for class yy as follows:

Σ^yA(i,j)=1ny1k=1ny(xy,kiμ^yA(i))(xy,kjμ^yA(j))\displaystyle\hat{\Sigma}_{y}^{A}(i,j)=\frac{1}{n_{y}-1}\sum_{k=1}^{n_{y}}(x_{y,k}^{i}-\hat{\mu}_{y}^{A}(i))(x_{y,k}^{j}-\hat{\mu}_{y}^{A}(j))
=1ny1k=1ny(xy,kiμyA(i))(xy,kjμyA(j))nyny1(μ^yA(i)μyA(i))(μ^yA(j)μyA(j))\displaystyle=\frac{1}{n_{y}-1}\sum_{k=1}^{n_{y}}(x_{y,k}^{i}-\mu_{y}^{A}(i))(x_{y,k}^{j}-\mu_{y}^{A}(j))-\frac{n_{y}}{n_{y}-1}(\hat{\mu}_{y}^{A}(i)-\mu_{y}^{A}(i))(\hat{\mu}_{y}^{A}(j)-\mu_{y}^{A}(j)) (3.33)

where xy,kix_{y,k}^{i} is the kkth sample of feature ii in class yy. By (3.23), w.p. 11 there exists K2>0K_{2}>0 and N3>N2N_{3}>N_{2} such that

|nyny1(μ^yA(i)μyA(i))(μ^yA(j)μyA(j))|\displaystyle\left|\frac{n_{y}}{n_{y}-1}(\hat{\mu}_{y}^{A}(i)-\mu_{y}^{A}(i))(\hat{\mu}_{y}^{A}(j)-\mu_{y}^{A}(j))\right| <K2loglognn\displaystyle<K_{2}\frac{\log\log n}{n} (3.34)

for all n>N3n>N_{3} and i,j=1,,|A|i,j=1,\ldots,|A|. (xkiμyA(i))(xkjμyA(j))(x_{k}^{i}-\mu_{y}^{A}(i))(x_{k}^{j}-\mu_{y}^{A}(j)) is a random variable with mean ΣyA(i,j)\Sigma_{y}^{A}(i,j) and finite variance Vy(i,j)V_{y}(i,j). Suppose Vy(i,j)>0V_{y}(i,j)>0. By the law of the iterated logarithm, w.p. 11

lim supny12nyloglognyk=1ny(xy,kiμyA(i))(xy,kjμyA(j))ΣyA(i,j)Vy(i,j)=1\displaystyle\limsup_{n_{y}\to\infty}\frac{1}{\sqrt{2n_{y}\log\log n_{y}}}\sum_{k=1}^{n_{y}}\frac{(x_{y,k}^{i}-\mu_{y}^{A}(i))(x_{y,k}^{j}-\mu_{y}^{A}(j))-\Sigma_{y}^{A}(i,j)}{\sqrt{V_{y}(i,j)}}=1 (3.35)

for all i,j=1,,|A|i,j=1,\ldots,|A|. Thus, w.p. 11 there exists K3>0K_{3}>0 and N4>N3N_{4}>N_{3} such that

|1ny1k=1ny(xy,kiμyA(i))(xy,kjμyA(j))ΣyA(i,j)|\displaystyle\left|\frac{1}{n_{y}-1}\sum_{k=1}^{n_{y}}(x_{y,k}^{i}-\mu_{y}^{A}(i))(x_{y,k}^{j}-\mu_{y}^{A}(j))-\Sigma_{y}^{A}(i,j)\right|
|1ny1k=1ny(xy,kiμyA(i))(xy,kjμyA(j))nyny1ΣyA(i,j)|+1ny1ΣyA(i,j)\displaystyle\qquad\leq\left|\frac{1}{n_{y}-1}\sum_{k=1}^{n_{y}}(x_{y,k}^{i}-\mu_{y}^{A}(i))(x_{y,k}^{j}-\mu_{y}^{A}(j))-\frac{n_{y}}{n_{y}-1}\Sigma_{y}^{A}(i,j)\right|+\frac{1}{n_{y}-1}\Sigma_{y}^{A}(i,j)
<2Vy(i,j)nyloglognyny1+1ny1ΣyA(i,j)\displaystyle\qquad<\frac{2\sqrt{V_{y}(i,j)n_{y}\log\log n_{y}}}{n_{y}-1}+\frac{1}{n_{y}-1}\Sigma_{y}^{A}(i,j)
<K3loglognn\displaystyle\qquad<K_{3}\sqrt{\frac{\log\log n}{n}} (3.36)

for all n>N4n>N_{4} and i,j=1,,|A|i,j=1,\ldots,|A|. A similar inequality holds when Vy(i,j)=0V_{y}(i,j)=0. Combining (3.33), (3.34) and (3.36), w.p. 11 there exists K4>0K_{4}>0 such that

|Σ^yA(i,j)ΣyA(i,j)|\displaystyle|\hat{\Sigma}_{y}^{A}(i,j)-\Sigma_{y}^{A}(i,j)| <K3loglognn+K2loglognn\displaystyle<K_{3}\sqrt{\frac{\log\log n}{n}}+K_{2}\frac{\log\log n}{n}
<K4loglognn\displaystyle<K_{4}\sqrt{\frac{\log\log n}{n}} (3.37)

for all n>N4n>N_{4} and i,j=1,,|A|i,j=1,\ldots,|A|. Thus (3.25) holds. Since |Σ^yA||\hat{\Sigma}_{y}^{A}| is a polynomial function of the Σ^yA(i,j)\hat{\Sigma}_{y}^{A}(i,j), where each term has degree |A||A| and coefficient ±1\pm 1, w.p. 11 there exists K5>0K_{5}>0 such that

||Σ^yA||ΣyA||<K5loglognn\displaystyle||\hat{\Sigma}_{y}^{A}|-|\Sigma_{y}^{A}||<K_{5}\sqrt{\frac{\log\log n}{n}} (3.38)

for all n>N4n>N_{4}. Dividing both sides by |ΣyA||\Sigma_{y}^{A}| proves (3.26).

Note that,

CyA\displaystyle C^{A}_{y} =ny1ny+κyA|A|1Σ^yA+1ny+κyA|A|1SyA\displaystyle=\frac{n_{y}-1}{n_{y}+\kappa^{A}_{y}-|A|-1}\hat{\Sigma}^{A}_{y}+\frac{1}{n_{y}+\kappa^{A}_{y}-|A|-1}S^{A}_{y}
+νyAny(ny+κyA|A|1)(ny+νyA)(μ^yAmyA)(μ^yAmyA)T.\displaystyle\quad+\frac{\nu^{A}_{y}n_{y}}{(n_{y}+\kappa^{A}_{y}-|A|-1)(n_{y}+\nu^{A}_{y})}(\hat{\mu}^{A}_{y}-m^{A}_{y})(\hat{\mu}^{A}_{y}-m^{A}_{y})^{T}. (3.39)

Further (3.23) implies that w.p. 11 μ^yA(i)myA(i)\hat{\mu}_{y}^{A}(i)-m_{y}^{A}(i) is bounded for all n>N2n>N_{2}. Similarly, (3.25) implies that w.p. 11 Σ^yA(i,j)\hat{\Sigma}_{y}^{A}(i,j) is bounded for all n>N3n>N_{3}. Thus, from (3.39), w.p. 11 there exists K6>0K_{6}>0 and N5>N4N_{5}>N_{4} such that

|CyA(i,j)Σ^yA(i,j)|\displaystyle|C_{y}^{A}(i,j)-\hat{\Sigma}_{y}^{A}(i,j)| K6n\displaystyle\leq\frac{K_{6}}{n} (3.40)

for all n>N5n>N_{5} and i,j=1,|A|i,j=1,\ldots|A|. Again noting that the determinant is a polynomial, w.p. 11 there exists K7>0K_{7}>0 such that

||CyA||Σ^yA||<K7n\displaystyle||C_{y}^{A}|-|\hat{\Sigma}_{y}^{A}||<\frac{K_{7}}{n} (3.41)

for all n>N5n>N_{5}. By (3.38), |Σ^yA||\hat{\Sigma}_{y}^{A}| is bounded for all n>N4n>N_{4} w.p. 11. Thus, (3.24) holds. Applying the triangle inequality on CyA(i,j)ΣyA(i,j)C_{y}^{A}(i,j)-\Sigma_{y}^{A}(i,j) and applying (3.25) and (3.40), we have that (3.27) also holds.

The sample covariance across all samples in both classes can be decomposed as

Σ^A\displaystyle\hat{\Sigma}^{A} =(ρn1ρnn1)Σ^0A+(1ρnρnn1)Σ^1A\displaystyle=\left(\rho_{n}-\frac{1-\rho_{n}}{n-1}\right)\hat{\Sigma}_{0}^{A}+\left(1-\rho_{n}-\frac{\rho_{n}}{n-1}\right)\hat{\Sigma}_{1}^{A}
+ρn(1ρn)nn1(μ^0Aμ^1A)(μ^0Aμ^1A)T.\displaystyle\quad+\frac{\rho_{n}(1-\rho_{n})n}{n-1}(\hat{\mu}_{0}^{A}-\hat{\mu}_{1}^{A})(\hat{\mu}_{0}^{A}-\hat{\mu}_{1}^{A})^{T}. (3.42)

Define

Σ^nA\displaystyle\hat{\Sigma}_{n}^{A} =ρnΣ^0A+(1ρn)Σ^1A+ρn(1ρn)(μ^0Aμ^1A)(μ^0Aμ^1A)T.\displaystyle=\rho_{n}\hat{\Sigma}_{0}^{A}+(1-\rho_{n})\hat{\Sigma}_{1}^{A}+\rho_{n}(1-\rho_{n})(\hat{\mu}_{0}^{A}-\hat{\mu}_{1}^{A})(\hat{\mu}_{0}^{A}-\hat{\mu}_{1}^{A})^{T}. (3.43)

Again since w.p. 11 μ^yA(i)\hat{\mu}_{y}^{A}(i) and Σ^yA(i,j)\hat{\Sigma}_{y}^{A}(i,j) are bounded for all n>N3n>N_{3}, by the triangle inequality w.p. 11 there exists K8>0K_{8}>0 such that

|Σ^A(i,j)Σ^nA(i,j)|\displaystyle|\hat{\Sigma}^{A}(i,j)-\hat{\Sigma}^{A}_{n}(i,j)| 1ρnn1Σ^0A(i,j)+ρnn1Σ^1A(i,j)\displaystyle\leq\frac{1-\rho_{n}}{n-1}\hat{\Sigma}_{0}^{A}(i,j)+\frac{\rho_{n}}{n-1}\hat{\Sigma}_{1}^{A}(i,j)
+ρn(1ρn)n1|μ^0A(i)μ^1A(i)||μ^0A(j)μ^1A(j)|\displaystyle\quad+\frac{\rho_{n}(1-\rho_{n})}{n-1}|\hat{\mu}_{0}^{A}(i)-\hat{\mu}_{1}^{A}(i)||\hat{\mu}_{0}^{A}(j)-\hat{\mu}_{1}^{A}(j)|
<K8n\displaystyle<\frac{K_{8}}{n} (3.44)

for all n>N3n>N_{3} and i,j=1,,|A|i,j=1,\ldots,|A|. Further, by the triangle inequality

|Σ^nA(i,j)ΣnA(i,j)|\displaystyle|\hat{\Sigma}^{A}_{n}(i,j)-\Sigma^{A}_{n}(i,j)| ρn|Σ^0A(i,j)Σ0A(i,j)|+(1ρn)|Σ^1A(i,j)Σ1A(i,j)|\displaystyle\leq\rho_{n}|\hat{\Sigma}_{0}^{A}(i,j)-\Sigma_{0}^{A}(i,j)|+(1-\rho_{n})|\hat{\Sigma}_{1}^{A}(i,j)-\Sigma_{1}^{A}(i,j)|
+ρn(1ρn)d(i,j),\displaystyle\quad+\rho_{n}(1-\rho_{n})d(i,j), (3.45)

for all n>N3n>N_{3} and i,j=1,,|A|i,j=1,\ldots,|A| w.p. 11, where

d(i,j)\displaystyle d(i,j) =|(μ^0A(i)μ^1A(i))(μ^0A(j)μ^1A(j))(μ0A(i)μ1A(i))(μ0A(j)μ1A(j))|\displaystyle=\left|(\hat{\mu}_{0}^{A}(i)-\hat{\mu}_{1}^{A}(i))(\hat{\mu}_{0}^{A}(j)-\hat{\mu}_{1}^{A}(j))-(\mu_{0}^{A}(i)-\mu_{1}^{A}(i))(\mu_{0}^{A}(j)-\mu_{1}^{A}(j))\right|
=|(d0(i)+d01(i)+d1(i))(d0(j)+d01(j)+d1(j))d01(i)d01(j)|\displaystyle=\left|(d_{0}(i)+d_{01}(i)+d_{1}(i))(d_{0}(j)+d_{01}(j)+d_{1}(j))-d_{01}(i)d_{01}(j)\right|
|d0(i)d0(j)|+|d0(i)d01(j)|+|d0(i)d1(j)|+|d01(i)d0(j)|+|d01(i)d1(j)|\displaystyle\leq|d_{0}(i)d_{0}(j)|+|d_{0}(i)d_{01}(j)|+|d_{0}(i)d_{1}(j)|+|d_{01}(i)d_{0}(j)|+|d_{01}(i)d_{1}(j)|
+|d1(i)d0(j)|+|d1(i)d01(j)|+|d1(i)d1(j)|,\displaystyle\quad+|d_{1}(i)d_{0}(j)|+|d_{1}(i)d_{01}(j)|+|d_{1}(i)d_{1}(j)|, (3.46)

d0=μ^0Aμ0Ad_{0}=\hat{\mu}_{0}^{A}-\mu_{0}^{A}, d01=μ0Aμ1Ad_{01}=\mu_{0}^{A}-\mu_{1}^{A} and d1=μ1Aμ^1Ad_{1}=\mu_{1}^{A}-\hat{\mu}_{1}^{A}. d01(i)d_{01}(i) is constant for all i=1,,|A|i=1,\ldots,|A|. By (3.23), w.p. 11 there exists K9,N6>N5K_{9},N_{6}>N_{5} such that |dy(i)|<K9loglogn/n|d_{y}(i)|<K_{9}\sqrt{\log\log n/n} for all n>N6n>N_{6}, y=0,1y=0,1 and i=1,,|A|i=1,\ldots,|A|. Thus, w.p. 11 there exists K10>0K_{10}>0 such that d(i,j)<K10loglogn/nd(i,j)<K_{10}\sqrt{\log\log n/n} for all n>N6n>N_{6} and i,j=1,,|A|i,j=1,\ldots,|A|. Applying this fact and (3.25) to (3.44), w.p. 11 there exists K11>0K_{11}>0 and N7>N6N_{7}>N_{6} such that

|Σ^nA(i,j)ΣnA(i,j)|<K11loglognn\displaystyle|\hat{\Sigma}_{n}^{A}(i,j)-\Sigma_{n}^{A}(i,j)|<K_{11}\sqrt{\frac{\log\log n}{n}} (3.47)

for all n>N7n>N_{7} and i,j=1,,|A|i,j=1,\ldots,|A|. Combining this fact with (3.45), w.p. 11 there exists K12>0K_{12}>0 such that

|Σ^A(i,j)ΣnA(i,j)|<K12loglognn\displaystyle|\hat{\Sigma}^{A}(i,j)-\Sigma_{n}^{A}(i,j)|<K_{12}\sqrt{\frac{\log\log n}{n}} (3.48)

for all n>N7n>N_{7} and i,j=1,,|A|i,j=1,\ldots,|A|. Again since the determinant is a polynomial, w.p. 11 there exists K13K_{13} such that

||Σ^A||ΣnA||<K13loglognn\displaystyle||\hat{\Sigma}^{A}|-|\Sigma_{n}^{A}||<K_{13}\sqrt{\frac{\log\log n}{n}} (3.49)

for all n>N7n>N_{7}. Observe from (3.31) that ΣnA(i,j)\Sigma_{n}^{A}(i,j) is lower bounded by

min(Σ0A(i,j),Σ1A(i,j))+min(0,0.25(μ0A(i)μ1A(i))(μ0A(j)μ1A(j))),\min(\Sigma_{0}^{A}(i,j),\Sigma_{1}^{A}(i,j))+\min(0,0.25(\mu_{0}^{A}(i)-\mu_{1}^{A}(i))(\mu_{0}^{A}(j)-\mu_{1}^{A}(j))), (3.50)

and upper bounded by

max(Σ0A(i,j),Σ1A(i,j))+max(0,0.25(μ0A(i)μ1A(i))(μ0A(j)μ1A(j))).\max(\Sigma_{0}^{A}(i,j),\Sigma_{1}^{A}(i,j))+\max(0,0.25(\mu_{0}^{A}(i)-\mu_{1}^{A}(i))(\mu_{0}^{A}(j)-\mu_{1}^{A}(j))). (3.51)

Since |ΣnA||\Sigma_{n}^{A}| is a polynomial function of the ΣnA(i,j)\Sigma_{n}^{A}(i,j), where each term has degree |A||A| and coefficient ±1\pm 1, |ΣnA||\Sigma_{n}^{A}| must also be upper and lower bounded when nn is large enough. (3.29) follows by dividing both sides of (3.49) by |ΣnA||\Sigma_{n}^{A}|, and applying a bound on |ΣnA||\Sigma_{n}^{A}| on the right hand side.

Observe that

CA\displaystyle C^{A} =n1n+κA|A|1Σ^A+1n+κA|A|1SA\displaystyle=\frac{n-1}{n+\kappa^{A}-|A|-1}\hat{\Sigma}^{A}+\frac{1}{n+\kappa^{A}-|A|-1}S^{A}
+νAn(n+κA|A|1)(n+νA)(μ^AmA)(μ^AmA)T.\displaystyle\quad+\frac{\nu^{A}n}{(n+\kappa^{A}-|A|-1)(n+\nu^{A})}(\hat{\mu}^{A}-m^{A})(\hat{\mu}^{A}-m^{A})^{T}. (3.52)

Since μ^A=ρnμ^0A+(1ρn)μ^1A\hat{\mu}^{A}=\rho_{n}\hat{\mu}_{0}^{A}+(1-\rho_{n})\hat{\mu}_{1}^{A}, (3.23) implies that w.p. 11 μ^A(i)mA(i)\hat{\mu}^{A}(i)-m^{A}(i) is bounded for all n>N2n>N_{2}. Similarly, (3.48) and the fact that ΣnA(i,j)\Sigma_{n}^{A}(i,j) is bounded for nn large enough implies that w.p. 11 there exists N8>N7N_{8}>N_{7} such that Σ^A(i,j)\hat{\Sigma}^{A}(i,j) is bounded for all n>N8n>N_{8}. Thus, from (3.52), w.p. 11 there exists K14>0K_{14}>0 and N9>N8N_{9}>N_{8} such that

|CA(i,j)Σ^A(i,j)|\displaystyle|C^{A}(i,j)-\hat{\Sigma}^{A}(i,j)| K14n\displaystyle\leq\frac{K_{14}}{n} (3.53)

for all n>N9n>N_{9} and i,j=1,|A|i,j=1,\ldots|A|. Again since the determinant is a polynomial, w.p. 11 there exists K15>0K_{15}>0 such that

||CA||Σ^A||<K15n\displaystyle||C^{A}|-|\hat{\Sigma}^{A}||<\frac{K_{15}}{n} (3.54)

for all n>N9n>N_{9}. By (3.49) and the fact that |ΣnA||\Sigma_{n}^{A}| is bounded for nn large enough, w.p. 11 there exists N10>N9N_{10}>N_{9} such that |Σ^yA||\hat{\Sigma}_{y}^{A}| is bounded for all n>N10n>N_{10}. Thus, (3.28) holds. Applying the triangle inequality on CA(i,j)ΣnA(i,j)C^{A}(i,j)-\Sigma_{n}^{A}(i,j) and applying (3.48) and (3.53), we have that (3.30) also holds. ∎

Corollary 1.

Let AFA\subseteq F be any non-empty feature set such that Σ0A\Sigma_{0}^{A} and Σ1A\Sigma_{1}^{A} are full rank. Suppose for all fFf\in F the fourth order moment across both classes exists and is finite. Let SS_{\infty} be a balanced sample. Then, w.p. 11 the following all hold for all y=0,1y=0,1 and i,j=1,,|A|i,j=1,\ldots,|A|, where ΣnA\Sigma_{n}^{A} is given in (3.31).

  1. 1.

    CyA(i,j)ΣyA(i,j)C_{y}^{A}(i,j)\to\Sigma_{y}^{A}(i,j) as nn\to\infty.

  2. 2.

    |CyA||ΣyA||C_{y}^{A}|\to|\Sigma_{y}^{A}| as nn\to\infty.

  3. 3.

    If ΣAΣ0A=Σ1A\Sigma^{A}\equiv\Sigma_{0}^{A}=\Sigma_{1}^{A} and μ0A=μ1A\mu_{0}^{A}=\mu_{1}^{A}, then CA(i,j)ΣA(i,j)C^{A}(i,j)\to\Sigma^{A}(i,j) as nn\to\infty.

  4. 4.

    If ΣAΣ0A=Σ1A\Sigma^{A}\equiv\Sigma_{0}^{A}=\Sigma_{1}^{A} and μ0A=μ1A\mu_{0}^{A}=\mu_{1}^{A}, then |CA||ΣA||C^{A}|\to|\Sigma^{A}| as nn\to\infty.

  5. 5.

    CA(i,j)ΣnA(i,j)0C^{A}(i,j)-\Sigma_{n}^{A}(i,j)\to 0 as nn\to\infty.

  6. 6.

    |CA||ΣnA||C^{A}|\sim|\Sigma_{n}^{A}| as nn\to\infty.

  7. 7.

    There exist L,UL,U\in\mathbb{R} and N>0N>0 such that LCA(i,j)UL\leq C^{A}(i,j)\leq U for all n>Nn>N.

  8. 8.

    There exist L,U,N>0L,U,N>0 such that L|CA|UL\leq|C^{A}|\leq U for all n>Nn>N.

Proof.

(1) follows from (3.27). (2) follows by combining (3.24) and (3.26). (5) follows from (3.30). (6) follows by combining (3.28) and (3.29). (3) and (4) are special cases of (5) and (6), where in this case ΣnA=ΣA\Sigma_{n}^{A}=\Sigma^{A} is constant. Recall that in Lemma 2 we showed that ΣnA(i,j)\Sigma_{n}^{A}(i,j) and |ΣnA||\Sigma_{n}^{A}| are lower and upper bounded when nn is large enough. By (3.30), CA(i,j)C^{A}(i,j) must also be upper and lower bounded when nn is large enough, so (7) holds. Finally, since by item 6 |CA||C^{A}| and |ΣnA||\Sigma_{n}^{A}| are asymptotically equivalent, |CA||C^{A}| must also be upper and lower bounded when nn is large enough, so (8) holds. ∎

Lemma 3.

Let A1A_{1} and A2A_{2} be any disjoint feature sets such that μ0A1=μ1A1\mu_{0}^{A_{1}}=\mu_{1}^{A_{1}}, ΣA1Σ0A1=Σ1A1\Sigma^{A_{1}}\equiv\Sigma_{0}^{A_{1}}=\Sigma_{1}^{A_{1}}, features in A1A_{1} and A2A_{2} are uncorrelated in both classes, and Σ0A\Sigma_{0}^{A} and Σ1A\Sigma_{1}^{A} are full rank, where A=A1A2A=A_{1}\cup A_{2}. Suppose for all fFf\in F the fourth order moment across both classes exists and is finite. Let SS_{\infty} be a balanced sample. Then, w.p. 11 there exist c,N>0c,N>0 such that

(|CA1|κA1|CA2|κA2|CA|κA)0.5<(logn)c\displaystyle\left(\frac{|C^{A_{1}}|^{\kappa^{A_{1}*}}|C^{A_{2}}|^{\kappa^{A_{2}*}}}{|C^{A}|^{\kappa^{A*}}}\right)^{0.5}<(\log n)^{c} (3.55)

for all n>Nn>N. The left hand side of (3.55) is a(P)/a(P)a(P)/a(P^{\prime}) when PP and PP^{\prime} are identical feature partitions except AA is a bad block in PP, and A1A_{1} and A2A_{2} are bad blocks in PP^{\prime}.

Proof.

We have that

(|CA1|κA1|CA2|κA2|CA|κA)0.5\displaystyle\left(\frac{|C^{A_{1}}|^{\kappa^{A_{1}*}}|C^{A_{2}}|^{\kappa^{A_{2}*}}}{|C^{A}|^{\kappa^{A*}}}\right)^{0.5} =(|CA1|κA1|CA2|κA2|CA|κA)0.5×Rn,\displaystyle=\left(\frac{|C^{A_{1}}|^{\kappa^{A_{1}}}|C^{A_{2}}|^{\kappa^{A_{2}}}}{|C^{A}|^{\kappa^{A}}}\right)^{0.5}\times R_{n}, (3.56)

where

Rn=(|CA1||CA2||CA|)0.5n.\displaystyle R_{n}=\left(\frac{|C^{A_{1}}||C^{A_{2}}|}{|C^{A}|}\right)^{0.5n}. (3.57)

By Corollary 1, the first term in (3.56) is upper bounded by a positive finite constant for nn large enough w.p. 11. Without loss of generality, assume features in AA are ordered such that

CA=(CA1VVTCA2)C^{A}=\begin{pmatrix}C^{A_{1}}&V\\ V^{T}&C^{A_{2}}\end{pmatrix} (3.58)

for some matrix VV. Observe that

|CA|=|CA1||CA2W|,\displaystyle|C^{A}|=|C^{A_{1}}||C^{A_{2}}-W|, (3.59)

where WVT(CA1)1VW\equiv V^{T}(C^{A_{1}})^{-1}V. Since μ0A1=μ1A1\mu_{0}^{A_{1}}=\mu_{1}^{A_{1}}, Σ0A1=Σ1A1\Sigma_{0}^{A_{1}}=\Sigma_{1}^{A_{1}}, and features in A1A_{1} and A2A_{2} are uncorrelated in both classes,

ΣnA=(ΣA10|A1|,|A2|0|A2|,|A1|ΣnA2),\Sigma^{A}_{n}=\begin{pmatrix}\Sigma^{A_{1}}&0_{|A_{1}|,|A_{2}|}\\ 0_{|A_{2}|,|A_{1}|}&\Sigma^{A_{2}}_{n}\end{pmatrix}, (3.60)

where ΣnA\Sigma^{A}_{n} and ΣnA2\Sigma^{A_{2}}_{n} are defined in (3.31). By Lemma 2, w.p. 11 there exist K1,N1>0K_{1},N_{1}>0 such that for all n>N1n>N_{1}, i=1,,|A1|i=1,\ldots,|A_{1}| and j=1,,|A2|j=1,\ldots,|A_{2}|,

|V(i,j)|<K1loglognn.\displaystyle|V(i,j)|<K_{1}\sqrt{\frac{\log\log n}{n}}. (3.61)

Since WW is comprised of only quadratic terms in VV and CA1(i,j)ΣA1(i,j)C^{A_{1}}(i,j)\to\Sigma^{A_{1}}(i,j) w.p. 11 (Corollary 1), w.p. 11 there exist K2,N2>0K_{2},N_{2}>0 such that for all n>N2n>N_{2}, i=1,,|A1|i=1,\ldots,|A_{1}| and j=1,,|A2|j=1,\ldots,|A_{2}|,

|W(i,j)|<K2loglognn.\displaystyle|W(i,j)|<K_{2}\frac{\log\log n}{n}. (3.62)

Further, since |CA2||CA2W||C^{A_{2}}|-|C^{A_{2}}-W| is a polynomial function of the elements of WW, where each term has a degree between 11 and |A2||A_{2}| and a coefficient that is a polynomial function of the elements of CA2C^{A_{2}} of at most degree |A2|1|A_{2}|-1, and since CA2(i,j)C^{A_{2}}(i,j) is upper and lower bounded for nn large enough w.p. 11 for all i,j=1,,|A|i,j=1,\ldots,|A| (Corollary 1), w.p. 11 there exists K3,N3>0K_{3},N_{3}>0 such that for all n>N3n>N_{3},

||CA2||CA2W||<K3loglognn.\displaystyle\left||C^{A_{2}}|-|C^{A_{2}}-W|\right|<K_{3}\frac{\log\log n}{n}. (3.63)

Therefore,

Rn\displaystyle R_{n} =(|CA1||CA2||CA1||CA2W|)0.5n\displaystyle=\left(\frac{|C^{A_{1}}||C^{A_{2}}|}{|C^{A_{1}}||C^{A_{2}}-W|}\right)^{0.5n}
<(1K3|CA2|loglognn)0.5n\displaystyle<\left(1-\frac{K_{3}}{|C^{A_{2}}|}\frac{\log\log n}{n}\right)^{-0.5n} (3.64)

for n>N3n>N_{3} w.p. 11, where the first line follows from (3.57) and (3.59) and the second line from (3.63). Further, by Corollary 1, w.p. 11 there exist K4>0K_{4}>0 and N4>N3N_{4}>N_{3} such that |CA2|>K4|C^{A_{2}}|>K_{4} for all n>N4n>N_{4}, thus

Rn\displaystyle R_{n} <(1K3K4loglognn)0.5n\displaystyle<\left(1-\frac{K_{3}}{K_{4}}\frac{\log\log n}{n}\right)^{-0.5n}
<(1+2K3K4loglognn)0.5n\displaystyle<\left(1+\frac{2K_{3}}{K_{4}}\frac{\log\log n}{n}\right)^{0.5n}
<(logn)K3/K4\displaystyle<(\log n)^{K_{3}/K_{4}} (3.65)

for all n>N4n>N_{4}. The second line holds for N4N_{4} large enough that x=(K3/K4)loglogn/nx=(K_{3}/K_{4})\log\log n/n is between 0 and 0.50.5 for all n>N4n>N_{4}, so that (1x)1<1+2x(1-x)^{-1}<1+2x. The last line follows from the fact that (1+t/x)x<et(1+t/x)^{x}<e^{t} for all x,t>0x,t>0. From (3.56), the theorem holds with c=K5+K3/K4c=K_{5}+{K_{3}/K_{4}} and N=N4N=N_{4}, where K5K_{5} is such that (logn)K5(\log n)^{K_{5}} exceeds a bound on the first term in (3.56). ∎

Lemma 4.

Let GG be any feature set such that either μ0Gμ1G\mu_{0}^{G}\neq\mu_{1}^{G} or Σ0GΣ1G\Sigma_{0}^{G}\neq\Sigma_{1}^{G}, and Σ0G\Sigma_{0}^{G} and Σ1G\Sigma_{1}^{G} are full rank. Let SS_{\infty} be a balanced sample. Then, w.p. 11 there exist 0r<10\leq r<1 and N>0N>0 such that

(|C0G|κ0G|C1G|κ1G|CG|κG)0.5<rn\displaystyle\left(\frac{|C^{G}_{0}|^{\kappa^{G*}_{0}}|C^{G}_{1}|^{\kappa^{G*}_{1}}}{|C^{G}|^{\kappa^{G*}}}\right)^{0.5}<r^{n} (3.66)

for all n>Nn>N. The left hand side of (3.66) is a(P)/a(P)a(P)/a(P^{\prime}) when PP and PP^{\prime} are identical feature partitions except GG is a bad block in PP and a good block in PP^{\prime}.

Proof.

We have that

(|C0G|κ0G|C1G|κ1G|CG|κG)0.5\displaystyle\left(\frac{|C^{G}_{0}|^{\kappa^{G*}_{0}}|C^{G}_{1}|^{\kappa^{G*}_{1}}}{|C^{G}|^{\kappa^{G*}}}\right)^{0.5} =(|C0G|κ0G|C1G|κ1G|CG|κG)0.5×Rnn,\displaystyle=\left(\frac{|C^{G}_{0}|^{\kappa^{G}_{0}}|C^{G}_{1}|^{\kappa^{G}_{1}}}{|C^{G}|^{\kappa^{G}}}\right)^{0.5}\times R_{n}^{n}, (3.67)

where

Rn=(|C0G|ρn|C1G|1ρn|CG|)0.5.\displaystyle R_{n}=\left(\frac{|C^{G}_{0}|^{\rho_{n}}|C^{G}_{1}|^{1-\rho_{n}}}{|C^{G}|}\right)^{0.5}. (3.68)

By Corollary 1, the first term in (3.67) is upper bounded by a positive finite constant for nn large enough, and

Rn(|Σ0G|ρn|Σ1G|1ρn|ρnΣ0G+(1ρn)Σ1G+ρn(1ρn)(μ0Gμ1G)(μ0Gμ1G)T|)0.5Tn\displaystyle R_{n}\sim\left(\frac{|\Sigma^{G}_{0}|^{\rho_{n}}|\Sigma^{G}_{1}|^{1-\rho_{n}}}{|\rho_{n}\Sigma_{0}^{G}+(1-\rho_{n})\Sigma_{1}^{G}+\rho_{n}(1-\rho_{n})(\mu_{0}^{G}-\mu_{1}^{G})(\mu_{0}^{G}-\mu_{1}^{G})^{T}|}\right)^{0.5}\equiv T_{n} (3.69)

as nn\to\infty, both w.p. 11. Suppose ΣGΣ0G=Σ1G\Sigma^{G}\equiv\Sigma_{0}^{G}=\Sigma_{1}^{G} and μ0Gμ1G\mu_{0}^{G}\neq\mu_{1}^{G}. Then,

Tn\displaystyle T_{n} =(|ΣG||ΣG+ρn(1ρn)(μ0Gμ1G)(μ0Gμ1G)T|)0.5\displaystyle=\left(\frac{|\Sigma^{G}|}{|\Sigma^{G}+\rho_{n}(1-\rho_{n})(\mu_{0}^{G}-\mu_{1}^{G})(\mu_{0}^{G}-\mu_{1}^{G})^{T}|}\right)^{0.5}
=(1+ρn(1ρn)(μ0Gμ1G)T(ΣG)1(μ0Gμ1G))0.5.\displaystyle=\left(1+\rho_{n}(1-\rho_{n})(\mu_{0}^{G}-\mu_{1}^{G})^{T}(\Sigma^{G})^{-1}(\mu_{0}^{G}-\mu_{1}^{G})\right)^{-0.5}. (3.70)

The last line follows from the matrix determinant lemma (Harville, 1998). Since SS_{\infty} is balanced and ΣG\Sigma^{G} is positive-definite, there exists 0<T<10<T<1 such that 0Tn<T0\leq T_{n}<T for all nn large enough. The theorem holds for any T<r<1T<r<1. If Σ0GΣ1G\Sigma_{0}^{G}\neq\Sigma_{1}^{G}, then

Tn(|Σ0G|ρn|Σ1G|1ρn|ρnΣ0G+(1ρn)Σ1G|)0.5.\displaystyle T_{n}\leq\left(\frac{|\Sigma_{0}^{G}|^{\rho_{n}}|\Sigma_{1}^{G}|^{1-\rho_{n}}}{|\rho_{n}\Sigma_{0}^{G}+(1-\rho_{n})\Sigma_{1}^{G}|}\right)^{0.5}. (3.71)

Fan (1950) showed that for any symmetric positive-definite XX and YY and 0p10\leq p\leq 1,

|pX+(1p)Y||X|p|Y|1p.|pX+(1-p)Y|\geq|X|^{p}|Y|^{1-p}. (3.72)

Although not mentioned by Fan, if 0<p<10<p<1 then equality holds if and only if X=YX=Y (by the weighted arithmetic-geometric-mean inequality). Suppose XYX\neq Y and fix 0<δ<0.50<\delta<0.5. Then there exists ϵ>0\epsilon>0 such that |X|p|Y|1p/|pX+(1p)Y|<1ϵ|X|^{p}|Y|^{1-p}/|pX+(1-p)Y|<1-\epsilon for all p(δ,1δ)p\in(\delta,1-\delta). Applying this fact here, since SS_{\infty} is balanced there exists 0<T<10<T<1 such that 0Tn<T0\leq T_{n}<T. The theorem holds for any T<r<1T<r<1. ∎

Lemma 5.

Let G1G_{1} and G2G_{2} be any disjoint feature sets such that μ0G2=μ1G2\mu_{0}^{G_{2}}=\mu_{1}^{G_{2}}, ΣG2Σ0G2=Σ1G2\Sigma^{G_{2}}\equiv\Sigma_{0}^{G_{2}}=\Sigma_{1}^{G_{2}}, there exists at least one feature in G1G_{1} and one feature in G2G_{2} that are correlated in at least one class, and Σ0G\Sigma_{0}^{G} and Σ1G\Sigma_{1}^{G} are full rank, where G=G1G2G=G_{1}\cup G_{2}. Let SS_{\infty} be a balanced sample. Then, w.p. 11 there exist 0r<10\leq r<1 and N>0N>0 such that

(|C0G|κ0G|C1G|κ1G|C0G1|κ0G1|C1G1|κ1G1|CG2|κG2)0.5<rn\displaystyle\left(\frac{|C^{G}_{0}|^{\kappa^{G*}_{0}}|C^{G}_{1}|^{\kappa^{G*}_{1}}}{|C^{G_{1}}_{0}|^{\kappa^{G_{1}*}_{0}}|C^{G_{1}}_{1}|^{\kappa^{G_{1}*}_{1}}|C^{G_{2}}|^{\kappa^{G_{2}*}}}\right)^{0.5}<r^{n} (3.73)

for all n>Nn>N. The left hand side of (3.73) is a(P)/a(P)a(P)/a(P^{\prime}) when PP and PP^{\prime} are identical feature partitions except G1G_{1} is a good block in PP, G2G_{2} is a bad block in PP, and GG is a good block in PP^{\prime}.

Proof.

We have that

(|C0G|κ0G|C1G|κ1G|C0G1|κ0G1|C1G1|κ1G1|CG2|κG2)0.5\displaystyle\left(\frac{|C^{G}_{0}|^{\kappa^{G*}_{0}}|C^{G}_{1}|^{\kappa^{G*}_{1}}}{|C^{G_{1}}_{0}|^{\kappa^{G_{1}*}_{0}}|C^{G_{1}}_{1}|^{\kappa^{G_{1}*}_{1}}|C^{G_{2}}|^{\kappa^{G_{2}*}}}\right)^{0.5} =(|C0G|κ0G|C1G|κ1G|C0G1|κ0G1|C1G1|κ1G1|CG2|κG2)0.5×R0,nn0R1,nn1,\displaystyle=\left(\frac{|C^{G}_{0}|^{\kappa^{G}_{0}}|C^{G}_{1}|^{\kappa^{G}_{1}}}{|C^{G_{1}}_{0}|^{\kappa^{G_{1}}_{0}}|C^{G_{1}}_{1}|^{\kappa^{G_{1}}_{1}}|C^{G_{2}}|^{\kappa^{G_{2}}}}\right)^{0.5}\times R_{0,n}^{n_{0}}R_{1,n}^{n_{1}}, (3.74)

where

Ry,n=(|CyG||CyG1||CG2|)0.5.\displaystyle R_{y,n}=\left(\frac{|C^{G}_{y}|}{|C^{G_{1}}_{y}||C^{G_{2}}|}\right)^{0.5}. (3.75)

By Corollary 1, the first term in (3.74) converges to a positive finite constant w.p. 11, and

Ry,n(|ΣyG||ΣyG1||ΣyG2|)0.5Ry\displaystyle R_{y,n}\to\left(\frac{|\Sigma^{G}_{y}|}{|\Sigma^{G_{1}}_{y}||\Sigma^{G_{2}}_{y}|}\right)^{0.5}\equiv R_{y} (3.76)

as nn\to\infty, both w.p. 11. Without loss of generality, assume features in GG are ordered such that

ΣyG=(ΣyG1VyVyTΣyG2),\Sigma^{G}_{y}=\begin{pmatrix}\Sigma^{G_{1}}_{y}&V_{y}\\ V_{y}^{T}&\Sigma^{G_{2}}_{y}\end{pmatrix}, (3.77)

where VyV_{y} is a matrix of correlations between features in G1G_{1} and G2G_{2}. Since

|ΣyG|=|ΣyG1||ΣyG2VyT(ΣyG1)1Vy|,|\Sigma_{y}^{G}|=|\Sigma_{y}^{G_{1}}||\Sigma_{y}^{G_{2}}-V_{y}^{T}(\Sigma_{y}^{G_{1}})^{-1}V_{y}|, (3.78)

we have that

Ry=(|ΣyG2VyT(ΣyG1)1Vy||ΣyG2|)0.5\displaystyle R_{y}=\left(\frac{|\Sigma_{y}^{G_{2}}-V_{y}^{T}(\Sigma_{y}^{G_{1}})^{-1}V_{y}|}{|\Sigma_{y}^{G_{2}}|}\right)^{0.5} (3.79)

We always have |ΣyG2VyT(ΣyG1)1Vy||ΣyG2||\Sigma_{y}^{G_{2}}-V_{y}^{T}(\Sigma_{y}^{G_{1}})^{-1}V_{y}|\leq|\Sigma^{G_{2}}_{y}|, and thus 0Ry10\leq R_{y}\leq 1. Since there exists at least one feature in G1G_{1} and one feature in G2G_{2} that are correlated in at least one class, in this class VyV_{y} is non-zero, |ΣyG2VyT(ΣyG1)1Vy|<|ΣyG2||\Sigma_{y}^{G_{2}}-V_{y}^{T}(\Sigma_{y}^{G_{1}})^{-1}V_{y}|<|\Sigma^{G_{2}}_{y}|, and Ry<1R_{y}<1. Since SS_{\infty} is balanced, there exist 0<p0,p1<10<p_{0},p_{1}<1 such that n0/n<p0n_{0}/n<p_{0} and n1/n<p1n_{1}/n<p_{1} for nn large enough. Thus,

R0,nn0R1,nn1Rnn\displaystyle R_{0,n}^{n_{0}}R_{1,n}^{n_{1}}\leq R_{n}^{n} (3.80)

for nn large enough, where Rn=R0,np0R1,np1R_{n}=R_{0,n}^{p_{0}}R_{1,n}^{p_{1}}. Note RnR0p0R1p1RR_{n}\to R_{0}^{p_{0}}R_{1}^{p_{1}}\equiv R, where 0R<10\leq R<1. The theorem holds for any R<r<1R<r<1. ∎

Lemma 6.

Let A1A_{1} and A2A_{2} be any disjoint feature sets such that features in A1A_{1} and A2A_{2} are uncorrelated in both classes, and Σ0A\Sigma_{0}^{A} and Σ1A\Sigma_{1}^{A} are full rank, where A=A1A2A=A_{1}\cup A_{2}. Suppose for all fFf\in F the fourth order moment across both classes exists and is finite. Let SS_{\infty} be a balanced sample. Then, w.p. 11 there exist c,N>0c,N>0 such that

(|C0A1|κ0A1|C0A2|κ0A2|C1A1|κ1A1|C1A2|κ1A2|C0A|κ0A|C1A|κ1A)0.5<(logn)c\displaystyle\left(\frac{|C^{A_{1}}_{0}|^{\kappa^{A_{1}*}_{0}}|C^{A_{2}}_{0}|^{\kappa^{A_{2}*}_{0}}|C^{A_{1}}_{1}|^{\kappa^{A_{1}*}_{1}}|C^{A_{2}}_{1}|^{\kappa^{A_{2}*}_{1}}}{|C^{A}_{0}|^{\kappa^{A*}_{0}}|C^{A}_{1}|^{\kappa^{A*}_{1}}}\right)^{0.5}<(\log n)^{c} (3.81)

for all n>Nn>N. The left hand side of (3.81) is a(P)/a(P)a(P)/a(P^{\prime}) when PP and PP^{\prime} are identical feature partitions except AA is a good block in PP, and A1A_{1} and A2A_{2} are good blocks in PP^{\prime}.

Proof.

We have that

(|C0A1|κ0A1|C0A2|κ0A2|C1A1|κ1A1|C1A2|κ1A2|C0A|κ0A|C1A|κ1A)0.5\displaystyle\left(\frac{|C^{A_{1}}_{0}|^{\kappa^{A_{1}*}_{0}}|C^{A_{2}}_{0}|^{\kappa^{A_{2}*}_{0}}|C^{A_{1}}_{1}|^{\kappa^{A_{1}*}_{1}}|C^{A_{2}}_{1}|^{\kappa^{A_{2}*}_{1}}}{|C^{A}_{0}|^{\kappa^{A*}_{0}}|C^{A}_{1}|^{\kappa^{A*}_{1}}}\right)^{0.5}
=(|C0A1|κ0A1|C0A2|κ0A2|C0A|κ0A×|C1A1|κ1A1|C1A2|κ1A2|C1A|κ1A)0.5×R0,nR1,n,\displaystyle\qquad=\left(\frac{|C^{A_{1}}_{0}|^{\kappa^{A_{1}}_{0}}|C^{A_{2}}_{0}|^{\kappa^{A_{2}}_{0}}}{|C^{A}_{0}|^{\kappa^{A}_{0}}}\times\frac{|C^{A_{1}}_{1}|^{\kappa^{A_{1}}_{1}}|C^{A_{2}}_{1}|^{\kappa^{A_{2}}_{1}}}{|C^{A}_{1}|^{\kappa^{A}_{1}}}\right)^{0.5}\times R_{0,n}R_{1,n}, (3.82)

where

Ry,n=(|CyA1||CyA2||CyA|)0.5ny.\displaystyle R_{y,n}=\left(\frac{|C^{A_{1}}_{y}||C^{A_{2}}_{y}|}{|C^{A}_{y}|}\right)^{0.5n_{y}}. (3.83)

By Corollary 1, the first term in (3.82) converges to a positive finite constant w.p. 11. Fix y{0,1}y\in\{0,1\}. Without loss of generality, assume features in AA are ordered such that

CyA=(CyA1VyVyTCyA2)C^{A}_{y}=\begin{pmatrix}C^{A_{1}}_{y}&V_{y}\\ V_{y}^{T}&C^{A_{2}}_{y}\end{pmatrix} (3.84)

for some matrix VyV_{y}. Observe that

|CyA|=|CyA1||CyA2Wy|,\displaystyle|C^{A}_{y}|=|C^{A_{1}}_{y}||C^{A_{2}}_{y}-W_{y}|, (3.85)

where WyVyT(CyA1)1VyW_{y}\equiv V^{T}_{y}(C^{A_{1}}_{y})^{-1}V_{y}. Since features in A1A_{1} and A2A_{2} are uncorrelated in both classes,

ΣyA=(ΣyA10|A1|,|A2|0|A2|,|A1|ΣyA2).\Sigma^{A}_{y}=\begin{pmatrix}\Sigma^{A_{1}}_{y}&0_{|A_{1}|,|A_{2}|}\\ 0_{|A_{2}|,|A_{1}|}&\Sigma^{A_{2}}_{y}\end{pmatrix}. (3.86)

By Lemma 2, w.p. 11 there exist K1,N1>0K_{1},N_{1}>0 such that for all n>N1n>N_{1}, y=0,1y=0,1, i=1,,|A1|i=1,\ldots,|A_{1}| and j=1,,|A2|j=1,\ldots,|A_{2}| we have

|Vy(i,j)|<K1loglognn.\displaystyle|V_{y}(i,j)|<K_{1}\sqrt{\frac{\log\log n}{n}}. (3.87)

Since WyW_{y} is comprised of only quadratic terms in VyV_{y} and CyA1(i,j)ΣyA1(i,j)C^{A_{1}}_{y}(i,j)\to\Sigma^{A_{1}}_{y}(i,j) w.p. 11 (Corollary 1), w.p. 11 there exist K2,N2>0K_{2},N_{2}>0 such that for all n>N2n>N_{2}, y=0,1y=0,1, i=1,,|A1|i=1,\ldots,|A_{1}| and j=1,,|A2|j=1,\ldots,|A_{2}|

|Wy(i,j)|<K2loglognn.\displaystyle|W_{y}(i,j)|<K_{2}\frac{\log\log n}{n}. (3.88)

Further, since |CyA2||CyA2Wy||C^{A_{2}}_{y}|-|C^{A_{2}}_{y}-W_{y}| is a polynomial function of the elements of WyW_{y}, where each term has a degree between 11 and |A2||A_{2}| and a coefficient that is a polynomial function of the elements of CyA2C^{A_{2}}_{y} of at most degree |A2|1|A_{2}|-1, and since CyA2(i,j)ΣyA2(i,j)C^{A_{2}}_{y}(i,j)\to\Sigma^{A_{2}}_{y}(i,j) w.p. 11 (Corollary 1), w.p. 11 there exists K3,N3>0K_{3},N_{3}>0 such that for all n>N3n>N_{3} and y=0,1y=0,1,

||CyA2||CyA2Wy||<K3loglognn.\displaystyle\left||C^{A_{2}}_{y}|-|C^{A_{2}}_{y}-W_{y}|\right|<K_{3}\frac{\log\log n}{n}. (3.89)

Therefore,

Ry,n\displaystyle R_{y,n} =(|CyA1||CyA2||CyA1||CyA2Wy|)0.5ny\displaystyle=\left(\frac{|C^{A_{1}}_{y}||C^{A_{2}}_{y}|}{|C^{A_{1}}_{y}||C^{A_{2}}_{y}-W_{y}|}\right)^{0.5n_{y}}
<(1K3|CyA2|loglognn)0.5ny\displaystyle<\left(1-\frac{K_{3}}{|C^{A_{2}}_{y}|}\frac{\log\log n}{n}\right)^{-0.5n_{y}} (3.90)

for n>N3n>N_{3} w.p. 11, where the first line follows from (3.83) and (3.85) and the second line from (3.89). Further, w.p. 11 there exists N4>N3N_{4}>N_{3} such that for all n>N4n>N_{4} and y=0,1y=0,1,

Ry,n\displaystyle R_{y,n} <(12K3|ΣyA2|loglognn)0.5ny\displaystyle<\left(1-\frac{2K_{3}}{|\Sigma^{A_{2}}_{y}|}\frac{\log\log n}{n}\right)^{-0.5n_{y}}
<(1+4K3|ΣyA2|loglognn)0.5ny\displaystyle<\left(1+\frac{4K_{3}}{|\Sigma^{A_{2}}_{y}|}\frac{\log\log n}{n}\right)^{0.5n_{y}}
<(1+4pyK3|ΣyA2|loglognny)0.5ny\displaystyle<\left(1+\frac{4p_{y}K_{3}}{|\Sigma^{A_{2}}_{y}|}\frac{\log\log n}{n_{y}}\right)^{0.5n_{y}}
<(logn)2pyK3/|ΣyA2|.\displaystyle<(\log n)^{2p_{y}K_{3}/|\Sigma^{A_{2}}_{y}|}. (3.91)

The first line holds as long as N4N_{4} is large enough that |CyA2|>|ΣyA2|/2|C^{A_{2}}_{y}|>|\Sigma^{A_{2}}_{y}|/2 for all n>N4n>N_{4} and y=0,1y=0,1 (this is possible since both CyA2C^{A_{2}}_{y} converge). The second line holds as long as N4N_{4} is large enough that x=(2K3/|ΣyA2|)loglogn/nx=(2K_{3}/|\Sigma^{A_{2}}_{y}|)\log\log n/n is between 0 and 0.50.5 for all n>N4n>N_{4} and y=0,1y=0,1, so that (1x)1<1+2x(1-x)^{-1}<1+2x. The third line holds as long as N4N_{4} is large enough that ny/n<pyn_{y}/n<p_{y} for all n>N4n>N_{4} and y=0,1y=0,1 and some 0<p0,p1<10<p_{0},p_{1}<1 (this is possible since SS_{\infty} is balanced). Finally, the last line follows from the fact that (1+t/x)x<et(1+t/x)^{x}<e^{t} for all x,t>0x,t>0. From (3.82), the theorem holds with c=K4+2p0K3/|Σ0A2|+2p1K3/|Σ1A2|c=K_{4}+2p_{0}K_{3}/|\Sigma^{A_{2}}_{0}|+2p_{1}K_{3}/|\Sigma^{A_{2}}_{1}| and N=N4N=N_{4}, where K4K_{4} is such that (logn)K4(\log n)^{K_{4}} exceeds a bound on the first term in (3.82). ∎

Lemma 7.

Let BB be any feature set such that μBμ0B=μ1B\mu^{B}\equiv\mu_{0}^{B}=\mu_{1}^{B}, ΣBΣ0B=Σ1B\Sigma^{B}\equiv\Sigma_{0}^{B}=\Sigma_{1}^{B}, and ΣB\Sigma^{B} is full rank. Suppose for all fFf\in F the fourth order moment across both classes exists and is finite. Let SS_{\infty} be a balanced sample. Then, w.p. 11 there exist c,N>0c,N>0 such that

(|CB|κB|C0B|κ0B|C1B|κ1B)0.5<(logn)c\displaystyle\left(\frac{|C^{B}|^{\kappa^{B*}}}{|C^{B}_{0}|^{\kappa^{B*}_{0}}|C^{B}_{1}|^{\kappa^{B*}_{1}}}\right)^{0.5}<(\log n)^{c} (3.92)

for all n>Nn>N. The left hand side of (3.92) is a(P)/a(P)a(P)/a(P^{\prime}) when PP and PP^{\prime} are identical feature partitions except BB is a good block in PP and a bad block in PP^{\prime}.

Proof.

We have that

(|CB|κB|C0B|κ0B|C1B|κ1B)0.5\displaystyle\left(\frac{|C^{B}|^{\kappa^{B*}}}{|C^{B}_{0}|^{\kappa^{B*}_{0}}|C^{B}_{1}|^{\kappa^{B*}_{1}}}\right)^{0.5} =(|CB|κB|C0B|κ0B|C1B|κ1B)0.5×Rn,\displaystyle=\left(\frac{|C^{B}|^{\kappa^{B}}}{|C^{B}_{0}|^{\kappa^{B}_{0}}|C^{B}_{1}|^{\kappa^{B}_{1}}}\right)^{0.5}\times R_{n}, (3.93)

where

Rn=(|CB||C0B|ρn|C1B|1ρn)0.5n.\displaystyle R_{n}=\left(\frac{|C^{B}|}{|C^{B}_{0}|^{\rho_{n}}|C^{B}_{1}|^{1-\rho_{n}}}\right)^{0.5n}. (3.94)

By Corollary 1, the first term in (3.93) converges to a positive finite constant w.p. 11. By Lemma 2, w.p. 11 there exists K1,N1>0K_{1},N_{1}>0 such that for all n>N1n>N_{1} and y=0,1y=0,1,

||CyB||Σ^yB|1|\displaystyle\left|\frac{|C^{B}_{y}|}{|\hat{\Sigma}^{B}_{y}|}-1\right| <K1n,\displaystyle<\frac{K_{1}}{n}, (3.95)
||CB||Σ^B|1|\displaystyle\left|\frac{|C^{B}|}{|\hat{\Sigma}^{B}|}-1\right| <K1n.\displaystyle<\frac{K_{1}}{n}. (3.96)

Let N2>N1N_{2}>N_{1} be such that x=K1/nx=K_{1}/n is between 0 and 0.50.5 for all n>N2n>N_{2}, so that (1x)1<1+2x(1-x)^{-1}<1+2x. Then for all n>N2n>N_{2},

Rn\displaystyle R_{n} <(|Σ^B|(1+K1n)|Σ^0B|ρn(1K1n)ρn|Σ^1B|1ρn(1K1n)1ρn)0.5n\displaystyle<\left(\frac{|\hat{\Sigma}^{B}|\left(1+\frac{K_{1}}{n}\right)}{|\hat{\Sigma}_{0}^{B}|^{\rho_{n}}\left(1-\frac{K_{1}}{n}\right)^{\rho_{n}}|\hat{\Sigma}_{1}^{B}|^{1-\rho_{n}}\left(1-\frac{K_{1}}{n}\right)^{1-\rho_{n}}}\right)^{0.5n}
<(1+K1n)0.5n(1+2K1n)0.5nρn(1+2K1n)0.5n(1ρn)(|Σ^B||Σ^0B|ρn|Σ^1B|1ρn)0.5n\displaystyle<\left(1+\frac{K_{1}}{n}\right)^{0.5n}\left(1+\frac{2K_{1}}{n}\right)^{0.5n\rho_{n}}\left(1+\frac{2K_{1}}{n}\right)^{0.5n(1-\rho_{n})}\left(\frac{|\hat{\Sigma}^{B}|}{|\hat{\Sigma}_{0}^{B}|^{\rho_{n}}|\hat{\Sigma}_{1}^{B}|^{1-\rho_{n}}}\right)^{0.5n}
<e1.5K1(|Σ^B||Σ^0B|ρn|Σ^1B|1ρn)0.5n\displaystyle<e^{1.5K_{1}}\left(\frac{|\hat{\Sigma}^{B}|}{|\hat{\Sigma}^{B}_{0}|^{\rho_{n}}|\hat{\Sigma}^{B}_{1}|^{1-\rho_{n}}}\right)^{0.5n} (3.97)

w.p. 11, where the last line follows from the fact that (1+t/x)x<et(1+t/x)^{x}<e^{t} for all x,t>0x,t>0. From (3.42),

Σ^B\displaystyle\hat{\Sigma}^{B} =ΣB+ρnE0+(1ρn)E1+E,\displaystyle=\Sigma^{B}+\rho_{n}E_{0}+(1-\rho_{n})E_{1}+E, (3.98)

and for y=0,1y=0,1,

Ey\displaystyle E_{y} =Σ^yBΣB,\displaystyle=\hat{\Sigma}^{B}_{y}-\Sigma^{B}, (3.99)
E\displaystyle E =1ρnn1Σ^0B+ρnn1Σ^1B+ρn(1ρn)nn1(e0e1)(e0e1)T,\displaystyle=\frac{1-\rho_{n}}{n-1}\hat{\Sigma}^{B}_{0}+\frac{\rho_{n}}{n-1}\hat{\Sigma}^{B}_{1}+\frac{\rho_{n}(1-\rho_{n})n}{n-1}(e_{0}-e_{1})(e_{0}-e_{1})^{T}, (3.100)

where ey=μ^yBμBe_{y}=\hat{\mu}_{y}^{B}-\mu^{B}. By Lemma 2, w.p. 11 there exists K2>0K_{2}>0 and N3>N2N_{3}>N_{2} such that

|Ey(i,j)|<K2loglognn|E_{y}(i,j)|<K_{2}\sqrt{\frac{\log\log n}{n}} (3.101)

for all n>N3n>N_{3}, y=0,1y=0,1 and i,j=1,,|B|i,j=1,\ldots,|B|. Since SS_{\infty} is balanced, there exist 0<p0,p1<10<p_{0},p_{1}<1 and N4>N3N_{4}>N_{3} such that ny/n<pyn_{y}/n<p_{y} for all n>N4n>N_{4} and y=0,1y=0,1. Since Σ^yBΣB\hat{\Sigma}_{y}^{B}\to\Sigma^{B} as nn\to\infty w.p. 11, there exists B>0B>0 such that |Σ^yB(i,j)|<B|\hat{\Sigma}_{y}^{B}(i,j)|<B for all nn, yy, ii and jj w.p. 11. By Lemma 2, w.p. 11 there exists K3>0K_{3}>0 and N5>N4N_{5}>N_{4} such that |ey(i)|<K3loglogn/n|e_{y}(i)|<K_{3}\sqrt{\log\log n/n} for all n>N5n>N_{5} and y=0,1y=0,1. By the triangle inequality,

|E(i,j)|\displaystyle|E(i,j)| 1ρnn1|Σ^0B(i,j)|+ρnn1|Σ^1B(i,j)|\displaystyle\leq\frac{1-\rho_{n}}{n-1}|\hat{\Sigma}^{B}_{0}(i,j)|+\frac{\rho_{n}}{n-1}|\hat{\Sigma}^{B}_{1}(i,j)|
+ρn(1ρn)nn1(|e0(i)|+|e1(i)|)(|e0(j)|+|e1(j)|)\displaystyle\quad+\frac{\rho_{n}(1-\rho_{n})n}{n-1}\left(|e_{0}(i)|+|e_{1}(i)|\right)\left(|e_{0}(j)|+|e_{1}(j)|\right)
p1Bn1+p0Bn1+4p0p1K32loglognn1.\displaystyle\leq\frac{p_{1}B}{n-1}+\frac{p_{0}B}{n-1}+4p_{0}p_{1}K_{3}^{2}\frac{\log\log n}{n-1}. (3.102)

for all n>N5n>N_{5} and i,j=1,,|B|i,j=1,\ldots,|B| w.p. 11. In particular, there exists K4>0K_{4}>0 such that

|E(i,j)|\displaystyle|E(i,j)| <K4loglognn.\displaystyle<K_{4}\frac{\log\log n}{n}. (3.103)

Observe that |Σ^B||\hat{\Sigma}^{B}| is a polynomial function of the elements of ΣB\Sigma^{B}, ρnE0\rho_{n}E_{0}, (1ρn)E1(1-\rho_{n})E_{1} and EE, where each term has a degree of |B||B| and a coefficient of ±1\pm 1. In particular,

|Σ^B|\displaystyle|\hat{\Sigma}^{B}| =i1=1|B|i|B|=1|B|εi1,,i|B|Σ^B(1,i1)Σ^B(|B|,i|B|)\displaystyle=\sum_{i_{1}=1}^{|B|}\cdots\sum_{i_{|B|}=1}^{|B|}\varepsilon_{i_{1},\ldots,i_{|B|}}\hat{\Sigma}^{B}(1,i_{1})\cdots\hat{\Sigma}^{B}(|B|,i_{|B|})
=i1=1|B|i|B|=1|B|εi1,,i|B|k1=14k|B|=14Xk1(1,i1)Xk|B|(|B|,i|B|)\displaystyle=\sum_{i_{1}=1}^{|B|}\cdots\sum_{i_{|B|}=1}^{|B|}\varepsilon_{i_{1},\ldots,i_{|B|}}\sum_{k_{1}=1}^{4}\cdots\sum_{k_{|B|}=1}^{4}X_{k_{1}}(1,i_{1})\cdots X_{k_{|B|}}(|B|,i_{|B|})
=k1=14k|B|=14m(k1,,k|B|),\displaystyle=\sum_{k_{1}=1}^{4}\cdots\sum_{k_{|B|}=1}^{4}m(k_{1},\ldots,k_{|B|}), (3.104)

where εi1,,i|B|\varepsilon_{i_{1},\ldots,i_{|B|}} is the Levi-Civita symbol, equal to +1+1 if (i1,,i|B|)(i_{1},\ldots,i_{|B|}) is an even permutation of (1,,|B|)(1,\ldots,|B|), 1-1 if its an odd permutation, and 0 otherwise, X1=ΣBX_{1}=\Sigma^{B}, X2=ρnE0X_{2}=\rho_{n}E_{0}, X3=(1ρn)E1X_{3}=(1-\rho_{n})E_{1}, X4=EX_{4}=E, and

m(k1,,k|B|)=i1=1|B|i|B|=1|B|εi1,,i|B|Xk1(1,i1)Xk|B|(|B|,i|B|).\displaystyle m(k_{1},\ldots,k_{|B|})=\sum_{i_{1}=1}^{|B|}\cdots\sum_{i_{|B|}=1}^{|B|}\varepsilon_{i_{1},\ldots,i_{|B|}}X_{k_{1}}(1,i_{1})\cdots X_{k_{|B|}}(|B|,i_{|B|}). (3.105)

From (3.101) and (3.103), w.p. 11 there exists K5K_{5}\in\mathbb{R} such that

|Σ^B|\displaystyle|\hat{\Sigma}^{B}| =m(1,,1)+(k1,,k|B|)2m(k1,,k|B|)+(k1,,k|B|)3m(k1,,k|B|)\displaystyle=m(1,\ldots,1)+\sum_{(k_{1},\ldots,k_{|B|})\in\mathcal{M}_{2}}m(k_{1},\ldots,k_{|B|})+\sum_{(k_{1},\ldots,k_{|B|})\in\mathcal{M}_{3}}m(k_{1},\ldots,k_{|B|})
+(k1,,k|B|)m(k1,,k|B|)\displaystyle\quad+\sum_{(k_{1},\ldots,k_{|B|})\in\mathcal{M}}m(k_{1},\ldots,k_{|B|})
<m(1,,1)+(k1,,k|B|)2m(k1,,k|B|)+(k1,,k|B|)3m(k1,,k|B|)\displaystyle<m(1,\ldots,1)+\sum_{(k_{1},\ldots,k_{|B|})\in\mathcal{M}_{2}}m(k_{1},\ldots,k_{|B|})+\sum_{(k_{1},\ldots,k_{|B|})\in\mathcal{M}_{3}}m(k_{1},\ldots,k_{|B|})
+K5loglognn\displaystyle\quad+K_{5}\frac{\log\log n}{n} (3.106)

for all n>N5n>N_{5}, where

2\displaystyle\mathcal{M}_{2} ={(k1,,k|B|):exactly one k equals 2, the rest equal 1},\displaystyle=\{(k_{1},\ldots,k_{|B|}):\text{exactly one $k$ equals $2$, the rest equal $1$}\},
3\displaystyle\mathcal{M}_{3} ={(k1,,k|B|):exactly one k equals 3, the rest equal 1},\displaystyle=\{(k_{1},\ldots,k_{|B|}):\text{exactly one $k$ equals $3$, the rest equal $1$}\},
\displaystyle\mathcal{M} ={(k1,,k|B|):at least two k’s are in {2,3}, or at least one k equals 4}.\displaystyle=\{(k_{1},\ldots,k_{|B|}):\text{at least two $k$'s are in $\{2,3\}$, or at least one $k$ equals $4$}\}.

Further, w.p. 11 there exists K6K_{6}\in\mathbb{R} such that

ρn|Σ^0B|\displaystyle\rho_{n}|\hat{\Sigma}_{0}^{B}| =ρn|ΣB+E0|\displaystyle=\rho_{n}|\Sigma^{B}+E_{0}|
=ρnk1=12k|B|=12m(k1,,k|B|)\displaystyle=\rho_{n}\sum_{k_{1}=1}^{2}\cdots\sum_{k_{|B|}=1}^{2}m^{\prime}(k_{1},\ldots,k_{|B|})
=ρnm(1,,1)\displaystyle=\rho_{n}m(1,\ldots,1)
+(k1,,k|B|)2m(k1,,k|B|)+ρn(k1,,k|B|)m(k1,,k|B|)\displaystyle\quad+\sum_{(k_{1},\ldots,k_{|B|})\in\mathcal{M}_{2}}m(k_{1},\ldots,k_{|B|})+\rho_{n}\sum_{(k_{1},\ldots,k_{|B|})\in\mathcal{M}^{\prime}}m^{\prime}(k_{1},\ldots,k_{|B|})
>ρnm(1,,1)+(k1,,k|B|)2m(k1,,k|B|)K6loglognn\displaystyle>\rho_{n}m(1,\ldots,1)+\sum_{(k_{1},\ldots,k_{|B|})\in\mathcal{M}_{2}}m(k_{1},\ldots,k_{|B|})-K_{6}\frac{\log\log n}{n} (3.107)

for all n>N5n>N_{5}, where

m(k1,,k|B|)\displaystyle m^{\prime}(k_{1},\ldots,k_{|B|}) =i1=1|B|i|B|=1|B|εi1,,i|B|Xk1(1,i1)Xk|B|(|B|,i|B|),\displaystyle=\sum_{i_{1}=1}^{|B|}\cdots\sum_{i_{|B|}=1}^{|B|}\varepsilon_{i_{1},\ldots,i_{|B|}}X^{\prime}_{k_{1}}(1,i_{1})\cdots X^{\prime}_{k_{|B|}}(|B|,i_{|B|}),
\displaystyle\mathcal{M}^{\prime} ={(k1,,k|B|):at least two k’s equal 2, the rest equal 1},\displaystyle=\{(k_{1},\ldots,k_{|B|}):\text{at least two $k$'s equal $2$, the rest equal $1$}\}, (3.108)

X1=ΣBX^{\prime}_{1}=\Sigma^{B}, X2=E0X^{\prime}_{2}=E_{0}, and we have used the facts that m(1,,1)=m(1,,1)m^{\prime}(1,\ldots,1)=m(1,\ldots,1) and ρnm(k1,,k|B|)=m(k1,,k|B|)\rho_{n}m^{\prime}(k_{1},\ldots,k_{|B|})=m(k_{1},\ldots,k_{|B|}) when (k1,,k|B|)2(k_{1},\ldots,k_{|B|})\in\mathcal{M}_{2}. Similarly, w.p. 11 there exists K7K_{7}\in\mathbb{R} such that

(1ρn)|Σ^1B|\displaystyle(1-\rho_{n})|\hat{\Sigma}_{1}^{B}| >(1ρn)m(1,,1)+(k1,,k|B|)3m(k1,,k|B|)K7loglognn\displaystyle>(1-\rho_{n})m(1,\ldots,1)+\sum_{(k_{1},\ldots,k_{|B|})\in\mathcal{M}_{3}}m(k_{1},\ldots,k_{|B|})-K_{7}\frac{\log\log n}{n} (3.109)

for all n>N5n>N_{5}. Combining (3.106), (3.107) and (3.109), w.p. 11 there exists K8K_{8}\in\mathbb{R} such that

|Σ^B|<ρn|Σ^0B|+(1ρn)|Σ^1B|+K8loglognn\displaystyle|\hat{\Sigma}^{B}|<\rho_{n}|\hat{\Sigma}_{0}^{B}|+(1-\rho_{n})|\hat{\Sigma}_{1}^{B}|+K_{8}\frac{\log\log n}{n} (3.110)

for all n>N5n>N_{5}. Thus, w.p. 11 there exists K9K_{9}\in\mathbb{R} such that

|Σ^B||Σ^0B|ρn|Σ^1B|1ρn\displaystyle\frac{|\hat{\Sigma}^{B}|}{|\hat{\Sigma}^{B}_{0}|^{\rho_{n}}|\hat{\Sigma}^{B}_{1}|^{1-\rho_{n}}} <ρn|Σ^0B|+(1ρn)|Σ^1B||Σ^0B|ρn|Σ^1B|1ρn+K9loglognn\displaystyle<\frac{\rho_{n}|\hat{\Sigma}_{0}^{B}|+(1-\rho_{n})|\hat{\Sigma}_{1}^{B}|}{|\hat{\Sigma}^{B}_{0}|^{\rho_{n}}|\hat{\Sigma}^{B}_{1}|^{1-\rho_{n}}}+K_{9}\frac{\log\log n}{n}
=ρn(|Σ^0B||Σ^1B|)1ρn+(1ρn)(|Σ^1B||Σ^0B|)ρn+K9loglognn\displaystyle=\rho_{n}\left(\frac{|\hat{\Sigma}^{B}_{0}|}{|\hat{\Sigma}^{B}_{1}|}\right)^{1-\rho_{n}}+(1-\rho_{n})\left(\frac{|\hat{\Sigma}^{B}_{1}|}{|\hat{\Sigma}^{B}_{0}|}\right)^{\rho_{n}}+K_{9}\frac{\log\log n}{n} (3.111)

for all n>N5n>N_{5}, where K9K_{9} is chosen based on the fact that |Σ^yB||ΣB||\hat{\Sigma}^{B}_{y}|\to|\Sigma^{B}| w.p. 11 (an application of Corollary 1 with hyperparameters νyB=0\nu_{y}^{B}=0, κyB=|A|\kappa_{y}^{B}=|A| and SyB=0S_{y}^{B}=0 in place of the hyperparameters used by the selection rule), and thus |Σ^yB||\hat{\Sigma}^{B}_{y}| must be bounded for all nn. In addition, w.p. 11 there exists K10>0K_{10}>0 and N6>N5N_{6}>N_{5} such that

||Σ^0B||Σ^1B|1|\displaystyle\left|\frac{|\hat{\Sigma}^{B}_{0}|}{|\hat{\Sigma}^{B}_{1}|}-1\right| =||Σ^0B||Σ^1B|||Σ^1B|\displaystyle=\frac{||\hat{\Sigma}^{B}_{0}|-|\hat{\Sigma}^{B}_{1}||}{|\hat{\Sigma}^{B}_{1}|}
||Σ^0B||ΣB||+||Σ^1B|+|ΣB||0.5|ΣB|\displaystyle\leq\frac{||\hat{\Sigma}^{B}_{0}|-|\Sigma^{B}||+||\hat{\Sigma}^{B}_{1}|+|\Sigma^{B}||}{0.5|\Sigma^{B}|}
=2||Σ^0B||ΣB|1|+2||Σ^1B||ΣB|1|\displaystyle=2\left|\frac{|\hat{\Sigma}^{B}_{0}|}{|\Sigma^{B}|}-1\right|+2\left|\frac{|\hat{\Sigma}^{B}_{1}|}{|\Sigma^{B}|}-1\right|
<4K10loglognn\displaystyle<4K_{10}\sqrt{\frac{\log\log n}{n}} (3.112)

for n>N6n>N_{6}, where in the second line we have applied the triangle inequality and the fact that |Σ^1B||ΣB||\hat{\Sigma}_{1}^{B}|\to|\Sigma^{B}| w.p. 11, so N6N_{6} is chosen such that |Σ^1B|>0.5|ΣB||\hat{\Sigma}^{B}_{1}|>0.5|\Sigma^{B}| for all n>N6n>N_{6}, and the last line follows from Lemma 2. By Lemma S3 in Foroughi pour and Dalton (2019), there exists r(0,1)r\in(0,1) such that for all ρ(0,1)\rho\in(0,1) and x(1r,1+r)x\in(1-r,1+r),

ρx1ρ+(1ρ)xρ1+ρ(1ρ)(x1)21+0.25(x1)2,\rho x^{1-\rho}+(1-\rho)x^{-\rho}\leq 1+\rho(1-\rho)(x-1)^{2}\leq 1+0.25(x-1)^{2}, (3.113)

where in the last inequality we use the fact that ρ(1ρ)<0.25\rho(1-\rho)<0.25 for all ρ[0,1]\rho\in[0,1]. Thus, w.p. 11 there exists N7>N6N_{7}>N_{6} such that

|Σ^B||Σ^0B|ρn|Σ^1B|1ρn\displaystyle\frac{|\hat{\Sigma}^{B}|}{|\hat{\Sigma}^{B}_{0}|^{\rho_{n}}|\hat{\Sigma}^{B}_{1}|^{1-\rho_{n}}} <1+0.25(|Σ^0B||Σ^1B|1)2+K9loglognn\displaystyle<1+0.25\left(\frac{|\hat{\Sigma}^{B}_{0}|}{|\hat{\Sigma}^{B}_{1}|}-1\right)^{2}+K_{9}\frac{\log\log n}{n}
<1+K11loglognn\displaystyle<1+K_{11}\frac{\log\log n}{n} (3.114)

for all n>N7n>N_{7}, where K11=4K102+K9K_{11}=4K_{10}^{2}+K_{9}. Thus, from (3.97), w.p. 11

Rn\displaystyle R_{n} <e1.5K1(1+K11loglognn)0.5n\displaystyle<e^{1.5K_{1}}\left(1+K_{11}\frac{\log\log n}{n}\right)^{0.5n}
<e1.5K1(logn)0.5K11\displaystyle<e^{1.5K_{1}}(\log n)^{0.5K_{11}} (3.115)

for all n>N7n>N_{7}, where the last line follows from the fact that (1+t/x)x<et(1+t/x)^{x}<e^{t} for all x,t>0x,t>0. From (3.93), the theorem holds with c=K12+0.5K11c=K_{12}+0.5K_{11} and N=N7N=N_{7}, where K12K_{12} is such that (logn)K12(\log n)^{K_{12}} exceeds e1.5K1e^{1.5K_{1}} times a bound on the first term in (3.93). ∎

Lemma 8.

Let G1G_{1} and G2G_{2} be any disjoint feature sets such that there exists at least one feature in G1G_{1} and one feature in G2G_{2} that are correlated in at least one class, and Σ0G\Sigma_{0}^{G} and Σ1G\Sigma_{1}^{G} are full rank, where G=G1G2G=G_{1}\cup G_{2}. Let SS_{\infty} be a balanced sample. Then, w.p. 11 there exist 0r<10\leq r<1 and N>0N>0 such that

(|C0G|κ0G|C1G|κ1G|C0G1|κ0G1|C0G2|κ0G2|C1G1|κ1G1|C1G2|κ1G2)0.5<rn\displaystyle\left(\frac{|C^{G}_{0}|^{\kappa^{G*}_{0}}|C^{G}_{1}|^{\kappa^{G*}_{1}}}{|C^{G_{1}}_{0}|^{\kappa^{G_{1}*}_{0}}|C^{G_{2}}_{0}|^{\kappa^{G_{2}*}_{0}}|C^{G_{1}}_{1}|^{\kappa^{G_{1}*}_{1}}|C^{G_{2}}_{1}|^{\kappa^{G_{2}*}_{1}}}\right)^{0.5}<r^{n} (3.116)

for all n>Nn>N. The left hand side of (3.116) is a(P)/a(P)a(P)/a(P^{\prime}) when PP and PP^{\prime} are strict refinements of the unambiguous feature partition that are identical except G1G_{1} and G2G_{2} are good blocks in PP and GG is a good block in PP^{\prime}.

Proof.

We have that

(|C0G|κ0G|C1G|κ1G|C0G1|κ0G1|C0G2|κ0G2|C1G1|κ1G1|C1G2|κ1G2)0.5\displaystyle\left(\frac{|C^{G}_{0}|^{\kappa^{G*}_{0}}|C^{G}_{1}|^{\kappa^{G*}_{1}}}{|C^{G_{1}}_{0}|^{\kappa^{G_{1}*}_{0}}|C^{G_{2}}_{0}|^{\kappa^{G_{2}*}_{0}}|C^{G_{1}}_{1}|^{\kappa^{G_{1}*}_{1}}|C^{G_{2}}_{1}|^{\kappa^{G_{2}*}_{1}}}\right)^{0.5}
=(|C0G|κ0G|C0G1|κ0G1|C0G2|κ0G2×|C1G|κ1G|C1G1|κ1G1|C1G2|κ1G2)0.5×R0,nn0R1,nn1,\displaystyle\qquad=\left(\frac{|C^{G}_{0}|^{\kappa^{G}_{0}}}{|C^{G_{1}}_{0}|^{\kappa^{G_{1}}_{0}}|C^{G_{2}}_{0}|^{\kappa^{G_{2}}_{0}}}\times\frac{|C^{G}_{1}|^{\kappa^{G}_{1}}}{|C^{G_{1}}_{1}|^{\kappa^{G_{1}}_{1}}|C^{G_{2}}_{1}|^{\kappa^{G_{2}}_{1}}}\right)^{0.5}\times R_{0,n}^{n_{0}}R_{1,n}^{n_{1}}, (3.117)

where

Ry,n=(|CyG||CyG1||CyG2|)0.5.\displaystyle R_{y,n}=\left(\frac{|C^{G}_{y}|}{|C^{G_{1}}_{y}||C^{G_{2}}_{y}|}\right)^{0.5}. (3.118)

By Corollary 1, the first term in (3.117) converges to a positive finite constant w.p. 11, and (3.76) holds. The rest of the proof proceeds exactly as in Lemma 5. ∎

Lemma 9.

Let B1B_{1} and B2B_{2} be any disjoint feature sets such that μ0B=μ1B\mu_{0}^{B}=\mu_{1}^{B}, ΣBΣ0B=Σ1B\Sigma^{B}\equiv\Sigma_{0}^{B}=\Sigma_{1}^{B}, there exists at least one feature in B1B_{1} and one feature in B2B_{2} that are correlated in at least one class, and ΣB\Sigma^{B} is full rank, where B=B1B2B=B_{1}\cup B_{2}. Let SS_{\infty} be a balanced sample. Then, w.p. 11 there exist 0r<10\leq r<1 and N>0N>0 such that

(|CB|κB|CB1|κB1|CB2|κB2)0.5<rn\displaystyle\left(\frac{|C^{B}|^{\kappa^{B*}}}{|C^{B_{1}}|^{\kappa^{B_{1}*}}|C^{B_{2}}|^{\kappa^{B_{2}*}}}\right)^{0.5}<r^{n} (3.119)

for all n>Nn>N. The left hand side of (3.119) is a(P)/a(P)a(P)/a(P^{\prime}) when PP and PP^{\prime} are strict refinements of the unambiguous feature partition that are identical except B1B_{1} and B2B_{2} are bad blocks in PP and BB is a bad block in PP^{\prime}.

Proof.

We have that

(|CB|κB|CB1|κB1|CB2|κB2)0.5\displaystyle\left(\frac{|C^{B}|^{\kappa^{B*}}}{|C^{B_{1}}|^{\kappa^{B_{1}*}}|C^{B_{2}}|^{\kappa^{B_{2}*}}}\right)^{0.5} =(|CB|κB|CB1|κB1|CB2|κB2)0.5×Rnn,\displaystyle=\left(\frac{|C^{B}|^{\kappa^{B}}}{|C^{B_{1}}|^{\kappa^{B_{1}}}|C^{B_{2}}|^{\kappa^{B_{2}}}}\right)^{0.5}\times R_{n}^{n}, (3.120)

where

Rn=(|CB||CB1||CB2|)0.5.R_{n}=\left(\frac{|C^{B}|}{|C^{B_{1}}||C^{B_{2}}|}\right)^{0.5}. (3.121)

By Corollary 1, the first term in (3.120) converges to a positive finite constant w.p. 11, and

Rn(|ΣB||ΣB1||ΣB2|)0.5R\displaystyle R_{n}\to\left(\frac{|\Sigma^{B}|}{|\Sigma^{B_{1}}||\Sigma^{B_{2}}|}\right)^{0.5}\equiv R (3.122)

as nn\to\infty, both w.p. 11, where ΣB1Σ0B1=Σ1B1\Sigma^{B_{1}}\equiv\Sigma^{B_{1}}_{0}=\Sigma^{B_{1}}_{1} and ΣB2Σ0B2=Σ1B2\Sigma^{B_{2}}\equiv\Sigma^{B_{2}}_{0}=\Sigma^{B_{2}}_{1}. Without loss of generality, assume features in BB are ordered such that

ΣB=(ΣB1VVTΣB2),\Sigma^{B}=\begin{pmatrix}\Sigma^{B_{1}}&V\\ V^{T}&\Sigma^{B_{2}}\end{pmatrix}, (3.123)

where VV is a matrix of correlations between features in B1B_{1} and B2B_{2}. Since

|ΣB|=|ΣB1||ΣB2VT(ΣB1)1V|,|\Sigma^{B}|=|\Sigma^{B_{1}}||\Sigma^{B_{2}}-V^{T}(\Sigma^{B_{1}})^{-1}V|, (3.124)

we have that

R=(|ΣB2VT(ΣB1)1V||ΣB2|)0.5.\displaystyle R=\left(\frac{|\Sigma^{B_{2}}-V^{T}(\Sigma^{B_{1}})^{-1}V|}{|\Sigma^{B_{2}}|}\right)^{0.5}. (3.125)

Since there exists at least one feature in B1B_{1} and one feature in B2B_{2} that are correlated, VV is non-zero, |ΣB2VT(ΣB1)1V|<|ΣB2||\Sigma^{B_{2}}-V^{T}(\Sigma^{B_{1}})^{-1}V|<|\Sigma^{B_{2}}|, and 0R<10\leq R<1. The theorem holds for any R<r<1R<r<1. ∎

4 Conclusion

The consistency theory presented herein is important because it provides a richer understanding the type of features selected by OBFS. Furthermore, we have characterized rates of convergence for the posterior on feature partitions, and the marginal posterior probabilities on individual features.

Although here we focus on identifying G¯\bar{G} using the posterior π(G)\pi^{*}(G), the OBFS framework can be used for optimal Bayesian partition selection, which aims to identify P¯\bar{P} using the full posterior π(P)\pi^{*}(P). Partition selection may be of interest, for instance, if one wishes to identify communities of closely interacting genes. Since Theorem 3 proves that π(P)\pi^{*}(P) converges to a point mass at P¯\bar{P}, this theorem has direct implications on the consistency of optimal Bayesian partition selection as well.

The conditions in Theorem 3 are sufficient but not necessary. For example, it may be possible to relax Condition (iii) of Theorem 3. It is also possible for π(G)\pi^{*}(G) to converge to a point mass at G¯\bar{G}, but for π(P)\pi^{*}(P) to not converge to a point mass at P¯\bar{P}. For example, if non-zero correlations are present in the data generation process, then the OBF variant of OBFS sets π(P¯)=0\pi(\bar{P})=0, which means that Condition (v) of Theorem 3 does not hold. However, if the independent unambiguous set of good features given in Definition 1 and the unambiguous set of good features given in Definition 3 happen to be equal (the latter always contains the former), then OBF can be strongly consistent relative to the unambiguous set of good features.

OBFS searches for the unambiguous set of good features, which expands upon the independent unambiguous set of good features targeted by OBF. In particular, the unambiguous set of good features includes features that are only strongly discriminating when grouped together, features that are individually weak but strongly correlated with strong features, and features that are linked to discriminating features only through a chain of correlation. The unambiguous feature set is complete in the sense that any features that are not included must not have any first or second order distributional differences between the classes and must be uncorrelated with all features that are included, and minimal in the sense that it is the smallest feature set with this property. The unambiguous feature set is thus of great interest and importance in bioinformatics and many other application areas, although we are not aware of any selection algorithms besides OBFS that aim to identify this set.

{acknowledgement}

This work is supported by the National Science Foundation (CCF-1422631 and CCF-1453563).

References

  • Ang et al. (2016) Ang, J. C., Mirzal, A., Haron, H., and Hamed, H. N. A. (2016). “Supervised, unsupervised, and semi-supervised feature selection: A review on gene selection.” IEEE/ACM Transactions on Computational Biology and Bioinformatics, 13(5): 971–989.
  • Awada et al. (2012) Awada, W., Khoshgoftaar, T. M., Dittman, D., Wald, R., and Napolitano, A. (2012). “A review of the stability of feature selection techniques for bioinformatics data.” In Proceedings of the 2012 IEEE 13th International Conference on Information Reuse and Integration (IRI), 356–363.
  • Dalton (2013) Dalton, L. A. (2013). “Optimal Bayesian feature selection.” 65–68. IEEE.
  • Diamandis (2010) Diamandis, E. P. (2010). “Cancer biomarkers: Can we turn recent failures into success?” Journal of the National Cancer Institute, 102(19): 1462–1467.
  • Fan (1950) Fan, K. (1950). “On a theorem of Weyl concerning eigenvalues of linear transformations II.” Proceedings of the National Academy of Sciences, 36(1): 31–35.
  • Fang et al. (2019) Fang, G., Wang, W., Paunic, V., Heydari, H., Costanzo, M., Liu, X., Liu, X., VanderSluis, B., Oately, B., Steinbach, M., et al. (2019). “Discovering genetic interactions bridging pathways in genome-wide association studies.” Nature communications, 10(1): 1–18.
  • Foroughi pour and Dalton (2014) Foroughi pour, A. and Dalton, L. A. (2014). “Optimal Bayesian feature selection on high dimensional gene expression data.” In Proceedings of the 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), 1402–1405.
  • Foroughi pour and Dalton (2015) — (2015). “Optimal Bayesian feature filtering.” In Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB), 651–652.
  • Foroughi pour and Dalton (2017a) — (2017a). “Robust feature selection for block covariance Bayesian models.” In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2696–2700.
  • Foroughi pour and Dalton (2017b) — (2017b). “Robust feature selection for block covariance Bayesian models.”
  • Foroughi pour and Dalton (2018a) — (2018a). “Heuristic algorithms for feature selection under Bayesian models with block-diagonal covariance structure.” BMC Bioinformatics, 19(Suppl 3): 70.
  • Foroughi pour and Dalton (2018b) — (2018b). “Optimal Bayesian filtering for biomarker discovery: Performance and robustness.” IEEE/ACM Transactions on Computational Biology and Bioinformatics, early access.
  • Foroughi pour and Dalton (2019) — (2019). “Theory of Optimal Bayesian Feature Filtering.” Bayesian Analysis, accepted.
  • Han et al. (2017) Han, K., Jeng, E. E., Hess, G. T., Morgens, D. W., Li, A., and Bassik, M. C. (2017). “Synergistic drug combinations for cancer identified in a CRISPR screen for pairwise genetic interactions.” Nature biotechnology, 35(5): 463.
  • Hartman and Wintner (1941) Hartman, P. and Wintner, A. (1941). “On the law of the iterated logarithm.” American Journal of Mathematics, 63(1): 169–176.
  • Harville (1998) Harville, D. A. (1998). “Matrix algebra from a statistician’s perspective.” Technometrics, 40(2): 164–164.
  • Ilyin et al. (2004) Ilyin, S. E., Belkowski, S. M., and Plata-Salamán, C. R. (2004). “Biomarker discovery and validation: Technologies and integrative approaches.” Trends in Biotechnology, 22(8): 411–416.
  • Li et al. (2017) Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., and Liu, H. (2017). “Feature selection: A data perspective.” ACM Computing Surveys, 50(6): 94.
  • Ramachandran et al. (2008) Ramachandran, N., Srivastava, S., and LaBaer, J. (2008). “Applications of protein microarrays for biomarker discovery.” Proteomics - Clinical Applications, 2(10–11): 1444–1459.
  • Rifai et al. (2006) Rifai, N., Gillette, M. A., and Carr, S. A. (2006). “Protein biomarker discovery and validation: The long and uncertain path to clinical utility.” Nature Biotechnology, 24(8): 971–983.
  • Sima and Dougherty (2006) Sima, C. and Dougherty, E. R. (2006). “What should be expected from feature selection in small-sample settings.” Bioinformatics, 22(19): 2430–2436.
  • Sima and Dougherty (2008) — (2008). “The peaking phenomenon in the presence of feature-selection.” Pattern Recognition Letters, 29(11): 1667–1674.
  • Xi et al. (2018) Xi, J., Li, A., and Wang, M. (2018). “A novel unsupervised learning model for detecting driver genes from pan-cancer data through matrix tri-factorization framework with pairwise similarities constraints.” Neurocomputing, 296: 64–73.