Data-Driven Representations for Testing Independence: Modeling, Analysis and Connection with Mutual Information Estimation
Abstract
This work addresses testing the independence of two continuous and finite-dimensional random variables from the design of a data-driven partition. The empirical log-likelihood statistic is adopted to approximate the sufficient statistics of an oracle test against independence (that knows the two hypotheses). It is shown that approximating the sufficient statistics of the oracle test offers a learning criterion for designing a data-driven partition that connects with the problem of mutual information estimation. Applying these ideas in the context of a data-dependent tree-structured partition (TSP), we derive conditions on the TSP’s parameters to achieve a strongly consistent distribution-free test of independence over the family of probabilities equipped with a density. Complementing this result, we present finite-length results that show our TSP scheme’s capacity to detect the scenario of independence structurally with the data-driven partition as well as new sampling complexity bounds for this detection. Finally, some experimental analyses provide evidence regarding our scheme’s advantage for testing independence compared with some strategies that do not use data-driven representations.
Index Terms:
Independence testing, non-parametric learning, learning representations, data-driven partitions, tree-structure partitions, mutual information, consistency, finite-length analysis.I Introduction
The problem of detecting independence from i.i.d. samples in a learning setting (distribution-free) is fundamental in statistics and has found numerous applications in statistical signal processing, data analysis, machine learning, inference, and decision problems [1, 2, 3, 4]. For instance, the capacity to detect independent and conditional independent structures from data is relevant when having invariances and probabilistic symmetries in decision and machine learning models [5, 6, 7, 8]. This capacity has been used in blind source separation, independent component analysis (ICA) [1, 2, 3, 4, 9] and the detection of associations in data [10]. Detecting independence from data has been systematically studied, and the literature is vast. Different non-parametric strategies have been proposed for this task using kernel-based statistics [11, 12, 13, 14], distance-based approaches [15, 16, 17, 18, 19], histogram-based approaches [20, 21, 22, 10, 23, 24], log-likelihood statistics [20, 25], correlation-based approaches [26, 27], -divergence estimators [25, 20], entropy and mutual information estimators [28, 29, 30, 10, 10], among many others.
A natural strategy, and the one we follow in this paper, is to partition the observation space (the binning approach) to evaluate the discrepancy (in some sense) between two empirical distributions [20, 12, 29, 15, 22, 21]. We highlight the following work using this approach: Gretton and Györfi [20] introduced a family of product data-independent partitions using the -statistics [31] and the I-divergence statistics [32]. With these two approaches, sufficient conditions are established on the partition and other learning parameters to achieve strong consistency for detecting independence distribution-free in a continuous multivariate scenario. Szekely et al. [15] also use partitions on the sample space, utilizing empirical distance correlation and distance covariance statistics to introduce a novel test of independence. The test is consistent and shows better empirical results than classical likelihood-ratio tests in scenarios with non-linear dependencies. Shi et al. [33] address the multivariate scenario by incorporating the concept of transportation-based ranks and signs. They propose plugging the center-outward distribution function into the mentioned distance covariance statistic [15] to determine a distribution-free test of independence.
More recently, Zhang [22] proposed a non-parametric test of independence assuming a rich class of models with known marginal distributions (uniform over ). This work proposed a binary expansion filtration of the space and deeply studied some symmetric properties to create a novel test that is uniformly consistent (on the power loss) and minimax for the power (in the sample size) assuming a large class of alternative distributions. Zhang et al. [34] presented a novel extension of the binary expansion filtration strategy [22] to a multidimensional setting by cleverly approximating an oracle Neyman-Pearson (NP) statistic. They propose a data-adaptive weight strategy (for the binary expansion) to approximate the ideal weight of the NP test. This new test unifies several important tests ( test statistics, distance correlation) and shows promising empirical results.
On the important connection between testing independence and information measures [35, 36], we highlight the work of Kontoyiannis and Skoularidou [30], where the problem of estimating directed information (directed information rate) between two discrete (finite alphabet) processes was explored based on a plug-in estimator. Importantly, the authors used this plug-in estimator to test the presence or absence of causality between the processes. Along the same lines, Suzuki [29] explored different strategies to estimate mutual information between discrete and continuous random variables and then evaluated those strategies experimentally for the task of testing independence. Finally, Reshef et al. [10] introduced the maximal information coefficient (MIC) to measure complex functional dependencies between a pair of variables by maximizing the (empirical) mutual information computed for collections of (adaptive) partitions of different resolutions.
In many of the methods mentioned that use partitions (or binning) for testing independence, a component that has not been explored systematically is the role played by data-driven partitions [37, 38, 10]. Data-driven partitions use data for the binning process (vector quantization) to construct the cells and define the final structure of the partition. This flexibility offers the capacity to better address inference and learning problems. Supporting this idea, data-driven representations have shown great approximation properties and better decision-estimation performance in numerous non-parametric learning problems [37, 38, 39, 40, 41].
In our context, the motivation that piqued our interest in data-driven partitions is the fact that under the hypothesis of independence, the trivial partition (the partition with one cell that contains all the space) is a sufficient representation of the problem. Our initial conjecture is that a well-designed adaptive data-driven partition could have the ability to detect (learn from data) this trivial solution by implementing a form of explicit regularization in its design. This adaptive ability could provide better detection of independence than non-adaptive partition schemes, which has prompted the exploration of data-driven methods for testing independence in this work.
Motivated by this, we look at the problem of designing a new learning criterion that selects a data-driven partition of the space (adaptive binning) as the optimal trade-off between estimation and approximation errors in a learning problem. To formulate this representation learning task, we adopt ideas from universal source coding to introduce a regret term that measures the discrepancy attributed to the use of empirical log-likelihood statistics — restricted over a partition — with respect to the oracle sufficient statistics of an ideal test that knows the true (two) probabilities: the oracle test against independence [36, 42]. Using this regret analysis, we establish a novel connection with the problem of mutual information (MI) estimation [30]. Furthermore, general conditions are derived to obtain a strongly consistent test of independence in the strong (almost-sure and distribution-free) sense introduced in [20].
We apply this framework in the context of tree-structured partitions (TSP) [39], which is the main application focus of this work. TSP algorithms were selected for the following important reasons: Their implementation is simple, TSP have a binary structure that can be used to address learning tasks efficiently for a large class of optimization problems (minimum cost-tree pruning) [43, 44, 45], and they have been shown to be expressive (with good approximation properties) when compared with other partition strategies in non-parametric learning and decision tasks [37, 38, 39, 40, 41, 43, 46, 37, 44].
On the proposed test, we derive concrete connections with results on mutual information estimation [39]. From these connections, we established new design conditions to obtain a strongly consistent test of independence and, more importantly, non-asymptotic results that express our framework’s capacity to approximate the information term of the oracle (ideal) test with a finite sample-size. Going a step further in this finite length performance analysis, we study our scheme’s capacity to detect independence structurally with the underlying data-driven partition, which was one of the original motivations used to explore data-driven partition for this task.
Indeed, we show that under the independence assumption, our data-driven partition collapses to the trivial solution with one cell (implying that the resulting MI estimator is zero) with a finite sample-size almost surely. This is a remarkable property attributed to our scheme’s capacity to adapt its representation to the sufficient statistics of the problem, which is the trivial partition in the context of independence. From this ability, we improve our original result concerning our test’s consistency (density-free) and provide refined sampling complexity bounds for detecting the scenario of independence.
To implement our approach, we propose a learning algorithm that offers a computationally efficient implementation of our data-driven strategy for testing independence. The algorithm is divided in two important phases (a growing and pruning stage), which are both shown to have a polynomial complexity on the side of the problem (the sample size). Finally, we provide empirical evidence of the advantage of our strategy in some controlled simulation scenarios. For this last analysis, we introduce concrete non-asymptotic metrics (in the form of sampling complexity indicators) to measure the test’s ability to detect the correct hypothesis with a finite number of samples.
A preliminary version of this work was presented in [47]. This work significantly expands the technical contribution in [47] and explores implementations and experimental analyses.
I-A Organization of the Paper
Section II introduces the problem and basic notations. Section III presents the general histogram-based scheme, and Section IV introduces the regret-based analysis proposed in this work and its connection with the problem of MI estimation. Section IV-B presents a general consistency result for our problem (Theorem 1). Section V introduces the family of tree-structured data-driven partitions and presents consistency results (Theorem 2, Lemma 2 and Corollary 1). Section VI analyzes some finite-length properties (Theorem 2) and elaborates on the structural capacity of our scheme to detect independence under the null hypothesis (Lemma 3, and Theorems 3 and 4). Sections VII and VIII cover implementation and empirical analysis, respectively. Finally, Section IX concludes with some discussion and future directions. The proofs are presented in the Appendix.
II Preliminaries
Let us consider two random variables and taking values in and , respectively, with joint distribution in , where and is the Borel sigma field. Let us denote by the class of probabilities in under which are independent, meaning that if , then for any and .
Given i.i.d. samples of , denoted by driven by the model , the problem is to decide from whether and are independent (meaning that ) or, alternatively, and have some statistical dependency, i.e., . In this context, a decision rule of length is a function from to , where means that the rule decides that the underlying probability producing belongs to . Then, for any decision rule 111 denotes the collection of binary rules acting on . of length , we recognize two errors:
-
•
Assuming that (), the non-detection of independence between and is measured by222 denotes the entire process distribution of and is the -fold probability in induced by the marginal .
-
•
Assuming that (), the false detection of independence between and is measured by
The following is the classical notion of strong-consistency used to evaluate the asymptotic goodness of a universal scheme (collection of rules of different lengths) for detecting independence from data [20].
Definition 1
Given a scheme , where is a decision rule of length , we say that is strongly consistent for detecting independence if
-
i)
under : reaches eventually with probability one, or -almost surely (a.s.),
-
ii)
under : reaches eventually -a.s.
II-A The Divergence and Mutual Information
Let , be two probability measures in such that , and let us consider a measurable partition of where is finite or countable. The divergence of with respect to restricted over (or the sub-sigma field induced by denoted by ) is given by [32, 36]
(1) |
The divergence of with respect to is [32, 36]
(2) |
where is the collection of measurable333 is the sigma field of Borel sets. partitions of .
III The Empirical Log-Likelihood Statistics from a Data-Driven Partition
In this work, we adopt a histogram-based log-likelihood statistics approach [20]. We interpret this approach as an empirical version of the Neyman-Pearson (NP) test against independence [42]. The ideal (oracle) NP test against independence decides whether the samples belong to the following two known cases: (for some ) or , where for any and .444 can be interpreted as the projection of on in the information divergence sense, i.e., is the solution of .
The specific approach proposed in this work is a two-stage process that uses the data twice: first, to estimate , and its projected version (over ), with the caveat that the estimation is restricted over the events of a finite partition of the joint space (i.e., a histogram-based estimation of and , respectively); second, to compute an empirical version of the log likelihood-ratio to decide (using a threshold) if (independence) or (non-independence).
There are two elements of learning design that determine our empirical test. The first is a data-driven partition rule denoted
by , which is a function that maps sequences in into a finite measurable partitions of . The second element is a non-negative threshold denoted by . Then given a pair and some data , the data-driven test is constructed as follows:
1) Use the quantization to estimate over the cells :
(3) |
(4) |
for any . In Eq.(4) we assume that the cells of have a product structure, i.e.,
where and .555This event-wise product structure
on the cells of is needed to estimate both and from i.i.d. samples of .
2) Project the data over the cells of . A simple projection (or representation) function is the following:
(5) |
The projected data is given and denoted by .
3) Compute the log-likelihood ratio of the empirical distributions in (3) and (4) using the quantized
(or projected over ) data .
In particular, we consider the following log-likelihood ratio per sample as
(6) |
where and denote the empirical distributions of the quantized random variable
in its representation space , meaning that by construction and
for all .
4) Finally, the decision is given by the following rule:
(7) |
The second step introduced in (5). This object captures the role played by the data-driven partition () in our log-likelihood statistics in (6).
IV Regret Analysis: The Representation Learning Problem
To analyze the quality of the proposed test, in this section we compare the proposed statistics in (6) with the statistics used by an oracle NP test. We consider the data-driven partition in the pair as a learning agent. For this analysis, let us first fix a finite partition . Given i.i.d. realizations from , we consider the empirical-quantized log-likelihood statistics in (6). Adopting the notion of regret from universal source coding [35], let us consider as a reference the true likelihood ratio of with respect to associated with the expression in (6) but with no quantization effects, i.e., 666By construction ; therefore, the RN derivative is well defined in (8) where it is clear that if , then .
(8) |
is the ideal (oracle) information term used by the NP test against independence [35]. We measure the (sample-wise) regret of as
(9) |
which measures the discrepancy between the empirical-quantized statistics in (6) and the oracle term in (8).
IV-A Connection with Divergence Estimation
The regret in (9) has two error sources: one associated with the quantization of the space (the representation quality of ) and the other associated with the fact that we use the empirical distribution in (3) instead of the true model . To isolate these components, it is useful to introduce the oracle-quantized information given by
(10) |
where and for all are short-hand for the true probabilities induced by , and in . Then, the regret in (9) can be decomposed in two components:
(11) |
The first term (I) on the right hand side (RHS) of Eq.(IV-A) captures an information loss attributed to the quantization (5) (the approximation error). This expression convergences as tends to infinity to almost surely. The second term (II) on the RHS of (IV-A) captures the discrepancy in the information density attributed to the use of empirical distributions. Crucially, these two error sources are driven by the partition . At this point, it is worth noticing that the empirical information term can be expressed as , which is a histogram-based estimator of the divergence restricted over the cells of [39, 48, 40]. From these results, the RHS of (IV-A) can be re-structured as follows
(12) |
The first term () on the RHS of (IV-A) goes to zero with probability one (by the strong law of large numbers assuming that ) independent of . Therefore, from the perspective of evaluating the effect of in the regret, this term can be overlooked. On the other hand, the second term () captures the approximation error, or what we lost in discrimination (as the number of samples goes to infinity) when projecting the data over with respect to the lossless term in (8). Finally, the last term () measures the estimation error, which is the discrepancy between the true (oracle) distributions and the empirical distributions in the information divergence sense over events of (see Eq.(1)).
IV-A1 The Representation Problem
Here we introduce the main design problem of this work, which is to find a data-driven partition that offers an optimal balance between the two relevant terms presented in (IV-A). For the estimation error, the idea is to adopt distribution-free (universal) error bounds for of the form (more details in Section V-B):
where is the confidence interval for the confidence probability . Equipped with this result, we could use the following upper bound for the regret: , with high probability. Finally, we could formulate the problem of selecting over a family of representations (subset of ) as the solution of the following regularized info-max problem:
(13) |
The learning task in (13) is still intractable. It resembles an oracle learner agent (teacher) that selects in solving the trade-off between the two errors, one of which needs the true model . Section V addresses a tractable info-max version of (13), where instead of the empirical distribution in (3) is adopted.
IV-B Strong Consistency: A Basic Requirement
In this subsection, we introduce the idea of measuring the discrepancy between and the ideal (oracle) information . From this perspective, we could introduce a new notion of consistency based on this learning objective.
Definition 2
Our scheme is said to be strongly consistent on the regret if for any and i.i.d. process with , it follows that
-almost surely (a.s.).
A simple result follows:
PROPOSITION 1
If is strongly consistent on the regret, then is a strongly consistent estimator of the MI between and .777From Definition 2, it follows directly that , -almost surely.
Returning to our original problem, the next result offers sufficient conditions for a data-driven scheme to be strongly consistent for detecting independence (Def. 1).
THEOREM 1
The proof is presented in Appendix A.
Consistency is a basic asymptotic requirement that is non-sufficient when looking into practical applications that operate with a finite sample size. Hence, in the next sections, we present a test that is strongly consistent (Th.3) but also satisfies some relevant non-asymptotic properties in the form of finite-length performance results (Ths. 2, 4 and Lemma 3).
V Tree-Structured Data-Driven Partitions
In this section, an empirical version of (13) is studied considering for a dictionary of tree-structured data-driven partitions (TSP). This TSP family was introduced in [39] for the problem of MI estimation. The focus of this section is to demonstrate its potential for the problem of testing independence.
V-A The Collection of Partitions
The construction of this family begins by defining a “root” node to index the trivial partition . The process continues by selecting a coordinate in to project the data (1D projection), then order the projected data (scalar values), select the median of the ordered sequence, and finally create a statistically equivalent partition of using the selected coordinate axis and the median to split the cell. Two new cells are created from with almost half the sample points in each (exactly half when is an even integer).
These new cells are indexed as the left and right children of the “root” denoted by ; i.e., the new cells are and where . The process continues by selecting a new coordinate in , where the statistically equivalent binary splitting criterion is iterated in and until a stopping condition is met. The stopping criterion adopted here imposed a minimum empirical probability in each created cell that we denote by . Therefore, at the end of this growing binary process, a collection of nodes is produced associated with a binary tree (by construction) and the nodes’ respective cells where for any .
Using the Breiman et al. [43] convention, a binary tree is a collection of nodes in : one node of degree 2 (the ”root”), and the remaining nodes of degree 3 (internal nodes) or degree 1 (leaf or terminal nodes) [43]. In this convention, the full-tree is denoted and given by . Importantly, if and is a binary tree by itself, then we say that is a subtree of . Moreover, if both have the same root, we say that is a pruned version of , and we denote this by . Finally, if we denote by the leaf nodes of an arbitrary tree , it is simple to verify that is a data-driven partition of indexed by where every cell in has the desired product structure in Eq.(4) [39].888The interested reader is referenced to [39] for a systematic explanation of this TSP construction.
V-B Regularized (Empirical) Information Maximization
To address the info-max design objective in (13) using our (data-driven) tree-indexed family , we have to first find an expression for in (13) for any . The next result offers the following:
LEMMA 1
(Silva et al.[39, Th. 1]) Let be the family of pruned TSPs of size . Then, , , and any small , there is a threshold where
(14) |
where specifically
(15) |
Importantly the bound in (1) is distribution-free [39], and a function of and (the size of the family ). From this distribution-free concentration result, the union bound tells us that the following events in
(16) |
with , happen simultaneously for any with probability at least (with respect to ). Equipped with these penalizations, i.e., in (16) for any , the empirical version of the info-max problem in (13) is
(17) |
Finally, it is important to note that both and determine the trees and TS partitions that we denote here by . In addition, if we include the thresholds , we denote by the rules induced by and in (7).
V-C Consistency: A Preliminary Analysis
The following results show that the TSP scheme in (17) is strongly consistent on the regret (Def.2).
LEMMA 2
Let us assume that has a density999 is absolutely continuous with respect to the Lebesgue measure in . in .
i) Under the conditions that for , is and is , we have that
(18) |
ii) Assuming that and are independent (), if we select , it follows that for any
(19) |
The proof is presented in Appendix B.
In addition to achieve consistency on the regret, stated in part i), the part ii) of this result shows that under the regret achieves a super polynomial velocity of convergence to zero: with probability one under . Therefore, the empirical information rate tends to zero faster than any polynomial order in with probability one, which is a remarkable capacity to detect this condition from data. In light of Theorem 1, we could have a range of admissible vanishing thresholds where strong consistency for detecting independence can be achieved (Def. 1). This is stated in the following result.
COROLLARY 1
VI Main Finite-Length Results
In this section, we focus on finite-length (non-asymptotic) results. We establish conditions where the solution of (17) nearly matches the performance of its equivalent oracle version in (13) with high probability. This non-asymptotic result is instrumental to show later one of the main findings of our work: the capacity of our scheme to detect with the structure of the partition.
THEOREM 2
Under the conditions that
for ,
is and is ,
we have that
i) under : for any there is such that , the equality
(20) |
holds with -probability .
ii) under : for any there is such that the bound
(21) |
holds with -probability .
The proof of this result is presented in Appendix C.
This result presents two optimality bounds for the information term of the TSP scheme under the two main hypotheses ( and ) of our problem. Under , our regularization approach is capable of detecting this structure (with an arbitrary high probability ) in the sense that in (17) reduces to the trivial partition . From this, we obtain the zero regret condition . This means that the solution of by itself (with no threshold) is capable of detecting independence.
On the other hand, under , we notice that it is only relevant to bound the under-estimation of with the empirical (log-likelihood ratio) information . Limiting the analysis only to the underestimation of can be argued from the well-known observation that quantization (on average) provides a bias (under-estimation) estimation of . More importantly for our problem, an overestimation of the oracle information does not increase the type 2 error when using instead of the oracle under .
In summary, Theorem 2 indicates that with an arbitrary high probability, the empirical TSP solution of (17) achieves the performance of the oracle solution that optimizes the upper bound on the regret derived in Eq.(13) over the TSP family .
Remark 1
In the proof of Theorem 2, we determine a set (see Eq.(39)) where if then (20) holds under and (2) holds under . Importantly, we obtain that from Lemma 1. Therefore, in Theorem 2, is fully determined by for any . Indeed, , which is well-defined since is . In particular, for the case being , it follows that is .
VI-A Detecting Independence with the Structure of
From (20), our data-driven partition has the capacity to detect independence (under ) in a stronger structural way: (with high probability) the data-driven partition collapses to the trivial cell, i.e., , with a finite sample size. Here, we improve this result by showing that the condition happens within a finite sample size almost surely.
Let us introduce the (random) time at which the partition process collapses to the trivial set :
Definition 3
(22) |
If , it means that for any . Therefore, if , this value is expressing the last time the data-driven partition is different from the trivial solution . On the other hand, if , this condition means that non-trivial partitions are observed infinitely often (i.o.) in the sequence . In general, we could have that under , which does not contradict the capacity that has for detecting independence consistently under (see Corollary 1).
The following result shows that is finite with probability one under some mild conditions.
LEMMA 3
Let us assume that for and is , is . Then under the hypothesis that : .
The proof is presented in Appendix D.
Importantly, the condition implies that reaches eventually with probability one for any sequence . From this observation, we obtain the following result that improves the regime of parameters where a consistent test is achieved (established in Corollary 1).
THEOREM 3
If for , is , is , and is , then the scheme is strongly consistent for detecting independence (Def. 1).
The proof is presented in Appendix E.
Finally, if we focus on the admissible solution where , we have the following refined description for the distribution of under .
THEOREM 4
Under the assumption of Theorem 3, let us consider that . Under , we have that for any and for some universal constant .
The proof is presented in Appendix F.
VI-B Remarks about Theorem 4:
1: If we introduce
(23) |
as the first time when the binary process reaches and stays at , it is simple to verify that
for any (see Appendix E). Therefore, under and the assumptions of Theorem 4, we have as a corollary that for any sequence as long as is .
2: With this bound, we could determine (under ) a lower bound on the number of samples needed to guarantee that we detect independence structurally, i.e., reaching , and also with our schemes with a confidence probability . This critical number of samples is achieved for such that
and . This number scales like .
VII Implementation
VII-A The Learning Algorithm
Let us briefly revisit the stages presented in Section V for the construction of our TSP scheme. Beginning with the partition, there are two phases: growing and pruning. In the growing phase, presented in Section V-A, we have a collection of tree-indexed partitions where and by construction for any . Here, we added a scalar parameter as a relevant design element for our numerical analysis. In the pruning phase, we implement the following complexity-regularized information maximization:
(24) |
where we included a regularized parameter .

VII-B Computational Cost
To analize the cost of implementing (24), we analyze the cost of each of its two phases individually.
Growing Phase: During the growing stage (see Section V-A), we obtain a collection of tree-indexed partitions , where . The full tree is obtained after a series of statistically equivalent binary partitions. The computational cost of the growing phase depends on the number of sub-partitions (or binary splittings) needed to obtain the full tree (which depends on the number of cells), and the cost of sorting the projected data of each cell and selecting the median of the ordered sequence to partition the cell. The number of cells in the full tree can be estimated considering that our design criterion states that for any , where and are design parameters of our method. Then, we should have at least samples per cell. Therefore, the number of cells associated with is bounded by
(25) |
On the other hand, the cost of sorting the projected data of each cell depends on the number of data points to be ordered. In particular, if a cell has points, then the sorting cost is [49]. As we reach deeper levels in the tree, the number of points per cell reduces. If we consider a level of depth (where corresponds to the root and corresponds to the full tree), there are cells and each cell contains roughly points. Therefore, the order of the total number of operations during the growing phase can be approximated to
Pruning Phase: The main regularized info-max problem in Eq. (24) is solved in this stage. The first term is additive (in the sense presented in [45]), whereas the penalty scales like , and, therefore, it is a sub-additive function of the size of the tree. In this scenario, Scott [45, Th. 1] showed that the family of minimum cost trees :
(26) |
is embedded, i.e., . Furthermore, [45, Th. 2] states that the solution of our problem in (24) corresponds to one of these embedded trees: i.e., for any , such that . This embedded property allows computationally efficient algorithms to be designed. Indeed, Scott [45] presented two algorithms to solve (24) with a worst case cost of . In our case we have that , then the pruning stage in (24) has a computational cost of , which is a polynomial (sub-linear) function of . The pseudo code of this stage is presented in Algorithm 1.
In summary, the construction of our partition (growing and pruning) has a computational cost that is on the sample size .

VIII Empirical Analysis
In this section, we present some controlled synthetic scenarios to evaluate the performance of our method. We begin analyzing how the selection of parameters affects the capacity of our scheme to estimate MI101010Given the space limit, this preliminary analysis on MI estimation is relegated to the Supplemental Material - Appendix G-A.. From these results and insights, we analyze the performance of the solutions for testing independence. For the rest of this section, denotes the TS partition, denotes the MI estimator, and denotes the final test.
VIII-A Testing Independence: Non-Asymptotic Empirical Analysis
We analyze the problem of testing independence with our data-driven test in (7). Here we focus on the non-asymptotic capacity of our framework to detect the two hypotheses. For this, we propose the following detection times:111111To simplify the notation, the dependency of and on , , and the scalar parameters , introduced in Sec.VII-A will be considered implicit.
(27) | |||
(28) |
and are random variables (rvs.) in determining when our test reaches and , respectively. Indeed, Theorem 3 (under specific conditions) tells us that under , and that under , . However, we are interested in the complete distribution of the rvs. and under and , respectively. In particular, we are interested in evaluating the pmf of , i.e., under (with ). Looking at these distributions, for any we can define
(29) | ||||
(30) |
(and ) indicates how many observations (sampling complexity) are needed to detect independence (and statistical dependency) with a confidence probability under (and ). In the context of our solution, for a given TSP scheme (function of the parameters , , and the sequence ), we will look at the trade-off expressed in the pair of induced tests when varying key parameters of our method.

VIII-A1 Simulation Setting
We consider a joint vector following a zero-mean Gaussian distribution in where the correlation coefficient determining is parametrized by . Concerning the alternative hypothesis (), we consider different levels of MI indexed by . We use iid realizations of under the different scenarios ( and ) to run our test with these iid samples. By doing so, we obtain realizations of the rvs. and and with those realizations we obtain an empirical estimation of their pmfs. and empirical estimations of and (indexed by the value ), respectively.121212By the law of large numbers, the estimators of and are strongly consistent. samples shown to be sufficient for the purpose of the analysis.
VIII-A2 Parameter Selection for
To evaluate the sensitivity of our TSP scheme, we consider the following fixed sequences and in the admissible regime established in Theorem 3.131313 needs to be and needs to be . Other configurations for and can be explored within the admissible range declared in Theorem 3. Considering (parametrized by and ) and (used to solve in Eq.(24)), we consider a preliminary analysis on MI estimation to select a range of reasonable values (in Supplemental Material - Appendix G-A). In particular, we consider and . We proceed as follows: given fixed parameters and , we explore values of in to express the trade-off between and under different data scenarios: .
VIII-A3 Results
Figure 1 shows the curves expressing the trade-off between and for different parameter configurations of our method in the range for and mentioned above and for . In general, we notice that the effect of these two parameters is relevant in the trade-off expressed in the curves. Both and determine the growing phase (i.e., the creation of the full tree) and also the pruning phase (regularization in Eq.(24)) because is also a function of . Therefore, the effects of on the results are not easy to express theoretically.
Describing the curves, we could say in general that a bigger value of within the explored range reduces the full tree’s size in the growing phase. This effect is expressed in a better trade-off of the pair (, ) in the regime where , at the expense of a worse trade-off of (, ) when . A similar general effect in the trade-off (, ) is produced when decreasing the value of within the explored range. Our family of solutions offers a collection of different trade-offs between and by exploring different values of in our solution. Therefore, the selection of the best parameters should be a function of the regime of sample size we want to consider for the detection of . For the final comparison with alternative approaches, we decided to consider one of the curves with a less prominent decreasing transition in Figure 1 (obtained for and ).

Figure 2 illustrates the estimated (empirical) pmfs of and under () and (), respectively. These histograms illustrate how the distributions are affected by the regularization parameter and, with that, the trade-off in and observed in Figure 1. Specifically, a small value of concentrates the distribution of in a high number of observations, while concentrates on a lower number of observations. This leads to a high value of and a low value of for that specific . The opposite holds for higher values of .
Finally, we compared our TSP scheme in terms of the trade-off in with and (for ) exploring three strategies presented in the literature [20]: the test, the log-likelihood test, and the Pearson- test. Each trade-off curve is generated by multiplying a well-selected range of values to the corresponding threshold for the independence detection [20]. For the parameters of the alternative tests, we consider regimes where these tests are strongly consistent141414We consider the number of partitions per dimension as with , which satisfies the strong-consistency conditions of the and the log-likelihood tests established in [20]. Although there is no explicit proof of Pearson- test strong-consistency, we use a regime of values for according to the range suggested in [20].. Within that, we selected parameters that offer a smooth transition between their sampling complexity pairs’ trade-off as we did to select the parameters of our method.151515More information is presented in the Supplemental Material - Appendix G-B.
Figure 3 presents these curves for our method and the alternative approaches. These curves express the trade-off between the ability to detect independence and non-product probabilistic structure ( and ) from data. In general, we could say that our method shows better results in almost all explored scenarios () where the TSP trade-off curves dominate the others. There is one exception to this general trend in the regime of for lower values of correlation for . However, in all the other regimes, our method performs better than the alternatives. In particular, it performs better than its closest relative, which is the log-likelihood test that uses non-adaptive partitions [20]. Interestingly, our method’s performance improvements increase with the magnitude of (for ). This shows that our approach’s advantage is more prominent when the alternative scenario has a higher level of mutual information.

VIII-B Multidimensional Analysis
We conclude this analysis by evaluating our method in higher dimensions. As in the previous analysis , we need to select the parameters and of our scheme. In this multi-dimensional context, we select a value of that offers a smooth trade-off curve according to the experimental results in Figure 1. However, as has a stronger impact on the resulting number of partitions, rather than choosing it individually for every dimension , we propose as a heuristic rule for selecting according to , where and .
This rule and its parameters were designed on the principle that every dimension of the joint space should be explored at least once in the growing phase. This criterion ensures that our data-driven scheme explores every coordinate of in the pursuit of detecting relevant statistical dependencies under . For a dimension , the full tree needs to have at least leaves (cells). For this, a basic condition is that is lower bounded by (our stopping criterion requires at least one sample per cell), which implies that to explore all the dimensions we need at least i.i.d. samples to meet this requirement. Then assuming the non-trivial regime , we selected and in our rule with the objective that is met independent of the value of (as long as ). This happens if becomes smaller when increases, which happens in our rule for and .
Figure 4 shows the number of cells of the full-tree, namely , for two settings: ( and ) in the left panel, and ( and ) in the right panel. The curves are shown across for three scenarios of sample-size , and . We also include the curve (dashed line) to indicate the critical lower bound. Both settings satisfy the proposed criterion of partitioning each dimension at least once (points above the diagonal) in the whole regime when . The horizontal line shows the points where the number of partitions generated by our method is equal to the number of samples (which happens when ). Interestingly, when approaches the critical condition , our full tree meets a scenario where every cell has one sample. Of the two settings, the selection on the right panel is more conservative, and it is the one we will use for the rest of this analysis. This selection partitions each dimension just above the minimum requirement (for the values of where ) and achieves the critical condition as increases.
Using this rule and , we consider a zero-mean joint Gaussian vector in , where and and, under , a diagonal co-variance of the form (with the Kronecker’s delta). For these results, we consider i.i.d samples meeting the condition that for . For the alternative, we consider for (), and we choose the values of for such that the mutual information between and is the same as that obtained for with . Therefore, for the alternative, we create three scenarios of constant MI across dimensions. This experimental design allows us to isolate the effect of dimensionality from the discrimination of the underlying task, which is known to be proportional to the value of MI under [32, 35, 36].
Figure 5 presents the trade-off between for the following values of (under ) and for different dimensions . First, we confirm that our scheme detects both hypotheses with a finite sample size as our theoretical results predict. Here, we observe that by increasing the dimensions of the problem the curves show sharper transitions (beyond a critical value for ) and overall the problems become (in general) more difficult: for a given value of , its respective increases with the dimension . In other words, for a relatively good capacity to detect independence (when ), it is more challenging for our test to detect the alternative as is higher. This performance trend could be attributed to the observation that in higher dimensions it is more challenging for a non-parametric framework to detect salient features under the alternative and, consequently, more observations (evidence) are required to correctly detect the alternative with probability . On the other hand, if we fix a dimension, let us say , the higher the MI the better the trade-off between (), which is a trend consistent with our observation that MI is an indicator of the task discrimination, and consequently, the performance of our scheme is sensitive to the level of MI under the alternative.
IX Summary and Final Discussion
We present a novel framework to address the problem of universal testing of independence using data-driven representations. Consistency and finite length (non-asymptotic) results express our solution’s capacity to adapt its representations to the problem’s sufficient statistics. This capacity is particularly evident under the hypothesis of independence. Precise results show that our scheme detects this scenario – collapsing to the trivial partition with one cell for representing the joint space – with a finite sample size. On the experimental side, our solution offers a computationally efficient implementation.
In a controlled experimental setting, we show that our scheme offers a clear advantage compared with alternative non-parametric solutions that do not adapt their representations to the data. These results confirm the critical role that data-driven partitions could play in implementing a universal test of independence. Further analysis is needed to fully uncover this method’s potential for machine learning (ML) problems and other applications. We anticipate that ML algorithms could benefit from both the representations obtained from our solutions (as an approximator of sufficient statistics) and our solution’s capacity to detect independence with a finite number of samples.
IX-A Limitations and Future Work
Our distribution-free results in Section VI provide a range of admissible parameters for our test; however, they do not provide a criterion for selecting them in a specific problem setting. To address this last practical aspect, we select a set of parameters from empirical observations and some basic heuristic rules. Then, it is an open research avenue to fully study conditions that would make a selection of optimal parameters (or nearly optimal) for our test under some specific conditions. Along these lines, it is relevant to further investigate specific classes of models, and based on this model assumption find a more constrained range of good parameters with improved finite-length performance results. In this vein, there are interesting ideas and results worth exploring about adaptivity that have been explored in statistical learning [50] and universal source coding [51, 52].
IX-B A Recent Related Work
The authors in [34] have recently introduced a non-parametric test with a similar oracle-base design principle. They proposed a data-adaptive weight strategy to approximate the ideal weights of a NP statistic. That strategy has an interesting connection with our design approach in Section IV because they adapt key parameters of their method to approximate a best (oracle) statistic. On the differences, the authors in [34] addressed the independence test by performing a binary expansion filtration of the observations and adapting the weights matrix of a quadratic form. Instead, our work addresses the independence test by estimating the mutual information (our statistics) through an adaptive tree-structured partition of the observation space. Therefore, although both approaches share a similar learning principle, the methods and components used to adapt them to the sample are entirely different.
IX-C Statistical Significance Analysis
We want to conclude with a discussion about the statistical significance of our test. Given our test , and assuming , the statistical significance is expressed by . Importantly, from the proofs of Theorem 2 and Theorem 3, we have the following:
PROPOSITION 2
This result implies that the statistical significance of our test is fully controlled by the sequence , which is one of our design parameters. Furthermore, the proof of this result shows that this bound is uniform over the class of models in . Direct corollaries of Proposition 2 are the following:
-
•
adding the assumptions of Theorem 3 and under , it follows that . Importantly, this convergence to zero is uniform for every model in . [Uniform vanishing condition on ]
-
•
adding the assumptions of Theorem 4 and under , it follows that for any , . Therefore, we achieve an exponentially fast vanishing error under . Importantly, this velocity is obtained uniformly (distribution-free) over .
These uniform bounds on the statistical significance are obtained under . Under , we have from strong consistency (Definition 1) that (the power of the test) tends to as tends to infinity.171717This argument is presented in the Supplemental Material - Appendix G-D.2).
X Acknowledgment
This manuscript is based on work supported by grants of CONICYT-Chile, Fondecyt 1210315, and the Advanced Center for Electrical and Electronic Engineering, Basal Project FB0008. We thank the anonymous reviewers for providing valuable comments and suggestions that helped to improve the technical quality and organization of this work. Finally, we thank Diane Greenstein for editing and proofreading all this material.
Appendix A Proof of Theorem 1
Proof:
In general, first we know that if reaches eventually with probability one, it is equivalent to saying that the process does not visit the event infinitely often (i.o.). This observation reduces to verify the following
(31) |
Equivalently, saying that reaches eventually with probability one reduces to
(32) |
For the rest of the proof, we use to denote and to denote .
Under , we have from the third hypothesis that tends to with probability one, which means that for any
(33) |
where . Under , let us consider the error event of by
If (by definition of the rule ), it follows that (see (7)). Then from definition of , for any it follows that . This implies that which from (33) implies that
(34) |
On the other hand, under the assumption that (), let us choose . Then, from the second assumption (see Definition 2), it follows that convergences to something strictly greater than with probability one. This formally means that
(35) |
The error event of under in this case is
(36) |
Finally using the condition that tends to zero with ( is ), we have that
(37) |
eventually in , which implies from (35) that
(38) |
∎
Appendix B Proof of Lemma 2
We use the following results from [39].
LEMMA 5
Appendix C Proof of Theorem 2
Proof:
If we define the event
(39) |
for all , we have that the conditions stated on and are the weakest to obtain from (15) that . This last condition is crucial to being able to apply Lemma 1 in . In fact, by definition of in (16) and the condition expressed in (15).
Importantly under (i.e., ), for , we can consider that , because for any . Consequently, for any , it follows that
(40) | ||||
(41) |
for any . The first inequality in (C) is from the definition of in (17) and the second inequality in (41) is from the fact that if (see (39)), then
(42) |
On the other hand, if , then
(43) |
At this point, we use the independence assumption191919Under , we have that for any . and the two inequalities (41) and (C) to obtain that for any :
(44) |
Finally using that , it is simple to verify that for any the trivial tree (with one cell) is a solution for (17), i.e., . Then from (44), , which concludes the argument using that and the assumption that is .
Appendix D Proof of Lemma 3
Proof:
Let us use the definition of the typical set introduced in (39) and . Under the assumption of this result, we know that . In addition, in the proof of Theorem 2, it is shown that if , then , which means that . Therefore, we have that202020Here we use the notation to make explicit that the partition is data-driven.
(45) |
Using the definition of , we have that
(46) | ||||
(47) | ||||
(48) | ||||
(49) |
The equality in (46) comes from definition in (22). The inequality in (48) comes from the sub-additivity of and the bounds in (49) follow from (45) and the construction of . Finally, using the assumption that in (49), we have that
(50) |
∎
Appendix E Proof of Theorem 3
Proof:
Under , the assumptions on and are within the admissible range stated in Theorem 2 part i). Consequently, we have consistency on the regret, i.e., , -almost surely. This result is sufficient to obtain that reaches eventually with probability one, from the arguments presented in the proof of Theorem 1.
Under , it is useful to define the following last exit time associated with the detection of independence:
(51) |
Saying that reaches eventually with probability one is equivalent to the condition that
(52) |
from the arguments presented in the proof of Theorem 1. At this point, it is important to revisit the definition of in (22), where if for some we have that ; this implies that (details on Appendix C), and therefore, independent of (see Eq.(7)). Consequently, we have that
(53) |
for any , and it follows that
(54) | ||||
(55) | ||||
(56) |
The identity in (E) follows from the definition in (E), and equations (55) and (56) follow from (E) and (46), respectively. The argument concludes from Lemma 3, as we know that under the assumptions stated on and , , which implies the result in (52) from (56). ∎
Appendix F Proof of Theorem 4
References
- [1] C.-J. Ku and T. L. Fine, “Testing for stochastic independence: Application to blind source separation,” IEEE Transactions on Signal Processing, vol. 53, no. 5, pp. 1815–1826, 2005.
- [2] P. Common, “Independent component analysis: a new concept?” Signal Processing, vol. 36, pp. 287–314, 1994.
- [3] J. F. Cardoso, “Blind separation of instantaneous mixtures of non-stationary sources,” IEEE Transactions on Signal Processing, vol. 49, no. 9, pp. 1837–1848, 2001.
- [4] J. Cheng, K. K. Tan, Z. Yi, and S. H. and, “Convergence analysis of a class of hyvarinen–oja’s ica learning algorithms with constant learning rates,” IEEE Transactions on Signal Processing, vol. 57, no. 5, pp. 1811–1824, 2009.
- [5] B. Bloem-Reddy and Y. W. Teh, “Probabilistic symetry and invariant neural netwroks,” 2019, https://arxiv.org/abs/1901.06082.
- [6] J. Range, “Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information,” in Proceedings of Machine Learning Research, 2018.
- [7] A. Bellot and M. van der Schaar, “Conditional independence testing using generative adversarial networks,” in Conference on Neural Information Processing Systems, 2019.
- [8] R. Sen, A. T. Suresh, K. Shanmugam, A. G. Dimakis, and S. Shakkottai, “Model-powered conditional independence test,” in Conference on Neural Information Processing Systems, 2017.
- [9] C. Ku and T. Fine, “A bayesian independence test for small datasets,” IEEE Transactions on Signal Processing, vol. 54, no. 10, pp. 4026 – 4031, 2006.
- [10] D. Reshef, Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P. Sabeti, “Detecting novel associations in large data sets,” Science, vol. 334, no. 6062, pp. 1518–1524, 2011.
- [11] A. Gretton, O. Busquet, A. Smola, and B. Scholkopf, “Measuring statistical dependence with hilbert-schmit norms,” in In Algorithms Learning Theory: 16rh International Conference. Springer, 2005, pp. 63–78.
- [12] A. Gretton and L. Györfi, “Nonparametric independence tests: Space partitioning and kernel approaches,” in In Algorithms Learning Theory: 19th International Conference. Springer, 2008, pp. 183–198.
- [13] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola, “A kernel statistical test of independence,” Advances in Neural Information Processing Systems, pp. 585–592, 2007.
- [14] N. Pfister, P. Büjlmann, B. Schölkopf, and J. Peters, “Kernel-based tests for joint independence,” arXiv: Statistics Theory, no. 1603.00285, 2016.
- [15] G. Székely, M. Rizzo, and N. Bakirov, “Measuring and testing dependency by correlation of distances,” The Annals of Statistics, vol. 35, no. 6, pp. 2769–2794, 2007.
- [16] G. J. Székely and M. L. Rizzo, “Brownian distance covariance,” The Annals of Applied Statistics, vol. 3, pp. 1236–1265, 2009.
- [17] G. J. Székely, “The distance correlation t-test of independence in high dimension,” Journal of Multivariate Analysis, vol. 117, pp. 193–213, 2013.
- [18] S. Zheng, N. Shi, and Z. Zhang, “Generalized measures of correlation for asymmetry, nonlinearity, and beyond,” Journal of the American Statistical Association, vol. 107, pp. 1239–1252, 2012.
- [19] X. Wang, B. Jiang, and J. Liu, “Generalized r-squared for detecting non-independence,” Biometrika, vol. 104, pp. 129–139, 2017.
- [20] A. Gretton and L. Gyorfi, “Consistent nonparamtric tests of independence,” Journal of Machine Learnning Research, vol. 11, pp. 1391–1423, 2010.
- [21] L. Ma and J. Mao, “Fisher exact scanning for dependency,” Journal of the American Statistical Association, vol. 114, no. 525, pp. 245–258, 2019.
- [22] K. Zhang, “Bet on independence,” Journal of the American Statistical Association, vol. 114, no. 528, pp. 1620–1637, 2019.
- [23] R. Heller, Y. Heller, S. Kaufman, B. Brill, and M. Gorfine, “Consistent distribution-free k-sample and independence tests for univariate random variables,” Journal of Machine Learning Research, vol. 17, pp. 1–54, 2016.
- [24] R. Heller and Y. Heller, “Multivariate tests of association based on univariate tests,” Advances in Neural Information Processing Systems, vol. 29, pp. 208–216, 2016.
- [25] M. Menedez, J. Pardo, L.Pardo, and K. Zografos, “On tests of independence based on minimum -divergnece estimator with constraint: An application to modeling DNA,” Computational Statistics and Data Analysis, vol. 51, pp. 1100–1118, 2006.
- [26] F. Han, S. Chen, and H. Liu, “Distribution-free tests of independence in high dimensions,” Biometrika, vol. 104, pp. 813–828, 2017.
- [27] J. Dauxois and G. M. Nkiet, “Nonlinear cannonial analysis and independence tests,” The Annals of Statistics, vol. 26, no. 4, pp. 1254–1278, 1998.
- [28] A. Dionisio, R. Menezes, and D. A. Mendes, “Entropy-based independence test,” Nonlinear Dynamics, vol. 44, pp. 351–357, 2006.
- [29] J. Suzuki, “An estimator of mutual information and its application to independence testing,” Entropy, vol. 18, no. 109, pp. 1–18, March 2016.
- [30] I. Kontoyiannis and M. Skoularidou, “Estimating the directed information and testing for causality,” IEEE Transactions on Information Theory, vol. 62, no. 11, pp. 6053–6067, 2016.
- [31] L. Devroye and G. Lugosi, Combinatorial Methods in Density Estimation. Springer - Verlag, New York, 2001.
- [32] S. Kullback, Information theory and Statistics. New York: Wiley, 1958.
- [33] H. Shi, M. Drton, and F. Han, “Distribution-free consistent independence tests via center-outward ranks and signs,” Journal of the American Statistical Association, pp. 1–16, 2020.
- [34] K. Zhang, Z. Zhao, and W. Zhou, “Beauty powered beast,” arXiv preprint arXiv:2103.00674, 2021.
- [35] I. Csiszar and P. Shields, Information theory and Statistics: A Tutorial. Now, 2004.
- [36] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Wiley Interscience, New York, 2006.
- [37] G. Lugosi and A. B. Nobel, “Consistency of data-driven histogram methods for density estimation and classification,” The Annals of Statistics, vol. 24, no. 2, pp. 687–706, 1996.
- [38] G. A. Darbellay and I. Vajda, “Estimation of the information by an adaptive partition of the observation space,” IEEE Transactions on Information Theory, vol. 45, no. 4, pp. 1315–1321, 1999.
- [39] J. F. Silva and S. Narayanan, “Complexity-regularized tree-structured partition for mutual information estimation,” IEEE Transactions on Information Theory, vol. 58, no. 3, pp. 1940–1952, March 2012.
- [40] ——, “Non-product data-dependent partitions for mutual information estimation: Strong consistency and applications,” IEEE Transactions on Signal Processing, vol. 58, pp. 3497–3511, 2010.
- [41] ——, “Information divergence estimation based on data-dependent partitions,” Journal of Statistical Planning and Inference, vol. 140, no. 11, pp. 3180–3198, November 2010.
- [42] R. E. Blahut, “Hypothesis testing and information theory,” IEEE Transactions on Information Theory, vol. it-20, no. 4, pp. 405–417, July 1974.
- [43] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees. Belmont, CA: Wadsworth, 1984.
- [44] C. Scott and R. D. Nowak, “Minimax-optimal classification with dyadic decision trees,” IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1335–1353, April 2006.
- [45] C. Scott, “Tree pruning with subadditive penalties,” IEEE Transactions on Signal Processing, vol. 53, no. 12, pp. 4518–4525, 2005.
- [46] L. Devroye, L. Gyorfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition. New York: Springer-Verlag, 1996.
- [47] M. Gonzalez and J. F. Silva, “Data-driven representations for testing independence: A connection with mutual information estimation,” in IEEE International Symposium on Information Theory, 2020, pp. 1–5.
- [48] J. F. Silva and S. Narayanan, “Universal consistency of data-driven partitions for divergence estimation,” in IEEE International Symposium on Information Theory, June 2007.
- [49] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to Algorithms. The MIT Press, 2009, 3er edition.
- [50] P. J. Bickel, C. A. J. Klaassen, Y. Ritov, and J. A. Wellmer, Efficient and Adaptive Estimation for Semiparametric Models. Springer - Verlag, New York, 1998.
- [51] N. Merhav and M. Feder, “Universal prediction,” IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2124–2147, October 1998.
- [52] D. Bontemps, S. Boucheron, and E. Gassiat, “About adaptive coding on countable alphabets,” IEEE Transactions on Information Theory, vol. 60, no. 2, pp. 808–821, 2014.
Appendix G Suplemental Material
In Section G-A, the estimation of mutual information (MI) based on our tree-structured partition (TSP) approach is evaluated. Section G-B details the selection of parameters of the test, the log-likelihood test, and the Pearson- test. These results were used to select the final curves presented in Figure 3. Section G-C complements the empirical results of our method on a heavy-tailed distribution and a multi-modal distribution. Finally, Section G-D presents some results regarding the statistical significance of our test.
G-A Preliminary Analysis of Mutual Information Estimation
Analyzing our tree-structured partition (TSP) scheme’s capacity to estimate MI in the non-asymptotic regime is important because it provides an indication of the trade-off between the two types of errors that we will encounter when analyzing the performance of its induced test. Here, we present a basic analysis on the effects of the parameters of our scheme: and (), and within the admissible regime provided by our results in Section VIII of the main content for and . For this analysis, we consider a joint vector following a zero-mean Gaussian distribution in where the correlation coefficient determining is parametrized by . We evaluate empirically how the estimator achieves the true value as grows and the effects of the mentioned parameters under () and (). For illustration purposes, we will consider as a relatively high MI regime under .212121Similar analyses where conducted under different level of correlations (under ) not reported here for brevity.

To begin, Figure 6 shows effects of fixing and . Under , we look at the performance of the full tree , which provides the baseline capacity of our regularized solution in Eq.(17) of the main content. Figure 6 shows that as increases (approaching its less conservative value associated with the biggest full tree in the growing phase), the capacity of our scheme to estimate the true MI under is superior, as expected. On the contrary, the estimation of the true MI under is better for smaller values of , although beyond samples, the specific value of is not that critical under .
Similarly, Figure 7 presents the expressiveness of the scaling factor in and its effects on the estimation of MI in the two aforementioned regimes and . For these plots, we fixed and , i.e., we focus on the performance of the full tree. In the finite sampling regime, , the effect of is relevant in terms of the capacity to over-estimate under and under-estimate or over-estimate under the case . The effect of in the performance of the full-tree solution is more prominent than the effect of in both scenarios ( and ).

Finally, we look at the effects of the scalar in Eq.(24), for which we select and . Figure 8 shows different selections for this factor in the two scenarios already introduced ( and ) and . We present three sets of figures: one plotting the plug-in estimations , the other the regularized MI estimations and finally, the size of the optimal tree: . In general, we observe that the best results under are obtained when is very small, in the order of . This finding is interesting because it shows that, in general, the theoretical confidence interval that is used to obtain the regularization term in Eq.(16) is very conservative. This means that in the non-asymptotic regime, this term needs to be corrected to obtain good results. It is worth noticing that this scaling factor is fixed (independent of ) and it does not affect the theoretical properties of our scheme: in particular, the structural capacity of our TSP scheme to detect independence under presented in Section VI. This capacity is observed in Figure 8 that present the size of . Here, the trivial tree (with one cell) is obtained in a finite number of steps as predicted by our results in Section VI.A. On the other hand, for the regime of high MI, we observe that the tree’s size is comparable to the size of the full-tree (obtained for ). Importantly, both of these results support the adaptive nature of our data-driven solution.

G-B Complementing Results for Figure 3
For selecting parameters of the alternative methods (the test, the log-likelihood test, and the Pearson- test), we implemented them and explored their performances in the admissible range where those approaches are known to be strongly consistent [20]. In particular, we consider the number of partitions per dimension as with (details in [20]), which satisfies the strong-consistency conditions for the and the log-likelihood tests established in [20]. Although there is no explicit proof of Pearson- test strong-consistency, we use a regime of values for according to the range suggested in [20].
Under the experimental setting presented in Section VIII, Figure 9 reports different performance curves (the trade-off between and ) when selecting different parameters of these methods. From these curves, the selected parameters for each method were the ones that offer a smooth transition between their sampling complexity pairs trade-off. This criterion is consistent with what we did to select the parameters of our method: please refer to Figure 1. Interestingly, for the three methods, this was obtained consistently for in the range of . Importantly, this selection offers reasonable solutions on all the regimes of correlation presented in Figure 9.

G-C Testing the Method on Different Distributions
To complement the experimental results in Section VIII.A (obtained for Gaussian distributions in the scalar case), we explore the performance of our method against non-adaptive histogram-based tests by studying the detection-time trade-off curves on data samples generated by two important family of distributions: a heavy-tailed distribution (t-student law) and a multi-modal distribution.
Figure 10 presents the trade-off between for different values of correlation (under ) using samples generated by a t-student distribution with degrees of freedom (which is a heavy-tailed distribution). The curves show that the performance of our test is superior to the other methods in all correlation scenarios indexed by . The performance gain is significant in the scenarios of high correlation and a for relatively high independence detection time (i.e., when ). For the non-adaptive tests, it is intriguing the relatively flat (insensitive) trade-off observed in all the explored scenarios, where the curves are close to an horizontal line. This flat trade-off behaviour in is observed even when considering a significantly number of cells (using ). In fact, this selection () produces a number of cells that is significantly higher than the number of cells obtained with our adaptive method (more of this analysis below). Therefore, this observation shows that our scheme with fewer data-driven cells produces better performance trade-off in general. In conclusion, these performance curves show the benefits of the adaptive partitioning in our test, which can deal with heavy-tailed distributions significantly better than tests that use non-adaptive partitions (binning) of the space.

Figure 11, on the other hand, presents the trade-off between and using samples generated by a mixture of four Gaussians (a multi-modal distribution). Following the setting introduced in Gretton (2008), the dependence level is produced by rotating in a counter close-wise manner (with a magnitude ) a 2D random vector with the following Gaussian mixture density:
(60) |
where denotes the bi-variate Gaussian density function with mean and covariance matrix , evaluated in . In this case, , , , , . The independent scenario is obtained by choosing , while three different dependent scenarios () are explored by choosing . Similarly to the results presented in Figure 3 (for the Gaussian case), we observed that our TSP test outperforms the other tests in the regime of , while the other tests presents a better performance in the opposite regime. It is important to notice that, unlike the Gaussian case, the number of partitions of each alternative test is determined by , which induces a notably higher number of cells than the ones obtained by the parameter configuration of our TSP test. It is noteworthy that a value of induces partitions for the alternative tests, while and induces just partitions for the tree without pruning, i.e., the final number of partitions will be lower than . This demonstrates the efficiency of the proposed method, that can represent the statistical dependence of random variable with a considerably lower number of cells because of its adaptive capacity. Due to the multi-modal nature of the distribution used in these experiments, we present an alternative configuration of a TSP test with and (denoted as TSP-long in Figure 11),inducing more adaptive cells (but still lower than the other tests) in order to capture the statistical dependence of the multi-modal distribution in a less conservative way. We observed a considerably benefit in this new choice in the regime when with a moderate-slight deterioration in the opposite regime.

To complement this analysis, Figure 12 shows the partitions generated by the non-adaptive and our adaptive partition schemes. For this, samples were considered for each of the three distributions explored (Gaussian, t-student and Gaussian Mixture). The adaptive attribute of our scheme is observed and contrasted with non-adaptive partitions, which is particularly evident in the multi-modal case. We observe that our scheme, as expected, represent the distribution of samples more effectively (with a lower number of cells) than non-adaptive scheme. Indeed, non-adaptive partitions use a considerably higher number of cells to represent the statistical dependence observed in the data, which can be used to explain the difference in performance with our method. In particular, there is a clear issue observed for the non-adaptive scheme: most of the cells have almost no samples, which is an ineffective way to represent the information of the data. This could significantly affect the discrimination ability of this non-adaptive methods for small sample-size (see Figure 12). From our perspective, this is one of the primary sources to explain the erratic (insensitive) flat behaviour observed in many of the trade-off curves presented in Figs 10 and 11, and overall, the better (and more expresive) performance trade-off offered by of our approach.

G-D Statistical Significance Analysis in Section IX.C.
G-D1 Proof of Proposition 2
Proof:
Let us consider the set
(61) |
i.e., the points in the space where our test of length decides . Under , the probability of rejecting the null is given by:
(62) |
where denotes the probability of induced by (i.e., the -fold product distribution). Using the argument presented in the proof of Theorem 3 (see Eq.(51)), we show that when our data-driven partition achieves the trivial case (the partition with one cell), then decides , independent of the value of . This structural detection of independence (see Section VI.A for more details) implies that:
(63) |
because implies that independent of the magnitude of .
At this point we use the assumptions of Theorem 2. In the argument presented in the proof of Theorem 2, it was proved that (see Appendix III) the set (see Eq.(39) in Appendix III) satisfies that if then (i.e., ). Consequently, we have that:
(64) |
Furthermore, the proof of Theorem 2 in Appendix III shows that , where this bound is distribution-free (independent of ) from the use of the distribution-free concentration inequality in Lemma 1 (see the argument of this in Appendix III). Therefore, from (63) and (64), we conclude that
(65) |
where this upper bound is valid for any model . ∎
G-D2 General Connection with Strong Consistency
There is a connection between strong consistency (in Definition 1) and the statistical significance (and also the power of a test) that can be obtained from the proof of Theorem 1 in Appendix I. Using the notation of Theorem 1, if we have a general test , the test is consistent if under , the binary process reaches eventually almost surely, which is equivalent to state that:
(66) |
where . Using this notation, the statistical significance of under is . Therefore, under , we have that:
(67) |
Therefore, if the test is strongly consistent from (66) and the bound in (67), it follows that the statistical significance of the test tends to zero for any model where and are independent ().
Concerning the power, under we have that the binary process reaches eventually almost surely (under the assumption of consistency), which is equivalent to state that:
(68) |
Therefore using (68), the power of the test as tends to infinity from (68), for any model where and are not independent ().
In conclusion, the definition of consistency used in this work (Definition 1) implies a vanishing error (when tends to infinity) under both hypotheses. However, this asymptotic property does not offer specifics bounds for the errors when is finite, in contrast to the result shown in Eq.(65).