Data-Driven Representations for Testing Independence: Modeling, Analysis and Connection with Mutual Information Estimation

Mauricio E. Gonzalez, Jorge F. Silva, Miguel Videla, and Marcos E. Orchard M. Gonzalez, J. F. Silva, M. Videla, and M. Orchard are with the Information and Decision System Group, Department of Electrical Engineering, University of Chile, Av. Tupper 2007 Santiago, 412-3, Room 508, Chile, Tel: 56-2-9784090, Fax: 56-2 -6953881, (email: [email protected]).

Abstract

This work addresses testing the independence of two continuous and finite-dimensional random variables from the design of a data-driven partition. The empirical log-likelihood statistic is adopted to approximate the sufficient statistics of an oracle test against independence (that knows the two hypotheses). It is shown that approximating the sufficient statistics of the oracle test offers a learning criterion for designing a data-driven partition that connects with the problem of mutual information estimation. Applying these ideas in the context of a data-dependent tree-structured partition (TSP), we derive conditions on the TSP’s parameters to achieve a strongly consistent distribution-free test of independence over the family of probabilities equipped with a density. Complementing this result, we present finite-length results that show our TSP scheme’s capacity to detect the scenario of independence structurally with the data-driven partition as well as new sampling complexity bounds for this detection. Finally, some experimental analyses provide evidence regarding our scheme’s advantage for testing independence compared with some strategies that do not use data-driven representations.

Index Terms:

Independence testing, non-parametric learning, learning representations, data-driven partitions, tree-structure partitions, mutual information, consistency, finite-length analysis.

I Introduction

The problem of detecting independence from i.i.d. samples in a learning setting (distribution-free) is fundamental in statistics and has found numerous applications in statistical signal processing, data analysis, machine learning, inference, and decision problems [1, 2, 3, 4]. For instance, the capacity to detect independent and conditional independent structures from data is relevant when having invariances and probabilistic symmetries in decision and machine learning models [5, 6, 7, 8]. This capacity has been used in blind source separation, independent component analysis (ICA) [1, 2, 3, 4, 9] and the detection of associations in data [10]. Detecting independence from data has been systematically studied, and the literature is vast. Different non-parametric strategies have been proposed for this task using kernel-based statistics [11, 12, 13, 14], distance-based approaches [15, 16, 17, 18, 19], histogram-based approaches [20, 21, 22, 10, 23, 24], log-likelihood statistics [20, 25], correlation-based approaches [26, 27], $\phi$ -divergence estimators [25, 20], entropy and mutual information estimators [28, 29, 30, 10, 10], among many others.

A natural strategy, and the one we follow in this paper, is to partition the observation space (the binning approach) to evaluate the discrepancy (in some sense) between two empirical distributions [20, 12, 29, 15, 22, 21]. We highlight the following work using this approach: Gretton and Györfi [20] introduced a family of product data-independent partitions using the $L_{1}$ -statistics [31] and the I-divergence statistics [32]. With these two approaches, sufficient conditions are established on the partition and other learning parameters to achieve strong consistency for detecting independence distribution-free in a continuous multivariate scenario. Szekely et al. [15] also use partitions on the sample space, utilizing empirical distance correlation and distance covariance statistics to introduce a novel test of independence. The test is consistent and shows better empirical results than classical likelihood-ratio tests in scenarios with non-linear dependencies. Shi et al. [33] address the multivariate scenario by incorporating the concept of transportation-based ranks and signs. They propose plugging the center-outward distribution function into the mentioned distance covariance statistic [15] to determine a distribution-free test of independence.

More recently, Zhang [22] proposed a non-parametric test of independence assuming a rich class of models with known marginal distributions (uniform over $[0,1]$ ). This work proposed a binary expansion filtration of the space and deeply studied some symmetric properties to create a novel test that is uniformly consistent (on the power loss) and minimax for the power (in the sample size) assuming a large class of alternative distributions. Zhang et al. [34] presented a novel extension of the binary expansion filtration strategy [22] to a multidimensional setting by cleverly approximating an oracle Neyman-Pearson (NP) statistic. They propose a data-adaptive weight strategy (for the binary expansion) to approximate the ideal weight of the NP test. This new test unifies several important tests ( $\chi^{2}$ test statistics, distance correlation) and shows promising empirical results.

On the important connection between testing independence and information measures [35, 36], we highlight the work of Kontoyiannis and Skoularidou [30], where the problem of estimating directed information (directed information rate) between two discrete (finite alphabet) processes was explored based on a plug-in estimator. Importantly, the authors used this plug-in estimator to test the presence or absence of causality between the processes. Along the same lines, Suzuki [29] explored different strategies to estimate mutual information between discrete and continuous random variables and then evaluated those strategies experimentally for the task of testing independence. Finally, Reshef et al. [10] introduced the maximal information coefficient (MIC) to measure complex functional dependencies between a pair of variables by maximizing the (empirical) mutual information computed for collections of (adaptive) partitions of different resolutions.

In many of the methods mentioned that use partitions (or binning) for testing independence, a component that has not been explored systematically is the role played by data-driven partitions [37, 38, 10]. Data-driven partitions use data for the binning process (vector quantization) to construct the cells and define the final structure of the partition. This flexibility offers the capacity to better address inference and learning problems. Supporting this idea, data-driven representations have shown great approximation properties and better decision-estimation performance in numerous non-parametric learning problems [37, 38, 39, 40, 41].

In our context, the motivation that piqued our interest in data-driven partitions is the fact that under the hypothesis of independence, the trivial partition (the partition with one cell that contains all the space) is a sufficient representation of the problem. Our initial conjecture is that a well-designed adaptive data-driven partition could have the ability to detect (learn from data) this trivial solution by implementing a form of explicit regularization in its design. This adaptive ability could provide better detection of independence than non-adaptive partition schemes, which has prompted the exploration of data-driven methods for testing independence in this work.

Motivated by this, we look at the problem of designing a new learning criterion that selects a data-driven partition of the space (adaptive binning) as the optimal trade-off between estimation and approximation errors in a learning problem. To formulate this representation learning task, we adopt ideas from universal source coding to introduce a regret term that measures the discrepancy attributed to the use of empirical log-likelihood statistics — restricted over a partition — with respect to the oracle sufficient statistics of an ideal test that knows the true (two) probabilities: the oracle test against independence [36, 42]. Using this regret analysis, we establish a novel connection with the problem of mutual information (MI) estimation [30]. Furthermore, general conditions are derived to obtain a strongly consistent test of independence in the strong (almost-sure and distribution-free) sense introduced in [20].

We apply this framework in the context of tree-structured partitions (TSP) [39], which is the main application focus of this work. TSP algorithms were selected for the following important reasons: Their implementation is simple, TSP have a binary structure that can be used to address learning tasks efficiently for a large class of optimization problems (minimum cost-tree pruning) [43, 44, 45], and they have been shown to be expressive (with good approximation properties) when compared with other partition strategies in non-parametric learning and decision tasks [37, 38, 39, 40, 41, 43, 46, 37, 44].

On the proposed test, we derive concrete connections with results on mutual information estimation [39]. From these connections, we established new design conditions to obtain a strongly consistent test of independence and, more importantly, non-asymptotic results that express our framework’s capacity to approximate the information term of the oracle (ideal) test with a finite sample-size. Going a step further in this finite length performance analysis, we study our scheme’s capacity to detect independence structurally with the underlying data-driven partition, which was one of the original motivations used to explore data-driven partition for this task.

Indeed, we show that under the independence assumption, our data-driven partition collapses to the trivial solution with one cell (implying that the resulting MI estimator is zero) with a finite sample-size almost surely. This is a remarkable property attributed to our scheme’s capacity to adapt its representation to the sufficient statistics of the problem, which is the trivial partition in the context of independence. From this ability, we improve our original result concerning our test’s consistency (density-free) and provide refined sampling complexity bounds for detecting the scenario of independence.

To implement our approach, we propose a learning algorithm that offers a computationally efficient implementation of our data-driven strategy for testing independence. The algorithm is divided in two important phases (a growing and pruning stage), which are both shown to have a polynomial complexity on the side of the problem (the sample size). Finally, we provide empirical evidence of the advantage of our strategy in some controlled simulation scenarios. For this last analysis, we introduce concrete non-asymptotic metrics (in the form of sampling complexity indicators) to measure the test’s ability to detect the correct hypothesis with a finite number of samples.

A preliminary version of this work was presented in [47]. This work significantly expands the technical contribution in [47] and explores implementations and experimental analyses.

I-A Organization of the Paper

Section II introduces the problem and basic notations. Section III presents the general histogram-based scheme, and Section IV introduces the regret-based analysis proposed in this work and its connection with the problem of MI estimation. Section IV-B presents a general consistency result for our problem (Theorem 1). Section V introduces the family of tree-structured data-driven partitions and presents consistency results (Theorem 2, Lemma 2 and Corollary 1). Section VI analyzes some finite-length properties (Theorem 2) and elaborates on the structural capacity of our scheme to detect independence under the null hypothesis (Lemma 3, and Theorems 3 and 4). Sections VII and VIII cover implementation and empirical analysis, respectively. Finally, Section IX concludes with some discussion and future directions. The proofs are presented in the Appendix.

II Preliminaries

Let us consider two random variables $X$ and $Y$ taking values in $\mathbb{R}^{p}$ and $\mathbb{R}^{q}$ , respectively, with joint distribution $P$ in $(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d}))$ , where $d=p+q$ and $\mathcal{B}(\mathbb{R}^{d})$ is the Borel sigma field. Let us denote by $\mathcal{P}_{0}$ the class of probabilities in $(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d}))$ under which $(X,Y)$ are independent, meaning that if $P\in\mathcal{P}_{0}$ , then $P(A\times B)=P(\mathbb{R}^{p}\times B)\cdot P(A\times\mathbb{R}^{q})$ for any $A\in\mathcal{B}(\mathbb{R}^{p})$ and $B\in\mathcal{B}(\mathbb{R}^{q})$ .

Given $n$ i.i.d. samples of $(X,Y)$ , denoted by $Z^{n}_{1}\equiv(Z_{1}\equiv(X_{1},Y_{1}),Z_{2}\equiv(X_{2},Y_{2}),..,Z_{n}\equiv(X_{n},Y_{n}))\in\mathbb{R}^{d}$ driven by the model $P\in\mathcal{P}(\mathbb{R}^{d})$ , the problem is to decide from $Z^{n}_{1}$ whether $X$ and $Y$ are independent (meaning that $P\in\mathcal{P}_{0}$ ) or, alternatively, $X$ and $Y$ have some statistical dependency, i.e., $P\notin\mathcal{P}_{0}$ . In this context, a decision rule $\phi_{n}(\cdot)$ of length $n$ is a function from $(\mathbb{R}^{d})^{n}$ to $\left\{0,1\right\}$ , where $\phi_{n}(z^{n}_{1})=0$ means that the rule decides that the underlying probability producing $z^{n}_{1}$ belongs to $\mathcal{P}_{0}$ . Then, for any decision rule $\phi_{n}\in\Pi_{n}$ ¹¹1 $\Pi_{n}$ denotes the collection of binary rules acting on $\mathbb{R}^{dn}$ . of length $n$ , we recognize two errors:

•

Assuming that $P\in\mathcal{P}_{0}$ ( $\mathcal{H}_{0}$ ), the non-detection of independence between $X$ and $Y$ is measured by²²2 $\mathbb{P}$ denotes the entire process distribution of $(Z_{n})_{n\geq 1}$ and $P^{n}$ is the $n$ -fold probability in $\mathbb{R}^{dn}$ induced by the marginal $P$ .

$\mathbb{P}(\phi_{n}(Z^{n}_{1})=1)=P^{n}(\left\{z^{n}_{1}:\phi_{n}(z^{n}_{1})=1\right\}),$
•

Assuming that $P\notin\mathcal{P}_{0}$ ( $\mathcal{H}_{1}$ ), the false detection of independence between $X$ and $Y$ is measured by

$\mathbb{P}(\phi_{n}(Z^{n}_{1})=0)=P^{n}(\left\{z^{n}_{1}:\phi_{n}(z^{n}_{1})=0\right\}).$

The following is the classical notion of strong-consistency used to evaluate the asymptotic goodness of a universal scheme (collection of rules of different lengths) for detecting independence from data [20].

Definition 1

Given a scheme $\xi=\left\{\phi_{n}\in\Pi_{n},n\geq 1\right\}$ , where $\phi_{n}$ is a decision rule of length $n$ , we say that $\xi$ is strongly consistent for detecting independence if

i)

under $\mathcal{H}_{0}$ : $(\phi_{n}(Z^{n}_{1}))_{n\geq 1}$ reaches $0$ eventually with probability one, or $\mathbb{P}$ -almost surely (a.s.),
ii)

under $\mathcal{H}_{1}$ : $(\phi_{n}(Z^{n}_{1}))_{n\geq 1}$ reaches $1$ eventually $\mathbb{P}$ -a.s.

II-A The Divergence and Mutual Information

Let $P$ , $Q$ be two probability measures in $(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d}))$ such that $P\ll Q$ , and let us consider $\pi=\left\{A_{i},i\in\mathcal{I}\right\}$ a measurable partition of $\mathbb{R}^{d}$ where $I$ is finite or countable. The divergence of $P$ with respect to $Q$ restricted over $\pi$ (or the sub-sigma field induced by $\pi$ denoted by $\sigma(\pi)$ ) is given by [32, 36]

D_{\sigma(\pi)}(P||Q)\equiv\sum_{i\in\mathcal{I}}P(A_{i})\log\frac{P(A_{i})}{Q(A_{i})}.

(1)

The divergence of $P$ with respect to $Q$ is [32, 36]

D(P||Q)\equiv\sup_{\pi\in\mathcal{Q}(\mathbb{R}^{d})}D_{\sigma(\pi)}(P||Q)

(2)

where $\mathcal{Q}(\mathbb{R}^{d})$ is the collection of measurable³³3 $\mathcal{B}(\mathbb{R}^{d})$ is the sigma field of Borel sets. partitions of $\mathbb{R}^{d}$ .

III The Empirical Log-Likelihood Statistics from a Data-Driven Partition

In this work, we adopt a histogram-based log-likelihood statistics approach [20]. We interpret this approach as an empirical version of the Neyman-Pearson (NP) test against independence [42]. The ideal (oracle) NP test against independence decides whether the samples belong to the following two known cases: $P$ (for some $P\notin\mathcal{P}_{0}$ ) or $Q^{*}(P)\in\mathcal{P}_{0}$ , where $Q^{*}(A\times B)=P(\mathbb{R}^{p}\times B)\cdot P(A\times\mathbb{R}^{q})$ for any $A\in\mathcal{B}(\mathbb{R}^{p})$ and $B\in\mathcal{B}(\mathbb{R}^{q})$ .⁴⁴4 $Q^{*}(P)$ can be interpreted as the projection of $P$ on $\mathcal{P}_{0}$ in the information divergence sense, i.e., $Q^{*}(P)$ is the solution of $\min_{Q\in\mathcal{P}_{0}}D(P||Q)$ .

The specific approach proposed in this work is a two-stage process that uses the data $Z^{n}_{1}\sim P^{n}$ twice: first, to estimate $P$ , and its projected version $Q^{*}(P)$ (over $\mathcal{P}_{0}$ ), with the caveat that the estimation is restricted over the events of a finite partition of the joint space $\mathbb{R}^{d}$ (i.e., a histogram-based estimation of $P$ and $Q^{*}(P)$ , respectively); second, to compute an empirical version of the log likelihood-ratio to decide (using a threshold) if $\hat{P}=\hat{Q}^{*}$ (independence) or $\hat{P}\neq\hat{Q}^{*}$ (non-independence).

There are two elements of learning design that determine our empirical test. The first is a data-driven partition rule denoted by $\pi_{n}(\cdot)$ , which is a function that maps sequences in $(\mathbb{R}^{d})^{n}$ into a finite measurable partitions of $\mathbb{R}^{d}$ . The second element is a non-negative threshold denoted by $a_{n}\in\mathbb{R}^{+}$ . Then given a pair $(\pi_{n}(\cdot),a_{n})$ and some data $z^{n}_{1}\in\mathbb{R}^{dn}$ , the data-driven test is constructed as follows:
1) Use the quantization $\pi_{n}(z^{n}_{1})=\left\{A_{i},i=1,..,\left|\pi_{n}(z^{n}_{1})\right|\right\}\subset\mathcal{B}(\mathbb{R}^{d})$ to estimate $P$ over the cells $\pi_{n}(z^{n}_{1})$ :

\hat{P}_{n}(A)\equiv\frac{1}{n}\sum_{i=1}^{n}\mathbf{1}_{A}(z_{i})\text{ and }

(3)

\hat{Q}^{*}_{n}(\underbrace{A^{1}\times A^{2}}_{A=})\equiv\hat{P}_{n}(A^{1}\times\mathbb{R}^{q})\cdot\hat{P}_{n}(\mathbb{R}^{p}\times A^{2}),

(4)

for any $A\in\pi_{n}(z^{n}_{1})$ . In Eq.(4) we assume that the cells of $\pi_{n}(z^{n}_{1})$ have a product structure, i.e., $A=A^{1}\times A^{2}$ where $A^{1}\in\mathcal{B}(\mathbb{R}^{p})$ and $A^{2}\in\mathcal{B}(\mathbb{R}^{q})$ .⁵⁵5This event-wise product structure on the cells of $\pi_{n}(z^{n}_{1})$ is needed to estimate both $P$ and $Q^{*}$ from i.i.d. samples of $P$ .
2) Project the data $z^{n}_{1}$ over the cells of $\pi_{n}(z^{n}_{1})$ . A simple projection (or representation) function is the following:

O_{\pi_{n}}(z)\equiv\sum_{j=1}^{\left|\pi_{n}(z^{n}_{1})\right|}j\cdot\mathbf{1}_{A_{j}}(z)\in\left\{1,..,\left|\pi_{n}(z^{n}_{1})\right|\right\}.

(5)

The projected data is given and denoted by $o^{n}_{1}\equiv(o_{1}=O_{\pi_{n}}(z_{1}),..,o_{n}=O_{\pi_{n}}(z_{n}))$ .
3) Compute the log-likelihood ratio of the empirical distributions in (3) and (4) using the quantized (or projected over $\sigma(\pi_{n}(z^{n}_{1}))$ ) data $o^{n}_{1}$ . In particular, we consider the following log-likelihood ratio per sample as

\hat{i}_{\pi_{n}}(o^{n}_{1})\equiv\frac{1}{n}log\frac{\hat{P}_{O_{\pi_{n}}}(o_{1})\hat{P}_{O_{\pi_{n}}}(o_{2})\cdots\hat{P}_{O_{\pi_{n}}}(o_{n})}{\hat{Q}_{O_{\pi_{n}}}(o_{1})\hat{Q}_{O_{\pi_{n}}}(o_{2})\cdots\hat{Q}_{O_{\pi_{n}}}(o_{n})},

(6)

where $\hat{P}_{O_{\pi_{n}}}$ and $\hat{Q}^{*}_{O_{\pi_{n}}}$ denote the empirical distributions of the quantized random variable $O_{\pi_{n}}$ in its representation space $\left\{1,..,\left|\pi_{n}(z^{n}_{1})\right|\right\}$ , meaning that by construction $\hat{P}_{O_{\pi_{n}}}(j)=\hat{P}_{n}(A_{j})$ and $\hat{Q}_{O_{\pi_{n}}}(j)=\hat{Q}^{*}_{n}(A_{j})$ for all $j\in\left\{1,..,\left|\pi_{n}(z^{n}_{1})\right|\right\}$ .
4) Finally, the decision is given by the following rule:

\phi_{\pi_{n},a_{n}}(o^{n}_{1})\equiv\left\{\begin{array}[]{lll}0&\textrm{ if }&i_{\pi_{n}}(o^{n}_{1})<a_{n}\\ 1&\textrm{ if }&i_{\pi_{n}}(o^{n}_{1})\geq a_{n}\end{array}\right..

(7)

The second step introduced $O_{\pi_{n}}(Z)$ in (5). This object captures the role played by the data-driven partition ( $\pi_{n}$ ) in our log-likelihood statistics in (6).

IV Regret Analysis: The Representation Learning Problem

To analyze the quality of the proposed test, in this section we compare the proposed statistics in (6) with the statistics used by an oracle NP test. We consider the data-driven partition in the pair $(\pi_{n}(\cdot),a_{n})$ as a learning agent. For this analysis, let us first fix a finite partition $\pi=\left\{A_{i},i=1,..,\left\{1,..,J\right\}\right\}$ . Given i.i.d. realizations $Z_{1},..,Z_{n}$ from $P\in\mathcal{P}(\mathbb{R}^{d})$ , we consider the empirical-quantized log-likelihood statistics in (6). Adopting the notion of regret from universal source coding [35], let us consider as a reference the true likelihood ratio of $P$ with respect to $Q^{*}(P)$ associated with the expression in (6) but with no quantization effects, i.e., ⁶⁶6By construction $P\ll Q^{*}$ ; therefore, the RN derivative $\frac{dP}{dQ^{*}}$ is well defined in (8) where it is clear that if $(\log\frac{dP}{dQ^{*}}(z))_{z\in\mathbb{R}^{d}}\in\ell_{1}(P)$ , then $I(X,Y)=\mathbb{E}_{Z\sim P}\left\{\log\frac{dP}{dQ^{*}}(Z))\right\}<\infty$ .

i_{n}(z_{1},...,z_{n})\equiv\frac{1}{n}\log\frac{dP}{dQ^{*}}(z_{1})\cdot\frac{dP}{dQ^{*}}(z_{2})\cdots\frac{dP}{dQ^{*}}(z_{n}).

(8)

$i_{n}(z_{1},...,z_{n})$ is the ideal (oracle) information term used by the NP test against independence [35]. We measure the (sample-wise) regret of $\pi$ as

\displaystyle i_{n}(z_{1},...,z_{n})-\hat{i}_{\pi}(O_{\pi}(z_{1}),....,O_{\pi}(z_{n})),

(9)

which measures the discrepancy between the empirical-quantized statistics in (6) and the oracle term in (8).

IV-A Connection with Divergence Estimation

The regret in (9) has two error sources: one associated with the quantization of the space (the representation quality of $\pi$ ) and the other associated with the fact that we use the empirical distribution $\hat{P}_{n}$ in (3) instead of the true model $P$ . To isolate these components, it is useful to introduce the oracle-quantized information given by

i_{\pi}(o^{n}_{1})\equiv\frac{1}{n}\log\prod_{i=1}^{n}\frac{P_{O_{\pi}}(o_{i})}{Q^{*}_{O_{\pi}}(o_{i})}\text{ with}\ o^{n}_{1}\in\left\{1,..,J\right\}^{n},

(10)

where $P_{O_{\pi}}(j)=P(A_{j})$ and $Q^{*}_{O_{\pi}}(j)={Q}^{*}(A_{j})$ for all $j\in\left\{1,..,J\right\}$ are short-hand for the true probabilities induced by $O_{\pi}(\cdot)$ , $P$ and $Q^{*}$ in $\left\{1,..,J\right\}$ . Then, the regret in (9) can be decomposed in two components:

		$\displaystyle i_{n}(z_{1},...,z_{n})-\hat{i}_{\pi}(O_{\pi}(z_{1}),....,O_{\pi}(z_{n}))=$
		$\displaystyle\underbrace{i_{n}(z_{1},...,z_{n})-i_{\pi}(O_{\pi}(z_{1}),....,O_{\pi}(z_{n}))}_{I}+$
		$\displaystyle\underbrace{i_{\pi}(O_{\pi}(z_{1}),....,O_{\pi}(z_{n}))-\hat{i}_{\pi}(O_{\pi}(z_{1}),....,O_{\pi}(z_{n}))}_{II}.$		(11)

The first term (I) on the right hand side (RHS) of Eq.(IV-A) captures an information loss attributed to the quantization (5) (the approximation error). This expression convergences as $n$ tends to infinity to $D(P||Q^{*})-D_{\sigma(\pi)}(P||Q^{*})\geq 0$ almost surely. The second term (II) on the RHS of (IV-A) captures the discrepancy in the information density attributed to the use of empirical distributions. Crucially, these two error sources are driven by the partition $\pi$ . At this point, it is worth noticing that the empirical information term can be expressed as $\hat{i}_{\pi}\left(O_{\pi}(z_{1}),....,O_{\pi}(z_{n})\right)=\sum_{A\in\pi}\hat{P}_{n}(A)\log\frac{\hat{P}_{n}(A)}{\hat{Q}_{n}^{*}(A)}$ , which is a histogram-based estimator of the divergence restricted over the cells of $\pi$ [39, 48, 40]. From these results, the RHS of (IV-A) can be re-structured as follows

		$\displaystyle i_{n}(Z_{1},...,Z_{n})-\hat{i}_{\pi}(O_{\pi}(Z_{1}),....,O_{\pi}(Z_{n}))=$
		$\displaystyle\underbrace{i_{n}(Z_{1},...,Z_{n})-D(P\|\|Q^{})}_{I}+\underbrace{D(P\|\|Q^{})-D_{\sigma({\pi})}({P}\|\|{Q}^{*})}_{II}+$
		$\displaystyle\underbrace{D_{\sigma({\pi})}({P}\|\|{Q}^{})-D_{\sigma({\pi})}(\hat{P}_{n}\|\|\hat{Q}_{n}^{})}_{III}.$		(12)

The first term ( $I$ ) on the RHS of (IV-A) goes to zero with probability one (by the strong law of large numbers assuming that $D(P||Q^{*})<\infty$ ) independent of $\pi$ . Therefore, from the perspective of evaluating the effect of $\pi$ in the regret, this term can be overlooked. On the other hand, the second term ( $II$ ) captures the approximation error, or what we lost in discrimination (as the number of samples goes to infinity) when projecting the data over $\sigma(\pi)$ with respect to the lossless term in (8). Finally, the last term ( $III$ ) measures the estimation error, which is the discrepancy between the true (oracle) distributions and the empirical distributions in the information divergence sense over events of $\sigma(\pi)$ (see Eq.(1)).

IV-A1 The Representation Problem

Here we introduce the main design problem of this work, which is to find a data-driven partition that offers an optimal balance between the two relevant terms presented in (IV-A). For the estimation error, the idea is to adopt distribution-free (universal) error bounds for $\left|D_{\sigma({\pi})}({P}||{Q}^{*})-D_{\sigma({\pi})}(\hat{P}||\hat{Q}^{*})\right|$ of the form (more details in Section V-B):

\mathbb{P}\left(\left|D_{\sigma({\pi})}({P}||{Q}^{*})-D_{\sigma({\pi})}(\hat{P}_{n}||\hat{Q}_{n}^{*})\right|\leq r_{n}(\pi)\right)\geq 1-\delta,

where $r_{n}(\pi)$ is the confidence interval for the confidence probability $1-\delta$ . Equipped with this result, we could use the following upper bound for the regret: $D(P||Q^{*})-\hat{i}_{\pi}(O_{\pi}(z_{1}),....,O_{\pi}(z_{n}))\leq\left[D(P||Q^{*})-D_{\sigma({\pi})}({P}||{Q}^{*})\right]+r_{n}(\pi)$ , with high probability. Finally, we could formulate the problem of selecting $\pi$ over a family of representations $\Pi$ (subset of $\mathcal{Q}(\mathbb{R}^{d})$ ) as the solution of the following regularized info-max problem:

\displaystyle\pi^{*}_{n}\equiv\arg\max_{\pi\in\Pi}D_{\sigma({\pi})}({P}||{Q}^{*})-r_{n}(\pi).

(13)

The learning task in (13) is still intractable. It resembles an oracle learner agent (teacher) that selects $\pi$ in $\Pi$ solving the trade-off between the two errors, one of which needs the true model $P$ . Section V addresses a tractable info-max version of (13), where instead of $P$ the empirical distribution $\hat{P}$ in (3) is adopted.

IV-B Strong Consistency: A Basic Requirement

In this subsection, we introduce the idea of measuring the discrepancy between $\hat{i}_{\pi_{n}}(O_{\pi_{n}}(z_{1}),....,O_{\pi_{n}}(z_{n}))$ and the ideal (oracle) information $i_{n}(z_{1},...,z_{n})$ . From this perspective, we could introduce a new notion of consistency based on this learning objective.

Definition 2

Our scheme $\left\{(\pi_{n},a_{n}),n\geq 1\right\}$ is said to be strongly consistent on the regret if for any $P\in\mathcal{P}(\mathbb{R}^{d})$ and i.i.d. process $(Z_{n})_{n\geq 1}$ with $Z_{i}\sim P$ , it follows that

\lim_{n\rightarrow\infty}\left|i_{n}(Z_{1},...,Z_{n})-\hat{i}_{\pi_{n}}(O_{\pi_{n}}(Z_{1}),....,O_{\pi_{n}}(Z_{n}))\right|=0,

$\mathbb{P}$ -almost surely (a.s.).

A simple result follows:

PROPOSITION 1

If $\left\{(\pi_{n},a_{n}),n\geq 1\right\}$ is strongly consistent on the regret, then $\hat{i}_{\pi_{n}}(O_{\pi_{n}}(Z_{1}),....,O_{\pi_{n}}(Z_{n}))$ is a strongly consistent estimator of the MI between $X$ and $Y$ .⁷⁷7From Definition 2, it follows directly that $\lim_{n\rightarrow\infty}i_{n}(Z_{1},...,Z_{n})=\lim_{n\rightarrow\infty}D_{\sigma({\pi})}(\hat{P}_{n}||\hat{Q}_{n}^{*})=D(P||Q^{*})=I(X,Y)$ , $\mathbb{P}$ -almost surely.

Returning to our original problem, the next result offers sufficient conditions for a data-driven scheme $\left\{(\pi_{n},a_{n})n\geq 1\right\}$ to be strongly consistent for detecting independence (Def. 1).

THEOREM 1

Let $\left\{(\pi_{n},a_{n}),n\geq 1\right\}$ be the data-driven scheme of Section III.

i)

If $(a_{n})_{n}$ is $o(1)$ ,
ii)

$\left\{\pi_{n},n\geq 1\right\}$ is strongly consistent on the regret (Def. 2),
iii)

under $\mathcal{H}_{0}$ , $(\hat{i}_{\pi_{n}}(O_{\pi_{n}}(Z_{1}),....,O_{\pi_{n}}(Z_{n})))_{n\geq 1}$ is $o(a_{n})$ in the sense that $\lim_{n\rightarrow\infty}\frac{\hat{i}_{\pi_{n}}(O_{\pi_{n}}(Z_{1}),....,O_{\pi_{n}}(Z_{n}))}{a_{n}}=0$ $\mathbb{P}$ -a.s.,

then $(\phi_{\pi_{n},a_{n}}(O_{\pi_{n}}(\cdot)))_{n\geq 1}$ in (7) is strongly consistent for detecting independence (Def. 1).

The proof is presented in Appendix A.

Consistency is a basic asymptotic requirement that is non-sufficient when looking into practical applications that operate with a finite sample size. Hence, in the next sections, we present a test that is strongly consistent (Th.3) but also satisfies some relevant non-asymptotic properties in the form of finite-length performance results (Ths. 2, 4 and Lemma 3).

V Tree-Structured Data-Driven Partitions

In this section, an empirical version of (13) is studied considering for $\Pi$ a dictionary of tree-structured data-driven partitions (TSP). This TSP family was introduced in [39] for the problem of MI estimation. The focus of this section is to demonstrate its potential for the problem of testing independence.

V-A The Collection of Partitions

The construction of this family begins by defining a “root” node to index the trivial partition $A_{root}\equiv\left\{\mathbb{R}^{d}\right\}$ . The process continues by selecting a coordinate in $\left\{1,..,d\right\}$ to project the data (1D projection), then order the projected data (scalar values), select the median of the ordered sequence, and finally create a statistically equivalent partition of $A_{root}$ using the selected coordinate axis and the median to split the cell. Two new cells are created from $A_{root}$ with almost half the sample points in each (exactly half when $n$ is an even integer).

These new cells are indexed as the left and right children of the “root” denoted by $(l(root),r(root))$ ; i.e., the new cells are $A_{l(root)}$ and $A_{r(root)}$ where $A_{root}=A_{l(root)}\cup A_{r(root)}$ . The process continues by selecting a new coordinate in $\left\{1,..,d\right\}$ , where the statistically equivalent binary splitting criterion is iterated in $A_{l(root)}$ and $A_{r(root)}$ until a stopping condition is met. The stopping criterion adopted here imposed a minimum empirical probability in each created cell that we denote by $b_{n}\in(0,1)$ . Therefore, at the end of this growing binary process, a collection of nodes is produced $\mathcal{I}\equiv\left\{root,l(root),r(root),...\right\}$ associated with a binary tree (by construction) and the nodes’ respective cells $\left\{A_{v},v\in\mathcal{I}\right\}$ where $\hat{P}_{n}(A_{v})\geq b_{n}$ for any $v\in\mathcal{I}$ .

Using the Breiman et al. [43] convention, a binary tree $T$ is a collection of nodes in $\mathcal{I}$ : one node of degree 2 (the ”root”), and the remaining nodes of degree 3 (internal nodes) or degree 1 (leaf or terminal nodes) [43]. In this convention, the full-tree is denoted and given by $T^{full}_{b_{n}}\equiv\mathcal{I}$ . Importantly, if $\tilde{T}\subset T^{full}_{b_{n}}$ and $\tilde{T}$ is a binary tree by itself, then we say that $\tilde{T}$ is a subtree of $T^{full}_{b_{n}}$ . Moreover, if both have the same root, we say that $\tilde{T}$ is a pruned version of $T^{full}_{b_{n}}$ , and we denote this by $\tilde{T}\ll T^{full}_{b_{n}}$ . Finally, if we denote by $\mathcal{L}(T)$ the leaf nodes of an arbitrary tree $T\ll T^{full}_{b_{n}}$ , it is simple to verify that $\pi_{T}\equiv\left\{A_{v},v\in\mathcal{L}(T)\right\}$ is a data-driven partition of $\mathbb{R}^{d}$ indexed by $T$ where every cell in $\pi_{T}$ has the desired product structure in Eq.(4) [39].⁸⁸8The interested reader is referenced to [39] for a systematic explanation of this TSP construction.

V-B Regularized (Empirical) Information Maximization

To address the info-max design objective in (13) using our (data-driven) tree-indexed family $\Pi_{n}\equiv\left\{\pi_{T}:T\ll T^{full}_{b_{n}}\right\}$ , we have to first find an expression for $r_{n}(\pi)$ in (13) for any $\pi_{T}\in\Pi_{n}$ . The next result offers the following:

LEMMA 1

(Silva et al.[39, Th. 1]) Let $\mathcal{G}^{k}_{b_{n}}\equiv\left\{T\ll T^{full}_{b_{n}}:\left|T\right|=k\right\}$ be the family of pruned TSPs of size $k$ . Then, $\forall k\in\left\{1,..,\left|T^{full}_{b_{n}}\right|\right\}$ , $\forall n>0$ , and any small $\delta\in(0,1)$ , there is a threshold $\epsilon_{c}(\cdot)$ where

		$\displaystyle\mathbb{P}\left(\sup_{T\in\mathcal{G}^{k}_{b_{n}}}\left\|{D_{\sigma({\pi_{T}})}({P}\|\|{Q}^{})-D_{\sigma({\pi_{T}})}(\hat{P}_{n}\|\|\hat{Q}_{n}^{})}\right\|\leq\epsilon_{c}(\cdot)\right)\geq$
		$\displaystyle 1-\delta,$		(14)

where specifically $\epsilon_{c}(n,b_{n},d,\delta,k)=$

\displaystyle\frac{24\sqrt{2}}{b_{n}\sqrt{n}}\sqrt{\ln(8/\delta)+k\left[(d+1)\ln(2)+d\ln(n)\right]}.

(15)

Importantly the bound in (1) is distribution-free [39], and a function of $b_{n}$ and $k$ (the size of the family $\mathcal{G}^{k}_{b_{n}}$ ). From this distribution-free concentration result, the union bound tells us that the following events in $\mathbb{R}^{dn}$

\displaystyle\left\{\sup_{T\in\mathcal{G}^{k}_{b_{n}}}\left|D_{\sigma({\pi_{T}})}({P}||{Q}^{*})-D_{\sigma({\pi_{T}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})\right|\leq r_{b_{n},\delta}(k)\right\},

(16)

with $r_{b_{n},\delta}(k)\equiv\epsilon_{c}(n,b_{n},d,\delta\cdot b_{n},k)$ , happen simultaneously for any $k=1,..,\left|T^{full}_{b_{n}}\right|$ with probability at least $1-\delta$ (with respect to $\mathbb{P}$ ). Equipped with these penalizations, i.e., $r_{b_{n},\delta}(\left|T\right|)$ in (16) for any $T\ll T^{full}_{b_{n}}$ , the empirical version of the info-max problem in (13) is

\displaystyle\hat{T}_{b_{n},\delta_{n}}\equiv\arg\max_{T\ll T^{full}_{b_{n}}}D_{\sigma({\pi_{T}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})-r_{b_{n},\delta_{n}}(\left|T\right|).

(17)

Finally, it is important to note that both $(b_{n})_{n\geq 1}$ and $(\delta_{n})_{n\geq 1}$ determine the trees $(\hat{T}_{b_{n},\delta_{n}})_{n\geq 1}$ and TS partitions that we denote here by $({\pi}_{b_{n},\delta_{n}})_{n\geq 1}$ . In addition, if we include the thresholds $(a_{n})_{n\geq 1}$ , we denote by $(\phi_{b_{n},\delta_{n},a_{n}}(\cdot))_{n\geq 1}$ the rules induced by $({\pi}_{b_{n},\delta_{n}})_{n\geq 1}$ and $(a_{n})_{n\geq 1}$ in (7).

V-C Consistency: A Preliminary Analysis

The following results show that the TSP scheme in (17) is strongly consistent on the regret (Def.2).

LEMMA 2

Let us assume that $P$ has a density⁹⁹9 $P$ is absolutely continuous with respect to the Lebesgue measure in $(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d}))$ . in $\mathbb{R}^{d}$ .
i) Under the conditions that $(b_{n})\approx(n^{-l})$ for $l\in(0,{1}/{3})$ , $(\delta_{n})$ is $o(1)$ and $(1/\delta_{n})$ is $\mathcal{O}({e}^{n^{1/3}})$ , we have that

		$\displaystyle\lim_{n\rightarrow\infty}\left\|i_{n}(Z_{1},...,Z_{n})-\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n}))\right\|$
		$\displaystyle=0,\ \mathbb{P}\text{-a.s.}.$		(18)

ii) Assuming that $X$ and $Y$ are independent ( $\mathcal{H}_{0}$ ), if we select $(1/\delta_{n})_{n}\approx(e^{n^{1/3}})$ , it follows that for any $p>0$

\displaystyle(\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n})))_{n\geq 1}\text{ is }o(n^{-p}),\mathbb{P}\text{-a.s.}.

(19)

The proof is presented in Appendix B.

In addition to achieve consistency on the regret, stated in part i), the part ii) of this result shows that under $\mathcal{H}_{0}$ the regret achieves a super polynomial velocity of convergence to zero: $i_{n}(Z_{1},...,Z_{n})=0$ with probability one under $\mathcal{H}_{0}$ . Therefore, the empirical information rate $\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n}))$ tends to zero faster than any polynomial order in $n$ with probability one, which is a remarkable capacity to detect this condition from data. In light of Theorem 1, we could have a range of admissible vanishing thresholds $(a_{n})_{n\geq 1}$ where strong consistency for detecting independence can be achieved (Def. 1). This is stated in the following result.

COROLLARY 1

Let us assume the setting and conditions on $(b_{n})_{n\geq 1}$ and $(\delta_{n})_{n\geq 1}$ stated in Lemma 2 part ii) for the TSP scheme. If $(a_{n})_{n\geq 1}$ is $\mathcal{O}(n^{-q})$ for some arbitrary $q>0$ , then the TSP scheme $\left\{\phi_{{b_{n},\delta_{n}},a_{n}},n\geq 1\right\}$ is strongly consistent for detecting independence (Def. 1).

Proof:

The proof derives directly from Theorem 1 and the two results obtained in Lemma 2. ∎

VI Main Finite-Length Results

In this section, we focus on finite-length (non-asymptotic) results. We establish conditions where the solution of (17) nearly matches the performance of its equivalent oracle version in (13) with high probability. This non-asymptotic result is instrumental to show later one of the main findings of our work: the capacity of our scheme to detect $\mathcal{H}_{0}$ with the structure of the partition.

THEOREM 2

Under the conditions that $(b_{n})\approx(n^{-l})$ for $l\in(0,{1}/{3})$ , $(\delta_{n})$ is $o(1)$ and $(1/\delta_{n})$ is $\mathcal{O}({e}^{n^{1/3}})$ , we have that
i) under $\mathcal{H}_{0}$ : for any $\epsilon>0$ there is $N(\epsilon)>0$ such that $\forall n\geq N(\epsilon)$ , the equality

\displaystyle\underbrace{\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n}))=i_{n}(Z_{1},...,Z_{n})}_{\text{zero regret regime}}=0,

(20)

holds with $\mathbb{P}$ -probability $1-\epsilon$ .

ii) under $\mathcal{H}_{1}$ : for any $\epsilon>0$ there is $N(\epsilon)$ such that $\forall n\geq N(\epsilon)$ the bound

		$\displaystyle i_{n}(Z_{1},...,Z_{n})-\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n}))$
		$\displaystyle\leq i_{n}(Z_{1},...,Z_{n})-I(X,Y)+$
		$\displaystyle\min_{T\ll T^{full}_{b_{n}}}\left[D(P\|\|Q^{}(P))-D_{\sigma({\pi_{T}})}({P}\|\|{Q}^{}(P))\right]+2r_{b_{n},\delta_{n}}(\left\|T\right\|).$		(21)

holds with $\mathbb{P}$ -probability $1-\epsilon$ .

The proof of this result is presented in Appendix C.

This result presents two optimality bounds for the information term of the TSP scheme under the two main hypotheses ( $\mathcal{H}_{0}$ and $\mathcal{H}_{1}$ ) of our problem. Under $\mathcal{H}_{0}$ , our regularization approach is capable of detecting this structure (with an arbitrary high probability $1-\epsilon$ ) in the sense that $\hat{T}_{b_{n},\delta_{n}}$ in (17) reduces to the trivial partition $\left\{\mathbb{R}^{d}\right\}$ . From this, we obtain the zero regret condition $\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n}))=i_{n}(Z_{1},...,Z_{n})$ . This means that the solution of $\hat{T}_{b_{n},\delta_{n}}$ by itself (with no threshold) is capable of detecting independence.

On the other hand, under $\mathcal{H}_{1}$ , we notice that it is only relevant to bound the under-estimation of $i_{n}(Z_{1},...,Z_{n})$ with the empirical (log-likelihood ratio) information $\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n}))$ . Limiting the analysis only to the underestimation of $i_{n}(Z_{1},...,Z_{n})$ can be argued from the well-known observation that quantization (on average) provides a bias (under-estimation) estimation of $\mathbb{E}(i_{n}(Z_{1},...,Z_{n}))=I(X;Y)$ . More importantly for our problem, an overestimation of the oracle information $i_{n}(Z_{1},...,Z_{n})$ does not increase the type 2 error when using $\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n}))$ instead of the oracle $i_{n}(Z_{1},...,Z_{n})$ under $\mathcal{H}_{1}$ .

In summary, Theorem 2 indicates that with an arbitrary high probability, the empirical TSP solution of (17) achieves the performance of the oracle solution that optimizes the upper bound on the regret derived in Eq.(13) over the TSP family $\left\{\pi_{T},T\ll T^{full}_{b_{n}}\right\}$ .

Remark 1

In the proof of Theorem 2, we determine a set $\mathcal{E}^{n}_{\delta_{n},b_{n}}$ (see Eq.(39)) where if $z^{n}_{1}\in\mathcal{E}^{n}_{\delta_{n},b_{n}}$ then (20) holds under $\mathcal{H}_{0}$ and (2) holds under $\mathcal{H}_{1}$ . Importantly, we obtain that $\mathbb{P}(\mathcal{E}^{n}_{\delta_{n},b_{n}})\geq 1-\delta_{n}$ from Lemma 1. Therefore, in Theorem 2, $N(\epsilon)$ is fully determined by $(\delta_{n})_{n\geq 1}$ for any $\epsilon>0$ . Indeed, $N(\epsilon)=\inf\left\{n\geq 1,st.\ \delta_{n}<\epsilon\right\}$ , which is well-defined since $(\delta_{n})$ is $o(1)$ . In particular, for the case $(1/\delta_{n})$ being $\mathcal{O}(e^{n^{1/3}})$ , it follows that $N(\epsilon)$ is $\mathcal{O}(\ln(1/\epsilon)^{3})$ .

VI-A Detecting Independence with the Structure of $\hat{T}_{b_{n},\delta_{n}}$

From (20), our data-driven partition has the capacity to detect independence (under $\mathcal{H}_{0}$ ) in a stronger structural way: (with high probability) the data-driven partition collapses to the trivial cell, i.e., $\pi_{b_{n},\delta_{n}}=\left\{\mathbb{R}^{d}\right\}$ , with a finite sample size. Here, we improve this result by showing that the condition $\pi_{b_{n},\delta_{n}}=\left\{\mathbb{R}^{d}\right\}$ happens within a finite sample size almost surely.

Let us introduce the (random) time at which the partition process $(\pi_{b_{n},\delta_{n}})_{n\geq 1}$ collapses to the trivial set $A_{root}=\left\{\mathbb{R}^{d}\right\}$ :

Definition 3

$\mathcal{T}_{0}((Z_{n})_{n\geq 1})\equiv$

\sup{\left\{m\geq 1:\left|\hat{T}_{b_{m},\delta_{m}}\right|>1\right\}}\in\mathbb{N}^{*}\equiv\mathbb{N}\cup\left\{\infty\right\}.

(22)

If $\mathcal{T}_{0}((Z_{n})_{n\geq 1})=k$ , it means that $\left|\hat{T}_{b_{n},\delta_{n}}\right|=1\Leftrightarrow\pi_{b_{n},\delta_{n}}=\left\{\mathbb{R}^{d}\right\}$ for any $n>k$ . Therefore, if $\mathcal{T}_{0}((Z_{n})_{n\geq 1})\in\mathbb{N}$ , this value is expressing the last time the data-driven partition is different from the trivial solution $\left\{\mathbb{R}^{d}\right\}$ . On the other hand, if $\mathcal{T}_{0}((Z_{n})_{n\geq 1})=\infty$ , this condition means that non-trivial partitions are observed infinitely often (i.o.) in the sequence $(\pi_{b_{n},\delta_{n}}(Z^{n}))_{n\geq 1}$ . In general, we could have that $\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})=\infty)>0$ under $\mathcal{H}_{0}$ , which does not contradict the capacity that $(\phi_{b_{n},\delta_{n},a_{n}}(\cdot))_{n\geq 1}$ has for detecting independence consistently under $\mathcal{H}_{0}$ (see Corollary 1).

The following result shows that $\mathcal{T}_{0}((Z_{n})_{n\geq 1})$ is finite with probability one under some mild conditions.

LEMMA 3

Let us assume that $(b_{n})\approx(n^{-l})$ for $l\in(0,{1}/{3})$ and $(\delta_{n})$ is $\ell_{1}(\mathbb{N})$ , $(1/\delta_{n})$ is $\mathcal{O}({e}^{n^{1/3}})$ . Then under the hypothesis that $P\in\mathcal{P}_{0}$ : $\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})<\infty)=1$ .

The proof is presented in Appendix D.

Importantly, the condition $\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})<\infty)=1$ implies that $(\phi_{{b_{n},\delta_{n}},a_{n}}(Z_{1},....,Z_{n}))_{n\geq 1}$ reaches $0$ eventually with probability one for any sequence $(a_{n})_{n\geq 1}\in[0,1]^{\mathbb{N}}$ . From this observation, we obtain the following result that improves the regime of parameters where a consistent test is achieved (established in Corollary 1).

THEOREM 3

If $(b_{n})_{n\geq 1}\approx(n^{-l})$ for $l\in(0,{1}/{3})$ , $(\delta_{n})_{n\geq 1}$ is $\ell_{1}(\mathbb{N})$ , $(1/\delta_{n})_{n\geq 1}$ is $\mathcal{O}({e}^{n^{1/3}})$ , and $(a_{n})_{n\geq 1}$ is $o(1)$ , then the scheme $(\phi_{{b_{n},\delta_{n}},a_{n}}(Z_{1},....,Z_{n}))_{n\geq 1}$ is strongly consistent for detecting independence (Def. 1).

The proof is presented in Appendix E.

Finally, if we focus on the admissible solution where $(1/\delta_{n})\approx(e^{n^{1/3}})$ , we have the following refined description for the distribution of $\mathcal{T}_{0}((Z_{n})_{n\geq 1})$ under $\mathcal{H}_{0}$ .

THEOREM 4

Under the assumption of Theorem 3, let us consider that $(1/\delta_{n})\approx(e^{n^{1/3}})$ . Under $\mathcal{H}_{0}$ , we have that $\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})\geq m)\leq Ke^{-m^{1/3}}\sim\mathcal{O}(e^{-m^{1/3}})$ for any $m\geq 1$ and for some universal constant $K>0$ .

The proof is presented in Appendix F.

VI-B Remarks about Theorem 4:

1: If we introduce $\mathcal{T}((Z_{n})_{n\geq 1},(a_{n})_{n\geq 1})\equiv$

\sup{\left\{m\geq 1:\phi_{b_{m},\delta_{m},a_{m}}(Z_{1},..,Z_{m}))=1\right\}},

(23)

as the first time when the binary process $(\phi_{b_{n},\delta_{n},a_{n}}(Z_{1},..,Z_{n}))_{n\geq 1}$ reaches and stays at $0$ , it is simple to verify that $\mathbb{P}(\mathcal{T}((Z_{n})_{n\geq 1},(a_{n})_{n\geq 1})\geq m)\leq\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})\geq m)$ for any $(a_{n})_{n}$ (see Appendix E). Therefore, under $\mathcal{H}_{0}$ and the assumptions of Theorem 4, we have as a corollary that $\mathbb{P}(\mathcal{T}((Z_{n})_{n\geq 1},(a_{n})_{n\geq 1})\geq m)\leq Ke^{-m^{1/3}}\sim\mathcal{O}(e^{-m^{1/3}})$ for any sequence $(a_{n})_{n\geq 1}$ as long as $(a_{n})_{n\geq 1}$ is $o(1)$ .
2: With this bound, we could determine (under $\mathcal{H}_{0}$ ) a lower bound on the number of samples needed to guarantee that we detect independence structurally, i.e., reaching $\pi_{b_{n},\delta_{n}}=\left\{\mathbb{R}^{d}\right\}$ , and also with our schemes $(\phi_{b_{n},\delta_{n},a_{n}}(\cdot))_{n\geq 1}$ with a confidence probability $1-\epsilon$ . This critical number of samples is achieved for $m(\epsilon)$ such that $Ke^{-m(\epsilon)^{1/3}}<\epsilon$ and $Ke^{-(m(\epsilon)-1)^{1/3}}\geq\epsilon$ . This number scales like $(m(\epsilon))_{\epsilon}\approx\left(\ln(K/\epsilon)\right)^{3}\sim\mathcal{O}(\ln(1/\epsilon)^{3})$ .

VII Implementation

VII-A The Learning Algorithm

Let us briefly revisit the stages presented in Section V for the construction of our TSP scheme. Beginning with the partition, there are two phases: growing and pruning. In the growing phase, presented in Section V-A, we have a collection of tree-indexed partitions $\left\{\pi_{T},T\ll T^{full}_{b_{n}}\right\}$ where $\pi_{T}=\left\{A_{v},v\in\mathcal{L}(T)\right\}$ and by construction $\hat{P}_{n}(A_{v})\geq b_{n}=w\cdot n^{-l}$ for any $v\in T$ . Here, we added a scalar parameter $w>0$ as a relevant design element for our numerical analysis. In the pruning phase, we implement the following complexity-regularized information maximization:

\displaystyle\hat{T}_{b_{n},\delta_{n}}(\alpha)\equiv\arg\max_{T\ll T^{full}_{b_{n}}}D_{\sigma({\pi_{T}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})-\alpha\cdot r_{b_{n},\delta_{n}}(\left|T\right|),

(24)

where we included a regularized parameter $\alpha\in\mathbb{R}^{+}$ .

Refer to caption — Figure 1: Illustration of the trade-off between $M_{0}(\epsilon=0.05)$ and $M^{\sigma}_{1}(\epsilon=0.05)$ , presented in Eqs. (29) and (30), for different parameters of our TSP approach. Each curve is associated with fixed values of $l$ and $w$ and is produced using different values of $\alpha$ . Three scenarios of correlation (under $\mathcal{H}_{1}$ ) are $\sigma=\mathbb{E}(XY)\in\left\{0.3,0.5,0.7\right\}$ . For the graph in the first row $w=0.05$ , whereas $l=0.167$ in the graph depicted in the second row.

VII-B Computational Cost

To analize the cost of implementing (24), we analyze the cost of each of its two phases individually.

Growing Phase: During the growing stage (see Section V-A), we obtain a collection of tree-indexed partitions $\left\{\pi_{T},T\ll T^{full}_{b_{n}}\right\}$ , where $\pi_{T}=\left\{A_{v},v\in\mathcal{L}(T)\right\}$ . The full tree $T^{full}_{b_{n}}$ is obtained after a series of statistically equivalent binary partitions. The computational cost of the growing phase depends on the number of sub-partitions (or binary splittings) needed to obtain the full tree (which depends on the number of cells), and the cost of sorting the projected data of each cell and selecting the median of the ordered sequence to partition the cell. The number of cells in the full tree $\left|\pi_{T^{full}_{b_{n}}}\right|$ can be estimated considering that our design criterion states that $\hat{P}_{n}(A_{v})\geq b_{n}=w\cdot n^{-l}$ for any $v\in T$ , where $w\in\mathbb{R}^{+}$ and $l\in(0,1/3)$ are design parameters of our method. Then, we should have at least $b_{n}\cdot n$ samples per cell. Therefore, the number of cells associated with $T^{full}_{b_{n}}$ is bounded by

\left|\pi_{T^{full}_{b_{n}}}\right|\leq\frac{n}{b_{n}\cdot n}=\frac{1}{b_{n}}\approx n^{l}

(25)

On the other hand, the cost of sorting the projected data of each cell depends on the number of data points to be ordered. In particular, if a cell has $m$ points, then the sorting cost is $\mathcal{O}(m\log(m))$ [49]. As we reach deeper levels in the tree, the number of points per cell reduces. If we consider a level of depth $k$ (where $k=1$ corresponds to the root and $k=\left\lceil\log\left(\left|\pi_{T^{full}_{b_{n}}}\right|\right)\right\rceil\approx\lceil\log(n^{l})\rceil$ corresponds to the full tree), there are $2^{k-1}$ cells and each cell contains roughly $n/2^{k}$ points. Therefore, the order of the total number of operations during the growing phase can be approximated to

\mathcal{O}\left(\sum_{k=1}^{\lceil\log(n^{l})\rceil}2^{k-1}\cdot\frac{n}{2^{k-1}}\log\left(\frac{n}{2^{k-1}}\right)\right)=\mathcal{O}\left(n\log^{2}(n)\right)

Pruning Phase: The main regularized info-max problem in Eq. (24) is solved in this stage. The first term $D_{\sigma({\pi_{T}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})$ is additive (in the sense presented in [45]), whereas the penalty $r_{b_{n},\delta_{n}}(\left|T\right|)$ scales like $\mathcal{O}(\sqrt{\left|T\right|})$ , and, therefore, it is a sub-additive function of the size of the tree. In this scenario, Scott [45, Th. 1] showed that the family of minimum cost trees $\{R_{i}\}_{i=1}^{m}$ :

R_{i}\equiv\arg\max_{\begin{subarray}{c}T\ll T^{full}_{b_{n}}\\ |T|=m-i+1\end{subarray}}D_{\sigma({\pi_{T}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})\quad,\,i\in\{1\dots m\}

(26)

is embedded, i.e., $R_{1}=T^{full}_{b_{n}}\gg R_{2}\gg\ldots\gg R_{m}=\left\{A_{root}\right\}$ . Furthermore, [45, Th. 2] states that the solution of our problem in (24) corresponds to one of these embedded trees: i.e., for any $\alpha>0$ , $\exists i\in\{1\dots m\}$ such that $\hat{T}_{b_{n},\delta_{n}}(\alpha)=R_{i}$ . This embedded property allows computationally efficient algorithms to be designed. Indeed, Scott [45] presented two algorithms to solve (24) with a worst case cost of $\mathcal{O}\left(\left|T^{full}_{b_{n}}\right|^{2}\right)$ . In our case we have that $\left|T^{full}_{b_{n}}\right|\leq 1/b_{n}$ , then the pruning stage in (24) has a computational cost of $\mathcal{O}\left(n^{2/3}\right)$ , which is a polynomial (sub-linear) function of $n$ . The pseudo code of this stage is presented in Algorithm 1.

In summary, the construction of our partition (growing and pruning) has a computational cost that is $\mathcal{O}(n\log^{2}(n))$ on the sample size $n$ .

Input : full tree

T^{full}_{b_{n}}

, minimum probability of partitions

b_{n}

, confidence probability

\delta_{n}

, regularization parameter

\alpha

Output : pruned tree

\hat{T}_{b_{n},\delta_{n}}

in Eq.(24)

Initialization;

\hat{T}_{b_{n}}^{1}=\{root\}

;

Family pruning problem of embedded trees;

for $k=2$ to $|T^{full}_{b_{n}}|$ do

Maximum empirical information tree of size

k

;

v^{*}=\underset{v\in\mathcal{L}(\hat{T}_{b_{n}}^{k-1})}{\operatorname*{arg\,max}}D_{\sigma{{}_{\big{(}}}\pi_{\hat{T}_{b_{n}}^{k-1}\cup\{l(v),r(v)\}}{{}_{\big{)}}}}(\hat{P}_{n}||\hat{Q}_{n}^{*})

;

\hat{T}_{b_{n}}^{k}=\hat{T}_{b_{n}}^{k-1}\cup\{l(v^{*}),r(v^{*})\}

;

end for

Regularized empirical information maximization;

\hat{T}_{b_{n},\delta_{n}}=\underset{T\in\left\{\hat{T}_{b_{n}}^{1},...,\hat{T}_{b_{n}}^{\left|T_{b_{n}}^{full}\right|}\right\}}{\operatorname*{arg\,max}}D_{\sigma(\pi_{T})}(\hat{P}_{n}||\hat{Q}_{n}^{*})-\alpha\cdot r_{b_{n},\delta_{n}}(|T|)

;

r_{b_{n},\delta_{n}}(|\hat{T}_{b_{n}}^{1}|)=0

Algorithm 1 TSP family pruning

VIII Empirical Analysis

In this section, we present some controlled synthetic scenarios to evaluate the performance of our method. We begin analyzing how the selection of parameters affects the capacity of our scheme to estimate MI¹⁰¹⁰10Given the space limit, this preliminary analysis on MI estimation is relegated to the Supplemental Material - Appendix G-A.. From these results and insights, we analyze the performance of the solutions $\left\{\hat{T}_{b_{n},\delta_{n}}(\alpha),\alpha\geq 0\right\}$ for testing independence. For the rest of this section, $\pi_{b_{n},\delta_{n}}^{\alpha}(\cdot)$ denotes the TS partition, $\hat{i}_{b_{n},\delta_{n}}^{\alpha}(\cdot)$ denotes the MI estimator, and $\phi_{b_{n},\delta_{n},a_{n}}^{\alpha}(\cdot)$ denotes the final test.

VIII-A Testing Independence: Non-Asymptotic Empirical Analysis

We analyze the problem of testing independence with our data-driven test $\phi_{b_{n},\delta_{n},a_{n}}^{\alpha}(\cdot)$ in (7). Here we focus on the non-asymptotic capacity of our framework to detect the two hypotheses. For this, we propose the following detection times:¹¹¹¹11To simplify the notation, the dependency of $\mathcal{\tilde{T}}_{0}((Z_{n})_{n\geq 1})$ and $\mathcal{\tilde{T}}_{1}((Z_{n})_{n\geq 1})$ on $(b_{m})$ , $(\delta_{m})$ , $(a_{m})$ and the scalar parameters $w$ , $\alpha$ introduced in Sec.VII-A will be considered implicit.

	$\displaystyle\mathcal{\tilde{T}}_{0}((Z_{n})_{n\geq 1})\equiv\sup{\left\{m\geq 1:\phi^{\alpha}_{b_{m},\delta_{m},a_{m}}(Z_{1},..,Z_{m})=1\right\}},$		(27)
	$\displaystyle\mathcal{\tilde{T}}_{1}((Z_{n})_{n\geq 1})\equiv\sup{\left\{m\geq 1:\phi^{\alpha}_{b_{m},\delta_{m},a_{m}}(Z_{1},..,Z_{m})=0\right\}}.$		(28)

$\mathcal{\tilde{T}}_{0}((Z_{n})_{n\geq 1})$ and $\mathcal{\tilde{T}}_{1}((Z_{n})_{n\geq 1})$ are random variables (rvs.) in $\mathbb{N}^{*}=\mathbb{N}\cup\left\{\infty\right\}$ determining when our test reaches $0$ and $1$ , respectively. Indeed, Theorem 3 (under specific conditions) tells us that under $\mathcal{H}_{0}$ , $\mathbb{P}(\mathcal{\tilde{T}}_{0}((Z_{n})_{n\geq 1})<\infty)=1$ and that under $\mathcal{H}_{1}$ , $\mathbb{P}(\mathcal{\tilde{T}}_{1}((Z_{n})_{n\geq 1})<\infty)=1$ . However, we are interested in the complete distribution of the rvs. $\mathcal{\tilde{T}}_{0}((Z_{n})_{n\geq 1})$ and $\mathcal{\tilde{T}}_{1}((Z_{n})_{n\geq 1})$ under $\mathcal{H}_{0}$ and $\mathcal{H}_{1}$ , respectively. In particular, we are interested in evaluating the pmf of $\mathcal{\tilde{T}}_{i}(\cdot)$ , i.e., $(\mathbb{P}(\mathcal{\tilde{T}}_{i}((Z_{n})_{n\geq 1})=k))_{k\geq 1}$ under $H_{i}$ (with $i\in\left\{0,1\right\}$ ). Looking at these distributions, for any $\epsilon>0$ we can define

	$\displaystyle M_{0}(\epsilon)$	$\displaystyle\equiv\min\left\{m\geq 1,\ \mathbb{P}(\mathcal{\tilde{T}}_{0}((Z_{n})_{n\geq 1})\leq m)\geq 1-\epsilon\right\}$		(29)
	$\displaystyle M_{1}(\epsilon)$	$\displaystyle\equiv\min\left\{m\geq 1,\ \mathbb{P}(\mathcal{\tilde{T}}_{1}((Z_{n})_{n\geq 1})\leq m)\geq 1-\epsilon\right\}$		(30)

$M_{0}(\epsilon)$ (and $M_{1}(\epsilon)$ ) indicates how many observations (sampling complexity) are needed to detect independence (and statistical dependency) with a confidence probability $1-\epsilon$ under $\mathcal{H}_{0}$ (and $\mathcal{H}_{1}$ ). In the context of our solution, for a given TSP scheme (function of the parameters $l$ , $w$ , $\alpha$ and the sequence $(a_{n})$ ), we will look at the trade-off expressed in the pair $(M_{0}(\epsilon),M_{1}(\epsilon))$ of induced tests when varying key parameters of our method.

VIII-A1 Simulation Setting

We consider a joint vector $Z=(X,Y)$ following a zero-mean Gaussian distribution in $\mathbb{R}^{2}$ where the correlation coefficient determining $I(X,Y)$ is parametrized by $\sigma=\mathbb{E}(XY)$ . Concerning the alternative hypothesis ( $\mathcal{H}_{1}$ ), we consider different levels of MI indexed by $\sigma=\mathbb{E}(XY)\in\left\{0.1,0.3,0.5,0.7,0.9\right\}$ . We use $1,000$ iid realizations of $(X,Y)$ under the different scenarios ( $\mathcal{H}_{0}$ and $\mathcal{H}^{\sigma}_{1}$ ) to run our test with these iid samples. By doing so, we obtain $1,000$ realizations of the rvs. $\mathcal{\tilde{T}}_{0}((Z_{n})_{n\geq 1})$ and $\mathcal{\tilde{T}}_{1}((Z_{n})_{n\geq 1})$ and with those realizations we obtain an empirical estimation of their pmfs. and empirical estimations of $M_{0}(\epsilon)$ and $M^{\sigma}_{1}(\epsilon)$ (indexed by the value $\sigma$ ), respectively.¹²¹²12By the law of large numbers, the estimators of $M_{0}(\epsilon)$ and $M^{\sigma}_{1}(\epsilon)$ are strongly consistent. $1,000$ samples shown to be sufficient for the purpose of the analysis.

VIII-A2 Parameter Selection for $\phi_{b_{n},\delta_{n},a_{n}}^{\alpha}(\cdot)$

To evaluate the sensitivity of our TSP scheme, we consider the following fixed sequences $(a_{n})=(0.5\cdot n^{-1})$ and $(\delta_{n})=(e^{-n^{1/3}})$ in the admissible regime established in Theorem 3.¹³¹³13 $(a_{n})_{n\geq 1}$ needs to be $o(1)$ and $(\delta_{n})$ needs to be $\ell_{1}(\mathbb{N})$ . Other configurations for $(a_{n})$ and $(\delta_{n})$ can be explored within the admissible range declared in Theorem 3. Considering $(b_{n})=(w\cdot n^{-l})$ (parametrized by $w$ and $l$ ) and $\alpha$ (used to solve $\hat{T}_{b_{n},\delta_{n}}(\alpha)$ in Eq.(24)), we consider a preliminary analysis on MI estimation to select a range of reasonable values (in Supplemental Material - Appendix G-A). In particular, we consider $l\in\left\{0.001,0.125,0.167,0.25,0.3\right\}$ and $w\in\left\{0.005,0.01,0.05,0.1,0.2\right\}$ . We proceed as follows: given fixed parameters $l$ and $w$ , we explore values of $\alpha$ in $[0,4\cdot 10^{-4})$ to express the trade-off between $M_{0}(\epsilon)$ and $M^{\sigma}_{1}(\epsilon)$ under different data scenarios: $\sigma\in\left\{0.3,0.5,0.7\right\}$ .

VIII-A3 Results

Figure 1 shows the curves expressing the trade-off between $M_{0}(\epsilon)$ and $M^{\sigma}_{1}(\epsilon)$ for different parameter configurations of our method in the range for $l$ and $w$ mentioned above and for $\epsilon=0.05$ . In general, we notice that the effect of these two parameters is relevant in the trade-off expressed in the curves. Both $w$ and $l$ determine the growing phase (i.e., the creation of the full tree) and also the pruning phase (regularization in Eq.(24)) because $r_{b_{n},\delta_{n}}(\left|T\right|)$ is also a function of $b_{n}$ . Therefore, the effects of $(b_{n})_{n\geq 1}$ on the results are not easy to express theoretically.

Describing the curves, we could say in general that a bigger value of $w$ within the explored range reduces the full tree’s size in the growing phase. This effect is expressed in a better trade-off of the pair ( $M_{0}(\epsilon)$ , $M^{\sigma}_{1}(\epsilon)$ ) in the regime where $M_{0}(\epsilon)\leq 10^{2}$ , at the expense of a worse trade-off of ( $M_{0}(\epsilon)$ , $M^{\sigma}_{1}(\epsilon)$ ) when $M_{0}(\epsilon)\geq 10^{3}$ . A similar general effect in the trade-off ( $M_{0}(\epsilon)$ , $M^{\sigma}_{1}(\epsilon)$ ) is produced when decreasing the value of $l$ within the explored range. Our family of solutions offers a collection of different trade-offs between $M_{0}(\epsilon)$ and $M^{\sigma}_{1}(\epsilon)$ by exploring different values of $\alpha\in[0,4\cdot 10^{-4})$ in our solution. Therefore, the selection of the best parameters should be a function of the regime of sample size we want to consider for the detection of $\mathcal{H}_{0}$ . For the final comparison with alternative approaches, we decided to consider one of the curves with a less prominent decreasing transition in Figure 1 (obtained for $w=0.1$ and $l=0.001$ ).

Figure 2 illustrates the estimated (empirical) pmfs of $\mathcal{\tilde{T}}_{0}((Z_{n})_{n\geq 1})$ and $\mathcal{\tilde{T}}_{1}((Z_{n})_{n\geq 1})$ under $\sigma=0$ ( $\mathcal{H}_{0}$ ) and $\sigma=0.3$ ( $\mathcal{H}_{1}$ ), respectively. These histograms illustrate how the distributions are affected by the regularization parameter $\alpha$ and, with that, the trade-off in $M_{0}(\epsilon)$ and $M^{\sigma}_{1}(\epsilon)$ observed in Figure 1. Specifically, a small value of $\alpha$ concentrates the distribution of $\mathcal{T}_{0}((Z_{n})_{n\geq 1})$ in a high number of observations, while $\mathcal{T}_{1}((Z_{n})_{n\geq 1})$ concentrates on a lower number of observations. This leads to a high value of $M_{0}(\epsilon)$ and a low value of $M_{1}^{\sigma}(\epsilon)$ for that specific $\alpha$ . The opposite holds for higher values of $\alpha$ .

Finally, we compared our TSP scheme in terms of the trade-off in $(M_{0}(\epsilon),M^{\sigma}_{1}(\epsilon))$ with $\epsilon=0.05$ and $\sigma\in\left\{0.1,0.3,0.5,0.7,0.9\right\}$ (for $\mathcal{H}_{1}$ ) exploring three strategies presented in the literature [20]: the $L_{1}$ test, the log-likelihood test, and the Pearson- $\chi^{2}$ test. Each trade-off curve is generated by multiplying a well-selected range of values $C\in\mathbb{R}$ to the corresponding threshold for the independence detection [20]. For the parameters of the alternative tests, we consider regimes where these tests are strongly consistent¹⁴¹⁴14We consider the number of partitions per dimension as $m(n)=n^{p}$ with $p\in(0,0.5)$ , which satisfies the strong-consistency conditions of the $L_{1}$ and the log-likelihood tests established in [20]. Although there is no explicit proof of Pearson- $\chi^{2}$ test strong-consistency, we use a regime of values for $p\in(0,1/3)$ according to the range suggested in [20].. Within that, we selected parameters that offer a smooth transition between their sampling complexity pairs’ trade-off as we did to select the parameters of our method.¹⁵¹⁵15More information is presented in the Supplemental Material - Appendix G-B.

Figure 3 presents these curves for our method and the alternative approaches. These curves express the trade-off between the ability to detect independence and non-product probabilistic structure ( $\mathcal{H}_{0}$ and $\mathcal{H}_{1}$ ) from data. In general, we could say that our method shows better results in almost all explored scenarios ( $\sigma\in\left\{0.1,0.3,0.5,0.7,0.9\right\}$ ) where the TSP trade-off curves dominate the others. There is one exception to this general trend in the regime of $M_{0}(\epsilon)<10^{2}$ for lower values of correlation for $\mathcal{H}_{1}$ . However, in all the other regimes, our method performs better than the alternatives. In particular, it performs better than its closest relative, which is the log-likelihood test that uses non-adaptive partitions [20]. Interestingly, our method’s performance improvements increase with the magnitude of $\sigma$ (for $\mathcal{H}_{1}$ ). This shows that our approach’s advantage is more prominent when the alternative scenario has a higher level of mutual information.

VIII-B Multidimensional Analysis

We conclude this analysis by evaluating our method in higher dimensions. As in the previous analysis $d=2$ , we need to select the parameters $w$ and $l$ of our scheme. In this multi-dimensional context, we select a value of $l$ that offers a smooth trade-off curve according to the experimental results in Figure 1. However, as $w$ has a stronger impact on the resulting number of partitions, rather than choosing it individually for every dimension $d$ , we propose $w=C^{p\cdot d}$ as a heuristic rule for selecting $w$ according to $d$ , where $C>0$ and $p>0$ .

This rule and its parameters were designed on the principle that every dimension of the joint space $\mathbb{R}^{d}$ should be explored at least once in the growing phase. This criterion ensures that our data-driven scheme explores every coordinate of $\mathbb{R}^{d}$ in the pursuit of detecting relevant statistical dependencies under $\mathcal{H}_{1}$ . For a dimension $d>0$ , the full tree needs to have at least $2^{d}$ leaves (cells). For this, a basic condition is that $b_{n}$ is lower bounded by $1/n$ (our stopping criterion requires at least one sample per cell), which implies that to explore all the $d$ dimensions we need at least $2^{d}$ i.i.d. samples to meet this requirement. Then assuming the non-trivial regime $n\geq 2^{d}$ , we selected $C$ and $p$ in our rule with the objective that $\left|T^{full}_{b_{n}}\right|\geq 2^{d}$ is met independent of the value of $d$ (as long as $n\geq 2^{d}$ ). This happens if $w$ becomes smaller when $d$ increases, which happens in our rule for $C<1$ and $p>0$ .

Figure 4 shows the number of cells of the full-tree, namely $\left|T^{full}_{b_{n}}\right|$ , for two settings: $w=0.1^{d/2}$ ( $C=0.1$ and $p=1/2$ ) in the left panel, and $w=0.225^{d/2}$ ( $C=0.225$ and $p=1/2$ ) in the right panel. The curves are shown across $d$ for three scenarios of sample-size $n=10^{3}$ , $n=10^{4}$ and $n=10^{5}$ . We also include the curve $2^{d}$ (dashed line) to indicate the critical lower bound. Both settings satisfy the proposed criterion of partitioning each dimension at least once (points above the diagonal) in the whole regime when $n\geq 2^{d}$ . The horizontal line shows the points where the number of partitions generated by our method is equal to the number of samples (which happens when $b_{n}=1/n$ ). Interestingly, when $d$ approaches the critical condition $2^{d}=n$ , our full tree meets a scenario where every cell has one sample. Of the two settings, the selection on the right panel is more conservative, and it is the one we will use for the rest of this analysis. This selection partitions each dimension just above the minimum requirement (for the values of $d$ where $n\geq 2^{d}$ ) and achieves the critical condition $n=2^{d}$ as $d$ increases.

Using this rule $w=0.225^{d/2}$ and $l=0.001$ , we consider a zero-mean joint Gaussian vector $({X},{Y})$ in $\mathbb{R}^{d}$ , where $X=(X_{1},..,X_{p})$ and ${Y}=(Y_{1},..,Y_{q})$ and, under $\mathcal{H}_{1}$ , a diagonal co-variance of the form $cov(X_{i},Y_{j})=\delta_{ij}\sigma(d)^{2}$ (with $\delta_{ij}$ the Kronecker’s delta). For these results, we consider $n=10^{4}$ i.i.d samples meeting the condition that $n\geq 2^{d}$ for $d\in\left\{4,6,8,10,12\right\}$ . For the alternative, we consider $\sigma\in\{0.3,0.5,0.7\}$ for $d=2$ ( $p=q=1$ ), and we choose the values of $\sigma(d)$ for $d\geq 4$ such that the mutual information between $X$ and $Y$ is the same as that obtained for $d=2$ with $\sigma\in\{0.3,0.5,0.7\}$ . Therefore, for the alternative, we create three scenarios of constant MI $I(X;Y)\in\{0.06803,0.20752,0.48572\}$ across dimensions. This experimental design allows us to isolate the effect of dimensionality from the discrimination of the underlying task, which is known to be proportional to the value of MI under $\mathcal{H}_{1}$ [32, 35, 36].

Figure 5 presents the trade-off between $(M_{0}(\epsilon),M^{\sigma}_{1}(\epsilon))$ for the following values of $\sigma(d)$ (under $\mathcal{H}_{1}$ ) and for different dimensions $d\in\left\{4,6,8,10,12\right\}$ . First, we confirm that our scheme detects both hypotheses with a finite sample size as our theoretical results predict. Here, we observe that by increasing the dimensions of the problem the curves $(M_{0}(\epsilon),M^{\sigma}_{1}(\epsilon))$ show sharper transitions (beyond a critical value for $M_{0}(\epsilon)$ ) and overall the problems become (in general) more difficult: for a given value of $M_{0}(\epsilon)$ , its respective $M^{\sigma}_{1}(\epsilon)$ increases with the dimension $d=p+q$ . In other words, for a relatively good capacity to detect independence (when $M_{0}(\epsilon)\leq 10^{3}$ ), it is more challenging for our test to detect the alternative as $d$ is higher. This performance trend could be attributed to the observation that in higher dimensions it is more challenging for a non-parametric framework to detect salient features under the alternative $\mathcal{H}_{1}$ and, consequently, more observations (evidence) are required to correctly detect the alternative with probability $1-\epsilon$ . On the other hand, if we fix a dimension, let us say $d=p+q=8$ , the higher the MI the better the trade-off between ( $(M_{0}(\epsilon),M^{\sigma}_{1}(\epsilon))$ ), which is a trend consistent with our observation that MI is an indicator of the task discrimination, and consequently, the performance of our scheme is sensitive to the level of MI under the alternative.

IX Summary and Final Discussion

We present a novel framework to address the problem of universal testing of independence using data-driven representations. Consistency and finite length (non-asymptotic) results express our solution’s capacity to adapt its representations to the problem’s sufficient statistics. This capacity is particularly evident under the hypothesis of independence. Precise results show that our scheme detects this scenario – collapsing to the trivial partition with one cell for representing the joint space – with a finite sample size. On the experimental side, our solution offers a computationally efficient implementation.

In a controlled experimental setting, we show that our scheme offers a clear advantage compared with alternative non-parametric solutions that do not adapt their representations to the data. These results confirm the critical role that data-driven partitions could play in implementing a universal test of independence. Further analysis is needed to fully uncover this method’s potential for machine learning (ML) problems and other applications. We anticipate that ML algorithms could benefit from both the representations obtained from our solutions (as an approximator of sufficient statistics) and our solution’s capacity to detect independence with a finite number of samples.

IX-A Limitations and Future Work

Our distribution-free results in Section VI provide a range of admissible parameters for our test; however, they do not provide a criterion for selecting them in a specific problem setting. To address this last practical aspect, we select a set of parameters from empirical observations and some basic heuristic rules. Then, it is an open research avenue to fully study conditions that would make a selection of optimal parameters (or nearly optimal) for our test under some specific conditions. Along these lines, it is relevant to further investigate specific classes of models, and based on this model assumption find a more constrained range of good parameters with improved finite-length performance results. In this vein, there are interesting ideas and results worth exploring about adaptivity that have been explored in statistical learning [50] and universal source coding [51, 52].

IX-B A Recent Related Work

The authors in [34] have recently introduced a non-parametric test with a similar oracle-base design principle. They proposed a data-adaptive weight strategy to approximate the ideal weights of a NP statistic. That strategy has an interesting connection with our design approach in Section IV because they adapt key parameters of their method to approximate a best (oracle) statistic. On the differences, the authors in [34] addressed the independence test by performing a binary expansion filtration of the observations and adapting the weights matrix of a quadratic form. Instead, our work addresses the independence test by estimating the mutual information (our statistics) through an adaptive tree-structured partition of the observation space. Therefore, although both approaches share a similar learning principle, the methods and components used to adapt them to the sample are entirely different.

IX-C Statistical Significance Analysis

We want to conclude with a discussion about the statistical significance of our test. Given our test $\phi_{b_{n},\delta_{n},a_{n}}(\cdot)$ , and assuming $\mathcal{H}_{0}$ , the statistical significance is expressed by $\alpha(b_{n},\delta_{n},a_{n})\equiv\mathbb{P}(\left\{\phi_{b_{n},\delta_{n},a_{n}}(Z_{1},..,Z_{n})=1\right\})$ . Importantly, from the proofs of Theorem 2 and Theorem 3, we have the following:

PROPOSITION 2

Under the assumptions of Theorem 2 on $(b_{n})$ and $(\delta_{n})$ , for any $P\in\mathcal{P}_{0}$ ( $\mathcal{H}_{0}$ ) and any $a_{n}>0$ , it follows that $\alpha(b_{n},\delta_{n},a_{n})\leq\delta_{n}$ , $\forall n\geq 1$ .¹⁶¹⁶16The proof is presented in the Supplemental Material - Appendix G-D.1).

This result implies that the statistical significance of our test is fully controlled by the sequence $(\delta_{n})$ , which is one of our design parameters. Furthermore, the proof of this result shows that this bound is uniform over the class of models in $\mathcal{P}_{0}$ . Direct corollaries of Proposition 2 are the following:

•

adding the assumptions of Theorem 3 and under $\mathcal{H}_{0}$ , it follows that $\lim_{n\rightarrow\infty}\alpha(b_{n},\delta_{n},a_{n})=0$ . Importantly, this convergence to zero is uniform for every model in $\mathcal{P}_{0}$ . [Uniform vanishing condition on $(\alpha(b_{n},\delta_{n},a_{n}))_{n\geq 1}$ ]
•

adding the assumptions of Theorem 4 and under $\mathcal{H}_{0}$ , it follows that for any $n\geq 1$ , $\alpha(b_{n},\delta_{n},a_{n})\leq\delta_{n}\approx e^{-n^{1/3}}$ . Therefore, we achieve an exponentially fast vanishing error under $\mathcal{H}_{0}$ . Importantly, this velocity is obtained uniformly (distribution-free) over $P\in\mathcal{P}_{0}$ .

These uniform bounds on the statistical significance are obtained under $\mathcal{H}_{0}$ . Under $\mathcal{H}_{1}$ , we have from strong consistency (Definition 1) that $\mathbb{P}(\left\{\phi_{b_{n},\delta_{n},a_{n}}(Z_{1},..,Z_{n})=1\right\})$ (the power of the test) tends to $1$ as $n$ tends to infinity.¹⁷¹⁷17This argument is presented in the Supplemental Material - Appendix G-D.2).

X Acknowledgment

This manuscript is based on work supported by grants of CONICYT-Chile, Fondecyt 1210315, and the Advanced Center for Electrical and Electronic Engineering, Basal Project FB0008. We thank the anonymous reviewers for providing valuable comments and suggestions that helped to improve the technical quality and organization of this work. Finally, we thank Diane Greenstein for editing and proofreading all this material.

Appendix A Proof of Theorem 1

Proof:

In general, first we know that if $(\phi_{n}(Z^{n}_{1}))_{n\geq 1}$ reaches $0$ eventually with probability one, it is equivalent to saying that the process $(\phi_{n}(Z^{n}_{1}))_{n\geq 1}$ does not visit the event $\left\{1\right\}$ infinitely often (i.o.). This observation reduces to verify the following

		$\displaystyle\mathbb{P}\left(\lim\sup_{n}\left\{z^{n}_{1}\in\mathbb{R}^{dn}:\phi_{n}(z^{n}_{1})=1\right\}\right)\equiv$
		$\displaystyle\mathbb{P}\left(\bigcap_{m\geq 1}\bigcup_{n\geq m}\left\{z^{n}_{1}\in\mathbb{R}^{dn}:\phi_{n}(z^{n}_{1})=1\right\}\right)=0.$		(31)

Equivalently, saying that $(\phi_{n}(Z^{n}_{1}))_{n\geq 1}$ reaches $1$ eventually with probability one reduces to

		$\displaystyle\mathbb{P}\left(\lim\sup_{n}\left\{z^{n}_{1}\in\mathbb{R}^{dn}:\phi_{n}(z^{n}_{1})=0\right\}\right)\equiv$
		$\displaystyle\mathbb{P}\left(\bigcap_{m\geq 1}\bigcup_{n\geq m}\left\{z^{n}_{1}\in\mathbb{R}^{dn}:\phi_{n}(z^{n}_{1})=0\right\}\right)=0.$		(32)

For the rest of the proof, we use $O^{n}_{1},..,O^{n}_{n}$ to denote $O_{\pi_{n}}(Z_{1}),....,O_{\pi_{n}}(Z_{n})$ and $o^{n}_{1},..,o^{n}_{n}$ to denote $O_{\pi_{n}}(z_{1}),....,O_{\pi_{n}}(z_{n})$ .

Under $\mathcal{H}_{0}$ , we have from the third hypothesis that ${\hat{i}_{\pi_{n}}(O^{n}_{1},....,O^{n}_{n})}/{a_{n}}$ tends to $0$ with probability one, which means that for any $\epsilon>0$

\mathbb{P}\left(\bigcap_{m\geq 1}\bigcup_{n\geq m}B^{\pi_{n},a_{n}}_{\epsilon,n}\right)=0,

(33)

where $B_{\epsilon,n}^{\pi_{n},a_{n}}\equiv\left\{z^{n}_{1}\in\mathbb{R}^{nd}:\hat{i}_{\pi_{n}}(o^{n}_{1},....,o^{n}_{n})>\epsilon\cdot a_{n}\right\}$ . Under $\mathcal{H}_{0}$ , let us consider the error event of $\phi_{\pi_{n},a_{n}}()$ by

E_{n}^{\pi_{n},a_{n}}\equiv\left\{z^{n}_{1}\in\mathbb{R}^{nd}:\phi_{\pi_{n},a_{n}}(o^{n}_{1},..,o^{n}_{n})=1\right\}.

If $z^{n}\in E_{n}^{\pi_{n},a_{n}}$ (by definition of the rule $\phi_{\pi_{n},a_{n}}(\cdot)$ ), it follows that $\hat{i}_{{\pi_{n}}}(o^{n}_{1},....,o^{n}_{n})\geq a_{n}$ (see (7)). Then from definition of $B_{\epsilon,n}^{\pi_{n},a_{n}}$ , for any $\epsilon<1$ it follows that $E_{n}^{\pi_{n},a_{n}}\subset B_{\epsilon,n}^{\pi_{n},a_{n}}$ . This implies that $\bigcap_{m\geq 1}\bigcup_{n\geq m}E_{n}^{\pi_{n},a_{n}}\subset\bigcap_{m\geq 1}\bigcup_{n\geq m}B_{\epsilon,n}^{\pi_{n},a_{n}}$ which from (33) implies that

\mathbb{P}\left(\bigcap_{m\geq 1}\bigcup_{n\geq m}E_{n}^{\pi_{n},a_{n}}\right)=0.

(34)

On the other hand, under the assumption that $I(X,Y)>0$ ( $\mathcal{H}_{1}$ ), let us choose $\epsilon_{0}\in(0,I(X,Y))$ . Then, from the second assumption (see Definition 2), it follows that $\hat{i}_{\pi_{n}}(O^{n}_{1},....,O^{n}_{n})$ convergences to something strictly greater than $\epsilon_{0}$ with probability one. This formally means that

\mathbb{P}\left(\bigcap_{m\geq 1}\bigcup_{n\geq m}\left\{z^{n}_{1}\in\mathbb{R}^{nd}:\hat{i}_{\pi_{n}}(o^{n}_{1},....,o^{n}_{n})\in(0,\epsilon_{0})\right\}\right)=0,

(35)

The error event of $\phi_{\pi_{n},a_{n}}()$ under $\mathcal{H}_{1}$ in this case is

\tilde{E}_{n}^{\pi_{n},a_{n}}\equiv\left\{z^{n}_{1}\in\mathbb{R}^{nd}:\hat{i}_{\pi_{n}}(o^{n}_{1},....,o^{n}_{n})<a_{n}\right\}.

(36)

Finally using the condition that $(a_{n})_{n\geq 1}$ tends to zero with $n$ ( $(a_{n})_{n}$ is $o(1)$ ), we have that

\tilde{E}_{n}^{\pi_{n},a_{n}}\subset\left\{z^{n}_{1}\in\mathbb{R}^{nd}:\hat{i}_{\pi_{n}}(o^{n}_{1},....,o^{n}_{n})\in(0,\epsilon_{0})\right\}

(37)

eventually in $n$ , which implies from (35) that

\mathbb{P}\left(\bigcap_{m\geq 1}\bigcup_{n\geq m}\tilde{E}_{n}^{\pi_{n},a_{n}}\right)=0.

(38)

∎

Appendix B Proof of Lemma 2

We use the following results from [39].

LEMMA 4

[39, Th.3] Under the assumptions of Lemma 2, $\lim_{n\rightarrow\infty}\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{1},....,O_{n})=I(X,Y)$ with probability one.

LEMMA 5

[39, Th.4] Under the assumptions of Lemma 2 part ii), for any $p>0$ , $\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{1},....,O_{n})$ is $o(n^{-p})$ . ¹⁸¹⁸18 This means formally that $\lim_{n\rightarrow\infty}\frac{\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{1},....,O_{n})}{n^{-p}}=0$ with probability one.

Proof:

Eq.(2) follows from Lemma 4, the fact that by the strong law of large number $\lim_{n\rightarrow\infty}i_{n}(Z_{1},...,Z_{n})=I(X,Y)$ with probability one and the union bound.

Under the assumption that $P\in\mathcal{P}_{0}$ (i.e. $I(X,Y)=0$ ), we have that $P=Q^{*}(P)$ . This implies by definition that $i_{n}(Z_{1},...,Z_{n})=0$ with probability one. Then the regret is basically our empirical information term $\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{1},....,O_{n})$ . Then, Lemma 5 implies the result in (19). ∎

Appendix C Proof of Theorem 2

Proof:

If we define the event $\mathcal{E}^{n,k}_{\delta_{n},b_{n}}\equiv$

\left\{z^{n}_{1}:\sup_{T\in\mathcal{G}^{k}_{b_{n}}}\left|{D_{\sigma({\pi_{T}})}({P}||{Q}^{*})-D_{\sigma({\pi_{T}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})}\right|\leq r_{b_{n},\delta_{n}}(k)\right\},

(39)

for all $k\in\left\{1,..,,\left|T^{full}_{b_{n}}\right|\right\}$ , we have that the conditions stated on $(b_{n})$ and $(\delta_{n})$ are the weakest to obtain from (15) that $\lim_{n\rightarrow\infty}\sup_{k\in\left\{1,..,,\left|T^{full}_{b_{n}}\right|\right\}}r_{b_{n},\delta_{n}}(k)=0$ . This last condition is crucial to being able to apply Lemma 1 in $\left\{\mathcal{E}^{n,k}_{\delta_{n},b_{n}},k\right\}$ . In fact, $\mathbb{P}\left(\mathcal{E}^{n}_{\delta_{n},b_{n}}\equiv\bigcap_{k=1}^{\left|T^{full}_{b_{n}}\right|}\mathcal{E}^{n,k}_{\delta_{n},b_{n}}\right)\geq 1-\delta_{n}$ by definition of $r_{b_{n},\delta_{n}}(k)$ in (16) and the condition expressed in (15).

Importantly under $\mathcal{H}_{0}$ (i.e., $I(X;Y)=0$ ), for $k=1$ , we can consider that $r_{b_{n},\delta_{n}}(k)=0$ , because $D_{\sigma({\left\{\mathbb{R}^{d}\right\}})}({P}||{Q}^{*})=D_{\sigma({\left\{\mathbb{R}^{d}\right\}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})=0$ for any $z^{n}_{1}\in\mathbb{R}^{dn}$ . Consequently, for any $z^{n}_{1}\in\mathcal{E}^{n}_{\delta_{n},b_{n}}=\bigcap_{k=1}^{\left|T^{full}_{b_{n}}\right|}\mathcal{E}^{n,k}_{\delta_{n},b_{n}}$ , it follows that

		$\displaystyle D_{\sigma({\pi_{\hat{T}_{b_{n},\delta_{n}}}})}(\hat{P}_{n}\|\|\hat{Q}_{n}^{*})-r_{b_{n},\delta_{n}}\left(\left\|\hat{T}_{b_{n},\delta_{n}}\right\|\right)\geq$
		$\displaystyle\sup_{T\in\mathcal{G}^{k}_{b_{n}}}D_{\sigma({\pi_{{T}}})}(\hat{P}_{n}\|\|\hat{Q}_{n}^{*})-r_{b_{n},\delta_{n}}(k)$		(40)
		$\displaystyle\geq\sup_{T\in\mathcal{G}^{k}_{b_{n}}}D_{\sigma({\pi_{{T}}})}({P}\|\|{Q}^{*})-2r_{b_{n},\delta_{n}}(k),$		(41)

for any $k\in\left\{1,..,\left|T^{full}_{b_{n}}\right|\right\}$ . The first inequality in (C) is from the definition of $\hat{T}_{b_{n},\delta_{n}}$ in (17) and the second inequality in (41) is from the fact that if $z^{n}_{1}\in\mathcal{E}^{n,k}_{\delta_{n},b_{n}}$ (see (39)), then

\displaystyle\left|\sup_{T\in\mathcal{G}^{k}_{b_{n}}}D_{\sigma({\pi_{T}})}({P}||{Q}^{*})-\sup_{T\in\mathcal{G}^{k}_{b_{n}}}D_{\sigma({\pi_{T}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})\right|\leq r_{b_{n},\delta_{n}}(k).

(42)

On the other hand, if $z^{n}_{1}\in\mathcal{E}^{n}_{\delta_{n},b_{n}}$ , then

		$\displaystyle D_{\sigma({\pi_{\hat{T}_{b_{n},\delta_{n}}}})}(\hat{P}_{n}\|\|\hat{Q}_{n}^{*})-r_{b_{n},\delta_{n}}(\left\|\hat{T}_{b_{n},\delta_{n}}\right\|)$
		$\displaystyle\leq D_{\sigma({\pi_{\hat{T}_{b_{n},\delta_{n}}}})}({P}\|\|{Q}^{})\leq D({P}\|\|{Q}^{}).$		(43)

At this point, we use the independence assumption¹⁹¹⁹19Under $\mathcal{H}_{0}$ , we have that $D_{\sigma({\pi_{T}})}({P}||{Q}^{*})=0$ for any $T\ll T^{full}_{b_{n}}$ . and the two inequalities (41) and (C) to obtain that for any $z^{n}_{1}\in\mathcal{E}^{n}_{\delta_{n},b_{n}}$ :

\displaystyle D_{\sigma({\pi_{\hat{T}_{b_{n},\delta_{n}}}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})-r_{b_{n},\delta_{n}}(\left|\hat{T}_{b_{n},\delta_{n}}\right|)=0.

(44)

Finally using that $r_{b_{n},\delta_{n}}(1)=0$ , it is simple to verify that for any $z^{n}_{1}\in\mathcal{E}^{n}_{\delta_{n},b_{n}}$ the trivial tree (with one cell) is a solution for (17), i.e., $\left|\hat{T}_{b_{n},\delta_{n}}\right|=1$ . Then from (44), $D_{\sigma({\pi_{\hat{T}_{b_{n},\delta_{n}}}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})=\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n}))=0$ , which concludes the argument using that $\mathbb{P}\left(\mathcal{E}^{n}_{\delta_{n},b_{n}}\right)\geq 1-\delta_{n}$ and the assumption that $(\delta_{n})$ is $o(1)$ .

Under $\mathcal{H}_{1}$ (i.e., $I(X;Y)>0$ ), Silva et al. [39, Th. 2, Eq.(33)] showed that for any $z^{n}_{1}\in\mathcal{E}^{n}_{\delta_{n},b_{n}}$

	$\displaystyle I(X,Y)-\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n}))$
	$\displaystyle\leq\min_{T\ll T^{full}_{b_{n}}}\left[I(X;Y)-D_{\sigma({\pi_{T}})}({P}\|\|{Q}^{*}(P))\right]+2r_{b_{n},\delta_{n}}(T).$

Again using that $\mathbb{P}\left(\mathcal{E}^{n}_{\delta_{n},b_{n}}\right)\geq 1-\delta_{n}$ , we obtain the bound in (2). ∎

Appendix D Proof of Lemma 3

Proof:

Let us use the definition of the typical set introduced in (39) and $\mathcal{E}^{n}_{\delta_{n},b_{n}}\equiv\bigcap_{k=1}^{\left|T^{full}_{b_{n}}\right|}\mathcal{E}^{n,k}_{\delta_{n},b_{n}}$ . Under the assumption of this result, we know that $\mathbb{P}\left(\mathcal{E}^{n}_{\delta_{n},b_{n}}\right)\geq 1-\delta_{n}$ . In addition, in the proof of Theorem 2, it is shown that if $z^{n}\in\mathcal{E}^{n}_{\delta_{n},b_{n}}$ , then $\left|\hat{T}_{b_{n},\delta_{n}}\right|=1$ , which means that $\pi_{b_{n}\delta_{n}}=\left\{\mathbb{R}^{d}\right\}$ . Therefore, we have that²⁰²⁰20Here we use the notation $\pi_{b_{n},\delta_{n}}(z^{n})$ to make explicit that the partition is data-driven.

\displaystyle\mathcal{E}^{n}_{\delta_{n},b_{n}}\subset\left\{z^{n}\in\mathbb{R}^{dn}:\pi_{b_{n},\delta_{n}}(z^{n})=\left\{\mathbb{R}^{d}\right\}\right\}.

(45)

Using the definition of $\mathcal{T}_{0}((z_{n})_{n\geq 1})$ , we have that

		$\displaystyle\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})\leq M)=\mathbb{P}_{{\bf Z}}\left(\bigcap_{n>M}\left\{z^{n}:\pi_{b_{n},\delta_{n}}(z^{n})=\left\{\mathbb{R}^{d}\right\}\right\}\right)$		(46)
		$\displaystyle=1-\mathbb{P}_{{\bf Z}}\left(\bigcup_{n>M}\left\{z^{n}:\pi_{b_{n},\delta_{n}}(z^{n})\neq\left\{\mathbb{R}^{d}\right\}\right\}\right)$		(47)
		$\displaystyle\geq 1-\sum_{n>M}\mathbb{P}_{{\bf Z}}\left(\left\{z^{n}:\pi_{b_{n},\delta_{n}}(z^{n})\neq\left\{\mathbb{R}^{d}\right\}\right\}\right)$		(48)
		$\displaystyle\geq 1-\sum_{n>M}\mathbb{P}_{{\bf Z}}\left((\mathcal{E}^{n}_{\delta_{n},b_{n}})^{c}\right)\geq 1-\sum_{n>M}\delta_{n}.$		(49)

The equality in (46) comes from definition in (22). The inequality in (48) comes from the sub-additivity of $\mathbb{P}_{{\bf Z}}$ and the bounds in (49) follow from (45) and the construction of $\mathcal{E}^{n}_{\delta_{n},b_{n}}$ . Finally, using the assumption that $(\delta_{n})\in\ell_{1}(\mathbb{N})$ in (49), we have that

\displaystyle\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})<\infty)=\lim_{M\rightarrow\infty}\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})\leq M)=1.

(50)

∎

Appendix E Proof of Theorem 3

Proof:

Under $\mathcal{H}_{1}$ , the assumptions on $(b_{n})$ and $(\delta_{n})$ are within the admissible range stated in Theorem 2 part i). Consequently, we have consistency on the regret, i.e., $\lim_{n\rightarrow\infty}\left|i_{n}(Z_{1},...,Z_{n})-\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n})\right|=0$ , $\mathbb{P}$ -almost surely. This result is sufficient to obtain that $(\phi_{{b_{n},\delta_{n}},a_{n}}(Z_{1},....,Z_{n}))_{n\geq 1}$ reaches $1$ eventually with probability one, from the arguments presented in the proof of Theorem 1.

Under $\mathcal{H}_{0}$ , it is useful to define the following last exit time associated with the detection of independence:

		$\displaystyle\mathcal{T}((z_{n})_{n\geq 1},(a_{n})_{n\geq 1})\equiv$
		$\displaystyle\sup{\left\{m\geq 1:\phi_{b_{m},\delta_{m},a_{m}}(z_{1},..,z_{m})=1\right\}}\in\mathbb{N}\cup\left\{\infty\right\}.$		(51)

Saying that $(\phi_{{b_{n},\delta_{n}},a_{n}}(Z_{1},....,Z_{n}))_{n\geq 1}$ reaches $0$ eventually with probability one is equivalent to the condition that

\mathbb{P}(\mathcal{T}((Z_{n})_{n\geq 1},(a_{n})_{n\geq 1})<\infty)=1,

(52)

from the arguments presented in the proof of Theorem 1. At this point, it is important to revisit the definition of $\mathcal{T}_{0}((z_{n})_{n\geq 1})$ in (22), where if for some $z^{m}\in\mathbb{R}^{dm}$ we have that $\left|\hat{T}_{b_{m},\delta_{m}}\right|=1$ ; this implies that $\hat{i}_{\pi_{b_{m},\delta_{m}}}(O_{\pi_{b_{m},\delta_{m}}}(z_{1}),....,O_{\pi_{b_{m},\delta_{m}}}(z_{m})=i_{m}(z_{1},...,z_{m})=0$ (details on Appendix C), and therefore, $\phi_{b_{m},\delta_{m},a_{m}}(z_{1},..,z_{m})=0$ independent of $a_{m}$ (see Eq.(7)). Consequently, we have that

		$\displaystyle\left\{z^{m}\in\mathbb{R}^{dm}:\pi_{b_{m},\delta_{m}}(z^{m})=\left\{\mathbb{R}^{d}\right\}\right\}=\left\{z^{m}:\left\|\hat{T}_{b_{m},\delta_{m}}\right\|=1\right\}$
		$\displaystyle\subset\left\{z^{m}\in\mathbb{R}^{dm}:\phi_{b_{m},\delta_{m},a_{m}}(z_{1},..,z_{m}))=0\right\}$		(53)

for any $a_{n}$ , and it follows that

		$\displaystyle\mathbb{P}(\mathcal{T}((Z_{n})_{n\geq 1},(a_{n})_{n\geq 1})\leq M)=$
		$\displaystyle\mathbb{P}_{{\bf Z}}\left(\bigcap_{n>M}\left\{z^{n}:\phi_{b_{n},\delta_{n},a_{n}}(z_{1},..,z_{n}))=0\right\}\right)$		(54)
		$\displaystyle\geq\mathbb{P}_{{\bf Z}}\left(\bigcap_{n>M}\left\{z^{n}:\pi_{b_{n},\delta_{n}}(z^{n})=\left\{\mathbb{R}^{d}\right\}\right\}\right)$		(55)
		$\displaystyle=\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})\leq M).$		(56)

The identity in (E) follows from the definition in (E), and equations (55) and (56) follow from (E) and (46), respectively. The argument concludes from Lemma 3, as we know that under the assumptions stated on $(b_{n})$ and $(\delta_{n})$ , $\lim_{M\rightarrow\infty}\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})\leq M)=1$ , which implies the result in (52) from (56). ∎

Appendix F Proof of Theorem 4

Proof:

Under $\mathcal{H}_{0}$ , we can adopt directly the bound presented in (49):

\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})<m)\geq 1-\sum_{n\geq m}\delta_{n}=1-\sum_{n\geq m}Ce^{-n^{1/3}},

(57)

where we include the assumption that $(1/\delta_{n})_{n}\approx(e^{n^{1/3}})$ (i.e., $C$ is positive constant). For the series in (57) we have

	$\displaystyle\sum_{n\geq m}e^{-n^{1/3}}$	$\displaystyle=e^{-m^{1/3}}\left[1+\frac{e^{-(m+1)^{1/3}}}{e^{-m^{1/3}}}+...+\frac{e^{-(m+j)^{1/3}}}{e^{-m^{1/3}}}+..\right]$		(58)
		$\displaystyle=e^{-m^{1/3}}\cdot\underbrace{\sum_{j\geq 0}e^{-j^{1/3}}}_{\equiv\mathcal{I}},$		(59)

where $\mathcal{I}<\infty$ . Therefore, we have that $\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})\geq m)\leq C\mathcal{I}e^{-m^{1/3}}$ . ∎

References

[1] C.-J. Ku and T. L. Fine, “Testing for stochastic independence: Application to blind source separation,” IEEE Transactions on Signal Processing, vol. 53, no. 5, pp. 1815–1826, 2005.
[2] P. Common, “Independent component analysis: a new concept?” Signal Processing, vol. 36, pp. 287–314, 1994.
[3] J. F. Cardoso, “Blind separation of instantaneous mixtures of non-stationary sources,” IEEE Transactions on Signal Processing, vol. 49, no. 9, pp. 1837–1848, 2001.
[4] J. Cheng, K. K. Tan, Z. Yi, and S. H. and, “Convergence analysis of a class of hyvarinen–oja’s ica learning algorithms with constant learning rates,” IEEE Transactions on Signal Processing, vol. 57, no. 5, pp. 1811–1824, 2009.
[5] B. Bloem-Reddy and Y. W. Teh, “Probabilistic symetry and invariant neural netwroks,” 2019, https://arxiv.org/abs/1901.06082.
[6] J. Range, “Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information,” in Proceedings of Machine Learning Research, 2018.
[7] A. Bellot and M. van der Schaar, “Conditional independence testing using generative adversarial networks,” in Conference on Neural Information Processing Systems, 2019.
[8] R. Sen, A. T. Suresh, K. Shanmugam, A. G. Dimakis, and S. Shakkottai, “Model-powered conditional independence test,” in Conference on Neural Information Processing Systems, 2017.
[9] C. Ku and T. Fine, “A bayesian independence test for small datasets,” IEEE Transactions on Signal Processing, vol. 54, no. 10, pp. 4026 – 4031, 2006.
[10] D. Reshef, Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P. Sabeti, “Detecting novel associations in large data sets,” Science, vol. 334, no. 6062, pp. 1518–1524, 2011.
[11] A. Gretton, O. Busquet, A. Smola, and B. Scholkopf, “Measuring statistical dependence with hilbert-schmit norms,” in In Algorithms Learning Theory: 16rh International Conference. Springer, 2005, pp. 63–78.
[12] A. Gretton and L. Györfi, “Nonparametric independence tests: Space partitioning and kernel approaches,” in In Algorithms Learning Theory: 19th International Conference. Springer, 2008, pp. 183–198.
[13] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola, “A kernel statistical test of independence,” Advances in Neural Information Processing Systems, pp. 585–592, 2007.
[14] N. Pfister, P. Büjlmann, B. Schölkopf, and J. Peters, “Kernel-based tests for joint independence,” arXiv: Statistics Theory, no. 1603.00285, 2016.
[15] G. Székely, M. Rizzo, and N. Bakirov, “Measuring and testing dependency by correlation of distances,” The Annals of Statistics, vol. 35, no. 6, pp. 2769–2794, 2007.
[16] G. J. Székely and M. L. Rizzo, “Brownian distance covariance,” The Annals of Applied Statistics, vol. 3, pp. 1236–1265, 2009.
[17] G. J. Székely, “The distance correlation t-test of independence in high dimension,” Journal of Multivariate Analysis, vol. 117, pp. 193–213, 2013.
[18] S. Zheng, N. Shi, and Z. Zhang, “Generalized measures of correlation for asymmetry, nonlinearity, and beyond,” Journal of the American Statistical Association, vol. 107, pp. 1239–1252, 2012.
[19] X. Wang, B. Jiang, and J. Liu, “Generalized r-squared for detecting non-independence,” Biometrika, vol. 104, pp. 129–139, 2017.
[20] A. Gretton and L. Gyorfi, “Consistent nonparamtric tests of independence,” Journal of Machine Learnning Research, vol. 11, pp. 1391–1423, 2010.
[21] L. Ma and J. Mao, “Fisher exact scanning for dependency,” Journal of the American Statistical Association, vol. 114, no. 525, pp. 245–258, 2019.
[22] K. Zhang, “Bet on independence,” Journal of the American Statistical Association, vol. 114, no. 528, pp. 1620–1637, 2019.
[23] R. Heller, Y. Heller, S. Kaufman, B. Brill, and M. Gorfine, “Consistent distribution-free k-sample and independence tests for univariate random variables,” Journal of Machine Learning Research, vol. 17, pp. 1–54, 2016.
[24] R. Heller and Y. Heller, “Multivariate tests of association based on univariate tests,” Advances in Neural Information Processing Systems, vol. 29, pp. 208–216, 2016.
[25] M. Menedez, J. Pardo, L.Pardo, and K. Zografos, “On tests of independence based on minimum $\phi$ -divergnece estimator with constraint: An application to modeling DNA,” Computational Statistics and Data Analysis, vol. 51, pp. 1100–1118, 2006.
[26] F. Han, S. Chen, and H. Liu, “Distribution-free tests of independence in high dimensions,” Biometrika, vol. 104, pp. 813–828, 2017.
[27] J. Dauxois and G. M. Nkiet, “Nonlinear cannonial analysis and independence tests,” The Annals of Statistics, vol. 26, no. 4, pp. 1254–1278, 1998.
[28] A. Dionisio, R. Menezes, and D. A. Mendes, “Entropy-based independence test,” Nonlinear Dynamics, vol. 44, pp. 351–357, 2006.
[29] J. Suzuki, “An estimator of mutual information and its application to independence testing,” Entropy, vol. 18, no. 109, pp. 1–18, March 2016.
[30] I. Kontoyiannis and M. Skoularidou, “Estimating the directed information and testing for causality,” IEEE Transactions on Information Theory, vol. 62, no. 11, pp. 6053–6067, 2016.
[31] L. Devroye and G. Lugosi, Combinatorial Methods in Density Estimation. Springer - Verlag, New York, 2001.
[32] S. Kullback, Information theory and Statistics. New York: Wiley, 1958.
[33] H. Shi, M. Drton, and F. Han, “Distribution-free consistent independence tests via center-outward ranks and signs,” Journal of the American Statistical Association, pp. 1–16, 2020.
[34] K. Zhang, Z. Zhao, and W. Zhou, “Beauty powered beast,” arXiv preprint arXiv:2103.00674, 2021.
[35] I. Csiszar and P. Shields, Information theory and Statistics: A Tutorial. Now, 2004.
[36] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Wiley Interscience, New York, 2006.
[37] G. Lugosi and A. B. Nobel, “Consistency of data-driven histogram methods for density estimation and classification,” The Annals of Statistics, vol. 24, no. 2, pp. 687–706, 1996.
[38] G. A. Darbellay and I. Vajda, “Estimation of the information by an adaptive partition of the observation space,” IEEE Transactions on Information Theory, vol. 45, no. 4, pp. 1315–1321, 1999.
[39] J. F. Silva and S. Narayanan, “Complexity-regularized tree-structured partition for mutual information estimation,” IEEE Transactions on Information Theory, vol. 58, no. 3, pp. 1940–1952, March 2012.
[40] ——, “Non-product data-dependent partitions for mutual information estimation: Strong consistency and applications,” IEEE Transactions on Signal Processing, vol. 58, pp. 3497–3511, 2010.
[41] ——, “Information divergence estimation based on data-dependent partitions,” Journal of Statistical Planning and Inference, vol. 140, no. 11, pp. 3180–3198, November 2010.
[42] R. E. Blahut, “Hypothesis testing and information theory,” IEEE Transactions on Information Theory, vol. it-20, no. 4, pp. 405–417, July 1974.
[43] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees. Belmont, CA: Wadsworth, 1984.
[44] C. Scott and R. D. Nowak, “Minimax-optimal classification with dyadic decision trees,” IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1335–1353, April 2006.
[45] C. Scott, “Tree pruning with subadditive penalties,” IEEE Transactions on Signal Processing, vol. 53, no. 12, pp. 4518–4525, 2005.
[46] L. Devroye, L. Gyorfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition. New York: Springer-Verlag, 1996.
[47] M. Gonzalez and J. F. Silva, “Data-driven representations for testing independence: A connection with mutual information estimation,” in IEEE International Symposium on Information Theory, 2020, pp. 1–5.
[48] J. F. Silva and S. Narayanan, “Universal consistency of data-driven partitions for divergence estimation,” in IEEE International Symposium on Information Theory, June 2007.
[49] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to Algorithms. The MIT Press, 2009, 3er edition.
[50] P. J. Bickel, C. A. J. Klaassen, Y. Ritov, and J. A. Wellmer, Efficient and Adaptive Estimation for Semiparametric Models. Springer - Verlag, New York, 1998.
[51] N. Merhav and M. Feder, “Universal prediction,” IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2124–2147, October 1998.
[52] D. Bontemps, S. Boucheron, and E. Gassiat, “About adaptive coding on countable alphabets,” IEEE Transactions on Information Theory, vol. 60, no. 2, pp. 808–821, 2014.

Appendix G Suplemental Material

In Section G-A, the estimation of mutual information (MI) based on our tree-structured partition (TSP) approach is evaluated. Section G-B details the selection of parameters of the $L_{1}$ test, the log-likelihood test, and the Pearson- $\chi^{2}$ test. These results were used to select the final curves presented in Figure 3. Section G-C complements the empirical results of our method on a heavy-tailed distribution and a multi-modal distribution. Finally, Section G-D presents some results regarding the statistical significance of our test.

G-A Preliminary Analysis of Mutual Information Estimation

Analyzing our tree-structured partition (TSP) scheme’s capacity to estimate MI in the non-asymptotic regime is important because it provides an indication of the trade-off between the two types of errors that we will encounter when analyzing the performance of its induced test. Here, we present a basic analysis on the effects of the parameters of our scheme: $l$ and $w$ ( $b_{n}=w\cdot n^{-l}$ ), and $\alpha>0$ within the admissible regime provided by our results in Section VIII of the main content for $(b_{n})_{n\geq 1}$ and $(\delta_{n})_{n\geq 1}$ . For this analysis, we consider a joint vector $Z=(X,Y)$ following a zero-mean Gaussian distribution in $\mathbb{R}^{2}$ where the correlation coefficient determining $I(X,Y)$ is parametrized by $\sigma=\mathbb{E}(XY)$ . We evaluate empirically how the estimator $\hat{i}_{b_{n},\delta_{n}}^{\alpha}(\cdot)$ achieves the true value as $n$ grows and the effects of the mentioned parameters under $\mathcal{H}_{0}$ ( $\sigma=0$ ) and $\mathcal{H}_{1}$ ( $\sigma>0$ ). For illustration purposes, we will consider $\sigma=0.7$ as a relatively high MI regime under $\mathcal{H}_{1}$ .²¹²¹21Similar analyses where conducted under different level of correlations (under $H_{1}$ ) not reported here for brevity.

To begin, Figure 6 shows effects of $l\in(0,1/3)$ fixing $w=0.05$ and $\alpha=0$ . Under $\alpha=0$ , we look at the performance of the full tree $T^{full}_{b_{n}}$ , which provides the baseline capacity of our regularized solution in Eq.(17) of the main content. Figure 6 shows that as $l$ increases (approaching its less conservative value $1/3$ associated with the biggest full tree in the growing phase), the capacity of our scheme to estimate the true MI under $\mathcal{H}_{1}$ is superior, as expected. On the contrary, the estimation of the true MI under $\mathcal{H}_{0}$ is better for smaller values of $l$ , although beyond $1.000$ samples, the specific value of $l$ is not that critical under $\mathcal{H}_{0}$ .

Similarly, Figure 7 presents the expressiveness of the scaling factor $w$ in $b_{n}$ and its effects on the estimation of MI in the two aforementioned regimes $\sigma=0$ and $\sigma=0.7$ . For these plots, we fixed $l=0.167\in(0.133,1/3)$ and $\alpha=0$ , i.e., we focus on the performance of the full tree. In the finite sampling regime, $n\in\left\{1,..,10^{5}\right\}$ , the effect of $w$ is relevant in terms of the capacity to over-estimate $I(X,Y)$ under $\sigma=0\,(\mathcal{H}_{0})$ and under-estimate or over-estimate $I(X,Y)$ under the case $\sigma=0.7$ . The effect of $w$ in the performance of the full-tree solution is more prominent than the effect of $l$ in both scenarios ( $\mathcal{H}_{0}$ and $\mathcal{H}_{1}$ ).

Finally, we look at the effects of the scalar $\alpha$ in Eq.(24), for which we select $l=0.167\in(0.133,1/3)$ and $w=0.05\in(0.005,0.1)$ . Figure 8 shows different selections for this factor in the two scenarios already introduced ( $\sigma=0$ and $\sigma=0.7$ ) and $n\in\left\{1,..,10^{5}\right\}$ . We present three sets of figures: one plotting the plug-in estimations $\hat{i}_{b_{n},\delta_{n}}(\alpha)\equiv D_{\sigma({\pi_{b_{n},\delta_{n}}^{\alpha}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})$ , the other the regularized MI estimations $\tilde{i}_{b_{n},\delta_{n}}(\alpha)\equiv\hat{i}_{b_{n},\delta_{n}}(\alpha)-\alpha\cdot r_{b_{n},\delta_{n}}(\left|\hat{T}_{b_{n},\delta_{n}}(\alpha)\right|)$ and finally, the size of the optimal tree: $\left|\hat{T}_{b_{n},\delta_{n}}(\alpha)\right|$ . In general, we observe that the best results under $\mathcal{H}_{1}$ are obtained when $\alpha$ is very small, in the order of $\sim 10^{-5}$ . This finding is interesting because it shows that, in general, the theoretical confidence interval that is used to obtain the regularization term in Eq.(16) is very conservative. This means that in the non-asymptotic regime, this term needs to be corrected to obtain good results. It is worth noticing that this scaling factor $\alpha$ is fixed (independent of $n$ ) and it does not affect the theoretical properties of our scheme: in particular, the structural capacity of our TSP scheme to detect independence under $\mathcal{H}_{0}$ presented in Section VI. This capacity is observed in Figure 8 that present the size of $\hat{T}_{b_{n},\delta_{n}}(\alpha)$ . Here, the trivial tree (with one cell) is obtained in a finite number of steps as predicted by our results in Section VI.A. On the other hand, for the regime of high MI, we observe that the tree’s size is comparable to the size of the full-tree (obtained for $\alpha=0$ ). Importantly, both of these results support the adaptive nature of our data-driven solution.

G-B Complementing Results for Figure 3

For selecting parameters of the alternative methods (the $L_{1}$ test, the log-likelihood test, and the Pearson- $\chi^{2}$ test), we implemented them and explored their performances in the admissible range where those approaches are known to be strongly consistent [20]. In particular, we consider the number of partitions per dimension as $m(n)=n^{p}$ with $p\in(0,0.5)$ (details in [20]), which satisfies the strong-consistency conditions for the $L_{1}$ and the log-likelihood tests established in [20]. Although there is no explicit proof of Pearson- $\chi^{2}$ test strong-consistency, we use a regime of values for $p\in(0,1/3)$ according to the range suggested in [20].

Under the experimental setting presented in Section VIII, Figure 9 reports different performance curves (the trade-off between $M_{0}(\epsilon)$ and $M^{\sigma}_{1}(\epsilon)$ ) when selecting different parameters of these methods. From these curves, the selected parameters for each method were the ones that offer a smooth transition between their sampling complexity pairs trade-off. This criterion is consistent with what we did to select the parameters of our method: please refer to Figure 1. Interestingly, for the three methods, this was obtained consistently for $p$ in the range of $0.2$ . Importantly, this selection offers reasonable solutions on all the regimes of correlation presented in Figure 9.

G-C Testing the Method on Different Distributions

To complement the experimental results in Section VIII.A (obtained for Gaussian distributions in the scalar case), we explore the performance of our method against non-adaptive histogram-based tests by studying the detection-time trade-off curves $(M_{0}(\epsilon),M_{1}^{\sigma}(\epsilon))$ on data samples generated by two important family of distributions: a heavy-tailed distribution (t-student law) and a multi-modal distribution.

Figure 10 presents the trade-off between $(M_{0}(\epsilon),M_{1}^{\sigma}(\epsilon))$ for different values of correlation (under $\mathcal{H}_{1}$ ) using samples generated by a t-student distribution with $2$ degrees of freedom (which is a heavy-tailed distribution). The curves show that the performance of our test is superior to the other methods in all correlation scenarios indexed by $\sigma$ . The performance gain is significant in the scenarios of high correlation and a for relatively high independence detection time (i.e., when $M_{0}(\epsilon)\geq 10^{2}$ ). For the non-adaptive tests, it is intriguing the relatively flat (insensitive) trade-off observed in all the explored scenarios, where the curves are close to an horizontal line. This flat trade-off behaviour in $(M_{0}(\epsilon),M_{1}^{\sigma}(\epsilon))$ is observed even when considering a significantly number of cells (using $p=0.3$ ). In fact, this selection ( $p=0.3$ ) produces a number of cells that is significantly higher than the number of cells obtained with our adaptive method (more of this analysis below). Therefore, this observation shows that our scheme with fewer data-driven cells produces better performance trade-off in general. In conclusion, these performance curves show the benefits of the adaptive partitioning in our test, which can deal with heavy-tailed distributions significantly better than tests that use non-adaptive partitions (binning) of the space.

Figure 11, on the other hand, presents the trade-off between $M_{0}(\epsilon)$ and $M^{\sigma}_{1}(\epsilon)$ using samples generated by a mixture of four Gaussians (a multi-modal distribution). Following the setting introduced in Gretton (2008), the dependence level is produced by rotating in a counter close-wise manner (with a magnitude $\theta$ ) a 2D random vector with the following Gaussian mixture density:

f(x,y)=\frac{1}{4}\sum_{i=1}^{4}\mathcal{N}((x,y);\mu_{i},\bar{\Sigma}),

(60)

where $\mathcal{N}((x,y);\mu,\Sigma)$ denotes the bi-variate Gaussian density function with mean $\mu$ and covariance matrix $\Sigma$ , evaluated in $(x,y)$ . In this case, $\bar{\Sigma}=\begin{bmatrix}0.05&0\\ 0&0.05\end{bmatrix}$ , $\mu_{1}=(1,1)$ , $\mu_{2}=(1,-1)$ , $\mu_{3}=(-1,1)$ , $\mu_{4}=(-1,-1)$ . The independent scenario is obtained by choosing $\theta=0$ , while three different dependent scenarios ( $\mathcal{H}_{1}$ ) are explored by choosing $\theta\in\{\pi/16,2\pi/16,3\pi/16\}$ . Similarly to the results presented in Figure 3 (for the Gaussian case), we observed that our TSP test outperforms the other tests in the regime of $M_{0}(\epsilon)\geq 10^{2}$ , while the other tests presents a better performance in the opposite regime. It is important to notice that, unlike the Gaussian case, the number of partitions of each alternative test is determined by $p=0.3$ , which induces a notably higher number of cells than the ones obtained by the parameter configuration of our TSP test. It is noteworthy that a value of $p=0.3$ induces $m(n)^{2}=\left(\left\lfloor n^{p}\right\rfloor\right)^{2}=\left(\left\lfloor 1000^{0.3}\right\rfloor\right)^{2}=49$ partitions for the alternative tests, while $l=0.001$ and $w=0.1$ induces just $8$ partitions for the tree without pruning, i.e., the final number of partitions will be lower than $8$ . This demonstrates the efficiency of the proposed method, that can represent the statistical dependence of random variable with a considerably lower number of cells because of its adaptive capacity. Due to the multi-modal nature of the distribution used in these experiments, we present an alternative configuration of a TSP test with $l=0.167$ and $w=0.05$ (denoted as TSP-long in Figure 11),inducing more adaptive cells (but still lower than the other tests) in order to capture the statistical dependence of the multi-modal distribution in a less conservative way. We observed a considerably benefit in this new choice in the regime when $M_{0}(\epsilon)\geq 10^{2}$ with a moderate-slight deterioration in the opposite regime.

To complement this analysis, Figure 12 shows the partitions generated by the non-adaptive and our adaptive partition schemes. For this, $n=1000$ samples were considered for each of the three distributions explored (Gaussian, t-student and Gaussian Mixture). The adaptive attribute of our scheme is observed and contrasted with non-adaptive partitions, which is particularly evident in the multi-modal case. We observe that our scheme, as expected, represent the distribution of samples more effectively (with a lower number of cells) than non-adaptive scheme. Indeed, non-adaptive partitions use a considerably higher number of cells to represent the statistical dependence observed in the data, which can be used to explain the difference in performance with our method. In particular, there is a clear issue observed for the non-adaptive scheme: most of the cells have almost no samples, which is an ineffective way to represent the information of the data. This could significantly affect the discrimination ability of this non-adaptive methods for small sample-size (see Figure 12). From our perspective, this is one of the primary sources to explain the erratic (insensitive) flat behaviour observed in many of the trade-off curves presented in Figs 10 and 11, and overall, the better (and more expresive) performance trade-off offered by of our approach.

G-D Statistical Significance Analysis in Section IX.C.

G-D1 Proof of Proposition 2

Proof:

Let us consider the set

A_{b_{n},\delta_{n},a_{n}}\equiv\left\{z^{n}=(x^{n},y^{n})\in{\mathcal{X}}^{n}\times{\mathcal{Y}}^{n}:\phi_{b_{n},\delta_{n},a_{n}},(z^{n})=0\right\},

(61)

i.e., the points in the space where our test of length $n$ decides $\mathcal{H}_{0}$ . Under $\mathcal{H}_{0}$ , the probability of rejecting the null is given by:

\alpha(b_{n},\delta_{n},a_{n})=\mathbb{P}_{Z^{n}}(A^{c}_{b_{n},\delta_{n},a_{n}}),

(62)

where $\mathbb{P}_{Z^{n}}$ denotes the probability of $Z^{n}$ induced by $P$ (i.e., the $n$ -fold product distribution). Using the argument presented in the proof of Theorem 3 (see Eq.(51)), we show that when our data-driven partition $\pi_{b_{n},\delta_{n}}()$ achieves the trivial case $\left\{R^{d}\right\}$ (the partition with one cell), then $\phi_{b_{n},\delta_{n},a_{n}}(\cdot)$ decides $0$ , independent of the value of $a_{n}>0$ . This structural detection of independence (see Section VI.A for more details) implies that:

B_{b_{n},\delta_{n}}\equiv\left\{z^{n}=(x^{n},y^{n}):\pi_{b_{n},\delta_{n}}(z^{n})=\left\{R^{d}\right\}\right\}\subset A_{b_{n},\delta_{n},a_{n}},

(63)

because $\pi_{b_{n},\delta_{n}}(z^{n})=\left\{R^{d}\right\}$ implies that $\phi_{b_{n},\delta_{n},a_{n}}(z^{n})=0$ independent of the magnitude of $a_{n}>0$ .

At this point we use the assumptions of Theorem 2. In the argument presented in the proof of Theorem 2, it was proved that (see Appendix III) the set $\xi^{n}_{\delta_{n},b_{n}}$ (see Eq.(39) in Appendix III) satisfies that if $z^{n}\in\xi^{n}_{\delta_{n},b_{n}}$ then $z^{n}\in B_{b_{n},\delta_{n}}$ (i.e., $\pi_{b_{n},\delta_{n}}(z^{n})=\left\{R^{d}\right\}$ ). Consequently, we have that:

\xi^{n}_{\delta_{n},b_{n}}\subset B_{b_{n},\delta_{n}}.

(64)

Furthermore, the proof of Theorem 2 in Appendix III shows that ${\mathcal{P}}_{Z^{n}}(\xi^{n}_{\delta_{n},b_{n}})\geq 1-\delta_{n}$ , where this bound is distribution-free (independent of $P\in\mathcal{P}_{0}$ ) from the use of the distribution-free concentration inequality in Lemma 1 (see the argument of this in Appendix III). Therefore, from (63) and (64), we conclude that

\alpha(b_{n},\delta_{n},a_{n})\leq{\mathcal{P}}_{Z^{n}}(B^{c}_{b_{n},\delta_{n}})\leq 1-{\mathcal{P}}_{Z^{n}}(\xi^{n}_{\delta_{n},b_{n}})\leq\delta_{n},

(65)

where this upper bound is valid for any model $P\in\mathcal{P}_{0}$ . ∎

G-D2 General Connection with Strong Consistency

There is a connection between strong consistency (in Definition 1) and the statistical significance (and also the power of a test) that can be obtained from the proof of Theorem 1 in Appendix I. Using the notation of Theorem 1, if we have a general test $(\phi_{n}(\cdot))_{n\geq 1}$ , the test is consistent if under ${\mathcal{H}}_{0}$ , the binary process $(\phi_{n}(Z^{n}))_{n\geq 1}$ reaches $0$ eventually almost surely, which is equivalent to state that:

\mathbb{P}(\lim\sup_{n}A_{n}^{c})=\mathbb{P}(\cap_{m\geq 1}\cup_{n\geq m}A_{n}^{c})=\lim_{m\rightarrow\infty}\mathbb{P}(\cup_{n\geq m}A_{n}^{c})=0

(66)

where $A_{n}\equiv\left\{z^{n}\in{\mathcal{X}}^{n}\times{\mathcal{Y}}^{n},\phi_{n}(z^{n})=0\right\}$ . Using this notation, the statistical significance of $\phi_{n}()$ under ${\mathcal{H}}_{0}$ is $\alpha_{n}\equiv\mathbb{P}_{Z^{n}}(A_{n}^{c})$ . Therefore, under ${\mathcal{H}}_{0}$ , we have that:

\alpha_{n}=\mathbb{P}_{Z^{n}}(A_{n}^{c})\leq\mathbb{P}(\cup_{k\geq n}A_{k}^{c}).

(67)

Therefore, if the test is strongly consistent from (66) and the bound in (67), it follows that the statistical significance of the test tends to zero for any model where $X$ and $Y$ are independent ( $P\in{\mathcal{P}}_{0}$ ).

Concerning the power, under ${\mathcal{H}}_{1}$ we have that the binary process $(\phi_{n}(Z^{n}))_{n\geq 1}$ reaches $1$ eventually almost surely (under the assumption of consistency), which is equivalent to state that:

\mathbb{P}(\lim\inf_{n}A_{n}^{c})=\mathbb{P}(\cup_{m\geq 1}\cap_{n\geq m}A_{n}^{c})=\lim_{m\rightarrow\infty}\mathbb{P}(\cap_{n\geq m}A_{n}^{c})=1.

(68)

Therefore using (68), the power of the test $\beta_{n}=\mathbb{P}_{Z^{n}}(A_{n}^{c})\geq\mathbb{P}(\cap_{k\geq n}A_{k}^{c})\rightarrow 1$ as $n$ tends to infinity from (68), for any model where $X$ and $Y$ are not independent ( $P\in{\mathcal{P}}_{1}$ ).

In conclusion, the definition of consistency used in this work (Definition 1) implies a vanishing error (when $n$ tends to infinity) under both hypotheses. However, this asymptotic property does not offer specifics bounds for the errors when $n$ is finite, in contrast to the result shown in Eq.(65).

		$\displaystyle D_{\sigma({\pi_{\hat{T}_{b_{n},\delta_{n}}}})}(\hat{P}_{n}\|\|\hat{Q}_{n}^{*})-r_{b_{n},\delta_{n}}\left(\left\|\hat{T}_{b_{n},\delta_{n}}\right\|\right)\geq$
		$\displaystyle\sup_{T\in\mathcal{G}^{k}_{b_{n}}}D_{\sigma({\pi_{{T}}})}(\hat{P}_{n}\|\|\hat{Q}_{n}^{*})-r_{b_{n},\delta_{n}}(k)$		(40)
		$\displaystyle\geq\sup_{T\in\mathcal{G}^{k}_{b_{n}}}D_{\sigma({\pi_{{T}}})}({P}\|\|{Q}^{*})-2r_{b_{n},\delta_{n}}(k),$		(41)

Data-Driven Representations for Testing Independence: Modeling, Analysis and Connection with Mutual Information Estimation

Abstract

Index Terms:

I Introduction

I-A Organization of the Paper

II Preliminaries

Definition 1

II-A The Divergence and Mutual Information

III The Empirical Log-Likelihood Statistics from a Data-Driven Partition

IV Regret Analysis: The Representation Learning Problem

IV-A Connection with Divergence Estimation

IV-A1 The Representation Problem

IV-B Strong Consistency: A Basic Requirement

Definition 2

PROPOSITION 1

THEOREM 1

V Tree-Structured Data-Driven Partitions

V-A The Collection of Partitions

V-B Regularized (Empirical) Information Maximization

LEMMA 1

V-C Consistency: A Preliminary Analysis

LEMMA 2

COROLLARY 1

Proof:

VI Main Finite-Length Results

THEOREM 2

Remark 1

VI-A Detecting Independence with the Structure of T^bn,δn\hat{T}_{b_{n},\delta_{n}}

Definition 3

LEMMA 3

THEOREM 3

THEOREM 4

VI-B Remarks about Theorem 4:

VII Implementation

VII-A The Learning Algorithm

VII-B Computational Cost

VIII Empirical Analysis

VIII-A Testing Independence: Non-Asymptotic Empirical Analysis

VIII-A1 Simulation Setting

VIII-A2 Parameter Selection for ϕbn,δn,anα​(⋅)\phi_{b_{n},\delta_{n},a_{n}}^{\alpha}(\cdot)

VIII-A3 Results

VIII-B Multidimensional Analysis

IX Summary and Final Discussion

IX-A Limitations and Future Work

IX-B A Recent Related Work

IX-C Statistical Significance Analysis

PROPOSITION 2

X Acknowledgment

Appendix A Proof of Theorem 1

Proof:

Appendix B Proof of Lemma 2

LEMMA 4

LEMMA 5

Proof:

Appendix C Proof of Theorem 2

Proof:

Appendix D Proof of Lemma 3

Proof:

Appendix E Proof of Theorem 3

Proof:

Appendix F Proof of Theorem 4

Proof:

References

Appendix G Suplemental Material

G-A Preliminary Analysis of Mutual Information Estimation

G-B Complementing Results for Figure 3

G-C Testing the Method on Different Distributions

G-D Statistical Significance Analysis in Section IX.C.

G-D1 Proof of Proposition 2

Proof:

G-D2 General Connection with Strong Consistency

VI-A Detecting Independence with the Structure of $\hat{T}_{b_{n},\delta_{n}}$

VIII-A2 Parameter Selection for $\phi_{b_{n},\delta_{n},a_{n}}^{\alpha}(\cdot)$