This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Data-Driven Representations for Testing Independence: Modeling, Analysis and Connection with Mutual Information Estimation

Mauricio E. Gonzalez, Jorge F. Silva, Miguel Videla, and Marcos E. Orchard M. Gonzalez, J. F. Silva, M. Videla, and M. Orchard are with the Information and Decision System Group, Department of Electrical Engineering, University of Chile, Av. Tupper 2007 Santiago, 412-3, Room 508, Chile, Tel: 56-2-9784090, Fax: 56-2 -6953881, (email: [email protected]).
Abstract

This work addresses testing the independence of two continuous and finite-dimensional random variables from the design of a data-driven partition. The empirical log-likelihood statistic is adopted to approximate the sufficient statistics of an oracle test against independence (that knows the two hypotheses). It is shown that approximating the sufficient statistics of the oracle test offers a learning criterion for designing a data-driven partition that connects with the problem of mutual information estimation. Applying these ideas in the context of a data-dependent tree-structured partition (TSP), we derive conditions on the TSP’s parameters to achieve a strongly consistent distribution-free test of independence over the family of probabilities equipped with a density. Complementing this result, we present finite-length results that show our TSP scheme’s capacity to detect the scenario of independence structurally with the data-driven partition as well as new sampling complexity bounds for this detection. Finally, some experimental analyses provide evidence regarding our scheme’s advantage for testing independence compared with some strategies that do not use data-driven representations.

Index Terms:
Independence testing, non-parametric learning, learning representations, data-driven partitions, tree-structure partitions, mutual information, consistency, finite-length analysis.

I Introduction

The problem of detecting independence from i.i.d. samples in a learning setting (distribution-free) is fundamental in statistics and has found numerous applications in statistical signal processing, data analysis, machine learning, inference, and decision problems [1, 2, 3, 4]. For instance, the capacity to detect independent and conditional independent structures from data is relevant when having invariances and probabilistic symmetries in decision and machine learning models [5, 6, 7, 8]. This capacity has been used in blind source separation, independent component analysis (ICA) [1, 2, 3, 4, 9] and the detection of associations in data [10]. Detecting independence from data has been systematically studied, and the literature is vast. Different non-parametric strategies have been proposed for this task using kernel-based statistics [11, 12, 13, 14], distance-based approaches [15, 16, 17, 18, 19], histogram-based approaches [20, 21, 22, 10, 23, 24], log-likelihood statistics [20, 25], correlation-based approaches [26, 27], ϕ\phi-divergence estimators [25, 20], entropy and mutual information estimators [28, 29, 30, 10, 10], among many others.

A natural strategy, and the one we follow in this paper, is to partition the observation space (the binning approach) to evaluate the discrepancy (in some sense) between two empirical distributions [20, 12, 29, 15, 22, 21]. We highlight the following work using this approach: Gretton and Györfi [20] introduced a family of product data-independent partitions using the L1L_{1}-statistics [31] and the I-divergence statistics [32]. With these two approaches, sufficient conditions are established on the partition and other learning parameters to achieve strong consistency for detecting independence distribution-free in a continuous multivariate scenario. Szekely et al. [15] also use partitions on the sample space, utilizing empirical distance correlation and distance covariance statistics to introduce a novel test of independence. The test is consistent and shows better empirical results than classical likelihood-ratio tests in scenarios with non-linear dependencies. Shi et al. [33] address the multivariate scenario by incorporating the concept of transportation-based ranks and signs. They propose plugging the center-outward distribution function into the mentioned distance covariance statistic [15] to determine a distribution-free test of independence.

More recently, Zhang [22] proposed a non-parametric test of independence assuming a rich class of models with known marginal distributions (uniform over [0,1][0,1]). This work proposed a binary expansion filtration of the space and deeply studied some symmetric properties to create a novel test that is uniformly consistent (on the power loss) and minimax for the power (in the sample size) assuming a large class of alternative distributions. Zhang et al. [34] presented a novel extension of the binary expansion filtration strategy [22] to a multidimensional setting by cleverly approximating an oracle Neyman-Pearson (NP) statistic. They propose a data-adaptive weight strategy (for the binary expansion) to approximate the ideal weight of the NP test. This new test unifies several important tests (χ2\chi^{2} test statistics, distance correlation) and shows promising empirical results.

On the important connection between testing independence and information measures [35, 36], we highlight the work of Kontoyiannis and Skoularidou [30], where the problem of estimating directed information (directed information rate) between two discrete (finite alphabet) processes was explored based on a plug-in estimator. Importantly, the authors used this plug-in estimator to test the presence or absence of causality between the processes. Along the same lines, Suzuki [29] explored different strategies to estimate mutual information between discrete and continuous random variables and then evaluated those strategies experimentally for the task of testing independence. Finally, Reshef et al. [10] introduced the maximal information coefficient (MIC) to measure complex functional dependencies between a pair of variables by maximizing the (empirical) mutual information computed for collections of (adaptive) partitions of different resolutions.

In many of the methods mentioned that use partitions (or binning) for testing independence, a component that has not been explored systematically is the role played by data-driven partitions [37, 38, 10]. Data-driven partitions use data for the binning process (vector quantization) to construct the cells and define the final structure of the partition. This flexibility offers the capacity to better address inference and learning problems. Supporting this idea, data-driven representations have shown great approximation properties and better decision-estimation performance in numerous non-parametric learning problems [37, 38, 39, 40, 41].

In our context, the motivation that piqued our interest in data-driven partitions is the fact that under the hypothesis of independence, the trivial partition (the partition with one cell that contains all the space) is a sufficient representation of the problem. Our initial conjecture is that a well-designed adaptive data-driven partition could have the ability to detect (learn from data) this trivial solution by implementing a form of explicit regularization in its design. This adaptive ability could provide better detection of independence than non-adaptive partition schemes, which has prompted the exploration of data-driven methods for testing independence in this work.

Motivated by this, we look at the problem of designing a new learning criterion that selects a data-driven partition of the space (adaptive binning) as the optimal trade-off between estimation and approximation errors in a learning problem. To formulate this representation learning task, we adopt ideas from universal source coding to introduce a regret term that measures the discrepancy attributed to the use of empirical log-likelihood statistics — restricted over a partition — with respect to the oracle sufficient statistics of an ideal test that knows the true (two) probabilities: the oracle test against independence [36, 42]. Using this regret analysis, we establish a novel connection with the problem of mutual information (MI) estimation [30]. Furthermore, general conditions are derived to obtain a strongly consistent test of independence in the strong (almost-sure and distribution-free) sense introduced in [20].

We apply this framework in the context of tree-structured partitions (TSP) [39], which is the main application focus of this work. TSP algorithms were selected for the following important reasons: Their implementation is simple, TSP have a binary structure that can be used to address learning tasks efficiently for a large class of optimization problems (minimum cost-tree pruning) [43, 44, 45], and they have been shown to be expressive (with good approximation properties) when compared with other partition strategies in non-parametric learning and decision tasks [37, 38, 39, 40, 41, 43, 46, 37, 44].

On the proposed test, we derive concrete connections with results on mutual information estimation [39]. From these connections, we established new design conditions to obtain a strongly consistent test of independence and, more importantly, non-asymptotic results that express our framework’s capacity to approximate the information term of the oracle (ideal) test with a finite sample-size. Going a step further in this finite length performance analysis, we study our scheme’s capacity to detect independence structurally with the underlying data-driven partition, which was one of the original motivations used to explore data-driven partition for this task.

Indeed, we show that under the independence assumption, our data-driven partition collapses to the trivial solution with one cell (implying that the resulting MI estimator is zero) with a finite sample-size almost surely. This is a remarkable property attributed to our scheme’s capacity to adapt its representation to the sufficient statistics of the problem, which is the trivial partition in the context of independence. From this ability, we improve our original result concerning our test’s consistency (density-free) and provide refined sampling complexity bounds for detecting the scenario of independence.

To implement our approach, we propose a learning algorithm that offers a computationally efficient implementation of our data-driven strategy for testing independence. The algorithm is divided in two important phases (a growing and pruning stage), which are both shown to have a polynomial complexity on the side of the problem (the sample size). Finally, we provide empirical evidence of the advantage of our strategy in some controlled simulation scenarios. For this last analysis, we introduce concrete non-asymptotic metrics (in the form of sampling complexity indicators) to measure the test’s ability to detect the correct hypothesis with a finite number of samples.

A preliminary version of this work was presented in [47]. This work significantly expands the technical contribution in [47] and explores implementations and experimental analyses.

I-A Organization of the Paper

Section II introduces the problem and basic notations. Section III presents the general histogram-based scheme, and Section IV introduces the regret-based analysis proposed in this work and its connection with the problem of MI estimation. Section IV-B presents a general consistency result for our problem (Theorem 1). Section V introduces the family of tree-structured data-driven partitions and presents consistency results (Theorem 2, Lemma 2 and Corollary 1). Section VI analyzes some finite-length properties (Theorem 2) and elaborates on the structural capacity of our scheme to detect independence under the null hypothesis (Lemma 3, and Theorems 3 and 4). Sections VII and VIII cover implementation and empirical analysis, respectively. Finally, Section IX concludes with some discussion and future directions. The proofs are presented in the Appendix.

II Preliminaries

Let us consider two random variables XX and YY taking values in p\mathbb{R}^{p} and q\mathbb{R}^{q}, respectively, with joint distribution PP in (d,(d))(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d})), where d=p+qd=p+q and (d)\mathcal{B}(\mathbb{R}^{d}) is the Borel sigma field. Let us denote by 𝒫0\mathcal{P}_{0} the class of probabilities in (d,(d))(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d})) under which (X,Y)(X,Y) are independent, meaning that if P𝒫0P\in\mathcal{P}_{0}, then P(A×B)=P(p×B)P(A×q)P(A\times B)=P(\mathbb{R}^{p}\times B)\cdot P(A\times\mathbb{R}^{q}) for any A(p)A\in\mathcal{B}(\mathbb{R}^{p}) and B(q)B\in\mathcal{B}(\mathbb{R}^{q}).

Given nn i.i.d. samples of (X,Y)(X,Y), denoted by Z1n(Z1(X1,Y1),Z2(X2,Y2),..,Zn(Xn,Yn))dZ^{n}_{1}\equiv(Z_{1}\equiv(X_{1},Y_{1}),Z_{2}\equiv(X_{2},Y_{2}),..,Z_{n}\equiv(X_{n},Y_{n}))\in\mathbb{R}^{d} driven by the model P𝒫(d)P\in\mathcal{P}(\mathbb{R}^{d}), the problem is to decide from Z1nZ^{n}_{1} whether XX and YY are independent (meaning that P𝒫0P\in\mathcal{P}_{0}) or, alternatively, XX and YY have some statistical dependency, i.e., P𝒫0P\notin\mathcal{P}_{0}. In this context, a decision rule ϕn()\phi_{n}(\cdot) of length nn is a function from (d)n(\mathbb{R}^{d})^{n} to {0,1}\left\{0,1\right\}, where ϕn(z1n)=0\phi_{n}(z^{n}_{1})=0 means that the rule decides that the underlying probability producing z1nz^{n}_{1} belongs to 𝒫0\mathcal{P}_{0}. Then, for any decision rule ϕnΠn\phi_{n}\in\Pi_{n}111Πn\Pi_{n} denotes the collection of binary rules acting on dn\mathbb{R}^{dn}. of length nn, we recognize two errors:

  • Assuming that P𝒫0P\in\mathcal{P}_{0} (0\mathcal{H}_{0}), the non-detection of independence between XX and YY is measured by222\mathbb{P} denotes the entire process distribution of (Zn)n1(Z_{n})_{n\geq 1} and PnP^{n} is the nn-fold probability in dn\mathbb{R}^{dn} induced by the marginal PP.

    (ϕn(Z1n)=1)=Pn({z1n:ϕn(z1n)=1}),\mathbb{P}(\phi_{n}(Z^{n}_{1})=1)=P^{n}(\left\{z^{n}_{1}:\phi_{n}(z^{n}_{1})=1\right\}),
  • Assuming that P𝒫0P\notin\mathcal{P}_{0} (1\mathcal{H}_{1}), the false detection of independence between XX and YY is measured by

    (ϕn(Z1n)=0)=Pn({z1n:ϕn(z1n)=0}).\mathbb{P}(\phi_{n}(Z^{n}_{1})=0)=P^{n}(\left\{z^{n}_{1}:\phi_{n}(z^{n}_{1})=0\right\}).

The following is the classical notion of strong-consistency used to evaluate the asymptotic goodness of a universal scheme (collection of rules of different lengths) for detecting independence from data [20].

Definition 1

Given a scheme ξ={ϕnΠn,n1}\xi=\left\{\phi_{n}\in\Pi_{n},n\geq 1\right\}, where ϕn\phi_{n} is a decision rule of length nn, we say that ξ\xi is strongly consistent for detecting independence if

  • i)

    under 0\mathcal{H}_{0}: (ϕn(Z1n))n1(\phi_{n}(Z^{n}_{1}))_{n\geq 1} reaches 0 eventually with probability one, or \mathbb{P}-almost surely (a.s.),

  • ii)

    under 1\mathcal{H}_{1}: (ϕn(Z1n))n1(\phi_{n}(Z^{n}_{1}))_{n\geq 1} reaches 11 eventually \mathbb{P}-a.s.

II-A The Divergence and Mutual Information

Let PP, QQ be two probability measures in (d,(d))(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d})) such that PQP\ll Q, and let us consider π={Ai,i}\pi=\left\{A_{i},i\in\mathcal{I}\right\} a measurable partition of d\mathbb{R}^{d} where II is finite or countable. The divergence of PP with respect to QQ restricted over π\pi (or the sub-sigma field induced by π\pi denoted by σ(π)\sigma(\pi)) is given by [32, 36]

Dσ(π)(P||Q)iP(Ai)logP(Ai)Q(Ai).D_{\sigma(\pi)}(P||Q)\equiv\sum_{i\in\mathcal{I}}P(A_{i})\log\frac{P(A_{i})}{Q(A_{i})}. (1)

The divergence of PP with respect to QQ is [32, 36]

D(P||Q)supπ𝒬(d)Dσ(π)(P||Q)D(P||Q)\equiv\sup_{\pi\in\mathcal{Q}(\mathbb{R}^{d})}D_{\sigma(\pi)}(P||Q) (2)

where 𝒬(d)\mathcal{Q}(\mathbb{R}^{d}) is the collection of measurable333(d)\mathcal{B}(\mathbb{R}^{d}) is the sigma field of Borel sets. partitions of d\mathbb{R}^{d}.

III The Empirical Log-Likelihood Statistics from a Data-Driven Partition

In this work, we adopt a histogram-based log-likelihood statistics approach [20]. We interpret this approach as an empirical version of the Neyman-Pearson (NP) test against independence [42]. The ideal (oracle) NP test against independence decides whether the samples belong to the following two known cases: PP (for some P𝒫0P\notin\mathcal{P}_{0}) or Q(P)𝒫0Q^{*}(P)\in\mathcal{P}_{0}, where Q(A×B)=P(p×B)P(A×q)Q^{*}(A\times B)=P(\mathbb{R}^{p}\times B)\cdot P(A\times\mathbb{R}^{q}) for any A(p)A\in\mathcal{B}(\mathbb{R}^{p}) and B(q)B\in\mathcal{B}(\mathbb{R}^{q}).444Q(P)Q^{*}(P) can be interpreted as the projection of PP on 𝒫0\mathcal{P}_{0} in the information divergence sense, i.e., Q(P)Q^{*}(P) is the solution of minQ𝒫0D(P||Q)\min_{Q\in\mathcal{P}_{0}}D(P||Q).

The specific approach proposed in this work is a two-stage process that uses the data Z1nPnZ^{n}_{1}\sim P^{n} twice: first, to estimate PP, and its projected version Q(P)Q^{*}(P) (over 𝒫0\mathcal{P}_{0}), with the caveat that the estimation is restricted over the events of a finite partition of the joint space d\mathbb{R}^{d} (i.e., a histogram-based estimation of PP and Q(P)Q^{*}(P), respectively); second, to compute an empirical version of the log likelihood-ratio to decide (using a threshold) if P^=Q^\hat{P}=\hat{Q}^{*} (independence) or P^Q^\hat{P}\neq\hat{Q}^{*} (non-independence).

There are two elements of learning design that determine our empirical test. The first is a data-driven partition rule denoted by πn()\pi_{n}(\cdot), which is a function that maps sequences in (d)n(\mathbb{R}^{d})^{n} into a finite measurable partitions of d\mathbb{R}^{d}. The second element is a non-negative threshold denoted by an+a_{n}\in\mathbb{R}^{+}. Then given a pair (πn(),an)(\pi_{n}(\cdot),a_{n}) and some data z1ndnz^{n}_{1}\in\mathbb{R}^{dn}, the data-driven test is constructed as follows:
1) Use the quantization πn(z1n)={Ai,i=1,..,|πn(z1n)|}(d)\pi_{n}(z^{n}_{1})=\left\{A_{i},i=1,..,\left|\pi_{n}(z^{n}_{1})\right|\right\}\subset\mathcal{B}(\mathbb{R}^{d}) to estimate PP over the cells πn(z1n)\pi_{n}(z^{n}_{1}):

P^n(A)1ni=1n𝟏A(zi) and \hat{P}_{n}(A)\equiv\frac{1}{n}\sum_{i=1}^{n}\mathbf{1}_{A}(z_{i})\text{ and } (3)
Q^n(A1×A2A=)P^n(A1×q)P^n(p×A2),\hat{Q}^{*}_{n}(\underbrace{A^{1}\times A^{2}}_{A=})\equiv\hat{P}_{n}(A^{1}\times\mathbb{R}^{q})\cdot\hat{P}_{n}(\mathbb{R}^{p}\times A^{2}), (4)

for any Aπn(z1n)A\in\pi_{n}(z^{n}_{1}). In Eq.(4) we assume that the cells of πn(z1n)\pi_{n}(z^{n}_{1}) have a product structure, i.e., A=A1×A2A=A^{1}\times A^{2} where A1(p)A^{1}\in\mathcal{B}(\mathbb{R}^{p}) and A2(q)A^{2}\in\mathcal{B}(\mathbb{R}^{q}).555This event-wise product structure on the cells of πn(z1n)\pi_{n}(z^{n}_{1}) is needed to estimate both PP and QQ^{*} from i.i.d. samples of PP.
2) Project the data z1nz^{n}_{1} over the cells of πn(z1n)\pi_{n}(z^{n}_{1}). A simple projection (or representation) function is the following:

Oπn(z)j=1|πn(z1n)|j𝟏Aj(z){1,..,|πn(z1n)|}.O_{\pi_{n}}(z)\equiv\sum_{j=1}^{\left|\pi_{n}(z^{n}_{1})\right|}j\cdot\mathbf{1}_{A_{j}}(z)\in\left\{1,..,\left|\pi_{n}(z^{n}_{1})\right|\right\}. (5)

The projected data is given and denoted by o1n(o1=Oπn(z1),..,on=Oπn(zn))o^{n}_{1}\equiv(o_{1}=O_{\pi_{n}}(z_{1}),..,o_{n}=O_{\pi_{n}}(z_{n})).
3) Compute the log-likelihood ratio of the empirical distributions in (3) and (4) using the quantized (or projected over σ(πn(z1n))\sigma(\pi_{n}(z^{n}_{1}))) data o1no^{n}_{1}. In particular, we consider the following log-likelihood ratio per sample as

i^πn(o1n)1nlogP^Oπn(o1)P^Oπn(o2)P^Oπn(on)Q^Oπn(o1)Q^Oπn(o2)Q^Oπn(on),\hat{i}_{\pi_{n}}(o^{n}_{1})\equiv\frac{1}{n}log\frac{\hat{P}_{O_{\pi_{n}}}(o_{1})\hat{P}_{O_{\pi_{n}}}(o_{2})\cdots\hat{P}_{O_{\pi_{n}}}(o_{n})}{\hat{Q}_{O_{\pi_{n}}}(o_{1})\hat{Q}_{O_{\pi_{n}}}(o_{2})\cdots\hat{Q}_{O_{\pi_{n}}}(o_{n})}, (6)

where P^Oπn\hat{P}_{O_{\pi_{n}}} and Q^Oπn\hat{Q}^{*}_{O_{\pi_{n}}} denote the empirical distributions of the quantized random variable OπnO_{\pi_{n}} in its representation space {1,..,|πn(z1n)|}\left\{1,..,\left|\pi_{n}(z^{n}_{1})\right|\right\}, meaning that by construction P^Oπn(j)=P^n(Aj)\hat{P}_{O_{\pi_{n}}}(j)=\hat{P}_{n}(A_{j}) and Q^Oπn(j)=Q^n(Aj)\hat{Q}_{O_{\pi_{n}}}(j)=\hat{Q}^{*}_{n}(A_{j}) for all j{1,..,|πn(z1n)|}j\in\left\{1,..,\left|\pi_{n}(z^{n}_{1})\right|\right\}.
4) Finally, the decision is given by the following rule:

ϕπn,an(o1n){0 if iπn(o1n)<an1 if iπn(o1n)an.\phi_{\pi_{n},a_{n}}(o^{n}_{1})\equiv\left\{\begin{array}[]{lll}0&\textrm{ if }&i_{\pi_{n}}(o^{n}_{1})<a_{n}\\ 1&\textrm{ if }&i_{\pi_{n}}(o^{n}_{1})\geq a_{n}\end{array}\right.. (7)

The second step introduced Oπn(Z)O_{\pi_{n}}(Z) in (5). This object captures the role played by the data-driven partition (πn\pi_{n}) in our log-likelihood statistics in (6).

IV Regret Analysis: The Representation Learning Problem

To analyze the quality of the proposed test, in this section we compare the proposed statistics in (6) with the statistics used by an oracle NP test. We consider the data-driven partition in the pair (πn(),an)(\pi_{n}(\cdot),a_{n}) as a learning agent. For this analysis, let us first fix a finite partition π={Ai,i=1,..,{1,..,J}}\pi=\left\{A_{i},i=1,..,\left\{1,..,J\right\}\right\}. Given i.i.d. realizations Z1,..,ZnZ_{1},..,Z_{n} from P𝒫(d)P\in\mathcal{P}(\mathbb{R}^{d}), we consider the empirical-quantized log-likelihood statistics in (6). Adopting the notion of regret from universal source coding [35], let us consider as a reference the true likelihood ratio of PP with respect to Q(P)Q^{*}(P) associated with the expression in (6) but with no quantization effects, i.e., 666By construction PQP\ll Q^{*}; therefore, the RN derivative dPdQ\frac{dP}{dQ^{*}} is well defined in (8) where it is clear that if (logdPdQ(z))zd1(P)(\log\frac{dP}{dQ^{*}}(z))_{z\in\mathbb{R}^{d}}\in\ell_{1}(P), then I(X,Y)=𝔼ZP{logdPdQ(Z))}<I(X,Y)=\mathbb{E}_{Z\sim P}\left\{\log\frac{dP}{dQ^{*}}(Z))\right\}<\infty.

in(z1,,zn)1nlogdPdQ(z1)dPdQ(z2)dPdQ(zn).i_{n}(z_{1},...,z_{n})\equiv\frac{1}{n}\log\frac{dP}{dQ^{*}}(z_{1})\cdot\frac{dP}{dQ^{*}}(z_{2})\cdots\frac{dP}{dQ^{*}}(z_{n}). (8)

in(z1,,zn)i_{n}(z_{1},...,z_{n}) is the ideal (oracle) information term used by the NP test against independence [35]. We measure the (sample-wise) regret of π\pi as

in(z1,,zn)i^π(Oπ(z1),.,Oπ(zn)),\displaystyle i_{n}(z_{1},...,z_{n})-\hat{i}_{\pi}(O_{\pi}(z_{1}),....,O_{\pi}(z_{n})), (9)

which measures the discrepancy between the empirical-quantized statistics in (6) and the oracle term in (8).

IV-A Connection with Divergence Estimation

The regret in (9) has two error sources: one associated with the quantization of the space (the representation quality of π\pi) and the other associated with the fact that we use the empirical distribution P^n\hat{P}_{n} in (3) instead of the true model PP. To isolate these components, it is useful to introduce the oracle-quantized information given by

iπ(o1n)1nlogi=1nPOπ(oi)QOπ(oi) witho1n{1,..,J}n,i_{\pi}(o^{n}_{1})\equiv\frac{1}{n}\log\prod_{i=1}^{n}\frac{P_{O_{\pi}}(o_{i})}{Q^{*}_{O_{\pi}}(o_{i})}\text{ with}\ o^{n}_{1}\in\left\{1,..,J\right\}^{n}, (10)

where POπ(j)=P(Aj)P_{O_{\pi}}(j)=P(A_{j}) and QOπ(j)=Q(Aj)Q^{*}_{O_{\pi}}(j)={Q}^{*}(A_{j}) for all j{1,..,J}j\in\left\{1,..,J\right\} are short-hand for the true probabilities induced by Oπ()O_{\pi}(\cdot), PP and QQ^{*} in {1,..,J}\left\{1,..,J\right\}. Then, the regret in (9) can be decomposed in two components:

in(z1,,zn)i^π(Oπ(z1),.,Oπ(zn))=\displaystyle i_{n}(z_{1},...,z_{n})-\hat{i}_{\pi}(O_{\pi}(z_{1}),....,O_{\pi}(z_{n}))=
in(z1,,zn)iπ(Oπ(z1),.,Oπ(zn))I+\displaystyle\underbrace{i_{n}(z_{1},...,z_{n})-i_{\pi}(O_{\pi}(z_{1}),....,O_{\pi}(z_{n}))}_{I}+
iπ(Oπ(z1),.,Oπ(zn))i^π(Oπ(z1),.,Oπ(zn))II.\displaystyle\underbrace{i_{\pi}(O_{\pi}(z_{1}),....,O_{\pi}(z_{n}))-\hat{i}_{\pi}(O_{\pi}(z_{1}),....,O_{\pi}(z_{n}))}_{II}. (11)

The first term (I) on the right hand side (RHS) of Eq.(IV-A) captures an information loss attributed to the quantization (5) (the approximation error). This expression convergences as nn tends to infinity to D(P||Q)Dσ(π)(P||Q)0D(P||Q^{*})-D_{\sigma(\pi)}(P||Q^{*})\geq 0 almost surely. The second term (II) on the RHS of (IV-A) captures the discrepancy in the information density attributed to the use of empirical distributions. Crucially, these two error sources are driven by the partition π\pi. At this point, it is worth noticing that the empirical information term can be expressed as i^π(Oπ(z1),.,Oπ(zn))=AπP^n(A)logP^n(A)Q^n(A)\hat{i}_{\pi}\left(O_{\pi}(z_{1}),....,O_{\pi}(z_{n})\right)=\sum_{A\in\pi}\hat{P}_{n}(A)\log\frac{\hat{P}_{n}(A)}{\hat{Q}_{n}^{*}(A)}, which is a histogram-based estimator of the divergence restricted over the cells of π\pi [39, 48, 40]. From these results, the RHS of (IV-A) can be re-structured as follows

in(Z1,,Zn)i^π(Oπ(Z1),.,Oπ(Zn))=\displaystyle i_{n}(Z_{1},...,Z_{n})-\hat{i}_{\pi}(O_{\pi}(Z_{1}),....,O_{\pi}(Z_{n}))=
in(Z1,,Zn)D(P||Q)I+D(P||Q)Dσ(π)(P||Q)II+\displaystyle\underbrace{i_{n}(Z_{1},...,Z_{n})-D(P||Q^{*})}_{I}+\underbrace{D(P||Q^{*})-D_{\sigma({\pi})}({P}||{Q}^{*})}_{II}+
Dσ(π)(P||Q)Dσ(π)(P^n||Q^n)III.\displaystyle\underbrace{D_{\sigma({\pi})}({P}||{Q}^{*})-D_{\sigma({\pi})}(\hat{P}_{n}||\hat{Q}_{n}^{*})}_{III}. (12)

The first term (II) on the RHS of (IV-A) goes to zero with probability one (by the strong law of large numbers assuming that D(P||Q)<D(P||Q^{*})<\infty) independent of π\pi. Therefore, from the perspective of evaluating the effect of π\pi in the regret, this term can be overlooked. On the other hand, the second term (IIII) captures the approximation error, or what we lost in discrimination (as the number of samples goes to infinity) when projecting the data over σ(π)\sigma(\pi) with respect to the lossless term in (8). Finally, the last term (IIIIII) measures the estimation error, which is the discrepancy between the true (oracle) distributions and the empirical distributions in the information divergence sense over events of σ(π)\sigma(\pi) (see Eq.(1)).

IV-A1 The Representation Problem

Here we introduce the main design problem of this work, which is to find a data-driven partition that offers an optimal balance between the two relevant terms presented in (IV-A). For the estimation error, the idea is to adopt distribution-free (universal) error bounds for |Dσ(π)(P||Q)Dσ(π)(P^||Q^)|\left|D_{\sigma({\pi})}({P}||{Q}^{*})-D_{\sigma({\pi})}(\hat{P}||\hat{Q}^{*})\right| of the form (more details in Section V-B):

(|Dσ(π)(P||Q)Dσ(π)(P^n||Q^n)|rn(π))1δ,\mathbb{P}\left(\left|D_{\sigma({\pi})}({P}||{Q}^{*})-D_{\sigma({\pi})}(\hat{P}_{n}||\hat{Q}_{n}^{*})\right|\leq r_{n}(\pi)\right)\geq 1-\delta,

where rn(π)r_{n}(\pi) is the confidence interval for the confidence probability 1δ1-\delta. Equipped with this result, we could use the following upper bound for the regret: D(P||Q)i^π(Oπ(z1),.,Oπ(zn))[D(P||Q)Dσ(π)(P||Q)]+rn(π)D(P||Q^{*})-\hat{i}_{\pi}(O_{\pi}(z_{1}),....,O_{\pi}(z_{n}))\leq\left[D(P||Q^{*})-D_{\sigma({\pi})}({P}||{Q}^{*})\right]+r_{n}(\pi), with high probability. Finally, we could formulate the problem of selecting π\pi over a family of representations Π\Pi (subset of 𝒬(d)\mathcal{Q}(\mathbb{R}^{d})) as the solution of the following regularized info-max problem:

πnargmaxπΠDσ(π)(P||Q)rn(π).\displaystyle\pi^{*}_{n}\equiv\arg\max_{\pi\in\Pi}D_{\sigma({\pi})}({P}||{Q}^{*})-r_{n}(\pi). (13)

The learning task in (13) is still intractable. It resembles an oracle learner agent (teacher) that selects π\pi in Π\Pi solving the trade-off between the two errors, one of which needs the true model PP. Section V addresses a tractable info-max version of (13), where instead of PP the empirical distribution P^\hat{P} in (3) is adopted.

IV-B Strong Consistency: A Basic Requirement

In this subsection, we introduce the idea of measuring the discrepancy between i^πn(Oπn(z1),.,Oπn(zn))\hat{i}_{\pi_{n}}(O_{\pi_{n}}(z_{1}),....,O_{\pi_{n}}(z_{n})) and the ideal (oracle) information in(z1,,zn)i_{n}(z_{1},...,z_{n}). From this perspective, we could introduce a new notion of consistency based on this learning objective.

Definition 2

Our scheme {(πn,an),n1}\left\{(\pi_{n},a_{n}),n\geq 1\right\} is said to be strongly consistent on the regret if for any P𝒫(d)P\in\mathcal{P}(\mathbb{R}^{d}) and i.i.d. process (Zn)n1(Z_{n})_{n\geq 1} with ZiPZ_{i}\sim P, it follows that

limn|in(Z1,,Zn)i^πn(Oπn(Z1),.,Oπn(Zn))|=0,\lim_{n\rightarrow\infty}\left|i_{n}(Z_{1},...,Z_{n})-\hat{i}_{\pi_{n}}(O_{\pi_{n}}(Z_{1}),....,O_{\pi_{n}}(Z_{n}))\right|=0,

\mathbb{P}-almost surely (a.s.).

A simple result follows:

PROPOSITION 1

If {(πn,an),n1}\left\{(\pi_{n},a_{n}),n\geq 1\right\} is strongly consistent on the regret, then i^πn(Oπn(Z1),.,Oπn(Zn))\hat{i}_{\pi_{n}}(O_{\pi_{n}}(Z_{1}),....,O_{\pi_{n}}(Z_{n})) is a strongly consistent estimator of the MI between XX and YY.777From Definition 2, it follows directly that limnin(Z1,,Zn)=limnDσ(π)(P^n||Q^n)=D(P||Q)=I(X,Y)\lim_{n\rightarrow\infty}i_{n}(Z_{1},...,Z_{n})=\lim_{n\rightarrow\infty}D_{\sigma({\pi})}(\hat{P}_{n}||\hat{Q}_{n}^{*})=D(P||Q^{*})=I(X,Y), \mathbb{P}-almost surely.

Returning to our original problem, the next result offers sufficient conditions for a data-driven scheme {(πn,an)n1}\left\{(\pi_{n},a_{n})n\geq 1\right\} to be strongly consistent for detecting independence (Def. 1).

THEOREM 1

Let {(πn,an),n1}\left\{(\pi_{n},a_{n}),n\geq 1\right\} be the data-driven scheme of Section III.

  • i)

    If (an)n(a_{n})_{n} is o(1)o(1),

  • ii)

    {πn,n1}\left\{\pi_{n},n\geq 1\right\} is strongly consistent on the regret (Def. 2),

  • iii)

    under 0\mathcal{H}_{0}, (i^πn(Oπn(Z1),.,Oπn(Zn)))n1(\hat{i}_{\pi_{n}}(O_{\pi_{n}}(Z_{1}),....,O_{\pi_{n}}(Z_{n})))_{n\geq 1} is o(an)o(a_{n}) in the sense that limni^πn(Oπn(Z1),.,Oπn(Zn))an=0\lim_{n\rightarrow\infty}\frac{\hat{i}_{\pi_{n}}(O_{\pi_{n}}(Z_{1}),....,O_{\pi_{n}}(Z_{n}))}{a_{n}}=0 \mathbb{P}-a.s.,

then (ϕπn,an(Oπn()))n1(\phi_{\pi_{n},a_{n}}(O_{\pi_{n}}(\cdot)))_{n\geq 1} in (7) is strongly consistent for detecting independence (Def. 1).

The proof is presented in Appendix A.

Consistency is a basic asymptotic requirement that is non-sufficient when looking into practical applications that operate with a finite sample size. Hence, in the next sections, we present a test that is strongly consistent (Th.3) but also satisfies some relevant non-asymptotic properties in the form of finite-length performance results (Ths. 2, 4 and Lemma 3).

V Tree-Structured Data-Driven Partitions

In this section, an empirical version of (13) is studied considering for Π\Pi a dictionary of tree-structured data-driven partitions (TSP). This TSP family was introduced in [39] for the problem of MI estimation. The focus of this section is to demonstrate its potential for the problem of testing independence.

V-A The Collection of Partitions

The construction of this family begins by defining a “root” node to index the trivial partition Aroot{d}A_{root}\equiv\left\{\mathbb{R}^{d}\right\}. The process continues by selecting a coordinate in {1,..,d}\left\{1,..,d\right\} to project the data (1D projection), then order the projected data (scalar values), select the median of the ordered sequence, and finally create a statistically equivalent partition of ArootA_{root} using the selected coordinate axis and the median to split the cell. Two new cells are created from ArootA_{root} with almost half the sample points in each (exactly half when nn is an even integer).

These new cells are indexed as the left and right children of the “root” denoted by (l(root),r(root))(l(root),r(root)); i.e., the new cells are Al(root)A_{l(root)} and Ar(root)A_{r(root)} where Aroot=Al(root)Ar(root)A_{root}=A_{l(root)}\cup A_{r(root)}. The process continues by selecting a new coordinate in {1,..,d}\left\{1,..,d\right\}, where the statistically equivalent binary splitting criterion is iterated in Al(root)A_{l(root)} and Ar(root)A_{r(root)} until a stopping condition is met. The stopping criterion adopted here imposed a minimum empirical probability in each created cell that we denote by bn(0,1)b_{n}\in(0,1). Therefore, at the end of this growing binary process, a collection of nodes is produced {root,l(root),r(root),}\mathcal{I}\equiv\left\{root,l(root),r(root),...\right\} associated with a binary tree (by construction) and the nodes’ respective cells {Av,v}\left\{A_{v},v\in\mathcal{I}\right\} where P^n(Av)bn\hat{P}_{n}(A_{v})\geq b_{n} for any vv\in\mathcal{I}.

Using the Breiman et al. [43] convention, a binary tree TT is a collection of nodes in \mathcal{I}: one node of degree 2 (the ”root”), and the remaining nodes of degree 3 (internal nodes) or degree 1 (leaf or terminal nodes) [43]. In this convention, the full-tree is denoted and given by TbnfullT^{full}_{b_{n}}\equiv\mathcal{I}. Importantly, if T~Tbnfull\tilde{T}\subset T^{full}_{b_{n}} and T~\tilde{T} is a binary tree by itself, then we say that T~\tilde{T} is a subtree of TbnfullT^{full}_{b_{n}}. Moreover, if both have the same root, we say that T~\tilde{T} is a pruned version of TbnfullT^{full}_{b_{n}}, and we denote this by T~Tbnfull\tilde{T}\ll T^{full}_{b_{n}}. Finally, if we denote by (T)\mathcal{L}(T) the leaf nodes of an arbitrary tree TTbnfullT\ll T^{full}_{b_{n}}, it is simple to verify that πT{Av,v(T)}\pi_{T}\equiv\left\{A_{v},v\in\mathcal{L}(T)\right\} is a data-driven partition of d\mathbb{R}^{d} indexed by TT where every cell in πT\pi_{T} has the desired product structure in Eq.(4) [39].888The interested reader is referenced to [39] for a systematic explanation of this TSP construction.

V-B Regularized (Empirical) Information Maximization

To address the info-max design objective in (13) using our (data-driven) tree-indexed family Πn{πT:TTbnfull}\Pi_{n}\equiv\left\{\pi_{T}:T\ll T^{full}_{b_{n}}\right\}, we have to first find an expression for rn(π)r_{n}(\pi) in (13) for any πTΠn\pi_{T}\in\Pi_{n}. The next result offers the following:

LEMMA 1

(Silva et al.[39, Th. 1]) Let 𝒢bnk{TTbnfull:|T|=k}\mathcal{G}^{k}_{b_{n}}\equiv\left\{T\ll T^{full}_{b_{n}}:\left|T\right|=k\right\} be the family of pruned TSPs of size kk. Then, k{1,..,|Tbnfull|}\forall k\in\left\{1,..,\left|T^{full}_{b_{n}}\right|\right\}, n>0\forall n>0, and any small δ(0,1)\delta\in(0,1), there is a threshold ϵc()\epsilon_{c}(\cdot) where

(supT𝒢bnk|Dσ(πT)(P||Q)Dσ(πT)(P^n||Q^n)|ϵc())\displaystyle\mathbb{P}\left(\sup_{T\in\mathcal{G}^{k}_{b_{n}}}\left|{D_{\sigma({\pi_{T}})}({P}||{Q}^{*})-D_{\sigma({\pi_{T}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})}\right|\leq\epsilon_{c}(\cdot)\right)\geq
1δ,\displaystyle 1-\delta, (14)

where specifically ϵc(n,bn,d,δ,k)=\epsilon_{c}(n,b_{n},d,\delta,k)=

242bnnln(8/δ)+k[(d+1)ln(2)+dln(n)].\displaystyle\frac{24\sqrt{2}}{b_{n}\sqrt{n}}\sqrt{\ln(8/\delta)+k\left[(d+1)\ln(2)+d\ln(n)\right]}. (15)

Importantly the bound in (1) is distribution-free [39], and a function of bnb_{n} and kk (the size of the family 𝒢bnk\mathcal{G}^{k}_{b_{n}}). From this distribution-free concentration result, the union bound tells us that the following events in dn\mathbb{R}^{dn}

{supT𝒢bnk|Dσ(πT)(P||Q)Dσ(πT)(P^n||Q^n)|rbn,δ(k)},\displaystyle\left\{\sup_{T\in\mathcal{G}^{k}_{b_{n}}}\left|D_{\sigma({\pi_{T}})}({P}||{Q}^{*})-D_{\sigma({\pi_{T}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})\right|\leq r_{b_{n},\delta}(k)\right\}, (16)

with rbn,δ(k)ϵc(n,bn,d,δbn,k)r_{b_{n},\delta}(k)\equiv\epsilon_{c}(n,b_{n},d,\delta\cdot b_{n},k), happen simultaneously for any k=1,..,|Tbnfull|k=1,..,\left|T^{full}_{b_{n}}\right| with probability at least 1δ1-\delta (with respect to \mathbb{P}). Equipped with these penalizations, i.e., rbn,δ(|T|)r_{b_{n},\delta}(\left|T\right|) in (16) for any TTbnfullT\ll T^{full}_{b_{n}}, the empirical version of the info-max problem in (13) is

T^bn,δnargmaxTTbnfullDσ(πT)(P^n||Q^n)rbn,δn(|T|).\displaystyle\hat{T}_{b_{n},\delta_{n}}\equiv\arg\max_{T\ll T^{full}_{b_{n}}}D_{\sigma({\pi_{T}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})-r_{b_{n},\delta_{n}}(\left|T\right|). (17)

Finally, it is important to note that both (bn)n1(b_{n})_{n\geq 1} and (δn)n1(\delta_{n})_{n\geq 1} determine the trees (T^bn,δn)n1(\hat{T}_{b_{n},\delta_{n}})_{n\geq 1} and TS partitions that we denote here by (πbn,δn)n1({\pi}_{b_{n},\delta_{n}})_{n\geq 1}. In addition, if we include the thresholds (an)n1(a_{n})_{n\geq 1}, we denote by (ϕbn,δn,an())n1(\phi_{b_{n},\delta_{n},a_{n}}(\cdot))_{n\geq 1} the rules induced by (πbn,δn)n1({\pi}_{b_{n},\delta_{n}})_{n\geq 1} and (an)n1(a_{n})_{n\geq 1} in (7).

V-C Consistency: A Preliminary Analysis

The following results show that the TSP scheme in (17) is strongly consistent on the regret (Def.2).

LEMMA 2

Let us assume that PP has a density999PP is absolutely continuous with respect to the Lebesgue measure in (d,(d))(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d})). in d\mathbb{R}^{d}.
i) Under the conditions that (bn)(nl)(b_{n})\approx(n^{-l}) for l(0,1/3)l\in(0,{1}/{3}), (δn)(\delta_{n}) is o(1)o(1) and (1/δn)(1/\delta_{n}) is 𝒪(en1/3)\mathcal{O}({e}^{n^{1/3}}), we have that

limn|in(Z1,,Zn)i^πbn,δn(Oπbn,δn(Z1),.,Oπbn,δn(Zn))|\displaystyle\lim_{n\rightarrow\infty}\left|i_{n}(Z_{1},...,Z_{n})-\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n}))\right|
=0,-a.s..\displaystyle=0,\ \mathbb{P}\text{-a.s.}. (18)

ii) Assuming that XX and YY are independent (0\mathcal{H}_{0}), if we select (1/δn)n(en1/3)(1/\delta_{n})_{n}\approx(e^{n^{1/3}}), it follows that for any p>0p>0

(i^πbn,δn(Oπbn,δn(Z1),.,Oπbn,δn(Zn)))n1 is o(np),-a.s..\displaystyle(\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n})))_{n\geq 1}\text{ is }o(n^{-p}),\mathbb{P}\text{-a.s.}. (19)

The proof is presented in Appendix B.

In addition to achieve consistency on the regret, stated in part i), the part ii) of this result shows that under 0\mathcal{H}_{0} the regret achieves a super polynomial velocity of convergence to zero: in(Z1,,Zn)=0i_{n}(Z_{1},...,Z_{n})=0 with probability one under 0\mathcal{H}_{0}. Therefore, the empirical information rate i^πbn,δn(Oπbn,δn(Z1),.,Oπbn,δn(Zn))\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n})) tends to zero faster than any polynomial order in nn with probability one, which is a remarkable capacity to detect this condition from data. In light of Theorem 1, we could have a range of admissible vanishing thresholds (an)n1(a_{n})_{n\geq 1} where strong consistency for detecting independence can be achieved (Def. 1). This is stated in the following result.

COROLLARY 1

Let us assume the setting and conditions on (bn)n1(b_{n})_{n\geq 1} and (δn)n1(\delta_{n})_{n\geq 1} stated in Lemma 2 part ii) for the TSP scheme. If (an)n1(a_{n})_{n\geq 1} is 𝒪(nq)\mathcal{O}(n^{-q}) for some arbitrary q>0q>0, then the TSP scheme {ϕbn,δn,an,n1}\left\{\phi_{{b_{n},\delta_{n}},a_{n}},n\geq 1\right\} is strongly consistent for detecting independence (Def. 1).

Proof:

The proof derives directly from Theorem 1 and the two results obtained in Lemma 2. ∎

VI Main Finite-Length Results

In this section, we focus on finite-length (non-asymptotic) results. We establish conditions where the solution of (17) nearly matches the performance of its equivalent oracle version in (13) with high probability. This non-asymptotic result is instrumental to show later one of the main findings of our work: the capacity of our scheme to detect 0\mathcal{H}_{0} with the structure of the partition.

THEOREM 2

Under the conditions that (bn)(nl)(b_{n})\approx(n^{-l}) for l(0,1/3)l\in(0,{1}/{3}), (δn)(\delta_{n}) is o(1)o(1) and (1/δn)(1/\delta_{n}) is 𝒪(en1/3)\mathcal{O}({e}^{n^{1/3}}), we have that
i) under 0\mathcal{H}_{0}: for any ϵ>0\epsilon>0 there is N(ϵ)>0N(\epsilon)>0 such that nN(ϵ)\forall n\geq N(\epsilon), the equality

i^πbn,δn(Oπbn,δn(Z1),.,Oπbn,δn(Zn))=in(Z1,,Zn)zero regret regime=0,\displaystyle\underbrace{\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n}))=i_{n}(Z_{1},...,Z_{n})}_{\text{zero regret regime}}=0, (20)

holds with \mathbb{P}-probability 1ϵ1-\epsilon.

ii) under 1\mathcal{H}_{1}: for any ϵ>0\epsilon>0 there is N(ϵ)N(\epsilon) such that nN(ϵ)\forall n\geq N(\epsilon) the bound

in(Z1,,Zn)i^πbn,δn(Oπbn,δn(Z1),.,Oπbn,δn(Zn))\displaystyle i_{n}(Z_{1},...,Z_{n})-\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n}))
in(Z1,,Zn)I(X,Y)+\displaystyle\leq i_{n}(Z_{1},...,Z_{n})-I(X,Y)+
minTTbnfull[D(P||Q(P))Dσ(πT)(P||Q(P))]+2rbn,δn(|T|).\displaystyle\min_{T\ll T^{full}_{b_{n}}}\left[D(P||Q^{*}(P))-D_{\sigma({\pi_{T}})}({P}||{Q}^{*}(P))\right]+2r_{b_{n},\delta_{n}}(\left|T\right|). (21)

holds with \mathbb{P}-probability 1ϵ1-\epsilon.

The proof of this result is presented in Appendix C.

This result presents two optimality bounds for the information term of the TSP scheme under the two main hypotheses (0\mathcal{H}_{0} and 1\mathcal{H}_{1}) of our problem. Under 0\mathcal{H}_{0}, our regularization approach is capable of detecting this structure (with an arbitrary high probability 1ϵ1-\epsilon) in the sense that T^bn,δn\hat{T}_{b_{n},\delta_{n}} in (17) reduces to the trivial partition {d}\left\{\mathbb{R}^{d}\right\}. From this, we obtain the zero regret condition i^πbn,δn(Oπbn,δn(Z1),.,Oπbn,δn(Zn))=in(Z1,,Zn)\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n}))=i_{n}(Z_{1},...,Z_{n}). This means that the solution of T^bn,δn\hat{T}_{b_{n},\delta_{n}} by itself (with no threshold) is capable of detecting independence.

On the other hand, under 1\mathcal{H}_{1}, we notice that it is only relevant to bound the under-estimation of in(Z1,,Zn)i_{n}(Z_{1},...,Z_{n}) with the empirical (log-likelihood ratio) information i^πbn,δn(Oπbn,δn(Z1),.,Oπbn,δn(Zn))\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n})). Limiting the analysis only to the underestimation of in(Z1,,Zn)i_{n}(Z_{1},...,Z_{n}) can be argued from the well-known observation that quantization (on average) provides a bias (under-estimation) estimation of 𝔼(in(Z1,,Zn))=I(X;Y)\mathbb{E}(i_{n}(Z_{1},...,Z_{n}))=I(X;Y). More importantly for our problem, an overestimation of the oracle information in(Z1,,Zn)i_{n}(Z_{1},...,Z_{n}) does not increase the type 2 error when using i^πbn,δn(Oπbn,δn(Z1),.,Oπbn,δn(Zn))\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n})) instead of the oracle in(Z1,,Zn)i_{n}(Z_{1},...,Z_{n}) under 1\mathcal{H}_{1}.

In summary, Theorem 2 indicates that with an arbitrary high probability, the empirical TSP solution of (17) achieves the performance of the oracle solution that optimizes the upper bound on the regret derived in Eq.(13) over the TSP family {πT,TTbnfull}\left\{\pi_{T},T\ll T^{full}_{b_{n}}\right\}.

Remark 1

In the proof of Theorem 2, we determine a set δn,bnn\mathcal{E}^{n}_{\delta_{n},b_{n}} (see Eq.(39)) where if z1nδn,bnnz^{n}_{1}\in\mathcal{E}^{n}_{\delta_{n},b_{n}} then (20) holds under 0\mathcal{H}_{0} and (2) holds under 1\mathcal{H}_{1}. Importantly, we obtain that (δn,bnn)1δn\mathbb{P}(\mathcal{E}^{n}_{\delta_{n},b_{n}})\geq 1-\delta_{n} from Lemma 1. Therefore, in Theorem 2, N(ϵ)N(\epsilon) is fully determined by (δn)n1(\delta_{n})_{n\geq 1} for any ϵ>0\epsilon>0. Indeed, N(ϵ)=inf{n1,st.δn<ϵ}N(\epsilon)=\inf\left\{n\geq 1,st.\ \delta_{n}<\epsilon\right\}, which is well-defined since (δn)(\delta_{n}) is o(1)o(1). In particular, for the case (1/δn)(1/\delta_{n}) being 𝒪(en1/3)\mathcal{O}(e^{n^{1/3}}), it follows that N(ϵ)N(\epsilon) is 𝒪(ln(1/ϵ)3)\mathcal{O}(\ln(1/\epsilon)^{3}).

VI-A Detecting Independence with the Structure of T^bn,δn\hat{T}_{b_{n},\delta_{n}}

From (20), our data-driven partition has the capacity to detect independence (under 0\mathcal{H}_{0}) in a stronger structural way: (with high probability) the data-driven partition collapses to the trivial cell, i.e., πbn,δn={d}\pi_{b_{n},\delta_{n}}=\left\{\mathbb{R}^{d}\right\}, with a finite sample size. Here, we improve this result by showing that the condition πbn,δn={d}\pi_{b_{n},\delta_{n}}=\left\{\mathbb{R}^{d}\right\} happens within a finite sample size almost surely.

Let us introduce the (random) time at which the partition process (πbn,δn)n1(\pi_{b_{n},\delta_{n}})_{n\geq 1} collapses to the trivial set Aroot={d}A_{root}=\left\{\mathbb{R}^{d}\right\}:

Definition 3

𝒯0((Zn)n1)\mathcal{T}_{0}((Z_{n})_{n\geq 1})\equiv

sup{m1:|T^bm,δm|>1}{}.\sup{\left\{m\geq 1:\left|\hat{T}_{b_{m},\delta_{m}}\right|>1\right\}}\in\mathbb{N}^{*}\equiv\mathbb{N}\cup\left\{\infty\right\}. (22)

If 𝒯0((Zn)n1)=k\mathcal{T}_{0}((Z_{n})_{n\geq 1})=k, it means that |T^bn,δn|=1πbn,δn={d}\left|\hat{T}_{b_{n},\delta_{n}}\right|=1\Leftrightarrow\pi_{b_{n},\delta_{n}}=\left\{\mathbb{R}^{d}\right\} for any n>kn>k. Therefore, if 𝒯0((Zn)n1)\mathcal{T}_{0}((Z_{n})_{n\geq 1})\in\mathbb{N}, this value is expressing the last time the data-driven partition is different from the trivial solution {d}\left\{\mathbb{R}^{d}\right\}. On the other hand, if 𝒯0((Zn)n1)=\mathcal{T}_{0}((Z_{n})_{n\geq 1})=\infty, this condition means that non-trivial partitions are observed infinitely often (i.o.) in the sequence (πbn,δn(Zn))n1(\pi_{b_{n},\delta_{n}}(Z^{n}))_{n\geq 1}. In general, we could have that (𝒯0((Zn)n1)=)>0\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})=\infty)>0 under 0\mathcal{H}_{0}, which does not contradict the capacity that (ϕbn,δn,an())n1(\phi_{b_{n},\delta_{n},a_{n}}(\cdot))_{n\geq 1} has for detecting independence consistently under 0\mathcal{H}_{0} (see Corollary 1).

The following result shows that 𝒯0((Zn)n1)\mathcal{T}_{0}((Z_{n})_{n\geq 1}) is finite with probability one under some mild conditions.

LEMMA 3

Let us assume that (bn)(nl)(b_{n})\approx(n^{-l}) for l(0,1/3)l\in(0,{1}/{3}) and (δn)(\delta_{n}) is 1()\ell_{1}(\mathbb{N}), (1/δn)(1/\delta_{n}) is 𝒪(en1/3)\mathcal{O}({e}^{n^{1/3}}). Then under the hypothesis that P𝒫0P\in\mathcal{P}_{0}: (𝒯0((Zn)n1)<)=1\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})<\infty)=1.

The proof is presented in Appendix D.

Importantly, the condition (𝒯0((Zn)n1)<)=1\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})<\infty)=1 implies that (ϕbn,δn,an(Z1,.,Zn))n1(\phi_{{b_{n},\delta_{n}},a_{n}}(Z_{1},....,Z_{n}))_{n\geq 1} reaches 0 eventually with probability one for any sequence (an)n1[0,1](a_{n})_{n\geq 1}\in[0,1]^{\mathbb{N}}. From this observation, we obtain the following result that improves the regime of parameters where a consistent test is achieved (established in Corollary 1).

THEOREM 3

If (bn)n1(nl)(b_{n})_{n\geq 1}\approx(n^{-l}) for l(0,1/3)l\in(0,{1}/{3}), (δn)n1(\delta_{n})_{n\geq 1} is 1()\ell_{1}(\mathbb{N}), (1/δn)n1(1/\delta_{n})_{n\geq 1} is 𝒪(en1/3)\mathcal{O}({e}^{n^{1/3}}), and (an)n1(a_{n})_{n\geq 1} is o(1)o(1), then the scheme (ϕbn,δn,an(Z1,.,Zn))n1(\phi_{{b_{n},\delta_{n}},a_{n}}(Z_{1},....,Z_{n}))_{n\geq 1} is strongly consistent for detecting independence (Def. 1).

The proof is presented in Appendix E.

Finally, if we focus on the admissible solution where (1/δn)(en1/3)(1/\delta_{n})\approx(e^{n^{1/3}}), we have the following refined description for the distribution of 𝒯0((Zn)n1)\mathcal{T}_{0}((Z_{n})_{n\geq 1}) under 0\mathcal{H}_{0}.

THEOREM 4

Under the assumption of Theorem 3, let us consider that (1/δn)(en1/3)(1/\delta_{n})\approx(e^{n^{1/3}}). Under 0\mathcal{H}_{0}, we have that (𝒯0((Zn)n1)m)Kem1/3𝒪(em1/3)\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})\geq m)\leq Ke^{-m^{1/3}}\sim\mathcal{O}(e^{-m^{1/3}}) for any m1m\geq 1 and for some universal constant K>0K>0.

The proof is presented in Appendix F.

VI-B Remarks about Theorem 4:

1: If we introduce 𝒯((Zn)n1,(an)n1)\mathcal{T}((Z_{n})_{n\geq 1},(a_{n})_{n\geq 1})\equiv

sup{m1:ϕbm,δm,am(Z1,..,Zm))=1},\sup{\left\{m\geq 1:\phi_{b_{m},\delta_{m},a_{m}}(Z_{1},..,Z_{m}))=1\right\}}, (23)

as the first time when the binary process (ϕbn,δn,an(Z1,..,Zn))n1(\phi_{b_{n},\delta_{n},a_{n}}(Z_{1},..,Z_{n}))_{n\geq 1} reaches and stays at 0, it is simple to verify that (𝒯((Zn)n1,(an)n1)m)(𝒯0((Zn)n1)m)\mathbb{P}(\mathcal{T}((Z_{n})_{n\geq 1},(a_{n})_{n\geq 1})\geq m)\leq\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})\geq m) for any (an)n(a_{n})_{n} (see Appendix E). Therefore, under 0\mathcal{H}_{0} and the assumptions of Theorem 4, we have as a corollary that (𝒯((Zn)n1,(an)n1)m)Kem1/3𝒪(em1/3)\mathbb{P}(\mathcal{T}((Z_{n})_{n\geq 1},(a_{n})_{n\geq 1})\geq m)\leq Ke^{-m^{1/3}}\sim\mathcal{O}(e^{-m^{1/3}}) for any sequence (an)n1(a_{n})_{n\geq 1} as long as (an)n1(a_{n})_{n\geq 1} is o(1)o(1).
2: With this bound, we could determine (under 0\mathcal{H}_{0}) a lower bound on the number of samples needed to guarantee that we detect independence structurally, i.e., reaching πbn,δn={d}\pi_{b_{n},\delta_{n}}=\left\{\mathbb{R}^{d}\right\}, and also with our schemes (ϕbn,δn,an())n1(\phi_{b_{n},\delta_{n},a_{n}}(\cdot))_{n\geq 1} with a confidence probability 1ϵ1-\epsilon. This critical number of samples is achieved for m(ϵ)m(\epsilon) such that Kem(ϵ)1/3<ϵKe^{-m(\epsilon)^{1/3}}<\epsilon and Ke(m(ϵ)1)1/3ϵKe^{-(m(\epsilon)-1)^{1/3}}\geq\epsilon. This number scales like (m(ϵ))ϵ(ln(K/ϵ))3𝒪(ln(1/ϵ)3)(m(\epsilon))_{\epsilon}\approx\left(\ln(K/\epsilon)\right)^{3}\sim\mathcal{O}(\ln(1/\epsilon)^{3}).

VII Implementation

VII-A The Learning Algorithm

Let us briefly revisit the stages presented in Section V for the construction of our TSP scheme. Beginning with the partition, there are two phases: growing and pruning. In the growing phase, presented in Section V-A, we have a collection of tree-indexed partitions {πT,TTbnfull}\left\{\pi_{T},T\ll T^{full}_{b_{n}}\right\} where πT={Av,v(T)}\pi_{T}=\left\{A_{v},v\in\mathcal{L}(T)\right\} and by construction P^n(Av)bn=wnl\hat{P}_{n}(A_{v})\geq b_{n}=w\cdot n^{-l} for any vTv\in T. Here, we added a scalar parameter w>0w>0 as a relevant design element for our numerical analysis. In the pruning phase, we implement the following complexity-regularized information maximization:

T^bn,δn(α)argmaxTTbnfullDσ(πT)(P^n||Q^n)αrbn,δn(|T|),\displaystyle\hat{T}_{b_{n},\delta_{n}}(\alpha)\equiv\arg\max_{T\ll T^{full}_{b_{n}}}D_{\sigma({\pi_{T}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})-\alpha\cdot r_{b_{n},\delta_{n}}(\left|T\right|), (24)

where we included a regularized parameter α+\alpha\in\mathbb{R}^{+}.

Refer to caption
Figure 1: Illustration of the trade-off between M0(ϵ=0.05)M_{0}(\epsilon=0.05) and M1σ(ϵ=0.05)M^{\sigma}_{1}(\epsilon=0.05), presented in Eqs. (29) and (30), for different parameters of our TSP approach. Each curve is associated with fixed values of ll and ww and is produced using different values of α\alpha. Three scenarios of correlation (under 1\mathcal{H}_{1}) are σ=𝔼(XY){0.3,0.5,0.7}\sigma=\mathbb{E}(XY)\in\left\{0.3,0.5,0.7\right\}. For the graph in the first row w=0.05w=0.05, whereas l=0.167l=0.167 in the graph depicted in the second row.

VII-B Computational Cost

To analize the cost of implementing (24), we analyze the cost of each of its two phases individually.

Growing Phase: During the growing stage (see Section V-A), we obtain a collection of tree-indexed partitions {πT,TTbnfull}\left\{\pi_{T},T\ll T^{full}_{b_{n}}\right\}, where πT={Av,v(T)}\pi_{T}=\left\{A_{v},v\in\mathcal{L}(T)\right\}. The full tree TbnfullT^{full}_{b_{n}} is obtained after a series of statistically equivalent binary partitions. The computational cost of the growing phase depends on the number of sub-partitions (or binary splittings) needed to obtain the full tree (which depends on the number of cells), and the cost of sorting the projected data of each cell and selecting the median of the ordered sequence to partition the cell. The number of cells in the full tree |πTbnfull|\left|\pi_{T^{full}_{b_{n}}}\right| can be estimated considering that our design criterion states that P^n(Av)bn=wnl\hat{P}_{n}(A_{v})\geq b_{n}=w\cdot n^{-l} for any vTv\in T, where w+w\in\mathbb{R}^{+} and l(0,1/3)l\in(0,1/3) are design parameters of our method. Then, we should have at least bnnb_{n}\cdot n samples per cell. Therefore, the number of cells associated with TbnfullT^{full}_{b_{n}} is bounded by

|πTbnfull|nbnn=1bnnl\left|\pi_{T^{full}_{b_{n}}}\right|\leq\frac{n}{b_{n}\cdot n}=\frac{1}{b_{n}}\approx n^{l} (25)

On the other hand, the cost of sorting the projected data of each cell depends on the number of data points to be ordered. In particular, if a cell has mm points, then the sorting cost is 𝒪(mlog(m))\mathcal{O}(m\log(m)) [49]. As we reach deeper levels in the tree, the number of points per cell reduces. If we consider a level of depth kk (where k=1k=1 corresponds to the root and k=log(|πTbnfull|)log(nl)k=\left\lceil\log\left(\left|\pi_{T^{full}_{b_{n}}}\right|\right)\right\rceil\approx\lceil\log(n^{l})\rceil corresponds to the full tree), there are 2k12^{k-1} cells and each cell contains roughly n/2kn/2^{k} points. Therefore, the order of the total number of operations during the growing phase can be approximated to

𝒪(k=1log(nl)2k1n2k1log(n2k1))=𝒪(nlog2(n))\mathcal{O}\left(\sum_{k=1}^{\lceil\log(n^{l})\rceil}2^{k-1}\cdot\frac{n}{2^{k-1}}\log\left(\frac{n}{2^{k-1}}\right)\right)=\mathcal{O}\left(n\log^{2}(n)\right)

Pruning Phase: The main regularized info-max problem in Eq. (24) is solved in this stage. The first term Dσ(πT)(P^n||Q^n)D_{\sigma({\pi_{T}})}(\hat{P}_{n}||\hat{Q}_{n}^{*}) is additive (in the sense presented in [45]), whereas the penalty rbn,δn(|T|)r_{b_{n},\delta_{n}}(\left|T\right|) scales like 𝒪(|T|)\mathcal{O}(\sqrt{\left|T\right|}), and, therefore, it is a sub-additive function of the size of the tree. In this scenario, Scott [45, Th. 1] showed that the family of minimum cost trees {Ri}i=1m\{R_{i}\}_{i=1}^{m}:

RiargmaxTTbnfull|T|=mi+1Dσ(πT)(P^n||Q^n),i{1m}R_{i}\equiv\arg\max_{\begin{subarray}{c}T\ll T^{full}_{b_{n}}\\ |T|=m-i+1\end{subarray}}D_{\sigma({\pi_{T}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})\quad,\,i\in\{1\dots m\} (26)

is embedded, i.e., R1=TbnfullR2Rm={Aroot}R_{1}=T^{full}_{b_{n}}\gg R_{2}\gg\ldots\gg R_{m}=\left\{A_{root}\right\}. Furthermore, [45, Th. 2] states that the solution of our problem in (24) corresponds to one of these embedded trees: i.e., for any α>0\alpha>0, i{1m}\exists i\in\{1\dots m\} such that T^bn,δn(α)=Ri\hat{T}_{b_{n},\delta_{n}}(\alpha)=R_{i}. This embedded property allows computationally efficient algorithms to be designed. Indeed, Scott [45] presented two algorithms to solve (24) with a worst case cost of 𝒪(|Tbnfull|2)\mathcal{O}\left(\left|T^{full}_{b_{n}}\right|^{2}\right). In our case we have that |Tbnfull|1/bn\left|T^{full}_{b_{n}}\right|\leq 1/b_{n}, then the pruning stage in (24) has a computational cost of 𝒪(n2/3)\mathcal{O}\left(n^{2/3}\right), which is a polynomial (sub-linear) function of nn. The pseudo code of this stage is presented in Algorithm 1.

In summary, the construction of our partition (growing and pruning) has a computational cost that is 𝒪(nlog2(n))\mathcal{O}(n\log^{2}(n)) on the sample size nn.

Input : full tree TbnfullT^{full}_{b_{n}}, minimum probability of partitions bnb_{n}, confidence probability δn\delta_{n}, regularization parameter α\alpha
Output : pruned tree T^bn,δn\hat{T}_{b_{n},\delta_{n}} in Eq.(24)
Initialization;
T^bn1={root}\hat{T}_{b_{n}}^{1}=\{root\};
Family pruning problem of embedded trees;
for k=2k=2 to |Tbnfull||T^{full}_{b_{n}}| do
       Maximum empirical information tree of size kk;
       v=argmaxv(T^bnk1)DσπT^bnk1{l(v),r(v)}()(P^n||Q^n)v^{*}=\underset{v\in\mathcal{L}(\hat{T}_{b_{n}}^{k-1})}{\operatorname*{arg\,max}}D_{\sigma{{}_{\big{(}}}\pi_{\hat{T}_{b_{n}}^{k-1}\cup\{l(v),r(v)\}}{{}_{\big{)}}}}(\hat{P}_{n}||\hat{Q}_{n}^{*});
       T^bnk=T^bnk1{l(v),r(v)}\hat{T}_{b_{n}}^{k}=\hat{T}_{b_{n}}^{k-1}\cup\{l(v^{*}),r(v^{*})\};
      
end for
Regularized empirical information maximization;
T^bn,δn=argmaxT{T^bn1,,T^bn|Tbnfull|}Dσ(πT)(P^n||Q^n)αrbn,δn(|T|)\hat{T}_{b_{n},\delta_{n}}=\underset{T\in\left\{\hat{T}_{b_{n}}^{1},...,\hat{T}_{b_{n}}^{\left|T_{b_{n}}^{full}\right|}\right\}}{\operatorname*{arg\,max}}D_{\sigma(\pi_{T})}(\hat{P}_{n}||\hat{Q}_{n}^{*})-\alpha\cdot r_{b_{n},\delta_{n}}(|T|);
  // rbn,δn(|T^bn1|)=0r_{b_{n},\delta_{n}}(|\hat{T}_{b_{n}}^{1}|)=0
Algorithm 1 TSP family pruning
Refer to caption
Figure 2: Empirical estimation of the probability mass functions of 𝒯~0((Zn)n1)\mathcal{\tilde{T}}_{0}((Z_{n})_{n\geq 1}) in Eq.(27) and 𝒯~1((Zn)n1)\mathcal{\tilde{T}}_{1}((Z_{n})_{n\geq 1}) in Eq.(28). Four scenarios of the regularization parameter α\alpha are considered to illustrate its effects in the distribution of 𝒯~0((Zn)n1)\mathcal{\tilde{T}}_{0}((Z_{n})_{n\geq 1}) and 𝒯~1((Zn)n1)\mathcal{\tilde{T}}_{1}((Z_{n})_{n\geq 1}). For each case, the values of M0(ϵ)M_{0}(\epsilon) and M1σ(ϵ)M_{1}^{\sigma}(\epsilon) are also illustrated (vertical lines) considering ϵ=0.05\epsilon=0.05 and σ=0.3\sigma=0.3.

VIII Empirical Analysis

In this section, we present some controlled synthetic scenarios to evaluate the performance of our method. We begin analyzing how the selection of parameters affects the capacity of our scheme to estimate MI101010Given the space limit, this preliminary analysis on MI estimation is relegated to the Supplemental Material - Appendix G-A.. From these results and insights, we analyze the performance of the solutions {T^bn,δn(α),α0}\left\{\hat{T}_{b_{n},\delta_{n}}(\alpha),\alpha\geq 0\right\} for testing independence. For the rest of this section, πbn,δnα()\pi_{b_{n},\delta_{n}}^{\alpha}(\cdot) denotes the TS partition, i^bn,δnα()\hat{i}_{b_{n},\delta_{n}}^{\alpha}(\cdot) denotes the MI estimator, and ϕbn,δn,anα()\phi_{b_{n},\delta_{n},a_{n}}^{\alpha}(\cdot) denotes the final test.

VIII-A Testing Independence: Non-Asymptotic Empirical Analysis

We analyze the problem of testing independence with our data-driven test ϕbn,δn,anα()\phi_{b_{n},\delta_{n},a_{n}}^{\alpha}(\cdot) in (7). Here we focus on the non-asymptotic capacity of our framework to detect the two hypotheses. For this, we propose the following detection times:111111To simplify the notation, the dependency of 𝒯~0((Zn)n1)\mathcal{\tilde{T}}_{0}((Z_{n})_{n\geq 1}) and 𝒯~1((Zn)n1)\mathcal{\tilde{T}}_{1}((Z_{n})_{n\geq 1}) on (bm)(b_{m}), (δm)(\delta_{m}), (am)(a_{m}) and the scalar parameters ww, α\alpha introduced in Sec.VII-A will be considered implicit.

𝒯~0((Zn)n1)sup{m1:ϕbm,δm,amα(Z1,..,Zm)=1},\displaystyle\mathcal{\tilde{T}}_{0}((Z_{n})_{n\geq 1})\equiv\sup{\left\{m\geq 1:\phi^{\alpha}_{b_{m},\delta_{m},a_{m}}(Z_{1},..,Z_{m})=1\right\}}, (27)
𝒯~1((Zn)n1)sup{m1:ϕbm,δm,amα(Z1,..,Zm)=0}.\displaystyle\mathcal{\tilde{T}}_{1}((Z_{n})_{n\geq 1})\equiv\sup{\left\{m\geq 1:\phi^{\alpha}_{b_{m},\delta_{m},a_{m}}(Z_{1},..,Z_{m})=0\right\}}. (28)

𝒯~0((Zn)n1)\mathcal{\tilde{T}}_{0}((Z_{n})_{n\geq 1}) and 𝒯~1((Zn)n1)\mathcal{\tilde{T}}_{1}((Z_{n})_{n\geq 1}) are random variables (rvs.) in ={}\mathbb{N}^{*}=\mathbb{N}\cup\left\{\infty\right\} determining when our test reaches 0 and 11, respectively. Indeed, Theorem 3 (under specific conditions) tells us that under 0\mathcal{H}_{0}, (𝒯~0((Zn)n1)<)=1\mathbb{P}(\mathcal{\tilde{T}}_{0}((Z_{n})_{n\geq 1})<\infty)=1 and that under 1\mathcal{H}_{1}, (𝒯~1((Zn)n1)<)=1\mathbb{P}(\mathcal{\tilde{T}}_{1}((Z_{n})_{n\geq 1})<\infty)=1. However, we are interested in the complete distribution of the rvs. 𝒯~0((Zn)n1)\mathcal{\tilde{T}}_{0}((Z_{n})_{n\geq 1}) and 𝒯~1((Zn)n1)\mathcal{\tilde{T}}_{1}((Z_{n})_{n\geq 1}) under 0\mathcal{H}_{0} and 1\mathcal{H}_{1}, respectively. In particular, we are interested in evaluating the pmf of 𝒯~i()\mathcal{\tilde{T}}_{i}(\cdot), i.e., ((𝒯~i((Zn)n1)=k))k1(\mathbb{P}(\mathcal{\tilde{T}}_{i}((Z_{n})_{n\geq 1})=k))_{k\geq 1} under HiH_{i} (with i{0,1}i\in\left\{0,1\right\}). Looking at these distributions, for any ϵ>0\epsilon>0 we can define

M0(ϵ)\displaystyle M_{0}(\epsilon) min{m1,(𝒯~0((Zn)n1)m)1ϵ}\displaystyle\equiv\min\left\{m\geq 1,\ \mathbb{P}(\mathcal{\tilde{T}}_{0}((Z_{n})_{n\geq 1})\leq m)\geq 1-\epsilon\right\} (29)
M1(ϵ)\displaystyle M_{1}(\epsilon) min{m1,(𝒯~1((Zn)n1)m)1ϵ}\displaystyle\equiv\min\left\{m\geq 1,\ \mathbb{P}(\mathcal{\tilde{T}}_{1}((Z_{n})_{n\geq 1})\leq m)\geq 1-\epsilon\right\} (30)

M0(ϵ)M_{0}(\epsilon) (and M1(ϵ)M_{1}(\epsilon)) indicates how many observations (sampling complexity) are needed to detect independence (and statistical dependency) with a confidence probability 1ϵ1-\epsilon under 0\mathcal{H}_{0} (and 1\mathcal{H}_{1}). In the context of our solution, for a given TSP scheme (function of the parameters ll, ww, α\alpha and the sequence (an)(a_{n})), we will look at the trade-off expressed in the pair (M0(ϵ),M1(ϵ))(M_{0}(\epsilon),M_{1}(\epsilon)) of induced tests when varying key parameters of our method.

Refer to caption
Figure 3: Illustration of the trade-off between M0(ϵ)M_{0}(\epsilon) and M1σ(ϵ)M^{\sigma}_{1}(\epsilon) obtained for our TSP test, the L1L_{1}-test, the log-likelihood test, and the χ2\chi^{2}-test. The TSP test parameter configuration is l=0.001l=0.001 and w=0.1w=0.1, with α[0.00001,0.0005]\alpha\in[0.00001,0.0005]. The L1L_{1} test parameter configuration is m(n)=n0.2m(n)=n^{0.2}, with C[0.7,1.46]C\in[0.7,1.46]. The log-likelihood test parameter configuration is m(n)=n0.2m(n)=n^{0.2}, with C[0.05,0.3]C\in[0.05,0.3]. The Pearson-χ2\chi^{2} test parameter configuration is m(n)=n0.25m(n)=n^{0.25}, with C[0.005,0.185]C\in[0.005,0.185]. Five scenarios of correlation are presented for 1\mathcal{H}_{1} considering σ=𝔼(XY){0.1,0.3,0.5,0.7,0.9}\sigma=\mathbb{E}(XY)\in\left\{0.1,0.3,0.5,0.7,0.9\right\}.

VIII-A1 Simulation Setting

We consider a joint vector Z=(X,Y)Z=(X,Y) following a zero-mean Gaussian distribution in 2\mathbb{R}^{2} where the correlation coefficient determining I(X,Y)I(X,Y) is parametrized by σ=𝔼(XY)\sigma=\mathbb{E}(XY). Concerning the alternative hypothesis (1\mathcal{H}_{1}), we consider different levels of MI indexed by σ=𝔼(XY){0.1,0.3,0.5,0.7,0.9}\sigma=\mathbb{E}(XY)\in\left\{0.1,0.3,0.5,0.7,0.9\right\}. We use 1,0001,000 iid realizations of (X,Y)(X,Y) under the different scenarios (0\mathcal{H}_{0} and 1σ\mathcal{H}^{\sigma}_{1}) to run our test with these iid samples. By doing so, we obtain 1,0001,000 realizations of the rvs. 𝒯~0((Zn)n1)\mathcal{\tilde{T}}_{0}((Z_{n})_{n\geq 1}) and 𝒯~1((Zn)n1)\mathcal{\tilde{T}}_{1}((Z_{n})_{n\geq 1}) and with those realizations we obtain an empirical estimation of their pmfs. and empirical estimations of M0(ϵ)M_{0}(\epsilon) and M1σ(ϵ)M^{\sigma}_{1}(\epsilon) (indexed by the value σ\sigma), respectively.121212By the law of large numbers, the estimators of M0(ϵ)M_{0}(\epsilon) and M1σ(ϵ)M^{\sigma}_{1}(\epsilon) are strongly consistent. 1,0001,000 samples shown to be sufficient for the purpose of the analysis.

VIII-A2 Parameter Selection for ϕbn,δn,anα()\phi_{b_{n},\delta_{n},a_{n}}^{\alpha}(\cdot)

To evaluate the sensitivity of our TSP scheme, we consider the following fixed sequences (an)=(0.5n1)(a_{n})=(0.5\cdot n^{-1}) and (δn)=(en1/3)(\delta_{n})=(e^{-n^{1/3}}) in the admissible regime established in Theorem 3.131313(an)n1(a_{n})_{n\geq 1} needs to be o(1)o(1) and (δn)(\delta_{n}) needs to be 1()\ell_{1}(\mathbb{N}). Other configurations for (an)(a_{n}) and (δn)(\delta_{n}) can be explored within the admissible range declared in Theorem 3. Considering (bn)=(wnl)(b_{n})=(w\cdot n^{-l}) (parametrized by ww and ll) and α\alpha (used to solve T^bn,δn(α)\hat{T}_{b_{n},\delta_{n}}(\alpha) in Eq.(24)), we consider a preliminary analysis on MI estimation to select a range of reasonable values (in Supplemental Material - Appendix G-A). In particular, we consider l{0.001,0.125,0.167,0.25,0.3}l\in\left\{0.001,0.125,0.167,0.25,0.3\right\} and w{0.005,0.01,0.05,0.1,0.2}w\in\left\{0.005,0.01,0.05,0.1,0.2\right\}. We proceed as follows: given fixed parameters ll and ww, we explore values of α\alpha in [0,4104)[0,4\cdot 10^{-4}) to express the trade-off between M0(ϵ)M_{0}(\epsilon) and M1σ(ϵ)M^{\sigma}_{1}(\epsilon) under different data scenarios: σ{0.3,0.5,0.7}\sigma\in\left\{0.3,0.5,0.7\right\}.

VIII-A3 Results

Figure 1 shows the curves expressing the trade-off between M0(ϵ)M_{0}(\epsilon) and M1σ(ϵ)M^{\sigma}_{1}(\epsilon) for different parameter configurations of our method in the range for ll and ww mentioned above and for ϵ=0.05\epsilon=0.05. In general, we notice that the effect of these two parameters is relevant in the trade-off expressed in the curves. Both ww and ll determine the growing phase (i.e., the creation of the full tree) and also the pruning phase (regularization in Eq.(24)) because rbn,δn(|T|)r_{b_{n},\delta_{n}}(\left|T\right|) is also a function of bnb_{n}. Therefore, the effects of (bn)n1(b_{n})_{n\geq 1} on the results are not easy to express theoretically.

Describing the curves, we could say in general that a bigger value of ww within the explored range reduces the full tree’s size in the growing phase. This effect is expressed in a better trade-off of the pair (M0(ϵ)M_{0}(\epsilon), M1σ(ϵ)M^{\sigma}_{1}(\epsilon)) in the regime where M0(ϵ)102M_{0}(\epsilon)\leq 10^{2}, at the expense of a worse trade-off of (M0(ϵ)M_{0}(\epsilon), M1σ(ϵ)M^{\sigma}_{1}(\epsilon)) when M0(ϵ)103M_{0}(\epsilon)\geq 10^{3}. A similar general effect in the trade-off (M0(ϵ)M_{0}(\epsilon), M1σ(ϵ)M^{\sigma}_{1}(\epsilon)) is produced when decreasing the value of ll within the explored range. Our family of solutions offers a collection of different trade-offs between M0(ϵ)M_{0}(\epsilon) and M1σ(ϵ)M^{\sigma}_{1}(\epsilon) by exploring different values of α[0,4104)\alpha\in[0,4\cdot 10^{-4}) in our solution. Therefore, the selection of the best parameters should be a function of the regime of sample size we want to consider for the detection of 0\mathcal{H}_{0}. For the final comparison with alternative approaches, we decided to consider one of the curves with a less prominent decreasing transition in Figure 1 (obtained for w=0.1w=0.1 and l=0.001l=0.001).

Refer to caption
Figure 4: Number of partitions generated for the full tree of the TSP under l=0.001l=0.001 as a function of the dimensionality dd of the joint space of the random variables for n{103,104,105}n\in\{10^{3},10^{4},10^{5}\} number of samples and two heuristic rules for the parameter ww. Left: w=0.1d/2w=0.1^{d/2}. Right: w=0.225d/2w=0.225^{d/2}. Diagonal line determines the minimum number of partitions that allow each dimension of the joint space to be partitioned at least once.

Figure 2 illustrates the estimated (empirical) pmfs of 𝒯~0((Zn)n1)\mathcal{\tilde{T}}_{0}((Z_{n})_{n\geq 1}) and 𝒯~1((Zn)n1)\mathcal{\tilde{T}}_{1}((Z_{n})_{n\geq 1}) under σ=0\sigma=0 (0\mathcal{H}_{0}) and σ=0.3\sigma=0.3 (1\mathcal{H}_{1}), respectively. These histograms illustrate how the distributions are affected by the regularization parameter α\alpha and, with that, the trade-off in M0(ϵ)M_{0}(\epsilon) and M1σ(ϵ)M^{\sigma}_{1}(\epsilon) observed in Figure 1. Specifically, a small value of α\alpha concentrates the distribution of 𝒯0((Zn)n1)\mathcal{T}_{0}((Z_{n})_{n\geq 1}) in a high number of observations, while 𝒯1((Zn)n1)\mathcal{T}_{1}((Z_{n})_{n\geq 1}) concentrates on a lower number of observations. This leads to a high value of M0(ϵ)M_{0}(\epsilon) and a low value of M1σ(ϵ)M_{1}^{\sigma}(\epsilon) for that specific α\alpha. The opposite holds for higher values of α\alpha.

Finally, we compared our TSP scheme in terms of the trade-off in (M0(ϵ),M1σ(ϵ))(M_{0}(\epsilon),M^{\sigma}_{1}(\epsilon)) with ϵ=0.05\epsilon=0.05 and σ{0.1,0.3,0.5,0.7,0.9}\sigma\in\left\{0.1,0.3,0.5,0.7,0.9\right\} (for 1\mathcal{H}_{1}) exploring three strategies presented in the literature [20]: the L1L_{1} test, the log-likelihood test, and the Pearson-χ2\chi^{2} test. Each trade-off curve is generated by multiplying a well-selected range of values CC\in\mathbb{R} to the corresponding threshold for the independence detection [20]. For the parameters of the alternative tests, we consider regimes where these tests are strongly consistent141414We consider the number of partitions per dimension as m(n)=npm(n)=n^{p} with p(0,0.5)p\in(0,0.5), which satisfies the strong-consistency conditions of the L1L_{1} and the log-likelihood tests established in [20]. Although there is no explicit proof of Pearson-χ2\chi^{2} test strong-consistency, we use a regime of values for p(0,1/3)p\in(0,1/3) according to the range suggested in [20].. Within that, we selected parameters that offer a smooth transition between their sampling complexity pairs’ trade-off as we did to select the parameters of our method.151515More information is presented in the Supplemental Material - Appendix G-B.

Figure 3 presents these curves for our method and the alternative approaches. These curves express the trade-off between the ability to detect independence and non-product probabilistic structure (0\mathcal{H}_{0} and 1\mathcal{H}_{1}) from data. In general, we could say that our method shows better results in almost all explored scenarios (σ{0.1,0.3,0.5,0.7,0.9}\sigma\in\left\{0.1,0.3,0.5,0.7,0.9\right\}) where the TSP trade-off curves dominate the others. There is one exception to this general trend in the regime of M0(ϵ)<102M_{0}(\epsilon)<10^{2} for lower values of correlation for 1\mathcal{H}_{1}. However, in all the other regimes, our method performs better than the alternatives. In particular, it performs better than its closest relative, which is the log-likelihood test that uses non-adaptive partitions [20]. Interestingly, our method’s performance improvements increase with the magnitude of σ\sigma (for 1\mathcal{H}_{1}). This shows that our approach’s advantage is more prominent when the alternative scenario has a higher level of mutual information.

Refer to caption
Figure 5: Illustration of the trade-off between M0(ϵ)M_{0}(\epsilon) and M1σ(ϵ)M^{\sigma}_{1}(\epsilon) obtained for the TSP test using samples from a multidimensional Gaussian distribution with component-wise correlation cov(Xi,Yj)=δijσ(d)cov(X_{i},Y_{j})=\delta_{ij}\sigma(d), where σ(d)\sigma(d) is selected to preserve the mutual information between the variables along the dimensions p=q{2,3,4,5,6}p=q\in\{2,3,4,5,6\} of each random variable under l=0.001l=0.001 and w=0.225d/2w=0.225^{d/2} with d=p+qd=p+q. Three scenarios of mutual information are explored I(X;Y){0.06803,0.20752,0.48572}I(X;Y)\in\{0.06803,0.20752,0.48572\}, which correspond to the theoretical mutual information obtained in the univariate Gaussian case p=q=1p=q=1 with 𝐄(XY)=σ,σ{0.3,0.5,0.7}\mathbf{E}(XY)=\sigma,\sigma\in\{0.3,0.5,0.7\}, respectively.

VIII-B Multidimensional Analysis

We conclude this analysis by evaluating our method in higher dimensions. As in the previous analysis d=2d=2, we need to select the parameters ww and ll of our scheme. In this multi-dimensional context, we select a value of ll that offers a smooth trade-off curve according to the experimental results in Figure 1. However, as ww has a stronger impact on the resulting number of partitions, rather than choosing it individually for every dimension dd, we propose w=Cpdw=C^{p\cdot d} as a heuristic rule for selecting ww according to dd, where C>0C>0 and p>0p>0.

This rule and its parameters were designed on the principle that every dimension of the joint space d\mathbb{R}^{d} should be explored at least once in the growing phase. This criterion ensures that our data-driven scheme explores every coordinate of d\mathbb{R}^{d} in the pursuit of detecting relevant statistical dependencies under 1\mathcal{H}_{1}. For a dimension d>0d>0, the full tree needs to have at least 2d2^{d} leaves (cells). For this, a basic condition is that bnb_{n} is lower bounded by 1/n1/n (our stopping criterion requires at least one sample per cell), which implies that to explore all the dd dimensions we need at least 2d2^{d} i.i.d. samples to meet this requirement. Then assuming the non-trivial regime n2dn\geq 2^{d}, we selected CC and pp in our rule with the objective that |Tbnfull|2d\left|T^{full}_{b_{n}}\right|\geq 2^{d} is met independent of the value of dd (as long as n2dn\geq 2^{d}). This happens if ww becomes smaller when dd increases, which happens in our rule for C<1C<1 and p>0p>0.

Figure 4 shows the number of cells of the full-tree, namely |Tbnfull|\left|T^{full}_{b_{n}}\right|, for two settings: w=0.1d/2w=0.1^{d/2} (C=0.1C=0.1 and p=1/2p=1/2) in the left panel, and w=0.225d/2w=0.225^{d/2} (C=0.225C=0.225 and p=1/2p=1/2) in the right panel. The curves are shown across dd for three scenarios of sample-size n=103n=10^{3}, n=104n=10^{4} and n=105n=10^{5}. We also include the curve 2d2^{d} (dashed line) to indicate the critical lower bound. Both settings satisfy the proposed criterion of partitioning each dimension at least once (points above the diagonal) in the whole regime when n2dn\geq 2^{d}. The horizontal line shows the points where the number of partitions generated by our method is equal to the number of samples (which happens when bn=1/nb_{n}=1/n). Interestingly, when dd approaches the critical condition 2d=n2^{d}=n, our full tree meets a scenario where every cell has one sample. Of the two settings, the selection on the right panel is more conservative, and it is the one we will use for the rest of this analysis. This selection partitions each dimension just above the minimum requirement (for the values of dd where n2dn\geq 2^{d}) and achieves the critical condition n=2dn=2^{d} as dd increases.

Using this rule w=0.225d/2w=0.225^{d/2} and l=0.001l=0.001, we consider a zero-mean joint Gaussian vector (X,Y)({X},{Y}) in d\mathbb{R}^{d}, where X=(X1,..,Xp)X=(X_{1},..,X_{p}) and Y=(Y1,..,Yq){Y}=(Y_{1},..,Y_{q}) and, under 1\mathcal{H}_{1}, a diagonal co-variance of the form cov(Xi,Yj)=δijσ(d)2cov(X_{i},Y_{j})=\delta_{ij}\sigma(d)^{2} (with δij\delta_{ij} the Kronecker’s delta). For these results, we consider n=104n=10^{4} i.i.d samples meeting the condition that n2dn\geq 2^{d} for d{4,6,8,10,12}d\in\left\{4,6,8,10,12\right\}. For the alternative, we consider σ{0.3,0.5,0.7}\sigma\in\{0.3,0.5,0.7\} for d=2d=2 (p=q=1p=q=1), and we choose the values of σ(d)\sigma(d) for d4d\geq 4 such that the mutual information between XX and YY is the same as that obtained for d=2d=2 with σ{0.3,0.5,0.7}\sigma\in\{0.3,0.5,0.7\}. Therefore, for the alternative, we create three scenarios of constant MI I(X;Y){0.06803,0.20752,0.48572}I(X;Y)\in\{0.06803,0.20752,0.48572\} across dimensions. This experimental design allows us to isolate the effect of dimensionality from the discrimination of the underlying task, which is known to be proportional to the value of MI under 1\mathcal{H}_{1} [32, 35, 36].

Figure 5 presents the trade-off between (M0(ϵ),M1σ(ϵ))(M_{0}(\epsilon),M^{\sigma}_{1}(\epsilon)) for the following values of σ(d)\sigma(d) (under 1\mathcal{H}_{1}) and for different dimensions d{4,6,8,10,12}d\in\left\{4,6,8,10,12\right\}. First, we confirm that our scheme detects both hypotheses with a finite sample size as our theoretical results predict. Here, we observe that by increasing the dimensions of the problem the curves (M0(ϵ),M1σ(ϵ))(M_{0}(\epsilon),M^{\sigma}_{1}(\epsilon)) show sharper transitions (beyond a critical value for M0(ϵ)M_{0}(\epsilon)) and overall the problems become (in general) more difficult: for a given value of M0(ϵ)M_{0}(\epsilon), its respective M1σ(ϵ)M^{\sigma}_{1}(\epsilon) increases with the dimension d=p+qd=p+q. In other words, for a relatively good capacity to detect independence (when M0(ϵ)103M_{0}(\epsilon)\leq 10^{3}), it is more challenging for our test to detect the alternative as dd is higher. This performance trend could be attributed to the observation that in higher dimensions it is more challenging for a non-parametric framework to detect salient features under the alternative 1\mathcal{H}_{1} and, consequently, more observations (evidence) are required to correctly detect the alternative with probability 1ϵ1-\epsilon. On the other hand, if we fix a dimension, let us say d=p+q=8d=p+q=8, the higher the MI the better the trade-off between ((M0(ϵ),M1σ(ϵ))(M_{0}(\epsilon),M^{\sigma}_{1}(\epsilon))), which is a trend consistent with our observation that MI is an indicator of the task discrimination, and consequently, the performance of our scheme is sensitive to the level of MI under the alternative.

IX Summary and Final Discussion

We present a novel framework to address the problem of universal testing of independence using data-driven representations. Consistency and finite length (non-asymptotic) results express our solution’s capacity to adapt its representations to the problem’s sufficient statistics. This capacity is particularly evident under the hypothesis of independence. Precise results show that our scheme detects this scenario – collapsing to the trivial partition with one cell for representing the joint space – with a finite sample size. On the experimental side, our solution offers a computationally efficient implementation.

In a controlled experimental setting, we show that our scheme offers a clear advantage compared with alternative non-parametric solutions that do not adapt their representations to the data. These results confirm the critical role that data-driven partitions could play in implementing a universal test of independence. Further analysis is needed to fully uncover this method’s potential for machine learning (ML) problems and other applications. We anticipate that ML algorithms could benefit from both the representations obtained from our solutions (as an approximator of sufficient statistics) and our solution’s capacity to detect independence with a finite number of samples.

IX-A Limitations and Future Work

Our distribution-free results in Section VI provide a range of admissible parameters for our test; however, they do not provide a criterion for selecting them in a specific problem setting. To address this last practical aspect, we select a set of parameters from empirical observations and some basic heuristic rules. Then, it is an open research avenue to fully study conditions that would make a selection of optimal parameters (or nearly optimal) for our test under some specific conditions. Along these lines, it is relevant to further investigate specific classes of models, and based on this model assumption find a more constrained range of good parameters with improved finite-length performance results. In this vein, there are interesting ideas and results worth exploring about adaptivity that have been explored in statistical learning [50] and universal source coding [51, 52].

IX-B A Recent Related Work

The authors in [34] have recently introduced a non-parametric test with a similar oracle-base design principle. They proposed a data-adaptive weight strategy to approximate the ideal weights of a NP statistic. That strategy has an interesting connection with our design approach in Section IV because they adapt key parameters of their method to approximate a best (oracle) statistic. On the differences, the authors in [34] addressed the independence test by performing a binary expansion filtration of the observations and adapting the weights matrix of a quadratic form. Instead, our work addresses the independence test by estimating the mutual information (our statistics) through an adaptive tree-structured partition of the observation space. Therefore, although both approaches share a similar learning principle, the methods and components used to adapt them to the sample are entirely different.

IX-C Statistical Significance Analysis

We want to conclude with a discussion about the statistical significance of our test. Given our test ϕbn,δn,an()\phi_{b_{n},\delta_{n},a_{n}}(\cdot), and assuming 0\mathcal{H}_{0}, the statistical significance is expressed by α(bn,δn,an)({ϕbn,δn,an(Z1,..,Zn)=1})\alpha(b_{n},\delta_{n},a_{n})\equiv\mathbb{P}(\left\{\phi_{b_{n},\delta_{n},a_{n}}(Z_{1},..,Z_{n})=1\right\}). Importantly, from the proofs of Theorem 2 and Theorem 3, we have the following:

PROPOSITION 2

Under the assumptions of Theorem 2 on (bn)(b_{n}) and (δn)(\delta_{n}), for any P𝒫0P\in\mathcal{P}_{0} (0\mathcal{H}_{0}) and any an>0a_{n}>0, it follows that α(bn,δn,an)δn\alpha(b_{n},\delta_{n},a_{n})\leq\delta_{n}, n1\forall n\geq 1.161616The proof is presented in the Supplemental Material - Appendix G-D.1).

This result implies that the statistical significance of our test is fully controlled by the sequence (δn)(\delta_{n}), which is one of our design parameters. Furthermore, the proof of this result shows that this bound is uniform over the class of models in 𝒫0\mathcal{P}_{0}. Direct corollaries of Proposition 2 are the following:

  • adding the assumptions of Theorem 3 and under 0\mathcal{H}_{0}, it follows that limnα(bn,δn,an)=0\lim_{n\rightarrow\infty}\alpha(b_{n},\delta_{n},a_{n})=0. Importantly, this convergence to zero is uniform for every model in 𝒫0\mathcal{P}_{0}. [Uniform vanishing condition on (α(bn,δn,an))n1(\alpha(b_{n},\delta_{n},a_{n}))_{n\geq 1}]

  • adding the assumptions of Theorem 4 and under 0\mathcal{H}_{0}, it follows that for any n1n\geq 1, α(bn,δn,an)δnen1/3\alpha(b_{n},\delta_{n},a_{n})\leq\delta_{n}\approx e^{-n^{1/3}}. Therefore, we achieve an exponentially fast vanishing error under 0\mathcal{H}_{0}. Importantly, this velocity is obtained uniformly (distribution-free) over P𝒫0P\in\mathcal{P}_{0}.

These uniform bounds on the statistical significance are obtained under 0\mathcal{H}_{0}. Under 1\mathcal{H}_{1}, we have from strong consistency (Definition 1) that ({ϕbn,δn,an(Z1,..,Zn)=1})\mathbb{P}(\left\{\phi_{b_{n},\delta_{n},a_{n}}(Z_{1},..,Z_{n})=1\right\}) (the power of the test) tends to 11 as nn tends to infinity.171717This argument is presented in the Supplemental Material - Appendix G-D.2).

X Acknowledgment

This manuscript is based on work supported by grants of CONICYT-Chile, Fondecyt 1210315, and the Advanced Center for Electrical and Electronic Engineering, Basal Project FB0008. We thank the anonymous reviewers for providing valuable comments and suggestions that helped to improve the technical quality and organization of this work. Finally, we thank Diane Greenstein for editing and proofreading all this material.

Appendix A Proof of Theorem 1

Proof:

In general, first we know that if (ϕn(Z1n))n1(\phi_{n}(Z^{n}_{1}))_{n\geq 1} reaches 0 eventually with probability one, it is equivalent to saying that the process (ϕn(Z1n))n1(\phi_{n}(Z^{n}_{1}))_{n\geq 1} does not visit the event {1}\left\{1\right\} infinitely often (i.o.). This observation reduces to verify the following

(limsupn{z1ndn:ϕn(z1n)=1})\displaystyle\mathbb{P}\left(\lim\sup_{n}\left\{z^{n}_{1}\in\mathbb{R}^{dn}:\phi_{n}(z^{n}_{1})=1\right\}\right)\equiv
(m1nm{z1ndn:ϕn(z1n)=1})=0.\displaystyle\mathbb{P}\left(\bigcap_{m\geq 1}\bigcup_{n\geq m}\left\{z^{n}_{1}\in\mathbb{R}^{dn}:\phi_{n}(z^{n}_{1})=1\right\}\right)=0. (31)

Equivalently, saying that (ϕn(Z1n))n1(\phi_{n}(Z^{n}_{1}))_{n\geq 1} reaches 11 eventually with probability one reduces to

(limsupn{z1ndn:ϕn(z1n)=0})\displaystyle\mathbb{P}\left(\lim\sup_{n}\left\{z^{n}_{1}\in\mathbb{R}^{dn}:\phi_{n}(z^{n}_{1})=0\right\}\right)\equiv
(m1nm{z1ndn:ϕn(z1n)=0})=0.\displaystyle\mathbb{P}\left(\bigcap_{m\geq 1}\bigcup_{n\geq m}\left\{z^{n}_{1}\in\mathbb{R}^{dn}:\phi_{n}(z^{n}_{1})=0\right\}\right)=0. (32)

For the rest of the proof, we use O1n,..,OnnO^{n}_{1},..,O^{n}_{n} to denote Oπn(Z1),.,Oπn(Zn)O_{\pi_{n}}(Z_{1}),....,O_{\pi_{n}}(Z_{n}) and o1n,..,onno^{n}_{1},..,o^{n}_{n} to denote Oπn(z1),.,Oπn(zn)O_{\pi_{n}}(z_{1}),....,O_{\pi_{n}}(z_{n}).

Under 0\mathcal{H}_{0}, we have from the third hypothesis that i^πn(O1n,.,Onn)/an{\hat{i}_{\pi_{n}}(O^{n}_{1},....,O^{n}_{n})}/{a_{n}} tends to 0 with probability one, which means that for any ϵ>0\epsilon>0

(m1nmBϵ,nπn,an)=0,\mathbb{P}\left(\bigcap_{m\geq 1}\bigcup_{n\geq m}B^{\pi_{n},a_{n}}_{\epsilon,n}\right)=0, (33)

where Bϵ,nπn,an{z1nnd:i^πn(o1n,.,onn)>ϵan}B_{\epsilon,n}^{\pi_{n},a_{n}}\equiv\left\{z^{n}_{1}\in\mathbb{R}^{nd}:\hat{i}_{\pi_{n}}(o^{n}_{1},....,o^{n}_{n})>\epsilon\cdot a_{n}\right\}. Under 0\mathcal{H}_{0}, let us consider the error event of ϕπn,an()\phi_{\pi_{n},a_{n}}() by

Enπn,an{z1nnd:ϕπn,an(o1n,..,onn)=1}.E_{n}^{\pi_{n},a_{n}}\equiv\left\{z^{n}_{1}\in\mathbb{R}^{nd}:\phi_{\pi_{n},a_{n}}(o^{n}_{1},..,o^{n}_{n})=1\right\}.

If znEnπn,anz^{n}\in E_{n}^{\pi_{n},a_{n}} (by definition of the rule ϕπn,an()\phi_{\pi_{n},a_{n}}(\cdot)), it follows that i^πn(o1n,.,onn)an\hat{i}_{{\pi_{n}}}(o^{n}_{1},....,o^{n}_{n})\geq a_{n} (see (7)). Then from definition of Bϵ,nπn,anB_{\epsilon,n}^{\pi_{n},a_{n}}, for any ϵ<1\epsilon<1 it follows that Enπn,anBϵ,nπn,anE_{n}^{\pi_{n},a_{n}}\subset B_{\epsilon,n}^{\pi_{n},a_{n}}. This implies that m1nmEnπn,anm1nmBϵ,nπn,an\bigcap_{m\geq 1}\bigcup_{n\geq m}E_{n}^{\pi_{n},a_{n}}\subset\bigcap_{m\geq 1}\bigcup_{n\geq m}B_{\epsilon,n}^{\pi_{n},a_{n}} which from (33) implies that

(m1nmEnπn,an)=0.\mathbb{P}\left(\bigcap_{m\geq 1}\bigcup_{n\geq m}E_{n}^{\pi_{n},a_{n}}\right)=0. (34)

On the other hand, under the assumption that I(X,Y)>0I(X,Y)>0 (1\mathcal{H}_{1}), let us choose ϵ0(0,I(X,Y))\epsilon_{0}\in(0,I(X,Y)). Then, from the second assumption (see Definition 2), it follows that i^πn(O1n,.,Onn)\hat{i}_{\pi_{n}}(O^{n}_{1},....,O^{n}_{n}) convergences to something strictly greater than ϵ0\epsilon_{0} with probability one. This formally means that

(m1nm{z1nnd:i^πn(o1n,.,onn)(0,ϵ0)})=0,\mathbb{P}\left(\bigcap_{m\geq 1}\bigcup_{n\geq m}\left\{z^{n}_{1}\in\mathbb{R}^{nd}:\hat{i}_{\pi_{n}}(o^{n}_{1},....,o^{n}_{n})\in(0,\epsilon_{0})\right\}\right)=0, (35)

The error event of ϕπn,an()\phi_{\pi_{n},a_{n}}() under 1\mathcal{H}_{1} in this case is

E~nπn,an{z1nnd:i^πn(o1n,.,onn)<an}.\tilde{E}_{n}^{\pi_{n},a_{n}}\equiv\left\{z^{n}_{1}\in\mathbb{R}^{nd}:\hat{i}_{\pi_{n}}(o^{n}_{1},....,o^{n}_{n})<a_{n}\right\}. (36)

Finally using the condition that (an)n1(a_{n})_{n\geq 1} tends to zero with nn ((an)n(a_{n})_{n} is o(1)o(1)), we have that

E~nπn,an{z1nnd:i^πn(o1n,.,onn)(0,ϵ0)}\tilde{E}_{n}^{\pi_{n},a_{n}}\subset\left\{z^{n}_{1}\in\mathbb{R}^{nd}:\hat{i}_{\pi_{n}}(o^{n}_{1},....,o^{n}_{n})\in(0,\epsilon_{0})\right\} (37)

eventually in nn, which implies from (35) that

(m1nmE~nπn,an)=0.\mathbb{P}\left(\bigcap_{m\geq 1}\bigcup_{n\geq m}\tilde{E}_{n}^{\pi_{n},a_{n}}\right)=0. (38)

Appendix B Proof of Lemma 2

We use the following results from [39].

LEMMA 4

[39, Th.3] Under the assumptions of Lemma 2, limni^πbn,δn(O1,.,On)=I(X,Y)\lim_{n\rightarrow\infty}\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{1},....,O_{n})=I(X,Y) with probability one.

LEMMA 5

[39, Th.4] Under the assumptions of Lemma 2 part ii), for any p>0p>0, i^πbn,δn(O1,.,On)\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{1},....,O_{n}) is o(np)o(n^{-p}). 181818 This means formally that limni^πbn,δn(O1,.,On)np=0\lim_{n\rightarrow\infty}\frac{\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{1},....,O_{n})}{n^{-p}}=0 with probability one.

Proof:

Eq.(2) follows from Lemma 4, the fact that by the strong law of large number limnin(Z1,,Zn)=I(X,Y)\lim_{n\rightarrow\infty}i_{n}(Z_{1},...,Z_{n})=I(X,Y) with probability one and the union bound.

Under the assumption that P𝒫0P\in\mathcal{P}_{0} (i.e. I(X,Y)=0I(X,Y)=0), we have that P=Q(P)P=Q^{*}(P). This implies by definition that in(Z1,,Zn)=0i_{n}(Z_{1},...,Z_{n})=0 with probability one. Then the regret is basically our empirical information term i^πbn,δn(O1,.,On)\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{1},....,O_{n}). Then, Lemma 5 implies the result in (19). ∎

Appendix C Proof of Theorem 2

Proof:

If we define the event δn,bnn,k\mathcal{E}^{n,k}_{\delta_{n},b_{n}}\equiv

{z1n:supT𝒢bnk|Dσ(πT)(P||Q)Dσ(πT)(P^n||Q^n)|rbn,δn(k)},\left\{z^{n}_{1}:\sup_{T\in\mathcal{G}^{k}_{b_{n}}}\left|{D_{\sigma({\pi_{T}})}({P}||{Q}^{*})-D_{\sigma({\pi_{T}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})}\right|\leq r_{b_{n},\delta_{n}}(k)\right\}, (39)

for all k{1,..,,|Tbnfull|}k\in\left\{1,..,,\left|T^{full}_{b_{n}}\right|\right\}, we have that the conditions stated on (bn)(b_{n}) and (δn)(\delta_{n}) are the weakest to obtain from (15) that limnsupk{1,..,,|Tbnfull|}rbn,δn(k)=0\lim_{n\rightarrow\infty}\sup_{k\in\left\{1,..,,\left|T^{full}_{b_{n}}\right|\right\}}r_{b_{n},\delta_{n}}(k)=0. This last condition is crucial to being able to apply Lemma 1 in {δn,bnn,k,k}\left\{\mathcal{E}^{n,k}_{\delta_{n},b_{n}},k\right\}. In fact, (δn,bnnk=1|Tbnfull|δn,bnn,k)1δn\mathbb{P}\left(\mathcal{E}^{n}_{\delta_{n},b_{n}}\equiv\bigcap_{k=1}^{\left|T^{full}_{b_{n}}\right|}\mathcal{E}^{n,k}_{\delta_{n},b_{n}}\right)\geq 1-\delta_{n} by definition of rbn,δn(k)r_{b_{n},\delta_{n}}(k) in (16) and the condition expressed in (15).

Importantly under 0\mathcal{H}_{0} (i.e., I(X;Y)=0I(X;Y)=0), for k=1k=1, we can consider that rbn,δn(k)=0r_{b_{n},\delta_{n}}(k)=0, because Dσ({d})(P||Q)=Dσ({d})(P^n||Q^n)=0D_{\sigma({\left\{\mathbb{R}^{d}\right\}})}({P}||{Q}^{*})=D_{\sigma({\left\{\mathbb{R}^{d}\right\}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})=0 for any z1ndnz^{n}_{1}\in\mathbb{R}^{dn}. Consequently, for any z1nδn,bnn=k=1|Tbnfull|δn,bnn,kz^{n}_{1}\in\mathcal{E}^{n}_{\delta_{n},b_{n}}=\bigcap_{k=1}^{\left|T^{full}_{b_{n}}\right|}\mathcal{E}^{n,k}_{\delta_{n},b_{n}}, it follows that

Dσ(πT^bn,δn)(P^n||Q^n)rbn,δn(|T^bn,δn|)\displaystyle D_{\sigma({\pi_{\hat{T}_{b_{n},\delta_{n}}}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})-r_{b_{n},\delta_{n}}\left(\left|\hat{T}_{b_{n},\delta_{n}}\right|\right)\geq
supT𝒢bnkDσ(πT)(P^n||Q^n)rbn,δn(k)\displaystyle\sup_{T\in\mathcal{G}^{k}_{b_{n}}}D_{\sigma({\pi_{{T}}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})-r_{b_{n},\delta_{n}}(k) (40)
supT𝒢bnkDσ(πT)(P||Q)2rbn,δn(k),\displaystyle\geq\sup_{T\in\mathcal{G}^{k}_{b_{n}}}D_{\sigma({\pi_{{T}}})}({P}||{Q}^{*})-2r_{b_{n},\delta_{n}}(k), (41)

for any k{1,..,|Tbnfull|}k\in\left\{1,..,\left|T^{full}_{b_{n}}\right|\right\}. The first inequality in (C) is from the definition of T^bn,δn\hat{T}_{b_{n},\delta_{n}} in (17) and the second inequality in (41) is from the fact that if z1nδn,bnn,kz^{n}_{1}\in\mathcal{E}^{n,k}_{\delta_{n},b_{n}} (see (39)), then

|supT𝒢bnkDσ(πT)(P||Q)supT𝒢bnkDσ(πT)(P^n||Q^n)|rbn,δn(k).\displaystyle\left|\sup_{T\in\mathcal{G}^{k}_{b_{n}}}D_{\sigma({\pi_{T}})}({P}||{Q}^{*})-\sup_{T\in\mathcal{G}^{k}_{b_{n}}}D_{\sigma({\pi_{T}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})\right|\leq r_{b_{n},\delta_{n}}(k). (42)

On the other hand, if z1nδn,bnnz^{n}_{1}\in\mathcal{E}^{n}_{\delta_{n},b_{n}}, then

Dσ(πT^bn,δn)(P^n||Q^n)rbn,δn(|T^bn,δn|)\displaystyle D_{\sigma({\pi_{\hat{T}_{b_{n},\delta_{n}}}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})-r_{b_{n},\delta_{n}}(\left|\hat{T}_{b_{n},\delta_{n}}\right|)
Dσ(πT^bn,δn)(P||Q)D(P||Q).\displaystyle\leq D_{\sigma({\pi_{\hat{T}_{b_{n},\delta_{n}}}})}({P}||{Q}^{*})\leq D({P}||{Q}^{*}). (43)

At this point, we use the independence assumption191919Under 0\mathcal{H}_{0}, we have that Dσ(πT)(P||Q)=0D_{\sigma({\pi_{T}})}({P}||{Q}^{*})=0 for any TTbnfullT\ll T^{full}_{b_{n}}. and the two inequalities (41) and (C) to obtain that for any z1nδn,bnnz^{n}_{1}\in\mathcal{E}^{n}_{\delta_{n},b_{n}}:

Dσ(πT^bn,δn)(P^n||Q^n)rbn,δn(|T^bn,δn|)=0.\displaystyle D_{\sigma({\pi_{\hat{T}_{b_{n},\delta_{n}}}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})-r_{b_{n},\delta_{n}}(\left|\hat{T}_{b_{n},\delta_{n}}\right|)=0. (44)

Finally using that rbn,δn(1)=0r_{b_{n},\delta_{n}}(1)=0, it is simple to verify that for any z1nδn,bnnz^{n}_{1}\in\mathcal{E}^{n}_{\delta_{n},b_{n}} the trivial tree (with one cell) is a solution for (17), i.e., |T^bn,δn|=1\left|\hat{T}_{b_{n},\delta_{n}}\right|=1. Then from (44), Dσ(πT^bn,δn)(P^n||Q^n)=i^πbn,δn(Oπbn,δn(Z1),.,Oπbn,δn(Zn))=0D_{\sigma({\pi_{\hat{T}_{b_{n},\delta_{n}}}})}(\hat{P}_{n}||\hat{Q}_{n}^{*})=\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n}))=0, which concludes the argument using that (δn,bnn)1δn\mathbb{P}\left(\mathcal{E}^{n}_{\delta_{n},b_{n}}\right)\geq 1-\delta_{n} and the assumption that (δn)(\delta_{n}) is o(1)o(1).

Under 1\mathcal{H}_{1} (i.e., I(X;Y)>0I(X;Y)>0), Silva et al. [39, Th. 2, Eq.(33)] showed that for any z1nδn,bnnz^{n}_{1}\in\mathcal{E}^{n}_{\delta_{n},b_{n}}

I(X,Y)i^πbn,δn(Oπbn,δn(Z1),.,Oπbn,δn(Zn))\displaystyle I(X,Y)-\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n}))
minTTbnfull[I(X;Y)Dσ(πT)(P||Q(P))]+2rbn,δn(T).\displaystyle\leq\min_{T\ll T^{full}_{b_{n}}}\left[I(X;Y)-D_{\sigma({\pi_{T}})}({P}||{Q}^{*}(P))\right]+2r_{b_{n},\delta_{n}}(T).

Again using that (δn,bnn)1δn\mathbb{P}\left(\mathcal{E}^{n}_{\delta_{n},b_{n}}\right)\geq 1-\delta_{n}, we obtain the bound in (2). ∎

Appendix D Proof of Lemma 3

Proof:

Let us use the definition of the typical set introduced in (39) and δn,bnnk=1|Tbnfull|δn,bnn,k\mathcal{E}^{n}_{\delta_{n},b_{n}}\equiv\bigcap_{k=1}^{\left|T^{full}_{b_{n}}\right|}\mathcal{E}^{n,k}_{\delta_{n},b_{n}}. Under the assumption of this result, we know that (δn,bnn)1δn\mathbb{P}\left(\mathcal{E}^{n}_{\delta_{n},b_{n}}\right)\geq 1-\delta_{n}. In addition, in the proof of Theorem 2, it is shown that if znδn,bnnz^{n}\in\mathcal{E}^{n}_{\delta_{n},b_{n}}, then |T^bn,δn|=1\left|\hat{T}_{b_{n},\delta_{n}}\right|=1, which means that πbnδn={d}\pi_{b_{n}\delta_{n}}=\left\{\mathbb{R}^{d}\right\}. Therefore, we have that202020Here we use the notation πbn,δn(zn)\pi_{b_{n},\delta_{n}}(z^{n}) to make explicit that the partition is data-driven.

δn,bnn{zndn:πbn,δn(zn)={d}}.\displaystyle\mathcal{E}^{n}_{\delta_{n},b_{n}}\subset\left\{z^{n}\in\mathbb{R}^{dn}:\pi_{b_{n},\delta_{n}}(z^{n})=\left\{\mathbb{R}^{d}\right\}\right\}. (45)

Using the definition of 𝒯0((zn)n1)\mathcal{T}_{0}((z_{n})_{n\geq 1}), we have that

(𝒯0((Zn)n1)M)=𝐙(n>M{zn:πbn,δn(zn)={d}})\displaystyle\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})\leq M)=\mathbb{P}_{{\bf Z}}\left(\bigcap_{n>M}\left\{z^{n}:\pi_{b_{n},\delta_{n}}(z^{n})=\left\{\mathbb{R}^{d}\right\}\right\}\right) (46)
=1𝐙(n>M{zn:πbn,δn(zn){d}})\displaystyle=1-\mathbb{P}_{{\bf Z}}\left(\bigcup_{n>M}\left\{z^{n}:\pi_{b_{n},\delta_{n}}(z^{n})\neq\left\{\mathbb{R}^{d}\right\}\right\}\right) (47)
1n>M𝐙({zn:πbn,δn(zn){d}})\displaystyle\geq 1-\sum_{n>M}\mathbb{P}_{{\bf Z}}\left(\left\{z^{n}:\pi_{b_{n},\delta_{n}}(z^{n})\neq\left\{\mathbb{R}^{d}\right\}\right\}\right) (48)
1n>M𝐙((δn,bnn)c)1n>Mδn.\displaystyle\geq 1-\sum_{n>M}\mathbb{P}_{{\bf Z}}\left((\mathcal{E}^{n}_{\delta_{n},b_{n}})^{c}\right)\geq 1-\sum_{n>M}\delta_{n}. (49)

The equality in (46) comes from definition in (22). The inequality in (48) comes from the sub-additivity of 𝐙\mathbb{P}_{{\bf Z}} and the bounds in (49) follow from (45) and the construction of δn,bnn\mathcal{E}^{n}_{\delta_{n},b_{n}}. Finally, using the assumption that (δn)1()(\delta_{n})\in\ell_{1}(\mathbb{N}) in (49), we have that

(𝒯0((Zn)n1)<)=limM(𝒯0((Zn)n1)M)=1.\displaystyle\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})<\infty)=\lim_{M\rightarrow\infty}\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})\leq M)=1. (50)

Appendix E Proof of Theorem 3

Proof:

Under 1\mathcal{H}_{1}, the assumptions on (bn)(b_{n}) and (δn)(\delta_{n}) are within the admissible range stated in Theorem 2 part i). Consequently, we have consistency on the regret, i.e., limn|in(Z1,,Zn)i^πbn,δn(Oπbn,δn(Z1),.,Oπbn,δn(Zn)|=0\lim_{n\rightarrow\infty}\left|i_{n}(Z_{1},...,Z_{n})-\hat{i}_{\pi_{b_{n},\delta_{n}}}(O_{\pi_{b_{n},\delta_{n}}}(Z_{1}),....,O_{\pi_{b_{n},\delta_{n}}}(Z_{n})\right|=0, \mathbb{P}-almost surely. This result is sufficient to obtain that (ϕbn,δn,an(Z1,.,Zn))n1(\phi_{{b_{n},\delta_{n}},a_{n}}(Z_{1},....,Z_{n}))_{n\geq 1} reaches 11 eventually with probability one, from the arguments presented in the proof of Theorem 1.

Under 0\mathcal{H}_{0}, it is useful to define the following last exit time associated with the detection of independence:

𝒯((zn)n1,(an)n1)\displaystyle\mathcal{T}((z_{n})_{n\geq 1},(a_{n})_{n\geq 1})\equiv
sup{m1:ϕbm,δm,am(z1,..,zm)=1}{}.\displaystyle\sup{\left\{m\geq 1:\phi_{b_{m},\delta_{m},a_{m}}(z_{1},..,z_{m})=1\right\}}\in\mathbb{N}\cup\left\{\infty\right\}. (51)

Saying that (ϕbn,δn,an(Z1,.,Zn))n1(\phi_{{b_{n},\delta_{n}},a_{n}}(Z_{1},....,Z_{n}))_{n\geq 1} reaches 0 eventually with probability one is equivalent to the condition that

(𝒯((Zn)n1,(an)n1)<)=1,\mathbb{P}(\mathcal{T}((Z_{n})_{n\geq 1},(a_{n})_{n\geq 1})<\infty)=1, (52)

from the arguments presented in the proof of Theorem 1. At this point, it is important to revisit the definition of 𝒯0((zn)n1)\mathcal{T}_{0}((z_{n})_{n\geq 1}) in (22), where if for some zmdmz^{m}\in\mathbb{R}^{dm} we have that |T^bm,δm|=1\left|\hat{T}_{b_{m},\delta_{m}}\right|=1; this implies that i^πbm,δm(Oπbm,δm(z1),.,Oπbm,δm(zm)=im(z1,,zm)=0\hat{i}_{\pi_{b_{m},\delta_{m}}}(O_{\pi_{b_{m},\delta_{m}}}(z_{1}),....,O_{\pi_{b_{m},\delta_{m}}}(z_{m})=i_{m}(z_{1},...,z_{m})=0 (details on Appendix C), and therefore, ϕbm,δm,am(z1,..,zm)=0\phi_{b_{m},\delta_{m},a_{m}}(z_{1},..,z_{m})=0 independent of ama_{m} (see Eq.(7)). Consequently, we have that

{zmdm:πbm,δm(zm)={d}}={zm:|T^bm,δm|=1}\displaystyle\left\{z^{m}\in\mathbb{R}^{dm}:\pi_{b_{m},\delta_{m}}(z^{m})=\left\{\mathbb{R}^{d}\right\}\right\}=\left\{z^{m}:\left|\hat{T}_{b_{m},\delta_{m}}\right|=1\right\}
{zmdm:ϕbm,δm,am(z1,..,zm))=0}\displaystyle\subset\left\{z^{m}\in\mathbb{R}^{dm}:\phi_{b_{m},\delta_{m},a_{m}}(z_{1},..,z_{m}))=0\right\} (53)

for any ana_{n}, and it follows that

(𝒯((Zn)n1,(an)n1)M)=\displaystyle\mathbb{P}(\mathcal{T}((Z_{n})_{n\geq 1},(a_{n})_{n\geq 1})\leq M)=
𝐙(n>M{zn:ϕbn,δn,an(z1,..,zn))=0})\displaystyle\mathbb{P}_{{\bf Z}}\left(\bigcap_{n>M}\left\{z^{n}:\phi_{b_{n},\delta_{n},a_{n}}(z_{1},..,z_{n}))=0\right\}\right) (54)
𝐙(n>M{zn:πbn,δn(zn)={d}})\displaystyle\geq\mathbb{P}_{{\bf Z}}\left(\bigcap_{n>M}\left\{z^{n}:\pi_{b_{n},\delta_{n}}(z^{n})=\left\{\mathbb{R}^{d}\right\}\right\}\right) (55)
=(𝒯0((Zn)n1)M).\displaystyle=\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})\leq M). (56)

The identity in (E) follows from the definition in (E), and equations (55) and (56) follow from (E) and (46), respectively. The argument concludes from Lemma 3, as we know that under the assumptions stated on (bn)(b_{n}) and (δn)(\delta_{n}), limM(𝒯0((Zn)n1)M)=1\lim_{M\rightarrow\infty}\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})\leq M)=1, which implies the result in (52) from (56). ∎

Appendix F Proof of Theorem 4

Proof:

Under 0\mathcal{H}_{0}, we can adopt directly the bound presented in (49):

(𝒯0((Zn)n1)<m)1nmδn=1nmCen1/3,\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})<m)\geq 1-\sum_{n\geq m}\delta_{n}=1-\sum_{n\geq m}Ce^{-n^{1/3}}, (57)

where we include the assumption that (1/δn)n(en1/3)(1/\delta_{n})_{n}\approx(e^{n^{1/3}}) (i.e., CC is positive constant). For the series in (57) we have

nmen1/3\displaystyle\sum_{n\geq m}e^{-n^{1/3}} =em1/3[1+e(m+1)1/3em1/3++e(m+j)1/3em1/3+..]\displaystyle=e^{-m^{1/3}}\left[1+\frac{e^{-(m+1)^{1/3}}}{e^{-m^{1/3}}}+...+\frac{e^{-(m+j)^{1/3}}}{e^{-m^{1/3}}}+..\right] (58)
=em1/3j0ej1/3,\displaystyle=e^{-m^{1/3}}\cdot\underbrace{\sum_{j\geq 0}e^{-j^{1/3}}}_{\equiv\mathcal{I}}, (59)

where <\mathcal{I}<\infty. Therefore, we have that (𝒯0((Zn)n1)m)Cem1/3\mathbb{P}(\mathcal{T}_{0}((Z_{n})_{n\geq 1})\geq m)\leq C\mathcal{I}e^{-m^{1/3}}. ∎

References

  • [1] C.-J. Ku and T. L. Fine, “Testing for stochastic independence: Application to blind source separation,” IEEE Transactions on Signal Processing, vol. 53, no. 5, pp. 1815–1826, 2005.
  • [2] P. Common, “Independent component analysis: a new concept?” Signal Processing, vol. 36, pp. 287–314, 1994.
  • [3] J. F. Cardoso, “Blind separation of instantaneous mixtures of non-stationary sources,” IEEE Transactions on Signal Processing, vol. 49, no. 9, pp. 1837–1848, 2001.
  • [4] J. Cheng, K. K. Tan, Z. Yi, and S. H. and, “Convergence analysis of a class of hyvarinen–oja’s ica learning algorithms with constant learning rates,” IEEE Transactions on Signal Processing, vol. 57, no. 5, pp. 1811–1824, 2009.
  • [5] B. Bloem-Reddy and Y. W. Teh, “Probabilistic symetry and invariant neural netwroks,” 2019, https://arxiv.org/abs/1901.06082.
  • [6] J. Range, “Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information,” in Proceedings of Machine Learning Research, 2018.
  • [7] A. Bellot and M. van der Schaar, “Conditional independence testing using generative adversarial networks,” in Conference on Neural Information Processing Systems, 2019.
  • [8] R. Sen, A. T. Suresh, K. Shanmugam, A. G. Dimakis, and S. Shakkottai, “Model-powered conditional independence test,” in Conference on Neural Information Processing Systems, 2017.
  • [9] C. Ku and T. Fine, “A bayesian independence test for small datasets,” IEEE Transactions on Signal Processing, vol. 54, no. 10, pp. 4026 – 4031, 2006.
  • [10] D. Reshef, Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P. Sabeti, “Detecting novel associations in large data sets,” Science, vol. 334, no. 6062, pp. 1518–1524, 2011.
  • [11] A. Gretton, O. Busquet, A. Smola, and B. Scholkopf, “Measuring statistical dependence with hilbert-schmit norms,” in In Algorithms Learning Theory: 16rh International Conference.   Springer, 2005, pp. 63–78.
  • [12] A. Gretton and L. Györfi, “Nonparametric independence tests: Space partitioning and kernel approaches,” in In Algorithms Learning Theory: 19th International Conference.   Springer, 2008, pp. 183–198.
  • [13] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola, “A kernel statistical test of independence,” Advances in Neural Information Processing Systems, pp. 585–592, 2007.
  • [14] N. Pfister, P. Büjlmann, B. Schölkopf, and J. Peters, “Kernel-based tests for joint independence,” arXiv: Statistics Theory, no. 1603.00285, 2016.
  • [15] G. Székely, M. Rizzo, and N. Bakirov, “Measuring and testing dependency by correlation of distances,” The Annals of Statistics, vol. 35, no. 6, pp. 2769–2794, 2007.
  • [16] G. J. Székely and M. L. Rizzo, “Brownian distance covariance,” The Annals of Applied Statistics, vol. 3, pp. 1236–1265, 2009.
  • [17] G. J. Székely, “The distance correlation t-test of independence in high dimension,” Journal of Multivariate Analysis, vol. 117, pp. 193–213, 2013.
  • [18] S. Zheng, N. Shi, and Z. Zhang, “Generalized measures of correlation for asymmetry, nonlinearity, and beyond,” Journal of the American Statistical Association, vol. 107, pp. 1239–1252, 2012.
  • [19] X. Wang, B. Jiang, and J. Liu, “Generalized r-squared for detecting non-independence,” Biometrika, vol. 104, pp. 129–139, 2017.
  • [20] A. Gretton and L. Gyorfi, “Consistent nonparamtric tests of independence,” Journal of Machine Learnning Research, vol. 11, pp. 1391–1423, 2010.
  • [21] L. Ma and J. Mao, “Fisher exact scanning for dependency,” Journal of the American Statistical Association, vol. 114, no. 525, pp. 245–258, 2019.
  • [22] K. Zhang, “Bet on independence,” Journal of the American Statistical Association, vol. 114, no. 528, pp. 1620–1637, 2019.
  • [23] R. Heller, Y. Heller, S. Kaufman, B. Brill, and M. Gorfine, “Consistent distribution-free k-sample and independence tests for univariate random variables,” Journal of Machine Learning Research, vol. 17, pp. 1–54, 2016.
  • [24] R. Heller and Y. Heller, “Multivariate tests of association based on univariate tests,” Advances in Neural Information Processing Systems, vol. 29, pp. 208–216, 2016.
  • [25] M. Menedez, J. Pardo, L.Pardo, and K. Zografos, “On tests of independence based on minimum ϕ\phi-divergnece estimator with constraint: An application to modeling DNA,” Computational Statistics and Data Analysis, vol. 51, pp. 1100–1118, 2006.
  • [26] F. Han, S. Chen, and H. Liu, “Distribution-free tests of independence in high dimensions,” Biometrika, vol. 104, pp. 813–828, 2017.
  • [27] J. Dauxois and G. M. Nkiet, “Nonlinear cannonial analysis and independence tests,” The Annals of Statistics, vol. 26, no. 4, pp. 1254–1278, 1998.
  • [28] A. Dionisio, R. Menezes, and D. A. Mendes, “Entropy-based independence test,” Nonlinear Dynamics, vol. 44, pp. 351–357, 2006.
  • [29] J. Suzuki, “An estimator of mutual information and its application to independence testing,” Entropy, vol. 18, no. 109, pp. 1–18, March 2016.
  • [30] I. Kontoyiannis and M. Skoularidou, “Estimating the directed information and testing for causality,” IEEE Transactions on Information Theory, vol. 62, no. 11, pp. 6053–6067, 2016.
  • [31] L. Devroye and G. Lugosi, Combinatorial Methods in Density Estimation.   Springer - Verlag, New York, 2001.
  • [32] S. Kullback, Information theory and Statistics.   New York: Wiley, 1958.
  • [33] H. Shi, M. Drton, and F. Han, “Distribution-free consistent independence tests via center-outward ranks and signs,” Journal of the American Statistical Association, pp. 1–16, 2020.
  • [34] K. Zhang, Z. Zhao, and W. Zhou, “Beauty powered beast,” arXiv preprint arXiv:2103.00674, 2021.
  • [35] I. Csiszar and P. Shields, Information theory and Statistics: A Tutorial.   Now, 2004.
  • [36] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed.   Wiley Interscience, New York, 2006.
  • [37] G. Lugosi and A. B. Nobel, “Consistency of data-driven histogram methods for density estimation and classification,” The Annals of Statistics, vol. 24, no. 2, pp. 687–706, 1996.
  • [38] G. A. Darbellay and I. Vajda, “Estimation of the information by an adaptive partition of the observation space,” IEEE Transactions on Information Theory, vol. 45, no. 4, pp. 1315–1321, 1999.
  • [39] J. F. Silva and S. Narayanan, “Complexity-regularized tree-structured partition for mutual information estimation,” IEEE Transactions on Information Theory, vol. 58, no. 3, pp. 1940–1952, March 2012.
  • [40] ——, “Non-product data-dependent partitions for mutual information estimation: Strong consistency and applications,” IEEE Transactions on Signal Processing, vol. 58, pp. 3497–3511, 2010.
  • [41] ——, “Information divergence estimation based on data-dependent partitions,” Journal of Statistical Planning and Inference, vol. 140, no. 11, pp. 3180–3198, November 2010.
  • [42] R. E. Blahut, “Hypothesis testing and information theory,” IEEE Transactions on Information Theory, vol. it-20, no. 4, pp. 405–417, July 1974.
  • [43] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees.   Belmont, CA: Wadsworth, 1984.
  • [44] C. Scott and R. D. Nowak, “Minimax-optimal classification with dyadic decision trees,” IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1335–1353, April 2006.
  • [45] C. Scott, “Tree pruning with subadditive penalties,” IEEE Transactions on Signal Processing, vol. 53, no. 12, pp. 4518–4525, 2005.
  • [46] L. Devroye, L. Gyorfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition.   New York: Springer-Verlag, 1996.
  • [47] M. Gonzalez and J. F. Silva, “Data-driven representations for testing independence: A connection with mutual information estimation,” in IEEE International Symposium on Information Theory, 2020, pp. 1–5.
  • [48] J. F. Silva and S. Narayanan, “Universal consistency of data-driven partitions for divergence estimation,” in IEEE International Symposium on Information Theory, June 2007.
  • [49] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to Algorithms.   The MIT Press, 2009, 3er edition.
  • [50] P. J. Bickel, C. A. J. Klaassen, Y. Ritov, and J. A. Wellmer, Efficient and Adaptive Estimation for Semiparametric Models.   Springer - Verlag, New York, 1998.
  • [51] N. Merhav and M. Feder, “Universal prediction,” IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2124–2147, October 1998.
  • [52] D. Bontemps, S. Boucheron, and E. Gassiat, “About adaptive coding on countable alphabets,” IEEE Transactions on Information Theory, vol. 60, no. 2, pp. 808–821, 2014.

Appendix G Suplemental Material

In Section G-A, the estimation of mutual information (MI) based on our tree-structured partition (TSP) approach is evaluated. Section G-B details the selection of parameters of the L1L_{1} test, the log-likelihood test, and the Pearson-χ2\chi^{2} test. These results were used to select the final curves presented in Figure 3. Section G-C complements the empirical results of our method on a heavy-tailed distribution and a multi-modal distribution. Finally, Section G-D presents some results regarding the statistical significance of our test.

G-A Preliminary Analysis of Mutual Information Estimation

Analyzing our tree-structured partition (TSP) scheme’s capacity to estimate MI in the non-asymptotic regime is important because it provides an indication of the trade-off between the two types of errors that we will encounter when analyzing the performance of its induced test. Here, we present a basic analysis on the effects of the parameters of our scheme: ll and ww (bn=wnlb_{n}=w\cdot n^{-l}), and α>0\alpha>0 within the admissible regime provided by our results in Section VIII of the main content for (bn)n1(b_{n})_{n\geq 1} and (δn)n1(\delta_{n})_{n\geq 1}. For this analysis, we consider a joint vector Z=(X,Y)Z=(X,Y) following a zero-mean Gaussian distribution in 2\mathbb{R}^{2} where the correlation coefficient determining I(X,Y)I(X,Y) is parametrized by σ=𝔼(XY)\sigma=\mathbb{E}(XY). We evaluate empirically how the estimator i^bn,δnα()\hat{i}_{b_{n},\delta_{n}}^{\alpha}(\cdot) achieves the true value as nn grows and the effects of the mentioned parameters under 0\mathcal{H}_{0} (σ=0\sigma=0) and 1\mathcal{H}_{1} (σ>0\sigma>0). For illustration purposes, we will consider σ=0.7\sigma=0.7 as a relatively high MI regime under 1\mathcal{H}_{1}.212121Similar analyses where conducted under different level of correlations (under H1H_{1}) not reported here for brevity.

Refer to caption
Figure 6: Examples of a realization of i^bn,δnα=0(Z1n)\hat{i}_{b_{n},\delta_{n}}^{\alpha=0}(Z^{n}_{1}) function of the sampling size nn under different values of the parameter l{0.067,0.133,0.2,0.267,0.3}l\in\left\{0.067,0.133,0.2,0.267,0.3\right\} with w=0.05w=0.05 and α=0\alpha=0 (bn=wnlb_{n}=w\cdot n^{-l} ). For these results, two scenarios of correlation between (X,Y)(X,Y) are considered: σ=0\sigma=0 and σ=0.7\sigma=0.7.

To begin, Figure 6 shows effects of l(0,1/3)l\in(0,1/3) fixing w=0.05w=0.05 and α=0\alpha=0. Under α=0\alpha=0, we look at the performance of the full tree TbnfullT^{full}_{b_{n}}, which provides the baseline capacity of our regularized solution in Eq.(17) of the main content. Figure 6 shows that as ll increases (approaching its less conservative value 1/31/3 associated with the biggest full tree in the growing phase), the capacity of our scheme to estimate the true MI under 1\mathcal{H}_{1} is superior, as expected. On the contrary, the estimation of the true MI under 0\mathcal{H}_{0} is better for smaller values of ll, although beyond 1.0001.000 samples, the specific value of ll is not that critical under 0\mathcal{H}_{0}.

Similarly, Figure 7 presents the expressiveness of the scaling factor ww in bnb_{n} and its effects on the estimation of MI in the two aforementioned regimes σ=0\sigma=0 and σ=0.7\sigma=0.7. For these plots, we fixed l=0.167(0.133,1/3)l=0.167\in(0.133,1/3) and α=0\alpha=0, i.e., we focus on the performance of the full tree. In the finite sampling regime, n{1,..,105}n\in\left\{1,..,10^{5}\right\}, the effect of ww is relevant in terms of the capacity to over-estimate I(X,Y)I(X,Y) under σ=0(0)\sigma=0\,(\mathcal{H}_{0}) and under-estimate or over-estimate I(X,Y)I(X,Y) under the case σ=0.7\sigma=0.7. The effect of ww in the performance of the full-tree solution is more prominent than the effect of ll in both scenarios (0\mathcal{H}_{0} and 1\mathcal{H}_{1}).

Refer to caption
Figure 7: Examples of a realization of i^bn,δnα=0(Z1n)\hat{i}_{b_{n},\delta_{n}}^{\alpha=0}(Z^{n}_{1}). Same setting and scenarios as the ones presented in Figure 6, however in this case w{0.0005,0.001,0.005,0.01,0.05,0.1,0.5,1}w\in\left\{0.0005,0.001,0.005,0.01,0.05,0.1,0.5,1\right\} with l=0.167l=0.167 and α=0\alpha=0.

Finally, we look at the effects of the scalar α\alpha in Eq.(24), for which we select l=0.167(0.133,1/3)l=0.167\in(0.133,1/3) and w=0.05(0.005,0.1)w=0.05\in(0.005,0.1). Figure 8 shows different selections for this factor in the two scenarios already introduced (σ=0\sigma=0 and σ=0.7\sigma=0.7) and n{1,..,105}n\in\left\{1,..,10^{5}\right\}. We present three sets of figures: one plotting the plug-in estimations i^bn,δn(α)Dσ(πbn,δnα)(P^n||Q^n)\hat{i}_{b_{n},\delta_{n}}(\alpha)\equiv D_{\sigma({\pi_{b_{n},\delta_{n}}^{\alpha}})}(\hat{P}_{n}||\hat{Q}_{n}^{*}), the other the regularized MI estimations i~bn,δn(α)i^bn,δn(α)αrbn,δn(|T^bn,δn(α)|)\tilde{i}_{b_{n},\delta_{n}}(\alpha)\equiv\hat{i}_{b_{n},\delta_{n}}(\alpha)-\alpha\cdot r_{b_{n},\delta_{n}}(\left|\hat{T}_{b_{n},\delta_{n}}(\alpha)\right|) and finally, the size of the optimal tree: |T^bn,δn(α)|\left|\hat{T}_{b_{n},\delta_{n}}(\alpha)\right|. In general, we observe that the best results under 1\mathcal{H}_{1} are obtained when α\alpha is very small, in the order of 105\sim 10^{-5}. This finding is interesting because it shows that, in general, the theoretical confidence interval that is used to obtain the regularization term in Eq.(16) is very conservative. This means that in the non-asymptotic regime, this term needs to be corrected to obtain good results. It is worth noticing that this scaling factor α\alpha is fixed (independent of nn) and it does not affect the theoretical properties of our scheme: in particular, the structural capacity of our TSP scheme to detect independence under 0\mathcal{H}_{0} presented in Section VI. This capacity is observed in Figure 8 that present the size of T^bn,δn(α)\hat{T}_{b_{n},\delta_{n}}(\alpha). Here, the trivial tree (with one cell) is obtained in a finite number of steps as predicted by our results in Section VI.A. On the other hand, for the regime of high MI, we observe that the tree’s size is comparable to the size of the full-tree (obtained for α=0\alpha=0). Importantly, both of these results support the adaptive nature of our data-driven solution.

Refer to caption
Figure 8: Examples of i^bn,δnα(Z1n)\hat{i}_{b_{n},\delta_{n}}^{\alpha}(Z^{n}_{1}), i~bn,δn(α)\tilde{i}_{b_{n},\delta_{n}}(\alpha) and |T^bn,δn(α)|\left|\hat{T}_{b_{n},\delta_{n}}(\alpha)\right| function of the sampling size nn and for different values of α\alpha with l=0.167l=0.167 and w=0.05w=0.05. The case α=0\alpha=0 corresponds to the non-regularize solution (the full tree).

G-B Complementing Results for Figure 3

For selecting parameters of the alternative methods (the L1L_{1} test, the log-likelihood test, and the Pearson-χ2\chi^{2} test), we implemented them and explored their performances in the admissible range where those approaches are known to be strongly consistent [20]. In particular, we consider the number of partitions per dimension as m(n)=npm(n)=n^{p} with p(0,0.5)p\in(0,0.5) (details in [20]), which satisfies the strong-consistency conditions for the L1L_{1} and the log-likelihood tests established in [20]. Although there is no explicit proof of Pearson-χ2\chi^{2} test strong-consistency, we use a regime of values for p(0,1/3)p\in(0,1/3) according to the range suggested in [20].

Under the experimental setting presented in Section VIII, Figure 9 reports different performance curves (the trade-off between M0(ϵ)M_{0}(\epsilon) and M1σ(ϵ)M^{\sigma}_{1}(\epsilon)) when selecting different parameters of these methods. From these curves, the selected parameters for each method were the ones that offer a smooth transition between their sampling complexity pairs trade-off. This criterion is consistent with what we did to select the parameters of our method: please refer to Figure 1. Interestingly, for the three methods, this was obtained consistently for pp in the range of 0.20.2. Importantly, this selection offers reasonable solutions on all the regimes of correlation presented in Figure 9.

Refer to caption
Figure 9: Illustration of the trade-off between M0(ϵ)M_{0}(\epsilon) and M1σ(ϵ)M^{\sigma}_{1}(\epsilon) obtained for the L1L_{1}-test, the log-likelihood test and the Pearson-χ2\chi^{2} test, considering the number of partition for the domain of each random variable as m(n)=npm(n)=n^{p}, for p{0.2,0.25,0.3}p\in\{0.2,0.25,0.3\}. Five scenarios of correlation are presented for 1\mathcal{H}_{1} considering σ=𝔼(XY){0.1,0.3,0.5,0.7,0.9}\sigma=\mathbb{E}(XY)\in\left\{0.1,0.3,0.5,0.7,0.9\right\}.

G-C Testing the Method on Different Distributions

To complement the experimental results in Section VIII.A (obtained for Gaussian distributions in the scalar case), we explore the performance of our method against non-adaptive histogram-based tests by studying the detection-time trade-off curves (M0(ϵ),M1σ(ϵ))(M_{0}(\epsilon),M_{1}^{\sigma}(\epsilon)) on data samples generated by two important family of distributions: a heavy-tailed distribution (t-student law) and a multi-modal distribution.

Figure 10 presents the trade-off between (M0(ϵ),M1σ(ϵ))(M_{0}(\epsilon),M_{1}^{\sigma}(\epsilon)) for different values of correlation (under 1\mathcal{H}_{1}) using samples generated by a t-student distribution with 22 degrees of freedom (which is a heavy-tailed distribution). The curves show that the performance of our test is superior to the other methods in all correlation scenarios indexed by σ\sigma. The performance gain is significant in the scenarios of high correlation and a for relatively high independence detection time (i.e., when M0(ϵ)102M_{0}(\epsilon)\geq 10^{2}). For the non-adaptive tests, it is intriguing the relatively flat (insensitive) trade-off observed in all the explored scenarios, where the curves are close to an horizontal line. This flat trade-off behaviour in (M0(ϵ),M1σ(ϵ))(M_{0}(\epsilon),M_{1}^{\sigma}(\epsilon)) is observed even when considering a significantly number of cells (using p=0.3p=0.3). In fact, this selection (p=0.3p=0.3) produces a number of cells that is significantly higher than the number of cells obtained with our adaptive method (more of this analysis below). Therefore, this observation shows that our scheme with fewer data-driven cells produces better performance trade-off in general. In conclusion, these performance curves show the benefits of the adaptive partitioning in our test, which can deal with heavy-tailed distributions significantly better than tests that use non-adaptive partitions (binning) of the space.

Refer to caption
Figure 10: Illustration of the trade-off between M0(ϵ)M_{0}(\epsilon) and M1σ(ϵ)M^{\sigma}_{1}(\epsilon) obtained for the TSP test, the L1L_{1}-test, the log-likelihood test, and the χ2\chi^{2}-test using samples from a t-student distribution with 2 degrees of freedom. The TSP test parameter configuration is l=0.001l=0.001 and w=0.1w=0.1, with α[105,5.5104]\alpha\in[10^{-5},5.5\cdot 10^{-4}]. The L1L_{1} test parameter configuration is m(n)=n0.3m(n)=n^{0.3}, with C[1.53,1.547]C\in[1.53,1.547]. The log-likelihood test parameter configuration is m(n)=n0.3m(n)=n^{0.3}, with C[0.465,0.485]C\in[0.465,0.485]. The Pearson-χ2\chi^{2} test parameter configuration is m(n)=n0.3m(n)=n^{0.3}, with C[0.106,0.12]C\in[0.106,0.12]. Five scenarios of correlation are presented for 1\mathcal{H}_{1} considering σ=𝔼(XY){0.1,0.3,0.5,0.7,0.9}\sigma=\mathbb{E}(XY)\in\left\{0.1,0.3,0.5,0.7,0.9\right\}.

Figure 11, on the other hand, presents the trade-off between M0(ϵ)M_{0}(\epsilon) and M1σ(ϵ)M^{\sigma}_{1}(\epsilon) using samples generated by a mixture of four Gaussians (a multi-modal distribution). Following the setting introduced in Gretton (2008), the dependence level is produced by rotating in a counter close-wise manner (with a magnitude θ\theta) a 2D random vector with the following Gaussian mixture density:

f(x,y)=14i=14𝒩((x,y);μi,Σ¯),f(x,y)=\frac{1}{4}\sum_{i=1}^{4}\mathcal{N}((x,y);\mu_{i},\bar{\Sigma}), (60)

where 𝒩((x,y);μ,Σ)\mathcal{N}((x,y);\mu,\Sigma) denotes the bi-variate Gaussian density function with mean μ\mu and covariance matrix Σ\Sigma, evaluated in (x,y)(x,y). In this case, Σ¯=[0.05000.05]\bar{\Sigma}=\begin{bmatrix}0.05&0\\ 0&0.05\end{bmatrix}, μ1=(1,1)\mu_{1}=(1,1), μ2=(1,1)\mu_{2}=(1,-1), μ3=(1,1)\mu_{3}=(-1,1), μ4=(1,1)\mu_{4}=(-1,-1). The independent scenario is obtained by choosing θ=0\theta=0, while three different dependent scenarios (1\mathcal{H}_{1}) are explored by choosing θ{π/16,2π/16,3π/16}\theta\in\{\pi/16,2\pi/16,3\pi/16\}. Similarly to the results presented in Figure 3 (for the Gaussian case), we observed that our TSP test outperforms the other tests in the regime of M0(ϵ)102M_{0}(\epsilon)\geq 10^{2}, while the other tests presents a better performance in the opposite regime. It is important to notice that, unlike the Gaussian case, the number of partitions of each alternative test is determined by p=0.3p=0.3, which induces a notably higher number of cells than the ones obtained by the parameter configuration of our TSP test. It is noteworthy that a value of p=0.3p=0.3 induces m(n)2=(np)2=(10000.3)2=49m(n)^{2}=\left(\left\lfloor n^{p}\right\rfloor\right)^{2}=\left(\left\lfloor 1000^{0.3}\right\rfloor\right)^{2}=49 partitions for the alternative tests, while l=0.001l=0.001 and w=0.1w=0.1 induces just 88 partitions for the tree without pruning, i.e., the final number of partitions will be lower than 88. This demonstrates the efficiency of the proposed method, that can represent the statistical dependence of random variable with a considerably lower number of cells because of its adaptive capacity. Due to the multi-modal nature of the distribution used in these experiments, we present an alternative configuration of a TSP test with l=0.167l=0.167 and w=0.05w=0.05 (denoted as TSP-long in Figure 11),inducing more adaptive cells (but still lower than the other tests) in order to capture the statistical dependence of the multi-modal distribution in a less conservative way. We observed a considerably benefit in this new choice in the regime when M0(ϵ)102M_{0}(\epsilon)\geq 10^{2} with a moderate-slight deterioration in the opposite regime.

Refer to caption
Figure 11: Illustration of the trade-off between M0(ϵ)M_{0}(\epsilon) and M1σ(ϵ)M^{\sigma}_{1}(\epsilon) obtained for two parameters configuration of the TSP test, the L1L_{1}-test, the log-likelihood test, and the χ2\chi^{2}-test using samples from a mixture of four non-correlated Gaussians centered in (1,1),(1,1),(1,1),(1,1)(1,1),(1,-1),(-1,1),(-1,-1) respectively, with σ2=0.05\sigma^{2}=0.05. The TSP test parameter configuration is l=0.001l=0.001 and w=0.1w=0.1, with α[0.00001,0.0005]\alpha\in[0.00001,0.0005]. The TSP-long configuration is l=0.167l=0.167 and w=0.05w=0.05, with α[0.000009,0.00025]\alpha\in[0.000009,0.00025]. The L1L_{1} test parameter configuration is m(n)=n0.3m(n)=n^{0.3}, with C[0.6,1.1]C\in[0.6,1.1]. The log-likelihood test parameter configuration is m(n)=n0.3m(n)=n^{0.3}, with C[0.067,0.32]C\in[0.067,0.32]. The Pearson-χ2\chi^{2} test parameter configuration is m(n)=n0.3m(n)=n^{0.3}, with C[0.075,0.0125]C\in[0.075,0.0125]. Three scenarios of rotation are presented for 1\mathcal{H}_{1} considering θ{π/16,2π/16,3π/16}\theta\in\left\{\pi/16,2\pi/16,3\pi/16\right\}.

To complement this analysis, Figure 12 shows the partitions generated by the non-adaptive and our adaptive partition schemes. For this, n=1000n=1000 samples were considered for each of the three distributions explored (Gaussian, t-student and Gaussian Mixture). The adaptive attribute of our scheme is observed and contrasted with non-adaptive partitions, which is particularly evident in the multi-modal case. We observe that our scheme, as expected, represent the distribution of samples more effectively (with a lower number of cells) than non-adaptive scheme. Indeed, non-adaptive partitions use a considerably higher number of cells to represent the statistical dependence observed in the data, which can be used to explain the difference in performance with our method. In particular, there is a clear issue observed for the non-adaptive scheme: most of the cells have almost no samples, which is an ineffective way to represent the information of the data. This could significantly affect the discrimination ability of this non-adaptive methods for small sample-size (see Figure 12). From our perspective, this is one of the primary sources to explain the erratic (insensitive) flat behaviour observed in many of the trade-off curves presented in Figs 10 and 11, and overall, the better (and more expresive) performance trade-off offered by of our approach.

Refer to caption
Figure 12: Illustration of partitions obtained from the the non-adaptive product partitions scheme (first row) and TSP adaptive partition scheme without pruning (second row), for n=1000n=1000 samples of a simply Gaussian distribution, a mixture of four Gaussians distribution and a t-student distribution. The parameters of the non-adaptive product partition are p=0.2p=0.2 for the Gaussian distribution and p=0.3p=0.3 for the mixture of Gaussians and t-student distributions. The parameters of the TSP partition are l=0.001,w=0.1l=0.001,w=0.1 for the Gaussian and t-student distribution and l=0.167,w=0.05l=0.167,w=0.05 (long tree) for the mixture of Gaussian distributions.

G-D Statistical Significance Analysis in Section IX.C.

G-D1 Proof of Proposition 2

Proof:

Let us consider the set

Abn,δn,an{zn=(xn,yn)𝒳n×𝒴n:ϕbn,δn,an,(zn)=0},A_{b_{n},\delta_{n},a_{n}}\equiv\left\{z^{n}=(x^{n},y^{n})\in{\mathcal{X}}^{n}\times{\mathcal{Y}}^{n}:\phi_{b_{n},\delta_{n},a_{n}},(z^{n})=0\right\}, (61)

i.e., the points in the space where our test of length nn decides 0\mathcal{H}_{0}. Under 0\mathcal{H}_{0}, the probability of rejecting the null is given by:

α(bn,δn,an)=Zn(Abn,δn,anc),\alpha(b_{n},\delta_{n},a_{n})=\mathbb{P}_{Z^{n}}(A^{c}_{b_{n},\delta_{n},a_{n}}), (62)

where Zn\mathbb{P}_{Z^{n}} denotes the probability of ZnZ^{n} induced by PP (i.e., the nn-fold product distribution). Using the argument presented in the proof of Theorem 3 (see Eq.(51)), we show that when our data-driven partition πbn,δn()\pi_{b_{n},\delta_{n}}() achieves the trivial case {Rd}\left\{R^{d}\right\} (the partition with one cell), then ϕbn,δn,an()\phi_{b_{n},\delta_{n},a_{n}}(\cdot) decides 0, independent of the value of an>0a_{n}>0. This structural detection of independence (see Section VI.A for more details) implies that:

Bbn,δn{zn=(xn,yn):πbn,δn(zn)={Rd}}Abn,δn,an,B_{b_{n},\delta_{n}}\equiv\left\{z^{n}=(x^{n},y^{n}):\pi_{b_{n},\delta_{n}}(z^{n})=\left\{R^{d}\right\}\right\}\subset A_{b_{n},\delta_{n},a_{n}}, (63)

because πbn,δn(zn)={Rd}\pi_{b_{n},\delta_{n}}(z^{n})=\left\{R^{d}\right\} implies that ϕbn,δn,an(zn)=0\phi_{b_{n},\delta_{n},a_{n}}(z^{n})=0 independent of the magnitude of an>0a_{n}>0.

At this point we use the assumptions of Theorem 2. In the argument presented in the proof of Theorem 2, it was proved that (see Appendix III) the set ξδn,bnn\xi^{n}_{\delta_{n},b_{n}} (see Eq.(39) in Appendix III) satisfies that if znξδn,bnnz^{n}\in\xi^{n}_{\delta_{n},b_{n}} then znBbn,δnz^{n}\in B_{b_{n},\delta_{n}} (i.e., πbn,δn(zn)={Rd}\pi_{b_{n},\delta_{n}}(z^{n})=\left\{R^{d}\right\}). Consequently, we have that:

ξδn,bnnBbn,δn.\xi^{n}_{\delta_{n},b_{n}}\subset B_{b_{n},\delta_{n}}. (64)

Furthermore, the proof of Theorem 2 in Appendix III shows that 𝒫Zn(ξδn,bnn)1δn{\mathcal{P}}_{Z^{n}}(\xi^{n}_{\delta_{n},b_{n}})\geq 1-\delta_{n}, where this bound is distribution-free (independent of P𝒫0P\in\mathcal{P}_{0}) from the use of the distribution-free concentration inequality in Lemma 1 (see the argument of this in Appendix III). Therefore, from (63) and (64), we conclude that

α(bn,δn,an)𝒫Zn(Bbn,δnc)1𝒫Zn(ξδn,bnn)δn,\alpha(b_{n},\delta_{n},a_{n})\leq{\mathcal{P}}_{Z^{n}}(B^{c}_{b_{n},\delta_{n}})\leq 1-{\mathcal{P}}_{Z^{n}}(\xi^{n}_{\delta_{n},b_{n}})\leq\delta_{n}, (65)

where this upper bound is valid for any model P𝒫0P\in\mathcal{P}_{0}. ∎

G-D2 General Connection with Strong Consistency

There is a connection between strong consistency (in Definition 1) and the statistical significance (and also the power of a test) that can be obtained from the proof of Theorem 1 in Appendix I. Using the notation of Theorem 1, if we have a general test (ϕn())n1(\phi_{n}(\cdot))_{n\geq 1}, the test is consistent if under 0{\mathcal{H}}_{0}, the binary process (ϕn(Zn))n1(\phi_{n}(Z^{n}))_{n\geq 1} reaches 0 eventually almost surely, which is equivalent to state that:

(limsupnAnc)=(m1nmAnc)=limm(nmAnc)=0\mathbb{P}(\lim\sup_{n}A_{n}^{c})=\mathbb{P}(\cap_{m\geq 1}\cup_{n\geq m}A_{n}^{c})=\lim_{m\rightarrow\infty}\mathbb{P}(\cup_{n\geq m}A_{n}^{c})=0 (66)

where An{zn𝒳n×𝒴n,ϕn(zn)=0}A_{n}\equiv\left\{z^{n}\in{\mathcal{X}}^{n}\times{\mathcal{Y}}^{n},\phi_{n}(z^{n})=0\right\}. Using this notation, the statistical significance of ϕn()\phi_{n}() under 0{\mathcal{H}}_{0} is αnZn(Anc)\alpha_{n}\equiv\mathbb{P}_{Z^{n}}(A_{n}^{c}). Therefore, under 0{\mathcal{H}}_{0}, we have that:

αn=Zn(Anc)(knAkc).\alpha_{n}=\mathbb{P}_{Z^{n}}(A_{n}^{c})\leq\mathbb{P}(\cup_{k\geq n}A_{k}^{c}). (67)

Therefore, if the test is strongly consistent from (66) and the bound in (67), it follows that the statistical significance of the test tends to zero for any model where XX and YY are independent (P𝒫0P\in{\mathcal{P}}_{0}).

Concerning the power, under 1{\mathcal{H}}_{1} we have that the binary process (ϕn(Zn))n1(\phi_{n}(Z^{n}))_{n\geq 1} reaches 11 eventually almost surely (under the assumption of consistency), which is equivalent to state that:

(liminfnAnc)=(m1nmAnc)=limm(nmAnc)=1.\mathbb{P}(\lim\inf_{n}A_{n}^{c})=\mathbb{P}(\cup_{m\geq 1}\cap_{n\geq m}A_{n}^{c})=\lim_{m\rightarrow\infty}\mathbb{P}(\cap_{n\geq m}A_{n}^{c})=1. (68)

Therefore using (68), the power of the test βn=Zn(Anc)(knAkc)1\beta_{n}=\mathbb{P}_{Z^{n}}(A_{n}^{c})\geq\mathbb{P}(\cap_{k\geq n}A_{k}^{c})\rightarrow 1 as nn tends to infinity from (68), for any model where XX and YY are not independent (P𝒫1P\in{\mathcal{P}}_{1}).

In conclusion, the definition of consistency used in this work (Definition 1) implies a vanishing error (when nn tends to infinity) under both hypotheses. However, this asymptotic property does not offer specifics bounds for the errors when nn is finite, in contrast to the result shown in Eq.(65).