Multi-scale tests of independence powerful for detecting explicit or implicit functional relationship

Angshuman Roy Department of Mathematics and Statistics, Indian Institute of Technology, Tirupati Sagnik Das Indian Institute of Science Education and Research, Kolkata

Abstract

In this article, we consider the problem of testing the independence between two random variables. Our primary objective is to develop tests that are highly effective at detecting associations arising from explicit or implicit functional relationship between two variables. We adopt a multi-scale approach by analyzing neighborhoods of varying sizes within the dataset and aggregating the results. We introduce a general testing framework designed to enhance the power of existing independence tests to achieve our objective. Additionally, we propose a novel test method that is powerful as well as computationally efficient. The performance of these tests is compared with existing methods using various simulated datasets.

1 Introduction

Tests of independence play a vital role in statistical analysis. They are used to determine relationships between variables, validate models, select relevant features from a large pool of features, and establish causal directions, among other applications. These tests are particularly important in fields such as economics, biology, social sciences, and clinical trials.

The mathematical formulation for testing independence is as follows. Consider two random variables $X$ and $Y$ with distribution functions $F_{X}$ and $F_{Y}$ , respectively, and let $F$ represent their joint distribution function. The objective is to test the null hypothesis $H_{0}$ against the alternative hypothesis $H_{1}$ based on $n$ independent and identically distributed (i.i.d.) observations $(x_{1},y_{1}),(x_{2},y_{2}),\ldots,(x_{n},y_{n})$ drawn from $F$ , where

\begin{cases}H_{0}:F=F_{X}F_{Y}\\ H_{1}:F\neq F_{X}F_{Y}.\end{cases}

(1)

A substantial amount of research has been conducted on this topic, and it remains an active area of investigation. Pearson’s correlation is perhaps the most well-known classical measure that quantifies linear dependence. Spearman’s rank correlation (Spearman, 1904) and Kendall’s concordance-discordance statistic (Kendall, 1938) are used to measure monotonic associations. Hoeffding (1948) proposed a measure using the $L_{2}$ distance between the joint distribution function and the product of marginals. Székely et al. (2007) introduced distance correlation (dCor), based on energy distance, while Gretton et al. (2007) leveraged kernel methods to develop the Hilbert-Schmidt independence criterion (HSIC). Reshef et al. (2011) proposed the maximal information coefficient (MIC), an information-theoretic measure that reaches its maximum when one variable is a function of the other. (Heller et al., 2013) introduced a powerful test that breaks down the problem into multiple $2\times 2$ contingency table independence tests and aggregates the results. Zhang (2019) proposed a test that is uniformly consistent with respect to total variation distance. Roy et al. (2020), and Roy (2020) used copula and checkerboard copula approaches to propose monotonic transformation-invariant tests of dependence for two or more variables. Chatterjee (2021) introduced an asymmetric measure of dependence with a simple form and an asymptotic normal distribution under independence, which attains its highest value only when a functional relationship exists. These are just a few of the many contributions in this field.

Different tests of independence have unique strengths and limitations. As demonstrated in Theorem 2.2 of Zhang (2019), no test of independence can achieve uniform consistency. This implies that for any fixed sample size, the power of a test cannot always exceed the nominal level across all alternatives. Therefore, selecting the most appropriate test for a given scenario is crucial. For instance, the correlation coefficient is particularly powerful for detecting dependence between two jointly normally distributed random variables. It is highly effective at identifying linear relationships. Similarly, Spearman’s rank correlation is particularly suited for detecting monotonic relationships.

In this article, we aim to develop tests of independence (as described in Equation 1) that will be particularly powerful for detecting association between two random variables, $X$ and $Y$ , that adhere to the following parametric equation:

\begin{cases}X=f(Z)+\epsilon_{X}\\ Y=g(Z)+\epsilon_{Y},\end{cases}

(2)

where $f$ and $g$ are continuous functions not constant on any interval, $Z$ is a continuous random variable defined on an interval, and $\epsilon_{X}$ and $\epsilon_{Y}$ are independent noise components that are each independent of $Z$ . As a result, our tests will also be powerful for detecting association between random variables $X$ and $Y$ that are functionally related to each other, i.e., when $Y=f(X)+\epsilon_{Y}$ or $X=g(Y)+\epsilon_{X}$ . In contrast to the method proposed by Chatterjee (2021), which is powerful for detecting dependence when $Y$ is a function of $X$ , in our approach, $X$ and $Y$ are treated symmetrically.

The article is organized as follows. In Section 2, we introduce a general testing framework aimed at enhancing the power of existing tests of independence in scenarios described by Equation (2) and examine its computational complexity. In Section 3, we apply this framework to develop a novel test of independence and analyze its computational complexity. Section 4 presents a performance comparison of our methods on various simulated datasets. Finally, we conclude with a discussion and summary of our findings.

2 A General Testing Framework

Let us begin this section with a few examples of continuous dependent random variables $X$ and $Y$ that satisfy the parametric equation described in Equation 2. We consider three examples, referred to as Example A, B, and C.

•

Example A: $X=\Theta+\epsilon_{X}$ and $Y=\sin(\Theta)+\epsilon_{Y}$ , where $\Theta$ , $\epsilon_{X}$ , and $\epsilon_{Y}$ are mutually independent, with $\Theta\sim U(0,2\pi)$ and $\epsilon_{X},\epsilon_{Y}\sim N(0,(1/20)^{2})$ .
•

Example B: $X=\cos(\Theta)+\epsilon_{X}$ and $Y=\sin(\Theta)+\epsilon_{Y}$ .
•

Example C: $Y=U+\epsilon_{Y}$ and $X=U^{2}+\epsilon_{X}$ , where $U\sim U(0,1)$ and $U$ is independent of $\epsilon_{X}$ and $\epsilon_{Y}$ .

The scatter plots for these examples are shown in Figure 1. A common feature of all these examples is that within certain neighborhoods around support points, the conditional correlation between $X$ and $Y$ is non-zero. Specifically, there exist points $(x_{0},y_{0})$ and $(x^{\prime},y^{\prime})$ in the support of the distribution such that $\text{Cor}(X,Y\mid(X,Y)\in N(x,y;|x^{\prime}-x_{0}|,|y^{\prime}-y_{0}|)$ is non-zero, where $N(x,y;\epsilon_{1},\epsilon_{2})$ represents the $\epsilon_{1},\epsilon_{2}$ -neighborhood of $(x,y)$ , defined as $[x-\epsilon_{1},x+\epsilon_{1}]\times[y-\epsilon_{2},y+\epsilon_{2}]$ . In Figure 1, red rectangles highlight instances of such neighborhoods with non-zero conditional correlation.

Refer to caption — Figure 1: Neighborhoods having non-zero conditional correlation in Example A, B, and C.

As the next example, we consider the $BEX_{d}$ distribution, originally introduced in (Zhang, 2019) where it is defined as the uniform distribution over a set of parallel and intersecting lines given by $\sum_{i=1}^{2^{d}-1}\sum_{j=1}^{2^{d}-1}(|x-c_{i}|-|y-c_{j}|)I[|x-c_{i}|\leq 2^{-d},|y-c_{j}|\leq 2^{-d}]=0$ , where $c_{i}=(2i-1)/2^{d}$ . Figure 2 illustrates the $BEX_{d}$ distribution for $d=2,3$ , and $4$ . If $(X,Y)$ follows the $BEX_{d}$ distribution, the marginals $X$ and $Y$ are dependent but both follow a continuous uniform distribution over $[0,1]$ . Moreover, $X$ and $Y$ can be shown to satisfy Equation 2. Although $\text{Cor}(X,Y)=0$ for the $BEX_{d}$ distribution, there exist neighborhoods around support points where the conditional correlation is extremely high.

Red rectangles highlight some such neighborhoods with high conditional correlation values in $BEX_{1}$ , $BEX_{2}$ , and $BEX_{3}$ distributions in Figure 2.

These examples suggest that by calculating the test statistics for an existing test of independence across different neighborhoods and aggregating these values meaningfully, we can enhance its power, especially in scenarios described by Equation (2). Building on this idea, we introduce a multi-scale approach for existing tests of independence in this section.

Let $\mathbf{xy}_{1:n}$ denote the $n$ i.i.d. observations $(x_{1},y_{1}),\ldots,(x_{n},y_{n})$ . We analyze the dataset using neighborhoods of varying sizes centered at each observation point, with the distances from the center point to other observations serving as our guide for selecting neighborhood sizes. For a given sample $\mathbf{xy}_{1:n}$ from a bivariate distribution $F$ , we consider all neighborhoods of the form $N(x_{i},y_{i};|x_{j}-x_{i}|,|y_{j}-y_{i}|)$ for $i\neq j$ . Thus, we consider a total of $n(n-1)$ neighborhoods. For notational convenience, we denote $N(x_{i},y_{i};|x_{j}-x_{i}|,|y_{j}-y_{i}|)$ by $N_{i,j}$ .

Let $T$ be a test statistic for testing independence between two univariate random variables. Let us use the notation $T(\mathbf{xy}_{1:n})$ to denote the value of the test statistic $T$ computed on the sample $\mathbf{xy}_{1:n}$ . Here, we consider only those test statistics $T$ such that $T(\mathbf{xy}_{1:n})\geq 0$ , $T(\mathbf{xy}_{1:n})\rightarrow 0$ in probability as $n\rightarrow\infty$ under independence, and the rejection region is of the form $\{T(\mathbf{xy}_{1:n})>C_{n}\}$ for some constant $C_{n}>0$ . For example, Person’s correlation coefficient cannot be considered as $T$ , but its absolute value can be used. We define $T_{i,j}:=T(\mathbf{xy}_{1:n}\cap N_{i,j})$ , that is, $T_{i,j}$ is the value of the test statistic $T$ when evaluated on the observations that fall within the neighborhood $N_{i,j}$ . If $T$ is undefined on $\mathbf{xy}_{1:n}\cap N_{i,j}$ (it may happen when $x_{i}=x_{j}$ or $y_{i}=y_{j}$ ), we set $T_{i,j}:=0$ . One naive way to aggregate all the findings from different neighborhoods is to sum $T_{i,j}$ ’s, that is, to consider $\sum_{1\leq i\neq j\leq n}T_{i,j}$ as our test statistics. The problem with this summation is that if the dependency information is limited to relatively small neighborhoods, summing all the $T_{i,j}$ values might introduce noise into this information. To address this issue, we separate $T_{i,j}$ ’s in $(n-1)$ distinct groups according to the proximity between $(x_{i},y_{i})$ and $(x_{j},y_{j})$ . A detailed description is as follows.

For each observation $(x_{i},y_{i})$ , we order remaining observations according to their Euclidean distance from $(x_{i},y_{i})$ . Random tie-breaking is to be used in case of ties while ordering. Let $(x_{\pi_{i}(1)},y_{\pi_{i}(1)}),$ $\ldots,$ $(x_{\pi_{i}(n-1)},y_{\pi_{i}(n-1)})$ be the observations ordered by their Euclidean distance from $(x_{i},y_{i})$ in ascending order. Thus, $N_{i,\pi_{i}(k)}$ is the neighborhood of $(x_{i},y_{i})$ that has $k$ -th nearest neighbor of $(x_{i},y_{i})$ at one of its vertices. It is easy to see that for a fixed $i$ , the length of the diagonal of $N_{i,\pi_{i}(k)}$ increases as $k$ increases. We average the value of the test statistic $T$ based on the sample $\mathbf{xy}_{1:n}\cap N_{i,\pi_{i}(k)}$ keeping $k$ fixed and varying over $i$ over $1,\ldots,n$ . We denote this average as $T_{[k]}$ , that is, $T_{[k]}:=n^{-1}\sum_{i=1}^{n}T(\mathbf{xy}_{1:n}\cap N_{i,\pi_{i}(k)})=n^{-1}\sum_{i=1}^{n}T_{i,\pi_{i}(k)}$ . To determine how extreme the value of $T_{[k]}$ is, we need to know its distribution under the independence of the marginals, which can be estimated by resampling technique. We discuss this in details next.

Given a sample $\mathbf{xy}_{1:n}$ , a randomly permuted sample $\mathbf{xy}^{(\tau)}_{1:n}$ can be generated, where $\tau(1),\ldots,\tau(n)$ is a random permutation of $1,\ldots,n$ and $\mathbf{xy}^{(\tau)}_{1:n}:=\{(x_{1},y_{\tau(1)}),\ldots,(x_{n},y_{\tau(n)})\}$ . We calculate $T_{[1]},\ldots,T_{[n-1]}$ in the same way on the sample $\mathbf{xy}^{(\tau)}_{1:n}$ and denote their values with $T_{[1]}^{(\tau)},\ldots,T_{[n-1]}^{(\tau)}$ . For $B$ independent random permutations $\tau_{1},\ldots,\tau_{B}$ , we can compute $T_{[k]}^{(\tau_{1})},\ldots,T_{[k]}^{(\tau_{B})}$ for $1\leq k\leq n-1$ . According to the permutation test principle, the empirical distribution of $T_{[k]}^{(\tau_{1})},\ldots,T_{[k]}^{(\tau_{B})}$ is an estimator of the distribution of $T_{[k]}$ under independence. We can estimate the mean of $T_{[k]}$ under independence by $\bar{T}_{[k]}^{H_{0}}:=B^{-1}\sum_{i=1}^{B}T_{[k]}^{(\tau_{i})}$ and the standard deviation of $T_{[k]}$ under independence by $s_{[k]}^{H_{0}}:=\sqrt{B^{-1}\sum_{i=1}^{B}(T_{[k]}^{(\tau_{i})}-\bar{T}_{[k]}^{H_{0}})^{2}}$ . Next we compute Z-score of $T_{[k]}$ with respect to its distribution under independence by $z_{k}:=\frac{T_{[k]}-\bar{T}_{[k]}^{H_{0}}}{s_{[k]}^{H_{0}}}$ . If $s_{[k]}^{H_{0}}=0$ , we define $z_{k}=0$ . The Z-scores $z_{1},\ldots,z_{n-1}$ suggest how extreme the values of $T_{[1]},\ldots,T_{[n-1]}$ are. Analyzing them for a sample $\mathbf{xy}_{1:n}$ can give us valuable insight into the dependence structure. We demonstrate this point below using various bivariate distributions.

In this illustration, we consider two popular test statistics as $T$ : absolute value of Pearson’s correlation and distance correlation. We denote these two test statistics by $T^{cor}$ and $T^{dcor}$ respectively. We consider eight bivariate distributions, description of each is provided in the Table 1.

Distribution	Description
(a) Square	$X,Y\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}U(-1,1)$
(b) Straight line	$X\sim U(-1,1),Y=X$
(c) Noisy straight line	$U\sim U(-1,1),X=U+e_{X},Y=U+e_{Y},$
	$e_{X},e_{Y}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}N(0,0.1)$
(d) Sine	$X\sim U(0,2\pi),Y=\sin(5X),$
(e) Circle	$\Theta\sim U(0,2\pi),X=\cos(\Theta),Y=\sin(\Theta)$
(f) Noisy parabola	$U\sim U(-1,1),X=U^{2}+\epsilon_{X},Y=U+\epsilon_{Y},$
	$e_{X},e_{Y}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}N(0,0.25)$
(g) $BEX_{2}$	As described in Section 1
(h) BVN, rho=0.5	$(X,Y)$ follows bivariate normal distribution
	with zero means, unit variances and correlation
	coefficient 0.5

Table 1: Description of distributions considered

The scatter plots of these distributions are presented in Figure 3. It is easy to see that except for the ‘Square’ distribution, $X$ and $Y$ are dependent in all the other distributions.

From each distribution, we generated 50 i.i.d. observations and calculated $z_{1},\ldots,z_{49}$ . By repeating this step independently 1000 times, we then calculated the average values $\bar{z}_{1},\ldots,\bar{z}_{49}$ . We plotted these average values for each distribution in Figure 4. Under independence, it is evident that the expected value of $z_{k}$ is 0. Therefore, if $z_{k}$ deviates from 0, it will be indicative of dependence.

From Figure 4, we observe that the average Z-score values are close to 0 for ‘Square’ distribution as $X$ and $Y$ are independent. For the ‘Straight line’ and ‘Noisy straight line’ distributions, the average Z-scores are higher in larger neighborhoods for both $T^{cor}$ and $T^{dcor}$ . However, in the ‘Noisy straight line’, the dependence information is less apparent in smaller neighborhoods compared to the ‘Straight line’. For the ‘Sine’ distribution, the dependence information is clearly noticeable in smaller neighborhoods. An interesting phenomenon occurs with the ‘Circle’ distribution, where most neighborhoods, including the largest ones, contain arcs, allowing both $T^{cor}$ and $T^{dcor}$ to detect dependence effectively in most neighborhoods. In the ‘Noisy parabola’, the dependence is most prominent in mid-sized neighborhoods. For the more complex $BEX_{2}$ distribution, these statistics detect the dependence patterns in smaller and mid-sized neighborhoods, but less so in larger ones. Finally, in ‘BVN, rho=0.5’, the dependence information is only clearly evident in larger neighborhoods.

It follows clearly from the above illustration that test statistics such as absolute value of correlation and distance correlation sometimes detect dependence information better in smaller neighborhoods, sometimes in mid-sized, and sometimes in larger neighborhoods. Therefore, it makes sense to aggregate information from all the Z-scores $z_{1},\ldots,z_{n-1}$ . We propose an aggregation method in the following paragraph.

First, we observe that under independence, the expected value of $z_{k}$ is $0$ for $k=1,\ldots,n-1$ . Under dependence, if the $T_{i,j}$ ’s are stochastically larger, $T_{[k]}$ will also be stochastically larger. In that case, the expected values of $z_{k}$ will also be positive. Let $\mu_{k}$ be the expected value of $z_{k}$ , that is, $\mu_{k}=E[z_{k}]$ . Thus testing $H_{0}$ against $H_{1}$ can be done by testing $H_{0}^{{}^{\prime}}:\mu_{k}=0~{}\forall k$ vs. $H_{1}^{{}^{\prime}}:\exists~{}k~{}\text{such that}~{}\mu_{k}>0$ . To test this, we suggest the following test statistic $\Psi_{n}:=\sum_{k=1}^{n-1}(\max\{z_{k},0\})^{2}$ . Clearly, a high value of $\Psi_{n}$ presents evidence against null hypothesis. To determine the distribution of $\Psi_{n}$ under $H_{0}$ , we again use resampling approach by utilizing $B$ randomly permuted samples $\mathbf{xy}_{1:n}^{(\tau_{1})},\ldots,\mathbf{xy}_{1:n}^{(\tau_{B})}$ which are already in our disposal. Similar how we computed $z_{1},\ldots,z_{n-1}$ for $\mathbf{xy}_{1:n}$ using permuted samples $\mathbf{xy}_{1:n}^{(\tau_{1})},\ldots,\mathbf{xy}_{1:n}^{(\tau_{B})}$ , we compute $z_{1}^{(\tau_{i})},\ldots,z_{n-1}^{(\tau_{i})}$ for $\mathbf{xy}_{1:n}^{(\tau_{i})}$ based on permuted samples $\mathbf{xy}_{1:n}^{(\tau_{1})},\ldots,\mathbf{xy}_{1:n}^{(\tau_{i-1})},\mathbf{xy}_{1:n}^{(\tau_{i+1})},\mathbf{xy}_{1:n}^{(\tau_{B})}$ for $i=1,\ldots,B$ . Next we calculate $\Psi_{n}^{(\tau_{i})}:=\sum_{k=1}^{n-1}(\max\{z_{k}^{(\tau_{i})},0\})^{2}$ for $i=1,\ldots,B$ . Finally we determine the p-value of $\Psi_{n}$ by $p=B^{-1}\sum_{i=1}^{B}I[\Psi_{n}\leq\Psi_{n}^{(\tau_{i})}]$ . We reject $H_{0}$ if the p-value turns out to be less than the level of significance.

Assume that computing $T(\mathbf{xy}_{1:n})$ requires $\mathcal{O}(\Theta(n))$ operations. In that case, computing $T_{i,j}$ collectively for $1\leq i\neq j\leq n$ requires $\mathcal{O}(n^{2}\Theta(n))$ operations. For a fixed $1\leq i\leq n$ , the merge sort algorithm allows us to compute $\pi_{i}$ in $\mathcal{O}(n\log n)$ operations. Therefore, determining $T_{[1]},\ldots,T_{[n-1]}$ collectively takes $\mathcal{O}(n^{2}\Theta(n)+n^{2}\log n)$ operations (recall that $T_{[k]}=n^{-1}\sum_{i=1}^{n}T_{i,\pi_{i}(k)}$ ). As a result, the computational complexity of computing $z_{1},\ldots,z_{n-1}$ together also is $\mathcal{O}(n^{2}\Theta(n)+n^{2}\log n)$ assuming that number of randomly permuted samples $B$ is constant with respect to $n$ . Thus we conclude that $\mathcal{O}(n^{2}\Theta(n)+n^{2}\log n)$ operations are required for computing the test statistics $\Psi_{n}=\sum_{k=1}^{n-1}(\max\{z_{k},0\})^{2}$ .

Since the correlation coefficient can be calculated in linear time, $\Theta(n)=n$ when $T=T^{cor}$ . Thus, the complexity of calculating $\Psi_{n}$ when $T=T^{cor}$ is $\mathcal{O}(n^{3})$ . On the other hand, the distance correlation between two univariate random variables can be computed in $\mathcal{O}(n\log n)$ operations (see, Chaudhuri and Hu, 2019). Therefore, the complexity of calculating $\Psi_{n}$ when $T=T^{dcor}$ is $\mathcal{O}(n^{3}\log n)$ .

3 A Special Test Method

In this section, we introduce a testing method that closely follows the framework proposed in the previous section, with a few modifications. $X$ and $Y$ are assumed to be continuous random variables in this setup. The key distinction in this approach lies in the manner in which the $T_{i,j}$ values are computed.

Given a sample $\mathbf{xy}_{1:n}$ , we begin by fixing two observations $(x_{i},y_{i})$ and $(x_{j},y_{j})$ , where $i\neq j$ . Similar to the previous method, we define the neighborhood $N_{i,j}=N(x_{i},y_{i};|x_{j}-x_{i}|,|y_{j}-y_{i}|)$ , centered at $(x_{i},y_{i})$ with $(x_{j},y_{j})$ positioned at a corner of the rectangle defined by the neighborhood. Within this neighborhood, we partition the space into four quadrants, classifying observations according to whether the $x$ -coordinate is less than or greater than $x_{i}$ and whether the $y$ -coordinate is less than or greater than $y_{i}$ . Subsequently, a $2\times 2$ contingency table is constructed to count the number of observations in each of these quadrants (see Table 2).

	$x_{k}<x_{i}$	$x_{k}>x_{i}$
$y_{k}>y_{i}$	$a_{i,j}$	$b_{i,j}$
$y_{k}<y_{i}$	$c_{i,j}$	$d_{i,j}$

Table 2: Number of

(x_{k},y_{k})

that belong to one of the four quadrants of

N_{i,j}

Let $a_{i,j},b_{i,j},c_{i,j},d_{i,j}$ represent the frequencies in this contingency table, where $a_{i,j}=\#\{(x_{k},y_{k}):(x_{k},y_{k})\in\mathbf{xy}_{1:n}\cap N_{i,j},x_{k}<x_{i},y_{k}>y_{i}\}$ , $b_{i,j}=\#\{(x_{k},y_{k}):(x_{k},y_{k})\in\mathbf{xy}_{1:n}\cap N_{i,j},x_{k}>x_{i},y_{k}>y_{i}\}$ , $c_{i,j}=\#\{(x_{k},y_{k}):(x_{k},y_{k})\in\mathbf{xy}_{1:n}\cap N_{i,j},x_{k}<x_{i},y_{k}<y_{i}\}$ , and $d_{i,j}=\#\{(x_{k},y_{k}):(x_{k},y_{k})\in\mathbf{xy}_{1:n}\cap N_{i,j},x_{k}>x_{i},y_{k}<y_{i}\}$ .

$T_{i,j}$ is defined as the absolute value of the phi coefficient calculated from this contingency table:

T_{i,j}:=\frac{|a_{i,j}d_{i,j}-b_{i,j}c_{i,j}|}{\sqrt{(a_{i,j}+b_{i,j})(c_{i,j}+d_{i,j})(a_{i,j}+c_{i,j})(b_{i,j}+d_{i,j})}}.

If $(a_{i,j}+b_{i,j})(c_{i,j}+d_{i,j})(a_{i,j}+c_{i,j})(b_{i,j}+d_{i,j})=0$ , we set $T_{i,j}=0$ .

Next, we employ the same framework described in the previous section to determine a p-value. For notational convenience, the test statistic $T$ used here will be referred to as $T^{phi}$ .

Following the setup in Figure 4, we plotted the average Z-scores for the $T^{phi}$ statistic across various neighborhood sizes, as shown in Figure 5 to gain a deeper understanding of its underlying mechanism. These values are represented by red dots. For comparison, the average Z-scores for the $T^{cor}$ statistic are plotted alongside and represented by the blue dots.

As shown in Figure 5(a), the Z-scores for $T^{phi}$ exhibit higher variance compared to $T^{cor}$ under independence. Figures 5(b) through 5(g) demonstrate that $T^{phi}$ consistently yields higher Z-scores than $T^{cor}$ for the ‘Straight line,’ ‘Noisy straight line,’ ‘Sine,’ ‘Circle,’ ‘noisy parabola,’ and ‘ $BEX_{2}$ ’ distributions. Additionally, the average Z-score plot for $T^{phi}$ appears to be more wiggly compared to $T^{cor}$ . In the ‘BVN, rho=0.5’ distribution, the average Z-scores of $T^{phi}$ are comparable to those of $T^{cor}$ .

For a fixed $1\leq i\neq j\leq n$ , calculating $T_{i,j}$ requires counting the values $a_{i,j},b_{i,j},c_{i,j},$ and $d_{i,j}$ . A straightforward approach to obtain these counts is to check each observation in $\mathbf{xy}_{1:n}$ individually. This process involves $\mathcal{O}(n)$ operations to compute $T_{i,j}$ . As established in the previous section, this naive approach requires $\mathcal{O}(n^{3})$ operations to compute the test statistics. However, by applying the algorithm described in Lemma 1, which computes $T_{i,j}$ for all $1\leq j\neq i\leq n$ simultaneously in $\mathcal{O}(n\log n)$ operations for a fixed $i$ , the computational complexity can be significantly reduced. When this algorithm is implemented, calculating $T_{[1]},\ldots,T_{[n-1]}$ collectively will require $\mathcal{O}(n^{2}\log n)$ operations. As a result, the complexity of computing the test statistics $\Psi_{n}$ is reduced to $\mathcal{O}(n^{2}\log n)$ .

4 Performance on simulated datasets

In this section, we compare the performance of our proposed test methods on various simulated datasets with that of existing tests in the literature. From the proposed general testing framework, we select two well-known test statistics - the absolute value of Pearson’s correlation and distance correlation - as $T$ . We denote the resulting test statistic $\Psi_{n}$ in these two cases as $\Psi_{n}^{cor}$ and $\Psi_{n}^{dcor}$ , respectively. We denote the test statistic $\Psi_{n}$ for our proposed special test method as $\Psi_{n}^{phi}$ . As pointed out in previous sections, the computational complexities of $\Psi_{n}^{phi},\Psi_{n}^{cor}$ , and $\Psi_{n}^{dcor}$ are $\mathcal{O}(n^{2}\log n),\mathcal{O}(n^{3})$ , and $\mathcal{O}(n^{3}\log n)$ , respectively.

From existing tests of independence, we selected the following popular tests: dCor (Székely et al., 2007), HSIC (Gretton et al., 2007), HHG (Heller et al., 2013), MIC (Reshef et al., 2011), Chatterjee’s correlation (xicor) (Chatterjee, 2021) and BET (Zhang, 2019). For performing dCor, HSIC, HHG, and xicor tests, we used R packages energy (Rizzo and Szekely, 2022), dHSIC (Pfister and Peters, 2019), HHG (Kaufman et al., 2019), and XICOR (Chatterjee and Holmes, 2023), respectively. We calculated the MIC statistic by using ‘mine_stat’ function from the R package minerva (Albanese et al., 2012) and subsequently performed a permutation test. For executing BET test, we used the R code provided in the supplemental material of (Zhang, 2019). We fixed the nominal level at $5\%$ . Except for BET, all other tests were performed using the permutation principle, with the number of permutations set to 1000. The empirical power of a test is determined by calculating the proportion of times it rejects the null hypothesis over 1000 independently generated samples of a specific size from a given distribution.

First, we considered eighty jointly continuous distributions, namely ‘Doppler’, ‘ $BEX_{2}$ ’, ‘Lissajous A’, ‘Lissajous B’, ‘Rose curve’, ‘Spiral’, ‘Tilted square’, and ‘Five clouds’. Their descriptions are provided in Table 3. Scatter plots for each of these distributions are presented in Figure 8 in Appendix B.

Distribution	Description
(a) Doppler	$X\sim U(0,1/2),Y=\sqrt{X}\sin(1/X)$
(b) $BEX_{2}$	As described in Section 1
(c) Lissajous A	$\Theta\sim U(0,2\pi),X=\sin(3\Theta+3\pi/4)+\epsilon_{X},Y=\sin(\Theta)+\epsilon_{Y}$
(d) Lissajous B	$\Theta\sim U(0,2\pi),X=\sin(4\Theta+\pi/2)+\epsilon_{X},Y=\sin(3\Theta)+\epsilon_{Y}$
(e) Rose curve	$\Theta\sim U(0,2\pi),R=\cos(4\Theta),X=R\cos(\Theta)+\epsilon_{X}/2,Y=R\sin(\Theta)+\epsilon_{Y}/2$
(f) Spiral	$\Theta\sim U(0,4\pi),R=\Theta/(2\pi),X=R\cos(\Theta)+\epsilon_{X},Y=R\sin(\Theta)+\epsilon_{Y}$
(g) Tilted Square	$U,V\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}U(-1,1),X=\cos(\pi/3)U-\sin(\pi/3)V,Y=\sin(\pi/3)U+cos(\pi/3)$ V
(h) Five Clouds	$(U,V)$ uniform on $\{(0,0),(0,1),(0.5,0.5),(1,0),(1,1)\}$ ,
	$X=U+2\epsilon_{X},Y=V+2\epsilon_{Y}$

Table 3: The description of ‘Doppler’, ‘

BEX_{2}

’, ‘Lissajous A’, ‘Lissajous B’, ‘Rose curve’, ‘Spiral’, ‘Tilted square’, and ‘Five clouds’ distributions. Here

\epsilon_{X},\epsilon_{Y}

are i.i.d. random variables following

N(0,1/30^{2})

, independent of

U,V

, and

\Theta

It can be observed from the table that in the ‘Doppler’ distribution, $Y$ is directly a function of $X$ . In ‘Lissajous curve A’, ‘Lissajous curve B’, ‘Rose curve’, and ‘Spiral’, $X$ and $Y$ are related to each other in form of an implicit function. ‘Tilted square’ and ‘five clouds’ are two distributions where $X$ and $Y$ are dependent, but their correlation is 0.

The empirical powers of all test methods on these 8 distributions are presented in Figure 6.

It can be seen from this figure that $\Psi_{n}^{phi}$ performed best in both ‘Doppler’ and ‘ $BEX_{2}$ ’ and performed well in all other distributions. $\Psi_{n}^{cor}$ performed fairly well in all distributions except for ‘Rose curve’ and ‘Tilted square’. $\Psi_{n}^{dcor}$ performed well in all cases, and in ‘Lissajous curve A’, ‘Lissajous curve B’, ‘Rose curve’, ‘Spiral’, ‘Five clouds’, it performed the best. HHG yielded satisfactory power overall except for ‘Lissajous curve B’. The power of HSIC is satisfactory only in the ‘Doppler’ and ‘Tilted square’ distributions, not otherwise. dCor, MIC, xicor did not perform well. BET performed somewhat well for larger sample sizes in ‘Lissajous curve B’, ‘Rose curve’ and in ‘Spiral’.

Next, we considered four datasets to assess the performance of our methods under different levels of noise. In these examples, $X=f(Z)+\lambda\epsilon_{X}$ and $Y=g(Z)+\lambda\epsilon_{Y}$ , where $f$ and $g$ are real functions, $Z$ is a random variable, $\lambda$ is a non-negative constant, and $\epsilon_{X}$ and $\epsilon_{Y}$ are i.i.d. normal random variables that are also independent of $Z$ . As $\lambda$ increases, the noise level increases. In particular, we considered 4 distribution named - ‘Parabola $\lambda$ ’, ‘Circle $\lambda$ ’, ‘Sine $\lambda$ ’ and ‘Laminscate $\lambda$ ’, descriptions of each of these can be found in Table 4.

Distribution	Description
(a) Parabola $\lambda$	$U\sim U(-1,1),X=U^{2}+2\lambda\epsilon_{X},Y=U+2\lambda\epsilon_{Y}$
(b) Circle $\lambda$	$\Theta\sim U(0,2\pi),X=\cos(\Theta)+6\lambda\epsilon_{X},Y=\sin(\Theta)+6\lambda\epsilon_{Y}$
(c) Sine $\lambda$	$\Theta\sim U(0,2\pi),X=\Theta+3\lambda\epsilon_{X},Y=\sin(\Theta)+3\lambda\epsilon_{Y}$
(d) Laminscate $\lambda$	$\Theta\sim U(0,2\pi),X=\frac{\cos(\Theta)}{(1+\sin(\Theta))^{2}}+2\lambda\epsilon_{X},Y=\frac{\sin(\Theta)\cos(\Theta)}{(1+\sin(\Theta))^{2}}+2\lambda\epsilon_{Y}$

Table 4: The descriptions of ‘Parabola

\lambda

’, ‘Circle

\lambda

’, ‘Sine

\lambda

’ and ‘Laminscate

\lambda

’ distributions. Possible values of lambda are

\lambda=0,1,2

. Here,

\epsilon_{X},\epsilon_{Y}

are i.i.d. random variables following

N(0,1/60^{2})

, independent of

U

and

\Theta

We considered three noise levels by taking $\lambda=0,1,2$ . Scatter plots of the twelve ( $4\times 3=12$ ) distributions are presented in Figure 9 in the Appendix B.

The empirical powers of each of these tests are presented in Figure 7.

It can be observed from this figure that $\Psi_{n}^{phi}$ performs best or next to best in the absence of noise. Even in the presence of noise, power of $\Psi_{n}^{phi}$ is quite good. As the noise level increases, $\Psi_{n}^{dcor}$ takes the lead. The overall performance of $\Psi_{n}^{cor}$ is competitive except for the ‘Circle $\lambda$ ’ case. HHG performed well but it lagged behind our proposed tests most of the time. HSIC performed well in ‘Parabola $\lambda$ ’ and ‘Circle $\lambda$ ’ distributions, but didn’t achieve satisfactory power in other distributions. dCor had a somewhat competitive power in ‘Parabola $\lambda$ ’, but not in other distributions. MIC had low power overall, except for higher sample sizes in the ‘Sine $\lambda$ ’ distributions. Xicor also performed poorly except for ‘Sine $\lambda$ ’, where it performed the best. This is not unexpected from xicor, as it is mainly designed to detect whether $Y$ is a function of $X$ . BET had the lowest power among all tests in most distributions.

5 Discussion and conclusion

We proposed a general framework for testing independence, which can utilize existing independence tests in different neighborhoods of the dataset and combine the results in a meaningful way. This approach has been shown to enhance performance when the variables are explicitly or implicitly functionally dependent. Additionally, we introduced a novel test of independence that leverages a similar framework. It is important to note that multiple approaches can be taken towards selecting neighborhoods, and both the power and the complexity of the resulting method can vary based on it. Therefore, an optimal selection of neighborhoods could be a potential direction for future research. Our proposed method assigns equal weight to each neighborhood, from nearest to farthest. The potential benefits of using varying weights for different neighborhoods could be explored in future research.

Appendix A

Lemma 1.

For the proposed special test method, for a fixed $1\leq i\leq n$ , the computation of $T_{i,j}$ for all $1\leq j\neq i\leq n$ can be done collectively in $\mathcal{O}(n\log n)$ operations.

Proof.

Let us fix $i$ such that $1\leq i\leq n$ . At first, we will prove that the sequence $a_{i,1},\dots,a_{i,i-1},a_{i,i+1},\dots,a_{i,n}$ can be computed in $\mathcal{O}(n\log n)$ operations.

For $1\leq j\neq i\leq n$ , $d^{x}_{j}=|x_{j}-x_{i}|$ and $d^{y}_{j}=|y_{j}-y_{i}|$ can be computed in $\mathcal{O}(n)$ operations. As $(x_{1},y_{1}),\ldots,(x_{n},y_{n})$ are $n$ i.i.d. observations from a continuous distribution, there are no ties in $x$ -coordinate or in $y$ -coordinate, and hence $d^{x}_{j}$ and $d^{y}_{j}$ are positive for all $1\leq j\neq i\leq n$ without any ties.

Using the merge sort algorithm, a permutation $\omega$ of the sequence $(1,\ldots,i-1,i+1,\ldots,n)$ can be computed in $\mathcal{O}(n\log n)$ operations, such that $d^{x}_{\omega(1)}\leq\cdots\leq d^{x}_{\omega(i-1)}\leq d^{x}_{\omega(i+1)}\leq\cdots\leq d^{x}_{\omega(n)}$ .

Next, by iterating through $j\in\{1,\ldots,i-1,i+1,\ldots,n\}$ in ascending order, we can store in an array $q$ those $j$ values that satisfy $x_{\omega(j)}<x_{i}$ and $y_{\omega(j)}>y_{i}$ . As a result, for all $k,l\in\{1,\ldots,\text{length}(q)\}$ , we have $q[k]<q[l]$ whenever $k<l$ , with $x_{\omega(q[k])}<x_{i}$ and $y_{\omega(q[k])}>y_{i}$ . The remaining integers in $\{1,\ldots,i-1,i+1,\ldots,n\}$ that do not satisfy the conditions for $q$ can be stored in another array $r$ in ascending order. Thus, for all $k,l\in\{1,\ldots,\text{length}(r)\}$ , we have $r[k]<r[l]$ whenever $k<l$ , with $x_{\omega(r[k])}>x_{i}$ or $y_{\omega(r[k])}<y_{i}$ . This iterative checking and storing process requires only $\mathcal{O}(n)$ operations. Note that $q$ and $r$ have no common elements, and together they contain all elements from the set $\{1,\ldots,i-1,i+1,\ldots,n\}$ .

Given an array of real numbers $[s_{1},\ldots,s_{m}]$ , the ‘surpasser count’ is defined as an array of integers $[t_{1},\ldots,t_{m}]$ of the same length, where $t_{j}=\sum_{k=j+1}^{m}I[s_{j}<s_{k}]$ for $1\leq j\leq m$ . Using the merge sort technique, this can be computed in $\mathcal{O}(m\log m)$ operations (see Bird, 2010, Chapter 2). Instead of using the surpasser count algorithm directly, we’ll use the ‘trail count’. For a given array of real numbers $[s_{1},\ldots,s_{m}]$ , the trail count is defined as an array of integers $[t_{1},\ldots,t_{m}]$ of the same length, where $t_{j}=\sum_{k=1}^{j}I[s_{j}\geq s_{k}]=1+\sum_{k=1}^{j-1}I[s_{j}>s_{k}]$ for $1\leq j\leq m$ . Essentially, the trail count is the surpasser count applied to the reversed array, thus maintaining the same complexity.

We apply the trail count to the array $[d^{y}_{\omega(1)},\ldots,d^{y}_{\omega(i-1)},d^{y}_{\omega(i+1)},\ldots,d^{y}_{\omega(n)}]$ , resulting in the array, say, $[w_{1},\ldots,w_{i-1},w_{i+1},\ldots,w_{n}]$ . It’s easy to verify that $w_{k}$ represents the number of observations, excluding $(x_{i},y_{i})$ , that are within the neighborhood $N_{i,\omega(k)}$ for $1\leq k\neq i\leq n$ . Next, we apply the trail count to the array $[d^{y}_{\omega(q[1])},\ldots,d^{y}_{\omega(q[\text{length}(q)])}]$ , resulting in the array, say, $[u_{1},\ldots,u_{\text{length}(q)}]$ . It can be verified that $u_{k}$ is the number of observations that fall within the set $\{(x,y):(x,y)\in N_{i,\omega(q[k])},x<x_{i},y>y_{i}\}$ for $1\leq k\leq\text{length}(q)$ . Thus, $a_{i,q[k]}=u_{k}$ for $1\leq k\leq\text{length}(q)$ . Similarly, applying the trail count to the array $[d^{y}_{\omega(r[1])},\ldots,d^{y}_{\omega(r[\text{length}(r)])}]$ results in the array, say, $[v_{1},\ldots,v_{\text{length}(r)}]$ . It can be verified that $v_{k}$ represents the number of points in the set $\{(x,y):(x,y)\in N_{i,\omega(r[k])},x>x_{i}\text{ or }y<y_{i}\}$ for $1\leq k\leq\text{length}(r)$ . Therefore, $a_{i,r[k]}=w_{r[k]}-v_{k}$ for $1\leq k\leq\text{length}(r)$ . Hence, all $a_{i,j}$ for $1\leq j\neq i\leq n$ can be computed together in $\mathcal{O}(n\log n)$ operations.

Similar steps can be taken to show that for a fixed $i$ , $b_{i,j},c_{i,j},d_{i,j}$ for $1\leq j\neq i\leq n$ can be computed collectively in $\mathcal{O}(n\log n)$ operations. Since $T_{i,j}$ is a function of $a_{i,j},b_{i,j},c_{i,j}$ , and $d_{i,j}$ , it directly follows that computing $T_{i,j}$ collectively for $1\leq j\neq i\leq n$ also requires $\mathcal{O}(n\log n)$ operations. ∎

Appendix B

References

Albanese et al. [2012] Davide Albanese, Michele Filosi, Roberto Visintainer, Samantha Riccadonna, Giuseppe Jurman, and Cesare Furlanello. Minerva and minepy: a c engine for the mine suite and its r, python and matlab wrappers. Bioinformatics, page bts707, 2012.
Bird [2010] Richard Bird. Pearls of functional algorithm design. Cambridge University Press, 2010.
Chatterjee [2021] Sourav Chatterjee. A new coefficient of correlation. Journal of the American Statistical Association, 116(536):2009–2022, 2021.
Chatterjee and Holmes [2023] Sourav Chatterjee and Susan Holmes. XICOR: Robust and generalized correlation coefficients, 2023. URL https://CRAN.R-project.org/package=XICOR. https://github.com/spholmes/XICOR.
Chaudhuri and Hu [2019] Arin Chaudhuri and Wenhao Hu. A fast algorithm for computing distance correlation. Computational statistics & data analysis, 135:15–24, 2019.
Gretton et al. [2007] Arthur Gretton, Kenji Fukumizu, Choon Teo, Le Song, Bernhard Schölkopf, and Alex Smola. A kernel statistical test of independence. Advances in neural information processing systems, 20, 2007.
Heller et al. [2013] Ruth Heller, Yair Heller, and Malka Gorfine. A consistent multivariate test of association based on ranks of distances. Biometrika, 100(2):503–510, 2013.
Hoeffding [1948] Wassily Hoeffding. A non-parametric test of independence. The Annals of Mathematical Statistics, 19(4):546–557, 1948.
Kaufman et al. [2019] Barak Brill & Shachar Kaufman, based in part on an earlier implementation by Ruth Heller, and Yair Heller. HHG: Heller-Heller-Gorfine Tests of Independence and Equality of Distributions, 2019. URL https://CRAN.R-project.org/package=HHG. R package version 2.3.
Kendall [1938] Maurice G. Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.
Pfister and Peters [2019] Niklas Pfister and Jonas Peters. dHSIC: Independence Testing via Hilbert Schmidt Independence Criterion, 2019. URL https://CRAN.R-project.org/package=dHSIC. R package version 2.1.
Reshef et al. [2011] David N Reshef, Yakir A Reshef, Hilary K Finucane, Sharon R Grossman, Gilean McVean, Peter J Turnbaugh, Eric S Lander, Michael Mitzenmacher, and Pardis C Sabeti. Detecting novel associations in large data sets. science, 334(6062):1518–1524, 2011.
Rizzo and Szekely [2022] Maria Rizzo and Gabor Szekely. energy: E-Statistics: Multivariate Inference via the Energy of Data, 2022. URL https://CRAN.R-project.org/package=energy. R package version 1.7-11.
Roy [2020] Angshuman Roy. Some copula-based tests of independence among several random variables having arbitrary probability distributions. Stat, 9(1):e263, 2020.
Roy et al. [2020] Angshuman Roy, Anil K Ghosh, Alok Goswami, and CA Murthy. Some new copula based distribution-free tests of independence among several random variables. Sankhya A, pages 1–41, 2020.
Spearman [1904] Charles Spearman. The proof and measurement of association between two things. The American Journal of Psychology, 15(1):72–101, 1904.
Székely et al. [2007] Gábor J Székely, Maria L Rizzo, and Nail K Bakirov. Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769–2794, 2007.
Zhang [2019] Kai Zhang. Bet on independence. Journal of the American Statistical Association, 2019.