A survey of some recent developments in measures of association

Sourav Chatterjee Mailing address: Department of Statistics, Stanford University, 390 Jane Stanford Way, Stanford, CA 94305, USA. Email: [email protected]. The author was partially supported by NSF grants DMS-2113242 and DMS-2153654. The author thanks Nabarun Deb, Fang Han and Bodhisattva Sen for helpful comments on a preliminary draft. Stanford University

Abstract

This paper surveys some recent developments in measures of association related to a new coefficient of correlation introduced by the author. A straightforward extension of this coefficient to standard Borel spaces (which includes all Polish spaces), overlooked in the literature so far, is proposed at the end of the survey.

Key words and phrases. Correlation, dependence, measures of association, standard Borel space
2020 Mathematics Subject Classification. 62H20, 62H15.

In honor of friend and teacher Prof. Rajeeva L. Karandikar on the occasion of his 65^th birthday.

1 Introduction

Measuring associations between variables is one of the central goals of data analysis. Arguably, the three most popular classical measures of association are Pearson’s correlation coefficient, Spearman’s $\rho$ , and Kendall’s $\tau$ . Although these coefficients are powerful for detecting monotonic associations, a practical problem is that they are not effective for detecting associations that are not monotonic. There have been many proposals to address this deficiency of the classical coefficients [66], such as the maximal correlation coefficient [59, 45, 92, 17], various coefficients based on joint cumulative distribution functions and ranks [43, 34, 54, 8, 83, 123, 124, 125, 29, 89, 61, 13, 94, 95, 30, 122], kernel-based methods [86, 48, 50, 99, 130], information theoretic coefficients [71, 74, 93], coefficients based on copulas [32, 76, 108, 98, 127], and coefficients based on pairwise distances [117, 115, 58, 41, 77].

This survey is about some recent developments in this area, beginning with a new coefficient of correlation proposed by the author in the paper [22]. This coefficient has the following desirable features: (a) It has a simple expression, like the classical coefficients. (b) It is a consistent estimator of a measure of dependence which is $0$ if and only if the variables are independent and $1$ if and only if one is a measurable function of the other. (c) It has a simple asymptotic theory under the hypothesis of independence.

The new coefficient is defined as follows. Let $(X,Y)$ be a pair of random variables defined on the same probability space, where $Y$ is not a constant. Let $(X_{1},Y_{1}),\ldots,(X_{n},Y_{n})$ be i.i.d. pairs of random variables with the same law as $(X,Y)$ , where $n\geq 2$ . Rearrange the data as $(X_{(1)},Y_{(1)}),\ldots,(X_{(n)},Y_{(n)})$ such that $X_{(1)}\leq\cdots\leq X_{(n)}$ . (Note that $Y_{(i)}$ is just the $Y$ -value ‘paired with’ $X_{(i)}$ in the original data, and not the $i^{th}$ order statistic of the $Y$ ’s.) If there are ties among the $X_{i}$ ’s, then choose an increasing rearrangement as above by breaking ties uniformly at random. Let $r_{i}$ be the number of $j$ such that $Y_{(j)}\leq Y_{(i)}$ , and let $l_{i}$ to be the number of $j$ such that $Y_{(j)}\geq Y_{(i)}$ . Then define

\displaystyle\xi_{n}(X,Y):=1-\frac{n\sum_{i=1}^{n-1}|r_{i+1}-r_{i}|}{2\sum_{i=1}^{n}l_{i}(n-l_{i})}.

(1.1)

This is the correlation coefficient proposed in [22]. When there are no ties among the $Y_{i}$ ’s, $l_{1},\ldots,l_{n}$ is just a permutation of $1,\ldots,n$ , and so the denominator in the above expression is just $n(n^{2}-1)/3$ . The following theorem is the main consistency result for $\xi_{n}$ .

Theorem 1.1 ([22]).

If $Y$ is not almost surely a constant, then as $n\to\infty$ , $\xi_{n}(X,Y)$ converges almost surely to the deterministic limit

\xi(X,Y):=\frac{\int\mathrm{Var}(\mathbb{E}(1_{\{Y\geq t\}}|X))d\mu(t)}{\int\mathrm{Var}(1_{\{Y\geq t\}})d\mu(t)},

(1.2)

where $\mu$ is the law of $Y$ . This limit belongs to the interval $[0,1]$ . It is $0$ if and only if $X$ and $Y$ are independent, and it is $1$ if and only if there is a measurable function $f:\mathbb{R}\to\mathbb{R}$ such that $Y=f(X)$ almost surely.

The limiting value $\xi(X,Y)$ appeared in the literature prior to [22], in a paper of Dette, Siburg, and Stoimenov [32] (see also [43, 70]). The paper [32] gave a copula-based estimator for $\xi(X,Y)$ when $X$ and $Y$ are continuous, that is consistent under smoothness assumptions on the copula and is computable in time $n^{5/3}$ for an optimal choice of tuning parameters.

Note that neither $\xi_{n}$ nor the limiting value $\xi$ is symmetric in $X$ and $Y$ . A symmetrized version of $\xi_{n}$ can be constructed by taking the maximum of $\xi_{n}(X,Y)$ and $\xi_{n}(Y,X)$ .

2 Why does it work?

The complete proof of Theorem 1.1 is available in the supplementary materials of [22], and also in the arXiv version of the paper. It is not too hard to see why $\xi(X,Y)$ has the properties listed in Theorem 1.1, although filling in the details takes some work. The proof of convergence of $\xi_{n}(X,Y)$ to $\xi(X,Y)$ is less obvious. The following is a very rough sketch of the proof, reproduced from a similar discussion in [22].

For simplicity, consider only the case of continuous $X$ and $Y$ , where the denominator in (1.2) is simply $n(n^{2}-1)/3$ . First, note that by the Glivenko–Cantelli theorem, $r_{i}/n\approx F(Y_{(i)})$ , where $F$ is the cumulative distribution function of $Y$ . Thus,

\displaystyle\xi_{n}(X,Y)\approx 1-\frac{3}{n}\sum_{i=1}^{n}|F(Y_{i})-F(Y_{N(i)})|,

(2.1)

where $N(i)$ is the unique index $j$ such that $X_{j}$ is immediately to the right of $X_{i}$ if we arrange the $X$ ’s in increasing order. If $X_{i}$ is the rightmost value, define $N(i)$ arbitrarily; it does not matter since the contribution of a single term in the above sum is of order $1/n$ . Next, observe that for any $x,y\in\mathbb{R}$ ,

\displaystyle|F(x)-F(y)|

\displaystyle=\int(1_{\{t\leq x\}}-1_{\{t\leq y\}})^{2}d\mu(t),

(2.2)

where $\mu$ is the law of $Y$ . This is true because the integrand is $1$ between $x$ and $y$ and $0$ outside.

Suppose that we condition on $X_{1},\ldots,X_{n}$ . Since $X_{i}$ is likely to be very close to $X_{N(i)}$ , the random variables $Y_{i}$ and $Y_{N(i)}$ are likely to be approximately i.i.d. after this conditioning. This leads to the approximation

	$\displaystyle\mathbb{E}[(1_{\{t\leq Y_{i}\}}-1_{\{t\leq Y_{N(i)}\}})^{2}\|X_{1},\ldots,X_{n}]$	$\displaystyle\approx 2\mathrm{Var}(1_{\{t\leq Y_{i}\}}\|X_{1},\ldots,X_{n})$
		$\displaystyle=2\mathrm{Var}(1_{\{t\leq Y_{i}\}}\|X_{i}).$

This gives

	$\displaystyle\mathbb{E}(1_{\{t\leq Y_{i}\}}-1_{\{t\leq Y_{N(i)}\}})^{2}$	$\displaystyle\approx 2\mathbb{E}[\mathrm{Var}(1_{\{t\leq Y\}}\|X)]$
		$\displaystyle=2\mathrm{Var}(1_{\{t\leq Y\}})-2\mathrm{Var}(\mathbb{E}(1_{\{t\leq Y\}}\|X)).$

Combining this with (2.2), we get

\displaystyle\mathbb{E}|F(Y_{i})-F(Y_{N(i)})|

\displaystyle\approx\int 2[\mathrm{Var}(1_{\{t\leq Y\}})-\mathrm{Var}(\mathbb{E}(1_{\{t\leq Y\}}|X))]d\mu(t).

But note that $\mathrm{Var}(1_{\{t\leq Y\}})=F(t)(1-F(t))$ , and $F(Y)\sim\textup{Uniform}[0,1]$ . Thus,

\int\mathrm{Var}(1_{\{t\leq Y\}})d\mu(t)=\int F(t)(1-F(t))d\mu(t)=\int_{0}^{1}x(1-x)dx=\frac{1}{6}.

Therefore by (2.1),

\displaystyle\mathbb{E}(\xi_{n}(X,Y))\approx 6\int\mathrm{Var}(\mathbb{E}(1_{\{t\leq Y\}}|X))d\mu(t)=\xi(X,Y),

where the last identity holds because $\int\mathrm{Var}(1_{\{t\leq Y\}})d\mu(t)=1/6$ , as shown above. This establishes the convergence of $\mathbb{E}(\xi_{n}(X,Y))$ to $\xi(X,Y)$ . Concentration inequalities are then used to show that $\xi_{n}(X,Y)-\mathbb{E}(\xi_{n}(X,Y))\to 0$ almost surely.

3 Asymptotic distribution

Let $X$ , $Y$ and $\xi_{n}$ be as in the previous section. For each $t\in\mathbb{R}$ , let $F(t):=\mathbb{P}(Y\leq t)$ and $G(t):=\mathbb{P}(Y\geq t)$ . Let $\phi(y,y^{\prime}):=\min\{F(y),F(y^{\prime})\}$ . Define

\tau^{2}=\frac{\mathbb{E}\phi(Y_{1},Y_{2})^{2}-2\mathbb{E}(\phi(Y_{1},Y_{2})\phi(Y_{1},Y_{3}))+(\mathbb{E}\phi(Y_{1},Y_{2}))^{2}}{(\mathbb{E}G(Y)(1-G(Y)))^{2}},

(3.1)

where $Y_{1},Y_{2},Y_{3}$ are independent copies of $Y$ . The following theorem gives the limiting distribution of $\xi_{n}$ under the null hypothesis that $X$ and $Y$ are independent.

Theorem 3.1 ([22]).

Suppose that $X$ and $Y$ are independent. Then $\sqrt{n}\xi_{n}(X,Y)$ converges to $N(0,\tau^{2})$ in distribution as $n\to\infty$ , where $\tau^{2}$ is given by the formula (3.1) stated above. The number $\tau^{2}$ is strictly positive if $Y$ is not a constant, and equals $2/5$ if $Y$ is continuous.

The reason why $\tau^{2}$ does not depend on the law of $Y$ if $Y$ is continuous is that in this case $F(Y)$ and $G(Y)$ are Uniform $[0,1]$ random variables, which implies that the expectations in (3.1) do not depend on the law of $Y$ . If $Y$ is not continuous, then $\tau^{2}$ may depend on the law of $Y$ . For example, it is not hard to show that if $Y$ is a Bernoulli $(1/2)$ random variable, then $\tau^{2}=1$ . Fortunately, if $Y$ is not continuous, there is a simple way to estimate $\tau^{2}$ from the data using the estimator

\widehat{\tau}^{2}_{n}=\frac{a_{n}-2b_{n}+c_{n}^{2}}{d_{n}^{2}},

where $a_{n}$ , $b_{n}$ , $c_{n}$ and $d_{n}$ are defined as follows. For each $i$ , let

R(i):=\#\{j:Y_{j}\leq Y_{i}\},\ \ \ L(i):=\#\{j:Y_{j}\geq Y_{i}\}.

(3.2)

(Note that $R(i)$ and $L(i)$ are different than $r_{i}$ and $l_{i}$ defined earlier.) Let $u_{1}\leq u_{2}\leq\cdots\leq u_{n}$ be an increasing rearrangement of $R(1),\ldots,R(n)$ . Let $v_{i}:=\sum_{j=1}^{i}u_{j}$ for $i=1,\ldots,n$ . Define

	$\displaystyle a_{n}:=\frac{1}{n^{4}}\sum_{i=1}^{n}(2n-2i+1)u_{i}^{2},\ \ \ b_{n}:=\frac{1}{n^{5}}\sum_{i=1}^{n}(v_{i}+(n-i)u_{i})^{2},$
	$\displaystyle c_{n}:=\frac{1}{n^{3}}\sum_{i=1}^{n}(2n-2i+1)u_{i},\ \ \ d_{n}:=\frac{1}{n^{3}}\sum_{i=1}^{n}L(i)(n-L(i)).$

Then the following holds.

Theorem 3.2 ([22]).

The estimator $\widehat{\tau}_{n}^{2}$ can be computed in time $O(n\log n)$ , and converges to $\tau^{2}$ almost surely as $n\to\infty$ .

The question of proving a central limit theorem for $\xi_{n}$ in the absence of independence is much more difficult than the independent case. This was left as an open question in [22] and recently resolved in complete generality by Lin and Han [73], following earlier proofs of Deb, Ghosal, and Sen [31] and Shi, Drton, and Han [104] under additional assumptions. Lin and Han [73] also give a consistent estimator of the asymptotic variance of $\xi_{n}$ in the absence of independence, solving another question that was left open in [22]. A central limit theorem for the symmetrized version of $\xi_{n}$ (defined as the maximum of $\xi_{n}(X,Y)$ and $\xi_{n}(Y,X)$ ) under the hypothesis of independence was proved by Zhang [129].

4 Power for testing independence

A deficiency of $\xi_{n}$ , as already pointed out in [22] through simulated examples, is that it has low power for testing independence against ‘standard’ alternatives, such as linear or monotone associations. This was theoretically confirmed by Shi, Drton, and Han [105], where it was shown that the test of independence using $\xi_{n}$ is rate-suboptimal against a family of local alternatives, whereas three other nonparametric tests of independence proposed in [61, 13, 8, 125] are rate-optimal. Like $\xi_{n}$ , the three competing test statistics considered in [105] are also computable in $O(n\log n)$ time. Similar results were obtained for a different type of competing test statistic by Cao and Bickel [21].

A more detailed analysis of the power properties of $\xi_{n}$ was carried out by Auddy, Deb, and Nandy [2], where the asymptotic distribution of $\xi_{n}$ under any changing sequence of alternatives converging to the null hypothesis of independence was computed. This analysis yielded exact detection thresholds and limiting power under natural alternatives converging to the null, such as mixture models, rotation models and noisy nonparametric regression. The detection boundary lies at distance $n^{-1/4}$ from the null, instead of the more standard $n^{-1/2}$ . This is similar to the power properties of other ‘graph-based’ statistics for testing independence, such as the Friedman–Rafsky statistic [41, 11].

A proposal for ‘boosting’ the power of $\xi_{n}$ for testing independence, by incorporating multiple nearby ranks instead of only the nearest ones, was recently proposed by Lin and Han [72]. This modified estimator was shown to attain near-optimal rates of power against certain classes of alternative hypotheses.

The conceptual reason behind the absence of local power of statistics such as $\xi_{n}$ was explained by Bickel [12]. An interesting question that remains unexplained is the following. It is seen in simulations that although $\xi_{n}$ has low power for testing independence against standard alternatives such as linear and monotone, it becomes more powerful as the signal starts to get more and more oscillatory [22]. This gives an advantage over other coefficients in applications where oscillatory signals arise naturally [25, 97], and suggests that $\xi_{n}$ may be efficient for certain kinds of local alternatives. No result of this sort has yet been proven.

5 Multivariate extensions

Many methods have been proposed for testing independence nonparametrically in the multivariate setting. This includes classical tests [88, 49, 117] as well as a flurry of recent ones proposed in the last ten years [57, 56, 58, 131, 124, 68, 30, 106, 10, 107, 31].

Most of these papers are concerned only with testing independence, and not with measuring the strength of dependence as measured by a correlation coefficient such as $\xi_{n}$ . Unfortunately, the $\xi_{n}$ coefficient and many other popular univariate coefficients do not readily generalize to the multivariate setting because they are based on ranks. Statisticians have started taking a new look at this old problem in recent years by considering a multivariate notion of rank defined using optimal transport. Roughly speaking, the idea is as follows. Let $\nu$ be a ‘reference measure’ in $\mathbb{R}^{d}$ , akin to the uniform distribution on $[0,1]$ in $\mathbb{R}$ . Given any probability measure $\mu$ in $\mathbb{R}^{d}$ , let $F^{\mu}:\mathbb{R}^{d}\to\mathbb{R}^{d}$ be the map that ‘optimally transports’ $\mu$ to $\nu$ — that is, if $X\sim\mu$ then $F^{\mu}(X)\sim\nu$ , and $F^{\mu}$ minimizes $\mathbb{E}\|X-F(X)\|^{2}$ among all such $F$ . By a theorem of McCann [80], such a map exists and is unique if $\mu$ and $\nu$ are both absolutely continuous with respect to Lebesgue measure. For example, when $d=1$ , $F^{\mu}$ is just the cumulative distribution function of $X$ , which transforms $X$ into a Uniform $[0,1]$ random variable. For properties of this map, see, e.g., Figalli [38] and Hallin et al. [51].

The above idea suggests a natural definition of multivariate rank. If $X_{1},\ldots,X_{n}$ are i.i.d. samples from $\mu$ , one can try to estimate $F^{\mu}$ using this data. Let $F_{n}^{\mu}$ be an estimate. Then $F_{n}^{\mu}(X_{i})$ can act as a ‘multivariate rank’ of $X_{i}$ among $X_{1},\ldots,X_{n}$ , divided by $n$ . Since $F^{\mu}(X_{i})\sim\nu$ , we can then assume that $F_{n}^{\mu}(X_{i})$ is approximately distributed according to $\nu$ , and then try to test for independence of random vectors using a test for independence that works when the marginal distributions are both $\nu$ . This idea has been made precise in a number of recent works in the statistics literature, such as Chernozhukov et al. [27], Deb, Ghosal, and Sen [31], Deb and Sen [30], Hallin et al. [51], Manole et al. [78], Shi, Drton, and Han [106], Ghosal and Sen [47], Mordant and Segers [82] and Shi et al. [107, 103]. For a survey and further discussions, see Han [52].

A direct generalization of $\xi_{n}$ to higher dimensional spaces has not been proposed so far, although the variant proposed in Deb et al. [31] satisfies the same properties as $\xi_{n}$ provided that the space on which $Y$ takes values admits a nonnegative definite kernel and the space on which $X$ takes values has a metric. This covers most spaces that arise in practice. There are a couple of other generalizations, proposed by Azadkia and Chatterjee [3] and Gamboa et al. [44], on measuring the dependence between a univariate random variable $Y$ and a random vector $\mathbf{X}$ . The coefficient proposed in [3] (discussed in detail in Section 6) is based on a generalization of the ideas behind the construction of $\xi_{n}$ . The coefficient proposed in [44] combines the construction of $\xi_{n}$ with ideas from the theory of Sobol indices. A new contribution of the present paper is a simple generalization of $\xi_{n}$ to standard Borel spaces, which has been overlooked in the literature until now. This is presented in Section 8.

6 Measuring conditional dependence

The problem of measuring conditional dependence has received less attention than the problem of measuring unconditional dependence, partly because it is a more difficult task. Non-parametric conditional independence can be tested for discrete data using the classical Cochran–Mantel–Haenszel test [28, 79], which can be adapted for continuous random variables by binning the data [62] or using kernels [42, 128, 111, 33, 100]. Besides these, there are methods based on estimating conditional cumulative distribution functions [75, 85], conditional characteristic functions [112, 67], conditional probability density functions [113], empirical likelihood [114], mutual information and entropy [96, 65, 87], copulas [7, 109, 119], distance correlation [121, 37, 116], and other approaches [101]. A number of interesting ideas based on resampling and permutation tests have been proposed in recent years [20, 100, 9].

In Azadkia and Chatterjee [3], a new coefficient of conditional dependence was proposed, based on the ideas behind the $\xi$ -coefficient defined in [22]. Like the $\xi$ -coefficient, this one also has a long list of desirable features, such as being fully nonparametric and working under minimal assumptions. The coefficient is defined as follows.

Let $Y$ be a random variable and $\mathbf{X}=(X_{1},\ldots,X_{p})$ and $\mathbf{Z}=(Z_{1},\ldots,Z_{q})$ be random vectors, all defined on the same probability space. Here $q\geq 1$ and $p\geq 0$ . The value $p=0$ means that $\mathbf{X}$ has no components at all. Let $\mu$ be the law of $Y$ . The following quantity was proposed in [3] as a measure of the degree of conditional dependence of $Y$ and $\mathbf{Z}$ given $\mathbf{X}$ :

T=T(Y,\mathbf{Z}|\mathbf{X}):=\frac{\int\mathbb{E}(\mathrm{Var}(\mathbb{P}(Y\geq t|\mathbf{Z},\mathbf{X})|\mathbf{X}))d\mu(t)}{\int\mathbb{E}(\mathrm{Var}(1_{\{Y\geq t\}}|\mathbf{X}))d\mu(t)}.

(6.1)

If the denominator equals zero, $T$ is undefined. If $p=0$ , then $\mathbf{X}$ has no components, and the conditional expectations and variances given $\mathbf{X}$ should be interpreted as unconditional expectations and variances. In this case we write $T(Y,\mathbf{Z})$ instead of $T(Y,\mathbf{Z}|\mathbf{X})$ . Note that $T$ is a generalization of the statistic $\xi$ appearing in Theorem 1.1. The following theorem summarizes the main properties of $T$ .

Theorem 6.1 ([3]).

Suppose that $Y$ is not almost surely equal to a measurable function of $\mathbf{X}$ (when $p=0$ , this means that $Y$ is not almost surely a constant). Then $T$ is well-defined and $0\leq T\leq 1$ . Moreover, $T=0$ if and only if $Y$ and $\mathbf{Z}$ are conditionally independent given $\mathbf{X}$ , and $T=1$ if and only if $Y$ is almost surely equal to a measurable function of $\mathbf{Z}$ given $\mathbf{X}$ . When $p=0$ , conditional independence given $\mathbf{X}$ simply means unconditional independence.

Now suppose we have data consisting of $n$ i.i.d. copies $(Y_{1},\mathbf{X}_{1},\mathbf{Z}_{1}),\ldots,(Y_{n},\mathbf{X}_{n},\mathbf{Z}_{n})$ of the triple $(Y,\mathbf{X},\mathbf{Z})$ , where $n\geq 2$ . For each $i$ , let $N(i)$ be the index $j$ such that $\mathbf{X}_{j}$ is the nearest neighbor of $\mathbf{X}_{i}$ with respect to the Euclidean metric on $\mathbb{R}^{p}$ , where ties are broken uniformly at random. Let $M(i)$ be the index $j$ such that $(\mathbf{X}_{j},\mathbf{Z}_{j})$ is the nearest neighbor of $(\mathbf{X}_{i},\mathbf{Z}_{i})$ in $\mathbb{R}^{p+q}$ , again with ties broken uniformly at random. Let $R_{i}$ be the rank of $Y_{i}$ , that is, the number of $j$ such that $Y_{j}\leq Y_{i}$ . If $p\geq 1$ , define

T_{n}=T_{n}(Y,\mathbf{Z}|\mathbf{X}):=\frac{\sum_{i=1}^{n}(\min\{R_{i},R_{M(i)}\}-\min\{R_{i},R_{N(i)}\})}{\sum_{i=1}^{n}(R_{i}-\min\{R_{i},R_{N(i)}\})}.

If $p=0$ , let $L_{i}$ be the number of $j$ such that $Y_{j}\geq Y_{i}$ , let $M(i)$ denote the $j$ such that $\mathbf{Z}_{j}$ is the nearest neighbor of $\mathbf{Z}_{i}$ (ties broken uniformly at random), and let

T_{n}=T_{n}(Y,\mathbf{Z}):=\frac{\sum_{i=1}^{n}(n\min\{R_{i},R_{M(i)}\}-L_{i}^{2})}{\sum_{i=1}^{n}L_{i}(n-L_{i})}.

In both cases, $T_{n}$ is undefined if the denominator is zero. The following theorem shows that $T_{n}$ is a consistent estimator of $T$ .

Theorem 6.2 ([3]).

Suppose that $Y$ is not almost surely equal to a measurable function of $\mathbf{X}$ . Then as $n\to\infty$ , $T_{n}\to T$ almost surely.

For various other properties of $T_{n}$ , such as rate of convergence, performance in simulations and real data, etc., see [3]. One problem that was left unsolved in [3] was the question of proving a central limit theorem for $T_{n}$ under the null hypothesis, which is crucial for carrying out tests for conditional independence. This question was partially resolved by Shi, Drton, and Han [104], who proved a central limit theorem for $T_{n}$ under the assumption that $Y$ is independent of $(\mathbf{X},\mathbf{Z})$ . An improved version of this result was proved recently by Lin and Han [73]. A version for data supported on manifolds was proved by Han and Huang [53].

The paper of Shi et al. [104] also develops the ‘conditional randomization test’ (CRT) framework of Candès et al. [20] to test conditional independence using $T_{n}$ , and find that $T_{n}$ , like $\xi_{n}$ is an inefficient test statistic. To address this concern, an improved generalization of $T_{n}$ , called ‘kernel partial correlation’ (KPC), was proposed by Huang, Deb, and Sen [63]. Unlike $T_{n}$ , KPC has the flexibility to use more than one nearest neighbor, which gives it better power properties.

Note that by the above theorems, a test of conditional independence based on $T_{n}$ is consistent against all alternatives. The problem is that in the absence of an asymptotic theory for $T_{n}$ , it is difficult to control the significance level of such a test. This is in fact an impossible problem, by a recent result of Shah and Peters [102] that proves hardness of conditional independence testing in the absence of smoothness assumptions. Assuming some degree of smoothness, minimax optimal conditional independence tests were recently constructed by Neykov, Balakrishnan, and Wasserman [84] and Kim et al. [69].

7 Application to nonparametric variable selection

The commonly used variable selection methods in the statistics literature rely on linear or additive models. This includes classical methods [14, 46, 26, 118, 35, 40, 55, 81] as well as modern ones [19, 132, 133, 126, 36, 91]. These methods are powerful and widely used in practice. However, they sometimes run into problems when significant interaction effects or nonlinearities are present. Such problems can be overcome by model-free methods [20, 60, 1, 15, 39, 55, 18, 6, 120, 16]. On the flip side, the theoretical foundations of model-free methods are usually weaker than those of model-based methods.

In an attempt to combine the best of both worlds, a new method of variable selection called Feature Ordering by Conditional Independence (FOCI), was proposed in Azadkia and Chatterjee [3]. This method uses the conditional dependence coefficient $T_{n}$ described in the previous section in a stepwise fashion, as follows. Let $Y$ be the response variable and let $\mathbf{X}=(X_{j})_{1\leq j\leq p}$ be the set of predictors. The data consists of $n$ i.i.d. copies of $(Y,\mathbf{X})$ . First, choose $j_{1}$ to be the index $j$ that maximizes $T_{n}(Y,X_{j})$ . Having obtained $j_{1},\ldots,j_{k}$ , choose $j_{k+1}$ to be the index $j\notin\{j_{1},\ldots,j_{k}\}$ that maximizes $T_{n}(Y,X_{j}|X_{j_{1}},\ldots,X_{j_{k}})$ . Continue like this until arriving at the first $k$ such that $T_{n}(Y,X_{j_{k+1}}|X_{j_{1}},\ldots,X_{j_{k}})\leq 0$ , and then declare the chosen subset to be $\widehat{S}:=\{j_{1},\ldots,j_{k}\}$ . If there is no such $k$ , define $\widehat{S}$ to be the whole set of variables. It may also happen that $T_{n}(Y,X_{j_{1}})\leq 0$ . In that case declare $\widehat{S}$ to be empty. Note that this variable selection procedure involves no choice of tuning parameters, which may be an advantage in practice.

It was shown in [3] that under mild conditions, the method selects a ‘correct’ set of variables with high probability. More precisely, it was shown that with high probability, the set $\widehat{S}$ selected by FOCI has the property that $Y$ and $(X_{j})_{j\notin S}$ are conditionally independent given $(X_{j})_{j\in\widehat{S}}$ . In other words, all the information about $Y$ that one can get from $\mathbf{X}$ is contained in $(X_{j})_{j\in\widehat{S}}$ . For further properties of FOCI and its performance in simulations and real data sets, see [3].

An improved version of FOCI called KFOCI (‘Kernel FOCI’) was proposed by Huang et al. [63]. An application of FOCI to causal inference, via an algorithm named DAG-FOCI, was introduced in Azadkia, Taeb, and Bühlmann [5]. For another application to causal inference, see Chatterjee and Vidyasagar [24].

8 A new proposal: Generalization to standard Borel spaces

In this section, a simple but wide ranging generalization of $\xi_{n}$ is proposed. In hindsight, this generalization seems obvious, but somehow this was overlooked both in the original paper [22] as well as in the subsequent developments listed in Section 5.

Recall that two measurable spaces are said to be isomorphic to each other if there is a bijection between the two spaces which is measurable and whose inverse is also measurable. Recall that a standard Borel space is a measurable space that is isomorphic to a Borel subset of a Polish space [110, Chapter 3]. In particular, every Borel subset of every Polish space is a standard Borel space. The Borel isomorphism theorem says that any uncountable standard Borel space is isomorphic to the real line (see Rao and Srivastava [90] for an elementary proof). In particular, if $\mathcal{X}$ is any standard Borel space, there is a measurable map $\varphi:\mathcal{X}\to\mathbb{R}$ such that $\varphi$ is injective, $\varphi(\mathcal{X})$ is Borel, and $\varphi^{-1}$ is measurable on $\varphi(\mathcal{X})$ . We will say that $\varphi$ is an isomorphism between $\mathcal{X}$ and a Borel subset of the real line, or simply, a ‘Borel isomorphism’.

Now let $\mathcal{X}$ and $\mathcal{Y}$ be two standard Borel spaces. Let $\varphi$ be a Borel isomorphism of $\mathcal{X}$ and $\psi$ be a Borel isomorphism of $\mathcal{Y}$ . Let $(X,Y)$ be an $\mathcal{X}\times\mathcal{Y}$ -valued pair of random variables, and let $(X_{1},Y_{1}),\ldots,(X_{n},Y_{n})$ be i.i.d. copies of $(X,Y)$ . Let $X^{\prime}:=\varphi(X)$ and $Y^{\prime}:=\psi(Y)$ , so that $(X^{\prime},Y^{\prime})$ is a pair of real-valued random variables. Let $X_{i}^{\prime}:=\varphi(X_{i})$ and $Y_{i}^{\prime}:=\psi(Y_{i})$ for each $i$ . Finally, define

\xi_{n}(X,Y):=\xi_{n}(X^{\prime},Y^{\prime}),

where $\xi_{n}(X^{\prime},Y^{\prime})$ is defined using $(X_{1}^{\prime},Y_{1}^{\prime}),\ldots,(X_{n}^{\prime},Y_{n}^{\prime})$ as in equation (1.1). Note that the definition of $\xi_{n}(X,Y)$ depends on our choices of $\varphi$ and $\psi$ . Different choices of isomorphisms would lead different definitions of $\xi_{n}$ . The following theorem generalizes Theorem 1.1.

Theorem 8.1.

If $Y$ is not almost surely a constant, then as $n\to\infty$ , $\xi_{n}(X,Y)$ converges almost surely to the deterministic limit $\xi(X,Y)$ , which equals $\xi(X^{\prime},Y^{\prime})$ , defined as in (1.2) with $X^{\prime}$ and $Y^{\prime}$ in place of $X$ and $Y$ . This limit belongs to the interval $[0,1]$ . It is $0$ if and only if $X$ and $Y$ are independent, and it is $1$ if and only if there is a measurable function $f:\mathbb{R}\to\mathbb{R}$ such that $Y=f(X)$ almost surely. Moreover, the asymptotic distribution of $\sqrt{n}\xi_{n}(X,Y)$ under the hypothesis of independence, as given by Theorems 3.1 and 3.2, also holds, provided that $\tau$ and $\widehat{\tau}_{n}$ are computed using $X_{i}^{\prime}$ and $Y_{i}^{\prime}$ instead of $X_{i}$ and $Y_{i}$ .

Proof.

The convergence is clear from Theorem 1.1. Also, by Theorem 1.1, $\xi(X^{\prime},Y^{\prime})=0$ if and only if $X^{\prime}$ and $Y^{\prime}$ are independent, and $\xi(X^{\prime},Y^{\prime})=1$ if and only if $Y^{\prime}$ is a measurable function of $X^{\prime}$ . Since $X=\varphi^{-1}(X^{\prime})$ and $Y=\psi^{-1}(Y^{\prime})$ , it follows that $X$ and $Y$ are independent if and only if $X^{\prime}$ and $Y^{\prime}$ are independent. For the same reason, $Y$ is a measurable function of $X$ almost surely if and only if $Y^{\prime}$ is a measurable function of $X^{\prime}$ almost surely. Note that $Y$ is not almost surely a constant if and only if $Y^{\prime}$ is not almost surely a constant. Lastly, since $\xi_{n}(X,Y)$ is just $\xi_{n}(X^{\prime},Y^{\prime})$ , any result about the asymptotic distribution of $\xi_{n}(X^{\prime},Y^{\prime})$ , including Theorems 3.1 and 3.2 of this draft, can be transferred to $\xi_{n}(X,Y)$ . ∎

Just like the univariate case, the generalized $\xi_{n}$ has the advantage of working under zero assumptions and having a simple asymptotic theory, as shown by Theorem 8.1. On the other hand, just like the univariate coefficient, one can expect the generalized coefficient to also suffer from low power for testing independence.

Theorem 8.1 is a nice, clean result, but to implement the idea in practice, one needs to work with actual Borel isomorphisms. Here is an example of a Borel isomorphism between $\mathbb{R}^{d}$ and a Borel subset of $\mathbb{R}$ . Take any $x=(x_{1},\ldots,x_{d})\in\mathbb{R}^{d}$ . Let

a_{i,1}\cdots a_{i,k_{i}}.b_{i,1}b_{i,2}\cdots

be the binary expansion of $|x_{i}|$ . Filling in extra $0$ ’s at the beginning if necessary, let us assume that $k_{1}=\cdots=k_{d}=k$ . Then, let us ‘interlace’ the digits to get the number

a_{1,1}a_{2,1}\cdots a_{d,1}a_{1,2}a_{2,2}\cdots a_{d,2}\cdots a_{1,k}a_{2,k}\cdots a_{d,k}.b_{1,1}b_{2,1}\cdots b_{d,1}b_{1,2}b_{2,2}\cdots b_{d,2}\cdots.

This is an encoding of the $n$ -tuple $(|x_{1}|,\ldots,|x_{d}|)$ . But we also want to encode the signs of the $x_{i}$ ’s. Let $c_{i}=1$ if $x_{i}\geq 0$ and $0$ if $x_{i}<0$ . Sticking $1c_{1}c_{2}\cdots c_{d}$ in front of the above list, we get an encoding of the vector $x$ as a real number. (The $1$ in front ensures there is no ambiguity arising from some of the leading $c_{i}$ ’s being $0$ .) It is easy to verify that this mapping is measurable, injective, and its inverse (defined on its range) is also measurable.

Numerical simulations with $\xi_{n}$ computed using the above scheme produced satisfactory results. Some examples are as follows. The examples show one potential problem with using statistics such as $\xi_{n}$ in a high dimensional setting: The bias may be quite large (even though the standard deviation is small), resulting in slow convergence of $\xi_{n}(X,Y)$ to $\xi(X,Y)$ .

Example 8.2 (Points on a sphere).

Non-uniform random points were generated on the unit sphere in $\mathbb{R}^{3}$ by drawing $\phi_{1},\ldots,\phi_{n}$ uniformly from $[-\pi,\pi]$ , drawing $\theta_{1},\ldots,\theta_{n}$ uniformly from $[0,2\pi]$ , and defining

x_{i}=\sin\phi_{i}\cos\theta_{i},\ \ y_{i}=\sin\phi_{i}\sin\theta_{i},\ \ z_{i}=\cos\phi_{i}.

Taking $X_{i}=(\phi_{i},\theta_{i})$ and $Y_{i}=(x_{i},y_{i},z_{i})$ , $\xi_{n}(X,Y)$ was computed for $n=100$ and $n=1000$ . One thousand simulations were done for each $n$ . The histograms of the values of $\xi_{n}$ are displayed in Figure 1. For $n=100$ , $\xi_{n}(X,Y)$ had a mean value of $0.617$ with a standard deviation of $0.049$ . For $n=1000$ , the mean value was $0.865$ and the standard deviation was $0.009$ . Simulations were also done for $n=10000$ , where the mean value of $\xi_{n}(X,Y)$ turned out to be $0.957$ and the standard deviation was $0.002$ . The slow convergence due to large bias is clearly apparent in this example.

Example 8.3 (Points on a sphere plus noise).

This is the same example as the previous one, except that independent random noises were added to $x_{i}$ , $y_{i}$ and $z_{i}$ . The noise variables were taken to be normal with mean zero and standard deviation $\sigma$ , for various values of $\sigma$ . Table 1 displays the means and standard deviations of $\xi_{n}(X,Y)$ in $1000$ simulations, for $n=100$ and $n=1000$ , and $\sigma=0.01$ , $\sigma=0.05$ and $\sigma=0.1$ .

Table 1: Means and standard deviations of

\xi_{n}(X,Y)

1000

simulations, when

X=

polar coordinates of a random point on a sphere and

Y=

Cartesian coordinates of the point plus independent

N(0,\sigma^{2})

errors in each coordinate (Example 8.3).

	$\sigma$
$n$	$0.01$	$0.05$	$0.1$
100	$0.545$ ( $0.062$ )	$0.440$ ( $0.071$ )	$0.357$ ( $0.074$ )
1000	$0.783$ ( $0.015$ )	$0.658$ ( $0.020$ )	$0.543$ ( $0.023$ )

Example 8.4 (Marginal independence versus joint dependence).

In this example, we have a pair of random variables $Y=(a,b)$ which is a function of a $4$ -tuple $X=(u,v,w,z)$ , but $u$ , $v$ , $w$ and $z$ are marginally independent of $Y_{i}$ . The variables are constructed as follows. Let $a$ , $b$ and $c$ be independent Uniform $[0,1]$ random variables. Let

	$\displaystyle u=a+b+c\ \ (\mathrm{mod}\ 1),\ \ \ v=\frac{a}{2}+\frac{b}{2}+c\ \ (\mathrm{mod}\ 1),$
	$\displaystyle w=\frac{4a}{3}+\frac{2b}{3}+c\ \ (\mathrm{mod}\ 1),\ \ \ z=\frac{2a}{3}+\frac{b}{3}+c\ \ (\mathrm{mod}\ 1).$

Here $x\ (\mathrm{mod}\ 1)$ denotes the ‘remainder modulo $1$ ’ of a real number $x$ . It is easy to see that individually, $u$ , $v$ , $w$ and $z$ are independent of $Y=(a,b)$ , since the operation of adding the uniform random variable $c$ and taking the remainder modulo $1$ erases all information about the quantity to which $c$ is added. However, $Y$ can be recovered from the $4$ -tuple $X=(u,v,w,z)$ because $u-v=(a+b)/2\ (\mathrm{mod}\ 1)=(a+b)/2$ and $w-z=(2a+b)/3\ (\mathrm{mod}\ 1)=(2a+b)/3$ , where the second identity holds in each case because $(a+b)/2\in[0,1]$ and $(2a+b)/3\in[0,1]$ . With the above definitions, $n$ i.i.d. copies of $(X,Y)$ were sampled, and $\xi_{n}(X,Y)$ was computed, along with the asymptotic P-value for testing independence. This was repeated one thousand times. The average values of the coefficients and the P-values for $n=100$ and $n=1000$ are reported in Table 2. The table shows that even for $n$ as small as $100$ , the hypothesis that $X$ and $Y$ are independent is rejected with average P-value $0.001$ , whereas the average P-value for the hypothesis and $u$ and $Y$ are independent is $0.491$ . On the other hand, even for $n$ as large as $1000$ , the average P-value for the hypothesis that $u$ and $Y$ are independent is $0.495$ , whereas the average P-value for the hypothesis that $X$ and $Y$ are independent is $0.000$ .

Table 2: Average values of

\xi_{n}(u,Y)

and

\xi_{n}(X,Y)

and average P-values for testing independence in Example 8.4.

$n$	$\xi_{n}(u,Y)$	$\xi_{n}(X,Y)$	$H_{0}:u\perp\!\!\!\perp Y$	$H_{0}:X\perp\!\!\!\perp Y$
	Average value		Average P-value
$100$	$0.002$	$0.313$	$0.491$	$0.001$
$1000$	$0.001$	$0.581$	$0.495$	$0.000$

The above examples indicate that it may not be ‘crazy’ to use the generalized version of $\xi_{n}$ as a measure of association in the multivariate setting and beyond. Further investigations, through applications in simulated and real data, would be necessary to arrive at a concrete verdict.

The generalized version of $\xi_{n}$ can also be used to define a coefficient of conditional dependence for random variables taking values in standard Borel spaces, as follows. Let $\mathcal{X}$ , $\mathcal{Y}$ and $\mathcal{Z}$ be standard Borel spaces, and $X$ , $Y$ and $Z$ be random variables taking values in $\mathcal{X}$ , $\mathcal{Y}$ and $\mathcal{Z}$ , respectively. Let $W$ denote the $\mathcal{X}\times\mathcal{Z}$ -valued random variable $(X,Z)$ . Using Borel isomorphisms, let us define the coefficients $\xi_{n}(X,Y)$ and $\xi_{n}(W,Y)$ as before, based on i.i.d. data $(X_{1},Y_{1},Z_{1}),\ldots,(X_{n},Y_{n},Z_{n})$ . One can then define a coefficient of conditional dependence between $Y$ and $Z$ given $X$ as

\xi_{n}(Z,Y|X):=\frac{\xi_{n}(W,Y)-\xi_{n}(X,Y)}{1-\xi_{n}(X,Y)},

leaving it undefined if the denominator is zero. The following theorem justifies its use as a measure of conditional dependence.

Theorem 8.5.

Suppose that $Y$ is not almost surely equal to a measurable function of $X$ . Then as $n\to\infty$ , $\xi_{n}(Z,Y|X)$ converges almost surely to a deterministic limit $\xi(Z,Y|X)\in[0,1]$ , which is $0$ if and only if $Y$ and $Z$ are independent given $X$ , and $1$ if and only if $Y$ is almost surely equal to a measurable function of $Z$ given $X$ .

Proof.

Let $\varphi$ , $\psi$ , $\eta$ and $\delta$ be the Borel isomorphisms of $\mathcal{X}$ , $\mathcal{Y}$ , $\mathcal{Z}$ and $\mathcal{X}\times\mathcal{Z}$ that we are using in our definitions of $\xi_{n}$ . Let $X^{\prime}:=\varphi(X)$ , $Y^{\prime}:=\psi(Y)$ , $Z^{\prime}:=\eta(Z)$ and $W^{\prime}:=\delta(W)$ . By Theorem 1.1 and the assumption that $Y$ is not almost surely a measurable function of $X$ , we get that $\xi_{n}(Z,Y|X)$ converges almost surely to

\displaystyle\xi(Z,Y|X)

\displaystyle:=\frac{\xi(W^{\prime},Y^{\prime})-\xi(X^{\prime},Y^{\prime})}{1-\xi(X^{\prime},Y^{\prime})}.

Let $\mu$ denote the law of $Y^{\prime}$ . Then

	$\displaystyle\xi(W^{\prime},Y^{\prime})-\xi(X^{\prime},Y^{\prime})$	$\displaystyle=\frac{\int\mathrm{Var}(\mathbb{P}(Y^{\prime}\geq t\|W^{\prime}))d\mu(t)-\int\mathrm{Var}(\mathbb{P}(Y^{\prime}\geq t\|X^{\prime}))d\mu(t)}{\int\mathrm{Var}(1_{\{Y^{\prime}\geq t\}})d\mu(t)}$
		$\displaystyle=\frac{\int\mathrm{Var}(\mathbb{P}(Y^{\prime}\geq t\|X^{\prime},Z^{\prime}))d\mu(t)-\int\mathrm{Var}(\mathbb{P}(Y^{\prime}\geq t\|X^{\prime}))d\mu(t)}{\int\mathrm{Var}(1_{\{Y^{\prime}\geq t\}})d\mu(t)}$
		$\displaystyle=\frac{\int\mathbb{E}(\mathrm{Var}(\mathbb{P}(Y^{\prime}\geq t\|X^{\prime},Z^{\prime})\|X^{\prime})d\mu(t)}{\int\mathrm{Var}(1_{\{Y^{\prime}\geq t\}})d\mu(t)}.$

Similarly,

1-\xi(X^{\prime},Y^{\prime})=\frac{\int\mathbb{E}(\mathrm{Var}(1_{\{Y^{\prime}\geq t\}}|X^{\prime}))d\mu(t)}{\int\mathrm{Var}(1_{\{Y^{\prime}\geq t\}})d\mu(t)}.

From the above expressions, we see that $\xi(Z,Y|X)$ is nothing but the quantity $T(Y,Z|X)$ displayed in equation (6.1). The claims of the theorem now follow from Theorem 6.1. ∎

9 R packages

An R package for calculating $\xi_{n}$ , as well as the generalized coefficient proposed in this survey, and P-values for testing independence — named XICOR — is available on CRAN [23]. An R package for calculating the conditional dependence coefficient $T_{n}$ and implementing the FOCI algorithm, called FOCI, is also available [4]. The KFOCI algorithm of Huang et al. [63] is implemented in an R package by the same name [64].

References

Amit and Geman [1997] Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Computation, 9(7):1545–1588, 1997.
Auddy et al. [2021] A. Auddy, N. Deb, and S. Nandy. Exact Detection Thresholds for Chatterjee’s Correlation. arXiv preprint arXiv:2104.15140, 2021.
Azadkia and Chatterjee [2021] M. Azadkia and S. Chatterjee. A simple measure of conditional dependence. Annals of Statistics, 49(6):3070–3102, 2021.
Azadkia et al. [2020] M. Azadkia, S. Chatterjee, and N. S. Matloff. FOCI: Feature Ordering by Conditional Independence, 2020. URL https://CRAN.R-project.org/package=FOCI.
Azadkia et al. [2021] M. Azadkia, A. Taeb, and P. Bühlmann. A fast non-parametric approach for causal structure learning in polytrees. arXiv preprint arXiv:2111.14969, 2021.
Battiti [1994] R. Battiti. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5(4):537–550, 1994.
Bergsma [2004] W. Bergsma. Testing conditional independence for continuous random variables. Report Eurandom, 2004048, 2004.
Bergsma and Dassios [2014] W. Bergsma and A. Dassios. A consistent test of independence based on a sign covariance related to Kendall’s tau. Bernoulli, 20(2):1006–1028, 2014.
Berrett et al. [2020] T. B. Berrett, Y. Wang, R. F. Barber, and R. J. Samworth. The conditional permutation test for independence while controlling for confounders. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(1):175–197, 2020.
Berrett et al. [2021] T. B. Berrett, I. Kontoyiannis, and R. J. Samworth. Optimal rates for independence testing via $u$ -statistic permutation tests. Annals of Statistics, 49(5):2457–2490, 2021.
Bhattacharya [2019] B. B. Bhattacharya. A general asymptotic framework for distribution-free graph-based two-sample tests. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(3):575–602, 2019.
Bickel [2022] P. J. Bickel. Measures of independence and functional dependence. arXiv preprint arXiv:2206.13663, 2022.
Blum et al. [1961] J. Blum, J. Kiefer, and M. Rosenblatt. Distribution free tests of independence based on the sample distribution function. Annals of Mathematical Statistics, 32(2):485–498, 1961.
Breiman [1995] L. Breiman. Better subset regression using the nonnegative garrote. Technometrics, 37(4):373–384, 1995.
Breiman [1996] L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.
Breiman [2001] L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
Breiman and Friedman [1985] L. Breiman and J. H. Friedman. Estimating optimal transformations for multiple regression and correlation. Journal of the American statistical Association, 80(391):580–598, 1985.
Breiman et al. [1984] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth Press, 1984.
Candès and Tao [2007] E. Candès and T. Tao. The Dantzig Selector: Statistical estimation when $p$ is much larger than $n$ . Annals of Statistics, 35(6):2313–2351, 2007.
Candès et al. [2018] E. Candès, Y. Fan, L. Janson, and J. Lv. Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(3):551–577, 2018.
Cao and Bickel [2020] S. Cao and P. J. Bickel. Correlations with tailored extremal properties. arXiv preprint arXiv:2008.10177, 2020.
Chatterjee [2021] S. Chatterjee. A new coefficient of correlation. Journal of the American Statistical Association, 116(536):2009–2022, 2021.
Chatterjee and Holmes [2020] S. Chatterjee and S. Holmes. XICOR: Association measurement through cross rank increments, 2020. URL https://CRAN.R-project.org/package=XICOR.
Chatterjee and Vidyasagar [2022] S. Chatterjee and M. Vidyasagar. Estimating large causal polytree skeletons from small samples. arXiv preprint arXiv:2209.07028, 2022.
Chen [2020] L.-P. Chen. A note of feature screening via rank-based coefficient of correlation. arXiv preprint arXiv:2008.04456, 2020.
Chen and Donoho [1994] S. Chen and D. Donoho. Basis pursuit. In Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers, volume 1, pages 41–44. IEEE, 1994.
Chernozhukov et al. [2017] V. Chernozhukov, A. Galichon, M. Hallin, and M. Henry. Monge–Kantorovich depth, quantiles, ranks and signs. Annals of Statistics, 45(1):223–256, 2017.
Cochran [1954] W. G. Cochran. Some methods for strengthening the common $\chi^{2}$ tests. Biometrics, 10(4):417–451, 1954.
Csörgő [1985] S. Csörgő. Testing for independence by the empirical characteristic function. Journal of Multivariate Analysis, 16(3):290–299, 1985.
Deb and Sen [2021] N. Deb and B. Sen. Multivariate rank-based distribution-free nonparametric testing using measure transportation. Journal of the American Statistical Association, pages 1–16, 2021.
Deb et al. [2020] N. Deb, P. Ghosal, and B. Sen. Measuring association on topological spaces using kernels and geometric graphs. arXiv preprint arXiv:2010.01768, 2020.
Dette et al. [2013] H. Dette, K. F. Siburg, and P. A. Stoimenov. A copula-based non-parametric measure of regression dependence. Scandinavian Journal of Statistics, 40(1):21–41, 2013.
Doran et al. [2014] G. Doran, K. Muandet, K. Zhang, and B. Schölkopf. A permutation-based kernel conditional independence test. In Uncertainty in Artificial Intelligence, pages 132–141. AUAI, 2014.
Drton et al. [2020] M. Drton, F. Han, and H. Shi. High-dimensional consistent independence testing with maxima of rank correlations. Annals of Statistics, 48(6):3206–3227, 2020.
Efron et al. [2004] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of statistics, 32(2):407–499, 2004.
Fan and Li [2001] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001.
Fan et al. [2020] J. Fan, Y. Feng, and L. Xia. A projection-based conditional dependence measure with applications to high-dimensional undirected graphical models. Journal of Econometrics, 218(1):119–139, 2020.
Figalli [2018] A. Figalli. On the continuity of center-outward distribution and quantile functions. Nonlinear Analysis, 177:413–421, 2018.
Freund and Schapire [1996] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference, pages 148–156, 1996.
Friedman [1991] J. H. Friedman. Multivariate adaptive regression splines. Annals of Statistics, 19(1):1–67, 1991.
Friedman and Rafsky [1983] J. H. Friedman and L. C. Rafsky. Graph-theoretic measures of multivariate association and prediction. Annals of Statistics, pages 377–391, 1983.
Fukumizu et al. [2007] K. Fukumizu, A. Gretton, X. Sun, and B. Schölkopf. Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.
Gamboa et al. [2018] F. Gamboa, T. Klein, and A. Lagnoux. Sensitivity analysis based on Cramér–von Mises distance. SIAM/ASA Journal on Uncertainty Quantification, 6(2):522–548, 2018.
Gamboa et al. [2022] F. Gamboa, P. Gremaud, T. Klein, and A. Lagnoux. Global sensitivity analysis: A novel generation of mighty estimators based on rank statistics. Bernoulli, 28(4):2345–2374, 2022.
Gebelein [1941] H. Gebelein. Das statistische problem der korrelation als variations-und eigenwertproblem und sein zusammenhang mit der ausgleichsrechnung. Zeitschrift für Angewandte Mathematik und Mechanik, 21(6):364–379, 1941.
George and McCulloch [1993] E. I. George and R. E. McCulloch. Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88(423):881–889, 1993.
Ghosal and Sen [2022] P. Ghosal and B. Sen. Multivariate ranks and quantiles using optimal transport: Consistency, rates and nonparametric testing. Annals of Statistics, 50(2):1012–1037, 2022.
Gretton et al. [2005a] A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf. Measuring statistical dependence with Hilbert–Schmidt norms. In Proceedings of the 16th International Conference on Algorithmic Learning Theory, pages 63–77. Springer, Berlin, 2005a.
Gretton et al. [2005b] A. Gretton, A. Smola, O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B. Schölkopf, and N. Logothetis. Kernel constrained covariance for dependence measurement. In International Workshop on Artificial Intelligence and Statistics, pages 112–119. PMLR, 2005b.
Gretton et al. [2007] A. Gretton, K. Fukumizu, C. Teo, L. Song, B. Schölkopf, and A. Smola. A kernel statistical test of independence. In Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.
Hallin et al. [2021] M. Hallin, E. Del Barrio, J. Cuesta-Albertos, and C. Matrán. Distribution and quantile functions, ranks and signs in dimension $d$ : A measure transportation approach. Annals of Statistics, 49(2):1139–1165, 2021.
Han [2021] F. Han. On extensions of rank correlation coefficients to multivariate spaces. Bernoulli News, 28(2):7–11, 2021.
Han and Huang [2022] F. Han and Z. Huang. Azadkia–Chatterjee’s correlation coefficient adapts to manifold data. arXiv preprint arXiv:2209.11156, 2022.
Han et al. [2017] F. Han, S. Chen, and H. Liu. Distribution-free tests of independence in high dimensions. Biometrika, 104(4):813–828, 2017.
Hastie et al. [2009] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, Berlin, 2nd edition, 2009.
Heller and Heller [2016] R. Heller and Y. Heller. Multivariate tests of association based on univariate tests. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
Heller et al. [2012] R. Heller, M. Gorfine, and Y. Heller. A class of multivariate distribution-free tests of independence based on graphs. Journal of Statistical Planning and Inference, 142(12):3097–3106, 2012.
Heller et al. [2013] R. Heller, Y. Heller, and M. Gorfine. A consistent multivariate test of association based on ranks of distances. Biometrika, 100(2):503–510, 2013.
Hirschfeld [1935] H. O. Hirschfeld. A connection between correlation and contingency. Mathematical Proceedings of the Cambridge Philosophical Society, 31(4):520–524, 1935.
Ho [1998] T. K. Ho. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832–844, 1998.
Hoeffding [1948] W. Hoeffding. A non-parametric test of independence. Annals of Mathematical Statististics, 19(4):546–557, 1948.
Huang [2010] T.-M. Huang. Testing conditional independence using maximal nonlinear conditional correlation. Annals of Statistics, 38(4):2047–2091, 2010.
Huang et al. [2020] Z. Huang, N. Deb, and B. Sen. Kernel partial correlation coefficient — a measure of conditional dependence. arXiv preprint arXiv:2012.14804, 2020.
Huang et al. [2022] Z. Huang, N. Deb, and B. Sen. KPC: Kernel Partial Correlation Coefficient, 2022. URL https://cran.r-project.org/web/packages/KPC.
Joe [1989] H. Joe. Relative entropy measures of multivariate dependence. Journal of the American Statistical Association, 84(405):157–164, 1989.
Josse and Holmes [2016] J. Josse and S. Holmes. Measuring multivariate association and beyond. Statistics Surveys, 10:132, 2016.
Ke and Yin [2019] C. Ke and X. Yin. Expected conditional characteristic function-based measures for testing independence. Journal of the American Statistical Association, 115(530):985–996, 2019.
Kim et al. [2020] I. Kim, S. Balakrishnan, and L. Wasserman. Robust multivariate nonparametric tests via projection averaging. Annals of Statistics, 48(6):3417–3441, 2020.
Kim et al. [2021] I. Kim, M. Neykov, S. Balakrishnan, and L. Wasserman. Local permutation tests for conditional independence. arXiv preprint arXiv:2112.11666, 2021.
Kong et al. [2019] E. Kong, Y. Xia, and W. Zhong. Composite coefficient of determination and its application in ultrahigh dimensional variable screening. Journal of the American Statistical Association, 114(528):1740–1751, 2019.
Kraskov et al. [2004] A. Kraskov, H. Stögbauer, and P. Grassberger. Estimating mutual information. Physical Review E, 69(6):066138, 2004.
Lin and Han [2022+] Z. Lin and F. Han. On boosting the power of Chatterjee’s rank correlation. Biometrika, 2022+. Forthcoming.
Lin and Han [2022] Z. Lin and F. Han. Limit theorems of Chatterjee’s rank correlation. arXiv preprint arXiv:2204.08031, 2022.
Linfoot [1957] E. H. Linfoot. An informational measure of correlation. Information and Control, 1(1):85–89, 1957.
Linton and Gozalo [1997] O. Linton and P. Gozalo. Conditional independence restrictions: testing and estimation. Cowles Foundation Discussion Paper, 1140, 1997.
Lopez-Paz et al. [2013] D. Lopez-Paz, P. Hennig, and B. Schölkopf. The randomized dependence coefficient. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.
Lyons [2013] R. Lyons. Distance covariance in metric spaces. Annals of Probability, 41(5):3284–3305, 2013.
Manole et al. [2021] T. Manole, S. Balakrishnan, J. Niles-Weed, and L. Wasserman. Plugin estimation of smooth optimal transport maps. arXiv preprint arXiv:2107.12364, 2021.
Mantel and Haenszel [1959] N. Mantel and W. Haenszel. Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22(4):719–748, 1959.
McCann [1995] R. J. McCann. Existence and uniqueness of monotone measure-preserving maps. Duke Mathematical Journal, 80(2):309–323, 1995.
Miller [2002] A. Miller. Subset Selection in Regression. Chapman and Hall, 2002.
Mordant and Segers [2022] G. Mordant and J. Segers. Measuring dependence between random vectors via optimal transport. Journal of Multivariate Analysis, 189:104912, 2022.
Nandy et al. [2016] P. Nandy, L. Weihs, and M. Drton. Large-sample theory for the Bergsma–Dassios sign covariance. Electronic Journal of Statistics, 10(2):2287–2311, 2016.
Neykov et al. [2021] M. Neykov, S. Balakrishnan, and L. Wasserman. Minimax optimal conditional independence testing. Annals of Statistics, 49(4):2151–2177, 2021.
Patra et al. [2016] R. K. Patra, B. Sen, and G. J. Székely. On a nonparametric notion of residual and its applications. Statistics & Probability Letters, 109:208–213, 2016.
Pfister et al. [2018] N. Pfister, P. Bühlmann, B. Schölkopf, and J. Peters. Kernel-based tests for joint independence. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):5–31, 2018.
Póczos and Schneider [2012] B. Póczos and J. Schneider. Nonparametric estimation of conditional information and divergences. In Artificial Intelligence and Statistics, pages 914–923. PMLR, 2012.
Puri et al. [1970] M. Puri, P. Sen, and D. Gokhale. On a class of rank order tests for independence in multivariate distributions. Sankhyā, Series A, 32(3):271–298, 1970.
Puri and Sen [1971] M. L. Puri and P. K. Sen. Nonparametric methods in multivariate analysis. Wiley, New York, 1971.
Rao and Srivastava [1994] B. Rao and S. Srivastava. An elementary proof of the Borel isomorphism theorem. Real Analysis Exchange, 20(1):347–349, 1994.
Ravikumar et al. [2009] P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman. Sparse additive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(5):1009–1030, 2009.
Rényi [1959] A. Rényi. On measures of dependence. Acta Mathematica Hungarica, 10(3-4):441–451, 1959.
Reshef et al. [2011] D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, and P. C. Sabeti. Detecting novel associations in large data sets. Science, 334(6062):1518–1524, 2011.
Romano [1988] J. P. Romano. A bootstrap revival of some nonparametric distance tests. Journal of the American Statistical Association, 83(403):698–708, 1988.
Rosenblatt [1975] M. Rosenblatt. A quadratic measure of deviation of two-dimensional density estimates and a test of independence. Annals of Statistics, pages 1–14, 1975.
Runge [2018] J. Runge. Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information. In International Conference on Artificial Intelligence and Statistics, pages 938–947. PMLR, 2018.
Sadeghi [2022] B. Sadeghi. Chatterjee Correlation Coefficient: a robust alternative for classic correlation methods in geochemical studies-(including “TripleCpy” Python package). Ore Geology Reviews, page 104954, 2022.
Schweizer and Wolff [1981] B. Schweizer and E. F. Wolff. On nonparametric measures of dependence for random variables. Annals of Statistics, 9(4):879–885, 1981.
Sen and Sen [2014] A. Sen and B. Sen. Testing independence and goodness-of-fit in linear models. Biometrika, 101(4):927–942, 2014.
Sen et al. [2017] R. Sen, A. T. Suresh, K. Shanmugam, A. G. Dimakis, and S. Shakkottai. Model-powered conditional independence test. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
Seth and Príncipe [2012] S. Seth and J. C. Príncipe. Conditional association. Neural Computation, 24(7):1882–1905, 2012.
Shah and Peters [2020] R. D. Shah and J. Peters. The hardness of conditional independence testing and the generalised covariance measure. Annals of Statistics, 48(3):1514–1538, 2020.
Shi et al. [2021a] H. Shi, M. Drton, M. Hallin, and F. Han. Center-outward sign-and rank-based quadrant, Spearman, and Kendall tests for multivariate independence. arXiv preprint arXiv:2111.15567, 2021a.
Shi et al. [2021b] H. Shi, M. Drton, and F. Han. On Azadkia–Chatterjee’s conditional dependence coefficient. arXiv preprint arXiv:2108.06827, 2021b.
Shi et al. [2022a] H. Shi, M. Drton, and F. Han. On the power of Chatterjee’s rank correlation. Biometrika, 109(2):317–333, 2022a.
Shi et al. [2022b] H. Shi, M. Drton, and F. Han. Distribution-free consistent independence tests via center-outward ranks and signs. Journal of the American Statistical Association, 117(537):395–410, 2022b.
Shi et al. [2022c] H. Shi, M. Hallin, M. Drton, and F. Han. On universally consistent and fully distribution-free rank tests of vector independence. Annals of Statistics, 50(4):1933–1959, 2022c.
Sklar [1959] M. Sklar. Fonctions de répartition à $n$ dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris, 8:229–231, 1959.
Song [2009] K. Song. Testing conditional independence via Rosenblatt transforms. Annals of Statistics, 37(6B):4011–4045, 2009.
Srivastava [1998] S. M. Srivastava. A Course on Borel Sets. Springer-Verlag, New York, 1998.
Strobl et al. [2019] E. V. Strobl, K. Zhang, and S. Visweswaran. Approximate kernel-based conditional independence tests for fast non-parametric causal discovery. Journal of Causal Inference, 7(1), 2019.
Su and White [2007] L. Su and H. White. A consistent characteristic function-based test for conditional independence. Journal of Econometrics, 141(2):807–834, 2007.
Su and White [2008] L. Su and H. White. A nonparametric Hellinger metric test for conditional independence. Econometric Theory, 24(4):829–864, 2008.
Su and White [2014] L. Su and H. White. Testing conditional independence via empirical likelihood. Journal of Econometrics, 182(1):27–44, 2014.
Székely and Rizzo [2009] G. J. Székely and M. L. Rizzo. Brownian distance covariance. Annals of Applied Statistics, 3(4):1236–1265, 2009.
Székely and Rizzo [2014] G. J. Székely and M. L. Rizzo. Partial distance correlation with methods for dissimilarities. Annals of Statistics, 42(6):2382–2412, 2014.
Székely et al. [2007] G. J. Székely, M. L. Rizzo, and N. K. Bakirov. Measuring and testing dependence by correlation of distances. Annals of Statistics, 35(6):2769–2794, 2007.
Tibshirani [1996] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
Veraverbeke et al. [2011] N. Veraverbeke, M. Omelka, and I. Gijbels. Estimation of a conditional copula and association measures. Scandinavian Journal of Statistics, 38(4):766–780, 2011.
Vergara and Estévez [2014] J. R. Vergara and P. A. Estévez. A review of feature selection methods based on mutual information. Neural Computing and Applications, 24(1):175–186, 2014.
Wang et al. [2015] X. Wang, W. Pan, W. Hu, Y. Tian, and H. Zhang. Conditional distance correlation. Journal of the American Statistical Association, 110(512):1726–1734, 2015.
Wang et al. [2017] X. Wang, B. Jiang, and J. S. Liu. Generalized R-squared for detecting dependence. Biometrika, 104(1):129–139, 2017.
Weihs et al. [2016] L. Weihs, M. Drton, and D. Leung. Efficient computation of the Bergsma–Dassios sign covariance. Computational Statistics, 31(1):315–328, 2016.
Weihs et al. [2018] L. Weihs, M. Drton, and N. Meinshausen. Symmetric rank covariances: a generalized framework for nonparametric measures of dependence. Biometrika, 105(3):547–562, 2018.
Yanagimoto [1970] T. Yanagimoto. On measures of association and a related problem. Annals of the Institute of Statistical Mathematics, 22(1):57–63, 1970.
Yuan and Lin [2006] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
Zhang [2019] K. Zhang. BET on independence. Journal of the American Statistical Association, 114(528):1620–1637, 2019.
Zhang et al. [2012] K. Zhang, J. Peters, D. Janzing, and B. Schölkopf. Kernel-based conditional independence test and application in causal discovery. arXiv preprint arXiv:1202.3775, 2012.
Zhang [2022] Q. Zhang. On the asymptotic distribution of the symmetrized Chatterjee’s correlation coefficient. arXiv preprint arXiv:2205.01769, 2022.
Zhang et al. [2018] Q. Zhang, S. Filippi, A. Gretton, and D. Sejdinovic. Large-scale kernel methods for independence testing. Statistics and Computing, 28(1):113–130, 2018.
Zhu et al. [2017] L. Zhu, K. Xu, R. Li, and W. Zhong. Projection correlation between two random vectors. Biometrika, 104(4):829–843, 2017.
Zou [2006] H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006.
Zou and Hastie [2005] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.

	$\displaystyle\xi(W^{\prime},Y^{\prime})-\xi(X^{\prime},Y^{\prime})$	$\displaystyle=\frac{\int\mathrm{Var}(\mathbb{P}(Y^{\prime}\geq t\|W^{\prime}))d\mu(t)-\int\mathrm{Var}(\mathbb{P}(Y^{\prime}\geq t\|X^{\prime}))d\mu(t)}{\int\mathrm{Var}(1_{\{Y^{\prime}\geq t\}})d\mu(t)}$
		$\displaystyle=\frac{\int\mathrm{Var}(\mathbb{P}(Y^{\prime}\geq t\|X^{\prime},Z^{\prime}))d\mu(t)-\int\mathrm{Var}(\mathbb{P}(Y^{\prime}\geq t\|X^{\prime}))d\mu(t)}{\int\mathrm{Var}(1_{\{Y^{\prime}\geq t\}})d\mu(t)}$
		$\displaystyle=\frac{\int\mathbb{E}(\mathrm{Var}(\mathbb{P}(Y^{\prime}\geq t\|X^{\prime},Z^{\prime})\|X^{\prime})d\mu(t)}{\int\mathrm{Var}(1_{\{Y^{\prime}\geq t\}})d\mu(t)}.$