This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A survey of some recent developments in measures of association

Sourav Chatterjee Mailing address: Department of Statistics, Stanford University, 390 Jane Stanford Way, Stanford, CA 94305, USA. Email: [email protected]. The author was partially supported by NSF grants DMS-2113242 and DMS-2153654. The author thanks Nabarun Deb, Fang Han and Bodhisattva Sen for helpful comments on a preliminary draft. Stanford University
Abstract

This paper surveys some recent developments in measures of association related to a new coefficient of correlation introduced by the author. A straightforward extension of this coefficient to standard Borel spaces (which includes all Polish spaces), overlooked in the literature so far, is proposed at the end of the survey.

Key words and phrases. Correlation, dependence, measures of association, standard Borel space
2020 Mathematics Subject Classification. 62H20, 62H15.

In honor of friend and teacher Prof. Rajeeva L. Karandikar on the occasion of his 65th birthday.

1 Introduction

Measuring associations between variables is one of the central goals of data analysis. Arguably, the three most popular classical measures of association are Pearson’s correlation coefficient, Spearman’s ρ\rho, and Kendall’s τ\tau. Although these coefficients are powerful for detecting monotonic associations, a practical problem is that they are not effective for detecting associations that are not monotonic. There have been many proposals to address this deficiency of the classical coefficients [66], such as the maximal correlation coefficient [59, 45, 92, 17], various coefficients based on joint cumulative distribution functions and ranks [43, 34, 54, 8, 83, 123, 124, 125, 29, 89, 61, 13, 94, 95, 30, 122], kernel-based methods [86, 48, 50, 99, 130], information theoretic coefficients [71, 74, 93], coefficients based on copulas [32, 76, 108, 98, 127], and coefficients based on pairwise distances [117, 115, 58, 41, 77].

This survey is about some recent developments in this area, beginning with a new coefficient of correlation proposed by the author in the paper [22]. This coefficient has the following desirable features: (a) It has a simple expression, like the classical coefficients. (b) It is a consistent estimator of a measure of dependence which is 0 if and only if the variables are independent and 11 if and only if one is a measurable function of the other. (c) It has a simple asymptotic theory under the hypothesis of independence.

The new coefficient is defined as follows. Let (X,Y)(X,Y) be a pair of random variables defined on the same probability space, where YY is not a constant. Let (X1,Y1),,(Xn,Yn)(X_{1},Y_{1}),\ldots,(X_{n},Y_{n}) be i.i.d. pairs of random variables with the same law as (X,Y)(X,Y), where n2n\geq 2. Rearrange the data as (X(1),Y(1)),,(X(n),Y(n))(X_{(1)},Y_{(1)}),\ldots,(X_{(n)},Y_{(n)}) such that X(1)X(n)X_{(1)}\leq\cdots\leq X_{(n)}. (Note that Y(i)Y_{(i)} is just the YY-value ‘paired with’ X(i)X_{(i)} in the original data, and not the ithi^{th} order statistic of the YY’s.) If there are ties among the XiX_{i}’s, then choose an increasing rearrangement as above by breaking ties uniformly at random. Let rir_{i} be the number of jj such that Y(j)Y(i)Y_{(j)}\leq Y_{(i)}, and let lil_{i} to be the number of jj such that Y(j)Y(i)Y_{(j)}\geq Y_{(i)}. Then define

ξn(X,Y):=1ni=1n1|ri+1ri|2i=1nli(nli).\displaystyle\xi_{n}(X,Y):=1-\frac{n\sum_{i=1}^{n-1}|r_{i+1}-r_{i}|}{2\sum_{i=1}^{n}l_{i}(n-l_{i})}. (1.1)

This is the correlation coefficient proposed in [22]. When there are no ties among the YiY_{i}’s, l1,,lnl_{1},\ldots,l_{n} is just a permutation of 1,,n1,\ldots,n, and so the denominator in the above expression is just n(n21)/3n(n^{2}-1)/3. The following theorem is the main consistency result for ξn\xi_{n}.

Theorem 1.1 ([22]).

If YY is not almost surely a constant, then as nn\to\infty, ξn(X,Y)\xi_{n}(X,Y) converges almost surely to the deterministic limit

ξ(X,Y):=Var(𝔼(1{Yt}|X))𝑑μ(t)Var(1{Yt})𝑑μ(t),\xi(X,Y):=\frac{\int\mathrm{Var}(\mathbb{E}(1_{\{Y\geq t\}}|X))d\mu(t)}{\int\mathrm{Var}(1_{\{Y\geq t\}})d\mu(t)}, (1.2)

where μ\mu is the law of YY. This limit belongs to the interval [0,1][0,1]. It is 0 if and only if XX and YY are independent, and it is 11 if and only if there is a measurable function f:f:\mathbb{R}\to\mathbb{R} such that Y=f(X)Y=f(X) almost surely.

The limiting value ξ(X,Y)\xi(X,Y) appeared in the literature prior to [22], in a paper of Dette, Siburg, and Stoimenov [32] (see also [43, 70]). The paper [32] gave a copula-based estimator for ξ(X,Y)\xi(X,Y) when XX and YY are continuous, that is consistent under smoothness assumptions on the copula and is computable in time n5/3n^{5/3} for an optimal choice of tuning parameters.

Note that neither ξn\xi_{n} nor the limiting value ξ\xi is symmetric in XX and YY. A symmetrized version of ξn\xi_{n} can be constructed by taking the maximum of ξn(X,Y)\xi_{n}(X,Y) and ξn(Y,X)\xi_{n}(Y,X).

2 Why does it work?

The complete proof of Theorem 1.1 is available in the supplementary materials of [22], and also in the arXiv version of the paper. It is not too hard to see why ξ(X,Y)\xi(X,Y) has the properties listed in Theorem 1.1, although filling in the details takes some work. The proof of convergence of ξn(X,Y)\xi_{n}(X,Y) to ξ(X,Y)\xi(X,Y) is less obvious. The following is a very rough sketch of the proof, reproduced from a similar discussion in [22].

For simplicity, consider only the case of continuous XX and YY, where the denominator in (1.2) is simply n(n21)/3n(n^{2}-1)/3. First, note that by the Glivenko–Cantelli theorem, ri/nF(Y(i))r_{i}/n\approx F(Y_{(i)}), where FF is the cumulative distribution function of YY. Thus,

ξn(X,Y)13ni=1n|F(Yi)F(YN(i))|,\displaystyle\xi_{n}(X,Y)\approx 1-\frac{3}{n}\sum_{i=1}^{n}|F(Y_{i})-F(Y_{N(i)})|, (2.1)

where N(i)N(i) is the unique index jj such that XjX_{j} is immediately to the right of XiX_{i} if we arrange the XX’s in increasing order. If XiX_{i} is the rightmost value, define N(i)N(i) arbitrarily; it does not matter since the contribution of a single term in the above sum is of order 1/n1/n. Next, observe that for any x,yx,y\in\mathbb{R},

|F(x)F(y)|\displaystyle|F(x)-F(y)| =(1{tx}1{ty})2𝑑μ(t),\displaystyle=\int(1_{\{t\leq x\}}-1_{\{t\leq y\}})^{2}d\mu(t), (2.2)

where μ\mu is the law of YY. This is true because the integrand is 11 between xx and yy and 0 outside.

Suppose that we condition on X1,,XnX_{1},\ldots,X_{n}. Since XiX_{i} is likely to be very close to XN(i)X_{N(i)}, the random variables YiY_{i} and YN(i)Y_{N(i)} are likely to be approximately i.i.d. after this conditioning. This leads to the approximation

𝔼[(1{tYi}1{tYN(i)})2|X1,,Xn]\displaystyle\mathbb{E}[(1_{\{t\leq Y_{i}\}}-1_{\{t\leq Y_{N(i)}\}})^{2}|X_{1},\ldots,X_{n}] 2Var(1{tYi}|X1,,Xn)\displaystyle\approx 2\mathrm{Var}(1_{\{t\leq Y_{i}\}}|X_{1},\ldots,X_{n})
=2Var(1{tYi}|Xi).\displaystyle=2\mathrm{Var}(1_{\{t\leq Y_{i}\}}|X_{i}).

This gives

𝔼(1{tYi}1{tYN(i)})2\displaystyle\mathbb{E}(1_{\{t\leq Y_{i}\}}-1_{\{t\leq Y_{N(i)}\}})^{2} 2𝔼[Var(1{tY}|X)]\displaystyle\approx 2\mathbb{E}[\mathrm{Var}(1_{\{t\leq Y\}}|X)]
=2Var(1{tY})2Var(𝔼(1{tY}|X)).\displaystyle=2\mathrm{Var}(1_{\{t\leq Y\}})-2\mathrm{Var}(\mathbb{E}(1_{\{t\leq Y\}}|X)).

Combining this with (2.2), we get

𝔼|F(Yi)F(YN(i))|\displaystyle\mathbb{E}|F(Y_{i})-F(Y_{N(i)})| 2[Var(1{tY})Var(𝔼(1{tY}|X))]𝑑μ(t).\displaystyle\approx\int 2[\mathrm{Var}(1_{\{t\leq Y\}})-\mathrm{Var}(\mathbb{E}(1_{\{t\leq Y\}}|X))]d\mu(t).

But note that Var(1{tY})=F(t)(1F(t))\mathrm{Var}(1_{\{t\leq Y\}})=F(t)(1-F(t)), and F(Y)Uniform[0,1]F(Y)\sim\textup{Uniform}[0,1]. Thus,

Var(1{tY})𝑑μ(t)=F(t)(1F(t))𝑑μ(t)=01x(1x)𝑑x=16.\int\mathrm{Var}(1_{\{t\leq Y\}})d\mu(t)=\int F(t)(1-F(t))d\mu(t)=\int_{0}^{1}x(1-x)dx=\frac{1}{6}.

Therefore by (2.1),

𝔼(ξn(X,Y))6Var(𝔼(1{tY}|X))𝑑μ(t)=ξ(X,Y),\displaystyle\mathbb{E}(\xi_{n}(X,Y))\approx 6\int\mathrm{Var}(\mathbb{E}(1_{\{t\leq Y\}}|X))d\mu(t)=\xi(X,Y),

where the last identity holds because Var(1{tY})𝑑μ(t)=1/6\int\mathrm{Var}(1_{\{t\leq Y\}})d\mu(t)=1/6, as shown above. This establishes the convergence of 𝔼(ξn(X,Y))\mathbb{E}(\xi_{n}(X,Y)) to ξ(X,Y)\xi(X,Y). Concentration inequalities are then used to show that ξn(X,Y)𝔼(ξn(X,Y))0\xi_{n}(X,Y)-\mathbb{E}(\xi_{n}(X,Y))\to 0 almost surely.

3 Asymptotic distribution

Let XX, YY and ξn\xi_{n} be as in the previous section. For each tt\in\mathbb{R}, let F(t):=(Yt)F(t):=\mathbb{P}(Y\leq t) and G(t):=(Yt)G(t):=\mathbb{P}(Y\geq t). Let ϕ(y,y):=min{F(y),F(y)}\phi(y,y^{\prime}):=\min\{F(y),F(y^{\prime})\}. Define

τ2=𝔼ϕ(Y1,Y2)22𝔼(ϕ(Y1,Y2)ϕ(Y1,Y3))+(𝔼ϕ(Y1,Y2))2(𝔼G(Y)(1G(Y)))2,\tau^{2}=\frac{\mathbb{E}\phi(Y_{1},Y_{2})^{2}-2\mathbb{E}(\phi(Y_{1},Y_{2})\phi(Y_{1},Y_{3}))+(\mathbb{E}\phi(Y_{1},Y_{2}))^{2}}{(\mathbb{E}G(Y)(1-G(Y)))^{2}}, (3.1)

where Y1,Y2,Y3Y_{1},Y_{2},Y_{3} are independent copies of YY. The following theorem gives the limiting distribution of ξn\xi_{n} under the null hypothesis that XX and YY are independent.

Theorem 3.1 ([22]).

Suppose that XX and YY are independent. Then nξn(X,Y)\sqrt{n}\xi_{n}(X,Y) converges to N(0,τ2)N(0,\tau^{2}) in distribution as nn\to\infty, where τ2\tau^{2} is given by the formula (3.1) stated above. The number τ2\tau^{2} is strictly positive if YY is not a constant, and equals 2/52/5 if YY is continuous.

The reason why τ2\tau^{2} does not depend on the law of YY if YY is continuous is that in this case F(Y)F(Y) and G(Y)G(Y) are Uniform[0,1][0,1] random variables, which implies that the expectations in (3.1) do not depend on the law of YY. If YY is not continuous, then τ2\tau^{2} may depend on the law of YY. For example, it is not hard to show that if YY is a Bernoulli(1/2)(1/2) random variable, then τ2=1\tau^{2}=1. Fortunately, if YY is not continuous, there is a simple way to estimate τ2\tau^{2} from the data using the estimator

τ^n2=an2bn+cn2dn2,\widehat{\tau}^{2}_{n}=\frac{a_{n}-2b_{n}+c_{n}^{2}}{d_{n}^{2}},

where ana_{n}, bnb_{n}, cnc_{n} and dnd_{n} are defined as follows. For each ii, let

R(i):=#{j:YjYi},L(i):=#{j:YjYi}.R(i):=\#\{j:Y_{j}\leq Y_{i}\},\ \ \ L(i):=\#\{j:Y_{j}\geq Y_{i}\}. (3.2)

(Note that R(i)R(i) and L(i)L(i) are different than rir_{i} and lil_{i} defined earlier.) Let u1u2unu_{1}\leq u_{2}\leq\cdots\leq u_{n} be an increasing rearrangement of R(1),,R(n)R(1),\ldots,R(n). Let vi:=j=1iujv_{i}:=\sum_{j=1}^{i}u_{j} for i=1,,ni=1,\ldots,n. Define

an:=1n4i=1n(2n2i+1)ui2,bn:=1n5i=1n(vi+(ni)ui)2,\displaystyle a_{n}:=\frac{1}{n^{4}}\sum_{i=1}^{n}(2n-2i+1)u_{i}^{2},\ \ \ b_{n}:=\frac{1}{n^{5}}\sum_{i=1}^{n}(v_{i}+(n-i)u_{i})^{2},
cn:=1n3i=1n(2n2i+1)ui,dn:=1n3i=1nL(i)(nL(i)).\displaystyle c_{n}:=\frac{1}{n^{3}}\sum_{i=1}^{n}(2n-2i+1)u_{i},\ \ \ d_{n}:=\frac{1}{n^{3}}\sum_{i=1}^{n}L(i)(n-L(i)).

Then the following holds.

Theorem 3.2 ([22]).

The estimator τ^n2\widehat{\tau}_{n}^{2} can be computed in time O(nlogn)O(n\log n), and converges to τ2\tau^{2} almost surely as nn\to\infty.

The question of proving a central limit theorem for ξn\xi_{n} in the absence of independence is much more difficult than the independent case. This was left as an open question in [22] and recently resolved in complete generality by Lin and Han [73], following earlier proofs of Deb, Ghosal, and Sen [31] and Shi, Drton, and Han [104] under additional assumptions. Lin and Han [73] also give a consistent estimator of the asymptotic variance of ξn\xi_{n} in the absence of independence, solving another question that was left open in [22]. A central limit theorem for the symmetrized version of ξn\xi_{n} (defined as the maximum of ξn(X,Y)\xi_{n}(X,Y) and ξn(Y,X)\xi_{n}(Y,X)) under the hypothesis of independence was proved by Zhang [129].

4 Power for testing independence

A deficiency of ξn\xi_{n}, as already pointed out in [22] through simulated examples, is that it has low power for testing independence against ‘standard’ alternatives, such as linear or monotone associations. This was theoretically confirmed by Shi, Drton, and Han [105], where it was shown that the test of independence using ξn\xi_{n} is rate-suboptimal against a family of local alternatives, whereas three other nonparametric tests of independence proposed in [61, 13, 8, 125] are rate-optimal. Like ξn\xi_{n}, the three competing test statistics considered in [105] are also computable in O(nlogn)O(n\log n) time. Similar results were obtained for a different type of competing test statistic by Cao and Bickel [21].

A more detailed analysis of the power properties of ξn\xi_{n} was carried out by Auddy, Deb, and Nandy [2], where the asymptotic distribution of ξn\xi_{n} under any changing sequence of alternatives converging to the null hypothesis of independence was computed. This analysis yielded exact detection thresholds and limiting power under natural alternatives converging to the null, such as mixture models, rotation models and noisy nonparametric regression. The detection boundary lies at distance n1/4n^{-1/4} from the null, instead of the more standard n1/2n^{-1/2}. This is similar to the power properties of other ‘graph-based’ statistics for testing independence, such as the Friedman–Rafsky statistic [41, 11].

A proposal for ‘boosting’ the power of ξn\xi_{n} for testing independence, by incorporating multiple nearby ranks instead of only the nearest ones, was recently proposed by Lin and Han [72]. This modified estimator was shown to attain near-optimal rates of power against certain classes of alternative hypotheses.

The conceptual reason behind the absence of local power of statistics such as ξn\xi_{n} was explained by Bickel [12]. An interesting question that remains unexplained is the following. It is seen in simulations that although ξn\xi_{n} has low power for testing independence against standard alternatives such as linear and monotone, it becomes more powerful as the signal starts to get more and more oscillatory [22]. This gives an advantage over other coefficients in applications where oscillatory signals arise naturally [25, 97], and suggests that ξn\xi_{n} may be efficient for certain kinds of local alternatives. No result of this sort has yet been proven.

5 Multivariate extensions

Many methods have been proposed for testing independence nonparametrically in the multivariate setting. This includes classical tests [88, 49, 117] as well as a flurry of recent ones proposed in the last ten years [57, 56, 58, 131, 124, 68, 30, 106, 10, 107, 31].

Most of these papers are concerned only with testing independence, and not with measuring the strength of dependence as measured by a correlation coefficient such as ξn\xi_{n}. Unfortunately, the ξn\xi_{n} coefficient and many other popular univariate coefficients do not readily generalize to the multivariate setting because they are based on ranks. Statisticians have started taking a new look at this old problem in recent years by considering a multivariate notion of rank defined using optimal transport. Roughly speaking, the idea is as follows. Let ν\nu be a ‘reference measure’ in d\mathbb{R}^{d}, akin to the uniform distribution on [0,1][0,1] in \mathbb{R}. Given any probability measure μ\mu in d\mathbb{R}^{d}, let Fμ:ddF^{\mu}:\mathbb{R}^{d}\to\mathbb{R}^{d} be the map that ‘optimally transports’ μ\mu to ν\nu — that is, if XμX\sim\mu then Fμ(X)νF^{\mu}(X)\sim\nu, and FμF^{\mu} minimizes 𝔼XF(X)2\mathbb{E}\|X-F(X)\|^{2} among all such FF. By a theorem of McCann [80], such a map exists and is unique if μ\mu and ν\nu are both absolutely continuous with respect to Lebesgue measure. For example, when d=1d=1, FμF^{\mu} is just the cumulative distribution function of XX, which transforms XX into a Uniform[0,1][0,1] random variable. For properties of this map, see, e.g., Figalli [38] and Hallin et al. [51].

The above idea suggests a natural definition of multivariate rank. If X1,,XnX_{1},\ldots,X_{n} are i.i.d. samples from μ\mu, one can try to estimate FμF^{\mu} using this data. Let FnμF_{n}^{\mu} be an estimate. Then Fnμ(Xi)F_{n}^{\mu}(X_{i}) can act as a ‘multivariate rank’ of XiX_{i} among X1,,XnX_{1},\ldots,X_{n}, divided by nn. Since Fμ(Xi)νF^{\mu}(X_{i})\sim\nu, we can then assume that Fnμ(Xi)F_{n}^{\mu}(X_{i}) is approximately distributed according to ν\nu, and then try to test for independence of random vectors using a test for independence that works when the marginal distributions are both ν\nu. This idea has been made precise in a number of recent works in the statistics literature, such as Chernozhukov et al. [27], Deb, Ghosal, and Sen [31], Deb and Sen [30], Hallin et al. [51], Manole et al. [78], Shi, Drton, and Han [106], Ghosal and Sen [47], Mordant and Segers [82] and Shi et al. [107, 103]. For a survey and further discussions, see Han [52].

A direct generalization of ξn\xi_{n} to higher dimensional spaces has not been proposed so far, although the variant proposed in Deb et al. [31] satisfies the same properties as ξn\xi_{n} provided that the space on which YY takes values admits a nonnegative definite kernel and the space on which XX takes values has a metric. This covers most spaces that arise in practice. There are a couple of other generalizations, proposed by Azadkia and Chatterjee [3] and Gamboa et al. [44], on measuring the dependence between a univariate random variable YY and a random vector 𝐗\mathbf{X}. The coefficient proposed in [3] (discussed in detail in Section 6) is based on a generalization of the ideas behind the construction of ξn\xi_{n}. The coefficient proposed in [44] combines the construction of ξn\xi_{n} with ideas from the theory of Sobol indices. A new contribution of the present paper is a simple generalization of ξn\xi_{n} to standard Borel spaces, which has been overlooked in the literature until now. This is presented in Section 8.

6 Measuring conditional dependence

The problem of measuring conditional dependence has received less attention than the problem of measuring unconditional dependence, partly because it is a more difficult task. Non-parametric conditional independence can be tested for discrete data using the classical Cochran–Mantel–Haenszel test [28, 79], which can be adapted for continuous random variables by binning the data [62] or using kernels [42, 128, 111, 33, 100]. Besides these, there are methods based on estimating conditional cumulative distribution functions [75, 85], conditional characteristic functions [112, 67], conditional probability density functions [113], empirical likelihood [114], mutual information and entropy [96, 65, 87], copulas [7, 109, 119], distance correlation [121, 37, 116], and other approaches [101]. A number of interesting ideas based on resampling and permutation tests have been proposed in recent years [20, 100, 9].

In Azadkia and Chatterjee [3], a new coefficient of conditional dependence was proposed, based on the ideas behind the ξ\xi-coefficient defined in [22]. Like the ξ\xi-coefficient, this one also has a long list of desirable features, such as being fully nonparametric and working under minimal assumptions. The coefficient is defined as follows.

Let YY be a random variable and 𝐗=(X1,,Xp)\mathbf{X}=(X_{1},\ldots,X_{p}) and 𝐙=(Z1,,Zq)\mathbf{Z}=(Z_{1},\ldots,Z_{q}) be random vectors, all defined on the same probability space. Here q1q\geq 1 and p0p\geq 0. The value p=0p=0 means that 𝐗\mathbf{X} has no components at all. Let μ\mu be the law of YY. The following quantity was proposed in [3] as a measure of the degree of conditional dependence of YY and 𝐙\mathbf{Z} given 𝐗\mathbf{X}:

T=T(Y,𝐙|𝐗):=𝔼(Var((Yt|𝐙,𝐗)|𝐗))𝑑μ(t)𝔼(Var(1{Yt}|𝐗))𝑑μ(t).T=T(Y,\mathbf{Z}|\mathbf{X}):=\frac{\int\mathbb{E}(\mathrm{Var}(\mathbb{P}(Y\geq t|\mathbf{Z},\mathbf{X})|\mathbf{X}))d\mu(t)}{\int\mathbb{E}(\mathrm{Var}(1_{\{Y\geq t\}}|\mathbf{X}))d\mu(t)}. (6.1)

If the denominator equals zero, TT is undefined. If p=0p=0, then 𝐗\mathbf{X} has no components, and the conditional expectations and variances given 𝐗\mathbf{X} should be interpreted as unconditional expectations and variances. In this case we write T(Y,𝐙)T(Y,\mathbf{Z}) instead of T(Y,𝐙|𝐗)T(Y,\mathbf{Z}|\mathbf{X}). Note that TT is a generalization of the statistic ξ\xi appearing in Theorem 1.1. The following theorem summarizes the main properties of TT.

Theorem 6.1 ([3]).

Suppose that YY is not almost surely equal to a measurable function of 𝐗\mathbf{X} (when p=0p=0, this means that YY is not almost surely a constant). Then TT is well-defined and 0T10\leq T\leq 1. Moreover, T=0T=0 if and only if YY and 𝐙\mathbf{Z} are conditionally independent given 𝐗\mathbf{X}, and T=1T=1 if and only if YY is almost surely equal to a measurable function of 𝐙\mathbf{Z} given 𝐗\mathbf{X}. When p=0p=0, conditional independence given 𝐗\mathbf{X} simply means unconditional independence.

Now suppose we have data consisting of nn i.i.d. copies (Y1,𝐗1,𝐙1),,(Yn,𝐗n,𝐙n)(Y_{1},\mathbf{X}_{1},\mathbf{Z}_{1}),\ldots,(Y_{n},\mathbf{X}_{n},\mathbf{Z}_{n}) of the triple (Y,𝐗,𝐙)(Y,\mathbf{X},\mathbf{Z}), where n2n\geq 2. For each ii, let N(i)N(i) be the index jj such that 𝐗j\mathbf{X}_{j} is the nearest neighbor of 𝐗i\mathbf{X}_{i} with respect to the Euclidean metric on p\mathbb{R}^{p}, where ties are broken uniformly at random. Let M(i)M(i) be the index jj such that (𝐗j,𝐙j)(\mathbf{X}_{j},\mathbf{Z}_{j}) is the nearest neighbor of (𝐗i,𝐙i)(\mathbf{X}_{i},\mathbf{Z}_{i}) in p+q\mathbb{R}^{p+q}, again with ties broken uniformly at random. Let RiR_{i} be the rank of YiY_{i}, that is, the number of jj such that YjYiY_{j}\leq Y_{i}. If p1p\geq 1, define

Tn=Tn(Y,𝐙|𝐗):=i=1n(min{Ri,RM(i)}min{Ri,RN(i)})i=1n(Rimin{Ri,RN(i)}).T_{n}=T_{n}(Y,\mathbf{Z}|\mathbf{X}):=\frac{\sum_{i=1}^{n}(\min\{R_{i},R_{M(i)}\}-\min\{R_{i},R_{N(i)}\})}{\sum_{i=1}^{n}(R_{i}-\min\{R_{i},R_{N(i)}\})}.

If p=0p=0, let LiL_{i} be the number of jj such that YjYiY_{j}\geq Y_{i}, let M(i)M(i) denote the jj such that 𝐙j\mathbf{Z}_{j} is the nearest neighbor of 𝐙i\mathbf{Z}_{i} (ties broken uniformly at random), and let

Tn=Tn(Y,𝐙):=i=1n(nmin{Ri,RM(i)}Li2)i=1nLi(nLi).T_{n}=T_{n}(Y,\mathbf{Z}):=\frac{\sum_{i=1}^{n}(n\min\{R_{i},R_{M(i)}\}-L_{i}^{2})}{\sum_{i=1}^{n}L_{i}(n-L_{i})}.

In both cases, TnT_{n} is undefined if the denominator is zero. The following theorem shows that TnT_{n} is a consistent estimator of TT.

Theorem 6.2 ([3]).

Suppose that YY is not almost surely equal to a measurable function of 𝐗\mathbf{X}. Then as nn\to\infty, TnTT_{n}\to T almost surely.

For various other properties of TnT_{n}, such as rate of convergence, performance in simulations and real data, etc., see [3]. One problem that was left unsolved in [3] was the question of proving a central limit theorem for TnT_{n} under the null hypothesis, which is crucial for carrying out tests for conditional independence. This question was partially resolved by Shi, Drton, and Han [104], who proved a central limit theorem for TnT_{n} under the assumption that YY is independent of (𝐗,𝐙)(\mathbf{X},\mathbf{Z}). An improved version of this result was proved recently by Lin and Han [73]. A version for data supported on manifolds was proved by Han and Huang [53].

The paper of Shi et al. [104] also develops the ‘conditional randomization test’ (CRT) framework of Candès et al. [20] to test conditional independence using TnT_{n}, and find that TnT_{n}, like ξn\xi_{n} is an inefficient test statistic. To address this concern, an improved generalization of TnT_{n}, called ‘kernel partial correlation’ (KPC), was proposed by Huang, Deb, and Sen [63]. Unlike TnT_{n}, KPC has the flexibility to use more than one nearest neighbor, which gives it better power properties.

Note that by the above theorems, a test of conditional independence based on TnT_{n} is consistent against all alternatives. The problem is that in the absence of an asymptotic theory for TnT_{n}, it is difficult to control the significance level of such a test. This is in fact an impossible problem, by a recent result of Shah and Peters [102] that proves hardness of conditional independence testing in the absence of smoothness assumptions. Assuming some degree of smoothness, minimax optimal conditional independence tests were recently constructed by Neykov, Balakrishnan, and Wasserman [84] and Kim et al. [69].

7 Application to nonparametric variable selection

The commonly used variable selection methods in the statistics literature rely on linear or additive models. This includes classical methods [14, 46, 26, 118, 35, 40, 55, 81] as well as modern ones [19, 132, 133, 126, 36, 91]. These methods are powerful and widely used in practice. However, they sometimes run into problems when significant interaction effects or nonlinearities are present. Such problems can be overcome by model-free methods [20, 60, 1, 15, 39, 55, 18, 6, 120, 16]. On the flip side, the theoretical foundations of model-free methods are usually weaker than those of model-based methods.

In an attempt to combine the best of both worlds, a new method of variable selection called Feature Ordering by Conditional Independence (FOCI), was proposed in Azadkia and Chatterjee [3]. This method uses the conditional dependence coefficient TnT_{n} described in the previous section in a stepwise fashion, as follows. Let YY be the response variable and let 𝐗=(Xj)1jp\mathbf{X}=(X_{j})_{1\leq j\leq p} be the set of predictors. The data consists of nn i.i.d. copies of (Y,𝐗)(Y,\mathbf{X}). First, choose j1j_{1} to be the index jj that maximizes Tn(Y,Xj)T_{n}(Y,X_{j}). Having obtained j1,,jkj_{1},\ldots,j_{k}, choose jk+1j_{k+1} to be the index j{j1,,jk}j\notin\{j_{1},\ldots,j_{k}\} that maximizes Tn(Y,Xj|Xj1,,Xjk)T_{n}(Y,X_{j}|X_{j_{1}},\ldots,X_{j_{k}}). Continue like this until arriving at the first kk such that Tn(Y,Xjk+1|Xj1,,Xjk)0T_{n}(Y,X_{j_{k+1}}|X_{j_{1}},\ldots,X_{j_{k}})\leq 0, and then declare the chosen subset to be S^:={j1,,jk}\widehat{S}:=\{j_{1},\ldots,j_{k}\}. If there is no such kk, define S^\widehat{S} to be the whole set of variables. It may also happen that Tn(Y,Xj1)0T_{n}(Y,X_{j_{1}})\leq 0. In that case declare S^\widehat{S} to be empty. Note that this variable selection procedure involves no choice of tuning parameters, which may be an advantage in practice.

It was shown in [3] that under mild conditions, the method selects a ‘correct’ set of variables with high probability. More precisely, it was shown that with high probability, the set S^\widehat{S} selected by FOCI has the property that YY and (Xj)jS(X_{j})_{j\notin S} are conditionally independent given (Xj)jS^(X_{j})_{j\in\widehat{S}}. In other words, all the information about YY that one can get from 𝐗\mathbf{X} is contained in (Xj)jS^(X_{j})_{j\in\widehat{S}}. For further properties of FOCI and its performance in simulations and real data sets, see [3].

An improved version of FOCI called KFOCI (‘Kernel FOCI’) was proposed by Huang et al. [63]. An application of FOCI to causal inference, via an algorithm named DAG-FOCI, was introduced in Azadkia, Taeb, and Bühlmann [5]. For another application to causal inference, see Chatterjee and Vidyasagar [24].

8 A new proposal: Generalization to standard Borel spaces

In this section, a simple but wide ranging generalization of ξn\xi_{n} is proposed. In hindsight, this generalization seems obvious, but somehow this was overlooked both in the original paper [22] as well as in the subsequent developments listed in Section 5.

Recall that two measurable spaces are said to be isomorphic to each other if there is a bijection between the two spaces which is measurable and whose inverse is also measurable. Recall that a standard Borel space is a measurable space that is isomorphic to a Borel subset of a Polish space [110, Chapter 3]. In particular, every Borel subset of every Polish space is a standard Borel space. The Borel isomorphism theorem says that any uncountable standard Borel space is isomorphic to the real line (see Rao and Srivastava [90] for an elementary proof). In particular, if 𝒳\mathcal{X} is any standard Borel space, there is a measurable map φ:𝒳\varphi:\mathcal{X}\to\mathbb{R} such that φ\varphi is injective, φ(𝒳)\varphi(\mathcal{X}) is Borel, and φ1\varphi^{-1} is measurable on φ(𝒳)\varphi(\mathcal{X}). We will say that φ\varphi is an isomorphism between 𝒳\mathcal{X} and a Borel subset of the real line, or simply, a ‘Borel isomorphism’.

Now let 𝒳\mathcal{X} and 𝒴\mathcal{Y} be two standard Borel spaces. Let φ\varphi be a Borel isomorphism of 𝒳\mathcal{X} and ψ\psi be a Borel isomorphism of 𝒴\mathcal{Y}. Let (X,Y)(X,Y) be an 𝒳×𝒴\mathcal{X}\times\mathcal{Y}-valued pair of random variables, and let (X1,Y1),,(Xn,Yn)(X_{1},Y_{1}),\ldots,(X_{n},Y_{n}) be i.i.d. copies of (X,Y)(X,Y). Let X:=φ(X)X^{\prime}:=\varphi(X) and Y:=ψ(Y)Y^{\prime}:=\psi(Y), so that (X,Y)(X^{\prime},Y^{\prime}) is a pair of real-valued random variables. Let Xi:=φ(Xi)X_{i}^{\prime}:=\varphi(X_{i}) and Yi:=ψ(Yi)Y_{i}^{\prime}:=\psi(Y_{i}) for each ii. Finally, define

ξn(X,Y):=ξn(X,Y),\xi_{n}(X,Y):=\xi_{n}(X^{\prime},Y^{\prime}),

where ξn(X,Y)\xi_{n}(X^{\prime},Y^{\prime}) is defined using (X1,Y1),,(Xn,Yn)(X_{1}^{\prime},Y_{1}^{\prime}),\ldots,(X_{n}^{\prime},Y_{n}^{\prime}) as in equation (1.1). Note that the definition of ξn(X,Y)\xi_{n}(X,Y) depends on our choices of φ\varphi and ψ\psi. Different choices of isomorphisms would lead different definitions of ξn\xi_{n}. The following theorem generalizes Theorem 1.1.

Theorem 8.1.

If YY is not almost surely a constant, then as nn\to\infty, ξn(X,Y)\xi_{n}(X,Y) converges almost surely to the deterministic limit ξ(X,Y)\xi(X,Y), which equals ξ(X,Y)\xi(X^{\prime},Y^{\prime}), defined as in (1.2) with XX^{\prime} and YY^{\prime} in place of XX and YY. This limit belongs to the interval [0,1][0,1]. It is 0 if and only if XX and YY are independent, and it is 11 if and only if there is a measurable function f:f:\mathbb{R}\to\mathbb{R} such that Y=f(X)Y=f(X) almost surely. Moreover, the asymptotic distribution of nξn(X,Y)\sqrt{n}\xi_{n}(X,Y) under the hypothesis of independence, as given by Theorems 3.1 and 3.2, also holds, provided that τ\tau and τ^n\widehat{\tau}_{n} are computed using XiX_{i}^{\prime} and YiY_{i}^{\prime} instead of XiX_{i} and YiY_{i}.

Proof.

The convergence is clear from Theorem 1.1. Also, by Theorem 1.1, ξ(X,Y)=0\xi(X^{\prime},Y^{\prime})=0 if and only if XX^{\prime} and YY^{\prime} are independent, and ξ(X,Y)=1\xi(X^{\prime},Y^{\prime})=1 if and only if YY^{\prime} is a measurable function of XX^{\prime}. Since X=φ1(X)X=\varphi^{-1}(X^{\prime}) and Y=ψ1(Y)Y=\psi^{-1}(Y^{\prime}), it follows that XX and YY are independent if and only if XX^{\prime} and YY^{\prime} are independent. For the same reason, YY is a measurable function of XX almost surely if and only if YY^{\prime} is a measurable function of XX^{\prime} almost surely. Note that YY is not almost surely a constant if and only if YY^{\prime} is not almost surely a constant. Lastly, since ξn(X,Y)\xi_{n}(X,Y) is just ξn(X,Y)\xi_{n}(X^{\prime},Y^{\prime}), any result about the asymptotic distribution of ξn(X,Y)\xi_{n}(X^{\prime},Y^{\prime}), including Theorems 3.1 and 3.2 of this draft, can be transferred to ξn(X,Y)\xi_{n}(X,Y). ∎

Just like the univariate case, the generalized ξn\xi_{n} has the advantage of working under zero assumptions and having a simple asymptotic theory, as shown by Theorem 8.1. On the other hand, just like the univariate coefficient, one can expect the generalized coefficient to also suffer from low power for testing independence.

Theorem 8.1 is a nice, clean result, but to implement the idea in practice, one needs to work with actual Borel isomorphisms. Here is an example of a Borel isomorphism between d\mathbb{R}^{d} and a Borel subset of \mathbb{R}. Take any x=(x1,,xd)dx=(x_{1},\ldots,x_{d})\in\mathbb{R}^{d}. Let

ai,1ai,ki.bi,1bi,2a_{i,1}\cdots a_{i,k_{i}}.b_{i,1}b_{i,2}\cdots

be the binary expansion of |xi||x_{i}|. Filling in extra 0’s at the beginning if necessary, let us assume that k1==kd=kk_{1}=\cdots=k_{d}=k. Then, let us ‘interlace’ the digits to get the number

a1,1a2,1ad,1a1,2a2,2ad,2a1,ka2,kad,k.b1,1b2,1bd,1b1,2b2,2bd,2.a_{1,1}a_{2,1}\cdots a_{d,1}a_{1,2}a_{2,2}\cdots a_{d,2}\cdots a_{1,k}a_{2,k}\cdots a_{d,k}.b_{1,1}b_{2,1}\cdots b_{d,1}b_{1,2}b_{2,2}\cdots b_{d,2}\cdots.

This is an encoding of the nn-tuple (|x1|,,|xd|)(|x_{1}|,\ldots,|x_{d}|). But we also want to encode the signs of the xix_{i}’s. Let ci=1c_{i}=1 if xi0x_{i}\geq 0 and 0 if xi<0x_{i}<0. Sticking 1c1c2cd1c_{1}c_{2}\cdots c_{d} in front of the above list, we get an encoding of the vector xx as a real number. (The 11 in front ensures there is no ambiguity arising from some of the leading cic_{i}’s being 0.) It is easy to verify that this mapping is measurable, injective, and its inverse (defined on its range) is also measurable.

Numerical simulations with ξn\xi_{n} computed using the above scheme produced satisfactory results. Some examples are as follows. The examples show one potential problem with using statistics such as ξn\xi_{n} in a high dimensional setting: The bias may be quite large (even though the standard deviation is small), resulting in slow convergence of ξn(X,Y)\xi_{n}(X,Y) to ξ(X,Y)\xi(X,Y).

Example 8.2 (Points on a sphere).

Non-uniform random points were generated on the unit sphere in 3\mathbb{R}^{3} by drawing ϕ1,,ϕn\phi_{1},\ldots,\phi_{n} uniformly from [π,π][-\pi,\pi], drawing θ1,,θn\theta_{1},\ldots,\theta_{n} uniformly from [0,2π][0,2\pi], and defining

xi=sinϕicosθi,yi=sinϕisinθi,zi=cosϕi.x_{i}=\sin\phi_{i}\cos\theta_{i},\ \ y_{i}=\sin\phi_{i}\sin\theta_{i},\ \ z_{i}=\cos\phi_{i}.

Taking Xi=(ϕi,θi)X_{i}=(\phi_{i},\theta_{i}) and Yi=(xi,yi,zi)Y_{i}=(x_{i},y_{i},z_{i}), ξn(X,Y)\xi_{n}(X,Y) was computed for n=100n=100 and n=1000n=1000. One thousand simulations were done for each nn. The histograms of the values of ξn\xi_{n} are displayed in Figure 1. For n=100n=100, ξn(X,Y)\xi_{n}(X,Y) had a mean value of 0.6170.617 with a standard deviation of 0.0490.049. For n=1000n=1000, the mean value was 0.8650.865 and the standard deviation was 0.0090.009. Simulations were also done for n=10000n=10000, where the mean value of ξn(X,Y)\xi_{n}(X,Y) turned out to be 0.9570.957 and the standard deviation was 0.0020.002. The slow convergence due to large bias is clearly apparent in this example.

Refer to caption
(a) n=100n=100
Refer to caption
(b) n=1000n=1000
Figure 1: Histograms of values of ξn(X,Y)\xi_{n}(X,Y) when XX and YY are the polar and Cartesian coordinates of random points from the unit sphere in 3\mathbb{R}^{3} (Example 8.2).
Example 8.3 (Points on a sphere plus noise).

This is the same example as the previous one, except that independent random noises were added to xix_{i}, yiy_{i} and ziz_{i}. The noise variables were taken to be normal with mean zero and standard deviation σ\sigma, for various values of σ\sigma. Table 1 displays the means and standard deviations of ξn(X,Y)\xi_{n}(X,Y) in 10001000 simulations, for n=100n=100 and n=1000n=1000, and σ=0.01\sigma=0.01, σ=0.05\sigma=0.05 and σ=0.1\sigma=0.1.

Table 1: Means and standard deviations of ξn(X,Y)\xi_{n}(X,Y) in 10001000 simulations, when X=X= polar coordinates of a random point on a sphere and Y=Y= Cartesian coordinates of the point plus independent N(0,σ2)N(0,\sigma^{2}) errors in each coordinate (Example 8.3).
σ\sigma
nn 0.010.01 0.050.05 0.10.1
100 0.5450.545 (0.0620.062) 0.4400.440 (0.0710.071) 0.3570.357 (0.0740.074)
1000 0.7830.783 (0.0150.015) 0.6580.658 (0.0200.020) 0.5430.543 (0.0230.023)
Example 8.4 (Marginal independence versus joint dependence).

In this example, we have a pair of random variables Y=(a,b)Y=(a,b) which is a function of a 44-tuple X=(u,v,w,z)X=(u,v,w,z), but uu, vv, ww and zz are marginally independent of YiY_{i}. The variables are constructed as follows. Let aa, bb and cc be independent Uniform[0,1][0,1] random variables. Let

u=a+b+c(mod 1),v=a2+b2+c(mod 1),\displaystyle u=a+b+c\ \ (\mathrm{mod}\ 1),\ \ \ v=\frac{a}{2}+\frac{b}{2}+c\ \ (\mathrm{mod}\ 1),
w=4a3+2b3+c(mod 1),z=2a3+b3+c(mod 1).\displaystyle w=\frac{4a}{3}+\frac{2b}{3}+c\ \ (\mathrm{mod}\ 1),\ \ \ z=\frac{2a}{3}+\frac{b}{3}+c\ \ (\mathrm{mod}\ 1).

Here x(mod 1)x\ (\mathrm{mod}\ 1) denotes the ‘remainder modulo 11’ of a real number xx. It is easy to see that individually, uu, vv, ww and zz are independent of Y=(a,b)Y=(a,b), since the operation of adding the uniform random variable cc and taking the remainder modulo 11 erases all information about the quantity to which cc is added. However, YY can be recovered from the 44-tuple X=(u,v,w,z)X=(u,v,w,z) because uv=(a+b)/2(mod 1)=(a+b)/2u-v=(a+b)/2\ (\mathrm{mod}\ 1)=(a+b)/2 and wz=(2a+b)/3(mod 1)=(2a+b)/3w-z=(2a+b)/3\ (\mathrm{mod}\ 1)=(2a+b)/3, where the second identity holds in each case because (a+b)/2[0,1](a+b)/2\in[0,1] and (2a+b)/3[0,1](2a+b)/3\in[0,1]. With the above definitions, nn i.i.d. copies of (X,Y)(X,Y) were sampled, and ξn(X,Y)\xi_{n}(X,Y) was computed, along with the asymptotic P-value for testing independence. This was repeated one thousand times. The average values of the coefficients and the P-values for n=100n=100 and n=1000n=1000 are reported in Table 2. The table shows that even for nn as small as 100100, the hypothesis that XX and YY are independent is rejected with average P-value 0.0010.001, whereas the average P-value for the hypothesis and uu and YY are independent is 0.4910.491. On the other hand, even for nn as large as 10001000, the average P-value for the hypothesis that uu and YY are independent is 0.4950.495, whereas the average P-value for the hypothesis that XX and YY are independent is 0.0000.000.

Table 2: Average values of ξn(u,Y)\xi_{n}(u,Y) and ξn(X,Y)\xi_{n}(X,Y) and average P-values for testing independence in Example 8.4.
Average value Average P-value
nn ξn(u,Y)\xi_{n}(u,Y) ξn(X,Y)\xi_{n}(X,Y) H0:uYH_{0}:u\perp\!\!\!\perp Y H0:XYH_{0}:X\perp\!\!\!\perp Y
100100 0.0020.002 0.3130.313 0.4910.491 0.0010.001
10001000 0.0010.001 0.5810.581 0.4950.495 0.0000.000

The above examples indicate that it may not be ‘crazy’ to use the generalized version of ξn\xi_{n} as a measure of association in the multivariate setting and beyond. Further investigations, through applications in simulated and real data, would be necessary to arrive at a concrete verdict.

The generalized version of ξn\xi_{n} can also be used to define a coefficient of conditional dependence for random variables taking values in standard Borel spaces, as follows. Let 𝒳\mathcal{X}, 𝒴\mathcal{Y} and 𝒵\mathcal{Z} be standard Borel spaces, and XX, YY and ZZ be random variables taking values in 𝒳\mathcal{X}, 𝒴\mathcal{Y} and 𝒵\mathcal{Z}, respectively. Let WW denote the 𝒳×𝒵\mathcal{X}\times\mathcal{Z}-valued random variable (X,Z)(X,Z). Using Borel isomorphisms, let us define the coefficients ξn(X,Y)\xi_{n}(X,Y) and ξn(W,Y)\xi_{n}(W,Y) as before, based on i.i.d. data (X1,Y1,Z1),,(Xn,Yn,Zn)(X_{1},Y_{1},Z_{1}),\ldots,(X_{n},Y_{n},Z_{n}). One can then define a coefficient of conditional dependence between YY and ZZ given XX as

ξn(Z,Y|X):=ξn(W,Y)ξn(X,Y)1ξn(X,Y),\xi_{n}(Z,Y|X):=\frac{\xi_{n}(W,Y)-\xi_{n}(X,Y)}{1-\xi_{n}(X,Y)},

leaving it undefined if the denominator is zero. The following theorem justifies its use as a measure of conditional dependence.

Theorem 8.5.

Suppose that YY is not almost surely equal to a measurable function of XX. Then as nn\to\infty, ξn(Z,Y|X)\xi_{n}(Z,Y|X) converges almost surely to a deterministic limit ξ(Z,Y|X)[0,1]\xi(Z,Y|X)\in[0,1], which is 0 if and only if YY and ZZ are independent given XX, and 11 if and only if YY is almost surely equal to a measurable function of ZZ given XX.

Proof.

Let φ\varphi, ψ\psi, η\eta and δ\delta be the Borel isomorphisms of 𝒳\mathcal{X}, 𝒴\mathcal{Y}, 𝒵\mathcal{Z} and 𝒳×𝒵\mathcal{X}\times\mathcal{Z} that we are using in our definitions of ξn\xi_{n}. Let X:=φ(X)X^{\prime}:=\varphi(X), Y:=ψ(Y)Y^{\prime}:=\psi(Y), Z:=η(Z)Z^{\prime}:=\eta(Z) and W:=δ(W)W^{\prime}:=\delta(W). By Theorem 1.1 and the assumption that YY is not almost surely a measurable function of XX, we get that ξn(Z,Y|X)\xi_{n}(Z,Y|X) converges almost surely to

ξ(Z,Y|X)\displaystyle\xi(Z,Y|X) :=ξ(W,Y)ξ(X,Y)1ξ(X,Y).\displaystyle:=\frac{\xi(W^{\prime},Y^{\prime})-\xi(X^{\prime},Y^{\prime})}{1-\xi(X^{\prime},Y^{\prime})}.

Let μ\mu denote the law of YY^{\prime}. Then

ξ(W,Y)ξ(X,Y)\displaystyle\xi(W^{\prime},Y^{\prime})-\xi(X^{\prime},Y^{\prime}) =Var((Yt|W))𝑑μ(t)Var((Yt|X))𝑑μ(t)Var(1{Yt})𝑑μ(t)\displaystyle=\frac{\int\mathrm{Var}(\mathbb{P}(Y^{\prime}\geq t|W^{\prime}))d\mu(t)-\int\mathrm{Var}(\mathbb{P}(Y^{\prime}\geq t|X^{\prime}))d\mu(t)}{\int\mathrm{Var}(1_{\{Y^{\prime}\geq t\}})d\mu(t)}
=Var((Yt|X,Z))𝑑μ(t)Var((Yt|X))𝑑μ(t)Var(1{Yt})𝑑μ(t)\displaystyle=\frac{\int\mathrm{Var}(\mathbb{P}(Y^{\prime}\geq t|X^{\prime},Z^{\prime}))d\mu(t)-\int\mathrm{Var}(\mathbb{P}(Y^{\prime}\geq t|X^{\prime}))d\mu(t)}{\int\mathrm{Var}(1_{\{Y^{\prime}\geq t\}})d\mu(t)}
=𝔼(Var((Yt|X,Z)|X)dμ(t)Var(1{Yt})𝑑μ(t).\displaystyle=\frac{\int\mathbb{E}(\mathrm{Var}(\mathbb{P}(Y^{\prime}\geq t|X^{\prime},Z^{\prime})|X^{\prime})d\mu(t)}{\int\mathrm{Var}(1_{\{Y^{\prime}\geq t\}})d\mu(t)}.

Similarly,

1ξ(X,Y)=𝔼(Var(1{Yt}|X))𝑑μ(t)Var(1{Yt})𝑑μ(t).1-\xi(X^{\prime},Y^{\prime})=\frac{\int\mathbb{E}(\mathrm{Var}(1_{\{Y^{\prime}\geq t\}}|X^{\prime}))d\mu(t)}{\int\mathrm{Var}(1_{\{Y^{\prime}\geq t\}})d\mu(t)}.

From the above expressions, we see that ξ(Z,Y|X)\xi(Z,Y|X) is nothing but the quantity T(Y,Z|X)T(Y,Z|X) displayed in equation (6.1). The claims of the theorem now follow from Theorem 6.1. ∎

9 R packages

An R package for calculating ξn\xi_{n}, as well as the generalized coefficient proposed in this survey, and P-values for testing independence — named XICOR — is available on CRAN [23]. An R package for calculating the conditional dependence coefficient TnT_{n} and implementing the FOCI algorithm, called FOCI, is also available [4]. The KFOCI algorithm of Huang et al. [63] is implemented in an R package by the same name [64].

References

  • Amit and Geman [1997] Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Computation, 9(7):1545–1588, 1997.
  • Auddy et al. [2021] A. Auddy, N. Deb, and S. Nandy. Exact Detection Thresholds for Chatterjee’s Correlation. arXiv preprint arXiv:2104.15140, 2021.
  • Azadkia and Chatterjee [2021] M. Azadkia and S. Chatterjee. A simple measure of conditional dependence. Annals of Statistics, 49(6):3070–3102, 2021.
  • Azadkia et al. [2020] M. Azadkia, S. Chatterjee, and N. S. Matloff. FOCI: Feature Ordering by Conditional Independence, 2020. URL https://CRAN.R-project.org/package=FOCI.
  • Azadkia et al. [2021] M. Azadkia, A. Taeb, and P. Bühlmann. A fast non-parametric approach for causal structure learning in polytrees. arXiv preprint arXiv:2111.14969, 2021.
  • Battiti [1994] R. Battiti. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5(4):537–550, 1994.
  • Bergsma [2004] W. Bergsma. Testing conditional independence for continuous random variables. Report Eurandom, 2004048, 2004.
  • Bergsma and Dassios [2014] W. Bergsma and A. Dassios. A consistent test of independence based on a sign covariance related to Kendall’s tau. Bernoulli, 20(2):1006–1028, 2014.
  • Berrett et al. [2020] T. B. Berrett, Y. Wang, R. F. Barber, and R. J. Samworth. The conditional permutation test for independence while controlling for confounders. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(1):175–197, 2020.
  • Berrett et al. [2021] T. B. Berrett, I. Kontoyiannis, and R. J. Samworth. Optimal rates for independence testing via uu-statistic permutation tests. Annals of Statistics, 49(5):2457–2490, 2021.
  • Bhattacharya [2019] B. B. Bhattacharya. A general asymptotic framework for distribution-free graph-based two-sample tests. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(3):575–602, 2019.
  • Bickel [2022] P. J. Bickel. Measures of independence and functional dependence. arXiv preprint arXiv:2206.13663, 2022.
  • Blum et al. [1961] J. Blum, J. Kiefer, and M. Rosenblatt. Distribution free tests of independence based on the sample distribution function. Annals of Mathematical Statistics, 32(2):485–498, 1961.
  • Breiman [1995] L. Breiman. Better subset regression using the nonnegative garrote. Technometrics, 37(4):373–384, 1995.
  • Breiman [1996] L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.
  • Breiman [2001] L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
  • Breiman and Friedman [1985] L. Breiman and J. H. Friedman. Estimating optimal transformations for multiple regression and correlation. Journal of the American statistical Association, 80(391):580–598, 1985.
  • Breiman et al. [1984] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth Press, 1984.
  • Candès and Tao [2007] E. Candès and T. Tao. The Dantzig Selector: Statistical estimation when pp is much larger than nn. Annals of Statistics, 35(6):2313–2351, 2007.
  • Candès et al. [2018] E. Candès, Y. Fan, L. Janson, and J. Lv. Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(3):551–577, 2018.
  • Cao and Bickel [2020] S. Cao and P. J. Bickel. Correlations with tailored extremal properties. arXiv preprint arXiv:2008.10177, 2020.
  • Chatterjee [2021] S. Chatterjee. A new coefficient of correlation. Journal of the American Statistical Association, 116(536):2009–2022, 2021.
  • Chatterjee and Holmes [2020] S. Chatterjee and S. Holmes. XICOR: Association measurement through cross rank increments, 2020. URL https://CRAN.R-project.org/package=XICOR.
  • Chatterjee and Vidyasagar [2022] S. Chatterjee and M. Vidyasagar. Estimating large causal polytree skeletons from small samples. arXiv preprint arXiv:2209.07028, 2022.
  • Chen [2020] L.-P. Chen. A note of feature screening via rank-based coefficient of correlation. arXiv preprint arXiv:2008.04456, 2020.
  • Chen and Donoho [1994] S. Chen and D. Donoho. Basis pursuit. In Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers, volume 1, pages 41–44. IEEE, 1994.
  • Chernozhukov et al. [2017] V. Chernozhukov, A. Galichon, M. Hallin, and M. Henry. Monge–Kantorovich depth, quantiles, ranks and signs. Annals of Statistics, 45(1):223–256, 2017.
  • Cochran [1954] W. G. Cochran. Some methods for strengthening the common χ2\chi^{2} tests. Biometrics, 10(4):417–451, 1954.
  • Csörgő [1985] S. Csörgő. Testing for independence by the empirical characteristic function. Journal of Multivariate Analysis, 16(3):290–299, 1985.
  • Deb and Sen [2021] N. Deb and B. Sen. Multivariate rank-based distribution-free nonparametric testing using measure transportation. Journal of the American Statistical Association, pages 1–16, 2021.
  • Deb et al. [2020] N. Deb, P. Ghosal, and B. Sen. Measuring association on topological spaces using kernels and geometric graphs. arXiv preprint arXiv:2010.01768, 2020.
  • Dette et al. [2013] H. Dette, K. F. Siburg, and P. A. Stoimenov. A copula-based non-parametric measure of regression dependence. Scandinavian Journal of Statistics, 40(1):21–41, 2013.
  • Doran et al. [2014] G. Doran, K. Muandet, K. Zhang, and B. Schölkopf. A permutation-based kernel conditional independence test. In Uncertainty in Artificial Intelligence, pages 132–141. AUAI, 2014.
  • Drton et al. [2020] M. Drton, F. Han, and H. Shi. High-dimensional consistent independence testing with maxima of rank correlations. Annals of Statistics, 48(6):3206–3227, 2020.
  • Efron et al. [2004] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of statistics, 32(2):407–499, 2004.
  • Fan and Li [2001] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001.
  • Fan et al. [2020] J. Fan, Y. Feng, and L. Xia. A projection-based conditional dependence measure with applications to high-dimensional undirected graphical models. Journal of Econometrics, 218(1):119–139, 2020.
  • Figalli [2018] A. Figalli. On the continuity of center-outward distribution and quantile functions. Nonlinear Analysis, 177:413–421, 2018.
  • Freund and Schapire [1996] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference, pages 148–156, 1996.
  • Friedman [1991] J. H. Friedman. Multivariate adaptive regression splines. Annals of Statistics, 19(1):1–67, 1991.
  • Friedman and Rafsky [1983] J. H. Friedman and L. C. Rafsky. Graph-theoretic measures of multivariate association and prediction. Annals of Statistics, pages 377–391, 1983.
  • Fukumizu et al. [2007] K. Fukumizu, A. Gretton, X. Sun, and B. Schölkopf. Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.
  • Gamboa et al. [2018] F. Gamboa, T. Klein, and A. Lagnoux. Sensitivity analysis based on Cramér–von Mises distance. SIAM/ASA Journal on Uncertainty Quantification, 6(2):522–548, 2018.
  • Gamboa et al. [2022] F. Gamboa, P. Gremaud, T. Klein, and A. Lagnoux. Global sensitivity analysis: A novel generation of mighty estimators based on rank statistics. Bernoulli, 28(4):2345–2374, 2022.
  • Gebelein [1941] H. Gebelein. Das statistische problem der korrelation als variations-und eigenwertproblem und sein zusammenhang mit der ausgleichsrechnung. Zeitschrift für Angewandte Mathematik und Mechanik, 21(6):364–379, 1941.
  • George and McCulloch [1993] E. I. George and R. E. McCulloch. Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88(423):881–889, 1993.
  • Ghosal and Sen [2022] P. Ghosal and B. Sen. Multivariate ranks and quantiles using optimal transport: Consistency, rates and nonparametric testing. Annals of Statistics, 50(2):1012–1037, 2022.
  • Gretton et al. [2005a] A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf. Measuring statistical dependence with Hilbert–Schmidt norms. In Proceedings of the 16th International Conference on Algorithmic Learning Theory, pages 63–77. Springer, Berlin, 2005a.
  • Gretton et al. [2005b] A. Gretton, A. Smola, O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B. Schölkopf, and N. Logothetis. Kernel constrained covariance for dependence measurement. In International Workshop on Artificial Intelligence and Statistics, pages 112–119. PMLR, 2005b.
  • Gretton et al. [2007] A. Gretton, K. Fukumizu, C. Teo, L. Song, B. Schölkopf, and A. Smola. A kernel statistical test of independence. In Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.
  • Hallin et al. [2021] M. Hallin, E. Del Barrio, J. Cuesta-Albertos, and C. Matrán. Distribution and quantile functions, ranks and signs in dimension dd: A measure transportation approach. Annals of Statistics, 49(2):1139–1165, 2021.
  • Han [2021] F. Han. On extensions of rank correlation coefficients to multivariate spaces. Bernoulli News, 28(2):7–11, 2021.
  • Han and Huang [2022] F. Han and Z. Huang. Azadkia–Chatterjee’s correlation coefficient adapts to manifold data. arXiv preprint arXiv:2209.11156, 2022.
  • Han et al. [2017] F. Han, S. Chen, and H. Liu. Distribution-free tests of independence in high dimensions. Biometrika, 104(4):813–828, 2017.
  • Hastie et al. [2009] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, Berlin, 2nd edition, 2009.
  • Heller and Heller [2016] R. Heller and Y. Heller. Multivariate tests of association based on univariate tests. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  • Heller et al. [2012] R. Heller, M. Gorfine, and Y. Heller. A class of multivariate distribution-free tests of independence based on graphs. Journal of Statistical Planning and Inference, 142(12):3097–3106, 2012.
  • Heller et al. [2013] R. Heller, Y. Heller, and M. Gorfine. A consistent multivariate test of association based on ranks of distances. Biometrika, 100(2):503–510, 2013.
  • Hirschfeld [1935] H. O. Hirschfeld. A connection between correlation and contingency. Mathematical Proceedings of the Cambridge Philosophical Society, 31(4):520–524, 1935.
  • Ho [1998] T. K. Ho. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832–844, 1998.
  • Hoeffding [1948] W. Hoeffding. A non-parametric test of independence. Annals of Mathematical Statististics, 19(4):546–557, 1948.
  • Huang [2010] T.-M. Huang. Testing conditional independence using maximal nonlinear conditional correlation. Annals of Statistics, 38(4):2047–2091, 2010.
  • Huang et al. [2020] Z. Huang, N. Deb, and B. Sen. Kernel partial correlation coefficient — a measure of conditional dependence. arXiv preprint arXiv:2012.14804, 2020.
  • Huang et al. [2022] Z. Huang, N. Deb, and B. Sen. KPC: Kernel Partial Correlation Coefficient, 2022. URL https://cran.r-project.org/web/packages/KPC.
  • Joe [1989] H. Joe. Relative entropy measures of multivariate dependence. Journal of the American Statistical Association, 84(405):157–164, 1989.
  • Josse and Holmes [2016] J. Josse and S. Holmes. Measuring multivariate association and beyond. Statistics Surveys, 10:132, 2016.
  • Ke and Yin [2019] C. Ke and X. Yin. Expected conditional characteristic function-based measures for testing independence. Journal of the American Statistical Association, 115(530):985–996, 2019.
  • Kim et al. [2020] I. Kim, S. Balakrishnan, and L. Wasserman. Robust multivariate nonparametric tests via projection averaging. Annals of Statistics, 48(6):3417–3441, 2020.
  • Kim et al. [2021] I. Kim, M. Neykov, S. Balakrishnan, and L. Wasserman. Local permutation tests for conditional independence. arXiv preprint arXiv:2112.11666, 2021.
  • Kong et al. [2019] E. Kong, Y. Xia, and W. Zhong. Composite coefficient of determination and its application in ultrahigh dimensional variable screening. Journal of the American Statistical Association, 114(528):1740–1751, 2019.
  • Kraskov et al. [2004] A. Kraskov, H. Stögbauer, and P. Grassberger. Estimating mutual information. Physical Review E, 69(6):066138, 2004.
  • Lin and Han [2022+] Z. Lin and F. Han. On boosting the power of Chatterjee’s rank correlation. Biometrika, 2022+. Forthcoming.
  • Lin and Han [2022] Z. Lin and F. Han. Limit theorems of Chatterjee’s rank correlation. arXiv preprint arXiv:2204.08031, 2022.
  • Linfoot [1957] E. H. Linfoot. An informational measure of correlation. Information and Control, 1(1):85–89, 1957.
  • Linton and Gozalo [1997] O. Linton and P. Gozalo. Conditional independence restrictions: testing and estimation. Cowles Foundation Discussion Paper, 1140, 1997.
  • Lopez-Paz et al. [2013] D. Lopez-Paz, P. Hennig, and B. Schölkopf. The randomized dependence coefficient. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.
  • Lyons [2013] R. Lyons. Distance covariance in metric spaces. Annals of Probability, 41(5):3284–3305, 2013.
  • Manole et al. [2021] T. Manole, S. Balakrishnan, J. Niles-Weed, and L. Wasserman. Plugin estimation of smooth optimal transport maps. arXiv preprint arXiv:2107.12364, 2021.
  • Mantel and Haenszel [1959] N. Mantel and W. Haenszel. Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22(4):719–748, 1959.
  • McCann [1995] R. J. McCann. Existence and uniqueness of monotone measure-preserving maps. Duke Mathematical Journal, 80(2):309–323, 1995.
  • Miller [2002] A. Miller. Subset Selection in Regression. Chapman and Hall, 2002.
  • Mordant and Segers [2022] G. Mordant and J. Segers. Measuring dependence between random vectors via optimal transport. Journal of Multivariate Analysis, 189:104912, 2022.
  • Nandy et al. [2016] P. Nandy, L. Weihs, and M. Drton. Large-sample theory for the Bergsma–Dassios sign covariance. Electronic Journal of Statistics, 10(2):2287–2311, 2016.
  • Neykov et al. [2021] M. Neykov, S. Balakrishnan, and L. Wasserman. Minimax optimal conditional independence testing. Annals of Statistics, 49(4):2151–2177, 2021.
  • Patra et al. [2016] R. K. Patra, B. Sen, and G. J. Székely. On a nonparametric notion of residual and its applications. Statistics & Probability Letters, 109:208–213, 2016.
  • Pfister et al. [2018] N. Pfister, P. Bühlmann, B. Schölkopf, and J. Peters. Kernel-based tests for joint independence. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):5–31, 2018.
  • Póczos and Schneider [2012] B. Póczos and J. Schneider. Nonparametric estimation of conditional information and divergences. In Artificial Intelligence and Statistics, pages 914–923. PMLR, 2012.
  • Puri et al. [1970] M. Puri, P. Sen, and D. Gokhale. On a class of rank order tests for independence in multivariate distributions. Sankhyā, Series A, 32(3):271–298, 1970.
  • Puri and Sen [1971] M. L. Puri and P. K. Sen. Nonparametric methods in multivariate analysis. Wiley, New York, 1971.
  • Rao and Srivastava [1994] B. Rao and S. Srivastava. An elementary proof of the Borel isomorphism theorem. Real Analysis Exchange, 20(1):347–349, 1994.
  • Ravikumar et al. [2009] P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman. Sparse additive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(5):1009–1030, 2009.
  • Rényi [1959] A. Rényi. On measures of dependence. Acta Mathematica Hungarica, 10(3-4):441–451, 1959.
  • Reshef et al. [2011] D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, and P. C. Sabeti. Detecting novel associations in large data sets. Science, 334(6062):1518–1524, 2011.
  • Romano [1988] J. P. Romano. A bootstrap revival of some nonparametric distance tests. Journal of the American Statistical Association, 83(403):698–708, 1988.
  • Rosenblatt [1975] M. Rosenblatt. A quadratic measure of deviation of two-dimensional density estimates and a test of independence. Annals of Statistics, pages 1–14, 1975.
  • Runge [2018] J. Runge. Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information. In International Conference on Artificial Intelligence and Statistics, pages 938–947. PMLR, 2018.
  • Sadeghi [2022] B. Sadeghi. Chatterjee Correlation Coefficient: a robust alternative for classic correlation methods in geochemical studies-(including “TripleCpy” Python package). Ore Geology Reviews, page 104954, 2022.
  • Schweizer and Wolff [1981] B. Schweizer and E. F. Wolff. On nonparametric measures of dependence for random variables. Annals of Statistics, 9(4):879–885, 1981.
  • Sen and Sen [2014] A. Sen and B. Sen. Testing independence and goodness-of-fit in linear models. Biometrika, 101(4):927–942, 2014.
  • Sen et al. [2017] R. Sen, A. T. Suresh, K. Shanmugam, A. G. Dimakis, and S. Shakkottai. Model-powered conditional independence test. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • Seth and Príncipe [2012] S. Seth and J. C. Príncipe. Conditional association. Neural Computation, 24(7):1882–1905, 2012.
  • Shah and Peters [2020] R. D. Shah and J. Peters. The hardness of conditional independence testing and the generalised covariance measure. Annals of Statistics, 48(3):1514–1538, 2020.
  • Shi et al. [2021a] H. Shi, M. Drton, M. Hallin, and F. Han. Center-outward sign-and rank-based quadrant, Spearman, and Kendall tests for multivariate independence. arXiv preprint arXiv:2111.15567, 2021a.
  • Shi et al. [2021b] H. Shi, M. Drton, and F. Han. On Azadkia–Chatterjee’s conditional dependence coefficient. arXiv preprint arXiv:2108.06827, 2021b.
  • Shi et al. [2022a] H. Shi, M. Drton, and F. Han. On the power of Chatterjee’s rank correlation. Biometrika, 109(2):317–333, 2022a.
  • Shi et al. [2022b] H. Shi, M. Drton, and F. Han. Distribution-free consistent independence tests via center-outward ranks and signs. Journal of the American Statistical Association, 117(537):395–410, 2022b.
  • Shi et al. [2022c] H. Shi, M. Hallin, M. Drton, and F. Han. On universally consistent and fully distribution-free rank tests of vector independence. Annals of Statistics, 50(4):1933–1959, 2022c.
  • Sklar [1959] M. Sklar. Fonctions de répartition à nn dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris, 8:229–231, 1959.
  • Song [2009] K. Song. Testing conditional independence via Rosenblatt transforms. Annals of Statistics, 37(6B):4011–4045, 2009.
  • Srivastava [1998] S. M. Srivastava. A Course on Borel Sets. Springer-Verlag, New York, 1998.
  • Strobl et al. [2019] E. V. Strobl, K. Zhang, and S. Visweswaran. Approximate kernel-based conditional independence tests for fast non-parametric causal discovery. Journal of Causal Inference, 7(1), 2019.
  • Su and White [2007] L. Su and H. White. A consistent characteristic function-based test for conditional independence. Journal of Econometrics, 141(2):807–834, 2007.
  • Su and White [2008] L. Su and H. White. A nonparametric Hellinger metric test for conditional independence. Econometric Theory, 24(4):829–864, 2008.
  • Su and White [2014] L. Su and H. White. Testing conditional independence via empirical likelihood. Journal of Econometrics, 182(1):27–44, 2014.
  • Székely and Rizzo [2009] G. J. Székely and M. L. Rizzo. Brownian distance covariance. Annals of Applied Statistics, 3(4):1236–1265, 2009.
  • Székely and Rizzo [2014] G. J. Székely and M. L. Rizzo. Partial distance correlation with methods for dissimilarities. Annals of Statistics, 42(6):2382–2412, 2014.
  • Székely et al. [2007] G. J. Székely, M. L. Rizzo, and N. K. Bakirov. Measuring and testing dependence by correlation of distances. Annals of Statistics, 35(6):2769–2794, 2007.
  • Tibshirani [1996] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  • Veraverbeke et al. [2011] N. Veraverbeke, M. Omelka, and I. Gijbels. Estimation of a conditional copula and association measures. Scandinavian Journal of Statistics, 38(4):766–780, 2011.
  • Vergara and Estévez [2014] J. R. Vergara and P. A. Estévez. A review of feature selection methods based on mutual information. Neural Computing and Applications, 24(1):175–186, 2014.
  • Wang et al. [2015] X. Wang, W. Pan, W. Hu, Y. Tian, and H. Zhang. Conditional distance correlation. Journal of the American Statistical Association, 110(512):1726–1734, 2015.
  • Wang et al. [2017] X. Wang, B. Jiang, and J. S. Liu. Generalized R-squared for detecting dependence. Biometrika, 104(1):129–139, 2017.
  • Weihs et al. [2016] L. Weihs, M. Drton, and D. Leung. Efficient computation of the Bergsma–Dassios sign covariance. Computational Statistics, 31(1):315–328, 2016.
  • Weihs et al. [2018] L. Weihs, M. Drton, and N. Meinshausen. Symmetric rank covariances: a generalized framework for nonparametric measures of dependence. Biometrika, 105(3):547–562, 2018.
  • Yanagimoto [1970] T. Yanagimoto. On measures of association and a related problem. Annals of the Institute of Statistical Mathematics, 22(1):57–63, 1970.
  • Yuan and Lin [2006] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
  • Zhang [2019] K. Zhang. BET on independence. Journal of the American Statistical Association, 114(528):1620–1637, 2019.
  • Zhang et al. [2012] K. Zhang, J. Peters, D. Janzing, and B. Schölkopf. Kernel-based conditional independence test and application in causal discovery. arXiv preprint arXiv:1202.3775, 2012.
  • Zhang [2022] Q. Zhang. On the asymptotic distribution of the symmetrized Chatterjee’s correlation coefficient. arXiv preprint arXiv:2205.01769, 2022.
  • Zhang et al. [2018] Q. Zhang, S. Filippi, A. Gretton, and D. Sejdinovic. Large-scale kernel methods for independence testing. Statistics and Computing, 28(1):113–130, 2018.
  • Zhu et al. [2017] L. Zhu, K. Xu, R. Li, and W. Zhong. Projection correlation between two random vectors. Biometrika, 104(4):829–843, 2017.
  • Zou [2006] H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006.
  • Zou and Hastie [2005] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.