The Impossibility of Testing for Dependence Using Kendall’s Under Missing Data of Unknown Form
Abstract
This paper discusses the statistical inference problem associated with testing for dependence between two continuous random variables using Kendall’s in the context of the missing data problem. We prove the worst-case identified set for this measure of association always includes zero. The consequence of this result is that robust inference for dependence using Kendall’s , where robustness is with respect to the form of the missingness-generating process, is impossible.
AMS 2020 subject classifications: 62H15; 62D10; 62G10
Keywords: Impossible Inference; Statistical Dependence; Kendall’s ; Partial Identification; Missing Data.
1 Introduction
Testing for statistical dependence between two random variables is an important facet of theoretical and empirical statistical research, and arises as a problem of interest in various areas of the natural and social sciences. Applications in social science include the study of the relationship between health outcomes and insurance levels (e.g., Cameron and Trivedi, 1993), survey analysis (e.g., Yu et al., 2016), stress-testing risk-management models (e.g., Asimit et al., 2016), and stock market co-movements (e.g., Horváth and Rice, 2015; Cameron and Trivedi, 1993). In the natural sciences, applications arise in contexts as diverse as cancerous somatic alteration co-occurrences (e.g., Canisius et al., 2016) and the movement of animals across time (e.g., Swihart and Slade, 1985).
Tests for dependence based on Kendall’s (Kendall, 1938) constitute a relevant tool in empirical practice to detect monotonic dependence between two random variables. The interested reader may refer, for instance, to the monographs Nelsen (2006), Bagdonavicius et al. (2011), and the references therein. The strength of such testing procedures is that is a distribution-free measure of association between paired continuous random variables. In particular, let be a pair of continuous random variables having joint distribution and marginal distributions and , respectively. In moment form, this measure of association for the random vector is defined as
(1.1) |
where is the copula of , and denotes the expectation operator with respect to the distribution . The hypothesis testing problem for detecting monotonic dependence using in (1.1) has the form
(1.2) |
The null hypothesis in (1.2) posits no monotonic dependence between the two random variables, and the alternative hypothesis is the negation of the null.
Statistical procedures for the hypothesis test problem (1.2) are predicated on the assumption that the random vector is observable. However, this assumption is violated in empirical practice because datasets can have missing values. For example, missing data can arise in the form of nonresponse, as in self-reported cross-sectional and longitudinal surveys, which is inevitable, or at follow-up in clinical studies. See, for example, Dutz et al. (2021) for a discussion on the prevalence of nonresponse in economics research. Missing data are also universal in ecological and evolutionary data, as in other branches of science; see, for example, the monograph Fox et al. (2015) and the references therein. Imputation methods are commonly used to address the missing data problem and enable testing with a complete dataset. However, the validity of such tests hinges on the correct specification of the imputation procedure, which can lead to biased inferences if misspecified. Another approach in the literature that addresses the missing data problem imposes assumptions on the missingness-generating process (MGP) that point-identify in (1.1). In the context of Kendall’s rank-correlation test, see, for example, Alvo and Cabilio (1995) who assume the MGP is either missing completely at random or weakly exogenous, and Ma (2012) who assumes that it is either missing at random or missing completely at random. While practical, such tests also ignore misspecification of the MGP, which weakens the credibility of any derived inferences (Manski, 2003).
Consequently, we ask if it is possible to conduct non-parametric inference for dependence using under missing data of unknown form. The results of this paper imply that such robust inference is impossible. Reasoning from first principles, any sensible testing procedure of this sort must be based on ’s identified set because it characterizes the information about this parameter contained in the observables. The identified set for this parameter is an interval whose bounds depend on observables. Therefore, the testing problem for inferring statistical dependence using this information must have the form
(1.3) |
Under in (1.3), the identified set is a subset of either or . Since holds by definition, this hypothesis implies that either or holds, so that and are statistically dependent. We show the bounds satisfy the inequalities , for all joint distributions of and MGPs. These inequalities show the null hypothesis in (1.3) always holds, implying that one cannot partition the underlying probability model into two submodels that are compatible with the assertions of the null and alternative hypotheses. Therefore, the worst-case bounds are useless in detecting dependence between and through the testing problem (1.3). We prove that this property of the bounds holds in the setup where the marginal distributions are known to the practitioner, which implies that it holds in the setup where those distributions are unknown. The reason is that the bounds in the case where the marginal distributions are unknown must be less informative, and hence, weakly wider than their counterparts under known marginal distributions, implying that they must also satisfy this property. A critical step in our theoretical derivations is an innovative use of results on extremal dependence described in Puccetti and Wang (2015).
This paper contributes to the literature on impossible inference, which has a rich history that started with the classic paper of Bahadur and Savage (1956). The recent paper by Bertanha and Moreira (2020) connects this literature and presents a taxonomy of the types of impossible inferences. Our result falls under Type A in their taxonomy, as the alternative is indistinguishable from the null. However, it should be noted that our result is not a consequence of the model of the null hypotheses being dense in the set of all likely models with respect to the total-variation distance, which is the essential characteristic of Type A impossible inferences. Rather, it flows from the fact that the bounds and are uninformative because they do not define a partition of the underlying probability model into two submodels that are compatible with the assertions of and in (1.3).
The idea of using bounds to account for missing data problems started with the seminal paper of Manski (1989) and gained popularity with the important paper of Horowitz and Manski (1995). Since then, there has been a growing influential literature on partial identification that has shaped empirical practice; see, for example, Canay and Shaikh (2016) for a recent survey of this literature and the references therein. Inference on bounds to account for missing data in moment inequality models have been considered in a variety of settings, such as distributional analyses (e.g., Blundell et al., 2007), treatment effect (e.g., Lee, 2009), and stochastic dominance testing (e.g., Fakih et al., 2021). In contrast to those works, this paper shows that such an approach is futile in testing for dependence under missing data of unknown form using Kendall’s and its worst-case bounds. We also discuss how to obtain informative partitions of the underlying probability model through restrictions on the dependence between and and/or the MGPs.
There is also a strand in the partial identification literature focusing on parameters that depend on the joint distribution of two random variables with point-identified marginal distributions; see, for example, Fan and Patton (2014) for a survey of this strand and the references therein. However, to the best of our knowledge, this literature strand has not considered partial identification of those parameters arising from the missing data problem. While convenient, the point-identification of the margins can be untenable in applications with missing data and can create challenges in the inference for such parameters — the results of our paper exemplify this point.
The rest of this paper is organized as follows. Section 2 introduces the statistical setup of this paper and preliminary results on extremal dependence that we utilize in the proof of our results. Section 3 presents our results, Section 4 discusses the scope of the results and implications for empirical practice. Section 5 concludes. All proofs are relegated to the Appendix.
2 Setup and Preliminaries
Consider the random vector having joint distribution , where and are the random variables of interest, which are continuous, and is a categorical variable supported on indicating missingness on and . In this setup,
(2.1) |
where denotes the missing variable. For simplicity, we assume that the marginal cumulative distribution functions (CDFs) of and , denoted by and , respectively, are known by the practitioner. We derive the worst-case bounds on under the following probability model.
Definition 1.
Let be the set of distributions of the random vector supported on , with generic element , such that
-
(i)
has a density, .
-
(ii)
is a continuous random vector having strictly positive density.
-
(iii)
has CDF and has CDF .
The worst-case bounds on without the practitioner’s knowledge of the marginal CDFs can be computed by extremizing their counterparts with known and over feasible candidate values of these CDFs. We elaborate on this point in Section 3, and show impossible inference for dependence under implies that it holds in the more general scenario where and are unknown to the practitioner.
In this setup, a MGP is specified through restrictions on the joint distribution of . The model does not place any restrictions on the dependence between and beyond the existence of a density. To account for the missing data problem, we exhibit as a functional of . For each , an application of the Law of Total probability shows the corresponding value of has the following representation:
(2.2) |
where is the copula of the joint CDF .111The existence and uniqueness of in our setup is a result due to Sklar (1959). This representation of is useful since it clarifies the situation faced by the practitioner in our setup. In particular, it shows that can be calculated for each using the following parameters: the copula ; the conditional CDFs, for ; and the marginal probabilities of , . From sampling, asymptotically the practitioner can recover and but not and , as the data alone do not contain any information on the latter. Consequently, a MGP can be characterized in terms of a specification of the conditional CDFs .
The above analysis shows is partially identified in the missing data setting when were are agnostic about the MGP. The identified set of in this case is a closed interval subset of whose boundary corresponds to the worst-case bounds on . These bounds permit the entire spectrum of MGPs, which is especially useful when the data have a large number of missing values, as there can be a diversity of explanations for it.
The next section describes the worst-case bounds on in (2.2) and raises the statistical issues concerning testing for dependence between and in a manner that is robust to the MGP. In developing our results we make use of the Fréchet-Hoeffding copula bounds and two results on extreme values of means of supermodular functions from Puccetti and Wang (2015). To describe these results, denote by the set of all bivariate copulas on the unit square . The Fréchet-Hoeffding bounds are for all , which hold for all .
A function is called supermodular if for all and ,
important examples of which are copulas. This point is important as the bounds on are characterized in terms of copulas of ’s joint distribution. The results of Puccetti and Wang (2015) that we utilize are Theorems 2.1 and 3.1 in their paper and we restate them in the following lemma, but in a form that is more suitable for the derivation of our results.
Lemma 1 (Puccetti and Wang (2015)).
Let be a supermodular function, and and be random variables with marginal CDFs and respectively. Furthermore, let be as described above.
-
1.
The moment , when viewed as a functional of the copula through the representation , is maximized when and are co-monotonic. That is,
(2.3) where .
-
2.
The functional , when viewed as a functional of the copula through the representation , is minimized when and are counter-monotonic. That is,
(2.4) where .
3 Results
The first result characterizes the worst-case bounds on Kendall’s in the case where the marginal CDFs of are known.
Theorem 1.
Let be given as in Definition 1, and suppose that . The worst-case bounds of under the distribution that satisfy , are given by:
Proof.
See Appendix A.1. ∎
The result of this theorem is that for each we can determine bounds on that permit the entire spectrum of MGPs. For each , these bounds are sharp; that is, any value in the interval , including the endpoints, cannot be rejected as the true value of . This property of the bounds follows from the sharpness of the Fréchet-Hoeffding bounds on a bivariate copula, which we use in the derivation of and .
To test for statistical dependence using in a manner that is robust to the form of the MGP, one can only consider tests that depend on observables through ’s identified set. This means positing the hypothesis testing problem
(3.1) |
where is the true distribution of , and . Notice that is the relative complement of in ; that is, , which implies that either or holds. The next result implies that , meaning the null hypothesis in (3.1) is always true.
Theorem 2.
Proof.
See Appendix A.2. ∎
The result of Theorem 2 shows the worst-case bounds of are not informative in the sense that they do not simultaneously take negative or positive values when the joint distribution of exhibits negative or positive dependence, respectively. This property creates an impossibility in testing for dependence on the basis of that is robust to missingness of any form, as in (3.1), since it implies that .
The results of Theorem 1 and 2 have assumed that the marginal CDFs of and are known to the practitioner. In the scenario where this is not the case, neither of those distributions would be point-identified when we are agnostic about the nature of the MGP. By an application of the Law of Total Probability to and , we can obtain pointwise bounds on these marginal CDFs as and for all , with the boundaries themselves CDFs, given by
and
Denoting by and the sets of all CDFs of and that satisfy the respective bounds described above, the probability model is . Thus, one has bounds on that depend on hypothetical values of the margins and , and extremizing these bounds with respect to the margins over and yields the worst-case upper and lower bounds, and , respectively. Therefore, the conclusion of Theorem 2 also holds for these worst-case bounds since they are wider than their counterparts in the scenario where the marginal CDFs of and are known to the practitioner.222Scrutinizing the expressions of and , observe that the worst-case bounds in this larger model can be obtained in closed-form. For the upper bound, replace and in with and , respectively; and for the lower bound, replace and in with and , respectively.
4 Discussion
This section discusses the implications of our results. The model is large, which is the reason why the bounds and do not yield a partition of that is compatible with the hypotheses in (3.1). Note that the model is non-parametric and permits (i) the entire spectrum of MGPs, and (ii) all bivariate absolutely continuous copulas in modelling the statistical dependence between and . This point raises the following question: does restricting give rise to an identified set for whose bounds are informative in detecting dependence between and ? The answer is in the affirmative. Restrictions on can be motivated by many considerations based on the application at hand. For example, they can be arise from the possession of side-information or by restrictions implied by economic theory as in the partial identification approach in econometrics (e.g., Tamer, 2010). We elaborate on this point with an example of the former utilizing results in Nelsen et al. (2001).
Let denote the true distribution of . Suppose we possess side-information that where and are the medians of and , and . Accordingly, we must have , which represents the side-information in terms of the copula. Theorem 1 of Nelsen et al. (2001) provides the bounds on the copula under this restriction, which are given by
for all , where . Thus, the bounds on the joint distribution of are
(4.1) |
which hold for all where the probability model accounts for the side-information. We can apply identical steps as in the proof of Theorem 1 to obtain the corresponding bounds on under the model , but replacing the Fréchet-Hoeffding lower and upper bounds with and , respectively. For brevity, we omit these details. The bounds on are given by
and satisfy , for .
In contrast to the worst-case bounds on , the bounds and are informative, in the sense that such that
We demonstrate this point using a numerical example in which we specify and being uniformly distributed on for simplicity. Furthermore, we set and derive the MGP from a multinomial logit specification for the propensity probabilities; i.e.,
(4.2) |
Finally, to complete the specification of we must designiate the copula of , , as the marginal probabilities of can be obtained by integrating the propensity scores with respect to this copula. Then, by Bayes’ Theorem, the MGP is given by the conditional probability density functions
We set as the bivariate Gaussian copula, with standard normal margins, and construct , , and through setting the correlation coefficient . As , and are linear combinations of moments, we calculated them using Monte Carlo simulations with random draws from the corresponding bivariate Gaussian copula.
The parameter specification for are as follows: ; ;;; and . This yields and . The parameter specification for are as follows: ; ;;; and . This specification has and gives rise to . Finally, the parameter specification for are identical to that under except that now . This specification has and gives rise to and .
The numerical results demonstrate the refined bounds can be informative in the detection of dependence. In such a situation, the practitioner can consider the following testing problem
(4.3) |
where and form a partition of . The bounds are linear combinations of moments but with unknown coefficients being the marginal probabilities . As these marginal probabilities are typically estimable at the -rate, one can adapt moment inequality testing procedures for this situation, which are abundant and well-established (e.g., Andrews and Soares, 2010; Canay, 2010; Romano et al., 2014). Developing the details of such a testing procedure goes beyond the intended scope of the paper, and is left for future research.
As this side-information only restricts the dependence between and , any valid testing procedure that rejects in favour of in (4.3) would be robust to the nature of the MGP. This robustness, however, comes at the expense of an ambiguity under the null. Specifically, such that and . The uninformative nature of in (4.3) is a consequence of circumventing assumptions on the MGP, which are unverifiable in practice. If one fails to reject the null, then, unfortunately, one cannot conclude anything informative about the dependence between and . In such a situation, we recommend empirical researchers perform a sensitivity analysis of this empirical conclusion (i.e., non-rejection of ) with respect to plausible assumptions on the MGP. The virtue of this type of analysis is that it establishes, in a transparent way, clear links between empirical outcomes and different assumptions made on the MGP. Such an analysis would reveal non-trivial links between assumptions on the MGP and inferences made. See, for example, Blundell et al. (2007) and Lee (2009) who refine worst-case distributional bounds using economic theory and develop testable implications based on them in the contexts of distributional analyses and treatment effect, respectively. See also Fakih et al. (2021) who discuss the refinement of worst-case distributional bounds of ordinal variables in the context of stochastic dominance testing by positing assumptions on the form of nonresponse in self-reported surveys.
5 Conclusion
This paper establishes the impossibility of performing inference for dependence between two continuous random variables using Kendall’s under missing data of unknown form. The crux of the issue is that its identified set always includes zero, implying that the sign of is not identified. We show how refining this identified set using additional information can address this problem, creating a pathway for robust inference based on statistical procedures from the moment inequality testing literature.
6 Acknowledgement
We thank Brendan K. Beare and Christopher D. Walker for helpful feedback and comments.
References
- Alvo and Cabilio (1995) Alvo, M. and P. Cabilio (1995). Rank correlation methods for missing data. Canadian Journal of Statistics 23(4), 345–358.
- Andrews and Soares (2010) Andrews, D. W. K. and G. Soares (2010). Inference for Parameters Defined by Moment Inequalities using Generalized Moment Selection. Econometrica 78(1), 119–157.
- Asimit et al. (2016) Asimit, A. V., R. Gerrard, Y. Hou, and L. Peng (2016). Tail dependence measure for examining financial extreme co-movements. Journal of econometrics 194(2), 330–348.
- Bagdonavicius et al. (2011) Bagdonavicius, V., J. Kruopis, and M. Nikulin (2011). Nonparametric Tests for Complete Data (First ed.). Wiley.
- Bahadur and Savage (1956) Bahadur, R. R. and L. J. Savage (1956). The Nonexistence of Certain Statistical Procedures in Nonparametric Problems. The Annals of Mathematical Statistics 27(4), 1115 – 1122.
- Bertanha and Moreira (2020) Bertanha, M. and M. J. Moreira (2020). Impossible inference in econometrics: Theory and applications. Journal of Econometrics 218(2), 247–270.
- Blundell et al. (2007) Blundell, R., A. Gosling, H. Ichimura, and C. Meghir (2007). Changes in the distribution of male and female wages accounting for employment composition using bounds. Econometrica 75(2), 323–363.
- Cameron and Trivedi (1993) Cameron, A. C. and P. K. Trivedi (1993). Tests of independence in parametric models with applications and illustrations. Journal of Business & Economic Statistics 11(1), 29–43.
- Canay (2010) Canay, I. A. (2010). EL Inference for Partially Identified Models:Large Deviations Optimality and Bootstrap Validity. Journal of Econometrics 156(2), 408–425.
- Canay and Shaikh (2016) Canay, I. A. and A. M. Shaikh (2016, January). Practical and theoretical advances in inference for partially identified models. CeMMAP working papers CWP05/16, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.
- Canisius et al. (2016) Canisius, S., L. Wessels, and J. W. M. Martens (2016). A novel independence test for somatic alterations in cancer shows that biology drives mutual exclusivity but chance explains most co-occurrence. Genome Biology 17(261).
- Dutz et al. (2021) Dutz, D., I. Huitfeldt, S. Lacouture, M. Mogstad, A. Torgovitsky, and W. van Dijk (2021, December). Selection in surveys. National Bureau of Economic Research. Working Paper 29549.
- Fakih et al. (2021) Fakih, A., P. Makdissi, W. Marrouch, R. V. Tabri, and M. Yazbeck (2021). A stochastic dominance test under survey nonresponse with an application to comparing trust levels in Lebanese public institutions. Journal of Econometrics.
- Fan and Patton (2014) Fan, Y. and A. J. Patton (2014). Copulas in econometrics. Annual Review of Economics 6(1), 179–200.
- Fox et al. (2015) Fox, G. A., S. Negrete-Yankelevich, and V. J. Sosa (2015). Ecological Statistics: Contemporary theory and application. Oxford University Press.
- Horowitz and Manski (1995) Horowitz, J. L. and C. F. Manski (1995). Identification and robustness with contaminated and corrupted data. Econometrica 63(2), 281–302.
- Horváth and Rice (2015) Horváth, L. and G. Rice (2015). Testing for independence between functional time series. Journal of econometrics 189(2), 371–382.
- Kendall (1938) Kendall, M. G. (1938). A new measure of rank correlation. Biometrika 30(1), 81–93.
- Lee (2009) Lee, D. S. (2009, 07). Training, wages, and sample selection: Estimating sharp bounds on treatment effects. The Review of Economic Studies 76(3), 1071–1102.
- Ma (2012) Ma, Y. (2012). On inference for kendall’s within a longitudinal data setting. Journal of Applied Statistics 39(11), 2441–2452.
- Manski (1989) Manski, C. F. (1989). Anatomy of the selection problem. The Journal of Human Resources 24(3), 343–360.
- Manski (2003) Manski, C. F. (2003). Partial Identification of Probability Distributions. Springer.
- Nelsen (2006) Nelsen, R. B. (2006). An Introduction to Copulas (Second ed.). Springer.
- Nelsen et al. (2001) Nelsen, R. B., J. J. Quesada-Molina, J. A. Rodriíguez-Lallena, and M. Úbeda Flores (2001). Bounds on bivariate distribution functions with given margins and measures of association. Communications in Statistics - Theory and Methods 30(6), 1055–1062.
- Puccetti and Wang (2015) Puccetti, G. and R. Wang (2015). Extremal dependence concepts. Statistical Science 30(4), 485–517.
- Romano et al. (2014) Romano, J. P., A. M. Shaikh, and M. Wolf (2014). A practical two-step method for testing moment inequalities. Econometrica 82(5), 1979–2002.
- Sklar (1959) Sklar, A. (1959). Fonctions de répartition à n dimensions et leurs marges. Publications de l’Institut Statistique de l’Université de Paris 8(2), 229–231.
- Swihart and Slade (1985) Swihart, R. K. and N. A. Slade (1985). Testing for independence of observations in animal movements. Ecology (Durham) 66(4), 1176–1184.
- Tamer (2010) Tamer, E. (2010). Partial identification in econometrics. Annual Reviews in Economics 2(1), 167–195.
- Yu et al. (2016) Yu, P. L., K. Lam, and M. Alvo (2016). Nonparametric rank tests for independence in opinion surveys. Österreichische Zeitschrift für Statistik 31(4).
Appendix A Proofs of Results
A.1 Proof of Theorem 1
Proof.
The proof proceeds by the direct method. We shall derive bounds on for each and recall that
First, we focus on the upper bound, and bound each term appearing in the sum separately. Starting with , note that it is less than or equal to , since by Fréchet-Hoeffding upper bound in -dimensions we have that for all . Now focusing on the term , note that is not observed and we replace it with its largest theoretical value, . Thus we bound which holds for all . Therefore, . Similarly, we bound which holds for all . Hence, . Finally, on the event , which is when neither nor is observed, we bound from above by , which holds for all . This yields . Combining these bounds yields
Next, we focus on the lower bound, and bound each term appearing in the representation of separately. Starting with , note that it is greater than or equal to
, since by Fréchet-Hoeffding lower bound in -dimensions we have that for all . Now focusing on the term
, note that is not observed and we replace it with its smallest theoretical value, . Thus, we bound , which holds for all , implying that . Similarly we bound , which holds for all , so that . Finally, on the event , which is when neither nor is observed, we bound from below by . Combining these bounds yields
This concludes the proof. ∎
A.2 Proof of Theorem 2
Proof.
The proof proceeds by the direct method. We split the proof into two cases: (i) showing for every , and (ii) showing for every .
Part (i). Fix . Firstly, note that if , then , and the inequality trivially holds. Now, we consider the case , and note that in this case
(A.1) |
where we have used the Condition (i) in the definition of to express the integrator in terms of a density. By Condition (ii), we can re-write as follows:
Substituting this expression for in (A.1) and simplifying yields
(A.2) |
where the inequality follow from holding for all . We can maximize the integral in (A.2) with respect to the joint distribution of to find that
(A.3) |
Since the function is supermodular, Part 1 of Lemma 1 implies
where . Now, we shall argue that . As the CDFs and are known, we define the new random variables and . By the Probability Integral Transform, both and are distributed as continuous uniform on the unit interval . This yields the representation
where . This copula is supported on the line segment in the unit square . Consequently,
Therefore, . Since we have chosen an arbitrary , the deduction holds for all . This concludes the proof for the lower-bound.
Part (ii). First, recall that
In the case that holds, we have that , which is the desired result. Now we consider the case where . Substituting into the expression of above and simplifying yields
(A.4) |
where the inequality (A.4) arises from the fact that and are bounded from below by with probability one (under ). Next, fix an arbitrary such that . Note that such a exists since . We will re-express
in a more convenient form to apply Part 2 of Lemma 1. It is by definition equal to
Now since we can multiply and divide by in the integrand and simplifying yields
which is equal to
Note that this re-writing applies for each for which . Now because
the expression in (A.4) is bounded from below by
(A.5) |
which in turn is bounded from below by
(A.6) |
As the function is supermodular, and the integrator in (A.6) only depends on the joint distribution of , we can apply Part 2 of Lemma 1 to deduce that the minimal value (A.6) is bounded from below:
(A.7) |
where . Now we argue that the right side of the inequality in (A.7) equals -3/4 to deduce the result of this theorem. This expected value equals
Now, using this result, we find that
(A.8) |
concluding the proof. ∎