Graph-Based Tests for Multivariate Covariate Balance Under Multi-Valued Treatments
Abstract
We propose the use of non-parametric, graph-based tests to assess the distributional balance of covariates in observational studies with multi-valued treatments. Our tests utilize graph structures ranging from Hamiltonian paths that connect all of the data to nearest neighbor graphs that maximally separates data into pairs. We consider algorithms that form minimal distance graphs, such as optimal Hamiltonian paths or non-bipartite matching, or approximate alternatives, such as greedy Hamiltonian paths or greedy nearest neighbor graphs. Extensive simulation studies demonstrate that the proposed tests are able to detect the misspecification of matching models that other methods miss. Contrary to intuition, we also find that tests ran on well-formed approximate graphs do better in most cases than tests run on optimally formed graphs, and that a properly formed test on an approximate nearest neighbor graph performs best, on average. In a multi-valued treatment setting with breast cancer data, these graph-based tests can also detect imbalances otherwise missed by common matching diagnostics. We provide a new R package graphTest to implement these methods and reproduce our results.
1 Introduction
In studies of causal effects, balancing the covariate distributions in each treatment group is essential for unbiased estimates. For example, in a study of the effects of smoking on medical expenditures, multiple smoking behaviors are possible: there may be never smokers, former smokers, and current smokers. To estimate the relative effects of the different exposures, we would like the groups to be balanced by various demographic characteristics such as age, race, and income. Distributional balance is expected by design in randomized control trials (RCT) for both measured and unmeasured confounders, but conducting a RCT is often not possible because of either cost or ethical restrictions, as would be the case in such a study of smoking. Instead, researchers are left to rely on observational studies to answer their causal questions.
If the number of covariates is few and the covariates have a small number of discrete levels, obtaining exact matches between treatment groups may be possible, thereby achieving exact distributional balance. However, such simple settings are rarely found in practice. This problem can be further compounded in the multi-valued treatment setting when there are more groups to match. Achieving distributional balance in such multi-valued treatment settings then relies on adjustments via methods such as propensity score matching [27] or cardinality matching [35]. Intuitively, these methods work by matching units that are close in terms of some covariate distance, which will make the distributions more similar between the treatment groups. But how do we know when the distributions between treatment groups are similar enough to proceed?
To answer this question, investigators typically consider univariate assessments of covariate balance. Such assessments can include assessing the mean of each covariate by treatment group before and after matching and a corresponding test for the difference of these means [30]. Other authors run a Kolmogorov-Smirnov test [19, 29] for each covariate [33, 12]. However, these methods will not generally diagnose differences in the joint covariate distribution between treatment groups.
Instead, we seek a way to identify differences in the joint distributions between the multiple groups after matching is completed—in other words, whether the observational study fails to approximate a randomized experiment in terms of observed covariate balance. Several common methods of assessing distributional balance such as likelihood ratio tests or examining the distribution of the propensity scores [28] require having correct probability models for the data. Unfortunately, these data models are rarely known in practice.
Fortunately, there are several non-parametric tests applicable to the multi-valued treatment setting and the ones we consider involve forming a graph-like structure on the data. Some work involves ranking the data in some fashion such that the observations are ordered in a one-dimensional space and hence form a graph [5]. [25], for example, consider a ranks-based test that uses the data ranks for each individual covariate in a Kruskal-Wallis test [21] while [10] use data-depth—measuring how far each observation is from the overall sample mean in terms of Mahalanobis distance—to calculate another ranks based test. [22] also uses a rank-based test to simultaneously examine location and scale changes. Rather than ranks, work by [24] instead forms a non-bipartite graph and extends the crossmatch test of [26] to multi-valued treatments.
In this vein, we consider several graphs structures on the data that range from Hamiltonian paths connecting all of the data into a single graph to kNN graphs. Hamiltonian paths are graphs that connects all of the observations in a dataset but with the constraint that each observation is only visited once. If we imagine the path as a highway, we will see all of the observations along the highway, but there is no way to see the old points we have visited without turning around—in graph terminology, there are no cycles. There are many such paths through the data under this broad definition, but we will define the shortest of such possible paths as the optimal Hamiltonian path. Under this graph, if the distributions vary between treatment groups we expect units from the same distribution to be closer to each other along the path.
For nearest neighbor graphs, the goal is not to find a path but to find matches—i.e., neighbors—that are close to each other in terms of a distance metric. This can be done in a way that allows for overlap between matched sets or in a way that forces those matched sets to be disjoint. In both types of nearest neighbor graphs, if the distributions vary between treatment groups, we expect observations from the same treatment group to be connected more frequently to each other than to observations from different treatmentes.
Several methods use these graphs for two-sample testing. In Hamiltonian paths, [7] use the Wald-Wolfowitz runs test [32] along the path to test for distributional similarity in the two-sample setting. For nearest neighbor graphs, the crossmatch test of [26] creates non-overlapping kNN graphs for and counts the number of matches that occur across samples. Recent work by [9] develops a similar test on greedy kNN graphs that instead counts the number of times nearest neighbors come from the same sample.
Unfortunately, there has not been much work extending these particular graph-based tests to the multisample setting. As previously mentioned, [24] extend Rosenbaum’s crossmatch test to the multisample setting, but similar examples exist neither for Hamiltonian paths nor for greedy nearest neighbor graphs. In addition, the empirical performance of these methods is unclear in practice, and to our knowledge, there is no readily available computational implementation of these methods.
To bridge these gaps with matched samples, we offer four contributions. First, we develop similar extensions for tests along the Hamiltonian path. Second, we provide a multisample extension to the nearest neighbor test of [9]. Third, we conduct an extensive simulation study of these tests in a variety of settings, finding that the kNN test outperforms others. Fourth, we implement all of these tests into the new R package graphTest.
2 Setting
Suppose we have covariate data from treatment groups: with . These groups could arise from sampling from different populations or because observations from the same population have chosen different treatments. As such, our goal is to see if after matching or some other adjustment, we can successfully evaluate if the distribution in each group is the same—i.e. .
Constructing graph-based tests studied in this work requires a notion of closeness between units and a way to denote which units are connected. We define as the pairwise distance matrix between all points in the sample with entries , for some distance . Further, let be a matrix whose entries indicate whether unit is connected to unit . If unit is connected to unit , then . Otherwise, .
3 Motivating example
To better understand this setting and the challenges it presents, we now consider a small example where individuals can choose one of three treatments. Our setting has only two confounders, and we then generate a treatment indicator , where and We set the sample size to .
Typically, researchers will not know the true treatment-assignment mechanism and we emulate this situation in our setup. We use cardinality matching [35] to balance first moments between each group and the overall sample mean but ignore the covariate functions and necessary to capture the true propensity score model. Matching towards the overall sample means will target the average treatment effects for the whole sample [36, 2]. We use the R package designmatch [34] and set the parameters of the matching such that the group means should differ from the sample mean by no more than 0.1 standard deviations. The results of such a matching for one draw of the data is shown in Figure 1.
The covariate balance before and after matching for this draw in Table 1 would indicate that the matching was a success. Further, the absolute standardized mean difference, defined here as , also indicates that all groups are within 0.1 standard deviations of the sample mean on average, and tests for differences in means fail to reject the null hypothesis that the group means are the same.

Unfortunately, the matching model misses two important features of treatment selection: and . Running this experiment 1000 times, we see that while standardized differences of means and are well balanced on average, the other covariate functions are not (Figure 2(a)). Similarly, univariate tests of covariate means are insufficient to diagnose the statistically significant deviations in the other covariate functions (Figure 2(b)). Clearly, we need an omnibus test to diagnose failure without having to check every possible covariate moment.
Before matching | After matching | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Cov. | Sample | Group 1 | 2 | 3 | Std. Diff | p | Group 1 | 2 | 3 | Std. Diff | p |
0.02 | 0.15 | 0.06 | -0.11 | 0.10 | 0.49 | 0.09 | 0.06 | -0.04 | 0.05 | 0.85 | |
-0.12 | -0.67 | -0.05 | 0.13 | 0.27 | 0.00 | -0.21 | -0.05 | -0.03 | 0.08 | 0.78 |


With this goal in mind, we now examine some graph-based tests that may offer a solution.
4 Structures, algorithms, and tests
This section considers two graph structures: Hamiltonian paths and nearest neighbor graphs. For each, we describe the graph structures, the algorithms to construct them, and the tests that we will run on these structures. Throughout, we will use the motivating example of the previous section for exposition in our figures.
4.1 Hamiltonian paths
As stated previously, Hamiltonian paths are graphs through the data where each point is visited once and the starting and ending points are not the same. A Hamiltonian cycle is similar except that the starting and ending points are the same. Optimal Hamiltonian cycles are typically known in the literature as the Traveling Salesman Problem or TSP [11].
4.1.1 Algorithms
There are a variety of ways to construct Hamiltonian paths. We present three algorithms to do so in order of increasing speed and decreasing optimality (Figure 3).


Our optimal algorithm is the Concorde solver of of [1]. Unfortunately, Concorde requires integer distances, which forces us to transform and risk either the loss of precision or numeric overflow in order to preserve accuracy. Despite these concerns, the Concorde solver still finds optimal paths on the moderately sized problems considered in this paper and does so much faster than a brute-force integer programming solution.
The next solution we consider comes from an application of Kruskal’s algorithm [20]. Using the pairwise distance matrix , we add edges in a greedy fashion starting with the two observations with the shortest pairwise distance. Then we sequentially add the next closest unit to the end of the path until we construct a full Hamiltonian path. While not optimal globally, this path will be more optimal locally for some subset of observations.
The final algorithm is a sorting algorithm based on the Hilbert space-filling curve [17]. The idea is to construct a Hilbert space-filling curve until it forms a Hamiltonian path through the data. This method also prescribes an ordering to the data points, and for this reason, the method is called Hilbert sorting. Since the constructed path is deterministic, Hilbert sorting is incredibly fast—successfully sorting 10 million observations in about 20 seconds on a circa 2018 laptop—but the path is no where close to optimal globally or locally. To perform this sorting method, we use the implementation freely available from [31] with an R interface relying on the approxOT package [13].
4.1.2 Test statistics
After path construction, the Hamiltonian path forms a one-dimensional radon measure on the manifold created by the corresponding graph [7]. Due to this one dimensional nature, the Hamiltonian path will admit a variety of one-dimensional statistics and corresponding tests. In this work we will consider two such tests: the ranks of observations along the Hamiltonian path and the runs of observations from the same treatment group [19, 29, 6, 23].
For these quantities, we consider two general forms of tests. The first form is an extremum test—either a minimum or maximum—that looks at the the probability of observing the most extreme group under : is the vector of statistics and or are the corresponding test statistics. The second test has the form of a Wald test—that is, given , and , then
The probabilities of these test statistics will naturally depend on the distributions of the statistics , which are often asymptotically Gaussian. Under a Gaussian distribution, the extremum can easily be calculated from the mvtnorm package in R [14]. Meanwhile, where is the degrees of freedom—typically or —with probabilities easily calculated in any statistical software.
Finally, we note that for the ranks-based tests, we will only consider the Wald test since the extremum test is not easily obtained. For the runs-based test, we examine both the Wald and extremum tests.
4.2 Nearest neighbor graphs
Similar to Hamiltonian paths, nearest neighbor graphs construct groups that are close to each other in terms of their covariate distance, . Also, nearest neighbor graphs can be constructed in an optimal or greedy way, as can be seen in Figure 4.

4.2.1 Algorithms
Optimal nearest neighbor graphs, also known as non-bipartite matching (NBM), minimizes the total distance between matches, which may lead to units that are not matched to their closest possible match, while the constraints force each unit to only be matched to one individual. Unlike the TSP, this problem is quick to solve for pairwise matching , but for larger numbers of matches, the problem is NP-hard [18].
In contrast, the greedy formulation is quick to solve for any number of matches, . One simply takes the pairwise distance matrix and finds the nearest matches to the observation of interest. This is done with replacement so that matches are allowed to be shared across observations, unlike in NBM. More formally, the kNN problem is
such that | ||||
Fortunately, there are algorithms that efficiently compute the nearest neighbor graph without having to construct the full distance matrix. Several of these, such as the k-d tree [3], are implemented in the R package FNN [4].
4.2.2 Test statistics
Unlike our tests along Hamiltonian paths, the tests on nearest neighbor graphs respect how the graphs are built. As such, we describe the corresponding statistics for NBM and kNN graphs separately. Also, we consider both extremum and Wald tests as above.
Crossmatch statistic
The crossmatch statistic uses the structure of NBM to see how many observations are matched across treatment groups. Intuitively, if the null is true that , then we would expect there to be a high number of matches across groups [26, 24]. For the crossmatch test, the extremum test is a minimum test since we expect individuals will be matched across groups less often than we would expect if is false.
-Nearest neighbors statistic
We now present our multisample version of the statistic developed by [9]. For a greedy nearest neighbor graph, we look at each observation and count how many of ’s nearest neighbors come from the same treatment group. Under the alternative, we would expect this number to be high since it would indicate observations from the same treatment group are more likely to be close to one another, which would indicate a difference between the distributions. If this happens more than we would expect under the null, we conclude the distributions are different.
More formally, define , or the number of times units from group are matched to other units from group . We then have the following lemma for .
Lemma 1.
The expectation and variance of are
(1) |
and
(2) | ||||
The covariance of and for is
(3) | ||||
where is the number of pairs of observations that are mutual nearest neighbors, and is the number of pairs of observations that share a nearest neighbor.
For a proof, see LABEL:sec:proof. The basic idea is to sum over and multiply by the probability that a given set of nodes come from group .
We now turn to the asymptotic distribution of the counts . In principle, we can use the exact permutation distribution to calculate the -values of tests using this statistic; however, this is computationally very expensive so we do not do so in practice. Before proceeding, we note that Propositions 3.1 and 3.2 of [16] show that and converge to constants and that only depend on two things: the distance metric used to calculate the nearest neighbors and the dimension of the covariate space. Then we define as the centered and scaled version of . This brings us to the asymptotic distribution of the .
Theorem 1.
Under the permutation null distribution as ,
with . For ,
where
We provide a proof in LABEL:sec:proof.
In practice, we utilize the finite sample correlations between counts
with
and
This will have as its limit. Further, given the discrete nature of the counts, we utilize a continuity correction such that .
For the kNN statistics, the extremum test is a maximum based test since we expect that individuals will be matched to their own group more often if is false. The Wald test using the kNN statistic can be calculated by and
Finally, we note that the power of the tests based on the kNN statistic can be sensitive to the choice of . [5] finds that should increase with the sample size in order for the test to be well-powered.
5 Simulation studies
We now present simulation results in a variety of settings. Note that we set in each case where we use the kNN test.
In the first simulation study, we return to the motivating example from Section 3 to see if the new tests can correctly diagnose deviations from the null hypothesis that missed by univariate tests. Then in the second simulation study, we utilize the settings of [24] to evaluate the performance of the graph-based tests. Overall, we find greedy paths and greedy nearest neighbor constructions perform better than their optimal counterparts while the kNN test performs best on average.
5.1 Motivating example, continued
Returning to the motivating example from Section 3, we again use cardinality matching to target distributional balance. As before, we use a misspecified setting that matches on and but misses the important covariate functions and , but this time we also add a well-specified setting that includes all relevant covariate moments. We again match treatment groups so that the group means on the selected covariate functions are within 0.1 standard deviations, and we still utilize 1000 experiments with a sample size of 150. For each experiment, we run the kNN and crossmatch tests on their corresponding graphs and the runs and ranks test on a greedy Hamiltonian path built via Kruskal’s algorithm. Our implementation of Kruskal’s algorithm relies on the TSP package developed by [15] in R.

Figure 5 reveals that kNN and the runs test successfully detect distributional imbalance both initially and in the misspecified models. Of concern is that the crossmatch and ranks test also fail to correctly reject the null even at baseline in many cases.
5.2 Setting of [24]
For our second simulation study, we use the setting of [24]. In their experiments, the authors use a multivariate Gaussian data and either change (1) the parameters of the distribution or (2) the structure of the problem. The distributions are always drawn from
where , , and . We set and run 1000 experiments in each setting.
(1) Parameter values
We change the parameter values in one of three ways. The first is location changes with the second is scale changes with and ; and the third is correlation changes with and where is a is a matrix of ones, is a -length vector of ones, and is a identity matrix. In location change, we set ; in scale changes and correlation changes, we set .
(2) Structure changes
For changes to the problem structure, we either change the dimension of the covariates——or the number of treatment groups—. Also note that as we change the dimension of the covariates, we keep fixed at five and as we change the number of treatment groups, we keep fixed at 150.
Finally, we present results averaged across simulation settings in this section to help simplify results. To see a detailed presentation of the results for each setting, please see Web Figures LABEL:fig:web_1–LABEL:fig:knn_k_power.
5.2.1 Extremum versus Wald tests
As previously discussed, there are two ways we can construct our tests: either an extremum (max or min) or a Wald test. Ultimately, we find that extremum tests perform better in tests along Hamiltonian paths while the Wald tests perform better for the nearest neighbor graphs.
To simplify exposition, we average across parameter change settings and present the relative power of Wald tests as compared to extremum tests, that is Numbers greater than one indicate that the Wald test has higher power on average while numbers less than one indicate the extremum test has higher power for the given algorithm and structure.
Dimension | ||||||
---|---|---|---|---|---|---|
Algorithm | 5 | 10 | 25 | 50 | 100 | 250 |
Optimal | 0.69 | 0.72 | 0.71 | 0.75 | 0.80 | 0.87 |
Greedy | 0.68 | 0.71 | 0.77 | 0.83 | 0.90 | 0.92 |
Hilbert sort | 0.68 | 0.71 | 0.76 | 0.72 | 0.68 | 0.72 |
NBP | 1.44 | 1.54 | 1.50 | 1.52 | 1.42 | 1.38 |
kNN | 0.93 | 1.06 | 1.16 | 1.17 | 1.15 | 1.09 |
Treatments | ||||
---|---|---|---|---|
Algorithm | 3 | 4 | 5 | 6 |
Optimal | 1.03 | 0.87 | 0.82 | 0.77 |
Greedy | 1.09 | 0.95 | 0.89 | 0.88 |
Hilbert sort | 1.59 | 1.06 | 0.71 | 0.56 |
NBP | 0.92 | 1.19 | 1.43 | 1.80 |
kNN | 1.30 | 1.21 | 1.15 | 1.06 |
Table 2 presents the average results for each graph structure as the covariate dimension is increased while holding the number of treatment groups fixed at . For the Hamiltonian path algorithms, the Wald tests have lower relative power for smaller covariate dimensions, with decreasing differences as the dimension of the problem increases. In contrast, for the nearest neighbor graphs, the Wald-like tests have higher power on average as the covariate dimension increases. Only kNN with 5 covariates bucks this trend.
Similarly, Table 3 shows the results as we change the number of groups. We note that the Wald tests have higher power for the Hamiltonian paths with three groups. However, this reverses as the number of groups increases and there seems to be a general trend that the extremum tests have higher relative power as we consider a larger numbers of groups.
In contrast, the nearest neighbor graph algorithms display higher power for the Wald tests in most cases; however, kNN and NBP display opposite trends in relative power as the number of groups increase. The kNN algorithm has a decreasing difference between the Wald and extremum tests as the number of groups increases; the NBP algorithm displays an increasing difference between the two tests as the number of groups increases with the Wald test having higher relative power.
5.2.2 Greedy paths perform better than optimal ones
We might assume a priori that the optimal Hamiltonian path will perform better than its greedy version because it is the shortest path through the data. Unfortunately, our experiments do not support this idea.
In Table 4, we present the results from changing the the dimension of the covariate space for our algorithms: optimal, greedy, and Hilbert sorting. For comparison purposes, we present results relative to the power of the tests on an optimal Hamiltonian path: As before, we average the results across parameter changes and present the average relative power for each path as we change the problem dimension, , and number of treatment groups, . On average, the greedy path does better than the optimal path, with increasing power as the dimension increases. In contrast, the Hilbert sort path displays increasing relative power as the dimension of the problem increases.
Similarly, in Table 5, we see that the relative power of tests on the greedy Hamiltonian path is consistently four times higher than tests on the optimal path found by Concorde. As before, the Hilbert sort path still has the lowest relative power, but this effect decreases as the number of groups increases.
Dimension | ||||||
---|---|---|---|---|---|---|
Path | 5 | 10 | 25 | 50 | 100 | 250 |
Optimal | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
Greedy | 2.27 | 3.41 | 3.99 | 4.36 | 4.71 | 4.35 |
Hilbert sort | 1.02 | 1.03 | 0.84 | 0.74 | 0.65 | 0.51 |
Treatments | ||||
---|---|---|---|---|
Path | 3 | 4 | 5 | 6 |
Optimal | 1.00 | 1.00 | 1.00 | 1.00 |
Greedy | 4.73 | 4.64 | 4.35 | 4.09 |
Hilbert sort | 0.50 | 0.56 | 0.57 | 0.60 |
5.2.3 The -nearest neighbor test outperforms others
We now compare the two other graph-based tests—the multisample crossmatch and the kNN test—to the Kruskal-Wallis ranks and runs test ran on the greedy Hamiltonian paths. All power calculations are presented as relative to the power of the kNN test.
First, we examine the effect of increasing covariate dimension, , while holding the number of treatment groups fixed at . We can see in Table 6 that the kNN test outperforms the other tests, on average. There are a few cases where some of the other tests perform better—for example, in location changes with a small covariate dimension, the crossmatch and runs test can sometimes perform better. However, such effects are not observed as dimension continues to increase and are also not observed in settings where the covariance matrix varies by group.
Next, we examine the effect of changing the number of treatment groups, , while holding the covariate dimension fixed at . In Table 7, we see that again the kNN test has the highest relative power compared to any other test on average. For smaller numbers of groups, the runs test performs as well as the kNN test, but this effect diminishes as the number of groups continues to increase.
Dimension | ||||||
---|---|---|---|---|---|---|
Test | 5 | 10 | 25 | 50 | 100 | 250 |
Rank | 0.61 | 0.64 | 0.68 | 0.69 | 0.72 | 0.76 |
Runs | 0.34 | 0.39 | 0.53 | 0.68 | 0.74 | 0.87 |
Crossmatch | 0.37 | 0.39 | 0.41 | 0.52 | 0.57 | 0.71 |
Nearest Neighbor | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
Treatments | ||||
---|---|---|---|---|
Test | 3 | 4 | 5 | 6 |
Rank | 0.75 | 0.72 | 0.73 | 0.77 |
Runs | 0.72 | 0.76 | 0.81 | 0.84 |
Crossmatch | 0.49 | 0.55 | 0.63 | 0.70 |
Nearest Neighbor | 1.00 | 1.00 | 1.00 | 1.00 |
6 Case study
Our goal is to use matching to make the distributions in the respective treatment groups more similar to the overall sample distribution. We will then use our graph-based tests to see if univariate tests miss any distributional differences. To do so, we use data from the 1987 National Medical Expenditures Survey data as collected by the TriMatch package in R [8].
6.1 Data
The original survey included detail information about participants quantity and duration of smoking as pack-years, with one pack-year being defined as smoking one pack of cigarettes per day for one year. In our data, we include the following covariates: the individual’s age at the last survey, their gender, whether the individual identified as white, whether the individual attended college, whether the individual had ever been married, whether the individual fell into the low income category, and whether the individual always wore a seat belt. For the binary variables, we add a very small jitter to make matching easier for the software. Finally, we constructed a composite treatment variable based on smoking history as: 1 = never smoker, 2 = former smoker with less than 15 pack-years of smoking, 3 = former smoker with more than 15 pack-years of smoking, 4 = current smoker with less than 15 pack-years of smoking, and 5 = current smoker with more than 15 pack-years of smoking. The number of individuals in each treatment group is presented in Table 8.
Treatment 1 | Treatment 2 | Treatment 3 | Treatment 4 | Treatment 5 | |
---|---|---|---|---|---|
Baseline | 9804 | 2073 | 2003 | 2326 | 3146 |
(A) | 8456 | 1512 | 502 | 696 | 1365 |
(B) | 8065 | 1381 | 458 | 637 | 1247 |
(C) | 8018 | 1367 | 453 | 629 | 1234 |
(D) | 6460 | 1070 | 0 | 0 | 755 |
6.2 Methods
We consider four forms of cardinality matching to achieve distributional balance. The first three forms of cardinality matching we use seeks to match individuals in each treatment group so that the group mean of each covariate is within either 0.05, 0.01, or 0.005 standard deviations of the sample mean. We denote these as Card. Matching (A), (B), and (C), respectively. The fourth and final form targets first through fifth moments of the covariate distributions and we denote this form as Card. Matching (D). For all matching tasks, we set the maximum time at one hour.
To diagnose failures of distributional balance we utilize a couple of methods. First, we examine single covariate measures of distributional balance such as the standardized difference of means and the relevant tests for differences in means. We present this in table form for the first moments (see Table 9). Second, we utilize the graphical tests described in this work including the kNN test, the runs and ranks tests on greedy Hamiltonian paths, and the crossmatch test.
6.3 Results
We now look to see how various balance assessment methods diagnose a lack of distributional overlap. Table 9 displays the average of the absolute standardized differences in means between treatment group and the sample means for all covariate first moments, and in column 2 we see that there are some covariates with substantial differences between the sample mean and the group means, on average. After cardinality matching on the first moments with a 0.05 standard deviation caliper (columns 4 and 5 of Table 9), we see that the average absolute standardized difference in means between groups is within 0.1 standard deviations but all of the -values are below 0.05. For a caliper of 0.01 (columns 6 and 7), the -values are now above 0.05, which again holds true for a caliper of 0.005 standard deviations. After cardinality matching on the first through fifth moments (columns 10 and 11 of Table 9), the picture looks similar, though the -values are slightly lower. From these metrics, we would conclude that sufficient balance has been achieved in most settings.
Baseline | Card. Matching (A) | Card. Matching (B) | Card. Matching (C) | Card. Matching (D) | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Variable | Std. diff | p-value | Std. diff | p-value | Std. diff | p-value | Std. diff | p-value | Std. diff | p-value |
Age | 0.26 | 0.03 | 0.02 | 0.99 | 1.00 | 0.94 | ||||
Any college | 0.05 | 0.04 | 0.95 | 1.00 | 1.00 | |||||
Ever married | 0.21 | 0.05 | 0.92 | 0.99 | 0.95 | |||||
Low Income | 0.08 | 0.03 | 0.04 | 0.98 | 1.00 | 0.99 | ||||
Male | 0.19 | 0.05 | 0.94 | 1.00 | 0.99 | |||||
Seatbelt always | 0.11 | 0.05 | 0.97 | 1.00 | 0.98 | |||||
White | 0.09 | 0.05 | 0.97 | 1.00 | 0.98 |
However, all is not well with the joint distributions. While most of the univariate tests looked fine after adjustment, almost none of the multivariate tests in Table 10 indicate that distributional balance has been achieved. For example, the kNN test (column 2) has -values that are below numeric tolerance for tests of distributional balance at baseline and in cardinality matching on the first moments (Card. Matching A–C). Only matching on the first through the fifth moments achieves a non-significant -value (Card. Matching D). A similar case holds for both the runs test and the crossmatch test of [24]. The ranks tests on the Hamiltonian path also achieves a non-significant -value in Card. Matching (C) as well.
k-Nearest Neighbor | Runs | Ranks | Crossmatch | |
---|---|---|---|---|
Baseline | ||||
Card. matching (A) | ||||
Card. matching (B) | ||||
Card. matching (C) | 0.07 | |||
Card. matching (D) | 0.17 | 0.13 | 0.47 | 0.51 |
7 Summary and remarks
In this paper, we proposed multisample extensions to the Hamiltonian path testing of [7] and the nearest neighbor testing of [9] and we evaluated these tests in a variety of settings. We considered various graph structures, algorithms to form these structures, and tests of distributional balance.
Findings
In our simulation studies, we found that the extremum test performs best for the runs statistic, while for the statistics on nearest neighbor graphs, the Wald tests perform best.
For both Hamiltonian paths and nearest neighbor graphs, we found in our simulation studies that approximate solutions—that is greedy Hamiltonian paths and the greedy nearest neighbor graphs—perform better than their optimal counterparts. We suspect that the approximate solutions do better locally and may have higher power in certain regions of the data to detect differences in distribution.
Overall, we found the kNN test on the approximate nearest neighbor graph to be the highest power test compared to various alternatives. The Hamiltonian path based tests performed slightly worse that the kNN test in most settings and the crossmatch test fared the worst. Of some concern, both the crossmatch test and the ranks test on a greedy Hamiltonian path missed the model misspecification in the motivating example, while the results in the other simulation studies in Section 5.2.3 are more mixed. Based on these results, we would recommend utilizing the kNN test to evaluate multivariate balance in settings with multi-valued treatments.
Our case study revealed how the univariate metrics typically used to assess balance fail to diagnose inadequate overlap in the joint distribution. Of some concern, we found that an exceedingly high number of moments were required to obtain non-significant tests for the overlap of the joint distribution.
Further research
One area that requires further exploration is that some settings may not require the strict distributional balance we are able to measure with the kNN test. Ultimately, balancing the conditional mean functions between treatment groups is sufficient since this will remove the bias in estimating treatment effects. For example, if the conditional mean function of the outcome is , then ensuring balance on the first moments of the covariates would ensure unbiased estimation of the treatment effects determined by . However, since this conditional mean function is not known in practice, determining when sufficient balance has been achieved on the correct functions of the covariates remains a challenge.
Finally, there may be room to invert the kNN test as a method for achieving distributional balance. One could construct an optimization problem that seeks to minimize the test statistic of the kNN test in order to achieve distributional balance. This would be useful in observational settings where balance cannot be achieved through randomization.
Acknowledgments
The author would like to thank Claire Chaumont, Gang Liu, Aarón Sonabend, Lorenzo Trippa, and José Zubizarreta for helpful comments and feedback on an earlier version of this manuscript. This research was funded by generous support from NIH grant 5T32CA009337-40, the Department of Biostatistics at the Harvard T.H. Chan School of Public Health, and the David Geffen Scholarship from the David Geffen School of Medicine at UCLA.
References
- [1] David Applegate, Robert Bixby, Vasek Chvátal and William Cook “Implementing the Dantzig-Fulkerson-Johnson algorithm for large traveling salesman problems” In Mathematical Programming 97.1, 2003, pp. 91–153 DOI: 10.1007/s10107-003-0440-4
- [2] Magdalena Bennett, Juan Pablo Vielma and José R. Zubizarreta “Building Representative Matched Samples With Multi-Valued Treatments in Large Observational Studies” In Journal of Computational and Graphical Statistics 29.4 Taylor & Francis, 2020, pp. 744–757 DOI: 10.1080/10618600.2020.1753532
- [3] Jon Louis Bentley “Multidimensional binary search trees used for associative searching” In Communications of the ACM 18.9, 1975, pp. 509–517 DOI: 10.1145/361002.361007
- [4] Alina Beygelzimer et al. “FNN: Fast Nearest Neighbor Search Algorithms and Applications”, 2019 URL: https://cran.r-project.org/package=FNN
- [5] Bhaswar Bhattacharya “A general asymptotic framework for distribution-free graph-based two-sample tests” In Journal of the Royal Statistical Society. Series B: Statistical Methodology 81.3, 2019, pp. 575–602 DOI: 10.1111/rssb.12319
- [6] Z. W. Birnbaum and R. A. Hall “Small Sample Distributions for Multi-Sample Statistics of the Smirnov Type” In The Annals of Mathematical Statistics 31.3, 1960, pp. 710–720 DOI: 10.1214/aoms/1177705797
- [7] Munmun Biswas, Minerva Mukhopadhyay and Anil K. Ghosh “A distribution-free two-sample run test applicable to high-dimensional data” In Biometrika 101.4, 2014, pp. 913–926 DOI: 10.1093/biomet/asu045
- [8] Jason Breyer “TriMatch: Propensity Score Matching of Non-Binary Treatments”, 2017 URL: https://CRAN.R-project.org/package=TriMatch
- [9] Hao Chen and Dylan S Small “New multivariate tests for assessing covariate balance in matched observational studies” In Biometrics, 2020 DOI: 10.1111/biom.13395
- [10] Shojaeddin Chenouri and Christopher G. Small “A nonparametric multivariate multisample test based on data depth” In Electronic Journal of Statistics 6, 2012, pp. 760–782 DOI: 10.1214/12-EJS692
- [11] G. Dantzig, R. Fulkerson and S. Johnson “Solution of a Large-Scale Traveling-Salesman Problem” In Journal of the Operations Research Society of America 2.4, 1954, pp. 393–410 DOI: 10.1287/opre.2.4.393
- [12] Alexis Diamond and Jasjeet S. Sekhon “Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies” In Review of Economics and Statistics 95.3, 2013, pp. 932–945 DOI: 10.1162/REST_a_00318
- [13] Eric A. Dunipace “approxOT: approximate optimal transport”, 2022 URL: https://CRAN.R-project.org/package=approxOT
- [14] Alan Genz et al. “mvtnorm: Multivariate Normal and t Distributions”, 2020 URL: https://cran.r-project.org/package=mvtnorm
- [15] Michael Hahsler and Kurt Hornik “TSP – Infrastructure for the traveling salesperson problem” In Journal of Statistical Software 23.2, 2007, pp. 1–21 DOI: 10.18637/jss.v023.i02
- [16] Norbert Henze “A Multivariate Two-Sample Test Based on the Number of Nearest Neighbor Type Coincidences” In The Annals of Statistics 16.2, 1988, pp. 1403–1433 DOI: 10.1214/aos/1176350835
- [17] David Hilbert “Ueber die stetige Abbildung einer Line auf ein Flächenstück” In Mathematische Annalen 38.3, 1891, pp. 459–460 DOI: 10.1007/BF01199431
- [18] Richard M. Karp “Reducibility among combinatorial problems” In Complexity of Computer Computations New York: Plenum, 1972, pp. 219–241 DOI: 10.1007/978-3-540-68279-0_8
- [19] A L Kolmogorov “Sulla determinazione empirica di una legge di distribuzione” In G. Ist. Ital. Attuari 4, 1933, pp. 83–91 URL: https://ci.nii.ac.jp/naid/10030673552/en/
- [20] Joseph B. Kruskal “On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem” In Proceedings of the American Mathematical Society 7.1, 1956, pp. 48 DOI: 10.2307/2033241
- [21] William H. Kruskal and W. Allen Wallis “Use of Ranks in One-Criterion Variance Analysis” In Journal of the American Statistical Association 47.260, 1952, pp. 583–621 DOI: 10.1080/01621459.1952.10483441
- [22] Marco Marozzi “The multisample Cucconi test” In Statistical Methods and Applications 23.2, 2014, pp. 209–227 DOI: 10.1007/s10260-014-0255-x
- [23] A. M. Mood “The distribution theory of runs” In The Annals of Mathematical Statistics 11.4, 1940, pp. 367–392 URL: http://pubsonline.informs.org/doi/abs/10.1287/opre.7.1.58
- [24] Somabha Mukherjee, Divyansh Agarwal, Nancy R. Zhang and Bhaswar B. Bhattacharya “Distribution-Free Multisample Tests Based on Optimal Matchings With Applications to Single Cell Genomics” In Journal of the American Statistical Association 0.0 Taylor & Francis, 2020, pp. 1–31 DOI: 10.1080/01621459.2020.1791131
- [25] Madan Lal Puri and Pranab Kumar Sen “On a class of multivariate multisample rank-order tests” In Sankhyā: The Indian Journal of Statistics, Series A JSTOR, 1966, pp. 353–376
- [26] Paul R. Rosenbaum “An exact distribution-free test comparing two multivariate distributions based on adjacency” In Journal of the Royal Statistical Society. Series B: Statistical Methodology 67.4, 2005, pp. 515–530 DOI: 10.1111/j.1467-9868.2005.00513.x
- [27] Paul R. Rosenbaum and Donald B. Rubin “The Central Role of the Propensity Score in Observational Studies for Causal Effects” In Biometrika 70.1, 1983, pp. 41–55
- [28] Paul R. Rosenbaum and Donald B. Rubin “Constructing a control group using multivariate matched sampling methods that incorporate the propensity score” In American Statistician 39.1, 1985, pp. 33–38 DOI: 10.1080/00031305.1985.10479383
- [29] N. Smirnov “Table for Estimating the Goodness of Fit of Empirical Distributions” In The Annals of Mathematical Statistics 19.2, 1948, pp. 279–281 DOI: 10.1214/aoms/1177730256
- [30] Elizabeth A. Stuart “Matching Methods for Causal Inference: A Review and a Look Forward” In Statistical Science 25.1, 2010, pp. 1–21 DOI: 10.1214/09-STS313
- [31] The CGAL Project “CGAL User and Reference Manual” CGAL Editorial Board, 2021 URL: https://doc.cgal.org/5.2.1/Manual/packages.html
- [32] A. Wald and J. Wolfowitz “On a Test Whether Two Samples are from the Same Population” In The Annals of Mathematical Statistics 11.2, 1940, pp. 147–162 DOI: 10.1214/aoms/1177731909
- [33] José R. Zubizarreta “Using mixed integer programming for matching in an observational study of kidney failure after surgery” In Journal of the American Statistical Association 107.500, 2012, pp. 1360–1371 DOI: 10.1080/01621459.2012.703874
- [34] José R Zubizarreta, Cinar Kilcioglu and Juan P Vielma “designmatch: Matched Samples that are Balanced and Representative by Design”, 2018 URL: https://cran.r-project.org/package=designmatch
- [35] José R. Zubizarreta, Ricardo D. Paredes and Paul R. Rosenbaum “Matching for balance, pairing for heterogeneity in an observational study of the effectiveness of for-profit and not-for-profit high schools in Chile” In Annals of Applied Statistics 8.1, 2014, pp. 204–231 DOI: 10.1214/13-AOAS713
- [36] María de los Angeles Resa and José R. Zubizarreta “Evaluation of subset matching methods and forms of covariate balance” In Statistics in Medicine 35.27, 2016, pp. 4961–4979 DOI: 10.1002/sim.7036