This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Graph-Based Tests for Multivariate Covariate Balance Under Multi-Valued Treatments

Eric A. Dunipace
David Geffen School of Medicine at UCLA, Los Angeles, CA
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA
[email protected]
Abstract

We propose the use of non-parametric, graph-based tests to assess the distributional balance of covariates in observational studies with multi-valued treatments. Our tests utilize graph structures ranging from Hamiltonian paths that connect all of the data to nearest neighbor graphs that maximally separates data into pairs. We consider algorithms that form minimal distance graphs, such as optimal Hamiltonian paths or non-bipartite matching, or approximate alternatives, such as greedy Hamiltonian paths or greedy nearest neighbor graphs. Extensive simulation studies demonstrate that the proposed tests are able to detect the misspecification of matching models that other methods miss. Contrary to intuition, we also find that tests ran on well-formed approximate graphs do better in most cases than tests run on optimally formed graphs, and that a properly formed test on an approximate nearest neighbor graph performs best, on average. In a multi-valued treatment setting with breast cancer data, these graph-based tests can also detect imbalances otherwise missed by common matching diagnostics. We provide a new R package graphTest to implement these methods and reproduce our results.

1 Introduction

In studies of causal effects, balancing the covariate distributions in each treatment group is essential for unbiased estimates. For example, in a study of the effects of smoking on medical expenditures, multiple smoking behaviors are possible: there may be never smokers, former smokers, and current smokers. To estimate the relative effects of the different exposures, we would like the groups to be balanced by various demographic characteristics such as age, race, and income. Distributional balance is expected by design in randomized control trials (RCT) for both measured and unmeasured confounders, but conducting a RCT is often not possible because of either cost or ethical restrictions, as would be the case in such a study of smoking. Instead, researchers are left to rely on observational studies to answer their causal questions.

If the number of covariates is few and the covariates have a small number of discrete levels, obtaining exact matches between treatment groups may be possible, thereby achieving exact distributional balance. However, such simple settings are rarely found in practice. This problem can be further compounded in the multi-valued treatment setting when there are more groups to match. Achieving distributional balance in such multi-valued treatment settings then relies on adjustments via methods such as propensity score matching [27] or cardinality matching [35]. Intuitively, these methods work by matching units that are close in terms of some covariate distance, which will make the distributions more similar between the treatment groups. But how do we know when the distributions between treatment groups are similar enough to proceed?

To answer this question, investigators typically consider univariate assessments of covariate balance. Such assessments can include assessing the mean of each covariate by treatment group before and after matching and a corresponding test for the difference of these means [30]. Other authors run a Kolmogorov-Smirnov test [19, 29] for each covariate [33, 12]. However, these methods will not generally diagnose differences in the joint covariate distribution between treatment groups.

Instead, we seek a way to identify differences in the joint distributions between the multiple groups after matching is completed—in other words, whether the observational study fails to approximate a randomized experiment in terms of observed covariate balance. Several common methods of assessing distributional balance such as likelihood ratio tests or examining the distribution of the propensity scores [28] require having correct probability models for the data. Unfortunately, these data models are rarely known in practice.

Fortunately, there are several non-parametric tests applicable to the multi-valued treatment setting and the ones we consider involve forming a graph-like structure on the data. Some work involves ranking the data in some fashion such that the observations are ordered in a one-dimensional space and hence form a graph [5]. [25], for example, consider a ranks-based test that uses the data ranks for each individual covariate in a Kruskal-Wallis test [21] while [10] use data-depth—measuring how far each observation is from the overall sample mean in terms of Mahalanobis distance—to calculate another ranks based test. [22] also uses a rank-based test to simultaneously examine location and scale changes. Rather than ranks, work by [24] instead forms a non-bipartite graph and extends the crossmatch test of [26] to multi-valued treatments.

In this vein, we consider several graphs structures on the data that range from Hamiltonian paths connecting all of the data into a single graph to kNN graphs. Hamiltonian paths are graphs that connects all of the observations in a dataset but with the constraint that each observation is only visited once. If we imagine the path as a highway, we will see all of the observations along the highway, but there is no way to see the old points we have visited without turning around—in graph terminology, there are no cycles. There are many such paths through the data under this broad definition, but we will define the shortest of such possible paths as the optimal Hamiltonian path. Under this graph, if the distributions vary between treatment groups we expect units from the same distribution to be closer to each other along the path.

For nearest neighbor graphs, the goal is not to find a path but to find matches—i.e., neighbors—that are close to each other in terms of a distance metric. This can be done in a way that allows for overlap between matched sets or in a way that forces those matched sets to be disjoint. In both types of nearest neighbor graphs, if the distributions vary between treatment groups, we expect observations from the same treatment group to be connected more frequently to each other than to observations from different treatmentes.

Several methods use these graphs for two-sample testing. In Hamiltonian paths, [7] use the Wald-Wolfowitz runs test [32] along the path to test for distributional similarity in the two-sample setting. For nearest neighbor graphs, the crossmatch test of [26] creates non-overlapping kNN graphs for k=1k=1 and counts the number of matches that occur across samples. Recent work by [9] develops a similar test on greedy kNN graphs that instead counts the number of times nearest neighbors come from the same sample.

Unfortunately, there has not been much work extending these particular graph-based tests to the multisample setting. As previously mentioned, [24] extend Rosenbaum’s crossmatch test to the multisample setting, but similar examples exist neither for Hamiltonian paths nor for greedy nearest neighbor graphs. In addition, the empirical performance of these methods is unclear in practice, and to our knowledge, there is no readily available computational implementation of these methods.

To bridge these gaps with matched samples, we offer four contributions. First, we develop similar extensions for tests along the Hamiltonian path. Second, we provide a multisample extension to the nearest neighbor test of [9]. Third, we conduct an extensive simulation study of these tests in a variety of settings, finding that the kNN test outperforms others. Fourth, we implement all of these tests into the new R package graphTest.

Next, we briefly describe our setting in Section 2, give a motivating example in Section 3, and describe how we construct and perform tests on the graphs in Section 4. Then we provide results from simulations and a case study in Sections 5 and 6, respectively. Finally, Section 7 concludes.

2 Setting

Suppose we have covariate data from GG treatment groups: {X1(1),,Xn1(1)},\{X_{1}^{(1)},\ldots,X_{n_{1}}^{(1)}\}, ,\ldots, {X1(G),\{X_{1}^{(G)}, ,\ldots, XnG(G)}X_{n_{G}}^{(G)}\} with n1+n2++nG=Nn_{1}+n_{2}+\cdots+n_{G}=N. These groups could arise from sampling from different populations or because observations from the same population have chosen different treatments. As such, our goal is to see if after matching or some other adjustment, we can successfully evaluate if the distribution in each group is the same—i.e. H0:F1=F2==FGH_{0}:F_{1}=F_{2}=\cdots=F_{G}.

Constructing graph-based tests studied in this work requires a notion of closeness between units and a way to denote which units are connected. We define 𝐃\mathbf{D} as the pairwise distance matrix between all points in the sample with entries 𝐃ij=𝐝(Xi,Xj),ij,i,j{1,,N}\mathbf{D}_{ij}={\mathbf{d}}(X_{i},X_{j}),\,\forall i\neq j,\,i,j\in\{1,...,N\}, for some distance 𝐝:d×d+\mathbf{d}:{}\operatorname{\mathbb{R}}^{d}\times{}\operatorname{\mathbb{R}}^{d}\mapsto{}\operatorname{\mathbb{R}}_{+}. Further, let 𝐌\mathbf{M} be a matrix whose entries indicate whether unit ii is connected to unit jj. If unit ii is connected to unit jj, then 𝐌ij=1\mathbf{M}_{ij}=1. Otherwise, 𝐌ij=0\mathbf{M}_{ij}=0.

3 Motivating example

To better understand this setting and the challenges it presents, we now consider a small example where individuals can choose one of three treatments. Our setting has only two confounders, X1,X2iid𝒩(0,1)X_{1},X_{2}\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny iid}}}{\sim}}\mathcal{N}(0,1) and we then generate a treatment indicator ZCategorical(𝐩)Z\sim\operatorname{Categorical}(\mathbf{p}), where 𝐩=(p1,p2,p3),pi=exp(ηi)j=13exp(ηj)\mathbf{p}=(p_{1},p_{2},p_{3}),^{\top}\,p_{i}=\frac{\exp(\eta_{i})}{\sum_{j=1}^{3}\exp(\eta_{j})} and η1=0.1X10.1X2X1X2,η2=0.2X1+0.2X2+0.5X12,η3=0.1X1+0.2X22X1X2.\eta_{1}=0.1X_{1}-0.1X_{2}-X_{1}X_{2},\eta_{2}=-0.2X_{1}+0.2X_{2}+0.5X_{1}^{2},\eta_{3}=-0.1X_{1}+0.2X_{2}-2X_{1}X_{2}. We set the sample size to N=150N=150.

Typically, researchers will not know the true treatment-assignment mechanism and we emulate this situation in our setup. We use cardinality matching [35] to balance first moments between each group and the overall sample mean but ignore the covariate functions X12X_{1}^{2} and X1X2X_{1}X_{2} necessary to capture the true propensity score model. Matching towards the overall sample means will target the average treatment effects for the whole sample [36, 2]. We use the R package designmatch [34] and set the parameters of the matching such that the group means should differ from the sample mean by no more than 0.1 standard deviations. The results of such a matching for one draw of the data is shown in Figure 1.

The covariate balance before and after matching for this draw in Table 1 would indicate that the matching was a success. Further, the absolute standardized mean difference, defined here as 13g=13|X¯gX¯|Var^(X)\frac{1}{3}\sum_{g=1}^{3}\frac{|\mkern 1.5mu\overline{\mkern-1.5muX\mkern-1.5mu}\mkern 1.5mu_{g}-\mkern 1.5mu\overline{\mkern-1.5muX\mkern-1.5mu}\mkern 1.5mu|}{\sqrt{\widehat{\operatorname{Var}}(X)}}, also indicates that all groups are within 0.1 standard deviations of the sample mean on average, and FF tests for differences in means fail to reject the null hypothesis that the group means are the same.

Refer to caption
Figure 1: One draw of the covariates before and after matching with cardinality matching targeting the means of each group to the overall sample mean. Open circles in the second plot denote observations that were dropped from the sample because they were unmatched. The lines denote the regression of X2X_{2} on X1X_{1} which will be equal to the scaled correlation between the variables for centered covariates. Tick marks are the means in each group for the respective covariate. We can see the means are balanced but some imbalance remains in the correlations between the variables. This figure appears in color in the electronic version of this article.

Unfortunately, the matching model misses two important features of treatment selection: X12X_{1}^{2} and X1X2X_{1}X_{2}. Running this experiment 1000 times, we see that while standardized differences of means X1X_{1} and X2X_{2} are well balanced on average, the other covariate functions are not (Figure 2(a)). Similarly, univariate tests of covariate means are insufficient to diagnose the statistically significant deviations in the other covariate functions (Figure 2(b)). Clearly, we need an omnibus test to diagnose failure without having to check every possible covariate moment.

Before matching After matching
Cov. Sample Group 1 2 3 Std. Diff p Group 1 2 3 Std. Diff p
X1X_{1} 0.02 0.15 0.06 -0.11 0.10 0.49 0.09 0.06 -0.04 0.05 0.85
X2X_{2} -0.12 -0.67 -0.05 0.13 0.27 0.00 -0.21 -0.05 -0.03 0.08 0.78
Table 1: Table of mean balance before and after mixed integer program matching for one sample. Data were matched to obtain group means within 0.1 standard deviations to the mean of the overall sample mean. p-values (p) indicate good balance on the covariate means after matching.
Refer to caption
(a) Absolute average standardized difference in means.
Refer to caption
(b) pp-values.
Figure 2: Univariate evaluations before and after matching evaluated over 1000 experiments. Dotted lines denote 0.1 standard deviations (a) and a pp-value of 0.05 (b). All tests are FF tests of the null hypothesis that the group means do not differ for the listed covariate functions. This figure appears in color in the electronic version of this article.

With this goal in mind, we now examine some graph-based tests that may offer a solution.

4 Structures, algorithms, and tests

This section considers two graph structures: Hamiltonian paths and nearest neighbor graphs. For each, we describe the graph structures, the algorithms to construct them, and the tests that we will run on these structures. Throughout, we will use the motivating example of the previous section for exposition in our figures.

4.1 Hamiltonian paths

As stated previously, Hamiltonian paths are graphs through the data where each point is visited once and the starting and ending points are not the same. A Hamiltonian cycle is similar except that the starting and ending points are the same. Optimal Hamiltonian cycles are typically known in the literature as the Traveling Salesman Problem or TSP [11].

4.1.1 Algorithms

There are a variety of ways to construct Hamiltonian paths. We present three algorithms to do so in order of increasing speed and decreasing optimality (Figure 3).

Refer to caption
(a) Paths from different algorithms
Refer to caption
(b) Distances of paths from different algorithms
Figure 3: Hamiltonian paths constructed by the Concorde solver (optimal), a greedy path constructed by Kruskal’s algorithm, and a path constructed via a Hilbert sort. Data is the same as the matched sample from Figure 1. Distances show the increasing length of the Hamiltonian path for various methods. Points in Figure (b) have been jittered to better visualize their location along the path. This figure appears in color in the electronic version of this article.

Our optimal algorithm is the Concorde solver of of [1]. Unfortunately, Concorde requires integer distances, which forces us to transform 𝐃\mathbf{D} and risk either the loss of precision or numeric overflow in order to preserve accuracy. Despite these concerns, the Concorde solver still finds optimal paths on the moderately sized problems considered in this paper and does so much faster than a brute-force integer programming solution.

The next solution we consider comes from an application of Kruskal’s algorithm [20]. Using the pairwise distance matrix 𝐃\mathbf{D}, we add edges in a greedy fashion starting with the two observations with the shortest pairwise distance. Then we sequentially add the next closest unit to the end of the path until we construct a full Hamiltonian path. While not optimal globally, this path will be more optimal locally for some subset of observations.

The final algorithm is a sorting algorithm based on the Hilbert space-filling curve [17]. The idea is to construct a Hilbert space-filling curve until it forms a Hamiltonian path through the data. This method also prescribes an ordering to the data points, and for this reason, the method is called Hilbert sorting. Since the constructed path is deterministic, Hilbert sorting is incredibly fast—successfully sorting 10 million observations in about 20 seconds on a circa 2018 laptop—but the path is no where close to optimal globally or locally. To perform this sorting method, we use the implementation freely available from [31] with an R interface relying on the approxOT package [13].

4.1.2 Test statistics

After path construction, the Hamiltonian path forms a one-dimensional radon measure on the manifold created by the corresponding graph [7]. Due to this one dimensional nature, the Hamiltonian path will admit a variety of one-dimensional statistics and corresponding tests. In this work we will consider two such tests: the ranks of observations along the Hamiltonian path and the runs of observations from the same treatment group [19, 29, 6, 23].

For these quantities, we consider two general forms of tests. The first form is an extremum test—either a minimum or maximum—that looks at the the probability of observing the most extreme group under H0H_{0}: 𝐒=(S1,S2,,SG)\mathbf{S}=(S_{1},S_{2},...,S_{G})^{\top} is the vector of statistics and Tl=min(𝐒)T_{l}=\min(\mathbf{S}) or Tu=max(𝐒)T_{u}=\max(\mathbf{S}) are the corresponding test statistics. The second test has the form of a Wald test—that is, given Σ=Var(𝐒)\Sigma=\operatorname{Var}(\mathbf{S}), and μ=𝔼(𝐒)\mu={}\operatorname{\mathbb{E}}(\mathbf{S}), then Tw=(𝐒μ)Σ1(𝐒μ).T_{w}=(\mathbf{S}-\mu)^{\top}\Sigma^{-1}(\mathbf{S}-\mu).

The probabilities of these test statistics will naturally depend on the distributions of the statistics 𝐒\mathbf{S}, which are often asymptotically Gaussian. Under a Gaussian distribution, the extremum can easily be calculated from the mvtnorm package in R [14]. Meanwhile, Twχν2,T_{w}\sim\chi_{\nu}^{2}, where ν\nu is the degrees of freedom—typically GG or G1G-1—with probabilities easily calculated in any statistical software.

Finally, we note that for the ranks-based tests, we will only consider the Wald test since the extremum test is not easily obtained. For the runs-based test, we examine both the Wald and extremum tests.

4.2 Nearest neighbor graphs

Similar to Hamiltonian paths, nearest neighbor graphs construct groups that are close to each other in terms of their covariate distance, 𝐃\mathbf{D}. Also, nearest neighbor graphs can be constructed in an optimal or greedy way, as can be seen in Figure 4.

Refer to caption
Figure 4: Nearest neighbor graphs constructed by an optimal non-bipartite matching (left) or via a greedy nearest neighbor search with 12 nearest neighbors (middle and right). Data come from the matched sample of Figure 1. The greedy nearest neighbors only shows the graphs for two separate points for clarity. This figure appears in color in the electronic version of this article.

4.2.1 Algorithms

Optimal nearest neighbor graphs, also known as non-bipartite matching (NBM), minimizes the total distance between matches, which may lead to units that are not matched to their closest possible match, while the constraints force each unit to only be matched to one individual. Unlike the TSP, this problem is quick to solve for pairwise matching (k=1)(k=1), but for larger numbers of matches, the problem is NP-hard [18].

In contrast, the greedy formulation is quick to solve for any number of matches, k,k>0k\in{}\operatorname{\mathbb{N}},k>0. One simply takes the pairwise distance matrix 𝐃\mathbf{D} and finds the kk nearest matches to the observation of interest. This is done with replacement so that matches are allowed to be shared across observations, unlike in NBM. More formally, the kNN problem is

min𝐌{0,1}\displaystyle\min_{\mathbf{M}\in\{0,1\}} i=1ji𝐃ij𝐌ij\displaystyle\sum_{i=1}\sum_{j\neq i}\mathbf{D}_{ij}\mathbf{M}_{ij}
such that j=1,ji𝐌ij=k,\displaystyle\sum_{j=1,j\neq i}\mathbf{M}_{ij}=k,\quad i{1,,N}\displaystyle\forall i\in\{1,...,N\}
𝐌ii=0,\displaystyle\mathbf{M}_{ii}=0,\quad i{1,,N},\displaystyle\forall i\in\{1,...,N\},

Fortunately, there are algorithms that efficiently compute the nearest neighbor graph without having to construct the full distance matrix. Several of these, such as the k-d tree [3], are implemented in the R package FNN [4].

4.2.2 Test statistics

Unlike our tests along Hamiltonian paths, the tests on nearest neighbor graphs respect how the graphs are built. As such, we describe the corresponding statistics for NBM and kNN graphs separately. Also, we consider both extremum and Wald tests as above.

Crossmatch statistic

The crossmatch statistic uses the structure of NBM to see how many observations are matched across treatment groups. Intuitively, if the null is true that F1==FGF_{1}=\cdots=F_{G}, then we would expect there to be a high number of matches across groups [26, 24]. For the crossmatch test, the extremum test is a minimum test since we expect individuals will be matched across groups less often than we would expect if H0H_{0} is false.

kk-Nearest neighbors statistic

We now present our multisample version of the statistic developed by [9]. For a greedy nearest neighbor graph, we look at each observation ii and count how many of ii’s nearest neighbors come from the same treatment group. Under the alternative, we would expect this number to be high since it would indicate observations from the same treatment group are more likely to be close to one another, which would indicate a difference between the distributions. If this happens more than we would expect under the null, we conclude the distributions are different.

More formally, define Cg=i,j𝐌ij𝕀(Zi=Zj)𝕀(Zi=g),g{1,,G}C_{g}=\sum_{i,j}\mathbf{M}_{ij}\mathbb{I}(Z_{i}=Z_{j})\mathbb{I}(Z_{i}=g),\,\forall g\in\{1,...,G\}, or the number of times units from group gg are matched to other units from group gg. We then have the following lemma for CgC_{g}.

Lemma 1.

The expectation and variance of CgC_{g} are

𝔼(Cg)=kng(ng1)N1{}\operatorname{\mathbb{E}}(C_{g})=k\frac{n_{g}(n_{g}-1)}{N-1} (1)

and

Var(Cg)\displaystyle\operatorname{Var}(C_{g}) =ng(ng1)(Nng)(Nng1)N(N1)(N2)(N3)×\displaystyle=\frac{n_{g}(n_{g}-1)(N-n_{g})(N-n_{g}-1)}{N(N-1)(N-2)(N-3)}\,\,\times (2)
(kN+2J+ng2Nng1(2S+kNk2N)\displaystyle\quad\quad\left(kN+2J+\frac{n_{g}-2}{N-n_{g}-1}\left(2S+kN-k^{2}N\right)\right.
2N1k2N).\displaystyle\quad\quad\quad\left.-\frac{2}{N-1}k^{2}N\right).

The covariance of CgC_{g} and ChC_{h} for ghg\neq h is

Cov(Cg,Ch)\displaystyle\operatorname{Cov}(C_{g},C_{h}) =ng(ng1)nh(nh1)N(N1)(N2)(N3)×\displaystyle=\frac{n_{g}(n_{g}-1)n_{h}(n_{h}-1)}{N(N-1)(N-2)(N-3)}\times (3)
(2J2S+k2N(N3)N1),\displaystyle\quad\quad\left(2J-2S+\frac{k^{2}N(N-3)}{N-1}\right),

where J=i=1N1j=1N𝐌ij𝐌jiJ=\sum_{i=1}^{N-1}\sum_{j=1}^{N}\mathbf{M}_{ij}\mathbf{M}_{ji} is the number of pairs of observations that are mutual nearest neighbors, and S=i=1Nj=1N1l=jN𝐌ji𝐌liS=\sum_{i=1}^{N}\sum_{j=1}^{N-1}\sum_{l=j}^{N}\mathbf{M}_{ji}\mathbf{M}_{li} is the number of pairs of observations that share a nearest neighbor.

For a proof, see LABEL:sec:proof. The basic idea is to sum over 𝐌\mathbf{M} and multiply by the probability that a given set of nodes come from group gg.

We now turn to the asymptotic distribution of the counts CgC_{g}. In principle, we can use the exact permutation distribution to calculate the pp-values of tests using this statistic; however, this is computationally very expensive so we do not do so in practice. Before proceeding, we note that Propositions 3.1 and 3.2 of [16] show that 2J/N2J/N and 2S/N2S/N converge to constants c1c_{1} and c2c_{2} that only depend on two things: the distance metric used to calculate the nearest neighbors and the dimension of the covariate space. Then we define Ug=Cg𝔼(Cg)Var(Cg)U_{g}=\frac{C_{g}-{}\operatorname{\mathbb{E}}(C_{g})}{\sqrt{\operatorname{Var}(C_{g})}} as the centered and scaled version of CgC_{g}. This brings us to the asymptotic distribution of the UgU_{g}.

Theorem 1.

Under the permutation null distribution as NN\to\infty,

(U1,,UG)𝒩(𝟎,Ω),(U_{1},...,U_{G})^{\top}\xrightarrow{\mathcal{L}}\mathcal{N}\left(\mathbf{0},\Omega\right),

with Ωgg=1\Omega_{gg}=1. For ghg\neq h,

Ωgh=pgph(1pg)(1ph)c1c2+k2fgfh,\Omega_{gh}=\frac{p_{g}p_{h}}{(1-p_{g})(1-p_{h})}\frac{c_{1}-c_{2}+k^{2}}{\sqrt{f_{g}f_{h}}},

where

pg\displaystyle p_{g} =limng,NngN,\displaystyle=\lim_{n_{g},N\to\infty}\frac{n_{g}}{N},
fg\displaystyle f_{g} =k+c1+pg1pg(c2+kk2).\displaystyle=k+c_{1}+\frac{p_{g}}{1-p_{g}}(c_{2}+k-k^{2}).

We provide a proof in LABEL:sec:proof.

In practice, we utilize the finite sample correlations between counts

Ω^gh=Rgh2J2S+k2N(N3)N1f^(ng)f^(nh),\hat{\Omega}_{gh}=R_{gh}\frac{2J-2S+k^{2}\frac{N(N-3)}{N-1}}{\sqrt{\hat{f}(n_{g})\hat{f}(n_{h})}},

with

Rgh=ng(ng1)nh(nh1)(Nng)(Nng1)(Nnh)(Nnh1)R_{gh}=\frac{\sqrt{n_{g}(n_{g}-1)n_{h}(n_{h}-1)}}{\sqrt{(N-n_{g})(N-n_{g}-1)(N-n_{h})(N-n_{h}-1)}}

and

f^(n)=kN+2J+n2Nn1(2S+kNk2N)2N1k2N.\hat{f}(n)=kN+2J+\frac{n-2}{N-n-1}(2S+kN-k^{2}N)-\frac{2}{N-1}k^{2}N.

This will have Ωgh\Omega_{gh} as its limit. Further, given the discrete nature of the counts, we utilize a continuity correction such that Ug=Cg0.5𝔼(Cg)Var(Cg)U_{g}=\frac{C_{g}-0.5-{}\operatorname{\mathbb{E}}(C_{g})}{\sqrt{\operatorname{Var}(C_{g})}}.

For the kNN statistics, the extremum test is a maximum based test since we expect that individuals will be matched to their own group more often if H0H_{0} is false. The Wald test using the kNN statistic can be calculated by Twald-knn=𝐔Ω1𝐔T_{\text{wald-knn}}=\mathbf{U}^{\top}\Omega^{-1}\mathbf{U} and Twald-knnχG2.T_{\text{wald-knn}}\sim\chi_{G}^{2}.

Finally, we note that the power of the tests based on the kNN statistic can be sensitive to the choice of kk. [5] finds that kk should increase with the sample size in order for the test to be well-powered.

5 Simulation studies

We now present simulation results in a variety of settings. Note that we set k=0.1Nk=\lfloor 0.1N\rfloor in each case where we use the kNN test.

In the first simulation study, we return to the motivating example from Section 3 to see if the new tests can correctly diagnose deviations from the null hypothesis that F1==FGF_{1}=\cdots=F_{G} missed by univariate tests. Then in the second simulation study, we utilize the settings of [24] to evaluate the performance of the graph-based tests. Overall, we find greedy paths and greedy nearest neighbor constructions perform better than their optimal counterparts while the kNN test performs best on average.

5.1 Motivating example, continued

Returning to the motivating example from Section 3, we again use cardinality matching to target distributional balance. As before, we use a misspecified setting that matches on X1X_{1} and X2X_{2} but misses the important covariate functions X12X_{1}^{2} and X1X2X_{1}X_{2}, but this time we also add a well-specified setting that includes all relevant covariate moments. We again match treatment groups so that the group means on the selected covariate functions are within 0.1 standard deviations, and we still utilize 1000 experiments with a sample size of 150. For each experiment, we run the kNN and crossmatch tests on their corresponding graphs and the runs and ranks test on a greedy Hamiltonian path built via Kruskal’s algorithm. Our implementation of Kruskal’s algorithm relies on the TSP package developed by [15] in R.

Refer to caption
Figure 5: Graphical tests run before matching (left), after matching with a misspecified model (middle), and after matching with a well-specified model (right). This figure appears in color in the electronic version of this article.

Figure 5 reveals that kNN and the runs test successfully detect distributional imbalance both initially and in the misspecified models. Of concern is that the crossmatch and ranks test also fail to correctly reject the null even at baseline in many cases.

5.2 Setting of [24]

For our second simulation study, we use the setting of [24]. In their experiments, the authors use a multivariate Gaussian data and either change (1) the parameters of the distribution or (2) the structure of the problem. The distributions are always drawn from

X1(g),,Xng(g)iid𝒩(μg,Σg),X_{1}^{(g)},...,X_{n_{g}}^{(g)}\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny iid}}}{\sim}}\mathcal{N}(\mu_{g},\Sigma_{g}),

where μgd\mu_{g}\in{}\operatorname{\mathbb{R}}^{d}, Σkd×d\Sigma_{k}\in{}\operatorname{\mathbb{R}}^{d\times d}, and g{1,,G}g\in\{1,...,G\}. We set ng=50gn_{g}=50g and run 1000 experiments in each setting.

(1) Parameter values

We change the parameter values in one of three ways. The first is location changes μg=(g1)δ𝟏d,\mu_{g}=(g-1)\delta\mathbf{1}_{d}, with Σg=𝐈d;\Sigma_{g}=\mathbf{I}_{d}; the second is scale changes with μg=𝟎\mu_{g}=\mathbf{0} and Σg={1+(g1)δ}𝐈d\Sigma_{g}=\{1+(g-1)\delta\}\,\mathbf{I}_{d}; and the third is correlation changes with μg=𝟎\mu_{g}=\mathbf{0} and Σg=(1ρg)𝐈d+ρg𝟏d×d,ρg=(g1)δg1,\Sigma_{g}=(1-\rho_{g})\mathbf{I}_{d}+\rho_{g}\mathbf{1}_{d\times d},\quad\rho_{g}=\frac{(g-1)\delta}{g-1}, where 𝟏d×d\mathbf{1}_{d\times d} is a d×dd\times d is a matrix of ones, 𝟏d\mathbf{1}_{d} is a dd-length vector of ones, and 𝐈d\mathbf{I}_{d} is a d×dd\times d identity matrix. In location change, we set δ{0.04,0.06,0.08,0.10,0.12}\delta\in\{0.04,0.06,0.08,0.10,0.12\}; in scale changes and correlation changes, we set δ{0.0,0.05,0.10,0.15,0.20,0.25,0.30}\delta\in\{0.0,0.05,0.10,0.15,0.20,0.25,0.30\}.

(2) Structure changes

For changes to the problem structure, we either change the dimension of the covariates—d{5,10,25,50,100,250}d\in\{5,10,25,50,100,250\}—or the number of treatment groups—G{3,4,5,6}G\in\{3,4,5,6\}. Also note that as we change the dimension of the covariates, we keep GG fixed at five and as we change the number of treatment groups, we keep dd fixed at 150.

Finally, we present results averaged across simulation settings in this section to help simplify results. To see a detailed presentation of the results for each setting, please see Web Figures LABEL:fig:web_1LABEL:fig:knn_k_power.

5.2.1 Extremum versus Wald tests

As previously discussed, there are two ways we can construct our tests: either an extremum (max or min) or a Wald test. Ultimately, we find that extremum tests perform better in tests along Hamiltonian paths while the Wald tests perform better for the nearest neighbor graphs.

To simplify exposition, we average across parameter change settings and present the relative power of Wald tests as compared to extremum tests, that is (reject H0Wald test)(reject H0 extremum test).\frac{{}\operatorname{\mathbb{P}}(\text{reject }H_{0}\mathop{\mid}\text{Wald test})}{{}\operatorname{\mathbb{P}}(\text{reject }H_{0}\mathop{\mid}\text{ extremum test})}. Numbers greater than one indicate that the Wald test has higher power on average while numbers less than one indicate the extremum test has higher power for the given algorithm and structure.

Dimension
Algorithm 5 10 25 50 100 250
Optimal 0.69 0.72 0.71 0.75 0.80 0.87
Greedy 0.68 0.71 0.77 0.83 0.90 0.92
Hilbert sort 0.68 0.71 0.76 0.72 0.68 0.72
NBP 1.44 1.54 1.50 1.52 1.42 1.38
kNN 0.93 1.06 1.16 1.17 1.15 1.09
Table 2: The relative power of the Wald-like test to the extremum test for the listed algorithms. The results are with an increasing dimension of the covariate space and averaged across the location, scale, and correlation changes. The algorithms above the double line are ones used to construct Hamiltonian paths while those below the double line, non-bipartite matching (NBP) and k-nearest neighbors (kNN), are used to construct nearest neighbor graphs. The number of groups GG is held fixed at 5.
Treatments
Algorithm 3 4 5 6
Optimal 1.03 0.87 0.82 0.77
Greedy 1.09 0.95 0.89 0.88
Hilbert sort 1.59 1.06 0.71 0.56
NBP 0.92 1.19 1.43 1.80
kNN 1.30 1.21 1.15 1.06
Table 3: The relative power of the Wald-like test to the extremum test for the listed algorithms. The results are with an increasing number of treatments and averaged across the location, scale, and correlation changes. The algorithms above the double line are ones used to construct Hamiltonian paths while those below the double line, non-bipartite matching (NBP) and k-nearest neighbors (kNN), are used to construct nearest neighbor graphs. The number of covariates dd is held fixed at 150.

Table 2 presents the average results for each graph structure as the covariate dimension is increased while holding the number of treatment groups fixed at G=5G=5. For the Hamiltonian path algorithms, the Wald tests have lower relative power for smaller covariate dimensions, with decreasing differences as the dimension of the problem increases. In contrast, for the nearest neighbor graphs, the Wald-like tests have higher power on average as the covariate dimension increases. Only kNN with 5 covariates bucks this trend.

Similarly, Table 3 shows the results as we change the number of groups. We note that the Wald tests have higher power for the Hamiltonian paths with three groups. However, this reverses as the number of groups increases and there seems to be a general trend that the extremum tests have higher relative power as we consider a larger numbers of groups.

In contrast, the nearest neighbor graph algorithms display higher power for the Wald tests in most cases; however, kNN and NBP display opposite trends in relative power as the number of groups increase. The kNN algorithm has a decreasing difference between the Wald and extremum tests as the number of groups increases; the NBP algorithm displays an increasing difference between the two tests as the number of groups increases with the Wald test having higher relative power.

5.2.2 Greedy paths perform better than optimal ones

We might assume a priori that the optimal Hamiltonian path will perform better than its greedy version because it is the shortest path through the data. Unfortunately, our experiments do not support this idea.

In Table 4, we present the results from changing the the dimension of the covariate space for our algorithms: optimal, greedy, and Hilbert sorting. For comparison purposes, we present results relative to the power of the tests on an optimal Hamiltonian path: (reject H0)(reject H0 optimal path).\frac{{}\operatorname{\mathbb{P}}(\text{reject }H_{0})}{{}\operatorname{\mathbb{P}}(\text{reject }H_{0}\mathop{\mid}\text{ optimal path})}. As before, we average the results across parameter changes and present the average relative power for each path as we change the problem dimension, dd, and number of treatment groups, GG. On average, the greedy path does better than the optimal path, with increasing power as the dimension increases. In contrast, the Hilbert sort path displays increasing relative power as the dimension of the problem increases.

Similarly, in Table 5, we see that the relative power of tests on the greedy Hamiltonian path is consistently four times higher than tests on the optimal path found by Concorde. As before, the Hilbert sort path still has the lowest relative power, but this effect decreases as the number of groups increases.

Dimension
Path 5 10 25 50 100 250
Optimal 1.00 1.00 1.00 1.00 1.00 1.00
Greedy 2.27 3.41 3.99 4.36 4.71 4.35
Hilbert sort 1.02 1.03 0.84 0.74 0.65 0.51
Table 4: The relative power of the listed paths compared to the optimal path found by Concorde. The results are with an increasing dimension of the covariate space and averaged across the location, scale, and correlation changes as well as the results from the runs test and rank test. The number of groups GG is held fixed at 5.
Treatments
Path 3 4 5 6
Optimal 1.00 1.00 1.00 1.00
Greedy 4.73 4.64 4.35 4.09
Hilbert sort 0.50 0.56 0.57 0.60
Table 5: The relative power of the listed paths compared to the optimal path found by Concorde. The results are with an increasing number of treatments and averaged across the location, scale, and correlation changes as well as the results from the runs test and rank test. The number of covariates dd is held fixed at 150.

5.2.3 The kk-nearest neighbor test outperforms others

We now compare the two other graph-based tests—the multisample crossmatch and the kNN test—to the Kruskal-Wallis ranks and runs test ran on the greedy Hamiltonian paths. All power calculations are presented as relative to the power of the kNN test.

First, we examine the effect of increasing covariate dimension, dd, while holding the number of treatment groups fixed at G=5G=5. We can see in Table 6 that the kNN test outperforms the other tests, on average. There are a few cases where some of the other tests perform better—for example, in location changes with a small covariate dimension, the crossmatch and runs test can sometimes perform better. However, such effects are not observed as dimension continues to increase and are also not observed in settings where the covariance matrix varies by group.

Next, we examine the effect of changing the number of treatment groups, GG, while holding the covariate dimension fixed at d=150d=150. In Table 7, we see that again the kNN test has the highest relative power compared to any other test on average. For smaller numbers of groups, the runs test performs as well as the kNN test, but this effect diminishes as the number of groups continues to increase.

Dimension
Test 5 10 25 50 100 250
Rank 0.61 0.64 0.68 0.69 0.72 0.76
Runs 0.34 0.39 0.53 0.68 0.74 0.87
Crossmatch 0.37 0.39 0.41 0.52 0.57 0.71
Nearest Neighbor 1.00 1.00 1.00 1.00 1.00 1.00
Table 6: The relative power of the listed tests compared to the nearest neighbor test. The results are with an increasing dimension of the covariate space and averaged across the location, scale, and correlation changes. Nearest neighbor test uses 10% of the sample size as the number of nearest neighbors and the rank and runs test use a greedily constructed Hamiltonian path. The number of groups GG is held fixed at 5.
Treatments
Test 3 4 5 6
Rank 0.75 0.72 0.73 0.77
Runs 0.72 0.76 0.81 0.84
Crossmatch 0.49 0.55 0.63 0.70
Nearest Neighbor 1.00 1.00 1.00 1.00
Table 7: The relative power of the listed tests compared to the nearest neighbor test. The results are with an increasing number of treatments and averaged across the location, scale, and correlation changes. Nearest neighbor test uses 10% of the sample size as the number of nearest neighbors and the rank and runs test use a greedily constructed Hamiltonian path. The number of covariates dd is held fixed at 150.

6 Case study

Our goal is to use matching to make the distributions in the respective treatment groups more similar to the overall sample distribution. We will then use our graph-based tests to see if univariate tests miss any distributional differences. To do so, we use data from the 1987 National Medical Expenditures Survey data as collected by the TriMatch package in R [8].

6.1 Data

The original survey included detail information about participants quantity and duration of smoking as pack-years, with one pack-year being defined as smoking one pack of cigarettes per day for one year. In our data, we include the following covariates: the individual’s age at the last survey, their gender, whether the individual identified as white, whether the individual attended college, whether the individual had ever been married, whether the individual fell into the low income category, and whether the individual always wore a seat belt. For the binary variables, we add a very small jitter to make matching easier for the software. Finally, we constructed a composite treatment variable based on smoking history as: 1 = never smoker, 2 = former smoker with less than 15 pack-years of smoking, 3 = former smoker with more than 15 pack-years of smoking, 4 = current smoker with less than 15 pack-years of smoking, and 5 = current smoker with more than 15 pack-years of smoking. The number of individuals in each treatment group is presented in Table 8.

Treatment 1 Treatment 2 Treatment 3 Treatment 4 Treatment 5
Baseline 9804 2073 2003 2326 3146
(A) 8456 1512 502 696 1365
(B) 8065 1381 458 637 1247
(C) 8018 1367 453 629 1234
(D) 6460 1070 0 0 755
Table 8: Sample size in each treatment group. Baseline is the data before any adjustment, and cardinality matching is done matching individuals in each treatment group to achieve means within the desired calipers of the overall sample means. (A) seeks matches on the first moments such that the means are within 0.05 standard deviations (S.D.), (B) matches on the first moments such that means are within 0.01 S.D., (C) matches on the first moments such that means are within 0.005 S.D., and (D) matches first moments within 0.005 S.D. and matches second to fifth moments within 0.025 S.D.

6.2 Methods

We consider four forms of cardinality matching to achieve distributional balance. The first three forms of cardinality matching we use seeks to match individuals in each treatment group so that the group mean of each covariate is within either 0.05, 0.01, or 0.005 standard deviations of the sample mean. We denote these as Card. Matching (A), (B), and (C), respectively. The fourth and final form targets first through fifth moments of the covariate distributions and we denote this form as Card. Matching (D). For all matching tasks, we set the maximum time at one hour.

To diagnose failures of distributional balance we utilize a couple of methods. First, we examine single covariate measures of distributional balance such as the standardized difference of means and the relevant tests for differences in means. We present this in table form for the first moments (see Table 9). Second, we utilize the graphical tests described in this work including the kNN test, the runs and ranks tests on greedy Hamiltonian paths, and the crossmatch test.

6.3 Results

We now look to see how various balance assessment methods diagnose a lack of distributional overlap. Table 9 displays the average of the absolute standardized differences in means between treatment group and the sample means for all covariate first moments, and in column 2 we see that there are some covariates with substantial differences between the sample mean and the group means, on average. After cardinality matching on the first moments with a 0.05 standard deviation caliper (columns 4 and 5 of Table 9), we see that the average absolute standardized difference in means between groups is within 0.1 standard deviations but all of the pp-values are below 0.05. For a caliper of 0.01 (columns 6 and 7), the pp-values are now above 0.05, which again holds true for a caliper of 0.005 standard deviations. After cardinality matching on the first through fifth moments (columns 10 and 11 of Table 9), the picture looks similar, though the pp-values are slightly lower. From these metrics, we would conclude that sufficient balance has been achieved in most settings.

Baseline Card. Matching (A) Card. Matching (B) Card. Matching (C) Card. Matching (D)
Variable Std. diff p-value Std. diff p-value Std. diff p-value Std. diff p-value Std. diff p-value
Age 0.26 <0.01<0.01 0.03 0.02 <0.01<0.01 0.99 <0.01<0.01 1.00 <0.01<0.01 0.94
Any college 0.05 <0.01<0.01 0.04 <0.01<0.01 <0.01<0.01 0.95 <0.01<0.01 1.00 <0.01<0.01 1.00
Ever married 0.21 <0.01<0.01 0.05 <0.01<0.01 <0.01<0.01 0.92 <0.01<0.01 0.99 <0.01<0.01 0.95
Low Income 0.08 <0.01<0.01 0.03 0.04 <0.01<0.01 0.98 <0.01<0.01 1.00 <0.01<0.01 0.99
Male 0.19 <0.01<0.01 0.05 <0.01<0.01 <0.01<0.01 0.94 <0.01<0.01 1.00 <0.01<0.01 0.99
Seatbelt always 0.11 <0.01<0.01 0.05 <0.01<0.01 <0.01<0.01 0.97 <0.01<0.01 1.00 <0.01<0.01 0.98
White 0.09 <0.01<0.01 0.05 <0.01<0.01 <0.01<0.01 0.97 <0.01<0.01 1.00 <0.01<0.01 0.98
Table 9: Univariate tests and absolute standardized mean differences applied to the matches of various methods. Baseline is the data before any adjustment, and cardinality matching is done matching individuals in each treatment group to achieve means within the desired calipers of the overall sample means. (A) seeks matches on the first moments such that the means are within 0.05 standard deviations (S.D.), (B) matches on the first moments such that means are within 0.01 S.D., (C) matches on the first moments such that means are within 0.005 S.D., and (D) matches first moments within 0.005 S.D. and matches second to fifth moments within 0.025 S.D.

However, all is not well with the joint distributions. While most of the univariate tests looked fine after adjustment, almost none of the multivariate tests in Table 10 indicate that distributional balance has been achieved. For example, the kNN test (column 2) has pp-values that are below numeric tolerance for tests of distributional balance at baseline and in cardinality matching on the first moments (Card. Matching A–C). Only matching on the first through the fifth moments achieves a non-significant pp-value (Card. Matching D). A similar case holds for both the runs test and the crossmatch test of [24]. The ranks tests on the Hamiltonian path also achieves a non-significant pp-value in Card. Matching (C) as well.

k-Nearest Neighbor Runs Ranks Crossmatch
Baseline 1.11×10161.11\times 10^{-16} 1.11×10161.11\times 10^{-16} 1.11×10161.11\times 10^{-16} 1.11×10161.11\times 10^{-16}
Card. matching (A) 1.11×10161.11\times 10^{-16} 1.50×10101.50\times 10^{-10} 1.90×10031.90\times 10^{-03} 1.11×10161.11\times 10^{-16}
Card. matching (B) 1.11×10161.11\times 10^{-16} 1.89×10151.89\times 10^{-15} 2.12×10142.12\times 10^{-14} 1.84×10101.84\times 10^{-10}
Card. matching (C) 1.11×10161.11\times 10^{-16} 1.15×10091.15\times 10^{-09} 0.07 8.39×10108.39\times 10^{-10}
Card. matching (D) 0.17 0.13 0.47 0.51
Table 10: Multisample tests applied to the matches from the cardinality matching methods. Baseline is the data before any adjustment and cardinality matching is done matching individuals in each treatment group to achieve group means within the desired calipers of the overall sample mean for the selected covariate moments. (A) seeks matches on the first moments such that the means are within 0.05 standard deviations (S.D.), (B) matches on the first moments such that means are within 0.01 S.D., (C) matches on the first moments such that means are within 0.005 S.D., and (D) matches first moments within 0.005 S.D. and matches second to fifth moments within 0.025 S.D.

7 Summary and remarks

In this paper, we proposed multisample extensions to the Hamiltonian path testing of [7] and the nearest neighbor testing of [9] and we evaluated these tests in a variety of settings. We considered various graph structures, algorithms to form these structures, and tests of distributional balance.

Findings

In our simulation studies, we found that the extremum test performs best for the runs statistic, while for the statistics on nearest neighbor graphs, the Wald tests perform best.

For both Hamiltonian paths and nearest neighbor graphs, we found in our simulation studies that approximate solutions—that is greedy Hamiltonian paths and the greedy nearest neighbor graphs—perform better than their optimal counterparts. We suspect that the approximate solutions do better locally and may have higher power in certain regions of the data to detect differences in distribution.

Overall, we found the kNN test on the approximate nearest neighbor graph to be the highest power test compared to various alternatives. The Hamiltonian path based tests performed slightly worse that the kNN test in most settings and the crossmatch test fared the worst. Of some concern, both the crossmatch test and the ranks test on a greedy Hamiltonian path missed the model misspecification in the motivating example, while the results in the other simulation studies in Section 5.2.3 are more mixed. Based on these results, we would recommend utilizing the kNN test to evaluate multivariate balance in settings with multi-valued treatments.

Our case study revealed how the univariate metrics typically used to assess balance fail to diagnose inadequate overlap in the joint distribution. Of some concern, we found that an exceedingly high number of moments were required to obtain non-significant tests for the overlap of the joint distribution.

Further research

One area that requires further exploration is that some settings may not require the strict distributional balance we are able to measure with the kNN test. Ultimately, balancing the conditional mean functions between treatment groups is sufficient since this will remove the bias in estimating treatment effects. For example, if the conditional mean function of the outcome YiY_{i} is 𝔼(YiXi,Zi)=Xiβ+αZi{}\operatorname{\mathbb{E}}(Y_{i}\mathop{\mid}X_{i},Z_{i})=X_{i}^{\top}\beta+\alpha Z_{i}, then ensuring balance on the first moments of the covariates would ensure unbiased estimation of the treatment effects determined by α\alpha. However, since this conditional mean function is not known in practice, determining when sufficient balance has been achieved on the correct functions of the covariates remains a challenge.

Finally, there may be room to invert the kNN test as a method for achieving distributional balance. One could construct an optimization problem that seeks to minimize the test statistic of the kNN test in order to achieve distributional balance. This would be useful in observational settings where balance cannot be achieved through randomization.

Acknowledgments

The author would like to thank Claire Chaumont, Gang Liu, Aarón Sonabend, Lorenzo Trippa, and José Zubizarreta for helpful comments and feedback on an earlier version of this manuscript. This research was funded by generous support from NIH grant 5T32CA009337-40, the Department of Biostatistics at the Harvard T.H. Chan School of Public Health, and the David Geffen Scholarship from the David Geffen School of Medicine at UCLA.

References

  • [1] David Applegate, Robert Bixby, Vasek Chvátal and William Cook “Implementing the Dantzig-Fulkerson-Johnson algorithm for large traveling salesman problems” In Mathematical Programming 97.1, 2003, pp. 91–153 DOI: 10.1007/s10107-003-0440-4
  • [2] Magdalena Bennett, Juan Pablo Vielma and José R. Zubizarreta “Building Representative Matched Samples With Multi-Valued Treatments in Large Observational Studies” In Journal of Computational and Graphical Statistics 29.4 Taylor & Francis, 2020, pp. 744–757 DOI: 10.1080/10618600.2020.1753532
  • [3] Jon Louis Bentley “Multidimensional binary search trees used for associative searching” In Communications of the ACM 18.9, 1975, pp. 509–517 DOI: 10.1145/361002.361007
  • [4] Alina Beygelzimer et al. “FNN: Fast Nearest Neighbor Search Algorithms and Applications”, 2019 URL: https://cran.r-project.org/package=FNN
  • [5] Bhaswar Bhattacharya “A general asymptotic framework for distribution-free graph-based two-sample tests” In Journal of the Royal Statistical Society. Series B: Statistical Methodology 81.3, 2019, pp. 575–602 DOI: 10.1111/rssb.12319
  • [6] Z. W. Birnbaum and R. A. Hall “Small Sample Distributions for Multi-Sample Statistics of the Smirnov Type” In The Annals of Mathematical Statistics 31.3, 1960, pp. 710–720 DOI: 10.1214/aoms/1177705797
  • [7] Munmun Biswas, Minerva Mukhopadhyay and Anil K. Ghosh “A distribution-free two-sample run test applicable to high-dimensional data” In Biometrika 101.4, 2014, pp. 913–926 DOI: 10.1093/biomet/asu045
  • [8] Jason Breyer “TriMatch: Propensity Score Matching of Non-Binary Treatments”, 2017 URL: https://CRAN.R-project.org/package=TriMatch
  • [9] Hao Chen and Dylan S Small “New multivariate tests for assessing covariate balance in matched observational studies” In Biometrics, 2020 DOI: 10.1111/biom.13395
  • [10] Shojaeddin Chenouri and Christopher G. Small “A nonparametric multivariate multisample test based on data depth” In Electronic Journal of Statistics 6, 2012, pp. 760–782 DOI: 10.1214/12-EJS692
  • [11] G. Dantzig, R. Fulkerson and S. Johnson “Solution of a Large-Scale Traveling-Salesman Problem” In Journal of the Operations Research Society of America 2.4, 1954, pp. 393–410 DOI: 10.1287/opre.2.4.393
  • [12] Alexis Diamond and Jasjeet S. Sekhon “Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies” In Review of Economics and Statistics 95.3, 2013, pp. 932–945 DOI: 10.1162/REST_a_00318
  • [13] Eric A. Dunipace “approxOT: approximate optimal transport”, 2022 URL: https://CRAN.R-project.org/package=approxOT
  • [14] Alan Genz et al. “mvtnorm: Multivariate Normal and t Distributions”, 2020 URL: https://cran.r-project.org/package=mvtnorm
  • [15] Michael Hahsler and Kurt Hornik “TSP – Infrastructure for the traveling salesperson problem” In Journal of Statistical Software 23.2, 2007, pp. 1–21 DOI: 10.18637/jss.v023.i02
  • [16] Norbert Henze “A Multivariate Two-Sample Test Based on the Number of Nearest Neighbor Type Coincidences” In The Annals of Statistics 16.2, 1988, pp. 1403–1433 DOI: 10.1214/aos/1176350835
  • [17] David Hilbert “Ueber die stetige Abbildung einer Line auf ein Flächenstück” In Mathematische Annalen 38.3, 1891, pp. 459–460 DOI: 10.1007/BF01199431
  • [18] Richard M. Karp “Reducibility among combinatorial problems” In Complexity of Computer Computations New York: Plenum, 1972, pp. 219–241 DOI: 10.1007/978-3-540-68279-0_8
  • [19] A L Kolmogorov “Sulla determinazione empirica di una legge di distribuzione” In G. Ist. Ital. Attuari 4, 1933, pp. 83–91 URL: https://ci.nii.ac.jp/naid/10030673552/en/
  • [20] Joseph B. Kruskal “On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem” In Proceedings of the American Mathematical Society 7.1, 1956, pp. 48 DOI: 10.2307/2033241
  • [21] William H. Kruskal and W. Allen Wallis “Use of Ranks in One-Criterion Variance Analysis” In Journal of the American Statistical Association 47.260, 1952, pp. 583–621 DOI: 10.1080/01621459.1952.10483441
  • [22] Marco Marozzi “The multisample Cucconi test” In Statistical Methods and Applications 23.2, 2014, pp. 209–227 DOI: 10.1007/s10260-014-0255-x
  • [23] A. M. Mood “The distribution theory of runs” In The Annals of Mathematical Statistics 11.4, 1940, pp. 367–392 URL: http://pubsonline.informs.org/doi/abs/10.1287/opre.7.1.58
  • [24] Somabha Mukherjee, Divyansh Agarwal, Nancy R. Zhang and Bhaswar B. Bhattacharya “Distribution-Free Multisample Tests Based on Optimal Matchings With Applications to Single Cell Genomics” In Journal of the American Statistical Association 0.0 Taylor & Francis, 2020, pp. 1–31 DOI: 10.1080/01621459.2020.1791131
  • [25] Madan Lal Puri and Pranab Kumar Sen “On a class of multivariate multisample rank-order tests” In Sankhyā: The Indian Journal of Statistics, Series A JSTOR, 1966, pp. 353–376
  • [26] Paul R. Rosenbaum “An exact distribution-free test comparing two multivariate distributions based on adjacency” In Journal of the Royal Statistical Society. Series B: Statistical Methodology 67.4, 2005, pp. 515–530 DOI: 10.1111/j.1467-9868.2005.00513.x
  • [27] Paul R. Rosenbaum and Donald B. Rubin “The Central Role of the Propensity Score in Observational Studies for Causal Effects” In Biometrika 70.1, 1983, pp. 41–55
  • [28] Paul R. Rosenbaum and Donald B. Rubin “Constructing a control group using multivariate matched sampling methods that incorporate the propensity score” In American Statistician 39.1, 1985, pp. 33–38 DOI: 10.1080/00031305.1985.10479383
  • [29] N. Smirnov “Table for Estimating the Goodness of Fit of Empirical Distributions” In The Annals of Mathematical Statistics 19.2, 1948, pp. 279–281 DOI: 10.1214/aoms/1177730256
  • [30] Elizabeth A. Stuart “Matching Methods for Causal Inference: A Review and a Look Forward” In Statistical Science 25.1, 2010, pp. 1–21 DOI: 10.1214/09-STS313
  • [31] The CGAL Project “CGAL User and Reference Manual” CGAL Editorial Board, 2021 URL: https://doc.cgal.org/5.2.1/Manual/packages.html
  • [32] A. Wald and J. Wolfowitz “On a Test Whether Two Samples are from the Same Population” In The Annals of Mathematical Statistics 11.2, 1940, pp. 147–162 DOI: 10.1214/aoms/1177731909
  • [33] José R. Zubizarreta “Using mixed integer programming for matching in an observational study of kidney failure after surgery” In Journal of the American Statistical Association 107.500, 2012, pp. 1360–1371 DOI: 10.1080/01621459.2012.703874
  • [34] José R Zubizarreta, Cinar Kilcioglu and Juan P Vielma “designmatch: Matched Samples that are Balanced and Representative by Design”, 2018 URL: https://cran.r-project.org/package=designmatch
  • [35] José R. Zubizarreta, Ricardo D. Paredes and Paul R. Rosenbaum “Matching for balance, pairing for heterogeneity in an observational study of the effectiveness of for-profit and not-for-profit high schools in Chile” In Annals of Applied Statistics 8.1, 2014, pp. 204–231 DOI: 10.1214/13-AOAS713
  • [36] María de los Angeles Resa and José R. Zubizarreta “Evaluation of subset matching methods and forms of covariate balance” In Statistics in Medicine 35.27, 2016, pp. 4961–4979 DOI: 10.1002/sim.7036