Power Under Multiplicity Project (PUMP): Estimating Power, Minimum Detectable Effect Size, and Sample Size When Adjusting for Multiple Outcomes in Multi-level Experiments
Abstract
For randomized controlled trials (RCTs) with a single intervention being measured on multiple outcomes, researchers often apply a multiple testing procedure (such as Bonferroni or Benjamini-Hochberg) to adjust -values. Such an adjustment reduces the likelihood of spurious findings, but also changes the statistical power, sometimes substantially, which reduces the probability of detecting effects when they do exist. However, this consideration is frequently ignored in typical power analyses, as existing tools do not easily accommodate the use of multiple testing procedures. We introduce the PUMP R package as a tool for analysts to estimate statistical power, minimum detectable effect size, and sample size requirements for multi-level RCTs with multiple outcomes. Multiple outcomes are accounted for in two ways. First, power estimates from PUMP properly account for the adjustment in -values from applying a multiple testing procedure. Second, as researchers change their focus from one outcome to multiple outcomes, different definitions of statistical power emerge. PUMP allows researchers to consider a variety of definitions of power, as some may be more appropriate for the goals of their study. The package estimates power for frequentist multi-level mixed effects models, and supports a variety of commonly-used RCT designs and models and multiple testing procedures. In addition to the main functionality of estimating power, minimum detectable effect size, and sample size requirements, the package allows the user to easily explore sensitivity of these quantities to changes in underlying assumptions.
Abstract
For randomized controlled trials (RCTs) with a single intervention being measured on multiple outcomes, researchers often apply a multiple testing procedure (such as Bonferroni or Benjamini-Hochberg) to adjust -values. Such an adjustment reduces the likelihood of spurious findings, but also changes the statistical power, sometimes substantially, which reduces the probability of detecting effects when they do exist. However, this consideration is frequently ignored in typical power analyses, as existing tools do not easily accommodate the use of multiple testing procedures. We introduce the PUMP R package as a tool for analysts to estimate statistical power, minimum detectable effect size, and sample size requirements for multi-level RCTs with multiple outcomes. Multiple outcomes are accounted for in two ways. First, power estimates from PUMP properly account for the adjustment in -values from applying a multiple testing procedure. Second, as researchers change their focus from one outcome to multiple outcomes, different definitions of statistical power emerge. PUMP allows researchers to consider a variety of definitions of power, as some may be more appropriate for the goals of their study. The package estimates power for frequentist multi-level mixed effects models, and supports a variety of commonly-used RCT designs and models and multiple testing procedures. In addition to the main functionality of estimating power, minimum detectable effect size, and sample size requirements, the package allows the user to easily explore sensitivity of these quantities to changes in underlying assumptions.
1 Introduction
The PUMP R package fills in an important gap in open-source software tools to design multi-level randomized controlled trials (RCTs) with adequate statistical power. With this package, researchers can estimate statistical power, minimum detectable effect size (MDES), and needed sample size for multi-level experimental designs, in which units are nested within hierarchical structures such as students nested within schools nested within school districts. The statistical power is calculated for estimating the impact of a single intervention on multiple outcomes. The package uses a frequentist framework of mixed effects regression models, which is currently the prevailing framework for estimating impacts from experiments in education and other social policy research.111Other options include nonparametric or Bayesian methods, but these are less prevalent in applied research [1, 2].
To our knowledge, none of the existing software tools for power calculations allow researchers to account for multiple hypothesis tests and the use of a multiple testing procedure (MTP). MTPs adjust -values to reduce the likelihood of spurious findings when researchers are testing for effects on multiple outcomes. This adjustment can result in a substantial change in statistical power, greatly reducing the probability of detecting effects when they do exist. Unfortunately, when designing studies, researchers who plan to test for effects on multiple outcomes and employ MTPs frequently ignore the power implications of the MTPs.
Also, as researchers change their focus from one outcome to multiple outcomes, multiple definitions of statistical power emerge [3, 4, 5, 6]. The PUMP package allows researchers to consider multiple definitions of power, selecting those most suited to the goals of their study. The definitions of power include:
-
•
individual power: the probability of detecting an effect of a particular size (specified by the researcher) or larger for each hypothesis test. Individual power corresponds to how power is defined when there is focus on a single outcome.
-
•
minimal power: the probability of detecting effects of at least a particular size on at least one outcome. Similarly, the researcher can consider minimal power for any less than the number of outcomes, or fractional powers, such as minimal power.
-
•
complete power: the power to detect effects of at least a particular size on all outcomes.
As noted in [7], the prevailing default in many studies—individual power—may or may not be the most appropriate type of power. If the researcher’s goal is to find statistically significant estimates of effects on most or all primary outcomes of interest, then their power may be much lower than anticipated when multiplicity adjustments are taken into account. On the other hand, if the researcher’s goal is to find statistically significant estimates of effects on at least one or a small proportion of outcomes, their power may be much better than anticipated. In both of these cases, by not accounting for both the challenges and opportunities arising from multiple outcomes, a researcher may find they have wasted resources, either by designing an underpowered study that cannot detect the desired effect sizes, or by designing an overpowered study that had a larger sample size than necessary. We introduce the PUMP package to allow for directly answering questions that take multiple outcomes into account, such as:
-
•
How many schools would I need to detect a given effect on at least three of my five outcomes?
-
•
What size effect can I reliably detect on each outcome, given a planned MTP across all my outcomes?
-
•
How would the power to detect a given effect change if only half my outcomes truly had impact?
The methods in the PUMP package build on those introduced in [7]. This earlier paper focused only on a single RCT design and model — a multisite RCT with the blocked randomization of individuals, in which effects are estimated using a model with block-specific intercepts and with the assumption of constant effects across all units. This earlier paper also did not produce software to assist researchers in implementing its methods. With this current paper and with the introduction of the PUMP package, we extend the methodology to nine additional multi-level RCT designs and models. Also, while [7] focused on estimates of power, PUMP goes further to also estimate MDES and sample size requirements that take multiplicity adjustments into account.
PUMP extends functionality of the popular PowerUp! R package (and its related tools in the form of a spreadsheet and Shiny application), which compute power or MDES for multi-level RCTs with a single outcome [8]. For a wide variety of RCT designs with a single outcome, researchers can take advantage of closed-form solutions and numerous power estimation tools. For example, in education and social policy research, see [8, 9, 10, 11]. However, closed-form solutions are difficult or impossible to derive when a MTP is applied to a setting with multiple outcomes. Instead, we use a simulation-based approach to achieve estimates of power.
In order to calculate power, the researcher specifies information about the sample size at each level, the minimum detectable effect size for each outcome, the level of statistical significance, and parameters of the data generating distribution. The minimum detectable effect size is the smallest true effect size the study can detect with the desired statistical significance level, in units of standard deviations. An “effect size” generally refers to the standardized mean difference effect size, which “equals the difference in mean outcomes for the treatment group and control group, divided by the standard deviation of outcomes across subjects within experimental groups” [12]. Researchers often use effect sizes to standardize outcomes so that outcomes with different scales can be directly compared.
The package includes three core functions:
-
•
pump_power() for calculating power given a experimental design and assumed model, parameters, and minimum detectable effect size.
-
•
pump_mdes() for calculating minimum detectable effect size given a target power and sample sizes.
-
•
pump_sample() for calculating the required sample size for achieving a given target power for a given minimum detectable effect size.
For any of these core functions, the user begins with two main choices. First, the user chooses the assumed design and model of the RCT. The PUMP package covers a range of multi-level designs, up to three levels of hierarchy, that researchers typically use in practice, in which research units are nested in hierarchical groups. Our power calculations assume the user will be analyzing these RCTs using frequentist mixed-effects regression models, containing a combination of fixed or random intercepts and treatment impacts at different levels, as we explain in detail in Section 4.1 and in the Technical Appendix. Second, the user chooses the MTP to be applied. PUMP supports five common MTPs — Bonferroni, Holm, single-step and step-down versions of Westfall-Young, and Benjamini-Hochberg. After these two main choices, the user must also make a variety of decisions about parameters of the data generating distribution.
The package also includes functions that allow users to easily explore power over a range of possible values of parameters. This exploration encourages the user to determine the sensitivity of estimates to different assumptions. PUMP also visually displays results. These additional functions include:
-
•
pump_power_grid(), pump_mdes_grid(), and pump_sample_grid() for calculating the given output over a range of possible parameter values.
-
•
update() to re-run an existing calculation with a small number of parameters updated.
-
•
plot() on PUMP-generated objects to generate plots (including grid outputs).
The PUMP package is available on CRAN at https://CRAN.R-project.org/package=PUMP. The authors of the PUMP package have also created a web application built with R Shiny. This web application calls the PUMP package and allows users to conduct calculations with a user-friendly interface, but it is less flexible than the package, with a focus on simpler scenarios (e.g., 10 or fewer outcomes). The app can be found at https://public.mdrc.org/pump/.
The remainder of this paper is organized as follows. In Section 2, we introduce Diplomas Now, an educational experiment, to be used as a running example throughout the paper. We note, however, that the problem of power estimation for multi-level RCTs is not exclusive to the educational setting. In Section 3, we provide a summary of the multiple testing problem. Also in Section 3, we present an overview of the statistical challenges introduced by multiple hypothesis testing and how MTPs protect against spurious impact findings. In Section 4, we introduce our methodology for estimating power when taking the use of MTPs into account. This section also briefly discusses our validation process. Section 5 discusses the various choices a user must make when using the package, including the designs and models, MTPs, and key design and model parameters. Section 6 provides a detailed presentation of the PUMP package with multiple examples of using the packages functions to conduct calculations for our education RCT example. Section 7 is a brief conclusion.
2 Diplomas Now
We illustrate our package using an example of a published RCT that evaluated a secondary school model called Diplomas Now. The Diplomas Now model is designed to increase high school graduation rates and post-secondary readiness. Evaluators conducted a RCT comparing schools who implemented the model to business-as-usual. We refer to this example throughout this paper to illustrate key concepts and to illustrate the application of the PUMP package.
The Diplomas Now model, created by three national organizations, Talent Development, City Year, and Communities In Schools, targets underfunded urban middle and high schools with many students who are not performing well academically. The model is designed to be robust enough to transform high-poverty and high-needs middle and high schools attended by many students who fall off the path to high school graduation. Diplomas Now, with MDRC as a partner, was one of the first validation grants awarded as part of the Investing in Innovation (i3) competition administered by the federal Department of Education.
We follow the general design of the Diplomas Now evaluation, conducted by MDRC. The RCT contains three levels (students within schools within districts) with random assignment at level two (schools). The initial evaluation, included two cohorts of schools with each cohort implementing for two years (2011-2013 for Cohort 1 and 2012-2014 for Cohort 2). The cohorts included 62 secondary schools (both middle and high schools) in 11 school districts that agreed to participate. Schools in the active treatment group were assigned to implement the Diplomas Now model, while the schools in the control group continued their existing school programs or implemented other reform strategies of their choosing [13]. The MDRC researchers conducted randomization of the schools within blocks defined by district, school type, and year of roll-out. After some schools were dropped from the study due to structural reasons, the researchers were left with 29 high schools and 29 middle schools grouped in 21 random assignment blocks. Within each block, schools were randomized to the active treatment or business-as-usual, resulting in 32 schools in the treatment group, and 30 schools in the control group.
The evaluation focused on three categories of outcomes: Attendance, Behavior, and Course performance, called the “ABC’s,” with multiple measures for each category. In addition, the evaluation measured an overall ABC composite measures of whether a student is above given thresholds on all three categories. This grouping constitutes 12 total outcomes of interest. Evaluating each of the 12 outcomes independently would not be good practice, as the chance of a spurious finding would not be well controlled. The authors of the MDRC report pre-identified three of these outcomes as primary outcomes before the start of the study in order to reduce the problem of multiple testing. We, by contrast, use this example to illustrate what could be done if there was uncertainty as to which outcomes should be primary. In particular, we illustrate how to conduct a power analysis to plan a study where one uses multiple testing adjustment, rather than predesignation, to account for the multiple outcome problem.
There are different guidelines for how to adjust for groupings of multiple outcomes in education studies. For example, [14] recommends organizing primary outcomes into domains, conducting tests on composite domain outcomes, and applying multiplicity corrections to composites across domains. The What Works Clearinghouse applies multiplicity corrections to findings within the same domain rather than across different domains. We do not provide recommendations for which guidelines to follow when investigating impacts on multiple outcomes. Rather, we address the fact that researchers across many domains are increasingly applying MTPs and therefore need to correctly estimate power, MDES and sample size requirements accounting for this choice. In our example, we elect to do a power analysis separately for each of the three outcome groups of the ABC outcomes to control family-wise error rather than overall error. This strategy means we adjust for the number of outcomes within each group independently. For illustration purposes, we focus on one outcome group, attendance, which we will assume contains five separate outcomes.
3 Overview of multiple testing
Our motivating example illustrates that researchers are often interested in testing the effectiveness of a single intervention on multiple outcomes. The resulting multiplicity of statistical hypothesis tests can lead to spurious findings of effects.222Testing the effectiveness of an intervention for multiple subgroups, at multiple points in time, or across multiple treatment groups also results in a multiplicity of statistical hypotheses and can also lead to spurious findings of effects, but this is beyond the scope of this paper. Multiple testing procedures counteract this problem by adjusting -values for effect estimates; generally, -values are adjusted upward to require a higher burden of proof. When not using a MTP, the probability of finding false positives increases, sometimes dramatically, with the number of tests. When using a MTP, this probability is reduced. Much of the proceeding explanation is borrowed from or parallels discussion found in [7].
We first remind the reader of how raw, or unadjusted, -values are calculated in this setting, before introducing adjustments due to multiple testing. Consider that a researcher is interested in testing the impact of an intervention on outcomes. In our running example of the Diplomas Now study, if we had five outcomes in the attendance group, we would have . We apply a frequentist hypothesis testing framework, and frame impacts in terms of effect sizes (). For outcome , we can test a null hypothesis of no effect, , against an alternative hypothesis for a two-sided tests or or for a one-sided test. A significance test, such as a two- or one-sided -test, would then typically be driven by a test statistic given by
(1) |
where is the standard error. A raw -value would then be computed as the probability of being at least as extreme as the one observed, given that the null hypothesis is true. The term “raw” is used to denote unadjusted -values, in contrast to values that have been adjusted using a MTP. For a two-sided test, the raw -value for test is . The PUMP package allows for either one-sided or two-sided tests, but we proceed assuming two-sided tests going forward.333For a one-sided test, depending on the direction of our alternative hypothesis, the raw -value for test is computed as or .
When testing a single hypothesis under this framework, “researchers typically specify , the maximum acceptable probability of making a Type I error. A Type I error is the probability of erroneously rejecting the null hypothesis when it is true. The quantity is also referred to as the significance level. If , then the null hypothesis is rejected if the -value is less than ” [7].
In contrast, “when one tests multiple hypotheses under this framework (such that ) and one conducts a separate test for each of the hypotheses with , there is a greater than overall chance of a false positive finding in the study. If the multiple tests are independent, the probability that at least one of the null hypothesis tests will be erroneously rejected is
Therefore, if researchers are estimating effects on three outcomes (and if these outcomes are independent) the probability of at least one false positive finding is . If the researchers were instead estimating effects on five independent outcomes, the probability of at least one false positive finding rises to . This Type I error inflation for independent outcomes demonstrates the crux of the multiple testing problem. In practice, however, the multiple outcomes are usually at least somewhat correlated, which makes the test statistics correlated and reduces the extent of Type I error inflation. Nonetheless, any error inflation can still make it problematic to draw reliable conclusions about the existence of effects above a specified size” [7].
3.1 Using MTPs to protect against spurious impact findings
As introduced above, multiple testing procedures adjust -values to counteract the multiple testing problem.444Alternatively, MTPs can decrease the critical values for rejecting hypothesis tests. For ease of presentation, this paper focuses only on the approach of adjusting -values. We next describe how using a MTP protects against false positives.
Considering multiple outcomes presents both challenges and opportunities. First, we discuss the impact of MTPs on individual power. The power of an individual hypothesis test is the probability of correctly rejecting a null hypothesis when the effect is at least a specified size. We refer to a setting in which the true impact is at least as large as the desired effect size as a “false null” hypothesis, while a “true null” is a setting in which the true impact is zero. In the proceeding explanations, when we refer to rejecting a null hypothesis or detecting an effect, we assume that we are detecting an effect of a certain pre-specified size. If -values are adjusted upward, one is less likely to reject true nulls, which reduces the probability of Type I errors, or false positive findings. At the same time, MTPs increase the probability of a Type II error, or false negative findings, when the test fails to reject a false null. Individual power is , so MTPs have the tradeoff of reducing Type I errors but also reducing individual power.
Next, we consider the impact of multiple outcomes on other definitions of power. Applying a MTP reduces power according to all definitions of power relative to the case when no MTP is applied to adjust -values. However, as discussed previously, having multiple outcomes also allows for a wider variety of definitions of success. Recall that minimal power is the probability of detecting an effect on at least one outcome. Typically, minimal power, even after applying a MTP, is higher than individual power for a hypothesis test on a single, pre-specified outcome. Depending on the study, other definitions of power, such as or -minimal power, may or may not have higher power than the power of a single hypothesis test.
The MTPs that are the focus of this paper have three key features that affect statistical power: (1) whether the MTP is a familywise procedure or a false discovery rate procedure; (2) whether the MTP is single-step or stepwise; and (3) whether the MTP takes the correlation between test statistics into account. Below we explain each of these features of MTPs and provide discussion of the new parameter specifications they require when estimating power.
3.2 Familywise error rate vs. false discovery error rate
Familywise procedures “reframe Type I error as a rate across the entire set or “family” of multiple hypothesis tests. This rate is called the familywise error rate (FWER) [15]. The FWER is typically set to the same value as the probability of a Type I error for a single test, e.g., . MTPs that control the FWER at adjust -values in a way that ensures that the probability of at least one Type I error across the entire set of hypothesis tests is no more than . The MTPs introduced by Bonferroni [16, 17], [18], and [19] all control the FWER” [7].
MTPs that control false discovery rate (FDR) take an entirely different approach to the multiple testing problem. FDR, introduced by [20], is a less stringent criteria than FWER. It is the expected proportion of all rejected hypotheses that are erroneously rejected. As laid out in [7], the two-by-two representation in Table 1 is often found in articles on multiple hypothesis testing, and helps to illustrate the difference between FWER and FDR. Let be the total number of tests. Therefore, we have unobserved truths: whether or not each null hypothesis is true or false. We also have observed decisions: whether or not the null hypotheses were rejected, because the -values were less than . In Table 1, , , , and are four possible scenarios: the numbers of true or false hypotheses not rejected or rejected. and are the unobservable numbers of true null and false null hypotheses. is the number of null hypotheses that were rejected, and is the number of null hypotheses that were not rejected.
Observed decisions | |||
Unobserved truths | Number not rejected | Number rejected | Total |
Number of true null hypotheses | |||
Number of false null hypotheses | |||
Total |
In Table 1, is the number of erroneously rejected null hypotheses, or the number of false positive findings. Therefore, the FWER is equivalent to , the probability of at least one false positive finding. As noted in [7], “recall that Type I error is inflated when testing for effects when no MTPs are applied. Consider the setting when all the outcomes are independent of each other. The Type I error is almost when testing effects on two independent outcomes and when testing effects on five independent outcomes. These Type I error rates both correspond to the FWER. The goal of MTPs that control the FWER is to bring these percentages back down to .”
Also in Table 1, the FDR is equal to but is defined to be when , or when no hypotheses are rejected. “As is frequently noted in the literature (e.g., [21, 14], the FWER and FDR have different objectives. Control of the FWER protects researchers from spurious findings and so may be preferred when even a single false positive could lead to the wrong conclusion about the effectiveness of an intervention. On the other hand, the FDR is more lenient with false positives” [7]. Researchers may be willing to accept a few false positives, , when the total number of rejected hypotheses, , is large. Note that under the complete null hypothesis that all null hypotheses are true, the FDR is equal to the FWER. Referring back to Table 1, under the complete null we have , so
However, if any effects truly exist, then FWER FDR. As a result of the difference in objective between FWER and FDR, in the case where there is at least one false null hypothesis (at least one true effect), a MTP that controls the FDR at will have a Type I error rate that is greater than .
A side remark is that MTPs may provide either “weak control” or “strong control” of the error rate they target. A MTP “provides weak control of the FWER or the FDR at level if the control can only be guaranteed when all null hypotheses are true, e.g. when the effects on all outcomes are zero. A MTP provides strong control of the FWER or FDR at level if the control is guaranteed when some null hypotheses are true and some are false, e.g. when there may be effects on at least some outcomes. Of course, strong control is preferred” [7].555The single-step and step-down Westfall Young MTPs (which we discuss below) always provide at least weak control of the FWER. In order for these procedures to provide strong control of the FWER, they require the assumption of subset pivotality [22].The distribution of the unadjusted test statistics or -values is said to have subset pivotality if for any subset of null hypotheses, the joint distribution of the test statistics or of the values for the subset is identical to the distribution under the complete null. A consequence of this assumption is that the permutation of test statistics or -values can be done under the complete null hypothesis rather than under the unknown partial hypothesis [22].
3.3 Single-step vs. stepwise procedures
An additional feature of a MTP that affects its statistical power is whether it is a “single-step” or “stepwise” procedure. “Single-step procedures adjust each -value independently of the other -values. For example, the Bonferroni MTP multiplies all raw -values by . Therefore, one -value adjustment does not depend on other -value adjustments, only on the number of tests” [7].
In contrast, “stepwise procedures first order raw -values (or test statistics), and then adjust according to the order of the tests. The adjustments depend on the null hypotheses already rejected in previous steps. For example, the Holm MTP — the stepwise counterpart to the Bonferroni MTP — orders raw -values from smallest to largest. The procedure then multiplies the smallest -value by , the second smallest -value by , and so on. The Holm MTP, like most other stepwise procedures, also enforces monotonicity: each adjusted -value is greater than or equal to the previous adjusted -value, and enforces that any -values is not greater than one. Overall, stepwise MTPs allow for less adjustment than single-step MTPs in later steps, and therefore preserve more power (for outcomes in the later steps). The Bonferroni and Westfall-Young single-step procedure are single-step; the Holm and Benjamini-Hochberg MTPs and the Westfall-Young step-down procedure are stepwise” [7]. Note that stepwise procedures may be “step-down” or “step-up,” referring to whether a procedure begins with the smallest -value, and thus the largest effect size (step-down) or the largest -value (step-up)“.
Due to the dependencies of adjustments in stepwise MTPs, a new assumption must be considered when estimating power under multiplicity: the proportion of outcomes on which there are truly impacts, or, equivalently, the number of false null hypotheses. “Researchers may be inclined to assume that there will be effects on all outcomes, as hypotheses of effects probably drive the selection of outcomes in the first place.(…) However, if the researchers are incorrect, the probability of detecting the extant effects can be diminished, sometimes substantially” [7].
As noted in [7], it is important to point out that under the assumption that some effects are truly null, we must change our notion of power for minimal powers (e.g., minimal power, minimal power, etc.) and complete power. While individual power is defined based on the probability of correctly rejecting false nulls, the definition is loosened here and includes the probability of erroneous rejections of true nulls. For example, minimal power is defined as the probability of detecting effects on at least of the total outcomes , regardless of the number of outcomes with true effects. That is, minimal power is not defined as the probability of detecting effects among the outcomes on which the effects truly exist. This reframing of power is necessary for power to be consistent. If minimal power were defined based on false nulls, then the value and interpretation would change depending on what assumption the researcher is making about the number of false nulls, which is an unknown quantity. For example, with outcomes, the probability of detecting at least one effect would be very different depending on if we assume all five outcomes are false nulls, or if we assume only two of them are false nulls. Complete power, which is the probability of detecting effects on all outcomes, has similar issues. We define complete power only in the context where all effects are assumed to be false nulls — if any outcomes are assumed to be true nulls, then complete power is undefined.
There is an additional technical note about the calculation of complete power.666Complete power has also been referred to as “conjunctive power” [23] and “all pairs power” [24]. To calculate complete power, we do not need to adjust the -values, and can instead reject each individual test based on the unadjusted -values. Complete power is the power of the omnibus test constructed by whether or not we reject all the null hypotheses. This test was originally introduced as the intersection-union test because the null hypothesis is expressed as a union and the alternative hypothesis is expressed as an intersection [25, 26]. [25] showed that if all the individual tests are level , the intersection-union test is also a level test. To provide some intuition, we do not need to adjust -values for complete power because it is a special case where we must reject all the hypothesis tests. Thus, there is no way for the omnibus test to be rejected by chance because of a favorable configuration [3]. For example, consider if we have four tests, with two false nulls and two true nulls. If we consider minimal power, we just need one of the two true nulls to be rejected by chance alone, and there are two ways for this to occur. For complete power, there is only one way for us to reject all of the nulls. The downside of an intersection-union test is that it is conservative: the FWER is generally less than . For example, if we have two independent tests with Type I error , then if both of are true nulls, the probability of a Type I error for the omnibus test (the probability of rejecting both null hypotheses) is [27].
3.4 Correlation between test statistics
The final feature of a MTP that affects its statistical power is whether or not it takes into account the correlation between test statistics. “The Bonferroni and Holm procedures strongly control the FWER in all cases, even when the test statistics are correlated, but they adjust -values more than is necessary in that case. Along with the Bonferroni and Holm MTPs, the Benjamin-Hochberg MTP also does not take correlations into account.777The Benjamini-Hochberg procedure was originally shown to control the FDR for independent test statistics. However, [28] showed that it also controls the FDR for true null hypotheses with “positive regression dependence.” This condition is satisfied for most applications in practice. In contrast, both of the Westfall-Young MTPs rely on the estimation of the joint distribution of test statistics when the complete null hypothesis (that there are not effects on any of the outcomes) is true. This joint distribution of the test statistics is estimated from the study’s data” [7]. For example, random permutations of the treatment indicator break the association between treatment status and outcome. Repeating these permutations a large number of times results in a distribution of test statistics under the complete null. “Because the actual data are used to generate this null distribution, correlations among the test statistics are captured. Then observed test statistics can be compared with the distribution of test statistics under the complete null hypothesis” [7].888Instead of using test statistics, the Westfall-Young MTPs can alternatively compare raw -values with the estimated joint null distribution of -values.
The correlation between test statistics is a parameter a researcher must specify in order to estimate power, MDES or sample size requirements when using a MTP. When fitting a separate regression model for the impact on each outcome, the correlations between test statistics are equal to the “pairwise correlations between the residuals in the impact models” [7]. Then, “if there are no covariates in the impact models or if the ’s of the covariates are equivalent in all impact models, then the correlations between the test statistics are equal to the correlations between the outcomes. However, having different ’s across the impact models reduces the correlations between the residuals and therefore between test statistics.999For example, one of the multiple outcomes may have a baseline covariate with a high while another may have a baseline covariate with a smaller . Also, block dummies may explain more variation in some outcomes than in others. Models of outcomes that are highly correlated are more likely to have residuals that are highly correlated because baseline covariates will tend to have similar ’s. The gaps between the correlations between outcomes and the correlations between residuals — and therefore the test statistics — may be wider for moderately or weakly correlated outcomes. In any case, the upper bounds of correlations between the test statistics are the correlations between the outcomes” [7].
4 Estimating power, MDES and sample size in studies with multiple outcomes
4.1 Power estimation approach
We take an innovative simulation-based approach to estimating power, as introduced in [7]. This approach is then also applied to estimate MDES and sample size. In order to estimate power for a single outcome, we can often use closed-form algebraic expressions, which are derived from the assumed model. However, with multiple outcomes, finding such expressions can be quite difficult, or even impossible. In cases where it is possible to find a closed-form expression, we would need to find expressions for every design and model, MTP, and definition of power. Importantly, we would also need to find new expressions for any possible number of outcomes, which quickly becomes an intractable problem. Furthermore, in some cases, such as permutation-based procedures like Westfall-Young approaches, a closed-form solution does not exist. To avoid these complexities, we rely on simulation to calculate estimated power. The approach outlined below can estimate power for any scenario.
If we were to rely on a full simulation approach, we could use the following method to estimate power. We introduce this full simulation approach to provide intuition, but use a simplified and far less computationally intensive approach in the package, as discussed below.
-
1.
Simulate a data sample according to the joint alternative hypothesis. First, we formulate what we will refer to as the joint alternative hypothesis, which is the set of outcomes we assume to have treatment effects above the desired size. We define to be the treatment effect for outcome , with total outcomes. If we have outcomes, as in the Diplomas Now study, one possible joint alternative hypothesis is that all outcomes have effects above specified sizes: . Another possible joint alternative hypothesis is one where only the first two outcomes have effects above the desired sizes: . Once our joint alternative hypothesis is specified, we would generate simulated data under this hypothesis. To simulate data, we need to specify the full set of parameters as mentioned in Section 5.3 that allow for data generation. The Technical Appendix contains more details about the assumed data-generating process. For example, for the Diplomas Now experiment, we would assume a specific data generating process to allow us to simulate synthetic students, schools, and districts, including covariates, outcomes, and treatment assignment. This process would involve specifying parameter values such as , the amount of outcome variation explained by covariates at a particular level, and translating these parameter choices into data-generating parameters, such as the coefficient values for covariates in a linear model.
-
2.
Estimate impacts on the simulated data. Given simulated data, we could fit regression models (specified to match the experimental design and model assumptions). For the models supported by PUMP, the relevant functions would be lm(), lmer() from the lme4 library [29], and interacted_linear_estimators() from the blkvar library.101010This package is currently under development on GitHub; see https://github.com/lmiratrix/blkvar From the model output we extract the test statistics for the estimated impacts, one statistic for each outcome, along with estimated standard errors.
-
3.
Calculate unadjusted -values. The test statistics and standard errors would in turn give raw (unadjusted) -values. We can either calculate these by hand, or use the -values routinely returned by regression functions. For Diplomas Now we could run a regression model of each attendance measure on treatment status and student and school covariates, and extract -values from the regression outputs.
-
4.
Repeat above steps (1 through 3) for a large number of iterations. Denote the number of iterations tnum. Repeating steps - tnum times results in a matrix of unadjusted -values which we call , and is of dimension . One row corresponds to one set of simulated raw -values from regressions for the attendance outcomes of interest for Diplomas Now.
-
5.
Adjust -values. For each row, corresponding to one simulated dataset, the raw -values corresponding to the hypothesis tests can be adjusted according to the desired multiple testing procedure. This process generates a new matrix of adjusted -values. For Bonferroni, Holm, and Benjamini-Hochberg adjustments, we use the function p.adjust in R (found in the stats package). We developed our own functions for implementing adjustment using the Westfall-Young procedures. One row corresponds to one set of simulated adjusted -values for the attendance outcomes of interest for Diplomas Now.
-
6.
Calculate hypothesis rejection indicators. For any MTP, the matrix of adjusted -values can then be compared with a specified value of (the default is , but the value can be changed by the user). For each row, corresponding to one iteration of simulated data, we record whether or not the null hypothesis was rejected for each outcome. This process results in a new matrix , which contains hypothesis rejection indicators (still of dimension ). Using , we can compute all definitions of power.
-
7.
Calculate power. To compute the different definitions of power:
-
•
Individual power for outcome is the proportion of the tnum rows in which the null hypothesis was rejected (the mean of column of ). We would have different individual power values for Diplomas Now, corresponding to each outcome of interest.
-
•
-minimal power is the proportion of the rows in which at least of the hypotheses were rejected.111111Note that others refer to minimal power simply as “minimal power” [30, 3, 6], “disjunctive power” [23], or “any pair” power [24]. [3] use the terminology of “r-power” for what is referred to here as minimal power for . For Diplomas Now, we could consider -minimal power through -minimal power.
-
•
Complete power is the proportion of the rows in which all of the null hypotheses were rejected based on the raw -values rather than adjusted -values (based on the matrix rather than .) We would be interested in complete power if we want to evaluate whether Diplomas Now resulted in improvement for every single attendance outcome of interest. With outcomes, this criteria is a relatively strict indicator of success.
Above, we outlined a full simulation-based approach for calculating power. This approach would be computationally intensive because of the need to generate and analyze a full simulated dataset at each iteration. We can simplify this process by skipping the simulation of data and modeling steps. Given an assumed model and correlation structure for the test statistics, we can directly sample from , the joint alternative distribution of the test statistics. This shortcut vastly improves both the simplicity and the speed of computation. In summary, our approach is:
-
1.
Generate draws of test statistics under the joint alternative hypothesis. This step produces a matrix .
-
2.
Calculate unadjusted -values. This produces the matrix , as in the procedure above.
-
3.
Adjust -values. This produces the matrix , as in the procedure above.
-
4.
Calculate hypothesis rejection indicators. This produces the matrix , as in the procedure above.
-
5.
Calculate power.
We now describe how to sample from directly. First, we assume a particular research design and model. In our example based on the Diplomas Now study, the research design is a -level experiment, with randomization at level . We plan for analyzing our data with a linear regression model with fixed intercepts at the district level, random intercepts at the school level, and a constant treatment effect across schools and districts. As previously, denote as the treatment effect for outcome . We express treatment effects in terms of effect sizes:
where is the standard deviation of outcome in the control group. In order to calculate power, we also need the standard error of the impact in effect size units, which we denote as
The quantity is a consequence of the assumed model, the number of units at different levels, the percent of units treated, the assumed , and other parameters; our technical appendix shows formulae for for all the designs and models our package supports. In our Diplomas Now example, will be a function of the number of students, schools, and districts; the proportion of treated units; the number of student and school covariates; the explanatory power of the student and school covariates; the proportion of variation in the outcome explained by schools and districts; and the amount of impact variation relative to the amount of mean variation. Some parameters, such as the percent of units treated, will generally be known, while others, such as the at different levels, would need to be supplied by the user through either estimation on pilot data or assumptions based on prior knowledge.
Given the effect sizes and the standard errors , we can determine the distribution of the vector of test statistics. When testing the hypothesis for outcome , the test statistic for a -test is:
with degrees of freedom , also defined by the assumed model. Under the alternative hypothesis for outcome , has a distribution with degrees of freedom and mean . Finally, in addition to the parameters above, we also need to choose the correlation matrix between test statistics to sample from the joint distribution of . With these distributions specified, we can calculate -values.
Note that this approach of simulating test statistics builds on work by [31], who use simulated test statistics to identify critical values based on the distribution of the maximum test statistics. Their approach produces the same estimates as the approach described here for the single-step Westfall-Young MTP. As an alternative to a simulation-based approach, [3] derived explicit formulae for minimal powers of stepwise procedures and for complete power of single-step procedures, but only for , , or tests. The approach presented here is more generally applicable, as it can be used for all MTPs, for any number of tests, and for all definitions of power discussed in the present paper.
Remark. The -value adjustment using Westfall-Young procedures is the most complex correction procedure, so we briefly outline it here. Similar to above, we first explain a full simulation approach, and then discuss our simplification. Under a full simulation approach, we would first generate a single dataset under the joint alternative hypothesis and calculate a set of observed test statistics. Then, we would permute the single simulated dataset, say times, assuming the joint null hypothesis, and calculate test statistics on each of these permuted datasets. This process generates an empirical distribution of test statistics under the joint null distribution. Next, we compare the distribution of observed test statistics to the generated distribution of test statistics under the joint null distribution to calculate -values. We would then re-generate a new simulated dataset, and repeat the process. If we were to generate datasets under the joint alternative hypothesis, for each of these datasets we generate permuted datasets under the joint null, so we would have to generate datasets!
When we skip the step of simulating data, then for each iteration in we first generate a set of observed test statistics from the joint alternative distribution. Then, we draw samples of test statistics under the joint null rather than permuting the data times. Under the null hypothesis, has a distribution with degrees of freedom and mean . As before, we then compare the distribution of observed test statistics to the distribution of test statistics under the joint null distribution to calculate -values. Westfall-Young procedures are computationally intensive, so the approach of skipping the simulated data step is particularly helpful here. This approach substantially reduces computational time by drawing test statistics directly rather than permutating the data.
4.2 Determining MDES and sample size
Frequently, a researcher’s main concern with power is calculating either the MDES for each outcome in a given study, or determining the necessary sample size to achieve a target power given a specified set of MDES values. In Diplomas Now, we might want to know what sample sizes we would need to detect at least one significant effect across our outcomes if all the outcomes had a specified effect size (corresponding to minimal power) and we were planning on using the Holm procedure.
For pump_mdes() and pump_sample(), the user provides a particular target power, say . The method then conducts a stochastic optimization problem to determine a value (of sample size or MDES) that is within a specified tolerance of the target power with high probability. We discuss the algorithm for MDES, although the approach for determining sample size is the same.
The algorithm first determines an initial range of MDES values that likely contain the target MDES. This initial range is calculated using formulae for unadjusted power based on the standard errors and degrees of freedom. In particular, from [8], in general the MDES for a single outcome can be estimated as
where , the “multiplier,” is the sum of two statistics with degrees of freedom . For one-tailed tests, where is the Type I error rate and is the desired power. For two-tailed tests, . We do not explain the details of the derivations of the multiplier here; for more details and understanding, see [8] or [12]. These expressions can be further manipulated to obtain sample size formulae; see our technical appendix for all formulae used in the package.
We can calculate our initial bounds by manipulating the and values in the above. First, to calculate the preliminary lower bound, we apply the formula above as given, assuming individual unadjusted power will give the smallest MDES; to calculate the preliminary upper bound, we apply the formula using to correspond to a Bonferroni correction. We also adjust to account for different power types. For example, if we are interested in complete power, we need a larger upper bound than for individual power; in order to have a complete power of , we would need each outcome to have an individual power of , assuming independence. If we are interested in minimal power, we must have a smaller lower bound; in order to have minimal power of , each outcome would only need to have individual power of . We ignore correlation in the setting of the initial bounds; the bounds do not need to be strict, given the adaptive nature of the subsequent search.
Once the initial range is established, we use pump_power() with the complete array of design parameters, including the correlation between test statistics, to obtain rough (using a small tnum, or number of simulation trials) estimates of power for five initial values across this range. We then fit a scaled logistic curve to these five points, and identify where the curve crosses the desired power level. After fitting an initial curve, we iterate, repeatedly calculating power for the targeted point and using the result to update the logistic curve model. At any point, if the current fitted curve’s range does not contain the target power, the algorithm extrapolates beyond the initial bounds for the next step. With each iteration we increase tnum to increase precision as we narrow in on the final answer; with each update to our estimated power curve, we weigh the collection of observations by their precisions (determined by corresponding tnum value). Once a test point achieves the target power to within tolerance, we conduct an additional simulation check using a high number of replicates to verify the proposed answer is within a specified tolerance of the target power; if it is not, we continue the iterative search. The default tolerance is , so given a target power of , we stop when we find a MDES that gives an estimated power between and .
In practice, due to the monotonic nature of the logistic functional form, our algorithm generally converges fairly rapidly. However, in certain corner cases the algorithm may fail to converge on a value within tolerance. For more information on applying the search algorithm, see the sample size vignette.
4.3 Package Validation
We completed extensive validation checks to ensure our power calculation procedures are correct. First, we compared our power estimates in scenarios with only one outcome, , to those from the PowerUpR package. Without a multiple testing procedure adjustment, our estimates match. Second, in order to validate our estimates under multiplicity, we followed the full simulation approach outlined above, in Section 4.1. The simulation approach involves generating many iterations of full datasets according to the assumed design and model, calculating -values, and calculating an empirical estimate of power. Using a binomial distribution we constructed Monte Carlo confidence intervals for the power estimates from the full simulation approach. Then, we validated that the PUMP estimates fall within these confidence intervals.
A more detailed explanation of the validation procedure can be found in the Appendix, and full validation code and results are in a supplementary GitHub repository pump_validate. For some scenarios, we have some apparent discrepancies from PowerUp, but these result from different modeling choices. For example, for certain models PowerUp assumes the intraclass correlation is zero, while we allow for nonzero values. These discrepancies are noted in the appendix.
5 User choices
5.1 Designs and models
When planning a study, the researcher first has to identify the design of the experiment, including the number of levels, and the level at which randomization occurs. These decisions can be a mix of the realities of the context (e.g., the treatment must be applied at the school level, and students are naturally nested in schools, making for a cluster randomization), or deliberate (e.g., the researcher groups similar schools to block their experiment in an attempt to improve power). Second, based on the design and the inferential goals of the study, the researchers chooses an assumed model, including whether intercepts and treatment effects should be treated as constant, fixed, or random. For the same experimental design, the analyst can sometimes choose from a variety of possible models, and these two decisions should be kept conceptually separated from each other.
The design. The PUMP package supports designs with one, two, or three levels, with randomization occurring at any level. For example, a design with two levels and randomization at level one is a blocked design (or equivalently a multisite experiment), where level two forms the blocks (blocks being groups of units, some of which are treated and some not). Ideally, the blocks in a trial will be groups of relatively homogenous units, but frequently they are a consequence of the units being studied (e.g., evaluations of college supports, with students, the units, nested in colleges, the blocks). A design with two levels and randomization at level two is commonly called a cluster design (e.g., a collection of schools, with treatment applied to a subset of the schools, with outcomes at the student level); here the schools are the clusters, with a cluster being a collection of units which is entirely treated or entirely not. We can also have both blocking and clustering: randomizing schools within districts, creating a series of cluster-randomized experiments, would be a blocked (by district), cluster-randomized experiment, with randomization at level two.
The model. Given a design, the researcher can select a model via a few modeling choices. In particular the researcher has to decide, for each level beyond the first, about the intercepts and the treatment impacts:
-
•
Whether level two and level three intercepts are:
-
–
fixed: we have a separate intercept for each unit.
-
–
random: we have a separate intercept for each unit as above, but model the collection of intercepts as Normally distributed, allowing for partial pooling.
-
–
-
•
Whether level two and level three treatment effects are:
-
–
constant: we model all units within a group as having the same single average impact.
-
–
fixed: we allow each block or cluster within a level to have its own individual estimated impact (we can only do this if we have treated and control units within said block or cluster).
-
–
random: we allow variation as with fixed, but model the collection of treatment impacts as Normally distributed around a grand mean mean impact. This is implicitly allowing for the sample as being representative of a larger super-population, in terms of treatment impact estimation.
-
–
We denote the research design by , followed by the number of levels and randomization level, so d3.1 is a three level design with randomization at level one. The model is denoted by , followed by the level and the assumption for the intercepts, either or and then the assumption for the treatment impacts, , , or . For example, m3ff2rc means at level , we assume fixed intercepts and fixed treatment impacts, and at level two we assume random intercepts and constant treatment impacts. The full design and model are specified by concatenating these together, e.g. d2.1_m3fc. The Diplomas Now model, for example, is d3.2_m3fc2rc.
The full list of supported design and model combinations is below. The user can see the list by calling pump_info(), which provides the designs and models, MTPs, power definitions, and model parameters. We also include the corresponding names from the PowerUP! package where appropriate. For more details about each combination of design and model, see the Technical Appendix.
d_m | Design | Model | PowerUp | Params |
d1.1_m1c | d1.1 | m1c | n/a | R2.1 |
d2.1_m2fc | d2.1 | m2fc | bira2_1c | R2.1, ICC.2 |
d2.1_m2ff | d2.1 | m2ff | bira2_1f | R2.1, ICC.2 |
d2.1_m2fr | d2.1 | m2fr | bira2_1r | R2.1, ICC.2, omega.2 |
d2.1_m2rr | d2.1 | m2rr | n/a | R2.1, ICC.2, omega.2 |
d2.2_m2rc | d2.2 | m2rc | cra2_2r | R2.1, R2.2, ICC.2 |
d3.1_m3rr2rr | d3.1 | m3rr2rr | bira3_1r | R2.1, ICC.2, omega.2, ICC.3, omega.3 |
d3.2_m3ff2rc | d3.2 | m3ff2rc | bcra3_2f | R2.1, R2.2, ICC.2, ICC.3 |
d3.2_m3fc2rc | d3.2 | m3fc2rc | n/a | R2.1, R2.2, ICC.2, ICC.3 |
d3.2_m3rr2rc | d3.2 | m3rr2rc | bcra3_2r | R2.1, R2.2, ICC.2, ICC.3, omega.3 |
d3.3_m3rc2rc | d3.3 | m3rc2rc | cra3_3r | R2.1, R2.2, ICC.2, R2.3, ICC.3 |
5.2 Multiple testing procedures
The supported multiple testing procedures were covered in more detail in Section 3. Here we provide a review of the multiple testing procedures supported by the PUMP package:
-
•
Bonferroni: adjusts -values by multiplying them by to ensure strong control of the FWER. Bonferroni is a simple procedure, but the most conservative.
-
•
Holm: a step-down version of Bonferroni. Starting from smallest to largest, -values are sequentially adjusted by different multipliers. Holm is less conservative than Bonferroni for larger -values.
-
•
Benjamini-Hochberg: A sequential, step-up procedure that controls the FDR. Using the BH method, only null hypotheses with -values below a certain threshold are rejected, where the threshold is determined by the number of tests and the level .
-
•
Single-step Westfall-Young: A permutation-based procedure for controlling the FWER, which directly takes into account the joint correlation structure of the outcomes. In the single-step approach, all outcomes are adjusted by using the permuted distribution of the minimum -value. Although Westfall-Young procedures are less conservative while still protecting against false discoveries, they are computationally very intensive.
-
•
Step-down Westfall-Young: A similar approach to the single-step procedure, except that outcomes are adjusted sequentially from smallest to largest according to the permuted distributions of the corresponding sequential -values.
For a more detailed explanation of each MTP, see Appendix A of [7].
The following table from [7] summarizes the important features for each of the MTPs supported by PUMP.
Procedure | Control | Single-step or stepwise | Accounts for correlation |
Bonferroni (BF) | FWER | single-step | No |
Holm (HO) | FWER | stepwise | No |
Westfall-Young Single-step (WY-SS) | FWER | single-step | Yes |
Westfall-Young Step-down (WY-SD) | FWER | stepwise | Yes |
Benjamini-Hochberg (BH) | FDR | stepwise | No |
5.3 Model parameters
The table below shows the parameters that influence , the standard error, for different designs and models.
Parameter | Description |
nbar | harmonic mean of level 1 units per level 2 unit (students per school) |
J | harmonic mean of number of level 2 units per level 3 unit (schools per district) |
K | number of level 3 units (districts) |
Tbar | proportion of units assigned to treatment |
numCovar.1 | number of level 1 (individual) covariates |
numCovar.2 | number of level 2 (school) covariates |
numCovar.3 | number of level 3 (district) covariates |
R2.1 | percent of variation explained by level 1 covariates |
R2.2 | percent of variation explained by level 2 covariates |
R2.3 | percent of variation explained by level 3 covariates |
ICC.2 | level 2 intraclass correlation |
ICC.3 | level 3 intraclass correlation |
omega.2 | ratio of variance of level 2 average impacts to level 2 random intercepts |
omega.3 | ratio of variance of level 3 average impacts to level 3 random intercepts |
A few parameters warrant more explanation.
-
•
The quantity ICC is the unconditional Intraclass Correlation, and gives a measure of variation at different levels of the model. For each outcome, the ICC for each level is defined as the ratio of the variance at that level divided by the overall variance of the individual outcomes. The ICC includes the variation due to covariates.
-
•
For each outcome, the quantity omega () for each level is the ratio between impact variation at that level and variation in intercepts (including covariates) at that level. It is a measure of treatment impact heterogeneity.
-
•
The expressions are the percent of variation at a particular level predicted by covariates specific to that level. For simplicity we assume covariates at a level are group mean centered, so only covariates at a particular level explain variance at that level.
For precise formulae of these expressions, see the Technical Appendix, which outlines the assumed data-generating process, and the resulting expressions for ICC, , and .
In addition to design parameters, there are additional parameters that control the precision of the power estimates themselves:
-
•
tnum is the number of test statistics generated in order to estimate power. A larger number of test statistics results in greater computation time, but also a more precise estimate of power. Note that the pump_mdes() and pump_sample() have multiple tnum parameters controlling the precision of the search.
-
•
B is the number of Westfall-Young permutations. Again, there is a tradeoff between precision and computation time.
-
•
parallel.WY.cores specifies the number of cores to use for parallel computation of the Westfall-Young Step-Down procedure, which is the most computationally intensive. The default of 1 does not result in parallel computation. Parallelization is done using parApply from the parallel package.
6 Using the PUMP package
In this section, we illustrate how to use the PUMP package, using our example motivated by the Diplomas Now study. Given the study’s design, we ask a natural initial question: What size of impact could we reasonably detect after using a MTP to adjust -values to account for our multiple outcomes?
We mimic the planning process one might use for planning a study similar to Diplomas Now (e.g., if we were planning a replication trial in a slightly different context). To answer this question we therefore first have to decide on our experimental design and modeling approach. We also have to determine values for the associated design parameters that accompany these choices. In the following sections we walk through selecting these parameters (sample size, control variables, intraclass correlation coefficients, impact variation, and correlation of outcomes). We calculate MDES for the resulting context and determine how necessary sample sizes change depending on what kind of power we desire. We finally illustrate some sensitivity checks, looking at how MDES changes as a function of rho, the correlation of the test statistics.
6.1 Establishing needed design parameters
To conduct power, MDES, and sample size calculations, we first specify the design, sample sizes, analytic model, and level of statistical significance. We also must specify parameters of the data generating distribution that match the selected design and model. All of these numbers have to be determined given resource limitations, or estimated using prior knowledge, pilot studies, or other sources of information.
We next discuss selection of all needed design parameters and modeling choices. For further discussion of selecting these parameters see, for example [12] and [8]. For discussion in the multiple testing context, especially with regards to the overall power measures such as minimal or complete power, see [7]; the findings there are general, as they are a function of the final distribution of test statistics. The key insight is that power is a function of only a few summarizing elements: the individual-level standard errors, the degrees of freedom, and the correlation structure of the test statistics. Once we have these elements, regardless of the design, we can proceed.
Analytic model. We first need to specify how we will analyze our data; this choice also determines which design parameters we will need to specify. Following the original Diplomas Now report, we plan on using a multi-level model with fixed effects at level three, a random intercept at level two, and a single treatment coefficient. We represent this model as “m3fc2rc.” The “3fc” means we are including block fixed effects, and not modeling any treatment impact variation at level three. The “2rc” means random intercept and no modeled variation of treatment within each block (the “c” is for “constant”). We note that the Diplomas Now report authors call their model a “two-level” model, but this is not quite aligned with the language of this package. In particular, fixed effects included at level two are actually accounting for variation at level three; we therefore identify their model as a three level model with fixed effects at level three.
Sample sizes. We assume equal size randomization blocks and schools, as is typical of most power analysis packages. For our context, this gives about three schools per randomization block; we can later do a sensitivity check where we increase and decrease this to see how power changes. The Diplomas Now report states there were 14,950 students, yielding around 258 students per school. Normally we would use the geometric means of schools per randomization block and students per school as our design parameters, but that information is not available in the report. We assume 50% of the schools are treated; our calculations will be approximate here in that we could not actually treat exactly 50% in small and odd-sized blocks.
Control variables. We next need values for the of the possible covariates. The report does not provide these quantities, but it does mention covariate adjustment in the presentation of the model. Given the types of outcomes we are working with, it is unlikely that there are highly predictive individual-level covariates, but our prior year school-average attendance measures are likely to be highly predictive of corresponding school-average outcomes. We thus set and . We assume five covariates at level one and three at level two; this decision, especially for level one, usually does not matter much in practice, unless sample sizes are very small (the number of covariates along with sample size determine the degrees of freedom for our planned tests).
ICCs. We also need a measure of where variation occurs: the individual, the school, or the randomization block level. We capture this with Intraclass Correlation Coefficients (ICCs), one for level two and one for level three. ICC measures specify overall variation in outcome across levels: e.g., do we see relatively homogeneous students within schools that are quite different, or are the schools generally the same with substantial variation within them? We typically would obtain ICCs from pilot data or external reports on similar data. We here specify a level-two ICC of 0.05, and a level-three ICC of 0.40. We set a relatively high level three ICC as we expect our school type by district blocks to isolate variation; in particular we might believe middle and high school attendance rates would be markedly different.
Correlation of outcomes. We finally need to specify the number and relationship among our outcomes and associated test-statistics. For illustration, we select attendance as our outcome group. We assume we have five different attendance measures. The main decision regarding outcomes is the correlation of our test statistics. As a rough proxy, we use the correlation of the outcomes at the level of randomization; in our case this would be the correlation of school-average attendance within block. We believe the attendance measures would be fairly related, so we select rho = 0.40 for all pairs of outcomes. This value is an estimate, and we strongly encourage exploration of different values of this correlation choice as a sensitivity check for any conducted analysis. Selecting a candidate rho is difficult, and will be new for those only familiar with power analyses of single outcomes; we need to more research in the field, both empirical and theoretical, to further guide this choice.
If the information were available, we could specify different values for the design parameters such as the s and s for each outcome, if we thought they had different characteristics; for simplicity we do not do this here. The PUMP package also allows specifying different pairwise correlations between the test statistics of the different outcomes via a matrix of s rather than a single ; also for simplicity, we do not do that here.
Once we have established initial values for all needed parameters, we first conduct a baseline calculation, and then explore how MDES or other quantities change as these parameters change.
6.2 Calculating MDES
We now have an initial planned design, with a set number of schools and students. But is this a large enough experiment to reliably detect reasonably sized effects? To answer this question we calculate the minimal detectable effect size (MDES), given our planned analytic strategy, for our outcomes.
To identify the MDES of a given setting we use the pump_mdes method, which conducts a search for a MDES that achieves a target level of power. The MDES depends on all the design and model parameters discussed above, but also depends on the type of power and target level of power we are interested in. For example, we could determine what size effect we can reliably detect on our first outcome, after multiplicity adjustment. Or, we could determine what size effects we would need across our five outcomes to reliably detect an impact on at least one of them. We set our goal by specifying the type (power.definition) and desired power (target.power).
Here, for example, we find the MDES if we want an 80% chance of detecting an impact on our first outcome when using the Holm procedure:
m <- pump_mdes(
d_m = "d3.2_m3fc2rc", # choice of design and analysis strategy
MTP = "HO", # multiple testing procedure
target.power = 0.80, # desired power
power.definition = "D1indiv", # power type
M = 5, # number of outcomes
J = 3, # number of schools per block
K = 21, # number districts
nbar = 258, # average number of students per school
Tbar = 0.50, # prop treated
alpha = 0.05, # significance level
numCovar.1 = 5, # number of covariates at level 1
numCovar.2 = 3, # number of covariates at level 2
R2.1 = 0.1, R2.2 = 0.7, # explanatory power of covariates for each level
ICC.2 = 0.05, ICC.3 = 0.4, # intraclass correlation coefficients
rho = 0.4 ) # how correlated outcomes are
The results are easily made into a nice table via the knitr kable() command:
knitr::kable( m, digits = 3, booktabs = TRUE,
position = "h!", caption = "MDES Estimate" ) %>%
kableExtra::kable_styling( position = "center" )
MTP | Adjusted.MDES | D1indiv.power |
HO | 0.106 | 0.797 |
The answers pump_mdes() gives are approximate as we are calculating them via monte carlo simulation. To control accuracy, we can specify a tolerance (tol) of how close the estimated power needs to be to the desired target along with the number of iterations in the search sequence (via start.tnum, tnum, and final.tnum). The search will stop when the estimated power is within tol of target.power, as estimated by final.tnum iterations. Lower tolerance and higher tnum values will give more exact results (and take more computational time).
Changing the type of power is straightforward: for example, to identify the MDES for minimal power (i.e., what effect do we have to assume across all observations such that we will find at least one significant result with 80% power?), we simply update our result with our new power definition:
m2 <- update( m, power.definition = "min1" )
#> mdes result: d3.2_m3fc2rc d_m with 5 outcomes #> target min1 power: 0.80 #> MTP Adjusted.MDES min1.power SE #> HO 0.08144353 0.79075 0.01 #> (10 steps in search)
The update() method can replace any number of arguments of the prior call with new ones, making exploration of different scenarios very straightforward.121212The update() method re-runs the underlying call of pump_mdes(), pump_sample(), or pump_power() with the revised set of design parameters. You can even change which call to use via the type parameter. Our results show that if we just want to detect at least one outcome with 80% power, we can reliably detect an effect of size (assuming all five outcomes have effects of at least that size).
When estimating power for multiple outcomes, it is important to consider cases where some of the outcomes in fact have null, or very small, effects, to hedge against circumstances such as one of the outcomes not being well measured. One way to do this is to set two of our outcomes to no effect with the numZero parameter:
m3 <- update( m2, numZero = 2 )
#> mdes result: d3.2_m3fc2rc d_m with 5 outcomes #> target min1 power: 0.80 #> MTP Adjusted.MDES min1.power SE #> HO 0.09050718 0.8065 0 #> (7 steps in search)
The MDES goes up, as expected: when there are not effects on some outcomes, there are fewer good chances for detecting an effect. Therefore, an increased MDES (for the nonzero outcomes) is required to achieve the same level of desired power (80%). Below we provide a deeper dive into the extent to which numZero can effect power estimates.
6.3 Determining necessary sample size
The MDES calculator tells us what we can detect given a specific design. We might instead want to ask how much larger our design would need to be in order to achieve a desired MDES. In particular, we might want to determine the needed number of students per school, the number of schools, or the number of blocks to detect an effect of a given size. The pump_sample method will search over any one of these.
Assuming we have three schools per block, we first calculate how many blocks we would need to achieve a MDES of 0.10 for minimal power (this answers the question of how big of an experiment do we need in order to have an 80% chance of finding at least one outcome significant, if all outcomes had a true effect size of 0.10):
smp <- pump_sample(
d_m = "d3.2_m3fc2rc",
MTP = "HO",
typesample = "K",
target.power = 0.80, power.definition = "min1", tol = 0.01,
MDES = 0.10, M = 5, nbar = 258, J = 3,
Tbar = 0.50, alpha = 0.05, numCovar.1 = 5, numCovar.2 = 3,
R2.1 = 0.1, R2.2 = 0.7, ICC.2 = 0.05, ICC.3 = 0.40, rho = 0.4 )
#> sample result: d3.2_m3fc2rc d_m with 5 outcomes #> target min1 power: 0.80 #> MTP Sample.type Sample.size min1.power SE #> HO K 15 0.798 0.01 #> (4 steps in search)
We would need 15 blocks, rather than the originally specified 21, giving 45 total schools in our study, to achieve 80% minimal power.
We recommend checking MDES and sample-size calculators, as the estimation error combined with the stochastic search can give results a bit off the target in some cases. A check is easy to do; simply run the found design through pump_power(), which directly calculates power for a given scenario, to see if we recover our originally targeted power (we can use update() and set the type to power to pass all the design parameters automatically). When we do this, we can also increase the number of iterations to get more precise estimates of power, as well:
p_check <- update( smp, type = "power", tnum = 50000,
long.table = TRUE )
power | None | HO |
individual outcome 1 | 0.7 | 0.53 |
individual outcome 2 | 0.7 | 0.52 |
individual outcome 3 | 0.7 | 0.53 |
individual outcome 4 | 0.7 | 0.53 |
individual outcome 5 | 0.7 | 0.53 |
mean individual | 0.7 | 0.53 |
1-minimum | 0.81 | |
2-minimum | 0.64 | |
3-minimum | 0.51 | |
4-minimum | 0.39 | |
complete | 0.33 |
When calculating power directly, we get power for all the implemented definitions of power applicable to the design.
In the above, the first five rows are the powers for rejecting each of the five outcomes—they are (up to simulation error) the same since we are assuming the same MDES and other design parameters for each. The “mean individual” is the mean individual power across all outcomes. The first column is power without adjustment, and the second has our power with the listed -value adjustment.
The next rows show different multi-outcome definitions of power. In particular, 1-minimum shows the chance of rejecting at least one hypotheses. The complete row shows the power to reject all hypotheses; it is only defined if all outcomes are specified to have a non-zero effect.131313The package does not show power for these without adjustment for multiple testing, as that power would be grossly inflated and meaningless.
We can look at a power curve of our pump_sample() call to assess how sensitive power is to our level two sample size:141414The points on the plots show the evaluated simulation trials, with larger points corresponding to more iterations and greater precision.
plot( smp )
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/b7faa82b-88c0-4e7c-b744-560e6b472359/x1.png)
Though increasing tnum is useful for checking the power calculation, it also increases computation time. Thus, for future calculations we save a call with the default tnum to reduce computation time.
pow <- update( p_check, tnum = 10000 )
Remark. In certain settings, a wide range of sample sizes may result in very similar levels of power. In this case, the algorithm may return a sample size that is larger than necessary. This pattern does not occur for the sample size at the highest level of the hierarchy, and only for occurs for sample sizes at lower levels of the hierarchy; e.g. for nbar for all models, and for nbar and J for three level models. In addition, due to the nature of the search algorithm, occasionally the algorithm may not converge. For a more detailed discussion of these challenges, see the package sample size vignette.
6.4 Comparing adjustment procedures
It is easy to rerun the above using the Westfall-Young Stepdown procedure (this procedure is much more computationally intensive to run), or other procedures of interest. Alternatively, simply provide a list of procedures you wish to compare. If you provide a list, the package will re-run the power calculator for each item on the list; this can make the overall call computationally intensive. Here we obtain power for our scenario using Bonferroni, Holm and Westfall-Young adjustments, and plot the results using the default plot() method:
p2 <- update( pow,
MTP = c( "BF", "HO", "WY-SD" ),
tnum = 5000,
parallel.WY.cores = 2 )
plot( p2 )
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/b7faa82b-88c0-4e7c-b744-560e6b472359/x2.png)
To speed up computation, we set an additional option, parallel.WY.cores, to the number of desired cores to parallelize the computation. We also reduce tnum to decrease computation time.
The more sophisticated (and less conservative) adjustment exploits the correlation in our outcomes (rho = 0.4) to provide higher individual power. Note, however, that we do not see elevated rates for minimal power. Accounting for the correlation of the test statistics when adjusting -values can drive some power (individual power) up, but on the flip side power can be driven down as the lack of independence between tests gives fewer chances for a significant result. See [7] for further discussion; while the paper focuses on the multisite randomized trial context, the lessons learned there apply to all designs as the only substantive differences between different design and modeling choices is in how we calculate the unadjusted distribution of their test statistics.
6.5 Exploring sensitivity to design parameters
Within the pump package we have two general ways of exploring design sensitivity. The first is with update(), which allows for quickly generating a single alternate scenario. To explore sensitivity to different design parameters more systematically, use the grid() functions, which calculate power, mdes, and sample size for all combinations of a set of passed parameter values. There are two main differences between the two approaches. First, update() allows for different values of a parameter for the different outcomes; the grid approach, by contrast, is more limited in this regard, and assumes the same parameter value across different outcomes. Second, the grid functions are a powerful tool for systematically exploring many possible combinations, while update() only allows the user to explore one value at a time.
We first illustrate the update() approach, and then turn to illustrating grid() across three common areas of exploration: Intraclass Correlation Coefficients (ICCs), the correlation of test statistics, and the assumed number of non-zero effects. The last two are particularly important for multiple outcome contexts.
6.5.1 Exploring power with update()
Update allows for a quick change of some of the set of parameters used in a prior call; we saw update() used several times above. As a further example, here we examine what happens if the ICCs are more equally split across levels two and three:
p_b <- update( pow, ICC.2 = 0.20, ICC.3 = 0.25 )
print( p_b )
#> power result: d3.2_m3fc2rc d_m with 5 outcomes
#> MTP D1indiv D2indiv D3indiv D4indiv D5indiv indiv.mean min1 min2 min3
#> None 0.2474 0.2421 0.2353 0.2395 0.2431 0.24148 NA NA NA
#> HO 0.0991 0.0960 0.0938 0.0967 0.0927 0.09566 0.2635 0.1157 0.0579
#> min4 complete
#> NA NA
#> 0.0279 0.0215
#> 0.000 <= SE <= 0.002
We immediately see that our assumption of substantial variation in level three matters a great deal for power.
When calculating power for a given scenario, it is also easy to vary many of our design parameters by outcome. For example, if we thought we had better predictive covariates for our second outcome, we might try:
p_d <- update( pow,
R2.1 = c( 0.1, 0.3, 0.1, 0.2, 0.2 ),
R2.2 = c( 0.4, 0.8, 0.3, 0.2, 0.2 ) )
print( p_d )
#> power result: d3.2_m3fc2rc d_m with 5 outcomes
#> MTP D1indiv D2indiv D3indiv D4indiv D5indiv indiv.mean min1 min2 min3
#> None 0.4362 0.8541 0.3891 0.349 0.3407 0.47382 NA NA NA
#> HO 0.2469 0.6552 0.2153 0.191 0.1887 0.29942 0.7155 0.3782 0.213
#> min4 complete
#> NA NA
#> 0.1226 0.0878
#> 0.001 <= SE <= 0.002
Notice how the individual powers are heavily impacted. The -minimal powers naturally take the varying outcomes into account as we are calculating a joint distribution of test statistics that will have the correct marginal distributions based on these different design parameter values.
After several update()s, we may lose track of where we are; to find out, we can always check details with print_design() or summary():
summary( p_d )
#> power result: d3.2_m3fc2rc d_m with 5 outcomes
#> MDES vector: 0.1, 0.1, 0.1, 0.1, 0.1
#> nbar: 258 J: 3 K: 15 Tbar: 0.5
#> alpha: 0.05
#> Level:
#> 1: R2: 0.1 / 0.3 / 0.1 / 0.2 / 0.2 (5 covariates)
#> 2: R2: 0.4 / 0.8 / 0.3 / 0.2 / 0.2 (3 covariates) ICC: 0.05 omega: 0
#> 3: fixed effects ICC: 0.4 omega: 0
#> rho = 0.4
#> MTP D1indiv D2indiv D3indiv D4indiv D5indiv indiv.mean min1 min2 min3
#> None 0.4362 0.8541 0.3891 0.349 0.3407 0.47382 NA NA NA
#> HO 0.2469 0.6552 0.2153 0.191 0.1887 0.29942 0.7155 0.3782 0.213
#> min4 complete
#> NA NA
#> 0.1226 0.0878
#> 0.001 <= SE <= 0.002
#> (tnum = 10000)
Using update allows for targeted comparison of major choices, but if we are interested in how power changes across a range of options, we can do this more systematically with the grid() functions, as we do next.
6.5.2 Exploring the impact of the ICC
We above saw that the ICC does impact power considerably. We next extend this evaluation by exploring a range of options for both level two and three ICCs, so we can assess whether our power is sufficient across a set of plausible values. The update_grid() call makes this straightforward: we pass our baseline scenario along with lists of parameters to additionally explore:
We can then easily visualize the variation in min1 power by calling plot() on the object.
grid <- update_grid( pow,
ICC.2 = seq( 0, 0.3, 0.05 ),
ICC.3 = seq( 0, 0.60, 0.20 ) )
plot( grid, power.definition = "min1" )
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/b7faa82b-88c0-4e7c-b744-560e6b472359/x3.png)
Note that in addition to update_grid(), there are also base functions pump_power_grid(), pump_mdes_grid(), and pump_sample_grid().
We see that higher ICC.2 radically reduces power to detect anything and ICC.3 does little. To understand why, we turn to our standard error formula for this design and model:
In the above, the students per group makes the second term very small compared to the first, regardless of the ICC.3 value. The first term, however, is a direct scaling of ICC.2; changing it will change the standard error, and therefore power, a lot. All provided designs and models implemented in the package are discussed, along with corresponding formula such as these, in our technical supplement accompanying this paper and package.
For grid searches we recommend reducing the number of permutations, via tnum, to speed up computation. As tnum shrinks, we will get increasingly rough estimates of power, but even these rough estimates can help us determine trends.
The grid() functions provide easy and direct ways of exploring how power changes as a function of the design parameters. We note, however, that in order to keep syntax simple, they do not allow different design parameters, including MDES, by outcome. This is to keep package syntax simpler. When faced with contexts where it is believed that these parameters do vary, we recommend using average values for the broader searches, and then double-checking a small set of potential final designs with the pump_power() method.
6.5.3 Exploring the impact of rho
The correlation of test statistics, , is a critical parameter for how power will play out across the multiple tests. For example, with Westfall-Young, we saw that the correlation can improve our individual power, as compared to Bonferroni. We might not know what will happen to minimal power, however: on one hand, correlated statistics make individual adjustment less severe, and on the other correlation means we succeed or fail all together. We can explore this question relatively easily by letting rho vary as so:
gridRho <- update_grid( pow,
MTP = c( "BF", "WY-SD" ),
rho = seq( 0, 0.9, by = 0.15 ),
tnum = 500,
B = 3000 )
We then plot our results.
plot( gridRho )
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/b7faa82b-88c0-4e7c-b744-560e6b472359/x4.png)
First, we see the benefit of the Westfall-Young single-step procedure is minimal, as compared to Bonferroni. Second, the impact on individual adjustment is flat, as anticipated. Third, across a very broad range of rho, we maintain good minimal power. Complete power climbs as correlation increases, and minimal power is generally unchanged.
6.5.4 Exploring the impact of null outcomes
We finally explore varying the number of outcomes with no effects. This exploration is an important way to hedge a design against the possibility that some number of the identified outcomes are measured poorly, or are simply not impacted by treatment. We use a grid search, varying the number of outcomes that have no treatment impact via the numZero design parameter:
gridZero <- update_grid( pow,
numZero = 0:4,
M = 5 )
plot( gridZero, nrow = 1 )
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/b7faa82b-88c0-4e7c-b744-560e6b472359/x5.png)
There are other ways of exploring the impact of weak or null effects on some outcomes. In particular, the pump_power() and pump_sample() methods allow the researcher to provide an MDES vector with different values for each outcome, including 0s for some outcomes. The grid() functions, by contrast, take a single MDES value for the non-null outcomes, with a separate specification of how many of the outcomes are 0. (This single value plus numZero parameter also works with pump_power() if desired.)
7 Conclusion
We introduce the power under multiplicity project (PUMP) R package, which estimates power for multi-level randomized control trials with multiple outcomes. PUMP allows users to estimate power, MDES, and sample size requirements for a wide variety of commonly used RCT designs and models across different definitions of power and applying different MTPs. The functionality of PUMP fills an important gap, as existing tools do not allow researchers to conduct power, MDES or sample size calculations when applying a MTP. An online interface is also available at https://public.mdrc.org/pump/.
The main advantage of the PUMP package is to provide easily accessible estimation procedures so that users can properly account for power when making adjustments for multiple hypothesis testing. However, one of the additional strengths of the package is the ease with which a user can explore the impact of different designs, models, and assumptions on power, MDES or sample size. Even if a user is only interested in a single outcome, PUMP provides useful functionality for more robust power calculations. A user can and should try a range of parameter values to determine the sensitivity of the power of their study to different assumptions; this package simplifies that process.
In addition to this paper, there is a variety of supporting information.
-
•
The package is available on CRAN, https://CRAN.R-project.org/package=PUMP.
-
•
The code is available on GitHub, https://github.com/MDRCNY/PUMP.
-
•
The Technical Appendix contains detailed information about each design and model, the assumed data generating process, and understanding parameters such as and . It is a useful reference not just for users of the package, but also as a general summary of multi-level models.
-
•
The package has an additional vignette on understanding sample size calculations, which present unique challenges.
-
•
The package has supplementary functions that allow a user to simulate data from multi-level models. Although these functions are not directly related to the power calculations, we provide them as a potentially useful tool. A short vignette explains these functions.
-
•
The code and results for validating the package are in a separate repository, https://github.com/MDRCNY/pump_validate.
Acknowledgements
We acknowledge the Diplomas Now team at MDRC. Development of this package was supported by a grant from the Institute of Education Sciences (R305D170030). We would like to thank members of the Harvard CARES lab for their feedback on the manuscript.
References
- [1] A. Gelman, J. Hill, and M. Yajima. Why we (usually) don’t have to worry about multiple comparisons. Journal of Research on Educational Effectiveness, 5:189–211, 2012.
- [2] A. Gelman, J. Hill, and M. Yajima. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007.
- [3] J. Chen, J. Luo, K. Liu, and D. Mehrotra. On power and sample size computation for multiple testing procedures. Computational Statistics and Data Analysis, 55:110–122, 2011.
- [4] S. Dudoit, J.P. Shaffer, and J.C. Boldrick. Multiple hypothesis testing in microarray experiments. Statistical Science, 18(1):71–103, 2003.
- [5] Stephen Senn and Frank Bretz. Power and sample size when multiple endpoints are considered. Pharmaceutical Statistics, 6:161–170, 2007.
- [6] Peter H Westfall, R.D. Tobias, and R. D. Wolfinger. Multiple Comparisons and Multiple Tests using SAS. The SAS Institute, 2011.
- [7] Kristin E. Porter. Statistical power in evaluations that investigate effects on multiple outcomes: A guide for researchers. Journal of Research on Educational Effectiveness, 11:267–295, 2018.
- [8] Nianbo Dong and Rebecca Maynard. Powerup!: A tool for calculating minimum detectable effect sizes and minimum required sample sizes for experimental and quasi-experimental design studies. Journal of Research on Educational Effectiveness, 6(1):24–67, 2013.
- [9] Larry V. Hedges and Christopher Rhoads. Statistical power analysis in education research. Technical report, National Center for Special Education Research, 2010.
- [10] S.W. Raudenbush, J. Spybrook, R. Congdon, X. Liu, A. Martinez, H. Bloom, and C Hill. Optimal design plus empirical evidence (version 3.0). Technical report, 2011.
- [11] Jessica Spybrook, H.S. Bloom, Richard Congdon, Carolyn J. Hill, Andres Martinez, and Stephen W. Raudenbush. Optimal design plus empirical evidence: Documentation for the “optimal design” software version 3.0. Technical report, 2011.
- [12] Howard S. Bloom. The core analytics of randomized experiments for social research, 2006.
- [13] W. Corrin, S. Sepanik, R. Rosen, and A. Shane. Addressing early warning indicators: Interim impact findings from the investing in innovation (i3) evaluation of diplomas now, 2016.
- [14] Peter Z. Schochet. Guidelines for multiple testing in impact evaluations of educational interventions. final report. Technical report, Mathematica Policy Research, Inc. P.O. Box 2393, Princeton, NJ 08543-2393, 2008.
- [15] J.W. Tukey. The problem of multiple comparisions. Technical report, Princeton University, 1953.
- [16] Olive Jean Dunn. Estimation of the medians for dependent variables. The Annals of Mathematical Statistics, 30(1):192–197, 1959.
- [17] Olive Jean Dunn. Multiple comparisons among means. Journal of the American Statistical Association, 56(293):52–64, 1961.
- [18] S. Holm. A simple sequentially rejective multiple test procedure. Scand. J. Statist., 6(2):65–70, 1979.
- [19] Peter H Westfall and S Stanley Young. Resampling-based Multiple Testing: Examples and Methods for P-value Adjustment, volume 279. John Wiley & Sons, 1993.
- [20] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57:289–300, 1995.
- [21] Juliet Popper Shaffer. Multiple hypothesis testing. Annual Review of Psychology, 46(1):561–584, 1995.
- [22] Y. Ge, S. Dudoit, and T.P. Speed. Resampling-based multiple testing for microarray data analsysis. Test, 12:1–77, 2003.
- [23] F Bretz, T. Hothorn, and P. Westfall. Multiple Comparisons Using R. Chapman & Hall/CRC, 2010.
- [24] P.H. Ramsey. Power differences between pairwise multiple comparisons. Journal of American Statistical Association, 75:479–487, 1978.
- [25] Roger L. Berger. Multiparameter hypothesis testing and acceptance sampling. Technometrics, 24(4):295–300, 1982.
- [26] Roger L. Berger and Jason C. Hsu. Bioequivalence trials, intersection–union tests and equivalence confidence sets. Statistical Science, 11(4):283–319, 1996.
- [27] Xutao Deng, Jun Xu, and Charles Wang. Improving the power for detecting overlapping genes from multiple dna microarray-derived gene lists. BMC Bioinformatics, 9, 2008.
- [28] Y. Benjamini and D. Yekutieli. The control of the false discovery rate: A practical and powerful approach to multiple testing under dependency. Annals of Statistics, 29:1165–1188, 2001.
- [29] D. Bates, M. Mächler, B. Bolker, and S. Walker. Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1):1–48, 2015.
- [30] W. Maurer and B. Mellein. On New Multiple Test Procedures Based on Independent P-values and the Assessment of their Powers, pages 48–66. Springer, 1988.
- [31] H. Bang, S. Jung, and S.L. George. Sample size calculation for simulation-based mulitple-testing procedures. Journal of Biopharmaceutical Statistics, 15:957–967, 2005.
- [32] Luke Miratrix, Michael Weiss, and Brit Henderson. An applied researcher’s guide to estimating effects from multisite individually randomized trials: Estimands, estimators, and estimates. Journal of Research on Educational Effectiveness, 2020, forthcoming.
Appendix: Validation
In order to validate that our power estimates are working as intended, we compared three different methods of estimating power:
-
•
PUMP
-
•
PowerUpR
-
•
Monte Carlo simulations
We compute values from PowerUpR for individual, unadjusted power, as PowerUpR only provides power estimates in that setting. For all other types of power definitions and adjustments, we compare PUMP to the estimated power obtained from full Monte Carlo simulations. We follow the simulation approach outlined in detail in Section 4.1. After repeatedly simulating data and calculating -values, we calculate power and a 95% confidence interval, assuming a conservative standard error estimate of .
To validate the estimates, we first check that the PUMP and PowerUpR estimates match. In some settings we expect some discrepancies between these values because PUMP has different assumptions than PowerUpR for certain models. For details about differences between PUMP and PowerUpR assumptions, see the Technical Appendixs. Second, we check that the PUMP estimate is within the Monte Carlo confidence interval.
We also validate MDES and sample size calculations. For MDES, we choose one default scenario for each design and model, then input the already-calculated individual power and see if the output MDES is the same as the original input MDES. Similarly, for sample size validation, we input the already-calculated individual power and see if the output sample size (nbar, J, and K depending on design) is the same as the original input sample size.
Simulation parameters
In order to validate that the method works in a wide range of scenarios, we vary the following parameters. For most scenarios, we vary only one parameter at a time. Thus, to test varying , we set with all other parameters being set to the default values, and try another scenario with with all other parameters being set to the default values. Table 5 shows the default parameter values, and the other values we try to test out varying that parameter.
Parameter | Default | Other values |
school size | 50 | 75, 100 |
0.1 | 0.6 | |
0.5 | 0.2, 0.8 | |
MDES | (0.125, 0.125, 0.125) | (0.125, 0, 0) |
ICC | 0.2 | 0.7 |
0.1 | 0.8 |
We do not vary:
-
•
M = 3
-
•
J and K are fixed for each scenario
-
•
Scalar grand mean of control outcomes
-
•
Correlations between school random effects and impacts
-
•
rho informs all correlations; we keep the same correlation between covariates, residuals, impacts, random effects for all levels and across all outcomes.
Validation results
Figure 1 is an example of a graph we use for validation. The green dots are PUMP estimate of power, the red dot is the PowerUpR estimate of power, and the 95% confidence intervals based on the Monte Carlo simulations are shown in blue. To validate that PUMP produces the expected result, we want to see the red and green points match, and for the red point to be within the blue intervals. Figure 1 shows the results across different types of power and different MTPs. We repeat this plot for a variety of different parameter values for each design and model.

Next, we validate MDES and sample size calculations. We put in our found power, and then see if the pump_mdes() function returns the MDES we originally plugged in to achieve this power. In Table 6, the first column shows the calculated MDES, the middle column is the power we plugged into the calculation, and the last column shows the MDES that we are targeting. Thus, ideally we want the first and last columns to match.
MTP | Adjusted MDES | D1indiv Power | Target MDES |
Bonferroni | 0.122 | 0.447 | 0.125 |
BH | 0.127 | 0.578 | 0.125 |
Holm | 0.125 | 0.540 | 0.125 |
Similarly, we validate our sample size calculations. Using our found power, we see if pump_sample() returns the original sample size. In Table 7, we are targeting a sample size of .
MTP | Sample.type | Sample.size | D1indiv.power |
Bonferroni | J | 21 | 0.500 |
BH | J | 21 | 0.580 |
Holm | J | 20 | 0.544 |
Technical Appendix
Appendix A Introduction
Our package allows for power calculations across a range of common scenarios that a user might select. Each scenario is categorized by two choices. First, the user chooses the planned experimental design (e.g., clustered data, with randomization within cluster). Second, the user makes a choice of the planned analytic model they would use for the data, once they had it (e.g. a multilevel model with random impacts). To provide a concrete example, throughout we assume an education setting where we have students (level 1) nested within schools (level 2) nested, in the three-level case, within districts (level 3).
The design is characterized by the number of levels of nesting (e.g., students in schools, no districts, would be two levels) and the level of randomization (e.g., randomization of schools would be randomization at level two).
The model choices are a bit more complex, and we discuss them in detail below. In particular, for each model we support, we create a taxonomy by noting what modeling choice is used at each level of the model, and how covariates are used.
The outline of this appendix is as follows. In the introduction, we provide notation, a taxonomy for models, explain user-set parameters, and outline the power estimation strategy. The bulk of the appendix then provides detailed information about each supported scenario, including the assumed model and standard error formula. We then describe the data generating process that we used for the package validation; this section might also be useful for readers who wish to understand the models and assumptions more deeply. The final two sections provide explicit formula linking the user-specified parameters to the full set of parameters used for data generation, followed by derivations of these formula.
A.1 Model taxonomy
To define our models, we first assume a set of observed quantities, shown in Table~8, such as sample sizes, the outcomes, and covariates. For all notation, we use to index level 1 (individuals), to index level 2 (schools), to index level 3 (districts), and to index outcomes.
Param | Description |
Number of outcomes | |
Number of level 2 units in each level 3 group (assumed constant across level 3 groups) | |
Number of level 3 units | |
Number of level 1 units (assumed constant across level 2 groups) | |
Proportion of the sample that is assigned to the treatment group (assumed constant across groups) | |
Total number of units | |
Categorical variable indicating the membership of individual to a level 2 group | |
Categorical variable indicating the membership of individual to a level 3 group | |
Potential outcome for unit in level 2 group in level 3 group for outcome given no treatment | |
Potential outcome for unit in level 2 group in level 3 group for outcome given treatment | |
Level 3 covariates | |
Number of level 3 covariates for outcome | |
Level 2 covariates | |
Number of level 2 covariates for outcome | |
Level 1 covariates | |
Number of level 1 covariates for outcome |
We also assume a set of unobserved, latent parameters, shown in Table~9. These parameters include intercepts, impacts, and coefficients on covariates.
Param | Description |
Grand mean outcome under no treatment across level 3 units for outcome | |
Grand mean impact across level 3 units for outcome | |
Grand mean outcome under no treatment across level 2 units in level 3 unit for outcome | |
Grand mean impact across level 2 units in level 3 unit for outcome | |
Mean outcome under no treatment for level 2 unit in level 3 unit for outcome | |
Mean impact for level 2 unit in level 3 unit for outcome | |
Level 3/District intercepts | |
Level 3/District impacts | |
Variance of level 3 random effects for outcome | |
Variance of level 3 impacts for outcome (cross-district treatment heterogeneity) | |
Coefficient of level 3 covariates | |
Level 2/School intercepts | |
Level 2/School impacts | |
Variance of level 2 random effects for outcome | |
Variance of level 2 impacts for outcome (cross-school treatment heterogeneity) | |
Coefficient of level 2 covariates | |
Level 1/Individual intercepts | |
Variance of individual/level 1 residuals | |
Coefficient of individual/level 1 covariates |
Now, we can create a model taxonomy, based on modeling choices for intercepts and impacts. In particular we determine for each level:
-
•
Whether the level 2 and level 3 intercepts are:
-
–
fixed ( and are fixed effects)
-
–
random ( and are considered to be Normally distributed, allowing for partial pooling)
-
–
-
•
Whether the level 2 and level 3 treatment effects are:
-
–
constant, e.g. all units are modeled as having a single average impact ( or )
-
–
fixed, e.g. each unit has an individual estimated impact ( are fixed effects constrained to have mean 0), with an additional mean impact.
-
–
random ( and are Normally distributed around a mean impact)
-
–
In addition, for each level the user can plan to adjust for baseline covariates, unless there are fixed effects at that level or below. Note that in some models, PowerUp! includes treatment by covariate interactions, allowing for, in principle, heterogeneous treatment effects correlated with said covariates. We do not allow for this, as including treatment by covariate interactions adds complexity with estimation of average treatment effects, and is unlikely to help with the precision of an overall average impact estimate. We view covariate by treatment interactions as primarily for modeling treatment effect heterogeneity, which is not the goal of this package or project. When this difference between PowerUp! and our approach occurs, it is noted.
For users familiar with PowerUP, Table~10 provides a reference for translating notation between this document and PowerUpR!
PowerUp | PUMP | Description |
Mean outcome under no treatment for school in district | ||
Mean impact for school in district | ||
Individual covariates | ||
Coefficient vector for individual covariates | ||
Grand mean outcome under no treatment across schools in district | ||
Grand mean impact across schools in district | ||
School covariates | ||
Coefficient vector for school covariates | ||
School intercepts | ||
School impacts | ||
Variance of school random effects | ||
Overall variance of schools | ||
Variance of school impacts | ||
Intraclass correlation (unconditional) for level 2 | ||
Ratio of variation of impacts to residuals for level 2 | ||
Correlations between school random effects and impacts | ||
Grand mean outcome under no treatment across districts | ||
Grand mean impact across districts | ||
District covariates | ||
Coefficient vector for district covariates | ||
District intercepts | ||
District impacts | ||
Variance of district random effects | ||
Overall variance of districts | ||
Variance of district impacts | ||
Correlations between district random effects and impacts |
A.2 Scenario naming convention
We denote the research design by , followed by the number of levels and randomization level, so d3.1 is a -level design with randomization at level . The model is denoted by , followed by the level and the assumption for the intercepts, either or and then the assumption for the treatment impacts, , , or . For example, m3ff2rc means at level , we assume fixed intercepts and fixed treatment impacts, and at level we assume random intercepts and constant treatment impacts. The full design and model are specified by concatenating these together, e.g. d2.1_m3fc.
Examples:
-
•
d2.1_m2rr: 2 level, individual assignment, level 2 random intercept and random treatment effect. Corresponds to PowerUP! blocked_i1_2r.
-
•
d3.2_m3ff2rc: 3 level, level 2 assignment, level 3 fixed intercepts and fixed treatment effects, level 2 random intercepts and constant treatment effects. Corresponds to PowerUP! blocked_c2_3f.
Table~11 shows the list of supported scenarios and their corresponding names in PowerUp!
PowerUpR! | PUMP |
n/a | d1.1_m1c |
bira2_1c | d2.1_m2fc |
bira2_1f | d2.1_m2ff |
bira2_1r | d2.1_m2fr |
bira3_1r | d3.1_m3rr2rr |
cra2_2r | d2.2_m2rc |
cra3_3r | d3.3_m3rc2rc |
bcra3_2f | d3.2_m3ff2rc |
n/a | d3.2_m3fc2rc |
bcra3_2r | d3.2_m3rr2rc |
A.3 Derived parameters
To calculate power, a user must choose assumed values for some of the latent parameters. However, for certain parameters, the user may instead have more intuition about likely values of functions of these parameters, rather than the parameters themselves. For example, rather than choosing the value of the coefficient for a level 3 covariate (), the user sets , the amount of level three variation explained by covariates. These derived parameters, which are functions of unobserved parameters, are listed in Table~12.
Param | Description |
treatment impact in effect size units | |
level 3 (district) intraclass correlation | |
ratio of variation of district impacts to district intercepts | |
level 2 (school) intraclass correlation | |
ratio of variation of school impacts to school intercepts | |
percent of district variation explained by level 3 (district) covariates | |
percent of school variation explained by level 2 (school) covariates | |
percent of individual variation explained by level 1 (individual) covariates |
We now provide further clarification on these derived parameters. For a more detailed discussion of these expressions, see Section~D.
To keep the formula for intraclass correlation coefficients (ICCs) and terms simple and clear, we assume all covariates are unit variance and group-mean centered. In particular, group-mean centering means covariates only explain variation at their level; in practice, raw lower level covariates can explain variation in higher levels, if they systematically differ by group. For using the formula, however, researchers are welcome to input total values that include the full explanatory power of all covariates for a given level.
The ICCs are unconditional Intraclass Correlations, meaning they include the variation explained by covariates. Because we assume all covariates are group-mean centered and have unit variance, we get the pairing structure of terms in the equations below.
(2) | ||||
(3) |
The quantity is the ratio between the variation in average impact of a unit and the variation in the control-side mean of a unit.
(4) | ||||
(5) |
The expressions are the percent of variation at a particular level predicted by covariates. The group-mean centering makes these formula only involve covariates at the same level as the ; this is a simplification of convenience. The standard error formula for the models in the remainder of the document are general, however, and a user would not need to group-mean center or rescale any covariate in practice.
(6) | ||||
(7) | ||||
(8) |
A.4 Power estimation strategy
The same strategy is followed for all designs. First, we lay out a model for our outcomes, . Next, we calculate the standard error of the average treatment effect estimate, . When expressing the estimated treatment effect as an effect size, the standard error is given by:
(9) |
where is some “Index Variation’ ’ that we are measuring our impacts against.
When analyzing actual data, we would, to estimate , plug in known values for , , and . Any other parameters are replaced by sample estimates. Then, when testing the null hypothesis, , the test statistic for a -test is given by
(10) |
When the null is true, follows a distribution with mean and degrees of freedom , which depends on the design and model.
For power calculations we calculate, given our assumptions on the design and selected model, a reasonable value for . We can then calculate the power to detect an impact expressed in effect size units.
From the power formulas, we can also calculate MDES and sample size requirements. From [8], in general the MDES can be estimated as
where is known as the multiplier and is the sum of two t statistics based on degrees of freedom . For one-tailed tests, where is the type I error rate and is the desired power. For two-tailed tests, . For more details, see [8, page 31] or [12, page 22]. Manipulating this expression then results in sample size formulae.
A note on effect sizes.
In describing the standard error of our estimators in terms of effect size, we need to carefully identify what we mean by an “effect size.’ ’ We commonly think of an effect size as the size of an impact relative to some reference amount of variation. If the reference amount of variation is different, then the effect size, for the same absolute effect, will also be different. This concern of what the denominator is can create some tension regarding some of the power formula, as we will note in the following sections.
In particular, the effect size formula can use either the variation in overall control group, or just the within-group variation only. Overall variation includes the student variation within each site, but also how the sites vary from each other. Within-group variation is just this latter component. We argue that overall variation is more natural. We also believe all the formula should use the same definition of effect size.
Where this is most obviously a concern is with fixed effect regression. In particular, with overall variation, if we increase the ICC at level 2 or level 3, then there is less variation (relative to the reference variation) in level 1; thus an increased ICC will increase power for fixed effect regression. This is simply the realized the gains of a blocked experiment. If the effect size is calculated relative to within-group variation, however, this gain is not seen. We will note how this plays out explicitly in the following sections. An alternate approach is to have the measure include variance explained by the fixed effects. To make the formula more directly comparable, we do not take this route, but one can obtain the same results by selecting an appropriate and then setting the ICC to 0.
Appendix B Scenarios
B.1 d1.1 designs: 1 level, randomization at level 1
This is the classic individually randomized experiment where we allocate some fraction of a single set of units to treatment.
The randomization scheme is simple random sampling:
T.x <- randomizr::simple_ra(N = nbar, prob = Tbar)
B.1.1 Constant effects (d1.1_m1c)
PowerUp name: Not applicable.
Design: 1-level design, randomization at level 1.
Model: constant intercepts, constant treatment effects, no school or district covariates.
The model for estimating impacts on outcome is given by:
(11) |
The standard error of the treatment effect estimate is:
(12) |
The degrees of freedom are:
(13) |
Sample size formula.
The sample size formulas is:
(14) |
Code syntax.
The R model is
Yobs ~ 1 + T.x + C.ijk
B.2 d2.1 designs: 2 levels, randomization at level 1
This section of designs comprise what are usually referred to as multisite experiments. In a multisite experiment, we have a collection of sites (here, schools) and are able to randomize the individuals within each site into treatment and control. This allows for estimating an average impact for each site, in principle. That being said, we are usually interested in estimating some overall summary of impacts across all our sites. These are also called blocked experiments, especially if the sites are viewed as fixed.
Critically, there are four different estimands we might consider: the average impact for persons vs. impact for sites, and the average impact of the sample we have vs. the average impact of the population where the sample came from. When sites are equal sized, a common assumption for power calculations, the site and person average will be the same. We therefore ignore it here. For finite vs. super-population, we have to be more careful. Some estimation strategies target a finite-population estimand. In this document, the ones that do are d2.1_m2fc and d2.1_m2ff. The d2.1_m2fc estimation strategy does because it assumes a constant treatment impact; given this assumption, there is no uncertainty due to the sample itself as all samples have the same average impact by assumption. The d2.1_m2ff estimation strategy allows each school to have an individually estimated impact, but due to using fixed effects rather than random, it is evaluating the sample at hand. See [32] for further, in-depth, discussion. Estimators that target the super-population need to take any uncertainty of the sample being representative of the super-population into account. Here, the one that does this is d2.1_m2fr, with a model of each school having an average impact drawn from some random distribution.
Regardless of the model used to analyze these data, the randomization scheme is the same. It is simple random sampling within each school, with proportion units assigned to treatment in each school. In R, we could randomize this way as so:
T.x <- randomizr::block_ra( S.id, prob = Tbar )
B.2.1 Constant effects (d2.1_m2fc)
PowerUp name: bira2_1c
Design: 2-level design, randomization at level 1 (blocked).
Model: fixed intercepts, constant treatment effect, no school covariates.
When we assume constant effects, each school has its own fixed intercept for the control outcome, and the treatment effect is modeled as constant across schools. We can also call this a fixed effects, constant treatment model [32]. This model allows some schools to have higher average outcomes than others (allowed for with the fixed effects), but assumes the treatment impact is the same.
The model for estimating impacts on outcome is given by:
(15) | ||||
with reduced form:
(16) |
and distributions:
(17) |
The standard error formula we use is
(18) |
See below for important details on this specific formula, and how it differs from PowerUp!
The degrees of freedom for our impact estimate are
(19) |
Sample size formula.
The sample size formulas are:
(20) | ||||
(21) |
The constant effects model means that we assume no treatment variation across our sites, i.e.,:
-
•
Difference from PowerUP.
The PowerUp formula in effect assumes
-
•
,
although this can be motivated differently; see below.
Code syntax.
The R model is
Yobs ~ 1 + T.x + C.ijk + S.id
PowerUp! Differences.
PowerUp assumes there is no term while we allow for it. This can be viewed as within (PowerUp!) vs. overall (this work) effect size metrics.
Remark on effect sizes.
The standard error of the treatment effect estimate not in effect size units is (this taken from the PowerUp! documentation):
(22) |
To convert this to an effect size, we need to scale by overall variation. Unfortunately, under a fixed effect model, there is no natural way to express this as we have not parameterized how the individual site intercepts, the , vary. PowerUp! therefore indexes by within group variation, which is
using the formula for , capturing the predictive power of our individual-level covariates on the outcomes within a given school, of
If we divide the above formula by we get the reported standard error formula for of
with the tilde denoting that these effect size units are in terms of within-school variation, which is not often done. Equivalently, this is assuming the blocks are all homogeneous, which both goes counter to the design principles of blocking and also is known to generally not hold when evaluating schools. If we want the more classic effect size indexed by cross-site variation, we need to go further.
Assume we have an , an assumed measure of how much overall (control-side) variation is at the school level:
This ICC is even defined for a finite sample, if we view the above as comparing the emperical (pooled) within-group variation to full variation. Rearranging this gives .
We can then plug this and the formula together to get
If we use this expression to scale our SE formula, we finally obtain our formula listed above.
B.2.2 Fixed effects (d2.1_m2ff)
PowerUp name: bira2_1f
Design: 2-level design, randomization at level 1 (blocked).
Model: fixed intercepts, fixed treatment effects, no school covariates.
The constant effects model assumes treatment is the same for each block. If it is not, and the blocks are different sizes or have different proportions of units treated, the constant effects estimator is precision-weighted and can thus be biased. Some may instead choose to allow each school to have its own estimated impact, with a second averaging step where we calculate an overall site-average of the site specific impact estimates.
We do this by interacting our site fixed effects with treatment. Now each school has its own fixed intercept for the control outcome, and each school also has its own fixed coefficient for the treatment effect. We can also call this a fixed effects with interactions model [32].
In practice, the power calculations for this model will be the same as for constant effects, unless we allow for block size variation or variable proportion treated.
The model for estimating impacts on outcome is given by:
(23) | ||||
with reduced form:
(24) |
and distributions:
(25) |
The standard error of the treatment effect estimate (and therefore the sample size formula) are all the same as in the constant effects model, i.e. for SE we have:
(26) |
However, the degrees of freedom are different due to the additional interaction terms we need to estimate:
(27) |
Sample size formula.
The sample size formulas are:
(28) | ||||
(29) |
Difference from PowerUP.
Note that the PowerUp formula assumes
-
•
Code syntax.
The R model is
Yobs ~ 0 + T.x:S.id - T.x + C.ijk
The overall treatment effect is then the average of the T.x:S.id interaction terms.
PowerUp! Differences.
Just as with the constant model, PowerUp assumes there is no term while we allow for it. This can be viewed as within (PowerUp!) vs. overall (this work) effect size metrics.
B.2.3 Random effects (d2.1_m2fr or d2.1_m2rr)
PowerUp name: bira2_1r
Design: 2-level design, randomization at level 1 (blocked).
Model: random intercepts, random treatment effect, school covariates for intercept. PowerUp! also includes interaction terms of treatment and school covariates to allow for modeling treatment effect heterogeneity; we do not include this.
If we are interested in generalizing from our sample to a superpopulation, we may wish to view the sample of schools themselves as representative of something larger. Then, if some schools have different average impacts than other schools, we have to account for the possibility that our sample of schools has an overall average impact different from the target population. We can account for this additional uncertainty with a random effects model that has a random effect for the school-level average impacts.
The classic random effects model gives each school both a random intercept for the control average outcome (the intercept), and a random coefficient for the treatment effect. This is also known as the RIRC model: random intercept, random coefficient. Recently, researchers also use a variant of this model, the Fixed Intercept, Random Coefficient (FIRC) model to account for concerns such as varying proportions of units treated in different schools. For power calculations, they have the same performance (they are also similar in practice; see [32]).
For RIRC, the model for estimating impacts on outcome is given by:
(30) | ||||
with reduced form:
(31) | ||||
and random effect and residual distributions of:
(32) | ||||
For FIRC, we only have the random effects model on the , and have fixed effects for the . For RIRC, we assume bivariate Normal effects with variances and and correlation . The correlation structure does not heavily impact the distribution of the final test statistic.
We make an important note. In PowerUp!, they assume that school and district covariates also influence the treatment impact:
but we do not make this assumption. The result of this is that we assume, in their notation, that , where is the percent of treatment variation explained by level 2 covariates; we are exploiting none of the cross-site impact heterogeneity. This assumption affects the first term in the standard error formula below.
The standard error of the treatment effect estimate is given by:
(33) |
Note that this formula is simply the formula for d2.1_m2fc with an additional term of . This term captures the additional uncertainty from extrapolating from our sample to the super-population. with this model, therefore, will be larger than the prior models to the extent that the schools differ in terms of their impact variation (the term is simply the variation in the random impact terms scaled by our overall variation).
The degrees of freedom are
(34) |
Sample size formula.
The sample size formulas are:
(35) | ||||
(36) |
Code syntax.
The R model is, for RIRC,
Yobs ~ 1 + T.x + X.jk + C.ijk + (1 + T.x | S.id)
For FIRC it is
Yobs ~ 0 + T.x + X.jk + C.ijk + S.id + (0 + T.x | S.id)
PowerUp! Differences.
PowerUp allows for school covariates to influence the treatment impact, while we do not allow for this. In PowerUp terms, we assume .
B.3 d2.2 designs: 2 levels, randomization at level 2
These are commonly called cluster randomized experiments, with the schools being the clusters.
The randomization scheme is a simple random sample of schools are assigned to treatment:
T.x <- randomizr::cluster_ra( S.id, prob = Tbar )
B.3.1 Random effects (d2.2_m2rc)
PowerUp name: cra2_2r
Design: 2-level design, randomization at level 2 (clusters).
Model: random intercepts, constant treatment effect for all schools, school covariates for intercept.
The model for estimating impacts on outcome is given by:
(37) | ||||
with reduced form:
(38) |
and distributions:
(39) | ||||
The standard error of the treatment effect estimate is given by:
(40) |
The degrees of freedom are
(41) |
The constant effects model means that we assume no treatment variation across our sites, i.e.:
-
•
Sample size formula.
(42) | ||||
(43) |
Code syntax.
The R model is
Yobs ~ 1 + T.x + X.jk + C.ijk + (1 | S.id)
B.4 d3.1 designs: 3 levels, randomization at level 1
In these designs we have schools nested in districts, and students nested in schools. The only difference here, as compared to blocked individual randomization with two levels, is the third level of district. Since we are randomizing at the student level, this will only impact how we think about where variation is in terms of our effect size units.
In this context, if we are interested in the finite-sample impacts, other than for calculating our reference variation for effect sizes, the districts do not matter. We can simply use the prior two level fixed effect designs if we lump district variation into the terms. In particular, one could use d2.1_m2ff or d2.1_m2fc for the three level case by just entering in for . In fact, we cannot have district random or fixed effects given school-level fixed effects due to collinearity.
The randomization scheme is: simple random sampling occurs within each school, with proportion units assigned to treatment in each school.
T.x <- randomizr::block_ra( S.id, prob = Tbar )
B.4.1 Random effects (d3.1_m3rr2rr)
PowerUp name: cra3_3r
Design: 3-level design, randomization at level 1 (blocked).
Model: random intercepts for district, random treatment effects for district, random intercepts for school, random effects for schools, school and district covariates for intercepts. Powerup also allows for school and district covariates for cross-site impact heterogeneity.
The model for estimating impacts on outcome is given by:
(44) | ||||
with reduced form:
(45) | ||||
and distributions:
(46) | ||||
Similar to the two-level blocked model, in PowerUp! they further assume that school and district covariates also influence the treatment impact
but we do not make this assumption.
The standard error of the treatment effect estimate is given by:
(47) |
The degrees of freedom are
(48) |
This is a very conservative degrees of freedom.
Sample size formula.
(49) | ||||
(50) | ||||
(51) |
Code syntax.
Yobs ~ 1 + T.x + V.k + X.jk + C.ijk + (1 + T.x | S.id) + (1 + T.x | D.id)
PowerUp! Differences.
PowerUp allows for school and district covariates to influence the treatment impact, while we do not allow for this. In PowerUp terms, we assume and . This also impacts our degrees of freedom formula, which is instead of .
B.5 d3.2 designs: 3 levels, randomization at level 2
These are commonly called blocked, cluster-randomized experiments. You find these if, for example, schools are randomized within a set of districts, or teachers are randomized within a set of schools (with students as outcomes in both cases).
The randomization scheme is: simple random sampling occurs within each district, with schools assigned to treatment in each district. In R we have:
T.x <- randomizr::block_and_cluster_ra( blocks = D.id, clusters = S.id, prob = Tbar )
B.5.1 Fixed effects (d3.2_m3ff2rc)
PowerUp name: bcra3_2f
Design: 3-level design, randomization at level 2 (blocked cluster).
Model: fixed intercepts for districts, fixed treatment effects for districts, random intercepts for schools, constant effects for schools within a district, school covariates for intercept.
The model for estimating impacts on outcome is given by:
(52) | ||||
with reduced form:
(53) | ||||
and distributions:
(54) | ||||
The standard error of the treatment effect estimate is given by:
(55) |
The degrees of freedom are
(56) |
This model assumes: no variation of impacts within schools, and no variation at the district level.
-
•
Difference from PowerUP.
Note that the PowerUp formula assumes
-
•
Sample size formula.
(57) | ||||
(58) | ||||
(59) |
Code syntax.
The R model is
Yobs ~ 0 + T.x * D.id - T.x + X.jk + C.ijk + (1 | S.id)
The overall treatment effect is then the average of the T.x interaction terms.
B.5.2 Random effects (d3.2_m3rr2rc)
PowerUp name: bcra3_2r
Design: 3-level design, randomization at level 2 (blocked cluster).
Model: random intercepts for districts, random treatment effect for districts, random intercepts for schools, constant effects for schools within a district, school and district covariates for intercept. Powerup also allows for district covariates for treatment effects.
The model for estimating impacts on outcome is given by:
(60) | ||||
with reduced form:
(61) | ||||
and distributions:
(62) | ||||
Similar to other blocked models model, in PowerUp! they further assume that district covariates also influence the treatment impact
but we do not make this assumption.
The standard error of the treatment effect estimate is given by:
(63) |
The degrees of freedom are
(64) |
Parameter assumptions
-
•
PowerUp! Differences.
PowerUp allows for district covariates to influence the treatment impact, while we do not allow for this. In PowerUp terms, we assume . This also impacts our degrees of freedom formula, which is instead of .
Sample size formula.
(65) | ||||
(66) | ||||
(67) |
Code syntax.
The R model is
Yobs ~ 1 + T.x + V.k + X.jk + C.ijk + (1 | S.id) + (1 + T.x | D.id)
B.6 d3.3 designs: 3 levels, randomization at level 3
These designs have randomization at the top level. They are cluster randomized, but we can model the nesting structure within cluster.
The randomization scheme is: simple random sampling occurs across districts, with districts assigned to treatment.
T.x <- randomizr::cluster_ra( D.id, prob = Tbar )
B.6.1 Random effects (d3.3_m3rc2rc)
PowerUp name: cra3_3r
Design: 3-level design, randomization at level 3 (cluster).
Model: random intercepts for districts, constant treatment effects for districts, random intercepts for schools, constant treatment effects for schools, school and district covariates for intercept.
The model for estimating impacts on outcome is given by:
(68) | ||||
with reduced form:
(69) | ||||
and distributions:
(70) | ||||
The standard error of the treatment effect estimate is given by:
(71) |
The degrees of freedom are
(72) |
The constant effects model means that we assume no treatment variation across our sites, i.e.:
-
•
-
•
Sample size formula.
(73) | ||||
(74) | ||||
(75) |
Code syntax.
The R model is
Yobs ~ 1 + T.x + V.k + X.jk + C.ijk + (1 | S.id) + (1 | D.id)
Appendix C The data generating process
We now discuss the assumed data generating process (DGP), indexed by parameters directly tied to the structural equations we use. The data generating process is done in the following stages, outlined below.
C.1 Determine DGP parameters
We have already discussed most of the required parameters in Section~A. However, there are a few additional parameters required to generate data that do not directly feed into our equations for single-outcome power or MDES, related to correlations, show in Table~13.
The parameters used in this section need to be picked based on desired aggregate relationships of the full data. See the next section for how to translate parameters such as ICC to the DGP parameters
Param | Description |
Correlation matrix of district covariates | |
Correlation matrix of district random effects | |
Correlation matrix of district impacts | |
Non-symmetric matrix of correlations between district random effects and impacts, composed of entries ) | |
Correlation matrix of school covariates | |
Correlation matrix of school random effects | |
Correlation matrix of school impacts | |
Non-symmetric matrix of correlations between school random effects and impacts, composed of entries ) | |
Correlation matrix of individual covariates | |
Correlation matrix of individual residuals |
C.2 Generate level 3 (district) data
C.2.1 Level 3 covariates
Each outcome has its own district-level covariates, with and . We have and . We assume a correlation between covariates, so we define is a symmetric correlation matrix, with is the value in row and column of the matrix .
C.2.2 Level 3 outcomes
Let be the grand mean outcome under no treatment for district , and be the grand mean impact across schools for district .
(76) | ||||
(77) |
Let be the grand mean outcome under no treatment across all districts. Without loss of generality, we will set for all . is the grand mean impact across districts.
We now consider the distributions of random effects and impacts and . First, we have , , and correlation between outcomes matrix .
Similarly, we have , , and correlation between outcomes matrix .
We now consider the joint distribution, as bivariate normal on the margin with correlation :
(78) |
But we want these values to be correlated across outcome (i.e., district average math test will be correlated with district average reading test). We therefore generate the full set of district ’s random effects across all outcomes as a vector of multivariate normal residuals:
with
and
For the form of the off-diagonal blocks, , we first construct matrix with entries . The diagonals of this matrix are , which have been previously defined as the correlation of the intercept and impact for outcome . The off-diagonals are, for ,
We assume these are fixed across different values of , i.e. that the correlations are constant across districts.
For example, consider a case:
We note that this matrix does not necessarily have to be symmetric.
C.3 Generate level 2 (school) data
C.3.1 Level 2 covariates
Each outcome has its own school-level covariate. For example, school average reading and math pre-tests, used for adjusting reading and math outcomes (in practice we might imagine adjusting each outcome with both, but in the case of few clusters this might not be a good idea due to degrees of freedom issues).
Index covariates as with , , and . As with the district-level covariates, we have and , and is a symmetric correlation matrix.
C.3.2 Level 2 outcomes
Each school in district has its average outcome under no treatment and its average impacts . The mean outcome and average impact for school in district for outcome is
(79) | ||||
(80) |
We can easily convert from three-level to two-level models. If there are no districts, then and for all . Essentially, we set , , and for all .
The follow a multivariate Normal structure as in Section~C.2.2. We have and . Also and . Finally, they relate to each other with .
C.4 Generate level 1 (individual) data
C.4.1 Level 1 covariates
Individuals have individual level covariates, one for each outcome . For example, group-mean centered reading and math scores. We assume these are homoskedastic and have the same mean across sites. As with previous covariates, we have and , and is a symmetric correlation matrix.
C.4.2 Level 1 outcomes
For each outcome, the outcome model for the individual is
(81) | ||||
(82) |
where is potential outcome under no treatment for individual in school in district , and is the unit’s individual causal effect.
We assume constant treatment effects for individuals in the same school, , but this assumption could be relaxed to allow for individual treatment-level heterogeneity.
As with previous covariates, we have and , and is a symmetric correlation matrix.
Finally, individual-level residuals are distributed and , and is a symmetric correlation matrix.
C.4.3 Reduced form
Putting the levels together, we have:
(83) | ||||
(84) |
C.5 Summary: Generating the full table of potential outcomes
-
1.
For , and :
-
(a)
Generate district covariates .
-
(b)
Generate district residuals and .
-
(c)
Calculate district grand means and .
-
(a)
-
2.
For , and :
-
(a)
Generate school covariates .
-
(b)
Generate school residuals and .
-
(c)
Calculate school grand means and .
-
(a)
-
3.
For and for :
-
(a)
Generate individual covariates,
-
(b)
Generate individual residuals .
-
(c)
Generate predicted baseline outcomes ( without residuals).
-
(d)
Add residuals to the predicted outcomes to get and calculate .
-
(a)
C.6 Generate observed data
Once we have our full set of potential outcomes, we generate treatment assignments to generate the observed outcomes. We generate our treatment assignment, for all and and . Once we have our set of (no matter how they were obtained) we calculate the observed outcomes
(85) |
C.6.1 Randomization schemes
We can assign at the district, school, or individual level depending on the design we are generating data for.
-
•
Blocked individual randomization: simple random sampling occurs within each school, with units assigned to treatment in each school.
-
•
Cluster 2-level randomization: simple random sampling occurs across schools, with schools assigned to treatment.
-
•
Blocked cluster 2-level randomization: school level assignment occurs within each district, with schools assigned to treatment in each district.
-
•
Cluster 3-level randomization: simple random sampling occurs across districts, with districts assigned to treatment.
Appendix D Tuning the DGP parameters
We define two main types of parameters. First, model parameters are those defined in Tables~9 and 13, and define the DGP. Second, control or derived parameters, defined in Table~12, indirectly tune model parameters. Control parameters are set by the user, which then given the model parameters fed into the DGP. The mapping of control parameters to model parameters is in Table~12.
We break our model parameters into sets:
-
•
Set 1: are set directly.
-
•
Set 2: are functions of parameters that are set directly.
-
•
Set 3: are tuned through control parameters.
To translate from our control parameters to the model parameters we derive several relationships in the following.
D.1 Calculating the variation in random effects and impacts
We have variation at the individual, school, and district level. We want to be able to tune the proportion of variation in each of these levels. We are interested in the unconditional (covariate-free) ICC.
We have for the variance of the control side:
Looking at Equation~83, we see:
D.2 Calculating the covariate coefficients
D.2.1 Calculating the level 3 covariate coefficient
The regression coefficients for the level 3 covariates, , is dictated by the desired values. Thus, we would like to find as a function of the level-3 . We define as the proportion of variance between level 3 districts predicted by level 3 covariates.
We start with
leading to
D.2.2 Calculating the level 2 covariate coefficient
We start with our level-2 being defined as the proportion of variance in level-2 schools explained by level-2 covariates:
where the conditioning denotes the variance of outcomes within a particular district. This expands to
Since our are generated independent of district, the conditional variance is the same as overall. This gives
Leading to
D.2.3 Calculating the coefficient for the Level 1 variable ()
Similar to level 2, we start with our level 1 being defined as the proportion of level 1 variance in individuals explained by level 1 covariates:
where the conditioning denotes the variance of outcomes within a particular school.
We find
D.3 Calculating the grand mean impacts
This is a function of effect size. The effect size is simply the overall impact in standard deviation units, with the standard deviation usually being the marginal standard deviation of the control side:
where denotes the standard deviation over , , and for fixed outcome . We have already noted .
D.4 Final results
We have produced a system of equations:
We solve the system to find our model parameters:
For details on the algebra, see Section~E.
And finally we set:
Appendix E Appendix: Derivations of parameter formulae
Let’s start off with expressions we will later use:
We also note:
And finally it’s easy to re-express :
Let’s start by plugging some of these into our expression for to find :
Proceeding by a similar method, we can use to find :
Now we can plug in to find :
And similarly: