The Causal Effect of the Two-For-One Strategy in the National Basketball Association
Abstract
This study evaluates the effectiveness of the “two-for-one” strategy in basketball by applying a causal inference framework to play-by-play data from the 2018-19 and 2021-22 National Basketball Association regular seasons. Incorporating factors such as player lineup, betting odds, and player ratings, we compute the average treatment effect and find that the two-for-one strategy has a positive impact on game outcomes, suggesting it can benefit teams when employed effectively. Additionally, we investigate potential heterogeneity in the strategy’s effectiveness using the causal forest framework, with tests indicating no significant variation across different contexts. These findings offer valuable insights into the tactical advantages of the two-for-one strategy in professional basketball.
Keywords— causal inference, average treament effect, heterogenous treatment effect, sports analytics
1 Introduction
In recent years, the use of data and statistical methods to analyze sports strategies has gained significant traction, driven by the availability of detailed game data and advances in data science. In professional sports leagues like the National Basketball Association (NBA), decision-makers seek to optimize team performance through strategic adjustments informed by empirical analysis. Among these strategies, the “two-for-one” tactic, where a team attempts to gain an extra possession by timing shots near the end of a quarter, has become popular. However, despite its widespread use, little rigorous evidence exists on whether this strategy provides a measurable advantage. This study aims to address this gap by applying causal inference methods to determine the effectiveness of the two-for-one strategy.
An increasing body of research has applied causal inference methods to investigate the causal effects of various sports strategies. For example, Gibbs et al. (2022) utilize causal inference to evaluate the impact of calling a timeout to disrupt an opposing team’s scoring run in NBA games. Nakahara et al. (2023) analyze pitching strategies in baseball through a causal inference framework, while Yam & Lopez (2019) examine the decision-making process surrounding fourth-down plays in the NFL, aiming to quantify the effectiveness of these high-stakes decisions. Additionally, Wu et al. (2021) explore the effect of crossing the ball in soccer, seeking to determine whether this strategy significantly influences scoring opportunities.
These studies represent a growing interest in the causal analysis of strategic decisions in sports, where the interplay of chance and skill presents unique challenges to identifying causality. By employing techniques that control for confounding variables and capitalize on observational data, such research provides insights that go beyond traditional statistical descriptions, aiming instead to uncover the causal impact of specific decisions. The increasing application of causal inference in sports analytics thus reflects a broader trend towards data-driven decision-making and evidence-based optimization of strategy across various sports contexts.
Causal inference (Imbens & Rubin, 2015) methods are essential tools for rigorously analyzing the effects of treatments or interventions in observational settings, particularly when randomized controlled trials (RCTs) are impractical, as is often the case in sports. In this study, we focus on estimating the average treatment effect (ATE) of the NBA’s two-for-one strategy. The ATE quantifies the average difference in outcomes (e.g., game score or quarter lead) between situations where the two-for-one strategy is employed and those where it is not. Estimating the ATE in observational data, however, requires addressing confounding variables—factors that influence both the treatment and the outcome, potentially biasing the results. To mitigate such bias, we apply inverse probability weighting and propensity score matching, two widely used methods in causal inference that allow for more reliable estimation of treatment effects.
Inverse probability weighting involves estimating the probability (propensity score) that a given observation will receive the treatment (the two-for-one strategy) based on observed covariates (e.g., player lineup, time in the game). Each observation is then weighted by the inverse of its propensity score to create a pseudo-population with a similar distribution of covariates between treated and control groups. Alternatively, matching pairs treated and control observations that share similar values of the propensity score, effectively matching the treatment and control groups on key covariates. This approach further reduces the risk of confounding and allows for a more direct comparison of the outcomes between the two groups, simulating the conditions of an RCT. Together, these methods help isolate the causal effect of the two-for-one strategy from other confounding influences.
Beyond estimating the ATE, it is crucial to assess whether the treatment effect varies across different contexts or player configurations. To do this, we employ the causal forest (Athey et al., 2019; Wager & Athey, 2018) framework, a machine learning-based method that enables the estimation of heterogeneous treatment effects (HTE). Traditional ATE estimation assumes that the treatment effect is constant across all observations, but in real-world settings like the NBA, the effectiveness of a strategy is likely to vary based on game context, player lineups, and other factors. The causal forest framework addresses this by partitioning the data into subgroups using decision trees (Hastie et al., 2009). Each subgroup represents a distinct set of conditions where the treatment effect may differ. Causal forests construct many such trees and aggregate the resulting estimates to produce a flexible, non-parametric estimate of treatment effect heterogeneity. By identifying subgroups with distinct treatment effects, causal forests allow us to uncover whether the two-for-one strategy is more or less effective in different situations, such as when certain players are on the court or at different points in a game.
In addition to estimating treatment effect heterogeneity, we apply the concept of rank average treatment effect (RATE) (Yadlowsky et al., 2024) to better understand the strategy’s impact. RATE works by ranking observations based on their predicted treatment effects, identifying which game contexts or player configurations are most likely to benefit from the two-for-one strategy. The RATE method quantifies the expected impact on the top ranked subgroups and compares these with the average effect across all observations. This ranking approach is particularly valuable in practice, as it highlights the specific conditions under which the two-for-one strategy is most advantageous, offering actionable insights for coaches and analysts looking to optimize in-game decisions.
1.1 Contributions
This study leverages detailed play-by-play data from the 2018-19 and 2021-22 NBA seasons, including variables such as player lineups, betting odds, and player ratings, to rigorously analyze the effectiveness of the two-for-one strategy. By framing the two-for-one strategy as a causal inference problem, we provide a formal statistical model that allows for reliable estimation of treatment effects while controlling for confounding factors. The methods used in this analysis—inverse probability weighting, matching, causal forests, and RATE—are all designed to ensure that our results are robust and generalizable to other contexts in sports analytics.
The main contributions of this paper are as follows:
-
1.
We frame the two-for-one strategy as a causal inference problem, translating a complex sports decision into a statistical model that can be rigorously analyzed.
-
2.
We estimate the average treatment effect of the two-for-one strategy, providing quantitative evidence of its effectiveness in NBA games.
-
3.
We investigate potential heterogeneity in the strategy’s impact using the causal forest framework, finding that the strategy’s effectiveness appears relatively consistent across various game contexts.
Our findings contribute to sports analytics by offering data-driven insights into an influential NBA strategy, providing coaches and analysts with valuable, evidence-based guidance for optimizing in-game tactics.
The structure of this paper is as follows. Section 2 presents definitions and background on the two-for-one strategy and details the data set used in our analysis. Section 3 outlines the causal methodology employed to assess the impact of this strategy. Section 4 presents the results obtained from analyzing two seasons of NBA data, highlighting key findings on the effectiveness of the two-for-one approach. Section 5 concludes with a brief discussion.
2 Definitions and Data
In this section, we provide a detailed explanation of the two-for-one (TFO) strategy in the NBA, including a mathematical definition of the strategy and a description of the data sources used in this study.
The TFO (Feng, 2015; Fischer, 2015) strategy in the NBA is a time-management tactic that aims to maximize a team’s scoring opportunities by securing two offensive possessions at the end of a quarter, while limiting the opponent to just one. This strategy is commonly employed in the last 30–40 seconds of a quarter, especially during close games, where maximizing scoring chances can be crucial. The TFO strategy is primarily focused on managing the game clock to optimize possessions and increase scoring potential before the quarter ends.
To implement the TFO, a team aims to take a shot early enough in the shot clock so that even if the opponent uses a full possession, the team can secure the ball again before time expires. For example, if a team has possession with 36 seconds remaining in the quarter, they may attempt a quick, high-quality shot within 6–8 seconds. This leaves approximately 30 seconds on the clock, and even if the opposing team takes nearly the entire 24 seconds allowed per possession, the original team would regain possession with around 5–10 seconds remaining, allowing them one last shot before the quarter ends.
2.1 Definitions
In order to determine the causal effect of the TFO strategy in the NBA, we need a well-defined treatment. That is, we need to be able to classify scenarios during an NBA game as those in which a team enacted the TFO strategy, and those in which they did not. This is not a trivial matter. Teams do not announce that they are attempting a TFO, and it is unclear how attune players are to the game and shot clock during any particular play. Thus we rely on a combination of time and play outcomes in order to identify TFO opportunities (TFO-O). These are scenarios when a team comes into possession of the ball with enough time left in the quarter such that taking a shot or initiating offense relatively early in the shot clock leads to a increased chance to get an extra possession, while waiting until the end of the shot clock to do these things increases the chance the opposing team will have the last possession. We then categorize those TFO-O into attempts (TFO-A) and non-attempts (TFO-NA).
Several important characteristics of NBA game play inform these definitions. The shot clock in the NBA is 24 seconds, meaning that from the time a team gains possession of the ball after a turnover, rebound, or made basket, the team has 24 seconds in which to attempt a shot. If a team gains possession with less than 24 seconds left in the quarter, the shot clock is turned off and the team in possession can attempt to have the last possession of the quarter by waiting until the very end to take a shot. Therefore, if a team wants to improve its chances of gaining the last possession of the quarter, it should hope that the opposing team gains possession with more than 24 seconds left.
However, it can take up to a few seconds for possession to change, particularly after a missed shot. As an example, suppose a player on Team A takes a shot with 6 seconds left in the quarter. It takes one second for the ball to get from the player’s hand to the rim, another second to bounce on the rim a couple of times before coming to the ground, and a third second before a player on Team B possesses it, now with only 3 seconds left in the quarter. Furthermore, if there is too little time left in the quarter when a team gains possession, that team has a very small chance of scoring any points.
With these characteristics in mind, we set a limit of 28 seconds left in the quarter, and say that if a team takes a shot after this point and the possession changes, it is unlikely that they will get a quality possession at the end of the quarter. Working backwards, then, a team needs to gain possession of the ball with enough time before the 28 seconds mark in order to potentially attempt a reasonable shot by this point. We set this lower limit to be 35 seconds. We choose 43 seconds as an upper limit on when a team can gain possession to exclude those scenarios where a team can get to this optimal shot time simply by waiting until the shot clock for their possession is close to 0. Thus we define a TFO-O as one in which a team comes into possession of the ball with between 35 and 43 seconds left in the quarter.
If a team presented with a TFO-O takes a shot or is fouled while taking a shot with at least 28 seconds to go in the quarter, we consider that a clear TFO-A. Other play results are harder to categorize. We also consider an opportunity a TFO-A if the team is fouled with at least 28 seconds to go in the quarter, with the rationale being that fouls more often occur when a team is trying to initiate offense. On the other hand, we consider an opportunity to be a TFO-NA if the team maintains possession until there are less than 28 seconds left in the quarter without taking a shot or being fouled.
We remove all other play outcomes that occur during a TFO-O from consideration rather than defining them as a TFO-A or a TFO-NA. The most common of these is a turnover. Turnovers can happen while a team is initiating offense, such as a charge foul, or just trying to hold the ball until the end of the shot clock, such as a steal that happens far away from the basket. Due to the lack of information on turnovers available in play-by-play data, we choose not to put them into either the TFO-A or TFO-NA categories. The same is true for other, less common plays such as a jump ball, a defensive three-second violation, and a kicked ball violation.
We define our response variable as the Post Opportunity Difference (POD), denoted by . The POD quantifies the effectiveness of the TFO strategy by measuring the difference in score differential from the time of the TFO-O to the end of the quarter. To calculate , we first determine the score difference between the teams at the time of the TFO-O, denoted . Next, we record the score difference at the end of the quarter, denoted . The POD is then defined as:
(1) |
A positive value of indicates that the team increased its score margin by the end of the quarter. Conversely, a negative implies that the team’s score margin decreased after the TFO-O. Thus if the TFO strategy is effective the POD would be higher for teams that attempt a TFO than those who do not.
2.2 Data
The main source of data we use is the play-by-play data from the NBA. This is a list of all the different plays that took place in an NBA game as well as supplemental information such as the players involved in the play, the score of the game at the time of the play, the quarter the play took place in, and the amount of time left in that quarter. We obtain play-by-play data for each regular season game in the 2018-2019 and 2021-2022 seasons using the nbastatR package (Bresler, 2024). The 2018-2019 season was the last regular season before the Covid-19 pandemic altered the NBA schedule, and the 2021-2022 season was the first full season after the pandemic, making these natural seasons to consider.
Using this play-by-play data, we identify all TFO-O in the first three quarters of each NBA regular season game. We choose to exclude the fourth quarters because the behavior of NBA games is often much different at the end of those quarters. If the game is not close, teams are no longer necessarily worried about maximizing their possessions because of the feeling that nothing they do will affect the outcome. If the game is very close, teams tend to foul more to try to regain possession. We believe that excluding these scenarios allows us to better isolate the treatment. For the 2018-2019 season, there were a total of 1036 TFO-A and 529 TFO-NA. For the 2021-2022 season there were a total of 950 TFO-A and 462 TFO-NA.
To enhance our analysis with relevant covariates, we draw from several supplemental data sources. First, to gauge the relative strength of the teams at the time of each game, we incorporate betting odds, which indicate the favored team and the expected point spread. These odds effectively summarize expert assessments of each team’s strength. Additionally, to measure the projected pace of play, we include betting totals, which provide a prediction of the total points expected to be scored in a game. Both the betting odds and totals are sourced from the Sportsbook Reviews archive (Sportsbook Reviews, 2023).
For player-specific information, we use individual player ratings from the NBA 2K video game series (HoopsHype, 2018, 2021), which provide relative strength assessments for each player. We obtain starting lineups from BasketballReference.com (Basketball Reference, n.d.) and use this information, combined with play-by-play data, to determine which players are on the court at any given time. These covariates allow us to account for variations in team and player strength as well as game tempo, enhancing the robustness of our analysis. Table 1 provides the name and definitions of the variables we use.
Variable Name | Variable Definition |
---|---|
Period | Period of the game; data is restricted to the first three periods. |
Time Left | Remaining time in the current period, measured in seconds. |
Score Margin | Difference between the scores of the two teams. |
Max Rating | Maximum player rating among players on the court for the team at that moment. |
Max Rating Opposition | Maximum player rating among players on the court for the opposing team at that moment. |
Mean Rating | Average player rating of the players on the court for the team at that moment. |
Mean Rating Opposition | Average player rating of the players on the court for the opposing team at that moment. |
Spread | The betting point spread for the game. |
Total Score | The projected total score from betting odds. |
In the following section, we describe the causal inference methodology we employed, detailing our computation of the average treatment effect and our investigation into treatment effect heterogeneity.
3 Methodology
Using the definitions from the previous section, we define the treatment for observation as if the TFO-O turned into a TFO-A and if the TFO-O turned into a TFO-NA. We let be the POD. We employ the potential outcomes framework in order to define the primary causal effect of interest. Let , be the potential outcomes under treatment and control, respectively. That is, is the POD a team would get if it turns a TFO-O into a TFO-A, and is the POD a team would get if it turns a TFO-O into a TFO-NA. A natural estimand of interest is the average treatment effect (ATE):
(2) |
Other estimands may also be of interest for particular analyses. In particular, we can restrict the population of interest to be either the treatment or control group, leading to the average treatment effect on the treated (ATT),
(3) |
and the average treatment effect on the control (ATC),
(4) |
We will use throughout the paper to represent these treatment effects, with the specific type of effect given by context.
Unfortunately, we do not observe both and for any particular observation. For each TFO-A and TFO-NA, it is impossible to know what would have happened if the team had chosen a different path. This is known as the fundamental problem of causal inference. If the scenarios where a team chose a TFO-O were the same, on average, in all meaningful ways to the scenarios where a team chose a TFO-A, an assumption known as exchangeability, then we could estimate the ATE from the data as . Unfortunately that assumption is not reasonable in this situation, as potentially confounding variables such as the score margin and the time left in the quarter are likely to influence a team’s decision.
The methods of causal inference have been developed for precisely such circumstances, when a causal effect estimate is desired for observational data where exchangeability does not hold. These methods typically rely on a common set of assumptions, namely consistency, no interference, conditional exchangeability, and positivity. Below we discuss these assumptions and why they are reasonably satisfied for our data, followed by a description of the specific methods we will be using to estimate treatment effects.
3.1 Causal Assumptions
The consistency assumption implies that there is a well-defined treatment with no hidden variations. In our setting we recognize that not all TFO-A look the same, as highlighted in the previous section. Some end in a shot while others end in a foul. In some cases the team that had the TFO-A gets the ball back, and in other cases it does not. However, because these differences are a part of the definition of our treatment, they are not hidden and do not form an egregious violation of the consistency assumption.
The no interference assumption says that the outcome for a particular unit does not depend on the treatment assignment of any other unit. Mathematically, this means that for any individual ,
(5) |
where represents the treatment vector excluding . This assumption would be violated in our setting if the success of a particular team’s TFO-A was affected by whether another TFO-O in the study was an attempt or a non-attempt. It seems unlikely that this would be true if the two units are from separate teams. It is possible due to the season-long nature of the data that if a particular team has a large number of TFO-A relative to TFO-NA early in the year, the players may practice last-second shots more because they have been experienced them more in the course of play, and thus they could improve the outcome of the TFO-A later in the year. However, players already practice these types of shots because of their importance to game play, and it seems unlikely that the amount of time they spend practicing such shots substantially depends on how often they attempt a TFO strategy. Thus the no interference assumption seems to reasonably hold for our setting.
Conditional exchangeability implies that, given a set of covariates , the treatment assignment is independent of the potential outcomes
(6) |
This allows us to assume that treated and untreated individuals with the same values of have comparable distributions of potential outcomes, thus enabling causal comparisons. Suppose our set of covariates consists of the score margin and the time left in the quarter. Then conditional exchangeability would say, for example, that TFO-O scenarios where a team is down by 4 with 40 seconds left in the quarter are the same, on average, for both the TFO-A and TFO-NA groups. The key to reasonably satisfying this assumption is choosing a set of covariates rich enough to include any major potential confounding variables.
The list of covariates we consider is given by Table 1. When considering whether teams convert a TFO-O into a TFO-A, perhaps the most obvious factor is how much time is left in the quarter. We also believe that there may be quarter specific effects and thus include indicators for the first three quarters. The scoring margin at the time of the TFO-O is included as a team may feel more of a need to maximize possessions if it is trailing in the game. The quality of the lineup on the floor for both teams is also important and is characterized by the different rating covariates. We use the betting spread to account for how close a game is expected to be, and the betting total to account for how high or low scoring the game is expected to be. We believe these are the most important sources of potential confounding and thus that conditional exchangeability is reasonably satisfied. It is also common to check this condition by looking at the balance of the covariates in the groups created by the methods discussed below, which we do in Section 4.1.
The final assumption necessary is the positivity assumption, which says that each TFO-O must have a positive probability, conditional on our set of covariates, of being in either the treatment or control group i.e.,
(7) |
There is a known trade-off between positivity and conditional exchangeability. If you condition on too many covariates, you may satisfy the latter but not the former. If you do not condition on enough covariates, the opposite may be true. We feel that we have found a good balance between the two by carefully selecting a small but rich set of covariates. We perform a visual check of the positivity assumption in Section 4.1.
3.2 Causal Methods for Average Treatment Effects
To estimate treatment effects we make use of two common causal inference methods: inverse probability weighting (IPW) and matching. Both of these methods rely on the concept of a propensity score (Rosenbaum & Rubin, 1983), which is defined as the probability of receiving treatment given baseline covariates. For a binary treatment , the propensity score denoted by is given by
(8) |
This scalar summary of multiple covariates facilitates covariate balance between treated and control groups, aiming to replicate the distributional balance of a randomized experiment. Propensity scores are generally unknown quantities but can be estimated using logistic regression or machine learning methods such as random forests or gradient boosting (Hastie et al., 2009), which can handle complex relationships between covariates and treatment assignment (Lee et al., 2010). For our analysis we employ the covariate balancing propensity score method of Imai & Ratkovic (2013), which improves performance by leveraging both the goal of modeling treatment assignment and that of balancing the covariates. In practice, propensity scores are often computed in software such as R, using packages like MatchIt, which provides matching methods, or WeightIt, which computes weights for propensity score weighting (Stuart et al., 2011; Greifer, 2024b). Under the assumptions of conditional exchangeability, positivity and consistency, the propensity score allows for unbiased estimation of the ATE (Imbens & Wooldridge, 2009; Stuart, 2010).
Propensity scores are used to estimate treatment effects in several ways. IPW assigns weights to individuals to create a pseudo-population where the distribution of covariates is balanced between treatment groups. For an individual with covariates when the ATE is the estimand of interest, weights are computed as
(9) |
where is the estimated propensity score. This weight is larger for individuals in groups with a lower probability of treatment assignment, adjusting for selection bias by up-weighting underrepresented individuals (Hirano et al., 2003). If conditional exchangeability holds in the true population, then marginal exchangeability holds in the pseudo-population. Thus the ATE can then be estimated by the weighted difference in outcomes:
(10) |
where denotes the observed outcome. If the ATT or ATC are the estimands of interest, the weights are simply modified so that the distribution of the group of interest is unchanged, leading to
(11) |
(12) |
This inverse probability weighting (IPW) estimator can be implemented in R using the survey package, which enables the application of IPW techniques to survey data structures, and twang, which supports estimation and diagnostics for IPW models with large datasets (Robins et al., 2000; Cefalu et al., 2024)
Propensity score matching pairs treated and control individuals with similar propensity scores, aiming to create a balanced sample for which treatment and control groups are comparable on observed covariates. When estimating the ATT, each treated unit is matched to a control unit with a propensity score as close as possible to . When estimating the ATC, each control unit is matched to a treated unit with a propensity score as close as possible to . Matching can be done using nearest-neighbor matching, which minimizes the absolute difference between scores; caliper matching, which restricts matches to pairs within a predefined range; or optimal matching, which minimizes the overall distance across matched pairs (Austin, 2011; Rosenbaum, 2002). If there are many more observations in one of the groups, several units in the larger group may be matched with a single unit in the smaller group. Typically matching is done without replacement, although with replacement matching is also possible.
Again, if conditional exchangeability holds in the true population, then marginal exchangeability holds in the matched population, and the treatment effect can be estimated as the difference in the treatment and control group averages in the matched population. Thus for the ATT we have
(13) |
where is the number of treated individuals and represents the matched control units for treated unit . For the ATC we can simply swap the treatment and control group in the above formula. This method can effectively balance covariates if quality matches can be found, making it a robust approach for estimating causal effects when assumptions hold. In R, the MatchIt package provides flexible options for implementing nearest-neighbor, caliper, and optimal matching, along with diagnostic tools for balance checking (Stuart et al., 2011).
It is important to note the methodological distinction between these two methods. In IPW, each observation is reweighted based on the estimated propensity score, allowing us to balance the treated and control groups in expectation. In contrast, propensity score matching directly pairs treated units with control units that have similar propensity scores, creating a matched sample for estimation. Both approaches aim to reduce bias in estimating the treatment effect, but matching emphasizes the creation of comparable subgroups, whereas weighting adjusts the contribution of all observations based on their propensity scores. Both methods are widely used in causal inference applications. See Hernan & Robins (2024) for a theoretical account of both methods, and Shiba & Kawahara (2021) for a recent practical comparison. We present effect estimates from both methods in Section 4.1.
3.3 Causal Methods for Heterogenous Treatment Effects
The ATE provides insight into the overall causal effect for a population. However, treatment effects often vary across individuals within the population. The presence of heterogeneity in treatment effects can obscure important subpopulation dynamics if only the ATE is considered. Identifying and analyzing these differences allows for more tailored interventions, improving both individual and aggregate outcomes. Furthermore, examining treatment effect heterogeneity can enhance our understanding of how covariates interact with the treatment, providing deeper insights into causal mechanisms. This motivates the need for methodologies that move beyond the ATE to uncover nuanced variations in causal effects across different subgroups.
The Conditional Average Treatment Effect (CATE) (Athey & Imbens, 2016) is a statistical measure used to capture the heterogeneity in treatment effects across different subgroups within a population. It extends the concept of the ATE, which calculates the expected difference in outcomes due to a treatment across the entire population, by conditioning on specific covariates . This approach is particularly useful in causal inference to identify how treatment efficacy varies across different individuals or groups, allowing for more personalized interventions.
Mathematically, the CATE is defined as:
(14) |
By conditioning on , CATE captures the expected treatment effect for individuals with characteristics . Assuming conditional exchangeability and consistency, we can estimate CATE from observational or experimental data by modeling the outcome response surface.
Classical approaches to heterogeneous treatment effect (HTE) estimation often rely on subgroup analysis, where subgroups are defined by observed covariates. However, traditional regression-based methods may be limited by the need to pre-specify interactions or assume homogeneity within subgroups. Recent advances have thus focused on machine learning techniques, which allow for more flexible, data-driven discovery of treatment effect heterogeneity. For instance, Athey & Imbens (2016) introduced causal trees for identifying subpopulations with distinct treatment effects, while Wager & Athey (2018) extended this to causal forests, a method based on random forests that provides both individual-level treatment effect estimates and uncertainty quantification.
Propensity score-based methods are also adapted for HTE analysis, such as subclassification on propensity scores within stratified groups (Rosenbaum & Rubin, 1983). However, propensity-based models assume covariate balance is achieved within each stratum, which may not hold in practice for high-dimensional data. To address this, flexible machine learning models, including generalized additive models and Bayesian Additive Regression Trees (BART), are commonly used to improve balance across strata and yield robust HTE estimates (Hill, 2011).
In our analysis, we use the causal forest framework to compute heterogenous treatment effects.
3.3.1 Causal Forest
Causal forests, introduced by Athey et al. (2019), Athey & Wager (2019), and Wager & Athey (2018), build on the framework of random forests (Breiman, 2001; Hastie et al., 2009) to estimate the conditional average treatment effect (CATE), denoted as , rather than the outcome . These forests have become a powerful tool for causal inference, leveraging machine learning techniques to uncover treatment effect heterogeneity. The methods are implemented in the R package grf (Tibshirani et al., 2024), which serves as a comprehensive toolkit for causal forest analysis. Below, we detail the core components of this algorithm.
Orthogonalization via Double Machine Learning
A central challenge in causal inference is ensuring unbiased estimation of in the presence of confounding variables, as we have discussed earlier. Causal forests address this through an orthogonalization procedure inspired by the double machine learning framework of Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey & Robins (2018). This process involves ”residualizing” the outcome and treatment to remove the influence of confounders.
Given a dataset , the first step is to estimate the expected outcome, , for both treated and control groups. This outcome model predicts the conditional mean:
Next, the propensity score model estimates , the probability of receiving treatment given the covariates:
Using these models, the observed outcomes and treatment are residualized:
Residualization ensures that the residualized outcome is orthogonal to the covariates , effectively isolating the treatment effect. These residuals serve as inputs for the subsequent tree-building process, reducing bias and enhancing the robustness of the treatment effect estimation.
Splitting Criterion: Maximizing Treatment Effect Heterogeneity
Causal forests partition the covariate space by recursively splitting it to maximize treatment effect heterogeneity. Unlike regression trees, which minimize the variance of the outcome, causal trees aim to identify regions where treatment effects differ the most. For a split on a covariate at a threshold , the data is divided into two child nodes:
The splitting criterion evaluates the difference in CATE between these nodes, maximizing the squared difference:
This approach encourages the tree to uncover splits that highlight heterogeneity in treatment responses, tailoring the model to detect meaningful differences across subpopulations.
Honest Estimation and Inference
To prevent overfitting and ensure valid inference, causal forests employ an honest estimation strategy, as proposed by Athey & Wager (2019). This involves splitting the data into two subsets: one for constructing the tree structure (splitting sample) and another for estimating treatment effects (estimation sample). For a given tree structure , the treatment effect for each leaf is estimated using:
Honest splitting mitigates the risk of overfitting by ensuring that data used to define the tree splits are separate from those used for effect estimation, thereby producing unbiased treatment effect estimates.
Ensemble of Trees: Aggregation for Stability
Causal forests aggregate multiple trees, each grown on a different bootstrap sample of the data, to enhance stability and reduce variance. For an individual , the final estimate of the CATE is obtained by averaging the treatment effects from all trees in the ensemble:
This ensemble approach capitalizes on the diversity of tree structures to produce robust and reliable estimates of treatment effects.
In summary, causal forests combine orthogonalization, heterogeneity-driven splits, honest estimation, and ensemble averaging to provide a powerful framework for estimating treatment effect heterogeneity. This methodological rigor ensures that causal forests remain at the forefront of modern causal inference.
3.4 Assessing Heterogeneity
Once the causal forest model is built, it is essential to test for significant heterogeneity in the estimated causal effects. One approach to evaluate this is through a Calibration Test (Chernozhukov, Demirer, Duflo & Fernandez-Val, 2018), which is conducted as follows:
We fit a linear regression where the target (outcome) variable serves as the dependent variable, and the independent variables are: (1) the overall treatment effect, denoted as mean forest prediction, and (2) the predicted conditional average treatment effect (CATE) based on the out-of-bag trees, denoted as differential forest prediction.
If significant heterogeneity exists in the causal effects, the coefficient associated with differential forest prediction will be statistically different from zero. This indicates that the predicted heterogeneity in treatment effects contributes meaningfully to explaining the outcome variability.
The Rank Weighted Average Treatment Effect (RATE) (Yadlowsky et al., 2024) provides a more formal approach to testing for heterogeneity in causal effects. This method begins with a prioritization rule based on covariates, . The prioritization rule, , assigns a score to each sample, where a higher score reflects a higher expected benefit from the treatment. In other words, samples predicted to gain more from the treatment receive higher prioritization scores.
The method then measures the incremental benefit of administering the treatment to the top % of samples according to their prioritization scores, a concept captured by the Target Operating Curve (TOC). Let be the distribution function of and for , the TOC is defined as:
(15) |
At , the difference between the two expectations becomes zero, as the entire sample is included. The is estimated by comparing the overall average treatment effect (ATE) with the ATE for the subset of samples whose prioritization scores exceed a given threshold, analogous to the receiver operating characteristic (ROC) curve.
RATE is the area under the curve of the TOC curve i.e.,
(16) |
The RATE value lies between 0 and 1. A good prioritization rule, one that accurately reflects treatment benefit, will result in a RATE value close to 1. Conversely, a poor rule will yield a RATE value near 0.
To test for overall heterogeneity, we can use the CATE estimates as the prioritization rule, assigning higher scores to observations with higher predicted CATE values. If significant heterogeneity is present, the RATE will be close to 1.
Alternatively, we can assess heterogeneity across a specific variable by basing the prioritization rule on that variable rather than the CATE estimates. For instance, we could use to rank samples by a particular covariate. In our case, we are interested in evaluating treatment effect heterogeneity when ranking is based on player rating and opponent player rating.
We now apply the methods discussed in the previous section to our data and present the results in the following section.
4 Results
In this section, we present the results of estimating the overall causal effect of the TFO strategy using two methods: IPW and propensity score matching. We then examine potential heterogeneity in treatment effects by employing the causal forest framework.
4.1 Average Treatment Effect
In our data set there are approximately twice as many treated observations as control observations each year. Therefore, when performing propensity score matching, it is more natural to estimate the average treatment effect on the control group, the ATC. For the sake of consistency, we estimate the ATC when using IPW as well. For teams that do not attempt a TFO in a certain situation, the ATC is a measure of how their average POD would have been different if they had attempted the TFO. This is a natural estimand for the goal of convincing teams that do not currently emphasize the TFO strategy to begin to do so. A positive ATC indicates an increase in the point differential, implying that the TFO strategy is beneficial. Conversely, if the ATC is not significantly different from zero or is negative, it suggests that the TFO strategy is ineffective.
The ATC is computed separately for each season, as well as for the combined dataset that includes both seasons. Estimating the ATC separately provides insight into the stability of the treatment effect across seasons, while combining the data allows for a larger sample size, leading to narrower confidence intervals. We report both the point estimates of ATC and the corresponding 95% confidence intervals.
We perform all calculations in R (R Core Team, 2024). To estimate the ATC using IPW, we employ the R package WeightIt to obtain the weights using covariate balancing propensity scores and the survey package to obtain point estimates and confidence interval. For propensity score matching, we obtain matches using the MatchIt package with 1:1 optimal pair matching without replacement, again based on covariate balancing propensity scores. We use the marginaleffects R package (Arel-Bundock et al., 2024) to obtain point estimates using g-computation as well as interval based on cluster-robust standard errors.
4.1.1 Checks for Assumptions
As mentioned in Sections 3.1 and 3.2, both IPW and matching rely on conditional exchangeability in order to create balanced groups. Figure 1 is a Love plot created in R for the combined sample using the cobalt package (Greifer, 2024a), which compares the standardized mean differences between the treatment and control groups of the various covariates before and after IPW and matching. In the original sample, five of the covariates have standardized differences that fall outside of the rule of thumb threshold of 0.1, including the time left variable with a value of .074. This indicates that the TFO-A and TFO-NA groups vary considerably, particularly in the amount of time left in the quarter. This is an expected feature of the data and highlights why causal methods are necessary.

Conversely, after weighting the observations, the standardized mean differences are very close to 0. This near perfect balance is in large part due to the use of covariate balancing propensity scores. Matching also creates more similar groups than in the original data set, although the gains are not nearly as pronounced as with IPW. The time left variable still has a standardized mean difference slightly above 0.1, although the rest are under the that threshold. In the combined data set there are 991 TFO-NA and 1986 TFO-A. The slightly worse balance for the matching method indicates that it was difficult to find quality matches for all of the TFO-NA given the relatively small number of TFO-A. Thus we should be slightly cautious when making conclusions from the matching method, while we can be more confident in the conclusions from IPW. Love plots for the individual seasons (not pictured) show a similar pattern to the combined data set.

The positivity assumption can also be assessed after fitting the IPW and matching models. Typically this is done by looking at the distribution of the estimated propensity scores. When estimating the ATC, it is important that the support of the control group is roughly contained within the support of the treatment group. Figure 2 gives histograms of the propensity scores for both groups. Other than perhaps a couple of control observations with the smallest estimated propensity scores, there is complete overlap between the two distributions. Thus we can say that the positivity assumption is reasonable for the combined data set. Similar plots for the individual seasons (not pictured) reveal that the positivity assumption is reasonable in those analyses as well.
4.1.2 Treatment Effect Estimates
Method | Season | Average Treatment Effect | ||
Estimate | 95% Confidence Interval | pvalue | ||
IPW | 18-19 | 0.55 | (0.43, 0.66) | |
21-22 | 0.57 | (0.45, 0.69) | ||
Both | 0.55 | (0.47, 0.64) | ||
Matching | 18-19 | 0.60 | (0.35, 0.85) | |
21-22 | 0.64 | (0.39, 0.89) | ||
Both | 0.63 | (0.46, 0.80) |
As shown in Table 2, the estimated ATC for both IPW and matching is consistently positive, with the 95% confidence intervals above 0 in all cases. For the IPW method, the ATC estimate for the combined dataset is 0.55 with a 95% confidence interval of (0.47, 0.64), while for the matching method, the combined ATC is slightly larger in magnitude at 0.63 with a confidence interval of (0.46, 0.80).
While there are slight differences in the point estimates between the two methods and across individual seasons, the substantial overlap in confidence intervals suggests that these variations are not significant. Overall, the poistive ATC across both methods and seasons indicates that the TFO strategy consistently increases the points differential by the end of the period. On average, each time a team had a TFO-O and failed to convert it to a TFO-A, it cost itself slightly more than half a point of score differential. This may seem like a minor disadvantage, but NBA teams are often willing to spend effort and resources in pursuit of even relatively small edges.
4.2 Heterogeneous Treatment Effect
The average treatment effect provides valuable insight, indicating that the TFO strategy is beneficial on average. However, there is a need to explore how this effect varies across different values of the covariates. Understanding this heterogeneity can help identify specific scenarios where the TFO strategy should or should not be attempted, thereby informing team strategy.
To this end, we investigate treatment effect heterogeneity to address the following key questions. First, is there significant heterogeneity in the treatment effect? Second, which variables contribute to this heterogeneity? We employ the causal forest framework to answer these questions.
Before fitting the causal forest model, we augment the dataset with three additional variables: the difference in the maximum ratings between the two teams, the difference in the median ratings between the two teams, and the absolute value of the score margin. Initially, we run the causal forest model with all available variables. We then compute the variable importance measures and retain only those variables that cumulatively account for 95% of the total importance, eliminating three variables in the process. This step helps control the standard error.
As an initial check, we estimate the ATE using the causal forest. The estimated ATE is 0.61, with a standard error of 0.11, relatively close to the results obtained from IPW and matching methods.
This analysis sets the foundation for a deeper exploration of heterogeneity in treatment effects, guiding future strategy development.
4.2.1 Overall Heterogeneity
To begin our analysis, we first assess whether there is any evidence of treatment effect heterogeneity in the data. This is done through two separate tests.
The first test we employ is the test calibration test, which evaluates the null hypothesis that there is no heterogeneity in the treatment effect. Table 3 shows the results of the test. The p-value of the differential forest prediction is 0.17, indicating that we do not have sufficient evidence to reject the null hypothesis. It is important to note that the standard error of this test is relatively high, which contributes to the high p-value and weakens our ability to detect heterogeneity, if present.
Estimate | Std Error | t Value | p Value | |
---|---|---|---|---|
Mean Forest Prediction | 1.00511 | 0.17876 | 5.6227 | |
Differential Forest Prediction | 0.63352 | 0.67520 | 0.9383 | 0.1741 |
The second test we conduct is based on the Rank Average Treatment Effect (RATE) (Yadlowsky et al., 2024). In this test, we define a prioritization rule and assess the impact of applying the treatment to only the top observations based on this rule. To implement this, we train the causal forest using 2018 data, predict the heterogeneous treatment effects for 2021 observations, and then use these predictions to form our prioritization rule. The RATE estimate is with a standard error of . The standard error is estimated using bootstrap method (Efron, 1982) with 200 samples. Similar to the first test, this result suggests that there is insufficient evidence to reject the null hypothesis of no heterogeneity in the treatment effect. Figure 3 shows the overall TOC where we can see looking at the confidence band that is in the confidence band.

In summary, both the test calibration and RATE results indicate that, based on our tests, we do not find strong evidence of treatment effect heterogeneity.
4.2.2 Heterogeneity Across Variables
In the previous section, we observed that at an aggregate level, the tests suggest minimal evidence for heterogeneity in the treatment effect. However, prior studies have demonstrated scenarios where overall homogeneity does not preclude the presence of heterogeneity across specific variables. To investigate this possibility, we examine heterogeneity across the following variables: difference in ratings of the two teams based on max rating, and difference in ratings of the two teams based on mean rating.
To evaluate potential heterogeneity, we employ the Rank Average Treatment Effect (RATE) framework. This method constructs prioritization rules based on the value of the selected variable and computes RATE estimates, reflecting the expectation that higher values of the variable should correspond to improved outcomes under treatment.
Table 4 presents the RATE estimates for the selected variables. The results indicate that the estimates are not statistically significant, providing insufficient evidence to conclude that treatment effect heterogeneity exists with respect to these variables. Thus, we conclude that there is no substantial evidence for heterogeneity in the treatment effect across the investigated variables.
Estimate | Standard Error | |
---|---|---|
Rating Max Diff | 0.028 | 0.085 |
Rating Mean Diff | 0.0825 | 0.087 |
5 Discussion
This study explores the effectiveness of the two-for-one (TFO) strategy in the NBA, converting a real-world sports strategy into a mathematical framework to enable causal inference. Using play-by-play data from the 2018-19 and 2021-22 NBA seasons, we analyze the impact of implementing this strategy on game outcomes by estimating the increase in the point differential a team that did not attempt a TFO could expect if it had attempted the TFO instead. Our findings indicate that attempting the TFO has a statistically significant positive effect, suggesting it can be advantageous for teams in real game scenarios.
The results also consider potential heterogeneity in the effectiveness of the two- for-one strategy using the causal forest framework. While we explore overall heterogeneity and heterogeneity across two selected variables, our results do not provide significant evidence of differential effects across subsets of the data. This lack of observed heterogeneity may imply that the strategy’s impact is relatively uniform across various game contexts and player configurations, though additional factors such as small sample size may also limit our ability to detect significant variation.
While these findings contribute valuable insights, there are limitations to our analysis. First, the dataset lacks spatial information regarding players’ positions on the court during each play, which could refine estimates of shot difficulty and improve causal estimates. Such information would enhance the understanding of strategic choices made by players and coaches, but access is constrained as spatial tracking data is often proprietary. Additionally, our insignificance in heterogeneity estimates could result from the limited scope of our dataset, which includes only two seasons. Repeating the analysis over a larger dataset, potentially spanning ten seasons or more, could provide deeper insights and improve the statistical power of our heterogeneity analysis.
Another natural extension would be to consider leagues other than the NBA, such as the Women’s National Basketball Association (WNBA) or the National Collegiate Athletic Association (NCAA). Play-by-play information is generally available for these organizations as well, and it would be informative to see if similar treatment effects are found for the TFO strategy there.
In conclusion, while our study demonstrates the positive impact of the TFO strategy in NBA games, future work with extended datasets and enhanced play-by-play information could further validate and expand on these findings. By overcoming data limitations and increasing sample size, subsequent research may uncover new dimensions of heterogeneity and provide a more comprehensive understanding of how various contextual factors influence the efficacy of this strategy.
References
- (1)
- Arel-Bundock et al. (2024) Arel-Bundock, V., Greifer, N. & Heiss, A. (2024), ‘How to interpret statistical models using marginaleffects for R and Python’, Journal of Statistical Software 111(9), 1–32.
- Athey & Imbens (2016) Athey, S. & Imbens, G. (2016), ‘Recursive partitioning for heterogeneous causal effects’, Proceedings of the National Academy of Sciences 113(27), 7353–7360.
-
Athey et al. (2019)
Athey, S., Tibshirani, J. & Wager, S. (2019), ‘Generalized random forests’, The Annals of Statistics 47(2), 1148 – 1178.
https://doi.org/10.1214/18-AOS1709 - Athey & Wager (2019) Athey, S. & Wager, S. (2019), ‘Estimating treatment effects with causal forests: An application’, Observational studies 5(2), 37–51.
- Austin (2011) Austin, P. C. (2011), ‘An introduction to propensity score methods for reducing the effects of confounding in observational studies’, Multivariate behavioral research 46(3), 399–424.
- Basketball Reference (n.d.) Basketball Reference (n.d.), ‘Basketball reference: NBA player and team stats’, https://www.basketball-reference.com.
- Breiman (2001) Breiman, L. (2001), ‘Random forests’, Machine learning 45, 5–32.
-
Bresler (2024)
Bresler, A. (2024), nbastatR: R’s interface to NBA data.
R package version 0.1.152.
https://github.com/abresler/nbastatR -
Cefalu et al. (2024)
Cefalu, M., Ridgeway, G., McCaffrey, D., Morral, A., Griffin, B. A. & Burgette, L. (2024), twang: Toolkit for Weighting and Analysis of Nonequivalent Groups.
R package version 2.6.1.
https://CRAN.R-project.org/package=twang - Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey & Robins (2018) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. & Robins, J. (2018), ‘Double/debiased machine learning for treatment and structural parameters’.
- Chernozhukov, Demirer, Duflo & Fernandez-Val (2018) Chernozhukov, V., Demirer, M., Duflo, E. & Fernandez-Val, I. (2018), Generic machine learning inference on heterogeneous treatment effects in randomized experiments, with an application to immunization in india, Technical report, National Bureau of Economic Research.
- Efron (1982) Efron, B. (1982), The jackknife, the bootstrap and other resampling plans, SIAM.
-
Feng (2015)
Feng, J. (2015), ‘Analytics of Optimal 2-for-1 Strategy in NBA Basketball’, Jay’s Life .
https://racketracer.wordpress.com/2015/05/12/analytics-of-optimal-2-for-1-strategy-in-nba-basketball -
Fischer (2015)
Fischer, J. (2015), ‘Optimizing End Of Quarter Shot-Timing In The NBA’, To the Mean .
https://tothemean.com/2015/02/15/optimizing-end-of-quarter-shot-timing.html - Gibbs et al. (2022) Gibbs, C. P., Elmore, R. & Fosdick, B. K. (2022), ‘The causal effect of a timeout at stopping an opposing run in the nba’, The Annals of Applied Statistics 16(3), 1359–1379.
-
Greifer (2024a)
Greifer, N. (2024a), cobalt: Covariate Balance Tables and Plots.
R package version 4.5.5.
https://CRAN.R-project.org/package=cobalt -
Greifer (2024b)
Greifer, N. (2024b), WeightIt: Weighting for Covariate Balance in Observational Studies.
R package version 1.3.0.
https://CRAN.R-project.org/package=WeightIt -
Hastie et al. (2009)
Hastie, T., Tibshirani, R. & Friedman, J. (2009), The Elements of Statistical Learning, Springer Series in Statistics, 2 edn, Springer New York, NY.
Copyright Information: Springer Science+Business Media, LLC, part of Springer Nature 2009.
https://doi.org/10.1007/978-0-387-84858-7 - Hernan & Robins (2024) Hernan, M. & Robins, J. (2024), Causal Inference: What If, Chapman & Hall/CRC Monographs on Statistics & Applied Probability, CRC Press.
- Hill (2011) Hill, J. L. (2011), ‘Bayesian nonparametric modeling for causal inference’, Journal of Computational and Graphical Statistics 20(1), 217–240.
- Hirano et al. (2003) Hirano, K., Imbens, G. W. & Ridder, G. (2003), ‘Efficient estimation of average treatment effects using the estimated propensity score’, Econometrica 71(4), 1161–1189.
- HoopsHype (2018) HoopsHype (2018), ‘NBA 2k19 Player Rankings’. Available at https://hoopshype.com/nba2k/2018-2019/ (Accessed November 8, 2024).
- HoopsHype (2021) HoopsHype (2021), ‘NBA 2k22 Player Rankings’. Available at https://hoopshype.com/nba2k/2021-2022/ (Accessed November 8, 2024).
-
Imai & Ratkovic (2013)
Imai, K. & Ratkovic, M. (2013), ‘Covariate balancing propensity score’, Journal of the Royal Statistical Society Series B: Statistical Methodology 76(1), 243–263.
https://doi.org/10.1111/rssb.12027 - Imbens & Rubin (2015) Imbens, G. W. & Rubin, D. B. (2015), Causal inference in statistics, social, and biomedical sciences, Cambridge university press.
- Imbens & Wooldridge (2009) Imbens, G. W. & Wooldridge, J. M. (2009), ‘Recent developments in the econometrics of program evaluation’, Journal of economic literature 47(1), 5–86.
- Lee et al. (2010) Lee, B. K., Lessler, J. & Stuart, E. A. (2010), ‘Improving propensity score weighting using machine learning’, Statistics in medicine 29(3), 337–346.
- Nakahara et al. (2023) Nakahara, H., Takeda, K. & Fujii, K. (2023), ‘Pitching strategy evaluation via stratified analysis using propensity score’, Journal of Quantitative Analysis in Sports 19(2), 91–102.
-
R Core Team (2024)
R Core Team (2024), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria.
https://www.R-project.org/ - Robins et al. (2000) Robins, J. M., Hernan, M. A. & Brumback, B. (2000), ‘Marginal structural models and causal inference in epidemiology’.
-
Rosenbaum (2002)
Rosenbaum, P. R. (2002), Observational Studies, Springer Series in Statistics, Springer, New York, NY.
Springer Science+Business Media New York 2002. Part of the Springer Book Archive.
https://doi.org/10.1007/978-1-4757-3692-2 - Rosenbaum & Rubin (1983) Rosenbaum, P. R. & Rubin, D. B. (1983), ‘The central role of the propensity score in observational studies for causal effects’, Biometrika 70(1), 41–55.
- Shiba & Kawahara (2021) Shiba, K. & Kawahara, T. (2021), ‘Using propensity scores for causal inference: pitfalls and tips’, Journal of epidemiology 31(8), 457–463.
- Sportsbook Reviews (2023) Sportsbook Reviews (2023), ‘NBA Scores and Odds Archive’. Available at https://www.sportsbookreviewsonline.com/scoresoddsarchives/nba/nbaoddsarchives.htm (Accessed November 8, 2024).
- Stuart (2010) Stuart, E. A. (2010), ‘Matching methods for causal inference: A review and a look forward’, Statistical science: a review journal of the Institute of Mathematical Statistics 25(1), 1.
- Stuart et al. (2011) Stuart, E. A., King, G., Imai, K. & Ho, D. (2011), ‘Matchit: nonparametric preprocessing for parametric causal inference’, Journal of statistical software .
-
Tibshirani et al. (2024)
Tibshirani, J., Athey, S., Sverdrup, E. & Wager, S. (2024), grf: Generalized Random Forests.
R package version 2.3.2.
https://CRAN.R-project.org/package=grf - Wager & Athey (2018) Wager, S. & Athey, S. (2018), ‘Estimation and inference of heterogeneous treatment effects using random forests’, Journal of the American Statistical Association 113(523), 1228–1242.
- Wu et al. (2021) Wu, L. Y., Danielson, A. J., Hu, X. J. & Swartz, T. B. (2021), ‘A contextual analysis of crossing the ball in soccer’, Journal of Quantitative Analysis in Sports 17(1), 57–66.
- Yadlowsky et al. (2024) Yadlowsky, S., Fleming, S., Shah, N., Brunskill, E. & Wager, S. (2024), ‘Evaluating treatment prioritization rules via rank-weighted average treatment effects’, Journal of the American Statistical Association pp. 1–14.
- Yam & Lopez (2019) Yam, D. R. & Lopez, M. J. (2019), ‘What was lost? a causal estimate of fourth down behavior in the national football league’, Journal of Sports Analytics 5(3), 153–167.