¹¹affiliationtext: Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, ON, Canada²²affiliationtext: School of Public Health and Health Systems, University of Waterloo, Waterloo, ON, Canada

Evaluating real-time probabilistic forecasts with application to National Basketball Association outcome prediction

Chi-Kuang Yeh [email protected] Gregory Rice Joel A. Dubin

Abstract

Motivated by the goal of evaluating real-time forecasts of home team win probabilities in the National Basketball Association, we develop new tools for measuring the quality of continuously updated probabilistic forecasts. This includes introducing calibration surface plots, and simple graphical summaries of them, to evaluate at a glance whether a given continuously updated probability forecasting method is well-calibrated, as well as developing statistical tests and graphical tools to evaluate the skill, or relative performance, of two competing continuously updated forecasting methods. These tools are studied by means of a Monte Carlo simulation study of simulated basketball games, and demonstrated in an application to evaluate the continuously updated forecasts published by the United States-based multinational sports network ESPN on its principle webpage espn.com. This application lends statistical evidence that the forecasts published there are well-calibrated, and exhibit improved skill over several naïve models, but do not demonstrate significantly improved skill over simple logistic regression models based solely on a measurement of each teams’ relative strength, and the evolving score difference throughout the game.

Keywords: Probability forecasting, Calibration, Skill score, Functional data, Brier score

1 Introduction

Probabilistic predictions and forecasts are ubiquitous in modern society, and many individuals consider, and make decisions based on, such forecasts on a routine basis. For example, in the United States probability of precipitation forecasts became widely publicly available starting in the late 1960’s, and are now a critical factor in countless people’s daily decisions (National Research Council,, 2006; Murphy,, 1998). Over time the number and scope of probabilistic forecasts readily accessible to the public has increased at a steady pace, and now covers prediction of phenomena ranging from sports (Silver et al.,, 2019), to politics (Erikson and Wlezien,, 2012), to medicine (Spiegelhalter,, 1986), to geology (Gomberg,, 2015), among many other, some more exotic (Rowe and Beard,, 2018), areas.

Many such forecasts are made initially well before the event in question occurs, and are then continuously updated as new information becomes available. The example that we focus on throughout this paper is basketball game outcome prediction in the National Basketball Association (NBA). Websites like espn.com, the main web page of the United States-based multinational sports network, ESPN, publish and update in real-time probabilistic forecasts of the home team winning for each NBA game played. Although the method by which ESPN produces these forecasts is largely proprietary, ostensibly initial probability forecasts of the home team winning are constructed based on information that is available before the game starts, e.g. the usual home court advantage in the NBA, relative team strength, player injuries, etc., and then after the game commences and progresses these forecasts are updated with new information such as the score, game time remaining, ball possession, fouls, in game player injuries, etc. The resulting probabilistic forecasts and their fluctuations may be viewed as a curve that is a function of the in-game time; see Figure 1 for an example. Such curves arise in any similar continuously updated probabilistic forecasting task, and are evidently not unique to basketball game outcome prediction; see e.g. Silver, (2020).

Refer to caption — Figure 1: Real-time probabilistic forecasts as a function of the in-game time published on espn.com/nba (ESPN,, 2020) from a November 8, 2018 game in which the Los Angeles Lakers hosted the Minnesota Timberwolves. The black points are the points at which ESPN updated the home team winning probability forecasts due to events occurring in the game, and the red line is the linear interpolation between these forecasts to produce a real-time probabilistic forecast curve.

Natural questions to ask when faced with any probabilistic forecast, including those that are continuously updated, are “are these forecasts accurate?” and “could these forecasts be improved upon?”. Following the seminal work of Murphy and Winkler, (1987) and Murphy and Winkler, (1992) on evaluating the quality of probabilistic forecasts in meteorology, evaluating a method for producing probabilistic forecasts is often broken into the tasks of measuring its calibration and skill. A model is deemed well-calibrated if its forecasts are compatible with the observed outcomes. In other words, a model that predicts an outcome with a given probability is well-calibrated if the relative frequency that the outcome occurs matches the probability forecast in the long run. A model is deemed to have higher skill than a competing model if its predictions are “sharper” or “more concentrated” than its competitor. For example, a model that forecasts the probability of a rainy day in New York City on a given day with the long run background rate of rainy days (which happens to be about 33.1% for New York City), will in the long run be well-calibrated, but has less skill than a model, perhaps based on more complete weather data, that correctly predicts rainy days with probabilistic forecasts of zero and one. Excellent reviews and more in-depth discussions of these concepts can be found in Gneiting et al., (2007) and Gneiting and Katzfuss, (2014).

The goal of this paper is to develop simple, easily interpretable, tools for evaluating the calibration and skill of methods to produce continuously updated probabilistic forecasts, and to apply these methods to evaluate the probabilistic forecasts pertaining to NBA basketball game outcome prediction published on ESPN. In terms of evaluating model calibration of probabilistic forecasts, standard tools are reliability diagrams and calibration plots, in which outcome frequencies are plotted against binned forecast probabilities; see Gneiting et al., (2007). Below we show how such curves can be extended in the continuously updated case to calibration surfaces, and how such surfaces can be summarized to show at a glance whether a given method is well-calibrated. In order to evaluate the relative skill of one continuously updated forecasting model against another, we employ the method of Lai et al., (2011) to construct confidence intervals for the average loss difference measured by the Brier score (Brier,, 1950) between the two models at given time points throughout the updating process. Estimating these intervals pointwise can be used to construct a simple graphical summary of the relative skill of one model versus another. In order to measure the cumulative statistical significance of differences observed in such a plot, we develop a new significance test for the skill-differences aggregated across time based on a novel large sample result for estimating continuous loss difference curves.

For the purpose of demonstrating these methods and evaluating ESPN’s forecasts, we introduce a number of “competing” continuously updated forecasting methods for basketball outcome prediction. Some are designed to be “straw men” for the purpose of demonstration, whereas others are based on logistic or probit generalized linear models making use of in-game information such as the score difference. We show using our methods that ESPN’s model is generally well-calibrated, and exhibits significantly better skill than some naïve models, although it does not demonstrate superiority over relatively simple logistic regression models based on the score difference and relative team strength alone.

The rest of the paper is organized as follows. Section 2 introduces the details of the ESPN forecasting data that we consider, as well as some competing forecasting methods that we develop and use for the purpose of comparison. Section 3 discusses the construction of calibration surfaces for such forecasts, as well as simple graphical summaries of these surfaces. Section 4 explains the proposed methods to evaluate the relative skill of two sets of real-time probabilistic forecasts. A Monte Carlo simulation study of these methods is given in Section 5. A detailed comparison of the ESPN forecasts as well as those of the proposed models is given in Section 6. Technical details are provided in Appendix A following these sections.

2 Motivating Data and Competing Models

The specific data that we consider and that motivates this work are play-by-play records and real-time probabilistic forecasts of NBA regular season games downloaded from espn.com/nba (ESPN, (2020)). The NBA is a major professional basketball league, which is often referred to as one of the “Big Four” professional sports leagues in North America. Since 2004, except for the lockout season in 2011 and the COVID-19-influenced season of 2020, the NBA is comprised of $30$ teams, with each team playing a schedule of $82$ games in the regular season.

Starting in the 2017 - 2018 NBA season, ESPN Analytics began providing real-time in-game probabilistic forecasts of the home team winning for each NBA game played; an example of the forecasts from one game is shown in Figure 1. The data available from ESPN are quite rich, including real-time information about details such as substitutions, fouls, and ball possession. We consider here only a subset of these data that includes the real-time probabilistic forecasts provided by ESPN, as well as the evolution of the score throughout the game, for the 2017-2018 and 2018-2019 seasons. These data are updated each time there is an “event” in the game, which includes primarily score changes, fouls, and changes of possession. A typical game features between 460-480 events.

We excluded a small portion of these data from our analysis due to two issues. Quite often, multiple events will occur at the same instant in a game. One of the main examples that contributes to this is multiple players substituting at the same time. Although these events are all logged at the same time point, they occur in the dataset in an ordered sequence. The forecasted probabilities published by ESPN during such an event are typically contingent on this order. Therefore, we simply average the forecasts together in such a scenario to produce a probabilistic forecast at that instant. We also tried a number of other ways to handle this situation, such as using the first or last probabilistic forecast among the events recorded, and the difference in the results was negligible.

The second issue is due to games that go to overtime. If two teams’ scores are tied at the end of the 48-minute regulation game time, the teams will play an extra five-minute overtime period. For such games we remove the overtime period from the analysis, and only consider probabilistic forecasts up to the end of the game so that they are comparable to those that arise from games that did not go to overtime. Overtime games represent slightly less than 10% of the total games. Additionally, a small number of data points were discarded due to evident defects or excessive missing values.

The remaining data that we analyze are summarized in Table 1, and in each season there are more than $1100$ games with a total of over 350,000 play-by-play records available. Below we use the data from the 2017-2018 season as training data for our own models, and then we produce and evaluate forecasts for the 2018-2019 season.

Season	Mode	Games	Events	Max. events	Min. events	Avg number of events
17 - 18	Raw	1158	530032	606	234	457.7133
	Selected	1137	517983	572	240	455.5699
	Processed	1137	354749	375	173	312.0343
18 - 19	Raw	1229	583443	700	124	474.7299
	Selected	1213	572546	598	366	472.0082
	Processed	1213	396991	385	241	327.2803

Table 1: Summary of the data obtained from ESPN (ESPN,, 2020) from the 2017-2018 and 2018 - 2019 NBA regular seasons. Raw counts represent the total number of games that ESPN provides probability forecasts for. Selected refers to those games that do not contain errors or missing values. Processed represents the data after averaging out multiple events recorded at the same game time.

Letting $N$ denote the total number of games with forecasts that we wish to evaluate, so $N=1213$ when we consider the 2018-2019 forecasts, the data may then be denoted as $\hat{p}_{i}^{ESPN}(t)$ , $1\leq i\leq N$ , $t\in[0,1]$ representing the probabilistic forecasts of the home team winning in the $i^{\prime}th$ game at intra-game time $t$ . We assume that the game time parameter $t$ is normalized to be between zero and one so that it represents the proportion of the game complete. Although these forecasts are only available when events occur, due to the fact that events are very dense throughout the game, we simply interpolate these forecasts linearly to produce full probability forecast curves over $[0,1]$ , which also makes them more comparable from one game to the next. This is illustrated in Figure 1. We also consider the data $H_{i}(t)$ and $A_{i}(t)$ , $1\leq i\leq N$ , $t\in[0,1]$ , denoting the home team score and away team score in the $i^{\prime}th$ game, respectively, at proportion $t$ of the game. In our analysis below we frequently make use of the score difference $ScD_{i}(t)=H_{i}(t)-A_{i}(t)$ , $1\leq i\leq N$ , $t\in[0,1]$ .

The goal of the methods that we develop below is to evaluate the quality of the forecasts $\hat{p}_{i}^{ESPN}(t)$ , $1\leq i\leq N$ , $t\in[0,1]$ . To do this we also develop a number of benchmark models that are used for the purpose of comparison.

2.1 Benchmark models for predicting NBA game outcomes

We use the following notation below. We let $Y_{i}$ denote the indicator random variable that the home team wins in the $i^{\prime}th$ game, so that $Y_{i}=1$ if the home team wins the $i^{\prime}th$ game, and $Y_{i}=0$ if the home team loses the $i^{\prime}th$ game. We are interested in forecasting or estimating the probability $p_{i}(t)$ that the home team wins, given the information up to time $t$ in the game, so that

p_{i}(t)=P(Y_{i}=1|\mbox{ all information up to time $t$ in game $i$}).

A more formal definition of $p_{i}(t)$ is given in the Appendix A, but is omitted here to lighten the technical detail in the text.

$\hat{p}_{i}^{ESPN}(t)$ is in principle an estimate (forecast) of $p_{i}(t)$ . In order to evaluate the quality of these forecasts, we consider a number of competing benchmark models, progressing from naïve to more realistic. The most complicated covariates that we consider to build these models are the score difference $ScD_{i}(t)$ , and some measure of the relative strength of the teams, which we term $RS_{i}$ . There are a number of ways of evaluating the relative strength of teams, including using the Elo rating system (Elo,, 1978), which has been extensively used to rate the strength of basketball teams(see Silver, (2014), Silver and Fischer-Baum, (2015), and Silver et al., (2019)) and odds in betting markets. We use as a proxy of the relative team $RS_{i}=\hat{p}_{i}^{ESPN}(0)$ , the pre-game probability of the home team winning as forecast by ESPN. We considered a number of alternate metrics to define $RS_{i}$ , and found that generally the results and conclusions of the below analyses did not change significantly, and so we use this quantity as to avoid the development of other proxies of relative team strength.

The benchmark models that we consider are listed below in order of most to least naïve. Some of these are based on generalized linear models (GLMs) for binary response data, such as logistic regression, and we use $g(\cdot)$ to denote the GLM link function, which in our case is either the logit or probit link, see e.g. Chapter 4 of McCullagh and Nelder, (1989). All GLMs were fit using the R programming language, specifically the glm function in the stats package, version 4.0.2, which uses iteratively reweighted least squares. For each model we used the 2017-2018 season data as training data, and then produced rolling forecasts on the 2018-2019 season data to compare to the ESPN forecasts.

Coin-Flip (CF): $\hat{p}_{i}(t)=0.5$ for all $t\in[0,1]$ .

Historic Home team Win Probability (HomeWP): $\hat{p}_{i}(t)=0.593$ , which represents the frequency at which the home team won over the course of the NBA regular season games from 2008-2017.

Pre-game Relative Strength (PgRS) model: $\hat{p}_{i}(t)$ is forecast from the GLM

g(p_{i}(t))=\beta_{0}(t)+\beta_{1}(t)RS_{i},~{}