This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Comparative Tutorial of Bayesian Sequential Design and Reinforcement Learning

Mauricio Teca Yunshan Duanb and Peter Müllerb
aDepartment of Biostatistics
Harvard T.H. Chan School of Public Health
bDepartment of Statistics and Data Science
The University of Texas at Austin
Abstract

Reinforcement Learning (RL) is a computational approach to reward-driven learning in sequential decision problems. It implements the discovery of optimal actions by learning from an agent interacting with an environment rather than from supervised data. We contrast and compare RL with traditional sequential design, focusing on simulation-based Bayesian sequential design (BSD). Recently, there has been an increasing interest in RL techniques for healthcare applications. We introduce two related applications as motivating examples. In both applications, the sequential nature of the decisions is restricted to sequential stopping. Rather than a comprehensive survey, the focus of the discussion is on solutions using standard tools for these two relatively simple sequential stopping problems. Both problems are inspired by adaptive clinical trial design. We use examples to explain the terminology and mathematical background that underlie each framework and map one to the other. The implementations and results illustrate the many similarities between RL and BSD. The results motivate the discussion of the potential strengths and limitations of each approach.

1 Introduction

Sequential design problems (SDP) involve a sequence of decisions DtD_{t} with data YtY_{t} observed at every time step t=1,,Tt=1,\ldots,T (DeGroot, 2004; Berger, 2013). The goal is to find a decision rule (Y1,,Yt)Dt(Y_{1},\ldots,Y_{t})\mapsto D_{t} that maximizes the expected value of a utility function. The utility function encodes an agent’s preferences as a function of hypothetical future data and assumed truth. The decision rule is prescribed before observing future data beyond YtY_{t} – as the term “design” emphasizes – assuming a probabilistic model with unknown parameters θ\theta that generate the future data. For example, θ\theta can be the true effect of a drug. Figure 1a summarizes the setup of a general SDP. As motivating examples in the upcoming discussion we will use two examples of clinical trial design, which naturally give rise to SDPs (Rossell et al., 2007; Berry and Ho, 1988; Christen and Nakamura, 2003). Both examples are about sequential stopping, that is, the sequential decision is to determine when and how to end the study. Figure 1b shows the setup of sequential stopping problems.

Using these examples, we compare two families of simulation-based methods for solving SDPs with applications in sequential stopping. The first is simulation-based Bayesian Sequential Design (BSD) (Müller et al., 2007), which is based on Bayesian decision theory (Berger, 2013). The other approach is Reinforcement Learning (RL), a paradigm based on the interaction between an agent and an environment that results in potential rewards (or costs) for each decision. RL has recently been proposed as a method to implement SDPs focusing on recent advances in deep learning (Shen and Huan, 2021), outside the context of clinical studies. Earlier application of RL related to clinical study design can be found in the dynamic treatment regimes literature (Zhao et al., 2009; Murphy, 2003; Murphy et al., 2007). There is a longstanding literature on such problems in statistics; see the review by Parmigiani and Inoue (2009, chapter 15) and the references therein. Implementing RL and BSD in these two motivating example problems (using standard algorithms) will illustrate many similarities between the two frameworks, while highlighting the potential strengths and limitations of each paradigm111The implementation code is freely available at https://github.com/mauriciogtec/bsd-and-rl..

A note about the scope of the upcoming discussion. It is meant to highlight the similarities and differences of algorithms in BSD and RL, and we provide a (partial) mapping of notations. There is no intent to provide (yet another) review of BSD or RL and its many variations. This article focuses on a few variations that best contrast the two traditions, \cbstartand we try to highlight the relative advantages of each method. In short, BSD is better equipped to deal with additional structure, that is, to include more details in the inference model. For example, dealing with delayed responses in a clinical study, one might want to include a model to use early responses to predict such delayed outcomes. Or one might want to borrow strength across multiple related problems. Doing so would also be possible in RL, but it requires to leave the framework of Markov decision processes underlying many RL algorithms (Gaon and Brafman, 2020), and discussed below. On the other hand, RL allows computation-efficient implementations, and is routinely used for much larger scale problems than BSD. The availability of computation-efficient implementations is critical in applications where sequential designs need to be evaluated for summaries under (hypothetical) repeated experimentation. This is the case, for example, in clinical trial design when frequentist error rates and power need to be reported. The evaluation requires Monte Carlo experiments with massive repeat simulations, making computation-efficient implementation important. \cbend

Refer to caption
(a) Generic SDP.
Refer to caption
(b) Sequential stopping.
Figure 1: Panel (a) shows the directed graph of a general SDP problem. In the figure, Ht={Y1,D1,Y2,,Dt1,Yt}H_{t}=\{Y_{1},D_{1},Y_{2},\ldots,D_{t-1},Y_{t}\} denotes the history or information set at decision time tt. There is an implicit arrow from every element of HtH_{t} onto DtD_{t}, summarily implied by the arrow HtDtH_{t}\to D_{t}. In RL the utility function is usually written as a sum G=t=1TRtG=\sum_{t=1}^{T}R_{t} of immediate rewards Rt(Ht,Dt)R_{t}(H_{t},D_{t}) (although it is usually called the return instead of the utility). Panel (b) shows a flow chart for the special case of sequential stopping problems, with Dt=0D_{t}=0 indicating continuation. The node T{\color[rgb]{1,0,0}T} represents stopping the trial (t=Tt=T and DT0D_{T}\neq 0).

BSD:

BSD is a model-based paradigm for SDPs based on a sampling model p(YtHt1,Dt1,θ)p(Y_{t}\mid H_{t-1},D_{t-1},\theta) for the observed data and a prior p(θ)p(\theta) reflecting the agent’s uncertainty about the unknown parameters.

To compare alternative decisions DtD_{t}, the agent uses an optimality criterion that is formalized as a utility function u(Y1,,YT,D1,,DT,θ)u(Y_{1},\ldots,Y_{T},D_{1},\ldots,D_{T},\theta) which quantifies the agent’s preferences under hypothetical data, decisions and truth. It will be convenient to write the utility as u(HT,DT=d,θ)u(H_{T},D_{T}=d,\theta), where HtH_{t} denotes the history Ht:=(Y1,D1,,Dt1,Yt)H_{t}:=(Y_{1},D_{1},\ldots,D_{t-1},Y_{t}) at decision time tt. Rational decision makers should act as if they were to maximize such utility in expectation conditioning on already observed data, and marginalizing with respect to any (still) unknown quantities like future data or parameters (DeGroot, 2004).

To develop a solution strategy, we start at time TT (final horizon or stopping time). Denote by U(HT,d)=E{u(HT,DT=d,θ)HT}U(H_{T},d)=E\{u(H_{T},D_{T}=d,\theta)\mid H_{T}\} the expected utility at the stopping time TT. Then, the rational agent would select DT(HT)=argmaxdU(HT,d)D^{\star}_{T}(H_{T})=\operatorname*{arg\,max}_{d}U(H_{T},d) at the stopping time. For earlier time steps, rational decisions derive from expectations over future data, with later optimal decisions plugged in,

U(Ht,d)=E{u(Ht,Dt=d,Yt+1(Dt=d),Dt+1(Ht+1(Dt=d)),,DT(HT(Dt=d)),θHt}U(H_{t},d)=E\{u(H_{t},D_{t}=d,Y_{t+1}(D_{t}=d),D^{\star}_{t+1}(H_{t+1}(D_{t}=d)),\ldots,D^{\star}_{T}(H_{T}(D_{t}=d)),\theta\mid H_{t}\} (1)

which determine the optimal decision (“Bayes rule”) as

Dt(Ht)=argmaxdU(Ht,d).D^{\star}_{t}(H_{t})=\textstyle{\operatorname*{arg\,max}_{d}\,}U(H_{t},d). (2)

Here, we use a potential outcomes notation Ht+1(Dt=d)H_{t+1}(D_{t}=d) to emphasize that future history depends on the action Dt=dD_{t}=d (Robins, 1997), and similarly for other quantities. In the notation for expected utility U()U(\cdot), the lack of an argument indicates marginalization (with respect to future data and parameters) and optimization (with respect to future actions), respectively. For example, in (1) expected utility conditions on HtH_{t}, but marginalizes w.r.t. future data Yt+kY_{t+k} and substitutes optimal future decisions Dt+kD^{\star}_{t+k}.

Some readers, especially those familiar with RL, might wonder why Dt=dD_{t}=d does not appear on the right-hand side of the conditional expectation (as in RL’s state-action value functions). This is because in the BSD framework actions are deterministic. There is no good reason why a rational decision maker would randomize (Berger, 2013). \cbstartHowever, note that in clinical studies randomization is usually included and desired, but for other reasons – not to achieve an optimal decision, but to facilitate attribution of differences in outcomes to the treatment selection. \cbend

RL

RL addresses a wider variety of sequential problems than BSD, provided one can formulate them as an agent interacting with an environment yielding rewards RtR_{t} at every time step. Environments can be based on simulations. For example, popular successful RL applications with simulation-based environments include Atari video-games (Mnih et al., 2013), complex board games like chess and Go (Silver et al., 2018), robotic tasks (Tunyasuvunakool et al., 2020) and autonomous driving (Wurman et al., 2022; Sallab et al., 2017).

Just as in BSD, the interactive setup for RL (transition and rewards) can be defined by a sampling model and a prior over θ\theta. The interaction replicates the decision process, shown in Figure 1. Each draw θp(θ)\theta\sim p(\theta) constitutes a new instance or episode of the environment. The RL agent seeks to maximize the expected sum of rewards G=t=1TRtG=\sum_{t=1}^{T}R_{t}, known as the return, over an episode. The return GG is the analogue of the utility function.

Decision rules are called policies in RL. A (stochastic) policy maps an observed history to a distribution over actions Dtπ(Ht)D_{t}\sim\pi(\cdot\mid H_{t}). The optimal policy π{\pi_{\star}} satisfies π=argmaxπE{Gπ}{\pi_{\star}}=\operatorname*{arg\,max}_{\pi}E\{G\mid\pi\}. As mentioned before, the notion of stochastic policies is not natural in BSD with its focus on decisions that a rational agent would take. In fact, under some regularity conditions it can be shown that also the optimal RL policy is deterministic (Puterman, 2014). So why stochastic policies? Stochastic policies in RL serve to implement exploration. In BSD it is assumed that if exploration were called for, it would be recognized by the optimal decision rule. While in theory this is the case, in practice additional reinforcement of exploration is reasonable. Also, as we shall see later, the use of stochastic policies facilitates the search for optimal policies by allowing the use of the stochastic gradient theorem.

Another close similarity of BSD and RL occurs in the definition of expected utility and the state-action value function in RL. The state-action value function is Qπ(Ht,d)=E{k=tTRkHt,Dt=d,π}Q^{\pi}(H_{t},d)=E\{\sum_{k=t}^{T}R_{k}\mid H_{t},D_{t}=d,\pi\}. The value function of the optimal policy Qπ(Ht,d)Q^{{\pi_{\star}}}(H_{t},d) plays the same role as expected utility U(Ht,d)U(H_{t},d) when optimal decisions are substituted for future decisions DsD_{s}, s>ts>t. An important difference is the stochastic nature of π\pi in the state-value function, versus the deterministic decisions DtD_{t} in BSD.

From this brief introduction one can already notice many correspondences between the objects in RL and BSD. Table 1 shows a partial mapping between their respective terminologies. Not all are perfect equivalences. Sometimes common use in BSD and RL involves different levels of marginalization and/or substituting optimal values. Some of the analogies in the table will be developed in the remainder of the paper.

Table 1: A brief comparison of key quantities in BSD and RL. Variations without time subindex t refer to time-invariant versions. Using states StS_{t}, in many instances an argument HtH_{t} can be replaced by StS_{t}, as in Dt(St)D_{t}(S_{t}), We use Y=(Y1,,YT)Y=(Y_{1},\ldots,Y_{T}) etc. to refer to lists over t=1,,Tt=1,\ldots,T.
BSD RL
YtY_{t} data observed at time tt
Ht=(Y1,D1Dt1,Yt)H_{t}=(Y_{1},D_{1}\ldots D_{t-1},Y_{t}) history (information set) at decision time tt
Dt=Dt(Ht)D_{t}=D_{t}(H_{t}) action (decision) at time tt
St=St(Ht)S_{t}=S_{t}(H_{t}) summary (sufficient) statistic state
Dϕ,tD_{\phi,t} (deterministic) action at time tt n/a (a)
indexed by parameter ϕ\phi (policy)
π(DtHt)\pi(D_{t}\mid H_{t}) n/a (no randomization) (random) policy
πϕ\pi_{\phi} n/a policy indexed by ϕ\phi
optimal decision/policy Bayes rule D(Ht)D^{\star}(H_{t}) (2) optimal policy π(Ht){\pi_{\star}}(H_{t})
θ\theta unknown parameter
(usually) required optional
RtR_{t} n/a (b) (immediate) reward at time tt
optimality criterion utility u(Y,D,θ)=u(Ht,Dt,θ)u(Y,D,\theta)=u(H_{t},D_{t},\theta) total return G=k=1TRtG=\sum_{k=1}^{T}R_{t} or
remaining return Gt=k=tTRkG_{t}=\sum_{k=t}^{T}R_{k}
state-action value n/a (deterministic DD^{\star}) Qπ(Ht,Dt)=Eπ{GtHt,Dt}Q^{\pi}(H_{t},D_{t})=E_{\pi}\{G_{t}\mid H_{t},D_{t}\}
state value n/a (deterministic DD^{\star}) Vπ(Ht)=Eπ{GtHt,π}V^{\pi}(H_{t})=E_{\pi}\{G_{t}\mid H_{t},\pi\}
value under optimal U(Ht,Dt)U(H_{t},D_{t}) Qπ(Ht,Dt)Q^{{\pi_{\star}}}(H_{t},D_{t})
      future actions
optimal value U(Ht,D(Ht))U(H_{t},D^{\star}(H_{t})) Vπ(Ht)V^{{\pi_{\star}}}(H_{t})
J(ϕ)J(\phi) expectation under policy/decision indexed by ϕ\phi
=E{u(Y,Dϕ,θ)}=E\{u(Y,D_{\phi},\theta)\} =E{Gπϕ}=E\{G\mid\pi_{\phi}\}
Bellman equation/ U(Ht,Dt)=U(H_{t},D_{t})= Qπ(Ht,Dt)=Q^{\pi}(H_{t},D_{t})=
backward induction     E{U(Ht+1(Dt))Ht}E\{U(H_{t+1}(D_{t}))\mid H_{t}\}    E{Rt+Vπ(Ht+1)Ht,Dt,π}E\{R_{t}+V^{\pi}(H_{t+1})\mid H_{t},D_{t},\pi\}

(a) deterministic policies πϕ(Ht)\pi_{\phi}(H_{t}) are not discussed in this review, but see Silver et al. (2014).
(b) additive decomposition as u(HT,Dt,θ)=tR(Ht,Dt,θ)u(H_{T},D_{t},\theta)=\sum_{t}R(H_{t},D_{t},\theta) is possible, but not usually made explicit.

2 Two examples of optimal stopping in clinical trials

The two stylized examples introduced here mimic sequential stopping in a clinical trial. The agent is an investigator who is planning and overseeing the trial. The data YtY_{t} are clinical outcomes recorded for each patient. In both cases DtD_{t} refers to a stopping decision after tt (cohorts of) patients. Under continuation (Dt=0D_{t}=0), the agent incurs an additional cost ctc_{t} for recruiting the next cohort of patients. Under stopping (Dt0D_{t}\neq 0), on the other hand, the agent incurs a cost if a wrong (precise meaning to be specified) recommendation is made. At each time the agent has to choose between continuing the study – to learn more – versus stopping and realizing a reward for a good final recommendation (about the treatment). Throughout we use the notions of cost (or loss) and utility interchangeably, treating loss as negative utility.

Example 1: A binary hypothesis

Consider the decision problem of choosing between H1:θ=θ1H_{1}:\theta=\theta_{1} and H2:θ=θ2H_{2}:\theta=\theta_{2}. For instance, θ\theta could represent the probability of a clinical response for an experimental therapy. Assume a binary outcome YtY_{t} with a Bernoulli sampling model p(Yt=1θ)=θp(Y_{t}=1\mid\theta)=\theta and a discrete two-point prior p(θ=θ1)=p(θ=θ2)=12p(\theta=\theta_{1})=p(\theta=\theta_{2})=\frac{1}{2}.

The possible decisions at any time are Dt{0,1,2}D_{t}\in\{0,1,2\}. Here Dt=0D_{t}=0 indicates continuation, Dt=1D_{t}=1 indicates to terminate the trial and report H1(θ=θ1)H_{1}(\theta=\theta_{1}), and Dt=2D_{t}=2 means terminate and report H2(θ=θ2)H_{2}(\theta=\theta_{2}). The utility function includes a (fixed) sampling cost cc for each cohort and a final cost K>0K>0 for reporting the wrong hypothesis. The utility function is

u(HT,DT,θ)=cTK𝕀(θθDT).u(H_{T},D_{T},\theta)=-cT-K\,\mathbb{I}(\theta\neq\theta_{D_{T}}). (3)

The relevant history HtH_{t} can be represented using the summary statistic St=(t,ktYk/t)S_{t}=(t,\sum_{k\leq t}Y_{k}/t), since this statistic is sufficient for the posterior of θ\theta. The implementations of simulation-based BSD and RL use this summary statistic. The problem parameters are fixed as c=1c=1 and K=100K=100. Example trajectories of StS_{t} assuming no stopping are shown in Figure 2(a).

Refer to caption
(a) Example 1: trajectories
Refer to caption
(b) Example 2: trajectories
Refer to caption
(c) Example 2: dose-response models
Figure 2: Data from the two sequential stopping examples. (a) and (b) are forward simulations assuming no stopping; (c) shows the implied dose-response curves for different random draws of the prior.

Example 2: A dose-finding study

This example is a stylized version of the ASTIN trial (Grieve and Krams, 2005). The trial aims to find the optimal dose for a drug from a set of candidate doses 𝕏={x0,,xG}{\mathbb{X}}=\{{\textnormal{x}}_{0},\ldots,{\textnormal{x}}_{G}\} where x0=0{\textnormal{x}}_{0}=0 stands for placebo. At each time tt, a dose Xt𝕏X_{t}\in{\mathbb{X}} is assigned to the next patient (cohort), and an efficacy outcome YtY_{t} is observed. The aim is to learn about the dose-response curve f(Xt)=𝔼(YtXt)f(X_{t})=\mathbb{E}(Y_{t}\mid X_{t}), and more specifically, to find the dose x95{\textnormal{x}}_{95} (the ED95) that achieves 95% of the maximum possible improvement over placebo. Let δg=f(xg)f(x0)\delta_{g}=f({\textnormal{x}}_{g})-f({\textnormal{x}}_{0}) be the advantage over the placebo at dose xg{\textnormal{x}}_{g}. We set up a nonlinear regression Yt=f(Xtθ)+ϵtY_{t}=f(X_{t}\mid\theta)+\epsilon_{t} with ϵtN(0,σ2)\epsilon_{t}\sim N(0,\sigma^{2}) using a dose-response function

f(xθ)=a+bxr(qr+xr)f({\textnormal{x}}\mid\theta)=a+b\frac{{\textnormal{x}}^{r}}{(q^{r}+{\textnormal{x}}^{r})} (4)

with θ=(a,b,q,r)\theta=(a,b,q,r) and a prior p(θ)=N(θ0,diag(λ02))p(\theta)=N(\theta_{0},diag(\lambda_{0}^{2})). In the PK/PD (pharmacokinetics/pharmacodynamics) literature model (4) is known as the EmaxE_{max} model (Meibohm and Derendorf, 1997).

Similar to Example 1, Dt{0,1,2}D_{t}\in\{0,1,2\} with Dt=0D_{t}=0 indicating continuation, Dt=1D_{t}=1 indicating stopping the trial and recommending no further investigation of the experimental therapy, and Dt=2D_{t}=2 indicating stopping and recommending for a follow-up trial set up as a pivotal trial to test the null hypothesis H0:H_{0}: δ95=0\delta_{95}=0. If continuing the trial, the next assigned dose is Xt+1=min{Xt+ξ,x^95,t}X_{t+1}=\min\{X_{t}+\xi,\hat{x}_{95,t}\} where x^95,t\hat{x}_{95,t} is the latest estimate of the ED95, and ξ\xi is a maximum allowable dose escalation between cohorts. If requiring a pivotal trial, Nα,βN_{\alpha,\beta} patients are assigned to the dose x^95,T\hat{x}_{95,T}, with Nα,βN_{\alpha,\beta} computed from the observed data to achieve a desired power 1β1-\beta at a certain alternative H1:θ=θ1H_{1}:\theta=\theta_{1} for test of size α\alpha. Details are given in Appendix A.

The utility function includes a patient recruitment cost of cc and a prize K>0K>0 if the null hypothesis H0(δ95=0)H_{0}(\delta_{95}=0) is rejected in the pivotal trial (meaning the agent found evidence of an effective drug). Denote ΔR=Pr(reject H0 in the 2nd trialHT)\Delta_{R}=\Pr(\mbox{reject $H_{0}$ in the 2nd trial}\mid H_{T}). At the stopping time T=mint{Dt0}T=\min_{t}\{D_{t}\neq 0\}, utility is calculated as

u(HT,DT,θ)={cTif DT=1cT+{cNα,β(HT)+KΔR(HT)}if DT=2u(H_{T},D_{T},\theta)=\begin{cases}-cT&\mbox{if }D_{T}=1\\ -cT+\left\{-cN_{\alpha,\beta}(H_{T})+K\Delta_{R}(H_{T})\right\}&\mbox{if }D_{T}=2\end{cases} (5)

In our implementation we fix a=0,r=1,σ=1a=0,r=1,\sigma=1 in (4) and c=1,K=100c=1,K=100 in (5), and ξ=1\xi=1, leaving the unknown parameters θ=(b,q)\theta=(b,q), including the maximum effect bb and the location of the x50{\textnormal{x}}_{50}. The prior of θ\theta has mean θ0=(1/2,1)\theta_{0}=(1/2,1) with variances λ0=(1,1)\lambda_{0}=(1,1) and we add the constraint q0.1q\geq 0.1.

The summary statistic is St=(δ¯95,sδ)S_{t}=(\bar{\delta}_{95},s_{\delta}) where δ¯95\bar{\delta}_{95} and sδs_{\delta} are posterior mean and standard deviation of δ95\delta_{95}. Figure 2(b) shows examples of trajectories of these summaries until some maximum time horizon TT (i.e., assuming no stopping). The trajectories are created by sampling θ\theta from the prior, assigning doses as described, and sampling responses using (4). Figure 2(c) shows examples of the implied dose-response curve f(xθ)f({\textnormal{x}}\mid\theta) for different prior draws. Notice that the chosen summaries do not capture the full posterior. The statistic StS_{t} is not a sufficient statistic for the posterior. However, as shown in Appendix A, Nα,βN_{\alpha,\beta} and ΔR\Delta_{R}, and therefore the utility, depend on the data only through StS_{t}.

3 Simulation-based Bayesian sequential design

From the expected utility definition in (1), one immediately deduces that

U(Ht,Dt=d)=E{U(Ht+1(Dt=d),D(Ht+1(Dt=d)))Ht},U(H_{t},D_{t}=d)=E\{U(H_{t+1}(D_{t}=d),D^{\star}(H_{t+1}(D_{t}=d)))\mid H_{t}\}, (6)

with the expectation being with respect to future data and θ\theta, and substituting optimal choices for future decisions, s>ts>t. In words, for a rational agent taking optimal actions, the expected utility given history HtH_{t} must be the same as the expected utility in the next step. Thus, one can (theoretically) deduce DtD^{\star}_{t} from knowing the best actions for all possible future histories by implementing backward induction starting with TT. Equation (6) is known as the Bellman equation (Bellman, 1966).

Enumerating all histories is computationally intractable in most realistic scenarios, rendering backward induction usually infeasible, except in some special setups (Berry and Ho, 1988; Christen and Nakamura, 2003). Simulation-based BSD comes to help: instead of enumerating all possible histories, we compute approximations using some simulated trajectories. The version presented here follows Müller et al. (2007). Similar schemes are developed in Brockwell and Kadane (2003), Kadane and Vlachos (2002) and Carlin et al. (1998). We use two strategies: first, we represent history HtH_{t} through a (low-dimensional) summary statistic StS_{t}, as already hinted in Section 2 when we proposed the posterior moments of the ED95 response for Example 2. The second – and closely related – strategy is to restrict DtD_{t} to depend on HtH_{t} only indirectly through StS_{t}. Two instances of this approach are discussed below and used to solve Examples 1 and 2.

3.1 Constrained backward induction

Constrained backward induction is an algorithm consisting of three simple steps. The first step is forward simulation. Our implementation here uses the assumption that the sequential nature is limited to sequential stopping, so trajectories can be generated assuming no stopping independently from decisions. Throughout we use Dt=0D_{t}=0 to denote continuation. Other actions, Dt0D_{t}\neq 0, indicate stopping the study and choice of a terminal decision. The second step is constrained backward induction, which implements (6) using decisions restricted to depend on the history HtH_{t} indirectly only through StS_{t}. The third step simply keeps track of the best action and iterates until convergence. We first briefly explain these steps and then provide an illustration of their application in Example 1 and additional implementation considerations.

Step 1.

Forward simulation: Simulate many trajectories, say MM, until some maximum number of steps TmaxT_{\text{max}} (e.g. cohorts in a trial). To do this, each m=1,,Mm=1,\ldots,M corresponds to a different prior draw θ(m)iidp(θ(m))\theta^{(m)}\stackrel{{\scriptstyle\mathclap{\tiny\mbox{iid}}}}{{\sim}}p(\theta^{(m)}) and samples Yt(m)iidp(Yt(m)θ(m))Y^{(m)}_{t}\stackrel{{\scriptstyle\mathclap{\tiny\mbox{iid}}}}{{\sim}}p(Y^{(m)}_{t}\mid\theta^{(m)}), t=1,,Tmaxt=1,\ldots,T_{\text{max}}. For each mm and tt, we evaluate and record the summary statistic St(m)S^{(m)}_{t} discretized over a grid.

Step 2.

Backward induction: For each possible decision dd and each grid value S=jS=j, the algorithm approximates U^(S,d)U(S,d)\widehat{U}(S,d)\approx U(S,d) using the forward simulation and Bellman equation as follows. Denote with Aj={(m,tm)Stm(m)=j}A_{j}=\{(m,t_{m})\mid S^{(m)}_{t_{m}}=j\} the set of forward simulations that fall within grid cell jj. Then,

U^(S=j,d)={1|Aj|(m,tm)AjU^(Stm+1(m),D(Stm+1(m)))d=01|Aj|(m,tm)Aju(Stm(m),Dtm=d,θ(m))d0.\widehat{U}(S=j,d)=\begin{cases}\frac{1}{|A_{j}|}\sum_{(m,t_{m})\in A_{j}}\widehat{U}(S^{(m)}_{t_{m}+1},D^{\star}(S^{(m)}_{t_{m}+1}))&d=0\\ \frac{1}{|A_{j}|}\sum_{(m,t_{m})\in A_{j}}u(S_{t_{m}}^{(m)},D_{t_{m}}=d,\theta^{(m)})&d\neq 0.\end{cases} (7)

The evaluation under d=0d=0 requires the optimal actions Dt+1(Stm+1)D^{\star}_{t+1}(S_{t_{m}+1}). We use an initial guess (see below), which is then iteratively updated (see next).

Step 3.

Iteration: Update the table D(S)=argmaxdU^(S,d)D^{\star}(S)=\operatorname*{arg\,max}_{d}\,\widehat{U}(S,d) after step 2.

\cbstart

Repeat steps 2 and 3 until updating in step 3 requires no (or below a minimum number of) changes of the tabulated D(S)D^{\star}(S). \cbend

Figure 3(a) shows the estimated utility function U^(S,d)\widehat{U}(S,d) in Example 1 using M=1000M=1000, Tmax=50T_{\text{max}}=50 and 100 grid values for the running average pt=k=1tYt/tp_{t}=\sum_{k=1}^{t}Y_{t}/t in S=(t,pt)S=(t,p_{t}). Optimal actions Dt(S)D^{\star}_{t}(S) are shown in Figure 3(b). The numerical uncertainty due to the Monte Carlo evaluation of the expectations is visible. If desired, one could reduce it by appropriate smoothing (MacDonald et al., 2015). One can verify, however, that the estimates are a close approximation to the analytic solution which is available in this case (Müller et al., 2007).

We explain Step 2 by example. Consider Figure 3(b) and assume, for example, that we need the posterior expected utility of S=(t,pt)=(20,0.25)S=(t,p_{t})=(20,0.25). In this stylized representation, only three simulations, A={m1,m2,m3}A=\{m_{1},m_{2},m_{3}\} pass through this grid cell. In this case, tm=t=20t_{m}=t=20 for the three trajectories since tt is part of the summary statistic. We evaluate U^(S,d=1)\widehat{U}(S,d=1) and U^(S,d=2)\widehat{U}(S,d=2) as averages 13iAu(S20(m),d,θ(m))\frac{1}{3}\sum_{i\in A}u(S_{20}^{(m)},d,\theta^{(m)}). For U^(S,d=0)\widehat{U}(S,d=0), we first determine the grid cells in the next period for each of the three trajectories. Stm+1(m)=(21,p21(m))S^{(m)}_{t_{m}+1}=(21,p^{(m)}_{21}). We then look up the optimal decisions D(S21,p21(m))D^{\star}(S_{21},p^{(m)}_{21}) (using in this case tm+1=21t_{m}+1=21) and average 13U^(21,p21(m),D(21,p21(m)))\frac{1}{3}\sum\widehat{U}(21,p^{(m)}_{21},D^{\star}(21,p^{(m)}_{21})), as in (7).

Constrained backward induction requires iterative updates of D(S)D^{\star}(S) and their values. The procedure starts with arbitrary initial values for D(S)D^{\star}(S), recorded on a grid over SS. For example, a possible initialization is D(S)=maxd0U^(S,d)D^{\star}(S)=\max_{d\neq 0}\widehat{U}(S,d), maximizing over all actions that do not involve continuation. With such initial values, U^(S,d)\widehat{U}(S,d) can be evaluated over the entire grid. Then, for updating the optimal actions D(S)D^{\star}(S) one should best start from grid values that are associated with the time horizon TT, or at least high tt. This is particularly easy when tt is an explicit part of StS_{t}, as in Example 1 with St=(t,pt)S_{t}=(t,p_{t}). Another typical example arises in Example 2 with St=(μt,σt)S_{t}=(\mu_{t},\sigma_{t}), the mean and standard deviation of some quantity of interest. For large tt we expect small σt\sigma_{t}, making it advisable to start updating in each iteration with the grid cells corresponding to smallest σt\sigma_{t}. We iterate until no more (or few) changes happen.

The algorithm can be understood as an implementation of (6).

Consider f(S,d)f(S,d) as an arbitrary function over pairs (S,d)(S,d) and the function operator 𝒫f{\mathcal{P}}f defined as (𝒫f)(S,d)=maxd𝔼[f(S,d)|S]({\mathcal{P}}f)(S,d)=\max_{d^{\prime}}\mathbb{E}[f(S^{\prime},d^{\prime})|S] where SS^{\prime} is the summary statistic resulting from sampling one more data point YY from the unknown θ\theta and recompute SS^{\prime} from SS. Then Bellman equation (6) can be written as U=𝒫UU={\mathcal{P}}U. In other words, expected utility under the optimal decision DD^{\star} is a fixed point of the operator 𝒫{\mathcal{P}}. Constrained backward induction attempts to find an approximate solution to the fixed-point equation. The same principle motivates the Q-learning algorithm in RL (Watkins and Dayan, 1992) (see Section 4). Backward induction is also closely related to the value iteration algorithm for Markov decision processes (Sutton and Barto, 2018), which relies on exact knowledge of the state transition function.

Refer to caption
(a) BSD: expected utility estimates under constrained backward induction.
Refer to caption
(b) BSD: optimal actions.
Refer to caption
(c) RL: state-action value estimates with Q-learning
Refer to caption
(d) RL: best actions.
Refer to caption
(e) BSD: parametric boundary.
Refer to caption
(f) BSD: utility by parameter.
Refer to caption
(g) RL: performance over training.
Figure 3: Example 1. Comparison of decision boundaries and fitted value functions/utilities.

3.2 Sequential design with decision boundaries

Inspection of Figure 3(b) suggests an attractive alternative algorithm. Notice the decision boundaries on S=(t,pt)S=(t,p_{t}) that trace a funnel with an upper boundary ω2(t)\omega_{2}(t) separating D=2D^{\star}=2 from D=0D^{\star}=0, and a lower boundary ω1(t)\omega_{1}(t) separating D=0D^{\star}=0 versus D=1D^{\star}=1.

Recognizing such boundaries suggests an alternative approach based on searching for optimal boundaries in a suitable family {ωϕ,1,ωϕ,2ϕΦ}\{\omega_{\phi,1},\omega_{\phi,2}\mid\phi\in\Phi\}.

This approach turns the sequential decision problem of finding optimal D(S)D^{\star}(S) into a non-sequential problem of finding an optimal ϕΦ\phi^{\star}\in\Phi. This method is used, for example, in Rossell et al. (2007).

In Example 1, we could use

ω1(t)=ϕt1T1,ω2(t)=1(1ϕ)t1T1\omega_{1}(t)=\frac{\phi\sqrt{t-1}}{\sqrt{T-1}},\quad\quad\omega_{2}(t)=1-\frac{(1-\phi)\sqrt{t-1}}{\sqrt{T-1}}

using a single tuning parameter ϕ(0,1)\phi\in(0,1). Both functions are linear in t1\sqrt{t-1}, and mimic the funnel shape seen in Figure 3(b). The decision rules implied by these boundaries is

Dϕ(St)={1 if pt<ω1(t)2 if pt>ω2(t)0 otherwise. D_{\phi}(S_{t})=\begin{cases}1&\mbox{ if }p_{t}<\omega_{1}(t)\\ 2&\mbox{ if }p_{t}>\omega_{2}(t)\\ 0&\mbox{ otherwise. }\end{cases}

Here, the additional subscript ϕ in Dϕ()D_{\phi}(\cdot) indicates that the decision follows the rule implied by decision boundaries ωj(t;ϕ)\omega_{j}(t;\phi). Note that ω1(1)=1\omega_{1}(1)=1 and ω2(1)=0\omega_{2}(1)=0, ensuring continuation at t=1t=1.

For a given ϕ\phi, the forward simulations are used to evaluate expected utilities under the policy DϕD_{\phi}. Let Tϕ(m)=min{t:Dϕ(St(m)0}T^{(m)}_{\phi}=\min\{t:\;D_{\phi}(S^{(m)}_{t}\neq 0\} denote the stopping time under DϕD_{\phi}. Then the expected utility under policy DϕD_{\phi} is

U(ϕ)=E{u(STϕ,Dϕ(STϕ),θ)}U(\phi)=E\{u(S_{T_{\phi}},D_{\phi}(S_{T_{\phi}}),\theta)\} (8)

where the expectation is with respect to data YTY_{T} and θ\theta. It is approximated as an average over all Monte Carlo simulations, stopping each simulation at Tϕ(m)T^{(m)}_{\phi}, as determined by the parametric decision boundaries,

U^(ϕ)=(1/M)m=1Mu(STϕm(m),Dϕ(STϕm(m)),θ(m)).\widehat{U}(\phi)=(1/M)\sum_{m=1}^{M}u(S^{(m)}_{T^{m}_{\phi}},D_{\phi}(S^{(m)}_{T^{m}_{\phi}}),\theta^{(m)}). (9)

Optimizing U^(ϕ)\widehat{U}(\phi) w.r.t. ϕ\phi we find the optimal decision boundaries ϕ=argmaxU^(ϕ)\phi^{\star}=\arg\max\widehat{U}(\phi). As long as the nature of the sequential decision is restricted to sequential stopping, the same set of Monte Carlo simulations can be used to evaluate all ϕ\phi, using different truncation to evaluate U^(ϕ)\widehat{U}(\phi). In general, a separate set of forward simulations for each ϕ\phi, or other simplifying assumptions might be required.

Figure 3(e) shows the decision boundaries for the best parameter estimated at ϕ=0.503\phi^{\star}=0.503 in Example 1. The estimated values for U^(ϕ)\widehat{U}(\phi) are in Figure 3(f). In Figure 3(e), the boundaries using constrained backward induction are overlaid in the image for comparison. The decision boundaries trace the optimal decisions under the backward induction well. The differences in expected utility close to the decision boundary are likely very small, leaving minor variations in the decision boundary negligible.

The same approach is applied to the (slightly more complex) Example 2.

Recall the form of the summary statistics S=(sδ,δ¯)S=(s_{\delta},\bar{\delta}), the posterior standard deviation and mean of the ED95 effect. We use the boundaries

ω1(S)=b1sδ+c,ω2(S)=b2sδ+c,\omega_{1}(S)=-b_{1}s_{\delta}+c,\quad\quad\omega_{2}(S)=b_{2}s_{\delta}+c,

parameterized by ϕ=(b1,b2,c)\phi=(b_{1},b_{2},c). The implied decision rules are

Dϕ(S)={1if δ¯<ω1(sδ)2if δ¯>ω2(sδ)0otherwise.D_{\phi}(S)=\begin{cases}1&\mbox{if }\bar{\delta}<\omega_{1}(s_{\delta})\\ 2&\mbox{if }\bar{\delta}>\omega_{2}(s_{\delta})\\ 0&\text{otherwise}.\end{cases}

The results are in Figure 4(a). Again, the sequential decision problem is reduced to the optimization problem of finding the optimal ϕ\phi in (9). Since now ϕ3\phi\in\Re^{3} the evaluation of U^\widehat{U} requires a 3-dimensional grid. To borrow strength from Monte Carlo evaluations of (9) across neighboring grid points for ϕ\phi we proceed as follows. We evaluate U^(ϕ)\widehat{U}(\phi) on a coarse 10×10×1010\times 10\times 10 grid, and then fit a quadratic response surface (as a function of ϕ\phi) to these Monte Carlo estimates. The optimal decision ϕ\phi^{*} is the maximizer of the quadratic fit. We find ϕ=(b1,b2,c)=(1.572,1.200,0.515)\phi^{*}=(b_{1}^{*},b_{2}^{*},c^{*})=(1.572,1.200,0.515) Instead of evaluating U^\widehat{U} on a regular grid over ϕ\phi, one could alternatively select a random number of design points (in ϕ\phi).

The use of parametric boundaries is closely related to the notion of function approximation and the method of policy gradients in RL, which will be described next.

Refer to caption
(a) BSD: parametric boundaries.
Refer to caption
(b) RL: policy gradients
Figure 4: Optimal decisions in Example 2. Comparison of fitted decision boundaries.

4 Reinforcement learning

The basic setup in RL is usually framed in terms of Markov decision process (MDP) (Puterman, 2014). The Markov property ensures that optimal decisions depend only on the most recently observed state, enabling practicable algorithms. In this section we first describe MDPs and how a sequential design problem can be adapted to fit in this framework. Next, we discuss two algorithms, Q-learning (Watkins and Dayan, 1992) and policy gradients (Grondman et al., 2012), implemented in Example 1 and 2, respectively. Both methods are implemented using neural networks. Throughout this section, the summary statistics StS_{t} are referred to as states, in keeping with the common terminology in the RL literature.

Refer to caption
(a) SDP as generic HiMDP.
Refer to caption
(b) Resulting belief MDP.
Figure 5: Belief MDP for the SDP.

4.1 Markov decision processes and partial observability

The Markov property for a decision process is defined by the conditions

p(St+1Ht,Dt)=p(St+1St,Dt) and p(RtHt,Dt)=p(RtSt,Dt),p(S_{t+1}\mid H_{t},D_{t})=p(S_{t+1}\mid S_{t},D_{t})\quad\quad\mbox{ and }p(R_{t}\mid H_{t},D_{t})=p(R_{t}\mid S_{t},D_{t}),

That is, the next state St+1S_{t+1} and the reward RtR_{t} depend on the history only indirectly through the current state and action. When the condition holds, the decision process is called an MDP. For MDPs, the optimal policy is only a function of the latest state StS_{t} and not of the entire history HtH_{t} (Puterman, 2014). Many RL algorithms assume the Markov property. However, many sequential decision problems are more naturally characterized as partially observable MDP (POMDP) that satisfy the Markov property only conditional on some θ\theta that is generated at the beginning of each episode. Such problems have

been studied in the RL literature under the name of Hidden-Parameter MDP (HiMDP) (Doshi-Velez and Konidaris, 2016).

There is a standard – Bayesian motivated – way to cast any POMDP as an MDP using so-called belief states (Cassandra et al., 1997). Belief states are obtained by including the posterior distribution of unobserved parameters as a part of the state. With a slight abuse of notation, we may write the belief states as St=p(θHt)S_{t}=p(\theta\mid H_{t}). While the belief state is, in general, a function, it can often be represented as a vector when the posterior admits a finite (sufficient) summary statistic. The reward distribution can also be written in terms of such belief states as

p(RSt,Dt)=θp(RDt,θ)𝑑p(θHt).p(R\mid S_{t},D_{t})=\int_{\theta}p(R\mid D_{t},\theta)\,dp(\theta\mid H_{t}). (10)

See Figure 5 for a graphical representation of an HiMDP and the resulting belief MDP.

We implement this approach for Example 1. The reward is chosen to match the definition of utility. It suffices to define it in terms of θ\theta (and let the posterior take care of the rest, using (10)). We use

R(d,θ)=c𝕀(d=0)K𝕀(θdθ,d0),R(d,\theta)=-c\,\mathbb{I}(d=0)-K\mathbb{I}(\theta_{d}\neq\theta,d\neq 0), (11)

as in (3). Next we introduce the belief states. Recall the notation from Example 1. We have p(θSt)=Bin(pt,pt(1pt)/t)p(\theta\mid S_{t})=\mbox{Bin}(p_{t},p_{t}(1-p_{t})/t). The summary statistic St=(t,pt)S_{t}=(t,p_{t}) is a two-dimensional representation of the belief state.

Considering Example 2, we note that the utility function (5) depends on the state StS_{t} only, and does not involve θ\theta. We define

Rθ(d,St)=c1𝕀(d=0)+(c2Nα,β(St)+KΔR(St))𝕀(d=2).R_{\theta}(d,S_{t})=-c_{1}\mathbb{I}(d=0)+(-c_{2}N_{\alpha,\beta}(S_{t})+K\Delta_{R}(S_{t}))\mathbb{I}(d=2).\\ (12)

While the reward is clearly Markovian, the transition probability is not necessarily Markovian. This is the case because in this example the posterior moments StS_{t} are not a sufficient statistic. In practice, however, a minor violation of the Markov assumption for the transition distribution does not seem to affect the ability to obtain good policies with standard RL techniques.

4.2 Q-learning

Q-learning (Watkins and Dayan, 1992; Clifton and Laber, 2020; Murphy, 2003) is an RL algorithm that is similar in spirit to the constrained backward induction described in Section 3.1. The starting point is Bellman optimality equation for MDPs (Bellman, 1966). Equation (6) for the optimal DD^{\star} and written for MDPs becomes

Qπ(s,d)=𝔼{Rt+maxdQπ(St+1,d)St=s,Dt=d},Q^{\pi_{\star}}(s,d)=\mathbb{E}\{R_{t}+\textstyle{\max_{d^{\prime}}}Q^{\pi_{\star}}(S_{t+1},d^{\prime})\mid S_{t}=s,D_{t}=d\}, (13)

where the expectation is with respect to RtR_{t} and St+1S_{t+1}. The optimal policy is implicitly defined as the solution to (13).

Q-learning proceeds iteratively following the fixed-point iteration principle. Let Q(k)Q^{(k)} be some approximation of QQ^{\star}. We assume that a set of simulated transitions {(st,dt,rt,st+1)}t=1n\{(s_{t},d_{t},r_{t},s_{t+1})\}_{t=1}^{n} is available. This collection is used like the forward simulations in the earlier discussion. In RL it is known as the “experience replay buffer”, and can be generated using any stochastic policy. And suppose, for the moment, that the state and action spaces are finite discrete, allowing to record Q(k)Q^{(k)} in a table. Q-learning is defined by updating Q(k)Q^{(k)} as

Q(k+1)(st,dt)(1αk)Q(k)(st,dt)+αk{rt+maxdQ(k)(st+1,d)}.Q^{(k+1)}(s_{t},d_{t})\leftarrow(1-\alpha_{k})Q^{(k)}(s_{t},d_{t})+\alpha_{k}\{r_{t}+\textstyle{\max_{d}}Q^{(k)}(s_{t+1},d)\}. (14)

Note the moving average nature of the update. The procedure iterates until convergence from a stream of transitions.

Deep Q-networks (DQN) (Mnih et al., 2013) are an extension of Q-learning for continuous states. A neural network is used to represent Q()Q(\cdot). Let ϕ(k)\phi^{(k)} denote the parameters of the neural network at iteration kk. Using simulated transitions from the buffer

DQN performs updates

ϕ(k+1)=argminϕt=1n(rt+maxdQϕ(k)(St+1,d)Qϕ(St,dt))2.\phi^{(k+1)}=\textstyle{\operatorname*{arg\,min}_{\phi}\sum_{t=1}^{n}}\left(r_{t}+\max_{d}Q_{\phi^{(k)}}(S_{t+1},d)-Q_{\phi}(S_{t},d_{t})\right)^{2}. (15)

In practice, exact minimization is replaced by a gradient step from mini-batches, together with numerous implementation tricks (Mnih et al., 2013).

We implemented DQN in Example 1 using the Python package Stable-Baselines3 (Raffin et al., 2021). The experience replay buffer is continuously updated. The algorithm uses a random policy to produce an initial buffer and then adds experience from an ϵ\epsilon-greedy policy, where the current best guess for the optimal policy is chosen with probability (1ϵ)(1-\epsilon), and otherwise fully random actions are chosen with probability ϵ\epsilon. Figure 3(c) shows Q^\hat{Q}, the estimate of QQ^{\star}, for each state and action d{0,1,2}d\in\{0,1,2\}. Figure 3(d) shows the corresponding optimal actions. Overall, the results are similar to the results with constrained backward induction, but much smoother. Also, notice that the solution under DQN is usually better in terms of expected utility as shown in Figure 3(f), even in (out-of-sample) evaluation episodes. This is likely due to to the flexibility and high-dimensional nature of the neural network approximation.

The better performance of RL comes at a price. First is sample efficiency (the number of simulations required by the algorithm to yield a good policy). The best Q^\hat{Q} is obtained after 2 million sampled transitions. Data efficiency is a known problem in DQN, and in RL in general (Yu, 2018). In many real applications investigators cannot afford such a high number of simulation steps. Another limitation is training instability. In particular, Figure 3(f) illustrates a phenomenon known as catastrophic forgetting, which happens when additional training decreases the ability of the agent to perform a previously learned task (Atkinson et al., 2021). This can happen because of the instability that arises from a typical strategy of evaluating performance periodically and keeping track of the best performing policy only. Several improvements over basic DQN have been proposed, with improved performance and efficiency (Hessel et al., 2018).

4.3 Policy gradients

The approach is similar to the use of parametric boundaries discussed before. Policy gradient (PG) approaches start from a parameterization πϕ\pi_{\bm{\phi}} of the policy. Again, consider a neural network (NN) with weights ϕ\phi. The goal of a PG method is to maximize the objective

maxϕJ(ϕ)=𝔼{Gπϕ}.\max_{\phi}J(\phi)=\mathbb{E}\{G\mid\pi_{\phi}\}. (16)

This objective is the analogue to maximizing U(ϕ)U(\phi) in (9), except that here the stochastic policy πϕ\pi_{\phi} is a probability distribution on decisions StS_{t}. The main characteristic of PG methods is the use of gradient descent to solve (16).

The evaluation of gradients is based on the PG theorem (Sutton et al., 1999),

ϕJ(ϕ)=𝔼{[t=1Tϕlogπϕ(DtSt)]Gπϕ},\nabla_{\bm{\phi}}J({\bm{\phi}})=\mathbb{E}\left\{\left[\sum_{t=1}^{T}\nabla_{\bm{\phi}}\log\pi_{\bm{\phi}}(D_{t}\mid S_{t})\right]\cdot G\mid\pi_{\phi}\right\}, (17)

where the total return GG is a function G(τ)G(\tau) of the entire trajectory τ=(S1,D1,,ST,DT)\tau=(S_{1},D_{1},\ldots,S_{T},D_{T}). Using gradients, PG methods can optimize over high-dimensional parameter spaces like in neural networks. In practice, estimates of the gradient are known to have huge variance, affecting the optimization. But several implementation tricks exist that improve the stability and reduce the variance. Proximal policy optimization (PPO) (Schulman et al., 2017) incorporates many of these tricks, and is widely used as a default for PG-based methods.

The PG theorem is essentially Leibniz rule for the gradient of J(ϕ)J(\phi).

With a slight abuse of notation, write πϕ(τ)\pi_{\bm{\phi}}(\tau) for the distribution of τ\tau induced by πϕ\pi_{\bm{\phi}} for the sequential decisions. Then Leibniz rule for the gradient of the integral gives

ϕJ(ϕ)=ϕπϕ(τ)G(τ)𝑑τ=ϕπϕ(τ)G(τ)𝑑τ=(ϕlogπϕ(τ))πϕ(τ)G(τ)𝑑τ=𝔼{ϕlogπϕ(τ)G(τ)},\nabla_{\bm{\phi}}J({\bm{\phi}})=\nabla_{\bm{\phi}}\int\pi_{\phi}(\tau)G(\tau)d\tau=\\ \int\nabla_{\bm{\phi}}\pi_{\bm{\phi}}(\tau)\cdot G(\tau)\,d\tau=\int\left(\nabla_{\bm{\phi}}\log\pi_{\bm{\phi}}(\tau)\right)\,\pi_{\bm{\phi}}(\tau)\cdot G(\tau)\,d\tau=\mathbb{E}\left\{\nabla_{\bm{\phi}}\log\pi_{\bm{\phi}}(\tau)\cdot G(\tau)\right\}, (18)

where the expectations are with respect to the (stochastic) policy πϕ\pi_{\phi} over τ\tau. The log probability in the last expression can be written as a sum of log probabilities, yielding (17).

PPO is implemented in Example 2 using Stable-Baselines3 (Raffin et al., 2021). The results are shown in Figure 4(b). Not surprisingly, the results are similar to those obtained earlier using parametric decision boundaries. Interestingly, the figure shows that neural networks do not necessarily extrapolate well to regions with low data. This behavior is noticeable on the lower left corner of the figure, where there could be data, but where it is never observed in practice because of the early stopping implied by the boundaries.

5 Discussion

We have introduced some of the main features of RL and BSD in the context of two optimal stopping problems. In the context of these examples the two approaches are quite similar, including an almost one-to-one mapping of terminology and notation, as we attempted in Table 1. In general, however, the applicability, especially the practical use of RL is much wider. The restriction of the sequential problems to optimal stopping was only needed for easy application of the BSD solution. In contrast, RL methods are routinely used for a variety of other problems, such as robotics (Tunyasuvunakool et al., 2020) autonomous driving (Sallab et al., 2017; Wurman et al., 2022), and smart building energy management (Yu et al., 2021). The main attraction of BSD is the principled nature of the solution. One can argue from first principles that a rational agent should act as if he or she were optimizing expected utility as in (1). There is a well-defined and coherent propagation of uncertainties. This might be particularly important when the SDP and underlying model are only part of a bigger problem. Overall, we note that the perspective of one method, and corresponding algorithms can be useful for improvements in the respective other method. For example, policy gradients could readily be used to solve BSD if randomized decision rules were used. The latter is usually not considered. On the other hand, hierarchical Bayesian inference models could be used to combine multiple sources of evidence in making sequential decisions under RL, or multiple related problems could be linked in a well-defined manner in a larger encompassing model. For example, clinical trials are never carried out in isolation. Often the same department or group might run multiple trials on the same patient population for the same disease, with obvious opportunities to borrow strength.

Acknowledgements

The authors gratefully thank Peter Stone and the Learning Agents Research Group (LARG) for helpful discussions and feedback.

Funding

Yunshan Duan and Peter Müller are partially funded by the NSF under grant NSF/DMS 1952679.

Conflicts of Interest

The authors report there are no competing interests to declare

References

  • Atkinson et al. (2021) Atkinson, C., B. McCane, L. Szymanski, and A. Robins (2021). Pseudo-rehearsal: Achieving deep reinforcement learning without catastrophic forgetting. Neurocomputing 428, 291–307.
  • Bellman (1966) Bellman, R. (1966). Dynamic programming. Science 153(3731), 34–37.
  • Berger (2013) Berger, J. O. (2013). Statistical decision theory and Bayesian analysis. Springer Science & Business Media.
  • Berry and Ho (1988) Berry, D. A. and C.-H. Ho (1988). One-sided sequential stopping boundaries for clinical trials: A decision-theoretic approach. Biometrics 44(1), 219–227.
  • Brockwell and Kadane (2003) Brockwell, A. E. and J. B. Kadane (2003). A gridding method for Bayesian sequential decision problems. Journal of Computational and Graphical Statistics 12(3), 566–584.
  • Carlin et al. (1998) Carlin, B. P., J. B. Kadane, and A. E. Gelfand (1998). Approaches for optimal sequential decision analysis in clinical trials. Biometrics 54, 964–975.
  • Cassandra et al. (1997) Cassandra, A., M. L. Littman, and N. L. Zhang (1997). Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes. In Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence, pp.  54–61.
  • Christen and Nakamura (2003) Christen, J. A. and M. Nakamura (2003). Sequential stopping rules for species accumulation. Journal of Agricultural, Biological, and Environmental Statistics 8(2), 184–195.
  • Clifton and Laber (2020) Clifton, J. and E. Laber (2020). Q-learning: Theory and applications. Annual Review of Statistics and Its Application 7(1), 279–301.
  • DeGroot (2004) DeGroot, M. (2004). Optimal statistical decisions. New York: Wiley-Interscience.
  • Doshi-Velez and Konidaris (2016) Doshi-Velez, F. and G. Konidaris (2016). Hidden parameter Markov decision processes: A semiparametric regression approach for discovering latent task parametrizations. In IJCAI: Proceedings of the Conference, Volume 2016, pp. 1432.
  • Gaon and Brafman (2020) Gaon, M. and R. Brafman (2020). Reinforcement learning with non-Markovian rewards. In Thirty-fourth AAAI Conference on Artificial Intelligence.
  • Grieve and Krams (2005) Grieve, A. P. and M. Krams (2005). ASTIN: A Bayesian adaptive dose–response trial in acute stroke. Clinical Trials 2(4), 340–351.
  • Grondman et al. (2012) Grondman, I., L. Busoniu, G. A. Lopes, and R. Babuska (2012). A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(6), 1291–1307.
  • Hessel et al. (2018) Hessel, M., J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver (2018). Rainbow: Combining improvements in deep reinforcement learning. In Thirty-second AAAI Conference on Artificial Intelligence.
  • Kadane and Vlachos (2002) Kadane, J. B. and P. K. Vlachos (2002). Hybrid methods for calculating optimal few-stage sequential strategies: Data monitoring for a clinical trial. Statistics and Computing 12, 147–152.
  • MacDonald et al. (2015) MacDonald, B., P. Ranjan, and H. Chipman (2015). GPfit: An R package for fitting a Gaussian process model to deterministic simulator outputs. Journal of Statistical Software 64, 1–23.
  • Meibohm and Derendorf (1997) Meibohm, B. and H. Derendorf (1997). Basic concepts of pharmacokinetic/pharmacodynamic (pk/pd) modelling. International Journal of Clinical Pharmacology and Therapeutics 35(10), 401–413.
  • Mnih et al. (2013) Mnih, V., K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013). Playing Atari with deep reinforcement learning. arXiv preprint 1312.5602.
  • Müller et al. (2007) Müller, P., D. A. Berry, A. P. Grieve, M. Smith, and M. Krams (2007). Simulation-based sequential Bayesian design. Journal of Statistical Planning and Inference 137(10), 3140–3150.
  • Murphy (2003) Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65(2), 331–355.
  • Murphy et al. (2007) Murphy, S. A., D. W. Oslin, A. J. Rush, and J. Zhu (2007). Methodological challenges in constructing effective treatment sequences for chronic psychiatric disorders. Neuropsychopharmacology 32(2), 257–262.
  • Parmigiani and Inoue (2009) Parmigiani, G. and L. Inoue (2009). Decision Theory: Principles and Approaches. Wiley.
  • Puterman (2014) Puterman, M. L. (2014). Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons.
  • Raffin et al. (2021) Raffin, A., A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann (2021). Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research 22(268), 1–8.
  • Robins (1997) Robins, J. M. (1997). Causal inference from complex longitudinal data. In Latent variable modeling and applications to causality, pp. 69–117. Springer.
  • Rossell et al. (2007) Rossell, D., P. Müller, and G. Rosner (2007). Screening designs for drug development. Biostatistics 8, 595–608.
  • Sallab et al. (2017) Sallab, A. E., M. Abdou, E. Perot, and S. Yogamani (2017). Deep reinforcement learning framework for autonomous driving. Electronic Imaging 2017(19), 70–76.
  • Schulman et al. (2017) Schulman, J., F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017). Proximal policy optimization algorithms. arXiv preprint 1707.06347.
  • Shen and Huan (2021) Shen, W. and X. Huan (2021). Bayesian sequential optimal experimental design for nonlinear models using policy gradient reinforcement learning. arXiv preprint 2110.15335.
  • Silver et al. (2018) Silver, D., T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419), 1140–1144.
  • Silver et al. (2014) Silver, D., G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014). Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on International Conference on Machine Learning, pp.  387–395.
  • Sutton and Barto (2018) Sutton, R. S. and A. G. Barto (2018). Reinforcement learning: An introduction. MIT Press 5, 31.
  • Sutton et al. (1999) Sutton, R. S., D. McAllester, S. Singh, and Y. Mansour (1999). Policy gradient methods for reinforcement learning with function approximation. In S. Solla, T. Leen, and K. Müller (Eds.), Advances in Neural Information Processing Systems, Volume 12.
  • Tunyasuvunakool et al. (2020) Tunyasuvunakool, S., A. Muldal, Y. Doron, S. Liu, S. Bohez, J. Merel, T. Erez, T. Lillicrap, N. Heess, and Y. Tassa (2020). dm_control: Software and tasks for continuous control. Software Impacts 6, 100022.
  • Watkins and Dayan (1992) Watkins, C. J. and P. Dayan (1992). Q-learning. Machine learning 8(3-4), 279–292.
  • Wurman et al. (2022) Wurman, P. R., S. Barrett, K. Kawamoto, J. MacGlashan, K. Subramanian, T. J. Walsh, R. Capobianco, A. Devlic, F. Eckert, F. Fuchs, et al. (2022). Outracing champion gran turismo drivers with deep reinforcement learning. Nature 602(7896), 223–228.
  • Yu et al. (2021) Yu, L., S. Qin, M. Zhang, C. Shen, T. Jiang, and X. Guan (2021). A review of deep reinforcement learning for smart building energy management. IEEE Internet of Things Journal 8, 12046–12063.
  • Yu (2018) Yu, Y. (2018). Towards sample efficient reinforcement learning. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence.
  • Zhao et al. (2009) Zhao, Y., M. R. Kosorok, and D. Zeng (2009). Reinforcement learning design for cancer clinical trials. Statistics in Medicine 28(26), 3294–3315.

Appendix

Appendix A Details in Example 2

We assume a nonlinear regression sampling model

Yt=f(Xtθ)+ϵt,ϵtN(0,σ2),Y_{t}=f(X_{t}\mid\theta)+\epsilon_{t},\;\;\;\;\epsilon_{t}\sim N(0,\sigma^{2}),

and the dose-response curve

f(Xtθ)=a+bXtr(qr+Xtr)f(X_{t}\mid\theta)=a+b\frac{X_{t}^{r}}{(q^{r}+X_{t}^{r})}

We fix a,r,σ2a,r,\sigma^{2}, and put a normal prior on unknown parameters θ=(b,q)\theta=(b,q),

p(θ)=N(θ0,diag(λ0)).p(\theta)=N(\theta_{0},diag(\lambda_{0})).

Sample size calculation

If at time TT, the decision DT=2D_{T}=2 indicates stopping and a pivotal trial is conducted to test H0:δ95=0H_{0}:\delta_{95}=0 vs. H1:δ95>0H_{1}:\delta_{95}>0. We need to determine the sample size Nα,βN_{\alpha,\beta} for the pivotal trial that can achieve desired significance level α\alpha and power (1β)(1-\beta), and calculate the posterior predictive probability of a significant outcome, ΔR=Pr(reject H0 in the 2nd trialHT)\Delta_{R}=\Pr(\mbox{reject $H_{0}$ in the 2nd trial}\mid H_{T}).

Let δ¯95\bar{\delta}_{95} and sδs_{\delta} denote the posterior mean and std. dev. of δ95\delta_{95}. We calculate power based on δ95=δ\delta_{95}=\delta^{*}, where δ95=δ¯95sδ\delta_{95}^{*}=\bar{\delta}_{95}-s_{\delta}.

Now consider a test enrolling Nα,βN_{\alpha,\beta} patients, randomizing Nα,β/2N_{\alpha,\beta}/2 at x=0x=0 (placebo) and Nα,β/2N_{\alpha,\beta}/2 at the estimated ED95. Assuming var(yi)=1var(y_{i})=1, we need

Nα,β4[(qα+qβ)/δ]2N_{\alpha,\beta}\geq 4\left[(q_{\alpha}+q_{\beta})/\delta^{*}\right]^{2}

where qαq_{\alpha} is the α\alpha right tail cutoff for the N(0,1)N(0,1) and α=5%\alpha=5\% and (1β)=80%(1-\beta)=80\% are the desired significance level and power (i.e., β=0.2\beta=0.2).

A significant outcome at the end of the 2nd trial means data in the rejection region. Let y¯1,y¯0\overline{y}_{1},\overline{y}_{0} denote the sample average of Nα,β/2N_{\alpha,\beta}/2 patients each to be enrolled in the two arms of the 2nd trial. Then the rejection region is

R={(y¯1y¯0)Nα,β/4qα}.R=\{(\overline{y}_{1}-\overline{y}_{0})\sqrt{N_{\alpha,\beta}/4}\geq q_{\alpha}\}.

Let Φ()\Phi(\cdot) denote the standard normal c.d.f. Then

ΔR=Φ[δ¯Nα,β/4qα1+Nα,β4sδ2]\Delta_{R}=\Phi\left[\frac{\bar{\delta}\sqrt{N_{\alpha,\beta}/4}-q_{\alpha}}{\sqrt{1+\frac{N_{\alpha,\beta}}{4}s_{\delta}^{2}}}\right]

Posterior simulation

We can implement independent posterior simulation:

  1. (i)

    Generate qp(qHt)q\sim p(q\mid H_{t}), using

    p(qHt)p(Htq)p(q)=p(Htb,q)p(b)p(bq,Ht)p(q)p(q\mid H_{t})\propto p(H_{t}\mid q)\cdot p(q)=\frac{p(H_{t}\mid b,q)p(b)}{p(b\mid q,H_{t})}\cdot p(q)
  2. (ii)

    Then generate bb from the posterior conditional distribution bp(bq,Ht)b\sim p(b\mid q,H_{t}). Based on normal linear regression, the conditional posterior is a univariate normal distribution.