This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Survey of Risk-Aware Multi-Armed Bandits

Vincent Y. F. Tan, Prashanth L.A., and Krishna Jagannathan
National University of Singapore, IIT Madras
[email protected], [email protected], [email protected]
Abstract

In several applications such as clinical trials and financial portfolio optimization, the expected value (or the average reward) does not satisfactorily capture the merits of a drug or a portfolio. In such applications, risk plays a crucial role, and a risk-aware performance measure is preferable, so as to capture losses in the case of adverse events. This survey aims to consolidate and summarise the existing research on risk measures, specifically in the context of multi-armed bandits. We review various risk measures of interest, and comment on their properties. Next, we review existing concentration inequalities for various risk measures. Then, we proceed to defining risk-aware bandit problems, We consider algorithms for the regret minimization setting, where the exploration-exploitation trade-off manifests, as well as the best-arm identification setting, which is a pure exploration problem—both in the context of risk-sensitive measures. We conclude by commenting on persisting challenges and fertile areas for future research.This is an unabridged version of a survey paper of the same title accepted to the 31st International Joint Conference on Artificial Intelligence and the 25th European Conference on Artificial Intelligence (IJCAI-ECAI, 2022).

1 Introduction

The multi-armed bandit (MAB) problem studies the problem of online learning with partial feedback that exemplifies the exploration-exploitation tradeoff. This problem has a long history dating back to Thompson (1933) and finds a wide variety of applications from clinical trials to financial portfolio optimization. In the stochastic MAB problem, a player chooses or pulls one among several arms, each defined by a certain reward distribution. The player wishes to maximize his reward or find the best arm in the face of the uncertain environment the distributions are a priori unknown. There are two general sub-problems in the MAB literature, namely, regret minimization and best-arm identification (also called pure exploration). In the former, in which the exploration-exploitation trade-off manifests, the player wants to maximize his reward over a fixed time period. In the latter, the player simply wants to learn which arm is the best in either the quickest time possible with a given probability of success (the fixed-confidence setting) or he wants to do so with the highest probability of success given a fixed playing horizon (the fixed budget setting). In most of the MAB literature (see Lattimore and Szepesvári (2020) for an up-to-date survey), the metric of interest is defined simply as the mean of the reward distribution associated with the arm pulled.

However, in real-world applications, the mean does not satisfactorily capture the merits of certain actions. In such applications, risk plays an important role, and a risk-aware performance measure is often preferred over the average reward. For example, if we have an important face-to-face meeting downtown, we might want to choose a route that has a slightly higher expected travel time, but whose variance is small. Some risk measures include the mean-variance (Markowitz, 1952), the conditional value-at-risk (Artzner et al., 1999; Rockafellar and Uryasev, 2000), spectral risk measures (Acerbi, 2002), utility-based shortfall risk (Artzner et al., 1999; Hu and Zhang, 2018), and cumulative prospect theory (Tversky and Kahneman, 1992). This short survey aims to consolidate these various risk measures and to comment on their concentration properties. These properties are then used to describe the latest algorithmic developments for risk-aware bandits in regret minimization as well as best arm identification settings.

2 Preliminaries

For forming estimators for several risk measures, we require the empirical distribution function (EDF).

Definition 1.

Given i.i.d. samples {Xi}i=1n\{X_{i}\}_{i=1}^{n} from the distribution FF of a random variable (abbreviated in the following as r.v.) XX, the EDF is

Fn(x):=1ni=1n𝕀{Xix}for anyxF_{n}(x):=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}\left\{X_{i}\leq x\right\}\quad\mbox{for any}\;\;x\in\mathbb{R}

To establish concentration bounds for risk measures, one requires an assumption that bounds the tail of the distribution. In this article, we consider sub-Gaussian distributions; see Theorem 2.1 of Wainwright (2019) for equivalent characterizations.

Definition 2.

A r.v. XX is sub-Gaussian with (variance proxy) parameter σ2\sigma^{2} if its cumulant generating function log𝔼[exp(rX)]r2σ2/2\log\mathbb{E}[\exp(rX)]\leq r^{2}\sigma^{2}/2 for all rr\in\mathbb{R}.

A sub-Gaussian r.v. XX with parameter σ2\sigma^{2} satisfies the following bound for any ϵ>0\epsilon>0:

(Xϵ)exp(ϵ22σ2).\displaystyle\mathbb{P}\left(X\geq\epsilon\right)\leq\exp\left(-\frac{\epsilon^{2}}{2\sigma^{2}}\right). (1)

For unifying risk measures, we require the notion of the Wasserstein distance, which is defined below.

Definition 3.

The Wasserstein distance between two cumulative distribution functions (CDFs) F1F_{1} and F2F_{2} on \mathbb{R} is

W1(F1,F2):=infFΓ(F1,F2)2|xy|dF(x,y),\displaystyle W_{1}(F_{1},F_{2}):=\inf_{F\in\Gamma(F_{1},F_{2})}\int_{\mathbb{R}^{2}}|x-y|\,\mathrm{d}F(x,y), (2)

where Γ(F1,F2)\Gamma(F_{1},F_{2}) is the set of couplings of F1F_{1} and F2F_{2}.

The Wasserstein distance between two CDFs. F1F_{1} and F2F_{2} of two r.v.s XX and YY, respectively, may be alternatively written as follows:

W1(F1,F2)=sup|𝔼(f(X)𝔼(f(Y))|\displaystyle W_{1}(F_{1},F_{2})=\sup\left|\mathbb{E}(f(X)-\mathbb{E}(f(Y))\right|
=|F1(s)F2(s)|ds=01|F11(β)F21(β)|dβ,\displaystyle=\!\int_{-\infty}^{\infty}|F_{1}(s)\!-\!F_{2}(s)|\mathrm{d}s\!=\!\int_{0}^{1}\!|F_{1}^{-1}(\beta)\!-\!F_{2}^{-1}(\beta)|\mathrm{d}\beta,

where the supremum is over all functions f:f:\mathbb{R}\rightarrow\mathbb{R} that are 11-Lipschitz.

3 Estimation and concentration of risk measures

3.1 Mean-Variance

We present a measure that effectively balances between expected reward and variability. Although there are a large number of models that resolve the tension between return and risk—such as the Sharpe ratio (Sharpe, 1966) and the Knightian uncertainty (Knight, 1921), we commence our discussion on one of the most popular measures, namely the mean-variance proposed by Markowitz (1952).

Definition 4.

The mean-variance of a r.v. XX with mean μ\mu and variance σ2\sigma^{2} and risk tolerance γ\gamma is MV=γμσ2\mathrm{MV}=\gamma\mu-\sigma^{2}.

We can recover two extreme cases by considering the extremal values of the risk tolerance γ\gamma. When γ\gamma\to\infty, maximizing MV\mathrm{MV} simply reduces to maximizing the mean, turning the mean-variance MAB problem into a standard reward maximizing problem. When γ=0\gamma=0, the mean-variance MAB problem reduces to a variance-minimization problem. When one only has access to samples, one can estimate the mean-variance by using plug-in estimates for the mean and variance. Concentration results for the mean-variance can be found in Zhu and Tan (2020). Here, we present a simpler form of the result.

We now provide a concentration result for the mean-variance, assuming that the underlying r.v. XX is sub-Gaussian. Define the empirical mean-variance MV^n:=γμ^nσ^n2\widehat{\mathrm{MV}}_{n}:=\gamma\hat{\mu}_{n}-\hat{\sigma}_{n}^{2} where μ^n\hat{\mu}_{n} and σ^n2\hat{\sigma}_{n}^{2} are the sample mean and sample variance respectively.

Lemma 1.

Let XX be a sub-Gaussian r.v. with parameter σ2\sigma^{2}. For any nn\in\mathbb{N} and ϵ>0\epsilon>0:

(|MV^nMV|>ϵ)2exp[nϵ28γ2σ2]\displaystyle\mathbb{P}\left(|\widehat{\mathrm{MV}}_{n}-\mathrm{MV}|>\epsilon\right)\leq 2\exp\left[-\frac{n\epsilon^{2}}{8\gamma^{2}\sigma^{2}}\right]
+2exp(n16min[ϵ22σ4,ϵσ2]).\displaystyle\qquad+2\exp\left(-\frac{n}{16}\min\left[\frac{\epsilon^{2}}{2\sigma^{4}},\frac{\epsilon}{\sigma^{2}}\right]\right).

This result can be proved by using concentration bounds for sub-Gaussian and sub-exponential r.v.s; see, for example Wainwright (2019, Prop. 2.2).

3.2 Lipschitz-continuous risk measures

In Cassel et al. (2018), the authors consider general risk measures that satisfy a Lipschitz requirement under some norm in the space of distributions. Prashanth and Bhat (2022) use the notion of Wasserstein distance as the underlying norm in defining a continuous class of risk measures as given below.

Definition 5.

Let (,W1)(\mathcal{L},W_{1}) denote the metric space of distributions. A risk measure ρ()\rho(\cdot) is Lipschitz-continuous on (,W)(\mathcal{L},W) if there exists L>0L>0 such that, for any two distributions (CDFs) F,GF,G\in\mathcal{L}, the following bound holds:

|ρ(F)ρ(G)|LW1(F,G).\displaystyle\left|\rho(F)-\rho(G)\right|\leq L\ W_{1}(F,G). (3)

Using the EDF FnF_{n} computed from nn i.i.d. samples, we estimate the risk measure ρ(F)\rho(F) satisfying (3) as follows:

ρn:=ρ(Fn).\displaystyle\rho_{n}:=\rho(F_{n}). (4)
Theorem 1.

Let XX be a sub-Gaussian r.v. with parameter σ2\sigma^{2}. Suppose ρ:\rho:\mathcal{L}\rightarrow\mathbb{R} is a Lipschitz risk measure with parameter LL. Then, for every ϵ\epsilon satisfying 512σn<ϵL<512σn+16σ𝖾\frac{512\sigma}{\sqrt{n}}<\frac{\epsilon}{L}<\frac{512\sigma}{\sqrt{n}}+16\sigma\sqrt{\mathsf{e}}, we have

(|ρnρ(X)|>ϵ)exp(n256σ2𝖾(ϵL512σn)2).\mathbb{P}\left(|\rho_{n}\!-\!\rho(X)|\!>\!\epsilon\right)\leq\exp\bigg{(}\!-\frac{n}{256\sigma^{2}\mathsf{e}}\Big{(}\frac{\epsilon}{L}-\frac{512\sigma}{\sqrt{n}}\Big{)}^{2}\bigg{)}.\! (5)

Conditional Value-at-Risk (Rockafellar and Uryasev, 2000), the spectral risk measure (Acerbi, 2002), and the utility-based shortfall risk (Föllmer and Schied, 2002) are three popular risk measures that are Lipschitz continuous. We describe these risk measures below. For other examples of risk measures satisfying (3), the reader is referred to Cassel et al. (2018) and Bhat and Prashanth (2019).

3.3 Conditional Value-at-Risk (CVaR)

Definition 6.

Let (y)+:=max(y,0)(y)^{+}:=\max(y,0). The CVaR at level α(0,1)\alpha\in(0,1) for a r.v. XX is defined by

CVaRα(X):=infξ{ξ+1(1α)𝔼[(Xξ)+]}.\mathrm{CVaR}_{\alpha}(X):=\inf_{\xi\in\mathbb{R}}\left\{\xi+\frac{1}{(1-\alpha)}\mathbb{E}\big{[}\left(X-\xi\right)^{+}\big{]}\right\}.

It is well known (Rockafellar and Uryasev, 2000) that the infimum in the definition of CVaR above is achieved for ξ=VaRα(X)\xi=\textrm{VaR}_{\alpha}(X), where VaRα(X)=inf{ξ:(Xξ)α}\textrm{VaR}_{\alpha}(X)=\inf\{\xi:\mathbb{P}\left(X\leq\xi\right)\geq\alpha\} is the Value-at-Risk (VaR) of XX at confidence level α\alpha. In financial applications, one prefers CVaR over VaR, since CVaR is coherent, while VaR is not (Artzner et al., 1999). CVaR also admits the following equivalent representation:

CVaRα(X)=VaRα(X)+11α𝔼[XVaRα(X)]+.\mathrm{CVaR}_{\alpha}(X)=\textrm{VaR}_{\alpha}(X)+\frac{1}{1-\alpha}\mathbb{E}\left[X-\textrm{VaR}_{\alpha}(X)\right]^{+}.

The following lemma shows that CVaR is a Lipschitz risk measure in the sense of Definition 5.

Lemma 2.

Suppose XX and YY are r.v.s with CDFs FXF_{X} and FYF_{Y}, respectively, and α(0,1)\alpha\in(0,1). Then

|CVaRα(X)CVaRα(Y)|(1α)1W1(FX,FY).\displaystyle|\mathrm{CVaR}_{\alpha}(X)-\mathrm{CVaR}_{\alpha}(Y)|\leq(1-\alpha)^{-1}W_{1}(F_{X},F_{Y}).

Let X1,,XnX_{1},\ldots,X_{n} denote i.i.d. samples drawn from the distribution of XX, and let {X[k]}k=1n\{X_{[k]}\}_{k=1}^{n} denote the order statistics of these samples in decreasing order. CVaRα(X)\mathrm{CVaR}_{\alpha}(X) can be estimated as follows:

cn,α=infξ{ξ+1n(1α)i=1n(Xiξ)+}.c_{n,\alpha}=\inf_{\xi\in\mathbb{R}}\bigg{\{}\xi+\frac{1}{n(1-\alpha)}\sum_{i=1}^{n}\left(X_{i}-\xi\right)^{+}\bigg{\}}. (6)

Let YY denote a r.v. with distribution FnF_{n}, the EDF of XX. Then, it can be shown that

cn,α=CVaRα(Y)=v^n,α+1n(1α)i=1n(Xiv^n,α)+,c_{n,\alpha}\!=\!\mathrm{CVaR}_{\alpha}(Y)\!=\!\hat{v}_{n,\alpha}+\frac{1}{n(1\!-\!\alpha)}\sum_{i=1}^{n}\left(X_{i}\!-\!\hat{v}_{n,\alpha}\right)^{+},

where v^n,α=inf{x:Fn(x)α}\hat{v}_{n,\alpha}=\inf\{x\in\mathbb{R}:F_{n}(x)\geq\alpha\} is the empirical VaR estimate, which can been shown to coincide with the order statistic X[n(1α)]X_{[\lfloor n(1-\alpha)\rfloor]}. Thus, the CVaR estimator cn,αc_{n,\alpha} coincides with the general template ρn=ρ(Fn)\rho_{n}=\rho(F_{n}).

The result below presents a concentration bound for CVaR estimation under a sub-Gaussianity assumption.

Corollary 1.

Suppose that XX is a sub-Gaussian r.v. with parameter σ\sigma. Then, for any ϵ\epsilon such that 512σn<(1α)ϵ<512σn+16σ𝖾\frac{512\sigma}{\sqrt{n}}<(1-\alpha)\epsilon<\frac{512\sigma}{\sqrt{n}}+16\sigma\sqrt{\mathsf{e}} and n1n\geq 1, we have

(|cn,αCVaRα(X)|>ϵ)\displaystyle\mathbb{P}\left(\left|c_{n,\alpha}-\mathrm{CVaR}_{\alpha}(X)\right|>\epsilon\right)
exp(n256σ2𝖾((1α)ϵ512σn)2).\displaystyle\leq\exp\left(-\frac{n}{256\sigma^{2}\mathsf{e}}\left((1-\alpha)\epsilon-\frac{512\sigma}{\sqrt{n}}\right)^{2}\right).

The proof of the bound above follows as a corollary to Theorem 1 in lieu of Lemma 2.

We next present an alternative bound with explicit constants under the assumption that the r.v. XX is continuous with a density ff satisfies a growth condition, which is made precise in the lemma below.

Lemma 3.

Suppose that the r.v. XX is sub-Gaussian with parameter σ\sigma and continuous, with a density function ff that satisfies the following condition: There exist universal constants η,δ>0\eta,\delta>0 such that f(x)>ηf(x)>\eta for all x[vαδ2,vα+δ2]x\in\left[v_{\alpha}-\frac{\delta}{2},v_{\alpha}+\frac{\delta}{2}\right]. Then, for any ϵ>0\epsilon>0, we have

(|cn,αCVaRα(X)|>ϵ)2exp[nϵ2(1α)28σ2]\displaystyle\mathbb{P}\left(\left|c_{n,\alpha}-\mathrm{CVaR}_{\alpha}(X)\right|>\epsilon\right)\leq 2\exp\left[-\dfrac{n\epsilon^{2}(1-\alpha)^{2}}{8\sigma^{2}}\right]
+4exp[n(1α)2η2min(ϵ2,4δ2)64].\displaystyle\qquad\qquad+4\exp\left[-\frac{n(1-\alpha)^{2}\eta^{2}\min\left(\epsilon^{2},4\delta^{2}\right)}{64}\right]. (7)

The reader is referred to Prashanth et al. (2020) for a proof of the result above. Other works on CVaR estimation can be found in Brown (2007); Wang and Gao (2010); Kolla et al. (2019); Thomas and Learned-Miller (2019); Kagrecha et al. (2019) and Prashanth et al. (2020).

3.4 Spectral risk measure (SRM)

Given a risk spectrum ϕ:[0,1][0,)\phi:[0,1]\rightarrow[0,\infty), the SRM MϕM_{\phi} associated with ϕ\phi is defined by (Acerbi, 2002)

Mϕ(X):=01ϕ(β)FX1(β)dβ.\displaystyle M_{\phi}(X):=\int_{0}^{1}\phi(\beta)F_{X}^{-1}(\beta)\,\mathrm{d}\beta. (8)

MϕM_{\phi} is a coherent risk measure if the risk spectrum is increasing and integrates to 1. Further, MϕM_{\phi} generalizes CVaR, since setting ϕ(β)=(1α)1𝕀{βα}\phi(\beta)=(1-\alpha)^{-1}\mathbb{I}\left\{\beta\geq\alpha\right\} leads to Mϕ=CVaRα(X)M_{\phi}=\mathrm{CVaR}_{\alpha}(X).

The result below shows that an SRM is a Lipschitz risk measure if the risk spectrum ϕ\phi is bounded.

Lemma 4.

Suppose ϕ(u)K\phi(u)\leq K for all u[0,1]u\in[0,1], and let XX and YY be r.v.s with CDFs FXF_{X} and FYF_{Y}, respectively. Then

|Mϕ(X)Mϕ(Y)|KW1(FX,FY).|M_{\phi}(X)-M_{\phi}(Y)|\leq K\ W_{1}(F_{X},F_{Y}). (9)

Specializing the estimator ρn=ρ(Fn)\rho_{n}=\rho(F_{n}) for SRM leads to the following estimator:

mn,ϕ=01ϕ(β)Fn1(β)dβ.m_{n,\phi}=\int_{0}^{1}\phi(\beta)F_{n}^{-1}(\beta)\,\mathrm{d}\beta. (10)

Using Lemma 4 and Theorem 1, it is straightforward to obtain the following concentration bound for SRM estimation.

Corollary 2.

Suppose that XX is a sub-Gaussian r.v. with parameter σ\sigma. Then, for any ϵ\epsilon such that 512σn<ϵK<512σn+16σ𝖾\frac{512\sigma}{\sqrt{n}}<\frac{\epsilon}{K}<\frac{512\sigma}{\sqrt{n}}+16\sigma\sqrt{\mathsf{e}} and n1n\geq 1, we have

(|mn,ϕMϕ(X)|>ϵ)\displaystyle\mathbb{P}\left(\left|m_{n,\phi}\!-\!M_{\phi}(X)\right|>\epsilon\right)
exp(n256σ2𝖾(ϵK512σn)2).\displaystyle\leq\exp\left(-\frac{n}{256\sigma^{2}\mathsf{e}}\left(\frac{\epsilon}{K}-\frac{512\sigma}{\sqrt{n}}\right)^{2}\right).

Since CVaR and spectral risk measures are weighted averages of the underlying distribution quantiles, a natural alternative to a Wasserstein distance-based approach is to employ concentration results for quantiles; cf. Prashanth et al. (2020); Pandey et al. (2021). While such an approach can provide bounds with better constants, the resulting bounds also involve distribution-dependent quantities. In this survey, we have chosen the Wasserstein distance-based approach since it provides a unified bound for an abstract risk measure that is Lipschitz, and one can easily specialize this bound to handle CVaR, SRM and other risk measures. Secondly, a template confidence-bound type bandit algorithm can be arrived at using the unified concentration bound in Theorem 1.

3.5 Utility-based shortfall risk (UBSR)

In financial applications, one would be interested in bounding a high loss that occurs with a given low probability. This amounts to bounding the VaR of the loss distribution, at a pre-specified level α\alpha. VaR as a risk measure is not preferable as it violates sub-additivity, which implies diversification could increase risk. In Artzner et al. (1999), the authors identified certain desirable properties for a risk measure, namely, monotonicity, sub-additivity, homogeneity, and translational invariance. A risk measure satisfyingt these properties is said to be coherent. CVaR is a risk measure that is closely related to VaR, and is also coherent. Convex risk measures Föllmer and Schied (2002) are a further generalization of risk measures beyond coherence, because sub-additivity and homogeneity imply convexity. A popular convex risk measure is UBSR. We introduce this risk measure below.

For a r.v. XX, the UBSR Sα(X)S_{\alpha}(X) is defined as follows:

Sα(X):=inf{ξ:𝔼[l(Xξ)]α},\displaystyle S_{\alpha}(X):=\inf\left\{\xi\in\mathbb{R}:\mathbb{E}\left[l(X-\xi)\right]\leq\alpha\right\}, (11)

where l:l:\mathbb{R}\rightarrow\mathbb{R} is a utility function. For the purpose of showing that UBSR is a Lipschitz risk measure, we require that ll is Lipschitz. While our results below require no additional assumptions on ll, we point out that a convexity assumption on ll leads to a useful dual representation (Föllmer and Schied, 2002). Further, if the utility is increasing, then the estimation of UBSR from i.i.d. samples can be performed in a computationally efficient manner; see Hu and Zhang (2018).

The following lemma shows that UBSR is a Lipschitz risk measure in the sense of Definition 5.

Lemma 5.

Suppose the utility function ll is non-decreasing, and there exist K,k>0K,k>0 such that ll is KK-Lipschitz and satisfies l(x2)l(x1)+k(x2x1)l(x_{2})\geq l(x_{1})+k(x_{2}-x_{1}) for every x1,x2x_{1},x_{2}\in\mathbb{R} satisfying x2x1x_{2}\geq x_{1}. Let XX and YY be r.v.s with CDFs FXF_{X} and FYF_{Y}, respectively. Then

|Sα(X)Sα(Y)|KkW1(FX,FY).\displaystyle|S_{\alpha}(X)-S_{\alpha}(Y)|\leq\frac{K}{k}W_{1}(F_{X},F_{Y}). (12)

To estimate the UBSR Sα(X)S_{\alpha}(X) from nn i.i.d. samples X1,,XnX_{1},\ldots,X_{n}, we use the following sample-average approximation to the optimization problem in (11) (also see Hu and Zhang (2018)):

minξξ subject to 1ni=1nl(Xiξ)α.\displaystyle\min_{\xi\in\mathbb{R}}\;\xi\quad\textrm{ subject to }\quad\frac{1}{n}\sum_{i=1}^{n}l(X_{i}-\xi)\leq\alpha. (13)

As in the case of CVaR and SRM, the UBSR estimator provided above is a special case of the general estimate given by ρn:=ρ(Fn)\rho_{n}:=\rho(F_{n}). Further, the estimation problem in (13) can be solved in a computationally efficient fashion using a bisection method; see Hu and Zhang (2018).

Using Lemma 5 and Theorem 1, we obtain the following concentration bound.

Corollary 3.

Suppose that XX is a sub-Gaussian r.v. XX with parameter σ\sigma. Then, for any ϵ\epsilon such that 512σn<ϵkK<512σn+16σ𝖾\frac{512\sigma}{\sqrt{n}}<\frac{\epsilon k}{K}<\frac{512\sigma}{\sqrt{n}}+16\sigma\sqrt{\mathsf{e}} and n1n\geq 1, we have

(|ξn,αSα(X)|>ϵ)\displaystyle\mathbb{P}\left(\left|\xi_{n,\alpha}-S_{\alpha}(X)\right|>\epsilon\right)
exp(n256σ2𝖾(kϵK512σn)2).\displaystyle\leq\exp\left(-\frac{n}{256\sigma^{2}\mathsf{e}}\left(\frac{k\epsilon}{K}-\frac{512\sigma}{\sqrt{n}}\right)^{2}\right).

Next, we present an example of a risk measure that is not Lipschitz. Consider a risk measure ρ(F)=0w(1F(x))dx\rho(F)=\int_{0}^{\infty}w(1-F(x))\,\mathrm{d}x, where ww is a weight function. Letting w(p)=pw(p)=\sqrt{p}, we claim there does not exist a norm \|\cdot\|, on the space of distribution functions, such that the following condition holds for all distributions functions F1,F2F_{1},F_{2} of non-negative r.v.s:

|ρ(F1)ρ(F2)|LF1F2.\displaystyle\left|\rho(F_{1})-\rho(F_{2})\right|\leq L\|F_{1}-F_{2}\|. (14)

Consider two Bernoulli distributions F1F_{1} and F2F_{2}, with parameters p1p_{1} and p2p_{2}, respectively. It is easy to see that

ρ(F1)=w(p1) and ρ(F2)=w(p2).\rho(F_{1})=w(p_{1})\;\textrm{ and }\;\rho(F_{2})=w(p_{2}).

Setting p1=4ϵp_{1}=4\epsilon and p2=ϵp_{2}=\epsilon, for some ϵ>0\epsilon>0, we claim that the condition in (14) does not hold for F1F_{1} and F2F_{2}. To see this, observe that the left-hand side of (14) is 4ϵϵ=ϵ\sqrt{4\epsilon}-\sqrt{\epsilon}=\sqrt{\epsilon}. On the other hand, (FG)(x)=ϵ𝟙[0,1)(x)(F-G)(x)=\epsilon\mathbbm{1}_{[0,1)}(x). By the homogeneity property of norms, FG\|F-G\| must scale linearly with ϵ\epsilon, and hence the inequality in (14) cannot hold for sufficiently small ϵ\epsilon. Thus this risk measure ρ()\rho(\cdot) violates the Lipschitz condition in Definition 5.

While the above measure appears contrived, it is not the case. The ρ()\rho(\cdot) defined above is closely related to a risk measure based on cumulative prospect theory, where one employs a weight function to distort cumulative probabilities. We describe such a risk measure in the next section.

3.6 Risk measures based on cumulative prospect theory (CPT)

We present a risk measure based on CPT, and this risk measure does not satisfy the Lipschitz condition in Definition 5. The CPT-value of an r.v. XX is defined as (Prashanth et al., 2016)

𝒞(X)\displaystyle\mathcal{C}(X) :=0w+((u+(X)>z))dz\displaystyle:=\int_{0}^{\infty}w^{+}\left(\mathbb{P}\left(u^{+}(X)>z\right)\right)\,\mathrm{d}z
0w((u(X)>z))dz,\displaystyle\qquad-\int_{0}^{\infty}w^{-}\left(\mathbb{P}\left(u^{-}(X)>z\right)\right)\,\mathrm{d}z, (15)

where u+,u:+u^{+},u^{-}:\mathbb{R}\rightarrow\mathbb{R}_{+} are the utility functions that are assumed to be continuous, with u+(x)=0u^{+}(x)=0 when x0x\leq 0 and increasing otherwise, and with u(x)=0u^{-}(x)=0 when x0x\geq 0 and decreasing otherwise, w+,w:[0,1][0,1]w^{+},w^{-}:[0,1]\rightarrow[0,1] are weight functions assumed to be continuous, non-decreasing and satisfy w+(0)=w(0)=0w^{+}(0)=w^{-}(0)=0 and w+(1)=w(1)=1w^{+}(1)=w^{-}(1)=1.

Given a set of i.i.d. samples {Xi}i=1n\{X_{i}\}_{i=1}^{n}, we form the EDFs Fn+(x)F_{n}^{+}(x) and Fn(x)F_{n}^{-}(x) of the r.v.s u+(X)u^{+}(X) and u(X)u^{-}(X), respectively. Using these EDFs, we estimate CPT-value as follows:

𝒞¯n=0w+(1Fn+(x))dx0w(1Fn(x))dx,\displaystyle\overline{\mathcal{C}}_{n}\!=\!\int_{0}^{\infty}\!\!\!\!w^{+}\left(1\!-\!F_{n}^{+}\left(x\right)\right)\,\mathrm{d}x\!-\!\int_{0}^{\infty}\!\!\!\!w^{-}\left(1\!-\!F_{n}^{-}\left(x\right)\right)\mathrm{d}x, (16)

The integrals above can be computed using the order statistics; see Jie et al. (2018). More precisely, let X[1]X[2]X[m]X_{[1]}\leq X_{[2]}\leq\ldots\leq X_{[m]} denote the order statistics. Then, the first and second integrals in (16), denoted by 𝒞¯m+\overline{\mathcal{C}}_{m}^{+} and 𝒞¯m\overline{\mathcal{C}}_{m}^{-}, respectively, are estimated as follows:

𝒞¯n+\displaystyle\overline{\mathcal{C}}_{n}^{+} :=i=1nu+(X[i])(w+(n+1in)w+(nin)),\displaystyle:=\sum_{i=1}^{n}u^{+}(X_{[i]})\bigg{(}w^{+}\!\Big{(}\frac{n+1-i}{n}\Big{)}\!-\!w^{+}\!\Big{(}\frac{n-i}{n}\Big{)}\bigg{)},
𝒞¯n\displaystyle\overline{\mathcal{C}}_{n}^{-} :=i=1nu(X[i])(w(in)w(i1n)).\displaystyle:=\sum_{i=1}^{n}u^{-}(X_{[i]})\bigg{(}w^{-}\Big{(}\frac{i}{n}\Big{)}-w^{-}\Big{(}\frac{i-1}{n}\Big{)}\bigg{)}.

We now present a concentration result for CPT-value estimation assuming bounded support for the underlying distribution; see Prashanth et al. (2016); Jie et al. (2018) for the proofs. We start by stating some assumptions.

(A1) The weight functions w±w^{\pm} are Hölder continuous with common order α\alpha and constant HH, i.e.,

supx,y[0,1]:xy|w±(x)w±(y)||xy|αH;\sup_{x,y\in[0,1]:x\neq y}\frac{|w^{\pm}(x)-w^{\pm}(y)|}{|x-y|^{\alpha}}\leq H;

(A2) The utility functions u+,u:+u^{+},u^{-}:\mathbb{R}\rightarrow\mathbb{R}_{+} are bounded above by MM.

Proposition 1.

Under (A1) and (A2), for all ϵ>0\epsilon>0,

(|𝒞¯n𝒞(X)|ϵ)2exp(2n(ϵHM)2α).\displaystyle\mathbb{P}\left(\left|\overline{\mathcal{C}}_{n}\!-\!\mathcal{C}(X)\right|\!\geq\!\epsilon\right)\leq 2\exp\bigg{(}\!-2n\Big{(}\frac{\epsilon}{HM}\Big{)}^{\frac{2}{\alpha}}\bigg{)}. (17)

The proof is based on the Dvoretzky–Kiefer–Wolfowitz (DKW) inequality, which establishes concentration of the EDF around the true distribution. For the sub-Gaussian case, one can use a truncated CPT-value estimator, and establish an exponentially decaying tail bound. In this case, we use a truncated CPT-value estimator (Bhat and Prashanth, 2019). For a given truncation level τn\tau_{n} (to be specified later), let (Fn|τn)+\left(F_{n}|_{\tau_{n}}\right)^{+} and (Fn|τn)\left(F_{n}|_{\tau_{n}}\right)^{-} denote the truncated EDFs formed from the samples {u+(Xi)}i=1n\{u^{+}(X_{i})\}_{i=1}^{n} and {u(Xi)}i=1n\{u^{-}(X_{i})\}_{i=1}^{n}, respectively. Using the truncated EDFs, the CPT-value is estimated as follows:

Cn=\displaystyle C_{n}= 0τnw+(1(Fn|τn)+(x))dx\displaystyle\int_{0}^{\tau_{n}}w^{+}(1-\left(F_{n}|_{\tau_{n}}\right)^{+}(x))\,\mathrm{d}x
0τnw(1(Fn|τn)(x))dx.\displaystyle\quad-\int_{0}^{\tau_{n}}w^{-}(1-\left(F_{n}|_{\tau_{n}}\right)^{-}(x))\,\mathrm{d}x. (18)

In comparison to (16), we have used complementary (truncated) EDFs in the estimator defined above.

(A2’) The utility functions u+,u:+u^{+},u^{-}:\mathbb{R}\rightarrow\mathbb{R}_{+} are sub-Gaussian with parameter σ2\sigma^{2}.

Proposition 2.

Assume (A1) and (A2’). Set τn=(σlogn+1)\tau_{n}=\left(\sigma\sqrt{\log n}+1\right). For all n1n\geq 1 form the CPT-value estimate CnC_{n} using (16). Then, we have the following bound for ϵ>8Hαnα/2\epsilon>\frac{8H}{\alpha n^{\alpha/2}}:

(|CnC(X)|>ϵ)\displaystyle\mathbb{P}\left(\left|{C}_{n}-C(X)\right|>\epsilon\right)
2exp(2n(2H2logn)1α(ϵ8Hαnα/2)2α).\displaystyle\leq 2\exp\left(-2n\left(\frac{2}{H^{2}\log n}\right)^{\frac{1}{\alpha}}\left(\epsilon-\frac{8H}{\alpha n^{\alpha/2}}\right)^{\frac{2}{\alpha}}\right).

4 Regret minimization

We now discuss how the above-mentioned concentration bounds are used in the context of regret minimization in the theory of MABs. A bandit instance ν\nu is a collection of KK distributions (ν1,ν2,,νK)(\nu_{1},\nu_{2},\ldots,\nu_{K}). Each distribution νi\nu_{i} has mean μi\mu_{i}\in\mathbb{R} and without loss of generality, one may assume that μ1μ2μK\mu_{1}\geq\mu_{2}\geq\ldots\geq\mu_{K}. The agent interacts with the bandit environment at each time by pulling an arm At{1,,K}A_{t}\in\{1,\ldots,K\} and observing a reward Xt,AtX_{t,A_{t}} drawn from νAt\nu_{A_{t}}. The decision of which arm to pull depends on the previous arm pulls as well as rewards t1=(A1,X1,A1,At1,Xt1,At1)\mathcal{H}_{t-1}=(A_{1},X_{1,A_{1}},\ldots A_{t-1},X_{t-1,A_{t-1}}). Over a fixed time period (or horizon) of length nn, the agent desires to maximize his/her total reward t=1nXt,At\sum_{t=1}^{n}X_{t,A_{t}} by designing an appropriate policy π={πt}t=1n\pi=\{\pi_{t}\}_{t=1}^{n} where πt\pi_{t} is σ(t1)\sigma(\mathcal{H}_{t-1})-measurable. An equivalent measure is the regret

Rn(ν,π):=𝔼[nμ1t=1nμAt],R_{n}(\nu,\pi):=\mathbb{E}\bigg{[}n\mu_{1}-\sum_{t=1}^{n}\mu_{A_{t}}\bigg{]},

which is to be minimized over π\pi. However, this measure pays no attention to risk, which is the central theme of this survey. More generally, the agent would like to minimize the ρ\rho-regret

Rnρ(ν,π):=𝔼[nmax1iKρ(νi)t=1nρ(νAt)],R_{n}^{\rho}(\nu,\pi):=\mathbb{E}\bigg{[}n\max_{1\leq i\leq K}\rho(\nu_{i})-\sum_{t=1}^{n}\rho(\nu_{A_{t}})\bigg{]},

where ρ(ν)\rho(\nu) is an appropriate risk measure. Let Δi=ρ(νi)ρ(νi)\Delta_{i}=\rho(\nu_{i^{*}})-\rho(\nu_{i}) be the gap between the risk of the best arm ii^{*} and that of the ii-th arm.

4.1 Confidence bound-based algorithms

We present an adaptation Risk-LCB of the well-known UCB algorithm (Auer et al., 2002) to handle an objective based on abstract risk measures ρ\rho. The algorithm caters to Lipschitz risk measures (see Definition 5) and arms’ distributions that are sub-Gaussian. Here, note that we are minimizing the risk measure instead of maximizing the reward.

To obtain the confidence widths, we first rewrite the concentration bound in Theorem 1 as follows: For any δ[exp(n),1]\delta\in[\exp(-n),1], with probability 1δ\geq 1-\delta,

|ρmρ(X)|Lσ(256𝖾log(1δ)+512)m,|\rho_{m}-\rho(X)|\leq\frac{L\sigma\left(\sqrt{256\mathsf{e}\log(\frac{1}{\delta})}+512\right)}{\sqrt{m}},

where ρm=ρ(Fm)\rho_{m}=\rho(F_{m}) is an mm-sample estimate of the risk measure ρ(X)\rho(X). The range of δ\delta is restricted to [exp(n),1][\exp(-n),1] due to the following constraint on ϵ\epsilon in Theorem 1:

512σn<ϵL<512σn+16σ𝖾.\displaystyle\frac{512\sigma}{\sqrt{n}}<\frac{\epsilon}{L}<\frac{512\sigma}{\sqrt{n}}+16\sigma\sqrt{\mathsf{e}}. (19)

For the classic bandit setup with a expected value objective, the finite sample analysis provided by Auer et al. (2002) chose to set δ=t4\delta=t^{-4} for obtaining a sub-linear regret guarantee. In our setting, we set δ=8t4\delta=\frac{8}{t^{4}} as this choice satisfies the constraint coming from (19), in turn facilitating the application of the high confidence guarantee on ρm\rho_{m} given above. Under this choice of δ\delta, in any round tt of Risk-LCB and for any arm k{1,,K}k\in\{1,\ldots,K\}, we have

(ρ(i)[ρi,Ti(t1)±wi,Ti(t1)])18t4, where\displaystyle\mathbb{P}\left(\rho(i)\in[\rho_{i,T_{i}(t-1)}\pm w_{i,T_{i}(t-1)}]\right)\geq 1-8t^{-4},\textrm{ where}
wi,T:=Lσ(32𝖾logt+512)T.\displaystyle\qquad\quad w_{i,T}:=\frac{L\sigma\left(32\sqrt{\mathsf{e}\log t}+512\right)}{\sqrt{T}}. (20)

In the above, ρi,Ti(t1)\rho_{i,T_{i}(t-1)} is the estimate of the risk measure for arm ii computed using ρn=ρ(Fn)\rho_{n}=\rho(F_{n}) from Ti(t1)T_{i}(t-1) samples.

The Risk-LCB algorithm would play all arms once in the initialization phase, and in rounds tK+1t\geq K+1, play an arm AtA_{t} according to the following rule:

At=argmin1iK(LCBt(i):=ρi,Ti(t1)wi,Ti(t1)).A_{t}=\operatorname*{arg\,min}\limits_{1\leq i\leq K}\left(\textrm{LCB}_{t}(i):=\rho_{i,T_{i}(t-1)}-w_{i,T_{i}(t-1)}\right).
Theorem 2.

Consider a KK-armed stochastic bandit problem with a Lipschitz risk measure ρ\rho and the arms’ distributions are sub-Gaussian with parameter σ2\sigma^{2}. Then, the expected regret RnρR_{n}^{\rho} of Risk-LCB satisfies the following bound:

Rnρi:Δi>0(4σ2L2[32𝖾logn+512]2Δi+28KΔi).R_{n}^{\rho}\!\leq\!\sum_{i:\Delta_{i}>0}\left(\dfrac{4\sigma^{2}L^{2}[32\sqrt{\mathsf{e}\log n}\!+\!512]^{2}}{\Delta_{i}}+28K\Delta_{i}\right).

Further, RnR_{n} satisfies the following bound that does not scale inversely with the gaps:

𝔼(Rn)\displaystyle\mathbb{E}(R_{n})
(2σL(32𝖾logn+512)+6i:Δi>0Δi)Kn.\displaystyle\leq\left(2\sigma L(32\sqrt{\mathsf{e}\log n}+512)+6\sum_{i:\Delta_{i}>0}\Delta_{i}\right)\sqrt{Kn}.

This bound mimics that of risk-neutral UCB except that the Δi\Delta_{i}’s depend on ρ\rho.

The regret bound derived in Theorem 2 is not applicable for CPT, as it is not a Lipschitz risk measure. However, one can derive a confidence bound-based algorithm with CPT as the risk measure using the result in Proposition 1 for the case of arms’ distributions with bounded support, see Gopalan et al. (2017). The regret bound is of the order O(i:Δi>0lognΔi2/α1)O\big{(}\sum_{i:\Delta_{i}>0}\frac{\log n}{\Delta_{i}^{{2}/{\alpha}-1}}\big{)}, where α\alpha is the Hölder exponent (see (A1) above). This bound cannot be improved in terms of dependence on the gaps and horizon nn; see Theorem 3 in Gopalan et al. (2017) for details. For an extension of confidence bound-based algorithms to handle sub-Gaussian arms’ distributions, see Prashanth and Bhat (2022).

4.2 Thompson sampling

Another popular class of algorithms for regret minimization is Thompson sampling (Thompson, 1933; Agrawal and Goyal, 2012; Kaufmann et al., 2012; Russo et al., 2018). Here, one starts with a prior over the set of bandit environments. In each round, as observations are obtained, the prior is updated to a posterior. The agent then samples an environment from the posterior and chooses the action that is optimal for the sampled environment.

The first attempt at using Thompson sampling for risk-averse bandits was by Zhu and Tan (2020) in which the authors applied Thompson sampling to mean-variance bandits for Gaussian and Bernoulli arms. This work is the Thompson sampling analogue of the UCB-based ones for the mean-variance by Sani et al. (2012) and Vakili and Zhao (2015). For the Gaussian case, Zhu and Tan (2020) considered sampling the precision (inverse variance) of the Gaussian τi,t\tau_{i,t} of arm ii at time tt from a Gamma distribution with parameters αi,t\alpha_{i,t} and βi,t\beta_{i,t}. Subsequently, the mean of the Gaussian θi,t\theta_{i,t} is sampled from a Gaussian whose mean is set to be the empirical mean and whose variance is the inverse of the number of times arm ii has been played. The Gamma and Gaussian are chosen as they are conjugate to the precision and mean of the Gaussian respectively. This algorithm, termed MVTS (for mean-variance Thompson sampling), has expected regret satisfying

lim supnRnρlogni=2Kmax{2Γ1,i2,1h(σi2σ12)}(Δi+2Γ¯i2),\displaystyle\!\limsup_{n\to\infty}\frac{R_{n}^{\rho}}{\log n}\!\leq\!\sum_{i=2}^{K}\!\max\bigg{\{}\frac{2}{\Gamma_{1,i}^{2}},\frac{1}{h(\frac{\sigma_{i}^{2}}{\sigma_{1}^{2}})}\bigg{\}}\big{(}\Delta_{i}\!+\!2\overline{\Gamma}_{i}^{2}\big{)},\! (21)

where Γi,j:=μiμj\Gamma_{i,j}:=\mu_{i}-\mu_{j} is the gap between the means of arms ii and jj, Γ¯i2:=maxj=1,,K(μiμi)2\overline{\Gamma}_{i}^{2}:=\max_{j=1,\ldots,K}(\mu_{i}-\mu_{i})^{2}, Δi:=MViMVi\Delta_{i}:=\mathrm{MV}_{i^{*}}-\mathrm{MV}_{i} is the gap between the mean-variance of arm ii and the arm with the best MV ii^{*}, and h(x):=12(x1logx)h(x):=\frac{1}{2}(x-1-\log x). This bound was shown to be order-optimal in the sense of meeting a problem-dependent information-theoretic lower bound as γ\gamma tends to either 0 or diverges to \infty.

Baudry et al. (2021) considered an MAB problem in which the quality of an arm is measured by the CVaR of the reward distribution. They proposed B-CVTS and M-CVTS for continuous bounded and multinominal rewards, respectively. Leveraging novel analysis by Riou and Honda (2020), the authors bounded the regret performances of both algorithms and show that they are asymptotically optimal. This is desirable as it improves on Zhu and Tan (2020), albeit for different risk measures, since the level α\alpha does not have to tend to its extremal values to be asymptotically optimal. Generalizations of and guarantees for Thompson sampling algorithms for other risk measures, such as the empirical distribution performance measures (EDPMs) and distorted risk functionals (DRFs) proposed in Cassel et al. (2018) and Huang et al. (2021) respectively are discussed in Chang and Tan (2022).

5 Pure exploration

We now review existing work on risk-aware pure exploration in MABs. Risk-aware pure exploration problems typically involve finding the best arm (or the best few arms), where the merit of an arm is defined to accommodate a risk measure or a constraint. Pure exploration problems have been studied in two settings, namely, fixed budget and fixed confidence.

5.1 Fixed budget algorithms

In the fixed budget setting, the objective is to find the arm (or the best few arms) using a fixed number of arm plays, so as to minimize the probability of misidentification. To be more precise, let nn be the fixed “budget”, i.e., the total number of arm plays allowed. In the notation of the previous section, let ρ(νk)\rho(\nu_{k}) be the risk measure of interest associated with arm kk, and let kk^{*} be the optimal arm (often assumed to be unique for simplicity). A given arm selection policy π={πt}t=1n\pi=\{\pi_{t}\}_{t=1}^{n} samples the arms as before, and at the end of the budget, identifies the arm knπk_{n}^{\pi} to be the optimal arm. The merit of the arm selection algorithm is captured by the probability of misidentifying the best arm, which is simply (knπk).\mathbb{P}\left(k_{n}^{\pi}\neq k^{*}\right).

The recent paper by Zhang and Ong (2021) considers pure exploration in a fixed budget setting, where the risk measure is the VaR or quantile at a particular level. Specifically, given a quantile level α,\alpha, the objective is to identify a set of mm arms with the highest VaRα values within a fixed budget nn. The authors obtain the empirical VaR estimate using the order statistic X(n(1α))X_{(\lfloor n(1-\alpha)\rfloor)} as explained in Section 3.3. They then derive exponential concentration bounds for the VaRα estimate under the assumptions that the underlying r.v.s have an increasing hazard rate, and a continuously differential probability density function. A similar VaR concentration result is derived in Prop. 2 of Kolla et al. (2019) under milder assumptions—nevertheless, VaR concentration results necessarily involve constants that depend on the slope of the distribution in the neighbourhood of the α\alpha-quantile.

The algorithm proposed in Zhang and Ong (2021) called Q-SAR (Quantile Successive Accept-Reject) adopts the well-known algorithm of Bubeck et al. (2013) to the risk-aware setting. Q-SAR first divides the time budget nn into M1M-1 phases, where the number of samples drawn for each arm in each phase is chosen exactly as in Bubeck et al. (2013) for the risk neutral setting. At each phase p{1,2,,M1},p\in\{1,2,\dots,M-1\}, the algorithm maintains two sets of arms, namely the active set, which contains all arms that are actively drawn in phase p,p, and the accepted set which contains arms that have been accepted. During each phase, based on the empirical VaR estimates, an arm is removed from the active set, and it is either accepted or rejected. If an arm is accepted, it is added into the accepted set. When the time budget ends, only one arm remains in the active set, which together with the accepted set, forms the recommended set of best arms. In Theorem 4 of Zhang and Ong (2021) the Q-SAR algorithm is shown to have an exponentially decaying probability of error in the budget n,n, using the VaR concentration bound.

Other papers on risk-aware pure exploration with fixed budget include Kagrecha et al. (2019) and Prashanth et al. (2020), which consider CVaR as the risk measure. In Prashanth et al. (2020), the authors derive concentration bounds for CVaR estimates, considering separately sub-Gaussian, light-tailed (or sub-exponential) and heavy-tailed distributions. For the sub-Gaussian and light-tailed cases, the estimates are constructed as described in Section 3.3. For heavy-tailed random variables, they assume a bounded moment condition, and derive a concentration bound for a truncation-based estimator. For all three distribution classes, the concentration bounds exhibit exponential decay in the number of samples. The authors then consider the best CVaR-arm identification problem under a fixed budget. The algorithm, called CVaR-SR, adopts the well-known successive rejects algorithm from Audibert et al. (2010) to best CVaR arm identification. Exponentially decaying bounds on the probability of incorrect arm identification are derived using the CVaR concentration results. We remark that the Wasserstein distance approach outlined in Section 3.2 gives a direct CVaR concentration bound for sub-Gaussian distributions, but does not seem to extend easily for distributions with heavier tails.

In Kagrecha et al. (2019), the authors consider a risk-aware best-arm identification problem, where the merit of an arm is defined as a linear combination of the expected reward and the associated CVaR. A notable feature in this paper is the distribution obliviousness of the algorithms, i.e., the algorithm (which is again a variant of successive rejects) is not aware of any information on the reward distributions, including tails or moment bounds.

We conclude this subsection by commenting on a common desirable feature in all the algorithms reviewed above. The SR (and SAR) family of algorithms does not require knowledge of the constants in the concentration bounds for their implementation. These constants, which are often distribution dependent, are not known in practice. They appear only in the analysis and the upper bound on the probability of the error. This is in contrast with confidence intervals-based algorithms used for regret minimization, where the distribution dependent constants from the concentration bound do appear in the confidence term, necessitating the knowledge of these constants in the algorithm’s implementation.

5.2 Fixed confidence algorithms

In fixed confidence pure exploration, the objective is to identify with a high degree of confidence (say with probability at least 1δ1-\delta) the best arm with the smallest possible number of arm plays. A related (and less stringent) objective is called PAC (Probably Approximately Correct), where the objective is to identify as quickly as possible, the set of arms with reward values that are ϵ\epsilon-close to the best arm, with a confidence level at least 1δ.1-\delta. More precisely, a policy in the fixed confidence setting consists of a stopping time TT (defined w.r.t. the filtration σ(t1)\sigma(\mathcal{H}_{t-1}) from the previous section), and an arm selection rule {πt}t=1T.\{\pi_{t}\}_{t=1}^{T}. The objective is to ensure that the arm identified at the stopping time TT is the ‘true best arm’ kk^{*} with probability at least 1δ.1-\delta. The figure of merit is the expected sample complexity 𝔼[T],\mathbb{E}[T], which should be as small as possible for a fixed δ.\delta.

In Szorenyi et al. (2015), the authors consider PAC best arm identification, where the quality of an arm is defined in terms of VaR (or quantile) at a fixed risk level. The key technique therein is to employ a sup-norm concentration of the empirical distribution (an improved version of the DKW inequality (Massart, 1990)) to obtain a suitable VaR concentration result. The authors also provide a lower bound for the special case of the VaR at the level 0.75.0.75. A lower bound was then generalized to any level α\alpha in David and Shimkin (2016). Notably, David and Shimkin (2016) also proposes two PAC algorithms (namely Maximal Quantile and Doubled Maximal Quantile) that have sample complexity within logarithmic factors of the lower bound.

In David et al. (2018), the authors study a PAC problem with risk constraints. Therein, the objective is to find an arm with the highest mean reward (similar to the classic best arm identification problem), but only among those arms that satisfy a risk constraint. The risk measure is defined in terms of the VaRα()\textrm{VaR}_{\alpha}(\cdot) of the arms being no smaller than a given β.\beta. The authors propose a confidence interval-based algorithm and prove an upper bound on its sample complexity for sub-Gaussian arms’ distribution. The upper bound has a similar form to the guarantee available for the risk-neutral PAC bandits problem (Even-Dar et al., 2006). They also show a lower bound on the sample complexity that matches the upper bound within logarithmic factors, using specific problem instances inspired by similar approaches in the risk-neutral setting (Mannor and Tsitsiklis, 2004). An improvement to the algorithm in David et al. (2018) based on the LUCB algorithm (Kalyanakrishnan et al., 2012) was proposed by Hou et al. (2022) recently.

6 Future challenges

In this survey, we reviewed the state-of-the-art in risk-aware MAB problems. This is likely just the tip of an enormous iceberg, and that there are many fertile avenues for further research.

The concentration bounds presented herein based on relating the risk measure to Wasserstein distance are likely to be suboptimal in terms of the constants. One can aim to tighten these bounds, which will have impact and utility beyond the study of MABs.

One can explore methods other than the empirical estimators for various risk measures. For example, the potential of using importance sampling for estimating tail-based risk measures such as the CVaR has hitherto not been explored adequately.

If improved concentration bounds can be derived, we can then apply them to obtain improved, problem-dependent bounds in both in regret minimization and best arm identification settings. These will also ascertain the asymptotic optimality of various algorithms by comparing the achievable bounds to the information-theoretic limits.

From Section 5, we notice that there is generally much less work on pure exploration with risk, and only VaR and CVaR have been explored. A unification of the algorithm design and accompanying guarantees for various risk measures would be welcome.

Finally, from a philosophical perspective, it is natural to wonder which risk measure to use when faced with a certain application scenario, such as portfolio optimization or clinical trials. A deeper, quantitative understanding of various risk measures and the ensuing implications on the scenario at hand requires more inter-disciplinary research. Even if one is content with restricting oneself to say the mean-variance or CVaR, there still remains the issue of choosing an appropriate risk tolerance parameter γ\gamma or the quantile level α\alpha.

Acknowledgements

VT is supported by a Singapore National Research Foundation (NRF) Fellowship (A-0005077-00-00) and Singapore Ministry of Education AcRF Tier 1 grants (A-0009042-00-00, A-8000189-00-00, and A-8000196-00-00).

References

  • Acerbi [2002] C. Acerbi. Spectral measures of risk: A coherent representation of subjective risk aversion. Journal of Banking & Finance, 26(7):1505–1518, 2002.
  • Agrawal and Goyal [2012] S. Agrawal and N. Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39–1, 2012.
  • Artzner et al. [1999] P. Artzner, F. Delbaen, J.-M. Eber, and D. Heath. Coherent measures of risk. Mathematical Finance, 9(3):203–228, 1999.
  • Audibert et al. [2010] J.-Y. Audibert, S. Bubeck, and R. Munos. Best arm identification in multi-armed bandits. In Proceedings of the Conference on Learning Theory, pages 41–53. Citeseer, 2010.
  • Auer et al. [2002] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002.
  • Baudry et al. [2021] D. Baudry, R. Gautron, E. Kaufmann, and O. Maillard. Optimal thompson sampling strategies for support-aware cvar bandits. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 716–726. PMLR, Jul 2021.
  • Bhat and Prashanth [2019] S. P. Bhat and L. A. Prashanth. Concentration of risk measures: A Wasserstein distance approach. In Advances in Neural Information Processing Systems, pages 11739–11748, 2019.
  • Brown [2007] D. B. Brown. Large deviations bounds for estimating conditional value-at-risk. Operations Research Letters, 35(6):722–730, 2007.
  • Bubeck et al. [2013] S. Bubeck, T. Wang, and N. Viswanathan. Multiple identifications in multi-armed bandits. In International Conference on Machine Learning, pages 258–265. PMLR, 2013.
  • Cassel et al. [2018] A. Cassel, S. Mannor, and A. Zeevi. A general approach to multi-armed bandits under risk criteria. In Proceedings of the 31st Conference On Learning Theory, pages 1295–1306, 2018.
  • Chang and Tan [2022] J. Q. L. Chang and V. Y. F. Tan. A Unifying Theory of Thompson Sampling for Continuous Risk-Averse Bandits. In Proc. of the 36th AAAI Conference on Artificial Intelligence. AAAI Press, 2022.
  • David and Shimkin [2016] Y. David and N. Shimkin. Pure exploration for max-quantile bandits. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 556–571. Springer, 2016.
  • David et al. [2018] Y. David, B. Szörényi, M. Ghavamzadeh, S. Mannor, and N. Shimkin. PAC bandits with risk constraints. In ISAIM, 2018.
  • Even-Dar et al. [2006] E. Even-Dar, S. Mannor, Y. Mansour, and S. Mahadevan. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7(39):1079–1105, 2006.
  • Föllmer and Schied [2002] H. Föllmer and A. Schied. Convex measures of risk and trading constraints. Finance and Stochastics, 6(4):429–447, 2002.
  • Gopalan et al. [2017] A. Gopalan, L. A. Prashanth, M. C. Fu, and S. I. Marcus. Weighted bandits or: How bandits learn distorted values that are not expected. In AAAI Conference on Artificial Intelligence, pages 1941–1947, 2017.
  • Hou et al. [2022] Y. Hou, V. Y. F. Tan, and Z. Zhong. Almost optimal variance-constrained best arm identification, 2022. arXiv 2201.10142.
  • Hu and Zhang [2018] Z. Hu and D. Zhang. Utility-based shortfall risk: Efficient computations via monte carlo. Naval Research Logistics (NRL), 65(5):378–392, 2018.
  • Huang et al. [2021] A. Huang, L. Liu, Z. C. Lipton, and K. Azizzadenesheli. Off-policy risk assessment in contextual bandits. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  • Jie et al. [2018] C. Jie, L. A. Prashanth, M. C. Fu, S. I. Marcus, and C. Szepesvári. Stochastic optimization in a cumulative prospect theory framework. IEEE Transactions on Automatic Control, 63(9):2867–2882, 2018.
  • Kagrecha et al. [2019] A. Kagrecha, J. Nair, and K. Jagannathan. Distribution oblivious, risk-aware algorithms for multi-armed bandits with unbounded rewards. In Advances in Neural Information Processing Systems, pages 11269–11278, 2019.
  • Kalyanakrishnan et al. [2012] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. PAC subset selection in stochastic multi-armed bandits. In Proceedings of the 29th International Conference on Machine Learning, pages 227–234. PMLR, 2012.
  • Kaufmann et al. [2012] E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptotically optimal finite time analysis. In Proceedings of the International Conference on Algorithmic Learning Theory, volume 7568, pages 199–213, 2012.
  • Knight [1921] F. H. Knight. Risk, Uncertainty and Profit. Houghton Mifflin Company, New York, 1921.
  • Kolla et al. [2019] R. K. Kolla, L. A. Prashanth, S. P. Bhat, and K. P. Jagannathan. Concentration bounds for empirical conditional value-at-risk: The unbounded case. Operations Research Letters, 47(1):16–20, 2019.
  • Lattimore and Szepesvári [2020] T. Lattimore and C. Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • Mannor and Tsitsiklis [2004] S. Mannor and J. N. Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623–648, 2004.
  • Markowitz [1952] H. Markowitz. Portfolio selection. The Journal of Finance, 7(1):77–91, 1952.
  • Massart [1990] P. Massart. The tight constant in the Dvoretzky–Kiefer–Wolfowitz inequality. The Annals of Probability, pages 1269–1283, 1990.
  • Pandey et al. [2021] A. K. Pandey, L. A. Prashanth, and S. P. Bhat. Estimation of spectral risk measures. Proceedings of the AAAI Conference on Artificial Intelligence, 35(13):12166–12173, May 2021.
  • Prashanth and Bhat [2022] L. A. Prashanth and S. P. Bhat. A Wasserstein distance approach for concentration of empirical risk estimates. arXiv preprint 1902.10709v4, 2022.
  • Prashanth et al. [2016] L. A. Prashanth, J. Cheng, M. C. Fu, S. I. Marcus, and C. Szepesvári. Cumulative prospect theory meets reinforcement learning: prediction and control. In International Conference on Machine Learning, pages 1406–1415, 2016.
  • Prashanth et al. [2020] L. A. Prashanth, K. Jagannathan, and R. K. Kolla. Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed distributions. In International Conference on Machine Learning, volume 119, pages 5577–5586. PMLR, 2020.
  • Riou and Honda [2020] C. Riou and J. Honda. Bandit algorithms based on thompson sampling for bounded reward distributions. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, volume 117, pages 777–826. PMLR, Feb 2020.
  • Rockafellar and Uryasev [2000] R. T. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk. Journal of Risk, 2(3):21–41, 2000.
  • Russo et al. [2018] D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, and Z. Wen. A Tutorial on Thompson Sampling. Foundations and Trends in Machine Learning, 11(1):1–96, 2018.
  • Sani et al. [2012] A. Sani, A. Lazaric, and R. Munos. Risk-aversion in multi-armed bandits. In Advances in Neural Information Processing Systems, pages 3275–3283, 2012.
  • Sharpe [1966] W. F. Sharpe. Mutual fund performance. The Journal of Business, 39(1):119–138, 1966.
  • Szorenyi et al. [2015] B. Szorenyi, R. Busa-Fekete, P. Weng, and E. Hüllermeier. Qualitative multi-armed bandits: A quantile-based approach. In International Conference on Machine Learning, pages 1660–1668. PMLR, 2015.
  • Thomas and Learned-Miller [2019] P. Thomas and E. Learned-Miller. Concentration inequalities for conditional value at risk. In International Conference on Machine Learning, pages 6225–6233, 2019.
  • Thompson [1933] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
  • Tversky and Kahneman [1992] Amos Tversky and Daniel Kahneman. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and Uncertainty, 5(4):297–323, 1992.
  • Vakili and Zhao [2015] S Vakili and Q. Zhao. Mean-variance and value at risk in multi-armed bandit problems. In Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1330–1335, 2015.
  • Wainwright [2019] M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint, volume 48. Cambridge University Press, 2019.
  • Wang and Gao [2010] Y. Wang and F. Gao. Deviation inequalities for an estimator of the conditional value-at-risk. Operations Research Letters, 38(3):236–239, 2010.
  • Zhang and Ong [2021] M. Zhang and C. S. Ong. Quantile bandits for best arms identification. In International Conference on Machine Learning, pages 12513–12523. PMLR, 2021.
  • Zhu and Tan [2020] Q. Zhu and V. Y. F. Tan. Thompson sampling algorithms for mean-variance bandits. In International Conference on Machine Learning, pages 2645–2654, 2020.