From Predictions to Decisions:
The Importance of Joint Predictive Distributions

Zheng Wen Ian Osband Chao Qin Xiuyuan Lu Morteza Ibrahimi Vikranth Dwaracherla Mohammad Asghari Benjamin Van Roy

Abstract

A fundamental challenge for any intelligent system is prediction: given some inputs, can you predict corresponding outcomes? Most work on supervised learning has focused on producing accurate marginal predictions for each input. However, we show that for a broad class of decision problems, accurate joint predictions are required to deliver good performance. In particular, we establish several results pertaining to combinatorial decision problems, sequential predictions, and multi-armed bandits to elucidate the essential role of joint predictive distributions. Our treatment of multi-armed bandits introduces an approximate Thompson sampling algorithm and analytic techniques that lead to a new kind of regret bound.

Machine Learning, ICML

1 Introduction

We are motivated by the problem of an intelligent agent interacting with an unknown environment. Although the agent is initially uncertain of the true dynamics of the environment, it might learn to predict aspects of its evolution through observation and learning. Predictions can then guide decisions aimed to optimize future outcomes.

Most work on supervised learning has focused on producing accurate marginal predictions for each input. For instance, in the image classification problem, the goal is to predict the class label of each image correctly. Though accurate marginal predictions are sufficient for some applications, in this paper, we show that for a broad class of downstream tasks, accurate joint predictions are essential. In particular, we show that even in simple combinatorial and sequential decision problems, accurate marginal predictions are insufficient for effective decisions. Rather, accurate joint predictive distributions are required to achieve good performance. We also show that, in the multi-armed bandit, a classical sequential decision problem, accurate joint predictive distributions enable near-optimal performance.

The cross-entropy loss is perhaps the most widely used performance metric in machine learning and plays a central role in training deep learning systems (LeCun et al., 2015). It is typically used to evaluate the quality of marginal predictions but readily extends to address joint predictive distributions. Baselining the cross-entropy loss with an agent-independent constant gives rise to expected KL-divergence. Hence, the cross-entropy loss is equivalent to the expected KL-divergence as a metric for ranking agents. We will focus in this paper on the use of the expected KL-divergence because the baselining allows for more elegant analysis and framing of results. In particular, the KL-divergence is closely related to fundamental notions in information theory (Cover, 1999), such as entropy and mutual information. It is worth clarifying that the cross-entropy loss and expected KL-divergence are just particular choices of evaluation metrics. The importance of evaluating joint predictions, not just marginals, should similarly be apparent under other evaluation metrics.

To elucidate the essential role of joint predictive distributions in driving downstream decision tasks, we establish several results pertaining to the universality of our metric across tasks, combinatorial decision problems, sequential predictions, and mult-armed bandits. Among these, our study of multi-armed bandits is particularly innovative, entailing the development of a novel variant of the Thompson sampling (TS) algorithm (Thompson, 1933; Russo et al., 2018) and original analytic techniques that lead to a new kind of regret bound (Theorem 5.1). Our proposed approximate TS algorithm is a generalization of the standard (exact) TS algorithm and is general in the sense that it does not depend on the agent’s uncertainty representation. Instead, it only requires that the agent can simulate hypothetical observations, sampled from a joint predictive distribution. The new analytic techniques required to establish our regret bound build on recent progress in information-theoretic analysis of bandit and reinforcement learning algorithms (Russo & Van Roy, 2016; Lu et al., 2021). We believe that the key ideas can also offer new insight into approximate TS algorithms beyond the scope of this paper, including those designed for more general sequential decision problems, such as episodic reinforcement learning.

Also novel is our analysis of sequential prediction. In this case, we establish a result that highlights the importance of joint predictive distributions even when an agent aims only to produce accurate marginal predictions sequentially, as it streams data. In particular, for the agent to be able to incrementally update its knowledge representation as it streams, it must also be capable of generating accurate joint, not just marginal, predictive distributions.

The remainder of the paper is organized as follows: in Section 2, we introduce the notation for this paper via a standard supervised learning setting. We also review and discuss the evaluation metric for joint predictive distributions. In Section 3, we illustrate through a simple recommendation system example that accurate marginal predictions are insufficient to guide effective decisions in combinatorial decision problems, and that accurate joint predictive distributions are essential. In Section 4, we formally prove that in a sequential decision problem, an agent that updates its parameters incrementally must retain sufficient information about joint predictive distributions if it is to achieve good performance, even if the performance metric only depends on marginal distributions. In Section 5, we consider the classical multi-armed bandit and show that accurate marginal predictions are insufficient to guide good decisions. We also establish a regret bound for our proposed TS algorithm: if the agent produces accurate joint predictive distributions, then the TS algorithm can attain near-optimal regret. Finally, we will discuss the related work in Section 6 and conclude the paper in Section 7.

2 Evaluating joint predictive distributions

In this section, we introduce the notation for this paper via a standard supervised learning framework. We also motivate and discuss the evaluation metrics used in this paper for joint predictive distributions.

2.1 Supervised learning

Consider a sequence of pairs $((X_{t},Y_{t+1}):t=0,1,2,\ldots)$ , where each $X_{t}$ is a feature vector and each $Y_{t+1}$ is its target label. Feature vectors $(X_{t}:t=0,1,2,\ldots)$ are i.i.d. according to an input distribution $P_{X}$ . Each target label $Y_{t+1}$ is independent of all other data, conditioned on $X_{t}$ , and distributed according to $\mathcal{E}(\cdot|X_{t})$ . The conditional distribution $\mathcal{E}$ is referred to as the environment. The environment $\mathcal{E}$ is random; and this reflects the agent’s uncertainty about how labels are generated given features. Note that $\mathds{P}(Y_{t+1}\in\cdot|\mathcal{E},X_{t})=\mathcal{E}(\cdot|X_{t})$ and $\mathds{P}(Y_{t+1}\in\cdot|X_{t})=\mathbb{E}[\mathcal{E}(\cdot|X_{t})|X_{t}]$ .

In supervised learning, we consider an agent that learns about the environment $\mathcal{E}$ from a training dataset

\mathcal{D}_{T}\equiv((X_{t},Y_{t+1}):t=0,1,\ldots,T-1),

and aims to predict the target labels

Y_{T+1:T+\tau}\equiv(Y_{T+1},\dots,Y_{T+\tau})

at $\tau$ feature vectors $X_{T:T+\tau-1}\equiv(X_{T},\dots,X_{T+\tau-1})$ .

Conditioned on the environment $\mathcal{E}$ , a predictive distribution over the target labels is given by

P^{*}_{T+1:T+\tau}\equiv\mathds{P}\left(Y_{T+1:T+\tau}\in\cdot\middle|\mathcal{E},X_{T:T+\tau-1}\right).

Conditioned instead on the training data, the predictive distribution becomes

\overline{P}_{T+1:T+\tau}\equiv\mathds{P}\left(Y_{T+1:T+\tau}\in\cdot\middle|\mathcal{D}_{T},X_{T:T+\tau-1}\right).

In a sense, $\overline{P}_{T+1:T+\tau}$ represents the result of perfect inference. Since the associated computations are intractable for environments of interest, we instead consider agents that approximate this predictive distribution. Specifically, we consider agents that represent the approximation in terms of a generative model. The agent’s predictions are parameterized by a vector $\theta_{T}$ that the agent learns from the training data $\mathcal{D}_{T}$ . The vector $\theta_{T}$ is conditionally independent of $\mathcal{E}$ conditioned on $\mathcal{D}_{T}$ . For any inputs $X_{T:T+\tau-1}$ , $\theta_{T}$ determines a predictive distribution, which could be used to sample imagined outcomes $\hat{Y}_{T+1:T+\tau}$ . Hence, the agent’s $\tau^{\rm th}$ -order predictive distribution is given by

\hat{P}_{T+1:T+\tau}\equiv\mathds{P}(\hat{Y}_{T+1:T+\tau}\in\cdot|\theta_{T},X_{T:T+\tau-1}).

When $\tau=1$ , we alternatively use $\hat{P}_{T+1}$ , $\overline{P}_{T+1}$ , and $P^{*}_{T+1}$ to denote $\hat{P}_{T+1:T+\tau}$ , $\overline{P}_{T+1:T+\tau}$ , and $P^{*}_{T+1:T+\tau}$ , respectively.

Note that if $\tau=1$ , $\hat{P}_{T+1}$ represents a marginal prediction: it predicts the label $Y_{T+1}$ for a single input $X_{T}$ . On the other hand, for $\tau>1$ , it represents a joint prediction over labels at $\tau$ input features.

2.2 Marginal vs. joint predictive distributions

Before proceeding, we use a simple coin flipping example to illustrate that evaluating marginal and joint predictive distributions can result in very different answers. Suppose that $(Y_{t+1}:t=0,1,\ldots)$ are generated by repeated tosses of a possibly biased coin with unknown probability $p$ of heads, with $Y_{t+1}=1$ and $Y_{t+1}=0$ indicating heads and tails, respectively. Consider two agents with different beliefs: Agent 1 assumes $p=2/3$ and models the outcome of each coin toss as independent conditioned on $p$ . Agent 2 assumes that $p=1$ with probability $2/3$ and $p=0$ with probability $1/3$ ; that is, the coin either produces only heads or only tails. Let $\hat{Y}^{1}_{t+1}$ and $\hat{Y}^{2}_{t+1}$ denote the outcomes imagined by the two agents. Despite their differing assumptions, the two agents generate identical marginal predictive distributions: $\mathds{P}(\hat{Y}^{1}_{t+1}=0)=\mathds{P}(\hat{Y}^{2}_{t+1}=0)=1/3$ .

On the other hand, the joint predictions of these two agents differ for $\tau>1$ :

\mathds{P}(\hat{Y}^{1}_{1},\dots,\hat{Y}^{1}_{\tau}=0)=1/3^{\tau}<1/3=\mathds{P}(\hat{Y}^{2}_{1},\dots,\hat{Y}^{2}_{\tau}=0).

Evaluating marginal predictions cannot distinguish between the two agents, though for a specific prior distribution over $p$ , one agent could be right and the other wrong. One must evaluate joint predictions to make this distinction.

2.3 Cross-entropy loss

Cross-entropy loss is perhaps the most widely used metric in machine learning, though it is typically only used to evaluate marginal predictive distributions. For our supervised learning formulation, an agent’s marginal cross-entropy loss takes the form

\mathbf{d}_{\mathrm{CE}}^{1}\equiv-\mathbb{E}\big{[}\log\hat{P}_{T+1}(Y_{T+1})\big{]},

(1)

where the expectation is over both $\hat{P}_{T+1}$ and $Y_{T+1}$ , and the superscript “ $1$ ” in $\mathbf{d}_{\mathrm{CE}}^{1}$ indicates that this evaluates marginal predictions. Note that the marginal distribution $\hat{P}_{T+1}$ is random because it depends on $\theta_{T}$ and $X_{T}$ .

It is straightforward to extend the cross-entropy loss to assess joint predictive distributions. Generalizing Equation (1), for any $\tau=1,2,\ldots$ , we define the $\tau^{\textrm{th}}$ -order cross-entropy loss:

\mathbf{d}_{\mathrm{CE}}^{\tau}\equiv-\mathbb{E}\big{[}\log\hat{P}_{T+1:T+\tau}(Y_{T+1:T+\tau})\big{]},

(2)

where the expectation is over $\hat{P}_{T+1:T+\tau}$ and $Y_{T+1:T+\tau}$ . Note that the $\tau^{\textrm{th}}$ -order joint distribution $\hat{P}_{T+1:T+\tau}$ is also random, since it depends on $\theta_{T}$ and $X_{T:T+\tau-1}$ .

2.4 Kullback–Leibler divergence

One can use the joint cross-entropy loss defined in Equation (2) to assess joint predictive distributions. However, it can be helpful to offset the metric by a baseline to convert it into the Kullback–Leibler (KL) divergence. We will take $\overline{P}_{T+1:T+\tau}$ , which is the perfect posterior predictive, to be our baseline. As we will see, since $\overline{P}_{T+1:T+\tau}$ does not depend on the agent, our measure of KL-divergence and the cross-entropy loss are effectively equivalent in the sense that they only differ by a constant that does not depend on the agent. However, KL-divergence exhibits properties that allow for a more elegant mathematical analysis.

The $\tau^{\textrm{th}}$ -order expected KL-divergence with respect to $\overline{P}$ is defined by

\mathbf{d}_{\mathrm{KL}}^{\tau}\equiv\mathbb{E}\big{[}\mathbf{d}_{\mathrm{KL}}\big{(}\overline{P}_{T+1:T+\tau}\big{\|}\hat{P}_{T+1:T+\tau}\big{)}\big{]},

(3)

where the expectation is over the distributions $\overline{P}_{T+1:T+\tau}$ and $\hat{P}_{T+1:T+\tau}$ , which depend in turn on the data $\mathcal{D}_{T}$ , the agent parameters $\theta_{T}$ , and the $\tau$ inputs $X_{T:T+\tau-1}$ .

Note that KL-divergence is minimized when $\hat{P}_{T+1:T+\tau}=\overline{P}_{T+1:T+\tau}$ , with the minimum being zero. Further, our two metrics are related according to

\mathbf{d}_{\mathrm{KL}}^{\tau}=\mathbf{d}_{\mathrm{CE}}^{\tau}+\mathbb{E}[\log\overline{P}_{T+1:T+\tau}(Y_{T+1:T+\tau})].

Since $\overline{P}_{T+1:T+\tau}(Y_{T+1:T+\tau})$ does not depend on the agent, our two metrics rank agents identically.

An unbiased estimate of cross-entropy loss can be computed based on a test data sample, according to

\mathbf{d}_{\mathrm{CE}}^{\tau}\approx-\log\hat{P}_{T+1:T+\tau}(Y_{T+1:T+\tau})\big{.}

The same is not true for $\mathbf{d}_{\mathrm{KL}}^{\tau}$ , which can only be estimated if also given an estimate of $\mathbb{E}[\log\overline{P}_{T+1:T+\tau}(Y_{T+1:T+\tau})]$ . Hence, $\mathbf{d}_{\mathrm{KL}}^{\tau}$ serves only as conceptual tools in our analysis and not an evaluation metric that can be applied with empirical data. While it ranks agents identically with $\mathbf{d}_{\mathrm{CE}}^{\tau}$ , $\mathbf{d}_{\mathrm{KL}}^{\tau}$ is more natural as a metric since its minimum is zero and it accommodates more elegant analysis.

2.5 Error in predictions versus environment

Our $\mathbf{d}_{\mathrm{KL}}^{\tau}$ metric assesses error incurred by the predictive distribution $\hat{P}_{T+1:T+\tau}$ . A common approach to generating such a predictive distribution is through estimating a posterior distribution over environments and using that posterior distribution to generate the predictive distribution. In such a context, $\theta_{T}$ parameterizes the estimated posterior distribution. Let $\hat{\mathcal{E}}$ be an imaginary environment sampled from this posterior distribution. To offer some intuition for $\mathbf{d}_{\mathrm{KL}}^{\tau}$ , we consider in this section its relation to the KL-divergence between the distributions of the true and imaginary environments.

Let $\hat{Y}_{T+1:T+\tau}$ denote a sequence of imaginary outcomes, with each $\hat{Y}_{t+1}$ sampled independently from $\hat{\mathcal{E}}(\cdot|X_{t})$ . If the support of the input distribution $P_{X}$ is exhaustive, the support of the imaginary environment distribution $\mathds{P}(\hat{\mathcal{E}}\in\cdot|\theta_{T})$ contains that of the true environment distribution $\mathds{P}(\mathcal{E}\in\cdot|\mathcal{D}_{T})$ , and the environment distributions satisfy suitable regularity conditions, then

\lim_{\tau\rightarrow\infty}\mathbf{d}_{\mathrm{KL}}^{\tau}=\mathbb{E}\big{[}\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}(\mathcal{E}\in\cdot|\mathcal{D}_{T})\big{\|}\mathds{P}(\hat{\mathcal{E}}\in\cdot|\theta_{T})\big{)}\big{]}.

In other words, under certain technical conditions, as the number $\tau$ of test data pairs grows, $\mathbf{d}_{\mathrm{KL}}^{\tau}$ converges to the error in the estimated posterior distribution over environments, measured in terms of KL-divergence.

One might wonder why we should use $\mathbf{d}_{\mathrm{KL}}^{\tau}$ rather than the KL divergence between the true and imaginary environments

\mathbb{E}\big{[}\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}(\mathcal{E}\in\cdot|\mathcal{D}_{T})\big{\|}\mathds{P}(\hat{\mathcal{E}}\in\cdot|\theta_{T})\big{)}\big{]}

(4)

to evaluate the agents. There are several reasons for it. First, practical agent designs often do not satisfy the requisite regularity conditions, and hence (4) becomes infinite. For example, it is common to approximate the posterior distribution $\mathcal{E}$ using an ensemble of environment models (see, e.g., Lu & Van Roy (2017)). Such an ensemble represents a distribution with finite support though the posterior may have infinite support. On the other hand, for any finite $\tau$ , $\mathbf{d}_{\mathrm{KL}}^{\tau}$ is finite. Second, $\mathbf{d}_{\mathrm{CE}}^{\tau}$ , which is equivalent to $\mathbf{d}_{\mathrm{KL}}^{\tau}$ up to a constant, can be computed based on data, whereas computing (4) requires access to the posterior distribution of $\mathcal{E}$ . Finally, as we will establish later, $\mathbf{d}_{\mathrm{KL}}^{\tau}$ with finite $\tau$ is sufficient to support effective decisions in downstream tasks such as multi-armed bandits.

2.6 Universality of $\mathbf{d}_{\mathrm{KL}}^{\tau}$

We now show that, for any $\tau$ , accuracy in terms of $\mathbf{d}_{\mathrm{KL}}^{\tau}$ is sufficient to guarantee an effective decision if the decision is judged in relation only to $Y_{T+1:T+\tau}$ . In particular, suppose an action $a$ selected from a set $\mathcal{A}$ results in an expected reward

	$\displaystyle\mathbb{E}[r(a,Y_{T+1:T+\tau})\|\mathcal{D}_{T},X_{T:T+\tau-1}]$
	$\displaystyle=\sum_{y_{T+1:T+\tau}}\overline{P}_{T+1:T+\tau}(y_{T+1:T+\tau})r(a,y_{T+1:T+\tau}),$

where $r$ is a reward function with range $[0,1]$ . The following result bounds the loss in expected reward of a decision that is based on the estimate $\hat{P}_{T+1:T+\tau}$ instead of the posterior $\overline{P}_{T+1:T+\tau}$ .

Proposition 2.1.

If an action $\hat{a}\in\mathcal{A}$ maximizes

\sum_{y_{T+1:T+\tau}}\hat{P}_{T+1:T+\tau}(y_{T+1:T+\tau})r(a,y_{T+1:T+\tau})

then

\displaystyle\mathbb{E}[r(\hat{a},Y_{T+1:T+\tau})]\geq\max_{a\in\mathcal{A}}\mathbb{E}[r(a,Y_{T+1:T+\tau})]-\sqrt{2\mathbf{d}_{\mathrm{KL}}^{\tau}}.

This results follows from Pinsker’s inequality and Jensen’s inequality. A proof is provided in Appendix A.

Proposition 2.1 ensures that if a decision maximizes an approximation to expected reward, with the expectation approximated using the predictive distribution $\hat{P}_{T+1:T+\tau}$ , then the action will be within $\sqrt{2\mathbf{d}_{\mathrm{KL}}^{\tau}}$ of what is achievable given the posterior predictive distribution $\overline{P}_{T+1:T+\tau}$ . In this sense, $\mathbf{d}_{\mathrm{KL}}^{\tau}$ is a universal evaluation metric: its value ensures a level of performance in any decision problem.

3 Combinatorial decision problems

In this section, we will present a simple example that highlights the importance of the joint predictive distributions in combinatorial decision problems. Importantly, we will show that there can be benefit to considering a joint predictive distribution even when the data are i.i.d. conditioned on the environment.

Consider the problem of a customer interacting with a recommendation system that proposes a selection of $K>1$ movies from an inventory of $N$ movies $X_{1},..,X_{N}$ . Each $X_{i}\in\mathds{R}^{d}$ describes the features of movie $i$ , and $d$ is the feature dimension. We model the probability that a user will enjoy movie $i$ by a logistic model $Y_{i}\sim{\rm logit}(\phi_{*}^{T}X_{i})$ , where ${\rm logit}$ is the standard logistic function. Note that $\phi_{*}\in\mathds{R}^{d}$ describes the preferences of the user, which is not fully known to the recommendation system and can be viewed as a random variable. We will show that, in order to maximize the probability that the user enjoys at least one of the $K>1$ recommended movies, it is insufficient to examine the marginal predictive distributions.

To make this example concrete, we consider the case where the user $\phi_{*}$ is drawn from two possible user types $\{\phi_{1},\phi_{2}\}$ , and the recommendation system should propose $K=2$ movies from an inventory $\{X_{1},X_{2},X_{3},X_{4}\}$ . Table 1 presents the numerical values for each of these vectors in this problem and their associated probabilities of selection implied by their logit functions correct to two decimal places. These values are chosen to set up a tension between optimization over marginal (each $X_{i}$ individually) and joint (pairs of $X_{i},X_{j}$ ) predictions. An agent that optimizes the expected probability for each movie individually will end up recommending the pair $(X_{3},X_{4})$ to an unknown $\phi\sim{\rm Unif}(\phi_{1},\phi_{2})$ . This means that, whether a user is type $\phi_{1}$ or $\phi_{2}$ , there is a greater than 10% chance they do not like either movie. By contrast, an agent that considers the joint predictive distribution for $\tau\geq K=2$ can see that instead selecting the pair $(X_{1},X_{2})$ will give close to 100% certainty that the user will enjoy one of the movies.

	$X_{1}=(10,-10)$	$X_{2}=(-10,10)$	$X_{3}=(1,0)$	$X_{4}=(0,1)$
$\phi_{1}=(1,0)$	1	0	0.73	0.5
$\phi_{2}=(0,1)$	0	1	0.5	0.73
$\phi\sim{\rm Unif}(\phi_{1},\phi_{2})$	0.5	0.5	0.62	0.62

Table 1: Expected probability to watch a movie under different user features, correct to two decimal places.

This didactic example is designed to be maximally simplistic, and in some sense it labours an obvious point. In combinatorial decision problems where the outcome depends on the joint predictive distribution, optimization based on the marginal predictive distribution is insufficient to guarantee good decisions. By contrast, as shown in Section 2.6, if your decision depends only on the $\tau^{\rm th}$ -order predictive, then attaining small $\mathbf{d}_{\mathrm{KL}}^{\tau}$ is sufficient for good performance. What is perhaps interesting about this example however, is that this coupling can occur even where, conditioned on the true environment, the data generating process is actually i.i.d. The key point is that, when the true underlying environment is unknown, a coupling in future observations is introduced. As such, examining the joint predictive distribution can be essential for good performance. In the rest of this paper, we will show that these distinctions become even more important as we move towards sequential decision problems.

4 Sequential predictions

In this and next section, we will establish some theoretical results to justify the importance of joint predictive distributions in sequential decision problems. We first consider sequential prediction problems, which is a subclass of sequential decision problems where the agent’s predictions (actions) do not influence its future observations. In particular, we show that in a standard sequential prediction problem, for agents with incremental updates to achieve good performance, it is necessary for them to retain significant portion of information about joint predictive distributions. This is true even if the performance metric only depends on marginal distributions.

The considered sequential prediction problem is formulated as follows: data pairs $(X_{t},Y_{t+1})$ arrive sequentially, one at a time. At each time $t$ , the agent needs to compute parameters $\theta_{t}$ based on previously observed data pairs $\mathcal{D}_{t}=(X_{0},Y_{1},X_{1},\ldots,X_{t-1},Y_{t})$ . Then, a new data pair $(X_{t},Y_{t+1})$ arrives. We assume that the feature vector $X_{t}$ ’s are unconditionally independent, but not necessarily identically distributed. The target label $Y_{t+1}$ is conditionally independently sampled from the distribution $\mathcal{E}\left(\cdot\middle|X_{t}\right)$ , where $\mathcal{E}$ is the environment. The agent’s objective is to minimize the expected cumulative KL-divergence in the first $T$ time steps:

\textstyle\sum_{t=0}^{T-1}\mathbb{E}\big{[}\mathbf{d}_{\mathrm{KL}}\big{(}\overline{P}_{t+1}\big{\|}\hat{P}_{t+1}\big{)}\big{]},

(5)

where

	$\displaystyle\bar{P}_{t+1}=$	$\displaystyle\,\mathds{P}(Y_{t+1}\in\cdot\|\,\mathcal{D}_{t},X_{t})$
	$\displaystyle\hat{P}_{t+1}=$	$\displaystyle\,\mathds{P}(\hat{Y}_{t+1}\in\cdot\|\,\theta_{t},X_{t})$

for all time $t$ . Note that this cumulative KL-divergence (5) only depends on the marginal distributions $\overline{P}_{t+1}$ and $\hat{P}_{t+1}$ . Also note that this performance metric is $0$ if the agent predicts the exact posterior at each time $t$ .

We consider a setting where an agent needs to incrementally update its parameters as data arrive. Specifically, at time $t=0$ , the agent chooses its parameters $\theta_{0}$ based on its prior knowledge; and then at each time $t=0,1,\ldots$ , the agent updates its parameters incrementally by sampling from a distribution that only depends on $\theta_{t}$ , $(X_{t},Y_{t+1})$ , and $t$ :

\theta_{t+1}\sim\mathds{P}\left(\theta_{t+1}\in\cdot\middle|\theta_{t},X_{t},Y_{t+1},t\right).

(6)

In other words, conditioning on $(\theta_{t},X_{t},Y_{t+1})$ , $\theta_{t+1}$ is independent of the dataset $\mathcal{D}_{t}$ and the environment $\mathcal{E}$ . Note that the incremental update rule in equation (6) is general: in particular, $\mathcal{D}_{t}$ could itself be recorded in $\theta_{t}$ . This would allow $\theta_{t+1}$ to depend on $\mathcal{D}_{t}$ in an arbitrary manner. However, such an approach can be impractical when there is a high volume of data. In particular, one may want to avoid sifting through a growing $\mathcal{D}_{t}$ at each time step. In many practical applications, it is desirable for the agent to update $\theta_{t+1}$ with fixed memory space and fixed per-step computational complexity, such as the standard SGD (Goodfellow et al., 2016) and Adam (Kingma & Ba, 2015) algorithms do.

We have the following result for the considered sequential prediction problem:

Theorem 4.1.

For an agent with incremental update (6), for any time $t=0,1,\ldots,T-1$ and any $\epsilon\geq 0$ , if

\textstyle\sum_{t^{\prime}=t}^{T-1}\mathbb{E}\big{[}\mathbf{d}_{\mathrm{KL}}\big{(}\bar{P}_{t^{\prime}+1}\,\big{\|}\,\hat{P}_{t^{\prime}+1}\big{)}\big{]}\leq\epsilon,

then we have

\mathbb{I}\left(Y_{t+1:T};\theta_{t}\,\middle|\,X_{t:T-1}\right)\geq\mathbb{I}\left(Y_{t+1:T};\mathcal{D}_{t}\,\middle|\,X_{t:T-1}\right)-\epsilon.

Please refer to Appendix B for the proof of Theorem 4.1. Notice that $\epsilon$ measures the performance loss of the agent; $\mathbb{I}\left(Y_{t+1:T};\mathcal{D}_{t}\,\middle|\,X_{t:T-1}\right)$ is the conditional information in $\mathcal{D}_{t}$ about the joint distribution of $Y_{t+1:T}$ ; and similarly $\mathbb{I}\left(Y_{t+1:T};\theta_{t}\,\middle|\,X_{t:T-1}\right)$ is the conditional information about $Y_{t+1:T}$ retained in $\theta_{t}$ . Also notice that

\mathbb{I}\left(Y_{t+1:T};\mathcal{D}_{t}\,\middle|\,X_{t:T-1}\right)\geq\mathbb{I}\left(Y_{t+1:T};\theta_{t}\,\middle|\,X_{t:T-1}\right)

always holds due to data processing inequality. In other words, Theorem 4.1 states that to be $\epsilon$ -near-optimal, an agent with incremental update must retain in $\theta_{t}$ all information in $\mathcal{D}_{t}$ about the joint distribution of $Y_{t+1:T}$ , except $\epsilon$ nats¹¹1In this paper, we use natural logarithms in both KL-divergence and mutual information. Notice that $\epsilon$ nats $\approx$ $1.443\epsilon$ bits..

We conjecture that results similar to Theorem 4.1 also hold in broader classes of sequential decision problems, such as multi-armed bandit problems discussed in Section 5, but we leave the formal analysis to future work.

5 Multi-armed bandits

In this section we turn our attention from passive observations, where an agent’s predictions do not influence the stream of data $\mathcal{D}_{t}$ , to active decisions, where an agent takes actions that influence future streams of data. In this context, the ability to predict outcomes over multiple time steps can significantly impact agent performance. In particular, we will show that the quality of predictions beyond marginals is essential to effectively balance the needs of exploration with exploitation (Thompson, 1933). We begin with a problem formulation around the famous “multi-armed bandit”, a simple model of sequential decisions that serves to highlight the key issues at play. Next, we show that, in this context, even agents that make perfect marginal predictions for the outcome of each action may not make good decisions for the problem as a whole. Finally, we show that agents that make good joint predictions over $\tau=O(K)$ time steps are sufficient to drive efficient exploration in a Bernoulli bandit with $K$ actions.

5.1 Problem formulation

Consider a sequential decision problem with $K$ actions. The environment $\mathcal{E}$ prescribes observation probabilities. During each time step $t=0,1,2,\ldots$ , the agent selects an action $A_{t}$ and observes an outcome $Y_{t+1}$ produced by the environment. Conditioned on the environment $\mathcal{E}$ and action $A_{t}$ , the next observation $Y_{t+1}$ is sampled from $\mathcal{E}(\cdot|A_{t})$ independently across time. There is a real-valued reward function $r$ that encodes the agent’s preferences over outcomes. The objective of the agent is to optimize the expected return over some long horizon $T$ ,

\mathbb{E}\big{[}\textstyle\sum_{t=0}^{T-1}r(Y_{t+1})\big{]},

where the expectation is taken over random outcomes, algorithmic randomness, and prior over $\mathcal{E}$ .

5.2 Marginal predictions are insufficient

We will now show that, even if an agent can make perfect predictions for the outcomes of each action, these marginal predictions per action can be insufficient to drive efficient exploration. The root cause is that, at any time step $t$ , the rewards and observations at future time steps $t^{\prime}>t$ are coupled through the unknown environment $\mathcal{E}$ . Clearly, we can imagine problem instances where one action reveals the full structure of the environment, but does not itself produce a reward. In this case, an agent that simply makes good marginal predictions on the reward of each action is clearly insufficient to take advantage of this informative action. However, perhaps more interestingly, this coupling can occur even when rewards evolve independently per action given the environment.

To make this point clear, we will consider an extension of the coin tossing example from Section 2.2. We consider an Bernoulli bandit with $K$ independent actions where the reward is simply taken to be the $Y_{t+1}$ . For the first $K-1$ actions the agents knows that the outcome is distributed $\mathrm{Ber(0.5)}$ , a pure 50% chance. However, the final action produces a deterministic outcome of either $1$ or $0$ , but it is equally likely to be of either kind. Clearly, the optimal policy to maximize long term reward is to first select the final action to see if it is optimal and, based on that outcome, choose the arm that maximizes expected reward given full knowledge of $\mathcal{E}$ for all future steps. However, any agent that only matches marginal predictions cannot distinguish the final arm from any of the others. As such, it will be impossible for the agent to do better than random chance over the $K$ arms. On average, it will require $\Theta(K)$ time steps to identify the optimal policy.

Just as in the coin flip example, the difference between the informative and uninformative actions is fully revealed when we look at $\tau^{\textrm{th}}$ -order predictive distributions for any $\tau>1$ . In general, it is only through examining the evolution of future outcomes beyond myopic marginals that allows an agent to optimize the long term returns. In the next subsection, we will expand on this intuitive argument and relate the ability to predict future outcomes to agent performance. In fact, we will establish a bound on performance that guarantees that if you can make approximate predictions $O(K)$ steps into the future, then it will be sufficient to drive efficient exploration.

5.3 Relating joint predictions to regret

As explained in the previous subsection, in a sequential decision problem, it is in general desirable to consider uncertainties jointly across actions in order to make effective decisions. Thus, to simplify exposition, we consider predictions for a vector of outcomes, $\mathbf{Y}\in\Re^{K}$ , with each entry corresponding to the outcome of an action. In this section, we will relate the quality of future predictions about $\mathbf{Y}$ to agent performance on a Bernoulli bandit with correlated arms.

Recall that in a $K$ -armed Bernoulli bandit, the environment $\mathcal{E}$ is identified by parameters $p=(p_{1},\ldots,p_{K})$ , where $p_{k}\in[0,1]$ is the expected reward of the $k^{\rm th}$ action. We make no assumptions about the prior $\mathds{P}(p\in\cdot)$ . In particular, there could be dependencies, where an observed outcome from trying one action informs our beliefs about other actions. We define the history by time $t$ as $H_{t}=\left(A_{0},Y_{1},\ldots,A_{t-1},Y_{t}\right)$ .

We use $\tilde{\mathbf{Y}}_{1:\tau}$ to denote a sequence of $\tau$ vectors sampled from the environment $\mathcal{E}$ . Specifically, these $\tau$ vectors are conditionally independent given $\mathcal{E}$ . Each vector has dimension $K$ , and the $k^{\rm th}$ component of each vector is conditionally independently sampled from $\mathrm{Ber}(p_{k})$ . On the other hand, consider an agent that can also generate a sequence of $K$ -dimensional binary vectors at each time $t$ . Let $\theta_{t}$ denote its parameters, and $\hat{\mathbf{Y}}^{t}_{1:\tau}$ denote a sequence of $\tau$ binary vectors sampled from it. In this subsection, we establish that if an agent can produce good $\tau^{\rm th}$ -order predictive distributions in the sense that

\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}(\mathds{P}(\tilde{\mathbf{Y}}_{1:\tau}\in\cdot|H_{t})\|\mathds{P}(\hat{\mathbf{Y}}^{t}_{1:\tau}\in\cdot|\theta_{t}))\right]\leq\epsilon\quad\forall t

for some small $\epsilon\geq 0$ , then based on this agent one can design bandit algorithms that achieve near-optimal performance.

Specifically, we consider an approximate version of Thompson sampling algorithm, which is described in Algorithm 1. This algorithm proceeds as follows: at each time $t$ , it first samples $\tau$ binary vectors $\hat{\mathbf{Y}}^{t}_{1:\tau}$ based on the agent predictive distribution; then, it samples a vector $\hat{p}^{t}\in[0,1]^{K}$ from the conditional distribution $\mathds{P}(p\in\cdot|\tilde{\mathbf{Y}}_{1:\tau}=\hat{\mathbf{Y}}^{t}_{1:\tau})$ ; finally, it chooses an action $A_{t}$ greedy to $\hat{p}^{t}$ and update the agent parameters based on new observations. Note that Algorithm 1 is general in the sense that it does not depend on the agent’s uncertainty representation. Instead, it only requires that the agent can simulate hypothetical observations, sampled from a joint predictive distribution. Also note that Algorithm 1 reduces to the standard (exact) Thompson sampling algorithm when $\mathds{P}(\hat{\mathbf{Y}}_{1:\tau}\in\cdot|\theta_{t})=\mathds{P}(\tilde{\mathbf{Y}}_{1:\tau}\in\cdot|H_{t})$ and $\tau\rightarrow\infty$ . We use this algorithm to establish that an agent that performs well based on a particular loss function retains enough information to enable efficient exploration.²²2Note that $\min\operatorname*{arg\,max}_{k}\hat{p}^{t}_{k}$ is well defined. Specifically, $\operatorname*{arg\,max}_{k}\hat{p}^{t}_{k}\subseteq\{1,\ldots,K\}$ is a set.

Algorithm 1 Approximate Thompson sampling

Input: prior over environment parameters

p

agent architecture

agent parameter initialization/update procedure

Initialization: compute parameters

\theta_{0}

based on prior

for

t=0,1,2,\ldots

sample

\hat{\mathbf{Y}}^{t}_{1:\tau}\sim\mathds{P}(\hat{\mathbf{Y}}_{1:\tau}\in\cdot|\theta_{t})

sample

\hat{p}^{t}

from

\mathds{P}(p\in\cdot|\tilde{\mathbf{Y}}_{1:\tau}=\hat{\mathbf{Y}}^{t}_{1:\tau})

choose

A_{t}=\min\operatorname*{arg\,max}_{k}\hat{p}^{t}_{k}

compute

\theta_{t+1}

based on

\theta_{t}

and

(A_{t},Y_{t+1})

end for

As is standard in bandit literature, we measure the performance of Algorithm 1 using (Bayes) cumulative regret, which is defined as

{\rm Regret}(T)=\textstyle\sum_{t=0}^{T-1}\mathbb{E}\big{[}p_{A^{*}}-r(Y_{t+1})\big{]},

(7)

where $A^{*}=\min\operatorname*{arg\,max}_{k}p_{k}$ is one optimal action. Similarly, the expectation is over random outcomes, algorithmic randomness, and prior over $\mathcal{E}$ . We can establish the following regret bound for Algorithm 1:

Theorem 5.1.

For any integer $\tau\geq 1$ and any $\epsilon\in\Re_{+}$ , if at each time $t$ , the agent with parameters $\theta_{t}$ can generate samples $\hat{\mathbf{Y}}^{t}_{1:\tau}$ such that

\mathbb{E}[\mathbf{d}_{\mathrm{KL}}(\mathds{P}(\tilde{\mathbf{Y}}_{1:\tau}\in\cdot|H_{t})\|\mathds{P}(\hat{\mathbf{Y}}_{1:\tau}^{t}\in\cdot|\theta_{t}))]\leq\epsilon,

then under Algorithm 1, we have

{\rm Regret}(T)\leq\sqrt{\frac{1}{2}KT\log K}+\left(\frac{K}{\sqrt{2\tau}}+\sqrt{2\epsilon}\right)T.

(8)

Please refer to Section 5.4 for the proof sketch for Theorem 5.1, and the detailed proof is provided in Appendix C. Specifically, the proof for this theorem uses information-ratio analysis similar to previous papers (Russo & Van Roy, 2016; Lu et al., 2021), and the notions of environment proxy and learning target developed in Lu et al. (2021) (see Section 3.3 of that paper). Specifically, we choose the environment proxy as $\tilde{\mathbf{Y}}_{1:\tau}$ and the learning target as an action $\tilde{A}=\min\operatorname*{arg\,max}_{k}\tilde{p}_{k}$ , where $\tilde{p}=(\tilde{p}_{1},\ldots,\tilde{p}_{K})$ is an independent sample drawn from $\mathds{P}(p\in\cdot|\tilde{\mathbf{Y}}_{1:\tau})$ . Note that conditioning on the environment $\mathcal{E}$ , $\tilde{\mathbf{Y}}_{1:\tau}$ and $\tilde{A}$ are independent of the history $H_{t}$ .

We now briefly discuss the regret bound in Theorem 5.1. First, note that if an agent can make good predictions $\tau\geq K/\epsilon$ steps into the future, then this regret bound reduces to $O(\sqrt{KT\log(K)}+\sqrt{K\epsilon}T)$ , which is sufficient to ensure efficient exploration. Second, notice that this regret bound consists of three terms. The linear regret term $\sqrt{2\epsilon}T$ is due to the expected KL-divergence loss of the agent. Specifically, if the agent makes a perfect prediction in the sense that $\mathds{P}(\hat{\mathbf{Y}}_{1:\tau}^{t}\in\cdot|\theta_{t})=\mathds{P}(\tilde{\mathbf{Y}}_{1:\tau}\in\cdot|H_{t})$ for all $t$ , then this linear regret term will reduce to zero. On the other hand, another linear regret term $KT/\sqrt{2\tau}$ is due to the fact that we choose $\tilde{A}$ as the learning target, which can be a sub-optimal action. It is obvious that as $\tau\rightarrow\infty$ , $\tilde{A}$ will converge to $A^{*}$ and this linear regret term will reduce to zero. Finally, the sublinear regret term $\sqrt{\frac{1}{2}KT\log K}$ is exactly the regret bound for the exact Thompson sampling algorithm (Russo & Van Roy, 2016). This is not surprising since when $\epsilon=0$ (i.e. $\mathds{P}(\hat{\mathbf{Y}}_{1:\tau}^{t}\in\cdot|\theta_{t})=\mathds{P}(\tilde{\mathbf{Y}}_{1:\tau}\in\cdot|H_{t})$ ) and $\tau\rightarrow\infty$ , Algorithm 1 reduces to the exact Thompson sampling algorithm.

Finally, it is worth mentioning that we conjecture that we can develop results similar to Theorem 5.1 for more practical algorithms and more general sequential decision problems. Note that in Algorithm 1, sampling $\hat{p}^{t}$ from $\mathds{P}(p\in\cdot|\tilde{\mathbf{Y}}_{1:\tau}=\hat{\mathbf{Y}}^{t}_{1:\tau})$ can be computationally expensive. Instead, a computationally more efficient approach is to choose $\hat{p}^{t}$ as the sample mean of $\tilde{\mathbf{Y}}_{1:\tau}$ , i.e. $\hat{p}^{t}=\frac{1}{\tau}\sum_{i=1}^{\tau}\hat{\mathbf{Y}}^{t}_{i}$ , where $\hat{\mathbf{Y}}^{t}_{i}$ is the $i^{\rm th}$ vector in $\tilde{\mathbf{Y}}_{1:\tau}$ . We conjecture that one can derive a similar regret bound with this modification, but leave the analysis to future work. Moreover, we believe that similar approximate Thompson sampling algorithms and regret bounds can be developed for more general sequential decision problems, such as multi-armed bandits with unbounded noises and episodic reinforcement learning, but leave them to future work.

5.4 Proof Sketch for Theorem 5.1

We provide a proof sketch for Theorem 5.1 in this subsection. First, notice that the expected per-step regret at time $t$ is $\mathbb{E}[p_{A^{*}}-p_{A_{t}}]$ , which can be decomposed as

\mathbb{E}[p_{A^{*}}-p_{A_{t}}]=\mathbb{E}[p_{A^{*}}-p_{\tilde{A}}]+\mathbb{E}[p_{\tilde{A}}-p_{A_{t}}].

(9)

Recall that action $\tilde{A}$ is the learning target. We bound the two terms in the righthand side of equation (9) separately. First, based on the fact that $p$ and $\tilde{p}$ are conditionally i.i.d given the environment proxy $\tilde{\mathbf{Y}}_{1:\tau}$ , we can show that

\mathbb{E}[p_{A^{*}}-p_{\tilde{A}}]\leq K/\sqrt{2\tau}.

To bound the second term $\mathbb{E}[p_{\tilde{A}}-p_{A_{t}}]$ , we consider its conditional version $\mathbb{E}_{t}[p_{\tilde{A}}-p_{A_{t}}]$ , where the subscript $t$ denotes conditioning on the history $H_{t}$ . Using information-ratio analysis, we can prove that

\mathbb{E}_{t}[p_{\tilde{A}}-p_{A_{t}}]\leq\sqrt{\frac{K}{2}\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})}\\ +\big{\|}\mathds{P}_{t}(\tilde{A}\in\cdot)-\mathds{P}_{t}(A_{t}\in\cdot)\big{\|}_{1}.

(10)

Using Pinsker’s inequality, the data processing inequality, and the assumption on the expected KL-divergence in Theorem 5.1, we can bound that

\mathbb{E}\big{[}\big{\|}\mathds{P}_{t}(\tilde{A}\in\cdot)-\mathds{P}_{t}(A_{t}\in\cdot)\big{\|}_{1}\big{]}\leq\sqrt{2\epsilon}.

On the other hand, based on Cauchy–Schwarz inequality and the chain rule for mutual information, we have

\sum_{t=0}^{T-1}\mathbb{E}\left[\sqrt{\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})}\right]\leq\sqrt{T\mathbb{I}(\tilde{A};H_{T})}.

Finally, note that $\mathbb{I}(\tilde{A};H_{T})\leq\mathbb{H}(\tilde{A})\leq\log K$ . Combining the above inequalities, we have proved Theorem 5.1.

6 Related work

There is extensive literature focusing on evaluating marginal distributions (e.g. Bishop (2006), Friedman et al. (2001), LeCun et al. (2015)), paired with evaluation on downstream tasks such as bandits (Riquelme et al., 2018). There is also a rich literature on developing agents for uncertainty estimation (Blundell et al., 2015; Gal & Ghahramani, 2016; Osband & Van Roy, 2015; Welling & Teh, 2011), and in most cases the developed agents are evaluated only based on marginal predictive distributions.

To the best of our knowledge, there is little literature that highlight the importance of joint predictive distributions. One exception is Wang et al. (2021). However, Wang et al. (2021) states that “joint log-likelihood scores [are] determined almost entirely by the marginal log-likelihood scores…in practice, they provide little new information beyond marginal likelihoods,” and proposes an alternative metric for joint distributions based on posterior predictive correlations (PPCs). Our paper, however, establishes that the KL-divergence for joint predictive distributions, which are equivalent to log-likelihood scores, ensure effective decisions, while marginals do not. In particular, our results in Section 5 show that agents that attain small marginal KL-divergence might not retain enough information to enable efficient exploration, while agents that attain small joint KL-divergence do. Moreover, the PPC metric proposed by Wang et al. (2021) is designed for regression problems, while cross-entropy loss can be applied more broadly, for example, to classification as well as regression problems. There is also a recent empirical work that compares agents based on their joint predictive distributions (Osband et al., 2021).

There is also extensive literature on combinatorial decision-making (Papadimitriou & Steiglitz, 1998), sequential prediction (Shalev-Shwartz et al., 2011), multi-armed bandits (Auer et al., 2002; Lattimore & Szepesvári, 2020), and Thompson sampling (Thompson, 1933; Chapelle & Li, 2011; Russo et al., 2018). There are also some recent work on approximate versions of Thompson sampling algorithms (Lu & Van Roy, 2017; Phan et al., 2019; Zhang et al., 2019; Yu et al., 2020). In this paper, we propose and analyze a novel variant of Thompson sampling algorithm for multi-armed bandits. Our analysis uses an information-theoretic approach, similar to papers in this research line (Russo & Van Roy, 2016, 2018; Lu et al., 2021). In addition, our analysis makes use of the notions of environment proxy and learning target developed in Lu et al. (2021).

7 Conclusion

This paper aims to elucidate the importance of the joint predictive distributions for a broad class of decision problems, including the combinatorial and sequential decision problems. Specifically, we have shown that in a simple combinatorial decision problem (Section 3), a sequential prediction problem (Section 4), and a multi-armed bandit problem (Section 5), accurate marginal predictions are insufficient to drive effective decisions. Instead, accurate joint predictive distributions are necessary for good performance. We also show that, in the multi-armed bandit problem, accurate joint predictive distributions are sufficient to enable near-optimal performance of a novel variant of Thompson sampling by establishing a new kind of regret bound via information-theoretic analysis (Theorem 5.1). While we evaluate the accuracy of marginal and joint predictive distributions based on KL-divergence (Section 2), or equivalently, cross-entropy loss, our insights on marginal versus joint predictions should extend to other evaluation metrics.

References

Auer et al. (2002) Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002.
Bishop (2006) Bishop, C. M. Pattern recognition. Machine learning, 128(9), 2006.
Blundell et al. (2015) Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural network. In International Conference on Machine Learning, pp. 1613–1622. PMLR, 2015.
Chapelle & Li (2011) Chapelle, O. and Li, L. An empirical evaluation of thompson sampling. Advances in neural information processing systems, 24:2249–2257, 2011.
Cover (1999) Cover, T. M. Elements of information theory. John Wiley & Sons, 1999.
Friedman et al. (2001) Friedman, J., Hastie, T., Tibshirani, R., et al. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.
Gal & Ghahramani (2016) Gal, Y. and Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, 2016.
Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. MIT press, 2016.
Kingma & Ba (2015) Kingma, D. and Ba, J. Adam: A Method for Stochastic Optimization. Proceedings of the International Conference on Learning Representations, 2015.
Lattimore & Szepesvári (2020) Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020.
LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 521(7553):436, 2015.
Lu & Van Roy (2017) Lu, X. and Van Roy, B. Ensemble sampling. In Advances in Neural Information Processing Systems, pp. 3260–3268, 2017.
Lu et al. (2021) Lu, X., Van Roy, B., Dwaracherla, V., Ibrahimi, M., Osband, I., and Wen, Z. Reinforcement learning, bit by bit. arXiv preprint arXiv:2103.04047, 2021.
Mattner (2003) Mattner, L. Mean absolute deviations of sample means and minimally concentrated binomials. Annals of probability, pp. 914–925, 2003.
Osband & Van Roy (2015) Osband, I. and Van Roy, B. Bootstrapped Thompson sampling and deep exploration. arXiv preprint arXiv:1507.00300, 2015.
Osband et al. (2021) Osband, I., Wen, Z., Asghari, S. M., Dwaracherla, V., Hao, B., Ibrahimi, M., Lawson, D., Lu, X., O’Donoghue, B., and Van Roy, B. Evaluating predictive distributions: Does bayesian deep learning work? arXiv preprint arXiv:2110.04629, 2021.
Papadimitriou & Steiglitz (1998) Papadimitriou, C. H. and Steiglitz, K. Combinatorial optimization: algorithms and complexity. Courier Corporation, 1998.
Phan et al. (2019) Phan, M., Abbasi-Yadkori, Y., and Domke, J. Thompson sampling with approximate inference. arXiv preprint arXiv:1908.04970, 2019.
Riquelme et al. (2018) Riquelme, C., Tucker, G., and Snoek, J. Deep Bayesian bandits showdown: An empirical comparison of Bayesian deep networks for thompson sampling. arXiv preprint arXiv:1802.09127, 2018.
Russo & Van Roy (2016) Russo, D. and Van Roy, B. An information-theoretic analysis of Thompson sampling. The Journal of Machine Learning Research, 17(1):2442–2471, 2016.
Russo & Van Roy (2018) Russo, D. and Van Roy, B. Learning to optimize via information-directed sampling. Operations Research, 66(1):230–252, 2018.
Russo et al. (2018) Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., and Wen, Z. A tutorial on Thompson sampling. Found. Trends Mach. Learn., 11(1):1–96, July 2018. ISSN 1935-8237.
Shalev-Shwartz et al. (2011) Shalev-Shwartz, S. et al. Online learning and online convex optimization. Foundations and trends in Machine Learning, 4(2):107–194, 2011.
Thompson (1933) Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
Wang et al. (2021) Wang, C., Sun, S., and Grosse, R. Beyond marginal uncertainty: How accurately can Bayesian regression models estimate posterior predictive correlations? In International Conference on Artificial Intelligence and Statistics, pp. 2476–2484. PMLR, 2021.
Welling & Teh (2011) Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681–688. Citeseer, 2011.
Yu et al. (2020) Yu, T., Kveton, B., Wen, Z., Zhang, R., and Mengshoel, O. J. Graphical models meet bandits: A variational thompson sampling approach. In International Conference on Machine Learning, pp. 10902–10912. PMLR, 2020.
Zhang et al. (2019) Zhang, R., Wen, Z., Chen, C., and Carin, L. Scalable thompson sampling via optimal transport. arXiv preprint arXiv:1902.07239, 2019.

Appendices

Appendix A Proof for Proposition 2.1

Proof.

From Pinsker’s inequality, we have

\delta\left(\overline{P}_{T+1:T+\tau},\hat{P}_{T+1:T+\tau}\right)\leq\sqrt{\frac{1}{2}\mathbf{d}_{\mathrm{KL}}\big{(}\overline{P}_{T+1:T+\tau}\big{\|}\hat{P}_{T+1:T+\tau}\big{)}},

where $\delta$ is the total variation distance. Without loss of generality, we assume that the supports of $\overline{P}_{T+1:T+\tau}$ and $\hat{P}_{T+1:T+\tau}$ are countable. Then we have $\delta\left(\overline{P}_{T+1:T+\tau},\hat{P}_{T+1:T+\tau}\right)=\frac{1}{2}\left\|\overline{P}_{T+1:T+\tau}-\hat{P}_{T+1:T+\tau}\right\|_{1}$ . Consequently, we have

\left\|\overline{P}_{T+1:T+\tau}-\hat{P}_{T+1:T+\tau}\right\|_{1}\leq\sqrt{2\mathbf{d}_{\mathrm{KL}}\big{(}\overline{P}_{T+1:T+\tau}\big{\|}\hat{P}_{T+1:T+\tau}\big{)}}.

Recall that $\hat{a}\in\mathcal{A}$ maximizes

\sum_{y_{T+1:T+\tau}}\hat{P}_{T+1:T+\tau}(y_{T+1:T+\tau})r(a,y_{T+1:T+\tau})

over all $a\in\mathcal{A}$ . Similarly, we assume $a^{*}\in\mathcal{A}$ maximizes

\sum_{y_{T+1:T+\tau}}\overline{P}_{T+1:T+\tau}(y_{T+1:T+\tau})r(a,y_{T+1:T+\tau}),

over all $a\in\mathcal{A}$ . Then we have

		$\displaystyle\,\left\|\sum_{y_{T+1:T+\tau}}\overline{P}_{T+1:T+\tau}(y_{T+1:T+\tau})r(a^{},y_{T+1:T+\tau})-\sum_{y_{T+1:T+\tau}}\hat{P}_{T+1:T+\tau}(y_{T+1:T+\tau})r(a^{},y_{T+1:T+\tau})\right\|$
	$\displaystyle=$	$\displaystyle\,\left\|\sum_{y_{T+1:T+\tau}}\left(\overline{P}_{T+1:T+\tau}(y_{T+1:T+\tau})-\hat{P}_{T+1:T+\tau}(y_{T+1:T+\tau})\right)r(a^{*},y_{T+1:T+\tau})\right\|$
	$\displaystyle\leq$	$\displaystyle\,\sum_{y_{T+1:T+\tau}}\left\|\overline{P}_{T+1:T+\tau}(y_{T+1:T+\tau})-\hat{P}_{T+1:T+\tau}(y_{T+1:T+\tau})\right\|r(a^{*},y_{T+1:T+\tau})$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}$	$\displaystyle\,\sum_{y_{T+1:T+\tau}}\left\|\overline{P}_{T+1:T+\tau}(y_{T+1:T+\tau})-\hat{P}_{T+1:T+\tau}(y_{T+1:T+\tau})\right\|$
	$\displaystyle=$	$\displaystyle\,\left\\|\overline{P}_{T+1:T+\tau}-\hat{P}_{T+1:T+\tau}\right\\|_{1}\leq\sqrt{2\mathbf{d}_{\mathrm{KL}}\big{(}\overline{P}_{T+1:T+\tau}\big{\\|}\hat{P}_{T+1:T+\tau}\big{)}},$

where (a) follows from $r(a^{*},y_{T+1:T+\tau})\in[0,1]$ . Consequently, we have

	$\displaystyle\sum_{y_{T+1:T+\tau}}\overline{P}_{T+1:T+\tau}(y_{T+1:T+\tau})r(a^{*},y_{T+1:T+\tau})\leq$	$\displaystyle\,\sum_{y_{T+1:T+\tau}}\hat{P}_{T+1:T+\tau}(y_{T+1:T+\tau})r(a^{*},y_{T+1:T+\tau})$
	$\displaystyle+$	$\displaystyle\,\sqrt{2\mathbf{d}_{\mathrm{KL}}\big{(}\overline{P}_{T+1:T+\tau}\big{\\|}\hat{P}_{T+1:T+\tau}\big{)}}$
	$\displaystyle\leq$	$\displaystyle\,\sum_{y_{T+1:T+\tau}}\hat{P}_{T+1:T+\tau}(y_{T+1:T+\tau})r(\hat{a},y_{T+1:T+\tau})$
	$\displaystyle+$	$\displaystyle\,\sqrt{2\mathbf{d}_{\mathrm{KL}}\big{(}\overline{P}_{T+1:T+\tau}\big{\\|}\hat{P}_{T+1:T+\tau}\big{)}},$

where the second inequality follows from the fact that $\hat{a}$ is optimal under $\hat{P}_{T+1:T+\tau}$ . Taking the expectations, we have

\mathbb{E}\left[r(\hat{a},y_{T+1:T+\tau})\right]\geq\mathbb{E}\left[r(a^{*},y_{T+1:T+\tau})\right]-\mathbb{E}\left[\sqrt{2\mathbf{d}_{\mathrm{KL}}\big{(}\overline{P}_{T+1:T+\tau}\big{\|}\hat{P}_{T+1:T+\tau}\big{)}}\right].

From Jensen’s inequality, we have

\mathbb{E}\left[r(a^{*},y_{T+1:T+\tau})\right]\geq\max_{a\in\mathcal{A}}\mathbb{E}\left[r(a,y_{T+1:T+\tau})\right]

and

\mathbb{E}\left[\sqrt{2\mathbf{d}_{\mathrm{KL}}\big{(}\overline{P}_{T+1:T+\tau}\big{\|}\hat{P}_{T+1:T+\tau}\big{)}}\right]\leq\sqrt{2\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\overline{P}_{T+1:T+\tau}\big{\|}\hat{P}_{T+1:T+\tau}\big{)}\right]}=\sqrt{2\mathbf{d}_{\mathrm{KL}}^{\tau}}.

Consequently, we have

\mathbb{E}\left[r(\hat{a},y_{T+1:T+\tau})\right]\geq\max_{a\in\mathcal{A}}\mathbb{E}\left[r(a,y_{T+1:T+\tau})\right]-\sqrt{2\mathbf{d}_{\mathrm{KL}}^{\tau}}.

This concludes the proof. ∎

Appendix B Proof for Theorem 4.1

B.1 Proof for Theorem 4.1

Proof.

Notice that

		$\displaystyle~{}\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t+1:T}\in\cdot\middle\|\mathcal{D}_{t},X_{t:T-1}\right)\big{\\|}\mathds{P}(Y_{t+1:T}\in\cdot\|\theta_{t},X_{t:T-1})\big{)}\right]$
	$\displaystyle\overset{(a)}{=}$	$\displaystyle~{}\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t+1}\in\cdot\middle\|\mathcal{D}_{t},X_{t}\right)\big{\\|}\mathds{P}(Y_{t+1}\in\cdot\|\theta_{t},X_{t})\big{)}\right]$
		$\displaystyle~{}\quad+\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t+2:T}\in\cdot\middle\|\mathcal{D}_{t+1},X_{t+1:T-1}\right)\big{\\|}\mathds{P}(Y_{t+2:T}\in\cdot\|\theta_{t},X_{t},Y_{t+1},X_{t+1:T-1})\big{)}\right]$
	$\displaystyle\overset{(b)}{\leq}$	$\displaystyle~{}\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t+1}\in\cdot\middle\|\mathcal{D}_{t},X_{t}\right)\big{\\|}\mathds{P}(Y_{t+1}\in\cdot\|\theta_{t},X_{t})\big{)}\right]$
		$\displaystyle~{}\quad+\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t+2:T}\in\cdot\middle\|\mathcal{D}_{t+1},X_{t+1:T-1}\right)\big{\\|}\mathds{P}(Y_{t+2:T}\in\cdot\|\theta_{t+1},X_{t+1:T-1})\big{)}\right]$
	$\displaystyle\leq$	$\displaystyle~{}\dots$
	$\displaystyle\leq$	$\displaystyle~{}\mathbb{E}\left[\sum_{t^{\prime}=t}^{T-1}\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t^{\prime}+1}\in\cdot\middle\|\mathcal{D}_{t^{\prime}},X_{t^{\prime}}\right)\big{\\|}\mathds{P}(Y_{t^{\prime}+1}\in\cdot\|\theta_{t^{\prime}},X_{t^{\prime}})\big{)}\right]$
	$\displaystyle\overset{(c)}{\leq}$	$\displaystyle~{}\mathbb{E}\left[\sum_{t^{\prime}=t}^{T-1}\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t^{\prime}+1}\in\cdot\middle\|\mathcal{D}_{t^{\prime}},X_{t^{\prime}}\right)\big{\\|}\mathds{P}(\hat{Y}_{t^{\prime}+1}\in\cdot\|\theta_{t^{\prime}},X_{t^{\prime}})\big{)}\right]$
	$\displaystyle\stackrel{{\scriptstyle(d)}}{{=}}$	$\displaystyle~{}\mathbb{E}\left[\sum_{t^{\prime}=t}^{T-1}\mathbf{d}_{\mathrm{KL}}\big{(}\bar{P}_{t^{\prime}+1}\big{\\|}\hat{P}_{t^{\prime}+1}\big{)}\right]\leq\epsilon,$

where (a) follows from the chain rule of KL divergence, (b) and (c) follow from the fact that the conditional predictive distribution minimizes the KL divergence to the posterior (see Lemma B.1), and (d) follows from the definition. On the other hand, by definition, we have

\mathbb{I}\left(Y_{t+1:T};\mathcal{D}_{t}\,\middle|\,\theta_{t},X_{t:T-1}\right)=\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t+1:T}\in\cdot\middle|\mathcal{D}_{t},X_{t:T-1}\right)\big{\|}\mathds{P}(Y_{t+1:T}\in\cdot|\theta_{t},X_{t:T-1})\big{)}\right]\leq\epsilon.

Moreover, from the chain rule of mutual information, we have

	$\displaystyle\mathbb{I}\left(Y_{t+1:T};\mathcal{D}_{t},\theta_{t}\,\middle\|\,X_{t:T-1}\right)=$	$\displaystyle\,\mathbb{I}\left(Y_{t+1:T};\theta_{t}\,\middle\|\,X_{t:T-1}\right)+\underbrace{\mathbb{I}\left(Y_{t+1:T};\mathcal{D}_{t}\,\middle\|\,\theta_{t},X_{t:T-1}\right)}_{\leq\epsilon}$
	$\displaystyle=$	$\displaystyle\,\mathbb{I}\left(Y_{t+1:T};\mathcal{D}_{t}\,\middle\|\,X_{t:T-1}\right)+\underbrace{\mathbb{I}\left(Y_{t+1:T};\theta_{t}\,\middle\|\,\mathcal{D}_{t},X_{t:T-1}\right)}_{=0},$

where $\mathbb{I}\left(Y_{t+1:T};\theta_{t}\,\middle|\,\mathcal{D}_{t},X_{t:T-1}\right)=0$ since $Y_{t+1:T}$ and $\theta_{t}$ are conditionally independent given $\mathcal{D}_{t}$ and $X_{t:T-1}$ . Hence we have

\mathbb{I}\left(Y_{t+1:T};\mathcal{D}_{t}\,\middle|\,X_{t:T-1}\right)\geq\mathbb{I}\left(Y_{t+1:T};\theta_{t}\,\middle|\,X_{t:T-1}\right)\geq\mathbb{I}\left(Y_{t+1:T};\mathcal{D}_{t}\,\middle|\,X_{t:T-1}\right)-\epsilon.

This concludes the proof. ∎

B.2 Lemma B.1

Lemma B.1.

Let the agent parameter $\theta_{t}\perp\mathcal{E}|\mathcal{D}_{t}$ . Then, the conditional predictive distribution $\mathds{P}(Y_{t+1}\in\cdot|\theta_{t},X_{t})$ minimizes the expected KL divergence

\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}(Y_{t+1}\in\cdot|\mathcal{D}_{t},X_{t})\big{\|}\mathds{P}(\hat{Y}_{t+1}\in\cdot|\theta_{t},X_{t})\right)\right]

over all possible predictive distribution $\mathds{P}(\hat{Y}_{t+1}\in\cdot|\theta_{t},X_{t})$ .

Proof.

We have

	$\displaystyle\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}(Y_{t+1}\in\cdot\|\mathcal{D}_{t},X_{t})\big{\\|}\mathds{P}(\hat{Y}_{t+1}\in\cdot\|\theta_{t},X_{t})\right)\right]$
	$\displaystyle=\mathbb{E}\left[\sum_{y}\mathds{P}(Y_{t+1}=y\|\mathcal{D}_{t},X_{t})\log\left(\frac{\mathds{P}(Y_{t+1}=y\|\mathcal{D}_{t},X_{t})}{\mathds{P}(Y_{t+1}=y\|\theta_{t},X_{t})}\cdot\frac{\mathds{P}(Y_{t+1}=y\|\theta_{t},X_{t})}{\mathds{P}(\hat{Y}_{t+1}=y\|\theta_{t},X_{t})}\right)\right]$
	$\displaystyle=\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}(Y_{t+1}\in\cdot\|\mathcal{D}_{t},X_{t})\big{\\|}\mathds{P}(Y_{t+1}\in\cdot\|\theta_{t},X_{t})\right)\right]$
	$\displaystyle\qquad+\mathbb{E}\left[\sum_{y}\mathds{P}(Y_{t+1}=y\|\mathcal{D}_{t},X_{t})\log\frac{\mathds{P}(Y_{t+1}=y\|\theta_{t},X_{t})}{\mathds{P}(\hat{Y}_{t+1}=y\|\theta_{t},X_{t})}\right]$
	$\displaystyle=\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}(Y_{t+1}\in\cdot\|\mathcal{D}_{t},X_{t})\big{\\|}\mathds{P}(Y_{t+1}\in\cdot\|\theta_{t},X_{t})\right)\right]$
	$\displaystyle\qquad+\mathbb{E}\left[\sum_{y}\mathds{P}(Y_{t+1}=y\|\theta_{t},X_{t})\log\frac{\mathds{P}(Y_{t+1}=y\|\theta_{t},X_{t})}{\mathds{P}(\hat{Y}_{t+1}=y\|\theta_{t},X_{t})}\right]$
	$\displaystyle=\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}(Y_{t+1}\in\cdot\|\mathcal{D}_{t},X_{t})\big{\\|}\mathds{P}(Y_{t+1}\in\cdot\|\theta_{t},X_{t})\right)\right]$
	$\displaystyle\qquad+\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}(Y_{t+1}\in\cdot\|\theta_{t},X_{t})\big{\\|}\mathds{P}(\hat{Y}_{t+1}\in\cdot\|\theta_{t},X_{t})\right)\right]$
	$\displaystyle\geq\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}(Y_{t+1}\in\cdot\|\mathcal{D}_{t},X_{t})\big{\\|}\mathds{P}(Y_{t+1}\in\cdot\|\theta_{t},X_{t})\right)\right].$

∎

Similarly, we can prove that

\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t+2:T}\in\cdot\middle|\mathcal{D}_{t+1},X_{t+1:T-1}\right)\big{\|}\mathds{P}(Y_{t+2:T}\in\cdot|\theta_{t},X_{t},Y_{t+1},X_{t+1:T-1})\big{)}\right]\\ \leq\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t+2:T}\in\cdot\middle|\mathcal{D}_{t+1},X_{t+1:T-1}\right)\big{\|}\mathds{P}(Y_{t+2:T}\in\cdot|\theta_{t+1},X_{t+1:T-1})\big{)}\right].

(11)

To see it, notice that $\theta_{t+1}\sim\mathds{P}(\theta_{t+1}\in\cdot|\theta_{t},X_{t},Y_{t+1},t)$ , and hence $\mathds{P}(Y_{t+2:T}\in\cdot|\theta_{t+1},X_{t+1:T-1})$ is a particular choice of $\mathds{P}(\hat{Y}_{t+2:T}\in\cdot|\theta_{t},X_{t},Y_{t+1},X_{t+1:T-1})$ . Consequently, using an analysis similar to Lemma B.1, we can prove the above inequality.

Appendix C Proof for Theorem 5.1

Proof.

We will now analyze the approximate TS algorithm. For this purpose, we continue to use a proxy $\tilde{\mathcal{E}}=\tilde{\mathbf{Y}}_{1:\tau}$ and target $\tilde{A}=\min\operatorname*{arg\,max}_{k}\tilde{p}_{k}$ . Further, we define $\tilde{\mathbf{Y}}^{t}_{1:\tau}\sim\mathds{P}(\tilde{\mathbf{Y}}_{1:\tau}\in\cdot|H_{t})$ , $\tilde{p}^{t}\sim\mathds{P}(p\in\cdot|\tilde{\mathbf{Y}}_{1:\tau}=\tilde{\mathbf{Y}}^{t}_{1:\tau})$ , and $\tilde{A}_{t}=\min\operatorname*{arg\,max}_{k}\tilde{p}^{t}_{k}$ .

Notice that

\mathbb{E}[p_{A^{*}}-p_{A_{t}}]=\mathbb{E}[p_{A^{*}}-p_{\tilde{A}}]+\mathbb{E}[p_{\tilde{A}}-p_{A_{t}}],

(12)

and we first show that $\mathbb{E}[p_{A^{*}}-p_{\tilde{A}}]\leq\frac{K}{2\tau}$ . Notice that

	$\displaystyle\mathbb{E}[p_{A^{*}}-p_{\tilde{A}}]=$	$\displaystyle\mathbb{E}[\mathbb{E}[p_{A^{*}}-p_{\tilde{A}}\|\tilde{\mathcal{E}}]]=\mathbb{E}[\mathbb{E}[\tilde{p}_{\tilde{A}}-p_{\tilde{A}}\|\tilde{\mathcal{E}}]]$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}[\mathbb{E}[\max_{k}(\tilde{p}_{k}-p_{k})\|\tilde{\mathcal{E}}]]\leq\sum_{k=1}^{K}\mathbb{E}[\|\tilde{p}_{k}-p_{k}\|]\stackrel{{\scriptstyle(a)}}{{\leq}}\frac{K}{\sqrt{2\tau}}.$

Please refer to (Mattner, 2003) for the proof of inequality (a).

Now we are going to bound the second term $\mathbb{E}[p_{\tilde{A}}-p_{A_{t}}]$ on the RHS of Equation (12) through bounding its conditional version $\mathbb{E}_{t}[p_{\tilde{A}}-p_{A_{t}}]$ in terms of the information gain $\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})$ , where the subscript $t$ in $\mathbb{E}_{t}$ and $\mathbb{I}_{t}$ denotes that the expectation and mutual information condition on the history $H_{t}$ . In other words, $\mathbb{E}_{t}[p_{\tilde{A}}-p_{A_{t}}]=\mathbb{E}[p_{\tilde{A}}-p_{A_{t}}|H_{t}]$ and $\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})=\mathbb{I}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}}|H_{t}=H_{t})$ . We first note that

	$\displaystyle\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})$	$\displaystyle=\mathbb{I}_{t}(\tilde{A};{A}_{t})+\mathbb{I}_{t}(\tilde{A};\mathbf{Y}_{{A}_{t}}\|A_{t})$
		$\displaystyle=\mathbb{I}_{t}(\tilde{A};\mathbf{Y}_{{A}_{t}}\|A_{t})$
		$\displaystyle=\sum_{a}\mathds{P}_{t}(A_{t}=a)\mathbb{I}_{t}(\tilde{A};\mathbf{Y}_{{A}_{t}}\|A_{t}=a)$
		$\displaystyle=\sum_{a}\mathds{P}_{t}(A_{t}=a)\mathbb{I}_{t}(\tilde{A};\mathbf{Y}_{a})$
		$\displaystyle=\sum_{a}\mathds{P}_{t}(A_{t}=a)\left(\sum_{\tilde{a}}\mathds{P}_{t}(\tilde{A}=\tilde{a})\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}_{t}(\mathbf{Y}_{a}\mid\tilde{A}=\tilde{a})\\|\mathds{P}_{t}(\mathbf{Y}_{a})\right)\right)$
		$\displaystyle=\sum_{a,\tilde{a}}\mathds{P}_{t}(A_{t}=a)\mathds{P}_{t}(\tilde{A}=\tilde{a})\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}_{t}(\mathbf{Y}_{a}\mid\tilde{A}=\tilde{a})\\|\mathds{P}_{t}(\mathbf{Y}_{a})\right)$

where the first inequality uses the chain rule of mutual information; the second inequality follows from that $A_{t}$ and $\tilde{A}$ are conditionally independent; the fourth inequality uses that $A_{t}$ is conditionally jointly independent of $\tilde{A}$ and $\mathbf{Y}=(\mathbf{Y}_{1},\ldots,\mathbf{Y}_{K})$ ; the fifth inequality follows from the KL divergence form of mutual information.

Now we are ready to bound $\mathbb{E}_{t}[p_{\tilde{A}}-p_{A_{t}}]$ in terms of $\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})$ .

	$\displaystyle\mathbb{E}_{t}[p_{\tilde{A}}-p_{A_{t}}]=$	$\displaystyle\mathbb{E}_{t}[\mathbf{Y}_{\tilde{A}}-\mathbf{Y}_{A_{t}}]$
	$\displaystyle=$	$\displaystyle\sum_{a}\mathds{P}_{t}(\tilde{A}=a)\mathbb{E}_{t}[\mathbf{Y}_{a}\|\tilde{A}=a]-\sum_{a}\mathds{P}_{t}(A_{t}=a)\mathbb{E}_{t}[\mathbf{Y}_{a}\|A_{t}=a]$
	$\displaystyle=$	$\displaystyle\sum_{a}\mathds{P}_{t}(\tilde{A}=a)\mathbb{E}_{t}[\mathbf{Y}_{a}\|\tilde{A}=a]-\sum_{a}\mathds{P}_{t}(A_{t}=a)\mathbb{E}_{t}[\mathbf{Y}_{a}]$
	$\displaystyle=$	$\displaystyle\sum_{a}\sqrt{\mathds{P}_{t}(A_{t}=a)\mathds{P}_{t}(\tilde{A}=a)}\left(\mathbb{E}_{t}[\mathbf{Y}_{a}\|\tilde{A}=a]-\mathbb{E}_{t}[\mathbf{Y}_{a}]\right)$
		$\displaystyle+\sum_{a}\left(\sqrt{\mathds{P}_{t}(\tilde{A}=a)}-\sqrt{\mathds{P}_{t}(A_{t}=a)}\right)\left(\sqrt{\mathds{P}_{t}(\tilde{A}=a)}\mathbb{E}_{t}[\mathbf{Y}_{a}\|\tilde{A}=a]+\sqrt{\mathds{P}_{t}(A_{t}=a)}\mathbb{E}_{t}[\mathbf{Y}_{a}]\right)$
	$\displaystyle\leq$	$\displaystyle\sqrt{K\sum_{a}\mathds{P}_{t}(A_{t}=a)\mathds{P}_{t}(\tilde{A}=a)\left(\mathbb{E}_{t}[\mathbf{Y}_{a}\|\tilde{A}=a]-\mathbb{E}_{t}[\mathbf{Y}_{a}]\right)^{2}}$
		$\displaystyle+\sum_{a}\left\|\sqrt{\mathds{P}_{t}(\tilde{A}=a)}-\sqrt{\mathds{P}_{t}(A_{t}=a)}\right\|\left(\sqrt{\mathds{P}_{t}(\tilde{A}=a)}+\sqrt{\mathds{P}_{t}(A_{t}=a)}\right)$
	$\displaystyle\leq$	$\displaystyle\sqrt{K\sum_{a,\tilde{a}}\mathds{P}_{t}(A_{t}=a)\mathds{P}_{t}(\tilde{A}=\tilde{a})\left(\mathbb{E}_{t}[\mathbf{Y}_{a}\|\tilde{A}=\tilde{a}]-\mathbb{E}_{t}[\mathbf{Y}_{a}]\right)^{2}}+\sum_{a}\left\|\mathds{P}_{t}(\tilde{A}=a)-\mathds{P}_{t}(A_{t}=a)\right\|$
	$\displaystyle\leq$	$\displaystyle\sqrt{\frac{K}{2}\sum_{a,\tilde{a}}\mathds{P}_{t}(A_{t}=a)\mathds{P}_{t}(\tilde{A}=\tilde{a})\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}_{t}(\mathbf{Y}_{a}\mid\tilde{A}=\tilde{a})\\|\mathds{P}_{t}(\mathbf{Y}_{a})\right)}+\sum_{a}\left\|\mathds{P}_{t}(\tilde{A}_{t}=a)-\mathds{P}_{t}(A_{t}=a)\right\|$
	$\displaystyle=$	$\displaystyle\sqrt{\frac{K}{2}\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})}+\sum_{a}\left\|\mathds{P}_{t}(\tilde{A}_{t}=a)-\mathds{P}_{t}(A_{t}=a)\right\|$

where the third inequality uses that $A_{t}$ is conditionally independent of $\mathbf{Y}=(\mathbf{Y}_{1},\ldots,\mathbf{Y}_{K})$ ; the first inequality follows from Cauchy-Schwarz inequality and $\mathbf{Y}_{a}\in\{0,1\}$ ; the third inequality uses Pinsker’s inequality and $\mathds{P}_{t}(\tilde{A}=a)=\mathds{P}_{t}(\tilde{A}_{t}=a)$ . Hence,

	$\displaystyle\mathbb{E}[p_{\tilde{A}}-p_{A_{t}}]$	$\displaystyle=\mathbb{E}[\mathbb{E}_{t}[p_{\tilde{A}}-p_{A_{t}}]]$
		$\displaystyle\leq\mathbb{E}\left[\sqrt{\frac{K}{2}\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})}\right]+\mathbb{E}\left[\sum_{a}\left\|\mathds{P}_{t}(\tilde{A}_{t}=a)-\mathds{P}_{t}(A_{t}=a)\right\|\right]$
		$\displaystyle\leq\sqrt{\frac{K}{2}\mathbb{E}\left[\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})\right]}+\mathbb{E}\left[\sqrt{2\mathbf{d}_{\mathrm{KL}}(\mathds{P}(\tilde{p}^{t}\in\cdot\|H_{t})\\|\mathds{P}(\hat{p}^{t}\in\cdot\|H_{t}))}\right]$
		$\displaystyle\leq\sqrt{\frac{K}{2}\mathbb{E}\left[\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})\right]}+\mathbb{E}\left[\sqrt{2\mathbf{d}_{\mathrm{KL}}(\mathds{P}(\tilde{\mathbf{Y}}_{1:\tau}\in\cdot\|H_{t})\\|\mathds{P}(\hat{\mathbf{Y}}^{t}_{1:\tau}\in\cdot\|H_{t}))}\right]$
		$\displaystyle\leq\sqrt{\frac{K}{2}\mathbb{E}\left[\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})\right]}+\sqrt{2\epsilon}$

where the second inequalities follows from Jensen’s inequality and Pinsker’s inequality and the third inequality uses the information processing inequality, and thus

\displaystyle\mathbb{E}[p_{A^{*}}-p_{A_{t}}]=\mathbb{E}[p_{A^{*}}-p_{\tilde{A}}]+\mathbb{E}[p_{\tilde{A}}-p_{A_{t}}]\leq\frac{K}{\sqrt{2\tau}}+\sqrt{\frac{K}{2}\mathbb{E}\left[\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})\right]}+\sqrt{2\epsilon}.

Using an analysis similar to Lu et al. (2021),

	$\displaystyle{\rm Regret}(T)=$	$\displaystyle\sum_{t=0}^{T-1}\mathbb{E}[p_{A_{*}}-p_{A_{t}}]$
	$\displaystyle\leq$	$\displaystyle\sum_{t=0}^{T-1}\left(\frac{K}{\sqrt{2\tau}}+\sqrt{\frac{K}{2}\mathbb{E}\left[\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})\right]}+\sqrt{2\epsilon}\right)$
	$\displaystyle\leq$	$\displaystyle\sqrt{\frac{K}{2}T}\sqrt{\sum_{t=0}^{T-1}\mathbb{E}\left[\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})\right]}+\left(\frac{K}{\sqrt{2\tau}}+\sqrt{2\epsilon}\right)T$
	$\displaystyle\leq$	$\displaystyle\sqrt{\frac{K}{2}T}\sqrt{\mathbb{I}(\tilde{A};A_{0},\mathbf{Y}_{A_{0}},\ldots,A_{T-1},\mathbf{Y}_{A_{T-1}})}+\left(\frac{K}{\sqrt{2\tau}}+\sqrt{2\epsilon}\right)T$
	$\displaystyle=$	$\displaystyle\sqrt{\frac{K}{2}T}\sqrt{\mathbb{H}(\tilde{A})-\mathbb{H}(\tilde{A}\mid A_{0},\mathbf{Y}_{A_{0}},\ldots,A_{T-1},\mathbf{Y}_{A_{T-1}})}+\left(\frac{K}{\sqrt{2\tau}}+\sqrt{2\epsilon}\right)T$
	$\displaystyle\leq$	$\displaystyle\sqrt{\frac{K}{2}T\log(K)}+\left(\frac{K}{\sqrt{2\tau}}+\sqrt{2\epsilon}\right)T$

where the second inequality uses Cauchy–Schwarz inequality; the third inequality follows from the chain rule for mutual information; the last inequality follows from the non-negativity of entropy. This completes the proof. ∎

		$\displaystyle\,\left\|\sum_{y_{T+1:T+\tau}}\overline{P}_{T+1:T+\tau}(y_{T+1:T+\tau})r(a^{},y_{T+1:T+\tau})-\sum_{y_{T+1:T+\tau}}\hat{P}_{T+1:T+\tau}(y_{T+1:T+\tau})r(a^{},y_{T+1:T+\tau})\right\|$
	$\displaystyle=$	$\displaystyle\,\left\|\sum_{y_{T+1:T+\tau}}\left(\overline{P}_{T+1:T+\tau}(y_{T+1:T+\tau})-\hat{P}_{T+1:T+\tau}(y_{T+1:T+\tau})\right)r(a^{*},y_{T+1:T+\tau})\right\|$
	$\displaystyle\leq$	$\displaystyle\,\sum_{y_{T+1:T+\tau}}\left\|\overline{P}_{T+1:T+\tau}(y_{T+1:T+\tau})-\hat{P}_{T+1:T+\tau}(y_{T+1:T+\tau})\right\|r(a^{*},y_{T+1:T+\tau})$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}$	$\displaystyle\,\sum_{y_{T+1:T+\tau}}\left\|\overline{P}_{T+1:T+\tau}(y_{T+1:T+\tau})-\hat{P}_{T+1:T+\tau}(y_{T+1:T+\tau})\right\|$
	$\displaystyle=$	$\displaystyle\,\left\\|\overline{P}_{T+1:T+\tau}-\hat{P}_{T+1:T+\tau}\right\\|_{1}\leq\sqrt{2\mathbf{d}_{\mathrm{KL}}\big{(}\overline{P}_{T+1:T+\tau}\big{\\|}\hat{P}_{T+1:T+\tau}\big{)}},$

		$\displaystyle~{}\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t+1:T}\in\cdot\middle\|\mathcal{D}_{t},X_{t:T-1}\right)\big{\\|}\mathds{P}(Y_{t+1:T}\in\cdot\|\theta_{t},X_{t:T-1})\big{)}\right]$
	$\displaystyle\overset{(a)}{=}$	$\displaystyle~{}\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t+1}\in\cdot\middle\|\mathcal{D}_{t},X_{t}\right)\big{\\|}\mathds{P}(Y_{t+1}\in\cdot\|\theta_{t},X_{t})\big{)}\right]$
		$\displaystyle~{}\quad+\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t+2:T}\in\cdot\middle\|\mathcal{D}_{t+1},X_{t+1:T-1}\right)\big{\\|}\mathds{P}(Y_{t+2:T}\in\cdot\|\theta_{t},X_{t},Y_{t+1},X_{t+1:T-1})\big{)}\right]$
	$\displaystyle\overset{(b)}{\leq}$	$\displaystyle~{}\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t+1}\in\cdot\middle\|\mathcal{D}_{t},X_{t}\right)\big{\\|}\mathds{P}(Y_{t+1}\in\cdot\|\theta_{t},X_{t})\big{)}\right]$
		$\displaystyle~{}\quad+\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t+2:T}\in\cdot\middle\|\mathcal{D}_{t+1},X_{t+1:T-1}\right)\big{\\|}\mathds{P}(Y_{t+2:T}\in\cdot\|\theta_{t+1},X_{t+1:T-1})\big{)}\right]$
	$\displaystyle\leq$	$\displaystyle~{}\dots$
	$\displaystyle\leq$	$\displaystyle~{}\mathbb{E}\left[\sum_{t^{\prime}=t}^{T-1}\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t^{\prime}+1}\in\cdot\middle\|\mathcal{D}_{t^{\prime}},X_{t^{\prime}}\right)\big{\\|}\mathds{P}(Y_{t^{\prime}+1}\in\cdot\|\theta_{t^{\prime}},X_{t^{\prime}})\big{)}\right]$
	$\displaystyle\overset{(c)}{\leq}$	$\displaystyle~{}\mathbb{E}\left[\sum_{t^{\prime}=t}^{T-1}\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t^{\prime}+1}\in\cdot\middle\|\mathcal{D}_{t^{\prime}},X_{t^{\prime}}\right)\big{\\|}\mathds{P}(\hat{Y}_{t^{\prime}+1}\in\cdot\|\theta_{t^{\prime}},X_{t^{\prime}})\big{)}\right]$
	$\displaystyle\stackrel{{\scriptstyle(d)}}{{=}}$	$\displaystyle~{}\mathbb{E}\left[\sum_{t^{\prime}=t}^{T-1}\mathbf{d}_{\mathrm{KL}}\big{(}\bar{P}_{t^{\prime}+1}\big{\\|}\hat{P}_{t^{\prime}+1}\big{)}\right]\leq\epsilon,$

	$\displaystyle\mathbb{I}\left(Y_{t+1:T};\mathcal{D}_{t},\theta_{t}\,\middle\|\,X_{t:T-1}\right)=$	$\displaystyle\,\mathbb{I}\left(Y_{t+1:T};\theta_{t}\,\middle\|\,X_{t:T-1}\right)+\underbrace{\mathbb{I}\left(Y_{t+1:T};\mathcal{D}_{t}\,\middle\|\,\theta_{t},X_{t:T-1}\right)}_{\leq\epsilon}$
	$\displaystyle=$	$\displaystyle\,\mathbb{I}\left(Y_{t+1:T};\mathcal{D}_{t}\,\middle\|\,X_{t:T-1}\right)+\underbrace{\mathbb{I}\left(Y_{t+1:T};\theta_{t}\,\middle\|\,\mathcal{D}_{t},X_{t:T-1}\right)}_{=0},$

	$\displaystyle\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}(Y_{t+1}\in\cdot\|\mathcal{D}_{t},X_{t})\big{\\|}\mathds{P}(\hat{Y}_{t+1}\in\cdot\|\theta_{t},X_{t})\right)\right]$
	$\displaystyle=\mathbb{E}\left[\sum_{y}\mathds{P}(Y_{t+1}=y\|\mathcal{D}_{t},X_{t})\log\left(\frac{\mathds{P}(Y_{t+1}=y\|\mathcal{D}_{t},X_{t})}{\mathds{P}(Y_{t+1}=y\|\theta_{t},X_{t})}\cdot\frac{\mathds{P}(Y_{t+1}=y\|\theta_{t},X_{t})}{\mathds{P}(\hat{Y}_{t+1}=y\|\theta_{t},X_{t})}\right)\right]$
	$\displaystyle=\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}(Y_{t+1}\in\cdot\|\mathcal{D}_{t},X_{t})\big{\\|}\mathds{P}(Y_{t+1}\in\cdot\|\theta_{t},X_{t})\right)\right]$
	$\displaystyle\qquad+\mathbb{E}\left[\sum_{y}\mathds{P}(Y_{t+1}=y\|\mathcal{D}_{t},X_{t})\log\frac{\mathds{P}(Y_{t+1}=y\|\theta_{t},X_{t})}{\mathds{P}(\hat{Y}_{t+1}=y\|\theta_{t},X_{t})}\right]$
	$\displaystyle=\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}(Y_{t+1}\in\cdot\|\mathcal{D}_{t},X_{t})\big{\\|}\mathds{P}(Y_{t+1}\in\cdot\|\theta_{t},X_{t})\right)\right]$
	$\displaystyle\qquad+\mathbb{E}\left[\sum_{y}\mathds{P}(Y_{t+1}=y\|\theta_{t},X_{t})\log\frac{\mathds{P}(Y_{t+1}=y\|\theta_{t},X_{t})}{\mathds{P}(\hat{Y}_{t+1}=y\|\theta_{t},X_{t})}\right]$
	$\displaystyle=\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}(Y_{t+1}\in\cdot\|\mathcal{D}_{t},X_{t})\big{\\|}\mathds{P}(Y_{t+1}\in\cdot\|\theta_{t},X_{t})\right)\right]$
	$\displaystyle\qquad+\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}(Y_{t+1}\in\cdot\|\theta_{t},X_{t})\big{\\|}\mathds{P}(\hat{Y}_{t+1}\in\cdot\|\theta_{t},X_{t})\right)\right]$
	$\displaystyle\geq\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}(Y_{t+1}\in\cdot\|\mathcal{D}_{t},X_{t})\big{\\|}\mathds{P}(Y_{t+1}\in\cdot\|\theta_{t},X_{t})\right)\right].$

	$\displaystyle\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})$	$\displaystyle=\mathbb{I}_{t}(\tilde{A};{A}_{t})+\mathbb{I}_{t}(\tilde{A};\mathbf{Y}_{{A}_{t}}\|A_{t})$
		$\displaystyle=\mathbb{I}_{t}(\tilde{A};\mathbf{Y}_{{A}_{t}}\|A_{t})$
		$\displaystyle=\sum_{a}\mathds{P}_{t}(A_{t}=a)\mathbb{I}_{t}(\tilde{A};\mathbf{Y}_{{A}_{t}}\|A_{t}=a)$
		$\displaystyle=\sum_{a}\mathds{P}_{t}(A_{t}=a)\mathbb{I}_{t}(\tilde{A};\mathbf{Y}_{a})$
		$\displaystyle=\sum_{a}\mathds{P}_{t}(A_{t}=a)\left(\sum_{\tilde{a}}\mathds{P}_{t}(\tilde{A}=\tilde{a})\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}_{t}(\mathbf{Y}_{a}\mid\tilde{A}=\tilde{a})\\|\mathds{P}_{t}(\mathbf{Y}_{a})\right)\right)$
		$\displaystyle=\sum_{a,\tilde{a}}\mathds{P}_{t}(A_{t}=a)\mathds{P}_{t}(\tilde{A}=\tilde{a})\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}_{t}(\mathbf{Y}_{a}\mid\tilde{A}=\tilde{a})\\|\mathds{P}_{t}(\mathbf{Y}_{a})\right)$

From Predictions to Decisions: The Importance of Joint Predictive Distributions

Abstract

1 Introduction

2 Evaluating joint predictive distributions

2.1 Supervised learning

2.2 Marginal vs. joint predictive distributions

2.3 Cross-entropy loss

2.4 Kullback–Leibler divergence

2.5 Error in predictions versus environment

2.6 Universality of 𝐝KLτ\mathbf{d}_{\mathrm{KL}}^{\tau}

Proposition 2.1.

3 Combinatorial decision problems

4 Sequential predictions

Theorem 4.1.

5 Multi-armed bandits

5.1 Problem formulation

5.2 Marginal predictions are insufficient

5.3 Relating joint predictions to regret

Theorem 5.1.

5.4 Proof Sketch for Theorem 5.1

6 Related work

7 Conclusion

References

Appendix A Proof for Proposition 2.1

Proof.

Appendix B Proof for Theorem 4.1

B.1 Proof for Theorem 4.1

Proof.

B.2 Lemma B.1

Lemma B.1.

Proof.

Appendix C Proof for Theorem 5.1

Proof.

From Predictions to Decisions:
The Importance of Joint Predictive Distributions

2.6 Universality of $\mathbf{d}_{\mathrm{KL}}^{\tau}$