This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

From Predictions to Decisions:
The Importance of Joint Predictive Distributions

Zheng Wen    Ian Osband    Chao Qin    Xiuyuan Lu    Morteza Ibrahimi    Vikranth Dwaracherla    Mohammad Asghari    Benjamin Van Roy
Abstract

A fundamental challenge for any intelligent system is prediction: given some inputs, can you predict corresponding outcomes? Most work on supervised learning has focused on producing accurate marginal predictions for each input. However, we show that for a broad class of decision problems, accurate joint predictions are required to deliver good performance. In particular, we establish several results pertaining to combinatorial decision problems, sequential predictions, and multi-armed bandits to elucidate the essential role of joint predictive distributions. Our treatment of multi-armed bandits introduces an approximate Thompson sampling algorithm and analytic techniques that lead to a new kind of regret bound.

Machine Learning, ICML

1 Introduction

We are motivated by the problem of an intelligent agent interacting with an unknown environment. Although the agent is initially uncertain of the true dynamics of the environment, it might learn to predict aspects of its evolution through observation and learning. Predictions can then guide decisions aimed to optimize future outcomes.

Most work on supervised learning has focused on producing accurate marginal predictions for each input. For instance, in the image classification problem, the goal is to predict the class label of each image correctly. Though accurate marginal predictions are sufficient for some applications, in this paper, we show that for a broad class of downstream tasks, accurate joint predictions are essential. In particular, we show that even in simple combinatorial and sequential decision problems, accurate marginal predictions are insufficient for effective decisions. Rather, accurate joint predictive distributions are required to achieve good performance. We also show that, in the multi-armed bandit, a classical sequential decision problem, accurate joint predictive distributions enable near-optimal performance.

The cross-entropy loss is perhaps the most widely used performance metric in machine learning and plays a central role in training deep learning systems (LeCun et al., 2015). It is typically used to evaluate the quality of marginal predictions but readily extends to address joint predictive distributions. Baselining the cross-entropy loss with an agent-independent constant gives rise to expected KL-divergence. Hence, the cross-entropy loss is equivalent to the expected KL-divergence as a metric for ranking agents. We will focus in this paper on the use of the expected KL-divergence because the baselining allows for more elegant analysis and framing of results. In particular, the KL-divergence is closely related to fundamental notions in information theory (Cover, 1999), such as entropy and mutual information. It is worth clarifying that the cross-entropy loss and expected KL-divergence are just particular choices of evaluation metrics. The importance of evaluating joint predictions, not just marginals, should similarly be apparent under other evaluation metrics.

To elucidate the essential role of joint predictive distributions in driving downstream decision tasks, we establish several results pertaining to the universality of our metric across tasks, combinatorial decision problems, sequential predictions, and mult-armed bandits. Among these, our study of multi-armed bandits is particularly innovative, entailing the development of a novel variant of the Thompson sampling (TS) algorithm (Thompson, 1933; Russo et al., 2018) and original analytic techniques that lead to a new kind of regret bound (Theorem 5.1). Our proposed approximate TS algorithm is a generalization of the standard (exact) TS algorithm and is general in the sense that it does not depend on the agent’s uncertainty representation. Instead, it only requires that the agent can simulate hypothetical observations, sampled from a joint predictive distribution. The new analytic techniques required to establish our regret bound build on recent progress in information-theoretic analysis of bandit and reinforcement learning algorithms (Russo & Van Roy, 2016; Lu et al., 2021). We believe that the key ideas can also offer new insight into approximate TS algorithms beyond the scope of this paper, including those designed for more general sequential decision problems, such as episodic reinforcement learning.

Also novel is our analysis of sequential prediction. In this case, we establish a result that highlights the importance of joint predictive distributions even when an agent aims only to produce accurate marginal predictions sequentially, as it streams data. In particular, for the agent to be able to incrementally update its knowledge representation as it streams, it must also be capable of generating accurate joint, not just marginal, predictive distributions.

The remainder of the paper is organized as follows: in Section 2, we introduce the notation for this paper via a standard supervised learning setting. We also review and discuss the evaluation metric for joint predictive distributions. In Section 3, we illustrate through a simple recommendation system example that accurate marginal predictions are insufficient to guide effective decisions in combinatorial decision problems, and that accurate joint predictive distributions are essential. In Section 4, we formally prove that in a sequential decision problem, an agent that updates its parameters incrementally must retain sufficient information about joint predictive distributions if it is to achieve good performance, even if the performance metric only depends on marginal distributions. In Section 5, we consider the classical multi-armed bandit and show that accurate marginal predictions are insufficient to guide good decisions. We also establish a regret bound for our proposed TS algorithm: if the agent produces accurate joint predictive distributions, then the TS algorithm can attain near-optimal regret. Finally, we will discuss the related work in Section 6 and conclude the paper in Section 7.

2 Evaluating joint predictive distributions

In this section, we introduce the notation for this paper via a standard supervised learning framework. We also motivate and discuss the evaluation metrics used in this paper for joint predictive distributions.

2.1 Supervised learning

Consider a sequence of pairs ((Xt,Yt+1):t=0,1,2,)((X_{t},Y_{t+1}):t=0,1,2,\ldots), where each XtX_{t} is a feature vector and each Yt+1Y_{t+1} is its target label. Feature vectors (Xt:t=0,1,2,)(X_{t}:t=0,1,2,\ldots) are i.i.d. according to an input distribution PXP_{X}. Each target label Yt+1Y_{t+1} is independent of all other data, conditioned on XtX_{t}, and distributed according to (|Xt)\mathcal{E}(\cdot|X_{t}). The conditional distribution \mathcal{E} is referred to as the environment. The environment \mathcal{E} is random; and this reflects the agent’s uncertainty about how labels are generated given features. Note that (Yt+1|,Xt)=(|Xt)\mathds{P}(Y_{t+1}\in\cdot|\mathcal{E},X_{t})=\mathcal{E}(\cdot|X_{t}) and (Yt+1|Xt)=𝔼[(|Xt)|Xt]\mathds{P}(Y_{t+1}\in\cdot|X_{t})=\mathbb{E}[\mathcal{E}(\cdot|X_{t})|X_{t}].

In supervised learning, we consider an agent that learns about the environment \mathcal{E} from a training dataset

𝒟T((Xt,Yt+1):t=0,1,,T1),\mathcal{D}_{T}\equiv((X_{t},Y_{t+1}):t=0,1,\ldots,T-1),

and aims to predict the target labels

YT+1:T+τ(YT+1,,YT+τ)Y_{T+1:T+\tau}\equiv(Y_{T+1},\dots,Y_{T+\tau})

at τ\tau feature vectors XT:T+τ1(XT,,XT+τ1)X_{T:T+\tau-1}\equiv(X_{T},\dots,X_{T+\tau-1}).

Conditioned on the environment \mathcal{E}, a predictive distribution over the target labels is given by

PT+1:T+τ(YT+1:T+τ|,XT:T+τ1).P^{*}_{T+1:T+\tau}\equiv\mathds{P}\left(Y_{T+1:T+\tau}\in\cdot\middle|\mathcal{E},X_{T:T+\tau-1}\right).

Conditioned instead on the training data, the predictive distribution becomes

P¯T+1:T+τ(YT+1:T+τ|𝒟T,XT:T+τ1).\overline{P}_{T+1:T+\tau}\equiv\mathds{P}\left(Y_{T+1:T+\tau}\in\cdot\middle|\mathcal{D}_{T},X_{T:T+\tau-1}\right).

In a sense, P¯T+1:T+τ\overline{P}_{T+1:T+\tau} represents the result of perfect inference. Since the associated computations are intractable for environments of interest, we instead consider agents that approximate this predictive distribution. Specifically, we consider agents that represent the approximation in terms of a generative model. The agent’s predictions are parameterized by a vector θT\theta_{T} that the agent learns from the training data 𝒟T\mathcal{D}_{T}. The vector θT\theta_{T} is conditionally independent of \mathcal{E} conditioned on 𝒟T\mathcal{D}_{T}. For any inputs XT:T+τ1X_{T:T+\tau-1}, θT\theta_{T} determines a predictive distribution, which could be used to sample imagined outcomes Y^T+1:T+τ\hat{Y}_{T+1:T+\tau}. Hence, the agent’s τth\tau^{\rm th}-order predictive distribution is given by

P^T+1:T+τ(Y^T+1:T+τ|θT,XT:T+τ1).\hat{P}_{T+1:T+\tau}\equiv\mathds{P}(\hat{Y}_{T+1:T+\tau}\in\cdot|\theta_{T},X_{T:T+\tau-1}).

When τ=1\tau=1, we alternatively use P^T+1\hat{P}_{T+1}, P¯T+1\overline{P}_{T+1}, and PT+1P^{*}_{T+1} to denote P^T+1:T+τ\hat{P}_{T+1:T+\tau}, P¯T+1:T+τ\overline{P}_{T+1:T+\tau}, and PT+1:T+τP^{*}_{T+1:T+\tau}, respectively.

Note that if τ=1\tau=1, P^T+1\hat{P}_{T+1} represents a marginal prediction: it predicts the label YT+1Y_{T+1} for a single input XTX_{T}. On the other hand, for τ>1\tau>1, it represents a joint prediction over labels at τ\tau input features.

2.2 Marginal vs. joint predictive distributions

Before proceeding, we use a simple coin flipping example to illustrate that evaluating marginal and joint predictive distributions can result in very different answers. Suppose that (Yt+1:t=0,1,)(Y_{t+1}:t=0,1,\ldots) are generated by repeated tosses of a possibly biased coin with unknown probability pp of heads, with Yt+1=1Y_{t+1}=1 and Yt+1=0Y_{t+1}=0 indicating heads and tails, respectively. Consider two agents with different beliefs: Agent 1 assumes p=2/3p=2/3 and models the outcome of each coin toss as independent conditioned on pp. Agent 2 assumes that p=1p=1 with probability 2/32/3 and p=0p=0 with probability 1/31/3; that is, the coin either produces only heads or only tails. Let Y^t+11\hat{Y}^{1}_{t+1} and Y^t+12\hat{Y}^{2}_{t+1} denote the outcomes imagined by the two agents. Despite their differing assumptions, the two agents generate identical marginal predictive distributions: (Y^t+11=0)=(Y^t+12=0)=1/3\mathds{P}(\hat{Y}^{1}_{t+1}=0)=\mathds{P}(\hat{Y}^{2}_{t+1}=0)=1/3.

On the other hand, the joint predictions of these two agents differ for τ>1\tau>1:

(Y^11,,Y^τ1=0)=1/3τ<1/3=(Y^12,,Y^τ2=0).\mathds{P}(\hat{Y}^{1}_{1},\dots,\hat{Y}^{1}_{\tau}=0)=1/3^{\tau}<1/3=\mathds{P}(\hat{Y}^{2}_{1},\dots,\hat{Y}^{2}_{\tau}=0).

Evaluating marginal predictions cannot distinguish between the two agents, though for a specific prior distribution over pp, one agent could be right and the other wrong. One must evaluate joint predictions to make this distinction.

2.3 Cross-entropy loss

Cross-entropy loss is perhaps the most widely used metric in machine learning, though it is typically only used to evaluate marginal predictive distributions. For our supervised learning formulation, an agent’s marginal cross-entropy loss takes the form

𝐝CE1𝔼[logP^T+1(YT+1)],\mathbf{d}_{\mathrm{CE}}^{1}\equiv-\mathbb{E}\big{[}\log\hat{P}_{T+1}(Y_{T+1})\big{]}, (1)

where the expectation is over both P^T+1\hat{P}_{T+1} and YT+1Y_{T+1}, and the superscript “11” in 𝐝CE1\mathbf{d}_{\mathrm{CE}}^{1} indicates that this evaluates marginal predictions. Note that the marginal distribution P^T+1\hat{P}_{T+1} is random because it depends on θT\theta_{T} and XTX_{T}.

It is straightforward to extend the cross-entropy loss to assess joint predictive distributions. Generalizing Equation (1), for any τ=1,2,\tau=1,2,\ldots, we define the τth\tau^{\textrm{th}}-order cross-entropy loss:

𝐝CEτ𝔼[logP^T+1:T+τ(YT+1:T+τ)],\mathbf{d}_{\mathrm{CE}}^{\tau}\equiv-\mathbb{E}\big{[}\log\hat{P}_{T+1:T+\tau}(Y_{T+1:T+\tau})\big{]}, (2)

where the expectation is over P^T+1:T+τ\hat{P}_{T+1:T+\tau} and YT+1:T+τY_{T+1:T+\tau}. Note that the τth\tau^{\textrm{th}}-order joint distribution P^T+1:T+τ\hat{P}_{T+1:T+\tau} is also random, since it depends on θT\theta_{T} and XT:T+τ1X_{T:T+\tau-1}.

2.4 Kullback–Leibler divergence

One can use the joint cross-entropy loss defined in Equation (2) to assess joint predictive distributions. However, it can be helpful to offset the metric by a baseline to convert it into the Kullback–Leibler (KL) divergence. We will take P¯T+1:T+τ\overline{P}_{T+1:T+\tau}, which is the perfect posterior predictive, to be our baseline. As we will see, since P¯T+1:T+τ\overline{P}_{T+1:T+\tau} does not depend on the agent, our measure of KL-divergence and the cross-entropy loss are effectively equivalent in the sense that they only differ by a constant that does not depend on the agent. However, KL-divergence exhibits properties that allow for a more elegant mathematical analysis.

The τth\tau^{\textrm{th}}-order expected KL-divergence with respect to P¯\overline{P} is defined by

𝐝KLτ𝔼[𝐝KL(P¯T+1:T+τP^T+1:T+τ)],\mathbf{d}_{\mathrm{KL}}^{\tau}\equiv\mathbb{E}\big{[}\mathbf{d}_{\mathrm{KL}}\big{(}\overline{P}_{T+1:T+\tau}\big{\|}\hat{P}_{T+1:T+\tau}\big{)}\big{]}, (3)

where the expectation is over the distributions P¯T+1:T+τ\overline{P}_{T+1:T+\tau} and P^T+1:T+τ\hat{P}_{T+1:T+\tau}, which depend in turn on the data 𝒟T\mathcal{D}_{T}, the agent parameters θT\theta_{T}, and the τ\tau inputs XT:T+τ1X_{T:T+\tau-1}.

Note that KL-divergence is minimized when P^T+1:T+τ=P¯T+1:T+τ\hat{P}_{T+1:T+\tau}=\overline{P}_{T+1:T+\tau}, with the minimum being zero. Further, our two metrics are related according to

𝐝KLτ=𝐝CEτ+𝔼[logP¯T+1:T+τ(YT+1:T+τ)].\mathbf{d}_{\mathrm{KL}}^{\tau}=\mathbf{d}_{\mathrm{CE}}^{\tau}+\mathbb{E}[\log\overline{P}_{T+1:T+\tau}(Y_{T+1:T+\tau})].

Since P¯T+1:T+τ(YT+1:T+τ)\overline{P}_{T+1:T+\tau}(Y_{T+1:T+\tau}) does not depend on the agent, our two metrics rank agents identically.

An unbiased estimate of cross-entropy loss can be computed based on a test data sample, according to

𝐝CEτlogP^T+1:T+τ(YT+1:T+τ).\mathbf{d}_{\mathrm{CE}}^{\tau}\approx-\log\hat{P}_{T+1:T+\tau}(Y_{T+1:T+\tau})\big{.}

The same is not true for 𝐝KLτ\mathbf{d}_{\mathrm{KL}}^{\tau}, which can only be estimated if also given an estimate of 𝔼[logP¯T+1:T+τ(YT+1:T+τ)]\mathbb{E}[\log\overline{P}_{T+1:T+\tau}(Y_{T+1:T+\tau})]. Hence, 𝐝KLτ\mathbf{d}_{\mathrm{KL}}^{\tau} serves only as conceptual tools in our analysis and not an evaluation metric that can be applied with empirical data. While it ranks agents identically with 𝐝CEτ\mathbf{d}_{\mathrm{CE}}^{\tau}, 𝐝KLτ\mathbf{d}_{\mathrm{KL}}^{\tau} is more natural as a metric since its minimum is zero and it accommodates more elegant analysis.

2.5 Error in predictions versus environment

Our 𝐝KLτ\mathbf{d}_{\mathrm{KL}}^{\tau} metric assesses error incurred by the predictive distribution P^T+1:T+τ\hat{P}_{T+1:T+\tau}. A common approach to generating such a predictive distribution is through estimating a posterior distribution over environments and using that posterior distribution to generate the predictive distribution. In such a context, θT\theta_{T} parameterizes the estimated posterior distribution. Let ^\hat{\mathcal{E}} be an imaginary environment sampled from this posterior distribution. To offer some intuition for 𝐝KLτ\mathbf{d}_{\mathrm{KL}}^{\tau}, we consider in this section its relation to the KL-divergence between the distributions of the true and imaginary environments.

Let Y^T+1:T+τ\hat{Y}_{T+1:T+\tau} denote a sequence of imaginary outcomes, with each Y^t+1\hat{Y}_{t+1} sampled independently from ^(|Xt)\hat{\mathcal{E}}(\cdot|X_{t}). If the support of the input distribution PXP_{X} is exhaustive, the support of the imaginary environment distribution (^|θT)\mathds{P}(\hat{\mathcal{E}}\in\cdot|\theta_{T}) contains that of the true environment distribution (|𝒟T)\mathds{P}(\mathcal{E}\in\cdot|\mathcal{D}_{T}), and the environment distributions satisfy suitable regularity conditions, then

limτ𝐝KLτ=𝔼[𝐝KL((|𝒟T)(^|θT))].\lim_{\tau\rightarrow\infty}\mathbf{d}_{\mathrm{KL}}^{\tau}=\mathbb{E}\big{[}\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}(\mathcal{E}\in\cdot|\mathcal{D}_{T})\big{\|}\mathds{P}(\hat{\mathcal{E}}\in\cdot|\theta_{T})\big{)}\big{]}.

In other words, under certain technical conditions, as the number τ\tau of test data pairs grows, 𝐝KLτ\mathbf{d}_{\mathrm{KL}}^{\tau} converges to the error in the estimated posterior distribution over environments, measured in terms of KL-divergence.

One might wonder why we should use 𝐝KLτ\mathbf{d}_{\mathrm{KL}}^{\tau} rather than the KL divergence between the true and imaginary environments

𝔼[𝐝KL((|𝒟T)(^|θT))]\mathbb{E}\big{[}\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}(\mathcal{E}\in\cdot|\mathcal{D}_{T})\big{\|}\mathds{P}(\hat{\mathcal{E}}\in\cdot|\theta_{T})\big{)}\big{]} (4)

to evaluate the agents. There are several reasons for it. First, practical agent designs often do not satisfy the requisite regularity conditions, and hence (4) becomes infinite. For example, it is common to approximate the posterior distribution \mathcal{E} using an ensemble of environment models (see, e.g., Lu & Van Roy (2017)). Such an ensemble represents a distribution with finite support though the posterior may have infinite support. On the other hand, for any finite τ\tau, 𝐝KLτ\mathbf{d}_{\mathrm{KL}}^{\tau} is finite. Second, 𝐝CEτ\mathbf{d}_{\mathrm{CE}}^{\tau}, which is equivalent to 𝐝KLτ\mathbf{d}_{\mathrm{KL}}^{\tau} up to a constant, can be computed based on data, whereas computing (4) requires access to the posterior distribution of \mathcal{E}. Finally, as we will establish later, 𝐝KLτ\mathbf{d}_{\mathrm{KL}}^{\tau} with finite τ\tau is sufficient to support effective decisions in downstream tasks such as multi-armed bandits.

2.6 Universality of 𝐝KLτ\mathbf{d}_{\mathrm{KL}}^{\tau}

We now show that, for any τ\tau, accuracy in terms of 𝐝KLτ\mathbf{d}_{\mathrm{KL}}^{\tau} is sufficient to guarantee an effective decision if the decision is judged in relation only to YT+1:T+τY_{T+1:T+\tau}. In particular, suppose an action aa selected from a set 𝒜\mathcal{A} results in an expected reward

𝔼[r(a,YT+1:T+τ)|𝒟T,XT:T+τ1]\displaystyle\mathbb{E}[r(a,Y_{T+1:T+\tau})|\mathcal{D}_{T},X_{T:T+\tau-1}]
=yT+1:T+τP¯T+1:T+τ(yT+1:T+τ)r(a,yT+1:T+τ),\displaystyle=\sum_{y_{T+1:T+\tau}}\overline{P}_{T+1:T+\tau}(y_{T+1:T+\tau})r(a,y_{T+1:T+\tau}),

where rr is a reward function with range [0,1][0,1]. The following result bounds the loss in expected reward of a decision that is based on the estimate P^T+1:T+τ\hat{P}_{T+1:T+\tau} instead of the posterior P¯T+1:T+τ\overline{P}_{T+1:T+\tau}.

Proposition 2.1.

If an action a^𝒜\hat{a}\in\mathcal{A} maximizes

yT+1:T+τP^T+1:T+τ(yT+1:T+τ)r(a,yT+1:T+τ)\sum_{y_{T+1:T+\tau}}\hat{P}_{T+1:T+\tau}(y_{T+1:T+\tau})r(a,y_{T+1:T+\tau})

then

𝔼[r(a^,YT+1:T+τ)]maxa𝒜𝔼[r(a,YT+1:T+τ)]2𝐝KLτ.\displaystyle\mathbb{E}[r(\hat{a},Y_{T+1:T+\tau})]\geq\max_{a\in\mathcal{A}}\mathbb{E}[r(a,Y_{T+1:T+\tau})]-\sqrt{2\mathbf{d}_{\mathrm{KL}}^{\tau}}.

This results follows from Pinsker’s inequality and Jensen’s inequality. A proof is provided in Appendix A.

Proposition 2.1 ensures that if a decision maximizes an approximation to expected reward, with the expectation approximated using the predictive distribution P^T+1:T+τ\hat{P}_{T+1:T+\tau}, then the action will be within 2𝐝KLτ\sqrt{2\mathbf{d}_{\mathrm{KL}}^{\tau}} of what is achievable given the posterior predictive distribution P¯T+1:T+τ\overline{P}_{T+1:T+\tau}. In this sense, 𝐝KLτ\mathbf{d}_{\mathrm{KL}}^{\tau} is a universal evaluation metric: its value ensures a level of performance in any decision problem.

3 Combinatorial decision problems

In this section, we will present a simple example that highlights the importance of the joint predictive distributions in combinatorial decision problems. Importantly, we will show that there can be benefit to considering a joint predictive distribution even when the data are i.i.d. conditioned on the environment.

Consider the problem of a customer interacting with a recommendation system that proposes a selection of K>1K>1 movies from an inventory of NN movies X1,..,XNX_{1},..,X_{N}. Each XidX_{i}\in\mathds{R}^{d} describes the features of movie ii, and dd is the feature dimension. We model the probability that a user will enjoy movie ii by a logistic model Yilogit(ϕTXi)Y_{i}\sim{\rm logit}(\phi_{*}^{T}X_{i}), where logit{\rm logit} is the standard logistic function. Note that ϕd\phi_{*}\in\mathds{R}^{d} describes the preferences of the user, which is not fully known to the recommendation system and can be viewed as a random variable. We will show that, in order to maximize the probability that the user enjoys at least one of the K>1K>1 recommended movies, it is insufficient to examine the marginal predictive distributions.

To make this example concrete, we consider the case where the user ϕ\phi_{*} is drawn from two possible user types {ϕ1,ϕ2}\{\phi_{1},\phi_{2}\}, and the recommendation system should propose K=2K=2 movies from an inventory {X1,X2,X3,X4}\{X_{1},X_{2},X_{3},X_{4}\}. Table 1 presents the numerical values for each of these vectors in this problem and their associated probabilities of selection implied by their logit functions correct to two decimal places. These values are chosen to set up a tension between optimization over marginal (each XiX_{i} individually) and joint (pairs of Xi,XjX_{i},X_{j}) predictions. An agent that optimizes the expected probability for each movie individually will end up recommending the pair (X3,X4)(X_{3},X_{4}) to an unknown ϕUnif(ϕ1,ϕ2)\phi\sim{\rm Unif}(\phi_{1},\phi_{2}). This means that, whether a user is type ϕ1\phi_{1} or ϕ2\phi_{2}, there is a greater than 10% chance they do not like either movie. By contrast, an agent that considers the joint predictive distribution for τK=2\tau\geq K=2 can see that instead selecting the pair (X1,X2)(X_{1},X_{2}) will give close to 100% certainty that the user will enjoy one of the movies.

X1=(10,10)X_{1}=(10,-10) X2=(10,10)X_{2}=(-10,10) X3=(1,0)X_{3}=(1,0) X4=(0,1)X_{4}=(0,1)
ϕ1=(1,0)\phi_{1}=(1,0) 1 0 0.73 0.5
ϕ2=(0,1)\phi_{2}=(0,1) 0 1 0.5 0.73
ϕUnif(ϕ1,ϕ2)\phi\sim{\rm Unif}(\phi_{1},\phi_{2}) 0.5 0.5 0.62 0.62
Table 1: Expected probability to watch a movie under different user features, correct to two decimal places.

This didactic example is designed to be maximally simplistic, and in some sense it labours an obvious point. In combinatorial decision problems where the outcome depends on the joint predictive distribution, optimization based on the marginal predictive distribution is insufficient to guarantee good decisions. By contrast, as shown in Section 2.6, if your decision depends only on the τth\tau^{\rm th}-order predictive, then attaining small 𝐝KLτ\mathbf{d}_{\mathrm{KL}}^{\tau} is sufficient for good performance. What is perhaps interesting about this example however, is that this coupling can occur even where, conditioned on the true environment, the data generating process is actually i.i.d. The key point is that, when the true underlying environment is unknown, a coupling in future observations is introduced. As such, examining the joint predictive distribution can be essential for good performance. In the rest of this paper, we will show that these distinctions become even more important as we move towards sequential decision problems.

4 Sequential predictions

In this and next section, we will establish some theoretical results to justify the importance of joint predictive distributions in sequential decision problems. We first consider sequential prediction problems, which is a subclass of sequential decision problems where the agent’s predictions (actions) do not influence its future observations. In particular, we show that in a standard sequential prediction problem, for agents with incremental updates to achieve good performance, it is necessary for them to retain significant portion of information about joint predictive distributions. This is true even if the performance metric only depends on marginal distributions.

The considered sequential prediction problem is formulated as follows: data pairs (Xt,Yt+1)(X_{t},Y_{t+1}) arrive sequentially, one at a time. At each time tt, the agent needs to compute parameters θt\theta_{t} based on previously observed data pairs 𝒟t=(X0,Y1,X1,,Xt1,Yt)\mathcal{D}_{t}=(X_{0},Y_{1},X_{1},\ldots,X_{t-1},Y_{t}). Then, a new data pair (Xt,Yt+1)(X_{t},Y_{t+1}) arrives. We assume that the feature vector XtX_{t}’s are unconditionally independent, but not necessarily identically distributed. The target label Yt+1Y_{t+1} is conditionally independently sampled from the distribution (|Xt)\mathcal{E}\left(\cdot\middle|X_{t}\right), where \mathcal{E} is the environment. The agent’s objective is to minimize the expected cumulative KL-divergence in the first TT time steps:

t=0T1𝔼[𝐝KL(P¯t+1P^t+1)],\textstyle\sum_{t=0}^{T-1}\mathbb{E}\big{[}\mathbf{d}_{\mathrm{KL}}\big{(}\overline{P}_{t+1}\big{\|}\hat{P}_{t+1}\big{)}\big{]}, (5)

where

P¯t+1=\displaystyle\bar{P}_{t+1}= (Yt+1|𝒟t,Xt)\displaystyle\,\mathds{P}(Y_{t+1}\in\cdot|\,\mathcal{D}_{t},X_{t})
P^t+1=\displaystyle\hat{P}_{t+1}= (Y^t+1|θt,Xt)\displaystyle\,\mathds{P}(\hat{Y}_{t+1}\in\cdot|\,\theta_{t},X_{t})

for all time tt. Note that this cumulative KL-divergence (5) only depends on the marginal distributions P¯t+1\overline{P}_{t+1} and P^t+1\hat{P}_{t+1}. Also note that this performance metric is 0 if the agent predicts the exact posterior at each time tt.

We consider a setting where an agent needs to incrementally update its parameters as data arrive. Specifically, at time t=0t=0, the agent chooses its parameters θ0\theta_{0} based on its prior knowledge; and then at each time t=0,1,t=0,1,\ldots, the agent updates its parameters incrementally by sampling from a distribution that only depends on θt\theta_{t}, (Xt,Yt+1)(X_{t},Y_{t+1}), and tt:

θt+1(θt+1|θt,Xt,Yt+1,t).\theta_{t+1}\sim\mathds{P}\left(\theta_{t+1}\in\cdot\middle|\theta_{t},X_{t},Y_{t+1},t\right). (6)

In other words, conditioning on (θt,Xt,Yt+1)(\theta_{t},X_{t},Y_{t+1}), θt+1\theta_{t+1} is independent of the dataset 𝒟t\mathcal{D}_{t} and the environment \mathcal{E}. Note that the incremental update rule in equation (6) is general: in particular, 𝒟t\mathcal{D}_{t} could itself be recorded in θt\theta_{t}. This would allow θt+1\theta_{t+1} to depend on 𝒟t\mathcal{D}_{t} in an arbitrary manner. However, such an approach can be impractical when there is a high volume of data. In particular, one may want to avoid sifting through a growing 𝒟t\mathcal{D}_{t} at each time step. In many practical applications, it is desirable for the agent to update θt+1\theta_{t+1} with fixed memory space and fixed per-step computational complexity, such as the standard SGD (Goodfellow et al., 2016) and Adam (Kingma & Ba, 2015) algorithms do.

We have the following result for the considered sequential prediction problem:

Theorem 4.1.

For an agent with incremental update (6), for any time t=0,1,,T1t=0,1,\ldots,T-1 and any ϵ0\epsilon\geq 0, if

t=tT1𝔼[𝐝KL(P¯t+1P^t+1)]ϵ,\textstyle\sum_{t^{\prime}=t}^{T-1}\mathbb{E}\big{[}\mathbf{d}_{\mathrm{KL}}\big{(}\bar{P}_{t^{\prime}+1}\,\big{\|}\,\hat{P}_{t^{\prime}+1}\big{)}\big{]}\leq\epsilon,

then we have

𝕀(Yt+1:T;θt|Xt:T1)𝕀(Yt+1:T;𝒟t|Xt:T1)ϵ.\mathbb{I}\left(Y_{t+1:T};\theta_{t}\,\middle|\,X_{t:T-1}\right)\geq\mathbb{I}\left(Y_{t+1:T};\mathcal{D}_{t}\,\middle|\,X_{t:T-1}\right)-\epsilon.

Please refer to Appendix B for the proof of Theorem 4.1. Notice that ϵ\epsilon measures the performance loss of the agent; 𝕀(Yt+1:T;𝒟t|Xt:T1)\mathbb{I}\left(Y_{t+1:T};\mathcal{D}_{t}\,\middle|\,X_{t:T-1}\right) is the conditional information in 𝒟t\mathcal{D}_{t} about the joint distribution of Yt+1:TY_{t+1:T}; and similarly 𝕀(Yt+1:T;θt|Xt:T1)\mathbb{I}\left(Y_{t+1:T};\theta_{t}\,\middle|\,X_{t:T-1}\right) is the conditional information about Yt+1:TY_{t+1:T} retained in θt\theta_{t}. Also notice that

𝕀(Yt+1:T;𝒟t|Xt:T1)𝕀(Yt+1:T;θt|Xt:T1)\mathbb{I}\left(Y_{t+1:T};\mathcal{D}_{t}\,\middle|\,X_{t:T-1}\right)\geq\mathbb{I}\left(Y_{t+1:T};\theta_{t}\,\middle|\,X_{t:T-1}\right)

always holds due to data processing inequality. In other words, Theorem 4.1 states that to be ϵ\epsilon-near-optimal, an agent with incremental update must retain in θt\theta_{t} all information in 𝒟t\mathcal{D}_{t} about the joint distribution of Yt+1:TY_{t+1:T}, except ϵ\epsilon nats111In this paper, we use natural logarithms in both KL-divergence and mutual information. Notice that ϵ\epsilon nats \approx 1.443ϵ1.443\epsilon bits..

We conjecture that results similar to Theorem 4.1 also hold in broader classes of sequential decision problems, such as multi-armed bandit problems discussed in Section 5, but we leave the formal analysis to future work.

5 Multi-armed bandits

In this section we turn our attention from passive observations, where an agent’s predictions do not influence the stream of data 𝒟t\mathcal{D}_{t}, to active decisions, where an agent takes actions that influence future streams of data. In this context, the ability to predict outcomes over multiple time steps can significantly impact agent performance. In particular, we will show that the quality of predictions beyond marginals is essential to effectively balance the needs of exploration with exploitation (Thompson, 1933). We begin with a problem formulation around the famous “multi-armed bandit”, a simple model of sequential decisions that serves to highlight the key issues at play. Next, we show that, in this context, even agents that make perfect marginal predictions for the outcome of each action may not make good decisions for the problem as a whole. Finally, we show that agents that make good joint predictions over τ=O(K)\tau=O(K) time steps are sufficient to drive efficient exploration in a Bernoulli bandit with KK actions.

5.1 Problem formulation

Consider a sequential decision problem with KK actions. The environment \mathcal{E} prescribes observation probabilities. During each time step t=0,1,2,t=0,1,2,\ldots, the agent selects an action AtA_{t} and observes an outcome Yt+1Y_{t+1} produced by the environment. Conditioned on the environment \mathcal{E} and action AtA_{t}, the next observation Yt+1Y_{t+1} is sampled from (|At)\mathcal{E}(\cdot|A_{t}) independently across time. There is a real-valued reward function rr that encodes the agent’s preferences over outcomes. The objective of the agent is to optimize the expected return over some long horizon TT,

𝔼[t=0T1r(Yt+1)],\mathbb{E}\big{[}\textstyle\sum_{t=0}^{T-1}r(Y_{t+1})\big{]},

where the expectation is taken over random outcomes, algorithmic randomness, and prior over \mathcal{E}.

5.2 Marginal predictions are insufficient

We will now show that, even if an agent can make perfect predictions for the outcomes of each action, these marginal predictions per action can be insufficient to drive efficient exploration. The root cause is that, at any time step tt, the rewards and observations at future time steps t>tt^{\prime}>t are coupled through the unknown environment \mathcal{E}. Clearly, we can imagine problem instances where one action reveals the full structure of the environment, but does not itself produce a reward. In this case, an agent that simply makes good marginal predictions on the reward of each action is clearly insufficient to take advantage of this informative action. However, perhaps more interestingly, this coupling can occur even when rewards evolve independently per action given the environment.

To make this point clear, we will consider an extension of the coin tossing example from Section 2.2. We consider an Bernoulli bandit with KK independent actions where the reward is simply taken to be the Yt+1Y_{t+1}. For the first K1K-1 actions the agents knows that the outcome is distributed Ber(0.5)\mathrm{Ber(0.5)}, a pure 50% chance. However, the final action produces a deterministic outcome of either 11 or 0, but it is equally likely to be of either kind. Clearly, the optimal policy to maximize long term reward is to first select the final action to see if it is optimal and, based on that outcome, choose the arm that maximizes expected reward given full knowledge of \mathcal{E} for all future steps. However, any agent that only matches marginal predictions cannot distinguish the final arm from any of the others. As such, it will be impossible for the agent to do better than random chance over the KK arms. On average, it will require Θ(K)\Theta(K) time steps to identify the optimal policy.

Just as in the coin flip example, the difference between the informative and uninformative actions is fully revealed when we look at τth\tau^{\textrm{th}}-order predictive distributions for any τ>1\tau>1. In general, it is only through examining the evolution of future outcomes beyond myopic marginals that allows an agent to optimize the long term returns. In the next subsection, we will expand on this intuitive argument and relate the ability to predict future outcomes to agent performance. In fact, we will establish a bound on performance that guarantees that if you can make approximate predictions O(K)O(K) steps into the future, then it will be sufficient to drive efficient exploration.

5.3 Relating joint predictions to regret

As explained in the previous subsection, in a sequential decision problem, it is in general desirable to consider uncertainties jointly across actions in order to make effective decisions. Thus, to simplify exposition, we consider predictions for a vector of outcomes, 𝐘K\mathbf{Y}\in\Re^{K}, with each entry corresponding to the outcome of an action. In this section, we will relate the quality of future predictions about 𝐘\mathbf{Y} to agent performance on a Bernoulli bandit with correlated arms.

Recall that in a KK-armed Bernoulli bandit, the environment \mathcal{E} is identified by parameters p=(p1,,pK)p=(p_{1},\ldots,p_{K}), where pk[0,1]p_{k}\in[0,1] is the expected reward of the kthk^{\rm th} action. We make no assumptions about the prior (p)\mathds{P}(p\in\cdot). In particular, there could be dependencies, where an observed outcome from trying one action informs our beliefs about other actions. We define the history by time tt as Ht=(A0,Y1,,At1,Yt)H_{t}=\left(A_{0},Y_{1},\ldots,A_{t-1},Y_{t}\right).

We use 𝐘~1:τ\tilde{\mathbf{Y}}_{1:\tau} to denote a sequence of τ\tau vectors sampled from the environment \mathcal{E}. Specifically, these τ\tau vectors are conditionally independent given \mathcal{E}. Each vector has dimension KK, and the kthk^{\rm th} component of each vector is conditionally independently sampled from Ber(pk)\mathrm{Ber}(p_{k}). On the other hand, consider an agent that can also generate a sequence of KK-dimensional binary vectors at each time tt. Let θt\theta_{t} denote its parameters, and 𝐘^1:τt\hat{\mathbf{Y}}^{t}_{1:\tau} denote a sequence of τ\tau binary vectors sampled from it. In this subsection, we establish that if an agent can produce good τth\tau^{\rm th}-order predictive distributions in the sense that

𝔼[𝐝KL((𝐘~1:τ|Ht)(𝐘^1:τt|θt))]ϵt\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}(\mathds{P}(\tilde{\mathbf{Y}}_{1:\tau}\in\cdot|H_{t})\|\mathds{P}(\hat{\mathbf{Y}}^{t}_{1:\tau}\in\cdot|\theta_{t}))\right]\leq\epsilon\quad\forall t

for some small ϵ0\epsilon\geq 0, then based on this agent one can design bandit algorithms that achieve near-optimal performance.

Specifically, we consider an approximate version of Thompson sampling algorithm, which is described in Algorithm 1. This algorithm proceeds as follows: at each time tt, it first samples τ\tau binary vectors 𝐘^1:τt\hat{\mathbf{Y}}^{t}_{1:\tau} based on the agent predictive distribution; then, it samples a vector p^t[0,1]K\hat{p}^{t}\in[0,1]^{K} from the conditional distribution (p|𝐘~1:τ=𝐘^1:τt)\mathds{P}(p\in\cdot|\tilde{\mathbf{Y}}_{1:\tau}=\hat{\mathbf{Y}}^{t}_{1:\tau}); finally, it chooses an action AtA_{t} greedy to p^t\hat{p}^{t} and update the agent parameters based on new observations. Note that Algorithm 1 is general in the sense that it does not depend on the agent’s uncertainty representation. Instead, it only requires that the agent can simulate hypothetical observations, sampled from a joint predictive distribution. Also note that Algorithm 1 reduces to the standard (exact) Thompson sampling algorithm when (𝐘^1:τ|θt)=(𝐘~1:τ|Ht)\mathds{P}(\hat{\mathbf{Y}}_{1:\tau}\in\cdot|\theta_{t})=\mathds{P}(\tilde{\mathbf{Y}}_{1:\tau}\in\cdot|H_{t}) and τ\tau\rightarrow\infty. We use this algorithm to establish that an agent that performs well based on a particular loss function retains enough information to enable efficient exploration.222Note that minargmaxkp^kt\min\operatorname*{arg\,max}_{k}\hat{p}^{t}_{k} is well defined. Specifically, argmaxkp^kt{1,,K}\operatorname*{arg\,max}_{k}\hat{p}^{t}_{k}\subseteq\{1,\ldots,K\} is a set.

Algorithm 1 Approximate Thompson sampling
  Input: prior over environment parameters pp
     agent architecture
     agent parameter initialization/update procedure
  Initialization: compute parameters θ0\theta_{0} based on prior
  for t=0,1,2,t=0,1,2,\ldots do
     sample 𝐘^1:τt(𝐘^1:τ|θt)\hat{\mathbf{Y}}^{t}_{1:\tau}\sim\mathds{P}(\hat{\mathbf{Y}}_{1:\tau}\in\cdot|\theta_{t})
     sample p^t\hat{p}^{t} from (p|𝐘~1:τ=𝐘^1:τt)\mathds{P}(p\in\cdot|\tilde{\mathbf{Y}}_{1:\tau}=\hat{\mathbf{Y}}^{t}_{1:\tau})
     choose At=minargmaxkp^ktA_{t}=\min\operatorname*{arg\,max}_{k}\hat{p}^{t}_{k}
     compute θt+1\theta_{t+1} based on θt\theta_{t} and (At,Yt+1)(A_{t},Y_{t+1}).
  end for

As is standard in bandit literature, we measure the performance of Algorithm 1 using (Bayes) cumulative regret, which is defined as

Regret(T)=t=0T1𝔼[pAr(Yt+1)],{\rm Regret}(T)=\textstyle\sum_{t=0}^{T-1}\mathbb{E}\big{[}p_{A^{*}}-r(Y_{t+1})\big{]}, (7)

where A=minargmaxkpkA^{*}=\min\operatorname*{arg\,max}_{k}p_{k} is one optimal action. Similarly, the expectation is over random outcomes, algorithmic randomness, and prior over \mathcal{E}. We can establish the following regret bound for Algorithm 1:

Theorem 5.1.

For any integer τ1\tau\geq 1 and any ϵ+\epsilon\in\Re_{+}, if at each time tt, the agent with parameters θt\theta_{t} can generate samples 𝐘^1:τt\hat{\mathbf{Y}}^{t}_{1:\tau} such that

𝔼[𝐝KL((𝐘~1:τ|Ht)(𝐘^1:τt|θt))]ϵ,\mathbb{E}[\mathbf{d}_{\mathrm{KL}}(\mathds{P}(\tilde{\mathbf{Y}}_{1:\tau}\in\cdot|H_{t})\|\mathds{P}(\hat{\mathbf{Y}}_{1:\tau}^{t}\in\cdot|\theta_{t}))]\leq\epsilon,

then under Algorithm 1, we have

Regret(T)12KTlogK+(K2τ+2ϵ)T.{\rm Regret}(T)\leq\sqrt{\frac{1}{2}KT\log K}+\left(\frac{K}{\sqrt{2\tau}}+\sqrt{2\epsilon}\right)T. (8)

Please refer to Section 5.4 for the proof sketch for Theorem 5.1, and the detailed proof is provided in Appendix C. Specifically, the proof for this theorem uses information-ratio analysis similar to previous papers (Russo & Van Roy, 2016; Lu et al., 2021), and the notions of environment proxy and learning target developed in Lu et al. (2021) (see Section 3.3 of that paper). Specifically, we choose the environment proxy as 𝐘~1:τ\tilde{\mathbf{Y}}_{1:\tau} and the learning target as an action A~=minargmaxkp~k\tilde{A}=\min\operatorname*{arg\,max}_{k}\tilde{p}_{k}, where p~=(p~1,,p~K)\tilde{p}=(\tilde{p}_{1},\ldots,\tilde{p}_{K}) is an independent sample drawn from (p|𝐘~1:τ)\mathds{P}(p\in\cdot|\tilde{\mathbf{Y}}_{1:\tau}). Note that conditioning on the environment \mathcal{E}, 𝐘~1:τ\tilde{\mathbf{Y}}_{1:\tau} and A~\tilde{A} are independent of the history HtH_{t}.

We now briefly discuss the regret bound in Theorem 5.1. First, note that if an agent can make good predictions τK/ϵ\tau\geq K/\epsilon steps into the future, then this regret bound reduces to O(KTlog(K)+KϵT)O(\sqrt{KT\log(K)}+\sqrt{K\epsilon}T), which is sufficient to ensure efficient exploration. Second, notice that this regret bound consists of three terms. The linear regret term 2ϵT\sqrt{2\epsilon}T is due to the expected KL-divergence loss of the agent. Specifically, if the agent makes a perfect prediction in the sense that (𝐘^1:τt|θt)=(𝐘~1:τ|Ht)\mathds{P}(\hat{\mathbf{Y}}_{1:\tau}^{t}\in\cdot|\theta_{t})=\mathds{P}(\tilde{\mathbf{Y}}_{1:\tau}\in\cdot|H_{t}) for all tt, then this linear regret term will reduce to zero. On the other hand, another linear regret term KT/2τKT/\sqrt{2\tau} is due to the fact that we choose A~\tilde{A} as the learning target, which can be a sub-optimal action. It is obvious that as τ\tau\rightarrow\infty, A~\tilde{A} will converge to AA^{*} and this linear regret term will reduce to zero. Finally, the sublinear regret term 12KTlogK\sqrt{\frac{1}{2}KT\log K} is exactly the regret bound for the exact Thompson sampling algorithm (Russo & Van Roy, 2016). This is not surprising since when ϵ=0\epsilon=0 (i.e. (𝐘^1:τt|θt)=(𝐘~1:τ|Ht)\mathds{P}(\hat{\mathbf{Y}}_{1:\tau}^{t}\in\cdot|\theta_{t})=\mathds{P}(\tilde{\mathbf{Y}}_{1:\tau}\in\cdot|H_{t})) and τ\tau\rightarrow\infty, Algorithm 1 reduces to the exact Thompson sampling algorithm.

Finally, it is worth mentioning that we conjecture that we can develop results similar to Theorem 5.1 for more practical algorithms and more general sequential decision problems. Note that in Algorithm 1, sampling p^t\hat{p}^{t} from (p|𝐘~1:τ=𝐘^1:τt)\mathds{P}(p\in\cdot|\tilde{\mathbf{Y}}_{1:\tau}=\hat{\mathbf{Y}}^{t}_{1:\tau}) can be computationally expensive. Instead, a computationally more efficient approach is to choose p^t\hat{p}^{t} as the sample mean of 𝐘~1:τ\tilde{\mathbf{Y}}_{1:\tau}, i.e. p^t=1τi=1τ𝐘^it\hat{p}^{t}=\frac{1}{\tau}\sum_{i=1}^{\tau}\hat{\mathbf{Y}}^{t}_{i}, where 𝐘^it\hat{\mathbf{Y}}^{t}_{i} is the ithi^{\rm th} vector in 𝐘~1:τ\tilde{\mathbf{Y}}_{1:\tau}. We conjecture that one can derive a similar regret bound with this modification, but leave the analysis to future work. Moreover, we believe that similar approximate Thompson sampling algorithms and regret bounds can be developed for more general sequential decision problems, such as multi-armed bandits with unbounded noises and episodic reinforcement learning, but leave them to future work.

5.4 Proof Sketch for Theorem 5.1

We provide a proof sketch for Theorem 5.1 in this subsection. First, notice that the expected per-step regret at time tt is 𝔼[pApAt]\mathbb{E}[p_{A^{*}}-p_{A_{t}}], which can be decomposed as

𝔼[pApAt]=𝔼[pApA~]+𝔼[pA~pAt].\mathbb{E}[p_{A^{*}}-p_{A_{t}}]=\mathbb{E}[p_{A^{*}}-p_{\tilde{A}}]+\mathbb{E}[p_{\tilde{A}}-p_{A_{t}}]. (9)

Recall that action A~\tilde{A} is the learning target. We bound the two terms in the righthand side of equation (9) separately. First, based on the fact that pp and p~\tilde{p} are conditionally i.i.d given the environment proxy 𝐘~1:τ\tilde{\mathbf{Y}}_{1:\tau}, we can show that

𝔼[pApA~]K/2τ.\mathbb{E}[p_{A^{*}}-p_{\tilde{A}}]\leq K/\sqrt{2\tau}.

To bound the second term 𝔼[pA~pAt]\mathbb{E}[p_{\tilde{A}}-p_{A_{t}}], we consider its conditional version 𝔼t[pA~pAt]\mathbb{E}_{t}[p_{\tilde{A}}-p_{A_{t}}], where the subscript tt denotes conditioning on the history HtH_{t}. Using information-ratio analysis, we can prove that

𝔼t[pA~pAt]K2𝕀t(A~;At,𝐘At)+t(A~)t(At)1.\mathbb{E}_{t}[p_{\tilde{A}}-p_{A_{t}}]\leq\sqrt{\frac{K}{2}\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})}\\ +\big{\|}\mathds{P}_{t}(\tilde{A}\in\cdot)-\mathds{P}_{t}(A_{t}\in\cdot)\big{\|}_{1}. (10)

Using Pinsker’s inequality, the data processing inequality, and the assumption on the expected KL-divergence in Theorem 5.1, we can bound that

𝔼[t(A~)t(At)1]2ϵ.\mathbb{E}\big{[}\big{\|}\mathds{P}_{t}(\tilde{A}\in\cdot)-\mathds{P}_{t}(A_{t}\in\cdot)\big{\|}_{1}\big{]}\leq\sqrt{2\epsilon}.

On the other hand, based on Cauchy–Schwarz inequality and the chain rule for mutual information, we have

t=0T1𝔼[𝕀t(A~;At,𝐘At)]T𝕀(A~;HT).\sum_{t=0}^{T-1}\mathbb{E}\left[\sqrt{\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})}\right]\leq\sqrt{T\mathbb{I}(\tilde{A};H_{T})}.

Finally, note that 𝕀(A~;HT)(A~)logK\mathbb{I}(\tilde{A};H_{T})\leq\mathbb{H}(\tilde{A})\leq\log K. Combining the above inequalities, we have proved Theorem 5.1.

6 Related work

There is extensive literature focusing on evaluating marginal distributions (e.g. Bishop (2006), Friedman et al. (2001), LeCun et al. (2015)), paired with evaluation on downstream tasks such as bandits (Riquelme et al., 2018). There is also a rich literature on developing agents for uncertainty estimation (Blundell et al., 2015; Gal & Ghahramani, 2016; Osband & Van Roy, 2015; Welling & Teh, 2011), and in most cases the developed agents are evaluated only based on marginal predictive distributions.

To the best of our knowledge, there is little literature that highlight the importance of joint predictive distributions. One exception is Wang et al. (2021). However, Wang et al. (2021) states that “joint log-likelihood scores [are] determined almost entirely by the marginal log-likelihood scores…in practice, they provide little new information beyond marginal likelihoods,” and proposes an alternative metric for joint distributions based on posterior predictive correlations (PPCs). Our paper, however, establishes that the KL-divergence for joint predictive distributions, which are equivalent to log-likelihood scores, ensure effective decisions, while marginals do not. In particular, our results in Section 5 show that agents that attain small marginal KL-divergence might not retain enough information to enable efficient exploration, while agents that attain small joint KL-divergence do. Moreover, the PPC metric proposed by Wang et al. (2021) is designed for regression problems, while cross-entropy loss can be applied more broadly, for example, to classification as well as regression problems. There is also a recent empirical work that compares agents based on their joint predictive distributions (Osband et al., 2021).

There is also extensive literature on combinatorial decision-making (Papadimitriou & Steiglitz, 1998), sequential prediction (Shalev-Shwartz et al., 2011), multi-armed bandits (Auer et al., 2002; Lattimore & Szepesvári, 2020), and Thompson sampling (Thompson, 1933; Chapelle & Li, 2011; Russo et al., 2018). There are also some recent work on approximate versions of Thompson sampling algorithms (Lu & Van Roy, 2017; Phan et al., 2019; Zhang et al., 2019; Yu et al., 2020). In this paper, we propose and analyze a novel variant of Thompson sampling algorithm for multi-armed bandits. Our analysis uses an information-theoretic approach, similar to papers in this research line (Russo & Van Roy, 2016, 2018; Lu et al., 2021). In addition, our analysis makes use of the notions of environment proxy and learning target developed in Lu et al. (2021).

7 Conclusion

This paper aims to elucidate the importance of the joint predictive distributions for a broad class of decision problems, including the combinatorial and sequential decision problems. Specifically, we have shown that in a simple combinatorial decision problem (Section 3), a sequential prediction problem (Section 4), and a multi-armed bandit problem (Section 5), accurate marginal predictions are insufficient to drive effective decisions. Instead, accurate joint predictive distributions are necessary for good performance. We also show that, in the multi-armed bandit problem, accurate joint predictive distributions are sufficient to enable near-optimal performance of a novel variant of Thompson sampling by establishing a new kind of regret bound via information-theoretic analysis (Theorem 5.1). While we evaluate the accuracy of marginal and joint predictive distributions based on KL-divergence (Section 2), or equivalently, cross-entropy loss, our insights on marginal versus joint predictions should extend to other evaluation metrics.

References

  • Auer et al. (2002) Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002.
  • Bishop (2006) Bishop, C. M. Pattern recognition. Machine learning, 128(9), 2006.
  • Blundell et al. (2015) Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural network. In International Conference on Machine Learning, pp. 1613–1622. PMLR, 2015.
  • Chapelle & Li (2011) Chapelle, O. and Li, L. An empirical evaluation of thompson sampling. Advances in neural information processing systems, 24:2249–2257, 2011.
  • Cover (1999) Cover, T. M. Elements of information theory. John Wiley & Sons, 1999.
  • Friedman et al. (2001) Friedman, J., Hastie, T., Tibshirani, R., et al. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.
  • Gal & Ghahramani (2016) Gal, Y. and Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, 2016.
  • Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. MIT press, 2016.
  • Kingma & Ba (2015) Kingma, D. and Ba, J. Adam: A Method for Stochastic Optimization. Proceedings of the International Conference on Learning Representations, 2015.
  • Lattimore & Szepesvári (2020) Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020.
  • LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 521(7553):436, 2015.
  • Lu & Van Roy (2017) Lu, X. and Van Roy, B. Ensemble sampling. In Advances in Neural Information Processing Systems, pp. 3260–3268, 2017.
  • Lu et al. (2021) Lu, X., Van Roy, B., Dwaracherla, V., Ibrahimi, M., Osband, I., and Wen, Z. Reinforcement learning, bit by bit. arXiv preprint arXiv:2103.04047, 2021.
  • Mattner (2003) Mattner, L. Mean absolute deviations of sample means and minimally concentrated binomials. Annals of probability, pp.  914–925, 2003.
  • Osband & Van Roy (2015) Osband, I. and Van Roy, B. Bootstrapped Thompson sampling and deep exploration. arXiv preprint arXiv:1507.00300, 2015.
  • Osband et al. (2021) Osband, I., Wen, Z., Asghari, S. M., Dwaracherla, V., Hao, B., Ibrahimi, M., Lawson, D., Lu, X., O’Donoghue, B., and Van Roy, B. Evaluating predictive distributions: Does bayesian deep learning work? arXiv preprint arXiv:2110.04629, 2021.
  • Papadimitriou & Steiglitz (1998) Papadimitriou, C. H. and Steiglitz, K. Combinatorial optimization: algorithms and complexity. Courier Corporation, 1998.
  • Phan et al. (2019) Phan, M., Abbasi-Yadkori, Y., and Domke, J. Thompson sampling with approximate inference. arXiv preprint arXiv:1908.04970, 2019.
  • Riquelme et al. (2018) Riquelme, C., Tucker, G., and Snoek, J. Deep Bayesian bandits showdown: An empirical comparison of Bayesian deep networks for thompson sampling. arXiv preprint arXiv:1802.09127, 2018.
  • Russo & Van Roy (2016) Russo, D. and Van Roy, B. An information-theoretic analysis of Thompson sampling. The Journal of Machine Learning Research, 17(1):2442–2471, 2016.
  • Russo & Van Roy (2018) Russo, D. and Van Roy, B. Learning to optimize via information-directed sampling. Operations Research, 66(1):230–252, 2018.
  • Russo et al. (2018) Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., and Wen, Z. A tutorial on Thompson sampling. Found. Trends Mach. Learn., 11(1):1–96, July 2018. ISSN 1935-8237.
  • Shalev-Shwartz et al. (2011) Shalev-Shwartz, S. et al. Online learning and online convex optimization. Foundations and trends in Machine Learning, 4(2):107–194, 2011.
  • Thompson (1933) Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
  • Wang et al. (2021) Wang, C., Sun, S., and Grosse, R. Beyond marginal uncertainty: How accurately can Bayesian regression models estimate posterior predictive correlations? In International Conference on Artificial Intelligence and Statistics, pp.  2476–2484. PMLR, 2021.
  • Welling & Teh (2011) Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp.  681–688. Citeseer, 2011.
  • Yu et al. (2020) Yu, T., Kveton, B., Wen, Z., Zhang, R., and Mengshoel, O. J. Graphical models meet bandits: A variational thompson sampling approach. In International Conference on Machine Learning, pp. 10902–10912. PMLR, 2020.
  • Zhang et al. (2019) Zhang, R., Wen, Z., Chen, C., and Carin, L. Scalable thompson sampling via optimal transport. arXiv preprint arXiv:1902.07239, 2019.

Appendices

Appendix A Proof for Proposition 2.1

Proof.

From Pinsker’s inequality, we have

δ(P¯T+1:T+τ,P^T+1:T+τ)12𝐝KL(P¯T+1:T+τP^T+1:T+τ),\delta\left(\overline{P}_{T+1:T+\tau},\hat{P}_{T+1:T+\tau}\right)\leq\sqrt{\frac{1}{2}\mathbf{d}_{\mathrm{KL}}\big{(}\overline{P}_{T+1:T+\tau}\big{\|}\hat{P}_{T+1:T+\tau}\big{)}},

where δ\delta is the total variation distance. Without loss of generality, we assume that the supports of P¯T+1:T+τ\overline{P}_{T+1:T+\tau} and P^T+1:T+τ\hat{P}_{T+1:T+\tau} are countable. Then we have δ(P¯T+1:T+τ,P^T+1:T+τ)=12P¯T+1:T+τP^T+1:T+τ1\delta\left(\overline{P}_{T+1:T+\tau},\hat{P}_{T+1:T+\tau}\right)=\frac{1}{2}\left\|\overline{P}_{T+1:T+\tau}-\hat{P}_{T+1:T+\tau}\right\|_{1}. Consequently, we have

P¯T+1:T+τP^T+1:T+τ12𝐝KL(P¯T+1:T+τP^T+1:T+τ).\left\|\overline{P}_{T+1:T+\tau}-\hat{P}_{T+1:T+\tau}\right\|_{1}\leq\sqrt{2\mathbf{d}_{\mathrm{KL}}\big{(}\overline{P}_{T+1:T+\tau}\big{\|}\hat{P}_{T+1:T+\tau}\big{)}}.

Recall that a^𝒜\hat{a}\in\mathcal{A} maximizes

yT+1:T+τP^T+1:T+τ(yT+1:T+τ)r(a,yT+1:T+τ)\sum_{y_{T+1:T+\tau}}\hat{P}_{T+1:T+\tau}(y_{T+1:T+\tau})r(a,y_{T+1:T+\tau})

over all a𝒜a\in\mathcal{A}. Similarly, we assume a𝒜a^{*}\in\mathcal{A} maximizes

yT+1:T+τP¯T+1:T+τ(yT+1:T+τ)r(a,yT+1:T+τ),\sum_{y_{T+1:T+\tau}}\overline{P}_{T+1:T+\tau}(y_{T+1:T+\tau})r(a,y_{T+1:T+\tau}),

over all a𝒜a\in\mathcal{A}. Then we have

|yT+1:T+τP¯T+1:T+τ(yT+1:T+τ)r(a,yT+1:T+τ)yT+1:T+τP^T+1:T+τ(yT+1:T+τ)r(a,yT+1:T+τ)|\displaystyle\,\left|\sum_{y_{T+1:T+\tau}}\overline{P}_{T+1:T+\tau}(y_{T+1:T+\tau})r(a^{*},y_{T+1:T+\tau})-\sum_{y_{T+1:T+\tau}}\hat{P}_{T+1:T+\tau}(y_{T+1:T+\tau})r(a^{*},y_{T+1:T+\tau})\right|
=\displaystyle= |yT+1:T+τ(P¯T+1:T+τ(yT+1:T+τ)P^T+1:T+τ(yT+1:T+τ))r(a,yT+1:T+τ)|\displaystyle\,\left|\sum_{y_{T+1:T+\tau}}\left(\overline{P}_{T+1:T+\tau}(y_{T+1:T+\tau})-\hat{P}_{T+1:T+\tau}(y_{T+1:T+\tau})\right)r(a^{*},y_{T+1:T+\tau})\right|
\displaystyle\leq yT+1:T+τ|P¯T+1:T+τ(yT+1:T+τ)P^T+1:T+τ(yT+1:T+τ)|r(a,yT+1:T+τ)\displaystyle\,\sum_{y_{T+1:T+\tau}}\left|\overline{P}_{T+1:T+\tau}(y_{T+1:T+\tau})-\hat{P}_{T+1:T+\tau}(y_{T+1:T+\tau})\right|r(a^{*},y_{T+1:T+\tau})
(a)\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}} yT+1:T+τ|P¯T+1:T+τ(yT+1:T+τ)P^T+1:T+τ(yT+1:T+τ)|\displaystyle\,\sum_{y_{T+1:T+\tau}}\left|\overline{P}_{T+1:T+\tau}(y_{T+1:T+\tau})-\hat{P}_{T+1:T+\tau}(y_{T+1:T+\tau})\right|
=\displaystyle= P¯T+1:T+τP^T+1:T+τ12𝐝KL(P¯T+1:T+τP^T+1:T+τ),\displaystyle\,\left\|\overline{P}_{T+1:T+\tau}-\hat{P}_{T+1:T+\tau}\right\|_{1}\leq\sqrt{2\mathbf{d}_{\mathrm{KL}}\big{(}\overline{P}_{T+1:T+\tau}\big{\|}\hat{P}_{T+1:T+\tau}\big{)}},

where (a) follows from r(a,yT+1:T+τ)[0,1]r(a^{*},y_{T+1:T+\tau})\in[0,1]. Consequently, we have

yT+1:T+τP¯T+1:T+τ(yT+1:T+τ)r(a,yT+1:T+τ)\displaystyle\sum_{y_{T+1:T+\tau}}\overline{P}_{T+1:T+\tau}(y_{T+1:T+\tau})r(a^{*},y_{T+1:T+\tau})\leq yT+1:T+τP^T+1:T+τ(yT+1:T+τ)r(a,yT+1:T+τ)\displaystyle\,\sum_{y_{T+1:T+\tau}}\hat{P}_{T+1:T+\tau}(y_{T+1:T+\tau})r(a^{*},y_{T+1:T+\tau})
+\displaystyle+ 2𝐝KL(P¯T+1:T+τP^T+1:T+τ)\displaystyle\,\sqrt{2\mathbf{d}_{\mathrm{KL}}\big{(}\overline{P}_{T+1:T+\tau}\big{\|}\hat{P}_{T+1:T+\tau}\big{)}}
\displaystyle\leq yT+1:T+τP^T+1:T+τ(yT+1:T+τ)r(a^,yT+1:T+τ)\displaystyle\,\sum_{y_{T+1:T+\tau}}\hat{P}_{T+1:T+\tau}(y_{T+1:T+\tau})r(\hat{a},y_{T+1:T+\tau})
+\displaystyle+ 2𝐝KL(P¯T+1:T+τP^T+1:T+τ),\displaystyle\,\sqrt{2\mathbf{d}_{\mathrm{KL}}\big{(}\overline{P}_{T+1:T+\tau}\big{\|}\hat{P}_{T+1:T+\tau}\big{)}},

where the second inequality follows from the fact that a^\hat{a} is optimal under P^T+1:T+τ\hat{P}_{T+1:T+\tau}. Taking the expectations, we have

𝔼[r(a^,yT+1:T+τ)]𝔼[r(a,yT+1:T+τ)]𝔼[2𝐝KL(P¯T+1:T+τP^T+1:T+τ)].\mathbb{E}\left[r(\hat{a},y_{T+1:T+\tau})\right]\geq\mathbb{E}\left[r(a^{*},y_{T+1:T+\tau})\right]-\mathbb{E}\left[\sqrt{2\mathbf{d}_{\mathrm{KL}}\big{(}\overline{P}_{T+1:T+\tau}\big{\|}\hat{P}_{T+1:T+\tau}\big{)}}\right].

From Jensen’s inequality, we have

𝔼[r(a,yT+1:T+τ)]maxa𝒜𝔼[r(a,yT+1:T+τ)]\mathbb{E}\left[r(a^{*},y_{T+1:T+\tau})\right]\geq\max_{a\in\mathcal{A}}\mathbb{E}\left[r(a,y_{T+1:T+\tau})\right]

and

𝔼[2𝐝KL(P¯T+1:T+τP^T+1:T+τ)]2𝔼[𝐝KL(P¯T+1:T+τP^T+1:T+τ)]=2𝐝KLτ.\mathbb{E}\left[\sqrt{2\mathbf{d}_{\mathrm{KL}}\big{(}\overline{P}_{T+1:T+\tau}\big{\|}\hat{P}_{T+1:T+\tau}\big{)}}\right]\leq\sqrt{2\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\overline{P}_{T+1:T+\tau}\big{\|}\hat{P}_{T+1:T+\tau}\big{)}\right]}=\sqrt{2\mathbf{d}_{\mathrm{KL}}^{\tau}}.

Consequently, we have

𝔼[r(a^,yT+1:T+τ)]maxa𝒜𝔼[r(a,yT+1:T+τ)]2𝐝KLτ.\mathbb{E}\left[r(\hat{a},y_{T+1:T+\tau})\right]\geq\max_{a\in\mathcal{A}}\mathbb{E}\left[r(a,y_{T+1:T+\tau})\right]-\sqrt{2\mathbf{d}_{\mathrm{KL}}^{\tau}}.

This concludes the proof. ∎

Appendix B Proof for Theorem 4.1

B.1 Proof for Theorem 4.1

Proof.

Notice that

𝔼[𝐝KL((Yt+1:T|𝒟t,Xt:T1)(Yt+1:T|θt,Xt:T1))]\displaystyle~{}\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t+1:T}\in\cdot\middle|\mathcal{D}_{t},X_{t:T-1}\right)\big{\|}\mathds{P}(Y_{t+1:T}\in\cdot|\theta_{t},X_{t:T-1})\big{)}\right]
=(a)\displaystyle\overset{(a)}{=} 𝔼[𝐝KL((Yt+1|𝒟t,Xt)(Yt+1|θt,Xt))]\displaystyle~{}\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t+1}\in\cdot\middle|\mathcal{D}_{t},X_{t}\right)\big{\|}\mathds{P}(Y_{t+1}\in\cdot|\theta_{t},X_{t})\big{)}\right]
+𝔼[𝐝KL((Yt+2:T|𝒟t+1,Xt+1:T1)(Yt+2:T|θt,Xt,Yt+1,Xt+1:T1))]\displaystyle~{}\quad+\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t+2:T}\in\cdot\middle|\mathcal{D}_{t+1},X_{t+1:T-1}\right)\big{\|}\mathds{P}(Y_{t+2:T}\in\cdot|\theta_{t},X_{t},Y_{t+1},X_{t+1:T-1})\big{)}\right]
(b)\displaystyle\overset{(b)}{\leq} 𝔼[𝐝KL((Yt+1|𝒟t,Xt)(Yt+1|θt,Xt))]\displaystyle~{}\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t+1}\in\cdot\middle|\mathcal{D}_{t},X_{t}\right)\big{\|}\mathds{P}(Y_{t+1}\in\cdot|\theta_{t},X_{t})\big{)}\right]
+𝔼[𝐝KL((Yt+2:T|𝒟t+1,Xt+1:T1)(Yt+2:T|θt+1,Xt+1:T1))]\displaystyle~{}\quad+\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t+2:T}\in\cdot\middle|\mathcal{D}_{t+1},X_{t+1:T-1}\right)\big{\|}\mathds{P}(Y_{t+2:T}\in\cdot|\theta_{t+1},X_{t+1:T-1})\big{)}\right]
\displaystyle\leq \displaystyle~{}\dots
\displaystyle\leq 𝔼[t=tT1𝐝KL((Yt+1|𝒟t,Xt)(Yt+1|θt,Xt))]\displaystyle~{}\mathbb{E}\left[\sum_{t^{\prime}=t}^{T-1}\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t^{\prime}+1}\in\cdot\middle|\mathcal{D}_{t^{\prime}},X_{t^{\prime}}\right)\big{\|}\mathds{P}(Y_{t^{\prime}+1}\in\cdot|\theta_{t^{\prime}},X_{t^{\prime}})\big{)}\right]
(c)\displaystyle\overset{(c)}{\leq} 𝔼[t=tT1𝐝KL((Yt+1|𝒟t,Xt)(Y^t+1|θt,Xt))]\displaystyle~{}\mathbb{E}\left[\sum_{t^{\prime}=t}^{T-1}\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t^{\prime}+1}\in\cdot\middle|\mathcal{D}_{t^{\prime}},X_{t^{\prime}}\right)\big{\|}\mathds{P}(\hat{Y}_{t^{\prime}+1}\in\cdot|\theta_{t^{\prime}},X_{t^{\prime}})\big{)}\right]
=(d)\displaystyle\stackrel{{\scriptstyle(d)}}{{=}} 𝔼[t=tT1𝐝KL(P¯t+1P^t+1)]ϵ,\displaystyle~{}\mathbb{E}\left[\sum_{t^{\prime}=t}^{T-1}\mathbf{d}_{\mathrm{KL}}\big{(}\bar{P}_{t^{\prime}+1}\big{\|}\hat{P}_{t^{\prime}+1}\big{)}\right]\leq\epsilon,

where (a) follows from the chain rule of KL divergence, (b) and (c) follow from the fact that the conditional predictive distribution minimizes the KL divergence to the posterior (see Lemma B.1), and (d) follows from the definition. On the other hand, by definition, we have

𝕀(Yt+1:T;𝒟t|θt,Xt:T1)=𝔼[𝐝KL((Yt+1:T|𝒟t,Xt:T1)(Yt+1:T|θt,Xt:T1))]ϵ.\mathbb{I}\left(Y_{t+1:T};\mathcal{D}_{t}\,\middle|\,\theta_{t},X_{t:T-1}\right)=\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t+1:T}\in\cdot\middle|\mathcal{D}_{t},X_{t:T-1}\right)\big{\|}\mathds{P}(Y_{t+1:T}\in\cdot|\theta_{t},X_{t:T-1})\big{)}\right]\leq\epsilon.

Moreover, from the chain rule of mutual information, we have

𝕀(Yt+1:T;𝒟t,θt|Xt:T1)=\displaystyle\mathbb{I}\left(Y_{t+1:T};\mathcal{D}_{t},\theta_{t}\,\middle|\,X_{t:T-1}\right)= 𝕀(Yt+1:T;θt|Xt:T1)+𝕀(Yt+1:T;𝒟t|θt,Xt:T1)ϵ\displaystyle\,\mathbb{I}\left(Y_{t+1:T};\theta_{t}\,\middle|\,X_{t:T-1}\right)+\underbrace{\mathbb{I}\left(Y_{t+1:T};\mathcal{D}_{t}\,\middle|\,\theta_{t},X_{t:T-1}\right)}_{\leq\epsilon}
=\displaystyle= 𝕀(Yt+1:T;𝒟t|Xt:T1)+𝕀(Yt+1:T;θt|𝒟t,Xt:T1)=0,\displaystyle\,\mathbb{I}\left(Y_{t+1:T};\mathcal{D}_{t}\,\middle|\,X_{t:T-1}\right)+\underbrace{\mathbb{I}\left(Y_{t+1:T};\theta_{t}\,\middle|\,\mathcal{D}_{t},X_{t:T-1}\right)}_{=0},

where 𝕀(Yt+1:T;θt|𝒟t,Xt:T1)=0\mathbb{I}\left(Y_{t+1:T};\theta_{t}\,\middle|\,\mathcal{D}_{t},X_{t:T-1}\right)=0 since Yt+1:TY_{t+1:T} and θt\theta_{t} are conditionally independent given 𝒟t\mathcal{D}_{t} and Xt:T1X_{t:T-1}. Hence we have

𝕀(Yt+1:T;𝒟t|Xt:T1)𝕀(Yt+1:T;θt|Xt:T1)𝕀(Yt+1:T;𝒟t|Xt:T1)ϵ.\mathbb{I}\left(Y_{t+1:T};\mathcal{D}_{t}\,\middle|\,X_{t:T-1}\right)\geq\mathbb{I}\left(Y_{t+1:T};\theta_{t}\,\middle|\,X_{t:T-1}\right)\geq\mathbb{I}\left(Y_{t+1:T};\mathcal{D}_{t}\,\middle|\,X_{t:T-1}\right)-\epsilon.

This concludes the proof. ∎

B.2 Lemma B.1

Lemma B.1.

Let the agent parameter θt|𝒟t\theta_{t}\perp\mathcal{E}|\mathcal{D}_{t}. Then, the conditional predictive distribution (Yt+1|θt,Xt)\mathds{P}(Y_{t+1}\in\cdot|\theta_{t},X_{t}) minimizes the expected KL divergence

𝔼[𝐝KL((Yt+1|𝒟t,Xt)(Y^t+1|θt,Xt))]\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}(Y_{t+1}\in\cdot|\mathcal{D}_{t},X_{t})\big{\|}\mathds{P}(\hat{Y}_{t+1}\in\cdot|\theta_{t},X_{t})\right)\right]

over all possible predictive distribution (Y^t+1|θt,Xt)\mathds{P}(\hat{Y}_{t+1}\in\cdot|\theta_{t},X_{t}).

Proof.

We have

𝔼[𝐝KL((Yt+1|𝒟t,Xt)(Y^t+1|θt,Xt))]\displaystyle\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}(Y_{t+1}\in\cdot|\mathcal{D}_{t},X_{t})\big{\|}\mathds{P}(\hat{Y}_{t+1}\in\cdot|\theta_{t},X_{t})\right)\right]
=𝔼[y(Yt+1=y|𝒟t,Xt)log((Yt+1=y|𝒟t,Xt)(Yt+1=y|θt,Xt)(Yt+1=y|θt,Xt)(Y^t+1=y|θt,Xt))]\displaystyle=\mathbb{E}\left[\sum_{y}\mathds{P}(Y_{t+1}=y|\mathcal{D}_{t},X_{t})\log\left(\frac{\mathds{P}(Y_{t+1}=y|\mathcal{D}_{t},X_{t})}{\mathds{P}(Y_{t+1}=y|\theta_{t},X_{t})}\cdot\frac{\mathds{P}(Y_{t+1}=y|\theta_{t},X_{t})}{\mathds{P}(\hat{Y}_{t+1}=y|\theta_{t},X_{t})}\right)\right]
=𝔼[𝐝KL((Yt+1|𝒟t,Xt)(Yt+1|θt,Xt))]\displaystyle=\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}(Y_{t+1}\in\cdot|\mathcal{D}_{t},X_{t})\big{\|}\mathds{P}(Y_{t+1}\in\cdot|\theta_{t},X_{t})\right)\right]
+𝔼[y(Yt+1=y|𝒟t,Xt)log(Yt+1=y|θt,Xt)(Y^t+1=y|θt,Xt)]\displaystyle\qquad+\mathbb{E}\left[\sum_{y}\mathds{P}(Y_{t+1}=y|\mathcal{D}_{t},X_{t})\log\frac{\mathds{P}(Y_{t+1}=y|\theta_{t},X_{t})}{\mathds{P}(\hat{Y}_{t+1}=y|\theta_{t},X_{t})}\right]
=𝔼[𝐝KL((Yt+1|𝒟t,Xt)(Yt+1|θt,Xt))]\displaystyle=\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}(Y_{t+1}\in\cdot|\mathcal{D}_{t},X_{t})\big{\|}\mathds{P}(Y_{t+1}\in\cdot|\theta_{t},X_{t})\right)\right]
+𝔼[y(Yt+1=y|θt,Xt)log(Yt+1=y|θt,Xt)(Y^t+1=y|θt,Xt)]\displaystyle\qquad+\mathbb{E}\left[\sum_{y}\mathds{P}(Y_{t+1}=y|\theta_{t},X_{t})\log\frac{\mathds{P}(Y_{t+1}=y|\theta_{t},X_{t})}{\mathds{P}(\hat{Y}_{t+1}=y|\theta_{t},X_{t})}\right]
=𝔼[𝐝KL((Yt+1|𝒟t,Xt)(Yt+1|θt,Xt))]\displaystyle=\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}(Y_{t+1}\in\cdot|\mathcal{D}_{t},X_{t})\big{\|}\mathds{P}(Y_{t+1}\in\cdot|\theta_{t},X_{t})\right)\right]
+𝔼[𝐝KL((Yt+1|θt,Xt)(Y^t+1|θt,Xt))]\displaystyle\qquad+\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}(Y_{t+1}\in\cdot|\theta_{t},X_{t})\big{\|}\mathds{P}(\hat{Y}_{t+1}\in\cdot|\theta_{t},X_{t})\right)\right]
𝔼[𝐝KL((Yt+1|𝒟t,Xt)(Yt+1|θt,Xt))].\displaystyle\geq\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}(Y_{t+1}\in\cdot|\mathcal{D}_{t},X_{t})\big{\|}\mathds{P}(Y_{t+1}\in\cdot|\theta_{t},X_{t})\right)\right].

Similarly, we can prove that

𝔼[𝐝KL((Yt+2:T|𝒟t+1,Xt+1:T1)(Yt+2:T|θt,Xt,Yt+1,Xt+1:T1))]𝔼[𝐝KL((Yt+2:T|𝒟t+1,Xt+1:T1)(Yt+2:T|θt+1,Xt+1:T1))].\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t+2:T}\in\cdot\middle|\mathcal{D}_{t+1},X_{t+1:T-1}\right)\big{\|}\mathds{P}(Y_{t+2:T}\in\cdot|\theta_{t},X_{t},Y_{t+1},X_{t+1:T-1})\big{)}\right]\\ \leq\mathbb{E}\left[\mathbf{d}_{\mathrm{KL}}\big{(}\mathds{P}\left(Y_{t+2:T}\in\cdot\middle|\mathcal{D}_{t+1},X_{t+1:T-1}\right)\big{\|}\mathds{P}(Y_{t+2:T}\in\cdot|\theta_{t+1},X_{t+1:T-1})\big{)}\right]. (11)

To see it, notice that θt+1(θt+1|θt,Xt,Yt+1,t)\theta_{t+1}\sim\mathds{P}(\theta_{t+1}\in\cdot|\theta_{t},X_{t},Y_{t+1},t), and hence (Yt+2:T|θt+1,Xt+1:T1)\mathds{P}(Y_{t+2:T}\in\cdot|\theta_{t+1},X_{t+1:T-1}) is a particular choice of (Y^t+2:T|θt,Xt,Yt+1,Xt+1:T1)\mathds{P}(\hat{Y}_{t+2:T}\in\cdot|\theta_{t},X_{t},Y_{t+1},X_{t+1:T-1}). Consequently, using an analysis similar to Lemma B.1, we can prove the above inequality.

Appendix C Proof for Theorem 5.1

Proof.

We will now analyze the approximate TS algorithm. For this purpose, we continue to use a proxy ~=𝐘~1:τ\tilde{\mathcal{E}}=\tilde{\mathbf{Y}}_{1:\tau} and target A~=minargmaxkp~k\tilde{A}=\min\operatorname*{arg\,max}_{k}\tilde{p}_{k}. Further, we define 𝐘~1:τt(𝐘~1:τ|Ht)\tilde{\mathbf{Y}}^{t}_{1:\tau}\sim\mathds{P}(\tilde{\mathbf{Y}}_{1:\tau}\in\cdot|H_{t}), p~t(p|𝐘~1:τ=𝐘~1:τt)\tilde{p}^{t}\sim\mathds{P}(p\in\cdot|\tilde{\mathbf{Y}}_{1:\tau}=\tilde{\mathbf{Y}}^{t}_{1:\tau}), and A~t=minargmaxkp~kt\tilde{A}_{t}=\min\operatorname*{arg\,max}_{k}\tilde{p}^{t}_{k}.

Notice that

𝔼[pApAt]=𝔼[pApA~]+𝔼[pA~pAt],\mathbb{E}[p_{A^{*}}-p_{A_{t}}]=\mathbb{E}[p_{A^{*}}-p_{\tilde{A}}]+\mathbb{E}[p_{\tilde{A}}-p_{A_{t}}], (12)

and we first show that 𝔼[pApA~]K2τ\mathbb{E}[p_{A^{*}}-p_{\tilde{A}}]\leq\frac{K}{2\tau}. Notice that

𝔼[pApA~]=\displaystyle\mathbb{E}[p_{A^{*}}-p_{\tilde{A}}]= 𝔼[𝔼[pApA~|~]]=𝔼[𝔼[p~A~pA~|~]]\displaystyle\mathbb{E}[\mathbb{E}[p_{A^{*}}-p_{\tilde{A}}|\tilde{\mathcal{E}}]]=\mathbb{E}[\mathbb{E}[\tilde{p}_{\tilde{A}}-p_{\tilde{A}}|\tilde{\mathcal{E}}]]
\displaystyle\leq 𝔼[𝔼[maxk(p~kpk)|~]]k=1K𝔼[|p~kpk|](a)K2τ.\displaystyle\mathbb{E}[\mathbb{E}[\max_{k}(\tilde{p}_{k}-p_{k})|\tilde{\mathcal{E}}]]\leq\sum_{k=1}^{K}\mathbb{E}[|\tilde{p}_{k}-p_{k}|]\stackrel{{\scriptstyle(a)}}{{\leq}}\frac{K}{\sqrt{2\tau}}.

Please refer to (Mattner, 2003) for the proof of inequality (a).

Now we are going to bound the second term 𝔼[pA~pAt]\mathbb{E}[p_{\tilde{A}}-p_{A_{t}}] on the RHS of Equation (12) through bounding its conditional version 𝔼t[pA~pAt]\mathbb{E}_{t}[p_{\tilde{A}}-p_{A_{t}}] in terms of the information gain 𝕀t(A~;At,𝐘At)\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}}), where the subscript tt in 𝔼t\mathbb{E}_{t} and 𝕀t\mathbb{I}_{t} denotes that the expectation and mutual information condition on the history HtH_{t}. In other words, 𝔼t[pA~pAt]=𝔼[pA~pAt|Ht]\mathbb{E}_{t}[p_{\tilde{A}}-p_{A_{t}}]=\mathbb{E}[p_{\tilde{A}}-p_{A_{t}}|H_{t}] and 𝕀t(A~;At,𝐘At)=𝕀(A~;At,𝐘At|Ht=Ht)\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})=\mathbb{I}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}}|H_{t}=H_{t}). We first note that

𝕀t(A~;At,𝐘At)\displaystyle\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}}) =𝕀t(A~;At)+𝕀t(A~;𝐘At|At)\displaystyle=\mathbb{I}_{t}(\tilde{A};{A}_{t})+\mathbb{I}_{t}(\tilde{A};\mathbf{Y}_{{A}_{t}}|A_{t})
=𝕀t(A~;𝐘At|At)\displaystyle=\mathbb{I}_{t}(\tilde{A};\mathbf{Y}_{{A}_{t}}|A_{t})
=at(At=a)𝕀t(A~;𝐘At|At=a)\displaystyle=\sum_{a}\mathds{P}_{t}(A_{t}=a)\mathbb{I}_{t}(\tilde{A};\mathbf{Y}_{{A}_{t}}|A_{t}=a)
=at(At=a)𝕀t(A~;𝐘a)\displaystyle=\sum_{a}\mathds{P}_{t}(A_{t}=a)\mathbb{I}_{t}(\tilde{A};\mathbf{Y}_{a})
=at(At=a)(a~t(A~=a~)𝐝KL(t(𝐘aA~=a~)t(𝐘a)))\displaystyle=\sum_{a}\mathds{P}_{t}(A_{t}=a)\left(\sum_{\tilde{a}}\mathds{P}_{t}(\tilde{A}=\tilde{a})\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}_{t}(\mathbf{Y}_{a}\mid\tilde{A}=\tilde{a})\|\mathds{P}_{t}(\mathbf{Y}_{a})\right)\right)
=a,a~t(At=a)t(A~=a~)𝐝KL(t(𝐘aA~=a~)t(𝐘a))\displaystyle=\sum_{a,\tilde{a}}\mathds{P}_{t}(A_{t}=a)\mathds{P}_{t}(\tilde{A}=\tilde{a})\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}_{t}(\mathbf{Y}_{a}\mid\tilde{A}=\tilde{a})\|\mathds{P}_{t}(\mathbf{Y}_{a})\right)

where the first inequality uses the chain rule of mutual information; the second inequality follows from that AtA_{t} and A~\tilde{A} are conditionally independent; the fourth inequality uses that AtA_{t} is conditionally jointly independent of A~\tilde{A} and 𝐘=(𝐘1,,𝐘K)\mathbf{Y}=(\mathbf{Y}_{1},\ldots,\mathbf{Y}_{K}); the fifth inequality follows from the KL divergence form of mutual information.

Now we are ready to bound 𝔼t[pA~pAt]\mathbb{E}_{t}[p_{\tilde{A}}-p_{A_{t}}] in terms of 𝕀t(A~;At,𝐘At)\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}}).

𝔼t[pA~pAt]=\displaystyle\mathbb{E}_{t}[p_{\tilde{A}}-p_{A_{t}}]= 𝔼t[𝐘A~𝐘At]\displaystyle\mathbb{E}_{t}[\mathbf{Y}_{\tilde{A}}-\mathbf{Y}_{A_{t}}]
=\displaystyle= at(A~=a)𝔼t[𝐘a|A~=a]at(At=a)𝔼t[𝐘a|At=a]\displaystyle\sum_{a}\mathds{P}_{t}(\tilde{A}=a)\mathbb{E}_{t}[\mathbf{Y}_{a}|\tilde{A}=a]-\sum_{a}\mathds{P}_{t}(A_{t}=a)\mathbb{E}_{t}[\mathbf{Y}_{a}|A_{t}=a]
=\displaystyle= at(A~=a)𝔼t[𝐘a|A~=a]at(At=a)𝔼t[𝐘a]\displaystyle\sum_{a}\mathds{P}_{t}(\tilde{A}=a)\mathbb{E}_{t}[\mathbf{Y}_{a}|\tilde{A}=a]-\sum_{a}\mathds{P}_{t}(A_{t}=a)\mathbb{E}_{t}[\mathbf{Y}_{a}]
=\displaystyle= at(At=a)t(A~=a)(𝔼t[𝐘a|A~=a]𝔼t[𝐘a])\displaystyle\sum_{a}\sqrt{\mathds{P}_{t}(A_{t}=a)\mathds{P}_{t}(\tilde{A}=a)}\left(\mathbb{E}_{t}[\mathbf{Y}_{a}|\tilde{A}=a]-\mathbb{E}_{t}[\mathbf{Y}_{a}]\right)
+a(t(A~=a)t(At=a))(t(A~=a)𝔼t[𝐘a|A~=a]+t(At=a)𝔼t[𝐘a])\displaystyle+\sum_{a}\left(\sqrt{\mathds{P}_{t}(\tilde{A}=a)}-\sqrt{\mathds{P}_{t}(A_{t}=a)}\right)\left(\sqrt{\mathds{P}_{t}(\tilde{A}=a)}\mathbb{E}_{t}[\mathbf{Y}_{a}|\tilde{A}=a]+\sqrt{\mathds{P}_{t}(A_{t}=a)}\mathbb{E}_{t}[\mathbf{Y}_{a}]\right)
\displaystyle\leq Kat(At=a)t(A~=a)(𝔼t[𝐘a|A~=a]𝔼t[𝐘a])2\displaystyle\sqrt{K\sum_{a}\mathds{P}_{t}(A_{t}=a)\mathds{P}_{t}(\tilde{A}=a)\left(\mathbb{E}_{t}[\mathbf{Y}_{a}|\tilde{A}=a]-\mathbb{E}_{t}[\mathbf{Y}_{a}]\right)^{2}}
+a|t(A~=a)t(At=a)|(t(A~=a)+t(At=a))\displaystyle+\sum_{a}\left|\sqrt{\mathds{P}_{t}(\tilde{A}=a)}-\sqrt{\mathds{P}_{t}(A_{t}=a)}\right|\left(\sqrt{\mathds{P}_{t}(\tilde{A}=a)}+\sqrt{\mathds{P}_{t}(A_{t}=a)}\right)
\displaystyle\leq Ka,a~t(At=a)t(A~=a~)(𝔼t[𝐘a|A~=a~]𝔼t[𝐘a])2+a|t(A~=a)t(At=a)|\displaystyle\sqrt{K\sum_{a,\tilde{a}}\mathds{P}_{t}(A_{t}=a)\mathds{P}_{t}(\tilde{A}=\tilde{a})\left(\mathbb{E}_{t}[\mathbf{Y}_{a}|\tilde{A}=\tilde{a}]-\mathbb{E}_{t}[\mathbf{Y}_{a}]\right)^{2}}+\sum_{a}\left|\mathds{P}_{t}(\tilde{A}=a)-\mathds{P}_{t}(A_{t}=a)\right|
\displaystyle\leq K2a,a~t(At=a)t(A~=a~)𝐝KL(t(𝐘aA~=a~)t(𝐘a))+a|t(A~t=a)t(At=a)|\displaystyle\sqrt{\frac{K}{2}\sum_{a,\tilde{a}}\mathds{P}_{t}(A_{t}=a)\mathds{P}_{t}(\tilde{A}=\tilde{a})\mathbf{d}_{\mathrm{KL}}\left(\mathds{P}_{t}(\mathbf{Y}_{a}\mid\tilde{A}=\tilde{a})\|\mathds{P}_{t}(\mathbf{Y}_{a})\right)}+\sum_{a}\left|\mathds{P}_{t}(\tilde{A}_{t}=a)-\mathds{P}_{t}(A_{t}=a)\right|
=\displaystyle= K2𝕀t(A~;At,𝐘At)+a|t(A~t=a)t(At=a)|\displaystyle\sqrt{\frac{K}{2}\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})}+\sum_{a}\left|\mathds{P}_{t}(\tilde{A}_{t}=a)-\mathds{P}_{t}(A_{t}=a)\right|

where the third inequality uses that AtA_{t} is conditionally independent of 𝐘=(𝐘1,,𝐘K)\mathbf{Y}=(\mathbf{Y}_{1},\ldots,\mathbf{Y}_{K}); the first inequality follows from Cauchy-Schwarz inequality and 𝐘a{0,1}\mathbf{Y}_{a}\in\{0,1\}; the third inequality uses Pinsker’s inequality and t(A~=a)=t(A~t=a)\mathds{P}_{t}(\tilde{A}=a)=\mathds{P}_{t}(\tilde{A}_{t}=a). Hence,

𝔼[pA~pAt]\displaystyle\mathbb{E}[p_{\tilde{A}}-p_{A_{t}}] =𝔼[𝔼t[pA~pAt]]\displaystyle=\mathbb{E}[\mathbb{E}_{t}[p_{\tilde{A}}-p_{A_{t}}]]
𝔼[K2𝕀t(A~;At,𝐘At)]+𝔼[a|t(A~t=a)t(At=a)|]\displaystyle\leq\mathbb{E}\left[\sqrt{\frac{K}{2}\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})}\right]+\mathbb{E}\left[\sum_{a}\left|\mathds{P}_{t}(\tilde{A}_{t}=a)-\mathds{P}_{t}(A_{t}=a)\right|\right]
K2𝔼[𝕀t(A~;At,𝐘At)]+𝔼[2𝐝KL((p~t|Ht)(p^t|Ht))]\displaystyle\leq\sqrt{\frac{K}{2}\mathbb{E}\left[\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})\right]}+\mathbb{E}\left[\sqrt{2\mathbf{d}_{\mathrm{KL}}(\mathds{P}(\tilde{p}^{t}\in\cdot|H_{t})\|\mathds{P}(\hat{p}^{t}\in\cdot|H_{t}))}\right]
K2𝔼[𝕀t(A~;At,𝐘At)]+𝔼[2𝐝KL((𝐘~1:τ|Ht)(𝐘^1:τt|Ht))]\displaystyle\leq\sqrt{\frac{K}{2}\mathbb{E}\left[\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})\right]}+\mathbb{E}\left[\sqrt{2\mathbf{d}_{\mathrm{KL}}(\mathds{P}(\tilde{\mathbf{Y}}_{1:\tau}\in\cdot|H_{t})\|\mathds{P}(\hat{\mathbf{Y}}^{t}_{1:\tau}\in\cdot|H_{t}))}\right]
K2𝔼[𝕀t(A~;At,𝐘At)]+2ϵ\displaystyle\leq\sqrt{\frac{K}{2}\mathbb{E}\left[\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})\right]}+\sqrt{2\epsilon}

where the second inequalities follows from Jensen’s inequality and Pinsker’s inequality and the third inequality uses the information processing inequality, and thus

𝔼[pApAt]=𝔼[pApA~]+𝔼[pA~pAt]K2τ+K2𝔼[𝕀t(A~;At,𝐘At)]+2ϵ.\displaystyle\mathbb{E}[p_{A^{*}}-p_{A_{t}}]=\mathbb{E}[p_{A^{*}}-p_{\tilde{A}}]+\mathbb{E}[p_{\tilde{A}}-p_{A_{t}}]\leq\frac{K}{\sqrt{2\tau}}+\sqrt{\frac{K}{2}\mathbb{E}\left[\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})\right]}+\sqrt{2\epsilon}.

Using an analysis similar to Lu et al. (2021),

Regret(T)=\displaystyle{\rm Regret}(T)= t=0T1𝔼[pApAt]\displaystyle\sum_{t=0}^{T-1}\mathbb{E}[p_{A_{*}}-p_{A_{t}}]
\displaystyle\leq t=0T1(K2τ+K2𝔼[𝕀t(A~;At,𝐘At)]+2ϵ)\displaystyle\sum_{t=0}^{T-1}\left(\frac{K}{\sqrt{2\tau}}+\sqrt{\frac{K}{2}\mathbb{E}\left[\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})\right]}+\sqrt{2\epsilon}\right)
\displaystyle\leq K2Tt=0T1𝔼[𝕀t(A~;At,𝐘At)]+(K2τ+2ϵ)T\displaystyle\sqrt{\frac{K}{2}T}\sqrt{\sum_{t=0}^{T-1}\mathbb{E}\left[\mathbb{I}_{t}(\tilde{A};{A}_{t},\mathbf{Y}_{{A}_{t}})\right]}+\left(\frac{K}{\sqrt{2\tau}}+\sqrt{2\epsilon}\right)T
\displaystyle\leq K2T𝕀(A~;A0,𝐘A0,,AT1,𝐘AT1)+(K2τ+2ϵ)T\displaystyle\sqrt{\frac{K}{2}T}\sqrt{\mathbb{I}(\tilde{A};A_{0},\mathbf{Y}_{A_{0}},\ldots,A_{T-1},\mathbf{Y}_{A_{T-1}})}+\left(\frac{K}{\sqrt{2\tau}}+\sqrt{2\epsilon}\right)T
=\displaystyle= K2T(A~)(A~A0,𝐘A0,,AT1,𝐘AT1)+(K2τ+2ϵ)T\displaystyle\sqrt{\frac{K}{2}T}\sqrt{\mathbb{H}(\tilde{A})-\mathbb{H}(\tilde{A}\mid A_{0},\mathbf{Y}_{A_{0}},\ldots,A_{T-1},\mathbf{Y}_{A_{T-1}})}+\left(\frac{K}{\sqrt{2\tau}}+\sqrt{2\epsilon}\right)T
\displaystyle\leq K2Tlog(K)+(K2τ+2ϵ)T\displaystyle\sqrt{\frac{K}{2}T\log(K)}+\left(\frac{K}{\sqrt{2\tau}}+\sqrt{2\epsilon}\right)T

where the second inequality uses Cauchy–Schwarz inequality; the third inequality follows from the chain rule for mutual information; the last inequality follows from the non-negativity of entropy. This completes the proof. ∎