Analysis and Design of Thompson Sampling
for Stochastic Partial Monitoring

Taira Tsuchiya
The University of Tokyo
RIKEN AIP
[email protected]
&Junya Honda
The University of Tokyo
RIKEN AIP
[email protected]
&Masashi Sugiyama
RIKEN AIP
The University of Tokyo
[email protected]

Abstract

We investigate finite stochastic partial monitoring, which is a general model for sequential learning with limited feedback. While Thompson sampling is one of the most promising algorithms on a variety of online decision-making problems, its properties for stochastic partial monitoring have not been theoretically investigated, and the existing algorithm relies on a heuristic approximation of the posterior distribution. To mitigate these problems, we present a novel Thompson-sampling-based algorithm, which enables us to exactly sample the target parameter from the posterior distribution. Besides, we prove that the new algorithm achieves the logarithmic problem-dependent expected pseudo-regret $\operatorname{O}(\log T)$ for a linearized variant of the problem with local observability. This result is the first regret bound of Thompson sampling for partial monitoring, which also becomes the first logarithmic regret bound of Thompson sampling for linear bandits.

1 Introduction

Partial monitoring (PM) is a general sequential decision-making problem with limited feedback (Rustichini, 1999; Piccolboni and Schindelhauer, 2001). PM is attracting broad interest because it includes a wide range of problems such as the multi-armed bandit problem (Lai and Robbins, 1985), a linear optimization problem with full or bandit feedback (Zinkevich, 2003; Dani et al., 2008), dynamic pricing (Kleinberg and Leighton, 2003), and label efficient prediction (Cesa-Bianchi et al., 2005).

A PM game can be seen as a sequential game that is played by two players: a learner and an opponent. At every round, the learner chooses an action, while the opponent chooses an outcome. Then, the learner suffers an unobserved loss and receives a feedback symbol, both of which are determined from the selected action and outcome. The main characteristic of this game is that the learner cannot directly observe the outcome and loss. The goal of the learner is to minimize his/her cumulative loss over all rounds. The performance of the learner is evaluated by the regret, which is defined as the difference between the cumulative losses of the learner and the optimal action (i.e., the action whose expected loss is the smallest).

There are mainly two types of PM games, which are the stochastic and adversarial settings (Piccolboni and Schindelhauer, 2001; Bartók et al., 2011). In the stochastic setting, the outcome at each round is determined from the opponent’s strategy, which is a probability vector over the opponent’s possible choices. On the other hand, in the adversarial setting, the outcomes are arbitrarily decided by the opponent. We refer to the PM game with finite actions and finite outcomes as a finite PM game. In this paper, we focus on the stochastic finite game.

One of the first algorithms for PM was considered by Piccolboni and Schindelhauer (2001). They proposed the FeedExp3 algorithm, the key idea of which is to use an unbiased estimator of the losses. They showed that the FeedExp3 algorithm attains $\operatorname{\tilde{O}}(T^{3/4})$ minimax regret for a certain class of PM games, and showed that any algorithm suffers linear minimax regret $\Omega(T)$ for the other class. Here $T$ is the time horizon and the notation $\operatorname{\tilde{O}}(\cdot)$ hides polylogarithmic factors. The upper bound $\operatorname{\tilde{O}}(T^{3/4})$ is later improved by Cesa-Bianchi et al. (2006) to $\operatorname{O}(T^{2/3})$ , and they also provided a game with a matching lower bound.

In the seminal paper by Bartók et al. (2011), they classified PM games into four classes based on their minimax regrets. To be more specific, they classified games into trivial, easy, hard, and hopeless games, where their minimax regrets are $0$ , $\operatorname{\tilde{\Theta}}(\sqrt{T})$ , $\Theta(T^{2/3})$ , and $\Theta(T)$ , respectively. Note that the easy game is also called a locally observable game. After their work, several algorithms have been proposed for the finite PM problem (Bartók et al., 2012; Vanchinathan et al., 2014; Komiyama et al., 2015). For the problem-dependent regret analysis, Komiyama et al. (2015) proposed an algorithm that achieves $\operatorname{O}(\log T)$ regret with the optimal constant factor. However, it requires to solve a time-consuming optimization problem with infinitely many constraints at each round. In addition, this algorithm relies on the forced exploration to achieve the optimality, which makes the empirical performance near-optimal only after prohibitively many rounds, say, $10^{5}$ or $10^{6}$ .

Thompson sampling (TS, Thompson, 1933) is one of the most promising algorithms on a variety of online decision-making problems such as the multi-armed bandit (Lai and Robbins, 1985) and the linear bandit (Agrawal and Goyal, 2013b), and the effectiveness of TS has been investigated both empirically (Chapelle and Li, 2011) and theoretically (Kaufmann et al., 2012; Agrawal and Goyal, 2013a; Honda and Takemura, 2014). In the literature on PM, Vanchinathan et al. (2014) proposed a TS-based algorithm called BPM-TS (Bayes-update for PM based on TS) for stochastic PM, which empirically achieved state-of-the-art performance. Their algorithm uses Gaussian approximation to handle the complicated posterior distribution of the opponent’s strategy. However, this approximation is somewhat heuristic and can degrade the empirical performance due to the discrepancy from the exact posterior distribution. Furthermore, no theoretical guarantee is provided for BPM-TS.

Our goals are to establish a new TS-based algorithm for stochastic PM, which allows us to sample the opponent’s strategy parameter from the exact posterior distribution, and investigate whether the TS-based algorithm can achive sub-linear regret in stochastic PM. Using the accept-reject sampling, we propose a new TS-based algorithm for PM (TSPM), which is equipped with a numerical scheme to obtain a posterior sample from the complicated posterior distribution. We derive a logarithmic regret upper bound $\operatorname{O}(\log{T})$ for the proposed algorithm on the locally observable game under a linearized variant of the problem. This is the first regret bound for TS on the locally observable game. Moreover, our setting includes the linear bandit problem and our result is also the first logarithmic expected regret bound of TS for the linear bandit, whereas a high-probability bound was provided, for example, in Agrawal and Goyal (2013b). Finally, we compare the performance of TSPM with existing algorithms in numerical experiments, and show that TSPM outperforms existing algorithms.

2 Preliminaries

This paper studies finite stochastic PM games (Bartók et al., 2011). A PM game with $N$ actions and $M$ outcomes is defined by a pair of a loss matrix $\mathbf{L}=(\ell_{i,j})\in\mathbb{R}^{N\times M}$ and feedback matrix $\mathbf{H}=(h_{i,j})\in[A]^{N\times M}$ , where $A$ is the number of feedback symbols and $[A]={\left\{1,2,\ldots,A\right\}}$ .

A PM game can be seen as a sequential game that is played by two players: the learner and the opponent. At each round $t=1,2,\dots,T$ , the learner selects action $i(t)\in[N]$ , and at the same time the opponent selects an outcome based on the opponent’s strategy $p^{*}\in\mathcal{P}_{M}$ , where $\mathcal{P}_{n}={\left\{p\in\mathbb{R}^{n}:p_{k}\geq 0,\sum_{k=1}^{n}p_{k}=1\right\}}$ is the $(n-1)$ -dimensional probability simplex. The outcome $j(t)$ of each round is an independent and identically distributed sample from $p^{*}$ , and then, the learner suffers loss $\ell_{i(t),j(t)}$ at time $t$ . The learner cannot directly observe the value of this loss, but instead observes the feedback symbol $y(t)=h_{i(t),j(t)}\in[A]$ . The setting explained above has been widely studied in the literature of stochastic PM (Bartók et al., 2011; Komiyama et al., 2015), and we call this the discrete setting. In Section 4, we also introduce a linear setting for theoretical analysis, which is a slightly different setting from the discrete one.

The learner aims to minimize the cumulative loss over $T$ rounds. The expected loss of action $i$ is given by $L_{i}^{\top}p^{*}$ , where $L_{i}$ is the $i$ -th column of $\mathbf{L}^{\top}$ . We say action $i$ is optimal under strategy $p^{*}$ if $(L_{i}-L_{j})^{\top}p^{*}\leq 0$ for any $j\neq i$ . We assume that the optimal action is unique, and without loss of generality that the optimal action is action $1$ . Let $\Delta_{i}=(L_{i}-L_{1})^{\top}p^{*}\geq 0$ for $i\in[N]$ and $N_{i}(t)$ be the number of times action $i$ is selected before the $t$ -th round. When the time step is clear from the context, we use $n_{i}$ instead of $N_{i}(t)$ . We adopt the pseudo-regret to measure the performance: $\mathrm{Reg}(T)=\sum_{t=1}^{T}\Delta_{i(t)}=\sum_{i\in[N]}\Delta_{i}N_{i}(T+1)$ . This is the relative performance of the algorithm against the oracle, which knows the optimal action $1$ before the game starts.

We introduce the following definitions to clarify the class of PM games, for which we develop an algorithm and derive a regret upper bound. The following cell decomposition is the concept to divide the simplex $\mathcal{P}_{M}$ based on the loss matrix to identify the optimal action, which depends on the opponent’s strategy $p^{*}$ .

Definition 1 (Cell decomposition and Pareto-optimality (Bartók et al., 2011)).

For every action $i\in[N]$ , cell $\mathcal{C}_{i}\coloneqq\{p\in\mathcal{P}_{M}:(L_{i}-L_{j})^{\top}p\leq 0,\,\forall j\neq i\}$ is the set of opponent’s strategies for which action $i$ is optimal. Action $i$ is Pareto-optimal if there exists an opponent’s strategy $p^{*}$ under which action $i$ is optimal.

Each cell is a convex closed polytope. Next, we define neighbors between two Pareto-optimal actions, which intuitively means that the two actions “touch” each other in their surfaces.

Definition 2 (Neighbors and neighborhood action (Bartók et al., 2011)).

Two Pareto-optimal actions $i$ and $j$ are neighbors if $\mathcal{C}_{i}\cap\mathcal{C}_{j}$ is an $(M-2)$ -dimensional polytope. For two neighboring actions $i,j\in[N]$ , the neighborhood action set is defined as $N_{{i},{j}}^{+}=\{k\in[N]:\mathcal{C}_{i}\cap\mathcal{C}_{j}\subseteq\mathcal{C}_{k}\}$ .

Note that the neighborhood action set $N_{{i},{j}}^{+}$ includes actions $i$ and $j$ from its definition. Next, we define the signal matrix, which encodes the information of the feedback matrix $\mathbf{H}$ so that we can utilize the feedback information.

Definition 3 (Signal matrix (Komiyama et al., 2015)).

The signal matrix $S_{i}\in{\left\{0,1\right\}}^{A\times M}$ of action $i$ is defined as $(S_{i})_{y,j}=\mathbbm{1}{\left[h_{i,j}=y\right]}$ , where $\mathbbm{1}{\left[X\right]}=1$ if the event $X$ is true and $0$ otherwise.

Note that if we define the signal matrix as above, $S_{i}p^{*}\in\mathbb{R}^{A}$ is a probability vector over feedback symbols of action $i$ . The following local observability condition separates easy and hard games, this condition intuitively means that the information obtained by taking actions in the neighborhood action set $N_{{i},{j}}^{+}$ is sufficient to distinguish the loss difference between actions $i$ and $j$ .

Definition 4 (Local observability (Bartók et al., 2011)).

A partial monitoring game is said to be locally observable if for all pairs $i,j$ of neighboring actions, $L_{i}-L_{j}\in\oplus_{k\in N_{{i},{j}}^{+}}\operatorname{Im}S_{k}^{\top}$ , where $\operatorname{Im}V$ is the image of the linear map $V$ , and $V\oplus W$ is the direct sum between the vector spaces $V$ and $W$ .

We also consider the concept of the strong local observability condition, which implies the above local observability condition.

Definition 5 (Strong local observability).

A partial monitoring game is said to be strongly locally observable if for all pairs $i,j\in[N]$ , $L_{i}-L_{j}\in\operatorname{Im}S_{i}^{\top}\oplus\operatorname{Im}S_{j}^{\top}$ .

This condition was assumed in the theoretical analysis in Vanchinathan et al. (2014), and we also assume this condition in theoretical analysis in Section 4. Note that the strong local observability means that, for any $j\neq k$ , there exists $z_{j,k}\neq 0\in\mathbb{R}^{2A}$ such that $L_{j}-L_{k}=(S_{j}^{\top},S_{k}^{\top})\,z_{j,k}$ .

Notation. Let $\|\cdot\|$ and $\|\cdot\|_{p}$ be the Euclidian norm and $p$ -norm, and let $\|x\|_{A}=\sqrt{x^{\top}Ax}$ be the norm induced by the positive semidefinite matrix $A\succeq 0$ . Let $\mathcal{D}_{\mathrm{KL}}\left({p}\middle\|{q}\right)=\sum_{a=1}^{A}p_{a}\log(p_{a}/q_{a})$ be the Kullback-Leibler divergence of $p$ from $q$ . The vector $e_{y}\in\mathbb{R}^{M}$ is the $y$ -th orthonormal basis of $\mathbb{R}^{M}$ , and $\mathbf{1}_{n}=[1,\ldots,1]^{\top}$ is the $n$ -dimensional all-one vector. Let $q_{i}^{(t)}$ be the empirical feedback distribution of action $i$ at time $t$ , i.e., $q_{i}^{(t)}=[n_{i1}/n_{i},\ldots,n_{iA}/n_{i}]^{\top}\in\mathcal{P}_{A}$ , where $n_{iy}=\sum_{s=1}^{t}\mathbbm{1}{\left[i(s)=i,y(s)=y\right]}$ and $n_{i}=\sum_{y=1}^{A}n_{iy}$ . The notation is summarized in Appendix A.

Methods for Sampling from Posterior Distribution. We briefly review the methods to draw a sample from the posterior distribution. While TS is one of the most promising algorithms, the posterior distribution can be in a quite complicated form, which makes obtaining a sample from it computationally hard. To overcome this issue, a variety of approximate posterior sampling methods have been considered, such as Gibbs sampling, Langevin Monte Carlo, Laplace approximation, and the bootstrap (Russo et al., 2018, Section 5). Recent work (Lu and Van Roy, 2017) proposed a flexible approximation method, which can even efficiently be applied to quite complex models such as neural networks. However, more recent work revealed that algorithms based on such an approximation procedure can suffer a linear regret (Phan et al., 2019), even if the approximation error in terms of the $\alpha$ -divergence is small enough.

Although BPM-TS is one of the best methods for stochastic PM, it approximates the posterior by a Gaussian distribution in a heuristic way, which can degrade the empirical performance due to the distributional discrepancy from the exact posterior distribution. Furthermore, no theoretical guarantee is provided for BPM-TS. In this paper, we mitigate these problems by providing a new algorithm for stochastic PM, which allows us to exactly draw samples from the posterior distribution. We also give theoretical analysis for the proposed algorithm.

3 Thompson-sampling-based Algorithm for Partial Monitoring

In this section, we present a new algorithm for stochastic PM games, where we name the algorithm TSPM (TS-based algorithm for PM). The algorithm is given in Algorithm 1, and we will explain the subroutines in the following.

Input: prior parameter

\lambda>0

1 Set

B_{0}\leftarrow\lambda I_{M},b_{0}\leftarrow 0

2 Take each action for

n\geq 1

times.

3 for

t=1,2,\ldots,T

4 Sample

\tilde{p}_{t}\sim\pi(p\mid{\left\{i(s),y(s)\right\}}_{s=1}^{t})

based on the accept-reject sampling (Algorithm 2).

5 Take action

i(t)=\operatorname*{arg\,max}_{i\in[N]}L_{i}^{\top}\tilde{p}_{t}

and observe feedback

y(t)

6 Update

B_{t}\leftarrow B_{t-1}+S_{i(t)}^{\top}S_{i(t)},\;b_{t}\leftarrow b_{t-1}+S_{i(t)}^{\top}e_{y(t)}

Algorithm 1 TSPM Algorithm

Input: constant

R\in[0,1]

1 while true do

2 Sample

\tilde{p}_{t}\sim g_{t}(p)

(Algorithm 3).

3 Sample

\tilde{u}\sim\mathcal{U}({[0,1]})

4 if

R\tilde{u}<{F_{t}(\tilde{p}_{t})}/{G_{t}(\tilde{p}_{t})}

then

5 return

\tilde{p}_{t}

Algorithm 2 Accept-Reject Sampling

1 Compute

\tilde{B}_{t},\tilde{b}_{t}

from

B_{t},b_{t}

2 repeat

3 Sample

{p^{(\alpha)}}\sim\mathcal{N}(\tilde{B}_{t}^{-1}\tilde{b}_{t},\tilde{B}_{t}^{-1})

4until

{p^{(\alpha)}}\in\mathcal{P}_{M-1}

;

5return

\tilde{p}=[{p^{(\alpha)}}^{\top},\,1-\sum_{i=1}^{M-1}({p^{(\alpha)}})_{i}]^{\top}

Algorithm 3 Sampling from

g_{t}(p)

3.1 Accept-Reject Sampling

We adopt the accept-reject sampling (Casella et al., 2004) to exactly draw samples from the posterior distribution. The accept-reject sampling is a technique to draw samples from a specific distribution $f$ , and a key feature is to use a proposal distribution $g$ , from which we can easily draw a sample and whose ratio to $f$ , that is $f/g$ , is bounded by a constant value $R$ . To obtain samples from $f$ , (i) we generate samples $X\sim g$ ; (ii) accept $X$ with probability $f(X)/Rg(X)$ . Note that $f$ and $g$ do not have to be normalized when the acceptance probability is calculated.

Let $\pi{\left(p\right)}$ be a prior distribution for $p$ . Then an unnormalized density of the posterior distribution for $p$ can be expressed as

\displaystyle F_{t}(p)

\displaystyle=\pi{\left(p\right)}\prod_{i=1}^{N}\exp\Big{\{}-n_{i}\mathcal{D}_{\mathrm{KL}}\left({q_{i}^{(t)}}\middle\|{S_{i}p}\right)\Big{\}}\,,

(1)

the detailed derivation of which is given in Appendix B. We use the proposal distribution with unnormalized density

\displaystyle G_{t}(p)

\displaystyle=\pi{\left(p\right)}\prod_{i=1}^{N}\exp\Big{\{}-\frac{1}{2}n_{i}\|q_{i}^{(t)}-S_{i}p\|^{2}\Big{\}}\,.

(2)

Based on these distributions, we use Algorithm 2 for exact sampling from the posterior distribution, where $\mathcal{U}({[0,1]})$ is the uniform distribution over $[0,1]$ and $g_{t}(p)$ is the distribution corresponding to the unnormalized density $G_{t}(p)$ in (2). The following proposition shows that setting $R=1$ realizes the exact sampling.

Proposition 1.

Let $f_{t}(p)$ be the distribution corresponding to the unnormalized density $F_{t}(p)$ in (1). Then, the output of Algorithm 2 with $R=1$ follows $f_{t}(p)$ .

This proposition can easily be proved by Pinsker’s inequality, which is detailed in Appendix B.

In practice, $R\in[0,1]$ is a parameter to balance the amount of over-exploration and the computational efficiency. As $R$ decreases from 1, the algorithm tends to accept a point $p$ far from the mode. The case $R=0$ corresponds the TSPM algorithm where the proposal distribution is used without the accept-reject sampling, which we call TSPM-Gaussian. As we will see in Section 4, TSPM-Gaussian corresponds to exact sampling of the posterior distribution when the feedback follows a Gaussian distribution rather than a multinomial distribution.

TSPM-Gaussian can be related to BPM-TS (Vanchinathan et al., 2014) in the sense that both of them use samples from Gaussian distributions. Nevertheless, they use different Gaussians and TSPM-Gaussian performs much better than BPM-TS as we will see in the experiments. Details on the relation between TSPM-Gaussian and BPM-TS are described in Appendix D.

In general, we can realize efficient sampling with a small number of rejections if the proposal distribution and the target distribution are close to each other. On the other hand, in our problem, the densities in (1) and (2) for each fixed point $p$ exponentially decay with the number of samples $n_{i}$ if the empirical feedback distribution $q_{i}^{(t)}$ converges. This means that $F_{t}(p)$ and $G_{t}(p)$ have an exponentially large relative gap in most rounds. Nevertheless, the number of rejections does not increase with $t$ as we will see in the experiments, which suggests that the proposal distribution approximates the target distribution well with high probability.

3.2 Sampling from Proposal Distribution

When we consider Gaussian density $\mathcal{N}(0,\lambda I_{M})$ truncated over $\mathcal{P}_{M}$ as a prior, the proposal distribution also has the Gaussian density $\mathcal{N}(B_{t}^{-1}b_{t},\,B_{t}^{-1})$ over $\mathcal{P}_{M}$ , where

\displaystyle B_{t}=\lambda I_{M}+\sum_{i=1}^{N}n_{i}S_{i}^{\top}S_{i}=B_{t-1}+S_{i(t)}^{\top}S_{i(t)}\,,\quad b_{t}=\sum_{i=1}^{N}n_{i}S_{i}^{\top}q_{i}^{(t)}=b_{t-1}+S_{i(t)}^{\top}e_{y(t)}\,.

(3)

Here note that the probability simplex $\mathcal{P}_{M}$ is in an $(M-1)$ -dimensional space and a sample from $\mathcal{N}(0,\lambda I_{M})$ is not contained in $\mathcal{P}_{M}$ with probability one. In the literature, e.g., Altmann et al. (2014), sampling methods for Gaussian distributions truncated on a simplex have been discussed. We use one of these procedures summarized in Algorithm 3, where we first sample $M-1$ elements of $p$ from another Gaussian distribution and determine the remaining element by the constraint $\sum_{i=1}^{M}p_{i}=1$ .

Proposition 2.

Sampling from $g_{t}(p)$ is equivalent to Algorithm 3 with

\displaystyle\tilde{B}_{t}=C_{t}-2D_{t}+f_{t}\mathbf{1}_{M-1}\mathbf{1}_{M-1}^{\top}\,,\quad\tilde{b}_{t}=f_{t}\mathbf{1}_{M-1}-d_{t}+b^{(\alpha)}_{t}-b^{(M)}\mathbf{1}_{M-1}\,,

where $B_{t}=\begin{bmatrix}C_{t}&d_{t}\\ d_{t}^{\top}&f_{t}\\ \end{bmatrix}$ for $C_{t}\in\mathbb{R}^{{M-1}\times{M-1}}$ , $d_{t}\in\mathbb{R}^{M-1}$ , $f_{t}\in\mathbb{R}$ , $b_{t}=[{b^{(\alpha)}_{t}}^{\top},b^{(M)}_{t}]^{\top}\in\mathbb{R}^{M-1}\times\mathbb{R}$ , and $D_{t}=\frac{1}{2}(d_{t}\mathbf{1}_{M-1}^{\top}+\mathbf{1}_{M-1}d_{t}^{\top})$ .

We give the proof of this proposition for self-containedness in Appendix C.

4 Theoretical Analysis

This section considers a regret upper bound of the TSPM algorithm.

In the theoretical analysis, we consider a linear setting of PM. In the linear PM, the learner suffers the expected loss $L_{i(t)}^{\top}p^{*}$ as in the discrete setting, and receives feedback vector $y(t)=S_{i}p^{*}+\epsilon_{t}$ for $\epsilon_{t}\sim\mathcal{N}(0,I_{M})$ whereas the one-hot representation of $y(t)$ is distributed by the probability vector $S_{i}p^{*}$ in the discrete setting. Therefore, if $\epsilon_{t}$ can be regarded as a sub-Gaussian random variable as in Kirschner et al. (2020) then the linear PM includes the discrete PM, though our theoretical analysis requires $\epsilon_{t}$ to be Gaussian. The relation between discrete and linear settings can also be seen from the observation that bandit problems with Bernoulli and Gaussian rewards can be expressed as discrete and linear PM, respectively. The linear PM also includes the linear bandit problem, where the feedback vector is expressed as $L_{i}^{\top}p^{*}+\epsilon_{t}$ .

In the linear PM, $G_{t}(p)$ in (2) becomes the exact posterior distribution rather than a proposal distribution. The definition of the cell decomposition for this setting is largely the same as that of discrete setting and detailed in Appendix F. Therefore, TS with exact posterior sampling in the linear PM corresponds to TSPM-Gaussian. In the linear PM, the unknown parameter $p^{*}$ is in $\mathbb{R}^{M}$ rather than in $\mathcal{P}_{M}$ , and therefore we consider the prior $\pi{\left(p\right)}=\mathcal{N}(0,\lambda I_{M})$ over $\mathbb{R}^{M}$ , where the posterior distribution becomes $\mathcal{N}(B_{t}^{-1}b_{t},\,B_{t}^{-1})$ .

There are a few works that analyze TS for the PM because of its difficulty. For example in Vanchinathan et al. (2014), an analysis of the TS-based algorithm (BPM-TS) is not given despite the fact that its performance is better than the algorithm based on a confidence ellipsoid (BPM-LEAST). Zimmert and Lattimore (2019) considered the theoretical aspect of a variant of TS for the linear PM in view of the Bayes regret, but this algorithm is based on the knowledge on the time horizon and different from the family of TS used in practice. More specifically, their algorithm considers the posterior distribution for regret (not pseudo-regret), and an action is chosen according to the posterior probability that each arm minimizes the cumulative regret. Thus, the time horizon also needs to be known.

Types of Regret Bounds. We focus on the (a) problem-dependent (b) expected pseudo-regret. (a) In the literature, a minimax (or problem-independent) regret bound has mainly been considered, for example, to classify difficulties of the PM problem (Bartók et al., 2010, 2011). On the other hand, a problem-dependent regret bound often reflects the empirical performance more clearly than the minimax regret (Bartók et al., 2012; Vanchinathan et al., 2014; Komiyama et al., 2015). For this reason, we consider this problem-dependent regret bound. (b) In complicated settings of bandit problems, a high-probability regret bound has mainly been considered (Abbasi-Yadkori et al., 2011; Agrawal and Goyal, 2013b), which bounds the pseudo-regret with high probability $1-\delta$ . Though such a bound can be transformed to an expected regret bound, this type of analysis often sacrifices the tightness since a linear regret might be suffered with small probability $\delta$ . This is why the analysis in Vanchinathan et al. (2014) for BPM-LEAST finally yielded an $\operatorname{\tilde{O}}(\sqrt{T})$ expected regret bound whereas their high-probability bound is $\operatorname{O}(\log T)$ .

4.1 Regret Upper Bound

In the following theorem, we show that logarithmic problem-dependent expected regret is achievable by the TSPM-Gaussian algorithm.

Theorem 3 (Regret upper bound).

Consider any finite stochastic linear partial monitoring game. Assume that the game is strongly locally observable and $\Delta_{i}=(L_{i}-L_{1})^{\top}p^{*}>0$ for any $i\neq 1$ . Then, the regret of TSPM-Gaussian satisfies for sufficiently large $T$ that

\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\mathrm{Reg}(T)\right]}=\operatorname{O}\bigg{(}\frac{AN^{2}M\max_{i\in[N]}\Delta_{i}}{\Lambda^{2}}\log T\bigg{)}\,,

(4)

where $\Lambda\coloneqq\min_{i\neq 1}\Lambda_{i}$ for $\Lambda_{i}=\Delta_{i}/{\|z_{1,i}\|}$ with $z_{1,i}$ defined after Definition 5.

Remark.

In the proof of Theorem 3, it is sufficient to assume that $L_{1}-L_{i}\in\operatorname{Im}S_{1}^{\top}\oplus\operatorname{Im}S_{i}^{\top}$ for $i\in[N]$ , which is weaker than the strong local observability, though it is still sometimes stronger than the local observability condition.

The proof of Theorem 3 is given in Appendix F. This result is the first problem-dependent bound of TS for PM, which also becomes the first logarithmic regret bound of TS for linear bandits.

The norm of $z_{j,k}$ in $\Lambda$ intuitively indicates the difficulty of the problem. Whereas we can estimate $(S_{j}p,S_{k}p)$ with noise through taking actions $j$ and $k$ , the actual interest is the gap of the losses $p^{\top}(L_{j}-L_{k})=(S_{j}p,S_{k}p)^{\top}z_{j,k}$ . Thus, if $\|z_{j,k}\|$ is large, the gap estimation becomes difficult since the noise is enhanced through $z_{j,k}$ .

Unfortunately, the derived bound in Theorem 3 has quadratic dependence on $N$ , which seems to be not tight. This quadratic dependence comes from the difficulty of the expected regret analysis. In general, we evaluate the regret before and after the convergence of the statistics separately. Whereas the latter one usually becomes dominant, the main difficulty comes from the analysis of the former one, which might become large with low probability (Agrawal and Goyal, 2012; Kaufmann et al., 2012; Agrawal and Goyal, 2013a).

In our analysis, we were not able to bound the former one within a non-dominant order, though it is still logarithmic in $T$ . In fact, our analysis shows that the regret after convergence is $\operatorname{O}(\sum_{i\neq 1}\Delta_{i}\frac{A}{\Lambda^{2}}\log T)$ as shown in Lemma 18 in Appendix F, which will become the regret with high probability. In particular, if we consider the classic bandit problem as a PM game, we can confirm that the derived bound after convergence becomes the best possible bound

\displaystyle\operatorname{O}\bigg{(}\sum_{i\neq 1}\frac{\log T}{\Delta_{i}}\bigg{)}

by considering $\Lambda_{i}$ depending on each suboptimal arm $i$ as the difficulty measure instead of $\Lambda$ . Still, deriving a regret bound for the term before convergence within an non-dominant order is an important future work.

4.2 Technical Difficulties of the Analysis

The main difficulty of this regret analysis is that PM requires to consider the statistics of all actions when the number of selections $N_{i}(t)$ of some action $i$ is evaluated. This is in stark contrast to the analysis of the classic bandit problems, where it becomes sufficient to evaluate statistics of action $i$ and the best action 1. This makes the analysis remarkably complicated in TS, where we need to separately consider the randomness caused by the feedback and TS.

To overcome this difficulty, we handle the effect of actions of no interest in two different novel ways depending on each decomposed regret. The first one is to evaluate the worst-case effect of these actions based on an argument (Lemma 10) related to the law of the iterated logarithm (LIL), which is sometimes used in the best-arm identification literature to improve the performance (Jamieson et al., 2014). The second one is to bound the action-selection probability of TS using an argument of (super-)martingale (Theorem 16), which is of independent interest. Whereas such a technique is often used for the construction of confidence bounds (Abbasi-Yadkori et al., 2011), we reveal that it is also useful for evaluation of the regret of TS.

We only focused on the Gaussian noise $\epsilon_{t}\sim\mathcal{N}(0,I_{M})$ , rather than the more general sub-Gaussian noise. This restriction to the Gaussian noise comes from the essential difficulty of the problem-dependent analysis of TS, where lower bounds for some probabilities are needed whereas the sub-Gaussian assumption is suited for obtaining upper bounds. To the best of our knowledge, the problem-dependent regret analysis for TS on the sub-Gaussian case has never been investigated even for the multi-armed bandit setting, which is quite simple compared to that of PM. In the literature of the problem-dependent regret analysis, the noise distribution is restricted to distributions with explicitly given forms, e.g., Bernoulli, Gaussian, or more generally a one-dimensional canonical exponential family (Kaufmann et al., 2012; Agrawal and Goyal, 2013a; Korda et al., 2013). Their analysis relies on the specific characteristic of the distribution to bound the problem-dependent regret.

5 Experiments

In this section, we numerically compare the performance of TSPM and TSPM-Gaussian against existing methods, which are RandomPM (the algorithm which selects action randomly), FeedExp3 (Piccolboni and Schindelhauer, 2001), and BPM-TS (Vanchinathan et al., 2014). Recently, Lattimore and Szepesvári (2019) considered the sampling-based algorithm called Mario sampling for easy games. Mario sampling coincides with TS (except for the difference between pseudo-regret and regret with known time horizon) mentioned in the last section when any pair of actions is a neighbor. As shown in Appendix G, this property is indeed satisfied for dp-easy games defined in the following. Therefore, the performance is essentially the same between TSPM with $R=1$ and Mario sampling. To compare the performance, we consider a dynamic pricing problem, which is a typical example of PM games. We conducted experiments on the discrete setting because the experiments for PM has been mainly focused on the discrete setting.

In the dynamic pricing game, the player corresponds to a seller, and the opponent corresponds to a buyer. At each round, the seller sells an item for a specific price $i(t)$ , and the buyer comes with an evaluation price $j(t)$ for the item, where the selling price and the evaluation price correspond to the action and outcome, respectively. The buyer buys the item if the selling price $i(t)$ is smaller than or equal to $j(t)$ and not otherwise. The seller can only know if the buyer bought the item (denoted as feedback $0$ ) or did not buy the item (denoted as $1$ ). The seller aims to minimize the cumulative “loss”, and there are two types of definitions for the loss, where each induced game falls into the easy and hard games. We call them dp-easy and dp-hard games, respectively.

In both cases, the seller incurs the constant loss $c>0$ when the item is not bought due to the loss of opportunity to sell the item. In contrast, when the item is not bought, the loss incurred to the seller is different between these settings. The seller in the dp-easy game does not take the buyer’s evaluation price into account. In other words, the seller gains the selling price $i(t)$ as a reward (equivalently incurs $-i(t)$ as a loss). Therefore, the loss for the selling price $i(t)$ and the evaluation $j(t)$ is

\displaystyle\ell_{i(t),j(t)}=-i(t)\mathbbm{1}{\left[i(t)\leq j(t)\right]}+c\mathbbm{1}{\left[i(t)>j(t)\right]}\,.

This setting can be regarded as a generalized version of the online posted price mechanism, which was addressed in, e.g., Blum et al. (2004) and Cesa-Bianchi et al. (2006), and an example of strongly locally observable games.

On the other hand, the seller in dp-hard game does take the buyer’s evaluation price into account when the item is bought. In other words, the seller incurs the difference between the opponent evaluation and the selling price $j(t)-i(t)$ as a loss because the seller could have made more profit if the seller had sold at the price $j(t)$ . Therefore, the loss incurred at time $t$ is

\displaystyle\ell_{i(t),j(t)}=(j(t)-i(t))\mathbbm{1}{\left[i(t)\leq j(t)\right]}+c\mathbbm{1}{\left[i(t)>j(t)\right]}\,.

This setting is also addressed in Cesa-Bianchi et al. (2006), and belongs to the class of hard games. Note that our algorithm can also be applied to a hard game, though there is no theoretical guarantee.

(a) dp-easy,

N=M=3

(b) dp-easy,

N=M=5

N=M=7

(d) dp-hard,

N=M=3

(e) dp-hard,

N=M=5

(f) dp-hard,

N=M=7

Figure 1: Regret-round plots of algorithms. The solid lines indicate the average over

100

independent trials. The thin fillings are the standard error.

(a) dp-easy,

N=M=3

(b) dp-easy,

N=M=5

N=M=7

Figure 2: The number of rejected times by the accept-reject sampling. The solid lines indicate the average over

100

independent trials after taking moving average with window size

100

Setup. In the both dp-easy and dp-hard games, we fixed $N=M\in{\left\{3,5,7\right\}}$ and $c=2$ . We fixed the time horizon $T$ to $10000$ and simulated $100$ times. For FeedExp3 and BPM-TS, the setup of hyperparameters follows their original papers. For TSPM, we set $\lambda=0.001$ , and $R$ was selected from ${\left\{0.01,1.0\right\}}$ . Here, recall that TSPM with $R=1$ and $R=0$ correspond to the exact sampling and TSPM-Gaussian, respectively, and a smaller value of $R$ gives the higher acceptance probability in the accept-reject sampling. Therefore, using small $R$ makes the algorithm time-efficient, although it can worsen the performance since it over-explores the tail of the posterior distributions. To stabilize sampling from the proposal distribution in Algorithm 3, we used an initialization that takes each action $n=10A$ times. The detailed settings of the experiments with more results are given in Appendix H.

Results. Figure 1 is the empirical comparison of the proposed algorithms against the benchmark methods. This result shows that, in all cases, the TSPM with exact sampling gives the best performance. TSPM-Gaussian also outperforms BPM-TS even though both of them use Gaussian distributions as posteriors. Besides, the experimental results suggest that our algorithm performs reasonably well even for a hard game. It can be observed that the proposed methods outperform BPM-TS more significantly for a larger number of outcomes. Further discussion for this observation is given in Appendix D.

Figure 2 shows the number of rejections at each time step in the accept-reject sampling. We counted the number of times that either Line 2 in Algorithm 2 or Line 3 in Algorithm 3 was not satisfied. In the accept-reject sampling, it is desirable that the frequency of rejection does not increase as the time-step $t$ and does not increase rapidly with the number of outcomes. We can see that the former one is indeed satisfied. For the latter property, the frequency of rejection becomes unfortunately large when exact sampling ( $R=1$ ) is conducted. Still, we can substantially improve this frequency by setting $R$ to be a small value or zero, which still keeps regret tremendously better than that of BPM with almost the same time-efficiency as BPM-TS.

6 Conclusion and Discussion

This paper investigated Thompson sampling (TS) for stochastic partial monitoring from the algorithmic and theoretical viewpoints. We provided a new algorithm that enables exact sampling from the posterior distribution, and numerically showed that the proposed algorithm outperforms existing methods. Besides, we provided an upper bound for the problem-dependent logarithmic expected pseudo-regret for the linearized version of the partial monitoring. To our knowledge, this bound is the first logarithmic problem-dependent expected pseudo-regret bound of a TS-based algorithm for linear bandit problems and strongly locally observable partial monitoring games.

There are several remaining questions. As mentioned in Section 4, Kirschner et al. (2020) considered linear partial monitoring with the feedback structure $y(t)=S_{i(t)}p^{*}+\epsilon_{t}$ , where $(\epsilon_{t})_{t=1}^{T}$ is a sequence of independent sub-Gaussian noise vector in $\mathbb{R}^{M}$ . This setting is the generalization of our linear setting, where $(\epsilon_{t})_{t=1}^{T}$ are i.i.d. Gaussian vectors. Therefore, a natural question that arises is whether we can extend our analysis on TSPM-Gaussian to the sub-Gaussian case, although we believe it would be not straightforward as discussed in Section 4. It is also an important open problem to derive a regret bound on TSPM using the exact posterior sampling for the discrete partial monitoring. Although we conjecture that the algorithm also achieves logarithmic regret for the setting, there still remain some difficulties in the analysis. In particular, we have to handle the KL divergence in $f_{t}(p)$ and consider the restriction of the support of the opponent’s strategy to $\mathcal{P}_{M}$ , which make the analysis much more complicated. Besides, it is worth noting that the theoretical analysis of TS for hard games has never been theoretically investigated. We believe that in general TS suffers linear regret in the minimax sense due to its greediness. However, we conjecture that TS can achieve the sub-linear regret for some specific instances of hard games in the sense of the problem-dependent regret, as empirically observed in the experiments. Finally, it is an important open problem to derive the minimax regret for anytime TS-based algorithms. This needs more detailed analysis on $\operatorname{o}(\log T)$ terms in the regret bound, which were dropped in our main result.

Broader Impact

Application. Partial monitoring (PM) includes various online decision-making problems such as multi-armed bandits, linear bandits, dynamic pricing, and label efficient prediction. Not only can PM handles them, the dueling bandits, combinatorial bandits, transductive bandits, and many other problems can be seen as a partial monitoring game, as discussed in Kirschner et al. (2020). Therefore, our analysis of Thompson sampling (TS) for PM games pushes the application of TS to a more wide range of online decision-making problems forward. Moreover, PM has the potential that novel online-decision making problems are newly discovered, where we have to handle the limited feedback in an online fashion.

Practical Use. The obvious advantage of using TS is that the users can easily apply the algorithm to their problems. They do not have to solve mathematical optimization problems, which are often required to solve when using non-sampling-based algorithms (Bartók et al., 2012; Komiyama et al., 2015). For the negative side, the theoretical analysis for the regret upper bound might make the users become overconfident when the users use their algorithms. For example, they might use the TSPM algorithm to the linear PM game with heavy-tailed noise, such as sub-exponential noise, without noticing it. Nevertheless, this is not an TS-specific problem, but one that can be found in many theoretical studies, and TS is still one of the most promising policies.

Acknowledgements

The authors would like to thank the meta-reviewer and reviewers for a lot of helpful comments. The authors would like to thank Kento Nozawa and Ikko Yamane for maintaining servers for our experiments, and Kenny Song for helpful discussion on the writing. TT was supported by Toyota-Dwango AI Scholarship, and RIKEN Junior Research Associate Program for the final part of the project. JH was supported by KAKENHI 18K17998, and MS was supported by KAKENHI 17H00757.

References

Abbasi-Yadkori et al. (2011) Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, pages 2312–2320, 2011.
Agrawal and Goyal (2012) Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In the 25th Annual Conference on Learning Theory, volume 23, pages 39.1–39.26, 2012.
Agrawal and Goyal (2013a) Shipra Agrawal and Navin Goyal. Further optimal regret bounds for Thompson sampling. In the Sixteenth International Conference on Artificial Intelligence and Statistics, volume 31, pages 99–107, 2013a.
Agrawal and Goyal (2013b) Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In the 30th International Conference on Machine Learning, pages 127–135, 2013b.
Altmann et al. (2014) Yoann Altmann, Steve McLaughlin, and Nicolas Dobigeon. Sampling from a multivariate gaussian distribution truncated on a simplex: A review. In 2014 IEEE Workshop on Statistical Signal Processing (SSP), pages 113–116, 2014.
Bartók et al. (2010) Gábor Bartók, Dávid Pál, and Csaba Szepesvári. Toward a classification of finite partial-monitoring games. In Algorithmic Learning Theory, pages 224–238, 2010.
Bartók et al. (2012) Gábor Bartók, Navid Zolghadr, and Csaba Szepesvári. An adaptive algorithm for finite stochastic partial monitoring. In the 29th International Conference on Machine Learning, pages 1–20, 2012.
Bartók et al. (2011) Gábor Bartók, Dávid Pál, and Csaba Szepesvári. Minimax regret of finite partial-monitoring games in stochastic environments. In the 24th Annual Conference on Learning Theory, volume 19, pages 133–154, 2011.
Blum et al. (2004) Avrim Blum, Vijay Kumar, Atri Rudra, and Felix Wu. Online learning in online auctions. Theoretical Computer Science, 324(2):137–146, 2004.
Casella et al. (2004) George Casella, Christian P. Robert, and Martin T. Wells. Generalized accept-reject sampling schemes, volume 45 of Lecture Notes–Monograph Series, pages 342–347. Institute of Mathematical Statistics, 2004.
Cesa-Bianchi et al. (2005) Nicolò Cesa-Bianchi, Gábor Lugosi, and Gilles Stoltz. Minimizing regret with label efficient prediction. IEEE Transactions on Information Theory, 51(6):2152–2162, 2005.
Cesa-Bianchi et al. (2006) Nicolò Cesa-Bianchi, Gábor Lugosi, and Gilles Stoltz. Regret minimization under partial monitoring. Mathematics of Operations Research, 31(3):562–580, 2006.
Chapelle and Li (2011) Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems 24, pages 2249–2257, 2011.
Dani et al. (2008) Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. In 21st Annual Conference on Learning Theory, pages 355–366, 2008.
Honda and Takemura (2014) Junya Honda and Akimichi Takemura. Optimality of Thompson sampling for Gaussian bandits depends on priors. In the Seventeenth International Conference on Artificial Intelligence and Statistics, volume 33, pages 375–383, 2014.
Jamieson et al. (2014) Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’ ucb : An optimal exploration algorithm for multi-armed bandits. In The 27th Conference on Learning Theory, pages 423–439, 2014.
Kaufmann et al. (2012) Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In Algorithmic Learning Theory, pages 199–213, 2012.
Kirschner et al. (2020) Johannes Kirschner, Tor Lattimore, and Andreas Krause. Information directed sampling for linear partial monitoring. arXiv preprint arXiv:2002.11182, 2020.
Kleinberg and Leighton (2003) Robert Kleinberg and Tom Leighton. The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In the 44th Annual IEEE Symposium on Foundations of Computer Science, pages 594–605, 2003.
Komiyama et al. (2015) Junpei Komiyama, Junya Honda, and Hiroshi Nakagawa. Regret lower bound and optimal algorithm in finite stochastic partial monitoring. In Advances in Neural Information Processing Systems 28, pages 1792–1800, 2015.
Korda et al. (2013) Nathaniel Korda, Emilie Kaufmann, and Remi Munos. Thompson sampling for 1-dimensional exponential family bandits. In Advances in Neural Information Processing Systems 26, pages 1448–1456, 2013.
Lai and Robbins (1985) T. L. Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
Lattimore and Szepesvári (2019) Tor Lattimore and Csaba Szepesvári. An information-theoretic approach to minimax regret in partial monitoring. In the 32nd Annual Conference on Learning Theory, volume 99, pages 2111–2139, 2019.
Lu and Van Roy (2017) Xiuyuan Lu and Benjamin Van Roy. Ensemble sampling. In Advances in Neural Information Processing Systems 30, pages 3258–3266, 2017.
Phan et al. (2019) My Phan, Yasin Abbasi Yadkori, and Justin Domke. Thompson sampling and approximate inference. In Advances in Neural Information Processing Systems 32, pages 8804–8813, 2019.
Piccolboni and Schindelhauer (2001) Antonio Piccolboni and Christian Schindelhauer. Discrete prediction games with arbitrary feedback and loss. In COLT/EuroCOLT, pages 208–223, 2001.
Russo et al. (2018) Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on Thompson sampling. Foundations and Trends in Machine Learning, 11(1):1–96, 2018.
Rustichini (1999) Aldo Rustichini. Minimizing regret: The general case. Games and Economic Behavior, 29(1):224–243, 1999.
Thompson (1933) William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 12 1933.
Vanchinathan et al. (2014) Hastagiri P Vanchinathan, Gábor Bartók, and Andreas Krause. Efficient partial monitoring with prior information. In Advances in Neural Information Processing Systems 27, pages 1691–1699, 2014.
Zimmert and Lattimore (2019) Julian Zimmert and Tor Lattimore. Connections between mirror descent, thompson sampling and the information ratio. In Advances in Neural Information Processing Systems 32, pages 11973–11982, 2019.
Zinkevich (2003) Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In the Twentieth International Conference on Machine Learning, pages 928–935. AAAI Press, 2003.

Appendix A Notation

Table 1 summarizes the symbols used in this paper.

Table 1: List of symbols used in this paper.

Symbol	Meaning
$\mathcal{P}_{n}$	$(n-1)$ -dimensional probability simplex
$\\|\cdot\\|$	Euclidian norm for vector and operator norm for matrix
$\\|\cdot\\|_{p}$	$p$ -norm
$\\|\cdot\\|_{A}$	norm induced by positive semidefinite matrix $A$
$\mathcal{D}_{\mathrm{KL}}\left({p}\middle\\|{q}\right)$	KL divergence from $q$ to $p$
$B_{r}^{n}(p)$	$n$ -dimensional Euclidian ball of radius $r$ at point $p\in\mathbb{R}^{N}$
$N,M\in\mathbb{N}$	the number of actions and outcomes
$\Sigma$	set of feedback symbols
$A$	the number of feedback symbols
$p^{*}\in\mathcal{P}_{M}$	opponent’s strategy
$T$	time horizon
$\mathbf{L}=(\ell_{i,j})\in\mathbb{R}^{N\times M}$	loss matrix
$\mathbf{H}=(h_{i,j})\in\Sigma^{N\times M}$	feedback matrix
$S_{i}\in{\left\{0,1\right\}}^{A\times M}\;(i=1,\dots,N)$	signal matrix
$i(t)$	action taken at time $t$
$N_{i}(t)$	the number of times the action $i$ is taken before time $t\in[T]$
$j(t)$	outcome taken by opponent at time $t$
$y(t)$	feedback observed at time $t$
$F_{t}(p)$	unnormalized posterior distribution in (1)
$f_{t}(p)$	probability density function corresponding to $F_{t}(p)$
$G_{t}(p)$	unnormalized proposal distribution for $F_{t}(p)$ in (2)
$g_{t}(p)$	probability density function corresponding to $G_{t}(p)$
$q_{i}^{(t)}\in\mathcal{P}_{M}$	empirical feedback distribution of action $i$ by time $t$
$q_{i,n}\in\mathcal{P}_{M}$	empirical feedback distribution of action $i$ after the action is taken $n$ times
$\mathcal{C}_{i}\subset\mathcal{P}_{M}$	cell of action $i$

Appendix B Posterior Distribution and Proposal Distribution in Section 3

In this appendix, we discuss representation of the posterior distribution and its relation with the proposal distribution.

Proposition 4.

$F_{t}(p)$ in (1) is proportional to the posterior distribution of the opponent’s strategy, and $F_{t}(p)\leq G_{t}(p)$ for all $p\in\mathcal{P}_{M}$ .

Proof.

The posterior distribution of the opponent’s strategy parameter $\pi{\left(p\mathrel{}\middle|\mathrel{}{\left\{i(s),y(s)\right\}}_{s=1}^{t}\right)}$ is rewritten as

$\displaystyle\pi{\left(p\mathrel{}\middle\|\mathrel{}{\left\{i(s),y(s)\right\}}_{s=1}^{t}\right)}$	$\displaystyle\propto\pi{\left(p,{\left\{i(s),y(s)\right\}}_{s=1}^{t}\right)}$
	$\displaystyle\propto\pi{\left(p\right)}\prod_{s=1}^{t}\mathbb{P}{\left\{y(s)\mathrel{}\middle\|\mathrel{}i(s),p\right\}}$
	$\displaystyle=\pi{\left(p\right)}\prod_{i=1}^{N}\prod_{y=1}^{A}(S_{i,y}p)^{n_{iy}}$
	$\displaystyle\propto\pi{\left(p\right)}\prod_{i=1}^{N}\exp\Big{\{}-n_{i}\mathcal{D}_{\mathrm{KL}}\left({q_{i}^{(t)}}\middle\\|{S_{i}p}\right)\Big{\}}\,,$	(5)

where $S_{i,y}$ is the $i$ -th row of the signal matrix $S_{i}$ , and note that $q_{i}^{(t)}$ is the empirical feedback distribution of action $i$ at time $t$ , that is, $q_{i}^{(t)}=[n_{i1}/n_{i},\ldots,n_{iA}/n_{i}]^{\top}\in\mathcal{P}_{A}$ for $n_{iy}=\sum_{s=1}^{t}\mathbbm{1}{\left[i(s)=i,y(s)=y\right]}$ and $n_{i}=\sum_{y=1}^{A}n_{iy}$ .

Next, we show that $F_{t}(p)\leq G_{t}(p)$ holds for all $p\in\mathcal{P}_{M}$ . Using the Pinsker’s inequality, the unnormalized posterior distribution $F_{t}(p)$ can be bounded from above as

$\displaystyle F_{t}(p)$	$\displaystyle=\pi{\left(p\right)}\prod_{i=1}^{N}\exp\Big{\{}-n_{i}\mathcal{D}_{\mathrm{KL}}\left({q_{i}^{(t)}}\middle\\|{S_{i}p}\right)\Big{\}}$
	$\displaystyle\leq\pi{\left(p\right)}\prod_{i=1}^{N}\exp\Big{\{}-\frac{1}{2}n_{i}\\|q_{i}^{(t)}-S_{i}p\\|_{1}^{2}\Big{\}}\quad\left(\text{by {Pinsker's ineq.}}\right)$
	$\displaystyle=\pi{\left(p\right)}\exp\Big{\{}-\frac{1}{2}\sum_{i=1}^{N}n_{i}\\|q_{i}^{(t)}-S_{i}p\\|_{1}^{2}\Big{\}}$
	$\displaystyle\leq\pi{\left(p\right)}\exp\Big{\{}-\frac{1}{2}\sum_{i=1}^{N}n_{i}\\|q_{i}^{(t)}-S_{i}p\\|^{2}\Big{\}}\quad\left(\text{by }{\\|q_{i}^{(t)}-S_{i}p\\|_{1}\geq\\|q_{i}^{(t)}-S_{i}p\\|}\right)$
	$\displaystyle=G_{t}(p)\,.$	(6)

∎

Remark.

The unnormalized density $G_{t}(p)$ is indeed Gaussian. Recalling that $B_{t}$ and $b_{t}$ are defined in (3) as

\displaystyle B_{t}=\sum_{i=1}^{N}n_{i}S_{i}^{\top}S_{i}=\sum_{s=1}^{t}S_{i(s)}^{\top}S_{i(s)}=B_{t-1}+S_{i(t)}^{\top}S_{i(t)},\quad b_{t}=\sum_{i=1}^{N}n_{i}S_{i}^{\top}q_{i}^{(t)}=b_{t-1}+S_{i(t)}^{\top}e_{y(t)}\,,

(7)

we have

$\displaystyle\sum_{i=1}^{N}n_{i}\\|q_{i}^{(t)}-S_{i}p\\|^{2}$	$\displaystyle=\sum_{i=1}^{N}n_{i}(q_{i}^{(t)}-S_{i}p)^{\top}(q_{i}^{(t)}-S_{i}p)$
	$\displaystyle=p^{\top}\underbrace{\Big{(}\sum_{i=1}^{N}n_{i}S_{i}^{\top}S_{i}\Big{)}}_{B_{t}}p,-2\underbrace{\Big{(}\sum_{i=1}^{N}n_{i}S_{i}^{\top}q_{i}^{(t)}\Big{)}^{\top}}_{b_{t}}p+\underbrace{\sum_{i=1}^{N}n_{i}\\|q_{i}^{(t)}\\|^{2}}_{c_{t}}$
	$\displaystyle=p^{\top}B_{t}p-2b_{t}^{\top}p+c_{t}$
	$\displaystyle=(p-B_{t}^{-1}b_{t})^{\top}B_{t}(p-B_{t}^{-1}b_{t})+c_{t}-b_{t}^{\top}B_{t}^{-1}b_{t}\,.$	(8)

Therefore, we have

\displaystyle\exp\Big{\{}-\frac{1}{2}\sum_{i=1}^{N}n_{i}\|q_{i}^{(t)}-S_{i}p\|^{2}\Big{\}}

\displaystyle\propto\exp\Big{\{}-\frac{1}{2}(p-B_{t}^{-1}b_{t})^{\top}B_{t}(p-B_{t}^{-1}b_{t})\Big{\}}\,.

(9)

Appendix C Proof of Proposition 2

We will see that the the procedure sampling $\tilde{p}_{t}$ from $g_{t}(p)$ and Algorithm 3 are equivalent. First, we derive the Gaussian density of $g_{t}(p)$ projected onto $\{p\in\mathbb{R}^{M}:\sum_{i=1}^{M}p_{i}=1\}$ .

For simplicity, we omit the subscript $t$ and write, e.g., $B$ instead of $B_{t}$ . We define $p=[{p^{(\alpha)}}^{\top},p_{M}]^{\top}\in\mathbb{R}^{M-1}\times\mathbb{R}$ . Let $h=B^{-1}b$ , and define $h=[{h^{(\alpha)}}^{\top},h_{M}]^{\top}\in\mathbb{R}^{M-1}\times\mathbb{R}$ . Let $B=\begin{bmatrix}C&d\\ d^{\top}&f\\ \end{bmatrix}$ , where $C\in\mathbb{R}^{{M-1}\times{M-1}},d\in\mathbb{R}^{M-1}$ , and $f\in\mathbb{R}$ . Also, let $b=[{b^{(\alpha)}}^{\top},b^{(M)}]^{\top}\in\mathbb{R}^{M-1}\times\mathbb{R}$ .

Using the decomposition

\displaystyle(p-B^{-1}b)^{\top}B(p-B^{-1}b)

\displaystyle=\underbrace{p^{\top}Bp}_{\text{(a)}}-2\underbrace{h^{\top}Bp}_{\text{(b)}}+h^{\top}Bh\,,

(10)

we rewrite each term by restricting the domain of $p$ so that it satisfies the condition $\sum_{i=1}^{M}p_{i}=1$ . Now the first term (a) is rewritten as

	(a)	$\displaystyle={p^{(\alpha)}}^{\top}C{p^{(\alpha)}}+2{p^{(\alpha)}}^{\top}dp_{M}+fp_{M}^{2}$
		$\displaystyle={p^{(\alpha)}}^{\top}C{p^{(\alpha)}}+2\underbrace{{p^{(\alpha)}}^{\top}d\Big{(}1-\sum_{i=1}^{M-1}p_{i}\Big{)}}_{\text{(a1)}}+f\underbrace{\Big{(}1-\sum_{i=1}^{M-1}p_{i}\Big{)}^{2}}_{\text{(a2)}}\,.$		(11)

The term (a1) is rewritten as

(a1)	$\displaystyle={p^{(\alpha)}}^{\top}d-{p^{(\alpha)}}^{\top}d\sum_{i=1}^{M-1}p_{i}$
	$\displaystyle={p^{(\alpha)}}^{\top}d-{p^{(\alpha)}}^{\top}d\mathbf{1}_{M-1}^{\top}{p^{(\alpha)}}$
	$\displaystyle={p^{(\alpha)}}^{\top}d-{p^{(\alpha)}}^{\top}D{p^{(\alpha)}}\quad\left(D=\frac{1}{2}\left(d\mathbf{1}_{M-1}^{\top}+\mathbf{1}_{M-1}d^{\top}\right)\right)\,,$	(12)

and the term (a2) is rewritten as

(a2)	$\displaystyle=\Big{(}1-\sum_{i=1}^{M-1}p_{i}\Big{)}^{2}$
	$\displaystyle=1-2\sum_{i=1}^{M-1}p_{i}+\Big{(}\sum_{i=1}^{M-1}p_{i}\Big{)}^{2}$
	$\displaystyle=1-2\mathbf{1}_{M-1}^{\top}{p^{(\alpha)}}+{p^{(\alpha)}}^{\top}\mathbf{1}_{M-1}\mathbf{1}_{M-1}^{\top}{p^{(\alpha)}}\,.$	(13)

Therefore,

(a)

\displaystyle={p^{(\alpha)}}^{\top}\underbrace{(C-2D+f\mathbf{1}_{M-1}\mathbf{1}_{M-1}^{\top})}_{\tilde{B}}{p^{(\alpha)}}-2(f\mathbf{1}_{M-1}-d)^{\top}{p^{(\alpha)}}+f\,.

(14)

With regard to the term (b), we have

(b)	$\displaystyle=b^{\top}p$
	$\displaystyle={b^{(\alpha)}}^{\top}{p^{(\alpha)}}^{\top}+b^{(M)}p_{M}$
	$\displaystyle=({b^{(\alpha)}}-b^{(M)}\mathbf{1}_{M-1})^{\top}{p^{(\alpha)}}+b^{(M)}\,.$	(15)

Therefore,

	$\displaystyle(p-B^{-1}b)^{\top}B(p-B^{-1}b)$
	$\displaystyle={p^{(\alpha)}}^{\top}\tilde{B}{p^{(\alpha)}}-2(\underbrace{f\mathbf{1}_{M-1}-d+{b^{(\alpha)}}-b^{(M)}\mathbf{1}_{M-1}}_{\tilde{b}})^{\top}{p^{(\alpha)}}+f-2b^{(M)}+h^{\top}Bh$
	$\displaystyle=({p^{(\alpha)}}-\tilde{B}^{-1}\tilde{b})^{\top}\tilde{B}({p^{(\alpha)}}-\tilde{B}^{-1}\tilde{b})+f-2b^{(M)}-\tilde{b}^{\top}\tilde{B}^{-1}\tilde{b}+b^{\top}B^{-1}b\quad\quad\left(\text{by }{h^{\top}Bh=b^{\top}B^{-1}b}\right)\,.$		(16)

From the above argument, the density $\mathcal{N}(\tilde{B}^{-1}b,\,\tilde{B}^{-1})$ is the Gaussian distribution of $g_{t}(p)$ on $\{p\in\mathbb{R}^{M}:\sum_{i=1}^{M}p_{i}=1\}$ . Therefore, the $p=[{p^{(\alpha)}}^{\top},\,1-\sum_{i=1}^{M-1}({p^{(\alpha)}})_{i}]^{\top}$ for ${p^{(\alpha)}}\sim\mathcal{N}(\tilde{B}^{-1}b,\,\tilde{B}^{-1})$ is supported over $\{p\in\mathbb{R}^{M}:\sum_{i=1}^{M}p_{i}=1\}$ .

If the sample ${p^{(\alpha)}}$ from $\mathcal{N}(\tilde{B}^{-1}b,\,\tilde{B}^{-1})$ is in $\mathcal{P}_{M-1}$ , then we can obtain the last element $p^{(M)}$ by $p^{(M)}=1-\sum_{i=1}^{M-1}({p^{(\alpha)}})_{i}$ . Otherwise, the probability that ${p^{(\alpha)}}$ is the first $M-1$ elements of the sample from $g_{t}(p)$ is zero, and hence, $[{p^{(\alpha)}}^{\top},p^{(M)}]^{\top}$ cannot be a sample from $g_{t}(p)$ . Therefore, sampling $\tilde{p}_{t}$ from $g_{t}(p)$ and Algorithm 3 are equivalent.

Appendix D Relation between TSPM-Gaussian and BPM-TS

In this appendix, we discuss the relation between TSPM-Gaussian and BPM-TS (Vanchinathan et al., 2014).

Underlying Feedback Structure. Here, we discuss the underlying feedback structure behind TSPM-Gaussian and BPM-TS.

We first consider the underlying feedback structure behind BPM-TS. In the following, we see that the feedback structure

\displaystyle y(t)=S_{i(t)}p+S_{i(t)}\epsilon\,,\;\epsilon\sim\mathcal{N}(0,I_{M})

(17)

induces the posterior distribution in BPM-TS. Under this feedback structure, we have $y(t)\sim\mathcal{N}(S_{i(t)}p,S_{i(t)}S_{i(t)}^{\top})$ .

When we take the prior distribution $\pi{\left(p\right)}$ as $\mathcal{N}(0,\sigma_{0}^{2}I_{M})$ , the posterior distribution for the opponent’s strategy parameter can be written as

	$\displaystyle\pi{\left(p\mathrel{}\middle\|\mathrel{}{\left\{i(s),y(s)\right\}}_{s=1}^{t}\right)}$
	$\displaystyle\propto\pi{\left(p\right)}\prod_{s=1}^{t}\pi{\left(y(s)\mathrel{}\middle\|\mathrel{}i(s),\,p\right)}$
	$\displaystyle=\pi{\left(p\right)}\prod_{s=1}^{t}\mathbb{P}_{y\sim\mathcal{N}(S_{i(s)}p,S_{i(s)}S_{i(s)}^{\top})}\left\{y=y(s)\right\}$
	$\displaystyle=\exp\Big{(}-\frac{p^{\top}p}{2\sigma_{0}^{2}}\Big{)}\prod_{s=1}^{t}\exp\Big{(}-\frac{1}{2}(y(s)-S_{i(s)}p)^{\top}(S_{i(s)}S_{i(s)}^{\top})^{-1}(y(s)-S_{i(s)}p)\Big{)}$
	$\displaystyle=\exp\Big{\{}-\frac{1}{2}\Big{(}p^{\top}\Big{(}\frac{1}{\sigma_{0}^{2}}I_{M}+\sum_{s=1}^{t}S_{i(s)}^{\top}(S_{i(s)}S_{i(s)}^{\top})^{-1}S_{i(s)}\Big{)}p\Big{)}$
	$\displaystyle\qquad\qquad-2\Big{(}\sum_{s=1}^{t}y(s)^{\top}(S_{i(s)}S_{i(s)}^{\top})^{-1}S_{i(s)}p\Big{)}+\text{(a term independent of $p$)}\Big{\}}$
	$\displaystyle\propto\exp\Big{\{}-\frac{1}{2}(p^{\top}{B_{t}^{\mathrm{BPM}}}p-2{b_{t}^{\mathrm{BPM}}}^{\top}p)\Big{\}}$
	$\displaystyle\propto\exp\Big{\{}-\frac{1}{2}(p-{{B_{t}^{\mathrm{BPM}}}}^{-1}{b_{t}^{\mathrm{BPM}}})^{\top}{B_{t}^{\mathrm{BPM}}}(p-{B_{t}^{\mathrm{BPM}}}^{-1}{b_{t}^{\mathrm{BPM}}})\Big{\}}\,,$		(18)

where

	$\displaystyle{B_{t}^{\mathrm{BPM}}}$	$\displaystyle=\frac{1}{\sigma_{0}^{2}}I_{M}+\sum_{s=1}^{t}S_{i(s)}^{\top}(S_{i(s)}S_{i(s)}^{\top})^{-1}S_{i(s)}=B_{t-1}^{\mathrm{BPM}}+S_{i(t)}^{\top}(S_{i(t)}S_{i(t)}^{\top})^{-1}S_{i(t)}\,,$		(19)
	$\displaystyle{b_{t}^{\mathrm{BPM}}}$	$\displaystyle=\sum_{s=1}^{t}S_{i(s)}^{\top}(S_{i(s)}S_{i(s)}^{\top})^{-1}y(s)=b_{t-1}^{\mathrm{BPM}}+S_{i(t)}^{\top}(S_{i(t)}S_{i(t)}^{\top})^{-1}y(t)\,.$		(20)

Therefore, the posterior distribution $\pi{\left(p\mid{\left\{i(s),y(s)\right\}}_{s=1}^{t}\right)}$ is

\displaystyle\frac{1}{\sqrt{(2\pi)^{M}|{B_{t}^{\mathrm{BPM}}}^{-1}|}}\exp\Big{\{}-\frac{1}{2}(p-{B_{t}^{\mathrm{BPM}}}^{-1}{b_{t}^{\mathrm{BPM}}})^{\top}{B_{t}^{\mathrm{BPM}}}(p-{B_{t}^{\mathrm{BPM}}}^{-1}{b_{t}^{\mathrm{BPM}}})\Big{\}}\,.

(21)

and this distribution indeed corresponds to the posterior distribution in BPM-TS (Vanchinathan et al., 2014) with ${B_{t}^{\mathrm{BPM}}}=\Sigma_{t}^{-1}$ .

Using the same argument, we can confirm that the feedback structure

\displaystyle y_{t}=S_{i}p+\epsilon\,,\;\epsilon\sim\mathcal{N}(0,I_{M})\,.

(22)

induces

\displaystyle\bar{g}_{t}(p)\coloneqq\frac{1}{\sqrt{(2\pi)^{M}|B_{t}^{-1}|}}\exp\Big{(}-\frac{1}{2}\left\|p-B_{t}^{-1}b_{t}\right\|_{B_{t}}^{2}\Big{)}\,,

(23)

which corresponds to the posterior distribution for TSPM in linear partial monitoring.

Covariances in TSPM-Gaussian and BPM-TS. In the linear partial monitoring, TSPM assumes noise with covariance $I_{M}$ , which is compatible with the fact that the discrete setting can be regarded as linear PM with $I_{M}$ -sub-Gaussian noise. On the other hand, BPM-TS assumes covariance $S_{i}S_{i}^{\top}$ , and in general $I_{M}\preceq S_{i}S_{i}^{\top}$ holds. Therefore, BPM-TS assumes unnecessarily larger covariance, which makes learning slow down.

Appendix E Preliminaries for Regret Analysis

In this appendix, we give some technical lemmas, which are used for the derivation of the regret bound in Appendix F. Here, we write $X\succeq Y$ to denote $X-Y\succeq 0$ . For $a,b\in\mathbb{R}$ , let $a\wedge b$ be $a$ if $a\leq b$ otherwise $b$ , and $a\vee b$ be $b$ if $a\leq b$ otherwise $a$ . We use $h(a)\coloneqq\mathbb{P}_{X\sim\chi^{2}_{M}}\left\{X\geq a\right\}$ to evaluate the behavior of the posterior samples, where $\chi^{2}_{M}$ is the chi-squared distribution with $M$ degree of freedom.

E.1 Basic Lemmas

Fact 5 (Moment generating function of squared-Gaussian distribution).

Let $X$ be the random variable following the standard normal distribution. Then, the moment generating function of $X^{2}$ is $\operatornamewithlimits{\mathbb{E}}{\left[\exp(\xi X^{2})\right]}=(1-2\xi)^{-1/2}$ for $\xi<1/2$ .

Lemma 6 (Chernoff bound for chi-squared random variable).

Let $X$ be the random variable following the chi-squared distribution with $k$ degree of freedom. Then, for any $a\geq 0$ and $0\leq\xi<1/2$ ,

\displaystyle\mathbb{P}{\left\{X\geq a\right\}}\leq\mathrm{e}^{-\xi a}(1-2\xi)^{-\frac{k}{2}}.

(24)

Proof.

By Markov’s inequality, the LHS can be bounded as

$\displaystyle\mathbb{P}{\left\{X\geq a\right\}}$	$\displaystyle=\mathbb{P}{\left\{\sum_{i=1}^{k}X_{i}^{2}\geq a\right\}}\;(X_{1},\ldots,X_{k}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mathcal{N}(0,1))$
	$\displaystyle=\mathbb{P}{\left\{\exp\Big{(}\xi\sum_{i=1}^{k}X_{i}^{2}\Big{)}\geq\exp(\xi a)\right\}}$
	$\displaystyle\leq\mathrm{e}^{-\xi a}\Big{(}\operatornamewithlimits{\mathbb{E}}{\left[\mathrm{e}^{\xi X_{1}^{2}}\right]}\Big{)}^{k}\quad\left(\text{by {Markov's ineq.}}\right)$
	$\displaystyle=\mathrm{e}^{-\xi a}(1-2\xi)^{-\frac{k}{2}}\quad\left(\text{by {Fact~{}\ref{fact:mgf_sqGauss}}}\right)\,,$	(25)

which completes the proof. ∎

E.2 Property of Strong Local Observability

Recall that $\Delta_{i}=(L_{i}-L_{1})^{\top}p^{*}>0$ for $i\in[N]$ , which is the difference of the expected loss of actions $i$ and $1$ . For this define

\displaystyle\epsilon\coloneqq\left(\frac{1}{2\sqrt{A}}\min_{i\neq 1}\frac{\Delta_{i}}{\|z_{1,i}\|}\right)\wedge\left(\min_{p\in\mathcal{C}_{1}^{c}}\frac{4}{3}\|p-p^{*}\|\right)\,,

(26)

which is used throughout the proof of this appendix and Appendix F. The following lemma provides the key property of the strong local observability condition.

Lemma 7.

For any partial monitoring game with strong local observability and $p\in\mathbb{R}^{M}$ , any of the conditions 1–3 in the following is not satisfied:

1.

$L_{1}^{\top}p>L_{k}^{\top}p$ (Worse action $k$ looks better under $p$ .)
2.

$\|S_{1}p-S_{1}p^{*}\|\leq\epsilon$
3.

$\|S_{k}p-S_{k}p^{*}\|\leq\epsilon$ .

Proof.

We prove by contradiction. Assume that there exists $p\in\mathbb{R}^{M}$ such that conditions 1–3 are simultaneously satisfied.

Now, by the conditions 2 and 3, we have

\displaystyle\begin{split}|S_{1}p-S_{1}p^{*}|&\preceq\epsilon\mathbf{1}_{A}\,,\\ |S_{k}p-S_{k}p^{*}|&\preceq\epsilon\mathbf{1}_{A}\,.\end{split}

(27)

Here, $|\cdot|$ is the element-wise absolute value, and $\preceq$ means that the inequality $\leq$ holds for each element. Therefore,

\displaystyle\left|\left(\begin{array}[]{c}S_{1}\\ S_{k}\end{array}\right)(p-p^{*})\right|\preceq\epsilon\mathbf{1}_{2A}\,.

(30)

On the other hand, by the strong local observability condition, for any $k\neq 1$ , there exists $z_{1,k}\neq 0\in\mathbb{R}^{2A}$ such that

\displaystyle(L_{1}-L_{k})^{\top}=z_{1,k}^{\top}\left(\begin{array}[]{c}S_{1}\\ S_{k}\end{array}\right)\,.

(33)

Now, we have

	$\displaystyle z_{1,k}^{\top}\left(\begin{array}[]{c}S_{1}\\ S_{k}\end{array}\right)(p-p^{*})$		(36)
	$\displaystyle\leq\\|z_{1,k}\\|\left\\|\left(\begin{array}[]{c}S_{1}\\ S_{k}\end{array}\right)(p-p^{*})\right\\|\quad\left(\text{by {Cauchy-Schwarz ineq.}}\right)$		(39)
	$\displaystyle\leq\sqrt{2A}\epsilon\\|z_{1,k}\\|\quad\left(\text{by {Eq.~{}\eqref{eq:element_wise_cond}}}\right)\,,$		(40)

and

	$\displaystyle z_{1,k}^{\top}\left(\begin{array}[]{c}S_{1}\\ S_{k}\end{array}\right)(p-p^{*})$		(43)
	$\displaystyle=(L_{1}-L_{k})^{\top}(p-p^{*})\quad\left(\text{by {Eq.~{}\eqref{eq:loc}}}\right)$
	$\displaystyle=(L_{1}-L_{k})^{\top}p+(L_{k}-L_{1})^{\top}p^{*}$
	$\displaystyle\geq\Delta_{k}\quad\left(\text{by {Condition 1 \& def. of $\Delta_{k}$}}\right)\,.$		(44)

Therefore, from (40) and (44), we have

\displaystyle\Delta_{k}\leq\sqrt{2A}\epsilon\|z_{1,k}\|\,.

(45)

This inequality does not hold for all $k\neq 1$ for the predefined value of $\epsilon$ , since we have

\displaystyle\epsilon\leq\frac{1}{2\sqrt{A}}\min_{k\neq 1}\frac{\Delta_{k}}{\|z_{1,k}\|}\,.

(46)

Therefore, the proof is completed by contradiction. ∎

Remark.

The similar result holds when the optimal action $1$ is replaced with action $j\neq k$ such that $\Delta_{j,k}\coloneqq(L_{j}-L_{k})^{\top}p^{*}>0$ by taking $\epsilon$ satisfying

\displaystyle\epsilon\leq\frac{1}{2\sqrt{A}}\min_{j\neq k:\Delta_{j,k}>0}\frac{\Delta_{j,k}}{\|z_{j,k}\|}\,.

(47)

From Lemma 7, we have the following corollary.

Corollary 8.

For any $p\in\mathbb{R}^{M}$ satisfying $p\in\mathcal{C}_{i}$ and $\|S_{1}p-S_{1}p^{*}\|\leq\epsilon$ , we have

\displaystyle\|S_{i}p-S_{i}p^{*}\|>\epsilon\,.

(48)

Proof.

Note that $p\in\mathcal{C}_{i}$ is equivalent to $(L_{1}-L_{i})^{\top}p^{*}>0$ for any $i\neq 1$ . Therefore, the result directly follows from Lemma 7. ∎

The next lemma is the property of Mahalanobis distance corresponding to $\bar{g}_{t}(p)$ .

Lemma 9.

Define $\mathcal{T}_{i}\coloneqq\{p\in\mathbb{R}^{M}:\|S_{i}p-S_{i}p^{*}\|>\epsilon\}$ . Assume that $N_{i}(t)\geq n_{i}$ , $\|S_{i}\hat{p}_{t}-S_{i}p^{*}\|\leq\epsilon/4$ . Then, for any $0\leq\xi<1/2$

\displaystyle h\bigg{(}\inf_{p\in\mathcal{T}_{i}}\|B_{t}^{1/2}(p-\hat{p}_{t})\|^{2}\bigg{)}\leq\exp\Big{(}-\frac{9}{16}\xi\epsilon^{2}n_{i}\Big{)}(1-2\xi)^{-M/2}\,.

(49)

Proof.

To bound the LHS of the above inequality, we bound $\|B_{t}^{1/2}(p-\hat{p}_{t})\|^{2}$ from below for $p\in\mathcal{T}_{i}$ . Using the triangle inequality and the assumptions, we have

	$\displaystyle\\|S_{i}(p-\hat{p}_{t})\\|$	$\displaystyle\geq\\|S_{i}p-S_{i}p^{}\\|-\\|S_{i}\hat{p}_{t}-S_{i}p^{}\\|$
		$\displaystyle>\epsilon-\epsilon/4>0\,.$		(50)

Therefore, we have

$\displaystyle\\|B_{t}^{1/2}(p-\hat{p}_{t})\\|^{2}$	$\displaystyle\geq\sum_{k\in[N]}N_{k}(t)\\|S_{k}(p-\hat{p}_{t})\\|^{2}\quad\left(\text{by {def. of $B_{t}$}}\right)$
	$\displaystyle\geq n_{i}\\|S_{i}(p-\hat{p}_{t})\\|^{2}\quad\left({N_{i}(t)\geq n_{i}}\right)$
	$\displaystyle>\frac{9}{16}\epsilon^{2}n_{i}\quad\left(\text{by {Eq.~{}\eqref{eq:diff_p_phat}}}\right)\,.$	(51)

By the Chernoff bound for a chi-squared random variable in Lemma 6, we now have

\displaystyle h(a)\leq\mathrm{e}^{-\xi a}(1-2\xi)^{-M/2}\,,

(52)

for any $a\geq 0$ and $0\leq\xi<1/2$ . Hence, using the fact that $\|B_{t}^{1/2}(p-\hat{p}_{t})\|^{2}$ follows the chi-squared distribution with $M$ degree of freedom, we have

	$\displaystyle h\bigg{(}\inf_{p\in\mathcal{T}_{i}}\\|B_{t}^{1/2}(p-\hat{p}_{t})\\|^{2}\bigg{)}$	$\displaystyle\leq h\Big{(}\frac{9}{16}\epsilon^{2}n_{i}\Big{)}$
		$\displaystyle\leq\exp\Big{(}-\frac{9}{16}\xi\epsilon^{2}n_{i}\Big{)}(1-2\xi)^{-M/2}\,,$		(53)

which completes the proof. ∎

E.3 Statistics of Uninterested Actions

For any $k\neq i$ and $n_{k}\in[T]$ , define

	$\displaystyle Z_{n_{k}}$	$\displaystyle\coloneqq n_{k}\\|q_{k,n_{k}}-S_{k}p^{*}\\|^{2}\,,$		(54)
	$\displaystyle Z_{\backslash i}$	$\displaystyle\coloneqq\sum_{k\neq i}\max_{n_{k}\in[T]}Z_{n_{k}}\,.$		(55)

In this section, we bound $\operatornamewithlimits{\mathbb{E}}{\left[Z_{\backslash i}\right]}$ from above. Note that $Z_{\backslash i}$ is independent of the randomness of Thompson sampling.

Lemma 10 (Upper bound for the expectation of $Z_{\backslash i}$ ).

\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[Z_{\backslash i}\right]}\leq 4N\Big{(}\log T+\frac{A}{2}\log 2+1\Big{)}\,.

(56)

Proof.

Recall that in linear partial monitoring, the feedback $y(t)\in\mathbb{R}^{A}$ for action $k$ is given as

\displaystyle y_{t}=S_{k}p^{*}+\epsilon\,,\;\epsilon\sim\mathcal{N}(0,I_{A})

(57)

at round $t\in[T]$ , Therefore, $y(t)-S_{k}p^{*}\sim\mathcal{N}(0,I_{A})$ . Since $q_{k,n_{k}}=\frac{1}{n_{k}}\sum_{s\in[T]:i(s)=k}y(s)$ for any $n_{k}\in[T]$ , we have

\displaystyle q_{k,n_{k}}-S_{k}p^{*}=\frac{1}{n_{k}}\sum_{s\in[T]:i(s)=k}(y(s)-S_{k}p^{*})\sim\mathcal{N}(0,I_{A}/n_{k})\,.

(58)

Therefore,

\displaystyle\sqrt{n_{k}}(q_{k,n_{k}}-S_{k}p^{*})\sim\mathcal{N}(0,I_{A})\,,

(59)

and thus

\displaystyle n_{k}\|q_{k,n_{k}}-S_{k}p^{*}\|^{2}=\|\sqrt{n_{k}}(q_{k,n_{k}}-S_{k}p^{*})\|^{2}\sim\chi^{2}_{A}\,.

(60)

Therefore, for any $0\leq\xi<1/2$ ,

$\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\max_{n_{k}\in[T]}Z_{n_{k}}\right]}$	$\displaystyle=\int_{0}^{\infty}\mathbb{P}{\left\{\max_{n_{k}\in[T]}Z_{n_{k}}\geq x\right\}}\mathrm{d}x$
	$\displaystyle\leq\int_{0}^{\infty}\left[1\wedge T\cdot\mathbb{P}{\left\{Z_{1}\geq x\right\}}\right]\mathrm{d}{x}\quad\left(\text{by {the union bound}}\right)$
	$\displaystyle\leq\int_{0}^{\infty}\left[1\wedge T\cdot\mathrm{e}^{-\xi x}(1-2\xi)^{-\frac{A}{2}}\right]\mathrm{d}{x}\quad\left(\text{by {$Z_{1}\sim\chi^{2}_{(}A)$ and Lemma~{}\ref{lem:chisq_chernoff}}}\right)$
	$\displaystyle=\int_{0}^{x^{}}\mathrm{d}x+\int_{x^{}}^{\infty}T\cdot\mathrm{e}^{-\xi x}(1-2\xi)^{-\frac{A}{2}}\mathrm{d}{x}$
	$\displaystyle\leq x^{}+T\cdot\int_{x^{}}^{\infty}\mathrm{e}^{-\xi x}(1-2\xi)^{-\frac{A}{2}}\mathrm{d}{x}$
	$\displaystyle=x^{}+T(1-2\xi)^{-\frac{A}{2}}\left[-\frac{1}{\xi}\mathrm{e}^{-\xi x}\right]_{x^{}}^{\infty}$
	$\displaystyle=\frac{1}{\xi}\Big{\{}\log T-\frac{A}{2}\log(1-2\xi)+1\Big{\}}\,,$	(61)

where $x^{*}\coloneqq\frac{1}{\xi}{\left\{\log T-\frac{A}{2}\log(1-2\xi)\right\}}$ . Therefore, taking $\xi=1/4$ , we have

$\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[Z_{\backslash i}\right]}$	$\displaystyle=\operatornamewithlimits{\mathbb{E}}{\left[\sum_{k\neq i}\max_{n_{k}\in[T]}Z_{n_{k}}\right]}$
	$\displaystyle\leq\sum_{k\neq i}\operatornamewithlimits{\mathbb{E}}{\left[\max_{n_{k}\in[T]}Z_{n_{k}}\right]}$
	$\displaystyle\leq(N-1)\frac{1}{\xi}\Big{\{}\log T-\frac{A}{2}\log(1-2\xi)+1\Big{\}}$
	$\displaystyle\leq 4N\Big{(}\log T+\frac{A}{2}\log 2+1\Big{)}\,,$	(62)

which completes the proof. ∎

E.4 Mahalanobis Distance Process

Discussions in this section are essentially very similar to Abbasi-Yadkori et al. (2011, Lemma 11), but their results are not directly applicable and we give the full derivation for self-containedness. To maximize the applicability here we only assume sub-Gaussian noise rather than a Gaussian one.

Let $\epsilon_{t}$ be zero-mean $1$ -sub-Gaussian random variable, which satisfies

\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\mathrm{e}^{\lambda^{\top}\epsilon_{t}}\right]}\leq\mathrm{e}^{-\frac{\left\|\lambda\right\|^{2}}{2}}

(63)

for any $\lambda\in\mathbb{R}^{M}$ .

Lemma 11.

For any vector $v\in\mathbb{R}^{M}$ and positive definite matrix $V\in\mathbb{R}^{M\times M}$ such that $V\succ I$ ,

\displaystyle\mathbb{E}_{\epsilon_{t}}{\left[\mathrm{e}^{\frac{\left\|\epsilon_{t}+v\right\|_{V^{-1}}^{2}}{2}}\right]}

\displaystyle\leq\frac{\sqrt{|V|}}{\sqrt{|V-I|}}\mathrm{e}^{\frac{1}{2}v^{\top}(V-I)^{-1}v}\,.

(64)

Proof.

For any $x\in\mathbb{R}^{M}$

\displaystyle\mathbb{E}_{\lambda\sim\mathcal{N}(0,V^{-1})}{\left[\mathrm{e}^{\lambda^{\top}x}\right]}

\displaystyle=\mathrm{e}^{\frac{\left\|x\right\|_{V^{-1}}^{2}}{2}}\,.

(65)

Therefore, by letting $x=\epsilon_{t}+v$ we see that

\displaystyle\mathrm{e}^{\frac{\left\|\epsilon_{t}+v\right\|_{V^{-1}}^{2}}{2}}

\displaystyle=\mathbb{E}_{\lambda\sim\mathcal{N}(0,V^{-1})}{\left[\mathrm{e}^{\lambda^{\top}(\epsilon_{t}+v)}\right]}\,.

(66)

As a result, by the definition of sub-Gaussian random variables, we have

$\displaystyle\mathbb{E}_{\epsilon_{t}}{\left[\mathrm{e}^{\frac{\left\\|\epsilon_{t}+v\right\\|_{V^{-1}}^{2}}{2}}\right]}$	$\displaystyle=\mathbb{E}_{\lambda\sim\mathcal{N}(0,V^{-1})}{\left[\mathbb{E}_{\epsilon_{t}}{\left[\mathrm{e}^{\lambda^{\top}(\epsilon_{t}+v)}\right]}\right]}$
	$\displaystyle=\mathbb{E}_{\lambda\sim\mathcal{N}(0,V^{-1})}{\left[\mathrm{e}^{\lambda^{\top}v}\mathbb{E}_{\epsilon_{t}}{\left[\mathrm{e}^{\lambda^{\top}\epsilon_{t}}\right]}\right]}$
	$\displaystyle\leq\mathbb{E}_{\lambda\sim\mathcal{N}(0,V^{-1})}{\left[\mathrm{e}^{\lambda^{\top}v}\mathrm{e}^{\left\\|\lambda\right\\|^{2}/2}\right]}$
	$\displaystyle=\frac{1}{(2\pi)^{d/2}\sqrt{\|V^{-1}\|}}\int\mathrm{e}^{\lambda^{\top}v}\mathrm{e}^{\left\\|\lambda\right\\|^{2}/2}\mathrm{e}^{-\left\\|\lambda\right\\|_{V}^{2}/2}\mathrm{d}\lambda$
	$\displaystyle=\frac{1}{(2\pi)^{d/2}\sqrt{\|V^{-1}\|}}\int\mathrm{e}^{-\frac{1}{2}\left(\lambda^{\top}(V-I)\lambda-2v^{\top}\lambda\right)}\mathrm{d}\lambda$
	$\displaystyle=\frac{\sqrt{\|V-I\|}}{(2\pi)^{d/2}\sqrt{\|V^{-1}\|\|V-I\|}}\int\mathrm{e}^{-\frac{1}{2}\left((\lambda-(V-I)^{-1}v)^{\top}(V-I)(\lambda-(V-I)^{-1}v)-v^{\top}(V-I)^{-1}v\right)}\mathrm{d}\lambda$
	$\displaystyle=\frac{\sqrt{\|V\|}}{\sqrt{\|V-I\|}}\mathrm{e}^{\frac{1}{2}v^{\top}(V-I)^{-1}v}\,.$	(67)

∎

Lemma 12.

\displaystyle\mathbb{E}\left[\exp\left(\frac{1}{2}\left(\left\|\hat{p}_{t}-p^{*}\right\|_{B_{t}}^{2}-\left\|\hat{p}_{t-1}-p^{*}\right\|_{B_{t-1}}^{2}\right)\right)\mathrel{}\middle|\mathrel{}\hat{p}_{t-1},\,B_{t-1},\,S_{i(t-1)}\right]

\displaystyle\leq\sqrt{\frac{|B_{t}|}{|B_{t-1}|}}\,.

(68)

Proof.

Let $Z_{t}\coloneqq-\lambda p^{*}+\sum_{s=1}^{t}S_{i(s)}^{\top}\epsilon_{s}$ , and we have

•

$B_{t}=\lambda I+\sum_{s=1}^{t}S_{i(s)}^{\top}S_{i(s)}$ ,
•

$b_{t}=\sum_{s=1}^{t}S_{i(s)}^{\top}y(s)=B_{t}p^{*}+Z_{t}$ ,
•

$\hat{p}_{t}=B_{t}^{-1}b_{t}=p^{*}+B_{t}^{-1}Z_{t}$ .

In the following we omit the conditioning on $(\hat{p}_{t-1},\,B_{t-1},\,S_{i(t-1)})$ for notational simplicity.

Let us define $C_{t}\coloneqq S_{i(t)}B_{t-1}S_{i(t)}^{\top}$ and $d_{t}\coloneqq S_{i(t)}B_{t-1}^{-1}Z_{t-1}=S_{i(t)}(\hat{p}_{t}-p^{*})$ . Then, using the Sherman-Morrison-Woodbury formula we have

$\displaystyle\left\\|\hat{p}_{t}-p^{}\right\\|_{B_{t}}^{2}-\left\\|\hat{p}_{t-1}-p^{}\right\\|_{B_{t-1}}^{2}$
	$\displaystyle=Z_{t}^{\top}B_{t}^{-1}Z_{t}-Z_{t-1}^{\top}B_{t-1}^{-1}Z_{t-1}$
	$\displaystyle=(Z_{t-1}^{\top}+\epsilon_{t}^{\top}S_{i(t)})(B_{t-1}^{-1}-B_{t-1}^{-1}S_{i(t)}^{\top}(I+S_{i(t)}B_{t-1}^{-1}S_{i(t)}^{\top})^{-1}S_{i(t)}B_{t-1}^{-1})(Z_{t-1}+S_{i(t)}^{\top}\epsilon_{t})-Z_{t-1}^{\top}B_{t-1}^{-1}Z_{t-1}$
	$\displaystyle=(Z_{t-1}^{\top}+\epsilon_{t}^{\top}S_{i(t)})B_{t-1}^{-1}(Z_{t-1}+S_{i(t)}^{\top}\epsilon_{t})-Z_{t-1}^{\top}B_{t}^{-1}Z_{t-1}$
	$\displaystyle\quad-(Z_{t-1}^{\top}+\epsilon_{t}^{\top}S_{i(t)})B_{t-1}^{-1}S_{i(t)}^{\top}(I+S_{i(t)}B_{t-1}^{-1}S_{i(t)}^{\top})^{-1}S_{i(t)}B_{t-1}^{-1}(Z_{t-1}+S_{i(t)}^{\top}\epsilon_{t})$
	$\displaystyle=\epsilon_{t}^{\top}S_{i(t)}B_{t-1}^{-1}S_{i(t)}^{\top}\epsilon_{t}+2Z_{t-1}^{\top}B_{t-1}^{-1}S_{i(t)}^{\top}\epsilon_{t}$
	$\displaystyle\quad-(Z_{t-1}^{\top}+\epsilon_{t}^{\top}S_{i(t)})B_{t-1}^{-1}S_{i(t)}^{\top}(I+S_{i(t)}B_{t-1}^{-1}S_{i(t)}^{\top})^{-1}S_{i(t)}B_{t-1}^{-1}(Z_{t-1}+S_{i(t)}^{\top}\epsilon_{t})$
	$\displaystyle=\epsilon_{t}^{\top}C_{t}\epsilon_{t}+2d_{t}^{\top}\epsilon_{t}-(d_{t}^{\top}+\epsilon_{t}^{\top}C_{t})(I+C_{t})^{-1}(d_{t}+C_{t}\epsilon_{t})$
	$\displaystyle=\epsilon_{t}^{\top}C_{t}(I-(I+C_{t})^{-1}C_{t})\epsilon_{t}+2d_{t}^{\top}(I-(I+C_{t})^{-1}C_{t})\epsilon_{t}-d_{t}^{\top}(I+C_{t})^{-1}d_{t}$
	$\displaystyle=\epsilon_{t}^{\top}C_{t}(I+C_{t})^{-1}\epsilon_{t}+2d_{t}^{\top}(I+C_{t})^{-1}\epsilon_{t}-d_{t}^{\top}(I+C_{t})^{-1}d_{t}$
	$\displaystyle=\left\\|\epsilon_{t}+C_{t}^{-1}d_{t}\right\\|_{C_{t}(I+C_{t})^{-1}}^{2}-d_{t}^{\top}(I+C_{t})^{-1}C_{t}^{-1}d_{t}-d_{t}^{\top}(I+C_{t})^{-1}d_{t}$
	$\displaystyle=\left\\|\epsilon_{t}+C_{t}^{-1}d_{t}\right\\|_{C_{t}(I+C_{t})^{-1}}^{2}-d_{t}^{\top}(I+C_{t})^{-1}(I+C_{t}^{-1})d_{t}\,.$	(69)

Therefore, Lemma 11 with $V\coloneqq\left(C_{t}(I+C_{t})^{-1}\right)^{-1}=(I+C_{t})C_{t}^{-1},\,v\coloneqq C_{t}^{-1}d_{t}$ yields

$\displaystyle\mathbb{E}\left[\exp\left(\frac{1}{2}\left(\left\\|\hat{p}_{t}-p^{}\right\\|_{B_{t}}^{2}-\left\\|\hat{p}_{t-1}-p^{}\right\\|_{B_{t-1}}^{2}\right)\right)\right]$
	$\displaystyle\leq\frac{\sqrt{\|(I+C_{t})C_{t}^{-1}\|}}{\sqrt{\|(I+C_{t})C_{t}^{-1}-I\|}}\mathrm{e}^{\frac{1}{2}d_{t}^{\top}C_{t}^{-1}((I+C_{t})C_{t}^{-1}-I)^{-1}C_{t}^{-1}d_{t}}\mathrm{e}^{-\frac{1}{2}d_{t}^{\top}(I+C_{t})^{-1}(I+C_{t}^{-1})d_{t}}$
	$\displaystyle\leq\frac{\sqrt{\|(I+C_{t})C_{t}^{-1}\|}}{\sqrt{\|C_{t}^{-1}\|}}\mathrm{e}^{\frac{1}{2}d_{t}^{\top}C_{t}^{-1}(C_{t}^{-1})^{-1}C_{t}^{-1}d_{t}}\mathrm{e}^{-\frac{1}{2}d_{t}^{\top}(I+C_{t})^{-1}(I+C_{t}^{-1})d_{t}}$
	$\displaystyle=\sqrt{\|(I+C_{t})\|}$
	$\displaystyle=\sqrt{\frac{\|B_{t}\|}{\|B_{t-1}\|}}\,,$	(70)

where see, e.g., Abbasi-Yadkori et al. (2011, Lemma 11) for the last equality. ∎

E.5 Norms under Perturbations

In the following two lemmas, we give some analysis of norms under perturbations.

Lemma 13.

Let $A$ be a positive definite matrix. Let $a\in\mathbb{R}^{d}$ and $\epsilon>0$ be such that $\epsilon<\left\|a\right\|/3$ . Then

\displaystyle\min_{x:\left\|x\right\|\leq 2\epsilon}\max_{x^{\prime}:\left\|x^{\prime}\right\|\leq\epsilon}\left\{(a+x+x^{\prime})^{\top}A(a+x+x^{\prime})\right\}

\displaystyle=\min_{x^{\prime\prime}:\left\|x^{\prime\prime}\right\|\leq\epsilon}\left\{(a+x^{\prime\prime})^{\top}A(a+x^{\prime\prime})\right\}\,.

(71)

Proof.

By considering the Lagrangian multiplier we see that any stationary point of the function $(a+x^{\prime\prime})^{\top}A(a+x^{\prime\prime})$ over $\{(x,x^{\prime}):\left\|x\right\|\leq 2\epsilon,\,\left\|x^{\prime}\right\|\leq\epsilon\}$ satisfies

	$\displaystyle A(a+x+x^{\prime})-\lambda_{1}x=0\,,$
	$\displaystyle A(a+x+x^{\prime})-\lambda_{2}x^{\prime}=0\,,$
	$\displaystyle x^{\top}x=4\epsilon^{2}\,,$
	$\displaystyle x^{\prime\top}x^{\prime}=\epsilon^{2}\,,$		(72)

and therefore $\lambda_{1}x=\lambda_{2}x^{\prime}$ . Considering the last two conditions of (72) we have $\lambda_{2}=\pm 2\lambda_{1}$ , implying that

\displaystyle x^{\prime}=-(3A-2\lambda_{1}I)Aa

(73)

\displaystyle x^{\prime}=(A-2\lambda_{1}I)Aa

(74)

for $\lambda_{1}$ satisfying $x^{\prime\top}x^{\prime}=\epsilon^{2}$ .

Note that it holds for any positive definite matrix $B$ that

\displaystyle\frac{\mathrm{d}{{}^{2}}}{\mathrm{d}{\lambda^{2}}}a(B+\lambda I)^{-2}a

\displaystyle=a(B+\lambda I)^{-4}a=\left\|(B+\lambda I)^{-2}a\right\|^{2}\,,

(75)

which is positive almost everywhere, meaning that $a(B+\lambda I)^{-2}a$ is strictly convex with respect to $\lambda\in\mathbb{R}$ . Therefore, there exists at most two $\lambda^{\prime}_{1}$ ’s satisfying (73) and $x^{\prime\top}x^{\prime}=\epsilon^{2}$ , and there exists at most two $\lambda^{\prime}_{1}$ ’s satisfying (74) and $x^{\prime\top}x^{\prime}=\epsilon^{2}$ . In summary, there at most four stationary points of $(a+x^{\prime\prime})^{\top}A(a+x^{\prime\prime})$ over $\{(x,x^{\prime}):\left\|x\right\|\leq 2\epsilon,\,\left\|x^{\prime}\right\|\leq\epsilon\}$ .

On the other hand, two optimization problems

\displaystyle\min_{x:\left\|x\right\|\leq 2\epsilon}\min_{x^{\prime}:\left\|x^{\prime}\right\|\leq\epsilon}\left\{(a+x+x^{\prime})^{\top}A(a+x+x^{\prime})\right\}

\displaystyle=\min_{x^{\prime\prime}:\left\|x^{\prime\prime}\right\|\leq 3\epsilon}\left\{(a+x^{\prime\prime})^{\top}A(a+x^{\prime\prime})\right\}

(76)

and

\displaystyle\max_{x:\left\|x\right\|\leq 2\epsilon}\max_{x^{\prime}:\left\|x^{\prime}\right\|\leq\epsilon}\left\{(a+x+x^{\prime})^{\top}A(a+x+x^{\prime})\right\}

\displaystyle=\max_{x^{\prime\prime}:\left\|x^{\prime\prime}\right\|\leq 3\epsilon}\left\{(a+x^{\prime\prime})^{\top}A(a+x^{\prime\prime})\right\}

(77)

can be easily solved by an elementary calculation and the optimal values are equal to those corresponding to (73).

Therefore, the optimal solutions of the two minimax problems

\displaystyle\max_{x:\left\|x\right\|\leq 2\epsilon}\min_{x^{\prime}:\left\|x^{\prime}\right\|\leq\epsilon}\left\{(a+x+x^{\prime})^{\top}A(a+x+x^{\prime})\right\}

(78)

and

\displaystyle\min_{x:\left\|x\right\|\leq 2\epsilon}\max_{x^{\prime}:\left\|x^{\prime}\right\|\leq\epsilon}\left\{(a+x+x^{\prime})^{\top}A(a+x+x^{\prime})\right\}

(79)

correspond to two points corresponding to (74).

We can see again from an elementary calculation that the optimal solutions for two optimization problems

	$\displaystyle\min_{x^{\prime\prime}:\left\\|x^{\prime\prime}\right\\|\leq\epsilon}\left\{(a+x^{\prime\prime})^{\top}A(a+x^{\prime\prime})\right\}$
	$\displaystyle\max_{x^{\prime\prime}:\left\\|x^{\prime\prime}\right\\|\leq\epsilon}\left\{(a+x^{\prime\prime})^{\top}A(a+x^{\prime\prime})\right\}$		(80)

have the same necessary and sufficient conditions as (74) and we complete the proof by noticing that (78) is less than (79). ∎

Lemma 14.

Let $A\succeq nS_{1}^{\top}S_{1}$ be a positive-definite matrix with minimum eigenvalue at least $\lambda>0$ . Then, for any $\hat{p}\in\mathbb{R}^{d}$ and $\epsilon>0$ satisfying $\epsilon<\left\|\hat{p}-p^{*}\right\|/3$ ,

\displaystyle\left\|\hat{p}-p^{*}\right\|_{A}^{2}-\inf_{p:\left\|p-p^{*}\right\|\leq 2\epsilon}\sup_{p^{\prime}:\left\|p^{\prime}-p\right\|\leq\epsilon}\left\|p^{\prime}-\hat{p}\right\|_{A}^{2}

\displaystyle\geq\epsilon\sqrt{n\lambda}\left\|S_{1}(\hat{p}-p^{*})\right\|\,.

(81)

Proof.

Let $a=\hat{p}-p^{*}$ . By Lemma 13, we have

$\displaystyle\inf_{p:\left\\|p-p^{*}\right\\|\leq 2\epsilon}\sup_{p^{\prime}:\left\\|p^{\prime}-p\right\\|\leq\epsilon}\left\\|p^{\prime}-\hat{p}\right\\|_{A}^{2}$
	$\displaystyle=\inf_{x:\left\\|x\right\\|\leq 2\epsilon}\sup_{x^{\prime}:\left\\|x^{\prime}\right\\|\leq\epsilon}\left\\|a+x+x^{\prime}\right\\|_{A}^{2}$
	$\displaystyle=\inf_{x:\left\\|x\right\\|\leq\epsilon}\left\\|a+x\right\\|_{A}^{2}\,.$	(82)

Now define $\mathcal{S}_{\epsilon^{\prime},A}=\{x:\left\|x\right\|_{A}\leq\epsilon^{\prime}\}$ . Then, we see that $\mathcal{S}_{\epsilon\sqrt{\lambda},A}\subset\{x:\left\|x\right\|\leq\epsilon\}$ . Therefore, an elementary calculation using the Lagrange multiplier technique shows

	$\displaystyle\inf_{x:\left\\|x\right\\|\leq\epsilon}\left\\|p^{\prime}-\hat{p}\right\\|_{A}^{2}$	$\displaystyle\leq\inf_{x\in\mathcal{S}_{\epsilon\sqrt{\lambda},A}}\left\\|p-\hat{p}\right\\|_{A}^{2}$
		$\displaystyle=\left(\left\\|a\right\\|_{A}-\epsilon\sqrt{\lambda}\right)^{2}\,.$		(83)

As a result, we see that

$\displaystyle\left\\|p^{}-\hat{p}\right\\|_{A}^{2}-\inf_{p:\left\\|p-p^{}\right\\|\leq 2\epsilon}\sup_{p^{\prime}:\left\\|p^{\prime}-p\right\\|\leq\epsilon}\left\\|p^{\prime}-\hat{p}\right\\|_{A}^{2}$	$\displaystyle\geq\left\\|a\right\\|_{A}^{2}-\left(\left\\|a\right\\|_{A}-\epsilon\sqrt{\lambda}\right)^{2}$
	$\displaystyle=\epsilon\sqrt{\lambda}\left(\left\\|a\right\\|_{A}+\left\\|a\right\\|_{A}-\epsilon\sqrt{\lambda}\right)$
	$\displaystyle\geq\epsilon\sqrt{\lambda}\left(\left\\|a\right\\|_{A}+\left\\|a\right\\|\sqrt{\lambda}-\epsilon\sqrt{\lambda}\right)$
	$\displaystyle=\epsilon\sqrt{\lambda}\left(\left\\|a\right\\|_{A}+\sqrt{\lambda}(\left\\|a\right\\|-\epsilon)\right)$
	$\displaystyle\geq\epsilon\sqrt{\lambda}\left\\|a\right\\|_{A}$
	$\displaystyle\geq\epsilon\sqrt{n\lambda}\left\\|S_{1}a\right\\|\,.$	(84)

∎

For the subsets of $\mathbb{R}^{n}$ , $\mathcal{X}$ and $\mathcal{Y}$ , let $\mathcal{X}+\mathcal{Y}\coloneqq{\left\{x+y:x\in\mathcal{X},\,y\in\mathcal{Y}\right\}}$ be the Minkowski sum, and let $B^{n}_{r}(p)$ be the $n$ -dimensional Euclidian ball of radius $r$ at point $p\in\mathbb{R}^{n}$ (the superscript $n$ can be omitted when it is clear from context). We also let $\epsilon^{\prime}$ be

\displaystyle\epsilon^{\prime}\coloneqq\frac{\epsilon}{\Big{(}16\max_{i\in[N]}\|S_{i}\|\Big{)}\vee\Big{(}\frac{1}{\sqrt{A}}\max_{i\in[N]}\frac{\|L_{i}-L_{1}\|}{\|z_{1,i}\|}\Big{)}}\,,

(85)

which is also used throughout the proof of this appendix and Appendix F as $\epsilon$ in (26).

Theorem 15.

Let $\epsilon^{\prime\prime}\in(0,\epsilon)$ be a constant for $\epsilon$ defined in (26). Let $\hat{p}\in\mathcal{C}_{k}+B_{\epsilon^{\prime}}^{d}(0)$ be satisfying $\left\|S_{k}(\hat{p}-p^{*})\right\|\leq\epsilon^{\prime\prime}$ . Then, there exists $\delta>0$ satisfying for any $n\geq 0$ and $A\succeq nS_{1}^{\top}S_{1}+\lambda I$ that

\displaystyle\left\|p^{*}-\hat{p}\right\|_{A}^{2}-\inf_{p:\left\|p-p^{*}\right\|\leq 2\epsilon}\sup_{p^{\prime}:\left\|p^{\prime}-p\right\|\leq\epsilon}\left\|p^{\prime}-\hat{p}\right\|_{A}^{2}

\displaystyle\geq\epsilon\delta\sqrt{\lambda n}\,.

(86)

Proof.

Recall that $\epsilon^{\prime\prime}<\epsilon\leq\min_{p\in\mathcal{C}_{1}^{c}}\left\|p-p^{*}\right\|/3$ . It is enough from Lemma 14 to prove that

\displaystyle\delta\coloneqq\min_{\hat{p}\in\{p\in\mathcal{C}_{k}+B^{d}_{\epsilon^{\prime}}(0):\left\|S_{k}(p-p^{*})\right\|\leq\epsilon^{\prime\prime}\}}\left\|S_{1}(\hat{p}-p^{*})\right\|

(87)

is positive.

We prove by contradiction and the proof is basically same as that of Lemma 7 but more general in the sense that the condition on $\hat{p}$ is not $\hat{p}\in\mathcal{C}_{k}$ but $\hat{p}\in\mathcal{C}_{k}+B^{d}_{\epsilon^{\prime}}(0)$ . Assume that $\delta=0$ , that is, there exists $\hat{p}\in\mathcal{C}_{k}+B^{d}_{\epsilon^{\prime}}(0)$ satisfying $\left\|S_{k}(p-p^{*})\right\|\leq\epsilon^{\prime\prime}\}$ and $\|S_{1}(\hat{p}-p^{*})\|=0$ . Note that $\|S_{1}(\hat{p}-p^{*})\|=0$ implies $\|S_{1}(\hat{p}-p^{*})\|\leq\epsilon^{\prime\prime}$ . Therefore, we now have following conditions on $\hat{p}$ :

•

$\hat{p}\in\mathcal{C}_{k}+B^{d}_{\epsilon^{\prime}}(0)$
•

$\|S_{1}(\hat{p}-p^{*})\|\leq\epsilon^{\prime\prime}$
•

$\|S_{k}(\hat{p}-p^{*})\|\leq\epsilon^{\prime\prime}$ .

Following the same argument as the proof of Lemma 7, we have

\displaystyle z_{1,k}^{\top}\left(\begin{array}[]{c}S_{1}\\ S_{k}\end{array}\right)(\hat{p}-p^{*})\leq\sqrt{2A}\epsilon^{\prime\prime}\|z_{1,k}\|\,.

(90)

On the other hand, since $\hat{p}\in\mathcal{C}_{k}+B^{d}_{\epsilon^{\prime}}(0)$ we can take $\bar{p}\in\mathcal{C}_{k}$ such that $\|\hat{p}-\bar{p}\|\leq\epsilon^{\prime}$ . Hence,

$\displaystyle z_{1,k}^{\top}\left(\begin{array}[]{c}S_{1}\\ S_{k}\end{array}\right)(\hat{p}-p^{*})$	$\displaystyle=(L_{1}-L_{k})^{\top}(\hat{p}-p^{*})$	(93)
	$\displaystyle=-(L_{k}-L_{1})^{\top}(\hat{p}-\bar{p})+(L_{1}-L_{k})^{\top}\bar{p}+(L_{k}-L_{1})^{\top}p^{*}$
	$\displaystyle\geq-(L_{k}-L_{1})^{\top}(\hat{p}-p^{*})+\Delta_{k}\,.\quad\left(\text{by {$\bar{p}\in\mathcal{C}_{k}$ and def. of $\Delta_{k}$}}\right)$	(94)

From (90) and (94), we have

\displaystyle\Delta_{k}-(L_{k}-L_{1})^{\top}(\hat{p}-p^{*})\leq\sqrt{2A}\epsilon^{\prime\prime}\|z_{1,k}\|\,.

(95)

Now, the left hand side of (95) is bounded from below as

$\displaystyle\Delta_{k}-(L_{k}-L_{1})^{\top}(\hat{p}-\bar{p})$	$\displaystyle\geq\Delta_{k}-\\|L_{k}-L_{1}\\|\\|\hat{p}-\bar{p}\\|$
	$\displaystyle\geq\Delta_{k}-\\|L_{k}-L_{1}\\|\epsilon^{\prime}$
	$\displaystyle=\Delta_{k}-\\|L_{k}-L_{1}\\|\frac{\epsilon}{\frac{1}{\sqrt{A}}\max_{i}\frac{\\|L_{1}-L_{i}\\|}{\\|z_{1,i}\\|}}$
	$\displaystyle=\Delta_{k}-\\|L_{k}-L_{1}\\|\frac{\frac{1}{2\sqrt{A}}\min_{i}\frac{\Delta_{i}}{\\|z_{1,i}\\|}}{\frac{1}{\sqrt{A}}\max_{i}\frac{\\|L_{1}-L_{i}\\|}{\\|z_{1,i}\\|}}$
	$\displaystyle\geq\Delta_{k}-\Delta_{k}/2\,.$	(96)

On the other hand, using the definition of $\epsilon^{\prime\prime}$ , the right hand side of (95) is bounded from above as

\displaystyle\sqrt{2A}\epsilon^{\prime\prime}\|z_{1,k}\|<\Delta_{k}/2\,.

(97)

Therefore, the proof is completed by contradiction. ∎

E.6 Exit Time Analysis

We next consider the exit time. Let $\mathcal{A}_{t}$ be an event deterministic given $\mathcal{F}_{t}$ , and $\mathcal{B}_{t}$ be a random event such that if $\mathcal{B}_{t}$ occurred then $\mathcal{A}_{t^{\prime}}$ never occurs for $t^{\prime}=t+1,t+2,\dots$ . Let $P_{t},\,t=1,2,\dots,T$ , be a stochastic process satisfying $P_{t}\leq\mathbb{P}{\left\{\mathcal{B}_{t}|\mathcal{F}_{t}\right\}}$ a.s. and $P_{t}^{-1}$ is a supermartingale with respect to the filtration induced by $\mathcal{F}_{t}$ .

Theorem 16.

Let $\tau$ be the stopping time defined as

\displaystyle\tau=\begin{cases}\min\{t\in[T]:\mathcal{A}_{t}\}&\mbox{\rm if $\mathcal{A}_{t}$ occurs for some $t\in[T]$.}\\ T+1&\mbox{\rm otherwise}.\end{cases}

(98)

Then we almost surely have

\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}\mathbbm{1}{\left[\mathcal{A}_{t}\right]}\mathrel{}\middle|\mathrel{}\mathcal{F}_{\tau}\right]\leq\begin{cases}P_{\tau}^{-1}&\tau\leq T,\\ 0&\tau=T+1.\end{cases}

(99)

We prove this theorem based on the following lemma.

Lemma 17.

Let $(Q_{i})_{i=1}^{\infty}\subset[0,1]$ be an arbitrary stochastic process such that $(Q_{i}^{-1})_{i=1}^{\infty}$ is a supermartingale with respect to a filtration $(\mathcal{G}_{i})_{i=1}^{\infty}$ . Then, for any $\mathcal{G}_{0}\subset\mathcal{G}_{1}$ ,

\displaystyle\mathbb{E}\left[\sum_{i=1}^{T}\prod_{j=1}^{i}(1-Q_{j})\mathrel{}\middle|\mathrel{}\mathcal{G}_{0}\right]\leq\mathbb{E}\left[Q_{1}^{-1}|\mathcal{G}_{0}\right]-1\quad\mathrm{a.s.}

(100)

Proof.

Let

	$\displaystyle{N}_{k}((Q_{i},\mathcal{G}_{i})_{i=1}^{\infty},\,\mathcal{G}_{0})$	$\displaystyle=\mathbb{E}\left[\sum_{i=1}^{k}\prod_{j=1}^{i}(1-Q_{j})\mathrel{}\middle\|\mathrel{}\mathcal{G}_{0}\right]$
	$\displaystyle\overline{N}_{k}((Q_{i},\mathcal{G}_{i})_{i=1}^{\infty},\,\mathcal{G}_{0})$	$\displaystyle=\mathbb{E}\left[\sum_{i=1}^{\infty}\prod_{j=1}^{i}(1-Q_{j})\mathrel{}\middle\|\mathrel{}\mathcal{G}_{0}\right]\quad\mbox{where $Q_{j}=Q_{k}$ for $j>k$.}$		(101)

We show $\overline{N}_{k}((Q_{i},\mathcal{G}_{i})_{i=1}^{\infty},\,\mathcal{G}_{0})\leq\mathbb{E}[Q_{1}^{-1}|\mathcal{G}_{0}]-1$ a.s. for any $(Q_{i},\,\mathcal{G}_{i})_{i=1}^{\infty}$ , $\mathcal{G}_{0}\subset\mathcal{G}_{1}$ and $k\in\mathbb{N}$ by induction. First, for $k=1$ the statement holds since

$\displaystyle\overline{N}_{1}((Q_{i},\mathcal{G}_{i})_{i=1}^{\infty},\,\mathcal{G}_{0})$	$\displaystyle=\mathbb{E}\left[\sum_{i=1}^{\infty}\prod_{j=1}^{i}(1-Q_{1})\mathrel{}\middle\|\mathrel{}\mathcal{G}_{0}\right]$
	$\displaystyle=\mathbb{E}\left[Q_{1}^{-1}-1\mathrel{}\middle\|\mathrel{}\mathcal{G}_{0}\right]$
	$\displaystyle=\mathbb{E}\left[Q_{1}^{-1}\mathrel{}\middle\|\mathrel{}\mathcal{G}_{0}\right]-1$	(102)

Next, assume that the statement holds for all $(Q_{i},\,\mathcal{G}_{i})_{i=1}^{k}$ , $\mathcal{G}_{0}\subset\mathcal{G}_{1}$ and $k\leq k_{0}$ . Then, we almost surely have

$\displaystyle\overline{N}_{k_{0}+1}((Q_{i},\mathcal{G}_{i})_{i=1}^{\infty},\,\mathcal{G}_{0})$	$\displaystyle=\mathbb{E}\left[(1-Q_{1})\mathbb{E}\left[1+\sum_{i=2}^{\infty}\prod_{j=2}^{i}(1-Q_{j})\mathrel{}\middle\|\mathrel{}\mathcal{G}_{1}\right]\mathrel{}\middle\|\mathrel{}\mathcal{G}_{0}\right]$
	$\displaystyle=\mathbb{E}\left[(1-Q_{1})(1+\overline{N}_{k_{0}}((Q_{i},\mathcal{G}_{i})_{i=2}^{\infty},\,\mathcal{G}_{1}))\mathrel{}\middle\|\mathrel{}\mathcal{G}_{0}\right]$
	$\displaystyle\leq\mathbb{E}\left[(1-Q_{1})\mathbb{E}[Q_{2}^{-1}\mathrel{}\middle\|\mathrel{}\mathcal{G}_{1}]\mathrel{}\middle\|\mathrel{}\mathcal{G}_{0}\right]\quad\left(\mbox{assumption of the induction}\right)$
	$\displaystyle\leq\mathbb{E}\left[Q_{1}^{-1}\mathrel{}\middle\|\mathrel{}\mathcal{G}_{0}\right]-1\quad\left(\mbox{$Q_{i}^{-1}$ is a supermartingale.}\right)$	(103)

We obtain the lemma from

\displaystyle\mathbb{E}\left[\sum_{i=1}^{k}\prod_{j=1}^{i}(1-Q_{j})\mathrel{}\middle|\mathrel{}\mathcal{G}_{0}\right]={N}_{k}((Q_{i},\mathcal{G}_{i})_{i=1}^{\infty},\,\mathcal{G}_{0})\leq\overline{N}_{k}((Q_{i},\mathcal{G}_{i})_{i=1}^{\infty},\,\mathcal{G}_{0})\quad\mbox{a.s.}

(104)

∎

Proof of Theorem 16.

The statement is obvious for the case $\tau=T+1$ and we consider the other case in the following.

Let $\tau_{i}$ be the time of the $i$ -th occurrence of $\mathcal{A}_{t}$ . More formally, we define $\tau_{i}$ as the stopping time $\tau_{1}=\tau$ and

\displaystyle\tau_{i+1}=\begin{cases}\min\left\{t\in[T]:\sum_{t^{\prime}=1}^{T}\mathbbm{1}{\left[\mathcal{A}_{t^{\prime}}\right]}=i+1\right\}&\sum_{t^{\prime}=1}^{T}\mathbbm{1}{\left[\mathcal{A}_{t^{\prime}}\right]}\geq i+1,\\ \tau_{i}+1&\mbox{otherwise.}\end{cases}

(105)

Then $(P^{\prime}_{i})=(P_{\tau_{i}})$ is a stochastic process measurable by the filtration induced by $(\mathcal{F}_{i}^{\prime})=(\mathcal{F}_{\tau_{i}})$ . By Lemma 17 we obtain

$\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}\mathbbm{1}{\left[\mathcal{A}_{t}\right]}\mathrel{}\middle\|\mathrel{}\mathcal{F}_{\tau}\right]$	$\displaystyle=\mathbb{E}\left[\sum_{n=1}^{T}\mathbbm{1}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[\mathcal{A}_{t}\right]}\geq n\mathrel{}\middle\|\mathrel{}\mathcal{F}_{\tau}\right]}\right]$
	$\displaystyle\leq 1+\mathbb{E}\left[\sum_{n=2}^{T}\mathbbm{1}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[\mathcal{A}_{t}\right]}\geq n\mathrel{}\middle\|\mathrel{}\mathcal{F}_{\tau}\right]}\right]$
	$\displaystyle\leq 1+\mathbb{E}\left[\sum_{i=1}^{T}\prod_{j=1}^{i}(1-P^{\prime}_{j})\mathrel{}\middle\|\mathrel{}\mathcal{F}_{1}^{\prime}\right]$
	$\displaystyle\leq 1+\mathbb{E}\left[(P_{1}^{\prime})^{-1}\|\mathcal{F}_{1}^{\prime}\right]-1$
	$\displaystyle=P_{\tau}^{-1}\,.$	(106)

∎

Appendix F Regret Analysis of TSPM Algorithm

In this appendix, we give the proof of Theorem 3. Note that the cells are defined for the decomposition of $\mathbb{R}^{M}$ , not $\mathcal{P}_{M}$ . In other words, the cell $\mathcal{C}_{i}$ is here defined as $\mathcal{C}_{i}={\left\{p\in\mathbb{R}^{M}:\text{action $i$ is optimal}\right\}}$ . For the linear setting, the empirical feedback distribution $q_{i}^{(t)}$ and $q_{i,n}$ are defined as

	$\displaystyle q_{i}^{(t)}$	$\displaystyle\coloneqq\frac{1}{N_{i}(t)}\sum_{s\in[t-1]:i(s)=i}y(s)\,,$		(107)
	$\displaystyle q_{i,n}$	$\displaystyle\coloneqq\text{the value of $q_{i}^{(t)}$ after taking action $i$ for $n$ times.}$		(108)

Recall that $\hat{p}_{t}=B_{t}^{-1}b_{t}$ , which is the mode of $\bar{g}_{t}(p)$ .

F.1 Regret Decomposition

Here, we break the regret into several terms. For any $i\in[N]$ , we define events

	$\displaystyle\mathcal{A}_{i}(t)$	$\displaystyle\coloneqq{\left\{\\|S_{i}\hat{p}_{t}-S_{i}p^{*}\\|\leq\frac{\epsilon}{4}\right\}}\,,$		(109)
	$\displaystyle\tilde{\mathcal{A}}_{i}(t)$	$\displaystyle\coloneqq{\left\{\\|S_{i}\tilde{p}_{t}-S_{i}p^{*}\\|\leq\epsilon\right\}}\,.$		(110)

We first decompose the regret as

$\displaystyle\mathrm{Reg}(T)$	$\displaystyle=\sum_{t=1}^{T}\Delta_{i(t)}$
	$\displaystyle\leq\sum_{t=1}^{T}\Big{(}\Delta_{i(t)}\mathbbm{1}{\left[\tilde{\mathcal{A}}_{1}(t)\right]}+\max_{j\in[N]}\Delta_{j}\mathbbm{1}{\left[\tilde{\mathcal{A}}_{1}^{c}(t)\right]}\Big{)}$
	$\displaystyle=\sum_{i\neq 1}\sum_{t=1}^{T}\Delta_{i}\mathbbm{1}{\left[i(t)=i,\,\tilde{\mathcal{A}}_{1}(t)\right]}+\max_{j\in[N]}\Delta_{j}\sum_{t=1}^{T}\mathbbm{1}{\left[\tilde{\mathcal{A}}_{1}^{c}(t)\right]}$
	$\displaystyle\leq\sum_{i\neq 1}\Delta_{i}\sum_{t=1}^{T}\bigg{(}\underbrace{\mathbbm{1}{\left[i(t)=i,\,\tilde{\mathcal{A}}_{1}(t),\,\mathcal{A}_{i}(t)\right]}}_{\text{(A)}}+\underbrace{\mathbbm{1}{\left[i(t)=i,\,\mathcal{A}_{i}^{c}(t)\right]}}_{\text{(B)}}\bigg{)}+\max_{j\in[N]}\Delta_{j}\sum_{t=1}^{T}\mathbbm{1}{\left[\tilde{\mathcal{A}}_{1}^{c}(t)\right]}\,.$	(111)

To decompose the last term, we define the following notation. We define for any $i\in[N]$

\displaystyle P_{i}(t)\coloneqq\mathbb{P}{\left\{\tilde{p}_{t}\in\mathcal{C}_{i}\mathrel{}\middle|\mathrel{}\mathcal{F}_{t}\right\}}\,.

(112)

We also define

\displaystyle\mathcal{C}_{i,t}\coloneqq\mathcal{C}_{i}\cap B_{\epsilon^{\prime}}(\hat{p}_{t})\,,

(113)

where $\epsilon^{\prime}$ is defined in (85), and

\displaystyle\bar{i}_{t}\coloneqq\operatorname*{arg\,max}_{i\in[N]}\mathbb{P}{\left\{\tilde{p}_{t}\in\mathcal{C}_{i,t}\mathrel{}\middle|\mathrel{}\mathcal{F}_{t}\right\}}\,.

(114)

We define $\bar{p}_{t}$ as an arbitrary point in $\mathcal{C}_{\bar{i}_{t},t}$ . Then, we define

\displaystyle\bar{\mathcal{A}}_{i}(t)\coloneqq{\left\{\left\|S_{i}\bar{p}_{t}-S_{i}p^{*}\right\|\leq\frac{\epsilon}{8}\right\}}\,.

(115)

Using these notations, the last term in (F.1) can be decomposed as

$\displaystyle\mathbbm{1}{\left[\tilde{\mathcal{A}}_{1}^{c}(t)\right]}$	$\displaystyle\leq\sum_{k=1}^{N}\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{k},\,\tilde{\mathcal{A}}_{1}^{c}(t)\right]}$
	$\displaystyle=\sum_{k=1}^{N}\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{k},\,\bar{\mathcal{A}}_{k}^{c}(t),\,\tilde{\mathcal{A}}_{1}^{c}(t)\right]}+\sum_{k=1}^{N}\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{k},\,\bar{\mathcal{A}}_{k}(t),\,\tilde{\mathcal{A}}_{1}^{c}(t)\right]}$
	$\displaystyle\leq\underbrace{\sum_{k=1}^{N}\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{k},\,\bar{\mathcal{A}}_{k}^{c}(t)\right]}}_{\text{(C)}}+\underbrace{\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{1},\,\bar{\mathcal{A}}_{1}(t),\,\tilde{\mathcal{A}}_{1}^{c}(t)\right]}}_{\text{(D)}}+\underbrace{\sum_{k=2}^{N}\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{k},\,\bar{\mathcal{A}}_{k}(t)\right]}}_{\text{(E)}}\,.$	(116)

We will bound the expectation of each term in the following and complete the proof of Theorem 3 as

$\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\mathrm{Reg}(T)\right]}$	$\displaystyle=\sum_{i\neq 1}\Delta_{i}\left(\operatorname{O}\left(\frac{1}{\epsilon^{2}}\log T\right)+\operatorname{O}\left(\frac{N}{\epsilon^{2}}\log T\right)\right)$
	$\displaystyle\quad+\max_{j\in[N]}\Delta_{j}\left(\sum_{k=1}^{N}\operatorname{O}\left(\frac{NM}{\epsilon^{2}}\log T\right)+\operatorname{O}(1)+\sum_{k=2}^{N}\operatorname{O}(1)\right)$
	$\displaystyle=\operatorname{O}\left(\max\left\{\frac{N\sum_{i\in[N]}\Delta_{i}}{\epsilon^{2}},\frac{N^{2}M\max_{i\in[N]}\Delta_{i}}{\epsilon^{2}}\right\}\log T\right)$
	$\displaystyle=\operatorname{O}\left(\frac{AN^{2}M\max_{i\in[N]}\Delta_{i}}{\Lambda^{2}}\log T\right)\,,$	(117)

where the last transformation follows from the definition of $\epsilon$ in (26).

F.2 Analysis for Case (A)

Lemma 18.

For any $i\neq 1$ ,

\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[i(t)=i,\,\tilde{\mathcal{A}}_{1}(t),\,\mathcal{A}_{i}(t)\right]}\right]}\leq\frac{64}{9\epsilon^{2}}\log T+2^{M/2}\,.

(118)

To prove Lemma 18, we prove the following lemma using Corollary 8 and Lemma 9.

Lemma 19.

For any $0\leq\xi<1/2$ ,

\displaystyle\mathbb{P}{\left\{\tilde{p}_{t}\in\mathcal{V}_{i}\mathrel{}\middle|\mathrel{}\mathcal{A}_{i}(t),\,N_{i}(t)>n_{i}\right\}}

\displaystyle\leq\exp\Big{(}-\frac{9}{16}\xi\epsilon^{2}n_{i}\Big{)}(1-2\xi)^{-M/2}\,,

(119)

where $\mathcal{V}_{i}\coloneqq{\left\{p\in\mathcal{C}_{i}:\|S_{1}p-S_{1}p^{*}\|\leq\epsilon\right\}}$ .

Proof.

Since $\tilde{p}_{t}\sim\mathcal{N}(\hat{p}_{t},\,B_{t}^{-1})$ for $\hat{p}_{t}=B_{t}^{-1}b_{t}$ , the squared Mahalanobis distance $\|B_{t}^{1/2}(\tilde{p}_{t}-\hat{p}_{t})\|^{2}$ follows the chi-squared distribution with $M$ degree of freedom. Therefore, we have

\displaystyle\mathbb{P}{\left\{\tilde{p}_{t}\in\mathcal{V}_{i}\mathrel{}\middle|\mathrel{}\mathcal{A}_{i}(t),\,N_{i}(t)>n_{i}\right\}}\leq h\bigg{(}\inf_{p\in\mathcal{V}_{i}}\|B_{t}^{1/2}(p-\hat{p}_{t})\|^{2}\bigg{)}\,,

(120)

where $h(a)=\mathbb{P}_{X\sim\chi^{2}_{M}}\left\{X\geq a\right\}$ . To use Lemma 9, we check the condition of Lemma 9 is indeed satisfied. First, it is obvious that the assumptions $N_{i}(t)\geq n_{i}$ and $\|S_{i}\hat{p}_{t}-S_{i}p^{*}\|<\epsilon/4$ are satisfied. Besides, $p\in\mathcal{V}_{i}$ implies $p\in\mathcal{T}_{i}=\{p\in\mathbb{R}^{M}:\|S_{i}p-S_{i}p^{*}\|\geq\epsilon\}$ from Corollary 8. Thus, applying Lemma 9 concludes the proof. ∎

Proof of Lemma 18.

For any $n_{i}>0$ ,

	$\displaystyle\sum_{t=1}^{T}\mathbbm{1}{\left[i(t)=i,\,\tilde{\mathcal{A}}_{1}(t),\,\mathcal{A}_{i}(t)\right]}$
	$\displaystyle=\sum_{t=1}^{T}\mathbbm{1}{\left[i(t)=i,\,\tilde{\mathcal{A}}_{1}(t),\,\mathcal{A}_{i}(t),\,N_{i}(t)\leq n_{i}\right]}+\sum_{t=1}^{T}\mathbbm{1}{\left[i(t)=i,\,\tilde{\mathcal{A}}_{1}(t),\,\mathcal{A}_{i}(t),\,N_{i}(t)>n_{i}\right]}$
	$\displaystyle\leq n_{i}+\sum_{t=1}^{T}\mathbbm{1}{\left[i(t)=i,\,\tilde{\mathcal{A}}_{1}(t),\,\mathcal{A}_{i}(t),\,N_{i}(t)>n_{i}\right]}\,.$		(121)

The second term is bounded from above as

	$\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[i(t)=i,\,\tilde{\mathcal{A}}_{1}(t),\,\mathcal{A}_{i}(t),\,N_{i}(t)>n_{i}\right]}\right]}$
	$\displaystyle=\sum_{t=1}^{T}\mathbb{P}{\left\{i(t)=i,\,\tilde{\mathcal{A}}_{1}(t),\,\mathcal{A}_{i}(t),\,N_{i}(t)>n_{i}\right\}}$
	$\displaystyle\leq\sum_{t=1}^{T}\mathbb{P}{\left\{i(t)=i,\,\tilde{\mathcal{A}}_{1}(t)\mathrel{}\middle\|\mathrel{}\mathcal{A}_{i}(t),\,N_{i}(t)>n_{i}\right\}}$
	$\displaystyle=\sum_{t=1}^{T}\mathbb{P}{\left\{i(t)=i,\,\tilde{\mathcal{A}}_{1}(t),\,\tilde{p}_{t}\in\mathcal{C}_{i}\mathrel{}\middle\|\mathrel{}\mathcal{A}_{i}(t),\,N_{i}(t)>n_{i}\right\}}\quad\left({i(t)=i\text{ implies }\tilde{p}_{t}\in\mathcal{C}_{i}}\right)$
	$\displaystyle\leq\sum_{t=1}^{T}\mathbb{P}{\left\{\tilde{p}_{t}\in\mathcal{V}_{i}\mathrel{}\middle\|\mathrel{}\mathcal{A}_{i}(t),\,N_{i}(t)>n_{i}\right\}}\,.$		(122)

To obtain an upper bound for $\mathbb{P}{\left\{\tilde{p}_{t}\in\mathcal{V}_{i}\mathrel{}\middle|\mathrel{}\mathcal{A}_{i}(t),\,N_{i}(t)>n_{i}\right\}}$ , we use Lemma 19. By taking $n_{i}=\frac{16}{9}\frac{1}{\xi\epsilon^{2}}\log T$ with $\xi=1/4$ , we have

$\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[i(t)=i,\,\tilde{\mathcal{A}}_{1}(t),\,\mathcal{A}_{i}(t)\right]}\right]}$	$\displaystyle\leq n_{i}+\sum_{t=1}^{T}\mathbb{P}{\left\{\tilde{p}_{t}\in\mathcal{V}_{i}\mathrel{}\middle\|\mathrel{}\mathcal{A}_{i}(t),\,N_{i}(t)>n_{i}\right\}}$
	$\displaystyle\leq n_{i}+\sum_{t=1}^{T}\exp\Big{(}-\frac{9}{16}\xi\epsilon^{2}n_{i}\Big{)}(1-2\xi)^{-M/2}\quad\left(\text{by {Lemma~{}\ref{lem:caseA_2}}}\right)$
	$\displaystyle=\frac{16}{9}\frac{1}{\xi\epsilon^{2}}\log T+(1-2\xi)^{-M/2}$
	$\displaystyle=\frac{64}{9\epsilon^{2}}\log T+2^{M/2}\,,$	(123)

which completes the proof. ∎

F.3 Analysis for Case (B)

Lemma 20.

For any $i\neq 1$ ,

\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[i(t)=i,\,\mathcal{A}_{i}^{c}(t)\right]}\right]}\leq\frac{256N\left(\log T+\frac{A}{2}\log 2+1\right)}{\epsilon^{2}}+\frac{16A^{2}}{\epsilon^{2}}

(124)

The regret in this case can intuitively be bounded because as the round proceeds the event $i(t)=i$ makes $S_{i}\hat{p}_{t}$ close to $S_{i}p^{*}$ , which implies that the expected number of times the event $\mathcal{A}_{i}^{c}(t)$ occurs is not large.

Before going to the analysis of Lemma 20, we prove useful inequalities between $\|q_{i}^{(t)}-S_{i}p^{*}\|$ , $\|q_{i}^{(t)}-S_{i}\hat{p}_{t}\|$ , and $\|S_{i}\hat{p}_{t}-S_{i}p^{*}\|$ .

Lemma 21.

Assume $N_{i}(t)>0$ . Then,

\displaystyle\|q_{i}^{(t)}-S_{i}\hat{p}_{t}\|^{2}\leq\frac{Z_{\backslash i}}{N_{i}(t)}+\|q_{i}^{(t)}-S_{i}p^{*}\|^{2}\,.

(125)

Proof.

Recall that $\hat{p}_{t}$ is the maximizer of $\bar{g}_{t}(p)$ , and we have

\displaystyle\hat{p}_{t}=\operatorname*{arg\,max}_{p\in\mathbb{R}^{M}}\bar{g}_{t}(p)=\operatorname*{arg\,max}_{p\in\mathbb{R}^{M}}\prod_{i=1}^{N}\exp\Big{\{}-\frac{1}{2}N_{i}(t)\|q_{i}^{(t)}-S_{i}p\|^{2}\Big{\}}=\operatorname*{arg\,min}_{p\in\mathbb{R}^{M}}\sum_{i=1}^{N}N_{i}(t)\|q_{i}^{(t)}-S_{i}p\|^{2}\,.

(126)

Using this and the definition of $Z_{\backslash i}$ , we have

$\displaystyle N_{i}(t)\\|q_{i}^{(t)}-S_{i}\hat{p}_{t}\\|^{2}$	$\displaystyle\leq\sum_{k\in[N]}N_{k}(t)\\|q_{k}^{(t)}-S_{k}\hat{p}_{t}\\|^{2}$
	$\displaystyle\leq\sum_{k\in[N]}N_{k}(t)\\|q_{k}^{(t)}-S_{k}p^{*}\\|^{2}$
	$\displaystyle\leq Z_{\backslash i}+N_{i}(t)\\|q_{i}^{(t)}-S_{i}p^{*}\\|^{2}\,.$	(127)

Dividing by $N_{i}(t)$ on the both sides completes the proof. ∎

Lemma 22.

Assume that $\mathcal{A}_{i}^{c}(t)$ and $N_{i}(t)>0$ hold. Then,

\displaystyle\|q_{i}^{(t)}-S_{i}p^{*}\|>\frac{1}{2}\bigg{(}\frac{\epsilon}{4}-\sqrt{\frac{Z_{\backslash i}}{N_{i}(t)}}\bigg{)}\,.

(128)

Proof.

By the triangle inequality,

$\displaystyle\\|q_{i}^{(t)}-S_{i}p^{*}\\|$	$\displaystyle\geq\\|S_{i}\hat{p}_{t}-S_{i}p^{*}\\|-\\|q_{i}^{(t)}-S_{i}\hat{p}_{t}\\|$
	$\displaystyle>\frac{\epsilon}{4}-\sqrt{\frac{Z_{\backslash i}}{N_{i}(t)}+\\|q_{i}^{(t)}-S_{i}p^{*}\\|^{2}}\quad\left(\text{by {$\mathcal{A}_{i}^{c}(t)$ and Lemma~{}\ref{lem:useful_1}}}\right)$
	$\displaystyle\geq\frac{\epsilon}{4}-\sqrt{\frac{Z_{\backslash i}}{N_{i}(t)}}-\\|q_{i}^{(t)}-S_{i}p^{*}\\|\quad\left(\text{by {$\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}$ for $x,y\geq 0$}}\right)\,,$	(129)

which is equivalent to (128). ∎

Proof of Lemma 20.

We first bound the expectation conditioned on $Z_{\backslash i}$ , and then take the expectation for $Z_{\backslash i}$ . Now,

	$\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[i(t)=i,\,\mathcal{A}_{i}^{c}(t)\right]}\mathrel{}\middle\|\mathrel{}Z_{\backslash i}\right]}$
	$\displaystyle=\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[i(t)=i,\,\mathcal{A}_{i}^{c}(t),\,N_{i}(t)\leq\frac{64Z_{\backslash i}}{\epsilon^{2}}\right]}\mathrel{}\middle\|\mathrel{}Z_{\backslash i}\right]}$
	$\displaystyle\quad+\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[i(t)=i,\,\mathcal{A}_{i}^{c}(t),\,N_{i}(t)>\frac{64Z_{\backslash i}}{\epsilon^{2}}\right]}\mathrel{}\middle\|\mathrel{}Z_{\backslash i}\right]}$
	$\displaystyle\leq\frac{64Z_{\backslash i}}{\epsilon^{2}}+\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[i(t)=i,\,\mathcal{A}_{i}^{c}(t),\,N_{i}(t)>\frac{64Z_{\backslash i}}{\epsilon^{2}}\right]}\mathrel{}\middle\|\mathrel{}Z_{\backslash i}\right]}\quad\left(\text{{$i(t)=i$ for all $t\in[T]$}}\right)\,.$		(130)

The first term becomes ${256N\left(\log T+\frac{A}{2}\log 2+1\right)}/{\epsilon^{2}}$ by taking expectation over $Z_{\backslash i}$ using Lemma 10. Then, we bound the second term. From Lemma 22, $\mathcal{A}_{i}^{c}(t)$ and $N_{i}(t)>\frac{64Z_{\backslash i}}{\epsilon^{2}}$ imply $\|q_{i}^{(t)}-S_{i}p^{*}\|>\epsilon/16$ . Therefore,

	$\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[i(t)=i,\,\mathcal{A}_{i}^{c}(t),\,N_{i}(t)>\frac{64Z_{\backslash i}}{\epsilon^{2}}\right]}\mathrel{}\middle\|\mathrel{}Z_{\backslash i}\right]}$
	$\displaystyle\leq\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[i(t)=i,\,\\|q_{i}^{(t)}-S_{i}p^{*}\\|>\frac{\epsilon}{16}\right]}\right]}$
	$\displaystyle\leq\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[i(t)=i,\,\bigcup_{y\in[A]}\|(q_{i}^{(t)})_{y}-(S_{i})_{y}p^{*}\|>\frac{\epsilon}{16\sqrt{A}}\right]}\right]}$
	$\displaystyle\leq\operatornamewithlimits{\mathbb{E}}{\left[\sum_{y=1}^{A}\sum_{t=1}^{T}\mathbbm{1}{\left[i(t)=i,\,\|(q_{i}^{(t)})_{y}-(S_{i})_{y}p^{*}\|>\frac{\epsilon}{16\sqrt{A}}\right]}\right]}$
	$\displaystyle\leq\operatornamewithlimits{\mathbb{E}}{\left[\sum_{y=1}^{A}\sum_{n_{i}=1}^{T}\sum_{t=1}^{T}\mathbbm{1}{\left[i(t)=i,\,N_{i}(t)=n_{i},\,\lvert(q_{i}^{(t)})_{y}-(S_{i})_{y}p^{*}\rvert>\frac{\epsilon}{16\sqrt{A}}\right]}\right]}$
	$\displaystyle=\operatornamewithlimits{\mathbb{E}}{\left[\sum_{y=1}^{A}\sum_{n_{i}=1}^{T}\mathbbm{1}{\left[\bigcup_{t=1}^{T}{\left\{i(t)=i,\,N_{i}(t)=n_{i},\,\lvert(q_{i}^{(t)})_{y}-(S_{i})_{y}p^{*}\rvert>\frac{\epsilon}{16\sqrt{A}}\right\}}\right]}\right]}$
	$\displaystyle\quad\quad\left(\text{{The event ${\left\{i(t)=i,\,N_{i}(t)=n_{i}\right\}}$ occurs at most once for fixed $n_{i}$.}}\right)$
	$\displaystyle\leq\sum_{y=1}^{A}\sum_{n_{i}=1}^{T}\mathbb{P}{\left\{\lvert(q_{i,n_{i}})_{y}-(S_{i})_{y}p^{*}\rvert>\frac{\epsilon}{4\sqrt{A}}\right\}}$
	$\displaystyle\leq\sum_{y=1}^{A}\sum_{n_{i}=1}^{T}2\exp\bigg{(}-2n_{i}\left(\frac{\epsilon}{4\sqrt{A}}\right)^{2}\bigg{)}\quad\left(\text{by {Hoeffding's ineq.}}\right)$
	$\displaystyle\leq 2A\sum_{n_{i}=1}^{\infty}\exp\Big{(}-\frac{n_{i}\epsilon^{2}}{8A}\Big{)}$
	$\displaystyle=2A\frac{1}{\exp\left(\frac{\epsilon^{2}}{8A}\right)-1}$
	$\displaystyle\leq 2A\frac{1}{\frac{\epsilon^{2}}{8A}}\quad\left(\text{by }{\mathrm{e}^{x}\geq 1+x}\right)$
	$\displaystyle=\frac{16A^{2}}{\epsilon^{2}}\,.$		(131)

By summing up the above argument, the proof is completed. ∎

F.4 Analysis for Case (C)

Before going to the analysis of cases (C), (D), and (E), we recall some notations. Recall that

\displaystyle P_{i}(t)=\mathbb{P}{\left\{\tilde{p}_{t}\in\mathcal{C}_{i}\mathrel{}\middle|\mathrel{}\mathcal{F}_{t}\right\}}\,,

(132)

$\mathcal{C}_{i,t}=\mathcal{C}_{i}\cap B_{\epsilon^{\prime}}(\hat{p}_{t})$ , $\bar{i}_{t}=\operatorname*{arg\,max}_{i\in[N]}\mathbb{P}{\left\{\tilde{p}_{t}\in\mathcal{C}_{i,t}|\mathcal{F}_{t}\right\}}$ , and $\bar{p}_{t}$ is an arbitrary point in $\mathcal{C}_{\bar{i}_{t},t}$ . Also recall that

\displaystyle\bar{\mathcal{A}}_{i}(t)={\left\{\left\|S_{i}\bar{p}_{t}-S_{i}p^{*}\right\|\leq\frac{\epsilon}{8}\right\}}\,.

(133)

Lemma 23.

For any $i\in[N]$ ,

\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{i},\,\bar{\mathcal{A}}_{i}^{c}(t)\right]}\right]}\leq\frac{N}{p_{0}}\left(\frac{2^{5}M\log T}{\epsilon^{2}}+\mathrm{e}^{\lambda\|p^{*}\|^{2}/2}\left(\frac{1}{\lambda T}+\frac{L}{M\lambda}\right)^{M/2}\frac{1}{1-\mathrm{e}^{-\epsilon^{2}/2^{5}}}\right)\,.

(134)

Before proving the above lemma, we give two lemmas.

Lemma 24.

\displaystyle\mathbb{P}{\left\{\tilde{p}_{t}\in\mathcal{C}_{\bar{i}_{t}}\mathrel{}\middle|\mathrel{}\mathcal{F}_{t}\right\}}\geq\mathbb{P}{\left\{\tilde{p}_{t}\in\mathcal{C}_{\bar{i}_{t},t}\mathrel{}\middle|\mathrel{}\mathcal{F}_{t}\right\}}\geq p_{0}/N\,,

(135)

where $p_{0}\coloneqq 1-h((\lambda\epsilon^{\prime})^{2})$ .

Proof.

First, we prove

\displaystyle\mathbb{P}{\left\{\tilde{p}_{t}\in\bigcup_{i\in[N]}\mathcal{C}_{i,t}\mathrel{}\middle|\mathrel{}\mathcal{F}_{t}\right\}}\geq 1-h((\lambda\epsilon^{\prime})^{2})\,.

(136)

This follows from

$\displaystyle\mathbb{P}{\left\{\tilde{p}_{t}\not\in\bigcup_{i\in[N]}\mathcal{C}_{i,t}\mathrel{}\middle\|\mathrel{}\mathcal{F}_{t}\right\}}$	$\displaystyle=\mathbb{P}{\left\{\tilde{p}_{t}\in B_{\epsilon^{\prime}}(\hat{p}_{t})\mathrel{}\middle\|\mathrel{}\mathcal{F}_{t}\right\}}$
	$\displaystyle\leq h\left(\inf_{p\in{\left\{p:\\|p-\hat{p}_{t}\\|>\epsilon^{\prime}\right\}}}\\|B_{t}^{1/2}(p-\hat{p}_{t})\\|^{2}\right)$
	$\displaystyle\leq h\Big{(}\lambda\\|p-\hat{p}_{t}\\|^{2}\Big{)}$
	$\displaystyle\leq h((\lambda\epsilon^{\prime})^{2})\,.$	(137)

Using the definition of $\bar{i}_{t}$ completes the proof. ∎

Lemma 25.

For any $i\in[N]$ , the event $\bar{\mathcal{A}}_{i}^{c}(t)$ implies $\|S_{i}\hat{p}_{t}-S_{i}p^{*}\|\geq\epsilon/16$ .

Proof.

Using the triangle inequality, we have

$\displaystyle\\|S_{i}\hat{p}_{t}-S_{i}p^{*}\\|$	$\displaystyle\geq\\|S_{i}\bar{p}_{t}-S_{i}p^{*}\\|-\\|S_{i}\bar{p}_{t}-S_{i}\hat{p}_{t}\\|$
	$\displaystyle\geq\epsilon/8-\\|S_{i}\\|\\|\bar{p}_{t}-\hat{p}_{t}\\|$
	$\displaystyle\geq\epsilon/8-\\|S_{i}\\|\frac{\epsilon}{16\max_{i}\\|S_{i}\\|}$
	$\displaystyle\geq\epsilon/8-\epsilon/16=\epsilon/16\,.$	(138)

∎

Proof of Lemma 23.

For any $n_{0}$ , which is specified later, we have

	$\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{i},\,\bar{\mathcal{A}}_{i}^{c}(t)\right]}\right]}$
	$\displaystyle=\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{i},\,\bar{\mathcal{A}}_{i}^{c}(t),\,N_{i}(t)<n_{0}\right]}\right]}+\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{i},\,\bar{\mathcal{A}}_{i}^{c}(t),\,N_{i}(t)\geq n_{0}\right]}\right]}$		(139)

The first term can be bounded by $(p_{0}/N)^{-1}\cdot n_{0}$ from Lemma 24. The rigorous proof can be obtained by the almost same argument as the following analysis of the second term using Theorem 16.

Then, we will bound the second term. Specifically, we will prove that for $n_{0}=\frac{M\log T}{(\epsilon/16)^{2}}$ ,

\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{i},\,\bar{\mathcal{A}}_{i}^{c}(t),\,N_{i}(t)\geq n_{0}\right]}\right]}=\operatorname{O}(1)\,.

(140)

First we have

	$\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{i},\,\bar{\mathcal{A}}_{i}^{c}(t),\,N_{i}(t)\geq n_{0}\right]}\right]}$
		$\displaystyle\leq\sum_{m=n_{0}}^{\infty}\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{i},\,\bar{\mathcal{A}}_{i}^{c}(t),\,N_{i}(t)=m\right]}\right]}\,.$		(141)

Let

\displaystyle\tau=\min{\left\{t:\bar{p}_{t}\in\mathcal{C}_{i},\,\bar{\mathcal{A}}_{i}^{c}(t),\,N_{i}(t)=m\right\}}\wedge(T+1)

(142)

be the first time such that $\bar{p}_{t}\in\mathcal{C}_{i},\,\bar{\mathcal{A}}_{i}^{c}(t)$ and $N_{i}(t)=m$ occur. Letting $\mathcal{A}_{t}\coloneqq{\left\{\bar{p}_{t}\in\mathcal{C}_{i},\,\bar{\mathcal{A}}_{i}^{c}(t),\,N_{i}(t)=m\right\}}$ , $\mathcal{B}_{t}\coloneqq{\left\{i(t)=i\right\}}$ and $P_{t}\coloneqq p_{0}/N$ in Theorem 16, we have

\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{i},\,\bar{\mathcal{A}}_{i}^{c}(t),\,N_{i}(t)=m\right]}\right]}\leq\frac{N}{p_{0}}\mathbb{P}{\left\{\tau\leq T\right\}}\,.

(143)

Here $\tau\leq T$ implies that

$\displaystyle\\|\hat{p}_{\tau}-p^{*}\\|_{B_{\tau}}$	$\displaystyle=(\hat{p}_{\tau}-p^{})^{\top}\left(\lambda I+\sum_{j\in[N]}N_{j}(\tau)S_{j}^{\top}S_{j}\right)(\hat{p}_{\tau}-p^{})$
	$\displaystyle\geq m(\hat{p}_{\tau}-p^{})^{\top}\left(S_{i}^{\top}S_{i}\right)(\hat{p}_{\tau}-p^{})$
	$\displaystyle=m\left\\|S_{i}(\hat{p}_{\tau}-p^{*})\right\\|^{2}\geq m(\epsilon/16)^{2}\,,$	(144)

where the last inequality follows from Lemma 25. Therefore we have

	$\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\exp(\\|\hat{p}_{\tau}-p^{*}\\|_{B_{\tau}}^{2}/2)\right]}$	$\displaystyle\geq\operatornamewithlimits{\mathbb{E}}{\left[\mathbbm{1}{\left[\tau\leq T\right]}\exp(\\|\hat{p}_{\tau}-p^{*}\\|_{B_{\tau}}^{2}/2)\right]}$
		$\displaystyle\geq\exp(m(\epsilon/16)^{2}/2)\mathbb{P}{\left\{\tau\leq T\right\}}\,.$		(145)

Note that $|B_{\tau}|\leq|B_{T}|\leq(1+TL/M)^{M}$ for $L=\max_{i}\sqrt{\mathrm{trace}(S_{i}^{\top}S_{i})}=\max_{i}\left\|S_{i}\right\|_{\mathrm{F}}$ by Lemma 10 of Abbasi-Yadkori et al. (2011), where $\left\|\cdot\right\|_{\mathrm{F}}$ is the Frobenius norm. Therefore we have

$\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\exp(\\|\hat{p}_{\tau}-p^{*}\\|_{B_{\tau}}/2)\right]}$	$\displaystyle\leq\operatornamewithlimits{\mathbb{E}}{\left[\sqrt{\|B_{\tau}\|}\cdot\frac{\exp(\\|\hat{p}_{\tau}-p^{*}\\|_{B_{\tau}}^{2}/2)}{\sqrt{\|B_{\tau}\|}}\right]}$
	$\displaystyle\leq(1+TL/M)^{M/2}\operatornamewithlimits{\mathbb{E}}{\left[\frac{\exp(\\|\hat{p}_{\tau}-p^{*}\\|_{B_{\tau}}^{2}/2)}{\sqrt{\|B_{\tau}\|}}\right]}$
	$\displaystyle\leq(1+TL/M)^{M/2}\operatornamewithlimits{\mathbb{E}}{\left[\frac{\exp(\\|\hat{p}_{0}-p^{*}\\|_{B_{0}}^{2}/2)}{\sqrt{\|B_{0}\|}}\right]}$	(146)
	$\displaystyle=\left(1+\frac{TL}{M\lambda}\right)^{M/2}\mathrm{e}^{\lambda\\|p^{*}\\|^{2}/2}\,,$	(147)

where (146) holds since $\frac{\exp(\|\hat{p}_{t}-p^{*}\|_{B_{t}}^{2}/2)}{\sqrt{|B_{t}|}}$ is a supermartingale from Lemma 12. Combining (143), (145), and (147), we obtain

	$\displaystyle\sum_{m=n_{0}}^{\infty}\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{i},\,\bar{\mathcal{A}}_{i}^{c}(t),\,N_{i}(t)=m\right]}\right]}$	$\displaystyle\leq\frac{N}{p_{0}}\left(\frac{1}{\lambda}+\frac{TL}{M\lambda}\right)^{M/2}\mathrm{e}^{\lambda\\|p^{*}\\|^{2}/2}\sum_{m=n_{0}}^{\infty}\mathrm{e}^{-m(\epsilon/16)^{2}/2}$
		$\displaystyle\leq\frac{N}{p_{0}}\left(\frac{1}{\lambda}+\frac{TL}{M\lambda}\right)^{M/2}\mathrm{e}^{\lambda\\|p^{*}\\|^{2}/2}\frac{\mathrm{e}^{-n_{0}\epsilon^{2}/2}}{1-\mathrm{e}^{-(\epsilon/16)^{2}/2}}\,.$		(148)

By choosing $n_{0}=\frac{M\log T}{(\epsilon/16)^{2}}$ we obtain the lemma. ∎

F.5 Analysis for Case (D)

Lemma 26.

For any $i\in[N]$ ,

\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{i},\,\bar{\mathcal{A}}_{i}(t),\,\tilde{\mathcal{A}}_{i}^{c}(t)\right]}\right]}\leq\frac{48}{9}\frac{M+2}{\epsilon^{2}}\frac{N}{p_{0}}\,.

(149)

Remark.

To prove the regret upper bound, it is enough to prove Lemma 26 only for $i=1$ . However, for the sake of generality, we prove the lemma for any $i\in[N]$ .

Before proving Lemma 26, we give two following lemmas.

Lemma 27.

For any $i\in[N]$ , the event $\bar{\mathcal{A}}_{i}(t)$ implies $\mathcal{A}_{i}(t)$ .

Proof.

Using the triangle inequality, we have

	$\displaystyle\\|S_{i}p^{*}-S_{i}\hat{p}_{t}\\|$	$\displaystyle\leq\\|S_{i}p^{*}-S_{i}\bar{p}_{t}\\|+\\|S_{i}\bar{p}_{t}-S_{i}\hat{p}_{t}\\|$
		$\displaystyle\leq\epsilon/8+\\|S_{i}\\|\cdot\frac{\epsilon}{16\max_{i}\\|S_{i}\\|}<\epsilon/4\,,$		(150)

which completes the proof. ∎

Now, Lemma 26 can be intuitively proven because from Lemma 27, $\bar{\mathcal{A}}_{i}(t)$ implies $\mathcal{A}_{i}(t)$ , and the events $\mathcal{A}_{i}(t)$ and $\tilde{\mathcal{A}}_{i}^{c}(t)$ does not simultaneously occur many times.

Let $t=\sigma_{1},\dots,\sigma_{m}$ be the time of the first $m$ times that the event $\{\bar{p}_{t}\in\mathcal{C}_{i},\,\mathcal{A}_{i}(t),\,N_{i}(t)=n_{i}\}$ occurred (not $\{\bar{p}_{t}\in\mathcal{C}_{i},\,\bar{\mathcal{A}}_{i}(t),\,N_{i}(t)=n_{i}\}$ ). In other words, we define

•

$\sigma_{1}$ : the first time that $\bar{p}_{t}\in\mathcal{C}_{i},\,\mathcal{A}_{i}(t)$ and $N_{i}(t)=n_{i}$ occurred
•

$\sigma_{2}$ : the second time that $\bar{p}_{t}\in\mathcal{C}_{i},\,\mathcal{A}_{i}(t)$ and $N_{i}(t)=n_{i}$ occurred
•

… .

Now we prove the following lemma using Lemma 9.

Lemma 28.

For any $0\leq\xi<1/2$ ,

\displaystyle\mathbb{P}{\left\{\tilde{\mathcal{A}}_{i}^{c}(t)\mathrel{}\middle|\mathrel{}\mathcal{A}_{i}(t),\,\sigma_{k}=t\right\}}\leq\exp\left(-\frac{9}{16}\xi\epsilon^{2}n_{i}\right)(1-2\xi)^{-M/2}\,.

(151)

Proof.

Recall that $\mathcal{T}_{i}={\left\{p\in\mathbb{R}^{M}:\|S_{i}p-S_{i}p^{*}\|>\epsilon\right\}}$ . We follow a similar argument as the analysis for Lemma 19. Since $\tilde{p}_{t}\sim\mathcal{N}(B_{t}^{-1}b_{t},B_{t}^{-1})$ , the squared Mahalanobis distance $\|B_{t}^{1/2}(p-\hat{p}_{t})\|^{2}$ follows the chi-squared distribution with $M$ degree of freedom. Hence, for $h(a)=\mathbb{P}_{X\sim\chi^{2}_{M}}\left\{X\geq a\right\}$ , we have

\displaystyle\mathbb{P}{\left\{\tilde{\mathcal{A}}_{i}^{c}(t)\mathrel{}\middle|\mathrel{}\mathcal{A}_{i,n_{i}},\,\sigma_{k}=t\right\}}\leq h\left(\inf_{p\in\mathcal{T}_{i}}\|B_{t}^{1/2}(p-\hat{p}_{t})\|^{2}\right)\,.

(152)

Then, Eq. (151) directly follows from Lemma 9. ∎

Proof of Lemma 26.

From Lemma 27, the event $\bar{\mathcal{A}}_{i}(t)$ implies $\mathcal{A}_{i}(t)$ . Hence, it is enough to derive the upper bound for

\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{i},\,\mathcal{A}_{i}(t),\,\tilde{\mathcal{A}}_{i}^{c}(t)\right]}\right]}

(153)

instead of the bound for

\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{i},\,\bar{\mathcal{A}}_{i}(t),\,\tilde{\mathcal{A}}_{i}^{c}(t)\right]}\right]}\,.

(154)

Using Lemma 28, we can bound the term for case (D) from above as

	$\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{i},\,\mathcal{A}_{i}(t),\,\tilde{\mathcal{A}}_{i}^{c}(t)\right]}\right]}$
	$\displaystyle=\operatornamewithlimits{\mathbb{E}}{\left[\sum_{n_{i}=1}^{T}\sum_{t=1}^{T}\mathbbm{1}{\left[\mathcal{A}_{i}(t),\,\tilde{\mathcal{A}}_{i}^{c}(t),\,N_{i}(t)=n_{i}\right]}\right]}$
	$\displaystyle=\sum_{n_{i}=1}^{T}\sum_{t=1}^{T}\mathbb{P}{\left\{\mathcal{A}_{i}(t),\,\tilde{\mathcal{A}}_{i}^{c}(t),\,N_{i}(t)=n_{i}\right\}}$
	$\displaystyle=\sum_{n_{i}=1}^{T}\sum_{t=1}^{T}\sum_{k=1}^{T}\mathbb{P}{\left\{\mathcal{A}_{i}(t),\,\tilde{\mathcal{A}}_{i}^{c}(t),\,\sigma_{k}=t\right\}}\quad\left(\text{{the event ${\left\{\sigma_{k}=t\right\}}$ is exclusive for fixed $n_{i}$}}\right)$
	$\displaystyle=\sum_{n_{i}=1}^{T}\sum_{t=1}^{T}\sum_{k=1}^{T}\mathbb{P}{\left\{\mathcal{A}_{i}(t),\,\sigma_{k}=t\right\}}\mathbb{P}{\left\{\tilde{\mathcal{A}}_{i}^{c}(t)\mathrel{}\middle\|\mathrel{}\mathcal{A}_{i}(t),\,\sigma_{k}=t\right\}}$
	$\displaystyle\leq\sum_{n_{i}=1}^{T}\sum_{t=1}^{T}\sum_{k=1}^{T}\mathbb{P}{\left\{\mathcal{A}_{i}(t),\,\sigma_{k}=t\right\}}C\mathrm{e}^{-n_{i}\iota}\quad\left(\text{by {Lemma~{}\ref{lem:caseD_empin-sampleout}}}\right)$
	$\displaystyle=\sum_{n_{i}=1}^{T}C\mathrm{e}^{-n_{i}\iota}\sum_{t=1}^{T}\sum_{k=1}^{T}\mathbb{P}{\left\{\mathcal{A}_{i}(t),\,\sigma_{k}=t\right\}}$
	$\displaystyle\leq\sum_{n_{i}=1}^{T}C\mathrm{e}^{-n_{i}\iota}\sum_{t=1}^{T}\sum_{k=1}^{T}\mathbb{P}{\left\{\sigma_{k}=t\right\}}$
	$\displaystyle\leq\sum_{n_{i}=1}^{T}C\mathrm{e}^{-n_{i}\iota}\sum_{k=1}^{T}\mathbb{P}{\left\{\text{$\sigma_{k}$ exists}\right\}}$
	$\displaystyle\leq\sum_{n_{i}=1}^{T}C\mathrm{e}^{-n_{i}\iota}\sum_{k=1}^{T}\left(1-\frac{p_{0}}{N}\right)^{k-1}\quad\left(\text{by {$\tilde{p}_{\sigma_{s}}\not\in\mathcal{C}_{i}$ for $s=1,\dots,k-1$}}\right)$
	$\displaystyle\leq 3C\frac{1}{\mathrm{e}^{\iota}-1}\frac{N}{p_{0}}$
	$\displaystyle\leq\frac{48}{9}\frac{M+2}{\epsilon^{2}}\frac{N}{p_{0}}\,,$		(155)

where $\iota=\frac{9\xi\epsilon^{2}}{16},C=(1-2\xi)^{-\frac{M}{2}}$ , and in the last inequality we select the optimal $\xi$ and use $1+x\leq\mathrm{e}^{x}$ . ∎

F.6 Analysis for Case (E)

Lemma 29.

For any $i\neq 1$ ,

\displaystyle\operatornamewithlimits{\mathbb{E}}{\left[\sum_{t=1}^{T}\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{i},\,\bar{\mathcal{A}}_{i}(t)\right]}\right]}\leq\frac{2^{5M/2+7}\Gamma(M/2+1)\mathrm{e}^{\lambda^{2}\left\|p^{*}\right\|^{2}/2}}{\delta^{2}\epsilon^{M+2}\lambda^{M/2+1}}\,,

(156)

where $\epsilon$ is defined in (26) and satisfies $\epsilon\leq\min_{p\in\mathcal{C}_{1}^{c}}\left\|p-p^{*}\right\|/3$ , and

\displaystyle\delta\coloneqq\min_{\hat{p}:(L_{1}-L_{i})^{\top}\hat{p}\geq 0,\,\left\|S_{i}(\hat{p}-p^{*})\right\|\leq\epsilon/8}\left\|S_{1}(\hat{p}-p^{*})\right\|\,.

(157)

We prove Lemma 29 using Lemma 12 and Theorem 16.

Remark.

The upper bound in (156) goes to infinite when we set $\lambda=0$ , that is, a flat prior is used. However, this is not the essential effect of the prior but just comes from the minimum eigenvalue of $B_{1}$ . In fact, we can see from the proof that a similar bound can be obtained for $\lambda=0$ if we run some deterministic initialization until $B_{t}$ becomes positive definite.

Proof.

We evaluate each term in the summation using Theorem 16 with

	$\displaystyle\mathcal{A}_{t}$	$\displaystyle=\{\bar{p}_{t}\in\mathcal{C}_{i},\,\left\\|S_{i}(\bar{p}_{t}-p^{*})\right\\|\leq\epsilon/8,\,N_{1}(t)=n\}\,,$
	$\displaystyle\mathcal{B}_{t}$	$\displaystyle=\{\tilde{p}_{t}\in\mathcal{C}_{1}\}\,.$		(158)

for $n\in[T]$ . Recall that

\displaystyle\bar{g}_{t}(p)=\frac{1}{\sqrt{(2\pi)^{M}|B_{t}^{-1}|}}\exp\left(-\frac{1}{2}\left\|p-\hat{p}_{t}\right\|_{B_{t}}^{2}\right)

(159)

is the probability density function of $\hat{p}_{t}$ given $\mathcal{F}_{t}=\{B_{t},b_{t}\}$ . Using $\tau$ defined in (98), it holds for any $\tau\in[T]$ that

$\displaystyle\mathbb{P}{\left\{\mathcal{B}_{\tau}\|\mathcal{F}_{\tau}\right\}}$	$\displaystyle=\mathbb{P}{\left\{\tilde{p}_{\tau}\in\mathcal{C}_{1}\mathrel{}\middle\|\mathrel{}\mathcal{F}_{\tau}\right\}}$
	$\displaystyle=\int_{p\in\mathcal{C}_{1}}\bar{g}_{\tau}(p)\mathrm{d}p$
	$\displaystyle\geq\int_{p:\left\\|p-p^{*}\right\\|\leq 3\epsilon}\bar{g}_{\tau}(p)\mathrm{d}p$
	$\displaystyle\geq\sup_{p:\left\\|p-p^{*}\right\\|\leq 2\epsilon}\int_{p^{\prime}:\left\\|p^{\prime}-p\right\\|\leq\epsilon}\bar{g}_{\tau}(p^{\prime})\mathrm{d}p^{\prime}$	(160)
	$\displaystyle\geq\sup_{p:\left\\|p-p^{*}\right\\|\leq 2\epsilon}\inf_{p^{\prime}:\left\\|p^{\prime}-p\right\\|\leq\epsilon}\bar{g}_{\tau}(p^{\prime})\mathrm{Vol}(\{p^{\prime\prime}:\left\\|p^{\prime\prime}-p\right\\|\leq\epsilon\})$
	$\displaystyle=\frac{(\sqrt{\pi}\epsilon)^{M}}{\Gamma(M/2+1)}\sup_{p:\left\\|p-p^{*}\right\\|\leq 2\epsilon}\inf_{p^{\prime}:\left\\|p^{\prime}-p\right\\|\leq\epsilon}\bar{g}_{\tau}(p^{\prime})$
	$\displaystyle=\frac{(\epsilon/\sqrt{2})^{M}\sqrt{\|B_{\tau}\|}}{\Gamma(M/2+1)}\exp\left\{-\frac{1}{2}\left(\inf_{p:\left\\|p-p^{*}\right\\|\leq 2\epsilon}\sup_{p^{\prime}:\left\\|p^{\prime}-p\right\\|\leq\epsilon}\left\\|p^{\prime}-\hat{p}_{\tau}\right\\|_{B_{\tau}}^{2}\right)\right\}$
	$\displaystyle\geq\frac{(\epsilon/\sqrt{2})^{M}\sqrt{\|B_{\tau}\|}}{\Gamma(M/2+1)}\exp\left\{-\frac{\left\\|\hat{p}_{\tau}-p^{*}\right\\|_{B_{\tau}}^{2}-\epsilon\delta\sqrt{\lambda n}}{2}\right\}\,,$	(161)

where (160) follows since $\{p:\left\|p-p^{*}\right\|\leq 3\epsilon\}\supset\{p^{\prime}:\left\|p^{\prime}-p_{0}\right\|\leq\epsilon\}$ for any $p_{0}$ such that $\left\|p_{0}-p^{*}\right\|\leq 2\epsilon$ , and the last inequality follows from Theorem 15. To apply Theorem 15, we used Lemma 27.

Now we define a stochastic process corresponds to (161) as

\displaystyle P_{t}

\displaystyle=\frac{(\epsilon/\sqrt{2})^{M}\sqrt{|B_{t}|}}{\Gamma(M/2+1)}\exp\left\{-\frac{\left\|\hat{p}_{t}-p^{*}\right\|_{B_{t}}^{2}-\epsilon\delta\sqrt{\lambda n}}{2}\right\}\,.

(162)

Then, by Lemma 12,

$\displaystyle\mathbb{E}[P_{t+1}^{-1}\|\mathcal{F}_{t}]$	$\displaystyle\leq\frac{\Gamma(M/2+1)}{(\epsilon/\sqrt{2})^{M}}\mathrm{e}^{-\epsilon\delta\sqrt{\lambda n}/2}\mathbb{E}\left[\frac{1}{\sqrt{\|B_{t+1}\|}}\mathbb{E}\left[\exp\left(\frac{\left\\|\hat{p}_{t}-p^{*}\right\\|_{B_{t+1}}^{2}}{2}\right)\mathrel{}\middle\|\mathrel{}\mathcal{F}_{t},\,S_{i(t)}\right]\mathrel{}\middle\|\mathrel{}\mathcal{F}_{t}\right]$
	$\displaystyle\leq\frac{\Gamma(M/2+1)}{(\epsilon/\sqrt{2})^{M}}\mathrm{e}^{-\epsilon\delta\sqrt{\lambda n}/2}\mathbb{E}\left[\frac{1}{\sqrt{\|B_{t}\|}}\exp\left(\frac{\left\\|\hat{p}_{t}-p^{*}\right\\|_{B_{t}}^{2}}{2}\right)\mathrel{}\middle\|\mathrel{}\mathcal{F}_{t}\right]$
	$\displaystyle=P_{t}^{-1}\,,$	(163)

which means that $P_{t}^{-1}$ is a supermartingale. Therefore we can apply Theorem 16 and obtain

$\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}\mathbbm{1}{\left[\hat{p}_{t}\in\mathcal{C}_{i},\,\left\\|S_{i}(\bar{p}_{t}-p^{*})\right\\|\leq\epsilon/8,\,N_{1}(t)=n\right]}\right]$	$\displaystyle\leq\mathbb{E}\left[\mathbbm{1}{\left[\tau\leq T\right]}P_{\tau}^{-1}\right]$
	$\displaystyle\leq\mathbb{E}\left[P_{\tau}^{-1}\right]$
	$\displaystyle\leq\mathbb{E}\left[P_{1}^{-1}\right]$
	$\displaystyle=\frac{\Gamma(M/2+1)\mathrm{e}^{\lambda^{2}\left\\|p^{*}\right\\|^{2}/2}}{(\epsilon\sqrt{\lambda/2})^{M}}\mathrm{e}^{-\epsilon\delta\sqrt{\lambda n}/2}\,.$	(164)

Finally we have

$\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{i},\,\left\\|S_{i}(\bar{p}_{t}-p^{*})\right\\|\leq\epsilon/8\right]}\right]$
	$\displaystyle=\sum_{n=1}^{T}\mathbb{E}\left[\sum_{t=1}^{T}\mathbbm{1}{\left[\bar{p}_{t}\in\mathcal{C}_{i},\,\left\\|S_{i}(\bar{p}_{t}-p^{*})\right\\|\leq\epsilon/8,\,N_{1}(t)=n\right]}\right]$
	$\displaystyle\leq\frac{\Gamma(M/2+1)\mathrm{e}^{\lambda^{2}\left\\|p^{*}\right\\|^{2}/2}}{(\epsilon\sqrt{\lambda/2})^{M}}\sum_{n=1}^{\infty}\mathrm{e}^{-\epsilon\delta\sqrt{\lambda n}/2}$
	$\displaystyle\leq\frac{\Gamma(M/2+1)\mathrm{e}^{\lambda^{2}\left\\|p^{*}\right\\|^{2}/2}}{(\epsilon\sqrt{\lambda/2})^{M}}\int_{0}^{\infty}\mathrm{e}^{-\epsilon\delta\sqrt{\lambda x}/2}\mathrm{d}x$
	$\displaystyle=\frac{\Gamma(M/2+1)\mathrm{e}^{\lambda^{2}\left\\|p^{*}\right\\|^{2}/2}}{(\epsilon\sqrt{\lambda/2})^{M}}\frac{2}{(\epsilon\delta\sqrt{\lambda}/2)^{2}}\Gamma(2)$
	$\displaystyle=\frac{2^{M/2+3}\Gamma(M/2+1)\mathrm{e}^{\lambda^{2}\left\\|p^{*}\right\\|^{2}/2}}{\delta^{2}\epsilon^{M+2}\lambda^{M/2+1}}\,,$	(165)

which completes the proof. ∎

Appendix G Property of Dynamic Pricing Games

In this appendix, we will see a property of dp-easy games.

Proposition 30.

Consider any dp-easy games with $c>-1$ . Then, any two actions in the game are neighbors.

Remark.

In section 5, we considered dp-easy games with $c>0$ , but this can be relaxed to $c>-1$ to prove Proposition 30.

Proof.

Take any two different actions $j,k\in[N]$ such that $j<k$ . From the definition of the loss matrix in dp-easy games, we have $e_{j}\in\mathcal{C}_{j}$ and $e_{k}\in\mathcal{C}_{k}$ .

First, we will find $\alpha\in[0,1]$ such that

\displaystyle\alpha e_{j}+(1-\alpha)e_{k}\in\mathcal{C}_{j}\cap\mathcal{C}_{k}\,.

(166)

From the definition of the loss matrix, the $i$ -th element of $L(\alpha e_{j}+(1-\alpha)e_{k})\in\mathcal{P}_{M}$ is

\displaystyle\begin{cases}-i&(1\leq i\leq j)\\ \alpha c+(1-\alpha)\cdot(-i)&(j+1\leq i\leq k)\\ c&(k<i\leq N)\\ \end{cases}\,.

(167)

It is easy to see that the indices which give the minimum value in (167) is $j$ or $k$ . Thus, to achieve the condition (166), the following should be satisfied,

\displaystyle-j=\alpha c+(1-\alpha)\cdot(-k)\,,

(168)

which is equivalent to

\displaystyle\alpha=\frac{k-j}{c+k}(\eqqcolon\alpha^{*})\,.

(169)

Note that we have $0\leq\alpha\leq 1$ for any $c>-1$ .

Next, we introduce the following definitions.

$\displaystyle p^{(j,k)}$	$\displaystyle\coloneqq\alpha^{}e_{j}+(1-\alpha^{})e_{k}\in\mathcal{C}_{j}\cap\mathcal{C}_{k}\,,$	(170)
$\displaystyle\mathrm{Ball}_{\epsilon}^{(j,k)}$	$\displaystyle\coloneqq{\left\{p\in\mathcal{P}_{M}:\\|p-p^{(j,k)}\\|\leq\epsilon\right\}}\,,$	(171)
$\displaystyle L^{(x)}$	$\displaystyle\coloneqq L(p^{(j,k)}+x)\in\mathbb{R}^{N}\,.$	(172)

To prove the proposition, it is enough to prove the following: there exists $\epsilon>0$ , $\mathrm{Ball}_{\epsilon}^{(j,k)}\subset\mathcal{C}_{j}\cup\mathcal{C}_{k}$ .

To prove this, it is enough to prove that, there exists $\epsilon>0$ ,

\displaystyle\min_{x\in\mathbb{R}^{M}:\|x\|\leq\epsilon}\min_{i\in[N]\backslash\{j,k\}}\left((L^{(x)})_{i}-(L^{(x)})_{j}\right)\vee\left((L^{(x)})_{i}-(L^{(x)})_{k}\right)>0\,.

(173)

We will prove (173) in the following. Take any $i\in[N]\backslash{\left\{j,k\right\}}$ and

\displaystyle\epsilon\coloneqq\min_{i:1\leq i<j}\frac{1}{2}\frac{j-i}{\|L_{j}-L_{i}\|}\wedge\min_{i:j<i<k}\frac{1}{2}\frac{(1-\alpha^{*})(k-i)}{\|L_{i}-L_{k}\|}\wedge\min_{i:k<i\leq N}\frac{1}{2}\frac{c+j}{\|L_{j}-L_{i}\|}\,.

(174)

Note that the $\epsilon$ used here is different from the one used in the proof of the regret upper bounds.

Case (A): When $1\leq i<j$ , using Cauchy–Schwarz inequality, we have

$\displaystyle\left((L^{(x)})_{i}-(L^{(x)})_{j}\right)\vee\left((L^{(x)})_{i}-(L^{(x)})_{k}\right)$	$\displaystyle\geq(L^{(x)})_{i}-(L^{(x)})_{j}$
	$\displaystyle=(-i+L_{i}^{\top}x)-(-j+L_{j}^{\top}x)$
	$\displaystyle=(j-i)-(L_{j}-L_{i})^{\top}x$
	$\displaystyle\geq(j-i)-\\|L_{j}-L_{k}\\|\\|x\\|$
	$\displaystyle\geq(j-i)-\epsilon\\|L_{j}-L_{i}\\|$
	$\displaystyle\geq\frac{1}{2}(j-i)$
	$\displaystyle>0\,.$	(175)

The arguments for cases (B) and (C) follow in the similar manner as case (A).

Case (B): When $j<i<k$ , we have

$\displaystyle\left((L^{(x)})_{i}-(L^{(x)})_{j}\right)\vee\left((L^{(x)})_{i}-(L^{(x)})_{k}\right)$	$\displaystyle\geq(L^{(x)})_{i}-(L^{(x)})_{k}$
	$\displaystyle={\left\{\alpha^{}c+(1-\alpha^{})\cdot(-i)+L_{i}^{\top}x\right\}}-{\left\{\alpha^{}c+(1-\alpha^{})\cdot(-k)+L_{k}^{\top}x\right\}}$
	$\displaystyle=(1-\alpha^{*})(k-i)-(L_{i}-L_{k})^{\top}x$
	$\displaystyle\geq(1-\alpha^{*})(k-i)-\epsilon(L_{i}-L_{k})^{\top}$
	$\displaystyle\geq\frac{1}{2}(1-\alpha^{*})(k-i)$
	$\displaystyle>0\,.$	(176)

Case (C): When $k<i\leq N$ , we have

$\displaystyle\left((L^{(x)})_{i}-(L^{(x)})_{j}\right)\vee\left((L^{(x)})_{i}-(L^{(x)})_{k}\right)$	$\displaystyle\geq(L^{(x)})_{i}-(L^{(x)})_{j}$
	$\displaystyle=(c+L_{i}^{\top}x)-(-j+L_{j}^{\top}x)$
	$\displaystyle\geq c+j-\\|L_{j}-L_{i}\\|\\|x\\|$
	$\displaystyle\geq c+j-\epsilon\\|L_{j}-L_{i}\\|$
	$\displaystyle\geq\frac{1}{2}(c+j)$
	$\displaystyle>0\,.$	(177)

Summing up the argument for cases (A) to (C), the proof is completed. ∎

Appendix H Details and Additional Results of Experiments

Here we give the specific values of the opponent’s strategy used in Section 5 and show the extended experimental results for performance comparison. Table 2 summarizes the values of opponent’s strategy used in this appendix and Section 5. Figure 3 shows the empirical comparison of the proposed algorithms against the benchmark methods, and Figure 4 shows the number of the rejected times. We can see the same tendency as Section 5, that is, TSPM performs the best and the number of rejections does not increase with the time step $t$ .

Table 2: The values of the opponent’s strategy.

# of outcomes $M$	opponent’s strategy $p^{*}$
$2$	$[0.7,0.3]$
$3$	$[0.5,0.3,0.2]$
$4$	$[0.3,0.3,0.3,0.1]$
$5$	$[0.2,0.3,0.3,0.1,0.1]$
$6$	$[0.2,0.2,0.3,0.1,0.1,0.1]$
$7$	$[0.2,0.2,0.3,0.1,0.1,0.05,0.05]$

(a) dp-easy,

N=M=2

(b) dp-easy,

N=M=3

N=M=4

(d) dp-easy,

N=M=5

(e) dp-easy,

N=M=6

(f) dp-easy,

N=M=7

(g) dp-hard,

N=M=2

(h) dp-hard,

N=M=3

(i) dp-hard,

N=M=4

(j) dp-hard,

N=M=5

(k) dp-hard,

N=M=6

(l) dp-hard,

N=M=7

Figure 3: Regret-round plots of the algorithms. The solid lines indicate the average over

100

independent trials. The thin fillings are the standard error.

(a) dp-easy,

N=M=2

(b) dp-easy,

N=M=3

N=M=4

(d) dp-easy,

N=M=5

(e) dp-easy,

N=M=6

(f) dp-easy,

N=M=7

(g) dp-hard,

N=M=2

(h) dp-hard,

N=M=3

(i) dp-hard,

N=M=4

(j) dp-hard,

N=M=5

(k) dp-hard,

N=M=6

(l) dp-hard,

N=M=7

Figure 4: The number of rejected times by the accept-reject sampling. The solid lines indicate the average over

100

independent trials after taking moving average with window size

100

$\displaystyle F_{t}(p)$	$\displaystyle=\pi{\left(p\right)}\prod_{i=1}^{N}\exp\Big{\{}-n_{i}\mathcal{D}_{\mathrm{KL}}\left({q_{i}^{(t)}}\middle\\|{S_{i}p}\right)\Big{\}}$
	$\displaystyle\leq\pi{\left(p\right)}\prod_{i=1}^{N}\exp\Big{\{}-\frac{1}{2}n_{i}\\|q_{i}^{(t)}-S_{i}p\\|_{1}^{2}\Big{\}}\quad\left(\text{by {Pinsker's ineq.}}\right)$
	$\displaystyle=\pi{\left(p\right)}\exp\Big{\{}-\frac{1}{2}\sum_{i=1}^{N}n_{i}\\|q_{i}^{(t)}-S_{i}p\\|_{1}^{2}\Big{\}}$
	$\displaystyle\leq\pi{\left(p\right)}\exp\Big{\{}-\frac{1}{2}\sum_{i=1}^{N}n_{i}\\|q_{i}^{(t)}-S_{i}p\\|^{2}\Big{\}}\quad\left(\text{by }{\\|q_{i}^{(t)}-S_{i}p\\|_{1}\geq\\|q_{i}^{(t)}-S_{i}p\\|}\right)$
	$\displaystyle=G_{t}(p)\,.$	(6)

$\displaystyle\\|B_{t}^{1/2}(p-\hat{p}_{t})\\|^{2}$	$\displaystyle\geq\sum_{k\in[N]}N_{k}(t)\\|S_{k}(p-\hat{p}_{t})\\|^{2}\quad\left(\text{by {def. of $B_{t}$}}\right)$
	$\displaystyle\geq n_{i}\\|S_{i}(p-\hat{p}_{t})\\|^{2}\quad\left({N_{i}(t)\geq n_{i}}\right)$
	$\displaystyle>\frac{9}{16}\epsilon^{2}n_{i}\quad\left(\text{by {Eq.~{}\eqref{eq:diff_p_phat}}}\right)\,.$	(51)

$\displaystyle\mathbb{E}_{\epsilon_{t}}{\left[\mathrm{e}^{\frac{\left\\|\epsilon_{t}+v\right\\|_{V^{-1}}^{2}}{2}}\right]}$	$\displaystyle=\mathbb{E}_{\lambda\sim\mathcal{N}(0,V^{-1})}{\left[\mathbb{E}_{\epsilon_{t}}{\left[\mathrm{e}^{\lambda^{\top}(\epsilon_{t}+v)}\right]}\right]}$
	$\displaystyle=\mathbb{E}_{\lambda\sim\mathcal{N}(0,V^{-1})}{\left[\mathrm{e}^{\lambda^{\top}v}\mathbb{E}_{\epsilon_{t}}{\left[\mathrm{e}^{\lambda^{\top}\epsilon_{t}}\right]}\right]}$
	$\displaystyle\leq\mathbb{E}_{\lambda\sim\mathcal{N}(0,V^{-1})}{\left[\mathrm{e}^{\lambda^{\top}v}\mathrm{e}^{\left\\|\lambda\right\\|^{2}/2}\right]}$
	$\displaystyle=\frac{1}{(2\pi)^{d/2}\sqrt{\|V^{-1}\|}}\int\mathrm{e}^{\lambda^{\top}v}\mathrm{e}^{\left\\|\lambda\right\\|^{2}/2}\mathrm{e}^{-\left\\|\lambda\right\\|_{V}^{2}/2}\mathrm{d}\lambda$
	$\displaystyle=\frac{1}{(2\pi)^{d/2}\sqrt{\|V^{-1}\|}}\int\mathrm{e}^{-\frac{1}{2}\left(\lambda^{\top}(V-I)\lambda-2v^{\top}\lambda\right)}\mathrm{d}\lambda$
	$\displaystyle=\frac{\sqrt{\|V-I\|}}{(2\pi)^{d/2}\sqrt{\|V^{-1}\|\|V-I\|}}\int\mathrm{e}^{-\frac{1}{2}\left((\lambda-(V-I)^{-1}v)^{\top}(V-I)(\lambda-(V-I)^{-1}v)-v^{\top}(V-I)^{-1}v\right)}\mathrm{d}\lambda$
	$\displaystyle=\frac{\sqrt{\|V\|}}{\sqrt{\|V-I\|}}\mathrm{e}^{\frac{1}{2}v^{\top}(V-I)^{-1}v}\,.$	(67)

$\displaystyle\mathbb{E}\left[\exp\left(\frac{1}{2}\left(\left\\|\hat{p}_{t}-p^{}\right\\|_{B_{t}}^{2}-\left\\|\hat{p}_{t-1}-p^{}\right\\|_{B_{t-1}}^{2}\right)\right)\right]$
	$\displaystyle\leq\frac{\sqrt{\|(I+C_{t})C_{t}^{-1}\|}}{\sqrt{\|(I+C_{t})C_{t}^{-1}-I\|}}\mathrm{e}^{\frac{1}{2}d_{t}^{\top}C_{t}^{-1}((I+C_{t})C_{t}^{-1}-I)^{-1}C_{t}^{-1}d_{t}}\mathrm{e}^{-\frac{1}{2}d_{t}^{\top}(I+C_{t})^{-1}(I+C_{t}^{-1})d_{t}}$
	$\displaystyle\leq\frac{\sqrt{\|(I+C_{t})C_{t}^{-1}\|}}{\sqrt{\|C_{t}^{-1}\|}}\mathrm{e}^{\frac{1}{2}d_{t}^{\top}C_{t}^{-1}(C_{t}^{-1})^{-1}C_{t}^{-1}d_{t}}\mathrm{e}^{-\frac{1}{2}d_{t}^{\top}(I+C_{t})^{-1}(I+C_{t}^{-1})d_{t}}$
	$\displaystyle=\sqrt{\|(I+C_{t})\|}$
	$\displaystyle=\sqrt{\frac{\|B_{t}\|}{\|B_{t-1}\|}}\,,$	(70)

$\displaystyle\inf_{p:\left\\|p-p^{*}\right\\|\leq 2\epsilon}\sup_{p^{\prime}:\left\\|p^{\prime}-p\right\\|\leq\epsilon}\left\\|p^{\prime}-\hat{p}\right\\|_{A}^{2}$
	$\displaystyle=\inf_{x:\left\\|x\right\\|\leq 2\epsilon}\sup_{x^{\prime}:\left\\|x^{\prime}\right\\|\leq\epsilon}\left\\|a+x+x^{\prime}\right\\|_{A}^{2}$
	$\displaystyle=\inf_{x:\left\\|x\right\\|\leq\epsilon}\left\\|a+x\right\\|_{A}^{2}\,.$	(82)

Analysis and Design of Thompson Sampling for Stochastic Partial Monitoring

Abstract

1 Introduction

2 Preliminaries

Definition 1 (Cell decomposition and Pareto-optimality (Bartók et al., 2011)).

Definition 2 (Neighbors and neighborhood action (Bartók et al., 2011)).

Definition 3 (Signal matrix (Komiyama et al., 2015)).

Definition 4 (Local observability (Bartók et al., 2011)).

Definition 5 (Strong local observability).

3 Thompson-sampling-based Algorithm for Partial Monitoring

3.1 Accept-Reject Sampling

Proposition 1.

3.2 Sampling from Proposal Distribution

Proposition 2.

4 Theoretical Analysis

4.1 Regret Upper Bound

Theorem 3 (Regret upper bound).

Remark.

4.2 Technical Difficulties of the Analysis

5 Experiments

6 Conclusion and Discussion

Broader Impact

Acknowledgements

References

Appendix A Notation

Appendix B Posterior Distribution and Proposal Distribution in Section 3

Proposition 4.

Proof.

Remark.

Appendix C Proof of Proposition 2

Appendix D Relation between TSPM-Gaussian and BPM-TS

Appendix E Preliminaries for Regret Analysis

E.1 Basic Lemmas

Fact 5 (Moment generating function of squared-Gaussian distribution).

Lemma 6 (Chernoff bound for chi-squared random variable).

Proof.

E.2 Property of Strong Local Observability

Lemma 7.

Proof.

Remark.

Corollary 8.

Proof.

Lemma 9.

Proof.

E.3 Statistics of Uninterested Actions

Lemma 10 (Upper bound for the expectation of Z\iZ_{\backslash i}).

Proof.

E.4 Mahalanobis Distance Process

Lemma 11.

Proof.

Lemma 12.

Proof.

E.5 Norms under Perturbations

Lemma 13.

Proof.

Lemma 14.

Proof.

Theorem 15.

Proof.

E.6 Exit Time Analysis

Theorem 16.

Lemma 17.

Proof.

Proof of Theorem 16.

Appendix F Regret Analysis of TSPM Algorithm

F.1 Regret Decomposition

F.2 Analysis for Case (A)

Lemma 18.

Lemma 19.

Proof.

Proof of Lemma 18.

F.3 Analysis for Case (B)

Lemma 20.

Lemma 21.

Proof.

Lemma 22.

Proof.

Proof of Lemma 20.

F.4 Analysis for Case (C)

Lemma 23.

Analysis and Design of Thompson Sampling
for Stochastic Partial Monitoring

Lemma 10 (Upper bound for the expectation of $Z_{\backslash i}$ ).