This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Reinforcement Learning with Delayed, Composite, and Partially Anonymous Reward

Washim Uddin Mondal [email protected]
School of IE and CE, Purdue University
Vaneet Aggarwal [email protected]
School of IE and ECE, Purdue University
Abstract

We investigate an infinite-horizon average reward Markov Decision Process (MDP) with delayed, composite, and partially anonymous reward feedback. The delay and compositeness of rewards mean that rewards generated as a result of taking an action at a given state are fragmented into different components, and they are sequentially realized at delayed time instances. The partial anonymity attribute implies that a learner, for each state, only observes the aggregate of past reward components generated as a result of different actions taken at that state, but realized at the observation instance. We propose an algorithm named DUCRL2\mathrm{DUCRL2} to obtain a near-optimal policy for this setting and show that it achieves a regret bound of ๐’ช~โ€‹(Dโ€‹Sโ€‹Aโ€‹T+dโ€‹(Sโ€‹A)3)\tilde{\mathcal{O}}\left(DS\sqrt{AT}+d(SA)^{3}\right) where SS and AA are the sizes of the state and action spaces, respectively, DD is the diameter of the MDP, dd is a parameter upper bounded by the maximum reward delay, and TT denotes the time horizon. This demonstrates the optimality of the bound in the order of TT, and an additive impact of the delay.

1 Introduction

Reinforcement learning (RL) enables an agent to learn a policy in an unknown environment by repeatedly interacting with it. The environment has a state that changes as a result of the actions executed by the agent. In this setup, a policy is a collection of rules that guides the agent to take action based on the observed state of the environment. Several algorithms exist in the literature that learn a policy with near-optimal regret (Jaksch etย al., 2010; Azar etย al., 2017; Jin etย al., 2018; Fruit etย al., 2018). However, all of the above frameworks assume that the reward is instantly fed back to the learner after executing an action. Unfortunately, this assumption of instantaneity may not hold in many practical scenarios. For instance, in the online advertisement industry, a customer may purchase a product several days after seeing the advertisement. In medical trials, the effect of a medicine may take several hours to manifest. In road networks, it might take a few minutes to notice the impact of a traffic light change. In many of these cases, the effect of an action is not entirely realized in a single instant, but rather fragmented into smaller components that are sequentially materialized over a long interval. Reward feedback that satisfies this property is referred to as delayed and composite reward. In several applications, the learner cannot directly observe each delayed component of a composite reward but only an aggregate of the components realized at the observation instance. For example, in the case of multiple advertisements, the advertiser can only see the total number of purchases at a given time but is completely unaware of which advertisement resulted in what fraction of the total purchase. Such feedback is referred to as anonymous reward.

Learning with delayed, composite, and anonymous rewards is gaining popularity in the RL community. While most of the theoretical analysis has been directed towards the multi-arm bandit (MAB) framework (Wang etย al., 2021; Zhou etย al., 2019; Vernade etย al., 2020; Pike-Burke etย al., 2018; Pedramfar and Aggarwal, 2023), recent studies have also analyzed Markov Decision Processes (MDPs) with delayed feedback (Howson etย al., 2023; Jin etย al., 2022; Lancewicki etย al., 2022). However, none of these studies have incorporated both composite and anonymous rewards, which is the focus of this paper.

1.1 The Challenge and Our Contribution

Learning with delayed, composite, and anonymous rewards has gained popularity in the RL community. While there has been extensive theoretical analysis of these types of rewards in the multi-arm bandit (MAB) framework, extending these ideas to the full Markov decision process (MDP) setting poses significant challenges. For example, one of the main ideas used in the MAB framework can be stated as follows (Wang etย al., 2021). The learning algorithm proceeds in multiple epochs with the sizes of the epochs increasing exponentially. At the beginning of each epoch, an action is chosen and it is applied at all instances within that epoch. Due to multiple occurrences of the chosen action, and exponentially increasing epoch lengths, the learner progressively obtains better estimates of the associated rewards despite the composite and anonymous nature of the feedback. In an MDP setting, however, it is impossible to ensure that any state-action pair (s,a)(s,a) appears contiguously over a given stretch of time. The most that one can hope to ensure is that each state, ss, when it appears in an epoch (defined appropriately), is always paired with a unique action, aa. In this way, if the state in consideration is visited sufficiently frequently in that epoch, the learner would obtain an accurate estimate of the reward associated with the pair (s,a)(s,a). Unfortunately, in general, there is no way to ensure a high enough visitation frequency for all states. Unlike the MAB setup, it is, therefore, unclear how to develop regret optimal learning algorithms for MDPs with delayed, composite, and anonymous rewards.

Our article provides a partial resolution to this problem. In particular, we demonstrate that if the rewards are delayed, composite, and partially anonymous, then an algorithm can be designed to achieve near-optimal regret. In a fully anonymous setting, a learner observes the aggregate of the (delayed) reward components generated as a result of all state-action pairs visited in the past. In contrast, a partially anonymous setup allows a learner to observe the sum of (delayed) reward components generated as a result of all past visitations to any specified state. Our proposed algorithm, DUCRL2\mathrm{DUCRL2}, is built upon the UCRL2\mathrm{UCRL2} algorithm of (Jaksch etย al., 2010) and works in multiple epochs. Unlike the bandit setting, however, the epoch lengths are not guaranteed to be exponentially increasing. Our primary innovation lies in demonstrating how an accurate reward function estimate can be obtained using the partially anonymous feedback. DUCRL2\mathrm{DUCRL2} yields a regret bound of ๐’ช~โ€‹(Dโ€‹Sโ€‹Aโ€‹T+dโ€‹(Sโ€‹A)3)\tilde{\mathcal{O}}(DS\sqrt{AT}+d(SA)^{3}) where S,AS,A denote the sizes of the state and action spaces respectively, TT is the time horizon, DD denotes the diameter of the MDP, and the parameter dd is bounded by the maximum delay in the reward generation process. The obtained result matches a well-known lower bound in TT.

1.2 Relevant Literature

Below we describe in detail the relevant literature.

Regret Bounds in Non-delayed RL: The framework of regret minimization in RL with immediate feedback is well-investigated in the literature. In particular, this topic has been explored in the settings of both stochastic (Jaksch etย al., 2010; Zanette etย al., 2020; Agarwal and Aggarwal, 2023; Agarwal etย al., 2022; 2023) and adversarial (Jin and Luo, 2020; Rosenberg and Mansour, 2019; Shani etย al., 2020) MDPs. Our setup can be considered to be the generalization of the stochastic MDPs with immediate feedback.

Delay in Bandit: Delayed feedback is a well-researched topic in the bandit literature, with numerous studies conducted in both stochastic (Vernade etย al., 2020; Pike-Burke etย al., 2018; Zhou etย al., 2019; Gael etย al., 2020; Lancewicki etย al., 2021; Pedramfar and Aggarwal, 2023) and adversarial settings (Quanrud and Khashabi, 2015; Cesa-Bianchi etย al., 2016; Thune etย al., 2019; Zimmert and Seldin, 2020; Bistritz etย al., 2019; Ito etย al., 2020). However, as previously discussed, applying the insights of bandit learning to the MDP setting with composite and anonymous rewards is challenging.

Delay in MDP: A number of recent papers have explored the incorporation of delayed feedback into the MDP framework. For example, (Lancewicki etย al., 2022; Jin etย al., 2022) consider adversarial MDPs, while (Howson etย al., 2023) analyzes a stochastic setting. However, all of these articles focus on episodic MDPs with non-composite and non-anonymous rewards, which is distinct from our work on infinite-horizon MDPs with delayed, composite, and partially anonymous rewards. It is worth noting that while delayed reward is a commonly studied topic in the literature, some works also consider delays in the state information (Agarwal and Aggarwal, 2021; Bouteiller etย al., 2021). Additionally, the impact of delay has also been explored in the context of multi-agent learning to characterize coarse correlated equilibrium (Zhang etย al., 2022).

2 Problem Setting

We consider an infinite-horizon average-reward Markov Decision Process (MDP) defined as, Mโ‰œ{๐’ฎ,๐’œ,r,p}M\triangleq\{\mathcal{S},\mathcal{A},r,p\} where ๐’ฎ,๐’œ\mathcal{S},\mathcal{A} denote the state, and action spaces respectively, r:๐’ฎร—๐’œโ†’[0,1]r:\mathcal{S}\times\mathcal{A}\rightarrow[0,1] is the reward function, and p:๐’ฎร—๐’œโ†’ฮ”โ€‹(๐’ฎ)p:\mathcal{S}\times\mathcal{A}\rightarrow\Delta(\mathcal{S}) indicates the state transition function. The function, ฮ”โ€‹(โ‹…)\Delta(\cdot) defines the probability simplex over its argument set. The cardinality of the sets ๐’ฎ\mathcal{S} and ๐’œ\mathcal{A} are denoted as S,AS,A respectively. Both the reward function, rr, and the transition function, pp are assumed to be unknown.

A learner/agent interacts with an environment governed by the MDP stated above as follows. The interaction proceeds in discrete time steps, tโˆˆ{1,2,โ‹ฏ}t\in\{1,2,\cdots\}. At the time tt, the state occupied by the environment is denoted as stโˆˆ๐’ฎs_{t}\in\mathcal{S}. The learner observes the state, sts_{t}, and chooses an action atโˆˆ๐’œa_{t}\in\mathcal{A} following a predefined protocol, ๐”ธ\mathbb{A}. As a result, a sequence of rewards ๐’“tโ€‹(st,at)โ‰œ{rt,ฯ„โ€‹(st,at)}ฯ„=0โˆž\boldsymbol{r}_{t}(s_{t},a_{t})\triangleq\{r_{t,\tau}(s_{t},a_{t})\}_{\tau=0}^{\infty} is generated where rt,ฯ„โ€‹(st,at)r_{t,\tau}(s_{t},a_{t}) is interpreted as the non-negative component of the vector ๐’“tโ€‹(st,at)\boldsymbol{r}_{t}(s_{t},a_{t}) that is realised at instant t+ฯ„t+\tau. The following assumption is made regarding the reward generation process.

Assumption 1

It is assumed that โˆ€tโˆˆ{1,2,โ‹ฏ}\forall t\in\{1,2,\cdots\}, โˆ€(s,a)โˆˆ๐’ฎร—๐’œ\forall(s,a)\in\mathcal{S}\times\mathcal{A}, the following holds.

(a)โ€‹โ€–๐’“tโ€‹(s,a)โ€–1โˆผDโ€‹(s,a)โˆˆฮ”โ€‹[0,1],\displaystyle(a)~{}||\boldsymbol{r}_{t}(s,a)||_{1}\sim D(s,a)\in\Delta[0,1],
(b)โ€‹๐”ผโ€‹โ€–๐’“tโ€‹(s,a)โ€–1=rโ€‹(s,a)โˆˆ[0,1],\displaystyle(b)~{}\mathbb{E}||\boldsymbol{r}_{t}(s,a)||_{1}=r(s,a)\in[0,1],
(c)โ€‹{๐’“tโ€‹(s,a)}tโ‰ฅ1,(s,a)โˆˆ๐’ฎร—๐’œโ€‹ย are mutually independent,\displaystyle(c)~{}\{\boldsymbol{r}_{t}(s,a)\}_{t\geq 1,(s,a)\in\mathcal{S}\times\mathcal{A}}\text{ are mutually independent},

where ||โ‹…||1||\cdot||_{1} denotes the 11-norm.

Assumption 1(a) dictates that the reward sequences generated from (s,a)(s,a) are such that the sum of their components can be thought of as samples taken from a certain distribution, Dโ€‹(s,a)D(s,a), over [0,1][0,1]. Note that the distribution, Dโ€‹(s,a)D(s,a), is independent of tt. Assumption 1(b) explains that the expected value of the sum of the reward components generated from (s,a)(s,a) equals rโ€‹(s,a)โˆˆ[0,1]r(s,a)\in[0,1]. Finally, Assumption 1(c) clarifies that the reward sequences generated at different instances (either by the same or distinct state-action pairs) are presumed to be independent of each other.

At the time instant tt, the learner observes a reward vector ๐’™tโˆˆโ„S\boldsymbol{x}_{t}\in\mathbb{R}^{S} whose ss-th element is expressed as follows.

๐’™tโ€‹(s)=โˆ‘0<ฯ„โ‰คtrฯ„,tโˆ’ฯ„โ€‹(s,aฯ„)โ€‹1โ€‹(sฯ„=s)\displaystyle\boldsymbol{x}_{t}(s)=\sum_{0<\tau\leq t}r_{\tau,t-\tau}(s,a_{\tau})\mathrm{1}(s_{\tau}=s)

where 1โ€‹(โ‹…)\mathrm{1}(\cdot) defines the indicator function. Note that the learner only has access to the lump-sum reward ๐’™tโ€‹(s)\boldsymbol{x}_{t}(s), not its individual components contributed by past actions. This explains why the reward is termed as partially anonymous. In a fully anonymous setting, the learner can only access โ€–๐’™tโ€–1||\boldsymbol{x}_{t}||_{1}, not its elements. Although full anonymity might be desirable for many practical scenarios, from a theoretical standpoint, it is notoriously difficult to analyze. We discuss this topic in detail in section 4.2. The expected accumulated reward generated up to time TT can be computed as,

Rโ€‹(s1,๐”ธ,M,T)=โˆ‘t=1Tโ€–๐’“tโ€‹(st,at)โ€–1\displaystyle R(s_{1},\mathbb{A},M,T)=\sum_{t=1}^{T}||\boldsymbol{r}_{t}(s_{t},a_{t})||_{1} (1)

We would like to emphasize that Rโ€‹(s1,๐”ธ,M,T)R(s_{1},\mathbb{A},M,T) is not the sum of the observed rewards up to time TT. Rather, it equates to the sum of all the reward components that are generated as a consequence of the actions taken up to time TT. We define the quantity expressed below

ฯโ€‹(s1,๐”ธ,M)โ‰œlimTโ†’โˆž1Tโ€‹๐”ผโ€‹[Rโ€‹(s1,๐”ธ,M,T)]\displaystyle\rho(s_{1},\mathbb{A},M)\triangleq\lim\limits_{T\rightarrow\infty}\dfrac{1}{T}\mathbb{E}\left[R(s_{1},\mathbb{A},M,T)\right] (2)

as the average reward of the MDP MM for a given protocol ๐”ธ\mathbb{A} and an initial state s1โˆˆ๐’ฎs_{1}\in\mathcal{S}. Here the expectation is obtained over all possible TT-length trajectories generated from the initial state s1s_{1} following the protocol ๐”ธ\mathbb{A} and the randomness associated with the reward generation process for any given state-action pair. It is well known that (Puterman, 2014) there exists a stationary deterministic policy ฯ€โˆ—:๐’ฎโ†’๐’œ\pi^{*}:\mathcal{S}\rightarrow\mathcal{A} that maximizes the average reward โˆ€s1โˆˆ๐’ฎ\forall s_{1}\in\mathcal{S} if Dโ€‹(M)D(M), the diameter of MM (defined below) is finite. Also, in that case, ฯโ€‹(s1,ฯ€โˆ—,M)\rho(s_{1},\pi^{*},M) becomes independent of s1s_{1} and thus can be simply denoted as ฯโˆ—โ€‹(M)\rho^{*}(M). The diameter Dโ€‹(M)D(M) is defined as follows.

Dโ€‹(M)โ‰œmaxsโ‰ sโ€ฒโกminฯ€:๐’ฎโ†’๐’œโก๐”ผโ€‹[Tโ€‹(sโ€ฒ|s,ฯ€,M)]\displaystyle D(M)\triangleq\max_{s\neq s^{\prime}}\min_{\pi:\mathcal{S}\rightarrow\mathcal{A}}\mathbb{E}\left[T(s^{\prime}|s,\pi,M)\right]

where Tโ€‹(sโ€ฒ|s,ฯ€,M)T(s^{\prime}|s,\pi,M) denotes the time needed for the MDP MM to reach the state sโ€ฒs^{\prime} from the state ss following the stationary deterministic policy, ฯ€\pi. Mathematically, Pr(T(sโ€ฒ|s,ฯ€,M)=t)โ‰œPr(st=s|s1=s,sฯ„โ‰ s,1<ฯ„<t,sฯ„โˆผp(sฯ„โˆ’1,aฯ„โˆ’1),aฯ„โˆผฯ€(sฯ„)\mathrm{Pr}(T(s^{\prime}|s,\pi,M)=t)\triangleq\mathrm{Pr}(s_{t}=s|s_{1}=s,s_{\tau}\neq s,1<\tau<t,s_{\tau}\sim p(s_{\tau-1},a_{\tau-1}),a_{\tau}\sim\pi(s_{\tau})). In simple words, given two arbitrary distinct states, one can always find a stationary deterministic policy such that the MDP, MM, on average, takes at most Dโ€‹(M)D(M) time steps to transition from one state to the other.

We define the performance of a protocol, ๐”ธ\mathbb{A} by the regret it accumulates over a horizon, TT which is mathematically expressed as,

Regโ€‹(s1,๐”ธ,M,T)=Tโ€‹ฯโˆ—โ€‹(M)โˆ’Rโ€‹(s1,๐”ธ,M,T)\displaystyle\mathrm{Reg}(s_{1},\mathbb{A},M,T)=T\rho^{*}(M)-R(s_{1},\mathbb{A},M,T) (3)

where ฯโˆ—โ€‹(M)\rho^{*}(M) is the maximum of the average reward given in (2)(\ref{eq_avg_reward}), and the second term is defined in (1)(\ref{eq_cum_reward}). We would like to mention that in order to define regret, we use Rโ€‹(s1,๐”ธ,M,T)R(s_{1},\mathbb{A},M,T) rather than the expected sum of the observed rewards up to time TT. The rationale behind this definition is that all the components of the rewards that are generated as a consequence of the actions taken up to time TT would eventually be realized if we allow the system to evolve for a long enough time. Our goal in this article is to come up with an algorithm that achieves sublinear regret for the delayed, composite, and anonymous reward MDP described above.

Before concluding, we would like to provide an example of an MDP with partially anonymous rewards. Let us consider the TT round of interaction of a potential consumer with a website that advertises SS categories of products. At round tt, the state, stโˆˆ{1,โ‹ฏ,S}s_{t}\in\{1,\cdots,S\} observed by the website is the category of product searched by the consumer. The ss-th category where sโˆˆ{1,โ‹ฏ,S}s\in\{1,\cdots,S\} has NsN_{s} number of potential advertisements, and the total number of advertisements is N=โˆ‘sNsN=\sum_{s}N_{s}. The website, in response to the observed state, sts_{t}, shows an ordered list of K<NK<N advertisements (denoted by ata_{t}), some of which may not directly correspond to the searched category, sts_{t}. This may cause the consumer, with some probability, to switch to a new state, st+1s_{t+1} in the next round. For example, if the consumer is searching for โ€œComputersโ€, then showing advertisements related to โ€œMouseโ€ or โ€œKeyboardโ€ may incentivize the consumer to search for those categories of products. After ฯ„\tau rounds, 0<ฯ„โ‰คT0<\tau\leq T, the consumer ends up spending rฯ„โ€‹(s)r_{\tau}(s) amount of money for the ss-th category of product as a consequence of previous advertisements. Note that if the same state ss appears in two different rounds t1<t2<ฯ„t_{1}<t_{2}<\tau, the website can potentially show two different ordered lists of advertisements, at1,at2a_{t_{1}},a_{t_{2}} to the consumer. However, it is impossible to segregate the portions of the reward rฯ„โ€‹(s)r_{\tau}(s) contributed by each of those actions. Hence, the system described above can be modeled by an MDP with delayed, composite, and partially anonymous rewards.

3 DUCRL2 Algorithm

In this section, we develop Delayed UCRL2 (DUCRL2) algorithm to achieve the overarching goal of our paper. It is inspired by the UCRL2 algorithm suggested by (Jaksch etย al., 2010) for MDPs with immediate rewards (no delay). Before going into the details of the DUCRL2 algorithm, we would like to introduce the following assumption.

Assumption 2

There exists a finite positive number dd such that, โˆ€tโˆˆ{1,2โ€‹โ‹ฏ}\forall t\in\{1,2\cdots\},

โˆ‘ฯ„1โ‰ฅ0max(s,a)โˆˆ๐’ฎร—๐’œโก[โˆ‘ฯ„โ‰ฅฯ„1rt,ฯ„โ€‹(s,a)]โ‰คd\displaystyle\sum_{\tau_{1}\geq 0}\max_{(s,a)\in\mathcal{S}\times\mathcal{A}}\left[\sum_{\tau\geq\tau_{1}}r_{t,\tau}(s,a)\right]\leq d~{} (4)

with probability 11 where {rt,ฯ„โ€‹(s,a)}ฯ„=0โˆž\{r_{t,\tau}(s,a)\}_{\tau=0}^{\infty} is the reward sequence generated by (s,a)(s,a) at time tt.

Algorithm 1 DUCRL2 Algorithm
1:Input: ฮดโˆˆ(0,1)\delta\in(0,1), dd, ๐’ฎ\mathcal{S}, ๐’œ\mathcal{A}
2:Initialization: Observe the initial state s1โˆˆ๐’ฎs_{1}\in\mathcal{S} and set tโ†1t\leftarrow 1, t0โ†0t_{0}\leftarrow 0.
3:forย episodes kโˆˆ{1,2,โ‹ฏ}k\in\{1,2,\cdots\}ย do โ–ท\triangleright Computing empirical estimates
4:ย ย ย ย ย tkโ†tt_{k}\leftarrow t
5:ย ย ย ย ย forย (s,a)โˆˆ๐’ฎร—๐’œ(s,a)\in\mathcal{S}\times\mathcal{A}ย do
6:ย ย ย ย ย ย ย ย ย ฮฝkโ€‹(s,a)โ†0\nu_{k}(s,a)\leftarrow 0
7:ย ย ย ย ย ย ย ย ย โ„ฐjโ€‹(s,a)โ†1โ€‹((s,a)โˆˆ{(sฯ„,aฯ„)|tjโˆ’1โ‰คฯ„<tj})\mathcal{E}_{j}(s,a)\leftarrow\mathrm{1}\left((s,a)\in\left\{(s_{\tau},a_{\tau})\big{|}t_{j-1}\leq\tau<t_{j}\right\}\right), ย โˆ€jโˆˆ{1,โ‹ฏ,kโˆ’1}\forall j\in\{1,\cdots,k-1\}
8:ย ย ย ย ย ย ย ย ย Ekโ€‹(s,a)โ†โˆ‘0<j<kโ„ฐjโ€‹(s,a)E_{k}(s,a)\leftarrow\sum_{0<j<k}\mathcal{E}_{j}(s,a)
9:ย ย ย ย ย ย ย ย ย Nkโ€‹(s,a)โ†โˆ‘0<ฯ„<tk1โ€‹(sฯ„=s,aฯ„=a)N_{k}(s,a)\leftarrow\sum_{0<\tau<t_{k}}\mathrm{1}(s_{\tau}=s,a_{\tau}=a)
10:ย ย ย ย ย ย ย ย ย r^kโ€‹(s,a)โ†โˆ‘0<j<kโˆ‘tjโ‰คฯ„<tj+1๐’™ฯ„โ€‹(s)โ€‹โ„ฐjโ€‹(s,a)/maxโก{1,Nkโ€‹(s,a)}\hat{r}_{k}(s,a)\leftarrow\sum_{0<j<k}\sum_{t_{j}\leq\tau<t_{j+1}}\boldsymbol{x}_{\tau}(s)\mathcal{E}_{j}(s,a)/\max\{1,N_{k}(s,a)\}
11:ย ย ย ย ย ย ย ย ย forย sโ€ฒโˆˆ๐’ฎs^{\prime}\in\mathcal{S}ย do
12:ย ย ย ย ย ย ย ย ย ย ย ย ย ย p^kโ€‹(sโ€ฒ|s,a)=โˆ‘0<ฯ„<tk1โ€‹(sฯ„=s,aฯ„=a,sฯ„+1=sโ€ฒ)/maxโก{1,Nkโ€‹(s,a)}\hat{p}_{k}(s^{\prime}|s,a)=\sum_{0<\tau<t_{k}}\mathrm{1}(s_{\tau}=s,a_{\tau}=a,s_{\tau+1}=s^{\prime})/\max\{1,N_{k}(s,a)\}
13:ย ย ย ย ย ย ย ย ย endย for
14:ย ย ย ย ย endย for
15:endย for
16:Let โ„ณk\mathcal{M}_{k} be the set of MDPs with state space ๐’ฎ\mathcal{S}, action space ๐’œ\mathcal{A}, transition probability p~\tilde{p}, and reward function r~\tilde{r} such that
|r~โ€‹(s,a)โˆ’r^kโ€‹(s,a)|โ‰ค7โ€‹logโก(2โ€‹Sโ€‹Aโ€‹tk/ฮด)2โ€‹maxโก{Nkโ€‹(s,a),1}+dโ€‹Ekโ€‹(s,a)maxโก{Nkโ€‹(s,a),1}\displaystyle\begin{split}&|\tilde{r}(s,a)-\hat{r}_{k}(s,a)|\leq\sqrt{\dfrac{7\log(2SAt_{k}/\delta)}{2\max\{N_{k}(s,a),1\}}}+d\dfrac{E_{k}(s,a)}{\max\{N_{k}(s,a),1\}}\end{split} (5)
||p~(โ‹…|s,a)โˆ’p^k(โ‹…|s,a)||1โ‰ค14โ€‹Sโ€‹logโก(2โ€‹Aโ€‹tk/ฮด)maxโก{1,Nkโ€‹(s,a)}\displaystyle||\tilde{p}(\cdot|s,a)-\hat{p}_{k}(\cdot|s,a)||_{1}\leq\sqrt{\dfrac{14S\log(2At_{k}/\delta)}{\max\{1,N_{k}(s,a)\}}} (6)
17:Using extended value function iteration (Algorithm 2), obtain a stationary deterministic policy ฯ€~k\tilde{\pi}_{k} and an MDP M~kโˆˆโ„ณk\tilde{M}_{k}\in\mathcal{M}_{k} such that,
ฯ~kโ‰œminsโˆˆ๐’ฎโกฯโ€‹(s,ฯ€~k,M~k)โ‰ฅmaxMโ€ฒโˆˆโ„ณk,ฯ€,sโ€ฒโˆˆ๐’ฎโกฯโ€‹(sโ€ฒ,ฯ€,Mโ€ฒ)โˆ’1tk\displaystyle\tilde{\rho}_{k}\triangleq\min_{s\in\mathcal{S}}\rho(s,\tilde{\pi}_{k},\tilde{M}_{k})\geq\max_{M^{\prime}\in\mathcal{M}_{k},\pi,s^{\prime}\in\mathcal{S}}\rho(s^{\prime},\pi,M^{\prime})-\dfrac{1}{\sqrt{t_{k}}} (7)
18:whileย ฮฝkโ€‹(st,ฯ€~kโ€‹(st))<maxโก{1,Nkโ€‹(st,ฯ€~kโ€‹(st))}\nu_{k}(s_{t},\tilde{\pi}_{k}(s_{t}))<\max\{1,N_{k}(s_{t},\tilde{\pi}_{k}(s_{t}))\}ย do
19:ย ย ย ย ย Execute at=ฯ€~kโ€‹(st)a_{t}=\tilde{\pi}_{k}(s_{t})
20:ย ย ย ย ย Observe the reward rtr_{t} and the next state st+1s_{t+1}
21:ย ย ย ย ย ฮฝkโ€‹(st,at)โ†ฮฝkโ€‹(st,at)+1\nu_{k}(s_{t},a_{t})\leftarrow\nu_{k}(s_{t},a_{t})+1
22:ย ย ย ย ย tโ†t+1t\leftarrow t+1
23:endย while

Note that if the maximum delay is dmaxd_{\max}, then using Assumption 1(a), one can show that dโ‰คdmaxd\leq d_{\max}. Therefore, dd can be thought of as a proxy for the maximum delay. To better understand the intuition behind Assumption 2, consider an interval {1,โ‹ฏ,T1}\{1,\cdots,T_{1}\}. Clearly, the reward sequence generated at t1=T1โˆ’ฯ„1t_{1}=T_{1}-\tau_{1}, ฯ„1โˆˆ{0,โ‹ฏ,T1โˆ’1}\tau_{1}\in\{0,\cdots,T_{1}-1\}, is ๐ซt1โ€‹(st1,at1)\mathbf{r}_{t_{1}}(s_{t_{1}},a_{t_{1}}). The portion of this reward that is realized after T1T_{1} is expressed by the following quantity: โˆ‘ฯ„โ‰ฅฯ„1rt1,ฯ„โ€‹(st1,at1)\sum_{\tau\geq\tau_{1}}r_{t_{1},\tau}(s_{t_{1}},a_{t_{1}}) which is upper bounded by max(s,a)โ€‹โˆ‘ฯ„โ‰ฅฯ„1rt1,ฯ„โ€‹(s,a)\max_{(s,a)}\sum_{\tau\geq\tau_{1}}r_{t_{1},\tau}(s,a). Therefore, the total amount of reward that is generated in {1,โ‹ฏ,T1}\{1,\cdots,T_{1}\} but realized after T1T_{1} is bounded by โˆ‘ฯ„1โ‰ฅ0max(s,a)โ€‹โˆ‘ฯ„โ‰ฅฯ„1rt1,ฯ„โ€‹(s,a)\sum_{\tau_{1}\geq 0}\max_{(s,a)}\sum_{\tau\geq\tau_{1}}r_{t_{1},\tau}(s,a). As the distribution of the reward sequence ๐ซt1\mathbf{r}_{t_{1}} is the same for all t1t_{1} (Assumption 1(a)), one can replace the ฯ„1\tau_{1} dependent term t1t_{1} with a generic quantity, tt. Thus, Assumption 2 essentially states that the spillover of the rewards generated within any finite interval {1,โ‹ฏ,T1}\{1,\cdots,T_{1}\} can be bounded by the term dd.

We would also like to emphasize that if dmaxd_{\max} is infinite, then dd may or may not be infinite depending on the reward sequence. For example, consider a deterministic sequence whose components are rt,ฯ„โ€‹(s,a)=2โˆ’1โˆ’ฯ„r_{t,\tau}(s,a)=2^{-1-\tau}, (s,a)โˆˆ๐’ฎร—๐’œ(s,a)\in\mathcal{S}\times\mathcal{A}, โˆ€tโˆˆ{1,2,โ‹ฏ}\forall t\in\{1,2,\cdots\}, โˆ€ฯ„โˆˆ{0,1,โ‹ฏ}\forall\tau\in\{0,1,\cdots\}. It is easy to show that one can choose d=2d=2 though dmaxd_{\max} is infinite. On the other hand, if rt,ฯ„โ€‹(s,a)=๐Ÿโ€‹(ฯ„=ฯ„โ€ฒ)r_{t,\tau}(s,a)=\mathbf{1}(\tau=\tau^{\prime}), โˆ€(s,a)\forall(s,a), โˆ€t,โˆ€ฯ„\forall t,\forall\tau where ฯ„โ€ฒ\tau^{\prime} is a random variable with the distribution Prโ€‹(ฯ„โ€ฒ=k)=(1โˆ’p)โ€‹pk\mathrm{Pr}(\tau^{\prime}=k)=(1-p)p^{k}, โˆ€kโˆˆ{0,1,โ‹ฏ}\forall k\in\{0,1,\cdots\}, 0<p<10<p<1, then one can show that (4) is violated with probability at least pdp^{d}. In other words, Assumption 2 is not satisfied for any finite dd.

The DUCRL2\mathrm{DUCRL2} algorithm (Algorithm 1) proceeds in multiple epochs. At the beginning of epoch kk, i.e., at time instant t=tkt=t_{k}, we compute two measures for all state-action pairs. The first measure is indicated as Nkโ€‹(s,a)N_{k}(s,a) which counts the number of times the pair (s,a)(s,a) appears before the onset of the kkth epoch. The second measure is โ„ฐjโ€‹(s,a)\mathcal{E}_{j}(s,a) which is a binary random variable that indicates whether the pair (s,a)(s,a) appears at least once in the jjth epoch, 0<j<k0<j<k. Due to the very nature of our algorithm (elaborated later), in a given epoch jj, if โ„ฐjโ€‹(s,a)=1\mathcal{E}_{j}(s,a)=1, then โ„ฐjโ€‹(s,aโ€ฒ)=0\mathcal{E}_{j}(s,a^{\prime})=0, โˆ€aโ€ฒโˆˆ๐’œโˆ–{a}\forall a^{\prime}\in\mathcal{A}\setminus\{a\}. Taking a sum over {โ„ฐjโ€‹(s,a)}0<j<k\{\mathcal{E}_{j}(s,a)\}_{0<j<k}, we obtain Ekโ€‹(s,a)E_{k}(s,a) which counts the number of epochs where (s,a)(s,a) appears at least once before tkt_{k}, the start of the kkth episode.

Next, we obtain the reward estimate r^kโ€‹(s,a)\hat{r}_{k}(s,a) by computing the sum of the ss-th element of the observed reward over all epochs where (s,a)(s,a) appears at least once and dividing it by Nkโ€‹(s,a)N_{k}(s,a). Also, the transition probability estimate p^kโ€‹(sโ€ฒ|s,a)\hat{p}_{k}(s^{\prime}|s,a) is calculated by taking a ratio of the number of times the transition (s,a,sโ€ฒ)(s,a,s^{\prime}) occurs before the kkth episode and Nkโ€‹(s,a)N_{k}(s,a). It is to be clarified that, due to the delayed composite nature of the reward, the observed reward values that are used for computing r^kโ€‹(s,a)\hat{r}_{k}(s,a) may be contaminated by the components generated by previous actions which could potentially be different from aa. Consequently, it might appear that the empirical estimates r^kโ€‹(s,a)\hat{r}_{k}(s,a) may not serve as a good proxy for rโ€‹(s,a)r(s,a). However, the analysis of our algorithm exhibits that by judiciously steering the exploration, it is still possible to obtain a near-optimal regret.

Using the estimates r^kโ€‹(s,a)\hat{r}_{k}(s,a), p^k(โ‹…|s,a)\hat{p}_{k}(\cdot|s,a), we now define a confidence set โ„ณk\mathcal{M}_{k} of MDPs that is characterized by (5)(\ref{eq_4}), (6)(\ref{eq_5}). The confidence radius given in (5)(\ref{eq_4}) is one of the main differences between our algorithm and the UCRL2 algorithm given by (Jaksch etย al., 2010). Applying extended value iteration (Appendix A), we then derive a policy ฯ€~k\tilde{\pi}_{k}, and an MDP M~kโˆˆโ„ณk\tilde{M}_{k}\in\mathcal{M}_{k} that are near-optimal within the set โ„ณk\mathcal{M}_{k} in the sense of (7)(\ref{eq_6}). We keep on executing the policy ฯ€~k\tilde{\pi}_{k} until for at least one state-action pair (s,a)(s,a), its total number of occurrences within the current epoch, ฮฝkโ€‹(s,a)\nu_{k}(s,a) becomes at least as large as Nkโ€‹(s,a)N_{k}(s,a), the number of its occurrences before the onset of the current epoch. When this criterion is achieved, a new epoch begins and the process described above starts all over again. Observe that, as the executed policies are deterministic, no two distinct pairs (s,a),(s,aโ€ฒ)(s,a),(s,a^{\prime}) can appear in the same epoch for a given state, ss.

We would like to conclude with the remark that all the quantities used in our algorithm can be computed in a recursive manner. Consequently, similar to UCRL2, the space complexity of our algorithm turns out to be ๐’ชโ€‹(S2โ€‹A)\mathcal{O}(S^{2}A) which is independent of TT.

4 Regret Analysis

Below we state our main result.

Theorem 1

Let Dโ‰œDโ€‹(M)D\triangleq D(M). With probability at least 1โˆ’ฮด1-\delta, ฮดโˆˆ(0,1)\delta\in(0,1), for arbitrary initial state ss, the regret accumulated by the algorithm DUCRL2\mathrm{DUCRL2} over T>1T>1 steps can be bounded above as follows.

Regโ€‹(s,DUCRL2,M,T)โ‰ค34โ€‹Dโ€‹Sโ€‹Aโ€‹Tโ€‹logโก(Tฮด)+2โ€‹dโ€‹(Sโ€‹A)3โ€‹[log2โก(8โ€‹TSโ€‹A)]2\displaystyle\mathrm{Reg}(s,\mathrm{DUCRL2},M,T)\leq 34DS\sqrt{AT\log\left(\frac{T}{\delta}\right)}+2d(SA)^{3}\left[\log_{2}\left(\frac{8T}{SA}\right)\right]^{2} (8)

Substituting ฮด=1/T\delta=1/T, we can therefore bound the expected regret as,

๐”ผโ€‹[Regโ€‹(s,DUCRL2,M,T)]โ‰ค68โ€‹Dโ€‹Sโ€‹Aโ€‹Tโ€‹logโก(T)+2โ€‹dโ€‹(Sโ€‹A)3โ€‹[log2โก(8โ€‹TSโ€‹A)]2\displaystyle\mathbb{E}\left[\mathrm{Reg}(s,\mathrm{DUCRL2},M,T)\right]\leq 68DS\sqrt{AT\log\left(T\right)}+2d(SA)^{3}\left[\log_{2}\left(\frac{8T}{SA}\right)\right]^{2} (9)

Theorem 1 states that the regret accumulated by algorithm DUCRL2\mathrm{DUCRL2} is ๐’ช~โ€‹(Dโ€‹Sโ€‹Aโ€‹T+dโ€‹(Sโ€‹A)3)\tilde{\mathcal{O}}(DS\sqrt{AT}+d(SA)^{3}) where ๐’ช~โ€‹(โ‹…)\tilde{\mathcal{O}}(\cdot) hides the logarithmic factors. (Jaksch etย al., 2010) showed that the lower bound on the regret is ฮฉโ€‹(Dโ€‹Sโ€‹Aโ€‹T)\Omega(\sqrt{DSAT}). As our setup is a generalization of the setup considered in (Jaksch etย al., 2010), the same lower bound must also apply to our model. Moreover, if we consider an MDP instance where the rewards arrive with a constant delay, dd, then in the first dd steps, due to lack of any feedback, each algorithm must obey the regret lower bound ฮฉโ€‹(d)\Omega(d). We conclude that the regret lower bound of our setup is ฮฉโ€‹(maxโก{Dโ€‹Sโ€‹Aโ€‹T,d})=ฮฉโ€‹(Dโ€‹Sโ€‹Aโ€‹T+d)\Omega(\max\{\sqrt{DSAT},d\})=\Omega(\sqrt{DSAT}+d). Although it matches the orders of TT, and dd of our regret upper bound, there is still room for improvement in the orders of DD, SS and AA.

4.1 Proof Sketch of Theorem 1

In this section, we provide a brief sketch of the proof of Theorem 1.

Step 1: The first step is to rewrite the total regret as the sum of regrets accumulated over various epochs. Particularly, we show that with probability at least 1โˆ’ฮด/12โ€‹T5/41-\delta/12T^{5/4}, ฮด>0\delta>0 the following bound holds.

Regโ€‹(s,DUCRL2,M,T)โ‰คโˆ‘k=1mRegkโŸโ‰œQ1+5โ€‹T8โ€‹logโก(8โ€‹Tฮด)โŸโ‰œQ2\displaystyle\mathrm{Reg}(s,\mathrm{DUCRL2},M,T)\leq\underbrace{\sum_{k=1}^{m}\mathrm{Reg}_{k}}_{\triangleq Q_{1}}+\underbrace{\sqrt{\frac{5T}{8}\log\left(\frac{8T}{\delta}\right)}}_{\triangleq Q_{2}}

The term, Regk\mathrm{Reg}_{k} can be defined as the regret accumulated over epoch kk (a precise definition is given in the appendix), and mm is such that TT lies in the (mโˆ’1)(m-1)th epoch. The additional term, Q2Q_{2} appears due to the stochasticity of the observed reward instances. We now divide Q1Q_{1} into two parts as follows.

Q1=โˆ‘k=1mRegkโ€‹1โ€‹(Mโˆ‰โ„ณk)โŸโ‰œQ11+โˆ‘k=1mRegkโ€‹1โ€‹(Mโˆˆโ„ณk)โŸโ‰œQ12\displaystyle Q_{1}=\underbrace{\sum_{k=1}^{m}\mathrm{Reg}_{k}\mathrm{1}(M\notin\mathcal{M}_{k})}_{\triangleq Q_{11}}+\underbrace{\sum_{k=1}^{m}\mathrm{Reg}_{k}\mathrm{1}(M\in\mathcal{M}_{k})}_{\triangleq Q_{12}}

Step 2: In order to bound Q11Q_{11}, it is important to have an estimate of Prโ€‹(Mโˆ‰โ„ณk)\mathrm{Pr}(M\notin\mathcal{M}_{k}) which we obtain in Lemma 1. We would like to elaborate that although the bound given in Lemma 1 is similar to that given in (Jaksch etย al., 2010), the proof techniques are different. In particular, here we account for the fact that the reward estimates, {r^kโ€‹(s,a)}(s,a)โˆˆ๐’ฎร—๐’œ\{\hat{r}_{k}(s,a)\}_{(s,a)\in\mathcal{S}\times\mathcal{A}}, of the kkth epoch, are potentially corrupted by delayed effects of past actions. We resolve this problem by proving the following inequality (see (32)(\ref{eq_17})).

|r^kโ€‹(s,a)โˆ’1Nkโ€‹(s,a)โ€‹โˆ‘0<ฯ„<tkโ€–๐’“ฯ„โ€‹(s,a)โ€–1โ€‹1โ€‹(sฯ„=s,aฯ„=a)|โ‰คdโ€‹Ekโ€‹(s,a)maxโก{Nkโ€‹(s,a),1}\displaystyle\left|\hat{r}_{k}(s,a)-\dfrac{1}{N_{k}(s,a)}\sum_{0<\tau<t_{k}}||\boldsymbol{r}_{\tau}(s,a)||_{1}\mathrm{1}(s_{\tau}=s,a_{\tau}=a)\right|\leq d\dfrac{E_{k}(s,a)}{\max\{N_{k}(s,a),1\}} (10)

Here the second term in the LHS denotes an estimate of rโ€‹(s,a)r(s,a) that the learner would have obtained had there been no delay in the observation of the reward instances. In other words, inequality (10)(\ref{eq_17_new}) estimates the gap between an MDP with delayed observations, and a hypothetical MDP without any delayed effects. Observe that the RHS of (10)(\ref{eq_17_new}) also appears in the confidence radius (5)(\ref{eq_4}). Therefore, the cost of incorporating delay is a looser confidence in the reward estimates.

Step 3: Using Lemma 1, Q11Q_{11} is bounded by T\sqrt{T} with high probability (Lemma 2).

Step 4: We now focus on bounding the other term, Q12Q_{12}. Lemma 3 shows that Regkโ‰คJk1+Jk2+Jk3\mathrm{Reg}_{k}\leq J_{k}^{1}+J_{k}^{2}+J_{k}^{3} where the precise definition of the terms {Jki}iโˆˆ{1,2,3}\{J_{k}^{i}\}_{i\in\{1,2,3\}} are given in the appendix B.2. Furthermore, it also proves that the following bound holds with high probability.

โˆ‘k=1mJk1โ€‹1โ€‹(Mโˆˆโ„ณk)=๐’ช~โ€‹(T+โˆ‘k=1mโˆ‘(s,a)โˆˆ๐’ฎร—๐’œฮฝkโ€‹(s,a)maxโก{Nkโ€‹(s,a),1})\displaystyle\sum_{k=1}^{m}J_{k}^{1}\mathrm{1}(M\in\mathcal{M}_{k})=\tilde{\mathcal{O}}\left(\sqrt{T}+\sum_{k=1}^{m}\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\dfrac{\nu_{k}(s,a)}{\sqrt{\max\{N_{k}(s,a),1\}}}\right)

where ๐’ช~โ€‹(โ‹…)\tilde{\mathcal{O}}(\cdot) hides logarithmic factors and terms related to D,S,AD,S,A. The other notations are identical to that given in section 3.

Step 5: The second term, Jk2J_{k}^{2} is bounded as follows.

Jk2\displaystyle J_{k}^{2} โ‰œโˆ‘(s,a)โˆˆ๐’ฎร—๐’œฮฝkโ€‹(s,a)โ€‹(r~kโ€‹(s,a)โˆ’rโ€‹(s,a))\displaystyle\triangleq\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\nu_{k}(s,a)(\tilde{r}_{k}(s,a)-r(s,a))
โ‰คโˆ‘(s,a)โˆˆ๐’ฎร—๐’œฮฝkโ€‹(s,a)โ€‹|r~kโ€‹(s,a)โˆ’r^kโ€‹(s,a)|+โˆ‘(s,a)โˆˆ๐’ฎร—๐’œฮฝkโ€‹(s,a)โ€‹|r^kโ€‹(s,a)โˆ’rโ€‹(s,a)|\displaystyle\leq\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\nu_{k}(s,a)|\tilde{r}_{k}(s,a)-\hat{r}_{k}(s,a)|+\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\nu_{k}(s,a)|\hat{r}_{k}(s,a)-r(s,a)|

Notice that the first term can be bounded by invoking (5)(\ref{eq_4}). The same inequality can also be used to bound the second term provided that Mโˆˆโ„ณkM\in\mathcal{M}_{k} which, as we have stated before, is a high probability event (Lemma 1)\ref{lemma_notin_cb}). Using this logic, and some algebraic manipulations, we finally obtain the following high probability bound.

โˆ‘k=1mJ2kโ€‹1โ€‹(Mโˆˆโ„ณk)=๐’ช~โ€‹(โˆ‘k=1mโˆ‘(s,a)โˆˆ๐’ฎร—๐’œฮฝkโ€‹(s,a)maxโก{Nkโ€‹(s,a),1}+2โ€‹dโ€‹Sโ€‹Aโ€‹m2)\displaystyle\sum_{k=1}^{m}J_{2}^{k}\mathrm{1}(M\in\mathcal{M}_{k})=\tilde{\mathcal{O}}\left(\sum_{k=1}^{m}\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\dfrac{\nu_{k}(s,a)}{\sqrt{\max\{N_{k}(s,a),1\}}}+2dSAm^{2}\right)

Step 6: Finally, we obtain the following inequality related to the third term.

โˆ‘k=1mJk3โ€‹1โ€‹(Mโˆˆโ„ณk)โ‰ค2โ€‹โˆ‘k=1mโˆ‘(s,a)โˆˆ๐’ฎร—๐’œฮฝkโ€‹(s,a)maxโก{Nkโ€‹(s,a),1}\displaystyle\sum_{k=1}^{m}J_{k}^{3}\mathrm{1}(M\in\mathcal{M}_{k})\leq 2\sum_{k=1}^{m}\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\dfrac{\nu_{k}(s,a)}{\sqrt{\max\{N_{k}(s,a),1\}}}

Step 7: Combining, we derive the bound stated below with high probability.

Q1=Q11+Q12\displaystyle Q_{1}=Q_{11}+Q_{12} โ‰คT+โˆ‘i=13โˆ‘k=1mJkiโ€‹1โ€‹(Mโˆˆโ„ณk)\displaystyle\leq\sqrt{T}+\sum_{i=1}^{3}\sum_{k=1}^{m}J_{k}^{i}\mathrm{1}\left(M\in\mathcal{M}_{k}\right)
=๐’ช~โ€‹(T+โˆ‘k=1mโˆ‘(s,a)โˆˆ๐’ฎร—๐’œฮฝkโ€‹(s,a)maxโก{Nkโ€‹(s,a),1}+2โ€‹dโ€‹Sโ€‹Aโ€‹m2)\displaystyle=\tilde{\mathcal{O}}\left(\sqrt{T}+\sum_{k=1}^{m}\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\dfrac{\nu_{k}(s,a)}{\sqrt{\max\{N_{k}(s,a),1\}}}+2dSAm^{2}\right)

Step 8: We conclude the proof using Lemma 4 which states that,

mโ‰คSโ€‹Aโ€‹log2โก(8โ€‹TSโ€‹A),andโ€‹โˆ‘k=1mโˆ‘(s,a)โˆˆ๐’ฎร—๐’œฮฝkโ€‹(s,a)maxโก{Nkโ€‹(s,a),1}โ‰ค(2+1)โ€‹Sโ€‹Aโ€‹T\displaystyle m\leq SA\log_{2}\left(\frac{8T}{SA}\right),~{}\text{and}~{}~{}\sum_{k=1}^{m}\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\dfrac{\nu_{k}(s,a)}{\sqrt{\max\{N_{k}(s,a),1\}}}\leq(\sqrt{2}+1)\sqrt{SAT}

4.2 Limitation

It is important to understand why our approach works with partial anonymity but not with full anonymity. Fix a state sโˆˆ๐’ฎs\in\mathcal{S} and an epoch jj. Recall from section 3 that no two distinct pairs (s,a),(s,aโ€ฒ)(s,a),(s,a^{\prime}) can appear in the same epoch. Utilizing this property, we can write down the following relation for a pair (s,a)(s,a) that appears in the jjth epoch.

โˆ‘tjโ‰คฯ„<tj+1๐’™ฯ„โ€‹(s)โŸโ‰œR0=โˆ‘0<ฯ„<tjโˆ‘ฯ„1โ‰ฅtjโˆ’ฯ„rฯ„,ฯ„1โ€‹(sฯ„,aฯ„)โ€‹1โ€‹(sฯ„=s)โŸโ‰œR1+โˆ‘tjโ‰คฯ„<tj+1โ€–๐’“ฯ„โ€‹(s,a)โ€–1โ€‹1โ€‹(sฯ„=s)โŸโ‰œR2โˆ’โˆ‘0<ฯ„<tj+1โˆ‘ฯ„1โ‰ฅtj+1โˆ’ฯ„rฯ„,ฯ„1โ€‹(sฯ„,aฯ„)โ€‹1โ€‹(sฯ„=s)โŸโ‰œR3\displaystyle\begin{split}\underbrace{\sum_{t_{j}\leq\tau<t_{j+1}}\boldsymbol{x}_{\tau}(s)}_{\triangleq R_{0}}&=\underbrace{\sum_{0<\tau<t_{j}}\sum_{\tau_{1}\geq t_{j}-\tau}r_{\tau,\tau_{1}}(s_{\tau},a_{\tau})\mathrm{1}(s_{\tau}=s)}_{\triangleq R_{1}}+\underbrace{\sum_{t_{j}\leq\tau<t_{j+1}}||\boldsymbol{r}_{\tau}(s,a)||_{1}\mathrm{1}(s_{\tau}=s)}_{\triangleq R_{2}}\\ &-\underbrace{\sum_{0<\tau<t_{j+1}}\sum_{\tau_{1}\geq t_{j+1}-\tau}r_{\tau,\tau_{1}}(s_{\tau},a_{\tau})\mathrm{1}(s_{\tau}=s)}_{\triangleq R_{3}}\end{split} (11)

The term, R0R_{0} denotes the sum of ss-th components of all observed reward vectors within epoch jj. Some portion of R0R_{0} is due to actions taken before the onset of the jjth epoch. This past contribution is denoted by the term, R1R_{1}. The rest of R0R_{0} is entirely contributed by the actions taken within epoch jj. The term, R2R_{2} denotes the sum of all the rewards generated as a result of actions taken within epoch jj. However, due to the delayed nature of the reward, some portion of R2R_{2} will be realized after the jjth epoch. This spillover part is termed as R3R_{3}. Using Assumption 2, we can write 0โ‰คR1,R3โ‰คd0\leq R_{1},R_{3}\leq d with high probability which leads to the following relation.

โˆ’dโ‰คR0โˆ’R2โ‰คd\displaystyle-d\leq R_{0}-R_{2}\leq d (12)

Eq. (12)(\ref{eq_10}) is the first step in establishing (10)(\ref{eq_17_new}). We would like to emphasize the fact that the term, R2R_{2} is entirely contributed by (s,a)(s,a) pairs appearing in epoch jj (i.e., no contamination from actions other than aa). In other words, although our estimation, r^โ€‹(s,a)\hat{r}(s,a) is based on the contaminated observation R0R_{0}, we demonstrate that it is not far away from uncontaminated estimations. This key feature makes DUCRL2\mathrm{DUCRL2} successful despite having partial anonymity. On the other hand, if rewards were fully anonymous, then (11)(\ref{eq_10_}) would have changed as follows.

โˆ‘tjโ‰คฯ„<tj+1โˆ‘sโˆˆ๐’ฎ๐’™ฯ„โ€‹(s)โŸโ‰œR~0=โˆ‘0<ฯ„<tjโˆ‘ฯ„1โ‰ฅtjโˆ’ฯ„rฯ„,ฯ„1โ€‹(sฯ„,aฯ„)โŸโ‰œR~1+โˆ‘tjโ‰คฯ„<tj+1โ€–๐’“ฯ„โ€‹(sฯ„,aฯ„)โ€–1โŸโ‰œR~2โˆ’โˆ‘0<ฯ„<tj+1โˆ‘ฯ„1โ‰ฅtj+1โˆ’ฯ„rฯ„,ฯ„1โ€‹(sฯ„,aฯ„)โŸโ‰œR~3\displaystyle\begin{split}\underbrace{\sum_{t_{j}\leq\tau<t_{j+1}}\sum_{s\in\mathcal{S}}\boldsymbol{x}_{\tau}(s)}_{\triangleq\tilde{R}_{0}}&=\underbrace{\sum_{0<\tau<t_{j}}\sum_{\tau_{1}\geq t_{j}-\tau}r_{\tau,\tau_{1}}(s_{\tau},a_{\tau})}_{\triangleq\tilde{R}_{1}}+\underbrace{\sum_{t_{j}\leq\tau<t_{j+1}}||\boldsymbol{r}_{\tau}(s_{\tau},a_{\tau})||_{1}}_{\triangleq\tilde{R}_{2}}\\ &-\underbrace{\sum_{0<\tau<t_{j+1}}\sum_{\tau_{1}\geq t_{j+1}-\tau}r_{\tau,\tau_{1}}(s_{\tau},a_{\tau})}_{\triangleq\tilde{R}_{3}}\end{split} (13)

Note that in (13)(\ref{eq_10_new_}), the term, R~2\tilde{R}_{2} is a mixer of contributions from various state-action pairs, unlike R2R_{2} in (11)(\ref{eq_10_}). This makes our algorithm ineffective in the presence of full anonymity.

Another limitation of our approach is that the delay parameter dd is used as an input to Algorithm 1. One can therefore ask how the regret bound changes if an incorrect estimate, d^\hat{d} of dd is used in the algorithm. One can easily prove that if d^>d\hat{d}>d, then the regret bound changes to ๐’ชโ€‹(Dโ€‹Sโ€‹Aโ€‹T+d^โ€‹(Sโ€‹A)3)\mathcal{O}(DS\sqrt{AT}+\hat{d}(SA)^{3}). However, if d^<d\hat{d}<d, then Lemma 1 no longer works, and consequently, the analysis does not yield any sub-linear regret.

5 Conclusion

In this work, we addressed the challenging problem of designing learning algorithms for infinite-horizon Markov Decision Processes with delayed, composite, and partially anonymous rewards. We propose an algorithm that achieves near-optimal performance and derive a regret bound that matches the existing lower bound in the time horizon while demonstrating an additive impact of delay on the regret. Our work is the first to consider partially anonymous rewards in the MDP setting with delayed feedback.

Possible future work includes extending the analysis to the more general scenario of fully anonymous, delayed, and composite rewards, which has important applications in many domains. This extension poses theoretical challenges, and we believe it is an exciting direction for future research. Overall, we hope our work provides a useful contribution to the reinforcement learning community and inspires further research on this important topic.

References

  • Agarwal and Aggarwal [2021] Mridul Agarwal and Vaneet Aggarwal. Blind decision making: Reinforcement learning with delayed observations. Pattern Recognition Letters, 150:176โ€“182, 2021.
  • Agarwal and Aggarwal [2023] Mridul Agarwal and Vaneet Aggarwal. Reinforcement learning for joint optimization of multiple rewards. J. Mach. Learn. Res., 24:49:1โ€“49:41, 2023. URL http://jmlr.org/papers/v24/19-980.html.
  • Agarwal etย al. [2022] Mridul Agarwal, Qinbo Bai, and Vaneet Aggarwal. Regret guarantees for model-based reinforcement learning with long-term average constraints. In Uncertainty in Artificial Intelligence, pages 22โ€“31. PMLR, 2022.
  • Agarwal etย al. [2023] Mridul Agarwal, Qinbo Bai, and Vaneet Aggarwal. Concave utility reinforcement learning with zero-constraint violations. Transactions on Machine Learning Research, 2023.
  • Azar etย al. [2017] Mohammadย Gheshlaghi Azar, Ian Osband, and Rรฉmi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263โ€“272. PMLR, 2017.
  • Bistritz etย al. [2019] Ilai Bistritz, Zhengyuan Zhou, Xiย Chen, Nicholas Bambos, and Jose Blanchet. Online exp3 learning in adversarial bandits with delayed feedback. Advances in neural information processing systems, 32, 2019.
  • Bouteiller etย al. [2021] Yann Bouteiller, Simon Ramstedt, Giovanni Beltrame, Christopher Pal, and Jonathan Binas. Reinforcement learning with random delays. In International conference on learning representations, 2021.
  • Cesa-Bianchi etย al. [2016] Nicolโ€˜o Cesa-Bianchi, Claudio Gentile, Yishay Mansour, and Alberto Minora. Delay and cooperation in nonstochastic bandits. In Conference on Learning Theory, pages 605โ€“622. PMLR, 2016.
  • Fruit etย al. [2018] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. In International Conference on Machine Learning, pages 1578โ€“1586. PMLR, 2018.
  • Gael etย al. [2020] Manegueuย Anne Gael, Claire Vernade, Alexandra Carpentier, and Michal Valko. Stochastic bandits with arm-dependent delays. In International Conference on Machine Learning, pages 3348โ€“3356. PMLR, 2020.
  • Howson etย al. [2023] Benjamin Howson, Ciara Pike-Burke, and Sarah Filippi. Optimism and delays in episodic reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 6061โ€“6094. PMLR, 2023.
  • Ito etย al. [2020] Shinji Ito, Daisuke Hatano, Hanna Sumita, Kei Takemura, Takuro Fukunaga, Naonori Kakimura, and Ken-Ichi Kawarabayashi. Delay and cooperation in nonstochastic linear bandits. Advances in Neural Information Processing Systems, 33:4872โ€“4883, 2020.
  • Jaksch etย al. [2010] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563โ€“1600, 2010.
  • Jin etย al. [2018] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michaelย I Jordan. Is Q-learning provably efficient? Advances in neural information processing systems, 31, 2018.
  • Jin and Luo [2020] Tiancheng Jin and Haipeng Luo. Simultaneously learning stochastic and adversarial episodic mdps with known transition. Advances in neural information processing systems, 33:16557โ€“16566, 2020.
  • Jin etย al. [2022] Tiancheng Jin, Tal Lancewicki, Haipeng Luo, Yishay Mansour, and Aviv Rosenberg. Near-optimal regret for adversarial mdp with delayed bandit feedback. In Advances in Neural Information Processing Systems, 2022.
  • Lancewicki etย al. [2021] Tal Lancewicki, Shahar Segal, Tomer Koren, and Yishay Mansour. Stochastic multi-armed bandits with unrestricted delay distributions. In International Conference on Machine Learning, pages 5969โ€“5978. PMLR, 2021.
  • Lancewicki etย al. [2022] Tal Lancewicki, Aviv Rosenberg, and Yishay Mansour. Learning adversarial markov decision processes with delayed feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volumeย 36, pages 7281โ€“7289, 2022.
  • Pedramfar and Aggarwal [2023] Mohammad Pedramfar and Vaneet Aggarwal. Stochastic submodular bandits with delayed composite anonymous bandit feedback. arXiv preprint arXiv:2303.13604, 2023.
  • Pike-Burke etย al. [2018] Ciara Pike-Burke, Shipra Agrawal, Csaba Szepesvari, and Steffen Grunewalder. Bandits with delayed, aggregated anonymous feedback. In International Conference on Machine Learning, pages 4105โ€“4113. PMLR, 2018.
  • Puterman [2014] Martinย L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  • Quanrud and Khashabi [2015] Kent Quanrud and Daniel Khashabi. Online learning with adversarial delays. Advances in neural information processing systems, 28, 2015.
  • Rosenberg and Mansour [2019] Aviv Rosenberg and Yishay Mansour. Online convex optimization in adversarial markov decision processes. In International Conference on Machine Learning, pages 5478โ€“5486. PMLR, 2019.
  • Shani etย al. [2020] Lior Shani, Yonathan Efroni, Aviv Rosenberg, and Shie Mannor. Optimistic policy optimization with bandit feedback. In International Conference on Machine Learning, pages 8604โ€“8613. PMLR, 2020.
  • Thune etย al. [2019] Tobiasย Sommer Thune, Nicolรฒ Cesa-Bianchi, and Yevgeny Seldin. Nonstochastic multiarmed bandits with unrestricted delays. Advances in Neural Information Processing Systems, 32, 2019.
  • Vernade etย al. [2020] Claire Vernade, Alexandra Carpentier, Tor Lattimore, Giovanni Zappella, Beyza Ermis, and Michael Brueckner. Linear bandits with stochastic delayed feedback. In International Conference on Machine Learning, pages 9712โ€“9721. PMLR, 2020.
  • Wang etย al. [2021] Siwei Wang, Haoyun Wang, and Longbo Huang. Adaptive algorithms for multi-armed bandit with composite and anonymous feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volumeย 35, pages 10210โ€“10217, 2021.
  • Weissman etย al. [2003] Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marceloย J Weinberger. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.
  • Zanette etย al. [2020] Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pages 10978โ€“10989. PMLR, 2020.
  • Zhang etย al. [2022] Yuyang Zhang, Runyu Zhang, Gen Li, Yuantao Gu, and Naย Li. Multi-agent reinforcement learning with reward delays. arXiv preprint arXiv:2212.01441, 2022.
  • Zhou etย al. [2019] Zhengyuan Zhou, Renyuan Xu, and Jose Blanchet. Learning in generalized linear contextual bandits with stochastic delays. Advances in Neural Information Processing Systems, 32, 2019.
  • Zimmert and Seldin [2020] Julian Zimmert and Yevgeny Seldin. An optimal algorithm for adversarial bandits with arbitrary delays. In International Conference on Artificial Intelligence and Statistics, pages 3285โ€“3294. PMLR, 2020.

Appendix A Extended Value Iteration

Algorithm 2 Extended Value Iteration
1:Input: {dโ€‹(s,a),๐’ซโ€‹(s,a),r^โ€‹(s,a),๐’ซโ€‹(s,a)}(s,a)โˆˆ๐’ฎร—๐’œ\{d(s,a),\mathcal{P}(s,a),\hat{r}(s,a),\mathcal{P}(s,a)\}_{(s,a)\in\mathcal{S}\times\mathcal{A}}, ฯต>0\epsilon>0
2:Initialization: ๐’–0โ‰œ{u0โ€‹(s)}sโˆˆ๐’ฎโ†๐ŸŽ\boldsymbol{u}_{0}\triangleq\{u_{0}(s)\}_{s\in\mathcal{S}}\leftarrow\mathbf{0}, iโ†0i\leftarrow 0, errorโ†2โ€‹ฯต\mathrm{error}\leftarrow 2\epsilon
3:whileย error<ฯต\mathrm{error}<\epsilonย do
4:ย ย ย ย ย forย sโˆˆ๐’ฎs\in\mathcal{S}ย do
5:ย ย ย ย ย ย ย ย ย ui+1โ€‹(s)=maxaโˆˆ๐’œโก{r^โ€‹(s,a)+dโ€‹(s,a)+maxpโ€‹(โ‹…)โˆˆ๐’ซโ€‹(s,a)โ€‹โˆ‘sโ€ฒโˆˆ๐’ฎpโ€‹(sโ€ฒ|s,a)โ€‹uiโ€‹(sโ€ฒ)}u_{i+1}(s)=\max\limits_{a\in\mathcal{A}}\left\{\hat{r}(s,a)+d(s,a)+\max\limits_{p(\cdot)\in\mathcal{P}(s,a)}\sum\limits_{s^{\prime}\in\mathcal{S}}p(s^{\prime}|s,a)u_{i}(s^{\prime})\right\}
6:ย ย ย ย ย endย for
7:ย ย ย ย ย errorโ†maxsโˆˆ๐’ฎโก{ui+1โ€‹(s)โˆ’uiโ€‹(s)}โˆ’minsโˆˆ๐’ฎโก{ui+1โ€‹(s)โˆ’uiโ€‹(s)}\mathrm{error}\leftarrow\max\limits_{s\in\mathcal{S}}\{u_{i+1}(s)-u_{i}(s)\}-\min\limits_{s\in\mathcal{S}}\{u_{i+1}(s)-u_{i}(s)\}
8:ย ย ย ย ย iโ†i+1i\leftarrow i+1
9:endย while

Here dโ€‹(s,a)d(s,a) can be though of as the confidence radius as depicted in (5)(\ref{eq_4}) whereas ๐’ซโ€‹(s,a)\mathcal{P}(s,a) is the set of probability vectors that satisfy (6)(\ref{eq_5}). Note that the stopping criteria for Algorithm 2 is the following.

maxsโˆˆ๐’ฎโก{ui+1โ€‹(s)โˆ’uiโ€‹(s)}โˆ’minsโˆˆ๐’ฎโก{ui+1โ€‹(s)โˆ’uiโ€‹(s)}<ฯต\displaystyle\max\limits_{s\in\mathcal{S}}\{u_{i+1}(s)-u_{i}(s)\}-\min\limits_{s\in\mathcal{S}}\{u_{i+1}(s)-u_{i}(s)\}<\epsilon (14)

In the context of Algorithm 1, we can take ฯต=1/tk\epsilon=1/\sqrt{t_{k}}. Theorem 7 of [Jaksch etย al., 2010] guarantees that the greedy policy deduced from the terminal utility vector ๐’–i\boldsymbol{u}_{i} of Algorithm 2 is ฯต\epsilon-optimal in the sense of (7)(\ref{eq_6}) if the set of MDPs whose transition probability distribution p(โ‹…|s,a)p(\cdot|s,a) lies in the confidence set ๐’ซโ€‹(s,a)\mathcal{P}(s,a), and the reward function, rโ€‹(s,a)r(s,a) lies at most dโ€‹(s,a)d(s,a) distance away from the estimate r^โ€‹(s,a)\hat{r}(s,a), comprises at least one MDP with a finite diameter. As a consequence of this result, we can write the following corollary.

Corollary 1

Let, โ„ณ\mathcal{M} be the collection of MDPs whose reward function rโ€‹(โ‹…,โ‹…)r(\cdot,\cdot) and transition function, p(โ‹…|โ‹…,โ‹…)p(\cdot|\cdot,\cdot) satisfy the following for given {r^โ€‹(s,a),dโ€‹(s,a),๐’ซโ€‹(s,a)}(s,a)โˆˆ๐’ฎร—๐’œ\{\hat{r}(s,a),d(s,a),\mathcal{P}(s,a)\}_{(s,a)\in\mathcal{S}\times\mathcal{A}}.

|rโ€‹(s,a)โˆ’r^โ€‹(s,a)|โ‰คdโ€‹(s,a),\displaystyle|r(s,a)-\hat{r}(s,a)|\leq d(s,a),
p(โ‹…|s,a)โˆˆ๐’ซ(s,a)\displaystyle p(\cdot|s,a)\in\mathcal{P}(s,a)

If the true MDP, MM lies in โ„ณ\mathcal{M}, then Algorithm 2 always converges. Moreover, if ๐ฎi\boldsymbol{u}_{i} indicates the terminal utility vector for a given ฯต>0\epsilon>0, and โˆ€(s,a)โˆˆ๐’ฎร—๐’œ\forall(s,a)\in\mathcal{S}\times\mathcal{A},

(ฯ€~(s),p~(โ‹…|s,a))โ‰œargโกmaxaโˆˆ๐’œ,pโ€‹(โ‹…)โˆˆ๐’ซโ€‹(s,a){r^(s,a)+d(s,a)+โˆ‘sโ€ฒโˆˆ๐’ฎp(sโ€ฒ|s,a)ui(sโ€ฒ)},\displaystyle(\tilde{\pi}(s),\tilde{p}(\cdot|s,a))\triangleq\underset{a\in\mathcal{A},~{}p(\cdot)\in\mathcal{P}(s,a)}{\arg\max}\left\{\hat{r}(s,a)+d(s,a)+\sum\limits_{s^{\prime}\in\mathcal{S}}p(s^{\prime}|s,a)u_{i}(s^{\prime})\right\}, (15)

then the following inequality holds.

minsโˆˆ๐’ฎโกฯโ€‹(s,ฯ€~,M~)โ‰ฅmaxMโ€ฒโˆˆโ„ณ,ฯ€,sโ€ฒโˆˆ๐’ฎโกฯโ€‹(sโ€ฒ,ฯ€,Mโ€ฒ)โˆ’ฯต\displaystyle\min_{s\in\mathcal{S}}\rho(s,\tilde{\pi},\tilde{M})\geq\max_{M^{\prime}\in\mathcal{M},\pi,s^{\prime}\in\mathcal{S}}\rho(s^{\prime},\pi,M^{\prime})-\epsilon

where M~\tilde{M} is an MDP with transition function, p~(โ‹…|โ‹…โ‹…)\tilde{p}(\cdot|\cdot\cdot) defined by (15)(\ref{eq_tilde_pi_p}), and reward function r~\tilde{r} that obeys r~โ€‹(s,a)=r^โ€‹(s,a)+dโ€‹(s,a)\tilde{r}(s,a)=\hat{r}(s,a)+d(s,a), โˆ€(s,a)โˆˆ๐’ฎร—๐’œ\forall(s,a)\in\mathcal{S}\times\mathcal{A}.

Corollary 1 is easily established by observing that the true MDP, MM is assumed to have a finite diameter. It is also worthwhile to state that the complexity of updating the vector ๐’–i\boldsymbol{u}_{i} is ๐’ชโ€‹(S2โ€‹A)\mathcal{O}(S^{2}A) as discussed in section 3.1.2 of [Jaksch etย al., 2010].

Appendix B Proof of Theorem 1

Let TT be such that tmโˆ’1โ‰คT<tmt_{m-1}\leq T<t_{m}. Clearly,

T<tm=โˆ‘(s,a)โˆˆ๐’ฎร—๐’œNmโ€‹(s,a)=โˆ‘k=1mโˆ‘(s,a)โˆˆ๐’ฎร—๐’œฮฝkโ€‹(s,a)\displaystyle T<t_{m}=\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}N_{m}(s,a)=\sum_{k=1}^{m}\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\nu_{k}(s,a)

Using the above relation, the regret given in (3)(\ref{eq_def_regret}) can be rewritten as,

Regโ€‹(s,DUCRL2,M,T)=Tโ€‹ฯโˆ—โ€‹(M)โˆ’โˆ‘t=1Tโ€–๐’“tโ€‹(st,at)โ€–1\displaystyle\mathrm{Reg}(s,\mathrm{DUCRL2},M,T)=T\rho^{*}(M)-\sum_{t=1}^{T}||\boldsymbol{r}_{t}(s_{t},a_{t})||_{1}
<โˆ‘(s,a)โˆˆ๐’ฎร—๐’œNmโ€‹(s,a)โ€‹[ฯโˆ—โ€‹(M)โˆ’rโ€‹(s,a)]+โˆ‘(s,a)โˆˆ๐’ฎร—๐’œNmโ€‹(s,a)โ€‹rโ€‹(s,a)โˆ’โˆ‘t=1T|โ€‹|๐’“tโ€‹(st,at)||1\displaystyle<\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}N_{m}(s,a)\left[\rho^{*}(M)-r(s,a)\right]+\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}N_{m}(s,a)r(s,a)-\sum_{t=1}^{T}||\boldsymbol{r}_{t}(s_{t},a_{t})||_{1}
=โˆ‘k=1mโˆ‘(s,a)โˆˆ๐’ฎร—๐’œฮฝkโ€‹(s,a)โ€‹[ฯโˆ—โ€‹(M)โˆ’rโ€‹(s,a)]โŸโ‰œRegk+โˆ‘(s,a)โˆˆ๐’ฎร—๐’œNmโ€‹(s,a)โ€‹rโ€‹(s,a)โˆ’โˆ‘t=1Tโ€–๐’“tโ€‹(st,at)โ€–1\displaystyle=\sum_{k=1}^{m}\underbrace{\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\nu_{k}(s,a)\left[\rho^{*}(M)-r(s,a)\right]}_{\triangleq\mathrm{Reg}_{k}}+\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}N_{m}(s,a)r(s,a)-\sum_{t=1}^{T}||\boldsymbol{r}_{t}(s_{t},a_{t})||_{1}

The term Regk\mathrm{Reg}_{k} can be interpreted as the regret accumulated over epoch kk. Observe that, for a given history of state-action evolution โ„‹Tโ‰œ{(st,at)}t=1T\mathcal{H}_{T}\triangleq\{(s_{t},a_{t})\}_{t=1}^{T} up to time TT, the collection of random variables {โ€–๐’“tโ€‹(st,at)โ€–1}t=1T\{||\boldsymbol{r}_{t}(s_{t},a_{t})||_{1}\}_{t=1}^{T} are mutually independent. Moreover,

๐”ผโ€‹[โˆ‘t=1Tโ€–๐’“tโ€‹(st,at)โ€–1|โ„‹T]=โˆ‘t=1Trโ€‹(st,at)\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}||\boldsymbol{r}_{t}(s_{t},a_{t})||_{1}\big{|}\mathcal{H}_{T}\right]=\sum_{t=1}^{T}r(s_{t},a_{t}) =โˆ‘(s,a)โˆˆ๐’ฎร—๐’œrโ€‹(s,a)โ€‹โˆ‘t=1T1โ€‹(st=s,at=a)\displaystyle=\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}r(s,a)\sum_{t=1}^{T}\mathrm{1}(s_{t}=s,a_{t}=a)
=โˆ‘(s,a)โˆˆ๐’ฎร—๐’œNmโ€‹(s,a)โ€‹rโ€‹(s,a)\displaystyle=\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}N_{m}(s,a)r(s,a)

Using Hoeffding โ€™s inequality, we therefore obtain,

Prโ€‹{โˆ‘(s,a)โˆˆ๐’ฎร—๐’œNmโ€‹(s,a)โ€‹rโ€‹(s,a)โˆ’โˆ‘t=1T||๐’“tโ€‹(st,at)||1>5โ€‹T8โ€‹logโก(8โ€‹Tฮด)|โ„‹T}โ‰คฮด12โ€‹T54\displaystyle\mathrm{Pr}\left\{\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}N_{m}(s,a)r(s,a)-\sum_{t=1}^{T}||\boldsymbol{r}_{t}(s_{t},a_{t})||_{1}>\sqrt{\dfrac{5T}{8}\log\left(\dfrac{8T}{\delta}\right)}~{}\Big{|}\mathcal{H}_{T}\right\}\leq\dfrac{\delta}{12T^{\frac{5}{4}}}

This implies that, with probability at least 1โˆ’ฮด/12โ€‹T541-\delta/12T^{\frac{5}{4}}, the following holds.

Regโ€‹(s,DUCRL2,M,T)โ‰คโˆ‘k=1mRegk+5โ€‹T8โ€‹logโก(8โ€‹Tฮด)\displaystyle\mathrm{Reg}(s,\mathrm{DUCRL2},M,T)\leq\sum_{k=1}^{m}\mathrm{Reg}_{k}+\sqrt{\dfrac{5T}{8}\log\left(\dfrac{8T}{\delta}\right)} (16)

B.1 Regret bound on episodes where MM lie outside the confidence set

Recall that โ„ณk\mathcal{M}_{k} is defined to be a collection of MDPs that obey the confidence bounds (5)(\ref{eq_4}), and (6)(\ref{eq_5}). In this subsection, we calculate the regret contribution of the episodes where the true MDP MM does not satisfy these bounds. The following lemma provides an upper bound estimate of the probability that the true MDP, MM does not lie in the confidence set, โ„ณk\mathcal{M}_{k}.

Lemma 1

Prโ€‹{Mโˆ‰โ„ณk}โ‰คฮด15โ€‹tk6\mathrm{Pr}\left\{M\notin\mathcal{M}_{k}\right\}\leq\dfrac{\delta}{15t_{k}^{6}}

The proof of Lemma 1 is relegated to Appendix C. Although the final result in Lemma 1 is the same as in [Jaksch etย al., 2010], the proof techniques are quite different. In particular, we need to account for the fact that reward estimates might be corrupted by contributions originating from various past actions. Using Lemma 1, the following bound can be obtained.

Lemma 2

[Jaksch etย al., 2010] With probability at least 1โˆ’ฮด/12โ€‹T541-\delta/12T^{\frac{5}{4}},

โˆ‘k=1mRegkโ€‹1โ€‹(Mโˆ‰โ„ณk)โ‰คT\displaystyle\sum_{k=1}^{m}\mathrm{Reg}_{k}\mathrm{1}(M\notin\mathcal{M}_{k})\leq\sqrt{T} (17)

We would like to mention here that although the definition of โ„ณk\mathcal{M}_{k} used in our article is different from that given in [Jaksch etย al., 2010], the above result still holds. This is mainly because the only property of โ„ณk\mathcal{M}_{k} that is invoked to prove Lemma 2 is provided in Lemma 1 which is the same as in [Jaksch etย al., 2010].

B.2 Regret bound on episodes where MM lie inside the confidence set

Let kk be the index of an episode where Mโˆˆโ„ณkM\in\mathcal{M}_{k} and ๐’–k={ukโ€‹(s)}sโˆˆ๐’ฎ\boldsymbol{u}_{k}=\{u_{k}(s)\}_{s\in\mathcal{S}} be the terminal utility vector obtained via extended value iteration at the kkth episode. Define, ๐’˜k={wkโ€‹(s)}sโˆˆ๐’ฎ\boldsymbol{w}_{k}=\{w_{k}(s)\}_{s\in\mathcal{S}},

wkโ€‹(s)โ‰œukโ€‹(s)โˆ’maxsโˆˆ๐’ฎโกukโ€‹(s)+minsโˆˆ๐’ฎโกukโ€‹(s)2\displaystyle w_{k}(s)\triangleq u_{k}(s)-\dfrac{\max_{s\in\mathcal{S}}u_{k}(s)+\min_{s\in\mathcal{S}}u_{k}(s)}{2}
Lemma 3

[Jaksch etย al., 2010] If kk is such that Mโˆˆโ„ณkM\in\mathcal{M}_{k}, then,

Regk\displaystyle\mathrm{Reg}_{k} โ‰คโˆ‘sโˆˆ๐’ฎฮฝkโ€‹(s,ฯ€~kโ€‹(s))โ€‹[โˆ‘sโ€ฒโˆˆ๐’ฎp~kโ€‹(sโ€ฒ|s,ฯ€~kโ€‹(s))โ€‹wkโ€‹(sโ€ฒ)โˆ’wkโ€‹(s)]โŸโ‰œJk1\displaystyle\leq\underbrace{\sum_{s\in\mathcal{S}}\nu_{k}(s,\tilde{\pi}_{k}(s))\left[\sum_{s^{\prime}\in\mathcal{S}}\tilde{p}_{k}(s^{\prime}|s,\tilde{\pi}_{k}(s))w_{k}(s^{\prime})-w_{k}(s)\right]}_{\triangleq J_{k}^{1}}
+โˆ‘(s,a)โˆˆ๐’ฎร—๐’œฮฝkโ€‹(s,a)โ€‹(r~kโ€‹(s,a)โˆ’rโ€‹(s,a))โŸโ‰œJk2+2โ€‹โˆ‘(s,a)โˆˆ๐’ฎร—๐’œฮฝkโ€‹(s,a)tkโŸโ‰œJk3\displaystyle+\underbrace{\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\nu_{k}(s,a)(\tilde{r}_{k}(s,a)-r(s,a))}_{\triangleq J_{k}^{2}}+\underbrace{2\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\dfrac{\nu_{k}(s,a)}{\sqrt{t_{k}}}}_{\triangleq J_{k}^{3}}

where r~k,p~k,ฯ€~k\tilde{r}_{k},\tilde{p}_{k},\tilde{\pi}_{k} are the reward, transition function, and policy of the MDP, M~k\tilde{M}_{k}. Moreover,

โˆ‘k=1mJk1โ€‹1โ€‹(Mโˆˆโ„ณk)โ‰คD52โ€‹Tโ€‹logโก(8โ€‹Tฮด)+Dโ€‹Sโ€‹Aโ€‹log2โก(8โ€‹TSโ€‹A)+D14โ€‹Sโ€‹logโก(2โ€‹Aโ€‹Tฮด)โ€‹โˆ‘k=1mโˆ‘(s,a)โˆˆ๐’ฎร—๐’œฮฝkโ€‹(s,a)maxโก{Nkโ€‹(s,a),1}\displaystyle\begin{split}\sum_{k=1}^{m}J_{k}^{1}\mathrm{1}(M\in\mathcal{M}_{k})\leq D&\sqrt{\frac{5}{2}T\log\left(\frac{8T}{\delta}\right)}+DSA\log_{2}\left(\frac{8T}{SA}\right)\\ +D&\sqrt{14S\log\left(\frac{2AT}{\delta}\right)}\sum_{k=1}^{m}\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\dfrac{\nu_{k}(s,a)}{\sqrt{\max\{N_{k}(s,a),1\}}}\end{split} (18)

with probability at least 1โˆ’ฮด/12โ€‹T541-\delta/12T^{\frac{5}{4}}.

The only properties of โ„ณk\mathcal{M}_{k} that are used in the proof of Lemma 3 are (6)(\ref{eq_5}), and (7)(\ref{eq_6}) which are the same as given in [Jaksch etย al., 2010]. Note that, the term Jk2J_{k}^{2} defined in Lemma 3 can be bounded as follows,

Jk2โ‰คโˆ‘(s,a)โˆˆ๐’ฎร—๐’œฮฝkโ€‹(s,a)โ€‹|r~kโ€‹(s,a)โˆ’r^kโ€‹(s,a)|+โˆ‘(s,a)โˆˆ๐’ฎร—๐’œฮฝkโ€‹(s,a)โ€‹|r^kโ€‹(s,a)โˆ’rโ€‹(s,a)|โ‰ค(a)โ€‹โˆ‘(s,a)โˆˆ๐’ฎร—๐’œ2โ€‹ฮฝkโ€‹(s,a)โ€‹7โ€‹logโก(2โ€‹Sโ€‹Aโ€‹Tฮด)2โ€‹maxโก{Nkโ€‹(s,a),1}+โˆ‘(s,a)โˆˆ๐’ฎร—๐’œ2โ€‹dโ€‹ฮฝkโ€‹(s,a)maxโก{Nkโ€‹(s,a),1}โ€‹Ekโ€‹(s,a)\displaystyle\begin{split}J_{k}^{2}&\leq\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\nu_{k}(s,a)|\tilde{r}_{k}(s,a)-\hat{r}_{k}(s,a)|+\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\nu_{k}(s,a)|\hat{r}_{k}(s,a)-r(s,a)|\\ &\overset{(a)}{\leq}\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}2\nu_{k}(s,a)\sqrt{\dfrac{7\log\left(\frac{2SAT}{\delta}\right)}{2\max\{N_{k}(s,a),1\}}}+\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\dfrac{2d\nu_{k}(s,a)}{\max\{N_{k}(s,a),1\}}E_{k}(s,a)\end{split} (19)

where (a)(a) applies the facts that M,M~kโˆˆโ„ณkM,\tilde{M}_{k}\in\mathcal{M}_{k}, and tkโ‰คTt_{k}\leq T. Note that our algorithm enforces ฮฝkโ€‹(s,a)โ‰คmaxโก{Nkโ€‹(s,a),1}\nu_{k}(s,a)\leq\max\{N_{k}(s,a),1\}. Therefore, J2kJ_{2}^{k} can be further bounded as,

J2kโ‰คโˆ‘(s,a)โˆˆ๐’ฎร—๐’œฮฝkโ€‹(s,a)maxโก{Nkโ€‹(s,a),1}โ€‹14โ€‹logโก(2โ€‹Sโ€‹Aโ€‹Tฮด)+2โ€‹dโ€‹โˆ‘(s,a)โˆˆ๐’ฎร—๐’œEkโ€‹(s,a)\displaystyle J_{2}^{k}\leq\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\dfrac{\nu_{k}(s,a)}{\sqrt{\max\{N_{k}(s,a),1\}}}\sqrt{14\log\left(\frac{2SAT}{\delta}\right)}+2d\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}E_{k}(s,a) (20)

Observe that, Ekโ€‹(s,a)โ‰คmE_{k}(s,a)\leq m, โˆ€(s,a)โˆˆ๐’ฎร—๐’œ\forall(s,a)\in\mathcal{S}\times\mathcal{A}. Therefore,

โˆ‘k=1mJ2kโ€‹1โ€‹(Mโˆˆโ„ณk)โ‰ค14โ€‹logโก(2โ€‹Sโ€‹Aโ€‹Tฮด)โ€‹โˆ‘k=1mโˆ‘(s,a)โˆˆ๐’ฎร—๐’œฮฝkโ€‹(s,a)maxโก{Nkโ€‹(s,a),1}+2โ€‹dโ€‹Sโ€‹Aโ€‹m2\displaystyle\begin{split}\sum_{k=1}^{m}J_{2}^{k}\mathrm{1}(M\in\mathcal{M}_{k})\leq\sqrt{14\log\left(\frac{2SAT}{\delta}\right)}\sum_{k=1}^{m}\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\dfrac{\nu_{k}(s,a)}{\sqrt{\max\{N_{k}(s,a),1\}}}+2dSAm^{2}\end{split} (21)

Finally, injecting the inequality, maxโก{Nkโ€‹(s,a),1}โ‰คtk\max\{N_{k}(s,a),1\}\leq t_{k}, we obtain the following bound,

โˆ‘k=1mJk3โ€‹1โ€‹(Mโˆˆโ„ณk)โ‰ค2โ€‹โˆ‘k=1mโˆ‘(s,a)โˆˆ๐’ฎร—๐’œฮฝkโ€‹(s,a)maxโก{Nkโ€‹(s,a),1}\displaystyle\sum_{k=1}^{m}J_{k}^{3}\mathrm{1}(M\in\mathcal{M}_{k})\leq 2\sum_{k=1}^{m}\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\dfrac{\nu_{k}(s,a)}{\sqrt{\max\{N_{k}(s,a),1\}}} (22)

We now use the following lemma to simplify the upper bounds.

Lemma 4

[Jaksch etย al., 2010] The following inequalities hold true,

mโ‰คSโ€‹Aโ€‹log2โก(8โ€‹TSโ€‹A),\displaystyle m\leq SA\log_{2}\left(\frac{8T}{SA}\right), (23)
โˆ‘k=1mโˆ‘(s,a)โˆˆ๐’ฎร—๐’œฮฝkโ€‹(s,a)maxโก{Nkโ€‹(s,a),1}โ‰ค(2+1)โ€‹Sโ€‹Aโ€‹T\displaystyle\sum_{k=1}^{m}\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\dfrac{\nu_{k}(s,a)}{\sqrt{\max\{N_{k}(s,a),1\}}}\leq(\sqrt{2}+1)\sqrt{SAT} (24)

Using Lemma 4 and combining (18)(\ref{eq_j_k_1}), (20)(\ref{eq_j_k_2}), (22)(\ref{eq_j_k_3}), we conclude that the following satisfies with probability at least 1โˆ’ฮด/12โ€‹T541-\delta/12T^{\frac{5}{4}}.

โˆ‘k=1mRegk1โ€‹(Mโˆˆโ„ณk)โ‰คDโ€‹52โ€‹Tโ€‹logโก(8โ€‹Tฮด)+Dโ€‹Sโ€‹Aโ€‹log2โก(8โ€‹TSโ€‹A)+2โ€‹(2+1)โ€‹[Dโ€‹14โ€‹Sโ€‹logโก(2โ€‹Aโ€‹Tฮด)+1]โ€‹Sโ€‹Aโ€‹T+2โ€‹dโ€‹(Sโ€‹A)3โ€‹[log2โก(8โ€‹TSโ€‹A)]2\displaystyle\begin{split}\sum_{k=1}^{m}\mathrm{Reg}_{k}&\mathrm{1}(M\in\mathcal{M}_{k})\leq D\sqrt{\frac{5}{2}T\log\left(\frac{8T}{\delta}\right)}+DSA\log_{2}\left(\frac{8T}{SA}\right)\\ +&2(\sqrt{2}+1)\left[D\sqrt{14S\log\left(\frac{2AT}{\delta}\right)}+1\right]\sqrt{SAT}+2d(SA)^{3}\left[\log_{2}\left(\frac{8T}{SA}\right)\right]^{2}\end{split} (25)

B.3 Obtaining the Total Regret

Combining (16),(27)(\ref{eq_11}),(\ref{eq_12}), and (25)(\ref{eq_19}), we can now establish that the following inequality is satisfied with at least 1โˆ’ฮด/12โ€‹T5/4โˆ’ฮด/12โ€‹T5/4โˆ’ฮด/12โ€‹T5/4=1โˆ’ฮด/4โ€‹T541-\delta/12T^{5/4}-\delta/12T^{5/4}-\delta/12T^{5/4}=1-\delta/4T^{\frac{5}{4}} probability.

Regโ€‹(s,DUCRL2,M,T)โ‰ค34โ€‹Dโ€‹Sโ€‹Aโ€‹Tโ€‹logโก(Tฮด)+2โ€‹dโ€‹(Sโ€‹A)3โ€‹[log2โก(8โ€‹TSโ€‹A)]2\displaystyle\mathrm{Reg}(s,\mathrm{DUCRL2},M,T)\leq 34DS\sqrt{AT\log\left(\frac{T}{\delta}\right)}+2d(SA)^{3}\left[\log_{2}\left(\frac{8T}{SA}\right)\right]^{2}

Taking a union bound on TT and noting that โˆ‘T=2โˆžฮด/4โ€‹T54<ฮด\sum_{T=2}^{\infty}\delta/4T^{\frac{5}{4}}<\delta, we conclude the theorem.

Appendix C Proof of Lemma 1

The probability that the L1L_{1}-deviation between the true and the empirical distributions of ll events over nn independent sample exceeds ฯต\epsilon can be bounded as [Weissman etย al., 2003],

Prโ€‹{โ€–p^โ€‹(โ‹…)โˆ’pโ€‹(โ‹…)โ€–1>ฯต}โ‰ค(2lโˆ’2)โ€‹expโก(โˆ’nโ€‹ฯต22)\displaystyle\mathrm{Pr}\left\{||\hat{p}(\cdot)-p(\cdot)||_{1}>\epsilon\right\}\leq(2^{l}-2)\exp\left(-\frac{n\epsilon^{2}}{2}\right) (26)

Presume, without loss of generality that, Nkโ€‹(s,a)โ‰ฅ1N_{k}(s,a)\geq 1. In our case, inequality (26)(\ref{eq_weissman}) can be utilised to obtain the following bound โˆ€(s,a)โˆˆ๐’ฎร—๐’œ\forall(s,a)\in\mathcal{S}\times\mathcal{A}.

Pr{||p^k(โ‹…|s,a)โˆ’p(โ‹…|s,a)||1>14โ€‹Sโ€‹logโก(2โ€‹Aโ€‹tk/ฮด)Nkโ€‹(s,a)}โ‰ค(a)โˆ‘0<n<tkPr{||p^k(โ‹…|s,a)โˆ’p(โ‹…|s,a)||1>14โ€‹Sโ€‹logโก(2โ€‹Aโ€‹tk/ฮด)n}โ‰คโˆ‘0<n<tk2Sโ€‹expโก(โˆ’7โ€‹Sโ€‹logโก(2โ€‹Aโ€‹tk/ฮด))โ‰คโˆ‘0<n<tkฮด20โ€‹Sโ€‹Aโ€‹tk7=ฮด20โ€‹Sโ€‹Aโ€‹tk6\displaystyle\begin{split}&\mathrm{Pr}\left\{||\hat{p}_{k}(\cdot|s,a)-p(\cdot|s,a)||_{1}>\sqrt{\dfrac{14S\log\left(2At_{k}/\delta\right)}{N_{k}(s,a)}}\right\}\\ &\overset{(a)}{\leq}\sum_{0<n<t_{k}}\mathrm{Pr}\left\{||\hat{p}_{k}(\cdot|s,a)-p(\cdot|s,a)||_{1}>\sqrt{\dfrac{14S\log\left(2At_{k}/\delta\right)}{n}}\right\}\\ &\leq\sum_{0<n<t_{k}}2^{S}\exp\left(-7S\log\left(2At_{k}/\delta\right)\right)\leq\sum_{0<n<t_{k}}\dfrac{\delta}{20SAt_{k}^{7}}=\dfrac{\delta}{20SAt_{k}^{6}}\end{split} (27)

Inequality (a)(a) is an application of the union bound. Recall that โ„ฐjโ€‹(s,a)\mathcal{E}_{j}(s,a) indicates whether the pair (s,a)(s,a) appears in the jjth episode. Using this notation, we deduce that, if โ„ฐjโ€‹(s,a)=1\mathcal{E}_{j}(s,a)=1 for some (s,a)โˆˆ๐’ฎร—๐’œ(s,a)\in\mathcal{S}\times\mathcal{A}, then,

โˆ‘tjโ‰คฯ„<tj+1๐’™ฯ„โ€‹(s)=โˆ‘tjโ‰คฯ„<tj+1โ€–๐’“ฯ„โ€‹(s,a)โ€–1โ€‹1(sฯ„=s)+โˆ‘0<ฯ„<tjโˆ‘ฯ„1โ‰ฅtjโˆ’ฯ„rฯ„,ฯ„1โ€‹(sฯ„,aฯ„)โ€‹1โ€‹(sฯ„=s)โˆ’โˆ‘0<ฯ„<tj+1โˆ‘ฯ„1โ‰ฅtj+1โˆ’ฯ„rฯ„,ฯ„1โ€‹(sฯ„,aฯ„)โ€‹1โ€‹(sฯ„=s)\displaystyle\begin{split}\sum_{t_{j}\leq\tau<t_{j+1}}\boldsymbol{x}_{\tau}(s)=\sum_{t_{j}\leq\tau<t_{j+1}}||\boldsymbol{r}_{\tau}(s,a)||_{1}\mathrm{1}&(s_{\tau}=s)+\sum_{0<\tau<t_{j}}\sum_{\tau_{1}\geq t_{j}-\tau}r_{\tau,\tau_{1}}(s_{\tau},a_{\tau})\mathrm{1}(s_{\tau}=s)\\ &-\sum_{0<\tau<t_{j+1}}\sum_{\tau_{1}\geq t_{j+1}-\tau}r_{\tau,\tau_{1}}(s_{\tau},a_{\tau})\mathrm{1}(s_{\tau}=s)\end{split} (28)

Using Assumption 2, we can now write the following โˆ€j\forall j.

โˆ‘0<ฯ„<tjโˆ‘ฯ„1โ‰ฅtjโˆ’ฯ„rฯ„,ฯ„1โ€‹(sฯ„,aฯ„)โ€‹1โ€‹(sฯ„=s)\displaystyle\sum_{0<\tau<t_{j}}\sum_{\tau_{1}\geq t_{j}-\tau}r_{\tau,\tau_{1}}(s_{\tau},a_{\tau})\mathrm{1}(s_{\tau}=s) โ‰คdโ€‹ย with probabilityย โ€‹1\displaystyle\leq d~{}\text{ with probability }1 (29)

Combining (28)(\ref{eq_13}), (29)(\ref{eq_13a}), we can now write the following with probability 11,

โˆ’dโ‰คโˆ‘tjโ‰คฯ„<tj+1๐’™ฯ„โ€‹(s)โˆ’โˆ‘tjโ‰คฯ„<tj+1โ€–๐’“ฯ„โ€‹(s,a)โ€–1โ€‹1โ€‹(sฯ„=s)โ‰คd\displaystyle-d\leq\sum_{t_{j}\leq\tau<t_{j+1}}\boldsymbol{x}_{\tau}(s)-\sum_{t_{j}\leq\tau<t_{j+1}}||\boldsymbol{r}_{\tau}(s,a)||_{1}\mathrm{1}(s_{\tau}=s)\leq d (30)

Taking a sum over all episodes jโˆˆ{1,โ‹ฏ,kโˆ’1}j\in\{1,\cdots,k-1\} where โ„ฐjโ€‹(s,a)=1\mathcal{E}_{j}(s,a)=1, we get the following inequality that is satisfied with probability 11.

|โˆ‘0<j<kโˆ‘tjโ‰คฯ„<tj+1๐’™ฯ„โ€‹(s)โ€‹โ„ฐjโ€‹(s,a)โˆ’โˆ‘0<ฯ„<tkโ€–๐’“ฯ„โ€‹(s,a)โ€–1โ€‹1โ€‹(sฯ„=s,aฯ„=a)|โ‰คdโ€‹Ekโ€‹(s,a)\displaystyle\left|\sum_{0<j<k}\sum_{t_{j}\leq\tau<t_{j+1}}\boldsymbol{x}_{\tau}(s)\mathcal{E}_{j}(s,a)-\sum_{0<\tau<t_{k}}||\boldsymbol{r}_{\tau}(s,a)||_{1}\mathrm{1}(s_{\tau}=s,a_{\tau}=a)\right|\leq dE_{k}(s,a) (31)

Using the definition of r^kโ€‹(s,a)\hat{r}_{k}(s,a), the above inequality can be rephrased as,

|r^kโ€‹(s,a)โˆ’1Nkโ€‹(s,a)โ€‹โˆ‘0<ฯ„<tkโ€–๐’“ฯ„โ€‹(s,a)โ€–1โ€‹1โ€‹(sฯ„=s,aฯ„=a)|โ‰คdโ€‹Ekโ€‹(s,a)Nkโ€‹(s,a)\displaystyle\left|\hat{r}_{k}(s,a)-\dfrac{1}{N_{k}(s,a)}\sum_{0<\tau<t_{k}}||\boldsymbol{r}_{\tau}(s,a)||_{1}\mathrm{1}(s_{\tau}=s,a_{\tau}=a)\right|\leq d\dfrac{E_{k}(s,a)}{N_{k}(s,a)} (32)

Using Assumption 1(b), we can show that,

๐”ผโ€‹[1Nkโ€‹(s,a)โ€‹โˆ‘0<ฯ„<tkโ€–๐’“ฯ„โ€‹(s,a)โ€–1โ€‹1โ€‹(sฯ„=s,aฯ„=a)]=rโ€‹(s,a)\displaystyle\mathbb{E}\left[\dfrac{1}{N_{k}(s,a)}\sum_{0<\tau<t_{k}}||\boldsymbol{r}_{\tau}(s,a)||_{1}\mathrm{1}(s_{\tau}=s,a_{\tau}=a)\right]=r(s,a) (33)

Therefore, the following sequence of inequalities can be derived.

Prโ€‹{|r^kโ€‹(s,a)โˆ’rโ€‹(s,a)|>dโ€‹Ekโ€‹(s,a)Nkโ€‹(s,a)+7โ€‹logโก(2โ€‹Sโ€‹Aโ€‹tk/ฮด)2โ€‹Nkโ€‹(s,a)}\displaystyle\mathrm{Pr}\left\{\left|\hat{r}_{k}(s,a)-r(s,a)\right|>d\dfrac{E_{k}(s,a)}{N_{k}(s,a)}+\sqrt{\dfrac{7\log(2SAt_{k}/\delta)}{2N_{k}(s,a)}}\right\}
โ‰ค(a)โ€‹Prโ€‹{|1Nkโ€‹(s,a)โ€‹โˆ‘0<ฯ„<tkโ€–๐’“ฯ„โ€‹(s,a)โ€–1โ€‹1โ€‹(sฯ„=s,aฯ„=a)โˆ’rโ€‹(s,a)|>7โ€‹logโก(2โ€‹Sโ€‹Aโ€‹tk/ฮด)2โ€‹Nkโ€‹(s,a)}\displaystyle\overset{(a)}{\leq}\mathrm{Pr}\left\{\left|\dfrac{1}{N_{k}(s,a)}\sum_{0<\tau<t_{k}}||\boldsymbol{r}_{\tau}(s,a)||_{1}\mathrm{1}(s_{\tau}=s,a_{\tau}=a)-r(s,a)\right|>\sqrt{\dfrac{7\log(2SAt_{k}/\delta)}{2N_{k}(s,a)}}\right\}
โ‰ค(b)โ€‹โˆ‘0<n<tkPrโ€‹{|1nโ€‹โˆ‘l=1nโ€–๐’“ฯ„lโ€‹(s,a)โ€–1โˆ’rโ€‹(s,a)|>7โ€‹logโก(2โ€‹Sโ€‹Aโ€‹tk/ฮด)2โ€‹n}\displaystyle\overset{(b)}{\leq}\sum_{0<n<t_{k}}\mathrm{Pr}\left\{\left|\dfrac{1}{n}\sum_{l=1}^{n}||\boldsymbol{r}_{\tau_{l}}(s,a)||_{1}-r(s,a)\right|>\sqrt{\dfrac{7\log(2SAt_{k}/\delta)}{2n}}\right\}
โ‰ค(c)โ€‹โˆ‘0<n<tk2โ€‹ฮด120โ€‹Sโ€‹Aโ€‹tk7=ฮด60โ€‹Sโ€‹Aโ€‹tk6\displaystyle\overset{(c)}{\leq}\sum_{0<n<t_{k}}\dfrac{2\delta}{120SAt_{k}^{7}}=\dfrac{\delta}{60SAt_{k}^{6}}

Inequality (a)(a) is a consequence of (32)(\ref{eq_17}) while (b)(b) follows from the union bound. Finally, (c)(c) utilizes Hoeffdingโ€™s inequality together with Assumption 1(a) and 1(c). Conjoining (27)(\ref{eq_12}) and the above result, we establish the lemma.