Reinforcement Learning with Delayed, Composite, and Partially Anonymous Reward
Abstract
We investigate an infinite-horizon average reward Markov Decision Process (MDP) with delayed, composite, and partially anonymous reward feedback. The delay and compositeness of rewards mean that rewards generated as a result of taking an action at a given state are fragmented into different components, and they are sequentially realized at delayed time instances. The partial anonymity attribute implies that a learner, for each state, only observes the aggregate of past reward components generated as a result of different actions taken at that state, but realized at the observation instance. We propose an algorithm named to obtain a near-optimal policy for this setting and show that it achieves a regret bound of where and are the sizes of the state and action spaces, respectively, is the diameter of the MDP, is a parameter upper bounded by the maximum reward delay, and denotes the time horizon. This demonstrates the optimality of the bound in the order of , and an additive impact of the delay.
1 Introduction
Reinforcement learning (RL) enables an agent to learn a policy in an unknown environment by repeatedly interacting with it. The environment has a state that changes as a result of the actions executed by the agent. In this setup, a policy is a collection of rules that guides the agent to take action based on the observed state of the environment. Several algorithms exist in the literature that learn a policy with near-optimal regret (Jaksch etย al., 2010; Azar etย al., 2017; Jin etย al., 2018; Fruit etย al., 2018). However, all of the above frameworks assume that the reward is instantly fed back to the learner after executing an action. Unfortunately, this assumption of instantaneity may not hold in many practical scenarios. For instance, in the online advertisement industry, a customer may purchase a product several days after seeing the advertisement. In medical trials, the effect of a medicine may take several hours to manifest. In road networks, it might take a few minutes to notice the impact of a traffic light change. In many of these cases, the effect of an action is not entirely realized in a single instant, but rather fragmented into smaller components that are sequentially materialized over a long interval. Reward feedback that satisfies this property is referred to as delayed and composite reward. In several applications, the learner cannot directly observe each delayed component of a composite reward but only an aggregate of the components realized at the observation instance. For example, in the case of multiple advertisements, the advertiser can only see the total number of purchases at a given time but is completely unaware of which advertisement resulted in what fraction of the total purchase. Such feedback is referred to as anonymous reward.
Learning with delayed, composite, and anonymous rewards is gaining popularity in the RL community. While most of the theoretical analysis has been directed towards the multi-arm bandit (MAB) framework (Wang etย al., 2021; Zhou etย al., 2019; Vernade etย al., 2020; Pike-Burke etย al., 2018; Pedramfar and Aggarwal, 2023), recent studies have also analyzed Markov Decision Processes (MDPs) with delayed feedback (Howson etย al., 2023; Jin etย al., 2022; Lancewicki etย al., 2022). However, none of these studies have incorporated both composite and anonymous rewards, which is the focus of this paper.
1.1 The Challenge and Our Contribution
Learning with delayed, composite, and anonymous rewards has gained popularity in the RL community. While there has been extensive theoretical analysis of these types of rewards in the multi-arm bandit (MAB) framework, extending these ideas to the full Markov decision process (MDP) setting poses significant challenges. For example, one of the main ideas used in the MAB framework can be stated as follows (Wang etย al., 2021). The learning algorithm proceeds in multiple epochs with the sizes of the epochs increasing exponentially. At the beginning of each epoch, an action is chosen and it is applied at all instances within that epoch. Due to multiple occurrences of the chosen action, and exponentially increasing epoch lengths, the learner progressively obtains better estimates of the associated rewards despite the composite and anonymous nature of the feedback. In an MDP setting, however, it is impossible to ensure that any state-action pair appears contiguously over a given stretch of time. The most that one can hope to ensure is that each state, , when it appears in an epoch (defined appropriately), is always paired with a unique action, . In this way, if the state in consideration is visited sufficiently frequently in that epoch, the learner would obtain an accurate estimate of the reward associated with the pair . Unfortunately, in general, there is no way to ensure a high enough visitation frequency for all states. Unlike the MAB setup, it is, therefore, unclear how to develop regret optimal learning algorithms for MDPs with delayed, composite, and anonymous rewards.
Our article provides a partial resolution to this problem. In particular, we demonstrate that if the rewards are delayed, composite, and partially anonymous, then an algorithm can be designed to achieve near-optimal regret. In a fully anonymous setting, a learner observes the aggregate of the (delayed) reward components generated as a result of all state-action pairs visited in the past. In contrast, a partially anonymous setup allows a learner to observe the sum of (delayed) reward components generated as a result of all past visitations to any specified state. Our proposed algorithm, , is built upon the algorithm of (Jaksch etย al., 2010) and works in multiple epochs. Unlike the bandit setting, however, the epoch lengths are not guaranteed to be exponentially increasing. Our primary innovation lies in demonstrating how an accurate reward function estimate can be obtained using the partially anonymous feedback. yields a regret bound of where denote the sizes of the state and action spaces respectively, is the time horizon, denotes the diameter of the MDP, and the parameter is bounded by the maximum delay in the reward generation process. The obtained result matches a well-known lower bound in .
1.2 Relevant Literature
Below we describe in detail the relevant literature.
Regret Bounds in Non-delayed RL: The framework of regret minimization in RL with immediate feedback is well-investigated in the literature. In particular, this topic has been explored in the settings of both stochastic (Jaksch etย al., 2010; Zanette etย al., 2020; Agarwal and Aggarwal, 2023; Agarwal etย al., 2022; 2023) and adversarial (Jin and Luo, 2020; Rosenberg and Mansour, 2019; Shani etย al., 2020) MDPs. Our setup can be considered to be the generalization of the stochastic MDPs with immediate feedback.
Delay in Bandit: Delayed feedback is a well-researched topic in the bandit literature, with numerous studies conducted in both stochastic (Vernade etย al., 2020; Pike-Burke etย al., 2018; Zhou etย al., 2019; Gael etย al., 2020; Lancewicki etย al., 2021; Pedramfar and Aggarwal, 2023) and adversarial settings (Quanrud and Khashabi, 2015; Cesa-Bianchi etย al., 2016; Thune etย al., 2019; Zimmert and Seldin, 2020; Bistritz etย al., 2019; Ito etย al., 2020). However, as previously discussed, applying the insights of bandit learning to the MDP setting with composite and anonymous rewards is challenging.
Delay in MDP: A number of recent papers have explored the incorporation of delayed feedback into the MDP framework. For example, (Lancewicki etย al., 2022; Jin etย al., 2022) consider adversarial MDPs, while (Howson etย al., 2023) analyzes a stochastic setting. However, all of these articles focus on episodic MDPs with non-composite and non-anonymous rewards, which is distinct from our work on infinite-horizon MDPs with delayed, composite, and partially anonymous rewards. It is worth noting that while delayed reward is a commonly studied topic in the literature, some works also consider delays in the state information (Agarwal and Aggarwal, 2021; Bouteiller etย al., 2021). Additionally, the impact of delay has also been explored in the context of multi-agent learning to characterize coarse correlated equilibrium (Zhang etย al., 2022).
2 Problem Setting
We consider an infinite-horizon average-reward Markov Decision Process (MDP) defined as, where denote the state, and action spaces respectively, is the reward function, and indicates the state transition function. The function, defines the probability simplex over its argument set. The cardinality of the sets and are denoted as respectively. Both the reward function, , and the transition function, are assumed to be unknown.
A learner/agent interacts with an environment governed by the MDP stated above as follows. The interaction proceeds in discrete time steps, . At the time , the state occupied by the environment is denoted as . The learner observes the state, , and chooses an action following a predefined protocol, . As a result, a sequence of rewards is generated where is interpreted as the non-negative component of the vector that is realised at instant . The following assumption is made regarding the reward generation process.
Assumption 1
It is assumed that , , the following holds.
where denotes the -norm.
Assumption 1(a) dictates that the reward sequences generated from are such that the sum of their components can be thought of as samples taken from a certain distribution, , over . Note that the distribution, , is independent of . Assumption 1(b) explains that the expected value of the sum of the reward components generated from equals . Finally, Assumption 1(c) clarifies that the reward sequences generated at different instances (either by the same or distinct state-action pairs) are presumed to be independent of each other.
At the time instant , the learner observes a reward vector whose -th element is expressed as follows.
where defines the indicator function. Note that the learner only has access to the lump-sum reward , not its individual components contributed by past actions. This explains why the reward is termed as partially anonymous. In a fully anonymous setting, the learner can only access , not its elements. Although full anonymity might be desirable for many practical scenarios, from a theoretical standpoint, it is notoriously difficult to analyze. We discuss this topic in detail in section 4.2. The expected accumulated reward generated up to time can be computed as,
(1) |
We would like to emphasize that is not the sum of the observed rewards up to time . Rather, it equates to the sum of all the reward components that are generated as a consequence of the actions taken up to time . We define the quantity expressed below
(2) |
as the average reward of the MDP for a given protocol and an initial state . Here the expectation is obtained over all possible -length trajectories generated from the initial state following the protocol and the randomness associated with the reward generation process for any given state-action pair. It is well known that (Puterman, 2014) there exists a stationary deterministic policy that maximizes the average reward if , the diameter of (defined below) is finite. Also, in that case, becomes independent of and thus can be simply denoted as . The diameter is defined as follows.
where denotes the time needed for the MDP to reach the state from the state following the stationary deterministic policy, . Mathematically, ). In simple words, given two arbitrary distinct states, one can always find a stationary deterministic policy such that the MDP, , on average, takes at most time steps to transition from one state to the other.
We define the performance of a protocol, by the regret it accumulates over a horizon, which is mathematically expressed as,
(3) |
where is the maximum of the average reward given in , and the second term is defined in . We would like to mention that in order to define regret, we use rather than the expected sum of the observed rewards up to time . The rationale behind this definition is that all the components of the rewards that are generated as a consequence of the actions taken up to time would eventually be realized if we allow the system to evolve for a long enough time. Our goal in this article is to come up with an algorithm that achieves sublinear regret for the delayed, composite, and anonymous reward MDP described above.
Before concluding, we would like to provide an example of an MDP with partially anonymous rewards. Let us consider the round of interaction of a potential consumer with a website that advertises categories of products. At round , the state, observed by the website is the category of product searched by the consumer. The -th category where has number of potential advertisements, and the total number of advertisements is . The website, in response to the observed state, , shows an ordered list of advertisements (denoted by ), some of which may not directly correspond to the searched category, . This may cause the consumer, with some probability, to switch to a new state, in the next round. For example, if the consumer is searching for โComputersโ, then showing advertisements related to โMouseโ or โKeyboardโ may incentivize the consumer to search for those categories of products. After rounds, , the consumer ends up spending amount of money for the -th category of product as a consequence of previous advertisements. Note that if the same state appears in two different rounds , the website can potentially show two different ordered lists of advertisements, to the consumer. However, it is impossible to segregate the portions of the reward contributed by each of those actions. Hence, the system described above can be modeled by an MDP with delayed, composite, and partially anonymous rewards.
3 DUCRL2 Algorithm
In this section, we develop Delayed UCRL2 (DUCRL2) algorithm to achieve the overarching goal of our paper. It is inspired by the UCRL2 algorithm suggested by (Jaksch etย al., 2010) for MDPs with immediate rewards (no delay). Before going into the details of the DUCRL2 algorithm, we would like to introduce the following assumption.
Assumption 2
There exists a finite positive number such that, ,
(4) |
with probability where is the reward sequence generated by at time .
(5) | ||||
(6) |
(7) |
Note that if the maximum delay is , then using Assumption 1(a), one can show that . Therefore, can be thought of as a proxy for the maximum delay. To better understand the intuition behind Assumption 2, consider an interval . Clearly, the reward sequence generated at , , is . The portion of this reward that is realized after is expressed by the following quantity: which is upper bounded by . Therefore, the total amount of reward that is generated in but realized after is bounded by . As the distribution of the reward sequence is the same for all (Assumption 1(a)), one can replace the dependent term with a generic quantity, . Thus, Assumption 2 essentially states that the spillover of the rewards generated within any finite interval can be bounded by the term .
We would also like to emphasize that if is infinite, then may or may not be infinite depending on the reward sequence. For example, consider a deterministic sequence whose components are , , , . It is easy to show that one can choose though is infinite. On the other hand, if , , where is a random variable with the distribution , , , then one can show that (4) is violated with probability at least . In other words, Assumption 2 is not satisfied for any finite .
The algorithm (Algorithm 1) proceeds in multiple epochs. At the beginning of epoch , i.e., at time instant , we compute two measures for all state-action pairs. The first measure is indicated as which counts the number of times the pair appears before the onset of the th epoch. The second measure is which is a binary random variable that indicates whether the pair appears at least once in the th epoch, . Due to the very nature of our algorithm (elaborated later), in a given epoch , if , then , . Taking a sum over , we obtain which counts the number of epochs where appears at least once before , the start of the th episode.
Next, we obtain the reward estimate by computing the sum of the -th element of the observed reward over all epochs where appears at least once and dividing it by . Also, the transition probability estimate is calculated by taking a ratio of the number of times the transition occurs before the th episode and . It is to be clarified that, due to the delayed composite nature of the reward, the observed reward values that are used for computing may be contaminated by the components generated by previous actions which could potentially be different from . Consequently, it might appear that the empirical estimates may not serve as a good proxy for . However, the analysis of our algorithm exhibits that by judiciously steering the exploration, it is still possible to obtain a near-optimal regret.
Using the estimates , , we now define a confidence set of MDPs that is characterized by , . The confidence radius given in is one of the main differences between our algorithm and the UCRL2 algorithm given by (Jaksch etย al., 2010). Applying extended value iteration (Appendix A), we then derive a policy , and an MDP that are near-optimal within the set in the sense of . We keep on executing the policy until for at least one state-action pair , its total number of occurrences within the current epoch, becomes at least as large as , the number of its occurrences before the onset of the current epoch. When this criterion is achieved, a new epoch begins and the process described above starts all over again. Observe that, as the executed policies are deterministic, no two distinct pairs can appear in the same epoch for a given state, .
We would like to conclude with the remark that all the quantities used in our algorithm can be computed in a recursive manner. Consequently, similar to UCRL2, the space complexity of our algorithm turns out to be which is independent of .
4 Regret Analysis
Below we state our main result.
Theorem 1
Let . With probability at least , , for arbitrary initial state , the regret accumulated by the algorithm over steps can be bounded above as follows.
(8) |
Substituting , we can therefore bound the expected regret as,
(9) |
Theorem 1 states that the regret accumulated by algorithm is where hides the logarithmic factors. (Jaksch etย al., 2010) showed that the lower bound on the regret is . As our setup is a generalization of the setup considered in (Jaksch etย al., 2010), the same lower bound must also apply to our model. Moreover, if we consider an MDP instance where the rewards arrive with a constant delay, , then in the first steps, due to lack of any feedback, each algorithm must obey the regret lower bound . We conclude that the regret lower bound of our setup is . Although it matches the orders of , and of our regret upper bound, there is still room for improvement in the orders of , and .
4.1 Proof Sketch of Theorem 1
In this section, we provide a brief sketch of the proof of Theorem 1.
Step 1: The first step is to rewrite the total regret as the sum of regrets accumulated over various epochs. Particularly, we show that with probability at least , the following bound holds.
The term, can be defined as the regret accumulated over epoch (a precise definition is given in the appendix), and is such that lies in the th epoch. The additional term, appears due to the stochasticity of the observed reward instances. We now divide into two parts as follows.
Step 2: In order to bound , it is important to have an estimate of which we obtain in Lemma 1. We would like to elaborate that although the bound given in Lemma 1 is similar to that given in (Jaksch etย al., 2010), the proof techniques are different. In particular, here we account for the fact that the reward estimates, , of the th epoch, are potentially corrupted by delayed effects of past actions. We resolve this problem by proving the following inequality (see ).
(10) |
Here the second term in the LHS denotes an estimate of that the learner would have obtained had there been no delay in the observation of the reward instances. In other words, inequality estimates the gap between an MDP with delayed observations, and a hypothetical MDP without any delayed effects. Observe that the RHS of also appears in the confidence radius . Therefore, the cost of incorporating delay is a looser confidence in the reward estimates.
Step 4: We now focus on bounding the other term, . Lemma 3 shows that where the precise definition of the terms are given in the appendix B.2. Furthermore, it also proves that the following bound holds with high probability.
where hides logarithmic factors and terms related to . The other notations are identical to that given in section 3.
Step 5: The second term, is bounded as follows.
Notice that the first term can be bounded by invoking . The same inequality can also be used to bound the second term provided that which, as we have stated before, is a high probability event (Lemma . Using this logic, and some algebraic manipulations, we finally obtain the following high probability bound.
Step 6: Finally, we obtain the following inequality related to the third term.
Step 7: Combining, we derive the bound stated below with high probability.
Step 8: We conclude the proof using Lemma 4 which states that,
4.2 Limitation
It is important to understand why our approach works with partial anonymity but not with full anonymity. Fix a state and an epoch . Recall from section 3 that no two distinct pairs can appear in the same epoch. Utilizing this property, we can write down the following relation for a pair that appears in the th epoch.
(11) |
The term, denotes the sum of -th components of all observed reward vectors within epoch . Some portion of is due to actions taken before the onset of the th epoch. This past contribution is denoted by the term, . The rest of is entirely contributed by the actions taken within epoch . The term, denotes the sum of all the rewards generated as a result of actions taken within epoch . However, due to the delayed nature of the reward, some portion of will be realized after the th epoch. This spillover part is termed as . Using Assumption 2, we can write with high probability which leads to the following relation.
(12) |
Eq. is the first step in establishing . We would like to emphasize the fact that the term, is entirely contributed by pairs appearing in epoch (i.e., no contamination from actions other than ). In other words, although our estimation, is based on the contaminated observation , we demonstrate that it is not far away from uncontaminated estimations. This key feature makes successful despite having partial anonymity. On the other hand, if rewards were fully anonymous, then would have changed as follows.
(13) |
Note that in , the term, is a mixer of contributions from various state-action pairs, unlike in . This makes our algorithm ineffective in the presence of full anonymity.
Another limitation of our approach is that the delay parameter is used as an input to Algorithm 1. One can therefore ask how the regret bound changes if an incorrect estimate, of is used in the algorithm. One can easily prove that if , then the regret bound changes to . However, if , then Lemma 1 no longer works, and consequently, the analysis does not yield any sub-linear regret.
5 Conclusion
In this work, we addressed the challenging problem of designing learning algorithms for infinite-horizon Markov Decision Processes with delayed, composite, and partially anonymous rewards. We propose an algorithm that achieves near-optimal performance and derive a regret bound that matches the existing lower bound in the time horizon while demonstrating an additive impact of delay on the regret. Our work is the first to consider partially anonymous rewards in the MDP setting with delayed feedback.
Possible future work includes extending the analysis to the more general scenario of fully anonymous, delayed, and composite rewards, which has important applications in many domains. This extension poses theoretical challenges, and we believe it is an exciting direction for future research. Overall, we hope our work provides a useful contribution to the reinforcement learning community and inspires further research on this important topic.
References
- Agarwal and Aggarwal [2021] Mridul Agarwal and Vaneet Aggarwal. Blind decision making: Reinforcement learning with delayed observations. Pattern Recognition Letters, 150:176โ182, 2021.
- Agarwal and Aggarwal [2023] Mridul Agarwal and Vaneet Aggarwal. Reinforcement learning for joint optimization of multiple rewards. J. Mach. Learn. Res., 24:49:1โ49:41, 2023. URL http://jmlr.org/papers/v24/19-980.html.
- Agarwal etย al. [2022] Mridul Agarwal, Qinbo Bai, and Vaneet Aggarwal. Regret guarantees for model-based reinforcement learning with long-term average constraints. In Uncertainty in Artificial Intelligence, pages 22โ31. PMLR, 2022.
- Agarwal etย al. [2023] Mridul Agarwal, Qinbo Bai, and Vaneet Aggarwal. Concave utility reinforcement learning with zero-constraint violations. Transactions on Machine Learning Research, 2023.
- Azar etย al. [2017] Mohammadย Gheshlaghi Azar, Ian Osband, and Rรฉmi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263โ272. PMLR, 2017.
- Bistritz etย al. [2019] Ilai Bistritz, Zhengyuan Zhou, Xiย Chen, Nicholas Bambos, and Jose Blanchet. Online exp3 learning in adversarial bandits with delayed feedback. Advances in neural information processing systems, 32, 2019.
- Bouteiller etย al. [2021] Yann Bouteiller, Simon Ramstedt, Giovanni Beltrame, Christopher Pal, and Jonathan Binas. Reinforcement learning with random delays. In International conference on learning representations, 2021.
- Cesa-Bianchi etย al. [2016] Nicolโo Cesa-Bianchi, Claudio Gentile, Yishay Mansour, and Alberto Minora. Delay and cooperation in nonstochastic bandits. In Conference on Learning Theory, pages 605โ622. PMLR, 2016.
- Fruit etย al. [2018] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. In International Conference on Machine Learning, pages 1578โ1586. PMLR, 2018.
- Gael etย al. [2020] Manegueuย Anne Gael, Claire Vernade, Alexandra Carpentier, and Michal Valko. Stochastic bandits with arm-dependent delays. In International Conference on Machine Learning, pages 3348โ3356. PMLR, 2020.
- Howson etย al. [2023] Benjamin Howson, Ciara Pike-Burke, and Sarah Filippi. Optimism and delays in episodic reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 6061โ6094. PMLR, 2023.
- Ito etย al. [2020] Shinji Ito, Daisuke Hatano, Hanna Sumita, Kei Takemura, Takuro Fukunaga, Naonori Kakimura, and Ken-Ichi Kawarabayashi. Delay and cooperation in nonstochastic linear bandits. Advances in Neural Information Processing Systems, 33:4872โ4883, 2020.
- Jaksch etย al. [2010] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563โ1600, 2010.
- Jin etย al. [2018] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michaelย I Jordan. Is Q-learning provably efficient? Advances in neural information processing systems, 31, 2018.
- Jin and Luo [2020] Tiancheng Jin and Haipeng Luo. Simultaneously learning stochastic and adversarial episodic mdps with known transition. Advances in neural information processing systems, 33:16557โ16566, 2020.
- Jin etย al. [2022] Tiancheng Jin, Tal Lancewicki, Haipeng Luo, Yishay Mansour, and Aviv Rosenberg. Near-optimal regret for adversarial mdp with delayed bandit feedback. In Advances in Neural Information Processing Systems, 2022.
- Lancewicki etย al. [2021] Tal Lancewicki, Shahar Segal, Tomer Koren, and Yishay Mansour. Stochastic multi-armed bandits with unrestricted delay distributions. In International Conference on Machine Learning, pages 5969โ5978. PMLR, 2021.
- Lancewicki etย al. [2022] Tal Lancewicki, Aviv Rosenberg, and Yishay Mansour. Learning adversarial markov decision processes with delayed feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volumeย 36, pages 7281โ7289, 2022.
- Pedramfar and Aggarwal [2023] Mohammad Pedramfar and Vaneet Aggarwal. Stochastic submodular bandits with delayed composite anonymous bandit feedback. arXiv preprint arXiv:2303.13604, 2023.
- Pike-Burke etย al. [2018] Ciara Pike-Burke, Shipra Agrawal, Csaba Szepesvari, and Steffen Grunewalder. Bandits with delayed, aggregated anonymous feedback. In International Conference on Machine Learning, pages 4105โ4113. PMLR, 2018.
- Puterman [2014] Martinย L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- Quanrud and Khashabi [2015] Kent Quanrud and Daniel Khashabi. Online learning with adversarial delays. Advances in neural information processing systems, 28, 2015.
- Rosenberg and Mansour [2019] Aviv Rosenberg and Yishay Mansour. Online convex optimization in adversarial markov decision processes. In International Conference on Machine Learning, pages 5478โ5486. PMLR, 2019.
- Shani etย al. [2020] Lior Shani, Yonathan Efroni, Aviv Rosenberg, and Shie Mannor. Optimistic policy optimization with bandit feedback. In International Conference on Machine Learning, pages 8604โ8613. PMLR, 2020.
- Thune etย al. [2019] Tobiasย Sommer Thune, Nicolรฒ Cesa-Bianchi, and Yevgeny Seldin. Nonstochastic multiarmed bandits with unrestricted delays. Advances in Neural Information Processing Systems, 32, 2019.
- Vernade etย al. [2020] Claire Vernade, Alexandra Carpentier, Tor Lattimore, Giovanni Zappella, Beyza Ermis, and Michael Brueckner. Linear bandits with stochastic delayed feedback. In International Conference on Machine Learning, pages 9712โ9721. PMLR, 2020.
- Wang etย al. [2021] Siwei Wang, Haoyun Wang, and Longbo Huang. Adaptive algorithms for multi-armed bandit with composite and anonymous feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volumeย 35, pages 10210โ10217, 2021.
- Weissman etย al. [2003] Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marceloย J Weinberger. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.
- Zanette etย al. [2020] Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pages 10978โ10989. PMLR, 2020.
- Zhang etย al. [2022] Yuyang Zhang, Runyu Zhang, Gen Li, Yuantao Gu, and Naย Li. Multi-agent reinforcement learning with reward delays. arXiv preprint arXiv:2212.01441, 2022.
- Zhou etย al. [2019] Zhengyuan Zhou, Renyuan Xu, and Jose Blanchet. Learning in generalized linear contextual bandits with stochastic delays. Advances in Neural Information Processing Systems, 32, 2019.
- Zimmert and Seldin [2020] Julian Zimmert and Yevgeny Seldin. An optimal algorithm for adversarial bandits with arbitrary delays. In International Conference on Artificial Intelligence and Statistics, pages 3285โ3294. PMLR, 2020.
Appendix A Extended Value Iteration
Here can be though of as the confidence radius as depicted in whereas is the set of probability vectors that satisfy . Note that the stopping criteria for Algorithm 2 is the following.
(14) |
In the context of Algorithm 1, we can take . Theorem 7 of [Jaksch etย al., 2010] guarantees that the greedy policy deduced from the terminal utility vector of Algorithm 2 is -optimal in the sense of if the set of MDPs whose transition probability distribution lies in the confidence set , and the reward function, lies at most distance away from the estimate , comprises at least one MDP with a finite diameter. As a consequence of this result, we can write the following corollary.
Corollary 1
Let, be the collection of MDPs whose reward function and transition function, satisfy the following for given .
If the true MDP, lies in , then Algorithm 2 always converges. Moreover, if indicates the terminal utility vector for a given , and ,
(15) |
then the following inequality holds.
where is an MDP with transition function, defined by , and reward function that obeys , .
Appendix B Proof of Theorem 1
Let be such that . Clearly,
Using the above relation, the regret given in can be rewritten as,
The term can be interpreted as the regret accumulated over epoch . Observe that, for a given history of state-action evolution up to time , the collection of random variables are mutually independent. Moreover,
Using Hoeffding โs inequality, we therefore obtain,
This implies that, with probability at least , the following holds.
(16) |
B.1 Regret bound on episodes where lie outside the confidence set
Recall that is defined to be a collection of MDPs that obey the confidence bounds , and . In this subsection, we calculate the regret contribution of the episodes where the true MDP does not satisfy these bounds. The following lemma provides an upper bound estimate of the probability that the true MDP, does not lie in the confidence set, .
Lemma 1
The proof of Lemma 1 is relegated to Appendix C. Although the final result in Lemma 1 is the same as in [Jaksch etย al., 2010], the proof techniques are quite different. In particular, we need to account for the fact that reward estimates might be corrupted by contributions originating from various past actions. Using Lemma 1, the following bound can be obtained.
Lemma 2
[Jaksch etย al., 2010] With probability at least ,
(17) |
We would like to mention here that although the definition of used in our article is different from that given in [Jaksch etย al., 2010], the above result still holds. This is mainly because the only property of that is invoked to prove Lemma 2 is provided in Lemma 1 which is the same as in [Jaksch etย al., 2010].
B.2 Regret bound on episodes where lie inside the confidence set
Let be the index of an episode where and be the terminal utility vector obtained via extended value iteration at the th episode. Define, ,
Lemma 3
[Jaksch etย al., 2010] If is such that , then,
where are the reward, transition function, and policy of the MDP, . Moreover,
(18) |
with probability at least .
The only properties of that are used in the proof of Lemma 3 are , and which are the same as given in [Jaksch etย al., 2010]. Note that, the term defined in Lemma 3 can be bounded as follows,
(19) |
where applies the facts that , and . Note that our algorithm enforces . Therefore, can be further bounded as,
(20) |
Observe that, , . Therefore,
(21) |
Finally, injecting the inequality, , we obtain the following bound,
(22) |
We now use the following lemma to simplify the upper bounds.
Lemma 4
[Jaksch etย al., 2010] The following inequalities hold true,
(23) | |||
(24) |
Using Lemma 4 and combining , , , we conclude that the following satisfies with probability at least .
(25) |
B.3 Obtaining the Total Regret
Combining , and , we can now establish that the following inequality is satisfied with at least probability.
Taking a union bound on and noting that , we conclude the theorem.
Appendix C Proof of Lemma 1
The probability that the -deviation between the true and the empirical distributions of events over independent sample exceeds can be bounded as [Weissman etย al., 2003],
(26) |
Presume, without loss of generality that, . In our case, inequality can be utilised to obtain the following bound .
(27) |
Inequality is an application of the union bound. Recall that indicates whether the pair appears in the th episode. Using this notation, we deduce that, if for some , then,
(28) |
Using Assumption 2, we can now write the following .
(29) |
Combining , , we can now write the following with probability ,
(30) |
Taking a sum over all episodes where , we get the following inequality that is satisfied with probability .
(31) |
Using the definition of , the above inequality can be rephrased as,
(32) |
Using Assumption 1(b), we can show that,
(33) |
Therefore, the following sequence of inequalities can be derived.