This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss

Shuang Qiu   Xiaohan Wei   Zhuoran Yang   Jieping Ye   Zhaoran Wang University of Michigan. Email: [email protected].University of Southern California. Email: [email protected]. Princeton University. Email: [email protected].University of Michigan. Email: [email protected].Northwestern University. Email: [email protected].
Abstract

We consider online learning for episodic stochastically constrained Markov decision processes (CMDPs), which plays a central role in ensuring the safety of reinforcement learning. Here the loss function can vary arbitrarily across the episodes, and both the loss received and the budget consumption are revealed at the end of each episode. Previous works solve this problem under the restrictive assumption that the transition model of the Markov decision processes (MDPs) is known a priori and establish regret bounds that depend polynomially on the cardinalities of the state space 𝒮\mathcal{S} and the action space 𝒜\mathcal{A}. In this work, we propose a new upper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model. In particular, we prove that the proposed algorithm achieves 𝒪~(L|𝒮||𝒜|T)\widetilde{\mathcal{O}}(L|\mathcal{S}|\sqrt{|\mathcal{A}|T}) upper bounds of both the regret and the constraint violation, where LL is the length of each episode. Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning, which demonstrates the power of “optimism in the face of uncertainty” in constrained online learning.

1 Introduction

Constrained Markov decision processes play an important role in control and planning. It aims at maximizing a reward or minimizing a penalty metric over the set of all available policies subject to constraints on other relevant metrics. The constraints aim at enforcing the fairness or safety of the policies so that over time the behaviors of the chosen policy are under control. For example, in an edge cloud serving network (Urgaonkar et al., 2015; Wang et al., 2015), one would like to minimize the average cost of serving the moving targets subject to a constraint on the average serving delay. In an autonomous vehicle control problem (Le et al., 2019), one might be interested in minimizing the driving time subject to certain fuel efficiency or driving safety constraints.

Classical treatment of CMDPs dates back to Fox (1966); Altman (1999) reformulating the problem into a linear program (LP) via stationary state-action occupancy measures. However, to formulate such an LP, one requires the full knowledge of the transition model, reward, and constraint functions, and also assumes them to be fixed. Leveraging the episodic structure of a class of MDPs, Neely (2012) develops online renewal optimization which potentially allows the loss and constraint functions to be stochastically varying and unknown, while still relying on the transition model to solve the subproblem within the episode.

More recently, policy-search type algorithms have received much attention, attaining state-of-art performance in various tasks without knowledge of the transition model, e.g. Williams (1992); Baxter and Bartlett (2000); Konda and Tsitsiklis (2000); Kakade (2002); Schulman et al. (2015); Lillicrap et al. (2015); Schulman et al. (2017); Sutton and Barto (2018); Fazel et al. (2018); Abbasi-Yadkori et al. (2019a, b); Bhandari and Russo (2019); Cai et al. (2019); Wang et al. (2019); Liu et al. (2019); Agarwal et al. (2019). While most of the algorithms focus on unconstrained reinforcement learning problems, there are efforts to develop policy-based methods in CMDPs where constraints are known with limited theoretical guarantees. The work Chow et al. (2017) develops a primal-dual type algorithm which is shown to converge to some constraint satisfying policy. Achiam et al. (2017) develops a trust-region type algorithm which requires solving an optimization problem with both trust-region and safety constraints during each update. Generalizing ideas from the fitted-Q iteration, Le et al. (2019) develops a batch offline primal-dual type algorithm which guarantees only the time average primal-dual gap converges.

The goal of this paper is to solve constrained episodic MDPs with more generality where not only transition models are unknown, but also the loss and constraint functions can change online. In particular, the losses can be arbitrarily time-varying and adversarial. Let TT be the number of episodes111In the non-episodic (infinite-horizon) setting, TT denotes the total number of steps, which is a little different from the aforementioned TT for the episodic setting.. When assuming the transition model is known, Even-Dar et al. (2009) achieves 𝒪~(ϱ2Tlog|𝒜|)\widetilde{\mathcal{O}}(\varrho^{2}\sqrt{T\log|\mathcal{A}|}) regret with ϱ\varrho being the mixing time of the MDPs, and the work Yu et al. (2009) achieves 𝒪~(T2/3)\widetilde{\mathcal{O}}(T^{2/3}) regret. These two papers consider a continuous setting that is a little different to the episodic setting that we consider in this paper. Zimin and Neu (2013) further studies the episodic MDP and achieves 𝒪~(LTlog(|𝒮||𝒜|))\widetilde{\mathcal{O}}(L\sqrt{T\log(|\mathcal{S}||\mathcal{A}|)}) regret. For the constrained case with known transitions, the work Wei et al. (2018) achieves an 𝒪~(poly(|𝒮||𝒜|)T)\widetilde{\mathcal{O}}(\text{poly}(|\mathcal{S}||\mathcal{A}|)\sqrt{T}) regret and constraint violations, and Zheng and Ratliff (2020) attains 𝒪~(|𝒮||𝒜|T3/4)\widetilde{\mathcal{O}}(|\mathcal{S}||\mathcal{A}|T^{3/4}) for the non-episodic setting.

After we finished the first version of this work, there are several concurrent works appearing which also focus on CMDPs with unknown transitions and losses. The work Efroni et al. (2020a) studies episodic tabular MDPs with unknown but fixed loss and constraint functions, where the feedbacks are stochastic bandits. Leveraging upper confidence bound (UCB) on the reward, constraints, and transitions, they obtain an 𝒪(T)\mathcal{O}(\sqrt{T}) regret and constraint violation via linear program as well as primal-dual optimization. In another work, Ding et al. (2020) studies the constrained episodic MDPs with a linear structure and adversarial losses via a primal-dual-type policy optimization algorithm, achieving 𝒪~(T)\widetilde{\mathcal{O}}(\sqrt{T}) regret and constraint violation. While the scenario in Ding et al. (2020) is more general than ours, their dependence on the length of episode is worse when applied to the tabular case. Both of these two works rely on Slater condition which is also more restrictive than that of this work.

On the other hand, for unconstrained online MDPs, the idea of UCB has shown to be effective and helps to achieve tight regret bounds without knowing the transition model, e.g. Jaksch et al. (2010); Azar et al. (2017); Rosenberg and Mansour (2019a, b); Jin et al. (2019). The main idea there is to sequentially refine a confidence set of the transition model and choose a model in the interval which performs the best in optimizing the current value.

The main contribution of this paper is to show that UCB is also effective when incorporating with primal-dual type approaches to achieve 𝒪~(L|𝒮||𝒜|T)\widetilde{\mathcal{O}}(L|\mathcal{S}|\sqrt{|\mathcal{A}|T}) regret and constraint violation simultaneously in online CMDPs when the transition model is unknown, the loss is adversarial, and the constraints are stochastic. This almost matches the lower bound Ω(L|𝒮||𝒜|T)\Omega(\sqrt{L|\mathcal{S}||\mathcal{A}|T}) for the regret Jaksch et al. (2010) up to an 𝒪~(L|𝒮|)\widetilde{\mathcal{O}}(\sqrt{L|\mathcal{S}|}) factor. Under the hood is a new Lagrange multiplier condition together with a new drift analysis on the Lagrange multipliers leading to low constraint violation. Our setup is challenging compared to classical constrained optimization in particular due to (1) the unknown loss and constraint functions from the online setup; (2) the time-varying decision sets resulting from moving confidence set estimation of UCB. The decision sets can potentially be much larger than or even inconsistent with the true decision set knowing the model, resulting in a potentially large constraint violation. The main idea is to utilize a Lagrange multiplier condition as well as a confidence set of the model to construct a probabilistic bound on an online dual multiplier. We then explicitly take into account the laziness nature of the confidence set estimation in our algorithm to argue that the bound on the dual multiplier gives the 𝒪~(T)\widetilde{\mathcal{O}}(\sqrt{T}) bound on constraint violation.

2 Related Work

In this paper, we are more interested in a class of online MDP problems where the loss functions are arbitrarily changing, or adversarial. With a known transition model, adversarial losses, and full-information feedbacks (as opposed to bandit feedbacks), Even-Dar et al. (2009) achieves 𝒪~(ϱ2Tlog|𝒜|)\widetilde{\mathcal{O}}(\varrho^{2}\sqrt{T\log|\mathcal{A}|}) regret with ϱ\varrho being the mixing time of the MDP, and the work Yu et al. (2009) achieves 𝒪~(T2/3)\widetilde{\mathcal{O}}(T^{2/3}) regret. These two papers consider a non-episodic setting that is different to the episodic setting that we consider in this paper. The work Zimin and Neu (2013) further studies the episodic MDP and achieves an 𝒪~(LTlog(|𝒮||𝒜|))\widetilde{\mathcal{O}}(L\sqrt{T\log(|\mathcal{S}||\mathcal{A}|)}) regret.

A more challenging setting is that the transition model is unknown. Under such setting, there are several works studying the online episodic MDP problems with adversarial losses and full-information feedbacks. Neu et al. (2012) obtains 𝒪~(L|𝒮||𝒜|T)\widetilde{\mathcal{O}}(L|\mathcal{S}||\mathcal{A}|\sqrt{T}) regret by proposing a Follow the Perturbed Optimistic Policy (FPOP) algorithm. The recent work Rosenberg and Mansour (2019a) improves the regret to 𝒪~(L|𝒮||𝒜|T)\widetilde{\mathcal{O}}(L|\mathcal{S}|\sqrt{|\mathcal{A}|T}) by proposing an online upper confidence mirror descent algorithm. This regret bound nearly matches the lower bound Ω(L|𝒮||𝒜|T)\Omega(\sqrt{L|\mathcal{S}||\mathcal{A}|T}) (Jaksch et al., 2010) up to 𝒪(L|𝒮|)\mathcal{O}(\sqrt{L|\mathcal{S}|}) and some logarithm factors. Our work is along this line of research, and further considers the setup that there exist stochastic constraints observed at each episode during the learning process.

Besides, a number of papers also investigate online episodic MDPs with bandit feedbacks. Assuming the transition model is known and the losses are adversarial, Neu et al. (2010) achieves 𝒪~(L2T|𝒜|/β)\widetilde{\mathcal{O}}(L^{2}\sqrt{T|\mathcal{A}|}/\beta) regret, where β\beta is the probability that all states are reachable under all policies. Under the same setting, Neu et al. (2010) achieves 𝒪~(T2/3)\widetilde{\mathcal{O}}(T^{2/3}) regret without the dependence on β\beta, and Zimin and Neu (2013) obtains 𝒪~(L|𝒮||𝒜|T)\widetilde{\mathcal{O}}(\sqrt{L|\mathcal{S}||\mathcal{A}|T}) regret. Furthermore, with assuming the transition model is not known and the losses are adversarial, Rosenberg and Mansour (2019b) obtains 𝒪~(T3/4)\widetilde{\mathcal{O}}(T^{3/4}) regret and also 𝒪~(T/β)\widetilde{\mathcal{O}}(\sqrt{T}/\beta) where all states are reachable with probability β\beta under any policy. Jin et al. (2019) further achieves 𝒪~(L|𝒮||𝒜|T)\widetilde{\mathcal{O}}(L|\mathcal{S}|\sqrt{|\mathcal{A}|T}) regret under the same setting of the unknown transition model and adversarial losses. We remark that our algorithm can be extended to the setting of constrained episodic MDP where both the loss and constraint functions are time-varying and we only receive bandit feedbacks. We leave such an extension as our future work.

On the other hand, instead of adversarial losses, extensive works have studied the setting where the feedbacks of the losses are stochastic and have fixed expectations, e.g. Jaksch et al. (2010); Azar et al. (2017); Ouyang et al. (2017); Jin et al. (2018); Fruit et al. (2018); Wei et al. (2019a); Zhang and Ji (2019); Dong et al. (2019). With assuming that the transition model is known, Zheng and Ratliff (2020) studies online CMDPs under the non-episodic setting and attains an 𝒪~(|𝒮||𝒜|T3/4)\widetilde{\mathcal{O}}(|\mathcal{S}||\mathcal{A}|T^{3/4}) regret which is suboptimal in terms of TT. The concurrent work Efroni et al. (2020a) studies episodic MDPs with unknown transitions and stochastic bandit feedbacks of the losses and the constraints, and obtains an 𝒪~(T)\widetilde{\mathcal{O}}(\sqrt{T}) regret and constraint violation.

In addition to the aforementioned papers, there is also a line of policy-search type works, focusing on solving online MDP problems via directly optimizing policies without knowing the transition model, e.g., Williams (1992); Baxter and Bartlett (2000); Konda and Tsitsiklis (2000); Kakade (2002); Schulman et al. (2015); Lillicrap et al. (2015); Schulman et al. (2017); Sutton and Barto (2018); Fazel et al. (2018); Abbasi-Yadkori et al. (2019a, b); Bhandari and Russo (2019); Cai et al. (2019); Wang et al. (2019); Liu et al. (2019); Agarwal et al. (2019); Efroni et al. (2020b). Efforts have also been made in several works (Chow et al., 2017; Achiam et al., 2017; Le et al., 2019) to investigate CMDP problems via policy-based methods, but with known transition models. In another concurrent work, assuming the transition model is unknown, Ding et al. (2020) studies the episodic CMDPs with linear structures and proposes a primal-dual type policy optimization algorithm.

3 Problem Formulation

Consider an episodic loop-free MDP with a finite state space SS and a finite action space 𝒜\mathcal{A} at each state over a finite horizon of TT episodes. Each episode starts with a fixed initial state s0s_{0} and ends with a terminal state sLs_{L}. The transition probability is P:𝒮×𝒮×𝒜[0,1]P:\mathcal{S}\times\mathcal{S}\times\mathcal{A}\rightarrow[0,1], where P(s|s,a)P(s^{\prime}|s,a) gives the probability of transition from ss to ss^{\prime} under an action aa. This underlying transition model PP is assumed to be unknown. The state space is loop-free, i.e., it is divided into layers, i.e., 𝒮:=𝒮0𝒮1𝒮L\mathcal{S}:=\mathcal{S}_{0}\cup\mathcal{S}_{1}\cup\cdots\cup\mathcal{S}_{L} with a singleton initial layer 𝒮0={s0}\mathcal{S}_{0}=\{s_{0}\} and terminal layer SL={sL}S_{L}=\{s_{L}\}. Furthermore, 𝒮k𝒮=,k\mathcal{S}_{k}\cap\mathcal{S}_{\ell}=\varnothing,~{}k\neq\ell and transitions are only allowed between consecutive layers, which is P(s|s,a)>0P(s^{\prime}|s,a)>0 only if s𝒮k+1s^{\prime}\in\mathcal{S}_{k+1}, s𝒮ks\in\mathcal{S}_{k}, and a𝒜a\in\mathcal{A}, k{0,1,2,,L1}\forall k\in\{0,1,2,\cdots,L-1\}. Note that such an assumption enforces that each path from the initial state to the terminal state takes a fixed length LL. This is not an excessively restrictive assumption as any loop-free MDP with bounded varying path lengths can be transformed into one with a fixed path length (see György et al. (2007) for details).

The loss function for each episode is ft:𝒮×𝒜×𝒮f^{t}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R}, where ft(s,a,s)f^{t}(s,a,s^{\prime}) denotes the loss received at episode tt for any s𝒮k,s𝒮k+1s\in\mathcal{S}_{k},~{}s^{\prime}\in\mathcal{S}_{k+1} and a𝒜a\in\mathcal{A}, k{0,1,2,,L1}\forall k\in\{0,1,2,\ldots,L-1\}. We assume ftf_{t} can be arbitrarily varying with potentially no fixed probability distribution. There are II stochastic constraint (or budget consumption) functions: git:𝒮×𝒜×𝒮,i{1,2,,I}g_{i}^{t}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R},~{}\forall i\in\{1,2,\ldots,I\}, where git(s,a,s)g_{i}^{t}(s,a,s^{\prime}) denotes the price to pay at episode tt for any (s,a,s)(s,a,s^{\prime}). Each stochastic function gitg_{i}^{t} at episode tt is sampled according to a random variable ξit𝒟i\xi_{i}^{t}\sim\mathcal{D}_{i}, namely git(s,a,s)=gi(s,a,s;ξit)g_{i}^{t}(s,a,s^{\prime})=g_{i}(s,a,s^{\prime};\xi_{i}^{t}). Then, we define gi(s,a,s):=𝔼[git(s,a,s)]=𝔼ξit[gi(s,a,s;ξit)]g_{i}(s,a,s^{\prime}):=\mathbb{E}[g_{i}^{t}(s,a,s^{\prime})]=\mathbb{E}_{\xi_{i}^{t}}[g_{i}(s,a,s^{\prime};\xi_{i}^{t})] where the expectation is taken over the randomness of ξit𝒟i\xi_{i}^{t}\sim\mathcal{D}_{i}. For abbreviation, we denote gi=𝔼[git]g_{i}=\mathbb{E}[g_{i}^{t}]. In addition, the functions ftf^{t} and git,i{1,,I}g_{i}^{t},~{}\forall i\in\{1,\ldots,I\}, are mutually independent and independent of the Markov transitions. Both the loss functions and the budget consumption functions are revealed at the end of each episode.

Remark 3.1.

It might be tempting to consider the more general scenario that both losses and constraints are arbitrarily time-varying. For such a setting, however, there exist counterexamples (Mannor et al., 2009) in the arguably simpler constrained online learning scenario that no algorithm can achieve sublinear regret and constraint violation simultaneously. Therefore, we seek to put extra assumptions on the problem so that obtaining sublinear regret and constraint violation is feasible, one of which is to assert constraints to be stochastic instead of arbitrarily varying.

For any episode tt, a policy πt\pi_{t} is the conditional probability πt(a|s)\pi_{t}(a|s) of choosing an action a𝒜a\in\mathcal{A} at any given state s𝒮s\in\mathcal{S}. Let (sk,ak,sk+1)𝒮k×𝒜×𝒮k+1(s_{k},a_{k},s_{k+1})\in\mathcal{S}_{k}\times\mathcal{A}\times\mathcal{S}_{k+1} denotes a random tuple generated according to the transition model PP and the policy πt\pi_{t}. The corresponding expected loss is k=0L1𝔼[ft(sk,ak,sk+1)|πt,P]\sum_{k=0}^{L-1}\mathbb{E}{\left[f^{t}(s_{k},a_{k},s_{k+1})|\pi_{t},P\right]}, while the budget costs are k=0L1𝔼[git(sk,ak,sk+1)|πt,P],i{1,,I}\sum_{k=0}^{L-1}\mathbb{E}{\left[g_{i}^{t}(s_{k},a_{k},s_{k+1})|\pi_{t},P\right]},i\in\{1,\cdots,I\}, where the expectations are taken w.r.t. the randomness of the tuples (sk,ak,sk+1)(s_{k},a_{k},s_{k+1}).

In this paper, we adopt the occupancy measure θ(s,a,s)\theta(s,a,s^{\prime}) for our analysis. In general, the occupancy measure θ(s,a,s)\theta(s,a,s^{\prime}) is a joint distribution of the tuple (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S} under some certain policy and transition model. Particularly, under the true transition PP, we define the set Δ:={θ:θ satisfies (a)(b), and (c)}\Delta:=\{\theta~{}:~{}\theta\text{ satisfies \ref{item:cond1}, \ref{item:cond2}, and \ref{item:cond3}}\} (Altman, 1999) where

  1. (a)

    s𝒮ka𝒜s𝒮k+1θ(s,a,s)=1,k{0,,L1}, and θ(s,a,s)0\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}_{k+1}}\theta(s,a,s^{\prime})=1,\forall k\in\{0,\ldots,L-1\},\text{ and }~{}\theta(s,a,s^{\prime})\geq 0.

  2. (b)

    s𝒮ka𝒜θ(s,a,s)=a𝒜s′′𝒮k+2θ(s,a,s′′),k{0,,L2},s𝒮k+1\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\theta(s,a,s^{\prime})=\sum_{a\in\mathcal{A}}\sum_{s^{\prime\prime}\in\mathcal{S}_{k+2}}\theta(s^{\prime},a,s^{\prime\prime}),\forall k\in\{0,\ldots,L-2\},~{}s^{\prime}\in\mathcal{S}_{k+1}.

  3. (c)

    θ(s,a,s)/s′′𝒮k+1θ(s,a,s′′)=P(s|s,a),k{0,,L1},s𝒮k,a𝒜,s𝒮k+1\theta(s,a,s^{\prime})/\sum_{s^{\prime\prime}\in\mathcal{S}_{k+1}}\theta(s,a,s^{\prime\prime})=P(s^{\prime}|s,a),\forall k\in\{0,\ldots,L-1\},s\in\mathcal{S}_{k},a\in\mathcal{A},s^{\prime}\in\mathcal{S}_{k+1}.

We can further recover a policy π\pi from θ\theta via π(a|s)=s𝒮k+1θ(s,a,s)/s𝒮k+1,a𝒜θ(s,a,s)\pi(a|s)=\sum_{s^{\prime}\in\mathcal{S}_{k+1}}\theta(s,a,s^{\prime})/\sum_{s^{\prime}\in\mathcal{S}_{k+1},a\in\mathcal{A}}\theta(s,a,s^{\prime}).

We define θ¯t(s,a,s)\overline{\theta}^{t}(s,a,s^{\prime}), s𝒮,s𝒮,a𝒜s\in\mathcal{S},~{}s^{\prime}\in\mathcal{S},~{}a\in\mathcal{A}, to be the occupancy measure at episode tt w.r.t. the true transition PP, resulting from a policy πt\pi_{t} at episode tt. Given the definition of occupancy measure, we can rewrite the expected loss and the budget cost as k=0L1𝔼[ft(sk,ak,sk+1)|πt,P]=ft,θ¯t\sum_{k=0}^{L-1}\mathbb{E}{\left[f^{t}(s_{k},a_{k},s_{k+1})|\pi_{t},P\right]}=\langle f^{t},\overline{\theta}^{t}\rangle where ft,θ¯t=s,a,sft(s,a,s)θ¯t(s,a,s)\langle f^{t},\overline{\theta}^{t}\rangle=\sum_{s,a,s^{\prime}}f^{t}(s,a,s^{\prime})\overline{\theta}^{t}(s,a,s^{\prime}) and k=0L1𝔼[git(sk,ak,sk+1)|πt,P]=git,θ¯t\sum_{k=0}^{L-1}\mathbb{E}{\left[g_{i}^{t}(s_{k},a_{k},s_{k+1})|\pi_{t},P\right]}=\langle g_{i}^{t},\overline{\theta}^{t}\rangle with git,θ¯t=s,a,sft(s,a,s)θ¯t(s,a,s)\langle g_{i}^{t},\overline{\theta}^{t}\rangle=\sum_{s,a,s^{\prime}}f^{t}(s,a,s^{\prime})\overline{\theta}^{t}(s,a,s^{\prime}). We aim to solve the following constrained optimization, and let θ¯\overline{\theta}^{*} be one solution which is further viewed as a reference point to define the regret:

minimizeθΔt=1Tft,θ,subject togi,θci,i[I],\displaystyle\begin{aligned} \mathop{\mathrm{minimize}}_{\theta\in\Delta}~{}~{}&\sum_{t=1}^{T}\langle f^{t},\theta\rangle,~{}~{}~{}\text{subject to}~{}~{}~{}\langle g_{i},\theta\rangle\leq c_{i},~{}~{}\forall i\in[I],\end{aligned} (1)

where t=1Tft,θ=t=1T𝔼[k=0L1ft(sk,ak,sk+1)|π,P]\sum_{t=1}^{T}\langle f^{t},\theta\rangle=\sum_{t=1}^{T}\mathbb{E}[\sum_{k=0}^{L-1}f^{t}(s_{k},a_{k},s_{k+1})|\pi,P] is the overall loss in TT episodes and constraints are enforced on the budget cost gi,θ=𝔼[k=0L1gi(sk,ak,sk+1)|π,P]\langle g_{i},\theta\rangle=\mathbb{E}[\sum_{k=0}^{L-1}g_{i}(s_{k},a_{k},s_{k+1})|\pi,P] based on the expected budget consumption functions gi,i[I]g_{i},\forall i\in[I]. To measure the regret and the constraint violation respectively for solving (1) in an online setting, we define the following two metrics:

Regret(T):=t=1Tft,θ¯tθ¯, and Violation(T):=[t=1T(𝐠t(θ¯t)𝐜)]+2,\displaystyle\mathrm{Regret}(T):=\sum_{t=1}^{T}\big{\langle}f^{t},\overline{\theta}^{t}-\overline{\theta}^{*}\big{\rangle},~{}~{}~{}\text{ and }~{}~{}\mathrm{Violation}(T):=\Bigg{\|}\Bigg{[}\sum_{t=1}^{T}(\mathbf{g}^{t}(\overline{\theta}^{t})-\mathbf{c})\Bigg{]}_{+}\Bigg{\|}_{2}, (2)

where the notation [𝐯]+[\mathbf{v}]_{+} denotes the entry-wise application of max{,0}\max\{\cdot,0\} for any vector 𝐯\mathbf{v}. For abbreviation, let 𝐠t(θ):=[g1t,θ,,gIt,θ]\mathbf{g}^{t}(\theta):=[\langle{g_{1}^{t}},{\theta}\rangle,~{}\cdots,~{}\langle{g_{I}^{t}},{\theta}\rangle]^{\top}, and 𝐜:=[c1,,cI]\mathbf{c}:=\left[c_{1},~{}\cdots,~{}c_{I}\right]^{\top}.

The goal is to attain a sublinear regret bound and constraint violation on this problem w.r.t. any fixed stationary policy π\pi, which does not change over episodes. In another word, we compare to the best policy π\pi^{*} in hindsight whose corresponding occupancy measure θ¯Δ\overline{\theta}^{*}\in\Delta solves problem (1). We make the following assumption on the existence of a solution to (1).

Assumption 3.2.

There exists at least one fixed policy π\pi such that the corresponding probability θΔ\theta\in\Delta is feasible, i.e., gi,θci,i{1,2,,I}\langle g_{i},\theta\rangle\leq c_{i},i\in\{1,2,\cdots,I\}.

Then, we assume the following boundedness on function values for simplicity of notations without loss of generality.

Assumption 3.3.

We assume the following quantities are bounded. For any t{1,2,,T}t\in\{1,2,\ldots,T\}, (1) sups,a,s|ft(s,a,s)|1\sup_{s,a,s^{\prime}}|f^{t}(s,a,s^{\prime})|\leq 1, (2) i=1Isups,a,s|git(s,a,s)|1\sum_{i=1}^{I}\sup_{s,a,s^{\prime}}|g_{i}^{t}(s,a,s^{\prime})|\leq 1, (3) i=1I|ci|L\sum_{i=1}^{I}|c_{i}|\leq L.

When the transition model PP is known and Slater’s condition holds (i.e., existence of a policy which satisfies all stochastic inequality constraints with a constant ε\varepsilon-slackness), this stochastically constrained online linear program can be solved via similar methods as Wei et al. (2018); Yu et al. (2017) with a regret bound that depends polynomially on the cardinalities of state and action spaces, which is highly suboptimal especially when the state or action space is large. The main challenge we will address in this paper is to solve this problem without knowing the model PP, or losses and constraints before making decisions, while tightening the dependency on both state and action spaces in the resulting performance bound.

4 Proposed Algorithm

In this section, we introduce our proposed algorithm, namely, the upper confidence primal-dual (UCPD) algorithm, as presented in Algorithm 1. It adopts a primal-dual mirror descent type algorithm solving constrained problems but with an important difference: We maintain a confidence set via past sample trajectories, which contains the true MDP model PP with high probability, and choose the policy to minimize the proximal Lagrangian using the most optimistic model from the confidence set. Such an idea, known as optimism in the face of uncertainty, is reminiscent of the upper confidence bound (UCB) algorithm (Auer et al., 2002) for stochastic multi-armed bandit (MAB) and first proposed by Jaksch et al. (2010) to obtain a near-optimal regret for reinforcement learning problems.

In the algorithm, we introduce epochs, which are back-to-back time intervals that span several episodes. We use {1,2,}\ell\in\{1,2,\cdots\} to index the epochs and use (t)\ell(t) to denote a mapping from episode index tt to epoch index, indicating which epoch the tt-th episode lives in. Next, let N(s,a)N_{\ell}(s,a) and M(s,a,s)M_{\ell}(s,a,s^{\prime}) be two global counters which indicate the number of times the tuples (s,a)(s,a) and (s,a,s)(s,a,s^{\prime}) appear before the \ell-th epoch. Let n(s,a)n_{\ell}(s,a), m(s,a,s)m_{\ell}(s,a,s^{\prime}) be two local counters which indicate the number of times the tuples (s,a)(s,a) and (s,a,s)(s,a,s^{\prime}) appear in the \ell-th epoch. We start a new epoch whenever there exists (s,a)(s,a) such that n(t)(s,a)N(t)(s,a)n_{\ell(t)}(s,a)\geq N_{\ell(t)}(s,a). Otherwise, set (t+1)=(t)\ell(t+1)=\ell(t). Such an update rule follows from Jaksch et al. (2010). Then, we define the empirical transition model P^\widehat{P}_{\ell} at any epoch >0\ell>0 as

P^(s|s,a):=M(s,a,s)max{1,N(s,a)},s,s𝒮,a𝒜.\widehat{P}_{\ell}(s^{\prime}|s,a):=\frac{M_{\ell}(s,a,s^{\prime})}{\max\{1,N_{\ell}(s,a)\}},~{}~{}\forall s,s^{\prime}\in\mathcal{S},~{}a\in\mathcal{A}.

As shown in Remark 6.8, introducing the notion of ‘epoch’ is necessary to achieve an 𝒪~(T)\widetilde{\mathcal{O}}(\sqrt{T}) constraint violation.

The next lemma shows that with high probability, the true transition model PP is contained in a confidence interval around the empirical one no matter what sequence of policies taken, which is adapted from Lemma 1 of Neu et al. (2012).

Lemma 4.1 (Lemma 1 of Neu et al. (2012)).

For any ζ(0,1)\zeta\in(0,1), we have that with probability at least 1ζ1-\zeta, for all epoch (T+1)\ell\leq\ell(T+1) and any state and action pair (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A},

P(|s,a)P^(|s,a)1εζ(s,a),\Big{\|}P(\cdot|s,a)-\widehat{P}_{\ell}(\cdot|s,a)\Big{\|}_{1}\leq\varepsilon_{\ell}^{\zeta}(s,a),

with the error εζ(s,a)\varepsilon_{\ell}^{\zeta}(s,a) being

εζ(s,a):=2|𝒮k(s)+1|log[(T+1)|𝒮||𝒜|/ζ]max{1,N(s,a)},\varepsilon_{\ell}^{\zeta}(s,a):=\sqrt{\frac{2|\mathcal{S}_{k(s)+1}|\log[(T+1)|\mathcal{S}||\mathcal{A}|/\zeta]}{\max\{1,N_{\ell}(s,a)\}}}, (3)

where k(s)k(s) is a mapping from state ss to the layer that the state ss belongs to.

Algorithm 1 Upper-Confidence Primal-Dual (UCPD) Mirror Descent
1:Input: Let V,α>0V,\alpha>0λ[0,1)\lambda\in[0,1) be some trade-off parameters. Fix ζ(0,1)\zeta\in(0,1).
2:Initialize: Qi(1)=0,i=1,,IQ_{i}(1)=0,~{}\forall i=1,\ldots,I. θ1(s,a,s)=1/(|𝒮k||𝒮k+1||𝒜|),(s,a,s)𝒮k×𝒜×𝒮k+1\theta^{1}(s,a,s^{\prime})=1/(|\mathcal{S}_{k}||\mathcal{S}_{k+1}||\mathcal{A}|),~{}\forall(s,a,s^{\prime})\in\mathcal{S}_{k}\times\mathcal{A}\times\mathcal{S}_{k+1}. (1)=1\ell(1)=1. n1(s,a)=0,N1(s,a)=0,(s,a)𝒮×𝒜n_{1}(s,a)=0,~{}N_{1}(s,a)=0,~{}\forall(s,a)\in\mathcal{S}\times\mathcal{A}. m1(s,a,s)=0,M1(s,a,s)=0,(s,a,s)𝒮×𝒜×𝒮m_{1}(s,a,s^{\prime})=0,~{}M_{1}(s,a,s^{\prime})=0,~{}\forall(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}.
3:for t=1,2,3,t=1,2,3,\ldots do
4:     Compute θt\theta^{t} via (6) and the corresponding policy πt\pi_{t} via (4).
5:     Sample a path (s0t,a0t,,sL1t,aL1t,sLt)(s_{0}^{t},a_{0}^{t},\cdots,s_{L-1}^{t},a_{L-1}^{t},s_{L}^{t}) following the policy πt\pi_{t}.
6:     Update each dual multiplier Qi(t)Q_{i}(t) via (5) and update the local counters:
n(t)(skt,akt)=n(t)(skt,akt)+1,\displaystyle n_{\ell(t)}(s_{k}^{t},a_{k}^{t})=n_{\ell(t)}(s_{k}^{t},a_{k}^{t})+1,
m(t)(skt,akt,sk+1t)=m(t)(skt,akt,sk+1t)+1.\displaystyle m_{\ell(t)}(s_{k}^{t},a_{k}^{t},s_{k+1}^{t})=m_{\ell(t)}(s_{k}^{t},a_{k}^{t},s_{k+1}^{t})+1.
7:     Observe the loss function ftf^{t} and constraint functions {git}i=1I\{g_{i}^{t}\}_{i=1}^{I}.
8:     if (s,a)𝒮×𝒜,n(t)(s,a)N(t)(s,a),\exists(s,a)\in\mathcal{S}\times\mathcal{A},~{}n_{\ell(t)}(s,a)\geq N_{\ell(t)}(s,a), then
9:         Start a new epoch:
10:         Set (t+1)=(t)+1\ell(t+1)=\ell(t)+1, and update the global counters for all s,s𝒮,a𝒜s,s^{\prime}\in\mathcal{S},~{}a\in\mathcal{A} by
N(t+1)(s,a)=N(t)(s,a)+n(t)(s,a),\displaystyle N_{\ell(t+1)}(s,a)=N_{\ell(t)}(s,a)+n_{\ell(t)}(s,a),
M(t+1)(s,a,s)=M(t)(s,a,s)+m(t)(s,a,s).\displaystyle M_{\ell(t+1)}(s,a,s^{\prime})=M_{\ell(t)}(s,a,s^{\prime})+m_{\ell(t)}(s,a,s^{\prime}).
11:         Construct the empirical transition
P^(t+1)(s|s,a):=M(t+1)(s,a,s)max{1,N(t+1)(s,a)},(s,a,s).\displaystyle\widehat{P}_{\ell(t+1)}(s^{\prime}|s,a):=\frac{M_{\ell(t+1)}(s,a,s^{\prime})}{\max\{1,N_{\ell(t+1)}(s,a)\}},\forall(s,a,s^{\prime}).
12:         Initialize n(t+1)(s,a)=0,m(t+1)(s,a,s)=0,(s,a,s)𝒮×𝒜×𝒮n_{\ell(t+1)}(s,a)=0,~{}m_{\ell(t+1)}(s,a,s^{\prime})=0,~{}\forall(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}.
13:     else
14:         Set (t+1)=(t)\ell(t+1)=\ell(t).
15:     end if
16:end for

4.1 Computing Optimistic Policies

Next, we show how to compute the policy at each episode. Formally, we introduce a new occupancy measure at episode tt, namely θt(s,a,s),s,s𝒮,a𝒜\theta^{t}(s,a,s^{\prime}),~{}s,s^{\prime}\in\mathcal{S},~{}a\in\mathcal{A}. It should be emphasized that this is different from the θ¯t(s,a,s)\overline{\theta}^{t}(s,a,s^{\prime}) defined in the previous section as θt(s,a,s)\theta^{t}(s,a,s^{\prime}) is chosen by the decision maker at episode tt to construct the policy. In particular, θt(s,a,s)\theta^{t}(s,a,s^{\prime}) does not have to satisfy the local balance equation (c). Once getting θt(s,a,s)\theta^{t}(s,a,s^{\prime}) (which will be detailed below), we construct the policy as follows

πt(a|s)=sθt(s,a,s)s,aθt(s,a,s),a𝒜,s𝒮.\pi_{t}(a|s)=\frac{\sum_{s^{\prime}}\theta^{t}(s,a,s^{\prime})}{\sum_{s^{\prime},a}\theta^{t}(s,a,s^{\prime})},~{}\forall a\in\mathcal{A},~{}s\in\mathcal{S}. (4)

Next, we demonstrate the proposed method computing θt(s,a,s)\theta^{t}(s,a,s^{\prime}). First, we introduce an online dual multiplier Qi(t)Q_{i}(t) for each constraint in (1), which is 0 when t=1t=1 and is updated as follows for t2t\geq 2,

Qi(t)=max{Qi(t1)+git1,θtci,0}.Q_{i}(t)=\max\{Q_{i}(t-1)+\langle{g_{i}^{t-1}},{\theta^{t}}\rangle-c_{i},~{}0\}. (5)

At each episode, we compute the new occupancy measure θt(s,a,s)\theta^{t}(s,a,s^{\prime}) solving an optimistic regularized linear program (ORLP) with tuning parameters λ,V,α>0\lambda,~{}V,~{}\alpha>0. Specifically, we update θt\theta^{t} for all t2t\geq 2 by solving

θt=argminθΔ((t),ζ)Vft1+i=1IQi(t1)git1,θ+αD(θ,θ~t1).\displaystyle\begin{aligned} \theta^{t}=\mathop{\mathrm{argmin}}_{\theta\in\Delta(\ell(t),\zeta)}~{}~{}&\Big{\langle}Vf^{t-1}+\sum_{i=1}^{I}Q_{i}(t-1)g_{i}^{t-1},\theta\Big{\rangle}+\alpha D(\theta,\widetilde{\theta}^{t-1}).\end{aligned} (6)

For t=1t=1, we let θ1(s,a,s)=1/(|𝒮k||𝒮k+1||𝒜|)\theta^{1}(s,a,s^{\prime})=1/(|\mathcal{S}_{k}||\mathcal{S}_{k+1}||\mathcal{A}|), (s,a,s)𝒮k×𝒜×𝒮k+1\forall(s,a,s^{\prime})\in\mathcal{S}_{k}\times\mathcal{A}\times\mathcal{S}_{k+1}. The above updating rule introduces extra notations Δ((t),ζ)\Delta(\ell(t),\zeta), θ~t1\widetilde{\theta}^{t-1}, and D(,)D(\cdot,\cdot) that will be elaborated below. Specifically, we denote by D(,)D(\cdot,\cdot) the unnormalized Kullback-Leibler(KL) divergence for two different occupancy measures θ\theta and θ\theta^{\prime}, which is defined as

D(θ,θ):=s,a,s[θ(s,a,s)logθ(s,a,s)θ(s,a,s)θ(s,a,s)+θ(s,a,s)],θ,θ.\displaystyle D(\theta,\theta^{\prime}):=\sum_{s,a,s^{\prime}}[\theta(s,a,s^{\prime})\log\frac{\theta(s,a,s^{\prime})}{\theta^{\prime}(s,a,s^{\prime})}-\theta(s,a,s^{\prime})+\theta^{\prime}(s,a,s^{\prime})],\quad\forall\theta,\theta^{\prime}. (7)

In addition, for k={0,,L1}\forall k=\{0,\ldots,L-1\} and s𝒮k,a𝒜,s𝒮k+1\forall s\in\mathcal{S}_{k},a\in\mathcal{A},s^{\prime}\in\mathcal{S}_{k+1}, we compute θ~t1\widetilde{\theta}^{t-1} via

θ~t1(s,a,s)=(1λ)θt1(s,a,s)+λ|𝒮k||𝒮k+1||𝒜|,\widetilde{\theta}^{t-1}(s,a,s^{\prime})=(1-\lambda)\theta^{t-1}(s,a,s^{\prime})+\frac{\lambda}{|\mathcal{S}_{k}||\mathcal{S}_{k+1}||\mathcal{A}|},

where 0λ10\leq\lambda\leq 1. This equation introduces a probability mixing, pushing the update away from the boundary and encouraging explorations.

Furthermore, since for any epoch >0\ell>0, we can compute the empirical transition model P^\widehat{P}_{\ell} with the confidence interval size εζ\varepsilon_{\ell}^{\zeta} as defined in (3), we let every θΔ(,ζ)\theta\in\Delta(\ell,\zeta) satisfy that

θ(s,a,)sθ(s,a,s)P^(|s,a)1εζ(s,a),s𝒮,a𝒜,\displaystyle\bigg{\|}\frac{\theta(s,a,\cdot)}{\sum_{s^{\prime}}\theta(s,a,s^{\prime})}-\widehat{P}_{\ell}(\cdot|s,a)\bigg{\|}_{1}\leq\varepsilon_{\ell}^{\zeta}(s,a),~{}~{}\forall s\in\mathcal{S},a\in\mathcal{A}, (8)

such that we can define the feasible set Δ(,ζ)\Delta(\ell,\zeta) for the optimization problem (6) as follows

Δ(,ζ):={θ:θ satisfies (a),(b), and (8)}.\displaystyle\Delta(\ell,\zeta):=\{\theta:\theta\text{ satisfies }\ref{item:cond1},\ref{item:cond2},\text{ and }\eqref{eq:approx_local_balance}\}. (9)

By this definition, we know that θtΔ((t),ζ)\theta^{t}\in\Delta(\ell(t),\zeta) at the epoch (t)\ell(t). On the other hand, according to Lemma 4.1, we have that with probability at least 1ζ1-\zeta, for all epoch \ell, ΔΔ(,ζ)\Delta\subseteq\Delta(\ell,\zeta) holds. By Rosenberg and Mansour (2019a), the problem (6) is essentially a linear programming with a special structure that can be solved efficiently (see details in Section A of the supplementary material).

5 Main Results

Before presenting our results, we first make assumption on the existence of Lagrange multipliers. We define a partial average function starting from any time slot tt as f(t,τ):=1τj=0τ1ft+jf^{(t,\tau)}:=\frac{1}{\tau}\sum_{j=0}^{\tau-1}f^{t+j}. Then, we consider the following static optimization problem (recalling gi:=𝔼[git]g_{i}:=\mathbb{E}[g_{i}^{t}])

minimizeθΔf(t,τ),θs.t.gi,θci,i{1,,I}.\mathop{\mathrm{minimize}}_{\theta\in\Delta}~{}\langle f^{(t,\tau)},\theta\rangle~{}~{}~{}\text{s.t.}~{}~{}~{}\langle{g_{i}},{\theta}\rangle\leq c_{i},~{}\forall i\in\{1,\ldots,I\}. (10)

Denote the solution to this program as θt,τ\theta^{*}_{t,\tau}. Define the Lagrangian dual function of (10) as q(t,τ)(η):=minθΔf(t,τ)(θ)+i=1Iηi(gi(θ)ci)q^{(t,\tau)}(\eta):=\min_{\theta\in\Delta}~{}f^{(t,\tau)}(\theta)+\sum_{i=1}^{I}\eta_{i}(g_{i}(\theta)-c_{i}), where η=[η1,,ηI]I\eta=[\eta_{1},\ldots,\eta_{I}]^{\top}\in\mathbb{R}^{I} is a dual variable. We are ready to state our assumption:

Assumption 5.1.

For any time slot tt and any time period τ\tau, the set of primal optimal solution to (10) is non-empty. Furthermore, the set of Lagrange multipliers, which is 𝒱t,τ:=argmaxη+Iq(t,τ)(η)\mathcal{V}_{t,\tau}^{*}:=\text{argmax}_{\eta\in\mathbb{R}^{I}_{+}}q^{(t,\tau)}(\eta), is non-empty and bounded. Any vector in 𝒱t,τ\mathcal{V}^{*}_{t,\tau} is called a Lagrange multiplier associated with (10). Furthermore, let B>0B>0 be a constant such that for any t{1,,T}t\in\{1,\ldots,T\} and τ=T\tau=\sqrt{T}, the dual optimal set 𝒱t,τ\mathcal{V}_{t,\tau}^{*} defined above satisfies maxη𝒱t,τη2B\max_{\eta\in\mathcal{V}_{t,\tau}^{*}}\|\eta\|_{2}\leq B.

As is discussed in Section B of the supplementary material, Assumption 5.1 proposes a weaker condition than the Slater condition commonly adopted in previous constrained online learning works. The following lemma further shows the relation between Assumption 5.1 and the dual function.

Lemma 5.2.

Suppose Assumption 5.1 holds, then for any t{1,,T}t\in\{1,\ldots,T\} and τ=T\tau=\sqrt{T}, there exists constants ϑ,σ>0\vartheta,~{}\sigma>0 such that for any ηI\eta\in{{\mathbb{R}}}^{I} satisfying 222We let dist(η,𝒱t,τ):=argminη𝒱t,τηη2\mathrm{dist}(\eta,\mathcal{V}_{t,\tau}^{*}):=\mathop{\mathrm{argmin}}_{\eta^{\prime}\in\mathcal{V}_{t,\tau}^{*}}\|\eta-\eta^{\prime}\|_{2} as Euclidean distance between a point η\eta and the set 𝒱t,τ\mathcal{V}_{t,\tau}^{*}. dist(η,𝒱t,τ)ϑ\mathrm{dist}(\eta,\mathcal{V}_{t,\tau}^{*})\geq\vartheta, we have

q(t,τ)(ηt,τ)q(t,τ)(η)σdist(η,𝒱t,τ),ηt,τ𝒱t,τ.q^{(t,\tau)}(\eta^{*}_{t,\tau})-q^{(t,\tau)}(\eta)\geq\sigma\cdot\mathrm{dist}(\eta,\mathcal{V}_{t,\tau}^{*}),~{}~{}\forall~{}\eta^{*}_{t,\tau}\in\mathcal{V}_{t,\tau}^{*}.

Based on the above assumptions and lemmas, we present results of the regret and constraint violation.

Theorem 5.3.

Consider any fixed horizon T|𝒮||𝒜|T\geq|\mathcal{S}||\mathcal{A}| with |𝒮|,|𝒜|>1|\mathcal{S}|,|\mathcal{A}|>1. Suppose Assumption 3.2, 3.3, 5.1 hold and there exist absolute constants σ¯\overline{\sigma} and ϑ¯\overline{\vartheta} such that σσ¯\sigma\geq\overline{\sigma} and ϑϑ¯\vartheta\leq\overline{\vartheta} for all σ,ϑ\sigma,~{}\vartheta in Lemma 5.2 over t{1,,T}t\in\{1,\ldots,T\} and τ=T\tau=\sqrt{T}. If setting α=LT,V=LT\alpha=LT,~{}V=L\sqrt{T}, λ=1/T\lambda=1/T and ζ(0,1/(4+8L/σ¯)]\zeta\in(0,1/(4+8L/\overline{\sigma})] in Algorithm 1, with probability at least 14ζ1-4\zeta, we have

Regret(T)\displaystyle\mathrm{Regret}(T) 𝒪~(L|𝒮|T|𝒜|),\displaystyle\leq\widetilde{\mathcal{O}}\Big{(}L|\mathcal{S}|\sqrt{T|\mathcal{A}|}\Big{)},
Violation(T)\displaystyle\mathrm{Violation}(T) 𝒪~(L|𝒮|T|𝒜|),\displaystyle\leq\widetilde{\mathcal{O}}\Big{(}L|\mathcal{S}|\sqrt{T|\mathcal{A}|}\Big{)},

where the notation O~()\widetilde{O}(\cdot) absorbs the factors log3/2(T/ζ)\log^{3/2}(T/\zeta) and log(T|𝒮||𝒜|/ζ)\log(T|\mathcal{S}||\mathcal{A}|/\zeta).

6 Theoretical Analysis

In this section, we provide lemmas and proof sketches for the regret bound and constraint violation bound in Theorem 5.3. The detailed proofs for Section 6.1 and Section 6.2 are in Section C and Section D of the supplementary material respectively.

6.1 Proof of Regret Bound

Lemma 6.1.

The updating rules in Algorithm 1 ensure that with probability at least 12ζ1-2\zeta,

t=1Tθtθ¯t1\displaystyle\sum_{t=1}^{T}\big{\|}\theta^{t}-\overline{\theta}^{t}\big{\|}_{1} (2+1)L|𝒮|2T|𝒜|log2T|𝒮||𝒜|ζ+2L22TlogLζ.\displaystyle\leq(\sqrt{2}+1)L|\mathcal{S}|\sqrt{2T|\mathcal{A}|\log\frac{2T|\mathcal{S}||\mathcal{A}|}{\zeta}}+2L^{2}\sqrt{2T\log\frac{L}{\zeta}}.
Lemma 6.2.

The updating rules in Algorithm 1 ensure that with probability at least 1ζ1-\zeta,

t=1Tft,θtθ¯\displaystyle\sum_{t=1}^{T}\big{\langle}f^{t},\theta^{t}-\overline{\theta}^{*}\big{\rangle}\leq 4L2T+(λT+1)αLlog|𝒮|2|𝒜|V+2λLT+LT2α+1Vt=1T𝐐(t),𝐠t(θ¯)𝐜.\displaystyle\frac{4L^{2}T+(\lambda T+1)\alpha L\log|\mathcal{S}|^{2}|\mathcal{A}|}{V}+2\lambda LT+\frac{LT}{2\alpha}+\frac{1}{V}\sum_{t=1}^{T}\left\langle\mathbf{Q}(t),\mathbf{g}^{t}(\overline{\theta}^{*})-\mathbf{c}\right\rangle.

Here we let 𝐐(t):=[Q1(t),Q2(t),,QI(t)]\mathbf{Q}(t):=[Q_{1}(t),~{}Q_{2}(t),~{}\cdots,~{}Q_{I}(t)]^{\top}. Next, we present Lemma 6.3, which is one of the key lemmas in our proof. Then, this lemma indicates that 𝐐(t)2\|\mathbf{Q}(t)\|_{2} is bounded by 𝒪(T)\mathcal{O}(\sqrt{T}) with high probability when setting the parameters τ,V,α,λ\tau,V,\alpha,\lambda as in Theorem 5.3. Thus, introducing stochastic constraints retains the 𝒪(T)\mathcal{O}(\sqrt{T}) regret. Moreover, this lemma will lead to the constraint violation in the level of 𝒪(T)\mathcal{O}(\sqrt{T}). Lemma 6.3 is proved by using Assumption 5.1 and Lemma 5.2.

Lemma 6.3.

Letting τ=T\tau=\sqrt{T} and ζ\zeta satisfy σ¯/4ζ(σ¯/2+2L)\overline{\sigma}/4\geq\zeta(\overline{\sigma}/2+2L), the updating rules in Algorithm 1 ensure that with probability at least 1Tδ1-T\delta, the following inequality holds for all t{1,,T+1}t\in\{1,\ldots,T+1\},

𝐐(t)2ω:=ψ+τ512L2σ¯log(1+128L2σ¯2eσ¯/(32L))+τ64L2σ¯log1δ+2τL,\displaystyle\|\mathbf{Q}(t)\|_{2}\leq\omega:=\psi+\tau\frac{512L^{2}}{\overline{\sigma}}\log\bigg{(}1+\frac{128L^{2}}{\overline{\sigma}^{2}}e^{\overline{\sigma}/(32L)}\bigg{)}+\tau\frac{64L^{2}}{\overline{\sigma}}\log\frac{1}{\delta}+2\tau L,

where we define ψ:=(2τL+CV,α,λ)/σ¯+2αLlog(|𝒮|2|𝒜|/λ)/(σ¯τ)+τσ¯/2\psi:=(2\tau L+C_{V,\alpha,\lambda})/\overline{\sigma}+2\alpha L\log(|\mathcal{S}|^{2}|\mathcal{A}|/\lambda)/(\overline{\sigma}\tau)+\tau\overline{\sigma}/2 and CV,α,λ:=2(σ¯B+σ¯ϑ¯)V+(6+4ϑ¯)VL+VL/α+4LλV+2αλLlog|𝒮|2|𝒜|+8L2C_{V,\alpha,\lambda}:=2(\overline{\sigma}B+\overline{\sigma}~{}\overline{\vartheta})V+(6+4\overline{\vartheta})VL+VL/\alpha+4L\lambda V+2\alpha\lambda L\log|\mathcal{S}|^{2}|\mathcal{A}|+8L^{2}.

The upper bound of Q(t)2||Q(t)||_{2} in the above lemma actually holds for any τ\tau and it is a convex function w.r.t. τ\tau, which thus indicates that there exists a tight upper bound of Q(t)2||Q(t)||_{2} if τ\tau is chosen by finding the minimizer of this upper bound. In this paper, we directly set τ=T\tau=\sqrt{T}, which suffices to give an 𝒪~(T)\widetilde{\mathcal{O}}(\sqrt{T}) upper bound.

Remark 6.4.

We discuss the upper bound of the term log(1+128L2σ¯2eσ¯/(32L))\log\big{(}1+\frac{128L^{2}}{\overline{\sigma}^{2}}e^{\overline{\sigma}/(32L)}\big{)} in the following way: (1) if 128L2σ¯2eσ¯/(32L)1\frac{128L^{2}}{\overline{\sigma}^{2}}e^{\overline{\sigma}/(32L)}\geq 1, then this term is bounded by log(256L2σ¯2eσ¯/(32L))=σ¯32L+log256L2σ¯2\log\big{(}\frac{256L^{2}}{\overline{\sigma}^{2}}e^{\overline{\sigma}/(32L)}\big{)}=\frac{\overline{\sigma}}{32L}+\log\frac{256L^{2}}{\overline{\sigma}^{2}}; (2) if 128L2σ¯2eσ¯/(32L)<1\frac{128L^{2}}{\overline{\sigma}^{2}}e^{\overline{\sigma}/(32L)}<1, then the term is bounded by log2\log 2. Thus, combining the two cases, we have

log(1+128L2σ¯2eσ¯/(32L))log2+σ¯32L+log256L2σ¯2.\log\left(1+\frac{128L^{2}}{\overline{\sigma}^{2}}e^{\overline{\sigma}/(32L)}\right)\leq\log 2+\frac{\overline{\sigma}}{32L}+\log\frac{256L^{2}}{\overline{\sigma}^{2}}.

This discussion shows that the log\log term in the result of Lemma 6.3 will not introduce extra dependency on LL except a logL\log L term.

With the bound of 𝐐(t)2\|\mathbf{Q}(t)\|_{2} in Lemma 6.3, we further obtain the following lemma.

Lemma 6.5.

By Algorithm 1, if σ¯/4ζ(σ¯/2+2L)\overline{\sigma}/4\geq\zeta(\overline{\sigma}/2+2L), then with probability at least 12Tδ1-2T\delta,

t=1T𝐐(t),𝐠t(θ¯)𝐜2LωTlog1Tδ,\displaystyle\sum_{t=1}^{T}\langle\mathbf{Q}(t),\mathbf{g}^{t}(\overline{\theta}^{*})-\mathbf{c}\rangle\leq 2L\omega\sqrt{T\log\frac{1}{T\delta}},

with ω\omega defined as the same as in Lemma 6.3.

Proof of the Regret Bound in Theorem 5.3.

Recall that θt\theta^{t} is the probability vector chosen by the decision maker, and θ¯t\overline{\theta}^{t} is the true occupancy measure at time tt while θ¯\overline{\theta}^{*} is the solution to the problem (1). The main idea is to decompose the regret as follows:

t=1Tft,θ¯tθ¯=t=1T(ft,θ¯tθt+ft,θtθ¯)t=1Tθ¯tθt1Term (I)+t=1Tft,θtθ¯Term (II),\displaystyle\begin{aligned} \sum_{t=1}^{T}\big{\langle}f^{t},\overline{\theta}^{t}-\overline{\theta}^{*}\big{\rangle}&=\sum_{t=1}^{T}\Big{(}\big{\langle}f^{t},\overline{\theta}^{t}-\theta^{t}\big{\rangle}+\big{\langle}f^{t},\theta^{t}-\overline{\theta}^{*}\big{\rangle}\Big{)}\\ &\leq\underbrace{\sum_{t=1}^{T}\big{\|}\overline{\theta}^{t}-\theta^{t}\big{\|}_{1}}_{\text{Term (I)}}+\underbrace{\sum_{t=1}^{T}\big{\langle}f^{t},\theta^{t}-\overline{\theta}^{*}\big{\rangle}}_{\text{Term (II)}},\end{aligned} (11)

where we use Assumption 3.3 such that ft,θ¯tθtftθ¯tθt1θ¯tθt1\big{\langle}f^{t},\overline{\theta}^{t}-\theta^{t}\big{\rangle}\leq\|f^{t}\|_{\infty}\big{\|}\overline{\theta}^{t}-\theta^{t}\big{\|}_{1}\leq\big{\|}\overline{\theta}^{t}-\theta^{t}\big{\|}_{1}. Thus, it suffices to bound the Term (I) and Term (II).

We first show the bound for Term (I). According to Lemma 6.1, by the fact that L|𝒮|L\leq|\mathcal{S}| and |𝒮|,|𝒜|1|\mathcal{S}|,|\mathcal{A}|\geq 1, we have that with probability at least 12ζ1-2\zeta, the following holds

Term (I)𝒪(L|𝒮|T|𝒜|log12(T|𝒮||𝒜|/ζ)).\displaystyle\text{Term (I)}\leq\mathcal{O}\left(L|\mathcal{S}|\sqrt{T|\mathcal{A}|}\log^{\frac{1}{2}}(T|\mathcal{S}||\mathcal{A}|/\zeta)\right). (12)

For Term (II), setting V=LTV=L\sqrt{T}, α=LT\alpha=LT, τ=T\tau=\sqrt{T}, and λ=1/T\lambda=1/T, by Lemma 6.2, we obtain

Term (II)8LT|𝒮||𝒜|+1LTt=1T𝐐(t),𝐠t(θ¯)𝐜,\displaystyle\text{Term (II)}\leq 8L\sqrt{T|\mathcal{S}||\mathcal{A}|}+\frac{1}{L\sqrt{T}}\sum_{t=1}^{T}\left\langle\mathbf{Q}(t),\mathbf{g}^{t}(\overline{\theta}^{*})-\mathbf{c}\right\rangle,

where we use the inequality that log|𝒮||𝒜||𝒮||𝒜|\log|\mathcal{S}||\mathcal{A}|\leq\sqrt{|\mathcal{S}||\mathcal{A}|} due to xlogx\sqrt{x}\geq\log x. Thus, we further need to bound the last term of the above inequality. By Lemma 6.5 and Remark 6.4, with probability at least 12Tδ1-2T\delta for all t{1,,T}t\in\{1,\ldots,T\}, we have

1LTt=1T𝐐(t),𝐠t(θ¯)𝐜𝒪(L|𝒮|T|𝒜|log32(T/δ)),\frac{1}{L\sqrt{T}}\sum_{t=1}^{T}\left\langle\mathbf{Q}(t),\mathbf{g}^{t}(\overline{\theta}^{*})-\mathbf{c}\right\rangle\leq\mathcal{O}\left(L|\mathcal{S}|\sqrt{T|\mathcal{A}|}\log^{\frac{3}{2}}(T/\delta)\right),

by the facts that L|𝒮|L\leq|\mathcal{S}| , |𝒮|>1|\mathcal{S}|>1, |𝒜|>1|\mathcal{A}|>1, and the assumption T|𝒮||𝒜|T\geq|\mathcal{S}||\mathcal{A}|, as well as the computation of ψ\psi as

ψ=𝒪(L2T+Llog|𝒮||𝒜|+L2Tlog(T|𝒮||𝒜|)).\psi=\mathcal{O}\left(L^{2}\sqrt{T}+L\log|\mathcal{S}||\mathcal{A}|+L^{2}\sqrt{T}\log(T|\mathcal{S}||\mathcal{A}|)\right).

Therefore, with probability at least 12Tδ1-2T\delta, the following holds

Term (II)𝒪(L|𝒮|T|𝒜|log32(T/δ)).\displaystyle\text{Term (II)}\leq\mathcal{O}\left(L|\mathcal{S}|\sqrt{T|\mathcal{A}|}\log^{\frac{3}{2}}(T/\delta)\right). (13)

Combining (12) and (13) with (11), and letting δ=ζ/T\delta=\zeta/T, by union bound, we eventually obtain that with probability at least 14ζ1-4\zeta, the regret bound is

Regret(T)𝒪~(L|𝒮|T|𝒜|),\mathrm{Regret}(T)\leq\widetilde{\mathcal{O}}\left(L|\mathcal{S}|\sqrt{T|\mathcal{A}|}\right),

where the notation 𝒪~()\widetilde{\mathcal{O}}(\cdot) absorbs the logarithm factors. Further let ζ1/(4+8L/σ¯)<1/4\zeta\leq 1/(4+8L/\overline{\sigma})<1/4 (such that σ¯/4ζ(σ¯/2+2L)\overline{\sigma}/4\geq\zeta(\overline{\sigma}/2+2L) is guaranteed). This completes the proof. ∎

6.2 Proof of Constraint Violation Bound

Lemma 6.6.

The updating rules in Algorithm 1 ensure

[t=1T(𝐠t(θt)𝐜)]+2𝐐(T+1)2+t=1Tθ+1θt1.\displaystyle\Bigg{\|}\Bigg{[}\sum_{t=1}^{T}\Big{(}\mathbf{g}^{t}(\theta^{t})-\mathbf{c}\Big{)}\Bigg{]}_{+}\Bigg{\|}_{2}\leq\|\mathbf{Q}(T+1)\|_{2}+\sum_{t=1}^{T}\big{\|}\theta^{+1}-\theta^{t}\big{\|}_{1}.
Lemma 6.7.

The updating rules in Algorithm 1 ensure

t=1Tθt+1θt1\displaystyle\sum_{t=1}^{T}\big{\|}\theta^{t+1}-\theta^{t}\big{\|}_{1} 3LT|𝒮||𝒜|log8T|𝒮||𝒜|+2L(1λ)2αt=1T𝐐(t)2+2VLT(1λ)2α\displaystyle\leq 3L\sqrt{T|\mathcal{S}||\mathcal{A}|}\log\frac{8T}{|\mathcal{S}||\mathcal{A}|}+\frac{2L}{(1-\lambda)^{2}\alpha}\sum_{t=1}^{T}\|\mathbf{Q}(t)\|_{2}+\frac{2VLT}{(1-\lambda)^{2}\alpha}
+2λLT1λ+8λlog|𝒮|2|𝒜|1λLT.\displaystyle\quad+\frac{2\lambda LT}{1-\lambda}+\frac{\sqrt{8\lambda\log|\mathcal{S}|^{2}|\mathcal{A}|}}{1-\lambda}LT.
Remark 6.8.

The proof of Lemma 6.7 uses the fact that the confidence set of the transition model PP changes only T|𝒮||𝒜|log(8T/(|𝒮||𝒜|))\sqrt{T|\mathcal{S}||\mathcal{A}|}\log(8T/(|\mathcal{S}||\mathcal{A}|)) times due to the doubling of epoch length in Algorithm 1. Within each epoch where the confidence set is unchanged, we further show θt+1θt1\|\theta^{t+1}-\theta^{t}\|_{1} is sufficiently small. Thus, the epoch length doubling eventually ensures that the constraint violation is in the level of 𝒪~(T)\widetilde{\mathcal{O}}(\sqrt{T}).

Proof of the Constraint Violation Bound in Theorem 5.3.

We decompose Violation(T)\mathrm{Violation}(T) as

[t=1T(𝐠t(θ¯t)𝐜)]+2t=1T𝐠t(θt)𝐠t(θ¯t)2+[t=1T(𝐠t(θt)𝐜)]+2t=1Tθ¯tθt1Term (III)+[t=1T(𝐠t(θt)𝐜)]+2Term (IV),\displaystyle\begin{aligned} \Bigg{\|}\Bigg{[}\sum_{t=1}^{T}\Big{(}\mathbf{g}^{t}(\overline{\theta}^{t})-\mathbf{c}\Big{)}\Bigg{]}_{+}\Bigg{\|}_{2}&\leq\sum_{t=1}^{T}\big{\|}\mathbf{g}^{t}(\theta^{t})-\mathbf{g}^{t}(\overline{\theta}^{t})\big{\|}_{2}+\Bigg{\|}\Bigg{[}\sum_{t=1}^{T}\left(\mathbf{g}^{t}(\theta^{t})-\mathbf{c}\right)\Bigg{]}_{+}\Bigg{\|}_{2}\\ &\leq\underbrace{\sum_{t=1}^{T}\big{\|}\overline{\theta}^{t}-\theta^{t}\big{\|}_{1}}_{\text{Term (III)}}+\underbrace{\Bigg{\|}\Bigg{[}\sum_{t=1}^{T}\Big{(}\mathbf{g}^{t}(\theta^{t})-\mathbf{c}\Big{)}\Bigg{]}_{+}\Bigg{\|}_{2}}_{\text{Term (IV)}},\end{aligned} (14)

where the second inequality is due to Assumption 3.3 that 𝐠t(θt)𝐠t(θ¯t)2=(i=1I|git,θtθ¯t|2)12i=1Igitθtθ¯t1θtθ¯t1\|\mathbf{g}^{t}(\theta^{t})-\mathbf{g}^{t}(\overline{\theta}^{t})\|_{2}=\big{(}\sum_{i=1}^{I}|\langle g_{i}^{t},\theta^{t}-\overline{\theta}^{t}\rangle|^{2}\big{)}^{\frac{1}{2}}\leq\sum_{i=1}^{I}\|g_{i}^{t}\|_{\infty}\|\theta^{t}-\overline{\theta}^{t}\|_{1}\leq\|\theta^{t}-\overline{\theta}^{t}\|_{1}. Thus, it suffices to bound Terms (III) and (IV).

For Term (III), we already have its bound as (12). Then, we focus on proving the upper bound of Term (IV). Set V=LTV=L\sqrt{T}, α=LT\alpha=LT, τ=T\tau=\sqrt{T}, and λ=1/T\lambda=1/T as in the proof of the regret bound. By Lemma 6.6, we know that to bound Term (IV) requires bounding the terms 𝐐(T+1)2\|\mathbf{Q}(T+1)\|_{2} and t=1Tθt+1θt1\sum_{t=1}^{T}\|\theta^{t+1}-\theta^{t}\|_{1}. By Lemma 6.3, combining it with Remark 6.4 and ψ=𝒪(L2T+Llog|𝒮||𝒜|+L2log(T|𝒮||𝒜|)/T)\psi=\mathcal{O}\big{(}L^{2}\sqrt{T}+L\log|\mathcal{S}||\mathcal{A}|+L^{2}\log(T|\mathcal{S}||\mathcal{A}|)/\sqrt{T}\big{)} as shown in the proof of the regret bound, letting σ¯/4ζ(σ¯/2+2L)\overline{\sigma}/4\geq\zeta(\overline{\sigma}/2+2L), with probability 1Tδ1-T\delta, for all t{1,,T+1}t\in\{1,\ldots,T+1\}, the following inequality holds

𝐐(t)2𝒪(L2Tlog(L/δ)),\displaystyle\|\mathbf{Q}(t)\|_{2}\leq\mathcal{O}\left(L^{2}\sqrt{T}\log(L/\delta)\right), (15)

where we use logxx\log x\leq\sqrt{x}. This gives the bound of 𝐐(T+1)2\|\mathbf{Q}(T+1)\|_{2} that 𝐐(T+1)2𝒪(L2Tlog(L/δ))\|\mathbf{Q}(T+1)\|_{2}\leq\mathcal{O}\big{(}L^{2}\sqrt{T}\log(L/\delta)\big{)}.

Furthermore, by Lemma 6.7, we know that the key to bounding t=1Tθt+1θt1\sum_{t=1}^{T}\|\theta^{t+1}-\theta^{t}\|_{1} is also the drift bound for 𝐐(t)\mathbf{Q}(t). Therefore, by (15) and the settings of the parameters α,λ,V\alpha,\lambda,V, we have

t=1Tθt+1θt1𝒪(L|𝒮||𝒜|Tlog(T|𝒮||𝒜|/δ)),\displaystyle\sum_{t=1}^{T}\|\theta^{t+1}-\theta^{t}\|_{1}\leq\mathcal{O}\left(L|\mathcal{S}|\sqrt{|\mathcal{A}|T}\log(T|\mathcal{S}||\mathcal{A}|/\delta)\right), (16)

by the facts that L|𝒮|L\leq|\mathcal{S}| , |𝒮|>1|\mathcal{S}|>1, |𝒜|>1|\mathcal{A}|>1 and the condition |𝒮||𝒜|T|\mathcal{S}||\mathcal{A}|\leq T. Thus combining (15) and (16) with Lemma 6.6, and letting δ=ζ/T\delta=\zeta/T, then with probability at least 1ζ1-\zeta, we have

Term (IV)𝒪(L|𝒮||𝒜|Tlog(T|𝒮||𝒜|/δ)).\displaystyle\text{Term (IV)}\leq\mathcal{O}\left(L|\mathcal{S}|\sqrt{|\mathcal{A}|T}\log(T|\mathcal{S}||\mathcal{A}|/\delta)\right).

Combining results for Term (III) and Term (IV) with (14), by union bound, with probability at least 14ζ1-4\zeta, the constraint violation is bounded as Violation(T)𝒪~(L|𝒮|T|𝒜|).\mathrm{Violation}(T)\leq\widetilde{\mathcal{O}}(L|\mathcal{S}|\sqrt{T|\mathcal{A}|}). This finishes the proof. ∎

7 Conclusion

In this paper, we propose a new upper confidence primal-dual algorithm to solve online constrained episodic MDPs with arbitrarily varying losses and stochastically changing constraints. In particular, our algorithm does not require transition models of the MDPs and delivers an 𝒪(L|𝒮||𝒜|T)\mathcal{O}(L|\mathcal{S}|\sqrt{|\mathcal{A}|T}) upper bounds of both the regret and the constraint violation.

References

  • Abbasi-Yadkori et al. (2019a) Abbasi-Yadkori, Y., Bartlett, P., Bhatia, K., Lazic, N., Szepesvari, C. and Weisz, G. (2019a). Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning.
  • Abbasi-Yadkori et al. (2019b) Abbasi-Yadkori, Y., Lazic, N., Szepesvari, C. and Weisz, G. (2019b). Exploration-enhanced politex. arXiv preprint arXiv:1908.10479.
  • Achiam et al. (2017) Achiam, J., Held, D., Tamar, A. and Abbeel, P. (2017). Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org.
  • Agarwal et al. (2019) Agarwal, A., Kakade, S. M., Lee, J. D. and Mahajan, G. (2019). Optimality and approximation with policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261.
  • Altman (1999) Altman, E. (1999). Constrained Markov decision processes, vol. 7. CRC Press.
  • Auer et al. (2002) Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47 235–256.
  • Azar et al. (2017) Azar, M. G., Osband, I. and Munos, R. (2017). Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org.
  • Baxter and Bartlett (2000) Baxter, J. and Bartlett, P. L. (2000). Direct gradient-based reinforcement learning. In 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No. 00CH36353), vol. 3. IEEE.
  • Bhandari and Russo (2019) Bhandari, J. and Russo, D. (2019). Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786.
  • Cai et al. (2019) Cai, Q., Yang, Z., Jin, C. and Wang, Z. (2019). Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830.
  • Chow et al. (2017) Chow, Y., Ghavamzadeh, M., Janson, L. and Pavone, M. (2017). Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18 6070–6120.
  • Ding et al. (2020) Ding, D., Wei, X., Yang, Z., Wang, Z. and Jovanović, M. (2020). Provably efficient safe exploration via primal-dual policy optimization. Manuscript.
  • Dong et al. (2019) Dong, K., Wang, Y., Chen, X. and Wang, L. (2019). Q-learning with ucb exploration is sample efficient for infinite-horizon mdp. arXiv preprint arXiv:1901.09311.
  • Efroni et al. (2020a) Efroni, Y., Mannor, S. and Pirotta, M. (2020a). Exploration-exploitation in constrained mdps. arXiv preprint arXiv:2003.02189.
  • Efroni et al. (2020b) Efroni, Y., Shani, L., Rosenberg, A. and Mannor, S. (2020b). Optimistic policy optimization with bandit feedback. arXiv preprint arXiv:2002.08243.
  • Even-Dar et al. (2009) Even-Dar, E., Kakade, S. M. and Mansour, Y. (2009). Online markov decision processes. Mathematics of Operations Research, 34 726–736.
  • Fazel et al. (2018) Fazel, M., Ge, R., Kakade, S. M. and Mesbahi, M. (2018). Global convergence of policy gradient methods for the linear quadratic regulator. arXiv preprint arXiv:1801.05039.
  • Fox (1966) Fox, B. (1966). Markov renewal programming by linear fractional programming. SIAM Journal on Applied Mathematics, 14 1418–1432.
  • Fruit et al. (2018) Fruit, R., Pirotta, M., Lazaric, A. and Ortner, R. (2018). Efficient bias-span-constrained exploration-exploitation in reinforcement learning. arXiv preprint arXiv:1802.04020.
  • György et al. (2007) György, A., Linder, T., Lugosi, G. and Ottucsák, G. (2007). The on-line shortest path problem under partial monitoring. Journal of Machine Learning Research, 8 2369–2403.
  • Jaksch et al. (2010) Jaksch, T., Ortner, R. and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11 1563–1600.
  • Jin et al. (2018) Jin, C., Allen-Zhu, Z., Bubeck, S. and Jordan, M. I. (2018). Is q-learning provably efficient? In Advances in Neural Information Processing Systems.
  • Jin et al. (2019) Jin, C., Jin, T., Luo, H., Sra, S. and Yu, T. (2019). Learning adversarial mdps with bandit feedback and unknown transition. arXiv preprint arXiv:1912.01192.
  • Kakade (2002) Kakade, S. M. (2002). A natural policy gradient. In Advances in neural information processing systems.
  • Konda and Tsitsiklis (2000) Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algorithms. In Advances in neural information processing systems.
  • Le et al. (2019) Le, H. M., Voloshin, C. and Yue, Y. (2019). Batch policy learning under constraints. arXiv preprint arXiv:1903.08738.
  • Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D. and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
  • Liu et al. (2019) Liu, B., Cai, Q., Yang, Z. and Wang, Z. (2019). Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306.
  • Mannor et al. (2009) Mannor, S., Tsitsiklis, J. N. and Yu, J. Y. (2009). Online learning with sample path constraints. Journal of Machine Learning Research, 10 569–590.
  • Nedić and Ozdaglar (2009) Nedić, A. and Ozdaglar, A. (2009). Approximate primal solutions and rate analysis for dual subgradient methods. SIAM Journal on Optimization, 19 1757–1780.
  • Neely (2012) Neely, M. J. (2012). Dynamic optimization and learning for renewal systems. IEEE Transactions on Automatic Control, 58 32–46.
  • Neu et al. (2010) Neu, G., György, A. and Szepesvári, C. (2010). The online loop-free stochastic shortest-path problem. In COLT, vol. 2010. Citeseer.
  • Neu et al. (2012) Neu, G., Gyorgy, A. and Szepesvári, C. (2012). The adversarial stochastic shortest path problem with unknown transition probabilities. In Artificial Intelligence and Statistics.
  • Ouyang et al. (2017) Ouyang, Y., Gagrani, M., Nayyar, A. and Jain, R. (2017). Learning unknown markov decision processes: A thompson sampling approach. In Advances in Neural Information Processing Systems.
  • Rosenberg and Mansour (2019a) Rosenberg, A. and Mansour, Y. (2019a). Online convex optimization in adversarial markov decision processes. arXiv preprint arXiv:1905.07773.
  • Rosenberg and Mansour (2019b) Rosenberg, A. and Mansour, Y. (2019b). Online stochastic shortest path with bandit feedback and unknown transition function. In Advances in Neural Information Processing Systems.
  • Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M. and Moritz, P. (2015). Trust region policy optimization. In International conference on machine learning.
  • Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  • Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
  • Urgaonkar et al. (2015) Urgaonkar, R., Wang, S., He, T., Zafer, M., Chan, K. and Leung, K. K. (2015). Dynamic service migration and workload scheduling in edge-clouds. Performance Evaluation, 91 205–228.
  • Wang et al. (2019) Wang, L., Cai, Q., Yang, Z. and Wang, Z. (2019). Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150.
  • Wang et al. (2015) Wang, S., Urgaonkar, R., Zafer, M., He, T., Chan, K. and Leung, K. K. (2015). Dynamic service migration in mobile edge-clouds. In 2015 IFIP Networking Conference (IFIP Networking). IEEE.
  • Wei et al. (2019a) Wei, C.-Y., Jafarnia-Jahromi, M., Luo, H., Sharma, H. and Jain, R. (2019a). Model-free reinforcement learning in infinite-horizon average-reward markov decision processes. arXiv preprint arXiv:1910.07072.
  • Wei et al. (2018) Wei, X., Yu, H. and Neely, M. J. (2018). Online learning in weakly coupled markov decision processes: A convergence time study. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2 12.
  • Wei et al. (2019b) Wei, X., Yu, H. and Neely, M. J. (2019b). Online primal-dual mirror descent under stochastic constraints. arXiv preprint arXiv:1908.00305.
  • Williams (1992) Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8 229–256.
  • Yu et al. (2017) Yu, H., Neely, M. and Wei, X. (2017). Online convex optimization with stochastic constraints. In Advances in Neural Information Processing Systems.
  • Yu et al. (2009) Yu, J. Y., Mannor, S. and Shimkin, N. (2009). Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34 737–757.
  • Zhang and Ji (2019) Zhang, Z. and Ji, X. (2019). Regret minimization for reinforcement learning by evaluating the optimal bias function. In Advances in Neural Information Processing Systems.
  • Zheng and Ratliff (2020) Zheng, L. and Ratliff, L. J. (2020). Constrained upper confidence reinforcement learning. arXiv preprint arXiv:2001.09377.
  • Zimin and Neu (2013) Zimin, A. and Neu, G. (2013). Online learning in episodic markovian decision processes by relative entropy policy search. In Advances in neural information processing systems.

Supplementary Material

Appendix A Efficient Solver for Subproblem (6)

In this section, we provide the details on how to efficiently solve the subproblem (6). We can further rewrite (6) into the following equivalent form

θt=argminθΔ((t),ζ)α1ψt1,θ+D(θ,θ~t1),\displaystyle\theta^{t}=\mathop{\mathrm{argmin}}_{\theta\in\Delta(\ell(t),\zeta)}~{}\alpha^{-1}\langle\psi^{t-1},\theta\rangle+D(\theta,\widetilde{\theta}^{t-1}),

where we let ψt1:=Vft1+i=1IQi(t1)git1\psi^{t-1}:=Vf^{t-1}+\sum_{i=1}^{I}Q_{i}(t-1)g_{i}^{t-1}. According to Rosenberg and Mansour (2019a), solving the above problem is decomposed to the following two steps

θ¯t\displaystyle\underline{\theta}^{t} =argminθα1ψt1,θ+D(θ,θ~t1),\displaystyle=\mathop{\mathrm{argmin}}_{\theta}~{}\alpha^{-1}\langle\psi^{t-1},\theta\rangle+D(\theta,\widetilde{\theta}^{t-1}), (17)
θt\displaystyle\theta^{t} =argminθΔ((t),ζ)D(θ,θ¯t).\displaystyle=\mathop{\mathrm{argmin}}_{\theta\in\Delta(\ell(t),\zeta)}~{}D(\theta,\underline{\theta}^{t}). (18)

Note that the first step, i.e., (17), is an unconstrained problem, which has a closed-form solution

θ¯t(s,a,s)=θ~t1(s,a,s)eψt1/α,(s,a,s)𝒮k×𝒜×𝒮k+1,k=0,,L1.\displaystyle\underline{\theta}^{t}(s,a,s^{\prime})=\widetilde{\theta}^{t-1}(s,a,s^{\prime})e^{-\psi^{t-1}/\alpha},~{}~{}~{}~{}\forall(s,a,s^{\prime})\in\mathcal{S}_{k}\times\mathcal{A}\times\mathcal{S}_{k+1},~{}~{}\forall k=0,\ldots,L-1. (19)

The second step, i.e., (18), can be viewed as a projection of θ¯t(s,a,s)\underline{\theta}^{t}(s,a,s^{\prime}) onto the feasible set Δ((t),ζ)\Delta(\ell(t),\zeta). With the definition of the feasible set as in (9), further by Theorem 4.2 of Rosenberg and Mansour (2019a) and Lemma 7 of Jin et al. (2019), and plugging in θ¯t\underline{\theta}^{t} computed as (19), we have

θt(s,a,s)=θ~t1(s,a,s)Ztk(s)(μt,βt)eBtμt,βt(s,a,s),\displaystyle\theta^{t}(s,a,s^{\prime})=\frac{\widetilde{\theta}^{t-1}(s,a,s^{\prime})}{Z_{t}^{k(s)}(\mu^{t},\beta^{t})}e^{B_{t}^{\mu^{t},\beta^{t}}(s,a,s^{\prime})}, (20)

where k(s)k(s) is a mapping from state ss to its associated layer index, and Ztk(μ,β)Z_{t}^{k}(\mu,\beta) and Btμ,βB_{t}^{\mu,\beta} are defined as follows

Btμ,β(s,a,s)\displaystyle B_{t}^{\mu,\beta}(s,a,s^{\prime}) =μ(s,a,s)μ+(s,a,s)+(μ+(s,a,s)+μ(s,a,s))ε(t)ζ(s,a)+β(s)\displaystyle=\mu^{-}(s,a,s^{\prime})-\mu^{+}(s,a,s^{\prime})+(\mu^{+}(s,a,s^{\prime})+\mu^{-}(s,a,s^{\prime}))\varepsilon_{\ell(t)}^{\zeta}(s,a)+\beta(s^{\prime})
β(s)ψt1(s,a,s)/αs′′𝒮k(s)+1P^(t)(s′′|s,a)(μ(s,a,s′′)μ+(s,a,s′′)),\displaystyle\quad-\beta(s)-\psi^{t-1}(s,a,s^{\prime})/\alpha-\sum_{s^{\prime\prime}\in\mathcal{S}_{k(s)+1}}\widehat{P}_{\ell(t)}(s^{\prime\prime}|s,a)(\mu^{-}(s,a,s^{\prime\prime})-\mu^{+}(s,a,s^{\prime\prime})),
Ztk(μ,β)\displaystyle Z_{t}^{k}(\mu,\beta) =s𝒮ka𝒜s𝒮k+1θ~t1(s,a,s)eBtμ,β(s,a,s),\displaystyle=\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}_{k+1}}\widetilde{\theta}^{t-1}(s,a,s^{\prime})e^{B_{t}^{\mu,\beta}(s,a,s^{\prime})},

where β:𝒮\beta:\mathcal{S}\rightarrow\mathbb{R} and μ=(μ+,μ)\mu=(\mu^{+},\mu^{-}) with μ+,μ:𝒮×𝒜×𝒮0\mu^{+},\mu^{-}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R}_{\geq 0}. Specifically, the variables μt,βt\mu^{t},\beta^{t} in (20) are obtained by solving a convex optimization with only non-negativity constraints, which is

μt,βt=argminμ,β0k=0L1logZtk(μ,β).\displaystyle\mu^{t},\beta^{t}=\mathop{\mathrm{argmin}}_{\mu,\beta\geq 0}\sum_{k=0}^{L-1}\log Z_{t}^{k}(\mu,\beta). (21)

Therefore, by solving (21), we can eventually compute θt\theta^{t} by (20). Since (21) is associated with a convex optimization with only non-negativity constraints, it can be solved much efficiently.

Appendix B Structure of Optimization Problem Sequence

We have the following simple sufficient condition which is a direct corollary of Lemma 1 in Nedić and Ozdaglar (2009):

Lemma B.1.

Suppose that the problem (10) is feasible. Then, the set of Lagrange multipliers 𝒱t,τ\mathcal{V}_{t,\tau}^{*} defined in Assumption 5.1 is nonempty and bounded if the Slater condition holds, i.e., θΔ,ε>0\exists\theta\in\Delta,~{}\varepsilon>0 such that gi,θciε,i{1,,I}\langle{g_{i}},{\theta}\rangle\leq c_{i}-\varepsilon,~{}\forall i\in\{1,\ldots,I\}.

In fact, it can be shown that some certain constraint qualification condition more general than Slater condition can imply the boundedness of Lagrange multipliers (see, for example, Lemma 18 of Wei et al. (2019b)). Thus, Assumption 5.1 is weaker than Slater condition commonly adopted in previous constrained online learning works. The motivation for such a Lagrange multiplier condition is that it is a sufficient condition of a key structural property on the dual function q(t,τ)(η)q^{(t,\tau)}(\eta), namely, the error bound condition. Formally, we have the following definition:

Definition B.2 (Error Bound Condition (EBC)).

Let h(𝐱)h(\mathbf{x}) be a concave function over 𝐱𝒳\mathbf{x}\in\mathcal{X}, where 𝒳\mathcal{X} is closed and convex. Suppose Λ:=argmax𝐱𝒳h(𝐱)\Lambda^{*}:=\mathop{\mathrm{argmax}}_{\mathbf{x}\in\mathcal{X}}h(\mathbf{x}) is non-empty. The function h(𝐱)h(\mathbf{x}) satisfies the EBC if there exists constants ϑ,σ>0\vartheta,~{}\sigma>0 such that for any 𝐱𝒳\mathbf{x}\in\mathcal{X} satisfying333We let dist(𝐱,Λ):=min𝐱Λ𝐱𝐱2\mathrm{dist}(\mathbf{x},\Lambda^{*}):=\min_{\mathbf{x}^{\prime}\in\Lambda^{*}}\|\mathbf{x}-\mathbf{x}^{\prime}\|_{2} as the Euclidean distance between a point 𝐱\mathbf{x} and the set Λ\Lambda^{*}. dist(𝐱,Λ)ϑ\mathrm{dist}(\mathbf{x},\Lambda^{*})\geq\vartheta,

h(𝐱)h(𝐱)σdist(𝐱,Λ) with 𝐱Λ.h(\mathbf{x}^{*})-h(\mathbf{x})\geq\sigma\cdot\mathrm{dist}(\mathbf{x},\Lambda^{*})~{}~{}\text{ with }\mathbf{x}^{*}\in\Lambda^{*}.

Note that in Definition B.2, Λ\Lambda^{*} is a closed convex set, which follows from the fact that h(𝐱)h(\mathbf{x}) is a concave function and thus all superlevel sets are closed and convex. The following lemma, whose proof can be found in Lemma 5 of Wei et al. (2019b), shows the relation between the Lagrange multiplier condition and the dual function:

Lemma B.3.

Fix T1T\geq 1. Suppose Assumption 5.1 holds, then for any t{1,,T}t\in\{1,\ldots,T\} and τ=T\tau=\sqrt{T}, the dual function q(t,τ)(η)q^{(t,\tau)}(\eta) satisfies the EBC with σ>0\sigma>0 and ϑ>0\vartheta>0.

This lemma is equivalent to Lemma 5.2 in the main text.

Appendix C Proofs of the Lemmas in Section 6.1

C.1 Proof of Lemma 6.1

We first provide Lemmas C.1 and C.2 below. Then, we give the proof of Lemma 6.1 based on these lemmas.

Lemma C.1 (Lemma 19 in Jaksch et al. (2010)).

For any sequence of numbers x1,,xnx_{1},\ldots,x_{n} with 0xkXk1:=max{1,i=1k1xi}0\leq x_{k}\leq X_{k-1}:=\max\big{\{}1,\sum_{i=1}^{k-1}x_{i}\big{\}} with 1kn1\leq k\leq n, the following inequality holds

k=1nxkXk1(2+1)Xn.\displaystyle\sum_{k=1}^{n}\frac{x_{k}}{\sqrt{X_{k-1}}}\leq(\sqrt{2}+1)\sqrt{X_{n}}.
Lemma C.2.

Let d^t(s)\widehat{d}_{t}(s) and dt(s)d_{t}(s) be the state stationary distributions associated with θt\theta^{t} and θ¯t\overline{\theta}^{t} respectively, and P^(t)(s|a,s)\widehat{P}_{\ell(t)}(s^{\prime}|a,s) and P(s|a,s)P(s^{\prime}|a,s) be the corresponding transition distributions. Denote πt(a|s)\pi_{t}(a|s) as the policy at episode tt. There are θt(s,a,s)=d^t(s)πt(a|s)P^(t)(s|a,s)\theta^{t}(s,a,s^{\prime})=\widehat{d}_{t}(s)\pi_{t}(a|s)\widehat{P}_{\ell(t)}(s^{\prime}|a,s) and θ¯t(s,a,s)=dt(s)πt(a|s)P(s|a,s)\overline{\theta}^{t}(s,a,s)=d_{t}(s)\pi_{t}(a|s)P(s^{\prime}|a,s). On the other hand, there are also d^t(s)=s𝒮ka𝒜θt(s,a,s),s𝒮k+1\widehat{d}_{t}(s^{\prime})=\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\theta^{t}(s,a,s^{\prime}),\forall s^{\prime}\in\mathcal{S}_{k+1}, and dt(s)=s𝒮ka𝒜θt(s,a,s),s𝒮k+1d_{t}(s^{\prime})=\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\theta^{t}(s,a,s^{\prime}),\forall s^{\prime}\in\mathcal{S}_{k+1}. Then, we have the following inequality

θtθ¯t1k=0L1j=0ks𝒮ja𝒜μt(s,a)P^(t)(|s,a)P(|s,a)1,\displaystyle\|\theta^{t}-\overline{\theta}^{t}\|_{1}\leq\sum_{k=0}^{L-1}\sum_{j=0}^{k}\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}\mu_{t}(s,a)\|\widehat{P}_{\ell(t)}(\cdot|s,a)-P(\cdot|s,a)\|_{1},

where we let μt(s,a)=dt(s)πt(a|s)\mu_{t}(s,a)=d_{t}(s)\pi_{t}(a|s).

Proof of Lemma C.2.

By the definitions of d^t\widehat{d}_{t}, dtd_{t}, P^(t)\widehat{P}_{\ell(t)}, PP, and πt\pi_{t} shown in Lemma C.2, we have

θtθ¯t1\displaystyle\|\theta^{t}-\overline{\theta}^{t}\|_{1} =k=0L1s𝒮ka𝒜θt(a,s,)θ¯t(a,s,)1\displaystyle=\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\|\theta^{t}(a,s,\cdot)-\overline{\theta}^{t}(a,s,\cdot)\|_{1}
=k=0L1s𝒮ka𝒜πt(a|s)P^(t)(|a,s)d^t(s)P(|a,s)dt(s)1\displaystyle=\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\pi_{t}(a|s)\|\widehat{P}_{\ell(t)}(\cdot|a,s)\widehat{d}_{t}(s)-P(\cdot|a,s)d_{t}(s)\|_{1}
=k=0L1s𝒮ka𝒜πt(a|s)P^(t)(|a,s)d^t(s)P^(|a,s)dt(s)\displaystyle=\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\pi_{t}(a|s)\|\widehat{P}_{\ell(t)}(\cdot|a,s)\widehat{d}_{t}(s)-\widehat{P}(\cdot|a,s)d_{t}(s)
+P^(|a,s)dt(s)P(|a,s)dt(s)1.\displaystyle\qquad\qquad\qquad\qquad\qquad+\widehat{P}(\cdot|a,s)d_{t}(s)-P(\cdot|a,s)d_{t}(s)\|_{1}.

Thus, with the above equalities, and by triangle inequality for 1\|\cdot\|_{1}, we can bound the term θtθ¯t1\|\theta^{t}-\overline{\theta}^{t}\|_{1} in the following way

θtθ¯t1k=0L1s𝒮ka𝒜πt(a|s)[P^(t)(|a,s)d^t(s)P^(t)(|a,s)dt(s)1+P^(t)(|a,s)dt(s)P(|a,s)dt(s)1]k=0L1s𝒮ka𝒜πt(a|s)dt(s)P^(t)(|a,s)P(|a,s)1+k=0L1s𝒮ka𝒜πt(a|s)P^(t)(|a,s)1|d^t(s)dt(s)|.\displaystyle\begin{aligned} \|\theta^{t}-\overline{\theta}^{t}\|_{1}&\leq\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\pi_{t}(a|s)[\|\widehat{P}_{\ell(t)}(\cdot|a,s)\widehat{d}_{t}(s)-\widehat{P}_{\ell(t)}(\cdot|a,s)d_{t}(s)\|_{1}\\ &\qquad+\|\widehat{P}_{\ell(t)}(\cdot|a,s)d_{t}(s)-P(\cdot|a,s)d_{t}(s)\|_{1}]\\ &\leq\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\pi_{t}(a|s)d_{t}(s)\|\widehat{P}_{\ell(t)}(\cdot|a,s)-P(\cdot|a,s)\|_{1}\\ &\qquad+\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\pi_{t}(a|s)\|\widehat{P}_{\ell(t)}(\cdot|a,s)\|_{1}\cdot|\widehat{d}_{t}(s)-d_{t}(s)|.\end{aligned} (22)

Then we need to bound the last two terms of (22) respectively. For the first term of RHS in (22), we have

k=0L1s𝒮ka𝒜πt(a|s)dt(s)P^(t)(|a,s)P(|a,s)1=k=0L1s𝒮ka𝒜μt(s,a)P^(t)(|a,s)P(|a,s)1,\displaystyle\begin{aligned} &\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\pi_{t}(a|s)d_{t}(s)\|\widehat{P}_{\ell(t)}(\cdot|a,s)-P(\cdot|a,s)\|_{1}\\ &\qquad=\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\mu_{t}(s,a)\|\widehat{P}_{\ell(t)}(\cdot|a,s)-P(\cdot|a,s)\|_{1},\end{aligned} (23)

since μt(s,a)=πt(a|s)dt(s)\mu_{t}(s,a)=\pi_{t}(a|s)d_{t}(s) denotes the joint distribution probability of (s,a)(s,a).

Next, we bound the last term of RHS in (22), which is

k=0L1s𝒮ka𝒜πt(a|s)P^(t)(|a,s)1|d^t(s)dt(s)|=k=0L1s𝒮ka𝒜πt(a|s)|d^t(s)dt(s)|,\displaystyle\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\pi_{t}(a|s)\|\widehat{P}_{\ell(t)}(\cdot|a,s)\|_{1}\cdot|\widehat{d}_{t}(s)-d_{t}(s)|=\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\pi_{t}(a|s)|\widehat{d}_{t}(s)-d_{t}(s)|,

since P^(t)(|a,s)1=s𝒮k+1P^(t)(s|a,s)=1\|\widehat{P}_{\ell(t)}(\cdot|a,s)\|_{1}=\sum_{s^{\prime}\in\mathcal{S}_{k+1}}\widehat{P}_{\ell(t)}(s^{\prime}|a,s)=1. Furthermore, we can bound the last term above as

k=0L1s𝒮ka𝒜πt(a|s)|d^t(s)dt(s)|\displaystyle\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\pi_{t}(a|s)|\widehat{d}_{t}(s)-d_{t}(s)|
=k=0L1s𝒮k|d^t(s)dt(s)|\displaystyle\qquad=\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}|\widehat{d}_{t}(s)-d_{t}(s)|
=k=1L1s𝒮k|d^t(s)dt(s)|\displaystyle\qquad=\sum_{k=1}^{L-1}\sum_{s\in\mathcal{S}_{k}}|\widehat{d}_{t}(s)-d_{t}(s)|
=k=1L1s𝒮k|s′′𝒮k1a𝒜θt(s′′,a,s)s′′𝒮k1a𝒜θ¯t(s′′,a,s)|,\displaystyle\qquad=\sum_{k=1}^{L-1}\sum_{s\in\mathcal{S}_{k}}\Big{|}\sum_{s^{\prime\prime}\in\mathcal{S}_{k-1}}\sum_{a\in\mathcal{A}}\theta^{t}(s^{\prime\prime},a,s)-\sum_{s^{\prime\prime}\in\mathcal{S}_{k-1}}\sum_{a\in\mathcal{A}}\overline{\theta}^{t}(s^{\prime\prime},a,s)\Big{|},

where the first equality is due to a𝒜πt(a|s)=1\sum_{a\in\mathcal{A}}\pi_{t}(a|s)=1, the second equality is due to d^t(s0)=dt(s0)=1\widehat{d}_{t}(s_{0})=d_{t}(s_{0})=1, and the third equality is by the relations d^t(s)=s′′𝒮k1a𝒜θt(s′′,a,s)\widehat{d}_{t}(s)=\sum_{s^{\prime\prime}\in\mathcal{S}_{k-1}}\sum_{a\in\mathcal{A}}\theta^{t}(s^{\prime\prime},a,s) and dt(s)=s′′𝒮k1a𝒜θ¯t(s′′,a,s)d_{t}(s)=\sum_{s^{\prime\prime}\in\mathcal{S}_{k-1}}\sum_{a\in\mathcal{A}}\overline{\theta}^{t}(s^{\prime\prime},a^{\prime},s), s𝒮k\forall s\in\mathcal{S}_{k}. Further bounding the last term of the above equation gives

k=1L1s𝒮k|s′′𝒮k1a𝒜θt(s′′,a,s)s′′𝒮k1a𝒜θ¯t(s′′,a,s)|\displaystyle\sum_{k=1}^{L-1}\sum_{s\in\mathcal{S}_{k}}\bigg{|}\sum_{s^{\prime\prime}\in\mathcal{S}_{k-1}}\sum_{a\in\mathcal{A}}\theta^{t}(s^{\prime\prime},a,s)-\sum_{s^{\prime\prime}\in\mathcal{S}_{k-1}}\sum_{a\in\mathcal{A}}\overline{\theta}^{t}(s^{\prime\prime},a,s)\bigg{|}
k=1L1s𝒮ks′′𝒮k1a𝒜|θt(s′′,a,s)θ¯t(s′′,a,s)|\displaystyle\qquad\leq\sum_{k=1}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{s^{\prime\prime}\in\mathcal{S}_{k-1}}\sum_{a\in\mathcal{A}}\bigg{|}\theta^{t}(s^{\prime\prime},a,s)-\overline{\theta}^{t}(s^{\prime\prime},a,s)\bigg{|}
=k=1L1s′′𝒮k1a𝒜θt(s′′,a,)θ¯t(s′′,a,)1\displaystyle\qquad=\sum_{k=1}^{L-1}\sum_{s^{\prime\prime}\in\mathcal{S}_{k-1}}\sum_{a\in\mathcal{A}}\big{\|}\theta^{t}(s^{\prime\prime},a,\cdot)-\overline{\theta}^{t}(s^{\prime\prime},a,\cdot)\big{\|}_{1}
=k=0L2s𝒮ka𝒜θt(s,a,)θ¯t(s,a,)1,\displaystyle\qquad=\sum_{k=0}^{L-2}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\big{\|}\theta^{t}(s,a,\cdot)-\overline{\theta}^{t}(s,a,\cdot)\big{\|}_{1},

which eventually implies that the last term on RHS of (22) can be bounded as

k=0L1s𝒮ka𝒜πt(a|s)P(|a,s)1|d^t(s)dt(s)|k=0L2s𝒮ka𝒜θt(s,a,)θ¯t(s,a,)1.\displaystyle\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\pi_{t}(a|s)\|P(\cdot|a,s)\|_{1}\cdot|\widehat{d}_{t}(s)-d_{t}(s)|\leq\sum_{k=0}^{L-2}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\big{\|}\theta^{t}(s,a,\cdot)-\overline{\theta}^{t}(s,a,\cdot)\big{\|}_{1}. (24)

Therefore, plugging the bounds (LABEL:eq:prob_decomp_1) and (24) in (22), we have

θtθ¯t1=\displaystyle\|\theta^{t}-\bar{\theta}^{t}\|_{1}= k=0L1s𝒮ka𝒜θt(a,s,)θ¯t(a,s,)1\displaystyle\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\big{\|}\theta^{t}(a,s,\cdot)-\overline{\theta}^{t}(a,s,\cdot)\big{\|}_{1}
\displaystyle\leq k=0L1s𝒮ka𝒜μt(s,a)P^(t)(|a,s)P(|a,s)1\displaystyle\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\mu_{t}(s,a)\big{\|}\widehat{P}_{\ell(t)}(\cdot|a,s)-P(\cdot|a,s)\big{\|}_{1}
+k=0L2s𝒮ka𝒜θt(s,a,)θ¯t(s,a,)1.\displaystyle+\sum_{k=0}^{L-2}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\big{\|}\theta^{t}(s,a,\cdot)-\overline{\theta}^{t}(s,a,\cdot)\big{\|}_{1}.

Recursively applying the above inequality, we obtain

θtθ¯t1k=0L1j=0ks𝒮ja𝒜μt(s,a)P^(t)(|s,a)P(|s,a)1,\displaystyle\|\theta^{t}-\overline{\theta}^{t}\|_{1}\leq\sum_{k=0}^{L-1}\sum_{j=0}^{k}\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}\mu_{t}(s,a)\Big{\|}\widehat{P}_{\ell(t)}(\cdot|s,a)-P(\cdot|s,a)\Big{\|}_{1},

which completes the proof. ∎

Now, we are in position to give the proof of Lemma 6.1.

Proof of Lemma 6.1.

The proof for Lemma 6.1 adopts similar ideas in Neu et al. (2012); Rosenberg and Mansour (2019a).

We already know P^(t)(s|s,a)=θt(s,a,s)sSk+1θt(s,a,s)\widehat{P}_{\ell(t)}(s^{\prime}|s,a)=\frac{\theta^{t}(s,a,s^{\prime})}{\sum_{s^{\prime}\in S_{k+1}}\theta^{t}(s,a,s^{\prime})} and μt(s,a)=s𝒮k+1θt(s,a,s)\mu_{t}(s,a)=\sum_{s^{\prime}\in\mathcal{S}_{k+1}}\theta^{t}(s,a,s^{\prime}), s𝒮k,a𝒜,s𝒮k+1\forall s\in\mathcal{S}_{k},a\in\mathcal{A},s^{\prime}\in\mathcal{S}_{k+1}, k{0,,L1}\forall k\in\{0,\ldots,L-1\}. By Lemma C.2, one can show that

θtθ¯t1\displaystyle\|\theta^{t}-\overline{\theta}^{t}\|_{1}\leq k=0L1j=0ks𝒮ja𝒜μt(s,a)P^(t)(|s,a)P(|s,a)1\displaystyle\sum_{k=0}^{L-1}\sum_{j=0}^{k}\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}\mu_{t}(s,a)\big{\|}\widehat{P}_{\ell(t)}(\cdot|s,a)-P(\cdot|s,a)\big{\|}_{1}
=\displaystyle= k=0L1j=0ks𝒮ja𝒜[(μt(s,a)𝕀{sjt=s,ajt=a})P^(t)(|s,a)P(|s,a)1\displaystyle\sum_{k=0}^{L-1}\sum_{j=0}^{k}\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}\big{[}(\mu_{t}(s,a)-\mathbb{I}\{s^{t}_{j}=s,~{}a^{t}_{j}=a\})\big{\|}\widehat{P}_{\ell(t)}(\cdot|s,a)-P(\cdot|s,a)\big{\|}_{1}
+𝕀{sjt=s,ajt=a}P^(t)(|s,a)P(|s,a)1],\displaystyle\qquad+\mathbb{I}\{s^{t}_{j}=s,~{}a^{t}_{j}=a\}\big{\|}\widehat{P}_{\ell(t)}(\cdot|s,a)-P(\cdot|s,a)\big{\|}_{1}\big{]},

where we denote 𝕀{sjt=s,ajt=a})\mathbb{I}\{s^{t}_{j}=s,~{}a^{t}_{j}=a\}) as the indicator random variable that equals 11 with probability μt(s,a),sSj,a𝒜\mu_{t}(s,a),\forall s\in S_{j},a\in\mathcal{A} and 0 otherwise. Denote ξt(s,a)=P^(t)(|s,a)P(|s,a)1\xi^{t}(s,a)=\|\widehat{P}_{\ell(t)}(\cdot|s,a)-P(\cdot|s,a)\|_{1} for abbreviation. We can see that ξt(s,a)P^(t)(|s,a)1+P(|s,a)1=2\xi^{t}(s,a)\leq\|\widehat{P}_{\ell(t)}(\cdot|s,a)\|_{1}+\|P(\cdot|s,a)\|_{1}=2. Summing both sides of the above inequality over TT time slots, we obtain

t=1Tθtθ¯t1t=1Tk=0L1j=0ks𝒮ja𝒜(μt(s,a)𝕀{sjt=s,ajt=a})ξt(s,a)+t=1Tk=0L1j=0ks𝒮ja𝒜[𝕀{sjt=s,ajt=a}ξt(s,a).\displaystyle\begin{aligned} \sum_{t=1}^{T}\|\theta^{t}-\overline{\theta}^{t}\|_{1}&\leq\sum_{t=1}^{T}\sum_{k=0}^{L-1}\sum_{j=0}^{k}\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}(\mu_{t}(s,a)-\mathbb{I}\{s^{t}_{j}=s,~{}a^{t}_{j}=a\})\xi^{t}(s,a)\\ &\qquad+\sum_{t=1}^{T}\sum_{k=0}^{L-1}\sum_{j=0}^{k}\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}[\mathbb{I}\{s^{t}_{j}=s,~{}a^{t}_{j}=a\}\xi^{t}(s,a).\end{aligned} (25)

Next, we bound the first term on RHS of (25). Let t1\mathcal{F}^{t-1} be the system history up to (t1)(t-1)-th episode. Then, by the definition of 𝕀(,)\mathbb{I}(\cdot,\cdot), we have

𝔼{s𝒮ja𝒜(μt(s,a)𝕀{sjt=s,ajt=a})ξt(s,a)|t1}=0,\mathbb{E}\Big{\{}\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}(\mu_{t}(s,a)-\mathbb{I}\{s^{t}_{j}=s,~{}a^{t}_{j}=a\})\xi^{t}(s,a)~{}\Big{|}~{}\mathcal{F}^{t-1}\Big{\}}=0,

since ξt\xi^{t} is only associated with system randomness history up to t1t-1 episodes. Thus, the term s𝒮ja𝒜(μt(s,a)𝕀{sjt=s,ajt=a})ξt(s,a)\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}(\mu_{t}(s,a)-\mathbb{I}\{s^{t}_{j}=s,~{}a^{t}_{j}=a\})\xi^{t}(s,a) is a martingale difference sequence with respect to t1\mathcal{F}^{t-1}. Furthermore, by ξt(s,a)2\xi^{t}(s,a)\leq 2 and s𝒮ja𝒜𝕀{sjt=s,ajt=a})=1\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}\mathbb{I}\{s^{t}_{j}=s,~{}a^{t}_{j}=a\})=1, there will be

|s𝒮ja𝒜(μt(s,a)𝕀{sjt=s,ajt=a})ξt(s,a)|\displaystyle\Bigg{|}\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}(\mu_{t}(s,a)-\mathbb{I}\{s^{t}_{j}=s,~{}a^{t}_{j}=a\})\xi^{t}(s,a)\Bigg{|}
|s𝒮ja𝒜𝕀{sjt=s,ajt=a}|ξt(s,a)+|s𝒮ja𝒜μt(s,a)|ξt(s,a)4.\displaystyle\qquad\leq\Bigg{|}\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}\mathbb{I}\{s^{t}_{j}=s,~{}a^{t}_{j}=a\}\Bigg{|}\xi^{t}(s,a)+\Bigg{|}\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}\mu_{t}(s,a)\Bigg{|}\xi^{t}(s,a)\leq 4.

Thus, by Azuma’s inequality, we obtain that with probability at least 1ζ/L1-\zeta/L,

t=1Ts𝒮ja𝒜(μt(s,a)𝕀{sjt=s,ajt=a})ξt(s,a)42TlogLζ.\sum_{t=1}^{T}\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}(\mu_{t}(s,a)-\mathbb{I}\{s^{t}_{j}=s,~{}a^{t}_{j}=a\})\xi^{t}(s,a)\leq 4\sqrt{2T\log\frac{L}{\zeta}}.

According to union bound, we further have that with probability at least 1ζ1-\zeta, the above inequality holds for all j=0,,L1j=0,...,L-1. This implies that with probability at least 1ζ1-\zeta, the following inequality holds

t=1Tk=0L1j=0ks𝒮ja𝒜(μt(s,a)𝕀{sjt=s,ajt=a})ξt(s,a)2L22TlogLζ.\displaystyle\sum_{t=1}^{T}\sum_{k=0}^{L-1}\sum_{j=0}^{k}\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}(\mu_{t}(s,a)-\mathbb{I}\{s^{t}_{j}=s,~{}a^{t}_{j}=a\})\xi^{t}(s,a)\leq 2L^{2}\sqrt{2T\log\frac{L}{\zeta}}. (26)

On the other hand, we adopt the same argument as the first part of the proof of Lemma 5 in Neu et al. (2012) to show the upper bound of t=1Tk=0L1j=0ks𝒮ja𝒜𝕀{sjt=s,ajt=a}ξt(s,a)\sum_{t=1}^{T}\sum_{k=0}^{L-1}\sum_{j=0}^{k}\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}\mathbb{I}\{s^{t}_{j}=s,~{}a^{t}_{j}=a\}\xi^{t}(s,a) in (25). Recall that (t)\ell(t) denotes the epoch that the tt-th episode belongs to. By the definition of the state-action pair counter N(s,a)N_{\ell}(s,a) and n(s,a)n_{\ell}(s,a), we have

N(s,a)=q=01nq(s,a).\displaystyle N_{\ell}(s,a)=\sum_{q=0}^{\ell-1}n_{q}(s,a).

According to Lemma C.1, we have

q=1(t)nq(s,a)max{1,Nq(s,a)}(2+1)q=1(t)nq(s,a).\displaystyle\sum_{q=1}^{\ell(t)}\frac{n_{q}(s,a)}{\max\{1,\sqrt{N_{q}(s,a)}\}}\leq(\sqrt{2}+1)\sqrt{\sum_{q=1}^{\ell(t)}n_{q}(s,a)}. (27)

Since we can rewrite

t=1Tk=0L1j=0ks𝒮ja𝒜𝕀{sjt=s,ajt=a}ξt(s,a)\displaystyle\sum_{t=1}^{T}\sum_{k=0}^{L-1}\sum_{j=0}^{k}\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}\mathbb{I}\{s^{t}_{j}=s,~{}a^{t}_{j}=a\}\xi^{t}(s,a)
=t=1Tk=0L1j=0kP^(t)(|sjt,ajt)P(|sjt,ajt)1,\displaystyle\qquad=\sum_{t=1}^{T}\sum_{k=0}^{L-1}\sum_{j=0}^{k}\|\widehat{P}_{\ell(t)}(\cdot|s^{t}_{j},a^{t}_{j})-P(\cdot|s^{t}_{j},a^{t}_{j})\|_{1},

then by Lemma 4.1and T+12TT+1\leq 2T, the following holds with probability at least 1ζ1-\zeta,

t=1Tk=0L1j=0ks𝒮ja𝒜𝕀{sjt=s,ajt=a}ξt(s,a)\displaystyle\sum_{t=1}^{T}\sum_{k=0}^{L-1}\sum_{j=0}^{k}\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}\mathbb{I}\{s^{t}_{j}=s,~{}a^{t}_{j}=a\}\xi^{t}(s,a)
k=0L1j=0kt=1T2|𝒮j+1|log(2T|𝒮||𝒜|/ζ)max{1,N(t)(sjt,ajt)}\displaystyle\qquad\leq\sum_{k=0}^{L-1}\sum_{j=0}^{k}\sum_{t=1}^{T}\sqrt{\frac{2|\mathcal{S}_{j+1}|\log(2T|\mathcal{S}||\mathcal{A}|/\zeta)}{\max\{1,N_{\ell(t)}(s^{t}_{j},a^{t}_{j})\}}}
k=0L1j=0kq=1(T)s𝒮ja𝒜nq(s,a)2|𝒮j+1|log(2T|𝒮||𝒜|/ζ)max{1,Nq(s,a)}\displaystyle\qquad\leq\sum_{k=0}^{L-1}\sum_{j=0}^{k}\sum_{q=1}^{\ell(T)}\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}n_{q}(s,a)\sqrt{\frac{2|\mathcal{S}_{j+1}|\log(2T|\mathcal{S}||\mathcal{A}|/\zeta)}{\max\{1,N_{q}(s,a)\}}}
k=0L1j=0ks𝒮ja𝒜(2+1)2q=1(T)nq(s,a)|𝒮j+1|log2T|𝒮||𝒜|ζ,\displaystyle\qquad\leq\sum_{k=0}^{L-1}\sum_{j=0}^{k}\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}(\sqrt{2}+1)\sqrt{2\sum_{q=1}^{\ell(T)}n_{q}(s,a)|\mathcal{S}_{j+1}|\log\frac{2T|\mathcal{S}||\mathcal{A}|}{\zeta}},

where the first inequality is due to Lemma 4.1, the second inequality is by the definition of the counters nq(s,a)n_{q}(s,a) and Nq(s,a)N_{q}(s,a), and the last inequality is by (27). Thus, further bounding the last term of the above inequality yields

k=0L1j=0ks𝒮ja𝒜(2+1)2[q=1(T)nq(s,a)]|𝒮j+1|log2T|𝒮||𝒜|ζ\displaystyle\sum_{k=0}^{L-1}\sum_{j=0}^{k}\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}(\sqrt{2}+1)\sqrt{2\left[\sum_{q=1}^{\ell(T)}n_{q}(s,a)\right]|\mathcal{S}_{j+1}|\log\frac{2T|\mathcal{S}||\mathcal{A}|}{\zeta}}
k=0L1j=0k(2+1)2s𝒮ja𝒜[q=1(T)nq(s,a)]|𝒮j||𝒮j+1||𝒜|log2T|𝒮||𝒜|ζ\displaystyle\qquad\leq\sum_{k=0}^{L-1}\sum_{j=0}^{k}(\sqrt{2}+1)\sqrt{2\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}\left[\sum_{q=1}^{\ell(T)}n_{q}(s,a)\right]|\mathcal{S}_{j}||\mathcal{S}_{j+1}||\mathcal{A}|\log\frac{2T|\mathcal{S}||\mathcal{A}|}{\zeta}}
k=0L1j=0k(2+1)2T|𝒮j||𝒮j+1||𝒜|log2T|𝒮||𝒜|ζ\displaystyle\qquad\leq\sum_{k=0}^{L-1}\sum_{j=0}^{k}(\sqrt{2}+1)\sqrt{2T|\mathcal{S}_{j}||\mathcal{S}_{j+1}||\mathcal{A}|\log\frac{2T|\mathcal{S}||\mathcal{A}|}{\zeta}}
(2+1)L|𝒮|2T|𝒜|log2T|𝒮||𝒜|ζ,\displaystyle\qquad\leq(\sqrt{2}+1)L|\mathcal{S}|\sqrt{2T|\mathcal{A}|\log\frac{2T|\mathcal{S}||\mathcal{A}|}{\zeta}},

where the first inequality is due to Jensen’s inequality, the second inequality is due to s𝒮ja𝒜q=1(T)nq(s,a)T\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}\sum_{q=1}^{\ell(T)}\allowbreak n_{q}(s,a)\leq T, and the last inequality is by bounding the term k=0L1j=0k|𝒮j||𝒮j+1|k=0L1j=0k(|𝒮j|+|𝒮j+1|)/2L|𝒮|\sum_{k=0}^{L-1}\sum_{j=0}^{k}\sqrt{|\mathcal{S}_{j}||\mathcal{S}_{j+1}|}\leq\sum_{k=0}^{L-1}\sum_{j=0}^{k}\allowbreak(|\mathcal{S}_{j}|+|\mathcal{S}_{j+1}|)/2\leq L|\mathcal{S}|. The above results imply that with probability at least 1ζ1-\zeta,

t=1Tk=0L1j=0ks𝒮ja𝒜𝕀{sjt=s,ajt=a}ξt(s,a)(2+1)L|𝒮|2T|𝒜|log2T|𝒮||𝒜|ζ.\displaystyle\sum_{t=1}^{T}\sum_{k=0}^{L-1}\sum_{j=0}^{k}\sum_{s\in\mathcal{S}_{j}}\sum_{a\in\mathcal{A}}\mathbb{I}\{s^{t}_{j}=s,~{}a^{t}_{j}=a\}\xi^{t}(s,a)\leq(\sqrt{2}+1)L|\mathcal{S}|\sqrt{2T|\mathcal{A}|\log\frac{2T|\mathcal{S}||\mathcal{A}|}{\zeta}}. (28)

By union bound, combining (25), (26) and (28), we obtain with probability at least 12ζ1-2\zeta,

t=1Tθtθ¯t1(2+1)L|𝒮|2T|𝒜|log2T|𝒮||𝒜|ζ+2L22TlogLζ.\displaystyle\sum_{t=1}^{T}\|\theta^{t}-\overline{\theta}^{t}\|_{1}\leq(\sqrt{2}+1)L|\mathcal{S}|\sqrt{2T|\mathcal{A}|\log\frac{2T|\mathcal{S}||\mathcal{A}|}{\zeta}}+2L^{2}\sqrt{2T\log\frac{L}{\zeta}}.

This completes the proof. ∎

C.2 Proof of Lemma 6.2

We provide Lemmas C.3, C.4, and C.5 first. Based on them, we give the proof of Lemma 6.2.

Lemma C.3 (Lemma 14 in Wei et al. (2019b)).

Let MM and MoM^{o} denote the probability simplex and the set of the probability simplex excluding the boundary respectively. Assuming 𝐲Mo\mathbf{y}\in M^{o}, and letting 𝒞M\mathcal{C}\subseteq M, then the following inequality holds

h(𝐱opt)+αD(𝐱opt,𝐲)h(𝐳)+αD(𝐳,𝐲)αD(𝐳,𝐱opt),𝐳𝒞,\displaystyle h(\mathbf{x}^{opt})+\alpha D(\mathbf{x}^{opt},\mathbf{y})\leq h(\mathbf{z})+\alpha D(\mathbf{z},\mathbf{y})-\alpha D(\mathbf{z},\mathbf{x}^{opt}),~{}~{}\forall\mathbf{z}\in\mathcal{C},

where 𝐱optargmin𝐱𝒞h(𝐱)+αD(𝐱,𝐲)\mathbf{x}^{opt}\in\arg\min_{\mathbf{x}\in\mathcal{C}}h(\mathbf{x})+\alpha D(\mathbf{x},\mathbf{y}), h()h(\cdot) is a convex function, and D(,)D(\cdot,\cdot) is the unnormalized KL divergence for two distributions.

Lemma C.3 is an extension of Lemma 14 in Wei et al. (2019b), whose proof follows the one in Wei et al. (2019b). Specifically, the unnormalized KL divergence is a special case of the Bregman divergence studied in Wei et al. (2019b).

Lemma C.4.

For any θ\theta and θ\theta^{\prime} satisfying s𝒮ka𝒜s𝒮k+1θ(s,a,s)=1, and θ(s,a,s)0,k{0,,L1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}_{k+1}}\theta(s,a,s^{\prime})=1,\text{ and }~{}\theta(s,a,s^{\prime})\geq 0,\forall k\in\{0,\ldots,L-1\} and s𝒮ka𝒜θ(s,a,s)=a𝒜s′′𝒮k+2θ(s,a,s′′),s𝒮k+1,k{0,,L2}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\theta(s,a,s^{\prime})=\sum_{a\in\mathcal{A}}\sum_{s^{\prime\prime}\in\mathcal{S}_{k+2}}\theta(s^{\prime},a,s^{\prime\prime}),\forall s^{\prime}\in\mathcal{S}_{k+1},\forall k\in\{0,\ldots,L-2\}, we let θk:=[θ(s,a,s)]s𝒮k,a𝒜,s𝒮k+1\theta_{k}:=[\theta(s,a,s^{\prime})]_{s\in\mathcal{S}_{k},a\in\mathcal{A},s^{\prime}\in\mathcal{S}_{k+1}} denote the vector formed by the elements θ(s,a,s)\theta(s,a,s^{\prime}) for all s𝒮k,a𝒜,s𝒮k+1s\in\mathcal{S}_{k},a\in\mathcal{A},s\in\mathcal{S}_{k+1}. We also let θk:=[θ(s,a,s)]s𝒮k,a𝒜,s𝒮k+1\theta^{\prime}_{k}:=[\theta^{\prime}(s,a,s^{\prime})]_{s\in\mathcal{S}_{k},a\in\mathcal{A},s^{\prime}\in\mathcal{S}_{k+1}} similarly denote a vector formed by θ(s,a,s)\theta^{\prime}(s,a,s^{\prime}). Then, we have

D(θ,θ)12k=0L1θkθk1212Lθθ12,\displaystyle D(\theta,\theta^{\prime})\geq\frac{1}{2}\sum_{k=0}^{L-1}\|\theta_{k}-\theta^{\prime}_{k}\|_{1}^{2}\geq\frac{1}{2L}\|\theta-\theta^{\prime}\|_{1}^{2},

where D(,)D(\cdot,\cdot) is defined as in (7).

Proof of Lemma C.4.

We prove the lemma by the following inequality

D(θ,θ)\displaystyle D(\theta,\theta^{\prime}) =k=0L1s𝒮ka𝒜s𝒮k+1θ(s,a,s)θ(s,a,s)θ(s,a,s)θ(s,a,s)+θ(s,a,s)\displaystyle=\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\sum_{s\in\mathcal{S}_{k+1}}\theta(s,a,s^{\prime})\frac{\theta(s,a,s^{\prime})}{\theta^{\prime}(s,a,s^{\prime})}-\theta(s,a,s^{\prime})+\theta^{\prime}(s,a,s^{\prime})
=k=0L1s𝒮ka𝒜s𝒮k+1θ(s,a,s)θ(s,a,s)θ(s,a,s)\displaystyle=\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\sum_{s\in\mathcal{S}_{k+1}}\theta(s,a,s^{\prime})\frac{\theta(s,a,s^{\prime})}{\theta^{\prime}(s,a,s^{\prime})}
12k=0L1θkθk1212L(k=0L1θkθk1)212Lθθ12,\displaystyle\geq\frac{1}{2}\sum_{k=0}^{L-1}\|\theta_{k}-\theta^{\prime}_{k}\|_{1}^{2}\geq\frac{1}{2L}\bigg{(}\sum_{k=0}^{L-1}\|\theta_{k}-\theta^{\prime}_{k}\|_{1}\bigg{)}^{2}\geq\frac{1}{2L}\|\theta-\theta^{\prime}\|_{1}^{2},

where the inequality is due to the Pinsker’s inequality since θk\theta_{k} and θk\theta^{\prime}_{k} are two probability distributions such that θk1=1\|\theta_{k}\|_{1}=1 and θk1=1\|\theta^{\prime}_{k}\|_{1}=1. This completes the proof. ∎

Lemma C.5.

For any θ\theta and θ\theta^{\prime} satisfying s𝒮ka𝒜s𝒮k+1θ(s,a,s)=1, and θ(s,a,s)0,k{0,,L1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}_{k+1}}\theta(s,a,s^{\prime})=1,\text{ and }~{}\theta(s,a,s^{\prime})\geq 0,\forall k\in\{0,\ldots,L-1\} and s𝒮ka𝒜θ(s,a,s)=a𝒜s′′𝒮k+2θ(s,a,s′′),s𝒮k+1,k{0,,L2}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\theta(s,a,s^{\prime})=\sum_{a\in\mathcal{A}}\sum_{s^{\prime\prime}\in\mathcal{S}_{k+2}}\theta(s^{\prime},a,s^{\prime\prime}),\forall s^{\prime}\in\mathcal{S}_{k+1},\forall k\in\{0,\ldots,L-2\}, letting θ~(s,a,s)=(1λ)θ(s,a,s)+λ|𝒜||𝒮k||𝒮k+1|,(s,a,s)𝒮k×𝒜×𝒮k+1,k=1,,L1\widetilde{\theta}^{\prime}(s,a,s^{\prime})=(1-\lambda)\theta^{\prime}(s,a,s^{\prime})+\frac{\lambda}{|\mathcal{A}||\mathcal{S}_{k}||\mathcal{S}_{k+1}|},\forall(s,a,s^{\prime})\in\mathcal{S}_{k}\times\mathcal{A}\times\mathcal{S}_{k+1},\forall k=1,\ldots,L-1 with 0<λ10<\lambda\leq 1, then we have

D(θ,θ~)D(θ,θ)λLlog|𝒮|2|𝒜|,\displaystyle D(\theta,\widetilde{\theta}^{\prime})-D(\theta,\theta^{\prime})\leq\lambda L\log|\mathcal{S}|^{2}|\mathcal{A}|,
D(θ,θ~)Llog(|𝒮|2|𝒜|λ).\displaystyle D(\theta,\widetilde{\theta}^{\prime})\leq L\log\Big{(}\frac{|\mathcal{S}|^{2}|\mathcal{A}|}{\lambda}\Big{)}.
Proof of Lemma C.5.

We start our proof as follows

D(θ,θ~)D(θ,θ)\displaystyle D(\theta,\widetilde{\theta}^{\prime})-D(\theta,\theta^{\prime}) =k=0L1s𝒮ka𝒜s𝒮k+1θ(s,a,s)(logθ(s,a,s)θ~(s,a,s)logθ(s,a,s)θ(s,a,s))\displaystyle=\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\sum_{s\in\mathcal{S}_{k+1}}\theta(s,a,s^{\prime})\Big{(}\log\frac{\theta(s,a,s^{\prime})}{\widetilde{\theta}^{\prime}(s,a,s^{\prime})}-\log\frac{\theta(s,a,s^{\prime})}{\theta^{\prime}(s,a,s^{\prime})}\Big{)}
+θ~(s,a,s)θ(s,a,s)\displaystyle\quad+\widetilde{\theta}^{\prime}(s,a,s^{\prime})-\theta^{\prime}(s,a,s^{\prime})
=k=0L1s𝒮ka𝒜s𝒮k+1θ(s,a,s)(logθ(s,a,s)logθ~(s,a,s))\displaystyle=\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\sum_{s\in\mathcal{S}_{k+1}}\theta(s,a,s^{\prime})\big{(}\log\theta^{\prime}(s,a,s^{\prime})-\log\widetilde{\theta}^{\prime}(s,a,s^{\prime})\big{)}
=k=0L1s𝒮ka𝒜s𝒮k+1θ(s,a,s)(logθ(s,a,s)\displaystyle=\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\sum_{s\in\mathcal{S}_{k+1}}\theta(s,a,s^{\prime})\Big{(}\log\theta^{\prime}(s,a,s^{\prime})
log[(1λ)θ(s,a,s)+λ/|𝒜||𝒮k||𝒮k+1|]),\displaystyle\quad-\log[(1-\lambda)\theta^{\prime}(s,a,s^{\prime})+\lambda/|\mathcal{A}||\mathcal{S}_{k}||\mathcal{S}_{k+1}|]\Big{)},

where the last equality is by substituting θ~(s,a,s)=(1λ)θ(s,a,s)+λ|𝒜||𝒮k||𝒮k+1|,(s,a,s)𝒮k×𝒜×𝒮k+1,k=1,,L1\widetilde{\theta}^{\prime}(s,a,s^{\prime})=(1-\lambda)\theta^{\prime}(s,a,s^{\prime})+\frac{\lambda}{|\mathcal{A}||\mathcal{S}_{k}||\mathcal{S}_{k+1}|},\forall(s,a,s^{\prime})\in\mathcal{S}_{k}\times\mathcal{A}\times\mathcal{S}_{k+1},\forall k=1,\ldots,L-1. Thus, by bounding the last term above, we further have

D(θ,θ~)D(θ,θ)\displaystyle D(\theta,\widetilde{\theta}^{\prime})-D(\theta,\theta^{\prime}) k=0L1s𝒮ka𝒜s𝒮k+1θ(s,a,s)(logθ(s,a,s)\displaystyle\leq\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\sum_{s\in\mathcal{S}_{k+1}}\theta(s,a,s^{\prime})\bigg{(}\log\theta^{\prime}(s,a,s^{\prime})
(1λ)logθ(s,a,s)λlog1|𝒮k||𝒮k+1||𝒜|)\displaystyle\quad-(1-\lambda)\log\theta^{\prime}(s,a,s^{\prime})-\lambda\log\frac{1}{|\mathcal{S}_{k}||\mathcal{S}_{k+1}||\mathcal{A}|}\bigg{)}
=k=0L1s𝒮ka𝒜s𝒮k+1λθ(s,a,s)(logθ(s,a,s)+log(|𝒮k||𝒮k+1||𝒜|))\displaystyle=\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\sum_{s\in\mathcal{S}_{k+1}}\lambda\theta(s,a,s^{\prime})\big{(}\log\theta^{\prime}(s,a,s^{\prime})+\log(|\mathcal{S}_{k}||\mathcal{S}_{k+1}||\mathcal{A}|)\big{)}
k=0L1s𝒮ka𝒜s𝒮k+1λθ(s,a,s)log(|𝒮k||𝒮k+1||𝒜|)λLlog|𝒮|2|𝒜|,\displaystyle\leq\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\sum_{s\in\mathcal{S}_{k+1}}\lambda\theta(s,a,s^{\prime})\log(|\mathcal{S}_{k}||\mathcal{S}_{k+1}||\mathcal{A}|)\leq\lambda L\log|\mathcal{S}|^{2}|\mathcal{A}|,

where the first inequality is by Jensen’s inequality and the second inequality is due to logθ(s,a,s)0\log\theta^{\prime}(s,a,s^{\prime})\leq 0 since 0<θ(s,a,s)10<\theta(s,a,s^{\prime})\leq 1, and the last inequality is due to Hölder’s inequality that 𝐱,𝐲𝐱1𝐲\langle\mathbf{x},\mathbf{y}\rangle\leq\|\mathbf{x}\|_{1}\|\mathbf{y}\|_{\infty} and |𝒮k||𝒮k+1||𝒮|2|\mathcal{S}_{k}||\mathcal{S}_{k+1}|\leq|\mathcal{S}|^{2}.

Moreover, we have

D(θ,θ~)\displaystyle D(\theta,\widetilde{\theta}^{\prime}) =k=0L1s𝒮ka𝒜s𝒮k+1θ(s,a,s)logθ(s,a,s)θ~(s,a,s)θ(s,a,s)+θ(s,a,s)\displaystyle=\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\sum_{s\in\mathcal{S}_{k+1}}\theta(s,a,s^{\prime})\log\frac{\theta(s,a,s^{\prime})}{\widetilde{\theta}^{\prime}(s,a,s^{\prime})}-\theta(s,a,s^{\prime})+\theta^{\prime}(s,a,s^{\prime})
=k=0L1s𝒮ka𝒜s𝒮k+1θ(s,a,s)(logθ(s,a,s)logθ~(s,a,s))\displaystyle=\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\sum_{s\in\mathcal{S}_{k+1}}\theta(s,a,s^{\prime})\big{(}\log\theta(s,a,s^{\prime})-\log\widetilde{\theta}^{\prime}(s,a,s^{\prime})\big{)}
=k=0L1s𝒮ka𝒜s𝒮k+1θ(s,a,s)(logθ(s,a,s)log[(1λ)θ(s,a,s)+λ/(|𝒮k||𝒮k+1||𝒜|)])\displaystyle=\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\sum_{s\in\mathcal{S}_{k+1}}\theta(s,a,s^{\prime})\big{(}\log\theta(s,a,s^{\prime})-\log[(1-\lambda)\theta^{\prime}(s,a,s^{\prime})+\lambda/(|\mathcal{S}_{k}||\mathcal{S}_{k+1}||\mathcal{A}|)]\big{)}
k=0L1s𝒮ka𝒜s𝒮k+1θ(s,a,s)(log[(1λ)θ(s,a,s)+λ/(|𝒮k||𝒮k+1||𝒜|)])\displaystyle\leq-\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\sum_{s\in\mathcal{S}_{k+1}}\theta(s,a,s^{\prime})\big{(}\log[(1-\lambda)\theta^{\prime}(s,a,s^{\prime})+\lambda/(|\mathcal{S}_{k}||\mathcal{S}_{k+1}||\mathcal{A}|)]\big{)}
k=0L1s𝒮ka𝒜s𝒮k+1θ(s,a,s)logλ|𝒮k||𝒮k+1||𝒜|Llog|𝒮|2|𝒜|λ,\displaystyle\leq-\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\sum_{s\in\mathcal{S}_{k+1}}\theta(s,a,s^{\prime})\cdot\log\frac{\lambda}{|\mathcal{S}_{k}||\mathcal{S}_{k+1}||\mathcal{A}|}\leq L\log\frac{|\mathcal{S}|^{2}|\mathcal{A}|}{\lambda},

where the first inequality is due to logθ(s,a,s)0\log\theta(s,a,s^{\prime})\leq 0, the second inequality is due to the monotonicity of logarithm function, and the third inequality is by as well as |𝒮k||𝒮k+1||𝒮|2|\mathcal{S}_{k}||\mathcal{S}_{k+1}|\leq|\mathcal{S}|^{2}. This completes the proof. ∎

Now we are ready to provide the proof of Lemma 6.2.

Proof of Lemma 6.2.

First of all, by Lemma 4.1, we know that

P(|s,a)P^(|s,a)1εζ(s,a),\|P(\cdot|s,a)-\widehat{P}_{\ell}(\cdot|s,a)\|_{1}\leq\varepsilon_{\ell}^{\zeta}(s,a),

with probability at least 1ζ1-\zeta, for all epochs \ell and any state and action pair (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}. Thus, we have that for any epoch (T+1)\ell\leq\ell(T+1),

ΔΔ(,ζ)\displaystyle\Delta\subseteq\Delta(\ell,\zeta)

holds with probability at least 1ζ1-\zeta.

This can be easily proved in the following way: If any θ¯Δ\overline{\theta}\in\Delta, then for all k={0,,T1}k=\{0,\ldots,T-1\}, s𝒮ks\in\mathcal{S}_{k} and a𝒜a\in\mathcal{A},

θ¯(s,a,)s𝒮k+1θ¯(s,a,s)=P(|s,a).\displaystyle\frac{\overline{\theta}(s,a,\cdot)}{\sum_{s^{\prime}\in\mathcal{S}_{k+1}}\overline{\theta}(s,a,s^{\prime})}=P(\cdot|s,a).

Then, we obtain with probability at least 1ζ1-\zeta,

θ¯(s,a,)s𝒮k+1θ¯(s,a,s)P^(|s,a)1\displaystyle\Big{\|}\frac{\overline{\theta}(s,a,\cdot)}{\sum_{s^{\prime}\in\mathcal{S}_{k+1}}\overline{\theta}(s,a,s^{\prime})}-\widehat{P}_{\ell}(\cdot|s,a)\Big{\|}_{1}
θ¯(s,a,)s𝒮k+1θ¯(s,a,s)P(|s,a)1+P(|s,a)P^(|s,a)10+εζ(s,a)εζ(s,a),\displaystyle\qquad\leq\Big{\|}\frac{\overline{\theta}(s,a,\cdot)}{\sum_{s^{\prime}\in\mathcal{S}_{k+1}}\overline{\theta}(s,a,s^{\prime})}-P(\cdot|s,a)\Big{\|}_{1}+\Big{\|}P(\cdot|s,a)-\widehat{P}_{\ell}(\cdot|s,a)\Big{\|}_{1}\leq 0+\varepsilon_{\ell}^{\zeta}(s,a)\leq\varepsilon_{\ell}^{\zeta}(s,a),

where the inequality is by Lemma 4.1. Therefore, we know that θ¯Δ(,ζ)\overline{\theta}\in\Delta(\ell,\zeta), which proves the above claim.

We define the event 𝒟T\mathcal{D}_{T} as follows:

Event 𝒟T:Δ=1(T+1)Δ(,ζ),\displaystyle\text{Event }\mathcal{D}_{T}:\Delta\subseteq\cap_{\ell=1}^{\ell(T+1)}\Delta(\ell,\zeta), (29)

by which we have

Pr(𝒟T)1ζ.\displaystyle\Pr(\mathcal{D}_{T})\geq 1-\zeta.

Thus, for any θ¯\overline{\theta}^{*} that is a solution to problem (1), we have θ¯Δ\overline{\theta}^{*}\in\Delta. If event 𝒟T\mathcal{D}_{T} happens, then θ¯=1(T)Δ(,ζ)\overline{\theta}^{*}\in\cap_{\ell=1}^{\ell(T)}\Delta(\ell,\zeta). Now we have that the updating rule of θ\theta follows θt=argminθΔ((t),ζ)Vft1+i=1IQi(t1)git1,θ+αD(θ,θ~t1)\theta^{t}=\arg\min_{\theta\in\Delta(\ell(t),\zeta)}\big{\langle}Vf^{t-1}+\sum_{i=1}^{I}Q_{i}(t-1)g_{i}^{t-1},\theta\big{\rangle}+\alpha D(\theta,\widetilde{\theta}^{t-1}) as shown in (6), and also θ¯=1(T+1)Δ(,ζ)\overline{\theta}^{*}\in\cap_{\ell=1}^{\ell(T+1)}\Delta(\ell,\zeta) holds with probability at least 1ζ1-\zeta. According to Lemma C.3, letting 𝐱opt=θt\mathbf{x}^{opt}=\theta^{t}, 𝐳=θ¯\mathbf{z}=\overline{\theta}^{*}, 𝐲=θ~t1\mathbf{y}=\widetilde{\theta}^{t-1} and h(θ)=Vft1+i=1IQi(t1)git1,θh(\theta)=\big{\langle}Vf^{t-1}+\sum_{i=1}^{I}Q_{i}(t-1)g_{i}^{t-1},\theta\big{\rangle}, we have that with probability at least 1ζ1-\zeta, the following holds for all episodes t=2,,T+1t=2,\ldots,T+1

Vft1+i=1IQi(t1)git1,θt+αD(θt,θ~t1)Vft1+i=1IQi(t1)git1,θ¯+αD(θ¯,θ~t1)αD(θ¯,θt),\displaystyle\begin{aligned} &\Big{\langle}Vf^{t-1}+\sum_{i=1}^{I}Q_{i}(t-1)g_{i}^{t-1},\theta^{t}\Big{\rangle}+\alpha D(\theta^{t},\widetilde{\theta}^{t-1})\\ &\qquad\leq\Big{\langle}Vf^{t-1}+\sum_{i=1}^{I}Q_{i}(t-1)g_{i}^{t-1},\overline{\theta}^{*}\Big{\rangle}+\alpha D(\overline{\theta}^{*},\widetilde{\theta}^{t-1})-\alpha D(\overline{\theta}^{*},\theta^{t}),\end{aligned} (30)

which means once given the event 𝒟T\mathcal{D}_{T} happens, the inequality (LABEL:eq:update_breg_tria) will hold.

On the other hand, according to the updating rule of 𝐐()\mathbf{Q}(\cdot) in (5), which is Qi(t)=max{Qi(t1)+git1,θtci,0}Q_{i}(t)=\max\{Q_{i}(t-1)+\langle g_{i}^{t-1},\theta^{t}\rangle-c_{i},~{}0\}, we know that

Qi(t)2(max{Qi(t1)+git1,θtci,0})2(Qi(t1)+git1,θtci)2,\displaystyle Q_{i}(t)^{2}\leq\left(\max\{Q_{i}(t-1)+\langle{g_{i}^{t-1}},{\theta^{t}}\rangle-c_{i},~{}0\}\right)^{2}\leq\left(Q_{i}(t-1)+\langle{g_{i}^{t-1}},{\theta^{t}}\rangle-c_{i}\right)^{2},

which further leads to

Qi(t)2Qi(t1)2\displaystyle Q_{i}(t)^{2}-Q_{i}(t-1)^{2}\leq 2Qi(t1)(git1,θtci)+(git1,θtci)2.\displaystyle 2Q_{i}(t-1)\left(\langle{g_{i}^{t-1}},{\theta^{t}}\rangle-c_{i}\right)+\left(\langle{g_{i}^{t-1}},{\theta^{t}}\rangle-c_{i}\right)^{2}.

Taking summation on both sides of the above inequality from i=1i=1 to II, we have

12(𝐐(t)2𝐐(t1)2)i=1IQi(t1)git1,θti=1IQi(t1)ci+12i=1I(git1,θtci)2i=1IQi(t1)git1,θti=1IQi(t1)ci+2L2,\displaystyle\begin{aligned} &\frac{1}{2}\left(\|\mathbf{Q}(t)\|^{2}-\|\mathbf{Q}(t-1)\|^{2}\right)\\ &\qquad\leq\sum_{i=1}^{I}\langle{Q_{i}(t-1)g_{i}^{t-1}},{\theta^{t}}\rangle-\sum_{i=1}^{I}Q_{i}(t-1)c_{i}+\frac{1}{2}\sum_{i=1}^{I}\left(\langle{g_{i}^{t-1}},{\theta^{t}}\rangle-c_{i}\right)^{2}\\ &\qquad\leq\sum_{i=1}^{I}\langle{Q_{i}(t-1)g_{i}^{t-1}},{\theta^{t}}\rangle-\sum_{i=1}^{I}Q_{i}(t-1)c_{i}+2L^{2},\end{aligned} (31)

where we let 𝐐(t)2=i=1IQi2(t)\|\mathbf{Q}(t)\|^{2}=\sum_{i=1}^{I}Q_{i}^{2}(t) and 𝐐(t1)2=i=1IQi2(t1)\|\mathbf{Q}(t-1)\|^{2}=\sum_{i=1}^{I}Q_{i}^{2}(t-1), and the last inequality is due to

i=1I(git1,θtci)2\displaystyle\sum_{i=1}^{I}(\langle g_{i}^{t-1},\theta^{t}\rangle-c_{i})^{2}\leq 2i=1I[(git1,θt)2+ci2]\displaystyle 2\sum_{i=1}^{I}[(\langle g_{i}^{t-1},\theta^{t}\rangle)^{2}+c_{i}^{2}]
\displaystyle\leq 2i=1I[git12θt12+ci2]\displaystyle 2\sum_{i=1}^{I}[\|g_{i}^{t-1}\|_{\infty}^{2}\|\theta^{t}\|_{1}^{2}+c_{i}^{2}]
\displaystyle\leq 2i=1I[L2git12+ci2]\displaystyle 2\sum_{i=1}^{I}[L^{2}\|g_{i}^{t-1}\|_{\infty}^{2}+c_{i}^{2}]
\displaystyle\leq 2[L2(i=1Igit1)2+(i=1I|ci|)2]4L2\displaystyle 2[L^{2}(\sum_{i=1}^{I}\|g_{i}^{t-1}\|_{\infty})^{2}+(\sum_{i=1}^{I}|c_{i}|)^{2}]\leq 4L^{2}

by Assumption 3.3 and the facts that s𝒮ka𝒜s𝒮k+1θt(s,a,s)=1\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}_{k+1}}\theta^{t}(s,a,s^{\prime})=1 and θt(s,a,s)0\theta^{t}(s,a,s^{\prime})\geq 0. Thus, summing up (LABEL:eq:update_breg_tria) and (LABEL:eq:Q_difference), and then subtracting Vft1,θt1\langle Vf^{t-1},\theta^{t-1}\rangle from both sides, we have

Vft1,θtθt1+12(𝐐(t)2𝐐(t1)2)+αD(θt,θ~t1)\displaystyle V\big{\langle}f^{t-1},\theta^{t}-\theta^{t-1}\big{\rangle}+\frac{1}{2}\big{(}\|\mathbf{Q}(t)\|^{2}-\|\mathbf{Q}(t-1)\|^{2}\big{)}+\alpha D(\theta^{t},\widetilde{\theta}^{t-1})
Vft1,θ¯θt1+i=1IQi(t1)(git1,θ¯ci)+αD(θ¯,θ~t1)αD(θ¯,θt)+4L2.\displaystyle\quad\leq V\big{\langle}f^{t-1},\overline{\theta}^{*}-\theta^{t-1}\big{\rangle}+\sum_{i=1}^{I}Q_{i}(t-1)(\langle g_{i}^{t-1},\overline{\theta}^{*}\rangle-c_{i})+\alpha D(\overline{\theta}^{*},\widetilde{\theta}^{t-1})-\alpha D(\overline{\theta}^{*},\theta^{t})+4L^{2}.

We further need to show the lower bound of the term Vft1,θtθt1+αD(θt,θ~t1)V\big{\langle}f^{t-1},\theta^{t}-\theta^{t-1}\big{\rangle}+\alpha D(\theta^{t},\widetilde{\theta}^{t-1}) on LHS of the above inequality. Specifically, we have

Vft1,θtθt1+αD(θt,θ~t1)\displaystyle V\big{\langle}f^{t-1},\theta^{t}-\theta^{t-1}\big{\rangle}+\alpha D(\theta^{t},\widetilde{\theta}^{t-1})
=Vft1,θtθ~t1+Vft1,θ~t1θt1+αD(θt,θ~t1)\displaystyle\qquad=V\big{\langle}f^{t-1},\theta^{t}-\widetilde{\theta}^{t-1}\big{\rangle}+V\big{\langle}f^{t-1},\widetilde{\theta}^{t-1}-\theta^{t-1}\rangle+\alpha D(\theta^{t},\widetilde{\theta}^{t-1})
Vft1θtθ~t11Vft1θ~t1θt11+α2k=0L1θktθ~kt112\displaystyle\qquad\geq-V\|f^{t-1}\|_{\infty}\cdot\|\theta^{t}-\widetilde{\theta}^{t-1}\|_{1}-V\|f^{t-1}\|_{\infty}\cdot\|\widetilde{\theta}^{t-1}-\theta^{t-1}\|_{1}+\frac{\alpha}{2}\sum_{k=0}^{L-1}\|\theta_{k}^{t}-\widetilde{\theta}_{k}^{t-1}\|_{1}^{2}
Vk=0L1θktθ~kt112LλV+α2k=0L1θktθ~kt112LV2α2LλV,\displaystyle\qquad\geq-V\sum_{k=0}^{L-1}\|\theta_{k}^{t}-\widetilde{\theta}_{k}^{t-1}\|_{1}-2L\lambda V+\frac{\alpha}{2}\sum_{k=0}^{L-1}\|\theta_{k}^{t}-\widetilde{\theta}_{k}^{t-1}\|_{1}^{2}\geq-\frac{LV}{2\alpha}-2L\lambda V,

where the first inequality uses Hölder’s inequality and Lemma C.4 that D(θ,θ)=k=1LD(θk,θk)12k=1Lθkθk12D(\theta,\theta^{\prime})=\sum_{k=1}^{L}D(\theta_{k},\theta^{\prime}_{k})\geq\frac{1}{2}\sum_{k=1}^{L}\|\theta_{k}-\theta^{\prime}_{k}\|_{1}^{2} with θk:=[θ(s,a,s)]sk𝒮k,ak𝒜,sk+1𝒮k+1\theta_{k}:=[\theta(s,a,s^{\prime})]_{s_{k}\in\mathcal{S}_{k},a_{k}\in\mathcal{A},s_{k+1}\in\mathcal{S}_{k+1}}, the second inequality is due to θ~kt1=(1λ)θkt1+λ𝟏|𝒜||𝒮k||𝒮k+1|\widetilde{\theta}^{t-1}_{k}=(1-\lambda)\theta^{t-1}_{k}+\lambda\frac{\mathbf{1}}{|\mathcal{A}||\mathcal{S}_{k}||\mathcal{S}_{k+1}|}, the second inequality is due to θ~t1θt11=k=0L1θ~kt1θkt11=λk=0L1θkt1𝟏|𝒜||𝒮k||𝒮k+1|1λk=0L1(θkt11+𝟏|𝒜||𝒮k||𝒮k+1|1)2λL\|\widetilde{\theta}^{t-1}-\theta^{t-1}\|_{1}=\sum_{k=0}^{L-1}\|\widetilde{\theta}^{t-1}_{k}-\theta^{t-1}_{k}\|_{1}=\lambda\sum_{k=0}^{L-1}\big{\|}\theta^{t-1}_{k}-\frac{\mathbf{1}}{|\mathcal{A}||\mathcal{S}_{k}||\mathcal{S}_{k+1}|}\big{\|}_{1}\leq\lambda\sum_{k=0}^{L-1}\big{(}\big{\|}\theta^{t-1}_{k}\big{\|}_{1}+\big{\|}\frac{\mathbf{1}}{|\mathcal{A}||\mathcal{S}_{k}||\mathcal{S}_{k+1}|}\big{\|}_{1}\big{)}\leq 2\lambda L, and the third inequality is by finding the minimal value of a quadratic function Vx+α2x2-Vx+\frac{\alpha}{2}x^{2}. Thus, we can show that with probability 1ζ1-\zeta, the following inequality holds for all tT+1t\leq T+1,

12(𝐐(t)2𝐐(t1)2)LV2α2LλV\displaystyle\frac{1}{2}\left(\|\mathbf{Q}(t)\|^{2}-\|\mathbf{Q}(t-1)\|^{2}\right)-\frac{LV}{2\alpha}-2L\lambda V (32)
Vft1,θ¯θt1+i=1IQi(t1)(git1,θ¯ci)+αD(θ¯,θ~t1)αD(θ¯,θt)+4L2.\displaystyle\qquad\leq V\big{\langle}f^{t-1},\overline{\theta}^{*}-\theta^{t-1}\big{\rangle}+\sum_{i=1}^{I}Q_{i}(t-1)(\langle g_{i}^{t-1},\overline{\theta}^{*}\rangle-c_{i})+\alpha D(\overline{\theta}^{*},\widetilde{\theta}^{t-1})-\alpha D(\overline{\theta}^{*},\theta^{t})+4L^{2}.

Note that according to Lemma C.5, we have

D(θ¯,θ~t1)D(θ¯,θt)\displaystyle D(\overline{\theta}^{*},\widetilde{\theta}^{t-1})-D(\overline{\theta}^{*},\theta^{t}) =D(θ¯,θ~t1)D(θ¯,θt1)+D(θ¯,θt1)D(θ¯,θt)\displaystyle=D(\overline{\theta}^{*},\widetilde{\theta}^{t-1})-D(\overline{\theta}^{*},\theta^{t-1})+D(\overline{\theta}^{*},\theta^{t-1})-D(\overline{\theta}^{*},\theta^{t})
λLlog|𝒮|2|𝒜|+D(θ¯,θt1)D(θ¯,θt).\displaystyle\leq\lambda L\log|\mathcal{S}|^{2}|\mathcal{A}|+D(\overline{\theta}^{*},\theta^{t-1})-D(\overline{\theta}^{*},\theta^{t}).

Therefore, plugging the above inequality into (32) and rearranging the terms, we further get

Vft1,θt1θ¯\displaystyle V\langle{f^{t-1}},{\theta^{t-1}-\overline{\theta}^{*}}\rangle 12(𝐐(t1)2𝐐(t)2)+i=1IQi(t1)(git1,θ¯ci)\displaystyle\leq\frac{1}{2}\left(\|\mathbf{Q}(t-1)\|^{2}-\|\mathbf{Q}(t)\|^{2}\right)+\sum_{i=1}^{I}Q_{i}(t-1)(\langle g_{i}^{t-1},\overline{\theta}^{*}\rangle-c_{i})
+αλLlog|𝒮|2|𝒜|+αD(θ¯,θt1)αD(θ¯,θt)+4L2+LV2α+2LλV.\displaystyle\quad+\alpha\lambda L\log|\mathcal{S}|^{2}|\mathcal{A}|+\alpha D(\overline{\theta}^{*},\theta^{t-1})-\alpha D(\overline{\theta}^{*},\theta^{t})+4L^{2}+\frac{LV}{2\alpha}+2L\lambda V.

Thus, by taking summation on both sides of the above inequality from 22 to T+1T+1 and assuming 𝐐(0)=0\mathbf{Q}(0)=0, we would obtain that with probability at least 1ζ1-\zeta, the following inequality holds

t=2T+1ft1,θt1θ¯1Vt=2T+1i=1IQi(t1)(git1,θ¯ci)+TαλLlog|𝒮|2|𝒜|V+αD(θ¯,θ0)+4KL2V+LT2α+2LλT.\displaystyle\begin{aligned} \sum_{t=2}^{T+1}\langle{f^{t-1}},{\theta^{t-1}-\overline{\theta}^{*}}\rangle&\leq\frac{1}{V}\sum_{t=2}^{T+1}\sum_{i=1}^{I}Q_{i}(t-1)(\langle g_{i}^{t-1},\overline{\theta}^{*}\rangle-c_{i})+\frac{T\alpha\lambda L\log|\mathcal{S}|^{2}|\mathcal{A}|}{V}\\ &\qquad+\frac{\alpha D(\overline{\theta}^{*},\theta^{0})+4KL^{2}}{V}+\frac{LT}{2\alpha}+2L\lambda T.\end{aligned} (33)

It is not difficult to compute that D(θ¯,θ0)Llog|𝒮|2|𝒜|D(\overline{\theta}^{*},\theta^{0})\leq L\log|\mathcal{S}|^{2}|\mathcal{A}| according to the initialization of θ1\theta^{1} by the uniform distribution. Then, by rearranging the terms and shifting the index, we rewrite (33) as

t=1Tft,θtθ¯1Vt=1Ti=1IQi(t)(git,θ¯ci)+4TL2+(λT+1)αLlog|𝒮|2|𝒜|V+LT2α+2LλT.\displaystyle\sum_{t=1}^{T}\langle{f^{t}},{\theta^{t}-\overline{\theta}^{*}}\rangle\leq\frac{1}{V}\sum_{t=1}^{T}\sum_{i=1}^{I}Q_{i}(t)(\langle g_{i}^{t},\overline{\theta}^{*}\rangle-c_{i})+\frac{4TL^{2}+(\lambda T+1)\alpha L\log|\mathcal{S}|^{2}|\mathcal{A}|}{V}+\frac{LT}{2\alpha}+2L\lambda T.

This completes the proof. ∎

C.3 Proof of Lemma 6.3

Lemma C.6 (Lemma 5 of Yu et al. (2017)).

Let {Z(t),t0}\{Z(t),t\geq 0\} be a discrete time stochastic process adapted to a filtration {𝒰t,t0}\{\mathcal{U}^{t},t\geq 0\} with Z(0)=0Z(0)=0 and 𝒰0={,Ω}\mathcal{U}^{0}=\{\varnothing,\Omega\}. Suppose there exists an integer τ>0\tau>0, real constants θ>0\theta>0, ρmax>0\rho_{\max}>0, and 0<κρmax0<\kappa\leq\rho_{\max} such that

|Z(t+1)Z(t)|\displaystyle|Z(t+1)-Z(t)| ρmax,\displaystyle\leq\rho_{\max},
𝔼[Z(t+τ)Z(t)|𝒰t]\displaystyle\mathbb{E}[Z(t+\tau)-Z(t)|\mathcal{U}^{t}] {τρmax, if Z(t)<ψτκ, if Z(t)ψ\displaystyle\leq\left\{\begin{matrix}\tau\rho_{\max},\ \text{ if }Z(t)<\psi\\ -\tau\kappa,\quad\text{ if }Z(t)\geq\psi\end{matrix}\right.

hold for all t{1,2,}t\in\{1,2,...\}. Then for any constant 0<δ<10<\delta<1, with probability at least 1δ1-\delta, we have

Z(t)ψ+τ4ρmax2κlog(1+8ρmax2κ2eκ/(4ρmax))+τ4ρmax2κlog1δ,t{1,2,}.\displaystyle Z(t)\leq\psi+\tau\frac{4\rho_{\max}^{2}}{\kappa}\log\bigg{(}1+\frac{8\rho_{\max}^{2}}{\kappa^{2}}e^{\kappa/(4\rho_{\max})}\bigg{)}+\tau\frac{4\rho_{\max}^{2}}{\kappa}\log\frac{1}{\delta},~{}~{}\forall t\in\{1,2,...\}.

Now, we are in position to give the proof of Lemma 6.3.

Proof of Lemma 6.3.

The proof of this Lemma is based on applying the lemma C.6 to our problem. Thus, this proof mainly focuses on showing that the variable 𝐐(t)2\|\mathbf{Q}(t)\|_{2} satisfies the condition of Lemma C.6. By the updating rule of Qi(t)Q_{i}(t), i.e., Qi(t+1)=max{Qi(t)+git,θt+1ci,0}Q_{i}(t+1)=\max\{Q_{i}(t)+\langle g_{i}^{t},\theta^{t+1}\rangle-c_{i},0\}, we have

|𝐐(t+1)2𝐐(t)2|\displaystyle\left|\|\mathbf{Q}(t+1)\|_{2}-\|\mathbf{Q}(t)\|_{2}\right|\leq 𝐐(t+1)𝐐(t)2\displaystyle\|\mathbf{Q}(t+1)-\mathbf{Q}(t)\|_{2}
=\displaystyle= i=1I|Qi(t+1)Qi(t)|2\displaystyle\sqrt{\sum_{i=1}^{I}|Q_{i}(t+1)-Q_{i}(t)|^{2}}
\displaystyle\leq i=1I|git,θt+1ci|2,\displaystyle\sqrt{\sum_{i=1}^{I}|\langle g_{i}^{t},\theta^{t+1}\rangle-c_{i}|^{2}},

where the first inequality is due to triangle inequality, and the second inequality is by the fact that |max{a+b,0}a||b||\max\{a+b,0\}-a|\leq|b| if a0a\geq 0. Then, by Assumption 3.3, we further have

i=1I|git,θt+1ci|2i=1I|git,θt+1ci|i=1I(gitθt+11+|ci|)2L,\displaystyle\sqrt{\sum_{i=1}^{I}|\langle g_{i}^{t},\theta^{t+1}\rangle-c_{i}|^{2}}\leq\sum_{i=1}^{I}|\langle g_{i}^{t},\theta^{t+1}\rangle-c_{i}|\leq\sum_{i=1}^{I}(\|g_{i}^{t}\|_{\infty}\|\theta^{t+1}\|_{1}+|c_{i}|)\leq 2L,

which therefore implies

|𝐐(t+1)2𝐐(t)2|2L.\displaystyle\left|\|\mathbf{Q}(t+1)\|_{2}-\|\mathbf{Q}(t)\|_{2}\right|\leq 2L. (34)

Thus, with the above inequality, we have

𝐐(t+τ)2𝐐(t)2|𝐐(t+τ)2𝐐(t)2|τ=1τ|𝐐(t+τ)2𝐐(t+τ1)2|2τL,\displaystyle\begin{aligned} \|\mathbf{Q}(t+\tau)\|_{2}-\|\mathbf{Q}(t)\|_{2}&\leq|\|\mathbf{Q}(t+\tau)\|_{2}-\|\mathbf{Q}(t)\|_{2}|\\ &\leq\sum_{\tau=1}^{\tau}\left|\|\mathbf{Q}(t+\tau)\|_{2}-\|\mathbf{Q}(t+\tau-1)\|_{2}\right|\leq 2\tau L,\end{aligned} (35)

such that

𝔼[𝐐(t+τ)2𝐐(t)2|t1]2τL.\displaystyle\mathbb{E}[\|\mathbf{Q}(t+\tau)\|_{2}-\|\mathbf{Q}(t)\|_{2}|\mathcal{F}^{t-1}]\leq 2\tau L. (36)

where t1\mathcal{F}^{t-1} represents the system randomness up to the (t1)(t-1)-th episode and 𝐐(t)\mathbf{Q}(t) depends on t1\mathcal{F}^{t-1} according to its updating rule. Note that (35) in fact indicates that the random process 𝐐(t+τ)2𝐐(t)2\|\mathbf{Q}(t+\tau)\|_{2}-\|\mathbf{Q}(t)\|_{2} is bounded by the value 2τL2\tau L.

Next, we need to show that there exist ψ\psi and κ\kappa such that 𝔼[𝐐(t+τ)2𝐐(t)2|t1]τκ\mathbb{E}[\|\mathbf{Q}(t+\tau)\|_{2}-\|\mathbf{Q}(t)\|_{2}|\mathcal{F}^{t-1}]\leq-\tau\kappa if 𝐐(t)2ψ\|\mathbf{Q}(t)\|_{2}\geq\psi. Recall the definition of the event 𝒟T\mathcal{D}_{T} in (29). Therefore, we have that with probability at least 1ζ1-\zeta, the event 𝒟T\mathcal{D}_{T} happens, such that for all t=2,,T+1t^{\prime}=2,...,T+1 and any θ=1(T+1)Δ(,ζ)\theta\in\cap_{\ell=1}^{\ell(T+1)}\Delta(\ell,\zeta), the following holds

Vft1,θt1θ¯\displaystyle V\big{\langle}f^{t^{\prime}-1},\theta^{t^{\prime}-1}-\overline{\theta}^{*}\big{\rangle} 12(𝐐(t1)22𝐐(t)22)+i=1IQi(t1)(git1,θci)\displaystyle\leq\frac{1}{2}\left(\|\mathbf{Q}(t^{\prime}-1)\|_{2}^{2}-\|\mathbf{Q}(t^{\prime})\|_{2}^{2}\right)+\sum_{i=1}^{I}Q_{i}(t^{\prime}-1)(\langle g_{i}^{t^{\prime}-1},\theta\rangle-c_{i})
+αλLlog|𝒮|2|𝒜|+αD(θ,θ~t1)αD(θ,θt)+4L2+LV2α+2LλV,\displaystyle\quad+\alpha\lambda L\log|\mathcal{S}|^{2}|\mathcal{A}|+\alpha D(\theta,\widetilde{\theta}^{t^{\prime}-1})-\alpha D(\theta,\theta^{t^{\prime}})+4L^{2}+\frac{LV}{2\alpha}+2L\lambda V,

which adopts similar proof techniques to (32). Then, the above inequality further leads to the following inequality by rearranging the terms

𝐐(t)22𝐐(t1)22\displaystyle\|\mathbf{Q}(t^{\prime})\|_{2}^{2}-\|\mathbf{Q}(t^{\prime}-1)\|_{2}^{2} 2Vft1,θt1θ+2i=1IQi(t1)(git1,θci)\displaystyle\leq-2V\langle{f^{t^{\prime}-1}},{\theta^{t^{\prime}-1}-\theta}\rangle+2\sum_{i=1}^{I}Q_{i}(t^{\prime}-1)(\langle g_{i}^{t^{\prime}-1},\theta\rangle-c_{i})
+2αλLlog|𝒮|2|𝒜|+2αD(θ,θ~t1)2αD(θ,θt)+8L2+LVα+4LλV.\displaystyle\quad+2\alpha\lambda L\log|\mathcal{S}|^{2}|\mathcal{A}|+2\alpha D(\theta,\widetilde{\theta}^{t^{\prime}-1})-2\alpha D(\theta,\theta^{t^{\prime}})+8L^{2}+\frac{LV}{\alpha}+4L\lambda V.

Taking summation from t+1t+1 to τ+t\tau+t on both sides of the above inequality, the following inequality holds with probability 1ζ1-\zeta for any τ>0\tau>0 and tt satisfying 1kK+1τ1\leq k\leq K+1-\tau

𝐐(τ+t)22𝐐(t)222Vt=t+1τ+tft1,θt1θ+2t=t+1τ+ti=1IQi(t1)(git1,θci)+2αD(θ,θ~t)2αD(θ,θ~τ+t)+t=t+1τ+t2α[D(θ,θ~t1)D(θ,θt1)]+8τL2+τLVα+4τLλV.\displaystyle\begin{aligned} &\|\mathbf{Q}(\tau+t)\|_{2}^{2}-\|\mathbf{Q}(t)\|_{2}^{2}\\ &\quad\leq-2V\sum_{t^{\prime}=t+1}^{\tau+t}\langle{f^{t^{\prime}-1}},{\theta^{t^{\prime}-1}-\theta}\rangle+2\sum_{t^{\prime}=t+1}^{\tau+t}\sum_{i=1}^{I}Q_{i}(t^{\prime}-1)(\langle g_{i}^{t^{\prime}-1},\theta\rangle-c_{i})+2\alpha D(\theta,\widetilde{\theta}^{t})\\ &\qquad-2\alpha D(\theta,\widetilde{\theta}^{\tau+t})+\sum_{t^{\prime}=t+1}^{\tau+t}2\alpha[D(\theta,\widetilde{\theta}^{t^{\prime}-1})-D(\theta,\theta^{t^{\prime}-1})]+8\tau L^{2}+\frac{\tau LV}{\alpha}+4\tau L\lambda V.\end{aligned} (37)

Particularly, in (LABEL:eq:Q_diff_high_prob_bound), the term 2αD(θ,θt1)0-2\alpha D(\theta,\theta^{t^{\prime}-1})\leq 0 due to the non-negativity of unnormalized KL divergence. By Lemma C.5, we can bound

τ=t+1τ+t2α[D(θ,θ~t1)D(θ,θt1)]2ατLlog|𝒮|2|𝒜|.\displaystyle\sum_{\tau=t+1}^{\tau+t}2\alpha[D(\theta,\widetilde{\theta}^{t^{\prime}-1})-D(\theta,\theta^{t^{\prime}-1})]\leq 2\alpha\tau L\log|\mathcal{S}|^{2}|\mathcal{A}|.

For the term 2αD(θ,θ~t)2\alpha D(\theta,\widetilde{\theta}^{t}), by Lemma C.5, we can bound it as

2αD(θ,θ~t)2αLlog(|𝒮|2|𝒜|/λ).\displaystyle 2\alpha D(\theta,\widetilde{\theta}^{t})\leq 2\alpha L\log(|\mathcal{S}|^{2}|\mathcal{A}|/\lambda).

Moreover, we can decompose the term 2Vt=t+1τ+tft1,θθt1+2t=t+1τ+ti=1IQi(t1)(git1,θ¯ci)2V\sum_{t^{\prime}=t+1}^{\tau+t}\langle f^{t^{\prime}-1},\theta-\theta^{t^{\prime}-1}\rangle+2\sum_{t^{\prime}=t+1}^{\tau+t}\sum_{i=1}^{I}Q_{i}(t^{\prime}-1)(\langle g_{i}^{t^{\prime}-1},\overline{\theta}^{*}\rangle-c_{i}) in (LABEL:eq:Q_diff_high_prob_bound) as

2Vt=t+1τ+tft1,θθt1+2t=t+1τ+ti=1IQi(t1)(git1,θci)\displaystyle 2V\sum_{t^{\prime}=t+1}^{\tau+t}\langle{f^{t^{\prime}-1}},{\theta-\theta^{t^{\prime}-1}}\rangle+2\sum_{t^{\prime}=t+1}^{\tau+t}\sum_{i=1}^{I}Q_{i}(t^{\prime}-1)(\langle g_{i}^{t^{\prime}-1},\theta\rangle-c_{i})
=2Vt=t+1τ+tft1,θθt1+2i=1IQi(t)t=t+1τ+t(git1,θci)\displaystyle\qquad=2V\sum_{t^{\prime}=t+1}^{\tau+t}\langle{f^{t^{\prime}-1}},{\theta-\theta^{t^{\prime}-1}}\rangle+2\sum_{i=1}^{I}Q_{i}(t)\sum_{t^{\prime}=t+1}^{\tau+t}(\langle g_{i}^{t^{\prime}-1},\theta\rangle-c_{i})
+2t=t+2τ+ti=1I[Qi(t1)Qi(t)](git1,θci)\displaystyle\qquad\qquad\quad+2\sum_{t^{\prime}=t+2}^{\tau+t}\sum_{i=1}^{I}[Q_{i}(t^{\prime}-1)-Q_{i}(t)](\langle g_{i}^{t^{\prime}-1},\theta\rangle-c_{i})
2Vt=t+1τ+tft1,θ+2i=1IQi(t)t=t+1τ+t(git1,θci)+2Lτ2+2VLτ,\displaystyle\qquad\leq 2V\sum_{t^{\prime}=t+1}^{\tau+t}\langle{f^{t^{\prime}-1}},{\theta}\rangle+2\sum_{i=1}^{I}Q_{i}(t)\sum_{t^{\prime}=t+1}^{\tau+t}(\langle g_{i}^{t^{\prime}-1},\theta\rangle-c_{i})+2L\tau^{2}+2VL\tau,

where the last inequality is due to

2Vt=t+1τ+tft1,θt12Vt=t+1τ+tk=0L1s𝒮ka𝒜s𝒮k+1ft1(s,a,s)θt1(s,a,s)2VLτ,\displaystyle-2V\sum_{t^{\prime}=t+1}^{\tau+t}\langle{f^{t^{\prime}-1}},{\theta^{t^{\prime}-1}}\rangle\leq 2V\sum_{t^{\prime}=t+1}^{\tau+t}\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}_{k+1}}f^{t^{\prime}-1}(s,a,s^{\prime})\theta^{t^{\prime}-1}(s,a,s^{\prime})\leq 2VL\tau,

as well as

2t=t+2τ+ti=1I[Qi(t1)Qi(t)](git1,θci)\displaystyle 2\sum_{t^{\prime}=t+2}^{\tau+t}\sum_{i=1}^{I}[Q_{i}(t^{\prime}-1)-Q_{i}(t)](\langle g_{i}^{t^{\prime}-1},\theta\rangle-c_{i})
2t=t+2τ+ti=1Ir=tt2|gir,θr+1ci||git1,θci|\displaystyle\qquad\leq 2\sum_{t^{\prime}=t+2}^{\tau+t}\sum_{i=1}^{I}\sum_{r=t}^{t^{\prime}-2}|\langle g_{i}^{r},\theta^{r+1}\rangle-c_{i}|\cdot|\langle g_{i}^{t^{\prime}-1},\theta\rangle-c_{i}|
t=t+2τ+tr=tt2i=1I|gir,θr+1ci|2+t=t+2τ+tr=tt2i=1I|git1,θci|2\displaystyle\qquad\leq\sum_{t^{\prime}=t+2}^{\tau+t}\sum_{r=t}^{t^{\prime}-2}\sqrt{\sum_{i=1}^{I}|\langle g_{i}^{r},\theta^{r+1}\rangle-c_{i}|^{2}}+\sum_{t^{\prime}=t+2}^{\tau+t}\sum_{r=t}^{t^{\prime}-2}\sqrt{\sum_{i=1}^{I}|\langle g_{i}^{t^{\prime}-1},\theta\rangle-c_{i}|^{2}}
2Lτ2,\displaystyle\qquad\leq 2L\tau^{2},

by Qi(t+1)=max{Qi(t)+git,θt+1ci,0}Q_{i}(t+1)=\max\{Q_{i}(t)+\langle g_{i}^{t},\theta^{t+1}\rangle-c_{i},0\} and |max{a+b,0}a||b||\max\{a+b,0\}-a|\leq|b| if a0a\geq 0 for the first inequality, and Assumption 3.3 for the last inequality.

Therefore, taking conditional expectation on both sides of (LABEL:eq:Q_diff_high_prob_bound) and combining the above upper bounds for certain terms in (LABEL:eq:Q_diff_high_prob_bound), we can obtain for any θ=1(T+1)Δ(,ζ)\theta\in\cap_{\ell=1}^{\ell(T+1)}\Delta(\ell,\zeta),

𝔼[𝐐(τ+t)2𝐐(t)2|t1,𝒟T]2τ2L+2αLlog(|𝒮|2|𝒜|/λ)+2Vτ𝔼[1τt=t+1τ+tft1,θ+1τi=1IQi(t)Vt=t+1τ+t(git1,θci)|t1,𝒟T]+2αλτLlog|𝒮|2|𝒜|+8τL2+τLVα+4τLλV+2VLτ.\displaystyle\begin{aligned} &\mathbb{E}[\|\mathbf{Q}(\tau+t)\|^{2}-\|\mathbf{Q}(t)\|^{2}|\mathcal{F}^{t-1},\mathcal{D}_{T}]\\ &\qquad\leq 2\tau^{2}L+2\alpha L\log(|\mathcal{S}|^{2}|\mathcal{A}|/\lambda)\\ &\qquad\quad+2V\tau\mathbb{E}\bigg{[}\frac{1}{\tau}\sum_{t^{\prime}=t+1}^{\tau+t}\langle f^{t^{\prime}-1},\theta\rangle+\frac{1}{\tau}\sum_{i=1}^{I}\frac{Q_{i}(t)}{V}\sum_{t^{\prime}=t+1}^{\tau+t}(\langle g_{i}^{t^{\prime}-1},\theta\rangle-c_{i})\bigg{|}\mathcal{F}^{t-1},\mathcal{D}_{T}\bigg{]}\\ &\qquad\quad+2\alpha\lambda\tau L\log|\mathcal{S}|^{2}|\mathcal{A}|+8\tau L^{2}+\frac{\tau LV}{\alpha}+4\tau L\lambda V+2VL\tau.\end{aligned} (38)

Thus, it remains to bound the term 𝔼[1τt=t+1τ+tft1,θ+1τi=1IQi(t)Vt=t+1τ+t(git1,θci)|t1,𝒟T]\mathbb{E}[\frac{1}{\tau}\sum_{t^{\prime}=t+1}^{\tau+t}\langle f^{t^{\prime}-1},\theta\rangle+\frac{1}{\tau}\sum_{i=1}^{I}\frac{Q_{i}(t)}{V}\sum_{t^{\prime}=t+1}^{\tau+t}(\langle g_{i}^{t^{\prime}-1},\theta\rangle-c_{i})|\mathcal{F}^{t-1},\mathcal{D}_{T}] so as to give an upper bound of the right-hand side of (LABEL:eq:exp_Q_diff_high_prob_bound). Given the event 𝒟T\mathcal{D}_{T} happens such that Δ=1(T+1)Δ(,ζ)\Delta\subseteq\cap_{\ell=1}^{\ell(T+1)}\Delta(\ell,\zeta)\neq\varnothing, and since θ\theta is any vector in the set =1(T+1)Δ(,ζ)\cap_{\ell=1}^{\ell(T+1)}\Delta(\ell,\zeta), we can give an upper bound of (LABEL:eq:exp_Q_diff_high_prob_bound) by bounding a term q(t,τ)(𝐐(t)V)q^{(t,\tau)}\Big{(}\frac{\mathbf{Q}(t)}{V}\Big{)}, which is by

minθ=1(T)Δ(,ζ)𝔼[1τt=t+1τ+tft1,θ+1τi=1IQi(t)Vt=t+1τ+t(git1,θci)|t,𝒟T]\displaystyle\min_{\theta\in\cap_{\ell=1}^{\ell(T)}\Delta(\ell,\zeta)}\mathbb{E}\Big{[}\frac{1}{\tau}\sum_{t^{\prime}=t+1}^{\tau+t}\big{\langle}f^{t^{\prime}-1},\theta\big{\rangle}+\frac{1}{\tau}\sum_{i=1}^{I}\frac{Q_{i}(t)}{V}\sum_{t^{\prime}=t+1}^{\tau+t}(\langle g_{i}^{t^{\prime}-1},\theta\rangle-c_{i})\Big{|}\mathcal{F}^{t},\mathcal{D}_{T}\Big{]}
=minθ=1(T)Δ(,ζ)f(t,τ),θ+i=1IQi(t)V(gi,θci)\displaystyle\quad=\min_{\theta\in\cap_{\ell=1}^{\ell(T)}\Delta(\ell,\zeta)}\big{\langle}f^{(t,\tau)},\theta\big{\rangle}+\sum_{i=1}^{I}\frac{Q_{i}(t)}{V}(\langle g_{i},\theta\rangle-c_{i})
minθΔf(t,τ),θ+i=1IQi(t)V(gi,θci)=q(t,τ)(𝐐(t)V),\displaystyle\quad\leq\min_{\theta\in\Delta}\big{\langle}f^{(t,\tau)},\theta\big{\rangle}+\sum_{i=1}^{I}\frac{Q_{i}(t)}{V}(\langle g_{i},\theta\rangle-c_{i})=q^{(t,\tau)}\Big{(}\frac{\mathbf{Q}(t)}{V}\Big{)},

where the inequality is due to Δ=1(T+1)Δ(,ζ)\Delta\subseteq\cap_{\ell=1}^{\ell(T+1)}\Delta(\ell,\zeta) given 𝒟T\mathcal{D}_{T} happens and the last equality is obtained according to the definition of the dual function qq in Section 5. We can bound q(t,τ)(𝐐(t)V)q^{(t,\tau)}\big{(}\frac{\mathbf{Q}(t)}{V}\big{)} in the following way.

According to Assumption 5.1, we assume that one dual solution is ηt,τ𝒱t,τ\eta^{*}_{t,\tau}\in\mathcal{V}_{t,\tau}^{*}. We let ϑ¯\overline{\vartheta} be the maximum of all ϑ\vartheta and σ¯\overline{\sigma} be the minimum of all σ\sigma. Thus, when dist(𝐐(t)V,𝒱t,τ)ϑ¯\mathrm{dist}(\frac{\mathbf{Q}(t)}{V},\mathcal{V}_{t,\tau}^{*})\geq\overline{\vartheta}, we have

q(t,τ)(𝐐(t)V)=\displaystyle q^{(t,\tau)}\Big{(}\frac{\mathbf{Q}(t)}{V}\Big{)}= q(t,τ)(𝐐(t)V)q(t,τ)(ηt,τ)+q(t,τ)(ηt,τ)\displaystyle q^{(t,\tau)}\Big{(}\frac{\mathbf{Q}(t)}{V}\Big{)}-q^{(t,\tau)}(\eta^{*}_{t,\tau})+q^{(t,\tau)}(\eta^{*}_{t,\tau})
\displaystyle\leq σ¯ηt,τ𝐐(t)V2+f(t,τ),θt,τ\displaystyle-\overline{\sigma}\Big{\|}\eta^{*}_{t,\tau}-\frac{\mathbf{Q}(t)}{V}\Big{\|}_{2}+\big{\langle}f^{(t,\tau)},\theta^{*}_{t,\tau}\big{\rangle}
\displaystyle\leq σ¯𝐐(t)V2+σ¯ηt,τ2+k=0L1s𝒮ka𝒜s𝒮k+1f(t,τ)(s,a,s)θt,τ(s,a,s)\displaystyle-\overline{\sigma}\Big{\|}\frac{\mathbf{Q}(t)}{V}\Big{\|}_{2}+\overline{\sigma}\|\eta^{*}_{t,\tau}\|_{2}+\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}_{k+1}}f^{(t,\tau)}(s,a,s^{\prime})\theta^{*}_{t,\tau}(s,a,s^{\prime})
\displaystyle\leq σ¯𝐐(t)V2+σ¯B+L,\displaystyle-\overline{\sigma}\Big{\|}\frac{\mathbf{Q}(t)}{V}\Big{\|}_{2}+\overline{\sigma}B+L,

where the first inequality is due to the weak error bound in Lemma 5.2 and weak duality with θt,τ\theta^{*}_{t,\tau} being one primal solution, the second inequality is by triangle inequality, and the third inequality is by Assumption 3.3 and Assumption 5.1. On the other hand, when dist(𝐐(t)V,𝒱t,τ)ϑ¯\mathrm{dist}(\frac{\mathbf{Q}(t)}{V},\mathcal{V}_{t,\tau}^{*})\leq\overline{\vartheta}, we have

q(t,τ)(𝐐(t)V)=\displaystyle q^{(t,\tau)}\Big{(}\frac{\mathbf{Q}(t)}{V}\Big{)}= minθΔf(t,τ),θ+i=1IQi(t)V(gi,θci)\displaystyle\min_{\theta\in\Delta}\big{\langle}f^{(t,\tau)},\theta\big{\rangle}+\sum_{i=1}^{I}\frac{Q_{i}(t)}{V}(\langle g_{i},\theta\rangle-c_{i})
=\displaystyle= minθΔf(t,τ),θ+i=1I[ηt,τ]i(gi,θci)+i=1I(Qi(t)V[ηt,τ]i)(gi,θci)\displaystyle\min_{\theta\in\Delta}\big{\langle}f^{(t,\tau)},\theta\big{\rangle}+\sum_{i=1}^{I}[\eta^{*}_{t,\tau}]_{i}(\langle g_{i},\theta\rangle-c_{i})+\sum_{i=1}^{I}\Big{(}\frac{Q_{i}(t)}{V}-[\eta^{*}_{t,\tau}]_{i}\Big{)}(\langle g_{i},\theta\rangle-c_{i})
\displaystyle\leq q(t,τ)(ηt,τ)+𝐐(t)Vηt,τ2𝐠(θ)𝐜2\displaystyle q^{(t,\tau)}(\eta^{*}_{t,\tau})+\Big{\|}\frac{\mathbf{Q}(t)}{V}-\eta^{*}_{t,\tau}\Big{\|}_{2}\|\mathbf{g}(\theta)-\mathbf{c}\|_{2}
\displaystyle\leq L+2ϑ¯L,\displaystyle L+2\overline{\vartheta}L,

where the first inequality is by the definition of q(t,τ)(ηt,τ)q^{(t,\tau)}(\eta^{*}_{t,\tau}) and Cauchy-Schwarz inequality, and the second inequality is due to weak duality and Assumption 3.3 such that

q(t,τ)(ηt,τ)f(t,τ),θt,τf(t,τ)θt,τ1L,\displaystyle q^{(t,\tau)}(\eta^{*}_{t,\tau})\leq\big{\langle}f^{(t,\tau)},\theta^{*}_{t,\tau}\big{\rangle}\leq\big{\|}f^{(t,\tau)}\big{\|}_{\infty}\|\theta^{*}_{t,\tau}\|_{1}\leq L,
𝐐(t)Vηt,τ2𝐠(θ)𝐜2ϑ¯i=1I|gi,θci|2ϑ¯i=1I(giθ1+|ci|)2ϑ¯L.\displaystyle\Big{\|}\frac{\mathbf{Q}(t)}{V}-\eta^{*}_{t,\tau}\Big{\|}_{2}\|\mathbf{g}(\theta)-\mathbf{c}\|_{2}\leq\overline{\vartheta}\sqrt{\sum_{i=1}^{I}\Big{|}\langle g_{i},\theta\rangle-c_{i}\Big{|}^{2}}\leq\overline{\vartheta}\sum_{i=1}^{I}(\|g_{i}\|_{\infty}\|\theta\|_{1}+|c_{i}|)\leq 2\overline{\vartheta}L.

Now we can combine the two cases as follows

q(t,τ)(𝐐(t)V)σ¯𝐐(t)V2+σ¯B+2L+2ϑ¯L+σ¯ϑ¯.\displaystyle q^{(t,\tau)}\Big{(}\frac{\mathbf{Q}(t)}{V}\Big{)}\leq-\overline{\sigma}\Big{\|}\frac{\mathbf{Q}(t)}{V}\Big{\|}_{2}+\overline{\sigma}B+2L+2\overline{\vartheta}L+\overline{\sigma}\overline{\vartheta}. (39)

The bound in (39) is derived by the following discussion:

  • (1)

    When dist(𝐐(t)V,𝒱t,τ)ϑ¯\mathrm{dist}\big{(}\frac{\mathbf{Q}(t)}{V},\mathcal{V}^{*}_{t,\tau}\big{)}\geq\overline{\vartheta}, we have

    q(t,τ)(𝐐(t)V)σ¯𝐐(t)V2+σ¯B+Lσ¯𝐐(t)V2+σ¯B+2L+2ϑ¯L+σ¯ϑ¯.q^{(t,\tau)}\big{(}\frac{\mathbf{Q}(t)}{V}\big{)}\leq-\overline{\sigma}\big{\|}\frac{\mathbf{Q}(t)}{V}\big{\|}_{2}+\overline{\sigma}B+L\leq-\overline{\sigma}\big{\|}\frac{\mathbf{Q}(t)}{V}\big{\|}_{2}+\overline{\sigma}B+2L+2\overline{\vartheta}L+\overline{\sigma}\overline{\vartheta}.
  • (2)

    When dist(𝐐(t)V,𝒱t,τ)<ϑ¯\mathrm{dist}\big{(}\frac{\mathbf{Q}(t)}{V},\mathcal{V}^{*}_{t,\tau}\big{)}<\overline{\vartheta}, we have

    q(t,τ)(𝐐(t)V)L+2ϑ¯Lσ¯𝐐(t)V2+σ¯B+2L+2ϑ¯L+σ¯ϑ¯,q^{(t,\tau)}\big{(}\frac{\mathbf{Q}(t)}{V}\big{)}\leq L+2\overline{\vartheta}L\leq-\overline{\sigma}\big{\|}\frac{\mathbf{Q}(t)}{V}\big{\|}_{2}+\overline{\sigma}B+2L+2\overline{\vartheta}L+\overline{\sigma}\overline{\vartheta},

    since σ¯𝐐(t)V2+σ¯ϑ¯+σ¯Bσ¯dist(𝐐(t)V,𝒱t,τ)+σ¯ϑ¯+σ¯Bσ¯B=σ¯[dist(𝐐(t)V,𝒱t,τ)+ϑ¯]0-\overline{\sigma}\big{\|}\frac{\mathbf{Q}(t)}{V}\big{\|}_{2}+\overline{\sigma}\overline{\vartheta}+\overline{\sigma}B\geq-\overline{\sigma}\cdot\mathrm{dist}\big{(}\frac{\mathbf{Q}(t)}{V},\mathcal{V}^{*}_{t,\tau}\big{)}+\overline{\sigma}\overline{\vartheta}+\overline{\sigma}B-\overline{\sigma}B=\overline{\sigma}\big{[}-\mathrm{dist}\big{(}\frac{\mathbf{Q}(t)}{V},\mathcal{V}^{*}_{t,\tau}\big{)}+\overline{\vartheta}\big{]}\geq 0.

Therefore, plugging (39) into (LABEL:eq:exp_Q_diff_high_prob_bound), we can obtain that given the event 𝒟T\mathcal{D}_{T} happens, the following holds

𝔼[𝐐(τ+t)22𝐐(t)22|t,𝒟T]2τ2L+τCV,α,λ+2αLlog(|𝒮|2|𝒜|/λ)2τσ¯𝐐(t)2,\displaystyle\begin{aligned} &\mathbb{E}[\|\mathbf{Q}(\tau+t)\|_{2}^{2}-\|\mathbf{Q}(t)\|_{2}^{2}|\mathcal{F}^{t},\mathcal{D}_{T}]\\ &\qquad\leq 2\tau^{2}L+\tau C_{V,\alpha,\lambda}+2\alpha L\log(|\mathcal{S}|^{2}|\mathcal{A}|/\lambda)-2\tau\overline{\sigma}\|\mathbf{Q}(t)\|_{2},\end{aligned} (40)

where we define

CV,α,λ:=2(σ¯B+σ¯ϑ¯)V+(6+4ϑ¯)VL+VLα+4LλV+2αλLlog|𝒮|2|𝒜|+8L2.\displaystyle C_{V,\alpha,\lambda}:=2(\overline{\sigma}B+\overline{\sigma}~{}\overline{\vartheta})V+(6+4\overline{\vartheta})VL+\frac{VL}{\alpha}+4L\lambda V+2\alpha\lambda L\log|\mathcal{S}|^{2}|\mathcal{A}|+8L^{2}.

We can see that if 𝐐(t)2(2τL+CV,α,λ)/σ¯+2αλLlog(|𝒮|2|𝒜|/λ)/(σ¯τ)+τσ¯/2\|\mathbf{Q}(t)\|_{2}\geq(2\tau L+C_{V,\alpha,\lambda})/\overline{\sigma}+2\alpha\lambda L\log(|\mathcal{S}|^{2}|\mathcal{A}|/\lambda)/(\overline{\sigma}\tau)+\tau\overline{\sigma}/2, then according to (LABEL:eq:bound_q_func_diff), there is

𝔼[𝐐(τ+t)2|t1,𝒟T]\displaystyle\mathbb{E}[\|\mathbf{Q}(\tau+t)\|^{2}|\mathcal{F}^{t-1},\mathcal{D}_{T}]\leq 𝐐(t)2τσ¯𝐐(t)2σ¯2τ22\displaystyle\|\mathbf{Q}(t)\|^{2}-\tau\overline{\sigma}\|\mathbf{Q}(t)\|_{2}-\frac{\overline{\sigma}^{2}\tau^{2}}{2}
\displaystyle\leq 𝐐(t)22τσ¯𝐐(t)2+σ¯2τ24\displaystyle\|\mathbf{Q}(t)\|_{2}^{2}-\tau\overline{\sigma}\|\mathbf{Q}(t)\|_{2}+\frac{\overline{\sigma}^{2}\tau^{2}}{4}
\displaystyle\leq (𝐐(t)2τσ¯2)2.\displaystyle\Big{(}\|\mathbf{Q}(t)\|_{2}-\frac{\tau\overline{\sigma}}{2}\Big{)}^{2}.

Due to 𝐐(t)2τσ¯2\|\mathbf{Q}(t)\|_{2}\geq\frac{\tau\overline{\sigma}}{2} and by Jensen’s inequality, we have

𝔼[𝐐(τ+t)2|t1,𝒟T]𝔼[𝐐(τ+t)22|t1,𝒟T]𝐐(t)2τσ¯2.\displaystyle\mathbb{E}[\|\mathbf{Q}(\tau+t)\|_{2}|\mathcal{F}^{t-1},\mathcal{D}_{T}]\leq\sqrt{\mathbb{E}[\|\mathbf{Q}(\tau+t)\|_{2}^{2}|\mathcal{F}^{t-1},\mathcal{D}_{T}]}\leq\|\mathbf{Q}(t)\|_{2}-\frac{\tau\overline{\sigma}}{2}. (41)

Then we can compute the expectation 𝔼[𝐐(τ+t)22𝐐(t)22|t1]\mathbb{E}[\|\mathbf{Q}(\tau+t)\|_{2}^{2}-\|\mathbf{Q}(t)\|_{2}^{2}|\mathcal{F}^{t-1}] according to the law of total expectation. With (35) and (41), we can obtain that

𝔼[𝐐(τ+t)2𝐐(t)2|t1]\displaystyle\mathbb{E}[\|\mathbf{Q}(\tau+t)\|_{2}-\|\mathbf{Q}(t)\|_{2}|\mathcal{F}^{t-1}]
=P(𝒟T)𝔼[𝐐(τ+t)2𝐐(t)2|t1,𝒟T]+P(𝒟¯T)𝔼[𝐐(τ+t)2𝐐(t)2|t1,𝒟¯T]\displaystyle\quad=P(\mathcal{D}_{T})\mathbb{E}[\|\mathbf{Q}(\tau+t)\|_{2}-\|\mathbf{Q}(t)\|_{2}|\mathcal{F}^{t-1},\mathcal{D}_{T}]+P(\overline{\mathcal{D}}_{T})\mathbb{E}[\|\mathbf{Q}(\tau+t)\|_{2}-\|\mathbf{Q}(t)\|_{2}|\mathcal{F}^{t-1},\overline{\mathcal{D}}_{T}]
τσ¯2(1ζ)+2ζτL=τ[σ¯2ζ(σ¯2+2L)]σ¯4τ,\displaystyle\quad\leq-\frac{\tau\overline{\sigma}}{2}(1-\zeta)+2\zeta\tau L=-\tau\Big{[}\frac{\overline{\sigma}}{2}-\zeta\Big{(}\frac{\overline{\sigma}}{2}+2L\Big{)}\Big{]}\leq-\frac{\overline{\sigma}}{4}\tau,

where we let σ¯/4ζ(σ¯/2+2L)\overline{\sigma}/4\geq\zeta(\overline{\sigma}/2+2L).

Summarizing the above results, we know that if σ¯/4ζ(σ¯/2+2L)\overline{\sigma}/4\geq\zeta(\overline{\sigma}/2+2L), then

|𝐐(t+1)2𝐐(t)2|\displaystyle\left|\|\mathbf{Q}(t+1)\|_{2}-\|\mathbf{Q}(t)\|_{2}\right| 2L,\displaystyle\leq 2L,
𝔼[𝐐(t+τ)2𝐐(t)2|t]\displaystyle\mathbb{E}[\|\mathbf{Q}(t+\tau)\|_{2}-\|\mathbf{Q}(t)\|_{2}|\mathcal{F}^{t}] {2τL, if 𝐐(t)2<ψσ¯4τ, if 𝐐(t)2ψ,\displaystyle\leq\left\{\begin{matrix}2\tau L,\qquad\text{ if }\|\mathbf{Q}(t)\|_{2}<\psi\\ -\frac{\overline{\sigma}}{4}\tau,~{}\quad\ \ \text{ if }\|\mathbf{Q}(t)\|_{2}\geq\psi\end{matrix}\right.,

where we let

ψ=2τL+CV,α,λσ¯+2αLlog(|𝒮|2|𝒜|/λ)σ¯τ+τσ¯2,\displaystyle\psi=\frac{2\tau L+C_{V,\alpha,\lambda}}{\overline{\sigma}}+\frac{2\alpha L\log(|\mathcal{S}|^{2}|\mathcal{A}|/\lambda)}{\overline{\sigma}\tau}+\frac{\tau\overline{\sigma}}{2},
CV,α,λ=2(σ¯B+σ¯ϑ¯)V+(6+4ϑ¯)VL+VLα+4LλV+2αλLlog|𝒮|2|𝒜|+8L2.\displaystyle C_{V,\alpha,\lambda}=2(\overline{\sigma}B+\overline{\sigma}~{}\overline{\vartheta})V+(6+4\overline{\vartheta})VL+\frac{VL}{\alpha}+4L\lambda V+2\alpha\lambda L\log|\mathcal{S}|^{2}|\mathcal{A}|+8L^{2}.

Directly by Lemma C.6, for a certain t[T+1τ]t\in[T+1-\tau], the following inequality holds with probability at least 1δ1-\delta,

𝐐(t)2\displaystyle\|\mathbf{Q}(t)\|_{2}\leq ψ+τ512L2σ¯log(1+128L2σ¯2eσ¯/(32L))+τ64L2σ¯log1δ.\displaystyle\psi+\tau\frac{512L^{2}}{\overline{\sigma}}\log\bigg{(}1+\frac{128L^{2}}{\overline{\sigma}^{2}}e^{\overline{\sigma}/(32L)}\bigg{)}+\tau\frac{64L^{2}}{\overline{\sigma}}\log\frac{1}{\delta}. (42)

Further employing union bound for probabilities, we have that with probability at least 1(T+1τ)δ1Tδ1-(T+1-\tau)\delta\geq 1-T\delta, for any t[T+1τ]t\in[T+1-\tau], the above inequality (42) holds. Note that (42) only holds when t[T+1τ]t\in[T+1-\tau]. For T+2τtT+1T+2-\tau\leq t\leq T+1, when (42) holds for t[T+1τ]t\in[T+1-\tau], combining (42) and (34), we have

𝐐(t)2\displaystyle\|\mathbf{Q}(t)\|_{2}\leq ψ+τ512H2σ¯log(1+128H2σ¯2eσ¯/(32H))+τ64H2σ¯log1δ+2τL.\displaystyle\psi+\tau\frac{512H^{2}}{\overline{\sigma}}\log\bigg{(}1+\frac{128H^{2}}{\overline{\sigma}^{2}}e^{\overline{\sigma}/(32H)}\bigg{)}+\tau\frac{64H^{2}}{\overline{\sigma}}\log\frac{1}{\delta}+2\tau L. (43)

Thus, with probability at least 1Tδ1-T\delta, for any k satisfying 1tT+11\leq t\leq T+1, the inequality (43) holds. We can understand the upper bound of the term log(1+128L2σ¯2eσ¯/(32L))\log\big{(}1+\frac{128L^{2}}{\overline{\sigma}^{2}}e^{\overline{\sigma}/(32L)}\big{)} in the following way: (1) if 128L2σ¯2eσ¯/(32L)1\frac{128L^{2}}{\overline{\sigma}^{2}}e^{\overline{\sigma}/(32L)}\geq 1, then this term is bounded by log(256L2σ¯2eσ¯/(32L))=σ¯32L+log256L2σ¯2\log\big{(}\frac{256L^{2}}{\overline{\sigma}^{2}}e^{\overline{\sigma}/(32L)}\big{)}=\frac{\overline{\sigma}}{32L}+\log\frac{256L^{2}}{\overline{\sigma}^{2}}; (2) if 128L2σ¯2eσ¯/(32L)<1\frac{128L^{2}}{\overline{\sigma}^{2}}e^{\overline{\sigma}/(32L)}<1, then the term is bounded by log2\log 2. Thus, we have

log(1+128L2σ¯2eσ¯/(32L))log2+σ¯32L+log256L2σ¯2.\displaystyle\log\bigg{(}1+\frac{128L^{2}}{\overline{\sigma}^{2}}e^{\overline{\sigma}/(32L)}\bigg{)}\leq\log 2+\frac{\overline{\sigma}}{32L}+\log\frac{256L^{2}}{\overline{\sigma}^{2}}.

This discussion shows that the log\log term in (42) will not introduce extra dependency on LL except a logL\log L term. This completes our proof. ∎

C.4 Proof of Lemma 6.5

Lemma C.7 (Lemma 9 of Yu et al. (2017)).

Let {Z(t),t0}\{Z(t),t\geq 0\} be a supermartingale adapted to a filtration {𝒰t,t0}\{\mathcal{U}^{t},t\geq 0\} with Z(0)=0Z(0)=0 and 𝒰0={,Ω}\mathcal{U}^{0}=\{\varnothing,\Omega\}, i.e., 𝔼[Z(t+1)|𝒰t]Z(t)\mathbb{E}[Z(t+1)|\mathcal{U}^{t}]\leq Z(t), t0\forall t\geq 0. Suppose there exists a constant ς>0\varsigma>0 such that {|Z(t+1)Z(t)|>ς}{Y(t)>0}\{|Z(t+1)-Z(t)|>\varsigma\}\subset\{Y(t)>0\}, where Y(t)Y(t) is process with Y(t)Y(t) adpated to 𝒰t\mathcal{U}^{t} for all t0t\geq 0. Then, for all z>0z>0, we have

Pr(Z(t)z)ez2/(2tς2)+τ=0t1Pr(Y(τ)>0),t1.\displaystyle\Pr(Z(t)\geq z)\leq e^{-z^{2}/(2t\varsigma^{2})}+\sum_{\tau=0}^{t-1}\Pr(Y(\tau)>0),\forall t\geq 1.

We are in position to give the proof of Lemma 6.5.

Proof of Lemma 6.5.

Now we compute the upper bound of the term t=1Ti=1IQi(t)(git,θ¯ci)\sum_{t=1}^{T}\sum_{i=1}^{I}Q_{i}(t)(\langle g_{i}^{t},\overline{\theta}^{*}\rangle-c_{i}). Note that Z(t):=τ=1ti=1IQi(τ)(giτ,θ¯ci)Z(t):=\sum_{\tau=1}^{t}\sum_{i=1}^{I}Q_{i}(\tau)(\langle g_{i}^{\tau},\overline{\theta}^{*}\rangle-c_{i}) is supermartigale which is verified by

𝔼[Z(t)|t1]=\displaystyle\mathbb{E}[Z(t)|\mathcal{F}^{t-1}]= 𝔼[τ=1ti=1IQi(τ)(giτ,θ¯ci)|t1]\displaystyle\mathbb{E}\Big{[}\sum_{\tau=1}^{t}\sum_{i=1}^{I}Q_{i}(\tau)(\langle g_{i}^{\tau},\overline{\theta}^{*}\rangle-c_{i})\Big{|}\mathcal{F}^{t-1}\Big{]}
=\displaystyle= i=1I𝔼[Qi(t)|t1](𝔼[git|t1],θ¯ci)+τ=1t1i=1IQi(τ)(giτ,θ¯ci)\displaystyle\sum_{i=1}^{I}\mathbb{E}[Q_{i}(t)|\mathcal{F}^{t-1}](\langle\mathbb{E}[g_{i}^{t}|\mathcal{F}^{t-1}],\overline{\theta}^{*}\rangle-c_{i})+\sum_{\tau=1}^{t-1}\sum_{i=1}^{I}Q_{i}(\tau)(\langle g_{i}^{\tau},\overline{\theta}^{*}\rangle-c_{i})
\displaystyle\leq τ=1t1i=1IQi(τ)(giτ,θ¯ci)=𝔼[Z(t1)],\displaystyle\sum_{\tau=1}^{t-1}\sum_{i=1}^{I}Q_{i}(\tau)(\langle g_{i}^{\tau},\overline{\theta}^{*}\rangle-c_{i})=\mathbb{E}[Z(t-1)],

where Qi(t)Q_{i}(t) and gitg_{i}^{t} are independent variables with Qi(t)0Q_{i}(t)\geq 0 and 𝔼[git|t1],θ¯ci\langle\mathbb{E}[g_{i}^{t}|\mathcal{F}^{t-1}],\overline{\theta}^{*}\rangle\leq c_{i}. On the other hand, we can know the random process has bounded drifts as

|Z(t+1)Z(t)|=\displaystyle|Z(t+1)-Z(t)|= i=1IQi(t+1)(git,θ¯ci)\displaystyle\sum_{i=1}^{I}Q_{i}(t+1)(\langle g_{i}^{t},\overline{\theta}^{*}\rangle-c_{i})
\displaystyle\leq 𝐐(t+1)2i=1I|git+1,θ¯ci|2\displaystyle\|\mathbf{Q}(t+1)\|_{2}\sqrt{\sum_{i=1}^{I}\big{|}\langle g_{i}^{t+1},\overline{\theta}^{*}\rangle-c_{i}\big{|}^{2}}
\displaystyle\leq 𝐐(t+1)2i=1I(git+1θ¯1+|ci|)\displaystyle\|\mathbf{Q}(t+1)\|_{2}\sum_{i=1}^{I}(\|g_{i}^{t+1}\|_{\infty}\|\overline{\theta}^{*}\|_{1}+|c_{i}|)
\displaystyle\leq 2L𝐐(t+1)2,\displaystyle 2L\|\mathbf{Q}(t+1)\|_{2},

where the first inequality is by Cauchy-Schwarz inequality, and the last inequality is by Assumption 3.3. This also implies that for an arbitrary ς\varsigma, we have {|Z(t+1)Z(t)|>ς}{Y(t):=𝐐(t+1)2ς/(2L)>0}\{|Z(t+1)-Z(t)|>\varsigma\}\subseteq\{Y(t):=\|\mathbf{Q}(t+1)\|_{2}-\varsigma/(2L)>0\} since |Z(t+1)Z(t)|>ς|Z(t+1)-Z(t)|>\varsigma implies 2L𝐐(t+1)2>ς2L\|\mathbf{Q}(t+1)\|_{2}>\varsigma according to the above inequality. Thus, by Lemma C.7, we have

Pr(t=1Ti=1IQi(t)(git,θ¯ci)z)ez2/(2Tς2)+t=0T1Pr(𝐐(t+1)2>ς2L)=ez2/(2Tς2)+t=1TPr(𝐐(t)2>ς2L),\displaystyle\begin{aligned} \Pr\bigg{(}\sum_{t=1}^{T}\sum_{i=1}^{I}Q_{i}(t)(\langle g_{i}^{t},\overline{\theta}^{*}\rangle-c_{i})\geq z\bigg{)}&\leq e^{-z^{2}/(2T\varsigma^{2})}+\sum_{t=0}^{T-1}\Pr\bigg{(}\|\mathbf{Q}(t+1)\|_{2}>\frac{\varsigma}{2L}\bigg{)}\\ &=e^{-z^{2}/(2T\varsigma^{2})}+\sum_{t=1}^{T}\Pr\bigg{(}\|\mathbf{Q}(t)\|_{2}>\frac{\varsigma}{2L}\bigg{)},\end{aligned} (44)

where we can see that bounding 𝐐(t)2\|\mathbf{Q}(t)\|_{2} is the key to obtaining the bound of t=1Ti=1IQi(t)(git,θ¯ci)\sum_{t=1}^{T}\sum_{i=1}^{I}Q_{i}(t)(\langle g_{i}^{t},\overline{\theta}^{*}\rangle-c_{i}).

Next, we will show the upper bound of the term 𝐐(t)2\|\mathbf{Q}(t)\|_{2}. According to the proof of Lemma 6.3, if σ¯/4ζ(σ¯/2+2L)\overline{\sigma}/4\geq\zeta(\overline{\sigma}/2+2L), setting

ψ=2τL+CV,α,λσ¯+2αLlog(|𝒮|2|𝒜|/λ)σ¯τ+τσ¯2,\displaystyle\psi=\frac{2\tau L+C_{V,\alpha,\lambda}}{\overline{\sigma}}+\frac{2\alpha L\log(|\mathcal{S}|^{2}|\mathcal{A}|/\lambda)}{\overline{\sigma}\tau}+\frac{\tau\overline{\sigma}}{2},
CV,α,λ:=2V(σ¯B+3L+2ϑ¯L+σ¯ϑ¯+L2α+2Lλ+αλLlog|𝒮|2|𝒜|+4L2V),\displaystyle C_{V,\alpha,\lambda}:=2V\bigg{(}\overline{\sigma}B+3L+2\overline{\vartheta}L+\overline{\sigma}\overline{\vartheta}+\frac{L}{2\alpha}+2L\lambda+\frac{\alpha\lambda L\log|\mathcal{S}|^{2}|\mathcal{A}|+4L^{2}}{V}\bigg{)},

we have that with probability at least 1δ1-\delta, for a certain t[T+1τ]t\in[T+1-\tau], the following inequality holds

𝐐(t)2ψ+τ512L2σ¯log[1+128L2σ¯2eσ¯/(32L)]+τ64L2σ¯log1δ+2τL.\displaystyle\|\mathbf{Q}(t)\|_{2}\leq\psi+\tau\frac{512L^{2}}{\overline{\sigma}}\log[1+\frac{128L^{2}}{\overline{\sigma}^{2}}e^{\overline{\sigma}/(32L)}]+\tau\frac{64L^{2}}{\overline{\sigma}}\log\frac{1}{\delta}+2\tau L.

Thus, combining (34) and the above inequality at t=T+1τt=T+1-\tau, with probability at least 1δ1-\delta, for a certain tt satisfying T+2τtT+1T+2-\tau\leq t\leq T+1, the above inequality also holds. The above inequality is equivalent to

Pr(𝐐(t)2>ψ+τ512L2σ¯log[1+128L2σ¯2eσ¯/(32L)]+τ64L2σ¯log1δ+2τL)δ.\displaystyle\Pr\left(\|\mathbf{Q}(t)\|_{2}>\psi+\tau\frac{512L^{2}}{\overline{\sigma}}\log[1+\frac{128L^{2}}{\overline{\sigma}^{2}}e^{\overline{\sigma}/(32L)}]+\tau\frac{64L^{2}}{\overline{\sigma}}\log\frac{1}{\delta}+2\tau L\right)\leq\delta.

Setting ς=2Lψ+τ1024L3σ¯log[1+128L2σ¯2eσ¯/(32L)]+τ128L3σ¯log1δ+4τL2\varsigma=2L\psi+\tau\frac{1024L^{3}}{\overline{\sigma}}\log\big{[}1+\frac{128L^{2}}{\overline{\sigma}^{2}}e^{\overline{\sigma}/(32L)}\big{]}+\tau\frac{128L^{3}}{\overline{\sigma}}\log\frac{1}{\delta}+4\tau L^{2} and z=2Tς2log1Tδz=\sqrt{2T\varsigma^{2}\log\frac{1}{T\delta}} in (44), then the following probability hold with probability at least 12Tδ1-2T\delta with

t=1Ti=1IQi(t)(git,θ¯ci)\displaystyle\sum_{t=1}^{T}\sum_{i=1}^{I}Q_{i}(t)(\langle g_{i}^{t},\overline{\theta}^{*}\rangle-c_{i})
(2Lψ+τ1024L3σ¯log[1+128L2σ¯2eσ¯/(32L)]+τ128L3σ¯log1δ+4τL2)Tlog1Tδ,\displaystyle\qquad\leq\bigg{(}2L\psi+\tau\frac{1024L^{3}}{\overline{\sigma}}\log\Big{[}1+\frac{128L^{2}}{\overline{\sigma}^{2}}e^{\overline{\sigma}/(32L)}\Big{]}+\tau\frac{128L^{3}}{\overline{\sigma}}\log\frac{1}{\delta}+4\tau L^{2}\bigg{)}\sqrt{T\log\frac{1}{T\delta}},

which completes the proof. ∎

Appendix D Proofs of the Lemmas in Section 6.2

D.1 Proof of Lemma 6.6

Proof of Lemma 6.6.

We start our proof with the updating rule of 𝐐()\mathbf{Q}(\cdot) as follows

Qi(t)=\displaystyle Q_{i}(t)= max{Qi(t1)+git1,θtci,0}Qi(t1)+git1,θtci\displaystyle\max\{Q_{i}(t-1)+\langle g_{i}^{t-1},\theta^{t}\rangle-c_{i},0\}\geq Q_{i}(t-1)+\langle g_{i}^{t-1},\theta^{t}\rangle-c_{i}
\displaystyle\geq Qi(t1)+git1,θt1ci+git1,θtθt1.\displaystyle Q_{i}(t-1)+\langle g_{i}^{t-1},\theta^{t-1}\rangle-c_{i}+\langle g_{i}^{t-1},\theta^{t}-\theta^{t-1}\rangle.

Rearranging the terms in the above inequality further leads to

git1,θt1ci\displaystyle\langle g_{i}^{t-1},\theta^{t-1}\rangle-c_{i}\leq Qi(t)Qi(t1)git1,θtθt1.\displaystyle Q_{i}(t)-Q_{i}(t-1)-\langle g_{i}^{t-1},\theta^{t}-\theta^{t-1}\rangle.

Thus, taking summation on both sides of the above inequality from 22 to T+1T+1 leads to

t=1T(git,θtci)Qi(T+1)t=1Tgit,θt+1θtQi(T+1)+t=1Tgitθt+1θt1,\displaystyle\sum_{t=1}^{T}\left(\langle g_{i}^{t},\theta^{t}\rangle-c_{i}\right)\leq Q_{i}(T+1)-\sum_{t=1}^{T}\langle g_{i}^{t},\theta^{t+1}-\theta^{t}\rangle\leq Q_{i}(T+1)+\sum_{t=1}^{T}\|g_{i}^{t}\|_{\infty}\|\theta^{t+1}-\theta^{t}\|_{1},

where the second inequality is due to Hölder’s inequality. Note that the right-hand side of the above inequality is no less than 0 since Qi(t)=max{Qi(t1)+git1,θtci,0}0Q_{i}(t)=\max\{Q_{i}(t-1)+\langle g_{i}^{t-1},\theta^{t}\rangle-c_{i},0\}\geq 0. Thus, we have

[t=1T(git,θtci)]+Qi(T+1)+t=1Tgitθt+1θt1,\displaystyle\Bigg{[}\sum_{t=1}^{T}\left(\langle g_{i}^{t},\theta^{t}\rangle-c_{i}\right)\Bigg{]}_{+}\leq Q_{i}(T+1)+\sum_{t=1}^{T}\|g_{i}^{t}\|_{\infty}\|\theta^{t+1}-\theta^{t}\|_{1},

where []+[~{}\cdot~{}]_{+} is an entry-wise application of the operation max{,0}\max\{\cdot,0\} for any vector.

Defining 𝐠t(θt):=[g1t,θt,,gIt,θt]\mathbf{g}^{t}(\theta^{t}):=[\langle g_{1}^{t},\theta^{t}\rangle,\cdots,\langle g_{I}^{t},\theta^{t}\rangle]^{\top} and 𝐜:=[c1,,cI]\mathbf{c}:=[c_{1},\cdots,c_{I}]^{\top}, we obtain

[t=1T(𝐠t(θt)𝐜)]+2\displaystyle\Bigg{\|}\Bigg{[}\sum_{t=1}^{T}\left(\mathbf{g}^{t}(\theta^{t})-\mathbf{c}\right)\Bigg{]}_{+}\Bigg{\|}_{2}\leq 𝐐(T+1)2+t=1Ti=1Igit2θt+1θt1\displaystyle\|\mathbf{Q}(T+1)\|_{2}+\sum_{t=1}^{T}\sqrt{\sum_{i=1}^{I}\|g_{i}^{t}\|^{2}_{\infty}}\|\theta^{t+1}-\theta^{t}\|_{1}
\displaystyle\leq 𝐐(T+1)2+t=1Ti=1Igitθt+1θt1\displaystyle\|\mathbf{Q}(T+1)\|_{2}+\sum_{t=1}^{T}\sum_{i=1}^{I}\|g_{i}^{t}\|_{\infty}\|\theta^{t+1}-\theta^{t}\|_{1}
\displaystyle\leq 𝐐(T+1)2+t=1Tθt+1θt1,\displaystyle\|\mathbf{Q}(T+1)\|_{2}+\sum_{t=1}^{T}\|\theta^{t+1}-\theta^{t}\|_{1},

where the third inequality is due to Assumption 3.3. This completes the proof. ∎

D.2 Proof of Lemma 6.7

Lemma D.1 (Proposition 18 of Jaksch et al. (2010)).

The number of epochs up to episode TT with T|𝒮||𝒜|T\geq|\mathcal{S}||\mathcal{A}| is upper bounded by

(T)|𝒮||𝒜|log2(8T|𝒮||𝒜|)T|𝒮||𝒜|log2(8T|𝒮||𝒜|),\displaystyle\ell(T)\leq|\mathcal{S}||\mathcal{A}|\log_{2}\left(\frac{8T}{|\mathcal{S}||\mathcal{A}|}\right)\leq\sqrt{T|\mathcal{S}||\mathcal{A}|}\log_{2}\left(\frac{8T}{|\mathcal{S}||\mathcal{A}|}\right),

where ()\ell(\cdot) is a mapping from a certain episode to the epoch where it lives.

We are ready to give the proof of Lemma 6.7.

Proof of Lemma 6.7.

We need to discuss the upper bound of the term θt+1θt1\|\theta^{t+1}-\theta^{t}\|_{1} in the following two cases:

  • (1)

    (t+1)=(t)\ell(t+1)=\ell(t), i.e., episodes t+1t+1 and tt are in the same epoch;

  • (2)

    (t+1)>(t)\ell(t+1)>\ell(t), i.e., episodes t+1t+1 and tt are in two different epochs.

For the first case where (t+1)=(t)\ell(t+1)=\ell(t), according to Lemma C.3, letting 𝐱opt=θt\mathbf{x}^{opt}=\theta^{t}, 𝐲=θ~t1\mathbf{y}=\widetilde{\theta}^{t-1}, 𝐳=θt1\mathbf{z}=\theta^{t-1} and h(θ)=Vft1+i=1IQi(t1)git1,θh(\theta)=\big{\langle}Vf^{t-1}+\sum_{i=1}^{I}Q_{i}(t-1)g_{i}^{t-1},\theta\big{\rangle}, with t2t\geq 2 and (t)=(t1)\ell(t)=\ell(t-1), we have

Vft1+i=1IQi(t1)git1,θt+αD(θt,θ~t1)\displaystyle\Big{\langle}Vf^{t-1}+\sum_{i=1}^{I}Q_{i}(t-1)g_{i}^{t-1},\theta^{t}\Big{\rangle}+\alpha D(\theta^{t},\widetilde{\theta}^{t-1})
Vft1+i=1IQi(t1)git1,θ~t1+αD(θt1,θ~t1)αD(θt1,θt).\displaystyle\qquad\leq\Big{\langle}Vf^{t-1}+\sum_{i=1}^{I}Q_{i}(t-1)g_{i}^{t-1},\widetilde{\theta}^{t-1}\Big{\rangle}+\alpha D(\theta^{t-1},\widetilde{\theta}^{t-1})-\alpha D(\theta^{t-1},\theta^{t}).

Rearranging the terms and dropping the last term (due to D(θt1,θt)0D(\theta^{t-1},\theta^{t})\geq 0) yield

αD(θt,θ~t1)\displaystyle\alpha D(\theta^{t},\widetilde{\theta}^{t-1})\leq Vft1+i=1IQi(t1)git1,θt1θt+αD(θt1,θ~t1)\displaystyle\Big{\langle}Vf^{t-1}+\sum_{i=1}^{I}Q_{i}(t-1)g_{i}^{t-1},\theta^{t-1}-\theta^{t}\Big{\rangle}+\alpha D(\theta^{t-1},\widetilde{\theta}^{t-1})
\displaystyle\leq (Vft1+i=1IQi(t1)git1)θt1θt1+αD(θt1,θ~t1)\displaystyle\Big{(}V\|f^{t-1}\|_{\infty}+\sum_{i=1}^{I}Q_{i}(t-1)\|g_{i}^{t-1}\|_{\infty}\Big{)}\|\theta^{t-1}-\theta^{t}\|_{1}+\alpha D(\theta^{t-1},\widetilde{\theta}^{t-1})
\displaystyle\leq (V+𝐐(t1)2i=1Igit12)θt1θt1+αD(θt1,θ~t1)\displaystyle\left(V+\|\mathbf{Q}(t-1)\|_{2}\sqrt{\sum_{i=1}^{I}\|g_{i}^{t-1}\|^{2}_{\infty}}\right)\|\theta^{t-1}-\theta^{t}\|_{1}+\alpha D(\theta^{t-1},\widetilde{\theta}^{t-1})
\displaystyle\leq (V+𝐐(t1)2)θt1θt1+αλLlog|𝒮|2|𝒜|,\displaystyle(V+\|\mathbf{Q}(t-1)\|_{2})\|\theta^{t-1}-\theta^{t}\|_{1}+\alpha\lambda L\log|\mathcal{S}|^{2}|\mathcal{A}|,

where the second inequality is by Hölder’s inequality and triangle inequality, the third inequality is by Cauchy–Schwarz inequality and Assumption 3.3, and the last inequality is due to Assumption 3.3 and the first inequality in Lemma C.5 with setting θ=θ=θt1\theta=\theta^{\prime}=\theta^{t-1} and θ~=θ~t1\widetilde{\theta}^{\prime}=\widetilde{\theta}^{t-1}. Note that by Lemma C.4, there is

D(θt,θ~t1)12Lθtθ~t112.\displaystyle D(\theta^{t},\widetilde{\theta}^{t-1})\geq\frac{1}{2L}\|\theta^{t}-\widetilde{\theta}^{t-1}\|_{1}^{2}.

Thus, combining the previous two inequalities, we obtain

θtθ~t1122LV+2L𝐐(t1)2αθtθt11+2λL2log|𝒮|2|𝒜|,\displaystyle\|\theta^{t}-\widetilde{\theta}^{t-1}\|_{1}^{2}\leq\frac{2LV+2L\|\mathbf{Q}(t-1)\|_{2}}{\alpha}\|\theta^{t}-\theta^{t-1}\|_{1}+2\lambda L^{2}\log|\mathcal{S}|^{2}|\mathcal{A}|,

which further leads to

θtθ~t112LV+2L𝐐(t1)2αθt1θt1+2λL2log|𝒮|2|𝒜|.\displaystyle\|\theta^{t}-\widetilde{\theta}^{t-1}\|_{1}\leq\sqrt{\frac{2LV+2L\|\mathbf{Q}(t-1)\|_{2}}{\alpha}\|\theta^{t-1}-\theta^{t}\|_{1}}+\sqrt{2\lambda L^{2}\log|\mathcal{S}|^{2}|\mathcal{A}|}.

Since there is

θtθ~t11\displaystyle\|\theta^{t}-\widetilde{\theta}^{t-1}\|_{1} =k=0L1θkt(1λ)θkt1λ1|𝒮|2|𝒜|𝟏1\displaystyle=\sum_{k=0}^{L-1}\left\|\theta_{k}^{t}-(1-\lambda)\theta_{k}^{t-1}-\lambda\frac{1}{|\mathcal{S}|^{2}|\mathcal{A}|}\mathbf{1}\right\|_{1}
(1λ)θtθt11λL,\displaystyle\geq(1-\lambda)\|\theta^{t}-\theta^{t-1}\|_{1}-\lambda L,

where θk:=[θ(s,a,s)]s𝒮k,a𝒜,s𝒮k+1\theta_{k}:=[\theta(s,a,s^{\prime})]_{s\in\mathcal{S}_{k},a\in\mathcal{A},s^{\prime}\in\mathcal{S}_{k+1}}, combining it with the last inequality, we further have

θtθt11\displaystyle\|\theta^{t}-\theta^{t-1}\|_{1} 2LV+2L𝐐(t1)2α(1λ)2θt1θt1+2λL2log|𝒮|2|𝒜|(1λ)+λL1λ\displaystyle\leq\sqrt{\frac{2LV+2L\|\mathbf{Q}(t-1)\|_{2}}{\alpha(1-\lambda)^{2}}\|\theta^{t-1}-\theta^{t}\|_{1}}+\frac{\sqrt{2\lambda L^{2}\log|\mathcal{S}|^{2}|\mathcal{A}|}}{(1-\lambda)}+\frac{\lambda L}{1-\lambda}
2LV+2L𝐐(t1)22(1λ)2α+12θt1θt1+2λL2log|𝒮|2|𝒜|1λ+λL1λ,\displaystyle\leq\frac{2LV+2L\|\mathbf{Q}(t-1)\|_{2}}{2(1-\lambda)^{2}\alpha}+\frac{1}{2}\|\theta^{t-1}-\theta^{t}\|_{1}+\frac{\sqrt{2\lambda L^{2}\log|\mathcal{S}|^{2}|\mathcal{A}|}}{1-\lambda}+\frac{\lambda L}{1-\lambda},

where the last inequality is due to ab|a|/2+|b|/2\sqrt{ab}\leq|a|/2+|b|/2. Rearranging the terms in the above inequality gives for t2t\geq 2 with (t)=(t1)\ell(t)=\ell(t-1),

θtθt11\displaystyle\|\theta^{t}-\theta^{t-1}\|_{1} 2LV+2L𝐐(t1)2(1λ)2α+8λL2log|𝒮|2|𝒜|1λ+2λL1λ.\displaystyle\leq\frac{2LV+2L\|\mathbf{Q}(t-1)\|_{2}}{(1-\lambda)^{2}\alpha}+\frac{\sqrt{8\lambda L^{2}\log|\mathcal{S}|^{2}|\mathcal{A}|}}{1-\lambda}+\frac{2\lambda L}{1-\lambda}.

Shifting the index in the above inequality, we further have for t[T]t\in[T] with (t+1)=(t)\ell(t+1)=\ell(t),

θt+1θt1\displaystyle\|\theta^{t+1}-\theta^{t}\|_{1} 2LV+2L𝐐(t)2(1λ)2α+8λL2log|𝒮|2|𝒜|1λ+2λL1λ.\displaystyle\leq\frac{2LV+2L\|\mathbf{Q}(t)\|_{2}}{(1-\lambda)^{2}\alpha}+\frac{\sqrt{8\lambda L^{2}\log|\mathcal{S}|^{2}|\mathcal{A}|}}{1-\lambda}+\frac{2\lambda L}{1-\lambda}. (45)

For the second case where (t+1)>(t)\ell(t+1)>\ell(t) with t[T]t\in[T], it is difficult to know whether the two solutions θt+1\theta^{t+1} and θt\theta^{t} are in the same feasible set since Δ((t+1),ζ)Δ((t),ζ)\Delta(\ell(t+1),\zeta)\neq\Delta(\ell(t),\zeta). Thus, the above derivation does not hold. Then, we give a bound for the term θt+1θt1\|\theta^{t+1}-\theta^{t}\|_{1} as follows

θt+1θt1θt+11+θt1=k=0L1s𝒮ka𝒜s𝒮k+1[θt(s,a,s)+θt(s,a,s)]=2L.\displaystyle\begin{aligned} \|\theta^{t+1}-\theta^{t}\|_{1}&\leq\|\theta^{t+1}\|_{1}+\|\theta^{t}\|_{1}\\ &=\sum_{k=0}^{L-1}\sum_{s\in\mathcal{S}_{k}}\sum_{a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}_{k+1}}[\theta^{t}(s,a,s^{\prime})+\theta^{t}(s,a,s^{\prime})]=2L.\end{aligned} (46)

However, we can observe that (t+1)>(t)\ell(t+1)>\ell(t) only happens when t+1t+1 is a starting episode for a new epoch. The number of starting episodes for new epochs in T+1T+1 episodes is bounded by (T)\ell(T), namely the total number of epochs in TT episodes. According to Lemma D.1, the total number of epochs (T)\ell(T) is bounded as (T)T|𝒮||𝒜|log2[8T/(|𝒮||𝒜|)]1.5T|𝒮||𝒜|log2[8T/(|𝒮||𝒜|)]\ell(T)\leq\sqrt{T|\mathcal{S}||\mathcal{A}|}\log_{2}[8T/(|\mathcal{S}||\mathcal{A}|)]\leq 1.5\sqrt{T|\mathcal{S}||\mathcal{A}|}\log_{2}[8T/(|\mathcal{S}||\mathcal{A}|)] which only grows in the order of TlogT\sqrt{T}\log T.

Thus, we can decompose the term t=1Tθt+1θt1\sum_{t=1}^{T}\|\theta^{t+1}-\theta^{t}\|_{1} in the following way

t=1Tθt+1θt1=\displaystyle\sum_{t=1}^{T}\|\theta^{t+1}-\theta^{t}\|_{1}= t:tT,(t+1)>(t)θt+1θt1+t:tT,(t+1)=(t)θt+1θt1\displaystyle\sum_{\begin{subarray}{c}t:~{}t\leq T,\\ \ell(t+1)>\ell(t)\end{subarray}}\|\theta^{t+1}-\theta^{t}\|_{1}+\sum_{\begin{subarray}{c}t:~{}t\leq T,\\ \ell(t+1)=\ell(t)\end{subarray}}\|\theta^{t+1}-\theta^{t}\|_{1}
\displaystyle\leq 2L(T)+t:tT,(t+1)=(t)θt+1θt1,\displaystyle 2L\ell(T)+\sum_{\begin{subarray}{c}t:~{}t\leq T,\\ \ell(t+1)=\ell(t)\end{subarray}}\|\theta^{t+1}-\theta^{t}\|_{1},

where the inequality is due to (46) and the fact that t:tT,(t+1)>(t)1(T)\sum_{\begin{subarray}{c}t:~{}t\leq T,\\ \ell(t+1)>\ell(t)\end{subarray}}1\leq\ell(T). By (45), we can further bound the last term in the above inequality as

t:tT,(t+1)=(t)θt+1θt1\displaystyle\sum_{\begin{subarray}{c}t:~{}t\leq T,\\ \ell(t+1)=\ell(t)\end{subarray}}\|\theta^{t+1}-\theta^{t}\|_{1} 2TLV+2Lt=1T𝐐(t)2(1λ)2α+8λlog|𝒮|2|𝒜|1λTL+2λ1λTL,\displaystyle\leq\frac{2TLV+2L\sum_{t=1}^{T}\|\mathbf{Q}(t)\|_{2}}{(1-\lambda)^{2}\alpha}+\frac{\sqrt{8\lambda\log|\mathcal{S}|^{2}|\mathcal{A}|}}{1-\lambda}TL+\frac{2\lambda}{1-\lambda}TL,

where we relax the summation on the right-hand side to t=1T\sum_{t=1}^{T}. Thus, we eventually obtain

t=1Tθt+1θt1\displaystyle\sum_{t=1}^{T}\|\theta^{t+1}-\theta^{t}\|_{1} 2L(T)+t:tT,(t+1)=(t)θt+1θt1\displaystyle\leq 2L\ell(T)+\sum_{\begin{subarray}{c}t:~{}t\leq T,\\ \ell(t+1)=\ell(t)\end{subarray}}\|\theta^{t+1}-\theta^{t}\|_{1}
3LT|𝒮||𝒜|log8T|𝒮||𝒜|+2L(1λ)2αt=1T𝐐(t)2+2VLT(1λ)2α\displaystyle\leq 3L\sqrt{T|\mathcal{S}||\mathcal{A}|}\log\frac{8T}{|\mathcal{S}||\mathcal{A}|}+\frac{2L}{(1-\lambda)^{2}\alpha}\sum_{t=1}^{T}\|\mathbf{Q}(t)\|_{2}+\frac{2VLT}{(1-\lambda)^{2}\alpha}
+2λLT1λ+8λlog|𝒮|2|𝒜|1λLT,\displaystyle\quad+\frac{2\lambda LT}{1-\lambda}+\frac{\sqrt{8\lambda\log|\mathcal{S}|^{2}|\mathcal{A}|}}{1-\lambda}LT,

where we use the result in Lemma D.1 to bound the number of epoch, i.e., (T)\ell(T). This completes the proof. ∎