This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning in Online MDPs:
Is there a Price for Handling the Communicating Case?

 Gautam Chandrasekaran
Department of Computer Science
University of Texas at Austin
Austin, TX 78705, USA
[email protected]
     Ambuj Tewari
Department of Statistics
University of Michigan
Ann Arbor, MI 48109, USA
[email protected]
Abstract

It is a remarkable fact that the same O(T)O(\sqrt{T}) regret rate can be achieved in both the Experts Problem and the Adversarial Multi-Armed Bandit problem albeit with a worse dependence on number of actions in the latter case. In contrast, it has been shown that handling online MDPs with communicating structure and bandit information incurs Ω(T2/3)\Omega(T^{2/3}) regret even in the case of deterministic transitions. Is this the price we pay for handling communicating structure or is it because we also have bandit feedback? In this paper we show that with full information, online MDPs can still be learned at an O(T)O(\sqrt{T}) rate even in the presence of communicating structure. We first show this by proposing an efficient follow the perturbed leader (FPL) algorithm for the deterministic transition case. We then extend our scope to consider stochastic transitions where we first give an inefficient O(T)O(\sqrt{T})-regret algorithm (with a mild additional condition on the dynamics). Then we show how to achieve O(Tα)O\left(\sqrt{\frac{T}{\alpha}}\right) regret rate using an oracle-efficient algorithm but with the additional restriction that the starting state distribution has mass at least α\alpha on each state.

1 Introduction

In this work, we study online learning in Markov Decisions Processes. In this setting, we have an agent interacting with an adversarial environment. The agent observes the state of the environment and takes an action. The action incurs an associated loss and the environment moves to a new state. The state transition dynamics are assumed to be Markovian, i.e., the probability distribution of the new state is fully determined by the action and the old state. The transition dynamics are fixed and known to the learner in advance. However the losses are chosen by the adversary. The adversary is assumed to be oblivious (the entire loss sequence is chosen before the interaction begins). We assume that the environment reveals full information about the losses at a given time step to the agent after the corresponding action is taken. The total loss incurred by the agent is the sum of losses incurred in each step of the interaction. We denote the set of states by SS and the set of actions by AA. The objective of the agent is to minimized its total loss.

This setting was first studied in the seminal work of Even-Dar et al. (2009). They studied the restricted class of ergodic MDPs where every policy induces a Markov chain with a single recurrent class. They designed an efficient (runs in polytime in MDP parameters and time of interaction) algorithm that achieved O(T)O(\sqrt{T}) regret with respect to the best stationary policy in hindsight. They assumed full information of the losses and that the MDP dynamics where known beforehand. This work was extended to bandit feedback by Neu et al. (2014)111with an additional assumption on the minimum stationary probability mass in any state. They also achieved a regret bound of O(T)O(\sqrt{T}). Bandit feedback is a harder model in which the learner only receives information about its own losses.

No state Ergodic Communicating
Full Info T\sqrt{T} T\sqrt{T} T\sqrt{T}   THIS PAPER
Bandit Info T\sqrt{T} T\sqrt{T}11footnotemark: 1 T2/3T^{2/3}
Table 1: The dependence on time horizon TT of the optimal regret, under full and bandit feedbacks, as the state transition dynamics become more complex.

In this paper we will look at the more general class of communicating MDPs, where, for any pair of states, there is a policy such that the time it takes to reach the second state from the first has finite expectation. In the case of bandit feedback with deterministic transitions, Dekel and Hazan (2013) designed an algorithm that achieved O(T2/3)O\left(T^{2/3}\right) regret. This regret bound was proved to be tight by a matching lower bound in Dekel et al. (2013). This regret lower bound was proved by a reduction from the problem of adversarial multi-armed bandits with switching costs. In this setting, the agent incurs an additional cost every time it switches the arm it plays. Their lower bound definitively proves that in the case of bandit information, online learning over communicating MDPs is statistically harder than the adversarial multi armed bandits problem for which we have O~(T)\tilde{O}(\sqrt{T}) regret algorithms (Auer et al. (1995)). Their result gives rise to the natural question: is the high regret due to the communicating structure or bandit feedback(or both)? In the case of experts with switching cost, we know O(T)O(\sqrt{T}) regret algorithms such as FPL(Kalai and Vempala (2005)). Using this, we give an O(T)O(\sqrt{T}) algorithm for online learning in communicating MDPs with full information. Thus, we show that having communicating structure alone does not add any statistical price (see Table 1).

1.1 Our Contributions

In this paper, we show that online learning over MDPs with full information is not statistically harder222upto polynomial factors in the number of states and actions333assuming the existence of a state with a “do nothing” action than the problem of online learning with expert advice. In particular, we design an efficient algorithm that learns to act in communicating ADMDPs with O(T)O(\sqrt{T}) regret under full information feedback against the best deterministic policy in hindsight. This is the first O(T)O(\sqrt{T}) regret algorithm for this problem. We prove a matching22footnotemark: 2 regret lower bound in this setting. We also extend the techniques used in the previous algorithm to design an algorithm that runs in time exponential in MDP parameters that achieves O(T)O(\sqrt{T}) regret in the general class of communicating MDPs (albeit with an additional mild assumption33footnotemark: 3). Again, this is the first algorithm that achieves O(T)O(\sqrt{T}) regret against this large class of MDPs. Before this, O(T)O(\sqrt{T}) regret algorithms were only known for the case of ergodic MDPs. We also give an O(Tα)O\left(\sqrt{\frac{T}{\alpha}}\right) regret algorithm for communicating MDPs with a start state distribution having probability mass at least α\alpha on each state that is efficient when given access to an optimization oracle.

2 Related Work

The problem we study in this paper is commonly referred to in literature as online learning in MDPs over an infinite horizon. This problem was first studied for MDPs with an ergodic transition structure. Even-Dar et al. (2009) and Neu et al. (2014) studied this problem under full information and partial information respectively. The former achieved O(T)O(\sqrt{T}) regret for all ergodic MDPs, whereas the latter achieved O(T)O(\sqrt{T}) regret for ergodic MDPs satisfying an additional assumption11footnotemark: 1. The problem of online learning in deterministic communicating MDPs(ADMDPs) was studied by Arora et al. (2012a) and Dekel and Hazan (2013). They consider bandit feedback and achieve O(T3/4)O(T^{3/4}) and O(T2/3)O(T^{2/3}) respectively.

As mentioned in the introduction, a closely related problem is that of online learning with switching costs. In the case of full information, algorithms like FPL (Kalai and Vempala (2005)) achieves O(T)O(\sqrt{T}) regret with switching cost. In the case of bandit feedback, Arora et al. (2012b) gives an algorithm that achieves O(T2/3)O(T^{2/3}) regret with switching cost. This was proved to be tight by Dekel et al. (2013) where they proved a matching lower bound.

Subsequent to the release of an earlier version of this paper, Dai et al. (2022) gave an inefficient O(T2/3)O(T^{2/3}) and oracle-efficient O(T5/6)O(T^{5/6}) regret algorithm for online learning over Communicating22footnotemark: 2 MDPs with bandit information. Their algorithms use our Switch_Policy procedure from Algorithm 3 and thus require the same assumption33footnotemark: 3 as us.

3 Preliminaries

Fix the finite state space SS, finite action space AA, and transition probability matrix PP where P(s,a,s)P(s,a,s^{\prime}) is the probability of moving from state ss to ss^{\prime} on action aa.

In the case of ADMDP, the transitions are deterministic and hence the ADMDP can also be represented by a directed graph GG with vertices corresponding to states SS. The edges are labelled by the actions. An edge from ss to ss^{\prime} labelled by action aa exists in the graph when the ADMDP takes the state ss to state ss^{\prime} on action aa. This edge will be referred to as (s,a,s)(s,a,s^{\prime}).

A (stationary) policy π\pi is a mapping π:S×A[0,1]\pi:S\times A\to[0,1] where π(s,a)\pi(s,a) denotes the probability of taking action aa when in state ss. When the policy is deterministic, we overload the notation and define π(s)\pi(s) to be the action taken when the state is ss. The interaction starts in an arbitrary start state is s1Ss_{1}\in S. The adversary chooses a sequence of loss functions 1,,T\ell_{1},\ldots,\ell_{T} obliviously where each t\ell_{t} is a map from S×AS\times A to [0,1][0,1].

An algorithm 𝒜\mathcal{A} that interacts with the online MDP chooses the action to be taken at each time step. It maintains a probability distribution over actions denoted by 𝒜(.s,1,,t1)\mathcal{A}(.\mid s,\ell_{1},\ldots,\ell_{t-1}) which depends on the current state and the sequence of loss functions seen so far. The expected loss of the algorithm 𝒜\mathcal{A} is

L(𝒜)=𝔼[t=1Tt(st,at)]L(\mathcal{A})=\mathbb{E}\left[\sum_{t=1}^{T}{\ell_{t}(s_{t},a_{t})}\right]

where at𝒜(.st,1,,t1),st+1P(,st,at)a_{t}\sim\mathcal{A}\left(.\mid s_{t},\ell_{1},\ldots,\ell_{t-1}\right),s_{t+1}\sim P(\cdot,s_{t},a_{t}) For a stationary policy π\pi, the loss of the policy is

Lπ=𝔼[t=1Tt(st,at)]L^{\pi}=\mathbb{E}\left[\sum_{t=1}^{T}\ell_{t}(s_{t},a_{t})\right]

where atπ(.st),st+1P(,st,at)a_{t}\sim\pi(.\mid s_{t}),s_{t+1}\sim P(\cdot,s_{t},a_{t}). The regret of the algorithm is defined as

R(𝒜)=L(𝒜)minπΠLπ.R(\mathcal{A})=L(\mathcal{A})-\min_{\pi\in\Pi}L^{\pi}\ .

The total expected loss of the best policy in hindsight is denoted by LL^{*}. Thus,

L=minπΠLπ.L^{*}=\min_{\pi\in\Pi}L^{\pi}\ .

For any stationary policy π\pi, let T(sM,π,s)T(s^{\prime}\mid M,\pi,s) be the random variable for the first time step in which ss^{\prime} is reached when we start at state ss and follow policy π\pi in MDP MM. We define the diameter D(M)D(M) of the MDP as

D(M)=maxssminπ𝔼[T(sM,π,s)].D(M)=\max_{s\neq s^{\prime}}\min_{\pi}\mathbb{E}\left[T(s^{\prime}\mid M,\pi,s)\right].

A communicating MDP is an MDP where D(M)<D(M)<\infty.

3.1 Preliminaries on ADMDPs

In this section, we use the graph GG and the ADMDP interchangeably. A stationary deterministic policy π\pi induces a subgraph GπG_{\pi} of GG where (s,a,s)(s,a,s^{\prime}) is an edge in GπG_{\pi} if and only if π(s)=a\pi(s)=a and the action aa takes state ss to ss^{\prime}.

A communicating ADMDP corresponds to a strongly connected graph. This is because the existence of a policy that takes state ss to ss^{\prime} also implies the existence of a path between the two vertices in the graph GG.

The subgraph GπG_{\pi} induced by policy π\pi in the communicating ADMDP is the set of transitions (s,a,s)(s,a,s^{\prime}) that are possible under π\pi. Each components of GπG_{\pi} is either a cycle or an initial path followed by a cycle. Start a walk from any state ss by following the policy π\pi. Since the set of states is finite, eventually a state must be repeated and this forms the cycle.

Let N(s,a)N(s,a) be the next state after visiting state ss and taking action aa. Define I(s)I(s) as

I(s)={(s,a)N(s,a)=s}.I(s)=\{(s^{\prime},a)\mid N(s^{\prime},a)=s\}.

The period of a vertex vv in GG is the greatest common divisor of the lengths of all the cycles starting and ending at vv. In a strongly, connected graph, the period of each vertex can be proved to be equal(Bremaud (2000) Chap. 2, Thm 4.2). Thus, the period of a strongly connected GG is well defined. If the period of GG is 1, we call GG aperiodic.

Let 𝒞(s,k)\mathcal{C}_{(s,k)} be the set of all closed walks of GG of length kk such that the start vertex is ss. The elements of 𝒞(s,k){\mathcal{C}_{(s,k)}} are represented by the sequence of edges in the walks.

Note that the cycles induced by any stationary deterministic policy π\pi that are of length kk and contain the vertex ss will be in 𝒞(s,k)\mathcal{C}_{(s,k)}. However, 𝒞(s,k)\mathcal{C}_{(s,k)} can also contain cycles not induced by policies(it can contain cycles that are not simple). We use 𝒞\mathcal{C} to denote sS,k[k]𝒞(s,k)\bigcup_{s\in S,k\in[k]}\mathcal{C}_{(s,k)}. We sometimes loosely refer elements of 𝒞\mathcal{C} as cycles. We define at(c)a_{t}(c) to be the action take by cc in the ttht_{th} step if we start following cc from the beginning of the interaction. Similarly, st(c)s_{t}(c) is the state that you reach after following cc for t1t-1 steps from the start of the interaction. We define k(c)k(c) as the length of the cycle cc.

The vertices of a strongly connected graph GG with period γ\gamma can be partitioned into γ\gamma non-empty cycle classes, C1,,CγC_{1},\ldots,C_{\gamma} where each edge goes CiC_{i} to Ci+1C_{i+1}.

Theorem 3.1.

If GG is strongly connected and aperiodic, there exists a critical length dd such that for any d\ell\geq d, there exists a path of length \ell in GG between any pair of vertices. Also, dn(n1)d\leq n(n-1) where nn is the number of vertices in the graph.

The above theorem is from Denardo (1977). It guarantees the existence of a d>0d>0 such that there are paths of length dd between any pair of vertices. The following generalization from Dekel and Hazan (2013) extends the result to periodic graphs.

Theorem 3.2 (Dekel and Hazan (2013)).

If GG has a period γ\gamma, there exists a critical value dd such that for any integer d\ell\geq d, there is a path of in GG of length γ\gamma\ell from any state vv to any other state in the same cycle class.

Remark 3.3.

We can also find the paths of length d\ell\geq d from a given vertex ss to any other vertex ss^{\prime} efficiently. This can be done by constructing the path in the reverse direction. We look at P1P^{\ell-1} to see all the predecessors of ss^{\prime} that have paths of length 1\ell-1 from ss. We choose any of these as the penultimate vertex in the path and recurse.

4 Deterministic Transitions

We now present our algorithm for online learning in ADMDPs when we have full information of losses. We use GG to refer to the graph associated to the ADMDP.

We assume that the ADMDP dynamics are known to the agent. This assumption can be relaxed as shown in Ortner (2010) as we can figure out the dynamics in poly(|S|,|A||S|,|A|) time when the transitions are deterministic. We want to minimize regret against the class of deterministic stationary policies.

4.1 Algorithm Sketch

We formulate the task of minimizing regret against the set of deterministic policies as a problem of prediction with expert advice. As observed earlier, deterministic policies induce a subgraph which is isomorphic to a cycle with an initial path. We keep an expert for each element of 𝒞(s,k)\mathcal{C}_{(s,k)} for all states ss and ksk\leq s. Note that we do not keep an expert for policies which have an initial path before the cycle. This is because the loss of these policies differ by at most |S||S| compared to the loss of the cycle. Also, we make sure that the start state of the cycle is in the same cycle class as the start state of the environment. If this is not the case, our algorithm will never be in phase with the expert policy. Henceforth, we will refer to these experts as cycles.

The loss incurred by cycle c𝒞(s,k)c\in\mathcal{C}_{(s,k)} at time tt is equal to t(st,at)\ell_{t}(s_{t},a_{t}) where sts_{t} and ata_{t} are the state action pair traversed by the cycle cc at time tt if we had followed it from the start of the interaction.

We first present an efficient (running time polynomial in |S|,|A||S|,|A| and TT) algorithm to achieve O(T)O(\sqrt{T}) regret and switching cost against this class of experts. For this we used a Follow the perturbed leader (FPL) style algorithm.

We then use this low switching algorithm as a black box. Whenever, the black box algorithm tells us to switch policies at time tt, we compute the state ss that we would have reached if we had followed the new policy from the start of the interaction and moved t+γdt+\gamma d steps. We then move to this state ss in γd\gamma d steps. Theorem 3.2 guarantees the existence of a path of this length. We then start following the new policy.

Thus, our algorithm matches the moves of the expert policies except when there is a switch in the policies. Thus, the regret of our algorithm differs from the regret of the black box algorithm by at most O(γdT)O(\gamma d\sqrt{T}).

4.2 FPL algorithm

We now describe the FPL style algorithm that competes with the set of cycles described earlier with O(T)O(\sqrt{T}) regret and switching cost.

Sample perturbation vectors ϵi|S||A|\epsilon_{i}\in\mathbb{R}^{|S||A|} for 1i|S|1\leq i\leq|S| from an exponential distribution with parameter λ\lambda;
Sample a perturbation vector δ|S|2\delta\in\mathbb{R}^{|S|^{2}} from the same distribution;
while tT+1t\neq T+1 do
       Ct=argminsS,k[K],c𝒞(s,k)δ(s,k)+i=1t1t(st(c),at(c))+i=1max(t,k(c)+1)ϵi(st(c),at(c))C_{t}=\operatorname*{arg\,min}_{s\in S,k\in[K],c\in\mathcal{C}_{(s,k)}}\delta(s,k)+\sum_{i=1}^{t-1}\ell_{t}(s_{t}(c),a_{t}(c))+\sum_{i=1}^{\max(t,k(c)+1)}\epsilon_{i}(s_{t}(c),a_{t}(c));
      
      Adversary returns loss function t\ell_{t};
      
Algorithm 1 FPL algorithm for Deterministic MDPs

4.2.1 Finding the leader: Offline Optimization Algorithm

First, we design an offline algorithm that finds the cycle (including start state) with lowest cumulative loss till time tt given the sequence of losses 1,t1\ell_{1}\ldots,\ell_{t-1}. This is the argmin\operatorname*{arg\,min} step in Algorithm 1. Given (s,k)(s,k), we find the best cycle among the cycles that start in state ss and have length kk. For this we use a method similar to that used in Arora et al. (2012a). We then find the minimum over all (s,k)(s,k) pairs to find the best cycle. Note that we only consider start states ss which are in the same cycle class as the start state s0s_{0} of the game.

We find the best cycle in 𝒞(s,k)\mathcal{C}_{(s,k)} using Linear Programming. Let n=|S||A|kn=|S||A|k. The LP is in the space n\mathbb{R}^{n}. Consider a cycle c𝒞(s,k)c\in\mathcal{C}_{(s,k)}. Let cic_{i} denoted the ithi_{th} state in cc. Also, let aia_{i} be the action taken at that state. We associate a vector x(c)x(c) with the cycle as follows.

x(c)s,a,i={1if a=ai and s=ci0otherwisex(c)_{s,a,i}=\begin{cases}1&\text{if }a=a_{i}\text{ and }s=c_{i}\\ 0&\text{otherwise}\end{cases}

We construct a loss vector in n\mathbb{R}^{n} as follows.

ls,a,i=1j<t(ji)0modkj(s,a)l_{s,a,i}=\sum_{\begin{subarray}{c}1\leq j<t\\ (j-i)\equiv 0\mod k\end{subarray}}\ell_{j}(s,a)

Our decision set 𝒳n\mathcal{X}\subseteq\mathbb{R}^{n} is the convex hull of all x(c)x(c) where c𝒞(s,k)c\in\mathcal{C}_{(s,k)}. Our objective is to find xx in 𝒳\mathcal{X} such that x,l\langle x,l\rangle is minimized. The set 𝒳\mathcal{X} can be captured by the following polynomial sized set of linear constraints.

x0\displaystyle x\geq 0
aAx(s,a,1)=1\displaystyle\sum_{a\in A}x_{(s,a,1)}=1
sS{s},aA,x(s,a,1)=0\displaystyle\forall s^{\prime}\in S\setminus\{s\},a\in A,\;x_{(s,a,1)}=0
(s,a)I(s),x(s,a,k)=0\displaystyle\forall(s^{\prime},a^{\prime})\notin I(s),\;x_{(s^{\prime},a^{\prime},k)}=0
sS,2ik,(s,a)I(s)x(s,a,i1)=aAx(s,a,i)\displaystyle\forall s^{\prime}\in S,2\leq i\leq k,\;\sum_{(s^{\prime},a^{\prime})\in I(s)}x_{(s^{\prime},a^{\prime},i-1)}=\sum_{a\in A}x_{(s,a,i)}

Once we get an optimal xx for the above LP, we can decompose the mixed solution as a convex combination of at most n+1n+1 cycles from Caratheodory Theorem. Also, these cycles can be recovered efficiently (Bazaraa et al. (2004)). Each of them will have same loss and hence we can choose any of them.

Once we have an optimal cycle for a given (s,k)(s,k), we can minimize over all such pairs to get the optimal cycle. This gives us a polynomial time algorithm to get the optimal cycle.

Remark 4.1.

If the new cycle chosen has the same perturbed loss as the old cycle, we will not switch. This is to prevent any unnecessary switches caused by the arbitrary choice of cycle in each optimization step (as we choose an arbitrary cycle with non-zero weight in the solution).

Remark 4.2.

Note that 𝒞(s,k)\mathcal{C}_{(s,k)} can also contain cycles that don’t correspond to deterministic stationary policies. However, arguing a regret upper bound against this larger class is sufficient to prove regret bounds against the class of stationary deterministic policies.

4.2.2 Regret of the FPL algorithm

We now state and prove the bound on the regret and expected number of switches of Algorithm 1.

Theorem 4.3.

The regret and the expected number of switches of Algorithm 1 can be bounded by

O(|S|Llog|S||A|)O\left(|S|\sqrt{L^{*}\cdot\log|S||A|}\right)

where LL^{*} is the cumulative loss of the best cycle in hindsight.

Before analysing the FPL algorithm described above, we first introduce some notation and definitions. We define the loss of a cycle cc at time tt as t(st(a),at(a))\ell_{t}(s_{t}(a),a_{t}(a)). For any cycle cc with start state ss, let LcL^{c} denote the total cumulative loss that we would have received if we followed the cycle cc from the start to the end of the interaction. We use L~c\tilde{L}^{c} to denote the total perturbed cumulative loss received by cycle cc. Let the cycle with lowest total cumulative loss be cc^{*}. Also, let the cycle with lowest perturbed cumulative loss be c~\tilde{c}^{*}. We use L~tc\tilde{L}^{c}_{t} to denote the total perturbed cumulative loss incurred by cycle cc after tt steps. We use c~t\tilde{c}^{*}_{t} to denote the cycle with lowest perturbed cumulative loss after tt steps. Let CtC_{t} be the cycle chosen by the FPL algorithm at step tt and ltl_{t} be it’s reward. Let the expected number of switches made by the algorithm during the interaction be NsN_{s}.

The analysis is similar in spirit to Section 2 of Kalai and Vempala (2005). We first state the following lemma that bounds the probability of switching the cycle at any step.

Lemma 4.4.

Pr[Ct+1cCt=c](|S|+1)λt(st(c),at(c))Pr[C_{t+1}\neq c\mid C_{t}=c]\leq(|S|+1)\cdot\lambda\cdot\ell_{t}(s_{t}(c),a_{t}(c)) for all cycles cc and times tTt\leq T.

Proof of Theorem 4.3.

We first bound the total loss incurred by the FPL algorithm. Let the expected number of switches made by the algorithm during the interaction be NsN_{s}. If the algorithm doesn’t switch cycles after time step tt, then L~tCt\tilde{L}^{C_{t}}_{t} must be equal to L~tc~t\tilde{L}^{\tilde{c}^{*}_{t}}_{t}. Thus, the loss incurred at time step tt by CtC_{t} is at most (L~tc~tL~t1c~t1)\left(\tilde{L}^{\tilde{c}^{*}_{t}}_{t}-\tilde{L}^{\tilde{c}^{*}_{t-1}}_{t-1}\right). In the steps in which the algorithm switches cycles, the maximum loss incurred is 11. Thus, we have that

𝔼[total loss of FPL]\displaystyle\mathbb{E}[\text{total loss of FPL}] L~c~1+i=2T(L~tc~tL~t1c~t1)+Ns\displaystyle\leq\tilde{L}^{\tilde{c}^{*}_{1}}+\sum_{i=2}^{T}\left(\tilde{L}^{\tilde{c}^{*}_{t}}_{t}-\tilde{L}^{\tilde{c}^{*}_{t-1}}_{t-1}\right)+N_{s}
L~c~T+Ns\displaystyle\leq\tilde{L}^{\tilde{c}^{*}_{T}}+N_{s}
=L~c~+Ns\displaystyle=\tilde{L}^{\tilde{c}^{*}}+N_{s} (1)

We now bound NsN_{s}. From linearity of expectation, we have that

Ns=t=1T1Pr[Ct+1Ct].N_{s}=\sum_{t=1}^{T-1}Pr[C_{t+1}\neq C_{t}].

From Lemma 4.4, we have Pr[Ct+1Ct]Pr[C_{t+1}\neq C_{t}] is at most (|S|+1)λ𝔼[lt](|S|+1)\cdot\lambda\cdot\mathbb{E}[l_{t}]. This gives us the following bound for NsN_{s}.

Ns\displaystyle N_{s} =t=1T1Pr[Ct+1Ct]\displaystyle=\sum_{t=1}^{T-1}Pr[C_{t}+1\neq C_{t}]
t=1T1(|S|+1)λ𝔼[lt]\displaystyle\leq\sum_{t=1}^{T-1}(|S|+1)\cdot\lambda\cdot\mathbb{E}[l_{t}]
(|S|+1)λt=1T1𝔼[lt]\displaystyle\leq(|S|+1)\cdot\lambda\cdot\sum_{t=1}^{T-1}\mathbb{E}[l_{t}]
(|S|+1)λ𝔼[total loss of FPL]\displaystyle\leq(|S|+1)\cdot\lambda\cdot\mathbb{E}[\text{total loss of FPL}]

Combining this with (4.2.2) gives us the following.

𝔼[total loss of FPL]L~c~+(|S|+1)λ𝔼[total loss of FPL]\mathbb{E}[\text{total loss of FPL}]\leq\tilde{L}^{\tilde{c}^{*}}+(|S|+1)\cdot\lambda\cdot\mathbb{E}[\text{total loss of FPL}] (2)

Let p(c)p(c) denote the perturbed loss added to cycle cc. Since the cycle with lowest perturbed cumulative loss at the end of the interaction is c~\tilde{c}^{*}, we have

L~c~Lc+p(c~).\tilde{L}^{\tilde{c}^{*}}\leq L^{c^{*}}+p(\tilde{c}^{*}).

Also,

𝔼[p(c~)]i=1|S|𝔼[max(s,a)ϵi(s,a)]+𝔼[max(s,k)δ(s,k)]|S|(1+log|S||A|)λ+1+log|S|2λ.\mathbb{E}[p(\tilde{c}^{*})]\leq\sum_{i=1}^{|S|}\mathbb{E}\left[\max_{(s,a)}\epsilon_{i}(s,a)\right]+\mathbb{E}\left[\max_{(s^{\prime},k)}\delta(s^{\prime},k)\right]\leq|S|\cdot\frac{(1+\log{|S||A|})}{\lambda}+\frac{1+\log{|S|^{2}}}{\lambda}.

The above inequality comes from the fact that the expectation of the max of kk independant exponential random variables with parameter λ\lambda is atmost 1+logkλ\frac{1+\log k}{\lambda}. Plugging this inequality into (2) gives us

𝔼[cost of FPL]L+|S|(1+log|S||A|)λ+1+log|S|2λ+(|S|+1)λ𝔼[cost of FPL].\mathbb{E}[\text{cost of FPL}]\leq L^{*}+|S|\cdot\frac{(1+\log{|S||A|})}{\lambda}+\frac{1+\log{|S|^{2}}}{\lambda}+(|S|+1)\cdot\lambda\cdot\mathbb{E}[\text{cost of FPL}]. (3)

Since the maximum cost is TT, we have

Regret|S|(1+log|S||A|)λ+1+log|S|2λ+(|S|+1)λT.\text{Regret}\leq|S|\frac{(1+\log{|S||A|})}{\lambda}+\frac{1+\log{|S|^{2}}}{\lambda}+(|S|+1)\lambda T.

Setting λ=log|S||A|T\lambda=\frac{\log{|S||A|}}{\sqrt{T}} gives us a bound of O(|S|Tlog|S||A|)O\left(|S|\sqrt{T\log{|S||A|}}\right) on the regret and expected number of switches. We can also derive first order bounds. From (3), we have

𝔼[total loss of FPL]\displaystyle\mathbb{E}[\text{total loss of FPL}] L+|S|(1+log|S||A|)λ+1+log|S|2λ+(|S|+1)λ𝔼[cost of FPL]\displaystyle\leq L^{*}+|S|\cdot\frac{(1+\log{|S||A|})}{\lambda}+\frac{1+\log{|S|^{2}}}{\lambda}+(|S|+1)\cdot\lambda\cdot\mathbb{E}[\text{cost of FPL}]
L+4|S|log|S||A|λ+2|S|λ𝔼[total loss of FPL].\displaystyle\leq L^{*}+4|S|\cdot\frac{\log|S||A|}{\lambda}+2|S|\cdot\lambda\cdot\mathbb{E}[\text{total loss of FPL}].

On rearranging, we get

𝔼[total loss of FPL]\displaystyle\mathbb{E}[\text{total loss of FPL}] L12λ|S|+4|S|log|S||A|λ(12λ|S|)\displaystyle\leq\frac{L^{*}}{1-2\lambda|S|}+4|S|\cdot\frac{\log|S||A|}{\lambda(1-2\lambda|S|)}
L(1+(2λ|S|+(2λ|S|)2+)+4|S|log|S||A|λ(1+2λ|S|+(2λ|S|)2+)\displaystyle\leq L^{*}(1+(2\lambda|S|+(2\lambda|S|)^{2}+\ldots)+4|S|\frac{\log|S||A|}{\lambda}(1+2\lambda|S|+(2\lambda|S|)^{2}+\ldots)
L(1+4λ|S|)+8|S|log|S||A|λ.\displaystyle\leq L^{*}(1+4\lambda|S|)+8|S|\frac{\log|S||A|}{\lambda}.

The last two inequalities work when 2λ|S|122\lambda|S|\leq\frac{1}{2}. Thus,

𝔼[total loss of FPL]L4λ|S|(L)+8|S|log|S||A|λ.\displaystyle\mathbb{E}[\text{total loss of FPL}]-L^{*}\leq 4\lambda|S|(L^{*})+8|S|\frac{\log|S||A|}{\lambda}.

Set λ=min(log|S||A|L,14|S|)\lambda=\text{min}\left(\sqrt{\frac{\log|S||A|}{L^{*}}},\frac{1}{4|S|}\right). This forces 2λ|S|2\lambda|S| to be less than 12\frac{1}{2} and thus the previous inequalities are still valid. On substituting the value of λ\lambda, we get that

RegretO(|S|Llog|S||A|)\text{Regret}\leq O\left(|S|\sqrt{L^{*}\cdot\log|S||A|}\right)

when L16|S|2log|S||A|.L^{*}\geq 16|S|^{2}\log|S||A|. Since the expected number of switches is at most 2λ|S|𝔼[total loss of FPL]2\lambda|S|\cdot\mathbb{E}[\text{total loss of FPL}], this is also bounded by O(|S|Llog|S||A|)O\left(|S|\sqrt{L^{*}\cdot\log|S||A|}\right). ∎

Proof of Lemma 4.4.

Let cc be a cycle in the set 𝒞(s,k)\mathcal{C}_{(s,k)}. Let ltl_{t} be shorthand for t(st(c),at(c))\ell_{t}(s_{t}(c),a_{t}(c))the loss incurred by cycle cc at step tt. If Ct+1C_{t+1} is not in 𝒞(s,k)\mathcal{C}_{(s,k)}, then the algorithm must have switched. Thus, we get the following equation.

Pr[Ct+1cCt=c]=Pr[Ct+1𝒞(s,k)Ct=c]+Pr[Ct+1c and Ct+1𝒞(s,k)Ct=c]Pr[C_{t+1}\neq c\mid C_{t}=c]=Pr[C_{t+1}\notin\mathcal{C}_{(s,k)}\mid C_{t}=c]+Pr[C_{t+1}\neq{c}\text{ and }C_{t+1}\in\mathcal{C}_{(s,k)}\mid C_{t}=c] (4)

We now bound both the terms in the right hand side of (4) separately.

First, we study at the first term. We will upper bound this term by proving an appropriate lower bound on the probability of choosing Ct+1C_{t+1} from 𝒞(s,k)\mathcal{C}_{(s,k)}. Since Ct=cC_{t}=c, we know that L~t1cL~t1c\tilde{L}^{c}_{t-1}\leq\tilde{L}^{c^{\prime}}_{t-1} for all ccc^{\prime}\neq c. For all c𝒞(s,k)c^{\prime}\notin\mathcal{C}_{(s,k)}, the perturbation δ(s,k)\delta(s,k) will play a role in the comparison of the perturbed cumulative losses. For c𝒞(s,k)c^{\prime}\in\mathcal{C}_{(s,k)}, δ(s,k)\delta(s,k) appears on both sides of the comparison and thus gets cancelled out. Thus, we have δ(s,k)w\delta(s,k)\geq w, where ww depends only on the perturbations and losses received by cc and the cycles not in 𝒞(s,k)\mathcal{C}_{(s,k)}. Now, if δ(s,k)\delta(s,k) was larger than w+ltw+l_{t}, then the perturbed cumulative loss of cc will be less than that of cycles not in 𝒞(s,k)\mathcal{C}_{(s,k)} even after receiving the losses of step tt. In this case, Ct+1C_{t+1} will also be chosen from 𝒞(s,k)\mathcal{C}_{(s,k)}. This gives us the require probability lower bound.

Pr[Ct+1𝒞(s,k)Ct=c]\displaystyle Pr[C_{t+1}\in\mathcal{C}_{(s,k)}\mid C_{t}=c] Pr[δ(s,k)w+ltδ(s,k)w]\displaystyle\geq Pr[\delta(s,k)\geq w+l_{t}\mid\delta(s,k)\geq w]
eλlt\displaystyle\geq e^{-\lambda\cdot l_{t}}
1λlt\displaystyle\geq 1-\lambda\cdot l_{t}

Thus, Pr[Ct+1𝒞(s,k)Ct=c]Pr[C_{t+1}\notin\mathcal{C}_{(s,k)}\mid C_{t}=c] is at most λlt\lambda\cdot l_{t}.

We now bound the second term. For any two cycles cc′′c^{\prime}\neq c^{\prime\prime} in 𝒞(s,k)\mathcal{C}_{(s,k)}, there exists an index iki\leq k such that the ithi^{th} edges of cc^{\prime} and c′′c^{\prime\prime} are different and all the smaller indexed edges of the two cycles are the same. We denote this index by d(c,c′′)d(c^{\prime},c^{\prime\prime}). Define d(c,c′′)d(c^{\prime},c^{\prime\prime}) to be zero when cc^{\prime} is from 𝒞(s,k)\mathcal{C}_{(s,k)} and c′′=cc^{\prime\prime}=c^{\prime} or c′′c^{\prime\prime} is not from 𝒞(s,k)\mathcal{C}_{(s,k)}. Now, if Ct+1C_{t+1} is in 𝒞(s,k)\mathcal{C}_{(s,k)} and not equal to cc, then d(Ct+1,c)d(C_{t+1},c) is a number between one and kk. Thus, we get the following equation.

Pr[Ct+1c and Ct+1𝒞(s,k)Ct=c]=i=1kPr[d(c,Ct+1)=iCt=c]Pr[C_{t+1}\neq c\text{ and }C_{t+1}\in\mathcal{C}_{(s,k)}\mid C_{t}=c]=\sum_{i=1}^{k}Pr[d(c,C_{t+1})=i\mid C_{t}=c] (5)

We now bound Pr[d(c,Ct+1)=iCt=c]Pr[d(c,C_{t+1})=i\mid C_{t}=c] for any ii between 11 and kk. Let (si,ai)(s_{i},a_{i}) be the ithi^{th} edge of cc. We prove a lower bound on the probability of choosing Ct+1C_{t+1} such that d(c,Ct+1)d(c,C_{t+1}) is not equal to ii. Again, since Ct=cC_{t}=c, we know that L~t1cL~t1c\tilde{L}^{c}_{t-1}\leq\tilde{L}^{c^{\prime}}_{t-1} for all ccc^{\prime}\neq c. Consider cycles cc^{\prime} that don’t contain the edge (si,ai)(s_{i},a_{i}) in the ithi^{th} position. The perturbation ϵi(si,ai)\epsilon_{i}(s_{i},a_{i}) will play a role in the comparison of perturbed losses of all such cc^{\prime} with cc. Thus, we have ϵi(si,ai)w\epsilon_{i}(s_{i},a_{i})\geq w, where ww depends only on the perturbations and losses received by cc and cycles cc^{\prime} that don’t have the (si,ai)(s_{i},a_{i}) edge in the ithi^{th} position. If ϵi(si,ai)\epsilon_{i}(s_{i},a_{i}) was greater than w+ltw+l_{t}, then the perturbed cumulative loss of cc will still be less than that of all cycles cc^{\prime} without the (si,ai)(s_{i},a_{i}) edge. In this case, Ct+1C_{t+1} will be chosen such that it also has the (si,ai)(s_{i},a_{i}) edge. This implies that d(c,Ct+1)id(c,C_{t+1})\neq i. Thus, we get the following probability lower bound.

Pr[d(c,Ct+1)iCt=c]\displaystyle Pr[d(c,C_{t+1})\neq i\mid C_{t}=c] Pr[ϵi(si,ai)w+ltϵi(si,ai)w]\displaystyle\geq Pr[\epsilon_{i}(s_{i},a_{i})\geq w+l_{t}\mid\epsilon_{i}(s_{i},a_{i})\geq w]
eλlt\displaystyle\geq e^{-\lambda\cdot l_{t}}
1λlt\displaystyle\geq 1-\lambda\cdot l_{t}

Thus, for all ii between 11 and kk, Pr[d(c,Ct+1)=iCt=c]Pr[d(c,C_{t}+1)=i\mid C_{t}=c] is at most λlt\lambda\cdot l_{t}. This proves that the term in (5) is at most kλltk\lambda\cdot l_{t}. Since kk is at most |S||S|, the second term in the right hand side of (4) is bounded by |S|λlt|S|\cdot\lambda\cdot l_{t}. ∎

4.3 Putting it together

We have described the FPL style algorithm that achieves low regret and low switching. We now use Algorithm 1 as a sub-routine to design a low regret algorithm for the online ADMDP problem.

Recall that for a cycle cc, st(c)s_{t}(c) is the state you would reach if you followed the cycle cc from the start. This can be computed efficiently.

t=1;
s0s_{0} is the start state of the environment;
Let c1c_{1} be the cycle chosen by Algorithm 1 at t=1t=1;
if s1s0(c1)s_{1}\neq s_{0}(c_{1}) then
       Spend γd\gamma d steps to move to state s0(c1)s_{0}(c_{1});
       c1+γd=c1c_{1+\gamma d}=c_{1};
       t=1+γdt=1+\gamma d;
      
while tT+1t\neq T+1 do
       Choose action at=at(ct)a_{t}=a_{t}(c_{t});
       Adversary returns loss function t\ell_{t} and next state st+1s_{t+1};
       Feed t\ell_{t} as the loss to Algorithm 1 ;
       if Algorithm 1 switches cycle to ct+1c_{t+1} then
             if st+1st+1(ct+1)s_{t+1}\neq s_{t+1}(c_{t+1}) then
                   Spend γd\gamma d steps to move to state st+γd(ct+1)s_{t+\gamma d}(c_{t+1});
                   ct+γd=ct+1c_{t+\gamma d}=c_{t+1};
                   t=t+γdt=t+\gamma d;
                  
            
      else
            
            t=t+1t=t+1;
            
      
Algorithm 2 Low regret algorithm for communicating ADMDPs

We now state the regret bound of Algorithm 2.

Theorem 4.5.

Given a communicating ADMDP with state space SS, action space AA and period γ\gamma, the regret of Algorithm 2 is bounded by

RegretO(|S|3γLlog|S||A|)\text{Regret}\leq O\left(|S|^{3}\cdot\gamma\sqrt{L^{*}\cdot\log{|S||A|}}\right)

where LL^{*} is the total loss incurred by the best stationary deterministic policy in hindsight.

Proof.

We spend γd\gamma d steps whenever Algorithm 1 switches. In all other steps, we receive the same loss as the cycle chosen by Algorithm 1. Thus, the regret differs by at most γdNs\gamma d\cdot N_{s}. From Theorem 4.3, we get that the total regret of our algorithm in the deterministic case is O(|S|γdTlog|S||A|)O\left(|S|\cdot\gamma d\sqrt{T\log{|S||A|}}\right) where dd is the critical length in the ADMDP. Note that dd is at most O(|S|2)O(|S|^{2}) . Thus, we get that

RegretO(|S|3γLlog|S||A|)\text{Regret}\leq O\left(|S|^{3}\cdot\gamma\sqrt{L^{*}\cdot\log{|S||A|}}\right)

Remark 4.6.

To achieve the first order regret bound, we set λ\lambda in terms of LL^{*}. We need prior knowledge of LL^{*} to directly do this. This can be circumvented by using a doubling trick.

4.4 Regret Lower Bound for Deterministic MDPs

We now state a matching regret lower bound (up to polynomial factors).

Theorem 4.7.

For any algorithm 𝒜\mathcal{A} and any |S|>3,|A|1|S|>3,|A|\geq 1, there exists an MDP MM with |S||S| states and |A||A| actions and a sequence of losses 1,,t\ell_{1},\ldots,\ell_{t} such that

R(𝒜)Ω(|S|Tlog|A|)R(\mathcal{A})\geq\Omega\left(\sqrt{|S|T\log|A|}\right)

where R(𝒜)R(\mathcal{A}) is the regret incurred by 𝒜\mathcal{A} on MM with the given sequence of losses.

Proof.

Let MM be an MDP with states labelled s0,s2,,s|S|1s_{0},s_{2},\ldots,s_{|S|-1}. Any action aa takes state sis_{i} to si+1s_{i+1}(modulo |S||S|). In other words, the states are arranged in a cycle and every action takes any state to its next state in the cycle. This is the required MM.

Consider the problem of prediction with expert advice with nn experts. We know that for any algorithm 𝒜\mathcal{A}, there is a sequence of losses such that the regret of 𝒜\mathcal{A} is Ω(Tlogn)\Omega(\sqrt{T\log n}) over TT steps (see Cesa-Bianchi and Lugosi (2006)). In our case, every policy spends exactly T|S|\frac{T}{|S|} steps in each state. Thus, the interaction with MM over TT steps can be interpreted as a problem of prediction with expert advice at every state where each interaction lasts only T|S|\frac{T}{|S|} steps. We have the following decomposition of the regret.

R(𝒜)=i=0|S|1k=0T|S|1k|S|+i(si,ak|S|)k|S|+i(si,π(si))R(\mathcal{A})=\sum_{i=0}^{|S|-1}\sum_{k=0}^{\frac{T}{|S|}-1}\ell_{k|S|+i}\left(s_{i},a_{k|S|}\right)-\ell_{k|S|+i}\left(s_{i},\pi^{*}(s_{i})\right) (6)

In the above equation, ata_{t} is the action taken by 𝒜\mathcal{A} at step tt. The best stationary deterministic policy in hindsight is π\pi^{*}.

From the regret lower bound for the experts problem, we know that there exists a sequence of losses such that for each i, the inner sum of (6) is atleast Ω(T|S|log|A|)\Omega\left(\sqrt{\frac{T}{|S|}\log|A|}\right). By combining these loss sequences, we get a sequence of losses such that

R(𝒜)i=0|S|1Ω(T|S|log|A|)Ω(|S|Tlog|A|).R(\mathcal{A})\geq\sum_{i=0}^{|S|-1}\Omega\left(\sqrt{\frac{T}{|S|}\log|A|}\right)\geq\Omega\left(\sqrt{|S|T\log|A|}\right).

This completes the proof. ∎

5 Stochastic Transitions

In the previous sections, we only considered deterministic transitions. We now present an algorithm that achieves low regret for the more general class of communicating MDPs (with an additional mild restriction). This algorithm achieves O(T)O(\sqrt{T}) regret but takes exponential time to run (exponential in |S||S|).

Assumption 5.1.

The MDP MM has a state ss^{*} and action aa such that

Pr(st+1=sst=s,at=a)=1Pr(s_{t+1}=s^{*}\mid s_{t}=s^{*},a_{t}=a)=1

In other words, there is some state ss^{*} in which we have a deterministic action that allows us to stay in the state ss^{*}. This can be interpreted as a state with a “do nothing” action where we can wait before taking the next action.

We now state a theorem that guarantees the existence of a number \ell^{*} such that all states can be reached from ss^{*} in exactly \ell^{*} steps with a reasonably high probability.

Theorem 5.2.

In MDPs satifying Assumption 5.1, we have 2D\ell^{*}\leq 2D and state ss^{*} such that, for all target states ss^{\prime}, we have policies πs\pi_{s^{\prime}} such that

ps=Pr[T(sM,πs,s)=]14Dp_{s^{\prime}}=Pr[T(s^{\prime}\mid M,\pi_{s^{\prime}},s^{*})=\ell^{*}]\geq\frac{1}{4D}

We first prove an intermediate lemma.

Lemma 5.3.

For any start state ss and target sss^{\prime}\neq s, we have s,s2D\ell_{s,s^{\prime}}\leq 2D and a policy π\pi such that

Pr[T(sM,π,s)=s,s]14DPr[T(s^{\prime}\mid M,\pi,s)=\ell_{s,s^{\prime}}]\geq\frac{1}{4D}
Proof.

From the definition of diameter, we are guaranteed a policy πs,s\pi_{s,s^{\prime}} such that

𝔼[T(sM,π,s)]D\mathbb{E}\left[T(s^{\prime}\mid M,\pi,s)\right]\leq D

From Markov’s inequality, we have

Pr[T(sM,π,s)2D]12Pr\left[T(s^{\prime}\mid M,\pi,s)\leq 2D\right]\geq\frac{1}{2}

Since there are only 2D2D discrete values less than 2D2D, there exists s,s2D\ell_{s,s^{\prime}}\leq 2D such that

Pr[T(sM,π,s)=s,s]1212D=14DPr[T(s^{\prime}\mid M,\pi,s)=\ell_{s,s^{\prime}}]\geq\frac{1}{2}\cdot\frac{1}{2D}=\frac{1}{4D}

We can now prove Theorem 5.2

Proof of Theorem 5.2.

From Lemma 5.3, we s4D\ell_{s^{\prime}}\leq 4D for each ss^{\prime} such that there is a policy πs,s\pi_{s^{*},s^{\prime}} that hits the state ss^{\prime} in time s\ell_{s}^{\prime} with probability at-least 14D\frac{1}{4D}. We take =maxsss\ell^{*}=\max_{s^{\prime}\neq s^{*}}\ell_{s^{\prime}}. For target state ss^{\prime}, the policy πs\pi_{s^{\prime}} loops at state ss^{*} for (s)(\ell^{*}-\ell_{s^{\prime}}) time steps and then starts following policy πs,s\pi_{s,s^{\prime}}. Clearly, this policy hits state ss^{\prime} at time \ell^{*} with probability at least 14D\frac{1}{4D}

Remark 5.4.

The policies guaranteed by Theorem 5.2 are not stationary.

Let p=minspsp^{*}=\min_{s}p_{s}. Clearly, p14Dp^{*}\geq\frac{1}{4D}

5.1 Algorithm

We extend the algorithm we used in the deterministic MDP case.

We use a low switching algorithm (FPL) that considers each policy πΠ\pi\in\Pi as an expert. We know from Kalai and Vempala (2005) that FPL achieves O(Tlogn)O(\sqrt{T\log{n}}) regret as well as switching cost. At time tt, we receive loss function t\ell_{t} from the adversary. Using this, we construct ^t\hat{\ell}_{t} as

^t(π)=𝔼[t(st,at)]\hat{\ell}_{t}(\pi)=\mathbb{E}\left[\ell_{t}(s_{t},a_{t})\right]

where s1d1,atπ(st,.)s_{1}\sim d_{1},a_{t}\sim\pi(s_{t},.)

In other words, ^t(π)\hat{\ell}_{t}(\pi) is the expected loss if we follow the policy π\pi from the start of the game. d1d_{1} is the initial distribution of states.

We feed ^t\hat{\ell}_{t} as the losses to FPL.

We can now rewrite LπL^{\pi} as

Lπ=𝔼[t=1Tt(st,at)]=t=1T𝔼[t(st,at)]=t=1T^t(π)L^{\pi}=\mathbb{E}\left[\sum_{t=1}^{T}\ell_{t}(s_{t},a_{t})\right]=\sum_{t=1}^{T}\mathbb{E}\left[\ell_{t}(s_{t},a_{t})\right]=\sum_{t=1}^{T}\hat{\ell}_{t}(\pi)

where s1d1s_{1}\sim d_{1} and atπ(st,.)a_{t}\sim\pi(s_{t},.). Let πt\pi_{t} be the policy chosen by BB at time tt. We know that

𝔼[t=1t^t(πt)]t=1t^t(π)O(Tlog|Π|)\mathbb{E}\left[\sum_{t=1}^{t}\hat{\ell}_{t}(\pi_{t})\right]-\sum_{t=1}^{t}\hat{\ell}_{t}(\pi)\leq O(\sqrt{T\log|\Pi|})

for any deterministic policy π\pi.

We need our algorithm to receive loss close to the first term in the above sum. If this is possible, we have an O(T)O(\sqrt{T}) regret bound for online learning in the MDP. We now present an approach to do this.

5.1.1 Catching a policy

When FPL switches policy, we cannot immediately start receiving the losses of the new policy. If this was possible, then the regret of our algorithm will match that of FPL. When implementing the policy switch in our algorithm, we suffer a delay before starting to incur the losses of the new policy (in an expected sense). Our goal now is to make this delay as small as possible. This coupled with the fact that FPL has a low number of switches will give us good regret bounds. Note that this was easily done in the deterministic case using Theorem 3.2. Theorem 5.2 acts somewhat like a stochastic analogue of Theorem 3.2 and we use this to reduce the time taken to catch the policy.

Function Switch_Policy(ss,π\pi,t0t_{0}):
       DoneDone = 0
       t=t0+1t=t_{0}+1
        // t0t_{0} is the time that BB switched policy
       St=sS_{t}=s
        // StS_{t} stores the state at time tt
       while Done1Done\neq 1 do
             Move to state ss^{*} using the best policy
              // Say this step takes kk steps
            
            t=t+kt=t+k
             Sample Tt+T_{t+\ell^{*}} from dπt+(.)d_{\pi}^{t+\ell^{*}}(.)
            
            We set Tt+T_{t+\ell^{*}} as the target state
             Use policy πTt+\pi_{T_{t+\ell^{*}}} guaranteed by Corollary 5.2 to move \ell^{*} steps from ss^{*}
             t=t+t=t+\ell^{*}
             if St=TtS_{t}=T_{t} then
                   Consider a Bernouli Random Variable II such that I=1I=1 with probability ppSt\frac{p^{*}}{p_{S_{t}}}.
                   if I=1I=1 then
                         Start following π\pi and set DoneDone to 11
                         Let the time at this happens be TswitchT_{switch}
                  else
                        I=0I=0
                   Continue
                  
            else
                   Continue
                  
            
      
Function Main:
       Let π1FPL\pi_{1}^{\text{FPL}} be the expert chosen by FPL at time 11
       π1=π1FPL\pi_{1}=\pi_{1}^{\text{FPL}}
       Let S1S_{1} be the start state.
       t=1t=1
       while tT+1t\neq T+1 do
             Sample ata_{t} from πt(st,.)\pi_{t}(s_{t},.)
             Adversary returns loss function t\ell_{t} and next state ss St+1S_{t+1}=s
             Compute ^t\hat{\ell}_{t} and feed it as the loss to FPL as discussed before
             if FPL switches policy then
                   Switch_Policy(s,πt+1FPL,t+1s,\pi_{t+1}^{\text{FPL}},t+1)
                    // Call the switch policy function to catch the new policy
                   πt+k=πt+1FPL\pi_{t+k}=\pi_{t+1}^{\text{FPL}}
                    // kk is the number of steps taking by Switch Policy
                   t=t+kt=t+k
                  
            else
                   πt+1=πt\pi_{t+1}=\pi_{t}
                   t=t+1t=t+1
                  
            
      
Algorithm 3 Low Regret Algorithm For Communicating MDPs
Remark 5.5.

In Algorithm 3, if FPL switches the policy in the middle of the Switch_Policy’s execution, we terminate the execution and call the routine again with a new target policy.

5.2 Analysis

The following lemma shows that the Switch_Policy routine works correctly. That is, after the execution of the routine, the state distribution is exactly the same as the state distribution of the new policy.

Lemma 5.6.

If Switch_Policy terminates at time tt, we have that

Pr[St=sTswitch=t]=dπt(s)Pr[S_{t}=s\mid T_{switch}=t]=d_{\pi}^{t}(s)

where dπt(s)d_{\pi}^{t}(s) is the distribution of states after following policy π\pi from the start of the game.

Proof of Lemma 5.6.

We want to compute Pr[St=sTswitch=t]Pr[S_{t}=s\mid T_{switch}=t].

Pr[St=sTswitch=t]\displaystyle Pr[S_{t}=s\mid T_{switch}=t] =Pr[St=s,Tswitch=t]Pr[Tswitch=t]\displaystyle=\frac{Pr[S_{t}=s,T_{switch}=t]}{Pr[T_{switch}=t]}
=Pr[St=Tt=s,Tswitch=t]Pr[Tswitch=t]\displaystyle=\frac{Pr[S_{t}=T_{t}=s,T_{switch}=t]}{Pr[T_{switch}=t]}
=Pr[Tt=s,St=s,Tswitch=t]Pr[Tswitch=t]\displaystyle=\frac{Pr[T_{t}=s,S_{t}=s,T_{switch}=t]}{Pr[T_{switch}=t]}

We now compute the denominator Pr[Tswitch=t]Pr[T_{switch}=t] as follows.

Pr[Tswitch=t]\displaystyle Pr[T_{switch}=t] =sSPr[St=Tt=s,St=s]Pr[Tswitch=tSt=Tt=s,St=s]\displaystyle=\sum_{s\in S}Pr[S_{t}=T_{t}=s,S_{t-\ell^{*}}=s^{*}]\cdot Pr[T_{switch}=t\mid S_{t}=T_{t}=s,S_{t-\ell^{*}}=s^{*}]
=sSPr[St=sTt=s,St=s]Pr[Tt=s,St=s]Pr[Tswitch=t|St=Tt=s,St=s]\displaystyle=\sum_{s\in S}Pr[S_{t}=s\mid T_{t}=s,S_{t-\ell^{*}}=s^{*}]\cdot Pr[T_{t}=s,S_{t-\ell^{*}}=s^{*}]Pr[T_{switch}=t|S_{t}=T_{t}=s,S_{t-\ell^{*}}=s^{*}]
=sSpsPr[Tt=s,St=s]pps\displaystyle=\sum_{s\in S}p_{s}\cdot Pr[T_{t}=s,S_{t-\ell^{*}}=s^{*}]\cdot\frac{p^{*}}{p_{s}}
=psSPr[Tt=s,St=s]\displaystyle=p^{*}\sum_{s\in S}Pr[T_{t}=s,S_{t-\ell^{*}}=s^{*}]
=pPr[St=s]\displaystyle=p^{*}\cdot Pr[S_{t-\ell^{*}}=s^{*}]

Now we calculate the numerator.

Pr[Tt=s,St=s,Tswitch=t]\displaystyle Pr[T_{t}=s,S_{t}=s,T_{switch}=t] =Pr[Tt=s,St=s,St=s,Tswitch=t]\displaystyle=Pr[T_{t}=s,S_{t}=s,S_{t-\ell^{*}}=s,T_{switch}=t]
=Pr[St=s,Tswitch=tSt=s,Tt=s]Pr[St=s,Tt=s]\displaystyle=Pr[S_{t}=s,T_{switch}=t\mid S_{t-\ell^{*}}=s^{*},T_{t}=s]\cdot Pr[S_{t-\ell^{*}}=s^{*},T_{t}=s]
=pPr[St=s]Pr[Tt=sSt=s]\displaystyle=p^{*}\cdot Pr[S_{t-\ell^{*}}=s^{*}]\cdot Pr[T_{t}=s\mid S_{t-\ell^{*}}=s^{*}]
=pPr[St=s]dπt(s)\displaystyle=p^{*}\cdot Pr[S_{t-\ell^{*}}=s^{*}]\cdot d_{\pi}^{t}(s)

Thus, we have

Pr[St=sTswitch=t]=dπt(s)Pr[S_{t}=s\mid T_{switch}=t]=d_{\pi}^{t}(s)

We now bound the expected loss of the algorithm in the period that FPL chooses policy π\pi

Lemma 5.7.

Let the policy of FPL be π\pi from time t1t_{1} to t2t_{2}. We have that

𝔼[t=t1t2t(st,at)]48D2+t=t1t2^t(π)\mathbb{E}\left[\sum_{t=t_{1}}^{t_{2}}\ell_{t}(s_{t},a_{t})\right]\leq 48\cdot D^{2}+\sum_{t=t_{1}}^{t_{2}}\hat{\ell}_{t}(\pi)
Proof of Lemma 5.7.

We bound the expectation using law of total expectations and conditioning on TswitchT_{switch}.

𝔼[t=t1t2t(st,at)]=𝔼[𝔼[t=t1t2t(st,at)Tswitch]]\displaystyle\mathbb{E}\left[\sum_{t=t_{1}}^{t_{2}}\ell_{t}(s_{t},a_{t})\right]=\mathbb{E}\left[\mathbb{E}\left[\sum_{t=t_{1}}^{t_{2}}\ell_{t}(s_{t},a_{t})\mid T_{switch}\right]\right]

We bound the conditional expectation.

𝔼[t=t1t2t(st,at)Tswitch=t]t+𝔼[t=tt2t(st,at)Tswitch=t]\displaystyle\mathbb{E}\left[\sum_{t=t_{1}}^{t_{2}}\ell_{t}(s_{t},a_{t})\mid T_{switch}=t^{*}\right]\leq t^{*}+\mathbb{E}\left[\sum_{t=t^{*}}^{t_{2}}\ell_{t}(s_{t},a_{t})\mid T_{switch}=t^{*}\right]

From Lemma 5.6, the second term is equal to t=tt2^t(π)\sum_{t=t^{*}}^{t_{2}}\hat{\ell}_{t}(\pi) Thus,

𝔼[t=t1t2t(st,at)]𝔼[Tswitch]+t=t1t2^t(π)\mathbb{E}\left[\sum_{t=t_{1}}^{t_{2}}\ell_{t}(s_{t},a_{t})\right]\leq\mathbb{E}[T_{switch}]+\sum_{t=t_{1}}^{t_{2}}\hat{\ell}_{t}(\pi)

Everytime we try to catch the policy from state ss^{*}, we succeed with probability p14Dp^{*}\geq\frac{1}{4D}. Thus, the expected number of times we try is 16D16\cdot D and each attempt takes 2D\ell^{*}\leq 2D steps. Between each of these attempts, we move at most DD steps in expectation to reach ss^{*} again. Thus, in total, we have

𝔼[Tswitch]16D2+32D2=48D2\mathbb{E}[T_{switch}]\leq 16D^{2}+32D^{2}=48D^{2}

This completes the proof. ∎

We are now ready to bound the regret of Algorithm 3

Theorem 5.8.

The regret of Algorithm 3 is at most O(D2Tlog|Π|)O\left(D^{2}\sqrt{T\log|\Pi|}\right)

Proof.

We condition on the number of switches made by FPL. Let NsN_{s} be the random variable corresponding to the number of switches made by FPL. We refer to Algorithm 3 as 𝒜\mathcal{A}.

L(𝒜)\displaystyle L(\mathcal{A}) =𝔼[t=1Tt(st,at)]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}\ell_{t}(s_{t},a_{t})\right]
=𝔼[𝔼[t=1Tt(st,at)Ns]]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\sum_{t=1}^{T}\ell_{t}(s_{t},a_{t})\mid N_{s}\right]\right]

After each switch, Lemma 5.7 tells us that the Algorithm suffers at most 48D248\cdot D^{2} extra average loss to the loss of the algorithm FPL. Thus,

L(𝒜)𝔼[48D2Ns+t=1T^t(πt)]L(\mathcal{A})\leq\mathbb{E}\left[48\cdot D^{2}\cdot N_{s}+\sum_{t=1}^{T}\hat{\ell}_{t}(\pi_{t})\right]

πt\pi_{t} is the policy chosen by algorithm FPL at time tt. Since FPL is a low switching algorithm, we have NsO(Tlog|Π|{N_{s}}\leq O(\sqrt{T\log|\Pi|}. The second term in the expectation is atmost Lπ+O(Tlog|Π|)L^{\pi}+O(\sqrt{T\log|\Pi|}) for any deterministic policy π\pi. This is because FPL is a low regret algorithm. Thus, we have

L(𝒜)LπO(D2Tlog|Π|)L(\mathcal{A})-L^{\pi}\leq O(D^{2}\sqrt{T\log|\Pi|})

for all stationary π\pi.

Thus, R(𝒜)O(D2Tlog|Π|)R(\mathcal{A})\leq O\left(D^{2}\sqrt{T\log|\Pi|}\right)

When Π\Pi is the set of stationary deterministic policies, we get that |Π||A||S||\Pi|\leq|A|^{|S|}. Thus, we get the following theorem.

Theorem 5.9.

Given a communicating MDP satisfying Assumption 5.1 with |S||S| states, |A||A| action and diameter DD, the regret of Algorithm 3 can be bounded by

RegretO(D2T|S|log|A|)\text{Regret}\leq O\left(D^{2}\sqrt{T|S|\log|A|}\right)

In fact, since we are using FPL as the expert algorithm, we can get first-order bounds similar to Theorem 4.3. In a setting with nn experts with mm being the total loss of the best expert, we can derive that the regret and number of switches can be bounded by O(mlogn)O(\sqrt{m\cdot\log n}).Thus, using this, we get the following first order regret bounds for Algorithm 3

Theorem 5.10.

Given a communicating MDP satisfying Assumption 5.1 with |S||S| states, |A||A| action and diameter DD, the regret of Algorithm 3 can be bounded by

RegretO(D2L|S|log|A|)\text{Regret}\leq O\left(D^{2}\sqrt{L^{*}\cdot|S|\log|A|}\right)

where LL^{*} is the total expected loss incurred by the best stationary deterministic policy in hindsight.

5.3 Oracle-efficient algorithm assuming exploring starts

In this section, we assume that initial distribution over states, d1d_{1} has probability mass at least α\alpha on every state. That is, Pr[S1=s]αPr[S_{1}=s]\geq\alpha for all sSs\in S. We also assume that we have an oracle 𝒪\mathcal{O} that can find the stationary deterministic policy with minimum cumulative loss at no computational cost. We use this oracle and the ideas from the previous sections to design an FPL style algorithm low regret algorithm.

5.3.1 Oracle

The oracle 𝒪\mathcal{O} takes in loss functions 1,,T\ell_{1},\ldots,\ell_{T} and outputs the stationary deterministic policy with the lowest expected cumulative loss. That is, it returns the policy π=argminΠLπ\pi=\operatorname*{arg\,min}_{\Pi}L^{\pi}, where

Lπ=𝔼[t=1Tt(st,at)]L^{\pi}=\mathbb{E}\left[\sum_{t=1}^{T}\ell_{t}(s_{t},a_{t})\right]

with s1d1s_{1}\sim d_{1} and at=π(st)a_{t}=\pi(s_{t}).

We say that an algorithm is oracle-efficient if it runs in polynomial time when given access to the oracle 𝒪\mathcal{O}.

5.3.2 Algorithm Sketch

Sample perturbation vectors ϵ|S||A|\epsilon\in\mathbb{R}^{|S||A|} from an exponential distribution with parameter λ\lambda
while tT+1t\neq T+1 do
      
πt=argminπΠ𝔼[ϵ(s1,a1)+i=1t1t(st,at)]\pi_{t}=\operatorname*{arg\,min}_{\pi\in\Pi}\mathbb{E}\left[\epsilon(s_{1},a_{1})+\sum_{i=1}^{t-1}\ell_{t}(s_{t},a_{t})\right]
with s1d1s_{1}\sim d_{1} and at=π(st)a_{t}=\pi(s_{t})
      
      Adversary returns loss function t\ell_{t}
      
Algorithm 4 Black Box FPL algorithm used for Communicating MDPs with exploring starts

Now, we use Algorithm 4 as our black box experts algorithm. We prove that Algorithm 4 has low regret.

Theorem 5.11.

The regret and expected number of switches can be bounded by O(L|S|log|S||A|α)O\left(\sqrt{\frac{L^{*}|S|\log|S||A|}{\alpha}}\right).

Proof.

Let LπL^{\pi} denote the total cumulative loss if we followed policy π\pi from the start of the interaction. We use L~π\tilde{L}^{\pi} to the denote the total perturbed cumulative loss if we followed policy π\pi from the start. Let π\pi^{*} be the policy with the lowest total cumulative loss. Similarly, let π~\tilde{\pi}^{*} be the policy with the lowest perturbed cumulative loss. Let L~tπ\tilde{L}^{\pi}_{t} be the total perturbed cumulative loss till time tt. Let πt\pi_{t} be the policy chosen by the FPL algorithm at step tt.

Let NsN_{s} be the number of times the oracle switches the best policy. As before, we treat each policy as an expert and consider the online learning problem where expert π\pi gets loss ^t(π)=𝔼[t(st,at)]\hat{\ell}_{t}(\pi)=\mathbb{E}\left[\ell_{t}(s_{t},a_{t})\right] where s1d1s_{1}\sim d_{1} and at=π(st)a_{t}=\pi(s_{t}).

Using the arguments from the proof of Theorem 4.5, we get

𝔼[total loss of FPL]L~π~+Ns.\mathbb{E}\left[\text{total loss of FPL}\right]\leq\tilde{L}^{\tilde{\pi}^{*}}+N_{s}.

Also, we have L~π=Lπ+s=1Sd1(s)ϵ(s,π(s))\tilde{L}^{\pi}=L^{\pi}+\sum_{s=1}^{S}d_{1}(s)\cdot\epsilon\left(s,\pi(s)\right).

We know that Ns=t=1T1Pr[πt+1πt]N_{s}=\sum_{t=1}^{T-1}Pr[\pi_{t+1}\neq\pi_{t}]. We now bound Pr[πt+1πt]Pr[\pi_{t+1}\neq\pi_{t}]. Let πt=π\pi_{t}=\pi. The algorithm chooses ππ\pi^{\prime}\neq\pi as πt+1\pi_{t+1} if and only if L~tπL~tπ\tilde{L}^{\pi}_{t}\geq\tilde{L}^{\pi^{\prime}}_{t}. We now argue that the probability of this happening is low if πt=π\pi_{t}=\pi. Since ππ\pi^{\prime}\neq\pi, we have π(s)π(s)\pi^{\prime}(s)\neq\pi(s) for some ss. Let the smallest state in which π\pi and π\pi^{\prime} differ be called d(π,π)d(\pi,\pi^{\prime}). Thus,

Pr[πt+1ππt=π]=sSPr[d(πt+1,π)=sπt=π].Pr[\pi_{t+1}\neq\pi\mid\pi_{t}=\pi]=\sum_{s\in S}Pr[d(\pi_{t+1},\pi)=s\mid\pi_{t}=\pi].

We bound Pr[d(π,πt+1)=sπt=π]Pr[d(\pi,\pi_{t+1})=s\mid\pi_{t}=\pi] for any state ss. Consider any policy π\pi^{\prime} that differs from π\pi in state ss. The perturbation ϵ(s,π(s))\epsilon(s,\pi(s)) will play a role in the comparison of perturbed losses of all such π\pi^{\prime} with π\pi. Since πt=π\pi_{t}=\pi, we have d1(s)ϵ(s,π(s))wd_{1}(s)\cdot{\epsilon(s,\pi(s))}\geq w for some ww that depends only on the perturbations and losses received by π\pi and policies π\pi^{\prime} that differ from π\pi in state ss. If d1(s)ϵ(s,π(s))w+^t(π)d_{1}(s)\cdot{\epsilon(s,\pi(s))}\geq w+\hat{\ell}_{t}(\pi), then we would not switch to a policy π\pi^{\prime} with π(s)π(s)\pi(s)\neq\pi^{\prime}(s). Thus,

Pr[d(π,πt+1)sπt=π]\displaystyle Pr[d(\pi,\pi_{t+1})\neq s\mid\pi_{t}=\pi] Pr[d1(s)ϵ(s,π(s))w+^t(π)d1(s)ϵ(s,π(s))w]\displaystyle\geq Pr[d_{1}(s)\cdot\epsilon(s,\pi(s))\geq w+\hat{\ell}_{t}(\pi)\mid d_{1}(s)\cdot\epsilon(s,\pi(s))\geq w]
1λ^t(π)(1d1(s))\displaystyle\geq 1-\lambda\hat{\ell}_{t}(\pi)\cdot\left(\frac{1}{d_{1}(s)}\right)

Since d1(s)d_{1}(s) is at least α\alpha, we have Pr[πt+1ππt=π]Pr[\pi_{t+1}\neq\pi\mid\pi_{t}=\pi] is at most |S|αλ^t(π)\frac{|S|}{\alpha}\cdot\lambda\hat{\ell}_{t}(\pi). From this, we get

Ns|S|αλ𝔼[total loss of FPL].N_{s}\leq\frac{|S|}{\alpha}\cdot\lambda\cdot\mathbb{E}[\text{total loss of FPL}].

Using arguments similar to Section 4.2.2, we get

𝔼[total loss of FPL]Lπ+(1+log|S||A|)λ+|S|αλ𝔼[total loss of FPL]\mathbb{E}[\text{total loss of FPL}]\leq L^{\pi^{*}}+\frac{(1+\log|S||A|)}{\lambda}+\frac{|S|}{\alpha}\cdot\lambda\cdot\mathbb{E}[\text{total loss of FPL}] (7)

Let L=LπL^{*}=L^{\pi^{*}}.

On rearranging and simplifying Equation 7 similar to the proof of Theorem 4.3, we have

𝔼[total loss of FPL]L(1+2αλ|S|)+4log|S||A|λ\displaystyle\mathbb{E}[\text{total loss of FPL}]\leq{L^{*}}\left(1+\frac{2}{\alpha}\lambda|S|\right)+4\frac{\log|S||A|}{\lambda}

The above inequality works when λα|S|12\frac{\lambda}{\alpha}|S|\leq\frac{1}{2}, Thus, we have

𝔼[Total loss of FPL]L2λα|S|(L)+4log|S||A|λ.\mathbb{E}[\text{Total loss of FPL}]-L^{*}\leq 2\frac{\lambda}{\alpha}|S|(L^{*})+4\frac{\log|S||A|}{\lambda}.

Set λ=min(αlog|S||A||S|L,α2|S|)\lambda=\min\left(\sqrt{\alpha\frac{\log|S||A|}{|S|L^{*}}},\frac{\alpha}{2|S|}\right). On substituting λ\lambda into the above equation, we get that

RegretO(L|S|log|S||A|α).\text{Regret}\leq O\left(\sqrt{\frac{L^{*}|S|\log|S||A|}{\alpha}}\right).

Since the expected number of switches is at-most |S|αλ𝔼[total loss of FPL]\frac{|S|}{\alpha}\cdot\lambda\cdot\mathbb{E}[\text{total loss of FPL}], this is also bounded by O(L|S|log|S||A|α)O\left(\sqrt{\frac{L^{*}|S|\log|S||A|}{\alpha}}\right)

The rest of the algorithm is the same as Algorithm 3. The exploring start assumption allows us to get an efficient low regret, low switching algorithm assuming access to the oracle 𝒪\mathcal{O}. We now state the regret bound for this algorithm.

Theorem 5.12.

Given a communicating MDP satisfying Assumption 5.1 with a start distribution with at least probability α\alpha on every state, and given access to the oracle 𝒪\mathcal{O}, we have an efficient algorithm with

RegretO(D2L|S|log|S||A|α)Regret\leq O\left(D^{2}\sqrt{\frac{L^{*}|S|\log|S||A|}{\alpha}}\right)
Proof.

The proof is exactly the same as that of Theorem 5.8 except that we use the switching cost bound from Theorem 5.11. ∎

6 Conclusion

We considered learning in a communicating MDP with adversarially chosen costs in the full information setting. We gave an efficient algorithm that achieves O(T)O(\sqrt{T}) regret when transitions are deterministic. We also presented an inefficient algorithm that achieves a O(T)O(\sqrt{T}) regret bounds for the general stochastic case with an extra mild assumption. Our result show that in the full information setting there is no statistical price (as far as the time dependence is concerned) for the extension from the vanilla online learning with experts problem to the problem of online learning with communicating MDPs.

Several interesting questions still remain open. First, what are the best lower bounds in the general (i.e., not necessarily deterministic) communicating setting? In the deterministic setting, diameter is bounded polynomially by the state space size. This is no longer true in the stochastic case. The best lower bound in terms of diameter and other relevant quantities (|S|,|A||S|,|A| and TT) still remains to be worked out. Second, is it possible to design an efficient algorithm beyond the deterministic case with fewer assumptions? The source of inefficiency in our algorithm is that we run FPL with each policy as an expert and perturb the losses of each policy independently. It is plausible that an FPL algorithm that perturbs losses (as in the deterministic case) can also be analyzed. However, there are challenges in its analysis as well as in proving that it is computationally efficient. For example, we are not aware of any efficient way to compute the best deterministic policy in hindsight for the general communicating case. This leads us to another open question: are there any oracle-efficient O(T)O(\sqrt{T}) regret algorithms that do online learning over communicating MDPs. Dai et al. (2022) give an oracle-efficient O(T5/6)O(T^{5/6}) regret algorithm but that works for bandits as well and does not use the additional information that is there in the full information case.

References

  • Arora et al. [2012a] R. Arora, O. Dekel, and A. Tewari. Deterministic mdps with adversarial rewards and bandit feedback. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, UAI’12, page 93–101, Arlington, Virginia, USA, 2012a. AUAI Press. ISBN 9780974903989.
  • Arora et al. [2012b] R. Arora, O. Dekel, and A. Tewari. Online bandit learning against an adaptive adversary: from regret to policy regret. Proceedings of the 29th International Conference on Machine Learning, ICML 2012, 2, 06 2012b.
  • Auer et al. [1995] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pages 322–331, 1995. doi: 10.1109/SFCS.1995.492488.
  • Bazaraa et al. [2004] M. S. Bazaraa, J. J. Jarvis, and H. D. Sherali. Linear Programming and Network Flows. Wiley-Interscience, USA, 2004. ISBN 0471485993.
  • Bremaud [2000] P. Bremaud. Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues. Springer, 2000.
  • Cesa-Bianchi and Lugosi [2006] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. ISBN 978-0-521-84108-5.
  • Dai et al. [2022] Y. Dai, H. Luo, and L. Chen. Follow-the-perturbed-leader for adversarial markov decision processes with bandit feedback. arXiv preprint arXiv:2205.13451, 2022.
  • Dekel and Hazan [2013] O. Dekel and E. Hazan. Better rates for any adversarial deterministic mdp. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, page III–675–III–683. JMLR.org, 2013.
  • Dekel et al. [2013] O. Dekel, J. Ding, T. Koren, and Y. Peres. Bandits with switching costs: t2/3t^{2/3} regret. Proceedings of the Annual ACM Symposium on Theory of Computing, 10 2013. doi: 10.1145/2591796.2591868.
  • Denardo [1977] E. V. Denardo. Periods of connected networks and powers of nonnegative matrices. Mathematics of Operations Research, 2(1):20–24, 1977. ISSN 0364765X, 15265471. URL http://www.jstor.org/stable/3689120.
  • Even-Dar et al. [2009] E. Even-Dar, S. M. Kakade, and Y. Mansour. Online markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009. ISSN 0364765X, 15265471. URL http://www.jstor.org/stable/40538442.
  • Kalai and Vempala [2005] A. Kalai and S. Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005. ISSN 0022-0000. doi: https://doi.org/10.1016/j.jcss.2004.10.016. URL https://www.sciencedirect.com/science/article/pii/S0022000004001394. Learning Theory 2003.
  • Neu et al. [2014] G. Neu, A. György, C. Szepesvári, and A. Antos. Online markov decision processes under bandit feedback. IEEE Trans. Autom. Control., 59(3):676–691, 2014. doi: 10.1109/TAC.2013.2292137. URL https://doi.org/10.1109/TAC.2013.2292137.
  • Ortner [2010] R. Ortner. Online regret bounds for markov decision processes with deterministic transitions. Theoretical Computer Science, 411(29):2684–2695, 2010. ISSN 0304-3975. doi: https://doi.org/10.1016/j.tcs.2010.04.005. URL https://www.sciencedirect.com/science/article/pii/S0304397510002008. Algorithmic Learning Theory (ALT 2008).