Learning in Online MDPs:
Is there a Price for Handling the Communicating Case?
Abstract
It is a remarkable fact that the same regret rate can be achieved in both the Experts Problem and the Adversarial Multi-Armed Bandit problem albeit with a worse dependence on number of actions in the latter case. In contrast, it has been shown that handling online MDPs with communicating structure and bandit information incurs regret even in the case of deterministic transitions. Is this the price we pay for handling communicating structure or is it because we also have bandit feedback? In this paper we show that with full information, online MDPs can still be learned at an rate even in the presence of communicating structure. We first show this by proposing an efficient follow the perturbed leader (FPL) algorithm for the deterministic transition case. We then extend our scope to consider stochastic transitions where we first give an inefficient -regret algorithm (with a mild additional condition on the dynamics). Then we show how to achieve regret rate using an oracle-efficient algorithm but with the additional restriction that the starting state distribution has mass at least on each state.
1 Introduction
In this work, we study online learning in Markov Decisions Processes. In this setting, we have an agent interacting with an adversarial environment. The agent observes the state of the environment and takes an action. The action incurs an associated loss and the environment moves to a new state. The state transition dynamics are assumed to be Markovian, i.e., the probability distribution of the new state is fully determined by the action and the old state. The transition dynamics are fixed and known to the learner in advance. However the losses are chosen by the adversary. The adversary is assumed to be oblivious (the entire loss sequence is chosen before the interaction begins). We assume that the environment reveals full information about the losses at a given time step to the agent after the corresponding action is taken. The total loss incurred by the agent is the sum of losses incurred in each step of the interaction. We denote the set of states by and the set of actions by . The objective of the agent is to minimized its total loss.
This setting was first studied in the seminal work of Even-Dar et al. (2009). They studied the restricted class of ergodic MDPs where every policy induces a Markov chain with a single recurrent class. They designed an efficient (runs in polytime in MDP parameters and time of interaction) algorithm that achieved regret with respect to the best stationary policy in hindsight. They assumed full information of the losses and that the MDP dynamics where known beforehand. This work was extended to bandit feedback by Neu et al. (2014)111with an additional assumption on the minimum stationary probability mass in any state. They also achieved a regret bound of . Bandit feedback is a harder model in which the learner only receives information about its own losses.
No state | Ergodic | Communicating | |
---|---|---|---|
Full Info | THIS PAPER | ||
Bandit Info | 11footnotemark: 1 |
In this paper we will look at the more general class of communicating MDPs, where, for any pair of states, there is a policy such that the time it takes to reach the second state from the first has finite expectation. In the case of bandit feedback with deterministic transitions, Dekel and Hazan (2013) designed an algorithm that achieved regret. This regret bound was proved to be tight by a matching lower bound in Dekel et al. (2013). This regret lower bound was proved by a reduction from the problem of adversarial multi-armed bandits with switching costs. In this setting, the agent incurs an additional cost every time it switches the arm it plays. Their lower bound definitively proves that in the case of bandit information, online learning over communicating MDPs is statistically harder than the adversarial multi armed bandits problem for which we have regret algorithms (Auer et al. (1995)). Their result gives rise to the natural question: is the high regret due to the communicating structure or bandit feedback(or both)? In the case of experts with switching cost, we know regret algorithms such as FPL(Kalai and Vempala (2005)). Using this, we give an algorithm for online learning in communicating MDPs with full information. Thus, we show that having communicating structure alone does not add any statistical price (see Table 1).
1.1 Our Contributions
In this paper, we show that online learning over MDPs with full information is not statistically harder222upto polynomial factors in the number of states and actions333assuming the existence of a state with a “do nothing” action than the problem of online learning with expert advice. In particular, we design an efficient algorithm that learns to act in communicating ADMDPs with regret under full information feedback against the best deterministic policy in hindsight. This is the first regret algorithm for this problem. We prove a matching22footnotemark: 2 regret lower bound in this setting. We also extend the techniques used in the previous algorithm to design an algorithm that runs in time exponential in MDP parameters that achieves regret in the general class of communicating MDPs (albeit with an additional mild assumption33footnotemark: 3). Again, this is the first algorithm that achieves regret against this large class of MDPs. Before this, regret algorithms were only known for the case of ergodic MDPs. We also give an regret algorithm for communicating MDPs with a start state distribution having probability mass at least on each state that is efficient when given access to an optimization oracle.
2 Related Work
The problem we study in this paper is commonly referred to in literature as online learning in MDPs over an infinite horizon. This problem was first studied for MDPs with an ergodic transition structure. Even-Dar et al. (2009) and Neu et al. (2014) studied this problem under full information and partial information respectively. The former achieved regret for all ergodic MDPs, whereas the latter achieved regret for ergodic MDPs satisfying an additional assumption11footnotemark: 1. The problem of online learning in deterministic communicating MDPs(ADMDPs) was studied by Arora et al. (2012a) and Dekel and Hazan (2013). They consider bandit feedback and achieve and respectively.
As mentioned in the introduction, a closely related problem is that of online learning with switching costs. In the case of full information, algorithms like FPL (Kalai and Vempala (2005)) achieves regret with switching cost. In the case of bandit feedback, Arora et al. (2012b) gives an algorithm that achieves regret with switching cost. This was proved to be tight by Dekel et al. (2013) where they proved a matching lower bound.
Subsequent to the release of an earlier version of this paper, Dai et al. (2022) gave an inefficient and oracle-efficient regret algorithm for online learning over Communicating22footnotemark: 2 MDPs with bandit information. Their algorithms use our Switch_Policy procedure from Algorithm 3 and thus require the same assumption33footnotemark: 3 as us.
3 Preliminaries
Fix the finite state space , finite action space , and transition probability matrix where is the probability of moving from state to on action .
In the case of ADMDP, the transitions are deterministic and hence the ADMDP can also be represented by a directed graph with vertices corresponding to states . The edges are labelled by the actions. An edge from to labelled by action exists in the graph when the ADMDP takes the state to state on action . This edge will be referred to as .
A (stationary) policy is a mapping where denotes the probability of taking action when in state . When the policy is deterministic, we overload the notation and define to be the action taken when the state is . The interaction starts in an arbitrary start state is . The adversary chooses a sequence of loss functions obliviously where each is a map from to .
An algorithm that interacts with the online MDP chooses the action to be taken at each time step. It maintains a probability distribution over actions denoted by which depends on the current state and the sequence of loss functions seen so far. The expected loss of the algorithm is
where For a stationary policy , the loss of the policy is
where . The regret of the algorithm is defined as
The total expected loss of the best policy in hindsight is denoted by . Thus,
For any stationary policy , let be the random variable for the first time step in which is reached when we start at state and follow policy in MDP . We define the diameter of the MDP as
A communicating MDP is an MDP where .
3.1 Preliminaries on ADMDPs
In this section, we use the graph and the ADMDP interchangeably. A stationary deterministic policy induces a subgraph of where is an edge in if and only if and the action takes state to .
A communicating ADMDP corresponds to a strongly connected graph. This is because the existence of a policy that takes state to also implies the existence of a path between the two vertices in the graph .
The subgraph induced by policy in the communicating ADMDP is the set of transitions that are possible under . Each components of is either a cycle or an initial path followed by a cycle. Start a walk from any state by following the policy . Since the set of states is finite, eventually a state must be repeated and this forms the cycle.
Let be the next state after visiting state and taking action . Define as
The period of a vertex in is the greatest common divisor of the lengths of all the cycles starting and ending at . In a strongly, connected graph, the period of each vertex can be proved to be equal(Bremaud (2000) Chap. 2, Thm 4.2). Thus, the period of a strongly connected is well defined. If the period of is 1, we call aperiodic.
Let be the set of all closed walks of of length such that the start vertex is . The elements of are represented by the sequence of edges in the walks.
Note that the cycles induced by any stationary deterministic policy that are of length and contain the vertex will be in . However, can also contain cycles not induced by policies(it can contain cycles that are not simple). We use to denote . We sometimes loosely refer elements of as cycles. We define to be the action take by in the step if we start following from the beginning of the interaction. Similarly, is the state that you reach after following for steps from the start of the interaction. We define as the length of the cycle .
The vertices of a strongly connected graph with period can be partitioned into non-empty cycle classes, where each edge goes to .
Theorem 3.1.
If is strongly connected and aperiodic, there exists a critical length such that for any , there exists a path of length in between any pair of vertices. Also, where is the number of vertices in the graph.
The above theorem is from Denardo (1977). It guarantees the existence of a such that there are paths of length between any pair of vertices. The following generalization from Dekel and Hazan (2013) extends the result to periodic graphs.
Theorem 3.2 (Dekel and Hazan (2013)).
If has a period , there exists a critical value such that for any integer , there is a path of in of length from any state to any other state in the same cycle class.
Remark 3.3.
We can also find the paths of length from a given vertex to any other vertex efficiently. This can be done by constructing the path in the reverse direction. We look at to see all the predecessors of that have paths of length from . We choose any of these as the penultimate vertex in the path and recurse.
4 Deterministic Transitions
We now present our algorithm for online learning in ADMDPs when we have full information of losses. We use to refer to the graph associated to the ADMDP.
We assume that the ADMDP dynamics are known to the agent. This assumption can be relaxed as shown in Ortner (2010) as we can figure out the dynamics in poly() time when the transitions are deterministic. We want to minimize regret against the class of deterministic stationary policies.
4.1 Algorithm Sketch
We formulate the task of minimizing regret against the set of deterministic policies as a problem of prediction with expert advice. As observed earlier, deterministic policies induce a subgraph which is isomorphic to a cycle with an initial path. We keep an expert for each element of for all states and . Note that we do not keep an expert for policies which have an initial path before the cycle. This is because the loss of these policies differ by at most compared to the loss of the cycle. Also, we make sure that the start state of the cycle is in the same cycle class as the start state of the environment. If this is not the case, our algorithm will never be in phase with the expert policy. Henceforth, we will refer to these experts as cycles.
The loss incurred by cycle at time is equal to where and are the state action pair traversed by the cycle at time if we had followed it from the start of the interaction.
We first present an efficient (running time polynomial in and ) algorithm to achieve regret and switching cost against this class of experts. For this we used a Follow the perturbed leader (FPL) style algorithm.
We then use this low switching algorithm as a black box. Whenever, the black box algorithm tells us to switch policies at time , we compute the state that we would have reached if we had followed the new policy from the start of the interaction and moved steps. We then move to this state in steps. Theorem 3.2 guarantees the existence of a path of this length. We then start following the new policy.
Thus, our algorithm matches the moves of the expert policies except when there is a switch in the policies. Thus, the regret of our algorithm differs from the regret of the black box algorithm by at most .
4.2 FPL algorithm
We now describe the FPL style algorithm that competes with the set of cycles described earlier with regret and switching cost.
4.2.1 Finding the leader: Offline Optimization Algorithm
First, we design an offline algorithm that finds the cycle (including start state) with lowest cumulative loss till time given the sequence of losses . This is the step in Algorithm 1. Given , we find the best cycle among the cycles that start in state and have length . For this we use a method similar to that used in Arora et al. (2012a). We then find the minimum over all pairs to find the best cycle. Note that we only consider start states which are in the same cycle class as the start state of the game.
We find the best cycle in using Linear Programming. Let . The LP is in the space . Consider a cycle . Let denoted the state in . Also, let be the action taken at that state. We associate a vector with the cycle as follows.
We construct a loss vector in as follows.
Our decision set is the convex hull of all where . Our objective is to find in such that is minimized. The set can be captured by the following polynomial sized set of linear constraints.
Once we get an optimal for the above LP, we can decompose the mixed solution as a convex combination of at most cycles from Caratheodory Theorem. Also, these cycles can be recovered efficiently (Bazaraa et al. (2004)). Each of them will have same loss and hence we can choose any of them.
Once we have an optimal cycle for a given , we can minimize over all such pairs to get the optimal cycle. This gives us a polynomial time algorithm to get the optimal cycle.
Remark 4.1.
If the new cycle chosen has the same perturbed loss as the old cycle, we will not switch. This is to prevent any unnecessary switches caused by the arbitrary choice of cycle in each optimization step (as we choose an arbitrary cycle with non-zero weight in the solution).
Remark 4.2.
Note that can also contain cycles that don’t correspond to deterministic stationary policies. However, arguing a regret upper bound against this larger class is sufficient to prove regret bounds against the class of stationary deterministic policies.
4.2.2 Regret of the FPL algorithm
We now state and prove the bound on the regret and expected number of switches of Algorithm 1.
Theorem 4.3.
The regret and the expected number of switches of Algorithm 1 can be bounded by
where is the cumulative loss of the best cycle in hindsight.
Before analysing the FPL algorithm described above, we first introduce some notation and definitions. We define the loss of a cycle at time as . For any cycle with start state , let denote the total cumulative loss that we would have received if we followed the cycle from the start to the end of the interaction. We use to denote the total perturbed cumulative loss received by cycle . Let the cycle with lowest total cumulative loss be . Also, let the cycle with lowest perturbed cumulative loss be . We use to denote the total perturbed cumulative loss incurred by cycle after steps. We use to denote the cycle with lowest perturbed cumulative loss after steps. Let be the cycle chosen by the FPL algorithm at step and be it’s reward. Let the expected number of switches made by the algorithm during the interaction be .
The analysis is similar in spirit to Section 2 of Kalai and Vempala (2005). We first state the following lemma that bounds the probability of switching the cycle at any step.
Lemma 4.4.
for all cycles and times .
Proof of Theorem 4.3.
We first bound the total loss incurred by the FPL algorithm. Let the expected number of switches made by the algorithm during the interaction be . If the algorithm doesn’t switch cycles after time step , then must be equal to . Thus, the loss incurred at time step by is at most . In the steps in which the algorithm switches cycles, the maximum loss incurred is . Thus, we have that
(1) |
We now bound . From linearity of expectation, we have that
From Lemma 4.4, we have is at most . This gives us the following bound for .
Combining this with (4.2.2) gives us the following.
(2) |
Let denote the perturbed loss added to cycle . Since the cycle with lowest perturbed cumulative loss at the end of the interaction is , we have
Also,
The above inequality comes from the fact that the expectation of the max of independant exponential random variables with parameter is atmost . Plugging this inequality into (2) gives us
(3) |
Since the maximum cost is , we have
Setting gives us a bound of on the regret and expected number of switches. We can also derive first order bounds. From (3), we have
On rearranging, we get
The last two inequalities work when . Thus,
Set . This forces to be less than and thus the previous inequalities are still valid. On substituting the value of , we get that
when Since the expected number of switches is at most , this is also bounded by . ∎
Proof of Lemma 4.4.
Let be a cycle in the set . Let be shorthand for the loss incurred by cycle at step . If is not in , then the algorithm must have switched. Thus, we get the following equation.
(4) |
We now bound both the terms in the right hand side of (4) separately.
First, we study at the first term. We will upper bound this term by proving an appropriate lower bound on the probability of choosing from . Since , we know that for all . For all , the perturbation will play a role in the comparison of the perturbed cumulative losses. For , appears on both sides of the comparison and thus gets cancelled out. Thus, we have , where depends only on the perturbations and losses received by and the cycles not in . Now, if was larger than , then the perturbed cumulative loss of will be less than that of cycles not in even after receiving the losses of step . In this case, will also be chosen from . This gives us the require probability lower bound.
Thus, is at most .
We now bound the second term. For any two cycles in , there exists an index such that the edges of and are different and all the smaller indexed edges of the two cycles are the same. We denote this index by . Define to be zero when is from and or is not from . Now, if is in and not equal to , then is a number between one and . Thus, we get the following equation.
(5) |
We now bound for any between and . Let be the edge of . We prove a lower bound on the probability of choosing such that is not equal to . Again, since , we know that for all . Consider cycles that don’t contain the edge in the position. The perturbation will play a role in the comparison of perturbed losses of all such with . Thus, we have , where depends only on the perturbations and losses received by and cycles that don’t have the edge in the position. If was greater than , then the perturbed cumulative loss of will still be less than that of all cycles without the edge. In this case, will be chosen such that it also has the edge. This implies that . Thus, we get the following probability lower bound.
4.3 Putting it together
We have described the FPL style algorithm that achieves low regret and low switching. We now use Algorithm 1 as a sub-routine to design a low regret algorithm for the online ADMDP problem.
Recall that for a cycle , is the state you would reach if you followed the cycle from the start. This can be computed efficiently.
We now state the regret bound of Algorithm 2.
Theorem 4.5.
Given a communicating ADMDP with state space , action space and period , the regret of Algorithm 2 is bounded by
where is the total loss incurred by the best stationary deterministic policy in hindsight.
Proof.
We spend steps whenever Algorithm 1 switches. In all other steps, we receive the same loss as the cycle chosen by Algorithm 1. Thus, the regret differs by at most . From Theorem 4.3, we get that the total regret of our algorithm in the deterministic case is where is the critical length in the ADMDP. Note that is at most . Thus, we get that
∎
Remark 4.6.
To achieve the first order regret bound, we set in terms of . We need prior knowledge of to directly do this. This can be circumvented by using a doubling trick.
4.4 Regret Lower Bound for Deterministic MDPs
We now state a matching regret lower bound (up to polynomial factors).
Theorem 4.7.
For any algorithm and any , there exists an MDP with states and actions and a sequence of losses such that
where is the regret incurred by on with the given sequence of losses.
Proof.
Let be an MDP with states labelled . Any action takes state to (modulo ). In other words, the states are arranged in a cycle and every action takes any state to its next state in the cycle. This is the required .
Consider the problem of prediction with expert advice with experts. We know that for any algorithm , there is a sequence of losses such that the regret of is over steps (see Cesa-Bianchi and Lugosi (2006)). In our case, every policy spends exactly steps in each state. Thus, the interaction with over steps can be interpreted as a problem of prediction with expert advice at every state where each interaction lasts only steps. We have the following decomposition of the regret.
(6) |
In the above equation, is the action taken by at step . The best stationary deterministic policy in hindsight is .
From the regret lower bound for the experts problem, we know that there exists a sequence of losses such that for each i, the inner sum of (6) is atleast . By combining these loss sequences, we get a sequence of losses such that
This completes the proof. ∎
5 Stochastic Transitions
In the previous sections, we only considered deterministic transitions. We now present an algorithm that achieves low regret for the more general class of communicating MDPs (with an additional mild restriction). This algorithm achieves regret but takes exponential time to run (exponential in ).
Assumption 5.1.
The MDP has a state and action such that
In other words, there is some state in which we have a deterministic action that allows us to stay in the state . This can be interpreted as a state with a “do nothing” action where we can wait before taking the next action.
We now state a theorem that guarantees the existence of a number such that all states can be reached from in exactly steps with a reasonably high probability.
Theorem 5.2.
In MDPs satifying Assumption 5.1, we have and state such that, for all target states , we have policies such that
We first prove an intermediate lemma.
Lemma 5.3.
For any start state and target , we have and a policy such that
Proof.
From the definition of diameter, we are guaranteed a policy such that
From Markov’s inequality, we have
Since there are only discrete values less than , there exists such that
∎
We can now prove Theorem 5.2
Proof of Theorem 5.2.
From Lemma 5.3, we for each such that there is a policy that hits the state in time with probability at-least . We take . For target state , the policy loops at state for time steps and then starts following policy . Clearly, this policy hits state at time with probability at least ∎
Remark 5.4.
The policies guaranteed by Theorem 5.2 are not stationary.
Let . Clearly,
5.1 Algorithm
We extend the algorithm we used in the deterministic MDP case.
We use a low switching algorithm (FPL) that considers each policy as an expert. We know from Kalai and Vempala (2005) that FPL achieves regret as well as switching cost. At time , we receive loss function from the adversary. Using this, we construct as
where
In other words, is the expected loss if we follow the policy from the start of the game. is the initial distribution of states.
We feed as the losses to FPL.
We can now rewrite as
where and . Let be the policy chosen by at time . We know that
for any deterministic policy .
We need our algorithm to receive loss close to the first term in the above sum. If this is possible, we have an regret bound for online learning in the MDP. We now present an approach to do this.
5.1.1 Catching a policy
When FPL switches policy, we cannot immediately start receiving the losses of the new policy. If this was possible, then the regret of our algorithm will match that of FPL. When implementing the policy switch in our algorithm, we suffer a delay before starting to incur the losses of the new policy (in an expected sense). Our goal now is to make this delay as small as possible. This coupled with the fact that FPL has a low number of switches will give us good regret bounds. Note that this was easily done in the deterministic case using Theorem 3.2. Theorem 5.2 acts somewhat like a stochastic analogue of Theorem 3.2 and we use this to reduce the time taken to catch the policy.
Remark 5.5.
In Algorithm 3, if FPL switches the policy in the middle of the Switch_Policy’s execution, we terminate the execution and call the routine again with a new target policy.
5.2 Analysis
The following lemma shows that the Switch_Policy routine works correctly. That is, after the execution of the routine, the state distribution is exactly the same as the state distribution of the new policy.
Lemma 5.6.
If Switch_Policy terminates at time , we have that
where is the distribution of states after following policy from the start of the game.
Proof of Lemma 5.6.
We want to compute .
We now compute the denominator as follows.
Now we calculate the numerator.
Thus, we have
∎
We now bound the expected loss of the algorithm in the period that FPL chooses policy
Lemma 5.7.
Let the policy of FPL be from time to . We have that
Proof of Lemma 5.7.
We bound the expectation using law of total expectations and conditioning on .
We bound the conditional expectation.
From Lemma 5.6, the second term is equal to Thus,
Everytime we try to catch the policy from state , we succeed with probability . Thus, the expected number of times we try is and each attempt takes steps. Between each of these attempts, we move at most steps in expectation to reach again. Thus, in total, we have
This completes the proof. ∎
We are now ready to bound the regret of Algorithm 3
Theorem 5.8.
The regret of Algorithm 3 is at most
Proof.
We condition on the number of switches made by FPL. Let be the random variable corresponding to the number of switches made by FPL. We refer to Algorithm 3 as .
After each switch, Lemma 5.7 tells us that the Algorithm suffers at most extra average loss to the loss of the algorithm FPL. Thus,
is the policy chosen by algorithm FPL at time . Since FPL is a low switching algorithm, we have . The second term in the expectation is atmost for any deterministic policy . This is because FPL is a low regret algorithm. Thus, we have
for all stationary .
Thus, ∎
When is the set of stationary deterministic policies, we get that . Thus, we get the following theorem.
Theorem 5.9.
In fact, since we are using FPL as the expert algorithm, we can get first-order bounds similar to Theorem 4.3. In a setting with experts with being the total loss of the best expert, we can derive that the regret and number of switches can be bounded by .Thus, using this, we get the following first order regret bounds for Algorithm 3
5.3 Oracle-efficient algorithm assuming exploring starts
In this section, we assume that initial distribution over states, has probability mass at least on every state. That is, for all . We also assume that we have an oracle that can find the stationary deterministic policy with minimum cumulative loss at no computational cost. We use this oracle and the ideas from the previous sections to design an FPL style algorithm low regret algorithm.
5.3.1 Oracle
The oracle takes in loss functions and outputs the stationary deterministic policy with the lowest expected cumulative loss. That is, it returns the policy , where
with and .
We say that an algorithm is oracle-efficient if it runs in polynomial time when given access to the oracle .
5.3.2 Algorithm Sketch
Now, we use Algorithm 4 as our black box experts algorithm. We prove that Algorithm 4 has low regret.
Theorem 5.11.
The regret and expected number of switches can be bounded by .
Proof.
Let denote the total cumulative loss if we followed policy from the start of the interaction. We use to the denote the total perturbed cumulative loss if we followed policy from the start. Let be the policy with the lowest total cumulative loss. Similarly, let be the policy with the lowest perturbed cumulative loss. Let be the total perturbed cumulative loss till time . Let be the policy chosen by the FPL algorithm at step .
Let be the number of times the oracle switches the best policy. As before, we treat each policy as an expert and consider the online learning problem where expert gets loss where and .
We know that . We now bound . Let . The algorithm chooses as if and only if . We now argue that the probability of this happening is low if . Since , we have for some . Let the smallest state in which and differ be called . Thus,
We bound for any state . Consider any policy that differs from in state . The perturbation will play a role in the comparison of perturbed losses of all such with . Since , we have for some that depends only on the perturbations and losses received by and policies that differ from in state . If , then we would not switch to a policy with . Thus,
Since is at least , we have is at most . From this, we get
Using arguments similar to Section 4.2.2, we get
(7) |
Let .
On rearranging and simplifying Equation 7 similar to the proof of Theorem 4.3, we have
The above inequality works when , Thus, we have
Set . On substituting into the above equation, we get that
Since the expected number of switches is at-most , this is also bounded by ∎
The rest of the algorithm is the same as Algorithm 3. The exploring start assumption allows us to get an efficient low regret, low switching algorithm assuming access to the oracle . We now state the regret bound for this algorithm.
Theorem 5.12.
Given a communicating MDP satisfying Assumption 5.1 with a start distribution with at least probability on every state, and given access to the oracle , we have an efficient algorithm with
6 Conclusion
We considered learning in a communicating MDP with adversarially chosen costs in the full information setting. We gave an efficient algorithm that achieves regret when transitions are deterministic. We also presented an inefficient algorithm that achieves a regret bounds for the general stochastic case with an extra mild assumption. Our result show that in the full information setting there is no statistical price (as far as the time dependence is concerned) for the extension from the vanilla online learning with experts problem to the problem of online learning with communicating MDPs.
Several interesting questions still remain open. First, what are the best lower bounds in the general (i.e., not necessarily deterministic) communicating setting? In the deterministic setting, diameter is bounded polynomially by the state space size. This is no longer true in the stochastic case. The best lower bound in terms of diameter and other relevant quantities ( and ) still remains to be worked out. Second, is it possible to design an efficient algorithm beyond the deterministic case with fewer assumptions? The source of inefficiency in our algorithm is that we run FPL with each policy as an expert and perturb the losses of each policy independently. It is plausible that an FPL algorithm that perturbs losses (as in the deterministic case) can also be analyzed. However, there are challenges in its analysis as well as in proving that it is computationally efficient. For example, we are not aware of any efficient way to compute the best deterministic policy in hindsight for the general communicating case. This leads us to another open question: are there any oracle-efficient regret algorithms that do online learning over communicating MDPs. Dai et al. (2022) give an oracle-efficient regret algorithm but that works for bandits as well and does not use the additional information that is there in the full information case.
References
- Arora et al. [2012a] R. Arora, O. Dekel, and A. Tewari. Deterministic mdps with adversarial rewards and bandit feedback. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, UAI’12, page 93–101, Arlington, Virginia, USA, 2012a. AUAI Press. ISBN 9780974903989.
- Arora et al. [2012b] R. Arora, O. Dekel, and A. Tewari. Online bandit learning against an adaptive adversary: from regret to policy regret. Proceedings of the 29th International Conference on Machine Learning, ICML 2012, 2, 06 2012b.
- Auer et al. [1995] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pages 322–331, 1995. doi: 10.1109/SFCS.1995.492488.
- Bazaraa et al. [2004] M. S. Bazaraa, J. J. Jarvis, and H. D. Sherali. Linear Programming and Network Flows. Wiley-Interscience, USA, 2004. ISBN 0471485993.
- Bremaud [2000] P. Bremaud. Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues. Springer, 2000.
- Cesa-Bianchi and Lugosi [2006] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. ISBN 978-0-521-84108-5.
- Dai et al. [2022] Y. Dai, H. Luo, and L. Chen. Follow-the-perturbed-leader for adversarial markov decision processes with bandit feedback. arXiv preprint arXiv:2205.13451, 2022.
- Dekel and Hazan [2013] O. Dekel and E. Hazan. Better rates for any adversarial deterministic mdp. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, page III–675–III–683. JMLR.org, 2013.
- Dekel et al. [2013] O. Dekel, J. Ding, T. Koren, and Y. Peres. Bandits with switching costs: regret. Proceedings of the Annual ACM Symposium on Theory of Computing, 10 2013. doi: 10.1145/2591796.2591868.
- Denardo [1977] E. V. Denardo. Periods of connected networks and powers of nonnegative matrices. Mathematics of Operations Research, 2(1):20–24, 1977. ISSN 0364765X, 15265471. URL http://www.jstor.org/stable/3689120.
- Even-Dar et al. [2009] E. Even-Dar, S. M. Kakade, and Y. Mansour. Online markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009. ISSN 0364765X, 15265471. URL http://www.jstor.org/stable/40538442.
- Kalai and Vempala [2005] A. Kalai and S. Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005. ISSN 0022-0000. doi: https://doi.org/10.1016/j.jcss.2004.10.016. URL https://www.sciencedirect.com/science/article/pii/S0022000004001394. Learning Theory 2003.
- Neu et al. [2014] G. Neu, A. György, C. Szepesvári, and A. Antos. Online markov decision processes under bandit feedback. IEEE Trans. Autom. Control., 59(3):676–691, 2014. doi: 10.1109/TAC.2013.2292137. URL https://doi.org/10.1109/TAC.2013.2292137.
- Ortner [2010] R. Ortner. Online regret bounds for markov decision processes with deterministic transitions. Theoretical Computer Science, 411(29):2684–2695, 2010. ISSN 0304-3975. doi: https://doi.org/10.1016/j.tcs.2010.04.005. URL https://www.sciencedirect.com/science/article/pii/S0304397510002008. Algorithmic Learning Theory (ALT 2008).