This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Parameterized MDPs and Reinforcement Learning Problems - A Maximum Entropy Principle Based Framework

Amber Srivastava and Srinivasa M Salapaka The paper has been accepted for publication in IEEE Transactions on Cybernetics. The Institute of Electrical and Electronics Engineers, Incorporated (the ”IEEE”) holds all rights under copyright.This work was supported by NSF grant ECCS (NRI) 18-30639, DOE award DE-EE0009125, and Dynamic Research Enterprise for Multidisciplinary Engineering Sciences (DREMES) - collaboration between Zhejiang University and the University of Illinois at Urbana-Champaign.The authors are with the Mechanical Science and Engineering Department and Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, IL, 61801 USA. E-mail: {asrvstv6, salapaka}@illinois.edu.
Abstract

We present a framework to address a class of sequential decision making problems. Our framework features learning the optimal control policy with robustness to noisy data, determining the unknown state and action parameters, and performing sensitivity analysis with respect to problem parameters. We consider two broad categories of sequential decision making problems modeled as infinite horizon Markov Decision Processes (MDPs) with (and without) an absorbing state. The central idea underlying our framework is to quantify exploration in terms of the Shannon Entropy of the trajectories under the MDP and determine the stochastic policy that maximizes it while guaranteeing a low value of the expected cost along a trajectory. This resulting policy enhances the quality of exploration early on in the learning process, and consequently allows faster convergence rates and robust solutions even in the presence of noisy data as demonstrated in our comparisons to popular algorithms such as Q-learning, Double Q-learning and entropy regularized Soft Q-learning. The framework extends to the class of parameterized MDP and RL problems, where states and actions are parameter dependent, and the objective is to determine the optimal parameters along with the corresponding optimal policy. Here, the associated cost function can possibly be non-convex with multiple poor local minima. Simulation results applied to a 5G small cell network problem demonstrate successful determination of communication routes and the small cell locations. We also obtain sensitivity measures to problem parameters and robustness to noisy environment data.

I Introduction

Markov Decision Processes (MDPs) model sequential decision making problems which arise in many application areas such as robotics, sensor networks, economics, and manufacturing. These models are characterized by the state-evolution dynamics st+1=f(st,at)s_{t+1}=f(s_{t},a_{t}), a control policy μ(at|st)\mu(a_{t}|s_{t}) that allocates an action ata_{t} from a control set to each state sts_{t}, and a cost c(st,at,st+1)c(s_{t},a_{t},s_{t+1}) associated with the transition from sts_{t} to st+1s_{t+1}. The goal in these applications is to determine the optimal control policy that results in a path, a sequence of actions and states, with minimum cumulative cost. There are many variants of this problem [1], where the dynamics can be defined over finite or infinite horizons; where the state-dynamics ff can be stochastic; where the models for the state dynamics may be partially or completely unknown, and the cost function is not known a priori, albeit the cost at each step is revealed at the end of each transition. Some of the most common methodologies that address MDPs include dynamic programming, value and policy iterations [2], linear programming [3, 4], and Q-learning [5].

In this article, we view MDPs and their variants as combinatorial optimization problems, and develop a framework based on maximum entropy principle (MEP) [6] to address them. MEP has proved successful in addressing a variety of combinatorial optimization problems such as facility location problems [7], combinatorial drug discovery [8], traveling salesman problem and its variants [7], image processing [9], graph and markov chain aggregation [10], and protein structure alignment [11]. MDPs, too, can be viewed as combinatorial optimization problems - due to the combinatorially large number of paths (sequence of consecutive states and actions) that it may take based on the control policy and its inherent stochasticity. In our MEP framework, we determine a probability distribution defined on the space of paths [12], such that (a) it is the fairest distribution - the one with the maximum Shannon Entropy HH, and (b) it satisfies the constraint that the expected cumulative cost JJ attains a prespecified feasible value J0J_{0}. The framework results in an iterative scheme, an annealing scheme, where probability distributions are improved upon by successively lowering the prespecified values J0J_{0}. In fact, the lagrange multiplier β\beta corresponding to the cost constraint (J=J0J=J_{0}) in the unconstrained lagrangian is increased from small values (near 0) to large values to effect annealing. Higher values of multipliers correspond to lower values of the expected cost. We show that as multiplier value increases, the corresponding probability distributions become more localized, finally converging to a deterministic policy.

This framework is applicable to all the classes of MDPs and its variants described above. Our MEP based approach inherits the flexibility of algorithms such as deterministic annealing [7] developed in the context of combinatorial resource allocation, which include adding capacity, communication, and dynamic constraints. The added advantage of our approach is that we can draw close parallels to existing algorithms for MDPs and RL (e.g. Q-Learning) – thus enabling us to exploit their algorithmic insights. Below we highlight main contributions and advantages of our approach.

Exploration and Unbiased Policy: In the context of model-free RL setting, the algorithms interact with the environment via agents and rely upon the instantaneous cost (or reward) generated by the environment to learn the optimal policy. Some of the popular algorithms include Q-learning [5], Double Q-learning [13], Soft Q-learning (entropy regularized Q-learning) [14, 15, 16, 17, 18, 19, 20] in discrete state and action spaces, and Trust Region Policy Optimization (TRPO) [21], and Soft Actor Critic (SAC) [22] in continuous spaces. It is commonly known that for the above algorithms to perform well, all relevant states and actions should be explored. In fact, under the assumption that each state-action pair is visited multiple times during the learning process it is guaranteed that the above discrete space algorithms [5, 13, 14, 15] will converge to the optimal policy. Thus, the adequate exploration of the state and action spaces becomes incumbent to the success of these algorithms in determining the optimal policy. Often the instantaneous cost is noisy [14] which hinders with the learning process and demands for an enhanced quality exploration.

In our MEP based approach, the Shannon Entropy of the probability distribution over the paths in the MDP explicitly characterizes the exploration. The framework results in a distribution over the paths that is as unbiased as possible under the given cost constraint. The corresponding stochastic policy is maximally noncommittal to any particular path in the MDP that achieves the constraint; this results in better (unbiased) exploration. The policy starts from being entirely explorative, when multiplier value is small (β0\beta\approx 0), and becomes increasingly exploitative as the multiplier value increases.

Parameterized MDPs and RL: These class of optimization problems are not even necessarily MDPs which contributes significantly to their inherent complexities. However, we model them in a specific way to retain the Markov property without any loss of generality, thereby making these problems tractable. Scenarios such as self organizing networks [23], 5G small cell network design [24, 25], supply chain networks, and last mile delivery problems [26] pose a parameterized MDP with a two-fold objective of determining simultaneously (a) the optimal control policy for the underlying stochastic process, and (b) the unknown parameters that the state and action variables depend upon such that the cumulative cost is minimized. The latter objective is akin to facility location problem [27, 28, 29], that is shown to be NP-hard [27], and where the associated cost function (non-convex) is riddled with multiple poor local minima.

Refer to caption
Figure 1: Illustrates the 5G Small Cell Network. The objective is to determine the small cell location {yjd}\{y_{j}\in\mathbb{R}^{d}\} and the communication routes from the Base Station δ\delta to each user {ni}\{n_{i}\} via the network of the small cells.

For instance, Figure 1 illustrates a 5G small cell network, where the objective is to simultaneously determine the locations of the small cells {fj}\{f_{j}\} and design the communication paths (control policy) between the user nodes {ni}\{n_{i}\} and base station δ\delta via network of small cells. Here, the state space 𝒮\mathcal{S} of the underlying MDP is parameterized by the locations {yj}\{y_{j}\} of small cells {fj}\{f_{j}\}.

Algebraic structure and Sensitivity Analysis: In our framework, maximization of Shannon entropy of the distribution over the paths under constraint on the cost function value results in an unconstrained Lagrangian - the free-energy function. This function is a smooth approximation of the cumulative cost function of the MDP, which enables the use of calculus. We exploit this distinctive feature of our framework to determine the unknown state and action parameters in case of parameterized MDPs and perform sensitivity analysis for various problem parameters. Also, the framework easily accommodates stochastic models that describe uncertainties in the instantaneous cost and parameter values.

Algorithmic Guarantees and Innovation: For the classes of MDPs that we consider, our MEP-based framework results into non-trivial derivations of the recursive Bellman equation for the associated Lagrangian. We show that these Bellman operators are contraction maps and use their several properties to guarantee the convergence to the optimal policy and as well as to a local minima in the case of parameterized MDPs.

In the context of model-free RL, we provide comparisons with the benchmark algorithms Q, Double Q, and entropy regularized G-learning [14] (also referred to as Soft Q-learning). Our algorithms converge at a faster rate (as fast as 1.51.5 times) than the benchmark algorithms across various values of discount factor, and even in the case of noisy environments. In the context of parameterized MDPs and RL, we address the small-cell network design problem in 5G communication. Here the parameters are the unknown locations of the small-cells and the control policy determines the routing of the communication packet. Upon comparison with the sequential method of first determining the unknown parameters (small cell locations) and then the control policy (equivalently, the communication paths), we show that our algorithms result into costs that are as low as 65%65\% of the former. The efficacy of our algorithms can be assessed from the fact that the solutions in the model-based and model-free cases are nearly the same. We also demonstrate sensitivity analysis, benefits of annealing and considering entropy of distribution over the paths in our simulations on parameterized MDPs and RL.

This paper is organized as follows. We briefly review the related work and MEP [6] in the Section II. In Section III and IV we develop MEP-based framework for MDPs. Section V builds upon the Section III to address the case of parameterized MDPs and RL problems. Simulations on a variety of scenarios are presented in Section VI. We discuss the generality of our framework, its capabilities, and future directions of the work in Section VII. For the ease of reading, we provide a comprehensive list of symbols in the Section F of the supplementary material.

II Preliminaries

Related Work in Entropy Regularization: Some of the previous works in RL literature [14, 15, 16, 17, 18, 19, 20, 30, 31] either use entropy as a regularization term (logμ(at|st)-\log\mu(a_{t}|s_{t})) [14, 15] to the instantaneous cost function c(st,at,st+1)c(s_{t},a_{t},s_{t+1}) or maximize the entropy (aμ(a|s)logμ(a|s)-\sum_{a}\mu(a|s)\log\mu(a|s)) [16, 17, 18] associated only to the stochastic policy under constraints on the cost JJ. This results into benefits such as better exploration, overcoming the effect of noise wtw_{t} in the instantaneous cost ctc_{t} and obtaining faster convergence. However, the resulting stochastic policy and soft-max approximation of the value function JJ is not in compliance with the Maximum Entropy Principle applied to the distribution over the paths of MDP. Thus, the resulting stochastic policy is biased in its exploration over the paths of the MDP. Our simulations demonstrate the benefit of unbiased exploration (in our framework) in terms of faster convergence and better performance in noisy environment in comparison to the entropy regularized benchmark algorithm.

Related Work in Parameterized MDPs and RL: The existing solution approaches [3, 2, 4] can be extended to the parameterized MDPs by discretizing the parameter domain. However, the resulting problem is not necessarily an MDP as every transition from one state to another is dependent on the path (and the parameter values) taken till the current state. Other related approaches for parameterized MDPs are case specific; for instance, [32] presents action-based parameterization of state space with application to service rate control in closed Jackson networks, and [33, 34, 35, 36, 37, 38] incorporate parameterized actions that is applicable in the domain of RoboCup soccer where at each step the agent must select both the discrete action it wishes to execute as well as continuously valued parameters required by that action. On the other hand, the class of parameterized MDPs that we address in this article predominantly originate in network based applications that involves simultaneous routing and resource allocations and pose additional challenges of non-convexity and NP-hardness. We address these MDPs in both the scenarios, where the underlying model is known as well as unknown.

Maximum Entropy Principle: We briefly review the Maximum Entropy Principle (MEP) [6] since our framework relies heavily upon it. MEP states that for a random variable 𝒳\mathcal{X} with a given prior information, the most unbiased probability distribution given prior data is the one that maximizes the Shannon entropy. More specifically, let the known prior information of the random variable 𝒳\mathcal{X} be given as constraints on the expectation of the functions fk:𝒳f_{k}:\mathcal{X}\rightarrow\mathbb{R}, 1km1\leq k\leq m. Then, the most unbiased probability distribution p𝒳()p_{\mathcal{X}}(\cdot) solves

max{p𝒳(xi)}H(𝒳)=i=1np𝒳(xi)lnp𝒳(xi)subject toi=1np𝒳(xi)fk(xi)=Fk1km,\displaystyle\begin{split}\max_{\{p_{\mathcal{X}}(x_{i})\}}\quad&H(\mathcal{X})=-\sum_{i=1}^{n}p_{\mathcal{X}}(x_{i})\ln p_{\mathcal{X}}(x_{i})\\ \text{subject to}\quad&\sum_{i=1}^{n}p_{\mathcal{X}}(x_{i})f_{k}(x_{i})=F_{k}~{}~{}\forall~{}1\leq k\leq m,\end{split} (1)

where FkF_{k}, 1km1\leq k\leq m, are known expected values of the functions fkf_{k}. The above optimization problem results into the Gibbs’ distribution [39] p𝒳(xi)=exp{kλkfk(xi)}j=1nexp{kλkfk(xj)}p_{\mathcal{X}}(x_{i})=\frac{\exp\{-\sum_{k}\lambda_{k}f_{k}(x_{i})\}}{\sum_{j=1}^{n}\exp\{-\sum_{k}\lambda_{k}f_{k}(x_{j})\}}, where λk\lambda_{k}, 1km1\leq k\leq m, are the Lagrange multipliers corresponding to the inequality constraints in (1).

III MDPs with Finite Shannon Entropy

III-A Problem Formulation

We consider an infinite horizon discounted MDP that comprises of a cost-free termination state δ\delta. We formally define this MDP as a tuple 𝒮,𝒜,c,p,γ\langle\mathcal{S},\mathcal{A},c,p,\gamma\rangle where 𝒮={s1,,sN=δ}\mathcal{S}=\{s_{1},\ldots,s_{N}=\delta\}, 𝒜={a1,,aM}\mathcal{A}=\{a_{1},\ldots,a_{M}\}, and c:𝒮×𝒜×𝒮c:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R} respectively denote the state space, action space, and cost function; p:𝒮×𝒮×𝒜[0,1]p:\mathcal{S}\times\mathcal{S}\times\mathcal{A}\rightarrow[0,1] is the state transition probability function and 0<γ10<\gamma\leq 1 is the discounting factor. A control policy μ:𝒜×𝒮{0,1}\mu:\mathcal{A}\times\mathcal{S}\rightarrow\{0,1\} determines the action taken at each state s𝒮s\in\mathcal{S}, where μ(a|s)=1\mu(a|s)=1 implies that action a𝒜a\in\mathcal{A} is taken when the system is in the state s𝒮s\in\mathcal{S} and μ(a|s)=0\mu(a|s)=0 indicates otherwise. For every initial state x0=sx_{0}=s, the MDP induces a stochastic process, whose realization is a path ω\omega - an infinite sequence of actions and states, that is

ω=(u0,x1,u1,x2,u2,,xT,uT,xT+1,),\displaystyle\omega=(u_{0},x_{1},u_{1},x_{2},u_{2},\ldots,x_{T},u_{T},x_{T+1},\ldots), (2)

where ut𝒜u_{t}\in\mathcal{A}, xt𝒮x_{t}\in\mathcal{S} for all t0t\in\mathbb{Z}_{\geq 0} and xt=δx_{t}=\delta for all tkt\geq k if and when the system reaches the termination state δ𝒮\delta\in\mathcal{S} in kk steps. The objective is to determine the optimal policy μ\mu^{*} that minimizes the state value function

Jμ(s)=𝔼pμ[t=0γtc(xt,ut,xt+1)|x0=s],s𝒮\displaystyle J^{\mu}(s)=\mathbb{E}_{p_{\mu}}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}c(x_{t},u_{t},x_{t+1})\big{|}x_{0}=s\Big{]},\quad\forall~{}s\in\mathcal{S} (3)

where the expectation is with respect to the probability distribution pμ(|s):ω[0,1]p_{\mu}(\cdot|s):\omega\rightarrow[0,1] on the space of all possible paths ωΩ:={(ut,xt+1)t0:ut𝒜,xt𝒮}\omega\in\Omega:=\{(u_{t},x_{t+1})_{t\in\mathbb{Z}_{\geq 0}}:u_{t}\in\mathcal{A},x_{t}\in\mathcal{S}\}. In order to ensure that the system reaches the cost-free termination state in finite steps and the optimal state value function Jμ(s)J^{\mu}(s) is finite, we make the following assumption throughout this section.

Assumption 1.

There exists atleast one deterministic proper policy μ¯(a|s){0,1}a𝒜,s𝒮\bar{\mu}(a|s)\in\{0,1\}~{}\forall~{}a\in\mathcal{A},s\in\mathcal{S} such that mins𝒮pμ¯(x|𝒮|=δ|x0=s)>0\min_{s\in\mathcal{S}}p_{\bar{\mu}}(x_{|\mathcal{S}|}=\delta|x_{0}=s)>0. In other words, under the policy μ¯\bar{\mu} there is a non-zero probability to reach the cost-free termination state δ\delta, when starting from any state ss.

We consider the following set of stochastic policies μ\mu

Γ:={π:0<π(a|s)<1a𝒜,s𝒮},\displaystyle\Gamma:=\{\pi:0<\pi(a|s)<1~{}\forall~{}a\in\mathcal{A},s\in\mathcal{S}\}, (4)

and the following lemma ensures that under the Assumption 1 all the policies μΓ\mu\in\Gamma are proper; that is,

Lemma 1.

For any policy μΓ\mu\in\Gamma as defined in (4), mins𝒮pμ(x|𝒮|=δ|x0=s)>0\min_{s\in\mathcal{S}}p_{\mu}(x_{|\mathcal{S}|}=\delta|x_{0}=s)>0, i.e., under each policy μΓ\mu\in\Gamma the probability to reach the termination state δ\delta in |𝒮|=N|\mathcal{S}|=N steps beginning from any s𝒮s\in\mathcal{S}, is strictly positive.

Proof. Please refer to the Appendix APPENDICES.

We use the Maximum Entropy Principle to determine the policy μΓ\mu\in\Gamma such that the Shannon Entropy of the corresponding distribution pμp_{\mu} is maximized and the state value function Jμ(s)J^{\mu}(s) attains a specified value J0J_{0}. More specifically, we pose the following optimization problem

max{pμ(|s)}:μΓHμ(s)\displaystyle\max_{\{p_{\mu}(\cdot|s)\}:\mu\in\Gamma}\quad H^{\mu}(s) =ωΩpμ(ω|s)logpμ(ω|s)\displaystyle=-\sum_{\omega\in\Omega}p_{\mu}(\omega|s)\log p_{\mu}(\omega|s)
subject toJμ(s)\displaystyle\text{subject to}\quad J^{\mu}(s) =J0.\displaystyle=J_{0}. (5)

Well posedness: For the class of proper policy μΓ\mu\in\Gamma the maximum entropy Hμ(s)s𝒮H^{\mu}(s)~{}\forall~{}s\in\mathcal{S} is finite as shown in [40, 41]. In short, the existence of a cost-free termination state δ\delta and a non-zero probability to reach it from any state s𝒮s\in\mathcal{S} ensures that the maximum entropy is finite. Please refer to the Theorem 1 in [40] or Proposition 2 in [41] for further details.

Remark 1.

Though the optimization problem in (III-A) considers the stochastic policies μΓ\mu\in\Gamma, our algorithms presented in the later sections are designed in such that the resulting stochastic policy asymptotically converges to a deterministic policy.

III-B Problem Solution

The probability pμ(ω|s)p_{\mu}(\omega|s) of taking the path ω\omega in (2) can be determined from the underlying policy μ\mu by exploiting the Markov property that dissociates pμ(ω|s)p_{\mu}(\omega|s) in terms of the policy μ\mu and the state transition probability pp as

pμ(ω|x0)=t=0μ(ut|xt)p(xt+1|xt,ut)p_{\mu}(\omega|x_{0})=\prod_{t=0}^{\infty}\mu(u_{t}|x_{t})p(x_{t+1}|x_{t},u_{t}). (6)

Thus, in our framework we prudently work with the policy μ\mu which is defined over finite action and state spaces as against the distribution pμ(ω|s)p_{\mu}(\omega|s) defined over infinitely many paths ωΩ\omega\in\Omega. The Lagrangian corresponding to the above optimization problem in (III-A) is Vβμ(s)=Jμ(s)1βHμ(s)=V^{\mu}_{\beta}(s)=J^{\mu}(s)-\frac{1}{\beta}H^{\mu}(s)=

𝔼[t=0γtcxtxt+1ut+1β(logμut|xt+logpxtxt+1ut)|x0=s],\displaystyle\mathbb{E}\big{[}\sum_{t=0}^{\infty}\gamma^{t}c_{x_{t}x_{t+1}}^{u_{t}}+\frac{1}{\beta}(\log\mu_{u_{t}|x_{t}}+\log p_{x_{t}x_{t+1}}^{u_{t}})\big{|}x_{0}=s\big{]}, (7)

where β\beta is the Lagrange parameter. Here we have not included the constant value J0J_{0} in the cost Lagrangian Vβμ(s)V_{\beta}^{\mu}(s) for simplicity. We refer to the above Lagrangian Vβμ(s)V^{\mu}_{\beta}(s) (7) as the free-energy function and 1β\frac{1}{\beta} as temperature due to their close analogies with statistical physics (where free energy is enthalpy (E) minus the temperature times entropy (TH)). To determine the optimal policy μβ\mu^{*}_{\beta} that minimizes the Lagrangian Vβμ(s)V_{\beta}^{\mu}(s) in (7), we first derive the Bellman equation for Vβμ(s)V_{\beta}^{\mu}(s).

Theorem 1.

The free-energy function Vβμ(s)V_{\beta}^{\mu}(s) in (7) satisfies the following recursive Bellman equation

Vβμ(s)=a,s𝒜,𝒮μa|s\displaystyle V_{\beta}^{\mu}(s)=\smashoperator[]{\sum_{\begin{subarray}{c}a,s^{\prime}\in\mathcal{A},\mathcal{S}\end{subarray}}^{}}\mu_{a|s} pssa(c¯ssa+γβlogμa|s+γVβμ(s)+c0(s)),\displaystyle p_{ss^{\prime}}^{a}\big{(}\bar{c}_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log\mu_{a|s}+\gamma V_{\beta}^{\mu}(s^{\prime})+c_{0}(s)\big{)}, (8)

where μa|s=μ(a|s)\mu_{a|s}=\mu(a|s), pssa=p(s|s,a)p_{ss^{\prime}}^{a}=p(s^{\prime}|s,a), c¯ssa=c(s,a,s)+γβlogp(s|s,a)\bar{c}_{ss^{\prime}}^{a}=c(s,a,s^{\prime})+\frac{\gamma}{\beta}\log p(s^{\prime}|s,a) for simplicity in notation, and function c0(s)c_{0}(s) does not depend on the policy μ\mu.

Proof. Please refer to the Appendix APPENDICES for details. It must be noted that this derivation shows and exploits the algebraic structure spssaHμ(s)=spssalogpssa+logμa|s+λs+1\sum_{s^{\prime}}p_{ss^{\prime}}^{a}H^{\mu}(s^{\prime})=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\log p_{ss^{\prime}}^{a}+\log\mu_{a|s}+\lambda_{s}+1 as detailed in Lemma 2 in the appendix.

Now the optimal policy satisfies Vβμ(s)μ(a|s)=0\frac{\partial V_{\beta}^{\mu}(s)}{\partial\mu(a|s)}=0, which results into the Gibbs distribution

μβ(a|s)\displaystyle\mu^{*}_{\beta}(a|s) =exp{(β/γ)Λβ(s,a)}a𝒜exp{(β/γ)Λβ(s,a)}, where\displaystyle=\frac{\exp\big{\{}-(\beta/\gamma)\Lambda_{\beta}(s,a)\big{\}}}{\sum_{a^{\prime}\in\mathcal{A}}\exp\big{\{}-(\beta/\gamma)\Lambda_{\beta}(s,a^{\prime})\big{\}}},\text{ where } (9)
Λβ(s,a)\displaystyle\Lambda_{\beta}(s,a) =s𝒮pssa(c¯ssa+γVβ(s)),\displaystyle=\sum_{s^{\prime}\in\mathcal{S}}p_{ss^{\prime}}^{a}\big{(}\bar{c}_{ss^{\prime}}^{a}+\gamma V^{*}_{\beta}(s^{\prime})\big{)}, (10)

is the state-action value function, pssa=p(s|s,a)p_{ss^{\prime}}^{a}=p(s^{\prime}|s,a), cssa=c(s,a,s)c_{ss^{\prime}}^{a}=c(s,a,s^{\prime}), c¯ssa=cssa+γβlogpssa\bar{c}_{ss^{\prime}}^{a}=c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a} and Vβ(=Vβμβ)V^{*}_{\beta}(=V^{\mu^{*}_{\beta}}_{\beta}) is the value function corresponding to the policy μβ\mu^{*}_{\beta}. To avoid notional clutter we use the above notations wherever it is clear from the context. Substituting the policy μβ\mu^{*}_{\beta} in (9) back into the Bellman equation (8) we obtain the implicit equation

Vβ(s)=γβlog(a𝒜exp{βγΛβ(s,a)}).\displaystyle V^{*}_{\beta}(s)=-\frac{\gamma}{\beta}\log\Big{(}\sum_{a\in\mathcal{A}}\exp\Big{\{}-\frac{\beta}{\gamma}\Lambda_{\beta}(s,a)\Big{\}}\Big{)}. (11)

Without loss of generality, we ignore the term c0(s)c_{0}(s) in (8) since it does not affect the policy as seen from equation (9). To solve for the state-action value function Λβ(s,a)\Lambda_{\beta}(s,a) and free-energy function Vβ(s)V_{\beta}^{*}(s) we substitute the expression of Vβ(s)V^{*}_{\beta}(s) in (11) into the expression of Λβ(s,a)\Lambda_{\beta}(s,a) in (10) to obtain the implicit equation Λβ(s,a)=:[TΛβ](s,a)\Lambda_{\beta}(s,a)=:[T\Lambda_{\beta}](s,a), where

[TΛβ](s,a)=s𝒮pssa(cssa+γβlogpssa)\displaystyle[T\Lambda_{\beta}](s,a)=\sum_{s^{\prime}\in\mathcal{S}}p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a}\big{)}
γ2βs𝒮pssaloga𝒜exp{βγΛβ(s,a)}.\displaystyle\qquad\quad-\frac{\gamma^{2}}{\beta}\sum_{s^{\prime}\in\mathcal{S}}p_{ss^{\prime}}^{a}\log\sum_{a^{\prime}\in\mathcal{A}}\exp\Big{\{}-\frac{\beta}{\gamma}\Lambda_{\beta}(s^{\prime},a^{\prime})\Big{\}}. (12)

To solve the above implicit equation, we show that the map TT in (III-B) is a contraction map and therefore Λβ\Lambda_{\beta} can be obtained using fixed point iterations, which guarantee converging to the unique fixed point. Consequently, the global minimum VβV_{\beta}^{*} in (11) and the optimal policy μβ\mu_{\beta}^{*} in (9) can be obtained.

Theorem 2.

The map [TΛβ](s,a)[T\Lambda_{\beta}](s,a) as defined in (III-B) is a contraction mapping with respect to a weighted maximum norm, i.e. \exists a vector ξ=(ξs)|𝒮|\xi=(\xi_{s})\in\mathbb{R}^{|\mathcal{S}|} with ξs>0s𝒮\xi_{s}>0~{}\forall~{}s\in\mathcal{S} and a scalar α<1\alpha<1 such that

TΛβTΛβξαΛβΛβξ\displaystyle\|T\Lambda_{\beta}-T\Lambda^{\prime}_{\beta}\|_{\xi}\leq\alpha\|\Lambda_{\beta}-\Lambda^{\prime}_{\beta}\|_{\xi} (13)

where Λβξ=maxs𝒮,a𝒜|Λβ(s,a)|ξs\|\Lambda_{\beta}\|_{\xi}=\max\limits_{\begin{subarray}{c}s\in\mathcal{S},a\in\mathcal{A}\end{subarray}}\frac{|\Lambda_{\beta}(s,a)|}{\xi_{s}}

Proof. Please refer to Appendix -A for details.

Remark 2.

It is known from the sensitivity analysis [39] that the value of the Lagrange parameter β\beta in (7) is inversely proportional to the constant J0J_{0} in (III-A). Thus, at small values of β0\beta\approx 0 (equivalently large J0J_{0}), we are mainly maximizing the Shannon Entropy Hμ(s)H^{\mu}(s) and the resultant policy in (9) encourages exploration along the paths of the MDP. As β\beta increases (J0J_{0} decreases), more and more weight is given to the state value function Jμ(s)J^{\mu}(s) in (7) and the policy in (9) goes from being exploratory to being exploitative. As β\beta\rightarrow\infty the exploration is completely eliminated and we converge to a deterministic policy μ\rightarrow\mu^{*} that minimizes Jμ(s)J^{\mu}(s) in (3).

Remark 3.

We briefly draw readers’ attention to the value function Y(s)=𝔼[t=0γt(cxtxt+1ut+(1/β)logμut|xt)]Y(s)=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}(c_{x_{t}x_{t+1}}^{u_{t}}+(1/\beta)\log\mu_{u_{t}|x_{t}})] considered in the entropy regularized methods [14]. Note that in Y(s)Y(s) the discounting γt\gamma^{t} is multiplied to both the cost term cxtxt+1utc_{x_{t}x_{t+1}}^{u_{t}} as well as the entropy term (1/β)logμut|xt(1/\beta)\log\mu_{u_{t}|x_{t}}. However, in our MEP-based method, the Lagrangian Vβμ(s)V_{\beta}^{\mu}(s) in (7) comprises of discounting γt\gamma^{t} only over the cost term cxtxt+1utc_{x_{t}x_{t+1}}^{u_{t}} and not on the entropy terms (1/β)logμut|xt(1/\beta)\log\mu_{u_{t}|x_{t}} and (1/β)logpxtxt+1ut(1/\beta)\log p_{x_{t}x_{t+1}}^{u_{t}}. Therefore, the policy in [14] does not satisfy MEP applied over the distribution pμp_{\mu}; consequently their exploration along the paths is not as unbiased as our algorithm.

III-C Model-free Reinforcement Learning Problems

In these problems, the cost function c(s,a,s)c(s,a,s^{\prime}) and the state-transition probability p(s|s,a)p(s^{\prime}|s,a) are not known apriori; however, at each discrete time instant tt the agent takes an action utu_{t} under a policy μ\mu and the environment (underlying stochastic process) results into an instantaneous cost cxtxt+1utc_{x_{t}x_{t+1}}^{u_{t}} and the subsequently moves to the state xt+1p(|xt,ut)x_{t+1}\thicksim p(\cdot|x_{t},u_{t}). Motivated by the iterative updates of Q-learning algorithm [2] we consider the following stochastic updates in our Algorithm 1 to learn the state-action value function in our methodology

Ψt+1(xt,ut)=(1νt(xt,ut))Ψt(xt,ut)\Psi_{t+1}(x_{t},u_{t})=(1-\nu_{t}(x_{t},u_{t}))\Psi_{t}(x_{t},u_{t})+
νt(xt,ut)[cxtxt+1utγ2βloga𝒜exp{βγΨt(xt+1,a)}],\displaystyle\text{\small$\nu_{t}(x_{t},u_{t})\Big{[}c_{x_{t}x_{t+1}}^{u_{t}}-\frac{\gamma^{2}}{\beta}\log\sum_{a^{\prime}\in\mathcal{A}}\exp\Big{\{}\frac{-\beta}{\gamma}\Psi_{t}(x_{t+1},a^{\prime})\Big{\}}\Big{]}$}, (14)

with the stepsize parameter νt(xt,ut)(0,1]\nu_{t}(x_{t},u_{t})\in(0,1], and show that under appropriate conditions on νt\nu_{t} (as illustrated shortly), Ψt\Psi_{t} will converge to the fixed point Λ¯β\bar{\Lambda}_{\beta}^{*} of the implicit equation

Λ¯β(s,a)\displaystyle\bar{\Lambda}_{\beta}(s,a) =s𝒮pssa(cssaγ2βlogaexp(βγΛ¯β(s,a)))\displaystyle=\sum_{s^{\prime}\in\mathcal{S}}p_{ss^{\prime}}^{a}\Big{(}c_{ss^{\prime}}^{a}-\frac{\gamma^{2}}{\beta}\log\sum_{a^{\prime}}\exp\big{(}\frac{-\beta}{\gamma}\bar{\Lambda}_{\beta}(s^{\prime},a^{\prime})\big{)}\Big{)}
=:[T¯Λ¯β](s,a).\displaystyle=:[\bar{T}\bar{\Lambda}_{\beta}](s,a). (15)

The above equation comprises of a minor change from the equation Λβ(s,a)=[TΛβ](s,a)\Lambda_{\beta}(s,a)=[T\Lambda_{\beta}](s,a) in (III-B). The latter has an additional term γβspssalogpssa\frac{\gamma}{\beta}\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\log p_{ss^{\prime}}^{a} which makes it difficult to learn its fixed point Λβ\Lambda_{\beta}^{*} in the absence of the state transition probability pssap_{ss^{\prime}}^{a} itself. Since in this work we do not attempt to determine (or learn) either the distribution pssap_{ss^{\prime}}^{a} (as in [42]) from the interactions of the agent with the environment, we work with the approximate state-action value function Λ¯β\bar{\Lambda}_{\beta} in (III-C) where Λ¯βΛβ\bar{\Lambda}_{\beta}\rightarrow\Lambda_{\beta} for large β\beta values (since γβ(spssalogpssa)0\frac{\gamma}{\beta}(\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\log p_{ss^{\prime}}^{a})\rightarrow 0 as β\beta\rightarrow\infty). The following proposition elucidates the conditions under which the updates Ψt\Psi_{t} in (III-C) converge to the fixed point Λ¯β\bar{\Lambda}_{\beta}^{*}.

Proposition 1.

Consider the class of MDPs illustrated in Section III-A. Given that

t=0νt(s,a)=,t=0νt2(s,a)<s𝒮,a𝒜,\sum_{t=0}^{\infty}\nu_{t}(s,a)=\infty,\sum_{t=0}^{\infty}\nu_{t}^{2}(s,a)<\infty~{}\forall~{}s\in\mathcal{S},a\in\mathcal{A},

the update Ψt(s,a)\Psi_{t}(s,a) in (III-C) converges to the fixed point Λ¯β\bar{\Lambda}_{\beta}^{*} of the map T¯Λ¯βΛ¯β\bar{T}\bar{\Lambda}_{\beta}\rightarrow\bar{\Lambda}_{\beta} in (III-C) with probability 1.

Proof. Please refer to the Appendix 4.

Input: NN, νt(,)\nu_{t}(\cdot,\cdot), σ\sigma; Output: μ\mu^{*}, Λ¯\bar{\Lambda}^{*}
Initialize: t=0t=0, Ψ0=0\Psi_{0}=0, μ0(a|s)=1/|𝒜|\mu_{0}(a|s)=1/|\mathcal{A}|.
for episode=1episode=1 to NN do
       β=σ×epsiode\beta=\sigma\times epsiode; reset environment at state xtx_{t}
       while True do
             sample utμt(|xt)u_{t}\sim\mu_{t}(\cdot|x_{t}); obtain cost ctc_{t} and xt+1x_{t+1}
             update Ψt(xt,ut)\Psi_{t}(x_{t},u_{t}), μt+1(ut|xt)\mu_{t+1}(u_{t}|x_{t}) in (III-C) and (9)
             break if xt+1=δx_{t+1}=\delta; tt+1t\leftarrow t+1
      
Algorithm 1 Model-free Reinforcement Learning
Remark 4.

Note that the stochasticity of the optimal policy μβ(a|s)\mu_{\beta}^{*}(a|s) (9) depends on γ\gamma value which allows it to incorporate for the effect of the discount factor on its exploration strategy. More precisely, in the case of large discount factors the time window TT, in which instantaneous costs γtc(st,at,st+1)\gamma^{t}c(s_{t},a_{t},s_{t+1}) is considerable (i.e., γtcstst+1at>ϵ\gamma^{t}c_{s_{t}s_{t+1}}^{a_{t}}>\epsilon \forall tTt\leq T), is large and thus, the stochastic policy (9) performs higher exploration along the paths. On the other hand, for small discount factors this time window TT is relatively smaller and thus, the stochastic policy (9) inherently performs lesser exploration. As illustrated in the simulations, this characteristic of the policy in (9) results into even faster convergence rates in comparison to benchmark algorithms as the discount factor γ\gamma decreases.

IV MDPs with Infinite Shannon Entropy

Here we consider the MDPs where the Shannon Entropy Hμ(s)H^{\mu}(s) of the distribution {pμ(ω|s)}\{p_{\mu}(\omega|s)\} over the paths ωΩ\omega\in\Omega is not necessarily finite (for instance, due to absence of absorption state). To ensure the finiteness of the objective in (III-A) we consider the discounted Shannon Entropy [43, 44]

Hdμ(s)=𝔼[t=0αt(logμut|xt+logpxtxt+1ut)|x0=s]H^{\mu}_{d}(s)=-\mathbb{E}\Big{[}\sum_{t=0}^{\infty}\alpha^{t}(\log\mu_{u_{t}|x_{t}}+\log p_{x_{t}x_{t+1}}^{u_{t}})\big{|}x_{0}=s\Big{]} (16)

with a discount factor of α(0,1)\alpha\in(0,1) which we chose to be independent of the discount factor γ\gamma in the value function Jμ(s)J^{\mu}(s). The free-energy function (or, the Lagrangian) resulting from the optimization problem in (III-A) with the alternate objective function Hdμ(s)H^{\mu}_{d}(s) in (16) is given by

Vβ,Iμ(s)=𝔼[t=0γtc^xtxt+1ut+αtβlogμ(ut|xt)|x0=s],\displaystyle\text{\small$V_{\beta,I}^{\mu}(s)=\mathbb{E}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}\hat{c}_{x_{t}x_{t+1}}^{u_{t}}+\frac{\alpha^{t}}{\beta}\log\mu(u_{t}|x_{t})\big{|}x_{0}=s\Big{]}$}, (17)

where c^xtxt+1ut=cxtxt+1ut+γtβαtlogpxtxt+1ut\hat{c}_{x_{t}x_{t+1}}^{u_{t}}=c_{x_{t}x_{t+1}}^{u_{t}}+\frac{\gamma^{t}}{\beta\alpha^{t}}\log p_{x_{t}x_{t+1}}^{u_{t}}, and the subscript II stands for ’infinite entropy’ case. Note that the free-energy functions (7) and (17) differ only with regards to the discount factor α\alpha and thus, our solution methodology in this section is similar to the one in Section III-B.

Theorem 3.

The free-energy function Vβ,Iμ(s)V^{\mu}_{\beta,I}(s) in (17) satisfies the recursive Bellman equation

Vβ,Iμ(s)=a,sμa|spssa(cˇssa+γαβlogμa|s+γVβ,Iμ(s))V_{\beta,I}^{\mu}(s)=\sum_{\begin{subarray}{c}a,s^{\prime}\end{subarray}}\mu_{a|s}p_{ss^{\prime}}^{a}(\check{c}_{ss^{\prime}}^{a}+\frac{\gamma}{\alpha\beta}\log\mu_{a|s}+\gamma V_{\beta,I}^{\mu}(s^{\prime})) (18)

where cˇssa=cssa+γαβlogpssa\check{c}_{ss^{\prime}}^{a}=c_{ss^{\prime}}^{a}+\frac{\gamma}{\alpha\beta}\log p_{ss^{\prime}}^{a}.

Proof. Please see Appendix 3. The above derivation shows and exploits algebraic structure αspssaHdμ(s)=spssalogpssa+logαμ(a|s)+λs\alpha\sum_{s^{\prime}}p_{ss^{\prime}}^{a}H^{\mu}_{d}(s^{\prime})=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\log p_{ss^{\prime}}^{a}+\log\alpha\mu(a|s)+\lambda_{s} (Lemma 4).

The optimal policy satisfies Vβ,Iμ(s)μ(a|s)=0\frac{\partial V_{\beta,I}^{\mu}(s)}{\partial\mu(a|s)}=0, which results into the Gibbs distribution

μβ,I(a|s)\displaystyle\mu^{*}_{\beta,I}(a|s) =exp{βαγΦβ(s,a)}a𝒜exp{βαγΦβ(s,a)}, where\displaystyle=\frac{\exp\big{\{}-\frac{\beta\alpha}{\gamma}\Phi_{\beta}(s,a)\big{\}}}{\sum_{a^{\prime}\in\mathcal{A}}\exp\big{\{}-\frac{\beta\alpha}{\gamma}\Phi_{\beta}(s,a^{\prime})\big{\}}},\text{ where} (19)
Φβ(s,a)\displaystyle\Phi_{\beta}(s,a) =s𝒮pssa(cˇssa+γVβ,I(s)),\displaystyle=\sum_{s^{\prime}\in\mathcal{S}}p_{ss^{\prime}}^{a}(\check{c}_{ss^{\prime}}^{a}+\gamma V^{*}_{\beta,I}(s^{\prime})), (20)

is the corresponding state-action value function. Substituting the μβ,I\mu^{*}_{\beta,I} in (19) in the Bellman equation (18) results into the following optimal free-energy function Vβ,I(s)(:=Vβ,Iμβ,I(s))V_{\beta,I}^{*}(s)(:=V_{\beta,I}^{\mu^{*}_{\beta,I}}(s)).

Vβ,I(s)\displaystyle V_{\beta,I}^{*}(s) =γαβloga𝒜exp(αβγΦβ(s,a))\displaystyle=-\frac{\gamma}{\alpha\beta}\log\sum_{a^{\prime}\in\mathcal{A}}\exp\Big{(}\frac{-\alpha\beta}{\gamma}\Phi_{\beta}(s,a^{\prime})\Big{)} (21)
Remark 5.

The subsequent steps to learn the optimal policy μβ,I\mu^{*}_{\beta,I} in (19) are similar to the steps demonstrated in the Section III-C. We forego the similar analysis here.

Remark 6.

When α=γ\alpha=\gamma the policy μβ,I\mu^{*}_{\beta,I} in (19), state-action value function Φβ\Phi_{\beta} in (20) and the free-energy function Vβ,IV^{*}_{\beta,I} in (21) corresponds to the similar expressions that are obtained in the Entropy regularized methods [14]. However, in this paper we do not require that α=γ\alpha=\gamma. On the other hand, we propose that α\alpha should take up large values. In fact, our simulations in the Section VI demonstrate better convergence rates that are obtained when γ<α=(1ϵ)\gamma<\alpha=(1-\epsilon) as compared to when γ=α\gamma=\alpha.

V Parameterized MDPs

V-A Problem Formulation

As stated in Section I, many application areas such as small cell networks (Figure 1) pose a parmaterized MDP that requires simultaneously determining the (a) optimal policy μ\mu^{*}, and (b) the unknown state and action parameters ζ={ζs}\zeta=\{\zeta_{s}\} and η={ηa}\eta=\{\eta_{a}\} such that the state value function

Jζημ(s)=𝔼pμ[t=0γtc(xt(ζ),ut(η),xt+1(ζ))|x0=s]\displaystyle J^{\mu}_{\zeta\eta}(s)=\mathbb{E}_{p_{\mu}}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}c\big{(}x_{t}(\zeta),u_{t}(\eta),x_{t+1}(\zeta)\big{)}\Big{|}x_{0}=s\Big{]} (22)

is minimized \forall s𝒮s\in\mathcal{S} where xt(ζ)x_{t}(\zeta) denotes the state xt𝒮x_{t}\in\mathcal{S} with the associated parameter ζxt\zeta_{x_{t}} and ut(η)u_{t}(\eta) denotes the action ut𝒜u_{t}\in\mathcal{A} with the associated action parameter value ηut\eta_{u_{t}}. As in Section III-A, we assume that the parameterized MDPs exhibit atleast one deterministic proper policy (Assumption 1) to ensure the finiteness of the value function Jζημ(s)J^{\mu}_{\zeta\eta}(s) and the Shannon Entropy Hμ(s)H^{\mu}(s) of the MDP for all μΓ\mu\in\Gamma. We further assume that the state-transition probability {pssa}\{p_{ss^{\prime}}^{a}\} is independent of the state and action parameters ζ,η\zeta,\eta.

V-B Problem Solution

This problem was solved in Section III-B, where the states and actions were not parameterized, or equivalently can be viewed as if parameters ζ\zeta, η\eta were known and fixed. For the parameterized case, we apply the same solution methodology, which results in the same optimal policy μβ,ζη\mu^{*}_{\beta,\zeta\eta} as in (9) as well as the corresponding free-energy function Vβ,ζη(s)V_{\beta,\zeta\eta}^{*}(s) in (11) except that now they are characterized by the parameters ζ\zeta, η\eta. To determine the optimal (local) parameters ζ\zeta, η\eta, we set s𝒮Vβ,ζη(s)ζs=0\sum_{s^{\prime}\in\mathcal{S}}\frac{\partial V_{\beta,\zeta\eta}^{*}(s^{\prime})}{\partial\zeta_{s}}=0 \forall ss, and s𝒮Vβ,ζη(s)ηa=0\sum_{s^{\prime}\in\mathcal{S}}\frac{\partial V_{\beta,\zeta\eta}^{*}(s^{\prime})}{\partial\eta_{a}}=0 \forall aa, which we implement by using the gradient descent steps

ζs+=ζsηs𝒮Gζsβ(s),ηa+=ηaη¯s𝒮Gηaβ(s).\displaystyle\zeta_{s}^{+}=\zeta_{s}-\eta\sum_{s^{\prime}\in\mathcal{S}}G_{\zeta_{s}}^{\beta}(s^{\prime}),~{}\eta_{a}^{+}=\eta_{a}-\bar{\eta}\sum_{s^{\prime}\in\mathcal{S}}G_{\eta_{a}}^{\beta}(s^{\prime}). (23)

Here Gζsβ(s):=Vβ,ζη(s)ζsG_{\zeta_{s}}^{\beta}(s^{\prime}):=\frac{\partial V_{\beta,\zeta\eta}^{*}(s^{\prime})}{\partial\zeta_{s}} and Gηaβ(s):=Vβ,ζη(s)ηaG_{\eta_{a}}^{\beta}(s^{\prime}):=\frac{\partial V_{\beta,\zeta\eta}^{*}(s^{\prime})}{\partial\eta_{a}}. The derivatives GζsβG_{\zeta_{s}}^{\beta} and GηaβG_{\eta_{a}}^{\beta} are assumed to be bounded (see Proposition 2). We compute these derivatives as Gζsβ(s)=aμa|sKζsβ(s,a)G_{\zeta_{s}}^{\beta}(s^{\prime})=\sum_{a^{\prime}}\mu_{a^{\prime}|s^{\prime}}K_{\zeta_{s}}^{\beta}(s^{\prime},a^{\prime}) and Gηaβ(s)=aμa|sLηaβ(s,a)G_{\eta_{a}}^{\beta}(s^{\prime})=\sum_{a^{\prime}}\mu_{a^{\prime}|s^{\prime}}L_{\eta_{a}}^{\beta}(s^{\prime},a^{\prime}) \forall s𝒮s^{\prime}\in\mathcal{S} where Kζsβ(s,a)K_{\zeta_{s}}^{\beta}(s^{\prime},a^{\prime}) and Lηaβ(s,a)L_{\eta_{a}}^{\beta}(s^{\prime},a^{\prime}) are the fixed points of their corresponding Bellman equations Kζsβ(s,a)=[T1Kζsβ](s,a)K_{\zeta_{s}}^{\beta}(s^{\prime},a^{\prime})=[T_{1}K_{\zeta_{s}}^{\beta}](s,a) and Lηaβ(s,a)=[T2Lηaβ](s,a)L_{\eta_{a}}^{\beta}(s^{\prime},a^{\prime})=[T_{2}L_{\eta_{a}}^{\beta}](s^{\prime},a^{\prime}) where

[T1Kζsβ](s,a)=s′′pss′′a[css′′aζs+γGζsβ(s′′)],[T2Lηaβ](s,a)=s′′pss′′a[css′′aηa+γGηaβ(s′′)].\displaystyle\begin{split}[T_{1}K_{\zeta_{s}}^{\beta}](s^{\prime},a^{\prime})&=\sum_{\begin{subarray}{c}s^{\prime\prime}\end{subarray}}p_{s^{\prime}s^{\prime\prime}}^{a^{\prime}}\Big{[}\frac{\partial c_{s^{\prime}s^{\prime\prime}}^{a^{\prime}}}{\partial\zeta_{s}}+\gamma G_{\zeta_{s}}^{\beta}(s^{\prime\prime})\Big{]},\\ [T_{2}L_{\eta_{a}}^{\beta}](s^{\prime},a^{\prime})&=\sum_{\begin{subarray}{c}s^{\prime\prime}\end{subarray}}p_{s^{\prime}s^{\prime\prime}}^{a^{\prime}}\Big{[}\frac{\partial c_{s^{\prime}s^{\prime\prime}}^{a^{\prime}}}{\partial\eta_{a}}+\gamma G_{\eta_{a}}^{\beta}(s^{\prime\prime})\Big{]}.\end{split} (24)

Note that in the above equations we have suppressed the dependence of the instantaneous cost function css′′ac_{s^{\prime}s^{\prime\prime}}^{a^{\prime}} on the parameters ζ\zeta and η\eta to avoid notational clutter.

Theorem 4.

The operators [T1Kζsβ](s,a)[T_{1}K_{\zeta_{s}}^{\beta}](s^{\prime},a^{\prime}) and [T2Lηaβ](s,a)[T_{2}L_{\eta_{a}}^{\beta}](s^{\prime},a^{\prime}) defined in (24) are contraction maps with respect to a weighted maximum norm ξ\|\cdot\|_{\xi}, where Xξ=maxs,aX(s,a)ξs\|X\|_{\xi}=\max_{s^{\prime},a^{\prime}}\frac{X(s^{\prime},a^{\prime})}{\xi_{s^{\prime}}} and ξ|𝒮|\xi\in\mathbb{R}^{|\mathcal{S}|} is a vector of positive components ξs\xi_{s}.

Proof. Please refer the Appendix 4 for details.

As previously stated in Section I, the state value function Jζημ()J^{\mu}_{\zeta\eta}(\cdot) in (22) is generally non-convex function of the parameters ζ\zeta, η\eta and riddled with multiple poor local minima with the resulting optimization problem being possibly NP-hard [27]. In our algorithm for parameterized MDPs we anneal β\beta from βmin\beta_{\min} to βmax\beta_{\max}, similar to our approach for non-parameterized MDPs in Section III-B, where the solution from the current β\beta iteration is used to initialize the subsequent β\beta iteration. However, in addition to facilitating a steady transition from an exploratory policy to an exploitative policy, annealing facilitates a gradual homotopy from the convex negative Shannon entropy function to the non-convex state value function JζημJ^{\mu}_{\zeta\eta} which prevents our algorithm from getting stuck in a poor local minimum. The underlying idea of our heuristic is to track the optimal as the initial convex function deforms to the actual non-convex cost. Also, minimizing the Lagrangian Vβ(s)V_{\beta}^{*}(s) at β=βmin0\beta=\beta_{\min}\approx 0 determines the global minimum thereby making our algorithm insensitive to initialization. Algorithm 2 illustrates steps to determine policy and parameters for a parameterized MDP.

Input: βmin\beta_{\min}, βmax\beta_{\max}, τ\tau; Output: μ\mu^{*}, ζ\zeta and η\eta.
Initialize: β=βmin\beta=\beta_{\min}, μa|s=1|𝒜|\mu_{a|s}=\frac{1}{|\mathcal{A}|}, and ζ,η\zeta,\eta to 0
while ββmax\beta\leq\beta_{\max} do
       while True do
             while until convergence do
                   update Λβ\Lambda_{\beta},μβ\mu_{\beta},GζsβG_{\zeta_{s}}^{\beta},GηaβG_{\eta_{a}}^{\beta} in (10), (9) and (24)
            update ζ,η\zeta,\eta in (23) if Gζs,Gηa<ϵ\|G_{\zeta_{s}}\|,\|G_{\eta_{a}}\|<\epsilon, break
      βτβ\beta\leftarrow\tau\beta
      
Algorithm 2 Parameterized Markov Decision Process

V-C Parameterized Reinforcement Learning

In many applications, formulated as parameterized MDPs, the explicit knowledge of the cost function cssac_{ss^{\prime}}^{a}, its dependence on the parameters ζ\zeta, η\eta, and the state-transition probabilities {pssa}\{p_{ss^{\prime}}^{a}\} are not known. However, for each action aa the environment results into an instantaneous cost based on its current xtx_{t}, next state xt+1x_{t+1} and the parameter ζ\zeta, η\eta values which can subsequently be used to simultaneously learn the policy μβ,ζη\mu^{*}_{\beta,\zeta\eta} and the unknown state and action parameters ζ\zeta, η\eta via stochastic iterative updates. At each β\beta iteration in our learning algorithm, we employ the stochastic iterative updates in (III-C) to determine the optimal policy μβ,ζη\mu^{*}_{\beta,\zeta\eta} for given ζ\zeta, η\eta values and subsequently employ the stochastic iterative updates

Kζst+1(xt,ut)K_{\zeta_{s}}^{t+1}(x_{t},u_{t}) =(1νt(xt,ut))Kζst(xt,ut)+=(1-\nu_{t}(x_{t},u_{t}))K_{\zeta_{s}}^{t}(x_{t},u_{t})+
νt(xt,ut)[cxtxt+1utζs+γGζst(xt+1)],\nu_{t}(x_{t},u_{t})\Big{[}\frac{\partial c_{x_{t}x_{t+1}}^{u_{t}}}{\partial\zeta_{s}}+\gamma G_{\zeta_{s}}^{t}(x_{t+1})\Big{]}, (25)

where Gζst(xt+1)=aμa|xt+1Kζst(xt+1,a)G_{\zeta_{s}}^{t}(x_{t+1})=\sum_{a}\mu_{a|x_{t+1}}K_{\zeta_{s}}^{t}(x_{t+1},a) to learn the derivative Gζsβ()G_{\zeta_{s}}^{\beta*}(\cdot). Similar updates are used to learn Gηaβ()G_{\eta_{a}}^{\beta*}(\cdot). The parameter values ζ\zeta, η\eta are then updated using the gradient descent step in (23). The following proposition formalizes the convergence of the updates in (V-C) to the fixed point GζsβG_{\zeta_{s}}^{\beta*}.

Proposition 2.

For the class of parameterized MDPs considered in Section V-A given that

  1. 1.

    t=0νt(s,a)=\sum_{t=0}^{\infty}\nu_{t}(s,a)=\infty, t=0νt2(s,a)<\sum_{t=0}^{\infty}\nu_{t}^{2}(s,a)<\infty s𝒮\forall s\in\mathcal{S}, a𝒜a\in\mathcal{A},

  2. 2.

    \exists B>0B>0 such that |c(s,a,s′′)ζs|B\Big{|}\frac{\partial c(s^{\prime},a^{\prime},s^{\prime\prime})}{\partial\zeta_{s}}\Big{|}\leq B s,s,a,s′′\forall~{}s,s^{\prime},a^{\prime},s^{\prime\prime},

  3. 3.

    \exists C>0C>0 such that |c(s,a,s′′)ηa|C\Big{|}\frac{\partial c(s^{\prime},a^{\prime},s^{\prime\prime})}{\partial\eta_{a}}\Big{|}\leq C a,s,a,s′′\forall~{}a,s^{\prime},a^{\prime},s^{\prime\prime},

the updates in (V-C) converge to the unique fixed point Gζsβ(s)G_{\zeta_{s}}^{\beta*}(s^{\prime}) of the map T1:GζsGζsT_{1}:G_{\zeta_{s}}\rightarrow G_{\zeta_{s}} in (24).

Proof. Please refer to the Appendix 4 for details.

Input: βmin\beta_{\min}, βmax\beta_{\max}, τ\tau, TT, νt\nu_{t}; Output: μ\mu^{*}, ζ\zeta, η\eta
Initialize: β=βmin\beta=\beta_{\min}, μt=1|𝒜|\mu_{t}=\frac{1}{|\mathcal{A}|}, and ζ,η,Gζβ,Gηβ\zeta,\eta,G_{\zeta}^{\beta},G_{\eta}^{\beta}, Kζβ,LηβK_{\zeta}^{\beta},L_{\eta}^{\beta}, Λ¯β\bar{\Lambda}_{\beta} to 0.
while ββmax\beta\leq\beta_{\max} do
      Use Algorithm 1 to obtain μβ,ζη\mu^{*}_{\beta,\zeta\eta} at given ζ\zeta, η\eta, β\beta.
       Consider env1(ζ\zeta,η\eta), env2(ζ,η\zeta^{\prime},\eta^{\prime}); set ζ=ζ\zeta^{\prime}=\zeta, η=η\eta^{\prime}=\eta
       while {ζs}\{\zeta_{s}\}, {ηa}\{\eta_{a}\} converge do
             for s𝒮\forall s\in\mathcal{S} do
                   for episode=1episode=1 to TT do
                         reset env1, env2 at state xtx_{t},
                         while True do
                               sample action utμ(|xt)u_{t}\sim\mu^{*}(\cdot|x_{t}).
                               env1: obtain ctc_{t}, xt+1x_{t+1}.
                               env2: set ζs=ζs+Δζs\textstyle\zeta_{s}^{\prime}=\zeta_{s}+\Delta\zeta_{s}, get ctc_{t}^{\prime}, xt+1x_{t+1}.
                               find Gζst+1(xt)\textstyle G_{\zeta_{s}}^{t+1}(x_{t}) with cxtxt+1utζsctctΔζs\scriptstyle\frac{\partial c_{x_{t}x_{t+1}}^{u_{t}}}{\partial\zeta_{s}}\approx\frac{c^{\prime}_{t}-c_{t}}{\Delta\zeta_{s}}.
                               break if xt+1=δx_{t+1}=\delta; tt+1t\leftarrow t+1.
                        
                  
            Similarly learn GηaβG_{\eta_{a}}^{\beta}. Update {ζs}\{\zeta_{s}\}, {ηa}\{\eta_{a}\} in (23).
      βτβ\beta\leftarrow\tau\beta
      
Algorithm 3 Parameterized Reinforcement Learning

Algorithmic Details: Please refer to the Algorithm 3 for a complete implementation. Unlike the scenario in Section III-C where the agent acts upon the environment by taking an action ut𝒜u_{t}\in\mathcal{A} and learns only the policy μ\mu^{*}, here the agent interacts with the environment by (a) taking an action ut𝒜u_{t}\in\mathcal{A} and also providing (b) estimated parameter ζ\zeta, η\eta values to the environment; subsequently, the environment results into an instantaneous cost and the next state. In our Algorithm 3, we first learn the policy μβ\mu_{\beta}^{*} at a given value of the parameters ζ\zeta, η\eta using the iterations (III-C) and then learn the fixed points GζsβG_{\zeta_{s}}^{\beta*}, GζaβG_{\zeta_{a}}^{\beta*} using the iterations in (V-C) to update the parameters ζ\zeta, η\eta using (23). Note that the iterations (V-C) require the derivatives c(s,a,s′′)/ζs\partial c(s^{\prime},a^{\prime},s^{\prime\prime})/\partial\zeta_{s} and c(s,a,s′′)/ηa\partial c(s^{\prime},a^{\prime},s^{\prime\prime})/\partial\eta_{a} which we determine using the instantaneous costs resulting from two ϵ\epsilon-distinct environments and finite difference method. Here the ϵ\epsilon-distinct environments represent the same underlying MDP but are distinct only in one of the parameter values. However, if two ϵ\epsilon-distinct environments are not feasible one can work with a single environment where the algorithm stores the instantaneous costs and the corresponding parameter values upon each interaction with the environment.

Remark 7.

Parameterized MDPs with infinite Shannon entropy HμH^{\mu} can be analogously addressed using above methodology.

Remark 8.

The MDPs addressed in Section III, IV, and V consider different variants of the discounted infinite horizon problems. MDPs in Section III address the class of sequential problems that have a non-zero probability of reaching a cost-free termination state (i.e., a finite Shannon entropy value). MDPs considered in Section IV need not reach a termination state (possibly infinite value of Shannon entropy), and the underlying sequential decision problem continues for the length of horizon determined by the discounting factor γ\gamma. Parameterized MDPs in Section V can have finite or infinite Shannon entropy, but they comprise of states and actions that have an unknown parameter associated to them.

VI Simulations

We broadly classify our simulations into two categories. Firstly, in the model-free RL setting we demonstrate our Algorithm 1 to determine the control policy

Refer to caption
Figure 2: Gridworld environment.

μ\mu^{*} for the finite and infinite Shannon entropy variants of the Gridworld environment in Figure 2. Each cell in the Gridworld denotes a state. The cells colored black are invalid states. An agent can choose to move vertically, horizontal, diagonally or stay at the current cell. Each action is followed by a probability to slip in the neighbouring states (probability of 0.05 to slip in each of the vertical and horizontal directions, and probability of 0.025 to slip in each of the diagonal directions - cumulative p(slip)0.3p(slip)\approx 0.3). For the finite entropy case - each step incurs a unit cost. The process terminates when the agent reaches the terminal state 𝐓\mathbf{T}. For the infinite entropy case - each step incurs a unit reward. Secondly, in the parameterized MDPs and RL setting we demonstrate our Algorithms 2 and 3 in designing a 5G small cell network. This involves simultaneously determining the locations of the small cells in the network as well as the optimal routing path of the communication packets from the base station to the users.

We compare our MEP-based Algorithm 1 with the benchmark algorithms Entropy Regularized (ER) G-learning (also referred to as Soft Q-learning) [14], Q-learning [5] and Double Q-learning [13]. Note that our choice of the benchmark algorithm G-learning (or, entropy regularized Soft Q) presented in [14]) is based on its commonality to our framework as discussed in the Section III-B, and the choice of algorithms Q-learning and Double Q-learning is based on their widespread utility in the literature. Also, note that the work done in [14] already establishes the efficacy of the G-learning algorithm over the following algorithms in literature Q-learning, Double Q-learning, Ψ\Psi-learning [45], Speedy Q-learning [46], and the consistent Bellman Operator 𝒯C\mathcal{T}_{C} of [47]. Below we highlight features and advantages of our MEP-based Algorithm 1.

Refer to caption
Figure 3: Performance of MEP-based algorithm: Illustrations on Gridworld Environment in Figure 2. (a1)-(a3) Finite Entropy variant: Illustrates faster convergence of Algorithm 1 (MEP) at different γ\gamma values. (a4) Demonstrates faster rates of convergence of Algorithm 1 (MEP) for γ\gamma values ranging from 0.650.65 to 0.950.95. (b1)-(b3) Infinite Entropy Variant : Demonstrates faster convergence of Algorithm 1 (MEP) to JJ^{*}. (b4) Illustrates the consistent faster convergence rates of MEP with γ\gamma ranging from 0.650.65 to 0.950.95. (c1)-(c4) Finite Entropy Version (with added Gaussian noise) : Similar observations as in (a1)-(a4) with significantly higher instability in learning with Double Q algorithm.

Faster Convergence to Optimal JJ^{*}: Figures 3(a1)-(a3) (finite entropy variant of Gridworld) and Figures 3 (b1)-(b3) (infinite entropy variant of Gridworld) illustrate the faster convergence of our MEP-based Algorithm 1 for different discount factor γ\gamma values. Here, at each episode the percentage error ΔV/J\Delta V/J^{*} between the value function VβμV_{\beta}^{\mu} corresponding to learned policy μ=μ(ep)\mu=\mu(ep) in the episode epep, and the optimal value function JJ^{*} is given by

ΔV(ep)J=1Ni=1Ns𝒮|Vβ,iμ(ep)(s)J(s)|J(s),\displaystyle\text{\small$\frac{\Delta V(ep)}{J^{*}}=\frac{1}{N}\sum_{i=1}^{N}\sum_{s\in\mathcal{S}}\frac{\big{|}V_{\beta,i}^{\mu(ep)}(s)-J^{*}(s)\big{|}}{J^{*}(s)}$}, (26)

where NN denotes the total experimental runs and ii indexes the value function Vβ,iμV_{\beta,i}^{\mu} for each run. As observed in Figures 3(a1)-(a3), and Figures 3(b1)-(b3), our Algorithm 1 converges even faster as the discount factor γ\gamma decreases. We characterize the faster convergence rates also in terms of the convergence time - more precisely the percentage E¯pr\bar{E}_{pr} of total episodes taken for the learning error ΔV/J\Delta V/J^{*} to reach within 5%5\% of the best (see Figures 3(a4) and 3(b4)). As is observed in the figures, the performance of our (MEP-based) algorithm in comparison to entropy regularized G learning is better across all values (0.650.65 to 0.950.95) of discount factor γ\gamma. Note that the performance of Algorithm 1 gets even better with decreasing γ\gamma values where the smaller discount factor values occur in instances such as the context of recommendation systems [48], and teaching RL-agents using human-generated rewards [49].

Robustness to noise in data: Figures 3(c1)-(c4) demonstrate robustness to noisy environments; here the instantaneous cost c(s,a,s)c(s,a,s^{\prime}) in the finite horizon variant of Gridworld is noisy. For the purpose of simulations, we add Gaussian noise 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) with σ=1\sigma=1 for vertical and horizontal actions, and σ=0.5\sigma=0.5 for diagonal movements. Here, at each episode we compare the percentage error ΔV/J\Delta V/J^{*} in the learned value functions VβV_{\beta} (corresponding to the state-action value estimate in (III-C)) of the respective algorithms. Similar to our observations and conclusions in Figures 3(a1)-(a3) and Figures 3(b1)-(b3) we see faster convergence of our MEP-based algorithm over the benchmark algorithms in Figures 3(c1)-(c3) in the case of noisy environment. Also, Figure 3(c4) demonstrates that across all discount factor values (0.650.65 to 0.950.95), Algorithm 1 converges faster than the entropy regularized Soft Q learning.

Simultaneously determining the unknown parameters and policy in Parameterized MDPs: We design the 5G Small Cell Network (see Figure 1) both when the underlying model (cssac_{ss^{\prime}}^{a} and pssap_{ss^{\prime}}^{a}) is known (using Algorithm 2) and as well as unknown (using Algorithm 3). In our simulations we randomly distribute 4646 user nodes {ni}\{n_{i}\} at {xi}\{x_{i}\} and the base station δ\delta at zz in the domain Ω2\Omega\subset\mathbb{R}^{2} as shown in Figure 4(a). The objective is to determine the locations {yj}j=15\{y_{j}\}_{j=1}^{5} (parameters) of the small cells {fj}j=15\{f_{j}\}_{j=1}^{5} and determine the corresponding communication routes (policy). Here, the state space of the underlying MDP is 𝒮={n1,,n46,f1,,f5}\mathcal{S}=\{n_{1},\ldots,n_{46},f_{1},\ldots,f_{5}\} where the locations y1,,y5y_{1},\ldots,y_{5} of the small cells are the unknown parameters {ζs}\{\zeta_{s}\} of the MDP, the action space is 𝒜={f1,,f5}\mathcal{A}=\{f_{1},\ldots,f_{5}\}, and the cost function c(s,a,s)=ρ(s)ρ(s)22c(s,a,s^{\prime})=\|\rho(s)-\rho(s^{\prime})\|_{2}^{2} where ρ()\rho(\cdot) denotes the spatial location of the respective states. The objective is to simultaneously determine the parameters (unknown small cell locations) and the control policy (communication routes in the 5G network). We consider two cases where (a) pssap_{ss^{\prime}}^{a} is deterministic, i.e. an action aa at the state ss results into s=as^{\prime}=a with probability 11, and (b) pssap_{ss^{\prime}}^{a} is probabilistic such that action aa at the state ss results into s=as^{\prime}=a with probability 0.90.9 or to the state s=f1s^{\prime}=f_{1} with probability 0.10.1. Additionally, due to absence of prior work in literature on network design problems modeled as parameterized MDPs, we compare our results only with the solution resulting from a straightforward sequential methodology (as shown in Figure 4(a)) where we first partition the user nodes into 55 distinct clusters to allocate a small cells in each cluster, and then determine optimal routes in the network.

Refer to caption
Figure 4: Parameterized MDPs and RL - Design of 5G small cell network. State space 𝒮={{ni},{fj},δ}\mathcal{S}=\{\{n_{i}\},\{f_{j}\},\delta\} comprises of the user nodes {ni}\{n_{i}\}, small cells {fj}\{f_{j}\}, and base station δ\delta. The unknown parameters ζs\zeta_{s} denote the locations {yj}\{y_{j}\} of the small cells. Action space comprises of the small-cells 𝒜={fj}\mathcal{A}=\{f_{j}\}. Based on our modelling of the network there are no unknown action parameters {ηa}\{\eta_{a}\}/ (a) Illustrates small cell locations {yj}\{y_{j}\} and communication routes determined using a straightforward sequential methodology. (b)-(c) demonstrate small cells at {yj}\{y_{j}\} and communication routes (as illustrated by arrows) resulting from policy obtained from Algorithm 2 (model-based) and Algorithm 3 (model-free), respectively when the pssap_{ss^{\prime}}^{a} is deterministic. (d)-(e) solutions obtained using Algorithm 2 and Algorithm 3, respectively when pssap_{ss^{\prime}}^{a} is probabilistic. (f) sensitivity analysis of the solutions with respect to user node locations {xi}\{x_{i}\}. (g)-(h) Network design obtained when considering entropy of the distribution over the control actions and paths of the MDP, respectively. (i) Network design obtained without annealing in Algorithm 2. (j) Simulation on a larger dataset (user base increased by more than 1010 times).

Deterministic p(s,a,s)p(s,a,s^{\prime}): Figure 4(b) illustrates the allocation of small cells and the corresponding communication routes (resulting from optimal policy μ\mu^{*}) as determined by the Algorithm 2. Here, the network is designed to minimize the cumulative cost of communication from each user node and small cell. As denoted in the Figure, the route δy3y4y1ni\delta\rightarrow y_{3}\rightarrow y_{4}\rightarrow y_{1}\rightarrow n_{i} carries the communication packet from the base station δ\delta to the respective user nodes nin_{i} as indicated by the grey arrow from y1y_{1}. The cost incurred here is approximately 180%180\% lesser than that in Figure 4(a) clearly indicating the advantage obtained from simultaneously determining the parameters and policy over a sequential methodology. In the model-free RL setting where the functions c(s,a)c(s,a), pssap_{ss^{\prime}}^{a}, and the locations {xi}\{x_{i}\} of the user nodes {ni}\{n_{i}\} are not known, we employ our Algorithm 3 to determine the small cell locations {yj}j=15\{y_{j}\}_{j=1}^{5} and as well as the optimal policy {μ(a|s)}\{\mu^{*}(a|s)\} as demonstrated in the Figure 4(c). It is evident from Figures 4(b) and (c) that the solutions obtained when the model is completely known and unknown are approximately same. In fact, the solutions obtained differ only by 1.9%1.9\% in terms of the total cost s𝒮Jζημ(s)\sum_{s\in\mathcal{S}}J^{\mu}_{\zeta\eta}(s) (22) incurred, clearly indicating the efficacy of our model-free learning Algorithm 3.

Probabilistic p(s,a,s)p(s,a,s^{\prime}): Figure 4(d) illustrates the solution as obtained by our Algorithm 2 when the underlying model (c(s,a)c(s,a), pssap_{ss^{\prime}}^{a}, and {xi}\{x_{i}\}) is known. As before, here the network is designed to minimize the cumulative cost of communication from each user node and small cell. The cost associated to the network design is approximately 127%127\% lesser than in Figure 4(a). Figure 4(e) illustrates the solution as obtained by the Algorithm 3 for the model-free case (c(s,a)c(s,a), pssap_{ss^{\prime}}^{a}, and {xi}\{x_{i}\} are unknown). Similar to the above scenario, the solutions obtained for this case using the Algorithms 2 and 3 are also approximately the same and differ only by 0.3%0.3\% in terms of the total cost sJζημ(s)\sum_{s}J^{\mu}_{\zeta\eta}(s) incurred; thereby, substantiating the efficacy of our proposed model-free learning Algorithm 3.

Sensitivity Analysis: Our algorithms enable categorizing the user nodes {ni}\{n_{i}\} in Figure 4(b) into the categories of (i) low, (ii) medium, and (iii) high sensitiveness such that the final solution is least susceptible to the user nodes in (i) and most susceptible to the nodes in (iii). Note that the above sensitivity analysis requires to compute the derivative sVβμ(s)/ζs\sum_{s^{\prime}}\partial V_{\beta}^{\mu}(s^{\prime})/\partial\zeta_{s}, and we determine it by solving for the fixed point of the Bellman equation in (24). The derivative sVβμ(s)/ζs\sum_{s^{\prime}}{\partial V_{\beta}^{\mu}(s^{\prime})}/{\partial\zeta_{s}} computed at β\beta\rightarrow\infty is a measure of sensitivity of the solution to the cost function sJζημ(s)\sum_{s}J^{\mu}_{\zeta\eta}(s) in (22) since VβμV_{\beta}^{\mu} in (11) is a smooth approximation of Jζημ(s)J^{\mu}_{\zeta\eta}(s) in (22) and VβμJζημ(s)V_{\beta}^{\mu}\rightarrow J^{\mu}_{\zeta\eta}(s) as β\beta\rightarrow\infty. A similar analysis for Figure 4(c)-(e) can be done if the locations {xi}\{x_{i}\} of the user nodes {ni}\{n_{i}\} are known to the agent. The sensitivity of the final solution to the locations {yj}\{y_{j}\}, zz of the small cells and the base station can also be determined in a similar manner.

Entropy over paths versus entropy of the policy: We demonstrate the benefit of maximizing the entropy of the distribution {pμ(ω|s)}\{p_{\mu}(\omega|s)\} over the paths of an MDP as compared to the distribution {μ(a|s)}\{\mu(a|s)\} over the control actions. Figure 4(g) demonstrates the 5G network obtained by considering the distribution over the control policy, and the Figure 4(h) illustrates the network obtained by considering the distribution over the entire paths. The network cost incurred in the Figure 4(h) is 5%5\% less than the cost incurred in Figure 4(g). Here, we have considered the above demonstrated probabilistic pssap_{ss^{\prime}}^{a} scenario and minimized the cumulative communication cost incurred only from the user nodes.

Avoiding poor local minima and large scale setups: As noted in the Section V-B, annealing β\beta from a small value βmin(0)\beta_{\min}(\approx 0) to a large value βmax\beta_{\max} prevents the algorithm from getting stuck at a poor local minima. Figure 4(i) demonstrates the network design obtained where the Algorithm 2 does not anneal β\beta, and iteratively solves the optimization problem at β=βmax\beta=\beta_{\max}. The resulting network incurs a 11%11\% higher cost in comparison to the network obtained in 4(h) where the Algorithm 2 anneals β\beta from a small to a large value. Figure 4(j) demonstrate the 5G network design obtained using Algorithm 2 when the user nodes are increased by around 1212 times (610610), and the allocated small cells are doubled to 1010.

VII Analysis and Discussion

VII-1 Mutual Information Minimization

The optimization problem (III-A) maximizes the Shannon entropy Hμ(s)H^{\mu}(s) under a given constraint on the value function JμJ^{\mu}. We can similarly pose and solve the mutual information minimization problem that requires to determine the distribution pμ(𝒫|s)p_{\mu^{*}}(\mathcal{P}|s) (with control policy μ\mu^{*}) over the paths of the MDP that is close to some given prior distribution q(𝒫|s)q(\mathcal{P}|s) [16, 15]. Here, the objective is to minimize the KL-divergence DKL(pμq))D_{KL}(p_{\mu}\|q)) under the constraint J=J0J=J_{0} (as in (III-A)).

VII-2 Non-dependence on choice of J0J_{0} in (III-A)

In our framework we do not explicitly determine and work with the value of J0J_{0}. Instead we work with the Lagrange parameter β\beta in the Lagrangian Vβμ(s)V^{\mu}_{\beta}(s) in (7) corresponding to the optimization problem (III-A). It is known from the sensitivity analysis [6] that the small values of β\beta correspond to large values of J0J_{0}, and large values of β\beta correspond to small values of J0J_{0}. Thus, in our algorithms we solve the optimization problem (III-A) beginning at small values of β=βmin0\beta=\beta_{\min}\approx 0 (that corresponds to some feasible large J0J_{0}), and anneal it to a large value βmax\beta_{\max} (that corresponds to a small J0J_{0} value) at which the stochastic policy μ\mu in (9) converges to either 0 or 11. Also at β0\beta\approx 0, the stochastic policy μβ\mu_{\beta}^{*} in (9) follows a uniform distribution, which implicitly fixes the value of J0J_{0}. Therefore, the initial value of J0J_{0} in the proposed algorithms are fixed and are not required to be pre-specified.

VII-3 Computational complexity

Our MEP-based Algorithm 1 performs exactly the same number of computations as the Soft Q-learning algorithm [14] for each epoch (or, iteration) within an episode. In comparison to the Q and Double Q learning algorithms, our proposed algorithm, apart from performing the additional minor computations of explicitly determining μ\mu^{*} in (9), exhibits the similar number of computational steps.

VII-4 Scheduling β\beta and Phase Transition

In our Algorithm 1, we follow a linear schedule βk=σk\beta_{k}=\sigma k (σ>0\sigma>0) as suggested in the benchmark algorithm [14] to anneal the parameter β\beta. In the case of parameterized MDPs (Algorithm 2, and 3) we geometrically anneal β\beta (i.e. βk+1=τβk\beta_{k+1}=\tau\beta_{k}, τ>1\tau>1) from a small value βmin\beta_{\min} to a large value βmax\beta_{\max} at which the control policy μβ\mu_{\beta}^{*} converges to either 0 or 11. Several other MEP-based algorithms (that address problems akin to parameterized MDPs) such as Deterministic Annealing [7], incorporate geometric annealing of β\beta. The underlying idea in [7] is that the solution undergoes significant changes only at certain critical βcr\beta_{cr} (phase transition) and shows insignificant changes between two consecutive critical βcr\beta_{cr}’s. Thus, for all practical purposes geometric annealing of β\beta works well. Similar to [7] our Algorithms 2 and 3 also undergo the phase transition and we are working on its analytical expression.

VII-5 Capacity and Exclusion Constraints

Certain parameterized MDPs may pose capacity or dynamical constraints over its parameters. For instance, each small cell fjf_{j} allocated in the Figure 4 can be constrained in capacity to cater to maximum cjc_{j} fraction of user nodes in the network. Our framework allows to model such a constraint as qμ(fj):=a,niμ(a|ni)p(fj|a,ni)cjq_{\mu}(f_{j}):=\sum_{a,n_{i}}\mu(a|n_{i})p(f_{j}|a,n_{i})\leq c_{j} where qμ(fj)q_{\mu}(f_{j}) measures fraction of user nodes {ni}\{n_{i}\} that connect to fjf_{j}. In another scenario, the locations {xi}\{x_{i}\} of the user nodes could be dynamically varying as x˙i=f(x,t)\dot{x}_{i}=f(x,t). The resulting policy μβ\mu^{*}_{\beta} and small cells {yj}\{y_{j}\} will also be time varying. We treat the free-energy function VβμV^{\mu}_{\beta} in (11) as a control-Lyapunov function and determine time varying μβ\mu^{*}_{\beta} and {yj}\{y_{j}\} such that V˙βμ0\dot{V}^{\mu}_{\beta}\leq 0.

VII-6 Uncertainty in Parameters

Many application areas comprise of states and actions where the associated parameters are uncertain with a known distribution over the set of their possible values. For instance, a user nodes nin_{i} in Figure 4 may have an associated uncertainty in its location xix_{i} owing to measurement errors. Our proposed framework easily incorporates such uncertainties in parameter values. For example, the above uncertainty will result into replacing c(ni,s,a)c(n_{i},s^{\prime},a) with c(ni,s,a)=xiXip(xi|ni)c(ni,s,a)c^{\prime}(n_{i},s^{\prime},a)=\sum_{x_{i}\in X_{i}}p(x_{i}|n_{i})c(n_{i},s^{\prime},a) where p(xi|ni)p(x_{i}|n_{i}) is the distribution over the set XiX_{i} of location xix_{i}. The subsequent solution approach remains the same as in Section V-B.

APPENDICES

Appendix A. Proof of Lemma 1: Let x¯0=s\bar{x}_{0}=s. By Assumption 1 \exists a path ω=(u¯0,x¯1,,x¯N=δ)\omega=(\bar{u}_{0},\bar{x}_{1},\ldots,\bar{x}_{N}=\delta) such that pμ¯(ω|x0=s)>0p_{\bar{\mu}}(\omega|x_{0}=s)>0 which implies p(xk+1=x¯k+1|xk=x¯k,uk=u¯k)>0p(x_{k+1}=\bar{x}_{k+1}|x_{k}=\bar{x}_{k},u_{k}=\bar{u}_{k})>0 by (6). Then, probability pμ(ω|x0=s)p_{\mu}(\omega|x_{0}=s) of taking path ω\omega under the stochastic policy μΓ\mu\in\Gamma in (4) is also positive.
Proof of Theorem 1: The following Lemma is needed

Lemma 2.

The Shannon Entropy Hμ()H^{\mu}(\cdot) corresponding to the MDP illustrated in Section III-A satisfies the algebraic expression spssaHμ(s)=spssalogpssa+logμa|s+λs+1\sum_{s^{\prime}}p_{ss^{\prime}}^{a}H^{\mu}(s^{\prime})=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\log p_{ss^{\prime}}^{a}+\log\mu_{a|s}+\lambda_{s}+1.

Proof.

Hμ()H^{\mu}(\cdot) in (III-A) satisfies the recursive Bellman equation

Hμ(s)=as′′μa|spss′′a[logpss′′alogμa|s+Hμ(s′′)]H^{\mu}(s^{\prime})=\sum_{a^{\prime}s^{\prime\prime}}\mu_{a^{\prime}|s^{\prime}}p_{s^{\prime}s^{\prime\prime}}^{a^{\prime}}\big{[}-\log p_{s^{\prime}s^{\prime\prime}}^{a^{\prime}}-\log\mu_{a^{\prime}|s^{\prime}}+H^{\mu}(s^{\prime\prime})\big{]}

On the right side of the above Bellman equation, we subtract a zero term λs(aμa|s1)\lambda_{s^{\prime}}(\sum_{a}\mu_{a|s^{\prime}}-1) that accounts for normalization constraint aμa|s=1\sum_{a}\mu_{a|s^{\prime}}=1 and λs\lambda_{s^{\prime}} is some constant. Taking the derivative of the resulting expression we obtain

Hμ(s)μa|s=ρ(s,a)δss+a,s′′μa|spss′′aHμ(s′′)μa|sλsδss,\displaystyle\text{\small$\frac{\partial H^{\mu}(s^{\prime})}{\partial\mu_{a|s}}=\rho(s,a)\delta_{ss^{\prime}}+\sum_{a^{\prime},s^{\prime\prime}}\mu_{a^{\prime}|s^{\prime}}p_{s^{\prime}s^{\prime\prime}}^{a^{\prime}}\frac{\partial H^{\mu}(s^{\prime\prime})}{\partial\mu_{a|s}}$}-\lambda_{s^{\prime}}\delta_{ss^{\prime}}, (27)

where ρ(s,a)=s′′pss′′a(logpss′′aHμ(s′′))logμa|s1\rho(s,a)=-\sum_{s^{\prime\prime}}p_{ss^{\prime\prime}}^{a}(\log p_{ss^{\prime\prime}}^{a}-H^{\mu}(s^{\prime\prime}))-\log\mu_{a|s}-1. The subsequent steps in the proof involve algebraic manipulations and makes use of the quantity pμ(s):=spμ(s|s)p_{\mu}(s^{\prime}):=\sum_{s}p_{\mu}(s^{\prime}|s) where pμ(s|s)=apssaμa|sp_{\mu}(s^{\prime}|s)=\sum_{a}p_{ss^{\prime}}^{a}\mu_{a|s}. Under the trivial assumption that for each state ss^{\prime} there exists a state-action pair (s,a)(s,a) such that the probability of the system to enter the state ss^{\prime} upon taking action aa in the state ss is non-zero ( i.e. pssa>0p_{ss^{\prime}}^{a}>0) we have that pμ(s)>0p_{\mu}(s^{\prime})>0. Now, we multiply equation (27) by pμ(s)p_{\mu}(s^{\prime}) and add over all s𝒮s^{\prime}\in\mathcal{S} to obtain

spμ(s)Hμ(s)μa|s=pμ(s)ρ(s,a)\sum_{s^{\prime}}p_{\mu}(s^{\prime})\frac{\partial H^{\mu}(s^{\prime})}{\partial\mu_{a|s}}=p_{\mu}(s^{\prime})\rho(s,a)
+s′′pμ(s′′)Hμ(s′′)μa|spμ(s)λs,\displaystyle\qquad\qquad\text{\small$+\sum_{s^{\prime\prime}}p_{\mu}(s^{\prime\prime})\frac{\partial H^{\mu}(s^{\prime\prime})}{\partial\mu_{a|s}}-p_{\mu}(s)\lambda_{s}$},

where pμ(s′′)=spμ(s)pμ(s′′|s)p_{\mu}(s^{\prime\prime})=\sum_{s^{\prime}}p_{\mu}(s^{\prime})p_{\mu}(s^{\prime\prime}|s^{\prime}). The derivative terms on both sides cancel to give pμ(s)ρ(s,a)λs=0p_{\mu}(s^{\prime})\rho(s,a)-\lambda_{s}=0 which implies spssaHμ(s)=spssalogpssa+logμa|s+λs+1\sum_{s^{\prime}}p_{ss^{\prime}}^{a}H^{\mu}(s^{\prime})=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\log p_{ss^{\prime}}^{a}+\log\mu_{a|s}+\lambda_{s}+1. ∎

Now consider the free energy function Vβμ(s)V_{\beta}^{\mu}(s) in (7) and separate out the t=0t=0 term in its infinite summation to obtain

Vβμ(s)=𝔼[t=0γtcxtxt+1ut+1βlogpxtxt+1utV^{\mu}_{\beta}(s)=\mathbb{E}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}c_{x_{t}x_{t+1}}^{u_{t}}+\frac{1}{\beta}\log p_{x_{t}x_{t+1}}^{u_{t}}
+1βlogμ(ut|xt)|x0=s]+\frac{1}{\beta}\log\mu(u_{t}|x_{t})\Big{|}x_{0}=s\Big{]}
Vβμ(s)=s,aμ(a|s)pssa(cssa+1βlogpssa+1βlogμ(a|s))V^{\mu}_{\beta}(s)=\sum_{s^{\prime},a}\mu(a|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}+\frac{1}{\beta}\log\mu(a|s)\big{)}
+𝔼[t=1γtcxtxt+1ut+1βlogpxtxt+1ut+1βlogμ(ut|xt)|x0=s]+\mathbb{E}\Big{[}\sum_{t=1}^{\infty}\gamma^{t}c_{x_{t}x_{t+1}}^{u_{t}}+\frac{1}{\beta}\log p_{x_{t^{\prime}}x_{t^{\prime}+1}}^{u_{t^{\prime}}}+\frac{1}{\beta}\log\mu(u_{t}|x_{t})\Big{|}x_{0}=s\Big{]}
lett=t+1,ut=ut+1,xt=xt+1\displaystyle\text{let}\qquad t=t^{\prime}+1,\qquad u^{\prime}_{t^{\prime}}=u_{t^{\prime}+1},\qquad x^{\prime}_{t^{\prime}}=x_{t^{\prime}+1}
Vβμ(s)=s,aμ(a|s)pssa(cssa+1βlogpssa+1βlogμ(a|s))\Rightarrow V_{\beta}^{\mu}(s)=\sum_{s^{\prime},a}\mu(a|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}+\frac{1}{\beta}\log\mu(a|s)\big{)}
+𝔼[t=0γt+1cxtxt+1ut+1βlogpxtxt+1ut+\mathbb{E}\Big{[}\sum_{t^{\prime}=0}^{\infty}\gamma^{t^{\prime}+1}c_{x^{\prime}_{t^{\prime}}x^{\prime}_{t^{\prime}+1}}^{u^{\prime}_{t^{\prime}}}+\frac{1}{\beta}\log p_{x^{\prime}_{t^{\prime}}x^{\prime}_{t^{\prime}+1}}^{u^{\prime}_{t^{\prime}}}
+1βlogμ(ut|xt)|x0=s]\displaystyle\qquad+\text{\small$\frac{1}{\beta}\log\mu(u^{\prime}_{t^{\prime}}|x^{\prime}_{t^{\prime}})\Big{|}x_{0}=s\Big{]}$}
=s,aμ(a|s)pssa(cssa+1βlogpssa+1βlogμ(a|s))=\sum_{s^{\prime},a}\mu(a|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}+\frac{1}{\beta}\log\mu(a|s)\big{)}
+γ𝔼[t=0γtcxtxt+1ut+1γβlogpxtxt+1ut+\gamma\mathbb{E}\Big{[}\sum_{t^{\prime}=0}^{\infty}\gamma^{t^{\prime}}c_{x^{\prime}_{t^{\prime}}x^{\prime}_{t^{\prime}+1}}^{u^{\prime}_{t^{\prime}}}+\frac{1}{\gamma\beta}\log p_{x^{\prime}_{t^{\prime}}x^{\prime}_{t^{\prime}+1}}^{u^{\prime}_{t^{\prime}}}
+1γβlogμ(ut|xt)|x0=s]+\frac{1}{\gamma\beta}\log\mu(u_{t^{\prime}}^{\prime}|x_{t^{\prime}}^{\prime})\Big{|}x_{0}=s\Big{]}
=s,aμ(a|s)pssa(cssa+1βlogpssa+1βlogμ(a|s))=\sum_{s^{\prime},a}\mu(a|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}+\frac{1}{\beta}\log\mu(a|s)\big{)}
+γ𝔼[𝔼[t=0γtcxtxt+1ut+1γβlogpxtxt+1at+\gamma\mathbb{E}\Big{[}\mathbb{E}\Big{[}\sum_{t^{\prime}=0}^{\infty}\gamma^{t^{\prime}}c_{x^{\prime}_{t^{\prime}}x^{\prime}_{t^{\prime}+1}}^{u^{\prime}_{t^{\prime}}}+\frac{1}{\gamma\beta}\log p_{x^{\prime}_{t^{\prime}}x^{\prime}_{t^{\prime}+1}}^{a^{\prime}_{t^{\prime}}}
+1γβlogμ(at|xt)|x0=s]|x0=s]\qquad+\frac{1}{\gamma\beta}\log\mu(a_{t^{\prime}}^{\prime}|x_{t^{\prime}}^{\prime})\Big{|}x_{0}^{\prime}=s^{\prime}\Big{]}\Big{|}x_{0}=s\Big{]}
=s,aμ(a|s)pssa(cssa+1βlogpssa+1βlogμ(a|s))=\sum_{s^{\prime},a}\mu(a|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}+\frac{1}{\beta}\log\mu(a|s)\big{)}
+γa,sμ(a|s)pssaVγβμ(s)+\gamma\sum_{a,s^{\prime}}\mu(a|s)p_{ss^{\prime}}^{a}V^{\mu}_{\gamma\beta}(s^{\prime})
Vβμ(s)=a,sμa|spssa[cssa+1βlogpssa\Rightarrow V_{\beta}^{\mu}(s)=\sum_{a,s^{\prime}}\mu_{a|s}p_{ss^{\prime}}^{a}\Big{[}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}
+1βlogμa|s+γVγβμ(s)],+\frac{1}{\beta}\log\mu_{a|s}+\gamma V_{\gamma\beta}^{\mu}(s^{\prime})\Big{]}, (28)

where Vγβμ(s):=Jμ(s)1γβHμ(s)V_{\gamma\beta}^{\mu}(s^{\prime}):=J^{\mu}(s^{\prime})-\frac{1}{\gamma\beta}H^{\mu}(s^{\prime}). Now we relate Vγβμ(s)V_{\gamma\beta}^{\mu}(s^{\prime}) with Vβμ(s)V_{\beta}^{\mu}(s^{\prime}) by adding and subtracting 1βHμ(s)-\frac{1}{\beta}H^{\mu}(s^{\prime}) to Vγβμ(s)V_{\gamma\beta}^{\mu}(s^{\prime}). We obtain Vγβμ(s)=Vβμ(s)1γγβH(s)V_{\gamma\beta}^{\mu}(s^{\prime})=V_{\beta}^{\mu}(s^{\prime})-\frac{1-\gamma}{\gamma\beta}H(s^{\prime}). Substituting Vγβ(s)V_{\gamma\beta}(s^{\prime}) and the algebraic expression obtained in Lemma 2 in the above equation (APPENDICES) we obtain

Vβμ(s)\displaystyle V^{\mu}_{\beta}(s) =s,aμ(a|s)pssa(cssa+1βlogpssa+1βlogμ(a|s))\displaystyle=\sum_{s^{\prime},a}\mu(a|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}+\frac{1}{\beta}\log\mu(a|s)\big{)}
+γa,sμ(a|s)pssaVβμ(s)1γβa,sμ(a|s)pssaHμ(s)\displaystyle+\gamma\sum_{a,s^{\prime}}\mu(a|s)p_{ss^{\prime}}^{a}V_{\beta}^{\mu}(s^{\prime})-\frac{1-\gamma}{\beta}\sum_{a,s^{\prime}}\mu(a|s)p_{ss^{\prime}}^{a}H^{\mu}(s^{\prime})
Vβμ(s)=s,aμ(a|s)pssa(cssa+1βlogpssa+1βlogμ(a|s))\displaystyle\Rightarrow V_{\beta}^{\mu}(s)=\sum_{s^{\prime},a}\mu(a|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}+\frac{1}{\beta}\log\mu(a|s)\big{)}
+γa,sμ(a|s)pssaVβμ(s)1γβaμ(a|s)spssaHμ(s)\displaystyle+\gamma\sum_{a,s^{\prime}}\mu(a|s)p_{ss^{\prime}}^{a}V_{\beta}^{\mu}(s^{\prime})-\frac{1-\gamma}{\beta}\sum_{a}\mu(a|s)\sum_{s^{\prime}}p_{ss^{\prime}}^{a}H^{\mu}(s^{\prime})
Vβμ(s)=s,aμ(a|s)pssa(cssa+1βlogpssa+1βlogμ(a|s))\displaystyle\Rightarrow V_{\beta}^{\mu}(s)=\sum_{s^{\prime},a}\mu(a|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}+\frac{1}{\beta}\log\mu(a|s)\big{)}
+γa,sμ(a|s)pssaVβμ(s)\displaystyle+\gamma\sum_{a,s^{\prime}}\mu(a|s)p_{ss^{\prime}}^{a}V_{\beta}^{\mu}(s^{\prime})
1γβaμ(a|s)(spssalogpssa+logμ(a|s)+λs+1)\displaystyle-\frac{1-\gamma}{\beta}\sum_{a}\mu(a|s)\Big{(}\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\log p_{ss^{\prime}}^{a}+\log\mu(a|s)+\lambda_{s}+1\Big{)}
Vβμ(s)\displaystyle\Rightarrow V_{\beta}^{\mu}(s) =a,sμ(a|s)pssa(cssa+γβlogpssa+γβlogμ(a|s))\displaystyle=\sum_{a,s^{\prime}}\mu(a|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log\mu(a|s)\big{)}
+γa,sμ(a|s)pssaVβμ(s)+c0(s)\displaystyle+\gamma\sum_{a,s^{\prime}}\mu(a|s)p_{ss^{\prime}}^{a}V_{\beta}^{\mu}(s^{\prime})+c_{0}(s) (29)

where c0(s)=1γβ(λs+1)c_{0}(s)=-\frac{1-\gamma}{\beta}(\lambda_{s}+1) does not depend on the policy μ\mu. Therefore, since the control policy μ(a|s)\mu^{*}(a|s) is determined by taking critical points of Vβμ(s)V_{\beta}^{\mu}(s), it is independent of c0(s)c_{0}(s) as shown in the following subsection.

-A Independence of policy on c0c_{0}

The Bellman equation is given by

Vβμ(s)\displaystyle V_{\beta}^{\mu}(s) =a,sμ(a|s)pssa(cssa+γβlogpssa+γβlogμ(a|s))\displaystyle=\sum_{a,s^{\prime}}\mu(a|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log\mu(a|s)\big{)}
+γa,sμ(a|s)pssaVβμ(s)+c0(s)\displaystyle+\gamma\sum_{a,s^{\prime}}\mu(a|s)p_{ss^{\prime}}^{a}V_{\beta}^{\mu}(s^{\prime})+c_{0}(s) (30)

The optimal control policy μβ(a|s)\mu_{\beta}^{*}(a|s) obtained upon taking the derivative of the recursive Bellman equation (and also accounting for aμ(a|s)=1\sum_{a}\mu(a|s)=1) is given by

μβ(a|s)\displaystyle\mu_{\beta}^{*}(a|s) =exp{βγΛ¯β(s,a)}aexp{βγΛ¯β(s,a)},where\displaystyle=\frac{\exp\{-\frac{\beta}{\gamma}\bar{\Lambda}^{*}_{\beta}(s,a)\}}{\sum_{a^{\prime}}\exp\{-\frac{\beta}{\gamma}\bar{\Lambda}^{*}_{\beta}(s,a^{\prime})\}},\quad\text{where} (31)
Λ¯β(s,a)\displaystyle\bar{\Lambda}_{\beta}^{*}(s,a) =spssa(cssa+γβlogpssa+γVβ(s)),\displaystyle=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a}+\gamma V_{\beta}^{*}(s^{\prime})\big{)}, (32)

Substituting the optimal policy μβ(a|s)\mu_{\beta}^{*}(a|s) (31) into the recursive Bellman in (-A) we obtain

Vβ(s)=γβlogaexp{βγΛ¯β(s,a)}+c0(s).\displaystyle V_{\beta}^{*}(s)=-\frac{\gamma}{\beta}\log\sum_{a}\exp\Big{\{}-\frac{\beta}{\gamma}\bar{\Lambda}_{\beta}^{*}(s,a)\Big{\}}+c_{0}(s). (33)

Substituting the above equation (33) into the state-action value function in (32) we obtain the following map

Λ¯β(s,a)=spssa[(cssa+γβlogpssa)\displaystyle\bar{\Lambda}_{\beta}^{*}(s,a)=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\Big{[}\big{(}c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a}\big{)}
γ2βlogaexp{βγΛ¯β(s,a)}]+c0(s)=:[T1Λ¯β](s,a),\displaystyle-\frac{\gamma^{2}}{\beta}\log\sum_{a}\exp\Big{\{}-\frac{\beta}{\gamma}\bar{\Lambda}_{\beta}^{*}(s,a)\Big{\}}\Big{]}+c_{0}(s)=:[T_{1}\bar{\Lambda}_{\beta}](s,a), (34)

where the proof that the map T1:Λ¯βΛ¯βT_{1}:\bar{\Lambda}_{\beta}\rightarrow\bar{\Lambda}_{\beta} is a contraction map is analogous to the proof of Theorem 2.

The state-action value Λβ(s,a)\Lambda_{\beta}^{*}(s,a) obtained by disregarding c0(s)c_{0}(s) in (33) satisfies the recursive equation

Λβ(s,a)=spssa[(cssa+γβlogpssa)\displaystyle\Lambda_{\beta}^{*}(s,a)=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\Big{[}\big{(}c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a}\big{)}
γ2βlogaexp{βγΛβ(s,a)}]=:[TΛβ](s,a),\displaystyle-\frac{\gamma^{2}}{\beta}\log\sum_{a}\exp\Big{\{}-\frac{\beta}{\gamma}\Lambda_{\beta}^{*}(s,a)\Big{\}}\Big{]}=:[T\Lambda_{\beta}](s,a), (35)

where the map T:ΛβΛβT:\Lambda_{\beta}\rightarrow\Lambda_{\beta} has been shown to be a contraction map in the Theorem 2.

Remark: Note that [T1x](s,a)=[Tx](s,a)+c0(s)[T_{1}x](s,a)=[Tx](s,a)+c_{0}(s). Therefore, T1T_{1} is a contraction since TT is contraction and has a unique fixed point.

Claim: The optimal control policy μβ(a|s)\mu_{\beta}^{*}(a|s) in (31) is same if computed using either Λ¯β(s,a)\bar{\Lambda}_{\beta}^{*}(s,a) or Λβ(s,a)\Lambda_{\beta}^{*}(s,a). This is so because the fixed points Λ¯β(s,a)\bar{\Lambda}_{\beta}^{*}(s,a) and Λβ(s,a)\Lambda_{\beta}^{*}(s,a) of the contraction maps T1:Λ¯βΛ¯βT_{1}:\bar{\Lambda}_{\beta}\rightarrow\bar{\Lambda}_{\beta} and T:ΛβΛβT:\Lambda_{\beta}\rightarrow\Lambda_{\beta}, respectively, differ by 11γc0(s)\frac{1}{1-\gamma}c_{0}(s), i.e. Λ¯β(s,a)=Λβ(s,a)+11γc0(s)\bar{\Lambda}_{\beta}^{*}(s,a)=\Lambda_{\beta}^{*}(s,a)+\frac{1}{1-\gamma}c_{0}(s). If we plug the above relation into the control policy in (31) we obtain

μβ(a|s)\displaystyle\mu_{\beta}^{*}(a|s) =exp{βγΛβ(s,a)β/γ1γc0(s)}aexp{βγΛβ(s,a)β/γ1γc0(s)}\displaystyle=\frac{\exp\{-\frac{\beta}{\gamma}{\Lambda}^{*}_{\beta}(s,a)-\frac{\beta/\gamma}{1-\gamma}c_{0}(s)\}}{\sum_{a^{\prime}}\exp\{-\frac{\beta}{\gamma}{\Lambda}^{*}_{\beta}(s,a^{\prime})-\frac{\beta/\gamma}{1-\gamma}c_{0}(s)\}}
μβ(a|s)\displaystyle\Rightarrow\mu_{\beta}^{*}(a|s) =exp{βγΛβ(s,a)}aexp{βγΛβ(s,a)}\displaystyle=\frac{\exp\{-\frac{\beta}{\gamma}{\Lambda}^{*}_{\beta}(s,a)\}}{\sum_{a^{\prime}}\exp\{-\frac{\beta}{\gamma}{\Lambda}^{*}_{\beta}(s,a^{\prime})\}}

Thus, indicating that we do not need to take c0(s)c_{0}(s) into account while computing the optimal policy.

Proof for Λ¯β(s,a)=Λβ(s,a)+11γc0(s)\bar{\Lambda}_{\beta}^{*}(s,a)=\Lambda_{\beta}^{*}(s,a)+\frac{1}{1-\gamma}c_{0}(s): This can be checked by substituting this expression into the definition of the map [T1Λ¯β][T_{1}\bar{\Lambda}_{\beta}] in (-A); we get

T1Λ¯β(s,a)=spssa[(cssa+γβlogpssa)\displaystyle T_{1}\bar{\Lambda}_{\beta}^{*}(s,a)=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\Big{[}\big{(}c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a}\big{)}
γ2βlogaexp{βγΛβ(s,a)β/γ1γc0(s)}]+c0(s)\displaystyle-\frac{\gamma^{2}}{\beta}\log\sum_{a}\exp\Big{\{}-\frac{\beta}{\gamma}\Lambda_{\beta}^{*}(s,a)-\frac{\beta/\gamma}{1-\gamma}c_{0}(s)\Big{\}}\Big{]}+c_{0}(s)
T1Λ¯β(s,a)=spssa[(cssa+γβlogpssa)\displaystyle T_{1}\bar{\Lambda}_{\beta}^{*}(s,a)=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\Big{[}\big{(}c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a}\big{)}
γ2βlogaexp{βγΛβ(s,a)}]+γ1γc0(s)+c0(s)\displaystyle-\frac{\gamma^{2}}{\beta}\log\sum_{a}\exp\Big{\{}-\frac{\beta}{\gamma}\Lambda_{\beta}^{*}(s,a)\Big{\}}\Big{]}+\frac{\gamma}{1-\gamma}c_{0}(s)+c_{0}(s)
T1Λ¯β(s,a)=spssa[(cssa+γβlogpssa)\displaystyle T_{1}\bar{\Lambda}_{\beta}^{*}(s,a)=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\Big{[}\big{(}c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a}\big{)}
\displaystyle- γ2βlogaexp{βγΛβ(s,a)}]+11γc0(s)\displaystyle\frac{\gamma^{2}}{\beta}\log\sum_{a}\exp\Big{\{}-\frac{\beta}{\gamma}\Lambda_{\beta}^{*}(s,a)\Big{\}}\Big{]}+\frac{1}{1-\gamma}c_{0}(s)
T1Λ¯β(s,a)=TΛβ(s,a)+11γc0(s)\displaystyle T_{1}\bar{\Lambda}_{\beta}^{*}(s,a)=T{\Lambda}_{\beta}^{*}(s,a)+\frac{1}{1-\gamma}c_{0}(s)
Λ¯β(s,a)=Λβ(s,a)+11γc0(s)\displaystyle\bar{\Lambda}_{\beta}^{*}(s,a)=\Lambda_{\beta}^{*}(s,a)+\frac{1}{1-\gamma}c_{0}(s)

Hence proved.

Appendix B. Proof of Theorem 2: Following lemma is used.

Lemma 3.

For every policy μΓ\mu\in\Gamma defined in (4) there exists a vector ξ=(ξs)+|𝒮|\xi=(\xi_{s})\in\mathbb{R}_{+}^{|\mathcal{S}|} with positive components and a scalar λ<1\lambda<1 such that spssaξsλξs\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\xi_{s^{\prime}}\leq\lambda\xi_{s} for all s𝒮s\in\mathcal{S} and a𝒜a\in\mathcal{A}.

Proof.

Consider a new MDP with state transition probabilities similar to the original MDP and the transition costs cssa=11βlog(|𝒜||𝒮|)c_{ss^{\prime}}^{a}=-1-\frac{1}{\beta}\log(|\mathcal{A}||\mathcal{S}|) except when s=δs=\delta. Thus, the free-energy function Vβμ(s)V_{\beta}^{\mu}(s) in (7) for the new MDP is less than or equal to 1-1. We define ξsVβ(s)-\xi_{s}\triangleq V_{\beta}^{*}(s) (as given in 11)) and use LogSumExp [50] inequality to obtain ξsminaΛβ(s,a)Λβ(s,a)a𝒜-\xi_{s}\leq\min_{a}\Lambda_{\beta}(s,a)\leq\Lambda_{\beta}(s,a)~{}\forall~{}a\in\mathcal{A} where Λβ(s,a)\Lambda_{\beta}(s,a) is the state action value function in (III-B). Thus, ξsspssa(cssa+γβlogpssaγξs)-\xi_{s}\leq\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a}-\gamma\xi_{s^{\prime}}\big{)} and upon substituting cssac_{ss^{\prime}}^{a} we obtain ξs1γspssaξs1spssaξs-\xi_{s}\leq-1-\gamma\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\xi_{s^{\prime}}\leq-1-\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\xi_{s^{\prime}}.

s𝒮pssaξsξs1[maxsξs1ξs]ξs=:λξs.\displaystyle\text{\small$\Rightarrow\sum_{s^{\prime}\in\mathcal{S}}p_{ss^{\prime}}^{a}\xi_{s^{\prime}}\leq\xi_{s}-1\leq\Big{[}\max_{s}\frac{\xi_{s}-1}{\xi_{s}}\Big{]}\xi_{s}=:\lambda\xi_{s}$}.

Since Vβ(s)1V_{\beta}^{*}(s)\leq-1 \Rightarrow ξs10\xi_{s}-1\geq 0 and thus λ<1\lambda<1. ∎

Next we show that T:ΛβΛβT:\Lambda_{\beta}\rightarrow\Lambda_{\beta} in (III-B) is a contraction map. For any Λ^β\hat{\Lambda}_{\beta} and Λˇβ\check{\Lambda}_{\beta} we have that [TΛ^βTΛˇβ](s,a)[T\hat{\Lambda}_{\beta}-T\check{\Lambda}_{\beta}](s,a)

=γ2βs𝒮pssalogaexp(βγΛ^β(s,a))aexp(βγΛˇβ(s,a))=-\frac{\gamma^{2}}{\beta}\sum_{s^{\prime}\in\mathcal{S}}p_{ss^{\prime}}^{a}\log\frac{\sum_{a}\exp{\big{(}-\frac{\beta}{\gamma}\hat{\Lambda}_{\beta}(s^{\prime},a)\big{)}}}{\sum_{a^{\prime}}\exp{\big{(}-\frac{\beta}{\gamma}\check{\Lambda}_{\beta}(s^{\prime},a^{\prime})\big{)}}}
γs,apssaμ^a|s(Λ^β(s,a)Λˇβ(s,a))=:γΔμ^,\displaystyle\text{\small$\geq\gamma\sum_{s^{\prime},a^{\prime}}p_{ss^{\prime}}^{a}\hat{\mu}_{a^{\prime}|s^{\prime}}(\hat{\Lambda}_{\beta}(s^{\prime},a^{\prime})-\check{\Lambda}_{\beta}(s^{\prime},a^{\prime}))=:\gamma\Delta_{\hat{\mu}}$}, (36)

where we use the Log sum inequality to obtain (-A), and μ^a|s\hat{\mu}_{a|s} is the stochastic policy in (9) corresponding to Λ^β(s,a)\hat{\Lambda}_{\beta}(s,a). Similarly, we obtain [TΛˇβTΛ^β](s,a)γs,apssaμˇa|s(Λ^β(s,a)Λˇβ(s,a))=:γΔμˇ[T\check{\Lambda}_{\beta}-T\hat{\Lambda}_{\beta}](s,a)\geq-\gamma\sum_{s^{\prime},a^{\prime}}p_{ss^{\prime}}^{a}\check{\mu}_{a^{\prime}|s^{\prime}}(\hat{\Lambda}_{\beta}(s^{\prime},a^{\prime})-\check{\Lambda}_{\beta}(s^{\prime},a^{\prime}))=:-\gamma\Delta_{\check{\mu}} where μˇa|s\check{\mu}_{a|s} is the policy in (9) corresponding to Λˇβ(s,a)\check{\Lambda}_{\beta}(s,a). Now from γΔμ^[TΛ^βTΛˇβ](s,a)γΔμˇ\gamma\Delta_{\hat{\mu}}\leq[T\hat{\Lambda}_{\beta}-T\check{\Lambda}_{\beta}](s,a)\leq\gamma\Delta_{\check{\mu}} we conclude that |[TΛ^βTΛˇβ](s,a)|γΔμ¯(s,a)|[T\hat{\Lambda}_{\beta}-T\check{\Lambda}_{\beta}](s,a)|\leq\gamma\Delta_{\bar{\mu}}(s,a) where Δμ¯(s,a)=max{|Δμ^(s,a)|,|Δμˇ(s,a)|}\Delta_{\bar{\mu}}(s,a)=\max\{|\Delta_{\hat{\mu}}(s,a)|,|\Delta_{\check{\mu}}(s,a)|\} and we have |[TΛ^βTΛˇβ](s,a)||[T\hat{\Lambda}_{\beta}-T\check{\Lambda}_{\beta}](s,a)|

γs,apssaμ¯a|s|Λ^β(s,a)Λˇβ(s,a)|\leq\gamma\sum_{s^{\prime},a^{\prime}}p_{ss^{\prime}}^{a}\bar{\mu}_{a^{\prime}|s^{\prime}}|\hat{\Lambda}_{\beta}(s^{\prime},a^{\prime})-\check{\Lambda}_{\beta}(s^{\prime},a^{\prime})| (37)
γs,apssaξsμ¯a|sΛ^βΛˇβξ\leq\gamma\sum_{s^{\prime},a^{\prime}}p_{ss^{\prime}}^{a}\xi_{s^{\prime}}\bar{\mu}_{a^{\prime}|s^{\prime}}\|\hat{\Lambda}_{\beta}-\check{\Lambda}_{\beta}\|_{\xi} (38)

where Λβξ=maxs,aΛβ(s,a)ξs\|\Lambda_{\beta}\|_{\xi}=\max_{s,a}\frac{\Lambda_{\beta}(s,a)}{\xi_{s}} and ξ𝒮\xi\in\mathbb{R}^{\mathcal{S}} is as given in Lemma 3. Further, from the same Lemma we obtain

|[TΛ^βTΛˇβ](s,a)|γλξsa𝒜μ¯a|sΛ^βΛˇβξ|[T\hat{\Lambda}_{\beta}-T\check{\Lambda}_{\beta}](s,a)|\leq\gamma\lambda\xi_{s}\sum_{a^{\prime}\in\mathcal{A}}\bar{\mu}_{a^{\prime}|s^{\prime}}\|\hat{\Lambda}_{\beta}-\check{\Lambda}_{\beta}\|_{\xi} (39)
TΛ^βTΛˇβξγλΛ^βΛˇβξ with γλ<1\Rightarrow\|T\hat{\Lambda}_{\beta}-T\check{\Lambda}_{\beta}\|_{\xi}\leq\gamma\lambda\|\hat{\Lambda}_{\beta}-\check{\Lambda}_{\beta}\|_{\xi}\text{ with }\gamma\lambda<1 (40)

Appendix C. Proof of Theorem 3: The proof follows the similar idea as the proof for Theorem 1 in Appendix APPENDICES and thus, we do not explain it in detail except the following Lemma that illustrates the algebraic structure of the discounted Shannon entropy Hdμ()H_{d}^{\mu}(\cdot) in (16) which is different from that in Lemma 2 and also required in our proof of the said theorem.

Lemma 4.

The discounted Shannon entropy Hdμ()H_{d}^{\mu}(\cdot) corresponding to the MDP in Section IV satisfies the algebraic term αspssaHdμ(s)=spssalogpssa+logαμ(a|s)+λs+1\alpha\sum_{s^{\prime}}p_{ss^{\prime}}^{a}H^{\mu}_{d}(s^{\prime})=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\log p_{ss^{\prime}}^{a}+\log\alpha\mu(a|s)+\lambda_{s}+1.

Proof.

Define a new MDP that augments the action and state spaces (𝒜,𝒮\mathcal{A},\mathcal{S}) of the original MDP with an additional action aea_{e} and state ses_{e}, respectively, and derives its state-transition probability {qssa}\{q_{ss^{\prime}}^{a}\} and policy {ζa|s}\{\zeta_{a|s}\} from original MDP as

qssa={pssas,s𝒮,a𝒜1if s,a=se,ae1if s=s=se0otherwiseζa|s={αμa|s(s,a)(𝒮,𝒜)1αif a=ae,s𝒮0if a𝒜,s=se1if a=ae,s=se\displaystyle q_{ss^{\prime}}^{a}=\begin{cases}p_{ss^{\prime}}^{a}&\forall s,s^{\prime}\in\mathcal{S},a\in\mathcal{A}\\ 1&\text{if }s^{\prime},a=s_{e},a_{e}\\ 1&\text{if }s^{\prime}=s=s_{e}\\ 0&\text{otherwise}\end{cases}\zeta_{a|s}=\begin{cases}\alpha\mu_{a|s}&\forall(s,a)\in(\mathcal{S},\mathcal{A})\\ 1-\alpha&\text{if }a=a_{e},s\in\mathcal{S}\\ 0&\text{if }a\in\mathcal{A},s=s_{e}\\ 1&\text{if }a=a_{e},s=s_{e}\end{cases}

Next, we define Tμ:=αHdμT^{\mu}:=\alpha H^{\mu}_{d} that satisfies Tμ(s)=as′′ηa|spss′′a[logpss′′alogηa|s+Tμ(s′′)]T^{\mu}(s^{\prime})=\sum_{a^{\prime}s^{\prime\prime}}\eta_{a^{\prime}|s^{\prime}}p_{s^{\prime}s^{\prime\prime}}^{a^{\prime}}\big{[}-\log p_{s^{\prime}s^{\prime\prime}}^{a}-\log\eta_{a^{\prime}|s^{\prime}}+T^{\mu}(s^{\prime\prime})\big{]} derived using (16) where ηa|s=αμa|s\eta_{a^{\prime}|s^{\prime}}=\alpha\mu_{a^{\prime}|s^{\prime}}. The subsequent steps of the proof are same as the proof of Lemma 2. ∎

Appendix D. Proof of Proposition 1: The proof in this section is analogous to the proof of Proposition 5.5 in [2]. Let T¯\bar{T} be the map in (III-C). The stochastic iterative updates in (III-C) can be re-written as Ψ¯t+1(xt,ut)=(1νt(xt,ut))Ψ¯t(xt,ut)+νt(xt,ut)([T¯Ψ¯t](xt,ut)+wt(xt,ut))\scriptstyle\bar{\Psi}_{t+1}(x_{t},u_{t})=(1-\nu_{t}(x_{t},u_{t}))\bar{\Psi}_{t}(x_{t},u_{t})+\nu_{t}(x_{t},u_{t})\big{(}[\bar{T}\bar{\Psi}_{t}](x_{t},u_{t})+w_{t}(x_{t},u_{t})\big{)} where wt(xt,ut)=cxtxt+1utγ2βlogaexp(βγΨ¯t(st+1,a))T¯Ψ¯t(xt,ut)\scriptstyle w_{t}(x_{t},u_{t})=c_{x_{t}x_{t+1}}^{u_{t}}-\frac{\gamma^{2}}{\beta}\log\sum_{a}\exp(-\frac{\beta}{\gamma}\bar{\Psi}_{t}(s_{t+1},a))-\bar{T}\bar{\Psi}_{t}(x_{t},u_{t}). Let t\mathcal{F}_{t} represent the history of the stochastic updates, i.e., t={Ψ¯0,,Ψ¯t,w0,,wt1,ν0,,νt},\scriptstyle\mathcal{F}_{t}=\{\bar{\Psi}_{0},\ldots,\bar{\Psi}_{t},w_{0},\ldots,w_{t-1},\nu_{0},\ldots,\nu_{t}\}, then 𝔼[wt(xt,ut)|t]=0\scriptstyle\mathbb{E}[w_{t}(x_{t},u_{t})|\mathcal{F}_{t}]=0 and 𝔼[wt2(xt,ut)|t]K(1+maxs,aΨ¯t2(s,a))\scriptstyle\mathbb{E}[w_{t}^{2}(x_{t},u_{t})|\mathcal{F}_{t}]\leq K(1+\max_{s,a}\bar{\Psi}_{t}^{2}(s,a)), where KK is a constant. These expressions satisfy the conditions on the expected value and the variance of wt(xt,ut)w_{t}(x_{t},u_{t}) that along with the contraction property of T¯\bar{T} guarantees the convergence of the stochastic updates (III-C) as illustrated in the Proposition 4.4 in [2].

Proof of Theorem 4: We show that the map T1T_{1} in (24) is a contraction map. For any KζsβK_{\zeta_{s}}^{\beta} and K¯ζsβ\bar{K}_{\zeta_{s}}^{\beta} we obtain that |[T1KζsβT1K¯ζsβ](s)|γa,s′′pss′′aμa|s|Kζsβ(s′′,a)K¯ζsβ(s′′,a)||[T_{1}K_{\zeta_{s}}^{\beta}-T_{1}\bar{K}_{\zeta_{s}}^{\beta}](s^{\prime})|\leq\gamma\sum_{a,s^{\prime\prime}}p_{s^{\prime}s^{\prime\prime}}^{a}\mu_{a|s^{\prime}}|K_{\zeta_{s}}^{\beta}(s^{\prime\prime},a)-\bar{K}_{\zeta_{s}}^{\beta}(s^{\prime\prime},a)|. Note that this inequality is similar to the one in (37); thus, we follow the exact same steps from (37) to (40) to show that T1KζsβT1K¯ζsβξγλKζsβK¯ζsβξ\|T_{1}K_{\zeta_{s}}^{\beta}-T_{1}\bar{K}_{\zeta_{s}}^{\beta}\|_{\xi}\leq\gamma\lambda\|K_{\zeta_{s}}^{\beta}-\bar{K}_{\zeta_{s}}^{\beta}\|_{\xi} and γλ<1\gamma\lambda<1.

Proof of Proposition 2: The proof in this section is similar to the proof of Proposition 1 in Appendix 4. Additional conditions on the boundedness of the derivatives |cssaζl|\big{|}\frac{\partial c_{ss^{\prime}}^{a}}{\partial\zeta_{l}}\big{|} and |cssaηk|\big{|}\frac{\partial c_{ss^{\prime}}^{a}}{\partial\eta_{k}}\big{|} are required to bound the variance 𝔼[wt2|t]\mathbb{E}[w_{t}^{2}|\mathcal{F}_{t}].

References

  • [1] E. A. Feinberg and A. Shwartz, Handbook of Markov decision processes: methods and applications.   Springer Science & Business Media, 2012, vol. 40.
  • [2] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-dynamic programming.   Athena Scientific Belmont, MA, 1996, vol. 5.
  • [3] A. Hordijk and L. Kallenberg, “Linear programming and markov decision chains,” Management Science, vol. 25, no. 4, pp. 352–362, 1979.
  • [4] Y. Abbasi-Yadkori, P. L. Bartlett, and A. Malek, “Linear programming for large-scale markov decision problems,” in JMLR Workshop and Conference Proceedings, no. 32.   MIT Press, 2014, pp. 496–504.
  • [5] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
  • [6] E. T. Jaynes, “Information theory and statistical mechanics,” Physical review, vol. 106, no. 4, p. 620, 1957.
  • [7] K. Rose, “Deterministic annealing, clustering, and optimization,” Ph.D. dissertation, California Institute of Technology, 1991.
  • [8] P. Sharma, S. Salapaka, and C. Beck, “A scalable approach to combinatorial library design for drug discovery,” Journal of chemical information and modeling, vol. 48, no. 1, pp. 27–41, 2008.
  • [9] J.-G. Yu, J. Zhao, J. Tian, and Y. Tan, “Maximal entropy random walk for region-based visual saliency,” IEEE transactions on cybernetics, vol. 44, no. 9, pp. 1661–1672, 2013.
  • [10] Y. Xu, S. M. Salapaka, and C. L. Beck, “Aggregation of graph models and markov chains by deterministic annealing,” IEEE Transactions on Automatic Control, vol. 59, no. 10, pp. 2807–2812, 2014.
  • [11] L. Chen, T. Zhou, and Y. Tang, “Protein structure alignment by deterministic annealing,” Bioinformatics, vol. 21, no. 1, pp. 51–62, 2005.
  • [12] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximum entropy inverse reinforcement learning.” in Aaai, vol. 8.   Chicago, IL, USA, 2008, pp. 1433–1438.
  • [13] H. V. Hasselt, “Double q-learning,” in Advances in Neural Information Processing Systems, 2010, pp. 2613–2621.
  • [14] R. Fox, A. Pakman, and N. Tishby, “Taming the noise in reinforcement learning via soft updates,” arXiv preprint arXiv:1512.08562, 2015.
  • [15] J. Grau-Moya, F. Leibfried, and P. Vrancx, “Soft q-learning with mutual-information regularization,” 2018.
  • [16] J. Peters, K. Mulling, and Y. Altun, “Relative entropy policy search,” in Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.
  • [17] G. Neu, A. Jonsson, and V. Gómez, “A unified view of entropy-regularized markov decision processes,” arXiv preprint arXiv:1705.07798, 2017.
  • [18] K. Asadi and M. L. Littman, “An alternative softmax operator for reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70.   JMLR. org, 2017, pp. 243–252.
  • [19] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, “Bridging the gap between value and policy based reinforcement learning,” in Advances in Neural Information Processing Systems, 2017, pp. 2775–2785.
  • [20] B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song, “Sbeed: Convergent reinforcement learning with nonlinear function approximation,” arXiv preprint arXiv:1712.10285, 2017.
  • [21] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning, 2015, pp. 1889–1897.
  • [22] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv preprint arXiv:1801.01290, 2018.
  • [23] A. Aguilar-Garcia, S. Fortes, M. Molina-García, J. Calle-Sánchez, J. I. Alonso, A. Garrido, A. Fernández-Durán, and R. Barco, “Location-aware self-organizing methods in femtocell networks,” Computer Networks, vol. 93, pp. 125–140, 2015.
  • [24] U. Siddique, H. Tabassum, E. Hossain, and D. I. Kim, “Wireless backhauling of 5g small cells: Challenges and solution approaches,” IEEE Wireless Communications, vol. 22, no. 5, pp. 22–31, 2015.
  • [25] G. Manganini, M. Pirotta, M. Restelli, L. Piroddi, and M. Prandini, “Policy search for the optimal control of markov decision processes: A novel particle-based iterative scheme,” IEEE transactions on cybernetics, vol. 46, no. 11, pp. 2643–2655, 2015.
  • [26] A. Srivastava and S. M. Salapaka, “Simultaneous facility location and path optimization in static and dynamic networks,” IEEE Transactions on Control of Network Systems, pp. 1–1, 2020.
  • [27] M. Mahajan, P. Nimbhorkar, and K. Varadarajan, “The planar k-means problem is np-hard,” in International Workshop on Algorithms and Computation.   Springer, 2009, pp. 274–285.
  • [28] A. A. Abin, “Querying beneficial constraints before clustering using facility location analysis,” IEEE transactions on cybernetics, vol. 48, no. 1, pp. 312–323, 2016.
  • [29] D. Huang, C.-D. Wang, and J.-H. Lai, “Locally weighted ensemble clustering,” IEEE transactions on cybernetics, vol. 48, no. 5, pp. 1460–1473, 2017.
  • [30] W. Shi, S. Song, and C. Wu, “Soft policy gradient method for maximum entropy deep reinforcement learning,” arXiv preprint arXiv:1909.03198, 2019.
  • [31] G. Xiang and J. Su, “Task-oriented deep reinforcement learning for robotic skill acquisition and control,” IEEE transactions on cybernetics, 2019.
  • [32] L. Xia and Q.-S. Jia, “Parameterized markov decision process and its application to service rate control,” Automatica, vol. 54, pp. 29–35, 2015.
  • [33] M. Hausknecht and P. Stone, “Deep reinforcement learning in parameterized action space,” arXiv preprint arXiv:1511.04143, 2015.
  • [34] W. Masson, P. Ranchod, and G. Konidaris, “Reinforcement learning with parameterized actions,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
  • [35] E. Wei, D. Wicke, and S. Luke, “Hierarchical approaches for reinforcement learning in parameterized action space,” in 2018 AAAI Spring Symposium Series, 2018.
  • [36] J. Xiong, Q. Wang, Z. Yang, P. Sun, L. Han, Y. Zheng, H. Fu, T. Zhang, J. Liu, and H. Liu, “Parametrized deep q-networks learning: Reinforcement learning with discrete-continuous hybrid action space,” arXiv preprint arXiv:1810.06394, 2018.
  • [37] V. Narayanan and S. Jagannathan, “Event-triggered distributed control of nonlinear interconnected systems using online reinforcement learning with exploration,” IEEE transactions on cybernetics, vol. 48, no. 9, pp. 2510–2519, 2017.
  • [38] E. Çilden and F. Polat, “Toward generalization of automated temporal abstraction to partially observable reinforcement learning,” IEEE transactions on cybernetics, vol. 45, no. 8, pp. 1414–1425, 2014.
  • [39] E. T. Jaynes, Probability theory: The logic of science.   Cambridge university press, 2003.
  • [40] F. Biondi, A. Legay, B. F. Nielsen, and A. Wkasowski, “Maximizing entropy over markov processes,” Journal of Logical and Algebraic Methods in Programming, vol. 83, no. 5-6, pp. 384–399, 2014.
  • [41] Y. Savas, M. Ornik, M. Cubuktepe, and U. Topcu, “Entropy maximization for constrained markov decision processes,” in 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).   IEEE, 2018, pp. 911–918.
  • [42] S. Ross, J. Pineau, B. Chaib-draa, and P. Kreitmann, “A bayesian approach for learning and planning in partially observable markov decision processes,” Journal of Machine Learning Research, vol. 12, no. May, pp. 1729–1770, 2011.
  • [43] L. P. Hansen, T. J. Sargent, G. Turmuhambetova, and N. Williams, “Robust control and model misspecification,” Journal of Economic Theory, vol. 128, no. 1, pp. 45–90, 2006.
  • [44] Z. Zhou, M. Bloem, and N. Bambos, “Infinite time horizon maximum causal entropy inverse reinforcement learning,” IEEE Transactions on Automatic Control, vol. 63, no. 9, pp. 2787–2802, 2017.
  • [45] K. Rawlik, M. Toussaint, and S. Vijayakumar, “Approximate inference and stochastic optimal control,” arXiv preprint arXiv:1009.3958, 2010.
  • [46] M. Ghavamzadeh, H. J. Kappen, M. G. Azar, and R. Munos, “Speedy q-learning,” in Advances in neural information processing systems, 2011, pp. 2411–2419.
  • [47] M. G. Bellemare, G. Ostrovski, A. Guez, P. S. Thomas, and R. Munos, “Increasing the action gap: New operators for reinforcement learning,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
  • [48] G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N. J. Yuan, X. Xie, and Z. Li, “Drn: A deep reinforcement learning framework for news recommendation,” in Proceedings of the 2018 World Wide Web Conference, 2018, pp. 167–176.
  • [49] W. B. Knox and P. Stone, “Reinforcement learning from human reward: Discounting in episodic tasks,” in 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication.   IEEE, 2012, pp. 878–885.
  • [50] K. Kobayashi et al., Mathematics of information and coding.   American Mathematical Soc., 2007, vol. 203.
  • [51] P. Sharma, S. M. Salapaka, and C. L. Beck, “Entropy-based framework for dynamic coverage and clustering problems,” IEEE Transactions on Automatic Control, vol. 57, no. 1, pp. 135–150, 2011.
[Uncaptioned image] Amber Srivastava received the B.Tech degree in Mechanical Engineering from the Indian Institute of Technology, Kanpur, in 2014 after which he held employment with an FMCG in India for an year (2014-2015) as an Assistant Manager. He obtained his Masters in Mathermatics from University of Illinois at Urbana Champaign (UIUC) in 2020. Currently he is a PhD student in the Mechanical Science and Engineering department of UIUC. His areas of interest are optimization, learning and controls.
[Uncaptioned image] Srinivasa M. Salapaka received the B. Tech from the IIT Madras, in 1995 and MS and PhD degrees in Mechanical from the University of California, Santa Barbara, in 1997 and 2002, respectively. From 2002 to 2004, he was a postdoctoral associate at Massachusetts Institute of Technology. Since 2004, he has been a faculty in the Mechanical Science and Engineering Department, University of Illinois, Urbana-Champaign. His areas of current research are controls for nanotechnology, combinatorial optimization, and power electronics.

Supplementary Material for CYB-E-2020-05-1237

Appendix A Link to Source Code

Please find the sample codes for the model-free RL, and parameterized MDP and RL setting in either of the links below:

  1. 1.

    https://drive.google.com/drive/folders/1fVA3nM1Oh6drQpi3H50ZPTDppK8Tv5bX?usp=sharing

  2. 2.

    https://uofi.box.com/s/twbo129bmst0e24ykgr1rhwwc8kyhe1o

Appendix B Simulations on the Double Chain Environment

In this section we demonstrate our Algorithm 1 to determine the control policy μ\mu^{*} for the finite and infinite Shannon entropy variants of the double chain (DC) environments in Figure 5.

Refer to caption
Figure 5: Finite and infinite entropy variants (respectively, cost and reward variants) of Double Chain (DC) environment. State space 𝒮={0,1,,8}\mathcal{S}=\{0,1,\ldots,8\} and action space 𝒜={0,1}\mathcal{A}=\{0,1\}. Cost c(s,a)c(s,a) for version (a) is denoted in black and reward r(s,a)r(s,a) for version (b) is in red. The agent slips to the state s=0s^{\prime}=0 with probability 0.20.2 every time the action a=0a=0 is taken at the state s𝒮s\in\mathcal{S}\{4}\{4\} in the finite entropy variant (and s𝒮s\in\mathcal{S} for the infinite entropy version (b)). The state s=4s=4 is the cost-free termination state for version (a).

Faster Convergence to Optimal JJ^{*}: Figures 6(a1)-(a4) (finite entropy variant of DC) and Figures 6 (d1)-(d4) (infinite entropy variant of DC) illustrate the faster convergence of our MEP-based Algorithm 1 for different discount factor γ\gamma values. Here the learning error is given by ΔV\Delta V at each episode between the value function VβμV_{\beta}^{\mu} corresponding to policy μ=μ(ep)\mu=\mu(ep) in the episode epep and the optimal value function JJ^{*}; that is,

ΔV(ep)=1Ni=1Ns𝒮|Vβ,iμ(ep)(s)J(s)|,\displaystyle\Delta V(ep)=\frac{1}{N}\sum_{i=1}^{N}\sum_{s\in\mathcal{S}}\big{|}V_{\beta,i}^{\mu(ep)}(s)-J^{*}(s)\big{|}, (41)

where NN denotes the total experimental runs and ii indexes the value function Vβ,iμV_{\beta,i}^{\mu} for each run. As observed in Figures 6(a1) and 6(a3), and Figures 6(d1) and 6(d3), our Algorithm 1 converges even faster as the discount factor γ\gamma decreases. We characterize the faster convergence rates also in terms of the convergence time - more precisely the critical episode E¯pr\bar{E}_{pr} beyond which learning error ΔV\Delta V is within 5%5\% of the optimal JJ^{*} (see Figures 6(a5) and 6(d5)). As is observed in the figures, the performance of our (MEP-based) algorithms gets even better with decreasing γ\gamma values as the stochastic policy μβ\mu_{\beta}^{*} optimizes the amount of exploration required along the paths based on γ\gamma values, whereas the other algorithms show deterioration in performance as γ\gamma decreases.

Refer to caption
Figure 6: Performance of MEP-based algorithm: Illustrations on Double Chain Environment in Figure 5. (a1)-(a4) Finite Entropy variant: Illustrates faster convergence of Algorithm 1 (MEP) at different γ\gamma values. (a5) Demonstrates faster and increasing rates of convergence of Algorithm 1 (MEP) with decreasing γ\gamma values. (b1)-(b3) Illustrates higher variance of Soft Q in comparison to MEP. (b4)-(b5) MEP exhibits larger Shannon Entropy (equivalently, exploration) during early episodes and smaller Shannon Entropy (equivalently, more exploitation) in later episodes; together these are responsible for faster convergence of MEP as illustrated in Fig. 6(a1)-(a5). (c1)-(c5) Cost Version (with added Gaussian noise) : Similar observations as in (a1)-(a5) with significantly higher instability in learning with Double Q algorithm. (d1)-(d5) Infinite Entropy Variant : Demonstrates faster convergence of Algorithm 1 (MEP) to JJ^{*} without noise in (d1),(d3) and with added noise in (d2),(d4). (d5) Illustrates the consistent faster convergence rates of MEP across different γ\gamma values.

Smaller Variance: Figures 6(b1)-(b3) compare the variance observed in the learning process during the last 4040 epsiodes of the NN experimental runs. As demonstrated, the Soft Q (Fig. 6(b2)) algorithm exhibit higher variances when compared to to the MEP-based (Fig. 6(b1)). The Figure 6(b3) plots the average variances of all the algorithms in the last 4040 episodes of the learning process with respect to different values of γ\gamma. As illustrated, Soft Q and Q learning exhibit higher variances across all γ\gamma values in comparison to our and Double Q algorithms. Between our and Double Q, the variance in our algorithm is smaller at lower values of γ\gamma and vice-versa.

Inherently better exploration and exploitation: The Shannon entropy HμH^{\mu} corresponding to our algorithm is higher during the initial episodes indicating a better exploration under the learned policy μ\mu, when compared to other algorithms as seen in Figures 6(b4)-(b5). Additionally, it exhibits smaller HμH^{\mu} in the later episodes indicating a more exploitative nature of the learned policy μ\mu. Unbiased policies resulting from our algorithm along with enhanced exploration-exploitation trade-off results into the faster convergence rates, and smaller variance as discussed above.

Robustness to noise in data: Figures 6(c1)-(c5) demonstrate robustness to noisy environments; here the instantaneous cost c(s,a)c(s,a) in the finite horizon variant of DC is noisy. For the purpose of simulations, we add Gaussian noise 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) with σ=5\sigma=5 when a=1,s𝒮a=1,s\in\mathcal{S}, a=0,s=8a=0,s=8, otherwise σ=10\sigma=10. Similar to our observations and conclusions in Figures 6(a1)-(a4), Figures 6(d1)-(d2) (γ=0.8\gamma=0.8) and Figures 6(d3)-(d4) (γ=0.6\gamma=0.6) illustrate faster convergence of our MEP-based algorithm. Also, similar to the Figure 6(a5), Figure 6(d5) exhibits higher convergence rates (i.e. lower E¯pr\bar{E}_{pr}) \forall γ\gamma values in comparison to benchmark algorithms. In the infinite entropy variant of DC, results of similar nature are obtained upon adding the noise 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) as illustrated in Figure 6(d2) and 6(d4).

Appendix C Comparison with MIRL [15] and REPS [16]

In this section we demonstrate the benefit of considering distributions over the paths of an MDP over considering distribution over the actions. Figure 7(a1)-(a4) compare the mutual information regularization (MIRL) algorithm presented in [15] (where mutual information is regularized with same discount factor as instantaneous cost) with the mutual information of distribution over the paths of MDP (our MEP-based idea that results into no discount factor for the regularizer term). Note the faster convergence plots ΔVJ\frac{\Delta V}{J^{*}} versus epep plots in Figure 7(a1)-a(3). In the Figure 7(a4) we demonstrate the faster convergence rate (E¯pr\bar{E}_{pr} versus γ\gamma) obtained when considering distribution over the paths of the MDP.

Refer to caption
Figure 7: (b1)-(b2) Demonstrating the effect of noisy environment on the Relative Entropy Policy Search (REPS) method in [16]. For the same number of interactions with the noisy Gridworld environment, same amount of greedy exploration the REPS method diverges towards the end episode of the learning process; whereas, the MEP-based algorithm does not diverge. Similar results are seen on comparing REPS with entropy regularized (Soft Q) learning too.

In Figure 7(b1)-(b2) we compare the performance of the relative entropy policy search algorithm (REPS) [16] to our MEP-based algorithm in the case of noisy environment. For the purpose of simulation we consider the infinite entropy variant of the double chain environment in Figure 5. We add Gaussian noise 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) with σ=10\sigma=10 when s=4s=4, a=0a=0, σ=5\sigma=5 when s=8s=8, a=0a=0, σ=2\sigma=2 when s0s\neq 0, a=1a=1, σ=1\sigma=1 else. We ensure fairness in comparison by allowing equal number of interactions with the environment for both the algorithms. As is illustrated in Figure 7(b1) the algorithm in [16] diverges during the last stages of learning owing to the noise in the environment. However, for the same environment our MEP-based algorithm performs better.

Appendix D Design of Self Organizing Networks

Here, we consider the scenario of self-organizing network that are responsible to provide good wifi access at all the relevant locations in a hotel lobby where the number of routers and the location of the modem are pre-specified. Figure 8(a) illustrates the floor plan of a hotel lobby with areas such as reception, manager’s office, and cyber space clearly marked. We distribute user nodes over the provided floor plan based on the number of wifi users in each of these areas, and assume the modem to be located near the reception area. Figure 8(b) demonstrates the user node and modem location on a 2D graph. Considering the objective is to allocate 66 routers and design communication routes from modem to each user via the routers that maximizes the cumulative signal strength at each user node we design this network both in the model-based scenario as well as model-free setting. For the purpose of simulation we assume that the signal strength is inversely proportional to the square of euclidean distance (Qin, Qiaofeng, et al. ”SDN controller placement with delay-overhead balancing in wireless edge networks.” IEEE Transactions on Network and Service Management 15.4 (2018): 1446-1459.). Figure 8(c) represents the allocated routers and the communication routes in the network for the model-based scenario. Figure 8(d) represents the allocated routers and the communication routes in the case of model-free setting.

Refer to caption
Figure 8: Design of Self Organizing Networks. (a) Floor plan of a hotel lobby along. Areas requiring internet access are marked via appropriate symbols. (b) 2D Schematic of the floor plan with user and modem location marked. (c) Network design with router location and routing. Cost function is square euclidean distance. (d) Network design with router location (parameter) and routing (control policy) in the model-free RL setting.

Appendix E Choice of annealing parameters σ\sigma, τ\tau

Algorithm 1: We follow a linear schedule βk=σk\beta_{k}=\sigma k (σ>0\sigma>0) in our simulations on model-free RL setting as suggested in the entropy regularized (soft Q) benchmark algorithm in [14]. The idea is to anneal the parameter β\beta from a small value (at which the policy μβ\mu_{\beta}^{*} in (9) is explorative) to a large value (where the policy becomes exploitative). As suggested in [14] the parameter σ\sigma can be obtained by performing initial runs (for a small number of iteration) with different values of σ\sigma, and picking the one that results into lower value function corresponding to the learned policy. This identified value of σ\sigma for a particular domain can then be re-used in similar domains without the need of performing any initial runs.

Algorithm 2 and 3: The parameter τ\tau is referred to as the annealing rate and is chosen to be greater than 11. The resulting schedule βk+1=τβk\beta_{k+1}=\tau\beta_{k} gemoterically anneals the Lagrange parameter β\beta from a small value βmin0\beta_{\min}\approx 0 to a large value βmax\beta_{\max} at which the control policy μβ\mu_{\beta}^{*} in (9) converges to 0 or 11. In the case of parameterized MDPs, this facilitates a homotopy from the convex function Hμ(s)H^{\mu}(s) to the (possibly) non-convex Jζημ(s)J^{\mu}_{\zeta\eta}(s) in (22), and prevents from getting stuck in poor local minima. For the purpose of simulations in Figure 4 we consider the value τ(1.01,1.04)\tau\in(1.01,1.04).

A practical method to determine appropriate τ\tau can be as follows. Start with a large (for instance τ1.5\tau\approx 1.5) estimate for τ\tau. If the parameter values obtained in the initial iterations of the algorithm oscillate a lot then decrease τ\tau. Choose the τ\tau value where the observed parameter values stop oscillating for the initial iterations of the algorithm. This practical method is rooted in the phenomenon of phase transition that our Algorithms 2 and 3 undergo. We illustrate this phenomenon further below.

The Algorithms 2 and 3 address the class of parameterized MDPs that simultaneously solve for an unknown parameter and the control policy. These problems are akin to the Facility Location Problem (FLP) addressed in [7]. In particular, [7] develops a Maximum Entropy Principle framework to determine the location of the facilities (parameter), and associate each user node to the closest facility (control policy). In the resulting algorithm (popularly known as Deterministic Annealing (DA)) it is observed that the solution changes significantly only at certain critical values of β=βcr\beta=\beta_{cr} that correspond to the instances of phase transition. At other values of β\beta, the solution does not change much [51]. It has been observed that a geometric law βk+1=τβk\beta_{k+1}=\tau\beta_{k} with τ=1+ϵ>1\tau=1+\epsilon>1 to anneal β\beta suffices to capture the changes in the solution that occur during the phase transition. Thus, with reference to the method for choosing a value of τ\tau, if the initial iterations result in high variation in parameter values then possibly some phase transitions are getting skipped and the user needs to anneal slower (i.e., reduce τ\tau) to capture all the phase transitions appropriately.

Appendix F List of Symbols

xtx_{t} state at time tt utu_{t} action at time tt
𝒮\mathcal{S} State Space 𝒜\mathcal{A} Action Space
μ\mu control policy p(s|s,a)p(s^{\prime}|s,a) state transition probability
δ\delta cost-free termination state γ\gamma discount factor
Jμ(s)J^{\mu}(s) value function under policy μ\mu μ\mu^{*} optimal control policy
ω\omega path of the MDP pμ(ω|s)p_{\mu}(\omega|s) ditribution over the paths
Γ\Gamma Set of proper policies Hμ(s)H^{\mu}(s) Shannon Entropy of the distribution {pμ(ω|s)}\{p_{\mu}(\omega|s)\}
β\beta annealing (Lagrange) parameter J0J_{0} constant value
Vβμ(s)V_{\beta}^{\mu}(s) Lagrangian or free-energy cxtxt+1utc_{x_{t}x_{t+1}}^{u_{t}} short for c(xt,ut,xt+1)c(x_{t},u_{t},x_{t+1})
μut|xt\mu_{u_{t}|x_{t}} short for μ(ut|xt)\mu(u_{t}|x_{t}) pxtxt+1utp_{x_{t}x_{t+1}}^{u_{t}} short for p(xt+1|xt,ut)p(x_{t+1}|x_{t},u_{t})
c¯ssa\bar{c}_{ss^{\prime}}^{a} short for c(s,a,s)+γβlogp(s|s,a)c(s,a,s^{\prime})+\frac{\gamma}{\beta}\log p(s^{\prime}|s,a) λs\lambda_{s} Lagrange parameter in Lemma 2
μβ(a|s)\mu_{\beta}^{*}(a|s) optimal policy under MEP Λβ(s,a)\Lambda_{\beta}(s,a) state-action value function
VβV_{\beta}^{*} optimal value function ξ\xi defines weighted norm for contraction mapping
ξ\|\cdot\|_{\xi} weighted norm α\alpha contraction constant
Ψt+1\Psi_{t+1} estimate of Λβ\Lambda_{\beta} at time tt νt(xt,ut)\nu_{t}(x_{t},u_{t}) learning rate at time tt
βmin\beta_{\min} minimum value of β\beta βmax\beta_{\max} maximum value of β\beta
NN number of episodes τ\tau annealing rate for β\beta
Hdμ(s)H_{d}^{\mu}(s) discounted Shannon entropy αt\alpha^{t} discount at tt for HdμH_{d}^{\mu}
Vβ,Iμ(s)V_{\beta,I}^{\mu}(s) free-energy for infinite entropy case of MDP c^xtxt+1ut\hat{c}_{x_{t}x_{t+1}}^{u_{t}} short for cxtxt+1ut+γtβαtlogpxtxt+1utc_{x_{t}x_{t+1}}^{u_{t}}+\frac{\gamma^{t}}{\beta\alpha^{t}}\log p_{x_{t}x_{t+1}}^{u_{t}}
μβ,I\mu_{\beta,I}^{*} optimal policy for infinite entropy MDP Φβ(s,a)\Phi_{\beta}(s,a) state-action value function
Vβ,IV_{\beta,I}^{*} optimal free-energy function ζ={ζs}\zeta=\{\zeta_{s}\} set of unknown state parameters
η={ηa}\eta=\{\eta_{a}\} set of unknown action parameters JζημJ_{\zeta\eta}^{\mu} value function parameterized MDPs
xt(ζ)x_{t}(\zeta) state at time tt with parameter ζxt\zeta_{x_{t}} ut(η)u_{t}(\eta) action at time tt with parameter ηut\eta_{u_{t}}
μβ,ζη\mu_{\beta,\zeta\eta}^{*} optimal control policy Vβ,ζηV_{\beta,\zeta\eta}^{*} optimal value function
Gζsβ(s)G_{\zeta_{s}}^{\beta}(s^{\prime}) derivative Vβ,ζη(s)/ζs\partial V_{\beta,\zeta\eta}^{*}(s^{\prime})/\partial\zeta_{s} Gηaβ(s)G_{\eta_{a}}^{\beta}(s^{\prime}) derivative Vβ,ζη(s)/ηa\partial V_{\beta,\zeta\eta}^{*}(s^{\prime})/\partial\eta_{a}
Kζsβ(s,a)K_{\zeta_{s}}^{\beta}(s^{\prime},a^{\prime}) derivatives Gζsβ(s)=aμa|sKζsβ(s,a)G_{\zeta_{s}}^{\beta}(s^{\prime})=\sum_{a^{\prime}}\mu_{a^{\prime}|s^{\prime}}K_{\zeta_{s}}^{\beta}(s^{\prime},a^{\prime}) Lηaβ(s,a)L_{\eta_{a}}^{\beta}(s^{\prime},a^{\prime}) derivatives Gηaβ(s)=aμa|sLηaβ(s,a)G_{\eta_{a}}^{\beta}(s^{\prime})=\sum_{a^{\prime}}\mu_{a^{\prime}|s^{\prime}}L_{\eta_{a}}^{\beta}(s^{\prime},a^{\prime})
Kζst+1(xt,ut)K_{\zeta_{s}}^{t+1}(x_{t},u_{t}) learned estimate of KζsK_{\zeta_{s}} at tt B,CB,C finite constants