Parameterized MDPs and Reinforcement Learning Problems - A Maximum Entropy Principle Based Framework

Amber Srivastava and Srinivasa M Salapaka The paper has been accepted for publication in IEEE Transactions on Cybernetics. The Institute of Electrical and Electronics Engineers, Incorporated (the ”IEEE”) holds all rights under copyright.This work was supported by NSF grant ECCS (NRI) 18-30639, DOE award DE-EE0009125, and Dynamic Research Enterprise for Multidisciplinary Engineering Sciences (DREMES) - collaboration between Zhejiang University and the University of Illinois at Urbana-Champaign.The authors are with the Mechanical Science and Engineering Department and Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, IL, 61801 USA. E-mail: {asrvstv6, salapaka}@illinois.edu.

Abstract

We present a framework to address a class of sequential decision making problems. Our framework features learning the optimal control policy with robustness to noisy data, determining the unknown state and action parameters, and performing sensitivity analysis with respect to problem parameters. We consider two broad categories of sequential decision making problems modeled as infinite horizon Markov Decision Processes (MDPs) with (and without) an absorbing state. The central idea underlying our framework is to quantify exploration in terms of the Shannon Entropy of the trajectories under the MDP and determine the stochastic policy that maximizes it while guaranteeing a low value of the expected cost along a trajectory. This resulting policy enhances the quality of exploration early on in the learning process, and consequently allows faster convergence rates and robust solutions even in the presence of noisy data as demonstrated in our comparisons to popular algorithms such as Q-learning, Double Q-learning and entropy regularized Soft Q-learning. The framework extends to the class of parameterized MDP and RL problems, where states and actions are parameter dependent, and the objective is to determine the optimal parameters along with the corresponding optimal policy. Here, the associated cost function can possibly be non-convex with multiple poor local minima. Simulation results applied to a 5G small cell network problem demonstrate successful determination of communication routes and the small cell locations. We also obtain sensitivity measures to problem parameters and robustness to noisy environment data.

I Introduction

Markov Decision Processes (MDPs) model sequential decision making problems which arise in many application areas such as robotics, sensor networks, economics, and manufacturing. These models are characterized by the state-evolution dynamics $s_{t+1}=f(s_{t},a_{t})$ , a control policy $\mu(a_{t}|s_{t})$ that allocates an action $a_{t}$ from a control set to each state $s_{t}$ , and a cost $c(s_{t},a_{t},s_{t+1})$ associated with the transition from $s_{t}$ to $s_{t+1}$ . The goal in these applications is to determine the optimal control policy that results in a path, a sequence of actions and states, with minimum cumulative cost. There are many variants of this problem [1], where the dynamics can be defined over finite or infinite horizons; where the state-dynamics $f$ can be stochastic; where the models for the state dynamics may be partially or completely unknown, and the cost function is not known a priori, albeit the cost at each step is revealed at the end of each transition. Some of the most common methodologies that address MDPs include dynamic programming, value and policy iterations [2], linear programming [3, 4], and Q-learning [5].

In this article, we view MDPs and their variants as combinatorial optimization problems, and develop a framework based on maximum entropy principle (MEP) [6] to address them. MEP has proved successful in addressing a variety of combinatorial optimization problems such as facility location problems [7], combinatorial drug discovery [8], traveling salesman problem and its variants [7], image processing [9], graph and markov chain aggregation [10], and protein structure alignment [11]. MDPs, too, can be viewed as combinatorial optimization problems - due to the combinatorially large number of paths (sequence of consecutive states and actions) that it may take based on the control policy and its inherent stochasticity. In our MEP framework, we determine a probability distribution defined on the space of paths [12], such that (a) it is the fairest distribution - the one with the maximum Shannon Entropy $H$ , and (b) it satisfies the constraint that the expected cumulative cost $J$ attains a prespecified feasible value $J_{0}$ . The framework results in an iterative scheme, an annealing scheme, where probability distributions are improved upon by successively lowering the prespecified values $J_{0}$ . In fact, the lagrange multiplier $\beta$ corresponding to the cost constraint ( $J=J_{0}$ ) in the unconstrained lagrangian is increased from small values (near 0) to large values to effect annealing. Higher values of multipliers correspond to lower values of the expected cost. We show that as multiplier value increases, the corresponding probability distributions become more localized, finally converging to a deterministic policy.

This framework is applicable to all the classes of MDPs and its variants described above. Our MEP based approach inherits the flexibility of algorithms such as deterministic annealing [7] developed in the context of combinatorial resource allocation, which include adding capacity, communication, and dynamic constraints. The added advantage of our approach is that we can draw close parallels to existing algorithms for MDPs and RL (e.g. Q-Learning) – thus enabling us to exploit their algorithmic insights. Below we highlight main contributions and advantages of our approach.

Exploration and Unbiased Policy: In the context of model-free RL setting, the algorithms interact with the environment via agents and rely upon the instantaneous cost (or reward) generated by the environment to learn the optimal policy. Some of the popular algorithms include Q-learning [5], Double Q-learning [13], Soft Q-learning (entropy regularized Q-learning) [14, 15, 16, 17, 18, 19, 20] in discrete state and action spaces, and Trust Region Policy Optimization (TRPO) [21], and Soft Actor Critic (SAC) [22] in continuous spaces. It is commonly known that for the above algorithms to perform well, all relevant states and actions should be explored. In fact, under the assumption that each state-action pair is visited multiple times during the learning process it is guaranteed that the above discrete space algorithms [5, 13, 14, 15] will converge to the optimal policy. Thus, the adequate exploration of the state and action spaces becomes incumbent to the success of these algorithms in determining the optimal policy. Often the instantaneous cost is noisy [14] which hinders with the learning process and demands for an enhanced quality exploration.

In our MEP based approach, the Shannon Entropy of the probability distribution over the paths in the MDP explicitly characterizes the exploration. The framework results in a distribution over the paths that is as unbiased as possible under the given cost constraint. The corresponding stochastic policy is maximally noncommittal to any particular path in the MDP that achieves the constraint; this results in better (unbiased) exploration. The policy starts from being entirely explorative, when multiplier value is small ( $\beta\approx 0$ ), and becomes increasingly exploitative as the multiplier value increases.

Parameterized MDPs and RL: These class of optimization problems are not even necessarily MDPs which contributes significantly to their inherent complexities. However, we model them in a specific way to retain the Markov property without any loss of generality, thereby making these problems tractable. Scenarios such as self organizing networks [23], 5G small cell network design [24, 25], supply chain networks, and last mile delivery problems [26] pose a parameterized MDP with a two-fold objective of determining simultaneously (a) the optimal control policy for the underlying stochastic process, and (b) the unknown parameters that the state and action variables depend upon such that the cumulative cost is minimized. The latter objective is akin to facility location problem [27, 28, 29], that is shown to be NP-hard [27], and where the associated cost function (non-convex) is riddled with multiple poor local minima.

Refer to caption — Figure 1: Illustrates the 5G Small Cell Network. The objective is to determine the small cell location $\{y_{j}\in\mathbb{R}^{d}\}$ and the communication routes from the Base Station $\delta$ to each user $\{n_{i}\}$ via the network of the small cells.

For instance, Figure 1 illustrates a 5G small cell network, where the objective is to simultaneously determine the locations of the small cells $\{f_{j}\}$ and design the communication paths (control policy) between the user nodes $\{n_{i}\}$ and base station $\delta$ via network of small cells. Here, the state space $\mathcal{S}$ of the underlying MDP is parameterized by the locations $\{y_{j}\}$ of small cells $\{f_{j}\}$ .

Algebraic structure and Sensitivity Analysis: In our framework, maximization of Shannon entropy of the distribution over the paths under constraint on the cost function value results in an unconstrained Lagrangian - the free-energy function. This function is a smooth approximation of the cumulative cost function of the MDP, which enables the use of calculus. We exploit this distinctive feature of our framework to determine the unknown state and action parameters in case of parameterized MDPs and perform sensitivity analysis for various problem parameters. Also, the framework easily accommodates stochastic models that describe uncertainties in the instantaneous cost and parameter values.

Algorithmic Guarantees and Innovation: For the classes of MDPs that we consider, our MEP-based framework results into non-trivial derivations of the recursive Bellman equation for the associated Lagrangian. We show that these Bellman operators are contraction maps and use their several properties to guarantee the convergence to the optimal policy and as well as to a local minima in the case of parameterized MDPs.

In the context of model-free RL, we provide comparisons with the benchmark algorithms Q, Double Q, and entropy regularized G-learning [14] (also referred to as Soft Q-learning). Our algorithms converge at a faster rate (as fast as $1.5$ times) than the benchmark algorithms across various values of discount factor, and even in the case of noisy environments. In the context of parameterized MDPs and RL, we address the small-cell network design problem in 5G communication. Here the parameters are the unknown locations of the small-cells and the control policy determines the routing of the communication packet. Upon comparison with the sequential method of first determining the unknown parameters (small cell locations) and then the control policy (equivalently, the communication paths), we show that our algorithms result into costs that are as low as $65\%$ of the former. The efficacy of our algorithms can be assessed from the fact that the solutions in the model-based and model-free cases are nearly the same. We also demonstrate sensitivity analysis, benefits of annealing and considering entropy of distribution over the paths in our simulations on parameterized MDPs and RL.

This paper is organized as follows. We briefly review the related work and MEP [6] in the Section II. In Section III and IV we develop MEP-based framework for MDPs. Section V builds upon the Section III to address the case of parameterized MDPs and RL problems. Simulations on a variety of scenarios are presented in Section VI. We discuss the generality of our framework, its capabilities, and future directions of the work in Section VII. For the ease of reading, we provide a comprehensive list of symbols in the Section F of the supplementary material.

II Preliminaries

Related Work in Entropy Regularization: Some of the previous works in RL literature [14, 15, 16, 17, 18, 19, 20, 30, 31] either use entropy as a regularization term ( $-\log\mu(a_{t}|s_{t})$ ) [14, 15] to the instantaneous cost function $c(s_{t},a_{t},s_{t+1})$ or maximize the entropy ( $-\sum_{a}\mu(a|s)\log\mu(a|s)$ ) [16, 17, 18] associated only to the stochastic policy under constraints on the cost $J$ . This results into benefits such as better exploration, overcoming the effect of noise $w_{t}$ in the instantaneous cost $c_{t}$ and obtaining faster convergence. However, the resulting stochastic policy and soft-max approximation of the value function $J$ is not in compliance with the Maximum Entropy Principle applied to the distribution over the paths of MDP. Thus, the resulting stochastic policy is biased in its exploration over the paths of the MDP. Our simulations demonstrate the benefit of unbiased exploration (in our framework) in terms of faster convergence and better performance in noisy environment in comparison to the entropy regularized benchmark algorithm.

Related Work in Parameterized MDPs and RL: The existing solution approaches [3, 2, 4] can be extended to the parameterized MDPs by discretizing the parameter domain. However, the resulting problem is not necessarily an MDP as every transition from one state to another is dependent on the path (and the parameter values) taken till the current state. Other related approaches for parameterized MDPs are case specific; for instance, [32] presents action-based parameterization of state space with application to service rate control in closed Jackson networks, and [33, 34, 35, 36, 37, 38] incorporate parameterized actions that is applicable in the domain of RoboCup soccer where at each step the agent must select both the discrete action it wishes to execute as well as continuously valued parameters required by that action. On the other hand, the class of parameterized MDPs that we address in this article predominantly originate in network based applications that involves simultaneous routing and resource allocations and pose additional challenges of non-convexity and NP-hardness. We address these MDPs in both the scenarios, where the underlying model is known as well as unknown.

Maximum Entropy Principle: We briefly review the Maximum Entropy Principle (MEP) [6] since our framework relies heavily upon it. MEP states that for a random variable $\mathcal{X}$ with a given prior information, the most unbiased probability distribution given prior data is the one that maximizes the Shannon entropy. More specifically, let the known prior information of the random variable $\mathcal{X}$ be given as constraints on the expectation of the functions $f_{k}:\mathcal{X}\rightarrow\mathbb{R}$ , $1\leq k\leq m$ . Then, the most unbiased probability distribution $p_{\mathcal{X}}(\cdot)$ solves

\displaystyle\begin{split}\max_{\{p_{\mathcal{X}}(x_{i})\}}\quad&H(\mathcal{X})=-\sum_{i=1}^{n}p_{\mathcal{X}}(x_{i})\ln p_{\mathcal{X}}(x_{i})\\ \text{subject to}\quad&\sum_{i=1}^{n}p_{\mathcal{X}}(x_{i})f_{k}(x_{i})=F_{k}~{}~{}\forall~{}1\leq k\leq m,\end{split}

(1)

where $F_{k}$ , $1\leq k\leq m$ , are known expected values of the functions $f_{k}$ . The above optimization problem results into the Gibbs’ distribution [39] $p_{\mathcal{X}}(x_{i})=\frac{\exp\{-\sum_{k}\lambda_{k}f_{k}(x_{i})\}}{\sum_{j=1}^{n}\exp\{-\sum_{k}\lambda_{k}f_{k}(x_{j})\}}$ , where $\lambda_{k}$ , $1\leq k\leq m$ , are the Lagrange multipliers corresponding to the inequality constraints in (1).

III MDPs with Finite Shannon Entropy

III-A Problem Formulation

We consider an infinite horizon discounted MDP that comprises of a cost-free termination state $\delta$ . We formally define this MDP as a tuple $\langle\mathcal{S},\mathcal{A},c,p,\gamma\rangle$ where $\mathcal{S}=\{s_{1},\ldots,s_{N}=\delta\}$ , $\mathcal{A}=\{a_{1},\ldots,a_{M}\}$ , and $c:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R}$ respectively denote the state space, action space, and cost function; $p:\mathcal{S}\times\mathcal{S}\times\mathcal{A}\rightarrow[0,1]$ is the state transition probability function and $0<\gamma\leq 1$ is the discounting factor. A control policy $\mu:\mathcal{A}\times\mathcal{S}\rightarrow\{0,1\}$ determines the action taken at each state $s\in\mathcal{S}$ , where $\mu(a|s)=1$ implies that action $a\in\mathcal{A}$ is taken when the system is in the state $s\in\mathcal{S}$ and $\mu(a|s)=0$ indicates otherwise. For every initial state $x_{0}=s$ , the MDP induces a stochastic process, whose realization is a path $\omega$ - an infinite sequence of actions and states, that is

\displaystyle\omega=(u_{0},x_{1},u_{1},x_{2},u_{2},\ldots,x_{T},u_{T},x_{T+1},\ldots),

(2)

where $u_{t}\in\mathcal{A}$ , $x_{t}\in\mathcal{S}$ for all $t\in\mathbb{Z}_{\geq 0}$ and $x_{t}=\delta$ for all $t\geq k$ if and when the system reaches the termination state $\delta\in\mathcal{S}$ in $k$ steps. The objective is to determine the optimal policy $\mu^{*}$ that minimizes the state value function

\displaystyle J^{\mu}(s)=\mathbb{E}_{p_{\mu}}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}c(x_{t},u_{t},x_{t+1})\big{|}x_{0}=s\Big{]},\quad\forall~{}s\in\mathcal{S}

(3)

where the expectation is with respect to the probability distribution $p_{\mu}(\cdot|s):\omega\rightarrow[0,1]$ on the space of all possible paths $\omega\in\Omega:=\{(u_{t},x_{t+1})_{t\in\mathbb{Z}_{\geq 0}}:u_{t}\in\mathcal{A},x_{t}\in\mathcal{S}\}$ . In order to ensure that the system reaches the cost-free termination state in finite steps and the optimal state value function $J^{\mu}(s)$ is finite, we make the following assumption throughout this section.

Assumption 1.

There exists atleast one deterministic proper policy $\bar{\mu}(a|s)\in\{0,1\}~{}\forall~{}a\in\mathcal{A},s\in\mathcal{S}$ such that $\min_{s\in\mathcal{S}}p_{\bar{\mu}}(x_{|\mathcal{S}|}=\delta|x_{0}=s)>0$ . In other words, under the policy $\bar{\mu}$ there is a non-zero probability to reach the cost-free termination state $\delta$ , when starting from any state $s$ .

We consider the following set of stochastic policies $\mu$

\displaystyle\Gamma:=\{\pi:0<\pi(a|s)<1~{}\forall~{}a\in\mathcal{A},s\in\mathcal{S}\},

(4)

and the following lemma ensures that under the Assumption 1 all the policies $\mu\in\Gamma$ are proper; that is,

Lemma 1.

For any policy $\mu\in\Gamma$ as defined in (4), $\min_{s\in\mathcal{S}}p_{\mu}(x_{|\mathcal{S}|}=\delta|x_{0}=s)>0$ , i.e., under each policy $\mu\in\Gamma$ the probability to reach the termination state $\delta$ in $|\mathcal{S}|=N$ steps beginning from any $s\in\mathcal{S}$ , is strictly positive.

Proof. Please refer to the Appendix APPENDICES.

We use the Maximum Entropy Principle to determine the policy $\mu\in\Gamma$ such that the Shannon Entropy of the corresponding distribution $p_{\mu}$ is maximized and the state value function $J^{\mu}(s)$ attains a specified value $J_{0}$ . More specifically, we pose the following optimization problem

	$\displaystyle\max_{\{p_{\mu}(\cdot\|s)\}:\mu\in\Gamma}\quad H^{\mu}(s)$	$\displaystyle=-\sum_{\omega\in\Omega}p_{\mu}(\omega\|s)\log p_{\mu}(\omega\|s)$
	$\displaystyle\text{subject to}\quad J^{\mu}(s)$	$\displaystyle=J_{0}.$		(5)

Well posedness: For the class of proper policy $\mu\in\Gamma$ the maximum entropy $H^{\mu}(s)~{}\forall~{}s\in\mathcal{S}$ is finite as shown in [40, 41]. In short, the existence of a cost-free termination state $\delta$ and a non-zero probability to reach it from any state $s\in\mathcal{S}$ ensures that the maximum entropy is finite. Please refer to the Theorem 1 in [40] or Proposition 2 in [41] for further details.

Remark 1.

Though the optimization problem in (III-A) considers the stochastic policies $\mu\in\Gamma$ , our algorithms presented in the later sections are designed in such that the resulting stochastic policy asymptotically converges to a deterministic policy.

III-B Problem Solution

The probability $p_{\mu}(\omega|s)$ of taking the path $\omega$ in (2) can be determined from the underlying policy $\mu$ by exploiting the Markov property that dissociates $p_{\mu}(\omega|s)$ in terms of the policy $\mu$ and the state transition probability $p$ as

p_{\mu}(\omega|x_{0})=\prod_{t=0}^{\infty}\mu(u_{t}|x_{t})p(x_{t+1}|x_{t},u_{t})

(6)

Thus, in our framework we prudently work with the policy $\mu$ which is defined over finite action and state spaces as against the distribution $p_{\mu}(\omega|s)$ defined over infinitely many paths $\omega\in\Omega$ . The Lagrangian corresponding to the above optimization problem in (III-A) is $V^{\mu}_{\beta}(s)=J^{\mu}(s)-\frac{1}{\beta}H^{\mu}(s)=$

\displaystyle\mathbb{E}\big{[}\sum_{t=0}^{\infty}\gamma^{t}c_{x_{t}x_{t+1}}^{u_{t}}+\frac{1}{\beta}(\log\mu_{u_{t}|x_{t}}+\log p_{x_{t}x_{t+1}}^{u_{t}})\big{|}x_{0}=s\big{]},

(7)

where $\beta$ is the Lagrange parameter. Here we have not included the constant value $J_{0}$ in the cost Lagrangian $V_{\beta}^{\mu}(s)$ for simplicity. We refer to the above Lagrangian $V^{\mu}_{\beta}(s)$ (7) as the free-energy function and $\frac{1}{\beta}$ as temperature due to their close analogies with statistical physics (where free energy is enthalpy (E) minus the temperature times entropy (TH)). To determine the optimal policy $\mu^{*}_{\beta}$ that minimizes the Lagrangian $V_{\beta}^{\mu}(s)$ in (7), we first derive the Bellman equation for $V_{\beta}^{\mu}(s)$ .

Theorem 1.

The free-energy function $V_{\beta}^{\mu}(s)$ in (7) satisfies the following recursive Bellman equation

\displaystyle V_{\beta}^{\mu}(s)=\smashoperator[]{\sum_{\begin{subarray}{c}a,s^{\prime}\in\mathcal{A},\mathcal{S}\end{subarray}}^{}}\mu_{a|s}

\displaystyle p_{ss^{\prime}}^{a}\big{(}\bar{c}_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log\mu_{a|s}+\gamma V_{\beta}^{\mu}(s^{\prime})+c_{0}(s)\big{)},

(8)

where $\mu_{a|s}=\mu(a|s)$ , $p_{ss^{\prime}}^{a}=p(s^{\prime}|s,a)$ , $\bar{c}_{ss^{\prime}}^{a}=c(s,a,s^{\prime})+\frac{\gamma}{\beta}\log p(s^{\prime}|s,a)$ for simplicity in notation, and function $c_{0}(s)$ does not depend on the policy $\mu$ .

Proof. Please refer to the Appendix APPENDICES for details. It must be noted that this derivation shows and exploits the algebraic structure $\sum_{s^{\prime}}p_{ss^{\prime}}^{a}H^{\mu}(s^{\prime})=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\log p_{ss^{\prime}}^{a}+\log\mu_{a|s}+\lambda_{s}+1$ as detailed in Lemma 2 in the appendix.

Now the optimal policy satisfies $\frac{\partial V_{\beta}^{\mu}(s)}{\partial\mu(a|s)}=0$ , which results into the Gibbs distribution

	$\displaystyle\mu^{*}_{\beta}(a\|s)$	$\displaystyle=\frac{\exp\big{\{}-(\beta/\gamma)\Lambda_{\beta}(s,a)\big{\}}}{\sum_{a^{\prime}\in\mathcal{A}}\exp\big{\{}-(\beta/\gamma)\Lambda_{\beta}(s,a^{\prime})\big{\}}},\text{ where }$		(9)
	$\displaystyle\Lambda_{\beta}(s,a)$	$\displaystyle=\sum_{s^{\prime}\in\mathcal{S}}p_{ss^{\prime}}^{a}\big{(}\bar{c}_{ss^{\prime}}^{a}+\gamma V^{*}_{\beta}(s^{\prime})\big{)},$		(10)

is the state-action value function, $p_{ss^{\prime}}^{a}=p(s^{\prime}|s,a)$ , $c_{ss^{\prime}}^{a}=c(s,a,s^{\prime})$ , $\bar{c}_{ss^{\prime}}^{a}=c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a}$ and $V^{*}_{\beta}(=V^{\mu^{*}_{\beta}}_{\beta})$ is the value function corresponding to the policy $\mu^{*}_{\beta}$ . To avoid notional clutter we use the above notations wherever it is clear from the context. Substituting the policy $\mu^{*}_{\beta}$ in (9) back into the Bellman equation (8) we obtain the implicit equation

\displaystyle V^{*}_{\beta}(s)=-\frac{\gamma}{\beta}\log\Big{(}\sum_{a\in\mathcal{A}}\exp\Big{\{}-\frac{\beta}{\gamma}\Lambda_{\beta}(s,a)\Big{\}}\Big{)}.

(11)

Without loss of generality, we ignore the term $c_{0}(s)$ in (8) since it does not affect the policy as seen from equation (9). To solve for the state-action value function $\Lambda_{\beta}(s,a)$ and free-energy function $V_{\beta}^{*}(s)$ we substitute the expression of $V^{*}_{\beta}(s)$ in (11) into the expression of $\Lambda_{\beta}(s,a)$ in (10) to obtain the implicit equation $\Lambda_{\beta}(s,a)=:[T\Lambda_{\beta}](s,a)$ , where

		$\displaystyle[T\Lambda_{\beta}](s,a)=\sum_{s^{\prime}\in\mathcal{S}}p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a}\big{)}$
		$\displaystyle\qquad\quad-\frac{\gamma^{2}}{\beta}\sum_{s^{\prime}\in\mathcal{S}}p_{ss^{\prime}}^{a}\log\sum_{a^{\prime}\in\mathcal{A}}\exp\Big{\{}-\frac{\beta}{\gamma}\Lambda_{\beta}(s^{\prime},a^{\prime})\Big{\}}.$		(12)

To solve the above implicit equation, we show that the map $T$ in (III-B) is a contraction map and therefore $\Lambda_{\beta}$ can be obtained using fixed point iterations, which guarantee converging to the unique fixed point. Consequently, the global minimum $V_{\beta}^{*}$ in (11) and the optimal policy $\mu_{\beta}^{*}$ in (9) can be obtained.

Theorem 2.

The map $[T\Lambda_{\beta}](s,a)$ as defined in (III-B) is a contraction mapping with respect to a weighted maximum norm, i.e. $\exists$ a vector $\xi=(\xi_{s})\in\mathbb{R}^{|\mathcal{S}|}$ with $\xi_{s}>0~{}\forall~{}s\in\mathcal{S}$ and a scalar $\alpha<1$ such that

\displaystyle\|T\Lambda_{\beta}-T\Lambda^{\prime}_{\beta}\|_{\xi}\leq\alpha\|\Lambda_{\beta}-\Lambda^{\prime}_{\beta}\|_{\xi}

(13)

where $\|\Lambda_{\beta}\|_{\xi}=\max\limits_{\begin{subarray}{c}s\in\mathcal{S},a\in\mathcal{A}\end{subarray}}\frac{|\Lambda_{\beta}(s,a)|}{\xi_{s}}$

Proof. Please refer to Appendix -A for details.

Remark 2.

It is known from the sensitivity analysis [39] that the value of the Lagrange parameter $\beta$ in (7) is inversely proportional to the constant $J_{0}$ in (III-A). Thus, at small values of $\beta\approx 0$ (equivalently large $J_{0}$ ), we are mainly maximizing the Shannon Entropy $H^{\mu}(s)$ and the resultant policy in (9) encourages exploration along the paths of the MDP. As $\beta$ increases ( $J_{0}$ decreases), more and more weight is given to the state value function $J^{\mu}(s)$ in (7) and the policy in (9) goes from being exploratory to being exploitative. As $\beta\rightarrow\infty$ the exploration is completely eliminated and we converge to a deterministic policy $\rightarrow\mu^{*}$ that minimizes $J^{\mu}(s)$ in (3).

Remark 3.

We briefly draw readers’ attention to the value function $Y(s)=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}(c_{x_{t}x_{t+1}}^{u_{t}}+(1/\beta)\log\mu_{u_{t}|x_{t}})]$ considered in the entropy regularized methods [14]. Note that in $Y(s)$ the discounting $\gamma^{t}$ is multiplied to both the cost term $c_{x_{t}x_{t+1}}^{u_{t}}$ as well as the entropy term $(1/\beta)\log\mu_{u_{t}|x_{t}}$ . However, in our MEP-based method, the Lagrangian $V_{\beta}^{\mu}(s)$ in (7) comprises of discounting $\gamma^{t}$ only over the cost term $c_{x_{t}x_{t+1}}^{u_{t}}$ and not on the entropy terms $(1/\beta)\log\mu_{u_{t}|x_{t}}$ and $(1/\beta)\log p_{x_{t}x_{t+1}}^{u_{t}}$ . Therefore, the policy in [14] does not satisfy MEP applied over the distribution $p_{\mu}$ ; consequently their exploration along the paths is not as unbiased as our algorithm.

III-C Model-free Reinforcement Learning Problems

In these problems, the cost function $c(s,a,s^{\prime})$ and the state-transition probability $p(s^{\prime}|s,a)$ are not known apriori; however, at each discrete time instant $t$ the agent takes an action $u_{t}$ under a policy $\mu$ and the environment (underlying stochastic process) results into an instantaneous cost $c_{x_{t}x_{t+1}}^{u_{t}}$ and the subsequently moves to the state $x_{t+1}\thicksim p(\cdot|x_{t},u_{t})$ . Motivated by the iterative updates of Q-learning algorithm [2] we consider the following stochastic updates in our Algorithm 1 to learn the state-action value function in our methodology

		$\Psi_{t+1}(x_{t},u_{t})=(1-\nu_{t}(x_{t},u_{t}))\Psi_{t}(x_{t},u_{t})$ +
		$\displaystyle\text{\small$\nu_{t}(x_{t},u_{t})\Big{[}c_{x_{t}x_{t+1}}^{u_{t}}-\frac{\gamma^{2}}{\beta}\log\sum_{a^{\prime}\in\mathcal{A}}\exp\Big{\{}\frac{-\beta}{\gamma}\Psi_{t}(x_{t+1},a^{\prime})\Big{\}}\Big{]}$},$		(14)

with the stepsize parameter $\nu_{t}(x_{t},u_{t})\in(0,1]$ , and show that under appropriate conditions on $\nu_{t}$ (as illustrated shortly), $\Psi_{t}$ will converge to the fixed point $\bar{\Lambda}_{\beta}^{*}$ of the implicit equation

	$\displaystyle\bar{\Lambda}_{\beta}(s,a)$	$\displaystyle=\sum_{s^{\prime}\in\mathcal{S}}p_{ss^{\prime}}^{a}\Big{(}c_{ss^{\prime}}^{a}-\frac{\gamma^{2}}{\beta}\log\sum_{a^{\prime}}\exp\big{(}\frac{-\beta}{\gamma}\bar{\Lambda}_{\beta}(s^{\prime},a^{\prime})\big{)}\Big{)}$
		$\displaystyle=:[\bar{T}\bar{\Lambda}_{\beta}](s,a).$		(15)

The above equation comprises of a minor change from the equation $\Lambda_{\beta}(s,a)=[T\Lambda_{\beta}](s,a)$ in (III-B). The latter has an additional term $\frac{\gamma}{\beta}\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\log p_{ss^{\prime}}^{a}$ which makes it difficult to learn its fixed point $\Lambda_{\beta}^{*}$ in the absence of the state transition probability $p_{ss^{\prime}}^{a}$ itself. Since in this work we do not attempt to determine (or learn) either the distribution $p_{ss^{\prime}}^{a}$ (as in [42]) from the interactions of the agent with the environment, we work with the approximate state-action value function $\bar{\Lambda}_{\beta}$ in (III-C) where $\bar{\Lambda}_{\beta}\rightarrow\Lambda_{\beta}$ for large $\beta$ values (since $\frac{\gamma}{\beta}(\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\log p_{ss^{\prime}}^{a})\rightarrow 0$ as $\beta\rightarrow\infty$ ). The following proposition elucidates the conditions under which the updates $\Psi_{t}$ in (III-C) converge to the fixed point $\bar{\Lambda}_{\beta}^{*}$ .

Proposition 1.

Consider the class of MDPs illustrated in Section III-A. Given that

\sum_{t=0}^{\infty}\nu_{t}(s,a)=\infty,\sum_{t=0}^{\infty}\nu_{t}^{2}(s,a)<\infty~{}\forall~{}s\in\mathcal{S},a\in\mathcal{A},

the update $\Psi_{t}(s,a)$ in (III-C) converges to the fixed point $\bar{\Lambda}_{\beta}^{*}$ of the map $\bar{T}\bar{\Lambda}_{\beta}\rightarrow\bar{\Lambda}_{\beta}$ in (III-C) with probability 1.

Proof. Please refer to the Appendix 4.

Input:

N

\nu_{t}(\cdot,\cdot)

\sigma

; Output:

\mu^{*}

\bar{\Lambda}^{*}

Initialize:

t=0

\Psi_{0}=0

\mu_{0}(a|s)=1/|\mathcal{A}|

for $episode=1$ to $N$ do

\beta=\sigma\times epsiode

; reset environment at state

x_{t}

while True do

sample

u_{t}\sim\mu_{t}(\cdot|x_{t})

; obtain cost

c_{t}

and

x_{t+1}

update

\Psi_{t}(x_{t},u_{t})

\mu_{t+1}(u_{t}|x_{t})

in (III-C) and (9)

break if

x_{t+1}=\delta

;

t\leftarrow t+1

Algorithm 1 Model-free Reinforcement Learning

Remark 4.

Note that the stochasticity of the optimal policy $\mu_{\beta}^{*}(a|s)$ (9) depends on $\gamma$ value which allows it to incorporate for the effect of the discount factor on its exploration strategy. More precisely, in the case of large discount factors the time window $T$ , in which instantaneous costs $\gamma^{t}c(s_{t},a_{t},s_{t+1})$ is considerable (i.e., $\gamma^{t}c_{s_{t}s_{t+1}}^{a_{t}}>\epsilon$ $\forall$ $t\leq T$ ), is large and thus, the stochastic policy (9) performs higher exploration along the paths. On the other hand, for small discount factors this time window $T$ is relatively smaller and thus, the stochastic policy (9) inherently performs lesser exploration. As illustrated in the simulations, this characteristic of the policy in (9) results into even faster convergence rates in comparison to benchmark algorithms as the discount factor $\gamma$ decreases.

IV MDPs with Infinite Shannon Entropy

Here we consider the MDPs where the Shannon Entropy $H^{\mu}(s)$ of the distribution $\{p_{\mu}(\omega|s)\}$ over the paths $\omega\in\Omega$ is not necessarily finite (for instance, due to absence of absorption state). To ensure the finiteness of the objective in (III-A) we consider the discounted Shannon Entropy [43, 44]

H^{\mu}_{d}(s)=-\mathbb{E}\Big{[}\sum_{t=0}^{\infty}\alpha^{t}(\log\mu_{u_{t}|x_{t}}+\log p_{x_{t}x_{t+1}}^{u_{t}})\big{|}x_{0}=s\Big{]}

(16)

with a discount factor of $\alpha\in(0,1)$ which we chose to be independent of the discount factor $\gamma$ in the value function $J^{\mu}(s)$ . The free-energy function (or, the Lagrangian) resulting from the optimization problem in (III-A) with the alternate objective function $H^{\mu}_{d}(s)$ in (16) is given by

\displaystyle\text{\small$V_{\beta,I}^{\mu}(s)=\mathbb{E}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}\hat{c}_{x_{t}x_{t+1}}^{u_{t}}+\frac{\alpha^{t}}{\beta}\log\mu(u_{t}|x_{t})\big{|}x_{0}=s\Big{]}$},

(17)

where $\hat{c}_{x_{t}x_{t+1}}^{u_{t}}=c_{x_{t}x_{t+1}}^{u_{t}}+\frac{\gamma^{t}}{\beta\alpha^{t}}\log p_{x_{t}x_{t+1}}^{u_{t}}$ , and the subscript $I$ stands for ’infinite entropy’ case. Note that the free-energy functions (7) and (17) differ only with regards to the discount factor $\alpha$ and thus, our solution methodology in this section is similar to the one in Section III-B.

Theorem 3.

The free-energy function $V^{\mu}_{\beta,I}(s)$ in (17) satisfies the recursive Bellman equation

V_{\beta,I}^{\mu}(s)=\sum_{\begin{subarray}{c}a,s^{\prime}\end{subarray}}\mu_{a|s}p_{ss^{\prime}}^{a}(\check{c}_{ss^{\prime}}^{a}+\frac{\gamma}{\alpha\beta}\log\mu_{a|s}+\gamma V_{\beta,I}^{\mu}(s^{\prime}))

(18)

where $\check{c}_{ss^{\prime}}^{a}=c_{ss^{\prime}}^{a}+\frac{\gamma}{\alpha\beta}\log p_{ss^{\prime}}^{a}$ .

Proof. Please see Appendix 3. The above derivation shows and exploits algebraic structure $\alpha\sum_{s^{\prime}}p_{ss^{\prime}}^{a}H^{\mu}_{d}(s^{\prime})=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\log p_{ss^{\prime}}^{a}+\log\alpha\mu(a|s)+\lambda_{s}$ (Lemma 4).

The optimal policy satisfies $\frac{\partial V_{\beta,I}^{\mu}(s)}{\partial\mu(a|s)}=0$ , which results into the Gibbs distribution

	$\displaystyle\mu^{*}_{\beta,I}(a\|s)$	$\displaystyle=\frac{\exp\big{\{}-\frac{\beta\alpha}{\gamma}\Phi_{\beta}(s,a)\big{\}}}{\sum_{a^{\prime}\in\mathcal{A}}\exp\big{\{}-\frac{\beta\alpha}{\gamma}\Phi_{\beta}(s,a^{\prime})\big{\}}},\text{ where}$		(19)
	$\displaystyle\Phi_{\beta}(s,a)$	$\displaystyle=\sum_{s^{\prime}\in\mathcal{S}}p_{ss^{\prime}}^{a}(\check{c}_{ss^{\prime}}^{a}+\gamma V^{*}_{\beta,I}(s^{\prime})),$		(20)

is the corresponding state-action value function. Substituting the $\mu^{*}_{\beta,I}$ in (19) in the Bellman equation (18) results into the following optimal free-energy function $V_{\beta,I}^{*}(s)(:=V_{\beta,I}^{\mu^{*}_{\beta,I}}(s))$ .

\displaystyle V_{\beta,I}^{*}(s)

\displaystyle=-\frac{\gamma}{\alpha\beta}\log\sum_{a^{\prime}\in\mathcal{A}}\exp\Big{(}\frac{-\alpha\beta}{\gamma}\Phi_{\beta}(s,a^{\prime})\Big{)}

(21)

Remark 5.

The subsequent steps to learn the optimal policy $\mu^{*}_{\beta,I}$ in (19) are similar to the steps demonstrated in the Section III-C. We forego the similar analysis here.

Remark 6.

When $\alpha=\gamma$ the policy $\mu^{*}_{\beta,I}$ in (19), state-action value function $\Phi_{\beta}$ in (20) and the free-energy function $V^{*}_{\beta,I}$ in (21) corresponds to the similar expressions that are obtained in the Entropy regularized methods [14]. However, in this paper we do not require that $\alpha=\gamma$ . On the other hand, we propose that $\alpha$ should take up large values. In fact, our simulations in the Section VI demonstrate better convergence rates that are obtained when $\gamma<\alpha=(1-\epsilon)$ as compared to when $\gamma=\alpha$ .

V Parameterized MDPs

V-A Problem Formulation

As stated in Section I, many application areas such as small cell networks (Figure 1) pose a parmaterized MDP that requires simultaneously determining the (a) optimal policy $\mu^{*}$ , and (b) the unknown state and action parameters $\zeta=\{\zeta_{s}\}$ and $\eta=\{\eta_{a}\}$ such that the state value function

\displaystyle J^{\mu}_{\zeta\eta}(s)=\mathbb{E}_{p_{\mu}}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}c\big{(}x_{t}(\zeta),u_{t}(\eta),x_{t+1}(\zeta)\big{)}\Big{|}x_{0}=s\Big{]}

(22)

is minimized $\forall$ $s\in\mathcal{S}$ where $x_{t}(\zeta)$ denotes the state $x_{t}\in\mathcal{S}$ with the associated parameter $\zeta_{x_{t}}$ and $u_{t}(\eta)$ denotes the action $u_{t}\in\mathcal{A}$ with the associated action parameter value $\eta_{u_{t}}$ . As in Section III-A, we assume that the parameterized MDPs exhibit atleast one deterministic proper policy (Assumption 1) to ensure the finiteness of the value function $J^{\mu}_{\zeta\eta}(s)$ and the Shannon Entropy $H^{\mu}(s)$ of the MDP for all $\mu\in\Gamma$ . We further assume that the state-transition probability $\{p_{ss^{\prime}}^{a}\}$ is independent of the state and action parameters $\zeta,\eta$ .

V-B Problem Solution

This problem was solved in Section III-B, where the states and actions were not parameterized, or equivalently can be viewed as if parameters $\zeta$ , $\eta$ were known and fixed. For the parameterized case, we apply the same solution methodology, which results in the same optimal policy $\mu^{*}_{\beta,\zeta\eta}$ as in (9) as well as the corresponding free-energy function $V_{\beta,\zeta\eta}^{*}(s)$ in (11) except that now they are characterized by the parameters $\zeta$ , $\eta$ . To determine the optimal (local) parameters $\zeta$ , $\eta$ , we set $\sum_{s^{\prime}\in\mathcal{S}}\frac{\partial V_{\beta,\zeta\eta}^{*}(s^{\prime})}{\partial\zeta_{s}}=0$ $\forall$ $s$ , and $\sum_{s^{\prime}\in\mathcal{S}}\frac{\partial V_{\beta,\zeta\eta}^{*}(s^{\prime})}{\partial\eta_{a}}=0$ $\forall$ $a$ , which we implement by using the gradient descent steps

\displaystyle\zeta_{s}^{+}=\zeta_{s}-\eta\sum_{s^{\prime}\in\mathcal{S}}G_{\zeta_{s}}^{\beta}(s^{\prime}),~{}\eta_{a}^{+}=\eta_{a}-\bar{\eta}\sum_{s^{\prime}\in\mathcal{S}}G_{\eta_{a}}^{\beta}(s^{\prime}).

(23)

Here $G_{\zeta_{s}}^{\beta}(s^{\prime}):=\frac{\partial V_{\beta,\zeta\eta}^{*}(s^{\prime})}{\partial\zeta_{s}}$ and $G_{\eta_{a}}^{\beta}(s^{\prime}):=\frac{\partial V_{\beta,\zeta\eta}^{*}(s^{\prime})}{\partial\eta_{a}}$ . The derivatives $G_{\zeta_{s}}^{\beta}$ and $G_{\eta_{a}}^{\beta}$ are assumed to be bounded (see Proposition 2). We compute these derivatives as $G_{\zeta_{s}}^{\beta}(s^{\prime})=\sum_{a^{\prime}}\mu_{a^{\prime}|s^{\prime}}K_{\zeta_{s}}^{\beta}(s^{\prime},a^{\prime})$ and $G_{\eta_{a}}^{\beta}(s^{\prime})=\sum_{a^{\prime}}\mu_{a^{\prime}|s^{\prime}}L_{\eta_{a}}^{\beta}(s^{\prime},a^{\prime})$ $\forall$ $s^{\prime}\in\mathcal{S}$ where $K_{\zeta_{s}}^{\beta}(s^{\prime},a^{\prime})$ and $L_{\eta_{a}}^{\beta}(s^{\prime},a^{\prime})$ are the fixed points of their corresponding Bellman equations $K_{\zeta_{s}}^{\beta}(s^{\prime},a^{\prime})=[T_{1}K_{\zeta_{s}}^{\beta}](s,a)$ and $L_{\eta_{a}}^{\beta}(s^{\prime},a^{\prime})=[T_{2}L_{\eta_{a}}^{\beta}](s^{\prime},a^{\prime})$ where

\displaystyle\begin{split}[T_{1}K_{\zeta_{s}}^{\beta}](s^{\prime},a^{\prime})&=\sum_{\begin{subarray}{c}s^{\prime\prime}\end{subarray}}p_{s^{\prime}s^{\prime\prime}}^{a^{\prime}}\Big{[}\frac{\partial c_{s^{\prime}s^{\prime\prime}}^{a^{\prime}}}{\partial\zeta_{s}}+\gamma G_{\zeta_{s}}^{\beta}(s^{\prime\prime})\Big{]},\\ [T_{2}L_{\eta_{a}}^{\beta}](s^{\prime},a^{\prime})&=\sum_{\begin{subarray}{c}s^{\prime\prime}\end{subarray}}p_{s^{\prime}s^{\prime\prime}}^{a^{\prime}}\Big{[}\frac{\partial c_{s^{\prime}s^{\prime\prime}}^{a^{\prime}}}{\partial\eta_{a}}+\gamma G_{\eta_{a}}^{\beta}(s^{\prime\prime})\Big{]}.\end{split}

(24)

Note that in the above equations we have suppressed the dependence of the instantaneous cost function $c_{s^{\prime}s^{\prime\prime}}^{a^{\prime}}$ on the parameters $\zeta$ and $\eta$ to avoid notational clutter.

Theorem 4.

The operators $[T_{1}K_{\zeta_{s}}^{\beta}](s^{\prime},a^{\prime})$ and $[T_{2}L_{\eta_{a}}^{\beta}](s^{\prime},a^{\prime})$ defined in (24) are contraction maps with respect to a weighted maximum norm $\|\cdot\|_{\xi}$ , where $\|X\|_{\xi}=\max_{s^{\prime},a^{\prime}}\frac{X(s^{\prime},a^{\prime})}{\xi_{s^{\prime}}}$ and $\xi\in\mathbb{R}^{|\mathcal{S}|}$ is a vector of positive components $\xi_{s}$ .

Proof. Please refer the Appendix 4 for details.

As previously stated in Section I, the state value function $J^{\mu}_{\zeta\eta}(\cdot)$ in (22) is generally non-convex function of the parameters $\zeta$ , $\eta$ and riddled with multiple poor local minima with the resulting optimization problem being possibly NP-hard [27]. In our algorithm for parameterized MDPs we anneal $\beta$ from $\beta_{\min}$ to $\beta_{\max}$ , similar to our approach for non-parameterized MDPs in Section III-B, where the solution from the current $\beta$ iteration is used to initialize the subsequent $\beta$ iteration. However, in addition to facilitating a steady transition from an exploratory policy to an exploitative policy, annealing facilitates a gradual homotopy from the convex negative Shannon entropy function to the non-convex state value function $J^{\mu}_{\zeta\eta}$ which prevents our algorithm from getting stuck in a poor local minimum. The underlying idea of our heuristic is to track the optimal as the initial convex function deforms to the actual non-convex cost. Also, minimizing the Lagrangian $V_{\beta}^{*}(s)$ at $\beta=\beta_{\min}\approx 0$ determines the global minimum thereby making our algorithm insensitive to initialization. Algorithm 2 illustrates steps to determine policy and parameters for a parameterized MDP.

Input:

\beta_{\min}

\beta_{\max}

\tau

; Output:

\mu^{*}

\zeta

and

\eta

Initialize:

\beta=\beta_{\min}

\mu_{a|s}=\frac{1}{|\mathcal{A}|}

, and

\zeta,\eta

0

while $\beta\leq\beta_{\max}$ do

while True do

while until convergence do

update

\Lambda_{\beta}

\mu_{\beta}

G_{\zeta_{s}}^{\beta}

G_{\eta_{a}}^{\beta}

in (10), (9) and (24)

update

\zeta,\eta

in (23) if

\|G_{\zeta_{s}}\|,\|G_{\eta_{a}}\|<\epsilon

, break

\beta\leftarrow\tau\beta

Algorithm 2 Parameterized Markov Decision Process

V-C Parameterized Reinforcement Learning

In many applications, formulated as parameterized MDPs, the explicit knowledge of the cost function $c_{ss^{\prime}}^{a}$ , its dependence on the parameters $\zeta$ , $\eta$ , and the state-transition probabilities $\{p_{ss^{\prime}}^{a}\}$ are not known. However, for each action $a$ the environment results into an instantaneous cost based on its current $x_{t}$ , next state $x_{t+1}$ and the parameter $\zeta$ , $\eta$ values which can subsequently be used to simultaneously learn the policy $\mu^{*}_{\beta,\zeta\eta}$ and the unknown state and action parameters $\zeta$ , $\eta$ via stochastic iterative updates. At each $\beta$ iteration in our learning algorithm, we employ the stochastic iterative updates in (III-C) to determine the optimal policy $\mu^{*}_{\beta,\zeta\eta}$ for given $\zeta$ , $\eta$ values and subsequently employ the stochastic iterative updates

	$K_{\zeta_{s}}^{t+1}(x_{t},u_{t})$	$=(1-\nu_{t}(x_{t},u_{t}))K_{\zeta_{s}}^{t}(x_{t},u_{t})+$
		$\nu_{t}(x_{t},u_{t})\Big{[}\frac{\partial c_{x_{t}x_{t+1}}^{u_{t}}}{\partial\zeta_{s}}+\gamma G_{\zeta_{s}}^{t}(x_{t+1})\Big{]},$		(25)

where $G_{\zeta_{s}}^{t}(x_{t+1})=\sum_{a}\mu_{a|x_{t+1}}K_{\zeta_{s}}^{t}(x_{t+1},a)$ to learn the derivative $G_{\zeta_{s}}^{\beta*}(\cdot)$ . Similar updates are used to learn $G_{\eta_{a}}^{\beta*}(\cdot)$ . The parameter values $\zeta$ , $\eta$ are then updated using the gradient descent step in (23). The following proposition formalizes the convergence of the updates in (V-C) to the fixed point $G_{\zeta_{s}}^{\beta*}$ .

Proposition 2.

For the class of parameterized MDPs considered in Section V-A given that

1.

$\sum_{t=0}^{\infty}\nu_{t}(s,a)=\infty$ , $\sum_{t=0}^{\infty}\nu_{t}^{2}(s,a)<\infty$ $\forall s\in\mathcal{S}$ , $a\in\mathcal{A}$ ,
2.

$\exists$ $B>0$ such that $\Big{|}\frac{\partial c(s^{\prime},a^{\prime},s^{\prime\prime})}{\partial\zeta_{s}}\Big{|}\leq B$ $\forall~{}s,s^{\prime},a^{\prime},s^{\prime\prime}$ ,
3.

$\exists$ $C>0$ such that $\Big{|}\frac{\partial c(s^{\prime},a^{\prime},s^{\prime\prime})}{\partial\eta_{a}}\Big{|}\leq C$ $\forall~{}a,s^{\prime},a^{\prime},s^{\prime\prime}$ ,

the updates in (V-C) converge to the unique fixed point $G_{\zeta_{s}}^{\beta*}(s^{\prime})$ of the map $T_{1}:G_{\zeta_{s}}\rightarrow G_{\zeta_{s}}$ in (24).

Proof. Please refer to the Appendix 4 for details.

Input:

\beta_{\min}

\beta_{\max}

\tau

T

\nu_{t}

; Output:

\mu^{*}

\zeta

\eta

Initialize:

\beta=\beta_{\min}

\mu_{t}=\frac{1}{|\mathcal{A}|}

, and

\zeta,\eta,G_{\zeta}^{\beta},G_{\eta}^{\beta}

K_{\zeta}^{\beta},L_{\eta}^{\beta}

\bar{\Lambda}_{\beta}

0

while $\beta\leq\beta_{\max}$ do

Use Algorithm 1 to obtain

\mu^{*}_{\beta,\zeta\eta}

at given

\zeta

\eta

\beta

Consider env1(

\zeta

\eta

), env2(

\zeta^{\prime},\eta^{\prime}

); set

\zeta^{\prime}=\zeta

\eta^{\prime}=\eta

while $\{\zeta_{s}\}$ , $\{\eta_{a}\}$ converge do

for $\forall s\in\mathcal{S}$ do

for $episode=1$ to $T$ do

reset env1, env2 at state

x_{t}

while True do

sample action

u_{t}\sim\mu^{*}(\cdot|x_{t})

env1: obtain

c_{t}

x_{t+1}

env2: set

\textstyle\zeta_{s}^{\prime}=\zeta_{s}+\Delta\zeta_{s}

, get

c_{t}^{\prime}

x_{t+1}

find

\textstyle G_{\zeta_{s}}^{t+1}(x_{t})

with

\scriptstyle\frac{\partial c_{x_{t}x_{t+1}}^{u_{t}}}{\partial\zeta_{s}}\approx\frac{c^{\prime}_{t}-c_{t}}{\Delta\zeta_{s}}

break if

x_{t+1}=\delta

;

t\leftarrow t+1

Similarly learn

G_{\eta_{a}}^{\beta}

. Update

\{\zeta_{s}\}

\{\eta_{a}\}

in (23).

\beta\leftarrow\tau\beta

Algorithm 3 Parameterized Reinforcement Learning

Algorithmic Details: Please refer to the Algorithm 3 for a complete implementation. Unlike the scenario in Section III-C where the agent acts upon the environment by taking an action $u_{t}\in\mathcal{A}$ and learns only the policy $\mu^{*}$ , here the agent interacts with the environment by (a) taking an action $u_{t}\in\mathcal{A}$ and also providing (b) estimated parameter $\zeta$ , $\eta$ values to the environment; subsequently, the environment results into an instantaneous cost and the next state. In our Algorithm 3, we first learn the policy $\mu_{\beta}^{*}$ at a given value of the parameters $\zeta$ , $\eta$ using the iterations (III-C) and then learn the fixed points $G_{\zeta_{s}}^{\beta*}$ , $G_{\zeta_{a}}^{\beta*}$ using the iterations in (V-C) to update the parameters $\zeta$ , $\eta$ using (23). Note that the iterations (V-C) require the derivatives $\partial c(s^{\prime},a^{\prime},s^{\prime\prime})/\partial\zeta_{s}$ and $\partial c(s^{\prime},a^{\prime},s^{\prime\prime})/\partial\eta_{a}$ which we determine using the instantaneous costs resulting from two $\epsilon$ -distinct environments and finite difference method. Here the $\epsilon$ -distinct environments represent the same underlying MDP but are distinct only in one of the parameter values. However, if two $\epsilon$ -distinct environments are not feasible one can work with a single environment where the algorithm stores the instantaneous costs and the corresponding parameter values upon each interaction with the environment.

Remark 7.

Parameterized MDPs with infinite Shannon entropy $H^{\mu}$ can be analogously addressed using above methodology.

Remark 8.

The MDPs addressed in Section III, IV, and V consider different variants of the discounted infinite horizon problems. MDPs in Section III address the class of sequential problems that have a non-zero probability of reaching a cost-free termination state (i.e., a finite Shannon entropy value). MDPs considered in Section IV need not reach a termination state (possibly infinite value of Shannon entropy), and the underlying sequential decision problem continues for the length of horizon determined by the discounting factor $\gamma$ . Parameterized MDPs in Section V can have finite or infinite Shannon entropy, but they comprise of states and actions that have an unknown parameter associated to them.

VI Simulations

We broadly classify our simulations into two categories. Firstly, in the model-free RL setting we demonstrate our Algorithm 1 to determine the control policy

$\mu^{*}$ for the finite and infinite Shannon entropy variants of the Gridworld environment in Figure 2. Each cell in the Gridworld denotes a state. The cells colored black are invalid states. An agent can choose to move vertically, horizontal, diagonally or stay at the current cell. Each action is followed by a probability to slip in the neighbouring states (probability of 0.05 to slip in each of the vertical and horizontal directions, and probability of 0.025 to slip in each of the diagonal directions - cumulative $p(slip)\approx 0.3$ ). For the finite entropy case - each step incurs a unit cost. The process terminates when the agent reaches the terminal state $\mathbf{T}$ . For the infinite entropy case - each step incurs a unit reward. Secondly, in the parameterized MDPs and RL setting we demonstrate our Algorithms 2 and 3 in designing a 5G small cell network. This involves simultaneously determining the locations of the small cells in the network as well as the optimal routing path of the communication packets from the base station to the users.

We compare our MEP-based Algorithm 1 with the benchmark algorithms Entropy Regularized (ER) G-learning (also referred to as Soft Q-learning) [14], Q-learning [5] and Double Q-learning [13]. Note that our choice of the benchmark algorithm G-learning (or, entropy regularized Soft Q) presented in [14]) is based on its commonality to our framework as discussed in the Section III-B, and the choice of algorithms Q-learning and Double Q-learning is based on their widespread utility in the literature. Also, note that the work done in [14] already establishes the efficacy of the G-learning algorithm over the following algorithms in literature Q-learning, Double Q-learning, $\Psi$ -learning [45], Speedy Q-learning [46], and the consistent Bellman Operator $\mathcal{T}_{C}$ of [47]. Below we highlight features and advantages of our MEP-based Algorithm 1.

Faster Convergence to Optimal $J^{*}$ : Figures 3(a1)-(a3) (finite entropy variant of Gridworld) and Figures 3 (b1)-(b3) (infinite entropy variant of Gridworld) illustrate the faster convergence of our MEP-based Algorithm 1 for different discount factor $\gamma$ values. Here, at each episode the percentage error $\Delta V/J^{*}$ between the value function $V_{\beta}^{\mu}$ corresponding to learned policy $\mu=\mu(ep)$ in the episode $ep$ , and the optimal value function $J^{*}$ is given by

\displaystyle\text{\small$\frac{\Delta V(ep)}{J^{*}}=\frac{1}{N}\sum_{i=1}^{N}\sum_{s\in\mathcal{S}}\frac{\big{|}V_{\beta,i}^{\mu(ep)}(s)-J^{*}(s)\big{|}}{J^{*}(s)}$},

(26)

where $N$ denotes the total experimental runs and $i$ indexes the value function $V_{\beta,i}^{\mu}$ for each run. As observed in Figures 3(a1)-(a3), and Figures 3(b1)-(b3), our Algorithm 1 converges even faster as the discount factor $\gamma$ decreases. We characterize the faster convergence rates also in terms of the convergence time - more precisely the percentage $\bar{E}_{pr}$ of total episodes taken for the learning error $\Delta V/J^{*}$ to reach within $5\%$ of the best (see Figures 3(a4) and 3(b4)). As is observed in the figures, the performance of our (MEP-based) algorithm in comparison to entropy regularized G learning is better across all values ( $0.65$ to $0.95$ ) of discount factor $\gamma$ . Note that the performance of Algorithm 1 gets even better with decreasing $\gamma$ values where the smaller discount factor values occur in instances such as the context of recommendation systems [48], and teaching RL-agents using human-generated rewards [49].

Robustness to noise in data: Figures 3(c1)-(c4) demonstrate robustness to noisy environments; here the instantaneous cost $c(s,a,s^{\prime})$ in the finite horizon variant of Gridworld is noisy. For the purpose of simulations, we add Gaussian noise $\mathcal{N}(0,\sigma^{2})$ with $\sigma=1$ for vertical and horizontal actions, and $\sigma=0.5$ for diagonal movements. Here, at each episode we compare the percentage error $\Delta V/J^{*}$ in the learned value functions $V_{\beta}$ (corresponding to the state-action value estimate in (III-C)) of the respective algorithms. Similar to our observations and conclusions in Figures 3(a1)-(a3) and Figures 3(b1)-(b3) we see faster convergence of our MEP-based algorithm over the benchmark algorithms in Figures 3(c1)-(c3) in the case of noisy environment. Also, Figure 3(c4) demonstrates that across all discount factor values ( $0.65$ to $0.95$ ), Algorithm 1 converges faster than the entropy regularized Soft Q learning.

Simultaneously determining the unknown parameters and policy in Parameterized MDPs: We design the 5G Small Cell Network (see Figure 1) both when the underlying model ( $c_{ss^{\prime}}^{a}$ and $p_{ss^{\prime}}^{a}$ ) is known (using Algorithm 2) and as well as unknown (using Algorithm 3). In our simulations we randomly distribute $46$ user nodes $\{n_{i}\}$ at $\{x_{i}\}$ and the base station $\delta$ at $z$ in the domain $\Omega\subset\mathbb{R}^{2}$ as shown in Figure 4(a). The objective is to determine the locations $\{y_{j}\}_{j=1}^{5}$ (parameters) of the small cells $\{f_{j}\}_{j=1}^{5}$ and determine the corresponding communication routes (policy). Here, the state space of the underlying MDP is $\mathcal{S}=\{n_{1},\ldots,n_{46},f_{1},\ldots,f_{5}\}$ where the locations $y_{1},\ldots,y_{5}$ of the small cells are the unknown parameters $\{\zeta_{s}\}$ of the MDP, the action space is $\mathcal{A}=\{f_{1},\ldots,f_{5}\}$ , and the cost function $c(s,a,s^{\prime})=\|\rho(s)-\rho(s^{\prime})\|_{2}^{2}$ where $\rho(\cdot)$ denotes the spatial location of the respective states. The objective is to simultaneously determine the parameters (unknown small cell locations) and the control policy (communication routes in the 5G network). We consider two cases where (a) $p_{ss^{\prime}}^{a}$ is deterministic, i.e. an action $a$ at the state $s$ results into $s^{\prime}=a$ with probability $1$ , and (b) $p_{ss^{\prime}}^{a}$ is probabilistic such that action $a$ at the state $s$ results into $s^{\prime}=a$ with probability $0.9$ or to the state $s^{\prime}=f_{1}$ with probability $0.1$ . Additionally, due to absence of prior work in literature on network design problems modeled as parameterized MDPs, we compare our results only with the solution resulting from a straightforward sequential methodology (as shown in Figure 4(a)) where we first partition the user nodes into $5$ distinct clusters to allocate a small cells in each cluster, and then determine optimal routes in the network.

Deterministic $p(s,a,s^{\prime})$ : Figure 4(b) illustrates the allocation of small cells and the corresponding communication routes (resulting from optimal policy $\mu^{*}$ ) as determined by the Algorithm 2. Here, the network is designed to minimize the cumulative cost of communication from each user node and small cell. As denoted in the Figure, the route $\delta\rightarrow y_{3}\rightarrow y_{4}\rightarrow y_{1}\rightarrow n_{i}$ carries the communication packet from the base station $\delta$ to the respective user nodes $n_{i}$ as indicated by the grey arrow from $y_{1}$ . The cost incurred here is approximately $180\%$ lesser than that in Figure 4(a) clearly indicating the advantage obtained from simultaneously determining the parameters and policy over a sequential methodology. In the model-free RL setting where the functions $c(s,a)$ , $p_{ss^{\prime}}^{a}$ , and the locations $\{x_{i}\}$ of the user nodes $\{n_{i}\}$ are not known, we employ our Algorithm 3 to determine the small cell locations $\{y_{j}\}_{j=1}^{5}$ and as well as the optimal policy $\{\mu^{*}(a|s)\}$ as demonstrated in the Figure 4(c). It is evident from Figures 4(b) and (c) that the solutions obtained when the model is completely known and unknown are approximately same. In fact, the solutions obtained differ only by $1.9\%$ in terms of the total cost $\sum_{s\in\mathcal{S}}J^{\mu}_{\zeta\eta}(s)$ (22) incurred, clearly indicating the efficacy of our model-free learning Algorithm 3.

Probabilistic $p(s,a,s^{\prime})$ : Figure 4(d) illustrates the solution as obtained by our Algorithm 2 when the underlying model ( $c(s,a)$ , $p_{ss^{\prime}}^{a}$ , and $\{x_{i}\}$ ) is known. As before, here the network is designed to minimize the cumulative cost of communication from each user node and small cell. The cost associated to the network design is approximately $127\%$ lesser than in Figure 4(a). Figure 4(e) illustrates the solution as obtained by the Algorithm 3 for the model-free case ( $c(s,a)$ , $p_{ss^{\prime}}^{a}$ , and $\{x_{i}\}$ are unknown). Similar to the above scenario, the solutions obtained for this case using the Algorithms 2 and 3 are also approximately the same and differ only by $0.3\%$ in terms of the total cost $\sum_{s}J^{\mu}_{\zeta\eta}(s)$ incurred; thereby, substantiating the efficacy of our proposed model-free learning Algorithm 3.

Sensitivity Analysis: Our algorithms enable categorizing the user nodes $\{n_{i}\}$ in Figure 4(b) into the categories of (i) low, (ii) medium, and (iii) high sensitiveness such that the final solution is least susceptible to the user nodes in (i) and most susceptible to the nodes in (iii). Note that the above sensitivity analysis requires to compute the derivative $\sum_{s^{\prime}}\partial V_{\beta}^{\mu}(s^{\prime})/\partial\zeta_{s}$ , and we determine it by solving for the fixed point of the Bellman equation in (24). The derivative $\sum_{s^{\prime}}{\partial V_{\beta}^{\mu}(s^{\prime})}/{\partial\zeta_{s}}$ computed at $\beta\rightarrow\infty$ is a measure of sensitivity of the solution to the cost function $\sum_{s}J^{\mu}_{\zeta\eta}(s)$ in (22) since $V_{\beta}^{\mu}$ in (11) is a smooth approximation of $J^{\mu}_{\zeta\eta}(s)$ in (22) and $V_{\beta}^{\mu}\rightarrow J^{\mu}_{\zeta\eta}(s)$ as $\beta\rightarrow\infty$ . A similar analysis for Figure 4(c)-(e) can be done if the locations $\{x_{i}\}$ of the user nodes $\{n_{i}\}$ are known to the agent. The sensitivity of the final solution to the locations $\{y_{j}\}$ , $z$ of the small cells and the base station can also be determined in a similar manner.

Entropy over paths versus entropy of the policy: We demonstrate the benefit of maximizing the entropy of the distribution $\{p_{\mu}(\omega|s)\}$ over the paths of an MDP as compared to the distribution $\{\mu(a|s)\}$ over the control actions. Figure 4(g) demonstrates the 5G network obtained by considering the distribution over the control policy, and the Figure 4(h) illustrates the network obtained by considering the distribution over the entire paths. The network cost incurred in the Figure 4(h) is $5\%$ less than the cost incurred in Figure 4(g). Here, we have considered the above demonstrated probabilistic $p_{ss^{\prime}}^{a}$ scenario and minimized the cumulative communication cost incurred only from the user nodes.

Avoiding poor local minima and large scale setups: As noted in the Section V-B, annealing $\beta$ from a small value $\beta_{\min}(\approx 0)$ to a large value $\beta_{\max}$ prevents the algorithm from getting stuck at a poor local minima. Figure 4(i) demonstrates the network design obtained where the Algorithm 2 does not anneal $\beta$ , and iteratively solves the optimization problem at $\beta=\beta_{\max}$ . The resulting network incurs a $11\%$ higher cost in comparison to the network obtained in 4(h) where the Algorithm 2 anneals $\beta$ from a small to a large value. Figure 4(j) demonstrate the 5G network design obtained using Algorithm 2 when the user nodes are increased by around $12$ times ( $610$ ), and the allocated small cells are doubled to $10$ .

VII Analysis and Discussion

VII-1 Mutual Information Minimization

The optimization problem (III-A) maximizes the Shannon entropy $H^{\mu}(s)$ under a given constraint on the value function $J^{\mu}$ . We can similarly pose and solve the mutual information minimization problem that requires to determine the distribution $p_{\mu^{*}}(\mathcal{P}|s)$ (with control policy $\mu^{*}$ ) over the paths of the MDP that is close to some given prior distribution $q(\mathcal{P}|s)$ [16, 15]. Here, the objective is to minimize the KL-divergence $D_{KL}(p_{\mu}\|q))$ under the constraint $J=J_{0}$ (as in (III-A)).

VII-2 Non-dependence on choice of $J_{0}$ in (III-A)

In our framework we do not explicitly determine and work with the value of $J_{0}$ . Instead we work with the Lagrange parameter $\beta$ in the Lagrangian $V^{\mu}_{\beta}(s)$ in (7) corresponding to the optimization problem (III-A). It is known from the sensitivity analysis [6] that the small values of $\beta$ correspond to large values of $J_{0}$ , and large values of $\beta$ correspond to small values of $J_{0}$ . Thus, in our algorithms we solve the optimization problem (III-A) beginning at small values of $\beta=\beta_{\min}\approx 0$ (that corresponds to some feasible large $J_{0}$ ), and anneal it to a large value $\beta_{\max}$ (that corresponds to a small $J_{0}$ value) at which the stochastic policy $\mu$ in (9) converges to either $0$ or $1$ . Also at $\beta\approx 0$ , the stochastic policy $\mu_{\beta}^{*}$ in (9) follows a uniform distribution, which implicitly fixes the value of $J_{0}$ . Therefore, the initial value of $J_{0}$ in the proposed algorithms are fixed and are not required to be pre-specified.

VII-3 Computational complexity

Our MEP-based Algorithm 1 performs exactly the same number of computations as the Soft Q-learning algorithm [14] for each epoch (or, iteration) within an episode. In comparison to the Q and Double Q learning algorithms, our proposed algorithm, apart from performing the additional minor computations of explicitly determining $\mu^{*}$ in (9), exhibits the similar number of computational steps.

VII-4 Scheduling $\beta$ and Phase Transition

In our Algorithm 1, we follow a linear schedule $\beta_{k}=\sigma k$ ( $\sigma>0$ ) as suggested in the benchmark algorithm [14] to anneal the parameter $\beta$ . In the case of parameterized MDPs (Algorithm 2, and 3) we geometrically anneal $\beta$ (i.e. $\beta_{k+1}=\tau\beta_{k}$ , $\tau>1$ ) from a small value $\beta_{\min}$ to a large value $\beta_{\max}$ at which the control policy $\mu_{\beta}^{*}$ converges to either $0$ or $1$ . Several other MEP-based algorithms (that address problems akin to parameterized MDPs) such as Deterministic Annealing [7], incorporate geometric annealing of $\beta$ . The underlying idea in [7] is that the solution undergoes significant changes only at certain critical $\beta_{cr}$ (phase transition) and shows insignificant changes between two consecutive critical $\beta_{cr}$ ’s. Thus, for all practical purposes geometric annealing of $\beta$ works well. Similar to [7] our Algorithms 2 and 3 also undergo the phase transition and we are working on its analytical expression.

VII-5 Capacity and Exclusion Constraints

Certain parameterized MDPs may pose capacity or dynamical constraints over its parameters. For instance, each small cell $f_{j}$ allocated in the Figure 4 can be constrained in capacity to cater to maximum $c_{j}$ fraction of user nodes in the network. Our framework allows to model such a constraint as $q_{\mu}(f_{j}):=\sum_{a,n_{i}}\mu(a|n_{i})p(f_{j}|a,n_{i})\leq c_{j}$ where $q_{\mu}(f_{j})$ measures fraction of user nodes $\{n_{i}\}$ that connect to $f_{j}$ . In another scenario, the locations $\{x_{i}\}$ of the user nodes could be dynamically varying as $\dot{x}_{i}=f(x,t)$ . The resulting policy $\mu^{*}_{\beta}$ and small cells $\{y_{j}\}$ will also be time varying. We treat the free-energy function $V^{\mu}_{\beta}$ in (11) as a control-Lyapunov function and determine time varying $\mu^{*}_{\beta}$ and $\{y_{j}\}$ such that $\dot{V}^{\mu}_{\beta}\leq 0$ .

VII-6 Uncertainty in Parameters

Many application areas comprise of states and actions where the associated parameters are uncertain with a known distribution over the set of their possible values. For instance, a user nodes $n_{i}$ in Figure 4 may have an associated uncertainty in its location $x_{i}$ owing to measurement errors. Our proposed framework easily incorporates such uncertainties in parameter values. For example, the above uncertainty will result into replacing $c(n_{i},s^{\prime},a)$ with $c^{\prime}(n_{i},s^{\prime},a)=\sum_{x_{i}\in X_{i}}p(x_{i}|n_{i})c(n_{i},s^{\prime},a)$ where $p(x_{i}|n_{i})$ is the distribution over the set $X_{i}$ of location $x_{i}$ . The subsequent solution approach remains the same as in Section V-B.

APPENDICES

Appendix A. Proof of Lemma 1: Let $\bar{x}_{0}=s$ . By Assumption 1 $\exists$ a path $\omega=(\bar{u}_{0},\bar{x}_{1},\ldots,\bar{x}_{N}=\delta)$ such that $p_{\bar{\mu}}(\omega|x_{0}=s)>0$ which implies $p(x_{k+1}=\bar{x}_{k+1}|x_{k}=\bar{x}_{k},u_{k}=\bar{u}_{k})>0$ by (6). Then, probability $p_{\mu}(\omega|x_{0}=s)$ of taking path $\omega$ under the stochastic policy $\mu\in\Gamma$ in (4) is also positive.
Proof of Theorem 1: The following Lemma is needed

Lemma 2.

The Shannon Entropy $H^{\mu}(\cdot)$ corresponding to the MDP illustrated in Section III-A satisfies the algebraic expression $\sum_{s^{\prime}}p_{ss^{\prime}}^{a}H^{\mu}(s^{\prime})=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\log p_{ss^{\prime}}^{a}+\log\mu_{a|s}+\lambda_{s}+1$ .

Proof.

$H^{\mu}(\cdot)$ in (III-A) satisfies the recursive Bellman equation

H^{\mu}(s^{\prime})=\sum_{a^{\prime}s^{\prime\prime}}\mu_{a^{\prime}|s^{\prime}}p_{s^{\prime}s^{\prime\prime}}^{a^{\prime}}\big{[}-\log p_{s^{\prime}s^{\prime\prime}}^{a^{\prime}}-\log\mu_{a^{\prime}|s^{\prime}}+H^{\mu}(s^{\prime\prime})\big{]}

On the right side of the above Bellman equation, we subtract a zero term $\lambda_{s^{\prime}}(\sum_{a}\mu_{a|s^{\prime}}-1)$ that accounts for normalization constraint $\sum_{a}\mu_{a|s^{\prime}}=1$ and $\lambda_{s^{\prime}}$ is some constant. Taking the derivative of the resulting expression we obtain

\displaystyle\text{\small$\frac{\partial H^{\mu}(s^{\prime})}{\partial\mu_{a|s}}=\rho(s,a)\delta_{ss^{\prime}}+\sum_{a^{\prime},s^{\prime\prime}}\mu_{a^{\prime}|s^{\prime}}p_{s^{\prime}s^{\prime\prime}}^{a^{\prime}}\frac{\partial H^{\mu}(s^{\prime\prime})}{\partial\mu_{a|s}}$}-\lambda_{s^{\prime}}\delta_{ss^{\prime}},

(27)

where $\rho(s,a)=-\sum_{s^{\prime\prime}}p_{ss^{\prime\prime}}^{a}(\log p_{ss^{\prime\prime}}^{a}-H^{\mu}(s^{\prime\prime}))-\log\mu_{a|s}-1$ . The subsequent steps in the proof involve algebraic manipulations and makes use of the quantity $p_{\mu}(s^{\prime}):=\sum_{s}p_{\mu}(s^{\prime}|s)$ where $p_{\mu}(s^{\prime}|s)=\sum_{a}p_{ss^{\prime}}^{a}\mu_{a|s}$ . Under the trivial assumption that for each state $s^{\prime}$ there exists a state-action pair $(s,a)$ such that the probability of the system to enter the state $s^{\prime}$ upon taking action $a$ in the state $s$ is non-zero ( i.e. $p_{ss^{\prime}}^{a}>0$ ) we have that $p_{\mu}(s^{\prime})>0$ . Now, we multiply equation (27) by $p_{\mu}(s^{\prime})$ and add over all $s^{\prime}\in\mathcal{S}$ to obtain

	$\sum_{s^{\prime}}p_{\mu}(s^{\prime})\frac{\partial H^{\mu}(s^{\prime})}{\partial\mu_{a\|s}}=p_{\mu}(s^{\prime})\rho(s,a)$
	$\displaystyle\qquad\qquad\text{\small$+\sum_{s^{\prime\prime}}p_{\mu}(s^{\prime\prime})\frac{\partial H^{\mu}(s^{\prime\prime})}{\partial\mu_{a\|s}}-p_{\mu}(s)\lambda_{s}$},$

where $p_{\mu}(s^{\prime\prime})=\sum_{s^{\prime}}p_{\mu}(s^{\prime})p_{\mu}(s^{\prime\prime}|s^{\prime})$ . The derivative terms on both sides cancel to give $p_{\mu}(s^{\prime})\rho(s,a)-\lambda_{s}=0$ which implies $\sum_{s^{\prime}}p_{ss^{\prime}}^{a}H^{\mu}(s^{\prime})=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\log p_{ss^{\prime}}^{a}+\log\mu_{a|s}+\lambda_{s}+1$ . ∎

Now consider the free energy function $V_{\beta}^{\mu}(s)$ in (7) and separate out the $t=0$ term in its infinite summation to obtain

		$V^{\mu}_{\beta}(s)=\mathbb{E}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}c_{x_{t}x_{t+1}}^{u_{t}}+\frac{1}{\beta}\log p_{x_{t}x_{t+1}}^{u_{t}}$
		$+\frac{1}{\beta}\log\mu(u_{t}\|x_{t})\Big{\|}x_{0}=s\Big{]}$
		$V^{\mu}_{\beta}(s)=\sum_{s^{\prime},a}\mu(a\|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}+\frac{1}{\beta}\log\mu(a\|s)\big{)}$
		$+\mathbb{E}\Big{[}\sum_{t=1}^{\infty}\gamma^{t}c_{x_{t}x_{t+1}}^{u_{t}}+\frac{1}{\beta}\log p_{x_{t^{\prime}}x_{t^{\prime}+1}}^{u_{t^{\prime}}}+\frac{1}{\beta}\log\mu(u_{t}\|x_{t})\Big{\|}x_{0}=s\Big{]}$
		$\displaystyle\text{let}\qquad t=t^{\prime}+1,\qquad u^{\prime}_{t^{\prime}}=u_{t^{\prime}+1},\qquad x^{\prime}_{t^{\prime}}=x_{t^{\prime}+1}$
		$\Rightarrow V_{\beta}^{\mu}(s)=\sum_{s^{\prime},a}\mu(a\|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}+\frac{1}{\beta}\log\mu(a\|s)\big{)}$
		$+\mathbb{E}\Big{[}\sum_{t^{\prime}=0}^{\infty}\gamma^{t^{\prime}+1}c_{x^{\prime}_{t^{\prime}}x^{\prime}_{t^{\prime}+1}}^{u^{\prime}_{t^{\prime}}}+\frac{1}{\beta}\log p_{x^{\prime}_{t^{\prime}}x^{\prime}_{t^{\prime}+1}}^{u^{\prime}_{t^{\prime}}}$
		$\displaystyle\qquad+\text{\small$\frac{1}{\beta}\log\mu(u^{\prime}_{t^{\prime}}\|x^{\prime}_{t^{\prime}})\Big{\|}x_{0}=s\Big{]}$}$
		$=\sum_{s^{\prime},a}\mu(a\|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}+\frac{1}{\beta}\log\mu(a\|s)\big{)}$
		$+\gamma\mathbb{E}\Big{[}\sum_{t^{\prime}=0}^{\infty}\gamma^{t^{\prime}}c_{x^{\prime}_{t^{\prime}}x^{\prime}_{t^{\prime}+1}}^{u^{\prime}_{t^{\prime}}}+\frac{1}{\gamma\beta}\log p_{x^{\prime}_{t^{\prime}}x^{\prime}_{t^{\prime}+1}}^{u^{\prime}_{t^{\prime}}}$
		$+\frac{1}{\gamma\beta}\log\mu(u_{t^{\prime}}^{\prime}\|x_{t^{\prime}}^{\prime})\Big{\|}x_{0}=s\Big{]}$
		$=\sum_{s^{\prime},a}\mu(a\|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}+\frac{1}{\beta}\log\mu(a\|s)\big{)}$
		$+\gamma\mathbb{E}\Big{[}\mathbb{E}\Big{[}\sum_{t^{\prime}=0}^{\infty}\gamma^{t^{\prime}}c_{x^{\prime}_{t^{\prime}}x^{\prime}_{t^{\prime}+1}}^{u^{\prime}_{t^{\prime}}}+\frac{1}{\gamma\beta}\log p_{x^{\prime}_{t^{\prime}}x^{\prime}_{t^{\prime}+1}}^{a^{\prime}_{t^{\prime}}}$
		$\qquad+\frac{1}{\gamma\beta}\log\mu(a_{t^{\prime}}^{\prime}\|x_{t^{\prime}}^{\prime})\Big{\|}x_{0}^{\prime}=s^{\prime}\Big{]}\Big{\|}x_{0}=s\Big{]}$
		$=\sum_{s^{\prime},a}\mu(a\|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}+\frac{1}{\beta}\log\mu(a\|s)\big{)}$
		$+\gamma\sum_{a,s^{\prime}}\mu(a\|s)p_{ss^{\prime}}^{a}V^{\mu}_{\gamma\beta}(s^{\prime})$
		$\Rightarrow V_{\beta}^{\mu}(s)=\sum_{a,s^{\prime}}\mu_{a\|s}p_{ss^{\prime}}^{a}\Big{[}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}$
		$+\frac{1}{\beta}\log\mu_{a\|s}+\gamma V_{\gamma\beta}^{\mu}(s^{\prime})\Big{]},$		(28)

where $V_{\gamma\beta}^{\mu}(s^{\prime}):=J^{\mu}(s^{\prime})-\frac{1}{\gamma\beta}H^{\mu}(s^{\prime})$ . Now we relate $V_{\gamma\beta}^{\mu}(s^{\prime})$ with $V_{\beta}^{\mu}(s^{\prime})$ by adding and subtracting $-\frac{1}{\beta}H^{\mu}(s^{\prime})$ to $V_{\gamma\beta}^{\mu}(s^{\prime})$ . We obtain $V_{\gamma\beta}^{\mu}(s^{\prime})=V_{\beta}^{\mu}(s^{\prime})-\frac{1-\gamma}{\gamma\beta}H(s^{\prime})$ . Substituting $V_{\gamma\beta}(s^{\prime})$ and the algebraic expression obtained in Lemma 2 in the above equation (APPENDICES) we obtain

	$\displaystyle V^{\mu}_{\beta}(s)$	$\displaystyle=\sum_{s^{\prime},a}\mu(a\|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}+\frac{1}{\beta}\log\mu(a\|s)\big{)}$
		$\displaystyle+\gamma\sum_{a,s^{\prime}}\mu(a\|s)p_{ss^{\prime}}^{a}V_{\beta}^{\mu}(s^{\prime})-\frac{1-\gamma}{\beta}\sum_{a,s^{\prime}}\mu(a\|s)p_{ss^{\prime}}^{a}H^{\mu}(s^{\prime})$

	$\displaystyle\Rightarrow V_{\beta}^{\mu}(s)=\sum_{s^{\prime},a}\mu(a\|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}+\frac{1}{\beta}\log\mu(a\|s)\big{)}$
	$\displaystyle+\gamma\sum_{a,s^{\prime}}\mu(a\|s)p_{ss^{\prime}}^{a}V_{\beta}^{\mu}(s^{\prime})-\frac{1-\gamma}{\beta}\sum_{a}\mu(a\|s)\sum_{s^{\prime}}p_{ss^{\prime}}^{a}H^{\mu}(s^{\prime})$

	$\displaystyle\Rightarrow V_{\beta}^{\mu}(s)=\sum_{s^{\prime},a}\mu(a\|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}+\frac{1}{\beta}\log\mu(a\|s)\big{)}$
	$\displaystyle+\gamma\sum_{a,s^{\prime}}\mu(a\|s)p_{ss^{\prime}}^{a}V_{\beta}^{\mu}(s^{\prime})$
	$\displaystyle-\frac{1-\gamma}{\beta}\sum_{a}\mu(a\|s)\Big{(}\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\log p_{ss^{\prime}}^{a}+\log\mu(a\|s)+\lambda_{s}+1\Big{)}$

	$\displaystyle\Rightarrow V_{\beta}^{\mu}(s)$	$\displaystyle=\sum_{a,s^{\prime}}\mu(a\|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log\mu(a\|s)\big{)}$
		$\displaystyle+\gamma\sum_{a,s^{\prime}}\mu(a\|s)p_{ss^{\prime}}^{a}V_{\beta}^{\mu}(s^{\prime})+c_{0}(s)$		(29)

where $c_{0}(s)=-\frac{1-\gamma}{\beta}(\lambda_{s}+1)$ does not depend on the policy $\mu$ . Therefore, since the control policy $\mu^{*}(a|s)$ is determined by taking critical points of $V_{\beta}^{\mu}(s)$ , it is independent of $c_{0}(s)$ as shown in the following subsection.

-A Independence of policy on $c_{0}$

The Bellman equation is given by

	$\displaystyle V_{\beta}^{\mu}(s)$	$\displaystyle=\sum_{a,s^{\prime}}\mu(a\|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log\mu(a\|s)\big{)}$
		$\displaystyle+\gamma\sum_{a,s^{\prime}}\mu(a\|s)p_{ss^{\prime}}^{a}V_{\beta}^{\mu}(s^{\prime})+c_{0}(s)$		(30)

The optimal control policy $\mu_{\beta}^{*}(a|s)$ obtained upon taking the derivative of the recursive Bellman equation (and also accounting for $\sum_{a}\mu(a|s)=1$ ) is given by

	$\displaystyle\mu_{\beta}^{*}(a\|s)$	$\displaystyle=\frac{\exp\{-\frac{\beta}{\gamma}\bar{\Lambda}^{}_{\beta}(s,a)\}}{\sum_{a^{\prime}}\exp\{-\frac{\beta}{\gamma}\bar{\Lambda}^{}_{\beta}(s,a^{\prime})\}},\quad\text{where}$		(31)
	$\displaystyle\bar{\Lambda}_{\beta}^{*}(s,a)$	$\displaystyle=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a}+\gamma V_{\beta}^{*}(s^{\prime})\big{)},$		(32)

Substituting the optimal policy $\mu_{\beta}^{*}(a|s)$ (31) into the recursive Bellman in (-A) we obtain

\displaystyle V_{\beta}^{*}(s)=-\frac{\gamma}{\beta}\log\sum_{a}\exp\Big{\{}-\frac{\beta}{\gamma}\bar{\Lambda}_{\beta}^{*}(s,a)\Big{\}}+c_{0}(s).

(33)

Substituting the above equation (33) into the state-action value function in (32) we obtain the following map

		$\displaystyle\bar{\Lambda}_{\beta}^{*}(s,a)=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\Big{[}\big{(}c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a}\big{)}$
		$\displaystyle-\frac{\gamma^{2}}{\beta}\log\sum_{a}\exp\Big{\{}-\frac{\beta}{\gamma}\bar{\Lambda}_{\beta}^{*}(s,a)\Big{\}}\Big{]}+c_{0}(s)=:[T_{1}\bar{\Lambda}_{\beta}](s,a),$		(34)

where the proof that the map $T_{1}:\bar{\Lambda}_{\beta}\rightarrow\bar{\Lambda}_{\beta}$ is a contraction map is analogous to the proof of Theorem 2.

The state-action value $\Lambda_{\beta}^{*}(s,a)$ obtained by disregarding $c_{0}(s)$ in (33) satisfies the recursive equation

		$\displaystyle\Lambda_{\beta}^{*}(s,a)=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\Big{[}\big{(}c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a}\big{)}$
		$\displaystyle-\frac{\gamma^{2}}{\beta}\log\sum_{a}\exp\Big{\{}-\frac{\beta}{\gamma}\Lambda_{\beta}^{*}(s,a)\Big{\}}\Big{]}=:[T\Lambda_{\beta}](s,a),$		(35)

where the map $T:\Lambda_{\beta}\rightarrow\Lambda_{\beta}$ has been shown to be a contraction map in the Theorem 2.

Remark: Note that $[T_{1}x](s,a)=[Tx](s,a)+c_{0}(s)$ . Therefore, $T_{1}$ is a contraction since $T$ is contraction and has a unique fixed point.

Claim: The optimal control policy $\mu_{\beta}^{*}(a|s)$ in (31) is same if computed using either $\bar{\Lambda}_{\beta}^{*}(s,a)$ or $\Lambda_{\beta}^{*}(s,a)$ . This is so because the fixed points $\bar{\Lambda}_{\beta}^{*}(s,a)$ and $\Lambda_{\beta}^{*}(s,a)$ of the contraction maps $T_{1}:\bar{\Lambda}_{\beta}\rightarrow\bar{\Lambda}_{\beta}$ and $T:\Lambda_{\beta}\rightarrow\Lambda_{\beta}$ , respectively, differ by $\frac{1}{1-\gamma}c_{0}(s)$ , i.e. $\bar{\Lambda}_{\beta}^{*}(s,a)=\Lambda_{\beta}^{*}(s,a)+\frac{1}{1-\gamma}c_{0}(s)$ . If we plug the above relation into the control policy in (31) we obtain

	$\displaystyle\mu_{\beta}^{*}(a\|s)$	$\displaystyle=\frac{\exp\{-\frac{\beta}{\gamma}{\Lambda}^{}_{\beta}(s,a)-\frac{\beta/\gamma}{1-\gamma}c_{0}(s)\}}{\sum_{a^{\prime}}\exp\{-\frac{\beta}{\gamma}{\Lambda}^{}_{\beta}(s,a^{\prime})-\frac{\beta/\gamma}{1-\gamma}c_{0}(s)\}}$
	$\displaystyle\Rightarrow\mu_{\beta}^{*}(a\|s)$	$\displaystyle=\frac{\exp\{-\frac{\beta}{\gamma}{\Lambda}^{}_{\beta}(s,a)\}}{\sum_{a^{\prime}}\exp\{-\frac{\beta}{\gamma}{\Lambda}^{}_{\beta}(s,a^{\prime})\}}$

Thus, indicating that we do not need to take $c_{0}(s)$ into account while computing the optimal policy.

Proof for $\bar{\Lambda}_{\beta}^{*}(s,a)=\Lambda_{\beta}^{*}(s,a)+\frac{1}{1-\gamma}c_{0}(s)$ : This can be checked by substituting this expression into the definition of the map $[T_{1}\bar{\Lambda}_{\beta}]$ in (-A); we get

	$\displaystyle T_{1}\bar{\Lambda}_{\beta}^{*}(s,a)=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\Big{[}\big{(}c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a}\big{)}$
	$\displaystyle-\frac{\gamma^{2}}{\beta}\log\sum_{a}\exp\Big{\{}-\frac{\beta}{\gamma}\Lambda_{\beta}^{*}(s,a)-\frac{\beta/\gamma}{1-\gamma}c_{0}(s)\Big{\}}\Big{]}+c_{0}(s)$

		$\displaystyle T_{1}\bar{\Lambda}_{\beta}^{*}(s,a)=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\Big{[}\big{(}c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a}\big{)}$
		$\displaystyle-\frac{\gamma^{2}}{\beta}\log\sum_{a}\exp\Big{\{}-\frac{\beta}{\gamma}\Lambda_{\beta}^{*}(s,a)\Big{\}}\Big{]}+\frac{\gamma}{1-\gamma}c_{0}(s)+c_{0}(s)$
		$\displaystyle T_{1}\bar{\Lambda}_{\beta}^{*}(s,a)=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\Big{[}\big{(}c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a}\big{)}$
	$\displaystyle-$	$\displaystyle\frac{\gamma^{2}}{\beta}\log\sum_{a}\exp\Big{\{}-\frac{\beta}{\gamma}\Lambda_{\beta}^{*}(s,a)\Big{\}}\Big{]}+\frac{1}{1-\gamma}c_{0}(s)$
		$\displaystyle T_{1}\bar{\Lambda}_{\beta}^{}(s,a)=T{\Lambda}_{\beta}^{}(s,a)+\frac{1}{1-\gamma}c_{0}(s)$
		$\displaystyle\bar{\Lambda}_{\beta}^{}(s,a)=\Lambda_{\beta}^{}(s,a)+\frac{1}{1-\gamma}c_{0}(s)$

Hence proved.

Appendix B. Proof of Theorem 2: Following lemma is used.

Lemma 3.

For every policy $\mu\in\Gamma$ defined in (4) there exists a vector $\xi=(\xi_{s})\in\mathbb{R}_{+}^{|\mathcal{S}|}$ with positive components and a scalar $\lambda<1$ such that $\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\xi_{s^{\prime}}\leq\lambda\xi_{s}$ for all $s\in\mathcal{S}$ and $a\in\mathcal{A}$ .

Proof.

Consider a new MDP with state transition probabilities similar to the original MDP and the transition costs $c_{ss^{\prime}}^{a}=-1-\frac{1}{\beta}\log(|\mathcal{A}||\mathcal{S}|)$ except when $s=\delta$ . Thus, the free-energy function $V_{\beta}^{\mu}(s)$ in (7) for the new MDP is less than or equal to $-1$ . We define $-\xi_{s}\triangleq V_{\beta}^{*}(s)$ (as given in 11)) and use LogSumExp [50] inequality to obtain $-\xi_{s}\leq\min_{a}\Lambda_{\beta}(s,a)\leq\Lambda_{\beta}(s,a)~{}\forall~{}a\in\mathcal{A}$ where $\Lambda_{\beta}(s,a)$ is the state action value function in (III-B). Thus, $-\xi_{s}\leq\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{\gamma}{\beta}\log p_{ss^{\prime}}^{a}-\gamma\xi_{s^{\prime}}\big{)}$ and upon substituting $c_{ss^{\prime}}^{a}$ we obtain $-\xi_{s}\leq-1-\gamma\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\xi_{s^{\prime}}\leq-1-\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\xi_{s^{\prime}}$ .

\displaystyle\text{\small$\Rightarrow\sum_{s^{\prime}\in\mathcal{S}}p_{ss^{\prime}}^{a}\xi_{s^{\prime}}\leq\xi_{s}-1\leq\Big{[}\max_{s}\frac{\xi_{s}-1}{\xi_{s}}\Big{]}\xi_{s}=:\lambda\xi_{s}$}.

Since $V_{\beta}^{*}(s)\leq-1$ $\Rightarrow$ $\xi_{s}-1\geq 0$ and thus $\lambda<1$ . ∎

Next we show that $T:\Lambda_{\beta}\rightarrow\Lambda_{\beta}$ in (III-B) is a contraction map. For any $\hat{\Lambda}_{\beta}$ and $\check{\Lambda}_{\beta}$ we have that $[T\hat{\Lambda}_{\beta}-T\check{\Lambda}_{\beta}](s,a)$

		$=-\frac{\gamma^{2}}{\beta}\sum_{s^{\prime}\in\mathcal{S}}p_{ss^{\prime}}^{a}\log\frac{\sum_{a}\exp{\big{(}-\frac{\beta}{\gamma}\hat{\Lambda}_{\beta}(s^{\prime},a)\big{)}}}{\sum_{a^{\prime}}\exp{\big{(}-\frac{\beta}{\gamma}\check{\Lambda}_{\beta}(s^{\prime},a^{\prime})\big{)}}}$
		$\displaystyle\text{\small$\geq\gamma\sum_{s^{\prime},a^{\prime}}p_{ss^{\prime}}^{a}\hat{\mu}_{a^{\prime}\|s^{\prime}}(\hat{\Lambda}_{\beta}(s^{\prime},a^{\prime})-\check{\Lambda}_{\beta}(s^{\prime},a^{\prime}))=:\gamma\Delta_{\hat{\mu}}$},$		(36)

where we use the Log sum inequality to obtain (-A), and $\hat{\mu}_{a|s}$ is the stochastic policy in (9) corresponding to $\hat{\Lambda}_{\beta}(s,a)$ . Similarly, we obtain $[T\check{\Lambda}_{\beta}-T\hat{\Lambda}_{\beta}](s,a)\geq-\gamma\sum_{s^{\prime},a^{\prime}}p_{ss^{\prime}}^{a}\check{\mu}_{a^{\prime}|s^{\prime}}(\hat{\Lambda}_{\beta}(s^{\prime},a^{\prime})-\check{\Lambda}_{\beta}(s^{\prime},a^{\prime}))=:-\gamma\Delta_{\check{\mu}}$ where $\check{\mu}_{a|s}$ is the policy in (9) corresponding to $\check{\Lambda}_{\beta}(s,a)$ . Now from $\gamma\Delta_{\hat{\mu}}\leq[T\hat{\Lambda}_{\beta}-T\check{\Lambda}_{\beta}](s,a)\leq\gamma\Delta_{\check{\mu}}$ we conclude that $|[T\hat{\Lambda}_{\beta}-T\check{\Lambda}_{\beta}](s,a)|\leq\gamma\Delta_{\bar{\mu}}(s,a)$ where $\Delta_{\bar{\mu}}(s,a)=\max\{|\Delta_{\hat{\mu}}(s,a)|,|\Delta_{\check{\mu}}(s,a)|\}$ and we have $|[T\hat{\Lambda}_{\beta}-T\check{\Lambda}_{\beta}](s,a)|$

	$\leq\gamma\sum_{s^{\prime},a^{\prime}}p_{ss^{\prime}}^{a}\bar{\mu}_{a^{\prime}\|s^{\prime}}\|\hat{\Lambda}_{\beta}(s^{\prime},a^{\prime})-\check{\Lambda}_{\beta}(s^{\prime},a^{\prime})\|$		(37)
	$\leq\gamma\sum_{s^{\prime},a^{\prime}}p_{ss^{\prime}}^{a}\xi_{s^{\prime}}\bar{\mu}_{a^{\prime}\|s^{\prime}}\\|\hat{\Lambda}_{\beta}-\check{\Lambda}_{\beta}\\|_{\xi}$		(38)

where $\|\Lambda_{\beta}\|_{\xi}=\max_{s,a}\frac{\Lambda_{\beta}(s,a)}{\xi_{s}}$ and $\xi\in\mathbb{R}^{\mathcal{S}}$ is as given in Lemma 3. Further, from the same Lemma we obtain

	$\|[T\hat{\Lambda}_{\beta}-T\check{\Lambda}_{\beta}](s,a)\|\leq\gamma\lambda\xi_{s}\sum_{a^{\prime}\in\mathcal{A}}\bar{\mu}_{a^{\prime}\|s^{\prime}}\\|\hat{\Lambda}_{\beta}-\check{\Lambda}_{\beta}\\|_{\xi}$		(39)
	$\Rightarrow\\|T\hat{\Lambda}_{\beta}-T\check{\Lambda}_{\beta}\\|_{\xi}\leq\gamma\lambda\\|\hat{\Lambda}_{\beta}-\check{\Lambda}_{\beta}\\|_{\xi}\text{ with }\gamma\lambda<1$		(40)

Appendix C. Proof of Theorem 3: The proof follows the similar idea as the proof for Theorem 1 in Appendix APPENDICES and thus, we do not explain it in detail except the following Lemma that illustrates the algebraic structure of the discounted Shannon entropy $H_{d}^{\mu}(\cdot)$ in (16) which is different from that in Lemma 2 and also required in our proof of the said theorem.

Lemma 4.

The discounted Shannon entropy $H_{d}^{\mu}(\cdot)$ corresponding to the MDP in Section IV satisfies the algebraic term $\alpha\sum_{s^{\prime}}p_{ss^{\prime}}^{a}H^{\mu}_{d}(s^{\prime})=\sum_{s^{\prime}}p_{ss^{\prime}}^{a}\log p_{ss^{\prime}}^{a}+\log\alpha\mu(a|s)+\lambda_{s}+1$ .

Proof.

Define a new MDP that augments the action and state spaces ( $\mathcal{A},\mathcal{S}$ ) of the original MDP with an additional action $a_{e}$ and state $s_{e}$ , respectively, and derives its state-transition probability $\{q_{ss^{\prime}}^{a}\}$ and policy $\{\zeta_{a|s}\}$ from original MDP as

\displaystyle q_{ss^{\prime}}^{a}=\begin{cases}p_{ss^{\prime}}^{a}&\forall s,s^{\prime}\in\mathcal{S},a\in\mathcal{A}\\ 1&\text{if }s^{\prime},a=s_{e},a_{e}\\ 1&\text{if }s^{\prime}=s=s_{e}\\ 0&\text{otherwise}\end{cases}\zeta_{a|s}=\begin{cases}\alpha\mu_{a|s}&\forall(s,a)\in(\mathcal{S},\mathcal{A})\\ 1-\alpha&\text{if }a=a_{e},s\in\mathcal{S}\\ 0&\text{if }a\in\mathcal{A},s=s_{e}\\ 1&\text{if }a=a_{e},s=s_{e}\end{cases}

Next, we define $T^{\mu}:=\alpha H^{\mu}_{d}$ that satisfies $T^{\mu}(s^{\prime})=\sum_{a^{\prime}s^{\prime\prime}}\eta_{a^{\prime}|s^{\prime}}p_{s^{\prime}s^{\prime\prime}}^{a^{\prime}}\big{[}-\log p_{s^{\prime}s^{\prime\prime}}^{a}-\log\eta_{a^{\prime}|s^{\prime}}+T^{\mu}(s^{\prime\prime})\big{]}$ derived using (16) where $\eta_{a^{\prime}|s^{\prime}}=\alpha\mu_{a^{\prime}|s^{\prime}}$ . The subsequent steps of the proof are same as the proof of Lemma 2. ∎

Appendix D. Proof of Proposition 1: The proof in this section is analogous to the proof of Proposition 5.5 in [2]. Let $\bar{T}$ be the map in (III-C). The stochastic iterative updates in (III-C) can be re-written as $\scriptstyle\bar{\Psi}_{t+1}(x_{t},u_{t})=(1-\nu_{t}(x_{t},u_{t}))\bar{\Psi}_{t}(x_{t},u_{t})+\nu_{t}(x_{t},u_{t})\big{(}[\bar{T}\bar{\Psi}_{t}](x_{t},u_{t})+w_{t}(x_{t},u_{t})\big{)}$ where $\scriptstyle w_{t}(x_{t},u_{t})=c_{x_{t}x_{t+1}}^{u_{t}}-\frac{\gamma^{2}}{\beta}\log\sum_{a}\exp(-\frac{\beta}{\gamma}\bar{\Psi}_{t}(s_{t+1},a))-\bar{T}\bar{\Psi}_{t}(x_{t},u_{t})$ . Let $\mathcal{F}_{t}$ represent the history of the stochastic updates, i.e., $\scriptstyle\mathcal{F}_{t}=\{\bar{\Psi}_{0},\ldots,\bar{\Psi}_{t},w_{0},\ldots,w_{t-1},\nu_{0},\ldots,\nu_{t}\},$ then $\scriptstyle\mathbb{E}[w_{t}(x_{t},u_{t})|\mathcal{F}_{t}]=0$ and $\scriptstyle\mathbb{E}[w_{t}^{2}(x_{t},u_{t})|\mathcal{F}_{t}]\leq K(1+\max_{s,a}\bar{\Psi}_{t}^{2}(s,a))$ , where $K$ is a constant. These expressions satisfy the conditions on the expected value and the variance of $w_{t}(x_{t},u_{t})$ that along with the contraction property of $\bar{T}$ guarantees the convergence of the stochastic updates (III-C) as illustrated in the Proposition 4.4 in [2].

Proof of Theorem 4: We show that the map $T_{1}$ in (24) is a contraction map. For any $K_{\zeta_{s}}^{\beta}$ and $\bar{K}_{\zeta_{s}}^{\beta}$ we obtain that $|[T_{1}K_{\zeta_{s}}^{\beta}-T_{1}\bar{K}_{\zeta_{s}}^{\beta}](s^{\prime})|\leq\gamma\sum_{a,s^{\prime\prime}}p_{s^{\prime}s^{\prime\prime}}^{a}\mu_{a|s^{\prime}}|K_{\zeta_{s}}^{\beta}(s^{\prime\prime},a)-\bar{K}_{\zeta_{s}}^{\beta}(s^{\prime\prime},a)|$ . Note that this inequality is similar to the one in (37); thus, we follow the exact same steps from (37) to (40) to show that $\|T_{1}K_{\zeta_{s}}^{\beta}-T_{1}\bar{K}_{\zeta_{s}}^{\beta}\|_{\xi}\leq\gamma\lambda\|K_{\zeta_{s}}^{\beta}-\bar{K}_{\zeta_{s}}^{\beta}\|_{\xi}$ and $\gamma\lambda<1$ .

Proof of Proposition 2: The proof in this section is similar to the proof of Proposition 1 in Appendix 4. Additional conditions on the boundedness of the derivatives $\big{|}\frac{\partial c_{ss^{\prime}}^{a}}{\partial\zeta_{l}}\big{|}$ and $\big{|}\frac{\partial c_{ss^{\prime}}^{a}}{\partial\eta_{k}}\big{|}$ are required to bound the variance $\mathbb{E}[w_{t}^{2}|\mathcal{F}_{t}]$ .

References

[1] E. A. Feinberg and A. Shwartz, Handbook of Markov decision processes: methods and applications. Springer Science & Business Media, 2012, vol. 40.
[2] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-dynamic programming. Athena Scientific Belmont, MA, 1996, vol. 5.
[3] A. Hordijk and L. Kallenberg, “Linear programming and markov decision chains,” Management Science, vol. 25, no. 4, pp. 352–362, 1979.
[4] Y. Abbasi-Yadkori, P. L. Bartlett, and A. Malek, “Linear programming for large-scale markov decision problems,” in JMLR Workshop and Conference Proceedings, no. 32. MIT Press, 2014, pp. 496–504.
[5] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
[6] E. T. Jaynes, “Information theory and statistical mechanics,” Physical review, vol. 106, no. 4, p. 620, 1957.
[7] K. Rose, “Deterministic annealing, clustering, and optimization,” Ph.D. dissertation, California Institute of Technology, 1991.
[8] P. Sharma, S. Salapaka, and C. Beck, “A scalable approach to combinatorial library design for drug discovery,” Journal of chemical information and modeling, vol. 48, no. 1, pp. 27–41, 2008.
[9] J.-G. Yu, J. Zhao, J. Tian, and Y. Tan, “Maximal entropy random walk for region-based visual saliency,” IEEE transactions on cybernetics, vol. 44, no. 9, pp. 1661–1672, 2013.
[10] Y. Xu, S. M. Salapaka, and C. L. Beck, “Aggregation of graph models and markov chains by deterministic annealing,” IEEE Transactions on Automatic Control, vol. 59, no. 10, pp. 2807–2812, 2014.
[11] L. Chen, T. Zhou, and Y. Tang, “Protein structure alignment by deterministic annealing,” Bioinformatics, vol. 21, no. 1, pp. 51–62, 2005.
[12] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximum entropy inverse reinforcement learning.” in Aaai, vol. 8. Chicago, IL, USA, 2008, pp. 1433–1438.
[13] H. V. Hasselt, “Double q-learning,” in Advances in Neural Information Processing Systems, 2010, pp. 2613–2621.
[14] R. Fox, A. Pakman, and N. Tishby, “Taming the noise in reinforcement learning via soft updates,” arXiv preprint arXiv:1512.08562, 2015.
[15] J. Grau-Moya, F. Leibfried, and P. Vrancx, “Soft q-learning with mutual-information regularization,” 2018.
[16] J. Peters, K. Mulling, and Y. Altun, “Relative entropy policy search,” in Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.
[17] G. Neu, A. Jonsson, and V. Gómez, “A unified view of entropy-regularized markov decision processes,” arXiv preprint arXiv:1705.07798, 2017.
[18] K. Asadi and M. L. Littman, “An alternative softmax operator for reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 243–252.
[19] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, “Bridging the gap between value and policy based reinforcement learning,” in Advances in Neural Information Processing Systems, 2017, pp. 2775–2785.
[20] B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song, “Sbeed: Convergent reinforcement learning with nonlinear function approximation,” arXiv preprint arXiv:1712.10285, 2017.
[21] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning, 2015, pp. 1889–1897.
[22] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv preprint arXiv:1801.01290, 2018.
[23] A. Aguilar-Garcia, S. Fortes, M. Molina-García, J. Calle-Sánchez, J. I. Alonso, A. Garrido, A. Fernández-Durán, and R. Barco, “Location-aware self-organizing methods in femtocell networks,” Computer Networks, vol. 93, pp. 125–140, 2015.
[24] U. Siddique, H. Tabassum, E. Hossain, and D. I. Kim, “Wireless backhauling of 5g small cells: Challenges and solution approaches,” IEEE Wireless Communications, vol. 22, no. 5, pp. 22–31, 2015.
[25] G. Manganini, M. Pirotta, M. Restelli, L. Piroddi, and M. Prandini, “Policy search for the optimal control of markov decision processes: A novel particle-based iterative scheme,” IEEE transactions on cybernetics, vol. 46, no. 11, pp. 2643–2655, 2015.
[26] A. Srivastava and S. M. Salapaka, “Simultaneous facility location and path optimization in static and dynamic networks,” IEEE Transactions on Control of Network Systems, pp. 1–1, 2020.
[27] M. Mahajan, P. Nimbhorkar, and K. Varadarajan, “The planar k-means problem is np-hard,” in International Workshop on Algorithms and Computation. Springer, 2009, pp. 274–285.
[28] A. A. Abin, “Querying beneficial constraints before clustering using facility location analysis,” IEEE transactions on cybernetics, vol. 48, no. 1, pp. 312–323, 2016.
[29] D. Huang, C.-D. Wang, and J.-H. Lai, “Locally weighted ensemble clustering,” IEEE transactions on cybernetics, vol. 48, no. 5, pp. 1460–1473, 2017.
[30] W. Shi, S. Song, and C. Wu, “Soft policy gradient method for maximum entropy deep reinforcement learning,” arXiv preprint arXiv:1909.03198, 2019.
[31] G. Xiang and J. Su, “Task-oriented deep reinforcement learning for robotic skill acquisition and control,” IEEE transactions on cybernetics, 2019.
[32] L. Xia and Q.-S. Jia, “Parameterized markov decision process and its application to service rate control,” Automatica, vol. 54, pp. 29–35, 2015.
[33] M. Hausknecht and P. Stone, “Deep reinforcement learning in parameterized action space,” arXiv preprint arXiv:1511.04143, 2015.
[34] W. Masson, P. Ranchod, and G. Konidaris, “Reinforcement learning with parameterized actions,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
[35] E. Wei, D. Wicke, and S. Luke, “Hierarchical approaches for reinforcement learning in parameterized action space,” in 2018 AAAI Spring Symposium Series, 2018.
[36] J. Xiong, Q. Wang, Z. Yang, P. Sun, L. Han, Y. Zheng, H. Fu, T. Zhang, J. Liu, and H. Liu, “Parametrized deep q-networks learning: Reinforcement learning with discrete-continuous hybrid action space,” arXiv preprint arXiv:1810.06394, 2018.
[37] V. Narayanan and S. Jagannathan, “Event-triggered distributed control of nonlinear interconnected systems using online reinforcement learning with exploration,” IEEE transactions on cybernetics, vol. 48, no. 9, pp. 2510–2519, 2017.
[38] E. Çilden and F. Polat, “Toward generalization of automated temporal abstraction to partially observable reinforcement learning,” IEEE transactions on cybernetics, vol. 45, no. 8, pp. 1414–1425, 2014.
[39] E. T. Jaynes, Probability theory: The logic of science. Cambridge university press, 2003.
[40] F. Biondi, A. Legay, B. F. Nielsen, and A. Wkasowski, “Maximizing entropy over markov processes,” Journal of Logical and Algebraic Methods in Programming, vol. 83, no. 5-6, pp. 384–399, 2014.
[41] Y. Savas, M. Ornik, M. Cubuktepe, and U. Topcu, “Entropy maximization for constrained markov decision processes,” in 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2018, pp. 911–918.
[42] S. Ross, J. Pineau, B. Chaib-draa, and P. Kreitmann, “A bayesian approach for learning and planning in partially observable markov decision processes,” Journal of Machine Learning Research, vol. 12, no. May, pp. 1729–1770, 2011.
[43] L. P. Hansen, T. J. Sargent, G. Turmuhambetova, and N. Williams, “Robust control and model misspecification,” Journal of Economic Theory, vol. 128, no. 1, pp. 45–90, 2006.
[44] Z. Zhou, M. Bloem, and N. Bambos, “Infinite time horizon maximum causal entropy inverse reinforcement learning,” IEEE Transactions on Automatic Control, vol. 63, no. 9, pp. 2787–2802, 2017.
[45] K. Rawlik, M. Toussaint, and S. Vijayakumar, “Approximate inference and stochastic optimal control,” arXiv preprint arXiv:1009.3958, 2010.
[46] M. Ghavamzadeh, H. J. Kappen, M. G. Azar, and R. Munos, “Speedy q-learning,” in Advances in neural information processing systems, 2011, pp. 2411–2419.
[47] M. G. Bellemare, G. Ostrovski, A. Guez, P. S. Thomas, and R. Munos, “Increasing the action gap: New operators for reinforcement learning,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
[48] G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N. J. Yuan, X. Xie, and Z. Li, “Drn: A deep reinforcement learning framework for news recommendation,” in Proceedings of the 2018 World Wide Web Conference, 2018, pp. 167–176.
[49] W. B. Knox and P. Stone, “Reinforcement learning from human reward: Discounting in episodic tasks,” in 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication. IEEE, 2012, pp. 878–885.
[50] K. Kobayashi et al., Mathematics of information and coding. American Mathematical Soc., 2007, vol. 203.
[51] P. Sharma, S. M. Salapaka, and C. L. Beck, “Entropy-based framework for dynamic coverage and clustering problems,” IEEE Transactions on Automatic Control, vol. 57, no. 1, pp. 135–150, 2011.

Supplementary Material for CYB-E-2020-05-1237

Appendix A Link to Source Code

Please find the sample codes for the model-free RL, and parameterized MDP and RL setting in either of the links below:

1.

https://drive.google.com/drive/folders/1fVA3nM1Oh6drQpi3H50ZPTDppK8Tv5bX?usp=sharing
2.

https://uofi.box.com/s/twbo129bmst0e24ykgr1rhwwc8kyhe1o

Appendix B Simulations on the Double Chain Environment

In this section we demonstrate our Algorithm 1 to determine the control policy $\mu^{*}$ for the finite and infinite Shannon entropy variants of the double chain (DC) environments in Figure 5.

Faster Convergence to Optimal $J^{*}$ : Figures 6(a1)-(a4) (finite entropy variant of DC) and Figures 6 (d1)-(d4) (infinite entropy variant of DC) illustrate the faster convergence of our MEP-based Algorithm 1 for different discount factor $\gamma$ values. Here the learning error is given by $\Delta V$ at each episode between the value function $V_{\beta}^{\mu}$ corresponding to policy $\mu=\mu(ep)$ in the episode $ep$ and the optimal value function $J^{*}$ ; that is,

\displaystyle\Delta V(ep)=\frac{1}{N}\sum_{i=1}^{N}\sum_{s\in\mathcal{S}}\big{|}V_{\beta,i}^{\mu(ep)}(s)-J^{*}(s)\big{|},

(41)

where $N$ denotes the total experimental runs and $i$ indexes the value function $V_{\beta,i}^{\mu}$ for each run. As observed in Figures 6(a1) and 6(a3), and Figures 6(d1) and 6(d3), our Algorithm 1 converges even faster as the discount factor $\gamma$ decreases. We characterize the faster convergence rates also in terms of the convergence time - more precisely the critical episode $\bar{E}_{pr}$ beyond which learning error $\Delta V$ is within $5\%$ of the optimal $J^{*}$ (see Figures 6(a5) and 6(d5)). As is observed in the figures, the performance of our (MEP-based) algorithms gets even better with decreasing $\gamma$ values as the stochastic policy $\mu_{\beta}^{*}$ optimizes the amount of exploration required along the paths based on $\gamma$ values, whereas the other algorithms show deterioration in performance as $\gamma$ decreases.

Smaller Variance: Figures 6(b1)-(b3) compare the variance observed in the learning process during the last $40$ epsiodes of the $N$ experimental runs. As demonstrated, the Soft Q (Fig. 6(b2)) algorithm exhibit higher variances when compared to to the MEP-based (Fig. 6(b1)). The Figure 6(b3) plots the average variances of all the algorithms in the last $40$ episodes of the learning process with respect to different values of $\gamma$ . As illustrated, Soft Q and Q learning exhibit higher variances across all $\gamma$ values in comparison to our and Double Q algorithms. Between our and Double Q, the variance in our algorithm is smaller at lower values of $\gamma$ and vice-versa.

Inherently better exploration and exploitation: The Shannon entropy $H^{\mu}$ corresponding to our algorithm is higher during the initial episodes indicating a better exploration under the learned policy $\mu$ , when compared to other algorithms as seen in Figures 6(b4)-(b5). Additionally, it exhibits smaller $H^{\mu}$ in the later episodes indicating a more exploitative nature of the learned policy $\mu$ . Unbiased policies resulting from our algorithm along with enhanced exploration-exploitation trade-off results into the faster convergence rates, and smaller variance as discussed above.

Robustness to noise in data: Figures 6(c1)-(c5) demonstrate robustness to noisy environments; here the instantaneous cost $c(s,a)$ in the finite horizon variant of DC is noisy. For the purpose of simulations, we add Gaussian noise $\mathcal{N}(0,\sigma^{2})$ with $\sigma=5$ when $a=1,s\in\mathcal{S}$ , $a=0,s=8$ , otherwise $\sigma=10$ . Similar to our observations and conclusions in Figures 6(a1)-(a4), Figures 6(d1)-(d2) ( $\gamma=0.8$ ) and Figures 6(d3)-(d4) ( $\gamma=0.6$ ) illustrate faster convergence of our MEP-based algorithm. Also, similar to the Figure 6(a5), Figure 6(d5) exhibits higher convergence rates (i.e. lower $\bar{E}_{pr}$ ) $\forall$ $\gamma$ values in comparison to benchmark algorithms. In the infinite entropy variant of DC, results of similar nature are obtained upon adding the noise $\mathcal{N}(0,\sigma^{2})$ as illustrated in Figure 6(d2) and 6(d4).

Appendix C Comparison with MIRL [15] and REPS [16]

In this section we demonstrate the benefit of considering distributions over the paths of an MDP over considering distribution over the actions. Figure 7(a1)-(a4) compare the mutual information regularization (MIRL) algorithm presented in [15] (where mutual information is regularized with same discount factor as instantaneous cost) with the mutual information of distribution over the paths of MDP (our MEP-based idea that results into no discount factor for the regularizer term). Note the faster convergence plots $\frac{\Delta V}{J^{*}}$ versus $ep$ plots in Figure 7(a1)-a(3). In the Figure 7(a4) we demonstrate the faster convergence rate ( $\bar{E}_{pr}$ versus $\gamma$ ) obtained when considering distribution over the paths of the MDP.

In Figure 7(b1)-(b2) we compare the performance of the relative entropy policy search algorithm (REPS) [16] to our MEP-based algorithm in the case of noisy environment. For the purpose of simulation we consider the infinite entropy variant of the double chain environment in Figure 5. We add Gaussian noise $\mathcal{N}(0,\sigma^{2})$ with $\sigma=10$ when $s=4$ , $a=0$ , $\sigma=5$ when $s=8$ , $a=0$ , $\sigma=2$ when $s\neq 0$ , $a=1$ , $\sigma=1$ else. We ensure fairness in comparison by allowing equal number of interactions with the environment for both the algorithms. As is illustrated in Figure 7(b1) the algorithm in [16] diverges during the last stages of learning owing to the noise in the environment. However, for the same environment our MEP-based algorithm performs better.

Appendix D Design of Self Organizing Networks

Here, we consider the scenario of self-organizing network that are responsible to provide good wifi access at all the relevant locations in a hotel lobby where the number of routers and the location of the modem are pre-specified. Figure 8(a) illustrates the floor plan of a hotel lobby with areas such as reception, manager’s office, and cyber space clearly marked. We distribute user nodes over the provided floor plan based on the number of wifi users in each of these areas, and assume the modem to be located near the reception area. Figure 8(b) demonstrates the user node and modem location on a 2D graph. Considering the objective is to allocate $6$ routers and design communication routes from modem to each user via the routers that maximizes the cumulative signal strength at each user node we design this network both in the model-based scenario as well as model-free setting. For the purpose of simulation we assume that the signal strength is inversely proportional to the square of euclidean distance (Qin, Qiaofeng, et al. ”SDN controller placement with delay-overhead balancing in wireless edge networks.” IEEE Transactions on Network and Service Management 15.4 (2018): 1446-1459.). Figure 8(c) represents the allocated routers and the communication routes in the network for the model-based scenario. Figure 8(d) represents the allocated routers and the communication routes in the case of model-free setting.

Appendix E Choice of annealing parameters $\sigma$ , $\tau$

Algorithm 1: We follow a linear schedule $\beta_{k}=\sigma k$ ( $\sigma>0$ ) in our simulations on model-free RL setting as suggested in the entropy regularized (soft Q) benchmark algorithm in [14]. The idea is to anneal the parameter $\beta$ from a small value (at which the policy $\mu_{\beta}^{*}$ in (9) is explorative) to a large value (where the policy becomes exploitative). As suggested in [14] the parameter $\sigma$ can be obtained by performing initial runs (for a small number of iteration) with different values of $\sigma$ , and picking the one that results into lower value function corresponding to the learned policy. This identified value of $\sigma$ for a particular domain can then be re-used in similar domains without the need of performing any initial runs.

Algorithm 2 and 3: The parameter $\tau$ is referred to as the annealing rate and is chosen to be greater than $1$ . The resulting schedule $\beta_{k+1}=\tau\beta_{k}$ gemoterically anneals the Lagrange parameter $\beta$ from a small value $\beta_{\min}\approx 0$ to a large value $\beta_{\max}$ at which the control policy $\mu_{\beta}^{*}$ in (9) converges to $0$ or $1$ . In the case of parameterized MDPs, this facilitates a homotopy from the convex function $H^{\mu}(s)$ to the (possibly) non-convex $J^{\mu}_{\zeta\eta}(s)$ in (22), and prevents from getting stuck in poor local minima. For the purpose of simulations in Figure 4 we consider the value $\tau\in(1.01,1.04)$ .

A practical method to determine appropriate $\tau$ can be as follows. Start with a large (for instance $\tau\approx 1.5$ ) estimate for $\tau$ . If the parameter values obtained in the initial iterations of the algorithm oscillate a lot then decrease $\tau$ . Choose the $\tau$ value where the observed parameter values stop oscillating for the initial iterations of the algorithm. This practical method is rooted in the phenomenon of phase transition that our Algorithms 2 and 3 undergo. We illustrate this phenomenon further below.

The Algorithms 2 and 3 address the class of parameterized MDPs that simultaneously solve for an unknown parameter and the control policy. These problems are akin to the Facility Location Problem (FLP) addressed in [7]. In particular, [7] develops a Maximum Entropy Principle framework to determine the location of the facilities (parameter), and associate each user node to the closest facility (control policy). In the resulting algorithm (popularly known as Deterministic Annealing (DA)) it is observed that the solution changes significantly only at certain critical values of $\beta=\beta_{cr}$ that correspond to the instances of phase transition. At other values of $\beta$ , the solution does not change much [51]. It has been observed that a geometric law $\beta_{k+1}=\tau\beta_{k}$ with $\tau=1+\epsilon>1$ to anneal $\beta$ suffices to capture the changes in the solution that occur during the phase transition. Thus, with reference to the method for choosing a value of $\tau$ , if the initial iterations result in high variation in parameter values then possibly some phase transitions are getting skipped and the user needs to anneal slower (i.e., reduce $\tau$ ) to capture all the phase transitions appropriately.

Appendix F List of Symbols

$x_{t}$	state at time $t$	$u_{t}$	action at time $t$
$\mathcal{S}$	State Space	$\mathcal{A}$	Action Space
$\mu$	control policy	$p(s^{\prime}\|s,a)$	state transition probability
$\delta$	cost-free termination state	$\gamma$	discount factor
$J^{\mu}(s)$	value function under policy $\mu$	$\mu^{*}$	optimal control policy
$\omega$	path of the MDP	$p_{\mu}(\omega\|s)$	ditribution over the paths
$\Gamma$	Set of proper policies	$H^{\mu}(s)$	Shannon Entropy of the distribution $\{p_{\mu}(\omega\|s)\}$
$\beta$	annealing (Lagrange) parameter	$J_{0}$	constant value
$V_{\beta}^{\mu}(s)$	Lagrangian or free-energy	$c_{x_{t}x_{t+1}}^{u_{t}}$	short for $c(x_{t},u_{t},x_{t+1})$
$\mu_{u_{t}\|x_{t}}$	short for $\mu(u_{t}\|x_{t})$	$p_{x_{t}x_{t+1}}^{u_{t}}$	short for $p(x_{t+1}\|x_{t},u_{t})$
$\bar{c}_{ss^{\prime}}^{a}$	short for $c(s,a,s^{\prime})+\frac{\gamma}{\beta}\log p(s^{\prime}\|s,a)$	$\lambda_{s}$	Lagrange parameter in Lemma 2
$\mu_{\beta}^{*}(a\|s)$	optimal policy under MEP	$\Lambda_{\beta}(s,a)$	state-action value function
$V_{\beta}^{*}$	optimal value function	$\xi$	defines weighted norm for contraction mapping
$\\|\cdot\\|_{\xi}$	weighted norm	$\alpha$	contraction constant
$\Psi_{t+1}$	estimate of $\Lambda_{\beta}$ at time $t$	$\nu_{t}(x_{t},u_{t})$	learning rate at time $t$
$\beta_{\min}$	minimum value of $\beta$	$\beta_{\max}$	maximum value of $\beta$
$N$	number of episodes	$\tau$	annealing rate for $\beta$
$H_{d}^{\mu}(s)$	discounted Shannon entropy	$\alpha^{t}$	discount at $t$ for $H_{d}^{\mu}$
$V_{\beta,I}^{\mu}(s)$	free-energy for infinite entropy case of MDP	$\hat{c}_{x_{t}x_{t+1}}^{u_{t}}$	short for $c_{x_{t}x_{t+1}}^{u_{t}}+\frac{\gamma^{t}}{\beta\alpha^{t}}\log p_{x_{t}x_{t+1}}^{u_{t}}$
$\mu_{\beta,I}^{*}$	optimal policy for infinite entropy MDP	$\Phi_{\beta}(s,a)$	state-action value function
$V_{\beta,I}^{*}$	optimal free-energy function	$\zeta=\{\zeta_{s}\}$	set of unknown state parameters
$\eta=\{\eta_{a}\}$	set of unknown action parameters	$J_{\zeta\eta}^{\mu}$	value function parameterized MDPs

$x_{t}(\zeta)$	state at time $t$ with parameter $\zeta_{x_{t}}$	$u_{t}(\eta)$	action at time $t$ with parameter $\eta_{u_{t}}$
$\mu_{\beta,\zeta\eta}^{*}$	optimal control policy	$V_{\beta,\zeta\eta}^{*}$	optimal value function
$G_{\zeta_{s}}^{\beta}(s^{\prime})$	derivative $\partial V_{\beta,\zeta\eta}^{*}(s^{\prime})/\partial\zeta_{s}$	$G_{\eta_{a}}^{\beta}(s^{\prime})$	derivative $\partial V_{\beta,\zeta\eta}^{*}(s^{\prime})/\partial\eta_{a}$
$K_{\zeta_{s}}^{\beta}(s^{\prime},a^{\prime})$	derivatives $G_{\zeta_{s}}^{\beta}(s^{\prime})=\sum_{a^{\prime}}\mu_{a^{\prime}\|s^{\prime}}K_{\zeta_{s}}^{\beta}(s^{\prime},a^{\prime})$	$L_{\eta_{a}}^{\beta}(s^{\prime},a^{\prime})$	derivatives $G_{\eta_{a}}^{\beta}(s^{\prime})=\sum_{a^{\prime}}\mu_{a^{\prime}\|s^{\prime}}L_{\eta_{a}}^{\beta}(s^{\prime},a^{\prime})$
$K_{\zeta_{s}}^{t+1}(x_{t},u_{t})$	learned estimate of $K_{\zeta_{s}}$ at $t$	$B,C$	finite constants

		$V^{\mu}_{\beta}(s)=\mathbb{E}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}c_{x_{t}x_{t+1}}^{u_{t}}+\frac{1}{\beta}\log p_{x_{t}x_{t+1}}^{u_{t}}$
		$+\frac{1}{\beta}\log\mu(u_{t}\|x_{t})\Big{\|}x_{0}=s\Big{]}$
		$V^{\mu}_{\beta}(s)=\sum_{s^{\prime},a}\mu(a\|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}+\frac{1}{\beta}\log\mu(a\|s)\big{)}$
		$+\mathbb{E}\Big{[}\sum_{t=1}^{\infty}\gamma^{t}c_{x_{t}x_{t+1}}^{u_{t}}+\frac{1}{\beta}\log p_{x_{t^{\prime}}x_{t^{\prime}+1}}^{u_{t^{\prime}}}+\frac{1}{\beta}\log\mu(u_{t}\|x_{t})\Big{\|}x_{0}=s\Big{]}$
		$\displaystyle\text{let}\qquad t=t^{\prime}+1,\qquad u^{\prime}_{t^{\prime}}=u_{t^{\prime}+1},\qquad x^{\prime}_{t^{\prime}}=x_{t^{\prime}+1}$
		$\Rightarrow V_{\beta}^{\mu}(s)=\sum_{s^{\prime},a}\mu(a\|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}+\frac{1}{\beta}\log\mu(a\|s)\big{)}$
		$+\mathbb{E}\Big{[}\sum_{t^{\prime}=0}^{\infty}\gamma^{t^{\prime}+1}c_{x^{\prime}_{t^{\prime}}x^{\prime}_{t^{\prime}+1}}^{u^{\prime}_{t^{\prime}}}+\frac{1}{\beta}\log p_{x^{\prime}_{t^{\prime}}x^{\prime}_{t^{\prime}+1}}^{u^{\prime}_{t^{\prime}}}$
		$\displaystyle\qquad+\text{\small$\frac{1}{\beta}\log\mu(u^{\prime}_{t^{\prime}}\|x^{\prime}_{t^{\prime}})\Big{\|}x_{0}=s\Big{]}$}$
		$=\sum_{s^{\prime},a}\mu(a\|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}+\frac{1}{\beta}\log\mu(a\|s)\big{)}$
		$+\gamma\mathbb{E}\Big{[}\sum_{t^{\prime}=0}^{\infty}\gamma^{t^{\prime}}c_{x^{\prime}_{t^{\prime}}x^{\prime}_{t^{\prime}+1}}^{u^{\prime}_{t^{\prime}}}+\frac{1}{\gamma\beta}\log p_{x^{\prime}_{t^{\prime}}x^{\prime}_{t^{\prime}+1}}^{u^{\prime}_{t^{\prime}}}$
		$+\frac{1}{\gamma\beta}\log\mu(u_{t^{\prime}}^{\prime}\|x_{t^{\prime}}^{\prime})\Big{\|}x_{0}=s\Big{]}$
		$=\sum_{s^{\prime},a}\mu(a\|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}+\frac{1}{\beta}\log\mu(a\|s)\big{)}$
		$+\gamma\mathbb{E}\Big{[}\mathbb{E}\Big{[}\sum_{t^{\prime}=0}^{\infty}\gamma^{t^{\prime}}c_{x^{\prime}_{t^{\prime}}x^{\prime}_{t^{\prime}+1}}^{u^{\prime}_{t^{\prime}}}+\frac{1}{\gamma\beta}\log p_{x^{\prime}_{t^{\prime}}x^{\prime}_{t^{\prime}+1}}^{a^{\prime}_{t^{\prime}}}$
		$\qquad+\frac{1}{\gamma\beta}\log\mu(a_{t^{\prime}}^{\prime}\|x_{t^{\prime}}^{\prime})\Big{\|}x_{0}^{\prime}=s^{\prime}\Big{]}\Big{\|}x_{0}=s\Big{]}$
		$=\sum_{s^{\prime},a}\mu(a\|s)p_{ss^{\prime}}^{a}\big{(}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}+\frac{1}{\beta}\log\mu(a\|s)\big{)}$
		$+\gamma\sum_{a,s^{\prime}}\mu(a\|s)p_{ss^{\prime}}^{a}V^{\mu}_{\gamma\beta}(s^{\prime})$
		$\Rightarrow V_{\beta}^{\mu}(s)=\sum_{a,s^{\prime}}\mu_{a\|s}p_{ss^{\prime}}^{a}\Big{[}c_{ss^{\prime}}^{a}+\frac{1}{\beta}\log p_{ss^{\prime}}^{a}$
		$+\frac{1}{\beta}\log\mu_{a\|s}+\gamma V_{\gamma\beta}^{\mu}(s^{\prime})\Big{]},$		(28)

	$\displaystyle\mu_{\beta}^{*}(a\|s)$	$\displaystyle=\frac{\exp\{-\frac{\beta}{\gamma}{\Lambda}^{}_{\beta}(s,a)-\frac{\beta/\gamma}{1-\gamma}c_{0}(s)\}}{\sum_{a^{\prime}}\exp\{-\frac{\beta}{\gamma}{\Lambda}^{}_{\beta}(s,a^{\prime})-\frac{\beta/\gamma}{1-\gamma}c_{0}(s)\}}$
	$\displaystyle\Rightarrow\mu_{\beta}^{*}(a\|s)$	$\displaystyle=\frac{\exp\{-\frac{\beta}{\gamma}{\Lambda}^{}_{\beta}(s,a)\}}{\sum_{a^{\prime}}\exp\{-\frac{\beta}{\gamma}{\Lambda}^{}_{\beta}(s,a^{\prime})\}}$

Parameterized MDPs and Reinforcement Learning Problems - A Maximum Entropy Principle Based Framework

Abstract

I Introduction

II Preliminaries

III MDPs with Finite Shannon Entropy

III-A Problem Formulation

Assumption 1.

Lemma 1.

Remark 1.

III-B Problem Solution

Theorem 1.

Theorem 2.

Remark 2.

Remark 3.

III-C Model-free Reinforcement Learning Problems

Proposition 1.

Remark 4.

IV MDPs with Infinite Shannon Entropy

Theorem 3.

Remark 5.

Remark 6.

V Parameterized MDPs

V-A Problem Formulation

V-B Problem Solution

Theorem 4.

V-C Parameterized Reinforcement Learning

Proposition 2.

Remark 7.

Remark 8.

VI Simulations

VII Analysis and Discussion

VII-1 Mutual Information Minimization

VII-2 Non-dependence on choice of J0J_{0} in (III-A)

VII-3 Computational complexity

VII-4 Scheduling β\beta and Phase Transition

VII-5 Capacity and Exclusion Constraints

VII-6 Uncertainty in Parameters

APPENDICES

Lemma 2.

Proof.

-A Independence of policy on c0c_{0}

Lemma 3.

Proof.

Lemma 4.

Proof.

References

Supplementary Material for CYB-E-2020-05-1237

Appendix A Link to Source Code

Appendix B Simulations on the Double Chain Environment

Appendix C Comparison with MIRL [15] and REPS [16]

Appendix D Design of Self Organizing Networks

Appendix E Choice of annealing parameters σ\sigma, τ\tau

Appendix F List of Symbols

VII-2 Non-dependence on choice of $J_{0}$ in (III-A)

VII-4 Scheduling $\beta$ and Phase Transition

-A Independence of policy on $c_{0}$

Appendix E Choice of annealing parameters $\sigma$ , $\tau$