Parameterized MDPs and Reinforcement Learning Problems - A Maximum Entropy Principle Based Framework
Abstract
We present a framework to address a class of sequential decision making problems. Our framework features learning the optimal control policy with robustness to noisy data, determining the unknown state and action parameters, and performing sensitivity analysis with respect to problem parameters. We consider two broad categories of sequential decision making problems modeled as infinite horizon Markov Decision Processes (MDPs) with (and without) an absorbing state. The central idea underlying our framework is to quantify exploration in terms of the Shannon Entropy of the trajectories under the MDP and determine the stochastic policy that maximizes it while guaranteeing a low value of the expected cost along a trajectory. This resulting policy enhances the quality of exploration early on in the learning process, and consequently allows faster convergence rates and robust solutions even in the presence of noisy data as demonstrated in our comparisons to popular algorithms such as Q-learning, Double Q-learning and entropy regularized Soft Q-learning. The framework extends to the class of parameterized MDP and RL problems, where states and actions are parameter dependent, and the objective is to determine the optimal parameters along with the corresponding optimal policy. Here, the associated cost function can possibly be non-convex with multiple poor local minima. Simulation results applied to a 5G small cell network problem demonstrate successful determination of communication routes and the small cell locations. We also obtain sensitivity measures to problem parameters and robustness to noisy environment data.
I Introduction
Markov Decision Processes (MDPs) model sequential decision making problems which arise in many application areas such as robotics, sensor networks, economics, and manufacturing. These models are characterized by the state-evolution dynamics , a control policy that allocates an action from a control set to each state , and a cost associated with the transition from to . The goal in these applications is to determine the optimal control policy that results in a path, a sequence of actions and states, with minimum cumulative cost. There are many variants of this problem [1], where the dynamics can be defined over finite or infinite horizons; where the state-dynamics can be stochastic; where the models for the state dynamics may be partially or completely unknown, and the cost function is not known a priori, albeit the cost at each step is revealed at the end of each transition. Some of the most common methodologies that address MDPs include dynamic programming, value and policy iterations [2], linear programming [3, 4], and Q-learning [5].
In this article, we view MDPs and their variants as combinatorial optimization problems, and develop a framework based on maximum entropy principle (MEP) [6] to address them. MEP has proved successful in addressing a variety of combinatorial optimization problems such as facility location problems [7], combinatorial drug discovery [8], traveling salesman problem and its variants [7], image processing [9], graph and markov chain aggregation [10], and protein structure alignment [11]. MDPs, too, can be viewed as combinatorial optimization problems - due to the combinatorially large number of paths (sequence of consecutive states and actions) that it may take based on the control policy and its inherent stochasticity. In our MEP framework, we determine a probability distribution defined on the space of paths [12], such that (a) it is the fairest distribution - the one with the maximum Shannon Entropy , and (b) it satisfies the constraint that the expected cumulative cost attains a prespecified feasible value . The framework results in an iterative scheme, an annealing scheme, where probability distributions are improved upon by successively lowering the prespecified values . In fact, the lagrange multiplier corresponding to the cost constraint () in the unconstrained lagrangian is increased from small values (near 0) to large values to effect annealing. Higher values of multipliers correspond to lower values of the expected cost. We show that as multiplier value increases, the corresponding probability distributions become more localized, finally converging to a deterministic policy.
This framework is applicable to all the classes of MDPs and its variants described above. Our MEP based approach inherits the flexibility of algorithms such as deterministic annealing [7] developed in the context of combinatorial resource allocation, which include adding capacity, communication, and dynamic constraints. The added advantage of our approach is that we can draw close parallels to existing algorithms for MDPs and RL (e.g. Q-Learning) – thus enabling us to exploit their algorithmic insights. Below we highlight main contributions and advantages of our approach.
Exploration and Unbiased Policy: In the context of model-free RL setting, the algorithms interact with the environment via agents and rely upon the instantaneous cost (or reward) generated by the environment to learn the optimal policy. Some of the popular algorithms include Q-learning [5], Double Q-learning [13], Soft Q-learning (entropy regularized Q-learning) [14, 15, 16, 17, 18, 19, 20] in discrete state and action spaces, and Trust Region Policy Optimization (TRPO) [21], and Soft Actor Critic (SAC) [22] in continuous spaces. It is commonly known that for the above algorithms to perform well, all relevant states and actions should be explored. In fact, under the assumption that each state-action pair is visited multiple times during the learning process it is guaranteed that the above discrete space algorithms [5, 13, 14, 15] will converge to the optimal policy. Thus, the adequate exploration of the state and action spaces becomes incumbent to the success of these algorithms in determining the optimal policy. Often the instantaneous cost is noisy [14] which hinders with the learning process and demands for an enhanced quality exploration.
In our MEP based approach, the Shannon Entropy of the probability distribution over the paths in the MDP explicitly characterizes the exploration. The framework results in a distribution over the paths that is as unbiased as possible under the given cost constraint. The corresponding stochastic policy is maximally noncommittal to any particular path in the MDP that achieves the constraint; this results in better (unbiased) exploration. The policy starts from being entirely explorative, when multiplier value is small (), and becomes increasingly exploitative as the multiplier value increases.
Parameterized MDPs and RL: These class of optimization problems are not even necessarily MDPs which contributes significantly to their inherent complexities. However, we model them in a specific way to retain the Markov property without any loss of generality, thereby making these problems tractable. Scenarios such as self organizing networks [23], 5G small cell network design [24, 25], supply chain networks, and last mile delivery problems [26] pose a parameterized MDP with a two-fold objective of determining simultaneously (a) the optimal control policy for the underlying stochastic process, and (b) the unknown parameters that the state and action variables depend upon such that the cumulative cost is minimized. The latter objective is akin to facility location problem [27, 28, 29], that is shown to be NP-hard [27], and where the associated cost function (non-convex) is riddled with multiple poor local minima.

For instance, Figure 1 illustrates a 5G small cell network, where the objective is to simultaneously determine the locations of the small cells and design the communication paths (control policy) between the user nodes and base station via network of small cells. Here, the state space of the underlying MDP is parameterized by the locations of small cells .
Algebraic structure and Sensitivity Analysis: In our framework, maximization of Shannon entropy of the distribution over the paths under constraint on the cost function value results in an unconstrained Lagrangian - the free-energy function. This function is a smooth approximation of the cumulative cost function of the MDP, which enables the use of calculus. We exploit this distinctive feature of our framework to determine the unknown state and action parameters in case of parameterized MDPs and perform sensitivity analysis for various problem parameters. Also, the framework easily accommodates stochastic models that describe uncertainties in the instantaneous cost and parameter values.
Algorithmic Guarantees and Innovation: For the classes of MDPs that we consider, our MEP-based framework results into non-trivial derivations of the recursive Bellman equation for the associated Lagrangian. We show that these Bellman operators are contraction maps and use their several properties to guarantee the convergence to the optimal policy and as well as to a local minima in the case of parameterized MDPs.
In the context of model-free RL, we provide comparisons with the benchmark algorithms Q, Double Q, and entropy regularized G-learning [14] (also referred to as Soft Q-learning). Our algorithms converge at a faster rate (as fast as times) than the benchmark algorithms across various values of discount factor, and even in the case of noisy environments. In the context of parameterized MDPs and RL, we address the small-cell network design problem in 5G communication. Here the parameters are the unknown locations of the small-cells and the control policy determines the routing of the communication packet. Upon comparison with the sequential method of first determining the unknown parameters (small cell locations) and then the control policy (equivalently, the communication paths), we show that our algorithms result into costs that are as low as of the former. The efficacy of our algorithms can be assessed from the fact that the solutions in the model-based and model-free cases are nearly the same. We also demonstrate sensitivity analysis, benefits of annealing and considering entropy of distribution over the paths in our simulations on parameterized MDPs and RL.
This paper is organized as follows. We briefly review the related work and MEP [6] in the Section II. In Section III and IV we develop MEP-based framework for MDPs. Section V builds upon the Section III to address the case of parameterized MDPs and RL problems. Simulations on a variety of scenarios are presented in Section VI. We discuss the generality of our framework, its capabilities, and future directions of the work in Section VII. For the ease of reading, we provide a comprehensive list of symbols in the Section F of the supplementary material.
II Preliminaries
Related Work in Entropy Regularization: Some of the previous works in RL literature [14, 15, 16, 17, 18, 19, 20, 30, 31] either use entropy as a regularization term () [14, 15] to the instantaneous cost function or maximize the entropy () [16, 17, 18] associated only to the stochastic policy under constraints on the cost . This results into benefits such as better exploration, overcoming the effect of noise in the instantaneous cost and obtaining faster convergence. However, the resulting stochastic policy and soft-max approximation of the value function is not in compliance with the Maximum Entropy Principle applied to the distribution over the paths of MDP. Thus, the resulting stochastic policy is biased in its exploration over the paths of the MDP. Our simulations demonstrate the benefit of unbiased exploration (in our framework) in terms of faster convergence and better performance in noisy environment in comparison to the entropy regularized benchmark algorithm.
Related Work in Parameterized MDPs and RL: The existing solution approaches [3, 2, 4] can be extended to the parameterized MDPs by discretizing the parameter domain. However, the resulting problem is not necessarily an MDP as every transition from one state to another is dependent on the path (and the parameter values) taken till the current state. Other related approaches for parameterized MDPs are case specific; for instance, [32] presents action-based parameterization of state space with application to service rate control in closed Jackson networks, and [33, 34, 35, 36, 37, 38] incorporate parameterized actions that is applicable in the domain of RoboCup soccer where at each step the agent must select both the discrete action it wishes to execute as well as continuously valued parameters required by that action. On the other hand, the class of parameterized MDPs that we address in this article predominantly originate in network based applications that involves simultaneous routing and resource allocations and pose additional challenges of non-convexity and NP-hardness. We address these MDPs in both the scenarios, where the underlying model is known as well as unknown.
Maximum Entropy Principle: We briefly review the Maximum Entropy Principle (MEP) [6] since our framework relies heavily upon it. MEP states that for a random variable with a given prior information, the most unbiased probability distribution given prior data is the one that maximizes the Shannon entropy. More specifically, let the known prior information of the random variable be given as constraints on the expectation of the functions , . Then, the most unbiased probability distribution solves
(1) |
where , , are known expected values of the functions . The above optimization problem results into the Gibbs’ distribution [39] , where , , are the Lagrange multipliers corresponding to the inequality constraints in (1).
III MDPs with Finite Shannon Entropy
III-A Problem Formulation
We consider an infinite horizon discounted MDP that comprises of a cost-free termination state . We formally define this MDP as a tuple where , , and respectively denote the state space, action space, and cost function; is the state transition probability function and is the discounting factor. A control policy determines the action taken at each state , where implies that action is taken when the system is in the state and indicates otherwise. For every initial state , the MDP induces a stochastic process, whose realization is a path - an infinite sequence of actions and states, that is
(2) |
where , for all and for all if and when the system reaches the termination state in steps. The objective is to determine the optimal policy that minimizes the state value function
(3) |
where the expectation is with respect to the probability distribution on the space of all possible paths . In order to ensure that the system reaches the cost-free termination state in finite steps and the optimal state value function is finite, we make the following assumption throughout this section.
Assumption 1.
There exists atleast one deterministic proper policy such that . In other words, under the policy there is a non-zero probability to reach the cost-free termination state , when starting from any state .
We consider the following set of stochastic policies
(4) |
and the following lemma ensures that under the Assumption 1 all the policies are proper; that is,
Lemma 1.
For any policy as defined in (4), , i.e., under each policy the probability to reach the termination state in steps beginning from any , is strictly positive.
Proof. Please refer to the Appendix APPENDICES.
We use the Maximum Entropy Principle to determine the policy such that the Shannon Entropy of the corresponding distribution is maximized and the state value function attains a specified value . More specifically, we pose the following optimization problem
(5) |
Well posedness: For the class of proper policy the maximum entropy is finite as shown in [40, 41]. In short, the existence of a cost-free termination state and a non-zero probability to reach it from any state ensures that the maximum entropy is finite. Please refer to the Theorem 1 in [40] or Proposition 2 in [41] for further details.
Remark 1.
Though the optimization problem in (III-A) considers the stochastic policies , our algorithms presented in the later sections are designed in such that the resulting stochastic policy asymptotically converges to a deterministic policy.
III-B Problem Solution
The probability of taking the path in (2) can be determined from the underlying policy by exploiting the Markov property that dissociates in terms of the policy and the state transition probability as
. | (6) |
Thus, in our framework we prudently work with the policy which is defined over finite action and state spaces as against the distribution defined over infinitely many paths . The Lagrangian corresponding to the above optimization problem in (III-A) is
(7) |
where is the Lagrange parameter. Here we have not included the constant value in the cost Lagrangian for simplicity. We refer to the above Lagrangian (7) as the free-energy function and as temperature due to their close analogies with statistical physics (where free energy is enthalpy (E) minus the temperature times entropy (TH)). To determine the optimal policy that minimizes the Lagrangian in (7), we first derive the Bellman equation for .
Theorem 1.
The free-energy function in (7) satisfies the following recursive Bellman equation
(8) |
where , , for simplicity in notation, and function does not depend on the policy .
Proof. Please refer to the Appendix APPENDICES for details. It must be noted that this derivation shows and exploits the algebraic structure as detailed in Lemma 2 in the appendix.
Now the optimal policy satisfies , which results into the Gibbs distribution
(9) | ||||
(10) |
is the state-action value function, , , and is the value function corresponding to the policy . To avoid notional clutter we use the above notations wherever it is clear from the context. Substituting the policy in (9) back into the Bellman equation (8) we obtain the implicit equation
(11) |
Without loss of generality, we ignore the term in (8) since it does not affect the policy as seen from equation (9). To solve for the state-action value function and free-energy function we substitute the expression of in (11) into the expression of in (10) to obtain the implicit equation , where
(12) |
To solve the above implicit equation, we show that the map in (III-B) is a contraction map and therefore can be obtained using fixed point iterations, which guarantee converging to the unique fixed point. Consequently, the global minimum in (11) and the optimal policy in (9) can be obtained.
Theorem 2.
The map as defined in (III-B) is a contraction mapping with respect to a weighted maximum norm, i.e. a vector with and a scalar such that
(13) |
where
Proof. Please refer to Appendix -A for details.
Remark 2.
It is known from the sensitivity analysis [39] that the value of the Lagrange parameter in (7) is inversely proportional to the constant in (III-A). Thus, at small values of (equivalently large ), we are mainly maximizing the Shannon Entropy and the resultant policy in (9) encourages exploration along the paths of the MDP. As increases ( decreases), more and more weight is given to the state value function in (7) and the policy in (9) goes from being exploratory to being exploitative. As the exploration is completely eliminated and we converge to a deterministic policy that minimizes in (3).
Remark 3.
We briefly draw readers’ attention to the value function considered in the entropy regularized methods [14]. Note that in the discounting is multiplied to both the cost term as well as the entropy term . However, in our MEP-based method, the Lagrangian in (7) comprises of discounting only over the cost term and not on the entropy terms and . Therefore, the policy in [14] does not satisfy MEP applied over the distribution ; consequently their exploration along the paths is not as unbiased as our algorithm.
III-C Model-free Reinforcement Learning Problems
In these problems, the cost function and the state-transition probability are not known apriori; however, at each discrete time instant the agent takes an action under a policy and the environment (underlying stochastic process) results into an instantaneous cost and the subsequently moves to the state . Motivated by the iterative updates of Q-learning algorithm [2] we consider the following stochastic updates in our Algorithm 1 to learn the state-action value function in our methodology
+ | ||||
(14) |
with the stepsize parameter , and show that under appropriate conditions on (as illustrated shortly), will converge to the fixed point of the implicit equation
(15) |
The above equation comprises of a minor change from the equation in (III-B). The latter has an additional term which makes it difficult to learn its fixed point in the absence of the state transition probability itself. Since in this work we do not attempt to determine (or learn) either the distribution (as in [42]) from the interactions of the agent with the environment, we work with the approximate state-action value function in (III-C) where for large values (since as ). The following proposition elucidates the conditions under which the updates in (III-C) converge to the fixed point .
Proposition 1.
Proof. Please refer to the Appendix 4.
Remark 4.
Note that the stochasticity of the optimal policy (9) depends on value which allows it to incorporate for the effect of the discount factor on its exploration strategy. More precisely, in the case of large discount factors the time window , in which instantaneous costs is considerable (i.e., ), is large and thus, the stochastic policy (9) performs higher exploration along the paths. On the other hand, for small discount factors this time window is relatively smaller and thus, the stochastic policy (9) inherently performs lesser exploration. As illustrated in the simulations, this characteristic of the policy in (9) results into even faster convergence rates in comparison to benchmark algorithms as the discount factor decreases.
IV MDPs with Infinite Shannon Entropy
Here we consider the MDPs where the Shannon Entropy of the distribution over the paths is not necessarily finite (for instance, due to absence of absorption state). To ensure the finiteness of the objective in (III-A) we consider the discounted Shannon Entropy [43, 44]
(16) |
with a discount factor of which we chose to be independent of the discount factor in the value function . The free-energy function (or, the Lagrangian) resulting from the optimization problem in (III-A) with the alternate objective function in (16) is given by
(17) |
where , and the subscript stands for ’infinite entropy’ case. Note that the free-energy functions (7) and (17) differ only with regards to the discount factor and thus, our solution methodology in this section is similar to the one in Section III-B.
Theorem 3.
Proof. Please see Appendix 3. The above derivation shows and exploits algebraic structure (Lemma 4).
The optimal policy satisfies , which results into the Gibbs distribution
(19) | ||||
(20) |
is the corresponding state-action value function. Substituting the in (19) in the Bellman equation (18) results into the following optimal free-energy function .
(21) |
Remark 5.
Remark 6.
When the policy in (19), state-action value function in (20) and the free-energy function in (21) corresponds to the similar expressions that are obtained in the Entropy regularized methods [14]. However, in this paper we do not require that . On the other hand, we propose that should take up large values. In fact, our simulations in the Section VI demonstrate better convergence rates that are obtained when as compared to when .
V Parameterized MDPs
V-A Problem Formulation
As stated in Section I, many application areas such as small cell networks (Figure 1) pose a parmaterized MDP that requires simultaneously determining the (a) optimal policy , and (b) the unknown state and action parameters and such that the state value function
(22) |
is minimized where denotes the state with the associated parameter and denotes the action with the associated action parameter value . As in Section III-A, we assume that the parameterized MDPs exhibit atleast one deterministic proper policy (Assumption 1) to ensure the finiteness of the value function and the Shannon Entropy of the MDP for all . We further assume that the state-transition probability is independent of the state and action parameters .
V-B Problem Solution
This problem was solved in Section III-B, where the states and actions were not parameterized, or equivalently can be viewed as if parameters , were known and fixed. For the parameterized case, we apply the same solution methodology, which results in the same optimal policy as in (9) as well as the corresponding free-energy function in (11) except that now they are characterized by the parameters , . To determine the optimal (local) parameters , , we set , and , which we implement by using the gradient descent steps
(23) |
Here and . The derivatives and are assumed to be bounded (see Proposition 2). We compute these derivatives as and where and are the fixed points of their corresponding Bellman equations and where
(24) |
Note that in the above equations we have suppressed the dependence of the instantaneous cost function on the parameters and to avoid notational clutter.
Theorem 4.
The operators and defined in (24) are contraction maps with respect to a weighted maximum norm , where and is a vector of positive components .
Proof. Please refer the Appendix 4 for details.
As previously stated in Section I, the state value function in (22) is generally non-convex function of the parameters , and riddled with multiple poor local minima with the resulting optimization problem being possibly NP-hard [27]. In our algorithm for parameterized MDPs we anneal from to , similar to our approach for non-parameterized MDPs in Section III-B, where the solution from the current iteration is used to initialize the subsequent iteration. However, in addition to facilitating a steady transition from an exploratory policy to an exploitative policy, annealing facilitates a gradual homotopy from the convex negative Shannon entropy function to the non-convex state value function which prevents our algorithm from getting stuck in a poor local minimum. The underlying idea of our heuristic is to track the optimal as the initial convex function deforms to the actual non-convex cost. Also, minimizing the Lagrangian at determines the global minimum thereby making our algorithm insensitive to initialization. Algorithm 2 illustrates steps to determine policy and parameters for a parameterized MDP.
V-C Parameterized Reinforcement Learning
In many applications, formulated as parameterized MDPs, the explicit knowledge of the cost function , its dependence on the parameters , , and the state-transition probabilities are not known. However, for each action the environment results into an instantaneous cost based on its current , next state and the parameter , values which can subsequently be used to simultaneously learn the policy and the unknown state and action parameters , via stochastic iterative updates. At each iteration in our learning algorithm, we employ the stochastic iterative updates in (III-C) to determine the optimal policy for given , values and subsequently employ the stochastic iterative updates
(25) |
where to learn the derivative . Similar updates are used to learn . The parameter values , are then updated using the gradient descent step in (23). The following proposition formalizes the convergence of the updates in (V-C) to the fixed point .
Proposition 2.
Proof. Please refer to the Appendix 4 for details.
Algorithmic Details: Please refer to the Algorithm 3 for a complete implementation. Unlike the scenario in Section III-C where the agent acts upon the environment by taking an action and learns only the policy , here the agent interacts with the environment by (a) taking an action and also providing (b) estimated parameter , values to the environment; subsequently, the environment results into an instantaneous cost and the next state. In our Algorithm 3, we first learn the policy at a given value of the parameters , using the iterations (III-C) and then learn the fixed points , using the iterations in (V-C) to update the parameters , using (23). Note that the iterations (V-C) require the derivatives and which we determine using the instantaneous costs resulting from two -distinct environments and finite difference method. Here the -distinct environments represent the same underlying MDP but are distinct only in one of the parameter values. However, if two -distinct environments are not feasible one can work with a single environment where the algorithm stores the instantaneous costs and the corresponding parameter values upon each interaction with the environment.
Remark 7.
Parameterized MDPs with infinite Shannon entropy can be analogously addressed using above methodology.
Remark 8.
The MDPs addressed in Section III, IV, and V consider different variants of the discounted infinite horizon problems. MDPs in Section III address the class of sequential problems that have a non-zero probability of reaching a cost-free termination state (i.e., a finite Shannon entropy value). MDPs considered in Section IV need not reach a termination state (possibly infinite value of Shannon entropy), and the underlying sequential decision problem continues for the length of horizon determined by the discounting factor . Parameterized MDPs in Section V can have finite or infinite Shannon entropy, but they comprise of states and actions that have an unknown parameter associated to them.
VI Simulations
We broadly classify our simulations into two categories. Firstly, in the model-free RL setting we demonstrate our Algorithm 1 to determine the control policy

for the finite and infinite Shannon entropy variants of the Gridworld environment in Figure 2. Each cell in the Gridworld denotes a state. The cells colored black are invalid states. An agent can choose to move vertically, horizontal, diagonally or stay at the current cell. Each action is followed by a probability to slip in the neighbouring states (probability of 0.05 to slip in each of the vertical and horizontal directions, and probability of 0.025 to slip in each of the diagonal directions - cumulative ). For the finite entropy case - each step incurs a unit cost. The process terminates when the agent reaches the terminal state . For the infinite entropy case - each step incurs a unit reward. Secondly, in the parameterized MDPs and RL setting we demonstrate our Algorithms 2 and 3 in designing a 5G small cell network. This involves simultaneously determining the locations of the small cells in the network as well as the optimal routing path of the communication packets from the base station to the users.
We compare our MEP-based Algorithm 1 with the benchmark algorithms Entropy Regularized (ER) G-learning (also referred to as Soft Q-learning) [14], Q-learning [5] and Double Q-learning [13]. Note that our choice of the benchmark algorithm G-learning (or, entropy regularized Soft Q) presented in [14]) is based on its commonality to our framework as discussed in the Section III-B, and the choice of algorithms Q-learning and Double Q-learning is based on their widespread utility in the literature. Also, note that the work done in [14] already establishes the efficacy of the G-learning algorithm over the following algorithms in literature Q-learning, Double Q-learning, -learning [45], Speedy Q-learning [46], and the consistent Bellman Operator of [47]. Below we highlight features and advantages of our MEP-based Algorithm 1.

Faster Convergence to Optimal : Figures 3(a1)-(a3) (finite entropy variant of Gridworld) and Figures 3 (b1)-(b3) (infinite entropy variant of Gridworld) illustrate the faster convergence of our MEP-based Algorithm 1 for different discount factor values. Here, at each episode the percentage error between the value function corresponding to learned policy in the episode , and the optimal value function is given by
(26) |
where denotes the total experimental runs and indexes the value function for each run. As observed in Figures 3(a1)-(a3), and Figures 3(b1)-(b3), our Algorithm 1 converges even faster as the discount factor decreases. We characterize the faster convergence rates also in terms of the convergence time - more precisely the percentage of total episodes taken for the learning error to reach within of the best (see Figures 3(a4) and 3(b4)). As is observed in the figures, the performance of our (MEP-based) algorithm in comparison to entropy regularized G learning is better across all values ( to ) of discount factor . Note that the performance of Algorithm 1 gets even better with decreasing values where the smaller discount factor values occur in instances such as the context of recommendation systems [48], and teaching RL-agents using human-generated rewards [49].
Robustness to noise in data: Figures 3(c1)-(c4) demonstrate robustness to noisy environments; here the instantaneous cost in the finite horizon variant of Gridworld is noisy. For the purpose of simulations, we add Gaussian noise with for vertical and horizontal actions, and for diagonal movements. Here, at each episode we compare the percentage error in the learned value functions (corresponding to the state-action value estimate in (III-C)) of the respective algorithms. Similar to our observations and conclusions in Figures 3(a1)-(a3) and Figures 3(b1)-(b3) we see faster convergence of our MEP-based algorithm over the benchmark algorithms in Figures 3(c1)-(c3) in the case of noisy environment. Also, Figure 3(c4) demonstrates that across all discount factor values ( to ), Algorithm 1 converges faster than the entropy regularized Soft Q learning.
Simultaneously determining the unknown parameters and policy in Parameterized MDPs: We design the 5G Small Cell Network (see Figure 1) both when the underlying model ( and ) is known (using Algorithm 2) and as well as unknown (using Algorithm 3). In our simulations we randomly distribute user nodes at and the base station at in the domain as shown in Figure 4(a). The objective is to determine the locations (parameters) of the small cells and determine the corresponding communication routes (policy). Here, the state space of the underlying MDP is where the locations of the small cells are the unknown parameters of the MDP, the action space is , and the cost function where denotes the spatial location of the respective states. The objective is to simultaneously determine the parameters (unknown small cell locations) and the control policy (communication routes in the 5G network). We consider two cases where (a) is deterministic, i.e. an action at the state results into with probability , and (b) is probabilistic such that action at the state results into with probability or to the state with probability . Additionally, due to absence of prior work in literature on network design problems modeled as parameterized MDPs, we compare our results only with the solution resulting from a straightforward sequential methodology (as shown in Figure 4(a)) where we first partition the user nodes into distinct clusters to allocate a small cells in each cluster, and then determine optimal routes in the network.

Deterministic : Figure 4(b) illustrates the allocation of small cells and the corresponding communication routes (resulting from optimal policy ) as determined by the Algorithm 2. Here, the network is designed to minimize the cumulative cost of communication from each user node and small cell. As denoted in the Figure, the route carries the communication packet from the base station to the respective user nodes as indicated by the grey arrow from . The cost incurred here is approximately lesser than that in Figure 4(a) clearly indicating the advantage obtained from simultaneously determining the parameters and policy over a sequential methodology. In the model-free RL setting where the functions , , and the locations of the user nodes are not known, we employ our Algorithm 3 to determine the small cell locations and as well as the optimal policy as demonstrated in the Figure 4(c). It is evident from Figures 4(b) and (c) that the solutions obtained when the model is completely known and unknown are approximately same. In fact, the solutions obtained differ only by in terms of the total cost (22) incurred, clearly indicating the efficacy of our model-free learning Algorithm 3.
Probabilistic : Figure 4(d) illustrates the solution as obtained by our Algorithm 2 when the underlying model (, , and ) is known. As before, here the network is designed to minimize the cumulative cost of communication from each user node and small cell. The cost associated to the network design is approximately lesser than in Figure 4(a). Figure 4(e) illustrates the solution as obtained by the Algorithm 3 for the model-free case (, , and are unknown). Similar to the above scenario, the solutions obtained for this case using the Algorithms 2 and 3 are also approximately the same and differ only by in terms of the total cost incurred; thereby, substantiating the efficacy of our proposed model-free learning Algorithm 3.
Sensitivity Analysis: Our algorithms enable categorizing the user nodes in Figure 4(b) into the categories of (i) low, (ii) medium, and (iii) high sensitiveness such that the final solution is least susceptible to the user nodes in (i) and most susceptible to the nodes in (iii). Note that the above sensitivity analysis requires to compute the derivative , and we determine it by solving for the fixed point of the Bellman equation in (24). The derivative computed at is a measure of sensitivity of the solution to the cost function in (22) since in (11) is a smooth approximation of in (22) and as . A similar analysis for Figure 4(c)-(e) can be done if the locations of the user nodes are known to the agent. The sensitivity of the final solution to the locations , of the small cells and the base station can also be determined in a similar manner.
Entropy over paths versus entropy of the policy: We demonstrate the benefit of maximizing the entropy of the distribution over the paths of an MDP as compared to the distribution over the control actions. Figure 4(g) demonstrates the 5G network obtained by considering the distribution over the control policy, and the Figure 4(h) illustrates the network obtained by considering the distribution over the entire paths. The network cost incurred in the Figure 4(h) is less than the cost incurred in Figure 4(g). Here, we have considered the above demonstrated probabilistic scenario and minimized the cumulative communication cost incurred only from the user nodes.
Avoiding poor local minima and large scale setups: As noted in the Section V-B, annealing from a small value to a large value prevents the algorithm from getting stuck at a poor local minima. Figure 4(i) demonstrates the network design obtained where the Algorithm 2 does not anneal , and iteratively solves the optimization problem at . The resulting network incurs a higher cost in comparison to the network obtained in 4(h) where the Algorithm 2 anneals from a small to a large value. Figure 4(j) demonstrate the 5G network design obtained using Algorithm 2 when the user nodes are increased by around times (), and the allocated small cells are doubled to .
VII Analysis and Discussion
VII-1 Mutual Information Minimization
The optimization problem (III-A) maximizes the Shannon entropy under a given constraint on the value function . We can similarly pose and solve the mutual information minimization problem that requires to determine the distribution (with control policy ) over the paths of the MDP that is close to some given prior distribution [16, 15]. Here, the objective is to minimize the KL-divergence under the constraint (as in (III-A)).
VII-2 Non-dependence on choice of in (III-A)
In our framework we do not explicitly determine and work with the value of . Instead we work with the Lagrange parameter in the Lagrangian in (7) corresponding to the optimization problem (III-A). It is known from the sensitivity analysis [6] that the small values of correspond to large values of , and large values of correspond to small values of . Thus, in our algorithms we solve the optimization problem (III-A) beginning at small values of (that corresponds to some feasible large ), and anneal it to a large value (that corresponds to a small value) at which the stochastic policy in (9) converges to either or . Also at , the stochastic policy in (9) follows a uniform distribution, which implicitly fixes the value of . Therefore, the initial value of in the proposed algorithms are fixed and are not required to be pre-specified.
VII-3 Computational complexity
Our MEP-based Algorithm 1 performs exactly the same number of computations as the Soft Q-learning algorithm [14] for each epoch (or, iteration) within an episode. In comparison to the Q and Double Q learning algorithms, our proposed algorithm, apart from performing the additional minor computations of explicitly determining in (9), exhibits the similar number of computational steps.
VII-4 Scheduling and Phase Transition
In our Algorithm 1, we follow a linear schedule () as suggested in the benchmark algorithm [14] to anneal the parameter . In the case of parameterized MDPs (Algorithm 2, and 3) we geometrically anneal (i.e. , ) from a small value to a large value at which the control policy converges to either or . Several other MEP-based algorithms (that address problems akin to parameterized MDPs) such as Deterministic Annealing [7], incorporate geometric annealing of . The underlying idea in [7] is that the solution undergoes significant changes only at certain critical (phase transition) and shows insignificant changes between two consecutive critical ’s. Thus, for all practical purposes geometric annealing of works well. Similar to [7] our Algorithms 2 and 3 also undergo the phase transition and we are working on its analytical expression.
VII-5 Capacity and Exclusion Constraints
Certain parameterized MDPs may pose capacity or dynamical constraints over its parameters. For instance, each small cell allocated in the Figure 4 can be constrained in capacity to cater to maximum fraction of user nodes in the network. Our framework allows to model such a constraint as where measures fraction of user nodes that connect to . In another scenario, the locations of the user nodes could be dynamically varying as . The resulting policy and small cells will also be time varying. We treat the free-energy function in (11) as a control-Lyapunov function and determine time varying and such that .
VII-6 Uncertainty in Parameters
Many application areas comprise of states and actions where the associated parameters are uncertain with a known distribution over the set of their possible values. For instance, a user nodes in Figure 4 may have an associated uncertainty in its location owing to measurement errors. Our proposed framework easily incorporates such uncertainties in parameter values. For example, the above uncertainty will result into replacing with where is the distribution over the set of location . The subsequent solution approach remains the same as in Section V-B.
APPENDICES
Appendix A. Proof of Lemma 1: Let . By Assumption 1 a path such that which implies by (6). Then, probability of taking path under the stochastic policy in (4) is also positive.
Proof of Theorem 1: The following Lemma is needed
Lemma 2.
The Shannon Entropy corresponding to the MDP illustrated in Section III-A satisfies the algebraic expression .
Proof.
in (III-A) satisfies the recursive Bellman equation
On the right side of the above Bellman equation, we subtract a zero term that accounts for normalization constraint and is some constant. Taking the derivative of the resulting expression we obtain
(27) |
where . The subsequent steps in the proof involve algebraic manipulations and makes use of the quantity where . Under the trivial assumption that for each state there exists a state-action pair such that the probability of the system to enter the state upon taking action in the state is non-zero ( i.e. ) we have that . Now, we multiply equation (27) by and add over all to obtain
where . The derivative terms on both sides cancel to give which implies . ∎
Now consider the free energy function in (7) and separate out the term in its infinite summation to obtain
(28) |
where . Now we relate with by adding and subtracting to . We obtain . Substituting and the algebraic expression obtained in Lemma 2 in the above equation (APPENDICES) we obtain
(29) |
where does not depend on the policy . Therefore, since the control policy is determined by taking critical points of , it is independent of as shown in the following subsection.
-A Independence of policy on
The Bellman equation is given by
(30) |
The optimal control policy obtained upon taking the derivative of the recursive Bellman equation (and also accounting for ) is given by
(31) | ||||
(32) |
Substituting the optimal policy (31) into the recursive Bellman in (-A) we obtain
(33) |
Substituting the above equation (33) into the state-action value function in (32) we obtain the following map
(34) |
where the proof that the map is a contraction map is analogous to the proof of Theorem 2.
The state-action value obtained by disregarding in (33) satisfies the recursive equation
(35) |
where the map has been shown to be a contraction map in the Theorem 2.
Remark: Note that . Therefore, is a contraction since is contraction and has a unique fixed point.
Claim: The optimal control policy in (31) is same if computed using either or . This is so because the fixed points and of the contraction maps and , respectively, differ by , i.e. . If we plug the above relation into the control policy in (31) we obtain
Thus, indicating that we do not need to take into account while computing the optimal policy.
Proof for : This can be checked by substituting this expression into the definition of the map in (-A); we get
Hence proved.
Appendix B. Proof of Theorem 2: Following lemma is used.
Lemma 3.
For every policy defined in (4) there exists a vector with positive components and a scalar such that for all and .
Proof.
Consider a new MDP with state transition probabilities similar to the original MDP and the transition costs except when . Thus, the free-energy function in (7) for the new MDP is less than or equal to . We define (as given in 11)) and use LogSumExp [50] inequality to obtain where is the state action value function in (III-B). Thus, and upon substituting we obtain .
Since and thus . ∎
Next we show that in (III-B) is a contraction map. For any and we have that
(36) |
where we use the Log sum inequality to obtain (-A), and is the stochastic policy in (9) corresponding to . Similarly, we obtain where is the policy in (9) corresponding to . Now from we conclude that where and we have
(37) | |||
(38) |
where and is as given in Lemma 3. Further, from the same Lemma we obtain
(39) | |||
(40) |
Appendix C. Proof of Theorem 3: The proof follows the similar idea as the proof for Theorem 1 in Appendix APPENDICES and thus, we do not explain it in detail except the following Lemma that illustrates the algebraic structure of the discounted Shannon entropy in (16) which is different from that in Lemma 2 and also required in our proof of the said theorem.
Lemma 4.
The discounted Shannon entropy corresponding to the MDP in Section IV satisfies the algebraic term .
Proof.
Define a new MDP that augments the action and state spaces () of the original MDP with an additional action and state , respectively, and derives its state-transition probability and policy from original MDP as
Next, we define that satisfies derived using (16) where . The subsequent steps of the proof are same as the proof of Lemma 2. ∎
Appendix D. Proof of Proposition 1: The proof in this section is analogous to the proof of Proposition 5.5 in [2]. Let be the map in (III-C). The stochastic iterative updates in (III-C) can be re-written as where . Let represent the history of the stochastic updates, i.e., then and , where is a constant. These expressions satisfy the conditions on the expected value and the variance of that along with the contraction property of guarantees the convergence of the stochastic updates (III-C) as illustrated in the Proposition 4.4 in [2].
References
- [1] E. A. Feinberg and A. Shwartz, Handbook of Markov decision processes: methods and applications. Springer Science & Business Media, 2012, vol. 40.
- [2] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-dynamic programming. Athena Scientific Belmont, MA, 1996, vol. 5.
- [3] A. Hordijk and L. Kallenberg, “Linear programming and markov decision chains,” Management Science, vol. 25, no. 4, pp. 352–362, 1979.
- [4] Y. Abbasi-Yadkori, P. L. Bartlett, and A. Malek, “Linear programming for large-scale markov decision problems,” in JMLR Workshop and Conference Proceedings, no. 32. MIT Press, 2014, pp. 496–504.
- [5] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
- [6] E. T. Jaynes, “Information theory and statistical mechanics,” Physical review, vol. 106, no. 4, p. 620, 1957.
- [7] K. Rose, “Deterministic annealing, clustering, and optimization,” Ph.D. dissertation, California Institute of Technology, 1991.
- [8] P. Sharma, S. Salapaka, and C. Beck, “A scalable approach to combinatorial library design for drug discovery,” Journal of chemical information and modeling, vol. 48, no. 1, pp. 27–41, 2008.
- [9] J.-G. Yu, J. Zhao, J. Tian, and Y. Tan, “Maximal entropy random walk for region-based visual saliency,” IEEE transactions on cybernetics, vol. 44, no. 9, pp. 1661–1672, 2013.
- [10] Y. Xu, S. M. Salapaka, and C. L. Beck, “Aggregation of graph models and markov chains by deterministic annealing,” IEEE Transactions on Automatic Control, vol. 59, no. 10, pp. 2807–2812, 2014.
- [11] L. Chen, T. Zhou, and Y. Tang, “Protein structure alignment by deterministic annealing,” Bioinformatics, vol. 21, no. 1, pp. 51–62, 2005.
- [12] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximum entropy inverse reinforcement learning.” in Aaai, vol. 8. Chicago, IL, USA, 2008, pp. 1433–1438.
- [13] H. V. Hasselt, “Double q-learning,” in Advances in Neural Information Processing Systems, 2010, pp. 2613–2621.
- [14] R. Fox, A. Pakman, and N. Tishby, “Taming the noise in reinforcement learning via soft updates,” arXiv preprint arXiv:1512.08562, 2015.
- [15] J. Grau-Moya, F. Leibfried, and P. Vrancx, “Soft q-learning with mutual-information regularization,” 2018.
- [16] J. Peters, K. Mulling, and Y. Altun, “Relative entropy policy search,” in Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.
- [17] G. Neu, A. Jonsson, and V. Gómez, “A unified view of entropy-regularized markov decision processes,” arXiv preprint arXiv:1705.07798, 2017.
- [18] K. Asadi and M. L. Littman, “An alternative softmax operator for reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 243–252.
- [19] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, “Bridging the gap between value and policy based reinforcement learning,” in Advances in Neural Information Processing Systems, 2017, pp. 2775–2785.
- [20] B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song, “Sbeed: Convergent reinforcement learning with nonlinear function approximation,” arXiv preprint arXiv:1712.10285, 2017.
- [21] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning, 2015, pp. 1889–1897.
- [22] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv preprint arXiv:1801.01290, 2018.
- [23] A. Aguilar-Garcia, S. Fortes, M. Molina-García, J. Calle-Sánchez, J. I. Alonso, A. Garrido, A. Fernández-Durán, and R. Barco, “Location-aware self-organizing methods in femtocell networks,” Computer Networks, vol. 93, pp. 125–140, 2015.
- [24] U. Siddique, H. Tabassum, E. Hossain, and D. I. Kim, “Wireless backhauling of 5g small cells: Challenges and solution approaches,” IEEE Wireless Communications, vol. 22, no. 5, pp. 22–31, 2015.
- [25] G. Manganini, M. Pirotta, M. Restelli, L. Piroddi, and M. Prandini, “Policy search for the optimal control of markov decision processes: A novel particle-based iterative scheme,” IEEE transactions on cybernetics, vol. 46, no. 11, pp. 2643–2655, 2015.
- [26] A. Srivastava and S. M. Salapaka, “Simultaneous facility location and path optimization in static and dynamic networks,” IEEE Transactions on Control of Network Systems, pp. 1–1, 2020.
- [27] M. Mahajan, P. Nimbhorkar, and K. Varadarajan, “The planar k-means problem is np-hard,” in International Workshop on Algorithms and Computation. Springer, 2009, pp. 274–285.
- [28] A. A. Abin, “Querying beneficial constraints before clustering using facility location analysis,” IEEE transactions on cybernetics, vol. 48, no. 1, pp. 312–323, 2016.
- [29] D. Huang, C.-D. Wang, and J.-H. Lai, “Locally weighted ensemble clustering,” IEEE transactions on cybernetics, vol. 48, no. 5, pp. 1460–1473, 2017.
- [30] W. Shi, S. Song, and C. Wu, “Soft policy gradient method for maximum entropy deep reinforcement learning,” arXiv preprint arXiv:1909.03198, 2019.
- [31] G. Xiang and J. Su, “Task-oriented deep reinforcement learning for robotic skill acquisition and control,” IEEE transactions on cybernetics, 2019.
- [32] L. Xia and Q.-S. Jia, “Parameterized markov decision process and its application to service rate control,” Automatica, vol. 54, pp. 29–35, 2015.
- [33] M. Hausknecht and P. Stone, “Deep reinforcement learning in parameterized action space,” arXiv preprint arXiv:1511.04143, 2015.
- [34] W. Masson, P. Ranchod, and G. Konidaris, “Reinforcement learning with parameterized actions,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
- [35] E. Wei, D. Wicke, and S. Luke, “Hierarchical approaches for reinforcement learning in parameterized action space,” in 2018 AAAI Spring Symposium Series, 2018.
- [36] J. Xiong, Q. Wang, Z. Yang, P. Sun, L. Han, Y. Zheng, H. Fu, T. Zhang, J. Liu, and H. Liu, “Parametrized deep q-networks learning: Reinforcement learning with discrete-continuous hybrid action space,” arXiv preprint arXiv:1810.06394, 2018.
- [37] V. Narayanan and S. Jagannathan, “Event-triggered distributed control of nonlinear interconnected systems using online reinforcement learning with exploration,” IEEE transactions on cybernetics, vol. 48, no. 9, pp. 2510–2519, 2017.
- [38] E. Çilden and F. Polat, “Toward generalization of automated temporal abstraction to partially observable reinforcement learning,” IEEE transactions on cybernetics, vol. 45, no. 8, pp. 1414–1425, 2014.
- [39] E. T. Jaynes, Probability theory: The logic of science. Cambridge university press, 2003.
- [40] F. Biondi, A. Legay, B. F. Nielsen, and A. Wkasowski, “Maximizing entropy over markov processes,” Journal of Logical and Algebraic Methods in Programming, vol. 83, no. 5-6, pp. 384–399, 2014.
- [41] Y. Savas, M. Ornik, M. Cubuktepe, and U. Topcu, “Entropy maximization for constrained markov decision processes,” in 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2018, pp. 911–918.
- [42] S. Ross, J. Pineau, B. Chaib-draa, and P. Kreitmann, “A bayesian approach for learning and planning in partially observable markov decision processes,” Journal of Machine Learning Research, vol. 12, no. May, pp. 1729–1770, 2011.
- [43] L. P. Hansen, T. J. Sargent, G. Turmuhambetova, and N. Williams, “Robust control and model misspecification,” Journal of Economic Theory, vol. 128, no. 1, pp. 45–90, 2006.
- [44] Z. Zhou, M. Bloem, and N. Bambos, “Infinite time horizon maximum causal entropy inverse reinforcement learning,” IEEE Transactions on Automatic Control, vol. 63, no. 9, pp. 2787–2802, 2017.
- [45] K. Rawlik, M. Toussaint, and S. Vijayakumar, “Approximate inference and stochastic optimal control,” arXiv preprint arXiv:1009.3958, 2010.
- [46] M. Ghavamzadeh, H. J. Kappen, M. G. Azar, and R. Munos, “Speedy q-learning,” in Advances in neural information processing systems, 2011, pp. 2411–2419.
- [47] M. G. Bellemare, G. Ostrovski, A. Guez, P. S. Thomas, and R. Munos, “Increasing the action gap: New operators for reinforcement learning,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
- [48] G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N. J. Yuan, X. Xie, and Z. Li, “Drn: A deep reinforcement learning framework for news recommendation,” in Proceedings of the 2018 World Wide Web Conference, 2018, pp. 167–176.
- [49] W. B. Knox and P. Stone, “Reinforcement learning from human reward: Discounting in episodic tasks,” in 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication. IEEE, 2012, pp. 878–885.
- [50] K. Kobayashi et al., Mathematics of information and coding. American Mathematical Soc., 2007, vol. 203.
- [51] P. Sharma, S. M. Salapaka, and C. L. Beck, “Entropy-based framework for dynamic coverage and clustering problems,” IEEE Transactions on Automatic Control, vol. 57, no. 1, pp. 135–150, 2011.
![]() |
Amber Srivastava received the B.Tech degree in Mechanical Engineering from the Indian Institute of Technology, Kanpur, in 2014 after which he held employment with an FMCG in India for an year (2014-2015) as an Assistant Manager. He obtained his Masters in Mathermatics from University of Illinois at Urbana Champaign (UIUC) in 2020. Currently he is a PhD student in the Mechanical Science and Engineering department of UIUC. His areas of interest are optimization, learning and controls. |
![]() |
Srinivasa M. Salapaka received the B. Tech from the IIT Madras, in 1995 and MS and PhD degrees in Mechanical from the University of California, Santa Barbara, in 1997 and 2002, respectively. From 2002 to 2004, he was a postdoctoral associate at Massachusetts Institute of Technology. Since 2004, he has been a faculty in the Mechanical Science and Engineering Department, University of Illinois, Urbana-Champaign. His areas of current research are controls for nanotechnology, combinatorial optimization, and power electronics. |
Supplementary Material for CYB-E-2020-05-1237
Appendix A Link to Source Code
Please find the sample codes for the model-free RL, and parameterized MDP and RL setting in either of the links below:
-
1.
https://drive.google.com/drive/folders/1fVA3nM1Oh6drQpi3H50ZPTDppK8Tv5bX?usp=sharing
-
2.
https://uofi.box.com/s/twbo129bmst0e24ykgr1rhwwc8kyhe1o
Appendix B Simulations on the Double Chain Environment
In this section we demonstrate our Algorithm 1 to determine the control policy for the finite and infinite Shannon entropy variants of the double chain (DC) environments in Figure 5.

Faster Convergence to Optimal : Figures 6(a1)-(a4) (finite entropy variant of DC) and Figures 6 (d1)-(d4) (infinite entropy variant of DC) illustrate the faster convergence of our MEP-based Algorithm 1 for different discount factor values. Here the learning error is given by at each episode between the value function corresponding to policy in the episode and the optimal value function ; that is,
(41) |
where denotes the total experimental runs and indexes the value function for each run. As observed in Figures 6(a1) and 6(a3), and Figures 6(d1) and 6(d3), our Algorithm 1 converges even faster as the discount factor decreases. We characterize the faster convergence rates also in terms of the convergence time - more precisely the critical episode beyond which learning error is within of the optimal (see Figures 6(a5) and 6(d5)). As is observed in the figures, the performance of our (MEP-based) algorithms gets even better with decreasing values as the stochastic policy optimizes the amount of exploration required along the paths based on values, whereas the other algorithms show deterioration in performance as decreases.

Smaller Variance: Figures 6(b1)-(b3) compare the variance observed in the learning process during the last epsiodes of the experimental runs. As demonstrated, the Soft Q (Fig. 6(b2)) algorithm exhibit higher variances when compared to to the MEP-based (Fig. 6(b1)). The Figure 6(b3) plots the average variances of all the algorithms in the last episodes of the learning process with respect to different values of . As illustrated, Soft Q and Q learning exhibit higher variances across all values in comparison to our and Double Q algorithms. Between our and Double Q, the variance in our algorithm is smaller at lower values of and vice-versa.
Inherently better exploration and exploitation: The Shannon entropy corresponding to our algorithm is higher during the initial episodes indicating a better exploration under the learned policy , when compared to other algorithms as seen in Figures 6(b4)-(b5). Additionally, it exhibits smaller in the later episodes indicating a more exploitative nature of the learned policy . Unbiased policies resulting from our algorithm along with enhanced exploration-exploitation trade-off results into the faster convergence rates, and smaller variance as discussed above.
Robustness to noise in data: Figures 6(c1)-(c5) demonstrate robustness to noisy environments; here the instantaneous cost in the finite horizon variant of DC is noisy. For the purpose of simulations, we add Gaussian noise with when , , otherwise . Similar to our observations and conclusions in Figures 6(a1)-(a4), Figures 6(d1)-(d2) () and Figures 6(d3)-(d4) () illustrate faster convergence of our MEP-based algorithm. Also, similar to the Figure 6(a5), Figure 6(d5) exhibits higher convergence rates (i.e. lower ) values in comparison to benchmark algorithms. In the infinite entropy variant of DC, results of similar nature are obtained upon adding the noise as illustrated in Figure 6(d2) and 6(d4).
Appendix C Comparison with MIRL [15] and REPS [16]
In this section we demonstrate the benefit of considering distributions over the paths of an MDP over considering distribution over the actions. Figure 7(a1)-(a4) compare the mutual information regularization (MIRL) algorithm presented in [15] (where mutual information is regularized with same discount factor as instantaneous cost) with the mutual information of distribution over the paths of MDP (our MEP-based idea that results into no discount factor for the regularizer term). Note the faster convergence plots versus plots in Figure 7(a1)-a(3). In the Figure 7(a4) we demonstrate the faster convergence rate ( versus ) obtained when considering distribution over the paths of the MDP.

In Figure 7(b1)-(b2) we compare the performance of the relative entropy policy search algorithm (REPS) [16] to our MEP-based algorithm in the case of noisy environment. For the purpose of simulation we consider the infinite entropy variant of the double chain environment in Figure 5. We add Gaussian noise with when , , when , , when , , else. We ensure fairness in comparison by allowing equal number of interactions with the environment for both the algorithms. As is illustrated in Figure 7(b1) the algorithm in [16] diverges during the last stages of learning owing to the noise in the environment. However, for the same environment our MEP-based algorithm performs better.
Appendix D Design of Self Organizing Networks
Here, we consider the scenario of self-organizing network that are responsible to provide good wifi access at all the relevant locations in a hotel lobby where the number of routers and the location of the modem are pre-specified. Figure 8(a) illustrates the floor plan of a hotel lobby with areas such as reception, manager’s office, and cyber space clearly marked. We distribute user nodes over the provided floor plan based on the number of wifi users in each of these areas, and assume the modem to be located near the reception area. Figure 8(b) demonstrates the user node and modem location on a 2D graph. Considering the objective is to allocate routers and design communication routes from modem to each user via the routers that maximizes the cumulative signal strength at each user node we design this network both in the model-based scenario as well as model-free setting. For the purpose of simulation we assume that the signal strength is inversely proportional to the square of euclidean distance (Qin, Qiaofeng, et al. ”SDN controller placement with delay-overhead balancing in wireless edge networks.” IEEE Transactions on Network and Service Management 15.4 (2018): 1446-1459.). Figure 8(c) represents the allocated routers and the communication routes in the network for the model-based scenario. Figure 8(d) represents the allocated routers and the communication routes in the case of model-free setting.

Appendix E Choice of annealing parameters ,
Algorithm 1: We follow a linear schedule () in our simulations on model-free RL setting as suggested in the entropy regularized (soft Q) benchmark algorithm in [14]. The idea is to anneal the parameter from a small value (at which the policy in (9) is explorative) to a large value (where the policy becomes exploitative). As suggested in [14] the parameter can be obtained by performing initial runs (for a small number of iteration) with different values of , and picking the one that results into lower value function corresponding to the learned policy. This identified value of for a particular domain can then be re-used in similar domains without the need of performing any initial runs.
Algorithm 2 and 3: The parameter is referred to as the annealing rate and is chosen to be greater than . The resulting schedule gemoterically anneals the Lagrange parameter from a small value to a large value at which the control policy in (9) converges to or . In the case of parameterized MDPs, this facilitates a homotopy from the convex function to the (possibly) non-convex in (22), and prevents from getting stuck in poor local minima. For the purpose of simulations in Figure 4 we consider the value .
A practical method to determine appropriate can be as follows. Start with a large (for instance ) estimate for . If the parameter values obtained in the initial iterations of the algorithm oscillate a lot then decrease . Choose the value where the observed parameter values stop oscillating for the initial iterations of the algorithm. This practical method is rooted in the phenomenon of phase transition that our Algorithms 2 and 3 undergo. We illustrate this phenomenon further below.
The Algorithms 2 and 3 address the class of parameterized MDPs that simultaneously solve for an unknown parameter and the control policy. These problems are akin to the Facility Location Problem (FLP) addressed in [7]. In particular, [7] develops a Maximum Entropy Principle framework to determine the location of the facilities (parameter), and associate each user node to the closest facility (control policy). In the resulting algorithm (popularly known as Deterministic Annealing (DA)) it is observed that the solution changes significantly only at certain critical values of that correspond to the instances of phase transition. At other values of , the solution does not change much [51]. It has been observed that a geometric law with to anneal suffices to capture the changes in the solution that occur during the phase transition. Thus, with reference to the method for choosing a value of , if the initial iterations result in high variation in parameter values then possibly some phase transitions are getting skipped and the user needs to anneal slower (i.e., reduce ) to capture all the phase transitions appropriately.
Appendix F List of Symbols
state at time | action at time | ||
State Space | Action Space | ||
control policy | state transition probability | ||
cost-free termination state | discount factor | ||
value function under policy | optimal control policy | ||
path of the MDP | ditribution over the paths | ||
Set of proper policies | Shannon Entropy of the distribution | ||
annealing (Lagrange) parameter | constant value | ||
Lagrangian or free-energy | short for | ||
short for | short for | ||
short for | Lagrange parameter in Lemma 2 | ||
optimal policy under MEP | state-action value function | ||
optimal value function | defines weighted norm for contraction mapping | ||
weighted norm | contraction constant | ||
estimate of at time | learning rate at time | ||
minimum value of | maximum value of | ||
number of episodes | annealing rate for | ||
discounted Shannon entropy | discount at for | ||
free-energy for infinite entropy case of MDP | short for | ||
optimal policy for infinite entropy MDP | state-action value function | ||
optimal free-energy function | set of unknown state parameters | ||
set of unknown action parameters | value function parameterized MDPs |
state at time with parameter | action at time with parameter | ||
optimal control policy | optimal value function | ||
derivative | derivative | ||
derivatives | derivatives | ||
learned estimate of at | finite constants |