This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Optimal Sensor and Actuator Selection for Factored Markov Decision Processes: Complexity, Approximability and Algorithms

Jayanth Bhargav, Mahsa Ghasemi and Shreyas Sundaram
Abstract

Factored Markov Decision Processes (fMDPs) are a class of Markov Decision Processes (MDPs) in which the states (and actions) can be factored into a set of state (and action) variables. The state space, action space and reward function of a fMDP can be encoded compactly using a factored representation. In this paper, we consider the setting where we have a set of potential sensors to select for the fMDP (at design-time), where each sensor measures a certain state variable and has a selection cost. We formulate the problem of selecting an optimal set of sensors for fMDPs (subject to certain budget constraints) to maximize the expected infinite-horizon discounted return provided by the optimal control policy. We show the fundamental result that it is NP-hard to approximate this optimization problem to within any non-trivial factor. We then study the dual problem of budgeted actuator selection (at design-time) to maximize the expected return under the optimal policy. Again, we show that it is NP-hard to approximate this optimization problem to within any non-trivial factor. Furthermore, with explicit examples, we show the failure of greedy algorithms for both the sensor and actuator selection problems and provide insights into the factors that cause these problems to be challenging. Despite the inapproximability results, through extensive simulations, we show that the greedy algorithm may provide near-optimal performance for actuator and sensor selection in many real-world and randomly generated fMDP instances.111This material is based upon work supported by the Office of Naval Research (ONR) via Contract No. N00014-23-C-1016 and under subcontract to Saab, Inc. as part of the TSUNOMI project. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of ONR, the U.S. Government, or Saab, Inc.
The authors are with the Elmore Family School of Electrical and Computer Engineering, Purdue University, West Lafayette IN 47907 USA. Email addresses: {jbhargav, mahsa, sundara2}@purdue.edu

Keywords: Computational complexity, Greedy algorithms, Markov Decision Processes, Optimization, State estimation, Sensor and Actuator selection.

1 Introduction

Markov Decision Processes (MDPs) have been widely used to model systems in decision-making problems such as autonomous driving, multi-agent robotic task execution, large data center operation and machine maintenance [1, 2]. In the MDP framework, the states of the system evolve stochastically owing to a probabilistic state transition function. In many real-world sequential decision-making problems, the state space is quite large (growing exponentially with the number of state variables). However, many large MDPs often admit significant internal structure, which can be exploited to model and represent them compactly. The idea of compactly representing a large structured MDP using a factored model was first proposed in [3]. In this framework, the state of a large MDP is factored into a set of state variables, each taking values from their respective domains. Then, a dynamic Bayesian network (DBN) can be used to represent the transition model. Assuming that the transition of a state variable only depends on a small subset of all the state variables, a DBN can capture this dependence in a very compact manner. Furthermore, the reward function can also be factored into a sum of rewards related to individual variables or a small subset of variables.

Factored MDPs exploit both additive and context-specific structures in large scale systems. However, a factored representation may still result in the intractability of exactly solving such large fMDPs. A significant amount of work has focused on solving for the optimal policy in fMDPs [4, 5] and its variants like Partially Observable MDPs (POMDPs) [6, 7] and Mixed-Observable MDPs (MOMDPs) [8]. However, these algorithms primarily focus on reducing the computational time required for solving large state space MDPs and its variants, but do not study the problem of sensor and actuator set selection for such systems in order to achieve optimal performance. While the problem of sensor and actuator selection has been well studied for other classes of systems (e.g., linear systems) [9, 10, 11], there has been no prior work on optimal sensor/actuator selection for fMDPs.

1.1 Motivation

Many applications in large-scale networks like congestion control [12, 13], load-balancing [14, 15] and energy optimization in large data-centers [1, 2] suffer from limited sensing and actuating resources. In many autonomous systems, the number of sensors or actuators that can be installed is limited by a certain budget and system design constraints [16]. In robotics applications, system designers often face the challenge of achieving certain design objectives such as maximizing observability and performance of the robot [16] [17][18], under limited sensing or actuating resources.

Scenario 1: Consider a task where a team of mobile robots have to simultaneously localize themselves (i.e., estimate their state) and perform tasks in an environment (see Figure 1). The robots can communicate with sensors deployed in the environment to localize themselves. Each robot has an underlying MDP and receives rewards for successfully completing tasks. Due to limited budget, a central designer can only deploy a limited number of sensors in the environment. The designer has to select an optimal subset of sensors to be placed in order to localize (i.e., estimate the states of) the robots, which can maximize the overall reward obtained by the team of robots.

Refer to caption
Figure 1: An environment where a team of mobile robots have to collectively localize themselves and perform various tasks.

Scenario 2: Consider a complex electric distribution network consisting of interconnected micro-grid systems, each containing generation stations, substations, transmission lines and electric loads, as shown in Figure 2. Each node in the network is prone to electric faults (e.g., generator faults, short circuit faults in the load). Faults in a node can potentially cascade, and cause a catastrophic power blackout in the entire network. Due to limited budget, a grid safety designer has to select only a subset of nodes where sensors and actuators can be installed, which can help minimize fault percolation in the network, by identifying and isolating (known as islanding) certain critical nodes.

The multi-agent systems described above can be modelled as MDPs, and often have exponentially large state and action spaces in practice (e.g., a large number of nodes in the network). In such cases, one can leverage the internal structure of such systems and model them as fMDPs. If one can only measure a subset of state variables using a limited number of sensors or can only influence transitions of a subset of state variables using a limited number of actuators, one is faced with the problem of selecting the optimal set of sensors (or actuators) that can result in better performance of such systems (fMDPs), which has not been well studied in the literature.

In this paper, we focus on two problems related to fMDPs: (i) selecting the best set of sensors at design-time (under some budget constraints) for a fMDP which can maximize the optimal expected infinite-horizon return for the resulting optimal policy and (ii) selecting the best set of actuators at design-time (under some budget constraints) for a fMDP which can maximize the optimal expected infinite-horizon return provided by the optimal policy.

Refer to caption
Figure 2: A distributed micro-grid network with generation stations, transmission systems, and electric loads [19].

1.2 Related Work

Sequential sensor placement/selection has been studied in the context of MDPs and its variants like POMDPs. In [20], the authors consider active perception under a limited budget for POMDPs to selectively gather information at runtime. However, in our problem, we consider design-time sensor/actuator selection for fMDPs, where the sensor/actuator set is not allowed to dynamically change at runtime. A body of literature considers the problem of sensor scheduling for sequential decision-making tasks and model the task of sensor placement itself as a POMDP [21, 22]. However, we consider the problem of selecting the optimal set of sensors (or actuators) a priori (at design-time) for an fMDP, and not sequentially.

The problem of selecting an optimal subset of sensors has also been very well studied for linear systems. In [9] and [10], the authors study the sensor selection and sensor attack problems for Kalman filtering of linear dynamical systems, where the objective is to reduce the trace of the steady-state error covariance of the filter. The authors of [10] show that these problems are NP-Hard and there exists no polynomial-time constant-factor approximation algorithms for such class of problems. In [23], the authors propose balanced model reduction and greedy optimization technique to efficiently determine sensor and actuator selections that optimize observability and controllability for linear systems. In [24], the authors show that the mapping from subsets of possible actuator/sensor placements to any linear function of the associated controllability or observability Gramian is a modular set function. This strong property of the function allows efficient global optimization.

Many works have considered the problem of actuator selection for controlling and stabilizing large scale power grids. The authors of [25] consider the problem of optimal placement of High Voltage Direct Current (HVDC) links in a power system. HVDC links are actuators which are used to stabilize a power grid from oscillations. They propose a performance measure that can be used to rank different candidate actuators according to their behavior after a disturbance, which is computed using Linear Matrix Inequalities (LMIs). In [26], the authors propose an algorithm using Semi-Definite Programming (SDP) for placement of HVDC links, which can optimize the LMI-based performance measure proposed in [25]. The authors of [26] state that this technique can be applied to other actuator selection problems, such as Flexible AC Transmission (FACTS) controllers and Power System Stabilizers (PSS). However, these techniques use a linearized state-space model of the power system around the current operating point, whereas in this paper, we consider fMDP models which may not necessarily have a linear structure.

For combinatorially-hard sensor or actuator selection problems, various approximation algorithms have proven to produce near-optimal solutions [27][28]. In [29], the authors leverage the adaptive submodularity property of the sensor area coverage metric to learn an adaptive greedy policy for sequential sensor placement with provable performance guarantees. However, we consider the design-time sensor selection problem for fMDPs. The authors of [30] exploit the weak-submodularity property of the objective function and provide near-optimal greedy algorithms for sensor selection. In contrast to these results, we present worst-case in-approximability results and demonstrate how greedy algorithms for both sensor and actuator selection in fMDPs can perform arbitrarily poorly, and that the value function of a fMDP is not generally submodular in the set of sensors (or actuators) selected.

1.3 Contributions

Our contributions are as follows. Firstly, we show that the problem of selecting an optimal subset of sensors at design-time for a general class of factored MDPs is NP-hard, and there is no polynomial time algorithm that can approximate this problem to within a factor of n1ϵn^{1-\epsilon} of the optimal solution, for any ϵ>0\epsilon>0, even when one has access to an oracle that can compute the optimal policy for any given instance (which itself is known to be PSPACE-hard [31]). Second, we show that the same inapproximability results hold for the problem of selecting an optimal subset of actuators at design-time for a general class of factored MDPs. Our inapproximability results directly imply that greedy algorithms cannot provide constant-factor guarantees for our problems, and that the value function of the fMDP is not submodular (in general) in the set of sensors (resp. actuators) selected. Our third contribution is to explicitly show how greedy algorithms can provide arbitrarily poor performance for fMDP-SS (resp. fMDP-AS) problems. Finally, through empirical results, we show that the greedy algorithm may provide near-optimal solutions for many instances in practice, despite the inapproximability results.

We considered the problem of optimal sensor selection for Mixed-Observable MDPs (MOMDPs) in the conference paper [32]. We proved its NP-hardness and provided insights into the lack of submodularity of the value function. However, in this paper, we present stronger inapproximability results for both sensor and actuator selection for a general class of factored MDPs, of which MOMDPs are a special case.

2 Problem Formulation

In this section, we formally state the sensor and actuator selection problems we consider in this paper. A general factored MDP is defined by the following tuple: :=(𝒮=𝒮1××𝒮n,𝒜=𝒜1××𝒜n,𝒯,,γ)\mathcal{M}:=(\mathcal{S}=\mathcal{S}_{1}\times\cdot\cdot\cdot\times\mathcal{S}_{n},\mathcal{A}=\mathcal{A}_{1}\times\cdot\cdot\cdot\times\mathcal{A}_{n},\mathcal{T},\mathcal{R},\gamma), where 𝒮\mathcal{S} is the state space decomposed into finite sub-spaces 𝒮i\mathcal{S}_{i}, each corresponding to a state-variable sis_{i} (which takes values from 𝒮i\mathcal{S}_{i}), 𝒜\mathcal{A} is the action space decomposed into finite sub-spaces 𝒜i\mathcal{A}_{i}, each corresponding to an action aia_{i} (which takes values from 𝒜i\mathcal{A}_{i}), 𝒯\mathcal{T} is the probabilistic state transition model, \mathcal{R} is the reward function and γ(0,1)\gamma\in(0,1) is the discount factor. The state transition model 𝒯:𝒮×𝒜×𝒮[0,1]\mathcal{T}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1] is usually represented using a DBN, which consists of a directed acyclic transition graph G𝒯G_{\mathcal{T}}, that captures the relationship between state transitions of the variables sis_{i}. The reward function :𝒮×𝒜\mathcal{R}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}, is a scalar function of the factored state variables sis_{i} and actions aia_{i}. In many applications, the overall reward function \mathcal{R} is the sum of local rewards i\mathcal{R}_{i} which depend on the local state-action pair (si,ai)(s_{i},a_{i}).

2.1 The Sensor Selection Problem

Consider the scenario where there are no sensors installed initially, and the agent does not have observability of the state variables sis_{i} of the f-MDP. Instead, the agent has to select a subset of sensors to install. Define Ω={ωii=1,2,,n}\Omega=\{\omega_{i}\mid i=1,2,\ldots,n\} to be a collection of sensors, where the sensor ωi\omega_{i} measures exactly the state variable sis_{i}. Let ci0c_{i}\in\mathbb{R}_{\geq 0} be the cost we pay to measure state sis_{i} by placing the sensor ωi\omega_{i}, and let C>0C\in\mathbb{R}_{>0} denote the total budget for the sensor placement. Let ΓΩ\Gamma\subseteq\Omega be the subset of sensors selected (at design-time) that generates observations yΓ(t)={si(t)ωiΓ}y_{\Gamma}(t)=\{s_{i}(t)\mid\omega_{i}\in\Gamma\}. At time tt, the agent has the following information: observations YtΓ={yΓ(0),yΓ(1),yΓ(2),,yΓ(t)}Y^{\Gamma}_{t}=\{y_{\Gamma}(0),y_{\Gamma}(1),y_{\Gamma}(2),\cdot\cdot\cdot,y_{\Gamma}(t)\}, joint actions At={a(0),a(1),a(2),,a(t1)}A_{t}=\{a(0),a(1),a(2),\cdot\cdot\cdot,a(t-1)\} and rewards Rt={r(0),r(1),r(2),,r(t1)}R_{t}=\{r(0),r(1),r(2),\cdot\cdot\cdot,r(t-1)\}. Since the agent does not have complete observability, it maintains a belief over the states of the fMDP, bb\in\mathcal{B}, where \mathcal{B} is the belief space, which is the set of probability distributions over the states in 𝒮\mathcal{S}.

In this case, the expected reward is belief-based, denoted as ρ(b,a)\rho(b,a), and is given by ρ(b,a)=sb(s)r(s,a)\rho(b,a)=\sum_{s}b(s)r(s,a), where b(s)b(s) is the belief over the state ss, aa is the action and r(s,a)r(s,a) is the reward obtained for taking action aa in the state ss. For any given choice of sensors, the agent seeks to maximize the expected infinite-horizon reward, given the initial belief b0b_{0}, by finding an optimal policy satisfying π=\pi^{*}= argmaxπΠVπ(b0)\arg\max_{\pi\in\Pi}V^{\pi}\left(b_{0}\right) with Vπ(b0)=𝔼[t=0γtρtb0,π]V^{\pi}\left(b_{0}\right)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}\rho_{t}\mid b_{0},\pi\right], where ρt\rho_{t} is the reward obtained at time tt. The policy π\pi belongs to a class of policies Π\Pi which map the history of observations, actions, and rewards to the next action. The value function Vnπ(b)V^{\pi}_{n}(b) under the policy π\pi can be computed using the value iteration algorithm with V0π(b)=0V^{\pi}_{0}(b)=0, and Vπ(b)=limnVnπ(b)V^{\pi}(b)=\lim_{n\rightarrow\infty}V^{\pi}_{n}(b) (as in [8]).

Let t={YtΓ,At,Rt}\mathcal{H}_{t}=\{Y^{\Gamma}_{t},A_{t},R_{t}\} denote the set containing all the information the agent has until time tt. Define ΠΓ={πΓπΓ:t𝒜}\Pi_{\Gamma}=\{\pi_{\Gamma}\mid\pi_{\Gamma}:\mathcal{H}_{t}\rightarrow\mathcal{A}\} to be a class of history-dependent policies that map from a set containing all the information known to the agent until time tt to the action ata_{t} which the agent takes at time tt. The goal is to find an optimal subset of sensors ΓΩ\Gamma^{*}\subseteq\Omega, under the budget constraint, that can maximize the expected value VΓV^{*}_{\Gamma} of the infinite-horizon discounted return obtained by the agent under the optimal policy for the resulting POMDP. Specifically, we aim to solve the optimization problem:

maxΓΩVΓ;s.t.ωiΓciC.\displaystyle\hskip 10.0pt\max_{\Gamma\subseteq\Omega}\hskip 5.0ptV_{\Gamma}^{*};\quad\textit{s.t.}\sum_{\omega_{i}\in\Gamma}c_{i}\leq C.

We now define the decision version for the above optimization problem as the Factored Markov Decision Process Sensor Selection Problem (fMDP-SS Problem).

Problem 1 (fMDP-SS Problem).

Consider a fMDP \mathcal{M} and a set of nn sensors Ω\Omega, where each sensor ωiΩ\omega_{i}\in\Omega is associated with a cost ci0c_{i}\in\mathbb{R}_{\geq 0}. For a value VV\in\mathbb{R} and sensor budget C>0C\in\mathbb{R}_{>0}, is there a subset of sensors ΓΩ\Gamma\subseteq\Omega, such that the expected return VΓV_{\Gamma}^{*} for the optimal policy in ΠΓ\Pi_{\Gamma} satisfies VΓVV_{\Gamma}^{*}\geq V and the total cost of the sensors satisfies ωiΓciC\sum_{\omega_{i}\in\Gamma}c_{i}\leq C?

2.2 The Actuator Selection Problem

Consider the scenario where there are no actuators installed initially, and the agent cannot influence the transitions of state variables sis_{i} of the fMDP. However, the agent has complete observability of the state variables sis_{i}. In this case, the agent has to select a subset of actuators to install. Define Φ={ϕii=1,2,,n}\Phi=\{\phi_{i}\mid i=1,2,\ldots,n\} to be a collection of actuators, where the actuator ϕi\phi_{i} provides a set of actions 𝒜i\mathcal{A}_{i}, and action ai𝒜ia_{i}\in\mathcal{A}_{i} can influence the state transition of sis_{i}. Let ki0k_{i}\in\mathbb{R}_{\geq 0} be the cost we pay to install the actuator ϕi\phi_{i}, and let K>0K\in\mathbb{R}_{>0} denote the total budget for the actuator placement. Let ΥΦ\Upsilon\subseteq\Phi be the subset of actuators selected (at design-time).

Let 𝒜Υ=ΠϕiΥ𝒜i={{ai}|ϕiΥ}\mathcal{A}_{\Upsilon}=\Pi_{\phi_{i}\in\Upsilon}\mathcal{A}_{i}=\{\{{a}_{i}\}|\phi_{i}\in\Upsilon\} denote the joint action space available to the agent generated by selecting the actuator set Υ\Upsilon. If an actuator is not selected, then the agent has a default action ada_{d}, and by taking this action, the agent stays in its current state with probability 1. As the agent has complete observability of the state, the rewards depend on the joint state-action pairs, i.e., r(s,a¯)r(s,\bar{a}) is the reward obtained by the agent for taking the joint action a¯\bar{a} in state ss. Define ΠΥ={πΥπΥ:t𝒜Υ}\Pi_{\Upsilon}=\{\pi_{\Upsilon}\mid\pi_{\Upsilon}:\mathcal{H}_{t}\rightarrow\mathcal{A}_{\Upsilon}\} to be a class of history-dependent policies that map from a set containing all the information t\mathcal{H}_{t} known to the agent until time tt to action, a¯t𝒜Υ\bar{a}_{t}\in\mathcal{A}_{\Upsilon} which the agent takes at time tt.

The goal of the agent is to choose the best actuator set ΥΦ\Upsilon^{*}\subseteq\Phi which can maximize the expected infinite-horizon discounted return under the optimal policy of the resulting MDP. The expected infinite-horizon discounted return is computed for the optimal policy satisfying π=argmaxπΠΥVπ\pi^{*}=\arg\max_{\pi\in\Pi_{\Upsilon}}V^{\pi} with Vπ=𝔼[t=0γtrtπ]V^{\pi}=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{t}\mid\pi\right], where rtr_{t} is the reward obtained at time tt. Let VΥV_{\Upsilon}^{*} denote the expected infinite-horizon discounted return under the optimal policy for the actuator set Υ\Upsilon. We aim to solve the optimization problem:

maxΥΦVΥ;s.t.ϕiΥkiK.\displaystyle\hskip 10.0pt\max_{\Upsilon\subseteq\Phi}\hskip 5.0ptV_{\Upsilon}^{*};\quad\textit{s.t.}\sum_{\phi_{i}\in\Upsilon}k_{i}\leq K.

We now define the decision version of the above optimization problem as the Factored Markov Decision Process Actuator Selection Problem (fMDP-AS Problem).

Problem 2 (fMDP-AS Problem).

Consider a fMDP \mathcal{M} and a set of nn actuators Φ\Phi, where each actuator ϕiΦ\phi_{i}\in\Phi is associated with a cost ki0k_{i}\in\mathbb{R}_{\geq 0}. For a value VV\in\mathbb{R} and actuator budget K>0K\in\mathbb{R}_{>0}, is there a subset of actuators ΥΦ\Upsilon\subseteq\Phi, such that the expected return VΥV_{\Upsilon}^{*} for the optimal policy in ΠΥ\Pi_{\Upsilon} satisfies VΥVV_{\Upsilon}^{*}\geq V and the total cost of the actuators selected satisfies ϕiΥkiK\sum_{\phi_{i}\in\Upsilon}k_{i}\leq K?

3 Complexity and Approximability Analysis

In this section, we analyze the approximability of the fMDP-SS and fMDP-AS problems. We will start with a brief overview of some relevant concepts from the field of computational complexity, and then provide some preliminary lemmas that we will use in proving our results. That will lead to our characterizations of the complexity of fMDP-SS and fMDP-AS.

3.1 Review of Complexity Theory

We first review the following fundamental concepts from complexity theory [33].

Definition 1.

A polynomial-time algorithm for a problem is an algorithm that returns a solution to the problem in a polynomial (in the size of the problem) number of computations.

Definition 2.

A decision problem is a problem whose answer is “yes” or “no”. The set P contains those decision problems that can be solved by a polynomial-time algorithm. The set NP contains those decision problems whose “yes” answer can be verified using a polynomial-time algorithm.

Definition 3.

An optimization problem is a problem whose objective is to maximize or minimize a certain quantity, possibly subject to constraints.

Definition 4.

A problem 𝒫1\mathcal{P}_{1} is NP-complete if (a) 𝒫1NP\mathcal{P}_{1}\in\mathrm{NP} and (b) for any problem 𝒫2\mathcal{P}_{2} in NP, there exists a polynomial time algorithm that converts (or “reduces”) any instance of 𝒫2\mathcal{P}_{2} to an instance of 𝒫1\mathcal{P}_{1} such that the answer to the constructed instance of 𝒫1\mathcal{P}_{1} provides the answer to the instance of 𝒫2\mathcal{P}_{2}. 𝒫1\mathcal{P}_{1} is NP-hard if it satisfies (b), but not necessarily (a).

The above definition indicates that if one had a polynomial time algorithm for an NP-complete (or NP-hard) problem, then one could solve every problem in NP in polynomial time. Specifically, suppose we had a polynomial-time algorithm to solve an NP-hard problem 𝒫1\mathcal{P}_{1}. Then, given any problem 𝒫2\mathcal{P}_{2} in NP\mathrm{NP}, one could first reduce any instance of 𝒫2\mathcal{P}_{2} to an instance of 𝒫1\mathcal{P}_{1} in polynomial time (such that the answer to the constructed instance of 𝒫1\mathcal{P}_{1} provides the answer to the given instance of 𝒫2\mathcal{P}_{2}), and then use the polynomial-time algorithm for 𝒫1\mathcal{P}_{1} to obtain the answer to 𝒫2\mathcal{P}_{2}.

The above discussion also reveals that to show that a given problem 𝒫1\mathcal{P}_{1} is NP-hard, one simply needs to show that any instance of some other NP-hard (or NP-complete) problem 𝒫2\mathcal{P}_{2} can be reduced to an instance of 𝒫1\mathcal{P}_{1} in polynomial time.

For NP-hard optimization problems, polynomial-time approximation algorithms are of particular interest. A constant factor approximation algorithm is defined as follows.

Definition 5.

A constant-factor approximation algorithm for an optimization problem is an algorithm that always returns a solution within a constant (system-independent) factor of the optimal solution.

In [32], we showed that the problem of selecting sensors for MOMDPs (which are a special class of POMDPs and fMDPs) is NP-hard. In this paper, we show a stronger result that there is no polynomial-time algorithm that can approximate the fMDP-SS (resp., fMDP-AS) Problem to within a factor of n1ϵn^{1-\epsilon} for any ϵ>0\epsilon>0, even when all the sensors (resp. actuators) have the same selection cost. Specifically, we consider a well-known NP-complete problem, and show how to reduce it to certain instances of fMDP-SS (resp., fMDP-AS) in polynomial time such that hypothetical polynomial-time approximation algorithms for the latter problems can be used to solve the known NP-complete problem. In particular, inspired by the reductions from Set Cover problem to the influence maximization problems in social networks presented in [34] and [35], we use the Set Cover problem and provide polynomial-time reductions to the fMDP-SS and fMDP-AS problems in order to establish our inapproximability results.

3.2 Set Cover Problem

The Set Cover Problem is a classical question in combinatorics and complexity theory. Given a set of nn elements 𝒰={u1,u2,,un}\mathcal{U}=\{u_{1},u_{2},\ldots,u_{n}\} called the universe and a collection of mm subsets 𝕊={S1,S2,,Sm|Si𝒰}\mathbb{S}=\{S_{1},S_{2},\ldots,S_{m}|S_{i}\subseteq\mathcal{U}\}, where each subset SiS_{i} is associated with a cost c(Si)0c(S_{i})\in\mathbb{R}_{\geq 0} and the union of these subsets equals the universe 𝒰=S1S2Sm\mathcal{U}=S_{1}\cup S_{2}\cup\cdots\cup S_{m}, the set cover problem is to identify the collection of subsets in 𝕊\mathbb{S} with minimum cost, whose union contains all the elements of the universe. Let 𝕊c\mathbb{S}_{c} denote the collection of subsets selected. We wish to solve the following optimization problem:

min𝕊c𝕊Si𝕊cc(Si);s.t.Si𝕊cSi=𝒰.\displaystyle\hskip 1.0pt\min_{\mathbb{S}_{c}\subseteq\mathbb{S}}\hskip 5.0pt\sum_{S_{i}\in\mathbb{S}_{c}}c(S_{i});\quad\textit{s.t.}\bigcup_{S_{i}\in\mathbb{S}_{c}}S_{i}=\mathcal{U}.

We will now define the decision version of this problem, under uniform set selection costs, as the SetCover Problem.

Problem 3 (SetCover Problem).

Consider a universal set of nn elements 𝒰:={u1,u2,,un}\mathcal{U}:=\{u_{1},u_{2},\ldots,u_{n}\} and a collection of subsets 𝕊:={S1,S2,,Sm}\mathbb{S}:=\{S_{1},S_{2},\ldots,S_{m}\}. For a positive integer kk, is there a collection 𝕊k\mathbb{S}_{k} of at most kk subsets SiS_{i} in 𝕊\mathbb{S} such that Si𝕊kSi=𝒰\bigcup_{S_{i}\in\mathbb{S}_{k}}S_{i}=\mathcal{U}?

The SetCover Problem is NP-Complete [36].

3.3 Inapproximability of Sensor Selection Problem

In this section, we present the inapproximability results for fMDP sensor selection by reducing an instance of the SetCover problem to the fMDP-SS problem. We first present a preliminary lemma, which we will use to characterize the complexity of fMDP-SS.

Example 1: Consider an MDP ¯:={𝒮,𝒜,𝒯,,γ,b0}\bar{\mathcal{M}}:=\{\mathcal{S},\mathcal{A},\mathcal{T},\mathcal{R},\gamma,b_{0}\} with state space 𝒮={A,B,C,D}\mathcal{S}=\{A,B,C,D\}, actions 𝒜={0,1,2}\mathcal{A}=\{0,1,2\}, transition function 𝒯:{𝒯0=[0.50.5000.50.50000100001];𝒯1=[0010000100100001];𝒯2=[0001001000100001]}\mathcal{T}:\Biggl{\{}\mathcal{T}_{0}=\begin{bmatrix}0.5&0.5&0&0\\ 0.5&0.5&0&0\\ 0&0&1&0\\ 0&0&0&1\end{bmatrix};\mathcal{T}_{1}=\begin{bmatrix}0&0&1&0\\ 0&0&0&1\\ 0&0&1&0\\ 0&0&0&1\end{bmatrix};\mathcal{T}_{2}=\begin{bmatrix}0&0&0&1\\ 0&0&1&0\\ 0&0&1&0\\ 0&0&0&1\end{bmatrix}\Biggr{\}} for a=0a=0, a=1a=1 and a=2a=2, respectively, reward function (s,a)=(r(A,0)=0,r(B,0)=0,r(A,1)=R,r(B,1)=(1+δ)R,r(A,2)=(1+δ)R,r(B,2)=R,r(C,)=R,r(D,)=(1+δ)R)\mathcal{R}(s,a)=(r(A,0)=0,r(B,0)=0,r(A,1)=R,r(B,1)=-(1+\delta)R,r(A,2)=-(1+\delta)R,r(B,2)=R,r(C,\cdot)=R,r(D,\cdot)=-(1+\delta)R) with R,δ>0R,\delta>0, discount factor γ(0,1)\gamma\in(0,1) and initial distribution b0=[0.5,0.5,0,0]b_{0}=[0.5,0.5,0,0]. Fig. 3 describes state-action transitions along with their probabilities. The state space of this MDP corresponds to only one state variable ss and the agent can measure this by selecting a noiseless sensor ω=s\omega=s. The initial state of the MDP at t=0t=0 is either s(0)=As(0)=A or s(0)=Bs(0)=B, with equal probability.

Refer to caption
Figure 3: State transition diagram of ¯\bar{\mathcal{M}}.
Lemma 1.

For the MDP ¯\bar{\mathcal{M}} defined in Example 1, the following holds for any γ(0,1)\gamma\in(0,1):

  • (i)

    If the state of ¯\bar{\mathcal{M}} is measured (or observable) using the sensor ω\omega, the infinite-horizon expected return under the optimal policy is V(s)=R(1γ)V^{*}(s)=\frac{R}{(1-\gamma)} and the optimal policy ensures that the state of the MDP is st=Cs_{t}=C, for all t>0t>0.

  • (ii)

    If the state of ¯\bar{\mathcal{M}} is not measured i.e., there is no sensor installed and the agent only has access to the sequence of actions and rewards, but not the current state ss, then the infinite-horizon expected reward beginning at belief b0b_{0}, under the optimal policy is V(b)=0V^{*}(b)=0 and the optimal policy ensures that the state of the MDP is st{C,D}s_{t}\notin\{C,D\}, for all t>0t>0.

Proof.

We will prove both (i)(i) and (ii)(ii) as follows.
Case (i): Consider the case when state of the fMDP ¯\bar{\mathcal{M}} is measured using sensor ω\omega. Based on the specified reward function, we can see that the agent can obtain the maximum reward (RR) at each time-step by taking action 11 if s=As=A, action 22 if s=Bs=B and any action if s=Cs=C or s=Ds=D. This yields V(s)=maxπΓVπΓ(s)=t=0γtR=R(1γ)V^{*}(s)=\max_{\pi_{\Gamma}}V^{\pi_{\Gamma}}(s)=\sum_{t=0}^{\infty}\gamma^{t}R=\frac{R}{(1-\gamma)}.
Case (ii): Consider the case when the state of the fMDP ¯\bar{\mathcal{M}} is not measured (i.e., the sensor ω\omega is not selected and as a result the agent does not know the current state but only has access to the sequence of actions and rewards).

Due to uncertainty in the state, the agent maintains a belief bb. The agent performs a Bayesian update of its belief at each time step using the information it has (i.e., the history of actions and observations) [6]. Consider the initial belief b0=[0.5,0.5,0,0]b_{0}=[0.5,0.5,0,0] for the agent. One can easily verify the following claim: since the agent has an equal probability of being in either state A or state B, it is not optimal for the agent to take either of the actions a=1a=1 or a=2a=2, since they may lead to a large negative reward of (1+δ)R-(1+\delta)R by reaching the absorbing state D. The optimal policy is to always take action a=0a=0. Thus, the state of the fMDP will always be either A or B with equal probability and as a result, the belief of the fMDP will always remain bt=[0.5,0.5,0,0]b_{t}=[0.5,0.5,0,0] for all t>0t>0. Since taking action 0 in both state A and B gives a reward of 0, the expected infinite-horizon reward under the optimal policy is thus V(b)=0V^{*}(b)=0. ∎

We are now in place to provide the following result characterizing the complexity of the fMDP-SS problem.

Theorem 1.

Unless P=NPP=NP, there is no polynomial-time algorithm that can approximate the fMDP-SS Problem, in general, to within a factor of n1ϵn^{1-\epsilon}, for any ϵ>0\epsilon>0.

Proof.

We consider an instance of the SetCover Problem with a collection of mm sets over nn elements of the universe and reduce it to an instance of the fMDP-SS Problem, similar to the reduction presented in [34] and [35]. We construct an fMDP consisting of NN identical MDPs :={1,2,,N}\mathcal{M}^{*}:=\{\mathcal{M}_{1},\mathcal{M}_{2},\ldots,\mathcal{M}_{N}\}, where N=m+n+ncN=m+n+n^{c}, for some large c>0c>0 (see Figure 4). Each MDP i\mathcal{M}_{i} is similar to the MDP defined in Example 1, except for the reward function, which we will define below. The state of fMDP \mathcal{M}^{*} has NN state variables and the joint action consists of NN actions, each corresponding to an MDP i\mathcal{M}_{i} and can be defined as the following tuple: :=(𝒮,𝒜,𝒯,,γ)\mathcal{M}^{*}:=(\mathcal{S},\mathcal{A},\mathcal{T},\mathcal{R},\gamma). The states of the MDPs are independent of each other, however the reward function for each MDP is a function of its own state as well as the states of the other MDPs. We will now explicitly define the state space, action space, transition function, reward function and discount factor as follows.

State Space 𝒮\mathcal{S}: The state space 𝒮i\mathcal{S}_{i} corresponds to the MDP i\mathcal{M}_{i} as defined in Example 1. We define the state space 𝒮\mathcal{S} of the NN-state fMDP \mathcal{M}^{*} as 𝒮:=𝒮1×𝒮2×𝒮N.\mathcal{S}:=\mathcal{S}_{1}\times\mathcal{S}_{2}\times\ldots\mathcal{S}_{N}. The state variable sis_{i} corresponding to the MDP i\mathcal{M}_{i} can be one-hot encoded in 44 bits, i.e., A=1000,B=0100,C=0010A=1000,B=0100,C=0010 and D=0001D=0001. The complete state of the fMDP \mathcal{M}^{*} can be represented in 4N4N bits, where N=m+n+ncN=m+n+n^{c}.

Action Space 𝒜\mathcal{A}: The action space 𝒜i\mathcal{A}_{i} corresponds to the MDP i\mathcal{M}_{i} as defined in Example 1. We define the action space 𝒜\mathcal{A} of the NN-state fMDP \mathcal{M}^{*} as 𝒜:=𝒜1×𝒜2×𝒜N.\mathcal{A}:=\mathcal{A}_{1}\times\mathcal{A}_{2}\times\ldots\mathcal{A}_{N}.

Transition Function 𝒯\mathcal{T}: The overall transition function of the fMDP \mathcal{M}^{*} is defined by a collection of NN transition functions, 𝒯:={𝒯1,,𝒯N}\mathcal{T}:=\{\mathcal{T}_{1},\ldots,\mathcal{T}_{N}\}, where 𝒯i\mathcal{T}_{i} is the transition function of MDP i\mathcal{M}_{i} as described in Example 1. The state transition probability of the fMDP can be computed as follows:

𝒯(𝐬|𝐬,𝐚)=i=1N𝒯i(si|si,ai),\mathcal{T}(\mathbf{s^{\prime}}|\mathbf{s},\mathbf{a})=\prod_{i=1}^{N}\mathcal{T}_{i}(s_{i}^{\prime}|s_{i},a_{i}),

where 𝐬=[s1,,sN]\mathbf{s}=[s_{1},\ldots,s_{N}] is the joint state, 𝐚=[a1,,aN]\mathbf{a}=[a_{1},\ldots,a_{N}] is the joint action and 𝒯i(si|si,ai)\mathcal{T}_{i}(s_{i}^{\prime}|s_{i},a_{i}) is as defined in the transition function of the ii’th state variable with respect to the ii’th action variable.

Discount Factor γ\gamma: Let the discount factors γi\gamma_{i} ’s of all MDP’s i\mathcal{M}_{i} be equal to each other,

γ1=γ2==γN=γ.\gamma_{1}=\gamma_{2}=\ldots=\gamma_{N}=\gamma. (1)

Reward Function \mathcal{R}: We first define the structure of the fMDP which captures the influence that the states of individual MDPs have over the reward functions of the other MDPs.
The reward functions of the MDPs {1,,m}\{\mathcal{M}_{1},\ldots,\mathcal{M}_{m}\} are independent of each other, and are as defined in Example 1.
For MDP i\mathcal{M}_{i}, where i=m+1,,m+ni=m+1,\ldots,m+n, the reward function i\mathcal{R}_{i} is a function of the states of the first mm MDPs, i.e., s^=[s1,,sm]\hat{s}=[s_{1},\ldots,s_{m}]. Define s~i=(j:uiSjsj)(0010)\tilde{s}_{i}=(\vee_{j:u_{i}\in S_{j}}s_{j})\wedge(0010) for m+1im+nm+1\leq i\leq m+n, where \vee is a bit-wise Boolean OR operation 222The bit-wise Boolean OR operation over two n-bit Boolean strings X and Y is a n-bit Boolean string Z, where the ii’th bit of Z is obtained by applying the Boolean OR operation to the ii’th bit of X and ii’th bit of Y. and \wedge is a bit-wise Boolean AND operation 333The bit-wise Boolean AND operation over two n-bit Boolean strings X and Y is a n-bit Boolean string Z, where the ii’th bit of Z is obtained by applying the Boolean AND operation to the ii’th bit of X and ii’th bit of Y. over the states. The reward function i\mathcal{R}_{i} for i=m+1,,m+ni=m+1,\ldots,m+n is given by

i(s^)={Rif s~i=00100otherwise .\mathcal{R}_{i}(\hat{s})=\begin{cases}R&\text{if }\tilde{s}_{i}=0010\\ 0&\text{otherwise }\end{cases}. (2)

The above reward function means that the reward of an MDP i\mathcal{M}_{i}, where i=m+1,,m+ni=m+1,\ldots,m+n is RR if the state sjs_{j} of MDP j\mathcal{M}_{j} is CC (or 00100010) for any jj such that uiSju_{i}\in S_{j} in the SetCover instance. The reward function i\mathcal{R}_{i} for m+n+1im+n+ncm+n+1\leq i\leq m+n+n^{c} is given by

i(s^)={Rif s~m+1,,s~m+n=00100otherwise .\mathcal{R}_{i}(\hat{s})=\begin{cases}R&\text{if }\tilde{s}_{m+1}\wedge,\ldots,\wedge\tilde{s}_{m+n}=0010\\ 0&\text{otherwise }\end{cases}. (3)

According to the above reward function, the reward of an MDP i\mathcal{M}_{i} for m+n+1im+n+ncm+n+1\leq i\leq m+n+n^{c} is RR only if all s~i\tilde{s}_{i}’s are equal to 00100010.
The overall reward function of the fMDP \mathcal{M}^{*} is given by

:=i=1Ni.\mathcal{R}:=\sum_{i=1}^{N}\mathcal{R}_{i}. (4)
Refer to caption
Figure 4: Reduction from SetCover to fMDP-SS/ fMDP-AS: The reward of an MDP in Layer 1 depends on its own states and actions. The rewards of the MDPs in Layers 2 and 3 depend on the states of all the MDPs in Layer 1.

Figure 4 shows the dependence of the rewards of each MDP in the fMDP on the states of all the MDPs derived using the SetCover instance. The rewards of MDPs in Layer 1 are independent of each other. The reward of MDPs in Layers 2 and 3 depend on the states of all MDPs in Layer 1.

Note that the reward function takes as an input the joint state 𝐬\mathbf{s} of the fMDP and computes the reward using bit-wise Boolean operations over at most 4N4N bits, the complexity of which is polynomial in mm and nn.

Let the sensor budget be C=kC=k, where k is the maximum number of sets one can select in the SetCover Problem. Let Ω\Omega denote the set of sensors, where each sensor ωiΩ\omega_{i}\in\Omega has a selection cost cic_{i}, and corresponds to the MDP i\mathcal{M}_{i}, where 1iN1\leq i\leq N. Let ci=1c_{i}=1 for 1im1\leq i\leq m and ci=k+1c_{i}=k+1 for m+1iNm+1\leq i\leq N. Let the value function threshold VV for the fMDP be V=(k+γn+γnc)R/(1γ)V=(k+\gamma n+\gamma n^{c})R/(1-\gamma).

Note that for the specified sensor budget and sensor costs, only a subset of sensors corresponding to MDPs {1,,m}\{\mathcal{M}_{1},\ldots,\mathcal{M}_{m}\} can be selected.

We now have an instance of the fMDP-SS Problem, obtained by reducing an instance of the SetCover Problem.

If the answer to the SetCover Problem is True, there is a full set cover 𝕊k\mathbb{S}_{k} of size kk which satisfies Si𝕊kSi=𝒰\bigcup_{S_{i}\in\mathbb{S}_{k}}S_{i}=\mathcal{U}. By deploying sensors on kk MDPs i\mathcal{M}_{i} corresponding to sets SiS_{i} in 𝕊k\mathbb{S}_{k}, where 1im1\leq i\leq m, we have from Lemma 1 that these MDPs will be in state CC for t1t\geq 1 and have an infinite-horizon expected return of R/(1γ)R/(1-\gamma) under the optimal policy. For t1t\geq 1, s~i\tilde{s}_{i} evaluates to 00100010. Thus, according to the specified reward function, each MDP i\mathcal{M}_{i} for m+1im+nm+1\leq i\leq m+n has an infinite-horizon expected return of γR/(1γ)\gamma R/(1-\gamma). It also follows that each MDP i\mathcal{M}_{i} for m+n+1iNm+n+1\leq i\leq N has an infinite-horizon expected return of γR/(1γ)\gamma R/(1-\gamma). The total infinite horizon expected return of the fMDP is VΓ(1)=(k+γn+γnc)R/(1γ)V_{\Gamma}^{*(1)}=(k+\gamma n+\gamma n^{c})R/(1-\gamma). Thus, the answer to the fMDP-SS instance is also True.

Conversely, if the answer to fMDP-SS Problem is True, there is a sensor set Γ\Gamma with ωiΓcik\sum_{\omega_{i}\in\Gamma}c_{i}\leq k sensors, installed on at most kk MDPs i\mathcal{M}_{i} where 1im1\leq i\leq m and VΓ(k+γn+γnc)R/(1γ)V_{\Gamma}^{*}\geq(k+\gamma n+\gamma n^{c})R/(1-\gamma). It follows from the specified reward function and the infinite-horizon return obtained in the fMDP that the collection of subsets Si𝕊S_{i}\in\mathbb{S} which correspond to the MDPs i\mathcal{M}_{i} in the fMDP on which sensors are installed, collectively cover all the elements of the universe 𝒰\mathcal{U} in the SetCover instance. Thus, the answer to the SetCover instance is True.

If the answer to the SetCover Problem is False, then there is no set cover of size kk that covers all the elements of the universe 𝒰\mathcal{U}. In this case, the maximum number of elements that can be covered by any kk Set Cover is n1n-1. This means that, by deploying sensors on the corresponding kk MDPs, at most kk MDPs in Layer 1 can have an expected infinite-horizon return of R/(1γ)R/(1-\gamma) and at most n1n-1 MDPs in Layer 2 can have an expected infinite-horizon return of γR/(1γ)\gamma R/(1-\gamma). The maximum value of the expected return would be VΓ(2)=(k+γ(n1))R/(1γ)V_{\Gamma}^{*(2)}=(k+\gamma(n-1))R/(1-\gamma). Define the ratio rapproxr_{approx} as follows:

rapprox=VΓ(2)VΓ(1)=k+γ(n1)k+γn+γnc.r_{approx}=\frac{V_{\Gamma}^{*(2)}}{V_{\Gamma}^{*(1)}}=\frac{k+\gamma(n-1)}{k+\gamma n+\gamma n^{c}}. (5)

For sufficiently large c>0c>0, the ratio rapproxr_{approx} will be close to n1cn^{1-c} and arbitrarily small. Thus, if an algorithm could approximate the problem to a factor of n1cn^{1-c} for any c>0c>0, then it could distinguish between the cases of k+n+nck+n+n^{c} MDPs with an infinite-horizon return of at least γR/(1γ)\gamma R/(1-\gamma) and where fewer than k+nk+n MDPs have an infinite-horizon return of at least γR/(1γ)\gamma R/(1-\gamma) in fMDP-SS. However, this would solve the underlying instance of the SetCover problem, and therefore is impossible unless P=NPP=NP. Therefore, the fMDP-SS problem is not only NP-hard, but there is no polynomial-time algorithm that can approximate it to within any non-trivial factor (specifically to a factor of n1cn^{1-c}, for any c>0c>0) of the optimal solution. ∎

3.4 Inapproximability of Actuator Selection Problem

In this section, we present the inapproximability results for fMDP actuator selection by reducing an instance of the SetCover problem to the fMDP-AS problem. We first present a preliminary lemma, which we will use to characterize the complexity of fMDP-AS.

Example 2: Consider an MDP given by ~:={𝒮,𝒜,𝒯,,γ}\tilde{\mathcal{M}}:=\{\mathcal{S},\mathcal{A},\mathcal{T},\mathcal{R},\gamma\}. The state space is given by 𝒮={A,B,C,D}\mathcal{S}=\{A,B,C,D\}. If there is no actuator, the agent can only take the default action a=0a=0, i.e., the action space is 𝒜={0}\mathcal{A}=\{0\} and if there is an actuator installed, then the agent can take one of three actions 0,1, or 2, i.e., the action space is 𝒜={0,1,2}\mathcal{A}=\{0,1,2\}. The transition function and reward function are as defined in Example 1. Note that this MDP is similar to the MDP defined in Example 1, but only differs in its action space.

Lemma 2.

For the MDP ~\tilde{\mathcal{M}} defined in Example 2, the following holds for any γ(0,1)\gamma\in(0,1):

  • (i)

    If an actuator is installed for ~\tilde{\mathcal{M}}, the infinite-horizon expected reward under the optimal policy is V(s)=R(1γ)V^{*}(s)=\frac{R}{(1-\gamma)} and the optimal policy ensures that the state of the MDP is st=Cs_{t}=C, for all t>0t>0.

  • (ii)

    If no actuator is installed for ~\tilde{\mathcal{M}}, the infinite-horizon expected reward under the optimal policy is V(s)=0V^{*}(s)=0 and the optimal policy ensures that the state of the MDP is st{C,D}s_{t}\notin\{C,D\}, for all t>0t>0.

Proof.

We will prove both (i)(i) and (ii)(ii) as follows.
Case (i): Consider the case when there is an actuator installed on ~\tilde{\mathcal{M}}. Based on the specified reward function, we can see that the agent can obtain the maximum reward (RR) at each time-step by taking action 11 if s=As=A, action 22 if s=Bs=B and any action if s=Cs=C. This yields V(s)=maxπΓVπΓ(s)=t=0γtR=R(1γ)V^{*}(s)=\max_{\pi_{\Gamma}}V^{\pi_{\Gamma}}(s)=\sum_{t=0}^{\infty}\gamma^{t}R=\frac{R}{(1-\gamma)}.
Case (ii): Consider the case when there is no actuator installed on ¯\bar{\mathcal{M}}. The optimal policy is to always take the only available action a=0a=0 for all t0t\geq 0. Since taking action 0 in both state A and B gives a reward of 0, the expected infinite-horizon reward under the optimal policy is thus V(s)=0V^{*}(s)=0. ∎

We have the following result characterizing the complexity of the fMDP-AS problem.

Theorem 2.

Unless P=NPP=NP, there is no polynomial-time algorithm that can approximate the fMDP-AS Problem to within a factor of n1ϵn^{1-\epsilon}, for any ϵ>0\epsilon>0.

Proof.

We construct an instance of the fMDP-AS problem using an instance of the SetCover Problem similar to the reduction presented in the proof of Theorem 1, but in this case, each of the MDPs i\mathcal{M}_{i} in the fMDP are defined as in Example 2, except for the reward functions. The effect of selecting a sensor for the ii’th MDP i\mathcal{M}_{i} as in Example 1 (see Lemma 1) is the same as that of selecting an actuator for the ii’th MDP i\mathcal{M}_{i} as in Example 2 (see Lemma 2), for 1im1\leq i\leq m. The reward function for each of the MDPs and the overall fMDP are defined as in Theorem 1.

Similar to the arguments made in the proof of Theorem 1, if there is a full set cover 𝕊k\mathbb{S}_{k} of size kk which satisfies Si𝕊kSi=𝒰\bigcup_{S_{i}\in\mathbb{S}_{k}}S_{i}=\mathcal{U}, then deploying actuators on kk MDPs i\mathcal{M}_{i} corresponding sets SiS_{i}, where 1im1\leq i\leq m, would ensure that the infinite-horizon expected returns of all MDPs {m+1,,N}\{\mathcal{M}_{m+1},\ldots,\mathcal{M}_{N}\} will be γR/(1γ)\gamma R/(1-\gamma). The total infinite horizon expected return in this case is VΥ(1)=(k+γn+γnc)R/(1γ)V_{\Upsilon}^{*(1)}=(k+\gamma n+\gamma n^{c})R/(1-\gamma). Conversely, if there is no full set cover of size kk that covers all the elements of the universe 𝒰\mathcal{U}, there are at most kk MDPs in {1,,m}\{\mathcal{M}_{1},\ldots,\mathcal{M}_{m}\} with an infinite-horizon return of R/(1γ)R/(1-\gamma) and fewer than nn MDPs in {m+1,,m+n}\{\mathcal{M}_{m+1},\ldots,\mathcal{M}_{m+n}\} with an expected infinite-horizon return of γR/(1γ)\gamma R/(1-\gamma). The maximum value of the total expected return would be VΓ(2)=(k+γ(n1))R/(1γ)V_{\Gamma}^{*(2)}=(k+\gamma(n-1))R/(1-\gamma). The approximation ratio rapproxr_{approx} is thus the same as the one obtained in Theorem 1, i.e., as in Equation 5. Therefore, it is NP-hard to approximate the fMDP-AS problem to within a factor of n1cn^{1-c} for any c>0c>0. ∎

4 Greedy Algorithms

Greedy algorithms, which iteratively and myopically choose items that provide the largest immediate benefit, provide computationally tractable and near-optimal solutions to many combinatorial optimization problems [28, 37, 38]. In this section, we present greedy algorithms for the fMDP-SS and fMDP-AS problems with uniform sensor/actuator costs to output a subset of sensors/actuators to be selected in order to maximize the infinite-horizon reward. According to the results presented in the previous section, greedy algorithms are not expected to perform well for all possible instances of the fMDP-SS and fMDP-AS problems. With explicit examples, we show how greedy algorithms can perform arbitrarily poorly for these problems and provide insights into the factors which could lead to such poor performance.

Algorithm 1 is a greedy algorithm that takes an instance of the fMDP-SS (or fMDP-AS) problem and returns a sensor (or actuator) set satisfying the specified budget constraints.

Data: A factored MDP \mathcal{M}, set of candidate sensors (or actuators) XX with uniform sensor (or actuator) costs, and sensor (or actuator) budget MM
Result: A set of selected sensors (or actuators) YY
k0,Yk\leftarrow 0,Y\leftarrow\emptyset
while kMk\leq M do
       for i(XYi\in(X\setminus Y) do
             Calculate optimal expected discounted return for the sensor (or actuator) set Y{i}Y\cup\{i\} denoted as V(Y{i})V^{*}(Y\cup\{i\})
       end for
      j=argmaxiXY(V(Y{i}))j=\operatorname*{\arg\!\max}_{i\in X\setminus Y}(V^{*}(Y\cup\{i\}))
       YY{j},kk+1Y\leftarrow Y\cup\{j\},k\leftarrow k+1
end while
Algorithm 1 Greedy algorithm for fMDP-SS (or fMDP-AS)

4.1 Failure of Greedy Algorithm for fMDP-SS

Consider the following instance of the fMDP-SS problem.

Example 3: An fMDP ={𝒮,𝒜,𝒯,,γ,b0}\mathcal{M}=\{\mathcal{S},\mathcal{A},\mathcal{T},\mathcal{R},\gamma,b_{0}\} consists of 4 MDPs {1,2,3,4}\{\mathcal{M}_{1},\mathcal{M}_{2},\mathcal{M}_{3},\mathcal{M}_{4}\}. Each of these are the single-state variable MDPs as defined in Example 1, except for their reward functions. We will now explicitly define the elements on the tuple \mathcal{M}
State Space 𝒮\mathcal{S}: The state space of the fMDP \mathcal{M} is the product of the state-spaces of the MDPs i\mathcal{M}_{i}, i.e., 𝒮=𝒮1×𝒮2×𝒮3×𝒮4\mathcal{S}=\mathcal{S}_{1}\times\mathcal{S}_{2}\times\mathcal{S}_{3}\times\mathcal{S}_{4}. The state of fMDP consists of 4 independently evolving state variables and is given by s=[s1,s2,s3,s4]s=[s_{1},s_{2},s_{3},s_{4}], where each state variable sis_{i} is as defined in Example 1. We encode the states into a binary representation over 44 bits as follows: (A=1000,B=0100,C=0010,D=0001)(A=1000,B=0100,C=0010,D=0001);
Action Space 𝒜\mathcal{A}: The action space of the fMDP \mathcal{M} is the product of the action-spaces of the MDPs i\mathcal{M}_{i}, i.e., 𝒜=𝒜1×𝒜2×𝒜3×𝒜4\mathcal{A}=\mathcal{A}_{1}\times\mathcal{A}_{2}\times\mathcal{A}_{3}\times\mathcal{A}_{4};
Transition Function 𝒯\mathcal{T}: The probabilistic transition function 𝒯\mathcal{T} of the fMDP \mathcal{M} is a collection of the individual transition functions {𝒯1,𝒯2,𝒯3,𝒯4}\{\mathcal{T}_{1},\mathcal{T}_{2},\mathcal{T}_{3},\mathcal{T}_{4}\} for s1,s2,s3,s4s_{1},s_{2},s_{3},s_{4} respectively, where each 𝒯i\mathcal{T}_{i} is as defined in Example 1;
Reward Function \mathcal{R}: The reward functions of {1,2,3}\{\mathcal{M}_{1},\mathcal{M}_{2},\mathcal{M}_{3}\} are independent of each other and are as defined in Example 1 with R=R1R=R_{1} for 1\mathcal{M}_{1}, R=R2R=R_{2} for 2\mathcal{M}_{2} and R=R3R=R_{3} for 3\mathcal{M}_{3} respectively. The reward function of 4\mathcal{M}_{4} depends on the states s2s_{2} and s3s_{3} and is defined as:

4(s2,s3)={R4if s2s3=00100otherwise ,\mathcal{R}_{4}(s_{2},s_{3})=\begin{cases}R_{4}&\text{if }s_{2}\wedge s_{3}=0010\\ 0&\text{otherwise }\end{cases}, (6)

Let R4>R1>R2>R3>0R_{4}>R_{1}>R_{2}>R_{3}>0 and R1=R2+ϵR_{1}=R_{2}+\epsilon for an arbitrarily small ϵ\epsilon. Let the value of δ\delta in the individual reward functions for each MDP be such that δ>R4/R31\delta>R_{4}/R_{3}-1. The overall reward function for the fMDP \mathcal{M} is given by

:=1(s1,a1)+2(s2,a2)+4(s2,s3);\mathcal{R}:=\mathcal{R}_{1}(s_{1},a_{1})+\mathcal{R}_{2}(s_{2},a_{2})+\mathcal{R}_{4}(s_{2},s_{3}); (7)

Discount Factor γ\gamma: Let the discount factors of all i\mathcal{M}_{i} be equal to each other and that of \mathcal{M}, i.e., γ1=γ2=γ3=γ4=γ\gamma_{1}=\gamma_{2}=\gamma_{3}=\gamma_{4}=\gamma.

Let Ω={ω1,ω2,ω3,ω4}\Omega=\{\omega_{1},\omega_{2},\omega_{3},\omega_{4}\} be the set of sensors which can measure states, s1,s2,s3,s4s_{1},s_{2},s_{3},s_{4} respectively. Let the cost of the sensors be 𝒞=(c1=1,c2=1,c3=1,c4=3)\mathcal{C}=(c_{1}=1,c_{2}=1,c_{3}=1,c_{4}=3) and the sensor budget be C=2C=2. Assume uniform initial beliefs (b0b_{0}) for all the fMDPs i\mathcal{M}_{i} and \mathcal{M}. Since c4>Cc_{4}>C, we apply the greedy algorithm described in Algorithm 1 to this instance of the fMDP-SS with X=Ω{ω4}X=\Omega\setminus\{\omega_{4}\} and M=CM=C. Let the output set Y=ΓY=\Gamma. For any such instance of fMDP-SS, define rgre(Γ)=VΓgreVΓoptr_{gre}(\Gamma)=\frac{V_{\Gamma}^{gre}}{V_{\Gamma}^{opt}}, where VΓgreV_{\Gamma}^{gre} and VΓoptV_{\Gamma}^{opt} are the infinite-horizon expected return obtained by the greedy algorithm and the optimal infinite-horizon expected return, respectively. Define h=R4/R2h=R_{4}/R_{2}.

Proposition 1.

For the instance of fMDP-SS problem described in Example 3, the ratio rgre(Γ)r_{gre}(\Gamma) satisfies limh,ϵ0rgre(Γ)=0.\lim_{h\rightarrow\infty,\epsilon\rightarrow 0}r_{gre}(\Gamma)=0.

Proof.

We first note that the specified reward function ensures each MDP i\mathcal{M}_{i} follows the optimal policy as in Lemma 1 in order to obtain the optimal expected return for the fMDP \mathcal{M}. The overall reward of the fMDP \mathcal{M} depends on the individual reward terms 1\mathcal{R}_{1}, 2\mathcal{R}_{2} and 4\mathcal{R}_{4}. In the first iteration, the greedy algorithm will pick ω1\omega_{1}, because R1>R2R_{1}>R_{2} by ϵ\epsilon. In the second iteration, greedy would pick ω2\omega_{2} (because R2>R3R_{2}>R_{3}) and terminate due to the budget constraint. Therefore, the sensor subset selected by the greedy algorithm is Γ={ω1,ω2}\Gamma=\{\omega_{1},\omega_{2}\}. By Lemma 1, it follows that the infinite-horizon expected reward of the greedy algorithm is

VΓgre=R11γ+R21γ=2R2+ϵ1γ.V_{\Gamma}^{gre}=\frac{R_{1}}{1-\gamma}+\frac{R_{2}}{1-\gamma}=\frac{2R_{2}+\epsilon}{1-\gamma}. (8)

Consider the following selection of sensors for the fMDP-SS instance: Γ={ω2,ω3}\Gamma=\{\omega_{2},\omega_{3}\}. By selecting sensors ω2\omega_{2} and ω3\omega_{3}, the states s2s_{2} and s3s_{3} can be measured. By Lemma 1, it follows that the states s2s_{2} and s3s_{3} are both in CC or (0010)(0010) at all t>0t>0, and thus the optimal infinite-horizon expected return is

VΓopt=R21γ+R41γ=R2+R41γ.V_{\Gamma}^{opt}=\frac{R_{2}}{1-\gamma}+\frac{R_{4}}{1-\gamma}=\frac{R_{2}+R_{4}}{1-\gamma}. (9)

Thus, we have

limh,ϵ0\displaystyle\lim_{h\rightarrow\infty,\epsilon\rightarrow 0} rgre(Γ)=limh,ϵ02R2+ϵR2+R4=limh21+h=0\displaystyle r_{gre}(\Gamma)=\lim_{h\rightarrow\infty,\epsilon\rightarrow 0}\frac{2R_{2}+\epsilon}{R_{2}+R_{4}}=\lim_{h\rightarrow\infty}\frac{2}{1+h}=0 (10)

Remark 1: Proposition 1 means that if we make R4R_{4} arbitrarily larger than R2R_{2}, and R1R_{1} slightly larger than R2R_{2}, the expected return obtained by the greedy algorithm can get arbitrarily small compared to the expected value obtained by the optimal selection of sensors. This is because greedy picks sensors ω1\omega_{1} and ω2\omega_{2} due to its myopic behavior. It does not consider the fact that, in spite of R3R_{3} being the least reward, selecting ω3\omega_{3} in combination with ω2\omega_{2} would eventually lead to the highest reward R4R_{4}. An expected consequence of the arbitrarily poor performance of the greedy algorithm is that the optimal value function of the fMDP is not necessarily submodular in the set of sensors selected.

4.2 Failure of Greedy Algorithm for fMDP-AS

Consider the following instance of fMDP-AS.

Example 4: An fMDP ={𝒮,𝒜,𝒯,,γ}\mathcal{M}=\{\mathcal{S},\mathcal{A},\mathcal{T},\mathcal{R},\gamma\} consists of 4 MDPs {1,2,3,4}\{\mathcal{M}_{1},\mathcal{M}_{2},\mathcal{M}_{3},\mathcal{M}_{4}\}. Each of these are the single-state variable MDPs as defined in Example 2, except for their reward functions. We will now explicitly define the elements on the tuple \mathcal{M}.
State Space 𝒮\mathcal{S}: The state space of the fMDP \mathcal{M} is the same as that in Example 3;
Action Space 𝒜\mathcal{A}: The action space of the fMDP \mathcal{M} is the product of the action-spaces of the MDPs i\mathcal{M}_{i}, i.e., 𝒜=𝒜1×𝒜2×𝒜3×𝒜4\mathcal{A}=\mathcal{A}_{1}\times\mathcal{A}_{2}\times\mathcal{A}_{3}\times\mathcal{A}_{4}. The action spaces 𝒜i\mathcal{A}_{i} depend on the placement of actuators, as detailed in Example 2;
Transition Function 𝒯\mathcal{T}: The probabilistic transition function 𝒯\mathcal{T} of the fMDP \mathcal{M} is a collection of the individual transition functions {𝒯1,𝒯2,𝒯3,𝒯4}\{\mathcal{T}_{1},\mathcal{T}_{2},\mathcal{T}_{3},\mathcal{T}_{4}\} for s1,s2,s3,s4s_{1},s_{2},s_{3},s_{4} respectively, where each 𝒯i\mathcal{T}_{i} is as defined in Example 2;
Reward Function \mathcal{R}: The reward function of the fMDP \mathcal{M} is the same as that in Example 3 (see Equation 7);

Discount Factor γ\gamma: Let the discount factors of i\mathcal{M}_{i} be equal to the discount factor of \mathcal{M} i.e., γ1=γ2=γ3=γ4=γ\gamma_{1}=\gamma_{2}=\gamma_{3}=\gamma_{4}=\gamma.

Let Φ={ϕ1,ϕ2,ϕ3,ϕ4}\Phi=\{\phi_{1},\phi_{2},\phi_{3},\phi_{4}\} be the set of actuators with costs 𝒦=(k1=1,k2=1,k3=1,k4=3)\mathcal{K}=(k_{1}=1,k_{2}=1,k_{3}=1,k_{4}=3), that can influence the transitions of states, s1,s2,s3,s4s_{1},s_{2},s_{3},s_{4} respectively. Let the actuator selection budget be K=2K=2. Since k4>Kk_{4}>K, we apply the greedy algorithm described in Algorithm 1 to this instance of the fMDP-AS with X=Φ{ϕ4}X=\Phi\setminus\{\phi_{4}\} and M=KM=K. Let the output set Y=ΥY=\Upsilon. For any such instance of fMDP-AS, define rgre(Υ)=VΥgreVΥoptr_{gre}(\Upsilon)=\frac{V_{\Upsilon}^{gre}}{V_{\Upsilon}^{opt}}, where VΥgreV_{\Upsilon}^{gre} and VΥoptV_{\Upsilon}^{opt} are the infinite-horizon expected return obtained by the greedy algorithm and the optimal infinite-horizon expected return respectively. Define h=R4/R2h=R_{4}/R_{2}.

Proposition 2.

For the instance of fMDP-AS problem described in Example 4, the ratio rgre(Υ)r_{gre}(\Upsilon) satisfies limh,ϵ0rgre(Υ)=0.\lim_{h\rightarrow\infty,\epsilon\rightarrow 0}r_{gre}(\Upsilon)=0.

Proof.

The construction of the proof and arguments are similar to Proposition 1. ∎

Remark 2: Proposition 2 means that if we make R4R_{4} arbitrarily larger than R2R_{2}, and R1R_{1} almost equal to R2R_{2}, the expected return obtained by greedy can get arbitrarily small compared to the expected value obtained by optimal selection of actuators. This is because greedy picks actuators ϕ1\phi_{1} and ϕ2\phi_{2} due to its myopic behavior. It does not consider the fact that, in spite of R3R_{3} being the least reward, selecting ϕ3\phi_{3} would eventually lead to the highest reward R4R_{4}. An expected consequence of the arbitrarily poor performance of the greedy algorithm is that the optimal value function of the fMDP is not necessarily submodular in the set of actuators selected.

5 Empirical Evaluation of Greedy Algorithm

Previously, we showed that the greedy algorithm for fMDP-SS and fMDP-AS can perform arbitrarily poorly. However, this arbitrary poor performance was for specific instances of those problems, and in general, greedy might not actually perform poorly for all instances. In this section, we provide some experimental results for the greedy sensor and actuator selection and discuss the empirical performance.

5.1 Actuator Selection in Electric Networks

We consider the problem of selecting an optimal subset of actuators which can minimize fault percolation in an electric network, which we will refer to as the Actuator Selection in Electric Networks (ASEN) Problem. Consider a distributed micro-grid network (e.g., see Figure 2) represented by a graph G=(V,E)G=(V,E), where each node iVi\in V represents a micro-grid system and each edge eijEe_{ij}\in E is a link between nodes (or micro-grids) ii and jj in the electric network. The state of each micro-grid system (or node) i(1,,|V|)i\in(1,\ldots,|V|) is represented by the variable si{healthy,faulty}s_{i}\in\{\texttt{healthy},\texttt{faulty}\}. If a particular node ii is faulty, a neighboring node j𝒩ij\in\mathcal{N}_{i}, where 𝒩i\mathcal{N}_{i} represents the non-inclusive neighborhood of the node ii, will become faulty with a certain probability pjip_{ji}. This is known as the Independent-Cascade diffusion process in networks, and has been well-studied in the network science literature. Given an initial set of faulty nodes SVS\subset V, as the diffusion/spread process unfolds, more nodes in the network will become faulty. In large-scale distributed micro-grid networks, a central grid controller can perform active islanding of micro-grids (or nodes) using islanding switches (or actuators), in order to prevent percolation of faults in the network. Given that only a limited number of islanding switches can be installed, the goal is to determine the optimal placement locations (nodes) in the distributed micro-grid, which can minimize the fault percolation. This problem is similar to the problem of minimizing influence of rumors in social networks studied in [39]. Similar to [39], we wish to identify a blocker set, i.e., a set of nodes which can be disconnected (islanded) from the network, which can minimize the total number of faulty nodes.

Given a budget KK, the problem of selecting the optimal subset of KK nodes on which actuators, i.e., islanding switches, can be installed, which can maximize the number of healthy nodes in the network (i.e., minimize the spread of faults) is an instance of the fMDP-AS problem presented in this paper. Each node i(1,,|V|)i\in(1,\ldots,|V|) corresponds to an MDP with state si{healthy,faulty}s_{i}\in\{\texttt{healthy},\texttt{faulty}\}. The state transition of a node depends on the state transitions of its neighboring nodes and is governed by the influence probabilities pjip_{ji}, for each pair of nodes (j,i)(j,i). Thus, the overall state of the fMDP can be factored into n=|V|n=|V| state variables. If a node is in healthy state, it receives a reward of +1+1, and if it is in faulty state, it receives a reward of 0. The overall reward of the fMDP at any time step is the sum of the rewards of each node. If an actuator is installed on a particular node iVi\in V, it can perform active islanding to disconnect from the network, and this is denoted by the factored action space 𝒜i={island,na}\mathcal{A}_{i}=\{\texttt{island},\texttt{na}\}. In case of no actuator on a node iVi\in V, it has a default action (do nothing), and this is denoted by the factored action space 𝒜i={na}\mathcal{A}_{i}=\{\texttt{na}\}. The action space of the fMDP is the product of the action spaces of the MDPs corresponding to each node.

We set the discount factor γ=0.95\gamma=0.95 and pji=0.3p_{ji}=0.3 for all node pairs (i,j)V×V(i,j)\in V\times V, and compute the infinite-horizon discounted return for the fMDP (electric network) as the fault percolation process unfolds over the network, according to the independent cascade model. We use a greedy influence maximization (IM) algorithm under the independent cascade model to identify the initial set SS of faulty nodes with maximum influence for fault propagation. We consider two types of random graph models to generate instances of the electric network: (i) Erdős-Rényi (ER) random graph denoted by G(n,p)G(n,p), where nn represents the number of nodes and pp represents the edge-probability and (ii) Barabási-Albert (BA) random graph denoted by G(n)G(n), where nn represents the number of nodes. We perform the following experiments to evaluate the greedy algorithm and plot the performance of greedy with respect to optimal (computed using brute-force search) along with the performance of random selection with respect to optimal in Figure 5.

Refer to caption
(a) Empirical performance of greedy w.r.t optimal for random instances of ASEN
Refer to caption
(b) Empirical performance of greedy w.r.t optimal for random instances of fMDP-SS
Refer to caption
(c) Empirical performance of greedy w.r.t optimal for random instances of fMDP-AS
Refer to caption
(d) Comparison of greedy v.s. random w.r.t. optimal for ER networks with p=0.1p=0.1 - Varying budget
Refer to caption
(e) Comparison of greedy v.s. random w.r.t. optimal for ER networks with p=0.1p=0.1 - Varying network size
Refer to caption
(f) Comparison of greedy v.s. random w.r.t. optimal for ER networks with p=0.1p=0.1 - Varying no. of faulty nodes
Refer to caption
(g) Comparison of greedy v.s. random w.r.t. optimal for ER networks with p=0.3p=0.3 - Varying budget
Refer to caption
(h) Comparison of greedy v.s. random w.r.t. optimal for ER networks with p=0.3p=0.3 - Varying network size
Refer to caption
(i) Comparison of greedy v.s. random w.r.t. optimal for ER networks with p=0.3p=0.3 - Varying no. of faulty nodes
Refer to caption
(j) Comparison of greedy v.s. random w.r.t. optimal actuator selection for BA networks - Varying selection budget
Refer to caption
(k) Comparison of greedy v.s. random w.r.t. optimal actuator selection for BA networks - Varying network size
Refer to caption
(l) Comparison of greedy v.s. random w.r.t. optimal actuator selection for BA networks - Varying no. of faulty Nodes
Figure 5: Empirical evaluation of greedy sensor and actuator selection for fMDPs

Random Instances: We generate random ER networks with G(30,0.3)G(30,0.3) and set the initial number of faulty nodes to |S|=5|S|=5 and the selection budget K=5K=5 and run the greedy algorithm. It can be seen from Fig. 5(a) that greedy shows near-optimal performance for many instances, with an average of 96.24%96.24\% of the optimal.

Varying Actuator Budget: We generate random ER networks (G(30,0.1)G(30,0.1) and G(30,0.3)G(30,0.3)) and BA networks (G(30)G(30)). We set the initial number of faulty nodes to |S|=5|S|=5 and run greedy for varying budgets K{1,2,3,4,5,6,7}K\in\{1,2,3,4,5,6,7\} for ER networks and K{1,2,3,4,5}K\in\{1,2,3,4,5\} for BA networks. The comparison of greedy and random selection with respect to optimal as the selection budget increases is shown in Figures 5(d), 5(g) and 5(i). It can be observed that greedy clearly outperforms random selection for both ER and BA networks, with near-optimal performance.

Varying Network Size: We generate random ER networks (G(|V|,0.1)G(|V|,0.1) and G(|V|,0.3)G(|V|,0.3)) and BA networks (G(|V|)G(|V|)). with varying network sizes, i.e., |V|{10,15,20,25}|V|\in\{10,15,20,25\}. We set the initial number of faulty nodes to |S|=5|S|=5 and actuator selection budget to K=5K=5. The comparison of greedy and random selection with respect to optimal as the network size increases is shown in Figures 5(e), 5(h) and 5(k). It can be observed that greedy provides near-optimal performance for both ER and BA networks as the network size increases, while random selection performs poorly compared to greedy as the network size increases.

Varying Number of Faulty Nodes: We generate random ER networks (G(30,0.1)G(30,0.1) and G(30,0.3)G(30,0.3)) and BA networks (G(30)G(30)). We vary the initial number of faulty nodes |S|{3,5,7,10}|S|\in\{3,5,7,10\} and set the selection budget to K=5K=5. The comparison of greedy and random selection with respect to optimal as the network size increases is shown in Figures 5(f), 5(i) and 5(l). It can be observed that greedy provides near-optimal performance and outperforms random selection for various network types and sizes.

In all the above cases, the greedy algorithm provides a performance of over 90%90\% of the optimal. We conclude that, though the fMDP-AS problem is inapproximable to within a non-trivial factor, the greedy algorithm can provide near-optimal solutions to many instances in practice.

5.2 Evaluation on Random Instances

First, we evaluate the greedy algorithm for several randomly generated instances of fMDP-SS. Note that any instance of the fMDP-SS problem can be represented as a Partially Observable MDP (POMDP) to solve for the optimal policy. Hence, we use the SolvePOMDP software package [40], a Java program that can solve partially observable Markov decision processes optimally using incremental pruning [41] combined with state-of-the-art vector pruning methods [42]. We run the exact algorithm in this package to compute the optimal solution for infinite-horizon cases by setting a value function tolerance η=1×106\eta=1\times 10^{-6} as a stopping criterion. We generate 20 instances of fMDP-SS with each instance having |𝒮|=16|\mathcal{S}|=16 states (44 binary state variables) and |𝒜|=16|\mathcal{A}|=16 actions. The transition function 𝒯:𝒮×𝒜×𝒮[0,1]\mathcal{T}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1] for each starting state s𝒮s\in\mathcal{S}, action a𝒜a\in\mathcal{A} and ending state s𝒮s^{\prime}\in\mathcal{S} is a point on a probability simplex (Δ|𝒮|)(\Delta_{|\mathcal{S}|}) is randomly selected. The rewards :𝒮×𝒜>0\mathcal{R}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}_{>0} for each state-action pair (s,a)(s,a) are randomly sampled from abs(𝒩(0,σ))\texttt{abs}(\mathcal{N}(0,\sigma)), with σuniform random(0,10)\sigma\sim\texttt{uniform random}(0,10). We consider a set of 4 noise-less sensors Ω={ω1,ω2,ω3,ω4}\Omega=\{\omega_{1},\omega_{2},\omega_{3},\omega_{4}\}, which can measure the states (s1,s2,s3,s4)(s_{1},s_{2},s_{3},s_{4}), respectively. We consider uniform sensor costs c1=c2=c3=c4=1c_{1}=c_{2}=c_{3}=c_{4}=1 and a sensor budget of C=2C=2. We apply a brute-force technique by generating all possible sensor subsets ΓΩ\Gamma\subset\Omega of size |Γ|=2|\Gamma|=2, to compute the optimal set of sensors Γ\Gamma^{*} and compute the optimal return VΓoptV_{\Gamma}^{opt} using the solver. We run Algorithm 1 for each of these instances to compute the return VΓgreV_{\Gamma}^{gre}. It can be seen from Fig. 5(b) that greedy shows near-optimal performance, with an average of 90.19%90.19\% of the optimal.

Next, we evaluate the performance of the greedy algorithm over randomly generated instances of the fMDP-AS problem. Note that any instance of the fMDP-AS problem can be solved to find the optimal policy using an MDP solver. We apply the policy iteration algorithm to find the optimal policy and the corresponding infinite-horizon expected return using the MDPToolbox package in Python [43]. We generate 20 instances of fMDP-AS with each instance having |𝒮|=20|\mathcal{S}|=20 states. We consider a set of 10 actuators Φ={ϕ1,,ϕ10}\Phi=\{\phi_{1},\ldots,\phi_{10}\} with uniform selection costs k1==k10=1k_{1}=\ldots=k_{10}=1 and a budget of K=5K=5. The action space is given by 𝒜={0,1}𝒜s\mathcal{A}=\{0,1\}\cup\mathcal{A}_{s}, where 𝒜s\mathcal{A}_{s} is the joint action space corresponding to the selected actuator set. By default, there are two actions available to the agent, i.e., {0,1}\{0,1\}. Each new actuator selected corresponds to a new action variable in the joint action aa. The transition function 𝒯:𝒮×𝒜×𝒮[0,1]\mathcal{T}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1] for each starting state s𝒮s\in\mathcal{S}, action a𝒜a\in\mathcal{A} and ending state s𝒮s^{\prime}\in\mathcal{S} is a value randomly sampled over a probability simplex (Δ|𝒮|)(\Delta_{|\mathcal{S}|}). The rewards :𝒮×𝒜>0\mathcal{R}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}_{>0} for each tuple (s,a)(s,a) are randomly sampled from abs(𝒩(0,σ))\texttt{abs}(\mathcal{N}(0,\sigma)), with σuniform random(0,10)\sigma\sim\texttt{uniform random}(0,10). We apply a brute-force technique to obtain the optimal set of actuators Υ\Upsilon^{*} and compute the optimal return VΥoptV_{\Upsilon}^{opt}. We run Algorithm 1 for these instances to compute the return VΥgreV_{\Upsilon}^{gre}. It can be seen from Fig. 5(c) that greedy shows near-optimal performance, with an average of 85.03%85.03\% of the optimal.

6 Conclusions

In this paper, we studied the budgeted (and design-time) sensor and actuator selection problems for fMDPs, and proved that they are NP-hard in general, and there is no polynomial-time algorithm that can approximate them to any non-trivial factor of the optimal solution. Our inapproximability results directly imply that greedy algorithms cannot provide constant-factor guarantees for our problems, and that the value function of the fMDP is not submodular in the set of sensors (or actuators) selected. We explicitly show how greedy algorithms can provide arbitrarily poor performance even for very small instances of the fMDP-SS (or fMDP-AS) problems. With these results, we conclude that these problems are more difficult than other variants of sensor and actuator selection problems that have submodular objectives. Finally, we demonstrated the empirical performance of the greedy algorithm for actuator selection in electric networks and several randomly generated fMDP-SS and fMDP-AS instances and concluded that although greedy performed arbitrarily poorly for some instances, it can provide near-optimal solutions in practice. Future works on extending the results to fMDPs over finite time horizons, and identifying classes of systems that admit near-optimal approximation guarantees are of interest.

References

  • [1] Chen Tessler, Yuval Shpigelman, Gal Dalal, Amit Mandelbaum, Doron Haritan Kazakov, Benjamin Fuhrer, Gal Chechik, and Shie Mannor. Reinforcement learning for datacenter congestion control. ACM SIGMETRICS Performance Evaluation Review, 49(2):43–46, 2022.
  • [2] Qinghui Tang, Sandeep Kumar S Gupta, and Georgios Varsamopoulos. Energy-efficient thermal-aware task scheduling for homogeneous high-performance computing data centers: A cyber-physical approach. IEEE Transactions on Parallel and Distributed Systems, 19(11):1458–1472, 2008.
  • [3] Craig Boutilier, Richard Dearden, Moisés Goldszmidt, et al. Exploiting structure in policy construction. In IJCAI, volume 14, pages 1104–1113, 1995.
  • [4] Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Efficient solution algorithms for factored MDPs. Journal of Artificial Intelligence Research, 19:399–468, 2003.
  • [5] Carlos Guestrin, Daphne Koller, and Ronald Parr. Multiagent planning with factored MDPs. Advances in neural information processing systems, 14, 2001.
  • [6] Joelle Pineau, Geoff Gordon, and Sebastian Thrun. Point-based value iteration: An anytime algorithm for POMDPs. In IJCAI, volume 3, 2003.
  • [7] Hanna Kurniawati, David Hsu, and Wee Sun Lee. SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In Robotics: Science and systems, 2008.
  • [8] Mauricio Araya-López, Vincent Thomas, Olivier Buffet, and François Charpillet. A closer look at MOMDPs. In 22nd International Conference on Tools with Artificial Intelligence, volume 2, pages 197–204. IEEE, 2010.
  • [9] Haotian Zhang, Raid Ayoub, and Shreyas Sundaram. Sensor selection for Kalman filtering of linear dynamical systems: Complexity, limitations and greedy algorithms. Automatica, 78:202–210, 2017.
  • [10] Lintao Ye, Nathaniel Woodford, Sandip Roy, and Shreyas Sundaram. On the complexity and approximability of optimal sensor selection and attack for Kalman filtering. IEEE Transactions on Automatic Control, 66(5):2146–2161, 2020.
  • [11] Vasileios Tzoumas, Mohammad Amin Rahimian, George J Pappas, and Ali Jadbabaie. Minimal actuator placement with bounds on control effort. IEEE Transactions on Control of Network Systems, 3(1):67–78, 2015.
  • [12] Wei Li, Fan Zhou, Kaushik Roy Chowdhury, and Waleed Meleis. Qtcp: Adaptive congestion control with reinforcement learning. IEEE Transactions on Network Science and Engineering, 6(3):445–458, 2018.
  • [13] Hosam Rowaihy, Matthew P Johnson, Sharanya Eswaran, Diego Pizzocaro, Amotz Bar-Noy, Thomas La Porta, Archan Misra, and Alun Preece. Utility-based joint sensor selection and congestion control for task-oriented wsns. In 2008 42nd Asilomar Conference on Signals, Systems and Computers, pages 1165–1169. IEEE, 2008.
  • [14] Martin Duggan, Kieran Flesk, Jim Duggan, Enda Howley, and Enda Barrett. A reinforcement learning approach for dynamic selection of virtual machines in cloud data centres. In International Conference on Innovative Computing Technology (INTECH), pages 92–97. IEEE, 2016.
  • [15] Farshid Hassani Bijarbooneh, Wei Du, Edith C-H Ngai, Xiaoming Fu, and Jiangchuan Liu. Cloud-assisted data fusion and sensor selection for internet of things. IEEE Internet of Things Journal, 3(3):257–268, 2015.
  • [16] Chase St Laurent and Raghvendra V Cowlagi. Coupled sensor configuration and path-planning in unknown static environments. In American Control Conference (ACC), pages 1535–1540. IEEE, 2021.
  • [17] Arthur W Mahoney, Trevor L Bruns, Philip J Swaney, and Robert J Webster. On the inseparable nature of sensor selection, sensor placement, and state estimation for continuum robots or “where to put your sensors and how to use them”. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 4472–4478, 2016.
  • [18] Han-Pang Chiu, Xun S Zhou, Luca Carlone, Frank Dellaert, Supun Samarasekera, and Rakesh Kumar. Constrained optimal selection for multi-sensor robot navigation using plug-and-play factor graphs. In IEEE International Conference on Robotics and Automation (ICRA), pages 663–670, 2014.
  • [19] Circuit Globe: Electric Grid. https://circuitglobe.com/electrical-grid.html.
  • [20] Mahsa Ghasemi and Ufuk Topcu. Online active perception for partially observable Markov decision processes with limited budget. In Conference on Decision and Control (CDC), pages 6169–6174. IEEE, 2019.
  • [21] Vikram Krishnamurthy and Dejan V Djonin. Structured threshold policies for dynamic sensor scheduling—a partially observed Markov decision process approach. IEEE Transactions on Signal Processing, 55(10):4938–4957, 2007.
  • [22] Shihao Ji, Ronald Parr, and Lawrence Carin. Nonmyopic multiaspect sensing with partially observable Markov decision processes. IEEE Transactions on Signal Processing, 55(6):2720–2730, 2007.
  • [23] Krithika Manohar, J Nathan Kutz, and Steven L Brunton. Optimal sensor and actuator selection using balanced model reduction. IEEE Transactions on Automatic Control, 67(4):2108–2115, 2021.
  • [24] Tyler H Summers and John Lygeros. Optimal sensor and actuator placement in complex dynamical networks. IFAC Proceedings Volumes, 47(3):3784–3789, 2014.
  • [25] Alexander Fuchs and Manfred Morari. Actuator performance evaluation using lmis for optimal hvdc placement. In 2013 European Control Conference (ECC), pages 1529–1534. IEEE, 2013.
  • [26] A. Fuchs and M. Morari. Placement of HVDC links for power grid stabilization during transients. In 2013 IEEE Grenoble Conference, pages 1–6. IEEE, 2013.
  • [27] Abolfazl Hashemi, Haris Vikalo, and Gustavo de Veciana. On the benefits of progressively increasing sampling sizes in stochastic greedy weak submodular maximization. IEEE Transactions on Signal Processing, 70:3978–3992, 2022.
  • [28] Abolfazl Hashemi, Mahsa Ghasemi, Haris Vikalo, and Ufuk Topcu. Randomized greedy sensor selection: Leveraging weak submodularity. IEEE Transactions on Automatic Control, 66(1):199–212, 2020.
  • [29] Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. Journal of Artificial Intelligence Research, 42:427–486, 2011.
  • [30] Rajiv Khanna, Ethan Elenberg, Alex Dimakis, Sahand Negahban, and Joydeep Ghosh. Scalable greedy feature selection via weak submodularity. In Artificial Intelligence and Statistics, pages 1560–1568. PMLR, 2017.
  • [31] Christos H Papadimitriou and John N Tsitsiklis. The complexity of Markov decision processes. Mathematics of operations research, 12(3):441–450, 1987.
  • [32] Jayanth Bhargav, Mahsa Ghasemi, and Shreyas Sundaram. On the Complexity and Approximability of Optimal Sensor Selection for Mixed-Observable Markov Decision Processes. In 2023 American Control Conference (ACC), pages 3332–3337. IEEE, 2023.
  • [33] Michael R Garey. Computers and intractability: A guide to the theory of NP-Completeness. Fundamental, 1997.
  • [34] David Kempe, Jon Kleinberg, and Éva Tardos. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137–146, 2003.
  • [35] Grant Schoenebeck and Biaoshuai Tao. Beyond worst-case (in) approximability of nonsubmodular influence maximization. ACM Transactions on Computation Theory (TOCT), 11(3):1–56.
  • [36] Richard M Karp. Reducibility among combinatorial problems. Springer, 2010.
  • [37] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions—i. Mathematical programming, 14(1):265–294.
  • [38] Andreas Krause and Carlos Guestrin. Near-optimal observation selection using submodular functions. In AAAI, volume 7, pages 1650–1654, 2007.
  • [39] Ruidong Yan, Deying Li, Weili Wu, Ding-Zhu Du, and Yongcai Wang. Minimizing influence of rumors by blockers on social networks: algorithms and analysis. IEEE Transactions on Network Science and Engineering, 7(3):1067–1078, 2019.
  • [40] SolvePOMDP-a java program that solves Partially Observable Markov Decision Processes. https://www.erwinwalraven.nl/solvepomdp/.
  • [41] Anthony R Cassandra, Michael L Littman, and Nevin Lianwen Zhang. Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes. arXiv preprint arXiv:1302.1525, 2013.
  • [42] Erwin Walraven and Matthijs Spaan. Accelerated vector pruning for optimal POMDP solvers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
  • [43] Markov Decision Process (MDP) Toolbox - a toolbox with classes and functions for the resolution of descrete-time Markov Decision Processes. https://pymdptoolbox.readthedocs.io/en/latest/api/mdptoolbox.html.