Minimum-Cost State-Flipped Control
for Reachability of Boolean Control Networks
using Reinforcement Learning

Jingjie Ni, Yang Tang, Fangfei Li This work is supported by the National Natural Science Foundation of China under Grants 62173142, 62233005, the Programme of Introducing Talents of Discipline to Universities (the 111 Project) under Grant B17017, and the Fundamental Research Funds for the Central Universities under Grant JKM01231838.Jingjie Ni is with the School of Mathematics, East China University of Science and Technology, Shanghai, 200237, P.R. China (email: [email protected]).Yang Tang is with the Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, East China University of Science and Technology, Shanghai, 200237, P.R. China (email: [email protected]).Fangfei Li is with the School of Mathematics and the Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, East China University of Science and Technology, Shanghai, 200237, P.R. China (email: [email protected], [email protected]).

Abstract

This paper proposes model-free reinforcement learning methods for minimum-cost state-flipped control in Boolean control networks (BCNs). We tackle two questions: 1) finding the flipping kernel, namely the flip set with the smallest cardinality ensuring reachability, and 2) deriving optimal policies to minimize the number of flipping actions for reachability based on the obtained flipping kernel. For question 1), Q-learning's capability in determining reachability is demonstrated. To expedite convergence, we incorporate two improvements: i) demonstrating that previously reachable states remain reachable after adding elements to the flip set, followed by employing transfer learning, and ii) initiating each episode with special initial states whose reachability to the target state set are currently unknown. Question 2) requires optimal control with terminal constraints, while Q-learning only handles unconstrained problems. To bridge the gap, we propose a BCN-characteristics-based reward scheme and prove its optimality. Questions 1) and 2) with large-scale BCNs are addressed by employing small memory Q-learning, which reduces memory usage by only recording visited action-values. An upper bound on memory usage is provided to assess the algorithm's feasibility. To expedite convergence for question 2) in large-scale BCNs, we introduce adaptive variable rewards based on the known maximum steps needed to reach the target state set without cycles. Finally, the effectiveness of the proposed methods is validated on both small- and large-scale BCNs.

Index Terms:

Boolean control networks, model-free, reachability, reinforcement learning, state-flipped control.

I Introduction

Gene regulatory networks are a crucial focus in systems biology. The concept of Boolean networks was first introduced in 1969 by Kauffman [1] as the earliest model for gene regulatory networks. Boolean networks consist of several Boolean variables, each of which takes a value of ``0” or ``1” to signify whether the gene is transcribed or not. As gene regulatory networks often involve external inputs, Boolean networks were further developed into Boolean control networks (BCNs), which incorporate control inputs to better describe network behaviors.

State-flipped control is an innovative method used to change the state of specific nodes in BCNs, flipping them from ``1" to ``0" or from ``0" to ``1". In the context of systems biology, state-flipped control can be achieved through gene regulation techniques such as transcription activator-like effector repression and clustered regularly interspaced short palindromic repeats activation [2]. State-flipped control is a powerful control method that minimally disrupts the system structure. From a control effectiveness standpoint, state-flipped control enables BCNs to achieve any desired state with an in-degree greater than 0 [3, 4]. In contrast, state feedback control can only steer BCNs towards the original reachable state set, which is a subset of states with an in-degree greater than 0. Another highly effective control method is pinning control, but it has the drawback of causing damage to the network structure, unlike state-flipped control. State-flipped control was first proposed by Rafimanzelat to study the controllability of attractors in BNs [5]. Subsequently, to achieve stabilization, authors in [6] and [7] respectively considered flipping a subset of nodes after the BN had passed its transient period and flipping a subset of nodes in BCNs at the initial step. Zhang et al. further extended the concept of state-flipped control from BCNs to switched BCNs [8]. Considering that previous research has primarily focused on one-time flipping [5, 6, 8, 7], which may hinder the achievement of certain control objectives, a more comprehensive form of state-flipped control that permits multiple flipping actions was introduced to study stabilization in BNs and BCNs [3, 4]. Furthermore, to minimize control costs, the concept of flipping kernel was proposed [4, 3], representing the smallest set of nodes that can accomplish the control objectives. In this paper, we extend the existing studies by investigating problems including finding flipping kernels, under joint control proposed by [3]. These joint controls involve the combination of state-flipped controls and control inputs as defined in [3].

Closely tied to stabilization and controllability, reachability is a prominent area that requires extensive exploration. It involves determining whether a system can transition from an initial subset of states to a desired subset of states. This concept holds significant importance in domains such as genetic reprogramming and targeted drug development, where the ability to transform networks from unfavorable states to desirable ones is pivotal [6]. Previous studies [9, 10, 11, 12] have proposed various necessary and sufficient conditions for reachability in BCNs and their extensions. Additionally, an algorithm has been developed to identify an optimal control policy that achieves reachability in the shortest possible time [10]. In the context of BCNs under state-flipped control, the analysis of reachability between two states can be conducted using the semi-tensor product [3].

Despite extensive studies on the reachability of BCNs, there are still unresolved issues in this field. These problems can be categorized into three aspects.

.

Firstly, the existing literature [9, 10, 11, 12] focuses on determining whether reachability can be achieved under certain control policies, without optimizing control costs. This is significant in both practical applications and theoretical research. For instance, in the development of targeted drugs, an increasing number of target points raises the difficulty level. Additionally, it is desirable to minimize the frequency of drug usage by patients to achieve expense savings and reduce drug side effects. If we formulate this problem as an optimal control problem, our objective is to find the flipping kernel and minimize the number of flipping actions. Emphasizing the identification of flipping kernels is more critical than minimizing flipping actions, as it significantly reduces the dimensionality of joint control, thereby exponentially alleviating the computational complexity when minimizing flipping actions. It is worth mentioning that achieving reachability while optimizing control costs poses an optimization problem with terminal constraint. To the best of our knowledge, conventional optimal control techniques such as the path integral approach and policy iteration [13, 14, 15, 16], which rely on predetermined cost functions, are deemed unsuitable for our study. This is primarily due to the challenge of expressing the objective of simultaneously minimizing control costs and achieving terminal goals, namely, ``achieving reachability," solely through a cost function.
.

Secondly, the model-free scenario requires special consideration, especially considering the increasing complexity associated with accurately modeling systems with numerous BCN nodes. While model-free approaches exist in the PBCNs field [17, 18, 4, 3, 19, 20], they address different challenges compared to ours. Additionally, there are ongoing investigations into cost reduction problems [9, 10, 11, 12, 21, 22, 23]. However, these studies are conducted under the assumption of known or partially known models.
.

Thirdly, the existing methods in the fields of BCN reachability[9, 10, 11, 12, 21] are limited in their applicability to small-scale BCNs. Several interesting approaches have been proposed for handling large BCNs, emphasizing controllability and stability [24, 25, 26]. However, these references concentrate on optimization goals slightly different from ours. They aim to minimize the set of controlled nodes, namely, the dimension of the controller. In contrast, our objective is not only to reduce the controller dimension but also to further minimize flipping actions, namely, the frequency of control implementation.

Considering the aforementioned issues, we aim to design minimum-cost state-flipped control for achieving reachability in BCNs, in the absence of a known system model. Specifically, we address the following questions:

0)

How to find the flipping kernel for reachability?
0)

Building on question 1), how can we determine the control policies that minimize the number of required flipping actions to achieve reachability?

Questions 1) and 2) are the same as 1) and 2) mentioned in the abstract. They are reiterated here for clarity purposes.

To tackle these questions, we turn to reinforcement learning-based methods. Compared with the semi-tensor product approach, reinforcement learning-based methods are suitable for model-free scenarios, which aligns with our considered setting. Moreover, reinforcement learning eliminates the need for matrix calculations and offers the potential to handle large-scale BCNs. In particular, we consider the extensively applied $Q$ -learning ( $Q$ L) method, which has been successful in solving control problems in BCNs and their extensions [17, 18, 4, 3, 19]. For reachability problems of large-scale BCNs, an enhanced $Q$ L method has been proposed by [20], providing a reliable foundation for this study. When compared to the deep $Q$ -learning method [27], which is also suitable for large-scale BCNs, the method proposed in [20] is favored due to its theoretical guarantees of convergence [28].

Taken together, we propose an improved $Q$ L to address questions 1)- 3). Our work's main contributions are as follows:

0.

Regarding question 1), unlike the semi-tensor product method proposed in [4, 3], our approach is model-free. We demonstrate that reachability can be determined using $Q$ L and propose improved $Q$ L with faster convergence compared to the traditional one [29]. Firstly, transfer learning (TL) [30] is integrated under the premise that the transferred knowledge is proven to be effective. Then, special initial states are incorporated into $Q$ L, allowing each episode to begin with an unknown reachability state, thereby eliminating unnecessary computations.
0.

Question 2) involves optimal control with terminal constraints, while $Q$ L only handles unconstrained problems. To bridge this gap, a BCN-characteristics-based reward scheme is proposed. Compared to other model-free approaches [17, 18, 4, 3, 19, 20], we first tackle a complex problem while providing rigorous theoretical guarantees.
0.

Compared to matrix-based methods [4, 3] and deep reinforcement techniques [17, 19, 18], our algorithms are suitable for large-scale BCNs with convergence guarantee. In contrast to [20], the memory usage in the presented problem is clarified to be not directly related to the scale of the BCNs, and an upper bound is provided to evaluate algorithm suitability. Further, To accelerate convergence in question 2), a modified reward scheme is introduced.

The paper progresses as follows: Section II provides an overview of the system model and defines the problem statement. Section III introduces the reinforcement learning method. In Section IV, fast iterative $Q$ L and small memory iterative $Q$ L for large-scale BCNs are presented to identify the flipping kernel for reachability. In Section V, to minimize the number of flipping actions, $Q$ L with a BCN-characteristics-based reward setting and that with dynamic rewards for large-scale BCNs are proposed. Algorithm complexity is detailed in Section VI. Section VII validates the proposed method. Finally, conclusions are drawn in Section VIII.

We use the following notations throughout this paper:

•

$\mathbb{Z^{+}}$ , $\mathbb{R}^{+}$ , $\mathbb{R}$ , $\mathbb{R}^{m\times n}$ denote the sets of non-negative integers, non-negative real numbers, real numbers, and ${m\times n}$ real number matrices, respectively.
•

$\mathbb{E}[\cdot]$ denotes the expected value operator and var $[\cdot]$ denotes the variance. $\Pr\left\{E_{1}\mid E_{2}\right\}$ denotes the probability of event $E_{1}$ occurring given that event $E_{2}$ has occurred.
•

For a set $S$ , $|S|$ denotes the cardinal number of $S$ .
•

For $M_{m\times n}\subset\mathbb{R}^{m\times n}$ , $|M_{m\times n}|_{\infty}$ denotes the maximal value of the elements in $M_{m\times n}$ .
•

The relative complement of set $S_{2}$ in set $S_{1}$ is denoted as $S_{1}\setminus S_{2}$ .
•

There are four basic operations on Boolean variables, which are ``not", ``and", ``or", and ``exclusive or", expressed as $\neg$ , $\wedge$ , $\vee$ , and $\oplus$ , respectively.
•

$\mathcal{B}:=\{0,1\}$ , and $\mathcal{B}^{n}:=\underbrace{\mathcal{B}\times\ldots\times\mathcal{B}}_{n}$ .

II Problem Formulation

This section introduces the system model, with a specific focus on BCNs under state-flipped control. Subsequently, we outline the problems under investigation.

II-A System Model

II-A1 BCNs

A BCN with $n$ nodes and $m$ control inputs is defined as

	$\displaystyle x_{i}(t+1)=f_{i}\Big{(}x_{1}(t),...,x_{n}(t),$	$\displaystyle u_{1}(t),...,u_{m}(t)\Big{)},$		(1)
		$\displaystyle t\in\mathbb{Z^{+}},\ i=1,...,n,$		(1)

where $f_{i}:\mathcal{B}^{n+m}\rightarrow\mathcal{B},i\in\{1,...,n\}$ is a logical function, $x_{i}(t)\in\mathcal{B},i\in\{1,...,n\}$ represents the $i^{th}$ node at time step $t$ , and $u_{j}(t)\in\mathcal{B},j\in\{1,...,m\}$ represents the $j^{th}$ control input at time step $t$ . All nodes at time step $t$ are grouped together in the vector $x(t)=\left(x_{1}(t),\ldots,x_{n}(t)\right)\in\mathcal{B}^{n}$ . Similarly, all control inputs at time step $t$ are represented by $u(t)=\left(u_{1}(t),\ldots,u_{m}(t)\right)\in\mathcal{B}^{m}$ .

II-A2 State-flipped Control

Considering that not all nodes in a BCN can be flipped, we refer to the set of all flippable nodes as a flip set $A\subseteq\{1,2,...,n\}$ . At each time step $t$ , a flipping action $A(t)=\{i_{1},i_{2},...,i_{k}\}\subseteq A$ is selected. According to the flipping action $A(t)$ , the flip function is defined as

	$\displaystyle\eta_{A(t)}\Big{(}x(t)\Big{)}=$	$\displaystyle\left.\Big{(}x_{1}(t),...,\neg x_{i_{1}}(t),...,\right.$		(2)
		$\displaystyle\left.\neg x_{i_{2}}(t),...,\neg x_{i_{k}}(t),...,x_{n}(t)\Big{)}.\right.$		(2)

Note that to achieve a specific control objective, it is not necessary to flip all the nodes in the set $A$ . The flipping kernel, denoted as $B^{*}\subset A$ , is defined as the flip set with the minimum cardinality required to achieve reachability.

II-A3 BCNs under State-flipped Control

Based on the definition of BCNs and state-flipped control, a BCN with $n$ nodes and a flip set $A$ is defined as

x_{i}(t+1)=f_{i}\bigg{(}\eta_{A(t)}\Big{(}x(t)\Big{)},u(t)\bigg{)},\ i=1,...,n.

(3)

II-B Problems of Interests

II-B1 Problem 1. Flipping Kernel for Reachability

To better illustrate Problem 1, we first define reachability.

$\mathbf{Definition\ 1}$ [31]. For system (3), let $\mathcal{M}_{0}\subset\mathcal{B}^{n}$ be the initial subset, and $\mathcal{M}_{d}\subset\mathcal{B}^{n}$ be the target subset. $\mathcal{M}_{d}$ is said to be reachable from $\mathcal{M}_{0}$ if and only if, for any initial state $x_{0}\in\mathcal{M}_{0}$ , there exists a sequence of joint control pairs $\Big{\{}\big{(}u(t),\eta_{A(t)}\big{)},t=0,1,...,T\Big{\}}$ , such that $x_{0}$ reaches a target state $x_{d}\in\mathcal{M}_{d}$ .

Based on Definition 1, we define the specific problem to be considered. In some cases, it is not necessary to flip all nodes in the set $A$ for system (3) to achieve reachability. To reduce the control cost, we hope the nodes that need to be flipped as few as possible. The above problem can be transformed into finding the flipping kernel $B^{*}$ for reachability, which satisfies

	$\displaystyle\min_{B}\|B\|$	(4)
s.t.	$\displaystyle B\subset A\text{{\color[rgb]{0,0,0}\ and\ system\ (\ref{eqFlipBCN})\ from\ any\ state\ in\ }}$
	$\displaystyle\mathcal{M}_{0}\text{{\color[rgb]{0,0,0}\ is\ reachable\ to\ state\ set\ }}\mathcal{M}_{d}.$

It is worth noting that flipping kernel may not be unique, as there can be multiple ways to achieve reachability with the minimum cardinality $|B^{*}|$ through flipping.

II-B2 Problem 2. Minimum Flipping Actions for Reachability

Based on the flipping kernel $B^{*}$ obtained from Problem 1 (4), we aim to determine the optimal policy, under which the reachability can be achieved with minimum flipping actions. This problem can be formulated as finding the policy $\pi^{*}:x(t)\rightarrow\big{(}u(t),\eta_{B^{*}(t)}\big{)},\ \forall t\in\mathbb{Z^{+}}$ that satisfies

	$\displaystyle\min_{\pi}\sum_{t=0}^{T}$	$\displaystyle n_{t}\ \ \text{s.t.}\text{{\color[rgb]{0,0,0}\ system\ (\ref{eqFlipBCN})\ from\ any\ state\ in\ }}$		(5)
		$\displaystyle\mathcal{M}_{0}\text{{\color[rgb]{0,0,0}\ is\ reachable\ to\ state\ set\ }}\mathcal{M}_{d},$		(5)

where $n_{t}$ denotes the number of nodes to be flipped at time step $t$ , and $T$ represents the terminal time step when a terminal state emerges. The terminal and initial states are represented by $\forall x_{d}\in\mathcal{M}_{d}$ and $\forall x_{0}\in\mathcal{M}_{0}$ , respectively. Note that the optimal policy $\pi^{*}$ may not be unique, as multiple policies can achieve reachability with minimal flipping actions.

III Preliminaries

In this section, we introduce a reinforcement learning method, specifically focusing on the Markov decision process and the $Q$ -learning algorithm.

III-A Markov Decision Process

Markov decision process provides the framework for reinforcement learning, which is represented as a quintuple $(\mathbf{X},\mathbf{A},\gamma,\mathbf{P},\mathbf{R})$ . $\mathbf{X}=\{x_{t},t\in\mathbb{Z^{+}}\}$ and $\mathbf{A}=\{a_{t},t\in\mathbb{Z^{+}}\}$ denote the state space and action space, respectively. The discount factor $\gamma\in[0,1]$ weighs the importance of future rewards. The state-transition probability $\mathbf{P}_{x_{t}}^{x_{t+1}}(a_{t})=\Pr\left\{x_{t+1}\mid x_{t},a_{t}\right\}$ represents the chance of transitioning from state $x_{t}$ to $x_{t+1}$ under action $a_{t}$ . $\mathbf{R}_{x_{t}}^{x_{t+1}}(a_{t})=\mathbb{E}\left[r_{t+1}\mid x_{t},a_{t},x_{t+1}\right]$ denotes the expected reward obtained by taking action $a_{t}$ at state $x_{t}$ that transitions to $x_{t+1}$ . At each time step $t\in\mathbb{Z^{+}}$ , an agent interacts with the environment to determine an optimal policy. Specificly, the agent observes $x_{t}$ and selects $a_{t}$ , according to the policy $\pi:x_{t}\rightarrow a_{t},\forall t\in\mathbb{Z^{+}}$ . Then, the environment returns $r_{t+1}$ and $x_{t+1}$ . The agent evaluates the reward $r_{t+1}$ received for taking action $a_{t}$ at state $x_{t}$ and then updates its policy $\pi$ . Define $G_{t}=\sum_{i=t+1}^{T}\gamma^{i-t-1}r_{t}$ as the return. The goal of the agent is to learn the optimal policy $\pi^{*}:x_{t}\rightarrow a_{t},\forall t\in\mathbb{Z^{+}}$ , which satisfies

\pi^{*}=\max_{\pi\in\Pi}\mathbb{E}_{\pi}[G_{t}],\forall t\in\mathbb{Z^{+}},

(6)

where $\mathbb{E}_{\pi}$ is the the expected value operator following policy $\pi$ , and $\Pi$ is the set of all admissible policies. The state-value and the action-value are $v_{\pi}(x_{t})=\mathbb{E}_{\pi}[G_{t}|x_{t}]$ and $q_{\pi}(x_{t},a_{t})=\mathbb{E}_{\pi}[G_{t}|x_{t},a_{t}]$ , respectively. The Bellman equations reveal the recursion of $v_{\pi}(x_{t})$ and $q_{\pi}(x_{t},a_{t})$ , which are given as follows:

		$\displaystyle v_{\pi}(x_{t})=\sum\limits_{x_{t+1}\in\mathbf{X}}\mathbf{P}_{x_{t}}^{x_{t+1}}\big{(}\pi(x_{t})\big{)}[\mathbf{R}_{x_{t}}^{x_{t+1}}\big{(}\pi(x_{t})\big{)}+\gamma v_{\pi}(x_{t+1})],$		(7)
		$\displaystyle q_{\pi}(x_{t},a_{t})=\sum\limits_{x_{t+1}\in\mathbf{X}}\mathbf{P}_{x_{t}}^{x_{t+1}}(a_{t})[\mathbf{R}_{x_{t}}^{x_{t+1}}(a_{t})+\gamma v_{\pi}(x_{t+1})].$		(7)

The optimal state-value and action-value are defined as

		$\displaystyle v^{*}(x_{t})=\max\limits_{\pi\in\Pi}v_{\pi}(x_{t}),\forall x_{t}\in\mathbf{X},$		(8)
		$\displaystyle q^{*}(x_{t},a_{t})=\max\limits_{\pi\in\Pi}q_{\pi}(x_{t},a_{t}),\forall x_{t}\in\mathbf{X},\forall a_{t}\in\mathbf{A}.$		(8)

For problems under the framework of Markov decision process, the optimal policy $\pi^{*}$ can be obtained through $Q$ L. In the following, we introduce $Q$ L.

III-B $Q$ -Learning

$Q$ L is a classical algorithm in reinforcement learning. The purpose of $Q$ L is to enable the agent to find an optimal policy $\pi^{*}$ through its interactions with the environment. $Q_{t}(x_{t},a_{t})$ , namely, the estimate of $q^{*}(x_{t},a_{t})$ , is recorded and updated every time step as follows:

Q_{t+1}(x,a)=\left\{\begin{aligned} &Q_{t}(x,a)+\alpha_{t}TDE_{t},&&\text{if\ }(x,a)=\left(x_{t},a_{t}\right),\\ &Q_{t}(x,a),&&\text{else},\end{aligned}\right.

(9)

where $\alpha_{t}\in(0,1],t\in\mathbb{Z^{+}}$ is the learning rate, and $TDE_{t}=r_{t+1}+\gamma\max\limits_{a\in\mathbf{A}}Q_{t}(x_{t+1},a)-Q_{t}(x_{t},a_{t})$ .

In terms of action selection, $\epsilon$ -greedy method is used

a_{t}=\left\{\begin{aligned} &\arg\max\limits_{a\in\mathbf{A}}Q_{t}\left(x_{t},a\right),&&P=1-\epsilon,\\ &\operatorname{rand}(\mathbf{A}),&&P=\epsilon,\\ \end{aligned}\right.

(10)

where $P$ is the probability of selecting the corresponding action, $0\leq\epsilon\leq 1$ , and $\operatorname{rand}(\mathbf{A})$ is an action randomly selected from $\mathbf{A}$ .

$\mathbf{Lemma\ 1}$ [28]. $Q_{t}(x_{t},a_{t})$ converges to the fixed point $q^{*}(x_{t},a_{t})$ with probability one under the following conditions:

0.

$\sum_{t=0}^{\infty}\alpha_{t}=\infty$ and $\sum_{t=0}^{\infty}\alpha_{t}^{2}<\infty$ .
0.

$\operatorname{var}\left[r_{t}\right]$ is finite.
0.

$|\mathbf{X}|$ and $|\mathbf{A}|$ are finite.
0.

If $\gamma=1$ , all policies lead to a cost-free terminal state.

Once $Q$ L converges, the optimal policy is obtained:

\pi^{*}\left(x_{t}\right)=\arg\underset{a\in\mathbf{A}}{\max}Q_{t}\left(x_{t},a\right),\forall x_{t}\in\mathbf{X}.

(11)

IV Flipping Kernel for Reachability

To solve Problem 1 (4), we propose three algorithms: iterative $Q$ L, fast iterative $Q$ L, and small memory iterative $Q$ L. Fast iterative $Q$ L enhances convergence efficiency over iterative $Q$ L. Meanwhile, small memory iterative $Q$ L is designed specifically for large-scale BCNs (3).

IV-A Markov Decision Process for Finding Flipping Kernel

The premise for using $Q$ L is to structure the problem within the framework of Markov decision process. We represent Markov decision process by the quintuple $(\mathcal{B}^{n},\mathcal{B}^{m+|B|},\mathbf{P},\mathbf{R},\gamma)$ , where $\mathcal{B}^{n}$ and $\mathcal{B}^{m+|B|}$ are the state space and action space, respectively. The state and the action are defined as $x_{t}=x(t)$ and $a_{t}=\big{(}u(t),\eta_{A(t)}\big{)}$ , respectively. The reward is given as follows:

r_{t}(x_{t},a_{t})=\left\{\begin{aligned} &100,&&x_{t}\in\mathcal{M}_{d},\\ &0,&&\text{else}.\\ \end{aligned}\right.

(12)

To incentivize the agent to reach $x_{t}\in\mathcal{M}_{d}$ as quickly as possible, we set the discount factor $\gamma\in(0,1)$ .

Assuming the agent is aware of the dimensions of $\mathcal{B}^{n}$ and $\mathcal{B}^{m+|B|}$ but lacks knowledge of $\mathbf{P}$ and $\mathbf{R}$ , meaning the agent does not comprehend the system dynamics (3). Through interactions with the environment, the agent implicitly learns about $\mathbf{P}$ , $\mathbf{R}$ , refines estimation of $q^{*}\left(x_{t},a_{t}\right)$ , and enhances its understanding of $\pi^{*}$ .

IV-B Iterative Q-Learning for Finding Flipping Kernel

Under the reward setting (12), it is worthwhile to investigate methods for determining the reachability of BCNs defined by equation (3). To answer this question, we present Theorem 1.

$\mathbf{Theorem\ 1}$ . For system (3), assume that $Q_{0}(x_{t},a_{t})=0,\forall x_{t},\forall a_{t}$ and equation (9) is utilized for updating $Q_{t}(x_{t},a_{t})$ . System (3) from any state in $\mathcal{M}_{0}$ is reachable to state set $\mathcal{M}_{d}$ if and only if, there exists a $t^{*}$ such that $\max\limits_{a\in\mathbf{A}}Q_{t^{*}}(x_{0},a)>0,\forall x_{0}\in\mathcal{M}_{0}$ .

$\mathbf{Proof}$ . (Necessity) Assume that system (3) from any state in $\mathcal{M}_{0}$ is reachable to state set $\mathcal{M}_{d}$ . Without loss of generality, suppose that there is only one $x_{0}\in\mathcal{M}_{0}$ and one $x_{d}\in\mathcal{M}_{d}$ . Then, there exists a joint control pair sequence $\Big{\{}a_{t}=\big{(}u(t),\eta_{A(t)}\big{)},t=0,1,...,T\Big{\}}$ , such that $x_{0}$ will reach the target state $x_{d}$ . We apply $\Big{\{}a_{t}=\big{(}u(t),\eta_{A(t)}\big{)},t=0,1,...,T\Big{\}}$ to $x_{0}$ , then we obtain a trajectory $x_{0}\stackrel{{\scriptstyle a_{0}}}{{\longrightarrow}}x_{1}\stackrel{{\scriptstyle a_{1}}}{{\longrightarrow}}...x_{T}\stackrel{{\scriptstyle a_{T}}}{{\longrightarrow}}x_{T+1}$ , where $x_{T+1}=x_{d}$ . Meanwhile, $x_{t}\stackrel{{\scriptstyle a_{t}}}{{\longrightarrow}}x_{t+1}$ means that $x_{t}$ will transform into $x_{t+1}$ when $a_{t}$ is taken.

As $Q$ L iterates, all state-action-state pairs $(x_{t},a_{t},x_{t+1})$ are constantly visited [28]. Therefore, there can be a case where the state-action-state pairs $(x_{T},a_{T},x_{T+1})$ , $(x_{T-1},a_{T-1},x_{T})$ , $(x_{T-2},a_{T-2},x_{T-1})$ , …, $(x_{0},a_{0},x_{1})$ are visited one after another, and in between, other state-action-state pairs can also exist. This case causes the values in the $Q$ -table to transition from all zeros to some being positive. Furthermore, we assume that this case initially occurs at time step $t^{\prime}$ . Then, the corresponding change of action-value from zero to positive is given by $Q_{t^{\prime}}\big{(}x_{T},a_{T}\big{)}\leftarrow\alpha_{t^{\prime}-1}(r_{t^{\prime}}+\gamma\max\limits_{a\in A}Q_{t^{\prime}-1}(x_{T+1},a))+(1-\alpha_{t^{\prime}-1})Q_{t^{\prime}-1}\big{(}x_{T},a_{T}\big{)}$ , where $r_{t^{\prime}}=100$ since $x_{T+1}\in\mathcal{M}_{d}$ . Additionally, due to the absence of negative rewards and the fact that initial action-values are all 0, it follows that $Q_{t^{\prime}-1}(x_{T+1},a)\geq 0$ for all $a\in\mathbf{A}$ . Thus, after the update, we have $Q_{t^{\prime}}\big{(}x_{T},a_{T}\big{)}>0$ . According to (9), at $t^{\prime}+1$ , it follows that $Q_{t^{\prime}+1}\big{(}x_{T},a_{T}\big{)}>0$ . Similarly, for any $t^{\prime\prime}>t^{\prime}$ , it follows that $Q_{t^{\prime\prime}}\big{(}x_{T},a_{T}\big{)}>0$ .

Next, we prove that there exists $t^{*}$ such that $Q_{t^{*}}\big{(}x_{\overline{t}},a_{\overline{t}}\big{)}>0$ , where $\overline{t}=T,\ T-1,\ ...,\ 0$ , based on mathematical induction. The proof is divided into two parts. Firstly, there exists $t^{\prime}$ such that $Q_{t^{\prime}}\big{(}x_{\overline{t}},a_{\overline{t}}\big{)}>0$ where $\overline{t}=T$ , according to the proof in the previous paragraph. Second, we prove that if $(x_{\overline{t}},a_{\overline{t}},x_{{\overline{t}}+1})$ is visited at time step $t^{\prime\prime}$ and $Q_{t^{\prime\prime}}\big{(}x_{{\overline{t}}+1},a_{{\overline{t}}+1}\big{)}>0$ , then $Q_{t^{\prime\prime}+1}\big{(}x_{\overline{t}},a_{\overline{t}}\big{)}>0$ . Since $(x_{\overline{t}},a_{\overline{t}},x_{{\overline{t}}+1})$ is visited at time step $t^{\prime\prime}$ , it follows that $Q_{t^{\prime\prime}+1}\big{(}x_{\overline{t}},a_{\overline{t}}\big{)}\leftarrow\alpha_{t^{\prime\prime}}(r_{t^{\prime\prime}+1}+\gamma\max\limits_{a\in\mathbf{A}}Q_{t^{\prime\prime}}(x_{{\overline{t}}+1},a))+(1-\alpha_{t^{\prime\prime}})Q_{t^{\prime\prime}}\big{(}x_{{\overline{t}}},a_{{\overline{t}}}\big{)}$ . In the updated formula, we have $r_{t^{\prime\prime}+1}\geq 0$ , and $Q_{t^{\prime\prime}}\big{(}x_{{\overline{t}}},a_{{\overline{t}}}\big{)}\geq 0$ . Furthermore, since $Q_{t^{\prime\prime}}(x_{{\overline{t}}+1},a_{{\overline{t}}+1})>0$ , it implies that $\max\limits_{a\in\mathbf{A}}Q_{t^{\prime\prime}}(x_{{\overline{t}}+1},a)>0$ . Consequently, we have $Q_{t^{\prime\prime}+1}\big{(}x_{\overline{t}},a_{\overline{t}}\big{)}>0$ . According to mathematical induction, there exists $t^{*}$ such that $Q_{t^{*}}\big{(}x_{0},a_{0}\big{)}>0$ . Since $a_{0}\in\mathbf{A}$ holds, we have $\max\limits_{a\in\mathbf{A}}Q_{t^{*}}(x_{0},a)>0$ .

(Sufficiency) Suppose that there exists $t^{*}$ such that $\max\limits_{a\in\mathbf{A}}Q_{t^{*}}(x_{0},a)>0,\ \forall x_{0}\in\mathcal{M}_{0}$ . Without loss of generality, we assume that there is only one $x_{0}\in\mathcal{M}_{0}$ as well as one $x_{d}\in\mathcal{M}_{d}$ , $\text{arg}\max\limits_{a\in\mathbf{A}}Q_{t^{*}}(x_{0},a)=a_{t}$ , and $\max\limits_{a\in\mathbf{A}}Q_{t^{*}-1}(x_{0},a)=0$ . Then, according to the updated formula $Q_{t^{*}}\big{(}x_{0},a_{t}\big{)}\leftarrow\alpha_{t^{*}-1}(r_{t^{*}}+\gamma\max\limits_{a\in\mathbf{A}}Q_{t^{*}-1}(x_{0\text{next}},a))+(1-\alpha_{t^{*}-1})Q_{t^{*}-1}\big{(}x_{0},a_{t}\big{)}$ , where $x_{0\text{next}}$ is the subsequent state of $x_{0}$ , at least one of the following two cases exists:

•

Case 1. There exist $a_{t}$ and $x_{0\text{next}}$ such that $r_{t^{*}}(x_{0\text{next}},a_{t})$ $>0$ .
•

Case 2. $x_{0\text{next}}$ satisfies $\max\limits_{a\in\mathbf{A}}Q_{t^{*}-1}(x_{0\text{next}},a)>0$ .

In Case 1, we have $x_{0\text{next}}\in\mathcal{M}_{d}$ . It implies that system (3) from state $x_{0}$ is reachable to state set $\mathcal{M}_{d}$ in one step.

In terms of Case 2, let us consider the condition under which $\max\limits_{a\in\mathbf{A}}Q_{t^{*}-1}(x_{0\text{next}},a)>0$ , meaning that there exists an action $a\in\mathbf{A}$ such that $Q_{t^{*}-1}(x_{0\text{next}},a)>0$ . We define the $i^{th}$ action-value, which changes from 0 to a positive number, as $Q_{t_{i}}(x_{i},a_{i})$ . Here, $t_{i}$ denotes the time step of the change of the $i^{th}$ action-value, and $(x_{i},a_{i})$ corresponds to the state-action pair. Since the initial $Q$ -values are all 0, it follows that $Q_{t_{1}}(x_{1},a_{1})>0$ if and only if $x_{1\text{next}}\in\mathcal{M}_{d}$ . This implies that system (3) from state $x_{1}$ is reachable to state set $\mathcal{M}_{d}$ within one step, leading to $r_{t_{1}}>0$ . Next, for any $i>1$ , $Q_{t_{i}}(x_{i},a_{i})>0$ occurs if and only if $x_{i\text{next}}\in\mathcal{M}_{d}$ , or $x_{i\text{next}}=x_{j},\ j<i$ , where $x_{j}$ is a state that has satisfied $Q_{t_{j}}(x_{j},a_{j})>0$ for $t_{j}<t_{i}$ based on the previously defined conditions. This means that for $Q_{t_{i}}(x_{i},a_{i})>0,\ \forall i$ , at least one of the following two events has taken place: either $\mathcal{M}_{d}$ is reachable from $x_{i}$ ; or a state $x_{j},j<i$ , which can reach $\mathcal{M}_{d}$ , is reachable from $x_{i}$ . In both cases, it implies that $\mathcal{M}_{d}$ is reachable from $x_{i}$ . Thus, if $Q_{t^{*}-1}(x_{0\text{next}},a)>0$ , it follows that $\mathcal{M}_{d}$ is reachable from $x_{0\text{next}}$ . This, in turn, implies that $\mathcal{M}_{d}$ is also reachable from $x_{0}$ . Based on the analysis of Cases 1 and 2, if $\max\limits_{a\in\mathbf{A}}Q_{t^{*}-1}(x_{0},a)$ $>0$ holds for all $x_{0}\in\mathcal{M}_{0}$ , then $\mathcal{M}_{d}$ is reachable from $\mathcal{M}_{0}$ . $\hfill\blacksquare$

$\mathbf{Remark\ 1}$ . The key for $Q$ L to judge the reachability for system (3) lies in exploration (10) and update rule (9).

$\mathbf{Lemma\ 2}$ . For system (3), $\mathcal{M}_{d}$ is reachable from $\mathcal{M}_{0}$ if and only if there exists $l\leqslant 2^{n}-\left|\mathcal{M}_{d}\right|$ such that $\mathcal{M}_{d}$ is $l$ -step reachable from $\forall x_{0}\in\mathcal{M}_{0}$ .

$\mathbf{Proof}$ . Without loss of generality, let us assume that there is a trajectory from $x_{0}\in\mathcal{M}_{0}$ to $\mathcal{M}_{d}$ whose length is greater than $2^{n}-\left|\mathcal{M}_{d}\right|$ . Then, there must be at least one $x_{t}\in\mathcal{B}^{n}\setminus\mathcal{M}_{d}$ that is visited repeatedly. If we remove the cycle from $x_{t}$ to $x_{t}$ , the trajectory length will be less than $2^{n}-\left|\mathcal{M}_{d}\right|$ . $\hfill\blacksquare$

Based on Theorem 1 and Lemma 2, we present Algorithm 1 for finding flipping kernels. First, we provide the notation definitions used in Algorithm 1. $C_{|A|}^{k}$ denotes the combinatorial number of flip sets with the given $k$ nodes in $A$ . $B_{k_{i}}$ represents the $i^{th}$ flip set with the given $k$ nodes. $ep$ denotes the number of episodes, where ``Episode" is a term from reinforcement learning, referring to a period of interaction that has passed through $T_{max}$ time steps from any initial state $x_{0}\in\mathcal{M}_{0}$ . $T_{max}$ is the maximum time step, which should exceed the length of the trajectory from $\mathcal{M}_{0}$ to $\mathcal{M}_{d}$ . One can refer to Lemma 2 for the setting of $T_{max}$ .

The core idea of Algorithm 1 is to incrementally increase the flip set size from $|B|$ to $|B|+1$ in step 18 when reachability isn't achieved with flip sets of size $|B|$ . Specifically, step 4 fixes the flip set $B$ , followed by steps 5-16 assessing reachability under that flip set, guided by Theorem 1. Upon identifying the first flipping kernel (when the condition in step 12 is met), the cardinality of the kernel is determined. This leads to setting $K=|B|$ as per step 13, along with step 2 to prevent ineffective access to a larger flip set. After evaluating system reachability under all flip sets with $K$ nodes, steps 20-24 present the algorithm's results.

\mathbf{Algorithm\ 1}

Finding flipping kernels for reachability of BCNs under state-flipped control using iterative

Q

\mathcal{M}_{0}

\mathcal{M}_{d}

A

\alpha_{t}\in(0,1]

\gamma\in(0,1)

\epsilon\in[0,1]

N

T_{max}

0: Flipping kernels

K=|A|

k=0

n=0

2: while

k\leq K

3: for

i=1,2,\dots,C_{|A|}^{k}

B=B_{k_{i}}

5: Initialize

Q(x_{t},a_{t})\leftarrow 0,\forall x_{t}\in\mathcal{B}^{n},\forall a_{t}\in\mathcal{B}^{m+|B|}

6: for

ep=0,1,\dots,N-1

x_{0}\leftarrow\operatorname{rand}(\mathcal{M}_{0})

8: for

t=0,1,\dots,T_{max}-1

and

x_{t}\notin\mathcal{M}_{d}

a_{t}\leftarrow\left\{\begin{aligned} &\arg\max_{a_{t}\in\mathcal{B}^{m+|B|}}Q(x_{t},a_{t}),&&P=1-\epsilon\\ &\operatorname{rand}(\mathcal{B}^{m+|B|}),&&P=\epsilon\end{aligned}\right.

10:

Q(x_{t},a_{t})\leftarrow\alpha_{t}(\gamma\max\limits_{a_{t+1}\in\mathcal{B}^{m+|B|}}Q(x_{t+1},a_{t+1})

+r_{t+1})+(1-\alpha_{t})Q(x_{t},a_{t})

11: end for

12: if

\max\limits_{a_{t}\in\mathcal{B}^{m+|B|}}Q(x_{t},a_{t})>0,\forall x_{t}\in\mathcal{M}_{0}

then

13:

B_{n}^{*}=B

K=|B|

n=n+1

14: Break

15: end if

16: end for

17: end for

18:

k=k+1

19: end while

20: if

n=0

then

21: return ``System can't achieve reachability."

22: else

23: return

B^{*}_{1},\dots,B^{*}_{n}

24: end if

Although Algorithm 1 is efficient, there is room for improvement. The next section outlines techniques to enhance its convergence efficiency.

IV-C Fast Iterative Q-learning for Finding Flipping Kernel

In this subsection, we propose Algorithm 2, known as fast iterative $Q$ L for finding the flipping kernels, which improves the convergence efficiency. The main difference between Algorithm 2 and Algorithm 1 can be classified into two aspects. 1) Special initial states: Algorithm 2 selects initial states strategically instead of randomly. 2) TL: Algorithm 2 utilizes the knowledge gained from achieving reachability with smaller flip sets to search for flipping kernels for larger flip sets.

\mathbf{Algorithm\ 2}

Finding flipping kernels for reachability of BCNs under state-flipped control using fast iterative

Q

\mathcal{M}_{0}

\mathcal{M}_{d}

A

\alpha_{t}\in(0,1]

\gamma\in(0,1)

\epsilon\in[0,1]

N

T_{max}

0: Flipping kernels

K=|A|

k=0

n=0

2: while

k\leq K

3: for

i=1,2,\dots,C_{|A|}^{k}

B=B_{k_{i}}

5: Initialize

Q(x_{t},a_{t}),\forall x_{t}\in\mathcal{B}^{n},\forall a_{t}\in\mathcal{B}^{m+|B|}

using equation (14)

6: Examining the reachability under flip set

B

following modified steps 6-16 of Algorithm 1 with equation (13)

7: Record the

Q

-table for

B

8: end for

9: Drop all the

Q

-tables,

k=k+1

10: end while

11: Present the result following steps 20-24 of Algorithm 1

IV-C1 Special Initial States

The main concept behind selecting special initial states is to avoid visiting $x_{0}\in\mathcal{M}_{0}$ once we determine that $\mathcal{M}_{d}$ is reachable from it. Instead, we focus on states in $\mathcal{M}_{0}$ that have not been identified as reachable. With this approach in mind, we modify step 7 in Algorithm 1 into

x_{0}\leftarrow\operatorname{rand}\left(\left\{x_{0}\in\mathcal{M}_{0}\mid\max_{a_{t}\in\mathcal{B}^{m+|B|}}Q\left(x_{0},a_{t}\right)=0\right\}\right)

(13)

$\mathbf{Remark\ 2.}$ The validity of Theorem 1 remains when special initial states are added to iterative $Q$ L, as the trajectory from the states in $\mathcal{M}_{0}$ that are reachable to $\mathcal{M}_{d}$ can still be visited.

IV-C2 Transfer-learning

Let $Q^{B}(x_{t},a_{t})$ and $Q^{b}(x_{t},a_{t})$ represent the $Q$ -table with flip sets $B$ and $b$ respectively, where $b\subset B$ . The relationship between $Q^{B}(x_{t},a_{t})$ and $Q^{b}(x_{t},a_{t})$ is explained in Theorem 2 below.

$\mathbf{Theorem\ 2}$ . For any state-action pairs $(x_{t},a_{t})$ , if $Q^{b}(x_{t},a_{t})$ $>0$ with flip set $b\subset B$ , then $Q^{B}(x_{t},a_{t})>0$ holds for flip set $B$ .

$\mathbf{Proof}$ . It is evident that the action space for flip set $b\subset B$ is the subset of the action space for flip set $B$ , namely, $\mathbf{A}_{b}\subset\mathbf{A}_{B}$ . Therefore, if system (3) from state $x_{1}$ is reachable to state $x_{d}\in\mathcal{M}_{d}$ with flip set $b\subset B$ , namely, there exists a trajectory $x_{1}\stackrel{{\scriptstyle a_{1}}}{{\longrightarrow}}x_{2}...\stackrel{{\scriptstyle a_{T}}}{{\longrightarrow}}x_{d},\ a_{t}\in\mathbf{A}_{b}$ , the reachability still holds with flip set $B$ . This is because the trajectory $x_{1}\stackrel{{\scriptstyle a_{1}}}{{\longrightarrow}}x_{2}...\stackrel{{\scriptstyle a_{T}}}{{\longrightarrow}}x_{d}$ exists for $a_{t}\in\mathbf{A}_{b}\subset\mathbf{A}_{B}$ . Thus, if $\mathcal{M}_{d}$ is reachable from $x_{1}$ with flip set $b$ , then the reachability holds with flip set $B$ . According to Theorem 1, for all $(x_{t},a_{t})$ , if $Q^{b}(x_{t},a_{t})>0$ with flip set $b$ , then $Q^{B}(x_{t},a_{t})>0$ holds for flip set $B$ . $\hfill\blacksquare$

Motivated by the above results, we consider TL [30]. TL enhances the learning process in a new task by transferring knowledge from a related task that has already been learned. In this context, we employ TL by initializing the $Q$ -table for flip set $B$ with the knowledge gained from the $Q$ -tables for flip sets $b\subset B$ , namely

Q_{0}(x_{t},a_{t})=\left\{\begin{aligned} &0,\text{\ if\ no\ subset\ }b\subset B\text{\ such\ that\ }a_{t}\in b,\\ &\max_{b\subset B}Q_{t}^{b}(x_{t},a_{t}),\text{\ else},\end{aligned}\right.

(14)

where $Q_{t}^{b}(x_{t},a_{t})$ represents the $Q$ -table associated with flip set $b$ . Based on this idea, we refine step 5 and include steps 7 and 9 in Algorithm 2 compared to Algorithm 1.

Algorithm 2 improves the convergence efficiency compared to Algorithm 1. However, Algorithm 2 is not suitable for large-scale BCNs. Specifically, the operation of iterative $Q$ L is based on a $Q$ -table that has $|\mathbf{X}|\times|\mathbf{A}|$ values. The number of values in the table grows exponentially with $n$ and $m+|B|$ , and can be expressed as $2^{n+m+|B|}$ . When $n$ and $m+|B|$ are large, iterative $Q$ L becomes impractical because the $Q$ -table is too large to store on a computer. Therefore, we propose an algorithm in the following subsection, which can be applied to large-scale BCNs.

IV-D Small Memory Iterative Q-Learning for Finding Flipping Kernel of Large-scale BCNs

In this subsection, we present Algorithm 3, named small memory iterative $Q$ L, which is inspired by the work of [20]. This algorithm serves as a solution for identifying flipping kernels in large-scale BCNs. The core idea behind Algorithm 3 is to store action-values only for visited states, instead of all $x_{t}\in\mathcal{B}^{n}$ , to reduce memory consumption. To implement this idea, we make adjustments to step 5 in Algorithm 1 and introduce additional steps 10-12 in Algorithm 3. Subsequently, we offer detailed insights into the effectiveness of Algorithm 3 and its applicability.

\mathbf{Algorithm\ 3}

Finding flipping kernels for reachability of system (3) using small memory iterative

Q

\mathcal{M}_{0}

\mathcal{M}_{d}

A

\alpha_{t}\in(0,1]

\gamma\in(0,1)

\epsilon\in[0,1]

N

T_{max}

0: Flipping kernels

K=|A|

k=0

n=0

2: while

k\leq K

3: for

i=1,2,\dots,C_{|A|}^{k}

B=B_{k_{i}}

5: Initialize

Q(x_{t},a_{t})\leftarrow 0,\forall x_{t}\in\mathcal{M}_{0},\forall a_{t}\in\mathcal{B}^{m+|B|}

6: for

ep=0,1,\dots,N-1

x_{0}\leftarrow\operatorname{rand}(\mathcal{M}_{0})

8: for

t=0,1,\dots,T_{max}-1

and

x_{t}\notin\mathcal{M}_{d}

a_{t}\leftarrow\left\{\begin{aligned} &\arg\max\limits_{a_{t}\in\mathcal{B}^{m+|B|}}Q(x_{t},a_{t}),&&P=1-\epsilon\\ &\operatorname{rand}(\mathcal{B}^{m+|B|}),&&P=\epsilon\end{aligned}\right.

10: if

Q(x_{t+1},.)

is not in the

Q

-table then

11: Add

Q(x_{t+1},.)=0

to the

Q

-table

12: end if

13:

Q(x_{t},a_{t})\leftarrow\alpha_{t}(\gamma\max\limits_{a_{t+1}\in\mathcal{B}^{m+|B|}}Q(x_{t+1},a_{t+1})

+r_{t+1})+(1-\alpha_{t})Q(x_{t},a_{t})

14: end for

15: if

\max\limits_{a_{t}\in\mathcal{B}^{m+|B|}}Q(x_{t},a_{t})>0,\forall x_{t}\in\mathcal{M}_{0}

then

16:

B_{n}^{*}=B

K=|B|

n=n+1

17: Break

18: end if

19: end for

20: end for

21:

k=k+1

22: end while

23: Present the result following steps 20-24 of Algorithm 1

Since Algorithm 3 only stores action-values for visited states, it reduces the number of required action-values. Specifically, let's define $V=\{x|$ $x$ is reachable from $\mathcal{M}_{0}$ , $x\in\mathcal{B}^{n}\}$ . Then, we can observe that $|V\cup\mathcal{M}_{0}|\times 2^{m+|B|}$ is the total number of visitable states, where $\cup$ represents the union of sets. Therefore, we can conclude that the number of required action-values using Algorithm 3 is $|V\cup\mathcal{M}_{0}|\times 2^{m+|B|}$ , which is a number not exceeding $2^{n+m+|B|}$ . Next, we present Theorem 3 to provide an estimation of $|V|$ . To better understand the theorem, we first introduce Definition 2.

$\mathbf{Definition\ 2}$ [4]. The in-degree of $x\in\mathcal{B}^{n}$ is defined as the number of states which can reach $x$ in one step with $\overline{a_{t}}=\big{(}u(t),\eta_{A}(t)\big{)}$ where $|\eta_{A(t)}|_{\infty}=0$ .

$\mathbf{Theorem\ 3}$ . Define $I=\{x|\$ the in-degree of $x$ is greater than 0, $x\in\mathcal{B}^{n}\}$ . Then, it follows that $|V|\leq|I|$ .

$\mathbf{Proof}$ . The state transition process can be divided into two steps. For any $x^{\prime}\in\mathcal{B}^{n}$ , according to $\eta_{A(t)}$ , it is firstly flipped into $\eta_{A(t)}(x^{\prime})\in\mathcal{B}^{n}$ . Second, based on $u(t)$ , namely, $\overline{a_{t}}$ , $\eta_{A(t)}(x^{\prime})\in\mathcal{B}^{n}$ tranfers to $x\in\mathcal{B}^{n}$ . According to the second step of the state transition process and the definition of $I$ , we have $x\in I$ . Due to the arbitrariness of $x^{\prime}$ , it follows that $\{x|\ x$ is reachable from $\mathcal{B}^{n}$ in one step, $x\in\mathcal{B}^{n}\}\subset I$ . Since $\mathcal{B}^{n}$ contains all states, the condition `in one step' can be removed. Define $N=\{x|\ x$ is reachable from $\mathcal{B}^{n}$ , $x\in\mathcal{B}^{n}\}$ . So, we have $N\subset I$ . Notice that the only difference between the sets $N$ and $V$ is the domain of the initial states. Since we have $\mathcal{M}_{0}\subset\mathcal{B}^{n}$ , it follows that $V\subset N\subset I$ . Therefore, it can be concluded that $|V|\leq|I|$ . $\hfill\blacksquare$

Algorithm 3 can reduce the number of entries stored in the $Q$ -table from $2^{n+m+|B|}$ to $|V\cup\mathcal{M}_{0}|\times 2^{m+|B|}$ , where $|V|\leq|I|$ . However, we acknowledge that when the size of $|V\cup\mathcal{M}_{0}|\times 2^{m+|B|}$ becomes excessively large, $Q$ L faces certain limitations. It is important to note that Algorithm 3 offers a potential solution for addressing problems in large-scale BCNs, as opposed to the conventional $Q$ L approach, which is often deemed inapplicable. In fact, the value of $|V\cup\mathcal{M}_{0}|$ is not directly related to $n$ . Specifically, large-scale BCNs with low connectivity will result in a small number of $|V\cup\mathcal{M}_{0}|$ . Traditional $Q$ L preconceives to store $2^{n+m+|B|}$ values in the $Q$ -table, which not only restricts the ability to handle problems in large-scale systems but also lacks practicality. In contrast, Algorithm 3 allows agents to explore and determine the significant information that needs to be recorded. This approach provides an opportunity to solve the reachability problems of large-scale BCNs.

$\mathbf{Remark\ 3}$ . The idea of utilizing special initial states and TL in Algorithm 2 can also be incorporated into Algorithm 3. The hybrid algorithm proposed has two major advantages: improved convergence efficiency and expanded applicability to large-scale BCNs.

V Minimum Flipping Actions for Reachability

In this section, we aim to solve Problem 2 (5). First, we propose a reward setting in which the highest return is achieved only when the policy successfully reaches the goal. Then, we propose two algorithms for obtaining the optimal policies: $Q$ L for small-scale BCNs, and small memory $Q$ L for large-scale ones.

V-A Markov Decision Process for Minimizing Flipping Actions

Since $Q$ L will be used to find $\pi^{*}$ , we first construct the problem into the framework of Markov decision process. Let Markov decision process be the quintuple $(\mathcal{B}^{n},\mathcal{B}^{m+|B^{*}|},\gamma,\mathbf{P},\mathbf{R})$ . To guide the agent in achieving the goal with minimal flipping actions, we define $r_{t}$ as follows:

r_{t}(x_{t},a_{t})=\left\{\begin{aligned} &-w\times n_{t},&&x_{t}\in\mathcal{M}_{d},\\ &-w\times n_{t}-1,&&\text{else},\\ \end{aligned}\right.

(15)

where $w>0$ represents the weight. The reward $r_{t}$ consists of two parts. The first component is ``-1'', which is assigned when the agent has not yet reached $\mathcal{M}_{d}$ . This negative feedback encourages the agent to reach the goal as soon as possible. The second component is `` $-w\times n_{t}$ '', which incentivizes the agent to use as few flipping actions as possible to reach the goal. Since we aim to minimize the cumulative flipping actions at each time step, we set $\gamma=1$ indicating that future rewards are not discounted in importance.

$\mathbf{Remark\ 4}$ . The following analysis is based on the assumption that system (3) from any state in $\mathcal{M}_{0}$ is reachable to state set $\mathcal{M}_{d}$ . The validity of this assumption can be verified using Theorem 1.

The selection of $w$ plays a crucial role in effectively conveying the goal (5) to the agent. If $w$ is too small, the objective of achieving reachability as soon as possible will submerge the objective of minimizing flipping actions. To illustrate this issue more clearly, let us consider an example.

$\mathbf{Example\ 1}$ . Without loss of generality, we assume that there is only one $x_{0}\in\mathcal{M}_{0}$ and one $x_{d}\in\mathcal{M}_{d}$ . Suppose that there exist two policies $\pi_{1}$ and $\pi_{2}$ . When we start with $x_{0}$ and take $\pi_{1}$ , we obtain the trajectory $x_{0}\rightarrow x_{1}\rightarrow x_{2}\rightarrow x_{3}\rightarrow x_{d}$ without any flipping action. If we take $\pi_{2}$ , the trajectory $x_{0}\rightarrow x_{d}$ with 2 flipping actions will be obtained. It can be calculated that $v_{\pi_{1}}(x_{0})=-4$ , since the agent takes 4 steps to achieve reachability without flipping actions. Meanwhile, $v_{\pi_{1}}(x_{0})=-1-2w$ , since the agent takes 1 step to achieve reachability with 2 flipping actions. If we set $w=1$ , then it will follow that $v_{\pi_{1}}(x_{0})<v_{\pi_{2}}(x_{0})$ . Therefore, the obtained optimal policy $\pi_{2}$ requires more flipping actions to achieve reachability, which does not contribute to finding $\pi^{*}$ for equation (5).

Furthermore, although a high value of the parameter `` $w$ " assists in reducing flipping actions for reachability, it can present challenges in reinforcement learning. Convergence in these algorithms relies on a finite reward variance, as specified in the second condition of Lemma 1. As `` $w$ " approaches infinity, this criterion cannot be satisfied. Additionally, an excessively large value of `` $w$ " may increase reward variance, resulting in greater variability in estimated state values and subsequently affecting the speed of algorithm convergence [32]. Hence, establishing a lower bound for `` $w$ " becomes crucial to facilitate algorithm convergence while ensuring goal attainment.

So, what should be the appropriate value of $w$ to achieve the goal (5)? To address this question, sufficient conditions that ensure the optimality of the policy are proposed in Theorem 4 and Corollary 1. The main concept is to select a value of $w$ such that $v_{\pi^{*}}(x_{0})>v_{\pi}(x_{0}),\forall\pi\in\Pi,$ and $\pi\neq\pi^{*}$ , where $\pi^{*}$ satisfies the goal (5).

$\mathbf{Theorem\ 4}$ . For system (3), if $w>l$ , where $l$ is the maximum number of steps required from any initial states $x_{0}\in\mathcal{M}_{0}$ to reach $\mathcal{M}_{d}$ without cycles, then the optimal policy obtained based on (15) satisfies the goal (5).

$\mathbf{Proof}$ . Without loss of generality, it is assumed that there is only one $x_{0}\in\mathcal{M}_{0}$ . We classify all $\pi\in\Pi$ and prove that, under the condition that $w>l$ , the policy with the highest state-value satisfies the goal (5). For all $\pi\in\Pi$ , we can divide them into two cases:

•

Case 1. Following $\pi_{1}$ , system (3) from state $x_{0}$ is not reachable to state set $\mathcal{M}_{d}$
•

Case 2. Following $\pi_{2}$ , system (3) from state $x_{0}$ is reachable to state set $\mathcal{M}_{d}$

For Case 1, $v_{\pi_{1}}(x_{0})=\sum_{t=0}^{-\infty}-1=-\infty$ . The value of $v_{\pi_{1}}(x_{0})$ is insufficient for $\pi_{1}$ to be optimal. For Case 2, we divide it into two sub-cases:

•

Case 2.1. Following $\pi_{2.1}$ , system (3) from state $x_{0}$ is reachable to state set $\mathcal{M}_{d}$ with fewer flipping actions.
•

Case 2.2. Following $\pi_{2.2}$ , system (3) from state $x_{0}$ is reachable to state set $\mathcal{M}_{d}$ with more flipping actions.

Next, we compare the magnitude of $v_{\pi_{2.1}}$ and $v_{\pi_{2.2}}$ . Let us assume that it takes $T_{\pi_{2.1}}$ steps with $\sum n_{\pi_{2.1}}$ flipping actions, and $T_{\pi_{2.2}}$ steps with $\sum n_{\pi_{2.2}}$ flipping actions for system (3) from state $x_{0}$ to reach state set $\mathcal{M}_{d}$ under $\pi_{2.1}$ and $\pi_{2.2}$ , respectively. Then, we can obtain $v_{\pi_{2.1}}=-w\sum n_{\pi_{2.1}}-T_{\pi_{2.1}}$ and $v_{\pi_{2.2}}=-w\sum n_{\pi_{2.2}}-T_{\pi_{2.2}}$ . It is calculated that

v_{\pi_{2.1}}-v_{\pi_{2.2}}=w\sum n_{\pi_{2.2}}+T_{\pi_{2.2}}-w\sum n_{\pi_{2.1}}-T_{\pi_{2.1}}.

(16)

Since more flipping actions are taken under $\pi_{2.2}$ than $\pi_{2.1}$ to achieve reachability, then one has $\sum n_{\pi_{2.2}}-\sum n_{\pi_{2.1}}>0$ .

Regarding the relationship between $T_{\pi_{2.1}}$ and $T_{\pi_{2.2}}$ , there are two categories:

•

Category 1. $T_{\pi_{2.1}}\leq T_{\pi_{2.2}}$ .
•

Category 2. $T_{\pi_{2.1}}>T_{\pi_{2.2}}$ .

For Category 1, it's obvious that $v_{\pi_{2.1}}>v_{\pi_{2.2}}$ . For Category 2, we divide $\pi_{2.1}$ into two cases.

•

Case 2.1.1. Following $\pi_{2.1.1}$ , system (3) from state $x_{0}$ can reach state set $\mathcal{M}_{d}$ in $T_{\pi_{2.1.1}}\leq l$ steps with $\sum n_{\pi_{2.1.1}}<\sum n_{\pi_{2.2}}$ flipping actions, where $l$ is the maximum length of the trajectory from $x_{0}$ to $\mathcal{M}_{d}$ without cycle.
•

Case 2.1.2. Following $\pi_{2.1.2}$ , system (3) from state $x_{0}$ can reach state set $\mathcal{M}_{d}$ in $T_{\pi_{2.1.2}}>l$ steps with $\sum n_{\pi_{2.1.2}}<\sum n_{\pi_{2.2}}$ flipping actions.

Next, we prove $v_{\pi_{2.1.1}}>v_{\pi_{2.2}}$ . For Case 2.1.1, we have the following inequalities

\left\{\begin{aligned} &T_{\pi_{2.2}}-T_{\pi_{2.1.1}}\geq T_{min}-l\geq 1-l,\\ &\sum n_{\pi_{2.2}}-\sum n_{\pi_{2.1.1}}\geq 1,\\ \end{aligned}\right.

(17)

where $T_{min}$ is the minumum length of the trajectory from $x_{0}$ to $\mathcal{M}_{d}$ . Substitute (17) into (16), we have $v_{\pi_{2.1.1}}-v_{\pi_{2.2}}\geq 1-l+w$ . Since $w>l$ , it follows that $v_{\pi_{2.1.1}}-v_{\pi_{2.2}}>0$ .

In the following, we discuss Case 2.1.2. Note that $T_{\pi_{2.1.2}}>l$ implies that at least one $x_{t}\in\mathcal{B}^{n}\backslash\mathcal{M}_{d}$ is visited repeatedly, according to Lemma 2. If we eliminate the cycle from the trajectory, then its length will be no greater than $l$ , and its flipping actions are fewer. This indicates that there exists $\pi_{2.1.2}^{*}$ such that $T_{\pi_{2.1.2}^{*}}\leq l$ and $\sum n_{\pi_{2.1.2}^{*}}\leq\sum n_{\pi_{2.1.2}}<\sum n_{\pi_{2.2}}$ . Thus, $\pi_{2.1.2}^{*}$ is consistent with Case 2.1.1. Based on the proof in the previous paragraph, we have $v_{\pi_{2.1.2}^{*}}(x_{0})>v_{\pi_{2.2}}(x_{0})$ . Also, it is easy to see that $v_{\pi_{2.1.2}^{*}}(x_{0})>v_{\pi_{2.1.2}}(x_{0})$ . Therefore, it can be concluded that $\pi_{2.1.2}^{*}$ has the highest return compared with $\pi_{2.2}$ and $\pi_{2.1.2}$ .

After analyzing various cases, we deduce that when $w>l$ holds, one of the following two conditions must be satisfied: 1) $v_{\pi_{2.1}}(x_{0})>v_{\pi_{2.2}}(x_{0})$ ; 2) There exists an alternative policy $\pi^{*}_{2.1.2}$ that achieves reachability with fewer flipping actions compared to $\pi_{2.1}$ , satisfying the conditions $v_{\pi^{*}_{2.1.2}}(x_{0})>v_{\pi_{2.2}}(x_{0})$ and $v_{\pi^{*}_{2.1.2}}(x_{0})>v_{\pi_{2.1}}(x_{0})$ . Namely, $\pi$ yielding the highest $v_{\pi}(x_{0})$ satisfies the goal (5). $\hfill\blacksquare$

In Theorem 4, the knowledge of $l$ is necessary. If $l$ is unknown, we can deduce the following corollary according to Lemma 2: $l\leq 2^{n}-|\mathcal{M}_{d}|$ .

$\mathbf{Corollary\ 1}$ . If $w>2^{n}-|\mathcal{M}_{d}|$ , then the optimal policy obtained according to (15) satisfies the goal (5).

$\mathbf{Remark\ 5}$ . Two points should be clarified for Theorem 4 and Corollary 1. First, following Theorem 4 or Corollary 1, we do not obtain all $\pi^{*}$ that satisfy the goal (5), but rather the $\pi^{*}$ that satisfies the goal (5) with the shortest time to achieve reachability. Second, Theorem 4 and Corollary 1 provide sufficient conditions but not necessary conditions for the optimality of the obtained policy, in that (17) uses the inequality scaling technique.

Now, we can achieve the goal (5) under the reward setting (15). Next, we present $Q$ L for finding the optimal policy $\pi^{*}$ .

V-B Q-Learning for Finding Minimum Flipping Actions

In this subsection, we propose Algorithm 4 for finding minimum flipping actions for the reachability of BCNs under state-flipped control.

\mathbf{Algorithm\ 4}

Finding minimum flipping actions for reachability of BCNs under state-flipped control using

Q

\mathcal{M}_{0}

\mathcal{M}_{d}

B^{*}

\alpha_{t}\in(0,1]

\gamma=1

\epsilon\in[0,1)

N

T_{max}

\pi^{*}

1: Initialize

Q(x_{t},a_{t})\leftarrow 0,\forall x_{t}\in\mathcal{B}^{n},\forall a_{t}\in\mathcal{B}^{m+|B^{*}|}

2: for

ep=0,1,\dots,N-1

x_{0}\leftarrow\operatorname{rand}(\mathcal{M}_{0})

4: for

t=0,1,\dots,T_{max}-1

and

x_{t}\notin\mathcal{M}_{d}

a_{t}\leftarrow\left\{\begin{aligned} &\arg\max_{a}Q(x_{t},a),&&P=1-\epsilon\\ &\operatorname{rand}(\mathcal{B}^{m+|B|}),&&P=\epsilon\end{aligned}\right.

Q(x_{t},a_{t})\leftarrow\alpha_{t}(r_{t+1}+\gamma\max\limits_{a}Q(x_{t+1},a))+(1-\alpha_{t})Q(x_{t},a_{t})

7: end for

8: end for

9: return

\pi^{*}(x_{t})\leftarrow\arg\max\limits_{a}Q(x_{t},a),\forall x_{t}\in\mathbf{X}

Next, we discuss the requirements for the parameters. In terms of $T_{max}$ , it should be larger than the maximum length of the trajectory from $x_{0}$ to $\mathcal{M}_{d}$ without any cycle. Otherwise, the agent may not be able to reach the terminal state in $\mathcal{M}_{d}$ before the episode ends, which violates condition 4) of Lemma 1. Furthermore, it should satisfy $\epsilon>0$ , ensuring the exploration of the agent. Hence, the policies during learning are non-deterministic, namely, all the policies can lead to a state $x\in\mathcal{M}_{d}$ , which satisfies condition 4) of Lemma 1.

Algorithm 4 is capable of finding $\pi^{*}$ for small-scale BCNs under state-flipped control. However, it becomes inapplicable when the values of $n$ and $m+|B|$ are too large since the $Q$ -table is too large to be stored in computer memory. This motivates us to utilize small memory $Q$ L in Section IV-D.

V-C Fast Small Memory Q-Learning for Finding Minimum Flipping Actions of Large-scale BCNs

In this section, we propose Algorithm 5 called the fast small memory $Q$ L for finding the minimum flipping actions required to achieve reachability in large-scale BCNs under state-flipped control. Algorithm 5 differs from Algorithm 4 in two aspects.

\mathbf{Algorithm\ 5}

Finding minimum flipping actions for reachability of large-scale BCNs under state-flipped control using fast small memory

Q

\mathcal{M}_{0}

\mathcal{M}_{d}

B^{*}

\alpha_{t}\in(0,1]

\gamma=1

\epsilon\in[0,1)

N

T_{max}

w

\Delta w>0

\pi^{*}

1: Initialize

Q(x_{t},a_{t})\leftarrow 0,\forall x_{t}\in\mathcal{M}_{0},\forall a_{t}\in\mathcal{B}^{m+|B^{*}|}

2: for

ep=0,1,\dots,N-1

x_{0}\leftarrow\operatorname{rand}(\mathcal{M}_{0})

4: if

w\leq\big{|}\{x_{t}|\ Q(x_{t},.)

is in the

Q

-table

\}\big{|}

then

w=w+\Delta w

6: end if

7: for

t=0,1,\dots,T_{max}-1

and

x_{t}\notin\mathcal{M}_{d}

a_{t}\leftarrow\left\{\begin{aligned} &\arg\max_{a}Q(x_{t},a),&&P=1-\epsilon\\ &\operatorname{rand}(\mathcal{B}^{m+|B|}),&&P=\epsilon\end{aligned}\right.

9: if

Q(x_{t+1},.)

is not in the

Q

-table then

10: Add

Q(x_{t+1},.)=0

to the

Q

-table

11: end if

12:

Q(x_{t},a_{t})\leftarrow\alpha_{t}(r_{t+1}+\gamma\max\limits_{a}Q(x_{t+1},a))+(1-\alpha_{t})Q(x_{t},a_{t})

13: end for

14: end for

15: return

\pi^{*}(x_{t})\leftarrow\arg\max\limits_{a}Q(x_{t},a),\forall x_{t}\in\mathbf{X}

Firstly, Algorithm 5 only records the action values of the visited states, as reflected in steps 1 and 9-11. This approach is similar to that employed in Algorithm 3. Secondly, in Algorithm 5, the parameter $w$ which affects $r_{t}$ changes from a constant to a variable that is dynamically adjusted according to $Q=\big{|}\{x_{t}|\ Q(x_{t},.)$ is in the $Q$ -table $\}\big{|}$ , as shown in steps 4-6. As the agent explores more states, $\big{|}Q\big{|}$ increases and finally equals to $|V\cup\mathcal{M}_{0}|$ . Here, we do not simply set $w=2^{n}+1-|\mathcal{M}_{d}|$ based on Corollary 1, in that for large-scale BCNs, the value of $2^{n}+1-|\mathcal{M}_{d}|$ is so large causing the weight for using fewer flipping actions to outweigh the importance of achieving reachability as soon as possible. This can lead to confusion for the agent. Specifically, at the beginning of the learning process, the agent receives large negative feedback when taking flipping actions, but less negative feedback when it fails to achieve reachability. The agent will only realize that taking flipping actions is worthwhile for achieving reachability when the accumulation of the negative feedback ``-1" for not realizing reachability exceeds that of `` $-w\times n_{t}$ '' for flipping $n_{t}$ nodes. The larger the $w$ , the longer the process will take. To address this issue, we introduce the scaling for $w$ to accelerate the process. The key concept behind this scaling method is to leverage the knowledge acquired by the agent while ensuring that the conditions in Theorem 4 are met to guarantee the optimality of the policy. We provide further explanation of this approach in Theorem 5.

$\mathbf{Theorem\ 5}$ . For a BCN under state-flipped control (3), if it holds that $w>|Q|$ , then the optimal policy obtained according to (15) satisfies the goal (5).

$\mathbf{Proof}$ . First, we prove that $\mathcal{M}_{d}$ is reachable from $\forall x_{0}\in\mathcal{M}_{0}$ only if, there exists $l\leq|V|$ such that $\mathcal{M}_{d}$ is $l$ -step reachable from $\forall x_{0}\in\mathcal{M}_{0}$ . Assume that $\mathcal{M}_{d}$ is reachable from $\mathcal{M}_{0}$ . According to Definition 1, for any initial state $x_{0}\in\mathcal{M}_{0}$ , there exists a trajectory $x_{0}\rightarrow x_{1}...\rightarrow x_{d}$ , where $x_{d}\in\mathcal{M}_{d}$ . According to the definition of $V$ , each state in the above trajectory is in $V$ . Similar to the proof of Lemma 2, for any initial state $x_{0}\in\mathcal{M}_{0}$ , there exists a trajectory from $x_{0}$ to $\mathcal{M}_{d}$ whose length $l$ satisfies $l\leq|V|$ .

Next, we prove that $w>l$ after sufficient exploration by the agent. Since $|Q|=|V\cup\mathcal{M}_{0}|$ , and $w>|Q|$ , we have $w>|V\cup\mathcal{M}_{0}|$ . From the proof in the previous paragraph, we know that $|V|\geq l$ , therefore, we have $w>l$ . By applying Theorem 4, we can guarantee the optimality of the policy. $\hfill\blacksquare$

Thus, Algorithm 5 with the introduction of the variable $w$ still satisfies the goal defined by (5). Furthermore, it speeds up the learning process. Regarding the parameter settings of Algorithm 5, the initial $w$ can be set according to $Q=\big{|}\{x_{t}|\ Q(x_{t},.)$ is in the $Q$ -table $\}\big{|}$ , where the $Q$ -table is the one obtained from Algorithm 3.

In summary, Algorithm 5 reduces the number of values to be stored from $2^{n+m+|B^{*}|}$ to $|V\cup\mathcal{M}_{0}|\times 2^{m+|B^{*}|}$ by recording only the visited states. This approach is especially effective for solving large-scale BCN problems. Meanwhile, to accelerate the learning process, $w$ is set as a variable that will converge to a value that is larger than $|V\cup\mathcal{M}_{0}|$ .

VI Computational Complexity

The time complexities of Algorithms 1-5 involve selecting the optimal value from a pool of up to $2^{m+|B^{*}|}$ action-values per step, resulting in O( $2^{m+|B^{*}|}$ ) complexity. For Algorithms 1, 2, and 3, the total number of steps is expressed as iter ${}^{1,3}=\sum_{k=0}^{|B^{*}|-1}C_{|A|}^{k}NT_{\text{max}}+\sum_{k=0}^{C_{|A|}^{|B^{*}|}}N_{k}^{1,3}T_{\text{max}}$ and iter ${}^{2}=\sum_{k=0}^{|B^{*}|-1}C_{|A|}^{k}NT_{\text{max}}+\sum_{k=0}^{C_{|A|}^{|B^{*}|}}N_{k}^{2}T_{\text{max}}$ respectively, revealing pre- and post-flipping kernel discovery iterations. Introducing early termination based on reachability conditions in Algorithms 1-3 reduces the number of episodes for the $k^{\text{th}}$ flip set with $N_{k}^{1,3}\leq N$ and $N_{k}^{2}\leq N$ , while Algorithm 2's efficiency ensures $N_{k}^{1,3}\leq N_{k}^{2}$ . Algorithms 4 and 5 require $NT_{max}$ steps in total. Refer to Table I for a comprehensive overview of the time complexities of each algorithm.

Concerning space complexity, Algorithms 1 and 4 store all action-values in a $Q$ -table, while Algorithm 2 additionally manages tables for subsets of the current flip set. In contrast, Algorithms 3 and 5 retain action-values for solely visited states. Details on the final space complexities for each algorithm can be found in Table I.

TABLE I: Time and space complexity of Algorithms 1 to 5

	Time complexity	Space complexity
1	$O$ ( $2^{m+\|B^{*}\|}$ iter¹)	$O$ ( $2^{n+m+\|B^{*}\|}$ )
2	$O$ ( $2^{m+\|B^{*}\|}$ iter²)	$O$ ( $2^{n+m+\|B^{}\|}(C_{\|A\|}^{\|B^{}\|-1}+1)$ )
3	$O$ ( $2^{m+\|B^{*}\|}$ iter^1,3)	$O$ ( $2^{m+\|B^{*}\|}$ $\|V\cup\mathcal{M}_{0}\|$ )
4	$O$ ( $2^{m+\|B^{*}\|}$ $NT_{max}$ )	$O$ ( $2^{n+m+\|B^{*}\|}$ )
5	$O$ ( $2^{m+\|B^{*}\|}$ $NT_{max}$ )	$O$ ( $2^{m+\|B^{*}\|}$ $\|V\cup\mathcal{M}_{0}\|$ )

VII Simulation

In this section, the performance of the proposed algorithms is shown. Two examples are given, which are a small-scale BCN with 3 nodes and a large-scale one with 27 nodes. For each example, the convergence efficiency of different algorithms to check the reachability, and $\pi^{*}$ with minimal flipping actions are shown.

VII-A A Small-scale BCN

$\mathbf{Example\ 2}$ . We consider a small-scale BCN with the combinational flip set $A=\{1,2,3\}$ , the dynamics of which is given as follows:

\left\{\begin{aligned} x_{1}(t+1)=&x_{1}(t)\wedge(x_{2}(t)\vee x_{3}(t))\vee\neg x_{1}(t)\wedge(x_{2}(t)\oplus\\ &x_{3}(t)),\\ x_{2}(t+1)=&x_{1}(t)\vee\neg x_{1}(t)\wedge(x_{2}(t)\vee x_{3}(t)),\\ x_{3}(t+1)=&\neg(x_{1}(t)\wedge x_{2}(t)\wedge x_{3}(t)\wedge u(t))\wedge(x_{3}(t)\vee\\ &(x_{1}(t)\vee\neg(x_{2}(t)\wedge u(t)))\wedge(\neg x_{1}(t)\vee(x_{1}(t)\\ &\oplus x_{2}(t))\vee u(t))).\end{aligned}\right.

(18)

In terms of the reachability, we set $\mathcal{M}_{d}=(0,0,1)$ and $\mathcal{M}_{0}=\mathcal{B}^{n}\setminus\mathcal{M}_{d}$ .

To find the flipping kernels of system (18), Algorithms 1 and 2 are utilized. As mentioned in section IV, Algorithm 2 has higher convergence efficiency compared with Algorithm 1. The result is verified in our simulation. To evaluate the convergence efficiency, the index $r_{ep}$ named reachable rate is defined as follows:

r_{ep}=\frac{nr_{ep}}{|\mathcal{M}_{0}|},

(19)

where $nr_{ep}$ represents the number of $x_{t}\in\mathcal{M}_{0}$ that have been found to be reachable to $\mathcal{M}_{d}$ at the end of the $ep^{\text{th}}$ episode. For Algorithms 1 and 2, the flipping kernels utilized are $\{1,2\}$ and $\{2,3\}$ , respectively. The effectiveness of Algorithms 1 and 2 is shown in Fig. 1, where the reachable rates are the average rates obtained from 100 independent experiments. In terms of $Q$ L with TL, the initial reachable rate is higher since the agent utilizes prior knowledge. For $Q$ L with special initial states, the reachable rate increases faster than conventional $Q$ L since the agent strategically selects the initial states. The best performance is achieved when both TL and special initial states are incorporated into $Q$ L.

Next, Algorithm 4 is employed to determine the minimum number of flipping actions required for reachability. The optimal policy obtained is shown in Fig. 2.

Refer to caption — Figure 1: The convergence efficiency of Algorithms 1 and 2 for finding the flipping kernels of the small-scale system (18)

$\mathbf{Remark\ 6.}$ Emphasizing that the system model in this paper is solely for illustrative purposes, we stress that the agent has no prior knowledge of the model.

$\mathbf{Remark\ 7.}$ The flipping kernels obtained through Algorithms 1 and 2 have been compared with those obtained using the model-based method proposed by [3]. Additionally, the policies depicted in Figs. 2 and 4 have been compared with other policies derived through enumeration. The results of these comparisons have verified that the flipping kernels and policies obtained are indeed optimal.

$\mathbf{Remark\ 8.}$ We utilize the method proposed by [3] solely to validate the optimality of the flipping kernel, without directly comparing its computational complexity or convergence speed. This is because the semi-tensor product method introduced by [3] requires prior knowledge of the system model, whereas our approach is model-free. Therefore, it would be unfair to make a direct comparison between the two methods.

VII-B A Large-scale BCN

$\mathbf{Example\ 3}$ . We consider a large-scale BCN with the combinational flip set $A=\{1,2,3,4,5,6\}$ , the dynamics of which is given as follows:

\left\{\begin{aligned} x_{1}(t+1)=&x_{1}(t)\wedge(x_{2}(t)\vee x_{3}(t))\vee\neg x_{1}(t)\wedge(x_{2}(t)\oplus\\ &x_{3}(t)),\\ x_{2}(t+1)=&x_{1}(t)\vee\neg x_{1}(t)\wedge(x_{2}(t)\vee x_{3}(t)),\\ x_{3}(t+1)=&\neg(x_{1}(t)\wedge x_{2}(t)\wedge x_{3}(t)\wedge u(t))\wedge(x_{3}(t)\vee\\ &(x_{1}(t)\vee\neg(x_{2}(t)\wedge u(t)))\wedge(\neg x_{1}(t)\vee\\ &(x_{1}(t)\oplus x_{2}(t))\vee u(t))),\\ x_{3i-2}(t+1)=&x_{3i-2}(t)\wedge(x_{3i-1}(t)\vee x_{3i}(t))\vee\neg x_{3i-2}(t)\\ &\wedge(x_{3i-1}(t)\oplus x_{3i}(t)),\\ x_{3i-1}(t+1)=&x_{3i-2}(t)\vee\neg x_{3i-2}(t)\wedge(x_{3i-1}(t)\vee x_{3i}(t)),\\ x_{3i}(t+1)=&x_{3i}(t)\vee(\neg x_{3i-2}(t)\vee(x_{3i-2}(t)\oplus\\ &x_{3i-1}(t))),\\ \end{aligned}\right.

(20)

where $i=2,3,...,9$ . In terms of the reachability, we set $\mathcal{M}_{d}=(0,0,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,$ $1,1,1,1,1)$ and $\mathcal{M}_{0}=(a,0,0,1,0,0,1,0,0,1,0,0,1,0,0,$ $1,0,0,1,0,0,1,0,0,1)$ , where $a\in\{(0,0,0),(0,1,0),$ $(0,1,1),$ $(1,0,0),(1,0,1),(1,1,0),(1,1,1)\}$ .

$\mathbf{Remark\ 9.}$ System (20) consists of 9 small-scale BCNs. In Fig. 3, it can be observed that the number of flipping actions and time steps required for system (20) to reach state set $\mathcal{M}_{d}$ is relatively small. These two phenomena can be attributed to our intention of selecting an example that is easily verifiable for the optimal policy using the enumeration method. However, it should be noted that the proposed algorithms are applicable to all large-scale BCNs since the agent has no prior knowledge of the system model.

To obtain the flipping kernels of the large-scale system (20), Algorithm 3 is employed along with the integration of TL and special initial states. The convergence efficiency of the algorithms is demonstrated in Fig. 3. According to Fig. 3, the combination of small memory $Q$ L with both TL and special initial states yields the best performance.

Subsequently, based on the flipping kernels $\{1,2,6\}$ and $\{2,3,6\}$ obtained from Algorithm 3, the optimal policy taking the minimal flipping actions to achieve the reachability is obtained using Algorithm 5. The resulting policy is displayed in Fig. 4. Notably, in terms of reward setting, if we set $w=2^{27}$ instead of utilizing a variable that changes according to $Q=\big{|}\{x_{t}|\ Q(x_{t},.)$ is in the $Q$ -table $\}\big{|}$ , the policy cannot converge to the optimal one even with 10 times episodes. This demonstrates the effectiveness of Algorithm 5. In Fig. 4, we simplify the state in decimal form, where a state at time step $t$ is defined as $\sum_{i=1}^{27}2^{27-i}x_{i}(t)+1$ .

VII-C Details

The parameters are listed as follows.

•

In all algorithms, we set learning rate as $a_{t}=\min\{1,\frac{1}{(\beta ep)^{\omega}}\}$ , which satisfies condition 1) in Lemma 1.
•

The greedy rate in all algorithms is specified as $1-\frac{0.99ep}{N}$ , gradually reducing from 1 to 0.01 as $ep$ increases.
•

We set $w=8$ for Algorithm 3 in Example 2 and $w=18$ with $\Delta w=20$ for Algorithm 5 in Example 3. These values of $w$ ensure the optimality of the obtained policy according to Corollary 1 and Theorem 4.
•

For Algorithms 1, 2, and 3, we set $\gamma=0.99$ .

$N$ , $T_{max}$ , $\beta$ , and $\omega$ are selected based on the complexity of different examples and algorithms, as detailed in Table II.

TABLE II: Parameter settings

Example	Algorithm	$N$	$T_{max}$	$\beta$	$\omega$
2	1,2	100	10	1	0.6
3	3	$10^{4}$	$2^{27}$	1	0.6
2	4	$3\times 10^{4}$	100	0.01	0.85
3	5	$2\times 10^{5}$	$2^{27}$	0.01	0.85

VIII Conclusion

This paper presents $model$ - $free$ reinforcement learning-based methods to obtain minimal state-flipped control for achieving reachability in BCNs, including large-scale ones. Two problems are addressed: 1) finding the flipping kernel, and 2) minimizing flipping actions based on the kernel.

For problem 1) with small-scale BCNs, fast iterative $Q$ L is proposed. Reachability is determined using $Q$ -values, while convergence is expedited through transfer learning and special initial states. Transfer learning migrates the $Q$ -table based on the proven theorem that reachability preservation holds as the flip set size increases. Special initial states designate unknown reachability states to avoid redundant evaluation of known reachable states. For problem 1) with large-scale BCNs, we utilize small memory iterative $Q$ L, which reduces memory usage by only recording visited action-values. Algorithm suitability is estimated via an upper bound on memory usage.

For problem 2) with small-scale BCNs, $Q$ L with BCN-characteristics-based rewards is presented. The rewards are designed based on the maximum length of reachable trajectories without cycles (an upper bound is given). This allows the minimization of flipping actions under terminal reachability constraints to be simplified as an unconstrained optimal control problem. For problem 2) with large-scale BCNs, fast small memory $Q$ L with variable rewards is proposed. In this approach, the rewards are dynamically adjusted based on the maximum length of explored reachable trajectories without cycles to enhance convergence efficiency, and their optimality is proven.

Considering the critical value of reachability in controllability, our upcoming research will investigate minimum-cost state-flipped control for controllability of BCNs using reinforcement methods. Specifically, we will address two key challenges: 1) identifying the flipping kernel for controllability of BCNs, and 2) minimizing the required flipping actions to achieve controllability. While our approach can effectively address challenges 1 and 2 for small-scale BCNs with minor adjustments, adapting it for large-scale BCNs with simple modifications is presently impractical. This will be the primary focus of our future research.

References

[1] S. A. Kauffman, ``Metabolic stability and epigenesis in randomly constructed genetic nets,'' Journal of theoretical biology, vol. 22, no. 3, pp. 437–467, 1969.
[2] M. Boettcher and M. McManus, ``Choosing the Right Tool for the Job: RNAi, TALEN, or CRISPR,'' Molecular Cell, vol. 58, no. 4, pp. 575–585, 2015.
[3] Y. Liu, Z. Liu, and J. Lu, ``State-flipped control and Q-learning algorithm for the stabilization of Boolean control networks,'' Control Theory & Applications, vol. 38, pp. 1743–1753, 2021.
[4] Z. Liu, J. Zhong, Y. Liu, and W. Gui, ``Weak stabilization of Boolean networks under state-flipped control,'' IEEE Transactions on Neural Networks and Learning Systems, vol. 34, pp. 2693–2700, 2021.
[5] M. Rafimanzelat and F. Bahrami, ``Attractor controllability of Boolean networks by flipping a subset of their nodes,'' Chaos: An Interdisciplinary Journal of Nonlinear Science, vol. 28, p. 043120, 2018.
[6] M. R. Rafimanzelat and F. Bahrami, ``Attractor stabilizability of Boolean networks with application to biomolecular regulatory networks,'' IEEE Transactions on Control of Network Systems, vol. 6, no. 1, pp. 72–81, 2018.
[7] B. Chen, X. Yang, Y. Liu, and J. Qiu, ``Controllability and stabilization of Boolean control networks by the auxiliary function of flipping,'' International Journal of Robust and Nonlinear Control, vol. 30, no. 14, pp. 5529–5541, 2020.
[8] Q. Zhang, J.-e. Feng, Y. Zhao, and J. Zhao, ``Stabilization and set stabilization of switched Boolean control networks via flipping mechanism,'' Nonlinear Analysis: Hybrid Systems, vol. 41, p. 101055, 2021.
[9] R. Zhou, Y. Guo, Y. Wu, and W. Gui, ``Asymptotical feedback set stabilization of probabilistic Boolean control networks,'' IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 11, pp. 4524–4537, 2019.
[10] H. Li and Y. Wang, ``On reachability and controllability of switched Boolean control networks,'' Automatica, vol. 48, no. 11, pp. 2917–2922, 2012.
[11] Y. Liu, H. Chen, J. Lu, and B. Wu, ``Controllability of probabilistic Boolean control networks based on transition probability matrices,'' Automatica, vol. 52, pp. 340–345, 2015.
[12] J. Liang, H. Chen, and J. Lam, ``An improved criterion for controllability of Boolean control networks,'' IEEE Transactions on Automatic Control, vol. 62, no. 11, pp. 6012–6018, 2017.
[13] L. Wang and Z.-G. Wu, ``Optimal asynchronous stabilization for Boolean control networks with lebesgue sampling,'' IEEE Transactions on Cybernetics, vol. 52, no. 5, pp. 2811–2820, 2022.
[14] Y. Wu, X.-M. Sun, X. Zhao, and T. Shen, ``Optimal control of Boolean control networks with average cost: A policy iteration approach,'' Automatica, vol. 100, pp. 378–387, 2019.
[15] S. Kharade, S. Sutavani, S. Wagh, A. Yerudkar, C. Del Vecchio, and N. Singh, ``Optimal control of probabilistic Boolean control networks: A scalable infinite horizon approach,'' International Journal of Robust and Nonlinear Control, vol. 33, no. 9, p. 4945–4966, 2023.
[16] Q. Zhu, Y. Liu, J. Lu, and J. Cao, ``On the optimal control of Boolean control networks,'' SIAM Journal on Control and Optimization, vol. 56, no. 2, pp. 1321–1341, 2018.
[17] A. Acernese, A. Yerudkar, L. Glielmo, and C. D. Vecchio, ``Reinforcement learning approach to feedback stabilization problem of probabilistic Boolean control networks,'' IEEE Control Systems Letters, vol. 5, no. 1, pp. 337–342, 2021.
[18] P. Bajaria, A. Yerudkar, and C. D. Vecchio, ``Random forest Q-Learning for feedback stabilization of probabilistic Boolean control networks,'' in 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2021, pp. 1539–1544.
[19] Z. Zhou, Y. Liu, J. Lu, and L. Glielmo, ``Cluster synchronization of Boolean networks under state-flipped control with reinforcement learning,'' IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 69, no. 12, pp. 5044–5048, 2022.
[20] X. Peng, Y. Tang, F. Li, and Y. Liu, ``Q-learning based optimal false data injection attack on probabilistic Boolean control networks,'' 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:265498378
[21] Z. Liu, Y. Liu, Q. Ruan, and W. Gui, ``Robust flipping stabilization of Boolean networks: A Q-learning approach,'' Systems & Control Letters, vol. 176, p. 105527, 2023.
[22] J. Lu, R. Liu, J. Lou, and Y. Liu, ``Pinning stabilization of Boolean control networks via a minimum number of controllers,'' IEEE Transactions on Cybernetics, vol. 51, no. 1, pp. 373–381, 2021.
[23] Y. Liu, L. Wang, J. Lu, and L. Yu, ``Pinning stabilization of stochastic networks with finite states via controlling minimal nodes,'' IEEE Transactions on Cybernetics, vol. 52, no. 4, pp. 2361–2369, 2022.
[24] S. Zhu, J. Lu, D. W. Ho, and J. Cao, ``Minimal control nodes for strong structural observability of discrete-time iteration systems: Explicit formulas and polynomial-time algorithms,'' IEEE Transactions on Automatic Control, 2023. [Online]. doi: 10.1109/TAC.2023.3330263.
[25] S. Zhu, J. Cao, L. Lin, J. Lam, and S.-i. Azuma, ``Towards stabilizable large-scale Boolean networks by controlling the minimal set of nodes,'' IEEE Transactions on Automatic Control, vol. 69, no. 1, pp. 174–188, 2024.
[26] S. Zhu, J. Lu, S.-i. Azuma, and W. X. Zheng, ``Strong structural controllability of Boolean networks: Polynomial-time criteria, minimal node control, and distributed pinning strategies,'' IEEE Transactions on Automatic Control, vol. 68, no. 9, pp. 5461–5476, 2023.
[27] H. V. Hasselt, A. Guez, and D. Silver, ``Deep reinforcement learning with double Q-learning,'' Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1, pp. 1–13, 2016.
[28] T. Jaakkola, M. Jordan, and S. Singh, ``Convergence of stochastic iterative dynamic programming algorithms,'' Advances in neural information processing systems, vol. 6, p. 703–710, 1993.
[29] C. J. Watkins and P. Dayan, ``Q-learning,'' Machine Learning, vol. 8, no. 3, pp. 279–292, 1992.
[30] L. Torrey and J. Shavlik, ``Transfer learning,'' in Handbook of research on machine learning applications and trends: algorithms, methods, and techniques. IGI global, 2010, pp. 242–264.
[31] R. Zhou, Y. Guo, and W. Gui, ``Set reachability and observability of probabilistic Boolean networks,'' Automatica, vol. 106, pp. 230–241, 2019.
[32] E. Greensmith, P. L. Bartlett, and J. Baxter, ``Variance reduction techniques for gradient estimates in reinforcement learning,'' Journal of Machine Learning Research, vol. 5, no. 9, p. 1471–1530, 2004.

	Time complexity	Space complexity
1	$O$ ( $2^{m+\|B^{*}\|}$ iter¹)	$O$ ( $2^{n+m+\|B^{*}\|}$ )
2	$O$ ( $2^{m+\|B^{*}\|}$ iter²)	$O$ ( $2^{n+m+\|B^{}\|}(C_{\|A\|}^{\|B^{}\|-1}+1)$ )
3	$O$ ( $2^{m+\|B^{*}\|}$ iter^1,3)	$O$ ( $2^{m+\|B^{*}\|}$ $\|V\cup\mathcal{M}_{0}\|$ )
4	$O$ ( $2^{m+\|B^{*}\|}$ $NT_{max}$ )	$O$ ( $2^{n+m+\|B^{*}\|}$ )
5	$O$ ( $2^{m+\|B^{*}\|}$ $NT_{max}$ )	$O$ ( $2^{m+\|B^{*}\|}$ $\|V\cup\mathcal{M}_{0}\|$ )

Minimum-Cost State-Flipped Control for Reachability of Boolean Control Networks using Reinforcement Learning

Abstract

Index Terms:

I Introduction

II Problem Formulation

II-A System Model

II-A1 BCNs

II-A2 State-flipped Control

II-A3 BCNs under State-flipped Control

II-B Problems of Interests

II-B1 Problem 1. Flipping Kernel for Reachability

II-B2 Problem 2. Minimum Flipping Actions for Reachability

III Preliminaries

III-A Markov Decision Process

III-B QQ-Learning

IV Flipping Kernel for Reachability

IV-A Markov Decision Process for Finding Flipping Kernel

IV-B Iterative Q-Learning for Finding Flipping Kernel

IV-C Fast Iterative Q-learning for Finding Flipping Kernel

IV-C1 Special Initial States

IV-C2 Transfer-learning

IV-D Small Memory Iterative Q-Learning for Finding Flipping Kernel of Large-scale BCNs

V Minimum Flipping Actions for Reachability

V-A Markov Decision Process for Minimizing Flipping Actions

V-B Q-Learning for Finding Minimum Flipping Actions

V-C Fast Small Memory Q-Learning for Finding Minimum Flipping Actions of Large-scale BCNs

VI Computational Complexity

VII Simulation

VII-A A Small-scale BCN

VII-B A Large-scale BCN

VII-C Details

VIII Conclusion

References

Minimum-Cost State-Flipped Control
for Reachability of Boolean Control Networks
using Reinforcement Learning

III-B $Q$ -Learning