This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Minimum-Cost State-Flipped Control
for Reachability of Boolean Control Networks
using Reinforcement Learning

Jingjie Ni, Yang Tang,  Fangfei Li This work is supported by the National Natural Science Foundation of China under Grants 62173142, 62233005, the Programme of Introducing Talents of Discipline to Universities (the 111 Project) under Grant B17017, and the Fundamental Research Funds for the Central Universities under Grant JKM01231838.Jingjie Ni is with the School of Mathematics, East China University of Science and Technology, Shanghai, 200237, P.R. China (email: [email protected]).Yang Tang is with the Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, East China University of Science and Technology, Shanghai, 200237, P.R. China (email: [email protected]).Fangfei Li is with the School of Mathematics and the Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, East China University of Science and Technology, Shanghai, 200237, P.R. China (email: [email protected], [email protected]).
Abstract

This paper proposes model-free reinforcement learning methods for minimum-cost state-flipped control in Boolean control networks (BCNs). We tackle two questions: 1) finding the flipping kernel, namely the flip set with the smallest cardinality ensuring reachability, and 2) deriving optimal policies to minimize the number of flipping actions for reachability based on the obtained flipping kernel. For question 1), Q-learning's capability in determining reachability is demonstrated. To expedite convergence, we incorporate two improvements: i) demonstrating that previously reachable states remain reachable after adding elements to the flip set, followed by employing transfer learning, and ii) initiating each episode with special initial states whose reachability to the target state set are currently unknown. Question 2) requires optimal control with terminal constraints, while Q-learning only handles unconstrained problems. To bridge the gap, we propose a BCN-characteristics-based reward scheme and prove its optimality. Questions 1) and 2) with large-scale BCNs are addressed by employing small memory Q-learning, which reduces memory usage by only recording visited action-values. An upper bound on memory usage is provided to assess the algorithm's feasibility. To expedite convergence for question 2) in large-scale BCNs, we introduce adaptive variable rewards based on the known maximum steps needed to reach the target state set without cycles. Finally, the effectiveness of the proposed methods is validated on both small- and large-scale BCNs.

Index Terms:
Boolean control networks, model-free, reachability, reinforcement learning, state-flipped control.

I Introduction

Gene regulatory networks are a crucial focus in systems biology. The concept of Boolean networks was first introduced in 1969 by Kauffman [1] as the earliest model for gene regulatory networks. Boolean networks consist of several Boolean variables, each of which takes a value of ``0” or ``1” to signify whether the gene is transcribed or not. As gene regulatory networks often involve external inputs, Boolean networks were further developed into Boolean control networks (BCNs), which incorporate control inputs to better describe network behaviors.

State-flipped control is an innovative method used to change the state of specific nodes in BCNs, flipping them from ``1" to ``0" or from ``0" to ``1". In the context of systems biology, state-flipped control can be achieved through gene regulation techniques such as transcription activator-like effector repression and clustered regularly interspaced short palindromic repeats activation [2]. State-flipped control is a powerful control method that minimally disrupts the system structure. From a control effectiveness standpoint, state-flipped control enables BCNs to achieve any desired state with an in-degree greater than 0 [3, 4]. In contrast, state feedback control can only steer BCNs towards the original reachable state set, which is a subset of states with an in-degree greater than 0. Another highly effective control method is pinning control, but it has the drawback of causing damage to the network structure, unlike state-flipped control. State-flipped control was first proposed by Rafimanzelat to study the controllability of attractors in BNs [5]. Subsequently, to achieve stabilization, authors in [6] and [7] respectively considered flipping a subset of nodes after the BN had passed its transient period and flipping a subset of nodes in BCNs at the initial step. Zhang et al. further extended the concept of state-flipped control from BCNs to switched BCNs [8]. Considering that previous research has primarily focused on one-time flipping [5, 6, 8, 7], which may hinder the achievement of certain control objectives, a more comprehensive form of state-flipped control that permits multiple flipping actions was introduced to study stabilization in BNs and BCNs [3, 4]. Furthermore, to minimize control costs, the concept of flipping kernel was proposed [4, 3], representing the smallest set of nodes that can accomplish the control objectives. In this paper, we extend the existing studies by investigating problems including finding flipping kernels, under joint control proposed by [3]. These joint controls involve the combination of state-flipped controls and control inputs as defined in [3].

Closely tied to stabilization and controllability, reachability is a prominent area that requires extensive exploration. It involves determining whether a system can transition from an initial subset of states to a desired subset of states. This concept holds significant importance in domains such as genetic reprogramming and targeted drug development, where the ability to transform networks from unfavorable states to desirable ones is pivotal [6]. Previous studies [9, 10, 11, 12] have proposed various necessary and sufficient conditions for reachability in BCNs and their extensions. Additionally, an algorithm has been developed to identify an optimal control policy that achieves reachability in the shortest possible time [10]. In the context of BCNs under state-flipped control, the analysis of reachability between two states can be conducted using the semi-tensor product [3].

Despite extensive studies on the reachability of BCNs, there are still unresolved issues in this field. These problems can be categorized into three aspects.

  1. .

    Firstly, the existing literature [9, 10, 11, 12] focuses on determining whether reachability can be achieved under certain control policies, without optimizing control costs. This is significant in both practical applications and theoretical research. For instance, in the development of targeted drugs, an increasing number of target points raises the difficulty level. Additionally, it is desirable to minimize the frequency of drug usage by patients to achieve expense savings and reduce drug side effects. If we formulate this problem as an optimal control problem, our objective is to find the flipping kernel and minimize the number of flipping actions. Emphasizing the identification of flipping kernels is more critical than minimizing flipping actions, as it significantly reduces the dimensionality of joint control, thereby exponentially alleviating the computational complexity when minimizing flipping actions. It is worth mentioning that achieving reachability while optimizing control costs poses an optimization problem with terminal constraint. To the best of our knowledge, conventional optimal control techniques such as the path integral approach and policy iteration [13, 14, 15, 16], which rely on predetermined cost functions, are deemed unsuitable for our study. This is primarily due to the challenge of expressing the objective of simultaneously minimizing control costs and achieving terminal goals, namely, ``achieving reachability," solely through a cost function.

  2. .

    Secondly, the model-free scenario requires special consideration, especially considering the increasing complexity associated with accurately modeling systems with numerous BCN nodes. While model-free approaches exist in the PBCNs field [17, 18, 4, 3, 19, 20], they address different challenges compared to ours. Additionally, there are ongoing investigations into cost reduction problems [9, 10, 11, 12, 21, 22, 23]. However, these studies are conducted under the assumption of known or partially known models.

  3. .

    Thirdly, the existing methods in the fields of BCN reachability[9, 10, 11, 12, 21] are limited in their applicability to small-scale BCNs. Several interesting approaches have been proposed for handling large BCNs, emphasizing controllability and stability [24, 25, 26]. However, these references concentrate on optimization goals slightly different from ours. They aim to minimize the set of controlled nodes, namely, the dimension of the controller. In contrast, our objective is not only to reduce the controller dimension but also to further minimize flipping actions, namely, the frequency of control implementation.

Considering the aforementioned issues, we aim to design minimum-cost state-flipped control for achieving reachability in BCNs, in the absence of a known system model. Specifically, we address the following questions:

  1. 0)

    How to find the flipping kernel for reachability?

  2. 0)

    Building on question 1), how can we determine the control policies that minimize the number of required flipping actions to achieve reachability?

Questions 1) and 2) are the same as 1) and 2) mentioned in the abstract. They are reiterated here for clarity purposes.

To tackle these questions, we turn to reinforcement learning-based methods. Compared with the semi-tensor product approach, reinforcement learning-based methods are suitable for model-free scenarios, which aligns with our considered setting. Moreover, reinforcement learning eliminates the need for matrix calculations and offers the potential to handle large-scale BCNs. In particular, we consider the extensively applied QQ-learning (QQL) method, which has been successful in solving control problems in BCNs and their extensions [17, 18, 4, 3, 19]. For reachability problems of large-scale BCNs, an enhanced QQL method has been proposed by [20], providing a reliable foundation for this study. When compared to the deep QQ-learning method [27], which is also suitable for large-scale BCNs, the method proposed in [20] is favored due to its theoretical guarantees of convergence [28].

Taken together, we propose an improved QQL to address questions 1)- 3). Our work's main contributions are as follows:

  1. 0.

    Regarding question 1), unlike the semi-tensor product method proposed in [4, 3], our approach is model-free. We demonstrate that reachability can be determined using QQL and propose improved QQL with faster convergence compared to the traditional one [29]. Firstly, transfer learning (TL) [30] is integrated under the premise that the transferred knowledge is proven to be effective. Then, special initial states are incorporated into QQL, allowing each episode to begin with an unknown reachability state, thereby eliminating unnecessary computations.

  2. 0.

    Question 2) involves optimal control with terminal constraints, while QQL only handles unconstrained problems. To bridge this gap, a BCN-characteristics-based reward scheme is proposed. Compared to other model-free approaches [17, 18, 4, 3, 19, 20], we first tackle a complex problem while providing rigorous theoretical guarantees.

  3. 0.

    Compared to matrix-based methods [4, 3] and deep reinforcement techniques [17, 19, 18], our algorithms are suitable for large-scale BCNs with convergence guarantee. In contrast to [20], the memory usage in the presented problem is clarified to be not directly related to the scale of the BCNs, and an upper bound is provided to evaluate algorithm suitability. Further, To accelerate convergence in question 2), a modified reward scheme is introduced.

The paper progresses as follows: Section II provides an overview of the system model and defines the problem statement. Section III introduces the reinforcement learning method. In Section IV, fast iterative QQL and small memory iterative QQL for large-scale BCNs are presented to identify the flipping kernel for reachability. In Section V, to minimize the number of flipping actions, QQL with a BCN-characteristics-based reward setting and that with dynamic rewards for large-scale BCNs are proposed. Algorithm complexity is detailed in Section VI. Section VII validates the proposed method. Finally, conclusions are drawn in Section VIII.

We use the following notations throughout this paper:

  • +\mathbb{Z^{+}}, +\mathbb{R}^{+}, \mathbb{R}, m×n\mathbb{R}^{m\times n} denote the sets of non-negative integers, non-negative real numbers, real numbers, and m×n{m\times n} real number matrices, respectively.

  • 𝔼[]\mathbb{E}[\cdot] denotes the expected value operator and var[][\cdot] denotes the variance. Pr{E1E2}\Pr\left\{E_{1}\mid E_{2}\right\} denotes the probability of event E1E_{1} occurring given that event E2E_{2} has occurred.

  • For a set SS, |S||S| denotes the cardinal number of SS.

  • For Mm×nm×nM_{m\times n}\subset\mathbb{R}^{m\times n}, |Mm×n||M_{m\times n}|_{\infty} denotes the maximal value of the elements in Mm×nM_{m\times n}.

  • The relative complement of set S2S_{2} in set S1S_{1} is denoted as S1S2S_{1}\setminus S_{2}.

  • There are four basic operations on Boolean variables, which are ``not", ``and", ``or", and ``exclusive or", expressed as ¬\neg, \wedge, \vee, and \oplus, respectively.

  • :={0,1}\mathcal{B}:=\{0,1\}, and n:=××n\mathcal{B}^{n}:=\underbrace{\mathcal{B}\times\ldots\times\mathcal{B}}_{n}.

II Problem Formulation

This section introduces the system model, with a specific focus on BCNs under state-flipped control. Subsequently, we outline the problems under investigation.

II-A System Model

II-A1 BCNs

A BCN with nn nodes and mm control inputs is defined as

xi(t+1)=fi(x1(t),,xn(t),\displaystyle x_{i}(t+1)=f_{i}\Big{(}x_{1}(t),...,x_{n}(t), u1(t),,um(t)),\displaystyle u_{1}(t),...,u_{m}(t)\Big{)}, (1)
t+,i=1,,n,\displaystyle t\in\mathbb{Z^{+}},\ i=1,...,n,

where fi:n+m,i{1,,n}f_{i}:\mathcal{B}^{n+m}\rightarrow\mathcal{B},i\in\{1,...,n\} is a logical function, xi(t),i{1,,n}x_{i}(t)\in\mathcal{B},i\in\{1,...,n\} represents the ithi^{th} node at time step tt, and uj(t),j{1,,m}u_{j}(t)\in\mathcal{B},j\in\{1,...,m\} represents the jthj^{th} control input at time step tt. All nodes at time step tt are grouped together in the vector x(t)=(x1(t),,xn(t))nx(t)=\left(x_{1}(t),\ldots,x_{n}(t)\right)\in\mathcal{B}^{n}. Similarly, all control inputs at time step tt are represented by u(t)=(u1(t),,um(t))mu(t)=\left(u_{1}(t),\ldots,u_{m}(t)\right)\in\mathcal{B}^{m}.

II-A2 State-flipped Control

Considering that not all nodes in a BCN can be flipped, we refer to the set of all flippable nodes as a flip set A{1,2,,n}A\subseteq\{1,2,...,n\}. At each time step tt, a flipping action A(t)={i1,i2,,ik}AA(t)=\{i_{1},i_{2},...,i_{k}\}\subseteq A is selected. According to the flipping action A(t)A(t), the flip function is defined as

ηA(t)(x(t))=\displaystyle\eta_{A(t)}\Big{(}x(t)\Big{)}= (x1(t),,¬xi1(t),,\displaystyle\left.\Big{(}x_{1}(t),...,\neg x_{i_{1}}(t),...,\right. (2)
¬xi2(t),,¬xik(t),,xn(t)).\displaystyle\left.\neg x_{i_{2}}(t),...,\neg x_{i_{k}}(t),...,x_{n}(t)\Big{)}.\right.

Note that to achieve a specific control objective, it is not necessary to flip all the nodes in the set AA. The flipping kernel, denoted as BAB^{*}\subset A, is defined as the flip set with the minimum cardinality required to achieve reachability.

II-A3 BCNs under State-flipped Control

Based on the definition of BCNs and state-flipped control, a BCN with nn nodes and a flip set AA is defined as

xi(t+1)=fi(ηA(t)(x(t)),u(t)),i=1,,n.x_{i}(t+1)=f_{i}\bigg{(}\eta_{A(t)}\Big{(}x(t)\Big{)},u(t)\bigg{)},\ i=1,...,n. (3)

II-B Problems of Interests

II-B1 Problem 1. Flipping Kernel for Reachability

To better illustrate Problem 1, we first define reachability.

𝐃𝐞𝐟𝐢𝐧𝐢𝐭𝐢𝐨𝐧 1\mathbf{Definition\ 1}[31]. For system (3), let 0n\mathcal{M}_{0}\subset\mathcal{B}^{n} be the initial subset, and dn\mathcal{M}_{d}\subset\mathcal{B}^{n} be the target subset. d\mathcal{M}_{d} is said to be reachable from 0\mathcal{M}_{0} if and only if, for any initial state x00x_{0}\in\mathcal{M}_{0}, there exists a sequence of joint control pairs {(u(t),ηA(t)),t=0,1,,T}\Big{\{}\big{(}u(t),\eta_{A(t)}\big{)},t=0,1,...,T\Big{\}}, such that x0x_{0} reaches a target state xddx_{d}\in\mathcal{M}_{d}.

Based on Definition 1, we define the specific problem to be considered. In some cases, it is not necessary to flip all nodes in the set AA for system (3) to achieve reachability. To reduce the control cost, we hope the nodes that need to be flipped as few as possible. The above problem can be transformed into finding the flipping kernel BB^{*} for reachability, which satisfies

minB|B|\displaystyle\min_{B}|B| (4)
s.t. BA and system (3) from any state in\displaystyle B\subset A\text{{\color[rgb]{0,0,0}\ and\ system\ (\ref{eqFlipBCN})\ from\ any\ state\ in\ }}
0 is reachable to state set d.\displaystyle\mathcal{M}_{0}\text{{\color[rgb]{0,0,0}\ is\ reachable\ to\ state\ set\ }}\mathcal{M}_{d}.

It is worth noting that flipping kernel may not be unique, as there can be multiple ways to achieve reachability with the minimum cardinality |B||B^{*}| through flipping.

II-B2 Problem 2. Minimum Flipping Actions for Reachability

Based on the flipping kernel BB^{*} obtained from Problem 1 (4), we aim to determine the optimal policy, under which the reachability can be achieved with minimum flipping actions. This problem can be formulated as finding the policy π:x(t)(u(t),ηB(t)),t+\pi^{*}:x(t)\rightarrow\big{(}u(t),\eta_{B^{*}(t)}\big{)},\ \forall t\in\mathbb{Z^{+}} that satisfies

minπt=0T\displaystyle\min_{\pi}\sum_{t=0}^{T} nts.t. system (3) from any state in\displaystyle n_{t}\ \ \text{s.t.}\text{{\color[rgb]{0,0,0}\ system\ (\ref{eqFlipBCN})\ from\ any\ state\ in\ }} (5)
0 is reachable to state set d,\displaystyle\mathcal{M}_{0}\text{{\color[rgb]{0,0,0}\ is\ reachable\ to\ state\ set\ }}\mathcal{M}_{d},

where ntn_{t} denotes the number of nodes to be flipped at time step tt, and TT represents the terminal time step when a terminal state emerges. The terminal and initial states are represented by xdd\forall x_{d}\in\mathcal{M}_{d} and x00\forall x_{0}\in\mathcal{M}_{0}, respectively. Note that the optimal policy π\pi^{*} may not be unique, as multiple policies can achieve reachability with minimal flipping actions.

III Preliminaries

In this section, we introduce a reinforcement learning method, specifically focusing on the Markov decision process and the QQ-learning algorithm.

III-A Markov Decision Process

Markov decision process provides the framework for reinforcement learning, which is represented as a quintuple (𝐗,𝐀,γ,𝐏,𝐑)(\mathbf{X},\mathbf{A},\gamma,\mathbf{P},\mathbf{R}). 𝐗={xt,t+}\mathbf{X}=\{x_{t},t\in\mathbb{Z^{+}}\} and 𝐀={at,t+}\mathbf{A}=\{a_{t},t\in\mathbb{Z^{+}}\} denote the state space and action space, respectively. The discount factor γ[0,1]\gamma\in[0,1] weighs the importance of future rewards. The state-transition probability 𝐏xtxt+1(at)=Pr{xt+1xt,at}\mathbf{P}_{x_{t}}^{x_{t+1}}(a_{t})=\Pr\left\{x_{t+1}\mid x_{t},a_{t}\right\} represents the chance of transitioning from state xtx_{t} to xt+1x_{t+1} under action ata_{t}. 𝐑xtxt+1(at)=𝔼[rt+1xt,at,xt+1]\mathbf{R}_{x_{t}}^{x_{t+1}}(a_{t})=\mathbb{E}\left[r_{t+1}\mid x_{t},a_{t},x_{t+1}\right] denotes the expected reward obtained by taking action ata_{t} at state xtx_{t} that transitions to xt+1x_{t+1}. At each time step t+t\in\mathbb{Z^{+}}, an agent interacts with the environment to determine an optimal policy. Specificly, the agent observes xtx_{t} and selects ata_{t}, according to the policy π:xtat,t+\pi:x_{t}\rightarrow a_{t},\forall t\in\mathbb{Z^{+}}. Then, the environment returns rt+1r_{t+1} and xt+1x_{t+1}. The agent evaluates the reward rt+1r_{t+1} received for taking action ata_{t} at state xtx_{t} and then updates its policy π\pi. Define Gt=i=t+1Tγit1rtG_{t}=\sum_{i=t+1}^{T}\gamma^{i-t-1}r_{t} as the return. The goal of the agent is to learn the optimal policy π:xtat,t+\pi^{*}:x_{t}\rightarrow a_{t},\forall t\in\mathbb{Z^{+}}, which satisfies

π=maxπΠ𝔼π[Gt],t+,\pi^{*}=\max_{\pi\in\Pi}\mathbb{E}_{\pi}[G_{t}],\forall t\in\mathbb{Z^{+}}, (6)

where 𝔼π\mathbb{E}_{\pi} is the the expected value operator following policy π\pi, and Π\Pi is the set of all admissible policies. The state-value and the action-value are vπ(xt)=𝔼π[Gt|xt]v_{\pi}(x_{t})=\mathbb{E}_{\pi}[G_{t}|x_{t}] and qπ(xt,at)=𝔼π[Gt|xt,at]q_{\pi}(x_{t},a_{t})=\mathbb{E}_{\pi}[G_{t}|x_{t},a_{t}], respectively. The Bellman equations reveal the recursion of vπ(xt)v_{\pi}(x_{t}) and qπ(xt,at)q_{\pi}(x_{t},a_{t}), which are given as follows:

vπ(xt)=xt+1𝐗𝐏xtxt+1(π(xt))[𝐑xtxt+1(π(xt))+γvπ(xt+1)],\displaystyle v_{\pi}(x_{t})=\sum\limits_{x_{t+1}\in\mathbf{X}}\mathbf{P}_{x_{t}}^{x_{t+1}}\big{(}\pi(x_{t})\big{)}[\mathbf{R}_{x_{t}}^{x_{t+1}}\big{(}\pi(x_{t})\big{)}+\gamma v_{\pi}(x_{t+1})], (7)
qπ(xt,at)=xt+1𝐗𝐏xtxt+1(at)[𝐑xtxt+1(at)+γvπ(xt+1)].\displaystyle q_{\pi}(x_{t},a_{t})=\sum\limits_{x_{t+1}\in\mathbf{X}}\mathbf{P}_{x_{t}}^{x_{t+1}}(a_{t})[\mathbf{R}_{x_{t}}^{x_{t+1}}(a_{t})+\gamma v_{\pi}(x_{t+1})].

The optimal state-value and action-value are defined as

v(xt)=maxπΠvπ(xt),xt𝐗,\displaystyle v^{*}(x_{t})=\max\limits_{\pi\in\Pi}v_{\pi}(x_{t}),\forall x_{t}\in\mathbf{X}, (8)
q(xt,at)=maxπΠqπ(xt,at),xt𝐗,at𝐀.\displaystyle q^{*}(x_{t},a_{t})=\max\limits_{\pi\in\Pi}q_{\pi}(x_{t},a_{t}),\forall x_{t}\in\mathbf{X},\forall a_{t}\in\mathbf{A}.

For problems under the framework of Markov decision process, the optimal policy π\pi^{*} can be obtained through QQL. In the following, we introduce QQL.

III-B QQ-Learning

QQL is a classical algorithm in reinforcement learning. The purpose of QQL is to enable the agent to find an optimal policy π\pi^{*} through its interactions with the environment. Qt(xt,at)Q_{t}(x_{t},a_{t}), namely, the estimate of q(xt,at)q^{*}(x_{t},a_{t}), is recorded and updated every time step as follows:

Qt+1(x,a)={Qt(x,a)+αtTDEt,if (x,a)=(xt,at),Qt(x,a),else,Q_{t+1}(x,a)=\left\{\begin{aligned} &Q_{t}(x,a)+\alpha_{t}TDE_{t},&&\text{if\ }(x,a)=\left(x_{t},a_{t}\right),\\ &Q_{t}(x,a),&&\text{else},\end{aligned}\right. (9)

where αt(0,1],t+\alpha_{t}\in(0,1],t\in\mathbb{Z^{+}} is the learning rate, and TDEt=rt+1+γmaxa𝐀Qt(xt+1,a)Qt(xt,at)TDE_{t}=r_{t+1}+\gamma\max\limits_{a\in\mathbf{A}}Q_{t}(x_{t+1},a)-Q_{t}(x_{t},a_{t}) .

In terms of action selection, ϵ\epsilon-greedy method is used

at={argmaxa𝐀Qt(xt,a),P=1ϵ,rand(𝐀),P=ϵ,a_{t}=\left\{\begin{aligned} &\arg\max\limits_{a\in\mathbf{A}}Q_{t}\left(x_{t},a\right),&&P=1-\epsilon,\\ &\operatorname{rand}(\mathbf{A}),&&P=\epsilon,\\ \end{aligned}\right. (10)

where PP is the probability of selecting the corresponding action, 0ϵ10\leq\epsilon\leq 1, and rand(𝐀)\operatorname{rand}(\mathbf{A}) is an action randomly selected from 𝐀\mathbf{A}.

𝐋𝐞𝐦𝐦𝐚 1\mathbf{Lemma\ 1} [28]. Qt(xt,at)Q_{t}(x_{t},a_{t}) converges to the fixed point q(xt,at)q^{*}(x_{t},a_{t}) with probability one under the following conditions:

  1. 0.

    t=0αt=\sum_{t=0}^{\infty}\alpha_{t}=\infty and t=0αt2<\sum_{t=0}^{\infty}\alpha_{t}^{2}<\infty.

  2. 0.

    var[rt]\operatorname{var}\left[r_{t}\right] is finite.

  3. 0.

    |𝐗||\mathbf{X}| and |𝐀||\mathbf{A}| are finite.

  4. 0.

    If γ=1\gamma=1, all policies lead to a cost-free terminal state.

Once QQL converges, the optimal policy is obtained:

π(xt)=argmaxa𝐀Qt(xt,a),xt𝐗.\pi^{*}\left(x_{t}\right)=\arg\underset{a\in\mathbf{A}}{\max}Q_{t}\left(x_{t},a\right),\forall x_{t}\in\mathbf{X}. (11)

IV Flipping Kernel for Reachability

To solve Problem 1 (4), we propose three algorithms: iterative QQL, fast iterative QQL, and small memory iterative QQL. Fast iterative QQL enhances convergence efficiency over iterative QQL. Meanwhile, small memory iterative QQL is designed specifically for large-scale BCNs (3).

IV-A Markov Decision Process for Finding Flipping Kernel

The premise for using QQL is to structure the problem within the framework of Markov decision process. We represent Markov decision process by the quintuple (n,m+|B|,𝐏,𝐑,γ)(\mathcal{B}^{n},\mathcal{B}^{m+|B|},\mathbf{P},\mathbf{R},\gamma), where n\mathcal{B}^{n} and m+|B|\mathcal{B}^{m+|B|} are the state space and action space, respectively. The state and the action are defined as xt=x(t)x_{t}=x(t) and at=(u(t),ηA(t))a_{t}=\big{(}u(t),\eta_{A(t)}\big{)}, respectively. The reward is given as follows:

rt(xt,at)={100,xtd,0,else.r_{t}(x_{t},a_{t})=\left\{\begin{aligned} &100,&&x_{t}\in\mathcal{M}_{d},\\ &0,&&\text{else}.\\ \end{aligned}\right. (12)

To incentivize the agent to reach xtdx_{t}\in\mathcal{M}_{d} as quickly as possible, we set the discount factor γ(0,1)\gamma\in(0,1).

Assuming the agent is aware of the dimensions of n\mathcal{B}^{n} and m+|B|\mathcal{B}^{m+|B|} but lacks knowledge of 𝐏\mathbf{P} and 𝐑\mathbf{R}, meaning the agent does not comprehend the system dynamics (3). Through interactions with the environment, the agent implicitly learns about 𝐏\mathbf{P}, 𝐑\mathbf{R}, refines estimation of q(xt,at)q^{*}\left(x_{t},a_{t}\right), and enhances its understanding of π\pi^{*}.

IV-B Iterative Q-Learning for Finding Flipping Kernel

Under the reward setting (12), it is worthwhile to investigate methods for determining the reachability of BCNs defined by equation (3). To answer this question, we present Theorem 1.

𝐓𝐡𝐞𝐨𝐫𝐞𝐦 1\mathbf{Theorem\ 1}. For system (3), assume that Q0(xt,at)=0,xt,atQ_{0}(x_{t},a_{t})=0,\forall x_{t},\forall a_{t} and equation (9) is utilized for updating Qt(xt,at)Q_{t}(x_{t},a_{t}). System (3) from any state in 0\mathcal{M}_{0} is reachable to state set d\mathcal{M}_{d} if and only if, there exists a tt^{*} such that maxa𝐀Qt(x0,a)>0,x00\max\limits_{a\in\mathbf{A}}Q_{t^{*}}(x_{0},a)>0,\forall x_{0}\in\mathcal{M}_{0}.

𝐏𝐫𝐨𝐨𝐟\mathbf{Proof}. (Necessity) Assume that system (3) from any state in 0\mathcal{M}_{0} is reachable to state set d\mathcal{M}_{d}. Without loss of generality, suppose that there is only one x00x_{0}\in\mathcal{M}_{0} and one xddx_{d}\in\mathcal{M}_{d}. Then, there exists a joint control pair sequence {at=(u(t),ηA(t)),t=0,1,,T}\Big{\{}a_{t}=\big{(}u(t),\eta_{A(t)}\big{)},t=0,1,...,T\Big{\}}, such that x0x_{0} will reach the target state xdx_{d}. We apply {at=(u(t),ηA(t)),t=0,1,,T}\Big{\{}a_{t}=\big{(}u(t),\eta_{A(t)}\big{)},t=0,1,...,T\Big{\}} to x0x_{0}, then we obtain a trajectory x0a0x1a1xTaTxT+1x_{0}\stackrel{{\scriptstyle a_{0}}}{{\longrightarrow}}x_{1}\stackrel{{\scriptstyle a_{1}}}{{\longrightarrow}}...x_{T}\stackrel{{\scriptstyle a_{T}}}{{\longrightarrow}}x_{T+1}, where xT+1=xdx_{T+1}=x_{d}. Meanwhile, xtatxt+1x_{t}\stackrel{{\scriptstyle a_{t}}}{{\longrightarrow}}x_{t+1} means that xtx_{t} will transform into xt+1x_{t+1} when ata_{t} is taken.

As QQL iterates, all state-action-state pairs (xt,at,xt+1)(x_{t},a_{t},x_{t+1}) are constantly visited [28]. Therefore, there can be a case where the state-action-state pairs (xT,aT,xT+1)(x_{T},a_{T},x_{T+1}), (xT1,aT1,xT)(x_{T-1},a_{T-1},x_{T}), (xT2,aT2,xT1)(x_{T-2},a_{T-2},x_{T-1}), …,(x0,a0,x1)(x_{0},a_{0},x_{1}) are visited one after another, and in between, other state-action-state pairs can also exist. This case causes the values in the QQ-table to transition from all zeros to some being positive. Furthermore, we assume that this case initially occurs at time step tt^{\prime}. Then, the corresponding change of action-value from zero to positive is given by Qt(xT,aT)αt1(rt+γmaxaAQt1(xT+1,a))+(1αt1)Qt1(xT,aT)Q_{t^{\prime}}\big{(}x_{T},a_{T}\big{)}\leftarrow\alpha_{t^{\prime}-1}(r_{t^{\prime}}+\gamma\max\limits_{a\in A}Q_{t^{\prime}-1}(x_{T+1},a))+(1-\alpha_{t^{\prime}-1})Q_{t^{\prime}-1}\big{(}x_{T},a_{T}\big{)}, where rt=100r_{t^{\prime}}=100 since xT+1dx_{T+1}\in\mathcal{M}_{d}. Additionally, due to the absence of negative rewards and the fact that initial action-values are all 0, it follows that Qt1(xT+1,a)0Q_{t^{\prime}-1}(x_{T+1},a)\geq 0 for all a𝐀a\in\mathbf{A}. Thus, after the update, we have Qt(xT,aT)>0Q_{t^{\prime}}\big{(}x_{T},a_{T}\big{)}>0. According to (9), at t+1t^{\prime}+1, it follows that Qt+1(xT,aT)>0Q_{t^{\prime}+1}\big{(}x_{T},a_{T}\big{)}>0. Similarly, for any t′′>tt^{\prime\prime}>t^{\prime}, it follows that Qt′′(xT,aT)>0Q_{t^{\prime\prime}}\big{(}x_{T},a_{T}\big{)}>0.

Next, we prove that there exists tt^{*} such that Qt(xt¯,at¯)>0Q_{t^{*}}\big{(}x_{\overline{t}},a_{\overline{t}}\big{)}>0, where t¯=T,T1,, 0\overline{t}=T,\ T-1,\ ...,\ 0, based on mathematical induction. The proof is divided into two parts. Firstly, there exists tt^{\prime} such that Qt(xt¯,at¯)>0Q_{t^{\prime}}\big{(}x_{\overline{t}},a_{\overline{t}}\big{)}>0 where t¯=T\overline{t}=T, according to the proof in the previous paragraph. Second, we prove that if (xt¯,at¯,xt¯+1)(x_{\overline{t}},a_{\overline{t}},x_{{\overline{t}}+1}) is visited at time step t′′t^{\prime\prime} and Qt′′(xt¯+1,at¯+1)>0Q_{t^{\prime\prime}}\big{(}x_{{\overline{t}}+1},a_{{\overline{t}}+1}\big{)}>0, then Qt′′+1(xt¯,at¯)>0Q_{t^{\prime\prime}+1}\big{(}x_{\overline{t}},a_{\overline{t}}\big{)}>0. Since (xt¯,at¯,xt¯+1)(x_{\overline{t}},a_{\overline{t}},x_{{\overline{t}}+1}) is visited at time step t′′t^{\prime\prime}, it follows that Qt′′+1(xt¯,at¯)αt′′(rt′′+1+γmaxa𝐀Qt′′(xt¯+1,a))+(1αt′′)Qt′′(xt¯,at¯)Q_{t^{\prime\prime}+1}\big{(}x_{\overline{t}},a_{\overline{t}}\big{)}\leftarrow\alpha_{t^{\prime\prime}}(r_{t^{\prime\prime}+1}+\gamma\max\limits_{a\in\mathbf{A}}Q_{t^{\prime\prime}}(x_{{\overline{t}}+1},a))+(1-\alpha_{t^{\prime\prime}})Q_{t^{\prime\prime}}\big{(}x_{{\overline{t}}},a_{{\overline{t}}}\big{)}. In the updated formula, we have rt′′+10r_{t^{\prime\prime}+1}\geq 0, and Qt′′(xt¯,at¯)0Q_{t^{\prime\prime}}\big{(}x_{{\overline{t}}},a_{{\overline{t}}}\big{)}\geq 0. Furthermore, since Qt′′(xt¯+1,at¯+1)>0Q_{t^{\prime\prime}}(x_{{\overline{t}}+1},a_{{\overline{t}}+1})>0, it implies that maxa𝐀Qt′′(xt¯+1,a)>0\max\limits_{a\in\mathbf{A}}Q_{t^{\prime\prime}}(x_{{\overline{t}}+1},a)>0. Consequently, we have Qt′′+1(xt¯,at¯)>0Q_{t^{\prime\prime}+1}\big{(}x_{\overline{t}},a_{\overline{t}}\big{)}>0. According to mathematical induction, there exists tt^{*} such that Qt(x0,a0)>0Q_{t^{*}}\big{(}x_{0},a_{0}\big{)}>0. Since a0𝐀a_{0}\in\mathbf{A} holds, we have maxa𝐀Qt(x0,a)>0\max\limits_{a\in\mathbf{A}}Q_{t^{*}}(x_{0},a)>0.

(Sufficiency) Suppose that there exists tt^{*} such that maxa𝐀Qt(x0,a)>0,x00\max\limits_{a\in\mathbf{A}}Q_{t^{*}}(x_{0},a)>0,\ \forall x_{0}\in\mathcal{M}_{0}. Without loss of generality, we assume that there is only one x00x_{0}\in\mathcal{M}_{0} as well as one xddx_{d}\in\mathcal{M}_{d}, argmaxa𝐀Qt(x0,a)=at\text{arg}\max\limits_{a\in\mathbf{A}}Q_{t^{*}}(x_{0},a)=a_{t}, and maxa𝐀Qt1(x0,a)=0\max\limits_{a\in\mathbf{A}}Q_{t^{*}-1}(x_{0},a)=0. Then, according to the updated formula Qt(x0,at)αt1(rt+γmaxa𝐀Qt1(x0next,a))+(1αt1)Qt1(x0,at)Q_{t^{*}}\big{(}x_{0},a_{t}\big{)}\leftarrow\alpha_{t^{*}-1}(r_{t^{*}}+\gamma\max\limits_{a\in\mathbf{A}}Q_{t^{*}-1}(x_{0\text{next}},a))+(1-\alpha_{t^{*}-1})Q_{t^{*}-1}\big{(}x_{0},a_{t}\big{)}, where x0nextx_{0\text{next}} is the subsequent state of x0x_{0}, at least one of the following two cases exists:

  • Case 1. There exist ata_{t} and x0nextx_{0\text{next}} such that rt(x0next,at)r_{t^{*}}(x_{0\text{next}},a_{t}) >0>0.

  • Case 2. x0nextx_{0\text{next}} satisfies maxa𝐀Qt1(x0next,a)>0\max\limits_{a\in\mathbf{A}}Q_{t^{*}-1}(x_{0\text{next}},a)>0.

In Case 1, we have x0nextdx_{0\text{next}}\in\mathcal{M}_{d}. It implies that system (3) from state x0x_{0} is reachable to state set d\mathcal{M}_{d} in one step.

In terms of Case 2, let us consider the condition under which maxa𝐀Qt1(x0next,a)>0\max\limits_{a\in\mathbf{A}}Q_{t^{*}-1}(x_{0\text{next}},a)>0, meaning that there exists an action a𝐀a\in\mathbf{A} such that Qt1(x0next,a)>0Q_{t^{*}-1}(x_{0\text{next}},a)>0. We define the ithi^{th} action-value, which changes from 0 to a positive number, as Qti(xi,ai)Q_{t_{i}}(x_{i},a_{i}). Here, tit_{i} denotes the time step of the change of the ithi^{th} action-value, and (xi,ai)(x_{i},a_{i}) corresponds to the state-action pair. Since the initial QQ-values are all 0, it follows that Qt1(x1,a1)>0Q_{t_{1}}(x_{1},a_{1})>0 if and only if x1nextdx_{1\text{next}}\in\mathcal{M}_{d}. This implies that system (3) from state x1x_{1} is reachable to state set d\mathcal{M}_{d} within one step, leading to rt1>0r_{t_{1}}>0. Next, for any i>1i>1, Qti(xi,ai)>0Q_{t_{i}}(x_{i},a_{i})>0 occurs if and only if xinextdx_{i\text{next}}\in\mathcal{M}_{d}, or xinext=xj,j<ix_{i\text{next}}=x_{j},\ j<i, where xjx_{j} is a state that has satisfied Qtj(xj,aj)>0Q_{t_{j}}(x_{j},a_{j})>0 for tj<tit_{j}<t_{i} based on the previously defined conditions. This means that for Qti(xi,ai)>0,iQ_{t_{i}}(x_{i},a_{i})>0,\ \forall i, at least one of the following two events has taken place: either d\mathcal{M}_{d} is reachable from xix_{i}; or a state xj,j<ix_{j},j<i, which can reach d\mathcal{M}_{d}, is reachable from xix_{i}. In both cases, it implies that d\mathcal{M}_{d} is reachable from xix_{i}. Thus, if Qt1(x0next,a)>0Q_{t^{*}-1}(x_{0\text{next}},a)>0, it follows that d\mathcal{M}_{d} is reachable from x0nextx_{0\text{next}}. This, in turn, implies that d\mathcal{M}_{d} is also reachable from x0x_{0}. Based on the analysis of Cases 1 and 2, if maxa𝐀Qt1(x0,a)\max\limits_{a\in\mathbf{A}}Q_{t^{*}-1}(x_{0},a) >0>0 holds for all x00x_{0}\in\mathcal{M}_{0}, then d\mathcal{M}_{d} is reachable from 0\mathcal{M}_{0}. \hfill\blacksquare

𝐑𝐞𝐦𝐚𝐫𝐤 1\mathbf{Remark\ 1}. The key for QQL to judge the reachability for system (3) lies in exploration (10) and update rule (9).

𝐋𝐞𝐦𝐦𝐚 2\mathbf{Lemma\ 2}. For system (3), d\mathcal{M}_{d} is reachable from 0\mathcal{M}_{0} if and only if there exists l2n|d|l\leqslant 2^{n}-\left|\mathcal{M}_{d}\right| such that d\mathcal{M}_{d} is ll-step reachable from x00\forall x_{0}\in\mathcal{M}_{0}.

𝐏𝐫𝐨𝐨𝐟\mathbf{Proof}. Without loss of generality, let us assume that there is a trajectory from x00x_{0}\in\mathcal{M}_{0} to d\mathcal{M}_{d} whose length is greater than 2n|d|2^{n}-\left|\mathcal{M}_{d}\right|. Then, there must be at least one xtndx_{t}\in\mathcal{B}^{n}\setminus\mathcal{M}_{d} that is visited repeatedly. If we remove the cycle from xtx_{t} to xtx_{t}, the trajectory length will be less than 2n|d|2^{n}-\left|\mathcal{M}_{d}\right|. \hfill\blacksquare

Based on Theorem 1 and Lemma 2, we present Algorithm 1 for finding flipping kernels. First, we provide the notation definitions used in Algorithm 1. C|A|kC_{|A|}^{k} denotes the combinatorial number of flip sets with the given kk nodes in AA. BkiB_{k_{i}} represents the ithi^{th} flip set with the given kk nodes. epep denotes the number of episodes, where ``Episode" is a term from reinforcement learning, referring to a period of interaction that has passed through TmaxT_{max} time steps from any initial state x00x_{0}\in\mathcal{M}_{0}. TmaxT_{max} is the maximum time step, which should exceed the length of the trajectory from 0\mathcal{M}_{0} to d\mathcal{M}_{d}. One can refer to Lemma 2 for the setting of TmaxT_{max}.

The core idea of Algorithm 1 is to incrementally increase the flip set size from |B||B| to |B|+1|B|+1 in step 18 when reachability isn't achieved with flip sets of size |B||B|. Specifically, step 4 fixes the flip set BB, followed by steps 5-16 assessing reachability under that flip set, guided by Theorem 1. Upon identifying the first flipping kernel (when the condition in step 12 is met), the cardinality of the kernel is determined. This leads to setting K=|B|K=|B| as per step 13, along with step 2 to prevent ineffective access to a larger flip set. After evaluating system reachability under all flip sets with KK nodes, steps 20-24 present the algorithm's results.

𝐀𝐥𝐠𝐨𝐫𝐢𝐭𝐡𝐦 1\mathbf{Algorithm\ 1} Finding flipping kernels for reachability of BCNs under state-flipped control using iterative QQL
0:  0\mathcal{M}_{0}, d\mathcal{M}_{d}, AA, αt(0,1]\alpha_{t}\in(0,1], γ(0,1)\gamma\in(0,1), ϵ[0,1]\epsilon\in[0,1], NN, TmaxT_{max}
0:  Flipping kernels
1:  K=|A|K=|A|, k=0k=0, n=0n=0
2:  while  kKk\leq K  do
3:     for  i=1,2,,C|A|ki=1,2,\dots,C_{|A|}^{k}  do
4:        B=BkiB=B_{k_{i}}
5:        Initialize Q(xt,at)0,xtn,atm+|B|Q(x_{t},a_{t})\leftarrow 0,\forall x_{t}\in\mathcal{B}^{n},\forall a_{t}\in\mathcal{B}^{m+|B|}
6:        for  ep=0,1,,N1ep=0,1,\dots,N-1  do
7:           x0rand(0)x_{0}\leftarrow\operatorname{rand}(\mathcal{M}_{0})
8:           for t=0,1,,Tmax1t=0,1,\dots,T_{max}-1 and xtdx_{t}\notin\mathcal{M}_{d} do
9:              at{argmaxatm+|B|Q(xt,at),P=1ϵrand(m+|B|),P=ϵa_{t}\leftarrow\left\{\begin{aligned} &\arg\max_{a_{t}\in\mathcal{B}^{m+|B|}}Q(x_{t},a_{t}),&&P=1-\epsilon\\ &\operatorname{rand}(\mathcal{B}^{m+|B|}),&&P=\epsilon\end{aligned}\right.
10:              Q(xt,at)αt(γmaxat+1m+|B|Q(xt+1,at+1)Q(x_{t},a_{t})\leftarrow\alpha_{t}(\gamma\max\limits_{a_{t+1}\in\mathcal{B}^{m+|B|}}Q(x_{t+1},a_{t+1}) +rt+1)+(1αt)Q(xt,at)+r_{t+1})+(1-\alpha_{t})Q(x_{t},a_{t})
11:           end for
12:           if maxatm+|B|Q(xt,at)>0,xt0\max\limits_{a_{t}\in\mathcal{B}^{m+|B|}}Q(x_{t},a_{t})>0,\forall x_{t}\in\mathcal{M}_{0} then
13:              Bn=BB_{n}^{*}=B, K=|B|K=|B|, n=n+1n=n+1
14:              Break
15:           end if
16:        end for
17:     end for
18:     k=k+1k=k+1
19:  end while
20:  if n=0n=0 then
21:     return  ``System can't achieve reachability."
22:  else
23:     return  B1,,BnB^{*}_{1},\dots,B^{*}_{n}
24:  end if

Although Algorithm 1 is efficient, there is room for improvement. The next section outlines techniques to enhance its convergence efficiency.

IV-C Fast Iterative Q-learning for Finding Flipping Kernel

In this subsection, we propose Algorithm 2, known as fast iterative QQL for finding the flipping kernels, which improves the convergence efficiency. The main difference between Algorithm 2 and Algorithm 1 can be classified into two aspects. 1) Special initial states: Algorithm 2 selects initial states strategically instead of randomly. 2) TL: Algorithm 2 utilizes the knowledge gained from achieving reachability with smaller flip sets to search for flipping kernels for larger flip sets.

𝐀𝐥𝐠𝐨𝐫𝐢𝐭𝐡𝐦 2\mathbf{Algorithm\ 2} Finding flipping kernels for reachability of BCNs under state-flipped control using fast iterative QQL
0:  0\mathcal{M}_{0}, d\mathcal{M}_{d}, AA, αt(0,1]\alpha_{t}\in(0,1], γ(0,1)\gamma\in(0,1), ϵ[0,1]\epsilon\in[0,1], NN, TmaxT_{max}
0:  Flipping kernels
1:  K=|A|K=|A|, k=0k=0, n=0n=0
2:  while  kKk\leq K  do
3:     for  i=1,2,,C|A|ki=1,2,\dots,C_{|A|}^{k}  do
4:        B=BkiB=B_{k_{i}}
5:        Initialize Q(xt,at),xtn,atm+|B|Q(x_{t},a_{t}),\forall x_{t}\in\mathcal{B}^{n},\forall a_{t}\in\mathcal{B}^{m+|B|} using equation (14)
6:        Examining the reachability under flip set BB following modified steps 6-16 of Algorithm 1 with equation (13)
7:        Record the QQ-table for BB
8:     end for
9:     Drop all the QQ-tables, k=k+1k=k+1
10:  end while
11:  Present the result following steps 20-24 of Algorithm 1

IV-C1 Special Initial States

The main concept behind selecting special initial states is to avoid visiting x00x_{0}\in\mathcal{M}_{0} once we determine that d\mathcal{M}_{d} is reachable from it. Instead, we focus on states in 0\mathcal{M}_{0} that have not been identified as reachable. With this approach in mind, we modify step 7 in Algorithm 1 into

x0rand({x00maxatm+|B|Q(x0,at)=0})x_{0}\leftarrow\operatorname{rand}\left(\left\{x_{0}\in\mathcal{M}_{0}\mid\max_{a_{t}\in\mathcal{B}^{m+|B|}}Q\left(x_{0},a_{t}\right)=0\right\}\right) (13)

𝐑𝐞𝐦𝐚𝐫𝐤 2.\mathbf{Remark\ 2.} The validity of Theorem 1 remains when special initial states are added to iterative QQL, as the trajectory from the states in 0\mathcal{M}_{0} that are reachable to d\mathcal{M}_{d} can still be visited.

IV-C2 Transfer-learning

Let QB(xt,at)Q^{B}(x_{t},a_{t}) and Qb(xt,at)Q^{b}(x_{t},a_{t}) represent the QQ-table with flip sets BB and bb respectively, where bBb\subset B. The relationship between QB(xt,at)Q^{B}(x_{t},a_{t}) and Qb(xt,at)Q^{b}(x_{t},a_{t}) is explained in Theorem 2 below.

𝐓𝐡𝐞𝐨𝐫𝐞𝐦 2\mathbf{Theorem\ 2}. For any state-action pairs (xt,at)(x_{t},a_{t}), if Qb(xt,at)Q^{b}(x_{t},a_{t}) >0>0 with flip set bBb\subset B, then QB(xt,at)>0Q^{B}(x_{t},a_{t})>0 holds for flip set BB.

𝐏𝐫𝐨𝐨𝐟\mathbf{Proof}. It is evident that the action space for flip set bBb\subset B is the subset of the action space for flip set BB, namely, 𝐀b𝐀B\mathbf{A}_{b}\subset\mathbf{A}_{B}. Therefore, if system (3) from state x1x_{1} is reachable to state xddx_{d}\in\mathcal{M}_{d} with flip set bBb\subset B, namely, there exists a trajectory x1a1x2aTxd,at𝐀bx_{1}\stackrel{{\scriptstyle a_{1}}}{{\longrightarrow}}x_{2}...\stackrel{{\scriptstyle a_{T}}}{{\longrightarrow}}x_{d},\ a_{t}\in\mathbf{A}_{b}, the reachability still holds with flip set BB. This is because the trajectory x1a1x2aTxdx_{1}\stackrel{{\scriptstyle a_{1}}}{{\longrightarrow}}x_{2}...\stackrel{{\scriptstyle a_{T}}}{{\longrightarrow}}x_{d} exists for at𝐀b𝐀Ba_{t}\in\mathbf{A}_{b}\subset\mathbf{A}_{B}. Thus, if d\mathcal{M}_{d} is reachable from x1x_{1} with flip set bb, then the reachability holds with flip set BB. According to Theorem 1, for all (xt,at)(x_{t},a_{t}), if Qb(xt,at)>0Q^{b}(x_{t},a_{t})>0 with flip set bb, then QB(xt,at)>0Q^{B}(x_{t},a_{t})>0 holds for flip set BB. \hfill\blacksquare

Motivated by the above results, we consider TL [30]. TL enhances the learning process in a new task by transferring knowledge from a related task that has already been learned. In this context, we employ TL by initializing the QQ-table for flip set BB with the knowledge gained from the QQ-tables for flip sets bBb\subset B, namely

Q0(xt,at)={0, if no subset bB such  that atb,maxbBQtb(xt,at), else,Q_{0}(x_{t},a_{t})=\left\{\begin{aligned} &0,\text{\ if\ no\ subset\ }b\subset B\text{\ such\ that\ }a_{t}\in b,\\ &\max_{b\subset B}Q_{t}^{b}(x_{t},a_{t}),\text{\ else},\end{aligned}\right. (14)

where Qtb(xt,at)Q_{t}^{b}(x_{t},a_{t}) represents the QQ-table associated with flip set bb. Based on this idea, we refine step 5 and include steps 7 and 9 in Algorithm 2 compared to Algorithm 1.

Algorithm 2 improves the convergence efficiency compared to Algorithm 1. However, Algorithm 2 is not suitable for large-scale BCNs. Specifically, the operation of iterative QQL is based on a QQ-table that has |𝐗|×|𝐀||\mathbf{X}|\times|\mathbf{A}| values. The number of values in the table grows exponentially with nn and m+|B|m+|B|, and can be expressed as 2n+m+|B|2^{n+m+|B|}. When nn and m+|B|m+|B| are large, iterative QQL becomes impractical because the QQ-table is too large to store on a computer. Therefore, we propose an algorithm in the following subsection, which can be applied to large-scale BCNs.

IV-D Small Memory Iterative Q-Learning for Finding Flipping Kernel of Large-scale BCNs

In this subsection, we present Algorithm 3, named small memory iterative QQL, which is inspired by the work of [20]. This algorithm serves as a solution for identifying flipping kernels in large-scale BCNs. The core idea behind Algorithm 3 is to store action-values only for visited states, instead of all xtnx_{t}\in\mathcal{B}^{n}, to reduce memory consumption. To implement this idea, we make adjustments to step 5 in Algorithm 1 and introduce additional steps 10-12 in Algorithm 3. Subsequently, we offer detailed insights into the effectiveness of Algorithm 3 and its applicability.

𝐀𝐥𝐠𝐨𝐫𝐢𝐭𝐡𝐦 3\mathbf{Algorithm\ 3} Finding flipping kernels for reachability of system (3) using small memory iterative QQL
0:  0\mathcal{M}_{0}, d\mathcal{M}_{d}, AA, αt(0,1]\alpha_{t}\in(0,1], γ(0,1)\gamma\in(0,1), ϵ[0,1]\epsilon\in[0,1], NN, TmaxT_{max}
0:  Flipping kernels
1:  K=|A|K=|A|, k=0k=0, n=0n=0
2:  while  kKk\leq K  do
3:     for  i=1,2,,C|A|ki=1,2,\dots,C_{|A|}^{k}  do
4:        B=BkiB=B_{k_{i}}
5:        Initialize Q(xt,at)0,xt0,atm+|B|Q(x_{t},a_{t})\leftarrow 0,\forall x_{t}\in\mathcal{M}_{0},\forall a_{t}\in\mathcal{B}^{m+|B|}
6:        for  ep=0,1,,N1ep=0,1,\dots,N-1  do
7:           x0rand(0)x_{0}\leftarrow\operatorname{rand}(\mathcal{M}_{0})
8:           for t=0,1,,Tmax1t=0,1,\dots,T_{max}-1 and xtdx_{t}\notin\mathcal{M}_{d} do
9:              at{argmaxatm+|B|Q(xt,at),P=1ϵrand(m+|B|),P=ϵa_{t}\leftarrow\left\{\begin{aligned} &\arg\max\limits_{a_{t}\in\mathcal{B}^{m+|B|}}Q(x_{t},a_{t}),&&P=1-\epsilon\\ &\operatorname{rand}(\mathcal{B}^{m+|B|}),&&P=\epsilon\end{aligned}\right.
10:              if Q(xt+1,.)Q(x_{t+1},.) is not in the QQ-table then
11:                 Add Q(xt+1,.)=0Q(x_{t+1},.)=0 to the QQ-table
12:              end if
13:              Q(xt,at)αt(γmaxat+1m+|B|Q(xt+1,at+1)Q(x_{t},a_{t})\leftarrow\alpha_{t}(\gamma\max\limits_{a_{t+1}\in\mathcal{B}^{m+|B|}}Q(x_{t+1},a_{t+1}) +rt+1)+(1αt)Q(xt,at)+r_{t+1})+(1-\alpha_{t})Q(x_{t},a_{t})
14:           end for
15:           if maxatm+|B|Q(xt,at)>0,xt0\max\limits_{a_{t}\in\mathcal{B}^{m+|B|}}Q(x_{t},a_{t})>0,\forall x_{t}\in\mathcal{M}_{0} then
16:              Bn=BB_{n}^{*}=B, K=|B|K=|B|, n=n+1n=n+1
17:              Break
18:           end if
19:        end for
20:     end for
21:     k=k+1k=k+1
22:  end while
23:  Present the result following steps 20-24 of Algorithm 1

Since Algorithm 3 only stores action-values for visited states, it reduces the number of required action-values. Specifically, let's define V={x|V=\{x| xx is reachable from 0\mathcal{M}_{0}, xn}x\in\mathcal{B}^{n}\}. Then, we can observe that |V0|×2m+|B||V\cup\mathcal{M}_{0}|\times 2^{m+|B|} is the total number of visitable states, where \cup represents the union of sets. Therefore, we can conclude that the number of required action-values using Algorithm 3 is |V0|×2m+|B||V\cup\mathcal{M}_{0}|\times 2^{m+|B|}, which is a number not exceeding 2n+m+|B|2^{n+m+|B|}. Next, we present Theorem 3 to provide an estimation of |V||V|. To better understand the theorem, we first introduce Definition 2.

𝐃𝐞𝐟𝐢𝐧𝐢𝐭𝐢𝐨𝐧 2\mathbf{Definition\ 2}[4]. The in-degree of xnx\in\mathcal{B}^{n} is defined as the number of states which can reach xx in one step with at¯=(u(t),ηA(t))\overline{a_{t}}=\big{(}u(t),\eta_{A}(t)\big{)} where |ηA(t)|=0|\eta_{A(t)}|_{\infty}=0.

𝐓𝐡𝐞𝐨𝐫𝐞𝐦 3\mathbf{Theorem\ 3}. Define I={x|I=\{x|\ the in-degree of xx is greater than 0, xn}x\in\mathcal{B}^{n}\}. Then, it follows that |V||I||V|\leq|I|.

𝐏𝐫𝐨𝐨𝐟\mathbf{Proof}. The state transition process can be divided into two steps. For any xnx^{\prime}\in\mathcal{B}^{n}, according to ηA(t)\eta_{A(t)}, it is firstly flipped into ηA(t)(x)n\eta_{A(t)}(x^{\prime})\in\mathcal{B}^{n}. Second, based on u(t)u(t), namely, at¯\overline{a_{t}}, ηA(t)(x)n\eta_{A(t)}(x^{\prime})\in\mathcal{B}^{n} tranfers to xnx\in\mathcal{B}^{n}. According to the second step of the state transition process and the definition of II, we have xIx\in I. Due to the arbitrariness of xx^{\prime}, it follows that {x|x\{x|\ x is reachable from n\mathcal{B}^{n} in one step, xn}Ix\in\mathcal{B}^{n}\}\subset I. Since n\mathcal{B}^{n} contains all states, the condition `in one step' can be removed. Define N={x|xN=\{x|\ x is reachable from n\mathcal{B}^{n}, xn}x\in\mathcal{B}^{n}\}. So, we have NIN\subset I. Notice that the only difference between the sets NN and VV is the domain of the initial states. Since we have 0n\mathcal{M}_{0}\subset\mathcal{B}^{n}, it follows that VNIV\subset N\subset I. Therefore, it can be concluded that |V||I||V|\leq|I|. \hfill\blacksquare

Algorithm 3 can reduce the number of entries stored in the QQ-table from 2n+m+|B|2^{n+m+|B|} to |V0|×2m+|B||V\cup\mathcal{M}_{0}|\times 2^{m+|B|}, where |V||I||V|\leq|I|. However, we acknowledge that when the size of |V0|×2m+|B||V\cup\mathcal{M}_{0}|\times 2^{m+|B|} becomes excessively large, QQL faces certain limitations. It is important to note that Algorithm 3 offers a potential solution for addressing problems in large-scale BCNs, as opposed to the conventional QQL approach, which is often deemed inapplicable. In fact, the value of |V0||V\cup\mathcal{M}_{0}| is not directly related to nn. Specifically, large-scale BCNs with low connectivity will result in a small number of |V0||V\cup\mathcal{M}_{0}|. Traditional QQL preconceives to store 2n+m+|B|2^{n+m+|B|} values in the QQ-table, which not only restricts the ability to handle problems in large-scale systems but also lacks practicality. In contrast, Algorithm 3 allows agents to explore and determine the significant information that needs to be recorded. This approach provides an opportunity to solve the reachability problems of large-scale BCNs.

𝐑𝐞𝐦𝐚𝐫𝐤 3\mathbf{Remark\ 3}. The idea of utilizing special initial states and TL in Algorithm 2 can also be incorporated into Algorithm 3. The hybrid algorithm proposed has two major advantages: improved convergence efficiency and expanded applicability to large-scale BCNs.

V Minimum Flipping Actions for Reachability

In this section, we aim to solve Problem 2 (5). First, we propose a reward setting in which the highest return is achieved only when the policy successfully reaches the goal. Then, we propose two algorithms for obtaining the optimal policies: QQL for small-scale BCNs, and small memory QQL for large-scale ones.

V-A Markov Decision Process for Minimizing Flipping Actions

Since QQL will be used to find π\pi^{*}, we first construct the problem into the framework of Markov decision process. Let Markov decision process be the quintuple (n,m+|B|,γ,𝐏,𝐑)(\mathcal{B}^{n},\mathcal{B}^{m+|B^{*}|},\gamma,\mathbf{P},\mathbf{R}). To guide the agent in achieving the goal with minimal flipping actions, we define rtr_{t} as follows:

rt(xt,at)={w×nt,xtd,w×nt1,else,r_{t}(x_{t},a_{t})=\left\{\begin{aligned} &-w\times n_{t},&&x_{t}\in\mathcal{M}_{d},\\ &-w\times n_{t}-1,&&\text{else},\\ \end{aligned}\right. (15)

where w>0w>0 represents the weight. The reward rtr_{t} consists of two parts. The first component is ``-1'', which is assigned when the agent has not yet reached d\mathcal{M}_{d}. This negative feedback encourages the agent to reach the goal as soon as possible. The second component is ``w×nt-w\times n_{t}'', which incentivizes the agent to use as few flipping actions as possible to reach the goal. Since we aim to minimize the cumulative flipping actions at each time step, we set γ=1\gamma=1 indicating that future rewards are not discounted in importance.

𝐑𝐞𝐦𝐚𝐫𝐤 4\mathbf{Remark\ 4}. The following analysis is based on the assumption that system (3) from any state in 0\mathcal{M}_{0} is reachable to state set d\mathcal{M}_{d}. The validity of this assumption can be verified using Theorem 1.

The selection of ww plays a crucial role in effectively conveying the goal (5) to the agent. If ww is too small, the objective of achieving reachability as soon as possible will submerge the objective of minimizing flipping actions. To illustrate this issue more clearly, let us consider an example.

𝐄𝐱𝐚𝐦𝐩𝐥𝐞 1\mathbf{Example\ 1}. Without loss of generality, we assume that there is only one x00x_{0}\in\mathcal{M}_{0} and one xddx_{d}\in\mathcal{M}_{d}. Suppose that there exist two policies π1\pi_{1} and π2\pi_{2}. When we start with x0x_{0} and take π1\pi_{1}, we obtain the trajectory x0x1x2x3xdx_{0}\rightarrow x_{1}\rightarrow x_{2}\rightarrow x_{3}\rightarrow x_{d} without any flipping action. If we take π2\pi_{2}, the trajectory x0xdx_{0}\rightarrow x_{d} with 2 flipping actions will be obtained. It can be calculated that vπ1(x0)=4v_{\pi_{1}}(x_{0})=-4, since the agent takes 4 steps to achieve reachability without flipping actions. Meanwhile, vπ1(x0)=12wv_{\pi_{1}}(x_{0})=-1-2w, since the agent takes 1 step to achieve reachability with 2 flipping actions. If we set w=1w=1, then it will follow that vπ1(x0)<vπ2(x0)v_{\pi_{1}}(x_{0})<v_{\pi_{2}}(x_{0}). Therefore, the obtained optimal policy π2\pi_{2} requires more flipping actions to achieve reachability, which does not contribute to finding π\pi^{*} for equation (5).

Furthermore, although a high value of the parameter ``ww" assists in reducing flipping actions for reachability, it can present challenges in reinforcement learning. Convergence in these algorithms relies on a finite reward variance, as specified in the second condition of Lemma 1. As ``ww" approaches infinity, this criterion cannot be satisfied. Additionally, an excessively large value of ``ww" may increase reward variance, resulting in greater variability in estimated state values and subsequently affecting the speed of algorithm convergence [32]. Hence, establishing a lower bound for ``ww" becomes crucial to facilitate algorithm convergence while ensuring goal attainment.

So, what should be the appropriate value of ww to achieve the goal (5)? To address this question, sufficient conditions that ensure the optimality of the policy are proposed in Theorem 4 and Corollary 1. The main concept is to select a value of ww such that vπ(x0)>vπ(x0),πΠ,v_{\pi^{*}}(x_{0})>v_{\pi}(x_{0}),\forall\pi\in\Pi, and ππ\pi\neq\pi^{*}, where π\pi^{*} satisfies the goal (5).

𝐓𝐡𝐞𝐨𝐫𝐞𝐦 4\mathbf{Theorem\ 4}. For system (3), if w>lw>l, where ll is the maximum number of steps required from any initial states x00x_{0}\in\mathcal{M}_{0} to reach d\mathcal{M}_{d} without cycles, then the optimal policy obtained based on (15) satisfies the goal (5).

𝐏𝐫𝐨𝐨𝐟\mathbf{Proof}. Without loss of generality, it is assumed that there is only one x00x_{0}\in\mathcal{M}_{0}. We classify all πΠ\pi\in\Pi and prove that, under the condition that w>lw>l, the policy with the highest state-value satisfies the goal (5). For all πΠ\pi\in\Pi, we can divide them into two cases:

  • Case 1. Following π1\pi_{1}, system (3) from state x0x_{0} is not reachable to state set d\mathcal{M}_{d}

  • Case 2. Following π2\pi_{2}, system (3) from state x0x_{0} is reachable to state set d\mathcal{M}_{d}

For Case 1, vπ1(x0)=t=01=v_{\pi_{1}}(x_{0})=\sum_{t=0}^{-\infty}-1=-\infty. The value of vπ1(x0)v_{\pi_{1}}(x_{0}) is insufficient for π1\pi_{1} to be optimal. For Case 2, we divide it into two sub-cases:

  • Case 2.1. Following π2.1\pi_{2.1}, system (3) from state x0x_{0} is reachable to state set d\mathcal{M}_{d} with fewer flipping actions.

  • Case 2.2. Following π2.2\pi_{2.2}, system (3) from state x0x_{0} is reachable to state set d\mathcal{M}_{d} with more flipping actions.

Next, we compare the magnitude of vπ2.1v_{\pi_{2.1}} and vπ2.2v_{\pi_{2.2}}. Let us assume that it takes Tπ2.1T_{\pi_{2.1}} steps with nπ2.1\sum n_{\pi_{2.1}} flipping actions, and Tπ2.2T_{\pi_{2.2}} steps with nπ2.2\sum n_{\pi_{2.2}} flipping actions for system (3) from state x0x_{0} to reach state set d\mathcal{M}_{d} under π2.1\pi_{2.1} and π2.2\pi_{2.2}, respectively. Then, we can obtain vπ2.1=wnπ2.1Tπ2.1v_{\pi_{2.1}}=-w\sum n_{\pi_{2.1}}-T_{\pi_{2.1}} and vπ2.2=wnπ2.2Tπ2.2v_{\pi_{2.2}}=-w\sum n_{\pi_{2.2}}-T_{\pi_{2.2}}. It is calculated that

vπ2.1vπ2.2=wnπ2.2+Tπ2.2wnπ2.1Tπ2.1.v_{\pi_{2.1}}-v_{\pi_{2.2}}=w\sum n_{\pi_{2.2}}+T_{\pi_{2.2}}-w\sum n_{\pi_{2.1}}-T_{\pi_{2.1}}. (16)

Since more flipping actions are taken under π2.2\pi_{2.2} than π2.1\pi_{2.1} to achieve reachability, then one has nπ2.2nπ2.1>0\sum n_{\pi_{2.2}}-\sum n_{\pi_{2.1}}>0.

Regarding the relationship between Tπ2.1T_{\pi_{2.1}} and Tπ2.2T_{\pi_{2.2}}, there are two categories:

  • Category 1. Tπ2.1Tπ2.2T_{\pi_{2.1}}\leq T_{\pi_{2.2}}.

  • Category 2. Tπ2.1>Tπ2.2T_{\pi_{2.1}}>T_{\pi_{2.2}}.

For Category 1, it's obvious that vπ2.1>vπ2.2v_{\pi_{2.1}}>v_{\pi_{2.2}}. For Category 2, we divide π2.1\pi_{2.1} into two cases.

  • Case 2.1.1. Following π2.1.1\pi_{2.1.1}, system (3) from state x0x_{0} can reach state set d\mathcal{M}_{d} in Tπ2.1.1lT_{\pi_{2.1.1}}\leq l steps with nπ2.1.1<nπ2.2\sum n_{\pi_{2.1.1}}<\sum n_{\pi_{2.2}} flipping actions, where ll is the maximum length of the trajectory from x0x_{0} to d\mathcal{M}_{d} without cycle.

  • Case 2.1.2. Following π2.1.2\pi_{2.1.2}, system (3) from state x0x_{0} can reach state set d\mathcal{M}_{d} in Tπ2.1.2>lT_{\pi_{2.1.2}}>l steps with nπ2.1.2<nπ2.2\sum n_{\pi_{2.1.2}}<\sum n_{\pi_{2.2}} flipping actions.

Next, we prove vπ2.1.1>vπ2.2v_{\pi_{2.1.1}}>v_{\pi_{2.2}}. For Case 2.1.1, we have the following inequalities

{Tπ2.2Tπ2.1.1Tminl1l,nπ2.2nπ2.1.11,\left\{\begin{aligned} &T_{\pi_{2.2}}-T_{\pi_{2.1.1}}\geq T_{min}-l\geq 1-l,\\ &\sum n_{\pi_{2.2}}-\sum n_{\pi_{2.1.1}}\geq 1,\\ \end{aligned}\right. (17)

where TminT_{min} is the minumum length of the trajectory from x0x_{0} to d\mathcal{M}_{d}. Substitute (17) into (16), we have vπ2.1.1vπ2.21l+wv_{\pi_{2.1.1}}-v_{\pi_{2.2}}\geq 1-l+w. Since w>lw>l, it follows that vπ2.1.1vπ2.2>0v_{\pi_{2.1.1}}-v_{\pi_{2.2}}>0.

In the following, we discuss Case 2.1.2. Note that Tπ2.1.2>lT_{\pi_{2.1.2}}>l implies that at least one xtn\dx_{t}\in\mathcal{B}^{n}\backslash\mathcal{M}_{d} is visited repeatedly, according to Lemma 2. If we eliminate the cycle from the trajectory, then its length will be no greater than ll, and its flipping actions are fewer. This indicates that there exists π2.1.2\pi_{2.1.2}^{*} such that Tπ2.1.2lT_{\pi_{2.1.2}^{*}}\leq l and nπ2.1.2nπ2.1.2<nπ2.2\sum n_{\pi_{2.1.2}^{*}}\leq\sum n_{\pi_{2.1.2}}<\sum n_{\pi_{2.2}}. Thus, π2.1.2\pi_{2.1.2}^{*} is consistent with Case 2.1.1. Based on the proof in the previous paragraph, we have vπ2.1.2(x0)>vπ2.2(x0)v_{\pi_{2.1.2}^{*}}(x_{0})>v_{\pi_{2.2}}(x_{0}). Also, it is easy to see that vπ2.1.2(x0)>vπ2.1.2(x0)v_{\pi_{2.1.2}^{*}}(x_{0})>v_{\pi_{2.1.2}}(x_{0}). Therefore, it can be concluded that π2.1.2\pi_{2.1.2}^{*} has the highest return compared with π2.2\pi_{2.2} and π2.1.2\pi_{2.1.2}.

After analyzing various cases, we deduce that when w>lw>l holds, one of the following two conditions must be satisfied: 1) vπ2.1(x0)>vπ2.2(x0)v_{\pi_{2.1}}(x_{0})>v_{\pi_{2.2}}(x_{0}); 2) There exists an alternative policy π2.1.2\pi^{*}_{2.1.2} that achieves reachability with fewer flipping actions compared to π2.1\pi_{2.1}, satisfying the conditions vπ2.1.2(x0)>vπ2.2(x0)v_{\pi^{*}_{2.1.2}}(x_{0})>v_{\pi_{2.2}}(x_{0}) and vπ2.1.2(x0)>vπ2.1(x0)v_{\pi^{*}_{2.1.2}}(x_{0})>v_{\pi_{2.1}}(x_{0}). Namely, π\pi yielding the highest vπ(x0)v_{\pi}(x_{0}) satisfies the goal (5). \hfill\blacksquare

In Theorem 4, the knowledge of ll is necessary. If ll is unknown, we can deduce the following corollary according to Lemma 2: l2n|d|l\leq 2^{n}-|\mathcal{M}_{d}|.

𝐂𝐨𝐫𝐨𝐥𝐥𝐚𝐫𝐲 1\mathbf{Corollary\ 1}. If w>2n|d|w>2^{n}-|\mathcal{M}_{d}|, then the optimal policy obtained according to (15) satisfies the goal (5).

𝐑𝐞𝐦𝐚𝐫𝐤 5\mathbf{Remark\ 5}. Two points should be clarified for Theorem 4 and Corollary 1. First, following Theorem 4 or Corollary 1, we do not obtain all π\pi^{*} that satisfy the goal (5), but rather the π\pi^{*} that satisfies the goal (5) with the shortest time to achieve reachability. Second, Theorem 4 and Corollary 1 provide sufficient conditions but not necessary conditions for the optimality of the obtained policy, in that (17) uses the inequality scaling technique.

Now, we can achieve the goal (5) under the reward setting (15). Next, we present QQL for finding the optimal policy π\pi^{*}.

V-B Q-Learning for Finding Minimum Flipping Actions

In this subsection, we propose Algorithm 4 for finding minimum flipping actions for the reachability of BCNs under state-flipped control.

𝐀𝐥𝐠𝐨𝐫𝐢𝐭𝐡𝐦 4\mathbf{Algorithm\ 4} Finding minimum flipping actions for reachability of BCNs under state-flipped control using QQL
0:  0\mathcal{M}_{0}, d\mathcal{M}_{d}, BB^{*}, αt(0,1]\alpha_{t}\in(0,1], γ=1\gamma=1, ϵ[0,1)\epsilon\in[0,1), NN, TmaxT_{max}
0:  π\pi^{*}
1:  Initialize Q(xt,at)0,xtn,atm+|B|Q(x_{t},a_{t})\leftarrow 0,\forall x_{t}\in\mathcal{B}^{n},\forall a_{t}\in\mathcal{B}^{m+|B^{*}|}
2:  for  ep=0,1,,N1ep=0,1,\dots,N-1  do
3:     x0rand(0)x_{0}\leftarrow\operatorname{rand}(\mathcal{M}_{0})
4:     for t=0,1,,Tmax1t=0,1,\dots,T_{max}-1 and xtdx_{t}\notin\mathcal{M}_{d} do
5:        at{argmaxaQ(xt,a),P=1ϵrand(m+|B|),P=ϵa_{t}\leftarrow\left\{\begin{aligned} &\arg\max_{a}Q(x_{t},a),&&P=1-\epsilon\\ &\operatorname{rand}(\mathcal{B}^{m+|B|}),&&P=\epsilon\end{aligned}\right.
6:        Q(xt,at)αt(rt+1+γmaxaQ(xt+1,a))+(1αt)Q(xt,at)Q(x_{t},a_{t})\leftarrow\alpha_{t}(r_{t+1}+\gamma\max\limits_{a}Q(x_{t+1},a))+(1-\alpha_{t})Q(x_{t},a_{t})
7:     end for
8:  end for
9:  return  π(xt)argmaxaQ(xt,a),xt𝐗\pi^{*}(x_{t})\leftarrow\arg\max\limits_{a}Q(x_{t},a),\forall x_{t}\in\mathbf{X}

Next, we discuss the requirements for the parameters. In terms of TmaxT_{max}, it should be larger than the maximum length of the trajectory from x0x_{0} to d\mathcal{M}_{d} without any cycle. Otherwise, the agent may not be able to reach the terminal state in d\mathcal{M}_{d} before the episode ends, which violates condition 4) of Lemma 1. Furthermore, it should satisfy ϵ>0\epsilon>0, ensuring the exploration of the agent. Hence, the policies during learning are non-deterministic, namely, all the policies can lead to a state xdx\in\mathcal{M}_{d}, which satisfies condition 4) of Lemma 1.

Algorithm 4 is capable of finding π\pi^{*} for small-scale BCNs under state-flipped control. However, it becomes inapplicable when the values of nn and m+|B|m+|B| are too large since the QQ-table is too large to be stored in computer memory. This motivates us to utilize small memory QQL in Section IV-D.

V-C Fast Small Memory Q-Learning for Finding Minimum Flipping Actions of Large-scale BCNs

In this section, we propose Algorithm 5 called the fast small memory QQL for finding the minimum flipping actions required to achieve reachability in large-scale BCNs under state-flipped control. Algorithm 5 differs from Algorithm 4 in two aspects.

𝐀𝐥𝐠𝐨𝐫𝐢𝐭𝐡𝐦 5\mathbf{Algorithm\ 5} Finding minimum flipping actions for reachability of large-scale BCNs under state-flipped control using fast small memory QQL
0:  0\mathcal{M}_{0}, d\mathcal{M}_{d}, BB^{*}, αt(0,1]\alpha_{t}\in(0,1], γ=1\gamma=1, ϵ[0,1)\epsilon\in[0,1), NN, TmaxT_{max}, ww, Δw>0\Delta w>0
0:  π\pi^{*}
1:  Initialize Q(xt,at)0,xt0,atm+|B|Q(x_{t},a_{t})\leftarrow 0,\forall x_{t}\in\mathcal{M}_{0},\forall a_{t}\in\mathcal{B}^{m+|B^{*}|}
2:  for  ep=0,1,,N1ep=0,1,\dots,N-1  do
3:     x0rand(0)x_{0}\leftarrow\operatorname{rand}(\mathcal{M}_{0})
4:     if w|{xt|Q(xt,.)w\leq\big{|}\{x_{t}|\ Q(x_{t},.) is in the QQ-table}|\}\big{|} then
5:        w=w+Δww=w+\Delta w
6:     end if
7:     for t=0,1,,Tmax1t=0,1,\dots,T_{max}-1 and xtdx_{t}\notin\mathcal{M}_{d} do
8:        at{argmaxaQ(xt,a),P=1ϵrand(m+|B|),P=ϵa_{t}\leftarrow\left\{\begin{aligned} &\arg\max_{a}Q(x_{t},a),&&P=1-\epsilon\\ &\operatorname{rand}(\mathcal{B}^{m+|B|}),&&P=\epsilon\end{aligned}\right.
9:        if Q(xt+1,.)Q(x_{t+1},.) is not in the QQ-table then
10:           Add Q(xt+1,.)=0Q(x_{t+1},.)=0 to the QQ-table
11:        end if
12:        Q(xt,at)αt(rt+1+γmaxaQ(xt+1,a))+(1αt)Q(xt,at)Q(x_{t},a_{t})\leftarrow\alpha_{t}(r_{t+1}+\gamma\max\limits_{a}Q(x_{t+1},a))+(1-\alpha_{t})Q(x_{t},a_{t})
13:     end for
14:  end for
15:  return  π(xt)argmaxaQ(xt,a),xt𝐗\pi^{*}(x_{t})\leftarrow\arg\max\limits_{a}Q(x_{t},a),\forall x_{t}\in\mathbf{X}

Firstly, Algorithm 5 only records the action values of the visited states, as reflected in steps 1 and 9-11. This approach is similar to that employed in Algorithm 3. Secondly, in Algorithm 5, the parameter ww which affects rtr_{t} changes from a constant to a variable that is dynamically adjusted according to Q=|{xt|Q(xt,.)Q=\big{|}\{x_{t}|\ Q(x_{t},.) is in the QQ-table}|\}\big{|}, as shown in steps 4-6. As the agent explores more states, |Q|\big{|}Q\big{|} increases and finally equals to |V0||V\cup\mathcal{M}_{0}|. Here, we do not simply set w=2n+1|d|w=2^{n}+1-|\mathcal{M}_{d}| based on Corollary 1, in that for large-scale BCNs, the value of 2n+1|d|2^{n}+1-|\mathcal{M}_{d}| is so large causing the weight for using fewer flipping actions to outweigh the importance of achieving reachability as soon as possible. This can lead to confusion for the agent. Specifically, at the beginning of the learning process, the agent receives large negative feedback when taking flipping actions, but less negative feedback when it fails to achieve reachability. The agent will only realize that taking flipping actions is worthwhile for achieving reachability when the accumulation of the negative feedback ``-1" for not realizing reachability exceeds that of ``w×nt-w\times n_{t}'' for flipping ntn_{t} nodes. The larger the ww, the longer the process will take. To address this issue, we introduce the scaling for ww to accelerate the process. The key concept behind this scaling method is to leverage the knowledge acquired by the agent while ensuring that the conditions in Theorem 4 are met to guarantee the optimality of the policy. We provide further explanation of this approach in Theorem 5.

𝐓𝐡𝐞𝐨𝐫𝐞𝐦 5\mathbf{Theorem\ 5}. For a BCN under state-flipped control (3), if it holds that w>|Q|w>|Q|, then the optimal policy obtained according to (15) satisfies the goal (5).

𝐏𝐫𝐨𝐨𝐟\mathbf{Proof}. First, we prove that d\mathcal{M}_{d} is reachable from x00\forall x_{0}\in\mathcal{M}_{0} only if, there exists l|V|l\leq|V| such that d\mathcal{M}_{d} is ll-step reachable from x00\forall x_{0}\in\mathcal{M}_{0}. Assume that d\mathcal{M}_{d} is reachable from 0\mathcal{M}_{0}. According to Definition 1, for any initial state x00x_{0}\in\mathcal{M}_{0}, there exists a trajectory x0x1xdx_{0}\rightarrow x_{1}...\rightarrow x_{d}, where xddx_{d}\in\mathcal{M}_{d}. According to the definition of VV, each state in the above trajectory is in VV. Similar to the proof of Lemma 2, for any initial state x00x_{0}\in\mathcal{M}_{0}, there exists a trajectory from x0x_{0} to d\mathcal{M}_{d} whose length ll satisfies l|V|l\leq|V|.

Next, we prove that w>lw>l after sufficient exploration by the agent. Since |Q|=|V0||Q|=|V\cup\mathcal{M}_{0}|, and w>|Q|w>|Q|, we have w>|V0|w>|V\cup\mathcal{M}_{0}|. From the proof in the previous paragraph, we know that |V|l|V|\geq l, therefore, we have w>lw>l. By applying Theorem 4, we can guarantee the optimality of the policy. \hfill\blacksquare

Thus, Algorithm 5 with the introduction of the variable ww still satisfies the goal defined by (5). Furthermore, it speeds up the learning process. Regarding the parameter settings of Algorithm 5, the initial ww can be set according to Q=|{xt|Q(xt,.)Q=\big{|}\{x_{t}|\ Q(x_{t},.) is in the QQ-table}|\}\big{|}, where the QQ-table is the one obtained from Algorithm 3.

In summary, Algorithm 5 reduces the number of values to be stored from 2n+m+|B|2^{n+m+|B^{*}|} to |V0|×2m+|B||V\cup\mathcal{M}_{0}|\times 2^{m+|B^{*}|} by recording only the visited states. This approach is especially effective for solving large-scale BCN problems. Meanwhile, to accelerate the learning process, ww is set as a variable that will converge to a value that is larger than |V0||V\cup\mathcal{M}_{0}|.

VI Computational Complexity

The time complexities of Algorithms 1-5 involve selecting the optimal value from a pool of up to 2m+|B|2^{m+|B^{*}|} action-values per step, resulting in O(2m+|B|2^{m+|B^{*}|}) complexity. For Algorithms 1, 2, and 3, the total number of steps is expressed as iter=1,3k=0|B|1C|A|kNTmax+k=0C|A||B|Nk1,3Tmax{}^{1,3}=\sum_{k=0}^{|B^{*}|-1}C_{|A|}^{k}NT_{\text{max}}+\sum_{k=0}^{C_{|A|}^{|B^{*}|}}N_{k}^{1,3}T_{\text{max}} and iter=2k=0|B|1C|A|kNTmax+k=0C|A||B|Nk2Tmax{}^{2}=\sum_{k=0}^{|B^{*}|-1}C_{|A|}^{k}NT_{\text{max}}+\sum_{k=0}^{C_{|A|}^{|B^{*}|}}N_{k}^{2}T_{\text{max}} respectively, revealing pre- and post-flipping kernel discovery iterations. Introducing early termination based on reachability conditions in Algorithms 1-3 reduces the number of episodes for the kthk^{\text{th}} flip set with Nk1,3NN_{k}^{1,3}\leq N and Nk2NN_{k}^{2}\leq N, while Algorithm 2's efficiency ensures Nk1,3Nk2N_{k}^{1,3}\leq N_{k}^{2}. Algorithms 4 and 5 require NTmaxNT_{max} steps in total. Refer to Table I for a comprehensive overview of the time complexities of each algorithm.

Concerning space complexity, Algorithms 1 and 4 store all action-values in a QQ-table, while Algorithm 2 additionally manages tables for subsets of the current flip set. In contrast, Algorithms 3 and 5 retain action-values for solely visited states. Details on the final space complexities for each algorithm can be found in Table I.

TABLE I: Time and space complexity of Algorithms 1 to 5
Time complexity Space complexity
1 OO(2m+|B|2^{m+|B^{*}|}iter1) OO(2n+m+|B|2^{n+m+|B^{*}|})
2 OO(2m+|B|2^{m+|B^{*}|}iter2) OO(2n+m+|B|(C|A||B|1+1)2^{n+m+|B^{*}|}(C_{|A|}^{|B^{*}|-1}+1))
3 OO(2m+|B|2^{m+|B^{*}|}iter1,3) OO(2m+|B|2^{m+|B^{*}|}|V0||V\cup\mathcal{M}_{0}|)
4 OO(2m+|B|2^{m+|B^{*}|}NTmaxNT_{max}) OO(2n+m+|B|2^{n+m+|B^{*}|})
5 OO(2m+|B|2^{m+|B^{*}|}NTmaxNT_{max}) OO(2m+|B|2^{m+|B^{*}|}|V0||V\cup\mathcal{M}_{0}|)

VII Simulation

In this section, the performance of the proposed algorithms is shown. Two examples are given, which are a small-scale BCN with 3 nodes and a large-scale one with 27 nodes. For each example, the convergence efficiency of different algorithms to check the reachability, and π\pi^{*} with minimal flipping actions are shown.

VII-A A Small-scale BCN

𝐄𝐱𝐚𝐦𝐩𝐥𝐞 2\mathbf{Example\ 2}. We consider a small-scale BCN with the combinational flip set A={1,2,3}A=\{1,2,3\}, the dynamics of which is given as follows:

{x1(t+1)=x1(t)(x2(t)x3(t))¬x1(t)(x2(t)x3(t)),x2(t+1)=x1(t)¬x1(t)(x2(t)x3(t)),x3(t+1)=¬(x1(t)x2(t)x3(t)u(t))(x3(t)(x1(t)¬(x2(t)u(t)))(¬x1(t)(x1(t)x2(t))u(t))).\left\{\begin{aligned} x_{1}(t+1)=&x_{1}(t)\wedge(x_{2}(t)\vee x_{3}(t))\vee\neg x_{1}(t)\wedge(x_{2}(t)\oplus\\ &x_{3}(t)),\\ x_{2}(t+1)=&x_{1}(t)\vee\neg x_{1}(t)\wedge(x_{2}(t)\vee x_{3}(t)),\\ x_{3}(t+1)=&\neg(x_{1}(t)\wedge x_{2}(t)\wedge x_{3}(t)\wedge u(t))\wedge(x_{3}(t)\vee\\ &(x_{1}(t)\vee\neg(x_{2}(t)\wedge u(t)))\wedge(\neg x_{1}(t)\vee(x_{1}(t)\\ &\oplus x_{2}(t))\vee u(t))).\end{aligned}\right. (18)

In terms of the reachability, we set d=(0,0,1)\mathcal{M}_{d}=(0,0,1) and 0=nd\mathcal{M}_{0}=\mathcal{B}^{n}\setminus\mathcal{M}_{d}.

To find the flipping kernels of system (18), Algorithms 1 and 2 are utilized. As mentioned in section IV, Algorithm 2 has higher convergence efficiency compared with Algorithm 1. The result is verified in our simulation. To evaluate the convergence efficiency, the index repr_{ep} named reachable rate is defined as follows:

rep=nrep|0|,r_{ep}=\frac{nr_{ep}}{|\mathcal{M}_{0}|}, (19)

where nrepnr_{ep} represents the number of xt0x_{t}\in\mathcal{M}_{0} that have been found to be reachable to d\mathcal{M}_{d} at the end of the epthep^{\text{th}} episode. For Algorithms 1 and 2, the flipping kernels utilized are {1,2}\{1,2\} and {2,3}\{2,3\}, respectively. The effectiveness of Algorithms 1 and 2 is shown in Fig. 1, where the reachable rates are the average rates obtained from 100 independent experiments. In terms of QQL with TL, the initial reachable rate is higher since the agent utilizes prior knowledge. For QQL with special initial states, the reachable rate increases faster than conventional QQL since the agent strategically selects the initial states. The best performance is achieved when both TL and special initial states are incorporated into QQL.

Next, Algorithm 4 is employed to determine the minimum number of flipping actions required for reachability. The optimal policy obtained is shown in Fig. 2.

Refer to caption


Figure 1: The convergence efficiency of Algorithms 1 and 2 for finding the flipping kernels of the small-scale system (18)

Refer to caption


Figure 2: The optimal policy using the minimal flipping actions for reachability of system (18) obtained by Algorithm 4

𝐑𝐞𝐦𝐚𝐫𝐤 6.\mathbf{Remark\ 6.} Emphasizing that the system model in this paper is solely for illustrative purposes, we stress that the agent has no prior knowledge of the model.

𝐑𝐞𝐦𝐚𝐫𝐤 7.\mathbf{Remark\ 7.} The flipping kernels obtained through Algorithms 1 and 2 have been compared with those obtained using the model-based method proposed by [3]. Additionally, the policies depicted in Figs. 2 and 4 have been compared with other policies derived through enumeration. The results of these comparisons have verified that the flipping kernels and policies obtained are indeed optimal.

𝐑𝐞𝐦𝐚𝐫𝐤 8.\mathbf{Remark\ 8.} We utilize the method proposed by [3] solely to validate the optimality of the flipping kernel, without directly comparing its computational complexity or convergence speed. This is because the semi-tensor product method introduced by [3] requires prior knowledge of the system model, whereas our approach is model-free. Therefore, it would be unfair to make a direct comparison between the two methods.

VII-B A Large-scale BCN

𝐄𝐱𝐚𝐦𝐩𝐥𝐞 3\mathbf{Example\ 3}. We consider a large-scale BCN with the combinational flip set A={1,2,3,4,5,6}A=\{1,2,3,4,5,6\}, the dynamics of which is given as follows:

{x1(t+1)=x1(t)(x2(t)x3(t))¬x1(t)(x2(t)x3(t)),x2(t+1)=x1(t)¬x1(t)(x2(t)x3(t)),x3(t+1)=¬(x1(t)x2(t)x3(t)u(t))(x3(t)(x1(t)¬(x2(t)u(t)))(¬x1(t)(x1(t)x2(t))u(t))),x3i2(t+1)=x3i2(t)(x3i1(t)x3i(t))¬x3i2(t)(x3i1(t)x3i(t)),x3i1(t+1)=x3i2(t)¬x3i2(t)(x3i1(t)x3i(t)),x3i(t+1)=x3i(t)(¬x3i2(t)(x3i2(t)x3i1(t))),\left\{\begin{aligned} x_{1}(t+1)=&x_{1}(t)\wedge(x_{2}(t)\vee x_{3}(t))\vee\neg x_{1}(t)\wedge(x_{2}(t)\oplus\\ &x_{3}(t)),\\ x_{2}(t+1)=&x_{1}(t)\vee\neg x_{1}(t)\wedge(x_{2}(t)\vee x_{3}(t)),\\ x_{3}(t+1)=&\neg(x_{1}(t)\wedge x_{2}(t)\wedge x_{3}(t)\wedge u(t))\wedge(x_{3}(t)\vee\\ &(x_{1}(t)\vee\neg(x_{2}(t)\wedge u(t)))\wedge(\neg x_{1}(t)\vee\\ &(x_{1}(t)\oplus x_{2}(t))\vee u(t))),\\ x_{3i-2}(t+1)=&x_{3i-2}(t)\wedge(x_{3i-1}(t)\vee x_{3i}(t))\vee\neg x_{3i-2}(t)\\ &\wedge(x_{3i-1}(t)\oplus x_{3i}(t)),\\ x_{3i-1}(t+1)=&x_{3i-2}(t)\vee\neg x_{3i-2}(t)\wedge(x_{3i-1}(t)\vee x_{3i}(t)),\\ x_{3i}(t+1)=&x_{3i}(t)\vee(\neg x_{3i-2}(t)\vee(x_{3i-2}(t)\oplus\\ &x_{3i-1}(t))),\\ \end{aligned}\right. (20)

where i=2,3,,9i=2,3,...,9. In terms of the reachability, we set d=(0,0,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,\mathcal{M}_{d}=(0,0,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1)1,1,1,1,1) and 0=(a,0,0,1,0,0,1,0,0,1,0,0,1,0,0,\mathcal{M}_{0}=(a,0,0,1,0,0,1,0,0,1,0,0,1,0,0, 1,0,0,1,0,0,1,0,0,1)1,0,0,1,0,0,1,0,0,1), where a{(0,0,0),(0,1,0),a\in\{(0,0,0),(0,1,0), (0,1,1),(0,1,1), (1,0,0),(1,0,1),(1,1,0),(1,1,1)}(1,0,0),(1,0,1),(1,1,0),(1,1,1)\}.

Refer to caption


Figure 3: The convergence efficiency of Algorithm 3 and it with TL and special initial states for finding the flipping kernels of the large-scale system (20)

𝐑𝐞𝐦𝐚𝐫𝐤 9.\mathbf{Remark\ 9.} System (20) consists of 9 small-scale BCNs. In Fig. 3, it can be observed that the number of flipping actions and time steps required for system (20) to reach state set d\mathcal{M}_{d} is relatively small. These two phenomena can be attributed to our intention of selecting an example that is easily verifiable for the optimal policy using the enumeration method. However, it should be noted that the proposed algorithms are applicable to all large-scale BCNs since the agent has no prior knowledge of the system model.

Refer to caption


Figure 4: The optimal policy using minimal flipping actions to achieve the reachability for the large-scale system (20) obtained by Algorithm 5

To obtain the flipping kernels of the large-scale system (20), Algorithm 3 is employed along with the integration of TL and special initial states. The convergence efficiency of the algorithms is demonstrated in Fig. 3. According to Fig. 3, the combination of small memory QQL with both TL and special initial states yields the best performance.

Subsequently, based on the flipping kernels {1,2,6}\{1,2,6\} and {2,3,6}\{2,3,6\} obtained from Algorithm 3, the optimal policy taking the minimal flipping actions to achieve the reachability is obtained using Algorithm 5. The resulting policy is displayed in Fig. 4. Notably, in terms of reward setting, if we set w=227w=2^{27} instead of utilizing a variable that changes according to Q=|{xt|Q(xt,.)Q=\big{|}\{x_{t}|\ Q(x_{t},.) is in the QQ-table}|\}\big{|}, the policy cannot converge to the optimal one even with 10 times episodes. This demonstrates the effectiveness of Algorithm 5. In Fig. 4, we simplify the state in decimal form, where a state at time step tt is defined as i=127227ixi(t)+1\sum_{i=1}^{27}2^{27-i}x_{i}(t)+1.

VII-C Details

The parameters are listed as follows.

  • In all algorithms, we set learning rate as at=min{1,1(βep)ω}a_{t}=\min\{1,\frac{1}{(\beta ep)^{\omega}}\}, which satisfies condition 1) in Lemma 1.

  • The greedy rate in all algorithms is specified as 10.99epN1-\frac{0.99ep}{N}, gradually reducing from 1 to 0.01 as epep increases.

  • We set w=8w=8 for Algorithm 3 in Example 2 and w=18w=18 with Δw=20\Delta w=20 for Algorithm 5 in Example 3. These values of ww ensure the optimality of the obtained policy according to Corollary 1 and Theorem 4.

  • For Algorithms 1, 2, and 3, we set γ=0.99\gamma=0.99.

NN, TmaxT_{max}, β\beta, and ω\omega are selected based on the complexity of different examples and algorithms, as detailed in Table II.

TABLE II: Parameter settings
Example Algorithm NN TmaxT_{max} β\beta ω\omega
2 1,2 100 10 1 0.6
3 3 10410^{4} 2272^{27} 1 0.6
2 4 3×1043\times 10^{4} 100 0.01 0.85
3 5 2×1052\times 10^{5} 2272^{27} 0.01 0.85

VIII Conclusion

This paper presents modelmodel-freefree reinforcement learning-based methods to obtain minimal state-flipped control for achieving reachability in BCNs, including large-scale ones. Two problems are addressed: 1) finding the flipping kernel, and 2) minimizing flipping actions based on the kernel.

For problem 1) with small-scale BCNs, fast iterative QQL is proposed. Reachability is determined using QQ-values, while convergence is expedited through transfer learning and special initial states. Transfer learning migrates the QQ-table based on the proven theorem that reachability preservation holds as the flip set size increases. Special initial states designate unknown reachability states to avoid redundant evaluation of known reachable states. For problem 1) with large-scale BCNs, we utilize small memory iterative QQL, which reduces memory usage by only recording visited action-values. Algorithm suitability is estimated via an upper bound on memory usage.

For problem 2) with small-scale BCNs, QQL with BCN-characteristics-based rewards is presented. The rewards are designed based on the maximum length of reachable trajectories without cycles (an upper bound is given). This allows the minimization of flipping actions under terminal reachability constraints to be simplified as an unconstrained optimal control problem. For problem 2) with large-scale BCNs, fast small memory QQL with variable rewards is proposed. In this approach, the rewards are dynamically adjusted based on the maximum length of explored reachable trajectories without cycles to enhance convergence efficiency, and their optimality is proven.

Considering the critical value of reachability in controllability, our upcoming research will investigate minimum-cost state-flipped control for controllability of BCNs using reinforcement methods. Specifically, we will address two key challenges: 1) identifying the flipping kernel for controllability of BCNs, and 2) minimizing the required flipping actions to achieve controllability. While our approach can effectively address challenges 1 and 2 for small-scale BCNs with minor adjustments, adapting it for large-scale BCNs with simple modifications is presently impractical. This will be the primary focus of our future research.

References

  • [1] S. A. Kauffman, ``Metabolic stability and epigenesis in randomly constructed genetic nets,'' Journal of theoretical biology, vol. 22, no. 3, pp. 437–467, 1969.
  • [2] M. Boettcher and M. McManus, ``Choosing the Right Tool for the Job: RNAi, TALEN, or CRISPR,'' Molecular Cell, vol. 58, no. 4, pp. 575–585, 2015.
  • [3] Y. Liu, Z. Liu, and J. Lu, ``State-flipped control and Q-learning algorithm for the stabilization of Boolean control networks,'' Control Theory & Applications, vol. 38, pp. 1743–1753, 2021.
  • [4] Z. Liu, J. Zhong, Y. Liu, and W. Gui, ``Weak stabilization of Boolean networks under state-flipped control,'' IEEE Transactions on Neural Networks and Learning Systems, vol. 34, pp. 2693–2700, 2021.
  • [5] M. Rafimanzelat and F. Bahrami, ``Attractor controllability of Boolean networks by flipping a subset of their nodes,'' Chaos: An Interdisciplinary Journal of Nonlinear Science, vol. 28, p. 043120, 2018.
  • [6] M. R. Rafimanzelat and F. Bahrami, ``Attractor stabilizability of Boolean networks with application to biomolecular regulatory networks,'' IEEE Transactions on Control of Network Systems, vol. 6, no. 1, pp. 72–81, 2018.
  • [7] B. Chen, X. Yang, Y. Liu, and J. Qiu, ``Controllability and stabilization of Boolean control networks by the auxiliary function of flipping,'' International Journal of Robust and Nonlinear Control, vol. 30, no. 14, pp. 5529–5541, 2020.
  • [8] Q. Zhang, J.-e. Feng, Y. Zhao, and J. Zhao, ``Stabilization and set stabilization of switched Boolean control networks via flipping mechanism,'' Nonlinear Analysis: Hybrid Systems, vol. 41, p. 101055, 2021.
  • [9] R. Zhou, Y. Guo, Y. Wu, and W. Gui, ``Asymptotical feedback set stabilization of probabilistic Boolean control networks,'' IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 11, pp. 4524–4537, 2019.
  • [10] H. Li and Y. Wang, ``On reachability and controllability of switched Boolean control networks,'' Automatica, vol. 48, no. 11, pp. 2917–2922, 2012.
  • [11] Y. Liu, H. Chen, J. Lu, and B. Wu, ``Controllability of probabilistic Boolean control networks based on transition probability matrices,'' Automatica, vol. 52, pp. 340–345, 2015.
  • [12] J. Liang, H. Chen, and J. Lam, ``An improved criterion for controllability of Boolean control networks,'' IEEE Transactions on Automatic Control, vol. 62, no. 11, pp. 6012–6018, 2017.
  • [13] L. Wang and Z.-G. Wu, ``Optimal asynchronous stabilization for Boolean control networks with lebesgue sampling,'' IEEE Transactions on Cybernetics, vol. 52, no. 5, pp. 2811–2820, 2022.
  • [14] Y. Wu, X.-M. Sun, X. Zhao, and T. Shen, ``Optimal control of Boolean control networks with average cost: A policy iteration approach,'' Automatica, vol. 100, pp. 378–387, 2019.
  • [15] S. Kharade, S. Sutavani, S. Wagh, A. Yerudkar, C. Del Vecchio, and N. Singh, ``Optimal control of probabilistic Boolean control networks: A scalable infinite horizon approach,'' International Journal of Robust and Nonlinear Control, vol. 33, no. 9, p. 4945–4966, 2023.
  • [16] Q. Zhu, Y. Liu, J. Lu, and J. Cao, ``On the optimal control of Boolean control networks,'' SIAM Journal on Control and Optimization, vol. 56, no. 2, pp. 1321–1341, 2018.
  • [17] A. Acernese, A. Yerudkar, L. Glielmo, and C. D. Vecchio, ``Reinforcement learning approach to feedback stabilization problem of probabilistic Boolean control networks,'' IEEE Control Systems Letters, vol. 5, no. 1, pp. 337–342, 2021.
  • [18] P. Bajaria, A. Yerudkar, and C. D. Vecchio, ``Random forest Q-Learning for feedback stabilization of probabilistic Boolean control networks,'' in 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2021, pp. 1539–1544.
  • [19] Z. Zhou, Y. Liu, J. Lu, and L. Glielmo, ``Cluster synchronization of Boolean networks under state-flipped control with reinforcement learning,'' IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 69, no. 12, pp. 5044–5048, 2022.
  • [20] X. Peng, Y. Tang, F. Li, and Y. Liu, ``Q-learning based optimal false data injection attack on probabilistic Boolean control networks,'' 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:265498378
  • [21] Z. Liu, Y. Liu, Q. Ruan, and W. Gui, ``Robust flipping stabilization of Boolean networks: A Q-learning approach,'' Systems & Control Letters, vol. 176, p. 105527, 2023.
  • [22] J. Lu, R. Liu, J. Lou, and Y. Liu, ``Pinning stabilization of Boolean control networks via a minimum number of controllers,'' IEEE Transactions on Cybernetics, vol. 51, no. 1, pp. 373–381, 2021.
  • [23] Y. Liu, L. Wang, J. Lu, and L. Yu, ``Pinning stabilization of stochastic networks with finite states via controlling minimal nodes,'' IEEE Transactions on Cybernetics, vol. 52, no. 4, pp. 2361–2369, 2022.
  • [24] S. Zhu, J. Lu, D. W. Ho, and J. Cao, ``Minimal control nodes for strong structural observability of discrete-time iteration systems: Explicit formulas and polynomial-time algorithms,'' IEEE Transactions on Automatic Control, 2023. [Online]. doi: 10.1109/TAC.2023.3330263.
  • [25] S. Zhu, J. Cao, L. Lin, J. Lam, and S.-i. Azuma, ``Towards stabilizable large-scale Boolean networks by controlling the minimal set of nodes,'' IEEE Transactions on Automatic Control, vol. 69, no. 1, pp. 174–188, 2024.
  • [26] S. Zhu, J. Lu, S.-i. Azuma, and W. X. Zheng, ``Strong structural controllability of Boolean networks: Polynomial-time criteria, minimal node control, and distributed pinning strategies,'' IEEE Transactions on Automatic Control, vol. 68, no. 9, pp. 5461–5476, 2023.
  • [27] H. V. Hasselt, A. Guez, and D. Silver, ``Deep reinforcement learning with double Q-learning,'' Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1, pp. 1–13, 2016.
  • [28] T. Jaakkola, M. Jordan, and S. Singh, ``Convergence of stochastic iterative dynamic programming algorithms,'' Advances in neural information processing systems, vol. 6, p. 703–710, 1993.
  • [29] C. J. Watkins and P. Dayan, ``Q-learning,'' Machine Learning, vol. 8, no. 3, pp. 279–292, 1992.
  • [30] L. Torrey and J. Shavlik, ``Transfer learning,'' in Handbook of research on machine learning applications and trends: algorithms, methods, and techniques.   IGI global, 2010, pp. 242–264.
  • [31] R. Zhou, Y. Guo, and W. Gui, ``Set reachability and observability of probabilistic Boolean networks,'' Automatica, vol. 106, pp. 230–241, 2019.
  • [32] E. Greensmith, P. L. Bartlett, and J. Baxter, ``Variance reduction techniques for gradient estimates in reinforcement learning,'' Journal of Machine Learning Research, vol. 5, no. 9, p. 1471–1530, 2004.