On the Sample Complexity of Decentralized Linear Quadratic Regulator with Partially Nested Information Structure
Abstract
We study the problem of control policy design for decentralized state-feedback linear quadratic control with a partially nested information structure, when the system model is unknown. We propose a model-based learning solution, which consists of two steps. First, we estimate the unknown system model from a single system trajectory of finite length, using least squares estimation. Next, based on the estimated system model, we design a decentralized control policy that satisfies the desired information structure. We show that the suboptimality gap between our control policy and the optimal decentralized control policy (designed using accurate knowledge of the system model) scales linearly with the estimation error of the system model. Using this result, we provide an end-to-end sample complexity result for learning decentralized controllers for a linear quadratic control problem with a partially nested information structure.
1 Introduction
In large-scale control systems, the control policy is often required to be decentralized, where different controllers may only use partial state information, when designing their local control policies. For example, a given controller may only receive a subset of the global state measurements (e.g., [35]), and there may be a delay in receiving the measurements (e.g., [25]). In general, finding a globally optimal control policy under information constraints is NP-hard, even if the system model is known at the controllers [39, 31, 6]. This has led to a large literature on identifying tractable subclasses of the problem. For instance, if the information structure describing the decentralized control problem is partially nested [21], the optimal solution to the state-feedback linear quadratic control problem can be solved efficiently using dynamic programming [26]. Other conditions, such as quadratic invariance [32, 33], have also been identified as tractable subclasses of the problem.
However, the classical work in this field assumes the knowledge of the system model at the controllers. In this work, we are interested in the situation when the system model is not known a priori [23]. In such a case, the existing algorithms do not apply. Moreover, it is not clear whether subclasses such as problems with partially nested information patterns or where quadratic invariance is satisfied are any more tractable than the general decentralized control problem in this case.
In this paper, we consider a decentralized infinite-horizon state-feedback Linear Quadratic Regulator (LQR) control problem with a partially nested information structure [35, 26] and assume that the controllers do not have access to the system model. We use a model-based learning approach, where we first identify the system model, and then use it to design a decentralized control policy that satisfies the prescribed information constraints.
Related Work
Solving optimal control problems without prior system model knowledge has receive much attention recently. One of the most studied problems is the centralized LQR problem. For this problem, two broad classes of methods have been studied, i.e., model based learning [1, 29, 12], and model-free learning [16, 41, 28, 20]. In the model-based learning approach, a system model is first estimated from observed system trajectories using some system identification method. A control policy can then be obtained based on the estimated system model. In the model-free learning approach, the objective function in the LQR problem is first viewed as a function of the control policies. Based on zeroth-order optimization methods (e.g., [19, 30]), the optimal solution can then be obtained using gradient descent, where the gradient of the objective function is estimated from the data samples from system trajectories. Moreover, the model-based learning approach has also been studied for the centralized linear quadratic Gaussian control problem [43]. In general, compared to model-free learning, model-based learning tends to require less data samples in order to achieve a policy of equivalent performance [38].
Most of the previous works on model-based learning for centralized LQR build on recent advances in non-asymptotic analyses for system identification of linear dynamical systems with full state observations (e.g., [13, 36, 34]). Such non-asymptotic analyses (i.e., sample complexity results) relate the estimation error of the system matrices to the number of samples used for system identification. In particular, it was shown in [36] that when using a single system trajectory, the least squares approach for system identification achieves the optimal sample complexity up to logarithmic factors. In this paper, we utilize a similar least squares approach for estimating the system matrices from a single system trajectory. Although the system matrices in our problem are structured, as dictated by the interconnections among the subsystems, we leverage the results in [2, 10] to provide a non-asymptotic analysis of the resulting estimation error.
There are few results on solving decentralized linear quadratic control problems with information constraints, when the system model is unknown. In [18], the authors studied a decentralized output-feedback linear quadratic control problem, under the assumption that the quadratic invariance condition is satisfied. The authors proposed a model-free approach and provided a sample complexity analysis. They focused on a finite-horizon setting, since gradient-based optimization methods may not converge to the optimal controller for infinite-horizon decentralized linear quadratic control problems with information constraints, even when the system model is known [17, 7]. In [27], the authors proposed a consensus-based model-free learning algorithm for multi-agent decentralized LQR over an infinite horizon, where each agent (i.e., controller) has access to a subset of the global state without delay. They showed that their algorithm converges to a control policy that is a stationary point of the objective function in the LQR problem. In [15], the authors studied model-based learning for LQR with subspace constraints on the closed-loop responses. However, those constraints may not lead to controllers that satisfy the information constraints that we consider in this paper (e.g., [42]).
There is also a line of research on online adaptive control for centralized LQR with unknown system models, using either model-based learning [1, 11, 10], or model-free learning [3, 8]. The goal there is to adaptively design a control policy in an online manner when new data samples from the system trajectory become available, and bound the corresponding regret.
Contributions
We propose a two-step model-based approach to solving the problem of learning decentralized LQR with a partially nested information structure. Here, we summarize our contributions and technical challenges in the paper.
-
•
In Section 3, we provide a sample complexity result for estimating the system model from a single system trajectory using a least squares approach. Despite the existence of a sparsity pattern in the system model considered in our problem, we adapt the analyses in [10, 9] for least squares estimation of general linear system models (without any sparsity pattern) to our setting, and show that such a system identification method for general system models suffices for our ensuing analyses.
-
•
In Section 4, based on the estimated system model, we design a novel decentralized control policy that satisfies the given information structure. Our control policy is inspired by [26], which developed the optimal controller for the decentralized LQR problem with a partially nested information structure and known system model. The optimal controller therein depends on some internal states, each of which evolves according to an auxiliary linear system (characterized by the actual model of the original system with a disturbance term from the original system) and correlates with other internal states. Accordingly, this complicated form of the internal states makes it challenging to extend the design in [26] to the case when the system model is unknown. To tackle this, we capitalize on the observation that the optimal controller proposed in [26] can be viewed as a disturbance-feedback control policy that maps the history of past disturbances (affecting the original system) to the current control input. Thanks to this viewpoint, we put forth a control policy that uses the aforementioned estimated system model and maps the estimates of past disturbances to the current control input via some estimated internal states. Particularly, the estimates of disturbances are obtained using the estimated system model and the state information of original system, and each of the estimated internal states evolves according to a linear system characterized by the estimated system model and the estimated disturbances. More importantly, we show that the proposed control policy can be implemented in a decentralized manner that satisfies the prescribed information structure, which requires a careful investigation of the structure of our problem.
-
•
In Section 5.2, we characterize the performance guarantee (i.e., suboptimality) of the control policy proposed in Section 4. As we discussed above, our control policy requires obtaining estimates of the past disturbances and maintaining the estimated internal states. When we compare the performance of our control policy to that of the optimal decentralized control policy in [26], both the estimates of the past disturbances and the estimated internal states contribute to the suboptimality of our control policy, which creates the major technical challenge in our analyses. We overcome this challenge by carefully investigating the structure of the proposed control policy, and we show that the suboptimality gap between our control policy and the optimal decentralized control policy (designed based on accurate knowledge of the system model) provided in [26] can be decomposed into two terms, both of which scale linearly with the estimation error of the system model.
-
•
In Section 5.3, we combine the above results together and provide an end-to-end sample complexity result for learning decentralized LQR with a partially nested information structure. Surprisingly, despite the existence of the information constraints and the fact that the optimal controller is a linear dynamic controller, our sample complexity result matches with that of learning centralized LQR without any information constraints [12].
2 Preliminaries and Problem Formulation
2.1 Notation and Terminology
The sets of integers and real numbers are denoted as and , respectively. The set of integers (resp., real numbers) that are greater than or equal to is denoted as (resp., ). For a real number , let be the smallest integer that is greater than or equal to . The space of -dimensional real vectors is denoted by , and the space of real matrices is denoted by . For a matrix , let , , and be its transpose, trace, and set of singular values, respectively. Without loss of generality, let the singular values of be ordered as . Let denote the norm, i.e., for a matrix , and for a vector . Let denote the Frobenius norm of . A positive semidefinite matrix is denoted by , and if and only if . Let (resp., ) denote the set of positive semidefinite (resp., positive definite) matrices. Let denote an identity matrix whose dimension can be inferred from the context. Given any integer , we define . The cardinality of a finite set is denoted by . Let denote a Gaussian distribution with mean and covariance .
2.2 Solution to Decentralized LQR with Sparsity and Delay Constraints
In this section, we sketch the method developed in [26, 35], which presents the optimal solution to a decentralized LQR problem with a partially nested information structure [21], when the system model is known a priori. First, let us consider a networked system that consists of interconnected linear-time-invariant (LTI) subsystems. Letting the state, input and disturbance of the subsystem corresponding to node be , , and , respectively, the subsystem corresponding to node is given by
(1) |
where is the set of subsystems whose states and inputs directly affect the state of subsystem , , , and is a white Gaussian noise process with for all , where .111The analysis can be extended to the case when is assumed to be a zero-mean white Gaussian noise process with covariance . In that case, our analysis will depend on and . For simplicity, we assume throughout this paper that for all . We can also write Eq. (1) as
(2) |
where , , , and , with . Further letting and , and defining , and , we can compactly write Eq. (1) into the following matrix form:
(3) |
where the th block of (resp., ), i.e., (resp., ) satisfies (resp., ) if . We assume that and are independent for all with and for all . In other words, is a white Gaussian noise process with for all . For simplicity, we assume that throughout this paper.222The analysis can be extended to the case when is given by a zero-mean Gaussian distribution, as one may view as .
Next, we use a directed graph with to characterize the information flow among the subsystems in due to communication constraints on the subsystems. Each node in represents a subsystem in , and we assume that does not have self loops. We associate any edge with a delay of either or , further denoted as or , respectively.333The framework described in this paper can also be used to handle with larger delays; see [26] for a detailed discussion. Then, we define the delay matrix corresponding to as such that: (i) If and there is a directed path from to in , then is equal to the sum of delays along the directed path from node to node with the smallest accumulative delay; (ii) If and there is no directed path from to in , then ; (iii) for all . Here, we consider the scenario where the information (e.g., state information) corresponding to subsystem can propagate to subsystem with a delay of (in time), if and only if there exists a directed path from to with an accumulative delay of . Note that as argued in [26], we assume that there is no directed cycle with zero accumulative delay; otherwise, one can first collapse all the nodes in such a directed cycle into a single node, and equivalently consider the resulting directed graph in the framework described above.
To proceed, we consider designing the control input for the LTI system in Eq. (3). We focus on state-feedback control, i.e., we can view as a policy that maps the states of the LTI system to a control input. Moreover, we require that satisfy the information structure according to the directed graph and the delay matrix , described above. Specifically, considering any and any , and noting that the controller corresponding to subsystem provides the control input , the state information that is available to the controller corresponding to is given by
(4) |
where . In other words, the control policy maps the states contained in to a control input. In the sequel, we also call the information set of controller at time . Note that contains the states corresponding to the subsystems in that have enough time to reach subsystem at time , due to the sparsity and delay constraints described above. Now, based on the information set , we further define to be the set that consists of all the policies that map the states in to a control input at node . The goal is then to solve the following constrained optimization problem:
(5) |
where and are the cost matrices, and the expectation is taken with respect to for all . Throughout the paper, we always assume that the following assumption on the information propagation pattern among the subsystems in holds (e.g., [26, 40]).
Assumption 1.
For all , it holds that , where is given in Eq. (1).
Assumption 1 says that the state of subsystem is affected by the state and input of subsystem , if and only if there is a communication link with a delay of at most from subsystem to in . As shown in [26], Assumption 1 ensures that the information structure associated with the system given in Eq. (1) is partially nested [21]. Assumption 1 is frequently used in decentralized control problems (e.g., [26, 35] and the references therein), and one can see that the assumption is satisfied in networked systems where information propagates at least as fast as dynamics. To illustrate our arguments above, we introduce Example 1.
Example 1.
Consider a directed graph given in Fig. 1, where and each directed edge is associated with a delay of or . The corresponding LTI system is then given by
(6) |

Now, in order to present the solution to (5) given in, e.g., [26], we need to construct an information graph (see [26] for more details). Considering any directed graph with , and the delay matrix as we described above, let us first define to be the set of nodes in that are reachable from node within time steps, i.e., . The information graph is then constructed as
(7) |
Thus, we see from (7) that each node corresponds to a set of nodes from in the original directed graph . Using a similar notation to that for the graph , if there is an edge from to in , we denote the edge as . Additionally, considering any , we write to indicate the fact that the noise is injected to node at time .444Note that we have assumed that there is no directed cycle with zero accumulative delay in . Hence, one can show that for any , is the only noise term such that . From the above construction of the information graph , one can show that the following properties hold.
Lemma 1.
Remark 1.
One can see from the construction of and Lemma 1 that is a forest, i.e., a set of disconnected directed trees, where each directed tree in the forest is oriented to a node with a self loop in . Specifically, for all are the leaf nodes in , and the nodes with self loop are root nodes in .
To illustrate the construction steps and the properties of the information graph discussed above, we again use Example 1; the resulting information graph is then given in Fig. 2. Note that the information graph in Fig. 2 contains two disconnected directed trees, one of which is an isolated node with a self loop. Also notice that , and . In fact, we can check that the results in Lemma 1 hold for in Fig. 2.

Throughout this paper, we assume that the elements in are ordered in an increasing manner, and that the elements in are also ordered in an increasing manner for all . Now, for any , we use (or ) to denote the submatrix of that corresponds to the nodes of the directed graph contained in and . For example, . In the sequel, we will also use similar notations to denote submatrices of , , and the identity matrix . We will make the following standard assumptions (see, e.g., [26]).
Assumption 2.
For any that has a self loop, the pair is stabilizable and the pair is detectable, where .
Leveraging the partial nestedness of (5), the authors in [26] obtained the optimal solution to (5), which we summarize in the following lemma.
Lemma 2.
[26, Corollary 4] Consider the problem given in (5), and let be the associated information graph. Suppose Assumption 2 holds. For all , define matrices and recursively as
(8) | ||||
(9) |
where for each , is the unique node such that . In particular, for any that has a self loop, the matrix is the unique positive semidefinite solution to the Riccati equation given by Eq. (9) , and the matrix is stable. The optimal solution to (5) is then given by
(10) |
and
(11) |
for all , where is an internal state initialized with for all . The corresponding optimal cost of (5), denoted as , is given by
(12) |
Let us use Example 1 to illustrate the results in Lemma 2. First, considering node in the information graph given in Fig. 2, we have from Eq. (10) that
Next, considering node in the directed graph given in Fig. 1, we see from Eq. (11) and Fig. 2 that
where is given by Eq. (8).
Remark 2.
Obtaining the optimal policy , for any , given by Lemma 2 requires global knowledge of the system matrices and , the cost matrices and , and the directed graph with the associated delay matrix . Moreover, given in Lemma 2 is not a static state-feedback controller, but a linear dynamic controller based on the internal states for all . For any controller and for any , the authors in [26] proposed an algorithm to determine for all such that , and thus , using only the memory maintained by the algorithm, the state information contained in the information set defined in Eq. (4), and the global information described above.
2.3 Problem Formulation and Summary of Results
We now formally introduce the problem that we will study in this paper. We consider the scenario where the system matrices and are unknown. However, we assume that the directed graph and the associated delay matrix are known. Similarly to, e.g., [12, 43], we consider the scenario where we can first conduct experiments in order to estimate the unknown system matrices and . Specifically, starting from the initial state , we evolve the system given in Eq. (3) for time steps using a given control input sequence , and collect the resulting state sequence . Based on and , we use a least squares approach to obtain estimates of the system matrices and , denoted as and , respectively. Using the obtained and , the goal is still to solve (5). Since the true system matrices and are unknown, it may no longer be possible to solve (5) optimally, using the methods introduced in Section 2.2. Thus, we aim to provide a solution to (5) using and , and characterize its performance (i.e., suboptimality) guarantees.
In the rest of this paper, we first analyze the estimation error of and obtained from the procedure described above. In particular, we show in Section 3 that the estimation errors and scale as with high probability.555Throughout this paper, we let hide logarithmic factors in . Next, in Section 4, we design a control policy , based on and , which satisfies the information constraints given in (5). Supposing and , where , and denoting the cost of (5) corresponding to as , we show in Section 5.2 that
as long as , where is the optimal cost of (5) given by (12), and and are constants that explicitly depend on the problem parameters of (5). Finally, combining the above results together, we show in Section 5.3 that with high probability and for large enough , the following end-to-end sample complexity of learning decentralized LQR with the partially nested information structure holds:
3 System Identification Using Least Squares
As we described in Section 2.3, we use a least squares approach to estimate the system matrices and , based on a single system trajectory consisting of the control input sequence and the system state sequence , where and . Here, we draw the inputs independently from a Gaussian distribution , where . In other words, we let for all . Moreover, we assume that the input and the disturbance are independent for all . Note that we consider the scenario where the estimation of and is performed in a centralized manner using a least squares approach (detailed in Algorithm 1). However, we remark that Algorithm 1 can be carried out without violating the information constraints given by Eq. (4), since is not a function of the states in the information set defined in Eq. (4) for any . In the following, we present the least squares approach to estimate and , and characterize the corresponding estimation error.
3.1 Least Squares Estimation of System Matrices
Let us denote
(13) |
where and . Given the sequences and , we use the regularized least squares to obtain an estimate of , denoted as , i.e.,
(14) |
where is the regularization parameter. We summarize the above least squares approach in Algorithm 1.
Input: parameter and time horizon length
3.2 Least Squares Estimation Error
In order to characterize the estimation error of given by (14), we will use the following result from [10], which is a consequence of [2, Theorem 1].
Lemma 3.
For any , we now introduce the following probabilistic events that will be useful in our analysis later:
(15) |
where . Denoting
(16) |
we have the following result; the proof is included in Appendix A.
Lemma 4.
For any and for any , it holds that .
For the analysis in the sequel, we will make the following assumption, which is also made in related literature (see e.g., [24, 37, 43]).
Assumption 3.
The system matrix is stable, and for all , where and .
Note that for any stable matrix , we have from the Gelfand formula (e.g., [22]) that there always exist and with such that for all . We then have the following results; the proofs are included in Appendix A.
Lemma 5.
Proposition 1.
Several remarks pertaining to Algorithm 1 and the result in Proposition 1 are now in order. First, note that while considering the problem of learning centralized LQR without any information constraints, the authors in [12] proposed to obtain and from multiple system trajectories using least squares, where each trajectory starts from . They showed that and , where is the number of system trajectories. In contrast, we estimate and from a single system trajectory, and achieve and .
Second, note that we use the regularized least squares in Algorithm 1 to obtain estimates and . Although least squares without regularization can also be used to obtain estimates and from a single system trajectory with the same finite sample guarantee (e.g., [36]), we choose to use regularized least squares considered in, e.g., [2, 10, 9]. The reason is that introducing the regularization into least squares makes the finite sample analysis more tractable (e.g., [10, 9]), which facilitates the adaption of the analysis in [10, 9] to our setting described in this section. Moreover, note that the lower bound on required in Proposition 1 is merely used to guarantee that the denominator of the right-hand side of Eq. (18) contains the factor ; choosing an arbitrary leads to a factor . In general, one can show that choosing any leads to the same finite sample guarantee.
Third, note that we do not leverage the block structure (i.e., sparsity pattern) of and described in Section 2.2, when we obtain and using Algorithm 1. Therefore, the sparsity pattern of and may potentially be inconsistent with that of and . Nonetheless, such a potential inconsistency does not play any role in our analysis later. The reason is that the control policy to be proposed in Section 4 does not depend on the sparsity pattern of and . Moreover, when analyzing the suboptimality of the proposed control policy later in Section 5, we only leverage the fact that the estimation error corresponding to submatrices in (resp., ) will be upper bounded by (resp., ). Specifically, considering any nodes in the information graph given by (7), one can show that , where recall that (resp., ) is a submatrix of (resp., ) that corresponds to the nodes of the directed graph contained in and .
Finally, we remark that one may also use system identification schemes and the associated sample complexity analysis dedicated to sparse system matrices (e.g., [14]). Under some extra assumptions on and (e.g., [14]), one may then obtain and that have the same sparsity pattern as and , and remove the logarithmic factor in in defined in Proposition 1. However, the assumptions on and made in e.g., [14] can be restrictive and hard to check in practice.
4 Control Policy Design
While the estimation of and is performed in a centralized manner as we discussed in Section 3.1, we assume that each controller receives the estimates and after we conduct the system identification step described in Algorithm 1. Given the matrices , , , and , and the directed graph () with the delay matrix , in this section we design a control policy that can be implemented in a decentralized manner, while satisfying the information constraints described in Section 2.2. To this end, we leverage the structure of the optimal policy given in Lemma 2 (when and are known). Note that the optimal policy cannot be applied to our scenario, since only and are available.
First, given the directed graph with and the delay matrix , we construct the information graph given by (7). Recall from Remark 1 that is a forest that contains a set of disconnected directed trees. We then let denote the set of all the leaf nodes in , i.e.,
(19) |
Moreover, for any , we denote
(20) |
where we write if and only if there is a unique directed path from node to node in . In other words, is the set of leaf nodes in that can reach . Moreover, for any such that , we let denote the length of the unique directed path from to in ; we let if . For example, in the information graph (associated with Example 1) given in Fig. 2, we have , , and .
Next, in order to leverage the structure of the optimal policy given in Eqs. (8)-(11), we substitute (submatrices of) and into the right-hand sides of Eqs. (8)-(9), and obtain and for all . Specifically, for all , we obtain , and recursively as
(21) | ||||
(22) |
where for each , we let be the unique node such that , and (resp., ) is a submatrix of (resp., ) obtained in the same manner as (resp., ) described before. Similarly to Eq. (10), we then use for all together with and to maintain an (estimated) internal state (to be defined later) for all and for all , which, via a similar form to Eq. (11), will lead to our control policy, denoted as , for all and for all . Specifically, for all in parallel, we propose Algorithm 2 to compute the control policy
(23) |
Input: estimates and , cost matrices and , directed graph with and delay matrix , time horizon length
We now describe the notations used in Algorithm 2 and hereafter. Let us consider any . In Algorithm 2, we let denote the set of disconnected directed trees in such that the root node of any tree in contains . Slightly abusing the notation, we also let denote the set of nodes of all the trees in . Moreover, we denote
(24) |
where is defined in Eq. (19), i.e., is the set of leaf nodes of all the trees in . Letting be the set of root nodes in , we denote
(25) |
where we recall from Lemma 1 that any root node in has a self loop. We then see from the information graph given in Fig. 2 that
Note that if any node is a leaf node with a self loop (i.e., is an isolated node in )) such as the node in Fig. 2, we only include in (i.e., but ).
Furthermore, we denote
(26) |
where we write if and only if there is a directed path from node to node in , and recall that is the sum of delays along the directed path from to with the smallest accumulative delay. Finally, the memory of Algorithm 2 is initialized as with
(27) |
where we initialize for all .
Remark 3.
For any , let be such that and . In Algorithm 2, we assume that the elements in are already ordered such that if , then comes before in . We then let the for loop in lines 5-9 in Algorithm 2 iterate over the elements in according to the above order. Considering the node in the directed graph in Example 1, we see from Fig. 1 and Fig. 2 that , where and . Since and , we assume that the elements in are ordered such that .
Remark 4.
For any , the dynamics of the internal state is given by
(28) |
where is an estimate of the disturbance in Eq. (2) obtained as
(29) |
where we replace and with the estimates and in Eq. (2), respectively, and is the vector that collects for all , with given in Assumption 1. We note from Eqs. (28)-(29) that , where as we assumed previously. We emphasize that Eqs. (28)-(29) are the keys to our control policy design, and they also enable our analyses in Section 5, where we provide a suboptimality guarantee of our control policy. As we mentioned in Section 1, the motivation of the control policy given by Eqs. (23), (28)-(29) is that the optimal control policy given in Lemma 2 can be viewed as a disturbance-feedback controller. Since the system matrices and are unknown, the control policy constructed in Eqs. (23), (28)-(29) maps the estimates of the past disturbances given by Eq. (29) to the current control input via the estimated internal states given by Eq. (28).
Observation 1.
We will show that in each iteration of the for loop in lines 4-14 of Algorithm 2, the internal states for all such that (i.e., for all ) can be determined, via Eq. (28), based on the current memory of the algorithm and the state information contained in (a subset of) the information set defined in Eq. (4). As we will see, Algorithm 2 maintains, in its current memory , the internal states (with potential time delays) for a certain subset of nodes in , via the recursion in Eq. (28). Given those internal states, for all can be determined using Eq. (28). Moreover, the memory of Algorithm 2 is recursively updated in the for loop in lines 4-14 of the algorithm. Formally, we have the following result for Algorithm 2; the proof can be found in Appendix B.
Proposition 2.
Suppose that any controller at any time step has access to the states in defined as
(30) |
where , and is defined in Eq. (4). Then, the following properties hold for Algorithm 2:
(a) The memory of Algorithm 2 can be recursively updated such that at the beginning of any iteration of the for loop in lines 4-14 of the algorithm,
(31) |
(b) The control input in line 13 can be determined using Eq. (28) and the states in the memory after line 12 (and before line 14) in any iteration of the for loop in lines 4-14 of Algorithm 2.
Since the proof of Proposition 2 is rather involved and requires careful considerations of the structures of the directed graph and the information graph described in Section 2.2, we again use Example 1 to illustrate the steps of Algorithm 2 and the results and proof ideas of Proposition 2.
First, we note from Fig. 1 and Eq. (26) that . Now, let us consider Algorithm 2 with respective to node in the directed graph given in Fig. 1. We see that , which implies via Eq. (30) that for all . One can check that the initial memory of Algorithm 2 given by Eq. (27) satisfies Eq. (31) for , which implies that the memory satisfies Eq. (31) at the beginning of iteration of the for loop in lines 4-14 of the algorithm.
To proceed, let us consider iteration of the for loop in lines 4-14 of the algorithm. Noting that from Remark 3, Algorithm 2 first considers in the for loop in lines 5-9, which implies that in line 7. We then see from Eq. (29) that in order to obtain , we need to know , , , and , where , and are given by Eq. (23). One can then check that the internal states that are needed to determine and are available in the current memory of Algorithm 2 or become available via further applications of Eq. (28). After is obtained, we see from Eq. (28) that can also be obtained. Algorithm 2 then updates its current memory in line 9 and finishes the iteration with respect to in the for loop in lines 5-9. Next, Algorithm 2 considers in the for loop in lines 5-9, which implies that in line 7. Following similar arguments to those above for and noting that the current memory of Algorithm 2 has been updated, one can show that can be obtained from Eq. (28), based on the current memory of the algorithm. Algorithm 2 again updates its current memory in line 9 and finishes the iteration with respect to in the for loop in lines 5-9.
Now, recalling that from Fig. 2, we see that Algorithm 2 considers in line 10. One can also check that can be obtained from Eq. (28), based on the current memory of the algorithm. Finally, based on the current memory of Algorithm 2 after line 12, one can check that the control input can be determined from Eq. (23). Note that Algorithm 2 also removes certain internal states from its current memory in line 14 that will no longer be used. One can check that after the removal, the current memory of Algorithm 2 will satisfy Eq. (31) at the beginning of iteration of the for loop in lines 4-14 of the algorithm, where . One can then repeat the above arguments for iteration of the for loop in lines 4-14 of the algorithm and so on.
Several remarks pertaining to Algorithm 2 are now in order. First, since and , one can show via the definition of Algorithm 2 that the number of the states in the memory of Algorithm 2 is always upper bounded by , where we note that defined in Eq. (26) satisfies , and is the number of nodes in the directed graph . Moreover, one can check that Algorithm 2 can be implemented in polynomial time.
Second, it is worth noting that the control policy for all that we proposed in Eq. (23) is related to the certainty equivalent approach (e.g., [4]) that has been used for learning centralized LQR without any information constraints on the controllers (e.g., [12, 29, 9]). It is known that the optimal solution to classic centralized LQR (i.e., problem (5) without the information constraints) is given by a static state-feedback controller , where can be obtained from the solution to the Ricatti equation corresponding to , , and (e.g., [5]). The corresponding certainty equivalent controller simply takes the form , where is obtained from the solution to the Ricatti equation corresponding to , , and , with and to be the estimates of and , respectively. While we also leverage the structure of the optimal control policy given in Eq. (11), we cannot simply replace with for all in Eq. (11), where is given by the Ricatti equations in Eqs. (21)-(22). As we argued in Remark 2, this is because is not a static state-feedback controller, but a linear dynamic controller based on the internal states for all , where the dynamics of given by Eq. (10) also depends on and . Thus, the control policy that we proposed in Eq. (23) is a linear dynamic controller based on and the estimated internal states for all , where the dynamics of given by Eq. (28) depends on and . Such a more complicated form of also creates several challenges when we analyze the corresponding suboptimality guarantees in the next section.
Third, for any and any , Proposition 2 only requires controller to have access to a subset of the state information contained in the information set .
5 Suboptimality Guarantees
In this section, we characterize the suboptimality guarantees of the control policy proposed in Section 4. To begin with, in order to explicitly distinguish the states of the system in Eq. (3) corresponding to the control policies and given by Eqs. (11) and (23), respectively, we let denote the state of the system in Eq. (3) corresponding to the control policy given by Eq. (23), for , i.e.,
(32) |
where we note from Eq. (23) that with and given by Eqs. (21) and (28), respectively, for all . We let denote the state of the system in Eq. (3) corresponding to the optimal control policy given by Eq. (11), for , i.e.,
(33) |
where with and given by Eqs. (8) and (10), respectively, for all . In Eqs. (32)-(33), we set .
Moreover, for our analysis in the sequel, we introduce another control policy given by
(34) |
for , where for any , is given by Eq. (21), and is given by
(35) |
with . We then let denote the state of the system in Eq. (3) corresponding to , for , i.e.,
(36) |
where from Eq. (34), and we set . Roughly speaking, the auxiliary control policy and the corresponding internal state introduced above allow us to decompose the suboptimality gap of the control policy into two terms that are due to and , respectively, for all . We then have the following result; the proof follows directly from [26, Lemma 14] and is thus omitted. Note that Lemma 6 is a consequence of the partially nested information structure and the structure of the information graph described in Section 2.2.
Lemma 6.
For any , the following hold: (a) , for all ; (b) ; (c) and are independent for all with .
Using the above notations, the cost of the optimization problem in (5) corresponding to the control policy (i.e., ) can be written as
(37) |
where we use instead of since the limit may not exist. Furthermore, we let denote the cost of the optimization problem in (5) corresponding to the control policy given in Eq. (34),666We will show in Proposition 3 that the limit in Eq. (38) exists.
(38) |
Supposing that the estimates and satisfy and with , our ultimate goal in this section is to provide an upper bound on the suboptimality gap , where is the optimal cost given by Eq. (12). To this end, we first decompose into and , and then upper bound and separately. Such a decomposition of is enabled by the structure of the control policy described in Section 4. Specifically, one may view as the suboptimality due to given by Eq. (21) for all , and view as the suboptimality due to given by Eq. (28) for all and for all . Moreover, the suboptimality introduced by is due to the fact that the dynamics of given in Eq. (28) for all are characterized by , and , where given by Eq. (29) is an estimate of the disturbance in Eq. (3).
To proceed, we recall from Lemma 2 that for any that has a self loop, the matrix is stable, where is given by Eq. (8). We then have from the Gelfand formula that for any that has a self loop, there exist and with such that for all . For notational simplicity, let us denote
(39) |
where denotes the set of root nodes in , and and with are given in Assumption 3. Thus, we see from Assumption 3 and our above arguments that for all and for all , and for all , where and . Moreover, we denote
(40) |
For our analysis in this section, we will make the following assumption; similar assumptions can be found in, e.g., [10, 29, 12].
Assumption 4.
The cost matrices and in (5) satisfy that and .
Note that the above assumption is not more restrictive than assuming that and are positive definite. Specifically, supposing and , one can assume without loss of generality that and . This is because one can check that scaling the objective function in (5) by a positive constant does not change in the optimal solution to (5) provided in Lemma 2, for any .
5.1 Perturbation Bounds on Solutions to Ricatti Equations
Supposing and with , in this subsection we aim to provide upper bounds on the perturbations and for all , where (resp., ) is given by Eq. (9) (resp., Eq. (22)), and (resp., ) is given by Eq. (8) (resp., Eq. (21)). We note from Lemma 2 that for any that has a self loop, Eq. (9) (resp., Eq. (22)) reduces to a discrete Ricatti equation in (resp., ). The following results characterize the bounds on and , for all ; the proofs can be found in Appendix C.
Lemma 7.
Suppose Assumptions 2 and 4 hold, and and , where . Then, for any that has a self loop, the following hold:
(41) |
(42) |
and
(43) |
under the assumption that
(44) |
where (resp., ) is given by Eq. (9) (resp., Eq. (22)), (resp., ) is given by Eq. (8) (resp., (21)), and are defined in (39), and is defined in (40).
Lemma 8.
Suppose Assumptions 2 and 4 hold, and and , where . Then, for any that does not have a self loop, the following hold:
(45) |
and
(46) |
under the assumption that
(47) |
where (resp., ) is given by Eq.(8) (resp., Eq. (21)), (resp., ) is given by Eq. (9) (resp., Eq. (22)), is defined in (40), and are defined in (39), is the length of the unique directed path from node to node in with to be the unique root node that is reachable from , and is defined in Eq. (26).
Consider any with a self loop and suppose Eq. (44) holds. One can show via Eq. (43) and [29, Lemma 12] that given by Eq. (21) is also stabilizing for the pair , i.e., is stable (see our arguments for (102) in Appendix D for more details). Moreover, it is well-known (e.g., [5]) that a stabilizing solution to the Ricatti equation in Eq. (22) exists if and only if is stabilizable and (with ) is detectable.777A solution to the Ricatti equation in Eq. (19) is stabilizing if and only if (with given by Eq. (18)) is stable. The above arguments together also imply that is stabilizable and (with ) is detectable for all , under the assumption on given by Eq. (44).
5.2 Perturbation Bounds on Costs
Suppose and , where . In this subsection, we aim to provide an upper bound on that scales linearly with , where and are given by Eqs. (12) and (37), respectively.
Lemma 9.
For our analysis in the sequel, we further define recursively, for all , as
(49) |
where is given by Eq. (21), and is the unique node such that . We then have the following result, which gives an upper bound on .
Proposition 3.
Next, we aim to provide an upper bound on . We first prove the following result.
Lemma 10.
For notational simplicity in the sequel, let us denote
(54) |
We then have the following results.
Lemma 11.
Suppose Assumptions 2-4 hold, and and , where is defined in (54). Then, for all ,
(55) |
and
(56) |
where (resp., ) is given by Eq. (23) (resp., Eq. (34)), (resp., ) is given by Eq. (32) (resp., Eq. (36)), and are defined in (40), and are defined in (39), and , is defined in Eq. (26), and is defined in (54).
Corollary 1.
Proof.
Proposition 4.
5.3 Sample Complexity Result
We are now in place to present the sample complexity result for learning decentralized LQR with the partially nested information structure described in Section 2.2.
Theorem 1.
Suppose Assumptions 2-4 hold, and Algorithm 1 is used to obtain and . Moreover, suppose and , where , and , where is defined in Eq. (26) and is a universal constant. Consider any . Let the input parameters to Algorithm 1 satisfy and , where
and
where and are given in Assumption 3, , , and is defined in (54). Then, with probability at least ,
(60) |
where and are given in Eqs. (37) and (12), respectively, is a universal constant, and are defined in (39), and and are defined in (40), , and .
Proof.
Note that the results in Propositions 3-4 hold, if and with given in (54). Thus, letting and , one can first check that , and then obtain from Proposition 1 that with probability at least , and returned by Algorithm 1 satisfy that and . Now, noting that , where is a universal constant, and setting in Proposition 3, one can then show via Propositions 3-4 that (60) holds with probability at least . ∎
Thus, we have shown a end-to-end sample complexity result for learning decentralized LQR with the partially nested information structure. In other words, we relate the number of data samples used for estimating the system model to the performance of the control policy proposed in Section 4. Note that our result in Theorem 1 matches with the sample complexity result (up to logarithm factors in ) provided in [12] for learning centralized LQR without any information constraints. Also note that the sample complexity for learning centralized LQR has been improved to in [29]. Specifically, the authors in [29] showed that the gap between the cost corresponding to the control policy they proposed and the optimal cost is upper bounded by , where and , if is sufficiently small. Due to the additional challenges introduced by the information constraints on the controllers (see our discussions at the end of Section 4), we leave investigating the possibility of improving our sample complexity result in Theorem 1 for future work.
6 Numerical Results
In this section, we illustrate the sample complexity result provided in Theorem 1 with numerical experiments, where the numerical experiments are conducted based on Example 1. Specifically, we consider the LTI system given by Eq. (6) with the corresponding directed graph and information graph given by Fig. 1 and Fig. 2, respectively. Under the sparsity pattern of and specified in Eq. (6), we generate the nonzero entries in and independently by the Gaussian distribution while satisfying Assumption 3. We set the covariance of the zero-mean white Gaussian noise process to be , and set the cost matrices to be and . Moreover, we set the input sequence used in the system identification algorithm (Algorithm 1) to be for all . In order to approximate the value of defined in Eq. (37), we simulate the system using Algorithm 2 for and obtain . Fixing the randomly generated matrices and described above, the numerical results presented in this section are obtained by averaging over independent experiments.


In Fig. 3(a), we plot the estimation error corresponding to Algorithm 1 when we range the number of the data samples used in Algorithm 1 from to . Similarly, in Fig. 3(b) we plot the curve corresponding to the cost suboptimality , where is obtained by the closed-form expression given in Eq. (12). According to Fig. 3, we observe that the estimation error and the cost suboptimality share a similar dependency pattern on . The similar dependency on aligns with the results shown in Proposition 1 and Theorem 1 that both the estimation error and the cost suboptimality scale as , which is a consequence of the results shown in Propositions 3-4 that the cost suboptimality scales linearly with the estimation error. The results presented in Fig. 3 then also imply that our suboptimality results provided in Propositions 3-4 can be tight for certain instances of the problem. Finally, we observe from the shaded regions in Fig. 3 that the cost suboptimality is more sensitive to the randomness introduced by the random input for and the noise for , when we run the experiments described above. This is potentially due to the fact that we approximated the cost suboptimality as with .
7 Conclusion
We considered the problem of control policy design for decentralized state-feedback linear quadratic control with a partially nested information structure, when the system model is unknown. We took a model-based learning approach consisting of two steps. First, we estimated the unknown system model from a single system trajectory of finite length, using least squares estimation. Next, we designed a control policy based on the estimated system model, which satisfies the desired information constraints. We showed that the suboptimality gap between our control policy and the optimal decentralized control policy (designed using accurate knowledge of the system model) scales linearly with the estimation error of the system model. Combining the above results, we provided an end-to-end sample complexity of learning decentralized controllers for state-feedback linear quadratic control with a partially nested information structure.
References
- Abbasi-Yadkori and Szepesvári [2011] Y. Abbasi-Yadkori and C. Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proc. Conference on Learning Theory, pages 1–26, 2011.
- Abbasi-Yadkori et al. [2011] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Proc. Advances in neural information processing systems, volume 24, pages 2312–2320, 2011.
- Abbasi-Yadkori et al. [2019] Y. Abbasi-Yadkori, P. Bartlett, K. Bhatia, N. Lazic, C. Szepesvari, and G. Weisz. Politex: Regret bounds for policy iteration using expert prediction. In Proc. International Conference on Machine Learning, pages 3692–3702, 2019.
- Åström and Wittenmark [2008] K. J. Åström and B. Wittenmark. Adaptive Control. Courier Corporation, 2008.
- Bertsekas [2017] D. P. Bertsekas. Dynamic programming and optimal control: Vol. 2 4th Edition. Athena Scientific, 2017.
- Blondel and Tsitsiklis [2000] V. D. Blondel and J. N. Tsitsiklis. A survey of computational complexity results in systems and control. Automatica, 36(9):1249–1274, 2000.
- Bu et al. [2019] J. Bu, A. Mesbahi, M. Fazel, and M. Mesbahi. LQR through the lens of first order methods: Discrete-time case. arXiv preprint arXiv:1907.08921, 2019.
- Cassel and Koren [2021] A. Cassel and T. Koren. Online policy gradient for model free learning of linear quadratic regulators with regret. arXiv preprint arXiv:2102.12608, 2021.
- Cassel et al. [2020] A. Cassel, A. Cohen, and T. Koren. Logarithmic regret for learning linear quadratic regulators efficiently. In Proc. International Conference on Machine Learning, pages 1328–1337, 2020.
- Cohen et al. [2019] A. Cohen, T. Koren, and Y. Mansour. Learning linear-quadratic regulators efficiently with only regret. In Proc. International Conference on Machine Learning, pages 1300–1309, 2019.
- Dean et al. [2018] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Proc. International Conference on Neural Information Processing Systems, pages 4192–4201, 2018.
- Dean et al. [2020] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, 20(4):633–679, 2020.
- Faradonbeh et al. [2018] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis. Finite time identification in unstable linear systems. Automatica, 96:342–353, 2018.
- Fattahi et al. [2019] S. Fattahi, N. Matni, and S. Sojoudi. Learning sparse dynamical systems from a single sample trajectory. In Proc. IEEE Conference on Decision and Control, pages 2682–2689, 2019.
- Fattahi et al. [2020] S. Fattahi, N. Matni, and S. Sojoudi. Efficient learning of distributed linear-quadratic control policies. SIAM Journal on Control and Optimization, 58(5):2927–2951, 2020.
- Fazel et al. [2018] M. Fazel, R. Ge, S. Kakade, and M. Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In Proc. International Conference on Machine Learning, pages 1467–1476, 2018.
- Feng and Lavaei [2019] H. Feng and J. Lavaei. On the exponential number of connected components for the feasible set of optimal decentralized control problems. In Proc. American Control Conference, pages 1430–1437, 2019.
- Furieri et al. [2020] L. Furieri, Y. Zheng, and M. Kamgarpour. Learning the globally optimal distributed LQ regulator. In Proc. Learning for Dynamics and Control Conference, pages 287–297, 2020.
- Ghadimi and Lan [2013] S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- Gravell et al. [2020] B. Gravell, P. M. Esfahani, and T. H. Summers. Learning optimal controllers for linear systems with multiplicative noise via policy gradient. IEEE Transactions on Automatic Control, 2020.
- Ho et al. [1972] Y.-C. Ho et al. Team decision theory and information structures in optimal control problems–part i. IEEE Transactions on Automatic control, 17(1):15–22, 1972.
- Horn and Johnson [2012] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge university press, 2012.
- Hou and Wang [2013] Z.-S. Hou and Z. Wang. From model-based control to data-driven control: Survey, classification and perspective. Information Sciences, 235:3–35, 2013.
- Lale et al. [2020] S. Lale, K. Azizzadenesheli, B. Hassibi, and A. Anandkumar. Logarithmic regret bound in partially observable linear dynamical systems. arXiv preprint arXiv:2003.11227, 2020.
- Lamperski and Doyle [2012] A. Lamperski and J. C. Doyle. Dynamic programming solutions for decentralized state-feedback LQG problems with communication delays. In Proc. American Control Conference, pages 6322–6327, 2012.
- Lamperski and Lessard [2015] A. Lamperski and L. Lessard. Optimal decentralized state-feedback control with sparsity and delays. Automatica, 58:143–151, 2015.
- Li et al. [2019] Y. Li, Y. Tang, R. Zhang, and N. Li. Distributed reinforcement learning for decentralized linear quadratic control: A derivative-free policy optimization approach. arXiv preprint arXiv:1912.09135, 2019.
- Malik et al. [2020] D. Malik, A. Pananjady, K. Bhatia, K. Khamaru, P. L. Bartlett, and M. J. Wainwright. Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. Journal of Machine Learning Research, 21(21):1–51, 2020.
- Mania et al. [2019] H. Mania, S. Tu, and B. Recht. Certainty equivalence is efficient for linear quadratic control. In Proc. International Conference on Neural Information Processing Systems, pages 10154–10164, 2019.
- Nesterov and Spokoiny [2017] Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.
- Papadimitriou and Tsitsiklis [1986] C. H. Papadimitriou and J. Tsitsiklis. Intractable problems in control theory. SIAM Journal on Control and Optimization, 24(4):639–654, 1986.
- Rotkowitz and Lall [2005] M. Rotkowitz and S. Lall. A characterization of convex problems in decentralized control. IEEE Transactions on Automatic Control, 50(12):1984–1996, 2005.
- Rotkowitz and Martins [2011] M. C. Rotkowitz and N. C. Martins. On the nearest quadratically invariant information constraint. IEEE Transactions on Automatic Control, 57(5):1314–1319, 2011.
- Sarkar and Rakhlin [2019] T. Sarkar and A. Rakhlin. Near optimal finite time identification of arbitrary linear dynamical systems. In Proc. International Conference on Machine Learning, pages 5610–5618, 2019.
- Shah and Parrilo [2013] P. Shah and P. A. Parrilo. -optimal decentralized control over posets: A state-space solution for state-feedback. IEEE Transactions on Automatic Control, 58(12):3084–3096, 2013.
- Simchowitz et al. [2018] M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht. Learning without mixing: Towards a sharp analysis of linear system identification. In Proc. Conference On Learning Theory, pages 439–473, 2018.
- Simchowitz et al. [2020] M. Simchowitz, K. Singh, and E. Hazan. Improper learning for non-stochastic control. In Proc. Conference on Learning Theory, pages 3320–3436, 2020.
- Tu and Recht [2019] S. Tu and B. Recht. The gap between model-based and model-free methods on the linear quadratic regulator: An asymptotic viewpoint. In Proc. Conference on Learning Theory, pages 3036–3083, 2019.
- Witsenhausen [1968] H. S. Witsenhausen. A counterexample in stochastic optimum control. SIAM Journal on Control, 6(1):131–147, 1968.
- Yu et al. [2022] J. Yu, D. Ho, and A. Wierman. Online stabilization of unknown networked systems with communication constraints. arXiv preprint arXiv:2203.02630, 2022.
- Zhang et al. [2020] K. Zhang, B. Hu, and T. Basar. Policy optimization for linear control with robustness guarantee: Implicit regularization and global convergence. In Proc. Learning for Dynamics and Control Conference, pages 179–190, 2020.
- Zheng et al. [2020] Y. Zheng, L. Furieri, A. Papachristodoulou, N. Li, and M. Kamgarpour. On the equivalence of Youla, system-level, and input–output parameterizations. IEEE Transactions on Automatic Control, 66(1):413–420, 2020.
- Zheng et al. [2021] Y. Zheng, L. Furieri, M. Kamgarpour, and N. Li. Sample complexity of linear quadratic gaussian (LQG) control for output feedback systems. In Proc. Learning for Dynamics and Control Conference, pages 559–570, 2021.
Appendix
Appendix A Proofs for Least Squares Estimation of System Matrices
A.1 Proof of Lemma 4
First, let us consider . We can apply Lemma 12 with and obtain that . Similarly, recalling from Algorithm 1 that for all , we have from Lemma 12 that . Next, let us consider . Applying Lemma 3 with , we obtain .
Finally, let us consider . Consider the sequence of random vectors and the filtration , where is defined in (13), and for all . For any , we note from Eq. (3) that is conditionally Gaussian on and , with
Note again from Algorithm 1 that for all , and that is assumed to be independent of for all . We then see that is also conditionally Gaussian on and , with
Now, we can apply Lemma 13 with and described above, and let . Since , we have from Lemma 13 that holds with probability at least . Combining the above arguments together and applying a union bound over the events , , and , we complete the proof of the lemma. ∎
A.2 Proof of Lemma 5
First, considering any , we denote . We see from (15) that
Next, for any , we see from Eq. (3) that
where recall that we assumed previously that . Since is stable from Assumption 3, we know that for all , where and . It now follows from the above arguments that
for all . Noting that , we then have
for all . ∎
A.3 Proof of Proposition 1
First, we see from Eq. (15) that on the event defined in Eq. (16), the following holds:
(61) |
where recall that and , and where and are defined in (13), and is given by (14). To obtain (61), we also use the fact that , which implies via that . Moreover, we have from Eq. (15) that on the event , the following holds:
where the second inequality follows from the choice of . Combining the above arguments together, one can show that
Noting from Lemma 5 that for all , one can use similar arguments to those for [9, Lemma 37], and show that
Noting that , we have . It then follows that
Noting that Algorithm 1 extracts and from , i.e., , one can show that and . Finally, since we know from Lemma 4 that , we conclude that and hold with probability at least . ∎
Appendix B Proof for Algorithm 2
B.1 Proof of Proposition 2
To prove part (a), we use an induction on . For the base case , we see directly from line 4 in Algorithm 2 and Eq. (27) that satisfies Eq. (31) (with ) at the beginning of iteration of the for loop in lines 4-14 of the algorithm. For the induction step, consider any and suppose the memory satisfies Eq. (31) at the beginning of iteration of the for loop in lines 4-14 of the algorithm.
To proceed, let us consider any with and in the for loop in lines 5-9 of Algorithm 2, where is defined in Eq. (24). We will show that in line 7, and thus in line 8, can be determined using Eq. (28) and the current memory of Algorithm 2. As suggested by the first two cases in Eq. (29), we can focus on the case when (otherwise, can be directly determined). We then note from the third case in Eq. (29) that in order to determine , we need to know , and and for all , where is given in Assumption 1. Also note that for all , which implies that . Thus, we have that , and for all , where is defined in Eq. (30). Now, considering any , we note from Eq. (23) that . Thus, in order to determine , it suffices to determine for all such that (i.e., for all ). Note that , i.e., there exists a directed path from node to node in . Moreover, noting the definition of given by (7) with its properties discussed in Lemma 1 and Remark 1, and noting the way we defined the set , one can show that for any , it holds that . Next, considering any , we recall from Eq. (20) that denotes the set of leaf nodes that can reach in , i.e., , where the second equality again follows from the properties of and the definition of in Eq. (24). Now, we split our arguments into two cases: is a root node in (i.e., has a self loop), and does not have a self loop.
First, suppose has a self loop. For the case when is a leaf node in (i.e., is an isolated node in and ), we see that with given by Eq. (31), for all , where and . Noting from the definition of that , we have from the construction of in Eq. (7) that in with . It follows that with given by Eq. (31). Thus, we focus on the case when is not a leaf node in , i.e., . We now see from Eq. (28) that given , and for all such that (with ) in , the state can be determined. Let us consider any such that (with ) in , and denote , where we note that . For any , one can recursively apply Eq. (28) to show that given for all , the state can be determined, where is the length of the (unique) directed path from to in . Further considering any , and noting that , we have that for all with and , where is given by Eq. (31). Recalling again the definition of in (7), and noting that , , , and in , one can show that
(62) |
We further split our arguments into and . First, supposing , we have
(63) |
Second, suppose . Recall from Remark 3 that we let the for loop in lines 5-9 of Algorithm 2 iterate over the elements in according to a certain order of the elements in . We then see from the inequality that (with and ) has already been considered by the for loop in lines 5-9 in Algorithm 2, i.e., the states for all are in the current memory of Algorithm 2, denoted as , when we consider the with and in the for loop in lines 5-9 of the algorithm. Moreover, we have from the above arguments that in , i.e., there is a directed path from node to node that goes through node in . It then follows that
(64) |
where . Combining (62) and (64), we obtain
(65) |
It then follows from (63) and (65) and our arguments above that the states for all are in the current memory described above, for all . Combining the above arguments together, we have that can be determined from Eq. (28) and the current memory , for all and for all such that (with ) in . Moreover, recalling that as we argued above, we see from Eq. (31) that . One can now apply Eq. (28) multiple times to show that can be determined from the current memory described above, for all .
Next, suppose does not have a self loop. Similarly to our arguments above, we first consider the case when is a leaf node in , i.e., . We see that , where with , and is defined in Eq. (31). Since , we have from the construction of in (7) that in with . Noting that in as we argued above, we then have the following:
(66) |
where . Now, supposing , we see directly see from Eq. (31) that holds. Supposing , we see from (66) that . Using similar arguments to those above for the case when has a self loop (particularly, the order of the elements in over which the for loop in lines 5-9 of Algorithm 2 iterates), one can show that the states for all have been added to the current memory of Algorithm 2, denoted as , when we consider the with and in the for loop in lines 5-9 of the algorithm. It follows that . Then, we consider the case when is not a leaf node in . We see from Eq. (28) that given for all such that (with ) in , the state can be determined. The remaining arguments then follow directly from those above for the case when has a self loop.
In summary, we have shown that can be determined from Eq. (28) and the current memory of Algorithm 2, for all , for all , and for all with and . It then follows from our arguments above that in line 7 of Algorithm 2, and thus in line 8 of Algorithm 2, can be determined using Eq. (28) and the current memory of Algorithm 2, for all with and . In other words, we have shown that can be added to the memory of Algorithm 2 in line 9, for all with and .
Now, let us consider any with defined in Eq. (25). We will show that can be determined using Eq. (28) and the states in given by Eq. (31), for all such that in . Note from our definition of in Eq. (25) that is not a leaf node. Following similar arguments to those above, let us consider any such that (with ) in , and denote . Further considering any , and noting that , we have that for all , where is given by Eq. (31), and with . Similarly to Eq. (62), we have that , which implies that . Therefore, we see that with given by Eq. (31), for all . Using similar arguments to those above, one can now recursively apply Eq. (28) to show that can be determined from given by Eq. (31), for all such that (with ). Since , we see from Eq. (28) that can be determined from given by Eq. (31). Thus, we have shown that can be added to the memory of Algorithm 2 in line 12, for all .
Combining all the above arguments together and noting line 14 in Algorithm 2, we see that at the beginning of iteration of the for loop in lines 4-14 of Algorithm 2, the memory of the algorithm satisfies
(67) |
This completes the induction step for the proof of Eq. (31), and thus completes the proof of part (a).
We then prove part (b). Consider any . In order to prove part (b), it suffices for us to show that for all can be determined using Eq. (28) and the memory after line 12 (and before line 14) in iteration of the for loop in lines 4-14 of Algorithm 2, which is given by
(68) |
Considering any , one can show via the definition of in (7) and the definition of that . Again, we split our arguments into two cases: has a self loop, and does not have a self loop.
First, suppose has a self loop. For the case when (i.e., is an isolated node in ), we see that with given by Eq. (68), where with . Noting that , we see from the definitions of in (7) that in with . It follows that with given by Eq. (68). Thus, we focus on the case when is not a leaf node, i.e., . We see from Eq. (28) that given , and for all (with ) in , the state can be determined. Again, let us consider any such that in , and denote . Further considering any , and noting that , we have that with given by Eq. (68), for all , where and . Similarly to (62), we have that , which implies that
(69) |
It then follows from (69) that with given by Eq. (68), for all , and for all . Using similar arguments to those before, one can recursively use Eq. (28) to show that can be determined from given by Eq. (68), for all . Moreover, recalling that as we argued above, we see that with given by Eq. (68). One can then apply Eq. (28) multiple times and show that can be determined from given by Eq. (68). Next, suppose does not have a self loop. Using similar arguments to those above for the case when has a self loop, one can show that can be determined using Eq. (28) and the current memory given in Eq. (68).
Appendix C Proofs for Perturbation Bounds on Solutions to Ricatti Equations
C.1 Proof of Lemma 7
Consider any that has a self loop. To show that (41) holds under the assumption on given in (44), we use [29, Proposition 2]. Specifically, since and hold, one can show that and hold, for all . Similarly, noting that and hold, for all , one can show that . Moreover, note that and from Assumption 4, where and . Recalling the definitions of , and , the proof of (41) under the assumption on given in (44) now follows from [29, Proposition 2].
Next, let denote
and note that . Moreover, we see from Eq. (21) that
where and . We have
which implies that
Noting that and recalling the definition of given in (40), we have
Similarly, one can show that
Now, following similar arguments to those in the proof of [29, Lemma 2], one can show that
where the second inequality follows from the assumption on given in (44). ∎
C.2 Proof of Lemma 8
First, let us consider any such that , i.e., . Since is the unique root node that is reachable from , we see from Lemma 1 and Remark 1 that has a self loop. Noting that from Assumption 4, and that , we see that any satisfying (47) also satisfies (44). Thus, we have from (41) in Lemma 7 that
where
and we note that . Moreover, we see from Eq. (21) that
where and (since and ). Now, using similar arguments to those in the proof of Lemma 7, one can also show that
and
Using similar arguments to those in the proof of [29, Lemma 2], one can now show that
(70) |
where the second inequality follows from the assumption on given in (47), and we note that . Hence, we have shown that (45) holds for
To prove Eq. (46) for , we first recall the expressions for and given in Eqs. (9) and (22), respectively. Using similar arguments to those above, we have
(71) |
where the first inequality in (71) follows from (70) and the definition of given in (40). To obtain the second inequality in (71), we first note from Eq. (9) that , where the last inequality follows from Assumption 4. We then have from (40) that , which further implies that . It follows that the second inequality in (71) holds. To proceed, denoting and , we have from Eqs. (9) and (22) that
(72) |
where we note that . From the above arguments, we have the following:
and
Noting that , one can now obtain from (72) that
(73) |
where the second inequality again follows from the assumption on given in (47).
Next, let us consider any such that (where is the unique root node that is reachable from ), and denote the unique directed path from to in as . We see that satisfies (70) and (73). Repeating the above arguments for obtaining (70) and (73) one more time, one can show that
and
where we use again the assumption on given by (47). Further repeating the above arguments, and noting from Eq. (26) and the definition of in (7) that for all , one can show that (45)-(46) hold, for all without a self loop, under the assumption on given in (47). ∎
Appendix D Proofs for Perturbation Bounds on Costs
D.1 Proof of Lemma 9
First, let us consider any that has a self loop. Noting the construction of the information graph given in (7), one can show that Eq. (35) can be rewritten as
(74) |
where is the set of leaf nodes in that can reach , is the length of the (unique) directed path from node to node in with if , and
with if , where is given by Eq. (8) for all , and are the nodes along the directed path from to in . Recalling from Eq. (35) that , in Eq. (74) we set if . Now, under the assumption on given in Eq. (47), we see from (45) in Lemma 8 that , which implies that , for all with . Noting that from the construction of , we have , for all . Considering any and denoting
(75) |
we have
where we use the fact from that and are independent for all with , and the fact that for any with , is the only noise term such that (see Footnote 2). Moreover, we see that and are independent for all with , and that is independent of for all . Now, considering any such that for all , and noting that for all , we have
(76) |
Let us denote the right-hand side of Eq. (76) as , and denote
Fixing any such that for all , and considering any , one can then unroll Eq. (74) and show that
(77) |
Under the assumption on given in (47), one can obtain from Lemma 7 that for all , where , which implies that is stable. It follows that
Noting that from the definition of given in (7), and that for any with , is the only noise term such that , as we argued above, we have from Eq. (76) that
where the second inequality follows from the fact that as we argued above. It then follows that (48) holds.
D.2 Proof of Proposition 3
First, since satisfies (47) (and thus (44)), we have from (43) in Lemma 8 that is stable for any that has a self loop. Now, using similar arguments to those for the proofs of Theorem 2 and Corollary 4 in [26], and leveraging Lemma 6 and Eqs. (34)-(35), (21) and (49), one can show that Eq. (50) holds. To proceed, for any and for any and, we set , and define
(78) |
where is the optimal control policy given by Eq. (11), for all and for all , and where and are given by Eqs. (8) and (10), respectively, for all . Moreover, on the right-hand side of Eq. (78), we set , where given by Eq. (36) is the state after applying the control policy in Eq. (34) for . Noting that as we discussed at the beginning of Section 5, and that , we see that
where we use the fact that .
Recalling the definition of in Eq. (38), we denote . We then have the following:
Using similar arguments to those for the proof of [26, Theorem 2], one can show that
(79) |
where for . To obtain for all and for all in Eq. (79), we use the following recursion:
(80) |
initialized with for all , where , and for each , we let be the unique node such that , and is given by Eq. (8) for all . Combining the above arguments together, we obtain the following:
(81) | ||||
(82) | ||||
(83) |
where for each in Eq. (83), we let be the unique node such that . To obtain Eq. (81), we note from Lemma 6 that for all , where for all , and and are independent for all with . Moreover, we note from Eq. (34) that for all , where is given by Eq. (21). Combining the above arguments together, and recalling that , we obtain Eq. (81). To obtain Eq. (82), we first apply Eq. (35) and notice . Next, we use the facts that the information graph defined in (7) is a tree (see Lemma 1 and Remark 1), and that and are independent for all with and for all , as we argued above. To obtain Eq. (83), we leverage again the tree structure of .
Now, leveraging the recursion in Eq. (80), and using similar arguments to those for the proof of [16, Lemma 12], one can show via (83) that
Recall from Lemma 2 that is stable for any that has a self loop. Using similar arguments to those for the proof of [26, Corollary 4], one can show via Eq, (80) that as , for all and for all , where is given by Eq. (9). It then follows that
(84) |
where the equality follows from Eq. (8), and the second equality follows from the fact that the limit exists, for all , as we argued in the proof of Lemma 9.
Finally, substituting (48) in Lemma 9 into the right-hand side of (84), we obtain
(85) |
where the third inequality follows from the fact that for all , with and . To obtain (85), we first note that is assumed to satisfy (47) (and thus (44)). Recalling , and for all as we assumed previously, we then obtain (85) from Lemmas 7-8, where we also use the fact that for all . ∎
D.3 Proof of Lemma 10
First, considering any and any , we have
Following similar arguments to those in the proof of Lemma 9 (particularly Eq. (77)), one can show that
where , and . Since satisfies (47) and thus (44), we see from Lemma 7 that for all . It now follows that
Combining the above arguments together, we obtain (51).
D.4 Proof of Lemma 11
For notational simplicity in this proof, we denote
and
We first prove (55). Based on the above notations, we can show that
where the first inequality follows from the facts that , and that since (see our arguments in the proof of Lemma 8). We then have
(86) |
Thus, in order to show that (55) holds for all , it suffices to show that holds for all . To this end, we prove via an induction on . For any , we recall from Eqs. (23) and (34) that and , respectively, for all , where and are given by Eqs. (28) and (35), respectively, and is given by Eq. (21), for all . As we argued before, in Eqs. (28) and (35) we have for all . Hence, we have , which implies that (55) holds for , completing the proof of the base step of the induction.
For the induction step, suppose holds for all . Now, considering any , we can unroll the expressions of and given by Eqs. (32) and (36), respectively, and obtain
and
where we note that . It then follows that
(87) |
where the first inequality follows from Lemma 14. To obtain the first inequality in (87), we use the induction hypothesis. To obtain the second inequality in (87), we use the fact that (with ), for all , from Assumption 3. Recalling from our arguments in Section 4 (particularly, Eq. (29)), one can show that
where is an estimate of in Eq. (3). From Eq. (32), we see that
It follows that
Recall from Lemma 10 that and , for all . We then obtain
where the first inequality follows again from Lemma 14, and the second inequality uses (87). Similarly, we have
It then follows that
where the first inequality again follows from Lemma 14, and the second inequality follows from and . Denoting
(88) |
we have
(89) |
Moreover, note that
(90) |
To proceed, let us consider any that has a self loop. Recalling the arguments in the proof of Lemmas 9, we can rewrite Eq. (35) as
(91) |
with
(92) |
where is the set of leaf nodes in that can reach , is the length of the (unique) directed path from node to node in with if , and
with if , where are the nodes along the directed path from to in . We also recall from the arguments in the proof of Lemma 9 that for all . We then see from (76) in the proof of Lemma 9 and the definition of in (54) that
(93) |
Similarly, one can rewrite Eq. (28) as
(94) |
where
(95) |
where
with if . Note that for any and for any in Eqs. (92) and (95), we set if . One can check that satisfies (44) and (47). We then have from Lemmas 7-8 that for all . It follows that for all with . Noting from the construction of in (7) that for all , one can now show that
(96) |
which also implies that
(97) |
for all . For any , we then have from the above arguments that
(98) |
where the first inequality follows from Lemma 14. To obtain (98), we first note (89)-(90) and (96)-(97). We then use the fact that from the definition of the information graph given by (7), and the fact that for any with , is the only noise term such that (see Footnote 2). From (93) and (98), we also obtain
(99) |
where the first inequality follows from Lemma 14.
Now, let us denote and . Recalling that , where as we assumed before, one can unroll Eqs. (91) and (94), and show that
(100) |
Since and , where satisfies (44), as we argued above, we have from Lemma 7 that
(101) |
where and , with . Moreover, since , we have from Lemma 15 that
(102) |
where (102) follows from the choice of in (54). Now, considering any term in the summation on the right-hand side of (100), we have
(103) |
where the first inequality follows from Lemma 14, and the second inequality uses the upper bounds provided in (98)-(99) and (101)-(102). Moreover, one can show that defined in (54) satisfies that and , which, via algebraic manipulations, yield (103). We then see from (100) that
(104) |
where the first inequality follows from Lemma 14, and (104) follows from standard formulas for series. Now, substituting Eq. (88) into the right-hand side of (104), one can show that
(105) |
where we note that and by their definitions.
Next, considering any that does not have a self loop, we have from the arguments in the proof of Lemma 9 that Eq. (35) can be rewritten as , where is defined in Eq. (92). Similarly, Eq. (28) can be rewritten as , where is defined in Eq. (95). Using similar arguments to those above, one can then show that (105) also holds.
Further recalling Eqs. (23) and (34), we know that and , which imply that
where the first inequality follows from Lemma 14, the second inequality follows from for all , as we argued above, and the last inequality follows from (105) and the fact that . Moreover, we can show that
where the first inequality follows from the fact that , and the second inequality follows from the fact that as we argued above. One can then show that given in (54) satisfies that . Thus, we obtain the following:
It follows that
which completes the induction step, i.e., we have shown that holds for all .
D.5 Proof of Proposition 4
For notational simplicity in this proof, we denote
(106) |
For all , we then see from Lemma 11 that
and
where (resp., ) is given by Eq. (23) (resp., Eq. (34)), (resp., ) is given by Eq. (32) (resp., Eq. (36)), and is defined in Eq. (54). Similarly, we see from Corollary 1 that
and
for all .
To proceed, we have the following:
(107) |
Now, considering any term in the summation on the right-hand side of Eq. (107), and dropping the dependency on for notational simplicity, we have the following:
(108) |
where the first two inequalities follow from the Cauchy-Schwartz inequality, and the third inequality follows from the upper bounds on , , and given above and in Lemma 10. Similarly, we have
(109) |
where the second inequality follows from the upper bounds on , , and given above and in Lemma 10. Combining (108) and (109) together, we obtain from Eq. (107) that
(110) |
where the second inequality follows from the fact that . To obtain (110), one can show that . Finally substituting the expressions for and given in (54) and (106), respectively, we obtain from (110) that (59) holds. ∎
Appendix E Auxiliary Lemmas
Lemma 12.
[9, Lemma 34] Let be a Gaussian random vector with distribution , for all , where . Then for any and for any , the following holds with probability at least :
Lemma 13.
[9, Lemma 36] Let be a sequence of random vectors that is adapted to a filtration , where for all . Suppose is conditionally Gaussian on with , for all , where . Then, for any and for any , the following holds with probability at least :
Lemma 14.
Let be a sequence of random vectors, where . Then,
Proof.
We have the following:
where the first and second inequalities follow from the Cauchy-Schwarz inequality. ∎
Lemma 15.
[29, Lemma 5] Consider any matrix and any matrix . Let and be such that , and for all . Then, for all ,