A modified Thompson sampling-based learning algorithm for unknown linear systems
Abstract
We revisit the Thompson sampling-based learning algorithm for controlling an unknown linear system with quadratic cost proposed in [1]. This algorithm operates in episodes of dynamic length and it is shown to have a regret bound of , where is the time-horizon. The regret bound of this algorithm is obtained under a technical assumption on the induced norm of the closed loop system. We propose a variation of this algorithm that enforces a lower bound on the episode length. We show that a careful choice of (that depends on the uncertainty about the system model) allows us to recover the regret bound under a milder technical condition about the closed loop system.
I Introduction
The problem of learning an optimal policy for a system with linear dynamics and quadratic cost with unknown parameters has received considerable attention in the literature. Historically, the focus has been on developing algorithms which asymptotically learn optimal policies using techniques from adaptive control and reinforcement learning [2, 3, 4, 5]. In recent years, the emphasis has shifted towards developing algorithms with finite-time regret guarantees.
Broadly speaking, three classes of learning algorithms have been considered in the literature: optimism in the face of uncertainty [6, 7, 8, 9], certainty equivalence [10, 11, 12, 13, 14], and Thompson sampling [15, 16, 14, 17]. These algorithms provide two kinds of regret guarantees: frequentist and Bayesian. In the frequentist setting, it is established that the regret for the unknown system is bounded with high probability (with respect to the distribution of the process noise and the randomness introduced by the algorithm). In the Bayesian setting, it is assumed that there is a prior on the unknown system parameters and it is established that the expected regret is bounded (where the expectation is with respect to the prior, the distribution of the process noise, and the randomness introduced by the algorithm). These two notions of regret are different and, since the per-step cost is not bounded, one form of the regret does not imply the other.
In this paper, we revisit a recently proposed algorithm for establishing Bayesian regret called Thompson sampling with dynamic episodes (TSDE) [1]. The main result of [1] is to show that the Bayesian regret of TSDE accumulated up to time is bounded by , where the notation hides constants and poly-logarithmic factors. This result was derived under a technical assumption on the induced norm of the closed loop system. In this paper, we present a variation of the TSDE algorithm and obtain a bound on the Bayesian regret by imposing a much milder technical assumption.
II Model and problem formulation
We consider the same model as [1]. For the sake of completeness, we present the model below.
Consider a linear system with state , control input , and disturbance . For the ease of exposition, we assume that the system starts from an initial state . The state evolves over time according to
(1) |
where and are the system dynamics matrices. The noise is an independent and identically distributed Gaussian process with .
In [1], it was assumed that . Using a general does not fundamentally change any of the results or the proof arguments.
At each time , the system incurs a per-step cost given by
(2) |
where and are positive definite matrices.
Let denote the parameters of the system. , where . The performance of any policy is measured by the long-term average cost given by
(3) |
Let denote the minimum of over all policies. It is well known [18] that if the pair is stabilizable, then is given by
where is the unique positive semi-definite solution of the following Riccati equation:
(4) |
Furthermore, the optimal control policy is given by
(5) |
where the gain matrix is given by
(6) |
As in [1], we are interested in the setting where the system parameters are unknown. We denote the unknown parameters by a random variable and assume that there is a prior distribution on . The Bayesian regret of a policy operating for horizon is defined by
(7) |
where the expectation is with respect to the prior on , the noise processes, the initial conditions, and the potential randomizations done by the policy .
III Thomson sampling based learning algorithm
As in [1], we assume that the unknown model parameters lie in a compact subset of . We use to denote the restriction of probability distribution on the set . We assume that there is a prior on which satisfies the following assumption.
Assumption 1.
There exist for and a positive definite matrix such that for any , , where
We maintain a posterior distribution on based on the history of the observations until time . From standard results in linear Gaussian regression [19], we know that the posterior is a truncated Gaussian distribution
where and and can be updated recursively as follows:
(8) | ||||
(9) |
where .
III-A Thompson sampling with dynamic episodes algorithm
We now present a variation of the Thompson sampling with dynamic episodes (TSDE) algorithm of [1]. As the name suggests, the algorithm operates in episodes of dynamic length. The key difference from [1] is that we enforce that each episode is of a minimum length . The choice of will be explained later.
Let and denote the start time and the length of episode , respectively. Episode has a minimum length of and ends when the length of the episode is strictly larger than the length of the previous episode (i.e., ) or at the first time after when the determinant of the covariance falls below half of its value at time , i.e., . Thus,
(10) |
Note that the stopping condition (10) implies that
(11) |
If we select in the above algorithm, we recover the stopping condition of [1].
The TSDE algorithm works as follows. At the beginning of episode , a parameter is sampled from the posterior distribution . During the episode, the control inputs are generated using the sampled parameters , i.e.,
(12) |
The complete algorithm is presented in Algorithm 1.
III-B A technical assumption and the choice of minimum episode length
For each , we define a 4-dimensional row-vector as follows:
(13) |
where is the solution of the Riccati equation in (4) and is the optimal gain matrix defined in (6). {definition} Let be positive constants such that and . We say that the uncertainty set is of Type if the following conditions hold:
-
1.
For all ,
(14) where the inequality is component-wise.
-
2.
For any with and for any integer ,
Assumption 2.
We assume that the uncertainty set is of Type , where , , , , , are positive constants such that and .
The following simple observation plays a critical role in analyzing the regret of TSDE. {lemma} Suppose Assumption 2 is true. Define
(15) |
Then, for with , we have
(16) |
Proof.
The proof follows immediately from Assumption 2 and the definition of .
Before presenting our regret analysis under Assumption 2, we present two special cases of this assumption. The first case is identical to the assumption made in [1] about the uncertainty set.
Assumption 3.
Let , , , , , be positive constants such that and . Assume that the uncertainty set satisfies the following conditions:
-
1.
Equation (14) holds for all .
-
2.
For any with ,
In [1], part 2) of Assumption 3 was stated explicitly. In addition, the uncertainty set was assumed to be compact, which ensures part 1) of Assumption 3.
An uncertainty set that satisfies Assumption 3 also satisfies Assumption 2 with . This is because if , then
(17) |
For a square matrix , let denote the spectral radius of matrix .The next assumption can also be viewed as a special case of Assumption 2.
Assumption 4.
Let , , , , , be positive constants such that and . Assume that the uncertainty set satisfies the following conditions:
-
1.
Equation (14) holds for all .
-
2.
For any with ,
We note that Assumption 4 is weaker than Assumption 3 (since for any matrix ). Consider, for example, a family of matrices , where and . For each , the spectral radius of is while its norm is at least . Thus, each satisfies Assumption 4 but not Assumption 3.
The following lemma shows that an uncertainty set that satisfies Assumption 4 also satisfies Assumption 2 for some constants and .
Suppose Assumption 4 is true. Then, there exist and such that for any with and for any integer ,
Proof.
Define . Let . Since is compact, so is . Now for any , there exists a norm (call it ) such that .
Since norms are continuous, there is an open ball centered at (let’s call this ) such that for any , we have . Consider the collection of open balls . This is an open cover of compact set . So, there is a finite sub-cover. Let’s denote this sub-cover by . By equivalence of norms, there is a finite constant such that for any matrix , for all . Let .
Now consider an arbitrary . It belongs to for some . Therefore, . Hence, for any integer , the above inequalities and the submulitplicity of norms give that .
Condition 2 in Definition III-B states that is uniformly exponentially stable for all all . Condition 2 of Assumption 4 states that is uniformly asymptotically stable for all . For linear systems, asymptotic stability implies exponential stability. In Lemma III-B, we are effectively showing that when the uncertainty set is compact, uniform asymptotic stability implies uniform exponential stability.
III-C Regret bounds
The following result provides an upper bound on the regret of the proposed algorithm.
Under Assumptions 1 and 2 and with , the regret of TSDE is upper bounded by
(18) |
The proof is presented in the next section. {remark} The constants hidden in the notation in Theorem III-C depend only on the type of the uncertainty set . In particular, these constants do not depend on , , and .
IV Regret analysis
For the ease of notation, we use instead of in this section. Let denote the number of episodes until horizon . Following the exact same steps as [1], we can show that
(19) |
where
(20) | ||||
(21) | ||||
(22) |
We establish the bound on by individually bounding , , and .
The terms in (19) are bounded as follows:
-
1.
.
-
2.
.
-
3.
.
Combining Lemma IV with equation (19) establishes Theorem III-C. Before presenting the proof of Lemma IV, we establish some preliminary results.
IV-A Preliminary results
Let denote the maximum of the norm of the state plus the noise standard deviation.
The statement of Lemmas IV-A and IV-A are the same as that of the corresponding lemmas in [1]. The proof of Lemma IV-A in [1] relied on Assumption 3. Since we impose a weaker assumption, our proof is more involved. The proof of Lemma IV-A is similar to the proof of [1, Lemma 3]. However, since our TSDE algorithm is different from that in [1], some of the details of the proof are different.
IV-B Proof of Lemma IV
We now prove each part of Lemma IV separately.
IV-B1 Proof of bound on
IV-B2 Proof of bound on
IV-B3 Proof of bound on
As in [1], we can bound the inner summand in as
Therefore,
which is same as [1, Eq. (45)]. Now, by simplifying the term inside using Cauchy-Schwartz inequality, we get
(27) |
Note that (27) is slightly different than the simplification of [1, Eq. (45)] using Cauchy-Schwartz inequality presented in [1, Eq. (46)], which used in each term in the right hand side instead of .
V Discussion and Conclusion
In this paper, we present a variation of the TSDE algorithm of [1] and show that its Bayesian regret up to time is bounded by under a milder technical assumption than [1]. The result in [1] was derived under the assumption that there exists a such that for any , . For our analysis, we impose a different assumption for the closed loop gain when the system dynamics are and the controller is chosen according to . We show that the assumption of [1] implies our assumption. Our assumption is also implied by .
The key technical result in [1] as well as our paper is Lemma IV-A, which shows that for any , . The proof argument in both [1] as well as our paper is to show that there is some constant such that . Under the stronger assumption in [1], one can show that for all , , which directly implies that . Under the weaker assumption in this paper, the argument is more subtle. The basic intuition is that in each episode, the system is asymptotically stable and, being a linear system, also exponentially stable (in the sense of Lemma III-B). So, if the episode length is sufficiently long, then we can ensure that , where and is a constant. This is sufficient to ensure that for an appropriately defined .
The fact that each episode must be of length implies that the second triggering condition is not triggered for the first steps in an episode. Therefore, in this interval, the determinant of the covariance can be smaller than half of its value at the beginning of the episode. Consequently, we cannot use the same proof argument as [1] to bound because that proof relied on the fact that for any , . So, we provide a variation of that proof argument, where we use a coarser bound on given by Lemma F.
We conclude by observing that the milder technical assumption imposed in this paper may not be necessary. Numerical experiments indicate that the regret of the TSDE algorithm shows behavior even when the uncertainty set does not satisfy Assumption 4 (as was also reported in [1]). This suggests that it might be possible to further relax Assumption 4 and still establish an regret bound.
Acknowledgement
We would like to thank Borna Sayedana for pointing out a mistake in Lemma 9 in a previous draft of this paper.
Appendix A An auxiliary result
For any , we have
(28) |
Proof.
For ease of notation, define random variables and . Note that has a -distribution with -degrees of freedom. Therefore, has a moment generating function for . Pick a . By Chernoff bound, we have
where . Therefore, the complementary CDF of is bounded by
Now, pick and consider an i.i.d. process , where has a CDF . We let denote and use a similar notation for . Note that has a Weibull distribution with shape . Therefore, (see e.g., [20, Eq. (3)])
(29) |
Now we present a bound on in terms of . Since , there exists a such that for all , . Thus, is stochastically dominated by in the weak stochastic order111A random variable is said to be dominated by a random variable in the weak stochastic order if for all increasing functions supported sufficiently away from , ., as defined in [20]. Therefore, by [20, Theorem 2.1], there exists a constant such that is stochastically dominated by in the convex order.222A random variable is said to be dominated by a random variable in the convex order if for all increasing and convex functions , . Consequently, by [20, Theorem 3.1(1)] (or [21, Theorem 2.2(1)]), we have that . Substituting (29) establishes the result of the Lemma.
Appendix B Proof of Lemma IV-A
For the ease of notation, let , and . In addition, define , , , and where and are the true parameters.
From the system dynamics under the TSDE algorithm, we know that for any time , we have
Thus, from triangle inequality and Assumption 2, we get
(30) |
Now at time , we have
(31) |
where the second inequality follows from (11), which implies . From Lemma III-B, . Recursively expanding (31), we get
(32) |
Appendix C Proof of Lemma IV-A
Appendix D Proof of Lemma IV-A
The high-level idea of the proof is same as that of [1, Lemma 3]. Define macro episodes with start times , , where and for ,
Thus, a new macro-episode starts whenever an episode ends due to the second stopping criterion. Let denote the number of macro-episodes until time and define . Let denote the length of the -th macro-episode. Within a macro-episode, all but the last episode must be triggered by the first stopping criterion. Thus, for ,
where the last equality follows from (11). Hence, by following exactly the same argument as [1], we have
and therefore following [1, Eq. (40)], we have
(36) |
which is same as [1, Eq. (41)].
Now, observe that
(37) |
where follows because is a non-decreasing sequence (because ) and and subsequent inequalities follow from the definition of the macro episode and the second triggering condition.
Appendix E Proof of Lemma IV-B3
Observe that the summand is constant for each episode. Therefore,
(39) |
where follows from (11), follows from the fact that is measurable, and hold because conditioned on each column of is the difference of two i.i.d. vectors .
Eq. (39) proves the first part of the Lemma. The second part follows from the fact that .
Appendix F Proof of Lemma IV-B3
For any . Eq. (9) implies that and consequently is positive definite. Therefore, we have the following: {lemma} Let be the smallest eigenvalue of . Then, each eigenvalue of is no less than . Therefore, each eigenvalue of is no more than . An immediate implication of Lemma F is the following: For any and ,
(40) |
where .
For any , implies that . Therefore, from [7, Lemma 11], we get that for any (of appropriate dimensions),
(41) |
Eq. (41) implies that for any , we have
(42) |
For the ease of notation, let . Then we have the following bound on . {lemma}
The following inequalities hold:
-
1.
For , we have
-
2.
For , we have
Consequently, for all , we have
(43) |
Proof.
The second relationship follows from the second stopping criterion. We now prove the first relationship. Eq. (9) implies that
Therefore,
(44) |
where the last inequality follows from (40). Thus, for any , we have
(45) |
where the first inequality follows by repeatedly applying (44) as a telescopic product.
Let . Then, (43) follows by observing that and .
Using Lemma F and (42), we get
(46) |
where the first inequality follows from (42) and the second inequality follows from Lemma F. Therefore,
(47) |
From (40) for , we get that
(48) |
Hence
(49) |
Using (9) and the intermediate step of the proof of [22, Lemma 6], we have
(50) |
Now, from (9), we get that
(51) |
where the last inequality uses the fact that . Combining (49) with (50) and (51), we get
(52) |
Therefore, we can bound the expectation of the right hand side of (47) as
(53) |
where the first inequality follows from (52) with , and the last inequality follows from Lemma IV-A by noting that is a polynomial of multiplied by a poly-log term.
References
- [1] Y. Ouyang, M. Gagrani, and R. Jain, “Posterior sampling-based reinforcement learning for control of unknown linear systems,” IEEE Trans. Autom. Control, vol. 65, no. 8, pp. 3600–3607, 2019.
- [2] K. J. Astrom and B. Wittenmark, Adaptive Control. Addison-Wesley Longman Publishing Co., Inc., 1994.
- [3] P. E. Caines, Linear Stochastic Systems. John Wiley, 1988. Republished by Society of Industrial Applied Mathematics, 2018.
- [4] S. J. Bradtke, “Reinforcement learning applied to linear quadratic regulation,” in Neural Information Processing Systems, pp. 295–302, 1993.
- [5] S. J. Bradtke, B. E. Ydstie, and A. G. Barto, “Adaptive linear quadratic control using policy iteration,” in Proceedings of American Control Conference, vol. 3, pp. 3475–3479, IEEE, 1994.
- [6] M. C. Campi and P. Kumar, “Adaptive linear quadratic Gaussian control: the cost-biased approach revisited,” SIAM Journal on Control and Optimization, vol. 36, no. 6, pp. 1890–1907, 1998.
- [7] Y. Abbasi-Yadkori and C. Szepesvári, “Regret bounds for the adaptive control of linear quadratic systems,” in Annual Conference on Learning Theory, pp. 1–26, 2011.
- [8] A. Cohen, T. Koren, and Y. Mansour, “Learning linear-quadratic regulators efficiently with only regret,” in International Conference on Machine Learning, pp. 1300–1309, PMLR, 2019.
- [9] M. Abeille and A. Lazaric, “Efficient optimistic exploration in linear-quadratic regulators via lagrangian relaxation,” in International Conference on Machine Learning, pp. 23–31, PMLR, 2020.
- [10] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “Regret bounds for robust adaptive control of the linear quadratic regulator,” in Neural Information Processing Systems, pp. 4192–4201, 2018.
- [11] H. Mania, S. Tu, and B. Recht, “Certainty equivalent control of LQR is efficient.” arXiv:1902.07826, 2019.
- [12] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “Input perturbations for adaptive control and learning,” Automatica, vol. 117, p. 108950, 2020.
- [13] M. Simchowitz and D. Foster, “Naive exploration is optimal for online lqr,” in International Conference on Machine Learning, pp. 8937–8948, PMLR, 2020.
- [14] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “On adaptive linear–quadratic regulators,” Automatica, vol. 117, p. 108982, July 2020.
- [15] Y. Ouyang, M. Gagrani, and R. Jain, “Control of unknown linear systems with thompson sampling,” in Allerton Conference on Communication, Control, and Computing, pp. 1198–1205, 2017.
- [16] M. Abeille and A. Lazaric, “Improved regret bounds for thompson sampling in linear quadratic control problems,” in International Conference on Machine Learning, pp. 1–9, 2018.
- [17] M. Gagrani, S. Sudhakara, A. Mahajan, A. Nayyar, and Y. Ouyang, “Thompson sampling for linear quadratic mean-field teams,” in 2021 60th IEEE Conference on Decision and Control (CDC), pp. 720–727, IEEE, 2021.
- [18] K. J. Astrom, Introduction to stochastic control theory. Academic Press New York, 1970.
- [19] J. Sternby, “On consistency for the method of least squares using martingale theory,” IEEE Trans. Autom. Control, vol. 22, no. 3, pp. 346–352, 1977.
- [20] P. J. Downey and R. S. Maier, “Stochastic orderings and the growth of expected extremes,” tech. rep., Tech. Report 90-9, Department of Computer Science, University of Arizona, 1990.
- [21] P. J. Downey and R. S. Maier, “Orderings arising from expected extremes, with an application,” in Institute of Mathematical Statistics Lecture Notes - Monograph Series, pp. 66–75, Hayward, CA: Institute of Mathematical Statistics, 1992.
- [22] Y. Abbasi-Yadkori and C. Szepesvari, “Bayesian optimal control of smoothly parameterized systems: The lazy posterior sampling algorithm.” arXiv preprint arXiv:1406.3926, 2014.