Online Control of Unknown Time-Varying Dynamical Systems
Abstract
We study online control of time-varying linear systems with unknown dynamics in the nonstochastic control model. At a high level, we demonstrate that this setting is qualitatively harder than that of either unknown time-invariant or known time-varying dynamics, and complement our negative results with algorithmic upper bounds in regimes where sublinear regret is possible. More specifically, we study regret bounds with respect to common classes of policies: Disturbance Action (SLS), Disturbance Response (Youla), and linear feedback policies. While these three classes are essentially equivalent for LTI systems, we demonstrate that these equivalences break down for time-varying systems.
We prove a lower bound that no algorithm can obtain sublinear regret with respect to the first two classes unless a certain measure of system variability also scales sublinearly in the horizon. Furthermore, we show that offline planning over the state linear feedback policies is NP-hard, suggesting hardness of the online learning problem.
On the positive side, we give an efficient algorithm that attains a sublinear regret bound against the class of Disturbance Response policies up to the aforementioned system variability term. In fact, our algorithm enjoys sublinear adaptive regret bounds, which is a strictly stronger metric than standard regret and is more appropriate for time-varying systems. We sketch extensions to Disturbance Action policies and partial observation, and propose an inefficient algorithm for regret against linear state feedback policies.
1 Introduction
The control of linear time-invariant (LTI) dynamical systems is well-studied and understood. This includes classical methods from optimal control such as LQR and LQG, as well as robust control. Recent advances study regret minimization and statistical complexity for online linear control, in both stochastic and adversarial perturbation models. Despite this progress, rigorous mathematical guarantees for nonlinear control remain elusive: nonlinear control is both statistically and computationally intractable in general.
In the face of these limitations, recent research has begun to study the rich continuum of settings which lie between LTI systems and generic nonlinear ones. The hope is to provide efficient and robust algorithms to solve the most general control problems that are tractable, and at the same time, to characterize precisely at which degree of nonlinearity no further progress can be made.
This paper studies the control of linear, time-varying (LTV) dynamical systems as one such point along this continuum. This is because the first-order Taylor approximation to the dynamics of any smooth nonlinear system about a given trajectory is an LTV system. These approximations are widely popular because they allow for efficient planning, as demonstrated by the success of iLQR and iLQG methods for nonlinear receding horizon control. We study online control of discrete-time LTV systems, with dynamics and time-varying costs
(1.1) |
Above, is the state of the system, the control input, the disturbances, and the system matrices. Our results extend naturally to partial-state observation, where the controller observes linear projections of the state . We focus on the challenges introduced when the system matrices and perturbations are not known to the learner in advance, and can only be determined by live interaction with the changing systems.
In this setting, we find that the overall change in system dynamics across time characterizes the difficulty of controlling the unknown LTV system. We define a measure, called system variability, which quantifies this. We show both statistical and computational lower bounds as well as algorithmic upper bounds in terms of the system variabilility. Surprisingly, system variability does not impede the complexity of control when the dynamics are known [16].
1.1 Contributions
We consider the recently popularized nonstochastic model of online control, and study regret bounds with respect to common classes of policies: Disturbance Action (Dac/SLS [44]), Disturbance Response (Drc/Youla [46]), and linear feedback policies. Planning over the third class of feedback policies in LTI systems admits efficient convex relaxations via the the first two parametrizations, Dac and Drc. This insight has been the cornerstone of both robust [49, 44] and online [3, 39] control.
Separation of parametrizations. For linear time-varying systems, however, we find that equivalences between linear feedback, Dac and Drc fail to hold: we show that there are cases where any one of the three parametrizations exhibits strictly better control performance than the other two.
Regret against convex parametrizations. Our first set of results pertain to Dac and Drc parametrizations, which are convex and admit efficient optimization. We demonstrate that no algorithm can obtain sublinear regret with respect to these classes when faced with unknown, LTV dynamics unless a certain measure of system variability also scales sublinearly in the horizon. This is true even under full observation, controllable dynamics, and fixed control cost. This finding is in direct contrast to recent work which shows sublinear regret is attainable over LTV system dynamics if they are known [16].
We give an efficient algorithm that attains sublinear regret against these policy classes up to an additive penalty for the aforementioned system variability term found in our lower bound. When the system variability is sufficiently small, our algorithm recovers state-of-the-art results for unknown LTI system dynamics up to logarithmic factors.
In fact, our algorithm enjoys sublinear adaptive regret [21], a strictly stronger metric than standard regret which is more appropriate for time-varying systems. We also show that the stronger notion of adaptivity called strongly adaptive regret [11] is out of reach in the partial information setting.
Regret against state feedback. Finally, we consider the class of state feedback policies, which are linear feedback with memory length one. We show that full-information optimization over state feedback policies is computationally hard. This suggests that obtaining sublinear regret relative to these policies may be computationally prohibitive, though does not entirely rule out the possibility of improper learning. However, improper learning cannot be done via the Drc or Dac relaxations in light of our policy class separation results. Finally, we include an inefficient algorithm which attains sublinear (albeit nonparametric-rate) regret against state feedback control policies.
Paper Structure
Discussion of relevant literature and relation to our work can be found in Section 1.2. In Section 2, we formally introduce the setting of LTV nonstochastic control, the policy classes we study and our key result regarding their non-equivalence in the LTV setting (Theorem 2.1). Motivated by this non-equivalence, the remainder of the paper is split into the study of convex policies (Section 3) and of state feedback policies (Section 4). In Section 3, we show that regret against the Dac and Drc classes cannot be sublinear unless the metric system variability (Definition 3.1) itself is sublinear (Theorem 3.1), and also propose Algorithm 2 whose adaptive regret scales at the rate of our lower bound plus a term (Theorem 3.4). On the other hand, in Section 4 we show sublinear regret against state feedback policies is technically possible (Theorem 4.1) with a computationally inefficient algorithm, but also provide a computational lower bound (Theorem 4.2) for planning which reveals significant difficulties imposed by the LTV dynamics in this scenario as well. Finally, in Section 5 we pose several future directions, concerning both questions in LTV control, as well as the extension to nonlinear control.
1.2 Related Work
Our study of LTV systems is motivated by the widespread practical popularity of iterative linearization for nonlinear receding horizon control; e.g., the iLQR [40], iLC [29], and iLQG [41] algorithms. Recent research has further demonstrated that near-optimal solutions to LTV approximations of dynamics confer stability guarantees onto the original nonlinear system of interest [45].
Low-Regret Control: We study algorithms which enjoy sublinear regret for online control of LTV systems; that is, whose performance tracks a given benchmark of policies up to a term which is vanishing relative to the problem horizon. [1] initiated the study of online control under the regret benchmark by introducing the online LQR problem: where a learner is faced with an unknown LTI system, fixed costs and i.i.d. Gaussian disturbances, and must attain performance relative to the LQR-optimal policy. Bounds for this setting were later improved and refined in [12, 26, 10, 38], and extended to partial-state observation in [25, 24]. Our work instead adopts the nonstochastic control setting [3], where the adversarially chosen (i.e. non-Gaussian) noise is considered to model the drift terms that arise in linearizations of nonlinear terms, and where costs may vary with time. [3] consider known system dynamics, later extended to unknown systems under both full-state [20] and partial-state observation [39, 37]. The study of nonstochastic control of known LTV dynamics was taken up in [16], with parallel work by [32] considering known LTV dynamics under stochastic noise.
Unknown LTV dynamics: Our work is the first to consider online (low-regret) control of unknown LTV systems in any model. There is, however, a rich body of classical work on adaptive control of LTV systems [28, 42]. These guarantees focus more heavily on error sensitivity and stability; they only permit dynamical recovery up to error that scales linearly in system noise, and thus guarantee only (vacuous) linear-in-horizon regret. More recent work has studied identification (but not online control) of an important LTV class called switching systems [31, 35].
Online Convex Optimization: We make extensive use of techniques from the field of online convex optimization [9, 18]. Most relevant to our work is the literature on adapting to changing environments in online learning, which starts from the works of [22, 6]. The notion of adaptive regret was introduced in [21] and significantly studied since as a metric for adaptive learning in OCO [2, 47]. [11] proposed to strengthen adaptive regret and the stronger metric has been shown to imply results over dynamic regret [48].
Recent nonlinear control literature: Recent research has also studied provably guarantees in various complementary (but incomparable) models: planning regret in nonlinear control [4], adaptive nonlinear control under linearly-parameterized uncertainty [5], online model-based control with access to non-convex planning oracles [23], and control with nonlinear observation models [27, 13].
2 Problem Setting
We study control of a linear time-varying (LTV) system Eq. 1.1 with state , control input chosen by the learner, and the external disturbance chosen by Nature. The system is characterized by time-varying matrices . For simplicity, the initial state is . At each time , oblivious111An oblivious adversary chooses the matrices, costs and perturbations prior to the control trajectory. adversary picks the system matrices , disturbances and cost functions . The dynamics are unknown to the learner: one observes only the next state and current cost after playing control .
Adaptive Regret. The goal of the learner is to minimize regret w.r.t. a policy class , i.e. the difference between the cumulative cost of the learner and the best policy in hindsight. Formally, the regret of an algorithm with control inputs and corresponding states , over an interval , is defined as
(2.1) |
Here indicate the control input and the corresponding state when following policy . For a randomized algorithm , we consider the expected regret. In this work, we focus on designing control algorithms that minimize adaptive regret, i.e. guarantee a low regret relative to the best-in-hindsight policy on any interval . This performance metric of adaptive regret is more suitable for control over LTV dynamics given its agility to compete against different local optimal policies at different times [16]. To illustrate this point, we describe the implications of standard vs. adaptive regret for -switching LQR.
Example 2.1 (-switching LQR.).
Consider the problem of -switching LQR in which the system evolves according to the fixed over each time interval for . An adaptive regret guarantee ensures good performance against on every interval , in contrast to standard regret which only ensures good performance against a single, joint comparator . Clearly over every interval the policy is a suitable comparator while is not.
Key objects. A central object in our study is the sequence of Nature’s x’s that arises from playing zero control input at each , i.e. [39]. This object allows us to split any state into a component independent of the algorithm’s actions and a component that is the direct effect of the chosen actions. To capture this intuition in equation form, we define the following operators for all ,
where the matrix product with is taken in the indicated order . The following identities give an alternative representation for the Nature’s x’s and state with control input in terms of the Markov operator at time , :
These operators and the alternative representation capture the dynamics by decoupling the disturbance and the control action effects. Observe that the operators capture the contribution of the perturbations on the state and the Markov operators that of the controls.
Assumptions. We make the three basic assumptions: we require from (i) the disturbances to not blow up the system with no control input, (ii) the system to have decaying effect over time, and (iii) the costs to be well-behaved and admit efficient optimization. Formally, these assumptions are:
Assumption 1.
For all , assume .
Assumption 2.
Assume there exist and s.t. for any and for all
Assumption 3.
Assume the costs are general convex functions that satisfy the conditions , and for some constant , where denotes any subgradient [7].
The conditions in Assumption 3 allow for functions whose values and gradient grow as quickly as quadratics (e.g. the costs in LQR) , and the term ensures the inclusion of standard bounded and Lipschitz functions as well. Assumptions 1 and 2 arise from the assumption our LTV system is open-loop stable; Section A.2 extends to the case where a nominal stabilizing controller is known, as in prior work [3, 39]. While these two assumptions may seem unnatural at first, they can be derived from the basic conditions of disturbance norm bound and sequential stability.
Lemma 2.1.
Suppose that there exist such that for any and all , and suppose that . Then, Assumption 1 holds with , and Assumption 2 holds with and .
Note that Assumption 2 implies that . It also suggests that for a sufficiently large the effect of iterations before are negligible at round . This prompts introducing a truncated Markov operator: denote to be the -truncation of the true Markov operator . It follows that their difference is negligible in operator norm for a sufficiently large . Define the bounded set of -truncated Markov operators to be with for all .
2.1 Benchmarks and Policy Classes
The performance of an algorithm, measured by Eq. 2.1, directly depends on the policy class that is chosen as a benchmark to compete against. In this work, we consider the following three policy classes: Drc, Dac, and linear feedback. Drc parameterizes control inputs in terms of Nature’s x’s , Dac does so in terms of the disturbances and linear feedback in terms of the states . We express all three in terms of a length- parameter in a bounded ball :
Definition 2.1 (Drc policy class).
A Drc control policy of length is given by where is the parameter of the policy. Define the bounded Drc policy class as .
Definition 2.2 (Dac policy class).
A Dac control policy of length is given by where is the parameter of the policy. Define the bounded Dac policy class as .
Definition 2.3 (Feedback policy class).
A feedback control policy of length is given by where is the parameter of the policy. Define the bounded feedback policy class as . In the special case of memory , denote the state feedback policy class as .
Convexity. Both the Drc and Dac policy classes are convex parametrizations: a policy outputs controls that are linear in the policy-independent sequences and , and thus the mapping from parameter to resulting states and inputs (resp. costs) is affine (resp. convex). Hence, we refer to these as the convex classes. In contrast, feedback policies select inputs based on policy-dependent states, and are therefore non-convex [15].
We drop the arguments when they are clear from the context. The state feedback policies encompass the and optimal control laws under full observation. For LTI systems, Drc and Dac are equivalent [46, 44] and approximate all linear feedback policies to arbitrarily high precision [3, 39]. However, we show that these relationships between the classes break down for LTV systems: there exist scenarios where any one of the three classes strictly outperforms the other two.
Theorem 2.1 (Informal).
For each class in there exists a sequence of well-behaved such that a policy suffers cumulative cost, but each of the other two classes suffers cost on all their constituent policies .
The formal theorem that includes the definition of a well-behaved instance sequence and the final statement dependence on along with its proof can be found in Section F.1.
Notation.
The norm refers to Euclidean norm unless otherwise stated, is used as a shorthand for , is used as a subscript shorthand for . The asymptotic notation suppress all terms independent of , additionally suppresses terms logarithmic in . We define to suppress absolute constants, polynomials in and logarithms in .
3 Online Control over Convex Policies
This section considers online control of unknown LTV systems so as to compete with the convex Drc and Dac policy classes. The fundamental quantity which appears throughout our results is the system variability, which measures the variation of the time-varying Markov operators over intervals .
Definition 3.1.
Define the system variability of an LTV dynamical system with Markov operators over a contiguous interval to be
where indicates the norm of the fully vectorized operator and is the empirical average of the operators that correspond to . Recall that corresponds to .
Our results in this section for both upper and lower bounds focus on expected regret: high probability results are possible as well with more technical effort using standard techniques.
3.1 A Linear Regret Lower Bound
Our first contribution is a negative one: that the regret against the class of either Dac or Drc policies cannot scale sublinearly in the time horizon. Informally, our result shows that the regret against these classes scales as , where is the system variability.
More precisely, for any , we construct a distribution over sequences , formally specified in Section F.2. Here, we list the essential properties of : (i) , (ii) is a fixed cost satisfying Assumption 3 with , (iii) the matrices are i.i.d., with almost surely, and , and (iv) for all . These conditions imply that Assumptions 2 and 1 hold for . The condition implies that for all , so the classes Drc and Dac are equivalent and the lower bound holds over both. Moreover, by Jensen’s inequality, this construction ensures that
In particular, . For the described construction, we show the following lower bound:
Theorem 3.1.
Let be a universal, positive constant. For any and any online control algorithm , there exists a Drc policy s.t. expected regret incurred by under the distribution and cost is at least
A full construction and proof of Theorem 3.1 is given in Section F.2. In particular, for , we find that no algorithm can attain less than expected regret; a stark distinction from either unknown LTI [39, 20] or known LTV [16] systems.
3.2 Estimation of Time-Varying Vector Sequences
To devise an algorithmic upper bound that complements the result in LABEL:{thm:main_lb}, we first consider the setting of online prediction under a partial information model. This setting captures the system identification phase of LTV system control and is used to derive the final control guarantees. Formally, consider the following repeated game between a learner and an oblivious adversary: at each round , the adversary picks a target vector from a convex decision set contained in a -centered ball of radius ; simultaneously, the learner selects an estimate and suffers quadratic loss . The only feedback the learner has access to is via the following noisy and costly oracle.
Oracle 1 (Noisy Costly Oracle).
At each time , the learner selects a decision indicating whether a query is sent to the oracle. If , the learner receives an unbiased estimate as response such that and . The filtration is the sigma algebra generated by and the choices of the oblivious adversary . A completed query results in a unit cost for the learner.
The performance metric of an online prediction algorithm is expected quadratic loss regret along with the extra cumulative oracle query cost. It is defined over each interval as
(3.1) |
To attain adaptive regret, i.e. bound Eq. 3.1 for each interval , we propose Algorithm 1 constructed as follows. First, suppose we wanted non-adaptive (i.e. just ) guarantees. In this special case, we propose to sample for an appropriate parameter , and perform a gradient descent update on the importance-weighted square loss . To extend this method to enjoy adaptive regret guarantees, we adopt the approach of [21]: the core idea in this approach is to initiate an instance of the base method at each round and use a weighted average of the instance predictions as the final prediction (Line 4). The instance weights are multiplicatively updated according to their performance (Line 13). To ensure computational efficiency, the algorithm only updates instances from a working dictionary (Line 11). These dictionaries are pruned each round (Line 15) such that (see Section C.1 for details).
Theorem 3.2.
Given access to queries from Oracle 1 and with stepsizes , Algorithm 1 enjoys the following adaptive regret guarantee: for all ,
(3.2) |
When , the optimal choice of parameter yields regret scaling roughly as . Unfortunatelly, this gives regret scaling as for all interval sizes: to attain regret on interval , the optimal choice of would yield regret on , which is considerably worse for small . One may ask if there exists a strongly adaptive algorithm which adapts as well, so as to enjoy regret polynomial in for all intervals simultaneously [11]. The following result shows this is not possible:
Theorem 3.3 (Informal).
For all and , there exists no online algorithm with feedback access to Oracle 1 that enjoys strongly adaptive regret of .
Hence, in a sense, Algorithm 1 is as adaptive as one could hope for: it ensures a regret bound for all intervals , but not a strongly adaptive one. The lower bound construction, formal statement, and proof of Theorem 3.3 are given in Section C.2.
3.3 Adaptive Regret for Control of Unknown Time-Varying Dynamics
We now apply our adaptive estimation algorithm (Algorithm 1) to the online control problem. Our proposed algorithm, Algorithm 2, takes in two sub-routines: a prediction algorithm which enjoys low prediction regret in the sense of the previous section, and a control algorithm which has low regret for control of known systems. Our master algorithm trades off between the two methods in epochs of length : each epoch corresponds to one step of indexed by .
At each epoch, the algorithm receives Markov operator estimates from (Line 3) and makes a binary decision . If , then it explores using i.i.d. Rademacher inputs (Line 7), and sends the resulting estimator to (Line 14). This corresponds to one query from Oracle 1. Otherwise, it selects inputs in line with (Line 9), and does not give a query to (Line 16). Regardless of exploration decision, the algorithm feeds costs, current estimates of the Markov operator and Nature’s x’s based on the Markov operator estimates to (Lines 10-12), which it uses to select inputs and update its internal parameter.
The prediction algorithm is taken to be Ada-Pred with the decision set : the projection operation onto it and the ball is done by clipping when the norm of the argument exceeds the indicated bound. The control algorithm is taken to be Drc-Ogd [39] for known systems. The core technique behind Drc-Ogd is running online gradient descent on the Drc parameterization (Definition 2.1). In Appendix B we spell out the algorithm and extend previous analyses to both LTV systems and adaptive regret guarantees. The final result guarantees low adaptive regret as long as the system variability is sublinear.
Theorem 3.4.
For , and , on any contiguous interval , Algorithm 2 enjoys the following adaptive regret guarantee:
Proof Sketch.
The analysis proceeds by reducing the regret incurred to that over a known system, accounting for: 1) the additional exploration penalty , 2) the system misspecification induced error , and 3) truncation errors (). Via straightforward computations, the system misspecification error can be expressed in terms of the result in Theorem 3.2, ultimately leading to an error contribution . The analysis is finalized by noting that the chosen ideally balances and , and that the chosen ensures that the truncation error is negligible. The full proof can be found in Appendix D. ∎
The adaptive regret bound in Theorem 3.4 has two notable terms. Note that the first term for matches the regret lower bound in Theorem 3.1. Furthermore, our algorithm is adaptive in this term for all intervals . On the other hand, for unknown LTI systems with , the algorithm recovers the state-of-the-art bound of [20]. However, the term is not adaptive to the intervals consistent with the lower bound against strongly adaptive algorithms in Theorem 3.3.
4 Online Control over State Feedback
Given the impossibility of sublinear regret against Drc/Dac without further restrictions on system variability, this section studies whether sublinear regret is possible against the class of linear feedback policies. For simplicity, we focus on the state feedback policies , that is, linear feedback policies with memory (Definition 2.3). We note that state feedback policies were the class which motivated the relaxation to Dac policies in the first study of nonstochastic control [3].
We present two results, rather qualitative in nature. First, we show that obtaining sublinear regret is, in the most literal sense, possible. The following result considers regret relative to a class of static feedback controllers which satisfy the restrictive assumption that each stabilizes the time varying dynamics ; see Appendix E for the formal algorithm, assumptions, and guarantees. We measure the regret against this class :
where are the iterates arising under the control law .
Theorem 4.1 (Sublinear regret against state-feedback).
Under a suitable stabilization assumption, there exists a computationally inefficient control algorithm which attains sublinear expected regret:
Above, suppresses a universal constant and exponent base, both of which are made explicit in a formal theorem statement in Appendix E. The bound follows by running the Exp3 bandit algorithm on a discretization of the set (high probability regret can be obtained by instead using Exp3.P [8]). The guarantee in Theorem 4.1 is neither practical nor sharp; its sole purpose is to confirm the possibility of sublinear regret. Due to the bandit reduction and exponential size of the cover of , the algorithm is computationally inefficient and suffers a nonparametric rate of regret [33]: -regret requires .
One may wonder if one can do much better than this naive bandit reduction. For example, is there structure that can be leveraged? For LTV systems, we show that there is strong evidence to suggest that, at least from a computational standpoint, attaining polynomial regret (e.g. for independent of dimension) is computationally prohibitive.
Theorem 4.2.
There exists a reduction from Max-3Sat on -clauses and -literals to the problem of finding a state-feedback controler which is within a small constant factor of optimal for the cost on a sequence of sequentially stable LTV systems and convex costs with no disturbance (), with state dimension , input dimension , and horizon . Therefore, unless , the latter cannot be solved in time polynomial in [17].
A more precise statement, construction, and proof are given in Section F.4. Theorem 4.2 demonstrates that solving the offline optimization problem over state feedback controllers to within constant precision is -Hard. In particular, this means that any sublinear regret algorithm which is proper and convergent, in the sense that for some sequence converges to a limit as , must be computationally inefficient. This is true even if the costs and dynamics are known in advance. Our result suggests it is computationally hard to obtain sublinear regret, but it does not rigorously imply it. For example, there may be more clever convex relaxations (other than Drc and Dac, which provably cannot work) that yield efficient and sublinear regret. Secondly, this lower bound does not rule out the possibility of an computationally inefficient algorithm which nevertheless attains polynomial regret.
5 Discussion and Future Work
This paper provided guarantees for and studied the limitations of sublinear additive regret in online control of an unknown, linear time-varying (LTV) dynamical system.
Our setting was motivated by the fact that the first-order Taylor approximation (Jacobian linearization) of smooth, nonlinear systems about any smooth trajectory is LTV. One would therefore hope that low-regret guarantees against LTV systems may imply convergence to first-order stationary points of general nonlinear control objectives [34], which in turn may enjoy stability properties [45]. Making this connection rigorous poses several challenges. Among them, one would need to extend our low-regret guarantees against oblivious adversaries to hold against adaptive adversaries, the latter modeling how nonlinear system dynamics evolve in response to the learner’s control inputs. This may require parting from our current analysis, which leverages the independence between exploratory inputs and changes in system dynamics.
Because we show that linear-in- regret is unavoidable for changing systems with large system variability, at least for the main convex policy parametrizations, it would be interesting to study our online setting under other measures of performance. In particular, the competive ratio, or the ratio of total algorithm cost to optimal cost in hindsight (as opposed to the difference between the two measured by regret) may yield a complementary set of tradeoffs, or lead to new and exciting principles for adaptive controller design. Does system variability play the same deciding roles in competive analysis as it does in regret? And, in either competitive or regret analyses, what is the correct measure of system variability (e.g. variability in which norm/geometry, or of which system parameters) which best captures sensitivity of online cost to system changes?
Acknowledgments
Elad Hazan and Edgar Minasyan have been supported in part by NSF grant #1704860. This work was done in part when Paula Gradu was at Google AI Princeton and Princeton University. Max Simchowitz is generously supported by an Open Philanthropy AI fellowship.
References
- [1] Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.
- [2] Dmitry Adamskiy, Wouter M Koolen, Alexey Chernov, and Vladimir Vovk. A closer look at adaptive regret. The Journal of Machine Learning Research, 17(1):706–726, 2016.
- [3] Naman Agarwal, Brian Bullins, Elad Hazan, Sham Kakade, and Karan Singh. Online control with adversarial disturbances. In International Conference on Machine Learning, pages 111–119, 2019.
- [4] Naman Agarwal, Elad Hazan, Anirudha Majumdar, and Karan Singh. A regret minimization approach to iterative learning control. arXiv preprint arXiv:2102.13478, 2021.
- [5] Nicholas M. Boffi, Stephen Tu, and Jean-Jacques E. Slotine. Regret bounds for adaptive nonlinear control, 2020.
- [6] Olivier Bousquet and Manfred K Warmuth. Tracking a small set of experts by mixing past posteriors. Journal of Machine Learning Research, 3(Nov):363–396, 2002.
- [7] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
- [8] Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv preprint arXiv:1204.5721, 2012.
- [9] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
- [10] Alon Cohen, Tomer Koren, and Yishay Mansour. Learning linear-quadratic regulators efficiently with only regret. In International Conference on Machine Learning, pages 1300–1309, 2019.
- [11] Amit Daniely, Alon Gonen, and Shai Shalev-Shwartz. Strongly adaptive online learning. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1405–1411, Lille, France, 07–09 Jul 2015. PMLR.
- [12] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188–4197, 2018.
- [13] Sarah Dean and Benjamin Recht. Certainty equivalent perception-based control. arXiv preprint arXiv:2008.12332, 2020.
- [14] Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. Experts in a markov decision process. In Advances in neural information processing systems, pages 401–408, 2005.
- [15] Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1467–1476, 2018.
- [16] Paula Gradu, Elad Hazan, and Edgar Minasyan. Adaptive regret for control of time-varying dynamics. arXiv preprint arXiv:2007.04393, 2020.
- [17] Johan Håstad. Some optimal inapproximability results. Journal of the ACM (JACM), 48(4):798–859, 2001.
- [18] Elad Hazan. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
- [19] Elad Hazan. Introduction to online convex optimization, 2019.
- [20] Elad Hazan, Sham Kakade, and Karan Singh. The nonstochastic control problem. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, pages 408–421. PMLR, 2020.
- [21] Elad Hazan and Comandur Seshadhri. Efficient learning algorithms for changing environments. In Proceedings of the 26th annual international conference on machine learning, pages 393–400. ACM, 2009.
- [22] Mark Herbster and Manfred K Warmuth. Tracking the best expert. Machine learning, 32(2):151–178, 1998.
- [23] Sham Kakade, Akshay Krishnamurthy, Kendall Lowrey, Motoya Ohnishi, and Wen Sun. Information theoretic regret bounds for online nonlinear control. arXiv preprint arXiv:2006.12466, 2020.
- [24] Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Logarithmic regret bound in partially observable linear dynamical systems, 2020.
- [25] Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Regret minimization in partially observable linear quadratic control, 2020.
- [26] Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalence is efficient for linear quadratic control. In Advances in Neural Information Processing Systems, pages 10154–10164, 2019.
- [27] Zakaria Mhammedi, Dylan J Foster, Max Simchowitz, Dipendra Misra, Wen Sun, Akshay Krishnamurthy, Alexander Rakhlin, and John Langford. Learning the linear quadratic regulator from nonlinear observations. arXiv preprint arXiv:2010.03799, 2020.
- [28] Richard H Middleton and Graham C Goodwin. Adaptive control of time-varying linear systems. IEEE Transactions on Automatic Control, 33(2):150–155, 1988.
- [29] Kevin L Moore. Iterative learning control for deterministic systems. Springer Science & Business Media, 2012.
- [30] Samet Oymak and Necmiye Ozay. Non-asymptotic identification of lti systems from a single trajectory. In 2019 American control conference (ACC), pages 5655–5661. IEEE, 2019.
- [31] Necmiye Ozay, Constantino Lagoa, and Mario Sznaier. Set membership identification of switched linear systems with known number of subsystems. Automatica, 51:180–191, 2015.
- [32] Guannan Qu, Yuanyuan Shi, Sahin Lale, Anima Anandkumar, and Adam Wierman. Stable online control of ltv systems stable online control of linear time-varying systems. arXiv preprint arXiv:2104.14134, 2021.
- [33] Alexander Rakhlin and Karthik Sridharan. Online non-parametric regression. In Conference on Learning Theory, pages 1232–1264. PMLR, 2014.
- [34] Vincent Roulet, Siddhartha Srinivasa, Dmitriy Drusvyatskiy, and Zaid Harchaoui. Iterative linearized control: stable algorithms and complexity guarantees. In International Conference on Machine Learning, pages 5518–5527. PMLR, 2019.
- [35] Tuhin Sarkar, Alexander Rakhlin, and Munther Dahleh. Nonparametric system identification of stochastic switched linear systems. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 3623–3628. IEEE, 2019.
- [36] Tuhin Sarkar, Alexander Rakhlin, and Munther A Dahleh. Finite-time system identification for partially observed lti systems of unknown order. arXiv preprint arXiv:1902.01848, 2019.
- [37] Max Simchowitz. Making non-stochastic control (almost) as easy as stochastic, 2020.
- [38] Max Simchowitz and Dylan Foster. Naive exploration is optimal for online lqr. In International Conference on Machine Learning, pages 8937–8948. PMLR, 2020.
- [39] Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control, 2020.
- [40] Y. Tassa, T. Erez, and E. Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4906–4913, 2012.
- [41] Emanuel Todorov and Weiwei Li. A generalized iterative lqg method for locally-optimal feedback control of constrained nonlinear stochastic systems. In Proceedings of the 2005, American Control Conference, 2005., pages 300–306. IEEE, 2005.
- [42] Kostas S Tsakalis and Petros A Ioannou. Linear time-varying systems: control and adaptation. Prentice-Hall, Inc., 1993.
- [43] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
- [44] Yuh-Shyang Wang, Nikolai Matni, and John C Doyle. A system-level approach to controller synthesis. IEEE Transactions on Automatic Control, 64(10):4079–4093, 2019.
- [45] Tyler Westenbroek, Max Simchowitz, Michael I Jordan, and S Shankar Sastry. On the stability of nonlinear receding horizon control: a geometric perspective. arXiv preprint arXiv:2103.15010, 2021.
- [46] Dante Youla, Hamid Jabr, and Jr Bongiorno. Modern wiener-hopf design of optimal controllers–part ii: The multivariable case. IEEE Transactions on Automatic Control, 21(3):319–338, 1976.
- [47] Lijun Zhang, Tie-Yan Liu, and Zhi-Hua Zhou. Adaptive regret of convex and smooth functions. arXiv preprint arXiv:1904.11681, 2019.
- [48] Lijun Zhang, Tianbao Yang, rong jin, and Zhi-Hua Zhou. Dynamic regret of strongly adaptive methods. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5882–5891. PMLR, 10–15 Jul 2018.
- [49] Kemin Zhou, John Comstock Doyle, Keith Glover, et al. Robust and optimal control, volume 40. Prentice hall New Jersey, 1996.
Appendix A Extensions
A.1 Affine Offsets
For many systems, performance improves dramatically for controllers with constant affine terms, that is
where is a constant affine term encoded by . All our arguments apply more generally to control policies of this form. Moreover, we can even allow linear combinations of time varying terms:
where now are fixed, possibly time varying basis functions (which do not depend on ). The case of constant affine terms corresponds to , and for all .
A.2 Changing Stabilizing Controllers
Our results extend naturally to the following setting: for each time , the algorithm has access to a static feedback control policy such that the closed loop matrices are sequentially stable, that is
has geometric decay. We let denote the iterates produced by the updates
We compute the stabilized policies of the form
To facillicate the extension, we define the stabilized Markov operator
This Markov operator satisfies
With similar techniques, we obtain estimates , back out estimates of the Nature’s sequence via
for a truncation radius suitably chosen. Recall that we apply this operator to ensure the estimates of the Nature’s sequenc does not grow unbounded and exert feedback. We then select inputs
Markov operator with low adaptive regret, and apply our Oco-with-memory algorithm to losses
A.3 Partial Observation
Our results further extend to partially observed systems. We explain this extension for sequentially stable systems; extensions to sequentialy stabilized systems by time varying linear dynamic controllers follows from the exposition in [39], Appendix C.
For partially observed systems, we have the same state transition dynamics , but, for a time-varying observation matrix and process noise , we observe outputs
Costs are suffered on input and outputs. As for full observation, the Nature’s sequence and correspond to states and outputs which arise under identially zero input . The Drc parametrization selects linear conmbinations of Nature’s y’s:
Recalling
the relevant Markov operators are the ones mapping inputs to outputs:
With similar techniques, we obtain estimates , back out estimates of the Nature’s sequence via
for trunctation radius suitably chosen, we select inputs
update parameters with the Oco-with-memory losses
A.4 The Dac parametrization
Here we sketch an algorithm to compete with Dac-parametrized control policies [3]. For simplicity, we focus on sequentially stable system, though the discussion extends to systems sequentially stabilized by sequences of controllers . Note that Dac does not apply under partial observation.
Recall that, in the Dac parametrization, the inputs are selected as linear combinations of past disturbances:
To implement Dac, we therefore need empiricals estimate of . As per [20], it suffices to construct estimates of , and choose
again clipped at a suitable radius to block compounding feedback. Given these estimates, our algorithm extends to Dac control in the expected way.
How does one obtain the estimates ? First, we observe that since
we have . Hence, we can select as .
The estimate of is more involved. For linear, time-invariant systems, can be recovered from via the Ho-Kalman procedure as is does in [20] (see also [30, 36]). For time-varying systems, this become more challenging. Ommitting details in the interest of brevity, one can use the robustness properties of Ho-Kalman to argue that if the system matrices are slow moving (an assumption required for low regret), is close to a stationarized analogue given by
Hence, we can view any estimate of as an estimate of , and apply Ho-Kalman to the latter.
Appendix B Adapive Regret for Time-Varying DRC-OGD
We first extend the DRC-OGD algorithm from [39] to the setting of known linear time-varying dynamics. We spell out Algorithm 3 and prove it attains adaptive regret over general convex costs under fully adversarial noise (Theorem B.1) with respect to the DRC policy class. The main technique, using OGD over the DRC parametrization, remains unchanged from the original paper and we show it generalizes naturally to LTV systems.
Theorem B.1.
B.1 Adaptive Regret of OGD for functions with memory
We first prove that OGD with a fixed stepsize attains adaptive regret for functions with memory.
Theorem B.2.
Let be a sequence of L coordinate-wise Lipschitz loss functions with memory such that (Line 6) is convex. Then, on any interval , Algorithm 4 enjoys the following adaptive policy regret guarantee:
where .
First we state and prove the following well-known fact about vanilla projected OGD over (memory-less) loss functions:
Fact B.1.
Let be a sequence of convex loss functions with . Then, on any interval , projected OGD enjoys the following guarantee:
where .
Proof.
Consider an arbitary interval Let and denote for simplicity. By convexity we have
(B.1) |
By the Pythagorean theorem
(B.2) |
Hence we can bound the interval regret as:
which yields the desired adaptive regret bound. ∎
Using this simple fact and Lipschitzness we are able to easily prove the desired guarantee for Algorithm 4.
Proof of Theorem B.2.
First note that Algorithm 4 is just doing gradient descent on the proxy convex loss functions . Hence, as long as we identify the gradient bound we can apply B.1 to get a bound on -regret. Observe that
so is -Lipschitz and hence has a gradient bound of . So we can apply B.1 to get
(B.3) |
We can use this to bound the adaptive policy regret. First note that
and by the triangle inequality
(B.4) |
Using Eq. B.4 and Lipschitzness we have:
(B.5) |
∎
Combining everything we get
B.2 Proof of Theorem B.1
We first prove that the constructed loss function satisfies key properties for efficient optimization.
Lemma B.2 (Convexity).
The loss functions constructed in 7 of Algorithm 3 are convex in .
Proof.
By definition we have that:
(B.6) |
which is affine in . Even more simply, we have .
Since and are affine, and, respectively, linear functions of and composition with the convex cost preserves convexity we get the desired property. ∎
Lemma B.3 (Lipschitzness).
The loss functions constructed in 7 of Algorithm 3 are coordinate-wise Lipschitz for .
Proof.
Observe that by Eq. B.6 we have . Straightforwardly, as well.
For an arbitrary , denoting , and using the sub-quadratic lipschitzness of the costs we have:
so the function is coordinate-wise Lipschitz with constant . ∎
Lemma B.4 (Euclidean Diameter).
The euclidean diameter of is at most .
Proof.
That for an arbitrary , we have
and the euclidean diameter is at most twice the maximal euclidean norm, concluding our statement. ∎
The three lemmas above will allow us to use the results in Section B.1 to obtain adaptive regret guarantees in terms of which truncated the effect on the state of actions further than in the past. To convert guarantees in terms of to ones in terms of , we prove that the effect of the past is minimal:
Lemma B.5 (Truncation Error).
For a changing DRC policy that acts according to up to time we have that:
Proof.
By the sub-quadratic Lipschitzness (and noting , , and ) we have:
∎
Having proven all these preliminary results the proof of the main theorem is immediate:
Proof of Theorem B.1.
By the definition of the proxy loss in Line 7 of Algorithm 3, we can expand the regret of Algorithm 3 over interval as:
Regret | |||
The first truncation error is bounded directly by Lemma B.5. For the second truncation error, let . Clearly we have
and hence we can apply Lemma B.5 to bound:
truncation error II | |||
Finally, due to Lemma B.2, Lemma B.3 and Lemma B.4 we can apply Theorem B.2 to get
Summing everything up and plugging in the Lipschitz and diameter constants, we have:
Regret |
Setting , we get
Regret | |||
where we denote and . ∎
Appendix C Estimation of Time-Varying Vector Sequences
In this section we segway into the setting of online prediction under a partial information model. The goal is to estimate a sequence of vectors under limited noisy feedback where the feedback access is softly restricted via additional cost. As shown in the following section, this setting captures the system identification phase of controlling an unknown time-varying dynamical system. We first extensively study the simplified setting as below, and afterwards transfer our findings into meaningful results in control.
Formally, consider the following repeated game between a learner and an oblivious adversary: at each round , the adversary picks a target vector from a convex decision set contained in a -centered ball of radius ; simultaneously, the learner selects an estimate and suffers quadratic loss . The only feedback the learner has access to is via the following noisy and costly oracle.
Oracle 2 (Noisy Costly Oracle).
At each time , the learner selects a decision indicating whether a query is sent to the oracle. If , the learner receives an unbiased estimate as response such that and where is the filtration sigma algebra generated by the entire sequence and the past . A completed query results in a unit cost for the learner denoted as well by abuse of notation.
The idea behind this setting is to model a general estimation framework for a time-varying system which focuses only on exploration. Committing to exploration, however, cannot realistically be free hence the additional cost for the number of calls to Oracle 2. Our goal is to design an algorithm that minimizes the quadratic loss regret along with the extra oracle cost, defined over each interval as
(C.1) |
where is a scaling constant independent of the horizon . For the entire interval, we use as a subscript instead of . The expectation above is taken over both the (potential) randomness of the algorithm and the stochasticity of the oracle responses; it is taken in the round order at each round conditioning on the past iterations.
In terms of estimation itself, the metric to consider over interval is given by that ignores the oracle call costs. Furthermore, we observe that the best-in-hindsight term in (C.1) is in fact a fundamental quantity of the vector sequence as defined below. This formulation will be used, and is more appropriate, when transferring our findings to the setting of control.
Definition C.1.
Define the variability of a time-varying vector sequence over an interval to be
where is the empirical average of the members of the sequence that correspond to .
This definition concludes the setup of our abstraction to general estimation of vector sequences. Regarding algorithmic results, we first present a base method that achieves logarithmic regret over the entire trajectory . The idea is for the learner to uniformly query Oracle 2 with probability : once an estimate is received, construct a stochastic gradient with expectation equal to the true gradient, and perform a gradient update. The algorithm is described in detail in Algorithm 5, and its guarantee given in the theorem below.
Theorem C.1.
Given access to queries from Oracle 2, with stepsizes , Algorithm 5 enjoys the following regret guarantee:
(C.2) |
Proof of Theorem C.1.
To prove the bound in the theorem, we construct the following proxy loss functions: if denote , otherwise for denote . The stochastic gradients of these functions at the current iterate can be written as and are used by the algorithm in the update rule. The idealized gradients are which we would use given access to the true targets . Recall that denotes the sigma-algebra generated by the true target sequence , as well as randomness of the past rounds and . Note then that is measurable. We characterize two essential properties of the stochastic gradients:
Lemma C.1.
Let be the minimizer of , i.e. empirical average of . Then,
Moreover, .
Proof.
The rest of the theorem proof mirrors that of Theorem 3.3 in [19] but accounting for the stochastic gradient. We can view Algorithm 5 as running online stochastic gradient descent over strongly convex functions on losses with true gradient and stochastic gradient at the iterate . Since the losses are -strongly convex, and using the claim from Lemma C.1 we get,
The update rule is given as , so from the Pythagorean theorem for the projection
Combining the above bounds results in
The telescoping sum inside the parentheses is equal to , the gradient term is bounded according to Lemma C.1 and the stepsize sum is bounded by , yielding the final result
∎
C.1 Adaptive Regret Bound
The guarantee in Theorem C.1 ensures that the predicted sequence performs comparably to the empirical mean of the entire target sequence . However, that doesn’t imply much about the performance of Algorithm 5 on a given local interval since can be very different from . Hence, we would like to extend our results to hold for any interval , i.e. derive adaptive regret results as introduced in [21]. To do so we will use the approach of [21] using Algorithm 5 as a subroutine. The resulting algorithm, presented in Algorithm 6, suffers only a logarithmic computational overhead over Algorithm 5 with its performance guarantee stated in the theorem below.
Theorem C.2.
Taking the base estimation algorithm to be Algorithm 5 and given access to queries from Oracle 2, Algorithm 6 enjoys the following guarantee:
(C.3) |
Corollary C.1.
The estimation error over each interval is bounded as follows,
Proof of Theorem C.2.
First observe that is -exp concave with . This is evident given its construction: with since and according to Oracle 2. The rest of the algorithm uses the approach of [21], in particular Algorithm 1, over exp concave functions to derive the guarantee in the theorem statement.
We note that Claim 3.1 in [21] holds identically in our case, i.e. for any the regret of Algorithm 6 with respect to is bounded by if stays in the working set. We combine this fact with the bound given in Theorem C.1 to get that Algorithm 6 enjoys regret over if stays in the working set throughout . Finally, an induction argument along with the working set properties detailed in Section C.1.1 identical to that of Lemma 3.2 in [21] yields the desired result for . Notice that this is our desired result in expectation,
Observation C.2.
We have the following identity for any and :
Proof of C.2.
We can expand:
By the linearity of expectation, the fact that are completely determined given , and the law of total expectation we have that
Plugging this in above we, adding and subtracting , and rearranging we have:
as desired. ∎
Combining C.2 with the fact that are -exp concave for we conclude the final statement of Theorem C.2. ∎
C.1.1 Working Set Construction
Our Algorithm 6 makes use of the working sets along with its properties in Claim C.3. In this section, we show the explicit construction of these working sets as in [21] and prove the claim.
Claim C.3.
The following properties hold for the working sets for all : (i) ; (ii) for any ; (iii) ; (iv) .
For any , let it be given as with odd and nonnegative. Denote , then if and only if . This fully describes the construction of the working sets , and we proceed to prove its properties.
Proof of Claim C.3.
For all we show the following properties of the working sets .
(i) : if then which implies that . For each fixed in this range, if then by construction. Since is an interval of length , it can include at most numbers of the form with odd. Thus, there is at most numbers for each which means that .
(ii) for all : this trivially holds for . Let be the largest such exponent of . Since the size of the interval is , then there exists that divides . This means that the corresponding for is large enough so that , and consequently, .
(iii) : let and , which is equivalent to and . Clearly, satisfies these conditions and is the only such number.
(iv) : suppose there exist two . This implies that which in turn means . Since both are odd, then , and consequently, resulting in . Thus, there can not exist two different members of which concludes that . ∎
C.2 No Strong Adaptivity
Notice that even though the guarantee of Theorem C.2 applies to all intervals , it does not entail meaningful guarantees for all. The reason is the choice of parameter : if one wishes to optimize then implies regret, but this choice is meaningless for intervals with length ; on the other hand, optimizing the bound for small intervals leads to large bounds for the entire horizon. One might then ask whether there exist methods with strongly adaptive guarantees, and we answer this question with a negative.
Theorem C.3.
For any and oracle cost , there exists no online algorithm with feedback access to Oracle 2 that enjoys the following strongly adaptive regret guarantee: .
Proof.
The proof of this impossibility results follows a simple construction: the idea behind it is that strongly adaptive guarantees imply both large and small amount of exploration. Let us suppose there exists such an algorithm and arrive at a contradiction: algorithm has a regret bound over any oblivious sequence where depends on problem parameters, and .
Construct the following oblivious sequence: let and be consecutive disjoint intervals such that , for all , and for all (w.l.o.g. we assume divides ). Now for each interval , , sample a fresh and let for all .
According to the assumed guarantee, the overall regret is bounded as which by definition implies that where the last inequality is true for sufficiently large horizon . Since there are consecutive disjoint intervals and less than overall calls to Oracle 2, there exists an interval such that .
On the other hand, the assumed guarantee for implies that the interval of size enjoys sublinear regret, i.e. . We show that this is a contradiction given that . As there were no oracle calls for the interval , the predictions of , over , are independent from the Rademacher sample of the interval : this is true since the samples for each interval in are independent. Therefore, for all which means that since the best loss in hindsight over is equal to as for all ,
Hence, for the interval , the regret of cannot be sublinear, which contradict the assumption that exhibits strongly adaptive guarantees. This concludes that no strongly adaptive online algorithm exists in the described partial information model. ∎
Appendix D Adaptive Regret for Control of Changing Unknown Dynamics
In this section we give our full control algorithm which attains sublinear regret with respect to up to an additive system variability term. A key component is the system estimation for which we will use Algorithm 6 and its guarantees from Appendix C. More specifically, our algorithm is based on the canonical explore-exploit approach: it explores with some probability by inputting random controls into the system, and otherwise outputs a control according to DRC-OGD (Algorithm 3). Note that due to the long-term consequences which appear in control, we need to explore for consecutive steps in order to get an estimate for the -truncation of the Markov operator. Hence, our algorithm will determine whether it explores or exploits in blocks of length . Furthermore, we will define the set of Markov operator of length and -norm bounded by as :
Remark D.1.
Note that the radius of is bounded by where .
Proof of Remark D.1.
, we have:
and denoting yields the result. ∎
Remark D.2.
By abuse of notation, we will consider for .
Remark D.3.
For simplicity, we assume divisible by (this is w.l.o.g. up to an extra cost which for us is negligble).
We spell out the full procedure in Algorithm 7 and give its guarantee below in Theorem D.1.
Theorem D.1.
For , and , on any contiguous interval , Algorithm 2 enjoys the following adaptive regret guarantee222For precise constants please see Eq. D.2.:
The proof of this theorem will proceed in terms of a quantity which we call total system variability which captures the total (rather than average) deviation from the mean operator for each interval. More precisely,
Definition D.1.
Define the total system variability of an LTV dynamical system with Markov operators over a contiguous interval to be
where indicates the norm of the fully vectorized operator and is the empirical average of the operators that correspond to .
D.1 Estimation of the Markov Operator
Note that the estimation component of Algorithm 7 directly operates in the setting of Appendix C and effectively solves the problem of adaptively estimating the sequence of the -truncations of the true Markov operators . To formally be able to apply Theorem C.2, we first show that the estimates sent to Algorithm 6 satisfy the properties of Oracle 2.
Claim D.1.
The estimators satisfy the properties of Oracle 2 with .
Proof.
There are only two things to prove:
-
1.
Boundedness. Because we clip the nature’s x estimates to and when the DAC policy lies in , we have that if , . If is an exploratory action then by design. By the equation of the progression of the state, we have that
(D.1) and by Cauchy Scwarz we get
and hence
-
2.
Unbiasedness. Plugging in , we get exactly that . Since this holds for the selected truncation , we have for as defined.
∎
Hence we can simply apply the guarantees of Appendix C to obtain the following guarantee:
Corollary D.1.
On any interval , we have that:
where we use to denote the set .
However, to properly analyze the additional regret introduced in the control framework by our estimation error, we need to convert Corollary D.1 into a guarantee in terms of norm which holds for any contiguous interval . This step is rather straightforward and only relies on a few basic properties/observations which we collect in D.2 below.
Observation D.2.
We will make the following observations:
-
1.
for any matrix ,
-
2.
for any set of indices , and any sequence ,
-
3.
,
-
4.
,
-
5.
For any , .
Proof.
Properties - follow from the definitions of the relevant quantities, or are general well-known facts. For , by the triangle inequality, we have that, for any ,
summing the above from to and taking to be the sample mean over yields the desired result. The second inequality simply follows by the fact that truncation can only decrease variance. ∎
As a first step, we first bound the Frobenius norm error of the truncated operators over an arbitrary contiguous interval .
Lemma D.3.
On any interval with divisible by 333This is w.l.o.g. and only assumed for simplicity of presentation., we can bound the Frobenius estimation error of the truncated operators as:
Proof.
By 3, we can write
For the first term, we will simply apply the above corollary after taking expectation. We will therefore focus on bounding the movement term.
(C.S.) |
This implies that
So finally we have that
Finally, taking expectation, plugging in Corollary D.1 and noting that for , (D.2 (2)) and , we get:
∎
Lemma D.4.
On any interval with divisible by , we can bound the squared estimation error of the truncated operators as:
Proof.
We have that
Taking expectation and plugging in the bound in Lemma D.3 yields the promised result. ∎
Finally we can use Lemma D.4 and Cauchy-Schwarz to get a result in terms of the linear (rather than squared) error accumulated over an interval:
Proposition D.5.
On any interval with divisible by , we can bound the squared estimation error of the truncated operators as:
Proof.
By Cauchy-Schwarz and Jensen (since is concave) we have:
∎
D.2 Error Sensitivity
We now analyze concretely how the estimation errors induce additional regret over the case of known systems. We can decompose the expected regret over an interval as:
First let us bound the realized iterate error which bounds the difference between what actually happened and what would have happened in the fictive system (without exploration).
Lemma D.6.
We can bound the difference between the true and the extracted as:
Proof.
Since and because (as argued earlier) , we have:
(Pythagoras) | ||||
∎
Lemma D.7 (Realized Iterate Error).
For with divisible by , we can bound the realized iterate error as:
where we denote .
Proof.
Consider the cost difference on a -block indexed by . Consider the following three cases:
-
1.
: in this case we are exploring and cannot give a better guarantee than
-
2.
: while we are not exploring during the current round, and hence for , we have explored in the previous round and therefore and may be arbitrarily far for . This can induce and to be quite far, especially the closer we get to . As such, in this event, we will also simply bound
-
3.
: finally, in this case we have that for . We can expand:
and
By the observation above, for , we have
Hence, by the sub-quadratic Lipschitzness of the cost and Eq. D.1 we have
So for any , we have
summing over yields the desired result.
∎
Lemma D.8 (Comparator Error).
We can bound the comparator error as:
Proof.
Let . We have that:
(comparator error) | |||
We have that
With a bit more computation, we can also bound the difference in the states. First, let us expand the expression of the states:
The only new thing we need to bound is:
Plugging this into our expressions for , and using previous bounds we have
Hence we can finalize that:
∎
D.3 Proof of Theorem D.1
Proof of Theorem D.1.
We have that
Taking expectation and plugging in Proposition D.5 we get:
(D.2) | ||||
where , using , and that for the chosen we have . ∎
Appendix E Sublinear Regret for State Feedback
We demonstrate that it is (information-theoretically) possible to achieve sublinear (though large) regret against a benchmark of stabilizing static feedback control policies.
We suppose there is a subset of feedback policies , and our goal is to obtain regret compared to the best :
where are the iterates arising under the control law .
For this setting, we propose an algorithm the classic Exp3exponential weights algorithm (see, e.g. Chapter 3 of [8]) on an -cover of in the operator norm. We maintain a constant controller on intervals of length , and feed the losses on those intervals to the Exp3 algorithm. Pseudocode is given in Algorithm 8.
We state our regret bound under the (quite restrictive) assumption that all policies are sequentially stabilizing. Formally, given a sequence of controllers , we define
We assume that exhibits geometric decay uniformly over all times for any fixed :
Assumption 4.
There exists and such that for any indices and any fixed , . We define the constant
Theorem E.1.
Suppose Assumptions 4 and 3 holds, and for some and , , and . In addition, suppose is large enough that . Then, Algorithm 8 with horizon , appropriate an step size and minimal -covering of enjoys the following regret bound:
where .
The theorem is established by a reduction to online multi-arm bandits in Section E.1 below.
Remark E.1 (Extensions of Theorem E.1).
The following analysis extends to policies of the form , where is a fixed sequence of control policies determined a priori, is a feedback parameter, and is a bounded affine term. Letting denote the iterates produced by such a policy, our notion of regret is
The only assumptions we require in general is that is bounded, and that , combined with , are sequentially stabilizing in the sense that, for any , the fixed sequence, and any , it holds that the products
exhibit geometric decay. ∎
E.1 Proof of Theorem E.1
In what follows, assume that evenly divides . For every index , define . To avoid confusion, we denote the iterates produced by the algorithm. We define the sequence which begins at state at time , and rolls forward under controller for future times:
Observe that, since we select a new controller just before each time , we have
Therefore, defining the losses,
we have
Therfore, we may decompose the regret as
Here, is the simple regret on the sequence, is the extend to which the sequence approximates regret against controller in the covering, and finally bounds the regret of the covering against the full set . Here, expectations are over the randomness in the algorithm, and due to the obliviousness of the adversary, we may assume that are deterministic and chosen in advance. We bound each of the three terms in sequence. Before proceeding, we use the following estimates:
Lemma E.1 (Key Term Bounds).
Suppose that is sufficiently large that . Moreover, let . Then,
-
(a)
For all and ,
-
(b)
For any and ,
-
(c)
and .
Proof.
Part a: Unfolding the dynamics, and bounding and via Assumption 4,
Next, we bound for some ,
If is sufficiently large that , then the above is just
yielding the bound for all . Part b: Next, let us bound for some . We have
Using , , and , the above simplifies to .
Part c: This follows from the fact that, for any and , . ∎
Bounding
The term corresponds to the simple regret on the sequence of losses over the discrete enumeration of controllers . Examining Algorithm 8, we simply run the Exp3 algorithm on these losses. By appealing to a standard regret bound for this algorithm with appropriate step size , we ensure that
provided that, for all and , . To find the appropriate bound , we note that from the growth condition on the costs, Assumption 3, we have
where the last inequality uses Lemma E.1. Hence,
Bounding :
To bound , it suffices to find a probability-one upper bound on
Using the Lipschitz conditions on , the bounds from Lemma E.1, and the bound ,
Finally, we can compute that the difference depends only on the response to the state difference at time . Hence, using Assumption 4 and Lemma E.1,the above is at most
Concluding, we find
Bounding .
We now turn to bounding , which captures the approximation error of approximating with . We require the following technical lemma:
Lemma E.2.
Let satisfy . Then, for all ,
Hence,
Using the fact that is an -covering of in the operator norm means that for any , we can find a for which . Hence, from the above lemma
Since , we conclude
Concluding the proof
In sum, we found
where in the last line, we use and . Setting ,
We bound the cardinality of . It suffices to ensure is an -covering in the larger Frobenius norm, which is just the Euclidean norm on . Since is a bounded subset of this space, with radius at most , we can find a covering such that (see, e.g. Chapter 4.2 in [43]). This yields
where we use . Hence, we can bound
Setting gives
E.2 Ommited Proofs
Proof of Lemma E.1.
Part a: All such iterates can be realized by dynamics of the form , and for any appropriate sequence of elements of . For such dynamics, we find
Using and the assumption from Assumption 4, we find
Part b: Since the closed-loop dynamics for and concincide for and are given by , we can compute
Bounding from part (a) and from Assumption 4 yields . Summing over yields . Finally, using and gives
∎
Proof of Lemma E.2.
Introducing the short hand and , and expanding the dynamics, and introducing the short hand
Using an elementary matrix telescoping identiy,
Thus, invoking stability assumption, Assumption 4, and setting ,
Thus, we find
where in the last step, we use . Thus, applying Assumption 3 and Lemma E.1,
Continuing, we bound
Finally, using the bound derived above, and bounding in view of Lemma E.1. Hence,
where above we use that by assumption. ∎
Appendix F Lower Bounds and Separations
F.1 Separation between policy classes
Let denote sequences over costs,disturbances, and dynamics. We let denote the cost of policy on the sequence . Our lower bounds hold even against sequences which enjoy the following regularity condition.
Definition F.1.
We say that is regular if, for all , satisfies Assumption 3 with , and that for all , , and .
We define the policy classes
That is, are all length Drc policies of unbounded norm and allowing affine offsets, are all length Dac policies of unbounded norm allowing affine offsets, and are all static feedback policies of unbounded norm allowing affine offsets, and are feedback policies of bounded norm and horizon.
The following theorem demonstrates that the Dac, Drc, and feedback parametrizations are fundamentally incommensurate.
Theorem F.1.
Let denote a universal constant. There exists three regular sequences in which separate Dac, Drc, and feedback controllers, in the following sense:
-
(a)
Under , the static feedback policy selecting satisfies , but
-
(b)
Under , the Dac policy selecting satisfies , but
-
(c)
Under , the Drc policy selecting satifies , but
Moreover, for any , and , we have
Proof.
We establish the separations for each part with different constant factors. One can choose to be the minimum of all constants which arise.
Proof of part a.
We set to be the sequence with for all , for all , , and
This sequence is clearly regular, and is clear that the policy has . On the other hand, let , . Since , so . Moreover, since for all , any has for some fixed for all . Then, for all , . Thus, . Using the definition of ,
The bound follows.
Proof of part b.
Set . Then the Dac policy has zero cost. Further, set , thus, , so and are equivalent on this system. Finally, let , and set , and for , set
Then, one can verify via induction that for all . Hence, for all , any has a constant input . However, , which is greater than a universal constant. Hence, the regret must spaces as .
Proof of part c.
Fix policity to select . Denote the sequences that arise from this policy as . We set
By construction has zero cost on . Now, set for all , and
Hence, a similar argument as in part (b), using the fact that that is constant but is periodic, shows that any suffers cost .
Let us now analyze the performance of policies . First, observe that for all .
Next, observe that for , and ,
Defining , we have
Hence, for integers ,
(F.1) |
We now argue a dichotomoty on the size of . First, we show that if the epsilons are large, the costs incurred on a past window of must be as well. This is Eq. F.1 would necessitate that differs from over the previous window.
Claim F.1.
Suppose . Then, .
Proof.
By Eq. F.1, we have that if , then . Since , the bound follows by upper bounding the maximum with the sum. ∎
On the other hand, we show that if the -terms are small, then the costs on are at least a small constant. This is because the inputs selected by , , are close to constant, and therefore can fit the periodic values of .
Claim F.2.
Suppose . Then, .
Proof.
We expand
In particular, suppose . Then unless , the above is at least . On the other hand, if . Then, . ∎
Combining both cases, we find that, for all such that ,
In particular, we find that for provided (say) . ∎
F.2 Linear Regret Against Drc and Dac
In this section, we demonstrate linear regret against Drc and Dac. We consider a distribution over instances of the following form
(F.2) |
Define the constant . For a given , let denote the distribution over induced by drawing
Note that these instances are (a) controllable, (b) stable, and (c) have variance scaling like , and (d) the Drc and Dac parametrizations coincide. Letting denote the -th coordinate of vectors , we consider cost of the form
(F.3) |
for either or . Note that both choices of ensure that satisfies Assumption 3, and the latter choice ensures that is second order smooth.
Theorem F.2.
Let be any online learning algorithm. Let as in (F.3). Then, for any , there exist a Drc policy 444Equivalently, in since such that expected regret incurred by under the distribution and cost is at least
where above, are universal, positive constants.
Proof.
In what follows, denotes expectation under the nstance from , and any algorithmic randomness.
We expand and into its coordinates and of . For any policy , we can decompose its cost as
Note that in the above. The following lemma characterizes the conditional expectation of the
Lemma F.3.
There exist a constant such that
where .
The proof of Lemma F.3 is deferred to the end of the section.
Bounding the cost of
We select to be Dac (or equivalently, Drc since ) policy given by
We find that
Since the noise is uniformly bounded independent for all , we conclude
(F.4) |
where we use , , and , and to achieve the bound Eq. F.4.
Bounding the cost of adaptive policies
Fix any online learning algorithm ; we lower bound its performance. Because and are drawn from a fixed probability distribution, and are therefore oblivious to the learner’s actions, we may assume without loss of generality that is deterministic.
The first step is to argue that any algorithm with small cost must select inputs where are bounded away from zero. Specifically, define the devent . Using Lemma F.3 together with ,
(F.5) |
We now lower bound , again deferring the proof to the end of the section.
Lemma F.4.
For any executable policy , provided .
F.2.1 Omitted proofs
Proof of Lemma F.3.
Let denote the filtration induced by setting to be the sigma-algebra generated by . We have
Set .
at . Define , we then have
(F.6) |
∎
Proof of Lemma F.4.
Let us introduce a second event, , defined as
Then, since is non-negative,
Let’s first lower bound . Writing , occurs as soon as
Using that and rearranging the above, it is enough that
Now, if holds, that . Furthermore, by construction . Therefore, if holds, then holds as long as
In particular, for , then
(F.7) |
Next, we lower bound . To do so, we observe that . Moreover, since is deterministic (see discussion above), is a deterministic function of , and moreover, are determinied by . Hence,
Thus, it suffices to characterize the distribution of whenever the events . Indeed, we claim that is uniform on on whenever holds.
Claim F.5.
Let and . then (with probability one) if holds.
Proof.
If holds, then there exists exactly two values and in such that
Even conditioned on , is uniformly distributed on , and since and have the same probability mass under the uniform distribution of , it follows that when holds. ∎
Hence,
For or , the minimum is attained at , with values and , respectively, both equal to . Thus,combining with Eq. F.7, we conclude
∎
F.3 Lower Bound without Stability
Theorem F.3.
Consider a scalar LTV system with , and drawn independently and uniformly at random from . Suppose that , and for all . Finally, let be a fixed costs. Then,
-
(a)
There then Drc policy , Dac policy , and static-feedback policy (all chosen with foreknowledge of the sequence) all enjoy:
-
(b)
Any online learning algorithm without foreknowledge of must suffer expected cost
Proof.
In part (a), all policies choose , so that . Since is an equilibrium point and for all , the system remains at zero. Hence, the only cost incurred is at times and , which are costs of zero and respectively. In part (b), we use the unbiasedness of to recurse
yielding for all . Hence, the expected cost of the policy is
Considering the cases where and , we find the above is . ∎
F.4 Hardness of Computing Best State Feedback Controller
Consider a time-varying linear dynamical system with no noise:
subject to changing convex costs . We show that even in the no-noise setting properly learning the optimal state feedback policy is computationally hard. This statement holds with the control agent having full prior knowledge of the dynamics . It relies on a reduction to the MAX-3SAT problem which is -hard. Our lower bound is inspired by the analogous one for discrete MDPs by [14].
Theorem F.4.
There exists a reduction from Max-3Sat on -clauses and -literals to the problem of finding the state feedback policy optimal for the cost , over sequentially stable dynamics given by : a solution to Max-3Sat with value implies optimal cost of at most , and a solution to the control problem with value implies optimal value of Max-3Sat at least for any known .
Let us first describe the construction of the dynamics that reduce the optimal control problem to the MAX-3SAT problem. Consider a 3-CNF formula with clauses and literals . The state space is of dimensionality and the action space is of dimensionality . The control problem is given as a sequence of episodes corresponding to the clauses of the formula .
For a single clause with , let the dynamics be an episode of length constructed as follows. The initial state is . The state transitions are independent of the clause itself given by the following :
-
•
for , , , for all other .
-
•
for , it becomes and for all other .
-
•
for take and for all other ; for take to ensure sequential stability.
The action matrices along with the costs , on the other hand, depend on the content of the clause itself. In particular, let be the set of indices of the literals that are in clause . We define the regularity cost to be , where and are the distance functions to the simplex sets of corresponding dimensionality. The action matrices and costs are given as follows:
-
•
for and , and .
-
•
for and , if : then , and for all the other ; the cost is rewarding the action which corresponds to assigning the literal a value .
-
•
for and , if : then , and for all the other ; the cost is rewarding the action which corresponds to assigning the literal a value .
-
•
for , and for , and for all other ; for both , the costs are .
Note that the last two rounds for a clause ensure sequential stability and identical starting state .
Lemma F.6.
The described system is sequentially stable and the costs are convex in .
Proof.
By the construction of the state matrices , we know that for any the operator implying sequential stability of the system. To show convexity, note that the distance function for any convex and compact set is a convex function. More specifically, for it is given by . This is straightforward to show: take any , and let be the closest points to respectively, i.e. and . For any , given the convexity of the set , we know that , which concludes the convexity proof for :
This means that both and are convex functions since the simplex is convex in any dimension. The construction of the costs is based on these two functions as well as linear components in , hence all costs are convex in . ∎
Lemma F.7.
If there exists an assignment of literals s.t. the formula has satisfied clauses, then there is a corresponding linear policy that suffers the exact cost of .
Proof.
This should be evident from the construction itself. Let assignment correspond to and to . Then consider the linear policy . Denote to be the basis vectors of the space. Note that according to the defined , if is a basis vector of , then is a basis vector of . It is straightforward to check by our construction that being a basis vector of implies is a basis vector of . Since , then the state-action pairs when following policy are both basis vectors, and satisfy the regularity conditions of . This means that the policy plays if and plays if . Hence, if the clause is satisfied by the assignment , then the cost of over the episode is exactly , i.e. once the clause is satisfied, is accrued and the state moves to the sink . If the clause is not satisfied, then the cost is since is throughout. This means that the constructed linear policy over the whole control sequence suffers cost .i k ∎
Lemma F.8.
If there exists a state feedback policy s.t. following the actions results in cost at most for any and any , then there is a literal assignment s.t. the formula has at least satisfied clauses.
Proof.
Let the linear policy matrix be given as . The proof consists of two main components: (i) we argue that the policy with for and is at least as good as in terms of cost up to an approximation factor ; (ii) we show that for that satisfies the constraints, the randomized policy that has w.p. and w.p. (as well as ) suffers expected cost at most that of itself.
Suppose these two claims are true, then the described randomized linear policy has expected cost at most , which means that there exists a deterministic linear policy with first columns as basis vectors, i.e. or , that suffers cost at most . It follows that the corresponding assignment of literals given by the first columns of the linear policy satisfies at least out of the clauses, so has at least satisfied clauses.
To prove (i), first notice that the policy suffers non-positive cost over the entire horizon since it satisfies the necessary constraints given by the regulatory cost . Note also that said can be scaled by any constant . Now suppose that under the condition the cost difference of and is bounded by : the choice of can depend on any problem parameters, and since the construction is over overall rounds (finite), such a choice is always possible. Hence, for all such we automatically infer that it suffers cost at most and satisfies the necessary constraints. On the other hand, if the condition does not hold, then the distance of from is bounded from below by , meaning that for a sufficiently large choice of (given knowledge of and all other parameters), the overall cost suffered by will be positive due to , i.e. it will have a higher cost than . Therefore, any state feedback policy can be approximately replaced by that satisfies the constraints ensured by .
To show (ii), for a policy that does satisfy these constraints, i.e. for and , we show that its randomized version is at least just as good in terms of expected cost. Proving this claim is a matter of unrolling the dynamics for a single clause . The order, the indices and negation or not of the literals in does not affect the cost, so w.l.o.g. assume we have . The cost of a general policy over first iterations is given by
The alternative randomized linear policy instead suffer expected cost given by
which is straightforward to show to be not larger than the original cost. ∎
Proof of Theorem F.4.
Lemma F.6 above show that the given LTV system construction along with the costs satisfies the theorem conditions. Lemmas F.7 and F.8 indicate that the Max-3Sat problem can be reduced to the LTV control, in particular proper learning of state feedback policies in this setting. Given that Max-3Sat is -Hard even in its decision form implies the computational hardness of the offline optimization of LTV optimal state feedback policies. ∎