Contraction-Based Methods for Stable Identification and
Robust Machine Learning: a Tutorial

Ian R. Manchester, Max Revay, Ruigang Wang This work was supported by the Australian Research Council.The authors are with the Australian Centre for Field Robotics and Sydney Institute for Robotics and Intelligent Systems, The University of Sydney, Sydney, NSW 2006, Australia (e-mail: [email protected]).

Abstract

This tutorial paper provides an introduction to recently developed tools for machine learning, especially learning dynamical systems (system identification), with stability and robustness constraints. The main ideas are drawn from contraction analysis and robust control, but adapted to problems in which large-scale models can be learnt with behavioural guarantees. We illustrate the methods with applications in robust image recognition and system identification.

I Introduction

The purpose of this tutorial is to provide a gentle introduction to some recently developed tools for robust system identification and machine learning, building upon ideas from robust control [1] and contraction analysis [2]. In this tutorial we gradually build from the most basic case of linear time-invariant systems to the latest results on complex nonlinear models incorporating equilibrium neural networks.

The main aim of these methods is to construct model parameterizations that guarantee behavioural properties of the model such as stability and robustness. Such models can then be fit to data in various ways. One hopes that the data are representative and an accurate model is the outcome, but necessarily incomplete data mean that black-box models can often exhibit undesirable or non-physical behaviours. Introducing model behavioural constraints help to ensure that whatever the case, the resulting model is well-behaved, and also serves as a form of model regularization.

The key ideas were developed over a series of papers including [3, 4, 5, 6, 7, 8, 9, 10, 11], however model structures, assumptions, and notation varied somewhat across these papers so our aim in this tutorial is to provide a simple and consistent explanation of the central ideas.

I-A Related Literature

System identification, i.e. learning dynamical systems that can reproduce observed input/output data is a common problem in the sciences and engineering, and a wide range of model classes have been developed, including finite impulse response models [12] and models that contain feedback, e.g. nonlinear state-space models [13], autoregressive models [14] and recurrent neural networks [15].

When learning models with feedback it is not uncommon for the model to be unstable even if the data-generating system is stable, and this has led to a large volume of research on guaranteeing model stability. Even in the case of linear models the problem is complicated by the fact that the set of stable matrices is non-convex, and various methods have been proposed to guarantee stability via regularization and constrained optimization [16, 17, 18, 19, 20, 7, 21].

For nonlinear models, there has also been a substantial volume of research on stability guarantees, e.g. for polynomial models [3, 4, 5], Gaussian mixture models [22], and recurrent neural networks [23, 8, 10], however the problem is substantially more complex than the linear case as there are many different definitions of nonlinear stability and even verification of stability of a given model is challenging. Contraction is a strong form of nonlinear stability [2], which is particularly well-suited to problems in learning and system identification since it guarantees stability of all solutions of the model, irrespective of inputs or initial conditions. This is important in learning since the purpose of a model is to simulate responses to previously unseen inputs. The first method for learning contracting nonlinear models was given in [3], and contraction constraints were also used in [4, 6, 5, 24, 23, 8, 10].

More generally, a variety of behavioural constraints have been investigated in the literature: in [25] network models with contraction and monotonicity properties are identified, while [26] finds transverse contracting [27] models that can exhibit limit cycles, and [28] finds stabilizable models via control contraction metrics [29].

An important behavioural constraint is model robustness, characterised in terms of sensitivity to small perturbations in the input. It has recently been shown that recurrent neural network models can be extremely fragile [30], i.e. small changes to the input produce dramatic changes in the output.

Formally, sensitivity and robustness can be quantified via Lipschitz bounds on the input-output mapping defined by the model, e.g. incremental $\ell_{2}$ gain bounds. In machine learning, Lipschitz constants are used in the proofs of generalization bounds [31] and guarantees of robustness to adversarial attacks [32, 33]. There is also ample empirical evidence to suggest that Lipschitz regularity (and model stability) improves generalization in machine learning [34], system identification [10, 11] and reinforcement learning [35].

Unfortunately, even calculation of the Lipschitz constant of a feedforward (static) neural networks is NP-hard [36] and instead approximate bounds must be used. The tightest bound known to date is found by using quadratic constraints to construct a behavioural description of the neural network activation functions [37]. Extending this approach to training new neural networks with a prescribed Lipschitz bound is complicated by the fact that the model parameters and IQC multipliers are not jointly convex. In [38], Lipschitz bounded feedforward models were trained using the Alternating Direction Method of Multipliers, although this added significant computational cost compared to unconstrained training. In [9] a parameterisation was given allowing equivalent constraints to be satisfied via unconstrained optimization.

In this paper, we give an introduction to a family of tricks and techniques that has proven useful generating convex sets of linear and nonlinear dynamical models that incorporate various stability and robustness constraints.

II Stability and Contraction

Given a dataset $\tilde{z}$ , we consider the problem of learning a nonlinear state-space dynamical model of the form

x_{t+1}=f(x_{t},u_{t},\theta),\quad y_{t}=g(x_{t},u_{t},\theta)

(1)

that minimizes some loss or cost function depending (in part) on the data, i.e. to solve a problem of the form

\min_{\theta\in\Theta}\;\mathcal{L}(\tilde{z},\theta).

(2)

In the above, $x_{t}\in\mathbb{R}^{n},u_{t}\in\mathbb{R}^{m},y_{t}\in\mathbb{R}^{p},\theta\in\Theta\subseteq\mathbb{R}^{N}$ are the model state, input, output and parameters, respectively. Here $f:\mathbb{R}^{n}\times\mathbb{R}^{m}\times\Theta\rightarrow\mathbb{R}^{n}$ and $g:\mathbb{R}^{n}\times\mathbb{R}^{m}\times\Theta\rightarrow\mathbb{R}^{p}$ are piecewise continuously differentiable functions.

In the context of system identification we may have $\tilde{z}=(\tilde{y},\tilde{u})$ consisting of finite sequences of input-output measurements, and aim to minimize simulation error:

\mathcal{L}(\tilde{z},\theta)=\|y-\tilde{y}\|_{T}^{2}

(3)

where $y=\mathfrak{R}_{a}(\tilde{u})$ is the output sequence generated by the nonlinear dynamical model (1) with initial condition $x_{0}=a$ and inputs $u_{t}=\tilde{u}_{t}$ . Here the initial condition $a$ may be part of the data $\tilde{z}$ , or considered a learnable parameter in $\theta$ .

In this paper, we are primarily concerned with constructing model parameterizations that have favourable stability and robustness properties. The particular form of nonlinear stability we use is the following:

Definition 1

A model (1) is said to be contracting if for any two initial conditions $a,b\in\mathbb{R}^{n}$ , given the same input sequence $u$ , the state sequences $x^{a}$ and $x^{b}$ satisfy $|x_{t}^{a}-x_{t}^{b}|\leq K\alpha^{t}|a-b|$ for some $K>0$ and $\alpha\in[0,1)$ .

Refer to caption — Figure 1: Contracting systems forget initial conditions and converge to the same response to arbitrary inputs.

The main idea is that contracting models forget their initial conditions exponentially, as illustrated in Fig. 1. This is useful for system identification and learning since typically initial conditions are not known, and the purpose of the model is to compute responses to previously-unseen inputs, so stability must be guaranteed for arbitrary inputs.

In general, contraction can be established by finding a contraction metric [2]. Formally, this is a Riemannian metric imbuing tangent vectors to the state-space (roughly speaking, infinitesimal displacements) with a smoothly-varying notion of length. However, for the purposes of this tutorial we will use constant (state independent) metrics and hence can consider equivalently finite displacements, between. We note that [4, 5, 7] extend the methods in this tutorial to state-dependent Riemannian metrics.

III The Basic Idea

To introduce some key concepts, we set aside contraction for a moment and consider ensuring asymptotic stability of a model of the form:

x_{+}=f(x)

(4)

where for notational simplicity $x_{+}$ is shorthand for $x({t+1})$ and $x$ denotes $x(t)$ .

The aim is to learn a model and prove that the origin is the unique globally-stable equilibrium. I.e. $f(0)=0$ and $x(t)\rightarrow 0$ as $t\rightarrow\infty$ from all initial conditions $x(0)\in\mathbb{R}^{n}$ .

There are several ways to define stability of such a system, but one useful definition is $\ell^{2}$ stability, i.e.

\sum_{t=0}^{\infty}|x(t)|^{2}<\infty

(5)

for solutions of (4) from any initial condtion $x(0)$ . Note for this sum to be finite, it must be the case that $|x(t)|\rightarrow 0$ quite fast, and in fact this type of stability can be related other strong forms such as exponential stability [39].

When the function $f$ defining the model is fixed and known, the standard way to verify stability is to find a Lyapunov function:

Condition 1

Suppose there exists a function $V(x)$ which is non-negative: $V(x)\geq 0$ for all $x$ , for which

\displaystyle V(x)-V(f(x))\geq\epsilon|x|^{2}\quad\forall x.

(6)

Then the system (4) is $l^{2}$ stable.

This follows by simply summing the inequality over $t\in[0,T]$ for arbitrary $T\geq 0$ , we have

\displaystyle V(x_{0})-V(x_{T})\geq\sum_{t=0}^{T}\epsilon|x|^{2}

(7)

and since $V(x_{T})\geq 0$ regardless of $x_{T}$ and $T\geq 0$ was arbitrary, we have $\sum_{t=0}^{\infty}|x(t)|^{2}\leq\frac{1}{\epsilon}V(x_{0})$ so (5) holds and the origin is globally stable.

While verifying (6) for a known $f$ can be challenging in many cases, from the computational perspective it has the advantage that it a convex (in fact linear) condition on the function $V$ . For example, if candidate functions $V$ are linearly-parameterized then this implies a convex constraint on the parameters.

On the other hand, if the objective is to search for $f$ , as it is in system identification, then we strike a problem: the term $V(f(x))$ is not jointly-convex in $V$ and $f$ . A key focus of this tutorial is on how to disentangle these functions in the stability condition.

Now, an equivalent statement to (6) is the following:

Condition 2

For all pairs $x,x_{+}\in\mathbb{R}^{n}$ such that $x_{+}=f(x)$ , the following inequality holds:

V(x)-V(x_{+})\geq\epsilon|x|^{2}.

(8)

However, such conditional statements are hard to verify due to the complex interdependence of the functions $V$ and $f$ and the pairs $x,x_{+}$ on which the condition holds.

A common strategy is to replace such a conditional statement with an unconditional statement via the introduction of multipliers:

Condition 3

There exists a multiplier vector $\lambda(x,x_{+})$ such that for all pairs $x,x_{+}\in\mathbb{R}^{n}$ .

V(x)-V(x_{+})+\lambda(x,x_{+})^{T}(x_{+}-f(x))\geq\epsilon|x|^{2}.

(9)

In robust control this kind of approach is usually called the S-Procedure, and is closely related to positivstellensatz constructions in semialgebraic geometry [40].

Here we have made some progress, in that the above condition is now linear in the unknowns $V$ and $f$ . However, there are still two issues:

It is well known that conditions such as (9) are generally very conservative if the multiplier $\lambda$ is fixed, but condition (9) is not jointly convex in $\lambda$ and $f$ , since their product $\lambda f$ appears. Even if one could search for $\lambda$ , this may still be conservative. It is clear that if $(x_{+}-f(x))=0$ then so does $(x_{+}-f(x))^{q}$ for any integer $q$ , so these could be added with additional multipliers. But these conditions are nonlinear in the function $f$ .

These issues lead us to the final step in our construction. Instead of an explicit model defined by a function $f$ , we search for an implicit equation

m(x,x_{+})=0

(10)

with the property that for every $x$ there is a unique $x_{+}$ satisfying $m(x,x_{+})=0$ . Roughly speaking, since this representation is non-unique it allows some of the required flexibility of the multiplier to be “absorbed” into the model equations.

This representation admits standard-form difference equations by $m(x,x_{+})=x_{+}-f(x)$ , but is more flexible: e.g. for any invertible (bijective) function $h$ , the model functions $m(x,x_{+})=x_{+}-f(x)$ , $m(x,x_{+})=h(x_{+}-f(x))$ , and $m(x,x_{+})=h(x_{+})-h(f(x))$ all define the same dynamical system. Hence we call $m$ a redundant representation.

This brings us to our final condition:

Condition 4

Suppose there exists a non-negative function $V$ such that

\displaystyle V(x)-V(x_{+})+\lambda(x,x_{+})^{T}m(x,x_{+})\geq\epsilon|x|^{2}.

(11)

Then for all trajectories satisfying $m(x,x_{+})=0$ satisfy (8).

Note that this condition is jointly convex in $V$ and $m$ . It is still not jointly convex in $m$ and $\lambda$ , but as we will see below the additional flexibility in $m$ ameliorates this problem quite significantly.

IV Learning Stable Linear Systems

Linear dynamical systems are by far the most comprehensively studied when it comes to questions of stability, and this extends to the issue of model stability in system identification. There are several possible approaches to guaranteeing stability of identified linear models, including state-space and transfer-function representations, but the approach we describe here has the advantage of extending naturally to the nonlinear model structures in the following sections.

A linear state-space model is defined by four matrix variables $A,B,C,D$ as below:

	$\displaystyle x_{+}$	$\displaystyle=Ax+Bu,$		(12)
	$\displaystyle y$	$\displaystyle=Cx+Du.$		(13)

This system is asymptotically – and indeed exponentially – stable if and only if the spectral radius (magnitude of the largest eigenvalue) of $A$ is strictly less than one.

Despite the simplicity of linear system stability, there is one complicating factor when trying to identify the dynamics: the set of stable $A$ matrices is not convex. This fact is illustrated by the following simple example:

Example 1

Consider two matrices $A_{a}$ and $A_{b}$ below and their average $A_{c}$ :

	$\displaystyle A_{a}=\begin{bmatrix}0.5&2\\ 0&0\end{bmatrix},\quad A_{b}=\begin{bmatrix}0&0\\ 2&0.5\end{bmatrix}$		(14)
	$\displaystyle A_{c}=\frac{1}{2}(A_{a}+A_{b})=\begin{bmatrix}0.25&1\\ 1&0.25\end{bmatrix}.$		(15)

Both $A_{a}$ and $A_{b}$ have spectral radius 0.5, so they are stable, but their average $A_{c}$ has spectral radius 1.25, so it is unstable.

Stability of a linear system can also be established via a quadratic Lyapunov function $V(x)=x^{T}Px$ , leading to the following well-known condition:

Condition 5

The system (12) is stable if and only if there exists a $P=P^{T}\succ 0$ such that

A^{T}PA-P\prec 0.

It is somewhat obvious but worth noting that differences between of solutions of (12) obey the linear dynamics

\Delta_{+}=A\Delta_{x}

(16)

where $\Delta_{x}=x^{a}_{t}-x^{b}_{t}$ and $\Delta_{+}=x^{a}_{t+1}-x^{b}_{t+1}$ , and $x^{a},x^{b}$ are two solutions of (12) with different initial conditions but the same input $u$ . Hence for linear systems, stability of the origin of the unforced system and incremental stability of all solutions (with inputs) are equivalent. This is not true for nonlinear systems.

Now, we have seen that the set of stable models parameterized via $A$ is non-convex, but following from the discussion in the previous section we can guess it may be beneficial to introduce an implicit representation. We introduce the new structure:

	$\displaystyle Ex_{+}$	$\displaystyle=Fx+Ku,$		(17)
	$\displaystyle y$	$\displaystyle=Cx+Du,$		(18)

which, so long as $E$ is invertible is an equivalent to (12) via $A=E^{-1}F,B=E^{-1}K$ . Hence this representation does not add any fundamental expressivity to the model set, but we will see below that it does allow us to convexify the set of stable models. In this representation, differences between solutions obey the dynamics:

E\Delta_{+}=F\Delta_{x}

(19)

Now, following the main idea from the previous section, we consider the contraction metric $\Delta_{x}^{T}P\Delta_{x}$ , and then combine the stability condition with the implicit model equations:

\displaystyle\underbrace{\Delta_{+}^{T}P\Delta_{+}-\Delta_{x}^{T}P\Delta_{x}}_{\textrm{Contraction metric decrease}}-\underbrace{2\Delta_{+}^{T}(E\Delta_{+}-F\Delta_{x})}_{=0\textrm{ by definition of model}}\leq-\epsilon|\Delta_{x}|^{2}.

(20)

Since the term $E\Delta_{+}-F\Delta_{x}$ vanishes along model solutions, we have

\displaystyle\Delta_{+}^{T}P\Delta_{+}\leq\Delta_{x}^{\prime}P\Delta_{x}-\epsilon|\Delta_{x}|^{2}

(21)

and summing over time, we obtain for all pairs of solutions

\sum_{t=0}^{\infty}|x^{a}_{t}-x^{b}_{t}|^{2}\leq\frac{1}{\epsilon}(x^{a}_{0}-x^{b}_{0})^{T}P(x^{a}_{0}-x^{b}_{0})

(22)

I.e. the system is stable.

The key benefit of this formulation is that condition (20) is convex in the model parameters $E,F,P$ , and can be rewritten as the linear matrix inequality (LMI):

\begin{bmatrix}(E+E^{T}-P)&-F\\ -F^{T}&P\end{bmatrix}\succ 0.

(23)

Furthermore, for every stable linear system (12), there exists a representation in the implicit form (17) such that (23) holds [3, 4]. Thus we have a convex parameterization of all stable linear models.

We also note that the lower-left block of (23) ensures $P\succ 0$ and the upper-left block ensures that $E+E^{T}\succ 0$ , which implies that all eigenvalues of $E$ have positive real parts, hence $E$ is invertible, so the model is well-posed.

Remark 1

In [6] this formulation was used to learn maximum likelihood models from input-output data via an expectatation maximization algorithm.

Remark 2

Condition (23) takes the form of an LMI, but can be used to defines a direct (unconstrained) parameterization of stable linear models. Indeed, take $V\in\mathbb{R}^{2n\times 2n}$ , where $n$ is the dimension of the system. Then let $H=VV^{T}+\epsilon I$ for small $\epsilon>0$ , so $H$ is positive definite. Then partition $H$ into $n\times n$ blocks, and we can extract $P=H_{22},F=-H_{12},E=\frac{1}{2}(H_{11}+H_{22}+S)$ where $S$ is any skew-symmetric matrix, and the resulting model satisfies (23). This can be useful since there are many algorithms for unconstrained optimization that are well-suited to machine learning with large data sets, e.g. stochastic gradient descent and Adam [41].

V Learning Contracting Recurrent Neural Networks

Next we adapt the main ideas to learning a class of nonlinear model incorporating neural networks. For nonlinear models, Lyapunov stability and contraction can be very different, e.g. a double pendulum has energy as a (non-strict) Lyapunov function, and has bounded solutions, but exhibits chaotic response to large initial conditions. Similarly, if one takes a standard recurrent neural network (RNN) [15] with sigmoid activation functions, then all solutions are bounded since the activation maps to a bounded interval, but within this interval solutions can be chaotic, i.e. nearby trajectories rapidly diverge.

Contraction [2] ensures that trajectories with the same input but different initial conditions always converge, and rules out chaotic behaviour, and is hence useful in learning predictive models since initial conditions are rarely known and it is the input-output response that is typically being modelled.

RNNs come in various flavours but they are generally characterized by interconnections of “learnable” linear mappings (weight matrices) and fixed nonlinear mappings (”activation functions”). In this tutorial, we first consider the class introduced in [10]:

	$\displaystyle x_{+}$	$\displaystyle=Ax+B_{1}w+B_{2}u,\quad v=C_{1}x+D_{12}u$
	$\displaystyle y$	$\displaystyle=C_{2}x+D_{21}w+D_{22}u,\quad w=\sigma(v)$

where $\sigma$ is a scalar nonlinearity applied elementwise. We assume that it is slope-restricted in $[0,1]$ , i.e.

0\leq\frac{\sigma(v^{a})-\sigma(v^{b})}{v^{a}-v^{b}}\leq 1,\quad\forall v^{a},v^{b}\in\mathbb{R},\;v^{a}\neq v^{b}.

(24)

All widely used activation functions (ReLU, leaky ReLU, sigmoid, tanh) satisfy this restriction, possibly after rescaling.

This model class includes many commonly-used model structures, including classical RNNs [15], linear dynamical systems, and single-hidden-layer feedforward neural networks, see [10] for details.

To establish that a model of this form is contracting we will need to deal with the nonlinearity $\sigma$ . We will do this by introducing some quadratic bounds for the signal. With $w=\sigma(v)$ the slope restriction can be rewritten as

(v_{a}-v_{b})(w_{a}-w_{b})\geq(w_{a}-w_{b})^{2}

Since this holds for all activation channels (the elements of $v$ ), we can also take positive combinations of these:

	$\displaystyle\sum_{i}\lambda_{i}[(v_{a}^{i}-v_{b}^{i})(w_{a}^{i}-w_{b}^{i})-(w_{a}^{i}-w_{b}^{i})^{2}]$
	$\displaystyle=(C_{1}\Delta_{x}-\Delta_{w})^{T}\Lambda\Delta_{w}\geq 0$		(25)

where $\Lambda$ is a diagonal matrix, with diagonal elements $\lambda_{i}>0$ .

This quadratic constraint is however not jointly convex in the model parameter $C_{1}$ and the multiplier matrix $\Lambda$ . The trick is to absorb the multiplier into model by defining $\tilde{C}=\Lambda C_{1}$ , obtaining the equivalent constraint:

(\tilde{C}\Delta_{x}-\Lambda\Delta_{w})^{T}\Delta_{w}\geq 0.

And we note that since $\Lambda\succ 0$ , $C_{1}$ is recoverable from $\tilde{C}$ .

Then, by following much the reasoning from the previous section, we construct the following constraint:

	$\displaystyle\underbrace{\Delta_{+}^{T}P\Delta_{+}-\Delta_{x}^{T}P\Delta_{x}}_{\textrm{Lyapunov function decrease}}$
$\displaystyle-$	$\displaystyle\underbrace{2\Delta_{+}^{T}(E\Delta_{+}-F\Delta_{x}-B_{1}\Delta_{w})}_{=0\textrm{ due to linear block}}$
$\displaystyle+$	$\displaystyle\underbrace{2(\tilde{C}\Delta_{x}-\Lambda\Delta_{w})^{T}\Delta_{w}}_{\geq 0\textrm{ due to sector condition}}\leq-\epsilon\|\Delta_{x}\|^{2}$	(26)

As depicted, the term incorporating the linear dynamics vanishes, while the term incorporating the sector condition is always non-negative, and hence can be dropped without affecting the truth of the inequality. So for all pairs of model trajectories we have

\displaystyle\Delta_{+}^{T}P\Delta_{+}\leq\Delta_{x}^{T}P\Delta_{x}-\epsilon|\Delta_{x}|^{2}

(27)

and hence the system is contracting.

Now, as in the previous section (28) is convex in the model parameters and can be written as an LMI:

\begin{bmatrix}(E+E^{T}-P)&-F&-B_{1}\\ -F^{T}&P&-\tilde{C}^{T}\\ -B_{1}^{T}&-\tilde{C}&2\Lambda\end{bmatrix}\succ 0.

(28)

Hence we have constructed a convex parameterization of contracting recurrent neural networks. Furthermore, by examining the upper-left two-thirds and comparing to (23), it is clear that if $B=\tilde{C}=0$ these are equivalent, so the set of contracting RNNs contains all stable linear models as a subset. This is a natural requirement, but it is not satisfied by previously-proposed sets of contracting RNNs, e.g. [23, 8].

VI Robust Recurrent Neural Networks

Beyond contraction, it can also be important to ensure a model is robust, i.e. not highly sensitive to small changes in the input. One measure for this is the $\ell_{2}$ Lipschitz constant of the input-output map, a.k.a. the incremental $\ell^{2}$ gain.

A model satisfies a Lipschitz bound of $\gamma$ if, for all pairs of inputs $u^{a},u^{b}$ , corresponding solutions (from identical initial conditions) satisfy

\sum_{t=0}^{T}|y^{a}_{t}-y^{b}_{t}|^{2}\leq\gamma^{2}\sum_{t=0}^{T}|u^{a}_{t}-u^{b}_{t}|^{2}

(29)

for all $T>0$ . To obtain such bounds, we use incremental form of the model:

	$\displaystyle\Delta_{+}$	$\displaystyle=A\Delta_{x}+B_{1}\Delta_{w},$
	$\displaystyle\Delta_{v}$	$\displaystyle=C_{1}\Delta_{x}+D_{12}\Delta_{u},$
	$\displaystyle\Delta_{y}$	$\displaystyle=C_{2}\Delta_{x}+D_{21}\Delta_{w}.$

Now, following the same procedure as above we can verify the Lipschitz bound via an incremental dissipation inequality:

	$\displaystyle\underbrace{\Delta_{+}^{T}P\Delta_{+}-\Delta_{x}^{T}P\Delta_{x}}_{\textrm{Storage function decrease}}$
$\displaystyle-$	$\displaystyle\underbrace{2\Delta_{+}^{T}(E\Delta_{+}-F\Delta_{x}-B_{1}\Delta_{w}-B_{2}\Delta_{u})}_{=0\textrm{ due to linear block}}$
	$\displaystyle+\underbrace{2(\tilde{C}\Delta_{x}+\tilde{D}\Delta_{w}-\Lambda\Delta_{w})^{T}\Delta_{w}}_{\geq 0\textrm{ due to sector condition}}\leq\underbrace{-\|\Delta_{y}\|^{2}+\gamma^{2}\|\Delta_{u}\|^{2}}_{l^{2}\textrm{ gain}}$	(30)

with $\tilde{D}=\Lambda D_{21}$ . this is jointly convex in $P,E,F,B,\tilde{C},\tilde{D},\Lambda$ and, but summing over $t$ we have

\sum_{t=0}^{T}[-|\Delta_{y}|^{2}+\gamma^{2}|\Delta_{u}|^{2}]\geq\Delta_{T}^{T}P\Delta_{T}-\Delta_{0}^{T}P\Delta_{0}.

With identical intitial conditions, $\Delta_{0}=0$ , and since $P\succ 0$ , $\Delta_{T}^{T}P\Delta_{T}\geq 0$ . So we have

\sum_{t=0}^{T}[-|\Delta_{y}|^{2}+\gamma^{2}|\Delta_{u}|^{2}]\geq 0

which can be rearranged to obtain (29). Once again, (30) is convex in the model parameters, and can be written as an LMI (see [10]).

VII Equilibrium Networks

The RNN structure in the previous two subsections incorporated single-hidden-layer neural networks, but much of the recent interest in machine learning has focused on deep (multi-layer) neural networks, and variations with “implicit depth” such as equilibrium networks [42, 9], a.k.a. implicit deep models [43] have recently attracted interest.

In this section, we consider nonlinear static maps defined by equilibrium networks, but we will keep the notation consistent with the previous sections. We can modify the nonlinearity by incorporating non-zero $D_{11}$ below:

w=\sigma(D_{11}w+D_{12}u+b_{w}),\quad y=D_{21}w+b_{y}

(31)

These have been called equilibrium networks because they can be seen as computing equilibria of difference or differential equations: $w_{t+1}=\sigma(D_{11}w+D_{12}u+b_{z})$ , or $\dot{w}=-w+\sigma(D_{11}w+D_{12}u+b_{z})$ . Equilibrium networks are quite flexible, e.g. consider an $L$ -layer feedforward network, described by the following recursive equation

\begin{split}z^{0}=u,\;z^{\ell+1}=\sigma(W^{\ell}z^{\ell}+b^{\ell}),\;y=W^{L}z^{L}+b^{L}\end{split}

(32)

with $\ell=0,\ldots,L-1$ . This can be rewirtten as an equilibrium network (31) with hidden units $z=\mathrm{col}(z^{1},\ldots,z^{L})$ and weights $D_{21}=\begin{bmatrix}0&\cdots&0&W^{L}\end{bmatrix}$ ,

D_{11}=\begin{bmatrix}0&&&\\ W^{1}&\ddots&&\\ \vdots&\ddots&0&\\ 0&\cdots&W^{L-1}&0\end{bmatrix},\;D_{12}=\begin{bmatrix}W^{0}\\ 0\\ \vdots\\ 0\end{bmatrix}.

(33)

An equilibrium network can be thought of as an algebraic interconnection of a linear and nonlinear components:

v=D_{11}w+D_{12}u+b_{w},\quad y=D_{21}w+b_{y},\quad w=\sigma(v)

where the linear part satisfies the incremental relations:

\Delta_{v}=D_{11}\Delta_{w}+D_{12}\Delta_{u},\quad\Delta_{y}=D_{21}\Delta_{w}

and the nonlinear part satisfies the incremental sector condition $(\Delta_{v}-\Delta_{w})^{T}\Lambda\Delta_{w}\geq 0$ for any positive diagonal $\Lambda$ , as per Section V.

Since an equilibrium network is defined by an implicit equation, it may not have a solution or it may have many. We make the following definition:

Definition 2

An equilibrium network is Well-posed if a unique solution $w$ exists for all inputs $u$ and biases.

Well-posedness depends on the particular weight matrix $D_{11}$ . In [9] it was shown that if there exists a positive diagonal matrix $\Lambda$ such that

2\Lambda-\Lambda D_{11}-D_{11}^{T}\Lambda\succ 0

(34)

then the equilibrium network is well-posed.

Indeed, supposing a solution exists its uniqueness follows from similar reasoning to the previous sections. Suppose two solutions $w_{1}$ and $w_{2}$ exist with the same $u$ , and let $\Delta_{w}=w_{1}-w_{2}$ . Then (34) states that there exists a scalar $\epsilon>0$ such that

\underbrace{2(\Lambda D_{11}\Delta_{w}-\Lambda\Delta_{w})^{T}\Delta_{w}}_{\geq 0\textrm{ due to sector condition}}\leq-\epsilon|\Delta_{w}|^{2}

(35)

for all $\Delta_{w}$ . But then the sector condition, as indicated, implies the right-hand-side is non-negative, which is only possible of $\Delta_{w}=0$ . Existence of solutions can also be shown via a contraction argument [9].

Moving beyond well-posedness, we can impose a Lipschitz bound $|\Delta_{y}|\leq\gamma|\Delta_{u}|$ in much the same way, by changing the right-hand-side in the inequality above:

\underbrace{2(\Lambda\Delta_{v}-\Lambda\Delta_{w})^{T}\Delta_{w}}_{\geq 0\textrm{ due to sector condition}}\leq\gamma|\Delta_{u}|_{2}^{2}-\frac{1}{\gamma}|\Delta_{y}|_{2}^{2}

(36)

where $\Delta_{v}=D_{11}\Delta_{w}+D_{12}\Delta_{u}$ and $\Delta_{y}=D_{21}\Delta_{w}$ . This can be rewritten as the constraint:

2\Lambda-\Lambda D_{11}-D_{11}^{T}\Lambda-\frac{1}{\gamma}D_{21}^{T}D_{21}-\frac{1}{\gamma}\Lambda D_{12}D_{12}^{T}\Lambda\succ 0.

(37)

which is convex in the parameters $\tilde{D}_{11}=\Lambda D_{11}$ and $\tilde{D}_{12}=\Lambda D_{12}$ , from $D_{11}$ and $D_{12}$ can easily be recovered since $\Lambda$ is invertible. Although we don’t go into detail here, one can also construct direct parameterisations enabling unconstrained optimization, see [9].

VII-A Robust Image Recognition

In this subsection we present some results from [9] applying such Lipschitz bounded equilibrium networks (LBEN) to the task of image recognition on the classic MNIST character recognition data set. We compare to the MON network of [42] and the LMT network of [44].

Figure 2 compares different networks in terms of test error without any perturbation (vertical axis) and a lower bound on the Lipschitz constant (horizontal axis) determined by local search for a worst-case input perturbation. The vertical lines indicate the theoretical upper bounds for Lipschitz constant for different LBENs. It can be seen that the LBEN structures allow trading off between sensitivity to perturbation and nominal test error, and that the results are better (further to the lower left) than competing methods. Furthermore, the theoretical upper bounds on the Lipschitz constant are quite close to the observed lower bounds. Note also that introducing a Lipschitz bounds actually decreases nominal test error compared to LBEN $\gamma<\infty$ , which is only well-posed. I.e. we see a regularising effect.

In Figure 3 we compare classification error of different networks as the size of the adversarial input increases. It can be seen that LBEN structures allow for a much slower degradation in test error with only minor affect on nominal (perturbation=0) performance. E.g. with a perturbation of 5, LBEN models can achieve an error of around 15% while all other structures are at around 50-60% error.

VIII Recurrent Equilibrium Networks

In this section we discuss the recurrent equilibrium network (REN), recently introduced in [11]. Roughly speaking, it is a combination of the recurrent networks of Sections V and VI with the equilibrium networks of Section VII.

The REN model structure is

	$\displaystyle x_{+}$	$\displaystyle=Ax+B_{1}w+B_{2}u,\quad v=C_{1}x+D_{11}w+D_{12}u,$
	$\displaystyle y$	$\displaystyle=C_{2}x+D_{21}w+D_{22}u,\quad w=\sigma(v).$

Compared to Sections V and VI, the difference is the presence of $D_{11}$ , which means that the model incorporates an equilibrium network for $w$ .

We omit details here, but following the same procedure as previous sections, we obtain the following LMI for contracting RENs:

\begin{bmatrix}(E+E^{T}-P)&-F&-B_{1}\\ -F^{T}&P&-\tilde{C}^{T}\\ -B_{1}^{T}&-\tilde{C}&2\Lambda-\tilde{D}_{11}-\tilde{D}_{11}^{T}\end{bmatrix}\succ 0,

where $\tilde{D}_{11}=\Lambda D_{11}$ . We briefly make some remarks about this model structure:

•

The lower right block is the same as (34), and hence well-posedness of the equilibrium network is automatically satisfied via the contraction constraint.
•

Compared to the structure in Section V, the lower-right block is no longer constrained to be diagonal. Hence it is now straightforward to construct a direct parameterization of contracting RENs, enabling learning via unconstrained optimization methods such as stochastic gradient descent. We refer the reader to [11] for details.
•

We have also extended the approach of Section VI to RENs and, in fact, achieve a direct parameterization of robust (Lipschitz bounded) RENs. See [11] for details.

VIII-A Application to Robust System Identification

We have applied the robust REN structure to the F16 ground vibration data set from [45]. This benchmark problem exhibits a challenging combinations of high dimension, highly resonant responses, and significant nonlinearity.

In Figure 4, we see a plot of normalised root-mean-square test error without perturbations (vertical axis) vs sensitivity of models to perturbations. It can be seen that the REN structure provides significantly improved combination of nominal fit and robustness to perturbations compared to previous structures such as the Robust RNN from Section VI and the widely-used long short-term memory (LSTM) [46] and the standard RNN structure [15]. In Figure 5 we see a zoomed-in view of the relative size of output perturbations of different models to a fixed input perturbation. it can be seen that the REN structures are much more robust to perturbation than LSTM and standard RNN.

IX Conclusions

This tutorial paper has given an introduction to recently developed tools for machine learning, especially learning dynamical systems (system identification), with stability and robustness constraints.

The main procedure is the following: (i) Formulate stability and robustness via contraction and incremental $\ell^{2}$ gain conditions. (ii) add linear model constraints to the stability/robustness conditions via Lagrange multipliers, (iii) add nonlinear components via incremental quadratic constraints and associated multipliers, (iv) make the resulting inequality convex via implicit model that structures merge the model parameters with the multipliers.

The methods have been illustrated the methods with applications in robust image recognition and system identification, where it was shown that significantly improved (and more robust) models can be obtained compared to existing methods. Our current work is in applying this approach a broader set of problems, including observer design [47, 48] and reinforcement learning [35, 49], and investigating additional behavioural constraints, e.g. incremental passivity.

References

[1] K. Zhou, J. C. Doyle, K. Glover et al., Robust and Optimal Control. Prentice hall New Jersey, 1996, vol. 40.
[2] W. Lohmiller and J.-J. E. Slotine, “On contraction analysis for non-linear systems,” Automatica, vol. 34, pp. 683–696, 1998.
[3] M. M. Tobenkin, I. R. Manchester, J. Wang, A. Megretski, and R. Tedrake, “Convex optimization in identification of stable non-linear state space models,” in 49th IEEE Conference on Decision and Control (CDC). IEEE, 2010.
[4] M. M. Tobenkin, I. R. Manchester, and A. Megretski, “Convex Parameterizations and Fidelity Bounds for Nonlinear Identification and Reduced-Order Modelling,” IEEE Transactions on Automatic Control, vol. 62, no. 7, pp. 3679–3686, Jul. 2017.
[5] J. Umenberger and I. R. Manchester, “Specialized Interior-Point Algorithm for Stable Nonlinear System Identification,” IEEE Transactions on Automatic Control, vol. 64, no. 6, pp. 2442–2456, 2018.
[6] J. Umenberger, J. Wagberg, I. R. Manchester, and T. B. Schön, “Maximum likelihood identification of stable linear dynamical systems,” Automatica, vol. 96, pp. 280–292, 2018.
[7] J. Umenberger and I. R. Manchester, “Convex Bounds for Equation Error in Stable Nonlinear Identification,” IEEE Control Systems Letters, vol. 3, no. 1, pp. 73–78, Jan. 2019.
[8] M. Revay and I. Manchester, “Contracting implicit recurrent neural networks: Stable models with improved trainability,” in Learning for Dynamics and Control. PMLR, 2020, pp. 393–403.
[9] M. Revay, R. Wang, and I. R. Manchester, “Lipschitz bounded equilibrium networks,” arXiv:2010.01732, 2020.
[10] M. Revay, R. Wang, and I. R. Manchester, “A convex parameterization of robust recurrent neural networks,” IEEE Control Systems Letters, vol. 5, no. 4, pp. 1363–1368, 2021.
[11] M. Revay, R. Wang, and I. R. Manchester, “Recurrent equilibrium networks: Unconstrained learning of stable and robust dynamical models,” 60th IEEE Conference on Decision and Control, 2021.
[12] M. Schetzen, The Volterra and Wiener Theories of Nonlinear Systems. USA: Krieger Publishing Co., Inc., 2006.
[13] T. B. Schön, A. Wills, and B. Ninness, “System identification of nonlinear state-space models,” Automatica, vol. 47, no. 1, pp. 39–49, 2011.
[14] S. A. Billings, Nonlinear system identification: NARMAX methods in the time, frequency, and spatio-temporal domains. John Wiley & Sons, 2013.
[15] D. Mandic and J. Chambers, Recurrent neural networks for prediction: learning algorithms, architectures and stability. Wiley, 2001.
[16] J. M. Maciejowski, “Guaranteed stability with subspace methods,” Systems & Control Letters, vol. 26, no. 2, pp. 153–156, Sep. 1995.
[17] T. Van Gestel, J. A. Suykens, P. Van Dooren, and B. De Moor, “Identification of stable models in subspace identification by using regularization,” IEEE Transactions on Automatic Control, vol. 46, no. 9, pp. 1416–1420, 2001.
[18] S. L. Lacy and D. S. Bernstein, “Subspace identification with guaranteed stability using constrained optimization,” IEEE Transactions on automatic control, vol. 48, no. 7, pp. 1259–1263, 2003.
[19] U. Nallasivam, B. Srinivasan, V.Kuppuraj, M. N. Karim, and R. Rengaswamy, “Computationally Efficient Identification of Global ARX Parameters With Guaranteed Stability,” IEEE Transactions on Automatic Control, vol. 56, no. 6, pp. 1406–1411, Jun. 2011.
[20] D. N. Miller and R. A. De Callafon, “Subspace identification with eigenvalue constraints,” Automatica, vol. 49, no. 8, pp. 2468–2473, 2013.
[21] G. Y. Mamakoukas, O. Xherija, and T. Murphey, “Memory-Efficient Learning of Stable Linear Dynamical Systems for Prediction and Control,” Advances in Neural Information Processing Systems, vol. 33, pp. 13 527–13 538, 2020.
[22] S. M. Khansari-Zadeh and A. Billard, “Learning Stable Nonlinear Dynamical Systems With Gaussian Mixture Models,” IEEE Transactions on Robotics, vol. 27, no. 5, pp. 943–957, Oct. 2011.
[23] J. Miller and M. Hardt, “Stable recurrent models,” in International Conference on Learning Representations, 2019.
[24] V. Sindhwani, S. Tu, and M. Khansari, “Learning contracting vector fields for stable imitation learning,” arXiv preprint arXiv:1804.04878, 2018.
[25] M. Revay, J. Umenberger, and I. R. Manchester, “Distributed identification of contracting and/or monotone network dynamics,” IEEE Transactions on Automatic Control, 2021.
[26] I. R. Manchester, M. M. Tobenkin, and J. Wang, “Identification of nonlinear systems with stable oscillations,” in 2011 50th IEEE Conference on Decision and Control and European Control Conference. IEEE, 2011, pp. 5792–5797.
[27] I. R. Manchester and J.-J. E. Slotine, “Transverse contraction criteria for existence, stability, and robustness of a limit cycle,” Systems & Control Letters, vol. 63, pp. 32–38, 2014.
[28] S. Singh, S. M. Richards, V. Sindhwani, J.-J. E. Slotine, and M. Pavone, “Learning stabilizable nonlinear dynamics with contraction-based regularization,” The International Journal of Robotics Research, 2020.
[29] I. R. Manchester and J.-J. E. Slotine, “Control contraction metrics: Convex and intrinsic criteria for nonlinear feedback design,” IEEE Trans. Autom. Control, vol. 62, no. 6, pp. 3046–3053, Jun. 2017.
[30] M. Cheng, J. Yi, P.-Y. Chen, H. Zhang, and C.-J. Hsieh, “Seq2sick: Evaluating the robustness of sequence-to-sequence models with adversarial examples.” in Association for the Advancement of Artificial Intelligence, 2020, pp. 3601–3608.
[31] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky, “Spectrally-normalized margin bounds for neural networks,” in Advances in Neural Information Processing Systems, 2017, pp. 6240–6249.
[32] T. Huster, C.-Y. J. Chiang, and R. Chadha, “Limitations of the lipschitz constant as a defense against adversarial examples,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2018, pp. 16–29.
[33] H. Qian and M. N. Wegman, “L2-nonexpansive neural networks,” International Conference on Learning Representations (ICLR), 2019.
[34] H. Gouk, E. Frank, B. Pfahringer, and M. J. Cree, “Regularisation of neural networks by enforcing lipschitz continuity,” Machine Learning, vol. 110, no. 2, pp. 393–416, 2021.
[35] A. Russo and A. Proutiere, “Optimal attacks on reinforcement learning policies,” arXiv preprint arXiv:1907.13548, 2019.
[36] A. Virmaux and K. Scaman, “Lipschitz regularity of deep neural networks: analysis and efficient estimation,” in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018.
[37] M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G. Pappas, “Efficient and accurate estimation of lipschitz constants for deep neural networks,” in Advances in Neural Information Processing Systems, 2019, pp. 11 423–11 434.
[38] P. Pauli, A. Koch, J. Berberich, P. Kohler, and F. Allgower, “Training robust neural networks using lipschitz bounds,” IEEE Control Systems Letters, 2021.
[39] A. Megretski and A. Rantzer, “System analysis via integral quadratic constraints,” IEEE Trans. Autom. Control, vol. 42, no. 6, pp. 819–830, Jun. 1997.
[40] P. A. Parrilo, “Semidefinite programming relaxations for semialgebraic problems,” Math. Program., vol. 96, no. 2, pp. 293–320, 2003.
[41] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” International Conference for Learning Representations (ICLR), Jan. 2017.
[42] E. Winston and J. Z. Kolter, “Monotone operator equilibrium networks,” arXiv:2006.08591, 2020.
[43] L. El Ghaoui, F. Gu, B. Travacca, A. Askari, and A. Y. Tsai, “Implicit deep learning,” arXiv:1908.06315, 2019.
[44] Y. Tsuzuku, I. Sato, and M. Sugiyama, “Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks,” arXiv preprint arXiv:1802.04034, 2018.
[45] J.-P. Noël and M. Schoukens, “F-16 aircraft benchmark based on ground vibration test data,” in 2017 Workshop on Nonlinear System Identification Benchmarks, 2017, pp. 19–23.
[46] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural computation, vol. 9, pp. 1735–1780, 1997.
[47] I. R. Manchester, “Contracting nonlinear observers: Convex optimization and learning from data,” in 2018 Annual American Control Conference (ACC). IEEE, 2018, pp. 1873–1880.
[48] B. Yi, R. Wang, and I. R. Manchester, “Reduced-order nonlinear observers via contraction analysis and convex optimization,” IEEE Transactions on Automatic Control, 2021.
[49] J. W. Roberts, I. R. Manchester, and R. Tedrake, “Feedback controller parameterizations for reinforcement learning,” in IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 2011.

Contraction-Based Methods for Stable Identification and Robust Machine Learning: a Tutorial