This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Convex Parameterization of Robust Recurrent Neural Networks

Max Revay, Ruigang Wang, Ian R. Manchester This work was supported by the Australian Research Council.The authors are with the Australian Centre for Field Robotics, The University of Sydney, Sydney, NSW 2006, Australia (e-mail: [email protected]).
Abstract

Recurrent neural networks (RNNs) are a class of nonlinear dynamical systems often used to model sequence-to-sequence maps. RNNs have excellent expressive power but lack the stability or robustness guarantees that are necessary for many applications. In this paper, we formulate convex sets of RNNs with stability and robustness guarantees. The guarantees are derived using incremental quadratic constraints and can ensure global exponential stability of all solutions, and bounds on incremental 2\ell_{2} gain (the Lipschitz constant of the learned sequence-to-sequence mapping). Using an implicit model structure, we construct a parametrization of RNNs that is jointly convex in the model parameters and stability certificate. We prove that this model structure includes all previously-proposed convex sets of stable RNNs as special cases, and also includes all stable linear dynamical systems. We illustrate the utility of the proposed model class in the context of non-linear system identification.

I INTRODUCTION

RNNs are state-space models incorporating neural networks that are frequently used in system identification and machine learning to model dynamical systems and other sequences-to-sequence mappings. It has long been observed that RNNs can be difficult to train in part due to model instability, referred to as the exploding gradients problem [1], and recent work shows that these models are often not robust to input perturbations [2]. These issues are related to long-standing concerns in control theory, i.e. stability and Lipschitz continuity solutions of dynamical systems [3].

There are many types of stability for nonlinear systems (e.g., RNNs). When learning dynamical systems with inputs, Lyapunov approaches are inappropriate as they require the construction of a Lyapunov function about a known stable solution. In machine learning and system identification, however, the goal is to simulate the learned model with new inputs, generating new solutions. Incremental stability [4] and contraction analysis [5] avoid this issue by showing stability for all inputs and trajectories.

Even if a model is stable, it is usually problematic if its output is very sensitive to small changes in the input. This sensitivity can be quantified by the model’s incremental 2\ell_{2} gain. Finite incremental 2\ell_{2} gain implies both boundedness and continuity of the input-output map [4]. Furthermore, the incremental 2\ell_{2} gain bound is also a bound on the Lipschitz constant of the sequence-to-sequence mapping. In machine learning, the Lipschitz constant is used in proofs of generalization bounds [6], analysis of expressiveness [7] and guarantees of robustness to adversarial attacks [8, 9].

The problem of training models with stability or robustness guaranteed a-priori has seen significant attention for both linear [10, 11] and nonlinear [12, 13, 14] models. The main difficult comes from the non-convexity of most model structures and their stability certificates. Some methods deal with this difficulty by simply fixing the stability certificate and optimizing over the model parameters at significant cost to model expressibility [15]. It has recently been shown that implicit parametrizations allow joint convexity of the model and a stability certificate for linear [16], polynomial [12] and RNN [17] models. It has been observed that stability constraints serve as an effective regulariser and can improve generalisation performance [18, 17].

When a system is expressed in terms of a neural network, even the problem of analyzing the stability of known dynamics is difficult. A number of approaches have formulated LMI conditions [19, 20, 21] guaranteeing Lyapunov stability of a particular equilibrium. Recently, incremental quadratic constraints [3] have been recently applied to (non-recurrent) neural networks to develop the tightest bounds on the Lipschitz constant known to date [22].

Contributions: In this letter we propose a new convex parameterization of RNNs satisfying stability and robustness conditions. By treating RNNs as linear systems in feedback with a slope-restricted, element-wise nonlinearity, we can apply methods from robust control to develop stability conditions that are less conservative than prior methods. The proposed model set contains all previously published sets of stable RNNs, and all stable linear time-invariant (LTI) systems. Using implicit parameterizations with incremental quadratic constraints, we construct a set of models that is jointly convex in the model parameters, stability certificate and the multipliers required by the incremental quadratic constraint approach. Joint convexity in all parameters simplifies the training of stable models as constraints can be easily dealt with using penalty, barrier or projected gradient methods.

Notation. We use ,\mathbb{N},\mathbb{R} to denote the set of natural and real numbers, respectively. The set of all one-side sequences x:nx:\mathbb{N}\rightarrow\mathbb{R}^{n} is denoted by 2en\ell_{2e}^{n}. Superscript nn is omitted when it is clear from the context. For x2enx\in\ell_{2e}^{n}, xtnx_{t}\in\mathbb{R}^{n} is the value of the sequence xx at time tt\in\mathbb{N}. The notation ||:n|\cdot|:\mathbb{R}^{n}\rightarrow\mathbb{R} denotes the standard 2-norm. The subset 22e\ell_{2}\subset\ell_{2e} consists of all square-summable sequences, i.e., x2x\in\ell_{2} if and only if the 2\ell_{2} norm x:=t=0|xt|2\|x\|:=\sqrt{\sum_{t=0}^{\infty}|x_{t}|^{2}} is finite. Given a sequence x2ex\in\ell_{2e}, the 2\ell_{2} norm of its truncation over [0,T][0,T] with TT\in\mathbb{N} is written as xT:=t=0T|xt|2\|x\|_{T}:=\sqrt{\sum_{t=0}^{T}|x_{t}|^{2}}. For matrices AA, we use A0A\succ 0 and A0A\succeq 0 to mean AA is positive definite or positive semi-definite respectively and ABA\succ B and A0A\succeq 0 to mean AB0A-B\succ 0 and AB0A-B\succeq 0 respectively. The set of diagonal, positive definite matrices is denoted 𝔻+\mathbb{D}_{+}.

II Problem Formulation

We are interested in learning nonlinear state space models:

xt+1=fθ(xt,ut),\displaystyle x_{t+1}=f_{\theta}(x_{t},u_{t}), (1)
yt=gθ(xt,ut),\displaystyle y_{t}=g_{\theta}(x_{t},u_{t}), (2)

where xtnx_{t}\in\mathbb{R}^{n} is the state, utmu_{t}\in\mathbb{R}^{m} is a known input and ytpy_{t}\in\mathbb{R}^{p} is the output. The functions fθ,gθf_{\theta},g_{\theta} are parametrized by θΘN\theta\in\Theta\subseteq\mathbb{R}^{N} and will be defined later. Given initial condition x0=ax_{0}=a, the dynamical system (1), (2) can be provides a sequence-to-sequence mapping Sa:2em2epS_{a}:\ell_{2e}^{m}\mapsto\ell_{2e}^{p}.

Definition 1.

The system (1), (2) is termed incrementally 2\ell_{2} stable if for any two initial conditions aa and bb, given the same input sequence uu, the corresponding output trajectories yay^{a} and yby^{b} satisfy yayb2py^{a}-y^{b}\in\ell_{2}^{p}.

This definition implies that initial conditions are forgotten, however, the outputs can still be sensitive to small perturbations in the input. In such cases, it is natural to measure system robustness in terms of the incremental 2\ell_{2}-gain.

Definition 2.

The system (1), (2) is said to have an incremental 2\ell_{2}-gain bound of γ\gamma if for all pairs of solutions with initial conditions a,bna,b\in\mathbb{R}^{n} and input sequences ua,ub2emu^{a},u^{b}\in\ell_{2e}^{m}, the output sequences ya,yb2epy^{a},y^{b}\in\ell_{2e}^{p} satisfy

yaybT2γ2uaubT2+d(a,b),T,\left\|y^{a}-y^{b}\right\|_{T}^{2}\leq\gamma^{2}\left\|u^{a}-u^{b}\right\|_{T}^{2}+d(a,b),\quad\forall T\in\mathbb{N}, (3)

for some d(a,b)0d(a,b)\geq 0 with d(a,a)=0d(a,a)=0.

Note that the above definition implies incremental 2\ell_{2} stability since yaybT2d(a,b)\|y^{a}-y^{b}\|_{T}^{2}\leq d(a,b) for all TT\in\mathbb{N} when ua=ubu^{a}=u^{b}. It also shows that all operators defined by (1) and (2) are Lipschitz continuous with Lipschitz constant γ\gamma, i.e. for any ana\in\mathbb{R}^{n} and all TT\in\mathbb{N}

Sa(u)Sa(v)TγuvT,u,v2em.\|S_{a}(u)-S_{a}(v)\|_{T}\leq\gamma\|u-v\|_{T},\quad\forall u,v\in\ell_{2e}^{m}. (4)

The goal of this work is to construct a rich parametrization of the functions fθf_{\theta} and gθg_{\theta} in (1), (2), with robustness guarantees. We focus on two robustness guarantees in this work:

  1. 1.

    A model set parametrized by ΘRN\Theta_{*}\subset R^{N} is robust if for all θΘ\theta\in\Theta_{*} the system has finite incremental 2\ell_{2}-gain.

  2. 2.

    A model set parameterized by ΘγRN\Theta_{\gamma}\subset R^{N} is γ\gamma-robust if for all θΘγ\theta\in\Theta_{\gamma} the system has an incremental 2\ell_{2}-gain bound of γ\gamma.

III Robust RNNs

III-A Model Structure

Refer to caption
Figure 1: Feedback interconnection for RNNs.

We parameterize the functions (1), (2) as a feedback interconnection between a linear system GG and a static, memoryless nonlinear operator Φ\Phi:

G{xt+1=F¯xt+B¯1wt+B¯2utyt=C1xt+D11wt+D12utvt=C¯2xt+b¯+D¯22ut,\displaystyle G\;\begin{cases}x_{t+1}=\bar{F}x_{t}+\bar{B}_{1}w_{t}+\bar{B}_{2}u_{t}\\ y_{t}=C_{1}x_{t}+D_{11}w_{t}+D_{12}u_{t}\\ v_{t}=\bar{C}_{2}x_{t}+\bar{b}+\bar{D}_{22}u_{t}\end{cases}, (5)
w=Φ(v),\displaystyle w=\Phi(v), (6)

where Φ(v)=[ϕ(v1)ϕ(vq)]\Phi(v)=[\phi(v_{1})\ \cdots\ \phi(v_{q})]^{\top} with viv_{i} as the iith component of the v2eqv\in\ell_{2e}^{q}. This feedback interconnection is shown in Fig. 1. We assume that the slope of ϕ\phi is restricted to the interval [0,β][0,\beta]:

0ϕ(y)ϕ(x)yxβ,x,y,xy.0\leq\frac{\phi(y)-\phi(x)}{y-x}\leq\beta,\quad\forall x,y\in\mathbb{R},~{}x\neq y. (7)

In the neural network literature, such functions are referred to as “activation functions”, and common choices (e.g. tanh, ReLU, sigmoid) are slope restricted [23].

The proposed model structure is highly expressive and contains many commonly used model structures. For instance, LTI systems are obtained when B¯1=0\bar{B}_{1}=0 and D11=0D_{11}=0. RNNs of the form [24]:

xt+1=1Φ(𝒜xt+ut+b),\displaystyle x_{t+1}=\mathcal{B}_{1}\Phi(\mathcal{A}x_{t}+\mathcal{B}u_{t}+b), (8)
yt=𝒞xt+𝒟ut,\displaystyle y_{t}=\mathcal{C}x_{t}+\mathcal{D}u_{t}, (9)

are obtained with the choice F¯=0\bar{F}=0, B¯1=1\bar{B}_{1}=\mathcal{B}_{1}, B¯2=0\bar{B}_{2}=0, C1=𝒞C_{1}=\mathcal{C}, D11=0D_{11}=0, D12=𝒟D_{12}=\mathcal{D}, C¯2=𝒜\bar{C}_{2}=\mathcal{A}, D¯22=\bar{D}_{22}=\mathcal{B} and b¯=b\bar{b}=b. This implies (5), (6) is a universal approximator for dynamical systems over bounded domains as qq\rightarrow\infty [25].

Even for linear systems, the set of robust or γ\gamma-robust models is non-convex. Constructing a set of parameters for which (5), (6) is robust or γ\gamma-robust is further complicated by presence of the nonlinear activation function in Φ\Phi. We will simplify the analysis by replacing Φ\Phi with incremental quadratic constraints.

III-B Description of Φ\Phi by Incremental Quadratic Constraints

Multiplying (7) through by (yx)2(y-x)^{2}, and combining the two inequalities, we get:

[yxϕ(y)ϕ(x)][0ββ2][yxϕ(y)ϕ(x)]0.\left[\begin{matrix}y-x\\ \phi(y)-\phi(x)\end{matrix}\right]^{\top}\left[\begin{matrix}0&\beta\\ \beta&-2\end{matrix}\right]\left[\begin{matrix}y-x\\ \phi(y)-\phi(x)\end{matrix}\right]\geq 0. (10)

For va,vbqv^{a},v^{b}\in\mathbb{R}^{q} and wa=Φ(va)w^{a}=\Phi(v^{a}), wb=Φ(vb)w^{b}=\Phi(v^{b}), (10) holds for each element with y=viay=v^{a}_{i} and x=vibx=v^{b}_{i}. Taking a conic combination of these constraints with multipliers λi>0\lambda_{i}>0, we get:

[vtavtbwtawtb][0βΛβΛ2Λ]M(Λ)[vtavtbwtawtb]0,\displaystyle\begin{bmatrix}v^{a}_{t}-v^{b}_{t}\\ w^{a}_{t}-w^{b}_{t}\end{bmatrix}^{\top}\underset{M(\Lambda)}{\underbrace{\begin{bmatrix}0&\beta\Lambda\\ \beta\Lambda&-2\Lambda\end{bmatrix}}}\begin{bmatrix}v^{a}_{t}-v^{b}_{t}\\ w^{a}_{t}-w^{b}_{t}\end{bmatrix}\geq 0, (11)

where Λ=diag(λ1,,λq)\Lambda=\mathrm{diag}(\lambda_{1},...,\lambda_{q}).

III-C Convex Parametrization of Robust RNNs

Corresponding to the linear system (5), we introduce the following implicit, redundant parametrization:

G{Ext+1=Fxt+B1wt+B2utyt=C1xt+D11wt+D12utΛvt=C2xt+b+D22utG\;\begin{cases}Ex_{t+1}=Fx_{t}+B_{1}w_{t}+B_{2}u_{t}\\ y_{t}=C_{1}x_{t}+D_{11}w_{t}+D_{12}u_{t}\\ \Lambda v_{t}=C_{2}x_{t}+b+D_{22}u_{t}\end{cases} (12)

where θ=(E,F,B1,B2,C1,D11,D12,Λ,C2,b,D22)\theta=(E,F,B_{1},B_{2},C_{1},D_{11},D_{12},\Lambda,C_{2},b,D_{22}) are the model parameters with EE invertible and Λ𝔻+\Lambda\in\mathbb{D}_{+} is the incremental quadratic constraint multiplier from (11). The explicit system (5) can be easily constructed from (12) by inverting EE and Λ\Lambda. While the parameters EE and Λ\Lambda do not improve model expressiveness, the extra degrees of freedom will allow us to formulate sets of robust models that are jointly convex in the model parameters, stability certificate and multipliers.

To construct the set of stable robust models, we introduce the following convex constraint:

[E+EPβC2βC22Λ][FB1]P1[FB1]0,\begin{bmatrix}E+E^{\top}-P&-\beta{C}^{\top}_{2}\\ -\beta{C}_{2}&2\Lambda\end{bmatrix}-\begin{bmatrix}{F}^{\top}\\ {B}_{1}^{\top}\end{bmatrix}P^{-1}\begin{bmatrix}{F}^{\top}\\ {B}_{1}^{\top}\end{bmatrix}^{\top}\succ 0, (13)

The set of Robust RNNs is then given by:

Θ:={θ:P0,Λ𝔻+s.t.(13)}.\Theta_{*}:=\{\theta:\exists P\succ 0,\,\Lambda\in\mathbb{D}_{+}\;\mathrm{s.t.}\;\eqref{eq:bound_l2_gain_lmi}\}.

Since P0P\succ 0, (13) and (14) imply that E+E0E+E^{\top}\succ 0 which means that EE is invertible.

To construct a set of γ\gamma-robust models, we propose the following convex constraint:

[E+EPβC20βC22ΛβD220βD22γI][FB1B2]P1[FB1B2]1γ[C1D11D12][C1D11D12]0,\begin{bmatrix}E+E^{\top}-P&-\beta{C}_{2}^{\top}&0\\ -\beta{C}_{2}&2\Lambda&-\beta{D}_{22}^{\top}\\ 0&-\beta{D}_{22}&\gamma I\end{bmatrix}\\ -\begin{bmatrix}F^{\top}\\ B_{1}^{\top}\\ B_{2}^{\top}\end{bmatrix}P^{-1}\begin{bmatrix}F^{\top}\\ B_{1}^{\top}\\ B_{2}^{\top}\end{bmatrix}^{\top}-\frac{1}{\gamma}\begin{bmatrix}C_{1}^{\top}\\ D_{11}^{\top}\\ D_{12}^{\top}\end{bmatrix}\begin{bmatrix}C_{1}^{\top}\\ D_{11}^{\top}\\ D_{12}^{\top}\end{bmatrix}^{\top}\succ 0, (14)

The set of γ\gamma-robust RNNs is then given by:

Θγ:={θ:P0,Λ𝔻+s.t.(14)}.\Theta_{\gamma}:=\{\theta:\exists P\succ 0,\,\Lambda\in\mathbb{D}_{+}\;\mathrm{s.t.}\;\eqref{eq:dl2_gain_bound_lmi}\}.

Note that (13) and (14) are jointly convex in the model parameters, stability certificate, multipliers Λ\Lambda and the incremental 2\ell_{2} gain bound γ\gamma.

Theorem 1.

Suppose that θΘγ\theta\in\Theta_{\gamma}, then the Robust RNN (5), (6) has a incremental 2\ell_{2}-gain bound of γ\gamma.

Proof.

Consider two solutions xa,xb2enx^{a},x^{b}\in\ell_{2e}^{n} and outputs ya,yb2epy^{a},y^{b}\in\ell_{2e}^{p} to the system (1), (2) with initial conditions a,bna,b\in\mathbb{R}^{n} and inputs ua,ub2emu^{a},u^{b}\in\ell_{2e}^{m}. Let Δu=uaub\Delta u=u^{a}-u^{b}, Δx=xaxb\Delta x=x^{a}-x^{b}, Δv=vavb\Delta v=v^{a}-v^{b}, Δw=wawb\Delta w=w^{a}-w^{b} and Δy=yayb\Delta y=y^{a}-y^{b}.

To establish the incremental 2\ell_{2}-gain bound, we first left and right multiply (14) by the vectors [Δxt,Δwt,Δut]\left[\begin{matrix}\Delta x_{t}^{\top},\Delta w_{t}^{\top},\Delta u_{t}^{\top}\end{matrix}\right] and [Δxt,Δwt,Δut]\left[\begin{matrix}\Delta{x_{t}}^{\top},\Delta{w_{t}}^{\top},\Delta{u_{t}}^{\top}\end{matrix}\right]^{\top}. Applying the bound EP1EPEE-E^{\top}P^{-1}E\preceq P-E-E^{\top} [12] and introducing the storage function Vt=ΔxtEP1EΔxtV_{t}=\Delta x_{t}^{\top}E^{\top}P^{-1}E\Delta x_{t} gives

Vt+1Vt<γ|Δut|21γ|Δyt|2[ΔvtΔwt]M(Λ)[ΔvtΔwt]V_{t+1}-V_{t}<\gamma|\Delta u_{t}|^{2}-\frac{1}{\gamma}|\Delta y_{t}|^{2}-\begin{bmatrix}\Delta v_{t}\\ \Delta w_{t}\end{bmatrix}^{\top}M(\Lambda)\begin{bmatrix}\Delta v_{t}\\ \Delta w_{t}\end{bmatrix}

for Δx0\Delta_{x}\neq 0. Using (11) and summing over [0,T][0,T] gives

VTV0<γΔuT21γΔyT2,\begin{split}V_{T}-V_{0}&<\gamma\|\Delta{u}\|_{T}^{2}-\frac{1}{\gamma}\|\Delta{y}\|_{T}^{2}\end{split},

for Δx0\Delta_{x}\neq 0, so the incremental 2\ell_{2}-gain condition (3) follows with d(a,b)=γV0d(a,b)=\gamma V_{0}. ∎

Theorem 2.

Suppose that θΘ\theta\in\Theta_{*}, then the Robust RNN (5), (6) has a finite incremental 2\ell_{2}-gain.

Proof.

Note that if the LMI condition (13) is satisfied, there exists a sufficiently large γ\gamma such that (14) holds for any choice of B2B_{2}, C1C_{1}, D11D_{11}, D12D_{12} and D22D_{22}. Since (13) implies (14) for some sufficiently large γ\gamma, from Theorem 1 the Robust RNN (5), (6) has a finite incremental 2\ell_{2}-gain bound of γ\gamma. ∎

Remark 1.

Theorem 1 and Theorem 2 actually imply a stronger form of stability. For Δu=0\Delta u=0, it is straightforward to show from the strict matrix inequalities that Vt+1αVtV_{t+1}\leq\alpha V_{t} for some α(0,1)\alpha\in(0,1), which implies that the dynamics are contracting [5].

III-D Expressivity of the model set

To be able to learn models for a wide class of systems, it is beneficial to have as expressive a model set as possible. The main result regarding expressivity is that the Robust RNN set Θ\Theta_{*} contains all contracting implicit RNNs (ci-RNNs) [17] and stable LTI models.

Theorem 3.

The Robust RNN set Θ\Theta_{*} contains all stable LTI models of the form

xt+1=𝒜xt+ut,yt=𝒞xt+𝒟ut.x_{t+1}=\mathcal{A}x_{t}+\mathcal{B}u_{t},\quad y_{t}=\mathcal{C}x_{t}+\mathcal{D}u_{t}. (15)
Proof.

A necessary and sufficient condition for stability of (15) is the existence of some 𝒫0\mathcal{P}\succ 0 such that:

𝒫𝒜𝒫𝒜0.\mathcal{P}-\mathcal{A}^{\top}\mathcal{P}\mathcal{A}\succ 0. (16)

For any stable LTI system, the implicit RNN with θ\theta such that E=P=𝒫E=P=\mathcal{P}, F=𝒫𝒜F=\mathcal{P}\mathcal{A}, B1=0B_{1}=0, B2=𝒫B_{2}=\mathcal{P}\mathcal{B}, C=𝒞C=\mathcal{C} and D=𝒟D=\mathcal{D}, C2=0C_{2}=0 and D22=0D_{22}=0 has the same dynamics and output. To see that that θΘ\theta\in\Theta_{*},

(16)E+EPFP1PP1F0[E+EPFP1F002Λ]0(13)\begin{split}\eqref{eq:explicit_lti_lmi}\Rightarrow&E+E^{\top}-P-F^{\top}P^{-1}PP^{-1}F\succ 0\\ \Rightarrow&\left[\begin{matrix}E+E^{\top}-P-F^{\top}P^{-1}F&0\\ 0&2\Lambda\end{matrix}\right]\succeq 0\Rightarrow\eqref{eq:bound_l2_gain_lmi}\end{split}

for any Λ0\Lambda\succ 0. ∎

Remark 2.

Essentially the same proof technique but with the strict Bounded Real Lemma can be used to show that Θγ\Theta_{\gamma} contains all LTI models with an HH_{\infty} norm of γ\gamma.

A ci-RNN [17] is an implicit model of the form:

zt+1=Φ(zt+ut+𝔟),yt=𝒞zt+𝒟ut\mathcal{E}z_{t+1}=\Phi(\mathcal{F}z_{t}+\mathcal{B}u_{t}+\mathfrak{b}),\quad y_{t}=\mathcal{C}z_{t}+\mathcal{D}u_{t} (17)

such that the following contraction condition holds

[+T𝒫T𝒫]0\begin{bmatrix}\mathcal{E}+\mathcal{E}^{T}-\mathcal{P}&\mathcal{F}^{T}\\ \mathcal{F}&\mathcal{P}\end{bmatrix}\succ 0 (18)

where 𝒫𝔻+\mathcal{P}\in\mathbb{D}_{+}. The stable RNN (s-RNN), proposed in [15] is contained within the set of ci-RNNs when =I\mathcal{E}=I.

Theorem 4.

The Robust RNN set Θ\Theta_{*} contains all ci-RNNs.

Proof.

For any ci-RNN, there is an implicit RNN with the same dynamics and output with θ\theta such that F=0F=0, E=E=\mathcal{E}, B1=IB_{1}=I, B2=0B_{2}=0, C1=𝒞C_{1}=\mathcal{C}, D11=0D_{11}=0, D12=𝒟D_{12}=\mathcal{D}, Λ1C2=\Lambda^{-1}C_{2}=\mathcal{F}, Λ1D22=\Lambda^{-1}D_{22}=\mathcal{B}, 𝔟=Λ1b\mathfrak{b}=\Lambda^{-1}b, Λ=P1\Lambda=P^{-1} and P=𝒫P=\mathcal{P}. By substituting θ\theta into (5) and (6), we recover the dynamics and output of the ci-RNN in (17).

For this parameter choice, θΘ\theta\in\Theta_{*}. To see this:

(18)\displaystyle\eqref{eq:contraction}\implies E+EPC2Λ1P1Λ1C20\displaystyle E+E^{\top}-P-C_{2}^{\top}\Lambda^{-1}P^{-1}\Lambda^{-1}C_{2}\succ 0
\displaystyle\implies [E+EPC2C22ΛP1]0(13).\displaystyle\left[\begin{matrix}E+E^{\top}-P&C_{2}^{\top}\\ C_{2}&2\Lambda-P^{-1}\end{matrix}\right]\succ 0\implies\eqref{eq:bound_l2_gain_lmi}.

The remaining conditions P0P\succ 0, Λ𝔻+\Lambda\in\mathbb{D}_{+} and E+E0E+E^{\top}\succ 0 follow by definition. ∎

IV Numerical Example

We will compare the proposed Robust RNN with the (Elman) RNN [24] described by (8), (9) with B1=IB_{1}=I and Long Short Term Memory (LSTM) [26], which is a widely-used model class that was originally proposed to resolve issues related to stability. In addition, we compare to two previously-published stable model sets, the contracting implicit RNN (ci-RNN) [17] and stable RNN (sRNN) [15]. All models have a state dimension of 10 and all models except for the LSTM use a ReLU activation function. The LSTM is described by the following equations:

LSTM{it+1=σ(Wxixt+Wiiut+1+bi),ft+1=σ(Wxfxt+Wifut+1+bf),gt+1=σ(Wxgxt+Wigut+1+bg),ot+1=σ(Wxoxt+Wiout+1+bo),ct+1=ft+1ct+it+1gt+1,xt+1=ot+1tanh(ct+1),\displaystyle\mathrm{LSTM}\begin{cases}i_{t+1}=\sigma(W_{xi}x_{t}+W_{ii}u_{t+1}+b_{i}),\\ f_{t+1}=\sigma(W_{xf}x_{t}+W_{if}u_{t+1}+b_{f}),\\ g_{t+1}=\sigma(W_{xg}x_{t}+W_{ig}u_{t+1}+b_{g}),\\ o_{t+1}=\sigma(W_{xo}x_{t}+W_{io}u_{t+1}+b_{o}),\\ c_{t+1}=f_{t+1}\odot c_{t}+i_{t+1}\odot g_{t+1},\\ x_{t+1}=o_{t+1}\odot\tanh(c_{t+1}),\end{cases} (19)

where ct,xtnc_{t},x_{t}\in\mathbb{R}^{n}, are the cell state and hidden state, utmu_{t}\in\mathbb{R}^{m} is the input and \odot is the Hadamard product and σ\sigma is the sigmoid function. The output is a linear function of the hidden state.

To generate data, we use a simulation of four coupled mass spring dampers. The goal is to identify a mapping from the force on the initial mass to the position of the final mass. Nonlinearity is introduced through the springs’ piecewise linear force profile

Fi(d)=kiΓ(d),Γ(d)={d+0.75,d1,0.25d,1<d<1,d0.75,d1,\displaystyle F_{i}(d)=k_{i}\Gamma(d),\;\;\Gamma(d)=\begin{cases}d+0.75,&-d\leq-1,\\ 0.25d,&-1<d<1,\\ d-0.75,&d\geq 1,\end{cases} (20)

where kik_{i} is the spring constant for the iith spring and dd is the displacement between the carts. A schematic is shown in Fig. 2. The masses are [m1,,m4]=[1/4,1/3,5/12,1/2][m_{1},...,m_{4}]=[1/4,1/3,5/12,1/2], the linear damping coefficients used are [c1,,c4]=[1/4,1/3,5/12,1/2][c_{1},...,c_{4}]=[1/4,1/3,5/12,1/2] and spring constants used in (20) are [k1,,k4]=[1,5/6,2/3,1/2][k_{1},...,k_{4}]=[1,5/6,2/3,1/2].

Refer to caption
Figure 2: Nonlinear mass spring damper schematic.

We excite the system with a piecewise-constant input signal that changes value after an interval distributed uniformly in [0,τ][0,\tau] and takes values that are normally distributed with standard deviation σu\sigma_{u}. The measurements have Gaussian noise of approximately 30dB30\mathrm{dB} added. To generate data we simulate the system for T/5T/5 seconds and sample the system at 5Hz to generate TT data points with an input signal characterized by τ=20s\tau=20s and σu=3N\sigma_{u}=3N. The training data consists of 100 batches of length 10001000. We also generate a validation set with τ=20s\tau=20s, σu=3N\sigma_{u}=3N and length 50005000 that is used for early stopping. To test model performance, we generate test sets of length 10001000 with τ=20s\tau=20s and varying σu\sigma_{u}.

IV-A Training Procedure

We fit Robust RNNs by optimizing simulation error using stochastic gradient descent and logarithmic barrier functions to ensure strict feasibility of the robustness constraints. We use the ADAM optimizer [27] with an initial learning rate of 1×1031\times 10^{-3} to optimize the following objective function:

J=y~kS(u~k)2iαlogdet(Mi)jαlogλj,J=||\tilde{y}^{k}-S(\tilde{u}^{k})||^{2}-\sum_{i}\alpha\log\det(M_{i})-\sum_{j}\alpha\log\lambda_{j},

where MiM_{i} are the LMIs to be satisfied and λj\lambda_{j} are the incremental quadratic constraint multipliers and u~k\tilde{u}^{k} and y~k\tilde{y}^{k} are the input and output for the kkth batch. A backtracking line search ensures strict feasibility throughout optimization. After 10 epochs without an improvement in validation performance, we decrease the learning rate by a factor of 0.250.25 and decrease α\alpha by a factor of 1010. When α\alpha reaches a final value of 1×1071\times 10^{-7}, we finish training. All code is written using Pytorch 1.60 and run on a standard desktop CPU. The code is available at the following link: https://github.com/imanchester/RobustRNN/.

IV-B Model Evaluation

Model quality of fit is measured using normalized simulation error:

NSE=y~yy~\text{NSE}=\frac{||\tilde{y}-y||}{||\tilde{y}||} (21)

where y,y~2py,\tilde{y}\in\ell_{2}^{p} are the simulated and measured system outputs respectively. Model robustness is measured by approximately solving:

γ^=maxu,vS(u)S(v)uv,uv.\hat{\gamma}=\max_{u,v}\frac{||S(u)-S(v)||}{||u-v||},\quad u\neq v. (22)

using gradient ascent. The value of γ^\hat{\gamma} is a lower bound on the true Lipschitz constant of the model.

IV-C Results

The validation performance versus number of epochs is shown in Fig. 3. Note that an epoch occurs after one complete pass through the training data. In this case, this corresponds to 100 batches and gradient descent steps. Each epoch training the Robust RNN takes twice as long as the LSTM due to the evaluation of the logarithmic barrier functions and the backtracking line search, however we will see that the model offers both stability/robustness guarantees and superior generalizability.

Figure 4 presents boxplots and a comparison of the medians for the performance of each model for a number of realizations of the input signal with varying σu\sigma_{u}. In each plot, there is a trough around σu=3\sigma_{u}=3 corresponding to the training data distribution. For the LSTM and RNN, the model performance quickly degrades with varying σu\sigma_{u}. On the other hand, the stable models exhibit a much slower decline in performance. This supports the claim that model stability constraints can improve model generalization. The Robust RNN set Θ\Theta_{*} uniformly outperforms all other models.

We have also plotted the worst case observed sensitivity versus median nominal test performance (σu=3\sigma_{u}=3) in Fig. 5. The Robust RNNs show the best trade-off between nominal performance and robustness signified by the fact that they lie further in the lower left corner. For instance if we compare the LSTM with the Robust RNN (Θ\Theta_{*}), we observe similar nominal performance, however the Robust RNN has a much smaller Lipschitz constant. Varying the incremental 2\ell_{2} gain allows us to trade off between model performance and robustness. We can also observe in the figure that the guaranteed upper bounds are quite tight to the observed lower bounds on the Lipschitz constant, especially for the set Θ3\Theta_{3}.

In Fig. 6, we have the model predictions for the RNN, LSTM and Robust RNN for a typical input with σu=10\sigma_{u}=10. We can see that even with the larger inputs, the Robust RNN continuous to accurately track the measured data. The predictions of the LSTM and RNN however deviate significantly from measured data for significant periods.

Refer to caption
Figure 3: Validation performance versus epochs. The RNN, LSTM and Robust RNN take an average of 10.1, 18.7 and 37.6 seconds per epoch, respectively.
Refer to caption
(a) RNN
Refer to caption
(b) LSTM
Refer to caption
(c) s-RNN
Refer to caption
(d) ci-RNN
Refer to caption
(e) Robust RNN (Θ\Theta_{*})
Refer to caption
(f) Comparison of median model performance.
Figure 4: Boxplots showing model performance for 300 input realizations for varying σu\sigma_{u}.
Refer to caption
Figure 5: Test Performance (σu=3\sigma_{u}=3) on simulated “mass-spring-damper” system versus observed worst case sensitivity. The vertical lines represent the theoretical upper bound on the Lipschitz constant.
Refer to caption
Figure 6: Simulation of models when σu=10\sigma_{u}=10, the robust RNN greatly outperforms the other models.

References

  • [1] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994.
  • [2] M. Cheng, J. Yi, P.-Y. Chen, H. Zhang, and C.-J. Hsieh, “Seq2sick: Evaluating the robustness of sequence-to-sequence models with adversarial examples.” in Association for the Advancement of Artificial Intelligence, 2020, pp. 3601–3608.
  • [3] G. Zames, “On the input-output stability of time-varying nonlinear feedback systems part one: Conditions derived using concepts of loop gain, conicity, and positivity,” IEEE Transactions on Automatic Control, vol. 11, no. 2, pp. 228–238, 1966.
  • [4] C. A. Desoer and M. Vidyasagar, Feedback systems: input-output properties.   SIAM, 1975, vol. 55.
  • [5] W. Lohmiller and J.-J. E. Slotine, “On contraction analysis for non-linear systems,” Automatica, vol. 34, pp. 683–696, 1998.
  • [6] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky, “Spectrally-normalized margin bounds for neural networks,” in Advances in Neural Information Processing Systems, 2017, pp. 6240–6249.
  • [7] S. Zhou and A. P. Schoellig, “An analysis of the expressiveness of deep neural network architectures based on their lipschitz constants,” arXiv preprint arXiv:1912.11511, 2019.
  • [8] T. Huster, C.-Y. J. Chiang, and R. Chadha, “Limitations of the lipschitz constant as a defense against adversarial examples,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases.   Springer, 2018, pp. 16–29.
  • [9] H. Qian and M. N. Wegman, “L2-nonexpansive neural networks,” International Conference on Learning Representations (ICLR), 2019.
  • [10] S. L. Lacy and D. S. Bernstein, “Subspace identification with guaranteed stability using constrained optimization,” IEEE Transactions on automatic control, vol. 48, no. 7, pp. 1259–1263, 2003.
  • [11] D. N. Miller and R. A. De Callafon, “Subspace identification with eigenvalue constraints,” Automatica, vol. 49, no. 8, pp. 2468–2473, 2013.
  • [12] M. M. Tobenkin, I. R. Manchester, and A. Megretski, “Convex parameterizations and fidelity bounds for nonlinear identification and reduced-order modelling,” IEEE Transactions on Automatic Control, vol. 62, no. 7, pp. 3679–3686, 2017.
  • [13] J. Umlauft, A. Lederer, and S. Hirche, “Learning stable gaussian process state space models,” in 2017 American Control Conference (ACC).   IEEE, 2017, pp. 1499–1504.
  • [14] J. Z. Kolter and G. Manek, “Learning stable deep dynamics models,” in Advances in Neural Information Processing Systems 32.   Curran Associates, Inc., 2019, pp. 11 128–11 136.
  • [15] J. Miller and M. Hardt, “Stable recurrent models,” In Proceedings of International Conference on Learning Representations 2019, 2019.
  • [16] J. Umenberger, J. Wågberg, I. R. Manchester, and T. B. Schön, “Maximum likelihood identification of stable linear dynamical systems,” Automatica, vol. 96, pp. 280–292, 2018.
  • [17] M. Revay and I. R. Manchester, “Contracting implicit recurrent neural networks: Stable models with improved trainability,” Learning for Dynamics and Control (L4DC), 2020.
  • [18] J. Umenberger and I. R. Manchester, “Specialized interior-point algorithm for stable nonlinear system identification,” IEEE Transactions on Automatic Control, vol. 64, no. 6, pp. 2442–2456, 2018.
  • [19] E. Kaszkurewicz and A. Bhaya, Matrix Diagonal Stability in Systems and Computation.   Birkhäuser Basel, 2000.
  • [20] N. E. Barabanov and D. V. Prokhorov, “Stability analysis of discrete-time recurrent neural networks,” IEEE Transactions on Neural Networks, vol. 13, no. 2, pp. 292–303, Mar. 2002.
  • [21] Y.-C. Chu and K. Glover, “Bounds of the induced norm and model reduction errors for systems with repeated scalar nonlinearities,” IEEE Transactions on Automatic Control, vol. 44, pp. 471–483, Mar. 1999.
  • [22] M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G. Pappas, “Efficient and accurate estimation of lipschitz constants for deep neural networks,” in Advances in Neural Information Processing Systems, 2019, pp. 11 423–11 434.
  • [23] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning.   MIT press, 2016.
  • [24] J. L. Elman, “Finding structure in time,” Cognitive science, vol. 14, no. 2, pp. 179–211, 1990.
  • [25] A. Pinkus, “Approximation theory of the mlp model in neural networks,” Acta numerica, vol. 8, no. 1, pp. 143–195, 1999.
  • [26] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural computation, vol. 9, pp. 1735–1780, 1997.
  • [27] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” International Conference for Learning Representations (ICLR), Jan. 2017.