This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Linear RNNs Provably Learn Linear Dynamic Systems

Lifu Wang, Tianyu Wang*, Shengwei Yi, Bo Shen, Bo Hu, Xing Cao Corresponding author: Tianyu Wang: [email protected] Wang, Tianyu Wang and Shengwei Yi are with China Information Technology Security Evaluation Center (CNITSEC), Beijing, China. Bo Shen, Bo Hu, Xing Cao are with Beijing Jiaotong University, Beijing, China.
Abstract

We study the learning ability of linear recurrent neural networks with Gradient Descent. We prove the first theoretical guarantee on linear RNNs to learn any stable linear dynamic system using any a large type of loss functions. For an arbitrary stable linear system with a parameter ρC\rho_{C} related to the transition matrix CC, we show that despite the non-convexity of the parameter optimization loss if the width of the RNN is large enough (and the required width in hidden layers does not rely on the length of the input sequence), a linear RNN can provably learn any stable linear dynamic system with the sample and time complexity polynomial in 11ρC\frac{1}{1-\rho_{C}}. Our results provide the first theoretical guarantee to learn a linear RNN and demonstrate how can the recurrent structure help to learn a dynamic system.

I Introduction

Recurrent neural network(RNN) is a very important structure in machine learning to deal with sequence data. It is believed that using the recurrent structure, RNNs can lean complicated transformations of data over extended periods. Non-linear RNN has been proved to be Turing-Complete[1], thus can simulate arbitrary procedures. However, training RNN requires optimizing highly non-convex functions which are very hard to solve. On the other hand, it is widely believed[2] that deep linear networks can capture some important aspects of optimization in deep learning and there are a series of recent papers trying to study the properties of deep linear networks [3, 4, 5]. Meanwhile, learning linear RNN itself is not only an important problem in System Identification but also useful for the language modeling in natural language processing [6]. In this paper, we study the non-convex optimization problem for learning linear RNNs.

Suppose there is a dpd_{p}-order and dd-dimension linear system with the following form:

ht=𝑪ht1+𝑫xt,\displaystyle h_{t}=\bm{C}h_{t-1}+\bm{D}x_{t}, (1)
y~t=𝑮ht(x),\displaystyle\widetilde{y}_{t}=\bm{G}h_{t}(x),

where 𝑪dp×dp,𝑫dp×d\bm{C}\in\mathbb{R}^{d_{p}\times d_{p}},\bm{D}\in\mathbb{R}^{d_{p}\times d} and 𝑮dp×dy\bm{G}\in\mathbb{R}^{d_{p}\times d_{y}} are unknown system parameters.

At time tt, this system output y~t\widetilde{y}_{t}. It is nature to consider the system identification problem to learn the unkonw system parameters from its outputs. We consider a new linear (student) RNN with the form:

ht=𝑾ht1+𝑨xt,\displaystyle h^{\prime}_{t}=\bm{W}h^{\prime}_{t-1}+\bm{A}x_{t}, (2)
yt=𝑩ht(x),\displaystyle y^{\prime}_{t}=\bm{B}h^{\prime}_{t}(x),

with 𝑾m×m,𝑨m×d,𝑩m×dy\bm{W}\in\mathbb{R}^{m\times m},\bm{A}\in\mathbb{R}^{m\times d},\bm{B}\in\mathbb{R}^{m\times d_{y}} and train the parameters 𝑾,𝑨,𝑩\bm{W},\bm{A},\bm{B} to fit the output y~t\widetilde{y}_{t} of (1).

Just like what is commonly done in machine learning, one may consider to collect data {xti,y~ti}\{x^{i}_{t},\widetilde{y}^{i}_{t}\} then optimize the empirical loss:

L(𝑾,𝑨,𝑩)\displaystyle L(\bm{W},\bm{A},\bm{B}) =1nTint=1TL(y~ti,yti),\displaystyle=\frac{1}{n\cdot T}\sum_{i}^{n}\sum_{t=1}^{T}L(\widetilde{y}^{i}_{t},{y^{\prime}_{t}}^{i}), (3)
=1nTint=1TL(𝑮ht(xi),𝑩ht(xi)),\displaystyle=\frac{1}{n\cdot T}\sum_{i}^{n}\sum_{t=1}^{T}L(\bm{G}h_{t}(x^{i}),\bm{B}h^{\prime}_{t}(x^{i})),

with Gradient Descent Algorithm, where LL is a convex loss function.

Therefore the following questions arise naturally:

  • Can gradient descent learn the target RNN in polynomial time and samples?

  • What kinds of random initializations(for example, how large widths will 𝐖\bm{W} be?) do we need to learn the target RNN?

These problems look easy since it is very basic and important for the system identification problem. However, the loss 1nin1Tt=1TL(y~ti,yti)\frac{1}{n}\sum_{i}^{n}\frac{1}{T}\sum_{t=1}^{T}L(\widetilde{y}^{i}_{t},{y^{\prime}_{t}}^{i}) is non-convex and in fact even for SISO(single-input single-output, which means xt,yt1x_{t},y_{t}\in\mathbb{R}^{1}) systems, this question is far from being trivial. Only after the work in [7], the SISO case is solved. In fact, as pointed out in [8], although the widely used method in system identification is the EM algorithm [9], yet it is inherently non-convex and EM method will get stuck in bad local minima easily.

One naive method is to only optimize the loss L(y~1,y1)L(\widetilde{y}_{1},y^{\prime}_{1}), which is a convex loss function. However, when y~t\widetilde{y}_{t} is not accurately observed, for example we can only observe yt=y~t+nty_{t}=\widetilde{y}_{t}+n_{t} and L(x,y)=xy2L(x,y)=||x-y||^{2} where ntn_{t} is white noise with σ\sigma variance at time tt, the results by optimizing L(y1,y1)L(y_{1},y^{\prime}_{1}) may not be optimal for the entire sequence loss 1Tt=1TL(yt,yt)\frac{1}{T}\sum_{t=1}^{T}L(y_{t},y^{\prime}_{t}). In fact a naive estimate from L(y1,y1)L(y_{1},y^{\prime}_{1}) will output an estimation with error σ2\sigma^{2} but 1Tt=1TL(yt,yt)\frac{1}{T}\sum_{t=1}^{T}L(y_{t},y^{\prime}_{t}) may output an estimation with error 𝒪(σ2/T)\mathcal{O}(\sigma^{2}/\sqrt{T}). It is shown in [7] that under some independent conditions of the inputs, SGD (stochastic gradient descent) converges to the global minimum of the maximum likelihood objective of an unknown linear time-invariant dynamical system from a sequence of noisy observations generated by the system and over-parameterization is helpful. However, their methods heavily rely on the SISO property (x,y1x,y\in\mathbb{R}^{1}) of the system, and the condition xtx_{t} for different tt are i.i.d. from Gaussian distribution. Their method can not be generalized to the systems with xdx\in\mathbb{R}^{d} and d>1d>1. It is still open under which conditions can SGD be guaranteed to find the global minimum of the linear RNN loss.

In this paper, we propose a new NTK method inspired by the work [10] and the authors’ previous work[11] on non-linear RNN. And this is completely different method from that in [7] so we avoid the defect that the method in [7] can only be used in the SISO case.

We show that if the width mm of the linear RNN (2) is large enough (polynomial large), SGD can provably learn any stable linear system with the sample and time complexity only polynomial in 11ρC\frac{1}{1-\rho_{C}} and independent of the input length TT, where ρC\rho_{C} is roughly the spectral radius (see Section III-A) of the transition matrix C\bm{C}. Learning linear RNN is a very important problem in System Identification. And since Gradient Descent with random initialization is the most commonly used method in machine learning, we are trying to understand this problem in a “machine-learning style”. We believe this can provide some insights for the recurrent structure in deep learning.

II Problem Formulation

We consider the target linear system with the form:

pti=𝑪pt1i+𝑫xti,\displaystyle p^{i}_{t}=\bm{C}p^{i}_{t-1}+\bm{D}x^{i}_{t}, (4)
yti~=𝑮pti,\displaystyle\widetilde{y^{i}_{t}}=\bm{G}p^{i}_{t},

which is a stable linear system with 𝑪k𝑫cρρCk||\bm{C}^{k}\bm{D}||\leq c_{\rho}\cdot\rho_{C}^{k} for all kk\in\mathbb{N} and ρC<1\rho_{C}<1, where cρ>0c_{\rho}>0, 𝑮,𝑫=Θ(1)||\bm{G}||,||\bm{D}||=\Theta(1). For a given convex loss function LL, we set the loss function

L(𝑪,𝑫,𝑮)=𝔼x,y𝒟1Tt=1TL(yt,y~t).L(\bm{C},\bm{D},\bm{G})=\mathbb{E}_{x,y\sim{\cal D}}\frac{1}{T}\sum_{t=1}^{T}L(y_{t},\widetilde{y}_{t}). (5)

We define the global minimum OPTρCOPT_{\rho_{C}} as

OPTρC=inf𝑪,𝑫,𝑮𝔼x,y𝒟1Tt=1TL(yt,y~t).OPT_{\rho_{C}}=\inf_{\bm{C},\bm{D},\bm{G}}\mathbb{E}_{x,y\sim{\cal D}}\frac{1}{T}\sum_{t=1}^{T}L(y_{t},\widetilde{y}_{t}).

with 𝑪k𝑫cρρCk,k||\bm{C}^{k}\bm{D}||\leq c_{\rho}\rho_{C}^{k},k\in\mathbb{N}, c0>0c_{0}>0 is an absolute constant.

Let the sequences {(xi=(x1i,x2i,,xTi))i=1K,(yi=(y1i,y2i,,yTi))i=1K}\{(x^{i}=(x^{i}_{1},x^{i}_{2},...,x^{i}_{T}))_{i=1}^{K},\ (y^{i}=(y^{i}_{1},y^{i}_{2},...,y^{i}_{T}))_{i=1}^{K}\} be KK samples i.i.d. drawn from

𝒟={(x1,x2,,xT)×(y1,y2,,yT)d×T×dy×T}.{\cal D}=\{(x_{1},x_{2},...,x_{T})\times(y_{1},y_{2},...,y_{T})\in\mathbb{R}^{d\times T}\times\mathbb{R}^{d_{y}\times T}\}.

We consider a “student” linear system (RNN) to learn the target one. Let ft(𝑾~,𝑨,xi)f_{t}(\widetilde{\bm{W}},\bm{A},x^{i}) be the tt-time output of a linear RNN with input xix^{i} and parameters 𝑾~m×m,𝑨m×d\widetilde{\bm{W}}\in\mathbb{R}^{m\times m},\bm{A}\in\mathbb{R}^{m\times d}:

ht(x)=𝑾~ht1+𝑨xt,\displaystyle h_{t}(x)=\widetilde{\bm{W}}h_{t-1}+\bm{A}x_{t}, (6)
f~t(𝑾~,𝑨,x)=𝑩ht(x).\displaystyle\widetilde{f}_{t}(\widetilde{\bm{W}},\bm{A},x)=\bm{B}h_{t}(x).

Our goal is to use ft(𝑾~,𝑨,xi)f_{t}(\widetilde{\bm{W}},\bm{A},x^{i}) and the KK samples to fit the empirical loss function and keeping the generalization error bound small.

III Our Result

The main result is formulated in the PAC-Learning setting as follow:

Theorem 1

(Informal) Under the conditions in the last section, suppose the entries of 𝐖\bm{W} in the student RNN are randomly initialized by i.i.d. generated from N(0,ρCm)N(0,\frac{\rho_{C}}{m}). We use SGD algorithm to optimize. 𝐖~k,𝐀k\widetilde{\bm{W}}_{k},\bm{A}_{k} are the kk-th step outputs of SGD algorithm.

For any ϵ,δ>0\epsilon,\ \delta>0, and 0<ρC<10<\rho_{C}<1, there exist parameters m=poly(11ρC,ϵ1,δ1,cρ)m^{*}=poly(\frac{1}{1-\rho_{C}},\epsilon^{-1},\delta^{-1},c_{\rho}) and K=poly(11ρC,ϵ1,δ1,cρ)K=poly(\frac{1}{1-\rho_{C}},\epsilon^{-1},\delta^{-1},c_{\rho}) such that if m>mm>m^{*}, with probability at least 1δ1-\delta, SGD can reach

𝔼x,y𝒟1Kk=1K1Tt=1TL(yt,f~t(𝑾~k,𝑨k,x))\displaystyle\mathbb{E}_{x,y\sim{\cal D}}\frac{1}{K}\sum_{k=1}^{K}\frac{1}{T}\sum_{t=1}^{T}L(y_{t},\widetilde{f}_{t}(\widetilde{\bm{W}}_{k},\bm{A}_{k},x)) (7)
OPTρC+ϵ,\displaystyle\leq OPT_{\rho_{C}}+\epsilon,

in KK steps.

Remark III.1

Theorem 1 induces gradient descent with linear RNNs can provably learn any stable linear system with the iteration and sample complexity polynomial in 11ρC\frac{1}{1-\rho_{C}}. This result is consistent with the previous Gradient Descent based method [7] to learn SISO linear systems.

In our theorem, all the parameters do not rely on the length TT. Note that suppose at different time, xtix^{i}_{t} are i.i.d. drawn from a distribution 𝒟{\cal D}^{\prime}, ntn_{t} is the white noise and TT is large enough,

𝔼nt,xt𝒟1Tt=1T||y~t+ntf~t(𝑾~,𝑨,x))||2\displaystyle\mathbb{E}_{n_{t},x_{t}\sim{\cal D}^{\prime}}\frac{1}{T}\sum_{t=1}^{T}||\widetilde{y}_{t}+n_{t}-\widetilde{f}_{t}(\widetilde{\bm{W}},\bm{A},x))||^{2}
limT𝔼nt,xt𝒟1Tt=1T||y~t+ntf~t(𝑾~,𝑨,x))||2+ϵ.\displaystyle\leq\lim_{T^{\prime}\to\infty}\mathbb{E}_{n_{t},x_{t}\sim{\cal D}^{\prime}}\frac{1}{T^{\prime}}\sum_{t=1}^{T^{\prime}}||\widetilde{y}_{t}+n_{t}-\widetilde{f}_{t}(\widetilde{\bm{W}},\bm{A},x))||^{2}+\epsilon.

Thus optimizing the large TT loss is enough to predict the complete dynamic behaviors of the target system.

III-A Scale of cρc_{\rho} and the Comparison with Previous Results

In our main theorem 1, the parameters are polynomial in cρc_{\rho}. We should note that although we assume the target system is stable, this only means

ρ(𝑪)=limk𝑪k1/k<1.\rho(\bm{C})=\lim_{k\to\infty}||\bm{C}^{k}||^{1/k}<1.

In fact, suppose 𝑪N×N\bm{C}\in\mathbb{R}^{N\times N}, generally we have (see e.g. corollary 3.15 in [12]):

𝑪kNj=0N1(N1j)(kj)𝑪jρ(𝑪)kj.||\bm{C}^{k}||\leq\sqrt{N}\sum_{j=0}^{N-1}\binom{N-1}{j}\binom{k}{j}\cdot||\bm{C}||_{\infty}^{j}\cdot\rho(\bm{C})^{k-j}.

When we set ρC=ρ(𝑪)\rho_{C}=\rho(\bm{C}), cρc_{\rho} can be very large and generally we should set ρC>ρ(𝑪)+ϵ\rho_{C}>\rho(\bm{C})+\epsilon to make cρc_{\rho} be polynomial in NN.

On the other hand, the scale of cρc_{\rho} is closely related to the so-called “acquiescent” systems.

Definition 1

Let zz\in\mathbb{C}. A SISO NN-order linear system with the transfer function s(z)p(z)\frac{s(z)}{p(z)} is called α\alpha-acquiescent if

{p(z)/zN:|z|=α}S,\{p(z)/z^{N}:|z|=\alpha\}\subseteq S,

where

S={z:Re(z)(1+τ0)|Im(z)|}{z:τ1Re(z)τ2}S=\{z:Re(z)\geq(1+\tau_{0})|\ Im(z)|\}\cap\{z:\tau_{1}\leq Re(z)\leq\tau_{2}\}

And we have

Lemma 2

(Lemma 4.4 in [7]) Suppose the target linear system is SISO and α\alpha-acquiescent. For any kk\in\mathbb{N},

𝑪k𝑫2πnα2n/τ1αk.||\bm{C}^{k}\bm{D}||\leq 2\pi n\alpha^{-2n}/\tau_{1}\cdot\alpha^{k}. (8)

Thus in the SISO case, under the “acquiescent” conditions, our Theorem 1 reduces to the main result in [7].

Corollary 3

(Corresponding to Theorem 5.1 in [7]) Supposing a SISO linear system is ρC\rho_{C}-acquiescent, it is learnable in polynomial time and polynomial samples.

And for MIMO systems, our condition 𝑪k𝑫cρρCk,k||\bm{C}^{k}\bm{D}||\leq c_{\rho}\rho_{C}^{k},k\in\mathbb{N} is a good generalization for MIMO systems.

III-B Our Techniques

Our proof technique is closely related to the recent works on deep linear network [3], non-linear network with neural tangent kernel [13][14, 15], and non-linear RNN[10], [16]. Similar to [13], we carefully upper and lower bound eigenvalues of this Gram matrix throughout the optimization process, using some perturbation analysis. At the initialization point, we consider the spectral properties of Gaussian random matrices. Using the linearization method, we can show these properties hold throughout the trajectory of gradient descent. Then we only need to construct a solution near the random initialization. And the distance from the solution to the initialization can be bounded by the stability of the system.

Notions. For two matrices 𝑨,𝑩m×n\bm{A},\bm{B}\in\mathbb{R}^{m\times n}, we define 𝑨,𝑩=Tr(ATB)\langle\bm{A},\bm{B}\rangle=\text{Tr}(A^{T}B). We define the asymptotic notations 𝒪(),Ω(),Θ(),poly()\mathcal{O}(\cdot),\Omega(\cdot),\Theta(\cdot),poly(\cdot) as follows. an,bna_{n},b_{n} are two sequences. an=𝒪(bn)a_{n}=\mathcal{O}(b_{n}) if limsupn|an/bn|<\lim\sup_{n\to\infty}|a_{n}/b_{n}|<\infty, an=Ω(bn)a_{n}=\Omega(b_{n}) if liminfn|an/bn|>0\lim\inf_{n\to\infty}|a_{n}/b_{n}|>0, an=Θ(bn)a_{n}=\Theta(b_{n}) if an=Ω(bn)a_{n}=\Omega(b_{n}) and an=𝒪(bn)a_{n}=\mathcal{O}(b_{n}), an=poly(bn)a_{n}=poly(b_{n}) if there is kk\in\mathbb{N} that an=O((bn)k)a_{n}=O((b_{n})^{k}). 𝒪~(),Ω~(),Θ()~,poly~()\widetilde{\mathcal{O}}(\cdot),\widetilde{\Omega}(\cdot),\widetilde{\Theta(\cdot)},\widetilde{poly}(\cdot) are notions which hide the logarithmic factors in 𝒪(),Ω(),Θ(),poly()\mathcal{O}(\cdot),\Omega(\cdot),\Theta(\cdot),poly(\cdot). ||||||\cdot|| and ||||2||\cdot||_{2} denote the 2-norm of matrices. ||||1||\cdot||_{1} denotes the 1-norm. ||||F||\cdot||_{F} is the Frobenius-norm.

IV Related Works

Deep Linear Network. The provable properties of the loss surface for deep linear networks were firstly shown in [5]. In [3], it is shown that if the width of the LL-layer deep linear network is large enough (only depends on the output dimension, the rank rr and the condition number κ\kappa of the input data), randomly initialized gradient descent will optimize deep linear networks in polynomial time in LL, rr and κ\kappa. Moreover. in [4], the linear ResNet is studied and it is shown that Gradient Descent provably optimizes wide enough deep linear ResNets and the width does not rely on the number of layers.

Over-Parameterization. Non-linear networks with one hidden node are studied in [17] and [18]. In these works, it is shown that, for a single-hidden-node ReLU network, under a very mild assumption on the input distribution, the loss is one point convex in a very large area. However, for the networks with multi-hidden nodes, the authors in [19] pointed out that spurious local minima are common and indicated that an over-parameterization (the number of hidden nodes should be large) assumption is necessary. Similarly, [7] showed that over-parameterization can help in the training process of a linear dynamic system. Another import progress is the theory about neural tangent kernel(NTK). The techniques of NTK for finite width network are studied in [15], [13], [10], [14] and [16].

Learning Linear System. Prediction problems of time series for linear dynamical systems can be traced back to Kalman [20]. In the case that the system is unknown, the first polynomial guarantees of running time and sample complexity bounds for learning single-input single-output (SISO) systems with gradient descent are provided in [7]. For MIMO systems. It was shown in [21][8] that the spectral filtering method can be provably learned with polynomial guarantees of running time and sample complexity.

V Problem Setup and Main Results

In this section, we introduce the basic problem setup and our main results.

Consider sequences {xi=(x1i,x2i,,xTi)}\{x^{i}=(x^{i}_{1},x^{i}_{2},...,x^{i}_{T})\} and the label {yi=(y1i,y2i,,yTi)}\{y^{i}=(y^{i}_{1},y^{i}_{2},...,y^{i}_{T})\} in the data set. xtid,ytidyx^{i}_{t}\in\mathbb{R}^{d},y^{i}_{t}\in\mathbb{R}^{d_{y}} and xti,yti𝒪(1)||x^{i}_{t}||,||y^{i}_{t}||\leq\mathcal{O}(1). We assume dyd=𝒪(1)d_{y}\leq d=\mathcal{O}(1) and omit them in the asymptotic symbols. We study the linear RNN as: 𝑾~m×m,𝑨m×d,𝑩dy×m\widetilde{\bm{W}}\in\mathbb{R}^{m\times m},\bm{A}\in\mathbb{R}^{m\times d},\bm{B}\in\mathbb{R}^{d_{y}\times m}

h0(x)=𝟎,ht(x)=𝑾~ht1+𝑨xt,\displaystyle h_{0}(x)=\bm{0},h_{t}(x)=\widetilde{\bm{W}}h_{t-1}+\bm{A}x_{t}, (9)
f~t(𝑾~,𝑨,x)=𝑩ht(x)dy.\displaystyle\widetilde{f}_{t}(\widetilde{\bm{W}},\bm{A},x)=\bm{B}h_{t}(x)\in\mathbb{R}^{d_{y}}.

Assume L(x)=L(yt,x)L^{*}(x)=L(y_{t},x) is convex and locally Lipschitz convex function: for any x,yx,y, when xC,ytC||x||\leq C,||y_{t}||\leq C,

xL(x)l0(1+C).||\nabla_{x}L^{*}(x)||\leq l_{0}(1+C). (10)

Then we perform algorithm 1.

Algorithm 1 Learning Stable Linear System with SGD
  Input:Sequences of data {x,y}\{x,y\}, learning rate η\eta, initialization parameter 0<ρ<10<\rho<1.
  Initialization: The entries of 𝑾~0\widetilde{\bm{W}}_{0} and 𝑨0\bm{A}_{0} are i.i.d. generated from N(0,ρm)N(0,\frac{\rho}{m}) and N(0,1m)N(0,\frac{1}{m}). The entries of 𝑩\bm{B} are i.i.d. generated from N(0,1dy)N(0,\frac{1}{d_{y}}).
  for k=0,1,2,3K1k=0,1,2,3...K-1 do
     𝑾~k+1=𝑾~kηTt𝑾k~L(yti,f~t(𝑾k~,𝑨k,xi))\widetilde{\bm{W}}_{k+1}=\widetilde{\bm{W}}_{k}-\frac{\eta}{T}\sum_{t}\nabla_{\widetilde{\bm{W}_{k}}}L(y^{i}_{t},\widetilde{f}_{t}(\widetilde{\bm{W}_{k}},\bm{A}_{k},x^{i})),
     Randomly sample a sequence xix^{i} and the label yiy^{i}.
     𝑨k+1=𝑨kηTt𝑨kL(yti,f~t(𝑾~k,𝑨k,xi))\bm{A}_{k+1}=\bm{A}_{k}-\frac{\eta}{T}\sum_{t}\nabla_{\bm{A}_{k}}L(y^{i}_{t},\widetilde{f}_{t}(\widetilde{\bm{W}}_{k},\bm{A}_{k},x^{i})).
  end for

In fact, we have

Theorem 4

Assume there is δ[0,e1]\delta\in[0,e^{-1}]. Let ρ1=11+10log2mm\rho_{1}=\frac{1}{1+10\cdot\frac{\log^{2}m}{\sqrt{m}}}, 0<ρ0<10<\rho_{0}<1. Set the initialization parameter ρ=ρ1ρ02\rho=\rho_{1}\cdot\rho^{2}_{0}. Given an unknown distribution 𝒟{\cal D} of sequences of {x,y}\{x,y\}, let 𝐖~k\widetilde{\bm{W}}_{k}, 𝐀k\bm{A}_{k} be the output of Algorithm 1.

For any small ϵ>0\epsilon>0, there are parameters 111In order to simplify symbols, we omit dd and dyd_{y} in these asymptotic bounds of parameters. One can easily show finally the parameters are polynomial in dd and dyd_{y}.

Tmax=Θ(1log(1ρ0)){2log(cρ1ρ0)+log(1ϵ)\displaystyle T_{max}=\Theta(\frac{1}{\log(\frac{1}{\rho_{0}})})\cdot\{2\log(\frac{c_{\rho}}{1-\rho_{0}})+\log(\frac{1}{\epsilon}) (11)
+loglog(Tmax/δ)+12log(m)})\displaystyle+\log\sqrt{\log(T_{max}/\delta)}+\frac{1}{2}\log(m)\})
b=log(Tmax/δ),η=Θ(νϵmb2),\displaystyle b=\sqrt{\log(T_{max}/\delta)},\eta=\Theta(\frac{\nu\epsilon}{mb^{2}}),
K=Θ(Tmax4b4νϵ2),\displaystyle K=\Theta(\frac{T^{4}_{max}b^{4}}{\nu\epsilon^{2}}),
m=Θ(cρ2K4(1ρ0)8ϵ2b6)+Θ(1δ),\displaystyle m^{*}=\Theta(\frac{c_{\rho}^{2}K^{4}(1-\rho_{0})^{8}\epsilon^{2}}{b^{6}})+\Theta(\frac{1}{\delta}),
ν=Θ(ϵ2(1ρ0)12Tmax4l06(1+2b)6),\displaystyle\nu=\Theta(\frac{\epsilon^{2}(1-\rho_{0})^{12}}{T_{max}^{4}\cdot l_{0}^{6}(1+2b)^{6}}),

such that with probability at least 1δ1-\delta, if m>mm>m^{*}, for any 𝐂,𝐃,𝐆\bm{C},\bm{D},\bm{G}, the algorithm outputs satisfy:

1Kk=0K1\displaystyle\frac{1}{K}\sum_{k=0}^{K-1} 1Tt=1T{L(ytk,f~t(𝑾~k,𝑨k),xk)\displaystyle\frac{1}{T}\sum_{t=1}^{T}\{L(y^{k}_{t},\widetilde{f}_{t}(\widetilde{\bm{W}}_{k},\bm{A}_{k}),x^{k}) (12)
L(ytk,y~tk(𝑪,𝑫,𝑮))}𝒪(ϵ),\displaystyle-L(y^{k}_{t},\widetilde{y}^{k}_{t}(\bm{C},\bm{D},\bm{G}))\}\leq\mathcal{O}(\epsilon),

where y~tk(𝐂,𝐃,𝐆)\widetilde{y}_{t}^{k}(\bm{C},\bm{D},\bm{G}) is the output from the linear system:

pti=𝑪pt1i+𝑫xti,\displaystyle p^{i}_{t}=\bm{C}p^{i}_{t-1}+\bm{D}x^{i}_{t}, (13)
yti~=𝑮pti.\displaystyle\widetilde{y^{i}_{t}}=\bm{G}p^{i}_{t}.

with 𝐂k𝐃cρρCk||\bm{C}^{k}\bm{D}||\leq c_{\rho}\rho_{C}^{k} for all kk\in\mathbb{N}, 𝐃dp×d\bm{D}\in\mathbb{R}^{d_{p}\times d}, ptdpp_{t}\in\mathbb{R}^{d_{p}}, 𝐂dp×dp\bm{C}\in\mathbb{R}^{d_{p}\times d_{p}} 𝐆dy×dp\bm{G}\in\mathbb{R}^{d_{y}\times d_{p}},

Lemma 5

For the parameters of the last theorem, when m>mm>m^{*} and ϵ\epsilon is small enough, we have the following results:

  1. 1.

    Tmax4b2Kmη=Θ(ϵ)\frac{T_{max}^{4}b^{2}}{Km\eta}=\Theta(\epsilon)

  2. 2.

    l0(1+2b)l02(1+2b)2ηmK(1ρ0)6𝒪(ϵ)l_{0}(1+2b)\cdot l_{0}^{2}(1+2b)^{2}\cdot\frac{\eta m\sqrt{K}}{(1-\rho_{0})^{6}}\leq\mathcal{O}(\epsilon)

  3. 3.

    K2η2mm(1ρ0)11)l02(1+2b)2l0(1+2b)𝒪(ϵ)K^{2}\eta^{2}\cdot\frac{m\sqrt{m}}{(1-\rho_{0})^{11}})\cdot l_{0}^{2}(1+2b)^{2}\cdot l_{0}(1+2b)\leq\mathcal{O}(\epsilon)

  4. 4.

    ηml02(1+2b)2(1ρ0)6𝒪(ϵ)\frac{\eta ml^{2}_{0}(1+2b)^{2}}{(1-\rho_{0})^{6}}\leq\mathcal{O}(\epsilon)

  5. 5.

    bd2cρTmax3logmm1/2𝒪(ϵl0(1+2b))\frac{b\cdot d^{2}c_{\rho}T^{3}_{max}\log m}{m^{1/2}}\leq\mathcal{O}(\frac{\epsilon}{l_{0}(1+2b)})

As for the population loss, we have:

Theorem 6

(Rademacher complexity for RNN) Under the condition in Theorem 4, with probability at least 1δ1-\delta,

|𝔼x,y𝒟1Tt=1T{L(yt,f~t(𝑾~k,𝑨k,x))L(yt,y~t)}\displaystyle|\mathbb{E}_{x,y\sim{\cal D}}\frac{1}{T}\sum_{t=1}^{T}\{L(y_{t},\widetilde{f}_{t}(\widetilde{\bm{W}}_{k},\bm{A}_{k},x))-L(y_{t},\widetilde{y}_{t})\}
1Kk=0K11Tt=1T{L(ytk,f~t(𝑾~k,𝑨k),xk)L(ytk,y~tk)}|\displaystyle-\frac{1}{K}\sum_{k=0}^{K-1}\frac{1}{T}\sum_{t=1}^{T}\{L(y^{k}_{t},\widetilde{f}_{t}(\widetilde{\bm{W}}_{k},\bm{A}_{k}),x^{k})-L(y^{k}_{t},\widetilde{y}^{k}_{t})\}|
𝒪(ϵ).\displaystyle\leq\mathcal{O}(\epsilon).

Since ρ11\rho_{1}\to 1 as m>poly(11ρ0)m^{*}>poly(\frac{1}{1-\rho_{0}}) large enough, we have ρ1>ρ0\rho_{1}>\rho_{0} and ρ>ρ03\rho>\rho_{0}^{3}. Therefore from the above two theorems we have the following corollary:

Corollary 7

Let the initialization parameter be ρC<1\rho_{C}<1. For any small ϵ>0,δ>0\epsilon>0,\delta>0, there is a parameter m=poly(1δ,1ϵ,11ρC)m^{*}=poly(\frac{1}{\delta},\frac{1}{\epsilon},\frac{1}{1-\rho_{C}}) such that with probability at least 1δ1-\delta, if m>m,K=poly(11ρC,1ϵ,1δ)m>m^{*},K=poly(\frac{1}{1-\rho_{C}},\frac{1}{\epsilon},\frac{1}{\delta}), the algorithm outputs satisfy:

𝔼x,y𝒟\displaystyle\mathbb{E}_{x,y\sim{\cal D}} 1Tt=1TL(yt,f~t(𝑾~k,𝑨k,x))OPTρC+𝒪(ϵ).\displaystyle\frac{1}{T}\sum_{t=1}^{T}L(y_{t},\widetilde{f}_{t}(\widetilde{\bm{W}}_{k},\bm{A}_{k},x))\leq OPT_{\rho_{C}}+\mathcal{O}(\epsilon).
Remark V.1

In this paper, our results only assume for some constant CC:

xL(x)𝒪(1+C).||\nabla_{x}L^{*}(x)||\leq\mathcal{O}(1+C). (14)

This assumption is a very mild condition for the loss function L(x)L(x), thus we, in fact, do not previously assume the form of the noise (for example, optimizing the square loss is to optimize the maximum likelihood objective of the Gaussian noise, and l1l^{1} loss is to optimize that for Laplace noise) and our result can even apply to not only the the regression problems but also the classification problems. In this aspect, this result improves upon previous methods to learn linear dynamical systems in [7] and [8] which highly rely on the form of the square loss.

VI Preliminary Properties

Before proving Theorem 4 and 6, we need some properties of Gaussian random matrices and linear RNNs. The proof is in the Supplementary Materials.

To simplify symbols, in the latter part of the paper, we set 𝑾k=𝑾k~/ρ\bm{W}_{k}=\widetilde{\bm{W}_{k}}/\rho and 222We omit xx in ft(𝑾,𝑨,x)f_{t}(\bm{W},\bm{A},x) when it doesn’t lead to misunderstanding.

ft(𝑾,𝑨)=f~t(𝑾~,𝑨,x)=t0=0t1ρt0𝑩(τ=1t0𝑾)𝑨Xtt0.\displaystyle f_{t}(\bm{W},\bm{A})=\widetilde{f}_{t}(\widetilde{\bm{W}},\bm{A},x)=\sum_{t_{0}=0}^{t-1}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{0}}\bm{W})\bm{A}X_{t-t_{0}}.

Then

𝑾k+1=\displaystyle\bm{W}_{k+1}= 𝑾kηTρt=1T𝑾k~L(yti,f~t(𝑾~k,𝑨k,xi)),\displaystyle\bm{W}_{k}-\frac{\eta}{T\rho}\sum_{t=1}^{T}\nabla_{\widetilde{\bm{W}_{k}}}L(y^{i}_{t},\widetilde{f}_{t}(\widetilde{\bm{W}}_{k},\bm{A}_{k},x^{i})), (15)
=\displaystyle= 𝑾kηTt=1T𝑾kL(yti,ft(𝑾k,𝑨k,xi)).\displaystyle\bm{W}_{k}-\frac{\eta}{T}\sum_{t=1}^{T}\nabla_{\bm{W}_{k}}L(y^{i}_{t},f_{t}(\bm{W}_{k},\bm{A}_{k},x^{i})).

Then we only need to consider 𝑾k\bm{W}_{k}. The entries of 𝑾0\bm{W}_{0} are i.i.d. generated from N(0,1m)N(0,\frac{1}{m}). Let B(𝑾0,ω)={𝑾|𝑾𝑾0Fω}B(\bm{W}_{0},\omega)=\{\bm{W}|\ ||\bm{W}-\bm{W}_{0}||_{F}\leq\omega\} and B(𝑨0,ω)={𝑨|𝑨𝑨0Fω}B(\bm{A}_{0},\omega)=\{\bm{A}|\ ||\bm{A}-\bm{A}_{0}||_{F}\leq\omega\}.

VI-A Properties of Random Matrix

One of the key points in this paper is that with high probability, the spectral radius of matrix 𝑾0\bm{W}_{0} will be less than ρ11\rho_{1}^{-1} and when mm\to\infty, ρ11\rho_{1}\to 1. In fact, we have:

Lemma 8

With probability at least 1exp(Ω(log2m)))1-exp(-\Omega(\log^{2}m))) (it is larger than 1δ1-\delta when m>mm>m^{*}), there exists L=c0mlogmL=c_{0}\cdot\frac{\sqrt{m}}{\log m}\in\mathbb{N} such that for all kLk\geq L,

𝑾0kρ1k,||\bm{W}_{0}^{k}||\leq\rho_{1}^{-k},

and for all k<Lk<L,

𝑾0kρ1L,||\bm{W}_{0}^{k}||\leq\rho_{1}^{-L},

where c0>1c_{0}>1 is an absolutely constant. Meanwhile, for all k2Lk\leq 2L, with probability at least 1exp(Ω(log2m)))1-exp(-\Omega(\log^{2}m))),

W0k2k.||W_{0}^{k}||\leq 2\sqrt{k}.

In the rest of this paper, all the probabilities are considered under the condition in Lemma 8.

As a corollary, let k2Lk\geq 2L, ω0=1ρ01\omega_{0}=\frac{1}{\rho_{0}}-1 and when 𝑾𝑾0Fωω0||\bm{W}-\bm{W}_{0}||_{F}\leq\omega\leq\omega_{0}, we have

(ρ1ρ02𝑾)k=(ρ1ρ02)ki=0kCki𝑾0i𝑾𝑾0ki\displaystyle||(\rho_{1}\cdot\rho_{0}^{2}\cdot\bm{W})^{k}||=(\rho_{1}\cdot\rho_{0}^{2})^{k}\sum_{i=0}^{k}C_{k}^{i}\cdot||\bm{W}^{i}_{0}||\cdot||\bm{W}-\bm{W}_{0}||^{k-i} (16)
=i=0kCki𝑾0i(ρ1ρ02)k(1ρ01)ki\displaystyle=\sum_{i=0}^{k}C_{k}^{i}\cdot||\bm{W}^{i}_{0}||\cdot(\rho_{1}\cdot\rho_{0}^{2})^{k}\cdot(\frac{1}{\rho_{0}}-1)^{k-i}
i=LkCki(ρ1ρ02)i[ρ1(ρ0ρ02)]kiρ1k\displaystyle\leq\sum_{i=L}^{k}C_{k}^{i}(\rho_{1}\rho_{0}^{2})^{i}[\rho_{1}\cdot(\rho_{0}-\rho_{0}^{2})]^{k-i}\cdot\rho_{1}^{-k}
+i=0L1Cki(ρ1ρ02)iW0iρ1ki(ρ0ρ02)kiρ0k.\displaystyle+\sum_{i=0}^{L-1}C_{k}^{i}(\rho_{1}\cdot\rho_{0}^{2})^{i}\cdot||W_{0}^{i}||\cdot\rho_{1}^{k-i}\cdot(\rho_{0}-\rho_{0}^{2})^{k-i}\leq\rho_{0}^{k}.

For k2Lk\leq 2L,

(ρ1ρ02𝑾)k=ρ1ki=0kCkiρ02𝑾0iρ02(𝑾𝑾0)ki\displaystyle||(\rho_{1}\cdot\rho_{0}^{2}\cdot\bm{W})^{k}||=\rho_{1}^{k}\sum_{i=0}^{k}C_{k}^{i}\cdot||\rho_{0}^{2}\bm{W}^{i}_{0}||\cdot||\rho_{0}^{2}(\bm{W}-\bm{W}_{0})||^{k-i} (17)
ρ1ki=0kCki2iρ02i(ρ0ρ02)ki\displaystyle\leq\rho_{1}^{k}\sum_{i=0}^{k}C_{k}^{i}2\sqrt{i}\rho_{0}^{2i}\cdot(\rho_{0}-\rho_{0}^{2})^{k-i}
ρ1kρ0k2k.\displaystyle\leq\rho_{1}^{k}\rho_{0}^{k}2\sqrt{k}.

Combing all these results, we can show the norm of ρtτ=1t𝑾\rho^{t}\prod_{\tau=1}^{t}\bm{W} is bounded, which is:
For all tt\in\mathbb{N}

ρt𝑾t2tρ0t.||\rho^{t}\bm{W}^{t}||\leq 2\sqrt{t}\rho_{0}^{t}. (18)

This is crucial to make the width independent on the length of input sequences. Then we can show:

Lemma 9

Let ρ=ρ1ρ02\rho=\rho_{1}\cdot\rho_{0}^{2}. For any τ\tau\in\mathbb{N}, any ZtdZ_{t}\in\mathbb{R}^{d} with Zt=1||Z_{t}||=1 and 𝐐m×d,𝐐2m×m\bm{Q}\in\mathbb{R}^{m\times d},\bm{Q}_{2}\in\mathbb{R}^{m\times m}, with probability at least 1exp(Ω(log2m)))1-exp(-\Omega(\log^{2}m))), for any 𝐖B(𝐖0,ω)\bm{W}\in B(\bm{W}_{0},\omega), 𝐀B(𝐀0,ω)\bm{A}\in B(\bm{A}_{0},\omega) with ωω0\omega\leq\omega_{0},

t=τρt𝑩(τ=1t𝑾)𝑸Zt4mτ(ρ0)τ(1ρ0)2𝑸,||\sum_{t=\tau}^{\infty}\rho^{t}\cdot\bm{B}(\prod_{\tau=1}^{t}\bm{W})\bm{Q}Z_{t}||\leq 4\frac{\sqrt{m}\tau(\rho_{0})^{\tau}}{(1-\rho_{0})^{2}}||\bm{Q}||, (19)
t0=τt1+t2=t0ρt0𝑩(τ=1t11𝑾)𝑸2(τ=1t21𝑾)𝑨Zt0\displaystyle||\sum_{t_{0}=\tau}^{\infty}\sum_{t_{1}+t_{2}=t_{0}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W})\bm{Q}_{2}(\prod_{\tau=1}^{t_{2}-1}\bm{W})\bm{A}Z_{t_{0}}||
32mτ2(ρ0)τ(1ρ0)3𝑸2.\displaystyle\leq 32\frac{\sqrt{m}\tau^{2}(\rho_{0})^{\tau}}{(1-\rho_{0})^{3}}||\bm{Q}_{2}||.

Based on these results, we can only consider the properties of ρt0𝑩(τ=1t0𝑾)𝑨Xtt0\rho^{t_{0}}\bm{B}(\prod_{\tau^{\prime}=1}^{t_{0}}\bm{W})\bm{A}X_{t-t_{0}} for bounded t0t_{0}. We have the following results:

Lemma 10

For any vector v1dy,v2dv_{1}\in\mathbb{R}^{d_{y}},v_{2}\in\mathbb{R}^{d} with v1=v2=1||v_{1}||=||v_{2}||=1, if m>𝒪(τ3d)m>\mathcal{O}(\tau^{3}\cdot d), with probability at least 1exp(Ω(m/τ2))1-exp(-\Omega(m/\tau^{2})):

0.9{τ=1t𝑾0}𝑨0v21.1,\displaystyle 0.9\leq||\{\prod_{\tau^{\prime}=1}^{t}\bm{W}_{0}\}\bm{A}_{0}v_{2}||\leq 1.1, (20)
0.91m{τ=1t𝑾0}T𝑩Tv11.1,\displaystyle 0.9\leq||\frac{1}{\sqrt{m}}\{\prod_{\tau^{\prime}=1}^{t}\bm{W}_{0}\}^{T}\bm{B}^{T}v_{1}||\leq 1.1,

for any 0tτ10\leq t\leq\tau-1 and τ>0\tau>0.

Lemma 11

If m>Ω(log(τd))m>\Omega(\log(\tau\cdot d)), with probability at least 1δ1-\delta:

||𝑩(τ=1t𝑾)𝑨||dlog(τd/δ)).||\bm{B}(\prod_{\tau^{\prime}=1}^{t}\bm{W})\bm{A}||\leq\sqrt{d\log(\tau\cdot d/\delta)}). (21)

for any 0tτ10\leq t\leq\tau-1 and τ>0\tau>0.

Lemma 12

For any vector v1,u1dy,v2,u2dv_{1},u_{1}\in\mathbb{R}^{d_{y}},v_{2},u_{2}\in\mathbb{R}^{d} with v1=u1=v2=u2=1||v_{1}||=||u_{1}||=||v_{2}||=||u_{2}||=1, if m1/2>Ω(τ3d)m^{1/2}>\Omega(\tau^{3}\cdot d), with probability at least 1exp(Ω(log2m))1-exp(-\Omega(\log^{2}m)), for any 0t,tτ10\leq t,t^{\prime}\leq\tau-1, ttt\neq t^{\prime}:

|(u1)T𝑩(k=1t𝑾0)\displaystyle|(u_{1})^{T}\bm{B}(\prod_{k=1}^{t}\bm{W}_{0}) {(k=1t𝑾0)}T𝑩Tv1|\displaystyle\cdot\{(\prod_{k=1}^{t^{\prime}}\bm{W}_{0})\}^{T}\bm{B}^{T}v_{1}| (22)
24τd2logm,\displaystyle\leq 24\tau d^{2}\log m,

and

|(u2)T𝑨0T(k=1t𝑾0)T{(k=1t𝑾0)}𝑨0v2|24τd2logmm.|(u_{2})^{T}\bm{A}^{T}_{0}(\prod_{k=1}^{t}\bm{W}_{0})^{T}\cdot\{(\prod_{k=1}^{t^{\prime}}\bm{W}_{0})\}\bm{A}_{0}v_{2}|\leq 24\tau\frac{d^{2}\log m}{\sqrt{m}}.

VI-B Properties of Linear RNN

For any t,τt,\tau\in\mathbb{N}, ftf_{t} can be written as (when tt0<1t-t_{0}<1, we set Xtt0=0X_{t-t_{0}}=0):

ft(𝑾,𝑨)=t0=0t1ρt0𝑩(τ=1t0𝑾)𝑨Xtt0.\displaystyle f_{t}(\bm{W},\bm{A})=\sum_{t_{0}=0}^{t-1}\rho^{t_{0}}\bm{B}(\prod_{\tau^{\prime}=1}^{t_{0}}\bm{W})\bm{A}X_{t-t_{0}}. (23)

From Lemma 9, we consider a truncation of ftf_{t} as

ftτ(𝑾,𝑨)=t0=0τρt0𝑩(τ=1t0𝑾)𝑨Xtt0.\displaystyle f^{\tau}_{t}(\bm{W},\bm{A})=\sum_{t_{0}=0}^{\tau}\rho^{t_{0}}\bm{B}(\prod_{\tau^{\prime}=1}^{t_{0}}\bm{W})\bm{A}X_{t-t_{0}}. (24)

ft(𝑾,𝑨)f_{t}(\bm{W},\bm{A}) is also an almost linear function for 𝑾\bm{W} and 𝑨\bm{A}. In fact we have:

Lemma 13

Under the condition in Theorem 4, with probability at least 1δ1-\delta, for all t[T]t\in[T] and 𝐖,𝐖B(𝐖0,ω)\bm{W},\bm{W}^{\prime}\in B(\bm{W}_{0},\omega), 𝐀,𝐀B(𝐀0,ω)\bm{A},\bm{A}^{\prime}\in B(\bm{A}_{0},\omega) with ωω0\omega\leq\omega_{0},

||ft(𝑾,𝑨)ft(𝑾,𝑨)Wft(𝑾,𝑨)[𝑾𝑾]\displaystyle||f_{t}(\bm{W}^{\prime},\bm{A}^{\prime})-f_{t}(\bm{W},\bm{A})-\nabla_{W}f_{t}(\bm{W},\bm{A})\cdot[\bm{W}^{\prime}-\bm{W}]
Aft(𝑾,𝑨)[𝑨𝑨]||\displaystyle-\nabla_{A}f_{t}(\bm{W},\bm{A})\cdot[\bm{A}^{\prime}-\bm{A}]||
Θ(mω2(1ρ0)5).\displaystyle\leq\Theta(\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}}).

Combing the above results, let

ftlin(𝑾,𝑨)\displaystyle f_{t}^{lin}(\bm{W},\bm{A}) =ft(𝑾0,𝑨0)+Wft(𝑾0,𝑨0)[𝑾𝑾0]\displaystyle=f_{t}(\bm{W}_{0},\bm{A}_{0})+\nabla_{W}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{W}-\bm{W}_{0}]
+Aft(𝑾0,𝑨0)[𝑨𝑨0],\displaystyle+\nabla_{A}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{A}-\bm{A}_{0}],
ftlin,τ(𝑾,𝑨)\displaystyle f_{t}^{lin,\tau}(\bm{W},\bm{A}) =ftτ(𝑾0,𝑨0)+Wftτ(𝑾0,𝑨0)[𝑾𝑾0]\displaystyle=f^{\tau}_{t}(\bm{W}_{0},\bm{A}_{0})+\nabla_{W}f^{\tau}_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{W}-\bm{W}_{0}]
+Aftτ(𝑾0,𝑨0)[𝑨𝑨0].\displaystyle+\nabla_{A}f^{\tau}_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{A}-\bm{A}_{0}].

We have:

Lemma 14

Under the condition in Theorem 4, with probability at least 1δ1-\delta,

ftlin(𝑾,𝑨)ftlin,τ(𝑾,𝑨)Θ(mτρ0τ(1ρ0)3).||f^{lin}_{t}(\bm{W},\bm{A})-f^{lin,\tau}_{t}(\bm{W},\bm{A})||\leq\Theta(\frac{\sqrt{m}\tau\rho_{0}^{\tau}}{(1-\rho_{0})^{3}}). (25)

Then since we set

Tmax>\displaystyle T_{max}> Θ(1log(1ρ0)){3log(11ρ0)+logb\displaystyle\Theta(\frac{1}{\log(\frac{1}{\rho_{0}})})\cdot\{3\log(\frac{1}{1-\rho_{0}})+\log b
+log(1ϵ)+12log(m)+logTmax},.\displaystyle+\log(\frac{1}{\epsilon})+\frac{1}{2}\log(m)+\log T_{max}\},.

We have

ftlin(𝑾,𝑨)ftlin,Tmax(𝑾,𝑨)ϵ/b.||f^{lin}_{t}(\bm{W},\bm{A})-f^{lin,T_{max}}_{t}(\bm{W},\bm{A})||\leq\epsilon/b. (26)

Therefore ftlin,Tmax(𝑾,𝑨)f^{lin,T_{max}}_{t}(\bm{W},\bm{A}) is a good approximation for ft(𝑾,𝑨)f_{t}(\bm{W},\bm{A}).

Note that from the above arguments and Lemma 11, we have

ft(𝑾,𝑨)2b,fL2l0(1+2b),||f_{t}(\bm{W},\bm{A})||\leq 2b,||\nabla_{f}L||\leq 2l_{0}\cdot(1+2b),

and

L(yt,ft(𝑾,𝑨))4l0(b+2b2).L(y_{t},f_{t}(\bm{W},\bm{A}))\leq 4l_{0}\cdot(b+2b^{2}). (27)

VII Proof of the Theorem 4

Theorem 4 is a direct corollary of Theorem 15 and 16 below.

Theorem 15

Under the condition in Theorem 4, for any 𝐀,𝐖\bm{A}^{*},\bm{W}^{*} with 𝐀𝐀0F,𝐖𝐖0FR/mbTmax2/m||\bm{A}^{*}-\bm{A}_{0}||_{F},||\bm{W}^{*}-\bm{W}_{0}||_{F}\leq R/\sqrt{m}\leq b\cdot T^{2}_{max}/\sqrt{m}, let

Lk(𝑾k,𝑨k)=1Tt=1TL(ytk,ft(𝑾k,𝑨k)),\displaystyle L^{k}(\bm{W}_{k},\bm{A}_{k})=\frac{1}{T}\sum_{t=1}^{T}L(y^{k}_{t},f_{t}(\bm{W}_{k},\bm{A}_{k})),
Ltk(𝑾k,𝑨k)=L(ytk,ft(𝑾k,𝑨k)).\displaystyle L^{k}_{t}(\bm{W}_{k},\bm{A}_{k})=L(y^{k}_{t},f_{t}(\bm{W}_{k},\bm{A}_{k})).

with probability at least 1δ1-\delta, the outputs of algorithm 1 satisfy:

1Kk=0K1Lk(𝑾k,𝑨k)Lk(𝑾,𝑨)𝒪(ϵ).\frac{1}{K}\sum_{k=0}^{K-1}L^{k}(\bm{W}_{k},\bm{A}_{k})-L^{k}(\bm{W}^{*},\bm{A}^{*})\leq\mathcal{O}(\epsilon). (28)

When the loss function is convex, one can easily see Theorem 15 follows. In our case, the proof of Theorem 15 is from the linearization Lemma 13. Lemma 13 says when 𝑨𝑨0F,𝑾𝑾0F||\bm{A}^{*}-\bm{A}_{0}||_{F},||\bm{W}^{*}-\bm{W}_{0}||_{F} are small enough,

ft(𝑾,𝑨)=t0=0t1ρt0𝑩(τ=1t0𝑾)𝑨Xtt0.\displaystyle f_{t}(\bm{W},\bm{A})=\sum_{t_{0}=0}^{t-1}\rho^{t_{0}}\bm{B}(\prod_{\tau^{\prime}=1}^{t_{0}}\bm{W})\bm{A}X_{t-t_{0}}. (29)

will nearly be a linear function for 𝑾\bm{W} and 𝑨\bm{A}. This is the main process to prove Theorem 15.

Proof of the Theorem 15:

From Lemma 9, for any tt,

WftF,AftF32m(1ρ0)3||\nabla_{W}f_{t}||_{F},||\nabla_{A}f_{t}||_{F}\leq 32\frac{\sqrt{m}}{(1-\rho_{0})^{3}}

when 𝑾B(𝑾0,ω)\bm{W}\in B(\bm{W}_{0},\omega), 𝑨B(𝑨0,ω)\bm{A}\in B(\bm{A}_{0},\omega). Meanwhile,

𝑾k+1𝑾k=ηWLk(𝑾k,𝑨k),\displaystyle\bm{W}_{k+1}-\bm{W}_{k}=\eta\nabla_{W}L^{k}(\bm{W}_{k},\bm{A}_{k}), (30)
𝑨k+1𝑨k=ηALk(𝑾k,𝑨k).\displaystyle\bm{A}_{k+1}-\bm{A}_{k}=\eta\nabla_{A}L^{k}(\bm{W}_{k},\bm{A}_{k}).

Thus for any kKk\leq K,

𝑾k𝑾0FηKWfFfL,\displaystyle||\bm{W}_{k}-\bm{W}_{0}||_{F}\leq\eta K\cdot||\nabla_{W}f||_{F}\cdot||\nabla_{f}L||, (31)
𝑨k𝑨0FηKAfFfL.\displaystyle||\bm{A}_{k}-\bm{A}_{0}||_{F}\leq\eta K\cdot||\nabla_{A}f||_{F}\cdot||\nabla_{f}L||.

In our case, from Eq (10) and (VI-B), we have

fLl0(1+2b),||\nabla_{f}L||\leq l_{0}(1+2b),
WLF,ALF32m(1ρ0)3l0(1+2b).||\nabla_{W}L||_{F},||\nabla_{A}L||_{F}\leq 32\frac{\sqrt{m}}{(1-\rho_{0})^{3}}\cdot l_{0}(1+2b).

Thus 𝑾k𝑾0F,𝑨k𝑨0FKη32m(1ρ0)3l0(1+2b)ω.||\bm{W}_{k}-\bm{W}_{0}||_{F},||\bm{A}_{k}-\bm{A}_{0}||_{F}\leq K\eta\cdot 32\frac{\sqrt{m}}{(1-\rho_{0})^{3}}\cdot l_{0}(1+2b)\triangleq\omega.

From the convexity of LL and Lemma 13, we have

Lk(𝑾k,𝑨k)Lk(𝑾,𝑨)\displaystyle L^{k}(\bm{W}_{k},\bm{A}_{k})-L^{k}(\bm{W}^{*},\bm{A}^{*})
1Tt=1TftLtk(𝑾k,𝑨k)[ft(𝑾k,𝑨k)ft(𝑾,𝑨)],\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T}\nabla_{f_{t}}L^{k}_{t}(\bm{W}_{k},\bm{A}_{k})\cdot[f_{t}(\bm{W}_{k},\bm{A}_{k})-f_{t}(\bm{W}^{*},\bm{A}^{*})],
1Tt=1TftLtk(𝑾k,𝑨k)[Wft(𝑾k,𝑨k)[𝑾k𝑾]\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T}\nabla_{f_{t}}L^{k}_{t}(\bm{W}_{k},\bm{A}_{k})\cdot[\nabla_{W}f_{t}(\bm{W}_{k},\bm{A}_{k})\cdot[\bm{W}_{k}-\bm{W}^{*}]
+Aft(𝑾k,𝑨k)[𝑨k𝑨]]+Θ(mω2(1ρ0)5)l0(1+2b),\displaystyle+\nabla_{A}f_{t}(\bm{W}_{k},\bm{A}_{k})\cdot[\bm{A}_{k}-\bm{A}^{*}]]+\Theta(\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}})\cdot l_{0}(1+2b),
WLk(𝑾k,𝑨k)[𝑾k𝑾]\displaystyle\leq\nabla_{W}L^{k}(\bm{W}_{k},\bm{A}_{k})\cdot[\bm{W}_{k}-\bm{W}^{*}]
+ALk(𝑾k,𝑨k)[𝑨k𝑨]+Θ(mω2(1ρ0)5)l0(1+2b).\displaystyle+\nabla_{A}L^{k}(\bm{W}_{k},\bm{A}_{k})\cdot[\bm{A}_{k}-\bm{A}^{*}]+\Theta(\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}})\cdot l_{0}(1+2b).

Since

𝑾k+1𝑾k=ηWLk(𝑾k𝑨k),\bm{W}_{k+1}-\bm{W}_{k}=-\eta\nabla_{W}L^{k}(\bm{W}_{k}\bm{A}_{k}),
𝑨k+1𝑨k=ηALk(𝑾k,𝑨k),\bm{A}_{k+1}-\bm{A}_{k}=-\eta\nabla_{A}L^{k}(\bm{W}_{k},\bm{A}_{k}),

we have

1K\displaystyle\frac{1}{K} k=0K1{Lk(𝑾k,𝑨k)Lk(𝑾,𝑨)}\displaystyle\sum_{k=0}^{K-1}\{L^{k}(\bm{W}_{k},\bm{A}_{k})-L^{k}(\bm{W}^{*},\bm{A}^{*})\}
\displaystyle\leq 1Kηk=0K1{𝑾k𝑾k+1,𝑾k𝑾\displaystyle\frac{1}{K\eta}\sum_{k=0}^{K-1}\{\langle\bm{W}_{k}-\bm{W}_{k+1},\bm{W}_{k}-\bm{W}^{*}\rangle
+𝑨k𝑨k+1,𝑨k𝑨}+Θ(mω2(1ρ0)5)l0(1+2b),\displaystyle+\langle\bm{A}_{k}-\bm{A}_{k+1},\bm{A}_{k}-\bm{A}^{*}\rangle\}+\Theta(\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}})\cdot l_{0}(1+2b),
\displaystyle\leq 12Kηk=0K1{||𝑾k𝑾||F2||𝑾k+1𝑾||F2\displaystyle\frac{1}{2K\eta}\sum_{k=0}^{K-1}\{||\bm{W}_{k}-\bm{W}^{*}||^{2}_{F}-||\bm{W}_{k+1}-\bm{W}^{*}||^{2}_{F}
+𝑾k+1𝑾kF2+𝑨k𝑨F2𝑨k+1𝑨F2\displaystyle+||\bm{W}_{k+1}-\bm{W}_{k}||^{2}_{F}+||\bm{A}_{k}-\bm{A}^{*}||_{F}^{2}-||\bm{A}_{k+1}-\bm{A}^{*}||_{F}^{2}
+||𝑨k+1𝑨k||F2}+Θ(mω2(1ρ0)5)l0(1+2b),\displaystyle+||\bm{A}_{k+1}-\bm{A}_{k}||_{F}^{2}\}+\Theta(\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}})\cdot l_{0}(1+2b),
\displaystyle\leq 12Kη{𝑾0𝑾F2+𝑨0𝑨F2}\displaystyle\frac{1}{2K\eta}\{||\bm{W}_{0}-\bm{W}^{*}||^{2}_{F}+||\bm{A}_{0}-\bm{A}^{*}||_{F}^{2}\}
+η[32m(1ρ0)3l0(1+2b)]2+Θ(mω2(1ρ0)5)l0(1+2b),\displaystyle+\eta[32\frac{\sqrt{m}}{(1-\rho_{0})^{3}}\cdot l_{0}(1+2b)]^{2}+\Theta(\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}})\cdot l_{0}(1+2b),
\displaystyle\leq R2Kmη+η[32m(1ρ0)3l0(1+2b)]2\displaystyle\frac{R^{2}}{Km\eta}+\eta[32\frac{\sqrt{m}}{(1-\rho_{0})^{3}}\cdot l_{0}(1+2b)]^{2} (32)
+Θ(mω2(1ρ0)5)l0(1+2b),\displaystyle+\Theta(\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}})\cdot l_{0}(1+2b),
\displaystyle\leq 𝒪(ϵ)+Θ(ηml02(1+2b)2(1+ρ0)6)\displaystyle\mathcal{O}(\epsilon)+\Theta(\frac{\eta ml^{2}_{0}(1+2b)^{2}}{(1+\rho_{0})^{6}}) (33)
+Θ([Kηm(1ρ0)3l0(1+2b)]2m(1ρ0)5)l0(1+2b),\displaystyle+\Theta([K\eta\cdot\frac{\sqrt{m}}{(1-\rho_{0})^{3}}\cdot l_{0}(1+2b)]^{2}\cdot\frac{\sqrt{m}}{(1-\rho_{0})^{5}})\cdot l_{0}(1+2b),
\displaystyle\leq 𝒪(ϵ)+𝒪(ϵ)\displaystyle\mathcal{O}(\epsilon)+\mathcal{O}(\epsilon)
+Θ([K2η2mm(1ρ0)11)l02(1+2b)2l0(1+2b)].\displaystyle+\Theta([K^{2}\eta^{2}\cdot\frac{m\sqrt{m}}{(1-\rho_{0})^{11}})\cdot l_{0}^{2}(1+2b)^{2}\cdot l_{0}(1+2b)].
𝒪(ϵ).\displaystyle\leq\mathcal{O}(\epsilon).

Meanwhile Rmωω01ρ01.\frac{R}{\sqrt{m}}\leq\omega\leq\omega_{0}\leq\frac{1}{\rho_{0}}-1. Thus

1Kk=1KLk(𝑾k,𝑨k)Lk(𝑾,𝑨)𝒪(ϵ).\frac{1}{K}\sum_{k=1}^{K}L^{k}(\bm{W}_{k},\bm{A}_{k})-L^{k}(\bm{W}^{*},\bm{A}^{*})\leq\mathcal{O}(\epsilon). (34)

Then Theorem 15 follows. \blacksquare

Based on Theorem 15, to prove the main result, we need to show there exits a (𝑾,𝑨)(\bm{W}^{*},\bm{A}^{*}) with 𝑨𝑨0F,𝑾𝑾0F𝒪(bTmax2/m)||\bm{A}^{*}-\bm{A}_{0}||_{F},||\bm{W}^{*}-\bm{W}_{0}||_{F}\leq\mathcal{O}(b\cdot T^{2}_{max}/\sqrt{m}) and ||ft(𝑾,𝑨))y~t||||f_{t}(\bm{W}^{*},\bm{A}^{*}))-\widetilde{y}_{t}|| small so the target lies in the linearization domain. In fact we have:

Theorem 16

Under the condition in Theorem 4, with probability at least 1δ1-\delta, there exist 𝐖,𝐀\bm{W}^{*},\bm{A}^{*} with

𝑾𝑾0F,𝑨𝑨0F𝒪(bTmax2/m).||\bm{W}^{*}-\bm{W}_{0}||_{F},||\bm{A}^{*}-\bm{A}_{0}||_{F}\leq\mathcal{O}(b\cdot T^{2}_{max}/\sqrt{m}). (35)

and

1Kk=0K11Tt=1T{L(ytk,ft(𝑾~,𝑨,xk))L(ytk,y~tk)}𝒪(ϵ).\frac{1}{K}\sum_{k=0}^{K-1}\frac{1}{T}\sum_{t=1}^{T}\{L(y^{k}_{t},f_{t}(\widetilde{\bm{W}}^{*},\bm{A}^{*},x^{k}))-L(y^{k}_{t},\widetilde{y}^{k}_{t})\}\leq\mathcal{O}(\epsilon).

To prove Theorem 16, assume the target stable system is

pti=𝑪pt1i+𝑫xti,\displaystyle p^{i}_{t}=\bm{C}p^{i}_{t-1}+\bm{D}x^{i}_{t}, (36)
yti~=𝑮pti,\displaystyle\widetilde{y^{i}_{t}}=\bm{G}p^{i}_{t},

We construct

𝑾𝑾0\displaystyle\bm{W}^{*}-\bm{W}_{0}
=\displaystyle= tm=1Tmax(ρ1)tm1t1+t2=tm1{(τ=1t11𝑾0)}T𝑩T𝑷t11{𝑮𝑪tm1𝑫\displaystyle\sum_{t_{m}=1}^{T_{max}}(\rho^{-1})^{t_{m}-1}\sum_{t_{1}+t_{2}=t_{m}-1}\{(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})\}^{T}\bm{B}^{T}\bm{P}^{1}_{t_{1}}\{\bm{G}\bm{C}^{t_{m}-1}\bm{D}
(ρ2)tm1/(tm1)𝑩{τ=1tm𝑾0}𝑨0}𝑷2t21𝑨T0(τ=1t2𝑾0),\displaystyle-(\rho^{2})^{t_{m}-1}/(t_{m}-1)\cdot\bm{B}\{\prod_{\tau=1}^{t_{m}}\bm{W}_{0}\}\bm{A}_{0}\}\bm{P}^{2}_{t_{2}-1}\bm{A}^{T}_{0}(\prod_{\tau=1}^{t_{2}}\bm{W}_{0}),
𝑨𝑨0={(τ=1tm1𝑾0)}T𝑩T𝑷tm11[𝑮𝑫𝑩𝑨0],\bm{A}^{*}-\bm{A}_{0}=\{(\prod_{\tau=1}^{t_{m}-1}\bm{W}_{0})\}^{T}\bm{B}^{T}\bm{P}^{1}_{t_{m}-1}[\bm{G}\bm{D}-\bm{B}\bm{A}_{0}],

where

𝑷t11={𝑩(τ=1t11𝑾0)(τ=1t11𝑾0)T𝑩T}1,\displaystyle\bm{P}^{1}_{t_{1}}=\{\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})^{T}\bm{B}^{T}\}^{-1}, (37)
𝑷t22={𝑨0T(τ=1t21𝑾0)T(τ=1t21𝑾0)𝑨0}1.\displaystyle\bm{P}^{2}_{t_{2}}=\{\bm{A}_{0}^{T}(\prod_{\tau=1}^{t_{2}-1}\bm{W}_{0})^{T}(\prod_{\tau=1}^{t_{2}-1}\bm{W}_{0})\bm{A}_{0}\}^{-1}.

We can show our 𝑾\bm{W}^{*}, 𝑨\bm{A}^{*} satisfying

||ft(𝑾,𝑨))t0=0t1𝑮(τ=1t0𝑪)𝑫Xtt0||F0,||f_{t}(\bm{W}^{*},\bm{A}^{*}))-\sum_{t_{0}=0}^{t-1}\bm{G}(\prod_{\tau=1}^{t_{0}}\bm{C})\bm{D}X_{t-t_{0}}||_{F}\approx 0,

using Lemma 12. To show 𝑾𝑾0F,𝑨𝑨0F𝒪(bTmax2/m)||\bm{W}^{*}-\bm{W}_{0}||_{F},||\bm{A}^{*}-\bm{A}_{0}||_{F}\leq\mathcal{O}(b\cdot T^{2}_{max}/\sqrt{m}), we need Lemma 11. The detailed proof is in Supplementary Materials.

Combing the above two theorems, we have

1Kk=0K11Tt=1T{L(ytk,ft(𝑾k,𝑨k,xk))L(ytk,y~tk)}𝒪(ϵ).\frac{1}{K}\sum_{k=0}^{K-1}\frac{1}{T}\sum_{t=1}^{T}\{L(y^{k}_{t},f_{t}(\bm{W}_{k},\bm{A}_{k},x^{k}))-L(y^{k}_{t},\widetilde{y}^{k}_{t})\}\leq\mathcal{O}(\epsilon).

Therefore Theorem 4 follows.

VIII Proof of Theorem 6

In order to prove the bound for the population risk, we need the following results for Rademacher Complexity. These results can be found in Proposition A.12 of [22] and section 3.8 in [23].

Theorem 17

Let 1,2,dy\mathcal{F}_{1},\mathcal{F}_{2},...\mathcal{F}_{d_{y}} be dyd_{y} classes of functions d\mathbb{R}^{d}\to\mathbb{R} and |L()|4l0(b+2b2)|L(\cdot)|\leq 4l_{0}\cdot(b+2b^{2}) be a bounded, l0(1+2b)l_{0}(1+2b) Lipschitz function. For NN samples xix_{i} i.i.d. drawn from a distribution 𝒟{\cal D}, with probability at least 1δ1-\delta, we have

supf11,fdydy\displaystyle\sup_{f_{1}\in\mathcal{F}_{1},...f_{d_{y}}\in\mathcal{F}_{d_{y}}} |𝔼x𝒟[L(f1(x),fdy(x))]\displaystyle|\mathbb{E}_{x\sim{\cal D}}[L(f_{1}(x),...f_{d_{y}}(x))] (38)
1Ni=1NL(f1(xi),fdy(xi))|\displaystyle-\frac{1}{N}\sum_{i=1}^{N}L(f_{1}(x_{i}),...f_{d_{y}}(x_{i}))|
Θ(l0(1+2b)r=1dy(r))\displaystyle\leq\Theta(l_{0}(1+2b)\cdot\sum_{r=1}^{d_{y}}\mathscr{R}(\mathcal{F}_{r}))
+44l0(b+2b2)log(1/δ)N.\displaystyle+4\frac{4l_{0}\cdot(b+2b^{2})\sqrt{\log(1/\delta)}}{\sqrt{N}}.

where ())\mathscr{R}(\mathcal{F})) is defined as :

())=𝔼ξ{±1}N[supf1NiNξif(xi)].\mathscr{R}(\mathcal{F}))=\mathbb{E}_{\xi\sim\{\pm 1\}^{N}}[\sup_{f\in\mathcal{F}}\frac{1}{N}\sum_{i}^{N}\xi_{i}f(x_{i})]. (39)

Meanwhile, for linear functions, Rademacher Complexity is easy to be calculated:

Theorem 18

(Proposition A.12 in [22])

Let \mathcal{F} be the function class {xf0(x)+w,x}\{x\to f_{0}(x)+\langle w,x\rangle\in\mathbb{R}\} with f0f_{0} a fixed function and wB||w||\leq B, then

())BN.\mathscr{R}(\mathcal{F}))\leq\frac{B}{\sqrt{N}}. (40)

Now we will estimate the gap between the empirical and population loss:

|𝔼x,y𝒟1Tt=1T{L(yt,ft(𝑾k,𝑨k,x))L(yt,y~t)}\displaystyle|\mathbb{E}_{x,y\sim{\cal D}}\frac{1}{T}\sum_{t=1}^{T}\{L(y_{t},f_{t}(\bm{W}_{k},\bm{A}_{k},x))-L(y_{t},\widetilde{y}_{t})\}
1Kk=0K11Tt=1T{L(ytk,ft(𝑾k,𝑨k),xk)L(ytk,y~tk)}|.\displaystyle-\frac{1}{K}\sum_{k=0}^{K-1}\frac{1}{T}\sum_{t=1}^{T}\{L(y^{k}_{t},f_{t}(\bm{W}_{k},\bm{A}_{k}),x^{k})-L(y^{k}_{t},\widetilde{y}^{k}_{t})\}|.

Firstly, from Eq (10) and (26),

|L(yt,ft(𝑾k,𝑨k))\displaystyle|L(y_{t},f_{t}(\bm{W}_{k},\bm{A}_{k})) L(yt,ftlin,Tmax(𝑾k,𝑨k))|𝒪(ϵ),\displaystyle-L(y_{t},f^{lin,T_{max}}_{t}(\bm{W}_{k},\bm{A}_{k}))|\leq\mathcal{O}(\epsilon),

and for all kKk\leq K,

𝑾k𝑾0F,𝑨k𝑨0Fω,||\bm{W}_{k}-\bm{W}_{0}||_{F},||\bm{A}_{k}-\bm{A}_{0}||_{F}\leq\omega,

where ω=𝒪(Kηm(1ρ0)3l0(1+2b)).\omega=\mathcal{O}(\frac{K\eta\sqrt{m}}{(1-\rho_{0})^{3}}\cdot l_{0}(1+2b)). Let eje_{j} be the jj-th orthogonal basis of dy\mathbb{R}^{d_{y}}. We only need to consider the Rademacher complexity of the function class:

{\displaystyle\{ xejTftlin,Tmax(𝑾0+Δ𝑾,𝑨0+Δ𝑨,x)|\displaystyle x\to e_{j}^{T}f^{lin,T_{max}}_{t}(\bm{W}_{0}+\Delta\bm{W},\bm{A}_{0}+\Delta\bm{A},x)| (41)
||Δ𝑾||F,||Δ𝑨||Fω}.\displaystyle\ ||\Delta\bm{W}||_{F},||\Delta\bm{A}||_{F}\leq\omega\}.

And this is a class of linear functions. We can write it as xF0,x+F,xx\to\langle F_{0},x\rangle+\langle F,x\rangle with F0F_{0} fixed. Apply Lemma 9. We have

F232m(1ρ0)3ω\displaystyle||F||_{2}\leq 32\frac{\sqrt{m}}{(1-\rho_{0})^{3}}\omega (42)
[32m(1ρ0)3][Kη32m(1ρ0)3l0(1+2b)].\displaystyle\leq[32\frac{\sqrt{m}}{(1-\rho_{0})^{3}}]\cdot[K\eta\cdot 32\frac{\sqrt{m}}{(1-\rho_{0})^{3}}\cdot l_{0}(1+2b)].

Then combing all the above results, Theorem 17 and Theorem 18, we have

|𝔼x,y𝒟1Tt=1T{L(yt,ft(𝑾k,𝑨k,x))L(yt,y~t)}\displaystyle|\mathbb{E}_{x,y\sim{\cal D}}\frac{1}{T}\sum_{t=1}^{T}\{L(y_{t},f_{t}(\bm{W}_{k},\bm{A}_{k},x))-L(y_{t},\widetilde{y}_{t})\} (43)
1Kk=0K11Tt=1T{L(ytk,ft(𝑾k,𝑨k),xk)L(ytk,y~tk)}|\displaystyle-\frac{1}{K}\sum_{k=0}^{K-1}\frac{1}{T}\sum_{t=1}^{T}\{L(y^{k}_{t},f_{t}(\bm{W}_{k},\bm{A}_{k}),x^{k})-L(y^{k}_{t},\widetilde{y}^{k}_{t})\}|
Θ(l0(1+2b)[m(1ρ0)3])\displaystyle\leq\Theta(l_{0}(1+2b)[\frac{\sqrt{m}}{(1-\rho_{0})^{3}}])
Θ([Kηm(1ρ0)3l0(1+2b)]/K)\displaystyle\cdot\Theta([K\eta\cdot\frac{\sqrt{m}}{(1-\rho_{0})^{3}}\cdot l_{0}(1+2b)]/\sqrt{K})
+𝒪(l0(b+2b2)log(1/δ)K)\displaystyle+\mathcal{O}(\frac{l_{0}\cdot(b+2b^{2})\sqrt{\log(1/\delta)}}{\sqrt{K}})
𝒪(l02(1+2b)2ηmK(1ρ0)6)\displaystyle\leq\mathcal{O}(l_{0}^{2}(1+2b)^{2}\frac{\eta m\sqrt{K}}{(1-\rho_{0})^{6}})
+𝒪(l0(b+2b2)log(1/δ)K)\displaystyle+\mathcal{O}(\frac{l_{0}\cdot(b+2b^{2})\sqrt{\log(1/\delta)}}{\sqrt{K}})
𝒪(ϵ).\displaystyle\leq\mathcal{O}(\epsilon).

with probability at least 1δ1-\delta. Our claim follows.

Acknowledgement

This research was funded by the Foundation item: National Natural Science Foundation of China (U1936215). Project name: Methodologies and Key Technologies of Intelligent Detection and Tracing of APT Attacks.

IX Conclusion

We provided the first theoretical guarantee on learning linear RNNs with Gradient Descent. The required width in hidden layers does not rely on the length of the input sequence and only depends on the transition matrix parameter ρC\rho_{C}. Under this condition, we showed that SGD can provably learn any stable linear system with transition matrix 𝑪\bm{C} satisfying 𝑪k𝒪(ρCk),k||\bm{C}^{k}||\leq\mathcal{O}(\rho_{C}^{k}),k\in\mathbb{N}, using poly(11ρC,ϵ1)poly(\frac{1}{1-\rho_{C}},\epsilon^{-1}) many iterations and poly(11ρC,ϵ1)poly(\frac{1}{1-\rho_{C}},\epsilon^{-1}) many samples. In this work we found a suitable random initialization which is available to optimize using gradient descent. This solves an open problem in System Identification and answers why SGD is available to optimize RNNs in practice. We hope this result can provide some insights for learning stable nonlinear dynamic systems using recurrent neural networks in deep learning.

References

  • [1] H. T. Siegelmann and E. D. Sontag, “On the computational power of neural nets - sciencedirect,” Journal of Computer and System Sciences, vol. 50, no. 1, pp. 132–150, 1995.
  • [2] A. M. Saxe, J. L. Mcclelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,” Computer Science, 2014.
  • [3] S. S. Du and W. Hu, “Width provably matters in optimization for deep linear neural networks,” International conference on machine learning, 2019.
  • [4] D. Zou, P. M. Long, and Q. Gu, “On the global convergence of training deep linear resnets,” International conference on learning representations, 2020.
  • [5] K. Kawaguchi, “Deep learning without poor local minima,” 2016.
  • [6] D. Belanger and S. M. Kakade, “A linear dynamical system model for text,” in Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, ser. JMLR Workshop and Conference Proceedings, F. R. Bach and D. M. Blei, Eds., vol. 37.   JMLR.org, 2015, pp. 833–842. [Online]. Available: http://proceedings.mlr.press/v37/belanger15.html
  • [7] M. Hardt, T. Ma, and B. Recht, “Gradient descent learns linear dynamical systems,” Journal of Machine Learning Research, vol. 19, 2016.
  • [8] E. Hazan, H. Lee, K. Singh, C. Zhang, and Y. Zhang, “Spectral filtering for general linear dynamical systems,” in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31.   Curran Associates, Inc., 2018.
  • [9] S. Roweis and Z. Ghahramani, “A unifying review of linear gaussian models,” Neural Computation, 1999.
  • [10] Z. Allen-Zhu and Y. Li, “Can sgd learn recurrent neural networks with provable generalization?” in Advances in Neural Information Processing Systems, vol. 32.   Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper/2019/file/67fe0f66449e31fdafdc3505c37d6acb-Paper.pdf
  • [11] L. Wang, B. Shen, B. Hu, and X. Cao, “On the provable generalization of recurrent neural networks,” in Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021, pp. 20 258–20 269. [Online]. Available: https://proceedings.neurips.cc/paper/2021/hash/a928731e103dfc64c0027fa84709689e-Abstract.html
  • [12] D. A. Dowler, “Bounding the norm of matrix powers,” 2013.
  • [13] S. S. Du, X. Zhai, B. Poczos, and A. Singh, “Gradient descent provably optimizes over-parameterized neural networks,” in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.   OpenReview.net, 2019. [Online]. Available: https://openreview.net/forum?id=S1eK3i09YQ
  • [14] Z. Allen-Zhu, Y. Li, and Z. Song, “A convergence theory for deep learning via over-parameterization,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97.   PMLR, 2019, pp. 242–252. [Online]. Available: http://proceedings.mlr.press/v97/allen-zhu19a.html
  • [15] Y. Cao and Q. Gu, “Generalization bounds of stochastic gradient descent for wide and deep neural networks,” in Advances in Neural Information Processing Systems, vol. 32.   Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper/2019/file/cf9dc5e4e194fc21f397b4cac9cc3ae9-Paper.pdf
  • [16] Z. Allen-Zhu, Y. Li, and Z. Song, “On the convergence rate of training recurrent neural networks,” in Advances in Neural Information Processing Systems, vol. 32.   Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper/2019/file/0ee8b85a85a49346fdff9665312a5cc4-Paper.pdf
  • [17] Y. Tian, “Symmetry-breaking convergence analysis of certain two-layered neural networks with relu nonlinearity,” International conference on learning representations, 2017.
  • [18] S. S. Du, J. D. Lee, and Y. Tian, “When is a convolutional filter easy to learn,” International conference on machine learning, 2018.
  • [19] I. Safran and O. Shamir, “Spurious local minima are common in two-layer relu neural networks,” International conference on machine learning, pp. 4430–4438, 2018.
  • [20] R. E. Kalman, “A new approach to linear filtering and prediction problems,” Journal of Basic Engineering, 1960.
  • [21] E. Hazan, K. Singh, and C. Zhang, “Learning linear dynamical systems via spectral filtering,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30.   Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/file/165a59f7cf3b5c4396ba65953d679f17-Paper.pdf
  • [22] Z. Allen-Zhu, Y. Li, and Y. Liang, “Learning and generalization in overparameterized neural networks, going beyond two layers,” in Advances in Neural Information Processing Systems, vol. 32.   Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper/2019/file/62dad6e273d32235ae02b7d321578ee8-Paper.pdf
  • [23] P. Liang., “Statistical learning theory (winter 2016),” 2016. [Online]. Available: https://web.stanford.edu/class/cs229t/notes.pdf
  • [24] R. Vershynin, “Introduction to the non-asymptotic analysis of random matrices,” 2011.

Lemma

The following concentration inequality is a standard result for subexponential distribution, which will be use many times.

Lemma 19

Standard Concentration inequality for Chi-square distribution:

Let i=1,2,3,m,vi=1,2,3,...m,v\in\mathbb{N} and Aiχ2(v)A_{i}\sim\chi^{2}(v) be mm i.i.d Chi-square distribution. Then with probability at least 1exp(Ω(mϵ))1-exp(-\Omega(m\epsilon)):

|i=1m1mAi𝔼χ2(v)|ϵ.|\sum_{i=1}^{m}\frac{1}{m}A_{i}-\mathbb{E}\chi^{2}(v)|\leq\epsilon.

As a corollary, let 𝑨m×n\bm{A}\in\mathbb{R}^{m\times n} is a Gaussian random matrix, and the element Ai,j𝒩(0,1m)A_{i,j}\sim\mathcal{N}(0,\frac{1}{m}). We can see for a fixed vv with v=1||v||=1, with probability at least 1exp(Ω(mϵ2))1-exp(-\Omega(m\epsilon^{2})),

|𝑨v1|ϵ.|\ ||\bm{A}v||-1|\leq\epsilon.

Proof of Lemma 9

Firstly,

t=τρt𝑩(τ=1t𝑾)𝑸Ztt=τρt𝑩(τ=1t𝑾)𝑸Zt.\displaystyle||\sum_{t=\tau}^{\infty}\rho^{t}\cdot\bm{B}(\prod_{\tau=1}^{t}\bm{W})\bm{Q}Z_{t}||\leq\sum_{t=\tau}^{\infty}\rho^{t}||\bm{B}(\prod_{\tau=1}^{t}\bm{W})\bm{Q}Z_{t}||. (44)

Meanwhile, thanks to (18), for all kk in \mathbb{N}, ρk𝑾k2kρ0k||\rho^{k}\bm{W}^{k}||\leq 2\sqrt{k}\rho_{0}^{k},

ρk𝑩(τ=1k𝑾)4mkρ0k𝑸Zt.\rho^{k}||\bm{B}(\prod_{\tau=1}^{k}\bm{W})||\leq 4\sqrt{m}\sqrt{k}\cdot\rho_{0}^{k}||\bm{Q}||\cdot||Z_{t}||.

Thus

t=τρt𝑩(τ=1t𝑾)𝑸Zt\displaystyle||\sum_{t=\tau}^{\infty}\rho^{t}\cdot\bm{B}(\prod_{\tau=1}^{t}\bm{W})\bm{Q}Z_{t}|| (45)
4τm(ρ0)τ(1ρ0)2𝑸\displaystyle\leq 4\frac{\tau\sqrt{m}(\rho_{0})^{\tau}}{(1-\rho_{0})^{2}}||\bm{Q}||

The first part of Lemma 9 follows.

Now we consider

t0=τt1+t2=t0ρt0𝑩(τ=1t11𝑾)𝑸(τ=1t21𝑾)𝑨Zt0.||\sum_{t_{0}=\tau}^{\infty}\sum_{t_{1}+t_{2}=t_{0}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W})\bm{Q}(\prod_{\tau=1}^{t_{2}-1}\bm{W})\bm{A}Z_{t_{0}}||. (46)

Note that

ρt0𝑩(τ=1t11𝑾)𝑸(τ=1t21𝑾)𝑨Zt0\displaystyle||\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W})\bm{Q}(\prod_{\tau=1}^{t_{2}-1}\bm{W})\bm{A}Z_{t_{0}}|| (47)
8mρ0t0t0𝑸.\displaystyle\leq 8\sqrt{m}\rho_{0}^{t_{0}}t_{0}||\bm{Q}||.

Thus

||t0=τt1+t2=t0\displaystyle||\sum_{t_{0}=\tau}^{\infty}\sum_{t_{1}+t_{2}=t_{0}} ρt0𝑩(τ=1t11𝑾)𝑸(τ=1t21𝑾)𝑨Zt0||\displaystyle\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W})\bm{Q}(\prod_{\tau=1}^{t_{2}-1}\bm{W})\bm{A}Z_{t_{0}}|| (48)
32m(ρ0)ττ2(1ρ0)3𝑸.\displaystyle\leq 32\frac{\sqrt{m}(\rho_{0})^{\tau}\tau^{2}}{(1-\rho_{0})^{3}}||\bm{Q}||.

Our results follow.

\blacksquare

Proof of Lemma 10

For a fixed v2v_{2}, with probability at least 1exp(Ω(mϵ2))1-exp(-\Omega(m\epsilon^{2})), 1ϵ𝑨0v21+ϵ1-\epsilon\leq||\bm{A}_{0}v_{2}||\leq 1+\epsilon.

Taking the ϵnet\epsilon-net in d\mathbb{R}^{d}, there is a set 𝒩\mathcal{N} with |𝒩|(3/ϵ)d|\mathcal{N}|\leq(3/\epsilon)^{d}. For any vector vv in d\mathbb{R}^{d} with v=1||v||=1, there is a vector vv^{\prime} satisfying vvϵ||v-v^{\prime}||\leq\epsilon. Thus with probability at least 1(3/ϵ)dexp(Ω(mϵ2))=1exp(Ω(mϵ2))1-(3/\epsilon)^{d}exp(-\Omega(m\epsilon^{2}))=1-exp(-\Omega(m\epsilon^{2})), for any vector v2v_{2} in d\mathbb{R}^{d} with v2=1,||v_{2}||=1,

1ϵ(1+ϵ)ϵ𝑨0v2(1+ϵ)2.1-\epsilon-(1+\epsilon)\epsilon\leq||\bm{A}_{0}v_{2}||\leq(1+\epsilon)^{2}. (49)

Thus

12ϵϵ2𝑨0v21+2ϵ+ϵ2.1-2\epsilon-\epsilon^{2}\leq||\bm{A}_{0}v_{2}||\leq 1+2\epsilon+\epsilon^{2}. (50)

This is the same for 𝑾0𝑨0v2||\bm{W}_{0}\bm{A}_{0}v_{2}||. With probability at least 1exp(Ω(mϵ2))1-exp(-\Omega(m\epsilon^{2}))

1ϵ𝑾0𝑨0v21+ϵ.1-\epsilon\leq||\bm{W}_{0}\bm{A}_{0}v_{2}||\leq 1+\epsilon. (51)

Let ht=t0=1t𝑾0𝑨0v2h_{t}=\prod_{t_{0}=1}^{t}\bm{W}_{0}\bm{A}_{0}v_{2}. When t>1t>1, we consider the Gram-Schmidt orthogonalization.

Let 𝑼t\bm{U}_{t} be the Gram-Schmidt orthogonalization matrix as:

𝑼t=GS(h0,h1,ht).\bm{U}_{t}=GS(h_{0},h_{1},...h_{t}). (52)

We have

ht=\displaystyle h_{t}= 𝑾0ht1=𝑾0𝑼t2𝑼t2Tht1\displaystyle\bm{W}_{0}h_{t-1}=\bm{W}_{0}\bm{U}_{t-2}\bm{U}_{t-2}^{T}h_{t-1} (53)
+𝑾0[𝑰𝑼t2𝑼t2T]ht1,\displaystyle+\bm{W}_{0}[\bm{I}-\bm{U}_{t-2}\bm{U}_{t-2}^{T}]h_{t-1},
=\displaystyle= [𝑾0𝑼t2,𝑾0[𝑰𝑼t2𝑼t2T]ht1[𝑰𝑼t2𝑼t2T]ht1]\displaystyle\begin{bmatrix}\bm{W}_{0}\bm{U}_{t-2},&\frac{\bm{W}_{0}[\bm{I}-\bm{U}_{t-2}\bm{U}_{t-2}^{T}]h_{t-1}}{||[\bm{I}-\bm{U}_{t-2}\bm{U}_{t-2}^{T}]h_{t-1}||}\end{bmatrix}
[𝑼t2Tht1[𝑰𝑼t2𝑼t2T]ht1],\displaystyle\cdot\begin{bmatrix}\bm{U}_{t-2}^{T}h_{t-1}\\ ||[\bm{I}-\bm{U}_{t-2}\bm{U}_{t-2}^{T}]h_{t-1}||\end{bmatrix},
=\displaystyle= [M1,M2][z1z2].\displaystyle\begin{bmatrix}M_{1},&M_{2}\end{bmatrix}\cdot\begin{bmatrix}z_{1}\\ z_{2}\end{bmatrix}.

Using this expression, the entries of M1=𝑾0𝑼t2M_{1}=\bm{W}_{0}\bm{U}_{t-2} and M2=𝑾0[𝑰𝑼t2𝑼t2T]ht1[𝑰𝑼t2𝑼t2T]ht1M_{2}=\frac{\bm{W}_{0}[\bm{I}-\bm{U}_{t-2}\bm{U}_{t-2}^{T}]h_{t-1}}{||[\bm{I}-\bm{U}_{t-2}\bm{U}_{t-2}^{T}]h_{t-1}||} are i.i.d from 𝒩(0,1m)\mathcal{N}(0,\frac{1}{m}). Therefore for fixed z1,z2z_{1},z_{2}, with probability at least 1exp(Ω(mϵ2))1-exp(-\Omega(m\epsilon^{2})),

(1ϵ)z12+z22ht(1+ϵ)z12+z22.(1-\epsilon)\sqrt{z_{1}^{2}+z_{2}^{2}}\leq||h_{t}||\leq(1+\epsilon)\sqrt{z_{1}^{2}+z_{2}^{2}}.

Take the ϵnet\epsilon-net for z1.z2z_{1}.z_{2}. The sizes of ϵnet\epsilon-net for z1.z2z_{1}.z_{2} are (3/ϵ)d(t+1)(3/\epsilon)^{d\cdot(t+1)} and (3/ϵ)1(3/\epsilon)^{1}. Thus with probability at least 1(3/ϵ)dexp(Ω(mϵ2))=1exp(Ω(mϵ2)),1-(3/\epsilon)^{d}exp(-\Omega(m\epsilon^{2}))=1-exp(-\Omega(m\epsilon^{2})),

(1ϵ)z12+z22ht(1+ϵ)z12+z22(1-\epsilon)\sqrt{z_{1}^{2}+z_{2}^{2}}\leq||h_{t}||\leq(1+\epsilon)\sqrt{z_{1}^{2}+z_{2}^{2}}

for any z1,z2z_{1},z_{2}.

Therefore we have

(1ϵ)tt0=1t𝑾0𝑨0v2(1+ϵ)t.(1-\epsilon)^{t}\leq||\prod_{t_{0}=1}^{t}\bm{W}_{0}\bm{A}_{0}v_{2}||\leq(1+\epsilon)^{t}. (54)

for any 0tτ0\leq t\leq\tau. Set ϵ=0.01τ\epsilon=\frac{0.01}{\tau}. The theorem for 𝑾0t𝑨0v2||\bm{W}^{t}_{0}\bm{A}_{0}v_{2}|| follows. For dym(𝑾0T)t𝑩Tv1||\frac{\sqrt{d_{y}}}{\sqrt{m}}(\bm{W}^{T}_{0})^{t}\bm{B}^{T}v_{1}||, the proof is the same.

This proves Lemma 10.

Proof of Lemma 8

Now we will prove Lemma 8. Similar to (54), for fixed vmv\in\mathbb{R}^{m}, let L=Θ(mlogm)L=\Theta(\frac{\sqrt{m}}{\log m})\in\mathbb{N} . With probability at least 1Lexp(Ω(m/L2))1-Lexp(-\Omega(m/L^{2})), for all 0<tL0<t\leq L, we have

(11L)tvt0=1t𝑾0v(1+1L)tv.(1-\frac{1}{L})^{t}||v||\leq||\prod_{t_{0}=1}^{t}\bm{W}_{0}v||\leq(1+\frac{1}{L})^{t}||v||. (55)

Then for fixed normalized orthogonal basis viv_{i}, i=1,2,..mi=1,2,..m, with probability at least 1Lmexp(Ω(m/L2))1-Lmexp(-\Omega(m/L^{2})), for all 0<tL0<t\leq L, i=1,2,mi=1,2,...m,

𝑾0tvi(1+1/L)tvi.||\bm{W}^{t}_{0}v_{i}||\leq(1+1/L)^{t}||v_{i}||. (56)

Any vmv\in\mathbb{R}^{m} can be write as v=i=1maiviv=\sum_{i=1}^{m}a_{i}v_{i}. When v=1||v||=1, we have i|ai|m\sum_{i}|a_{i}|\leq\sqrt{m}.

Therefore with probability at least 1Lmexp(Ω(m/L2))=1exp(Ω(m/L2))1-Lmexp(-\Omega(m/L^{2}))=1-exp(-\Omega(m/L^{2})), for all 0<tL0<t\leq L, any vv

𝑾0tvm(1+1/L)tv.||\bm{W}^{t}_{0}v||\leq\sqrt{m}(1+1/L)^{t}||v||. (57)

Since L=Θ(mlogm)L=\Theta(\frac{\sqrt{m}}{\log m}), L2>mL^{2}>\sqrt{m}, we have

𝑾0tL2(1+1/L)t.||\bm{W}^{t}_{0}||\leq L^{2}(1+1/L)^{t}. (58)

Note that 1/ρ1=1+10log2mm>1+logLlogmm1/\rho_{1}=1+10\cdot\frac{\log^{2}m}{\sqrt{m}}>1+\log L\cdot\frac{\log m}{\sqrt{m}} so we can set L=Θ(mlogm)L=\Theta(\frac{\sqrt{m}}{\log m})\in\mathbb{N} lager than an absolute constant such that

𝑾0L\displaystyle||\bm{W}^{L}_{0}|| L2(1+1/L)Le2logLe\displaystyle\leq L^{2}(1+1/L)^{L}\leq e^{2\log L}\cdot e (59)
=e1+2logL(1+10log2mm)(mlog2m)100logL\displaystyle=e^{1+2\log L}\leq(1+10\cdot\frac{\log^{2}m}{\sqrt{m}})^{(\frac{\sqrt{m}}{\log^{2}m})\cdot 100\cdot\log L}
(1/ρ1)(mlogm)100logLlogm\displaystyle\leq(1/\rho_{1})^{(\frac{\sqrt{m}}{\log m})\cdot 100\cdot\frac{\log L}{\log m}}
(1/ρ1)L/2.\displaystyle\leq(1/\rho_{1})^{L/2}.

And for kLk\leq L

𝑾0k\displaystyle||\bm{W}^{k}_{0}|| L2(1+1/L)L(1/ρ1)L/2.\displaystyle\leq L^{2}(1+1/L)^{L}\leq(1/\rho_{1})^{L/2}. (60)

If s>Ls>L, we can write it as s=kL+rs=k\cdot L+r and k,rk,r\in\mathbb{N}, rLr\leq L. Then

𝑾0s𝑾0Lk𝑾0rρ1Lk/2ρ1L/2ρ1s.||\bm{W}^{s}_{0}||\leq||\bm{W}^{L}_{0}||^{k}\cdot||\bm{W}^{r}_{0}||\leq\rho_{1}^{-Lk/2}\cdot\rho_{1}^{-L/2}\leq\rho_{1}^{-s}. (61)

Thus we have

𝑾0sρ1s.||\bm{W}^{s}_{0}||\leq\rho_{1}^{-s}. (62)

For k2Lk\leq 2L, note that for any unit vector vmv\in\mathbb{R}^{m}, we can write v=i=1kuiv=\sum_{i=1}^{k}u_{i}, where uiu_{i} has at most 2m/k2m/k non-zero coordinates and the intersection of non-zero coordinates sets for uiu_{i} and uju_{j} are empty when iji\neq j. In space R2m/kR^{2m/k}, we can use the ϵ\epsilon-net argument which says with probability at least 1exp(Ω(m/k))1-exp(-\Omega(m/k)), for any v2m/kv\in\mathbb{R}^{2m/k},

W0kv2v.||W_{0}^{k}v||\leq 2||v||.

Thus with probability at least 14L2exp(Ω(m/L))1-4L^{2}exp(-\Omega(m/L)), for all k2Lk\leq 2L, W0k2k||W_{0}^{k}||\leq 2\sqrt{k}.

Proof of Lemma 11

Note that for a fixed vmv\in\mathbb{R}^{m},

𝑩v||\bm{B}v|| (63)

is a Gaussian random variable with variance v2||v||^{2}. Meanwhile, let viv_{i}, i=1,di=1,...d be the orthogonal basis in d\mathbb{R}^{d}. We only need to bound τd\tau\cdot d many vectors B𝑾0t𝑨0viB\bm{W}^{t}_{0}\bm{A}_{0}v_{i}.

Thus for any tτt\leq\tau, and any viv_{i}, with probability at least 1(τd)exp(Ω(mϵ2))1-(\tau\cdot d)\cdot exp(-\Omega(m\epsilon^{2})), we have

t0=1t𝑾0𝑨0vi(1+ϵ)t.||\prod_{t_{0}=1}^{t}\bm{W}_{0}\bm{A}_{0}v_{i}||\leq(1+\epsilon)^{t}. (64)

Thus if m>Ω(log(τd/δ)m>\Omega(\log(\tau\cdot d/\delta), with probability at least 1δ1-\delta,

||𝑩𝑾0t𝑨0||dlog(τd/δ)).||\bm{B}\bm{W}^{t}_{0}\bm{A}_{0}||\leq\sqrt{d\log(\tau\cdot d/\delta)}). (65)

Lemma 11 follows,

Proof of Lemma 12

Consider

|uT𝑨{(τ=1t𝑾0)T}{(τ=1t𝑾0)}𝑨v|,|u^{T}\bm{A}\{(\prod_{\tau=1}^{t}\bm{W}_{0})^{T}\}\{(\prod_{\tau=1}^{t^{\prime}}\bm{W}_{0})\}\bm{A}v|, (66)

with u=v=1||u||=||v||=1. Before we study the properties of the above equation, we need some very useful results in random matrices theory.

Lemma 20

(Lemma 5.36 in [24]) Consider a matrix 𝐁\bm{B} and δ>0\delta>0. Let smin(𝐁),smax(𝐁)s_{min}(\bm{B}),s_{max}(\bm{B}) be the smallest and the largest singular values of BB which satisfy

1δsmin(𝑩)smax(𝑩)1+δ.1-\delta\leq s_{min}(\bm{B})\leq s_{max}(\bm{B})\leq 1+\delta. (67)

Then

𝑩T𝑩𝑰3max(δ,δ2).||\bm{B}^{T}\bm{B}-\bm{I}||\leq 3\max(\delta,\delta^{2}). (68)

combining this lemma with (51), we have the following result.

Lemma 21

For any tτt\leq\tau, let m1/2>τm^{1/2}>\tau, 𝐅=𝐖t𝐀\bm{F}=\bm{W}^{t}\bm{A}. With probability at least 1exp(Ω(log2m))1-\exp(-\Omega(\log^{2}{m})), we have

𝑭T𝑭𝑰logm/m.||\bm{F}^{T}\bm{F}-\bm{I}||\leq\log m/\sqrt{m}. (69)

Meanwhile, let xdx\in\mathbb{R}^{d}, y=𝑨xmy=\bm{A}x\in\mathbb{R}^{m}. We have:

yT𝑾0y=𝒘T𝑨x,y^{T}\bm{W}_{0}y=\bm{w^{\prime}}^{T}\bm{A}x, (70)

where wmw^{\prime}\in\mathbb{R}^{m} is a random vector and every entry of which is i.i.d drawn from N(0,y2)N(0,||y||^{2}). Thus with probability at least 1exp(Ω(log2m))1-exp(-\Omega(\log^{2}m)), we have

|yT𝑾0y|logm/my2.|y^{T}\bm{W}_{0}y|\leq\log m/\sqrt{m}\cdot||y||^{2}. (71)

Now let i,j[d]i,j\in[d] be the two indexes. Let e1,e2,ede_{1},e_{2},...e_{d} be the orthonormal basis in d\mathbb{R}^{d}. We can write (66) into a linear combination of the below terms.

|eiT𝑨T{(τ=1t𝑾0)T}{(τ=1t𝑾0)}𝑨ej|,|e_{i}^{T}\bm{A}^{T}\{(\prod_{\tau=1}^{t}\bm{W}_{0})^{T}\}\{(\prod_{\tau=1}^{t^{\prime}}\bm{W}_{0})\}\bm{A}e_{j}|, (72)

For any iji\neq j, we set

zi,0=𝑨ei,zj,0=𝑨ej.z_{i,0}=\bm{A}e_{i},z_{j,0}=\bm{A}e_{j}. (73)

And

zi,t={τ=1t𝑾0}𝑨ei.z_{i,t}=\{\prod_{\tau=1}^{t}\bm{W}_{0}\}\bm{A}e_{i}. (74)

As shown in the last section, when tτt\leq\tau, with probability at least 1τexp(Ω(m/τ2))1-\tau exp(-\Omega(m/\tau^{2})).

zi,t(1+1/100τ)t.||z_{i,t}||\leq(1+1/100\tau)^{t}. (75)

Now we consider Gram-Schmidt orthonormal matrix

Zj,t=GS(zi,0,zj,0,zi,1,zj,1,zi,t,zj,t).Z_{j,t}=GS(z_{i,0},z_{j,0},z_{i,1},z_{j,1},...z_{i,t},z_{j,t}). (76)

It has at most 2(t+1)2(t+1) columns.

In order to prove the lemma, we consider Zj,tTzi,t||Z_{j,t}^{T}z_{i,t}|| using induction. When t=0t=0 and t=1t=1, from Lemma 21 and 71, we know with probability at least 1exp(Ω(log2m))1-exp(-\Omega(\log^{2}m)), Zj,tTzi,t2logm/m||Z_{j,t}^{T}z_{i,t}||\leq 2\log m/\sqrt{m}.

For t+1t+1,

Zj,t+1Tzi,t+1=\displaystyle Z_{j,t+1}^{T}z_{i,t+1}= Zt+1T𝑾0(IZj,tZj,tT)zi,t\displaystyle Z_{t+1}^{T}\bm{W}_{0}(I-Z_{j,t}Z_{j,t}^{T})z_{i,t} (77)
+Zj,t+1T𝑾0Zj,tZj,tTzi,t.\displaystyle+Z_{j,t+1}^{T}\bm{W}_{0}Z_{j,t}Z_{j,t}^{T}z_{i,t}.

Note that thanks to the Gram-Schmidt orthogonalization, the elements in matrix 𝑾0(IZj,tZj,tT)\bm{W}_{0}(I-Z_{j,t}Z_{j,t}^{T}) are independent of Zj,t+1Z_{j,t+1}. Then with probability at least 1exp(Ω(log2m))1-exp(-\Omega(\log^{2}m)),

Zj,t+1T𝑾0(IZj,tZj,tT)zi,t\displaystyle||Z_{j,t+1}^{T}\bm{W}_{0}(I-Z_{j,t}Z_{j,t}^{T})z_{i,t}|| (78)
Zj,t+1T𝑾0(IZj,tZj,tT)zi,t(IZj,tZj,tT)zi,t(IZj,tZj,tT)zi,t,\displaystyle\leq||Z_{j,t+1}^{T}||\cdot||\frac{\bm{W}_{0}(I-Z_{j,t}Z_{j,t}^{T})z_{i,t}}{||(I-Z_{j,t}Z_{j,t}^{T})z_{i,t}||}||\cdot||(I-Z_{j,t}Z_{j,t}^{T})z_{i,t}||,
Zj,t+1T2logm/m(IZj,tZj,tT)zi,t,\displaystyle\leq||Z_{j,t+1}^{T}||\cdot 2\log m/\sqrt{m}\cdot||(I-Z_{j,t}Z_{j,t}^{T})z_{i,t}||,
2logm/m(IZj,tZj,tT)zi,t\displaystyle\leq 2\log m/\sqrt{m}\cdot||(I-Z_{j,t}Z_{j,t}^{T})z_{i,t}||
2logm/mzi,t\displaystyle\leq 2\log m/\sqrt{m}\cdot||z_{i,t}||
2logm/mlogm.\displaystyle\leq 2\log m/\sqrt{m}\cdot\log m.

As for Zj,t+1T𝑾0Zj,tZj,tTzi,tZ_{j,t+1}^{T}\bm{W}_{0}Z_{j,t}Z_{j,t}^{T}z_{i,t}, firstly note that the entries of matrix 𝑾0Zj,t\bm{W}_{0}Z_{j,t} are i.i.d from N(0,1m)N(0,\frac{1}{m}). When Zj,tTzi,tZ_{j,t}^{T}z_{i,t} is fixed, with probability at least 1exp(Ω(m/τ2)1-exp(-\Omega(m/\tau^{2}),

Zj,t+1T𝑾0Zj,tZj,tTzi,t𝑾0Zj,tZj,tTzi,t\displaystyle||Z_{j,t+1}^{T}\bm{W}_{0}Z_{j,t}Z_{j,t}^{T}z_{i,t}||\leq||\bm{W}_{0}Z_{j,t}Z_{j,t}^{T}z_{i,t}|| (79)
(1+1100τ)Zj,tTzi,t.\displaystyle\leq(1+\frac{1}{100\tau})||Z_{j,t}^{T}z_{i,t}||.

Meanwhile, there are at most exp(𝒪(t+1))exp(\mathcal{O}(t+1)) fixed vectors forming an ϵ\epsilon-net for Zj,tTzi,tZ_{j,t}^{T}z_{i,t}. Let m1/2>τm^{1/2}>\tau. With probability at least 1exp(Ω(m/τ2))1-exp(-\Omega(m/\tau^{2})),

Zj,t+1T𝑾0Zj,tZtTzi,t(1+150τ)Zj,tTzi,t.||Z_{j,t+1}^{T}\bm{W}_{0}Z_{j,t}Z_{t}^{T}z_{i,t}||\leq(1+\frac{1}{50\tau})||Z_{j,t}^{T}z_{i,t}||. (80)

Therefore

Zj,t+1Tzi,t+1\displaystyle||Z_{j,t+1}^{T}z_{i,t+1}|| =Zj,t+1T𝑾0zi,t\displaystyle=||Z_{j,t+1}^{T}\bm{W}_{0}z_{i,t}|| (81)
(1+150τ+2log2mm)t+1(2logmm)\displaystyle\leq(1+\frac{1}{50\tau}+2\frac{\log^{2}m}{\sqrt{m}})^{t+1}\cdot(2\frac{\log m}{\sqrt{m}})

When mm is large enough such that m2log2m>50τ\frac{\sqrt{m}}{2\log^{2}m}>50\tau, we have

(1+150τ+2log2mm)t+13,(1+\frac{1}{50\tau}+2\frac{\log^{2}m}{\sqrt{m}})^{t+1}\leq 3,

Thus

Zj,t+1Tzi,t+16logmm.\displaystyle||Z_{j,t+1}^{T}z_{i,t+1}||\leq 6\frac{\log m}{\sqrt{m}}. (82)

As a corollary, this says suppose u=v=1,uTv=0||u||=||v||=1,u^{T}v=0, for all kτk\leq\tau,

|uT𝑾kv|6logmm.|u^{T}\bm{W}^{k}v|\leq 6\frac{\log m}{\sqrt{m}}.

Without loss of generality, we assume t>tt>t^{\prime}.

|ejT𝑨T{(τ=1t𝑾0)T}{(τ=1t𝑾0)}𝑨ei|\displaystyle|e_{j}^{T}\bm{A}^{T}\{(\prod_{\tau=1}^{t}\bm{W}_{0})^{T}\}\{(\prod_{\tau=1}^{t^{\prime}}\bm{W}_{0})\}\bm{A}e_{i}| (83)
=[zj,t]T𝑾0ttzi,t\displaystyle=[z_{j,t^{\prime}}]^{T}\bm{W}_{0}^{t-t^{\prime}}z_{i,t^{\prime}}
=[zj,t]T𝑾0tt(IZj,tZj,tT)zi,t\displaystyle=[z_{j,t^{\prime}}]^{T}\bm{W}_{0}^{t-t^{\prime}}(I-Z_{j,t^{\prime}}Z_{j,t^{\prime}}^{T})z_{i,t^{\prime}}
+[zj,t]T𝑾0ttZj,tZj,tTzi,t\displaystyle+[z_{j,t^{\prime}}]^{T}\bm{W}_{0}^{t-t^{\prime}}Z_{j,t^{\prime}}Z_{j,t^{\prime}}^{T}z_{i,t^{\prime}}
zj,t𝑾0tt6logmm\displaystyle\leq||z_{j,t^{\prime}}||\cdot||\bm{W}_{0}^{t-t^{\prime}}||\cdot 6\frac{\log m}{\sqrt{m}}
+zj,t6logmmzi,t\displaystyle+||z_{j,t^{\prime}}||\cdot 6\frac{\log m}{\sqrt{m}}\cdot||z_{i,t^{\prime}}||
24τlogmm.\displaystyle\leq 24\tau\frac{\log m}{\sqrt{m}}.

Now we consider the case i=ji=j,

|eiT𝑨T{(τ=1t𝑾0)T}{(τ=1t𝑾0)}𝑨ei|\displaystyle|e_{i}^{T}\bm{A}^{T}\{(\prod_{\tau=1}^{t}\bm{W}_{0})^{T}\}\{(\prod_{\tau=1}^{t^{\prime}}\bm{W}_{0})\}\bm{A}e_{i}| (84)
=[zi,t]Tzi,t.\displaystyle=[z_{i,t}]^{T}z_{i,t^{\prime}}.

Since

Zj,t+1Tzi,t+16logmm,\displaystyle||Z_{j,t+1}^{T}z_{i,t+1}||\leq 6\frac{\log m}{\sqrt{m}}, (85)

and

Zj,t=GS(zi,0,zj,0,zi,1,zj,1,zi,t,zj,t),Z_{j,t}=GS(z_{i,0},z_{j,0},z_{i,1},z_{j,1},...z_{i,t},z_{j,t}), (86)

we have

|[zi,t]Tzi,t|6logmm.\displaystyle|[z_{i,t}]^{T}z_{i,t^{\prime}}|\leq 6\frac{\log m}{\sqrt{m}}. (87)

Thus for any u,vu,v,

|uT𝑨T{(τ=1t𝑾0)T}{(τ=1t𝑾0)}𝑨v|\displaystyle|u^{T}\bm{A}^{T}\{(\prod_{\tau=1}^{t}\bm{W}_{0})^{T}\}\{(\prod_{\tau=1}^{t^{\prime}}\bm{W}_{0})\}\bm{A}v| (88)
24τd2logmm.\displaystyle\leq 24\tau\frac{d^{2}\log m}{\sqrt{m}}.

There is a similar argument for

𝑩T(τ=1t𝑾0)(τ=1t𝑾0)T𝑩.\bm{B}^{T}(\prod_{\tau=1}^{t^{\prime}}\bm{W}_{0})(\prod_{\tau=1}^{t}\bm{W}_{0})^{T}\bm{B}.

The theorem follows.

\blacksquare

Proof of Lemma 13 and 14

Firstly, note that

ft(𝑾,𝑨)\displaystyle f_{t}(\bm{W},\bm{A}) =𝑩𝑨Xt+ρ𝑩𝑾𝑨Xt1+\displaystyle=\bm{B}\bm{A}X_{t}+\rho\cdot\bm{B}\bm{W}\bm{A}X_{t-1}+... (89)
+ρt1𝑩(τ=1t1𝑾)𝑨X1.\displaystyle+\rho^{t-1}\cdot\bm{B}(\prod_{\tau=1}^{t-1}\bm{W})\bm{A}X_{1}.

We have

Wft(𝑾,𝑨)𝒁=𝑩𝒁𝑨Xt1+\displaystyle\nabla_{W}f_{t}(\bm{W},\bm{A})\cdot\bm{Z}=\bm{B}\bm{Z}\bm{A}X_{t-1}+... (90)
+ρt1t1+t2=t1,t1,t2>0𝑩(τ=1t11𝑾)𝒁(τ=1t21𝑾)𝑨X1,\displaystyle+\rho^{t-1}\cdot\sum_{t_{1}+t_{2}=t-1,t_{1},t_{2}>0}\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W})\bm{Z}(\prod_{\tau=1}^{t_{2}-1}\bm{W})\bm{A}X_{1},
=t1+t2+t0=t,t0,t1,t2>0ρtt0\displaystyle=\sum_{t_{1}+t_{2}+t_{0}=t,t_{0},t_{1},t_{2}>0}\rho^{t-t_{0}}
𝑩(τ=1t11𝑾)𝒁(τ=1t21𝑾)𝑨Xt0.\displaystyle\cdot\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W})\bm{Z}(\prod_{\tau=1}^{t_{2}-1}\bm{W})\bm{A}X_{t_{0}}.

Meanwhile,

Aft(𝑾,𝑨)𝒁=\displaystyle\nabla_{A}f_{t}(\bm{W},\bm{A})\cdot\bm{Z}= 𝑩ZXt+ρ𝑩𝑾𝒁Xt1+\displaystyle\bm{B}ZX_{t}+\rho\cdot\bm{B}\bm{W}\bm{Z}X_{t-1}+... (91)
+ρt1𝑩(τ=1t1𝑾)𝒁X1.\displaystyle+\rho^{t-1}\cdot\bm{B}(\prod_{\tau=1}^{t-1}\bm{W})\bm{Z}X_{1}.

Let

Rt0(𝑾,𝑨)=ρtt0𝑩(τ=1tt0𝑾)𝑨Xt0.R_{t_{0}}(\bm{W},\bm{A})=\rho^{t-t_{0}}\cdot\bm{B}(\prod_{\tau=1}^{t-t_{0}}\bm{W})\bm{A}X_{t_{0}}. (92)

Note that from (18), ρt𝑾t2tρ0t||\rho^{t}\bm{W}^{t}||\leq 2\sqrt{t}\rho_{0}^{t} for all tt\in\mathbb{N}. We have

||Rt0(𝑾,𝑨)Rt0(𝑾,𝑨)\displaystyle||R_{t_{0}}(\bm{W}^{\prime},\bm{A}^{\prime})-R_{t_{0}}(\bm{W},\bm{A}) (93)
WRt0(𝑾,𝑨)[𝑾𝑾],\displaystyle-\nabla_{W}R_{t_{0}}(\bm{W},\bm{A})\cdot[\bm{W}^{\prime}-\bm{W}],
ARt0(𝑾,𝑨)[𝑨𝑨]||F,\displaystyle-\nabla_{A}R_{t_{0}}(\bm{W},\bm{A})\cdot[\bm{A}^{\prime}-\bm{A}]||_{F},
i+j=tt01𝑩ρtt0𝑾i𝑾j\displaystyle\leq\sum_{i+j=t-t_{0}-1}||\bm{B}||\cdot\rho^{t-t_{0}}\cdot||\bm{W}^{i}||\cdot||\bm{W}^{j}||
𝑾𝑾𝑨𝑨\displaystyle\cdot||\bm{W}^{\prime}-\bm{W}||\cdot||\bm{A^{\prime}}-\bm{A}||
+i+j+k=tt02𝑩ρtt0𝑾i𝑾j𝑾k\displaystyle+\sum_{i+j+k=t-t_{0}-2}||\bm{B}||\cdot\rho^{t-t_{0}}\cdot||\bm{W}^{i}||\cdot||\bm{W}^{j}||\cdot||\bm{W}^{k}||
𝑾𝑾2𝑨\displaystyle\cdot||\bm{W}^{\prime}-\bm{W}||^{2}\cdot||\bm{A}||
2m(tt0)ρ0tt04(tt0)ω2\displaystyle\leq 2\sqrt{m}(t-t_{0})\rho_{0}^{t-t_{0}}\cdot 4(t-t_{0})\cdot\omega^{2}
+2m(tt0)2ρ0tt08(tt0)2ω22\displaystyle+2\sqrt{m}(t-t_{0})^{2}\rho_{0}^{t-t_{0}}\cdot 8(t-t_{0})^{2}\cdot\omega^{2}\cdot 2
32m(tt0)4ω2ρ0tt0.\displaystyle\leq 32\sqrt{m}(t-t_{0})^{4}\omega^{2}\rho_{0}^{t-t_{0}}.

Thus

||ft(𝑾,𝑨)ft(𝑾,𝑨)Wft(𝑾,𝑨),𝑾𝑾\displaystyle||f_{t}(\bm{W}^{\prime},\bm{A}^{\prime})-f_{t}(\bm{W},\bm{A})-\langle\nabla_{W}f_{t}(\bm{W},\bm{A}),\bm{W}^{\prime}-\bm{W}\rangle (94)
+Aft(𝑾,𝑨),𝑨𝑨||F,\displaystyle+\langle\nabla_{A}f_{t}(\bm{W},\bm{A}),\bm{A}^{\prime}-\bm{A}\rangle||_{F},
t32m(tt0)4ω2ρ0tt0\displaystyle\leq\sum_{t}32\sqrt{m}(t-t_{0})^{4}\omega^{2}\rho_{0}^{t-t_{0}}
768mω2(1ρ0)5.\displaystyle\leq 768\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}}.

Lemma 13 follows. \blacksquare

Appendix A Proof of Lemma 14

From the definition we have

ftlin(𝑾,𝑨)=\displaystyle f_{t}^{lin}(\bm{W},\bm{A})= ft(𝑾0,𝑨0)+Wft(𝑾0,𝑨0)[𝑾𝑾0]\displaystyle f_{t}(\bm{W}_{0},\bm{A}_{0})+\nabla_{W}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{W}-\bm{W}_{0}] (95)
+Aft(𝑾0,𝑨0)[𝑨𝑨0],\displaystyle+\nabla_{A}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{A}-\bm{A}_{0}],
ftlin,τ(𝑾,𝑨)=\displaystyle f_{t}^{lin,\tau}(\bm{W},\bm{A})= ftτ(𝑾0,𝑨0)+Wftτ(𝑾0,𝑨0)[𝑾𝑾0]\displaystyle f^{\tau}_{t}(\bm{W}_{0},\bm{A}_{0})+\nabla_{W}f^{\tau}_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{W}-\bm{W}_{0}]
+Aftτ(𝑾0,𝑨0)[𝑨𝑨0],\displaystyle+\nabla_{A}f^{\tau}_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{A}-\bm{A}_{0}],

and

Wft(𝑾,𝑨)𝒁=𝑩𝒁𝑨Xt1+\displaystyle\nabla_{W}f_{t}(\bm{W},\bm{A})\cdot\bm{Z}=\bm{B}\bm{Z}\bm{A}X_{t-1}+... (96)
+ρt1t1+t2=t1,t1,t2>0𝑩(τ=1t11𝑾)𝒁(τ=1t21𝑾)𝑨X1\displaystyle+\rho^{t-1}\cdot\sum_{t_{1}+t_{2}=t-1,t_{1},t_{2}>0}\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W})\bm{Z}(\prod_{\tau=1}^{t_{2}-1}\bm{W})\bm{A}X_{1}
=t1+t2+t0=t,t0,t1,t2>0ρtt0𝑩(τ=1t11𝑾)𝒁(τ=1t21𝑾)𝑨Xt0.\displaystyle=\sum_{t_{1}+t_{2}+t_{0}=t,t_{0},t_{1},t_{2}>0}\rho^{t-t_{0}}\cdot\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W})\bm{Z}(\prod_{\tau=1}^{t_{2}-1}\bm{W})\bm{A}X_{t_{0}}.

Meanwhile

Aft(𝑾,𝑨)𝒁=\displaystyle\nabla_{A}f_{t}(\bm{W},\bm{A})\cdot\bm{Z}= 𝑩ZXt+ρ𝑩𝑾𝒁Xt1+\displaystyle\bm{B}ZX_{t}+\rho\cdot\bm{B}\bm{W}\bm{Z}X_{t-1}+... (97)
+ρt1𝑩(τ=1t1𝑾)𝒁X1.\displaystyle+\rho^{t-1}\cdot\bm{B}(\prod_{\tau=1}^{t-1}\bm{W})\bm{Z}X_{1}.

Using the same way as above, we have

ftlin(𝑾,𝑨)ftlin,τ(𝑾,𝑨)F\displaystyle||f^{lin}_{t}(\bm{W},\bm{A})-f^{lin,\tau}_{t}(\bm{W},\bm{A})||_{F} (98)
k=τi+j=k1𝑩ρk𝑾0i𝑾0j𝑾𝑨0\displaystyle\leq\sum_{k=\tau}\sum_{i+j=k-1}||\bm{B}||\cdot\rho^{k}\cdot||\bm{W}_{0}^{i}||\cdot||\bm{W}_{0}^{j}||\cdot||\bm{W}||\cdot||\bm{A}_{0}||
+k=τ𝑩ρk𝑾0k𝑨\displaystyle+\sum_{k=\tau}||\bm{B}||\cdot\rho^{k}\cdot||\bm{W}_{0}^{k}||\cdot||\bm{A}||
τρ0τ8m(1ρ0)3\displaystyle\leq\tau\rho_{0}^{\tau}\cdot 8\frac{\sqrt{m}}{(1-\rho_{0})^{3}}

from Lemma 9.

\blacksquare

Existence: Proof of Theorem 16

Firstly we briefly introduce the main steps of the proof.

1) From Lemma 13, for all t[T]t\in[T] and 𝑾,𝑾B(𝑾0,ω)\bm{W},\bm{W}^{\prime}\in B(\bm{W}_{0},\omega), 𝑨,𝑨B(𝑨0,ω)\bm{A},\bm{A}^{\prime}\in B(\bm{A}_{0},\omega) with ωω0\omega\leq\omega_{0},

||ft(𝑾,𝑨)ft(𝑾0,𝑨0)Wft(𝑾0,𝑨0)[𝑾𝑾0]\displaystyle||f_{t}(\bm{W},\bm{A})-f_{t}(\bm{W}_{0},\bm{A}_{0})-\nabla_{W}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{W}-\bm{W}_{0}]
Aft(𝑾0,𝑨0)[𝑨𝑨0]||768mω2(1ρ0)5.\displaystyle-\nabla_{A}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{A}-\bm{A}_{0}]||\leq 768\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}}.

Therefore we consider the linearization function

ftlin(𝑾,𝑨)=\displaystyle f_{t}^{lin}(\bm{W},\bm{A})= ft(𝑾0,𝑨0)+Wft(𝑾0,𝑨0)[𝑾𝑾0]\displaystyle f_{t}(\bm{W}_{0},\bm{A}_{0})+\nabla_{W}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{W}-\bm{W}_{0}]
+Aft(𝑾0,𝑨0)[𝑨𝑨0].\displaystyle+\nabla_{A}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{A}-\bm{A}_{0}].

We have

ft(𝑾,𝑨)ftlin(𝑾,𝑨)𝒪(mω2(1ρ0)5).\displaystyle||f_{t}(\bm{W},\bm{A})-f_{t}^{lin}(\bm{W},\bm{A})||\leq\mathcal{O}(\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}}).

2) From Lemma 14,

ftlin(𝑾,𝑨)ftlin,Tmax(𝑾,𝑨)𝒪(mτρ0τ(1ρ0)3).||f^{lin}_{t}(\bm{W},\bm{A})-f^{lin,T_{max}}_{t}(\bm{W},\bm{A})||\leq\mathcal{O}(\frac{\sqrt{m}\tau\rho_{0}^{\tau}}{(1-\rho_{0})^{3}}).

Therefore we only need to consider

ftlin,Tmax(𝑾,𝑨)=t0=0Tmaxρt0𝑩(τ=1t0𝑾0)𝑨0Xtt0\displaystyle f^{lin,T_{max}}_{t}(\bm{W},\bm{A})=\sum_{t_{0}=0}^{T_{max}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{0}}\bm{W}_{0})\bm{A}_{0}X_{t-t_{0}}
+t0=0Tmaxt1+t2=t0ρt0𝑩(τ=1t11𝑾0)[𝑾𝑾0](τ=1t21𝑾0)𝑨0Xtt0\displaystyle+\sum_{t_{0}=0}^{T_{max}}\sum_{t_{1}+t_{2}=t_{0}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})[\bm{W}-\bm{W}_{0}](\prod_{\tau=1}^{t_{2}-1}\bm{W}_{0})\bm{A}_{0}X_{t-t_{0}}
+t0=0Tmaxρt0𝑩(τ=1t01𝑾0)[𝑨𝑨0]Xtt0.\displaystyle+\sum_{t_{0}=0}^{T_{max}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{0}-1}\bm{W}_{0})[\bm{A}-\bm{A}_{0}]X_{t-t_{0}}.

3) Finally we set

𝑾𝑾0\displaystyle\bm{W}^{*}-\bm{W}_{0}
=\displaystyle= tm=1Tmax(ρ1)tm1t1+t2=tm1{(τ=1t11𝑾0)}T𝑩T𝑷t11{𝑮𝑪tm1𝑫\displaystyle\sum_{t_{m}=1}^{T_{max}}(\rho^{-1})^{t_{m}-1}\sum_{t_{1}+t_{2}=t_{m}-1}\{(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})\}^{T}\bm{B}^{T}\bm{P}^{1}_{t_{1}}\{\bm{G}\bm{C}^{t_{m}-1}\bm{D}
(ρ2)tm1/(tm1)𝑩{τ=1tm𝑾0}𝑨0}𝑷2t21𝑨T0(τ=1t2𝑾0),\displaystyle-(\rho^{2})^{t_{m}-1}/(t_{m}-1)\cdot\bm{B}\{\prod_{\tau=1}^{t_{m}}\bm{W}_{0}\}\bm{A}_{0}\}\bm{P}^{2}_{t_{2}-1}\bm{A}^{T}_{0}(\prod_{\tau=1}^{t_{2}}\bm{W}_{0}),
𝑨𝑨0={(τ=1tm1𝑾0)}T𝑩T𝑷tm11[𝑮𝑫𝑩𝑨0],\bm{A}^{*}-\bm{A}_{0}=\{(\prod_{\tau=1}^{t_{m}-1}\bm{W}_{0})\}^{T}\bm{B}^{T}\bm{P}^{1}_{t_{m}-1}[\bm{G}\bm{D}-\bm{B}\bm{A}_{0}],

where

𝑷t11={𝑩(τ=1t11𝑾0)(τ=1t11𝑾0)T𝑩T}1,\displaystyle\bm{P}^{1}_{t_{1}}=\{\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})^{T}\bm{B}^{T}\}^{-1}, (99)
𝑷t22={𝑨0T(τ=1t21𝑾0)T(τ=1t21𝑾0)𝑨0}1.\displaystyle\bm{P}^{2}_{t_{2}}=\{\bm{A}_{0}^{T}(\prod_{\tau=1}^{t_{2}-1}\bm{W}_{0})^{T}(\prod_{\tau=1}^{t_{2}-1}\bm{W}_{0})\bm{A}_{0}\}^{-1}.

Then from Lemma 10, 11 and 12,

ftlin,Tmax(𝑾,𝑨)t0=0Tmax𝑮(τ=1t0𝑪)𝑫Xtt0.f^{lin,T_{max}}_{t}(\bm{W}^{*},\bm{A}^{*})\approx\sum_{t_{0}=0}^{T_{max}}\bm{G}(\prod_{\tau^{\prime}=1}^{t_{0}}\bm{C})\bm{D}X_{t-t_{0}}. (100)

With probability at least 1exp(Ω(log2m))1-exp(-\Omega(\log^{2}m)),

||ftlin,Tmax(𝑾,𝑨))y~t||𝒪(bd2cρTmax3logmm1/2)+cρρTmax1ρ),||f^{lin,T_{max}}_{t}(\bm{W}^{*},\bm{A}^{*}))-\widetilde{y}_{t}||\leq\mathcal{O}(\frac{b\cdot d^{2}c_{\rho}T^{3}_{max}\log m}{m^{1/2}})+\frac{c_{\rho}\rho^{T_{max}}}{1-\rho}),

and we can show 𝑾𝑾0F,𝑨𝑨0F𝒪(bTmax2/m)||\bm{W}^{*}-\bm{W}_{0}||_{F},||\bm{A}^{*}-\bm{A}_{0}||_{F}\leq\mathcal{O}(b\cdot T^{2}_{max}/\sqrt{m}) using Lemma 11. The theorem follows.

Theorem 22

Consider a linear system:

pt=𝑪pt1+𝑫xt,\displaystyle p_{t}=\bm{C}p_{t-1}+\bm{D}x_{t}, (101)
y~t=𝑮pt.\displaystyle\widetilde{y}_{t}=\bm{G}p_{t}.

with 𝐂k𝐃<cρρk||\bm{C}^{k}\bm{D}||<c_{\rho}\cdot\rho^{k} for any kk\in\mathbb{N}, ρ<1\rho<1, 𝐃dp×d\bm{D}\in\mathbb{R}^{d_{p}\times d}, ptdpp_{t}\in\mathbb{R}^{d_{p}}, 𝐂dp×dp\bm{C}\in\mathbb{R}^{d_{p}\times d_{p}} 𝐆dy×dp\bm{G}\in\mathbb{R}^{d_{y}\times d_{p}}.

For given data {xt,yt}\{x_{t},y_{t}\}, and any 0<ρ<10<\rho<1, if m>mm>m^{*}, with probability at least 1exp(Ω(log2m))1-exp(-\Omega(\log^{2}m)), there exist 𝐖,𝐀\bm{W}^{*},\bm{A}^{*} satisfying that for all tt,

||ft(𝑾,𝑨))y~t||F\displaystyle||f_{t}(\bm{W}^{*},\bm{A}^{*}))-\widetilde{y}_{t}||_{F} 𝒪(m(ρ0)Tmax(1ρ0)3)\displaystyle\leq\mathcal{O}(\frac{\sqrt{m}(\rho_{0})^{T_{max}}}{(1-\rho_{0})^{3}}) (102)
+𝒪(bd2cρTmax3logmm1/2)+cρρTmax1ρ),\displaystyle+\mathcal{O}(\frac{b\cdot d^{2}c_{\rho}T^{3}_{max}\log m}{m^{1/2}})+\frac{c_{\rho}\rho^{T_{max}}}{1-\rho}),

and

𝑾𝑾0F,𝑨𝑨0F2cρbTmax2/m.||\bm{W}^{*}-\bm{W}_{0}||_{F},||\bm{A}^{*}-\bm{A}_{0}||_{F}\leq 2c_{\rho}b\cdot T^{2}_{max}/\sqrt{m}. (103)

Proof:

Firstly, note that from Lemma 13.

||ft(𝑾,𝑨)ft(𝑾0,𝑨0)Wft(𝑾0,𝑨0)[𝑾𝑾0]\displaystyle||f_{t}(\bm{W},\bm{A})-f_{t}(\bm{W}_{0},\bm{A}_{0})-\nabla_{W}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{W}-\bm{W}_{0}] (104)
Aft(𝑾0,𝑨0)[𝑨𝑨0]||\displaystyle-\nabla_{A}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{A}-\bm{A}_{0}]||
𝒪(mω2(1ρ0)5).\displaystyle\leq\mathcal{O}(\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}}).

Therefore let

ftlin(𝑾,𝑨)=\displaystyle f_{t}^{lin}(\bm{W},\bm{A})= ft(𝑾0,𝑨0)+Wft(𝑾0,𝑨0)[𝑾𝑾0]\displaystyle f_{t}(\bm{W}_{0},\bm{A}_{0})+\nabla_{W}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{W}-\bm{W}_{0}] (105)
+Aft(𝑾0,𝑨0)[𝑨𝑨0].\displaystyle+\nabla_{A}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{A}-\bm{A}_{0}].

We have

ft(𝑾,𝑨)ftlin(𝑾,𝑨)𝒪(mω2(1ρ0)5).||f_{t}(\bm{W},\bm{A})-f_{t}^{lin}(\bm{W},\bm{A})||\leq\mathcal{O}(\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}}).

Note that

𝑾𝑾0F,𝑨𝑨0F𝒪(bTmax2/m).||\bm{W}^{*}-\bm{W}_{0}||_{F},||\bm{A}^{*}-\bm{A}_{0}||_{F}\leq\mathcal{O}(b\cdot T^{2}_{max}/\sqrt{m}). (106)

We can show

ft(𝑾,𝑨)ftlin(𝑾,𝑨)𝒪(ϵ/b).||f_{t}(\bm{W}^{*},\bm{A}^{*})-f_{t}^{lin}(\bm{W}^{*},\bm{A}^{*})||\leq\mathcal{O}(\epsilon/b).

Now, from Lemma 14,

ftlin(𝑾,𝑨)ftlin,Tmax(𝑾,𝑨)𝒪(mτρ0τ(1ρ0)3).||f^{lin}_{t}(\bm{W},\bm{A})-f^{lin,T_{max}}_{t}(\bm{W},\bm{A})||\leq\mathcal{O}(\frac{\sqrt{m}\tau\rho_{0}^{\tau}}{(1-\rho_{0})^{3}}). (107)

We only need to consider

ftlin,Tmax(𝑾,𝑨)=t0=0Tmaxρt0𝑩(τ=1t0𝑾0)𝑨0Xtt0\displaystyle f^{lin,T_{max}}_{t}(\bm{W},\bm{A})=\sum_{t_{0}=0}^{T_{max}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{0}}\bm{W}_{0})\bm{A}_{0}X_{t-t_{0}}
+t0=0Tmaxt1+t2=t0ρt0𝑩(τ=1t11𝑾0)[𝑾𝑾0]\displaystyle+\sum_{t_{0}=0}^{T_{max}}\sum_{t_{1}+t_{2}=t_{0}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})[\bm{W}-\bm{W}_{0}]
(τ=1t21𝑾0)𝑨0Xtt0\displaystyle\cdot(\prod_{\tau=1}^{t_{2}-1}\bm{W}_{0})\bm{A}_{0}X_{t-t_{0}}
+t0=0Tmaxρt0𝑩(τ=1t01𝑾0)[𝑨𝑨0]Xtt0.\displaystyle+\sum_{t_{0}=0}^{T_{max}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{0}-1}\bm{W}_{0})[\bm{A}-\bm{A}_{0}]X_{t-t_{0}}.

Finally we set

𝑾𝑾0\displaystyle\bm{W}^{*}-\bm{W}_{0}
=tm=1Tmax(ρ1)tm1t1+t2=tm1{(τ=1t11𝑾0)}T𝑩T𝑷t11\displaystyle=\sum_{t_{m}=1}^{T_{max}}(\rho^{-1})^{t_{m}-1}\sum_{t_{1}+t_{2}=t_{m}-1}\{(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})\}^{T}\bm{B}^{T}\bm{P}^{1}_{t_{1}}
{𝑮𝑪tm1𝑫(ρ2)tm1/(tm1)𝑩{τ=1tm𝑾0}𝑨0}\displaystyle\cdot\{\bm{G}\bm{C}^{t_{m}-1}\bm{D}-(\rho^{2})^{t_{m}-1}/(t_{m}-1)\cdot\bm{B}\{\prod_{\tau=1}^{t_{m}}\bm{W}_{0}\}\bm{A}_{0}\}
𝑷t212𝑨0T(τ=1t2𝑾0),\displaystyle\cdot\bm{P}^{2}_{t_{2}-1}\bm{A}^{T}_{0}(\prod_{\tau=1}^{t_{2}}\bm{W}_{0}),
𝑨𝑨0={(τ=1tm1𝑾0)}T𝑩T𝑷tm11[𝑮𝑫𝑩𝑨0],\bm{A}^{*}-\bm{A}_{0}=\{(\prod_{\tau=1}^{t_{m}-1}\bm{W}_{0})\}^{T}\bm{B}^{T}\bm{P}^{1}_{t_{m}-1}[\bm{G}\bm{D}-\bm{B}\bm{A}_{0}], (108)

where

𝑷t11={𝑩{(τ=1t11𝑾0)(τ=1t11𝑾0)T𝑩T}1,\displaystyle\bm{P}^{1}_{t_{1}}=\{\bm{B}\{(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})^{T}\bm{B}^{T}\}^{-1}, (109)
𝑷t22={𝑨0T(τ=1t21𝑾0)(τ=1t21𝑾0)𝑨0}1.\displaystyle\bm{P}^{2}_{t_{2}}=\{\bm{A}_{0}^{T}(\prod_{\tau=1}^{t_{2}-1}\bm{W}_{0})(\prod_{\tau=1}^{t_{2}-1}\bm{W}_{0})\bm{A}_{0}\}^{-1}.

Firstly we need to bound 𝑾𝑾0F||\bm{W}^{*}-\bm{W}_{0}||_{F} and 𝑨𝑨0F||\bm{A}^{*}-\bm{A}_{0}||_{F}.

When t1,t2Tmaxt_{1},t_{2}\leq T_{max}, note that 𝑷t11\bm{P}^{1}_{t_{1}} and 𝑷t22\bm{P}^{2}_{t_{2}} are symmetric matrices. From Lemma 10.

𝑷t112m,\displaystyle||\bm{P}^{1}_{t_{1}}||\leq\frac{2}{m}, (110)
𝑷t222.\displaystyle||\bm{P}^{2}_{t_{2}}||\leq 2.

Meanwhile (ρ1)tm1𝑮𝑪tm1𝑫cρ(\rho^{-1})^{t_{m}-1}||\bm{G}\bm{C}^{t_{m}-1}\bm{D}||\leq c_{\rho} since ρ(𝑪)ρ\rho(\bm{C})\leq\rho. Moreover, from Lemma 11,

𝑩(τ=1t0𝑾)𝑨b,||\bm{B}(\prod_{\tau^{\prime}=1}^{t_{0}}\bm{W})\bm{A}||\leq b, (111)

for all t0Tmaxt_{0}\leq T_{max}. Therefore

𝑾𝑾0F,𝑨𝑨0F2cρbTmax2/m.||\bm{W}^{*}-\bm{W}_{0}||_{F},||\bm{A}^{*}-\bm{A}_{0}||_{F}\leq 2c_{\rho}b\cdot T^{2}_{max}/\sqrt{m}. (112)

From lemma 12, when ttt\neq t^{\prime}, t,tτt,t^{\prime}\leq\tau,

|(u1)T𝑩(τ=1t𝑾0)\displaystyle|(u_{1})^{T}\bm{B}(\prod_{\tau=1}^{t}\bm{W}_{0})\cdot {(τ=1t𝑾0)}T𝑩Tv1|\displaystyle\{(\prod_{\tau=1}^{t^{\prime}}\bm{W}_{0})\}^{T}\bm{B}^{T}v_{1}| (113)
24τd2mlogm,\displaystyle\leq 24\tau d^{2}\sqrt{m}\log m,

and

|(u2)T𝑨0T(τ=1t𝑾0)T{(τ=1t𝑾0)}𝑨0v2|24τd2logmm.|(u_{2})^{T}\bm{A}^{T}_{0}(\prod_{\tau=1}^{t}\bm{W}_{0})^{T}\cdot\{(\prod_{\tau=1}^{t^{\prime}}\bm{W}_{0})\}\bm{A}_{0}v_{2}|\leq 24\tau\frac{d^{2}\log m}{\sqrt{m}}. (114)

This shows

ftlin,Tmax(𝑾,𝑨)\displaystyle f^{lin,T_{max}}_{t}(\bm{W}^{*},\bm{A}^{*})
=\displaystyle= t0=0Tmaxρt0𝑩(τ=1t0𝑾0)𝑨0Xtt0\displaystyle\sum_{t_{0}=0}^{T_{max}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{0}}\bm{W}_{0})\bm{A}_{0}X_{t-t_{0}}
+t0=0Tmaxt1+t2=t0ρt0𝑩(τ=1t11𝑾0)\displaystyle+\sum_{t_{0}=0}^{T_{max}}\sum_{t_{1}+t_{2}=t_{0}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})
tm=1Tmax(ρ1)tm1t1+t2=tm1\displaystyle\cdot\sum_{t_{m}=1}^{T_{max}}(\rho^{-1})^{t_{m}-1}\sum_{t_{1}+t_{2}=t_{m}-1}
{(τ=1t11𝑾0)}T𝑩T𝑷t11{𝑮𝑪tm1𝑫\displaystyle\cdot\{(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})\}^{T}\bm{B}^{T}\bm{P}^{1}_{t_{1}}\{\bm{G}\bm{C}^{t_{m}-1}\bm{D}
(ρ2)tm1/(tm1)𝑩{τ=1tm𝑾0}𝑨0}𝑷2t21𝑨T0(τ=1t2𝑾0)\displaystyle-(\rho^{2})^{t_{m}-1}/(t_{m}-1)\cdot\bm{B}\{\prod_{\tau=1}^{t_{m}}\bm{W}_{0}\}\bm{A}_{0}\}\bm{P}^{2}_{t_{2}-1}\bm{A}^{T}_{0}(\prod_{\tau=1}^{t_{2}}\bm{W}_{0})
(τ=1t21𝑾0)𝑨0Xtt0\displaystyle\cdot(\prod_{\tau=1}^{t_{2}-1}\bm{W}_{0})\bm{A}_{0}X_{t-t_{0}}
+t0=0Tmaxρt0𝑩(τ=1t01𝑾0){(τ=1tm1𝑾0)}T𝑩T\displaystyle+\sum_{t_{0}=0}^{T_{max}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{0}-1}\bm{W}_{0})\{(\prod_{\tau=1}^{t_{m}-1}\bm{W}_{0})\}^{T}\bm{B}^{T}
𝑷tm11[𝑮𝑫𝑩𝑨0]Xtt0.\displaystyle\cdot\bm{P}^{1}_{t_{m}-1}[\bm{G}\bm{D}-\bm{B}\bm{A}_{0}]X_{t-t_{0}}.

Combining all the above results, Lemma 11 and

yt~=t0=0t1𝑮(τ=1t0𝑪)𝑫Xtt0,\widetilde{y_{t}}=\sum_{t_{0}=0}^{t-1}\bm{G}(\prod_{\tau=1}^{t_{0}}\bm{C})\bm{D}X_{t-t_{0}},

we have

ftlin,Tmax(𝑾,𝑨)t0=0Tmax1𝑮(τ=1t0𝑪)𝑫Xtt0\displaystyle||f^{lin,T_{max}}_{t}(\bm{W}^{*},\bm{A}^{*})-\sum_{t_{0}=0}^{T_{max}-1}\bm{G}(\prod_{\tau=1}^{t_{0}}\bm{C})\bm{D}X_{t-t_{0}}|| (115)
𝒪(Tmax22bcρTmaxd2logmm),\displaystyle\leq\mathcal{O}(T_{max}^{2}\cdot 2b\cdot c_{\rho}\cdot T_{max}\frac{d^{2}\log m}{\sqrt{m}}),
||ftlin,Tmax\displaystyle||f^{lin,T_{max}}_{t} (𝑾,𝑨))yt~||\displaystyle(\bm{W}^{*},\bm{A}^{*}))-\widetilde{y_{t}}|| (116)
𝒪(bd2cρTmax3logmm1/2)+cρρTmax1ρ).\displaystyle\leq\mathcal{O}(\frac{b\cdot d^{2}c_{\rho}T^{3}_{max}\log m}{m^{1/2}})+\frac{c_{\rho}\rho^{T_{max}}}{1-\rho}).

The theorem follows.

\blacksquare

Using Theorem 22, from the definition of TmaxT_{max},

m>m,m>m^{*},

and ρ<ρ0,\rho<\rho_{0},

Tmax>Θ(1log(1ρ)){log(11ρ)+logb)+log(1ϵ)}),T_{max}>\Theta(\frac{1}{\log(\frac{1}{\rho})})\cdot\{\log(\frac{1}{1-\rho})+\log b)+\log(\frac{1}{\epsilon})\}), (117)

we have

||ft(𝑾,𝑨))yt~||𝒪(ϵl0(1+2b)).||f_{t}(\bm{W}^{*},\bm{A}^{*}))-\widetilde{y_{t}}||\leq\mathcal{O}(\frac{\epsilon}{l_{0}\cdot(1+2b)}). (118)

Thus L(yt,ft(𝑾~,𝑨))L(yt,y~t)𝒪(ϵ).L(y_{t},f_{t}(\widetilde{\bm{W}}^{*},\bm{A}^{*}))-L(y_{t},\widetilde{y}_{t})\leq\mathcal{O}(\epsilon). Theorem 16 follows.