Linear RNNs Provably Learn Linear Dynamic Systems

Lifu Wang, Tianyu Wang*, Shengwei Yi, Bo Shen, Bo Hu, Xing Cao Corresponding author: Tianyu Wang: [email protected] Wang, Tianyu Wang and Shengwei Yi are with China Information Technology Security Evaluation Center (CNITSEC), Beijing, China. Bo Shen, Bo Hu, Xing Cao are with Beijing Jiaotong University, Beijing, China.

Abstract

We study the learning ability of linear recurrent neural networks with Gradient Descent. We prove the first theoretical guarantee on linear RNNs to learn any stable linear dynamic system using any a large type of loss functions. For an arbitrary stable linear system with a parameter $\rho_{C}$ related to the transition matrix $C$ , we show that despite the non-convexity of the parameter optimization loss if the width of the RNN is large enough (and the required width in hidden layers does not rely on the length of the input sequence), a linear RNN can provably learn any stable linear dynamic system with the sample and time complexity polynomial in $\frac{1}{1-\rho_{C}}$ . Our results provide the first theoretical guarantee to learn a linear RNN and demonstrate how can the recurrent structure help to learn a dynamic system.

I Introduction

Recurrent neural network(RNN) is a very important structure in machine learning to deal with sequence data. It is believed that using the recurrent structure, RNNs can lean complicated transformations of data over extended periods. Non-linear RNN has been proved to be Turing-Complete[1], thus can simulate arbitrary procedures. However, training RNN requires optimizing highly non-convex functions which are very hard to solve. On the other hand, it is widely believed[2] that deep linear networks can capture some important aspects of optimization in deep learning and there are a series of recent papers trying to study the properties of deep linear networks [3, 4, 5]. Meanwhile, learning linear RNN itself is not only an important problem in System Identification but also useful for the language modeling in natural language processing [6]. In this paper, we study the non-convex optimization problem for learning linear RNNs.

Suppose there is a $d_{p}$ -order and $d$ -dimension linear system with the following form:

		$\displaystyle h_{t}=\bm{C}h_{t-1}+\bm{D}x_{t},$		(1)
		$\displaystyle\widetilde{y}_{t}=\bm{G}h_{t}(x),$		(1)

where $\bm{C}\in\mathbb{R}^{d_{p}\times d_{p}},\bm{D}\in\mathbb{R}^{d_{p}\times d}$ and $\bm{G}\in\mathbb{R}^{d_{p}\times d_{y}}$ are unknown system parameters.

At time $t$ , this system output $\widetilde{y}_{t}$ . It is nature to consider the system identification problem to learn the unkonw system parameters from its outputs. We consider a new linear (student) RNN with the form:

		$\displaystyle h^{\prime}_{t}=\bm{W}h^{\prime}_{t-1}+\bm{A}x_{t},$		(2)
		$\displaystyle y^{\prime}_{t}=\bm{B}h^{\prime}_{t}(x),$		(2)

with $\bm{W}\in\mathbb{R}^{m\times m},\bm{A}\in\mathbb{R}^{m\times d},\bm{B}\in\mathbb{R}^{m\times d_{y}}$ and train the parameters $\bm{W},\bm{A},\bm{B}$ to fit the output $\widetilde{y}_{t}$ of (1).

Just like what is commonly done in machine learning, one may consider to collect data $\{x^{i}_{t},\widetilde{y}^{i}_{t}\}$ then optimize the empirical loss:

	$\displaystyle L(\bm{W},\bm{A},\bm{B})$	$\displaystyle=\frac{1}{n\cdot T}\sum_{i}^{n}\sum_{t=1}^{T}L(\widetilde{y}^{i}_{t},{y^{\prime}_{t}}^{i}),$		(3)
		$\displaystyle=\frac{1}{n\cdot T}\sum_{i}^{n}\sum_{t=1}^{T}L(\bm{G}h_{t}(x^{i}),\bm{B}h^{\prime}_{t}(x^{i})),$		(3)

with Gradient Descent Algorithm, where $L$ is a convex loss function.

Therefore the following questions arise naturally:

•

Can gradient descent learn the target RNN in polynomial time and samples?
•

What kinds of random initializations(for example, how large widths will $\bm{W}$ be?) do we need to learn the target RNN?

These problems look easy since it is very basic and important for the system identification problem. However, the loss $\frac{1}{n}\sum_{i}^{n}\frac{1}{T}\sum_{t=1}^{T}L(\widetilde{y}^{i}_{t},{y^{\prime}_{t}}^{i})$ is non-convex and in fact even for SISO(single-input single-output, which means $x_{t},y_{t}\in\mathbb{R}^{1}$ ) systems, this question is far from being trivial. Only after the work in [7], the SISO case is solved. In fact, as pointed out in [8], although the widely used method in system identification is the EM algorithm [9], yet it is inherently non-convex and EM method will get stuck in bad local minima easily.

One naive method is to only optimize the loss $L(\widetilde{y}_{1},y^{\prime}_{1})$ , which is a convex loss function. However, when $\widetilde{y}_{t}$ is not accurately observed, for example we can only observe $y_{t}=\widetilde{y}_{t}+n_{t}$ and $L(x,y)=||x-y||^{2}$ where $n_{t}$ is white noise with $\sigma$ variance at time $t$ , the results by optimizing $L(y_{1},y^{\prime}_{1})$ may not be optimal for the entire sequence loss $\frac{1}{T}\sum_{t=1}^{T}L(y_{t},y^{\prime}_{t})$ . In fact a naive estimate from $L(y_{1},y^{\prime}_{1})$ will output an estimation with error $\sigma^{2}$ but $\frac{1}{T}\sum_{t=1}^{T}L(y_{t},y^{\prime}_{t})$ may output an estimation with error $\mathcal{O}(\sigma^{2}/\sqrt{T})$ . It is shown in [7] that under some independent conditions of the inputs, SGD (stochastic gradient descent) converges to the global minimum of the maximum likelihood objective of an unknown linear time-invariant dynamical system from a sequence of noisy observations generated by the system and over-parameterization is helpful. However, their methods heavily rely on the SISO property ( $x,y\in\mathbb{R}^{1}$ ) of the system, and the condition $x_{t}$ for different $t$ are i.i.d. from Gaussian distribution. Their method can not be generalized to the systems with $x\in\mathbb{R}^{d}$ and $d>1$ . It is still open under which conditions can SGD be guaranteed to find the global minimum of the linear RNN loss.

In this paper, we propose a new NTK method inspired by the work [10] and the authors’ previous work[11] on non-linear RNN. And this is completely different method from that in [7] so we avoid the defect that the method in [7] can only be used in the SISO case.

We show that if the width $m$ of the linear RNN (2) is large enough (polynomial large), SGD can provably learn any stable linear system with the sample and time complexity only polynomial in $\frac{1}{1-\rho_{C}}$ and independent of the input length $T$ , where $\rho_{C}$ is roughly the spectral radius (see Section III-A) of the transition matrix $\bm{C}$ . Learning linear RNN is a very important problem in System Identification. And since Gradient Descent with random initialization is the most commonly used method in machine learning, we are trying to understand this problem in a “machine-learning style”. We believe this can provide some insights for the recurrent structure in deep learning.

II Problem Formulation

We consider the target linear system with the form:

		$\displaystyle p^{i}_{t}=\bm{C}p^{i}_{t-1}+\bm{D}x^{i}_{t},$		(4)
		$\displaystyle\widetilde{y^{i}_{t}}=\bm{G}p^{i}_{t},$		(4)

which is a stable linear system with $||\bm{C}^{k}\bm{D}||\leq c_{\rho}\cdot\rho_{C}^{k}$ for all $k\in\mathbb{N}$ and $\rho_{C}<1$ , where $c_{\rho}>0$ , $||\bm{G}||,||\bm{D}||=\Theta(1)$ . For a given convex loss function $L$ , we set the loss function

L(\bm{C},\bm{D},\bm{G})=\mathbb{E}_{x,y\sim{\cal D}}\frac{1}{T}\sum_{t=1}^{T}L(y_{t},\widetilde{y}_{t}).

(5)

We define the global minimum $OPT_{\rho_{C}}$ as

OPT_{\rho_{C}}=\inf_{\bm{C},\bm{D},\bm{G}}\mathbb{E}_{x,y\sim{\cal D}}\frac{1}{T}\sum_{t=1}^{T}L(y_{t},\widetilde{y}_{t}).

with $||\bm{C}^{k}\bm{D}||\leq c_{\rho}\rho_{C}^{k},k\in\mathbb{N}$ , $c_{0}>0$ is an absolute constant.

Let the sequences $\{(x^{i}=(x^{i}_{1},x^{i}_{2},...,x^{i}_{T}))_{i=1}^{K},\ (y^{i}=(y^{i}_{1},y^{i}_{2},...,y^{i}_{T}))_{i=1}^{K}\}$ be $K$ samples i.i.d. drawn from

{\cal D}=\{(x_{1},x_{2},...,x_{T})\times(y_{1},y_{2},...,y_{T})\in\mathbb{R}^{d\times T}\times\mathbb{R}^{d_{y}\times T}\}.

We consider a “student” linear system (RNN) to learn the target one. Let $f_{t}(\widetilde{\bm{W}},\bm{A},x^{i})$ be the $t$ -time output of a linear RNN with input $x^{i}$ and parameters $\widetilde{\bm{W}}\in\mathbb{R}^{m\times m},\bm{A}\in\mathbb{R}^{m\times d}$ :

		$\displaystyle h_{t}(x)=\widetilde{\bm{W}}h_{t-1}+\bm{A}x_{t},$		(6)
		$\displaystyle\widetilde{f}_{t}(\widetilde{\bm{W}},\bm{A},x)=\bm{B}h_{t}(x).$		(6)

Our goal is to use $f_{t}(\widetilde{\bm{W}},\bm{A},x^{i})$ and the $K$ samples to fit the empirical loss function and keeping the generalization error bound small.

III Our Result

The main result is formulated in the PAC-Learning setting as follow:

Theorem 1

(Informal) Under the conditions in the last section, suppose the entries of $\bm{W}$ in the student RNN are randomly initialized by i.i.d. generated from $N(0,\frac{\rho_{C}}{m})$ . We use SGD algorithm to optimize. $\widetilde{\bm{W}}_{k},\bm{A}_{k}$ are the $k$ -th step outputs of SGD algorithm.

For any $\epsilon,\ \delta>0$ , and $0<\rho_{C}<1$ , there exist parameters $m^{*}=poly(\frac{1}{1-\rho_{C}},\epsilon^{-1},\delta^{-1},c_{\rho})$ and $K=poly(\frac{1}{1-\rho_{C}},\epsilon^{-1},\delta^{-1},c_{\rho})$ such that if $m>m^{*}$ , with probability at least $1-\delta$ , SGD can reach

		$\displaystyle\mathbb{E}_{x,y\sim{\cal D}}\frac{1}{K}\sum_{k=1}^{K}\frac{1}{T}\sum_{t=1}^{T}L(y_{t},\widetilde{f}_{t}(\widetilde{\bm{W}}_{k},\bm{A}_{k},x))$		(7)
		$\displaystyle\leq OPT_{\rho_{C}}+\epsilon,$		(7)

in $K$ steps.

Remark III.1

Theorem 1 induces gradient descent with linear RNNs can provably learn any stable linear system with the iteration and sample complexity polynomial in $\frac{1}{1-\rho_{C}}$ . This result is consistent with the previous Gradient Descent based method [7] to learn SISO linear systems.

In our theorem, all the parameters do not rely on the length $T$ . Note that suppose at different time, $x^{i}_{t}$ are i.i.d. drawn from a distribution ${\cal D}^{\prime}$ , $n_{t}$ is the white noise and $T$ is large enough,

		$\displaystyle\mathbb{E}_{n_{t},x_{t}\sim{\cal D}^{\prime}}\frac{1}{T}\sum_{t=1}^{T}\|\|\widetilde{y}_{t}+n_{t}-\widetilde{f}_{t}(\widetilde{\bm{W}},\bm{A},x))\|\|^{2}$
		$\displaystyle\leq\lim_{T^{\prime}\to\infty}\mathbb{E}_{n_{t},x_{t}\sim{\cal D}^{\prime}}\frac{1}{T^{\prime}}\sum_{t=1}^{T^{\prime}}\|\|\widetilde{y}_{t}+n_{t}-\widetilde{f}_{t}(\widetilde{\bm{W}},\bm{A},x))\|\|^{2}+\epsilon.$

Thus optimizing the large $T$ loss is enough to predict the complete dynamic behaviors of the target system.

III-A Scale of $c_{\rho}$ and the Comparison with Previous Results

In our main theorem 1, the parameters are polynomial in $c_{\rho}$ . We should note that although we assume the target system is stable, this only means

\rho(\bm{C})=\lim_{k\to\infty}||\bm{C}^{k}||^{1/k}<1.

In fact, suppose $\bm{C}\in\mathbb{R}^{N\times N}$ , generally we have (see e.g. corollary 3.15 in [12]):

||\bm{C}^{k}||\leq\sqrt{N}\sum_{j=0}^{N-1}\binom{N-1}{j}\binom{k}{j}\cdot||\bm{C}||_{\infty}^{j}\cdot\rho(\bm{C})^{k-j}.

When we set $\rho_{C}=\rho(\bm{C})$ , $c_{\rho}$ can be very large and generally we should set $\rho_{C}>\rho(\bm{C})+\epsilon$ to make $c_{\rho}$ be polynomial in $N$ .

On the other hand, the scale of $c_{\rho}$ is closely related to the so-called “acquiescent” systems.

Definition 1

Let $z\in\mathbb{C}$ . A SISO $N$ -order linear system with the transfer function $\frac{s(z)}{p(z)}$ is called $\alpha$ -acquiescent if

\{p(z)/z^{N}:|z|=\alpha\}\subseteq S,

where

S=\{z:Re(z)\geq(1+\tau_{0})|\ Im(z)|\}\cap\{z:\tau_{1}\leq Re(z)\leq\tau_{2}\}

And we have

Lemma 2

(Lemma 4.4 in [7]) Suppose the target linear system is SISO and $\alpha$ -acquiescent. For any $k\in\mathbb{N}$ ,

||\bm{C}^{k}\bm{D}||\leq 2\pi n\alpha^{-2n}/\tau_{1}\cdot\alpha^{k}.

(8)

Thus in the SISO case, under the “acquiescent” conditions, our Theorem 1 reduces to the main result in [7].

Corollary 3

(Corresponding to Theorem 5.1 in [7]) Supposing a SISO linear system is $\rho_{C}$ -acquiescent, it is learnable in polynomial time and polynomial samples.

And for MIMO systems, our condition $||\bm{C}^{k}\bm{D}||\leq c_{\rho}\rho_{C}^{k},k\in\mathbb{N}$ is a good generalization for MIMO systems.

III-B Our Techniques

Our proof technique is closely related to the recent works on deep linear network [3], non-linear network with neural tangent kernel [13][14, 15], and non-linear RNN[10], [16]. Similar to [13], we carefully upper and lower bound eigenvalues of this Gram matrix throughout the optimization process, using some perturbation analysis. At the initialization point, we consider the spectral properties of Gaussian random matrices. Using the linearization method, we can show these properties hold throughout the trajectory of gradient descent. Then we only need to construct a solution near the random initialization. And the distance from the solution to the initialization can be bounded by the stability of the system.

Notions. For two matrices $\bm{A},\bm{B}\in\mathbb{R}^{m\times n}$ , we define $\langle\bm{A},\bm{B}\rangle=\text{Tr}(A^{T}B)$ . We define the asymptotic notations $\mathcal{O}(\cdot),\Omega(\cdot),\Theta(\cdot),poly(\cdot)$ as follows. $a_{n},b_{n}$ are two sequences. $a_{n}=\mathcal{O}(b_{n})$ if $\lim\sup_{n\to\infty}|a_{n}/b_{n}|<\infty$ , $a_{n}=\Omega(b_{n})$ if $\lim\inf_{n\to\infty}|a_{n}/b_{n}|>0$ , $a_{n}=\Theta(b_{n})$ if $a_{n}=\Omega(b_{n})$ and $a_{n}=\mathcal{O}(b_{n})$ , $a_{n}=poly(b_{n})$ if there is $k\in\mathbb{N}$ that $a_{n}=O((b_{n})^{k})$ . $\widetilde{\mathcal{O}}(\cdot),\widetilde{\Omega}(\cdot),\widetilde{\Theta(\cdot)},\widetilde{poly}(\cdot)$ are notions which hide the logarithmic factors in $\mathcal{O}(\cdot),\Omega(\cdot),\Theta(\cdot),poly(\cdot)$ . $||\cdot||$ and $||\cdot||_{2}$ denote the 2-norm of matrices. $||\cdot||_{1}$ denotes the 1-norm. $||\cdot||_{F}$ is the Frobenius-norm.

IV Related Works

Deep Linear Network. The provable properties of the loss surface for deep linear networks were firstly shown in [5]. In [3], it is shown that if the width of the $L$ -layer deep linear network is large enough (only depends on the output dimension, the rank $r$ and the condition number $\kappa$ of the input data), randomly initialized gradient descent will optimize deep linear networks in polynomial time in $L$ , $r$ and $\kappa$ . Moreover. in [4], the linear ResNet is studied and it is shown that Gradient Descent provably optimizes wide enough deep linear ResNets and the width does not rely on the number of layers.

Over-Parameterization. Non-linear networks with one hidden node are studied in [17] and [18]. In these works, it is shown that, for a single-hidden-node ReLU network, under a very mild assumption on the input distribution, the loss is one point convex in a very large area. However, for the networks with multi-hidden nodes, the authors in [19] pointed out that spurious local minima are common and indicated that an over-parameterization (the number of hidden nodes should be large) assumption is necessary. Similarly, [7] showed that over-parameterization can help in the training process of a linear dynamic system. Another import progress is the theory about neural tangent kernel(NTK). The techniques of NTK for finite width network are studied in [15], [13], [10], [14] and [16].

Learning Linear System. Prediction problems of time series for linear dynamical systems can be traced back to Kalman [20]. In the case that the system is unknown, the first polynomial guarantees of running time and sample complexity bounds for learning single-input single-output (SISO) systems with gradient descent are provided in [7]. For MIMO systems. It was shown in [21][8] that the spectral filtering method can be provably learned with polynomial guarantees of running time and sample complexity.

V Problem Setup and Main Results

In this section, we introduce the basic problem setup and our main results.

Consider sequences $\{x^{i}=(x^{i}_{1},x^{i}_{2},...,x^{i}_{T})\}$ and the label $\{y^{i}=(y^{i}_{1},y^{i}_{2},...,y^{i}_{T})\}$ in the data set. $x^{i}_{t}\in\mathbb{R}^{d},y^{i}_{t}\in\mathbb{R}^{d_{y}}$ and $||x^{i}_{t}||,||y^{i}_{t}||\leq\mathcal{O}(1)$ . We assume $d_{y}\leq d=\mathcal{O}(1)$ and omit them in the asymptotic symbols. We study the linear RNN as: $\widetilde{\bm{W}}\in\mathbb{R}^{m\times m},\bm{A}\in\mathbb{R}^{m\times d},\bm{B}\in\mathbb{R}^{d_{y}\times m}$

		$\displaystyle h_{0}(x)=\bm{0},h_{t}(x)=\widetilde{\bm{W}}h_{t-1}+\bm{A}x_{t},$		(9)
		$\displaystyle\widetilde{f}_{t}(\widetilde{\bm{W}},\bm{A},x)=\bm{B}h_{t}(x)\in\mathbb{R}^{d_{y}}.$		(9)

Assume $L^{*}(x)=L(y_{t},x)$ is convex and locally Lipschitz convex function: for any $x,y$ , when $||x||\leq C,||y_{t}||\leq C$ ,

||\nabla_{x}L^{*}(x)||\leq l_{0}(1+C).

(10)

Then we perform algorithm 1.

Algorithm 1 Learning Stable Linear System with SGD

Input:Sequences of data

\{x,y\}

, learning rate

\eta

, initialization parameter

0<\rho<1

Initialization: The entries of

\widetilde{\bm{W}}_{0}

and

\bm{A}_{0}

are i.i.d. generated from

N(0,\frac{\rho}{m})

and

N(0,\frac{1}{m})

. The entries of

\bm{B}

are i.i.d. generated from

N(0,\frac{1}{d_{y}})

for

k=0,1,2,3...K-1

\widetilde{\bm{W}}_{k+1}=\widetilde{\bm{W}}_{k}-\frac{\eta}{T}\sum_{t}\nabla_{\widetilde{\bm{W}_{k}}}L(y^{i}_{t},\widetilde{f}_{t}(\widetilde{\bm{W}_{k}},\bm{A}_{k},x^{i}))

Randomly sample a sequence

x^{i}

and the label

y^{i}

\bm{A}_{k+1}=\bm{A}_{k}-\frac{\eta}{T}\sum_{t}\nabla_{\bm{A}_{k}}L(y^{i}_{t},\widetilde{f}_{t}(\widetilde{\bm{W}}_{k},\bm{A}_{k},x^{i}))

end for

In fact, we have

Theorem 4

Assume there is $\delta\in[0,e^{-1}]$ . Let $\rho_{1}=\frac{1}{1+10\cdot\frac{\log^{2}m}{\sqrt{m}}}$ , $0<\rho_{0}<1$ . Set the initialization parameter $\rho=\rho_{1}\cdot\rho^{2}_{0}$ . Given an unknown distribution ${\cal D}$ of sequences of $\{x,y\}$ , let $\widetilde{\bm{W}}_{k}$ , $\bm{A}_{k}$ be the output of Algorithm 1.

For any small $\epsilon>0$ , there are parameters ¹¹1In order to simplify symbols, we omit $d$ and $d_{y}$ in these asymptotic bounds of parameters. One can easily show finally the parameters are polynomial in $d$ and $d_{y}$ .

		$\displaystyle T_{max}=\Theta(\frac{1}{\log(\frac{1}{\rho_{0}})})\cdot\{2\log(\frac{c_{\rho}}{1-\rho_{0}})+\log(\frac{1}{\epsilon})$		(11)
		$\displaystyle+\log\sqrt{\log(T_{max}/\delta)}+\frac{1}{2}\log(m)\})$
		$\displaystyle b=\sqrt{\log(T_{max}/\delta)},\eta=\Theta(\frac{\nu\epsilon}{mb^{2}}),$
		$\displaystyle K=\Theta(\frac{T^{4}_{max}b^{4}}{\nu\epsilon^{2}}),$
		$\displaystyle m^{*}=\Theta(\frac{c_{\rho}^{2}K^{4}(1-\rho_{0})^{8}\epsilon^{2}}{b^{6}})+\Theta(\frac{1}{\delta}),$
		$\displaystyle\nu=\Theta(\frac{\epsilon^{2}(1-\rho_{0})^{12}}{T_{max}^{4}\cdot l_{0}^{6}(1+2b)^{6}}),$

such that with probability at least $1-\delta$ , if $m>m^{*}$ , for any $\bm{C},\bm{D},\bm{G}$ , the algorithm outputs satisfy:

	$\displaystyle\frac{1}{K}\sum_{k=0}^{K-1}$	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}\{L(y^{k}_{t},\widetilde{f}_{t}(\widetilde{\bm{W}}_{k},\bm{A}_{k}),x^{k})$		(12)
		$\displaystyle-L(y^{k}_{t},\widetilde{y}^{k}_{t}(\bm{C},\bm{D},\bm{G}))\}\leq\mathcal{O}(\epsilon),$		(12)

where $\widetilde{y}_{t}^{k}(\bm{C},\bm{D},\bm{G})$ is the output from the linear system:

		$\displaystyle p^{i}_{t}=\bm{C}p^{i}_{t-1}+\bm{D}x^{i}_{t},$		(13)
		$\displaystyle\widetilde{y^{i}_{t}}=\bm{G}p^{i}_{t}.$		(13)

with $||\bm{C}^{k}\bm{D}||\leq c_{\rho}\rho_{C}^{k}$ for all $k\in\mathbb{N}$ , $\bm{D}\in\mathbb{R}^{d_{p}\times d}$ , $p_{t}\in\mathbb{R}^{d_{p}}$ , $\bm{C}\in\mathbb{R}^{d_{p}\times d_{p}}$ $\bm{G}\in\mathbb{R}^{d_{y}\times d_{p}}$ ,

Lemma 5

For the parameters of the last theorem, when $m>m^{*}$ and $\epsilon$ is small enough, we have the following results:

1.

$\frac{T_{max}^{4}b^{2}}{Km\eta}=\Theta(\epsilon)$
2.

$l_{0}(1+2b)\cdot l_{0}^{2}(1+2b)^{2}\cdot\frac{\eta m\sqrt{K}}{(1-\rho_{0})^{6}}\leq\mathcal{O}(\epsilon)$
3.

$K^{2}\eta^{2}\cdot\frac{m\sqrt{m}}{(1-\rho_{0})^{11}})\cdot l_{0}^{2}(1+2b)^{2}\cdot l_{0}(1+2b)\leq\mathcal{O}(\epsilon)$
4.

$\frac{\eta ml^{2}_{0}(1+2b)^{2}}{(1-\rho_{0})^{6}}\leq\mathcal{O}(\epsilon)$
5.

$\frac{b\cdot d^{2}c_{\rho}T^{3}_{max}\log m}{m^{1/2}}\leq\mathcal{O}(\frac{\epsilon}{l_{0}(1+2b)})$

As for the population loss, we have:

Theorem 6

(Rademacher complexity for RNN) Under the condition in Theorem 4, with probability at least $1-\delta$ ,

		$\displaystyle\|\mathbb{E}_{x,y\sim{\cal D}}\frac{1}{T}\sum_{t=1}^{T}\{L(y_{t},\widetilde{f}_{t}(\widetilde{\bm{W}}_{k},\bm{A}_{k},x))-L(y_{t},\widetilde{y}_{t})\}$
		$\displaystyle-\frac{1}{K}\sum_{k=0}^{K-1}\frac{1}{T}\sum_{t=1}^{T}\{L(y^{k}_{t},\widetilde{f}_{t}(\widetilde{\bm{W}}_{k},\bm{A}_{k}),x^{k})-L(y^{k}_{t},\widetilde{y}^{k}_{t})\}\|$
		$\displaystyle\leq\mathcal{O}(\epsilon).$

Since $\rho_{1}\to 1$ as $m^{*}>poly(\frac{1}{1-\rho_{0}})$ large enough, we have $\rho_{1}>\rho_{0}$ and $\rho>\rho_{0}^{3}$ . Therefore from the above two theorems we have the following corollary:

Corollary 7

Let the initialization parameter be $\rho_{C}<1$ . For any small $\epsilon>0,\delta>0$ , there is a parameter $m^{*}=poly(\frac{1}{\delta},\frac{1}{\epsilon},\frac{1}{1-\rho_{C}})$ such that with probability at least $1-\delta$ , if $m>m^{*},K=poly(\frac{1}{1-\rho_{C}},\frac{1}{\epsilon},\frac{1}{\delta})$ , the algorithm outputs satisfy:

\displaystyle\mathbb{E}_{x,y\sim{\cal D}}

\displaystyle\frac{1}{T}\sum_{t=1}^{T}L(y_{t},\widetilde{f}_{t}(\widetilde{\bm{W}}_{k},\bm{A}_{k},x))\leq OPT_{\rho_{C}}+\mathcal{O}(\epsilon).

Remark V.1

In this paper, our results only assume for some constant $C$ :

||\nabla_{x}L^{*}(x)||\leq\mathcal{O}(1+C).

(14)

This assumption is a very mild condition for the loss function $L(x)$ , thus we, in fact, do not previously assume the form of the noise (for example, optimizing the square loss is to optimize the maximum likelihood objective of the Gaussian noise, and $l^{1}$ loss is to optimize that for Laplace noise) and our result can even apply to not only the the regression problems but also the classification problems. In this aspect, this result improves upon previous methods to learn linear dynamical systems in [7] and [8] which highly rely on the form of the square loss.

VI Preliminary Properties

Before proving Theorem 4 and 6, we need some properties of Gaussian random matrices and linear RNNs. The proof is in the Supplementary Materials.

To simplify symbols, in the latter part of the paper, we set $\bm{W}_{k}=\widetilde{\bm{W}_{k}}/\rho$ and ²²2We omit $x$ in $f_{t}(\bm{W},\bm{A},x)$ when it doesn’t lead to misunderstanding.

\displaystyle f_{t}(\bm{W},\bm{A})=\widetilde{f}_{t}(\widetilde{\bm{W}},\bm{A},x)=\sum_{t_{0}=0}^{t-1}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{0}}\bm{W})\bm{A}X_{t-t_{0}}.

Then

	$\displaystyle\bm{W}_{k+1}=$	$\displaystyle\bm{W}_{k}-\frac{\eta}{T\rho}\sum_{t=1}^{T}\nabla_{\widetilde{\bm{W}_{k}}}L(y^{i}_{t},\widetilde{f}_{t}(\widetilde{\bm{W}}_{k},\bm{A}_{k},x^{i})),$		(15)
	$\displaystyle=$	$\displaystyle\bm{W}_{k}-\frac{\eta}{T}\sum_{t=1}^{T}\nabla_{\bm{W}_{k}}L(y^{i}_{t},f_{t}(\bm{W}_{k},\bm{A}_{k},x^{i})).$		(15)

Then we only need to consider $\bm{W}_{k}$ . The entries of $\bm{W}_{0}$ are i.i.d. generated from $N(0,\frac{1}{m})$ . Let $B(\bm{W}_{0},\omega)=\{\bm{W}|\ ||\bm{W}-\bm{W}_{0}||_{F}\leq\omega\}$ and $B(\bm{A}_{0},\omega)=\{\bm{A}|\ ||\bm{A}-\bm{A}_{0}||_{F}\leq\omega\}$ .

VI-A Properties of Random Matrix

One of the key points in this paper is that with high probability, the spectral radius of matrix $\bm{W}_{0}$ will be less than $\rho_{1}^{-1}$ and when $m\to\infty$ , $\rho_{1}\to 1$ . In fact, we have:

Lemma 8

With probability at least $1-exp(-\Omega(\log^{2}m)))$ (it is larger than $1-\delta$ when $m>m^{*}$ ), there exists $L=c_{0}\cdot\frac{\sqrt{m}}{\log m}\in\mathbb{N}$ such that for all $k\geq L$ ,

||\bm{W}_{0}^{k}||\leq\rho_{1}^{-k},

and for all $k<L$ ,

||\bm{W}_{0}^{k}||\leq\rho_{1}^{-L},

where $c_{0}>1$ is an absolutely constant. Meanwhile, for all $k\leq 2L$ , with probability at least $1-exp(-\Omega(\log^{2}m)))$ ,

||W_{0}^{k}||\leq 2\sqrt{k}.

In the rest of this paper, all the probabilities are considered under the condition in Lemma 8.

As a corollary, let $k\geq 2L$ , $\omega_{0}=\frac{1}{\rho_{0}}-1$ and when $||\bm{W}-\bm{W}_{0}||_{F}\leq\omega\leq\omega_{0}$ , we have

		$\displaystyle\|\|(\rho_{1}\cdot\rho_{0}^{2}\cdot\bm{W})^{k}\|\|=(\rho_{1}\cdot\rho_{0}^{2})^{k}\sum_{i=0}^{k}C_{k}^{i}\cdot\|\|\bm{W}^{i}_{0}\|\|\cdot\|\|\bm{W}-\bm{W}_{0}\|\|^{k-i}$		(16)
		$\displaystyle=\sum_{i=0}^{k}C_{k}^{i}\cdot\|\|\bm{W}^{i}_{0}\|\|\cdot(\rho_{1}\cdot\rho_{0}^{2})^{k}\cdot(\frac{1}{\rho_{0}}-1)^{k-i}$
		$\displaystyle\leq\sum_{i=L}^{k}C_{k}^{i}(\rho_{1}\rho_{0}^{2})^{i}[\rho_{1}\cdot(\rho_{0}-\rho_{0}^{2})]^{k-i}\cdot\rho_{1}^{-k}$
		$\displaystyle+\sum_{i=0}^{L-1}C_{k}^{i}(\rho_{1}\cdot\rho_{0}^{2})^{i}\cdot\|\|W_{0}^{i}\|\|\cdot\rho_{1}^{k-i}\cdot(\rho_{0}-\rho_{0}^{2})^{k-i}\leq\rho_{0}^{k}.$

For $k\leq 2L$ ,

		$\displaystyle\|\|(\rho_{1}\cdot\rho_{0}^{2}\cdot\bm{W})^{k}\|\|=\rho_{1}^{k}\sum_{i=0}^{k}C_{k}^{i}\cdot\|\|\rho_{0}^{2}\bm{W}^{i}_{0}\|\|\cdot\|\|\rho_{0}^{2}(\bm{W}-\bm{W}_{0})\|\|^{k-i}$		(17)
		$\displaystyle\leq\rho_{1}^{k}\sum_{i=0}^{k}C_{k}^{i}2\sqrt{i}\rho_{0}^{2i}\cdot(\rho_{0}-\rho_{0}^{2})^{k-i}$
		$\displaystyle\leq\rho_{1}^{k}\rho_{0}^{k}2\sqrt{k}.$

Combing all these results, we can show the norm of $\rho^{t}\prod_{\tau=1}^{t}\bm{W}$ is bounded, which is:
For all $t\in\mathbb{N}$

||\rho^{t}\bm{W}^{t}||\leq 2\sqrt{t}\rho_{0}^{t}.

(18)

This is crucial to make the width independent on the length of input sequences. Then we can show:

Lemma 9

Let $\rho=\rho_{1}\cdot\rho_{0}^{2}$ . For any $\tau\in\mathbb{N}$ , any $Z_{t}\in\mathbb{R}^{d}$ with $||Z_{t}||=1$ and $\bm{Q}\in\mathbb{R}^{m\times d},\bm{Q}_{2}\in\mathbb{R}^{m\times m}$ , with probability at least $1-exp(-\Omega(\log^{2}m)))$ , for any $\bm{W}\in B(\bm{W}_{0},\omega)$ , $\bm{A}\in B(\bm{A}_{0},\omega)$ with $\omega\leq\omega_{0}$ ,

||\sum_{t=\tau}^{\infty}\rho^{t}\cdot\bm{B}(\prod_{\tau=1}^{t}\bm{W})\bm{Q}Z_{t}||\leq 4\frac{\sqrt{m}\tau(\rho_{0})^{\tau}}{(1-\rho_{0})^{2}}||\bm{Q}||,

(19)

		$\displaystyle\|\|\sum_{t_{0}=\tau}^{\infty}\sum_{t_{1}+t_{2}=t_{0}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W})\bm{Q}_{2}(\prod_{\tau=1}^{t_{2}-1}\bm{W})\bm{A}Z_{t_{0}}\|\|$
		$\displaystyle\leq 32\frac{\sqrt{m}\tau^{2}(\rho_{0})^{\tau}}{(1-\rho_{0})^{3}}\|\|\bm{Q}_{2}\|\|.$

Based on these results, we can only consider the properties of $\rho^{t_{0}}\bm{B}(\prod_{\tau^{\prime}=1}^{t_{0}}\bm{W})\bm{A}X_{t-t_{0}}$ for bounded $t_{0}$ . We have the following results:

Lemma 10

For any vector $v_{1}\in\mathbb{R}^{d_{y}},v_{2}\in\mathbb{R}^{d}$ with $||v_{1}||=||v_{2}||=1$ , if $m>\mathcal{O}(\tau^{3}\cdot d)$ , with probability at least $1-exp(-\Omega(m/\tau^{2}))$ :

		$\displaystyle 0.9\leq\|\|\{\prod_{\tau^{\prime}=1}^{t}\bm{W}_{0}\}\bm{A}_{0}v_{2}\|\|\leq 1.1,$		(20)
		$\displaystyle 0.9\leq\|\|\frac{1}{\sqrt{m}}\{\prod_{\tau^{\prime}=1}^{t}\bm{W}_{0}\}^{T}\bm{B}^{T}v_{1}\|\|\leq 1.1,$		(20)

for any $0\leq t\leq\tau-1$ and $\tau>0$ .

Lemma 11

If $m>\Omega(\log(\tau\cdot d))$ , with probability at least $1-\delta$ :

||\bm{B}(\prod_{\tau^{\prime}=1}^{t}\bm{W})\bm{A}||\leq\sqrt{d\log(\tau\cdot d/\delta)}).

(21)

for any $0\leq t\leq\tau-1$ and $\tau>0$ .

Lemma 12

For any vector $v_{1},u_{1}\in\mathbb{R}^{d_{y}},v_{2},u_{2}\in\mathbb{R}^{d}$ with $||v_{1}||=||u_{1}||=||v_{2}||=||u_{2}||=1$ , if $m^{1/2}>\Omega(\tau^{3}\cdot d)$ , with probability at least $1-exp(-\Omega(\log^{2}m))$ , for any $0\leq t,t^{\prime}\leq\tau-1$ , $t\neq t^{\prime}$ :

	$\displaystyle\|(u_{1})^{T}\bm{B}(\prod_{k=1}^{t}\bm{W}_{0})$	$\displaystyle\cdot\{(\prod_{k=1}^{t^{\prime}}\bm{W}_{0})\}^{T}\bm{B}^{T}v_{1}\|$		(22)
		$\displaystyle\leq 24\tau d^{2}\log m,$		(22)

and

|(u_{2})^{T}\bm{A}^{T}_{0}(\prod_{k=1}^{t}\bm{W}_{0})^{T}\cdot\{(\prod_{k=1}^{t^{\prime}}\bm{W}_{0})\}\bm{A}_{0}v_{2}|\leq 24\tau\frac{d^{2}\log m}{\sqrt{m}}.

VI-B Properties of Linear RNN

For any $t,\tau\in\mathbb{N}$ , $f_{t}$ can be written as (when $t-t_{0}<1$ , we set $X_{t-t_{0}}=0$ ):

\displaystyle f_{t}(\bm{W},\bm{A})=\sum_{t_{0}=0}^{t-1}\rho^{t_{0}}\bm{B}(\prod_{\tau^{\prime}=1}^{t_{0}}\bm{W})\bm{A}X_{t-t_{0}}.

(23)

From Lemma 9, we consider a truncation of $f_{t}$ as

\displaystyle f^{\tau}_{t}(\bm{W},\bm{A})=\sum_{t_{0}=0}^{\tau}\rho^{t_{0}}\bm{B}(\prod_{\tau^{\prime}=1}^{t_{0}}\bm{W})\bm{A}X_{t-t_{0}}.

(24)

$f_{t}(\bm{W},\bm{A})$ is also an almost linear function for $\bm{W}$ and $\bm{A}$ . In fact we have:

Lemma 13

Under the condition in Theorem 4, with probability at least $1-\delta$ , for all $t\in[T]$ and $\bm{W},\bm{W}^{\prime}\in B(\bm{W}_{0},\omega)$ , $\bm{A},\bm{A}^{\prime}\in B(\bm{A}_{0},\omega)$ with $\omega\leq\omega_{0}$ ,

		$\displaystyle\|\|f_{t}(\bm{W}^{\prime},\bm{A}^{\prime})-f_{t}(\bm{W},\bm{A})-\nabla_{W}f_{t}(\bm{W},\bm{A})\cdot[\bm{W}^{\prime}-\bm{W}]$
		$\displaystyle-\nabla_{A}f_{t}(\bm{W},\bm{A})\cdot[\bm{A}^{\prime}-\bm{A}]\|\|$
		$\displaystyle\leq\Theta(\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}}).$

Combing the above results, let

	$\displaystyle f_{t}^{lin}(\bm{W},\bm{A})$	$\displaystyle=f_{t}(\bm{W}_{0},\bm{A}_{0})+\nabla_{W}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{W}-\bm{W}_{0}]$
		$\displaystyle+\nabla_{A}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{A}-\bm{A}_{0}],$
	$\displaystyle f_{t}^{lin,\tau}(\bm{W},\bm{A})$	$\displaystyle=f^{\tau}_{t}(\bm{W}_{0},\bm{A}_{0})+\nabla_{W}f^{\tau}_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{W}-\bm{W}_{0}]$
		$\displaystyle+\nabla_{A}f^{\tau}_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{A}-\bm{A}_{0}].$

We have:

Lemma 14

Under the condition in Theorem 4, with probability at least $1-\delta$ ,

||f^{lin}_{t}(\bm{W},\bm{A})-f^{lin,\tau}_{t}(\bm{W},\bm{A})||\leq\Theta(\frac{\sqrt{m}\tau\rho_{0}^{\tau}}{(1-\rho_{0})^{3}}).

(25)

Then since we set

	$\displaystyle T_{max}>$	$\displaystyle\Theta(\frac{1}{\log(\frac{1}{\rho_{0}})})\cdot\{3\log(\frac{1}{1-\rho_{0}})+\log b$
		$\displaystyle+\log(\frac{1}{\epsilon})+\frac{1}{2}\log(m)+\log T_{max}\},.$

We have

||f^{lin}_{t}(\bm{W},\bm{A})-f^{lin,T_{max}}_{t}(\bm{W},\bm{A})||\leq\epsilon/b.

(26)

Therefore $f^{lin,T_{max}}_{t}(\bm{W},\bm{A})$ is a good approximation for $f_{t}(\bm{W},\bm{A})$ .

Note that from the above arguments and Lemma 11, we have

||f_{t}(\bm{W},\bm{A})||\leq 2b,||\nabla_{f}L||\leq 2l_{0}\cdot(1+2b),

and

L(y_{t},f_{t}(\bm{W},\bm{A}))\leq 4l_{0}\cdot(b+2b^{2}).

(27)

VII Proof of the Theorem 4

Theorem 4 is a direct corollary of Theorem 15 and 16 below.

Theorem 15

Under the condition in Theorem 4, for any $\bm{A}^{*},\bm{W}^{*}$ with $||\bm{A}^{*}-\bm{A}_{0}||_{F},||\bm{W}^{*}-\bm{W}_{0}||_{F}\leq R/\sqrt{m}\leq b\cdot T^{2}_{max}/\sqrt{m}$ , let

		$\displaystyle L^{k}(\bm{W}_{k},\bm{A}_{k})=\frac{1}{T}\sum_{t=1}^{T}L(y^{k}_{t},f_{t}(\bm{W}_{k},\bm{A}_{k})),$
		$\displaystyle L^{k}_{t}(\bm{W}_{k},\bm{A}_{k})=L(y^{k}_{t},f_{t}(\bm{W}_{k},\bm{A}_{k})).$

with probability at least $1-\delta$ , the outputs of algorithm 1 satisfy:

\frac{1}{K}\sum_{k=0}^{K-1}L^{k}(\bm{W}_{k},\bm{A}_{k})-L^{k}(\bm{W}^{*},\bm{A}^{*})\leq\mathcal{O}(\epsilon).

(28)

When the loss function is convex, one can easily see Theorem 15 follows. In our case, the proof of Theorem 15 is from the linearization Lemma 13. Lemma 13 says when $||\bm{A}^{*}-\bm{A}_{0}||_{F},||\bm{W}^{*}-\bm{W}_{0}||_{F}$ are small enough,

\displaystyle f_{t}(\bm{W},\bm{A})=\sum_{t_{0}=0}^{t-1}\rho^{t_{0}}\bm{B}(\prod_{\tau^{\prime}=1}^{t_{0}}\bm{W})\bm{A}X_{t-t_{0}}.

(29)

will nearly be a linear function for $\bm{W}$ and $\bm{A}$ . This is the main process to prove Theorem 15.

Proof of the Theorem 15:

From Lemma 9, for any $t$ ,

||\nabla_{W}f_{t}||_{F},||\nabla_{A}f_{t}||_{F}\leq 32\frac{\sqrt{m}}{(1-\rho_{0})^{3}}

when $\bm{W}\in B(\bm{W}_{0},\omega)$ , $\bm{A}\in B(\bm{A}_{0},\omega)$ . Meanwhile,

		$\displaystyle\bm{W}_{k+1}-\bm{W}_{k}=\eta\nabla_{W}L^{k}(\bm{W}_{k},\bm{A}_{k}),$		(30)
		$\displaystyle\bm{A}_{k+1}-\bm{A}_{k}=\eta\nabla_{A}L^{k}(\bm{W}_{k},\bm{A}_{k}).$		(30)

Thus for any $k\leq K$ ,

		$\displaystyle\|\|\bm{W}_{k}-\bm{W}_{0}\|\|_{F}\leq\eta K\cdot\|\|\nabla_{W}f\|\|_{F}\cdot\|\|\nabla_{f}L\|\|,$		(31)
		$\displaystyle\|\|\bm{A}_{k}-\bm{A}_{0}\|\|_{F}\leq\eta K\cdot\|\|\nabla_{A}f\|\|_{F}\cdot\|\|\nabla_{f}L\|\|.$		(31)

In our case, from Eq (10) and (VI-B), we have

||\nabla_{f}L||\leq l_{0}(1+2b),

||\nabla_{W}L||_{F},||\nabla_{A}L||_{F}\leq 32\frac{\sqrt{m}}{(1-\rho_{0})^{3}}\cdot l_{0}(1+2b).

Thus $||\bm{W}_{k}-\bm{W}_{0}||_{F},||\bm{A}_{k}-\bm{A}_{0}||_{F}\leq K\eta\cdot 32\frac{\sqrt{m}}{(1-\rho_{0})^{3}}\cdot l_{0}(1+2b)\triangleq\omega.$

From the convexity of $L$ and Lemma 13, we have

		$\displaystyle L^{k}(\bm{W}_{k},\bm{A}_{k})-L^{k}(\bm{W}^{},\bm{A}^{})$
		$\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T}\nabla_{f_{t}}L^{k}_{t}(\bm{W}_{k},\bm{A}_{k})\cdot[f_{t}(\bm{W}_{k},\bm{A}_{k})-f_{t}(\bm{W}^{},\bm{A}^{})],$
		$\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T}\nabla_{f_{t}}L^{k}_{t}(\bm{W}_{k},\bm{A}_{k})\cdot[\nabla_{W}f_{t}(\bm{W}_{k},\bm{A}_{k})\cdot[\bm{W}_{k}-\bm{W}^{*}]$
		$\displaystyle+\nabla_{A}f_{t}(\bm{W}_{k},\bm{A}_{k})\cdot[\bm{A}_{k}-\bm{A}^{*}]]+\Theta(\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}})\cdot l_{0}(1+2b),$
		$\displaystyle\leq\nabla_{W}L^{k}(\bm{W}_{k},\bm{A}_{k})\cdot[\bm{W}_{k}-\bm{W}^{*}]$
		$\displaystyle+\nabla_{A}L^{k}(\bm{W}_{k},\bm{A}_{k})\cdot[\bm{A}_{k}-\bm{A}^{*}]+\Theta(\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}})\cdot l_{0}(1+2b).$

Since

\bm{W}_{k+1}-\bm{W}_{k}=-\eta\nabla_{W}L^{k}(\bm{W}_{k}\bm{A}_{k}),

\bm{A}_{k+1}-\bm{A}_{k}=-\eta\nabla_{A}L^{k}(\bm{W}_{k},\bm{A}_{k}),

we have

$\displaystyle\frac{1}{K}$	$\displaystyle\sum_{k=0}^{K-1}\{L^{k}(\bm{W}_{k},\bm{A}_{k})-L^{k}(\bm{W}^{},\bm{A}^{})\}$
$\displaystyle\leq$	$\displaystyle\frac{1}{K\eta}\sum_{k=0}^{K-1}\{\langle\bm{W}_{k}-\bm{W}_{k+1},\bm{W}_{k}-\bm{W}^{*}\rangle$
	$\displaystyle+\langle\bm{A}_{k}-\bm{A}_{k+1},\bm{A}_{k}-\bm{A}^{*}\rangle\}+\Theta(\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}})\cdot l_{0}(1+2b),$
$\displaystyle\leq$	$\displaystyle\frac{1}{2K\eta}\sum_{k=0}^{K-1}\{\|\|\bm{W}_{k}-\bm{W}^{}\|\|^{2}_{F}-\|\|\bm{W}_{k+1}-\bm{W}^{}\|\|^{2}_{F}$
	$\displaystyle+\|\|\bm{W}_{k+1}-\bm{W}_{k}\|\|^{2}_{F}+\|\|\bm{A}_{k}-\bm{A}^{}\|\|_{F}^{2}-\|\|\bm{A}_{k+1}-\bm{A}^{}\|\|_{F}^{2}$
	$\displaystyle+\|\|\bm{A}_{k+1}-\bm{A}_{k}\|\|_{F}^{2}\}+\Theta(\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}})\cdot l_{0}(1+2b),$
$\displaystyle\leq$	$\displaystyle\frac{1}{2K\eta}\{\|\|\bm{W}_{0}-\bm{W}^{}\|\|^{2}_{F}+\|\|\bm{A}_{0}-\bm{A}^{}\|\|_{F}^{2}\}$
	$\displaystyle+\eta[32\frac{\sqrt{m}}{(1-\rho_{0})^{3}}\cdot l_{0}(1+2b)]^{2}+\Theta(\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}})\cdot l_{0}(1+2b),$
$\displaystyle\leq$	$\displaystyle\frac{R^{2}}{Km\eta}+\eta[32\frac{\sqrt{m}}{(1-\rho_{0})^{3}}\cdot l_{0}(1+2b)]^{2}$	(32)
	$\displaystyle+\Theta(\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}})\cdot l_{0}(1+2b),$
$\displaystyle\leq$	$\displaystyle\mathcal{O}(\epsilon)+\Theta(\frac{\eta ml^{2}_{0}(1+2b)^{2}}{(1+\rho_{0})^{6}})$	(33)
	$\displaystyle+\Theta([K\eta\cdot\frac{\sqrt{m}}{(1-\rho_{0})^{3}}\cdot l_{0}(1+2b)]^{2}\cdot\frac{\sqrt{m}}{(1-\rho_{0})^{5}})\cdot l_{0}(1+2b),$
$\displaystyle\leq$	$\displaystyle\mathcal{O}(\epsilon)+\mathcal{O}(\epsilon)$
	$\displaystyle+\Theta([K^{2}\eta^{2}\cdot\frac{m\sqrt{m}}{(1-\rho_{0})^{11}})\cdot l_{0}^{2}(1+2b)^{2}\cdot l_{0}(1+2b)].$
	$\displaystyle\leq\mathcal{O}(\epsilon).$

Meanwhile $\frac{R}{\sqrt{m}}\leq\omega\leq\omega_{0}\leq\frac{1}{\rho_{0}}-1.$ Thus

\frac{1}{K}\sum_{k=1}^{K}L^{k}(\bm{W}_{k},\bm{A}_{k})-L^{k}(\bm{W}^{*},\bm{A}^{*})\leq\mathcal{O}(\epsilon).

(34)

Then Theorem 15 follows. $\blacksquare$

Based on Theorem 15, to prove the main result, we need to show there exits a $(\bm{W}^{*},\bm{A}^{*})$ with $||\bm{A}^{*}-\bm{A}_{0}||_{F},||\bm{W}^{*}-\bm{W}_{0}||_{F}\leq\mathcal{O}(b\cdot T^{2}_{max}/\sqrt{m})$ and $||f_{t}(\bm{W}^{*},\bm{A}^{*}))-\widetilde{y}_{t}||$ small so the target lies in the linearization domain. In fact we have:

Theorem 16

Under the condition in Theorem 4, with probability at least $1-\delta$ , there exist $\bm{W}^{*},\bm{A}^{*}$ with

||\bm{W}^{*}-\bm{W}_{0}||_{F},||\bm{A}^{*}-\bm{A}_{0}||_{F}\leq\mathcal{O}(b\cdot T^{2}_{max}/\sqrt{m}).

(35)

and

\frac{1}{K}\sum_{k=0}^{K-1}\frac{1}{T}\sum_{t=1}^{T}\{L(y^{k}_{t},f_{t}(\widetilde{\bm{W}}^{*},\bm{A}^{*},x^{k}))-L(y^{k}_{t},\widetilde{y}^{k}_{t})\}\leq\mathcal{O}(\epsilon).

To prove Theorem 16, assume the target stable system is

		$\displaystyle p^{i}_{t}=\bm{C}p^{i}_{t-1}+\bm{D}x^{i}_{t},$		(36)
		$\displaystyle\widetilde{y^{i}_{t}}=\bm{G}p^{i}_{t},$		(36)

We construct

		$\displaystyle\bm{W}^{*}-\bm{W}_{0}$
	$\displaystyle=$	$\displaystyle\sum_{t_{m}=1}^{T_{max}}(\rho^{-1})^{t_{m}-1}\sum_{t_{1}+t_{2}=t_{m}-1}\{(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})\}^{T}\bm{B}^{T}\bm{P}^{1}_{t_{1}}\{\bm{G}\bm{C}^{t_{m}-1}\bm{D}$
		$\displaystyle-(\rho^{2})^{t_{m}-1}/(t_{m}-1)\cdot\bm{B}\{\prod_{\tau=1}^{t_{m}}\bm{W}_{0}\}\bm{A}_{0}\}\bm{P}^{2}_{t_{2}-1}\bm{A}^{T}_{0}(\prod_{\tau=1}^{t_{2}}\bm{W}_{0}),$

\bm{A}^{*}-\bm{A}_{0}=\{(\prod_{\tau=1}^{t_{m}-1}\bm{W}_{0})\}^{T}\bm{B}^{T}\bm{P}^{1}_{t_{m}-1}[\bm{G}\bm{D}-\bm{B}\bm{A}_{0}],

where

		$\displaystyle\bm{P}^{1}_{t_{1}}=\{\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})^{T}\bm{B}^{T}\}^{-1},$		(37)
		$\displaystyle\bm{P}^{2}_{t_{2}}=\{\bm{A}_{0}^{T}(\prod_{\tau=1}^{t_{2}-1}\bm{W}_{0})^{T}(\prod_{\tau=1}^{t_{2}-1}\bm{W}_{0})\bm{A}_{0}\}^{-1}.$		(37)

We can show our $\bm{W}^{*}$ , $\bm{A}^{*}$ satisfying

||f_{t}(\bm{W}^{*},\bm{A}^{*}))-\sum_{t_{0}=0}^{t-1}\bm{G}(\prod_{\tau=1}^{t_{0}}\bm{C})\bm{D}X_{t-t_{0}}||_{F}\approx 0,

using Lemma 12. To show $||\bm{W}^{*}-\bm{W}_{0}||_{F},||\bm{A}^{*}-\bm{A}_{0}||_{F}\leq\mathcal{O}(b\cdot T^{2}_{max}/\sqrt{m})$ , we need Lemma 11. The detailed proof is in Supplementary Materials.

Combing the above two theorems, we have

\frac{1}{K}\sum_{k=0}^{K-1}\frac{1}{T}\sum_{t=1}^{T}\{L(y^{k}_{t},f_{t}(\bm{W}_{k},\bm{A}_{k},x^{k}))-L(y^{k}_{t},\widetilde{y}^{k}_{t})\}\leq\mathcal{O}(\epsilon).

Therefore Theorem 4 follows.

VIII Proof of Theorem 6

In order to prove the bound for the population risk, we need the following results for Rademacher Complexity. These results can be found in Proposition A.12 of [22] and section 3.8 in [23].

Theorem 17

Let $\mathcal{F}_{1},\mathcal{F}_{2},...\mathcal{F}_{d_{y}}$ be $d_{y}$ classes of functions $\mathbb{R}^{d}\to\mathbb{R}$ and $|L(\cdot)|\leq 4l_{0}\cdot(b+2b^{2})$ be a bounded, $l_{0}(1+2b)$ Lipschitz function. For $N$ samples $x_{i}$ i.i.d. drawn from a distribution ${\cal D}$ , with probability at least $1-\delta$ , we have

$\displaystyle\sup_{f_{1}\in\mathcal{F}_{1},...f_{d_{y}}\in\mathcal{F}_{d_{y}}}$	$\displaystyle\|\mathbb{E}_{x\sim{\cal D}}[L(f_{1}(x),...f_{d_{y}}(x))]$	(38)
	$\displaystyle-\frac{1}{N}\sum_{i=1}^{N}L(f_{1}(x_{i}),...f_{d_{y}}(x_{i}))\|$
	$\displaystyle\leq\Theta(l_{0}(1+2b)\cdot\sum_{r=1}^{d_{y}}\mathscr{R}(\mathcal{F}_{r}))$
	$\displaystyle+4\frac{4l_{0}\cdot(b+2b^{2})\sqrt{\log(1/\delta)}}{\sqrt{N}}.$

where $\mathscr{R}(\mathcal{F}))$ is defined as :

\mathscr{R}(\mathcal{F}))=\mathbb{E}_{\xi\sim\{\pm 1\}^{N}}[\sup_{f\in\mathcal{F}}\frac{1}{N}\sum_{i}^{N}\xi_{i}f(x_{i})].

(39)

Meanwhile, for linear functions, Rademacher Complexity is easy to be calculated:

Theorem 18

(Proposition A.12 in [22])

Let $\mathcal{F}$ be the function class $\{x\to f_{0}(x)+\langle w,x\rangle\in\mathbb{R}\}$ with $f_{0}$ a fixed function and $||w||\leq B$ , then

\mathscr{R}(\mathcal{F}))\leq\frac{B}{\sqrt{N}}.

(40)

Now we will estimate the gap between the empirical and population loss:

		$\displaystyle\|\mathbb{E}_{x,y\sim{\cal D}}\frac{1}{T}\sum_{t=1}^{T}\{L(y_{t},f_{t}(\bm{W}_{k},\bm{A}_{k},x))-L(y_{t},\widetilde{y}_{t})\}$
		$\displaystyle-\frac{1}{K}\sum_{k=0}^{K-1}\frac{1}{T}\sum_{t=1}^{T}\{L(y^{k}_{t},f_{t}(\bm{W}_{k},\bm{A}_{k}),x^{k})-L(y^{k}_{t},\widetilde{y}^{k}_{t})\}\|.$

Firstly, from Eq (10) and (26),

\displaystyle|L(y_{t},f_{t}(\bm{W}_{k},\bm{A}_{k}))

\displaystyle-L(y_{t},f^{lin,T_{max}}_{t}(\bm{W}_{k},\bm{A}_{k}))|\leq\mathcal{O}(\epsilon),

and for all $k\leq K$ ,

||\bm{W}_{k}-\bm{W}_{0}||_{F},||\bm{A}_{k}-\bm{A}_{0}||_{F}\leq\omega,

where $\omega=\mathcal{O}(\frac{K\eta\sqrt{m}}{(1-\rho_{0})^{3}}\cdot l_{0}(1+2b)).$ Let $e_{j}$ be the $j$ -th orthogonal basis of $\mathbb{R}^{d_{y}}$ . We only need to consider the Rademacher complexity of the function class:

	$\displaystyle\{$	$\displaystyle x\to e_{j}^{T}f^{lin,T_{max}}_{t}(\bm{W}_{0}+\Delta\bm{W},\bm{A}_{0}+\Delta\bm{A},x)\|$		(41)
		$\displaystyle\ \|\|\Delta\bm{W}\|\|_{F},\|\|\Delta\bm{A}\|\|_{F}\leq\omega\}.$		(41)

And this is a class of linear functions. We can write it as $x\to\langle F_{0},x\rangle+\langle F,x\rangle$ with $F_{0}$ fixed. Apply Lemma 9. We have

		$\displaystyle\|\|F\|\|_{2}\leq 32\frac{\sqrt{m}}{(1-\rho_{0})^{3}}\omega$		(42)
		$\displaystyle\leq[32\frac{\sqrt{m}}{(1-\rho_{0})^{3}}]\cdot[K\eta\cdot 32\frac{\sqrt{m}}{(1-\rho_{0})^{3}}\cdot l_{0}(1+2b)].$		(42)

Then combing all the above results, Theorem 17 and Theorem 18, we have

		$\displaystyle\|\mathbb{E}_{x,y\sim{\cal D}}\frac{1}{T}\sum_{t=1}^{T}\{L(y_{t},f_{t}(\bm{W}_{k},\bm{A}_{k},x))-L(y_{t},\widetilde{y}_{t})\}$		(43)
		$\displaystyle-\frac{1}{K}\sum_{k=0}^{K-1}\frac{1}{T}\sum_{t=1}^{T}\{L(y^{k}_{t},f_{t}(\bm{W}_{k},\bm{A}_{k}),x^{k})-L(y^{k}_{t},\widetilde{y}^{k}_{t})\}\|$
		$\displaystyle\leq\Theta(l_{0}(1+2b)[\frac{\sqrt{m}}{(1-\rho_{0})^{3}}])$
		$\displaystyle\cdot\Theta([K\eta\cdot\frac{\sqrt{m}}{(1-\rho_{0})^{3}}\cdot l_{0}(1+2b)]/\sqrt{K})$
		$\displaystyle+\mathcal{O}(\frac{l_{0}\cdot(b+2b^{2})\sqrt{\log(1/\delta)}}{\sqrt{K}})$
		$\displaystyle\leq\mathcal{O}(l_{0}^{2}(1+2b)^{2}\frac{\eta m\sqrt{K}}{(1-\rho_{0})^{6}})$
		$\displaystyle+\mathcal{O}(\frac{l_{0}\cdot(b+2b^{2})\sqrt{\log(1/\delta)}}{\sqrt{K}})$
		$\displaystyle\leq\mathcal{O}(\epsilon).$

with probability at least $1-\delta$ . Our claim follows.

Acknowledgement

This research was funded by the Foundation item: National Natural Science Foundation of China (U1936215). Project name: Methodologies and Key Technologies of Intelligent Detection and Tracing of APT Attacks.

IX Conclusion

We provided the first theoretical guarantee on learning linear RNNs with Gradient Descent. The required width in hidden layers does not rely on the length of the input sequence and only depends on the transition matrix parameter $\rho_{C}$ . Under this condition, we showed that SGD can provably learn any stable linear system with transition matrix $\bm{C}$ satisfying $||\bm{C}^{k}||\leq\mathcal{O}(\rho_{C}^{k}),k\in\mathbb{N}$ , using $poly(\frac{1}{1-\rho_{C}},\epsilon^{-1})$ many iterations and $poly(\frac{1}{1-\rho_{C}},\epsilon^{-1})$ many samples. In this work we found a suitable random initialization which is available to optimize using gradient descent. This solves an open problem in System Identification and answers why SGD is available to optimize RNNs in practice. We hope this result can provide some insights for learning stable nonlinear dynamic systems using recurrent neural networks in deep learning.

References

[1] H. T. Siegelmann and E. D. Sontag, “On the computational power of neural nets - sciencedirect,” Journal of Computer and System Sciences, vol. 50, no. 1, pp. 132–150, 1995.
[2] A. M. Saxe, J. L. Mcclelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,” Computer Science, 2014.
[3] S. S. Du and W. Hu, “Width provably matters in optimization for deep linear neural networks,” International conference on machine learning, 2019.
[4] D. Zou, P. M. Long, and Q. Gu, “On the global convergence of training deep linear resnets,” International conference on learning representations, 2020.
[5] K. Kawaguchi, “Deep learning without poor local minima,” 2016.
[6] D. Belanger and S. M. Kakade, “A linear dynamical system model for text,” in Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, ser. JMLR Workshop and Conference Proceedings, F. R. Bach and D. M. Blei, Eds., vol. 37. JMLR.org, 2015, pp. 833–842. [Online]. Available: http://proceedings.mlr.press/v37/belanger15.html
[7] M. Hardt, T. Ma, and B. Recht, “Gradient descent learns linear dynamical systems,” Journal of Machine Learning Research, vol. 19, 2016.
[8] E. Hazan, H. Lee, K. Singh, C. Zhang, and Y. Zhang, “Spectral filtering for general linear dynamical systems,” in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018.
[9] S. Roweis and Z. Ghahramani, “A unifying review of linear gaussian models,” Neural Computation, 1999.
[10] Z. Allen-Zhu and Y. Li, “Can sgd learn recurrent neural networks with provable generalization?” in Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper/2019/file/67fe0f66449e31fdafdc3505c37d6acb-Paper.pdf
[11] L. Wang, B. Shen, B. Hu, and X. Cao, “On the provable generalization of recurrent neural networks,” in Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021, pp. 20 258–20 269. [Online]. Available: https://proceedings.neurips.cc/paper/2021/hash/a928731e103dfc64c0027fa84709689e-Abstract.html
[12] D. A. Dowler, “Bounding the norm of matrix powers,” 2013.
[13] S. S. Du, X. Zhai, B. Poczos, and A. Singh, “Gradient descent provably optimizes over-parameterized neural networks,” in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. [Online]. Available: https://openreview.net/forum?id=S1eK3i09YQ
[14] Z. Allen-Zhu, Y. Li, and Z. Song, “A convergence theory for deep learning via over-parameterization,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 242–252. [Online]. Available: http://proceedings.mlr.press/v97/allen-zhu19a.html
[15] Y. Cao and Q. Gu, “Generalization bounds of stochastic gradient descent for wide and deep neural networks,” in Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper/2019/file/cf9dc5e4e194fc21f397b4cac9cc3ae9-Paper.pdf
[16] Z. Allen-Zhu, Y. Li, and Z. Song, “On the convergence rate of training recurrent neural networks,” in Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper/2019/file/0ee8b85a85a49346fdff9665312a5cc4-Paper.pdf
[17] Y. Tian, “Symmetry-breaking convergence analysis of certain two-layered neural networks with relu nonlinearity,” International conference on learning representations, 2017.
[18] S. S. Du, J. D. Lee, and Y. Tian, “When is a convolutional filter easy to learn,” International conference on machine learning, 2018.
[19] I. Safran and O. Shamir, “Spurious local minima are common in two-layer relu neural networks,” International conference on machine learning, pp. 4430–4438, 2018.
[20] R. E. Kalman, “A new approach to linear filtering and prediction problems,” Journal of Basic Engineering, 1960.
[21] E. Hazan, K. Singh, and C. Zhang, “Learning linear dynamical systems via spectral filtering,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/file/165a59f7cf3b5c4396ba65953d679f17-Paper.pdf
[22] Z. Allen-Zhu, Y. Li, and Y. Liang, “Learning and generalization in overparameterized neural networks, going beyond two layers,” in Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper/2019/file/62dad6e273d32235ae02b7d321578ee8-Paper.pdf
[23] P. Liang., “Statistical learning theory (winter 2016),” 2016. [Online]. Available: https://web.stanford.edu/class/cs229t/notes.pdf
[24] R. Vershynin, “Introduction to the non-asymptotic analysis of random matrices,” 2011.

Lemma

The following concentration inequality is a standard result for subexponential distribution, which will be use many times.

Lemma 19

Standard Concentration inequality for Chi-square distribution:

Let $i=1,2,3,...m,v\in\mathbb{N}$ and $A_{i}\sim\chi^{2}(v)$ be $m$ i.i.d Chi-square distribution. Then with probability at least $1-exp(-\Omega(m\epsilon))$ :

|\sum_{i=1}^{m}\frac{1}{m}A_{i}-\mathbb{E}\chi^{2}(v)|\leq\epsilon.

As a corollary, let $\bm{A}\in\mathbb{R}^{m\times n}$ is a Gaussian random matrix, and the element $A_{i,j}\sim\mathcal{N}(0,\frac{1}{m})$ . We can see for a fixed $v$ with $||v||=1$ , with probability at least $1-exp(-\Omega(m\epsilon^{2}))$ ,

|\ ||\bm{A}v||-1|\leq\epsilon.

Proof of Lemma 9

Firstly,

\displaystyle||\sum_{t=\tau}^{\infty}\rho^{t}\cdot\bm{B}(\prod_{\tau=1}^{t}\bm{W})\bm{Q}Z_{t}||\leq\sum_{t=\tau}^{\infty}\rho^{t}||\bm{B}(\prod_{\tau=1}^{t}\bm{W})\bm{Q}Z_{t}||.

(44)

Meanwhile, thanks to (18), for all $k$ in $\mathbb{N}$ , $||\rho^{k}\bm{W}^{k}||\leq 2\sqrt{k}\rho_{0}^{k}$ ,

\rho^{k}||\bm{B}(\prod_{\tau=1}^{k}\bm{W})||\leq 4\sqrt{m}\sqrt{k}\cdot\rho_{0}^{k}||\bm{Q}||\cdot||Z_{t}||.

Thus

		$\displaystyle\|\|\sum_{t=\tau}^{\infty}\rho^{t}\cdot\bm{B}(\prod_{\tau=1}^{t}\bm{W})\bm{Q}Z_{t}\|\|$		(45)
		$\displaystyle\leq 4\frac{\tau\sqrt{m}(\rho_{0})^{\tau}}{(1-\rho_{0})^{2}}\|\|\bm{Q}\|\|$		(45)

The first part of Lemma 9 follows.

Now we consider

||\sum_{t_{0}=\tau}^{\infty}\sum_{t_{1}+t_{2}=t_{0}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W})\bm{Q}(\prod_{\tau=1}^{t_{2}-1}\bm{W})\bm{A}Z_{t_{0}}||.

(46)

Note that

		$\displaystyle\|\|\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W})\bm{Q}(\prod_{\tau=1}^{t_{2}-1}\bm{W})\bm{A}Z_{t_{0}}\|\|$		(47)
		$\displaystyle\leq 8\sqrt{m}\rho_{0}^{t_{0}}t_{0}\|\|\bm{Q}\|\|.$		(47)

Thus

	$\displaystyle\|\|\sum_{t_{0}=\tau}^{\infty}\sum_{t_{1}+t_{2}=t_{0}}$	$\displaystyle\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W})\bm{Q}(\prod_{\tau=1}^{t_{2}-1}\bm{W})\bm{A}Z_{t_{0}}\|\|$		(48)
		$\displaystyle\leq 32\frac{\sqrt{m}(\rho_{0})^{\tau}\tau^{2}}{(1-\rho_{0})^{3}}\|\|\bm{Q}\|\|.$		(48)

Our results follow.

$\blacksquare$

Proof of Lemma 10

For a fixed $v_{2}$ , with probability at least $1-exp(-\Omega(m\epsilon^{2}))$ , $1-\epsilon\leq||\bm{A}_{0}v_{2}||\leq 1+\epsilon$ .

Taking the $\epsilon-net$ in $\mathbb{R}^{d}$ , there is a set $\mathcal{N}$ with $|\mathcal{N}|\leq(3/\epsilon)^{d}$ . For any vector $v$ in $\mathbb{R}^{d}$ with $||v||=1$ , there is a vector $v^{\prime}$ satisfying $||v-v^{\prime}||\leq\epsilon$ . Thus with probability at least $1-(3/\epsilon)^{d}exp(-\Omega(m\epsilon^{2}))=1-exp(-\Omega(m\epsilon^{2}))$ , for any vector $v_{2}$ in $\mathbb{R}^{d}$ with $||v_{2}||=1,$

1-\epsilon-(1+\epsilon)\epsilon\leq||\bm{A}_{0}v_{2}||\leq(1+\epsilon)^{2}.

(49)

Thus

1-2\epsilon-\epsilon^{2}\leq||\bm{A}_{0}v_{2}||\leq 1+2\epsilon+\epsilon^{2}.

(50)

This is the same for $||\bm{W}_{0}\bm{A}_{0}v_{2}||$ . With probability at least $1-exp(-\Omega(m\epsilon^{2}))$

1-\epsilon\leq||\bm{W}_{0}\bm{A}_{0}v_{2}||\leq 1+\epsilon.

(51)

Let $h_{t}=\prod_{t_{0}=1}^{t}\bm{W}_{0}\bm{A}_{0}v_{2}$ . When $t>1$ , we consider the Gram-Schmidt orthogonalization.

Let $\bm{U}_{t}$ be the Gram-Schmidt orthogonalization matrix as:

\bm{U}_{t}=GS(h_{0},h_{1},...h_{t}).

(52)

We have

$\displaystyle h_{t}=$	$\displaystyle\bm{W}_{0}h_{t-1}=\bm{W}_{0}\bm{U}_{t-2}\bm{U}_{t-2}^{T}h_{t-1}$	(53)
	$\displaystyle+\bm{W}_{0}[\bm{I}-\bm{U}_{t-2}\bm{U}_{t-2}^{T}]h_{t-1},$
$\displaystyle=$	$\displaystyle\begin{bmatrix}\bm{W}_{0}\bm{U}_{t-2},&\frac{\bm{W}_{0}[\bm{I}-\bm{U}_{t-2}\bm{U}_{t-2}^{T}]h_{t-1}}{\|\|[\bm{I}-\bm{U}_{t-2}\bm{U}_{t-2}^{T}]h_{t-1}\|\|}\end{bmatrix}$
	$\displaystyle\cdot\begin{bmatrix}\bm{U}_{t-2}^{T}h_{t-1}\\ \|\|[\bm{I}-\bm{U}_{t-2}\bm{U}_{t-2}^{T}]h_{t-1}\|\|\end{bmatrix},$
$\displaystyle=$	$\displaystyle\begin{bmatrix}M_{1},&M_{2}\end{bmatrix}\cdot\begin{bmatrix}z_{1}\\ z_{2}\end{bmatrix}.$

Using this expression, the entries of $M_{1}=\bm{W}_{0}\bm{U}_{t-2}$ and $M_{2}=\frac{\bm{W}_{0}[\bm{I}-\bm{U}_{t-2}\bm{U}_{t-2}^{T}]h_{t-1}}{||[\bm{I}-\bm{U}_{t-2}\bm{U}_{t-2}^{T}]h_{t-1}||}$ are i.i.d from $\mathcal{N}(0,\frac{1}{m})$ . Therefore for fixed $z_{1},z_{2}$ , with probability at least $1-exp(-\Omega(m\epsilon^{2}))$ ,

(1-\epsilon)\sqrt{z_{1}^{2}+z_{2}^{2}}\leq||h_{t}||\leq(1+\epsilon)\sqrt{z_{1}^{2}+z_{2}^{2}}.

Take the $\epsilon-net$ for $z_{1}.z_{2}$ . The sizes of $\epsilon-net$ for $z_{1}.z_{2}$ are $(3/\epsilon)^{d\cdot(t+1)}$ and $(3/\epsilon)^{1}$ . Thus with probability at least $1-(3/\epsilon)^{d}exp(-\Omega(m\epsilon^{2}))=1-exp(-\Omega(m\epsilon^{2})),$

(1-\epsilon)\sqrt{z_{1}^{2}+z_{2}^{2}}\leq||h_{t}||\leq(1+\epsilon)\sqrt{z_{1}^{2}+z_{2}^{2}}

for any $z_{1},z_{2}$ .

Therefore we have

(1-\epsilon)^{t}\leq||\prod_{t_{0}=1}^{t}\bm{W}_{0}\bm{A}_{0}v_{2}||\leq(1+\epsilon)^{t}.

(54)

for any $0\leq t\leq\tau$ . Set $\epsilon=\frac{0.01}{\tau}$ . The theorem for $||\bm{W}^{t}_{0}\bm{A}_{0}v_{2}||$ follows. For $||\frac{\sqrt{d_{y}}}{\sqrt{m}}(\bm{W}^{T}_{0})^{t}\bm{B}^{T}v_{1}||$ , the proof is the same.

This proves Lemma 10.

Proof of Lemma 8

Now we will prove Lemma 8. Similar to (54), for fixed $v\in\mathbb{R}^{m}$ , let $L=\Theta(\frac{\sqrt{m}}{\log m})\in\mathbb{N}$ . With probability at least $1-Lexp(-\Omega(m/L^{2}))$ , for all $0<t\leq L$ , we have

(1-\frac{1}{L})^{t}||v||\leq||\prod_{t_{0}=1}^{t}\bm{W}_{0}v||\leq(1+\frac{1}{L})^{t}||v||.

(55)

Then for fixed normalized orthogonal basis $v_{i}$ , $i=1,2,..m$ , with probability at least $1-Lmexp(-\Omega(m/L^{2}))$ , for all $0<t\leq L$ , $i=1,2,...m$ ,

||\bm{W}^{t}_{0}v_{i}||\leq(1+1/L)^{t}||v_{i}||.

(56)

Any $v\in\mathbb{R}^{m}$ can be write as $v=\sum_{i=1}^{m}a_{i}v_{i}$ . When $||v||=1$ , we have $\sum_{i}|a_{i}|\leq\sqrt{m}$ .

Therefore with probability at least $1-Lmexp(-\Omega(m/L^{2}))=1-exp(-\Omega(m/L^{2}))$ , for all $0<t\leq L$ , any $v$

||\bm{W}^{t}_{0}v||\leq\sqrt{m}(1+1/L)^{t}||v||.

(57)

Since $L=\Theta(\frac{\sqrt{m}}{\log m})$ , $L^{2}>\sqrt{m}$ , we have

||\bm{W}^{t}_{0}||\leq L^{2}(1+1/L)^{t}.

(58)

Note that $1/\rho_{1}=1+10\cdot\frac{\log^{2}m}{\sqrt{m}}>1+\log L\cdot\frac{\log m}{\sqrt{m}}$ so we can set $L=\Theta(\frac{\sqrt{m}}{\log m})\in\mathbb{N}$ lager than an absolute constant such that

$\displaystyle\|\|\bm{W}^{L}_{0}\|\|$	$\displaystyle\leq L^{2}(1+1/L)^{L}\leq e^{2\log L}\cdot e$	(59)
	$\displaystyle=e^{1+2\log L}\leq(1+10\cdot\frac{\log^{2}m}{\sqrt{m}})^{(\frac{\sqrt{m}}{\log^{2}m})\cdot 100\cdot\log L}$
	$\displaystyle\leq(1/\rho_{1})^{(\frac{\sqrt{m}}{\log m})\cdot 100\cdot\frac{\log L}{\log m}}$
	$\displaystyle\leq(1/\rho_{1})^{L/2}.$

And for $k\leq L$

\displaystyle||\bm{W}^{k}_{0}||

\displaystyle\leq L^{2}(1+1/L)^{L}\leq(1/\rho_{1})^{L/2}.

(60)

If $s>L$ , we can write it as $s=k\cdot L+r$ and $k,r\in\mathbb{N}$ , $r\leq L$ . Then

||\bm{W}^{s}_{0}||\leq||\bm{W}^{L}_{0}||^{k}\cdot||\bm{W}^{r}_{0}||\leq\rho_{1}^{-Lk/2}\cdot\rho_{1}^{-L/2}\leq\rho_{1}^{-s}.

(61)

Thus we have

||\bm{W}^{s}_{0}||\leq\rho_{1}^{-s}.

(62)

For $k\leq 2L$ , note that for any unit vector $v\in\mathbb{R}^{m}$ , we can write $v=\sum_{i=1}^{k}u_{i}$ , where $u_{i}$ has at most $2m/k$ non-zero coordinates and the intersection of non-zero coordinates sets for $u_{i}$ and $u_{j}$ are empty when $i\neq j$ . In space $R^{2m/k}$ , we can use the $\epsilon$ -net argument which says with probability at least $1-exp(-\Omega(m/k))$ , for any $v\in\mathbb{R}^{2m/k}$ ,

||W_{0}^{k}v||\leq 2||v||.

Thus with probability at least $1-4L^{2}exp(-\Omega(m/L))$ , for all $k\leq 2L$ , $||W_{0}^{k}||\leq 2\sqrt{k}$ .

Proof of Lemma 11

Note that for a fixed $v\in\mathbb{R}^{m}$ ,

||\bm{B}v||

(63)

is a Gaussian random variable with variance $||v||^{2}$ . Meanwhile, let $v_{i}$ , $i=1,...d$ be the orthogonal basis in $\mathbb{R}^{d}$ . We only need to bound $\tau\cdot d$ many vectors $B\bm{W}^{t}_{0}\bm{A}_{0}v_{i}$ .

Thus for any $t\leq\tau$ , and any $v_{i}$ , with probability at least $1-(\tau\cdot d)\cdot exp(-\Omega(m\epsilon^{2}))$ , we have

||\prod_{t_{0}=1}^{t}\bm{W}_{0}\bm{A}_{0}v_{i}||\leq(1+\epsilon)^{t}.

(64)

Thus if $m>\Omega(\log(\tau\cdot d/\delta)$ , with probability at least $1-\delta$ ,

||\bm{B}\bm{W}^{t}_{0}\bm{A}_{0}||\leq\sqrt{d\log(\tau\cdot d/\delta)}).

(65)

Lemma 11 follows,

Proof of Lemma 12

Consider

|u^{T}\bm{A}\{(\prod_{\tau=1}^{t}\bm{W}_{0})^{T}\}\{(\prod_{\tau=1}^{t^{\prime}}\bm{W}_{0})\}\bm{A}v|,

(66)

with $||u||=||v||=1$ . Before we study the properties of the above equation, we need some very useful results in random matrices theory.

Lemma 20

(Lemma 5.36 in [24]) Consider a matrix $\bm{B}$ and $\delta>0$ . Let $s_{min}(\bm{B}),s_{max}(\bm{B})$ be the smallest and the largest singular values of $B$ which satisfy

1-\delta\leq s_{min}(\bm{B})\leq s_{max}(\bm{B})\leq 1+\delta.

(67)

Then

||\bm{B}^{T}\bm{B}-\bm{I}||\leq 3\max(\delta,\delta^{2}).

(68)

combining this lemma with (51), we have the following result.

Lemma 21

For any $t\leq\tau$ , let $m^{1/2}>\tau$ , $\bm{F}=\bm{W}^{t}\bm{A}$ . With probability at least $1-\exp(-\Omega(\log^{2}{m}))$ , we have

||\bm{F}^{T}\bm{F}-\bm{I}||\leq\log m/\sqrt{m}.

(69)

Meanwhile, let $x\in\mathbb{R}^{d}$ , $y=\bm{A}x\in\mathbb{R}^{m}$ . We have:

y^{T}\bm{W}_{0}y=\bm{w^{\prime}}^{T}\bm{A}x,

(70)

where $w^{\prime}\in\mathbb{R}^{m}$ is a random vector and every entry of which is i.i.d drawn from $N(0,||y||^{2})$ . Thus with probability at least $1-exp(-\Omega(\log^{2}m))$ , we have

|y^{T}\bm{W}_{0}y|\leq\log m/\sqrt{m}\cdot||y||^{2}.

(71)

Now let $i,j\in[d]$ be the two indexes. Let $e_{1},e_{2},...e_{d}$ be the orthonormal basis in $\mathbb{R}^{d}$ . We can write (66) into a linear combination of the below terms.

|e_{i}^{T}\bm{A}^{T}\{(\prod_{\tau=1}^{t}\bm{W}_{0})^{T}\}\{(\prod_{\tau=1}^{t^{\prime}}\bm{W}_{0})\}\bm{A}e_{j}|,

(72)

For any $i\neq j$ , we set

z_{i,0}=\bm{A}e_{i},z_{j,0}=\bm{A}e_{j}.

(73)

And

z_{i,t}=\{\prod_{\tau=1}^{t}\bm{W}_{0}\}\bm{A}e_{i}.

(74)

As shown in the last section, when $t\leq\tau$ , with probability at least $1-\tau exp(-\Omega(m/\tau^{2}))$ .

||z_{i,t}||\leq(1+1/100\tau)^{t}.

(75)

Now we consider Gram-Schmidt orthonormal matrix

Z_{j,t}=GS(z_{i,0},z_{j,0},z_{i,1},z_{j,1},...z_{i,t},z_{j,t}).

(76)

It has at most $2(t+1)$ columns.

In order to prove the lemma, we consider $||Z_{j,t}^{T}z_{i,t}||$ using induction. When $t=0$ and $t=1$ , from Lemma 21 and 71, we know with probability at least $1-exp(-\Omega(\log^{2}m))$ , $||Z_{j,t}^{T}z_{i,t}||\leq 2\log m/\sqrt{m}$ .

For $t+1$ ,

	$\displaystyle Z_{j,t+1}^{T}z_{i,t+1}=$	$\displaystyle Z_{t+1}^{T}\bm{W}_{0}(I-Z_{j,t}Z_{j,t}^{T})z_{i,t}$		(77)
		$\displaystyle+Z_{j,t+1}^{T}\bm{W}_{0}Z_{j,t}Z_{j,t}^{T}z_{i,t}.$		(77)

Note that thanks to the Gram-Schmidt orthogonalization, the elements in matrix $\bm{W}_{0}(I-Z_{j,t}Z_{j,t}^{T})$ are independent of $Z_{j,t+1}$ . Then with probability at least $1-exp(-\Omega(\log^{2}m))$ ,

		$\displaystyle\|\|Z_{j,t+1}^{T}\bm{W}_{0}(I-Z_{j,t}Z_{j,t}^{T})z_{i,t}\|\|$		(78)
		$\displaystyle\leq\|\|Z_{j,t+1}^{T}\|\|\cdot\|\|\frac{\bm{W}_{0}(I-Z_{j,t}Z_{j,t}^{T})z_{i,t}}{\|\|(I-Z_{j,t}Z_{j,t}^{T})z_{i,t}\|\|}\|\|\cdot\|\|(I-Z_{j,t}Z_{j,t}^{T})z_{i,t}\|\|,$
		$\displaystyle\leq\|\|Z_{j,t+1}^{T}\|\|\cdot 2\log m/\sqrt{m}\cdot\|\|(I-Z_{j,t}Z_{j,t}^{T})z_{i,t}\|\|,$
		$\displaystyle\leq 2\log m/\sqrt{m}\cdot\|\|(I-Z_{j,t}Z_{j,t}^{T})z_{i,t}\|\|$
		$\displaystyle\leq 2\log m/\sqrt{m}\cdot\|\|z_{i,t}\|\|$
		$\displaystyle\leq 2\log m/\sqrt{m}\cdot\log m.$

As for $Z_{j,t+1}^{T}\bm{W}_{0}Z_{j,t}Z_{j,t}^{T}z_{i,t}$ , firstly note that the entries of matrix $\bm{W}_{0}Z_{j,t}$ are i.i.d from $N(0,\frac{1}{m})$ . When $Z_{j,t}^{T}z_{i,t}$ is fixed, with probability at least $1-exp(-\Omega(m/\tau^{2})$ ,

		$\displaystyle\|\|Z_{j,t+1}^{T}\bm{W}_{0}Z_{j,t}Z_{j,t}^{T}z_{i,t}\|\|\leq\|\|\bm{W}_{0}Z_{j,t}Z_{j,t}^{T}z_{i,t}\|\|$		(79)
		$\displaystyle\leq(1+\frac{1}{100\tau})\|\|Z_{j,t}^{T}z_{i,t}\|\|.$		(79)

Meanwhile, there are at most $exp(\mathcal{O}(t+1))$ fixed vectors forming an $\epsilon$ -net for $Z_{j,t}^{T}z_{i,t}$ . Let $m^{1/2}>\tau$ . With probability at least $1-exp(-\Omega(m/\tau^{2}))$ ,

||Z_{j,t+1}^{T}\bm{W}_{0}Z_{j,t}Z_{t}^{T}z_{i,t}||\leq(1+\frac{1}{50\tau})||Z_{j,t}^{T}z_{i,t}||.

(80)

Therefore

	$\displaystyle\|\|Z_{j,t+1}^{T}z_{i,t+1}\|\|$	$\displaystyle=\|\|Z_{j,t+1}^{T}\bm{W}_{0}z_{i,t}\|\|$		(81)
		$\displaystyle\leq(1+\frac{1}{50\tau}+2\frac{\log^{2}m}{\sqrt{m}})^{t+1}\cdot(2\frac{\log m}{\sqrt{m}})$		(81)

When $m$ is large enough such that $\frac{\sqrt{m}}{2\log^{2}m}>50\tau$ , we have

(1+\frac{1}{50\tau}+2\frac{\log^{2}m}{\sqrt{m}})^{t+1}\leq 3,

Thus

\displaystyle||Z_{j,t+1}^{T}z_{i,t+1}||\leq 6\frac{\log m}{\sqrt{m}}.

(82)

As a corollary, this says suppose $||u||=||v||=1,u^{T}v=0$ , for all $k\leq\tau$ ,

|u^{T}\bm{W}^{k}v|\leq 6\frac{\log m}{\sqrt{m}}.

Without loss of generality, we assume $t>t^{\prime}$ .

		$\displaystyle\|e_{j}^{T}\bm{A}^{T}\{(\prod_{\tau=1}^{t}\bm{W}_{0})^{T}\}\{(\prod_{\tau=1}^{t^{\prime}}\bm{W}_{0})\}\bm{A}e_{i}\|$		(83)
		$\displaystyle=[z_{j,t^{\prime}}]^{T}\bm{W}_{0}^{t-t^{\prime}}z_{i,t^{\prime}}$
		$\displaystyle=[z_{j,t^{\prime}}]^{T}\bm{W}_{0}^{t-t^{\prime}}(I-Z_{j,t^{\prime}}Z_{j,t^{\prime}}^{T})z_{i,t^{\prime}}$
		$\displaystyle+[z_{j,t^{\prime}}]^{T}\bm{W}_{0}^{t-t^{\prime}}Z_{j,t^{\prime}}Z_{j,t^{\prime}}^{T}z_{i,t^{\prime}}$
		$\displaystyle\leq\|\|z_{j,t^{\prime}}\|\|\cdot\|\|\bm{W}_{0}^{t-t^{\prime}}\|\|\cdot 6\frac{\log m}{\sqrt{m}}$
		$\displaystyle+\|\|z_{j,t^{\prime}}\|\|\cdot 6\frac{\log m}{\sqrt{m}}\cdot\|\|z_{i,t^{\prime}}\|\|$
		$\displaystyle\leq 24\tau\frac{\log m}{\sqrt{m}}.$

Now we consider the case $i=j$ ,

		$\displaystyle\|e_{i}^{T}\bm{A}^{T}\{(\prod_{\tau=1}^{t}\bm{W}_{0})^{T}\}\{(\prod_{\tau=1}^{t^{\prime}}\bm{W}_{0})\}\bm{A}e_{i}\|$		(84)
		$\displaystyle=[z_{i,t}]^{T}z_{i,t^{\prime}}.$		(84)

Since

\displaystyle||Z_{j,t+1}^{T}z_{i,t+1}||\leq 6\frac{\log m}{\sqrt{m}},

(85)

and

Z_{j,t}=GS(z_{i,0},z_{j,0},z_{i,1},z_{j,1},...z_{i,t},z_{j,t}),

(86)

we have

\displaystyle|[z_{i,t}]^{T}z_{i,t^{\prime}}|\leq 6\frac{\log m}{\sqrt{m}}.

(87)

Thus for any $u,v$ ,

		$\displaystyle\|u^{T}\bm{A}^{T}\{(\prod_{\tau=1}^{t}\bm{W}_{0})^{T}\}\{(\prod_{\tau=1}^{t^{\prime}}\bm{W}_{0})\}\bm{A}v\|$		(88)
		$\displaystyle\leq 24\tau\frac{d^{2}\log m}{\sqrt{m}}.$		(88)

There is a similar argument for

\bm{B}^{T}(\prod_{\tau=1}^{t^{\prime}}\bm{W}_{0})(\prod_{\tau=1}^{t}\bm{W}_{0})^{T}\bm{B}.

The theorem follows.

$\blacksquare$

Proof of Lemma 13 and 14

Firstly, note that

	$\displaystyle f_{t}(\bm{W},\bm{A})$	$\displaystyle=\bm{B}\bm{A}X_{t}+\rho\cdot\bm{B}\bm{W}\bm{A}X_{t-1}+...$		(89)
		$\displaystyle+\rho^{t-1}\cdot\bm{B}(\prod_{\tau=1}^{t-1}\bm{W})\bm{A}X_{1}.$		(89)

We have

		$\displaystyle\nabla_{W}f_{t}(\bm{W},\bm{A})\cdot\bm{Z}=\bm{B}\bm{Z}\bm{A}X_{t-1}+...$		(90)
		$\displaystyle+\rho^{t-1}\cdot\sum_{t_{1}+t_{2}=t-1,t_{1},t_{2}>0}\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W})\bm{Z}(\prod_{\tau=1}^{t_{2}-1}\bm{W})\bm{A}X_{1},$
		$\displaystyle=\sum_{t_{1}+t_{2}+t_{0}=t,t_{0},t_{1},t_{2}>0}\rho^{t-t_{0}}$
		$\displaystyle\cdot\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W})\bm{Z}(\prod_{\tau=1}^{t_{2}-1}\bm{W})\bm{A}X_{t_{0}}.$

Meanwhile,

	$\displaystyle\nabla_{A}f_{t}(\bm{W},\bm{A})\cdot\bm{Z}=$	$\displaystyle\bm{B}ZX_{t}+\rho\cdot\bm{B}\bm{W}\bm{Z}X_{t-1}+...$		(91)
		$\displaystyle+\rho^{t-1}\cdot\bm{B}(\prod_{\tau=1}^{t-1}\bm{W})\bm{Z}X_{1}.$		(91)

Let

R_{t_{0}}(\bm{W},\bm{A})=\rho^{t-t_{0}}\cdot\bm{B}(\prod_{\tau=1}^{t-t_{0}}\bm{W})\bm{A}X_{t_{0}}.

(92)

Note that from (18), $||\rho^{t}\bm{W}^{t}||\leq 2\sqrt{t}\rho_{0}^{t}$ for all $t\in\mathbb{N}$ . We have

		$\displaystyle\|\|R_{t_{0}}(\bm{W}^{\prime},\bm{A}^{\prime})-R_{t_{0}}(\bm{W},\bm{A})$		(93)
		$\displaystyle-\nabla_{W}R_{t_{0}}(\bm{W},\bm{A})\cdot[\bm{W}^{\prime}-\bm{W}],$
		$\displaystyle-\nabla_{A}R_{t_{0}}(\bm{W},\bm{A})\cdot[\bm{A}^{\prime}-\bm{A}]\|\|_{F},$
		$\displaystyle\leq\sum_{i+j=t-t_{0}-1}\|\|\bm{B}\|\|\cdot\rho^{t-t_{0}}\cdot\|\|\bm{W}^{i}\|\|\cdot\|\|\bm{W}^{j}\|\|$
		$\displaystyle\cdot\|\|\bm{W}^{\prime}-\bm{W}\|\|\cdot\|\|\bm{A^{\prime}}-\bm{A}\|\|$
		$\displaystyle+\sum_{i+j+k=t-t_{0}-2}\|\|\bm{B}\|\|\cdot\rho^{t-t_{0}}\cdot\|\|\bm{W}^{i}\|\|\cdot\|\|\bm{W}^{j}\|\|\cdot\|\|\bm{W}^{k}\|\|$
		$\displaystyle\cdot\|\|\bm{W}^{\prime}-\bm{W}\|\|^{2}\cdot\|\|\bm{A}\|\|$
		$\displaystyle\leq 2\sqrt{m}(t-t_{0})\rho_{0}^{t-t_{0}}\cdot 4(t-t_{0})\cdot\omega^{2}$
		$\displaystyle+2\sqrt{m}(t-t_{0})^{2}\rho_{0}^{t-t_{0}}\cdot 8(t-t_{0})^{2}\cdot\omega^{2}\cdot 2$
		$\displaystyle\leq 32\sqrt{m}(t-t_{0})^{4}\omega^{2}\rho_{0}^{t-t_{0}}.$

Thus

		$\displaystyle\|\|f_{t}(\bm{W}^{\prime},\bm{A}^{\prime})-f_{t}(\bm{W},\bm{A})-\langle\nabla_{W}f_{t}(\bm{W},\bm{A}),\bm{W}^{\prime}-\bm{W}\rangle$		(94)
		$\displaystyle+\langle\nabla_{A}f_{t}(\bm{W},\bm{A}),\bm{A}^{\prime}-\bm{A}\rangle\|\|_{F},$
		$\displaystyle\leq\sum_{t}32\sqrt{m}(t-t_{0})^{4}\omega^{2}\rho_{0}^{t-t_{0}}$
		$\displaystyle\leq 768\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}}.$

Lemma 13 follows. $\blacksquare$

Appendix A Proof of Lemma 14

From the definition we have

$\displaystyle f_{t}^{lin}(\bm{W},\bm{A})=$	$\displaystyle f_{t}(\bm{W}_{0},\bm{A}_{0})+\nabla_{W}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{W}-\bm{W}_{0}]$	(95)
	$\displaystyle+\nabla_{A}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{A}-\bm{A}_{0}],$
$\displaystyle f_{t}^{lin,\tau}(\bm{W},\bm{A})=$	$\displaystyle f^{\tau}_{t}(\bm{W}_{0},\bm{A}_{0})+\nabla_{W}f^{\tau}_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{W}-\bm{W}_{0}]$
	$\displaystyle+\nabla_{A}f^{\tau}_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{A}-\bm{A}_{0}],$

and

		$\displaystyle\nabla_{W}f_{t}(\bm{W},\bm{A})\cdot\bm{Z}=\bm{B}\bm{Z}\bm{A}X_{t-1}+...$		(96)
		$\displaystyle+\rho^{t-1}\cdot\sum_{t_{1}+t_{2}=t-1,t_{1},t_{2}>0}\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W})\bm{Z}(\prod_{\tau=1}^{t_{2}-1}\bm{W})\bm{A}X_{1}$
		$\displaystyle=\sum_{t_{1}+t_{2}+t_{0}=t,t_{0},t_{1},t_{2}>0}\rho^{t-t_{0}}\cdot\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W})\bm{Z}(\prod_{\tau=1}^{t_{2}-1}\bm{W})\bm{A}X_{t_{0}}.$

Meanwhile

	$\displaystyle\nabla_{A}f_{t}(\bm{W},\bm{A})\cdot\bm{Z}=$	$\displaystyle\bm{B}ZX_{t}+\rho\cdot\bm{B}\bm{W}\bm{Z}X_{t-1}+...$		(97)
		$\displaystyle+\rho^{t-1}\cdot\bm{B}(\prod_{\tau=1}^{t-1}\bm{W})\bm{Z}X_{1}.$		(97)

Using the same way as above, we have

		$\displaystyle\|\|f^{lin}_{t}(\bm{W},\bm{A})-f^{lin,\tau}_{t}(\bm{W},\bm{A})\|\|_{F}$		(98)
		$\displaystyle\leq\sum_{k=\tau}\sum_{i+j=k-1}\|\|\bm{B}\|\|\cdot\rho^{k}\cdot\|\|\bm{W}_{0}^{i}\|\|\cdot\|\|\bm{W}_{0}^{j}\|\|\cdot\|\|\bm{W}\|\|\cdot\|\|\bm{A}_{0}\|\|$
		$\displaystyle+\sum_{k=\tau}\|\|\bm{B}\|\|\cdot\rho^{k}\cdot\|\|\bm{W}_{0}^{k}\|\|\cdot\|\|\bm{A}\|\|$
		$\displaystyle\leq\tau\rho_{0}^{\tau}\cdot 8\frac{\sqrt{m}}{(1-\rho_{0})^{3}}$

from Lemma 9.

$\blacksquare$

Existence: Proof of Theorem 16

Firstly we briefly introduce the main steps of the proof.

1) From Lemma 13, for all $t\in[T]$ and $\bm{W},\bm{W}^{\prime}\in B(\bm{W}_{0},\omega)$ , $\bm{A},\bm{A}^{\prime}\in B(\bm{A}_{0},\omega)$ with $\omega\leq\omega_{0}$ ,

		$\displaystyle\|\|f_{t}(\bm{W},\bm{A})-f_{t}(\bm{W}_{0},\bm{A}_{0})-\nabla_{W}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{W}-\bm{W}_{0}]$
		$\displaystyle-\nabla_{A}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{A}-\bm{A}_{0}]\|\|\leq 768\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}}.$

Therefore we consider the linearization function

	$\displaystyle f_{t}^{lin}(\bm{W},\bm{A})=$	$\displaystyle f_{t}(\bm{W}_{0},\bm{A}_{0})+\nabla_{W}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{W}-\bm{W}_{0}]$
		$\displaystyle+\nabla_{A}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{A}-\bm{A}_{0}].$

We have

\displaystyle||f_{t}(\bm{W},\bm{A})-f_{t}^{lin}(\bm{W},\bm{A})||\leq\mathcal{O}(\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}}).

2) From Lemma 14,

||f^{lin}_{t}(\bm{W},\bm{A})-f^{lin,T_{max}}_{t}(\bm{W},\bm{A})||\leq\mathcal{O}(\frac{\sqrt{m}\tau\rho_{0}^{\tau}}{(1-\rho_{0})^{3}}).

Therefore we only need to consider

		$\displaystyle f^{lin,T_{max}}_{t}(\bm{W},\bm{A})=\sum_{t_{0}=0}^{T_{max}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{0}}\bm{W}_{0})\bm{A}_{0}X_{t-t_{0}}$
		$\displaystyle+\sum_{t_{0}=0}^{T_{max}}\sum_{t_{1}+t_{2}=t_{0}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})[\bm{W}-\bm{W}_{0}](\prod_{\tau=1}^{t_{2}-1}\bm{W}_{0})\bm{A}_{0}X_{t-t_{0}}$
		$\displaystyle+\sum_{t_{0}=0}^{T_{max}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{0}-1}\bm{W}_{0})[\bm{A}-\bm{A}_{0}]X_{t-t_{0}}.$

3) Finally we set

		$\displaystyle\bm{W}^{*}-\bm{W}_{0}$
	$\displaystyle=$	$\displaystyle\sum_{t_{m}=1}^{T_{max}}(\rho^{-1})^{t_{m}-1}\sum_{t_{1}+t_{2}=t_{m}-1}\{(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})\}^{T}\bm{B}^{T}\bm{P}^{1}_{t_{1}}\{\bm{G}\bm{C}^{t_{m}-1}\bm{D}$
		$\displaystyle-(\rho^{2})^{t_{m}-1}/(t_{m}-1)\cdot\bm{B}\{\prod_{\tau=1}^{t_{m}}\bm{W}_{0}\}\bm{A}_{0}\}\bm{P}^{2}_{t_{2}-1}\bm{A}^{T}_{0}(\prod_{\tau=1}^{t_{2}}\bm{W}_{0}),$

\bm{A}^{*}-\bm{A}_{0}=\{(\prod_{\tau=1}^{t_{m}-1}\bm{W}_{0})\}^{T}\bm{B}^{T}\bm{P}^{1}_{t_{m}-1}[\bm{G}\bm{D}-\bm{B}\bm{A}_{0}],

where

		$\displaystyle\bm{P}^{1}_{t_{1}}=\{\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})^{T}\bm{B}^{T}\}^{-1},$		(99)
		$\displaystyle\bm{P}^{2}_{t_{2}}=\{\bm{A}_{0}^{T}(\prod_{\tau=1}^{t_{2}-1}\bm{W}_{0})^{T}(\prod_{\tau=1}^{t_{2}-1}\bm{W}_{0})\bm{A}_{0}\}^{-1}.$		(99)

Then from Lemma 10, 11 and 12,

f^{lin,T_{max}}_{t}(\bm{W}^{*},\bm{A}^{*})\approx\sum_{t_{0}=0}^{T_{max}}\bm{G}(\prod_{\tau^{\prime}=1}^{t_{0}}\bm{C})\bm{D}X_{t-t_{0}}.

(100)

With probability at least $1-exp(-\Omega(\log^{2}m))$ ,

||f^{lin,T_{max}}_{t}(\bm{W}^{*},\bm{A}^{*}))-\widetilde{y}_{t}||\leq\mathcal{O}(\frac{b\cdot d^{2}c_{\rho}T^{3}_{max}\log m}{m^{1/2}})+\frac{c_{\rho}\rho^{T_{max}}}{1-\rho}),

and we can show $||\bm{W}^{*}-\bm{W}_{0}||_{F},||\bm{A}^{*}-\bm{A}_{0}||_{F}\leq\mathcal{O}(b\cdot T^{2}_{max}/\sqrt{m})$ using Lemma 11. The theorem follows.

Theorem 22

Consider a linear system:

		$\displaystyle p_{t}=\bm{C}p_{t-1}+\bm{D}x_{t},$		(101)
		$\displaystyle\widetilde{y}_{t}=\bm{G}p_{t}.$		(101)

with $||\bm{C}^{k}\bm{D}||<c_{\rho}\cdot\rho^{k}$ for any $k\in\mathbb{N}$ , $\rho<1$ , $\bm{D}\in\mathbb{R}^{d_{p}\times d}$ , $p_{t}\in\mathbb{R}^{d_{p}}$ , $\bm{C}\in\mathbb{R}^{d_{p}\times d_{p}}$ $\bm{G}\in\mathbb{R}^{d_{y}\times d_{p}}$ .

For given data $\{x_{t},y_{t}\}$ , and any $0<\rho<1$ , if $m>m^{*}$ , with probability at least $1-exp(-\Omega(\log^{2}m))$ , there exist $\bm{W}^{*},\bm{A}^{*}$ satisfying that for all $t$ ,

	$\displaystyle\|\|f_{t}(\bm{W}^{},\bm{A}^{}))-\widetilde{y}_{t}\|\|_{F}$	$\displaystyle\leq\mathcal{O}(\frac{\sqrt{m}(\rho_{0})^{T_{max}}}{(1-\rho_{0})^{3}})$		(102)
		$\displaystyle+\mathcal{O}(\frac{b\cdot d^{2}c_{\rho}T^{3}_{max}\log m}{m^{1/2}})+\frac{c_{\rho}\rho^{T_{max}}}{1-\rho}),$		(102)

and

||\bm{W}^{*}-\bm{W}_{0}||_{F},||\bm{A}^{*}-\bm{A}_{0}||_{F}\leq 2c_{\rho}b\cdot T^{2}_{max}/\sqrt{m}.

(103)

Proof:

Firstly, note that from Lemma 13.

		$\displaystyle\|\|f_{t}(\bm{W},\bm{A})-f_{t}(\bm{W}_{0},\bm{A}_{0})-\nabla_{W}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{W}-\bm{W}_{0}]$		(104)
		$\displaystyle-\nabla_{A}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{A}-\bm{A}_{0}]\|\|$
		$\displaystyle\leq\mathcal{O}(\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}}).$

Therefore let

	$\displaystyle f_{t}^{lin}(\bm{W},\bm{A})=$	$\displaystyle f_{t}(\bm{W}_{0},\bm{A}_{0})+\nabla_{W}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{W}-\bm{W}_{0}]$		(105)
		$\displaystyle+\nabla_{A}f_{t}(\bm{W}_{0},\bm{A}_{0})\cdot[\bm{A}-\bm{A}_{0}].$		(105)

We have

||f_{t}(\bm{W},\bm{A})-f_{t}^{lin}(\bm{W},\bm{A})||\leq\mathcal{O}(\frac{\sqrt{m}\omega^{2}}{(1-\rho_{0})^{5}}).

Note that

||\bm{W}^{*}-\bm{W}_{0}||_{F},||\bm{A}^{*}-\bm{A}_{0}||_{F}\leq\mathcal{O}(b\cdot T^{2}_{max}/\sqrt{m}).

(106)

We can show

||f_{t}(\bm{W}^{*},\bm{A}^{*})-f_{t}^{lin}(\bm{W}^{*},\bm{A}^{*})||\leq\mathcal{O}(\epsilon/b).

Now, from Lemma 14,

||f^{lin}_{t}(\bm{W},\bm{A})-f^{lin,T_{max}}_{t}(\bm{W},\bm{A})||\leq\mathcal{O}(\frac{\sqrt{m}\tau\rho_{0}^{\tau}}{(1-\rho_{0})^{3}}).

(107)

We only need to consider

	$\displaystyle f^{lin,T_{max}}_{t}(\bm{W},\bm{A})=\sum_{t_{0}=0}^{T_{max}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{0}}\bm{W}_{0})\bm{A}_{0}X_{t-t_{0}}$
	$\displaystyle+\sum_{t_{0}=0}^{T_{max}}\sum_{t_{1}+t_{2}=t_{0}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})[\bm{W}-\bm{W}_{0}]$
	$\displaystyle\cdot(\prod_{\tau=1}^{t_{2}-1}\bm{W}_{0})\bm{A}_{0}X_{t-t_{0}}$
	$\displaystyle+\sum_{t_{0}=0}^{T_{max}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{0}-1}\bm{W}_{0})[\bm{A}-\bm{A}_{0}]X_{t-t_{0}}.$

Finally we set

	$\displaystyle\bm{W}^{*}-\bm{W}_{0}$
	$\displaystyle=\sum_{t_{m}=1}^{T_{max}}(\rho^{-1})^{t_{m}-1}\sum_{t_{1}+t_{2}=t_{m}-1}\{(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})\}^{T}\bm{B}^{T}\bm{P}^{1}_{t_{1}}$
	$\displaystyle\cdot\{\bm{G}\bm{C}^{t_{m}-1}\bm{D}-(\rho^{2})^{t_{m}-1}/(t_{m}-1)\cdot\bm{B}\{\prod_{\tau=1}^{t_{m}}\bm{W}_{0}\}\bm{A}_{0}\}$
	$\displaystyle\cdot\bm{P}^{2}_{t_{2}-1}\bm{A}^{T}_{0}(\prod_{\tau=1}^{t_{2}}\bm{W}_{0}),$

\bm{A}^{*}-\bm{A}_{0}=\{(\prod_{\tau=1}^{t_{m}-1}\bm{W}_{0})\}^{T}\bm{B}^{T}\bm{P}^{1}_{t_{m}-1}[\bm{G}\bm{D}-\bm{B}\bm{A}_{0}],

(108)

where

		$\displaystyle\bm{P}^{1}_{t_{1}}=\{\bm{B}\{(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})^{T}\bm{B}^{T}\}^{-1},$		(109)
		$\displaystyle\bm{P}^{2}_{t_{2}}=\{\bm{A}_{0}^{T}(\prod_{\tau=1}^{t_{2}-1}\bm{W}_{0})(\prod_{\tau=1}^{t_{2}-1}\bm{W}_{0})\bm{A}_{0}\}^{-1}.$		(109)

Firstly we need to bound $||\bm{W}^{*}-\bm{W}_{0}||_{F}$ and $||\bm{A}^{*}-\bm{A}_{0}||_{F}$ .

When $t_{1},t_{2}\leq T_{max}$ , note that $\bm{P}^{1}_{t_{1}}$ and $\bm{P}^{2}_{t_{2}}$ are symmetric matrices. From Lemma 10.

		$\displaystyle\|\|\bm{P}^{1}_{t_{1}}\|\|\leq\frac{2}{m},$		(110)
		$\displaystyle\|\|\bm{P}^{2}_{t_{2}}\|\|\leq 2.$		(110)

Meanwhile $(\rho^{-1})^{t_{m}-1}||\bm{G}\bm{C}^{t_{m}-1}\bm{D}||\leq c_{\rho}$ since $\rho(\bm{C})\leq\rho$ . Moreover, from Lemma 11,

||\bm{B}(\prod_{\tau^{\prime}=1}^{t_{0}}\bm{W})\bm{A}||\leq b,

(111)

for all $t_{0}\leq T_{max}$ . Therefore

||\bm{W}^{*}-\bm{W}_{0}||_{F},||\bm{A}^{*}-\bm{A}_{0}||_{F}\leq 2c_{\rho}b\cdot T^{2}_{max}/\sqrt{m}.

(112)

From lemma 12, when $t\neq t^{\prime}$ , $t,t^{\prime}\leq\tau$ ,

	$\displaystyle\|(u_{1})^{T}\bm{B}(\prod_{\tau=1}^{t}\bm{W}_{0})\cdot$	$\displaystyle\{(\prod_{\tau=1}^{t^{\prime}}\bm{W}_{0})\}^{T}\bm{B}^{T}v_{1}\|$		(113)
		$\displaystyle\leq 24\tau d^{2}\sqrt{m}\log m,$		(113)

and

|(u_{2})^{T}\bm{A}^{T}_{0}(\prod_{\tau=1}^{t}\bm{W}_{0})^{T}\cdot\{(\prod_{\tau=1}^{t^{\prime}}\bm{W}_{0})\}\bm{A}_{0}v_{2}|\leq 24\tau\frac{d^{2}\log m}{\sqrt{m}}.

(114)

This shows

		$\displaystyle f^{lin,T_{max}}_{t}(\bm{W}^{},\bm{A}^{})$
	$\displaystyle=$	$\displaystyle\sum_{t_{0}=0}^{T_{max}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{0}}\bm{W}_{0})\bm{A}_{0}X_{t-t_{0}}$
		$\displaystyle+\sum_{t_{0}=0}^{T_{max}}\sum_{t_{1}+t_{2}=t_{0}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})$
		$\displaystyle\cdot\sum_{t_{m}=1}^{T_{max}}(\rho^{-1})^{t_{m}-1}\sum_{t_{1}+t_{2}=t_{m}-1}$
		$\displaystyle\cdot\{(\prod_{\tau=1}^{t_{1}-1}\bm{W}_{0})\}^{T}\bm{B}^{T}\bm{P}^{1}_{t_{1}}\{\bm{G}\bm{C}^{t_{m}-1}\bm{D}$
		$\displaystyle-(\rho^{2})^{t_{m}-1}/(t_{m}-1)\cdot\bm{B}\{\prod_{\tau=1}^{t_{m}}\bm{W}_{0}\}\bm{A}_{0}\}\bm{P}^{2}_{t_{2}-1}\bm{A}^{T}_{0}(\prod_{\tau=1}^{t_{2}}\bm{W}_{0})$
		$\displaystyle\cdot(\prod_{\tau=1}^{t_{2}-1}\bm{W}_{0})\bm{A}_{0}X_{t-t_{0}}$
		$\displaystyle+\sum_{t_{0}=0}^{T_{max}}\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{0}-1}\bm{W}_{0})\{(\prod_{\tau=1}^{t_{m}-1}\bm{W}_{0})\}^{T}\bm{B}^{T}$
		$\displaystyle\cdot\bm{P}^{1}_{t_{m}-1}[\bm{G}\bm{D}-\bm{B}\bm{A}_{0}]X_{t-t_{0}}.$

Combining all the above results, Lemma 11 and

\widetilde{y_{t}}=\sum_{t_{0}=0}^{t-1}\bm{G}(\prod_{\tau=1}^{t_{0}}\bm{C})\bm{D}X_{t-t_{0}},

we have

		$\displaystyle\|\|f^{lin,T_{max}}_{t}(\bm{W}^{},\bm{A}^{})-\sum_{t_{0}=0}^{T_{max}-1}\bm{G}(\prod_{\tau=1}^{t_{0}}\bm{C})\bm{D}X_{t-t_{0}}\|\|$		(115)
		$\displaystyle\leq\mathcal{O}(T_{max}^{2}\cdot 2b\cdot c_{\rho}\cdot T_{max}\frac{d^{2}\log m}{\sqrt{m}}),$		(115)

	$\displaystyle\|\|f^{lin,T_{max}}_{t}$	$\displaystyle(\bm{W}^{},\bm{A}^{}))-\widetilde{y_{t}}\|\|$		(116)
		$\displaystyle\leq\mathcal{O}(\frac{b\cdot d^{2}c_{\rho}T^{3}_{max}\log m}{m^{1/2}})+\frac{c_{\rho}\rho^{T_{max}}}{1-\rho}).$		(116)

The theorem follows.

$\blacksquare$

Using Theorem 22, from the definition of $T_{max}$ ,

m>m^{*},

and $\rho<\rho_{0},$

T_{max}>\Theta(\frac{1}{\log(\frac{1}{\rho})})\cdot\{\log(\frac{1}{1-\rho})+\log b)+\log(\frac{1}{\epsilon})\}),

(117)

we have

||f_{t}(\bm{W}^{*},\bm{A}^{*}))-\widetilde{y_{t}}||\leq\mathcal{O}(\frac{\epsilon}{l_{0}\cdot(1+2b)}).

(118)

Thus $L(y_{t},f_{t}(\widetilde{\bm{W}}^{*},\bm{A}^{*}))-L(y_{t},\widetilde{y}_{t})\leq\mathcal{O}(\epsilon).$ Theorem 16 follows.

		$\displaystyle\|\|(\rho_{1}\cdot\rho_{0}^{2}\cdot\bm{W})^{k}\|\|=(\rho_{1}\cdot\rho_{0}^{2})^{k}\sum_{i=0}^{k}C_{k}^{i}\cdot\|\|\bm{W}^{i}_{0}\|\|\cdot\|\|\bm{W}-\bm{W}_{0}\|\|^{k-i}$		(16)
		$\displaystyle=\sum_{i=0}^{k}C_{k}^{i}\cdot\|\|\bm{W}^{i}_{0}\|\|\cdot(\rho_{1}\cdot\rho_{0}^{2})^{k}\cdot(\frac{1}{\rho_{0}}-1)^{k-i}$
		$\displaystyle\leq\sum_{i=L}^{k}C_{k}^{i}(\rho_{1}\rho_{0}^{2})^{i}[\rho_{1}\cdot(\rho_{0}-\rho_{0}^{2})]^{k-i}\cdot\rho_{1}^{-k}$
		$\displaystyle+\sum_{i=0}^{L-1}C_{k}^{i}(\rho_{1}\cdot\rho_{0}^{2})^{i}\cdot\|\|W_{0}^{i}\|\|\cdot\rho_{1}^{k-i}\cdot(\rho_{0}-\rho_{0}^{2})^{k-i}\leq\rho_{0}^{k}.$

	$\displaystyle\|\|\sum_{t_{0}=\tau}^{\infty}\sum_{t_{1}+t_{2}=t_{0}}$	$\displaystyle\rho^{t_{0}}\bm{B}(\prod_{\tau=1}^{t_{1}-1}\bm{W})\bm{Q}(\prod_{\tau=1}^{t_{2}-1}\bm{W})\bm{A}Z_{t_{0}}\|\|$		(48)
		$\displaystyle\leq 32\frac{\sqrt{m}(\rho_{0})^{\tau}\tau^{2}}{(1-\rho_{0})^{3}}\|\|\bm{Q}\|\|.$		(48)

		$\displaystyle\|\|Z_{j,t+1}^{T}\bm{W}_{0}(I-Z_{j,t}Z_{j,t}^{T})z_{i,t}\|\|$		(78)
		$\displaystyle\leq\|\|Z_{j,t+1}^{T}\|\|\cdot\|\|\frac{\bm{W}_{0}(I-Z_{j,t}Z_{j,t}^{T})z_{i,t}}{\|\|(I-Z_{j,t}Z_{j,t}^{T})z_{i,t}\|\|}\|\|\cdot\|\|(I-Z_{j,t}Z_{j,t}^{T})z_{i,t}\|\|,$
		$\displaystyle\leq\|\|Z_{j,t+1}^{T}\|\|\cdot 2\log m/\sqrt{m}\cdot\|\|(I-Z_{j,t}Z_{j,t}^{T})z_{i,t}\|\|,$
		$\displaystyle\leq 2\log m/\sqrt{m}\cdot\|\|(I-Z_{j,t}Z_{j,t}^{T})z_{i,t}\|\|$
		$\displaystyle\leq 2\log m/\sqrt{m}\cdot\|\|z_{i,t}\|\|$
		$\displaystyle\leq 2\log m/\sqrt{m}\cdot\log m.$

		$\displaystyle\|\|R_{t_{0}}(\bm{W}^{\prime},\bm{A}^{\prime})-R_{t_{0}}(\bm{W},\bm{A})$		(93)
		$\displaystyle-\nabla_{W}R_{t_{0}}(\bm{W},\bm{A})\cdot[\bm{W}^{\prime}-\bm{W}],$
		$\displaystyle-\nabla_{A}R_{t_{0}}(\bm{W},\bm{A})\cdot[\bm{A}^{\prime}-\bm{A}]\|\|_{F},$
		$\displaystyle\leq\sum_{i+j=t-t_{0}-1}\|\|\bm{B}\|\|\cdot\rho^{t-t_{0}}\cdot\|\|\bm{W}^{i}\|\|\cdot\|\|\bm{W}^{j}\|\|$
		$\displaystyle\cdot\|\|\bm{W}^{\prime}-\bm{W}\|\|\cdot\|\|\bm{A^{\prime}}-\bm{A}\|\|$
		$\displaystyle+\sum_{i+j+k=t-t_{0}-2}\|\|\bm{B}\|\|\cdot\rho^{t-t_{0}}\cdot\|\|\bm{W}^{i}\|\|\cdot\|\|\bm{W}^{j}\|\|\cdot\|\|\bm{W}^{k}\|\|$
		$\displaystyle\cdot\|\|\bm{W}^{\prime}-\bm{W}\|\|^{2}\cdot\|\|\bm{A}\|\|$
		$\displaystyle\leq 2\sqrt{m}(t-t_{0})\rho_{0}^{t-t_{0}}\cdot 4(t-t_{0})\cdot\omega^{2}$
		$\displaystyle+2\sqrt{m}(t-t_{0})^{2}\rho_{0}^{t-t_{0}}\cdot 8(t-t_{0})^{2}\cdot\omega^{2}\cdot 2$
		$\displaystyle\leq 32\sqrt{m}(t-t_{0})^{4}\omega^{2}\rho_{0}^{t-t_{0}}.$

		$\displaystyle\|\|f^{lin}_{t}(\bm{W},\bm{A})-f^{lin,\tau}_{t}(\bm{W},\bm{A})\|\|_{F}$		(98)
		$\displaystyle\leq\sum_{k=\tau}\sum_{i+j=k-1}\|\|\bm{B}\|\|\cdot\rho^{k}\cdot\|\|\bm{W}_{0}^{i}\|\|\cdot\|\|\bm{W}_{0}^{j}\|\|\cdot\|\|\bm{W}\|\|\cdot\|\|\bm{A}_{0}\|\|$
		$\displaystyle+\sum_{k=\tau}\|\|\bm{B}\|\|\cdot\rho^{k}\cdot\|\|\bm{W}_{0}^{k}\|\|\cdot\|\|\bm{A}\|\|$
		$\displaystyle\leq\tau\rho_{0}^{\tau}\cdot 8\frac{\sqrt{m}}{(1-\rho_{0})^{3}}$

Linear RNNs Provably Learn Linear Dynamic Systems

Abstract

I Introduction

II Problem Formulation

III Our Result

Theorem 1

Remark III.1

III-A Scale of cρc_{\rho} and the Comparison with Previous Results

Definition 1

Lemma 2

Corollary 3

III-B Our Techniques

IV Related Works

V Problem Setup and Main Results

Theorem 4

Lemma 5

Theorem 6

Corollary 7

Remark V.1

VI Preliminary Properties

VI-A Properties of Random Matrix

Lemma 8

Lemma 9

Lemma 10

Lemma 11

Lemma 12

VI-B Properties of Linear RNN

Lemma 13

Lemma 14

VII Proof of the Theorem 4

Theorem 15

Theorem 16

VIII Proof of Theorem 6

Theorem 17

Theorem 18

Acknowledgement

IX Conclusion

References

Lemma

Lemma 19

Proof of Lemma 9

Proof of Lemma 10

Proof of Lemma 8

Proof of Lemma 11

Proof of Lemma 12

Lemma 20

Lemma 21

Proof of Lemma 13 and 14

Appendix A Proof of Lemma 14

Existence: Proof of Theorem 16

Theorem 22

III-A Scale of $c_{\rho}$ and the Comparison with Previous Results