Lifu Wang, Tianyu Wang*, Shengwei Yi, Bo Shen, Bo Hu, Xing Cao
Corresponding author: Tianyu Wang: [email protected] Wang, Tianyu Wang and Shengwei Yi are with China Information Technology Security Evaluation Center (CNITSEC), Beijing, China.
Bo Shen, Bo Hu, Xing Cao are with Beijing Jiaotong University, Beijing, China.
Abstract
We study the learning ability of linear recurrent neural networks with Gradient Descent. We prove the first theoretical guarantee on linear RNNs to learn any stable linear dynamic system using any a large type of loss functions. For an arbitrary stable linear system with a parameter related to the transition matrix , we show that despite the non-convexity of the parameter optimization loss if the width of the RNN is large enough (and the required width in hidden layers does not rely on the length of the input sequence), a linear RNN can provably learn any stable linear dynamic system with the sample and time complexity polynomial in . Our results provide the first theoretical guarantee to learn a linear RNN and demonstrate how can the recurrent structure help to learn a dynamic system.
I Introduction
Recurrent neural network(RNN) is a very important structure in machine learning to deal with sequence data. It is believed that using the recurrent structure, RNNs can lean complicated transformations of data over extended periods. Non-linear RNN has been proved to be Turing-Complete[1], thus can simulate arbitrary procedures. However, training RNN requires optimizing highly non-convex functions which are very hard to solve. On the other hand, it is widely believed[2] that deep linear networks can capture some important aspects of optimization in deep learning and there are a series of recent papers trying to study the properties of deep linear networks [3, 4, 5]. Meanwhile, learning linear RNN itself is not only an important problem in System Identification but also useful for the language modeling in natural language processing [6]. In this paper, we study the non-convex optimization problem for learning linear RNNs.
Suppose there is a -order and -dimension linear system with the following form:
(1)
where and are unknown system parameters.
At time , this system output . It is nature to consider the system identification problem to learn the unkonw system parameters from its outputs. We consider a new linear (student) RNN with the form:
(2)
with and train the parameters to fit the output of (1).
Just like what is commonly done in machine learning, one may consider to collect data then optimize the empirical loss:
(3)
with Gradient Descent Algorithm, where is a convex loss function.
Therefore the following questions arise naturally:
•
Can gradient descent learn the target RNN in polynomial time and samples?
•
What kinds of random initializations(for example, how large widths will be?) do we need to learn the target RNN?
These problems look easy since it is very basic and important for the system identification problem. However, the loss is non-convex and in fact even for SISO(single-input single-output, which means ) systems, this question is far from being trivial. Only after the work in [7], the SISO case is solved. In fact, as pointed out in [8], although the widely used method in system identification is the EM algorithm [9], yet it is inherently non-convex and EM method will get stuck in bad local minima easily.
One naive method is to only optimize the loss , which is a convex loss function. However, when is not accurately observed, for example we can only observe and where is white noise with variance at time , the results by optimizing may not be optimal for the entire sequence loss . In fact a naive estimate from will output an estimation with error but may output an estimation with error . It is shown in [7] that under some independent conditions of the inputs, SGD (stochastic gradient descent) converges to the global minimum of the maximum likelihood objective of an unknown linear time-invariant dynamical system from a sequence of noisy observations generated by the system and over-parameterization is helpful. However, their methods heavily rely on the SISO property () of the system, and the condition for different are i.i.d. from Gaussian distribution. Their method can not be generalized to the systems with and . It is still open under which conditions can SGD be guaranteed to find the global minimum of the linear RNN loss.
In this paper, we propose a new NTK method inspired by the work [10] and the authors’ previous work[11] on non-linear RNN. And this is completely different method from that in [7] so we avoid the defect that the method in [7] can only be used in the SISO case.
We show that if the width of the linear RNN (2) is large enough (polynomial large), SGD can provably learn any stable linear system with the sample and time complexity only polynomial in and independent of the input length , where is roughly the spectral radius (see Section III-A) of the transition matrix . Learning linear RNN is a very important problem in System Identification. And since Gradient Descent with random initialization is the most commonly used method in machine learning, we are trying to understand this problem in a “machine-learning style”. We believe this can provide some insights for the recurrent structure in deep learning.
II Problem Formulation
We consider the target linear system with the form:
(4)
which is a stable linear system with for all and , where , . For a given convex loss function , we set the loss function
(5)
We define the global minimum as
with , is an absolute constant.
Let the sequences be samples i.i.d. drawn from
We consider a “student” linear system (RNN) to learn the target one. Let be the -time output of a linear RNN with input and parameters :
(6)
Our goal is to use and the samples to fit the empirical loss function and keeping the generalization error bound small.
III Our Result
The main result is formulated in the PAC-Learning setting as follow:
Theorem 1
(Informal)
Under the conditions in the last section, suppose the entries of in the student RNN are randomly initialized by i.i.d. generated from . We use SGD algorithm to optimize. are the -th step outputs of SGD algorithm.
For any , and , there exist parameters and such that if , with probability at least , SGD can reach
(7)
in steps.
Remark III.1
Theorem 1 induces gradient descent with linear RNNs can provably learn any stable linear system with the iteration and sample complexity polynomial in . This result is consistent with the previous Gradient Descent based method [7] to learn SISO linear systems.
In our theorem, all the parameters do not rely on the length . Note that suppose at different time, are i.i.d. drawn from a distribution , is the white noise and is large enough,
Thus optimizing the large loss is enough to predict the complete dynamic behaviors of the target system.
III-AScale of and the Comparison with Previous Results
In our main theorem 1, the parameters are polynomial in . We should note that although we assume the target system is stable, this only means
In fact, suppose , generally we have (see e.g. corollary 3.15 in [12]):
When we set , can be very large and generally we should set to make be polynomial in .
On the other hand, the scale of is closely related to the so-called “acquiescent” systems.
Definition 1
Let . A SISO -order linear system with the transfer function is called -acquiescent if
where
And we have
Lemma 2
(Lemma 4.4 in [7]) Suppose the target linear system is SISO and -acquiescent. For any ,
(8)
Thus in the SISO case, under the “acquiescent” conditions, our Theorem 1 reduces to the main result in [7].
Corollary 3
(Corresponding to Theorem 5.1 in [7])
Supposing a SISO linear system is -acquiescent, it is learnable in polynomial time and polynomial samples.
And for MIMO systems, our condition is a good generalization for MIMO systems.
III-BOur Techniques
Our proof technique is closely related to the recent works on deep linear network [3], non-linear network with neural tangent kernel [13][14, 15], and non-linear RNN[10], [16]. Similar to [13], we carefully upper and lower bound eigenvalues of this Gram matrix throughout the optimization process, using some perturbation analysis. At the initialization point, we consider the spectral properties of Gaussian random matrices. Using the linearization method, we can show these properties hold throughout the trajectory of gradient descent. Then we only need to construct a solution near the random initialization. And the distance from the solution to the initialization can be bounded by the stability of the system.
Notions. For two matrices , we define . We define the asymptotic
notations as follows. are two sequences. if , if
, if and , if there is that . are notions which hide the logarithmic factors in . and denote the 2-norm of matrices. denotes the 1-norm. is the Frobenius-norm.
IV Related Works
Deep Linear Network. The provable properties of the loss surface for deep linear networks were firstly shown in [5]. In [3], it is shown that if the width of the -layer deep linear network is large enough (only depends on the output dimension, the rank and the condition number of the input data), randomly initialized gradient descent will optimize deep linear networks in polynomial time in , and .
Moreover. in [4], the linear ResNet is studied and it is shown that Gradient Descent provably optimizes wide enough deep linear ResNets and the width does not rely on the number of layers.
Over-Parameterization. Non-linear networks with one hidden node are studied in [17] and [18]. In these works, it is shown that, for a single-hidden-node ReLU network, under a very mild assumption on the input distribution, the loss is one point convex in a very large area. However, for the networks with multi-hidden nodes, the authors in [19] pointed out that spurious local minima are common and indicated that an over-parameterization (the number of hidden nodes should be large) assumption is necessary. Similarly, [7] showed that over-parameterization can help in the training process of a linear dynamic system. Another import progress is the theory about neural tangent kernel(NTK). The techniques of NTK for finite width network are studied in [15], [13], [10], [14] and [16].
Learning Linear System.
Prediction problems of time series for linear dynamical systems can be traced back to Kalman [20]. In the case that the system is unknown, the first polynomial guarantees of running time and sample complexity bounds for learning single-input single-output (SISO) systems with gradient descent are provided in [7]. For MIMO systems. It was shown in [21][8] that the spectral filtering method can be provably learned with polynomial guarantees of running time and sample complexity.
V Problem Setup and Main Results
In this section, we introduce the basic problem setup and our main results.
Consider sequences and the label in the data set. and . We assume and omit them in the asymptotic symbols. We study the linear RNN as:
(9)
Assume is convex and locally Lipschitz convex function: for any , when ,
Algorithm 1 Learning Stable Linear System with SGD
Input:Sequences of data , learning rate , initialization parameter .
Initialization: The entries of and are i.i.d. generated from and . The entries of are i.i.d. generated from .
fordo
,
Randomly sample a sequence and the label .
.
endfor
In fact, we have
Theorem 4
Assume there is . Let , . Set the initialization parameter . Given an unknown distribution of sequences of , let , be the output of Algorithm 1.
For any small , there are parameters 111In order to simplify symbols, we omit and in these asymptotic bounds of parameters. One can easily show finally the parameters are polynomial in and .
(11)
such that with probability at least , if , for any , the algorithm outputs satisfy:
(12)
where is the output from the linear system:
(13)
with for all , , , ,
Lemma 5
For the parameters of the last theorem, when and is small enough, we have the following results:
1.
2.
3.
4.
5.
As for the population loss, we have:
Theorem 6
(Rademacher complexity for RNN) Under the condition in Theorem 4, with probability at least ,
Since as large enough, we have and . Therefore from the above two theorems we have the following corollary:
Corollary 7
Let the initialization parameter be . For any small , there is a parameter such that with probability at least , if , the algorithm outputs satisfy:
Remark V.1
In this paper, our results only assume for some constant :
(14)
This assumption is a very mild condition for the loss function , thus we, in fact, do not previously assume the form of the noise (for example, optimizing the square loss is to optimize the maximum likelihood objective of the Gaussian noise, and loss is to optimize that for Laplace noise) and our result can even apply to not only the the regression problems but also the classification problems. In this aspect, this result improves upon previous methods to learn linear dynamical systems in [7] and [8] which highly rely on the form of the square loss.
VI Preliminary Properties
Before proving Theorem 4 and 6, we need some properties of Gaussian random matrices and linear RNNs. The proof is in the Supplementary Materials.
To simplify symbols, in the latter part of the paper, we set and 222We omit in when it doesn’t lead to misunderstanding.
Then
(15)
Then we only need to consider . The entries of are i.i.d. generated from . Let and .
VI-AProperties of Random Matrix
One of the key points in this paper is that with high probability, the spectral radius of matrix will be less than and when , . In fact, we have:
Lemma 8
With probability at least (it is larger than when ), there exists such that for all ,
and for all ,
where is an absolutely constant.
Meanwhile, for all , with probability at least ,
In the rest of this paper, all the probabilities are considered under the condition in Lemma 8.
As a corollary, let , and when , we have
(16)
For ,
(17)
Combing all these results, we can show the norm of is bounded, which is:
For all
(18)
This is crucial to make the width independent on the length of input sequences. Then we can show:
Lemma 9
Let .
For any , any with and , with probability at least , for any , with ,
(19)
Based on these results, we can only consider the properties of for bounded . We have the following results:
Lemma 10
For any vector with , if , with probability at least :
(20)
for any and .
Lemma 11
If , with probability at least :
(21)
for any and .
Lemma 12
For any vector with , if , with probability at least , for any , :
Theorem 4 is a direct corollary of Theorem 15 and 16 below.
Theorem 15
Under the condition in Theorem 4,
for any with , let
with probability at least , the outputs of algorithm 1 satisfy:
(28)
When the loss function is convex, one can easily see Theorem 15 follows. In our case, the proof of Theorem 15 is from the linearization Lemma 13. Lemma 13 says when are small enough,
(29)
will nearly be a linear function for and . This is the main process to prove Theorem 15.
Based on Theorem 15, to prove the main result, we need to show there exits a with and small so the target lies in the linearization domain. In fact we have:
Theorem 16
Under the condition in Theorem 4, with probability at least , there exist with
(35)
and
To prove Theorem 16, assume the target stable system is
(36)
We construct
where
(37)
We can show our , satisfying
using Lemma 12.
To show
,
we need Lemma 11. The detailed proof is in Supplementary Materials.
In order to prove the bound for the population risk, we need the following results for Rademacher Complexity. These results can be found in Proposition A.12 of [22] and section 3.8 in [23].
Theorem 17
Let be classes of functions and be a bounded, Lipschitz function. For samples i.i.d. drawn from a distribution , with probability at least , we have
(38)
where is defined as :
(39)
Meanwhile, for linear functions, Rademacher Complexity is easy to be calculated:
where Let be the -th orthogonal basis of . We only need to consider the Rademacher complexity of the function class:
(41)
And this is a class of linear functions. We can write it as with fixed. Apply Lemma 9. We have
(42)
Then combing all the above results, Theorem 17 and Theorem 18, we have
(43)
with probability at least . Our claim follows.
Acknowledgement
This research was funded by the Foundation item: National Natural Science Foundation of China (U1936215). Project name: Methodologies and Key Technologies of Intelligent Detection and Tracing of APT Attacks.
IX Conclusion
We provided the first theoretical guarantee on learning linear RNNs with Gradient Descent. The required width in hidden layers does not rely on the length of the input sequence and only depends on the transition matrix parameter . Under this condition, we showed that SGD can provably learn any stable linear system with transition matrix satisfying , using many iterations and many samples. In this work we found a suitable random initialization which is available to optimize using gradient descent. This
solves an open problem in System Identification and answers why SGD is available to optimize RNNs in practice. We hope this result can provide some insights for learning stable nonlinear dynamic systems using recurrent neural networks in deep learning.
References
[1]
H. T. Siegelmann and E. D. Sontag, “On the computational power of neural nets - sciencedirect,” Journal of Computer and System Sciences, vol. 50, no. 1, pp. 132–150, 1995.
[2]
A. M. Saxe, J. L. Mcclelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,” Computer Science, 2014.
[3]
S. S. Du and W. Hu, “Width provably matters in optimization for deep linear neural networks,” International conference on machine learning, 2019.
[4]
D. Zou, P. M. Long, and Q. Gu, “On the global convergence of training deep linear resnets,” International conference on learning representations, 2020.
[5]
K. Kawaguchi, “Deep learning without poor local minima,” 2016.
[6]
D. Belanger and S. M. Kakade, “A linear dynamical system model for text,” in Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, ser. JMLR Workshop and Conference Proceedings, F. R. Bach and D. M. Blei, Eds., vol. 37. JMLR.org, 2015, pp. 833–842. [Online]. Available: http://proceedings.mlr.press/v37/belanger15.html
[7]
M. Hardt, T. Ma, and B. Recht, “Gradient descent learns linear dynamical systems,” Journal of Machine Learning Research, vol. 19, 2016.
[8]
E. Hazan, H. Lee, K. Singh, C. Zhang, and Y. Zhang, “Spectral filtering for general linear dynamical systems,” in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018.
[9]
S. Roweis and Z. Ghahramani, “A unifying review of linear gaussian models,” Neural Computation, 1999.
[11]
L. Wang, B. Shen, B. Hu, and X. Cao, “On the provable generalization of recurrent neural networks,” in Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021, pp. 20 258–20 269. [Online]. Available: https://proceedings.neurips.cc/paper/2021/hash/a928731e103dfc64c0027fa84709689e-Abstract.html
[12]
D. A. Dowler, “Bounding the norm of matrix powers,” 2013.
[13]
S. S. Du, X. Zhai, B. Poczos, and A. Singh, “Gradient descent provably optimizes over-parameterized neural networks,” in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. [Online]. Available: https://openreview.net/forum?id=S1eK3i09YQ
[14]
Z. Allen-Zhu, Y. Li, and Z. Song, “A convergence theory for deep learning via over-parameterization,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 242–252. [Online]. Available: http://proceedings.mlr.press/v97/allen-zhu19a.html
[17]
Y. Tian, “Symmetry-breaking convergence analysis of certain two-layered neural networks with relu nonlinearity,” International conference on learning representations, 2017.
[18]
S. S. Du, J. D. Lee, and Y. Tian, “When is a convolutional filter easy to learn,” International conference on machine learning, 2018.
[19]
I. Safran and O. Shamir, “Spurious local minima are common in two-layer relu neural networks,” International conference on machine learning, pp. 4430–4438, 2018.
[20]
R. E. Kalman, “A new approach to linear filtering and prediction problems,” Journal of Basic Engineering, 1960.
[21]
E. Hazan, K. Singh, and C. Zhang, “Learning linear dynamical systems via spectral filtering,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/file/165a59f7cf3b5c4396ba65953d679f17-Paper.pdf
Now we will prove Lemma 8. Similar to (54), for fixed , let . With probability at least , for all , we have
(55)
Then for fixed normalized orthogonal basis , , with probability at least , for all , ,
(56)
Any can be write as . When , we have .
Therefore with probability at least , for all , any
(57)
Since , , we have
(58)
Note that so we can set lager than an absolute constant such that
(59)
And for
(60)
If , we can write it as and , . Then
(61)
Thus we have
(62)
For , note that for any unit vector , we can write , where has at most non-zero coordinates and the intersection of non-zero coordinates sets for and are empty when . In space , we can use the -net argument which says with probability at least , for any ,