This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Model Compression Method for S4 with Diagonal State Space Layers using Balanced Truncation

Haruka Ezoe and Kazuhiro Sato H. Ezoe and K. Sato are with the Department of Mathematical Informatics, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-8656, Japan, email: [email protected] (H. Ezoe), [email protected] (K. Sato)
Abstract

To implement deep learning models on edge devices, model compression methods have been widely recognized as useful. However, it remains unclear which model compression methods are effective for Structured State Space Sequence (S4) models incorporating Diagonal State Space (DSS) layers, tailored for processing long-sequence data. In this paper, we propose to use the balanced truncation, a prevalent model reduction technique in control theory, applied specifically to DSS layers in pre-trained S4 model as a novel model compression method. Moreover, we propose using the reduced model parameters obtained by the balanced truncation as initial parameters of S4 models with DSS layers during the main training process. Numerical experiments demonstrate that our trained models combined with the balanced truncation surpass conventionally trained models with Skew-HiPPO initialization in accuracy, even with fewer parameters. Furthermore, our observations reveal a positive correlation: higher accuracy in the original model consistently leads to increased accuracy in models trained using our model compression method, suggesting that our approach effectively leverages the strengths of the original model.

Index Terms:
Balanced truncation, Deep learning, Diagonal state space model, Model compression

I Introduction

In recent years, deep learning models have garnered substantial attention due to their versatility across a range of applications, including sequence prediction, natural language translation, speech recognition, and audio generation [1, 2, 3]. These models’ ability to understand and predict sequential data underpins their success in these domains. A critical aspect of these models’ effectiveness is their capacity to capture dependencies between sequential data points, a fundamental requirement for achieving high levels of performance in tasks involving time series or sequential input. For instance, the Transformer [4] is effective in capturing short-range dependencies in sequential data, and achieved a state-of-the-art BLEU score of 41.0 on the WMT 2014 English-to-French translation task. However, the Transformer’s ability to capture long-range dependencies in time series data is limited, leading to the loss of temporal information due to its permutation-invariant self-attention mechanism [5].

Contrary to the limitations observed in the Transformer model, the Structured State Space Sequence (S4) model, as introduced in [6], demonstrates exceptional capability in capturing long-range dependencies within sequential data. This effectiveness is largely attributed to the innovative use of HiPPO initialization [7], a technique specifically designed to enhance model performance by leveraging the principles of the state space model (SSM) from control theory. Notably, the S4 model has shown to surpass conventional models, including the Transformer, in Long Range Arena (LRA) tasks [8], signifying a substantial advancement in handling sequential data. Further refinement of the S4 architecture led to the introduction of the Diagonal State Space (DSS) layers [9], offering a simplified yet effective version of the original S4 model, maintaining its high performance with a more streamlined architecture. In addition to the original and simplified S4 models, several deep learning models related to the SSM, such as H3 [10], Hyena [11], S4D [12], S4ND [13], S5 [14], SSSD [15], and Mamba [16], have been proposed. Generally, in tasks requiring long-range dependency modeling, these deep learning models tend to perform better with a larger number of parameters.

However, deep learning models, which have a vast number of parameters, demand considerable computational resources for inference, thereby limiting their practical and sustainable use. For example, in Edge Intelligence (EI) [17, 18], data from individual devices are processed both in the cloud and locally on each device (Fig. 1). EI devices, such as sensors in factories, have limited computational resources and power consumption constraints. This limitation poses a challenge for performing inference using deep learning models with numerous parameters. Therefore, it is crucial to achieve optimal performance using models with fewer parameters and reduced computational costs in EI applications.

Refer to caption
Figure 1: Edge Intelligence (EI). In EI, data gathered from various devices is not processed entirely in the cloud but rather locally on each device. These devices, like sensors in industrial settings, face limitations that make it difficult to deploy large-scale deep learning models typically trained in the cloud, due to constraints such as computing resources and power consumption.

When considering the application of deep learning models in EI, the implementation of model compression techniques is essential [19, 20, 21, 22]. For example, the techniques include:

  • Pruning, which reduces the number of parameters in deep learning models [23, 24].

  • Quantization, which reduces the number of bits used to represent the weights and activations in deep learning models [25].

  • Knowledge distillation, involving training a smaller student model to replicate a larger teacher model [26, 27, 28, 29].

However, the effectiveness of these model compression techniques in deep models that incorporate SSMs remains unclear.

Thus, our goal is to provide a novel and effective model compression method tailored for S4 models with DSS layers, especially for EI scenario deployment. To achieve this, we leverage SSM’s use in DSS layers to enable the use of various well-established reduction methods [30, 31, 32, 33, 34, 35, 36]. In this study, we employ the balanced truncation method [33], a widely used control theory approach. To train the original model, we use Skew-HiPPO initialization [12, 9], consistently outperforming models started with random initialization.

The contributions of this paper are summarized as follows: To reduce computational costs during inference, we introduce a novel model compression method that applies the balanced truncation technique to DSS layers in pre-trained S4 models. Moreover, we propose using the reduced model parameters obtained by the balanced truncation as initial parameters of S4 models with DSS layers during the main training process. As demonstrated in Section VI, our trained models combined with the balanced truncation achieved superior accuracy on LRA tasks compared to conventionally trained models using Skew-HiPPO initialization as described in [12, 9], even with fewer parameters. While [37] reports minimal impact from dimension reduction in the MultiHyena variant of the Hyena model on performance, our findings with the S4 model underscore a critical distinction with significant performance improvement.

The paper is organized as follows. In Section II, we introduce the balanced truncation method for the reduction of state space models and explain the HiPPO matrix used in Skew-HiPPO initialization. In Section III, we present a deep learning model with DSS layers. Section IV describes existing training methods for this model, and discusses the model’s computational cost during inference. To address the issue of computational cost, in Section V, we propose a model compression method using the balanced truncation method for SSMs. In Section VI, the results of numerical experiments are presented. Finally, in Section VII, we discuss the effectiveness of the proposed method based on the results of the numerical experiments, clarify the limitations of our work, and outline future work.

Notation

  • \mathbb{R} and \mathbb{C} denote the sets of real and complex numbers, respectively.

  • For aa\in\mathbb{C}, |a||a| denotes the absolute value of aa.

  • For v=(vi)Nv=(v_{i})\in\mathbb{C}^{N}, v\|v\| denotes the Euclidean norm of vv, i.e., v:=|v1|2++|vN|2\|v\|:=\sqrt{|v_{1}|^{2}+\cdots+|v_{N}|^{2}}.

  • AA^{\top} and AA^{*} denote the transpose and complex conjugate transpose of matrix An×nA\in\mathbb{C}^{n\times n}, respectively.

  • f(x)xaf(x)\mid_{x\leq a} denotes the function that coincides with f(x)f(x), a function defined on x0x\geq 0, on its domain [0,a][0,a].

  • deg(f)\mathrm{deg}(f) represents the degree of polynomial f(x)f(x).

  • i\mathrm{i} denotes the imaginary unit.

  • diag(λ1,,λN)N×N\mathrm{diag}(\lambda_{1},\cdots,\lambda_{N})\in\mathbb{C}^{N\times N} is a diagonal matrix with λ1,,λN\lambda_{1},\cdots,\lambda_{N}\in\mathbb{C} as diagonal elements.

  • Re(α)\mathrm{Re}(\alpha) and Im(α)\mathrm{Im}(\alpha) are the real and imaginary parts of α\alpha\in\mathbb{C}, respectively.

  • aba*b represents the Hadamard product of vectors aa and bb.

  • 𝕀[a,b](x)={1ifaxb0otherwise\mathbb{I}_{[a,b]}(x)=\begin{cases}1&\text{if}\quad a\leq x\leq b\\ 0&\text{otherwise}\end{cases}

  • 𝒰(a,b)\mathcal{U}(a,b) denotes the uniform distribution on (a,b)(a,b).

  • 𝒩(μ,σ)\mathcal{N}(\mu,\sigma) is a normal distribution with mean μ\mu and variance σ\sigma.

II Preliminaries

In this section, we introduce a state space model (SSM), an important component of the DSS layer. We also present the balanced truncation method [33], a reduction method for SSMs employed in our training method. Furthermore, we explain the HiPPO matrix [6] utilized in Skew-HiPPO initialization [12, 9], which enhances the performance of trained models.

II-A State space model (SSM)

A crucial component of a deep learning model discussed in our study, as outlined in Section III, is the hidden layer known as the DSS layer. This layer is defined by using the SSM

{dxdt(t)=Ax(t)+Bu(t)y(t)=Cx(t),\displaystyle\begin{cases}\frac{\mathrm{d}x}{\mathrm{d}t}(t)=Ax(t)+Bu(t)\\ y(t)=Cx(t),\end{cases} (1)

where u(t)u(t)\in\mathbb{R}, y(t)y(t)\in\mathbb{C}, and x(t)Nx(t)\in\mathbb{C}^{N} denote the input, output, and state, respectively, and (A,B,C)N×N×N×1×1×N(A,B,C)\in\mathbb{C}^{N\times N}\times\mathbb{C}^{N\times 1}\times\mathbb{C}^{1\times N}. The matrix AA denotes a state transition matrix, which describes the internal influence on the time evolution of the internal state xx. The matrix BB is an input matrix, which describes how the external input uu affects the internal state xx. The matrix CC serves as an output matrix, which describes how the internal state xx is transformed into the observable output yy.

Notably, SSM (1) is a single-input and single-output system, and the matrices AA, BB, CC, state x(t)x(t), and output y(t)y(t) are all complex-valued. Moreover, the state dimension NN is relatively large, unlike the standard setting in control theory [38].

II-B Balanced truncation method

To address computational issues arising from the large state dimension of SSM (1), we consider using the balanced truncation method [33], a reduction technique. This method focuses on the controllability and observability of SSM (1) to derive another SSM with dimension r(N)r~{}(\leq N), which gives almost the same output as (1):

{dxrdt(t)=Arxr(t)+Bru(t)yr(t)=Crxr(t),\displaystyle\begin{cases}\frac{\mathrm{d}x_{r}}{\mathrm{d}t}(t)=A_{r}x_{r}(t)+B_{r}u(t)\\ y_{r}(t)=C_{r}x_{r}(t),\end{cases} (2)

where Arr×r,Brr×1,Cr1×rA_{r}\in\mathbb{C}^{r\times r},B_{r}\in\mathbb{C}^{r\times 1},C_{r}\in\mathbb{C}^{1\times r}. Below is a brief explanation of the balanced truncation method. For more detailed information, refer to Appendix A.

For SSM (1) with asymptotic stability, where all the real parts of the eigenvalues of matrix AA are negative, the controllability Gramian PP and observability Gramian QQ are defined as

P0exp(At)BBexp(At)dt,\displaystyle P\coloneqq\int_{0}^{\infty}\exp(At)BB^{*}\exp(A^{*}t)\mathrm{d}t, (3)
Q0exp(At)CCexp(At)dt.\displaystyle Q\coloneqq\int_{0}^{\infty}\exp(A^{*}t)C^{*}C\exp(At)\mathrm{d}t. (4)

These are the unique solutions to the Lyapunov equations

AP+PA+BB=0,\displaystyle AP+PA^{*}+BB^{*}=0, (5)
AQ+QA+CC=0,\displaystyle A^{*}Q+QA+C^{*}C=0, (6)

as shown in [38, Theorem 4.1].

In the balanced truncation method, the subspace spanned by eigenvectors corresponding to small eigenvalues of the controllability Gramian PP and the observability Gramian QQ is ignored. The minimum input energy required to achieve limtx(t)=xf\lim_{t\to\infty}x(t)=x_{f} from the initial condition x(0)=0x(0)=0 is expressed using the controllability Gramian PP as

0u(t)2dt=xfP1xf.\displaystyle\int_{0}^{\infty}\|u(t)\|^{2}\mathrm{d}t=x_{f}^{*}P^{-1}x_{f}. (7)

Thus, eigenvectors corresponding to smaller eigenvalues of PP correspond to directions in the state space that are less influenced by the input uu. On the other hand, when x(0)=x0x(0)=x_{0} and u(0)=0u(0)=0, the output energy can be expressed using the observability Gramian QQ as

0y(t)2dt=x0Qx0.\displaystyle\int_{0}^{\infty}\|y(t)\|^{2}\mathrm{d}t=x_{0}^{*}Qx_{0}. (8)

Thus, eigenvectors corresponding to smaller eigenvalues of QQ correspond to directions in the state space that have less impact on the output yy.

A coordinate transformation x¯(t)=Tx(t)\bar{x}(t)=Tx(t) is applied to SSM (1), obtaining another SSM where the controllability Gramian and observability Gramian coincide as a diagonal matrix Σ\Sigma:

{dx¯dt(t)=TAT1x¯(t)+TBu(t)y(t)=CT1x¯(t).\displaystyle\begin{cases}\frac{\mathrm{d}\bar{x}}{\mathrm{d}t}(t)=TAT^{-1}\bar{x}(t)+TBu(t)\\ y(t)=CT^{-1}\bar{x}(t).\end{cases} (9)

SSM (9) is referred to as the balanced realization of SSM (1). Here, the diagonal elements of Σ\Sigma are denoted as σ1,,σN\sigma_{1},\cdots,\sigma_{N}, which are called the Hankel singular values, satisfying σ1σN>0\sigma_{1}\geq\cdots\geq\sigma_{N}>0 under the assumption that SSM (1) is controllable and observable. By partitioning the matrices TAT1TAT^{-1}, TBTB, CT1CT^{-1} in SSM (9) as

TAT1\displaystyle TAT^{-1} =(A11A12A21A22),\displaystyle=\begin{pmatrix}A_{11}&A_{12}\\ A_{21}&A_{22}\\ \end{pmatrix}, (10)
TB\displaystyle TB =(B1B2),\displaystyle=\begin{pmatrix}B_{1}\\ B_{2}\\ \end{pmatrix}, (11)
CT1\displaystyle CT^{-1} =(C1C2),\displaystyle=\begin{pmatrix}C_{1}&C_{2}\\ \end{pmatrix}, (12)

we define the parameters (Ar,Br,Cr)(A_{r},B_{r},C_{r}) of reduced model (2) as

(Ar,Br,Cr)=(A11,B1,C1).\displaystyle(A_{r},B_{r},C_{r})=(A_{11},B_{1},C_{1}). (13)

The resulting system (2) with (13) can be interpreted as a reduced SSM of SSM (1), obtained by truncating the state space associated with the smaller Hankel singular values σr+1,,σN\sigma_{r+1},\cdots,\sigma_{N}, which correspond to the subspace spanned by eigenvectors that are less influenced by the input or have less influence on the output. Moreover, if SSM (1) is asymptotically stable, the reduced SSM (2) with (13) is also asymptotically stable, as shown in [38, Proposition 4.15].

II-C HiPPO matrix

For the model explained in Section III, the parameters of the matrices (A,B,C)(A,B,C) in the SSM (1) are trained using a suitable optimization algorithm. The initialization of matrix AA significantly influences the performance of trained models, as it sets the initial state for the optimization process. The High-order Polynomial Projection Operators (HiPPO) matrix [7] is derived from a method for online compression of continuous signals using projections onto subspaces spanned by polynomial bases. It is well-established that the HiPPO matrix is an effective choice for the initial AA [6, 12, 9].

The derivation of the HiPPO matrix is explained below. With respect to a measure μ\mu on [0,)[0,\infty), let

L2(μ){f:[0,)|fismeasurable,0|f(τ)|2dμ(τ)<}.\displaystyle\begin{split}L_{2}(\mu)\coloneqq\{&f:[0,\infty)\rightarrow{\mathbb{C}}\Bigm{|}f~{}\mathrm{is~{}measurable},\\ &\int_{0}^{\infty}|f(\tau)|^{2}\mathrm{d}\mu(\tau)<\infty\}.\end{split} (14)

The inner product and norm on L2(μ)L_{2}(\mu) are defined as

f1,f2μ0f1(τ)f2(τ)dμ(τ),\displaystyle\langle f_{1},f_{2}\rangle_{\mu}\coloneqq\int_{0}^{\infty}f_{1}^{*}(\tau)f_{2}(\tau)\mathrm{d}\mu(\tau), (15)
fL2(μ)f,fμ1/2,\displaystyle\lVert f\rVert_{L_{2}(\mu)}\coloneqq\langle f,f\rangle_{\mu}^{1/2}, (16)

respectively.

For an input signal u(t)u(t) defined on t0t\geq 0, the history utu(τ)τtu_{\leq t}\coloneqq u(\tau)\mid_{\tau\leq t} at each time t>0t>0 is approximated by projecting it onto a subspace spanned by polynomial bases, and the corresponding coefficient vector x(t)x(t) represents the history of input signal. This compression is useful because storing utu_{\leq t} requires a significant amount of memory. Thus, at each time tt, x(t)x(t) contains sufficient information to reconstruct utu_{\leq t}, even though it requires less memory compared to directly storing utu_{\leq t}.

The vector x(t)x(t) is expressed as the optimal solution to a convex optimization problem, which is defined by a measure μ(t)\mu^{(t)} on (,t](-\infty,t] and orthogonal polynomial basis of the subspace of L2(μ(t))L_{2}(\mu^{(t)}) denoted as {gn(t)}1nN\{g_{n}^{(t)}\}_{1\leq n\leq N} (i.e. deg(gi(t))=i(i=1,,N),gi(t),gj(t)μ(t)=0(ij)\mathrm{deg}(g_{i}^{(t)})=i~{}(i=1,\cdots,N),\langle g_{i}^{(t)},g_{j}^{(t)}\rangle_{\mu^{(t)}}=0~{}(i\neq j)). The optimization problem is

minx(t)NutgL2(μ(t))2subject  tog(τ)=k=1Nxk(t)gk(t)(τ).\displaystyle\begin{split}&\min_{x(t)\in\mathbb{R}^{N}}\quad\lVert u_{\leq t}-g\rVert_{L_{2}(\mu^{(t)})}^{2}\\ &\textrm{subject\,\,to}\quad~{}g(\tau)=\sum_{k=1}^{N}x_{k}(t)g^{(t)}_{k}(\tau).\end{split} (17)

If {gn(t)}1nN\{g_{n}^{(t)}\}_{1\leq n\leq N} is a normalized orthogonal basis, i.e. gi(t)L2(μ(t))=1(i=1,,N)\lVert g_{i}^{(t)}\rVert_{L_{2}(\mu^{(t)})}=1~{}(i=1,\cdots,N), the optimal solution is given by

xk(t)=ut,gk(t)μ(t).\displaystyle x_{k}(t)=\langle u_{\leq t},g^{(t)}_{k}\rangle_{\mu^{(t)}}. (18)

The vector x(t)x(t) defines the approximation g=k=1Nxk(t)gk(t)g=\sum_{k=1}^{N}x_{k}(t)g^{(t)}_{k} of utu_{\leq t}, thus it retains information necessary for reconstructing the history utu_{\leq t} of the input u(t)u(t) at time tt. This property of memorizing the input history in the state vector will be useful for modeling long sequential data, as capturing dependencies in sequential data requires referencing information from previous inputs to compute each output at every time step. Moreover, by Equation (18), the measure μ(t)\mu^{(t)} represents the importance of each time step when compressing the history utu_{\leq t}. This x(t)x(t) satisfies the differential equation

dxdt(t)=A(t)x(t)+B(t)u(t),\displaystyle\frac{\mathrm{d}x}{\mathrm{d}t}(t)=A(t)x(t)+B(t)u(t), (19)

where A(t)N×N,B(t)N×1A(t)\in\mathbb{R}^{N\times N},B(t)\in\mathbb{R}^{N\times 1} depend on the polynomial basis {gn(t)}1nN\{g^{(t)}_{n}\}_{1\leq n\leq N} and the measure μ(t)\mu^{(t)}. Unlike SSM (1) introduced in Subsection II-A, A(t)A(t) and B(t)B(t) are time-dependent.

For HiPPO-LegS that is a variant of HiPPO [7], the measure μ(t)\mu^{(t)} is defined as the scaled Legendre (LegS) measure

μ(t)=1t𝕀[0,t].\displaystyle\mu^{(t)}=\frac{1}{t}\mathbb{I}_{[0,t]}. (20)

This assigns uniform importance to the entire history [0,t][0,t] at each time tt. Furthermore, the polynomial basis gn(t)(τ)g^{(t)}_{n}(\tau) is the normalized orthogonal basis

gn(t)(τ)=gn(t,τ)=(2n+1)1/2Pn(2τt1),\displaystyle g^{(t)}_{n}(\tau)=g_{n}(t,\tau)=(2n+1)^{1/2}P_{n}\left(\frac{2\tau}{t}-1\right), (21)

where Pn(α)P_{n}(\alpha) is the Legendre polynomial

Pn(α)=12nn!dndαn(α21)n.\displaystyle P_{n}(\alpha)=\frac{1}{2^{n}\cdot n!}\frac{\mathrm{d}^{n}}{\mathrm{d}\alpha^{n}}(\alpha^{2}-1)^{n}. (22)

In this case, as shown in [7, Appendix D.3], x(t)x(t) satisfies

dxdt(t)=1tAx(t)+1tBu(t)\displaystyle\frac{\mathrm{d}x}{\mathrm{d}t}(t)=\frac{1}{t}Ax(t)+\frac{1}{t}Bu(t) (23)
Ank={(2n+1)1/2(2k+1)1/2ifn>kn+1ifn=k0ifn<k\displaystyle A_{nk}=-\begin{cases}(2n+1)^{1/2}(2k+1)^{1/2}&\text{if}\quad n>k\\ n+1&\text{if}\quad n=k\\ 0&\text{if}\quad n<k\end{cases} (24)
Bn=(2n+1)1/2.\displaystyle B_{n}=(2n+1)^{1/2}. (25)

This matrix AA of Equation (24) is called the HiPPO matrix. For the SSM incorporating the HiPPO matrix, the state vector x(t)x(t) retains information about the history utu_{\leq t} of the input uu at each time tt [7].

III Deep learning model employed in this study

The deep learning model employed in this study, proposed in [9], has the structure illustrated in Fig. 2. In this paper, the model is referred to as S4 with DSS layers, despite being named DSS by the authors of [9]. The input layer receives sequential data and outputs HH features as 1-dimensional sequential data. This conversion adapts data from various formats to the DSS layer’s input format, as detailed in Subsection III-A. The term HH denotes the hidden size, representing the count of features processed by the DSS layer. Finally, the output layer converts the HH features of 1-dimensional sequential data into the model’s final output format. For further details, refer to Appendix B.

Refer to caption
Figure 2: Deep learning model with DSS layers. This represents the overall architecture of the deep learning model used in this study, with the intermediate DSS layer being the most critical component.

III-A Diagonal State Space layer

The most important component of the deep learning model illustrated in Fig. 2 is the DSS layer. As shown in Fig. 3, the DSS layer consists of:

  • Independent HH DSS models.

  • Nonlinear connection blocks.

  • Linear combination block.

The details of each of these components are explained below.

By restricting the matrix AN×NA\in\mathbb{C}^{N\times N} in (1) to be diagonal, assuming that the diagonal elements do not lie on the imaginary axis, the DSS model is defined by the following discretization for a sample time Δ>0\Delta\in\mathbb{R}_{>0}:

{xk+1=A¯xk+B¯ukyk=C¯xk,\displaystyle\begin{cases}x_{k+1}=\bar{A}x_{k}+\bar{B}u_{k}\\ y_{k}=\bar{C}x_{k},\end{cases} (26)

where A¯=eAΔ\bar{A}=e^{A\Delta}, B¯=(A¯I)A1B\bar{B}=(\bar{A}-I)A^{-1}B, and C¯=C\bar{C}=C. The diagonal elements of the matrix A¯\bar{A} do not lie on the unit circle in the complex plane, due to the assumption on the matrix AA.

The nonlinear connection block receives the input uku_{k} and output yky_{k} from the DSS model and outputs 1-dimensional sequential data

yk=GELU(Re(yk)+Duk),\displaystyle y^{\prime}_{k}=\mathrm{GELU}(\mathrm{Re}(y_{k})+Du_{k}), (27)

where DD\in\mathbb{R}, and GELU[39] is a nonlinear activation function expressed as

GELU(α)=αΦ(α).\displaystyle\mathrm{GELU}(\alpha)=\alpha\Phi(\alpha). (28)

Here, Φ(α)\Phi(\alpha) is the cumulative distribution function of the standard normal distribution. This approach is expected to enhance the performance of the model. Note that Re(yk)\mathrm{Re}(y_{k}) is used in Equation (27) since yky_{k} can be a complex number.

Finally, the HH 1-dimensional sequential data (yk(h))1hH(y_{k}^{\prime(h)})_{1\leq h\leq H} outputted from each nonlinear connection block is mixed to obtain the final output of the DSS layer, resulting in HH 1-dimensional sequential data (uk(h))1hH(u_{k}^{\prime(h)})_{1\leq h\leq H}. With parameters of weight WoutH×HW_{out}\in\mathbb{R}^{H\times H} and bias bHb\in\mathbb{R}^{H}, the output is expressed as

uk(h)=h=1HWout,hhyk(h)+bh𝟏,\displaystyle u_{k}^{\prime(h)}=\sum_{h^{\prime}=1}^{H}W_{out,hh^{\prime}}y_{k}^{\prime(h^{\prime})}+b_{h}\cdot\bm{1}, (29)

where 𝟏\bm{1} is a 1-dimensional sequential data of the same length as yk(h)y_{k}^{\prime(h)} with all elements 11.

Refer to caption
Figure 3: DSS layer, which consists of HH DSS models, nonlinear connection blocks, and a linear combination block.

III-B DSSEXP{}_{\text{EXP}} and DSSSOFTMAX{}_{\text{SOFTMAX}}

The output of DSS (26) can be calculated as

yk=i=0khiuki\displaystyle y_{k}=\sum_{i=0}^{k}h_{i}\cdot u_{k-i} (30)

with hiC¯A¯iB¯h_{i}\coloneqq\bar{C}\bar{A}^{i}\bar{B}, which is referred to as the impulse response of DSS (26). Given a sample time Δ\Delta, hih_{i} is determined by (A,B,C)(A,B,C), and different sets of parameters (A,B,C)(A,B,C) may result in the same sequence of impulse responses. In fact,

K¯Δ,L(A,B,C)(h0,h1,,hL1)\displaystyle\begin{split}\bar{K}_{\Delta,L}(A,B,C)&\coloneqq(h_{0},h_{1},\cdots,h_{L-1})\\ \end{split} (31)

are the same for different (A,B,C)(A,B,C), as described below [9].

Proposition 1.

Suppose that the parameters A=diag(λ1,,λN),B,C,ΔA=\mathrm{diag}(\lambda_{1},\cdots,\lambda_{N}),B,C,\Delta of DSS (26) are given, and define K:=K¯Δ,L(A,B,C)LK:=\bar{K}_{\Delta,L}(A,B,C)\in\mathbb{C}^{L}. Then, there exist w,w~1×Nw,\tilde{w}\in\mathbb{C}^{1\times N} satisfying the following equations:

  1. (a)

    K=K¯Δ,L(A,(1)1iN,w~)K=\bar{K}_{\Delta,L}(A,(1)_{1\leq i\leq N},\tilde{w})
    =w~A1(eAΔI)elementwiseexp(P)\quad=\tilde{w}\cdot A^{-1}(e^{A\Delta}-I)\cdot\mathrm{elementwise\mathchar 45\relax exp}(P)

  2. (b)

    K=K¯Δ,L(A,((eLλiΔ1)1)1iN,w)K=\bar{K}_{\Delta,L}(A,((e^{L\lambda_{i}\Delta}-1)^{-1})_{1\leq i\leq N},w)
    =wA1rowsoftmax(P)\quad=w\cdot A^{-1}\cdot\mathrm{row\mathchar 45\relax softmax}(P)

where

elementwiseexp(P)=(exp(Pik))1iN,0k<L,\displaystyle\mathrm{elementwise\mathchar 45\relax exp}(P)=\left(\exp(P_{ik})\right)_{1\leq i\leq N,0\leq k<L}, (32)
rowsoftmax(P)=(exp(Pik)Σr=0L1exp(Pir))1iN,0k<L,\displaystyle\mathrm{row\mathchar 45\relax softmax}(P)=\left(\frac{\exp(P_{ik})}{\Sigma_{r=0}^{L-1}\exp(P_{ir})}\right)_{1\leq i\leq N,0\leq k<L}, (33)
Pi,k=λikΔ.\displaystyle P_{i,k}=\lambda_{i}k\Delta. (34)

Proposition 1 implies that, under weak assumptions, the impulse responses h0,h1,,hL1h_{0},h_{1},\cdots,h_{L-1} of DSS (26) can be achieved with special structure of (A,B,C)(A,B,C). DSS (26) with (B,C)=((1)1iN,w)(B,C)=((1)_{1\leq i\leq N},w), as stated in Proposition 1(a), is referred to as DSSEXP{}_{\text{EXP}}, and DSS (26) with (B,C)=(((eLλiΔ1)1)1iN,w)(B,C)=(((e^{L\lambda_{i}\Delta}-1)^{-1})_{1\leq i\leq N},w), as stated in Proposition 1(b), is referred to as DSSSOFTMAX{}_{\text{SOFTMAX}} [9]. DSSEXP{}_{\text{EXP}} and DSSSOFTMAX{}_{\text{SOFTMAX}} offer different approaches to modeling the impulse responses, with potential implications for the performance and interpretability of the DSS model. In the following sections, we utilize DSSEXP{}_{\text{EXP}} or DSSSOFTMAX{}_{\text{SOFTMAX}} as DSS (26).

IV Existing training methods and limitations

In the training of deep learning models, the goal is to minimize the loss function E(W)E(W) with respect to the training dataset {(χi,di)}\{(\chi_{i},d_{i})\}. Here, (χi,di)(\chi_{i},d_{i}) represents a pair of input χi\chi_{i} and its desired output did_{i}, and WW denotes the parameters of the model. The model’s output for an input χ\chi with parameters WW is denoted as ζ(χ;W)\zeta(\chi;W). For each input χi\chi_{i}, a loss function Ei(W)E_{i}(W) is defined to measure the difference between the desired output did_{i} and the model’s output ζ(χi;W)\zeta(\chi_{i};W). The loss function for the entire training dataset is expressed as E(W)=ΣiEi(W)E(W)=\Sigma_{i}E_{i}(W), where the summation is over all training examples. As an algorithm for minimizing the loss function E(W)E(W), we can consider using AdamW [40].

IV-A Training parameters within the Diagonal State Space layer

Among the parameters WW trained in our deep learning model, those in the DSS layer include:

  • (A,B,C,Δ)(A,B,C,\Delta) for each of the HH DSS models, as defined in (26).

  • DD for each of the HH nonlinear connection blocks, as defined in (27).

  • Weight WoutH×HW_{out}\in\mathbb{R}^{H\times H} and bias bHb\in\mathbb{R}^{H} for the linear combination block, as defined in (29).

For DSSEXP{}_{\text{EXP}} defined in Subsection III-B, the parameters (A,B,C)(A,B,C) are defined as

A=diag(λ1,,λN),B=(1)1iN,C=w,\displaystyle\begin{split}&A=\mathrm{diag}(\lambda_{1},\cdots,\lambda_{N}),\\ &B=(1)_{1\leq i\leq N},\\ &C=w,\end{split} (35)

where

λi=exp(Λre,i)+iΛim,i.\displaystyle\lambda_{i}=-\exp(\Lambda_{re,i})+\mathrm{i}\cdot\Lambda_{im,i}. (36)

For DSSEXP{}_{\text{EXP}}, the parameters Λre,ΛimN\Lambda_{re},\Lambda_{im}\in\mathbb{R}^{N} and wNw\in\mathbb{C}^{N} are trained to determine (A,B,C)(A,B,C).

For DSSSOFTMAX{}_{\text{SOFTMAX}} defined in Subsection III-B, the parameters (A,B,C)(A,B,C) are defined as:

A=diag(λ1,,λN),B=((eLλiΔ1)1)1iN,C=w,\displaystyle\begin{split}&A=\mathrm{diag}(\lambda_{1},\cdots,\lambda_{N}),\\ &B=((e^{L\lambda_{i}\Delta}-1)^{-1})_{1\leq i\leq N},\\ &C=w,\end{split} (37)

where

λi=Λre,i+iΛim,i.\displaystyle\lambda_{i}=\Lambda_{re,i}+\mathrm{i}\cdot\Lambda_{im,i}. (38)

Similarly to DSSEXP{}_{\text{EXP}}, the parameters Λre,ΛimN\Lambda_{re},\Lambda_{im}\in\mathbb{R}^{N} and wNw\in\mathbb{C}^{N} are trained to determine (A,B,C)(A,B,C) for DSSSOFTMAX{}_{\text{SOFTMAX}}, with a different expression for AA and BB.

IV-B Initialization of the DSS Layer

The performance of S4 with DSS layers is sensitive to initialization of the state matrix AA. To obtain an effective initial value for AA, the HiPPO matrix is decomposed into a normal matrix and a low-rank matrix. This decomposition allows for a more structured and interpretable initialization of the state matrix AA, which can improve the performance of the model. The eigenvalues of the normal matrix are employed to initialize the diagonal elements of AA.

In more detail, the HiPPO matrix N×N\mathcal{H}\in\mathbb{R}^{N^{\prime}\times N^{\prime}} is defined as explained in Subsection II-C:

nk={(2n+1)1/2(2k+1)1/2ifn>kn+1ifn=k0ifn<k.\displaystyle\mathcal{H}_{nk}=-\begin{cases}(2n+1)^{1/2}(2k+1)^{1/2}&\text{if}\quad n>k\\ n+1&\text{if}\quad n=k\\ 0&\text{if}\quad n<k.\end{cases} (39)

For SSM (1) incorporating this HiPPO matrix, the state x(t)x(t) retains information about the history of the input u(t)u(t) [7]. The HiPPO matrix \mathcal{H} can be decomposed into a normal matrix and a low-rank matrix as

=12PQ,\displaystyle\mathcal{H}=\mathcal{H}^{\prime}-\frac{1}{2}PQ^{\top}, (40)

where N×N,PN,QN\mathcal{H}^{\prime}\in\mathbb{R}^{N^{\prime}\times N^{\prime}},P\in\mathbb{R}^{N^{\prime}},Q\in\mathbb{R}^{N^{\prime}} are defined as

nk={(2n+1)1/2(2k+1)1/2/2ifn>k1/2ifn=k(2n+1)1/2(2k+1)1/2/2ifn<k\displaystyle\mathcal{H}^{\prime}_{nk}=-\begin{cases}(2n+1)^{1/2}(2k+1)^{1/2}/2&\text{if}\quad n>k\\ 1/2&\text{if}\quad n=k\\ -(2n+1)^{1/2}(2k+1)^{1/2}/2&\text{if}\quad n<k\end{cases} (41)
Pn=(2n+1)1/2,Qk=(2k+1)1/2.\displaystyle P_{n}=(2n+1)^{1/2},\quad Q_{k}=(2k+1)^{1/2}. (42)

This \mathcal{H}^{\prime} is a normal matrix. Under the assumption that N=2NN^{\prime}=2N and μ1,,μN\mu_{1},\ldots,\mu_{N} are the eigenvalues of 2N×2N\mathcal{H}^{\prime}\in\mathbb{R}^{2N\times 2N} with positive imaginary parts, Λre,ΛimN\Lambda_{re},\Lambda_{im}\in\mathbb{R}^{N} are defined as follows:

  • For DSSEXP{}_{\text{EXP}},

    Λre,i=log(Re(μi)),Λim,i=Im(μi).\displaystyle\Lambda_{re,i}=\log(-\mathrm{Re}(\mu_{i})),\quad\Lambda_{im,i}=\mathrm{Im}(\mu_{i}). (43)
  • For DSSSOFTMAX{}_{\text{SOFTMAX}},

    Λre,i=Re(μi),Λim,i=Im(μi).\displaystyle\Lambda_{re,i}=\mathrm{Re}(\mu_{i}),\quad\Lambda_{im,i}=\mathrm{Im}(\mu_{i}). (44)

Using Λre\Lambda_{re} and Λim\Lambda_{im}, the matrix AA is initialized as Adiag(λ1,,λN)A\coloneqq\mathrm{diag}(\lambda_{1},\ldots,\lambda_{N}), where each λi\lambda_{i} is derived from Equation (36) for DSSEXP{}_{\text{EXP}} or Equation (38) for DSSSOFTMAX{}_{\text{SOFTMAX}}. This process is known as the Skew-HiPPO initialization [12, 9]. Other parameters within the DSS are randomly sampled, as detailed in Section VI. According to [9], models utilizing Skew-HiPPO initialization demonstrate superior prediction accuracy compared to those initialized with values from 𝒩(0,1)\mathcal{N}(0,1).

IV-C Computational cost of the DSS layer output

When the input is entered one by one or all at once, reducing HH (the hidden size) and NN (the state dimension) facilitates a reduction in the computational cost of the DSS layer output. In fact, the time and space complexities of the DSS layer output are as illustrated in Table I and Table II, respectively, as discussed below. Here, the input length of the 1-dimensional sequence is denoted as LL.

In the case where input uku_{k} at each time step is entered one by one into DSS (26), the output yky_{k} can be computed using the previous state vector xk1x_{k-1} according to (26). The time and space complexities per step are both O(N)O(N). The time complexity for processing the entire input of length LL is O(NL)O(NL), and the space complexity involves overwriting at each step, thus remaining O(N)O(N). For the nonlinear connection block, both the time and space complexities per step are O(1)O(1). The time complexity for processing the entire input of length LL is O(L)O(L), and the space complexity involves overwriting at each step, thus remaining O(1)O(1). Regarding the following linear combination block, the time complexity per step is O(H2)O(H^{2}), and the space complexity is O(H)O(H). The time complexity for processing the entire input of length LL is O(H2L)O(H^{2}L), and the space complexity involves overwriting at each step, thus remaining O(H)O(H). Therefore, adding up HH DSS, HH nonlinear connection blocks, and one linear combination block, the time complexity of the DSS layer output when input is entered one by one is O(HNL)+O(H2L)O(HNL)+O(H^{2}L), and the space complexity is O(HN)O(HN) (Table I).

In the case where whole input (uk)0k<L(u_{k})_{0\leq k<L} is entered all at once, the output

yk=i=0L1hiuki\displaystyle y_{k}=\sum_{i=0}^{L-1}h_{i}\cdot u_{k-i} (45)

can be efficiently computed. In fact, leveraging the fast Fourier transform [41] implies that the time complexity is O(Llog(L))O(L\log(L)) and the space complexity is O(L)O(L). Furthermore, the computation is parallelizable. Besides, the impulse response hih_{i} in Equation (45) can be easily computed. In fact, for a diagonal matrix AN×NA\in\mathbb{C}^{N\times N} with diagonal elements λi\lambda_{i}, hih_{i} can be calculated as

hi=C¯A¯iB¯=CeAiΔ(eAΔI)A1B=j=1NCjeλjiΔ(eλjΔ1)λj1Bj.\displaystyle\begin{split}h_{i}&=\bar{C}\bar{A}^{i}\bar{B}\\ &=Ce^{A\cdot i\Delta}(e^{A\Delta}-I)A^{-1}B\\ &=\sum_{j=1}^{N}C_{j}\cdot e^{\lambda_{j}\cdot i\Delta}(e^{\lambda_{j}\Delta}-1)\lambda_{j}^{-1}\cdot B_{j}.\end{split} (46)

The computation time for the sequence of impulse responses (h0,h1,,hL1)(h_{0},h_{1},\cdots,h_{L-1}) is O(NL)O(NL). As for the nonlinear connection block, the time and space complexities are both O(L)O(L) Regarding the following linear combination block, the time complexity is O(H2L)O(H^{2}L), and the space complexity is O(HL)O(HL). Therefore, adding up HH DSS, HH nonlinear connection blocks, and one linear combination block, the time complexity of the DSS layer output when input is entered all at once is O(HLlog(L))+O(H2L)O(HL\log(L))+O(H^{2}L), and the space complexity is O(HL)O(HL) (Table II).

Additionally, Equation (45) is an approximation that holds when DSS (26) is asymptotically stable and LL is sufficiently large. The exact output of DSS (26) is given by (30). However, when (26) is asymptotically stable and LL is sufficiently large, hi0h_{i}\approx 0 for iLi\geq L, making the approximation in (45) valid.

TABLE I: Computational complexities of the DSS layer output when input of length LL is entered one by one
Time Space
DSS (1 unit) O(NL)O(NL) O(N)O(N)
Nonlinear connection(1 unit) O(L)O(L) O(1)O(1)
Linear combination O(H2L)O(H^{2}L) O(H)O(H)
Total O(HNL)+O(H2L)O(HNL)+O(H^{2}L) O(HN)O(HN)
TABLE II: Computational complexities of the DSS layer output when input of length LL is entered at once
Time Space
DSS (1 unit) O(Llog(L))O(L\log(L)) O(L)O(L)
Nonlinear connection(1 unit) O(L)O(L) O(L)O(L)
Linear combination O(H2L)O(H^{2}L) O(HL)O(HL)
Total O(HLlog(L))+O(H2L)O(HL\log(L))+O(H^{2}L) O(HL)O(HL)

IV-D Issues for practical applications

Let us consider the issues that hinder the application of S4 with DSS layers to EI [17, 18], as explained in Section I. These issues include memory constraints, computational complexity, and the trade-offs between model performance and resource efficiency.

The following can be concluded from the arguments of Subsection IV-C:

  • When processing the entire input of large length LL at once, the application of S4 with DSS layers to EI is challenging, even if HH and NN of the trained model are sufficiently small. This is because the space complexity is O(HL)O(HL) (Table II), making it difficult to conduct inference in devices with small capacity memory (e.g. sensors set in factories).

  • When processing the input one-by-one, which corresponds to L=1L=1, S4 with DSS layers can be applied to EI if HH and NN of the trained model are sufficiently small. Specifically, the time and space complexities are those shown in Table  I. That is, when L=1L=1, the time complexity is O(HN)+O(H2)O(HN)+O(H^{2}), and the space complexity is O(HN)O(HN). This implies that even for very large input sequences, inference can be conducted in devices with small capacity memory.

In summary, to apply S4 with DSS layers to EI, the input needs to be processed one-by-one, and it is desirable to keep the values of HH and NN as small as possible. However, excessively small values of HH and NN may limit the model’s capacity to capture complex patterns in the data, leading to a deterioration in performance.

V Proposed Model Compression Method

In this section, to address the issues discussed in Subsection IV-D, we propose an effective model compression method for S4 with DSS layers aiming to reduce computational costs during inference by one-by-one processing. Specifically, this method enables the acquisition of parameter values that achieve higher accuracy compared to existing methods when training models with DSS layers of the same HH and NN.

The following procedure is our proposed model compression method.

  1. 1.

    Apply the balanced truncation method, as explained in Subsection II-B, to a large-scale DSS that is part of a trained model.

  2. 2.

    Retrain the model using the reduced DSS obtained in step 1) for improved initialization.

Refer to caption
Figure 4: Proposed method, which consists of Pre-Training, DSS Reduction, Parameter Extraction, and Main Training. At DSS Reduction step, we use the balanced truncation method.

In more detail, our proposed method consists of the following Pre-Training, DSS Reduction, Parameter Extraction, and Main Training, illustrated in Fig. 4.

Pre-Training

The parameters (A,B,C)(A,B,C) and Δ\Delta of DSS (26) are determined by training the model with the Skew-HiPPO initialization [9], as detailed in Section IV-B. For DSSEXP{}_{\text{EXP}} and DSSSOFTMAX{}_{\text{SOFTMAX}}, calculations use (35) and (37), respectively.

DSS Reduction

The DSS model (26), determined by parameters (A,B,C)(A,B,C) and Δ\Delta obtained through Pre-Training, is reduced using the balanced truncation method described in Subsection II-B and Appendix A. The state dimension is reduced to r(N)r(\leq N), resulting in the reduced SSM (2). Here, the sample time Δ\Delta remains unchanged.

Parameter Extraction

Assuming ArA_{r} is diagonalizable, we can transform the reduced SSM (2) into the DSS (26), as detailed below. There exist a diagonal matrix M=diag(μ1,,μr)M=\mathrm{diag}(\mu_{1},\cdots,\mu_{r}) and an invertible matrix Vr×rV\in\mathbb{C}^{r\times r} satisfying Ar=VMV1A_{r}=VMV^{-1}. Using a coordinate transformation x^=V1xr\hat{x}=V^{-1}x_{r}, we obtain the new SSM

{dx^dt(t)=Mx^(t)+V1Bru(t)yr(t)=CrVx^(t),\displaystyle\begin{cases}\frac{\mathrm{d}\hat{x}}{\mathrm{d}t}(t)=M\hat{x}(t)+V^{-1}B_{r}u(t)\\ y_{r}(t)=C_{r}V\hat{x}(t),\end{cases} (47)

which is equivalent to the reduced SSM (2). Consequently, the transformation allows expressing the reduced SSM (2) in the explicitly diagonal form of DSS, as shown in (47).

From Proposition 1, the impulse response of DSS (47) with the state dimension rr can also be derived from DSSEXP{}_{\text{EXP}} or DSSSOFTMAX{}_{\text{SOFTMAX}}.

  • For DSSEXP{}_{\text{EXP}}, the parameters are determined as

    Mre,i=log(Re(μi)),\displaystyle M_{re,i}=\log(-\mathrm{Re}(\mu_{i})), (48)
    Mim,i=Im(μi),\displaystyle M_{im,i}=\mathrm{Im}(\mu_{i}), (49)
    v=(CrV)(V1Br).\displaystyle v=(C_{r}V)^{\top}*(V^{-1}B_{r}). (50)
  • For DSSSOFTMAX{}_{\text{SOFTMAX}}, the parameters are determined as

    Mre,i=Re(μi),\displaystyle M_{re,i}=\mathrm{Re}(\mu_{i}), (51)
    Mim,i=Im(μi),\displaystyle M_{im,i}=\mathrm{Im}(\mu_{i}), (52)
    v=(CrV)(V1Br)(eLμiΔ1)1ir.\displaystyle v=(C_{r}V)^{\top}*(V^{-1}B_{r})*(e^{L\mu_{i}\Delta}-1)_{1\leq i\leq r}. (53)

For the vectors of (50) and (53), refer to the proof of Proposition in [9, Appendix A.1].

Main Training

As detailed in Subsection IV-A, Λre,Λimr\Lambda_{re},\Lambda_{im}\in\mathbb{R}^{r} and wrw\in\mathbb{C}^{r} are training parameters within DSSEXP{}_{\text{EXP}} and DSSSOFTMAX{}_{\text{SOFTMAX}}. Here, (Λre,Λim)(\Lambda_{re},\Lambda_{im}) and ww are initialized with (Mre,Mim)(M_{re},M_{im}) and vv obtained by Parameter Extraction, respectively. It is important to note that the dimension, previously denoted as NN, is adjusted to rr for the context of this initialization. All other parameters maintain their values as obtained from the Pre-Training phase, ensuring consistency in the model’s initialization process.

VI Numerical Experiments

To evaluate the proposed method, we employed tasks of LRA [8], which is available at https://github.com/google-research/long-range-arena. The benchmark includes sequence data ranging from 1,000 to 16,000 in length and evaluates the model’s ability to capture long-range dependencies required for learning long sequences. In the text classification task, we classify movie reviews in the Internet Movie Database (IMDb) review dataset [42] as negative or positive. Tables III and IV summarize the statistics of the text classification dataset, including the counts and lengths of the raw data sequences. These sequences are truncated or padded as necessary to ensure consistent input lengths.

The experiments were conducted on a machine running Windows 10, equipped with 64 GB of memory and an 11th Gen Intel Core i9-11980HK CPU. The model training and evaluation code was implemented in Python using PyTorch 1.11.0 and TensorFlow 2.12.0, and executed with the NVIDIA RTX A3000 Laptop GPU.

TABLE III: Text classification (negative)
Train Test
Number of examples 12,500 12,500
Max text length 8,969 6,385
Min text length 52 32
Avg text length 1302.97904 1285.14968
TABLE IV: Text classification (positive)
Train Test
Number of examples 12,500 12,500
Max text length 13,704 12,988
Min text length 70 65
Avg text length 1347.16024 1302.43512

VI-A Comparison with existing training methods

Table V and Table VI show the accuracy of models obtained through various training methods, using DSSEXP{}_{\text{EXP}} and DSSSOFTMAX{}_{\text{SOFTMAX}} respectively, where NN denotes the dimension of the state vector of DSS. The number of DSS layers is 44 and the hidden size HH is 1616. The columns labeled “before” and “after” denote the accuracy of the model with the initial parameter values and the accuracy of the model after training from that initial state, respectively.

In the “HiPPO” column on the left, the Skew-HiPPO initialization explained in Subsection IV-B was used to initialize the state matrix AA of each DSS. In the middle column “Random”, the initial values of AA were randomly sampled. For DSSEXP{}_{\text{EXP}}, the real and imaginary parts of the diagonal elements of matrix AA were sampled from 𝒰(1,0)\mathcal{U}(-1,0) and 𝒩(0,1)\mathcal{N}(0,1), respectively. For DSSSOFTMAX{}_{\text{SOFTMAX}}, the real and imaginary parts of the diagonal elements were sampled from 𝒩(0,1)\mathcal{N}(0,1). Other parameters within DSS were randomly sampled for both “HiPPO” and “Random”. The real and imaginary parts of each element in wNw\in\mathbb{C}^{N} were sampled from 𝒩(0,1)\mathcal{N}(0,1).

TABLE V: Accuracy of models through various training methods (Text classification, DSSEXP{}_{\text{EXP}}, H=16H=16)
NN HiPPO Random Proposed Method
128 before 0.5000 0.4998
after 0.7997 0.7731
64 before 0.4967 0.4982
after 0.8012 0.7968
32 before 0.5008 0.4978
after 0.7990 0.8100
16 before 0.4936 0.4970 0.8310
after 0.8310 0.8071 0.8396
8 before 0.4996 0.5028 0.5011
after 0.8182 0.8115 0.8376
4 before 0.4960 0.4964 0.5002
after 0.8042 0.8155 0.8418
TABLE VI: Accuracy of models through various training methods (Text classification, DSSSOFTMAX{}_{\text{SOFTMAX}}, H=16H=16)
NN HiPPO Random Proposed Method
128 before 0.4984 0.5000 0.8216
after 0.8216 0.7540 0.8359
64 before 0.4993 0.5000 0.5013
after 0.8158 0.7544 0.8382
32 before 0.4994 0.4999 0.5000
after 0.8026 0.7496 0.8370
16 before 0.5011 0.4994 0.5005
after 0.8190 0.7461 0.8402
8 before 0.5000 0.5072 0.5036
after 0.8071 0.7722 0.8412
4 before 0.4940 0.5000 0.5012
after 0.7948 0.7819 0.8377

For DSSSOFTMAX{}_{\text{SOFTMAX}}, it has been reported in [9] that “HiPPO” using Skew-HiPPO initialization achieves higher accuracy after training compared to “Random” using randomly sampled initial values. The results in Table VI are cosistent, where “HiPPO” achieves higher accuracy after training compared to “Random” for each NN, while each accuracy before training is near. For DSSEXP{}_{\text{EXP}}, the same trend was observed for almost all NN as shown in Table V.

The ”Proposed Method” column on the right describes our approach, which uses a reduced SSM from the Pre-Trained models with N=16N=16 (DSSEXP)({\text{DSS}}_{\text{EXP}}) and N=128N=128 (DSSSOFTMAX)({\text{DSS}}_{\text{SOFTMAX}}), obtained through the balanced truncation method, to initialize Main Training. In Table V, the ”Proposed Method” entries for N=32N=32, N=64N=64, and N=128N=128 are blank, because the balanced truncation does not permit expanding the state dimension beyond the original size of N=16N=16.

Before Main Training, the accuracy of models using the “Proposed Method” is comparable to those using “Random” and “HiPPO” for each NN excluding N=16N=16 for DSSEXP{}_{\text{EXP}} and N=128N=128 for DSSSOFTMAX{}_{\text{SOFTMAX}}. However, after Main Training, the accuracy of “Proposed Method” exceeded that of “HiPPO” for each NN. This result is noteworthy because the “Proposed Method” tends to outperform “HiPPO” after Main Training, despite having similar accuracies before the training.

The following points are particularly noteworthy.

  • For DSSEXP{}_{\text{EXP}} shown in Table V, the highest accuracy after Main Training with “Proposed Method” was 0.8418 at N=4N=4. Notably, this exceeded the accuracy after training with “HiPPO” at N=16N=16, which was 0.8310, despite having a smaller NN while maintaining the same hidden size HH.

  • For DSSSOFTMAX{}_{\text{SOFTMAX}} shown in Table VI, the highest accuracy after training with “Proposed Method” was 0.8412 at N=8N=8. Notably, this exceeded the accuracy after training with “HiPPO” at N=128N=128, which was 0.8216, despite having a smaller NN while maintaining the same hidden size HH.

In summary, the initial parameters obtained by reducing Pre-Trained DSS of (H,N)=(16,16)(H,N)=(16,16) for DSSEXP{}_{\text{EXP}} and (H,N)=(16,128)(H,N)=(16,128) for DSSSOFTMAX{}_{\text{SOFTMAX}} appear to be effective in enhancing accuracy of the trained model compared to the initial parameters by the Skew-HiPPO initialization. Similar trends are observed in the ListOps task and text retrieval task of LRA, where our method enhanced the accuracy of the trained model. For detailed results, refer to Appendix C.

VI-B Relationship between accuracy of Pre-Trained models and models after Main Training

Table VII shows the accuracy of models after Main Training when initialized with different Pre-Trained models. We obtained Pre-Trained models with DSS of N=128N=128 and N=80N=80, and utilized the reduced models for improved initialization of Main Training. The accuracy of the models after main training is in columns “Proposed Method (N=128N=128)” and “Proposed Method (N=80N=80)”. Both “Proposed Method (N=128N=128)” and “Proposed Method (N=80N=80)” followed the trend observed in Subsection VI-A, where “Proposed Method” achieved higher accuracy than “HiPPO” for each NN.

The accuracy of the Pre-Trained models for N=128N=128 is 0.8216, which is higher than 0.8098 at N=80N=80. As for models after Main Training, the accuracy of “Proposed Method (N=128N=128)” surpasses that of “Proposed Method (N=80N=80)” for each NN. This suggests that higher accuracy of the Pre-Trained model leads to higher accuracy of the model obtained through Main Training.

TABLE VII: Accuracy of models initialized using different Pre-Trained models (DSSSOFTMAX{}_{\text{SOFTMAX}}, H=16H=16)
NN HiPPO
Proposed Method
(N=128N=128)
Proposed Method
(N=80N=80)
128 0.8216 0.8359
80 0.8098 0.8390 0.8343
64 0.8158 0.8382 0.8224
32 0.8026 0.8370 0.8255
16 0.8190 0.8402 0.8294
8 0.8071 0.8412 0.8334
4 0.7948 0.8377 0.8317

VI-C Non-triviality of the obtained results

The Hankel singular values illustrated in Fig. 5 highlight the non-triviality of the results presented in Tables V and VI from a system-theoretic perspective. These values were derived from the SSM parameters (A,B,C)(A,B,C) of the Pre-Trained model using DSSSOFTMAX{}_{\text{SOFTMAX}} with N=128N=128. Specifically, the Hankel singular values were computed for each SSM in every DSS layer. The detailed computational method is described in Appendix A.

As explained in Section II-B and Appendix A, the Hankel singular values can reveal the important directions in the state space from the controllability and observability perspective. That is, if the Hankel singular values are relatively large, the corresponding directions are relatively important. Notably, Fig. 5 shows that almost all directions in the N=128N=128 dimensional state space are important, because there are few significantly small Hankel sigular values. Therefore, reducing the dimensionality of the Pre-Trained model from N=128N=128 is expected to significantly deteriorate its performance. This expectation is consistent with the results shown in Table VI. In fact, the accuracy of the Pre-Trained model with N=128N=128 was 0.82160.8216, but after reducing NN to 88, the accuracy dropped to 0.50360.5036. Nevertheless, after Main-Training, the accuracy of the reduced model improved to 0.84120.8412. This improvement in accuracy is not predicted by the theoretical analysis of balanced truncation introduced in Appendix A and is a non-trivial result.

Refer to caption
Figure 5: Hankel singular values obtained from each SSM for DSSSOFTMAX{}_{\text{SOFTMAX}} with N=128N=128. These SSMs were part of the Pre-Trained model initialized using the Skew-HiPPO with N=128N=128, as shown in Table VI. Although the hidden size was set to 1616, we only presented the cases for H=1H=1, 22, 33, and 44 because the results for H=5,6,,16H=5,6,\ldots,16 were almost identical.

VII Conclusion

We developed a new model compression method specifically for S4 models with DSS layers, using the balanced truncation method [33]. This approach not only reduces the number of parameters but also enhances model performance. We proposed using the reduced model parameters obtained by the balanced truncation as initial parameters for the main training process. Our experiments demonstrated that the proposed method achieves superior accuracy on Long Range Arena (LRA) tasks compared to conventionally trained models using the Skew-HiPPO initialization, even with fewer parameters. Moreover, we observed a positive correlation between the accuracy of Pre-Trained models and their accuracy after Main Training.

The primary limitation of this study lies in the scope of tasks and datasets used for evaluation. While the LRA tasks provide a robust benchmark for long-range dependency modeling, further validation on diverse datasets and real-world applications is necessary. Additionally, the underlying principles of the proposed method remain unclear, which limits the understanding of why this approach is effective.

The following are interesting future directions:

  • Future research should investigate the underlying principles of the proposed method, aiming to enhance the development of more effective training methods for deep learning models with DSS layers.

  • Reference [23] has shown that combining various model compression methods can yield better results. Investigating whether combining our proposed model compression method based on the balanced truncation with other compression techniques can improve performance is an interesting and promising direction for future work.

  • Expanding the scope of evaluation to include real-time deployment scenarios in EI applications will provide more comprehensive insights into the method’s practical viability. This can help demonstrate how the reduced models can be effectively used in resource-constrained environments.

  • As an extension of our study, applying the proposed compression techniques to Physics-Informed Neural Networks (PINNs) can be explored [43, 44]. This could help to improve the efficiency and performance of PINNs in modeling physical systems with limited computational resources.

Acknowledgment

This work was supported by the Japan Society for the Promotion of Science KAKENHI under Grant 23H03680.

Appendix A Details of the balanced truncation method

As mentioned in Section II-B, the eigenvectors of the controllability and observability Gramians PP and QQ of SSM (1) provide important directions in the state space N\mathbb{C}^{N} from the perspectives of controllability and observability. Thus, we can adopt an approach that reduces dimensions along directions that are not significant. However, the eigenvectors do not coincide in general. This means that, in general, it is impossible to uniquely determine the directions to be ignored based solely on the information from the original controllability Gramian PP and observability Gramian QQ.

To overcome this problem, we apply a coordinate transformation x¯(t)=Tx(t)\bar{x}(t)=Tx(t) to SSM (1) to obtain new SSM (9). Then, the corresponding controllability and observability Gramians of SSM (9) become TPTTPT^{*} and (T1)QT1(T^{-1})^{*}QT^{-1}, respectively. Thus, if we can find TN×NT\in\mathbb{C}^{N\times N} satisfying

TPT=(T1)QT1,\displaystyle TPT^{*}=(T^{-1})^{*}QT^{-1}, (54)

the controllability and observability Gramians of the transformed SSM (9) will coincide, even if the original Gramians of SSM (1) do not.

To find TT satisfying (54), we perform the eigenvalue decomposition of the symmetric positive definite matrix P1/2QP1/2P^{1/2}QP^{1/2} to obtain

P1/2QP1/2=UΛU,\displaystyle P^{1/2}QP^{1/2}=U\Lambda U^{*}, (55)

where P1/2P^{1/2} is the square root matrix of PP, UU is a unitary matrix, and Λ=diag(λ1,,λN)\Lambda=\mathrm{diag}(\lambda_{1},\ldots,\lambda_{N}) is a diagonal matrix with positive diagonal elements satisfying λ1λN\lambda_{1}\geq\cdots\geq\lambda_{N}. Defining

T:=Λ1/4UP1/2,\displaystyle T:=\Lambda^{1/4}U^{*}P^{-1/2}, (56)

we get

TPT=(T1)QT1=Λ1/2=diag(σ1,,σN).\displaystyle TPT^{*}=(T^{-1})^{*}QT^{-1}=\Lambda^{1/2}=\mathrm{diag}(\sigma_{1},\ldots,\sigma_{N}). (57)

We call σi=λi\sigma_{i}=\sqrt{\lambda_{i}} (i=1,,N)(i=1,\ldots,N) the Hankel singular values of SSM (1).

Thus, the balanced truncation method consists of the following procedure:

  1. 1.

    Compute the controllability Gramian PP and the observability Gramian QQ of SSM (1) by solving the Lyapunov equations (5) and (6), respectively.

  2. 2.

    Compute the square root matrix P1/2P^{1/2}.

  3. 3.

    Perform the eigenvalue decomposition using (55).

  4. 4.

    Determine the transformation matrix TT using (56).

  5. 5.

    Compute (10), (11), and (12) using TT, and define (Ar,Br,Cr)(A_{r},B_{r},C_{r}) of reduced SSM (2) as (13).

The reduced SSM (2) is preferable when it closely approximates the original large-scale SSM (1) in terms of the HH^{\infty} norm of the difference in their transfer functions. In fact, the transfer functions GG and GrG_{r} of original SSM (1) and reduced SSM (2) are defined by

G(s):=C(sINA)1B,Gr(s):=Cr(sIrAr)1Br,\displaystyle G(s):=C(sI_{N}-A)^{-1}B,\quad G_{r}(s):=C_{r}(sI_{r}-A_{r})^{-1}B_{r}, (58)

respectively. The energy of the difference between original SSM (1) output yy and reduced SSM (2) output yry_{r} can be evaluated by

0y(t)yr(t)2dtGGrH20u(t)2dt,\displaystyle\int_{0}^{\infty}\|y(t)-y_{r}(t)\|^{2}{\rm d}t\leq\|G-G_{r}\|_{H^{\infty}}^{2}\int_{0}^{\infty}\|u(t)\|^{2}{\rm d}t, (59)

where H\|\cdot\|_{H^{\infty}} denotes the HH^{\infty} norm. That is, if the input energy 0u(t)2dt\int_{0}^{\infty}\|u(t)\|^{2}{\rm d}t and GGrH\|G-G_{r}\|_{H^{\infty}} are sufficiently small, the output error energy 0y(t)yr(t)2dt\int_{0}^{\infty}\|y(t)-y_{r}(t)\|^{2}{\rm d}t is also sufficiently small. Moreover, GGrH2\|G-G_{r}\|_{H^{\infty}}^{2} when using the balanced truncation method is bounded by

σr+1GGrH2(σr+1++σN),\displaystyle\sigma_{r+1}\leq\|G-G_{r}\|_{H^{\infty}}\leq 2(\sigma_{r+1}+\cdots+\sigma_{N}), (60)

assuming that σr+1>>σN\sigma_{r+1}>\cdots>\sigma_{N}. Thus, if the Hankel singular values σr+1,,σN\sigma_{r+1},\ldots,\sigma_{N} are small, then GGrH\|G-G_{r}\|_{H^{\infty}} will also be small. The proofs for the above claims can be found in [38, Chapter 4].

Appendix B Details of the deep learning model

In the deep learning model employed in this study, residual connections [45] and normalization layers are positioned before and after the DSS layer. In residual connections, a path bypassing one or more layers is created, as illustrated in Fig. 6 and Fig. 7, and the output of the bypassed layer is added to it. The normalization layer can be placed before the DSS layer (Prenorm, Fig. 6) or after the residual connection (Postnorm, Fig. 7). In the case of prenorm, a normalization layer is also placed before the output layer. Normalization layers such as batch normalization [46] or layer normalization [47] are used, which contributes to the stability and acceleration of training. Additionally, residual connections prevent the gradient vanishing and exploding problems.

Refer to caption
Figure 6: Deep learning model with DSS layers (Prenorm). The normalization layer is placed before the DSS layer and output layer.
Refer to caption
Figure 7: Deep learning model with DSS layers (Postnorm). The normalization layer is placed after the residual connection.

Appendix C Other results

In addition to the text classification task, explained in Section VI, we confirmed that our proposed method improves performance in the ListOps task and the text retrieval task of LRA [8], as shown in Table VIII and Table IX, respectively. Here, the number of DSS layers is 66 and the hidden size HH is 1616. In the ListOps task, a numerical expression structured with operators MAX, MEAN, MEDIAN, SUM_MOD and parentheses is the input, and its value is the output. For instance,

𝙸𝙽𝙿𝚄𝚃:[𝙼𝙰𝚇𝟷𝟸𝙼𝙸𝙽[𝟹𝟺]𝙼𝙴𝙳𝙸𝙰𝙽[𝟷𝟻𝟿]]𝙾𝚄𝚃𝙿𝚄𝚃:𝟻\displaystyle\begin{split}&\mathtt{INPUT:[~{}MAX~{}1~{}2~{}MIN~{}[~{}3~{}4~{}]~{}MEDIAN~{}[~{}1~{}5~{}9~{}]~{}]}\\ &\mathtt{OUTPUT:5}\end{split} (61)

The maximum length of input is 2000, and the output values range from 0 to 9. In the text retrieval task, we estimate the similarity between two papers and determine if there is a citation link. The length of each paper is 4000, and the total input length is 8000.

TABLE VIII: Accuracy of models through various training methods (Listops, DSSEXP{}_{\text{EXP}}, H=16H=16)
NN HiPPO Random Proposed Method
64 before 0.0715 0.0810 0.5180
after 0.5180 0.4020 0.5300
32 before 0.1705 0.1195 0.3610
after 0.4920 0.4035 0.5205
16 before 0.0760 0.1780 0.1290
after 0.4745 0.4001 0.5390
8 before 0.0715 0.0960 0.1825
after 0.4025 0.4300 0.5250
4 before 0.0825 0.1210 0.1725
after 0.4250 0.4440 0.5175
TABLE IX: Accuracy of models through various training methods (Text retrieval, DSSEXP{}_{\text{EXP}}, H=16H=16)
NN HiPPO Random Proposed Method
64 before 0.4943 0.5068 0.8189
after 0.8189 0.7830 0.8345
32 before 0.4937 0.4976 0.5807
after 0.8217 0.7812 0.8302
16 before 0.5055 0.4932 0.5399
after 0.8071 0.7907 0.8251
8 before 0.4939 0.4939 0.5064
after 0.8212 0.7944 0.8270
4 before 0.4939 0.4939 0.5062
after 0.8136 0.7893 0.8313

For DSSSOFTMAX{}_{\text{SOFTMAX}}, the left column “HiPPO” using the Skew-HiPPO initialization achieved higher accuracy after training compared to the middle column “Random” using randomly sampled initial values. For DSSEXP{}_{\text{EXP}}, the same trend was observed for almost all NN as shown in Table VIII and Table IX.

In the “Proposed Method” column on the right, Main Training was initialized using a reduced model obtained from Pre-Trained models of (H,N)=(16,64)(H,N)=(16,64). The accuracy of models before Main Training with “Proposed Method” was comparable to that of “Random” and “HiPPO” for each NN excluding N=64N=64. However, after the training, the accuracy of “Proposed Method” exceeded that of “HiPPO” for each NN.

The following points are particularly noteworthy.

  • For ListOps task, the highest accuracy after Main Training with “Proposed Method” was 0.5390 at N=16N=16. Notably, this exceeded the accuracy after training with “HiPPO” at N=64N=64, which was 0.5180, despite the smaller NN and the same hidden size HH.

  • For text retrieval task, the accuracy after Main Training with “Proposed Method” was 0.8313 at N=4N=4, which exceeded the accuracy after training with “HiPPO” at N=64N=64 and N=32N=32, despite the smaller NN and the same hidden size HH.

Consequently, the initial parameters obtained by reducing Pre-Trained DSS of (H,N)=(16,64)(H,N)=(16,64) appear to be effective in enhancing accuracy of the trained model compared to the initial parameters by the Skew-HiPPO initialization.

References

  • [1] B. Lim and S. Zohren, “Time-series forecasting with deep learning: a survey,” Philosophical Transactions of the Royal Society A, vol. 379, no. 2194, p. 20200209, 2021.
  • [2] A. Mehrish, N. Majumder, R. Bharadwaj, R. Mihalcea, and S. Poria, “A review of deep learning techniques for speech processing,” Information Fusion, p. 101869, 2023.
  • [3] A. K. Pandey and S. S. Roy, “Natural language generation using sequential models: A survey,” Neural Processing Letters, vol. 55, no. 6, pp. 7709–7742, 2023.
  • [4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of Advances in neural information processing systems, 2017.
  • [5] A. Zeng, M. Chen, L. Zhang, and Q. Xu, “Are transformers effective for time series forecasting?” in Proceedings of the AAAI conference on artificial intelligence, vol. 37, no. 9, 2023, pp. 11 121–11 128.
  • [6] A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with structured state spaces,” in International Conference on Learning Representations, 2022.
  • [7] A. Gu, T. Dao, S. Ermon, A. Rudra, and C. Ré, “Hippo: Recurrent memory with optimal polynomial projections,” in Proceedings of Advances in neural information processing systems, 2020, pp. 1474–1487.
  • [8] Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri, P. Pham, J. Rao, L. Yang, S. Ruder, and D. Metzler, “Long range arena: A benchmark for efficient transformers,” in International Conference on Learning Representations, 2021.
  • [9] A. Gupta, A. Gu, and J. Berant, “Diagonal state spaces are as effective as structured state spaces,” in Proceedings of Advances in Neural Information Processing Systems, 2022, pp. 22 982–22 994.
  • [10] D. Y. Fu, T. Dao, K. K. Saab, A. W. Thomas, A. Rudra, and C. Re, “Hungry hungry hippos: Towards language modeling with state space models,” in The Eleventh International Conference on Learning Representations, 2023.
  • [11] M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao, S. Baccus, Y. Bengio, S. Ermon, and C. Ré, “Hyena hierarchy: Towards larger convolutional language models,” arXiv preprint arXiv:2302.10866, 2023.
  • [12] A. Gu, K. Goel, A. Gupta, and C. Ré, “On the parameterization and initialization of diagonal state space models,” in Advances in Neural Information Processing Systems, 2022, pp. 35 971–35 983.
  • [13] E. Nguyen, K. Goel, A. Gu, G. Downs, P. Shah, T. Dao, S. Baccus, and C. Ré, “S4nd: Modeling images and videos as multidimensional signals with state spaces,” in Proceedings of Advances in neural information processing systems, 2022, pp. 2846–2861.
  • [14] J. T. Smith, A. Warrington, and S. Linderman, “Simplified state space layers for sequence modeling,” in The Eleventh International Conference on Learning Representations, 2023.
  • [15] J. M. Lopez Alcaraz and N. Strodthoff, “Diffusion-based time series imputation and forecasting with structured state space models,” Transactions on machine learning research, pp. 1–36, 2023.
  • [16] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
  • [17] K. Cao, Y. Liu, G. Meng, and Q. Sun, “An overview on edge computing research,” IEEE Access, vol. 8, pp. 85 714–85 728, 2020.
  • [18] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge intelligence: Paving the last mile of artificial intelligence with edge computing,” Proceedings of the IEEE, vol. 107, no. 8, pp. 1738–1762, 2019.
  • [19] T. Choudhary, V. Mishra, A. Goswami, and J. Sarangapani, “A comprehensive survey on model compression and acceleration,” Artificial Intelligence Review, vol. 53, pp. 5113–5155, 2020.
  • [20] H. Djigal, J. Xu, L. Liu, and Y. Zhang, “Machine and deep learning for resource allocation in multi-access edge computing: A survey,” IEEE Communications Surveys & Tutorials, vol. 24, no. 4, pp. 2449–2494, 2022.
  • [21] M. S. Murshed, C. Murphy, D. Hou, N. Khan, G. Ananthanarayanan, and F. Hussain, “Machine learning at the network edge: A survey,” ACM Computing Surveys, vol. 54, no. 8, pp. 1–37, 2021.
  • [22] N. Tekin, A. Aris, A. Acar, S. Uluagac, and V. C. Gungor, “A review of on-device machine learning for iot: An energy perspective,” Ad Hoc Networks, p. 103348, 2023.
  • [23] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” in 4th International Conference on Learning Representations, 2016.
  • [24] Y. LeCun, J. Denker, and S. Solla, “Optimal brain damage,” Advances in neural information processing systems, vol. 2, 1989.
  • [25] A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A survey of quantization methods for efficient neural network inference,” in Low-Power Computer Vision.   Chapman and Hall/CRC, 2022, pp. 291–326.
  • [26] J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” International Journal of Computer Vision, vol. 129, no. 6, pp. 1789–1819, 2021.
  • [27] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  • [28] Z. Hao, J. Guo, K. Han, H. Hu, C. Xu, and Y. Wang, “Revisit the power of vanilla knowledge distillation: from small scale to large scale,” Advances in Neural Information Processing Systems, vol. 36, 2023.
  • [29] T. Huang, Y. Zhang, M. Zheng, S. You, F. Wang, C. Qian, and C. Xu, “Knowledge diffusion for distillation,” Advances in Neural Information Processing Systems, vol. 36, 2023.
  • [30] A. C. Antoulas, Approximation of large-scale dynamical systems.   SIAM, 2005.
  • [31] A. Astolfi, “Model reduction by moment matching for linear and nonlinear systems,” IEEE Transactions on Automatic Control, vol. 55, no. 10, pp. 2321–2336, 2010.
  • [32] S. Gugercin, A. C. Antoulas, and C. Beattie, “H2H^{2} model reduction for large-scale linear dynamical systems,” SIAM Journal on Matrix Analysis and Applications, vol. 30, no. 2, pp. 609–638, 2008.
  • [33] B. Moore, “Principal component analysis in linear systems: Controllability, observability, and model reduction,” IEEE transactions on automatic control, vol. 26, no. 1, pp. 17–32, 1981.
  • [34] K. Sato, “Riemannian optimal model reduction of linear port-hamiltonian systems,” Automatica, vol. 93, pp. 428–434, 2018.
  • [35] K. Sato, “Riemannian optimal model reduction of stable linear systems,” IEEE Access, vol. 7, pp. 14 689–14 698, 2019.
  • [36] K. Sato, “Reduced model reconstruction method for stable positive network systems,” IEEE Transactions on Automatic Control, vol. 68, no. 9, pp. 5616–5623, 2023.
  • [37] S. Massaroli, M. Poli, D. Fu, H. Kumbong, R. Parnichkun, D. Romero, A. Timalsina, Q. McIntyre, B. Chen, A. Rudra et al., “Laughing hyena distillery: Extracting compact recurrences from convolutions,” in Advances in Neural Information Processing Systems, 2023.
  • [38] G. E. Dullerud and F. Paganini, A course in robust control theory: a convex approach.   Springer Science & Business Media, 2000.
  • [39] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
  • [40] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  • [41] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to algorithms, third edition.   MIT press, 2009.
  • [42] A. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, 2011, pp. 142–150.
  • [43] J. Donnelly, A. Daneshkhah, and S. Abolfathi, “Physics-informed neural networks as surrogate models of hydrodynamic simulators,” Science of the Total Environment, vol. 912, p. 168814, 2024.
  • [44] M. Raissi, P. Perdikaris, and G. E. Karniadakis, “Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations,” Journal of Computational physics, vol. 378, pp. 686–707, 2019.
  • [45] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [46] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning, 2015, pp. 448–456.
  • [47] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.