This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Universal Approximation Properties for an ODENet and a ResNet: Mathematical Analysis and Numerical Experiments

Yuto Aizawa Division of Mathematical and Physical Sciences, Kanazawa University ([email protected])    Masato Kimura Faculty of Mathematics and Physics, Kanazawa University ([email protected])    Kazunori Matsui Faculty of Science and Technology, Seikei University ([email protected])
Abstract

We prove a universal approximation property (UAP) for a class of ODENet and a class of ResNet, which are simplified mathematical models for deep learning systems with skip connections. The UAP can be stated as follows. Let nn and mm be the dimension of input and output data, and assume mnm\leq n. Then we show that ODENet of width n+mn+m with any non-polynomial continuous activation function can approximate any continuous function on a compact subset on n\mathbb{R}^{n}. We also show that ResNet has the same property as the depth tends to infinity. Furthermore, we derive the gradient of a loss function explicitly with respect to a certain tuning variable. We use this to construct a learning algorithm for ODENet. To demonstrate the usefulness of this algorithm, we apply it to a regression problem, a binary classification, and a multinomial classification in MNIST.

Keywords: Deep neural network, ODENet, ResNet, Universal approximation property

1 Introduction

Recent advances in neural networks have proven immensely successful for regression analysis, image classification, time series modeling, and so on [20]. Neural Networks are models of the human brain and vision [18, 8]. A neural network performs regression analysis, image classification, and time series modeling by performing a series of sequential operations, known as layers. Each of these layers is composed of neurons that are connected to neurons of other (typically, adjacent) layers. We consider a neural network with L+1L+1 layers, where the input layer is layer 0, the output layer is layer LL, and the number of nodes in layer l(l=0,1,,L)l~{}(l=0,1,\ldots,L) is nln_{l}\in\mathbb{N}. Let f(l):nlnl+1f^{(l)}:\mathbb{R}^{n_{l}}\to\mathbb{R}^{n_{l+1}} be the function of each layer. The output of each layer is, therefore, a vector in nl+1\mathbb{R}^{n_{l+1}}. If the input data is ξn0\xi\in\mathbb{R}^{n_{0}}, then, at each layer, we have

{x(l+1)=f(l)(x(l)),l=0,1,,L1,x(0)=ξ.\left\{\begin{aligned} x^{(l+1)}&=f^{(l)}(x^{(l)}),&l=0,1,\ldots,L-1,\\ x^{(0)}&=\xi.&\end{aligned}\right.

The final output of the network then becomes x(L)x^{(L)}, and the network is represented by H=[ξx(L)]H=[\xi\mapsto x^{(L)}].

A neural network approaches the regression and classification problem in two steps. Firstly, a priori observed and classified data is used to train the network. Then, the trained network is used to predict the rest of the data. Let Dn0D\subset\mathbb{R}^{n_{0}} be the set of input data, and F:DnLF:D\to\mathbb{R}^{n_{L}} be the target function. In the training step, the training data {(ξ(k),F(ξ(k)))}k=1K\{(\xi^{(k)},F(\xi^{(k)}))\}_{k=1}^{K} are available, where {ξ(k)}k=1KD\{\xi^{(k)}\}_{k=1}^{K}\subset D are the inputs, and {F(ξ(k))}k=1KnL\{F(\xi^{(k)})\}_{k=1}^{K}\subset\mathbb{R}^{n_{L}} are the outputs. The goal is to learn the neural network so that H(ξ)H(\xi) approximates F(ξ)F(\xi). This is achieved by minimizing a loss function that represents a similarity distance measure between the two quantities. In this paper, we consider the loss function with the mean square error

1Kk=1K|H(ξ(k))F(ξ(k))|2.\frac{1}{K}\sum_{k=1}^{K}\left|H(\xi^{(k)})-F(\xi^{(k)})\right|^{2}.

Finding the optimal functions f(l):nlnl+1f^{(l)}:\mathbb{R}^{n_{l}}\to\mathbb{R}^{n_{l+1}} out of all possible such functions is challenging. In addition, this includes a risk of overfitting because of the high number of available degrees of freedom. We restrict the functions to the following form:

f(l)(x)=a(l)𝝈(W(l)x+b(l)),f^{(l)}(x)=a^{(l)}\odot\mbox{\boldmath$\sigma$}(W^{(l)}x+b^{(l)}), (1.1)

where W(l)nl+1×nlW^{(l)}\in\mathbb{R}^{n_{l+1}\times n_{l}} is a weight matrix, bnl+1b\in\mathbb{R}^{n_{l+1}} is a bias vector, and a(l)nl+1a^{(l)}\in\mathbb{R}^{n_{l+1}} is weight vector of the output of each layer. The operator \odot denotes the Hadamard product (element-wise product) of two vectors defined by (2.2). The function 𝝈:nl+1nl+1\mbox{\boldmath$\sigma$}:\mathbb{R}^{n_{l+1}}\to\mathbb{R}^{n_{l+1}} is defined by 𝝈(x)=(σ(x1),σ(x2),,σ(xnl+1))\mbox{\boldmath$\sigma$}(x)=(\sigma(x_{1}),\sigma(x_{2}),\ldots,\sigma(x_{n_{l+1}}))^{\top}, where σ:\sigma:\mathbb{R}\to\mathbb{R} is called an activation function. For a scalar xx\in\mathbb{R}, the sigmoid function σ(x)=(1+ex)1\sigma(x)=(1+e^{-x})^{-1}, the hyperbolic tangent function σ(x)=tanh(x)\sigma(x)=\tanh(x), the rectified linear unit (ReLU) function σ(x)=max(0,x)\sigma(x)=\max(0,x), and the linear function σ(x)=x\sigma(x)=x, and so on, are used as activation functions.

If we restrict the functions of the form (1.1), the goal is to learn W(l),b(l),a(l)W^{(l)},b^{(l)},a^{(l)} that approximates F(ξ)F(\xi) in the training step. The gradient method is used for training. Let GW(l),Gb(l)G_{W^{(l)}},G_{b^{(l)}} and Ga(l)G_{a^{(l)}} be the gradient of the loss function with respect to W(l),b(l)W^{(l)},b^{(l)} and a(l)a^{(l)}, respectively, and let τ>0\tau>0 be the learning rate. Using the gradient method, the weights and biases are updated as follows:

W(l)W(l)τGW(l),b(l)b(l)τGb(l),a(l)a(l)τGa(l).W^{(l)}\leftarrow W^{(l)}-\tau G_{W^{(l)}},\quad b^{(l)}\leftarrow b^{(l)}-\tau G_{b^{(l)}},\quad a^{(l)}\leftarrow a^{(l)}-\tau G_{a^{(l)}}.

Note that the stochastic gradient method [4] has been widely used recently. Then, error backpropagation [19] was used to find the gradient.

It is known that deep (convolutional) neural networks are of great importance in image recognition [21, 23]. In [12], it was found through controlled experiments that the increase of depth in networks actually improves its performance and accuracy, in exchange, of course, for additional time complexity. However, in the case that the depth is overly increased, the accuracy might get stagnant or even degraded [12]. In addition, considering deeper networks may impede the learning process, which is due to the vanishing or exploding of the gradient [3, 10]. Apparently, deeper neural networks are more difficult to train. To address such an issue, the authors in [13] recommended the use of residual learning to facilitate the training of networks that are considerably deeper than those used previously. Such a network is referred to as residual network or ResNet. Let nn and mm be the dimensions of the input and output data. Let NN be the number of nodes in each layer. A ResNet can be represented as

{x(l+1)=x(l)+f(l)(x(l)),l=0,1,,L1,x(0)=Qξ.\left\{\begin{aligned} x^{(l+1)}&=x^{(l)}+f^{(l)}(x^{(l)}),&l=0,1,\ldots,L-1,\\ x^{(0)}&=Q\xi.&\end{aligned}\right. (1.2)

The final output of the network then becomes H(ξ):=Px(L)H(\xi):=Px^{(L)}, where Pm×NP\in\mathbb{R}^{m\times N} and QN×nQ\in\mathbb{R}^{N\times n}. Moreover, the function f(l)f^{(l)} is learned from training data.

Transforming (1.2) into

{x(l+1)=x(l)+hf(l)(x(l)),l=0,1,,L1,x(0)=Qξ,\left\{\begin{aligned} x^{(l+1)}&=x^{(l)}+hf^{(l)}(x^{(l)}),&l=0,1,\ldots,L-1,\\ x^{(0)}&=Q\xi,&\end{aligned}\right. (1.3)

where hh is the step size of the layer, leads to the same equation for the Euler method, which is a method for finding numerical solution to initial value problem for ordinary differential equation. Indeed, putting x(t):=x(l)x(t):=x^{(l)} and f(t,x):=f(l)(x)f(t,x):=f^{(l)}(x), where t=hlt=hl, T=hLT=hL and f:[0,T]×NNf:[0,T]\times\mathbb{R}^{N}\rightarrow\mathbb{R}^{N}, then the limit of (1.3) as hh approaches zero yields the following initial value problem of ordinary differential equation

{x(t)=f(t,x(t)),t(0,T],x(0)=Qξ.\left\{\begin{aligned} x^{\prime}(t)&=f(t,x(t)),&t\in(0,T],\\ x(0)&=Q\xi.&\end{aligned}\right. (1.4)

We call the function H=[DξPx(T)]H=[D\ni\xi\mapsto Px(T)] an ODENet [6] associated with the system of ordinary differential equations (1.4).

Remark 1.1.

In the real deep learning system, the vector field f(t,x)f(t,x) should be chosen from a family of the vector fields f{fω}ωf\in\{f_{\omega}\}_{\omega}, where ω\omega is a parameter that is optimized. In this paper, we consider an ODENet associated with (2.3), which we call an (α,β,γ)(\alpha,\beta,\gamma)-type ODENet. Instead of the variable x(t)Nx(t)\in\mathbb{R}^{N} in (1.4), we consider (x(t),y(t))n×m(x(t),y(t))\in\mathbb{R}^{n}\times\mathbb{R}^{m} in (2.3), where N=m+nN=m+n (see Appendix D for the detail). The implementation of ODENet requires (forward) Euler discretization to ResNet. A ResNet with (2.5) corresponds to the discretized version of the (α,β,γ)(\alpha,\beta,\gamma)-type ODENet, and we call it an (α,β,γ)(\alpha,\beta,\gamma)-type ResNet.

A neural network of arbitrary width and bounded depth has universal approximation property (UAP). The classical UAP is that continuous functions on a compact subset on n\mathbb{R}^{n} can be approximated by a linear combination of activation functions. It has been shown that the UAP for the neural networks holds by choosing a sigmoidal function [7, 14, 5, 9], any bounded function that is not a polynomial function [16, 1], and any function in Lizorkin space including a ReLU function [22] as an activation function. The UAP for neural network and its proof for each activation function is presented in Table 1.

Table. 1: Activation function and classical universal approximation property of neural network
References Activation function How to prove
Cybenko [7] Continuous sigmoidal Hahn-Banach theorem
Hornik et al. [14] Monotonic sigmoidal Stone-Weierstrass theorem
Carroll, Dickinson [5] Continuous sigmoidal Radon transform
Funahashi [9] Monotonic sigmoidal Fourier transform
Leshno et al. [16] Non-polynomial Weierstrass theorem
Attali, Pagès [1] Non-polynomial Taylor expansion
Sonoda, Murata [22] Lizorkin distribution Ridgelet transform

Recently, some positive results have been established showing the UAP for particular deep narrow networks. Hanin and Sellke [11] have shown that deep narrow networks with ReLU activation function have the UAP, and require only width n+mn+m. Lin and Jegelka [17] have shown that a ResNet with ReLU activation function, arbitrary input dimension, width 1, output dimension 1 have the UAP. For activation functions other than ReLU, Kidger and Lyons [15] have shown that deep narrow networks with any non-polynomial continuous function have the UAP, and require only width n+m+1n+m+1. The comparison of the UAPs are shown in Table 2. It is proved in [15, 17] and Theorem 2.7 that the UAP holds as LL\rightarrow\infty. Theorem 2.3 shows that the UAP holds when the layer is continuous.

Table. 2: A comparison of universal approximation properties
Shallow wide NN Deep narrow NN ResNet
References [16, 22] [15] [17]
Input dimension nn n,mn,m : any n,mn,m : any nn : any
Output dimension mm m=1m=1
Activation function Non-polynomial Non-polynomial ReLU
Depth LL L=3L=3 LL\to\infty LL\to\infty
Width NN NN\to\infty N=n+m+1N=n+m+1 N=1N=1
(α,β,γ)(\alpha,\beta,\gamma)-type ResNet (α,β,γ)(\alpha,\beta,\gamma)-type ODENet
References Theorem 2.7 Theorem 2.3
Input dimension nn nmn\geq m nmn\geq m
Output dimension mm
Activation function Non-polynomial Non-polynomial
Depth LL LL\to\infty continuous setting (L=)(L=\infty)
Width NN N=n+mN=n+m N=n+mN=n+m

In this paper, we propose the (α,β,γ)(\alpha,\beta,\gamma)-type ODENet associated with (2.3) whose width is n+mn+m and show the conditions for the UAP for the (α,β,γ)(\alpha,\beta,\gamma)-type ODENet and the (α,β,γ)(\alpha,\beta,\gamma)-type ResNet with (2.5). In Section 2, we show that the UAP holds for the (α,β,γ)(\alpha,\beta,\gamma)-type ODENet associated with (2.3) and the (α,β,γ)(\alpha,\beta,\gamma)-type ResNet with (2.5). In Section 3, we derive the gradient of the loss function and a learning algorithm for the (α,β,γ)(\alpha,\beta,\gamma)-type ODENet in consideration, followed by some numerical experiments in Section 4. Finally, we end the paper with a conclusion in Section 5.

2 Universal Approximation Theorem for (α,β,γ)(\alpha,\beta,\gamma)-type ODENet and (α,β,γ)(\alpha,\beta,\gamma)-type ResNet

2.1 Definition of an activation function with universal approximation property

Let mm and nn be natural numbers. Our main results, Theorem 2.3 and Theorem 2.7, show that any continuous function on a compact subset on n\mathbb{R}^{n} can be approximated using the (α,β,γ)(\alpha,\beta,\gamma)-type ODENet and the (α,β,γ)(\alpha,\beta,\gamma)-type ResNet.

In this paper, the following notations are used

|x|:=(i=1n|xi|2)12,A:=(i=1mj=1n|aij|2)12,|x|:=\left(\sum_{i=1}^{n}|x_{i}|^{2}\right)^{\frac{1}{2}},\quad\|A\|:=\left(\sum_{i=1}^{m}\sum_{j=1}^{n}|a_{ij}|^{2}\right)^{\frac{1}{2}},

for any x=(x1,x2,,xn)nx=(x_{1},x_{2},\ldots,x_{n})^{\top}\in\mathbb{R}^{n} and A=(aij)i=1,,mj=1,,nm×nA=(a_{ij})_{\begin{subarray}{c}i=1,\ldots,m\\ j=1,\ldots,n\end{subarray}}\in\mathbb{R}^{m\times n}. Also, we define

xf:=(fixj)i=1,,mj=1,,n,xf:=(xf)\nabla_{x}^{\top}f:=\left(\frac{\partial f_{i}}{\partial x_{j}}\right)_{\begin{subarray}{c}i=1,\ldots,m\\ j=1,\ldots,n\end{subarray}},\quad\nabla_{x}f^{\top}:=\left(\nabla_{x}^{\top}f\right)^{\top}

for any fC1(n;m)f\in C^{1}(\mathbb{R}^{n};\mathbb{R}^{m}). For a function σ:\sigma:\mathbb{R}\to\mathbb{R}, we define 𝝈:mm\mbox{\boldmath$\sigma$}:\mathbb{R}^{m}\to\mathbb{R}^{m} by

𝝈(x):=(σ(x1)σ(x2)σ(xm))\mbox{\boldmath$\sigma$}(x):=\left(\begin{array}[]{c}\sigma(x_{1})\\ \sigma(x_{2})\\ \vdots\\ \sigma(x_{m})\end{array}\right) (2.1)

for x=(x1,x2,,xm)mx=(x_{1},x_{2},\ldots,x_{m})^{\top}\in\mathbb{R}^{m}. For a=(a1,a2,,am),b=(b1,b2,,bm)ma=(a_{1},a_{2},\ldots,a_{m})^{\top},b=(b_{1},b_{2},\ldots,b_{m})^{\top}\in\mathbb{R}^{m}, their Hadamard product is defined by

ab:=(a1b1a2b2ambm)m.a\odot b:=\left(\begin{array}[]{c}a_{1}b_{1}\\ a_{2}b_{2}\\ \vdots\\ a_{m}b_{m}\end{array}\right)\in\mathbb{R}^{m}. (2.2)
Definition 2.1 (Universal approximation property for the activation function σ\sigma).

Let σ\sigma be a real-valued function on \mathbb{R} and DD be a compact subset of n\mathbb{R}^{n}. Also, consider the set

S:={G:D|G(ξ)=l=1Lαlσ(𝒄lξ+dl),L,αl,dl,𝒄ln}.S:=\left\{G:D\to\mathbb{R}\left|G(\xi)=\sum_{l=1}^{L}\alpha_{l}\sigma(\mbox{\boldmath$c$}_{l}\cdot\xi+d_{l}),L\in\mathbb{N},\alpha_{l},d_{l}\in\mathbb{R},\mbox{\boldmath$c$}_{l}\in\mathbb{R}^{n}\right.\right\}.

Suppose that SS is dense in C(D)C(D). In other words, given FC(D)F\in C(D) and η>0\eta>0, there exists a function GSG\in S such that

|G(ξ)F(ξ)|<η|G(\xi)-F(\xi)|<\eta

for any ξD\xi\in D. Then, we say that σ\sigma has a universal approximation property (UAP) on DD.

Some activation functions with the universal approximation property are presented in Table 3.

Table. 3: Example of activation functions with universal approximation property
Activation function σ(x)\sigma(x)
Unbounded functions
Truncated power function x+k:={xkx>00x0k{0}x_{+}^{k}:=\left\{\begin{array}[]{ll}x^{k}&x>0\\ 0&x\leq 0\end{array}\right.\quad k\in\mathbb{N}\cup\{0\}
ReLU function x+x_{+}
Softplus function log(1+ex)\log(1+e^{x})
Bounded but not integrable functions
Unit step function x+0x_{+}^{0}
(Standard) Sigmoidal function (1+ex)1(1+e^{-x})^{-1}
Hyperbolic tangent function tanh(x)\tanh(x)
Bump functions
(Gaussian) Radial basis function 12πexp(x22)\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{x^{2}}{2}\right)
Dirac’s δ\delta function δ(x)\delta(x)

A non-polynomial activation function in a neural network with three layers has a universal approximation property. Such result was shown by Leshno [16] using functional analysis and later by Sonoda and Murata [22] using Ridgelet transform.

2.2 Main Theorem for (α,β,γ)(\alpha,\beta,\gamma)-type ODENet

In this subsection, we show the universal approximation property for the (α,β,γ)(\alpha,\beta,\gamma)-type ODENet associated with the ODE system (2.3). Since the first (resp. second) equation consists of nn (resp. mm) equations, the width of the (α,β,γ)(\alpha,\beta,\gamma)-type ODENet is n+mn+m.

Definition 2.2 ((α,β,γ)(\alpha,\beta,\gamma)-type ODENet).

Suppose that an m×nm\times n real matrix AA and a function σ:\sigma:\mathbb{R}\to\mathbb{R} are given. We consider a system of ODEs

{x(t)=β(t)x(t)+γ(t),t(0,T],y(t)=α(t)𝝈(Ax(t)),t(0,T],x(0)=ξ,y(0)=0,\left\{\begin{aligned} x^{\prime}(t)&=\beta(t)x(t)+\gamma(t),&t\in(0,T],\\ y^{\prime}(t)&=\alpha(t)\odot\mbox{\boldmath$\sigma$}(Ax(t)),&t\in(0,T],\\ x(0)&=\xi,&\\ y(0)&=0,&\end{aligned}\right. (2.3)

where xx and yy are functions from [0,T][0,T] to n\mathbb{R}^{n} and m\mathbb{R}^{m}, respectively; x(0)=ξnx(0)=\xi\in\mathbb{R}^{n} is an input data and y(T)my(T)\in\mathbb{R}^{m} is the final output. Moreover, the functions α:[0,T]m\alpha:[0,T]\to\mathbb{R}^{m}, β:[0,T]n×n\beta:[0,T]\to\mathbb{R}^{n\times n}, and γ:[0,T]n\gamma:[0,T]\to\mathbb{R}^{n} are design parameters. The function 𝝈:mm\mbox{\boldmath$\sigma$}:\mathbb{R}^{m}\to\mathbb{R}^{m} is defined by (2.1) and the operator \odot denotes the Hadamard product defined by (2.2). We call H=[ξy(T)]:nmH=[\xi\mapsto y(T)]:\mathbb{R}^{n}\to\mathbb{R}^{m} an (α,β,γ)(\alpha,\beta,\gamma)-type ODENet associated with the ODE system (2.3).

For a compact subset DnD\subset\mathbb{R}^{n}, we define

S(D):={[ξy(T)]C(D;m)|αC([0,T];m),βC([0,T];n×n),γC([0,T];n)}.S(D):=\{[\xi\mapsto y(T)]\in C(D;\mathbb{R}^{m})|\alpha\in C^{\infty}([0,T];\mathbb{R}^{m}),\beta\in C^{\infty}([0,T];\mathbb{R}^{n\times n}),\gamma\in C^{\infty}([0,T];\mathbb{R}^{n})\}.

We will assume that the activation function is locally Lipschitz continuous, in other words,

R>0,LR>0s.t.|σ(s1)σ(s2)|LR|s1s2|fors1,s2[R,R].\forall R>0,~{}\exists L_{R}>0~{}\mathrm{s.t.}\quad|\sigma(s_{1})-\sigma(s_{2})|\leq L_{R}|s_{1}-s_{2}|\quad\mathrm{for}~{}s_{1},s_{2}\in[-R,R]. (2.4)
Theorem 2.3 (UAP for (α,β,γ)(\alpha,\beta,\gamma)-type ODENet).

Suppose that mnm\leq n and rank(A)=m\mathrm{rank}(A)=m. If σ:\sigma:\mathbb{R}\to\mathbb{R} satisfies (2.4) and has UAP on a compact subset DnD\subset\mathbb{R}^{n}, then S(D)S(D) is dense in C(D;m)C(D;\mathbb{R}^{m}). In other words, given FC(D;m)F\in C(D;\mathbb{R}^{m}) and η>0\eta>0, there exists a function HS(D)H\in S(D) such that

|H(ξ)F(ξ)|<η,|H(\xi)-F(\xi)|<\eta,

for any ξD\xi\in D.

Corollary 2.4.

Let 1p<1\leq p<\infty. Then, S(D)S(D) is dense in Lp(D;m)L^{p}(D;\mathbb{R}^{m}). In other words, given FLp(D;m)F\in L^{p}(D;\mathbb{R}^{m}) and η>0\eta>0, there exists a function HS(D)H\in S(D) such that

HFLp(D;m)<η.\|H-F\|_{L^{p}(D;\mathbb{R}^{m})}<\eta.
Remark 2.5.

The assumption for σ\sigma in Theorem 2.3 holds, e.g., if σ\sigma is a non-polynomial continuous function [16].

2.3 Main Theorem for (α,β,γ)(\alpha,\beta,\gamma)-type ResNet

In this subsection, we show that a universal approximation property also holds for an (α,β,γ)(\alpha,\beta,\gamma)-type ResNet with the system of difference equations (2.5).

Definition 2.6 ((α,β,γ)(\alpha,\beta,\gamma)-type ResNet).

Suppose that an m×nm\times n real matrix AA and a function σ:\sigma:\mathbb{R}\to\mathbb{R} are given. We consider a system of difference equations

{x(l)=x(l1)+β(l)x(l1)+γ(l),l=1,2,,Ly(l)=y(l1)+α(l)𝝈(Ax(l)),l=1,2,,Lx(0)=ξ,y(0)=0,\left\{\begin{aligned} x^{(l)}&=x^{(l-1)}+\beta^{(l)}x^{(l-1)}+\gamma^{(l)},&l=1,2,\ldots,L\\ y^{(l)}&=y^{(l-1)}+\alpha^{(l)}\odot\mbox{\boldmath$\sigma$}(Ax^{(l)}),&l=1,2,\ldots,L\\ x^{(0)}&=\xi,&\\ y^{(0)}&=0,&\end{aligned}\right. (2.5)

where x(l)x^{(l)} and y(l)y^{(l)} are nn- and mm-dimensional real vectors, for all l=0,1,,Ll=0,1,\ldots,L, respectively. Also, ξn\xi\in\mathbb{R}^{n} denotes the input data while y(L)my^{(L)}\in\mathbb{R}^{m} represents the final output. Moreover, the vectors α(l)m,β(l)n×n\alpha^{(l)}\in\mathbb{R}^{m},\beta^{(l)}\in\mathbb{R}^{n\times n} and γn(l=1,2,,L)\gamma\in\mathbb{R}^{n}~{}(l=1,2,\ldots,L) are design parameters. The functions 𝝈:mm\mbox{\boldmath$\sigma$}:\mathbb{R}^{m}\to\mathbb{R}^{m} is defined by (2.1) and the operator \odot denotes the Hadamard product defined by (2.2). We call the function H=[ξy(L)]:DmH=[\xi\mapsto y^{(L)}]:D\to\mathbb{R}^{m} an (α,β,γ)(\alpha,\beta,\gamma)-type ResNet with a system of difference equations (2.5).

For a compact subset DnD\subset\mathbb{R}^{n}, we define

Sres(D):={[ξy(L)]C(D;m)|L,α(l)m,β(l)n×n,γ(l)n(l=1,2,,L)}.S_{\mathrm{res}}(D):=\{[\xi\mapsto y^{(L)}]\in C(D;\mathbb{R}^{m})|L\in\mathbb{N},\alpha^{(l)}\in\mathbb{R}^{m},\beta^{(l)}\in\mathbb{R}^{n\times n},\gamma^{(l)}\in\mathbb{R}^{n}~{}(l=1,2,\ldots,L)\}.
Theorem 2.7 (UAP for (α,β,γ)(\alpha,\beta,\gamma)-type ResNet).

Suppose that mnm\leq n and rank(A)=m\mathrm{rank}(A)=m. If σ:\sigma:\mathbb{R}\to\mathbb{R} satisfies (2.4) and has UAP on a compact subset DnD\subset\mathbb{R}^{n}, then Sres(D)S_{\mathrm{res}}(D) is dense in C(D;m)C(D;\mathbb{R}^{m}).

Remark 2.8.

The assumption for σ\sigma in Theorem 2.7 holds, e.g., if σ\sigma is a non-polynomial continuous function [16].

2.4 Some lemmas

We describe some lemmas used to prove Theorems 2.3 and 2.7.

Lemma 2.9.

Suppose that mnm\leq n. Let 𝝈\sigma be a function from m\mathbb{R}^{m} to m\mathbb{R}^{m} defined by (2.1). For any α,dm\alpha,d\in\mathbb{R}^{m} and C=(𝒄1,𝒄2,,𝒄m)m×nC=(\mbox{\boldmath$c$}_{1},\mbox{\boldmath$c$}_{2},\ldots,\mbox{\boldmath$c$}_{m})^{\top}\in\mathbb{R}^{m\times n} which has no zero rows (i.e. 𝒄l0\mbox{\boldmath$c$}_{l}\neq 0 for l=1,2,,ml=1,2,\ldots,m), there exist α~(l),d~(l)m\tilde{\alpha}^{(l)},\tilde{d}^{(l)}\in\mathbb{R}^{m}, and C~(l)m×n(l=1,2,,m)\tilde{C}^{(l)}\in\mathbb{R}^{m\times n}~{}(l=1,2,\ldots,m) such that

α𝝈(Cξ+d)=l=1mα~(l)𝝈(C~(l)ξ+d~(l)),\alpha\odot\mbox{\boldmath$\sigma$}(C\xi+d)=\sum_{l=1}^{m}\tilde{\alpha}^{(l)}\odot\mbox{\boldmath$\sigma$}(\tilde{C}^{(l)}\xi+\tilde{d}^{(l)}),

for any ξn\xi\in\mathbb{R}^{n}, and rank(C~(l))=m\mathrm{rank}(\tilde{C}^{(l)})=m, for all l=1,2,,ml=1,2,\ldots,m. Moreover, if m=nm=n, we can choose C~(l)n×n\tilde{C}^{(l)}\in\mathbb{R}^{n\times n} such that detC~(l)>0\det\tilde{C}^{(l)}>0, for all l=1,2,,nl=1,2,\ldots,n.

Proof.

Let mnm\leq n. For all l=1,2,,ml=1,2,\ldots,m, there exists C~(l)=(𝒄~1(l),𝒄~2(l),,𝒄~m(l))m×n\tilde{C}^{(l)}=(\tilde{\mbox{\boldmath$c$}}_{1}^{(l)},\tilde{\mbox{\boldmath$c$}}_{2}^{(l)},\ldots,\tilde{\mbox{\boldmath$c$}}_{m}^{(l)})^{\top}\in\mathbb{R}^{m\times n} such that 𝒄~l(l)=𝒄l\tilde{\mbox{\boldmath$c$}}_{l}^{(l)}=\mbox{\boldmath$c$}_{l}, rank(C~(l))=m\mathrm{rank}(\tilde{C}^{(l)})=m. Then, we put

α~k(l):={αk,ifl=k,0,iflk,d~k(l):={dk,ifl=k,0,iflk.\tilde{\alpha}_{k}^{(l)}:=\left\{\begin{array}[]{ll}\alpha_{k},&\mathrm{if}~{}l=k,\\ 0,&\mathrm{if}~{}l\neq k,\end{array}\right.\quad\tilde{d}_{k}^{(l)}:=\left\{\begin{array}[]{ll}d_{k},&\mathrm{if}~{}l=k,\\ 0,&\mathrm{if}~{}l\neq k.\end{array}\right.

Looking at the kk-th component, we see that for any ξn\xi\in\mathbb{R}^{n}, we have

l=1mα~k(l)σ(𝒄~k(l)ξ+d~k(l))=α~k(k)σ(𝒄~k(k)ξ+d~k(k))=αkσ(𝒄kξ+dk).\sum_{l=1}^{m}\tilde{\alpha}_{k}^{(l)}\sigma(\tilde{\mbox{\boldmath$c$}}_{k}^{(l)}\cdot\xi+\tilde{d}_{k}^{(l)})=\tilde{\alpha}_{k}^{(k)}\sigma(\tilde{\mbox{\boldmath$c$}}_{k}^{(k)}\cdot\xi+\tilde{d}_{k}^{(k)})=\alpha_{k}\sigma(\mbox{\boldmath$c$}_{k}\cdot\xi+d_{k}).

Therefore,

l=1mα~(l)𝝈(C~(l)ξ+d~(l))=α𝝈(Cξ+d).\sum_{l=1}^{m}\tilde{\alpha}^{(l)}\odot\mbox{\boldmath$\sigma$}(\tilde{C}^{(l)}\xi+\tilde{d}^{(l)})=\alpha\odot\mbox{\boldmath$\sigma$}(C\xi+d).

Now, if m=nm=n, then rank(C~(l))=n\mathrm{rank}(\tilde{C}^{(l)})=n, and so det(C~(l))0\det(\tilde{C}^{(l)})\neq 0. In particular, we can choose C~(l)\tilde{C}^{(l)} such that det(C~(l))>0\det(\tilde{C}^{(l)})>0. ∎

Lemma 2.10.

Suppose that mnm\leq n. Let 𝝈\sigma be a function from m\mathbb{R}^{m} to m\mathbb{R}^{m}. For any L,α(l),d(l)m,C(l)m×n(l=1,2,,L)L\in\mathbb{N},\alpha^{(l)},d^{(l)}\in\mathbb{R}^{m},C^{(l)}\in\mathbb{R}^{m\times n}~{}(l=1,2,\ldots,L), there exists L,α~(l),d~(l)m,C~(l)m×n(l=1,2,,L)L^{\prime}\in\mathbb{N},\tilde{\alpha}^{(l)},\tilde{d}^{(l)}\in\mathbb{R}^{m},\tilde{C}^{(l)}\in\mathbb{R}^{m\times n}~{}(l=1,2,\ldots,L^{\prime}) such that

1Ll=1Lα(l)𝝈(C(l)ξ+d(l))=1Ll=1Lα~(l)𝝈(C~(l)ξ+d~(l))\frac{1}{L}\sum_{l=1}^{L}\alpha^{(l)}\odot\mbox{\boldmath$\sigma$}(C^{(l)}\xi+d^{(l)})=\frac{1}{L^{\prime}}\sum_{l=1}^{L^{\prime}}\tilde{\alpha}^{(l)}\odot\mbox{\boldmath$\sigma$}(\tilde{C}^{(l)}\xi+\tilde{d}^{(l)})

for any ξn\xi\in\mathbb{R}^{n}, and rank(C~(l))=m\mathrm{rank}(\tilde{C}^{(l)})=m, for all l=1,2,,Ll=1,2,\ldots,L^{\prime}. Moreover, if m=nm=n, we can choose C~(l)m×n\tilde{C}^{(l)}\in\mathbb{R}^{m\times n} such that detC~(l)>0\det\tilde{C}^{(l)}>0, for all l=1,2,,Ll=1,2,\ldots,L^{\prime}.

Proof.

This follows from Lemma 2.9. ∎

Lemma 2.11.

Suppose that m<nm<n. Let AA be an m×nm\times n real matrix satisfying rank(A)=m\mathrm{rank}(A)=m. Then, for any Cm×nC\in\mathbb{R}^{m\times n} satisfying rank(C)=m\mathrm{rank}(C)=m, there exists Pn×nP\in\mathbb{R}^{n\times n} such that

C=AP,detP>0.C=AP,\quad\det P>0. (2.6)

In addition, if m=nm=n and sgn(detC)=sgn(detA)\mathrm{sgn}(\det C)=\mathrm{sgn}(\det A), there exists Pn×nP\in\mathbb{R}^{n\times n} such that (2.6).

Proof.
  1. (i)

    Suppose that m<nm<n. From rank(A)=rank(C)=m\mathrm{rank}(A)=\mathrm{rank}(C)=m, there exists A¯,C¯(nm)×n\bar{A},\bar{C}\in\mathbb{R}^{(n-m)\times n} such that

    detA~>0,A~=(AA¯),detC~>0,C~=(CC¯).\det\tilde{A}>0,\quad\tilde{A}=\left(\begin{array}[]{c}A\\ \bar{A}\end{array}\right),\quad\det\tilde{C}>0,\quad\tilde{C}=\left(\begin{array}[]{c}C\\ \bar{C}\end{array}\right).

    If we put P:=A~1C~P:=\tilde{A}^{-1}\tilde{C}, we get detP>0\det P>0, C=APC=AP.

  2. (ii)

    Suppose that m=nm=n. We put P:=A1CP:=A^{-1}C. Because sgn(detC)=sgn(detA)\mathrm{sgn}(\det C)=\mathrm{sgn}(\det A), we have detP>0\det P>0, and so C=APC=AP.

Lemma 2.12.

Let p[1,)p\in[1,\infty). Suppose that

P(t)=P(l)n×n,detP(l)>0,P(t)=P^{(l)}\in\mathbb{R}^{n\times n},\quad\det P^{(l)}>0,

for tl1t<tlt_{l-1}\leq t<t_{l}, and for all l=1,2,,Ll=1,2,\ldots,L, where t0=0t_{0}=0 and tL=Tt_{L}=T. Then, there exists a real number C>0C>0 such that, for any ε>0\varepsilon>0, there exists PεC([0,T];n×n)P^{\varepsilon}\in C([0,T];\mathbb{R}^{n\times n}) such that

PεPLp(0,T;n×n)<ε,detPε(t)>0,andPε(t)C,\|P^{\varepsilon}-P\|_{L^{p}(0,T;\mathbb{R}^{n\times n})}<\varepsilon,\quad\det P^{\varepsilon}(t)>0,\quad\mathrm{and}\quad\|P^{\varepsilon}(t)\|\leq C,

for any t[0,T]t\in[0,T].

Proof.

We define GL+(n,):={An×n|detA>0}\mathrm{GL}^{+}(n,\mathbb{R}):=\{A\in\mathbb{R}^{n\times n}|\det A>0\}. From [2, Chapter 9, p.239], GL+(n,)\mathrm{GL}^{+}(n,\mathbb{R}) is path-connected. For all l=1,2,,Ll=1,2,\ldots,L, there exists Q(l)C([0,1];n×n)Q^{(l)}\in C([0,1];\mathbb{R}^{n\times n}) such that

Q(l)(0)=P(l),Q(l)(1)=P(l+1),anddetQ(l)(s)>0,Q^{(l)}(0)=P^{(l)},\quad Q^{(l)}(1)=P^{(l+1)},\quad\mathrm{and}\quad\det Q^{(l)}(s)>0,

for any s[0,1]s\in[0,1]. For δ>0\delta>0, we put

Qδ(t):={P(1),<t<t1,Q(l)(ttlδ),tlt<tl+δ,(l=1,2,,L1),P(l)tl1+δt<tl,(l=2,3,,L2),P(L)tL1+δt<.Q^{\delta}(t):=\left\{\begin{array}[]{lll}P^{(1)},&-\infty<t<t_{1},&\\ \displaystyle{Q^{(l)}\left(\frac{t-t_{l}}{\delta}\right)},&t_{l}\leq t<t_{l}+\delta,&(l=1,2,\ldots,L-1),\\ P^{(l)}&t_{l-1}+\delta\leq t<t_{l},&(l=2,3,\ldots,L-2),\\ P^{(L)}&t_{L-1}+\delta\leq t<\infty.&\end{array}\right.

Then, QδQ^{\delta} is a continuous function from \mathbb{R} to n×n\mathbb{R}^{n\times n}. There exists a C0>0C_{0}>0 such that detQδ(t)C0\det Q^{\delta}(t)\geq C_{0}, for any tt\in\mathbb{R}. Let {φε}ε>0\{\varphi_{\varepsilon}\}_{\varepsilon>0} be a sequence of Friedrichs’ mollifiers in \mathbb{R}. We put

Pε(t):=(φεQδ)(t).P^{\varepsilon}(t):=(\varphi_{\varepsilon}*Q^{\delta})(t).

Then, PεC(;n×n)P^{\varepsilon}\in C^{\infty}(\mathbb{R};\mathbb{R}^{n\times n}). Since

limε0PεQδC([0,T];n×n)=0,\lim_{\varepsilon\to 0}\|P^{\varepsilon}-Q^{\delta}\|_{C([0,T];\mathbb{R}^{n\times n})}=0,

there exists a number ε0>0\varepsilon_{0}>0 such that, for any εε0\varepsilon\leq\varepsilon_{0},

detPε(t)C02\det P^{\varepsilon}(t)\geq\frac{C_{0}}{2}

for all t[0,T]t\in[0,T]. Because QδQ^{\delta} is bounded, there exists a number C>0C>0 such that Pε(t)C\|P^{\varepsilon}(t)\|\leq C, for any t[0,T]t\in[0,T]. Now, we note that

PεPLp(0,T;n×n)PεQδLp(0,T;n×n)+QδPLp(0,T;n×n).\|P^{\varepsilon}-P\|_{L^{p}(0,T;\mathbb{R}^{n\times n})}\leq\|P^{\varepsilon}-Q^{\delta}\|_{L^{p}(0,T;\mathbb{R}^{n\times n})}+\|Q^{\delta}-P\|_{L^{p}(0,T;\mathbb{R}^{n\times n})}.

The last summand is calculated as follows

QδPLp(0,T;n×n)p\displaystyle\|Q^{\delta}-P\|_{L^{p}(0,T;\mathbb{R}^{n\times n})}^{p} =0TQδ(t)P(t)p𝑑t,\displaystyle=\int_{0}^{T}\|Q^{\delta}(t)-P(t)\|^{p}dt,
=l=1L1tltl+δQ(l)(ttlδ)P(l+1)p𝑑t,\displaystyle=\sum_{l=1}^{L-1}\int_{t_{l}}^{t_{l}+\delta}\left\|Q^{(l)}\left(\frac{t-t_{l}}{\delta}\right)-P^{(l+1)}\right\|^{p}dt,
=δl=1L101Qδ(s)P(l+1)p𝑑s.\displaystyle=\delta\sum_{l=1}^{L-1}\int_{0}^{1}\|Q^{\delta}(s)-P^{(l+1)}\|^{p}ds.

Hence, if δ0\delta\to 0, then QδPLp(0,T;n×n)0\|Q^{\delta}-P\|_{L^{p}(0,T;\mathbb{R}^{n\times n})}\to 0. Therefore,

PεPLp(0,T;n×n)<ε,\|P^{\varepsilon}-P\|_{L^{p}(0,T;\mathbb{R}^{n\times n})}<\varepsilon,

for any ε>0\varepsilon>0. ∎

Remark 2.13.

Lemma 2.12 does not hold when p=p=\infty (except when PP is a constant function) because the uniform limit of continuous functions is also continuous.

2.5 Proofs

In this subsection, we provide the proof of Theorem 2.3 and Theorem 2.7.

2.5.1 Proof of Theorem 2.3

Proof.

Since 𝝈C(m;m)\mbox{\boldmath$\sigma$}\in C(\mathbb{R}^{m};\mathbb{R}^{m}) is defined by (2.1), where σC()\sigma\in C(\mathbb{R}) satisfies a UAP, then given FC(D;m)F\in C(D;\mathbb{R}^{m}) and η>0\eta>0, there exist a positive integer LL, m\mathbb{R}^{m}-valued vectors α(l)\alpha^{(l)} and d(l)d^{(l)}, and matrices C(l)m×nC^{(l)}\in\mathbb{R}^{m\times n}, for all l=1,2,,Ll=1,2,\ldots,L, such that

G(ξ)=TLl=1Lα(l)𝝈(C(l)ξ+d(l)),G(\xi)=\frac{T}{L}\sum_{l=1}^{L}\alpha^{(l)}\odot\mbox{\boldmath$\sigma$}(C^{(l)}\xi+d^{(l)}),
|G(ξ)F(ξ)|<η2,|G(\xi)-F(\xi)|<\frac{\eta}{2}, (2.7)

for any ξD\xi\in D. From Lemma 2.10, we know that rank(C(l))=m\mathrm{rank}(C^{(l)})=m, for l=1,2,,Ll=1,2,\ldots,L. In addition, when m=nm=n, we have sgn(detA)=sgn(detC(l))\mathrm{sgn}(\det A)=\mathrm{sgn}(\det C^{(l)}). In view of Lemma 2.11, there exists a matrix P(l)n×nP^{(l)}\in\mathbb{R}^{n\times n} such that detP(l)>0\det P^{(l)}>0 and C(l)=AP(l)C^{(l)}=AP^{(l)}, for each l=1,2,,Ll=1,2,\ldots,L. We put q(l):=A(AA)1d(l)q^{(l)}:=A^{\top}(AA^{\top})^{-1}d^{(l)} so that d(l)=Aq(l)d^{(l)}=Aq^{(l)}. In addition, we let

α(t):=α(l),P(t):=P(l),q(t):=q(l),l1LTt<lLT.\alpha(t):=\alpha^{(l)},\quad P(t):=P^{(l)},\quad q(t):=q^{(l)},\quad\frac{l-1}{L}T\leq t<\frac{l}{L}T.

Then, detP(t)>0\det P(t)>0 for any t[0,T]t\in[0,T] and

G(ξ)=TLl=1Lα(l)𝝈(AP(l)ξ+Aq(l))=0Tα(t)𝝈(A(P(t)ξ+q(t)))𝑑t.G(\xi)=\frac{T}{L}\sum_{l=1}^{L}\alpha^{(l)}\odot\mbox{\boldmath$\sigma$}(AP^{(l)}\xi+Aq^{(l)})=\int_{0}^{T}\alpha(t)\odot\mbox{\boldmath$\sigma$}(A(P(t)\xi+q(t)))dt.

Let {φε}ε>0\{\varphi_{\varepsilon}\}_{\varepsilon>0} be a sequence of Friedrichs’ mollifiers. We put αε(t):=(φεα)(t)\alpha^{\varepsilon}(t):=(\varphi_{\varepsilon}*\alpha)(t) and qε(t):=(φεq)(t)q^{\varepsilon}(t):=(\varphi_{\varepsilon}*q)(t). Then, αεC([0,T];m)\alpha^{\varepsilon}\in C^{\infty}([0,T];\mathbb{R}^{m}) and qεC([0,T];n)q^{\varepsilon}\in C^{\infty}([0,T];\mathbb{R}^{n}). From Lemma 2.12, there exists a real number C>0C>0 such that, given η>0\eta>0, there exists PεC([0,T];n×n)P^{\varepsilon}\in C^{\infty}([0,T];\mathbb{R}^{n\times n}) from which we have

PεPL1(0,T;n×n)<η,detPε(t)>0,Pε(t)C,\|P^{\varepsilon}-P\|_{L^{1}(0,T;\mathbb{R}^{n\times n})}<\eta,\quad\det P^{\varepsilon}(t)>0,\quad\|P^{\varepsilon}(t)\|\leq C,

for any t[0,T]t\in[0,T]. If we put

xε(t;ξ):=Pε(t)ξ+qε(t),x^{\varepsilon}(t;\xi):=P^{\varepsilon}(t)\xi+q^{\varepsilon}(t), (2.8)
yε(t;ξ):=0Tαε(s)𝝈(Axε(s;ξ))𝑑s,y^{\varepsilon}(t;\xi):=\int_{0}^{T}\alpha^{\varepsilon}(s)\odot\mbox{\boldmath$\sigma$}(Ax^{\varepsilon}(s;\xi))ds, (2.9)

then

yε(T;ξ)=0Tαε(t)𝝈(A(Pε(t)ξ+qε(t)))𝑑t.y^{\varepsilon}(T;\xi)=\int_{0}^{T}\alpha^{\varepsilon}(t)\odot\mbox{\boldmath$\sigma$}(A(P^{\varepsilon}(t)\xi+q^{\varepsilon}(t)))dt.

Hence, we have

|yε(T;ξ)G(ξ)|\displaystyle|y^{\varepsilon}(T;\xi)-G(\xi)| \displaystyle\leq 0T|αε(t)𝝈(A(Pε(t)ξ+qε(t)))α(t)𝝈(A(P(t)ξ+q(t)))|𝑑t,\displaystyle\int_{0}^{T}\left|\alpha^{\varepsilon}(t)\odot\mbox{\boldmath$\sigma$}(A(P^{\varepsilon}(t)\xi+q^{\varepsilon}(t)))-\alpha(t)\odot\mbox{\boldmath$\sigma$}(A(P(t)\xi+q(t)))\right|dt,
\displaystyle\leq 0T|αε(t)α(t)||𝝈(A(P(t)ξ+q(t)))|𝑑t,\displaystyle\int_{0}^{T}|\alpha^{\varepsilon}(t)-\alpha(t)||\mbox{\boldmath$\sigma$}(A(P(t)\xi+q(t)))|dt,
+0T|αε(t)||𝝈(A(Pε(t)ξ+qε(t)))𝝈(A(P(t)ξ+q(t)))|𝑑t.\displaystyle+\int_{0}^{T}|\alpha^{\varepsilon}(t)||\mbox{\boldmath$\sigma$}(A(P^{\varepsilon}(t)\xi+q^{\varepsilon}(t)))-\mbox{\boldmath$\sigma$}(A(P(t)\xi+q(t)))|dt.

Because PP and qq are piecewise constant functions, then they are bounded. Since 𝝈C(m;m)\mbox{\boldmath$\sigma$}\in C(\mathbb{R}^{m};\mathbb{R}^{m}), there exists M>0M>0 such that |𝝈(A(P(t)ξ+q(t)))|M|\mbox{\boldmath$\sigma$}(A(P(t)\xi+q(t)))|\leq M, for any t[0,T]t\in[0,T]. On the other had, we have the estimate

|αε(t)|φε(ts)|α(s)|𝑑sαL(0,T;m)φε(τ)𝑑τ=αL(0,T;m).|\alpha^{\varepsilon}(t)|\leq\int_{\mathbb{R}}\varphi_{\varepsilon}(t-s)|\alpha(s)|ds\leq\|\alpha\|_{L^{\infty}(0,T;\mathbb{R}^{m})}\int_{\mathbb{R}}\varphi_{\varepsilon}(\tau)d\tau=\|\alpha\|_{L^{\infty}(0,T;\mathbb{R}^{m})}.

Similarly, because qεL(0,T;n)qL(0,T;n)\|q^{\varepsilon}\|_{L^{\infty}(0,T;\mathbb{R}^{n})}\leq\|q\|_{L^{\infty}(0,T;\mathbb{R}^{n})}, then qεq^{\varepsilon} is bounded. We assume that A(Pε(t)ξ+qε(t))A(P^{\varepsilon}(t)\xi+q^{\varepsilon}(t)), A(P(t)ξ+q(t))[R,R]mA(P(t)\xi+q(t))\in[-R,R]^{m}, for any t[0,T]t\in[0,T],

|𝝈(A(Pε(t)ξ+qε(t)))𝝈(A(P(t)ξ+q(t)))|\displaystyle|\mbox{\boldmath$\sigma$}(A(P^{\varepsilon}(t)\xi+q^{\varepsilon}(t)))-\mbox{\boldmath$\sigma$}(A(P(t)\xi+q(t)))|
LRA(Pε(t)P(t)(maxξD|ξ|)+|qε(t)q(t)|).\displaystyle\leq L_{R}\|A\|\left(\|P^{\varepsilon}(t)-P(t)\|(\max_{\xi\in D}|\xi|)+|q^{\varepsilon}(t)-q(t)|\right).

Therefore,

|yε(T;ξ)G(ξ)|MαεαL1(0,T;m)\displaystyle|y^{\varepsilon}(T;\xi)-G(\xi)|\leq M\|\alpha^{\varepsilon}-\alpha\|_{L^{1}(0,T;\mathbb{R}^{m})}
+LRAαL(0,T;m)(PεPL1(0,T;n×n)(maxξD|ξ|)+qεqL1(0,T;n)).\displaystyle+L_{R}\|A\|\|\alpha\|_{L^{\infty}(0,T;\mathbb{R}^{m})}\left(\|P^{\varepsilon}-P\|_{L^{1}(0,T;\mathbb{R}^{n\times n})}(\max_{\xi\in D}|\xi|)+\|q^{\varepsilon}-q\|_{L^{1}(0,T;\mathbb{R}^{n})}\right).

We know that there exists a number ε>0\varepsilon>0 such that

|yε(T;ξ)G(ξ)|<η2,|y^{\varepsilon}(T;\xi)-G(\xi)|<\frac{\eta}{2}, (2.10)

for any ξD\xi\in D. Thus, from (2.7) and (2.10),

|yε(T;ξ)F(ξ)||yε(T;ξ)G(ξ)|+|G(ξ)F(ξ)|<η,|y^{\varepsilon}(T;\xi)-F(\xi)|\leq|y^{\varepsilon}(T;\xi)-G(\xi)|+|G(\xi)-F(\xi)|<\eta,

for any ξD\xi\in D. For all t[0,T]t\in[0,T], we know that detPε(t)>0\det P^{\varepsilon}(t)>0, so Pε(t)P^{\varepsilon}(t) is invertible. This allows us to define

β(t):=(ddtPε(t))(Pε(t))1,γ(t):=ddtqε(t)β(t)qε(t).\beta(t):=\left(\frac{d}{dt}P^{\varepsilon}(t)\right)\left(P^{\varepsilon}(t)\right)^{-1},\quad\gamma(t):=\frac{d}{dt}q^{\varepsilon}(t)-\beta(t)q^{\varepsilon}(t).

This gives us

ddtPε(t)=β(t)Pε(t),ddtqε(t)=β(t)qε(t)+γ(t).\frac{d}{dt}P^{\varepsilon}(t)=\beta(t)P^{\varepsilon}(t),\quad\frac{d}{dt}q^{\varepsilon}(t)=\beta(t)q^{\varepsilon}(t)+\gamma(t).

In view of (2.8) and (2.9),

ddtxε(t;ξ)=ddtPε(t)ξ+ddtqε(t)=β(t)Pε(t)ξ+β(t)qε(t)+γ(t)=β(t)xε(t;ξ)+γ(t),\frac{d}{dt}x^{\varepsilon}(t;\xi)=\frac{d}{dt}P^{\varepsilon}(t)\xi+\frac{d}{dt}q^{\varepsilon}(t)=\beta(t)P^{\varepsilon}(t)\xi+\beta(t)q^{\varepsilon}(t)+\gamma(t)=\beta(t)x^{\varepsilon}(t;\xi)+\gamma(t),
ddtyε(t;ξ)=αε(t)𝝈(Axε(t;ξ)).\frac{d}{dt}y^{\varepsilon}(t;\xi)=\alpha^{\varepsilon}(t)\odot\mbox{\boldmath$\sigma$}(Ax^{\varepsilon}(t;\xi)).

Hence, yε(T,)S(D)y^{\varepsilon}(T,\cdot)\in S(D). Therefore, given FC(D;m)F\in C(D;\mathbb{R}^{m}) and η>0\eta>0, there exist some functions αC([0,T];m)\alpha\in C^{\infty}([0,T];\mathbb{R}^{m}), βC([0,T];n×n)\beta\in C^{\infty}([0,T];\mathbb{R}^{n\times n}), and γC([0,T];n)\gamma\in C^{\infty}([0,T];\mathbb{R}^{n}) such that

|y(T;ξ)F(ξ)|<η,|y(T;\xi)-F(\xi)|<\eta,

for any ξD\xi\in D. ∎

2.5.2 Proof of Theorem 2.7

Proof.

Again, we start with the fact that 𝝈C(m;m)\mbox{\boldmath$\sigma$}\in C(\mathbb{R}^{m};\mathbb{R}^{m}) is defined by (2.1), where σC()\sigma\in C(\mathbb{R}) satisfies a UAP; that is, given FC(D;m)F\in C(D;\mathbb{R}^{m}) and η>0\eta>0, there exist a positive integer LL, m\mathbb{R}^{m}-valued vectors α(l)\alpha^{(l)} and d(l)d^{(l)}, and matrices C(l)m×nC^{(l)}\in\mathbb{R}^{m\times n}, for all l=1,2,,Ll=1,2,\ldots,L, such that

G(ξ)=l=1Lα(l)𝝈(C(l)ξ+d(l)),G(\xi)=\sum_{l=1}^{L}\alpha^{(l)}\odot\mbox{\boldmath$\sigma$}(C^{(l)}\xi+d^{(l)}),
|G(ξ)F(ξ)|<η,|G(\xi)-F(\xi)|<\eta,

for any ξD\xi\in D. By virtue of Lemma 2.10, we know that rank(C(l))=m\mathrm{rank}(C^{(l)})=m, for all l=1,2,,Ll=1,2,\ldots,L. Moreover, if m=nm=n, we have sgn(detA)=sgn(detC(l))\mathrm{sgn}(\det A)=\mathrm{sgn}(\det C^{(l)}). On the other hand, from Lemma 2.11, there exists P(l)n×nP^{(l)}\in\mathbb{R}^{n\times n} such that detP(l)>0\det P^{(l)}>0 and C(l)=AP(l)C^{(l)}=AP^{(l)}, for each l=1,2,,Ll=1,2,\ldots,L. Putting q(l):=A(AA)1d(l)q^{(l)}:=A^{\top}(AA^{\top})^{-1}d^{(l)}, we get d(l)=Aq(l)d^{(l)}=Aq^{(l)}, from which we obtain

G(ξ)=l=1Lα(l)𝝈(A(P(l)ξ+q(l))).G(\xi)=\sum_{l=1}^{L}\alpha^{(l)}\odot\mbox{\boldmath$\sigma$}(A(P^{(l)}\xi+q^{(l)})).

Next, we define

x(l):=P(l)ξ+q(l),y(l):=i=1lα(i)𝝈(Ax(i)),x^{(l)}:=P^{(l)}\xi+q^{(l)},\quad y^{(l)}:=\sum_{i=1}^{l}\alpha^{(i)}\odot\mbox{\boldmath$\sigma$}(Ax^{(i)}),
β(l):=(P(l)P(l1))(P(l1))1,γ(l):=q(l)q(l1)β(l)q(l1),\beta^{(l)}:=(P^{(l)}-P^{(l-1)})(P^{(l-1)})^{-1},\quad\gamma^{(l)}:=q^{(l)}-q^{(l-1)}-\beta^{(l)}q^{(l-1)},

for all l=1,2,,Ll=1,2,\ldots,L, and set P(0):=InP^{(0)}:=I_{n}, q(0)=0q^{(0)}=0. Because P(l)P(l1)=β(l)P(l1)P^{(l)}-P^{(l-1)}=\beta^{(l)}P^{(l-1)} and q(l)q(l1)=β(l)q(l1)+γ(l)q^{(l)}-q^{(l-1)}=\beta^{(l)}q^{(l-1)}+\gamma^{(l)} hold true, then

x(l)x(l1)=(P(l)P(l1))ξ+(q(l)q(l1))=β(l)x(l1)+γ(l),x^{(l)}-x^{(l-1)}=(P^{(l)}-P^{(l-1)})\xi+(q^{(l)}-q^{(l-1)})=\beta^{(l)}x^{(l-1)}+\gamma^{(l)},
y(L)=l=1Lα(l)𝝈(A(P(l)ξ+q(l)))=G(ξ).y^{(L)}=\sum_{l=1}^{L}\alpha^{(l)}\odot\mbox{\boldmath$\sigma$}(A(P^{(l)}\xi+q^{(l)}))=G(\xi).

Hence, [ξy(L)]Sres(D)[\xi\mapsto y^{(L)}]\in S_{\mathrm{res}}(D). Therefore, given FC(D;m)F\in C(D;\mathbb{R}^{m}) and η>0\eta>0, there exists L,α(l)m,β(l)n×n,γ(l)n(l=1,2,,L)L\in\mathbb{N},\alpha^{(l)}\in\mathbb{R}^{m},\beta^{(l)}\in\mathbb{R}^{n\times n},\gamma^{(l)}\in\mathbb{R}^{n}~{}(l=1,2,\ldots,L) such that

|y(L)F(ξ)|<η,|y^{(L)}-F(\xi)|<\eta,

for any ξD\xi\in D. ∎

3 The gradient and learning algorithm

3.1 The gradient of the loss function with respect to the design parameter

We consider the (α,β,γ)(\alpha,\beta,\gamma)-type ODENet associated with the ODE system of (2.3). We also consider the approximation of FC(D;m)F\in C(D;\mathbb{R}^{m}). Let KK\in\mathbb{N} be the number of training data and {(ξ(k),F(ξ(k)))}k=1KD×m\{(\xi^{(k)},F(\xi^{(k)}))\}_{k=1}^{K}\subset D\times\mathbb{R}^{m} be the training data. We divide the label of the training data into the following disjoint sets.

{1,2,,K}=I1I2IM(disjoint)(1MK)\{1,2,\ldots,K\}=I_{1}\cup I_{2}\cup\cdots\cup I_{M}~{}(\mathrm{disjoint})\quad(1\leq M\leq K)

Let x(k)(t)x^{(k)}(t) and y(k)(t)y^{(k)}(t) be the solution to (2.3) with the initial value ξ(k)\xi^{(k)}. For all μ=1,2,,M\mu=1,2,\ldots,M, let 𝒙=(x(k))kIμ\mbox{\boldmath$x$}=(x^{(k)})_{k\in I_{\mu}} and 𝒚=(y(k))kIμ\mbox{\boldmath$y$}=(y^{(k)})_{k\in I_{\mu}}. We define the loss function as follows:

eμ[𝒙,𝒚]=1|Iμ|kIμ|y(k)(T)F(ξ(k))|2,e_{\mu}[\mbox{\boldmath$x$},\mbox{\boldmath$y$}]=\frac{1}{|I_{\mu}|}\sum_{k\in I_{\mu}}\left|y^{(k)}(T)-F(\xi^{(k)})\right|^{2}, (3.1)
E=1Kk=1K|y(k)(T)F(ξ(k))|2.E=\frac{1}{K}\sum_{k=1}^{K}\left|y^{(k)}(T)-F(\xi^{(k)})\right|^{2}. (3.2)

We consider the learning for each label set using the gradient method. We fix μ=1,2,,M\mu=1,2,\ldots,M. Let λ(k):[0,T]n\lambda^{(k)}:[0,T]\to\mathbb{R}^{n} be the adjoint and satisfy the following adjoint equation for any kIμk\in I_{\mu}.

{ddtλ(k)(t)=(β(t))λ(k)(t)1|Iμ|A((y(k)(T)F(ξ(k)))α(t)𝝈(Ax(k)(t))),λ(k)(T)=0.\left\{\begin{aligned} \frac{d}{dt}\lambda^{(k)}(t)&=-(\beta(t))^{\top}\lambda^{(k)}(t)-\frac{1}{|I_{\mu}|}A^{\top}\left(\left(y^{(k)}(T)-F(\xi^{(k)})\right)\odot\alpha(t)\odot\mbox{\boldmath$\sigma$}^{\prime}(Ax^{(k)}(t))\right),\\ \lambda^{(k)}(T)&=0.\end{aligned}\right. (3.3)

Then, the gradient G[α](μ)C([0,T];m),G[β](μ)C([0,T];n×n)G[\alpha]^{(\mu)}\in C([0,T];\mathbb{R}^{m}),G[\beta]^{(\mu)}\in C([0,T];\mathbb{R}^{n\times n}) and G[γ](μ)C([0,T];n)G[\gamma]^{(\mu)}\in C([0,T];\mathbb{R}^{n}) of the loss function (3.1) at αC([0,T];m),βC([0,T];n×n)\alpha\in C([0,T];\mathbb{R}^{m}),\beta\in C([0,T];\mathbb{R}^{n\times n}) and γC([0,T];n)\gamma\in C([0,T];\mathbb{R}^{n}) with respect to L2(0,T;m),L2(0,T;n×n),L2(0,T;n)L^{2}(0,T;\mathbb{R}^{m}),L^{2}(0,T;\mathbb{R}^{n\times n}),L^{2}(0,T;\mathbb{R}^{n}) can be represented as

G[α](μ)(t)=1|Iμ|kIμ(y(k)(T)F(ξ(k)))𝝈(Ax(k)(t)),G[\alpha]^{(\mu)}(t)=\frac{1}{|I_{\mu}|}\sum_{k\in I_{\mu}}\left(y^{(k)}(T)-F(\xi^{(k)})\right)\odot\mbox{\boldmath$\sigma$}(Ax^{(k)}(t)),
G[β](μ)(t)=kIμλ(k)(t)(x(k)(t)),G[γ](μ)(t)=kIμλ(k)(t),G[\beta]^{(\mu)}(t)=\sum_{k\in I_{\mu}}\lambda^{(k)}(t)\left(x^{(k)}(t)\right)^{\top},\quad G[\gamma]^{(\mu)}(t)=\sum_{k\in I_{\mu}}\lambda^{(k)}(t),

respectively.

3.2 Learning algorithm

In this section, we describe the learning algorithm of the (α,β,γ)(\alpha,\beta,\gamma)-type ODENet associated with an ODE system (2.3). The initial value problems of ordinary differential equations (2.3) and (3.3) are computed using the explicit Euler method. Let hh be the size of the time step. We define L:=T/hL:=\lfloor T/h\rfloor. By discretizing the ordinary differential equations (2.3), we obtain

{xl+1(k)xl(k)h=βlxl(k)+γl,l=0,1,,L1,yl+1(k)yl(k)h=αl𝝈(Axl(k)),l=0,1,,L1,x0(k)=ξ(k),y0(k)=0,\left\{\begin{aligned} \frac{x_{l+1}^{(k)}-x_{l}^{(k)}}{h}&=\beta_{l}x_{l}^{(k)}+\gamma_{l},&l=0,1,\ldots,L-1,\\ \frac{y_{l+1}^{(k)}-y_{l}^{(k)}}{h}&=\alpha_{l}\odot\mbox{\boldmath$\sigma$}(Ax_{l}^{(k)}),&l=0,1,\ldots,L-1,\\ x_{0}^{(k)}&=\xi^{(k)},&\\ y_{0}^{(k)}&=0,&\end{aligned}\right.

for any kIμk\in I_{\mu}. Furthermore, by discretizing the adjoint equation (3.3), we obtain

{λl(k)λl1(k)h=βlλl(k)1|Iμ|A((yL(k)F(ξ(k)))αl𝝈(Axl(k))),λL(k)=0,\left\{\begin{aligned} \frac{\lambda_{l}^{(k)}-\lambda_{l-1}^{(k)}}{h}&=-\beta_{l}^{\top}\lambda_{l}^{(k)}-\frac{1}{|I_{\mu}|}A^{\top}\left(\left(y_{L}^{(k)}-F(\xi^{(k)})\right)\odot\alpha_{l}\odot\mbox{\boldmath$\sigma$}^{\prime}(Ax_{l}^{(k)})\right),\\ \lambda_{L}^{(k)}&=0,\end{aligned}\right.

with l=L,L1,,1l=L,L-1,\ldots,1 for any kIμk\in I_{\mu}. Here we put

αl=α(lh),βl=β(lh),γl=γ(lh),\alpha_{l}=\alpha(lh),\quad\beta_{l}=\beta(lh),\quad\gamma_{l}=\gamma(lh),

for all l=0,1,,Ll=0,1,\ldots,L.

We perform the optimization of the loss function (3.2) using stochastic gradient descent (SGD). We show the learning algorithm in Algorithm 1.

Algorithm 1 Stochastic gradient descent method for (α,β,γ)(\alpha,\beta,\gamma)-type ODENet
1:  Choose η>0\eta>0 and τ>0\tau>0
2:  Set ν=0\nu=0 and choose α(0)l=0Lm,β(0)l=0Ln×n\alpha_{(0)}\in\prod_{l=0}^{L}\mathbb{R}^{m},\beta_{(0)}\in\prod_{l=0}^{L}\mathbb{R}^{n\times n}, γ(0)l=0Ln\gamma_{(0)}\in\prod_{l=0}^{L}\mathbb{R}^{n} and (fixed) AA
3:  repeat
4:     Divide the label of the training data {(ξ(k),F(ξ(k)))}k=1K\{(\xi^{(k)},F(\xi^{(k)}))\}_{k=1}^{K} into the following disjoint sets
{1,2,,K}=I1I2IM(disjoint),(1MK)\{1,2,\ldots,K\}=I_{1}\cup I_{2}\cup\cdots\cup I_{M}~{}(\mathrm{disjoint}),\quad(1\leq M\leq K)
5:     Set α(1)=α(ν),β(1)=β(ν)\alpha^{(1)}=\alpha_{(\nu)},\beta^{(1)}=\beta_{(\nu)} and γ(1)=γ(ν)\gamma^{(1)}=\gamma_{(\nu)}
6:     for μ=1,M\mu=1,M do
7:        Solve
{xl+1(k)xl(k)h=βlxl(k)+γl,l=0,1,,L1,yl+1(k)yl(k)h=αl𝝈(Axl(k)),l=0,1,,L1,x0(k)=ξ(k),y0(k)=0,\left\{\begin{aligned} \frac{x_{l+1}^{(k)}-x_{l}^{(k)}}{h}&=\beta_{l}x_{l}^{(k)}+\gamma_{l},&l=0,1,\ldots,L-1,\\ \frac{y_{l+1}^{(k)}-y_{l}^{(k)}}{h}&=\alpha_{l}\odot\mbox{\boldmath$\sigma$}(Ax_{l}^{(k)}),&l=0,1,\ldots,L-1,\\ x_{0}^{(k)}&=\xi^{(k)},&\\ y_{0}^{(k)}&=0,&\end{aligned}\right.
for any kIμk\in I_{\mu}
8:        Solve
{λl(k)λl1(k)h=βlλl(k)1|Iμ|A((yL(k)F(ξ(k)))αl𝝈(Axl(k))),λL(k)=0,\left\{\begin{aligned} \frac{\lambda_{l}^{(k)}-\lambda_{l-1}^{(k)}}{h}&=-\beta_{l}^{\top}\lambda_{l}^{(k)}-\frac{1}{|I_{\mu}|}A^{\top}\left(\left(y_{L}^{(k)}-F(\xi^{(k)})\right)\odot\alpha_{l}\odot\mbox{\boldmath$\sigma$}^{\prime}(Ax_{l}^{(k)})\right),\\ \lambda_{L}^{(k)}&=0,\end{aligned}\right.
with l=L,L1,,1l=L,L-1,\ldots,1 for any kIμk\in I_{\mu}
9:        Compute the gradients
G[α]l(μ)=1|Iμ|kIμ(yL(k)F(ξ(k)))𝝈(Axl(k)),G[\alpha]_{l}^{(\mu)}=\frac{1}{|I_{\mu}|}\sum_{k\in I_{\mu}}\left(y_{L}^{(k)}-F(\xi^{(k)})\right)\odot\mbox{\boldmath$\sigma$}(Ax_{l}^{(k)}),
G[β]l(μ)=kIμλl(k)(xl(k)),G[γ]l(μ)=kIμλl(k)G[\beta]_{l}^{(\mu)}=\sum_{k\in I_{\mu}}\lambda_{l}^{(k)}(x_{l}^{(k)})^{\top},\quad G[\gamma]_{l}^{(\mu)}=\sum_{k\in I_{\mu}}\lambda_{l}^{(k)}
10:        Set
αl(μ+1)=αl(μ)τG[α]l(μ),βl(μ+1)=βl(μ)τG[β]l(μ),\alpha_{l}^{(\mu+1)}=\alpha_{l}^{(\mu)}-\tau G[\alpha]_{l}^{(\mu)},\quad\beta_{l}^{(\mu+1)}=\beta_{l}^{(\mu)}-\tau G[\beta]_{l}^{(\mu)},
γl(μ+1)=γl(μ)τG[γ]l(μ)\gamma_{l}^{(\mu+1)}=\gamma_{l}^{(\mu)}-\tau G[\gamma]_{l}^{(\mu)}
11:     end for
12:     Set α(ν+1)=(αl(M))l=0L,β(ν+1)=(βl(M))l=0L\alpha_{(\nu+1)}=(\alpha_{l}^{(M)})_{l=0}^{L},\beta_{(\nu+1)}=(\beta_{l}^{(M)})_{l=0}^{L} and γ(ν+1)=(γl(M))l=0L\gamma_{(\nu+1)}=(\gamma_{l}^{(M)})_{l=0}^{L}
13:     Shuffle the training data {(ξ(k),F(ξ(k)))}k=1K\{(\xi^{(k)},F(\xi^{(k)}))\}_{k=1}^{K} randomly and set ν=ν+1\nu=\nu+1
14:  until max(α(ν)α(ν1),β(ν)β(ν1),γ(ν)γ(ν1))<η\max(\|\alpha_{(\nu)}-\alpha_{(\nu-1)}\|,\|\beta_{(\nu)}-\beta_{(\nu-1)}\|,\|\gamma_{(\nu)}-\gamma_{(\nu-1)}\|)<\eta
Remark.

In 10 of Algorithm 1, we call the momentum SGD [19], in which the following expression is substituted for the update expression.

αl(μ+1):=αl(μ)τG[α]l(μ)+τ1(αl(μ)αl(μ1))\alpha_{l}^{(\mu+1)}:=\alpha_{l}^{(\mu)}-\tau G[\alpha]_{l}^{(\mu)}+\tau_{1}(\alpha_{l}^{(\mu)}-\alpha_{l}^{(\mu-1)})
βl(μ+1):=βl(μ)τG[β]l(μ)+τ1(βl(μ)βl(μ1))\beta_{l}^{(\mu+1)}:=\beta_{l}^{(\mu)}-\tau G[\beta]_{l}^{(\mu)}+\tau_{1}(\beta_{l}^{(\mu)}-\beta_{l}^{(\mu-1)})
γl(μ+1):=γl(μ)τG[γ]l(μ)+τ1(γl(μ)γl(μ1))\gamma_{l}^{(\mu+1)}:=\gamma_{l}^{(\mu)}-\tau G[\gamma]_{l}^{(\mu)}+\tau_{1}(\gamma_{l}^{(\mu)}-\gamma_{l}^{(\mu-1)})

where τ\tau is the learning rate and τ1\tau_{1} is the momentum rate.

4 Numerical results

4.1 Sinusoidal Curve

We performed a numerical example of the regression problem of a 1-dimensional signal F(ξ)=sin4πξF(\xi)=\sin 4\pi\xi defined on ξ[0,1]\xi\in[0,1]. Let the number of training data be K1=1000K_{1}=1000, and let the training data be

{(k1K1,F(k1K1))}k=1K1[0,1]×,D1:={k1K1}k=1K1.\left\{\left(\frac{k-1}{K_{1}},F\left(\frac{k-1}{K_{1}}\right)\right)\right\}_{k=1}^{K_{1}}\subset[0,1]\times\mathbb{R},\quad D_{1}:=\left\{\frac{k-1}{K_{1}}\right\}_{k=1}^{K_{1}}.

We run Algorithm 1 until ν=10000\nu=10000. We set the learning rate to τ=0.01\tau=0.01, AA to the matrix with ones on the diagonal and zeros elsewhere, and

α(0)0,β(0)0,γ(0)0.\alpha_{(0)}\equiv 0,\quad\beta_{(0)}\equiv 0,\quad\gamma_{(0)}\equiv 0.

Let the number of validation data be K2=3333K_{2}=3333. The signal sampled with Δξ=1/K2\Delta\xi=1/K_{2} was used as the validation data. Let D2D_{2} be the set of input data used for the validation data. Fig. 2. shows the training data which is F(ξ)=sin4πξF(\xi)=\sin 4\pi\xi sampled from [0,1][0,1] with Δξ=1/K1\Delta\xi=1/K_{1}. Fig. 2. shows the result predicted using validation data when ν=10000\nu=10000. The validation data is shown as a blue line, and the result predicted using the validation data is shown as an orange line. Fig. 4. shows the initial values of parameters α,β\alpha,\beta and γ\gamma. Fig. 4. shows the learning results of each design parameter at ν=10000\nu=10000. Fig. 5. shows the change in the loss function during learning for each of the training data and validation data.

Fig. 5. shows that the loss function can be decreased using Algorithm 1. Fig. 2. suggests that the prediction is good. In addition, the learning results of the parameters α,β\alpha,\beta and γ\gamma are continuous functions.

Refer to caption
Fig. 1: The training data which is F(ξ)=sin4πξF(\xi)=\sin 4\pi\xi sampled from [0,1][0,1] with Δξ=1/K1\Delta\xi=1/K_{1}.
Refer to caption
Fig. 2: The result predicted using validation data when ν=10000\nu=10000.
Refer to caption
Fig. 3: The initial values of design parameters α,β\alpha,\beta and γ\gamma at each t=hlt=hl.
Refer to caption
Fig. 4: The learning results of design parameters α,β\alpha,\beta and γ\gamma at each t=hlt=hl when ν=10000\nu=10000.
Refer to caption
Fig. 5: The change in the loss function during learning.

4.2 Binary classification

We performed numerical experiments on a binary classification problem for 2-dimensional input. We set n=2n=2 and m=1m=1. Let the number of the training data be K1=10000K_{1}=10000, and let D1={ξ(k)|k=1,2,,K1}[0,1]2D_{1}=\{\xi^{(k)}|k=1,2,\ldots,K_{1}\}\subset[0,1]^{2} be the set of randomly generated points. Let

{(ξ(k),F(ξ(k)))}k=1K1[0,1]2×,\left\{\left(\xi^{(k)},F(\xi^{(k)})\right)\right\}_{k=1}^{K_{1}}\subset[0,1]^{2}\times\mathbb{R},
F(ξ)={0,if|ξ(0.5,0.5)|<0.3,1,if|ξ(0.5,0.5)|0.3,F(\xi)=\left\{\begin{array}[]{ll}0,&\mathrm{if}~{}|\xi-(0.5,0.5)|<0.3,\\ 1,&\mathrm{if}~{}|\xi-(0.5,0.5)|\geq 0.3,\end{array}\right. (4.1)

be the training data. We run Algorithm 1 until ν=10000\nu=10000. We set the learning rate to τ=0.01\tau=0.01, AA to the matrix with ones on the diagonal and zeros elsewhere, and

α(0)0,β(0)0,γ(0)0.\alpha_{(0)}\equiv 0,\quad\beta_{(0)}\equiv 0,\quad\gamma_{(0)}\equiv 0.

Let the number of validation data be K2=2500K_{2}=2500. The set of points ξ\xi randomly generated on [0,1]2[0,1]^{2} and F(ξ)F(\xi) is used as the validation data. Fig. 7. shows the training data in which randomly generated ξD1\xi\in D_{1} are classified in (4.1). Fig. 7. shows the prediction result using validation data at ν=10000\nu=10000. The results that were successfully predicted are shown in dark red and dark blue, and the results that were incorrectly predicted are shown in light red and light blue. Fig. 9. shows the result of predicting the validation data using kk-nearest neighbor algorithm at k=3k=3. Fig. 9. shows the result of predicting the validation data using a multi-layer perceptron with 50005000 nodes. Fig. 11. shows the initial value of parameters α,β\alpha,\beta and γ\gamma. Fig. 11., 13. and 13. show the learning results of each parameters at ν=10000\nu=10000. Fig. 15. shows the change of the loss function during learning for each of the training data and validation data. Fig. 15. shows the change of accuracy during learning. The accuracy is defined as

Accuracy=#{ξ|F(ξ)=y¯(ξ)}Ki,if{ξ|F(ξ)=y¯(ξ)}Di,(i=1,2),\mathrm{Accuracy}=\frac{\#\{\xi|F(\xi)=\bar{y}(\xi)\}}{K_{i}},\quad\mathrm{if}~{}\{\xi|F(\xi)=\bar{y}(\xi)\}\subset D_{i},\quad(i=1,2),
y¯(ξ):={0,ify(T;ξ)<0.5,1,ify(T;ξ)0.5.\bar{y}(\xi):=\left\{\begin{array}[]{ll}0,&\mathrm{if}~{}y(T;\xi)<0.5,\\ 1,&\mathrm{if}~{}y(T;\xi)\geq 0.5.\end{array}\right.

Table 4 shows the value of the loss function and the accuracy of the prediction of each method.

From Fig. 15. and 15., we observe that the loss function can be decreased and accuracy can be increased using Algorithm 1. Fig. 7. shows that some points in the neighborhood of |ξ(0.5,0.5)|=0.3|\xi-(0.5,0.5)|=0.3 are wrongly predicted; however, most points are well predicted. The results are similar when compared with Fig. 9. and 9. In addition, the learning results of the parameters α,β\alpha,\beta, and γ\gamma are continuous functions. From Table 4, the kk-nearest neighbor algorithm minimizes the value of the loss function among the three methods. We consider that this is because the output of ODENet is y(T;ξ)[0,1]y(T;\xi)\in[0,1], while the output of the kk-nearest neighbor algorithm is {0,1}\{0,1\}. Compared to K-NN and MLP, the (α,β,γ)(\alpha,\beta,\gamma)-type ODENet gives a slightly worse result (Table 4). However, the binary classification is not a suitable problem to test the potential of ODENet, because the output values of ODENet are continuous. It should also be noted that the proposed model is not very tuned to increase accuracy. Considering these facts, this result also shows the potential of ODENet.

Refer to caption
Fig. 6: The training data defined by (4.1).
Refer to caption
Fig. 7: The result predicted using validation data when ν=10000\nu=10000.
Refer to caption
Fig. 8: The result of predicting the validation data using kk-nearest neighbor algorithm at k=3k=3.
Refer to caption
Fig. 9: The result of predicting the validation data using a multi-layer perceptron with 5000 nodes.
Refer to caption
Fig. 10: The initial values of design parameters α,β\alpha,\beta and γ\gamma at each t=hlt=hl.
Refer to caption
Fig. 11: The learning result of design parameters α\alpha at each t=hlt=hl when ν=10000\nu=10000.
Refer to caption
Fig. 12: The learning result of design parameters β\beta at each t=hlt=hl when ν=10000\nu=10000.
Refer to caption
Fig. 13: The learning result of design parameters γ\gamma at each t=hlt=hl when ν=10000\nu=10000.
Refer to caption
Fig. 14: The change of the loss function during learning.
Refer to caption
Fig. 15: The change of accuracy during learning.
Table. 4: The prediction result of each method.
Method Loss Accuracy
This paper ((α,β,γ)(\alpha,\beta,\gamma)-type ODENet) 0.02629 0.9592
KK-nearest neighbor algorithm (K-NN) 0.006000 0.9879
Multilayer perceptron (MLP) 0.006273 0.9883

4.3 Multinomial classification in MNIST

We performed a numerical experiment on a classification problem using MNIST, a dataset of handwritten digits. The input is a 28×2828\times 28 image and the output is a one-hot vector of labels attached to the MNIST dataset. We set n=784n=784 and m=10m=10. Let the number of training data be K1=43200K_{1}=43200 and let the batchsize be |Iμ|=128|I_{\mu}|=128. We run Algorithm 1 until ν=1000\nu=1000. However, the momentum SGD was used to update the design parameters. We set the learning rate as τ=0.01\tau=0.01, the momentum rate as 0.90.9, AA to the matrix with ones on the diagonal and zeros elsewhere, and

α(0)108,β(0)108,γ(0)108.\alpha_{(0)}\equiv 10^{-8},\quad\beta_{(0)}\equiv 10^{-8},\quad\gamma_{(0)}\equiv 10^{-8}.

Let the number of validation data be K2=10800K_{2}=10800. Fig. 17. shows the change of the loss function during learning for each of the training data and validation data. Fig. 17. shows the change of accuracy during learning. Using the test data, the values of the loss function and accuracy are

E=0.06432,Accuracy=0.9521,E=0.06432,\quad\mathrm{Accuracy}=0.9521,

at ν=1000\nu=1000, respectively.

Fig. 17. and 17. suggest that the loss function can be decreased and accuracy can be increased using Algorithm 1 (using the Momentum SGD).

Refer to caption
Fig. 16: The change of the loss function during learning.
Refer to caption
Fig. 17: The change of accuracy during learning.

5 Conclusion

In this paper, we proposed the (α,β,γ)(\alpha,\beta,\gamma)-type ODENet and the (α,β,γ)(\alpha,\beta,\gamma)-type ResNet and showed that they uniformly approximate an arbitrary continuous function on a compact set. This result shows that the (α,β,γ)(\alpha,\beta,\gamma)-type ODENet and the (α,β,γ)(\alpha,\beta,\gamma)-type ResNet can represent a variety of data. In addition, we showed the existence and continuity of the gradient of the loss function in a general ODENet. We performed numerical experiments on some data and confirmed that the gradient method reduces the loss function and represents the data.

Our future work is to show that the design parameters converge to a global minimizer of the loss function using a continuous gradient. We also plan to show that ODENet with other forms, such as convolution, can represent arbitrary data.

6 Acknowledgement

This work is partially supported by JSPSKAKENHI JP20KK0058, and JST CREST Grant Number JPMJCR2014, Japan.

References

  • [1] J.-G. Attali and G. Pagès. Approximations of functions by a multilayer perceptron: a new approach. Neural Networks, 10(6):1069–1081, 1997.
  • [2] A. Baker. Matrix groups: An Introduction to Lie Group Theory. Springer Science & Business Media, 2003.
  • [3] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
  • [4] L. Bottou. Online algorithms and stochastic approximations. In Online Learning and Neural Networks. Cambridge University Press, 1998.
  • [5] S.M. Carroll and B.W. Dickinson. Construction of neural nets using the radon transform. In International 1989 Joint Conference on Neural Networks, volume 1, pages 607–611. IEEE, 1989.
  • [6] R.T.Q. Chen, Y. Rubanova, J. Bettencourt, and D.K. Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, volume 31, pages 6571–6583, 2018.
  • [7] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303–314, 1989.
  • [8] K. Fukushima and S. Miyake. Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and Cooperation in Neural Nets, pages 267–285. Springer, 1982.
  • [9] K.-I. Funahashi. On the approximate realization of continuous mappings by neural networks. Neural Networks, 2(3):183–192, 1989.
  • [10] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, volume 9, pages 249–256. PMLR, 2010.
  • [11] B. Hanin and M. Sellke. Approximating continuous functions by ReLU nets of minimal width. arXiv preprint arXiv:1710.11278, 2017.
  • [12] K. He and J. Sun. Convolutional neural networks at constrained time cost. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5353–5360. IEEE, 2015.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778. IEEE, 2016.
  • [14] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366, 1989.
  • [15] P. Kidger and T. Lyons. Universal Approximation with Deep Narrow Networks. In Proceedings of 33rd Conference on Learning Theory, pages 2306–2327. PMLR, 2020.
  • [16] M. Leshno, V.Y. Lin, A. Pinkus, and S. Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861–867, 1993.
  • [17] H. Lin and S. Jegelka. ResNet with one-neuron hidden layers is a universal approximator. In Advances in Neural Information Processing Systems, volume 31, pages 6169–6178. Curran Associates, Inc., 2018.
  • [18] W.S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5:115–133, 1943.
  • [19] D.E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986.
  • [20] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015.
  • [21] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
  • [22] S. Sonoda and N. Murata. Neural network with unbounded activation functions is universal approximator. Applied and Computational Harmonic Analysis, 43(2):233–268, 2017.
  • [23] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V Vanhoucke, and A Rabinovich. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015.

Appendix A Differentiability with respect to parameters of ODE

We discuss the differentiability with respect to the design parameters of ordinary differential equations.

Theorem A.1.

Let NN and rr be natural numbers, and TT be a positive real number. We define X:=C1([0,T];N)X:=C^{1}([0,T];\mathbb{R}^{N}) and Ω:=C([0,T];r)\Omega:=C([0,T];\mathbb{R}^{r}). We consider the initial value problem for the ordinary differential equation:

{x(t)=f(t,x(t),ω(t)),t(0,T],x(0)=ξ,\left\{\begin{aligned} x^{\prime}(t)&=f(t,x(t),\omega(t)),&t\in(0,T],\\ x(0)&=\xi,&\end{aligned}\right. (A.1)

where xx is a function from [0,T][0,T] to N\mathbb{R}^{N}, and ξD\xi\in D is the initial value; ωΩ\omega\in\Omega is the design parameter; ff is a continuously differentiable function from [0,T]×N×r[0,T]\times\mathbb{R}^{N}\times\mathbb{R}^{r} to N\mathbb{R}^{N}; There exists L>0L>0 such that

|f(t,x1,ω(t))f(t,x2,ω(t))|L|x1x2||f(t,x_{1},\omega(t))-f(t,x_{2},\omega(t))|\leq L|x_{1}-x_{2}|

for any t[0,T],x1,x2Nt\in[0,T],x_{1},x_{2}\in\mathbb{R}^{N}, and ωΩ\omega\in\Omega. Then, the solution to (A.1) satisfies [Ωωx[ω]]C1(Ω;X)[\Omega\ni\omega\mapsto x[\omega]]\in C^{1}(\Omega;X). Furthermore, if we define

y(t):=(ωx[ω]η)(t)y(t):=(\partial_{\omega}x[\omega]\eta)(t)

for any ηΩ\eta\in\Omega, the following relations

{y(t)xf(t,x[ω](t),ω(t))y(t)=ωf(t,x[ω](t),ω(t))η(t),t(0,T],y(0)=0,\left\{\begin{array}[]{ll}y^{\prime}(t)-\nabla_{x}^{\top}f(t,x[\omega](t),\omega(t))y(t)=\nabla_{\omega}^{\top}f(t,x[\omega](t),\omega(t))\eta(t),&t\in(0,T],\\ y(0)=0,&\end{array}\right.

are satisfied.

Proof.

Let X0X_{0} be the set of continuous functions from [0,T][0,T] to N\mathbb{R}^{N}. Because f(t,,ω(t))f(t,\cdot,\omega(t)) is Lipschitz continuous for any t[0,T]t\in[0,T] and ωΩ\omega\in\Omega, there exists a unique solution x[ω]Xx[\omega]\in X in (A.1). We define the map J:X×ΩXJ:X\times\Omega\to X as

J(x,ω)(t):=x(t)ξ0tf(s,x(s),ω(s))𝑑s.J(x,\omega)(t):=x(t)-\xi-\int_{0}^{t}f(s,x(s),\omega(s))ds.

The map JJ satisfies

J(x,ω)(t)=x(t)f(t,x(t),ω(t)).J(x,\omega)^{\prime}(t)=x^{\prime}(t)-f(t,x(t),\omega(t)).

Since fC1([0,T]×N×r;N)f\in C^{1}([0,T]\times\mathbb{R}^{N}\times\mathbb{R}^{r};\mathbb{R}^{N}), JC(X×Ω;X)J\in C(X\times\Omega;X).

Take an arbitrary ωΩ\omega\in\Omega. For any xXx\in X, let

fx(t):=f(t,x(t),ω(t)),xfx(t):=xf(t,x(t),ω(t)).f\circ x(t):=f(t,x(t),\omega(t)),\quad\nabla_{x}^{\top}f\circ x(t):=\nabla_{x}^{\top}f(t,x(t),\omega(t)).

We define the map A(x):XXA(x):X\to X as

(A(x)y)(t):=y(t)0t(xfx(s))y(s)𝑑s.(A(x)y)(t):=y(t)-\int_{0}^{t}(\nabla_{x}^{\top}f\circ x(s))y(s)ds.

The map A(x)A(x) satisfies

(A(x)y)(t)=y(t)(xfx(t))y(t).(A(x)y)^{\prime}(t)=y^{\prime}(t)-(\nabla_{x}^{\top}f\circ x(t))y(t).

xx and ω\omega are bounded because they are continuous functions on a compact interval. Because fC1([0,T]×N×r;N)f\in C^{1}([0,T]\times\mathbb{R}^{N}\times\mathbb{R}^{r};\mathbb{R}^{N}), there exists C>0C>0 such that |xfx(t)|C|\nabla_{x}^{\top}f\circ x(t)|\leq C for any t[0,T]t\in[0,T]. From

|(A(x)y)(t)|yX0+CTyX0,|(A(x)y)(t)|yX0+CyX0,|(A(x)y)(t)|\leq\|y\|_{X_{0}}+CT\|y\|_{X_{0}},\quad|(A(x)y)^{\prime}(t)|\leq\|y^{\prime}\|_{X_{0}}+C\|y\|_{X_{0}},
A(x)yXyX+C(T+1)yX=(1+C(T+1))yX\|A(x)y\|_{X}\leq\|y\|_{X}+C(T+1)\|y\|_{X}=(1+C(T+1))\|y\|_{X}

is satisfied. Hence, A(x)B(X,X)A(x)\in B(X,X). Let us fix x0Xx_{0}\in X. We take xXx\in X such that xx0x\to x_{0}.

|(A(x)y)(t)(A(x0)y)(t)|\displaystyle|(A(x)y)(t)-(A(x_{0})y)(t)| 0t|xfx(s)xfx0(s)||y(s)|𝑑s\displaystyle\leq\int_{0}^{t}|\nabla_{x}^{\top}f\circ x(s)-\nabla_{x}^{\top}f\circ x_{0}(s)||y(s)|ds
TyXxfxxfx0C([0,T];N×N)\displaystyle\leq T\|y\|_{X}\|\nabla_{x}^{\top}f\circ x-\nabla_{x}^{\top}f\circ x_{0}\|_{C([0,T];\mathbb{R}^{N\times N})}
|(A(x)y)(t)(A(x0)y)(t)|\displaystyle|(A(x)y)^{\prime}(t)-(A(x_{0})y)^{\prime}(t)| |xfx(t)xfx0(t)||y(t)|\displaystyle\leq|\nabla_{x}^{\top}f\circ x(t)-\nabla_{x}^{\top}f\circ x_{0}(t)||y(t)|
yXxfxxfx0C([0,T];N×N)\displaystyle\leq\|y\|_{X}\|\nabla_{x}^{\top}f\circ x-\nabla_{x}^{\top}f\circ x_{0}\|_{C([0,T];\mathbb{R}^{N\times N})}
A(x)yA(x0)yX(T+1)yXxfxxfx0C([0,T];N×N)\|A(x)y-A(x_{0})y\|_{X}\leq(T+1)\|y\|_{X}\|\nabla_{x}^{\top}f\circ x-\nabla_{x}^{\top}f\circ x_{0}\|_{C([0,T];\mathbb{R}^{N\times N})}
A(x)A(x0)B(X,X)(T+1)xfxxfx0C([0,T];N×N)\|A(x)-A(x_{0})\|_{B(X,X)}\leq(T+1)\|\nabla_{x}^{\top}f\circ x-\nabla_{x}^{\top}f\circ x_{0}\|_{C([0,T];\mathbb{R}^{N\times N})}

Hence, AC(X;B(X,X))A\in C(X;B(X,X)).

J(x+y,ω)(t)J(x,ω)(t)(A(x)y)(t)\displaystyle J(x+y,\omega)(t)-J(x,\omega)(t)-(A(x)y)(t)
=0t(f(x+y)(s)fx(s)(xfx(s))y(s))𝑑s\displaystyle=-\int_{0}^{t}(f\circ(x+y)(s)-f\circ x(s)-(\nabla_{x}^{\top}f\circ x(s))y(s))ds
J(x+y,ω)J(x,ω)A(x)yX(T+1)f(x+y)fx(xfx)yX0\|J(x+y,\omega)-J(x,\omega)-A(x)y\|_{X}\leq(T+1)\|f\circ(x+y)-f\circ x-(\nabla_{x}^{\top}f\circ x)y\|_{X_{0}}

Form the Taylor expansion of ff, we obtain

f(t,x(t)+y(t),ω(t))=f(t,x(t),ω(t))+01xf(t,x(t)+ζy(t),ω(t))y(t)𝑑ζf(t,x(t)+y(t),\omega(t))=f(t,x(t),\omega(t))+\int_{0}^{1}\nabla_{x}^{\top}f(t,x(t)+\zeta y(t),\omega(t))y(t)d\zeta

for any t[0,T],x,yXt\in[0,T],x,y\in X and ωΩ\omega\in\Omega. We obtain

|f(x+y)(t)fx(t)(xfx(t))y(t)|01|xf(x+ζy)(t)xfx(t)||y(t)|𝑑ζ.|f\circ(x+y)(t)-f\circ x(t)-(\nabla_{x}^{\top}f\circ x(t))y(t)|\leq\int_{0}^{1}|\nabla_{x}^{\top}f\circ(x+\zeta y)(t)-\nabla_{x}^{\top}f\circ x(t)||y(t)|d\zeta.

For any ϵ>0\epsilon>0, there exists δ>0\delta>0 such that

yX0<δ,ζ[0,1]|xf(x+ζy)(t)xfx(t)|<ϵ.\|y\|_{X_{0}}<\delta,\zeta\in[0,1]~{}\Rightarrow~{}|\nabla_{x}^{\top}f\circ(x+\zeta y)(t)-\nabla_{x}^{\top}f\circ x(t)|<\epsilon.

We obtain

|f(x+y)(t)fx(t)(xfx(t))y(t)|ϵyX0,|f\circ(x+y)(t)-f\circ x(t)-(\nabla_{x}^{\top}f\circ x(t))y(t)|\leq\epsilon\|y\|_{X_{0}},
J(x+y,ω)J(x,ω)A(x)yXϵ(T+1)yX.\|J(x+y,\omega)-J(x,\omega)-A(x)y\|_{X}\leq\epsilon(T+1)\|y\|_{X}.

Hence,

xJ(x,ω)y=A(x)y.\partial_{x}J(x,\omega)y=A(x)y.

From xJ(,ω)C(X;B(X,X))\partial_{x}J(\cdot,\omega)\in C(X;B(X,X)), J(,ω)C1(X;X)J(\cdot,\omega)\in C^{1}(X;X).

By fixing ω0Ω\omega_{0}\in\Omega, there exists a solution x0Xx_{0}\in X of (A.1) such that

x0(t)=ξ+0tf(s,x0(s),ω0(s))𝑑s.x_{0}(t)=\xi+\int_{0}^{t}f(s,x_{0}(s),\omega_{0}(s))ds.

That is,

J(x0,ω0)(t)=x0(t)ξ0tf(s,x0(s),ω0(s))𝑑s=x0(t)x0(t)=0J(x_{0},\omega_{0})(t)=x_{0}(t)-\xi-\int_{0}^{t}f(s,x_{0}(s),\omega_{0}(s))ds=x_{0}(t)-x_{0}(t)=0

is satisfied. If yXy\in X satisfies (xJ(x0,ω0)y)(t)=g(t)(\partial_{x}J(x_{0},\omega_{0})y)(t)=g(t) for any gXg\in X, then

{y(t)xf(t,x0(t),ω0(t))y(t)=g(t),t(0,T],y(0)=g(0).\left\{\begin{array}[]{ll}y^{\prime}(t)-\nabla_{x}^{\top}f(t,x_{0}(t),\omega_{0}(t))y(t)=g^{\prime}(t),&t\in(0,T],\\ y(0)=g(0).&\end{array}\right.

Because the solution to this ordinary differential equation exists uniquely, there exists an inverse map (xJ(x0,ω0))1(\partial_{x}J(x_{0},\omega_{0}))^{-1} such that (xJ(x0,ω0))1B(X,X)(\partial_{x}J(x_{0},\omega_{0}))^{-1}\in B(X,X).

From the implicit function theorem, for any ωΩ\omega\in\Omega, there exists x[ω]Xx[\omega]\in X such that J(x[ω],ω)=0J(x[\omega],\omega)=0. From JC1(X×Ω;X)J\in C^{1}(X\times\Omega;X), we obtain [ωx[ω]]C1(Ω;X)[\omega\mapsto x[\omega]]\in C^{1}(\Omega;X). We put

y(t):=(ωx[ω]η)(t)y(t):=(\partial_{\omega}x[\omega]\eta)(t)

for any ηΩ\eta\in\Omega. From J(x[ω],ω)=0J(x[\omega],\omega)=0,

(xJ(x[ω],ω)y)(t)+(ωJ(x[ω],ω)η)(t)=0,(\partial_{x}J(x[\omega],\omega)y)(t)+(\partial_{\omega}J(x[\omega],\omega)\eta)(t)=0,
y(t)0txf(s,x[ω](s),ω(s))y(s)𝑑s0tωf(s,x[ω](s),ω(s))η(s)𝑑s=0.y(t)-\int_{0}^{t}\nabla_{x}^{\top}f(s,x[\omega](s),\omega(s))y(s)ds-\int_{0}^{t}\nabla_{\omega}^{\top}f(s,x[\omega](s),\omega(s))\eta(s)ds=0.

Therefore, we obtain

{y(t)xf(t,x[ω](t),ω(t))y(t)=ωf(t,x[ω](t),ω(t))η(t),t(0,T],y(0)=0.\left\{\begin{array}[]{ll}y^{\prime}(t)-\nabla_{x}^{\top}f(t,x[\omega](t),\omega(t))y(t)=\nabla_{\omega}^{\top}f(t,x[\omega](t),\omega(t))\eta(t),&t\in(0,T],\\ y(0)=0.&\end{array}\right.

Appendix B General ODENet

In this section, we describe the general ODENet and the existence and continuity of the gradient of a loss function with respect to the design parameter. Let NN and rr be natural numbers and TT be a positive real number. Let the input data DnD\subset\mathbb{R}^{n} be a compact set. We define X:=C1([0,T];N)X:=C^{1}([0,T];\mathbb{R}^{N}) and Ω:=C([0,T];r)\Omega:=C([0,T];\mathbb{R}^{r}). We consider the ODENet with the following system of ordinary differential equations.

{x(t)=f(t,x(t),ω(t)),t(0,T],x(0)=Qξ,\left\{\begin{aligned} x^{\prime}(t)&=f(t,x(t),\omega(t)),&t\in(0,T],\\ x(0)&=Q\xi,&\end{aligned}\right. (B.1)

where xx is a function from [0,T][0,T] to N\mathbb{R}^{N}; ξD\xi\in D is the input data; Px(T)Px(T) is the final output; ωΩ\omega\in\Omega is the design parameter; PP and QQ are m×Nm\times N and N×nN\times n real matrices; ff is a continuously differentiable function from [0,T]×N×r[0,T]\times\mathbb{R}^{N}\times\mathbb{R}^{r} to N\mathbb{R}^{N}, and f(t,,ω(t))f(t,\cdot,\omega(t)) is Lipschitz continuous for any t[0,T]t\in[0,T] and ωΩ\omega\in\Omega. For an input data ξD\xi\in D, we denote the output data as Px(T;ξ)Px(T;\xi). We consider an approximation of FC(D;m)F\in C(D;\mathbb{R}^{m}) using ODENet with a system of ordinary differential equations (B.1). We define the loss function as

e[x]=12|Px(T;ξ)F(ξ)|2.e[x]=\frac{1}{2}\left|Px(T;\xi)-F(\xi)\right|^{2}.

We define the gradient of the loss function with respect to the design parameter as follows:

Definition B.1.

Let Ω\Omega be a real Banach space. Assume that the inner product ,\left<\cdot,\cdot\right> is defined on Ω\Omega. The functional Φ:Ω\Phi:\Omega\to\mathbb{R} is a Fréchet differentiable at ωΩ\omega\in\Omega. The Fréchet derivative is denoted by Φ[ω]Ω\partial\Phi[\omega]\in\Omega^{*}. If G[ω]ΩG[\omega]\in\Omega exists such that

Φ[ω]η=G[ω],η,\partial\Phi[\omega]\eta=\left<G[\omega],\eta\right>,

for any ηΩ\eta\in\Omega, we call G[ω]G[\omega] the gradient of Φ\Phi at ωΩ\omega\in\Omega with respect to the inner product ,\left<\cdot,\cdot\right>.

Remark.

If there exists a gradient G[ω]G[\omega] of the functional Φ\Phi at ωΩ\omega\in\Omega with respect to the inner product ,\left<\cdot,\cdot\right>, the algorithm to find the minimum value of Φ\Phi by

ω(ν)=ω(ν1)τG[ω(ν1)]\omega_{(\nu)}=\omega_{(\nu-1)}-\tau G[\omega_{(\nu-1)}]

is called the steepest descent method.

Theorem B.2.

Given the design parameter ωΩ\omega\in\Omega, let x[ω](t;ξ)x[\omega](t;\xi) be the solution to (B.1) with the initial value ξD\xi\in D. Let λ:[0,T]N\lambda:[0,T]\to\mathbb{R}^{N} be the adjoint and satisfy the following adjoint equation:

{λ(t)=xf(t,x[ω](t;ξ),ω(t))λ(t),t[0,T),λ(T)=P(Px[ω](T;ξ)F(ξ)).\left\{\begin{aligned} \lambda^{\prime}(t)&=-\nabla_{x}f^{\top}\left(t,x[\omega](t;\xi),\omega(t)\right)\lambda(t),&t\in[0,T),\\ \lambda(T)&=P^{\top}\left(Px[\omega](T;\xi)-F(\xi)\right).&\end{aligned}\right.

We define the functional Φ:Ω\Phi:\Omega\to\mathbb{R} as Φ[ω]=e[x[ω]]\Phi[\omega]=e[x[\omega]]. Then, there exists a gradient G[ω]ΩG[\omega]\in\Omega of Φ\Phi as ωΩ\omega\in\Omega with respect to the L2(0,T;r)L^{2}(0,T;\mathbb{R}^{r}) inner predict such that

Φ[ω]η=0TG[ω](t)η(t)𝑑t,G[ω](t)=ωf(t,x[ω](t;ξ),ω(t))λ(t),\partial\Phi[\omega]\eta=\int_{0}^{T}G[\omega](t)\cdot\eta(t)dt,\quad G[\omega](t)=\nabla_{\omega}f^{\top}\left(t,x[\omega](t;\xi),\omega(t)\right)\lambda(t),

for any ηΩ\eta\in\Omega.

Proof.

ee is a continuously differentiable function from XX to \mathbb{R}, and the solution of (B.1) satisfies [ωx[ω]][\omega\mapsto x[\omega]] from the Theorem A.1. Hence, ΦC1(Ω)\Phi\in C^{1}(\Omega). For any ηΩ\eta\in\Omega,

Φ[ω]η\displaystyle\partial\Phi[\omega]\eta =(Px[ω](T;ξ)F(ξ))P(ωx[ω]η)(T),\displaystyle=(Px[\omega](T;\xi)-F(\xi))\cdot P(\partial_{\omega}x[\omega]\eta)(T),
=P(Px[ω](T;ξ)F(ξ))(ωx[ω]η)(T).\displaystyle=P^{\top}(Px[\omega](T;\xi)-F(\xi))\cdot(\partial_{\omega}x[\omega]\eta)(T).

We put y(t):=(ωx[ω]η)(t)y(t):=(\partial_{\omega}x[\omega]\eta)(t). From Theorem A.1, we obtain

{y(t)xf(t,x[ω](t,ξ),ω(t))y(t)=ωf(t,x[ω](t;ξ),ω(t))η(t),t(0,T],y(0)=0.\left\{\begin{array}[]{ll}y^{\prime}(t)-\nabla_{x}^{\top}f\left(t,x[\omega](t,\xi),\omega(t)\right)y(t)=\nabla_{\omega}^{\top}f\left(t,x[\omega](t;\xi),\omega(t)\right)\eta(t),&t\in(0,T],\\ y(0)=0.\end{array}\right.

Since the assumption,

{λ(t)=xf(t,x[ω](t;ξ),ω(t))λ(t),t[0,T),λ(T)=P(Px[ω](T;ξ)F(ξ)).\left\{\begin{aligned} \lambda^{\prime}(t)&=-\nabla_{x}f^{\top}\left(t,x[\omega](t;\xi),\omega(t)\right)\lambda(t),&t\in[0,T),\\ \lambda(T)&=P^{\top}\left(Px[\omega](T;\xi)-F(\xi)\right).&\end{aligned}\right.

is satisfied. We define

g(t):=ωf(t,x[ω](t;ξ),ω(t))λ(t).g(t):=\nabla_{\omega}f^{\top}\left(t,x[\omega](t;\xi),\omega(t)\right)\lambda(t).

Then, gΩg\in\Omega is satisfied. We calculate the L2(0,T;r)L^{2}(0,T;\mathbb{R}^{r}) inner product of gg and η\eta,

g,η\displaystyle\left<g,\eta\right> =0T(ωf(t,x[ω](t;ξ),ω(t))λ(t))η(t)𝑑t,\displaystyle=\int_{0}^{T}(\nabla_{\omega}f^{\top}(t,x[\omega](t;\xi),\omega(t))\lambda(t))\cdot\eta(t)dt,
=0Tλ(t)(ωf(t,x[ω](t;ξ),ω(t))η(t))𝑑t,\displaystyle=\int_{0}^{T}\lambda(t)\cdot(\nabla_{\omega}^{\top}f(t,x[\omega](t;\xi),\omega(t))\eta(t))dt,
=0Tλ(t)(y(t)xf(t,x[ω](t;ξ),ω(t))y(t))𝑑t,\displaystyle=\int_{0}^{T}\lambda(t)\cdot(y^{\prime}(t)-\nabla_{x}^{\top}f(t,x[\omega](t;\xi),\omega(t))y(t))dt,
=λ(T)y(T)λ(0)y(0)0T(λ(t)+xf(t,x[ω](t;ξ),ω(t))λ(t))y(t)𝑑t,\displaystyle=\lambda(T)\cdot y(T)-\lambda(0)\cdot y(0)-\int_{0}^{T}(\lambda^{\prime}(t)+\nabla_{x}f^{\top}(t,x[\omega](t;\xi),\omega(t))\lambda(t))\cdot y(t)dt,
=P(Px[ω](t;ξ)F(ξ))y(T),\displaystyle=P^{\top}(Px[\omega](t;\xi)-F(\xi))\cdot y(T),
=Φ[ω]η.\displaystyle=\partial\Phi[\omega]\eta.

Therefore, there exists a gradient G[ω]ΩG[\omega]\in\Omega of Φ\Phi at ωΩ\omega\in\Omega with respect to the L2(0,T;r)L^{2}(0,T;\mathbb{R}^{r}) inner product such that

G[ω](t)=ωf(t,x[ω](t;ξ),ω(t))λ(t).G[\omega](t)=\nabla_{\omega}f^{\top}\left(t,x[\omega](t;\xi),\omega(t)\right)\lambda(t).

Appendix C General ResNet

In this section, we describe the general ResNet and error backpropagation. We consider a ResNet with the following system of difference equations

{x(l+1)=x(l)+f(l)(x(l),ω(l)),l=0,1,,L1,x(0)=Qξ,\left\{\begin{aligned} x^{(l+1)}&=x^{(l)}+f^{(l)}(x^{(l)},\omega^{(l)}),&l=0,1,\ldots,L-1,\\ x^{(0)}&=Q\xi,&\end{aligned}\right. (C.1)

where x(l)x^{(l)} is an NN-dimensional real vector for all l=0,1,,Ll=0,1,\ldots,L; ξD\xi\in D is the input data; Px(L)Px^{(L)} is the final output; ω(l)rl(l=0,1,,L1)\omega^{(l)}\in\mathbb{R}^{r_{l}}~{}(l=0,1,\ldots,L-1) are the design parameters; PP and QQ are m×Nm\times N and N×nN\times n real matrices; f(l)f^{(l)} is a continuously differentiable function from N×rl\mathbb{R}^{N}\times\mathbb{R}^{r_{l}} to N\mathbb{R}^{N} for all l=0,1,,L1l=0,1,\ldots,L-1. We consider an approximation of FC(D;m)F\in C(D;\mathbb{R}^{m}) using ResNet with a system of difference equations (C.1). Let KK\in\mathbb{N} be the number of training data and {(ξ(k),F(ξ(k)))}k=1KD×m\{(\xi^{(k)},F(\xi^{(k)}))\}_{k=1}^{K}\subset D\times\mathbb{R}^{m} be the training data. We divide the label of the training data into the following disjoint sets.

{1,2,,K}=I1I2IM(disjoint),(1MK).\{1,2,\ldots,K\}=I_{1}\cup I_{2}\cup\cdots\cup I_{M}~{}(\mathrm{disjoint}),\quad(1\leq M\leq K).

Let Px(L,k)Px^{(L,k)} denote the final output for a given input data ξ(k)D\xi^{(k)}\in D. We set 𝝎=(ω(0),ω(1),,ω(L1))\mbox{\boldmath$\omega$}=(\omega^{(0)},\omega^{(1)},\ldots,\omega^{(L-1)}). We define the loss function for all μ=1,2,,M\mu=1,2,\ldots,M as follows:

eμ(𝝎)=12|Iμ|kIμ|Px(L,k)F(ξ(k))|2,e_{\mu}(\mbox{\boldmath$\omega$})=\frac{1}{2|I_{\mu}|}\sum_{k\in I_{\mu}}\left|Px^{(L,k)}-F(\xi^{(k)})\right|^{2}, (C.2)
E=12Kk=1K|Px(L,k)F(ξ(k))|2.E=\frac{1}{2K}\sum_{k=1}^{K}\left|Px^{(L,k)}-F(\xi^{(k)})\right|^{2}.

We consider the learning for each label set using the gradient method. We find the gradient of the loss function (C.2) with respect to the design parameter ω(l)rl\omega^{(l)}\in\mathbb{R}^{r_{l}} for all l=0,1,,L1l=0,1,\ldots,L-1 using error backpropagation. Using the chain rule, we obtain

ω(l)eμ(𝝎)=kIμω(l)x(l+1,k)x(l+1,k)eμ(𝝎)\nabla_{\omega^{(l)}}e_{\mu}(\mbox{\boldmath$\omega$})=\sum_{k\in I_{\mu}}\nabla_{\omega^{(l)}}{x^{(l+1,k)}}^{\top}\nabla_{x^{(l+1,k)}}e_{\mu}(\mbox{\boldmath$\omega$})

for all l=0,1,,L1l=0,1,\ldots,L-1. From (C.1),

ω(l)x(l+1,k)=ω(l)f(l)(x(l,k),ω(l)).\nabla_{\omega^{(l)}}{x^{(l+1,k)}}^{\top}=\nabla_{\omega^{(l)}}{f^{(l)}}^{\top}(x^{(l,k)},\omega^{(l)}).

We define λ(l,k):=x(l,k)eμ(𝝎)\lambda^{(l,k)}:=\nabla_{x^{(l,k)}}e_{\mu}(\mbox{\boldmath$\omega$}) for all l=0,1,,Ll=0,1,\ldots,L and kIμk\in I_{\mu}. We obtain

λ(l,k)=x(l,k)x(l+1,k)x(l+1,k)eμ(𝝎)=λ(l+1,k)+x(l,k)f(l)(x(l,k),ω(l))λ(l+1,k).\lambda^{(l,k)}=\nabla_{x^{(l,k)}}{x^{(l+1,k)}}^{\top}\nabla_{x^{(l+1,k)}}e_{\mu}(\mbox{\boldmath$\omega$})=\lambda^{(l+1,k)}+\nabla_{x^{(l,k)}}{f^{(l)}}^{\top}(x^{(l,k)},\omega^{(l)})\lambda^{(l+1,k)}.

Also,

λ(L,k)=x(L,k)eμ(𝝎)=1|Iμ|P(Px(L,k)F(ξ(k))).\lambda^{(L,k)}=\nabla_{x^{(L,k)}}e_{\mu}(\mbox{\boldmath$\omega$})=\frac{1}{|I_{\mu}|}P^{\top}\left(Px^{(L,k)}-F(\xi^{(k)})\right).

Therefore, we can find the gradient ω(l)eμ(𝝎)\nabla_{\omega^{(l)}}e_{\mu}(\mbox{\boldmath$\omega$}) of the loss function (C.2) with respect to the design parameters ω(l)r\omega^{(l)}\in\mathbb{R}^{r} by using the following equations

{ω(l)eμ(𝝎)=kIμω(l)f(l)(x(l,k),ω(l))λ(l+1,k),l=0,1,,L1,λ(l,k)=λ(l+1,k)+x(l,k)f(l)(x(l,k),ω(l))λ(l+1,k),l=0,1,,L1,kIμ,λ(L,k)=1|Iμ|P(Px(L,k)F(ξ(k))),kIμ.\left\{\begin{array}[]{lll}\displaystyle{\nabla_{\omega^{(l)}}e_{\mu}(\mbox{\boldmath$\omega$})=\sum_{k\in I_{\mu}}\nabla_{\omega^{(l)}}{f^{(l)}}^{\top}(x^{(l,k)},\omega^{(l)})\lambda^{(l+1,k)}},&l=0,1,\ldots,L-1,&\\ \displaystyle{\lambda^{(l,k)}=\lambda^{(l+1,k)}+\nabla_{x^{(l,k)}}{f^{(l)}}^{\top}(x^{(l,k)},\omega^{(l)})\lambda^{(l+1,k)}},&l=0,1,\ldots,L-1,&k\in I_{\mu},\\ \displaystyle{\lambda^{(L,k)}=\frac{1}{|I_{\mu}|}P^{\top}\left(Px^{(L,k)}-F(\xi^{(k)})\right)},&&k\in I_{\mu}.\end{array}\right.

Appendix D General ODENet and (α,β,γ)(\alpha,\beta,\gamma)-type ODENet

In this section, we show that (B.1) is a generalization of (2.3). Let natural numbers nn amd mm satisfy nmn\geq m. In (B.1) with N=m+nN=m+n and r=m+n(n+1)r=m+n(n+1), if we put

x(t)=(x~(t)y~(t))n×m,ω(t)=(α(t),β(t),γ(t))m×n×n×n,P=(Om×nIm),Q=(InOm×n),f(t,x(t),ω(t))=(β(t)x~(t)+γ(t)α(t)𝝈(Ax~(t))),\displaystyle\begin{array}[]{c}{\displaystyle x(t)=\left(\begin{array}[]{c}\tilde{x}(t)\\ \tilde{y}(t)\end{array}\right)\in\mathbb{R}^{n}\times\mathbb{R}^{m},\quad\omega(t)=\left(\alpha(t),\beta(t),\gamma(t)\right)\in\mathbb{R}^{m}\times\mathbb{R}^{n\times n}\times\mathbb{R}^{n},}\\[16.0pt] {\displaystyle P=\left(O_{m\times n}~{}I_{m}\right),\quad Q=\left(\begin{array}[]{c}I_{n}\\ O_{m\times n}\end{array}\right),\quad f(t,x(t),\omega(t))=\left(\begin{array}[]{c}\beta(t)\tilde{x}(t)+\gamma(t)\\ \alpha(t)\odot\mbox{\boldmath$\sigma$}(A\tilde{x}(t))\end{array}\right),}\end{array}

where AA is an m×nm\times n real matrix, InI_{n} is the identity matrix of size nn, and Om×nO_{m\times n} is the m×nm\times n zero matrix, then x~(t)n\tilde{x}(t)\in\mathbb{R}^{n} and y~(t)m\tilde{y}(t)\in\mathbb{R}^{m} satisfy (2.3) with the initial value

(x~(0)y~(0))=Qx(0)=(ξ0)\left(\begin{array}[]{c}\tilde{x}(0)\\ \tilde{y}(0)\end{array}\right)=Qx(0)=\left(\begin{array}[]{c}\xi\\ 0\end{array}\right)

and y~(T)=Px(T)\tilde{y}(T)=Px(T) is the final output.