Universal Approximation Properties for an ODENet and a ResNet: Mathematical Analysis and Numerical Experiments

Yuto Aizawa Division of Mathematical and Physical Sciences, Kanazawa University ([email protected]) Masato Kimura Faculty of Mathematics and Physics, Kanazawa University ([email protected]) Kazunori Matsui Faculty of Science and Technology, Seikei University ([email protected])

Abstract

We prove a universal approximation property (UAP) for a class of ODENet and a class of ResNet, which are simplified mathematical models for deep learning systems with skip connections. The UAP can be stated as follows. Let $n$ and $m$ be the dimension of input and output data, and assume $m\leq n$ . Then we show that ODENet of width $n+m$ with any non-polynomial continuous activation function can approximate any continuous function on a compact subset on $\mathbb{R}^{n}$ . We also show that ResNet has the same property as the depth tends to infinity. Furthermore, we derive the gradient of a loss function explicitly with respect to a certain tuning variable. We use this to construct a learning algorithm for ODENet. To demonstrate the usefulness of this algorithm, we apply it to a regression problem, a binary classification, and a multinomial classification in MNIST.

Keywords: Deep neural network, ODENet, ResNet, Universal approximation property

1 Introduction

Recent advances in neural networks have proven immensely successful for regression analysis, image classification, time series modeling, and so on [20]. Neural Networks are models of the human brain and vision [18, 8]. A neural network performs regression analysis, image classification, and time series modeling by performing a series of sequential operations, known as layers. Each of these layers is composed of neurons that are connected to neurons of other (typically, adjacent) layers. We consider a neural network with $L+1$ layers, where the input layer is layer $0$ , the output layer is layer $L$ , and the number of nodes in layer $l~{}(l=0,1,\ldots,L)$ is $n_{l}\in\mathbb{N}$ . Let $f^{(l)}:\mathbb{R}^{n_{l}}\to\mathbb{R}^{n_{l+1}}$ be the function of each layer. The output of each layer is, therefore, a vector in $\mathbb{R}^{n_{l+1}}$ . If the input data is $\xi\in\mathbb{R}^{n_{0}}$ , then, at each layer, we have

\left\{\begin{aligned} x^{(l+1)}&=f^{(l)}(x^{(l)}),&l=0,1,\ldots,L-1,\\ x^{(0)}&=\xi.&\end{aligned}\right.

The final output of the network then becomes $x^{(L)}$ , and the network is represented by $H=[\xi\mapsto x^{(L)}]$ .

A neural network approaches the regression and classification problem in two steps. Firstly, a priori observed and classified data is used to train the network. Then, the trained network is used to predict the rest of the data. Let $D\subset\mathbb{R}^{n_{0}}$ be the set of input data, and $F:D\to\mathbb{R}^{n_{L}}$ be the target function. In the training step, the training data $\{(\xi^{(k)},F(\xi^{(k)}))\}_{k=1}^{K}$ are available, where $\{\xi^{(k)}\}_{k=1}^{K}\subset D$ are the inputs, and $\{F(\xi^{(k)})\}_{k=1}^{K}\subset\mathbb{R}^{n_{L}}$ are the outputs. The goal is to learn the neural network so that $H(\xi)$ approximates $F(\xi)$ . This is achieved by minimizing a loss function that represents a similarity distance measure between the two quantities. In this paper, we consider the loss function with the mean square error

\frac{1}{K}\sum_{k=1}^{K}\left|H(\xi^{(k)})-F(\xi^{(k)})\right|^{2}.

Finding the optimal functions $f^{(l)}:\mathbb{R}^{n_{l}}\to\mathbb{R}^{n_{l+1}}$ out of all possible such functions is challenging. In addition, this includes a risk of overfitting because of the high number of available degrees of freedom. We restrict the functions to the following form:

f^{(l)}(x)=a^{(l)}\odot\mbox{\boldmath$\sigma$}(W^{(l)}x+b^{(l)}),

(1.1)

where $W^{(l)}\in\mathbb{R}^{n_{l+1}\times n_{l}}$ is a weight matrix, $b\in\mathbb{R}^{n_{l+1}}$ is a bias vector, and $a^{(l)}\in\mathbb{R}^{n_{l+1}}$ is weight vector of the output of each layer. The operator $\odot$ denotes the Hadamard product (element-wise product) of two vectors defined by (2.2). The function $\mbox{\boldmath$\sigma$}:\mathbb{R}^{n_{l+1}}\to\mathbb{R}^{n_{l+1}}$ is defined by $\mbox{\boldmath$\sigma$}(x)=(\sigma(x_{1}),\sigma(x_{2}),\ldots,\sigma(x_{n_{l+1}}))^{\top}$ , where $\sigma:\mathbb{R}\to\mathbb{R}$ is called an activation function. For a scalar $x\in\mathbb{R}$ , the sigmoid function $\sigma(x)=(1+e^{-x})^{-1}$ , the hyperbolic tangent function $\sigma(x)=\tanh(x)$ , the rectified linear unit (ReLU) function $\sigma(x)=\max(0,x)$ , and the linear function $\sigma(x)=x$ , and so on, are used as activation functions.

If we restrict the functions of the form (1.1), the goal is to learn $W^{(l)},b^{(l)},a^{(l)}$ that approximates $F(\xi)$ in the training step. The gradient method is used for training. Let $G_{W^{(l)}},G_{b^{(l)}}$ and $G_{a^{(l)}}$ be the gradient of the loss function with respect to $W^{(l)},b^{(l)}$ and $a^{(l)}$ , respectively, and let $\tau>0$ be the learning rate. Using the gradient method, the weights and biases are updated as follows:

W^{(l)}\leftarrow W^{(l)}-\tau G_{W^{(l)}},\quad b^{(l)}\leftarrow b^{(l)}-\tau G_{b^{(l)}},\quad a^{(l)}\leftarrow a^{(l)}-\tau G_{a^{(l)}}.

Note that the stochastic gradient method [4] has been widely used recently. Then, error backpropagation [19] was used to find the gradient.

It is known that deep (convolutional) neural networks are of great importance in image recognition [21, 23]. In [12], it was found through controlled experiments that the increase of depth in networks actually improves its performance and accuracy, in exchange, of course, for additional time complexity. However, in the case that the depth is overly increased, the accuracy might get stagnant or even degraded [12]. In addition, considering deeper networks may impede the learning process, which is due to the vanishing or exploding of the gradient [3, 10]. Apparently, deeper neural networks are more difficult to train. To address such an issue, the authors in [13] recommended the use of residual learning to facilitate the training of networks that are considerably deeper than those used previously. Such a network is referred to as residual network or ResNet. Let $n$ and $m$ be the dimensions of the input and output data. Let $N$ be the number of nodes in each layer. A ResNet can be represented as

\left\{\begin{aligned} x^{(l+1)}&=x^{(l)}+f^{(l)}(x^{(l)}),&l=0,1,\ldots,L-1,\\ x^{(0)}&=Q\xi.&\end{aligned}\right.

(1.2)

The final output of the network then becomes $H(\xi):=Px^{(L)}$ , where $P\in\mathbb{R}^{m\times N}$ and $Q\in\mathbb{R}^{N\times n}$ . Moreover, the function $f^{(l)}$ is learned from training data.

Transforming (1.2) into

\left\{\begin{aligned} x^{(l+1)}&=x^{(l)}+hf^{(l)}(x^{(l)}),&l=0,1,\ldots,L-1,\\ x^{(0)}&=Q\xi,&\end{aligned}\right.

(1.3)

where $h$ is the step size of the layer, leads to the same equation for the Euler method, which is a method for finding numerical solution to initial value problem for ordinary differential equation. Indeed, putting $x(t):=x^{(l)}$ and $f(t,x):=f^{(l)}(x)$ , where $t=hl$ , $T=hL$ and $f:[0,T]\times\mathbb{R}^{N}\rightarrow\mathbb{R}^{N}$ , then the limit of (1.3) as $h$ approaches zero yields the following initial value problem of ordinary differential equation

\left\{\begin{aligned} x^{\prime}(t)&=f(t,x(t)),&t\in(0,T],\\ x(0)&=Q\xi.&\end{aligned}\right.

(1.4)

We call the function $H=[D\ni\xi\mapsto Px(T)]$ an ODENet [6] associated with the system of ordinary differential equations (1.4).

Remark 1.1.

In the real deep learning system, the vector field $f(t,x)$ should be chosen from a family of the vector fields $f\in\{f_{\omega}\}_{\omega}$ , where $\omega$ is a parameter that is optimized. In this paper, we consider an ODENet associated with (2.3), which we call an $(\alpha,\beta,\gamma)$ -type ODENet. Instead of the variable $x(t)\in\mathbb{R}^{N}$ in (1.4), we consider $(x(t),y(t))\in\mathbb{R}^{n}\times\mathbb{R}^{m}$ in (2.3), where $N=m+n$ (see Appendix D for the detail). The implementation of ODENet requires (forward) Euler discretization to ResNet. A ResNet with (2.5) corresponds to the discretized version of the $(\alpha,\beta,\gamma)$ -type ODENet, and we call it an $(\alpha,\beta,\gamma)$ -type ResNet.

A neural network of arbitrary width and bounded depth has universal approximation property (UAP). The classical UAP is that continuous functions on a compact subset on $\mathbb{R}^{n}$ can be approximated by a linear combination of activation functions. It has been shown that the UAP for the neural networks holds by choosing a sigmoidal function [7, 14, 5, 9], any bounded function that is not a polynomial function [16, 1], and any function in Lizorkin space including a ReLU function [22] as an activation function. The UAP for neural network and its proof for each activation function is presented in Table 1.

Table. 1: Activation function and classical universal approximation property of neural network

References	Activation function	How to prove
Cybenko [7]	Continuous sigmoidal	Hahn-Banach theorem
Hornik et al. [14]	Monotonic sigmoidal	Stone-Weierstrass theorem
Carroll, Dickinson [5]	Continuous sigmoidal	Radon transform
Funahashi [9]	Monotonic sigmoidal	Fourier transform
Leshno et al. [16]	Non-polynomial	Weierstrass theorem
Attali, Pagès [1]	Non-polynomial	Taylor expansion
Sonoda, Murata [22]	Lizorkin distribution	Ridgelet transform

Recently, some positive results have been established showing the UAP for particular deep narrow networks. Hanin and Sellke [11] have shown that deep narrow networks with ReLU activation function have the UAP, and require only width $n+m$ . Lin and Jegelka [17] have shown that a ResNet with ReLU activation function, arbitrary input dimension, width 1, output dimension 1 have the UAP. For activation functions other than ReLU, Kidger and Lyons [15] have shown that deep narrow networks with any non-polynomial continuous function have the UAP, and require only width $n+m+1$ . The comparison of the UAPs are shown in Table 2. It is proved in [15, 17] and Theorem 2.7 that the UAP holds as $L\rightarrow\infty$ . Theorem 2.3 shows that the UAP holds when the layer is continuous.

Table. 2: A comparison of universal approximation properties

Output dimension $m$	$n\geq m$	$n\geq m$
	Shallow wide NN	Deep narrow NN	ResNet
References	[16, 22]	[15]	[17]
Input dimension $n$	$n,m$ : any	$n,m$ : any	$n$ : any
Output dimension $m$	$n,m$ : any	$n,m$ : any	$m=1$
Activation function	Non-polynomial	Non-polynomial	ReLU
Depth $L$	$L=3$	$L\to\infty$	$L\to\infty$
Width $N$	$N\to\infty$	$N=n+m+1$	$N=1$
	$(\alpha,\beta,\gamma)$ -type ResNet	$(\alpha,\beta,\gamma)$ -type ODENet
References	Theorem 2.7	Theorem 2.3
Input dimension $n$	$n\geq m$	$n\geq m$
Activation function	Non-polynomial	Non-polynomial
Depth $L$	$L\to\infty$	continuous setting $(L=\infty)$
Width $N$	$N=n+m$	$N=n+m$

In this paper, we propose the $(\alpha,\beta,\gamma)$ -type ODENet associated with (2.3) whose width is $n+m$ and show the conditions for the UAP for the $(\alpha,\beta,\gamma)$ -type ODENet and the $(\alpha,\beta,\gamma)$ -type ResNet with (2.5). In Section 2, we show that the UAP holds for the $(\alpha,\beta,\gamma)$ -type ODENet associated with (2.3) and the $(\alpha,\beta,\gamma)$ -type ResNet with (2.5). In Section 3, we derive the gradient of the loss function and a learning algorithm for the $(\alpha,\beta,\gamma)$ -type ODENet in consideration, followed by some numerical experiments in Section 4. Finally, we end the paper with a conclusion in Section 5.

2 Universal Approximation Theorem for $(\alpha,\beta,\gamma)$ -type ODENet and $(\alpha,\beta,\gamma)$ -type ResNet

2.1 Definition of an activation function with universal approximation property

Let $m$ and $n$ be natural numbers. Our main results, Theorem 2.3 and Theorem 2.7, show that any continuous function on a compact subset on $\mathbb{R}^{n}$ can be approximated using the $(\alpha,\beta,\gamma)$ -type ODENet and the $(\alpha,\beta,\gamma)$ -type ResNet.

In this paper, the following notations are used

|x|:=\left(\sum_{i=1}^{n}|x_{i}|^{2}\right)^{\frac{1}{2}},\quad\|A\|:=\left(\sum_{i=1}^{m}\sum_{j=1}^{n}|a_{ij}|^{2}\right)^{\frac{1}{2}},

for any $x=(x_{1},x_{2},\ldots,x_{n})^{\top}\in\mathbb{R}^{n}$ and $A=(a_{ij})_{\begin{subarray}{c}i=1,\ldots,m\\ j=1,\ldots,n\end{subarray}}\in\mathbb{R}^{m\times n}$ . Also, we define

\nabla_{x}^{\top}f:=\left(\frac{\partial f_{i}}{\partial x_{j}}\right)_{\begin{subarray}{c}i=1,\ldots,m\\ j=1,\ldots,n\end{subarray}},\quad\nabla_{x}f^{\top}:=\left(\nabla_{x}^{\top}f\right)^{\top}

for any $f\in C^{1}(\mathbb{R}^{n};\mathbb{R}^{m})$ . For a function $\sigma:\mathbb{R}\to\mathbb{R}$ , we define $\mbox{\boldmath$\sigma$}:\mathbb{R}^{m}\to\mathbb{R}^{m}$ by

\mbox{\boldmath$\sigma$}(x):=\left(\begin{array}[]{c}\sigma(x_{1})\\ \sigma(x_{2})\\ \vdots\\ \sigma(x_{m})\end{array}\right)

(2.1)

for $x=(x_{1},x_{2},\ldots,x_{m})^{\top}\in\mathbb{R}^{m}$ . For $a=(a_{1},a_{2},\ldots,a_{m})^{\top},b=(b_{1},b_{2},\ldots,b_{m})^{\top}\in\mathbb{R}^{m}$ , their Hadamard product is defined by

a\odot b:=\left(\begin{array}[]{c}a_{1}b_{1}\\ a_{2}b_{2}\\ \vdots\\ a_{m}b_{m}\end{array}\right)\in\mathbb{R}^{m}.

(2.2)

Definition 2.1 (Universal approximation property for the activation function $\sigma$ ).

Let $\sigma$ be a real-valued function on $\mathbb{R}$ and $D$ be a compact subset of $\mathbb{R}^{n}$ . Also, consider the set

S:=\left\{G:D\to\mathbb{R}\left|G(\xi)=\sum_{l=1}^{L}\alpha_{l}\sigma(\mbox{\boldmath$c$}_{l}\cdot\xi+d_{l}),L\in\mathbb{N},\alpha_{l},d_{l}\in\mathbb{R},\mbox{\boldmath$c$}_{l}\in\mathbb{R}^{n}\right.\right\}.

Suppose that $S$ is dense in $C(D)$ . In other words, given $F\in C(D)$ and $\eta>0$ , there exists a function $G\in S$ such that

|G(\xi)-F(\xi)|<\eta

for any $\xi\in D$ . Then, we say that $\sigma$ has a universal approximation property (UAP) on $D$ .

Some activation functions with the universal approximation property are presented in Table 3.

Table. 3: Example of activation functions with universal approximation property

	Activation function	$\sigma(x)$
Unbounded functions
	Truncated power function	$x_{+}^{k}:=\left\{\begin{array}[]{ll}x^{k}&x>0\\ 0&x\leq 0\end{array}\right.\quad k\in\mathbb{N}\cup\{0\}$
	ReLU function	$x_{+}$
	Softplus function	$\log(1+e^{x})$
Bounded but not integrable functions
	Unit step function	$x_{+}^{0}$
	(Standard) Sigmoidal function	$(1+e^{-x})^{-1}$
	Hyperbolic tangent function	$\tanh(x)$
Bump functions
	(Gaussian) Radial basis function	$\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{x^{2}}{2}\right)$
	Dirac’s $\delta$ function	$\delta(x)$

A non-polynomial activation function in a neural network with three layers has a universal approximation property. Such result was shown by Leshno [16] using functional analysis and later by Sonoda and Murata [22] using Ridgelet transform.

2.2 Main Theorem for $(\alpha,\beta,\gamma)$ -type ODENet

In this subsection, we show the universal approximation property for the $(\alpha,\beta,\gamma)$ -type ODENet associated with the ODE system (2.3). Since the first (resp. second) equation consists of $n$ (resp. $m$ ) equations, the width of the $(\alpha,\beta,\gamma)$ -type ODENet is $n+m$ .

Definition 2.2 ( $(\alpha,\beta,\gamma)$ -type ODENet).

Suppose that an $m\times n$ real matrix $A$ and a function $\sigma:\mathbb{R}\to\mathbb{R}$ are given. We consider a system of ODEs

\left\{\begin{aligned} x^{\prime}(t)&=\beta(t)x(t)+\gamma(t),&t\in(0,T],\\ y^{\prime}(t)&=\alpha(t)\odot\mbox{\boldmath$\sigma$}(Ax(t)),&t\in(0,T],\\ x(0)&=\xi,&\\ y(0)&=0,&\end{aligned}\right.

(2.3)

where $x$ and $y$ are functions from $[0,T]$ to $\mathbb{R}^{n}$ and $\mathbb{R}^{m}$ , respectively; $x(0)=\xi\in\mathbb{R}^{n}$ is an input data and $y(T)\in\mathbb{R}^{m}$ is the final output. Moreover, the functions $\alpha:[0,T]\to\mathbb{R}^{m}$ , $\beta:[0,T]\to\mathbb{R}^{n\times n}$ , and $\gamma:[0,T]\to\mathbb{R}^{n}$ are design parameters. The function $\mbox{\boldmath$\sigma$}:\mathbb{R}^{m}\to\mathbb{R}^{m}$ is defined by (2.1) and the operator $\odot$ denotes the Hadamard product defined by (2.2). We call $H=[\xi\mapsto y(T)]:\mathbb{R}^{n}\to\mathbb{R}^{m}$ an $(\alpha,\beta,\gamma)$ -type ODENet associated with the ODE system (2.3).

For a compact subset $D\subset\mathbb{R}^{n}$ , we define

S(D):=\{[\xi\mapsto y(T)]\in C(D;\mathbb{R}^{m})|\alpha\in C^{\infty}([0,T];\mathbb{R}^{m}),\beta\in C^{\infty}([0,T];\mathbb{R}^{n\times n}),\gamma\in C^{\infty}([0,T];\mathbb{R}^{n})\}.

We will assume that the activation function is locally Lipschitz continuous, in other words,

\forall R>0,~{}\exists L_{R}>0~{}\mathrm{s.t.}\quad|\sigma(s_{1})-\sigma(s_{2})|\leq L_{R}|s_{1}-s_{2}|\quad\mathrm{for}~{}s_{1},s_{2}\in[-R,R].

(2.4)

Theorem 2.3 (UAP for $(\alpha,\beta,\gamma)$ -type ODENet).

Suppose that $m\leq n$ and $\mathrm{rank}(A)=m$ . If $\sigma:\mathbb{R}\to\mathbb{R}$ satisfies (2.4) and has UAP on a compact subset $D\subset\mathbb{R}^{n}$ , then $S(D)$ is dense in $C(D;\mathbb{R}^{m})$ . In other words, given $F\in C(D;\mathbb{R}^{m})$ and $\eta>0$ , there exists a function $H\in S(D)$ such that

|H(\xi)-F(\xi)|<\eta,

for any $\xi\in D$ .

Corollary 2.4.

Let $1\leq p<\infty$ . Then, $S(D)$ is dense in $L^{p}(D;\mathbb{R}^{m})$ . In other words, given $F\in L^{p}(D;\mathbb{R}^{m})$ and $\eta>0$ , there exists a function $H\in S(D)$ such that

\|H-F\|_{L^{p}(D;\mathbb{R}^{m})}<\eta.

Remark 2.5.

The assumption for $\sigma$ in Theorem 2.3 holds, e.g., if $\sigma$ is a non-polynomial continuous function [16].

2.3 Main Theorem for $(\alpha,\beta,\gamma)$ -type ResNet

In this subsection, we show that a universal approximation property also holds for an $(\alpha,\beta,\gamma)$ -type ResNet with the system of difference equations (2.5).

Definition 2.6 ( $(\alpha,\beta,\gamma)$ -type ResNet).

Suppose that an $m\times n$ real matrix $A$ and a function $\sigma:\mathbb{R}\to\mathbb{R}$ are given. We consider a system of difference equations

\left\{\begin{aligned} x^{(l)}&=x^{(l-1)}+\beta^{(l)}x^{(l-1)}+\gamma^{(l)},&l=1,2,\ldots,L\\ y^{(l)}&=y^{(l-1)}+\alpha^{(l)}\odot\mbox{\boldmath$\sigma$}(Ax^{(l)}),&l=1,2,\ldots,L\\ x^{(0)}&=\xi,&\\ y^{(0)}&=0,&\end{aligned}\right.

(2.5)

where $x^{(l)}$ and $y^{(l)}$ are $n$ - and $m$ -dimensional real vectors, for all $l=0,1,\ldots,L$ , respectively. Also, $\xi\in\mathbb{R}^{n}$ denotes the input data while $y^{(L)}\in\mathbb{R}^{m}$ represents the final output. Moreover, the vectors $\alpha^{(l)}\in\mathbb{R}^{m},\beta^{(l)}\in\mathbb{R}^{n\times n}$ and $\gamma\in\mathbb{R}^{n}~{}(l=1,2,\ldots,L)$ are design parameters. The functions $\mbox{\boldmath$\sigma$}:\mathbb{R}^{m}\to\mathbb{R}^{m}$ is defined by (2.1) and the operator $\odot$ denotes the Hadamard product defined by (2.2). We call the function $H=[\xi\mapsto y^{(L)}]:D\to\mathbb{R}^{m}$ an $(\alpha,\beta,\gamma)$ -type ResNet with a system of difference equations (2.5).

For a compact subset $D\subset\mathbb{R}^{n}$ , we define

S_{\mathrm{res}}(D):=\{[\xi\mapsto y^{(L)}]\in C(D;\mathbb{R}^{m})|L\in\mathbb{N},\alpha^{(l)}\in\mathbb{R}^{m},\beta^{(l)}\in\mathbb{R}^{n\times n},\gamma^{(l)}\in\mathbb{R}^{n}~{}(l=1,2,\ldots,L)\}.

Theorem 2.7 (UAP for $(\alpha,\beta,\gamma)$ -type ResNet).

Suppose that $m\leq n$ and $\mathrm{rank}(A)=m$ . If $\sigma:\mathbb{R}\to\mathbb{R}$ satisfies (2.4) and has UAP on a compact subset $D\subset\mathbb{R}^{n}$ , then $S_{\mathrm{res}}(D)$ is dense in $C(D;\mathbb{R}^{m})$ .

Remark 2.8.

The assumption for $\sigma$ in Theorem 2.7 holds, e.g., if $\sigma$ is a non-polynomial continuous function [16].

2.4 Some lemmas

We describe some lemmas used to prove Theorems 2.3 and 2.7.

Lemma 2.9.

Suppose that $m\leq n$ . Let $\sigma$ be a function from $\mathbb{R}^{m}$ to $\mathbb{R}^{m}$ defined by (2.1). For any $\alpha,d\in\mathbb{R}^{m}$ and $C=(\mbox{\boldmath$c$}_{1},\mbox{\boldmath$c$}_{2},\ldots,\mbox{\boldmath$c$}_{m})^{\top}\in\mathbb{R}^{m\times n}$ which has no zero rows (i.e. $\mbox{\boldmath$c$}_{l}\neq 0$ for $l=1,2,\ldots,m$ ), there exist $\tilde{\alpha}^{(l)},\tilde{d}^{(l)}\in\mathbb{R}^{m}$ , and $\tilde{C}^{(l)}\in\mathbb{R}^{m\times n}~{}(l=1,2,\ldots,m)$ such that

\alpha\odot\mbox{\boldmath$\sigma$}(C\xi+d)=\sum_{l=1}^{m}\tilde{\alpha}^{(l)}\odot\mbox{\boldmath$\sigma$}(\tilde{C}^{(l)}\xi+\tilde{d}^{(l)}),

for any $\xi\in\mathbb{R}^{n}$ , and $\mathrm{rank}(\tilde{C}^{(l)})=m$ , for all $l=1,2,\ldots,m$ . Moreover, if $m=n$ , we can choose $\tilde{C}^{(l)}\in\mathbb{R}^{n\times n}$ such that $\det\tilde{C}^{(l)}>0$ , for all $l=1,2,\ldots,n$ .

Proof.

Let $m\leq n$ . For all $l=1,2,\ldots,m$ , there exists $\tilde{C}^{(l)}=(\tilde{\mbox{\boldmath$c$}}_{1}^{(l)},\tilde{\mbox{\boldmath$c$}}_{2}^{(l)},\ldots,\tilde{\mbox{\boldmath$c$}}_{m}^{(l)})^{\top}\in\mathbb{R}^{m\times n}$ such that $\tilde{\mbox{\boldmath$c$}}_{l}^{(l)}=\mbox{\boldmath$c$}_{l}$ , $\mathrm{rank}(\tilde{C}^{(l)})=m$ . Then, we put

\tilde{\alpha}_{k}^{(l)}:=\left\{\begin{array}[]{ll}\alpha_{k},&\mathrm{if}~{}l=k,\\ 0,&\mathrm{if}~{}l\neq k,\end{array}\right.\quad\tilde{d}_{k}^{(l)}:=\left\{\begin{array}[]{ll}d_{k},&\mathrm{if}~{}l=k,\\ 0,&\mathrm{if}~{}l\neq k.\end{array}\right.

Looking at the $k$ -th component, we see that for any $\xi\in\mathbb{R}^{n}$ , we have

\sum_{l=1}^{m}\tilde{\alpha}_{k}^{(l)}\sigma(\tilde{\mbox{\boldmath$c$}}_{k}^{(l)}\cdot\xi+\tilde{d}_{k}^{(l)})=\tilde{\alpha}_{k}^{(k)}\sigma(\tilde{\mbox{\boldmath$c$}}_{k}^{(k)}\cdot\xi+\tilde{d}_{k}^{(k)})=\alpha_{k}\sigma(\mbox{\boldmath$c$}_{k}\cdot\xi+d_{k}).

Therefore,

\sum_{l=1}^{m}\tilde{\alpha}^{(l)}\odot\mbox{\boldmath$\sigma$}(\tilde{C}^{(l)}\xi+\tilde{d}^{(l)})=\alpha\odot\mbox{\boldmath$\sigma$}(C\xi+d).

Now, if $m=n$ , then $\mathrm{rank}(\tilde{C}^{(l)})=n$ , and so $\det(\tilde{C}^{(l)})\neq 0$ . In particular, we can choose $\tilde{C}^{(l)}$ such that $\det(\tilde{C}^{(l)})>0$ . ∎

Lemma 2.10.

Suppose that $m\leq n$ . Let $\sigma$ be a function from $\mathbb{R}^{m}$ to $\mathbb{R}^{m}$ . For any $L\in\mathbb{N},\alpha^{(l)},d^{(l)}\in\mathbb{R}^{m},C^{(l)}\in\mathbb{R}^{m\times n}~{}(l=1,2,\ldots,L)$ , there exists $L^{\prime}\in\mathbb{N},\tilde{\alpha}^{(l)},\tilde{d}^{(l)}\in\mathbb{R}^{m},\tilde{C}^{(l)}\in\mathbb{R}^{m\times n}~{}(l=1,2,\ldots,L^{\prime})$ such that

\frac{1}{L}\sum_{l=1}^{L}\alpha^{(l)}\odot\mbox{\boldmath$\sigma$}(C^{(l)}\xi+d^{(l)})=\frac{1}{L^{\prime}}\sum_{l=1}^{L^{\prime}}\tilde{\alpha}^{(l)}\odot\mbox{\boldmath$\sigma$}(\tilde{C}^{(l)}\xi+\tilde{d}^{(l)})

for any $\xi\in\mathbb{R}^{n}$ , and $\mathrm{rank}(\tilde{C}^{(l)})=m$ , for all $l=1,2,\ldots,L^{\prime}$ . Moreover, if $m=n$ , we can choose $\tilde{C}^{(l)}\in\mathbb{R}^{m\times n}$ such that $\det\tilde{C}^{(l)}>0$ , for all $l=1,2,\ldots,L^{\prime}$ .

Proof.

This follows from Lemma 2.9. ∎

Lemma 2.11.

Suppose that $m<n$ . Let $A$ be an $m\times n$ real matrix satisfying $\mathrm{rank}(A)=m$ . Then, for any $C\in\mathbb{R}^{m\times n}$ satisfying $\mathrm{rank}(C)=m$ , there exists $P\in\mathbb{R}^{n\times n}$ such that

C=AP,\quad\det P>0.

(2.6)

In addition, if $m=n$ and $\mathrm{sgn}(\det C)=\mathrm{sgn}(\det A)$ , there exists $P\in\mathbb{R}^{n\times n}$ such that (2.6).

Proof.

(i)

Suppose that $m<n$ . From $\mathrm{rank}(A)=\mathrm{rank}(C)=m$ , there exists $\bar{A},\bar{C}\in\mathbb{R}^{(n-m)\times n}$ such that

\det\tilde{A}>0,\quad\tilde{A}=\left(\begin{array}[]{c}A\\ \bar{A}\end{array}\right),\quad\det\tilde{C}>0,\quad\tilde{C}=\left(\begin{array}[]{c}C\\ \bar{C}\end{array}\right).

If we put $P:=\tilde{A}^{-1}\tilde{C}$ , we get $\det P>0$ , $C=AP$ .

(ii)

Suppose that $m=n$ . We put $P:=A^{-1}C$ . Because $\mathrm{sgn}(\det C)=\mathrm{sgn}(\det A)$ , we have $\det P>0$ , and so $C=AP$ .

∎

Lemma 2.12.

Let $p\in[1,\infty)$ . Suppose that

P(t)=P^{(l)}\in\mathbb{R}^{n\times n},\quad\det P^{(l)}>0,

for $t_{l-1}\leq t<t_{l}$ , and for all $l=1,2,\ldots,L$ , where $t_{0}=0$ and $t_{L}=T$ . Then, there exists a real number $C>0$ such that, for any $\varepsilon>0$ , there exists $P^{\varepsilon}\in C([0,T];\mathbb{R}^{n\times n})$ such that

\|P^{\varepsilon}-P\|_{L^{p}(0,T;\mathbb{R}^{n\times n})}<\varepsilon,\quad\det P^{\varepsilon}(t)>0,\quad\mathrm{and}\quad\|P^{\varepsilon}(t)\|\leq C,

for any $t\in[0,T]$ .

Proof.

We define $\mathrm{GL}^{+}(n,\mathbb{R}):=\{A\in\mathbb{R}^{n\times n}|\det A>0\}$ . From [2, Chapter 9, p.239], $\mathrm{GL}^{+}(n,\mathbb{R})$ is path-connected. For all $l=1,2,\ldots,L$ , there exists $Q^{(l)}\in C([0,1];\mathbb{R}^{n\times n})$ such that

Q^{(l)}(0)=P^{(l)},\quad Q^{(l)}(1)=P^{(l+1)},\quad\mathrm{and}\quad\det Q^{(l)}(s)>0,

for any $s\in[0,1]$ . For $\delta>0$ , we put

Q^{\delta}(t):=\left\{\begin{array}[]{lll}P^{(1)},&-\infty<t<t_{1},&\\ \displaystyle{Q^{(l)}\left(\frac{t-t_{l}}{\delta}\right)},&t_{l}\leq t<t_{l}+\delta,&(l=1,2,\ldots,L-1),\\ P^{(l)}&t_{l-1}+\delta\leq t<t_{l},&(l=2,3,\ldots,L-2),\\ P^{(L)}&t_{L-1}+\delta\leq t<\infty.&\end{array}\right.

Then, $Q^{\delta}$ is a continuous function from $\mathbb{R}$ to $\mathbb{R}^{n\times n}$ . There exists a $C_{0}>0$ such that $\det Q^{\delta}(t)\geq C_{0}$ , for any $t\in\mathbb{R}$ . Let $\{\varphi_{\varepsilon}\}_{\varepsilon>0}$ be a sequence of Friedrichs’ mollifiers in $\mathbb{R}$ . We put

P^{\varepsilon}(t):=(\varphi_{\varepsilon}*Q^{\delta})(t).

Then, $P^{\varepsilon}\in C^{\infty}(\mathbb{R};\mathbb{R}^{n\times n})$ . Since

\lim_{\varepsilon\to 0}\|P^{\varepsilon}-Q^{\delta}\|_{C([0,T];\mathbb{R}^{n\times n})}=0,

there exists a number $\varepsilon_{0}>0$ such that, for any $\varepsilon\leq\varepsilon_{0}$ ,

\det P^{\varepsilon}(t)\geq\frac{C_{0}}{2}

for all $t\in[0,T]$ . Because $Q^{\delta}$ is bounded, there exists a number $C>0$ such that $\|P^{\varepsilon}(t)\|\leq C$ , for any $t\in[0,T]$ . Now, we note that

\|P^{\varepsilon}-P\|_{L^{p}(0,T;\mathbb{R}^{n\times n})}\leq\|P^{\varepsilon}-Q^{\delta}\|_{L^{p}(0,T;\mathbb{R}^{n\times n})}+\|Q^{\delta}-P\|_{L^{p}(0,T;\mathbb{R}^{n\times n})}.

The last summand is calculated as follows

	$\displaystyle\\|Q^{\delta}-P\\|_{L^{p}(0,T;\mathbb{R}^{n\times n})}^{p}$	$\displaystyle=\int_{0}^{T}\\|Q^{\delta}(t)-P(t)\\|^{p}dt,$
		$\displaystyle=\sum_{l=1}^{L-1}\int_{t_{l}}^{t_{l}+\delta}\left\\|Q^{(l)}\left(\frac{t-t_{l}}{\delta}\right)-P^{(l+1)}\right\\|^{p}dt,$
		$\displaystyle=\delta\sum_{l=1}^{L-1}\int_{0}^{1}\\|Q^{\delta}(s)-P^{(l+1)}\\|^{p}ds.$

Hence, if $\delta\to 0$ , then $\|Q^{\delta}-P\|_{L^{p}(0,T;\mathbb{R}^{n\times n})}\to 0$ . Therefore,

\|P^{\varepsilon}-P\|_{L^{p}(0,T;\mathbb{R}^{n\times n})}<\varepsilon,

for any $\varepsilon>0$ . ∎

Remark 2.13.

Lemma 2.12 does not hold when $p=\infty$ (except when $P$ is a constant function) because the uniform limit of continuous functions is also continuous.

2.5 Proofs

In this subsection, we provide the proof of Theorem 2.3 and Theorem 2.7.

2.5.1 Proof of Theorem 2.3

Proof.

Since $\mbox{\boldmath$\sigma$}\in C(\mathbb{R}^{m};\mathbb{R}^{m})$ is defined by (2.1), where $\sigma\in C(\mathbb{R})$ satisfies a UAP, then given $F\in C(D;\mathbb{R}^{m})$ and $\eta>0$ , there exist a positive integer $L$ , $\mathbb{R}^{m}$ -valued vectors $\alpha^{(l)}$ and $d^{(l)}$ , and matrices $C^{(l)}\in\mathbb{R}^{m\times n}$ , for all $l=1,2,\ldots,L$ , such that

G(\xi)=\frac{T}{L}\sum_{l=1}^{L}\alpha^{(l)}\odot\mbox{\boldmath$\sigma$}(C^{(l)}\xi+d^{(l)}),

|G(\xi)-F(\xi)|<\frac{\eta}{2},

(2.7)

for any $\xi\in D$ . From Lemma 2.10, we know that $\mathrm{rank}(C^{(l)})=m$ , for $l=1,2,\ldots,L$ . In addition, when $m=n$ , we have $\mathrm{sgn}(\det A)=\mathrm{sgn}(\det C^{(l)})$ . In view of Lemma 2.11, there exists a matrix $P^{(l)}\in\mathbb{R}^{n\times n}$ such that $\det P^{(l)}>0$ and $C^{(l)}=AP^{(l)}$ , for each $l=1,2,\ldots,L$ . We put $q^{(l)}:=A^{\top}(AA^{\top})^{-1}d^{(l)}$ so that $d^{(l)}=Aq^{(l)}$ . In addition, we let

\alpha(t):=\alpha^{(l)},\quad P(t):=P^{(l)},\quad q(t):=q^{(l)},\quad\frac{l-1}{L}T\leq t<\frac{l}{L}T.

Then, $\det P(t)>0$ for any $t\in[0,T]$ and

G(\xi)=\frac{T}{L}\sum_{l=1}^{L}\alpha^{(l)}\odot\mbox{\boldmath$\sigma$}(AP^{(l)}\xi+Aq^{(l)})=\int_{0}^{T}\alpha(t)\odot\mbox{\boldmath$\sigma$}(A(P(t)\xi+q(t)))dt.

Let $\{\varphi_{\varepsilon}\}_{\varepsilon>0}$ be a sequence of Friedrichs’ mollifiers. We put $\alpha^{\varepsilon}(t):=(\varphi_{\varepsilon}*\alpha)(t)$ and $q^{\varepsilon}(t):=(\varphi_{\varepsilon}*q)(t)$ . Then, $\alpha^{\varepsilon}\in C^{\infty}([0,T];\mathbb{R}^{m})$ and $q^{\varepsilon}\in C^{\infty}([0,T];\mathbb{R}^{n})$ . From Lemma 2.12, there exists a real number $C>0$ such that, given $\eta>0$ , there exists $P^{\varepsilon}\in C^{\infty}([0,T];\mathbb{R}^{n\times n})$ from which we have

\|P^{\varepsilon}-P\|_{L^{1}(0,T;\mathbb{R}^{n\times n})}<\eta,\quad\det P^{\varepsilon}(t)>0,\quad\|P^{\varepsilon}(t)\|\leq C,

for any $t\in[0,T]$ . If we put

x^{\varepsilon}(t;\xi):=P^{\varepsilon}(t)\xi+q^{\varepsilon}(t),

(2.8)

y^{\varepsilon}(t;\xi):=\int_{0}^{T}\alpha^{\varepsilon}(s)\odot\mbox{\boldmath$\sigma$}(Ax^{\varepsilon}(s;\xi))ds,

(2.9)

then

y^{\varepsilon}(T;\xi)=\int_{0}^{T}\alpha^{\varepsilon}(t)\odot\mbox{\boldmath$\sigma$}(A(P^{\varepsilon}(t)\xi+q^{\varepsilon}(t)))dt.

Hence, we have

$\displaystyle\|y^{\varepsilon}(T;\xi)-G(\xi)\|$	$\displaystyle\leq$	$\displaystyle\int_{0}^{T}\left\|\alpha^{\varepsilon}(t)\odot\mbox{\boldmath$\sigma$}(A(P^{\varepsilon}(t)\xi+q^{\varepsilon}(t)))-\alpha(t)\odot\mbox{\boldmath$\sigma$}(A(P(t)\xi+q(t)))\right\|dt,$
	$\displaystyle\leq$	$\displaystyle\int_{0}^{T}\|\alpha^{\varepsilon}(t)-\alpha(t)\|\|\mbox{\boldmath$\sigma$}(A(P(t)\xi+q(t)))\|dt,$
$\displaystyle+\int_{0}^{T}\|\alpha^{\varepsilon}(t)\|\|\mbox{\boldmath$\sigma$}(A(P^{\varepsilon}(t)\xi+q^{\varepsilon}(t)))-\mbox{\boldmath$\sigma$}(A(P(t)\xi+q(t)))\|dt.$

Because $P$ and $q$ are piecewise constant functions, then they are bounded. Since $\mbox{\boldmath$\sigma$}\in C(\mathbb{R}^{m};\mathbb{R}^{m})$ , there exists $M>0$ such that $|\mbox{\boldmath$\sigma$}(A(P(t)\xi+q(t)))|\leq M$ , for any $t\in[0,T]$ . On the other had, we have the estimate

|\alpha^{\varepsilon}(t)|\leq\int_{\mathbb{R}}\varphi_{\varepsilon}(t-s)|\alpha(s)|ds\leq\|\alpha\|_{L^{\infty}(0,T;\mathbb{R}^{m})}\int_{\mathbb{R}}\varphi_{\varepsilon}(\tau)d\tau=\|\alpha\|_{L^{\infty}(0,T;\mathbb{R}^{m})}.

Similarly, because $\|q^{\varepsilon}\|_{L^{\infty}(0,T;\mathbb{R}^{n})}\leq\|q\|_{L^{\infty}(0,T;\mathbb{R}^{n})}$ , then $q^{\varepsilon}$ is bounded. We assume that $A(P^{\varepsilon}(t)\xi+q^{\varepsilon}(t))$ , $A(P(t)\xi+q(t))\in[-R,R]^{m}$ , for any $t\in[0,T]$ ,

	$\displaystyle\|\mbox{\boldmath$\sigma$}(A(P^{\varepsilon}(t)\xi+q^{\varepsilon}(t)))-\mbox{\boldmath$\sigma$}(A(P(t)\xi+q(t)))\|$
	$\displaystyle\leq L_{R}\\|A\\|\left(\\|P^{\varepsilon}(t)-P(t)\\|(\max_{\xi\in D}\|\xi\|)+\|q^{\varepsilon}(t)-q(t)\|\right).$

Therefore,

	$\displaystyle\|y^{\varepsilon}(T;\xi)-G(\xi)\|\leq M\\|\alpha^{\varepsilon}-\alpha\\|_{L^{1}(0,T;\mathbb{R}^{m})}$
	$\displaystyle+L_{R}\\|A\\|\\|\alpha\\|_{L^{\infty}(0,T;\mathbb{R}^{m})}\left(\\|P^{\varepsilon}-P\\|_{L^{1}(0,T;\mathbb{R}^{n\times n})}(\max_{\xi\in D}\|\xi\|)+\\|q^{\varepsilon}-q\\|_{L^{1}(0,T;\mathbb{R}^{n})}\right).$

We know that there exists a number $\varepsilon>0$ such that

|y^{\varepsilon}(T;\xi)-G(\xi)|<\frac{\eta}{2},

(2.10)

for any $\xi\in D$ . Thus, from (2.7) and (2.10),

|y^{\varepsilon}(T;\xi)-F(\xi)|\leq|y^{\varepsilon}(T;\xi)-G(\xi)|+|G(\xi)-F(\xi)|<\eta,

for any $\xi\in D$ . For all $t\in[0,T]$ , we know that $\det P^{\varepsilon}(t)>0$ , so $P^{\varepsilon}(t)$ is invertible. This allows us to define

\beta(t):=\left(\frac{d}{dt}P^{\varepsilon}(t)\right)\left(P^{\varepsilon}(t)\right)^{-1},\quad\gamma(t):=\frac{d}{dt}q^{\varepsilon}(t)-\beta(t)q^{\varepsilon}(t).

This gives us

\frac{d}{dt}P^{\varepsilon}(t)=\beta(t)P^{\varepsilon}(t),\quad\frac{d}{dt}q^{\varepsilon}(t)=\beta(t)q^{\varepsilon}(t)+\gamma(t).

In view of (2.8) and (2.9),

\frac{d}{dt}x^{\varepsilon}(t;\xi)=\frac{d}{dt}P^{\varepsilon}(t)\xi+\frac{d}{dt}q^{\varepsilon}(t)=\beta(t)P^{\varepsilon}(t)\xi+\beta(t)q^{\varepsilon}(t)+\gamma(t)=\beta(t)x^{\varepsilon}(t;\xi)+\gamma(t),

\frac{d}{dt}y^{\varepsilon}(t;\xi)=\alpha^{\varepsilon}(t)\odot\mbox{\boldmath$\sigma$}(Ax^{\varepsilon}(t;\xi)).

Hence, $y^{\varepsilon}(T,\cdot)\in S(D)$ . Therefore, given $F\in C(D;\mathbb{R}^{m})$ and $\eta>0$ , there exist some functions $\alpha\in C^{\infty}([0,T];\mathbb{R}^{m})$ , $\beta\in C^{\infty}([0,T];\mathbb{R}^{n\times n})$ , and $\gamma\in C^{\infty}([0,T];\mathbb{R}^{n})$ such that

|y(T;\xi)-F(\xi)|<\eta,

for any $\xi\in D$ . ∎

2.5.2 Proof of Theorem 2.7

Proof.

Again, we start with the fact that $\mbox{\boldmath$\sigma$}\in C(\mathbb{R}^{m};\mathbb{R}^{m})$ is defined by (2.1), where $\sigma\in C(\mathbb{R})$ satisfies a UAP; that is, given $F\in C(D;\mathbb{R}^{m})$ and $\eta>0$ , there exist a positive integer $L$ , $\mathbb{R}^{m}$ -valued vectors $\alpha^{(l)}$ and $d^{(l)}$ , and matrices $C^{(l)}\in\mathbb{R}^{m\times n}$ , for all $l=1,2,\ldots,L$ , such that

G(\xi)=\sum_{l=1}^{L}\alpha^{(l)}\odot\mbox{\boldmath$\sigma$}(C^{(l)}\xi+d^{(l)}),

|G(\xi)-F(\xi)|<\eta,

for any $\xi\in D$ . By virtue of Lemma 2.10, we know that $\mathrm{rank}(C^{(l)})=m$ , for all $l=1,2,\ldots,L$ . Moreover, if $m=n$ , we have $\mathrm{sgn}(\det A)=\mathrm{sgn}(\det C^{(l)})$ . On the other hand, from Lemma 2.11, there exists $P^{(l)}\in\mathbb{R}^{n\times n}$ such that $\det P^{(l)}>0$ and $C^{(l)}=AP^{(l)}$ , for each $l=1,2,\ldots,L$ . Putting $q^{(l)}:=A^{\top}(AA^{\top})^{-1}d^{(l)}$ , we get $d^{(l)}=Aq^{(l)}$ , from which we obtain

G(\xi)=\sum_{l=1}^{L}\alpha^{(l)}\odot\mbox{\boldmath$\sigma$}(A(P^{(l)}\xi+q^{(l)})).

Next, we define

x^{(l)}:=P^{(l)}\xi+q^{(l)},\quad y^{(l)}:=\sum_{i=1}^{l}\alpha^{(i)}\odot\mbox{\boldmath$\sigma$}(Ax^{(i)}),

\beta^{(l)}:=(P^{(l)}-P^{(l-1)})(P^{(l-1)})^{-1},\quad\gamma^{(l)}:=q^{(l)}-q^{(l-1)}-\beta^{(l)}q^{(l-1)},

for all $l=1,2,\ldots,L$ , and set $P^{(0)}:=I_{n}$ , $q^{(0)}=0$ . Because $P^{(l)}-P^{(l-1)}=\beta^{(l)}P^{(l-1)}$ and $q^{(l)}-q^{(l-1)}=\beta^{(l)}q^{(l-1)}+\gamma^{(l)}$ hold true, then

x^{(l)}-x^{(l-1)}=(P^{(l)}-P^{(l-1)})\xi+(q^{(l)}-q^{(l-1)})=\beta^{(l)}x^{(l-1)}+\gamma^{(l)},

y^{(L)}=\sum_{l=1}^{L}\alpha^{(l)}\odot\mbox{\boldmath$\sigma$}(A(P^{(l)}\xi+q^{(l)}))=G(\xi).

Hence, $[\xi\mapsto y^{(L)}]\in S_{\mathrm{res}}(D)$ . Therefore, given $F\in C(D;\mathbb{R}^{m})$ and $\eta>0$ , there exists $L\in\mathbb{N},\alpha^{(l)}\in\mathbb{R}^{m},\beta^{(l)}\in\mathbb{R}^{n\times n},\gamma^{(l)}\in\mathbb{R}^{n}~{}(l=1,2,\ldots,L)$ such that

|y^{(L)}-F(\xi)|<\eta,

for any $\xi\in D$ . ∎

3 The gradient and learning algorithm

3.1 The gradient of the loss function with respect to the design parameter

We consider the $(\alpha,\beta,\gamma)$ -type ODENet associated with the ODE system of (2.3). We also consider the approximation of $F\in C(D;\mathbb{R}^{m})$ . Let $K\in\mathbb{N}$ be the number of training data and $\{(\xi^{(k)},F(\xi^{(k)}))\}_{k=1}^{K}\subset D\times\mathbb{R}^{m}$ be the training data. We divide the label of the training data into the following disjoint sets.

\{1,2,\ldots,K\}=I_{1}\cup I_{2}\cup\cdots\cup I_{M}~{}(\mathrm{disjoint})\quad(1\leq M\leq K)

Let $x^{(k)}(t)$ and $y^{(k)}(t)$ be the solution to (2.3) with the initial value $\xi^{(k)}$ . For all $\mu=1,2,\ldots,M$ , let $\mbox{\boldmath$x$}=(x^{(k)})_{k\in I_{\mu}}$ and $\mbox{\boldmath$y$}=(y^{(k)})_{k\in I_{\mu}}$ . We define the loss function as follows:

e_{\mu}[\mbox{\boldmath$x$},\mbox{\boldmath$y$}]=\frac{1}{|I_{\mu}|}\sum_{k\in I_{\mu}}\left|y^{(k)}(T)-F(\xi^{(k)})\right|^{2},

(3.1)

E=\frac{1}{K}\sum_{k=1}^{K}\left|y^{(k)}(T)-F(\xi^{(k)})\right|^{2}.

(3.2)

We consider the learning for each label set using the gradient method. We fix $\mu=1,2,\ldots,M$ . Let $\lambda^{(k)}:[0,T]\to\mathbb{R}^{n}$ be the adjoint and satisfy the following adjoint equation for any $k\in I_{\mu}$ .

\left\{\begin{aligned} \frac{d}{dt}\lambda^{(k)}(t)&=-(\beta(t))^{\top}\lambda^{(k)}(t)-\frac{1}{|I_{\mu}|}A^{\top}\left(\left(y^{(k)}(T)-F(\xi^{(k)})\right)\odot\alpha(t)\odot\mbox{\boldmath$\sigma$}^{\prime}(Ax^{(k)}(t))\right),\\ \lambda^{(k)}(T)&=0.\end{aligned}\right.

(3.3)

Then, the gradient $G[\alpha]^{(\mu)}\in C([0,T];\mathbb{R}^{m}),G[\beta]^{(\mu)}\in C([0,T];\mathbb{R}^{n\times n})$ and $G[\gamma]^{(\mu)}\in C([0,T];\mathbb{R}^{n})$ of the loss function (3.1) at $\alpha\in C([0,T];\mathbb{R}^{m}),\beta\in C([0,T];\mathbb{R}^{n\times n})$ and $\gamma\in C([0,T];\mathbb{R}^{n})$ with respect to $L^{2}(0,T;\mathbb{R}^{m}),L^{2}(0,T;\mathbb{R}^{n\times n}),L^{2}(0,T;\mathbb{R}^{n})$ can be represented as

G[\alpha]^{(\mu)}(t)=\frac{1}{|I_{\mu}|}\sum_{k\in I_{\mu}}\left(y^{(k)}(T)-F(\xi^{(k)})\right)\odot\mbox{\boldmath$\sigma$}(Ax^{(k)}(t)),

G[\beta]^{(\mu)}(t)=\sum_{k\in I_{\mu}}\lambda^{(k)}(t)\left(x^{(k)}(t)\right)^{\top},\quad G[\gamma]^{(\mu)}(t)=\sum_{k\in I_{\mu}}\lambda^{(k)}(t),

respectively.

3.2 Learning algorithm

In this section, we describe the learning algorithm of the $(\alpha,\beta,\gamma)$ -type ODENet associated with an ODE system (2.3). The initial value problems of ordinary differential equations (2.3) and (3.3) are computed using the explicit Euler method. Let $h$ be the size of the time step. We define $L:=\lfloor T/h\rfloor$ . By discretizing the ordinary differential equations (2.3), we obtain

\left\{\begin{aligned} \frac{x_{l+1}^{(k)}-x_{l}^{(k)}}{h}&=\beta_{l}x_{l}^{(k)}+\gamma_{l},&l=0,1,\ldots,L-1,\\ \frac{y_{l+1}^{(k)}-y_{l}^{(k)}}{h}&=\alpha_{l}\odot\mbox{\boldmath$\sigma$}(Ax_{l}^{(k)}),&l=0,1,\ldots,L-1,\\ x_{0}^{(k)}&=\xi^{(k)},&\\ y_{0}^{(k)}&=0,&\end{aligned}\right.

for any $k\in I_{\mu}$ . Furthermore, by discretizing the adjoint equation (3.3), we obtain

\left\{\begin{aligned} \frac{\lambda_{l}^{(k)}-\lambda_{l-1}^{(k)}}{h}&=-\beta_{l}^{\top}\lambda_{l}^{(k)}-\frac{1}{|I_{\mu}|}A^{\top}\left(\left(y_{L}^{(k)}-F(\xi^{(k)})\right)\odot\alpha_{l}\odot\mbox{\boldmath$\sigma$}^{\prime}(Ax_{l}^{(k)})\right),\\ \lambda_{L}^{(k)}&=0,\end{aligned}\right.

with $l=L,L-1,\ldots,1$ for any $k\in I_{\mu}$ . Here we put

\alpha_{l}=\alpha(lh),\quad\beta_{l}=\beta(lh),\quad\gamma_{l}=\gamma(lh),

for all $l=0,1,\ldots,L$ .

We perform the optimization of the loss function (3.2) using stochastic gradient descent (SGD). We show the learning algorithm in Algorithm 1.

Algorithm 1 Stochastic gradient descent method for

(\alpha,\beta,\gamma)

-type ODENet

1: Choose

\eta>0

and

\tau>0

2: Set

\nu=0

and choose

\alpha_{(0)}\in\prod_{l=0}^{L}\mathbb{R}^{m},\beta_{(0)}\in\prod_{l=0}^{L}\mathbb{R}^{n\times n}

\gamma_{(0)}\in\prod_{l=0}^{L}\mathbb{R}^{n}

and (fixed)

A

3: repeat

4: Divide the label of the training data

\{(\xi^{(k)},F(\xi^{(k)}))\}_{k=1}^{K}

into the following disjoint sets

\{1,2,\ldots,K\}=I_{1}\cup I_{2}\cup\cdots\cup I_{M}~{}(\mathrm{disjoint}),\quad(1\leq M\leq K)

5: Set

\alpha^{(1)}=\alpha_{(\nu)},\beta^{(1)}=\beta_{(\nu)}

and

\gamma^{(1)}=\gamma_{(\nu)}

6: for

\mu=1,M

7: Solve

\left\{\begin{aligned} \frac{x_{l+1}^{(k)}-x_{l}^{(k)}}{h}&=\beta_{l}x_{l}^{(k)}+\gamma_{l},&l=0,1,\ldots,L-1,\\ \frac{y_{l+1}^{(k)}-y_{l}^{(k)}}{h}&=\alpha_{l}\odot\mbox{\boldmath$\sigma$}(Ax_{l}^{(k)}),&l=0,1,\ldots,L-1,\\ x_{0}^{(k)}&=\xi^{(k)},&\\ y_{0}^{(k)}&=0,&\end{aligned}\right.

for any

k\in I_{\mu}

8: Solve

\left\{\begin{aligned} \frac{\lambda_{l}^{(k)}-\lambda_{l-1}^{(k)}}{h}&=-\beta_{l}^{\top}\lambda_{l}^{(k)}-\frac{1}{|I_{\mu}|}A^{\top}\left(\left(y_{L}^{(k)}-F(\xi^{(k)})\right)\odot\alpha_{l}\odot\mbox{\boldmath$\sigma$}^{\prime}(Ax_{l}^{(k)})\right),\\ \lambda_{L}^{(k)}&=0,\end{aligned}\right.

with

l=L,L-1,\ldots,1

for any

k\in I_{\mu}

9: Compute the gradients

G[\alpha]_{l}^{(\mu)}=\frac{1}{|I_{\mu}|}\sum_{k\in I_{\mu}}\left(y_{L}^{(k)}-F(\xi^{(k)})\right)\odot\mbox{\boldmath$\sigma$}(Ax_{l}^{(k)}),

G[\beta]_{l}^{(\mu)}=\sum_{k\in I_{\mu}}\lambda_{l}^{(k)}(x_{l}^{(k)})^{\top},\quad G[\gamma]_{l}^{(\mu)}=\sum_{k\in I_{\mu}}\lambda_{l}^{(k)}

10: Set

\alpha_{l}^{(\mu+1)}=\alpha_{l}^{(\mu)}-\tau G[\alpha]_{l}^{(\mu)},\quad\beta_{l}^{(\mu+1)}=\beta_{l}^{(\mu)}-\tau G[\beta]_{l}^{(\mu)},

\gamma_{l}^{(\mu+1)}=\gamma_{l}^{(\mu)}-\tau G[\gamma]_{l}^{(\mu)}

11: end for

12: Set

\alpha_{(\nu+1)}=(\alpha_{l}^{(M)})_{l=0}^{L},\beta_{(\nu+1)}=(\beta_{l}^{(M)})_{l=0}^{L}

and

\gamma_{(\nu+1)}=(\gamma_{l}^{(M)})_{l=0}^{L}

13: Shuffle the training data

\{(\xi^{(k)},F(\xi^{(k)}))\}_{k=1}^{K}

randomly and set

\nu=\nu+1

14: until

\max(\|\alpha_{(\nu)}-\alpha_{(\nu-1)}\|,\|\beta_{(\nu)}-\beta_{(\nu-1)}\|,\|\gamma_{(\nu)}-\gamma_{(\nu-1)}\|)<\eta

Remark.

In 10 of Algorithm 1, we call the momentum SGD [19], in which the following expression is substituted for the update expression.

\alpha_{l}^{(\mu+1)}:=\alpha_{l}^{(\mu)}-\tau G[\alpha]_{l}^{(\mu)}+\tau_{1}(\alpha_{l}^{(\mu)}-\alpha_{l}^{(\mu-1)})

\beta_{l}^{(\mu+1)}:=\beta_{l}^{(\mu)}-\tau G[\beta]_{l}^{(\mu)}+\tau_{1}(\beta_{l}^{(\mu)}-\beta_{l}^{(\mu-1)})

\gamma_{l}^{(\mu+1)}:=\gamma_{l}^{(\mu)}-\tau G[\gamma]_{l}^{(\mu)}+\tau_{1}(\gamma_{l}^{(\mu)}-\gamma_{l}^{(\mu-1)})

where $\tau$ is the learning rate and $\tau_{1}$ is the momentum rate.

4 Numerical results

4.1 Sinusoidal Curve

We performed a numerical example of the regression problem of a 1-dimensional signal $F(\xi)=\sin 4\pi\xi$ defined on $\xi\in[0,1]$ . Let the number of training data be $K_{1}=1000$ , and let the training data be

\left\{\left(\frac{k-1}{K_{1}},F\left(\frac{k-1}{K_{1}}\right)\right)\right\}_{k=1}^{K_{1}}\subset[0,1]\times\mathbb{R},\quad D_{1}:=\left\{\frac{k-1}{K_{1}}\right\}_{k=1}^{K_{1}}.

We run Algorithm 1 until $\nu=10000$ . We set the learning rate to $\tau=0.01$ , $A$ to the matrix with ones on the diagonal and zeros elsewhere, and

\alpha_{(0)}\equiv 0,\quad\beta_{(0)}\equiv 0,\quad\gamma_{(0)}\equiv 0.

Let the number of validation data be $K_{2}=3333$ . The signal sampled with $\Delta\xi=1/K_{2}$ was used as the validation data. Let $D_{2}$ be the set of input data used for the validation data. Fig. 2. shows the training data which is $F(\xi)=\sin 4\pi\xi$ sampled from $[0,1]$ with $\Delta\xi=1/K_{1}$ . Fig. 2. shows the result predicted using validation data when $\nu=10000$ . The validation data is shown as a blue line, and the result predicted using the validation data is shown as an orange line. Fig. 4. shows the initial values of parameters $\alpha,\beta$ and $\gamma$ . Fig. 4. shows the learning results of each design parameter at $\nu=10000$ . Fig. 5. shows the change in the loss function during learning for each of the training data and validation data.

Fig. 5. shows that the loss function can be decreased using Algorithm 1. Fig. 2. suggests that the prediction is good. In addition, the learning results of the parameters $\alpha,\beta$ and $\gamma$ are continuous functions.

Refer to caption — Fig. 1: The training data which is $F(\xi)=\sin 4\pi\xi$ sampled from $[0,1]$ with $\Delta\xi=1/K_{1}$ .

4.2 Binary classification

We performed numerical experiments on a binary classification problem for 2-dimensional input. We set $n=2$ and $m=1$ . Let the number of the training data be $K_{1}=10000$ , and let $D_{1}=\{\xi^{(k)}|k=1,2,\ldots,K_{1}\}\subset[0,1]^{2}$ be the set of randomly generated points. Let

\left\{\left(\xi^{(k)},F(\xi^{(k)})\right)\right\}_{k=1}^{K_{1}}\subset[0,1]^{2}\times\mathbb{R},

F(\xi)=\left\{\begin{array}[]{ll}0,&\mathrm{if}~{}|\xi-(0.5,0.5)|<0.3,\\ 1,&\mathrm{if}~{}|\xi-(0.5,0.5)|\geq 0.3,\end{array}\right.

(4.1)

be the training data. We run Algorithm 1 until $\nu=10000$ . We set the learning rate to $\tau=0.01$ , $A$ to the matrix with ones on the diagonal and zeros elsewhere, and

\alpha_{(0)}\equiv 0,\quad\beta_{(0)}\equiv 0,\quad\gamma_{(0)}\equiv 0.

Let the number of validation data be $K_{2}=2500$ . The set of points $\xi$ randomly generated on $[0,1]^{2}$ and $F(\xi)$ is used as the validation data. Fig. 7. shows the training data in which randomly generated $\xi\in D_{1}$ are classified in (4.1). Fig. 7. shows the prediction result using validation data at $\nu=10000$ . The results that were successfully predicted are shown in dark red and dark blue, and the results that were incorrectly predicted are shown in light red and light blue. Fig. 9. shows the result of predicting the validation data using $k$ -nearest neighbor algorithm at $k=3$ . Fig. 9. shows the result of predicting the validation data using a multi-layer perceptron with $5000$ nodes. Fig. 11. shows the initial value of parameters $\alpha,\beta$ and $\gamma$ . Fig. 11., 13. and 13. show the learning results of each parameters at $\nu=10000$ . Fig. 15. shows the change of the loss function during learning for each of the training data and validation data. Fig. 15. shows the change of accuracy during learning. The accuracy is defined as

\mathrm{Accuracy}=\frac{\#\{\xi|F(\xi)=\bar{y}(\xi)\}}{K_{i}},\quad\mathrm{if}~{}\{\xi|F(\xi)=\bar{y}(\xi)\}\subset D_{i},\quad(i=1,2),

\bar{y}(\xi):=\left\{\begin{array}[]{ll}0,&\mathrm{if}~{}y(T;\xi)<0.5,\\ 1,&\mathrm{if}~{}y(T;\xi)\geq 0.5.\end{array}\right.

Table 4 shows the value of the loss function and the accuracy of the prediction of each method.

From Fig. 15. and 15., we observe that the loss function can be decreased and accuracy can be increased using Algorithm 1. Fig. 7. shows that some points in the neighborhood of $|\xi-(0.5,0.5)|=0.3$ are wrongly predicted; however, most points are well predicted. The results are similar when compared with Fig. 9. and 9. In addition, the learning results of the parameters $\alpha,\beta$ , and $\gamma$ are continuous functions. From Table 4, the $k$ -nearest neighbor algorithm minimizes the value of the loss function among the three methods. We consider that this is because the output of ODENet is $y(T;\xi)\in[0,1]$ , while the output of the $k$ -nearest neighbor algorithm is $\{0,1\}$ . Compared to K-NN and MLP, the $(\alpha,\beta,\gamma)$ -type ODENet gives a slightly worse result (Table 4). However, the binary classification is not a suitable problem to test the potential of ODENet, because the output values of ODENet are continuous. It should also be noted that the proposed model is not very tuned to increase accuracy. Considering these facts, this result also shows the potential of ODENet.

Table. 4: The prediction result of each method.

Method	Loss	Accuracy
This paper ( $(\alpha,\beta,\gamma)$ -type ODENet)	0.02629	0.9592
$K$ -nearest neighbor algorithm (K-NN)	0.006000	0.9879
Multilayer perceptron (MLP)	0.006273	0.9883

4.3 Multinomial classification in MNIST

We performed a numerical experiment on a classification problem using MNIST, a dataset of handwritten digits. The input is a $28\times 28$ image and the output is a one-hot vector of labels attached to the MNIST dataset. We set $n=784$ and $m=10$ . Let the number of training data be $K_{1}=43200$ and let the batchsize be $|I_{\mu}|=128$ . We run Algorithm 1 until $\nu=1000$ . However, the momentum SGD was used to update the design parameters. We set the learning rate as $\tau=0.01$ , the momentum rate as $0.9$ , $A$ to the matrix with ones on the diagonal and zeros elsewhere, and

\alpha_{(0)}\equiv 10^{-8},\quad\beta_{(0)}\equiv 10^{-8},\quad\gamma_{(0)}\equiv 10^{-8}.

Let the number of validation data be $K_{2}=10800$ . Fig. 17. shows the change of the loss function during learning for each of the training data and validation data. Fig. 17. shows the change of accuracy during learning. Using the test data, the values of the loss function and accuracy are

E=0.06432,\quad\mathrm{Accuracy}=0.9521,

at $\nu=1000$ , respectively.

Fig. 17. and 17. suggest that the loss function can be decreased and accuracy can be increased using Algorithm 1 (using the Momentum SGD).

5 Conclusion

In this paper, we proposed the $(\alpha,\beta,\gamma)$ -type ODENet and the $(\alpha,\beta,\gamma)$ -type ResNet and showed that they uniformly approximate an arbitrary continuous function on a compact set. This result shows that the $(\alpha,\beta,\gamma)$ -type ODENet and the $(\alpha,\beta,\gamma)$ -type ResNet can represent a variety of data. In addition, we showed the existence and continuity of the gradient of the loss function in a general ODENet. We performed numerical experiments on some data and confirmed that the gradient method reduces the loss function and represents the data.

Our future work is to show that the design parameters converge to a global minimizer of the loss function using a continuous gradient. We also plan to show that ODENet with other forms, such as convolution, can represent arbitrary data.

6 Acknowledgement

This work is partially supported by JSPSKAKENHI JP20KK0058, and JST CREST Grant Number JPMJCR2014, Japan.

References

[1] J.-G. Attali and G. Pagès. Approximations of functions by a multilayer perceptron: a new approach. Neural Networks, 10(6):1069–1081, 1997.
[2] A. Baker. Matrix groups: An Introduction to Lie Group Theory. Springer Science & Business Media, 2003.
[3] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
[4] L. Bottou. Online algorithms and stochastic approximations. In Online Learning and Neural Networks. Cambridge University Press, 1998.
[5] S.M. Carroll and B.W. Dickinson. Construction of neural nets using the radon transform. In International 1989 Joint Conference on Neural Networks, volume 1, pages 607–611. IEEE, 1989.
[6] R.T.Q. Chen, Y. Rubanova, J. Bettencourt, and D.K. Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, volume 31, pages 6571–6583, 2018.
[7] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303–314, 1989.
[8] K. Fukushima and S. Miyake. Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and Cooperation in Neural Nets, pages 267–285. Springer, 1982.
[9] K.-I. Funahashi. On the approximate realization of continuous mappings by neural networks. Neural Networks, 2(3):183–192, 1989.
[10] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, volume 9, pages 249–256. PMLR, 2010.
[11] B. Hanin and M. Sellke. Approximating continuous functions by ReLU nets of minimal width. arXiv preprint arXiv:1710.11278, 2017.
[12] K. He and J. Sun. Convolutional neural networks at constrained time cost. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5353–5360. IEEE, 2015.
[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778. IEEE, 2016.
[14] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366, 1989.
[15] P. Kidger and T. Lyons. Universal Approximation with Deep Narrow Networks. In Proceedings of 33rd Conference on Learning Theory, pages 2306–2327. PMLR, 2020.
[16] M. Leshno, V.Y. Lin, A. Pinkus, and S. Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861–867, 1993.
[17] H. Lin and S. Jegelka. ResNet with one-neuron hidden layers is a universal approximator. In Advances in Neural Information Processing Systems, volume 31, pages 6169–6178. Curran Associates, Inc., 2018.
[18] W.S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5:115–133, 1943.
[19] D.E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986.
[20] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015.
[21] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
[22] S. Sonoda and N. Murata. Neural network with unbounded activation functions is universal approximator. Applied and Computational Harmonic Analysis, 43(2):233–268, 2017.
[23] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V Vanhoucke, and A Rabinovich. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015.

Appendix A Differentiability with respect to parameters of ODE

We discuss the differentiability with respect to the design parameters of ordinary differential equations.

Theorem A.1.

Let $N$ and $r$ be natural numbers, and $T$ be a positive real number. We define $X:=C^{1}([0,T];\mathbb{R}^{N})$ and $\Omega:=C([0,T];\mathbb{R}^{r})$ . We consider the initial value problem for the ordinary differential equation:

\left\{\begin{aligned} x^{\prime}(t)&=f(t,x(t),\omega(t)),&t\in(0,T],\\ x(0)&=\xi,&\end{aligned}\right.

(A.1)

where $x$ is a function from $[0,T]$ to $\mathbb{R}^{N}$ , and $\xi\in D$ is the initial value; $\omega\in\Omega$ is the design parameter; $f$ is a continuously differentiable function from $[0,T]\times\mathbb{R}^{N}\times\mathbb{R}^{r}$ to $\mathbb{R}^{N}$ ; There exists $L>0$ such that

|f(t,x_{1},\omega(t))-f(t,x_{2},\omega(t))|\leq L|x_{1}-x_{2}|

for any $t\in[0,T],x_{1},x_{2}\in\mathbb{R}^{N}$ , and $\omega\in\Omega$ . Then, the solution to (A.1) satisfies $[\Omega\ni\omega\mapsto x[\omega]]\in C^{1}(\Omega;X)$ . Furthermore, if we define

y(t):=(\partial_{\omega}x[\omega]\eta)(t)

for any $\eta\in\Omega$ , the following relations

\left\{\begin{array}[]{ll}y^{\prime}(t)-\nabla_{x}^{\top}f(t,x[\omega](t),\omega(t))y(t)=\nabla_{\omega}^{\top}f(t,x[\omega](t),\omega(t))\eta(t),&t\in(0,T],\\ y(0)=0,&\end{array}\right.

are satisfied.

Proof.

Let $X_{0}$ be the set of continuous functions from $[0,T]$ to $\mathbb{R}^{N}$ . Because $f(t,\cdot,\omega(t))$ is Lipschitz continuous for any $t\in[0,T]$ and $\omega\in\Omega$ , there exists a unique solution $x[\omega]\in X$ in (A.1). We define the map $J:X\times\Omega\to X$ as

J(x,\omega)(t):=x(t)-\xi-\int_{0}^{t}f(s,x(s),\omega(s))ds.

The map $J$ satisfies

J(x,\omega)^{\prime}(t)=x^{\prime}(t)-f(t,x(t),\omega(t)).

Since $f\in C^{1}([0,T]\times\mathbb{R}^{N}\times\mathbb{R}^{r};\mathbb{R}^{N})$ , $J\in C(X\times\Omega;X)$ .

Take an arbitrary $\omega\in\Omega$ . For any $x\in X$ , let

f\circ x(t):=f(t,x(t),\omega(t)),\quad\nabla_{x}^{\top}f\circ x(t):=\nabla_{x}^{\top}f(t,x(t),\omega(t)).

We define the map $A(x):X\to X$ as

(A(x)y)(t):=y(t)-\int_{0}^{t}(\nabla_{x}^{\top}f\circ x(s))y(s)ds.

The map $A(x)$ satisfies

(A(x)y)^{\prime}(t)=y^{\prime}(t)-(\nabla_{x}^{\top}f\circ x(t))y(t).

$x$ and $\omega$ are bounded because they are continuous functions on a compact interval. Because $f\in C^{1}([0,T]\times\mathbb{R}^{N}\times\mathbb{R}^{r};\mathbb{R}^{N})$ , there exists $C>0$ such that $|\nabla_{x}^{\top}f\circ x(t)|\leq C$ for any $t\in[0,T]$ . From

|(A(x)y)(t)|\leq\|y\|_{X_{0}}+CT\|y\|_{X_{0}},\quad|(A(x)y)^{\prime}(t)|\leq\|y^{\prime}\|_{X_{0}}+C\|y\|_{X_{0}},

\|A(x)y\|_{X}\leq\|y\|_{X}+C(T+1)\|y\|_{X}=(1+C(T+1))\|y\|_{X}

is satisfied. Hence, $A(x)\in B(X,X)$ . Let us fix $x_{0}\in X$ . We take $x\in X$ such that $x\to x_{0}$ .

	$\displaystyle\|(A(x)y)(t)-(A(x_{0})y)(t)\|$	$\displaystyle\leq\int_{0}^{t}\|\nabla_{x}^{\top}f\circ x(s)-\nabla_{x}^{\top}f\circ x_{0}(s)\|\|y(s)\|ds$
		$\displaystyle\leq T\\|y\\|_{X}\\|\nabla_{x}^{\top}f\circ x-\nabla_{x}^{\top}f\circ x_{0}\\|_{C([0,T];\mathbb{R}^{N\times N})}$

	$\displaystyle\|(A(x)y)^{\prime}(t)-(A(x_{0})y)^{\prime}(t)\|$	$\displaystyle\leq\|\nabla_{x}^{\top}f\circ x(t)-\nabla_{x}^{\top}f\circ x_{0}(t)\|\|y(t)\|$
		$\displaystyle\leq\\|y\\|_{X}\\|\nabla_{x}^{\top}f\circ x-\nabla_{x}^{\top}f\circ x_{0}\\|_{C([0,T];\mathbb{R}^{N\times N})}$

\|A(x)y-A(x_{0})y\|_{X}\leq(T+1)\|y\|_{X}\|\nabla_{x}^{\top}f\circ x-\nabla_{x}^{\top}f\circ x_{0}\|_{C([0,T];\mathbb{R}^{N\times N})}

\|A(x)-A(x_{0})\|_{B(X,X)}\leq(T+1)\|\nabla_{x}^{\top}f\circ x-\nabla_{x}^{\top}f\circ x_{0}\|_{C([0,T];\mathbb{R}^{N\times N})}

Hence, $A\in C(X;B(X,X))$ .

	$\displaystyle J(x+y,\omega)(t)-J(x,\omega)(t)-(A(x)y)(t)$
	$\displaystyle=-\int_{0}^{t}(f\circ(x+y)(s)-f\circ x(s)-(\nabla_{x}^{\top}f\circ x(s))y(s))ds$

\|J(x+y,\omega)-J(x,\omega)-A(x)y\|_{X}\leq(T+1)\|f\circ(x+y)-f\circ x-(\nabla_{x}^{\top}f\circ x)y\|_{X_{0}}

Form the Taylor expansion of $f$ , we obtain

f(t,x(t)+y(t),\omega(t))=f(t,x(t),\omega(t))+\int_{0}^{1}\nabla_{x}^{\top}f(t,x(t)+\zeta y(t),\omega(t))y(t)d\zeta

for any $t\in[0,T],x,y\in X$ and $\omega\in\Omega$ . We obtain

|f\circ(x+y)(t)-f\circ x(t)-(\nabla_{x}^{\top}f\circ x(t))y(t)|\leq\int_{0}^{1}|\nabla_{x}^{\top}f\circ(x+\zeta y)(t)-\nabla_{x}^{\top}f\circ x(t)||y(t)|d\zeta.

For any $\epsilon>0$ , there exists $\delta>0$ such that

\|y\|_{X_{0}}<\delta,\zeta\in[0,1]~{}\Rightarrow~{}|\nabla_{x}^{\top}f\circ(x+\zeta y)(t)-\nabla_{x}^{\top}f\circ x(t)|<\epsilon.

We obtain

|f\circ(x+y)(t)-f\circ x(t)-(\nabla_{x}^{\top}f\circ x(t))y(t)|\leq\epsilon\|y\|_{X_{0}},

\|J(x+y,\omega)-J(x,\omega)-A(x)y\|_{X}\leq\epsilon(T+1)\|y\|_{X}.

Hence,

\partial_{x}J(x,\omega)y=A(x)y.

From $\partial_{x}J(\cdot,\omega)\in C(X;B(X,X))$ , $J(\cdot,\omega)\in C^{1}(X;X)$ .

By fixing $\omega_{0}\in\Omega$ , there exists a solution $x_{0}\in X$ of (A.1) such that

x_{0}(t)=\xi+\int_{0}^{t}f(s,x_{0}(s),\omega_{0}(s))ds.

That is,

J(x_{0},\omega_{0})(t)=x_{0}(t)-\xi-\int_{0}^{t}f(s,x_{0}(s),\omega_{0}(s))ds=x_{0}(t)-x_{0}(t)=0

is satisfied. If $y\in X$ satisfies $(\partial_{x}J(x_{0},\omega_{0})y)(t)=g(t)$ for any $g\in X$ , then

\left\{\begin{array}[]{ll}y^{\prime}(t)-\nabla_{x}^{\top}f(t,x_{0}(t),\omega_{0}(t))y(t)=g^{\prime}(t),&t\in(0,T],\\ y(0)=g(0).&\end{array}\right.

Because the solution to this ordinary differential equation exists uniquely, there exists an inverse map $(\partial_{x}J(x_{0},\omega_{0}))^{-1}$ such that $(\partial_{x}J(x_{0},\omega_{0}))^{-1}\in B(X,X)$ .

From the implicit function theorem, for any $\omega\in\Omega$ , there exists $x[\omega]\in X$ such that $J(x[\omega],\omega)=0$ . From $J\in C^{1}(X\times\Omega;X)$ , we obtain $[\omega\mapsto x[\omega]]\in C^{1}(\Omega;X)$ . We put

y(t):=(\partial_{\omega}x[\omega]\eta)(t)

for any $\eta\in\Omega$ . From $J(x[\omega],\omega)=0$ ,

(\partial_{x}J(x[\omega],\omega)y)(t)+(\partial_{\omega}J(x[\omega],\omega)\eta)(t)=0,

y(t)-\int_{0}^{t}\nabla_{x}^{\top}f(s,x[\omega](s),\omega(s))y(s)ds-\int_{0}^{t}\nabla_{\omega}^{\top}f(s,x[\omega](s),\omega(s))\eta(s)ds=0.

Therefore, we obtain

\left\{\begin{array}[]{ll}y^{\prime}(t)-\nabla_{x}^{\top}f(t,x[\omega](t),\omega(t))y(t)=\nabla_{\omega}^{\top}f(t,x[\omega](t),\omega(t))\eta(t),&t\in(0,T],\\ y(0)=0.&\end{array}\right.

∎

Appendix B General ODENet

In this section, we describe the general ODENet and the existence and continuity of the gradient of a loss function with respect to the design parameter. Let $N$ and $r$ be natural numbers and $T$ be a positive real number. Let the input data $D\subset\mathbb{R}^{n}$ be a compact set. We define $X:=C^{1}([0,T];\mathbb{R}^{N})$ and $\Omega:=C([0,T];\mathbb{R}^{r})$ . We consider the ODENet with the following system of ordinary differential equations.

\left\{\begin{aligned} x^{\prime}(t)&=f(t,x(t),\omega(t)),&t\in(0,T],\\ x(0)&=Q\xi,&\end{aligned}\right.

(B.1)

where $x$ is a function from $[0,T]$ to $\mathbb{R}^{N}$ ; $\xi\in D$ is the input data; $Px(T)$ is the final output; $\omega\in\Omega$ is the design parameter; $P$ and $Q$ are $m\times N$ and $N\times n$ real matrices; $f$ is a continuously differentiable function from $[0,T]\times\mathbb{R}^{N}\times\mathbb{R}^{r}$ to $\mathbb{R}^{N}$ , and $f(t,\cdot,\omega(t))$ is Lipschitz continuous for any $t\in[0,T]$ and $\omega\in\Omega$ . For an input data $\xi\in D$ , we denote the output data as $Px(T;\xi)$ . We consider an approximation of $F\in C(D;\mathbb{R}^{m})$ using ODENet with a system of ordinary differential equations (B.1). We define the loss function as

e[x]=\frac{1}{2}\left|Px(T;\xi)-F(\xi)\right|^{2}.

We define the gradient of the loss function with respect to the design parameter as follows:

Definition B.1.

Let $\Omega$ be a real Banach space. Assume that the inner product $\left<\cdot,\cdot\right>$ is defined on $\Omega$ . The functional $\Phi:\Omega\to\mathbb{R}$ is a Fréchet differentiable at $\omega\in\Omega$ . The Fréchet derivative is denoted by $\partial\Phi[\omega]\in\Omega^{*}$ . If $G[\omega]\in\Omega$ exists such that

\partial\Phi[\omega]\eta=\left<G[\omega],\eta\right>,

for any $\eta\in\Omega$ , we call $G[\omega]$ the gradient of $\Phi$ at $\omega\in\Omega$ with respect to the inner product $\left<\cdot,\cdot\right>$ .

Remark.

If there exists a gradient $G[\omega]$ of the functional $\Phi$ at $\omega\in\Omega$ with respect to the inner product $\left<\cdot,\cdot\right>$ , the algorithm to find the minimum value of $\Phi$ by

\omega_{(\nu)}=\omega_{(\nu-1)}-\tau G[\omega_{(\nu-1)}]

is called the steepest descent method.

Theorem B.2.

Given the design parameter $\omega\in\Omega$ , let $x[\omega](t;\xi)$ be the solution to (B.1) with the initial value $\xi\in D$ . Let $\lambda:[0,T]\to\mathbb{R}^{N}$ be the adjoint and satisfy the following adjoint equation:

\left\{\begin{aligned} \lambda^{\prime}(t)&=-\nabla_{x}f^{\top}\left(t,x[\omega](t;\xi),\omega(t)\right)\lambda(t),&t\in[0,T),\\ \lambda(T)&=P^{\top}\left(Px[\omega](T;\xi)-F(\xi)\right).&\end{aligned}\right.

We define the functional $\Phi:\Omega\to\mathbb{R}$ as $\Phi[\omega]=e[x[\omega]]$ . Then, there exists a gradient $G[\omega]\in\Omega$ of $\Phi$ as $\omega\in\Omega$ with respect to the $L^{2}(0,T;\mathbb{R}^{r})$ inner predict such that

\partial\Phi[\omega]\eta=\int_{0}^{T}G[\omega](t)\cdot\eta(t)dt,\quad G[\omega](t)=\nabla_{\omega}f^{\top}\left(t,x[\omega](t;\xi),\omega(t)\right)\lambda(t),

for any $\eta\in\Omega$ .

Proof.

$e$ is a continuously differentiable function from $X$ to $\mathbb{R}$ , and the solution of (B.1) satisfies $[\omega\mapsto x[\omega]]$ from the Theorem A.1. Hence, $\Phi\in C^{1}(\Omega)$ . For any $\eta\in\Omega$ ,

	$\displaystyle\partial\Phi[\omega]\eta$	$\displaystyle=(Px[\omega](T;\xi)-F(\xi))\cdot P(\partial_{\omega}x[\omega]\eta)(T),$
		$\displaystyle=P^{\top}(Px[\omega](T;\xi)-F(\xi))\cdot(\partial_{\omega}x[\omega]\eta)(T).$

We put $y(t):=(\partial_{\omega}x[\omega]\eta)(t)$ . From Theorem A.1, we obtain

\left\{\begin{array}[]{ll}y^{\prime}(t)-\nabla_{x}^{\top}f\left(t,x[\omega](t,\xi),\omega(t)\right)y(t)=\nabla_{\omega}^{\top}f\left(t,x[\omega](t;\xi),\omega(t)\right)\eta(t),&t\in(0,T],\\ y(0)=0.\end{array}\right.

Since the assumption,

\left\{\begin{aligned} \lambda^{\prime}(t)&=-\nabla_{x}f^{\top}\left(t,x[\omega](t;\xi),\omega(t)\right)\lambda(t),&t\in[0,T),\\ \lambda(T)&=P^{\top}\left(Px[\omega](T;\xi)-F(\xi)\right).&\end{aligned}\right.

is satisfied. We define

g(t):=\nabla_{\omega}f^{\top}\left(t,x[\omega](t;\xi),\omega(t)\right)\lambda(t).

Then, $g\in\Omega$ is satisfied. We calculate the $L^{2}(0,T;\mathbb{R}^{r})$ inner product of $g$ and $\eta$ ,

	$\displaystyle\left<g,\eta\right>$	$\displaystyle=\int_{0}^{T}(\nabla_{\omega}f^{\top}(t,x[\omega](t;\xi),\omega(t))\lambda(t))\cdot\eta(t)dt,$
		$\displaystyle=\int_{0}^{T}\lambda(t)\cdot(\nabla_{\omega}^{\top}f(t,x[\omega](t;\xi),\omega(t))\eta(t))dt,$
		$\displaystyle=\int_{0}^{T}\lambda(t)\cdot(y^{\prime}(t)-\nabla_{x}^{\top}f(t,x[\omega](t;\xi),\omega(t))y(t))dt,$
		$\displaystyle=\lambda(T)\cdot y(T)-\lambda(0)\cdot y(0)-\int_{0}^{T}(\lambda^{\prime}(t)+\nabla_{x}f^{\top}(t,x[\omega](t;\xi),\omega(t))\lambda(t))\cdot y(t)dt,$
		$\displaystyle=P^{\top}(Px[\omega](t;\xi)-F(\xi))\cdot y(T),$
		$\displaystyle=\partial\Phi[\omega]\eta.$

Therefore, there exists a gradient $G[\omega]\in\Omega$ of $\Phi$ at $\omega\in\Omega$ with respect to the $L^{2}(0,T;\mathbb{R}^{r})$ inner product such that

G[\omega](t)=\nabla_{\omega}f^{\top}\left(t,x[\omega](t;\xi),\omega(t)\right)\lambda(t).

∎

Appendix C General ResNet

In this section, we describe the general ResNet and error backpropagation. We consider a ResNet with the following system of difference equations

\left\{\begin{aligned} x^{(l+1)}&=x^{(l)}+f^{(l)}(x^{(l)},\omega^{(l)}),&l=0,1,\ldots,L-1,\\ x^{(0)}&=Q\xi,&\end{aligned}\right.

(C.1)

where $x^{(l)}$ is an $N$ -dimensional real vector for all $l=0,1,\ldots,L$ ; $\xi\in D$ is the input data; $Px^{(L)}$ is the final output; $\omega^{(l)}\in\mathbb{R}^{r_{l}}~{}(l=0,1,\ldots,L-1)$ are the design parameters; $P$ and $Q$ are $m\times N$ and $N\times n$ real matrices; $f^{(l)}$ is a continuously differentiable function from $\mathbb{R}^{N}\times\mathbb{R}^{r_{l}}$ to $\mathbb{R}^{N}$ for all $l=0,1,\ldots,L-1$ . We consider an approximation of $F\in C(D;\mathbb{R}^{m})$ using ResNet with a system of difference equations (C.1). Let $K\in\mathbb{N}$ be the number of training data and $\{(\xi^{(k)},F(\xi^{(k)}))\}_{k=1}^{K}\subset D\times\mathbb{R}^{m}$ be the training data. We divide the label of the training data into the following disjoint sets.

\{1,2,\ldots,K\}=I_{1}\cup I_{2}\cup\cdots\cup I_{M}~{}(\mathrm{disjoint}),\quad(1\leq M\leq K).

Let $Px^{(L,k)}$ denote the final output for a given input data $\xi^{(k)}\in D$ . We set $\mbox{\boldmath$\omega$}=(\omega^{(0)},\omega^{(1)},\ldots,\omega^{(L-1)})$ . We define the loss function for all $\mu=1,2,\ldots,M$ as follows:

e_{\mu}(\mbox{\boldmath$\omega$})=\frac{1}{2|I_{\mu}|}\sum_{k\in I_{\mu}}\left|Px^{(L,k)}-F(\xi^{(k)})\right|^{2},

(C.2)

E=\frac{1}{2K}\sum_{k=1}^{K}\left|Px^{(L,k)}-F(\xi^{(k)})\right|^{2}.

We consider the learning for each label set using the gradient method. We find the gradient of the loss function (C.2) with respect to the design parameter $\omega^{(l)}\in\mathbb{R}^{r_{l}}$ for all $l=0,1,\ldots,L-1$ using error backpropagation. Using the chain rule, we obtain

\nabla_{\omega^{(l)}}e_{\mu}(\mbox{\boldmath$\omega$})=\sum_{k\in I_{\mu}}\nabla_{\omega^{(l)}}{x^{(l+1,k)}}^{\top}\nabla_{x^{(l+1,k)}}e_{\mu}(\mbox{\boldmath$\omega$})

for all $l=0,1,\ldots,L-1$ . From (C.1),

\nabla_{\omega^{(l)}}{x^{(l+1,k)}}^{\top}=\nabla_{\omega^{(l)}}{f^{(l)}}^{\top}(x^{(l,k)},\omega^{(l)}).

We define $\lambda^{(l,k)}:=\nabla_{x^{(l,k)}}e_{\mu}(\mbox{\boldmath$\omega$})$ for all $l=0,1,\ldots,L$ and $k\in I_{\mu}$ . We obtain

\lambda^{(l,k)}=\nabla_{x^{(l,k)}}{x^{(l+1,k)}}^{\top}\nabla_{x^{(l+1,k)}}e_{\mu}(\mbox{\boldmath$\omega$})=\lambda^{(l+1,k)}+\nabla_{x^{(l,k)}}{f^{(l)}}^{\top}(x^{(l,k)},\omega^{(l)})\lambda^{(l+1,k)}.

Also,

\lambda^{(L,k)}=\nabla_{x^{(L,k)}}e_{\mu}(\mbox{\boldmath$\omega$})=\frac{1}{|I_{\mu}|}P^{\top}\left(Px^{(L,k)}-F(\xi^{(k)})\right).

Therefore, we can find the gradient $\nabla_{\omega^{(l)}}e_{\mu}(\mbox{\boldmath$\omega$})$ of the loss function (C.2) with respect to the design parameters $\omega^{(l)}\in\mathbb{R}^{r}$ by using the following equations

\left\{\begin{array}[]{lll}\displaystyle{\nabla_{\omega^{(l)}}e_{\mu}(\mbox{\boldmath$\omega$})=\sum_{k\in I_{\mu}}\nabla_{\omega^{(l)}}{f^{(l)}}^{\top}(x^{(l,k)},\omega^{(l)})\lambda^{(l+1,k)}},&l=0,1,\ldots,L-1,&\\ \displaystyle{\lambda^{(l,k)}=\lambda^{(l+1,k)}+\nabla_{x^{(l,k)}}{f^{(l)}}^{\top}(x^{(l,k)},\omega^{(l)})\lambda^{(l+1,k)}},&l=0,1,\ldots,L-1,&k\in I_{\mu},\\ \displaystyle{\lambda^{(L,k)}=\frac{1}{|I_{\mu}|}P^{\top}\left(Px^{(L,k)}-F(\xi^{(k)})\right)},&&k\in I_{\mu}.\end{array}\right.

Appendix D General ODENet and $(\alpha,\beta,\gamma)$ -type ODENet

In this section, we show that (B.1) is a generalization of (2.3). Let natural numbers $n$ amd $m$ satisfy $n\geq m$ . In (B.1) with $N=m+n$ and $r=m+n(n+1)$ , if we put

\displaystyle\begin{array}[]{c}{\displaystyle x(t)=\left(\begin{array}[]{c}\tilde{x}(t)\\ \tilde{y}(t)\end{array}\right)\in\mathbb{R}^{n}\times\mathbb{R}^{m},\quad\omega(t)=\left(\alpha(t),\beta(t),\gamma(t)\right)\in\mathbb{R}^{m}\times\mathbb{R}^{n\times n}\times\mathbb{R}^{n},}\\[16.0pt] {\displaystyle P=\left(O_{m\times n}~{}I_{m}\right),\quad Q=\left(\begin{array}[]{c}I_{n}\\ O_{m\times n}\end{array}\right),\quad f(t,x(t),\omega(t))=\left(\begin{array}[]{c}\beta(t)\tilde{x}(t)+\gamma(t)\\ \alpha(t)\odot\mbox{\boldmath$\sigma$}(A\tilde{x}(t))\end{array}\right),}\end{array}

where $A$ is an $m\times n$ real matrix, $I_{n}$ is the identity matrix of size $n$ , and $O_{m\times n}$ is the $m\times n$ zero matrix, then $\tilde{x}(t)\in\mathbb{R}^{n}$ and $\tilde{y}(t)\in\mathbb{R}^{m}$ satisfy (2.3) with the initial value

\left(\begin{array}[]{c}\tilde{x}(0)\\ \tilde{y}(0)\end{array}\right)=Qx(0)=\left(\begin{array}[]{c}\xi\\ 0\end{array}\right)

and $\tilde{y}(T)=Px(T)$ is the final output.

	$\displaystyle\\|Q^{\delta}-P\\|_{L^{p}(0,T;\mathbb{R}^{n\times n})}^{p}$	$\displaystyle=\int_{0}^{T}\\|Q^{\delta}(t)-P(t)\\|^{p}dt,$
		$\displaystyle=\sum_{l=1}^{L-1}\int_{t_{l}}^{t_{l}+\delta}\left\\|Q^{(l)}\left(\frac{t-t_{l}}{\delta}\right)-P^{(l+1)}\right\\|^{p}dt,$
		$\displaystyle=\delta\sum_{l=1}^{L-1}\int_{0}^{1}\\|Q^{\delta}(s)-P^{(l+1)}\\|^{p}ds.$

$\displaystyle\|y^{\varepsilon}(T;\xi)-G(\xi)\|$	$\displaystyle\leq$	$\displaystyle\int_{0}^{T}\left\|\alpha^{\varepsilon}(t)\odot\mbox{\boldmath$\sigma$}(A(P^{\varepsilon}(t)\xi+q^{\varepsilon}(t)))-\alpha(t)\odot\mbox{\boldmath$\sigma$}(A(P(t)\xi+q(t)))\right\|dt,$
	$\displaystyle\leq$	$\displaystyle\int_{0}^{T}\|\alpha^{\varepsilon}(t)-\alpha(t)\|\|\mbox{\boldmath$\sigma$}(A(P(t)\xi+q(t)))\|dt,$
$\displaystyle+\int_{0}^{T}\|\alpha^{\varepsilon}(t)\|\|\mbox{\boldmath$\sigma$}(A(P^{\varepsilon}(t)\xi+q^{\varepsilon}(t)))-\mbox{\boldmath$\sigma$}(A(P(t)\xi+q(t)))\|dt.$

Universal Approximation Properties for an ODENet and a ResNet: Mathematical Analysis and Numerical Experiments

Abstract

1 Introduction

Remark 1.1.

2 Universal Approximation Theorem for (α,β,γ)(\alpha,\beta,\gamma)-type ODENet and (α,β,γ)(\alpha,\beta,\gamma)-type ResNet

2.1 Definition of an activation function with universal approximation property

Definition 2.1 (Universal approximation property for the activation function σ\sigma).

2.2 Main Theorem for (α,β,γ)(\alpha,\beta,\gamma)-type ODENet

Definition 2.2 ((α,β,γ)(\alpha,\beta,\gamma)-type ODENet).

Theorem 2.3 (UAP for (α,β,γ)(\alpha,\beta,\gamma)-type ODENet).

Corollary 2.4.

Remark 2.5.

2.3 Main Theorem for (α,β,γ)(\alpha,\beta,\gamma)-type ResNet

Definition 2.6 ((α,β,γ)(\alpha,\beta,\gamma)-type ResNet).

Theorem 2.7 (UAP for (α,β,γ)(\alpha,\beta,\gamma)-type ResNet).

Remark 2.8.

2.4 Some lemmas

Lemma 2.9.

Proof.

Lemma 2.10.

Proof.

Lemma 2.11.

Proof.

Lemma 2.12.

Proof.

Remark 2.13.

2.5 Proofs

2.5.1 Proof of Theorem 2.3

Proof.

2.5.2 Proof of Theorem 2.7

Proof.

3 The gradient and learning algorithm

3.1 The gradient of the loss function with respect to the design parameter

3.2 Learning algorithm

Remark.

4 Numerical results

4.1 Sinusoidal Curve

4.2 Binary classification

4.3 Multinomial classification in MNIST

5 Conclusion

6 Acknowledgement

References

Appendix A Differentiability with respect to parameters of ODE

Theorem A.1.

Proof.

Appendix B General ODENet

Definition B.1.

Remark.

Theorem B.2.

Proof.

Appendix C General ResNet

Appendix D General ODENet and (α,β,γ)(\alpha,\beta,\gamma)-type ODENet

2 Universal Approximation Theorem for $(\alpha,\beta,\gamma)$ -type ODENet and $(\alpha,\beta,\gamma)$ -type ResNet

Definition 2.1 (Universal approximation property for the activation function $\sigma$ ).

2.2 Main Theorem for $(\alpha,\beta,\gamma)$ -type ODENet

Definition 2.2 ( $(\alpha,\beta,\gamma)$ -type ODENet).

Theorem 2.3 (UAP for $(\alpha,\beta,\gamma)$ -type ODENet).

2.3 Main Theorem for $(\alpha,\beta,\gamma)$ -type ResNet

Definition 2.6 ( $(\alpha,\beta,\gamma)$ -type ResNet).

Theorem 2.7 (UAP for $(\alpha,\beta,\gamma)$ -type ResNet).

Appendix D General ODENet and $(\alpha,\beta,\gamma)$ -type ODENet