This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Neural Network Approximations of Compositional Functions With Applications to Dynamical Systems thanks: This work was supported in part by U.S. Naval Research Laboratory - Monterey, CA

Wei Kang Department of Applied Mathematics, Naval Postgraduate School, Monterey, CA, USA; [email protected]    Qi Gong Department of Applied Mathematics, University of California, Santa Cruz, Santa Cruz, CA, USA; [email protected]
Abstract

As demonstrated in many areas of real-life applications, neural networks have the capability of dealing with high dimensional data. In the fields of optimal control and dynamical systems, the same capability was studied and verified in many published results in recent years. Towards the goal of revealing the underlying reason why neural networks are capable of solving some high dimensional problems, we develop an algebraic framework and an approximation theory for compositional functions and their neural network approximations. The theoretical foundation is developed in a way so that it supports the error analysis for not only functions as input-output relations, but also numerical algorithms. This capability is critical because it enables the analysis of approximation errors for problems for which analytic solutions are not available, such as differential equations and optimal control. We identify a set of key features of compositional functions and the relationship between the features and the complexity of neural networks. In addition to function approximations, we prove several formulae of error upper bounds for neural networks that approximate the solutions to differential equations, optimization, and optimal control.

1 Introduction

In recent years, the research at the intersection of the fields of optimal control, dynamical systems and machine learning have attracted rapidly increasing attention. Progresses have been made at a fast pace, for instance [7, 8, 10, 11, 12, 14, 21, 22, 25]. In these papers, machine learning based on artificial neural networks are used as an effective tool for finding approximate solutions to high dimensional problems of optimal control and partial differential equations. Many of these problems are not solvable using other numerical methods due to the curse of dimensionality, i.e. the solution’s complexity grows exponentially when the dimension of the problem increases. Why do neural networks work well? Despite of years of research efforts, the fundamental reasons are far from clear. Reviews of some related topics and results can be found in [6, 13]. It is proved in [2] that the neural network approximation of a function whose gradient has an integrable Fourier transform can achieve an error upper bound O(n1/2)O(n^{-1/2}), where n>0n>0 is the number of neurons, i.e., the complexity of the neural network. This rate has a constant exponent that is independent of the problem’s dimension, dd. Therefore, the curse of dimensionality is mitigated. However, it is pointed out in [2] that a constant CC in O(n1/2)O(n^{-1/2}) can indirectly depend on dd. If CC is exponentially large in dd, then an exponentially large value of the neural network’s complexity would be required for the approximation to be bounded by a given error tolerance. Similar issues are also brought up in [13] in a review of various neural network approximation methods. If LL^{\infty} norm is used to measure errors, which is the norm adopted in this paper, the complexity of deep neural networks for the approximation of compositional functions is bounded above by O(ϵr)O(\epsilon^{-r}) for some constant r>0r>0, where ϵ\epsilon is the error tolerance [19, 20, 23]. This theory is based on the assumption that complicated functions can be represented by the compositions of relatively simple functions. The upper bound of complexity, O(ϵr)O(\epsilon^{-r}), has an unknown constant which can grow with the increase of the function’s input dimension. The growth rate depends on various factors or features associated with the function. The relationship is not fully understood. What makes the problem even more challenging is that some functions studied in this paper are associated with dynamical systems such as the trajectory of differential equations or the feedback law of optimal control. For these problems, most existing results including those mentioned above are not directly applicable to prove or guarantee error upper bounds.

In this paper, we aim to reveal the relationship between the complexity of neural networks and the internal structure of the function to be approximated. A theoretical foundation is developed that supports the error analysis of neural networks for not only functions as input-output relations, but also functions generated based on the trajectories of differential equations and the feedback law of optimal control. By the structure of a function, we mean compositional structure. In engineering applications, it is quite common that a complicated function with a high input dimension is the result of a sequence of compositions of simple functions that have low input dimensions. These simple functions as well as the network structure determine the neural network’s complexity that is required to accurately approximate the compositional function. The main contributions of the paper include: 1) We develop an algebraic framework and an approximation theory for compositional functions. These results form the theoretical foundation for the proofs of several theorems on the error upper bound and neural network complexity in the approximation of compositional functions. The theoretical foundation is developed in a way so that it supports the error analysis for numerical algorithms. This capability is critical because it enables the analysis of approximation errors for problems for which analytic solutions are not available, such as differential equations and optimal control. 2) We identify a set of key features of compositional functions. These key features are quantitative. Their value determines an upper bound of the complexity of neural networks when approximating compositional functions. These key features serve as sufficient conditions for overcoming the curse of dimensionality. 3) We prove several formulae of error upper bounds for neural networks that approximate the solutions to differential equations, optimization, and optimal control.

2 Composition vs. projection: what is the difference?

A widely used idea of representing functions is to project a function to a sequence of finite dimensional subspaces. For instance, a large family of functions can be represented by Fourier series

f(x)=a02+n=1(ancos(2nπx)+bnsin(2nπx)),x.\begin{array}[]{lllllllllll}f(x)=\displaystyle\frac{a_{0}}{2}+\displaystyle\sum_{n=1}^{\infty}\left(a_{n}\cos(2n\pi x)+b_{n}\sin(2n\pi x)\right),&x\in\mathbb{R}.\end{array} (1)

In this representation, the nn-th terms, ancos(2nπx)+bnsin(2nπx)a_{n}\cos(2n\pi x)+b_{n}\sin(2n\pi x), are considered to be the elementary units of the function. Because the projection operator is linear, this representation is closed under linear combination in the following sense: If both ff and gg are Fourier series, then so is h=af+bgh=af+bg for any constants aa and bb; and the nn-th term of hh is determined by the nn-th term of ff and gg, i.e., the elementary units of ff and gg are the elementary units of hh as well. However, the Fourier series representation is not closed under some other important algebraic operations such as function composition. It is obvious that h=fgh=f\circ g, or h(x)=f(g(x))h(x)=f(g(x)), is not in the form of Fourier series. Although one can find a Fourier series representation for hh, its nn-th term is not determined by the nn-terms of ff and gg, i.e., the nn-th terms of ff and gg are not elementary units in the Fourier series of hh.

Let’s consider functions generated through a network of multiple function compositions. A formal definition of compositional functions is given in the next section. Algebraic expressions of compositional functions can be cumbersome due to the complexity of the function. Instead, we use directed graphs. As an example, consider the following two vector valued functions

𝐟(x1,x2,x3)=[sin(x1x2)cos(x2x3)x1x3],𝐠(x1,x2,x3)=[(x12+x22)ex12ex32].\begin{array}[]{lllllllllll}{\bf f}(x_{1},x_{2},x_{3})=\left[\begin{array}[]{c}\sin(x_{1}x_{2})\\ \cos(x_{2}x_{3})\\ x_{1}x_{3}\end{array}\right],&{\bf g}(x_{1},x_{2},x_{3})=\left[\begin{array}[]{c}(x_{1}^{2}+x_{2}^{2})\\ e^{-x^{2}_{1}}\\ e^{-x^{2}_{3}}\end{array}\right].\end{array} (2)

In this paper, we use bold letters, such as 𝐱{\bf x} and 𝐟{\bf f}, to represent a vector or a vector valued function and regular font to represent a scalar or a scalar valued function. The functions 𝐟{\bf f} and 𝐠{\bf g} can be represented by a directed graph in which each node is a function with a relatively low input dimension (Figure 1). These nodes are the elementary units of the function.

Refer to caption
Refer to caption
Figure 1: Directed graphs representing the functions in (2).

Different from representations based on projection, the family of compositional functions is closed under both linear combinations and compositions. For example, 𝐡=𝐟𝐠{\bf h}={\bf f}-{\bf g} is a compositional function; and every elementary unit (or node) of 𝐟{\bf f} and 𝐠{\bf g} is an elementary unit of 𝐡{\bf h}. The graph of 𝐡{\bf h} and its nodes are shown in Figure 2.

Refer to caption
Figure 2: Directed graph representing 𝐟𝐠{\bf f}-{\bf g}, where 𝐟{\bf f} and 𝐠{\bf g} are defined in (2).

If 𝐡=𝐟𝐠{\bf h}={\bf f}\circ{\bf g}, then 𝐡{\bf h} is a compositional function; the nodes of 𝐟{\bf f} and 𝐠{\bf g} are also nodes of 𝐡{\bf h} (Figure 3). In addition to linear combinations and compositions, we will prove in the next section that the family of compositional functions is also closed under some other algebraic operations, including inner product, division, and substitution.

Refer to caption
Figure 3: Directed graph representing 𝐟𝐠{\bf f}\circ{\bf g}, where 𝐟{\bf f} and 𝐠{\bf g} are defined in (2).

Why is this interesting? In real-world applications, mathematical models of complicated problems oftentimes have a compositional structure in which the model can be represented by the composition of multiple functions that are relatively simple. The compositional structure reflects the interactive relations among basic components and sub-systems determined by physical laws or engineering design and thus contains valuable insights of the problem. In addition to functions as input-output relations, many computational algorithms are compositional. For instance, an iterative algorithm is a process in which a function is repeatedly applied to the result from the previous step until the process converges to a small neighborhood of the target point. The iterative process is equivalent to a finite sequence of function compositions. For discrete dynamical systems, their trajectories are all compositional functions. Obviously, neural networks are also compositional functions. In fact, the family of neural networks is closed under the operation of composition and linear combination. From the wide spectrum of subjects covered by compositional functions, we believe that an algebraic framework for compositional functions can provide a unified theoretical foundation for the study of neural networks in the approximation of not only functions as input-output relations, but also other subjects that have effective iterative algorithms, such as differential equations and optimal control. Because the family of compositional functions are closed under linear combinations and compositions, it simplifies the problem of tracking error propagations inside complicated compositional functions.

Mathematical study on compositional functions has a rich history. In Hilbert’s 13th problem it was conjectured that the roots of some 7th order polynomials cannot be represented as compositions of functions of two variables. Hilbert’s conjecture was disproved by Arnold and Kolmogorov. In [15] Kolmogorov showed that any continuous function defined on a d-dimensional cube can be exactly represented by a composition of a set of continuous one-dimensional functions as

f(x1,,xd)=q=12d+1ϕq(p=1dψpq(xp)),f(x_{1},\cdots,x_{d})=\sum_{q=1}^{2d+1}\phi_{q}\left(\sum_{p=1}^{d}\psi_{pq}(x_{p})\right),

where ϕq\phi_{q} and ψpq\psi_{pq} are continuous univariate functions. This astonishing result of Kolmogorov’s representation theorem reveals that compositional structure could be viewed as an inherent property of any high-dimensional continuous functions. By relaxing Kolmogorov’s exact representation to approximation and leveraging on techniques developed by Kolmogorov, Ref. [16] gives a direct proof of a universal approximation theorem of multilayer neural networks for any continuous functions.

In deep learning, compositional functions have been addressed by many authors [3, 5, 6, 19, 20, 23]. In [3], which is a study of statistic risk with a given sample size, the approach takes the advantage of the fact that deep networks with ramp activation functions can be represented as a tree rooted at the output node. In [5, 6], tree-like function spaces are introduced. They are considered as the natural function spaces for the purposes of approximation theory for neural networks. Compositional functions in [19, 20, 23] are defined using directed acyclic graphs. Our definition of compositional functions is also based on directed acyclic graphs. However, we give an emphasis on the graph’s layering, in addition to its nodes and edges.

Before we introduce the main theorems, it is important to develop a theoretical foundation for error estimation. In Section 3.1, some basic concepts about compositional functions are defined. In Sections 3.2 and 3.3, some algebraic properties of compositional functions are proved. These concepts and properties form a set of pillars that supports the approximation theory for compositional functions. In Section 4, neural networks are introduced as a family of special compositional functions. The family of neural networks is closed under several algebraic operations including linear combination, composition, and substitution. This property enables the capability for one to compose complicated deep neural networks based on simple networks and algebraic operations. Also in Section 4, a set of key features of compositional functions is identified. For neural network approximations, we prove an error upper bound that is determined by the key features. Because iterative computational algorithms can be considered as compositional functions, the algebraic framework and the approximation theory are applicable to the solutions to differential equations, optimization, and optimal control. The results are proved in Sections 5 and 6.

3 Algebraic properties of compositional functions

Compositional functions are functions consisting of multiple layers of relatively simple functions defined in low dimensional spaces. These simple functions are integrated together through algebraic operations such as function composition and linear combination. If compositional functions are estimated by neural networks, errors are propagated through layers of functions in a complicated way. Analyzing the estimation error requires a good understanding of the interplay between the error propagation and the internal structure of the compositional function. In this paper, some proofs of theorems on neural network solutions to differential equations and optimal control are impossible without first developing an algebraic foundation and approximation theory for the family of compositional functions. In the following, we first define a set of fundamental concepts and then prove some basic algebraic properties of compositional functions. In Table 1, we summarize a list of repeatedly used notations in the following sections.

Notation
Meaning
Notation
Meaning
𝒢𝐟{\cal G}^{\bf f}
a DAG associated with 𝐟{\bf f}
𝐟(){\cal L}^{\bf f}(\cdot)
a mapping: node \rightarrow layer number
𝒱𝐟{\cal V}^{\bf f}
the set of nodes in 𝒢𝐟{\cal G}^{\bf f}
𝐟{\cal E}^{\bf f}
the set of edges in 𝒢𝐟{\cal G}^{\bf f}
𝒱L𝐟{\cal V}^{\bf f}_{L}
the set of linear nodes in 𝒢𝐟{\cal G}^{\bf f}
𝒱G𝐟{\cal V}^{\bf f}_{G}
the set of general nodes in 𝒢𝐟{\cal G}^{\bf f}
𝒱I𝐟{\cal V}^{\bf f}_{I}
the set of input nodes in 𝒢𝐟{\cal G}^{\bf f}
𝒱O𝐟{\cal V}^{\bf f}_{O}
the set of output nodes in 𝒢𝐟{\cal G}^{\bf f}
fi,jf_{i,j}
the jj-th node in the ii-th layer
lmax𝐟l_{max}^{\bf f}
the layer number of output nodes
di,jd_{i,j}
or di,j𝐟d_{i,j}^{\bf f}
the input dimension of fi,jf_{i,j}
mi,jm_{i,j}
or mi,j𝐟m^{\bf f}_{i,j}
the smoothness of fi,jf_{i,j}
Ri,jR_{i,j}
or Ri,j𝐟R^{\bf f}_{i,j}
[Ri,j,Ri,j]di,j[-R_{i,j},R_{i,j}]^{d_{i,j}} is the domain
of fi,jf_{i,j}
Li,jL_{i,j}
or Li,j𝐟L^{\bf f}_{i,j}
a Lipschitz constant associated
with fi,jf_{i,j}
Λ𝐟\Lambda^{\bf f}
For all fi,j𝒱G𝐟f_{i,j}\in{\cal V}^{\bf f}_{G}, the largest
max{(Ri,j)mi,j,1}fi,jWmi,j,di,j\max\{(R_{i,j})^{m_{i,j}},1\}\left\lVert f_{i,j}\right\rVert_{W^{\infty}_{m_{i,j},d_{i,j}}}
Lmax𝐟L_{max}^{\bf f}
the largest Li,jL_{i,j} for fi,j𝒱G𝐟f_{i,j}\in{\cal V}^{\bf f}_{G}
𝐟{\bf f}, 𝐠{\bf g}, 𝐱{\bf x}
fi,jf_{i,j}, xix_{i}
bold letters represent a vector or a vector valued function
regular font represents a scalar or a scalar valued function
Table 1: Some frequently used notations.

3.1 Definitions and notations

A directed acyclic graph (DAG) is a directed graph without directed cycles. A DAG consists of a set of nodes and a set of directed edges. If each node is a function with input (inward edges) and output (outward edges), then the DAG represents a compositional function. Figure 4 is an example of DAG associated with a function 𝐟:42{\bf f}:\mathbb{R}^{4}\rightarrow\mathbb{R}^{2}. The nodes that have no inward edges are call input nodes; those that have no outward edges are called output nodes.

Refer to caption
Figure 4: An example of compositional function. The DAG has four input nodes representing, x1,x2,x3,x4x_{1},x_{2},x_{3},x_{4}, and two output nodes, f4,1,f4,2f_{4,1},f_{4,2}.

DAGs allow layerings that agree with the direction of edges. More specifically, suppose (hi,hj)(h_{i},h_{j}) is a directed edge in a DAG in which hih_{i} is a node in the ii-th layer and hjh_{j} is in the jj-th layer, then j>ij>i. For the compositional function in Figure 4, it has a total of five layers. In this paper, the set of input nodes is always the first layer, labeled as layer 0; two layers next to each other are labeled using adjacent integers; the set of output nodes forms the terminal layer which has the highest layer number. We would like to point out that the layering of a DAG is not unique. Moreover, a function may have more than one associated DAGs. For example, consider the following function

f(x1,x2,x3)=sin(x1x2)+cos(x2x3)+x1x3.\begin{array}[]{lllllllllll}f(x_{1},x_{2},x_{3})=\sin(x_{1}x_{2})+\cos(x_{2}x_{3})+x_{1}x_{3}.\end{array} (3)

In Figure 5, three DAGs are shown. As a function, they all represent ff. Figure 5(a) has three layers. Both (b) and (c) have four layers. However, the set of nodes in two of the layers are different. In Figure 5, some nodes are colored and some are white. The reason will be explained following Definition 1.

Refer to caption
Figure 5: Three DAGs associated with the same function in (3).
Definition 1

(Compositional function) A compositional function is a triplet, (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}), consisting of a function, 𝐟:dq{\bf f}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{q}, a layered DAG, 𝒢𝐟{\cal G}^{\bf f}, and a mapping, 𝐟{\cal L}^{\bf f}. If hh is a node in 𝒢𝐟{\cal G}^{\bf f}, then 𝐟(h){\cal L}^{\bf f}(h) equals an integer that is the layer number of hh. We also use subscripts to represent the layer number, i.e., fi,jf_{i,j} is the jj-th node in the ii-th layer. Every node in 𝒢𝐟{\cal G}^{\bf f} that has at least one inward edge is a function, fi,j:di,jf_{i,j}:\mathbb{R}^{d_{i,j}}\rightarrow\mathbb{R}, where di,jd_{i,j} is the number of inward edges of the node. A node that has no inward edge is called an input node. It can take any value in the domain of 𝐟{\bf f}. All input nodes in 𝒢𝐟{\cal G}^{\bf f} are labeled as layer 0\mathit{0}, called the input layer. A node that has no outward edge is called an output node. All output nodes are located in the final layer, called the output layer. All layers between input and output layers are called hidden layers. We always assume that the ranges and domains of all nodes are compatible for composition, i.e. if (fi,j,fl,k)(f_{i,j},f_{l,k}) is an edge in 𝒢𝐟{\cal G}^{\bf f}, then the range of fi,jf_{i,j} is contained in the interior of the domain of fl,kf_{l,k}.

For each node in a compositional function, the number of inward edges, di,jd_{i,j}, defines the number of free variables of the node. Each node takes its value in \mathbb{R}, i.e., all nodes are scalar valued functions. The set of nodes and edges in 𝒢𝐟{\cal G}^{\bf f} are denoted by 𝒱𝐟{\cal V}^{\bf f} and 𝐟{\cal E}^{\bf f}, respectively. Given a node, fi,jf_{i,j}, if it is a 1st order polynomial, then it is called a linear node. Otherwise, it is called a general node. All input nodes are treated as linear nodes. In the figure of 𝒢𝐟{\cal G}^{\bf f}, linear nodes have white color. The set of linear nodes are denoted by 𝒱L𝐟{\cal V}^{\bf f}_{L} and the set of general nodes are denoted by 𝒱G𝐟{\cal V}^{\bf f}_{G}. The set of input nodes, i.e., the first layer of 𝒢𝐟{\cal G}^{\bf f}, is denoted by 𝒱I𝐟{\cal V}_{I}^{\bf f}. Similarly, the layer formed by output nodes is 𝒱O𝐟{\cal V}_{O}^{\bf f}. Therefore, 𝐟(𝒱O𝐟){\cal L}^{\bf f}({\cal V}_{O}^{\bf f}) is the largest layer number in the graph. It is denoted by lmax𝐟l_{max}^{\bf f}. Given an edge (fi1,j1,fi2,j2)(f_{i_{1},j_{1}},f_{i_{2},j_{2}}). If i2i1+2i_{2}\geq i_{1}+2, then we say that this edge skips layers.

Example 1

A complicated function defined in high dimensional spaces can be a compositional function consisting of simple nodes. In a model of power systems [1, 24], the electric air-gap torque, 𝐏e{\bf P}_{e}, is a vector valued function that can be approximated by

(𝐏e)i=Ei2Gii+j=1,jimEiEj(Gijcos(δiδj)+Bijsin(δiδj)),1im,\begin{array}[]{lllllllllll}({\bf P}_{e})_{i}=E_{i}^{2}G_{ii}+\displaystyle\sum_{j=1,j\neq i}^{m}E_{i}E_{j}(G_{ij}\cos(\delta_{i}-\delta_{j})+B_{ij}\sin(\delta_{i}-\delta_{j})),&1\leq i\leq m,\end{array} (4)

where mm is the number of generators, δi\delta_{i}, i=1,2,,mi=1,2,\cdots,m, are rotor angles of the generators. Other notations represent constant parameters. This function is used to determine the equilibrium point of the power system. It is also a part of the torques applied to a rotor in a generator. If the system have m=20m=20 generators, then the model is a function 𝐏e:2020{\bf P}_{e}:\mathbb{R}^{20}\rightarrow\mathbb{R}^{20}. This function can be represented as a compositional function with simple nodes. Shown in Figure 6 is (𝐏e)i({\bf P}_{e})_{i} in a DAG with four layers. Except for the layer of sin(z)\sin(z) and cos(z)\cos(z), all nodes are linear. It will be shown later that linear nodes in a compositional function do not increase the complexity of the neural network that approximates the function. For all general nodes, their domains have a low dimension, di,j=1d_{i,j}=1. Theorems in this paper help to reveal that a low input dimension of individual nodes implies low complexity when the function is approximated by neural networks. The complexity is not directly dependent on the overall dimension, in this case d=20d=20.

Refer to caption
Figure 6: DAG of (𝐏e)i({\bf P}_{e})_{i} defined in (4)
Example 2

Many numerical algorithms are compositional functions. For instant, numerical algorithms of ODEs such as the Euler method and all explicit Runge-Kutta methods are compositional functions. Consider

𝐱˙=𝐟(𝐱),𝐱d,𝐟(𝐱)d,t[0,T].\begin{array}[]{lllllllllll}\dot{\bf x}={\bf f}({\bf x}),&{\bf x}\in\mathbb{R}^{d},\,{\bf f}({\bf x})\in\mathbb{R}^{d},\,t\in[0,T].\end{array} (5)

For any initial state 𝐱0{\bf x}_{0}, the following is a numerical solution based on the forward Euler method:

𝐱k=𝐱k1+h𝐟(𝐱k1),1kK,\begin{array}[]{lllllllllll}{\bf x}_{k}={\bf x}_{k-1}+h{\bf f}({\bf x}_{k-1}),&1\leq k\leq K,\end{array} (6)

where h=T/Kh=T/K is the step size. This algorithm is, in fact, a compositional function. The structure of its DAG is shwon in Figure 7. Each box represents a layer that may have multiple nodes if d>1d>1. In fact, the box of 𝐟{\bf f} is a DAG if 𝐟{\bf f} is also a compositional function.

Refer to caption
Figure 7: DAG structure of the Euler method. If 𝐟{\bf f} is a compositional function, then each box represents a layer that may have multiple nodes. The white box in the upper left represents the input layer of 𝒢𝐟{\cal G}^{\bf f}. The box marked as “𝐟(𝐱){\bf f}({\bf x})” contains all hidden layers and the output layer of 𝒢𝐟{\cal G}^{\bf f}.
Example 3

Neural networks are compositional functions. Let σ:\sigma:\mathbb{R}\rightarrow\mathbb{R} be an activation function. A function

f(𝐱)=j=1najσ(𝐰jT𝐱+bj),𝐱d,\begin{array}[]{lllllllllll}f({\bf x})=\displaystyle\sum_{j=1}^{n}a_{j}\sigma\left({\bf w}_{j}^{T}{\bf x}+b_{j}\right),&{\bf x}\in\mathbb{R}^{d},\end{array} (7)

is a shallow neural network with a single hidden layer (Figure 8), where 𝐰j{\bf w}_{j}, aja_{j} and bjb_{j} are parameters. Similarly, deep neural networks are also compositional functions.

Refer to caption
Figure 8: DAG of a typical shallow neural network. In the hidden layer, f1,j(𝐱)=σ(𝐰jT𝐱+bj)f_{1,j}({\bf x})=\sigma\left({\bf w}_{j}^{T}{\bf x}+b_{j}\right).

In this paper, the algebraic foundation and approximation theory on compositional functions are developed as a unified framework for various applications of neural networks. Before we introduce these applications, we must define some basic concepts about algebraic operations between compositional functions, such as linear combination, multiplication, division, substitution and composition. Many complicated functions and numerical algorithms, such as the solutions of differential equations or optimal control, are the results of a sequence of these algebraic operations of compositional functions.

Definition 2

(Linear combintation, multiplication and division of compositional functions) Given compositional functions (𝐟1,𝒢𝐟1,𝐟1)({\bf f}_{1},{\cal G}^{{\bf f}_{1}},{\cal L}^{{\bf f}_{1}}) and (𝐟2,𝒢𝐟2,𝐟2)({\bf f}_{2},{\cal G}^{{\bf f}_{2}},{\cal L}^{{\bf f}_{2}}) in which both 𝐟1{\bf f}_{1} and 𝐟2{\bf f}_{2} are from d\mathbb{R}^{d} into q\mathbb{R}^{q}. Let a,ba,b\in\mathbb{R} be constant numbers. Define

𝐠=a𝐟1+b𝐟2.\begin{array}[]{lllllllllll}{\bf g}=a{\bf f}_{1}+b{\bf f}_{2}.\end{array} (8)

Then 𝐠{\bf g} is a compositional function associated with a layered DAG that is induced by (𝐟1,𝒢𝐟1,𝐟1)({\bf f}_{1},{\cal G}^{{\bf f}_{1}},{\cal L}^{{\bf f}_{1}}) and (𝐟2,𝒢𝐟2,𝐟2)({\bf f}_{2},{\cal G}^{{\bf f}_{2}},{\cal L}^{{\bf f}_{2}}). The induced DAG and its layering are defined as follows (see Figure 9 for an illustration).

  1. 1.

    Input layer: Let 𝒱I𝐟1=𝒱I𝐟2=𝒱I𝐠{\cal V}^{{\bf f}_{1}}_{I}={\cal V}^{{\bf f}_{2}}_{I}={\cal V}^{\bf g}_{I}, i.e., the input layer of 𝒢𝐠{\cal G}^{\bf g} is formed by overlapping the input nodes of 𝐟1{\bf f}_{1} and 𝐟2{\bf f}_{2}. The outward edges of the input nodes in 𝒢𝐟1{\cal G}^{{\bf f}_{1}} and 𝒢𝐟2{\cal G}^{{\bf f}_{2}} are combined.

  2. 2.

    Hidden layers: all nodes in (𝒱𝐟1𝒱I𝐟1)(𝒱𝐟2𝒱I𝐟2)\left({\cal V}^{{\bf f}_{1}}\setminus{\cal V}^{{\bf f}_{1}}_{I}\right)\cup\left({\cal V}^{{\bf f}_{2}}\setminus{\cal V}^{{\bf f}_{2}}_{I}\right) are nodes of 𝒢𝐠{\cal G}^{\bf g}. The layer number of each node is kept the same as that in their original DAG. The edges pointing to or from these nodes are kept unchanged.

  3. 3.

    Output layer: Create qq output nodes. Each one has two inward edges. For the kk-th node, where k=1,2,,qk=1,2,\cdots,q, one inward edge starts from flmax𝐟1,kf_{l^{{\bf f}_{1}}_{max},k} and the other one from flmax𝐟2,kf_{l^{{\bf f}_{2}}_{max},k}. Each output node is a function

    h=az1+bz2.\begin{array}[]{lllllllllll}h=az_{1}+bz_{2}.\end{array} (9)

    The layer number of output nodes in 𝒢𝐡{\cal G}^{\bf h} is

    max{lmax𝐟1,lmax𝐟2}+1.\begin{array}[]{lllllllllll}\max\left\{l_{max}^{{\bf f}_{1}},l_{max}^{{\bf f}_{2}}\right\}+1.\end{array} (10)

One can define the induced DAGs for the compositional functions 𝐟1T𝐟2{\bf f}_{1}^{T}{\bf f}_{2} and f1/f2f_{1}/f_{2} (if q=1q=1, f1f_{1} and f2f_{2} are scalar valued functions and f20f_{2}\neq 0) by modifying the output layer accordingly.

Refer to caption
Figure 9: Induced DAG of a𝐟1+b𝐟2a{\bf f}_{1}+b{\bf f}_{2}. In the output layer, the kk-th node is aflmax𝐟1,k𝐟1+bflmax𝐟2,k𝐟2af^{{\bf f}_{1}}_{l^{{\bf f}_{1}}_{max},k}+bf^{{\bf f}_{2}}_{l^{{\bf f}_{2}}_{max},k}, where flmax𝐟i,k𝐟if^{{\bf f}_{i}}_{l^{{\bf f}_{i}}_{max},k} is the kk-th output node of 𝐟i{\bf f}_{i}, i=1,2i=1,2. For 𝐟1T𝐟2{\bf f}_{1}^{T}{\bf f}_{2}, the DAG is the same except that the output layer has one node only, which is k=1dflmax𝐟1,k𝐟1flmax𝐟2,k𝐟2\sum_{k=1}^{d}f^{{\bf f}_{1}}_{l^{{\bf f}_{1}}_{max},k}f^{{\bf f}_{2}}_{l^{{\bf f}_{2}}_{max},k}. For q=1q=1, the DAG of f1/f2f_{1}/f_{2} has an output layer that has one node, flmaxf1,1f1/flmaxf2,1f2f^{f_{1}}_{l^{f_{1}}_{max},1}/f^{f_{2}}_{l^{f_{2}}_{max},1}.

This definition implies that the family of compositional functions is closed under inner product and division, in addition to linear combination and composition. Unless otherwise stated, in this paper we always use the induced DAG for functions that are generated through algebraic operations of compositional functions. In the following, we define the induced DAG for a few more types of algebraic operations involving compositional functions. In Figure 10(a)-(b), two compositional functions, ff and gg, are shown. We would like to substitute gg for the node f2,1f_{2,1}, a function that has the same number of inputs as gg. Figure 10(c) shows a layered DAG of f~\tilde{f}, the function resulting from the substitution. The blue nodes and edges are those from 𝒢f{\cal G}^{f}. The red nodes and edges are those from 𝒢g{\cal G}^{g}. In the substitution, all input nodes of 𝒢g{\cal G}^{g} are removed; their outward edges are attached to the corresponding nodes in 𝒢f{\cal G}^{f}; f2,1f_{2,1} is replaced by the output node of 𝒢g{\cal G}^{g}. The layering is changed depending on the depth of 𝒢g{\cal G}^{g}.

Definition 3

(Node substitution) Given a compositional function (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}). Let fi,j:di,jf_{i,j}:\mathbb{R}^{d_{i,j}}\rightarrow\mathbb{R} be a node in the ii-th layer, where i1i\geq 1. Let (g,𝒢g,g)(g,{\cal G}^{g},{\cal L}^{g}), g:di,jg:\mathbb{R}^{d_{i,j}}\rightarrow\mathbb{R}, be a compositional function. Substituting gg for fi,jf_{i,j} results in a new compositional function, 𝐟~\tilde{\bf f}. The induced DAG and its layering are defined as follows.

  1. 1.

    𝒱𝐟~=(𝒱𝐟{fi,j})(𝒱g𝒱Ig){\cal V}^{\tilde{\bf f}}=\left({\cal V}^{\bf f}\setminus\{f_{i,j}\}\right)\cup\left({\cal V}^{g}\setminus{\cal V}^{g}_{I}\right)

  2. 2.

    If {v1,,vdi,j}\{v_{1},\cdots,v_{d_{i,j}}\} is the ordered set of inward edges of fi,jf_{i,j}, then each viv_{i} is replaced by the set of outward edges of g0,ig_{0,i}. For the outward edges of fi,jf_{i,j}, their new starting node in 𝒢𝐟~{\cal G}^{\tilde{\bf f}} is the output node of gg. All other edges in 𝒢𝐟{\cal G}^{\bf f} and 𝒢g{\cal G}^{g} are kept unchanged.

  3. 3.

    Let Δi{\it\Delta}i be a positive integer so that iΔii-{\it\Delta}i be the highest layer number of layers in 𝒢𝐟{\cal G}^{\bf f} from which at least one node has an outward edge pointing to fi,jf_{i,j}. The layering in 𝒢𝐟~{\cal G}^{\tilde{\bf f}} is defined as follows.

    For fl,k𝒱𝐟{fi,j},𝐟~(fl,k)={l,li1l+max{0,lmaxgΔi},li,For gl,k𝒱g(𝒱Ig𝒱Og),𝐟~(gl,k)=l+iΔi,For gl,k𝒱Og,𝐟~(gl,k)=max{i,iΔi+lmaxg}.\begin{array}[]{lllllllllll}\mbox{For }f_{l,k}\in{\cal V}^{\bf f}\setminus\{f_{i,j}\},&{\cal L}^{\tilde{\bf f}}(f_{l,k})=\left\{\begin{array}[]{lll}l,&l\leq i-1\\ l+\max\{0,l_{max}^{g}-{\it\Delta}i\},&l\geq i\end{array}\right.,\\ \mbox{For }g_{l,k}\in{\cal V}^{g}\setminus\left({\cal V}^{g}_{I}\cup{\cal V}^{g}_{O}\right),&{\cal L}^{\tilde{\bf f}}(g_{l,k})=l+i-{\it\Delta}i,\\ \mbox{For }g_{l,k}\in{\cal V}^{g}_{O},&{\cal L}^{\tilde{\bf f}}(g_{l,k})=\max\{i,i-{\it\Delta}i+l^{g}_{max}\}.\end{array} (11)
Refer to caption
Figure 10: Node substitution. The compositional function in (c) is the result of substituting gg in (b) for the node f2,1f_{2,1} in (a).

Many numerical algorithms are iterative, i.e., the same function is repeatedly applied to the result from the previous iteration until the process converges to a small neighborhood of the target point. If each iteration is a compositional function, then the numerical process can be considered as a sequence of compositions of compositional functions.

Definition 4

(Composition of compositional functions) Given positive integers dd, qq and ss. Consider two compositional functions (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}) and (𝐠,𝒢𝐠,𝐠)({\bf g},{\cal G}^{\bf g},{\cal L}^{\bf g}), 𝐟:dq{\bf f}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{q}, 𝐠:qs{\bf g}:\mathbb{R}^{q}\rightarrow\mathbb{R}^{s}. Let 𝐡=𝐠𝐟:𝐱dg(f(𝐱))s{\bf h}={\bf g}\circ{\bf f}:{\bf x}\in\mathbb{R}^{d}\rightarrow g(f({\bf x}))\in\mathbb{R}^{s}. Then 𝐡{\bf h} is a compositional function. Stacking 𝒢𝐠{\cal G}^{\bf g} horizontally to the right of 𝒢𝐟{\cal G}^{\bf f} forms the induced DAG of 𝐡{\bf h} as illustrated in Figure 11. More specifically,

  1. 1.

    𝒱𝐡=𝒱𝐟(𝒱𝐠𝒱I𝐠){\cal V}^{\bf h}={\cal V}^{\bf f}\cup\left({\cal V}^{\bf g}\setminus{\cal V}^{\bf g}_{I}\right).

  2. 2.

    The outward edges from the input nodes in 𝒢𝐠{\cal G}^{\bf g} are attached to the corresponding output nodes in 𝒢𝐟{\cal G}^{\bf f} as new starting nodes.

  3. 3.

    The order of layers are kept unchanged, i.e.,

    𝐡(fi,j)=𝐟(fi,j),for fi,j𝒱𝐟,𝐡(gi,j)=lmax𝐟+𝐠(gi,j),for gi,j𝒱𝐠𝒱I𝐠.\begin{array}[]{lllllllllll}{\cal L}^{\bf h}(f_{i,j})={\cal L}^{\bf f}(f_{i,j}),&\mbox{for }f_{i,j}\in{\cal V}^{\bf f},\\ {\cal L}^{\bf h}(g_{i,j})=l^{\bf f}_{max}+{\cal L}^{\bf g}(g_{i,j}),&\mbox{for }g_{i,j}\in{\cal V}^{\bf g}\setminus{\cal V}^{\bf g}_{I}.\end{array} (12)
Refer to caption
Figure 11: Composition of compositional functions. The function in (c) is the composition of the functions in (a) and (b).

In this study, we find that the layering of a DAG plays an important role in the propagation of approximation errors through a network of nodes. For example, given a node in the ii-th layer of a compositional function. The variation of this node impacts the value of nodes in the layers li+1l\geq i+1, but not those in the layers lil\leq i. In the following, we introduce the concept of compositional truncation along a layer. The concept leads us to the definition of the Lipschitz constant associated with nodes. Consider a compositional function with five layers shown in Figure 12(a). Its truncation along the layer l=1l=1 is a function, 𝐟¯\bar{\bf f}, that is shown in Figure 12(b). In the truncation process, we first remove all edges that end at nodes in layers l1l\leq 1. Subsequently, f0,1f_{0,1} and f0,2f_{0,2} become isolated nodes, i.e., they have neither outward nor inward edge. Remove them from the graph. Align all nodes in layers l1l\leq 1, f1,1f_{1,1}, f1,2f_{1,2}, f0,3f_{0,3} and f0,4f_{0,4}, to form the input layer of the truncated function. In the new input layer, the nodes from layer l=1l=1 in 𝒢𝐟{\cal G}^{\bf f} (in this example, f1,1f_{1,1} and f1,2f_{1,2}) are kept in the same order; the nodes moved up from layer l=0l=0 (in this case f0,3f_{0,3} and f0,4f_{0,4}) are located at the lower half of the layer. Their order is not important. A set of dummy variables, (z1,z2,z3,z4)(z_{1},z_{2},z_{3},z_{4}), is introduced as the input of 𝐟¯\bar{\bf f}, the truncated function. It is important to clarify that the domain of zkz_{k} is the intersection of the domains of all nodes that are directly connected with zkz_{k} in 𝐟¯\bar{\bf f}. For example, z2z_{2} in Figure 12 can take any value that is in the domains of f2,1f_{2,1}, f2,2f_{2,2} and f2,3f_{2,3} simultaneously. The value is not necessarily in the range of the original node f1,2f_{1,2} in 𝒢𝐟{\cal G}^{\bf f}.

Definition 5

(Compositional truncation) Given a compositional function (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}). Let ii be an integer, 1ilmax𝐟11\leq i\leq l_{max}^{\bf f}-1. The compositional truncation of 𝐟{\bf f} along layer ii is another compositional function, 𝐟¯\bar{\bf f}. Its layered DAG is obtained by the following process: (i) removing all edges in 𝒢𝐟{\cal G}^{\bf f} whose ending nodes are located in layers lil\leq i; (ii) removing all isolated nodes whose inward and outward edges are all removed in (i); (iii) forming the input layer of 𝐟¯\bar{\bf f} by aligning all nodes in layers lil\leq i in 𝒢𝐟{\cal G}^{\bf f} that are not removed in (ii). We find it convenient in discussions if the order of fi,jf_{i,j}’s in 𝒢𝐟{\cal G}^{\bf f} remains the same in the new input layer and they are located at the top portion of the layer. All other nodes in the new input layer are located at the bottom of the layer organized in any order one may choose. Given any fl,k𝒢𝐟f_{l,k}\in{\cal G}^{\bf f}, lil\geq i, its layer number in 𝒢𝐟¯{\cal G}^{\bar{\bf f}} is lil-i. The variables associate with the input layer of 𝒢𝐟¯{\cal G}^{\bar{\bf f}} are called dummy inputs. Each dummy input can take any value that is in the common domain of all nodes in 𝒢𝐟¯{\cal G}^{\bar{\bf f}} that have inward edges starting from this input node.

Refer to caption
Refer to caption
Figure 12: Compositional truncation. The function in (b) is the compositional truncation of the function in (a) along layer l=1l=1.

The concept of the Lipschitz constant is essential in several theorems proved in this paper. In the following, we define the Lipschitz constant associated with nodes based on a given norm. For vectors in d\mathbb{R}^{d}, its pp-norm is

𝐱p={(|x1|p+|x2|p++|xd|p)1/p,1p<max{|x1|,|x2|,,|xd|},p=\begin{array}[]{lllllllllll}\left\lVert{\bf x}\right\rVert_{p}=\left\{\begin{array}[]{ll}\left(\left|x_{1}\right|^{p}+\left|x_{2}\right|^{p}+\cdots+\left|x_{d}\right|^{p}\right)^{1/p},&1\leq p<\infty\\ \max\{|x_{1}|,|x_{2}|,\cdots,|x_{d}|\},&p=\infty\end{array}\right.\end{array} (13)
Definition 6

(The Lipschitz constant associated with a node) Given a compositional function (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}) and a constant 1p1\leq p\leq\infty. Consider a node, fi,jf_{i,j}, in the ii-th layer, where 0ilmax𝐟10\leq i\leq l^{\bf f}_{max}-1. Let 𝐟¯(z1,,zj,)\bar{\bf f}(z_{1},\cdots,z_{j},\cdots) be the truncation of 𝐟{\bf f} along the ii-th layer. If Li,j>0L_{i,j}>0 is a constant satisfying

𝐟¯(z1,,zj+Δz,)𝐟¯(z1,,zj,)pLi,j|Δz|\begin{array}[]{lllllllllll}\left\lVert\bar{\bf f}(z_{1},\cdots,z_{j}+{\it{\it\Delta}}z,\cdots)-\bar{\bf f}(z_{1},\cdots,z_{j},\cdots)\right\rVert_{p}\leq L_{i,j}\left|{\it\Delta}z\right|\end{array} (14)

for all (z1,,zj,)(z_{1},\cdots,z_{j},\cdots) and (z1,,zj+Δz,)(z_{1},\cdots,z_{j}+{\it\Delta}z,\cdots) within the domain of 𝐟¯\bar{\bf f}, Li,jL_{i,j} is called a Lipschitz constant (under the pp-norm) associated with fi,jf_{i,j}. In some discussions, we denote the constant by Li,j𝐟L^{\bf f}_{i,j} to differentiate Lipschitz constants associated with different functions.

We should point out that a Lipschitz constant associated with a node, fi,jf_{i,j}, is different from the Lipschitz constant of fi,jf_{i,j}. The later is about the rate of change of fi,j(𝐳)f_{i,j}({\bf z}) when 𝐳{\bf z} is changed; the former is about the rate of change of 𝐟¯\bar{\bf f}, the compositional truncation, when the value of fi,jf_{i,j} is changed, where the node itself is treated as a free variable. The role of the Lipschitz constant of individual nodes in a DAG is studied in [19, 20], where a good error propagation phenomenon is illustrated. Definition 6 provides a different way of tracking error propagation. It leads to a feature (to be introduced in Section 4) that has direct impact on an error upper bound of function approximations using neural networks.

3.2 Some properties of Lipschitz constants

Algebraic operations may change Lipschitz constants associate with nodes. The propositions in this section are about the rules that Lipschitz constants follow in some algebraic operations of compositional functions.

Proposition 1

Given a compositional function, (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}), and a node, fi,jf_{i,j}, in the ii-th layer. Suppose (g,𝒢g,g)(g,{\cal G}^{g},{\cal L}^{g}) is another compositional function in which gg is a scalar valued function that has the same domain as fi,jf_{i,j}. Let (𝐟~,,𝒢𝐟~,𝐟~)(\tilde{\bf f},,{\cal G}^{\tilde{\bf f}},{\cal L}^{\tilde{\bf f}}) be the compositional function resulting from substituting gg for fi,jf_{i,j}, assuming that the range of gg is compatible for the substitution. For all nodes in 𝒢𝐟{\cal G}^{\bf f} in layers lil\geq i, their associated Lipschitz constants can be carried over to 𝒢𝐟~{\cal G}^{\tilde{\bf f}}. Or equivalently, one can assign

Ll+max{0,lmaxgΔi},k𝐟~=Ll,k𝐟,li,\begin{array}[]{lllllllllll}L^{\tilde{\bf f}}_{l+\max\{0,l^{g}_{max}-{\it\Delta}i\},k}=L^{\bf f}_{l,k},&l\geq i,\end{array} (15)

where iΔii-{\it\Delta}i is the highest layer number of layers in 𝒢𝐟{\cal G}^{\bf f} from which at least one edge points to fi,jf_{i,j}.

Proof. From Definition 3, substituting gg for fi,jf_{i,j} does not change the nodes and edges above the ii-th layer. The number of nodes in the ii-th and higher layers in 𝒢𝐟{\cal G}^{\bf f} is the same as that in the layers l(i+max{0,lmaxgΔi})l\geq(i+\max\{0,l^{g}_{max}-{\it\Delta}i\}) in 𝒢𝐟~{\cal G}^{\tilde{\bf f}}. All edges in 𝒢𝐟{\cal G}^{\bf f} that end at nodes in the layers l(i+1)l\geq(i+1) have a one-to-one correspondence with the edges in 𝒢𝐟~{\cal G}^{\tilde{\bf f}} that end at nodes in the layers l(i+max{0,lmaxgΔi})+1l\geq(i+\max\{0,l^{g}_{max}-{\it\Delta}i\})+1. Therefore, the compositional truncation of 𝐟{\bf f} along the ll-th layer (lil\geq i) is identical to the compositional truncation of 𝐟~\tilde{\bf f} along the (l+max{0,lmaxgΔi})(l+\max\{0,l^{g}_{max}-{\it\Delta}i\})-th layer. Because Lipschitz constants are defined based on compositional truncations, (15) holds true. \blacklozenge

The following result is about the change of Lipschitz constants after the composition of two compositional functions.

Proposition 2

Given positive integers dd, qq and ss. Consider two compositional functions (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}), 𝐟:dq{\bf f}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{q}, and (𝐠,𝒢𝐠,𝐠)({\bf g},{\cal G}^{\bf g},{\cal L}^{\bf g}), 𝐠:qs{\bf g}:\mathbb{R}^{q}\rightarrow\mathbb{R}^{s}. Let 𝐡=𝐠𝐟:𝐱dg(f(𝐱))s{\bf h}={\bf g}\circ{\bf f}:{\bf x}\in\mathbb{R}^{d}\rightarrow g(f({\bf x}))\in\mathbb{R}^{s}. Let L𝐟L^{\bf f}, L𝐠L^{\bf g} and L𝐡L^{\bf h} be some Lipschitz constants of 𝐟{\bf f}, 𝐠{\bf g} and 𝐡{\bf h} under a pp-norm.

  1. 1.

    Given any node fi,jf_{i,j} in 𝒢𝐟{\cal G}^{\bf f}. As a node in 𝒢𝐡{\cal G}^{\bf h}, a Lipschitz constant associate with this node is Li,j𝐟L𝐠L^{\bf f}_{i,j}L^{\bf g}.

  2. 2.

    A Lipschitz constant of 𝐡{\bf h} is L𝐡=L𝐟L𝐠L^{\bf h}=L^{\bf f}L^{\bf g}.

Proof. Part (2) is well known. For part (1), let 𝐟¯(z1,,zj,)\bar{\bf f}(z_{1},\cdots,z_{j},\cdots) be the truncation of 𝐟{\bf f} along the ii-th layer in which fi,jf_{i,j} is located. For 𝐡{\bf h}, the truncation along the layer of the same node is 𝐠𝐟¯{\bf g}\circ\bar{\bf f}. For any Δz{\it\Delta}z\in\mathbb{R},

𝐠(𝐟¯(z1,,zj+Δz,))𝐠(𝐟¯(z1,,zj,))pL𝐠𝐟¯(z1,,zj+Δz,)𝐟¯(z1,,zj,)pL𝐠Li,j𝐟|Δz|.\begin{array}[]{lllllllllll}\left\lVert{\bf g}(\bar{\bf f}(z_{1},\cdots,z_{j}+{\it\Delta}z,\cdots))-{\bf g}(\bar{\bf f}(z_{1},\cdots,z_{j},\cdots))\right\rVert_{p}\\ \leq L^{\bf g}\left\lVert\bar{\bf f}(z_{1},\cdots,z_{j}+{\it\Delta}z,\cdots)-\bar{\bf f}(z_{1},\cdots,z_{j},\cdots)\right\rVert_{p}\\ \leq L^{\bf g}L^{\bf f}_{i,j}\left|{\it\Delta}z\right|.\end{array} (16)

\blacklozenge

3.3 Some properties of error propagation

For compositional functions in which all nodes have low input dimensions, one can approximate each node using a shallow NN. Substituting these neural networks for the corresponding nodes that they approximate, the process results in a deep neural network. The error of using this deep neural network as an approximation of the function depends on all “local errors”, i.e., the estimation error of each node. These local errors propagate through the DAG network interactively. In this section, we study the relationship between the Lipschitz constants associated with nodes and the propagation of local errors. To measure errors, we need to introduce an appropriate norm. Given a function, f:Dfdf:D^{f}\subset\mathbb{R}^{d}\rightarrow\mathbb{R}, where DfD^{f} is a compact set. Its LL^{\infty}-norm is the essential supremum

fL=inf{M0;|f(𝐱)|M almost everywhere in Df}.\begin{array}[]{lllllllllll}\left\lVert f\right\rVert_{L^{\infty}}=\inf\left\{M\geq 0;\;\left|f({\bf x})\right|\leq M\mbox{ almost everywhere in }D^{f}\right\}.\end{array} (17)

Suppose f~\tilde{f} is an approximation of ff. We say that the approximation error is bounded by ϵ>0\epsilon>0 if

ff~Lϵ.\begin{array}[]{lllllllllll}\left\lVert f-\tilde{f}\right\rVert_{L^{\infty}}\leq\epsilon.\end{array} (18)

If both functions are continuous, this is equivalent to

|f(𝐱)f~(𝐱)|ϵ, for all 𝐱Df.\begin{array}[]{lllllllllll}\left|f({\bf x})-\tilde{f}({\bf x})\right|\leq\epsilon,&\mbox{ for all }{\bf x}\in D^{f}.\end{array} (19)

In this paper, all approximations are measured using a uniformly bounded error like the one shown in (19). If 𝐟{\bf f} is a vector valued function, a pp-norm is used to measure the estimation error

𝐟(𝐱)𝐟~(𝐱)pϵ, for all 𝐱D𝐟.\begin{array}[]{lllllllllll}\left\lVert{\bf f}({\bf x})-\tilde{\bf f}({\bf x})\right\rVert_{p}\leq\epsilon,&\mbox{ for all }{\bf x}\in D^{\bf f}.\end{array} (20)

When deep neural networks are used to approximate compositional functions, two types of algebraic operations are frequently used. One is node substitution and the other is composition. The following propositions are about the relationship between the approximation errors before and after an algebraic operation.

Proposition 3

Let (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}) be a compositional function. Let {h1,h2,,hK}𝒱𝐟𝒱I𝐟\{h_{1},h_{2},\cdots,h_{K}\}\subseteq{\cal V}^{\bf f}\setminus{\cal V}^{\bf f}_{I} be a set of nodes. Under a pp-norm, let Lj𝐟>0L_{j}^{\bf f}>0 be a Lipschitz constant associated with hjh_{j}, 1jK1\leq j\leq K. Suppose

{(h~j,𝒢h~j,h~j);1jK}\begin{array}[]{lllllllllll}\left\{(\tilde{h}_{j},{\cal G}^{{\tilde{h}}_{j}},{\cal L}^{{\tilde{h}}_{j}});1\leq j\leq K\right\}\end{array} (21)

is a set of compositional functions in which h~j\tilde{h}_{j} has the same domain as hjh_{j} and a compatible range for substitution. Assume that

|h~j(𝐰)hj(𝐰)|ϵj,for all 𝐰 in the domain of hj,\begin{array}[]{lllllllllll}\left|{\tilde{h}}_{j}({\bf w})-h_{j}({\bf w})\right|\leq\epsilon_{j},&\mbox{for all }{\bf w}\mbox{ in the domain of }h_{j},\end{array} (22)

where ϵj\epsilon_{j}, j=1,2,,Kj=1,2,\cdots,K, are some positive numbers. Let 𝐟~\tilde{\bf f} be the function obtained by substituting h~j\tilde{h}_{j} for hjh_{j}, j=1,2,,Kj=1,2,\cdots,K. Then

𝐟~(𝐱)𝐟(𝐱)pj=1KLj𝐟ϵj,for all x in the domain of 𝐟.\begin{array}[]{lllllllllll}\left\lVert\tilde{\bf f}({\bf x})-{\bf f}({\bf x})\right\rVert_{p}\leq\displaystyle\sum_{j=1}^{K}L_{j}^{\bf f}\epsilon_{j},&\mbox{for all }x\mbox{ in the domain of }{\bf f}.\end{array} (23)

Proof. We prove the case for K=1K=1 and 22. The proof can be extended to arbitrary KK through mathematical induction. Let K=1K=1. Let 𝐟¯\bar{\bf f} and 𝐟~¯\bar{\tilde{\bf f}} be the compositional truncation of 𝐟{\bf f} and 𝐟~\tilde{\bf f} along the layer in which h1h_{1} is located. From the proof of Proposition 1, we know that 𝐟¯=𝐟~¯\bar{\bf f}=\bar{\tilde{\bf f}} . The following fact follows the definition of compositional truncations,

𝐟(𝐱)=𝐟¯(𝐳(𝐱)),𝐟~(𝐱)=𝐟~¯(𝐳~(𝐱)),\begin{array}[]{lllllllllll}{\bf f}({\bf x})=\bar{\bf f}({\bf z}({\bf x})),\\ {\tilde{\bf f}}({\bf x})=\bar{\tilde{\bf f}}(\tilde{\bf z}({\bf x})),\end{array} (24)

where 𝐳(𝐱){\bf z}({\bf x}), the dummy input of the truncated function, takes the value of the corresponding nodes in 𝒢𝐟{\cal G}^{\bf f} before the truncation. Therefore, it depends on 𝐱{\bf x}. The value of 𝐳~(𝐱){\tilde{\bf z}}({\bf x}) is similarly defined. Without loss of generality, assume that h1h_{1} is the first node in its layer. Then, z1z_{1} takes the value of h1h_{1} in the evaluation of 𝐟(𝐱){\bf f}({\bf x}), and z~1{\tilde{z}}_{1} takes the value of h~1{\tilde{h}}_{1}. The value of z1z_{1} is different from that of z~1\tilde{z}_{1} because h1h_{1} is replaced by h~1\tilde{h}_{1}. All other components in zz and z~\tilde{z} are equal because h1h_{1} is the only node that is substituted. We have

𝐟(𝐱)𝐟~(𝐱)p=𝐟¯(𝐳)𝐟~¯(𝐳~)p=𝐟¯(𝐳)𝐟¯(𝐳~)pL1𝐟|z1z~1|=L1𝐟|h1(𝐰)h~1(𝐰)|,L1𝐟ϵ1,\begin{array}[]{lllllllllll}\begin{aligned} \left\lVert{\bf f}({\bf x})-\tilde{\bf f}({\bf x})\right\rVert_{p}&=\left\lVert\bar{\bf f}({\bf z})-\bar{\tilde{\bf f}}(\tilde{\bf z})\right\rVert_{p}\\ &=\left\lVert\bar{\bf f}({\bf z})-\bar{{\bf f}}(\tilde{\bf z})\right\rVert_{p}\\ &\leq L^{\bf f}_{1}\left|z_{1}-\tilde{z}_{1}\right|\\ &=L^{\bf f}_{1}\left|h_{1}({\bf w})-\tilde{h}_{1}({\bf w})\right|,&\\ &\leq L^{\bf f}_{1}\epsilon_{1},\end{aligned}\end{array} (25)

where 𝐰{\bf w} takes the value of nodes that have edges pointing to h1h_{1}. If K=2K=2, suppose {h1,h2}\{h_{1},h_{2}\} is ordered in consistent with the layer number of the nodes, i.e., 𝐟(h1)𝐟(h2){\cal L}^{\bf f}(h_{1})\leq{\cal L}^{\bf f}(h_{2}). Let 𝐠{\bf g} be the function obtained by substituting h~1\tilde{h}_{1} for h1h_{1} in 𝒢𝐟{\cal G}^{\bf f}; and 𝐟~\tilde{\bf f} be the function obtained by substituting both h~1\tilde{h}_{1} and h~2\tilde{h}_{2} for h1h_{1} and h2h_{2}, respectively. When the node h2𝒱𝐟h_{2}\in{\cal V}^{\bf f} is considered as a node in 𝒢𝐠{\cal G}^{\bf g}, its associated Lipschitz constant as a node in 𝒢𝐟{\cal G}^{\bf f} can be carried over to 𝒢𝐠{\cal G}^{\bf g} because 𝐟(h1)𝐟(h2){\cal L}^{\bf f}(h_{1})\leq{\cal L}^{\bf f}(h_{2}) (Proposition 1). Therefore, L2𝐟L_{2}^{\bf f} is a Lipschitz constant associated with h2h_{2} as a node in 𝒢𝐠{\cal G}^{\bf g}. Applying the proved result for K=1K=1 to both 𝐟(𝐱)𝐠(𝐱){\bf f}({\bf x})-{\bf g}({\bf x}) and 𝐠(𝐱)𝐟~(𝐱){\bf g}({\bf x})-\tilde{\bf f}({\bf x}), we have

𝐟(𝐱)𝐟~(𝐱)p𝐟(𝐱)𝐠(𝐱)p+𝐠(𝐱)𝐟~(𝐱)pL1𝐟ϵ1+L2𝐟ϵ2\begin{array}[]{lllllllllll}\left\lVert{\bf f}({\bf x})-\tilde{\bf f}({\bf x})\right\rVert_{p}\\ \leq\left\lVert{\bf f}({\bf x})-{\bf g}({\bf x})\right\rVert_{p}+\left\lVert{\bf g}({\bf x})-\tilde{\bf f}({\bf x})\right\rVert_{p}\\ \leq L^{\bf f}_{1}\epsilon_{1}+L^{\bf f}_{2}\epsilon_{2}\end{array} (26)

for all xx in the domain of 𝐟{\bf f}. \blacklozenge

Given a function 𝐟:dd{\bf f}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}. A sequence of repeated self-compositions is denoted by

(𝐟())k(𝐱):=𝐟𝐟𝐟k(𝐱).\begin{array}[]{lllllllllll}({\bf f}(\cdot))^{k}({\bf x}):=\overbrace{{\bf f}\circ{\bf f}\circ\cdots\circ{\bf f}}^{k}({\bf x}).\end{array} (27)
Proposition 4

Consider compositional functions (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}), 𝐟:dd{\bf f}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}, (𝐠,𝒢𝐠,𝐠)({\bf g},{\cal G}^{\bf g},{\cal L}^{\bf g}), 𝐠:qd{\bf g}:\mathbb{R}^{q}\rightarrow\mathbb{R}^{d}, and (𝐡,𝒢𝐡,𝐡)({\bf h},{\cal G}^{\bf h},{\cal L}^{\bf h}), 𝐡:dq{\bf h}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{q}. Suppose that the domains and ranges of them are compatible for compositions. Let L𝐟L^{\bf f} and L𝐡L^{\bf h} be Lipschitz constants of 𝐟{\bf f} and 𝐡{\bf h} under a pp-norm. Suppose 𝐟~\tilde{\bf f}, 𝐠~\tilde{\bf g} and 𝐡~\tilde{\bf h} are functions satisfying

𝐟(𝐱)𝐟~(𝐱)pe1,𝐱 in the domain,𝐠(𝐱)𝐠~(𝐱)pe2,𝐱 in the domain,𝐡(𝐱)𝐡~(𝐱)pe3,𝐱 in the domain,\begin{array}[]{lllllllllll}\left\lVert{\bf f}({\bf x})-\tilde{\bf f}({\bf x})\right\rVert_{p}\leq e_{1},&{\bf x}\mbox{ in the domain},\\ \left\lVert{\bf g}({\bf x})-\tilde{\bf g}({\bf x})\right\rVert_{p}\leq e_{2},&{\bf x}\mbox{ in the domain},\\ \left\lVert{\bf h}({\bf x})-\tilde{\bf h}({\bf x})\right\rVert_{p}\leq e_{3},&{\bf x}\mbox{ in the domain},\end{array} (28)

for some e1>0e_{1}>0, e2>0e_{2}>0 and e3>0e_{3}>0. Given any integer K>0K>0, we have

(𝐟())K𝐠(𝐱)(𝐟~())K𝐠~(𝐱)p(L𝐟)K1L𝐟1e1+(L𝐟)Ke2.\begin{array}[]{lllllllllll}\left\lVert({\bf f}(\cdot))^{K}\circ{\bf g}({\bf x})-(\tilde{\bf f}(\cdot))^{K}\circ{\tilde{\bf g}}({\bf x})\right\rVert_{p}\leq\displaystyle\frac{(L^{\bf f})^{K}-1}{L^{\bf f}-1}e_{1}+(L^{\bf f})^{K}e_{2}.\end{array} (29)

and

𝐡(𝐟())K(𝐱)𝐡~(𝐟~())K(𝐱)pL𝐡(L𝐟)K1L𝐟1e1+e3.\begin{array}[]{lllllllllll}\left\lVert{\bf h}\circ({\bf f}(\cdot))^{K}({\bf x})-\tilde{\bf h}\circ(\tilde{\bf f}(\cdot))^{K}({\bf x})\right\rVert_{p}\leq L^{\bf h}\displaystyle\frac{(L^{\bf f})^{K}-1}{L^{\bf f}-1}e_{1}+e_{3}.\end{array} (30)

Proof. The inequality (29) is true if K=1K=1 because

𝐟𝐠(𝐱)𝐟~𝐠~(𝐱)p𝐟𝐠(𝐱)𝐟𝐠~(𝐱)p+𝐟𝐠~(𝐱)𝐟~𝐠~(𝐱)pL𝐟e2+e1.\begin{array}[]{lllllllllll}\left\lVert{\bf f}\circ{\bf g}({\bf x})-\tilde{\bf f}\circ{\tilde{\bf g}}({\bf x})\right\rVert_{p}\\ \leq\left\lVert{\bf f}\circ{\bf g}({\bf x})-{\bf f}\circ{\tilde{\bf g}}({\bf x})\right\rVert_{p}+\left\lVert{\bf f}\circ{\tilde{\bf g}}({\bf x})-\tilde{\bf f}\circ{\tilde{\bf g}}({\bf x})\right\rVert_{p}\\ \leq L^{\bf f}e_{2}+e_{1}.\end{array} (31)

Suppose the inequality holds if the power of 𝐟(){\bf f}(\cdot) is K1K-1. Then,

(𝐟())K𝐠(𝐱)(𝐟~())K𝐠~(𝐱)p𝐟(𝐟())K1𝐠(𝐱)𝐟(𝐟~())K1𝐠~(𝐱)p+𝐟(𝐟~())K1𝐠~(𝐱)𝐟~(𝐟~())K1𝐠~(𝐱)pL𝐟(𝐟())K1𝐠(𝐱)(𝐟~())K1𝐠~(𝐱)p+e1L𝐟((L𝐟)K11L𝐟1e1+(L𝐟)K1e2)+e1=(L𝐟)K1L𝐟1e1+(L𝐟)Ke2.\begin{array}[]{lllllllllll}\left\lVert({\bf f}(\cdot))^{K}\circ{\bf g}({\bf x})-(\tilde{\bf f}(\cdot))^{K}\circ\tilde{\bf g}({\bf x})\right\rVert_{p}\\ \leq\left\lVert{\bf f}\circ({\bf f}(\cdot))^{K-1}\circ{\bf g}({\bf x})-{\bf f}\circ(\tilde{\bf f}(\cdot))^{K-1}\circ\tilde{\bf g}({\bf x})\right\rVert_{p}+\left\lVert{\bf f}\circ(\tilde{\bf f}(\cdot))^{K-1}\circ\tilde{\bf g}({\bf x})-\tilde{\bf f}\circ(\tilde{\bf f}(\cdot))^{K-1}\circ\tilde{\bf g}({\bf x})\right\rVert_{p}\\ \leq L^{\bf f}\left\lVert({\bf f}(\cdot))^{K-1}\circ{\bf g}({\bf x})-(\tilde{\bf f}(\cdot))^{K-1}\circ\tilde{\bf g}({\bf x})\right\rVert_{p}+e_{1}\\ \leq L^{\bf f}\left(\displaystyle\frac{(L^{\bf f})^{K-1}-1}{L^{\bf f}-1}e_{1}+(L^{\bf f})^{K-1}e_{2}\right)+e_{1}\\ =\displaystyle\frac{(L^{\bf f})^{K}-1}{L^{\bf f}-1}e_{1}+(L^{\bf f})^{K}e_{2}.\end{array} (32)

The inequality (30) is implied by (29) and the following inequality

𝐡(𝐟())K(𝐱)𝐡~(𝐟~())K(𝐱)p𝐡(𝐟())K(𝐱)𝐡(𝐟~())K(𝐱)p+𝐡(𝐟~())K(𝐱)𝐡~(𝐟~())K(𝐱)p\begin{array}[]{lllllllllll}\left\lVert{\bf h}\circ({\bf f}(\cdot))^{K}({\bf x})-\tilde{\bf h}\circ(\tilde{\bf f}(\cdot))^{K}({\bf x})\right\rVert_{p}\\ \leq\left\lVert{\bf h}\circ({\bf f}(\cdot))^{K}({\bf x})-{\bf h}\circ(\tilde{\bf f}(\cdot))^{K}({\bf x})\right\rVert_{p}+\left\lVert{\bf h}\circ(\tilde{\bf f}(\cdot))^{K}({\bf x})-\tilde{\bf h}\circ(\tilde{\bf f}(\cdot))^{K}({\bf x})\right\rVert_{p}\end{array} (33)

\blacklozenge

The following proposition is simply a version of the triangular inequality. It is useful when a sequence of approximations are applied to a function.

Proposition 5

Given a sequences of functions 𝐟j:Ddq{\bf f}_{j}:D\subset\mathbb{R}^{d}\rightarrow\mathbb{R}^{q}, 1jK1\leq j\leq K. Suppose {ej}j=1K1\{e_{j}\}_{j=1}^{K-1} is a sequence of positive numbers satisfying

𝐟j+1(𝐱)𝐟j(𝐱)pej, for 𝐱D, 1jK1.\begin{array}[]{lllllllllll}\left\lVert{\bf f}_{j+1}({\bf x})-{\bf f}_{j}({\bf x})\right\rVert_{p}\leq e_{j},&\mbox{ for }{\bf x}\in D,\;1\leq j\leq K-1.\end{array} (34)

Then,

𝐟1(𝐱)𝐟K(𝐱)pj=1K1ej.\begin{array}[]{lllllllllll}\left\lVert{\bf f}_{1}({\bf x})-{\bf f}_{K}({\bf x})\right\rVert_{p}\leq\displaystyle\sum_{j=1}^{K-1}e_{j}.\end{array} (35)

4 Neural networks as compositional functions

Let Wm,dW^{\infty}_{m,d} be the space of scalar valued functions whose derivatives up to mm-th order exist almost everywhere in DfdD^{f}\subset\mathbb{R}^{d} and they are all essentially bounded. The Sobolev norm of fWm,df\in W^{\infty}_{m,d} is defined by

fWm,d=0𝐤,|𝐤|m|𝐤|f𝐱𝐤L\begin{array}[]{lllllllllll}\left\lVert f\right\rVert_{W^{\infty}_{m,d}}=\displaystyle\sum_{0\leq{\bf k},\left|{\bf k}\right|\leq m}\left\lVert\displaystyle\frac{\partial^{\left|{\bf k}\right|}f}{\partial{\bf x}^{{\bf k}}}\right\rVert_{L^{\infty}}\end{array} (36)

where 𝐤=[k1,k2,,kd]{\bf k}=\left[\begin{array}[]{rrrrrrrrrrrrrrrrrrrr}k_{1},k_{2},\cdots,k_{d}\end{array}\right] represents a vector of integers and |𝐤|=|k1|++|kd|\left|{\bf k}\right|=\left|k_{1}\right|+\cdots+\left|k_{d}\right|. Most theorems proved in this paper are based on similar assumptions. They are stated as follows.

Assumption A1 (Compositional functions) Given a compositional function (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}). We assume that 𝐟{\bf f} is a function from [R,R]d[-R,R]^{d} to q\mathbb{R}^{q} for some R>0R>0 and positive integers dd and qq. Assume that the nodes, {fi,j}\{f_{i,j}\}, are functions in Wmi,j,di,jW^{\infty}_{m_{i,j},d_{i,j}} for some integers mi,j1m_{i,j}\geq 1 and di,j1d_{i,j}\geq 1. The domain of fi,jf_{i,j} is a hypercube with edge length Ri,j>0R_{i,j}>0. The ranges and domains of all nodes are compatible for composition, i.e., if (fi,j,fl,k)(f_{i,j},f_{l,k}) is an edge in 𝒢𝐟{\cal G}^{\bf f}, then the range of fi,jf_{i,j} is contained in the interior of the domain of fl,kf_{l,k}.

In the following, we introduce four key features of compositional functions. These features play a critical role in the estimation of the complexity of deep neural networks when approximating compositional functions.

Definition 7

(Features of compositional functions) Given a compositional function (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}), 𝐟:[R,R]dq{\bf f}:[-R,R]^{d}\rightarrow\mathbb{R}^{q}, that satisfies A1. The notations rmax𝐟r^{\bf f}_{max}, Λ𝐟\Lambda^{\bf f} and Lmax𝐟L^{\bf f}_{max} represent positive numbers satisfying

di,j/mi,jrmax𝐟,max{(Ri,j)mi,j,1}fi,jWmi,j,di,jΛ𝐟,|Li,j|Lmax𝐟,\begin{array}[]{lllllllllll}\begin{aligned} d_{i,j}/m_{i,j}&\leq r^{\bf f}_{max},\\ \max\{(R_{i,j})^{m_{i,j}},1\}\left\lVert f_{i,j}\right\rVert_{W^{\infty}_{m_{i,j},d_{i,j}}}&\leq\Lambda^{\bf f},\\ \left|L_{i,j}\right|&\leq L^{\bf f}_{max},\end{aligned}\end{array} (37)

for all nodes in 𝒱G𝐟{\cal V}^{\bf f}_{G}, where Li,jL_{i,j} is defined under a fixed pp-norm. These upper bounds, rmax𝐟r^{\bf f}_{max}, Λ𝐟\Lambda^{\bf f} and Lmax𝐟L^{\bf f}_{max}, together with |𝒱G𝐟|\left|{\cal V}^{\bf f}_{G}\right| are called features of the compositional function.

Example 4

The function 𝐟:[R,R]ddd{\bf f}:[-R,R]^{d}\subset\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} in (44) is the vector field that defines the Loranz-96 model [17],

𝐟\displaystyle{\bf f} =\displaystyle= [x0(x2x1)x1+Fx1(x3x0)x2+Fxi1(xi+1xi2)xi+Fxd1(xd+1xd2)xd+F],\displaystyle\left[\begin{array}[]{c}x_{0}(x_{2}-x_{-1})-x_{1}+F\\ x_{1}(x_{3}-x_{0})-x_{2}+F\\ \vdots\\ x_{i-1}(x_{i+1}-x_{i-2})-x_{i}+F\\ \vdots\\ x_{d-1}(x_{d+1}-x_{d-2})-x_{d}+F\end{array}\right], (44)

where x1=xd1x_{-1}=x_{d-1}, x0=xdx_{0}=x_{d}, xd+1=x1x_{d+1}=x_{1} and FF is a constant. Let’s treat 𝐟{\bf f} as a compositional function, (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}), with a DAG shown in Figure 13.

Refer to caption
Figure 13: DAG structure of the function (44). For a clear illustration, edges pointing to the first, the second, and the third layer are shown in red, blue and green respectively.

All general nodes in (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}) are located in the second layer. They are defined by

f2,j(xj1,zj)=xj1zj,j=1,,d.\begin{array}[]{lllllllllll}f_{2,j}(x_{j-1},z_{j})=x_{j-1}z_{j},\ j=1,\cdots,d.\end{array} (45)

The formulae of linear nodes are shown in Figure 13. All general nodes have the dimension d2,j=2d_{2,j}=2 and the domain [2R,2R]2[-2R,2R]^{2}. To compute the Lipchitz constant associated with the general node, f2,jf_{2,j}, we construct the truncation of (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}) along the second layer, which is given by

𝐟¯(z1,,z2d)=[z1zd+1z2zd+1zdz2d],\begin{array}[]{lllllllllll}\bar{\bf f}(z_{1},\cdots,z_{2d})&=&\left[\begin{array}[]{ccc}z_{1}-z_{d+1}\\ z_{2}-z_{d+1}\\ \vdots\\ z_{d}-z_{2d}\end{array}\right],\end{array} (46)

where the dummy inputs zi[2R,2R]\ z_{i}\in[-2R,2R] for all i=1,,2di=1,\cdots,2d. Clearly L2,j=1L_{2,j}=1, for all j=1,,dj=1,\cdots,d. Since f2,jf_{2,j} is smooth, we can set m2,j=mm_{2,j}=m for any integer m2m\geq 2. Thus,

f2,jWm,2=1+2R+4R2.\|f_{2,j}\|_{W^{\infty}_{m,2}}=1+2R+4R^{2}.

Now it is straightforward to compute the features of the compositional function (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}):

rmax𝐟=2/m,Λ𝐟=max{(2R)m,1}(1+2R+4R2),Lmax𝐟=1,|𝒱G𝐟|=d.\begin{array}[]{lllllllllll}r_{max}^{\bf f}&=&2/m,\\ \Lambda^{\bf f}&=&\mbox{max}\{(2R)^{m},1\}(1+2R+4R^{2}),\\ L^{\bf f}_{max}&=&1,\\ \left|{\cal V}^{\bf f}_{G}\right|&=&d.\end{array} (47)

Shallow and deep neural networks are defined as follows. They are compositional functions in which all general nodes are based on a single function.

Definition 8

(Neural network) A Neural network is a compositional function of a special type. Given a function σ:\sigma:\mathbb{R}\rightarrow\mathbb{R}, which is called an activation function.

  1. 1.

    A shallow NN is a function

    f(𝐱)=j=1najσ(𝐰jT𝐱+bj)\begin{array}[]{lllllllllll}f({\bf x})=\displaystyle\sum_{j=1}^{n}a_{j}\sigma\left({\bf w}_{j}^{T}{\bf x}+b_{j}\right)\end{array} (48)

    where aj,bja_{j},b_{j}\in\mathbb{R} and 𝐰jd{\bf w}_{j}\in\mathbb{R}^{d}, 1jn1\leq j\leq n, are parameters. Its DAG is shown in Figure 14(a), which has one hidden layer and a linear output node. Each node in the hidden layer, f1,jf_{1,j}, is a function of the form 𝐱σ(𝐰T𝐱+b){\bf x}\rightarrow\sigma({\bf w}^{T}{\bf x}+b). In this paper, unless otherwise stated, a shallow neural network is a scalar valued function.

  2. 2.

    A deep neural network is a compositional function (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}), 𝐟:dq{\bf f}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{q}, that has multiple hidden layers (Figure 14(b)). The output layer consists of linear nodes. Every node in hidden layers is a function of the form 𝐳σ(𝐰T𝐳+b){\bf z}\rightarrow\sigma({\bf w}^{T}{\bf z}+b).

  3. 3.

    A node in a hidden layer is called a neuron. The complexity of a neural network or deep neural network is the total number of neurons in it.

Refer to caption
Refer to caption
Figure 14: (a): DAG of typical shallow neural networks. (b): DAG structure of deep neural networks.
Remark 1

In many applications of neural networks, edges connecting neurons do not skip layers. This is different from the deep neural network defined in Definition 8 where edges can skip layers. It is addressed later in this section that one can construct neural networks without edges that skip layers by adding identity functions in the DAG. We are not the first ones in the study of DAG based neural networks. For instance, error analysis and potential computational advantages were addressed for DAG based neural networks in [4, 19, 20, 23] and references therein. In this paper, all neural networks are DAG based. For the reason of simplicity, we do not always specify the layered DAG, 𝒢𝐟NN{\cal G}^{{\bf f}^{NN}} and 𝐟NN{\cal L}^{{\bf f}^{NN}}, when introducing a neural network.

Remark 2

In Definition 8, there is no linear nodes in the hidden layers. This requirement is not essential because one can easily modify the DAG of a neural network to remove all linear nodes by merging them into neurons. Figure 15(a) shows a portion of a DAG in which fi,jf_{i,j} is a linear node. It evaluates the summation of the three inputs. This linear node is directly connected with a neuron σl.k\sigma_{l.k}. We can remove the linear node; and then connect its inward edges directly to the neuron. Adjusting the neuron’s parameters accordingly, the result is a simplified DAG shown in Figure 15(b). Following the same process, one can remove all linear nodes in hidden layers without increasing the number of neurons. The compositional function resulting from this process is a deep neural network defined in Definition 8. In the rest of the paper, we call a function a deep neural network even if it has linear nodes in addition to neurons, with the understanding that the function is equivalent to a deep neural network as defined in Definition 8 by merging all linear nodes into neurons.

Refer to caption
Refer to caption
Figure 15: An example of merging linear nodes into neurons. The linear node fi,jf_{i,j} in (a) can be removed by merging it into the neuron σl,k\sigma_{l,k}. The new DAG is shown in (b).

Based on Remark 2, it is obvious that the family of neural networks is closed under several algebraic operations, including linear combination, substitution, and composition.

Proposition 6

Given two neural networks, 𝐟NN{\bf f}^{NN} and 𝐠NN{\bf g}^{NN}. Suppose that their domains and ranges are compatible for the operations discussed below. Then the following statements hold true.

  1. 1.

    The linear combination, a𝐟NN+b𝐠NNa{\bf f}^{NN}+b{\bf g}^{NN} for any constants aa and bb, is a neural network.

  2. 2.

    The compositional function, 𝐠NN𝐟NN{\bf g}^{NN}\circ{\bf f}^{NN}, is a neural network.

  3. 3.

    If a neuron in 𝐟NN{\bf f}^{NN} is substituted by gNNg^{NN}, the resulting compositional function is still a neural network.

Based on this proposition, one can compose complicated deep neural networks using simple and shallow neural networks through sequences of algebraic operations. When this approach is applied to approximate functions, the approximation error of a complicated neural network depends on the accuracy of individual shallow neural networks that are used as building blocks in the composition. The following theorem on the approximation error of shallow neural networks, which is proved in [18], is fundamental in the proof of several theorems about deep neural network approximations of various subjects, including the solutions of differential equations, optimization problems and optimal control.

Theorem 1

(Shallow neural network approximation [18]) Let m,d,n1m,d,n\geq 1 be integers. Let σ:\sigma:\mathbb{R}\rightarrow\mathbb{R} be an infinitely differentiable function in some open interval in \mathbb{R}. Assume that there is a point, bb\in\mathbb{R}, at which σ\sigma and all its derivatives do not equal zero. For any positive integer, nn, there exist vectors {𝐰j}j=1nd\{{\bf w}_{j}\}_{j=1}^{n}\subset\mathbb{R}^{d} and a constant CC satisfying the following property. For any function, f:[1,1]df:[-1,1]^{d}\rightarrow\mathbb{R} in Wm,dW^{\infty}_{m,d}, there exist coefficients aj(f)a_{j}(f) such that

|f(𝐱)j=1naj(f)σ(𝐰jT𝐱+b)|Cnm/dfWm,d, for all 𝐱[1,1]d.\begin{array}[]{lllllllllll}\left|f({\bf x})-\displaystyle\sum_{j=1}^{n}a_{j}(f)\sigma\left({\bf w}_{j}^{T}{\bf x}+b\right)\right|\leq Cn^{-m/d}\left\lVert f\right\rVert_{W^{\infty}_{m,d}},&\mbox{ for all }{\bf x}\in[-1,1]^{d}.\end{array} (49)

The constant CC is determined by dd and mm but is independent of ff and nn.

This theorem is based on the assumption that the domain of ff is a hypercube whose edge length is 2. However, the domain of each node in a compositional function do not necessarily satisfy this assumption. The following is an error upper bound for ff defined in a hypercube that has an arbitrary edge length. The result is proved for domains centered at the origin. Applying translation mappings, the result can be extended to hypercubes centered at any point.

Corollary 1

Making the same assumptions as in Theorem 1 except that the domain of ff is [R,R]d[-R,R]^{d}. Then, there exist vectors {𝐰j}j=1nd\{{\bf w}_{j}\}_{j=1}^{n}\subset\mathbb{R}^{d}, bb\in\mathbb{R}, a constant CC and coefficients aj(f)a_{j}(f) such that

|f(𝐱)j=1naj(f)σ(𝐰jT𝐱+b)|Cnm/dmax{Rm,1}fWm,d,𝐱[R,R]d.\begin{array}[]{lllllllllll}\left|f({\bf x})-\displaystyle\sum_{j=1}^{n}a_{j}(f)\sigma\left({\bf w}_{j}^{T}{\bf x}+b\right)\right|\leq Cn^{-m/d}\max\{R^{m},1\}\left\lVert f\right\rVert_{W^{\infty}_{m,d}},&{\bf x}\in[-R,R]^{d}.\end{array} (50)

The constant CC is determined by dd and mm but independent of ff and nn.

Proof. Define a mapping 𝐱=R𝐱~{\bf x}=R\tilde{\bf x} and a new function f~(𝐱~)=f(R𝐱~)\tilde{f}(\tilde{\bf x})=f(R\tilde{\bf x}). The domain of the new function is [1,1]d[-1,1]^{d}. From Theorem 1, we have

|f~(𝐱)j=1naj(f~)σ(𝐰jT𝐱+b)|Cnm/df~Wm,d\begin{array}[]{lllllllllll}\left|\tilde{f}({\bf x})-\displaystyle\sum_{j=1}^{n}a_{j}({\tilde{f}})\sigma\left({\bf w}_{j}^{T}{\bf x}+b\right)\right|\leq Cn^{-m/d}\left\lVert\tilde{f}\right\rVert_{W^{\infty}_{m,d}}\end{array} (51)

for all 𝐱{\bf x} in [1,1]d[-1,1]^{d}. It is obvious that

|𝐤|f~𝐱~𝐤L=|𝐤|fx𝐤LR|𝐤|.\begin{array}[]{lllllllllll}\left\lVert\displaystyle\frac{\partial^{\left|{\bf k}\right|}{\tilde{f}}}{\partial{\tilde{\bf x}}^{{\bf k}}}\right\rVert_{L^{\infty}}=\left\lVert\displaystyle\frac{\partial^{\left|{\bf k}\right|}f}{\partial x^{\bf k}}\right\rVert_{L^{\infty}}R^{\left|{\bf k}\right|}.\end{array} (52)

Therefore,

f~Wm,d=0𝐤,|𝐤|m|𝐤|f~𝐱~𝐤L=j=0mRj(|𝐤|=j|𝐤|f𝐱𝐤L)j=0mmax{Rj,1}(|𝐤|=j|𝐤|f𝐱𝐤L)max{Rm,1}(j=0m|𝐤|=j|𝐤|f𝐱𝐤L)=max{Rm,1}fWm,d.\begin{array}[]{lllllllllll}\begin{aligned} \left\lVert\tilde{f}\right\rVert_{W^{\infty}_{m,d}}&=\displaystyle\sum_{0\leq{\bf k},\left|\bf k\right|\leq m}\left\lVert\displaystyle\frac{\partial^{\left|\bf k\right|}\tilde{f}}{\partial\tilde{\bf x}^{\bf k}}\right\rVert_{L^{\infty}}\\ &=\displaystyle\sum_{j=0}^{m}R^{j}\left(\sum_{\left|\bf k\right|=j}\left\lVert\displaystyle\frac{\partial^{\left|\bf k\right|}f}{\partial{\bf x}^{\bf k}}\right\rVert_{L^{\infty}}\right)\\ &\leq\displaystyle\sum_{j=0}^{m}\max\{R^{j},1\}\left(\sum_{\left|\bf k\right|=j}\left\lVert\displaystyle\frac{\partial^{\left|\bf k\right|}f}{\partial{\bf x}^{\bf k}}\right\rVert_{L^{\infty}}\right)\\ &\leq\max\{R^{m},1\}\left(\displaystyle\sum_{j=0}^{m}\sum_{\left|\bf k\right|=j}\left\lVert\displaystyle\frac{\partial^{\left|\bf k\right|}f}{\partial{\bf x}^{\bf k}}\right\rVert_{L^{\infty}}\right)\\ &=\max\{R^{m},1\}\left\lVert f\right\rVert_{W^{\infty}_{m,d}}.\end{aligned}\end{array} (53)

Then, (51) and (53) imply (50). \blacklozenge

Consider a compositional function (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}) in which the nodes, {fi,j}\{f_{i,j}\}, are functions in Wmi,j,di,jW^{\infty}_{m_{i,j},d_{i,j}} for some positive integers mi,j,di,j1m_{i,j},d_{i,j}\geq 1. From Theorem 1, there exists a set of constant Ci,jC_{i,j} and a set of shallow neural networks, {fi,jNN}\{f_{i,j}^{NN}\}, such that

|fi,j(𝐱)fi,jNN(𝐱)|Ci,jnmi,j/di,jmax{Ri,jmi,j,1}fi,jWmi,j,di,j, for all 𝐱 in the domain,\begin{array}[]{lllllllllll}\left|f_{i,j}({\bf x})-f_{i,j}^{NN}({\bf x})\right|\leq C_{i,j}n^{-m_{i,j}/d_{i,j}}\max\{R_{i,j}^{m_{i,j}},1\}\left\lVert f_{i,j}\right\rVert_{W^{\infty}_{m_{i,j},d_{i,j}}},\mbox{ for all }{\bf x}\mbox{ in the domain,}\end{array} (54)

where nn is the complexity, i.e. the number of neurons in fi,jNNf_{i,j}^{NN}. Substituting the shallow neural networks for all general nodes in 𝒢𝐟{\cal G}^{\bf f} results in a new compositional function consisting of neurons and linear nodes, i.e. a deep neural network. It is an approximation of 𝐟{\bf f}. Its error upper bound is proved in the following theorem.

Theorem 2

Consider a compositional function (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}), 𝐟:[R,R]dq{\bf f}:[-R,R]^{d}\rightarrow\mathbb{R}^{q}, and a set of its features (Definition 7). Suppose that the function satisfies A1.

  1. 1.

    For any integer nwidth>0n_{width}>0, there exists a set of shallow neural networks of width nwidthn_{width} that approximates the nodes in 𝒱G𝐟{\cal V}^{\bf f}_{G}. The deep neural network, 𝐟NN{\bf f}^{NN}, resulting from substituting the shallow neural networks for all general nodes in 𝒢𝐟{\cal G}^{\bf f} has the following error upper bound,

    𝐟(𝐱)𝐟NN(𝐱)pC(nwidth)1/rmax𝐟,for all 𝐱[R,R]d,\begin{array}[]{lllllllllll}\left\lVert{\bf f}({\bf x})-{\bf f}^{NN}({\bf x})\right\rVert_{p}\leq C(n_{width})^{-1/r^{\bf f}_{max}},&\mbox{for all }{\bf x}\in[-R,R]^{d},\end{array} (55)

    where CC is a constant.

  2. 2.

    The constant, CC, in (55) depends on the features of (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}),

    C=C1Lmax𝐟Λ𝐟|𝒱G𝐟|,\begin{array}[]{lllllllllll}C=C_{1}L_{max}^{\bf f}\Lambda^{\bf f}\left|{\cal V}^{\bf f}_{G}\right|,\end{array} (56)

    where C1C_{1} is a constant determined by {di,j,mi,j;fi,j𝒱G𝐟}\{d_{i,j},m_{i,j};\;\;f_{i,j}\in{\cal V}^{\bf f}_{G}\}.

  3. 3.

    The complexity of fNNf^{NN} is |𝒱G𝐟|nwidth\left|{\cal V}^{\bf f}_{G}\right|n_{width}.

Remark 3

Theorem 2 implies that approximations using neural networks can avoid the curse of dimensionality, provided that the features of the compositional function do not grow exponentially with the dimension. More specifically, for a compositional function defined in a domain in d\mathbb{R}^{d}, the complexity of its neural network approximation depends directly on the features of the function, but not directly on the dimension dd. Given a family of compositional functions for d>0d>0. If the value of their features do not increase exponentially as dd\rightarrow\infty, then neural network approximations of these compositional functions do not suffer the curse of dimensionality. We would like to point out that the inequality (55) can be found in [23]. Different from [23], we rigorously prove the inequality as well as the relationship between CC and the features of compositional functions.

Proof of Theorem 2. For every node fi,j𝒱G𝐟f_{i,j}\in{\cal V}^{\bf f}_{G}, there exists a shallow neural network, fi,jNNf_{i,j}^{NN}, that approximates the node (Theorem 1 and Corollary 1). If the number of neurons in fi,jNNf_{i,j}^{NN} is nwidthn_{width}, we can find one that satisfies

|fi,j(𝐳)fi,jNN(𝐳)|Ci,j(nwidth)mi,j/di,jmax{Ri,jm,1}fi,jWmi,j,di,j,\begin{array}[]{lllllllllll}\left|f_{i,j}({\bf z})-f_{i,j}^{NN}({\bf z})\right|\leq C_{i,j}(n_{width})^{-m_{i,j}/d_{i,j}}\max\{R^{m}_{i,j},1\}\left\lVert f_{i,j}\right\rVert_{W^{\infty}_{m_{i,j},d_{i,j}}},\end{array} (57)

for all 𝐳{\bf z} in the domain of fi,jf_{i,j}, where Ci,jC_{i,j} depends on di,jd_{i,j} and mi,jm_{i,j} but not fi,jf_{i,j}. Suppose C1C_{1} is the largest Ci,jC_{i,j}. From Proposition 3, 𝐟NN{\bf f}^{NN} satisfies

𝐟(𝐱)𝐟NN(𝐱)pfi,j𝒱G𝐟Li,j𝐟Ci,jmax{Ri,jm,1}fi,jWmi,j,di,j(nwidth)mi,j/di,jfi,j𝒱G𝐟Lmax𝐟C1Λ𝐟(nwidth)1/rmax𝐟=C1Lmax𝐟Λ𝐟|𝒱G𝐟|(nwidth)1/rmax𝐟.\begin{array}[]{lllllllllll}\left\lVert{\bf f}({\bf x})-{\bf f}^{NN}({\bf x})\right\rVert_{p}\\ \leq\displaystyle\sum_{f_{i,j}\in{\cal V}^{\bf f}_{G}}L_{i,j}^{\bf f}C_{i,j}\max\{R^{m}_{i,j},1\}\left\lVert f_{i,j}\right\rVert_{W^{\infty}_{m_{i,j},d_{i,j}}}(n_{width})^{-m_{i,j}/d_{i,j}}\\ \leq\displaystyle\sum_{f_{i,j}\in{\cal V}^{\bf f}_{G}}L_{max}^{\bf f}C_{1}\Lambda^{\bf f}(n_{width})^{-1/r^{\bf f}_{max}}\\ =C_{1}L_{max}^{\bf f}\Lambda^{\bf f}\left|{\cal V}^{\bf f}_{G}\right|(n_{width})^{-1/r^{\bf f}_{max}}.\end{array} (58)

\blacklozenge

Remark 4

The expression of CC in (56) is conservative. From the proof, an alternative error upper bound is in (58). More specifically,

𝐟(𝐱)𝐟NN(𝐱)pfi,j𝒱G𝐟Li,j𝐟Ci,jmax{Ri,jm,1}fi,jWmi,j,di,j(nwidth)mi,j/di,j\begin{array}[]{lllllllllll}\left\lVert{\bf f}({\bf x})-{\bf f}^{NN}({\bf x})\right\rVert_{p}\leq\displaystyle\sum_{f_{i,j}\in{\cal V}^{\bf f}_{G}}L_{i,j}^{\bf f}C_{i,j}\max\{R^{m}_{i,j},1\}\left\lVert f_{i,j}\right\rVert_{W^{\infty}_{m_{i,j},d_{i,j}}}(n_{width})^{-m_{i,j}/d_{i,j}}\end{array} (59)

Moreover, the proof suggests that the number of hidden layers in the neural network (after all linear nodes in hidden layers are merged into neurons) is lmax𝐟l^{\bf f}_{max}. The output layer consists of all linear nodes. The width of layer 1llmax𝐟1\leq l\leq l^{\bf f}_{max} in the neural network is bounded by nwidth×nln_{width}\times n_{l}, where nln_{l} is the number of general nodes in the ll-th layer in 𝒢𝐟{\cal G}^{\bf f}.

Remark 5

The edges in 𝒢𝐟NN{\cal G}^{{\bf f}^{NN}} in Theorem 2 may skip layers, which is not in the traditional format of deep neural networks. Any compositional function, in fact, allows a DAG in which no edges skip layers. Adding identity functions as nodes is a way to achieve this. Shown in Figure 16(a) is the DAG of a compositional function in which two edges skip layers, (f0,3,f2,3)(f_{0,3},f_{2,3}) and (f0,4,f3,3)(f_{0,4},f_{3,3}). In Figure 16(b), three identity functions are added as nodes to the DAG at each location where an edge skips a layer. Under the new DAG, no edges skip layers. In Theorem 2, if 𝐟{\bf f} has no edges that skip layers and if we substitute shallow neural networks for all nodes in 𝒱𝐟𝒱I𝐟{\cal V}^{\bf f}\setminus{\cal V}^{\bf f}_{I} (including identity and linear nodes), then the DAG of 𝐟NN{\bf f}^{NN} in Theorem 2 has no edges that skip layers. After merging linear nodes in 𝒢𝐟NN{\cal G}^{{\bf f}^{NN}} into neurons, one can achieve a traditional deep neural network in which edges do not skip layers.

Refer to caption
Refer to caption
Figure 16: Functions in (a) and (b), as input-output relations, are equal to each other. Adding identity nodes, the DAG in (b) has no edge that skips layers.
Corollary 2

Consider a compositional function (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}) that satisfies the same assumptions made in Theorem 2. Then, there is an ϵ0>0\epsilon_{0}>0. For any 0<ϵϵ00<\epsilon\leq\epsilon_{0}, there always exists a deep neural network, 𝐟NN{\bf f}^{NN}, satisfying

𝐟(𝐱)𝐟NN(𝐱)pϵ,𝐱[R,R]d.\begin{array}[]{lllllllllll}\left\lVert{\bf f}({\bf x})-{\bf f}^{NN}({\bf x})\right\rVert_{p}\leq\epsilon,&{\bf x}\in[-R,R]^{d}.\end{array} (60)

The complexity of 𝐟NN{\bf f}^{NN} is

nC(Lmax𝐟Λ𝐟)rmax𝐟|𝒱G𝐟|rmax𝐟+1ϵrmax𝐟,\begin{array}[]{lllllllllll}n\leq C\left(L_{max}^{\bf f}\Lambda^{\bf f}\right)^{r^{\bf f}_{max}}\left|{\cal V}^{\bf f}_{G}\right|^{r^{\bf f}_{max}+1}\epsilon^{-r^{\bf f}_{max}},\end{array} (61)

where CC is a constant determined by {di,j,mi,j;fi,j𝒱G𝐟}\{d_{i,j},m_{i,j};\;\;f_{i,j}\in{\cal V}^{\bf f}_{G}\}.

Proof. From Theorem 2, there exists a deep neural network, denoted by 𝐟NN{\bf f}^{NN}, that satisfy (55)-(56). Let nwidthn_{width} be the smallest integer such that

CLmax𝐟Λ𝐟|𝒱G𝐟|(nwidth)1/rmax𝐟ϵ.\begin{array}[]{lllllllllll}CL_{max}^{\bf f}\Lambda^{\bf f}\left|{\cal V}^{\bf f}_{G}\right|(n_{width})^{-1/r^{\bf f}_{max}}\leq\epsilon.\end{array} (62)

Then (60) holds. As the smallest integer satisfying (62), nwidthn_{width} satisfies

nwidth(CLmax𝐟Λ𝐟|𝒱G𝐟|)rmax𝐟ϵrmax𝐟+1.\begin{array}[]{lllllllllll}n_{width}\leq\left(CL_{max}^{\bf f}\Lambda^{\bf f}\left|{\cal V}^{\bf f}_{G}\right|\right)^{r^{\bf f}_{max}}\epsilon^{-r^{\bf f}_{max}}+1.\end{array} (63)

Because Crmax𝐟C^{r^{\bf f}_{max}} is a constant depending on {di,j,mi,j;fi,j𝒱G𝐟}\{d_{i,j},m_{i,j};\;\;f_{i,j}\in{\cal V}^{\bf f}_{G}\} and n=|𝒱G𝐟|nwidthn=\left|{\cal V}^{\bf f}_{G}\right|n_{width}, (63) implies

nC(Lmax𝐟Λ𝐟)rmax𝐟|𝒱G𝐟|rmax𝐟+1ϵrmax𝐟+|𝒱G𝐟|.\begin{array}[]{lllllllllll}n\leq C\left(L_{max}^{\bf f}\Lambda^{\bf f}\right)^{r^{\bf f}_{max}}\left|{\cal V}^{\bf f}_{G}\right|^{r^{\bf f}_{max}+1}\epsilon^{-r^{\bf f}_{max}}+\left|{\cal V}^{\bf f}_{G}\right|.\end{array} (64)

If ϵ\epsilon is small enough, then we can double the constant CC and remove the last term, |𝒱G𝐟|\left|{\cal V}^{\bf f}_{G}\right|, i.e. (61) holds. \blacklozenge

Example 5

Consider the compositional function (44) in Example 4, whose features are given in (47). By Theorem 2, for any integer nwidth>0n_{width}>0, there exists a deep neural network, 𝐟NN{\bf f}^{NN}, such that

𝐟(𝐱)𝐟NN(𝐱)pC1d(1+2R+4R2)max{(2R)m,1}(nwidth)m/2,\begin{array}[]{lllllllllll}\left\lVert{\bf f}({\bf x})-{\bf f}^{NN}({\bf x})\right\rVert_{p}\leq C_{1}d\left(1+2R+4R^{2}\right)\mbox{max}\{(2R)^{m},1\}(n_{width})^{-m/2},\end{array}

for all 𝐱[R,R]d{\bf x}\in[-R,R]^{d} and all integer m2m\geq 2, where C1C_{1} is a constant determined by mm. The complexity of fNNf^{NN} is dnwidthdn_{width}. The approximation error and the complexity depend on the dimension, dd, linearly. If the value of mm and the domain size RR are set to be constants, the neural network approximation is curse-of-dimensionality free.

Given a function f:Ddf:D\subset\mathbb{R}^{d}\rightarrow\mathbb{R}. The radius of its range is defined by

(max𝐱D{f(𝐱)}min𝐱D{f(𝐱)})/2.\begin{array}[]{lllllllllll}\left(\displaystyle\max_{{\bf x}\in D}\{f({\bf x})\}-\min_{{\bf x}\in D}\{f({\bf x})\}\right)/2.\end{array} (65)
Theorem 3

Consider two compositional functions (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}), 𝐟:[R,R]dq{\bf f}:[-R,R]^{d}\rightarrow\mathbb{R}^{q}, and (𝐠,𝒢𝐠,𝐠)({\bf g},{\cal G}^{\bf g},{\cal L}^{\bf g}), 𝐠:[R,R]dq{\bf g}:[-R,R]^{d}\rightarrow\mathbb{R}^{q}. Suppose they both satisfy A1. Let 𝐟NN{\bf f}^{NN} and 𝐠NN{\bf g}^{NN} be two neural networks with complexity n1n_{1} and n2n_{2}, respectively. Suppose

𝐟(𝐱)𝐟NN(𝐱)pe1,𝐠(𝐱)𝐠NN(𝐱)pe2,\begin{array}[]{lllllllllll}\left\lVert{\bf f}({\bf x})-{\bf f}^{NN}({\bf x})\right\rVert_{p}\leq e_{1},\\ \left\lVert{\bf g}({\bf x})-{\bf g}^{NN}({\bf x})\right\rVert_{p}\leq e_{2},\\ \end{array} (66)

for some positive numbers e1e_{1}, e2e_{2} and for all 𝐱{\bf x} in the domain.

  1. 1.

    Consider 𝐡=a𝐟+b𝐠{\bf h}=a{\bf f}+b{\bf g}, where a,ba,b\in\mathbb{R} are constants. Then 𝐡NN=a𝐟NN+b𝐠NN{\bf h}^{NN}=a{\bf f}^{NN}+b{\bf g}^{NN} is a neural network satisfying

    𝐡(𝐱)𝐡NN(𝐱)p|a|e1+|b|e2\begin{array}[]{lllllllllll}\left\lVert{\bf h}({\bf x})-{\bf h}^{NN}({\bf x})\right\rVert_{p}\leq\left|a\right|e_{1}+\left|b\right|e_{2}\end{array} (67)

    for all 𝐱{\bf x} in the domain. The complexity of 𝐡NN{\bf h}^{NN} is n1+n2n_{1}+n_{2}.

  2. 2.

    Consider h=𝐟T𝐠h={\bf f}^{T}{\bf g}. Let p=2p=2 in (66). Then, for any integers nwidth>0n_{width}>0 and m>0m>0, there is a deep neural network, hNNh^{NN}, satisfying

    |h(𝐱)hNN(𝐱)|max𝐱[R,R]d{𝐟(𝐱)2}e2+max𝐱[R,R]d{𝐠(𝐱)2}e1+e1e2+CΛq(nwidth)m/2,\begin{array}[]{lllllllllll}\left|h({\bf x})-h^{NN}({\bf x})\right|\\ \leq\displaystyle\max_{{\bf x}\in[-R,R]^{d}}\{\left\lVert{\bf f}({\bf x})\right\rVert_{2}\}e_{2}+\displaystyle\max_{{\bf x}\in[-R,R]^{d}}\{\left\lVert{\bf g}({\bf x})\right\rVert_{2}\}e_{1}+e_{1}e_{2}+C\Lambda q(n_{width})^{-m/2},\end{array} (68)

    where CC is a constant depending on mm and qq. The parameter, Λ\Lambda, is defined as follows based on fjf_{j} and gjg_{j} (the jj-th component of 𝐟{\bf f} and 𝐠{\bf g}, respectively)

    Λ=max1jq{max{(Rj)m,1}(AjBj+Aj+Bj+2)},Aj=maxx{|fj(𝐱)|},Bj=maxx{|gj(𝐱)|},Rj=max{radii of the ranges of fj,gj}.\begin{array}[]{lllllllllll}\Lambda=\displaystyle\max_{1\leq j\leq q}\left\{\max\{(R_{j})^{m},1\}(A_{j}B_{j}+A_{j}+B_{j}+2)\right\},\\ A_{j}=\displaystyle\max_{x}\{\left|f_{j}({\bf x})\right|\},B_{j}=\displaystyle\max_{x}\{\left|g_{j}({\bf x})\right|\},\\ R_{j}=\max\{\mbox{radii of the ranges of }f_{j},g_{j}\}.\end{array} (69)

    The complexity of 𝐡NN{\bf h}^{NN} is n1+n2+qnwidthn_{1}+n_{2}+qn_{width}.

  3. 3.

    If ff and gg are scalar valued compositional functions. Suppose g(𝐱)0g({\bf x})\neq 0. Consider h=f/gh=f/g. Suppose

    e212min𝐱[R,R]d{|g(𝐱)|}.\begin{array}[]{lllllllllll}e_{2}\leq\displaystyle\frac{1}{2}\displaystyle\min_{{\bf x}\in[-R,R]^{d}}\{\left|g({\bf x})\right|\}.\end{array} (70)

    Then, for any integers nwidth>0n_{width}>0 and m>0m>0, there is a deep neural network, hNNh^{NN}, satisfying

    |h(𝐱)hNN(𝐱)|2AB2e2+2Be1+CΛ(nwidth)m/2,\begin{array}[]{lllllllllll}\left|h({\bf x})-h^{NN}({\bf x})\right|\leq\displaystyle\frac{2A}{B^{2}}e_{2}+\displaystyle\frac{2}{B}e_{1}+C\Lambda(n_{width})^{-m/2},\end{array} (71)

    where CC is a constant depending on mm,

    Λ=max{R1m,1}(m!(A+B)1(1/B)m+1B1),A=max𝐱[R,R]d{|f(𝐱)|},B=min𝐱[R,R]d{|g(𝐱)|},R1=max{radii of the ranges of f,g}.\begin{array}[]{lllllllllll}\Lambda=\max\{R^{m}_{1},1\}\left(m!(A+B)\displaystyle\frac{1-(1/B)^{m+1}}{B-1}\right),\\ A=\displaystyle\max_{{\bf x}\in[-R,R]^{d}}\{\left|f({\bf x})\right|\},\;B=\displaystyle\min_{{\bf x}\in[-R,R]^{d}}\{\left|g({\bf x})\right|\},\\ R_{1}=\max\{\mbox{radii of the ranges of }f,g\}.\end{array} (72)

    The complexity of hNNh^{NN} is n1+n2+nwidthn_{1}+n_{2}+n_{width}.

Proof. The inequality (67) is obvious. From Definition 2, the last layer of the induced DAG for a𝐟NN+b𝐠NNa{\bf f}^{NN}+b{\bf g}^{NN} consists of all linear nodes. It does not change the complexity of the deep neural network.

To prove (2), we construct a compositional function ψ:(𝐮,𝐯)q×q𝐮T𝐯\psi:({\bf u},{\bf v})\in\mathbb{R}^{q}\times\mathbb{R}^{q}\rightarrow{\bf u}^{T}{\bf v}\in\mathbb{R} (Figure 17(a)). Then

h(𝐱)=ψ(𝐟(𝐱),𝐠(𝐱)).\begin{array}[]{lllllllllll}h({\bf x})=\psi({\bf f}({\bf x}),{\bf g}({\bf x})).\end{array} (73)
Refer to caption
Refer to caption
Figure 17: (a): DAG of inner product, 𝐮T𝐯{\bf u}^{T}{\bf v}, as a compositional function. (b) DAG of division, u/vu/v, as a compositional function.

The Lipschitz constants associated with each general node in 𝒢ψ{\cal G}^{\psi} (all located in the layer l=1l=1) is L1,jψ=1L_{1,j}^{\psi}=1. It is straightforward to show that

ψ1,jWm,2(AjBj+Aj+Bj+2),\begin{array}[]{lllllllllll}\left\lVert\psi_{1,j}\right\rVert_{W^{\infty}_{m,2}}\leq(A_{j}B_{j}+A_{j}+B_{j}+2),\end{array} (74)

for any m1m\geq 1. From Theorem 2, there exists a deep neural network, ψNN\psi^{NN}, so that

|ψ(u,v)ψNN(u,v)|CΛq(nwidth)m/2,\begin{array}[]{lllllllllll}\left|\psi(u,v)-\psi^{NN}(u,v)\right|\leq C\Lambda q(n_{width})^{-m/2},\end{array} (75)

where C is a constant depending on mm (di,j=2d_{i,j}=2), Λ\Lambda is defined in (69). Define hNN=ψNN(fNN,gNN)h^{NN}=\psi^{NN}\circ(f^{NN},g^{NN}). It is a deep neural network. We have

|h(𝐱)hNN(𝐱)|=|ψ(𝐟(𝐱),𝐠(𝐱))ψNN(𝐟NN(𝐱),𝐠NN(𝐱))|=|ψ(𝐟(𝐱),𝐠(𝐱))ψ(fNN(𝐱),𝐠NN(𝐱))|+|ψ(𝐟NN(𝐱),gNN(𝐱))ψNN(𝐟NN(𝐱),𝐠NN(𝐱))||𝐟(𝐱)T𝐠(𝐱)(𝐟NN(𝐱))T𝐠NN(𝐱)|+CΛq(nwidth)m/2|𝐟(𝐱)T𝐠(𝐱)𝐟(𝐱)T𝐠NN(𝐱)|+|𝐟(𝐱)T𝐠NN(𝐱)(𝐟NN(𝐱))T𝐠NN(𝐱)|+CΛq(nwidth)m/2𝐟(𝐱)2𝐠(𝐱)𝐠NN(𝐱)2+𝐟(𝐱)𝐟NN(𝐱)2𝐠NN(𝐱)2+CΛq(nwidth)m/2max𝐱{𝐟(𝐱)2}e2+(max𝐱{𝐠(𝐱)2}+e2)e1+CΛq(nwidth)m/2.\begin{array}[]{lllllllllll}\left|h({\bf x})-h^{NN}({\bf x})\right|\\ =\left|\psi({\bf f}({\bf x}),{\bf g}({\bf x}))-\psi^{NN}({\bf f}^{NN}({\bf x}),{\bf g}^{NN}({\bf x}))\right|\\ =\left|\psi({\bf f}({\bf x}),{\bf g}({\bf x}))-\psi(f^{NN}({\bf x}),{\bf g}^{NN}({\bf x}))\right|+\left|\psi({\bf f}^{NN}({\bf x}),g^{NN}({\bf x}))-\psi^{NN}({\bf f}^{NN}({\bf x}),{\bf g}^{NN}({\bf x}))\right|\\ \leq\left|{\bf f}({\bf x})^{T}{\bf g}({\bf x})-({\bf f}^{NN}({\bf x}))^{T}{\bf g}^{NN}({\bf x})\right|+C\Lambda q(n_{width})^{-m/2}\\ \leq\left|{\bf f}({\bf x})^{T}{\bf g}({\bf x})-{\bf f}({\bf x})^{T}{\bf g}^{NN}({\bf x})\right|+\left|{\bf f}({\bf x})^{T}{\bf g}^{NN}({\bf x})-({\bf f}^{NN}({\bf x}))^{T}{\bf g}^{NN}({\bf x})\right|+C\Lambda q(n_{width})^{-m/2}\\ \leq\left\lVert{\bf f}({\bf x})\right\rVert_{2}\left\lVert{\bf g}({\bf x})-{\bf g}^{NN}({\bf x})\right\rVert_{2}+\left\lVert{\bf f}({\bf x})-{\bf f}^{NN}({\bf x})\right\rVert_{2}\left\lVert{\bf g}^{NN}({\bf x})\right\rVert_{2}+C\Lambda q(n_{width})^{-m/2}\\ \leq\displaystyle\max_{\bf x}\{\left\lVert{\bf f}({\bf x})\right\rVert_{2}\}e_{2}+(\displaystyle\max_{\bf x}\{\left\lVert{\bf g}({\bf x})\right\rVert_{2}\}+e_{2})e_{1}+C\Lambda q(n_{width})^{-m/2}.\end{array} (76)

The proof of (3) is similar to the idea shown above but for q=1q=1. We construct a compositional function ψ:(u,v)×u/v\psi:(u,v)\in\mathbb{R}\times\mathbb{R}\rightarrow u/v\in\mathbb{R}. This is a compositional function with a single general node and two input nodes (Figure 17(b)). Then

h(𝐱)=ψ(f(𝐱),g(𝐱)).\begin{array}[]{lllllllllll}h({\bf x})=\psi(f({\bf x}),g({\bf x})).\end{array} (77)

A Lipschitz constant associated with the single general node in 𝒢ψ{\cal G}^{\psi} is L1,1=1L_{1,1}=1. Skipping the proof that is based on typical algebraic derivations, we claim that the norm of ψ\psi satisfies

A=max𝐱{|f(𝐱)|},B=min𝐱{|g(𝐱)|},ψWm,2(m!(A+B)1(1/B)m+1B1),\begin{array}[]{lllllllllll}A=\displaystyle\max_{\bf x}\{\left|f({\bf x})\right|\},B=\displaystyle\min_{\bf x}\{\left|g({\bf x})\right|\},\\ \left\lVert\psi\right\rVert_{W^{\infty}_{m,2}}\leq\left(m!(A+B)\displaystyle\frac{1-(1/B)^{m+1}}{B-1}\right),\\ \end{array} (78)

for any m1m\geq 1. From Theorem 2, there exists a deep neural network, ψNN\psi^{NN}, so that

|ψ(u,v)ψNN(u,v)|CΛ(nwidth)m/2,\begin{array}[]{lllllllllll}\left|\psi(u,v)-\psi^{NN}(u,v)\right|\leq C\Lambda(n_{width})^{-m/2},\end{array} (79)

where C is a constant depending on mm, Λ\Lambda is defined in (72). Define hNN=ψNN(fNN,gNN)h^{NN}=\psi^{NN}\circ(f^{NN},g^{NN}). We have

|h(𝐱)hNN(𝐱)|=|ψ(f(𝐱),g(𝐱))ψNN(fNN(𝐱),gNN(𝐱))|=|ψ(f(𝐱),g(𝐱))ψ(fNN(𝐱),gNN(𝐱))|+|ψ(fNN(𝐱),gNN(𝐱))ψNN(fNN(𝐱),gNN(𝐱))||f(𝐱)/g(𝐱)fNN(𝐱)/gNN(𝐱)|+CΛ(nwidth)m/2|f(𝐱)/g(𝐱)f(𝐱)/gNN(𝐱)|+|f(𝐱)/gNN(𝐱)fNN(𝐱)/gNN(𝐱)|+CΛ(nwidth)m/2|f(𝐱)||g(𝐱)gNN(𝐱)||g(𝐱)gNN(𝐱)|+|f(𝐱)fNN(𝐱)||gNN(𝐱)|+CΛ(nwidth)m/2.\begin{array}[]{lllllllllll}\left|h({\bf x})-h^{NN}({\bf x})\right|\\ =\left|\psi(f({\bf x}),g({\bf x}))-\psi^{NN}(f^{NN}({\bf x}),g^{NN}({\bf x}))\right|\\ =\left|\psi(f({\bf x}),g({\bf x}))-\psi(f^{NN}({\bf x}),g^{NN}({\bf x}))\right|+\left|\psi(f^{NN}({\bf x}),g^{NN}({\bf x}))-\psi^{NN}(f^{NN}({\bf x}),g^{NN}({\bf x}))\right|\\ \leq\left|f({\bf x})/g({\bf x})-f^{NN}({\bf x})/g^{NN}({\bf x})\right|+C\Lambda(n_{width})^{-m/2}\\ \leq\left|f({\bf x})/g({\bf x})-f({\bf x})/g^{NN}({\bf x})\right|+\left|f({\bf x})/g^{NN}({\bf x})-f^{NN}({\bf x})/g^{NN}({\bf x})\right|+C\Lambda(n_{width})^{-m/2}\\ \leq\left|f({\bf x})\right|\displaystyle\frac{\left|g({\bf x})-g^{NN}({\bf x})\right|}{\left|g({\bf x})g^{NN}({\bf x})\right|}+\displaystyle\frac{\left|f({\bf x})-f^{NN}({\bf x})\right|}{\left|g^{NN}({\bf x})\right|}+C\Lambda(n_{width})^{-m/2}.\end{array} (80)

Because of (70), we know

|gNN(𝐱)|B2.\begin{array}[]{lllllllllll}\left|g^{NN}({\bf x})\right|\geq\displaystyle\frac{B}{2}.\end{array} (81)

Therefore,

|h(𝐱)hNN(𝐱)|2AB2e2+2Be1+CΛ(nwidth)m/2.\begin{array}[]{lllllllllll}\left|h({\bf x})-h^{NN}({\bf x})\right|\leq\displaystyle\frac{2A}{B^{2}}e_{2}+\displaystyle\frac{2}{B}e_{1}+C\Lambda(n_{width})^{-m/2}.\end{array} (82)

\blacklozenge

5 Ordinary differential equations

In the following, we say that a function is CkC^{k} if all derivatives of order less than or equal to kk are continuous in the domain. Consider a system of ordinary differential equations (ODEs)

𝐱˙=𝐟(𝐱),𝐱d,𝐟(𝐱)d,t[0,T].\begin{array}[]{lllllllllll}\dot{\bf x}={\bf f}({\bf x}),&{\bf x}\in\mathbb{R}^{d},{\bf f}({\bf x})\in\mathbb{R}^{d},t\in[0,T].\end{array} (83)

Without loss of generality, 𝐟{\bf f} in (83) does not explicitly depend on tt. If the right-hand side of the ODE is a function of (t,x)(t,x), one can add a new state variable to convert the equation to a time-invariant one. Let ϕ(t;𝐱0){\bm{\phi}}(t;{\bf x}_{0}) represent the solution of (83) satisfying 𝐱(0)=𝐱0{\bf x}(0)={\bf x}_{0}. In this section, we study the complexity of deep neural networks in the approximation of the function ϕ(T;):𝐱ϕ(T;𝐱){\bm{\phi}}(T;\cdot):{\bf x}\rightarrow{\bm{\phi}}(T;{\bf x}). We assume that 𝐟{\bf f}, the right-hand side of (83), is a compositional function that satisfies A1. After integration, the compositional structure of ϕ(t;𝐱){\bm{\phi}}(t;{\bf x}) becomes unknown.

The solution of an ODE can be approximated using the Euler method. Let ELR:ddELR:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} represent the operator of the Euler method, i.e.,

ELR(𝐱)=𝐱+h𝐟(𝐱).\begin{array}[]{lllllllllll}ELR({\bf x})={\bf x}+h{\bf f}({\bf x}).\end{array} (84)

Let K>0K>0 be a positive integer and h=T/Kh=T/K. Then

ϕ(T;𝐱)ELRELRELRK(𝐱)=(ELR())K(𝐱).\begin{array}[]{lllllllllll}{\bm{\phi}}(T;{\bf x})\\ \approx\overbrace{ELR\circ\cdots ELR\circ ELR}^{K}({\bf x})\\ =(ELR(\cdot))^{K}({\bf x}).\end{array} (85)

As a compositional function, ELRELR has a DAG, whose framework is shown in Figure 18. The first box on the left represents the layer of input nodes. The blue box in the middle contains all nodes in 𝒱𝐟𝒱I𝐟{\cal V}^{\bf f}\setminus{\cal V}^{\bf f}_{I} and their edges. The box on the right consists of output nodes, which are all linear nodes. Applying Definition 4, (ELR())K(ELR(\cdot))^{K} is a compositional function with an induced DAG.

Refer to caption
Figure 18: DAG structure of an Euler iteration
Theorem 4

Consider the ODE (83) in which (𝐟,𝒢𝐟,𝐟)({\bf f},{\cal G}^{\bf f},{\cal L}^{\bf f}) is a compositional function satisfying A1. Its features are defined in Definition 7. Assume that all nodes in 𝒢𝐟{\cal G}^{\bf f} are C1C^{1}. Suppose D𝐟D^{\bf f} is a closed set such that ϕ(t;𝐱)[R,R]d{\bm{\phi}}(t;{\bf x})\in[-R,R]^{d} whenever 𝐱D𝐟{\bf x}\in D^{\bf f} and t[0,T]t\in[0,T]. Let 1p1\leq p\leq\infty be a real number.

  1. 1.

    There is a constant CC so that, for any integer nwidth>0n_{width}>0, there exists a deep neural network, ϕNN:dd{\bm{\phi}}^{NN}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}, satisfying

    ϕ(T;𝐱)ϕNN(𝐱)pC(nwidth)1/rmax𝐟,𝐱D𝐟.\begin{array}[]{lllllllllll}\left\lVert{\bm{\phi}}(T;{\bf x})-{\bm{\phi}}^{NN}({\bf x})\right\rVert_{p}\leq C(n_{width})^{-1/r^{\bf f}_{max}},&{\bf x}\in D^{\bf f}.\end{array} (86)
  2. 2.

    Let α\alpha\in\mathbb{R} be the constant so that a Lipschitz constant of ELRELR has the form (1+hα)(1+h\alpha). The constant, CC, in (86) satisfies

    C=C1(max{eαT,1})TLmax𝐟Λ𝐟|𝒱G𝐟|+AeBTT,A=max𝐱[R,R]d{𝐟(𝐱)p},B=max𝐱[R,R]d{𝐟𝐱(𝐱)p},\begin{array}[]{lllllllllll}C=C_{1}\left(\max\{e^{\alpha T},1\}\right)TL_{max}^{\bf f}\Lambda^{\bf f}\left|{\cal V}^{\bf f}_{G}\right|+Ae^{BT}T,\\ A=\displaystyle\max_{{\bf x}\in[-R,R]^{d}}\{\left\lVert{\bf f}({\bf x})\right\rVert_{p}\},\\ B=\displaystyle\max_{{\bf x}\in[-R,R]^{d}}\left\{\left\lVert\frac{\partial{\bf f}}{\partial{\bf x}}({\bf x})\right\rVert_{p}\right\},\end{array} (87)

    where C1C_{1} is a constant determined by {di,j,mi,j;fi,j𝒱G𝐟}\{d_{i,j},m_{i,j};\;f_{i,j}\in{\cal V}^{\bf f}_{G}\}.

  3. 3.

    The complexity of ϕNN{\bm{\phi}}^{NN} is bounded by

    n((nwidth)1/rmax𝐟+1)nwidth|𝒱G𝐟|.\begin{array}[]{lllllllllll}n\leq\left((n_{width})^{1/r^{\bf f}_{max}}+1\right)n_{width}\left|{\cal V}^{\bf f}_{G}\right|.\end{array} (88)
Remark 6

The error upper bound in (86)-(87) increases at a polynomial rate with respect to the features of 𝐟{\bf f}. It also depends on the terms eαTe^{\alpha T} and eBTe^{BT}. If the Euler method is stable [9], then α<0\alpha<0. The value of eαTe^{\alpha T} is always bounded. For a stable system, the error estimation using eBTe^{BT} is conservative. In this case the error does not increase exponentially with TT if a stable Euler method is applied.

Proof of Theorem 4. For any integer nwidth>0n_{width}>0, from Theorem 2, there exists a deep neural network approximation of 𝐟{\bf f} satisfying

𝐟(𝐱)𝐟NN(𝐱)pC1Lmax𝐟Λ𝐟|𝒱G𝐟|(nwidth)1/rmax𝐟,\begin{array}[]{lllllllllll}\left\lVert{\bf f}({\bf x})-{\bf f}^{NN}({\bf x})\right\rVert_{p}\leq C_{1}L^{\bf f}_{max}\Lambda^{\bf f}\left|{\cal V}^{\bf f}_{G}\right|(n_{width})^{-1/r^{\bf f}_{max}},\end{array} (89)

for some constant C1C_{1} determined by di,jd_{i,j} and mi,jm_{i,j}. The complexity of the neural network is bounded by nwidth|𝒱G𝐟|n_{width}\left|{\cal V}^{\bf f}_{G}\right|. Let ELRNNELR^{NN} be the neural network obtained by substituting 𝐟NN{\bf f}^{NN} for ff. Then,

ELR(𝐱)ELRNN(𝐱)p,=h𝐟(𝐱)𝐟NN(𝐱)p,hC1Lmax𝐟Λ𝐟|𝒱G𝐟|(nwidth)1/rmax𝐟.\begin{array}[]{lllllllllll}\left\lVert ELR({\bf x})-ELR^{NN}({\bf x})\right\rVert_{p},\\ =h\left\lVert{\bf f}({\bf x})-{\bf f}^{NN}({\bf x})\right\rVert_{p},\\ \leq hC_{1}L^{\bf f}_{max}\Lambda^{\bf f}\left|{\cal V}^{\bf f}_{G}\right|(n_{width})^{-1/r^{\bf f}_{max}}.\end{array} (90)

Define

ϕNN=(ELRNN())K.\begin{array}[]{lllllllllll}{\bm{\phi}}^{NN}=(ELR^{NN}(\cdot))^{K}.\end{array} (91)

Applying Proposition 4, we have

(ELR())K(𝐱)ϕNN(𝐱)p(1+hα)K1(1+hα)1hC1Lmax𝐟Λ𝐟|𝒱G𝐟|(nwidth)1/rmax𝐟((1+hα)K1+(1+hα)K2++1)hC1Lmax𝐟Λ𝐟|𝒱G𝐟|(nwidth)1/rmax𝐟K(max{(1+TKα)K1,1})TKC1Lmax𝐟Λ𝐟|𝒱G𝐟|(nwidth)1/rmax𝐟C1(max{eαT,1})TLmax𝐟Λ𝐟|𝒱G𝐟|(nwidth)1/rmax𝐟.\begin{array}[]{lllllllllll}\left\lVert(ELR(\cdot))^{K}({\bf x})-\phi^{NN}({\bf x})\right\rVert_{p}\\ \leq\displaystyle\frac{(1+h\alpha)^{K}-1}{(1+h\alpha)-1}hC_{1}L^{\bf f}_{max}\Lambda^{\bf f}\left|{\cal V}^{\bf f}_{G}\right|(n_{width})^{-1/r^{\bf f}_{max}}\\ \leq\left((1+h\alpha)^{K-1}+(1+h\alpha)^{K-2}+\cdots+1\right)hC_{1}L^{\bf f}_{max}\Lambda^{\bf f}\left|{\cal V}^{\bf f}_{G}\right|(n_{width})^{-1/r^{\bf f}_{max}}\\ \leq K\left(\max\{(1+\frac{T}{K}\alpha)^{K-1},1\}\right)\displaystyle\frac{T}{K}C_{1}L^{\bf f}_{max}\Lambda^{\bf f}\left|{\cal V}^{\bf f}_{G}\right|(n_{width})^{-1/r^{\bf f}_{max}}\\ \leq C_{1}\left(\max\{e^{\alpha T},1\}\right)TL_{max}^{\bf f}\Lambda^{\bf f}\left|{\cal V}^{\bf f}_{G}\right|(n_{width})^{-1/r^{\bf f}_{max}}.\end{array} (92)

On the other hand, it is well known (see, for instance, [9]) that the global error of the Euler method satisfies

ϕ(T;𝐱)ELRK(𝐱)pAeBTTK,𝐱D𝐟.\begin{array}[]{lllllllllll}\left\lVert{\bm{\phi}}(T;{\bf x})-ELR^{K}({\bf x})\right\rVert_{p}\leq Ae^{BT}\displaystyle\frac{T}{K},&{\bf x}\in D^{\bf f}.\end{array} (93)

From Proposition 5 and the inequalities (92) and (93), we have

ϕ(T;𝐱)ϕNN(𝐱)pC1(max{eαT,1})TLmax𝐟Λ𝐟|𝒱G𝐟|(nwidth)1/rmax𝐟+AeBTTK1,𝐱D𝐟.\begin{array}[]{lllllllllll}\left\lVert{\bm{\phi}}(T;{\bf x})-{\bm{\phi}}^{NN}({\bf x})\right\rVert_{p}\leq C_{1}\left(\max\{e^{\alpha T},1\}\right)TL_{max}^{\bf f}\Lambda^{\bf f}\left|{\cal V}^{\bf f}_{G}\right|(n_{width})^{-1/r^{\bf f}_{max}}+Ae^{BT}TK^{-1},&{\bf x}\in D^{\bf f}.\end{array} (94)

Substituting KK by K=ceiling((nwidth)1/rmax𝐟)K={\rm ceiling}((n_{width})^{1/r^{\bf f}_{max}}), where the value of ceiling(z){\rm ceiling}(z) is the smallest integer that is greater than or equal to zz, we achieve the inequalities in (86)-(87) because

zceiling(z)z+1,for all z0.\begin{array}[]{lllllllllll}z\leq{\rm ceiling}(z)\leq z+1,&\mbox{for all }z\geq 0.\end{array} (95)

The calculation of the complexity is straightforward. There are a total of KK iterations in the Euler approximation. In each 𝒢ELR{\cal G}^{ELR}, there is one copy of 𝐟{\bf f}, which is substituted by 𝐟NN{\bf f}^{NN}, a neural network whose complexity is nwidth|𝒱G𝐟|n_{width}\left|{\cal V}^{\bf f}_{G}\right|. Thus, the total number of neurons in ϕNN{\bm{\phi}}^{NN} is Knwidth|𝒱G𝐟|Kn_{width}\left|{\cal V}^{\bf f}_{G}\right|, which implies (88) because K(nwidth)1/rmax𝐟+1K\leq(n_{width})^{1/r^{\bf f}_{max}}+1. \blacklozenge

Example 6

Consider Lorenz-96 system [17],

𝐱˙\displaystyle\dot{\bf x} =\displaystyle= 𝐟(𝐱),\displaystyle{\bf f}({\bf x}), (96)

where the vector field 𝐟{\bf f} is the compositional function (44) in Example 4. For a suitable choice of the constant FF, system (96) is chaotic; thus trajectories are bounded. Choose a sufficiently large RR such that 𝐱(t)[R,R]d{\bf x}(t)\in[-R,R]^{d} for all t[0,T]t\in[0,T]. By Theorem 4 and the features of 𝐟{\bf f} given in (47), for any integers nwidth>0n_{width}>0 and m2m\geq 2, there exists a deep neural network, ϕNN:dd{\bm{\phi}}^{NN}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}, satisfying

ϕ(T;𝐱)ϕNN(𝐱)C(nwidth)m/2,\begin{array}[]{lllllllllll}\left\lVert{\bm{\phi}}(T;{\bf x})-{\bm{\phi}}^{NN}({\bf x})\right\rVert_{\infty}\leq C(n_{width})^{-m/2},\end{array} (97)

where

C=C1Td(1+2R+4R2)max{eαT,1}max{(2R)m,1}+AeBTT,A=max𝐱[R,R]d{𝐟(𝐱)}2R2+R+F,B=max𝐱[R,R]d{𝐟𝐱(𝐱)}1+4R,αB,C1:a constant determined bym.\begin{array}[]{lllllllllll}C=C_{1}Td(1+2R+4R^{2})\max\{e^{\alpha T},1\}\mbox{max}\{(2R)^{m},1\}+Ae^{BT}T,\\ A=\displaystyle\max_{{\bf x}\in[-R,R]^{d}}\{\left\lVert{\bf f}({\bf x})\right\rVert_{\infty}\}\leq 2R^{2}+R+F,\\ B=\displaystyle\max_{{\bf x}\in[-R,R]^{d}}\left\{\left\lVert\frac{\partial{\bf f}}{\partial{\bf x}}({\bf x})\right\rVert_{\infty}\right\}\leq 1+4R,\\ \alpha\leq B,\\ C_{1}:\mbox{a constant determined by}\ $m$.\end{array} (98)

Moreover, the complexity of ϕNN{\bm{\phi}}^{NN} is bounded by

nd((nwidth)m/2+1)nwidth.\begin{array}[]{lllllllllll}n\leq d\left((n_{width})^{m/2}+1\right)n_{width}.\end{array} (99)

6 Optimal control

Let DD be a bounded and closed set in d\mathbb{R}^{d}. Let V(𝐱,𝐮):D×qV({\bf x},{\bf u}):D\times\mathbb{R}^{q}\rightarrow\mathbb{R} be a cost function, where 𝐱{\bf x} is the state and 𝐮{\bf u} is the control input. Suppose that VV is C2C^{2}. Let H(𝐱,𝐮)H({\bf x},{\bf u}) represent the Hessian of VV with respect to 𝐮{\bf u}, i.e.

H(𝐱,𝐮)=2V𝐮2.\begin{array}[]{lllllllllll}H({\bf x},{\bf u})=\displaystyle\frac{\partial^{2}V}{\partial{\bf u}^{2}}.\end{array} (100)

If H(𝐱,𝐮)H({\bf x},{\bf u}) is positive definite, then VV is a convex function for every fixed 𝐱{\bf x}. Consider the problem of optimization

min𝐮V(𝐱,𝐮).\begin{array}[]{lllllllllll}\displaystyle\min_{\bf u}V({\bf x},{\bf u}).\end{array} (101)

Given any 𝐱D{\bf x}\in D, (101) has a unique solution. The optimal control is denoted by 𝐮(𝐱){\bf u}^{\ast}({\bf x}). The minimum cost is called the value function,

V(𝐱)=V(𝐱,𝐮(𝐱)).\begin{array}[]{lllllllllll}V^{\ast}({\bf x})=V({\bf x},{\bf u}^{\ast}({\bf x})).\end{array} (102)

Because of the convexity and the smoothness of VV, it can be proved that 𝐮(𝐱){\bf u}^{\ast}({\bf x}) is continuous and the set of control inputs,

{𝐮(𝐱);𝐱D},\begin{array}[]{lllllllllll}\{{\bf u}^{\ast}({\bf x});\;{\bf x}\in D\},\end{array} (103)

is bounded. Its diameter is denoted by

γ=max𝐱D{u(x)}.\begin{array}[]{lllllllllll}\gamma=\displaystyle\max_{{\bf x}\in D}\{u^{\ast}(x)\}.\end{array} (104)

Let 𝐮0m{\bf u}_{0}\in\mathbb{R}^{m} be any fixed point satisfying

𝐮0𝐮(𝐱)2γ,𝐱D.\begin{array}[]{lllllllllll}\left\lVert{\bf u}_{0}-{\bf u}^{\ast}({\bf x})\right\rVert_{2}\leq\gamma,&{\bf x}\in D.\end{array} (105)

In the following, we apply an iterative algorithm starting from 𝐮0{\bf u}_{0} in the searching for 𝐮(𝐱){\bf u}^{\ast}({\bf x}). It is a fact that will become clear later that the following set is invariant under the searching process:

𝒰={𝐮m;𝐮𝐮022γ}.\begin{array}[]{lllllllllll}{\cal U}=\{{\bf u}\in\mathbb{R}^{m};\;\left\lVert{\bf u}-{\bf u}_{0}\right\rVert_{2}\leq 2\gamma\}.\end{array} (106)

In the following, we prove an error upper bound for the approximation of uu^{\ast} using neural network.

Theorem 5

Suppose V(𝐱,𝐮):D×𝒰d×qV({\bf x},{\bf u}):D\times{\cal U}\subset\mathbb{R}^{d}\times\mathbb{R}^{q}\rightarrow\mathbb{R} is a C2C^{2} function. Assume that H(𝐱,𝐮)H({\bf x},{\bf u}) is positive definite in D×𝒰D\times{\cal U}. Let λmin>0\lambda_{min}>0 and λmax>0\lambda_{max}>0 be the lower and upper bounds of the eigenvalues of H(𝐱,𝐮)H({\bf x},{\bf u}) in D×𝒰D\times{\cal U}. Suppose VNNV^{NN} is a neural network satisfying

|V(𝐱,𝐮)VNN(𝐱,𝐮)|e1,(𝐱,𝐮)D×𝒰,\begin{array}[]{lllllllllll}\left|V({\bf x},{\bf u})-V^{NN}({\bf x},{\bf u})\right|\leq e_{1},&({\bf x},{\bf u})\in D\times{\cal U},\end{array} (107)

for some e1>0e_{1}>0.

  1. 1.

    Given any integer K>0K>0 and real number h>0h>0, there exists a deep neural network, 𝐮NN:dq{\bf u}^{\ast NN}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{q}, that approximates 𝐮{\bf u}^{\ast} with the following error upper bound

    𝐮NN(𝐱)𝐮(𝐱)2γLK+C1qKh+2qKh1e1,𝐱D,\begin{array}[]{lllllllllll}\left\lVert{\bf u}^{\ast NN}({\bf x})-{\bf u}^{\ast}({\bf x})\right\rVert_{2}\leq\gamma L^{K}+C_{1}\sqrt{q}Kh+2\sqrt{q}Kh^{-1}e_{1},&{\bf x}\in D,\end{array} (108)

    where γ\gamma is the diameter of the set (103), L=1λmin/max{1,2λmax}L=1-\lambda_{min}/\max\{1,2\lambda_{max}\}, and

    C1=max𝐱D,𝐮𝒰{2Vuj2(𝐱,𝐮)},1jq.\begin{array}[]{lllllllllll}C_{1}=\displaystyle\max_{{\bf x}\in D,{\bf u}\in{\cal U}}\left\{\displaystyle\frac{\partial^{2}V}{\partial u_{j}^{2}}({\bf x},{\bf u})\right\},&1\leq j\leq q.\end{array} (109)
  2. 2.

    Let n1n_{1} be the complexity of VNNV^{NN}. The complexity of 𝐮NN{\bf u}^{\ast NN} satisfies

    n2n1qK.\begin{array}[]{lllllllllll}n\leq 2n_{1}qK.\end{array} (110)

Proof. Define a function

𝝍(𝐱,𝐮)=𝐮βV𝐮(𝐱,𝐮),\begin{array}[]{lllllllllll}{\bm{\psi}}({\bf x},{\bf u})={\bf u}-\beta\displaystyle\frac{\partial V}{\partial{\bf u}}({\bf x},{\bf u}),\end{array} (111)

where β>0\beta>0 is chosen so that the matrix norm satisfies

Iβ2V𝐮22L<1,𝐱D,𝐮𝒰,\begin{array}[]{lllllllllll}\left\lVert I-\beta\displaystyle\frac{\partial^{2}V}{\partial{\bf u}^{2}}\right\rVert_{2}\leq L<1,&{\bf x}\in D,{\bf u}\in{\cal U},\end{array} (112)

for some constant LL. This is always possible because 𝒰{\cal U} is bounded and H(𝐱,𝐮)H({\bf x},{\bf u}) is positive definite and continuous. For instance, one can chose

β=1max{1,2λmax},\begin{array}[]{lllllllllll}\beta=\displaystyle\frac{1}{\max\{1,2\lambda_{max}\}},\end{array} (113)

In this case, we can prove that

L=1λminmax{1,2λmax}\begin{array}[]{lllllllllll}L=1-\displaystyle\frac{\lambda_{min}}{\max\{1,2\lambda_{max}\}}\end{array} (114)

satisfies (112). By the contraction mapping theorem, for every 𝐱D{\bf x}\in D, there is a fixed point 𝐮(𝐱){\bf u}^{\ast}({\bf x}) such that

𝝍(𝐱,𝐮(𝐱))=𝐮(𝐱).\begin{array}[]{lllllllllll}{\bm{\psi}}({\bf x},{\bf u}^{\ast}({\bf x}))={\bf u}^{\ast}({\bf x}).\end{array} (115)

By the definition of 𝝍{\bm{\psi}} in (111), this implies that

V𝐮(𝐱,𝐮(𝐱))=0,\begin{array}[]{lllllllllll}\displaystyle\frac{\partial V}{\partial{\bf u}}({\bf x},{\bf u}^{\ast}({\bf x}))=0,\end{array} (116)

i.e., the fixed point is the optimal control. From Proposition 2, for any integer K>0K>0, we have

|(𝝍(𝐱,))K(𝐮0)𝐮(𝐱)|=|(𝝍(𝐱,))K(𝐮0)(𝝍(𝐱,))K(𝐮(𝐱))|LK𝐮0𝐮(𝐱)2γLK.\begin{array}[]{lllllllllll}\left|\left({\bm{\psi}}({\bf x},\cdot)\right)^{K}({\bf u}_{0})-{\bf u}^{\ast}({\bf x})\right|\\ =\left|\left({\bm{\psi}}({\bf x},\cdot)\right)^{K}({\bf u}_{0})-\left({\bm{\psi}}({\bf x},\cdot)\right)^{K}({\bf u}^{\ast}({\bf x}))\right|\\ \leq L^{K}\left\lVert{\bf u}_{0}-{\bf u}^{\ast}({\bf x})\right\rVert_{2}\\ \leq\gamma L^{K}.\end{array} (117)

This implies that (𝝍(𝐱,))K(𝐮0)𝒰\left({\bm{\psi}}({\bf x},\cdot)\right)^{K}({\bf u}_{0})\in{\cal U}. Let 𝝍NN(𝐱,𝐮){\bm{\psi}}^{NN}({\bf x},{\bf u}) be a deep neural network approximation (to be constructed) of 𝝍{\bm{\psi}} with an error bound

𝝍(𝐱,𝐮)𝝍NN(𝐱,𝐮)2e2.\begin{array}[]{lllllllllll}\left\lVert{\bm{\psi}}({\bf x},{\bf u})-{\bm{\psi}}^{NN}({\bf x},{\bf u})\right\rVert_{2}\leq e_{2}.\end{array} (118)

From Proposition 4,

(𝝍(𝐱,))K(𝐮0)(𝝍NN(𝐱,))K(𝐮0)2LK1L1e2.\begin{array}[]{lllllllllll}\left\lVert\left({\bm{\psi}}({\bf x},\cdot)\right)^{K}({\bf u}_{0})-\left({\bm{\psi}}^{NN}({\bf x},\cdot)\right)^{K}({\bf u}_{0})\right\rVert_{2}\\ \leq\displaystyle\frac{L^{K}-1}{L-1}e_{2}.\end{array} (119)

The inequalities (117) and (119) and Proposition 5 imply

|(𝝍NN(𝐱,))K(𝐮0)𝐮(𝐱)|γLK+LK1L1e2.\begin{array}[]{lllllllllll}\left|\left({\bm{\psi}}^{NN}({\bf x},\cdot)\right)^{K}({\bf u}_{0})-{\bf u}^{\ast}({\bf x})\right|\\ \leq\gamma L^{K}+\displaystyle\frac{L^{K}-1}{L-1}e_{2}.\end{array} (120)

In the next, we construct 𝝍NN{\bm{\psi}}^{NN} and derive e2e_{2}. Let h>0h>0 be a step size for finite difference. Then

|Vuj1h(V(uj+h)V(uj))|C1h,1jq,\begin{array}[]{lllllllllll}\left|V_{u_{j}}-\displaystyle\frac{1}{h}\left(V(\cdots u_{j}+h\cdots)-V(\cdots u_{j}\cdots)\right)\right|\leq C_{1}h,&1\leq j\leq q,\end{array} (121)

where

C1=maxxD,u𝒰{2Vuj2(𝐱,𝐮)},1jq.\begin{array}[]{lllllllllll}C_{1}=\displaystyle\max_{x\in D,u\in{\cal U}}\left\{\displaystyle\frac{\partial^{2}V}{\partial u_{j}^{2}}({\bf x},{\bf u})\right\},&1\leq j\leq q.\end{array} (122)

Define

𝝍~(𝐱,𝐮)=𝐮β[1h(V(u1+h)V(u1))1h(V(uj+h)V(uj)).]\begin{array}[]{lllllllllll}\tilde{\bm{\psi}}({\bf x},{\bf u})={\bf u}-\beta\left[\begin{array}[]{ccc}\displaystyle\frac{1}{h}\left(V(\cdots u_{1}+h\cdots)-V(\cdots u_{1}\cdots)\right)\\ \vdots\\ \displaystyle\frac{1}{h}\left(V(\cdots u_{j}+h\cdots)-V(\cdots u_{j}\cdots)\right).\\ \vdots\end{array}\right]\end{array} (123)

Then

𝝍(𝐱,𝐮)𝝍~(𝐱,𝐮)2βC1qh.\begin{array}[]{lllllllllll}\left\lVert{\bm{\psi}}({\bf x},{\bf u})-\tilde{\bm{\psi}}({\bf x},{\bf u})\right\rVert_{2}\leq\beta C_{1}\sqrt{q}h.\end{array} (124)

By assumption, VNNV^{NN} is a deep neural network that approximates VV satisfying

|V(𝐱,𝐮)VNN(𝐱,𝐮)|e1.\begin{array}[]{lllllllllll}\left|V({\bf x},{\bf u})-V^{NN}({\bf x},{\bf u})\right|\leq e_{1}.\end{array} (125)

We define 𝝍NN{\bm{\psi}}^{NN} by substituting VNNV^{NN} for VV in 𝝍~\tilde{\bm{\psi}}. Then

𝝍~(𝐱,𝐮)𝝍NN(𝐱,𝐮)22βhqe1.\begin{array}[]{lllllllllll}\left\lVert\tilde{\bm{\psi}}({\bf x},{\bf u})-{\bm{\psi}}^{NN}({\bf x},{\bf u})\right\rVert_{2}\leq 2\displaystyle\frac{\beta}{h}\sqrt{q}e_{1}.\end{array} (126)

From Proposition 5, (124) and (126) imply

𝝍(𝐱,𝐮)𝝍NN(𝐱,𝐮)2βC1qh+2βhqe1.\begin{array}[]{lllllllllll}\left\lVert{\bm{\psi}}({\bf x},{\bf u})-{\bm{\psi}}^{NN}({\bf x},{\bf u})\right\rVert_{2}\\ \leq\beta C_{1}\sqrt{q}h+2\displaystyle\frac{\beta}{h}\sqrt{q}e_{1}.\end{array} (127)

Using the right side of the above inequality as e2e_{2}, (120) implies

(𝝍NN(𝐱,))K(𝐮0)𝐮(𝐱)pγLK+LK1L1(βC1qh+2βhqe1)γLK+K(βC1qh+2βhqe1),(because L<1)γLK+C1qKh+2qKh1e1,\begin{array}[]{lllllllllll}\left\lVert\left({\bm{\psi}}^{NN}({\bf x},\cdot)\right)^{K}({\bf u}_{0})-{\bf u}^{\ast}({\bf x})\right\rVert_{p}\\ \leq\gamma L^{K}+\displaystyle\frac{L^{K}-1}{L-1}\left(\beta C_{1}\sqrt{q}h+2\displaystyle\frac{\beta}{h}\sqrt{q}e_{1}\right)\\ \leq\gamma L^{K}+K\left(\beta C_{1}\sqrt{q}h+2\displaystyle\frac{\beta}{h}\sqrt{q}e_{1}\right),&(\mbox{because }L<1)\\ \leq\gamma L^{K}+C_{1}\sqrt{q}Kh+2\sqrt{q}Kh^{-1}e_{1},\\ \end{array} (128)

The parameter β\beta is removed from the inequality because β1\beta\leq 1. If we define

𝐮NN(𝐱):=(𝝍NN(𝐱,))K,\begin{array}[]{lllllllllll}{\bf u}^{\ast NN}({\bf x}):=\left({\bm{\psi}}^{NN}({\bf x},\cdot)\right)^{K},\end{array} (129)

then (128) is equivalent to (108). From the structure of 𝝍~\tilde{\bm{\psi}}, we know that the complexity of 𝐮NN{\bf u}^{\ast NN} is bounded by

2n1qK.\begin{array}[]{lllllllllll}2n_{1}qK.\end{array} (130)

\blacklozenge

Consider a control system

𝐱˙=𝐟(𝐱,𝐮),𝐱Dd,𝐮q,t[0,T],\begin{array}[]{lllllllllll}\dot{\bf x}={\bf f}({\bf x},{\bf u}),&{\bf x}\in D\subset\mathbb{R}^{d},{\bf u}\in\mathbb{R}^{q},t\in[0,T],\end{array} (131)

where DdD\subset\mathbb{R}^{d} is a bounded closed set. We assume a zero-order hold (ZOH) control input function. Specifically, given any positive integer Nt>0N_{t}>0. Let Δt=T/Nt{\it\Delta}t=T/N_{t} and tk=kΔtt_{k}=k{\it\Delta}t for 0kNt0\leq k\leq N_{t}. Then the control input 𝐮(t){\bf u}(t) is a step function that takes a constant value in each time interval [tk1,tk][t_{k-1},t_{k}]. Let ϕ(t;𝐮(),𝐱0){\bm{\phi}}(t;{\bf u}(\cdot),{\bf x}_{0}) be the trajectory of the control system (131) with the initial condition 𝐱(0)=𝐱0{\bf x}(0)={\bf x}_{0} and a step function 𝐮(t){\bf u}(t) as the control input. For a fixed Δt{\it\Delta}t, we can treat 𝐮(t){\bf u}(t) as a long vector UqNtU\in\mathbb{R}^{qN_{t}}, where U=[𝐮1T𝐮2T𝐮NtT]TU=\left[\begin{array}[]{rrrrrrrrrrrrrrrrrrrr}{\bf u}_{1}^{T}&{\bf u}_{2}^{T}&\cdots&{\bf u}_{N_{t}}^{T}\end{array}\right]^{T} and 𝐮k{\bf u}_{k} is the value of 𝐮(t){\bf u}(t), t[tk1,tk]t\in[t_{k-1},t_{k}]. Define

Φ(𝐱,U)=ϕ(T;𝐮(),𝐱)=ϕ(Δt;𝐮Nt,)ϕ(Δt;𝐮Nt1,)ϕ(Δt;𝐮1,𝐱),𝐱D.\begin{array}[]{lllllllllll}\begin{aligned} \Phi({\bf x},U)&={\bm{\phi}}(T;{\bf u}(\cdot),{\bf x})\\ &={\bm{\phi}}({\it\Delta}t;{\bf u}_{N_{t}},\cdot)\circ{\bm{\phi}}({\it\Delta}t;{\bf u}_{N_{t}-1},\cdot)\circ\cdots\circ{\bm{\phi}}({\it\Delta}t;{\bf u}_{1},{\bf x}),&{\bf x}\in D.\end{aligned}\end{array} (132)

If f(𝐱,𝐮)f({\bf x},{\bf u}) is C2C^{2} in (𝐱,𝐮)({\bf x},{\bf u}), then Φ(𝐱,U)\Phi({\bf x},U) is a C2C^{2} function of (𝐱,U)({\bf x},U). Let Ψ:d\Psi:\mathbb{R}^{d}\rightarrow\mathbb{R} be a C2C^{2} function. Define the cost function as follows:

J(𝐱,U)=ΨΦ(𝐱,U).\begin{array}[]{lllllllllll}J({\bf x},U)=\Psi\circ\Phi({\bf x},U).\end{array} (133)

The problem of ZOH optimal control is

minUqNtJ(𝐱,U).\begin{array}[]{lllllllllll}\displaystyle\min_{U\in\mathbb{R}^{qN_{t}}}J({\bf x},U).\end{array} (134)

Given any 𝐱D{\bf x}\in D, let U(𝐱)U^{\ast}({\bf x}) represent the optimal control. Let 𝒰qNt{\cal U}\subset\mathbb{R}^{qN_{t}} be the set defined in (106) where 𝐮{\bf u} is replaced by UU. Suppose 𝐟(𝐱,𝐮){\bf f}({\bf x},{\bf u}) in (131) is defined for all 𝐱[R,R]d{\bf x}\in[-R,R]^{d} where RR is large enough so that

ϕ(Δt;𝐮k,)ϕ(Δt;𝐮k1,)ϕ(Δt;𝐮1,𝐱)[R,R]d,1kNt,\begin{array}[]{lllllllllll}{\bm{\phi}}({\it\Delta}t;{\bf u}_{k},\cdot)\circ{\bm{\phi}}({\it\Delta}t;{\bf u}_{k-1},\cdot)\circ\cdots\circ{\bm{\phi}}({\it\Delta}t;{\bf u}_{1},{\bf x})\in[-R,R]^{d},&1\leq k\leq N_{t},\end{array} (135)

for all U𝒰U\in{\cal U} and 𝐱D{\bf x}\in D. For the simplicity of derivation in the following theorem, we assume Δt2{\it\Delta}t\leq 2.

Theorem 6

Suppose that 𝐟{\bf f} in the control system (131) and Ψ\Psi in the cost function (133) are both compositional functions that satisfy A1. Assume that all nodes in 𝒢𝐟{\cal G}^{\bf f} and 𝒢Ψ{\cal G}^{\Psi} are C2C^{2}. Assume that the Hession of J(𝐱,U)J({\bf x},U) with respect to UU is positive definite for all (𝐱,U)D×𝒰({\bf x},U)\in D\times{\cal U}. Let rmax=max{rmax𝐟,rmaxΨ}r_{max}=\max\{r^{\bf f}_{max},r^{\Psi}_{max}\} and γ\gamma be the diameter of 𝒰{\cal U}. Then, for any ϵ>0\epsilon>0, there exists a deep neural network, UNNU^{\ast NN} that approximates UU^{\ast}. The estimation error is bounded by

UNN(𝐱)U(𝐱)23ϵ,𝐱D.\begin{array}[]{lllllllllll}\left\lVert U^{\ast NN}({\bf x})-U^{\ast}({\bf x})\right\rVert_{2}\leq 3\epsilon,&{\bf x}\in D.\end{array} (136)

The complexity of UNNU^{\ast NN} is bounded by

nC(q(|𝒱G𝐟|+|𝒱GΨ|))(rmax+1+rmax/rmax𝐟)ϵ(4rmax+1+4rmax/rmax𝐟)\begin{array}[]{lllllllllll}n\leq C\left(q(\left|{\cal V}^{\bf f}_{G}\right|+\left|{\cal V}^{\Psi}_{G}\right|)\right)^{(r_{max}+1+r_{max}/r^{\bf f}_{max})}\epsilon^{-(4r_{max}+1+4r_{max}/r^{\bf f}_{max})}\end{array} (137)

for some constant C>0C>0.

Remark 7

The inequality (137) is a simplified version that contains only three explicitly presented parameters, qq, 𝒱G𝐟{\cal V}^{\bf f}_{G} and 𝒱GΨ{\cal V}^{\Psi}_{G}. For fixed Δt2{\it\Delta}t\leq 2, the constant CC in (137) depends on NtN_{t}, di,j𝐟d_{i,j}^{\bf f}, di,jΨd_{i,j}^{\Psi} mi,j𝐟m_{i,j}^{\bf f}, mi,jΨm_{i,j}^{\Psi}, Lmax𝐟L^{\bf f}_{max}, LmaxΨL^{\Psi}_{max}, Λmax𝐟\Lambda^{\bf f}_{max}, ΛmaxΨ\Lambda^{\Psi}_{max}, the Lipschitz constants of 𝐟{\bf f}, Ψ\Psi and ϕ(Δt;,){\bm{\phi}}({\it\Delta}t;\cdot,\cdot) with respect to 𝐱{\bf x} for (𝐱,𝐮)D×𝒰({\bf x},{\bf u})\in D\times{\cal U}, the norms of 𝐟{\bf f} and its Jacobian, the eigenvalues and diagonal entries of the Hession matrix of J(𝐱,U)J({\bf x},U), and the diameter of 𝒰{\cal U}. We avoid showing an explicit expression of CC as a function of these parameters because such a formula would be too long. However, following the proof, an algebraic expression of the complexity upper bound in terms of all these parameters can be derived in detail if needed.

Proof. Suppose we can find a deep neural network, JNN(𝐱,U)J^{NN}({\bf x},U) (to be found later), that approximates J(𝐱,U)J({\bf x},U) with an error upper bound e1e_{1}, i.e.,

|JNN(𝐱,U)J(𝐱,U)|e1,𝐱D,U𝒰.\begin{array}[]{lllllllllll}\left|J^{NN}({\bf x},U)-J({\bf x},U)\right|\leq e_{1},&{\bf x}\in D,U\in{\cal U}.\end{array} (138)

From Theorem 5, there is a deep neural network, UNNU^{\ast NN}, that approximates the optimal control UU^{\ast} with an error upper bound

UNN(𝐱)U(𝐱)2γLK+C1qNtKh+2qNtKh1e1,𝐱D,\begin{array}[]{lllllllllll}\left\lVert U^{\ast NN}({\bf x})-U^{\ast}({\bf x})\right\rVert_{2}\leq\gamma L^{K}+C_{1}\sqrt{qN_{t}}Kh+2\sqrt{qN_{t}}Kh^{-1}e_{1},&{\bf x}\in D,\end{array} (139)

where 0<L<10<L<1 is a constant determined by the eigenvalues of the Hession of J(𝐱,U)J({\bf x},U), γ\gamma is the diameter of the set {U(𝐱),𝐱D}\{U^{\ast}({\bf x}),{\bf x}\in D\} and C1C_{1} depends on the diagonal of the Hession of J(𝐱,U)J({\bf x},U). Given any ϵ>0\epsilon>0. For the first term to be bounded by ϵ\epsilon, we need γLKϵ\gamma L^{K}\leq\epsilon. This is obvious if ϵγ\epsilon\geq\gamma, because L<1L<1. If ϵ<γ\epsilon<\gamma, γLK=ϵ\gamma L^{K}=\epsilon implies

K=lnγϵln1L(ln1L)1γ1ϵ\begin{array}[]{lllllllllll}K=\displaystyle\frac{\ln\displaystyle\frac{\gamma}{\epsilon}}{\ln\displaystyle\frac{1}{L}}\leq\left(\ln\displaystyle\frac{1}{L}\right)^{-1}\gamma\displaystyle\frac{1}{\epsilon}\end{array} (140)

We can choose

K=C¯1ϵ1\begin{array}[]{lllllllllll}K=\bar{C}_{1}\epsilon^{-1}\end{array} (141)

for some constant C¯1>0\bar{C}_{1}>0 that is determined by γ\gamma and the eigenvalues of the Hessian of J(𝐱,U)J({\bf x},U). For the second term in (139) to be bounded by ϵ\epsilon, we have

C1qNtKh=C1C¯1qNtϵ1hϵ.\begin{array}[]{lllllllllll}C_{1}\sqrt{qN_{t}}Kh=C_{1}\bar{C}_{1}\sqrt{qN_{t}}\epsilon^{-1}h\leq\epsilon.\end{array} (142)

We choose

h=C¯2qNtϵ2\begin{array}[]{lllllllllll}h=\displaystyle\frac{\bar{C}_{2}}{\sqrt{qN_{t}}}\epsilon^{2}\end{array} (143)

for C¯2=1C1C¯1\bar{C}_{2}=\displaystyle\frac{1}{C_{1}\bar{C}_{1}}. For the third term in (139) to be less than or equal to ϵ\epsilon, we have

2qNtKh1e1=2qNtC¯1ϵ1qNtC¯2ϵ2e1ϵ,\begin{array}[]{lllllllllll}2\sqrt{qN_{t}}Kh^{-1}e_{1}=2\sqrt{qN_{t}}\bar{C}_{1}\epsilon^{-1}\displaystyle\frac{\sqrt{qN_{t}}}{\bar{C}_{2}}\epsilon^{-2}e_{1}\leq\epsilon,\end{array} (144)

or

e1C¯3qNtϵ4,\begin{array}[]{lllllllllll}e_{1}\leq\displaystyle\frac{\bar{C}_{3}}{qN_{t}}\epsilon^{4},\end{array} (145)

where C¯3=12C1C¯12\bar{C}_{3}=\displaystyle\frac{1}{2C_{1}\bar{C}_{1}^{2}}. To summarize, we proved that the existence of JNNJ^{NN} satisfying (138) and (145) implies

UNN(𝐱)U(𝐱)23ϵ,𝐱D\begin{array}[]{lllllllllll}\left\lVert U^{\ast NN}({\bf x})-U^{\ast}({\bf x})\right\rVert_{2}\leq 3\epsilon,&{\bf x}\in D\end{array} (146)

Now, we need to prove the existence of a JNNJ^{NN} whose estimation error, e1e_{1}, satisfies (145). Suppose we have deep neural networks that approximates Ψ()\Psi(\cdot) and ϕ(Δt;,){\bm{\phi}}({\it\Delta}t;\cdot,\cdot). Let JNNJ^{NN} be the deep neural network obtained by substituting functions in (132)-(133) by ΨNN()\Psi^{NN}(\cdot) and ϕNN(Δt;,){\bm{\phi}}^{NN}({\it\Delta}t;\cdot,\cdot). From Proposition 4, the estimation error, e1e_{1}, of JNNJ^{NN} is bounded by

e1C¯4max𝐱[R,R]d,𝐮𝒰ϕ(Δt;𝐮,𝐱)ϕNN(Δt;𝐮,𝐱)2+max𝐳[R,R]dΨ(𝐳)ΨNN(𝐳)2,\begin{array}[]{lllllllllll}e_{1}\leq\bar{C}_{4}\displaystyle\max_{{\bf x}\in[-R,R]^{d},{\bf u}\in{\cal U}}\left\lVert{\bm{\phi}}({\it\Delta}t;{\bf u},{\bf x})-{\bm{\phi}}^{NN}({\it\Delta}t;{\bf u},{\bf x})\right\rVert_{2}+\displaystyle\max_{{\bf z}\in[-R,R]^{d}}\left\lVert\Psi({\bf z})-\Psi^{NN}({\bf z})\right\rVert_{2},\end{array} (147)

where C¯4\bar{C}_{4} depends on NtN_{t}, and the Lipschitz constants of ϕ(Δt,,){\bm{\phi}}({\it\Delta}t,\cdot,\cdot) and Ψ\Psi with respect to 𝐱{\bf x}. From Theorem 2 and Theorem 4, we can find ΨNN()\Psi^{NN}(\cdot) and ϕNN(Δt;,){\bm{\phi}}^{NN}({\it\Delta}t;\cdot,\cdot) so that

e1C¯5|𝒱GΨ|(nwidth)1/rmaxΨ+C¯6|𝒱G𝐟|Δt(nwidth)1/rmax𝐟C¯7(|𝒱GΨ|+|𝒱G𝐟|Δt)(nwidth)1/rmax,\begin{array}[]{lllllllllll}e_{1}\leq\bar{C}_{5}\left|{\cal V}^{\Psi}_{G}\right|(n_{width})^{-1/r^{\Psi}_{max}}+\bar{C}_{6}\left|{\cal V}^{\bf f}_{G}\right|{\it\Delta}t(n_{width})^{-1/r^{\bf f}_{max}}\leq\bar{C}_{7}\left(\left|{\cal V}^{\Psi}_{G}\right|+\left|{\cal V}^{\bf f}_{G}\right|{\it\Delta}t\right)(n_{width})^{-1/r_{max}},\end{array} (148)

where C¯7\bar{C}_{7} depends on C¯4\bar{C}_{4}, di,j𝐟d_{i,j}^{\bf f}, di,jΨd_{i,j}^{\Psi} mi,j𝐟m_{i,j}^{\bf f}, mi,jΨm_{i,j}^{\Psi}, Lmax𝐟L^{\bf f}_{max}, LmaxΨL^{\Psi}_{max}, Λmax𝐟\Lambda^{\bf f}_{max}, ΛmaxΨ\Lambda^{\Psi}_{max} and the norms of 𝐟{\bf f} and its Jacobian. It also depends on eαΔte^{\alpha{\it\Delta}t} (see (87)). However, for a fixed Δt{\it\Delta}t, this term has an upper bounded determined by the norm of the Jacobian of 𝐟{\bf f}. To satisfy (145), (148) implies that we can choose

nwidth=C¯8qrmax(|𝒱GΨ|+|𝒱G𝐟|Δt)rmaxϵ4rmax,\begin{array}[]{lllllllllll}n_{width}=\bar{C}_{8}q^{r_{max}}\left(\left|{\cal V}^{\Psi}_{G}\right|+\left|{\cal V}^{\bf f}_{G}\right|{\it\Delta}t\right)^{r_{max}}\epsilon^{-4r_{max}},\end{array} (149)

where C¯8=(NtC¯7/C¯3)rmax\bar{C}_{8}=(N_{t}\bar{C}_{7}/\bar{C}_{3})^{r_{max}}. From Theorem 2, Theorem 4 and (149), the complexity of JNNJ^{NN}, denoted by n1n_{1}, satisfies

n1Nt((nwidth)1/rmax𝐟+1)nwidth|𝒱G𝐟|+nwidth|𝒱GΨ|(nwidth)1+1/rmax𝐟(2Nt|𝒱G𝐟|+|𝒱GΨ|)C¯9qrmax(1+1/rmax𝐟)(|𝒱GΨ|+|𝒱G𝐟|Δt)rmax(1+1/rmax𝐟)(2Nt|𝒱G𝐟|+|𝒱GΨ|)ϵ4rmax(1+1/rmax𝐟)C¯92Ntqrmax(1+1/rmax𝐟)(|𝒱GΨ|+|𝒱G𝐟|)rmax(1+1/rmax𝐟)+1ϵ4rmax(1+1/rmax𝐟),\begin{array}[]{lllllllllll}n_{1}\leq N_{t}((n_{width})^{1/r^{\bf f}_{max}}+1)n_{width}\left|{\cal V}^{\bf f}_{G}\right|+n_{width}\left|{\cal V}^{\Psi}_{G}\right|\\ \leq(n_{width})^{1+1/r^{\bf f}_{max}}\left(2N_{t}\left|{\cal V}^{\bf f}_{G}\right|+\left|{\cal V}^{\Psi}_{G}\right|\right)\\ \leq\bar{C}_{9}q^{r_{max}(1+1/r^{\bf f}_{max})}\left(\left|{\cal V}^{\Psi}_{G}\right|+\left|{\cal V}^{\bf f}_{G}\right|{\it\Delta}t\right)^{r_{max}(1+1/r^{\bf f}_{max})}\left(2N_{t}\left|{\cal V}^{\bf f}_{G}\right|+\left|{\cal V}^{\Psi}_{G}\right|\right)\epsilon^{-4r_{max}(1+1/r^{\bf f}_{max})}\\ \leq\displaystyle\frac{\bar{C}_{9}}{2N_{t}}q^{r_{max}(1+1/r^{\bf f}_{max})}\left(\left|{\cal V}^{\Psi}_{G}\right|+\left|{\cal V}^{\bf f}_{G}\right|\right)^{r_{max}(1+1/r^{\bf f}_{max})+1}\epsilon^{-4r_{max}(1+1/r^{\bf f}_{max})},\end{array} (150)

where C¯9=C¯81+1/rmax𝐟\bar{C}_{9}=\bar{C}_{8}^{1+1/r^{\bf f}_{max}}. From (110) in Theorem 5, the complexity of UNNU^{\ast NN} is bounded by

n2n1qNtK2C¯92Ntqrmax(1+1/rmax𝐟)(|𝒱G𝐟|+|𝒱GΨ|)rmax(1+1/rmax𝐟)+1ϵ4rmax(1+1/rmax𝐟)qNtC¯1ϵ1Cq(rmax+1+rmax/rmax𝐟)(|𝒱G𝐟|+|𝒱GΨ|)(rmax+1+rmax/rmax𝐟)ϵ(4rmax+1+4rmax/rmax𝐟)\begin{array}[]{lllllllllll}n\leq 2n_{1}qN_{t}K\\ \leq 2\displaystyle\frac{\bar{C}_{9}}{2N_{t}}q^{r_{max}(1+1/r^{\bf f}_{max})}\left(\left|{\cal V}^{\bf f}_{G}\right|+\left|{\cal V}^{\Psi}_{G}\right|\right)^{r_{max}(1+1/r^{\bf f}_{max})+1}\epsilon^{-4r_{max}(1+1/r^{\bf f}_{max})}qN_{t}\bar{C}_{1}\epsilon^{-1}\\ \leq Cq^{(r_{max}+1+r_{max}/r^{\bf f}_{max})}\left(\left|{\cal V}^{\bf f}_{G}\right|+\left|{\cal V}^{\Psi}_{G}\right|\right)^{(r_{max}+1+r_{max}/r^{\bf f}_{max})}\epsilon^{-(4r_{max}+1+4r_{max}/r^{\bf f}_{max})}\end{array} (151)

where C=C¯1C¯9C=\bar{C}_{1}\bar{C}_{9}.

7 Conclusions

In this paper, we introduce new concepts and develop an algebraic framework for compositional functions. In addition to functions defined in d\mathbb{R}^{d}, iterative numerical algorithms and dynamical systems fall into the category of compositional functions as well. The algebraic framework forms a unified theoretical foundation that supports the study of neural networks with applications to a variety of problems including differential equations, optimization, and optimal control. It is proved that the error upper bound of approximations using neural networks is determined by a set of key features of compositional functions. In the approximation of functions, differential equations and optimal control, the complexity of neural networks is bounded by a polynomial function of the key features and error tolerance. The results could also shed light on the reason why using neural network approximations may help to avoid the curse of dimensionality. As shown in examples, increasing the dimension of a system may not increase the value of the key features.

The study in this paper raises more questions than answers. All neural networks used in the proofs of theorems have layered DAGs that are similar to the DAG of the function being approximated. What is the underlying connection between the shallow neural networks that approximate individual nodes and the neural network that approximates the overall function? How can this connection help to generate initial guess of parameters for neural network training? Instead of C1C^{1} functions, to what extent one can apply the idea in this paper to a larger family of functions to include C0C^{0} or even discontinuous functions? This is important because optimal control problems with constraints are likely to have C0C^{0} or discontinuous feedbacks. The DAG associated with a given function is not unique. What types of DAGs are advantageous for more accurate approximation with less complexity? In general, many problems of nonlinear control such as finding Lyapunov functions, designing output regulators, HH_{\infty} control and differential game do not have tractable numerical solutions if the dimension of state space is high. Finding compositional function representations of these problems may lead to innovative and practical solutions based on deep learning.

References

  • [1] P. M. Anderson and A. A. Fouad. Power Systems Control and Stability, IEEE Press, 1994.
  • [2] A. R. Barron, Universal approximation bounds for superpositions of a sigmoidal function, IEEE Trans. on Information Theory, Vol. 39 (3), pp. 930-945, 1993.
  • [3] A. R. Barron and J. M. Klusowski, Approximation and estimation for high-dimensional deep learning networks. arXiv:1809.03090, 2018.
  • [4] C. Chiu and J. Zhan, An evolutionary approach to compact DAG neural network optimization, IEEE Access, Vol. 7, pp. 178331-178341, 2019.
  • [5] W. E, C. Ma, S. Wojtowytsch and L. Wu, Towards a mathematical understanding of neural network-based machine learning: what we know and what we don’t, arXiv:2009.10713v3, 2020.
  • [6] W. E and S. Wojtowytsch, On the Banach spaces associated with multi-layer ReLU networks: function representation, approximation theory and gradient descent dynamics, arXiv:2007.15623v1, 2020.
  • [7] P. Giesl, B. Hamzi, M. Rasmussen and K. Webster. Approximation of Lyapunov functions from noisy data, Journal of Computational Dynamics, Vol. 7(1), pp 57-81, 2020.
  • [8] L. Grüne, Overcoming the curse of dimensionality for approximating Lyapunov functions with deep neural networks uder a small-gain condition, arXiv:2001.08423v3, 2020.
  • [9] E. Hairer, S. P. Nørsett, G. Wanner, Solving Ordinary Differential Equations I: Nonstiff Problems, second edition, Springer, 1993.
  • [10] B. Hamzi and H. Owhadi. Learning dynamical systems from data: a simple cross-validation perspective, arXiv.org/abs/2007.05074, 2020.
  • [11] J. Han, A. Jentzen and W. E. Solving high-dimensional partial differential equations using deep learning, Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018.
  • [12] J. Han and W. E. Deep learning approximation for stochastic control problems, arXiv:1611.07422v1, 2016.
  • [13] P. C. Kainen, V. Ků\mathring{\rm u}rková, M. Sanguineti, Approximating multivariable functions by feedforward neural nets. In: M. Bianchini, M. Maggini, L. Jain (eds) Hanbook on Neural Information Processing. Intelligent Systems Reference Library, Vol 49, pp143-181, Springer, 2013.
  • [14] W. Kang, Q. Gong and T. Nakamura-Zimmerer, Algorithms of data development for deep learning and feedback design, arXiv:1912.00492v2, 2020.
  • [15] A. N. Kolmogorov, On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition, In Doklady Akademii Nauk, Vol. 114, No. 5, pp. 953-956, Russian Academy of Sciences, 1957.
  • [16] V. Kůrková, Kolmogorov’s theorem and multilayer neural networks, Neural networks, Vol. 5, No. 3, pp. 501-506, 1992.
  • [17] E. Lorenz, Predictability – A problem partly solved, Seminar on Predictability, Vol. I, ECMWF, 1996.
  • [18] H. N. Mhaskar, Neural networks for optimal approximation of smooth and analytic functions, Neural Computation, Vol. 8, No. 1, pp. 164-177, 1996.
  • [19] H. N. Mhaskar and T. Poggio, Deep vs. shallow networks: an approximation theory perspective, arXiv: 1608.03287v1, 2016.
  • [20] H. N. Mhaskar and T. Poggio, Function approximation by deep networks, arXiv:1905.12882v2, 2019.
  • [21] T. Nakamura-Zimmerer, Q. Gong and W. Kang, Adaptive deep learning for high dimensional Hamilton-Jacobi-Bellman equations, arXiv:1907.05317, 2019.
  • [22] T. Nakamura-Zimmerer, Q. Gong and W. Kang, QRnet: optimal regulator design with LQR-augmented neural networks, IEEE Control Systems Letters, Vol. 5, No. 4, pp.1303-1308, 2020.
  • [23] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, Q. Liao, Why and when can deep - but not shallow - networks avoid the curse of dimensionality: a review, arXiv:1611.00740v5, 2017.
  • [24] J. Qi, K. Sun, and W. Kang, Optimal PMU placement for power system dynamic state estimation by using empirical observability Gramian, IEEE Trans. Power Systems, Vol. 30, No. 4, pp. 2041-2054, 2015.
  • [25] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physics-informed neural networks: a deep learing framework for solving forward and inverse problems involving nonlinear partial differential equations. J. of Computational Physics, 378:686–707, 2019.