Differentiable Neural Networks with RePU Activation: with Applications to Score Estimation and Isotonic Regression

\nameGuohao Shen \email[email protected]
\addrDepartment of Applied Mathematics
The Hong Kong Polytechnic University
Hong Kong SAR, China \AND\nameYuling Jiao \email[email protected]
\addrSchool of Mathematics and Statistics
and Hubei Key Laboratory of Computational Science
Wuhan University
Wuhan, China, 430072 \AND\nameYuanyuan Lin \email[email protected]
\addrDepartment of Statistics
The Chinese University of Hong Kong
Hong Kong SAR, China \AND\nameJian Huang \email[email protected]
\addrDepartment of Applied Mathematics
The Hong Kong Polytechnic University
Hong Kong SAR, China

Abstract

We study the properties of differentiable neural networks activated by rectified power unit (RePU) functions. We show that the partial derivatives of RePU neural networks can be represented by RePUs mixed-activated networks and derive upper bounds for the complexity of the function class of derivatives of RePUs networks. We establish error bounds for simultaneously approximating $C^{s}$ smooth functions and their derivatives using RePU-activated deep neural networks. Furthermore, we derive improved approximation error bounds when data has an approximate low-dimensional support, demonstrating the ability of RePU networks to mitigate the curse of dimensionality. To illustrate the usefulness of our results, we consider a deep score matching estimator (DSME) and propose a penalized deep isotonic regression (PDIR) using RePU networks. We establish non-asymptotic excess risk bounds for DSME and PDIR under the assumption that the target functions belong to a class of $C^{s}$ smooth functions. We also show that PDIR achieves the minimax optimal convergence rate and has a robustness property in the sense it is consistent with vanishing penalty parameters even when the monotonicity assumption is not satisfied. Furthermore, if the data distribution is supported on an approximate low-dimensional manifold, we show that DSME and PDIR can mitigate the curse of dimensionality.

Keywords: Approximation error, curse of dimensionality, differentiable neural networks, isotonic regression, score matching

1 Introduction

In many statistical problems, it is important to estimate the derivatives of a target function, in addition to estimating a target function itself. An example is the score matching method for distribution learning through score function estimation (Hyvärinen and Dayan, 2005). In this method, the objective function involves the partial derivatives of the score function. Another example is a newly proposed penalized approach for isotonic regression described below, in which the partial derivatives are used to form a penalty function to encourage the estimated regression function to be monotonic. Motivated by these problems, we consider Rectified Power Unit (RePU) activated deep neural networks for estimating differentiable target functions. A RePU activation function has continuous derivatives, which makes RePU networks differentiable and suitable for derivative estimation. We study the properties of RePU networks along with their derivatives, establish error bounds for using RePU networks to approximate smooth functions and their derivatives, and apply them to the problems of score estimation and isotonic regression.

1.1 Score matching

Score estimation is an important approach to distribution learning and score function plays a central role in the diffusion-based generative learning (Song et al., 2021; Block et al., 2020; Ho et al., 2020; Lee et al., 2022). Let $X\sim p_{0}$ , where $p_{0}$ is a probability density function supported on $\mathbb{R}^{d}$ , then $d$ -dimensional score function of $p_{0}$ is defined as $s_{0}(x)=\nabla_{x}\log p_{0}(x)$ , where $\nabla_{x}$ is the vector differential operator with respect to the input $x$ .

A score matching estimator (Hyvärinen and Dayan, 2005) is obtained by solving the minimization problem

\min_{s\in\mathcal{F}}\frac{1}{2}\mathbb{E}_{X}\|s(X)-s_{0}(X)\|^{2}_{2},

(1)

where $\|\cdot\|_{2}$ denotes the Euclidean norm, $\mathcal{F}$ is a prespecified class of functions, often referred to as a hypothesis space. However, this objective function is computationally infeasible because $s_{0}$ is unknown. Under some mild conditions given in Assumption 9 in Section 4, it can be shown that (Hyvärinen and Dayan, 2005)

\frac{1}{2}\mathbb{E}_{X}\|s(X)-s_{0}(X)\|^{2}_{2}=J(s)+\frac{1}{2}\mathbb{E}_{X}\|s_{0}(X)\|_{2}^{2},

(2)

with

J(s):=\mathbb{E}_{X}\left[{\rm tr}(\nabla_{x}s(X))+\frac{1}{2}\|s(X)\|_{2}^{2}\right],

(3)

where $\nabla_{x}s(x)$ denotes the Jacobian matrix of $s(x)$ and ${\rm tr}(\cdot)$ the trace operator. Since the second term on the right side of (2), $\mathbb{E}_{X}\|s_{0}(X)\|_{2}^{2}/2$ , does not involve $s$ , it can be considered a constant. Therefore, we can just use $J$ in (3) as an objective function for estimating the score function $s_{0}$ . When a random sample is available, we use a sample version of $J$ as the empirical objective function. Since $J$ involves the partial derivatives of $s(x)$ , we need to compute the derivatives of the functions in $\mathcal{F}$ during estimation. And we need to analyze the properties of $\mathcal{F}$ and their derivatives to develop the learning theories. In particular, if we take $\mathcal{F}$ to be a class of deep neural network functions, we need to study the properties of their derivatives in terms of estimation and approximation.

1.2 Isotonic regression

Isotonic regression is a technique that fits a regression model to observations such that the fitted regression function is non-decreasing (or non-increasing). It is a basic form of shape-constrained estimation and has applications in many areas, such as epidemiology (Morton-Jones et al., 2000), medicine (Diggle et al., 1999; Jiang et al., 2011), econometrics (Horowitz and Lee, 2017), and biostatistics (Rueda et al., 2009; Luss et al., 2012; Qin et al., 2014).

Consider a regression model

Y=f_{0}(X)+\epsilon,

(4)

where $X\in\mathcal{X}\subseteq\mathbb{R}^{d},$ $Y\in\mathbb{R},$ and $\epsilon$ is an independent noise variable with $\mathbb{E}(\epsilon)=0$ and ${\rm Var}(\epsilon)\leq\sigma^{2}.$ In (4), $f_{0}$ is the underlying regression function, which is usually assumed to belong to certain smooth function class.

In isotonic regression, $f_{0}$ is assumed to satisfy a monotonicity property as follows. Let $\preceq$ denote the binary relation “less than” in the partially ordered space $\mathbb{R}^{d}$ , i.e., $u\preceq v$ if $u_{j}\leq v_{j}$ for all $j=1,\ldots,d,$ where $u=(u_{1},\ldots,u_{d})^{\top},v=(v_{1},\ldots,v_{d})^{\top}\in\mathcal{X}\subseteq\mathbb{R}^{d}$ . In isotonic regression, the target regression function $f_{0}$ is assumed to be coordinate-wisely non-decreasing on $\mathcal{X}$ , i.e., $f_{0}(u)\leq f_{0}(v)$ if $u\preceq v.$ The class of isotonic regression functions on $\mathcal{X}\subseteq\mathbb{R}^{d}$ is the set of coordinate-wisely non-decreasing functions

\mathcal{F}_{0}:=\{f:\mathcal{X}\to\mathbb{R},f(u)\leq f(v){\rm\ if\ }u\preceq v,\text{ where }\mathcal{X}\subset\mathbb{R}^{d}\}.

The goal is to estimate the target regression function $f_{0}$ under the constraint that $f_{0}\in\mathcal{F}_{0}$ based on an observed sample $S:=\{(X_{i},Y_{i})\}_{i=1}^{n}$ . For a possibly random function $f:\mathcal{X}\to\mathbb{R}$ , let the population risk be

\mathcal{R}(f)=\mathbb{E}|Y-f(X)|^{2},

(5)

where $(X,Y)$ follows the same distribution as $(X_{i},Y_{i})$ and is independent of $f$ . Then the target function $f_{0}$ is the minimizer of the risk $\mathcal{R}(f)$ over $\mathcal{F}_{0},$ i.e.,

f_{0}\in\arg\min_{f\in\mathcal{F}_{0}}\mathcal{R}(f).

(6)

The empirical version of (6) is a constrained minimization problem, which is generally difficult to solve directly. In light of this, we propose a penalized approach for estimating $f_{0}$ based on the fact that, for a smooth $f_{0}$ , it is increasing with respect to the $j$ th argument $x_{j}$ if and only if its partial derivative with respect to $x_{j}$ is nonnegative. Let $\dot{f}_{j}(x)=\partial f(x)/\partial x_{j}$ denote the partial derivative of $f$ with respective to $x_{j},j=1,\ldots,d.$ We propose the following penalized objective function

\mathcal{R}^{\lambda}(f)=\mathbb{E}|Y-f(X)|^{2}+\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\mathbb{E}\{\rho(\dot{f}_{j}(X))\},

(7)

where $\lambda=\{\lambda_{j}\}_{j=1}^{d}$ with $\lambda_{j}\geq 0,j=1,\ldots,d$ , are tuning parameters, $\rho(\cdot):\mathbb{R}\to[0,\infty)$ is a penalty function satisfying $\rho(x)\geq 0$ for all $x$ and $\rho(x)=0$ if $x\geq 0$ . Feasible choices include $\rho(x)=\max\{-x,0\}$ , $\rho(x)=[\max\{-x,0\}]^{2}$ and more generally $\rho(x)=h(\max\{-x,0\})$ for a Lipschitz function $h$ with $h(0)=0$ .

The objective function (7) turns the constrained isotonic regression problem into a penalized regression problem with penalties on the partial derivatives of the regression function. Therefore, if we analyze the learning theory of estimators in (7) using neural network functions, we need to study the partial derivatives of the neural network functions in terms of their generalization and approximation properties.

It is worth mentioning that an advantage of our penalized formulation over the hard-constrained isotonic regressions is that the resulting estimator remains consistent with proper tuning when the underlying regression function is not monotonic. Therefore, our proposed method has a robustness property against model misspecification. We will discuss this point in detail in Section 5.2.

1.3 Differentiable neural networks

A commonality between the aforementioned two quite different problems is that they both involve the derivatives of the target function, in addition to the target function itself. When deep neural networks are used to parameterize the hypothesis space, the derivatives of deep neural networks must be considered. To study the statistical learning theory for these deep neural methods, it requires the knowledge of the complexity and approximation properties of deep neural networks along with their derivatives.

Complexities of deep neural networks with ReLU and piece-wise polynomial activation functions have been studied by Anthony and Bartlett (1999) and Bartlett et al. (2019). Generalization bounds in terms of the operator norm of neural networks have also been obtained by several authors (Neyshabur et al., 2015; Bartlett et al., 2017; Nagarajan and Kolter, 2019; Wei and Ma, 2019). These generalization results are based on various complexity measures such as Rademacher complexity, VC-dimension, Pseudo-dimension, and norm of parameters. These studies shed light on the complexity and generalization properties of neural networks themselves, however, the complexities of their derivatives remain unclear.

The approximation power of deep neural networks with smooth activation functions has been considered in the literature. The universality of sigmoidal deep neural networks has been established by Mhaskar (1993) and Chui et al. (1994). In addition, the approximation properties of shallow RePU-activated networks were analyzed by Klusowski and Barron (2018) and Siegel and Xu (2022). The approximation rates of deep RePU neural networks for several types of target functions have also been investigated. For instance, Li et al. (2019, 2020), and Ali and Nouy (2021) studied the approximation rates for functions in Sobolev and Besov spaces in terms of the $L_{p}$ norm, Duan et al. (2021), Abdeljawad and Grohs (2022) studied the approximation rates for functions in Sobolev space in terms of the Sobolev norm, and Belomestny et al. (2022) studied the approximation rates for functions in Hölder space in terms of the Hölder norm. Several recent papers have also studied the approximation of derivatives of smooth functions (Duan et al., 2021; Gühring and Raslan, 2021; Belomestny et al., 2022). We will have more detailed discussions on the related works in Section 6.

Table 1 provides a summary of the comparison between our work and the existing results for achieving the same approximation accuracy $\epsilon$ on a function with smoothness index $\beta$ in terms of the needed non-zero parameters in the network. We also summarize the results on whether the neural network approximator has an explicit architecture, where the approximation accuracy holds simultaneously for the target function and its derivative, and whether the approximation results were shown to adapt to the low-dimensional structure of the target function.

Table 1: Comparison of approximation results of RePU neural networks on a function with smoothness order

\beta>0

, within the accuracy

\epsilon

. ReQU

\sigma_{2}

and ReCU

\sigma_{3}

are special instances of RePU

\sigma_{p}

for

p\geq 2

. The Sobolev norm

W^{s,p}

of a function

f

refers to the mean value of

L_{p}

norm of all partial derivatives of

f

up to order

s

, and

W^{s,\infty}

refers to the maximum value of

L_{\infty}

norm of all the partial derivatives of

f

up to order

s

. The Hölder norm

\mathcal{H}^{s}

refers to

W^{s,\infty}

when

s

is a non-negative integer. The

C^{s}

norm of a function

f

refers to the mean value of

L_{\infty}

norm of all partial derivatives of

f

up to order

s

when

s

is a positive integer.

	Norm	Activation	Non-zero parameters	Explicit architecture	Simultaneous approximation	Low-dim result
Li et al. (2019, 2020)	$L_{2}$	RePU	$\mathcal{O}(\epsilon^{-d/\beta})~{}~{}~{}$	✗	✗	✗
Duan et al. (2021)	$W^{1,2}$	ReQU	$\mathcal{O}(\epsilon^{-2d/(\beta-1)})$	✗	✓	✗
Abdeljawad and Grohs (2022)	$W^{s,p}$	ReCU	$\mathcal{O}(\epsilon^{-d/(\beta-s)})$	✗	✓	✗
Belomestny et al. (2022)	$\mathcal{H}^{s}$	ReQU	$\mathcal{O}(\epsilon^{-d/(\beta-s)})$	✓	✓	✗
This work	$C^{s}$	RePU	$\mathcal{O}(\epsilon^{-d/(\beta-s)})$	✓	✓	✓

1.4 Our contributions

In this paper, motivated by the aforementioned estimation problems involving derivatives, we investigate the properties of RePU networks and their derivatives. We show that the partial derivatives of RePU neural networks can be represented by mixed-RePUs activated networks. We derive upper bounds for the complexity of the function class of the derivatives of RePU networks. This is a new result for the complexity of derivatives of RePU networks and is crucial to establish generalization error bounds for a variety of estimation problems involving derivatives, including the score matching estimation and our proposed penalized approach for isotonic regression considered in the present work.

We also derive our approximation results of the RePU network on the smooth functions and their derivatives simultaneously. Our approximation results of the RePU network are based on its representational power on polynomials. We construct the RePU networks with an explicit architecture, which is different from those in the existing literature. The number of hidden layers of our constructed RePU networks only depends on the degree of the target polynomial but is independent of the dimension of input. This construction is new for studying the approximation properties of RePU networks.

We summarize the main contributions of this work as follows.

1.

We study the basic properties of RePU neural networks and their derivatives. First, we show that partial derivatives of RePU networks can be represented by mixed-RePUs activated networks. We derive upper bounds for the complexity of the function class of partial derivatives of RePU networks in terms of pseudo dimension. Second, we derive novel approximation results for simultaneously approximating $C^{s}$ smooth functions and their derivatives using RePU networks based on a new and efficient construction technique. We show that the approximation can be improved when the data or target function has a low-dimensional structure, which implies that RePU networks can mitigate the curse of dimensionality.
2.

We study the statistical learning theory of deep score matching estimator (DSME) using RePU networks. We establish non-asymptotic prediction error bounds for DSME under the assumption that the target score is $C^{s}$ smooth. We show that DSME can mitigate the curse of dimensionality if the data has low-dimensional support.
3.

We propose a penalized deep isotonic regression (PDIR) approach using RePU networks, which encourages the partial derivatives of the estimated regression function to be nonnegative. We establish non-asymptotic excess risk bounds for PDIR under the assumption that the target regression function $f_{0}$ is $C^{s}$ smooth. Moreover, we show that PDIR achieves the minimax optimal rate of convergence for non-parametric regression. We also show that PDIR can mitigate the curse of dimensionality when data concentrates near a low-dimensional manifold. Furthermore, we show that with tuning parameters tending to zero, PDIR is consistent even when the target function is not isotonic.

The rest of the paper is organized as follows. In Section 2 we study basic properties of RePU neural networks. In Section 3 we establish novel approximation error bounds for approximating $C^{s}$ smooth functions and their derivatives using RePU networks. In Section 4 we derive non-asymptotic error bounds for DSME. In Section 5 we propose PDIR and establish non-asymptotic bounds for PDIR. In Section 6 we discuss related works. Concluding remarks are given in Section 7. Results from simulation studies, proofs, and technical details are given in the Supplementary Material.

2 Basic properties of RePU neural networks

In this section, we establish some basic properties of RePU networks. We show that the partial derivatives of RePU networks can be represented by RePUs mixed-activated networks. The width, depth, number of neurons, and size of the RePUs mixed-activated network have the same order as those of the original RePU networks. In addition, we derive upper bounds of the complexity of the function class of RePUs mixed-activated networks in terms of pseudo dimension, which leads to an upper bound of the class of partial derivatives of RePU networks.

2.1 RePU activated neural networks

Neural networks with nonlinear activation functions have proven to be a powerful approach for approximating multi-dimensional functions. One of the most commonly used activation functions is the Rectified linear unit (ReLU), defined as $\sigma_{1}(x)=\max\{x,0\},x\in\mathbb{R}$ , due to its attractive properties in computation and optimization. However, since partial derivatives are involved in our objective function (3) and (7), it is not sensible to use networks with piecewise linear activation functions, such as ReLU and Leaky ReLU. Neural networks activated by Sigmoid and Tanh, are smooth and differentiable but have been falling from favor due to their vanishing gradient problems in optimization. In light of these, we are particularly interested in studying the neural networks activated by RePU, which are non-saturated and differentiable.

In Table 2 below we compare RePU with ReLU and Sigmoid networks in several important aspects. ReLU and RePU activation functions are continuous and non-saturated¹¹1An activation function $\sigma$ is saturating if $\lim_{|x|\to\infty}|\nabla\sigma(x)|=0$ ., which do not have “vanishing gradients” as Sigmodal activations (e.g. Sigmoid, Tanh) in training. RePU and Sigmoid are differentiable and can approximate the gradient of a target function, but ReLU activation is not, especially for estimation involving high-order derivatives of a target function.

Table 2: Comparison among ReLU, Sigmoid and RePU activation functions.

Activation	Continuous	Non-saturated	Differentiable	Gradient Estimation
ReLU	✓	✓	✗	✗
Sigmoid	✓	✗	✓	✓
RePU	✓	✓	✓	✓

We consider the $p$ th order Rectified Power units (RePU) activation function for a positive integer $p.$ The RePU activation function, denoted as $\sigma_{p}$ , is simply the power of ReLU,

\displaystyle\sigma_{p}(x)=\left\{\begin{array}[]{lr}x^{p},&x\geq 0\\ 0,&x<0\end{array}\right..

Note that when $p=0$ , the activation function $\sigma_{0}$ is the Heaviside step function; when $p=1$ , the activation function $\sigma_{1}$ is the familiar Rectified Linear unit (ReLU); when $p=2,3$ , the activation functions $\sigma_{2},\sigma_{3}$ are called rectified quadratic unit (ReQU) and rectified cubic unit (ReCU) respectively. In this work, we focus on the case with $p\geq 2$ , implying that the RePU activation function has a continuous $(p-1)$ th continuous derivative.

With a RePU activation function, the network will be smooth and differentiable. The architecture of a RePU activated multilayer perceptron can be expressed as a composition of a series of functions

f(x)=\mathcal{L}_{\mathcal{D}}\circ\sigma_{p}\circ\mathcal{L}_{\mathcal{D}-1}\circ\sigma_{p}\circ\cdots\circ\sigma_{p}\circ\mathcal{L}_{1}\circ\sigma_{p}\circ\mathcal{L}_{0}(x),\ x\in\mathbb{R}^{d_{0}},

where $d_{0}$ is the dimension of the input data, $\sigma_{p}(x)=\{\max(0,x)\}^{p},p\geq 2,$ is a RePU activation function (defined for each component of $x$ if $x$ is a vector), and $\mathcal{L}_{i}(x)=W_{i}x+b_{i},i=0,1,\ldots,\mathcal{D}.$ Here $d_{i}$ is the width (the number of neurons or computational units) of the $i$ -th layer, $W_{i}\in\mathbb{R}^{d_{i+1}\times d_{i}}$ is a weight matrix, and $b_{i}\in\mathbb{R}^{d_{i+1}}$ is the bias vector in the $i$ -th linear transformation $\mathcal{L}_{i}$ . The input data $x$ is the first layer of the neural network and the output is the last layer. Such a network $f$ has $\mathcal{D}$ hidden layers and $(\mathcal{D}+2)$ layers in total. We use a $(\mathcal{D}+2)$ -vector $(d_{0},d_{1},\ldots,d_{\mathcal{D}},d_{\mathcal{D}+1})^{\top}$ to describe the width of each layer. The width $\mathcal{W}$ is defined as the maximum width of hidden layers, i.e., $\mathcal{W}=\max\{d_{1},...,d_{\mathcal{D}}\}$ ; the size $\mathcal{S}$ is defined as the total number of parameters in the network $f$ , i.e., $\mathcal{S}=\sum_{i=0}^{\mathcal{D}}\{d_{i+1}\times(d_{i}+1)\}$ ; the number of neurons $\mathcal{U}$ is defined as the number of computational units in hidden layers, i.e., $\mathcal{U}=\sum_{i=1}^{\mathcal{D}}d_{i}$ . Note that the neurons in consecutive layers are connected to each other via weight matrices $W_{i}$ , $i=0,1,\ldots,\mathcal{D}$ .

We use the notation $\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}}$ to denote a class of RePU activated multilayer perceptrons $f:\mathbb{R}^{d_{0}}\to\mathbb{R}$ with depth $\mathcal{D}$ , width $\mathcal{W}$ , number of neurons $\mathcal{U}$ , size $\mathcal{S}$ and $f$ satisfying $\|f\|_{\infty}\leq\mathcal{B}$ and $\max_{j=1,\ldots,d}\|\frac{\partial}{\partial x_{j}}f\|_{\infty}\leq\mathcal{B}^{\prime}$ for some $0<\mathcal{B},\mathcal{B}^{\prime}<\infty$ , where $\|f\|_{\infty}$ is the sup-norm of a function $f$ .

2.2 Derivatives of RePU networks

An advantage of RePU networks over piece-wise linear activated networks (e.g. ReLU networks) is that RePU networks are differentiable. Thus RePU networks are useful in many estimation problems involving derivative. To establish the learning theory for these problems, we need to study the properties of derivatives of RePU.

Recall that a $\mathcal{D}$ -hidden layer network activated by $p$ th order RePU can be expressed by

f(x)=\mathcal{L}_{\mathcal{D}}\circ\sigma_{p}\circ\mathcal{L}_{\mathcal{D}-1}\circ\sigma_{p}\circ\cdots\circ\sigma_{p}\circ\mathcal{L}_{1}\circ\sigma_{p}\circ\mathcal{L}_{0}(x),\ x\in\mathbb{R}^{d_{0}}.

Let $f_{i}:=\sigma_{p}\circ\mathcal{L}_{i}$ denote the $i$ th linear transformation composited with RePU activation for $i=0,1,\ldots,\mathcal{D}-1$ and let $f_{\mathcal{D}}=\mathcal{L}_{\mathcal{D}}$ denotes the linear transformation in the last layer. Then by the chain rule, the gradient of the network can be computed by

\nabla f=\left(\prod_{k=0}^{\mathcal{D}-1}\big{[}\nabla f_{\mathcal{D}-k}\circ f_{\mathcal{D}-k-1}\circ\ldots\circ f_{0}\big{]}\right)\nabla f_{0},

(8)

where $\nabla$ denotes the gradient operator used in vector calculus. With a differentiable RePU activation $\sigma_{p}$ , the gradients $\nabla f_{i}$ in (8) can be exactly computed by $\sigma_{p-1}$ activated layers since $\nabla f_{i}(x)=\nabla[\sigma_{p}\circ\mathcal{L}_{i}(x)]=\nabla\sigma_{p}(W_{i}x+b_{i})=pW_{i}^{\top}\sigma_{p-1}(W_{i}x+b_{i})$ . In addition, the $f_{i},i=0,\ldots,\mathcal{D}$ are already RePU-activated layers. Then, the network gradient $\nabla f$ can be represented by a network activated by $\sigma_{p},\sigma_{p-1}$ (and possibly $\sigma_{m}$ for $1\leq m\leq p-2$ ) according to (8) with a proper architecture. Below, we refer to the neural networks activated by the $\{\sigma_{t}:1\leq t\leq p\}$ as Mixed RePUs activated neural networks, i.e., the activation functions in Mixed RePUs network can be $\sigma_{t}$ for $1\leq t\leq p$ , and for different neurons the activation function can be different.

The following theorem shows that the partial derivatives and the gradient $\nabla f_{\phi}$ of a RePU neural network indeed can be represented by a Mixed RePUs network with activation functions $\{\sigma_{t}:1\leq t\leq p\}$ .

Theorem 1 (Neural networks for partial derivatives)

Let $\mathcal{F}:=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}}$ be a class of RePU $\sigma_{p}$ activated neural networks $f:\mathcal{X}\to\mathbb{R}$ with depth (number of hidden layer) $\mathcal{D}$ , width (maximum width of hidden layer) $\mathcal{W}$ , number of neurons $\mathcal{U}$ , number of parameters (weights and bias) $\mathcal{S}$ and $f$ satisfying $\|f\|_{\infty}\leq\mathcal{B}$ and $\max_{j=1,\ldots,d}\|\frac{\partial}{\partial x_{j}}f\|_{\infty}\leq\mathcal{B}^{\prime}$ . Then for any $f\in\mathcal{F}$ and any $j\in\{1,\ldots,d\}$ , the partial derivative $\frac{\partial}{\partial x_{j}}f$ can be implemented by a Mixed RePUs activated multilayer perceptron with depth $3\mathcal{D}+3$ , width $6\mathcal{W}$ , number of neurons $13\mathcal{U}$ , number of parameters $23\mathcal{S}$ and bound $\mathcal{B}^{\prime}$ .

Theorem 1 shows that for each $j\in\{1,\ldots,d\}$ , the partial derivative with respect to the $j$ -th argument of the function $f\in\mathcal{F}$ can be exactly computed by a Mixed RePUs network. In addition, by paralleling the networks computing $\frac{\partial}{\partial x_{j}}f,j=1,\ldots,d$ , the whole vector of partial derivatives $\nabla f=(\frac{\partial}{\partial x_{1}}f,\ldots,\frac{\partial}{\partial x_{d}}f)$ can be computed by a Mixed RePUs network with depth $3\mathcal{D}+3$ , width $6d\mathcal{W}$ , number of neurons $13d\mathcal{U}$ and number of parameters $23d\mathcal{S}$ .

Let $\mathcal{F}_{j}^{\prime}:=\{{\partial}/{\partial x_{j}}f:f\in\mathcal{F}\}$ be the partial derivatives of the functions in $\mathcal{F}$ with respect to the $j$ -th argument. And let $\mathcal{\widetilde{F}}$ denote the class of Mixed RePUs networks in Theorem 1. Then Theorem 1 implies that the class of partial derivative functions is contained in a class of Mixed RePUs networks, i.e., $\mathcal{F}^{\prime}_{j}\subseteq\mathcal{\widetilde{F}}$ for $j=1,\ldots,d$ . This further implies the complexity of $\mathcal{F}_{j}^{\prime}$ can be bounded by that of the class of Mixed RePUs networks $\mathcal{\widetilde{F}}$ .

The complexity of a function class is a key quantity in the analysis of generalization properties. Lower complexity in general implies a smaller generalization gap. The complexity of a function class can be measured in several ways, including Rademacher complexity, covering number, VC dimension, and Pseudo dimension. These measures depict the complexity of a function class differently but are closely related to each other.

In the following, we develop complexity upper bounds for the class of Mixed RePUs network functions. In particular, these bounds lead to the upper bound for the Pseudo dimension of the function class $\widetilde{\mathcal{F}}$ , and that of $\mathcal{F}^{\prime}$ .

Lemma 2 (Pseudo dimension of Mixed RePUs multilayer perceptrons)

Let $\mathcal{\widetilde{F}}$ be a function class implemented by Mixed RePUs activated multilayer perceptrons with depth $\tilde{\mathcal{D}}$ , number of neurons (nodes) $\tilde{\mathcal{U}}$ and size or number of parameters (weights and bias) $\tilde{\mathcal{S}}$ . Then the Pseudo dimension of $\mathcal{\widetilde{F}}$ satisfies

\displaystyle{\rm Pdim}(\mathcal{\widetilde{F}})\leq 3p\tilde{\mathcal{D}}\tilde{\mathcal{S}}(\tilde{\mathcal{D}}+\log_{2}\tilde{\mathcal{U}}),

where ${\rm Pdim}(\mathcal{F})$ denotes the pseudo dimension of a function class $\mathcal{F}$

With Theorem 1 and Lemma 2, we can now obtain an upper bound for the complexity of the class of derivatives of RePU neural networks. This facilitates establishing learning theories for statistical methods involving derivatives.

Due to the symmetry among the arguments of the input of networks in $\mathcal{F}$ , the concerned complexities for $\mathcal{F}_{1}^{\prime},\ldots,\mathcal{F}_{d}^{\prime}$ are generally the same. For notational simplicity, we use

\mathcal{F}^{\prime}:=\{\frac{\partial}{\partial x_{1}}f:f\in\mathcal{F}\}

in the main context to denote the quantities of complexities such as pseudo dimension, e.g., we use ${\rm Pdim}(\mathcal{F}^{\prime})$ instead of ${\rm Pdim}(\mathcal{F}_{j}^{\prime})$ for $j=1,\ldots,d$ .

3 Approximation power of RePU neural networks

In this section, we establish error bounds for using RePU networks to simultaneously approximate smooth functions and their derivatives.

We show that RePU neural networks, with an appropriate architecture, can represent multivariate polynomials with no error and thus can simultaneously approximate multivariate differentiable functions and their derivatives. Moreover, we show that the RePU neural network can mitigate the “curse of dimensionality” when the domain of the target function concentrates in a neighborhood of a low-dimensional manifold.

In the studies of ReLU network approximation properties (Yarotsky, 2017, 2018; Shen et al., 2020; Schmidt-Hieber, 2020), the analyses rely on two key facts. First, the ReLU activation function can be used to construct continuous, piecewise linear bump functions with compact support, which forms a partition of unity of the domain. Second, deep ReLU networks can approximate the square function $x^{2}$ to any error tolerance, provided the network is large enough. Based on these facts, the ReLU network can compute Taylor’s expansion to approximate smooth functions. However, due to the piecewise linear nature of ReLU, the approximation is restricted to the target function itself rather than its derivative. In other words, the error in approximation by ReLU networks is quantified using the $L_{p}$ norm, where $p\geq 1$ or $p=\infty$ . On the other hand, norms such as Sobolev, Hölder, or others can indicate the approximation of derivatives. Gühring and Raslan (2021) extended the results by showing that the network activated by a general smooth function can approximate the partition of unity and polynomial functions, and obtain the approximation rate for smooth functions in the Sobolev norm which implies approximation of the target function and its derivatives. RePU-activated networks have been shown to represent splines (Duan et al., 2021; Belomestny et al., 2022), thus they can approximate smooth functions and their derivatives based on the approximation power of splines.

RePU networks can also represent polynomials efficiently and accurately. This fact motivated us to derive our approximation results for RePU networks based on their representational power on polynomials. To construct RePU networks representing polynomials, our basic idea is to express basic operators as one-hidden-layer RePU networks and then compute polynomials by combining and composing these building blocks. For univariate input $x$ , the identity map $x$ , linear transformation $ax+b$ , and square map $x^{2}$ can all be represented by one-hidden-layer RePU networks with only a few nodes. The multiplication operator $xy={(x+y)^{2}-(x-y)^{2}}/4$ can also be realized by a one-hidden-layer RePU network. Then univariate polynomials $\sum_{i=0}^{N}a_{i}x^{i}$ of degree $N\geq 0,$ can be computed by a RePU network with a proper composited construction based on Horner’s method (also known as Qin Jiushao’s algorithm) (Horner, 1819). Further, a multivariate polynomial can be viewed as the product of univariate polynomials, then a RePU network with a suitable architecture can represent multivariate polynomials. Alternatively, as mentioned in Mhaskar (1993); Chui and Li (1993), any polynomial in $d$ variables with total degree not exceeding $N$ can be written as a linear combination of $\binom{N+d}{d}$ quantities of the form $(w^{\prime}x+b)^{N}$ where $\binom{N+d}{d}=(N+d)!/(N!d!)$ denotes the combinatorial number and $(w^{\prime}x+b)$ denotes the linear combination of $d$ variables $x$ . Given this fact, RePU networks can also be shown to represent polynomials based on proper construction.

Theorem 3 (Representation of Polynomials by RePU networks)

For any non-negative integer $N\in\mathbb{N}_{0}$ and positive integer $d\in\mathbb{N}^{+}$ , if $f:\mathbb{R}^{d}\to\mathbb{R}$ is a polynomial of $d$ variables with total degree $N$ , then $f$ can be exactly computed with no error by RePU activated neural network with

(1)

$2N-1$ hidden layers, $(6p+2)(2N^{d}-N^{d-1}-N)+2p(2N^{d}-N^{d-1}-N)/(N-1)=\mathcal{O}(pN^{d})$ number of neurons, $(30p+2)(2N^{d}-N^{d-1}-N)+(2p+1)(2N^{d}-N^{d-1}-N)/(N-1)=\mathcal{O}(pN^{d})$ number of parameters and width $12pN^{d-1}+6p(N^{d-1}-N)/(N-1)=\mathcal{O}(pN^{d-1})$ ;
(2)

$\lceil\log_{p}(N)\rceil$ hidden layers, $2\lceil\log_{p}(N)\rceil(N+d)!/(N!d!)=\mathcal{O}(\log_{p}(N)N^{d})$ number of neurons, $2(\lceil\log_{p}(N)\rceil+d+1)(N+d)!/(N!d!)=\mathcal{O}((\log_{p}(N)+d)N^{d})$ number of parameters and width $2(N+d)!/(N!d!)=\mathcal{O}(N^{d})$ ,

where $\lceil a\rceil$ denotes the smallest integer no less than $a\in\mathbb{R}$ and $a!$ denotes the factorial of integer $a$ .

Theorem 3 establishes the capability of RePU networks to accurately represent multivariate polynomials of order $N$ through appropriate architectural designs. The theorem introduces two distinct network architectures, based on Horner’s method (Horner, 1819) and Mhaskar’s method (Mhaskar, 1993) respectively, that can achieve this representation. The RePU network constructed using Horner’s method exhibits a larger number of hidden layers but fewer neurons and parameters compared to the network constructed using Mhaskar’s method. Neither construction is universally superior, and the choice of construction depends on the relationship between the dimension $d$ and the order $N$ , allowing for potential efficiency gains in specific scenarios.

Importantly, RePU neural networks offer advantages over ReLU networks in approximating polynomials. For any positive integers $W$ and $L$ , ReLU networks with a width of approximately $\mathcal{O}(WN^{d})$ and a depth of $\mathcal{O}(LN^{2})$ can only approximate, but not accurately represent, $d$ -variate polynomials of degree $N$ with an accuracy of approximately $9N(W+1)^{-7NL}=\mathcal{O}(NW^{-LN})$ (Shen et al., 2020; Hon and Yang, 2022). Furthermore, the approximation capabilities of ReLU networks for polynomials are generally limited to bounded regions, whereas RePU networks can precisely compute polynomials over the entire $\mathbb{R}^{d}$ space.

Theorem 3 introduces novel findings that distinguish it from existing research (Li et al., 2019, 2020) in several aspects. First, it provides an explicit formulation of the RePU network’s depth, width, number of neurons, and parameters in terms of the target polynomial’s order $N$ and the input dimension $d$ , thereby facilitating practical implementation. Second, the theorem presents Architecture (2), which outperforms previous studies in the sense that it requires fewer hidden layers for polynomial representation. Prior works, such as Li et al. (2019, 2020), required RePU networks with $d(\lceil\log_{p}(N)\rceil+1)$ hidden layers, along with $\mathcal{O}(pN^{d})$ neurons and parameters, to represent $d$ -variate polynomials of degree $N$ on $\mathbb{R}^{d}$ . However, Architecture (2) in Theorem 3 achieves a comparable number of neurons and parameters with only $\lceil\log_{p}(N)\rceil$ hidden layers. Importantly, the number of hidden layers $\lceil\log_{p}(N)\rceil$ depends only on the polynomial’s degree $N$ and is independent of the input dimension $d$ . This improvement bears particular significance in dealing with high-dimensional input spaces that is commonly encountered in machine-learning tasks. Lastly, Architecture (1) in Theorem 3 contributes an additional RePU network construction based on Horner’s method (Horner, 1819), complementing existing results based solely on Mhaskar’s method (Mhaskar, 1993) and providing an alternative choice for polynomial representation.

By leveraging the approximation power of multivariate polynomials, we can derive error bounds for approximating general multivariate smooth functions using RePU neural networks. Previously, approximation properties of RePU networks have been studied for target functions in different spaces, e.g. Sobolev space (Li et al., 2020, 2019; Gühring and Raslan, 2021), spectral Barron space (Siegel and Xu, 2022), Besov space (Ali and Nouy, 2021) and Hölder space (Belomestny et al., 2022). Here we focus on the approximation of multivariate smooth functions and their derivatives in $C^{s}$ space for $s\in\mathbb{N}^{+}$ defined in Definition 4.

Definition 4 (Multivariate differentiable class $C^{s}$ )

A function $f:\mathbb{B}\subset\mathbb{R}^{d}\to\mathbb{R}$ defined on a subset $\mathbb{B}$ of $\mathbb{R}^{d}$ is said to be in class $C^{s}(\mathbb{B})$ on $\mathbb{B}$ for a positive integer $s$ , if all partial derivatives

D^{\alpha}f:=\frac{\partial^{\alpha}}{\partial x_{1}^{\alpha_{1}}\partial x_{2}^{\alpha_{2}}\cdots\partial x_{d}^{\alpha_{d}}}f

exist and are continuous on $\mathbb{B}$ , for every $\alpha_{1},\alpha_{2},\ldots,\alpha_{d}$ non-negative integers, such that $\alpha_{1}+\alpha_{2}+\cdots+\alpha_{d}\leq s$ . In addition, we define the norm of $f$ over $\mathbb{B}$ by

\|f\|_{C^{s}}:=\sum_{|\alpha|_{1}\leq s}\sup_{x\in\mathbb{B}}|D^{\alpha}f(x)|,

where $|\alpha|_{1}:=\sum_{i=1}^{d}\alpha_{i}$ for any vector $\alpha=(\alpha_{1},\alpha_{2},\ldots,\alpha_{d})\in\mathbb{R}^{d}$ .

Theorem 5

Let $f$ be a real-valued function defined on a compact set $\mathcal{X}\subset\mathbb{R}^{d}$ belonging to class $C^{s}$ for $0\leq s<\infty$ . For any $N\in\mathbb{N}^{+}$ , there exists a RePU activated neural network $\phi_{N}$ with its depth $\mathcal{D}$ , width $\mathcal{W}$ , number of neurons $\mathcal{U}$ and size $\mathcal{S}$ specified as one of the following architectures:

(1)

	$\displaystyle\mathcal{D}$	$\displaystyle=2N-1,\quad\mathcal{W}=12pN^{d-1}+6p(N^{d-1}-N)/(N-1),$
	$\displaystyle\mathcal{U}$	$\displaystyle=(6p+2)(2N^{d}-N^{d-1}-N)+2p(2N^{d}-N^{d-1}-N)/(N-1),$
	$\displaystyle\mathcal{S}$	$\displaystyle=(30p+2)(2N^{d}-N^{d-1}-N)+(2p+1)(2N^{d}-N^{d-1}-N)/(N-1);$

(2)

	$\displaystyle\mathcal{D}$	$\displaystyle=\lceil\log_{p}(N)\rceil,\quad\mathcal{W}=2(N+d)!/(N!d!),$
	$\displaystyle\mathcal{U}$	$\displaystyle=2\lceil\log_{p}(N)\rceil(N+d)!/(N!d!),$
	$\displaystyle\mathcal{S}$	$\displaystyle=2(\lceil\log_{p}(N)\rceil+d+1)(N+d)!/(N!d!),$

such that for each multi-index $\alpha\in\mathbb{N}^{d}_{0}$ satisfying $|\alpha|_{1}\leq\min\{s,N\}$ , we have

\sup_{\mathcal{X}}|D^{\alpha}(f-\phi_{N})|\leq C_{p,s,d,\mathcal{X}}\,\|f\|_{C^{|\alpha|_{1}}}\,N^{-(s-|\alpha|_{1})},

where $C_{p,s,d,\mathcal{X}}$ is a positive constant depending only on $p,d,s$ and the diameter of $\mathcal{X}$ .

Theorem 5 gives a simultaneous approximation result for RePU network approximation since the error is measured in $\|\cdot\|_{C^{s}}$ norm. It improves over existing results focusing on $L_{p}$ norms, which cannot guarantee the approximation of derivatives of the target function (Li et al., 2019, 2020). It is known that shallow neural networks with smooth activation can simultaneously approximate a smooth function and its derivatives (Xu and Cao, 2005). However, the simultaneous approximation of RePU neural networks with respect to norms involving derivatives is still an ongoing research area (Gühring and Raslan, 2021; Duan et al., 2021; Belomestny et al., 2022). For solving partial differential equations in a Sobolev space with smoothness order 2, Duan et al. (2021) showed that ReQU neural networks can simultaneously approximate the target function and its derivative in Sobolev norm $W^{1,2}$ . To achieve an accuracy of $\epsilon$ , the ReQU networks require $\mathcal{O}(\log_{2}d)$ layers and $\mathcal{O}(4d\epsilon^{-d})$ neurons. Later Belomestny et al. (2022) proved that $\beta$ -Hölder smooth functions ( $\beta>2$ ) and their derivatives up to order $l$ can be simultaneously approximated with accuracy $\epsilon$ in Hölder norm by a ReQU network with width $\mathcal{O}(\epsilon^{-d/(\beta-l)})$ , $\mathcal{O}(\log_{2}d)$ layers, and $\mathcal{O}(\epsilon^{-d/(\beta-l)})$ nonzero parameters. Gühring and Raslan (2021) derived simultaneous approximation results for neural networks with general smooth activation functions. Based on Gühring and Raslan (2021), a RePU neural network with constant layer and $\mathcal{O}(\epsilon^{-d/(\beta-l)})$ nonzero parameters can achieve an approximation accuracy $\epsilon$ measured in Sobolev norm up to $l$ th order derivative for a $d$ -dimensional Sobolev function with smoothness $\beta$ .

To achieve the approximation accuracy $\epsilon$ , our Theorem 5 demonstrates that a RePU network requires a comparable number of neurons, namely $\mathcal{O}(\epsilon^{-d/(s-l)})$ , to simultaneously approximate the target function up to its $l$ -th order derivatives. Our result differs from existing studies in several ways. First, in contrast to Li et al. (2019, 2020), Theorem 5 derives simultaneous approximation results for RePU networks. Second, Theorem 5 holds for general RePU networks ( $p\geq 2$ ), including the ReQU network ( $p=2$ ) studied in Duan et al. (2021) and Belomestny et al. (2022). Third, Theorem 5 explicitly specifies the network architecture to facilitate the network design in practice, whereas existing studies determine network architectures solely in terms of orders (Li et al., 2019, 2020; Gühring and Raslan, 2021). In addition, as discussed in the next subsection, Theorem 5 can be further improved and adapted to the low-dimensional structured data, which highlights the RePU networks’ capability to mitigate the curse of dimensionality in estimation problems. We again refer to Table 1 for a summary comparison of our work with the existing results.

Remark 6

Theorem 3 is based on the representation power of RePU networks on polynomials as in Li et al. (2019, 2020) and Ali and Nouy (2021). Other existing works derived approximation results based on the representation of the ReQU neural network on B-splines or tensor-product splines (Duan et al., 2021; Siegel and Xu, 2022; Belomestny et al., 2022).

3.1 Circumventing the curse of dimensionality

In Theorem 5, to achieve an approximate error $\epsilon$ , the RePU neural network should have $O(\epsilon^{-d/s})$ many parameters. The number of parameters grows polynomially in the desired approximation accuracy $\epsilon$ with an exponent $-d/s$ depending on the dimension $d$ . In statistical and machine learning tasks, such an approximation result can make the estimation suffer from the curse of dimensionality. In other words, when the dimension $d$ of the input data is large, the convergence rate becomes extremely slow. Fortunately, high-dimensional data often have approximate low-dimensional latent structures in many applications, such as computer vision and natural language processing (Belkin and Niyogi, 2003; Hoffmann et al., 2009; Fefferman et al., 2016). It has been shown that these low-dimensional structures can help mitigate the curse of dimensionality (improve the convergence rate) using ReLU networks (Schmidt-Hieber, 2019; Shen et al., 2020; Jiao et al., 2023; Chen et al., 2022). We consider an assumption of approximate low-dimensional support of data distribution (Jiao et al., 2023), and show that the RePU network can also mitigate the curse of dimensionality under this assumption.

Assumption 7

The predictor $X$ is supported on a $\rho$ -neighborhood $\mathcal{M}_{\rho}$ of a compact $d_{\mathcal{M}}$ -dimensional Riemannian submanifold $\mathcal{M}\subset\mathbb{R}^{d}$ , where

\mathcal{M}_{\rho}=\{x\in\mathbb{R}^{d}:\inf\{\|x-y\|_{2}:y\in\mathcal{M}\}\leq\rho\},\ \rho\in(0,1),

and the Riemannian submanifold $\mathcal{M}$ has condition number $1/\tau$ , volume $V$ and geodesic covering regularity $R$ .

We assume that the high-dimensional data $X$ concentrates in a $\rho$ -neighborhood of a low-dimensional manifold. This assumption serves as a relaxation from the stringent requirements imposed by exact manifold assumptions (Chen et al., 2019; Schmidt-Hieber, 2019).

With a well-conditioned manifold $\mathcal{M}$ , we show that RePU networks possess the capability to adaptively embed the data into a lower-dimensional space while approximately preserving distances. The dimensionality of the embedded representation, as well as the quality of the embedding in terms of its ability to preserve distances, are contingent upon the properties of the approximate manifold, including its radius $\rho$ , condition number $1/\tau$ , volume $V$ , and geodesic covering regularity $R$ . For in-depth definitions of these properties, we direct the interested reader to Baraniuk and Wakin (2009).

Theorem 8 (Improved approximation results)

Suppose that Assumption 7 holds. Let $f$ be a real-valued function defined on $\mathbb{R}^{d}$ belonging to class $C^{s}$ for $0\leq s<\infty$ . Let $d_{\delta}=c\cdot d_{\mathcal{M}}\log(d\cdot VR\tau^{-1}/\delta)/\delta^{2}$ be an integer satisfying $d_{\delta}\leq d$ for some $\delta\in(0,1)$ and a universal constant $c>0$ . Then for any $N\in\mathbb{N}^{+}$ , there exists a RePU activated neural network $\phi_{N}$ with its depth $\mathcal{D}$ , width $\mathcal{W}$ , number of neurons $\mathcal{U}$ and size $\mathcal{S}$ as one of the following architectures:

(1)

	$\displaystyle\mathcal{D}$	$\displaystyle=2N-1,\quad\mathcal{W}=12pN^{d_{\delta}-1}+6p(N^{d_{\delta}-1}-N)/(N-1),$
	$\displaystyle\mathcal{U}$	$\displaystyle=(6p+2)(2N^{d_{\delta}}-N^{d_{\delta}-1}-N)+2p(2N^{d_{\delta}}-N^{d_{\delta}-1}-N)/(N-1),$
	$\displaystyle\mathcal{S}$	$\displaystyle=(30p+2)(2N^{d_{\delta}}-N^{d_{\delta}-1}-N)+(2p+1)(2N^{d_{\delta}}-N^{d_{\delta}-1}-N)/(N-1);$

(2)

	$\displaystyle\mathcal{D}$	$\displaystyle=\lceil\log_{p}(N)\rceil,\quad\mathcal{W}=2(N+d_{\delta})!/(N!d_{\delta}!),$
	$\displaystyle\mathcal{U}$	$\displaystyle=2\lceil\log_{p}(N)\rceil(N+d_{\delta})!/(N!d_{\delta}!),$
	$\displaystyle\mathcal{S}$	$\displaystyle=2(\lceil\log_{p}(N)\rceil+d_{\delta}+1)(N+d_{\delta})!/(N!d_{\delta}!),$

such that for each multi-index $\alpha\in\mathbb{N}^{d}_{0}$ satisfying $|\alpha|_{1}\leq 1$ ,

\mathbb{E}_{X}|D^{\alpha}(f(X)-\phi_{N}(X))|\leq C_{p,s,d_{\delta},\mathcal{M}_{\rho}}\cdot(1-\delta)^{-2}\|f\|_{C^{|\alpha|_{1}}}\,N^{-(s-|\alpha|_{1})},

for $\rho\leq C_{1}N^{-(s-|\alpha|_{1})}$ with a universal constant $C_{1}>0$ , where $C_{p,s,d_{\delta},\mathcal{M}_{\rho}}$ is a positive constant depending only on $d_{\delta},s,p$ and $\mathcal{M}_{\rho}$ .

When data has a low-dimensional structure, Theorem 8 indicates that the RePU network can approximate $C^{s}$ smooth function up to $l$ th order derivatives with an accuracy $\epsilon$ using $O(\epsilon^{d_{\delta}/(s-l)})$ neurons. Here the effective dimension $d_{\delta}$ scales linearly in the intrinsic manifold dimension $d_{\mathcal{M}}$ and logarithmically in the ambient dimension $d$ and the features $1/\tau,V,R$ of the manifold. Compared to Theorem 5, the effective dimensionality in Theorem 8 is $d_{\delta}$ instead of $d$ , which could be a significant improvement especially when the ambient dimension of data $d$ is large but the intrinsic dimension $d_{\mathcal{M}}$ is small.

Theorem 8 shows that RePU neural networks are an effective tool for analyzing data that lies in a neighborhood of a low-dimensional manifold, indicating their potential to mitigate the curse of dimensionality. In particular, this property makes them well-suited to scenarios where the ambient dimension of the data is high, but its intrinsic dimension is low. To the best of our knowledge, our Theorem 8 is the first result of the ability of RePU networks to mitigate the curse of dimensionality. A highlight of the comparison between our result and the existing recent results of Li et al. (2019), Li et al. (2020), Duan et al. (2021), Abdeljawad and Grohs (2022) and Belomestny et al. (2022) is given in Table 1.

4 Deep score estimation

Deep neural networks have revolutionized many areas of statistics and machine learning, and one of the important applications is score function estimation using the score matching method (Hyvärinen and Dayan, 2005). Score-based generative models (Song et al., 2021), which learn to generate samples by estimating the gradient of the log-density function, can benefit significantly from deep neural networks. Using a deep neural network allows for more expressive and flexible models, which can capture complex patterns and dependencies in the data. This is especially important for high-dimensional data, where traditional methods may struggle to capture all of the relevant features. By leveraging the power of deep neural networks, score-based generative models can achieve state-of-the-art results on a wide range of tasks, from image generation to natural language processing. The use of deep neural networks in score function estimation represents a major advance in the field of generative modeling, with the potential to unlock new levels of creativity and innovation. We apply our developed theories of RePU networks to explore the statistical learning theories of deep score matching estimation (DSME).

Let $p_{0}(x)$ be a probability density function supported on $\mathbb{R}^{d}$ and $s_{0}(x)=\nabla_{x}\log p_{0}(x)$ be its score function where $\nabla_{x}$ is the vector differential operator with respect to the input $x$ . The goal of deep score estimation is to model and estimate $s_{0}$ by a function $s:\mathbb{R}^{d}\to\mathbb{R}^{d}$ based on samples $\{X_{i}\}_{i=1}^{n}$ from $p_{0}$ such that $s(x)\approx s_{0}(x)$ . Here $s$ belongs to a class of deep neural networks.

It worths noting that the neural network $s:\mathbb{R}^{d}\to\mathbb{R}^{d}$ used in deep score estimation is a vector-valued function. For a $d$ -dimensional input $x=(x_{1},\ldots,x_{d})^{\top}\in\mathbb{R}^{d}$ , the output $s(x)=(s_{1}(x),\ldots,s_{d}(x))^{\top}\in\mathbb{R}^{d}$ is also $d$ -dimensional. We let $\nabla_{x}s$ denote the $n\times n$ Jacobian matrix of $s$ with its $(i,j)$ entry being ${\partial s_{i}}/{\partial x_{j}}$ . With a slight abuse of notation, we denote $\mathcal{F}_{n}:=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}}$ by a class of RePU activated multilayer perceptrons $s:\mathbb{R}^{d}\to\mathbb{R}^{d}$ with parameter $\theta$ , depth $\mathcal{D}$ , width $\mathcal{W}$ , size $\mathcal{S}$ , number of neurons $\mathcal{U}$ and $s$ satisfying: (i) $\|s\|_{\infty}\leq\mathcal{B}$ for some $0<\mathcal{B}<\infty$ where $\|s\|_{\infty}:=\sup_{x\in\mathcal{X}}\|s(x)\|_{\infty}$ is the sup-norm of a vector-valued function $s$ over its domain $x\in\mathcal{X}$ ; (ii) $\|(\nabla_{x}s)_{ii}\|_{\infty}\leq\mathcal{B}^{\prime}$ , $i=1,\ldots,d$ , for some $0<\mathcal{B}^{\prime}<\infty$ where $(\nabla_{x}s)_{ii}$ is the $i$ -th diagonal entry (in the $i$ -th row and $i$ -th column) of $\nabla_{x}s$ . Here the parameters $\mathcal{D},\mathcal{W},\mathcal{U}$ and $\mathcal{S}$ of $\mathcal{F}_{n}$ can depend on the sample size $n$ , but we omit the dependence in their notations. In addition, we extend the definition of smooth multivariate function. We say a multivariate function $s=(s_{1},\ldots,s_{d})$ belongs to $C^{m}$ if $s_{j}$ belongs to $C^{m}$ for each $j=1,\ldots,d$ . Correspondingly, we define $\|s\|_{C^{m}}:=\max_{j=1,\ldots,d}\|s_{j}\|_{C^{m}}$ .

4.1 Non-asymptotic error bounds for DSME

The development of theory for predicting the performance of score estimator using deep neural networks has been a crucial research area in recent times. Theoretical upper bounds for prediction errors have become increasingly important in understanding the limitations and potential of these models.

We are interested in establishing non-asymptoic error bounds for DSME, which is obtained by minimizing the expected squared distance $\mathbb{E}_{X}\|s(X)-s_{0}(X)\|^{2}_{2}$ over the class of functions $\mathcal{F}$ . However, this objective is computationally infeasible because the explicit form of $s_{0}$ is unknown. Under proper conditions, the objective function has an equivalent formulation which is computationally feasible.

Assumption 9

The density $p_{0}$ of the data $X$ is differentiable. The expectation $\mathbb{E}_{X}\|s_{0}(X)\|_{2}^{2}$ and $\mathbb{E}_{X}\|s(X)\|_{2}^{2}$ are finite for any $s\in\mathcal{F}$ . And $s_{0}(x)s(x)\to 0$ for any $s\in\mathcal{F}$ when $\|x\|\to\infty$ .

Under Assumption 9, the population objective of score matching is equivalent to $J$ given in (3). With a finite sample $S_{n}=\{X_{i}\}_{i=1}^{n}$ , the empirical version of $J$ is

J_{n}(s)=\frac{1}{n}\sum_{i=1}^{n}\left\{{\rm tr}(\nabla_{x}s(X_{i}))+\frac{1}{2}\|s(X_{i})\|_{2}^{2}\right\}.

Then, DSME is defined by

\displaystyle\hat{s}_{n}:=\arg\min_{s\in\mathcal{F}_{n}}J_{n}(s),

(9)

which is the empirical risk minimizer over the class of RePU neural networks $\mathcal{F}_{n}$ .

Our target is to give upper bounds of the excess risk of $\hat{s}_{n}$ , which is defined as

J(\hat{s}_{n})-J(s_{0})=\frac{1}{2}\mathbb{E}_{X}\|\hat{s}_{n}(X)-s_{0}(X)\|_{2}^{2}.

To obtain an upper bound of $J(\hat{s}_{n})-J(s_{0})$ , we decompose it into two parts of error, i.e. stochastic error and approximation error, and then derive upper bounds for them respectively. Let $s_{n}=\arg\min_{s\in\mathcal{F}_{n}}J(s)$ , then

	$\displaystyle J(\hat{s}_{n})-J(s_{0})$
	$\displaystyle=\{J(\hat{s}_{n})-J_{n}(\hat{s}_{n})\}+\{J_{n}(\hat{s}_{n})-J_{n}(s_{n})\}+\{J_{n}(s_{n})-J(s_{n})\}+\{J(s_{n})-J(s_{0})\}$
	$\displaystyle\leq\{J(\hat{s}_{n})-J_{n}(\hat{s}_{n})\}+\{J_{n}(s_{n})-J(s_{n})\}+\{J(s_{n})-J(s_{0})\}$
	$\displaystyle\leq 2\sup_{s\in\mathcal{F}_{n}}\|J(s)-J_{n}(s)\|+\inf_{s\in\mathcal{F}_{n}}\{J(s)-J(s_{0})\},$

where we call $2\sup_{s\in\mathcal{F}_{n}}|J(s)-J_{n}(s)|$ the stochastic error and $\inf_{s\in\mathcal{F}_{n}}\{J(s)-J(s_{0})\}$ the approximation error.

It is important to highlight that the analysis of stochastic error and approximation error for DSME estimation are unconventional. On one hand, since $J(s)-J(s_{0})=\mathbb{E}_{X}\|s(X)-s_{0}(X)\|_{2}^{2}$ holds for any $s$ , the approximation error can be obtained by examining the squared distance approximation in the $L_{2}$ norm. Thus, Theorem 5 provides a bound for the approximation error $\inf{s\in\mathcal{F}_{n}}[{J(s)-J(s_{0})}]$ . On the other hand, the empirical squared distance loss $\sum_{i=1}^{n}\|s(X_{i})-s_{0}(X_{i})\|^{2}_{2}/(2n)$ is not equivalent to the surrogate loss $J_{n}$ . In other words, the minimizer $\hat{s}_{n}$ of $J_{n}$ may not be the same as the minimizer of the empirical squared distance $\sum_{i=1}^{n}\|s(X_{i})-s_{0}(X_{i})\|^{2}_{2}/(2n)$ over $s\in\mathcal{F}_{n}$ . Consequently, the stochastic error can only be analyzed based on the formulation of $J$ rather than the squared loss. This implies that the stochastic error is dependent on the complexities of the RePU networks class $\mathcal{F}_{n}$ , as well as their derivatives $\mathcal{F}_{n}^{\prime}$ . Based on Theorem 1, Lemma 2 and the empirical process theory, it is expected that the stochastic error will be bounded by $\mathcal{O}(({\rm Pdim}(\mathcal{F}_{n})+{\rm Pdim}(\mathcal{F}_{n}^{\prime}))^{1/2}n^{-1/2})$ . Finally, by combining these two error bounds, we obtain the following bounds for the mean squared error of the empirical risk minimizer $\hat{s}_{n}$ defined in (9).

Lemma 10

Suppose that Assumption 9 hold and the target score function $s_{0}$ belongs to $C^{m}(\mathcal{X})$ for $m\in\mathbb{N}^{+}$ . For any positive integer $N\in\mathbb{N}^{+}$ , let $\mathcal{F}_{n}:=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}}$ be the class of RePU activated neural networks $f:\mathcal{X}\to\mathbb{R}^{d}$ with depth $\mathcal{D}=\lceil\log_{p}(N)\rceil$ , width $\mathcal{W}=2(N+d)!/(N!d!)$ , number of neurons $\mathcal{U}=2\lceil\log_{p}(N)\rceil(N+d)!/(N!d!)$ and size $\mathcal{S}=2(\lceil\log_{p}(N)\rceil+d+1)(N+d)!/(N!d!)$ , and suppose that $\mathcal{B}\geq\|s_{0}\|_{C^{0}}$ and $\mathcal{B}^{\prime}\geq\max_{i=1,\ldots,d}\|(\nabla_{x}s_{0})_{ii}\|_{C^{1}}$ . Then the empirical risk minimizer $\hat{s}_{n}$ defined in (9) satisfies

\displaystyle\mathbb{E}\|\hat{s}_{n}(X)-s_{0}(X)\|^{2}_{2}

\displaystyle\leq\mathcal{E}_{sto}+\mathcal{E}_{app},

(10)

with

	$\displaystyle\mathcal{E}_{sto}=$	$\displaystyle C_{1}pd^{3}(\mathcal{B}^{2}+2\mathcal{B}^{\prime})(\log n)^{1/2}{n}^{-1/2}(\log_{p}N)^{2}N^{d/2},$
	$\displaystyle\mathcal{E}_{app}=$	$\displaystyle{C}_{2}N^{-2m}\\|s_{0}\\|_{C^{0}}^{2},$

where the expectation $\mathbb{E}$ is taken with respect to $X$ and $\hat{s}_{n}$ , $C_{1}>0$ is a universal constant, and $C_{2}>0$ is a constant depending only on $p,d,m$ and the diameter of $\mathcal{X}$ .

Remark 11

Lemma 10 established a bound on the mean squared error of the empirical risk minimizer. Specifically, this error is shown to be bounded by the sum of the stochastic error, denoted as $\mathcal{E}_{sto}$ , and the approximation error, denoted as $\mathcal{E}_{app}$ . On one hand, the stochastic error $\mathcal{E}_{sto}$ exhibits a decreasing trend with respect to the sample size $n$ , but an increasing trend with respect to the network size as determined by $N$ . On the other hand, the approximation error $\mathcal{E}_{app}$ decreases in the network size as determined by $N$ . To attain a fast convergence rate with respect to the sample size $n$ , it is necessary to carefully balance these two errors by selecting an appropriate $N$ based on a given sample size $n$ .

Remark 12

In Lemma 10, the error bounds are stated in terms of the integer $N$ . These error bounds can also be expressed in terms of the number of neurons $\mathcal{U}$ and size $\mathcal{S}$ , given that we have specified the relationships between these parameters. Specifically, $\mathcal{U}=2\lceil\log_{p}(N)\rceil(N+d)!/(N!d!)$ and size $\mathcal{S}=2(\lceil\log_{p}(N)\rceil+d+1)(N+d)!/(N!d!)$ , which relate the number of neurons and size of the network to $N$ and the dimensionality of $X$ .

Lemma 10 leads to the following error bound for the score-matching estimator.

Theorem 13 (Non-asymptotic excess risk bounds)

Under the conditions of Theorem 10, we set $N=\lfloor n^{1/(d+4m)}\rfloor$ . Then by (10), the empirical risk minimizer $\hat{s}_{n}$ defined in (9) satisfies

\displaystyle\mathbb{E}\|\hat{s}_{n}(X)-s_{0}(X)\|^{2}

\displaystyle\leq C(\log n)n^{-\frac{2m}{d+4m}},

where $C$ is a constant only depending on $p,\mathcal{B},\mathcal{B}^{\prime},m,d,\mathcal{X}$ and $\|s_{0}\|_{C^{1}}$ .

In Theorem 13, the convergence rate in the error bound is $n^{-\frac{2m}{d+4m}}$ up to a logarithmic factor. While this rate is slightly slower than the optimal minimax rate $n^{-\frac{2m}{d+2m}}$ for nonparametric regression (Stone, 1982), it remains reasonable considering the nature of score matching estimation. Score matching estimation involves derivatives and the target score function value is not directly observable, which deviates from the traditional nonparametric regression in Stone (1982) where both predictors and responses are observed and no derivatives are involved. However, the rate $n^{-\frac{2m}{d+4m}}$ can be extremely slow for large $d$ , suffering from the curse of dimensionality. To address this issue, we derive error bounds under an approximate lower-dimensional support assumption as stated in Assumption 7, to mitigate the curse of dimensionality.

Lemma 14

Suppose that Assumptions 7, 9 hold and the target score function $s_{0}$ belongs to $C^{m}(\mathcal{X})$ for some $m\in\mathbb{N}^{+}$ . Let $d_{\delta}=c\cdot d_{\mathcal{M}}\log(d\cdot VR\tau^{-1}/\delta)/\delta^{2})$ be an integer with $d_{\delta}\leq d$ for some $\delta\in(0,1)$ and universal constant $c>0$ . For any positive integer $N\in\mathbb{N}^{+}$ , let $\mathcal{F}_{n}:=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}}$ be the class of RePU activated neural networks $f:\mathcal{X}\to\mathbb{R}^{d}$ with depth $\mathcal{D}=\lceil\log_{p}(N)\rceil$ , width $\mathcal{W}=2(N+d_{\delta})!/(N!d_{\delta}!)$ , number of neurons $\mathcal{U}=2\lceil\log_{p}(N)\rceil(N+d_{\delta})!/(N!d_{\delta}!)$ and size $\mathcal{S}=2(\lceil\log_{p}(N)\rceil+d_{\delta}+1)(N+d_{\delta})!/(N!d_{\delta}!)$ . Suppose that $\mathcal{B}\geq\|s_{0}\|_{C^{0}}$ and $\mathcal{B}^{\prime}\geq\max_{i=1,\ldots,d}\|(\nabla_{x}s_{0})_{ii}\|_{C^{1}}$ . Then the empirical risk minimizer $\hat{s}_{n}$ defined in (9) satisfies

\displaystyle\mathbb{E}\|\hat{s}_{n}(X)-s_{0}(X)\|^{2}_{2}

\displaystyle\leq\mathcal{E}_{sto}+\tilde{\mathcal{E}}_{app},

(11)

with

	$\displaystyle\mathcal{E}_{sto}$	$\displaystyle=C_{1}pd^{2}d_{\delta}(\mathcal{B}^{2}+2\mathcal{B}^{\prime})(\log n)^{1/2}{n}^{-1/2}N^{d_{\delta}/2},$
	$\displaystyle\tilde{\mathcal{E}}_{app}$	$\displaystyle={C}_{2}(1-\delta)^{-2}\\|s_{0}\\|_{C^{0}}^{2}N^{-2m},$

for $\rho\leq C_{\rho}N^{-2m}$ , where $C_{\rho},C_{1}>0$ are universal constants and $C_{2}>0$ is a constant depending only on $p,d,d_{\delta},m$ and $\mathcal{M}_{\rho}$ .

With an approximate low-dimensional support assumption, Lemma 14 implies that a faster convergence rate for deep score estimator can be achieved.

Theorem 15 (Improved non-asymptotic excess risk bounds)

In (11), we can set $N=\lfloor n^{d_{\delta}/(d_{\delta}+4m)}\rfloor$ , then the empirical risk minimizer $\hat{s}_{n}$ defined in (9) satisfies

\displaystyle\mathbb{E}\|\hat{s}_{n}(X)-s_{0}(X)\|^{2}

\displaystyle\leq Cn^{-\frac{2m}{d_{\delta}+4m}},

where $C$ is a constant only depending on $p,\mathcal{B},\mathcal{B}^{\prime},m,d,d_{\delta},\mathcal{X}$ and $\|s_{0}\|_{C^{1}}$ .

5 Deep isotonic regression

As another application of our results on RePU-activated networks, we propose PDIR, a penalized deep isotonic regression approach using RePU networks and a penalty function based on the derivatives of the networks to enforce monotonicity. We also establish the error bounds for PDIR.

Suppose we have a random sample $S:=\{(X_{i},Y_{i})\}_{i=1}^{n}$ from model (4). Recall $R^{\lambda}$ is the proposed population objective function for isotonic regression defined in (7). We consider the empirical counterpart of the objective function $R^{\lambda}$ :

\mathcal{R}^{\lambda}_{n}(f)=\frac{1}{n}\sum_{i=1}^{n}\Big{\{}|Y_{i}-f(X_{i})|^{2}+\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\rho(\dot{f}_{j}(X_{i}))\Big{\}}.

(12)

A simple choice of the penalty function $\rho$ is $\rho(x)=\max\{-x,0\}$ . In general, we can take $\rho(x)=h(\max\{-x,0\})$ for a function $h$ with $h(0)=0$ . We focus on Lipschitz penalty functions as defined below.

Assumption 16 (Lipschitz penalty function)

The penalty function $\rho(\cdot):\mathbb{R}\to[0,\infty)$ satisfies $\rho(x)=0$ if $x\geq 0$ . Besides, $\rho$ is $\kappa$ -Lipschitz, i.e., $|\rho(x_{1})-\rho(x_{2})|\leq\kappa|x_{1}-x_{2}|,$ for any $x_{1},x_{2}\in\mathbb{R}$ .

Let the empirical risk minimizer of deep isotonic regression denoted by

\hat{f}_{n}^{\lambda}:\in\arg\min_{f\in\mathcal{F}_{n}}\mathcal{R}^{\lambda}_{n}(f),

(13)

where $\mathcal{F}_{n}$ is a class of functions computed by deep neural networks which may depend on and can be set to depend on the sample size $n$ . We refer to $\hat{f}^{\lambda}_{n}$ as a penalized deep isotonic regression (PDIR) estimator.

Refer to caption — Figure 1: Examples of PDIR estimates. In all figures, the data points are depicted as grey dots, the underlying regression functions are plotted as solid black curves, and PDIR estimates with different levels of penalty parameter $\lambda$ are plotted as colored curves. In the top two figures, data are generated from models with monotonic regression functions. In the bottom left figure, the target function is a constant. In the bottom right figure, the model is misspecified, in which the underlying regression function is not monotonic. Small values of $\lambda$ can lead to non-monotonic estimated functions.

An illustration of PDIR is presented in Figure 1. In all subfigures, the data are depicted as grey dots, the underlying regression functions are plotted as solid black curves and PDIR estimates with different levels of penalty parameter $\lambda$ are plotted as colored curves. In the top two figures, data are generated from models with monotonic regression functions. In the bottom left figure, the target function is a constant. In the bottom right figure, the model is misspecified, in which the underlying regression function is not monotonic. Small values of $\lambda$ can lead to non-monotonic and reasonable estimates, suggesting that PDIR is robust against model misspecification. We have conducted more numerical experiments to evaluate the performance of PDIR, which indicates that PDIR tends to perform better than the existing isotonic regression methods considered in the comparison. The results are given in the Supplementary Material.

5.1 Non-asymptotic error bounds for PDIR

In this section, we state our main results on the bounds for the excess risk of the PDIR estimator defined in (13). Recall the definition of $R^{\lambda}$ in (7). For notational simplicity, we write

\mathcal{R}(f)=\mathcal{R}^{0}(f)=\mathbb{E}|Y-f(X)|^{2}.

(14)

The target function $f_{0}$ is the minimizer of risk $\mathcal{R}(f)$ over measurable functions, i.e., $f_{0}\in\arg\min_{f}\mathcal{R}(f).$ In isotonic regression, we assume that $f_{0}\in\mathcal{F}_{0}.$ In addition, for any function $f$ , under the regression model (4), we have

\mathcal{R}(f)-\mathcal{R}(f_{0})=\mathbb{E}|f(X)-f_{0}(X)|^{2}.

We first state the conditions needed for establishing the excess risk bounds.

Assumption 17

(i) The target regression function $f_{0}:\mathcal{X}\to\mathbb{R}$ defined in (4) is coordinate-wisely nondecreasing on $\mathcal{X}$ , i.e., $f_{0}(x)\leq f_{0}(y)$ if $x\preceq y$ for $x,y\in\mathcal{X}\subseteq\mathbb{R}^{d}$ . (ii) The errors $\epsilon_{i},i=1,\ldots,n,$ are independent and identically distributed noise variables with $\mathbb{E}(\epsilon_{i})=0$ and ${\rm Var}(\epsilon_{i})\leq\sigma^{2}$ , and $\epsilon_{i}$ ’s are independent of $\{X_{i}\}_{i=1}^{n}$ .

Assumption 17 includes basic model assumptions on the errors and the monotonic target function $f_{0}$ . In addition, we assume that the target function $f_{0}$ belongs to the class $C^{s}$ .

Next, we state the following basic lemma for bounding the excess risk.

Lemma 18 (Excess risk decomposition)

For the empirical risk minimizer $\hat{f}^{\lambda}_{n}$ defined in (13), its excess risk can be upper bounded by

	$\displaystyle\mathbb{E}\|\hat{f}^{\lambda}_{n}(X)-f_{0}(X)\|^{2}=$	$\displaystyle\mathbb{E}\Big{\{}\mathcal{R}(\hat{f}_{n}^{\lambda})-\mathcal{R}(f_{0})\Big{\}}\leq\mathbb{E}\Big{\{}\mathcal{R}^{\lambda}(\hat{f}_{n}^{\lambda})-\mathcal{R}^{\lambda}(f_{0})\Big{\}}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\Big{\{}\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-2\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}(f_{0})\Big{\}}+2\inf_{f\in\mathcal{F}_{n}}\Big{[}\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})\Big{]}.$

The upper bound for the excess risk can be decomposed into two components: the stochastic error, given by the expected value of $\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-2\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}(f_{0})$ , and the approximation error, defined as $\inf_{f\in\mathcal{F}_{n}}[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]$ . To establish a bound for the stochastic error, it is necessary to consider the complexities of both RePU networks and their derivatives, which have been investigated in our Theorem 1 and Lemma 2. To establish a bound for the approximation error $\inf_{f\in\mathcal{F}_{n}}{\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})}$ , we rely on the simultaneous approximation results in Theorem 5.

Remark 19

The error decomposition in Lemma 18 differs from the canonical decomposition for score estimation in section 4.1, particularly pertaining to the stochastic error component. However, utilizing the decomposition in Lemma 18 enables us to derive a superior stochastic error bound by leveraging the properties of the PDIR loss function. A similar decomposition for least squares loss without penalization can be found in Jiao et al. (2023).

Lemma 20

Suppose that Assumptions 16, 17 hold and the target function $f_{0}$ defined in (4) belongs to $C^{s}$ for some $s\in\mathbb{N}^{+}$ . For any positive integer $N\in\mathbb{N}^{+}$ , let $\mathcal{F}_{n}:=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}}$ be the class of RePU activated neural networks $f:\mathcal{X}\to\mathbb{R}$ with depth $\mathcal{D}=\lceil\log_{p}(N)\rceil$ , width $\mathcal{W}=2(N+d)!/(N!d!)$ , number of neurons $\mathcal{U}=2\lceil\log_{p}(N)\rceil(N+d)!/(N!d!)$ and size $\mathcal{S}=2(\lceil\log_{p}(N)\rceil+d+1)(N+d)!/(N!d!)$ . Suppose that $\mathcal{B}\geq\|f_{0}\|_{C^{0}}$ and $\mathcal{B}^{\prime}\geq\|f_{0}\|_{C^{1}}$ . Then for $n\geq\max\{{\rm Pdim}(\mathcal{F}_{n}),{\rm Pdim}(\mathcal{F}^{\prime}_{n})\}$ , the excess risk of PDIR $\hat{f}^{\lambda}_{n}$ defined in (13) satisfies

	$\displaystyle\mathbb{E}\|\hat{f}^{\lambda}_{n}(X)-f_{0}(X)\|^{2}$	$\displaystyle\leq\mathcal{E}_{sto}+\mathcal{E}_{app},$		(15)
	$\displaystyle\mathbb{E}\big{[}\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))\big{]}$	$\displaystyle\leq\mathcal{E}_{sto}+\mathcal{E}_{app},$		(16)

with

	$\displaystyle\mathcal{E}_{sto}$	$\displaystyle=C_{1}d^{3}\big{\{}p^{3}\mathcal{B}^{3}+(\kappa\bar{\lambda}\mathcal{B}^{\prime})^{2}\big{\}}(\log n)^{2}n^{-1}(\log_{p}N)^{3}N^{d},$
	$\displaystyle\mathcal{E}_{app}$	$\displaystyle=C_{2}\cdot\\|f_{0}\\|^{2}_{C^{1}}(N^{-2s}+\kappa\bar{\lambda}N^{-(s-1)}),$

where the expectation $\mathbb{E}$ is taken with respect to $X$ and $\hat{f}^{\lambda}_{n}$ , $\bar{\lambda}=\sum_{j=1}^{d}\lambda_{j}/d$ is the mean of the tuning parameters, $C_{1}>0$ is a universal constant and $C_{2}>0$ is a positive constant depending only on $d,s$ and the diameter of the support $\mathcal{X}$ .

Lemma 20 establishes two error bounds for the PDIR estimator $\hat{f}^{\lambda}_{n}$ : (15) for the mean squared error between $\hat{f}^{\lambda}_{n}$ and the target $f_{0}$ , and (16) for controlling the non-monotonicity of $\hat{f}^{\lambda}_{n}$ via its partial derivatives $\frac{\partial}{\partial x_{j}}\hat{f}_{n},j=1,\ldots,d,$ with respect to a measure defined in terms of $\rho$ . Both bounds (15) and (16) are encompasses both stochastic and approximation errors. Specifically, the stochastic error is of order $\mathcal{O}(N^{d}/n)$ , which represents an improvement over the canonical error bound of $\mathcal{O}([N^{d}/n]^{1/2})$ , up to logarithmic factors in $n$ . This advancement is owing to the decomposition in Lemma 18 and the properties of PDIR loss function, which is different from traditional decomposition techniques.

Remark 21

In (16), the estimator $\hat{f}^{\lambda}_{n}$ is encouraged to exhibit monotonicity, as the expected monotonicity penalty on the estimator $\mathbb{E}[\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))]$ is bounded. Notably, when $\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))]=0$ , the estimator $\hat{f}^{\lambda}_{n}$ is almost surely monotonic in its $j$ th argument with respect to the probability measure of $X$ . Based on (16), guarantees of the estimator’s monotonicity with respect to a single argument can be obtained. Specifically, for those $j$ where $\lambda_{j}\not=0$ , we have $\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))]\leq d(\mathcal{E}_{sto}+\mathcal{E}_{app})/\lambda_{j}$ , which provides a guarantee of the estimator’s monotonicity with respect to its $j$ th argument. Moreover, larger values of $\lambda_{j}$ lead to smaller bounds, which is consistent with the intuition that larger values of $\lambda_{j}$ better promote monotonicity of $\hat{f}^{\lambda}_{n}$ with respect to its $j$ th argument.

Theorem 22 (Non-asymptotic excess risk bounds)

Under the conditions of Lemma 20 to achieve the smallest error bound in (15), we can set $N=\lfloor n^{1/(d+2s)}\rfloor$ and $\lambda_{j}=n^{-(s+1)/(d+2s)}$ for $j=1,\ldots,d$ . Then we have

\displaystyle\mathbb{E}|\hat{f}^{\lambda}_{n}(X)-f_{0}(X)|^{2}

\displaystyle\leq C(\log n)^{5}n^{-\frac{2s}{d+2s}},

and

\displaystyle\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))]\leq C(\log n)^{5}n^{-\frac{s-1}{d+2s}},

for $j=1,\ldots,d$ , where $C>0$ is a constant depending only on $\mathcal{B},\mathcal{B}^{\prime},s,d,\mathcal{X},\|f_{0}\|_{C^{s}}$ and $\kappa$ .

By Theorem 22, our proposed PDIR estimator obtained with proper network architecture and tuning parameter achieves the minimax optimal rate $\mathcal{O}(n^{-\frac{2s}{d+2s}})$ up to logarithms for the nonparametric regression (Stone, 1982). Meanwhile, the PDIR estimator $\hat{f}^{\lambda}_{n}$ is guaranteed to be monotonic as measured by $\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))]$ at a rate of $\mathcal{O(}n^{-(s-1)/(d+2s)})$ up to a logarithmic factor.

Remark 23

In Theorem 22, we choose $\lambda_{j}=n^{-(s+1)/(d+2s)},\ j=1,\ldots,d,$ to attain the optimal rate of the expected mean squared error of $\hat{f}^{\lambda}_{n}$ up to a logarithmic factor. Additionally, we guarantee that the estimator $\hat{f}^{\lambda}_{n}$ to be monotonic at a rate of $n^{-(s-1)/(d+2s)}$ up to a logarithmic factor as measured by $\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))]$ . The choice of $\lambda_{j}$ is not unique for ensuring the consistency of $\hat{f}^{\lambda}_{n}$ . In fact, any choice of $\bar{\lambda}=o((\log n)^{-2}n^{(s-1)/(d+2s)})$ will result in a consistent $\hat{f}^{\lambda}_{n}$ . However, larger values of $\bar{\lambda}$ lead to a slower convergence rate of the expected mean squared error, but better guarantee for the monotonicity of $\hat{f}^{\lambda}_{n}$ .

The smoothness $s$ of the target function $f_{0}$ is unknown in practice and how to determine the smoothness of an unknown function is an important but nontrivial problem. Note that the convergence rate $(\log n)^{5}n^{-2s/(d+2s)}$ suffers from the curse of dimensionality since it can be extremely slow if $d$ is large.

High-dimensional data have low-dimensional latent structures in many applications. Below we show that PDIR can mitigate the curse of dimensionality if the data distribution is supported on an approximate low-dimensional manifold.

Lemma 24

Suppose that Assumptions 7, 16, 17 hold and the target function $f_{0}$ defined in (4) belongs to $C^{s}$ for some $s\in\mathbb{N}^{+}$ . Let $d_{\delta}=c\cdot d_{\mathcal{M}}\log(d\cdot VR\tau^{-1}/\delta)/\delta^{2})$ be an integer with $d_{\delta}\leq d$ for some $\delta\in(0,1)$ and universal constant $c>0$ . For any positive integer $N\in\mathbb{N}^{+}$ , let $\mathcal{F}_{n}:=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}}$ be the class of RePU activated neural networks $f:\mathcal{X}\to\mathbb{R}$ with depth $\mathcal{D}=\lceil\log_{p}(N)\rceil$ , width $\mathcal{W}=2(N+d_{\delta})!/(N!d_{\delta}!)$ , number of neurons $\mathcal{U}=2\lceil\log_{p}(N)\rceil(N+d_{\delta})!/(N!d_{\delta}!)$ and size $\mathcal{S}=2(\lceil\log_{p}(N)\rceil+d_{\delta}+1)(N+d_{\delta})!/(N!d_{\delta}!)$ . Suppose that $\mathcal{B}\geq\|f_{0}\|_{C^{0}}$ and $\mathcal{B}^{\prime}\geq\|f_{0}\|_{C^{1}}$ . Then for $n\geq\max\{{\rm Pdim}(\mathcal{F}_{n}),{\rm Pdim}(\mathcal{F}^{\prime}_{n})\}$ , the excess risk of the PDIR estimator $\hat{f}^{\lambda}_{n}$ defined in (13) satisfies

	$\displaystyle\mathbb{E}\|\hat{f}^{\lambda}_{n}(X)-f_{0}(X)\|^{2}$	$\displaystyle\leq\mathcal{E}_{sto}+\tilde{\mathcal{E}}_{app},$		(17)
	$\displaystyle\mathbb{E}\big{[}\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))\big{]}$	$\displaystyle\leq\mathcal{E}_{sto}+\tilde{\mathcal{E}}_{app},$		(18)

with

	$\displaystyle\mathcal{E}_{sto}$	$\displaystyle=C_{1}d^{2}d_{\delta}\big{\{}p^{3}\mathcal{B}^{3}+(\kappa\bar{\lambda}\mathcal{B}^{\prime})^{2}\big{\}}(\log n)^{2}n^{-1}N^{d_{\delta}},$
	$\displaystyle\tilde{\mathcal{E}}_{app}$	$\displaystyle=C_{2}(1-\delta)^{2}\\|f_{0}\\|^{2}_{C^{s}}(N^{-2s}+\kappa\bar{\lambda}N^{-(s-1)}),$

for $\rho\leq C_{\rho}N^{-(s-|\alpha|_{1})},$ where $C_{\rho},C_{1}>0$ are universal constants and $C_{2}>0$ is a constant depending only on $d_{\delta},s$ and the diameter of the support $\mathcal{M}_{\rho}$ .

Based on Lemma 24, we obtain the following result.

Theorem 25 (Improved non-asymptotic excess risk bounds)

Under the conditions of Lemma 24, to achieve the smallest error bound in (17), we can set $N=\lfloor n^{1/(d_{\delta}+2s)}\rfloor$ and $\lambda_{j}=n^{-(s+1)/(d_{\delta}+2s)}$ for $j=1,\ldots,d$ . Then we have

\displaystyle\mathbb{E}|\hat{f}^{\lambda}_{n}(X)-f_{0}(X)|^{2}

\displaystyle\leq C(\log n)^{5}n^{-\frac{2s}{d_{\delta}+2s}},

and for $j=1,\ldots,d$ ,

\displaystyle\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))]\leq C(\log n)^{5}n^{-\frac{2s}{d_{\delta}+2s}},

where $C>0$ is a constant depending only on $\mathcal{B},\mathcal{B}^{\prime},s,d,d_{\delta},\mathcal{M}_{\rho},\|f_{0}\|_{C^{s}}$ and $\kappa$ .

In Theorem 25, the effective dimension is $d_{\delta}$ rather than large $d$ . Therefore, the rate of convergence is an improvement over the result in Theorem 22 when the intrinsic dimension $d_{\delta}$ is smaller than the ambient dimension $d$ .

5.2 PDIR under model misspecification

In this subsection, we investigate PDIR under model misspecification when Assumption 17 (i) is not satisfied, meaning that the underlying regression function $f_{0}$ may not be monotonic.

Let $S:={(X_{i},Y_{i})}_{i=1}^{n}$ be a random sample from model (4). Recall that the penalized risk of the deep isotonic regression is given by

\mathcal{R}^{\lambda}(f)=\mathbb{E}|Y-f(X)|^{2}+\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\mathbb{E}{\rho(\dot{f}_{j}(X))}.

If $f_{0}$ is not monotonic, the penalty $\sum_{j=1}^{d}\lambda_{j}\mathbb{E}[\rho(\dot{f}_{j}(X))/d]$ is non-zero, and consequently, $f_{0}$ is not a minimizer of the risk $\mathcal{R}^{\lambda}$ when $\lambda_{j}\not=0,\forall j$ . Intuitively, the deep isotonic regression estimator will exhibit a bias towards the target $f_{0}$ due to the additional penalty terms in the risk. However, it is reasonable to expect that the estimator $\hat{f}^{\lambda}_{n}$ will have a smaller bias if $\lambda_{j},j=1,\ldots,d$ are small. In the following lemma, we establish a non-asymptotic upper bound for our proposed deep isotonic regression estimator while adapting to model misspecification.

Lemma 26

Suppose that Assumptions 16 and 17 (ii) hold and the target function $f_{0}$ defined in (4) belongs to $C^{s}$ for some $s\in\mathbb{N}^{+}$ . For any positive integer $N\in\mathbb{N}^{+}$ let $\mathcal{F}_{n}:=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}}$ be the class of RePU activated neural networks $f:\mathcal{X}\to\mathbb{R}^{d}$ with depth $\mathcal{D}=\lceil\log_{p}(N)\rceil$ , width $\mathcal{W}=2(N+d)!/(N!d!)$ , number of neurons $\mathcal{U}=2\lceil\log_{p}(N)\rceil(N+d)!/(N!d!)$ and size $\mathcal{S}=2(\lceil\log_{p}(N)\rceil+d+1)(N+d)!/(N!d!)$ . Suppose that $\mathcal{B}\geq\|f_{0}\|_{C^{0}}$ and $\mathcal{B}^{\prime}\geq\|f_{0}\|_{C^{1}}$ . Then for $n\geq\max\{{\rm Pdim}(\mathcal{F}_{n}),{\rm Pdim}(\mathcal{F}^{\prime}_{n})\}$ , the excess risk of the PDIR estimator $\hat{f}^{\lambda}_{n}$ defined in (13) satisfies

	$\displaystyle\mathbb{E}\|\hat{f}^{\lambda}_{n}(X)-f_{0}(X)\|^{2}\leq$	$\displaystyle\mathcal{E}_{sto}+\mathcal{E}_{app}+\mathcal{E}_{mis},$		(19)
	$\displaystyle\mathbb{E}\big{[}\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))\big{]}$	$\displaystyle\leq\mathcal{E}_{sto}+\mathcal{E}_{app}+\mathcal{E}_{mis},$		(20)

with

	$\displaystyle\mathcal{E}_{sto}$	$\displaystyle=C_{1}p^{2}d^{3}(\mathcal{B}^{2}+\kappa\bar{\lambda}\mathcal{B}^{\prime})(\log n)^{1/2}n^{-1/2}(\log_{p}N)^{3/2}N^{d/2},$
	$\displaystyle\mathcal{E}_{app}$	$\displaystyle=C_{2}\\|f_{0}\\|^{2}_{C^{1}}(N^{-2s}+\kappa\bar{\lambda}N^{-(s-1)}),$
	$\displaystyle\mathcal{E}_{mis}$	$\displaystyle=\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}f_{0}(X))],$

Lemma 26 is a generalized version of Lemma 20 for PDIR, as it holds regardless of whether the target function is isotonic or not. In Lemma 26, the expected mean squared error of the PDIR estimator $\hat{f}^{\lambda}_{n}$ can be bounded by three errors: stochastic error $\mathcal{E}_{sto}$ , approximation error $\mathcal{E}_{app}$ , and misspecification error $\mathcal{E}_{mis}$ , without the monotonicity assumption. Compared with Lemma 20 with the monotonicity assumption, the approximation error is identical, the stochastic error is worse in terms of order, and the misspecification error appears as an extra term in the inequality. With an appropriate setup of $N$ for the neural network architecture with respect to the sample size $n$ , the stochastic error and approximation error can converge to zero, albeit at a slower rate than that in Theorem 22. However, the misspecification error remains constant for fixed tuning parameters $\lambda_{j}$ . Thus, we can let the tuning parameters $\lambda_{j}$ converge to zero to achieve consistency.

Remark 27

It is worth noting that if the target function is isotonic, then the misspecification error vanishes, leading the scenario to that of isotonic regression. However, the convergence rate based on Lemma 26 is slower than that in Lemma 20. The reason is that Lemma 26 is general and holds without prior knowledge of the monotonicity of the target function. If knowledge is available about the non-isotonicity of the $j$ th argument of the target function $f_{0}$ , setting the corresponding $\lambda_{j}=0$ decreases the misspecification error and helps improve the upper bound.

Theorem 28 (Non-asymptotic excess risk bounds)

Under the conditions of Lemma 26, to achieve the fastest convergence rate in (19), we can set $N=\lfloor n^{1/(d+4s)}\rfloor$ and $\lambda_{j}=n^{-2s/(d+4s)}$ for $j=1,\ldots,d$ . Then we have

\displaystyle\mathbb{E}|\hat{f}^{\lambda}_{n}(X)-f_{0}(X)|^{2}

\displaystyle\leq C(\log n)n^{-\frac{2s}{d+4s}},

where $C>0$ is a constant depending only on $\mathcal{B},\mathcal{B}^{\prime},s,d,\mathcal{X},\|f_{0}\|_{C^{s}}$ and $\kappa$ .

According to Lemma 26, under the misspecification model, the prediction error of PDIR attains its minimum when $\lambda_{j}=0$ for $j=1,\ldots,d$ , and the misspecification error $\mathcal{E}_{mis}$ vanishes. Consequently, the optimal convergence rate with respect to $n$ can be achieved by setting $N=\mathcal{O}(\lfloor n^{1/(d+4s)}\rfloor)$ and $\lambda_{j}=0$ for $j=1,\ldots,d$ . It is worth noting that the prediction error of PDIR can achieve this rate as long as $\bar{\lambda}=\mathcal{O}n^{-2s/(d+4s)}$ .

Remark 29

According to Theorem 28, there is no unique choice of $\lambda_{j}$ that ensures the consistency of PDIR. Consistency is guaranteed even under a misspecified model when the $\lambda_{j}$ for $j=1,\ldots,d$ tend to zero as $n\to\infty$ . Additionally, selecting a smaller value of $\bar{\lambda}$ provides a better upper bound for (19), and an optimal rate up to logarithms of $n$ can be achieved with a sufficiently small $\bar{\lambda}=O(n^{-2s/(d+4s)})$ . An example demonstrating the effects of tuning parameters is visualized in the last subfigure of Figure 1.

6 Related works

In this section, we briefly review the papers in the existing literature that are most related to the present work.

6.1 ReLU and RePU networks

Deep learning has achieved impressive success in a wide range of applications. A fundamental reason for these successes is the ability of deep neural networks to approximate high-dimensional functions and extract effective data representations. There has been much effort devoted to studying the approximation properties of deep neural networks in recent years. Many interesting results have been obtained concerning the approximation power of deep neural networks for multivariate functions. Examples include Chen et al. (2019), Schmidt-Hieber (2020), Jiao et al. (2023). These works focused on the power of ReLU-activated neural networks for approximating various types of smooth functions.

For the approximation of the square function by ReLU networks, Yarotsky (2017) first used “sawtooth” functions, which achieves an error rate of $\mathcal{O}(2^{-L})$ with width 6 and depth $\mathcal{O}(L)$ for any positive integer $L\in\mathbb{N}^{+}$ . General construction of ReLU networks for approximating a square function can achieve an error $N^{-L}$ with width $3N$ and depth $L$ for any positive integers $N,L\in\mathbb{N}^{+}$ (Lu et al., 2021b). Based on this basic fact, the ReLU networks approximating multiplication and polynomials can be constructed correspondingly. However, the network complexity in terms of network size (depth and width) for a ReLU network to achieve precise approximation can be large compared to that of a RePU network since a RePU network can exactly compute polynomials with fewer layers and neurons.

The approximation results of the RePU network are generally obtained by converting splines or polynomials into RePU networks and making use of the approximation results of splines and polynomials. The universality of sigmoidal deep neural networks has been studied in the pioneering works (Mhaskar, 1993; Chui et al., 1994). In addition, the approximation properties of shallow Rectified Power Unit (RePU) activated network were studied in Klusowski and Barron (2018); Siegel and Xu (2022). The approximation rates of deep RePU neural networks on target functions in different spaces have also been explored, including Besov spaces (Ali and Nouy, 2021), Sobolev spaces (Li et al., 2019, 2020; Duan et al., 2021; Abdeljawad and Grohs, 2022) , and Hölder space (Belomestny et al., 2022). Most of the existing results on the expressiveness of neural networks measure the quality of approximation with respect to $L_{p}$ where $p\geq 1$ norm. However, fewer papers have studied the approximation of derivatives of smooth functions (Duan et al., 2021; Gühring and Raslan, 2021; Belomestny et al., 2022).

6.2 Related works on score estimation

Learning a probability distribution from data is a fundamental task in statistics and machine learning for efficient generation of new samples from the learned distribution. Likelihood-based models approach this problem by directly learning the probability density function, but they have several limitations, such as an intractable normalizing constant and approximate maximum likelihood training.

One alternative approach to circumvent these limitations is to model the score function (Liu et al., 2016), which is the gradient of the logarithm of the probability density function. Score-based models can be learned using a variety of methods, including parametric score matching methods (Hyvärinen and Dayan, 2005; Sasaki et al., 2014), autoencoders as its denoising variants (Vincent, 2011), sliced score matching (Song et al., 2020), nonparametric score matching (Sriperumbudur et al., 2017; Sutherland et al., 2018), and kernel estimators based on Stein’s methods (Li and Turner, 2017; Shi et al., 2018). These score estimators have been applied in many research problems, such as gradient flow and optimal transport methods (Gao et al., 2019, 2022), gradient-free adaptive MCMC (Strathmann et al., 2015), learning implicit models (Warde-Farley and Bengio, 2016), inverse problems (Jalal et al., 2021). Score-based generative learning models, especially those using deep neural networks, have achieved state-of-the-art performance in many downstream tasks and applications, including image generation (Song and Ermon, 2019, 2020; Song et al., 2021; Ho et al., 2020; Dhariwal and Nichol, 2021; Ho et al., 2022), music generation (Mittal et al., 2021), and audio synthesis (Chen et al., 2020; Kong et al., 2020; Popov et al., 2021).

However, there is a lack of theoretical understanding of nonparametric score estimation using deep neural networks. The existing studies mainly considered kernel based methods. Zhou et al. (2020) studied regularized nonparametric score estimators using vector-valued reproducing kernel Hilbert space, which connects the kernel exponential family estimator (Sriperumbudur et al., 2017) with the score estimator based on Stein’s method (Li and Turner, 2017; Shi et al., 2018). Consistency and convergence rates of these kernel-based score estimator are also established under the correctly-specified model assumption in Zhou et al. (2020). For denoising autoencoders, Block et al. (2020) obtained generalization bounds for general nonparametric estimators also under the correctly-specified model assumption.

For sore-based learning using deep neural networks, the main difficulty for establishing the theoretical foundation is the lack of knowledge of differentiable neural networks since the derivatives of neural networks are involved in the estimation of score function. Previously, the non-differentiable Rectified Linear Unit (ReLU) activated deep neural network has received much attention due to its attractive properties in computation and optimization, and has been extensively studied in terms of its complexity (Bartlett et al., 1998; Anthony and Bartlett, 1999; Bartlett et al., 2019) and approximation power (Yarotsky, 2017; Petersen and Voigtlaender, 2018; Shen et al., 2020; Lu et al., 2021a; Jiao et al., 2023), based on which statistical learning theories for deep non-parametric estimations were established (Bauer and Kohler, 2019; Schmidt-Hieber, 2020; Jiao et al., 2023). For deep neural networks with differentiable activation functions, such as ReQU and RepU, the simultaneous approximation power on a smooth function and its derivatives were studied recently (Ali and Nouy, 2021; Belomestny et al., 2022; Siegel and Xu, 2022; Hon and Yang, 2022), but the statistical properties of differentiable networks are still largely unknown. To the best of our knowledge, the statistical learning theory has only been investigated for ReQU networks in Shen et al. (2022), where they have developed network representation of the derivatives of ReQU networks and studied their complexity.

6.3 Related works on isotonic regression

There is a rich and extensive literature on univariate isotonic regression, which is too vast to be adequately summarized here. So we refer to the books Barlow et al. (1972) and Robertson et al. (1988) for a systematic treatment of this topic and review of earlier works. For more recent developments on the error analysis of nonparametric isotonic regression, we refer to Durot (2002); Zhang (2002); Durot (2007, 2008); Groeneboom and Jongbloed (2014); Chatterjee et al. (2015), and Yang and Barber (2019), among others.

The least squares isotonic regression estimators under fixed design were extensively studied. With a fixed design at fixed points $x_{1},\ldots,x_{n}$ , the $L_{p}$ risk of the least squares estimator is defined by $\mathcal{R}_{n,p}(\hat{f}_{0})=\mathbb{E}(n^{-1}\sum_{i=1}^{n}|\hat{f}_{0}(x_{i})-f_{0}(x_{i})|^{p})^{1/p},$ where the least squares estimator $\hat{f}_{0}$ is defined by

\hat{f}_{0}=\arg\min_{f\in\mathcal{F}_{0}}\frac{1}{n}\sum_{i=1}^{n}\{Y_{i}-f(x_{i})\}^{2}.

(21)

The problem can be restated in terms of isotonic vector estimation on directed acyclic graphs. Specifically, the design points $\{x_{1},\ldots,x_{n}\}$ induce a directed acyclic graph $G_{x}(V(G_{x}),E(G_{x}))$ with vertices $V(G_{x})=\{1,\ldots,n\}$ and edges $E(G_{x})=\{(i,j):x_{i}\preceq x_{j}\}$ . The class of isotonic vectors on $G_{x}$ is defined by

\mathcal{M}(G):=\{\theta\in\mathbb{R}^{V(G_{x})}:\theta_{x}\leq\theta_{y}{\rm\ for}x\preceq y\}.

Then the least squares estimation in (21) becomes that of searching for a target vector $\theta_{0}=\{(\theta_{0})_{i}\}_{i=1}^{n}:=\{f_{0}(x_{i})\}_{i=1}^{n}\in\mathcal{M}(G_{x})$ . The least squares estimator $\hat{\theta}_{0}=\{(\hat{\theta}_{0})_{i}\}_{i=1}^{n}:=\{\hat{f}_{0}(x_{i})\}_{i=1}^{n}$ is actually the projection of $\{Y_{i}\}_{i=1}^{n}$ onto the polyhedral convex cone $\mathcal{M}(G_{x})$ (Han et al., 2019).

For univariate isotonic least squares regression with a bounded total variation target function $f_{0}$ , Zhang (2002) obtained sharp upper bounds for $\mathcal{R}_{n,p}$ risk of the least squares estimator $\hat{\theta}_{0}$ for $1\leq p<3$ . Shape-constrained estimators were also considered in different settings where automatic rate-adaptation phenomenon happens (Chatterjee et al., 2015; Gao et al., 2017; Bellec, 2018). We also refer to Kim et al. (2018); Chatterjee and Lafferty (2019) for other examples of adaptation in univariate shape-constrained problems.

Error analysis for the least squares estimator in multivariate isotonic regression is more difficult. For two-dimensional isotonic regression, where $X\in\mathbb{R}^{d}$ with $d=2$ and Gaussian noise, Chatterjee et al. (2018) considered the fixed lattice design case and obtained sharp error bounds. Han et al. (2019) extended the results of Chatterjee et al. (2018) to the case with $d\geq 3$ , both from a worst-case perspective and an adaptation point of view. They also proved parallel results for random designs assuming the density of the covariate $X$ is bounded away from zero and infinity on the support.

Deng and Zhang (2020) considered a class of block estimators for multivariate isotonic regression in $\mathbb{R}^{d}$ involving rectangular upper and lower sets under, which is defined as any estimator in-between the following max-min and min-max estimator. Under a $q$ -th moment condition on the noise, they developed $L_{q}$ risk bounds for such estimators for isotonic regression on graphs. Furthermore, the block estimator possesses an oracle property in variable selection: when $f_{0}$ depends on only an unknown set of $s$ variables, the $L_{2}$ risk of the block estimator automatically achieves the minimax rate up to a logarithmic factor based on the knowledge of the set of the $s$ variables.

Our proposed method and theoretical results are different from those in the aforementioned papers in several aspects. First, the resulting estimates from our method are smooth instead of piecewise constant as those based on the existing methods. Second, our method can mitigate the curse of dimensionality under an approximate low-dimensional manifold support assumption, which is weaker than the exact low-dimensional space assumption in the existing work. Finally, our method possesses a robustness property against model specification in the sense that it still yields consistent estimators if the monotonicity assumption is not strictly satisfied. However, the properties of the existing isotonic regression methods under model misspecification are unclear.

7 Conclusions

In this work, motivated by the problems of score estimation and isotonic regression, we have studied the properties of RePU-activated neural networks, including a novel generalization result for the derivatives of RePU networks and improved approximation error bounds for RePU networks with approximate low-dimensional structures. We have established non-asymptotic excess risk bounds for DSME, a deep score matching estimator; and PDIR, our proposed penalized deep isotonic regression method.

Our findings highlight the potential of RePU-activated neural networks in addressing challenging problems in machine learning and statistics. The ability to accurately represent the partial derivatives of RePU networks with RePUs mixed-activated networks is a valuable tool in many applications that require the use of neural network derivatives. Moreover, the improved approximation error bounds for RePU networks with low-dimensional structures demonstrate their potential to mitigate the curse of dimensionality in high-dimensional settings.

Future work can investigate further the properties of RePU networks, such as their stability, robustness, and interpretability. It would also be interesting to explore the use of RePU-activated neural networks in other applications, such as nonparametric variable selection and more general shape-constrained estimation problems. Additionally, our work can be extended to other smooth activation functions beyond RePUs, such as Gaussian error linear unit and scaled exponential linear unit, and study their derivatives and approximation properties.

Appendix

This appendix contains results from simulation studies to evaluate the performance of PDIR and proofs and supporting lemmas for the theoretical results stated in the paper.

Appendix A Numerical studies

In this section, we conduct simulation studies to evaluate the performance of PDIR and compare it with the existing isotonic regression methods. The methods included in the simulation include

•

The isotonic least squares estimator, denoted by Isotonic LSE, is defined as the minimizer of the mean square error on the training data subject to the monotone constraint. As the squared loss only involves the values at $n$ design points, then the isotonic LSE (with no more than $\binom{n}{2}$ linear constraints) can be computed with quadratic programming or using convex optimization algorithms (Dykstra, 1983; Kyng et al., 2015; Stout, 2015). Algorithmically, this turns out to be mappable to a network flow problem (Picard, 1976; Spouge et al., 2003). In our implementation, we compute Isotonic LSE via the Python package multiisotonic²²2https://github.com/alexfields/multiisotonic.
•

The block estimator (Deng and Zhang, 2020), denoted by Block estimator, is defined as any estimator between the block min-max and max-min estimators (Fokianos et al., 2020). In the simulation, we take the Block estimator as the mean of max-min and min-max estimators as suggested in (Deng and Zhang, 2020). The Isotonic LSE is shown to has an explicit mini-max representation on the design points for isotonic regression on graphs in general (Robertson et al., 1988). As in Deng and Zhang (2020), we use brute force which exhaustively calculates means over all blocks and finds the max-min value for each point $x$ . The computation cost via brute force is of order $n^{3}$ .
•

Deep isotonic regression estimator as described in Section 5, denoted by PDIR. Here we focus on using RePU $\sigma_{p}$ activated network with $p=2$ . We implement it in Python via Pytorch and use Adam (Kingma and Ba, 2014) as the optimization algorithm with default learning rate 0.01 and default $\beta=(0.9,0.99)$ (coefficients used for computing running averages of gradients and their squares). The tuning parameters are chosen in the way that $\lambda_{j}=\log(n)$ for $j=1,\ldots,d$ .
•

Deep nonparametric regression estimator, denoted by DNR, which is actually the PDIR without penalty. The implementation is the same as that of PDIR, but the tuning parameters $\lambda_{j}=0$ for $j=1,\ldots,d$ .

A.1 Estimation and evaluation

For the proposed PDIR estimator, we set the tuning parameter $\lambda_{j}=\log(n)$ for $j=1,\ldots,d$ across the simulations. For each target function $f_{0}$ , according to model (4) we generate the training data $S_{\rm train}=(X_{i}^{\rm train},Y_{i}^{\rm train})_{i=1}^{n}$ with sample size $n$ and train the Isotonic LSE, Block estimator, PDIR and DNR estimators on $S_{\rm train}$ . We would mention that the Block estimator has no definition when the input $x$ is “outside” the domain of training data $S_{\rm train}=(X_{i}^{\rm train},Y_{i}^{\rm train})_{i=1}^{n}$ , i.e., there exist no $i,j\in\{1,\ldots,n\}$ such that $X^{\rm train}_{i}\preceq x\preceq X^{\rm train}_{j}$ . In view of this, in our simulation we focus on using the training data with lattice design of the covariates $(X^{\rm train}_{i})_{i=1}^{n}$ for ease of presentation on the Block estimator. For PDIR and DNR estimators, such fixed lattice design of the covariates are not necessary and the obtained estimators can be smoothly extended to large domains which covers the domain of the training samples.

For each $f_{0}$ , we also generate the testing data $S_{\rm test}(X_{t}^{\rm test},Y_{t}^{\rm test})_{t=1}^{T}$ with sample size $T$ from the same distribution of the training data. For the proposed method and for each obtained estimator $\hat{f}_{n}$ , we calculate the mean squared error (MSE) on the testing data $S_{\rm test}=(X_{t}^{\rm test},Y_{t}^{\rm test})_{t=1}^{T}$ . We calculate the $L_{1}$ distance between the estimator $\hat{f}_{n}$ and the corresponding target function $f_{0}$ on the testing data by

\displaystyle\|\hat{f}_{n}-f_{0}\|_{L^{1}(\nu)}=\frac{1}{T}\sum_{t=1}^{T}\Big{|}\hat{f}_{n}(X_{t}^{\rm test})-f_{0}(X_{t}^{\rm test})\Big{|},

and we also calculate the $L_{2}$ distance between the estimator $\hat{f}_{n}$ and the target function $f_{0}$ , i.e.

\displaystyle\|\hat{f}_{n}-f_{0}\|_{L^{2}(\nu)}=\sqrt{\frac{1}{T}\sum_{t=1}^{T}\Big{|}\hat{f}_{n}(X_{t}^{\rm test})-f_{0}(X_{t}^{\rm test})\Big{|}^{2}}.

In the simulation studies, for each data generation model we generate $T=100^{d}$ testing data by the lattice points (100 even lattice points for each dimension of the input) where $d$ is the dimension of the input. We report the mean squared error, $L_{1}$ and $L_{2}$ distances to the target function defined above and their standard deviations over $R=100$ replications under different scenarios. The specific forms of $f_{0}$ are given in the data generation models below.

A.2 Univariate models

We consider three basic univariate models, including “Linear”, “Exp”, “Step”, “Constant” and “Wave”, which corresponds to different specifications of the target function $f_{0}$ . The formulae are given below.

(a)

Linear :

$Y=f_{0}(x)+\epsilon=2x+\epsilon,$
(b)

Exp:

$Y=f_{0}(X)+\epsilon=\exp(2X)+\epsilon,$
(c)

Step:

$Y=f_{0}(X)+\epsilon=\sum h_{i}I(X\geq t_{i})+\epsilon,$
(d)

Constant:

$Y=f_{0}(X)+\epsilon=\exp(2X)+\epsilon,$
(e)

Wave:

$Y=f_{0}(X)+\epsilon=4X+2X\sin(4\pi X)+\epsilon,$

where $(h_{i})=(1,2,2)$ , $(t_{i})=(0.2,0.6,1)$ and $\epsilon\sim N(0,\frac{1}{4})$ follows normal distribution. We use the linear model as a baseline model in our simulations and expect all the methods perform well under the linear model. The “Step” model is monotonic but not smooth even continuous. The “Constant” is a monotonic but not strictly monotonic model. And the “Wave” is a nonlinear, smooth but non monotonic model. These models are chosen so that we can evaluate the performance of Isotonic LSE, Block estimator PDIR and DNR under different types of models, including the conventional and misspecified cases.

For these models, we use the lattice design for the ease of presentation of Block estimator, where $(X_{i}^{\rm train})_{i=1}^{n}$ are the lattice points evenly distributed on interval $[0,1]$ . Figure S2 shows all these univariate data generation models.

Figures S3 shows an instance of the estimated curves for the “Linear”, “Exp”, “Step” and “Constant” models when sample size $n=64$ . In these plots, the training data is depicted as grey dots. The target functions are depicted as dashed curves in black, and the estimated functions are represented by solid curves with different colors. The summary statistics are presented in Table S3. Compared with the piece-wise constant estimates of Isotonic LSE and Block estimator, the PDIR estimator is smooth and it works reasonably well under univariate models, especially for models with smooth target functions.

Table S3: Summary statistics for the simulation results under different univariate models (

d=1

). The averaged mean squared error of the estimates on testing data and

L_{1},L_{2}

distance to the target function are calculated over 100 replications. The standard deviations are reported in parenthesis.

Model	Method	$n=64$			$n=256$
Model	Method	MSE	$L_{1}$	$L_{2}$	MSE	$L_{1}$	$L_{2}$
Linear	DNR	0.266 (0.011)	0.101 (0.035)	0.122 (0.040)	0.253 (0.011)	0.055 (0.020)	0.068 (0.023)
	PDIR	0.265 (0.012)	0.098 (0.037)	0.118 (0.041)	0.254 (0.012)	0.058 (0.024)	0.070 (0.027)
	Isotonic LSE	0.282 (0.013)	0.140 (0.027)	0.177 (0.035)	0.262 (0.012)	0.088 (0.012)	0.113 (0.017)
	Block	0.330 (0.137)	0.165 (0.060)	0.243 (0.155)	0.277 (0.033)	0.106 (0.021)	0.153 (0.060)
Exp	DNR	0.268 (0.014)	0.103 (0.043)	0.124 (0.049)	0.256 (0.012)	0.055 (0.024)	0.068 (0.027)
	PDIR	0.268 (0.017)	0.102 (0.049)	0.124 (0.056)	0.255 (0.012)	0.055 (0.022)	0.068 (0.026)
	Isotonic LSE	0.312 (0.018)	0.195 (0.028)	0.246 (0.034)	0.274 (0.014)	0.120 (0.014)	0.153 (0.018)
	Block	0.302 (0.021)	0.177 (0.028)	0.223 (0.034)	0.272 (0.012)	0.115 (0.015)	0.146 (0.017)
Step	DNR	0.375 (0.045)	0.259 (0.059)	0.347 (0.061)	0.315 (0.017)	0.169 (0.022)	0.253 (0.018)
	PDIR	0.366 (0.042)	0.245 (0.058)	0.335 (0.057)	0.311 (0.018)	0.153 (0.025)	0.245 (0.018)
	Isotonic LSE	0.304 (0.020)	0.151 (0.041)	0.228 (0.039)	0.275 (0.014)	0.081 (0.022)	0.155 (0.020)
	Block	0.382 (0.217)	0.208 (0.082)	0.327 (0.160)	0.295 (0.046)	0.108 (0.035)	0.197 (0.086)
Constant	DNR	0.266 (0.012)	0.102 (0.038)	0.122 (0.042)	0.258 (0.013)	0.057 (0.021)	0.069 (0.023)
	PDIR	0.260 (0.011)	0.080 (0.045)	0.092 (0.049)	0.257 (0.012)	0.051 (0.025)	0.060 (0.028)
	Isotonic LSE	0.265 (0.013)	0.087 (0.044)	0.114 (0.052)	0.258 (0.012)	0.044 (0.020)	0.068 (0.025)
	Block	0.264 (0.012)	0.085 (0.044)	0.108 (0.049)	0.258 (0.012)	0.044 (0.020)	0.066 (0.025)
Wave	DNR	0.289 (0.023)	0.156 (0.039)	0.192 (0.044)	0.262 (0.014)	0.089 (0.025)	0.110 (0.029)
	PDIR	0.530 (0.030)	0.398 (0.026)	0.528 (0.018)	0.511 (0.022)	0.368 (0.014)	0.510 (0.009)
	Isotonic LSE	0.525 (0.027)	0.399 (0.022)	0.524 (0.015)	0.495 (0.020)	0.353 (0.009)	0.494 (0.004)
	Block	0.516 (0.024)	0.391 (0.022)	0.519 (0.017)	0.497 (0.023)	0.358 (0.012)	0.500 (0.013)

A.3 Bivariate models

We consider several basic multivariate models, including polynomial model (“Polynomial”), concave model (“Concave”), step model (“Step”), partial model (“Partial”), constant model (“Constant”) and wave model (“Wave”), which correspond to different specifications of the target function $f_{0}$ . The formulae are given below.

(a)

Polynomial:

$Y=f_{0}(X)+\epsilon=\frac{10}{2^{3/4}}(x_{1}+x_{2})^{3/4}+\epsilon,$
(b)

Concave:

$Y=f_{0}(X)+\epsilon=1+3x_{1}(1-\exp(-3x_{2}))+\epsilon,$
(c)

Step:

$Y=f_{0}(X)+\epsilon=\sum h_{i}I(x_{1}+x_{2}\geq t_{i})+\epsilon,$
(d)

Partial:

$Y=f_{0}(X)+\epsilon=10x_{2}^{8/3}+\epsilon,$
(e)

Constant:

$Y=f_{0}(X)+\epsilon=3+\epsilon,$
(f)

Wave:

$Y=f_{0}(X)+\epsilon=5(x_{1}+x_{2})+3(x_{1}+x_{2})\sin(\pi(x_{1}+x_{2}))+\epsilon,$

where $X=(x_{1},x_{2})$ , $(h_{i})=(1,2,2,1.5,0.5,1)$ , $(t_{i})=(0.2,0.6,1.0,1.3,1.7,1.9)$ and $\epsilon\sim N(0,\frac{1}{4})$ follows normal distribution. The “Polynomial” and “Concave” models are monotonic models. The “Step” model is monotonic but not smooth even continuous. In “Partial ” model, the response is related to only one covariate. The “Constant” is a monotonic but not strictly monotonic model and the “Wave” is a nonlinear, smooth but non monotonic model. We use the lattice design for the ease of presentation of Block estimator, where $(X_{i}^{\rm train})_{i=1}^{n}$ are the lattice points evenly distributed on interval $[0,1]^{2}$ . Simulation results over 100 replications are summarized in Table S4. And for each model, we take an instance from the replications to present the heatmaps and the 3D surface of the predictions of these estimates; see Figure S4-S15. In heatmaps, we show the observed data (linearly interpolated), the true target function $f_{0}$ and the estimates of different methods. We can see that compared with the piece-wise constant estimates of Isotonic LSE and Block estimator, the DIR estimator is smooth and works reasonably well under bivariate models, especially for models with smooth target functions.

Table S4: Summary statistics for the simulation results under different bivariate models (

d=2

). The averaged mean squared error of the estimates on testing data and

L_{1},L_{2}

distance to the target function are calculated over 100 replications. The standard deviations are reported in parenthesis.

Model	Method	$n=64$			$n=256$
Model	Method	MSE	$L_{1}$	$L_{2}$	MSE	$L_{1}$	$L_{2}$
Polynomial	DNR	4.735 (0.344)	0.138 (0.041)	0.172 (0.046)	4.655 (0.178)	0.078 (0.022)	0.098 (0.025)
	PDIR	4.724 (0.366)	0.140 (0.045)	0.171 (0.049)	4.688 (0.181)	0.077 (0.023)	0.096 (0.026)
	Isotonic LSE	8.309 (0.405)	0.755 (0.061)	0.884 (0.061)	6.052 (0.153)	0.364 (0.019)	0.444 (0.021)
	Block	4.780 (0.284)	0.319 (0.020)	0.397 (0.026)	4.747 (0.129)	0.210 (0.011)	0.264 (0.017)
Concave	DNR	0.282 (0.016)	0.142 (0.038)	0.176 (0.042)	0.261 (0.007)	0.083 (0.022)	0.103 (0.024)
	PDIR	0.276 (0.015)	0.129 (0.038)	0.158 (0.043)	0.260 (0.007)	0.077 (0.024)	0.096 (0.028)
	Isotonic LSE	0.393 (0.042)	0.308 (0.051)	0.375 (0.055)	0.294 (0.010)	0.163 (0.020)	0.207 (0.022)
	Block	0.303 (0.015)	0.183 (0.025)	0.229 (0.030)	0.275 (0.006)	0.125 (0.012)	0.157 (0.014)
Step	DNR	0.561 (0.030)	0.462 (0.020)	0.557 (0.025)	0.519 (0.011)	0.432 (0.010)	0.519 (0.010)
	PDIR	0.561 (0.030)	0.461 (0.019)	0.557 (0.024)	0.519 (0.011)	0.431 (0.010)	0.520 (0.010)
	Isotonic LSE	1.462 (0.104)	0.852 (0.046)	1.100 (0.047)	0.700 (0.028)	0.430 (0.019)	0.672 (0.021)
	Block	0.657 (0.031)	0.503 (0.022)	0.638 (0.023)	0.461 (0.016)	0.321 (0.014)	0.457 (0.014)
Partial	DNR	12.67 (0.530)	0.129 (0.033)	0.161 (0.039)	12.77 (0.338)	0.079 (0.023)	0.099 (0.027)
	PDIR	12.70 (0.548)	0.112 (0.037)	0.136 (0.042)	12.72 (0.285)	0.063 (0.021)	0.080 (0.026)
	Isotonic LSE	17.78 (0.532)	0.739 (0.052)	1.062 (0.055)	15.00 (0.262)	0.378 (0.024)	0.528 (0.028)
	Block	12.31 (0.571)	0.435 (0.041)	0.578 (0.043)	12.50 (0.313)	0.237 (0.028)	0.313 (0.049)
Constant	DNR	0.278 (0.016)	0.131 (0.038)	0.160 (0.042)	0.260 (0.005)	0.079 (0.020)	0.097 (0.022)
	PDIR	0.266 (0.013)	0.094 (0.047)	0.111 (0.050)	0.255 (0.005)	0.052 (0.022)	0.063 (0.025)
	Isotonic LSE	0.280 (0.021)	0.121 (0.047)	0.161 (0.056)	0.262 (0.006)	0.076 (0.025)	0.108 (0.026)
	Block	0.265 (0.012)	0.089 (0.040)	0.110 (0.046)	0.256 (0.005)	0.059 (0.022)	0.075 (0.024)
Wave	DNR	0.306 (0.020)	0.189 (0.036)	0.233 (0.042)	0.269 (0.009)	0.108 (0.025)	0.135 (0.029)
	PDIR	0.459 (0.058)	0.390 (0.056)	0.454 (0.063)	0.581 (0.039)	0.493 (0.028)	0.574 (0.033)
	Isotonic LSE	1.380 (0.085)	0.918 (0.032)	1.063 (0.040)	0.989 (0.024)	0.760 (0.014)	0.860 (0.012)
	Block	0.978 (0.022)	0.750 (0.014)	0.854 (0.012)	0.892 (0.021)	0.693 (0.009)	0.802 (0.008)

A.4 Tuning parameter

In this subsection, we investigate the numerical performance regrading to different choice of tuning parameters under different models.

For univariate models, we calculate the testing statistics $L_{1}$ and $L_{2}$ for tuning parameter $\lambda$ on the 20 lattice points in the interval $[0,3\log(n)]$ . For each $\lambda$ , we simulate 20 times replications and reports the average $L_{1}$ , $L_{2}$ statistics and their 90% empirical band. For each replication, we train the PDIR using $n=256$ training samples and $T=1,000$ testing samples. In our simulation, four univariate models in section A.2 are considered including “Exp”, “Constant”, “Step” and misspecified model “Wave” and the results are reported in Figure S16. We can see that for isotonic models “Exp” and “Step”, the estimate is not sensitive to the choice of the tuning parameter $\lambda$ in $[0,3\log(n)]$ , which all leads to reasonable estimates. For “Constant” model, a not strictly isotonic model, errors slightly increase as the tuning parameter $\lambda$ increases in $[0,3\log(n)]$ . Overall, the choice $\lambda=\log(n)$ can lead to reasonably well estimates for correctly specified models. For misspecified model “Wave”, the estimates deteriorates quickly as the tuning parameter $\lambda$ increases around 0, and after that the additional negative effect of the increasing $\lambda$ becomes slight.

For bivariate models, our first simulation studies focus on the case where tuning parameters have the same value, i.e., $\lambda_{1}=\lambda_{2}$ . We calculate the testing statistics $L_{1}$ and $L_{2}$ for tuning parameters $\lambda_{1}=\lambda_{2}$ on the 20 lattice points in the interval $[0,3\log(n)]$ . For each $\lambda$ , we simulate 20 times replications and reports the average $L_{1}$ , $L_{2}$ statistics and their 90% empirical band. For each replication, we train the PDIR using $n=256$ training samples and $T=10,000$ testing samples. In our simulation, four bivariate models in section A.3 are considered including “Partial”, “Constant”, “Concave” and misspecified model “Wave” and the results are reported in Figure S17. The observation are similar to those of univariate models, that is, the estimates are not sensitive to the choices of tuning parameters over $[0,3\log(n)]$ for correctly specified models, i.e., isotonic models. But for misspecified model “Wave”, the estimates deteriorates quickly as the tuning parameter $\lambda$ increases around 0, and after that increasing $\lambda$ slightly spoils the estimates.

In the second simulation study of bivariate models, we can choose different values for different components of the turning parameter $\lambda=(\lambda_{1},\lambda_{2})$ , i.e., $\lambda_{1}$ can be different from $\lambda_{2}$ . We investigate this by considering the follow bivariate model, where the target function $f_{0}$ is monotonic in its second argument and non-monotonic in its first one.

Model (g)

:

$Y=f_{0}(X)+\epsilon=2\sin(2\pi x_{1})+4(x_{2})^{4/3}+\epsilon,$

where $X=(x_{1},x_{2})$ and $\epsilon\sim N(0,\frac{1}{4})$ follows normal distribution. Heatmaps for the observed training data, the target function $f_{0}$ , and 3D surface plots for the target function $f_{0}$ under model (g) when $d=2$ and $n=256$ are presented in Figure S18.

For model (g), we calculate the mean testing statistics $L_{1}$ and $L_{2}$ for tuning parameter $\lambda=(\lambda_{1},\lambda_{2})$ on the 400 grid points on the region $[0,3\log(n)]\times[0,3\log(n)]$ . For each $\lambda=(\lambda_{1},\lambda_{2})$ , we simulate 5 times replications and reports the average $L_{1}$ , $L_{2}$ statistics. For each replication, we train the PDIR using $n=256$ training samples and $T=10,000$ testing samples. The mean $L_{1}$ and $L_{2}$ distance from the target function and the estimates on the testing data under different $\lambda$ are depicted in Figure S19. We see that the target function $f_{0}$ is increasing in its second argument, and the estimates is insensitive to the tuning parameter $\lambda_{2}$ , while the target function $f_{0}$ is non-monotonic in its first argument, and the estimates deteriorate when $\lambda_{1}$ gets larger. The simulation results suggest that we should only penalize the gradient with respect to the monotonic arguments but not non-monotonic ones. The estimates is sensitive to the turning parameter for misspecified model, especially when the turning parameter increases around $0$ .

Appendix B Proofs

Proof of Theorem 1

For integer $p\geq 2$ , let $\sigma_{p-1}(x)=\max\{0,x\}^{p-1}$ and $\sigma_{p}(x)=\max\{0,x\}^{p}$ denote the RePUs activation functions respectively. Let $(d_{0},d_{1},\ldots,d_{\mathcal{D}+1})$ be vector of the width (number of neurons) of each layer in the original RePU network where $d_{0}=d$ and $d_{\mathcal{D}+1}=1$ in our problem. We let $f^{(i)}_{u}$ be the function (subnetwork of the RePU network) from $\mathcal{X}\subset\mathbb{R}^{d}$ to $\mathbb{R}$ which takes $X=(x_{1},\ldots,x_{d})$ as input and outputs the $u$ -th neuron of the $i$ -th layer for $u=1,\ldots,d_{i}$ and $i=1,\ldots,\mathcal{D}+1$ .

We construct Mixed RePUs activated subnetworks to compute $(\frac{\partial}{\partial x_{j}}f^{(i)}_{1},\ldots,\frac{\partial}{\partial x_{j}}f^{(i)}_{d_{i}})$ iteratively for $i=1,\ldots,\mathcal{D}+1$ , i.e., we construct the partial derivatives of the original RePU subnetworks(up to $i$ -th layer) step by step. Without loss of generality, we compute $(\frac{\partial}{\partial x_{j}}f^{(i)}_{1},\ldots,\frac{\partial}{\partial x_{j}}f^{(i)}_{d_{i}})$ for $j\in\{1,\ldots,d\}$ , and the construction is the same for all other $j\in\{1,\ldots,d\}$ . We illustrate the details of the construction of the Mixed RePUs subnetworks for the first two layers ( $i=1,2$ ) and the last layer $(i=\mathcal{D}+1)$ and apply induction for layers $i=3,\ldots,\mathcal{D}$ . Note that the derivative of RePU activation function $\sigma_{p}$ is $\sigma^{\prime}_{p}(x)=p\sigma_{p-1}(x)$ , then when $i=1$ for any $u=1,\ldots,d_{1}$ ,

\displaystyle\frac{\partial}{\partial x_{j}}f^{(1)}_{u}=\frac{\partial}{\partial x_{j}}\sigma_{p}\Big{(}\sum_{i=1}^{d_{0}}w^{(1)}_{ui}x_{i}+b_{u}^{(1)}\Big{)}=p\sigma_{p-1}\Big{(}\sum_{i=1}^{d_{0}}w^{(1)}_{ui}x_{i}+b_{u}^{(1)}\Big{)}\cdot w_{u,d_{0}}^{(1)},

(B.1)

where we denote $w^{(1)}_{ui}$ and $b_{u}^{(1)}$ by the corresponding weights and bias in $1$ -th layer of the original RePU network. Now we intend to construct a 4 layer (2 hidden layers) Mixed RePUs network with width $(d_{0},3d_{1},10d_{1},2d_{1})$ which takes $X=(x_{1},\ldots,x_{d_{0}})$ as input and outputs

(f^{(1)}_{1},\ldots,f^{(1)}_{d_{1}},\frac{\partial}{\partial x_{j}}f^{(1)}_{1},\ldots,\frac{\partial}{\partial x_{j}}f^{(1)}_{d_{1}})\in\mathbb{R}^{2d_{1}}.

Note that the output of such network contains all the quantities needed to calculated $(\frac{\partial}{\partial x_{j}}f^{(2)}_{1},\ldots,\frac{\partial}{\partial x_{j}}f^{(2)}_{d_{2}})$ , and the process of construction can be continued iteratively and the induction proceeds. In the first hidden layer, we can obtain $3d_{1}$ neurons

(f^{(1)}_{1},\ldots,f^{(1)}_{d_{1}},p|w^{(1)}_{1,d_{0}}|,\ldots,p|w^{(1)}_{d_{1},d_{0}}|,\sigma_{p-1}(\sum_{i=1}^{d_{0}}w^{(1)}_{1i}x_{i}+b^{(1)}_{1}),\ldots,\sigma_{p-1}(\sum_{i=1}^{d_{0}}w^{(1)}_{d_{1}i}x_{i}+b^{(1)}_{d_{1}})),

with weight matrix $A^{(1)}_{1}$ having $2d_{0}d_{1}$ parameters, bias vector $B^{(1)}_{1}$ and activation function vector $\Sigma_{1}$ being

\displaystyle A^{(1)}_{1}=\left[\begin{array}[]{ccccc}w^{(1)}_{1,1}&w^{(1)}_{1,2}&\cdots&\cdots&w^{(1)}_{1,d_{0}}\\ w^{(1)}_{2,1}&w^{(1)}_{2,2}&\cdots&\cdots&w^{(1)}_{2,d_{0}}\\ \ldots&\ldots&\ldots&\ldots&\ldots\\ w^{(1)}_{d_{1},1}&w^{(1)}_{d_{1},2}&\cdots&\cdots&w^{(1)}_{d_{1},d_{0}}\\ 0&0&0&0&0\\ \ldots&\ldots&\ldots&\ldots&\ldots\\ 0&0&0&0&0\\ w^{(1)}_{1,1}&w^{(1)}_{1,2}&\cdots&\cdots&w^{(1)}_{1,d_{0}}\\ w^{(1)}_{2,1}&w^{(1)}_{2,2}&\cdots&\cdots&w^{(1)}_{2,d_{0}}\\ \ldots&\ldots&\ldots&\ldots&\ldots\\ w^{(1)}_{d_{1},1}&w^{(1)}_{d_{1},2}&\cdots&\cdots&w^{(1)}_{d_{1},d_{0}}\\ \end{array}\right]\in\mathbb{R}^{3d_{1}\times d_{0}},

\displaystyle B^{(1)}_{1}=\left[\begin{array}[]{c}b^{(1)}_{1}\\ b^{(1)}_{2}\\ \ldots\\ b^{(1)}_{d_{1}}\\ p|w^{(1)}_{1,d_{0}}|\\ p|w^{(1)}_{2,d_{0}}|\\ \ldots\\ p|w^{(1)}_{d_{1},d_{0}}|\\ b^{(1)}_{1}\\ b^{(1)}_{2}\\ \ldots\\ b^{(1)}_{d_{1}}\\ \end{array}\right]\in\mathbb{R}^{3d_{1}},\quad\Sigma^{(1)}_{1}=\left[\begin{array}[]{c}\sigma_{p}\\ \ldots\\ \sigma_{p}\\ \sigma_{1}\\ \ldots\\ \sigma_{1}\\ \sigma_{p-1}\\ \ldots\\ \sigma_{p-1}\\ \end{array}\right],

where the first $d_{1}$ activation functions of $\Sigma_{1}$ are chosen to be $\sigma_{p}$ , the last $d_{1}$ activation functions are chosen to be $\sigma_{p-1}$ and the rest $\sigma_{1}$ . In the second hidden layer, we can obtain $6d_{1}$ neurons. The first $2d_{1}$ neurons of the second hidden layer (or the third layer) are

(\sigma_{1}(f^{(1)}_{1}),\sigma_{1}(-f^{(1)}_{1})),\ldots,\sigma_{1}(f^{(1)}_{d_{1}}),\sigma_{1}(-f^{(1)}_{d_{1}})),

which intends to implement identity map such that $(f^{(1)}_{1},\ldots,f^{(1)}_{d_{1}})$ can be kept and outputted in the next layer since identity map can be realized by $x=\sigma_{1}(x)-\sigma_{1}(-x)$ . The rest $4d_{1}$ neurons of the second hidden layer (or the third layer) are

\displaystyle\left[\begin{array}[]{c}\sigma_{2}(p\cdot w^{(1)}_{1,d_{0}}+\sigma_{p-1}(\sum_{i=1}^{d_{0}}w^{(1)}_{1i}x_{i}+b^{(1)}_{1}))\\ \sigma_{2}(p\cdot w^{(1)}_{1,d_{0}}-\sigma_{p-1}(\sum_{i=1}^{d_{0}}w^{(1)}_{1i}x_{i}+b^{(1)}_{1}))\\ \sigma_{2}(-p\cdot w^{(1)}_{1,d_{0}}+\sigma_{p-1}(\sum_{i=1}^{d_{0}}w^{(1)}_{1i}x_{i}+b^{(1)}_{1})\\ \sigma_{2}(-p\cdot w^{(1)}_{1,d_{0}}-\sigma_{p-1}(\sum_{i=1}^{d_{0}}w^{(1)}_{1i}x_{i}+b^{(1)}_{1}))\\ \ldots\\ \sigma_{2}(p\cdot w^{(1)}_{d_{1},d_{0}}+\sigma_{p-1}(\sum_{i=1}^{d_{0}}w^{(1)}_{d_{1}i}x_{i}+b^{(1)}_{d_{1}}))\\ \sigma_{2}(p\cdot w^{(1)}_{d_{1},d_{0}}-\sigma_{p-1}(\sum_{i=1}^{d_{0}}w^{(1)}_{d_{1}i}x_{i}+b^{(1)}_{d_{1}}))\\ \sigma_{2}(-p\cdot w^{(1)}_{d_{1},d_{0}}+\sigma_{p-1}(\sum_{i=1}^{d_{0}}w^{(1)}_{d_{1}i}x_{i}+b^{(1)}_{d_{1}})\\ \sigma_{2}(-p\cdot w^{(1)}_{d_{1},d_{0}}-\sigma_{p-1}(\sum_{i=1}^{d_{0}}w^{(1)}_{d_{1}i}x_{i}+b^{(1)}_{d_{1}}))\\ \end{array}\right]\in\mathbb{R}^{4d_{1}},

which is ready for implementing the multiplications in (B.1) to obtain $(\frac{\partial}{\partial x_{j}}f^{(1)}_{1},\ldots,\frac{\partial}{\partial x_{j}}f^{(1)}_{d_{1}})\in\mathbb{R}^{d_{1}}$ since

\displaystyle x\cdot y=\frac{1}{4}\{(x+y)^{2}-(x-y)^{2}\}=\frac{1}{4}\{\sigma_{2}(x+y)+\sigma_{2}(-x-y)-\sigma_{2}(x-y)-\sigma_{2}(-x+y)\}.

In the second hidden layer (the third layer), the bias vector is zero $B^{(1)}_{2}=(0,\ldots,0)\in\mathbb{R}^{6d_{1}}$ , activation functions vector

\Sigma^{(1)}_{2}=(\underbrace{\sigma_{1},\ldots,\sigma_{1}}_{2d_{1}\ {\rm times}},\underbrace{\sigma_{2},\ldots,\sigma_{2}}_{4d_{1}\ {\rm times}}),

and the corresponding weight matrix $A^{(1)}_{2}$ can be formulated correspondingly without difficulty which contains $2d_{1}+8d_{1}=10d_{1}$ non-zero parameters. Then in the last layer, by the identity maps and multiplication operations with weight matrix $A^{(1)}_{3}$ having $2d_{1}+4d_{1}=6d_{1}$ parameters, bias vector $B^{(1)}_{3}$ being zeros, we obtain

(f^{(1)}_{1},\ldots,f^{(1)}_{d_{1}},\frac{\partial}{\partial x_{j}}f^{(1)}_{1},\ldots,\frac{\partial}{\partial x_{j}}f^{(1)}_{d_{1}})\in\mathbb{R}^{2d_{1}}.

Such Mixed RePUs neural network has 2 hidden layers (4 layers), $11d_{1}$ hidden neurons, $2d_{0}d_{1}+3d_{1}+10d_{1}+6d_{1}=2d_{0}d_{1}+19d_{1}$ parameters and its width is $(d_{0},3d_{1},6d_{1},2d_{1})$ . It worth noting that the RePU activation functions do not apply to the last layer since the construction here is for a single network. When we are combining two consecutive subnetworks into one deep neural network, the RePU activation functions should apply to the last layer of the first subnetwork. Hence, in the construction of the whole big network, the last layer of the subnetwork here should output $4d_{1}$ neurons

	$\displaystyle(\sigma_{1}(f^{(1)}_{1}),\sigma_{1}(-f^{(1)}_{1})\ldots,\sigma_{1}(f^{(1)}_{d_{1}}),\sigma_{1}(-f^{(1)}_{d_{1}}),$
	$\displaystyle\qquad\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(1)}_{1}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(1)}_{1})\ldots,\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(1)}_{d_{1}}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(1)}_{d_{1}}))\in\mathbb{R}^{4d_{1}},$

to keep $(f^{(1)}_{1},\ldots,f^{(1)}_{d_{1}},\frac{\partial}{\partial x_{j}}f^{(1)}_{1},\ldots,\frac{\partial}{\partial x_{j}}f^{(1)}_{d_{1}})$ in use in the next subnetwork. Then for this Mixed RePUs neural network, the weight matrix $A^{(1)}_{3}$ has $2d_{1}+8d_{1}=10d_{1}$ parameters, the bias vector $B^{(1)}_{3}$ is zeros and the activation functions vector $\Sigma^{(1)}_{3}$ has all $\sigma_{1}$ as elements. And such Mixed RePUs neural network has 2 hidden layers (4 layers), $13d_{1}$ hidden neurons, $2d_{0}d_{1}+3d_{1}+10d_{1}+10d_{1}=2d_{0}d_{1}+23d_{1}$ parameters and its width is $(d_{0},3d_{1},6d_{1},4d_{1})$ .

Now we consider the second step, for any $u=1,\ldots,d_{2}$ ,

\displaystyle\frac{\partial}{\partial x_{j}}f^{(2)}_{u}=\frac{\partial}{\partial x_{j}}\sigma_{p}\Big{(}\sum_{i=1}^{d_{1}}w^{(2)}_{ui}f^{(1)}_{i}+b_{u}^{(2)}\Big{)}=p\sigma_{p-1}\Big{(}\sum_{i=1}^{d_{1}}w^{(2)}_{ui}f^{(1)}_{i}+b_{u}^{(2)}\Big{)}\cdot\sum_{i=1}^{d_{1}}w_{u,i}^{(2)}\frac{\partial}{\partial x_{j}}f^{(1)}_{i},

(B.2)

where $w^{(2)}_{ui}$ and $b_{u}^{(2)}$ are defined correspondingly as the weights and bias in $2$ -th layer of the original RePU network. By the previous constructed subnetwork, we can start with its outputs

	$\displaystyle(\sigma_{1}(f^{(1)}_{1}),\sigma_{1}(-f^{(1)}_{1})\ldots,\sigma_{1}(f^{(1)}_{d_{1}}),\sigma_{1}(-f^{(1)}_{d_{1}}),$
	$\displaystyle\qquad\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(1)}_{1}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(1)}_{1})\ldots,\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(1)}_{d_{1}}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(1)}_{d_{1}}))\in\mathbb{R}^{4d_{1}},$

as the inputs of the second subnetwork we are going to build. In the first hidden layer of the second subnetwork, we can obtain $3d_{2}$ neurons

	$\displaystyle\Big{(}f^{(2)}_{1},\ldots,f^{(2)}_{d_{2}},\|\sum_{i=1}^{d_{1}}w_{1,i}^{(2)}\frac{\partial}{\partial x_{j}}f^{(1)}_{i}\|,\ldots,\|\sum_{i=1}^{d_{1}}w_{d_{2},i}^{(2)}\frac{\partial}{\partial x_{j}}f^{(1)}_{i}\|,$
	$\displaystyle\qquad\qquad\sigma_{p-1}(\sum_{i=1}^{d_{1}}w^{(2)}_{1i}f^{(1)}_{i}+b^{(1)}_{1}),\ldots,\sigma_{p-1}(\sum_{i=1}^{d_{1}}w^{(2)}_{d_{2}i}f^{(1)}_{i}+b^{(2)}_{d_{2}})\Big{)},$

with weight matrix $A^{(2)}_{1}\in\mathbb{R}^{3d_{2}\times 4d_{1}}$ having $6d_{1}d_{2}$ non-zero parameters, bias vector $B^{(2)}_{1}\in\mathbb{R}^{3d_{2}}$ and activation functions vector $\Sigma^{(2)}_{1}=\Sigma^{(1)}_{1}$ . Similarly, the second hidden layer can be constructed to have $6d_{2}$ neurons with weight matrix $A^{(2)}_{2}\in\mathbb{R}^{3d_{2}\times 6d_{2}}$ having $2d_{2}+8d_{2}=10d_{2}$ non-zero parameters, zero bias vector $B^{(2)}_{1}\in\mathbb{R}^{6d_{2}}$ and activation functions vector $\Sigma^{(2)}_{2}=\Sigma^{(1)}_{2}$ . The second hidden layer here serves exactly the same as that in the first subnetwork, which intends to implement the identity map for

(f^{(2)}_{1},\ldots,f^{(2)}_{d_{2}}),

and implement the multiplication in (B.2). Similarly, the last layer can also be constructed as that in the first subnetwork, which outputs

	$\displaystyle(\sigma_{1}(f^{(2)}_{1}),\sigma_{1}(-f^{(2)}_{1})\ldots,\sigma_{1}(f^{(2)}_{d_{2}}),\sigma_{1}(-f^{(2)}_{d_{2}}),$
	$\displaystyle\qquad\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(2)}_{1}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(2)}_{1})\ldots,\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(2)}_{d_{2}}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(2)}_{d_{2}}))\in\mathbb{R}^{4d_{2}},$

with the weight matrix $A^{(2)}_{3}$ having $2d_{2}+8d_{2}=10d_{2}$ parameters, the bias vector $B^{(2)}_{3}$ being zeros and the activation functions vector $\Sigma^{(1)}_{3}$ with elements being $\sigma_{1}$ . Then the second Mixed RePUs subnetwork has 2 hidden layers (4 layers), $17d_{2}$ hidden neurons, $6d_{1}d_{2}+3d_{2}+10d_{2}+10d_{2}=6d_{1}d_{2}+23d_{2}$ parameters and its width is $(4d_{1},3d_{2},6d_{2},4d_{2})$ .

Then we can continuing this process of construction. For integers $k=3,\ldots,\mathcal{D}$ and for any $u=1,\ldots,d_{k}$ ,

	$\displaystyle\frac{\partial}{\partial x_{j}}f^{(k)}_{u}$	$\displaystyle=\frac{\partial}{\partial x_{j}}\sigma_{p}\Big{(}\sum_{i=1}^{d_{k-1}}w^{(k)}_{ui}f^{(k-1)}_{i}+b_{u}^{(k)}\Big{)}$
		$\displaystyle=p\sigma_{p-1}\Big{(}\sum_{i=1}^{d_{k-1}}w^{(k)}_{ui}f^{(k-1)}_{i}+b_{u}^{(k)}\Big{)}\cdot\sum_{i=1}^{d_{k-1}}w_{u,i}^{(k)}\frac{\partial}{\partial x_{j}}f^{(k-1)}_{i},$

where $w^{(k)}_{ui}$ and $b_{u}^{(k)}$ are defined correspondingly as the weights and bias in $k$ -th layer of the original RePU network. We can construct a Mixed RePUs network taking

	$\displaystyle(\sigma_{1}(f^{(k-1)}_{1}),\sigma_{1}(-f^{(k-1)}_{1})\ldots,\sigma_{1}(f^{(k-1)}_{d_{k-1}}),\sigma_{1}(-f^{(k-1)}_{d_{k-1}}),$
	$\displaystyle\qquad\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(k-1)}_{1}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(k-1)}_{1})\ldots,\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(k-1)}_{d_{k-1}}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(k-1)}_{d_{k-1}}))\in\mathbb{R}^{4d_{k-1}},$

as input, and it outputs

	$\displaystyle(\sigma_{1}(f^{(k)}_{1}),\sigma_{1}(-f^{(k)}_{1})\ldots,\sigma_{1}(f^{(k)}_{d_{k}}),\sigma_{1}(-f^{(k)}_{d_{k}}),$
	$\displaystyle\qquad\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(k)}_{1}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(k)}_{1})\ldots,\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(k)}_{d_{k}}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(k)}_{d_{k}}))\in\mathbb{R}^{4d_{k}},$

with 2 hidden layers, $13d_{k}$ hidden neurons, $6d_{k-1}d_{k}+23d_{k}$ parameters and its width is $(4d_{k-1},3d_{k},6d_{k},4d_{k})$ .

Iterate this process until the $k=\mathcal{D}+1$ step, where the last layer of the original RePU network has only $1$ neurons. For the RePU activated neural network $f\in\mathcal{F}_{n}=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B}}$ , the output of the network $f:\mathcal{X}\to\mathbb{R}$ is a scalar and the partial derivative with respect to $x_{j}$ is

\displaystyle\frac{\partial}{\partial x_{j}}f=\frac{\partial}{\partial x_{j}}\sum_{i=1}^{d_{\mathcal{D}}}w^{(\mathcal{D})}_{i}f^{(\mathcal{D})}_{i}+b^{(\mathcal{D})}=\sum_{i=1}^{d_{\mathcal{D}}}w^{(\mathcal{D})}_{i}\frac{\partial}{\partial x_{j}}f^{(\mathcal{D})}_{i},

where $w^{(\mathcal{D})}_{i}$ and $b^{(\mathcal{D})}$ are the weights and bias parameter in the last layer of the original RePU network. The the constructed $(\mathcal{D}+1)$ -th subnetwork taking

	$\displaystyle(\sigma_{1}(f^{(\mathcal{D})}_{1}),\sigma_{1}(-f^{(\mathcal{D})}_{1})\ldots,\sigma_{1}(f^{(\mathcal{D})}_{d_{\mathcal{D}}}),\sigma_{1}(-f^{(\mathcal{D})}_{d_{\mathcal{D}}}),$
	$\displaystyle\qquad\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(\mathcal{D})}_{1}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(\mathcal{D})}_{1})\ldots,\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(\mathcal{D})}_{d_{\mathcal{D}}}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(\mathcal{D})}_{d_{\mathcal{D}}}))\in\mathbb{R}^{4d_{\mathcal{D}}},$

as input and it outputs $\frac{\partial}{\partial x_{j}}f^{(\mathcal{D}+1)}=\frac{\partial}{\partial x_{j}}f$ which is the partial derivative of the whole RePU network with respect to its $j$ -th argument $x_{j}$ . The subnetwork should have 2 hidden layers width $(4d_{\mathcal{D}},2,8,1)$ with $11$ hidden neurons, $4d_{\mathcal{D}}+2+16=4d_{\mathcal{D}}+18$ non-zero parameters.

Lastly, we combing all the $\mathcal{D}+1$ subnetworks in order to form a big Mixed RePUs network which takes $X=(x_{1},\ldots,x_{d})\in\mathbb{R}^{d}$ as input and outputs $\frac{\partial}{\partial x_{j}}f$ for $f\in\mathcal{F}_{n}=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}}$ . Recall that here $\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S}$ are the depth, width, number of neurons and number of parameters of the original RePU network respectively, and we have $\mathcal{U}=\sum_{i=0}^{\mathcal{D}+1}d_{i}$ and $\mathcal{S}=\sum_{i=0}^{\mathcal{D}}d_{i}d_{i+1}+d_{i+1}.$ Then the big Mixed RePUs network has $3\mathcal{D}+3$ hidden layers (totally 3 $\mathcal{D}+5$ layers), $d_{0}+\sum_{i=1}^{\mathcal{D}}13d_{i}+11\leq 13\mathcal{U}$ neurons, $2d_{0}d_{1}+23d_{1}+\sum_{i=1}^{\mathcal{D}}(6d_{i}d_{i+1}+23d_{i+1})+4d_{\mathcal{D}}+18\leq 23\mathcal{S}$ parameters and its width is $6\max\{d_{1},\ldots,d_{\mathcal{D}}\}=6\mathcal{W}$ . This completes the proof. $\hfill\Box$

Proof of Lemma 2

We follow the idea of the proof of Theorem 6 in Bartlett et al. (2019) to prove a somewhat stronger result, where we give the upper bound of the Pseudo dimension of class of Mixed RePUs networks $\mathcal{F}$ in terms of the depth, size and number of neurons of the network. Instead of the VC dimension of ${\rm sign}(\mathcal{F})$ given in Bartlett et al. (2019), our Pseudo dimension bound is stronger since ${\rm VCdim}({\rm sign}(\mathcal{F}))\leq{\rm Pdim}(\mathcal{F})$ .

Let $\mathcal{Z}$ denote the domain of the functions $f\in\mathcal{F}$ and let $t\in\mathbb{R}$ , we consider a new class of functions

\tilde{\mathcal{F}}:=\{\tilde{f}(z,t)={\rm sign}(f(z)-t):f\in\mathcal{F}\}.

Then it is clear that ${\rm Pdim}(\mathcal{F})\leq{\rm VCdim}(\tilde{\mathcal{F}})$ and we next bound the VC dimension of $\tilde{\mathcal{F}}$ . Recall that the the total number of parameters (weights and biases) in the neural network implementing functions in $\mathcal{F}$ is $\mathcal{S}$ , we let $\theta\in\mathbb{R}^{\mathcal{S}}$ denote the parameters vector of the network $f(\cdot,\theta):\mathcal{Z}\to\mathbb{R}$ implemented in $\mathcal{F}$ . And here we intend to derive a bound for

K(m):=\Big{|}\{({\rm sign}(f(z_{1},\theta)-t_{1}),\ldots,{\rm sign}(f(z_{m},\theta)-t_{m})):\theta\in\mathbb{R}^{\mathcal{S}}\}\Big{|}

which uniformly holds for all choice of $\{z_{i}\}_{i=1}^{m}$ and $\{t_{i}\}_{i=1}^{m}$ . Note that the maximum of $K(m)$ over all all choice of $\{z_{i}\}_{i=1}^{m}$ and $\{t_{i}\}_{i=1}^{m}$ is just the growth function of $\tilde{\mathcal{F}}$ . To give a uniform bound of $K(m)$ , we use the Theorem 8.3 in Anthony and Bartlett (1999) as a main tool to deal with the analysis.

Lemma 30 (Theorem 8.3 in Anthony and Bartlett (1999))

Let $p_{1},\ldots,p_{m}$ be polynomials in $n$ variables of degree at most $d$ . If $n\leq m$ , define

K:=|\{({\rm sign}(p_{1}(x),\ldots,{\rm sign}(p_{m}(x))):x\in\mathbb{R}^{n})\}|,

i.e. $K$ is the number of possible sign vectors given by the polynomials. Then $K\leq 2(2emd/n)^{n}$ .

Now if we can find a partition $\mathcal{P}=\{P_{1},\ldots,P_{N}\}$ of the parameter domain $\mathbb{R}^{\mathcal{S}}$ such that within each region $P_{i}$ , the functions $f(z_{j},\cdot)$ are all fixed polynomials of bounded degree, then $K(m)$ can be bounded via the following sum

\displaystyle K(m)\leq\sum_{i=1}^{N}\Big{|}\{({\rm sign}(f(z_{1},\theta)-t_{1}),\ldots,{\rm sign}(f(z_{m},\theta)-t_{m})):\theta\in P_{i}\}\Big{|},

(B.3)

and each term in this sum can be bounded via Lemma 30. Next, we construct the partition follows the same way as in Bartlett et al. (2019) iteratively layer by layer. We define the a sequence of successive refinements $\mathcal{P}_{1},\ldots,\mathcal{P}_{\mathcal{D}}$ satisfying the following properties:

The cardinality $|\mathcal{P}_{1}|=1$ and for each $n\in\{1,\ldots,\mathcal{D}\}$ ,

\frac{|\mathcal{P}_{n+1}|}{|\mathcal{P}_{n}|}\leq 2\Big{(}\frac{2emk_{n}(1+(n-1)p^{n-1})}{\mathcal{S}_{n}}\Big{)}^{\mathcal{S}_{n}},

where $k_{n}$ denotes the number of neurons in the $n$ -th layer and $\mathcal{S}_{n}$ denotes the total number of parameters (weights and biases) at the inputs to units in all the layers up to layer $n$ .

2.

For each $n\in\{1,\ldots,\mathcal{D}\}$ , each element of $P$ of $\mathcal{P}_{n}$ , each $j\in\{1,\ldots,m\}$ , and each unit $u$ in the $n$ -th layer, when $\theta$ varies in $P$ , the net input to $u$ is a fixed polynomial function in $\mathcal{S}_{n}$ variables of $\theta$ , of total degree no more than $1+(n-1)p^{n-1}$ where for each layer the activation functions are $\sigma_{1},\ldots,\sigma_{p}$ for some integer $p\geq 2$ (this polynomial may depend on $P,j$ and $u$ ).

One can define $\mathcal{P}_{1}=\mathbb{R}^{\mathcal{S}}$ , and it can be verified that $\mathcal{P}_{1}$ satisfies property 2 above. Note that in our case, for fixed $z_{j}$ and $t_{j}$ and any subset $P\subset\mathbb{R}^{\mathcal{S}}$ , $f(z_{j},\theta)-t_{j}$ is a polynomial with respect to $\theta$ with degree the same as that of $f(z_{j},\theta)$ , which is no more than $1+(\mathcal{D}-1)p^{\mathcal{D}-1}$ . Then the construction of $\mathcal{P}_{1},\ldots,\mathcal{P}_{\mathcal{D}}$ and its verification for properties 1 and 2 can follow the same way in Bartlett et al. (2019). Finally we obtain a partition $\mathcal{P}_{\mathcal{D}}$ of $\mathbb{R}^{\mathcal{S}}$ such that for $P\in\mathcal{P}_{\mathcal{D}}$ , the network output in response to any $z_{j}$ is a fixed polynomial of $\theta\in P$ of degree no more than $1+(\mathcal{D}-1)p^{\mathcal{D}-1}$ (since the last node just outputs its input). Then by Lemma 30

\Big{|}\{({\rm sign}(f(z_{1},\theta)-t_{1}),\ldots,{\rm sign}(f(z_{m},\theta)-t_{m})):\theta\in P\}\Big{|}\leq 2\Big{(}\frac{2em(1+(\mathcal{D}-1)p^{\mathcal{D}-1})}{\mathcal{S}_{\mathcal{D}}}\Big{)}^{\mathcal{S}_{\mathcal{D}}}.

Besides, by property 1 we have

\displaystyle|\mathcal{P}_{\mathcal{D}}|

\displaystyle\leq\Pi_{i=1}^{\mathcal{D-1}}2\Big{(}\frac{2emk_{i}(1+(i-1)p^{i-1})}{\mathcal{S}_{i}}\Big{)}^{\mathcal{S}_{i}}.

Then using (B.3), and since the sample $z_{1},\ldots,Z_{m}$ are arbitrarily chosen, we have

	$\displaystyle K(m)$	$\displaystyle\leq\Pi_{i=1}^{\mathcal{D}}2\Big{(}\frac{2emk_{i}(1+(i-1)p^{i-1})}{\mathcal{S}_{i}}\Big{)}^{\mathcal{S}_{i}}$
		$\displaystyle\leq 2^{\mathcal{D}}\Big{(}\frac{2em\sum k_{i}(1+(i-1)p^{i-1})}{\sum\mathcal{S}_{i}}\Big{)}^{\sum\mathcal{S}_{i}}$
		$\displaystyle\leq\Big{(}\frac{4em(1+(\mathcal{D}-1)p^{\mathcal{D}-1})\sum k_{i}}{\sum\mathcal{S}_{i}}\Big{)}^{\sum\mathcal{S}_{i}}$
		$\displaystyle\leq\Big{(}4em(1+(\mathcal{D}-1)p^{\mathcal{D}-1})\Big{)}^{\sum\mathcal{S}_{i}},$

where the second inequality follows from weighted arithmetic and geometric means inequality, the third holds since $\mathcal{D}\leq\sum\mathcal{S}_{i}$ and the last holds since $\sum k_{i}\leq\sum\mathcal{S}_{i}$ . Since $K(m)$ is the growth function of $\tilde{\mathcal{F}}$ , we have

\displaystyle 2^{{\rm Pdim}(\mathcal{F})}\leq 2^{{\rm VCdim}(\tilde{\mathcal{F}})}\leq K({\rm VCdim}(\tilde{\mathcal{F}}))\leq 2^{\mathcal{D}}\Big{(}\frac{2eR\cdot{\rm VCdim}(\tilde{\mathcal{F}})}{\sum\mathcal{S}_{i}}\Big{)}^{\sum\mathcal{S}_{i}}

where $R:=\sum_{i=1}^{\mathcal{D}}k_{i}(1+(i-1)p^{i-1})\leq\mathcal{U}+\mathcal{U}(\mathcal{D}-1)p^{\mathcal{D}-1}.$ Since $\mathcal{U}>0$ and $2eR\geq 16$ , then by Lemma 16 in Bartlett et al. (2019) we have

{\rm Pdim}(\mathcal{F})\leq\mathcal{D}+(\sum_{i=1}^{n}\mathcal{S}_{i})\log_{2}(4eR\log_{2}(2eR)).

Note that $\sum_{i=1}^{\mathcal{D}}\mathcal{S}_{i}\leq\mathcal{D}\mathcal{S}$ and $\log_{2}(R)\leq\log_{2}(\mathcal{U}\{1+(\mathcal{D}-1)p^{\mathcal{D}-1}\})\leq\log_{2}(\mathcal{U})+p\mathcal{D}$ , then we have

{\rm Pdim}(\mathcal{F})\leq\mathcal{D}+\mathcal{D}\mathcal{S}(2p\mathcal{D}+2\log_{2}\mathcal{U}+6)\leq 3p\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})

for some universal constant $c>0$ . $\hfill\Box$

Proof of Theorem 3

We begin our proof with consider the simple case, which is to construct a proper RePU network to represent a univariate polynomial with no error. We can leverage Horner’s method or Qin Jiushao’s algorithm in China to construct such networks. Suppose $f(x)=a_{0}+a_{1}x+\cdots+a_{N}x^{N}$ is a univariate polynomial of degree $N$ , then it can be written as

f(x)=a_{0}+x(a_{1}+x(a_{2}+x(a_{3}+\cdots+x(a_{N-1}+xa_{N})))).

We can iteratively calculate a sequence of intermediate variables $b_{1},\ldots,b_{N}$ by

\displaystyle b_{k}=\Big{\{}\begin{array}[]{lr}a_{N-1}+xa_{N},\qquad k=1,\\ a_{N-k}+xb_{N-1},\ \ \ k=2,\ldots,N.\\ \end{array}\Big{.}

Then we can obtain $b_{N}=f(x)$ . By (iii) in Lemma 40, we know that a RePU network with one hidden layer and no more than $2p$ nodes can represent any polynomial of the input with order no more than $p$ . Obviously, for input $x$ , the identity map $x$ , linear transformation $ax+b$ and square map $x^{2}$ are all polynomials of $x$ with order no more than $p$ . In addition, it is not hard to see that the multiplication operator $xy=\{(x+y)^{2}-(x-y)^{2}\}/4$ can be represented by a RePU network with one hidden layer and $4p$ nodes. Then to calculate $b_{1}$ needs a RePU network with 1 hidden layer and $2p$ hidden neurons, and to calculate $b_{2}$ needs a RePU network with 3 hidden layer, $2\times 2p+1\times 4p+2$ hidden neurons. By induction, to calculate $b_{N}=f(x)$ for $N\geq 1$ needs a RePU network with $2N-1$ hidden layer, $N\times 2p+(N-1)\times 4p+(N-1)\times 2=(6p+2)(N-1)+2p$ hidden neurons, $(N-1)(30p+2)+2p+1$ parameters(weights and bias), and its width equals to $6p$ .

Apart from the construction based on the Horner’s method, another construction is shown in Theorem 2 of li2019powernet, where the constructed RePU network has $\lceil\log_{p}N\rceil+2$ hidden layers, $O(N)$ neurons and $O(pN)$ parameters (weights and bias).

Now we consider converting RePU networks to multivariate polynomial $f$ with total degree $N$ on $\mathbb{R}^{d}$ . For any $d\in\mathbb{N}^{+}$ and $N\in\mathbb{N}_{0}$ , let

f^{d}_{N}(x_{1},\ldots,x_{d})=\sum_{i_{1}+\cdots+i_{d}=0}^{N}a_{i_{1},i_{2},\ldots,i_{d}}x_{1}^{i_{1}}x_{2}^{i_{2}}\cdots x_{d}^{i_{d}},

denote the polynomial with total degree $N$ of $d$ variables, where $i_{1},i_{2},\ldots,i_{d}$ are non-negative integers, $\{a_{i_{1},i_{2},\ldots,i_{d}}:i_{1}+\cdots+i_{d}\leq N\}$ are coefficients in $\mathbb{R}$ . Note that the multivariate polynomial $f^{d}_{N}$ can be written as

\displaystyle f^{d}_{N}(x_{1},\ldots,x_{d})=\sum_{i_{1}=0}^{N}\Big{(}\sum_{i_{2}+\cdots+i_{d}=0}^{N-i_{1}}a_{i_{1},i_{2}\ldots,i_{d}}x_{2}^{i_{2}}\cdots x_{d}^{i_{d}}\Big{)}x_{1}^{i_{1}},

and we can view $f^{d}_{N}$ as a univariate polynomial of $x_{1}$ with degree $N$ if $x_{2},\ldots,x_{d}$ are given and for each $i_{1}\in\{0,\ldots,N\}$ the $(d-1)$ -variate polynomial $\sum_{i_{2}+\cdots+i_{d}=0}^{N-i_{1}}a_{i_{1},i_{2}\ldots,i_{d}}x_{2}^{i_{2}}\cdots x_{d}^{i_{d}}$ with degree no more than $N$ can be computed by a proper RePU network. This reminds us the construction of RePU network for $f^{d}_{N}$ can be implemented recursively via composition of $f^{1}_{N},f^{2}_{N},\ldots,f^{d}_{N}$ by induction.

By Horner’s method we have constructed a RePU network with $2N-1$ hidden layers, $(6p+2)(N-1)+2p$ hidden neurons and $(N-1)(30p+2)+2p+1$ parameters to exactly compute $f^{1}_{N}$ . Now we start to show $f^{2}_{N}$ can be computed by RePU networks. We can write $f^{2}_{N}$ as

f^{2}_{N}(x_{1},x_{2})=\sum_{i+j=0}^{N}a_{ij}x_{1}^{i}x_{2}^{j}=\sum_{i=0}^{N}\Big{(}\sum_{j=0}^{N-i}a_{ij}x_{2}^{j}\Big{)}x_{1}^{i}.

Note that for $i\in\{0,\ldots,N\}$ , the the degree of polynomial $\sum_{j=0}^{N-i}a_{ij}x_{2}^{j}$ is $N-i$ which is less than $N$ . But we can still view it as a polynomial with degree $N$ by padding (adding zero terms) such that $\sum_{j=0}^{N-i}a_{ij}x_{2}^{j}=\sum_{j=0}^{N}a^{*}_{ij}x_{2}^{j}$ where $a^{*}_{ij}=a_{ij}$ if $i+j\leq N$ and $a^{*}_{ij}=0$ if $i+j>N$ . In such a way, for each $i\in\{0,\ldots,N\}$ the polynomial $\sum_{j=0}^{N-i}a_{ij}x_{2}^{j}$ can be computed by a RePU network with $2N-1$ hidden layers, $(6p+2)(N-1)+2p$ hidden neurons, $(N-1)(30p+2)+2p+1$ parameters and its width equal to $6p$ . Besides, for each $i\in\{0,\ldots,N\}$ , the monomial $x^{i}$ can also be computed by a RePU network with $2N-1$ hidden layers, $(6p+2)(N-1)+2p$ hidden neurons, $(N-1)(30p+2)+2p+1$ parameters and its width equal to $6p$ , in whose implementation the identity maps are used after the $(2i-1)$ -th hidden layer. Now we parallel these two sub networks to get a RePU network which takes $x_{1}$ and $x_{2}$ as input and outputs $(\sum_{j=0}^{N-i}a_{ij}x_{2}^{j})x^{i}$ with width $12p$ , hidden layers $2N-1$ , number of neurons $2\times[(6p+2)(N-1)+2p]$ and size $2\times[(N-1)(30p+2)+2p+1]$ . Since for each $i\in\{0,\ldots,N\}$ , such paralleled RePU network can be constructed, then with straightforward paralleling of $N$ such RePU networks, we obtain a RePU network exactly computes $f^{2}_{N}$ with width $12pN$ , hidden layers $2N-1$ , number of neurons $2\times[(6p+2)(N-1)+2p]\times N\leq 14pN^{2}$ and number of parameters $2\times[(N-1)(30p+2)+2p+1]\times N\leq 62pN^{2}$ .

Similarly for polynomial $f^{3}_{N}$ of $3$ variables, we can write $f^{3}_{N}$ as

f^{3}_{N}(x_{1},x_{2},x_{3})=\sum_{i+j+k=0}^{N}a_{ijk}x_{1}^{i}x_{2}^{j}x_{3}^{k}=\sum_{i=0}^{N}\Big{(}\sum_{j+k=0}^{N-i}a_{ijk}x_{2}^{j}x_{3}^{k}\Big{)}x_{1}^{i}.

By our previous argument, for each $i\in\{0,\ldots,N\}$ , there exists a RePU network which takes $(x_{1},x_{2},x_{3})$ as input and outputs $\Big{(}\sum_{j+k=0}^{N-i}a_{ijk}x_{2}^{j}x_{3}^{k}\Big{)}x_{1}^{i}$ with width $12pN+6p$ , hidden layers $2N-1$ , number of neurons $2N\times[(6p+2)(N-1)+2p]+[(6p+2)(N-1)+2p]$ and parameters $2N\times[(N-1)(30p+2)+2p+1]+[(N-1)(30p+2)+2p+1]$ . And by paralleling $N$ such subnetworks, we obtain a RePU network that exactly computes $f^{3}_{N}$ with width $(12pN+6p)\times N=12pN^{2}+6pN$ , hidden layers $2N-1$ , number of neurons $2N^{2}\times[(6p+2)(N-1)+2p]+N\times[(6p+2)(N-1)+2p]$ and number of parameters $2N^{2}\times[(N-1)(30p+2)+2p+1]+N\times[(N-1)(30p+2)+2p+1]$ .

Continuing this process, we can construct RePU networks exactly compute polynomials of any $d$ variables with total degree $N$ . With a little bit abuse of notations, we let $\mathcal{W}_{k}$ , $\mathcal{D}_{k}$ , $\mathcal{U}_{k}$ and $\mathcal{S}_{k}$ denote the width, number of hidden layers, number of neurons and number of parameters (weights and bias) respectively of the RePU network computing $f^{k}_{N}$ for $k=1,2,3,\ldots$ . We have known that

\displaystyle\mathcal{D}_{1}=2N-1,\quad\mathcal{W}_{1}=6p,\quad\mathcal{U}_{1}=(6p+2)(N-1)+2p,\quad\mathcal{S}_{1}=(N-1)(30p+2)+2p+1.

Besides, based on the iterate procedure of the network construction, by induction we can see that for $k=2,3,4,\ldots$ the following equations hold,

	$\displaystyle\mathcal{D}_{k}=$	$\displaystyle 2N-1,$
	$\displaystyle\mathcal{W}_{k}=$	$\displaystyle N\times(\mathcal{W}_{k-1}+\mathcal{W}_{1}),$
	$\displaystyle\mathcal{U}_{k}=$	$\displaystyle N\times(\mathcal{U}_{k-1}+\mathcal{U}_{1}),$
	$\displaystyle\mathcal{S}_{k}=$	$\displaystyle N\times(\mathcal{S}_{k-1}+\mathcal{S}_{1}).$

Then based on the values of $\mathcal{D}_{1},\mathcal{W}_{1},\mathcal{U}_{1},\mathcal{S}_{1}$ and the recursion formula, we have for $k=2,3,4,\ldots$

\displaystyle\mathcal{D}_{k}=2N-1,

\displaystyle\mathcal{W}_{k}=12pN^{k-1}+6p\frac{N^{k-1}-N}{N-1},

	$\displaystyle\mathcal{U}_{k}=$	$\displaystyle N\times(\mathcal{U}_{k-1}+\mathcal{U}_{1})=2\mathcal{U}_{1}N^{k-1}+\mathcal{U}_{1}\frac{N^{k-1}-N}{N-1}$
	$\displaystyle=$	$\displaystyle(6p+2)(2N^{k}-N^{k-1}-N)+2p(\frac{2N^{k}-N^{k-1}-N}{N-1}),$

	$\displaystyle\mathcal{S}_{k}=$	$\displaystyle N\times(\mathcal{S}_{k-1}+\mathcal{S}_{1})=2\mathcal{S}_{1}N^{k-1}+\mathcal{S}_{1}\frac{N^{k-1}-N}{N-1}$
	$\displaystyle=$	$\displaystyle(30p+2)(2N^{k}-N^{k-1}-N)+(2p+1)(\frac{2N^{k}-N^{k-1}-N}{N-1}).$

This completes our proof. $\hfill\Box$

Proof of Theorem 5

The proof is straightforward by and leveraging the approximation power of multivariate polynomials based on Theorem 3. The theories for polynomial approximation have been extensively studies on various spaces of smooth functions. We refer to Bagby et al. (2002) for the polynomial approximation on smooth functions in our proof.

Lemma 31 (Theorem 2 in Bagby et al. (2002))

Let $f$ be a function of compact support on $\mathbb{R}^{d}$ of class $C^{s}$ where $s\in\mathbb{N}^{+}$ and let $K$ be a compact subset of $\mathbb{R}^{d}$ which contains the support of $f$ . Then for each nonnegative integer $N$ there is a polynomial $p_{N}$ of degree at most $N$ on $\mathbb{R}^{d}$ with the following property: for each multi-index $\alpha$ with $|\alpha|_{1}\leq\min\{s,N\}$ we have

\sup_{K}|D^{\alpha}(f-p_{N})|\leq\frac{C}{N^{s-|\alpha|_{1}}}\sum_{|\alpha|_{1}\leq s}\sup_{K}|D^{\alpha}f|,

where $C$ is a positive constant depending only on $d,s$ and $K$ .

The proof of Lemma 31 can be found in Bagby et al. (2002) based on the Whitney extension theorem (Theorem 2.3.6 in Hörmander (2015)) and by examining the proof of Theorem 1 in Bagby et al. (2002), the dependence of the constant $C$ in Lemma 31 on the $d,s$ and $K$ can be detailed.

To use Lemma 31, we need to find a RePU network to compute the $p_{N}$ for each $N\in\mathbb{N}^{+}$ . By Theorem 3, we know that any $p_{N}$ of $d$ variables can be exactly computed by a RePU network $\phi_{N}$ with $2N-1$ hidden layers, $(6p+2)(2N^{d}-N^{d-1}-N)+2p(2N^{d}-N^{d-1}-N)/(N-1)$ number of neurons, $(30p+2)(2N^{d}-N^{d-1}-N)+(2p+1)(2N^{d}-N^{d-1}-N)/(N-1)$ number of parameters (weights and bias) and network width $12pN^{d-1}+6p(N^{d-1}-N)/(N-1)$ . Then we have

\sup_{K}|D^{\alpha}(f-\phi_{N})|\leq C_{s,d,K}N^{-(s-|\alpha|_{1})}\|f\|_{C^{s}},

where $C_{s,d,K}$ is a positive constant depending only on $,d,s$ and $K$ . Note that the number neurons $\mathcal{U}=\mathcal{O}(18pN^{d})$ , which implies $(\mathcal{U}/18p)^{1/d}\leq N$ . Then we also have

\sup_{K}|D^{\alpha}(f-\phi_{N})|\leq C_{p,s,d,K}\mathcal{U}^{-(s-|\alpha|_{1})/d}\|f\|_{C^{s}},

where $C_{p,s,d,K}$ is a positive constant depending only on $,d,s$ and $K$ . This completes the proof. $\hfill\Box$

Proof of Theorem 8

The idea of our proof is based on projecting the data to a low-dimensional space and then use deep RePU neural network to approximate the low-dimensional function.

Given any integer $d_{\delta}=O(d_{\mathcal{M}}{\log(d/\delta)}/{\delta^{2}})$ satisfying $d_{\delta}\leq d$ , by Theorem 3.1 in Baraniuk and Wakin (2009) there exists a linear projector $A\in\mathbb{R}^{d_{\delta}\times d}$ that maps a low-dimensional manifold in a high-dimensional space to a low-dimensional space nearly preserving the distance. Specifically, there exists a matrix $A\in\mathbb{R}^{d_{\delta}\times d}$ such that $AA^{T}=(d/d_{\delta})I_{d_{\delta}}$ where $I_{d_{\delta}}$ is an identity matrix of size $d_{\delta}\times d_{\delta}$ , and

(1-\delta)\|x_{1}-x_{2}\|_{2}\leq\|Ax_{1}-Ax_{2}\|_{2}\leq(1+\delta)\|x_{1}-x_{2}\|_{2},

for any $x_{1},x_{2}\in\mathcal{M}.$

Note that for any $z\in A(\mathcal{M})$ , there exists a unique $x\in\mathcal{M}$ such that $Ax=z$ . Then for any $z\in A(\mathcal{M})$ , define $x_{z}=\mathcal{SL}(\{x\in\mathcal{M}:Ax=z\})$ where $\mathcal{SL}(\cdot)$ is a set function which returns a unique element of a set. If $Ax=z$ where $x\in\mathcal{M}$ and $z\in A(\mathcal{M})$ , then $x=x_{z}$ by our argument since $\{x\in\mathcal{M}:Ax=z\}$ is a set with only one element when $z\in A(\mathcal{M})$ . We can see that $\mathcal{SL}:A(\mathcal{M})\to\mathcal{M}$ is a differentiable function with the norm of its derivative locates in $[1/(1+\delta),1/(1-\delta)]$ , since

\frac{1}{1+\delta}\|z_{1}-z_{2}\|_{2}\leq\|x_{z_{1}}-x_{z_{2}}\|_{2}\leq\frac{1}{1-\delta}\|z_{1}-z_{2}\|_{2},

for any $z_{1},z_{2}\in A(\mathcal{M})$ . For the high-dimensional function $f_{0}:\mathcal{X}\to\mathbb{R}^{1}$ , we define its low-dimensional representation $\tilde{f}_{0}:\mathbb{R}^{d_{\delta}}\to\mathbb{R}^{1}$ by

\tilde{f}_{0}(z)=f_{0}(x_{z}),\quad{\rm for\ any}\ z\in A(\mathcal{M})\subseteq\mathbb{R}^{d_{\delta}}.

Recall that $f_{0}\in C^{s}(\mathcal{X})$ , then $\tilde{f}_{0}\in C^{s}(A(\mathcal{M}))$ . Note that $\mathcal{M}$ is compact manifold and $A$ is a linear mapping, then by the extended version of Whitney’ extension theorem in Fefferman (2006), there exists a function $\tilde{F}_{0}\in C^{s}(A(\mathcal{M}_{\rho}))$ such that $\tilde{F}_{0}(z)=\tilde{f}_{0}(z)$ for any $z\in A(\mathcal{M}_{\rho})$ and $\|\tilde{F}_{0}\|_{C^{1}}\leq(1+\delta)\|f_{0}\|_{C^{1}}$ . By Theorem 5, for any $N\in\mathbb{N}^{+}$ , there exists a function $\tilde{f}_{n}:\mathbb{R}^{d_{\delta}}\to\mathbb{R}^{1}$ implemented by a RePU network with its depth $\mathcal{D}$ , width $\mathcal{W}$ , number of neurons $\mathcal{U}$ and size $\mathcal{S}$ specified as

	$\displaystyle\mathcal{D}=2N-1,\qquad\mathcal{W}=12pN^{d_{\delta}-1}+6p(N^{d_{\delta}-1}-N)/(N-1)$
	$\displaystyle\mathcal{U}=(6p+2)(2N^{d_{\delta}}-N^{d_{\delta}-1}-N)+2p(2N^{d_{\delta}}-N^{d_{\delta}-1}-N)/(N-1),$
	$\displaystyle\mathcal{S}=(30p+2)(2N^{d_{\delta}}-N^{d_{\delta}-1}-N)+(2p+1)(2N^{d_{\delta}}-N^{d_{\delta}-1}-N)/(N-1),$

such that for each multi-index $\alpha\in\mathbb{N}^{d}_{0}$ with $|\alpha|_{1}\leq 1$ , we have

|D^{\alpha}(\tilde{f}_{n}(z)-\tilde{F}_{0}(z))|\leq C_{s,d_{\delta},A(\mathcal{M}_{\rho})}N^{-(s-|\alpha|_{1})}\|\tilde{F}_{0}\|_{C^{1}},

for all $z\in A(\mathcal{M}_{\rho})$ where $C_{s,d_{\delta},A(\mathcal{M}_{\rho})}>0$ is a constant depending only on $s,d_{\delta},A(\mathcal{M}_{\rho})$ .

By Theorem 3, the linear projection $A$ can be computed by a RePU network with 1 hidden layer and its width no more than 18p. If we define $f^{*}_{n}=\tilde{f}_{n}\circ A$ which is $f^{*}_{n}(x)=\tilde{f}_{n}(Ax)$ for any $x\in\mathcal{X}$ , then $f^{*}_{n}\in\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B}}$ is also a RePU network with one more layer than $\tilde{f}_{n}$ . For any $x\in\mathcal{M}_{\rho}$ , there exists a $\tilde{x}\in\mathcal{M}$ such that $\|x-\tilde{x}\|_{2}\leq\rho$ . Then, for each multi-index $\alpha\in\mathbb{N}^{d}_{0}$ with $|\alpha|_{1}\leq 1$ , we have

	$\displaystyle\|D^{\alpha}(f^{*}_{n}(x)-f_{0}(x))\|=\|D^{\alpha}(\tilde{f}_{n}(Ax)-\tilde{F}_{0}(Ax)+\tilde{F}_{0}(Ax)-\tilde{F}_{0}(A\tilde{x})+\tilde{F}_{0}(A\tilde{x})-f_{0}(x))\|$
	$\displaystyle\leq\|D^{\alpha}(\tilde{f}_{n}(Ax)-\tilde{F}_{0}(Ax))\|+\|D^{\alpha}(\tilde{F}_{0}(Ax)-\tilde{F}_{0}(A\tilde{x}))\|+\|D^{\alpha}(\tilde{F}_{0}(A\tilde{x})-f_{0}(x))\|$
	$\displaystyle\leq C_{s,d_{\delta},A(\mathcal{M}_{\rho})}N^{-(s-\|\alpha\|_{1})}\\|\tilde{F}_{0}\\|_{C^{\|\alpha\|_{1}}}+(1+\delta)\rho\\|\tilde{F}_{0}\\|_{C^{\|\alpha\|_{1}}}+\|D^{\alpha}(f_{0}(\tilde{x})-f_{0}(x))\|$
	$\displaystyle\leq\big{[}C_{s,d_{\delta},A(\mathcal{M}_{\rho})}N^{-(s-\|\alpha\|_{1})}+(1+\delta)\rho\big{]}\\|\tilde{F}_{0}\\|_{C^{\|\alpha\|_{1}}}+\rho\\|f_{0}\\|_{C^{\|\alpha\|_{1}}}$
	$\displaystyle\leq C_{s,d_{\delta},A(\mathcal{M}_{\rho})}(1+\delta)\\|f_{0}\\|_{C^{\|\alpha\|_{1}}}N^{-(s-\|\alpha\|_{1})}+2(1+\delta)^{2}\rho\\|f_{0}\\|_{C^{\|\alpha\|_{1}}}$
	$\displaystyle\leq\tilde{C}_{s,d_{\delta},A(\mathcal{M}_{\rho})}(1+\delta)\\|f_{0}\\|_{C^{\|\alpha\|_{1}}}N^{-(s-\|\alpha\|_{1})},$
	$\displaystyle\leq{C}_{p,s,d_{\delta},A(\mathcal{M}_{\rho})}(1+\delta)\\|f_{0}\\|_{C^{\|\alpha\|_{1}}}\mathcal{U}^{-(s-\|\alpha\|_{1})/d_{\delta}},$

where ${C}_{p,s,d_{\delta},A(\mathcal{M}_{\rho})}$ is a constant depending only on $p,s,d_{\delta},A(\mathcal{M}_{\rho})$ . The second last inequality follows from $\rho\leq C_{1}N^{-(s-1)}(1+\delta)^{-1}$ . Since the number neurons $\mathcal{U}=\mathcal{O}(18pN^{d})$ and $(\mathcal{U}/18p)^{1/d}\leq N$ , the last inequality follows. This completes the proof.

$\hfill\Box$

Proof of Lemma 10

For the empirical risk minimizer $\hat{s}_{n}$ based on the sample $S=\{X_{i}\}_{i=1}^{n}$ , we consider its excess risk $\mathbb{E}\{J(\hat{s}_{n})-J(s_{0})\}$ .

For any $s\in\mathcal{F}_{n}$ , we have

	$\displaystyle J(\hat{s}_{n})-J(s_{0})$	$\displaystyle=J(\hat{s}_{n})-J_{n}(\hat{s}_{n})+J_{n}(\hat{s}_{n})-J_{n}(s)+J_{n}(s)-J(s)+J(s)-J(s_{0})$
		$\displaystyle\leq J(\hat{s}_{n})-J_{n}(\hat{s}_{n})+J_{n}(s)-J(s)+J(s)-J(s_{0})$
		$\displaystyle\leq 2\sup_{s\in\mathcal{F}_{n}}\|J(s)-J_{n}(s)\|+J(s)-J(s_{0}),$

where the first inequality follows from the definition of empirical risk minimizer $\hat{s}_{n}$ , and the second inequality holds since $\hat{s}_{n},s\in\mathcal{F}_{n}$ . Note that above inequality holds for any $s\in\mathcal{F}_{n}$ , then

\displaystyle J(\hat{s}_{n})-J(s_{0})

\displaystyle=\leq 2\sup_{s\in\mathcal{F}_{n}}|J(s)-J_{n}(s)|+\inf_{s\in\mathcal{F}_{n}}[J(s)-J(s_{0})],

where we call $2\sup_{s\in\mathcal{F}_{n}}|J(s)-J_{n}(s)|$ the stochastic error and $\inf_{s\in\mathcal{F}_{n}}[J(s)-J(s_{0})]$ the approximation error.

Bounding the stochastic error

Recall that $s,s_{0}$ are vector-valued functions, we write $s=(s_{1},\ldots,s_{d})^{\top},s_{0}=(s_{01},\ldots,s_{0d})^{\top}$ and let $\frac{\partial}{\partial x_{j}}s_{j}$ denote the $j$ -th diagonal entry in $\nabla_{x}s$ and $\frac{\partial}{\partial x_{j}}s_{0j}$ the $j$ -th diagonal entry in $\nabla_{x}s_{0}$ .

\displaystyle J(s)

\displaystyle=\mathbb{E}\Big{[}tr(\nabla_{x}s(X))+\frac{1}{2}\|s(X)\|^{2}_{2}\Big{]}=\mathbb{E}\Big{[}\sum_{j=1}^{d}\frac{\partial}{\partial x_{j}}s_{j}(X)+\frac{1}{2}\sum_{j=1}^{d}|s_{j}(X)|^{2}\Big{]}.

Define

	$\displaystyle J^{1,j}(s)$	$\displaystyle=\mathbb{E}\Big{[}\frac{\partial}{\partial x_{j}}s_{j}(X)\Big{]}\qquad J^{1,j}_{n}(s)=\sum_{i=1}^{n}\Big{[}\frac{\partial}{\partial x_{j}}s_{j}(X_{i})\Big{]}$
	$\displaystyle J^{2,j}(s)$	$\displaystyle=\mathbb{E}\Big{[}\frac{1}{2}\|s_{j}(X)\|^{2}\Big{]}\qquad J^{2,j}_{n}(s)=\sum_{i=1}^{n}\Big{[}\frac{1}{2}\|s_{j}(X_{i})\|^{2}\Big{]}$

for $j=1,\ldots,d$ and $s\in\mathcal{F}_{n}$ . Then,

	$\displaystyle\sup_{s\in\mathcal{F}_{n}}\|J(s)-J_{n}(s)\|$	$\displaystyle\leq\sup_{s\in\mathcal{F}_{n}}\sum_{j=1}^{d}\Big{[}\|J^{1,j}(s)-J^{1,j}_{n}(s)\|+\|J^{2,j}(s)-J^{2,j}_{n}(s)\|\Big{]}$
		$\displaystyle\leq\sum_{j=1}^{d}\sup_{s\in\mathcal{F}_{n}}\|J^{1,j}(s)-J^{1,j}_{n}(s)\|+\sum_{j=1}^{d}\sup_{s\in\mathcal{F}_{n}}\|J^{2,j}(s)-J^{2,j}_{n}(s)\|.$		(B.4)

Recall that for any $s\in\mathcal{F}_{n}$ , the output $\|s_{j}\|_{\infty}\leq\mathcal{B}$ and the partial derivative $\|\frac{\partial}{\partial x_{j}}s_{j}\|_{\infty}\leq\mathcal{B}^{\prime}$ for $j=1,\ldots,d$ . Then by Theorem 11.8 in Mohri et al. (2018), for any $\delta>0$ , with probability at least $1-\delta$ over the choice of $n$ i.i.d sample $S$ ,

\displaystyle\sup_{s\in\mathcal{F}_{n}}|J^{1,j}(s)-J^{1,j}_{n}(s)|\leq 2\mathcal{B}^{\prime}\sqrt{\frac{2{\rm Pdim}(\mathcal{F}^{\prime}_{jn})\log(en)}{n}}+2\mathcal{B}^{\prime}\sqrt{\frac{\log(1/\delta)}{2n}}

(B.5)

for $j=1,\ldots,d$ where ${\rm Pdim}(\mathcal{F}^{\prime}_{jn})$ is the Pseudo dimension of $\mathcal{F}^{\prime}_{jn}$ and $\mathcal{F}^{\prime}_{jn}=\{\frac{\partial}{\partial x_{j}}s_{j}:s\in\mathcal{F}_{n}\}$ . Similarly, with probability at least $1-\delta$ over the choice of $n$ i.i.d sample $S$ ,

\displaystyle\sup_{s\in\mathcal{F}_{n}}|J^{2,j}(s)-J^{2,j}_{n}(s)|\leq\mathcal{B}^{2}\sqrt{\frac{2{\rm Pdim}(\mathcal{F}_{jn})\log(en)}{n}}+\mathcal{B}^{2}\sqrt{\frac{\log(1/\delta)}{2n}}

(B.6)

for $j=1,\ldots,d$ where $\mathcal{F}_{jn}=\{s_{j}:s\in\mathcal{F}_{n}\}$ . Combining (B.4), (B.5) and (B.6), we have proved that for any $\delta>0$ , with probability at least $1-2d\delta$ ,

	$\displaystyle\sup_{s\in\mathcal{F}_{n}}\|J(s)-J_{n}(s)\|$	$\displaystyle\leq\sqrt{\frac{2\log(en)}{n}}\sum_{j=1}^{d}\Big{[}\mathcal{B}^{2}\sqrt{{\rm Pdim}(\mathcal{F}_{jn})}+\mathcal{B}^{\prime}\sqrt{{\rm Pdim}(\mathcal{F}^{\prime}_{jn})}\Big{]}$
		$\displaystyle\qquad\qquad+d\Big{[}\mathcal{B}^{2}+2\mathcal{B}^{\prime}\Big{]}\sqrt{\frac{\log(1/\delta)}{2n}}.$

Note that $s=(s_{1},\ldots,s_{d})^{\top}$ for $s\in\mathcal{F}_{n}$ where $\mathcal{F}_{n}=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}}$ is a class of RePU neural networks with depth $\mathcal{D}$ , width $\mathcal{W}$ , size $\mathcal{S}$ and number of neurons $\mathcal{U}$ . Then for each $j=1,\ldots,d$ , the function class $\mathcal{F}_{jn}=\{s_{j}:s\in\mathcal{F}_{n}\}$ consists of RePU neural networks with depth $\mathcal{D}$ , width $\mathcal{W}$ , number of neurons $\mathcal{U}-(d-1)$ and size no more than $\mathcal{S}$ . By Lemma 2, we have ${\rm Pdim}(\mathcal{F}_{1n})={\rm Pdim}(\mathcal{F}_{2n})=\ldots={\rm Pdim}(\mathcal{F}_{dn})\leq 3p\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})$ . Similarly, by Theorem 1 and Lemma 2, we have ${\rm Pdim}(\mathcal{F}^{\prime}_{1n})={\rm Pdim}(\mathcal{F}^{\prime}_{2n})=\ldots={\rm Pdim}(\mathcal{F}^{\prime}_{dn})\leq 2484p\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})$ . Then, for any $\delta>0$ , with probability at least $1-\delta$

\displaystyle\sup_{s\in\mathcal{F}_{n}}|J(s)-J_{n}(s)|

\displaystyle\leq 50\times d(\mathcal{B}^{2}+2\mathcal{B}^{\prime})\Bigg{(}50\sqrt{\frac{2p\log(en)\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})}{n}}+\sqrt{\frac{\log(2d/\delta)}{2n}}\Bigg{)}.

If we let $t=50d(\mathcal{B}^{2}+2\mathcal{B}^{\prime})(50\sqrt{{2p\log(en)\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})}/{n}})$ , then above inequality implies

\displaystyle\mathbb{P}\Bigg{(}\sup_{s\in\mathcal{F}_{n}}|J(s)-J_{n}(s)|

\displaystyle\geq\epsilon\Bigg{)}\leq 2d\exp\left(\frac{-2n(\epsilon-t)^{2}}{[50d(\mathcal{B}^{2}+2\mathcal{B}^{\prime})]^{2}}\right),

for $\epsilon\geq t$ . And

	$\displaystyle\mathbb{E}\left[\sup_{s\in\mathcal{F}_{n}}\|J(s)-J_{n}(s)\|\right]$
	$\displaystyle=\int_{0}^{\infty}\mathbb{P}\Bigg{(}\sup_{s\in\mathcal{F}_{n}}\|J(s)-J_{n}(s)\|\geq u\Bigg{)}du$
	$\displaystyle=\int_{0}^{t}\mathbb{P}\Bigg{(}\sup_{s\in\mathcal{F}_{n}}\|J(s)-J_{n}(s)\|\geq u\Bigg{)}du+\int_{t}^{\infty}\mathbb{P}\Bigg{(}\sup_{s\in\mathcal{F}_{n}}\|J(s)-J_{n}(s)\|\geq u\Bigg{)}du$
	$\displaystyle\leq\int_{0}^{t}1du+\int_{t}^{\infty}2d\exp\left(\frac{-2n(u-t)^{2}}{[50d(\mathcal{B}^{2}+2\mathcal{B}^{\prime})]^{2}}\right)du$
	$\displaystyle=t+25\sqrt{2\pi}d^{2}(\mathcal{B}^{2}+2\mathcal{B}^{\prime})\frac{1}{\sqrt{n}}$
	$\displaystyle\leq 2575d^{2}(\mathcal{B}^{2}+2\mathcal{B}^{\prime})\sqrt{{2p\log(en)\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})}/{n}}.$

Bounding the approximation error

Recall that for $s\in\mathcal{F}_{n}$ ,

\displaystyle J(s)

\displaystyle=\mathbb{E}\Big{[}tr(\nabla_{x}s(X))+\frac{1}{2}\|s(X)\|^{2}_{2}\Big{]}=\mathbb{E}\Big{[}\sum_{j=1}^{d}\frac{\partial}{\partial x_{j}}s_{j}(X)+\frac{1}{2}\sum_{j=1}^{d}|s_{j}(X)|^{2}\Big{]},

and the excess risk

	$\displaystyle J(s)-J(s_{0})$
	$\displaystyle=\mathbb{E}\Big{[}\sum_{j=1}^{d}\frac{\partial}{\partial x_{j}}s_{j}(X)+\frac{1}{2}\sum_{j=1}^{d}\|s_{j}(X)\|^{2}\Big{]}-\mathbb{E}\Big{[}\sum_{j=1}^{d}\frac{\partial}{\partial x_{j}}s_{0j}(X)+\frac{1}{2}\sum_{j=1}^{d}\|s_{0j}(X)\|^{2}\Big{]}$
	$\displaystyle=\sum_{j=1}^{d}\mathbb{E}\Big{[}\frac{\partial}{\partial x_{j}}s_{j}(X)-\frac{\partial}{\partial x_{j}}s_{0j}(X)+\frac{1}{2}\|s_{j}(X)\|^{2}-\frac{1}{2}\|s_{0j}(X)\|^{2}\Big{]}.$

Recall that $s=(s_{1},\ldots,s_{d})^{\top}$ and $s_{0}=(s_{01},\ldots,s_{0d})^{\top}$ are vector-valued functions. For each $j=1,\ldots,d$ , we let $\mathcal{F}_{jn}$ be a class of RePU neural networks with depth $\mathcal{D}$ , width $\mathcal{W}$ , size $\mathcal{S}$ and number of neurons $\mathcal{U}$ . Define $\tilde{\mathcal{F}}_{n}=\{s=(s_{1},\ldots,s_{d})^{\top}:s_{j}\in\mathcal{F}_{jn},j=1,\ldots,d\}$ . The neural networks in $\tilde{\mathcal{F}}_{n}$ has depth $\mathcal{D}$ , width $d\mathcal{W}$ , size $d\mathcal{S}$ and number of neurons $d\mathcal{U}$ , which can be seen as built by paralleling $d$ subnetworks in $\mathcal{F}_{jn}$ for $j=1,\ldots,d$ . Let $\mathcal{F}_{n}$ be the class of all RePU neural networks with depth $\mathcal{D}$ , width $d\mathcal{W}$ , size $d\mathcal{S}$ and number of neurons $d\mathcal{U}$ . Then $\tilde{\mathcal{F}}_{n}\subset\mathcal{F}_{n}$ and

	$\displaystyle\inf_{s\in\mathcal{F}_{n}}\Big{[}J(s)-J(s_{0})\Big{]}$
	$\displaystyle\leq\inf_{s\in\tilde{\mathcal{F}}_{n}}\Big{[}J(s)-J(s_{0})\Big{]}$
	$\displaystyle=\inf_{s=(s_{1},\ldots,s_{d})^{\top},s_{j}\in\mathcal{F}_{jn},j=1,\ldots,d}\Big{[}J(s)-J(s_{0})\Big{]}$
	$\displaystyle=\sum_{j=1}^{d}\inf_{s_{j}\in\mathcal{F}_{jn}}\mathbb{E}\Big{[}\frac{\partial}{\partial x_{j}}s_{j}(X)-\frac{\partial}{\partial x_{j}}s_{0j}(X)+\frac{1}{2}\|s_{j}(X)\|^{2}-\frac{1}{2}\|s_{0j}(X)\|^{2}\Big{]}.$

Now we focus on derive upper bound for

\inf_{s_{j}\in\mathcal{F}_{jn}}\mathbb{E}\Big{[}\frac{\partial}{\partial x_{j}}s_{j}(X)-\frac{\partial}{\partial x_{j}}s_{0j}(X)+\frac{1}{2}|s_{j}(X)|^{2}-\frac{1}{2}|s_{0j}(X)|^{2}\Big{]}

for each $j=1,\ldots,d$ . By assumption, $\|\frac{\partial}{\partial x_{j}}s_{0j}\|_{\infty},\|\frac{\partial}{\partial x_{j}}s_{j}\|_{\infty}\leq\mathcal{B}^{\prime}$ and $\|s_{0j}\|_{\infty},\|s_{j}\|_{\infty}\leq\mathcal{B}$ for any $s\in\mathcal{F}_{jn}$ . For each $j=1,\ldots,n$ , the target $s_{0j}$ defined on domain $\mathcal{X}$ is a real-valued function belonging to class $C^{m}$ for $1\leq m<\infty$ . By Theorem 5, for any $N\in\mathbb{N}^{+}$ , there exists a RePU activated neural network $s_{Nj}$ with $2N-1$ hidden layers, $(6p+2)(2N^{d}-N^{d-1}-N)+2p(2N^{d}-N^{d-1}-N)/(N-1)$ number of neurons, $(30p+2)(2N^{d}-N^{d-1}-N)+(2p+1)(2N^{d}-N^{d-1}-N)/(N-1)$ number of parameters and network width $12pN^{d-1}+6p(N^{d-1}-N)/(N-1)$ such that for each multi-index $\alpha\in\mathbb{N}^{d}_{0}$ , we have $|\alpha|_{1}\leq 1$ ,

\sup_{\mathcal{X}}|D^{\alpha}(s_{0j}-s_{Nj})|\leq C_{m,d,\mathcal{X}}N^{-(m-|\alpha|_{1})}\|s_{0j}\|_{C^{|\alpha|_{1}}},

where $C_{m,d,\mathcal{X}}$ is a positive constant depending only on $d,m$ and the diameter of $\mathcal{X}$ . Then

	$\displaystyle\inf_{s_{j}\in\mathcal{F}_{jn}}\mathbb{E}\Big{[}\frac{\partial}{\partial x_{j}}s_{j}(X)-\frac{\partial}{\partial x_{j}}s_{0j}(X)+\frac{1}{2}\|s_{j}(X)\|^{2}-\frac{1}{2}\|s_{0j}(X)\|^{2}\Big{]}$
	$\displaystyle\leq\mathbb{E}\Big{[}\frac{\partial}{\partial x_{j}}s_{Nj}(X)-\frac{\partial}{\partial x_{j}}s_{0j}(X)+\frac{1}{2}\|s_{Nj}(X)\|^{2}-\frac{1}{2}\|s_{0j}(X)\|^{2}\Big{]}$
	$\displaystyle\leq C_{m,d,\mathcal{X}}N^{-(m-1)}\\|s_{0j}\\|_{C^{1}}+C_{m,d,\mathcal{X}}\mathcal{B}N^{-m}\\|s_{0j}\\|_{C^{0}}$
	$\displaystyle\leq{C}_{m,d,\mathcal{X}}(1+\mathcal{B})N^{-(m-1)}\\|s_{0j}\\|_{C^{1}}$

holds for each $j=1,\ldots,d$ . Sum up above inequalities, we have proved that

\inf_{s\in\mathcal{F}_{n}}\Big{[}J(s)-J(s_{0})\Big{]}\leq{C}_{m,d,\mathcal{X}}(1+\mathcal{B})N^{-(m-1)}\|s_{0j}\|_{C^{1}}.

Non-asymptotic error bound

Based on the obtained stochastic error bound and approximation error bound, we can conclude that with probability at least $1-\delta$ , the empirical risk minimizer $\hat{s}_{n}$ defined in (9) satisfies

	$\displaystyle J(\hat{s}_{n})-J(s_{0})$	$\displaystyle\leq 100\times d(\mathcal{B}^{2}+2\mathcal{B}^{\prime})\Bigg{(}50\sqrt{\frac{2p\log(en)\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})}{n}}+\sqrt{\frac{\log(2d/\delta)}{2n}}\Bigg{)}$
		$\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad+{C}_{m,d,\mathcal{X}}(1+\mathcal{B})N^{-(m-1)}\\|s_{0}\\|_{C^{1}},$

and

	$\displaystyle\mathbb{E}\{J(\hat{s}_{n})-J(s_{0})\}$	$\displaystyle\leq 2575d^{2}(\mathcal{B}^{2}+2\mathcal{B}^{\prime})\sqrt{{2p\log(en)\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})}/{n}}$
		$\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad+{C}_{m,d,\mathcal{X}}(1+\mathcal{B})N^{-(m-1)}\\|s_{0}\\|_{C^{1}},$

where $C_{m,d,\mathcal{X}}$ is a positive constant depending only on $d,m$ and the diameter of $\mathcal{X}$ .

Note that the network depth $\mathcal{D}=2N-1$ is a positive odd number. Then we let $\mathcal{D}$ be a positive odd number, and let the class of neuron network specified by depth $\mathcal{D}$ , width $\mathcal{W}=18pd[(\mathcal{D}+1)/2]^{d-1}$ , neurons $\mathcal{U}=18pd[(\mathcal{D}+1)/2]^{d}$ and size $\mathcal{S}=67pd[(\mathcal{D}+1)/2]^{d}$ . Then we can further express the stochastic error in term of $\mathcal{U}$ :

	$\displaystyle\mathbb{E}\{J(\hat{s}_{n})-J(s_{0})\}$	$\displaystyle\leq Cp^{2}d^{3}(\mathcal{B}^{2}+2\mathcal{B}^{\prime})\mathcal{U}^{(d+2)/{2d}}(\log n)^{1/2}{n}^{-1/2}$
		$\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad+{C}_{m,d,\mathcal{X}}(1+\mathcal{B})\\|s_{0}\\|_{C^{1}}\mathcal{U}^{-(m-1)/d},$

where $C$ is a universal positive constant and $C_{m,d,\mathcal{X}}$ is a positive constant depending only on $d,m$ and the diameter of $\mathcal{X}$ . $\hfill\Box$

Proof of Lemma 14

By applying Theorem 8, we can prove Lemma 14 similarly following the proof of Lemma 10.

Proof of Lemma 18

Recall that $\hat{f}^{\lambda}_{n}$ is the empirical risk minimizer. Then, for any $f\in\mathcal{F}_{n}$ we have

\displaystyle\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})\leq\mathcal{R}^{\lambda}_{n}(f).

For any $f\in\mathcal{F}_{n}$ , let

\rho^{\lambda}(f):=\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\mathbb{E}\{\rho(\frac{\partial}{\partial x_{j}}f(X))\}

and

\rho^{\lambda}_{n}(f):=\frac{1}{n\times d}\sum_{i=1}^{n}\sum_{j=1}^{d}\lambda_{j}\mathbb{E}\{\rho(\frac{\partial}{\partial x_{j}}f(X_{i}))\}.

Then for any $f\in\mathcal{F}_{n}$ , we have $\rho^{\lambda}(f)\geq 0$ and $\rho^{\lambda}_{n}(f)\geq 0$ since $\rho^{\lambda}$ and $\rho^{\lambda}_{n}$ are nonnegative functions and $\lambda_{j}$ ’s are nonnegative numbers. Note that $\rho^{\lambda}(f_{0})=\rho^{\lambda}_{n}(f_{0})=0$ by the assumption that $f_{0}$ is coordinate-wisely nondecreasing. Then,

\displaystyle\mathcal{R}(\hat{f}^{\lambda}_{n})-\mathcal{R}(f_{0})\leq

\displaystyle\mathcal{R}(\hat{f}^{\lambda}_{n})-\mathcal{R}(f_{0})+\rho^{\lambda}(\hat{f}^{\lambda}_{n})-\rho^{\lambda}(f_{0})=\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f_{0}).

We can then give upper bounds for the excess risk $\mathcal{R}(\hat{f}^{\lambda}_{n})-\mathcal{R}(f_{0})$ . For any $f\in\mathcal{F}_{n}$ ,

	$\displaystyle\mathbb{E}\{\mathcal{R}(\hat{f}^{\lambda}_{n})-\mathcal{R}(f_{0})\}$
	$\displaystyle\leq\mathbb{E}\{\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f_{0})\}$
	$\displaystyle\leq\mathbb{E}\{\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f_{0})\}+2\mathbb{E}\{\mathcal{R}_{n}^{\lambda}(f)-\mathcal{R}_{n}^{\lambda}(\hat{f}^{\lambda}_{n})\}$
	$\displaystyle=\mathbb{E}\{\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f_{0})\}+2\mathbb{E}[\{\mathcal{R}_{n}^{\lambda}(f)-\mathcal{R}_{n}^{\lambda}(f_{0})\}-\{\mathcal{R}_{n}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}_{n}^{\lambda}(f_{0})\}]$
	$\displaystyle=\mathbb{E}\{\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-2\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}(f_{0})\}+2\mathbb{E}\{\mathcal{R}_{n}^{\lambda}(f)-\mathcal{R}_{n}^{\lambda}(f_{0})\}$

where the second inequality holds by the the fact that $\hat{f}^{\lambda}_{n}$ satisfies $\mathcal{R}_{n}^{\lambda}(f)\geq\mathcal{R}_{n}^{\lambda}(\hat{f}^{\lambda}_{n})$ for any $f\in\mathcal{F}_{n}$ . Since the inequality holds for any $f\in\mathcal{F}_{n}$ , we have

\displaystyle\mathbb{E}\{\mathcal{R}(\hat{f}^{\lambda}_{n})-\mathcal{R}(f_{0})\}

\displaystyle\leq\mathbb{E}\{\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-2\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}(f_{0})\}+2\inf_{f\in\mathcal{F}_{n}}\{\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})\}.

This completes the proof. $\hfill\Box$

Proof of Lemma 20

Lemma 20 can be proved by combing Lemma 18, Lemma 37 and Lemma 38.

Proof of Lemma 24

Lemma 24 can be proved by combing Lemma 18, Lemma 37, Lemma 38 and Theorem 8.

Proof of Lemma 26

Under the misspecified model, the target function $f_{0}$ may not be monotonic, and the quantity $\sum_{j=1}^{d}\lambda_{j}[\rho(\frac{\partial}{\partial x_{j}}f^{\lambda}_{n}(X_{i}))-\rho(\frac{\partial}{\partial x_{j}}{f}_{0}(X_{i}))]$ is not guaranteed to be positive, which prevents us to use the decomposition technique in proof of Lemma 37 to get a fast rate. Instead, we use the canonical decomposition of the excess risk. Let $S=\{Z_{i}=(X_{i},Y_{i})\}_{i=1}^{n}$ be the sample, and let $S_{X}=\{X_{i}\}_{i=1}^{n}$ and $S_{Y}=\{Y_{i}\}_{i=1}^{n}$ . We notice that

	$\displaystyle\mathbb{E}[\mathcal{R}(\hat{f}^{\lambda}_{n})-\mathcal{R}(f_{0})]$	$\displaystyle\leq\mathbb{E}\Big{[}\mathcal{R}(\hat{f}^{\lambda}_{n})-\mathcal{R}(f_{0})+\sum_{j=1}^{d}\lambda_{j}\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))]\Big{]}$
		$\displaystyle=\mathbb{E}\Big{[}\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f_{0})\Big{]}+\sum_{j=1}^{d}\lambda_{j}\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}f_{0}(X))],$

and

	$\displaystyle\mathbb{E}\Big{[}\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f_{0})\Big{]}$
	$\displaystyle=\mathbb{E}\Big{[}\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f^{}_{n})+\mathcal{R}^{\lambda}(f^{}_{n})-\mathcal{R}^{\lambda}(f_{0})\Big{]}$
	$\displaystyle\leq\mathbb{E}\Big{[}\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}_{n}(f^{}_{n})-\mathcal{R}^{\lambda}(f^{}_{n})+\mathcal{R}^{\lambda}(f^{*}_{n})-\mathcal{R}^{\lambda}(f_{0})\Big{]}$
	$\displaystyle=\mathbb{E}\Big{[}[\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f_{0})]-[\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}_{n}(f_{0})]$
	$\displaystyle+[\mathcal{R}^{\lambda}_{n}(f^{}_{n})-\mathcal{R}^{\lambda}_{n}(f_{0})]-[\mathcal{R}^{\lambda}(f^{}_{n})-\mathcal{R}^{\lambda}(f_{0})]+\mathcal{R}^{\lambda}(f^{*}_{n})-\mathcal{R}^{\lambda}(f_{0})\Big{]}$
	$\displaystyle\leq\mathbb{E}\Big{[}2\sup_{f\in\mathcal{F}_{n}}\Big{\|}[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\Big{\|}\Big{]}+\inf_{f\in\mathcal{F}_{n}}[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})],$

where $\mathcal{R}^{\lambda}(f_{n}^{*})=\inf_{f\in\mathcal{F}_{n}}\mathcal{R}^{\lambda}(f)$ , $\mathbb{E}$ denotes the expectation taken with respect to $S$ , and $\mathbb{E}[\cdot\mid S_{X}]$ denotes the conditional expectation given $S_{X}$ . Then we have

	$\displaystyle\mathbb{E}\Big{[}\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f_{0})\Big{]}$	$\displaystyle\leq\mathbb{E}\Big{[}2\sup_{f\in\mathcal{F}_{n}}\Big{\|}[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\Big{\|}\Big{]}$
		$\displaystyle+\inf_{f\in\mathcal{F}_{n}}[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]+\sum_{j=1}^{d}\lambda_{j}\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}f_{0}(X))],$

where the first term is the stochastic error, the second term is the approximation error, and the third term the misspecification error with respect to the penalty. Compared with the decomposition in Lemma 18, the approximation error is the same and can be bounded using Lemma 38. However, the stochastic error is different, and there is an additional misspecification error. We will leave the misspecification error untouched and include it in the final upper bound. Next, we focus on deriving the upper bound for the stochastic error.

For $f\in\mathcal{F}_{n}$ and each $Z_{i}=(X_{i},Y_{i})$ and $j=1\ldots,d$ , let

	$\displaystyle g_{1}(f,X_{i})=\mathbb{E}\Big{[}\|Y_{i}-f(X_{i})$	$\displaystyle\|^{2}-\|Y_{i}-f_{0}(X_{i})\|^{2}\mid X_{i}\Big{]}=\|f(X_{i})-f_{0}(X_{i})\|^{2}$
	$\displaystyle g^{j}_{2}(f,X_{i})$	$\displaystyle=\rho(\frac{\partial}{\partial x_{j}}f(X_{i}))-\rho(\frac{\partial}{\partial x_{j}}f_{0}(X_{i})).$

Then we have

	$\displaystyle\sup_{f\in\mathcal{F}_{n}}\left\|[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\right\|$
	$\displaystyle\leq\sup_{f\in\mathcal{F}_{n}}\Big{\|}\mathbb{E}[g_{1}(f,X)]-\frac{1}{n}\sum_{i=1}^{n}g_{1}(f,X_{i})+\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\Big{[}\mathbb{E}[g^{j}_{2}(f,X)]-\frac{1}{n}\sum_{i=1}^{n}g^{j}_{2}(f,X_{i})\Big{]}\Big{\|}$
	$\displaystyle\leq\sup_{f\in\mathcal{F}_{n}}\Big{\|}\mathbb{E}[g_{1}(f,X)]-\frac{1}{n}\sum_{i=1}^{n}g_{1}(f,X_{i})\Big{\|}+\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\sup_{f\in\mathcal{F}_{n}}\Big{\|}\mathbb{E}[g^{j}_{2}(f,X)]-\frac{1}{n}\sum_{i=1}^{n}g^{j}_{2}(f,X_{i})\Big{\|}.$

Recall that for any $f\in\mathcal{F}_{n}$ , the $\|f\|_{\infty}\leq\mathcal{B},\|\frac{\partial}{\partial x_{j}}f\|\leq\mathcal{B}^{\prime}$ and by assumption $\|f_{0}\|_{\infty}\leq\mathcal{B},\|\frac{\partial}{\partial x_{j}}f_{0}\|_{\infty}\leq\mathcal{B}^{\prime}$ for $j=1,\ldots,d$ . By applying Theorem 11.8 in Mohri et al. (2018), for any $\delta>0$ , with probability at least $1-\delta$ over the choice of $n$ i.i.d sample $S$ ,

\displaystyle\sup_{f\in\mathcal{F}_{n}}\Big{|}\mathbb{E}[g_{1}(f,X)]-\frac{1}{n}\sum_{i=1}^{n}g_{1}(f,X_{i})\Big{|}\leq 4\mathcal{B}^{2}\sqrt{\frac{2{\rm Pdim}(\mathcal{F}_{n})\log(en)}{n}}+4\mathcal{B}^{2}\sqrt{\frac{\log(1/\delta)}{2n}},

and

\displaystyle\sup_{f\in\mathcal{F}_{n}}|\mathbb{E}[g^{j}_{2}(f,X)]-\frac{1}{n}\sum_{i=1}^{n}g^{j}_{2}(f,X_{i})|\leq 2\kappa\mathcal{B}^{\prime}\sqrt{\frac{2{\rm Pdim}(\mathcal{F}^{\prime}_{jn})\log(en)}{n}}+2\kappa\mathcal{B}^{\prime}\sqrt{\frac{\log(1/\delta)}{2n}}

for $j=1,\ldots,d$ where $\mathcal{F}^{\prime}_{jn}=\{\frac{\partial}{\partial x_{j}}f:f\in\mathcal{F}_{n}\}$ . Combining above in probability bounds, we know that for any $\delta>0$ , with probability at least $1-(d+1)\delta$ ,

	$\displaystyle\sup_{f\in\mathcal{F}_{n}}\left\|[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\right\|$
	$\displaystyle\leq 4\sqrt{\frac{2\log(en)}{n}}\Big{[}\mathcal{B}^{2}\sqrt{{\rm Pdim}(\mathcal{F}_{n})}+\bar{\lambda}\kappa\mathcal{B}^{\prime}\sqrt{{\rm Pdim}(\mathcal{F}^{\prime}_{jn})}\Big{]}+4\Big{[}\mathcal{B}^{2}+\bar{\lambda}\kappa\mathcal{B}^{\prime}\Big{]}\sqrt{\frac{\log(1/\delta)}{2n}}.$

Recall that $\mathcal{F}_{n}=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}}$ is a class of RePU neural networks with depth $\mathcal{D}$ , width $\mathcal{W}$ , size $\mathcal{S}$ and number of neurons $\mathcal{U}$ . By Lemma 2, ${\rm Pdim}(\mathcal{F}_{n})\leq 3p\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})$ Then for each $j=1,\ldots,d$ , the function class $\mathcal{F}_{jn}=\{\frac{\partial}{\partial x_{j}}f:f\in\mathcal{F}_{n}\}$ consists of RePU neural networks with depth $3\mathcal{D}+3$ , width $6\mathcal{W}$ , number of neurons $13\mathcal{U}$ and size no more than $23\mathcal{S}$ . By Theorem 1 and Lemma 2, we have ${\rm Pdim}(\mathcal{F}^{\prime}_{1n})={\rm Pdim}(\mathcal{F}^{\prime}_{2n})=\ldots={\rm Pdim}(\mathcal{F}^{\prime}_{dn})\leq 2484p\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})$ . Then, for any $\delta>0$ , with probability at least $1-\delta$

	$\displaystyle\sup_{f\in\mathcal{F}_{n}}\left\|[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\right\|$
	$\displaystyle\leq 200d(\mathcal{B}^{2}+\bar{\lambda}\kappa\mathcal{B}^{\prime})\Bigg{(}\sqrt{\frac{2p\log(en)\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})}{n}}+\sqrt{\frac{\log((d+1)/\delta)}{2n}}\Bigg{)}.$

If we let $t=200d(\mathcal{B}^{2}+\kappa\bar{\lambda}\mathcal{B}^{\prime})(\sqrt{{2p\log(en)\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})}/{n}})$ , then above inequality implies

	$\displaystyle\mathbb{P}\Bigg{(}\sup_{f\in\mathcal{F}_{n}}\left\|[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\right\|\geq\epsilon\Bigg{)}$
	$\displaystyle\leq(d+1)\exp\left(\frac{-n(\epsilon-t)^{2}}{[100d(\mathcal{B}^{2}+\kappa\bar{\lambda}\mathcal{B}^{\prime})]^{2}}\right),$

for $\epsilon\geq t$ . And

	$\displaystyle\mathbb{E}\left[\sup_{f\in\mathcal{F}_{n}}\left\|[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\right\|\right]$
	$\displaystyle=\int_{0}^{\infty}\mathbb{P}\Bigg{(}\sup_{f\in\mathcal{F}_{n}}\left\|[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\right\|\geq u\Bigg{)}du$
	$\displaystyle=\int_{0}^{t}\mathbb{P}\Bigg{(}\sup_{f\in\mathcal{F}_{n}}\left\|[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\right\|\geq u\Bigg{)}du$
	$\displaystyle+\int_{t}^{\infty}\mathbb{P}\Bigg{(}\sup_{f\in\mathcal{F}_{n}}\left\|[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\right\|\geq u\Bigg{)}du$
	$\displaystyle\leq\int_{0}^{t}1du+\int_{t}^{\infty}(d+1)\exp\left(\frac{-n(\epsilon-t)^{2}}{[100d(\mathcal{B}^{2}+\kappa\bar{\lambda}\mathcal{B}^{\prime})]^{2}}\right)du$
	$\displaystyle=t+200\sqrt{2\pi}d(\mathcal{B}^{2}+\kappa\bar{\lambda}\mathcal{B}^{\prime})/{\sqrt{n}}$
	$\displaystyle\leq 800d^{2}(\mathcal{B}^{2}+\kappa\bar{\lambda}\mathcal{B}^{\prime})\sqrt{{2p\log(en)\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})}/{n}}.$

Note that the network depth $\mathcal{D}$ , number of neurons $\mathcal{U}$ and number of parameters $\mathcal{S}$ satisfies $\mathcal{U}=18pd((\mathcal{D}+1)/2)^{d}$ and $\mathcal{S}=67pd((\mathcal{D}+1)/2)^{d}$ . By Lemma 38, combining the error decomposition, we have

	$\displaystyle\mathbb{E}\Big{[}\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f_{0})\Big{]}$	$\displaystyle\leq C_{1}p^{2}d^{3}(\mathcal{B}^{2}+\kappa\bar{\lambda}\mathcal{B}^{\prime})(\log n)^{1/2}n^{-1/2}\mathcal{U}^{(d+2)/2d}$
		$\displaystyle+C_{2}(1+\kappa\bar{\lambda})\\|f_{0}\\|^{2}_{C^{s}}\mathcal{U}^{-(s-1)/d}+\sum_{j=1}^{d}\lambda_{j}\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}f_{0}(X))],$

where $C_{1}>0$ is a universal constant, $C_{2}>0$ is a constant depending only on $d,s$ and the diameter of the support $\mathcal{X}$ . This completes the proof. $\hfill\Box$

Proof of Lemma 37

Let $S=\{Z_{i}=(X_{i},Y_{i})\}_{i=1}^{n}$ be the sample used to estimate $\hat{f}^{\lambda}_{n}$ from the distribution $Z=(X,Y)$ . And let $S^{\prime}=\{Z^{\prime}_{i}=(X^{\prime}_{i},Y^{\prime}_{i})\}_{i=1}^{n}$ be another sample independent of $S$ . Define

	$\displaystyle g_{1}(f,X_{i})=\mathbb{E}\big{\{}\|Y_{i}-f(X_{i})\|^{2}-\|Y_{i}-f_{0}(X_{i})\|^{2}\mid X_{i}\big{\}}=\mathbb{E}\big{\{}\|f(X_{i})-f_{0}(X_{i})\|^{2}\mid X_{i}\big{\}}$
	$\displaystyle g_{2}(f,X_{i})=\mathbb{E}\big{[}\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\rho(\frac{\partial}{\partial x_{j}}f(X_{i}))-\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\rho(\frac{\partial}{\partial x_{j}}f_{0}(X_{i}))\mid X_{i}\big{]}$
	$\displaystyle g(f,X_{i})=g_{1}(f,X_{i})+g_{2}(f,X_{i})$

for any (random) $f$ and sample $X_{i}$ . It worth noting that for any $x$ and $f\in\mathcal{F}_{n}$ ,

0\leq g_{1}(f,x)=\mathbb{E}\big{\{}|f(X_{i})-f_{0}(X_{i})|^{2}\mid X_{i}=x\big{\}}\leq 4\mathcal{B}^{2},

since $\|f\|_{\infty}\leq\mathcal{B}$ and $\|f_{0}\|_{\infty}\leq\mathcal{B}$ for $f\in\mathcal{F}_{n}$ by assumption. For any $x$ and $f\in\mathcal{F}_{n}$ ,

0\leq g_{2}(f,x)=\mathbb{E}\big{[}\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\rho(\frac{\partial}{\partial x_{j}}f(X_{i}))-\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\rho(\frac{\partial}{\partial x_{j}}f_{0}(X_{i}))\mid X_{i}=x\big{]}\leq 2\mathcal{B}^{\prime}\kappa\bar{\lambda},

since $\rho(\cdot)$ is a $\kappa$ -Lipschitz function and $\|\frac{\partial}{\partial x_{j}}f\|_{\infty}\leq\mathcal{B}^{\prime}$ and $\|\frac{\partial}{\partial x_{j}}f_{0}\|_{\infty}\leq\mathcal{B}^{\prime}$ for $j=1,\ldots,d$ for $f\in\in\mathcal{F}_{n}$ by assumption

Recall that the the empirical risk minimizer $\hat{f}^{\lambda}_{n}$ depends on the sample $S$ , and the stochastic error is

$\displaystyle\mathbb{E}\{\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-2\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}(f_{0})\}$	$\displaystyle=\mathbb{E}_{S}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g(\hat{f}^{\lambda}_{n},X^{\prime}_{i})\big{\}}-2g(\hat{f}^{\lambda}_{n},X_{i})\bigg{]}\Big{)}$
	$\displaystyle=\mathbb{E}_{S}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{1}(\hat{f}^{\lambda}_{n},X^{\prime}_{i})\big{\}}-2g_{1}(\hat{f}^{\lambda}_{n},X_{i})\bigg{]}\Big{)}$	(B.7)
	$\displaystyle+\mathbb{E}_{S}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{2}(\hat{f}^{\lambda}_{n},X^{\prime}_{i})\big{\}}-2g_{2}(\hat{f}^{\lambda}_{n},X_{i})\bigg{]}\Big{)}.$	(B.8)

In the following, we derive upper bounds of (B.7) and (B.8) respectively. For any random variable $\xi$ , it is clear that $\mathbb{E}[\xi]\leq\mathbb{E}[\max\{\xi,0\}]=\int_{0}^{\infty}\mathbb{P}(\xi>t)dt$ . In light of this, we aim at giving upper bounds for the tail probabilities

\mathbb{P}\left(\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{k}(\hat{f}^{\lambda}_{n},X^{\prime}_{i})\big{\}}-2g_{k}(\hat{f}^{\lambda}_{n},X_{i})\bigg{]}>t\right),\qquad k=1,2

for $t>0$ . Given $\hat{f}^{\lambda}_{n}\in\mathcal{F}_{n}$ , for $k=1,2$ , we have

	$\displaystyle\mathbb{P}\left(\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{k}(\hat{f}^{\lambda}_{n},X^{\prime}_{i})\big{\}}-2g_{k}(\hat{f}^{\lambda}_{n},X_{i},\xi_{i})\bigg{]}>t\right)$
$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\exists f\in\mathcal{F}_{n}:\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{k}(f,X^{\prime}_{i})\big{\}}-2g_{k}(f,X_{i})\bigg{]}>t\right)$
$\displaystyle=$	$\displaystyle\mathbb{P}\left(\exists f\in\mathcal{F}_{n}:\mathbb{E}\big{\{}g_{k}(f,X)\big{\}}-\frac{1}{n}\sum_{i=1}^{n}\big{[}g_{k}(f,X_{i})\big{]}>\frac{1}{2}\bigg{(}t+\mathbb{E}\big{\{}g_{k}(f,X)\big{\}}\bigg{)}\right).$	(B.9)

The bound the probability (B.9), we apply Lemma 24 in Shen et al. (2022). For completeness of the proof, we present Lemma in the following.

Lemma 32 (Lemma 24 in Shen et al. (2022))

Let $\mathcal{H}$ be a set of functions $h:\mathbb{R}^{d}\to[0,B]$ with $B\geq 1$ . Let $Z,Z_{1},\ldots,Z_{n}$ be i.i.d. $\mathbb{R}^{d}$ -valued random variables. Then for each $n\geq 1$ and any $s>0$ and $0<\epsilon<1$ ,

\displaystyle\mathbb{P}\left(\sup_{h\in\mathcal{H}}\ \frac{\mathbb{E}\big{\{}h(Z)\big{\}}-\frac{1}{n}\sum_{i=1}^{n}\big{[}h(Z_{i})\big{]}}{s+\mathbb{E}\big{\{}h(Z)\big{\}}+\frac{1}{n}\sum_{i=1}^{n}\big{[}h(Z_{i})\big{]}}>\epsilon\right)\leq 4\mathcal{N}_{n}\Big{(}\frac{s\epsilon}{16},\mathcal{H},\|\cdot\|_{\infty}\Big{)}\exp\Big{(}-\frac{\epsilon^{2}sn}{15B}\Big{)},

where $\mathcal{N}_{n}(\frac{s\epsilon}{16},\mathcal{H},\|\cdot\|_{\infty})$ is the covering number of $\mathcal{H}$ with radius $s\epsilon/16$ under the norm $\|\cdot\|_{\infty}$ . The definition of the covering number can be found in Appendix C.

We apply Lemma 32 with $\epsilon=1/3,s=2t$ to the class of functions $\mathcal{G}_{k}:=\{g_{k}(f,\cdot):f\in\mathcal{F}_{n}\}$ for $k=1,2$ to get

		$\displaystyle\mathbb{P}\Big{(}\exists f\in\mathcal{F}_{n}:\mathbb{E}\big{\{}g_{1}(f,X)\big{\}}-\frac{1}{n}\sum_{i=1}^{n}\big{[}g_{1}(f,X_{i})\big{]}>\frac{1}{2}\bigg{(}t+\mathbb{E}\big{\{}g_{1}(f,X)\big{\}}\bigg{)}\Big{)}$
		$\displaystyle\leq 4\mathcal{N}_{n}\Big{(}\frac{t}{24},\mathcal{G}_{1},\\|\cdot\\|_{\infty}\Big{)}\exp\Big{(}-\frac{tn}{270\mathcal{B}^{2}}\Big{)},$		(B.10)

and

		$\displaystyle\mathbb{P}\Big{(}\exists f\in\mathcal{F}_{n}:\mathbb{E}\big{\{}g_{2}(f,X)\big{\}}-\frac{1}{n}\sum_{i=1}^{n}\big{[}g_{2}(f,X_{i})\big{]}>\frac{1}{2}\bigg{(}t+\mathbb{E}\big{\{}g_{2}(f,X)\big{\}}\bigg{)}\Big{)}$
		$\displaystyle\leq 4\mathcal{N}_{n}\Big{(}\frac{t}{24},\mathcal{G}_{2},\\|\cdot\\|_{\infty}\Big{)}\exp\Big{(}-\frac{tn}{135\bar{\lambda}\kappa\mathcal{B}^{\prime}}\Big{)}.$		(B.11)

Combining (B.9) and (B.10), for $a_{n}>1/n$ , we have

	$\displaystyle\mathbb{E}_{S}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{1}(\hat{f}^{\lambda}_{n},X^{\prime}_{i})\big{\}}-2g_{1}(\hat{f}^{\lambda}_{n},X_{i})\bigg{]}\Big{)}$
	$\displaystyle\leq\int_{0}^{\infty}\mathbb{P}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{1}(\hat{f}^{\lambda}_{n},X^{\prime}_{i})\big{\}}-2g_{1}(\hat{f}^{\lambda}_{n},X_{i})\bigg{]}>t\Big{)}dt$
	$\displaystyle\leq\int_{0}^{a_{n}}1dt+\int_{a_{n}}^{\infty}4\mathcal{N}_{n}\Big{(}\frac{t}{24},\mathcal{G}_{1},\\|\cdot\\|_{\infty}\Big{)}\exp\Big{(}-\frac{tn}{270\mathcal{B}^{2}}\Big{)}dt$
	$\displaystyle\leq a_{n}+4\mathcal{N}_{n}\Big{(}\frac{1}{24n},\mathcal{G}_{1},\\|\cdot\\|_{\infty}\Big{)}\int_{a_{n}}^{\infty}\exp\Big{(}-\frac{tn}{270\mathcal{B}^{2}}\Big{)}dt$
	$\displaystyle=a_{n}+4\mathcal{N}_{n}\Big{(}\frac{1}{24n},\mathcal{G}_{1},\\|\cdot\\|_{\infty}\Big{)}\exp\Big{(}-\frac{a_{n}n}{270\mathcal{B}^{2}}\Big{)}\frac{270\mathcal{B}^{2}}{n}.$

Choosing $a_{n}=\log\{4\mathcal{N}_{n}(1/(24n),\mathcal{G}_{1},\|\cdot\|_{\infty})\}\cdot 270\mathcal{B}^{2}/n$ , we get

\displaystyle\mathbb{E}_{S}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{1}(\hat{f}^{\lambda}_{n},X^{\prime}_{i})\big{\}}-2g_{1}(\hat{f}^{\lambda}_{n},X_{i})\bigg{]}\Big{)}\leq\frac{270\log[4e\mathcal{N}_{n}(1/(24n),\mathcal{G}_{1},\|\cdot\|_{\infty})]\mathcal{B}^{2}}{n}.

For any $f_{1},f_{2}\in\mathcal{F}_{n}$ , by the definition of $g_{1}$ , it is easy to show $\|g_{1}(f_{1},\cdot)-g_{1}(f_{2},\cdot)\|_{\infty}\leq 4\mathcal{B}\|f_{1}-f_{2}\|_{\infty}$ . Then $\mathcal{N}_{n}(1/(24n),\mathcal{G}_{1},\|\cdot\|_{\infty})\leq\mathcal{N}_{n}(1/(96\mathcal{B}n),\mathcal{F}_{n},\|\cdot\|_{\infty})$ , which leads to

		$\displaystyle\mathbb{E}_{S}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{1}(\hat{f}^{\lambda}_{n},X^{\prime}_{i},\xi^{\prime}_{i})\big{\}}-2g_{1}(\hat{f}^{\lambda}_{n},X_{i},\xi_{i})\bigg{]}\Big{)}$
		$\displaystyle\leq\frac{270\log[4e\mathcal{N}_{n}(1/(96\mathcal{B}n),\mathcal{F}_{n},\\|\cdot\\|_{\infty})]\mathcal{B}^{2}}{n}.$		(B.12)

Similarly, combining (B.9) and (B.11), we can obtain

\displaystyle\mathbb{E}_{S}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{2}(\hat{f}^{\lambda}_{n},X^{\prime}_{i})\big{\}}-2g_{2}(\hat{f}^{\lambda}_{n},X_{i})\bigg{]}\Big{)}\leq\frac{135\bar{\lambda}\kappa\log[4e\mathcal{N}_{n}(1/(24n),\mathcal{G}_{2},\|\cdot\|_{\infty})]\mathcal{B}^{\prime}}{n}.

For any $f_{1},f_{2}\in\mathcal{F}_{n}$ , by the definition of $g_{2}$ , it can be shown $\|g_{2}(f_{1},\cdot)-g_{2}(f_{2},\cdot)\|_{\infty}\leq\frac{\kappa}{d}\sum_{j=1}^{d}\lambda_{j}\|\frac{\partial}{\partial x_{j}}f_{1}-\frac{\partial}{\partial x_{j}}f_{2}\|_{\infty}$ . Recall that $\mathcal{F}_{nj}^{\prime}=\{\frac{\partial}{\partial x_{j}}f:f\in\mathcal{F}_{n}\}$ for $j=1,\ldots,d$ . Then $\mathcal{N}_{n}(1/(24n),\mathcal{G}_{2},\|\cdot\|_{\infty})\leq\Pi_{j=1}^{d}\mathcal{N}_{n}(1/(24\kappa\lambda_{j}n),\mathcal{F}_{nj}^{\prime},\|\cdot\|_{\infty})$ where we view $11/(24\kappa\lambda_{j}n)$ as $\infty$ if $\lambda_{j}=0$ . This leads to

		$\displaystyle\mathbb{E}_{S}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{2}(\hat{f}^{\lambda}_{n},X^{\prime}_{i},\xi^{\prime}_{i})\big{\}}-2g_{2}(\hat{f}^{\lambda}_{n},X_{i},\xi_{i})\bigg{]}\Big{)}$
		$\displaystyle\leq\frac{135\bar{\lambda}\kappa\log[4e\Pi_{j=1}^{d}\mathcal{N}_{n}(1/(24\kappa\lambda_{j}n),\mathcal{F}^{\prime}_{nj},\\|\cdot\\|_{\infty})]\mathcal{B}^{\prime}}{n}.$		(B.13)

Then by Lemma 39 in Appendix C, we can further bound the covering number by the Pseudo dimension. More exactly, for $n\geq{\rm Pdim}(\mathcal{F}_{n})$ and any $\delta>0$ , we have

\displaystyle\log(\mathcal{N}_{n}(\delta,\mathcal{F}_{n},\|\cdot\|_{\infty}))\leq{\rm Pdim}(\mathcal{F}_{n})\log\Big{(}\frac{en\mathcal{B}}{\delta{\rm Pdim}(\mathcal{F}_{n})}\Big{)},

and for $n\geq{\rm Pdim}(\mathcal{F}^{\prime}_{nj})$ for $j=1\ldots,d$ and any $\delta>0$ , we have

\displaystyle\log(\mathcal{N}_{n}(\delta,\mathcal{F}^{\prime}_{nj},\|\cdot\|_{\infty}))\leq{\rm Pdim}(\mathcal{F}^{\prime}_{nj})\log\Big{(}\frac{en\mathcal{B}^{\prime}}{\delta{\rm Pdim}(\mathcal{F}^{\prime}_{nj})}\Big{)}.

By Theorem 1 we know ${\rm Pdim}(\mathcal{F}^{\prime}_{nj})={\rm Pdim}(\mathcal{F}^{\prime}_{n})$ for $j=1,\ldots,d$ . Combining the upper bounds of the covering numbers, we have

\displaystyle\mathbb{E}\{\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-2\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}(f_{0})\}\leq c_{0}\frac{\big{[}\mathcal{B}^{3}{\rm Pdim}(\mathcal{F}_{n})+d(\kappa\bar{\lambda}\mathcal{B}^{\prime})^{2}{\rm Pdim}(\mathcal{F}_{n}^{\prime})\big{]}\log(n)}{n},

for $n\geq\max\{{\rm Pdim}(\mathcal{F}_{n}),{\rm Pdim}(\mathcal{F}^{\prime}_{n})\}$ and some universal constant $c_{0}>0$ where $\bar{\lambda}=\sum_{j=1}^{d}\lambda_{j}/d$ . By Lemma 2, for the function class $\mathcal{F}_{n}$ implemented by Mixed RePU activated multilayer perceptrons with depth no more than $\mathcal{D}$ , width no more than $\mathcal{W}$ , number of neurons (nodes) no more than $\mathcal{U}$ and size or number of parameters (weights and bias) no more than $\mathcal{S}$ , we have

\displaystyle{\rm Pdim}(\mathcal{F}_{n})\leq 3p\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U}),

and by Lemma 2, for any function $f\in\mathcal{F}_{n}$ , its partial derivative $\frac{\partial}{\partial x_{j}}f$ can be implemented by a Mixed RePU activated multilayer perceptron with depth $3\mathcal{D}+3$ , width $6\mathcal{W}$ , number of neurons $13\mathcal{U}$ , number of parameters $23\mathcal{S}$ and bound $\mathcal{B}^{\prime}$ . Then

\displaystyle{\rm Pdim}(\mathcal{F}^{\prime}_{n})\leq 2484p\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U}).

It follows that

\displaystyle\mathbb{E}\{\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-2\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}(f_{0})\}\leq c_{1}\big{(}\mathcal{B}^{3}+d(\kappa\bar{\lambda}\mathcal{B}^{\prime})^{2}\big{)}\frac{\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})\log(n)}{n},

for $n\geq\max\{{\rm Pdim}(\mathcal{F}_{n}),{\rm Pdim}(\mathcal{F}^{\prime}_{n})\}$ and some universal constant $c_{1}>0$ . This completes the proof. $\hfill\Box$

Proof of Lemma 38

Recall that

	$\displaystyle\inf_{f\in\mathcal{F}_{n}}\Big{[}\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})\Big{]}$
	$\displaystyle=\inf_{f\in\mathcal{F}_{n}}\Bigg{[}\mathbb{E}\|f(X)-f_{0}(X)\|^{2}+\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\{\rho(\frac{\partial}{\partial x_{j}}f(X))-\rho(\frac{\partial}{\partial x_{j}}f_{0}(X))\}\Bigg{]}$
	$\displaystyle\leq\inf_{f\in\mathcal{F}_{n}}\Bigg{[}\mathbb{E}\|f(X)-f_{0}(X)\|^{2}+\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\kappa\|\frac{\partial}{\partial x_{j}}f(X)-\frac{\partial}{\partial x_{j}}f_{0}(X)\|\Bigg{]}.$

By Theorem 5, for each $N\in\mathbb{N}^{+}$ , there exists a RePU network $\phi_{N}\in\mathcal{F}_{n}$ with $2N-1$ hidden layer, no more than $15N^{d}$ neurons, no more than $24N^{d}$ parameters and width no more than $12N^{d-1}$ such that for each multi-index $\alpha\in\mathbb{N}^{d}_{0}$ with $|\alpha|_{1}\leq\min\{s,N\}$ we have

\sup_{\mathcal{X}}|D^{\alpha}(f-\phi_{N})|\leq C(s,d,\mathcal{X})\times N^{-(s-|\alpha|_{1})}\|f\|_{C^{|\alpha|_{1}}},

where $C(s,d,\mathcal{X})$ is a positive constant depending only on $d,s$ and the diameter of $\mathcal{X}$ . This implies

\sup_{\mathcal{X}}|f-\phi_{N}|\leq C(s,d,\mathcal{X})\times N^{-s}\|f\|_{C^{0}},

and for $j=1,\ldots,d$

\sup_{\mathcal{X}}\Big{|}\frac{\partial}{\partial x_{j}}(f-\phi_{N})\Big{|}\leq C(s,d,\mathcal{X})\times N^{-(s-1)}\|f\|_{C^{1}}.

Combine above two uniform bounds, we have

	$\displaystyle\inf_{f\in\mathcal{F}_{n}}\Big{[}\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})\Big{]}$
	$\displaystyle\leq\Big{[}\|\mathbb{E}_{X}\Big{\{}\|\phi_{N}(X)-f_{0}(X)\|^{2}+\frac{\kappa}{d}\sum_{j=1}^{d}\lambda_{j}\|\frac{\partial}{\partial x_{j}}\phi_{N}(X)-\frac{\partial}{\partial x_{j}}f_{0}(X)\|\Big{\}}\Big{]}$
	$\displaystyle\leq C(s,d,\mathcal{X})^{2}\times N^{-2s}\\|f\\|^{2}_{C^{0}}+\kappa\bar{\lambda}C(s,d,\mathcal{X})\times N^{-(s-1)}\\|f\\|_{C^{1}}$
	$\displaystyle\leq C_{1}(s,d,\mathcal{X})(1+\kappa\bar{\lambda})N^{-(s-1)}\\|f\\|^{2}_{C^{1}},$

where $C_{1}(s,d,\mathcal{X})=\max\{[C(s,d,\mathcal{X})]^{2},C(s,d,\mathcal{X}))\}$ is also a constant depending only on $s,d$ and $\mathcal{X}$ . By defining the network depth $\mathcal{D}$ to be a positive odd number, and expressing the network width $\mathcal{W}$ , neurons $\mathcal{U}$ and size $\mathcal{S}$ in terms of $\mathcal{D}$ , one can obtain the approximation error bound in terms of $\mathcal{U}$ . This completes the proof. $\hfill\Box$

Appendix C Definitions and Supporting Lemmas

C.1 Definitions

The following definitions are used in the proofs.

Definition 33 (Covering number)

Let $\mathcal{F}$ be a class of function from $\mathcal{X}$ to $\mathbb{R}$ . For a given sequence $x=(x_{1},\ldots,x_{n})\in\mathcal{X}^{n},$ let $\mathcal{F}_{n}|_{x}=\{(f(x_{1}),\ldots,f(x_{n}):f\in\mathcal{F}_{n}\}$ be the subset of $\mathbb{R}^{n}$ . For a positive number $\delta$ , let $\mathcal{N}(\delta,\mathcal{F}_{n}|_{x},\|\cdot\|_{\infty})$ be the covering number of $\mathcal{F}_{n}|_{x}$ under the norm $\|\cdot\|_{\infty}$ with radius $\delta$ . Define the uniform covering number $\mathcal{N}_{n}(\delta,\|\cdot\|_{\infty},\mathcal{F}_{n})$ to be the maximum over all $x\in\mathcal{X}$ of the covering number $\mathcal{N}(\delta,\mathcal{F}_{n}|_{x},\|\cdot\|_{\infty})$ , i.e.,

\mathcal{N}_{n}(\delta,\mathcal{F}_{n},\|\cdot\|_{\infty})=\max\{\mathcal{N}(\delta,\mathcal{F}_{n}|_{x},\|\cdot\|_{\infty}):x\in\mathcal{X}^{n}\}.

(C.1)

Definition 34 (Shattering)

Let $\mathcal{F}$ be a family of functions from a set $\mathcal{Z}$ to $\mathbb{R}$ . A set $\{z_{1},\ldots,Z_{n}\}\subset\mathcal{Z}$ is said to be shattered by $\mathcal{F}$ , if there exists $t_{1},\ldots,t_{n}\in\mathbb{R}$ such that

\displaystyle\Big{|}\Big{\{}\Big{[}\begin{array}[]{lr}{\rm sgn}(f(z_{1})-t_{1})\\ \ldots\\ {\rm sgn}(f(z_{n})-t_{n})\\ \end{array}\Big{]}:f\in\mathcal{F}\Big{\}}\Big{|}=2^{n},

where ${rmsgn}$ is the sign function returns $+1$ or $-1$ and $|\cdot|$ denotes the cardinality of a set. When they exist, the threshold values $t_{1},\ldots,t_{n}$ are said to witness the shattering.

Definition 35 (Pseudo dimension)

Let $\mathcal{F}$ be a family of functions mapping from $\mathcal{Z}$ to $\mathbb{R}$ . Then, the pseudo dimension of $\mathcal{F}$ , denoted by ${\rm Pdim}(\mathcal{F})$ , is the size of the largest set shattered by $\mathcal{F}$ .

Definition 36 (VC dimension)

Let $\mathcal{F}$ be a family of functions mapping from $\mathcal{Z}$ to $\mathbb{R}$ . Then, the Vapnik–Chervonenkis (VC) dimension of $\mathcal{F}$ , denoted by ${\rm VCdim}(\mathcal{F})$ , is the size of the largest set shattered by $\mathcal{F}$ with all threshold values being zero, i.e., $t_{1}=\ldots,=t_{n}=0$ .

C.2 Supporting Lemmas

Lemma 37 (Stochastic error bound)

Suppose Assumption 16 and 17 hold. Let $\mathcal{F}_{n}=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}}$ be the RePU $\sigma_{p}$ activated multilayer perceptron and let $\mathcal{F}^{\prime}_{n}=\{\frac{\partial}{\partial x_{1}}f:f\in\mathcal{F}_{n}\}$ denote the class of the partial derivative of $f\in\mathcal{F}_{n}$ with respect to its first argument. Then for $n\geq\max\{{\rm Pdim}(\mathcal{F}_{n}),{\rm Pdim}(\mathcal{F}^{\prime}_{n})\}$ , the stochastic error satisfies

\displaystyle\mathbb{E}\{\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-2\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}(f_{0})\}\leq c_{1}p\big{\{}\mathcal{B}^{3}+d(\kappa\bar{\lambda}\mathcal{B}^{\prime})^{2}\big{\}}\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})\frac{\log(n)}{n},

for some universal constant $c_{1}>0,$ where $\bar{\lambda}:=\sum_{j=1}^{d}\lambda_{j}/d$ .

Lemma 38 (Approximation error bound)

Suppose that the target function $f_{0}$ defined in (4) belongs to $C^{s}$ for some $s\in\mathbb{N}^{+}$ . For any positive odd number $\mathcal{D}$ , let $\mathcal{F}_{n}:=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}}$ be the class of RePU activated neural networks $f:\mathcal{X}\to\mathbb{R}^{d}$ with depth $\mathcal{D}$ , width $\mathcal{W}=18pd[(\mathcal{D}+1)/2]^{d-1}$ , number of neurons $\mathcal{U}=18pd[(\mathcal{D}+1)/2]^{d}$ and size $\mathcal{S}=67pd[(\mathcal{D}+1)/2]^{d}$ , satisfying $\mathcal{B}\geq\|f_{0}\|_{C^{0}}$ and $\mathcal{B}^{\prime}\geq\|f_{0}\|_{C^{1}}$ . Then the approximation error given in Lemma 18 satisfies

\inf_{f\in\mathcal{F}_{n}}\Big{[}\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})\Big{]}\leq C(1+\kappa\bar{\lambda})\mathcal{U}^{-(s-1)/d}\|f_{0}\|^{2}_{C^{1}},

where $\bar{\lambda}:=\sum_{j=1}^{d}\lambda_{j}/d$ , $\kappa$ is the Lipschitz constant of the panelty function $\rho$ and $C>0$ is a constant depending only on $d,s$ and the diameter of the support $\mathcal{X}$ .

The following lemma gives an upper bound for the covering number in terms of the pseudo-dimension.

Lemma 39 (Theorem 12.2 in Anthony and Bartlett (1999))

Let $\mathcal{F}$ be a set of real functions from domain $\mathcal{Z}$ to the bounded interval $[0,B]$ . Let $\delta>0$ and suppose that $\mathcal{F}$ has finite pseudo-dimension ${\rm Pdim}(\mathcal{F})$ then

\displaystyle\mathcal{N}_{n}(\delta,\mathcal{F},\|\cdot\|_{\infty})\leq\sum_{i=1}^{{\rm Pdim}(\mathcal{F})}\binom{n}{i}\Big{(}\frac{B}{\delta}\Big{)}^{i},

which is less than $\{enB/(\delta{\rm Pdim}(\mathcal{F}))\}^{{\rm Pdim}(\mathcal{F})}$ for $n\geq{\rm Pdim}(\mathcal{F})$ .

The following lemma presents basic approximation properties of RePU network on monomials.

Lemma 40 (Lemma 1 in li2019powernet)

The monomials $x^{N},0\leq N\leq p$ can be exactly represented by RePU ( $\sigma_{p},p\geq 2$ ) activated neural network with one hidden layer and no more than $2p$ nodes. More exactly,

(i)

If $N=0$ , the monomial $x^{N}$ can be computed by a RePU $\sigma_{p}$ activated network with one hidden layer and 1 nodes as

$1=x^{0}=\sigma_{p}(0\cdot x+1).$

(ii)

If $N=p$ , the monomial $x^{N}$ can be computed by a RePU $\sigma_{p}$ activated network with one hidden layer and 2 nodes as

x^{N}=W_{1}\sigma_{p}(W_{0}x),\qquad W_{1}=\left[\begin{array}[]{c}1\\ (-1)^{p}\end{array}\right],W_{0}=\left[\begin{array}[]{c}1\\ -1\end{array}\right].

(iii)

If $1\leq N\leq p$ , the monomial $x^{N}$ can be computed by a RePU $\sigma_{p}$ activated network with one hidden layer and no more than $2p$ nodes. More generally, a polynomial of degree no more than $p$ , i.e. $\sum_{k=0}^{p}a_{k}x^{k}$ , can also be computed by a RePU $\sigma_{p}$ activated network with one hidden layer and no more than $2p$ nodes as

x^{N}=W_{1}^{\top}\sigma_{p}(W_{0}^{\top}x+b_{0})+u_{0},

where

W_{0}=\left[\begin{array}[]{c}1\\ -1\\ \vdots\\ 1\\ -1\end{array}\right]\in\mathbb{R}^{2p\times 1},\ \ b_{0}=\left[\begin{array}[]{c}t_{1}\\ -t_{1}\\ \vdots\\ t_{p}\\ -t_{p}\end{array}\right]\in\mathbb{R}^{2p\times 1},\ \ W_{1}=\left[\begin{array}[]{c}u_{1}\\ (-1)^{p}u_{1}\\ \vdots\\ u_{p}\\ (-1)^{p}u_{p}\end{array}\right]\in\mathbb{R}^{2p\times 1}.

Here $t_{1},\ldots,t_{p}$ are distinct values in $\mathbb{R}$ and values of $u_{0},\ldots,u_{p}$ satisfy the linear system

\left[\begin{array}[]{ccccc}1&1&\cdots&1&0\\ \vdots&\vdots&&\vdots&\vdots\\ t_{1}^{p-i}&t_{2}^{p-i}&\cdots&t_{p}^{p-i}&0\\ \vdots&\vdots&&\vdots&\vdots\\ t_{1}^{p-1}&t_{2}^{p-1}&\cdots&t_{p}^{p-1}&0\\ t_{1}^{p}&t_{2}^{p}&\cdots&t_{p}^{p}&1\end{array}\right]\left[\begin{array}[]{c}u_{1}\\ \vdots\\ u_{i}\\ \vdots\\ u_{p}\\ u_{0}\end{array}\right]=\left[\begin{array}[]{c}a_{p}(C^{p}_{p})^{-1}\\ \vdots\\ a_{i}(C^{i}_{p})^{-1}\\ \vdots\\ a_{1}(C^{1}_{p})^{-1}\\ a_{0}(C^{0}_{p})^{-1}\end{array}\right],

where $C^{i}_{p},i=0,\ldots,p$ are binomial coefficients. Note that the top-left $p\times p$ sub-matrix of the $(p+1)\times(p+1)$ matrix above is a Vandermonde matrix, which is invertible as long as $t_{1},\ldots,t_{p}$ are distinct.

References

Abdeljawad and Grohs (2022) Ahmed Abdeljawad and Philipp Grohs. Approximations with deep neural networks in sobolev time-space. Analysis and Applications, 20(03):499–541, 2022.
Ali and Nouy (2021) Mazen Ali and Anthony Nouy. Approximation of smoothness classes by deep rectifier networks. SIAM Journal on Numerical Analysis, 59(6):3032–3051, 2021.
Anthony and Bartlett (1999) Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge, 1999. ISBN 0-521-57353-X. doi: 10.1017/CBO9780511624216. URL https://doi.org/10.1017/CBO9780511624216.
Bagby et al. (2002) Thomas Bagby, Len Bos, and Norman Levenberg. Multivariate simultaneous approximation. Constructive approximation, 18(4):569–577, 2002.
Baraniuk and Wakin (2009) Richard G. Baraniuk and Michael B. Wakin. Random projections of smooth manifolds. Found. Comput. Math., 9(1):51–77, 2009. ISSN 1615-3375. doi: 10.1007/s10208-007-9011-z. URL https://doi.org/10.1007/s10208-007-9011-z.
Barlow et al. (1972) R. E. Barlow, D. J. Bartholomew, J. M. Bremner, and H. D. Brunk. Statistical Inference under Order Restrictions; the Theory and Application of Isotonic Regression. New York: Wiley, 1972.
Bartlett et al. (1998) Peter Bartlett, Vitaly Maiorov, and Ron Meir. Almost linear vc dimension bounds for piecewise polynomial networks. Advances in neural information processing systems, 11, 1998.
Bartlett et al. (2017) Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/b22b257ad0519d4500539da3c8bcf4dd-Paper.pdf.
Bartlett et al. (2019) Peter L. Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20:Paper No. 63, 17, 2019. ISSN 1532-4435.
Bauer and Kohler (2019) Benedikt Bauer and Michael Kohler. On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Ann. Statist., 47(4):2261–2285, 2019. ISSN 0090-5364. doi: 10.1214/18-AOS1747. URL https://doi.org/10.1214/18-AOS1747.
Belkin and Niyogi (2003) Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput., 15(6):1373–1396, 2003.
Bellec (2018) Pierre C Bellec. Sharp oracle inequalities for least squares estimators in shape restricted regression. The Annals of Statistics, 46(2):745–780, 2018.
Belomestny et al. (2022) Denis Belomestny, Alexey Naumov, Nikita Puchkin, and Sergey Samsonov. Simultaneous approximation of a smooth function and its derivatives by deep neural networks with piecewise-polynomial activations. arXiv:2206.09527, 2022.
Block et al. (2020) Adam Block, Youssef Mroueh, and Alexander Rakhlin. Generative modeling with denoising auto-encoders and langevin sampling. arXiv:2002.00107, 2020.
Chatterjee and Lafferty (2019) Sabyasachi Chatterjee and John Lafferty. Adaptive risk bounds in unimodal regression. Bernoulli, 25(1):1–25, 2019.
Chatterjee et al. (2015) Sabyasachi Chatterjee, Adityanand Guntuboyina, and Bodhisattva Sen. On risk bounds in isotonic and other shape restricted regression problems. The Annals of Statistics, 43(4):1774–1800, 2015.
Chatterjee et al. (2018) Sabyasachi Chatterjee, Adityanand Guntuboyina, and Bodhisattva Sen. On matrix estimation under monotonicity constraints. Bernoulli, 24(2):1072–1100, 2018.
Chen et al. (2019) Minshuo Chen, Haoming Jiang, and Tuo Zhao. Efficient approximation of deep relu networks for functions on low dimensional manifolds. Advances in Neural Information Processing Systems, 2019.
Chen et al. (2022) Minshuo Chen, Haoming Jiang, Wenjing Liao, and Tuo Zhao. Nonparametric regression on low-dimensional manifolds using deep relu networks: Function approximation and statistical recovery. Information and Inference: A Journal of the IMA, 11(4):1203–1253, 2022.
Chen et al. (2020) Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. arXiv:2009.00713, 2020.
Chui and Li (1993) Charles K Chui and Xin Li. Realization of neural networks with one hidden layer. In Multivariate approximation: From CAGD to wavelets, pages 77–89. World Scientific, 1993.
Chui et al. (1994) Charles K Chui, Xin Li, and Hrushikesh Narhar Mhaskar. Neural networks for localized approximation. mathematics of computation, 63(208):607–623, 1994.
Deng and Zhang (2020) Hang Deng and Cun-Hui Zhang. Isotonic regression in multi-dimensional spaces and graphs. The Annals of Statistics, 48(6):3672–3698, 2020.
Dhariwal and Nichol (2021) Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
Diggle et al. (1999) Peter Diggle, Sara Morris, and Tony Morton-Jones. Case-control isotonic regression for investigation of elevation in risk around a point source. Statistics in medicine, 18(13):1605–1613, 1999.
Duan et al. (2021) Chenguang Duan, Yuling Jiao, Yanming Lai, Xiliang Lu, and Zhijian Yang. Convergence rate analysis for deep ritz method. arXiv preprint arXiv:2103.13330, 2021.
Durot (2002) Cécile Durot. Sharp asymptotics for isotonic regression. Probability theory and related fields, 122(2):222–240, 2002.
Durot (2007) Cécile Durot. On the $l_{p}$ -error of monotonicity constrained estimators. The Annals of Statistics, 35(3):1080–1104, 2007.
Durot (2008) Cécile Durot. Monotone nonparametric regression with random design. Mathematical methods of statistics, 17(4):327–341, 2008.
Dykstra (1983) Richard L Dykstra. An algorithm for restricted least squares regression. Journal of the American Statistical Association, 78(384):837–842, 1983.
Fefferman (2006) Charles Fefferman. Whitney’s extension problem for $c^{m}$ . Annals of Mathematics., 164(1):313–359, 2006. ISSN 0003486X. URL http://www.jstor.org/stable/20159991.
Fefferman et al. (2016) Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983–1049, 2016.
Fokianos et al. (2020) Konstantinos Fokianos, Anne Leucht, and Michael H Neumann. On integrated $l_{1}$ convergence rate of an isotonic regression estimator for multivariate observations. IEEE Transactions on Information Theory, 66(10):6389–6402, 2020.
Gao et al. (2017) Chao Gao, Fang Han, and Cun-Hui Zhang. Minimax risk bounds for piecewise constant models. arXiv preprint arXiv:1705.06386, 2017.
Gao et al. (2019) Yuan Gao, Yuling Jiao, Yang Wang, Yao Wang, Can Yang, and Shunkang Zhang. Deep generative learning via variational gradient flow. In International Conference on Machine Learning, pages 2093–2101. PMLR, 2019.
Gao et al. (2022) Yuan Gao, Jian Huang, Yuling Jiao, Jin Liu, Xiliang Lu, and Zhijian Yang. Deep generative learning via euler particle transport. In Mathematical and Scientific Machine Learning, pages 336–368. PMLR, 2022.
Groeneboom and Jongbloed (2014) Piet Groeneboom and Geurt Jongbloed. Nonparametric estimation under shape constraints, volume 38. Cambridge University Press, 2014.
Gühring and Raslan (2021) Ingo Gühring and Mones Raslan. Approximation rates for neural networks with encodable weights in smoothness spaces. Neural Networks, 134:107–130, 2021.
Han et al. (2019) Qiyang Han, Tengyao Wang, Sabyasachi Chatterjee, and Richard J Samworth. Isotonic regression in general dimensions. The Annals of Statistics, 47(5):2440–2471, 2019.
Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
Ho et al. (2022) Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47–1, 2022.
Hoffmann et al. (2009) Heiko Hoffmann, Stefan Schaal, and Sethu Vijayakumar. Local dimensionality reduction for non-parametric regression. Neural Processing Letters, 29(2):109, 2009.
Hon and Yang (2022) Sean Hon and Haizhao Yang. Simultaneous neural network approximation for smooth functions. Neural Networks, 154:152–164, 2022.
Hörmander (2015) Lars Hörmander. The analysis of linear partial differential operators I: Distribution theory and Fourier analysis. Springer, 2015.
Horner (1819) William George Horner. A new method of solving numerical equations of all orders, by continuous approximation. Philosophical Transactions of the Royal Society of London, (109):308–335, 1819.
Horowitz and Lee (2017) Joel L Horowitz and Sokbae Lee. Nonparametric estimation and inference under shape restrictions. Journal of Econometrics, 201(1):108–126, 2017.
Hyvärinen and Dayan (2005) Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
Jalal et al. (2021) Ajil Jalal, Marius Arvinte, Giannis Daras, Eric Price, Alexandros G Dimakis, and Jon Tamir. Robust compressed sensing mri with deep generative priors. Advances in Neural Information Processing Systems, 34:14938–14954, 2021.
Jiang et al. (2011) Xiaoqian Jiang, Melanie Osl, Jihoon Kim, and Lucila Ohno-Machado. Smooth isotonic regression: A new method to calibrate predictive models. AMIA Summits on Translational Science Proceedings, 2011:16, 2011.
Jiao et al. (2023) Yuling Jiao, Guohao Shen, Yuanyuan Lin, and Jian Huang. Deep nonparametric regression on approximate manifolds: Nonasymptotic error bounds with polynomial prefactors. The Annals of Statistics, 51(2):691–716, 2023.
Kim et al. (2018) Arlene KH Kim, Adityanand Guntuboyina, and Richard J Samworth. Adaptation in log-concave density estimation. The Annals of Statistics, 46(5):2279–2306, 2018.
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Klusowski and Barron (2018) Jason M Klusowski and Andrew R Barron. Approximation by combinations of relu and squared relu ridge functions with $\ell^{1}$ and $\ell^{0}$ controls. IEEE Transactions on Information Theory, 64(12):7649–7656, 2018.
Kong et al. (2020) Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. arXiv:2009.09761, 2020.
Kyng et al. (2015) Rasmus Kyng, Anup Rao, and Sushant Sachdeva. Fast, provable algorithms for isotonic regression in all l_p-norms. Advances in neural information processing systems, 28, 2015.
Lee et al. (2022) Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence for score-based generative modeling with polynomial complexity. arXiv:2206.06227, 2022.
Li et al. (2019) Bo Li, Shanshan Tang, and Haijun Yu. Better approximations of high dimensional smooth functions by deep neural networks with rectified power units. arXiv preprint arXiv:1903.05858, 2019.
Li et al. (2020) Bo Li, Shanshan Tang, and Haijun Yu. Powernet: Efficient representations of polynomials and smooth functions by deep neural networks with rectified power units. J. Math. Study, 53(2):159–191, 2020.
Li and Turner (2017) Yingzhen Li and Richard E Turner. Gradient estimators for implicit models. arXiv:1705.07107, 2017.
Liu et al. (2016) Qiang Liu, Jason Lee, and Michael Jordan. A kernelized stein discrepancy for goodness-of-fit tests. In International conference on machine learning, pages 276–284. PMLR, 2016.
Lu et al. (2021a) Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021a.
Lu et al. (2021b) Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021b.
Luss et al. (2012) Ronny Luss, Saharon Rosset, and Moni Shahar. Efficient regularized isotonic regression with application to gene–gene interaction search. The Annals of Applied Statistics, 6(1):253–283, 2012.
Mhaskar (1993) Hrushikesh Narhar Mhaskar. Approximation properties of a multilayered feedforward artificial neural network. Advances in Computational Mathematics, 1(1):61–80, 1993.
Mittal et al. (2021) Gautam Mittal, Jesse Engel, Curtis Hawthorne, and Ian Simon. Symbolic music generation with diffusion models. arXiv:2103.16091, 2021.
Mohri et al. (2018) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.
Morton-Jones et al. (2000) Tony Morton-Jones, Peter Diggle, Louise Parker, Heather O Dickinson, and Keith Binks. Additive isotonic regression models in epidemiology. Statistics in medicine, 19(6):849–859, 2000.
Nagarajan and Kolter (2019) Vaishnavh Nagarajan and J Zico Kolter. Deterministic pac-bayesian generalization bounds for deep networks via generalizing noise-resilience. arXiv preprint arXiv:1905.13344, 2019.
Neyshabur et al. (2015) Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on learning theory, pages 1376–1401. PMLR, 2015.
Petersen and Voigtlaender (2018) Philipp Petersen and Felix Voigtlaender. Optimal approximation of piecewise smooth functions using deep relu neural networks. Neural Networks, 108:296–330, 2018.
Picard (1976) Jean-Claude Picard. Maximal closure of a graph and applications to combinatorial problems. Management science, 22(11):1268–1272, 1976.
Popov et al. (2021) Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. Grad-tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, pages 8599–8608. PMLR, 2021.
Qin et al. (2014) Jing Qin, Tanya P Garcia, Yanyuan Ma, Ming-Xin Tang, Karen Marder, and Yuanjia Wang. Combining isotonic regression and em algorithm to predict genetic risk under monotonicity constraint. The annals of applied statistics, 8(2):1182, 2014.
Robertson et al. (1988) T. Robertson, F. T. Wright, and R. L. Dykstra. Order Restricted Statistical Inference. New York: Wiley, 1988.
Rueda et al. (2009) Cristina Rueda, Miguel A Fernández, and Shyamal Das Peddada. Estimation of parameters subject to order restrictions on a circle with application to estimation of phase angles of cell cycle genes. Journal of the American Statistical Association, 104(485):338–347, 2009.
Sasaki et al. (2014) Hiroaki Sasaki, Aapo Hyvärinen, and Masashi Sugiyama. Clustering via mode seeking by direct estimation of the gradient of a log-density. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 19–34. Springer, 2014.
Schmidt-Hieber (2019) Johannes Schmidt-Hieber. Deep relu network approximation of functions on a manifold. arXiv:1908.00695, 2019.
Schmidt-Hieber (2020) Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU activation function. Annals of Statistics, 48(4):1875–1897, 2020.
Shen et al. (2022) Guohao Shen, Yuling Jiao, Yuanyuan Lin, Joel L Horowitz, and Jian Huang. Estimation of non-crossing quantile regression process with deep requ neural networks. arXiv:2207.10442, 2022.
Shen et al. (2020) Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation characterized by number of neurons. Commun. Comput. Phys., 28(5):1768–1811, 2020. ISSN 1815-2406. doi: 10.4208/cicp.oa-2020-0149. URL https://doi.org/10.4208/cicp.oa-2020-0149.
Shi et al. (2018) Jiaxin Shi, Shengyang Sun, and Jun Zhu. A spectral approach to gradient estimation for implicit distributions. In International Conference on Machine Learning, pages 4644–4653. PMLR, 2018.
Siegel and Xu (2022) Jonathan W Siegel and Jinchao Xu. High-order approximation rates for shallow neural networks with cosine and reluk activation functions. Applied and Computational Harmonic Analysis, 58:1–26, 2022.
Song and Ermon (2019) Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
Song and Ermon (2020) Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
Song et al. (2020) Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, pages 574–584. PMLR, 2020.
Song et al. (2021) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=PxTIG12RRHS.
Spouge et al. (2003) J Spouge, H Wan, and WJ Wilbur. Least squares isotonic regression in two dimensions. Journal of Optimization Theory and Applications, 117(3):585–605, 2003.
Sriperumbudur et al. (2017) Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Aapo Hyvärinen, and Revant Kumar. Density estimation in infinite dimensional exponential families. Journal of Machine Learning Research, 2017.
Stone (1982) Charles J Stone. Optimal global rates of convergence for nonparametric regression. The annals of statistics, pages 1040–1053, 1982.
Stout (2015) Quentin F Stout. Isotonic regression for multiple independent variables. Algorithmica, 71(2):450–470, 2015.
Strathmann et al. (2015) Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zoltan Szabo, and Arthur Gretton. Gradient-free hamiltonian monte carlo with efficient kernel exponential families. Advances in Neural Information Processing Systems, 28, 2015.
Sutherland et al. (2018) Danica J Sutherland, Heiko Strathmann, Michael Arbel, and Arthur Gretton. Efficient and principled score estimation with nyström kernel exponential families. In International Conference on Artificial Intelligence and Statistics, pages 652–660. PMLR, 2018.
Vincent (2011) Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
Warde-Farley and Bengio (2016) David Warde-Farley and Yoshua Bengio. Improving generative adversarial networks with denoising feature matching. 2016.
Wei and Ma (2019) Colin Wei and Tengyu Ma. Data-dependent sample complexity of deep neural networks via lipschitz augmentation. Advances in Neural Information Processing Systems, 32, 2019.
Xu and Cao (2005) Zong-Ben Xu and Fei-Long Cao. Simultaneous lp-approximation order for neural networks. Neural Networks, 18(7):914–923, 2005.
Yang and Barber (2019) Fan Yang and Rina Foygel Barber. Contraction and uniform convergence of isotonic regression. Electronic Journal of Statistics, 13(1):646–677, 2019.
Yarotsky (2017) Dmitry Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Networks, 94:103–114, 2017.
Yarotsky (2018) Dmitry Yarotsky. Optimal approximation of continuous functions by very deep ReLU networks. In Conference on Learning Theory, pages 639–649. PMLR, 2018.
Zhang (2002) Cun-Hui Zhang. Risk bounds in isotonic regression. The Annals of Statistics, 30(2):528–555, 2002.
Zhou et al. (2020) Yuhao Zhou, Jiaxin Shi, and Jun Zhu. Nonparametric score estimators. In International Conference on Machine Learning, pages 11513–11522. PMLR, 2020.

	$\displaystyle\|D^{\alpha}(f^{*}_{n}(x)-f_{0}(x))\|=\|D^{\alpha}(\tilde{f}_{n}(Ax)-\tilde{F}_{0}(Ax)+\tilde{F}_{0}(Ax)-\tilde{F}_{0}(A\tilde{x})+\tilde{F}_{0}(A\tilde{x})-f_{0}(x))\|$
	$\displaystyle\leq\|D^{\alpha}(\tilde{f}_{n}(Ax)-\tilde{F}_{0}(Ax))\|+\|D^{\alpha}(\tilde{F}_{0}(Ax)-\tilde{F}_{0}(A\tilde{x}))\|+\|D^{\alpha}(\tilde{F}_{0}(A\tilde{x})-f_{0}(x))\|$
	$\displaystyle\leq C_{s,d_{\delta},A(\mathcal{M}_{\rho})}N^{-(s-\|\alpha\|_{1})}\\|\tilde{F}_{0}\\|_{C^{\|\alpha\|_{1}}}+(1+\delta)\rho\\|\tilde{F}_{0}\\|_{C^{\|\alpha\|_{1}}}+\|D^{\alpha}(f_{0}(\tilde{x})-f_{0}(x))\|$
	$\displaystyle\leq\big{[}C_{s,d_{\delta},A(\mathcal{M}_{\rho})}N^{-(s-\|\alpha\|_{1})}+(1+\delta)\rho\big{]}\\|\tilde{F}_{0}\\|_{C^{\|\alpha\|_{1}}}+\rho\\|f_{0}\\|_{C^{\|\alpha\|_{1}}}$
	$\displaystyle\leq C_{s,d_{\delta},A(\mathcal{M}_{\rho})}(1+\delta)\\|f_{0}\\|_{C^{\|\alpha\|_{1}}}N^{-(s-\|\alpha\|_{1})}+2(1+\delta)^{2}\rho\\|f_{0}\\|_{C^{\|\alpha\|_{1}}}$
	$\displaystyle\leq\tilde{C}_{s,d_{\delta},A(\mathcal{M}_{\rho})}(1+\delta)\\|f_{0}\\|_{C^{\|\alpha\|_{1}}}N^{-(s-\|\alpha\|_{1})},$
	$\displaystyle\leq{C}_{p,s,d_{\delta},A(\mathcal{M}_{\rho})}(1+\delta)\\|f_{0}\\|_{C^{\|\alpha\|_{1}}}\mathcal{U}^{-(s-\|\alpha\|_{1})/d_{\delta}},$

	$\displaystyle\inf_{s_{j}\in\mathcal{F}_{jn}}\mathbb{E}\Big{[}\frac{\partial}{\partial x_{j}}s_{j}(X)-\frac{\partial}{\partial x_{j}}s_{0j}(X)+\frac{1}{2}\|s_{j}(X)\|^{2}-\frac{1}{2}\|s_{0j}(X)\|^{2}\Big{]}$
	$\displaystyle\leq\mathbb{E}\Big{[}\frac{\partial}{\partial x_{j}}s_{Nj}(X)-\frac{\partial}{\partial x_{j}}s_{0j}(X)+\frac{1}{2}\|s_{Nj}(X)\|^{2}-\frac{1}{2}\|s_{0j}(X)\|^{2}\Big{]}$
	$\displaystyle\leq C_{m,d,\mathcal{X}}N^{-(m-1)}\\|s_{0j}\\|_{C^{1}}+C_{m,d,\mathcal{X}}\mathcal{B}N^{-m}\\|s_{0j}\\|_{C^{0}}$
	$\displaystyle\leq{C}_{m,d,\mathcal{X}}(1+\mathcal{B})N^{-(m-1)}\\|s_{0j}\\|_{C^{1}}$

	$\displaystyle\mathbb{E}\Big{[}\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f_{0})\Big{]}$
	$\displaystyle=\mathbb{E}\Big{[}\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f^{}_{n})+\mathcal{R}^{\lambda}(f^{}_{n})-\mathcal{R}^{\lambda}(f_{0})\Big{]}$
	$\displaystyle\leq\mathbb{E}\Big{[}\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}_{n}(f^{}_{n})-\mathcal{R}^{\lambda}(f^{}_{n})+\mathcal{R}^{\lambda}(f^{*}_{n})-\mathcal{R}^{\lambda}(f_{0})\Big{]}$
	$\displaystyle=\mathbb{E}\Big{[}[\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f_{0})]-[\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}_{n}(f_{0})]$
	$\displaystyle+[\mathcal{R}^{\lambda}_{n}(f^{}_{n})-\mathcal{R}^{\lambda}_{n}(f_{0})]-[\mathcal{R}^{\lambda}(f^{}_{n})-\mathcal{R}^{\lambda}(f_{0})]+\mathcal{R}^{\lambda}(f^{*}_{n})-\mathcal{R}^{\lambda}(f_{0})\Big{]}$
	$\displaystyle\leq\mathbb{E}\Big{[}2\sup_{f\in\mathcal{F}_{n}}\Big{\|}[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\Big{\|}\Big{]}+\inf_{f\in\mathcal{F}_{n}}[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})],$

	$\displaystyle\mathbb{E}\left[\sup_{f\in\mathcal{F}_{n}}\left\|[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\right\|\right]$
	$\displaystyle=\int_{0}^{\infty}\mathbb{P}\Bigg{(}\sup_{f\in\mathcal{F}_{n}}\left\|[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\right\|\geq u\Bigg{)}du$
	$\displaystyle=\int_{0}^{t}\mathbb{P}\Bigg{(}\sup_{f\in\mathcal{F}_{n}}\left\|[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\right\|\geq u\Bigg{)}du$
	$\displaystyle+\int_{t}^{\infty}\mathbb{P}\Bigg{(}\sup_{f\in\mathcal{F}_{n}}\left\|[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\right\|\geq u\Bigg{)}du$
	$\displaystyle\leq\int_{0}^{t}1du+\int_{t}^{\infty}(d+1)\exp\left(\frac{-n(\epsilon-t)^{2}}{[100d(\mathcal{B}^{2}+\kappa\bar{\lambda}\mathcal{B}^{\prime})]^{2}}\right)du$
	$\displaystyle=t+200\sqrt{2\pi}d(\mathcal{B}^{2}+\kappa\bar{\lambda}\mathcal{B}^{\prime})/{\sqrt{n}}$
	$\displaystyle\leq 800d^{2}(\mathcal{B}^{2}+\kappa\bar{\lambda}\mathcal{B}^{\prime})\sqrt{{2p\log(en)\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})}/{n}}.$

	$\displaystyle\mathbb{E}_{S}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{1}(\hat{f}^{\lambda}_{n},X^{\prime}_{i})\big{\}}-2g_{1}(\hat{f}^{\lambda}_{n},X_{i})\bigg{]}\Big{)}$
	$\displaystyle\leq\int_{0}^{\infty}\mathbb{P}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{1}(\hat{f}^{\lambda}_{n},X^{\prime}_{i})\big{\}}-2g_{1}(\hat{f}^{\lambda}_{n},X_{i})\bigg{]}>t\Big{)}dt$
	$\displaystyle\leq\int_{0}^{a_{n}}1dt+\int_{a_{n}}^{\infty}4\mathcal{N}_{n}\Big{(}\frac{t}{24},\mathcal{G}_{1},\\|\cdot\\|_{\infty}\Big{)}\exp\Big{(}-\frac{tn}{270\mathcal{B}^{2}}\Big{)}dt$
	$\displaystyle\leq a_{n}+4\mathcal{N}_{n}\Big{(}\frac{1}{24n},\mathcal{G}_{1},\\|\cdot\\|_{\infty}\Big{)}\int_{a_{n}}^{\infty}\exp\Big{(}-\frac{tn}{270\mathcal{B}^{2}}\Big{)}dt$
	$\displaystyle=a_{n}+4\mathcal{N}_{n}\Big{(}\frac{1}{24n},\mathcal{G}_{1},\\|\cdot\\|_{\infty}\Big{)}\exp\Big{(}-\frac{a_{n}n}{270\mathcal{B}^{2}}\Big{)}\frac{270\mathcal{B}^{2}}{n}.$

Differentiable Neural Networks with RePU Activation: with Applications to Score Estimation and Isotonic Regression

Abstract

1 Introduction

1.1 Score matching

1.2 Isotonic regression

1.3 Differentiable neural networks

1.4 Our contributions

2 Basic properties of RePU neural networks

2.1 RePU activated neural networks

2.2 Derivatives of RePU networks

Theorem 1 (Neural networks for partial derivatives)

Lemma 2 (Pseudo dimension of Mixed RePUs multilayer perceptrons)

3 Approximation power of RePU neural networks

Theorem 3 (Representation of Polynomials by RePU networks)

Definition 4 (Multivariate differentiable class CsC^{s})

Theorem 5

Remark 6

3.1 Circumventing the curse of dimensionality

Assumption 7

Theorem 8 (Improved approximation results)

4 Deep score estimation

4.1 Non-asymptotic error bounds for DSME

Assumption 9

Lemma 10

Remark 11

Remark 12

Theorem 13 (Non-asymptotic excess risk bounds)

Lemma 14

Theorem 15 (Improved non-asymptotic excess risk bounds)

5 Deep isotonic regression

Assumption 16 (Lipschitz penalty function)

5.1 Non-asymptotic error bounds for PDIR

Assumption 17

Lemma 18 (Excess risk decomposition)

Remark 19

Lemma 20

Remark 21

Theorem 22 (Non-asymptotic excess risk bounds)

Remark 23

Lemma 24

Theorem 25 (Improved non-asymptotic excess risk bounds)

5.2 PDIR under model misspecification

Lemma 26

Remark 27

Theorem 28 (Non-asymptotic excess risk bounds)

Remark 29

6 Related works

6.1 ReLU and RePU networks

6.2 Related works on score estimation

6.3 Related works on isotonic regression

7 Conclusions

Appendix

Appendix A Numerical studies

A.1 Estimation and evaluation

A.2 Univariate models

A.3 Bivariate models

A.4 Tuning parameter

Appendix B Proofs

Proof of Theorem 1

Proof of Lemma 2

Lemma 30 (Theorem 8.3 in Anthony and Bartlett (1999))

Proof of Theorem 3

Proof of Theorem 5

Lemma 31 (Theorem 2 in Bagby et al. (2002))

Proof of Theorem 8

Proof of Lemma 10

Bounding the stochastic error

Bounding the approximation error

Non-asymptotic error bound

Proof of Lemma 14

Proof of Lemma 18

Proof of Lemma 20

Proof of Lemma 24

Proof of Lemma 26

Proof of Lemma 37

Lemma 32 (Lemma 24 in Shen et al. (2022))

Proof of Lemma 38

Appendix C Definitions and Supporting Lemmas

C.1 Definitions

Definition 33 (Covering number)

Definition 4 (Multivariate differentiable class $C^{s}$ )