This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Differentiable Neural Networks with RePU Activation: with Applications to Score Estimation and Isotonic Regression

\nameGuohao Shen \email[email protected]
\addrDepartment of Applied Mathematics
The Hong Kong Polytechnic University
Hong Kong SAR, China \AND\nameYuling Jiao \email[email protected]
\addrSchool of Mathematics and Statistics
and Hubei Key Laboratory of Computational Science
Wuhan University
Wuhan, China, 430072 \AND\nameYuanyuan Lin \email[email protected]
\addrDepartment of Statistics
The Chinese University of Hong Kong
Hong Kong SAR, China \AND\nameJian Huang \email[email protected]
\addrDepartment of Applied Mathematics
The Hong Kong Polytechnic University
Hong Kong SAR, China
Abstract

We study the properties of differentiable neural networks activated by rectified power unit (RePU) functions. We show that the partial derivatives of RePU neural networks can be represented by RePUs mixed-activated networks and derive upper bounds for the complexity of the function class of derivatives of RePUs networks. We establish error bounds for simultaneously approximating CsC^{s} smooth functions and their derivatives using RePU-activated deep neural networks. Furthermore, we derive improved approximation error bounds when data has an approximate low-dimensional support, demonstrating the ability of RePU networks to mitigate the curse of dimensionality. To illustrate the usefulness of our results, we consider a deep score matching estimator (DSME) and propose a penalized deep isotonic regression (PDIR) using RePU networks. We establish non-asymptotic excess risk bounds for DSME and PDIR under the assumption that the target functions belong to a class of CsC^{s} smooth functions. We also show that PDIR achieves the minimax optimal convergence rate and has a robustness property in the sense it is consistent with vanishing penalty parameters even when the monotonicity assumption is not satisfied. Furthermore, if the data distribution is supported on an approximate low-dimensional manifold, we show that DSME and PDIR can mitigate the curse of dimensionality.

Keywords: Approximation error, curse of dimensionality, differentiable neural networks, isotonic regression, score matching

1 Introduction

In many statistical problems, it is important to estimate the derivatives of a target function, in addition to estimating a target function itself. An example is the score matching method for distribution learning through score function estimation (Hyvärinen and Dayan, 2005). In this method, the objective function involves the partial derivatives of the score function. Another example is a newly proposed penalized approach for isotonic regression described below, in which the partial derivatives are used to form a penalty function to encourage the estimated regression function to be monotonic. Motivated by these problems, we consider Rectified Power Unit (RePU) activated deep neural networks for estimating differentiable target functions. A RePU activation function has continuous derivatives, which makes RePU networks differentiable and suitable for derivative estimation. We study the properties of RePU networks along with their derivatives, establish error bounds for using RePU networks to approximate smooth functions and their derivatives, and apply them to the problems of score estimation and isotonic regression.

1.1 Score matching

Score estimation is an important approach to distribution learning and score function plays a central role in the diffusion-based generative learning (Song et al., 2021; Block et al., 2020; Ho et al., 2020; Lee et al., 2022). Let Xp0X\sim p_{0}, where p0p_{0} is a probability density function supported on d\mathbb{R}^{d}, then dd-dimensional score function of p0p_{0} is defined as s0(x)=xlogp0(x)s_{0}(x)=\nabla_{x}\log p_{0}(x), where x\nabla_{x} is the vector differential operator with respect to the input xx.

A score matching estimator (Hyvärinen and Dayan, 2005) is obtained by solving the minimization problem

mins12𝔼Xs(X)s0(X)22,\min_{s\in\mathcal{F}}\frac{1}{2}\mathbb{E}_{X}\|s(X)-s_{0}(X)\|^{2}_{2}, (1)

where 2\|\cdot\|_{2} denotes the Euclidean norm, \mathcal{F} is a prespecified class of functions, often referred to as a hypothesis space. However, this objective function is computationally infeasible because s0s_{0} is unknown. Under some mild conditions given in Assumption 9 in Section 4, it can be shown that (Hyvärinen and Dayan, 2005)

12𝔼Xs(X)s0(X)22=J(s)+12𝔼Xs0(X)22,\frac{1}{2}\mathbb{E}_{X}\|s(X)-s_{0}(X)\|^{2}_{2}=J(s)+\frac{1}{2}\mathbb{E}_{X}\|s_{0}(X)\|_{2}^{2}, (2)

with

J(s):=𝔼X[tr(xs(X))+12s(X)22],J(s):=\mathbb{E}_{X}\left[{\rm tr}(\nabla_{x}s(X))+\frac{1}{2}\|s(X)\|_{2}^{2}\right], (3)

where xs(x)\nabla_{x}s(x) denotes the Jacobian matrix of s(x)s(x) and tr(){\rm tr}(\cdot) the trace operator. Since the second term on the right side of (2), 𝔼Xs0(X)22/2\mathbb{E}_{X}\|s_{0}(X)\|_{2}^{2}/2, does not involve ss, it can be considered a constant. Therefore, we can just use JJ in (3) as an objective function for estimating the score function s0s_{0}. When a random sample is available, we use a sample version of JJ as the empirical objective function. Since JJ involves the partial derivatives of s(x)s(x), we need to compute the derivatives of the functions in \mathcal{F} during estimation. And we need to analyze the properties of \mathcal{F} and their derivatives to develop the learning theories. In particular, if we take \mathcal{F} to be a class of deep neural network functions, we need to study the properties of their derivatives in terms of estimation and approximation.

1.2 Isotonic regression

Isotonic regression is a technique that fits a regression model to observations such that the fitted regression function is non-decreasing (or non-increasing). It is a basic form of shape-constrained estimation and has applications in many areas, such as epidemiology (Morton-Jones et al., 2000), medicine (Diggle et al., 1999; Jiang et al., 2011), econometrics (Horowitz and Lee, 2017), and biostatistics (Rueda et al., 2009; Luss et al., 2012; Qin et al., 2014).

Consider a regression model

Y=f0(X)+ϵ,Y=f_{0}(X)+\epsilon, (4)

where X𝒳d,X\in\mathcal{X}\subseteq\mathbb{R}^{d}, Y,Y\in\mathbb{R}, and ϵ\epsilon is an independent noise variable with 𝔼(ϵ)=0\mathbb{E}(\epsilon)=0 and Var(ϵ)σ2.{\rm Var}(\epsilon)\leq\sigma^{2}. In (4), f0f_{0} is the underlying regression function, which is usually assumed to belong to certain smooth function class.

In isotonic regression, f0f_{0} is assumed to satisfy a monotonicity property as follows. Let \preceq denote the binary relation “less than” in the partially ordered space d\mathbb{R}^{d}, i.e., uvu\preceq v if ujvju_{j}\leq v_{j} for all j=1,,d,j=1,\ldots,d, where u=(u1,,ud),v=(v1,,vd)𝒳du=(u_{1},\ldots,u_{d})^{\top},v=(v_{1},\ldots,v_{d})^{\top}\in\mathcal{X}\subseteq\mathbb{R}^{d}. In isotonic regression, the target regression function f0f_{0} is assumed to be coordinate-wisely non-decreasing on 𝒳\mathcal{X}, i.e., f0(u)f0(v)f_{0}(u)\leq f_{0}(v) if uv.u\preceq v. The class of isotonic regression functions on 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d} is the set of coordinate-wisely non-decreasing functions

0:={f:𝒳,f(u)f(v)ifuv, where 𝒳d}.\mathcal{F}_{0}:=\{f:\mathcal{X}\to\mathbb{R},f(u)\leq f(v){\rm\ if\ }u\preceq v,\text{ where }\mathcal{X}\subset\mathbb{R}^{d}\}.

The goal is to estimate the target regression function f0f_{0} under the constraint that f00f_{0}\in\mathcal{F}_{0} based on an observed sample S:={(Xi,Yi)}i=1nS:=\{(X_{i},Y_{i})\}_{i=1}^{n}. For a possibly random function f:𝒳f:\mathcal{X}\to\mathbb{R}, let the population risk be

(f)=𝔼|Yf(X)|2,\mathcal{R}(f)=\mathbb{E}|Y-f(X)|^{2}, (5)

where (X,Y)(X,Y) follows the same distribution as (Xi,Yi)(X_{i},Y_{i}) and is independent of ff. Then the target function f0f_{0} is the minimizer of the risk (f)\mathcal{R}(f) over 0,\mathcal{F}_{0}, i.e.,

f0argminf0(f).f_{0}\in\arg\min_{f\in\mathcal{F}_{0}}\mathcal{R}(f). (6)

The empirical version of (6) is a constrained minimization problem, which is generally difficult to solve directly. In light of this, we propose a penalized approach for estimating f0f_{0} based on the fact that, for a smooth f0f_{0}, it is increasing with respect to the jjth argument xjx_{j} if and only if its partial derivative with respect to xjx_{j} is nonnegative. Let f˙j(x)=f(x)/xj\dot{f}_{j}(x)=\partial f(x)/\partial x_{j} denote the partial derivative of ff with respective to xj,j=1,,d.x_{j},j=1,\ldots,d. We propose the following penalized objective function

λ(f)=𝔼|Yf(X)|2+1dj=1dλj𝔼{ρ(f˙j(X))},\mathcal{R}^{\lambda}(f)=\mathbb{E}|Y-f(X)|^{2}+\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\mathbb{E}\{\rho(\dot{f}_{j}(X))\}, (7)

where λ={λj}j=1d\lambda=\{\lambda_{j}\}_{j=1}^{d} with λj0,j=1,,d\lambda_{j}\geq 0,j=1,\ldots,d, are tuning parameters, ρ():[0,)\rho(\cdot):\mathbb{R}\to[0,\infty) is a penalty function satisfying ρ(x)0\rho(x)\geq 0 for all xx and ρ(x)=0\rho(x)=0 if x0x\geq 0. Feasible choices include ρ(x)=max{x,0}\rho(x)=\max\{-x,0\}, ρ(x)=[max{x,0}]2\rho(x)=[\max\{-x,0\}]^{2} and more generally ρ(x)=h(max{x,0})\rho(x)=h(\max\{-x,0\}) for a Lipschitz function hh with h(0)=0h(0)=0.

The objective function (7) turns the constrained isotonic regression problem into a penalized regression problem with penalties on the partial derivatives of the regression function. Therefore, if we analyze the learning theory of estimators in (7) using neural network functions, we need to study the partial derivatives of the neural network functions in terms of their generalization and approximation properties.

It is worth mentioning that an advantage of our penalized formulation over the hard-constrained isotonic regressions is that the resulting estimator remains consistent with proper tuning when the underlying regression function is not monotonic. Therefore, our proposed method has a robustness property against model misspecification. We will discuss this point in detail in Section 5.2.

1.3 Differentiable neural networks

A commonality between the aforementioned two quite different problems is that they both involve the derivatives of the target function, in addition to the target function itself. When deep neural networks are used to parameterize the hypothesis space, the derivatives of deep neural networks must be considered. To study the statistical learning theory for these deep neural methods, it requires the knowledge of the complexity and approximation properties of deep neural networks along with their derivatives.

Complexities of deep neural networks with ReLU and piece-wise polynomial activation functions have been studied by Anthony and Bartlett (1999) and Bartlett et al. (2019). Generalization bounds in terms of the operator norm of neural networks have also been obtained by several authors (Neyshabur et al., 2015; Bartlett et al., 2017; Nagarajan and Kolter, 2019; Wei and Ma, 2019). These generalization results are based on various complexity measures such as Rademacher complexity, VC-dimension, Pseudo-dimension, and norm of parameters. These studies shed light on the complexity and generalization properties of neural networks themselves, however, the complexities of their derivatives remain unclear.

The approximation power of deep neural networks with smooth activation functions has been considered in the literature. The universality of sigmoidal deep neural networks has been established by Mhaskar (1993) and Chui et al. (1994). In addition, the approximation properties of shallow RePU-activated networks were analyzed by Klusowski and Barron (2018) and Siegel and Xu (2022). The approximation rates of deep RePU neural networks for several types of target functions have also been investigated. For instance, Li et al. (2019, 2020), and Ali and Nouy (2021) studied the approximation rates for functions in Sobolev and Besov spaces in terms of the LpL_{p} norm, Duan et al. (2021), Abdeljawad and Grohs (2022) studied the approximation rates for functions in Sobolev space in terms of the Sobolev norm, and Belomestny et al. (2022) studied the approximation rates for functions in Hölder space in terms of the Hölder norm. Several recent papers have also studied the approximation of derivatives of smooth functions (Duan et al., 2021; Gühring and Raslan, 2021; Belomestny et al., 2022). We will have more detailed discussions on the related works in Section 6.

Table 1 provides a summary of the comparison between our work and the existing results for achieving the same approximation accuracy ϵ\epsilon on a function with smoothness index β\beta in terms of the needed non-zero parameters in the network. We also summarize the results on whether the neural network approximator has an explicit architecture, where the approximation accuracy holds simultaneously for the target function and its derivative, and whether the approximation results were shown to adapt to the low-dimensional structure of the target function.

Table 1: Comparison of approximation results of RePU neural networks on a function with smoothness order β>0\beta>0, within the accuracy ϵ\epsilon. ReQU σ2\sigma_{2} and ReCU σ3\sigma_{3} are special instances of RePU σp\sigma_{p} for p2p\geq 2. The Sobolev norm Ws,pW^{s,p} of a function ff refers to the mean value of LpL_{p} norm of all partial derivatives of ff up to order ss, and Ws,W^{s,\infty} refers to the maximum value of LL_{\infty} norm of all the partial derivatives of ff up to order ss. The Hölder norm s\mathcal{H}^{s} refers to Ws,W^{s,\infty} when ss is a non-negative integer. The CsC^{s} norm of a function ff refers to the mean value of LL_{\infty} norm of all partial derivatives of ff up to order ss when ss is a positive integer.
Norm Activation Non-zero parameters Explicit architecture Simultaneous approximation Low-dim result
Li et al. (2019, 2020) L2L_{2} RePU 𝒪(ϵd/β)\mathcal{O}(\epsilon^{-d/\beta})~{}~{}~{}
Duan et al. (2021) W1,2W^{1,2} ReQU 𝒪(ϵ2d/(β1))\mathcal{O}(\epsilon^{-2d/(\beta-1)})
Abdeljawad and Grohs (2022) Ws,pW^{s,p} ReCU 𝒪(ϵd/(βs))\mathcal{O}(\epsilon^{-d/(\beta-s)})
Belomestny et al. (2022) s\mathcal{H}^{s} ReQU 𝒪(ϵd/(βs))\mathcal{O}(\epsilon^{-d/(\beta-s)})
This work CsC^{s} RePU 𝒪(ϵd/(βs))\mathcal{O}(\epsilon^{-d/(\beta-s)})

1.4 Our contributions

In this paper, motivated by the aforementioned estimation problems involving derivatives, we investigate the properties of RePU networks and their derivatives. We show that the partial derivatives of RePU neural networks can be represented by mixed-RePUs activated networks. We derive upper bounds for the complexity of the function class of the derivatives of RePU networks. This is a new result for the complexity of derivatives of RePU networks and is crucial to establish generalization error bounds for a variety of estimation problems involving derivatives, including the score matching estimation and our proposed penalized approach for isotonic regression considered in the present work.

We also derive our approximation results of the RePU network on the smooth functions and their derivatives simultaneously. Our approximation results of the RePU network are based on its representational power on polynomials. We construct the RePU networks with an explicit architecture, which is different from those in the existing literature. The number of hidden layers of our constructed RePU networks only depends on the degree of the target polynomial but is independent of the dimension of input. This construction is new for studying the approximation properties of RePU networks.

We summarize the main contributions of this work as follows.

  1. 1.

    We study the basic properties of RePU neural networks and their derivatives. First, we show that partial derivatives of RePU networks can be represented by mixed-RePUs activated networks. We derive upper bounds for the complexity of the function class of partial derivatives of RePU networks in terms of pseudo dimension. Second, we derive novel approximation results for simultaneously approximating CsC^{s} smooth functions and their derivatives using RePU networks based on a new and efficient construction technique. We show that the approximation can be improved when the data or target function has a low-dimensional structure, which implies that RePU networks can mitigate the curse of dimensionality.

  2. 2.

    We study the statistical learning theory of deep score matching estimator (DSME) using RePU networks. We establish non-asymptotic prediction error bounds for DSME under the assumption that the target score is CsC^{s} smooth. We show that DSME can mitigate the curse of dimensionality if the data has low-dimensional support.

  3. 3.

    We propose a penalized deep isotonic regression (PDIR) approach using RePU networks, which encourages the partial derivatives of the estimated regression function to be nonnegative. We establish non-asymptotic excess risk bounds for PDIR under the assumption that the target regression function f0f_{0} is CsC^{s} smooth. Moreover, we show that PDIR achieves the minimax optimal rate of convergence for non-parametric regression. We also show that PDIR can mitigate the curse of dimensionality when data concentrates near a low-dimensional manifold. Furthermore, we show that with tuning parameters tending to zero, PDIR is consistent even when the target function is not isotonic.

The rest of the paper is organized as follows. In Section 2 we study basic properties of RePU neural networks. In Section 3 we establish novel approximation error bounds for approximating CsC^{s} smooth functions and their derivatives using RePU networks. In Section 4 we derive non-asymptotic error bounds for DSME. In Section 5 we propose PDIR and establish non-asymptotic bounds for PDIR. In Section 6 we discuss related works. Concluding remarks are given in Section 7. Results from simulation studies, proofs, and technical details are given in the Supplementary Material.

2 Basic properties of RePU neural networks

In this section, we establish some basic properties of RePU networks. We show that the partial derivatives of RePU networks can be represented by RePUs mixed-activated networks. The width, depth, number of neurons, and size of the RePUs mixed-activated network have the same order as those of the original RePU networks. In addition, we derive upper bounds of the complexity of the function class of RePUs mixed-activated networks in terms of pseudo dimension, which leads to an upper bound of the class of partial derivatives of RePU networks.

2.1 RePU activated neural networks

Neural networks with nonlinear activation functions have proven to be a powerful approach for approximating multi-dimensional functions. One of the most commonly used activation functions is the Rectified linear unit (ReLU), defined as σ1(x)=max{x,0},x\sigma_{1}(x)=\max\{x,0\},x\in\mathbb{R}, due to its attractive properties in computation and optimization. However, since partial derivatives are involved in our objective function (3) and (7), it is not sensible to use networks with piecewise linear activation functions, such as ReLU and Leaky ReLU. Neural networks activated by Sigmoid and Tanh, are smooth and differentiable but have been falling from favor due to their vanishing gradient problems in optimization. In light of these, we are particularly interested in studying the neural networks activated by RePU, which are non-saturated and differentiable.

In Table 2 below we compare RePU with ReLU and Sigmoid networks in several important aspects. ReLU and RePU activation functions are continuous and non-saturated111An activation function σ\sigma is saturating if lim|x||σ(x)|=0\lim_{|x|\to\infty}|\nabla\sigma(x)|=0., which do not have “vanishing gradients” as Sigmodal activations (e.g. Sigmoid, Tanh) in training. RePU and Sigmoid are differentiable and can approximate the gradient of a target function, but ReLU activation is not, especially for estimation involving high-order derivatives of a target function.

Table 2: Comparison among ReLU, Sigmoid and RePU activation functions.
Activation Continuous Non-saturated Differentiable Gradient Estimation
ReLU
Sigmoid
RePU

We consider the ppth order Rectified Power units (RePU) activation function for a positive integer p.p. The RePU activation function, denoted as σp\sigma_{p}, is simply the power of ReLU,

σp(x)={xp,x00,x<0.\displaystyle\sigma_{p}(x)=\left\{\begin{array}[]{lr}x^{p},&x\geq 0\\ 0,&x<0\end{array}\right..

Note that when p=0p=0, the activation function σ0\sigma_{0} is the Heaviside step function; when p=1p=1, the activation function σ1\sigma_{1} is the familiar Rectified Linear unit (ReLU); when p=2,3p=2,3, the activation functions σ2,σ3\sigma_{2},\sigma_{3} are called rectified quadratic unit (ReQU) and rectified cubic unit (ReCU) respectively. In this work, we focus on the case with p2p\geq 2, implying that the RePU activation function has a continuous (p1)(p-1)th continuous derivative.

With a RePU activation function, the network will be smooth and differentiable. The architecture of a RePU activated multilayer perceptron can be expressed as a composition of a series of functions

f(x)=𝒟σp𝒟1σpσp1σp0(x),xd0,f(x)=\mathcal{L}_{\mathcal{D}}\circ\sigma_{p}\circ\mathcal{L}_{\mathcal{D}-1}\circ\sigma_{p}\circ\cdots\circ\sigma_{p}\circ\mathcal{L}_{1}\circ\sigma_{p}\circ\mathcal{L}_{0}(x),\ x\in\mathbb{R}^{d_{0}},

where d0d_{0} is the dimension of the input data, σp(x)={max(0,x)}p,p2,\sigma_{p}(x)=\{\max(0,x)\}^{p},p\geq 2, is a RePU activation function (defined for each component of xx if xx is a vector), and i(x)=Wix+bi,i=0,1,,𝒟.\mathcal{L}_{i}(x)=W_{i}x+b_{i},i=0,1,\ldots,\mathcal{D}. Here did_{i} is the width (the number of neurons or computational units) of the ii-th layer, Widi+1×diW_{i}\in\mathbb{R}^{d_{i+1}\times d_{i}} is a weight matrix, and bidi+1b_{i}\in\mathbb{R}^{d_{i+1}} is the bias vector in the ii-th linear transformation i\mathcal{L}_{i}. The input data xx is the first layer of the neural network and the output is the last layer. Such a network ff has 𝒟\mathcal{D} hidden layers and (𝒟+2)(\mathcal{D}+2) layers in total. We use a (𝒟+2)(\mathcal{D}+2)-vector (d0,d1,,d𝒟,d𝒟+1)(d_{0},d_{1},\ldots,d_{\mathcal{D}},d_{\mathcal{D}+1})^{\top} to describe the width of each layer. The width 𝒲\mathcal{W} is defined as the maximum width of hidden layers, i.e., 𝒲=max{d1,,d𝒟}\mathcal{W}=\max\{d_{1},...,d_{\mathcal{D}}\}; the size 𝒮\mathcal{S} is defined as the total number of parameters in the network ff, i.e., 𝒮=i=0𝒟{di+1×(di+1)}\mathcal{S}=\sum_{i=0}^{\mathcal{D}}\{d_{i+1}\times(d_{i}+1)\}; the number of neurons 𝒰\mathcal{U} is defined as the number of computational units in hidden layers, i.e., 𝒰=i=1𝒟di\mathcal{U}=\sum_{i=1}^{\mathcal{D}}d_{i}. Note that the neurons in consecutive layers are connected to each other via weight matrices WiW_{i}, i=0,1,,𝒟i=0,1,\ldots,\mathcal{D}.

We use the notation 𝒟,𝒲,𝒰,𝒮,,\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}} to denote a class of RePU activated multilayer perceptrons f:d0f:\mathbb{R}^{d_{0}}\to\mathbb{R} with depth 𝒟\mathcal{D}, width 𝒲\mathcal{W}, number of neurons 𝒰\mathcal{U}, size 𝒮\mathcal{S} and ff satisfying f\|f\|_{\infty}\leq\mathcal{B} and maxj=1,,dxjf\max_{j=1,\ldots,d}\|\frac{\partial}{\partial x_{j}}f\|_{\infty}\leq\mathcal{B}^{\prime} for some 0<,<0<\mathcal{B},\mathcal{B}^{\prime}<\infty, where f\|f\|_{\infty} is the sup-norm of a function ff.

2.2 Derivatives of RePU networks

An advantage of RePU networks over piece-wise linear activated networks (e.g. ReLU networks) is that RePU networks are differentiable. Thus RePU networks are useful in many estimation problems involving derivative. To establish the learning theory for these problems, we need to study the properties of derivatives of RePU.

Recall that a 𝒟\mathcal{D}-hidden layer network activated by ppth order RePU can be expressed by

f(x)=𝒟σp𝒟1σpσp1σp0(x),xd0.f(x)=\mathcal{L}_{\mathcal{D}}\circ\sigma_{p}\circ\mathcal{L}_{\mathcal{D}-1}\circ\sigma_{p}\circ\cdots\circ\sigma_{p}\circ\mathcal{L}_{1}\circ\sigma_{p}\circ\mathcal{L}_{0}(x),\ x\in\mathbb{R}^{d_{0}}.

Let fi:=σpif_{i}:=\sigma_{p}\circ\mathcal{L}_{i} denote the iith linear transformation composited with RePU activation for i=0,1,,𝒟1i=0,1,\ldots,\mathcal{D}-1 and let f𝒟=𝒟f_{\mathcal{D}}=\mathcal{L}_{\mathcal{D}} denotes the linear transformation in the last layer. Then by the chain rule, the gradient of the network can be computed by

f=(k=0𝒟1[f𝒟kf𝒟k1f0])f0,\nabla f=\left(\prod_{k=0}^{\mathcal{D}-1}\big{[}\nabla f_{\mathcal{D}-k}\circ f_{\mathcal{D}-k-1}\circ\ldots\circ f_{0}\big{]}\right)\nabla f_{0}, (8)

where \nabla denotes the gradient operator used in vector calculus. With a differentiable RePU activation σp\sigma_{p}, the gradients fi\nabla f_{i} in (8) can be exactly computed by σp1\sigma_{p-1} activated layers since fi(x)=[σpi(x)]=σp(Wix+bi)=pWiσp1(Wix+bi)\nabla f_{i}(x)=\nabla[\sigma_{p}\circ\mathcal{L}_{i}(x)]=\nabla\sigma_{p}(W_{i}x+b_{i})=pW_{i}^{\top}\sigma_{p-1}(W_{i}x+b_{i}). In addition, the fi,i=0,,𝒟f_{i},i=0,\ldots,\mathcal{D} are already RePU-activated layers. Then, the network gradient f\nabla f can be represented by a network activated by σp,σp1\sigma_{p},\sigma_{p-1} (and possibly σm\sigma_{m} for 1mp21\leq m\leq p-2) according to (8) with a proper architecture. Below, we refer to the neural networks activated by the {σt:1tp}\{\sigma_{t}:1\leq t\leq p\} as Mixed RePUs activated neural networks, i.e., the activation functions in Mixed RePUs network can be σt\sigma_{t} for 1tp1\leq t\leq p, and for different neurons the activation function can be different.

The following theorem shows that the partial derivatives and the gradient fϕ\nabla f_{\phi} of a RePU neural network indeed can be represented by a Mixed RePUs network with activation functions {σt:1tp}\{\sigma_{t}:1\leq t\leq p\}.

Theorem 1 (Neural networks for partial derivatives)

Let :=𝒟,𝒲,𝒰,𝒮,,\mathcal{F}:=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}} be a class of RePU σp\sigma_{p} activated neural networks f:𝒳f:\mathcal{X}\to\mathbb{R} with depth (number of hidden layer) 𝒟\mathcal{D}, width (maximum width of hidden layer) 𝒲\mathcal{W}, number of neurons 𝒰\mathcal{U}, number of parameters (weights and bias) 𝒮\mathcal{S} and ff satisfying f\|f\|_{\infty}\leq\mathcal{B} and maxj=1,,dxjf\max_{j=1,\ldots,d}\|\frac{\partial}{\partial x_{j}}f\|_{\infty}\leq\mathcal{B}^{\prime}. Then for any ff\in\mathcal{F} and any j{1,,d}j\in\{1,\ldots,d\}, the partial derivative xjf\frac{\partial}{\partial x_{j}}f can be implemented by a Mixed RePUs activated multilayer perceptron with depth 3𝒟+33\mathcal{D}+3, width 6𝒲6\mathcal{W}, number of neurons 13𝒰13\mathcal{U}, number of parameters 23𝒮23\mathcal{S} and bound \mathcal{B}^{\prime}.

Theorem 1 shows that for each j{1,,d}j\in\{1,\ldots,d\}, the partial derivative with respect to the jj-th argument of the function ff\in\mathcal{F} can be exactly computed by a Mixed RePUs network. In addition, by paralleling the networks computing xjf,j=1,,d\frac{\partial}{\partial x_{j}}f,j=1,\ldots,d, the whole vector of partial derivatives f=(x1f,,xdf)\nabla f=(\frac{\partial}{\partial x_{1}}f,\ldots,\frac{\partial}{\partial x_{d}}f) can be computed by a Mixed RePUs network with depth 3𝒟+33\mathcal{D}+3, width 6d𝒲6d\mathcal{W}, number of neurons 13d𝒰13d\mathcal{U} and number of parameters 23d𝒮23d\mathcal{S}.

Let j:={/xjf:f}\mathcal{F}_{j}^{\prime}:=\{{\partial}/{\partial x_{j}}f:f\in\mathcal{F}\} be the partial derivatives of the functions in \mathcal{F} with respect to the jj-th argument. And let ~\mathcal{\widetilde{F}} denote the class of Mixed RePUs networks in Theorem 1. Then Theorem 1 implies that the class of partial derivative functions is contained in a class of Mixed RePUs networks, i.e., j~\mathcal{F}^{\prime}_{j}\subseteq\mathcal{\widetilde{F}} for j=1,,dj=1,\ldots,d. This further implies the complexity of j\mathcal{F}_{j}^{\prime} can be bounded by that of the class of Mixed RePUs networks ~\mathcal{\widetilde{F}}.

The complexity of a function class is a key quantity in the analysis of generalization properties. Lower complexity in general implies a smaller generalization gap. The complexity of a function class can be measured in several ways, including Rademacher complexity, covering number, VC dimension, and Pseudo dimension. These measures depict the complexity of a function class differently but are closely related to each other.

In the following, we develop complexity upper bounds for the class of Mixed RePUs network functions. In particular, these bounds lead to the upper bound for the Pseudo dimension of the function class ~\widetilde{\mathcal{F}}, and that of \mathcal{F}^{\prime}.

Lemma 2 (Pseudo dimension of Mixed RePUs multilayer perceptrons)

Let ~\mathcal{\widetilde{F}} be a function class implemented by Mixed RePUs activated multilayer perceptrons with depth 𝒟~\tilde{\mathcal{D}}, number of neurons (nodes) 𝒰~\tilde{\mathcal{U}} and size or number of parameters (weights and bias) 𝒮~\tilde{\mathcal{S}}. Then the Pseudo dimension of ~\mathcal{\widetilde{F}} satisfies

Pdim(~)3p𝒟~𝒮~(𝒟~+log2𝒰~),\displaystyle{\rm Pdim}(\mathcal{\widetilde{F}})\leq 3p\tilde{\mathcal{D}}\tilde{\mathcal{S}}(\tilde{\mathcal{D}}+\log_{2}\tilde{\mathcal{U}}),

where Pdim(){\rm Pdim}(\mathcal{F}) denotes the pseudo dimension of a function class \mathcal{F}

With Theorem 1 and Lemma 2, we can now obtain an upper bound for the complexity of the class of derivatives of RePU neural networks. This facilitates establishing learning theories for statistical methods involving derivatives.

Due to the symmetry among the arguments of the input of networks in \mathcal{F}, the concerned complexities for 1,,d\mathcal{F}_{1}^{\prime},\ldots,\mathcal{F}_{d}^{\prime} are generally the same. For notational simplicity, we use

:={x1f:f}\mathcal{F}^{\prime}:=\{\frac{\partial}{\partial x_{1}}f:f\in\mathcal{F}\}

in the main context to denote the quantities of complexities such as pseudo dimension, e.g., we use Pdim(){\rm Pdim}(\mathcal{F}^{\prime}) instead of Pdim(j){\rm Pdim}(\mathcal{F}_{j}^{\prime}) for j=1,,dj=1,\ldots,d.

3 Approximation power of RePU neural networks

In this section, we establish error bounds for using RePU networks to simultaneously approximate smooth functions and their derivatives.

We show that RePU neural networks, with an appropriate architecture, can represent multivariate polynomials with no error and thus can simultaneously approximate multivariate differentiable functions and their derivatives. Moreover, we show that the RePU neural network can mitigate the “curse of dimensionality” when the domain of the target function concentrates in a neighborhood of a low-dimensional manifold.

In the studies of ReLU network approximation properties (Yarotsky, 2017, 2018; Shen et al., 2020; Schmidt-Hieber, 2020), the analyses rely on two key facts. First, the ReLU activation function can be used to construct continuous, piecewise linear bump functions with compact support, which forms a partition of unity of the domain. Second, deep ReLU networks can approximate the square function x2x^{2} to any error tolerance, provided the network is large enough. Based on these facts, the ReLU network can compute Taylor’s expansion to approximate smooth functions. However, due to the piecewise linear nature of ReLU, the approximation is restricted to the target function itself rather than its derivative. In other words, the error in approximation by ReLU networks is quantified using the LpL_{p} norm, where p1p\geq 1 or p=p=\infty. On the other hand, norms such as Sobolev, Hölder, or others can indicate the approximation of derivatives. Gühring and Raslan (2021) extended the results by showing that the network activated by a general smooth function can approximate the partition of unity and polynomial functions, and obtain the approximation rate for smooth functions in the Sobolev norm which implies approximation of the target function and its derivatives. RePU-activated networks have been shown to represent splines (Duan et al., 2021; Belomestny et al., 2022), thus they can approximate smooth functions and their derivatives based on the approximation power of splines.

RePU networks can also represent polynomials efficiently and accurately. This fact motivated us to derive our approximation results for RePU networks based on their representational power on polynomials. To construct RePU networks representing polynomials, our basic idea is to express basic operators as one-hidden-layer RePU networks and then compute polynomials by combining and composing these building blocks. For univariate input xx, the identity map xx, linear transformation ax+bax+b, and square map x2x^{2} can all be represented by one-hidden-layer RePU networks with only a few nodes. The multiplication operator xy=(x+y)2(xy)2/4xy={(x+y)^{2}-(x-y)^{2}}/4 can also be realized by a one-hidden-layer RePU network. Then univariate polynomials i=0Naixi\sum_{i=0}^{N}a_{i}x^{i} of degree N0,N\geq 0, can be computed by a RePU network with a proper composited construction based on Horner’s method (also known as Qin Jiushao’s algorithm) (Horner, 1819). Further, a multivariate polynomial can be viewed as the product of univariate polynomials, then a RePU network with a suitable architecture can represent multivariate polynomials. Alternatively, as mentioned in Mhaskar (1993); Chui and Li (1993), any polynomial in dd variables with total degree not exceeding NN can be written as a linear combination of (N+dd)\binom{N+d}{d} quantities of the form (wx+b)N(w^{\prime}x+b)^{N} where (N+dd)=(N+d)!/(N!d!)\binom{N+d}{d}=(N+d)!/(N!d!) denotes the combinatorial number and (wx+b)(w^{\prime}x+b) denotes the linear combination of dd variables xx. Given this fact, RePU networks can also be shown to represent polynomials based on proper construction.

Theorem 3 (Representation of Polynomials by RePU networks)

For any non-negative integer N0N\in\mathbb{N}_{0} and positive integer d+d\in\mathbb{N}^{+}, if f:df:\mathbb{R}^{d}\to\mathbb{R} is a polynomial of dd variables with total degree NN, then ff can be exactly computed with no error by RePU activated neural network with

  • (1)

    2N12N-1 hidden layers, (6p+2)(2NdNd1N)+2p(2NdNd1N)/(N1)=𝒪(pNd)(6p+2)(2N^{d}-N^{d-1}-N)+2p(2N^{d}-N^{d-1}-N)/(N-1)=\mathcal{O}(pN^{d}) number of neurons, (30p+2)(2NdNd1N)+(2p+1)(2NdNd1N)/(N1)=𝒪(pNd)(30p+2)(2N^{d}-N^{d-1}-N)+(2p+1)(2N^{d}-N^{d-1}-N)/(N-1)=\mathcal{O}(pN^{d}) number of parameters and width 12pNd1+6p(Nd1N)/(N1)=𝒪(pNd1)12pN^{d-1}+6p(N^{d-1}-N)/(N-1)=\mathcal{O}(pN^{d-1});

  • (2)

    logp(N)\lceil\log_{p}(N)\rceil hidden layers, 2logp(N)(N+d)!/(N!d!)=𝒪(logp(N)Nd)2\lceil\log_{p}(N)\rceil(N+d)!/(N!d!)=\mathcal{O}(\log_{p}(N)N^{d}) number of neurons, 2(logp(N)+d+1)(N+d)!/(N!d!)=𝒪((logp(N)+d)Nd)2(\lceil\log_{p}(N)\rceil+d+1)(N+d)!/(N!d!)=\mathcal{O}((\log_{p}(N)+d)N^{d}) number of parameters and width 2(N+d)!/(N!d!)=𝒪(Nd)2(N+d)!/(N!d!)=\mathcal{O}(N^{d}),

where a\lceil a\rceil denotes the smallest integer no less than aa\in\mathbb{R} and a!a! denotes the factorial of integer aa.

Theorem 3 establishes the capability of RePU networks to accurately represent multivariate polynomials of order NN through appropriate architectural designs. The theorem introduces two distinct network architectures, based on Horner’s method (Horner, 1819) and Mhaskar’s method (Mhaskar, 1993) respectively, that can achieve this representation. The RePU network constructed using Horner’s method exhibits a larger number of hidden layers but fewer neurons and parameters compared to the network constructed using Mhaskar’s method. Neither construction is universally superior, and the choice of construction depends on the relationship between the dimension dd and the order NN, allowing for potential efficiency gains in specific scenarios.

Importantly, RePU neural networks offer advantages over ReLU networks in approximating polynomials. For any positive integers WW and LL, ReLU networks with a width of approximately 𝒪(WNd)\mathcal{O}(WN^{d}) and a depth of 𝒪(LN2)\mathcal{O}(LN^{2}) can only approximate, but not accurately represent, dd-variate polynomials of degree NN with an accuracy of approximately 9N(W+1)7NL=𝒪(NWLN)9N(W+1)^{-7NL}=\mathcal{O}(NW^{-LN}) (Shen et al., 2020; Hon and Yang, 2022). Furthermore, the approximation capabilities of ReLU networks for polynomials are generally limited to bounded regions, whereas RePU networks can precisely compute polynomials over the entire d\mathbb{R}^{d} space.

Theorem 3 introduces novel findings that distinguish it from existing research (Li et al., 2019, 2020) in several aspects. First, it provides an explicit formulation of the RePU network’s depth, width, number of neurons, and parameters in terms of the target polynomial’s order NN and the input dimension dd, thereby facilitating practical implementation. Second, the theorem presents Architecture (2), which outperforms previous studies in the sense that it requires fewer hidden layers for polynomial representation. Prior works, such as Li et al. (2019, 2020), required RePU networks with d(logp(N)+1)d(\lceil\log_{p}(N)\rceil+1) hidden layers, along with 𝒪(pNd)\mathcal{O}(pN^{d}) neurons and parameters, to represent dd-variate polynomials of degree NN on d\mathbb{R}^{d}. However, Architecture (2) in Theorem 3 achieves a comparable number of neurons and parameters with only logp(N)\lceil\log_{p}(N)\rceil hidden layers. Importantly, the number of hidden layers logp(N)\lceil\log_{p}(N)\rceil depends only on the polynomial’s degree NN and is independent of the input dimension dd. This improvement bears particular significance in dealing with high-dimensional input spaces that is commonly encountered in machine-learning tasks. Lastly, Architecture (1) in Theorem 3 contributes an additional RePU network construction based on Horner’s method (Horner, 1819), complementing existing results based solely on Mhaskar’s method (Mhaskar, 1993) and providing an alternative choice for polynomial representation.

By leveraging the approximation power of multivariate polynomials, we can derive error bounds for approximating general multivariate smooth functions using RePU neural networks. Previously, approximation properties of RePU networks have been studied for target functions in different spaces, e.g. Sobolev space (Li et al., 2020, 2019; Gühring and Raslan, 2021), spectral Barron space (Siegel and Xu, 2022), Besov space (Ali and Nouy, 2021) and Hölder space (Belomestny et al., 2022). Here we focus on the approximation of multivariate smooth functions and their derivatives in CsC^{s} space for s+s\in\mathbb{N}^{+} defined in Definition 4.

Definition 4 (Multivariate differentiable class CsC^{s})

A function f:𝔹df:\mathbb{B}\subset\mathbb{R}^{d}\to\mathbb{R} defined on a subset 𝔹\mathbb{B} of d\mathbb{R}^{d} is said to be in class Cs(𝔹)C^{s}(\mathbb{B}) on 𝔹\mathbb{B} for a positive integer ss, if all partial derivatives

Dαf:=αx1α1x2α2xdαdfD^{\alpha}f:=\frac{\partial^{\alpha}}{\partial x_{1}^{\alpha_{1}}\partial x_{2}^{\alpha_{2}}\cdots\partial x_{d}^{\alpha_{d}}}f

exist and are continuous on 𝔹\mathbb{B}, for every α1,α2,,αd\alpha_{1},\alpha_{2},\ldots,\alpha_{d} non-negative integers, such that α1+α2++αds\alpha_{1}+\alpha_{2}+\cdots+\alpha_{d}\leq s. In addition, we define the norm of ff over 𝔹\mathbb{B} by

fCs:=|α|1ssupx𝔹|Dαf(x)|,\|f\|_{C^{s}}:=\sum_{|\alpha|_{1}\leq s}\sup_{x\in\mathbb{B}}|D^{\alpha}f(x)|,

where |α|1:=i=1dαi|\alpha|_{1}:=\sum_{i=1}^{d}\alpha_{i} for any vector α=(α1,α2,,αd)d\alpha=(\alpha_{1},\alpha_{2},\ldots,\alpha_{d})\in\mathbb{R}^{d}.

Theorem 5

Let ff be a real-valued function defined on a compact set 𝒳d\mathcal{X}\subset\mathbb{R}^{d} belonging to class CsC^{s} for 0s<0\leq s<\infty. For any N+N\in\mathbb{N}^{+}, there exists a RePU activated neural network ϕN\phi_{N} with its depth 𝒟\mathcal{D}, width 𝒲\mathcal{W}, number of neurons 𝒰\mathcal{U} and size 𝒮\mathcal{S} specified as one of the following architectures:

  • (1)
    𝒟\displaystyle\mathcal{D} =2N1,𝒲=12pNd1+6p(Nd1N)/(N1),\displaystyle=2N-1,\quad\mathcal{W}=12pN^{d-1}+6p(N^{d-1}-N)/(N-1),
    𝒰\displaystyle\mathcal{U} =(6p+2)(2NdNd1N)+2p(2NdNd1N)/(N1),\displaystyle=(6p+2)(2N^{d}-N^{d-1}-N)+2p(2N^{d}-N^{d-1}-N)/(N-1),
    𝒮\displaystyle\mathcal{S} =(30p+2)(2NdNd1N)+(2p+1)(2NdNd1N)/(N1);\displaystyle=(30p+2)(2N^{d}-N^{d-1}-N)+(2p+1)(2N^{d}-N^{d-1}-N)/(N-1);
  • (2)
    𝒟\displaystyle\mathcal{D} =logp(N),𝒲=2(N+d)!/(N!d!),\displaystyle=\lceil\log_{p}(N)\rceil,\quad\mathcal{W}=2(N+d)!/(N!d!),
    𝒰\displaystyle\mathcal{U} =2logp(N)(N+d)!/(N!d!),\displaystyle=2\lceil\log_{p}(N)\rceil(N+d)!/(N!d!),
    𝒮\displaystyle\mathcal{S} =2(logp(N)+d+1)(N+d)!/(N!d!),\displaystyle=2(\lceil\log_{p}(N)\rceil+d+1)(N+d)!/(N!d!),

such that for each multi-index α0d\alpha\in\mathbb{N}^{d}_{0} satisfying |α|1min{s,N}|\alpha|_{1}\leq\min\{s,N\}, we have

sup𝒳|Dα(fϕN)|Cp,s,d,𝒳fC|α|1N(s|α|1),\sup_{\mathcal{X}}|D^{\alpha}(f-\phi_{N})|\leq C_{p,s,d,\mathcal{X}}\,\|f\|_{C^{|\alpha|_{1}}}\,N^{-(s-|\alpha|_{1})},

where Cp,s,d,𝒳C_{p,s,d,\mathcal{X}} is a positive constant depending only on p,d,sp,d,s and the diameter of 𝒳\mathcal{X}.

Theorem 5 gives a simultaneous approximation result for RePU network approximation since the error is measured in Cs\|\cdot\|_{C^{s}} norm. It improves over existing results focusing on LpL_{p} norms, which cannot guarantee the approximation of derivatives of the target function (Li et al., 2019, 2020). It is known that shallow neural networks with smooth activation can simultaneously approximate a smooth function and its derivatives (Xu and Cao, 2005). However, the simultaneous approximation of RePU neural networks with respect to norms involving derivatives is still an ongoing research area (Gühring and Raslan, 2021; Duan et al., 2021; Belomestny et al., 2022). For solving partial differential equations in a Sobolev space with smoothness order 2, Duan et al. (2021) showed that ReQU neural networks can simultaneously approximate the target function and its derivative in Sobolev norm W1,2W^{1,2}. To achieve an accuracy of ϵ\epsilon, the ReQU networks require 𝒪(log2d)\mathcal{O}(\log_{2}d) layers and 𝒪(4dϵd)\mathcal{O}(4d\epsilon^{-d}) neurons. Later Belomestny et al. (2022) proved that β\beta-Hölder smooth functions (β>2\beta>2) and their derivatives up to order ll can be simultaneously approximated with accuracy ϵ\epsilon in Hölder norm by a ReQU network with width 𝒪(ϵd/(βl))\mathcal{O}(\epsilon^{-d/(\beta-l)}), 𝒪(log2d)\mathcal{O}(\log_{2}d) layers, and 𝒪(ϵd/(βl))\mathcal{O}(\epsilon^{-d/(\beta-l)}) nonzero parameters. Gühring and Raslan (2021) derived simultaneous approximation results for neural networks with general smooth activation functions. Based on Gühring and Raslan (2021), a RePU neural network with constant layer and 𝒪(ϵd/(βl))\mathcal{O}(\epsilon^{-d/(\beta-l)}) nonzero parameters can achieve an approximation accuracy ϵ\epsilon measured in Sobolev norm up to llth order derivative for a dd-dimensional Sobolev function with smoothness β\beta.

To achieve the approximation accuracy ϵ\epsilon, our Theorem 5 demonstrates that a RePU network requires a comparable number of neurons, namely 𝒪(ϵd/(sl))\mathcal{O}(\epsilon^{-d/(s-l)}), to simultaneously approximate the target function up to its ll-th order derivatives. Our result differs from existing studies in several ways. First, in contrast to Li et al. (2019, 2020), Theorem 5 derives simultaneous approximation results for RePU networks. Second, Theorem 5 holds for general RePU networks (p2p\geq 2), including the ReQU network (p=2p=2) studied in Duan et al. (2021) and Belomestny et al. (2022). Third, Theorem 5 explicitly specifies the network architecture to facilitate the network design in practice, whereas existing studies determine network architectures solely in terms of orders (Li et al., 2019, 2020; Gühring and Raslan, 2021). In addition, as discussed in the next subsection, Theorem 5 can be further improved and adapted to the low-dimensional structured data, which highlights the RePU networks’ capability to mitigate the curse of dimensionality in estimation problems. We again refer to Table 1 for a summary comparison of our work with the existing results.

Remark 6

Theorem 3 is based on the representation power of RePU networks on polynomials as in Li et al. (2019, 2020) and Ali and Nouy (2021). Other existing works derived approximation results based on the representation of the ReQU neural network on B-splines or tensor-product splines (Duan et al., 2021; Siegel and Xu, 2022; Belomestny et al., 2022).

3.1 Circumventing the curse of dimensionality

In Theorem 5, to achieve an approximate error ϵ\epsilon, the RePU neural network should have O(ϵd/s)O(\epsilon^{-d/s}) many parameters. The number of parameters grows polynomially in the desired approximation accuracy ϵ\epsilon with an exponent d/s-d/s depending on the dimension dd. In statistical and machine learning tasks, such an approximation result can make the estimation suffer from the curse of dimensionality. In other words, when the dimension dd of the input data is large, the convergence rate becomes extremely slow. Fortunately, high-dimensional data often have approximate low-dimensional latent structures in many applications, such as computer vision and natural language processing (Belkin and Niyogi, 2003; Hoffmann et al., 2009; Fefferman et al., 2016). It has been shown that these low-dimensional structures can help mitigate the curse of dimensionality (improve the convergence rate) using ReLU networks (Schmidt-Hieber, 2019; Shen et al., 2020; Jiao et al., 2023; Chen et al., 2022). We consider an assumption of approximate low-dimensional support of data distribution (Jiao et al., 2023), and show that the RePU network can also mitigate the curse of dimensionality under this assumption.

Assumption 7

The predictor XX is supported on a ρ\rho-neighborhood ρ\mathcal{M}_{\rho} of a compact dd_{\mathcal{M}}-dimensional Riemannian submanifold d\mathcal{M}\subset\mathbb{R}^{d}, where

ρ={xd:inf{xy2:y}ρ},ρ(0,1),\mathcal{M}_{\rho}=\{x\in\mathbb{R}^{d}:\inf\{\|x-y\|_{2}:y\in\mathcal{M}\}\leq\rho\},\ \rho\in(0,1),

and the Riemannian submanifold \mathcal{M} has condition number 1/τ1/\tau, volume VV and geodesic covering regularity RR.

We assume that the high-dimensional data XX concentrates in a ρ\rho-neighborhood of a low-dimensional manifold. This assumption serves as a relaxation from the stringent requirements imposed by exact manifold assumptions (Chen et al., 2019; Schmidt-Hieber, 2019).

With a well-conditioned manifold \mathcal{M}, we show that RePU networks possess the capability to adaptively embed the data into a lower-dimensional space while approximately preserving distances. The dimensionality of the embedded representation, as well as the quality of the embedding in terms of its ability to preserve distances, are contingent upon the properties of the approximate manifold, including its radius ρ\rho, condition number 1/τ1/\tau, volume VV, and geodesic covering regularity RR. For in-depth definitions of these properties, we direct the interested reader to Baraniuk and Wakin (2009).

Theorem 8 (Improved approximation results)

Suppose that Assumption 7 holds. Let ff be a real-valued function defined on d\mathbb{R}^{d} belonging to class CsC^{s} for 0s<0\leq s<\infty. Let dδ=cdlog(dVRτ1/δ)/δ2d_{\delta}=c\cdot d_{\mathcal{M}}\log(d\cdot VR\tau^{-1}/\delta)/\delta^{2} be an integer satisfying dδdd_{\delta}\leq d for some δ(0,1)\delta\in(0,1) and a universal constant c>0c>0. Then for any N+N\in\mathbb{N}^{+}, there exists a RePU activated neural network ϕN\phi_{N} with its depth 𝒟\mathcal{D}, width 𝒲\mathcal{W}, number of neurons 𝒰\mathcal{U} and size 𝒮\mathcal{S} as one of the following architectures:

  • (1)
    𝒟\displaystyle\mathcal{D} =2N1,𝒲=12pNdδ1+6p(Ndδ1N)/(N1),\displaystyle=2N-1,\quad\mathcal{W}=12pN^{d_{\delta}-1}+6p(N^{d_{\delta}-1}-N)/(N-1),
    𝒰\displaystyle\mathcal{U} =(6p+2)(2NdδNdδ1N)+2p(2NdδNdδ1N)/(N1),\displaystyle=(6p+2)(2N^{d_{\delta}}-N^{d_{\delta}-1}-N)+2p(2N^{d_{\delta}}-N^{d_{\delta}-1}-N)/(N-1),
    𝒮\displaystyle\mathcal{S} =(30p+2)(2NdδNdδ1N)+(2p+1)(2NdδNdδ1N)/(N1);\displaystyle=(30p+2)(2N^{d_{\delta}}-N^{d_{\delta}-1}-N)+(2p+1)(2N^{d_{\delta}}-N^{d_{\delta}-1}-N)/(N-1);
  • (2)
    𝒟\displaystyle\mathcal{D} =logp(N),𝒲=2(N+dδ)!/(N!dδ!),\displaystyle=\lceil\log_{p}(N)\rceil,\quad\mathcal{W}=2(N+d_{\delta})!/(N!d_{\delta}!),
    𝒰\displaystyle\mathcal{U} =2logp(N)(N+dδ)!/(N!dδ!),\displaystyle=2\lceil\log_{p}(N)\rceil(N+d_{\delta})!/(N!d_{\delta}!),
    𝒮\displaystyle\mathcal{S} =2(logp(N)+dδ+1)(N+dδ)!/(N!dδ!),\displaystyle=2(\lceil\log_{p}(N)\rceil+d_{\delta}+1)(N+d_{\delta})!/(N!d_{\delta}!),

such that for each multi-index α0d\alpha\in\mathbb{N}^{d}_{0} satisfying |α|11|\alpha|_{1}\leq 1,

𝔼X|Dα(f(X)ϕN(X))|Cp,s,dδ,ρ(1δ)2fC|α|1N(s|α|1),\mathbb{E}_{X}|D^{\alpha}(f(X)-\phi_{N}(X))|\leq C_{p,s,d_{\delta},\mathcal{M}_{\rho}}\cdot(1-\delta)^{-2}\|f\|_{C^{|\alpha|_{1}}}\,N^{-(s-|\alpha|_{1})},

for ρC1N(s|α|1)\rho\leq C_{1}N^{-(s-|\alpha|_{1})} with a universal constant C1>0C_{1}>0, where Cp,s,dδ,ρC_{p,s,d_{\delta},\mathcal{M}_{\rho}} is a positive constant depending only on dδ,s,pd_{\delta},s,p and ρ\mathcal{M}_{\rho}.

When data has a low-dimensional structure, Theorem 8 indicates that the RePU network can approximate CsC^{s} smooth function up to llth order derivatives with an accuracy ϵ\epsilon using O(ϵdδ/(sl))O(\epsilon^{d_{\delta}/(s-l)}) neurons. Here the effective dimension dδd_{\delta} scales linearly in the intrinsic manifold dimension dd_{\mathcal{M}} and logarithmically in the ambient dimension dd and the features 1/τ,V,R1/\tau,V,R of the manifold. Compared to Theorem 5, the effective dimensionality in Theorem 8 is dδd_{\delta} instead of dd, which could be a significant improvement especially when the ambient dimension of data dd is large but the intrinsic dimension dd_{\mathcal{M}} is small.

Theorem 8 shows that RePU neural networks are an effective tool for analyzing data that lies in a neighborhood of a low-dimensional manifold, indicating their potential to mitigate the curse of dimensionality. In particular, this property makes them well-suited to scenarios where the ambient dimension of the data is high, but its intrinsic dimension is low. To the best of our knowledge, our Theorem 8 is the first result of the ability of RePU networks to mitigate the curse of dimensionality. A highlight of the comparison between our result and the existing recent results of Li et al. (2019), Li et al. (2020), Duan et al. (2021), Abdeljawad and Grohs (2022) and Belomestny et al. (2022) is given in Table 1.

4 Deep score estimation

Deep neural networks have revolutionized many areas of statistics and machine learning, and one of the important applications is score function estimation using the score matching method (Hyvärinen and Dayan, 2005). Score-based generative models (Song et al., 2021), which learn to generate samples by estimating the gradient of the log-density function, can benefit significantly from deep neural networks. Using a deep neural network allows for more expressive and flexible models, which can capture complex patterns and dependencies in the data. This is especially important for high-dimensional data, where traditional methods may struggle to capture all of the relevant features. By leveraging the power of deep neural networks, score-based generative models can achieve state-of-the-art results on a wide range of tasks, from image generation to natural language processing. The use of deep neural networks in score function estimation represents a major advance in the field of generative modeling, with the potential to unlock new levels of creativity and innovation. We apply our developed theories of RePU networks to explore the statistical learning theories of deep score matching estimation (DSME).

Let p0(x)p_{0}(x) be a probability density function supported on d\mathbb{R}^{d} and s0(x)=xlogp0(x)s_{0}(x)=\nabla_{x}\log p_{0}(x) be its score function where x\nabla_{x} is the vector differential operator with respect to the input xx. The goal of deep score estimation is to model and estimate s0s_{0} by a function s:dds:\mathbb{R}^{d}\to\mathbb{R}^{d} based on samples {Xi}i=1n\{X_{i}\}_{i=1}^{n} from p0p_{0} such that s(x)s0(x)s(x)\approx s_{0}(x). Here ss belongs to a class of deep neural networks.

It worths noting that the neural network s:dds:\mathbb{R}^{d}\to\mathbb{R}^{d} used in deep score estimation is a vector-valued function. For a dd-dimensional input x=(x1,,xd)dx=(x_{1},\ldots,x_{d})^{\top}\in\mathbb{R}^{d}, the output s(x)=(s1(x),,sd(x))ds(x)=(s_{1}(x),\ldots,s_{d}(x))^{\top}\in\mathbb{R}^{d} is also dd-dimensional. We let xs\nabla_{x}s denote the n×nn\times n Jacobian matrix of ss with its (i,j)(i,j) entry being si/xj{\partial s_{i}}/{\partial x_{j}}. With a slight abuse of notation, we denote n:=𝒟,𝒲,𝒰,𝒮,,\mathcal{F}_{n}:=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}} by a class of RePU activated multilayer perceptrons s:dds:\mathbb{R}^{d}\to\mathbb{R}^{d} with parameter θ\theta, depth 𝒟\mathcal{D}, width 𝒲\mathcal{W}, size 𝒮\mathcal{S}, number of neurons 𝒰\mathcal{U} and ss satisfying: (i) s\|s\|_{\infty}\leq\mathcal{B} for some 0<<0<\mathcal{B}<\infty where s:=supx𝒳s(x)\|s\|_{\infty}:=\sup_{x\in\mathcal{X}}\|s(x)\|_{\infty} is the sup-norm of a vector-valued function ss over its domain x𝒳x\in\mathcal{X}; (ii) (xs)ii\|(\nabla_{x}s)_{ii}\|_{\infty}\leq\mathcal{B}^{\prime}, i=1,,di=1,\ldots,d, for some 0<<0<\mathcal{B}^{\prime}<\infty where (xs)ii(\nabla_{x}s)_{ii} is the ii-th diagonal entry (in the ii-th row and ii-th column) of xs\nabla_{x}s. Here the parameters 𝒟,𝒲,𝒰\mathcal{D},\mathcal{W},\mathcal{U} and 𝒮\mathcal{S} of n\mathcal{F}_{n} can depend on the sample size nn, but we omit the dependence in their notations. In addition, we extend the definition of smooth multivariate function. We say a multivariate function s=(s1,,sd)s=(s_{1},\ldots,s_{d}) belongs to CmC^{m} if sjs_{j} belongs to CmC^{m} for each j=1,,dj=1,\ldots,d. Correspondingly, we define sCm:=maxj=1,,dsjCm\|s\|_{C^{m}}:=\max_{j=1,\ldots,d}\|s_{j}\|_{C^{m}}.

4.1 Non-asymptotic error bounds for DSME

The development of theory for predicting the performance of score estimator using deep neural networks has been a crucial research area in recent times. Theoretical upper bounds for prediction errors have become increasingly important in understanding the limitations and potential of these models.

We are interested in establishing non-asymptoic error bounds for DSME, which is obtained by minimizing the expected squared distance 𝔼Xs(X)s0(X)22\mathbb{E}_{X}\|s(X)-s_{0}(X)\|^{2}_{2} over the class of functions \mathcal{F}. However, this objective is computationally infeasible because the explicit form of s0s_{0} is unknown. Under proper conditions, the objective function has an equivalent formulation which is computationally feasible.

Assumption 9

The density p0p_{0} of the data XX is differentiable. The expectation 𝔼Xs0(X)22\mathbb{E}_{X}\|s_{0}(X)\|_{2}^{2} and 𝔼Xs(X)22\mathbb{E}_{X}\|s(X)\|_{2}^{2} are finite for any ss\in\mathcal{F}. And s0(x)s(x)0s_{0}(x)s(x)\to 0 for any ss\in\mathcal{F} when x\|x\|\to\infty.

Under Assumption 9, the population objective of score matching is equivalent to JJ given in (3). With a finite sample Sn={Xi}i=1nS_{n}=\{X_{i}\}_{i=1}^{n}, the empirical version of JJ is

Jn(s)=1ni=1n{tr(xs(Xi))+12s(Xi)22}.J_{n}(s)=\frac{1}{n}\sum_{i=1}^{n}\left\{{\rm tr}(\nabla_{x}s(X_{i}))+\frac{1}{2}\|s(X_{i})\|_{2}^{2}\right\}.

Then, DSME is defined by

s^n:=argminsnJn(s),\displaystyle\hat{s}_{n}:=\arg\min_{s\in\mathcal{F}_{n}}J_{n}(s), (9)

which is the empirical risk minimizer over the class of RePU neural networks n\mathcal{F}_{n}.

Our target is to give upper bounds of the excess risk of s^n\hat{s}_{n}, which is defined as

J(s^n)J(s0)=12𝔼Xs^n(X)s0(X)22.J(\hat{s}_{n})-J(s_{0})=\frac{1}{2}\mathbb{E}_{X}\|\hat{s}_{n}(X)-s_{0}(X)\|_{2}^{2}.

To obtain an upper bound of J(s^n)J(s0)J(\hat{s}_{n})-J(s_{0}), we decompose it into two parts of error, i.e. stochastic error and approximation error, and then derive upper bounds for them respectively. Let sn=argminsnJ(s)s_{n}=\arg\min_{s\in\mathcal{F}_{n}}J(s), then

J(s^n)J(s0)\displaystyle J(\hat{s}_{n})-J(s_{0})
={J(s^n)Jn(s^n)}+{Jn(s^n)Jn(sn)}+{Jn(sn)J(sn)}+{J(sn)J(s0)}\displaystyle=\{J(\hat{s}_{n})-J_{n}(\hat{s}_{n})\}+\{J_{n}(\hat{s}_{n})-J_{n}(s_{n})\}+\{J_{n}(s_{n})-J(s_{n})\}+\{J(s_{n})-J(s_{0})\}
{J(s^n)Jn(s^n)}+{Jn(sn)J(sn)}+{J(sn)J(s0)}\displaystyle\leq\{J(\hat{s}_{n})-J_{n}(\hat{s}_{n})\}+\{J_{n}(s_{n})-J(s_{n})\}+\{J(s_{n})-J(s_{0})\}
2supsn|J(s)Jn(s)|+infsn{J(s)J(s0)},\displaystyle\leq 2\sup_{s\in\mathcal{F}_{n}}|J(s)-J_{n}(s)|+\inf_{s\in\mathcal{F}_{n}}\{J(s)-J(s_{0})\},

where we call 2supsn|J(s)Jn(s)|2\sup_{s\in\mathcal{F}_{n}}|J(s)-J_{n}(s)| the stochastic error and infsn{J(s)J(s0)}\inf_{s\in\mathcal{F}_{n}}\{J(s)-J(s_{0})\} the approximation error.

It is important to highlight that the analysis of stochastic error and approximation error for DSME estimation are unconventional. On one hand, since J(s)J(s0)=𝔼Xs(X)s0(X)22J(s)-J(s_{0})=\mathbb{E}_{X}\|s(X)-s_{0}(X)\|_{2}^{2} holds for any ss, the approximation error can be obtained by examining the squared distance approximation in the L2L_{2} norm. Thus, Theorem 5 provides a bound for the approximation error infsn[J(s)J(s0)]\inf{s\in\mathcal{F}_{n}}[{J(s)-J(s_{0})}]. On the other hand, the empirical squared distance loss i=1ns(Xi)s0(Xi)22/(2n)\sum_{i=1}^{n}\|s(X_{i})-s_{0}(X_{i})\|^{2}_{2}/(2n) is not equivalent to the surrogate loss JnJ_{n}. In other words, the minimizer s^n\hat{s}_{n} of JnJ_{n} may not be the same as the minimizer of the empirical squared distance i=1ns(Xi)s0(Xi)22/(2n)\sum_{i=1}^{n}\|s(X_{i})-s_{0}(X_{i})\|^{2}_{2}/(2n) over sns\in\mathcal{F}_{n}. Consequently, the stochastic error can only be analyzed based on the formulation of JJ rather than the squared loss. This implies that the stochastic error is dependent on the complexities of the RePU networks class n\mathcal{F}_{n}, as well as their derivatives n\mathcal{F}_{n}^{\prime}. Based on Theorem 1, Lemma 2 and the empirical process theory, it is expected that the stochastic error will be bounded by 𝒪((Pdim(n)+Pdim(n))1/2n1/2)\mathcal{O}(({\rm Pdim}(\mathcal{F}_{n})+{\rm Pdim}(\mathcal{F}_{n}^{\prime}))^{1/2}n^{-1/2}). Finally, by combining these two error bounds, we obtain the following bounds for the mean squared error of the empirical risk minimizer s^n\hat{s}_{n} defined in (9).

Lemma 10

Suppose that Assumption 9 hold and the target score function s0s_{0} belongs to Cm(𝒳)C^{m}(\mathcal{X}) for m+m\in\mathbb{N}^{+}. For any positive integer N+N\in\mathbb{N}^{+}, let n:=𝒟,𝒲,𝒰,𝒮,,\mathcal{F}_{n}:=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}} be the class of RePU activated neural networks f:𝒳df:\mathcal{X}\to\mathbb{R}^{d} with depth 𝒟=logp(N)\mathcal{D}=\lceil\log_{p}(N)\rceil, width 𝒲=2(N+d)!/(N!d!)\mathcal{W}=2(N+d)!/(N!d!), number of neurons 𝒰=2logp(N)(N+d)!/(N!d!)\mathcal{U}=2\lceil\log_{p}(N)\rceil(N+d)!/(N!d!) and size 𝒮=2(logp(N)+d+1)(N+d)!/(N!d!)\mathcal{S}=2(\lceil\log_{p}(N)\rceil+d+1)(N+d)!/(N!d!), and suppose that s0C0\mathcal{B}\geq\|s_{0}\|_{C^{0}} and maxi=1,,d(xs0)iiC1\mathcal{B}^{\prime}\geq\max_{i=1,\ldots,d}\|(\nabla_{x}s_{0})_{ii}\|_{C^{1}}. Then the empirical risk minimizer s^n\hat{s}_{n} defined in (9) satisfies

𝔼s^n(X)s0(X)22\displaystyle\mathbb{E}\|\hat{s}_{n}(X)-s_{0}(X)\|^{2}_{2} sto+app,\displaystyle\leq\mathcal{E}_{sto}+\mathcal{E}_{app}, (10)

with

sto=\displaystyle\mathcal{E}_{sto}= C1pd3(2+2)(logn)1/2n1/2(logpN)2Nd/2,\displaystyle C_{1}pd^{3}(\mathcal{B}^{2}+2\mathcal{B}^{\prime})(\log n)^{1/2}{n}^{-1/2}(\log_{p}N)^{2}N^{d/2},
app=\displaystyle\mathcal{E}_{app}= C2N2ms0C02,\displaystyle{C}_{2}N^{-2m}\|s_{0}\|_{C^{0}}^{2},

where the expectation 𝔼\mathbb{E} is taken with respect to XX and s^n\hat{s}_{n}, C1>0C_{1}>0 is a universal constant, and C2>0C_{2}>0 is a constant depending only on p,d,mp,d,m and the diameter of 𝒳\mathcal{X}.

Remark 11

Lemma 10 established a bound on the mean squared error of the empirical risk minimizer. Specifically, this error is shown to be bounded by the sum of the stochastic error, denoted as sto\mathcal{E}_{sto}, and the approximation error, denoted as app\mathcal{E}_{app}. On one hand, the stochastic error sto\mathcal{E}_{sto} exhibits a decreasing trend with respect to the sample size nn, but an increasing trend with respect to the network size as determined by NN. On the other hand, the approximation error app\mathcal{E}_{app} decreases in the network size as determined by NN. To attain a fast convergence rate with respect to the sample size nn, it is necessary to carefully balance these two errors by selecting an appropriate NN based on a given sample size nn.

Remark 12

In Lemma 10, the error bounds are stated in terms of the integer NN. These error bounds can also be expressed in terms of the number of neurons 𝒰\mathcal{U} and size 𝒮\mathcal{S}, given that we have specified the relationships between these parameters. Specifically, 𝒰=2logp(N)(N+d)!/(N!d!)\mathcal{U}=2\lceil\log_{p}(N)\rceil(N+d)!/(N!d!) and size 𝒮=2(logp(N)+d+1)(N+d)!/(N!d!)\mathcal{S}=2(\lceil\log_{p}(N)\rceil+d+1)(N+d)!/(N!d!), which relate the number of neurons and size of the network to NN and the dimensionality of XX.

Lemma 10 leads to the following error bound for the score-matching estimator.

Theorem 13 (Non-asymptotic excess risk bounds)

Under the conditions of Theorem 10, we set N=n1/(d+4m)N=\lfloor n^{1/(d+4m)}\rfloor. Then by (10), the empirical risk minimizer s^n\hat{s}_{n} defined in (9) satisfies

𝔼s^n(X)s0(X)2\displaystyle\mathbb{E}\|\hat{s}_{n}(X)-s_{0}(X)\|^{2} C(logn)n2md+4m,\displaystyle\leq C(\log n)n^{-\frac{2m}{d+4m}},

where CC is a constant only depending on p,,,m,d,𝒳p,\mathcal{B},\mathcal{B}^{\prime},m,d,\mathcal{X} and s0C1\|s_{0}\|_{C^{1}}.

In Theorem 13, the convergence rate in the error bound is n2md+4mn^{-\frac{2m}{d+4m}} up to a logarithmic factor. While this rate is slightly slower than the optimal minimax rate n2md+2mn^{-\frac{2m}{d+2m}} for nonparametric regression (Stone, 1982), it remains reasonable considering the nature of score matching estimation. Score matching estimation involves derivatives and the target score function value is not directly observable, which deviates from the traditional nonparametric regression in Stone (1982) where both predictors and responses are observed and no derivatives are involved. However, the rate n2md+4mn^{-\frac{2m}{d+4m}} can be extremely slow for large dd, suffering from the curse of dimensionality. To address this issue, we derive error bounds under an approximate lower-dimensional support assumption as stated in Assumption 7, to mitigate the curse of dimensionality.

Lemma 14

Suppose that Assumptions 7, 9 hold and the target score function s0s_{0} belongs to Cm(𝒳)C^{m}(\mathcal{X}) for some m+m\in\mathbb{N}^{+}. Let dδ=cdlog(dVRτ1/δ)/δ2)d_{\delta}=c\cdot d_{\mathcal{M}}\log(d\cdot VR\tau^{-1}/\delta)/\delta^{2}) be an integer with dδdd_{\delta}\leq d for some δ(0,1)\delta\in(0,1) and universal constant c>0c>0. For any positive integer N+N\in\mathbb{N}^{+}, let n:=𝒟,𝒲,𝒰,𝒮,,\mathcal{F}_{n}:=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}} be the class of RePU activated neural networks f:𝒳df:\mathcal{X}\to\mathbb{R}^{d} with depth 𝒟=logp(N)\mathcal{D}=\lceil\log_{p}(N)\rceil, width 𝒲=2(N+dδ)!/(N!dδ!)\mathcal{W}=2(N+d_{\delta})!/(N!d_{\delta}!), number of neurons 𝒰=2logp(N)(N+dδ)!/(N!dδ!)\mathcal{U}=2\lceil\log_{p}(N)\rceil(N+d_{\delta})!/(N!d_{\delta}!) and size 𝒮=2(logp(N)+dδ+1)(N+dδ)!/(N!dδ!)\mathcal{S}=2(\lceil\log_{p}(N)\rceil+d_{\delta}+1)(N+d_{\delta})!/(N!d_{\delta}!). Suppose that s0C0\mathcal{B}\geq\|s_{0}\|_{C^{0}} and maxi=1,,d(xs0)iiC1\mathcal{B}^{\prime}\geq\max_{i=1,\ldots,d}\|(\nabla_{x}s_{0})_{ii}\|_{C^{1}}. Then the empirical risk minimizer s^n\hat{s}_{n} defined in (9) satisfies

𝔼s^n(X)s0(X)22\displaystyle\mathbb{E}\|\hat{s}_{n}(X)-s_{0}(X)\|^{2}_{2} sto+~app,\displaystyle\leq\mathcal{E}_{sto}+\tilde{\mathcal{E}}_{app}, (11)

with

sto\displaystyle\mathcal{E}_{sto} =C1pd2dδ(2+2)(logn)1/2n1/2Ndδ/2,\displaystyle=C_{1}pd^{2}d_{\delta}(\mathcal{B}^{2}+2\mathcal{B}^{\prime})(\log n)^{1/2}{n}^{-1/2}N^{d_{\delta}/2},
~app\displaystyle\tilde{\mathcal{E}}_{app} =C2(1δ)2s0C02N2m,\displaystyle={C}_{2}(1-\delta)^{-2}\|s_{0}\|_{C^{0}}^{2}N^{-2m},

for ρCρN2m\rho\leq C_{\rho}N^{-2m}, where Cρ,C1>0C_{\rho},C_{1}>0 are universal constants and C2>0C_{2}>0 is a constant depending only on p,d,dδ,mp,d,d_{\delta},m and ρ\mathcal{M}_{\rho}.

With an approximate low-dimensional support assumption, Lemma 14 implies that a faster convergence rate for deep score estimator can be achieved.

Theorem 15 (Improved non-asymptotic excess risk bounds)

In (11), we can set N=ndδ/(dδ+4m)N=\lfloor n^{d_{\delta}/(d_{\delta}+4m)}\rfloor, then the empirical risk minimizer s^n\hat{s}_{n} defined in (9) satisfies

𝔼s^n(X)s0(X)2\displaystyle\mathbb{E}\|\hat{s}_{n}(X)-s_{0}(X)\|^{2} Cn2mdδ+4m,\displaystyle\leq Cn^{-\frac{2m}{d_{\delta}+4m}},

where CC is a constant only depending on p,,,m,d,dδ,𝒳p,\mathcal{B},\mathcal{B}^{\prime},m,d,d_{\delta},\mathcal{X} and s0C1\|s_{0}\|_{C^{1}}.

5 Deep isotonic regression

As another application of our results on RePU-activated networks, we propose PDIR, a penalized deep isotonic regression approach using RePU networks and a penalty function based on the derivatives of the networks to enforce monotonicity. We also establish the error bounds for PDIR.

Suppose we have a random sample S:={(Xi,Yi)}i=1nS:=\{(X_{i},Y_{i})\}_{i=1}^{n} from model (4). Recall RλR^{\lambda} is the proposed population objective function for isotonic regression defined in (7). We consider the empirical counterpart of the objective function RλR^{\lambda}:

nλ(f)=1ni=1n{|Yif(Xi)|2+1dj=1dλjρ(f˙j(Xi))}.\mathcal{R}^{\lambda}_{n}(f)=\frac{1}{n}\sum_{i=1}^{n}\Big{\{}|Y_{i}-f(X_{i})|^{2}+\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\rho(\dot{f}_{j}(X_{i}))\Big{\}}. (12)

A simple choice of the penalty function ρ\rho is ρ(x)=max{x,0}\rho(x)=\max\{-x,0\}. In general, we can take ρ(x)=h(max{x,0})\rho(x)=h(\max\{-x,0\}) for a function hh with h(0)=0h(0)=0. We focus on Lipschitz penalty functions as defined below.

Assumption 16 (Lipschitz penalty function)

The penalty function ρ():[0,)\rho(\cdot):\mathbb{R}\to[0,\infty) satisfies ρ(x)=0\rho(x)=0 if x0x\geq 0. Besides, ρ\rho is κ\kappa-Lipschitz, i.e., |ρ(x1)ρ(x2)|κ|x1x2|,|\rho(x_{1})-\rho(x_{2})|\leq\kappa|x_{1}-x_{2}|, for any x1,x2x_{1},x_{2}\in\mathbb{R}.

Let the empirical risk minimizer of deep isotonic regression denoted by

f^nλ:argminfnnλ(f),\hat{f}_{n}^{\lambda}:\in\arg\min_{f\in\mathcal{F}_{n}}\mathcal{R}^{\lambda}_{n}(f), (13)

where n\mathcal{F}_{n} is a class of functions computed by deep neural networks which may depend on and can be set to depend on the sample size nn. We refer to f^nλ\hat{f}^{\lambda}_{n} as a penalized deep isotonic regression (PDIR) estimator.

Refer to caption
Figure 1: Examples of PDIR estimates. In all figures, the data points are depicted as grey dots, the underlying regression functions are plotted as solid black curves, and PDIR estimates with different levels of penalty parameter λ\lambda are plotted as colored curves. In the top two figures, data are generated from models with monotonic regression functions. In the bottom left figure, the target function is a constant. In the bottom right figure, the model is misspecified, in which the underlying regression function is not monotonic. Small values of λ\lambda can lead to non-monotonic estimated functions.

An illustration of PDIR is presented in Figure 1. In all subfigures, the data are depicted as grey dots, the underlying regression functions are plotted as solid black curves and PDIR estimates with different levels of penalty parameter λ\lambda are plotted as colored curves. In the top two figures, data are generated from models with monotonic regression functions. In the bottom left figure, the target function is a constant. In the bottom right figure, the model is misspecified, in which the underlying regression function is not monotonic. Small values of λ\lambda can lead to non-monotonic and reasonable estimates, suggesting that PDIR is robust against model misspecification. We have conducted more numerical experiments to evaluate the performance of PDIR, which indicates that PDIR tends to perform better than the existing isotonic regression methods considered in the comparison. The results are given in the Supplementary Material.

5.1 Non-asymptotic error bounds for PDIR

In this section, we state our main results on the bounds for the excess risk of the PDIR estimator defined in (13). Recall the definition of RλR^{\lambda} in (7). For notational simplicity, we write

(f)=0(f)=𝔼|Yf(X)|2.\mathcal{R}(f)=\mathcal{R}^{0}(f)=\mathbb{E}|Y-f(X)|^{2}. (14)

The target function f0f_{0} is the minimizer of risk (f)\mathcal{R}(f) over measurable functions, i.e., f0argminf(f).f_{0}\in\arg\min_{f}\mathcal{R}(f). In isotonic regression, we assume that f00.f_{0}\in\mathcal{F}_{0}. In addition, for any function ff, under the regression model (4), we have

(f)(f0)=𝔼|f(X)f0(X)|2.\mathcal{R}(f)-\mathcal{R}(f_{0})=\mathbb{E}|f(X)-f_{0}(X)|^{2}.

We first state the conditions needed for establishing the excess risk bounds.

Assumption 17

(i) The target regression function f0:𝒳f_{0}:\mathcal{X}\to\mathbb{R} defined in (4) is coordinate-wisely nondecreasing on 𝒳\mathcal{X}, i.e., f0(x)f0(y)f_{0}(x)\leq f_{0}(y) if xyx\preceq y for x,y𝒳dx,y\in\mathcal{X}\subseteq\mathbb{R}^{d}. (ii) The errors ϵi,i=1,,n,\epsilon_{i},i=1,\ldots,n, are independent and identically distributed noise variables with 𝔼(ϵi)=0\mathbb{E}(\epsilon_{i})=0 and Var(ϵi)σ2{\rm Var}(\epsilon_{i})\leq\sigma^{2}, and ϵi\epsilon_{i}’s are independent of {Xi}i=1n\{X_{i}\}_{i=1}^{n}.

Assumption 17 includes basic model assumptions on the errors and the monotonic target function f0f_{0}. In addition, we assume that the target function f0f_{0} belongs to the class CsC^{s}.

Next, we state the following basic lemma for bounding the excess risk.

Lemma 18 (Excess risk decomposition)

For the empirical risk minimizer f^nλ\hat{f}^{\lambda}_{n} defined in (13), its excess risk can be upper bounded by

𝔼|f^nλ(X)f0(X)|2=\displaystyle\mathbb{E}|\hat{f}^{\lambda}_{n}(X)-f_{0}(X)|^{2}= 𝔼{(f^nλ)(f0)}𝔼{λ(f^nλ)λ(f0)}\displaystyle\mathbb{E}\Big{\{}\mathcal{R}(\hat{f}_{n}^{\lambda})-\mathcal{R}(f_{0})\Big{\}}\leq\mathbb{E}\Big{\{}\mathcal{R}^{\lambda}(\hat{f}_{n}^{\lambda})-\mathcal{R}^{\lambda}(f_{0})\Big{\}}
\displaystyle\leq 𝔼{λ(f^nλ)2nλ(f^nλ)+λ(f0)}+2inffn[λ(f)λ(f0)].\displaystyle\mathbb{E}\Big{\{}\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-2\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}(f_{0})\Big{\}}+2\inf_{f\in\mathcal{F}_{n}}\Big{[}\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})\Big{]}.

The upper bound for the excess risk can be decomposed into two components: the stochastic error, given by the expected value of λ(f^nλ)2nλ(f^nλ)+λ(f0)\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-2\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}(f_{0}), and the approximation error, defined as inffn[λ(f)λ(f0)]\inf_{f\in\mathcal{F}_{n}}[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]. To establish a bound for the stochastic error, it is necessary to consider the complexities of both RePU networks and their derivatives, which have been investigated in our Theorem 1 and Lemma 2. To establish a bound for the approximation error inffnλ(f)λ(f0)\inf_{f\in\mathcal{F}_{n}}{\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})}, we rely on the simultaneous approximation results in Theorem 5.

Remark 19

The error decomposition in Lemma 18 differs from the canonical decomposition for score estimation in section 4.1, particularly pertaining to the stochastic error component. However, utilizing the decomposition in Lemma 18 enables us to derive a superior stochastic error bound by leveraging the properties of the PDIR loss function. A similar decomposition for least squares loss without penalization can be found in Jiao et al. (2023).

Lemma 20

Suppose that Assumptions 16, 17 hold and the target function f0f_{0} defined in (4) belongs to CsC^{s} for some s+s\in\mathbb{N}^{+}. For any positive integer N+N\in\mathbb{N}^{+}, let n:=𝒟,𝒲,𝒰,𝒮,,\mathcal{F}_{n}:=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}} be the class of RePU activated neural networks f:𝒳f:\mathcal{X}\to\mathbb{R} with depth 𝒟=logp(N)\mathcal{D}=\lceil\log_{p}(N)\rceil, width 𝒲=2(N+d)!/(N!d!)\mathcal{W}=2(N+d)!/(N!d!), number of neurons 𝒰=2logp(N)(N+d)!/(N!d!)\mathcal{U}=2\lceil\log_{p}(N)\rceil(N+d)!/(N!d!) and size 𝒮=2(logp(N)+d+1)(N+d)!/(N!d!)\mathcal{S}=2(\lceil\log_{p}(N)\rceil+d+1)(N+d)!/(N!d!). Suppose that f0C0\mathcal{B}\geq\|f_{0}\|_{C^{0}} and f0C1\mathcal{B}^{\prime}\geq\|f_{0}\|_{C^{1}}. Then for nmax{Pdim(n),Pdim(n)}n\geq\max\{{\rm Pdim}(\mathcal{F}_{n}),{\rm Pdim}(\mathcal{F}^{\prime}_{n})\}, the excess risk of PDIR f^nλ\hat{f}^{\lambda}_{n} defined in (13) satisfies

𝔼|f^nλ(X)f0(X)|2\displaystyle\mathbb{E}|\hat{f}^{\lambda}_{n}(X)-f_{0}(X)|^{2} sto+app,\displaystyle\leq\mathcal{E}_{sto}+\mathcal{E}_{app}, (15)
𝔼[1dj=1dλjρ(xjf^nλ(X))]\displaystyle\mathbb{E}\big{[}\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))\big{]} sto+app,\displaystyle\leq\mathcal{E}_{sto}+\mathcal{E}_{app}, (16)

with

sto\displaystyle\mathcal{E}_{sto} =C1d3{p33+(κλ¯)2}(logn)2n1(logpN)3Nd,\displaystyle=C_{1}d^{3}\big{\{}p^{3}\mathcal{B}^{3}+(\kappa\bar{\lambda}\mathcal{B}^{\prime})^{2}\big{\}}(\log n)^{2}n^{-1}(\log_{p}N)^{3}N^{d},
app\displaystyle\mathcal{E}_{app} =C2f0C12(N2s+κλ¯N(s1)),\displaystyle=C_{2}\cdot\|f_{0}\|^{2}_{C^{1}}(N^{-2s}+\kappa\bar{\lambda}N^{-(s-1)}),

where the expectation 𝔼\mathbb{E} is taken with respect to XX and f^nλ\hat{f}^{\lambda}_{n}, λ¯=j=1dλj/d\bar{\lambda}=\sum_{j=1}^{d}\lambda_{j}/d is the mean of the tuning parameters, C1>0C_{1}>0 is a universal constant and C2>0C_{2}>0 is a positive constant depending only on d,sd,s and the diameter of the support 𝒳\mathcal{X}.

Lemma 20 establishes two error bounds for the PDIR estimator f^nλ\hat{f}^{\lambda}_{n}: (15) for the mean squared error between f^nλ\hat{f}^{\lambda}_{n} and the target f0f_{0}, and (16) for controlling the non-monotonicity of f^nλ\hat{f}^{\lambda}_{n} via its partial derivatives xjf^n,j=1,,d,\frac{\partial}{\partial x_{j}}\hat{f}_{n},j=1,\ldots,d, with respect to a measure defined in terms of ρ\rho. Both bounds (15) and (16) are encompasses both stochastic and approximation errors. Specifically, the stochastic error is of order 𝒪(Nd/n)\mathcal{O}(N^{d}/n), which represents an improvement over the canonical error bound of 𝒪([Nd/n]1/2)\mathcal{O}([N^{d}/n]^{1/2}), up to logarithmic factors in nn. This advancement is owing to the decomposition in Lemma 18 and the properties of PDIR loss function, which is different from traditional decomposition techniques.

Remark 21

In (16), the estimator f^nλ\hat{f}^{\lambda}_{n} is encouraged to exhibit monotonicity, as the expected monotonicity penalty on the estimator 𝔼[1dj=1dλjρ(xjf^nλ(X))]\mathbb{E}[\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))] is bounded. Notably, when 𝔼[ρ(xjf^nλ(X))]=0\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))]=0, the estimator f^nλ\hat{f}^{\lambda}_{n} is almost surely monotonic in its jjth argument with respect to the probability measure of XX. Based on (16), guarantees of the estimator’s monotonicity with respect to a single argument can be obtained. Specifically, for those jj where λj0\lambda_{j}\not=0, we have 𝔼[ρ(xjf^nλ(X))]d(sto+app)/λj\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))]\leq d(\mathcal{E}_{sto}+\mathcal{E}_{app})/\lambda_{j}, which provides a guarantee of the estimator’s monotonicity with respect to its jjth argument. Moreover, larger values of λj\lambda_{j} lead to smaller bounds, which is consistent with the intuition that larger values of λj\lambda_{j} better promote monotonicity of f^nλ\hat{f}^{\lambda}_{n} with respect to its jjth argument.

Theorem 22 (Non-asymptotic excess risk bounds)

Under the conditions of Lemma 20 to achieve the smallest error bound in (15), we can set N=n1/(d+2s)N=\lfloor n^{1/(d+2s)}\rfloor and λj=n(s+1)/(d+2s)\lambda_{j}=n^{-(s+1)/(d+2s)} for j=1,,dj=1,\ldots,d. Then we have

𝔼|f^nλ(X)f0(X)|2\displaystyle\mathbb{E}|\hat{f}^{\lambda}_{n}(X)-f_{0}(X)|^{2} C(logn)5n2sd+2s,\displaystyle\leq C(\log n)^{5}n^{-\frac{2s}{d+2s}},

and

𝔼[ρ(xjf^nλ(X))]C(logn)5ns1d+2s,\displaystyle\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))]\leq C(\log n)^{5}n^{-\frac{s-1}{d+2s}},

for j=1,,dj=1,\ldots,d, where C>0C>0 is a constant depending only on ,,s,d,𝒳,f0Cs\mathcal{B},\mathcal{B}^{\prime},s,d,\mathcal{X},\|f_{0}\|_{C^{s}} and κ\kappa.

By Theorem 22, our proposed PDIR estimator obtained with proper network architecture and tuning parameter achieves the minimax optimal rate 𝒪(n2sd+2s)\mathcal{O}(n^{-\frac{2s}{d+2s}}) up to logarithms for the nonparametric regression (Stone, 1982). Meanwhile, the PDIR estimator f^nλ\hat{f}^{\lambda}_{n} is guaranteed to be monotonic as measured by 𝔼[ρ(xjf^nλ(X))]\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))] at a rate of 𝒪(n(s1)/(d+2s))\mathcal{O(}n^{-(s-1)/(d+2s)}) up to a logarithmic factor.

Remark 23

In Theorem 22, we choose λj=n(s+1)/(d+2s),j=1,,d,\lambda_{j}=n^{-(s+1)/(d+2s)},\ j=1,\ldots,d, to attain the optimal rate of the expected mean squared error of f^nλ\hat{f}^{\lambda}_{n} up to a logarithmic factor. Additionally, we guarantee that the estimator f^nλ\hat{f}^{\lambda}_{n} to be monotonic at a rate of n(s1)/(d+2s)n^{-(s-1)/(d+2s)} up to a logarithmic factor as measured by 𝔼[ρ(xjf^nλ(X))]\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))]. The choice of λj\lambda_{j} is not unique for ensuring the consistency of f^nλ\hat{f}^{\lambda}_{n}. In fact, any choice of λ¯=o((logn)2n(s1)/(d+2s))\bar{\lambda}=o((\log n)^{-2}n^{(s-1)/(d+2s)}) will result in a consistent f^nλ\hat{f}^{\lambda}_{n}. However, larger values of λ¯\bar{\lambda} lead to a slower convergence rate of the expected mean squared error, but better guarantee for the monotonicity of f^nλ\hat{f}^{\lambda}_{n}.

The smoothness ss of the target function f0f_{0} is unknown in practice and how to determine the smoothness of an unknown function is an important but nontrivial problem. Note that the convergence rate (logn)5n2s/(d+2s)(\log n)^{5}n^{-2s/(d+2s)} suffers from the curse of dimensionality since it can be extremely slow if dd is large.

High-dimensional data have low-dimensional latent structures in many applications. Below we show that PDIR can mitigate the curse of dimensionality if the data distribution is supported on an approximate low-dimensional manifold.

Lemma 24

Suppose that Assumptions 7, 16, 17 hold and the target function f0f_{0} defined in (4) belongs to CsC^{s} for some s+s\in\mathbb{N}^{+}. Let dδ=cdlog(dVRτ1/δ)/δ2)d_{\delta}=c\cdot d_{\mathcal{M}}\log(d\cdot VR\tau^{-1}/\delta)/\delta^{2}) be an integer with dδdd_{\delta}\leq d for some δ(0,1)\delta\in(0,1) and universal constant c>0c>0. For any positive integer N+N\in\mathbb{N}^{+}, let n:=𝒟,𝒲,𝒰,𝒮,,\mathcal{F}_{n}:=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}} be the class of RePU activated neural networks f:𝒳f:\mathcal{X}\to\mathbb{R} with depth 𝒟=logp(N)\mathcal{D}=\lceil\log_{p}(N)\rceil, width 𝒲=2(N+dδ)!/(N!dδ!)\mathcal{W}=2(N+d_{\delta})!/(N!d_{\delta}!), number of neurons 𝒰=2logp(N)(N+dδ)!/(N!dδ!)\mathcal{U}=2\lceil\log_{p}(N)\rceil(N+d_{\delta})!/(N!d_{\delta}!) and size 𝒮=2(logp(N)+dδ+1)(N+dδ)!/(N!dδ!)\mathcal{S}=2(\lceil\log_{p}(N)\rceil+d_{\delta}+1)(N+d_{\delta})!/(N!d_{\delta}!). Suppose that f0C0\mathcal{B}\geq\|f_{0}\|_{C^{0}} and f0C1\mathcal{B}^{\prime}\geq\|f_{0}\|_{C^{1}}. Then for nmax{Pdim(n),Pdim(n)}n\geq\max\{{\rm Pdim}(\mathcal{F}_{n}),{\rm Pdim}(\mathcal{F}^{\prime}_{n})\}, the excess risk of the PDIR estimator f^nλ\hat{f}^{\lambda}_{n} defined in (13) satisfies

𝔼|f^nλ(X)f0(X)|2\displaystyle\mathbb{E}|\hat{f}^{\lambda}_{n}(X)-f_{0}(X)|^{2} sto+~app,\displaystyle\leq\mathcal{E}_{sto}+\tilde{\mathcal{E}}_{app}, (17)
𝔼[1dj=1dλjρ(xjf^nλ(X))]\displaystyle\mathbb{E}\big{[}\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))\big{]} sto+~app,\displaystyle\leq\mathcal{E}_{sto}+\tilde{\mathcal{E}}_{app}, (18)

with

sto\displaystyle\mathcal{E}_{sto} =C1d2dδ{p33+(κλ¯)2}(logn)2n1Ndδ,\displaystyle=C_{1}d^{2}d_{\delta}\big{\{}p^{3}\mathcal{B}^{3}+(\kappa\bar{\lambda}\mathcal{B}^{\prime})^{2}\big{\}}(\log n)^{2}n^{-1}N^{d_{\delta}},
~app\displaystyle\tilde{\mathcal{E}}_{app} =C2(1δ)2f0Cs2(N2s+κλ¯N(s1)),\displaystyle=C_{2}(1-\delta)^{2}\|f_{0}\|^{2}_{C^{s}}(N^{-2s}+\kappa\bar{\lambda}N^{-(s-1)}),

for ρCρN(s|α|1),\rho\leq C_{\rho}N^{-(s-|\alpha|_{1})}, where Cρ,C1>0C_{\rho},C_{1}>0 are universal constants and C2>0C_{2}>0 is a constant depending only on dδ,sd_{\delta},s and the diameter of the support ρ\mathcal{M}_{\rho}.

Based on Lemma 24, we obtain the following result.

Theorem 25 (Improved non-asymptotic excess risk bounds)

Under the conditions of Lemma 24, to achieve the smallest error bound in (17), we can set N=n1/(dδ+2s)N=\lfloor n^{1/(d_{\delta}+2s)}\rfloor and λj=n(s+1)/(dδ+2s)\lambda_{j}=n^{-(s+1)/(d_{\delta}+2s)} for j=1,,dj=1,\ldots,d. Then we have

𝔼|f^nλ(X)f0(X)|2\displaystyle\mathbb{E}|\hat{f}^{\lambda}_{n}(X)-f_{0}(X)|^{2} C(logn)5n2sdδ+2s,\displaystyle\leq C(\log n)^{5}n^{-\frac{2s}{d_{\delta}+2s}},

and for j=1,,dj=1,\ldots,d,

𝔼[ρ(xjf^nλ(X))]C(logn)5n2sdδ+2s,\displaystyle\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))]\leq C(\log n)^{5}n^{-\frac{2s}{d_{\delta}+2s}},

where C>0C>0 is a constant depending only on ,,s,d,dδ,ρ,f0Cs\mathcal{B},\mathcal{B}^{\prime},s,d,d_{\delta},\mathcal{M}_{\rho},\|f_{0}\|_{C^{s}} and κ\kappa.

In Theorem 25, the effective dimension is dδd_{\delta} rather than large dd. Therefore, the rate of convergence is an improvement over the result in Theorem 22 when the intrinsic dimension dδd_{\delta} is smaller than the ambient dimension dd.

5.2 PDIR under model misspecification

In this subsection, we investigate PDIR under model misspecification when Assumption 17 (i) is not satisfied, meaning that the underlying regression function f0f_{0} may not be monotonic.

Let S:=(Xi,Yi)i=1nS:={(X_{i},Y_{i})}_{i=1}^{n} be a random sample from model (4). Recall that the penalized risk of the deep isotonic regression is given by

λ(f)=𝔼|Yf(X)|2+1dj=1dλj𝔼ρ(f˙j(X)).\mathcal{R}^{\lambda}(f)=\mathbb{E}|Y-f(X)|^{2}+\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\mathbb{E}{\rho(\dot{f}_{j}(X))}.

If f0f_{0} is not monotonic, the penalty j=1dλj𝔼[ρ(f˙j(X))/d]\sum_{j=1}^{d}\lambda_{j}\mathbb{E}[\rho(\dot{f}_{j}(X))/d] is non-zero, and consequently, f0f_{0} is not a minimizer of the risk λ\mathcal{R}^{\lambda} when λj0,j\lambda_{j}\not=0,\forall j. Intuitively, the deep isotonic regression estimator will exhibit a bias towards the target f0f_{0} due to the additional penalty terms in the risk. However, it is reasonable to expect that the estimator f^nλ\hat{f}^{\lambda}_{n} will have a smaller bias if λj,j=1,,d\lambda_{j},j=1,\ldots,d are small. In the following lemma, we establish a non-asymptotic upper bound for our proposed deep isotonic regression estimator while adapting to model misspecification.

Lemma 26

Suppose that Assumptions 16 and 17 (ii) hold and the target function f0f_{0} defined in (4) belongs to CsC^{s} for some s+s\in\mathbb{N}^{+}. For any positive integer N+N\in\mathbb{N}^{+} let n:=𝒟,𝒲,𝒰,𝒮,,\mathcal{F}_{n}:=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}} be the class of RePU activated neural networks f:𝒳df:\mathcal{X}\to\mathbb{R}^{d} with depth 𝒟=logp(N)\mathcal{D}=\lceil\log_{p}(N)\rceil, width 𝒲=2(N+d)!/(N!d!)\mathcal{W}=2(N+d)!/(N!d!), number of neurons 𝒰=2logp(N)(N+d)!/(N!d!)\mathcal{U}=2\lceil\log_{p}(N)\rceil(N+d)!/(N!d!) and size 𝒮=2(logp(N)+d+1)(N+d)!/(N!d!)\mathcal{S}=2(\lceil\log_{p}(N)\rceil+d+1)(N+d)!/(N!d!). Suppose that f0C0\mathcal{B}\geq\|f_{0}\|_{C^{0}} and f0C1\mathcal{B}^{\prime}\geq\|f_{0}\|_{C^{1}}. Then for nmax{Pdim(n),Pdim(n)}n\geq\max\{{\rm Pdim}(\mathcal{F}_{n}),{\rm Pdim}(\mathcal{F}^{\prime}_{n})\}, the excess risk of the PDIR estimator f^nλ\hat{f}^{\lambda}_{n} defined in (13) satisfies

𝔼|f^nλ(X)f0(X)|2\displaystyle\mathbb{E}|\hat{f}^{\lambda}_{n}(X)-f_{0}(X)|^{2}\leq sto+app+mis,\displaystyle\mathcal{E}_{sto}+\mathcal{E}_{app}+\mathcal{E}_{mis}, (19)
𝔼[1dj=1dλjρ(xjf^nλ(X))]\displaystyle\mathbb{E}\big{[}\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))\big{]} sto+app+mis,\displaystyle\leq\mathcal{E}_{sto}+\mathcal{E}_{app}+\mathcal{E}_{mis}, (20)

with

sto\displaystyle\mathcal{E}_{sto} =C1p2d3(2+κλ¯)(logn)1/2n1/2(logpN)3/2Nd/2,\displaystyle=C_{1}p^{2}d^{3}(\mathcal{B}^{2}+\kappa\bar{\lambda}\mathcal{B}^{\prime})(\log n)^{1/2}n^{-1/2}(\log_{p}N)^{3/2}N^{d/2},
app\displaystyle\mathcal{E}_{app} =C2f0C12(N2s+κλ¯N(s1)),\displaystyle=C_{2}\|f_{0}\|^{2}_{C^{1}}(N^{-2s}+\kappa\bar{\lambda}N^{-(s-1)}),
mis\displaystyle\mathcal{E}_{mis} =1dj=1dλj𝔼[ρ(xjf0(X))],\displaystyle=\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}f_{0}(X))],

where the expectation 𝔼\mathbb{E} is taken with respect to XX and f^nλ\hat{f}^{\lambda}_{n}, λ¯=j=1dλj/d\bar{\lambda}=\sum_{j=1}^{d}\lambda_{j}/d is the mean of the tuning parameters, C1>0C_{1}>0 is a universal constant and C2>0C_{2}>0 is a positive constant depending only on d,sd,s and the diameter of the support 𝒳\mathcal{X}.

Lemma 26 is a generalized version of Lemma 20 for PDIR, as it holds regardless of whether the target function is isotonic or not. In Lemma 26, the expected mean squared error of the PDIR estimator f^nλ\hat{f}^{\lambda}_{n} can be bounded by three errors: stochastic error sto\mathcal{E}_{sto}, approximation error app\mathcal{E}_{app}, and misspecification error mis\mathcal{E}_{mis}, without the monotonicity assumption. Compared with Lemma 20 with the monotonicity assumption, the approximation error is identical, the stochastic error is worse in terms of order, and the misspecification error appears as an extra term in the inequality. With an appropriate setup of NN for the neural network architecture with respect to the sample size nn, the stochastic error and approximation error can converge to zero, albeit at a slower rate than that in Theorem 22. However, the misspecification error remains constant for fixed tuning parameters λj\lambda_{j}. Thus, we can let the tuning parameters λj\lambda_{j} converge to zero to achieve consistency.

Remark 27

It is worth noting that if the target function is isotonic, then the misspecification error vanishes, leading the scenario to that of isotonic regression. However, the convergence rate based on Lemma 26 is slower than that in Lemma 20. The reason is that Lemma 26 is general and holds without prior knowledge of the monotonicity of the target function. If knowledge is available about the non-isotonicity of the jjth argument of the target function f0f_{0}, setting the corresponding λj=0\lambda_{j}=0 decreases the misspecification error and helps improve the upper bound.

Theorem 28 (Non-asymptotic excess risk bounds)

Under the conditions of Lemma 26, to achieve the fastest convergence rate in (19), we can set N=n1/(d+4s)N=\lfloor n^{1/(d+4s)}\rfloor and λj=n2s/(d+4s)\lambda_{j}=n^{-2s/(d+4s)} for j=1,,dj=1,\ldots,d. Then we have

𝔼|f^nλ(X)f0(X)|2\displaystyle\mathbb{E}|\hat{f}^{\lambda}_{n}(X)-f_{0}(X)|^{2} C(logn)n2sd+4s,\displaystyle\leq C(\log n)n^{-\frac{2s}{d+4s}},

where C>0C>0 is a constant depending only on ,,s,d,𝒳,f0Cs\mathcal{B},\mathcal{B}^{\prime},s,d,\mathcal{X},\|f_{0}\|_{C^{s}} and κ\kappa.

According to Lemma 26, under the misspecification model, the prediction error of PDIR attains its minimum when λj=0\lambda_{j}=0 for j=1,,dj=1,\ldots,d, and the misspecification error mis\mathcal{E}_{mis} vanishes. Consequently, the optimal convergence rate with respect to nn can be achieved by setting N=𝒪(n1/(d+4s))N=\mathcal{O}(\lfloor n^{1/(d+4s)}\rfloor) and λj=0\lambda_{j}=0 for j=1,,dj=1,\ldots,d. It is worth noting that the prediction error of PDIR can achieve this rate as long as λ¯=𝒪n2s/(d+4s)\bar{\lambda}=\mathcal{O}n^{-2s/(d+4s)}.

Remark 29

According to Theorem 28, there is no unique choice of λj\lambda_{j} that ensures the consistency of PDIR. Consistency is guaranteed even under a misspecified model when the λj\lambda_{j} for j=1,,dj=1,\ldots,d tend to zero as nn\to\infty. Additionally, selecting a smaller value of λ¯\bar{\lambda} provides a better upper bound for (19), and an optimal rate up to logarithms of nn can be achieved with a sufficiently small λ¯=O(n2s/(d+4s))\bar{\lambda}=O(n^{-2s/(d+4s)}). An example demonstrating the effects of tuning parameters is visualized in the last subfigure of Figure 1.

6 Related works

In this section, we briefly review the papers in the existing literature that are most related to the present work.

6.1 ReLU and RePU networks

Deep learning has achieved impressive success in a wide range of applications. A fundamental reason for these successes is the ability of deep neural networks to approximate high-dimensional functions and extract effective data representations. There has been much effort devoted to studying the approximation properties of deep neural networks in recent years. Many interesting results have been obtained concerning the approximation power of deep neural networks for multivariate functions. Examples include Chen et al. (2019), Schmidt-Hieber (2020), Jiao et al. (2023). These works focused on the power of ReLU-activated neural networks for approximating various types of smooth functions.

For the approximation of the square function by ReLU networks, Yarotsky (2017) first used “sawtooth” functions, which achieves an error rate of 𝒪(2L)\mathcal{O}(2^{-L}) with width 6 and depth 𝒪(L)\mathcal{O}(L) for any positive integer L+L\in\mathbb{N}^{+}. General construction of ReLU networks for approximating a square function can achieve an error NLN^{-L} with width 3N3N and depth LL for any positive integers N,L+N,L\in\mathbb{N}^{+} (Lu et al., 2021b). Based on this basic fact, the ReLU networks approximating multiplication and polynomials can be constructed correspondingly. However, the network complexity in terms of network size (depth and width) for a ReLU network to achieve precise approximation can be large compared to that of a RePU network since a RePU network can exactly compute polynomials with fewer layers and neurons.

The approximation results of the RePU network are generally obtained by converting splines or polynomials into RePU networks and making use of the approximation results of splines and polynomials. The universality of sigmoidal deep neural networks has been studied in the pioneering works (Mhaskar, 1993; Chui et al., 1994). In addition, the approximation properties of shallow Rectified Power Unit (RePU) activated network were studied in Klusowski and Barron (2018); Siegel and Xu (2022). The approximation rates of deep RePU neural networks on target functions in different spaces have also been explored, including Besov spaces (Ali and Nouy, 2021), Sobolev spaces (Li et al., 2019, 2020; Duan et al., 2021; Abdeljawad and Grohs, 2022) , and Hölder space (Belomestny et al., 2022). Most of the existing results on the expressiveness of neural networks measure the quality of approximation with respect to LpL_{p} where p1p\geq 1 norm. However, fewer papers have studied the approximation of derivatives of smooth functions (Duan et al., 2021; Gühring and Raslan, 2021; Belomestny et al., 2022).

6.2 Related works on score estimation

Learning a probability distribution from data is a fundamental task in statistics and machine learning for efficient generation of new samples from the learned distribution. Likelihood-based models approach this problem by directly learning the probability density function, but they have several limitations, such as an intractable normalizing constant and approximate maximum likelihood training.

One alternative approach to circumvent these limitations is to model the score function (Liu et al., 2016), which is the gradient of the logarithm of the probability density function. Score-based models can be learned using a variety of methods, including parametric score matching methods (Hyvärinen and Dayan, 2005; Sasaki et al., 2014), autoencoders as its denoising variants (Vincent, 2011), sliced score matching (Song et al., 2020), nonparametric score matching (Sriperumbudur et al., 2017; Sutherland et al., 2018), and kernel estimators based on Stein’s methods (Li and Turner, 2017; Shi et al., 2018). These score estimators have been applied in many research problems, such as gradient flow and optimal transport methods (Gao et al., 2019, 2022), gradient-free adaptive MCMC (Strathmann et al., 2015), learning implicit models (Warde-Farley and Bengio, 2016), inverse problems (Jalal et al., 2021). Score-based generative learning models, especially those using deep neural networks, have achieved state-of-the-art performance in many downstream tasks and applications, including image generation (Song and Ermon, 2019, 2020; Song et al., 2021; Ho et al., 2020; Dhariwal and Nichol, 2021; Ho et al., 2022), music generation (Mittal et al., 2021), and audio synthesis (Chen et al., 2020; Kong et al., 2020; Popov et al., 2021).

However, there is a lack of theoretical understanding of nonparametric score estimation using deep neural networks. The existing studies mainly considered kernel based methods. Zhou et al. (2020) studied regularized nonparametric score estimators using vector-valued reproducing kernel Hilbert space, which connects the kernel exponential family estimator (Sriperumbudur et al., 2017) with the score estimator based on Stein’s method (Li and Turner, 2017; Shi et al., 2018). Consistency and convergence rates of these kernel-based score estimator are also established under the correctly-specified model assumption in Zhou et al. (2020). For denoising autoencoders, Block et al. (2020) obtained generalization bounds for general nonparametric estimators also under the correctly-specified model assumption.

For sore-based learning using deep neural networks, the main difficulty for establishing the theoretical foundation is the lack of knowledge of differentiable neural networks since the derivatives of neural networks are involved in the estimation of score function. Previously, the non-differentiable Rectified Linear Unit (ReLU) activated deep neural network has received much attention due to its attractive properties in computation and optimization, and has been extensively studied in terms of its complexity (Bartlett et al., 1998; Anthony and Bartlett, 1999; Bartlett et al., 2019) and approximation power (Yarotsky, 2017; Petersen and Voigtlaender, 2018; Shen et al., 2020; Lu et al., 2021a; Jiao et al., 2023), based on which statistical learning theories for deep non-parametric estimations were established (Bauer and Kohler, 2019; Schmidt-Hieber, 2020; Jiao et al., 2023). For deep neural networks with differentiable activation functions, such as ReQU and RepU, the simultaneous approximation power on a smooth function and its derivatives were studied recently (Ali and Nouy, 2021; Belomestny et al., 2022; Siegel and Xu, 2022; Hon and Yang, 2022), but the statistical properties of differentiable networks are still largely unknown. To the best of our knowledge, the statistical learning theory has only been investigated for ReQU networks in Shen et al. (2022), where they have developed network representation of the derivatives of ReQU networks and studied their complexity.

6.3 Related works on isotonic regression

There is a rich and extensive literature on univariate isotonic regression, which is too vast to be adequately summarized here. So we refer to the books Barlow et al. (1972) and Robertson et al. (1988) for a systematic treatment of this topic and review of earlier works. For more recent developments on the error analysis of nonparametric isotonic regression, we refer to Durot (2002); Zhang (2002); Durot (2007, 2008); Groeneboom and Jongbloed (2014); Chatterjee et al. (2015), and Yang and Barber (2019), among others.

The least squares isotonic regression estimators under fixed design were extensively studied. With a fixed design at fixed points x1,,xnx_{1},\ldots,x_{n}, the LpL_{p} risk of the least squares estimator is defined by n,p(f^0)=𝔼(n1i=1n|f^0(xi)f0(xi)|p)1/p,\mathcal{R}_{n,p}(\hat{f}_{0})=\mathbb{E}(n^{-1}\sum_{i=1}^{n}|\hat{f}_{0}(x_{i})-f_{0}(x_{i})|^{p})^{1/p}, where the least squares estimator f^0\hat{f}_{0} is defined by

f^0=argminf01ni=1n{Yif(xi)}2.\hat{f}_{0}=\arg\min_{f\in\mathcal{F}_{0}}\frac{1}{n}\sum_{i=1}^{n}\{Y_{i}-f(x_{i})\}^{2}. (21)

The problem can be restated in terms of isotonic vector estimation on directed acyclic graphs. Specifically, the design points {x1,,xn}\{x_{1},\ldots,x_{n}\} induce a directed acyclic graph Gx(V(Gx),E(Gx))G_{x}(V(G_{x}),E(G_{x})) with vertices V(Gx)={1,,n}V(G_{x})=\{1,\ldots,n\} and edges E(Gx)={(i,j):xixj}E(G_{x})=\{(i,j):x_{i}\preceq x_{j}\}. The class of isotonic vectors on GxG_{x} is defined by

(G):={θV(Gx):θxθyforxy}.\mathcal{M}(G):=\{\theta\in\mathbb{R}^{V(G_{x})}:\theta_{x}\leq\theta_{y}{\rm\ for}x\preceq y\}.

Then the least squares estimation in (21) becomes that of searching for a target vector θ0={(θ0)i}i=1n:={f0(xi)}i=1n(Gx)\theta_{0}=\{(\theta_{0})_{i}\}_{i=1}^{n}:=\{f_{0}(x_{i})\}_{i=1}^{n}\in\mathcal{M}(G_{x}). The least squares estimator θ^0={(θ^0)i}i=1n:={f^0(xi)}i=1n\hat{\theta}_{0}=\{(\hat{\theta}_{0})_{i}\}_{i=1}^{n}:=\{\hat{f}_{0}(x_{i})\}_{i=1}^{n} is actually the projection of {Yi}i=1n\{Y_{i}\}_{i=1}^{n} onto the polyhedral convex cone (Gx)\mathcal{M}(G_{x}) (Han et al., 2019).

For univariate isotonic least squares regression with a bounded total variation target function f0f_{0}, Zhang (2002) obtained sharp upper bounds for n,p\mathcal{R}_{n,p} risk of the least squares estimator θ^0\hat{\theta}_{0} for 1p<31\leq p<3. Shape-constrained estimators were also considered in different settings where automatic rate-adaptation phenomenon happens (Chatterjee et al., 2015; Gao et al., 2017; Bellec, 2018). We also refer to Kim et al. (2018); Chatterjee and Lafferty (2019) for other examples of adaptation in univariate shape-constrained problems.

Error analysis for the least squares estimator in multivariate isotonic regression is more difficult. For two-dimensional isotonic regression, where XdX\in\mathbb{R}^{d} with d=2d=2 and Gaussian noise, Chatterjee et al. (2018) considered the fixed lattice design case and obtained sharp error bounds. Han et al. (2019) extended the results of Chatterjee et al. (2018) to the case with d3d\geq 3, both from a worst-case perspective and an adaptation point of view. They also proved parallel results for random designs assuming the density of the covariate XX is bounded away from zero and infinity on the support.

Deng and Zhang (2020) considered a class of block estimators for multivariate isotonic regression in d\mathbb{R}^{d} involving rectangular upper and lower sets under, which is defined as any estimator in-between the following max-min and min-max estimator. Under a qq-th moment condition on the noise, they developed LqL_{q} risk bounds for such estimators for isotonic regression on graphs. Furthermore, the block estimator possesses an oracle property in variable selection: when f0f_{0} depends on only an unknown set of ss variables, the L2L_{2} risk of the block estimator automatically achieves the minimax rate up to a logarithmic factor based on the knowledge of the set of the ss variables.

Our proposed method and theoretical results are different from those in the aforementioned papers in several aspects. First, the resulting estimates from our method are smooth instead of piecewise constant as those based on the existing methods. Second, our method can mitigate the curse of dimensionality under an approximate low-dimensional manifold support assumption, which is weaker than the exact low-dimensional space assumption in the existing work. Finally, our method possesses a robustness property against model specification in the sense that it still yields consistent estimators if the monotonicity assumption is not strictly satisfied. However, the properties of the existing isotonic regression methods under model misspecification are unclear.

7 Conclusions

In this work, motivated by the problems of score estimation and isotonic regression, we have studied the properties of RePU-activated neural networks, including a novel generalization result for the derivatives of RePU networks and improved approximation error bounds for RePU networks with approximate low-dimensional structures. We have established non-asymptotic excess risk bounds for DSME, a deep score matching estimator; and PDIR, our proposed penalized deep isotonic regression method.

Our findings highlight the potential of RePU-activated neural networks in addressing challenging problems in machine learning and statistics. The ability to accurately represent the partial derivatives of RePU networks with RePUs mixed-activated networks is a valuable tool in many applications that require the use of neural network derivatives. Moreover, the improved approximation error bounds for RePU networks with low-dimensional structures demonstrate their potential to mitigate the curse of dimensionality in high-dimensional settings.

Future work can investigate further the properties of RePU networks, such as their stability, robustness, and interpretability. It would also be interesting to explore the use of RePU-activated neural networks in other applications, such as nonparametric variable selection and more general shape-constrained estimation problems. Additionally, our work can be extended to other smooth activation functions beyond RePUs, such as Gaussian error linear unit and scaled exponential linear unit, and study their derivatives and approximation properties.

Appendix

This appendix contains results from simulation studies to evaluate the performance of PDIR and proofs and supporting lemmas for the theoretical results stated in the paper.

Appendix A Numerical studies

In this section, we conduct simulation studies to evaluate the performance of PDIR and compare it with the existing isotonic regression methods. The methods included in the simulation include

  • The isotonic least squares estimator, denoted by Isotonic LSE, is defined as the minimizer of the mean square error on the training data subject to the monotone constraint. As the squared loss only involves the values at nn design points, then the isotonic LSE (with no more than (n2)\binom{n}{2} linear constraints) can be computed with quadratic programming or using convex optimization algorithms (Dykstra, 1983; Kyng et al., 2015; Stout, 2015). Algorithmically, this turns out to be mappable to a network flow problem (Picard, 1976; Spouge et al., 2003). In our implementation, we compute Isotonic LSE via the Python package multiisotonic222https://github.com/alexfields/multiisotonic.

  • The block estimator (Deng and Zhang, 2020), denoted by Block estimator, is defined as any estimator between the block min-max and max-min estimators (Fokianos et al., 2020). In the simulation, we take the Block estimator as the mean of max-min and min-max estimators as suggested in (Deng and Zhang, 2020). The Isotonic LSE is shown to has an explicit mini-max representation on the design points for isotonic regression on graphs in general (Robertson et al., 1988). As in Deng and Zhang (2020), we use brute force which exhaustively calculates means over all blocks and finds the max-min value for each point xx. The computation cost via brute force is of order n3n^{3}.

  • Deep isotonic regression estimator as described in Section 5, denoted by PDIR. Here we focus on using RePU σp\sigma_{p} activated network with p=2p=2. We implement it in Python via Pytorch and use Adam (Kingma and Ba, 2014) as the optimization algorithm with default learning rate 0.01 and default β=(0.9,0.99)\beta=(0.9,0.99) (coefficients used for computing running averages of gradients and their squares). The tuning parameters are chosen in the way that λj=log(n)\lambda_{j}=\log(n) for j=1,,dj=1,\ldots,d.

  • Deep nonparametric regression estimator, denoted by DNR, which is actually the PDIR without penalty. The implementation is the same as that of PDIR, but the tuning parameters λj=0\lambda_{j}=0 for j=1,,dj=1,\ldots,d.

A.1 Estimation and evaluation

For the proposed PDIR estimator, we set the tuning parameter λj=log(n)\lambda_{j}=\log(n) for j=1,,dj=1,\ldots,d across the simulations. For each target function f0f_{0}, according to model (4) we generate the training data Strain=(Xitrain,Yitrain)i=1nS_{\rm train}=(X_{i}^{\rm train},Y_{i}^{\rm train})_{i=1}^{n} with sample size nn and train the Isotonic LSE, Block estimator, PDIR and DNR estimators on StrainS_{\rm train}. We would mention that the Block estimator has no definition when the input xx is “outside” the domain of training data Strain=(Xitrain,Yitrain)i=1nS_{\rm train}=(X_{i}^{\rm train},Y_{i}^{\rm train})_{i=1}^{n}, i.e., there exist no i,j{1,,n}i,j\in\{1,\ldots,n\} such that XitrainxXjtrainX^{\rm train}_{i}\preceq x\preceq X^{\rm train}_{j}. In view of this, in our simulation we focus on using the training data with lattice design of the covariates (Xitrain)i=1n(X^{\rm train}_{i})_{i=1}^{n} for ease of presentation on the Block estimator. For PDIR and DNR estimators, such fixed lattice design of the covariates are not necessary and the obtained estimators can be smoothly extended to large domains which covers the domain of the training samples.

For each f0f_{0}, we also generate the testing data Stest(Xttest,Yttest)t=1TS_{\rm test}(X_{t}^{\rm test},Y_{t}^{\rm test})_{t=1}^{T} with sample size TT from the same distribution of the training data. For the proposed method and for each obtained estimator f^n\hat{f}_{n}, we calculate the mean squared error (MSE) on the testing data Stest=(Xttest,Yttest)t=1TS_{\rm test}=(X_{t}^{\rm test},Y_{t}^{\rm test})_{t=1}^{T}. We calculate the L1L_{1} distance between the estimator f^n\hat{f}_{n} and the corresponding target function f0f_{0} on the testing data by

f^nf0L1(ν)=1Tt=1T|f^n(Xttest)f0(Xttest)|,\displaystyle\|\hat{f}_{n}-f_{0}\|_{L^{1}(\nu)}=\frac{1}{T}\sum_{t=1}^{T}\Big{|}\hat{f}_{n}(X_{t}^{\rm test})-f_{0}(X_{t}^{\rm test})\Big{|},

and we also calculate the L2L_{2} distance between the estimator f^n\hat{f}_{n} and the target function f0f_{0}, i.e.

f^nf0L2(ν)=1Tt=1T|f^n(Xttest)f0(Xttest)|2.\displaystyle\|\hat{f}_{n}-f_{0}\|_{L^{2}(\nu)}=\sqrt{\frac{1}{T}\sum_{t=1}^{T}\Big{|}\hat{f}_{n}(X_{t}^{\rm test})-f_{0}(X_{t}^{\rm test})\Big{|}^{2}}.

In the simulation studies, for each data generation model we generate T=100dT=100^{d} testing data by the lattice points (100 even lattice points for each dimension of the input) where dd is the dimension of the input. We report the mean squared error, L1L_{1} and L2L_{2} distances to the target function defined above and their standard deviations over R=100R=100 replications under different scenarios. The specific forms of f0f_{0} are given in the data generation models below.

A.2 Univariate models

We consider three basic univariate models, including “Linear”, “Exp”, “Step”, “Constant” and “Wave”, which corresponds to different specifications of the target function f0f_{0}. The formulae are given below.

  • (a)

    Linear :

    Y=f0(x)+ϵ=2x+ϵ,Y=f_{0}(x)+\epsilon=2x+\epsilon,
  • (b)

    Exp:

    Y=f0(X)+ϵ=exp(2X)+ϵ,Y=f_{0}(X)+\epsilon=\exp(2X)+\epsilon,
  • (c)

    Step:

    Y=f0(X)+ϵ=hiI(Xti)+ϵ,Y=f_{0}(X)+\epsilon=\sum h_{i}I(X\geq t_{i})+\epsilon,
  • (d)

    Constant:

    Y=f0(X)+ϵ=exp(2X)+ϵ,Y=f_{0}(X)+\epsilon=\exp(2X)+\epsilon,
  • (e)

    Wave:

    Y=f0(X)+ϵ=4X+2Xsin(4πX)+ϵ,Y=f_{0}(X)+\epsilon=4X+2X\sin(4\pi X)+\epsilon,

where (hi)=(1,2,2)(h_{i})=(1,2,2), (ti)=(0.2,0.6,1)(t_{i})=(0.2,0.6,1) and ϵN(0,14)\epsilon\sim N(0,\frac{1}{4}) follows normal distribution. We use the linear model as a baseline model in our simulations and expect all the methods perform well under the linear model. The “Step” model is monotonic but not smooth even continuous. The “Constant” is a monotonic but not strictly monotonic model. And the “Wave” is a nonlinear, smooth but non monotonic model. These models are chosen so that we can evaluate the performance of Isotonic LSE, Block estimator PDIR and DNR under different types of models, including the conventional and misspecified cases.

For these models, we use the lattice design for the ease of presentation of Block estimator, where (Xitrain)i=1n(X_{i}^{\rm train})_{i=1}^{n} are the lattice points evenly distributed on interval [0,1][0,1]. Figure S2 shows all these univariate data generation models.

Figures S3 shows an instance of the estimated curves for the “Linear”, “Exp”, “Step” and “Constant” models when sample size n=64n=64. In these plots, the training data is depicted as grey dots. The target functions are depicted as dashed curves in black, and the estimated functions are represented by solid curves with different colors. The summary statistics are presented in Table S3. Compared with the piece-wise constant estimates of Isotonic LSE and Block estimator, the PDIR estimator is smooth and it works reasonably well under univariate models, especially for models with smooth target functions.

Refer to caption
Figure S2: Univariate data generation models. The target functions are depicted by solid curves in blue and instance samples with size n=64n=64 are depicted as black dots.
Refer to caption
Figure S3: An instance of the estimated curves for the “Linear”, “Exp”, “Step” and “Constant” models when sample size n=64n=64. The training data is depicted as grey dots. The target functions are depicted as dashed curves in black, and the estimated functions are represented by solid curves with different colors.
Table S3: Summary statistics for the simulation results under different univariate models (d=1d=1). The averaged mean squared error of the estimates on testing data and L1,L2L_{1},L_{2} distance to the target function are calculated over 100 replications. The standard deviations are reported in parenthesis.
Model Method n=64n=64 n=256n=256
MSE L1L_{1} L2L_{2} MSE L1L_{1} L2L_{2}
Linear DNR 0.266 (0.011) 0.101 (0.035) 0.122 (0.040) 0.253 (0.011) 0.055 (0.020) 0.068 (0.023)
PDIR 0.265 (0.012) 0.098 (0.037) 0.118 (0.041) 0.254 (0.012) 0.058 (0.024) 0.070 (0.027)
Isotonic LSE 0.282 (0.013) 0.140 (0.027) 0.177 (0.035) 0.262 (0.012) 0.088 (0.012) 0.113 (0.017)
Block 0.330 (0.137) 0.165 (0.060) 0.243 (0.155) 0.277 (0.033) 0.106 (0.021) 0.153 (0.060)
Exp DNR 0.268 (0.014) 0.103 (0.043) 0.124 (0.049) 0.256 (0.012) 0.055 (0.024) 0.068 (0.027)
PDIR 0.268 (0.017) 0.102 (0.049) 0.124 (0.056) 0.255 (0.012) 0.055 (0.022) 0.068 (0.026)
Isotonic LSE 0.312 (0.018) 0.195 (0.028) 0.246 (0.034) 0.274 (0.014) 0.120 (0.014) 0.153 (0.018)
Block 0.302 (0.021) 0.177 (0.028) 0.223 (0.034) 0.272 (0.012) 0.115 (0.015) 0.146 (0.017)
Step DNR 0.375 (0.045) 0.259 (0.059) 0.347 (0.061) 0.315 (0.017) 0.169 (0.022) 0.253 (0.018)
PDIR 0.366 (0.042) 0.245 (0.058) 0.335 (0.057) 0.311 (0.018) 0.153 (0.025) 0.245 (0.018)
Isotonic LSE 0.304 (0.020) 0.151 (0.041) 0.228 (0.039) 0.275 (0.014) 0.081 (0.022) 0.155 (0.020)
Block 0.382 (0.217) 0.208 (0.082) 0.327 (0.160) 0.295 (0.046) 0.108 (0.035) 0.197 (0.086)
Constant DNR 0.266 (0.012) 0.102 (0.038) 0.122 (0.042) 0.258 (0.013) 0.057 (0.021) 0.069 (0.023)
PDIR 0.260 (0.011) 0.080 (0.045) 0.092 (0.049) 0.257 (0.012) 0.051 (0.025) 0.060 (0.028)
Isotonic LSE 0.265 (0.013) 0.087 (0.044) 0.114 (0.052) 0.258 (0.012) 0.044 (0.020) 0.068 (0.025)
Block 0.264 (0.012) 0.085 (0.044) 0.108 (0.049) 0.258 (0.012) 0.044 (0.020) 0.066 (0.025)
Wave DNR 0.289 (0.023) 0.156 (0.039) 0.192 (0.044) 0.262 (0.014) 0.089 (0.025) 0.110 (0.029)
PDIR 0.530 (0.030) 0.398 (0.026) 0.528 (0.018) 0.511 (0.022) 0.368 (0.014) 0.510 (0.009)
Isotonic LSE 0.525 (0.027) 0.399 (0.022) 0.524 (0.015) 0.495 (0.020) 0.353 (0.009) 0.494 (0.004)
Block 0.516 (0.024) 0.391 (0.022) 0.519 (0.017) 0.497 (0.023) 0.358 (0.012) 0.500 (0.013)

A.3 Bivariate models

We consider several basic multivariate models, including polynomial model (“Polynomial”), concave model (“Concave”), step model (“Step”), partial model (“Partial”), constant model (“Constant”) and wave model (“Wave”), which correspond to different specifications of the target function f0f_{0}. The formulae are given below.

  • (a)

    Polynomial:

    Y=f0(X)+ϵ=1023/4(x1+x2)3/4+ϵ,Y=f_{0}(X)+\epsilon=\frac{10}{2^{3/4}}(x_{1}+x_{2})^{3/4}+\epsilon,
  • (b)

    Concave:

    Y=f0(X)+ϵ=1+3x1(1exp(3x2))+ϵ,Y=f_{0}(X)+\epsilon=1+3x_{1}(1-\exp(-3x_{2}))+\epsilon,
  • (c)

    Step:

    Y=f0(X)+ϵ=hiI(x1+x2ti)+ϵ,Y=f_{0}(X)+\epsilon=\sum h_{i}I(x_{1}+x_{2}\geq t_{i})+\epsilon,
  • (d)

    Partial:

    Y=f0(X)+ϵ=10x28/3+ϵ,Y=f_{0}(X)+\epsilon=10x_{2}^{8/3}+\epsilon,
  • (e)

    Constant:

    Y=f0(X)+ϵ=3+ϵ,Y=f_{0}(X)+\epsilon=3+\epsilon,
  • (f)

    Wave:

    Y=f0(X)+ϵ=5(x1+x2)+3(x1+x2)sin(π(x1+x2))+ϵ,Y=f_{0}(X)+\epsilon=5(x_{1}+x_{2})+3(x_{1}+x_{2})\sin(\pi(x_{1}+x_{2}))+\epsilon,

where X=(x1,x2)X=(x_{1},x_{2}), (hi)=(1,2,2,1.5,0.5,1)(h_{i})=(1,2,2,1.5,0.5,1), (ti)=(0.2,0.6,1.0,1.3,1.7,1.9)(t_{i})=(0.2,0.6,1.0,1.3,1.7,1.9) and ϵN(0,14)\epsilon\sim N(0,\frac{1}{4}) follows normal distribution. The “Polynomial” and “Concave” models are monotonic models. The “Step” model is monotonic but not smooth even continuous. In “Partial ” model, the response is related to only one covariate. The “Constant” is a monotonic but not strictly monotonic model and the “Wave” is a nonlinear, smooth but non monotonic model. We use the lattice design for the ease of presentation of Block estimator, where (Xitrain)i=1n(X_{i}^{\rm train})_{i=1}^{n} are the lattice points evenly distributed on interval [0,1]2[0,1]^{2}. Simulation results over 100 replications are summarized in Table S4. And for each model, we take an instance from the replications to present the heatmaps and the 3D surface of the predictions of these estimates; see Figure S4-S15. In heatmaps, we show the observed data (linearly interpolated), the true target function f0f_{0} and the estimates of different methods. We can see that compared with the piece-wise constant estimates of Isotonic LSE and Block estimator, the DIR estimator is smooth and works reasonably well under bivariate models, especially for models with smooth target functions.

Table S4: Summary statistics for the simulation results under different bivariate models (d=2d=2). The averaged mean squared error of the estimates on testing data and L1,L2L_{1},L_{2} distance to the target function are calculated over 100 replications. The standard deviations are reported in parenthesis.
Model Method n=64n=64 n=256n=256
MSE L1L_{1} L2L_{2} MSE L1L_{1} L2L_{2}
Polynomial DNR 4.735 (0.344) 0.138 (0.041) 0.172 (0.046) 4.655 (0.178) 0.078 (0.022) 0.098 (0.025)
PDIR 4.724 (0.366) 0.140 (0.045) 0.171 (0.049) 4.688 (0.181) 0.077 (0.023) 0.096 (0.026)
Isotonic LSE 8.309 (0.405) 0.755 (0.061) 0.884 (0.061) 6.052 (0.153) 0.364 (0.019) 0.444 (0.021)
Block 4.780 (0.284) 0.319 (0.020) 0.397 (0.026) 4.747 (0.129) 0.210 (0.011) 0.264 (0.017)
Concave DNR 0.282 (0.016) 0.142 (0.038) 0.176 (0.042) 0.261 (0.007) 0.083 (0.022) 0.103 (0.024)
PDIR 0.276 (0.015) 0.129 (0.038) 0.158 (0.043) 0.260 (0.007) 0.077 (0.024) 0.096 (0.028)
Isotonic LSE 0.393 (0.042) 0.308 (0.051) 0.375 (0.055) 0.294 (0.010) 0.163 (0.020) 0.207 (0.022)
Block 0.303 (0.015) 0.183 (0.025) 0.229 (0.030) 0.275 (0.006) 0.125 (0.012) 0.157 (0.014)
Step DNR 0.561 (0.030) 0.462 (0.020) 0.557 (0.025) 0.519 (0.011) 0.432 (0.010) 0.519 (0.010)
PDIR 0.561 (0.030) 0.461 (0.019) 0.557 (0.024) 0.519 (0.011) 0.431 (0.010) 0.520 (0.010)
Isotonic LSE 1.462 (0.104) 0.852 (0.046) 1.100 (0.047) 0.700 (0.028) 0.430 (0.019) 0.672 (0.021)
Block 0.657 (0.031) 0.503 (0.022) 0.638 (0.023) 0.461 (0.016) 0.321 (0.014) 0.457 (0.014)
Partial DNR 12.67 (0.530) 0.129 (0.033) 0.161 (0.039) 12.77 (0.338) 0.079 (0.023) 0.099 (0.027)
PDIR 12.70 (0.548) 0.112 (0.037) 0.136 (0.042) 12.72 (0.285) 0.063 (0.021) 0.080 (0.026)
Isotonic LSE 17.78 (0.532) 0.739 (0.052) 1.062 (0.055) 15.00 (0.262) 0.378 (0.024) 0.528 (0.028)
Block 12.31 (0.571) 0.435 (0.041) 0.578 (0.043) 12.50 (0.313) 0.237 (0.028) 0.313 (0.049)
Constant DNR 0.278 (0.016) 0.131 (0.038) 0.160 (0.042) 0.260 (0.005) 0.079 (0.020) 0.097 (0.022)
PDIR 0.266 (0.013) 0.094 (0.047) 0.111 (0.050) 0.255 (0.005) 0.052 (0.022) 0.063 (0.025)
Isotonic LSE 0.280 (0.021) 0.121 (0.047) 0.161 (0.056) 0.262 (0.006) 0.076 (0.025) 0.108 (0.026)
Block 0.265 (0.012) 0.089 (0.040) 0.110 (0.046) 0.256 (0.005) 0.059 (0.022) 0.075 (0.024)
Wave DNR 0.306 (0.020) 0.189 (0.036) 0.233 (0.042) 0.269 (0.009) 0.108 (0.025) 0.135 (0.029)
PDIR 0.459 (0.058) 0.390 (0.056) 0.454 (0.063) 0.581 (0.039) 0.493 (0.028) 0.574 (0.033)
Isotonic LSE 1.380 (0.085) 0.918 (0.032) 1.063 (0.040) 0.989 (0.024) 0.760 (0.014) 0.860 (0.012)
Block 0.978 (0.022) 0.750 (0.014) 0.854 (0.012) 0.892 (0.021) 0.693 (0.009) 0.802 (0.008)
Refer to caption
Figure S4: Heatmaps for the target function f0f_{0}, the observed training data, and its deep isotonic regression and isotonic least squares estimate (isotonic LSE) under model (a) when d=2d=2 and n=64n=64.
Refer to caption
Figure S5: 3D surface plots for the target function f0f_{0}, the observed training data, and its deep isotonic regression and isotonic least squares estimate (isotonic LSE) under model (a) when d=2d=2 and n=64n=64.
Refer to caption
Figure S6: Heatmaps for the target function f0f_{0}, the observed training data, and its deep isotonic regression and isotonic least squares estimate (isotonic LSE) under model (b) when d=2d=2 and n=64n=64.
Refer to caption
Figure S7: 3D surface plots for the target function f0f_{0}, the observed training data, and its deep isotonic regression and isotonic least squares estimate (isotonic LSE) under model (b) when d=2d=2 and n=64n=64.
Refer to caption
Figure S8: Heatmaps for the target function f0f_{0}, the observed training data, and its deep isotonic regression and isotonic least squares estimate (isotonic LSE) under model (c) when d=2d=2 and n=64n=64.
Refer to caption
Figure S9: 3D surface plots for the target function f0f_{0}, the observed training data, and its deep isotonic regression and isotonic least squares estimate (isotonic LSE) under model (c) when d=2d=2 and n=64n=64.
Refer to caption
Figure S10: Heatmaps for the target function f0f_{0}, the observed training data, and its deep isotonic regression and isotonic least squares estimate (isotonic LSE) under model (d) when d=2d=2 and n=64n=64.
Refer to caption
Figure S11: 3D surface plots for the target function f0f_{0}, the observed training data, and its deep isotonic regression and isotonic least squares estimate (isotonic LSE) under model (d) when d=2d=2 and n=64n=64.
Refer to caption
Figure S12: Heatmaps for the target function f0f_{0}, the observed training data, and its deep isotonic regression and isotonic least squares estimate (isotonic LSE) under model (e) when d=2d=2 and n=64n=64.
Refer to caption
Figure S13: 3D surface plots for the target function f0f_{0}, the observed training data, and its deep isotonic regression and isotonic least squares estimate (isotonic LSE) under model (e) when d=2d=2 and n=64n=64.
Refer to caption
Figure S14: Heatmaps for the target function f0f_{0}, the observed training data, and its deep isotonic regression and isotonic least squares estimate (isotonic LSE) under model (f) when d=2d=2 and n=64n=64.
Refer to caption
Figure S15: 3D surface plots for the target function f0f_{0}, the observed training data, and its deep isotonic regression and isotonic least squares estimate (isotonic LSE) under model (f) when d=2d=2 and n=64n=64.

A.4 Tuning parameter

In this subsection, we investigate the numerical performance regrading to different choice of tuning parameters under different models.

For univariate models, we calculate the testing statistics L1L_{1} and L2L_{2} for tuning parameter λ\lambda on the 20 lattice points in the interval [0,3log(n)][0,3\log(n)]. For each λ\lambda, we simulate 20 times replications and reports the average L1L_{1}, L2L_{2} statistics and their 90% empirical band. For each replication, we train the PDIR using n=256n=256 training samples and T=1,000T=1,000 testing samples. In our simulation, four univariate models in section A.2 are considered including “Exp”, “Constant”, “Step” and misspecified model “Wave” and the results are reported in Figure S16. We can see that for isotonic models “Exp” and “Step”, the estimate is not sensitive to the choice of the tuning parameter λ\lambda in [0,3log(n)][0,3\log(n)], which all leads to reasonable estimates. For “Constant” model, a not strictly isotonic model, errors slightly increase as the tuning parameter λ\lambda increases in [0,3log(n)][0,3\log(n)]. Overall, the choice λ=log(n)\lambda=\log(n) can lead to reasonably well estimates for correctly specified models. For misspecified model “Wave”, the estimates deteriorates quickly as the tuning parameter λ\lambda increases around 0, and after that the additional negative effect of the increasing λ\lambda becomes slight.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure S16: L1L_{1} and L2L_{2} distances between estimates and the targets with different tuning parameters under univariate models with size n=256n=256. For each value of tuning parameter λ\lambda, the mean L1L_{1} and L2L_{2} distances (solid blue and red curves) and their 90% empirical band (blue and red ranges) are calculated over 20 replications. A vertical dashed line is presented at λ=log(n)\lambda=\log(n).

For bivariate models, our first simulation studies focus on the case where tuning parameters have the same value, i.e., λ1=λ2\lambda_{1}=\lambda_{2}. We calculate the testing statistics L1L_{1} and L2L_{2} for tuning parameters λ1=λ2\lambda_{1}=\lambda_{2} on the 20 lattice points in the interval [0,3log(n)][0,3\log(n)]. For each λ\lambda, we simulate 20 times replications and reports the average L1L_{1}, L2L_{2} statistics and their 90% empirical band. For each replication, we train the PDIR using n=256n=256 training samples and T=10,000T=10,000 testing samples. In our simulation, four bivariate models in section A.3 are considered including “Partial”, “Constant”, “Concave” and misspecified model “Wave” and the results are reported in Figure S17. The observation are similar to those of univariate models, that is, the estimates are not sensitive to the choices of tuning parameters over [0,3log(n)][0,3\log(n)] for correctly specified models, i.e., isotonic models. But for misspecified model “Wave”, the estimates deteriorates quickly as the tuning parameter λ\lambda increases around 0, and after that increasing λ\lambda slightly spoils the estimates.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure S17: L1L_{1} and L2L_{2} distances between estimates and the targets with different tuning parameters under bivariate models with size n=256n=256. For each value of tuning parameter (λ1,λ2)(\lambda_{1},\lambda_{2}) with λ1=λ2\lambda_{1}=\lambda_{2}, the mean L1L_{1} and L2L_{2} distances (solid blue and red curves) and their 90% empirical band (blue and red ranges) are calculated over 20 replications. A vertical dashed line is presented at λ1=λ2=log(n)\lambda_{1}=\lambda_{2}=\log(n).

In the second simulation study of bivariate models, we can choose different values for different components of the turning parameter λ=(λ1,λ2)\lambda=(\lambda_{1},\lambda_{2}), i.e., λ1\lambda_{1} can be different from λ2\lambda_{2}. We investigate this by considering the follow bivariate model, where the target function f0f_{0} is monotonic in its second argument and non-monotonic in its first one.

  • Model (g)

    :

    Y=f0(X)+ϵ=2sin(2πx1)+4(x2)4/3+ϵ,Y=f_{0}(X)+\epsilon=2\sin(2\pi x_{1})+4(x_{2})^{4/3}+\epsilon,

where X=(x1,x2)X=(x_{1},x_{2}) and ϵN(0,14)\epsilon\sim N(0,\frac{1}{4}) follows normal distribution. Heatmaps for the observed training data, the target function f0f_{0}, and 3D surface plots for the target function f0f_{0} under model (g) when d=2d=2 and n=256n=256 are presented in Figure S18.

For model (g), we calculate the mean testing statistics L1L_{1} and L2L_{2} for tuning parameter λ=(λ1,λ2)\lambda=(\lambda_{1},\lambda_{2}) on the 400 grid points on the region [0,3log(n)]×[0,3log(n)][0,3\log(n)]\times[0,3\log(n)]. For each λ=(λ1,λ2)\lambda=(\lambda_{1},\lambda_{2}), we simulate 5 times replications and reports the average L1L_{1}, L2L_{2} statistics. For each replication, we train the PDIR using n=256n=256 training samples and T=10,000T=10,000 testing samples. The mean L1L_{1} and L2L_{2} distance from the target function and the estimates on the testing data under different λ\lambda are depicted in Figure S19. We see that the target function f0f_{0} is increasing in its second argument, and the estimates is insensitive to the tuning parameter λ2\lambda_{2}, while the target function f0f_{0} is non-monotonic in its first argument, and the estimates deteriorate when λ1\lambda_{1} gets larger. The simulation results suggest that we should only penalize the gradient with respect to the monotonic arguments but not non-monotonic ones. The estimates is sensitive to the turning parameter for misspecified model, especially when the turning parameter increases around 0.

Refer to caption
Figure S18: Heatmaps for the observed training data, the target function f0f_{0}, and 3D surface plots for the target function f0f_{0} under model (g) when d=2d=2 and n=256n=256.
Refer to caption
Figure S19: Heatmaps for the observed training data, the target function f0f_{0}, and 3D surface plots for the target function f0f_{0} under model (g) when d=2d=2 and n=256n=256.

Appendix B Proofs

Proof of Theorem 1

For integer p2p\geq 2, let σp1(x)=max{0,x}p1\sigma_{p-1}(x)=\max\{0,x\}^{p-1} and σp(x)=max{0,x}p\sigma_{p}(x)=\max\{0,x\}^{p} denote the RePUs activation functions respectively. Let (d0,d1,,d𝒟+1)(d_{0},d_{1},\ldots,d_{\mathcal{D}+1}) be vector of the width (number of neurons) of each layer in the original RePU network where d0=dd_{0}=d and d𝒟+1=1d_{\mathcal{D}+1}=1 in our problem. We let fu(i)f^{(i)}_{u} be the function (subnetwork of the RePU network) from 𝒳d\mathcal{X}\subset\mathbb{R}^{d} to \mathbb{R} which takes X=(x1,,xd)X=(x_{1},\ldots,x_{d}) as input and outputs the uu-th neuron of the ii-th layer for u=1,,diu=1,\ldots,d_{i} and i=1,,𝒟+1i=1,\ldots,\mathcal{D}+1.

We construct Mixed RePUs activated subnetworks to compute (xjf1(i),,xjfdi(i))(\frac{\partial}{\partial x_{j}}f^{(i)}_{1},\ldots,\frac{\partial}{\partial x_{j}}f^{(i)}_{d_{i}}) iteratively for i=1,,𝒟+1i=1,\ldots,\mathcal{D}+1, i.e., we construct the partial derivatives of the original RePU subnetworks(up to ii-th layer) step by step. Without loss of generality, we compute (xjf1(i),,xjfdi(i))(\frac{\partial}{\partial x_{j}}f^{(i)}_{1},\ldots,\frac{\partial}{\partial x_{j}}f^{(i)}_{d_{i}}) for j{1,,d}j\in\{1,\ldots,d\}, and the construction is the same for all other j{1,,d}j\in\{1,\ldots,d\}. We illustrate the details of the construction of the Mixed RePUs subnetworks for the first two layers (i=1,2i=1,2) and the last layer (i=𝒟+1)(i=\mathcal{D}+1) and apply induction for layers i=3,,𝒟i=3,\ldots,\mathcal{D}. Note that the derivative of RePU activation function σp\sigma_{p} is σp(x)=pσp1(x)\sigma^{\prime}_{p}(x)=p\sigma_{p-1}(x), then when i=1i=1 for any u=1,,d1u=1,\ldots,d_{1},

xjfu(1)=xjσp(i=1d0wui(1)xi+bu(1))=pσp1(i=1d0wui(1)xi+bu(1))wu,d0(1),\displaystyle\frac{\partial}{\partial x_{j}}f^{(1)}_{u}=\frac{\partial}{\partial x_{j}}\sigma_{p}\Big{(}\sum_{i=1}^{d_{0}}w^{(1)}_{ui}x_{i}+b_{u}^{(1)}\Big{)}=p\sigma_{p-1}\Big{(}\sum_{i=1}^{d_{0}}w^{(1)}_{ui}x_{i}+b_{u}^{(1)}\Big{)}\cdot w_{u,d_{0}}^{(1)}, (B.1)

where we denote wui(1)w^{(1)}_{ui} and bu(1)b_{u}^{(1)} by the corresponding weights and bias in 11-th layer of the original RePU network. Now we intend to construct a 4 layer (2 hidden layers) Mixed RePUs network with width (d0,3d1,10d1,2d1)(d_{0},3d_{1},10d_{1},2d_{1}) which takes X=(x1,,xd0)X=(x_{1},\ldots,x_{d_{0}}) as input and outputs

(f1(1),,fd1(1),xjf1(1),,xjfd1(1))2d1.(f^{(1)}_{1},\ldots,f^{(1)}_{d_{1}},\frac{\partial}{\partial x_{j}}f^{(1)}_{1},\ldots,\frac{\partial}{\partial x_{j}}f^{(1)}_{d_{1}})\in\mathbb{R}^{2d_{1}}.

Note that the output of such network contains all the quantities needed to calculated (xjf1(2),,xjfd2(2))(\frac{\partial}{\partial x_{j}}f^{(2)}_{1},\ldots,\frac{\partial}{\partial x_{j}}f^{(2)}_{d_{2}}), and the process of construction can be continued iteratively and the induction proceeds. In the first hidden layer, we can obtain 3d13d_{1} neurons

(f1(1),,fd1(1),p|w1,d0(1)|,,p|wd1,d0(1)|,σp1(i=1d0w1i(1)xi+b1(1)),,σp1(i=1d0wd1i(1)xi+bd1(1))),(f^{(1)}_{1},\ldots,f^{(1)}_{d_{1}},p|w^{(1)}_{1,d_{0}}|,\ldots,p|w^{(1)}_{d_{1},d_{0}}|,\sigma_{p-1}(\sum_{i=1}^{d_{0}}w^{(1)}_{1i}x_{i}+b^{(1)}_{1}),\ldots,\sigma_{p-1}(\sum_{i=1}^{d_{0}}w^{(1)}_{d_{1}i}x_{i}+b^{(1)}_{d_{1}})),

with weight matrix A1(1)A^{(1)}_{1} having 2d0d12d_{0}d_{1} parameters, bias vector B1(1)B^{(1)}_{1} and activation function vector Σ1\Sigma_{1} being

A1(1)=[w1,1(1)w1,2(1)w1,d0(1)w2,1(1)w2,2(1)w2,d0(1)wd1,1(1)wd1,2(1)wd1,d0(1)0000000000w1,1(1)w1,2(1)w1,d0(1)w2,1(1)w2,2(1)w2,d0(1)wd1,1(1)wd1,2(1)wd1,d0(1)]3d1×d0,\displaystyle A^{(1)}_{1}=\left[\begin{array}[]{ccccc}w^{(1)}_{1,1}&w^{(1)}_{1,2}&\cdots&\cdots&w^{(1)}_{1,d_{0}}\\ w^{(1)}_{2,1}&w^{(1)}_{2,2}&\cdots&\cdots&w^{(1)}_{2,d_{0}}\\ \ldots&\ldots&\ldots&\ldots&\ldots\\ w^{(1)}_{d_{1},1}&w^{(1)}_{d_{1},2}&\cdots&\cdots&w^{(1)}_{d_{1},d_{0}}\\ 0&0&0&0&0\\ \ldots&\ldots&\ldots&\ldots&\ldots\\ 0&0&0&0&0\\ w^{(1)}_{1,1}&w^{(1)}_{1,2}&\cdots&\cdots&w^{(1)}_{1,d_{0}}\\ w^{(1)}_{2,1}&w^{(1)}_{2,2}&\cdots&\cdots&w^{(1)}_{2,d_{0}}\\ \ldots&\ldots&\ldots&\ldots&\ldots\\ w^{(1)}_{d_{1},1}&w^{(1)}_{d_{1},2}&\cdots&\cdots&w^{(1)}_{d_{1},d_{0}}\\ \end{array}\right]\in\mathbb{R}^{3d_{1}\times d_{0}},
B1(1)=[b1(1)b2(1)bd1(1)p|w1,d0(1)|p|w2,d0(1)|p|wd1,d0(1)|b1(1)b2(1)bd1(1)]3d1,Σ1(1)=[σpσpσ1σ1σp1σp1],\displaystyle B^{(1)}_{1}=\left[\begin{array}[]{c}b^{(1)}_{1}\\ b^{(1)}_{2}\\ \ldots\\ b^{(1)}_{d_{1}}\\ p|w^{(1)}_{1,d_{0}}|\\ p|w^{(1)}_{2,d_{0}}|\\ \ldots\\ p|w^{(1)}_{d_{1},d_{0}}|\\ b^{(1)}_{1}\\ b^{(1)}_{2}\\ \ldots\\ b^{(1)}_{d_{1}}\\ \end{array}\right]\in\mathbb{R}^{3d_{1}},\quad\Sigma^{(1)}_{1}=\left[\begin{array}[]{c}\sigma_{p}\\ \ldots\\ \sigma_{p}\\ \sigma_{1}\\ \ldots\\ \sigma_{1}\\ \sigma_{p-1}\\ \ldots\\ \sigma_{p-1}\\ \end{array}\right],

where the first d1d_{1} activation functions of Σ1\Sigma_{1} are chosen to be σp\sigma_{p}, the last d1d_{1} activation functions are chosen to be σp1\sigma_{p-1} and the rest σ1\sigma_{1}. In the second hidden layer, we can obtain 6d16d_{1} neurons. The first 2d12d_{1} neurons of the second hidden layer (or the third layer) are

(σ1(f1(1)),σ1(f1(1))),,σ1(fd1(1)),σ1(fd1(1))),(\sigma_{1}(f^{(1)}_{1}),\sigma_{1}(-f^{(1)}_{1})),\ldots,\sigma_{1}(f^{(1)}_{d_{1}}),\sigma_{1}(-f^{(1)}_{d_{1}})),

which intends to implement identity map such that (f1(1),,fd1(1))(f^{(1)}_{1},\ldots,f^{(1)}_{d_{1}}) can be kept and outputted in the next layer since identity map can be realized by x=σ1(x)σ1(x)x=\sigma_{1}(x)-\sigma_{1}(-x). The rest 4d14d_{1} neurons of the second hidden layer (or the third layer) are

[σ2(pw1,d0(1)+σp1(i=1d0w1i(1)xi+b1(1)))σ2(pw1,d0(1)σp1(i=1d0w1i(1)xi+b1(1)))σ2(pw1,d0(1)+σp1(i=1d0w1i(1)xi+b1(1))σ2(pw1,d0(1)σp1(i=1d0w1i(1)xi+b1(1)))σ2(pwd1,d0(1)+σp1(i=1d0wd1i(1)xi+bd1(1)))σ2(pwd1,d0(1)σp1(i=1d0wd1i(1)xi+bd1(1)))σ2(pwd1,d0(1)+σp1(i=1d0wd1i(1)xi+bd1(1))σ2(pwd1,d0(1)σp1(i=1d0wd1i(1)xi+bd1(1)))]4d1,\displaystyle\left[\begin{array}[]{c}\sigma_{2}(p\cdot w^{(1)}_{1,d_{0}}+\sigma_{p-1}(\sum_{i=1}^{d_{0}}w^{(1)}_{1i}x_{i}+b^{(1)}_{1}))\\ \sigma_{2}(p\cdot w^{(1)}_{1,d_{0}}-\sigma_{p-1}(\sum_{i=1}^{d_{0}}w^{(1)}_{1i}x_{i}+b^{(1)}_{1}))\\ \sigma_{2}(-p\cdot w^{(1)}_{1,d_{0}}+\sigma_{p-1}(\sum_{i=1}^{d_{0}}w^{(1)}_{1i}x_{i}+b^{(1)}_{1})\\ \sigma_{2}(-p\cdot w^{(1)}_{1,d_{0}}-\sigma_{p-1}(\sum_{i=1}^{d_{0}}w^{(1)}_{1i}x_{i}+b^{(1)}_{1}))\\ \ldots\\ \sigma_{2}(p\cdot w^{(1)}_{d_{1},d_{0}}+\sigma_{p-1}(\sum_{i=1}^{d_{0}}w^{(1)}_{d_{1}i}x_{i}+b^{(1)}_{d_{1}}))\\ \sigma_{2}(p\cdot w^{(1)}_{d_{1},d_{0}}-\sigma_{p-1}(\sum_{i=1}^{d_{0}}w^{(1)}_{d_{1}i}x_{i}+b^{(1)}_{d_{1}}))\\ \sigma_{2}(-p\cdot w^{(1)}_{d_{1},d_{0}}+\sigma_{p-1}(\sum_{i=1}^{d_{0}}w^{(1)}_{d_{1}i}x_{i}+b^{(1)}_{d_{1}})\\ \sigma_{2}(-p\cdot w^{(1)}_{d_{1},d_{0}}-\sigma_{p-1}(\sum_{i=1}^{d_{0}}w^{(1)}_{d_{1}i}x_{i}+b^{(1)}_{d_{1}}))\\ \end{array}\right]\in\mathbb{R}^{4d_{1}},

which is ready for implementing the multiplications in (B.1) to obtain (xjf1(1),,xjfd1(1))d1(\frac{\partial}{\partial x_{j}}f^{(1)}_{1},\ldots,\frac{\partial}{\partial x_{j}}f^{(1)}_{d_{1}})\in\mathbb{R}^{d_{1}} since

xy=14{(x+y)2(xy)2}=14{σ2(x+y)+σ2(xy)σ2(xy)σ2(x+y)}.\displaystyle x\cdot y=\frac{1}{4}\{(x+y)^{2}-(x-y)^{2}\}=\frac{1}{4}\{\sigma_{2}(x+y)+\sigma_{2}(-x-y)-\sigma_{2}(x-y)-\sigma_{2}(-x+y)\}.

In the second hidden layer (the third layer), the bias vector is zero B2(1)=(0,,0)6d1B^{(1)}_{2}=(0,\ldots,0)\in\mathbb{R}^{6d_{1}}, activation functions vector

Σ2(1)=(σ1,,σ12d1times,σ2,,σ24d1times),\Sigma^{(1)}_{2}=(\underbrace{\sigma_{1},\ldots,\sigma_{1}}_{2d_{1}\ {\rm times}},\underbrace{\sigma_{2},\ldots,\sigma_{2}}_{4d_{1}\ {\rm times}}),

and the corresponding weight matrix A2(1)A^{(1)}_{2} can be formulated correspondingly without difficulty which contains 2d1+8d1=10d12d_{1}+8d_{1}=10d_{1} non-zero parameters. Then in the last layer, by the identity maps and multiplication operations with weight matrix A3(1)A^{(1)}_{3} having 2d1+4d1=6d12d_{1}+4d_{1}=6d_{1} parameters, bias vector B3(1)B^{(1)}_{3} being zeros, we obtain

(f1(1),,fd1(1),xjf1(1),,xjfd1(1))2d1.(f^{(1)}_{1},\ldots,f^{(1)}_{d_{1}},\frac{\partial}{\partial x_{j}}f^{(1)}_{1},\ldots,\frac{\partial}{\partial x_{j}}f^{(1)}_{d_{1}})\in\mathbb{R}^{2d_{1}}.

Such Mixed RePUs neural network has 2 hidden layers (4 layers), 11d111d_{1} hidden neurons, 2d0d1+3d1+10d1+6d1=2d0d1+19d12d_{0}d_{1}+3d_{1}+10d_{1}+6d_{1}=2d_{0}d_{1}+19d_{1} parameters and its width is (d0,3d1,6d1,2d1)(d_{0},3d_{1},6d_{1},2d_{1}). It worth noting that the RePU activation functions do not apply to the last layer since the construction here is for a single network. When we are combining two consecutive subnetworks into one deep neural network, the RePU activation functions should apply to the last layer of the first subnetwork. Hence, in the construction of the whole big network, the last layer of the subnetwork here should output 4d14d_{1} neurons

(σ1(f1(1)),σ1(f1(1)),σ1(fd1(1)),σ1(fd1(1)),\displaystyle(\sigma_{1}(f^{(1)}_{1}),\sigma_{1}(-f^{(1)}_{1})\ldots,\sigma_{1}(f^{(1)}_{d_{1}}),\sigma_{1}(-f^{(1)}_{d_{1}}),
σ1(xjf1(1)),σ1(xjf1(1)),σ1(xjfd1(1)),σ1(xjfd1(1)))4d1,\displaystyle\qquad\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(1)}_{1}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(1)}_{1})\ldots,\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(1)}_{d_{1}}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(1)}_{d_{1}}))\in\mathbb{R}^{4d_{1}},

to keep (f1(1),,fd1(1),xjf1(1),,xjfd1(1))(f^{(1)}_{1},\ldots,f^{(1)}_{d_{1}},\frac{\partial}{\partial x_{j}}f^{(1)}_{1},\ldots,\frac{\partial}{\partial x_{j}}f^{(1)}_{d_{1}}) in use in the next subnetwork. Then for this Mixed RePUs neural network, the weight matrix A3(1)A^{(1)}_{3} has 2d1+8d1=10d12d_{1}+8d_{1}=10d_{1} parameters, the bias vector B3(1)B^{(1)}_{3} is zeros and the activation functions vector Σ3(1)\Sigma^{(1)}_{3} has all σ1\sigma_{1} as elements. And such Mixed RePUs neural network has 2 hidden layers (4 layers), 13d113d_{1} hidden neurons, 2d0d1+3d1+10d1+10d1=2d0d1+23d12d_{0}d_{1}+3d_{1}+10d_{1}+10d_{1}=2d_{0}d_{1}+23d_{1} parameters and its width is (d0,3d1,6d1,4d1)(d_{0},3d_{1},6d_{1},4d_{1}).

Now we consider the second step, for any u=1,,d2u=1,\ldots,d_{2},

xjfu(2)=xjσp(i=1d1wui(2)fi(1)+bu(2))=pσp1(i=1d1wui(2)fi(1)+bu(2))i=1d1wu,i(2)xjfi(1),\displaystyle\frac{\partial}{\partial x_{j}}f^{(2)}_{u}=\frac{\partial}{\partial x_{j}}\sigma_{p}\Big{(}\sum_{i=1}^{d_{1}}w^{(2)}_{ui}f^{(1)}_{i}+b_{u}^{(2)}\Big{)}=p\sigma_{p-1}\Big{(}\sum_{i=1}^{d_{1}}w^{(2)}_{ui}f^{(1)}_{i}+b_{u}^{(2)}\Big{)}\cdot\sum_{i=1}^{d_{1}}w_{u,i}^{(2)}\frac{\partial}{\partial x_{j}}f^{(1)}_{i}, (B.2)

where wui(2)w^{(2)}_{ui} and bu(2)b_{u}^{(2)} are defined correspondingly as the weights and bias in 22-th layer of the original RePU network. By the previous constructed subnetwork, we can start with its outputs

(σ1(f1(1)),σ1(f1(1)),σ1(fd1(1)),σ1(fd1(1)),\displaystyle(\sigma_{1}(f^{(1)}_{1}),\sigma_{1}(-f^{(1)}_{1})\ldots,\sigma_{1}(f^{(1)}_{d_{1}}),\sigma_{1}(-f^{(1)}_{d_{1}}),
σ1(xjf1(1)),σ1(xjf1(1)),σ1(xjfd1(1)),σ1(xjfd1(1)))4d1,\displaystyle\qquad\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(1)}_{1}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(1)}_{1})\ldots,\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(1)}_{d_{1}}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(1)}_{d_{1}}))\in\mathbb{R}^{4d_{1}},

as the inputs of the second subnetwork we are going to build. In the first hidden layer of the second subnetwork, we can obtain 3d23d_{2} neurons

(f1(2),,fd2(2),|i=1d1w1,i(2)xjfi(1)|,,|i=1d1wd2,i(2)xjfi(1)|,\displaystyle\Big{(}f^{(2)}_{1},\ldots,f^{(2)}_{d_{2}},|\sum_{i=1}^{d_{1}}w_{1,i}^{(2)}\frac{\partial}{\partial x_{j}}f^{(1)}_{i}|,\ldots,|\sum_{i=1}^{d_{1}}w_{d_{2},i}^{(2)}\frac{\partial}{\partial x_{j}}f^{(1)}_{i}|,
σp1(i=1d1w1i(2)fi(1)+b1(1)),,σp1(i=1d1wd2i(2)fi(1)+bd2(2))),\displaystyle\qquad\qquad\sigma_{p-1}(\sum_{i=1}^{d_{1}}w^{(2)}_{1i}f^{(1)}_{i}+b^{(1)}_{1}),\ldots,\sigma_{p-1}(\sum_{i=1}^{d_{1}}w^{(2)}_{d_{2}i}f^{(1)}_{i}+b^{(2)}_{d_{2}})\Big{)},

with weight matrix A1(2)3d2×4d1A^{(2)}_{1}\in\mathbb{R}^{3d_{2}\times 4d_{1}} having 6d1d26d_{1}d_{2} non-zero parameters, bias vector B1(2)3d2B^{(2)}_{1}\in\mathbb{R}^{3d_{2}} and activation functions vector Σ1(2)=Σ1(1)\Sigma^{(2)}_{1}=\Sigma^{(1)}_{1}. Similarly, the second hidden layer can be constructed to have 6d26d_{2} neurons with weight matrix A2(2)3d2×6d2A^{(2)}_{2}\in\mathbb{R}^{3d_{2}\times 6d_{2}} having 2d2+8d2=10d22d_{2}+8d_{2}=10d_{2} non-zero parameters, zero bias vector B1(2)6d2B^{(2)}_{1}\in\mathbb{R}^{6d_{2}} and activation functions vector Σ2(2)=Σ2(1)\Sigma^{(2)}_{2}=\Sigma^{(1)}_{2}. The second hidden layer here serves exactly the same as that in the first subnetwork, which intends to implement the identity map for

(f1(2),,fd2(2)),(f^{(2)}_{1},\ldots,f^{(2)}_{d_{2}}),

and implement the multiplication in (B.2). Similarly, the last layer can also be constructed as that in the first subnetwork, which outputs

(σ1(f1(2)),σ1(f1(2)),σ1(fd2(2)),σ1(fd2(2)),\displaystyle(\sigma_{1}(f^{(2)}_{1}),\sigma_{1}(-f^{(2)}_{1})\ldots,\sigma_{1}(f^{(2)}_{d_{2}}),\sigma_{1}(-f^{(2)}_{d_{2}}),
σ1(xjf1(2)),σ1(xjf1(2)),σ1(xjfd2(2)),σ1(xjfd2(2)))4d2,\displaystyle\qquad\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(2)}_{1}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(2)}_{1})\ldots,\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(2)}_{d_{2}}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(2)}_{d_{2}}))\in\mathbb{R}^{4d_{2}},

with the weight matrix A3(2)A^{(2)}_{3} having 2d2+8d2=10d22d_{2}+8d_{2}=10d_{2} parameters, the bias vector B3(2)B^{(2)}_{3} being zeros and the activation functions vector Σ3(1)\Sigma^{(1)}_{3} with elements being σ1\sigma_{1}. Then the second Mixed RePUs subnetwork has 2 hidden layers (4 layers), 17d217d_{2} hidden neurons, 6d1d2+3d2+10d2+10d2=6d1d2+23d26d_{1}d_{2}+3d_{2}+10d_{2}+10d_{2}=6d_{1}d_{2}+23d_{2} parameters and its width is (4d1,3d2,6d2,4d2)(4d_{1},3d_{2},6d_{2},4d_{2}).

Then we can continuing this process of construction. For integers k=3,,𝒟k=3,\ldots,\mathcal{D} and for any u=1,,dku=1,\ldots,d_{k},

xjfu(k)\displaystyle\frac{\partial}{\partial x_{j}}f^{(k)}_{u} =xjσp(i=1dk1wui(k)fi(k1)+bu(k))\displaystyle=\frac{\partial}{\partial x_{j}}\sigma_{p}\Big{(}\sum_{i=1}^{d_{k-1}}w^{(k)}_{ui}f^{(k-1)}_{i}+b_{u}^{(k)}\Big{)}
=pσp1(i=1dk1wui(k)fi(k1)+bu(k))i=1dk1wu,i(k)xjfi(k1),\displaystyle=p\sigma_{p-1}\Big{(}\sum_{i=1}^{d_{k-1}}w^{(k)}_{ui}f^{(k-1)}_{i}+b_{u}^{(k)}\Big{)}\cdot\sum_{i=1}^{d_{k-1}}w_{u,i}^{(k)}\frac{\partial}{\partial x_{j}}f^{(k-1)}_{i},

where wui(k)w^{(k)}_{ui} and bu(k)b_{u}^{(k)} are defined correspondingly as the weights and bias in kk-th layer of the original RePU network. We can construct a Mixed RePUs network taking

(σ1(f1(k1)),σ1(f1(k1)),σ1(fdk1(k1)),σ1(fdk1(k1)),\displaystyle(\sigma_{1}(f^{(k-1)}_{1}),\sigma_{1}(-f^{(k-1)}_{1})\ldots,\sigma_{1}(f^{(k-1)}_{d_{k-1}}),\sigma_{1}(-f^{(k-1)}_{d_{k-1}}),
σ1(xjf1(k1)),σ1(xjf1(k1)),σ1(xjfdk1(k1)),σ1(xjfdk1(k1)))4dk1,\displaystyle\qquad\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(k-1)}_{1}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(k-1)}_{1})\ldots,\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(k-1)}_{d_{k-1}}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(k-1)}_{d_{k-1}}))\in\mathbb{R}^{4d_{k-1}},

as input, and it outputs

(σ1(f1(k)),σ1(f1(k)),σ1(fdk(k)),σ1(fdk(k)),\displaystyle(\sigma_{1}(f^{(k)}_{1}),\sigma_{1}(-f^{(k)}_{1})\ldots,\sigma_{1}(f^{(k)}_{d_{k}}),\sigma_{1}(-f^{(k)}_{d_{k}}),
σ1(xjf1(k)),σ1(xjf1(k)),σ1(xjfdk(k)),σ1(xjfdk(k)))4dk,\displaystyle\qquad\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(k)}_{1}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(k)}_{1})\ldots,\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(k)}_{d_{k}}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(k)}_{d_{k}}))\in\mathbb{R}^{4d_{k}},

with 2 hidden layers, 13dk13d_{k} hidden neurons, 6dk1dk+23dk6d_{k-1}d_{k}+23d_{k} parameters and its width is (4dk1,3dk,6dk,4dk)(4d_{k-1},3d_{k},6d_{k},4d_{k}).

Iterate this process until the k=𝒟+1k=\mathcal{D}+1 step, where the last layer of the original RePU network has only 11 neurons. For the RePU activated neural network fn=𝒟,𝒲,𝒰,𝒮,f\in\mathcal{F}_{n}=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B}}, the output of the network f:𝒳f:\mathcal{X}\to\mathbb{R} is a scalar and the partial derivative with respect to xjx_{j} is

xjf=xji=1d𝒟wi(𝒟)fi(𝒟)+b(𝒟)=i=1d𝒟wi(𝒟)xjfi(𝒟),\displaystyle\frac{\partial}{\partial x_{j}}f=\frac{\partial}{\partial x_{j}}\sum_{i=1}^{d_{\mathcal{D}}}w^{(\mathcal{D})}_{i}f^{(\mathcal{D})}_{i}+b^{(\mathcal{D})}=\sum_{i=1}^{d_{\mathcal{D}}}w^{(\mathcal{D})}_{i}\frac{\partial}{\partial x_{j}}f^{(\mathcal{D})}_{i},

where wi(𝒟)w^{(\mathcal{D})}_{i} and b(𝒟)b^{(\mathcal{D})} are the weights and bias parameter in the last layer of the original RePU network. The the constructed (𝒟+1)(\mathcal{D}+1)-th subnetwork taking

(σ1(f1(𝒟)),σ1(f1(𝒟)),σ1(fd𝒟(𝒟)),σ1(fd𝒟(𝒟)),\displaystyle(\sigma_{1}(f^{(\mathcal{D})}_{1}),\sigma_{1}(-f^{(\mathcal{D})}_{1})\ldots,\sigma_{1}(f^{(\mathcal{D})}_{d_{\mathcal{D}}}),\sigma_{1}(-f^{(\mathcal{D})}_{d_{\mathcal{D}}}),
σ1(xjf1(𝒟)),σ1(xjf1(𝒟)),σ1(xjfd𝒟(𝒟)),σ1(xjfd𝒟(𝒟)))4d𝒟,\displaystyle\qquad\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(\mathcal{D})}_{1}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(\mathcal{D})}_{1})\ldots,\sigma_{1}(\frac{\partial}{\partial x_{j}}f^{(\mathcal{D})}_{d_{\mathcal{D}}}),\sigma_{1}(-\frac{\partial}{\partial x_{j}}f^{(\mathcal{D})}_{d_{\mathcal{D}}}))\in\mathbb{R}^{4d_{\mathcal{D}}},

as input and it outputs xjf(𝒟+1)=xjf\frac{\partial}{\partial x_{j}}f^{(\mathcal{D}+1)}=\frac{\partial}{\partial x_{j}}f which is the partial derivative of the whole RePU network with respect to its jj-th argument xjx_{j}. The subnetwork should have 2 hidden layers width (4d𝒟,2,8,1)(4d_{\mathcal{D}},2,8,1) with 1111 hidden neurons, 4d𝒟+2+16=4d𝒟+184d_{\mathcal{D}}+2+16=4d_{\mathcal{D}}+18 non-zero parameters.

Lastly, we combing all the 𝒟+1\mathcal{D}+1 subnetworks in order to form a big Mixed RePUs network which takes X=(x1,,xd)dX=(x_{1},\ldots,x_{d})\in\mathbb{R}^{d} as input and outputs xjf\frac{\partial}{\partial x_{j}}f for fn=𝒟,𝒲,𝒰,𝒮,,f\in\mathcal{F}_{n}=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}}. Recall that here 𝒟,𝒲,𝒰,𝒮\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S} are the depth, width, number of neurons and number of parameters of the original RePU network respectively, and we have 𝒰=i=0𝒟+1di\mathcal{U}=\sum_{i=0}^{\mathcal{D}+1}d_{i} and 𝒮=i=0𝒟didi+1+di+1.\mathcal{S}=\sum_{i=0}^{\mathcal{D}}d_{i}d_{i+1}+d_{i+1}. Then the big Mixed RePUs network has 3𝒟+33\mathcal{D}+3 hidden layers (totally 3𝒟+5\mathcal{D}+5 layers), d0+i=1𝒟13di+1113𝒰d_{0}+\sum_{i=1}^{\mathcal{D}}13d_{i}+11\leq 13\mathcal{U} neurons, 2d0d1+23d1+i=1𝒟(6didi+1+23di+1)+4d𝒟+1823𝒮2d_{0}d_{1}+23d_{1}+\sum_{i=1}^{\mathcal{D}}(6d_{i}d_{i+1}+23d_{i+1})+4d_{\mathcal{D}}+18\leq 23\mathcal{S} parameters and its width is 6max{d1,,d𝒟}=6𝒲6\max\{d_{1},\ldots,d_{\mathcal{D}}\}=6\mathcal{W}. This completes the proof. \hfill\Box

Proof of Lemma 2

We follow the idea of the proof of Theorem 6 in Bartlett et al. (2019) to prove a somewhat stronger result, where we give the upper bound of the Pseudo dimension of class of Mixed RePUs networks \mathcal{F} in terms of the depth, size and number of neurons of the network. Instead of the VC dimension of sign(){\rm sign}(\mathcal{F}) given in Bartlett et al. (2019), our Pseudo dimension bound is stronger since VCdim(sign())Pdim(){\rm VCdim}({\rm sign}(\mathcal{F}))\leq{\rm Pdim}(\mathcal{F}).

Let 𝒵\mathcal{Z} denote the domain of the functions ff\in\mathcal{F} and let tt\in\mathbb{R}, we consider a new class of functions

~:={f~(z,t)=sign(f(z)t):f}.\tilde{\mathcal{F}}:=\{\tilde{f}(z,t)={\rm sign}(f(z)-t):f\in\mathcal{F}\}.

Then it is clear that Pdim()VCdim(~){\rm Pdim}(\mathcal{F})\leq{\rm VCdim}(\tilde{\mathcal{F}}) and we next bound the VC dimension of ~\tilde{\mathcal{F}}. Recall that the the total number of parameters (weights and biases) in the neural network implementing functions in \mathcal{F} is 𝒮\mathcal{S}, we let θ𝒮\theta\in\mathbb{R}^{\mathcal{S}} denote the parameters vector of the network f(,θ):𝒵f(\cdot,\theta):\mathcal{Z}\to\mathbb{R} implemented in \mathcal{F}. And here we intend to derive a bound for

K(m):=|{(sign(f(z1,θ)t1),,sign(f(zm,θ)tm)):θ𝒮}|K(m):=\Big{|}\{({\rm sign}(f(z_{1},\theta)-t_{1}),\ldots,{\rm sign}(f(z_{m},\theta)-t_{m})):\theta\in\mathbb{R}^{\mathcal{S}}\}\Big{|}

which uniformly holds for all choice of {zi}i=1m\{z_{i}\}_{i=1}^{m} and {ti}i=1m\{t_{i}\}_{i=1}^{m}. Note that the maximum of K(m)K(m) over all all choice of {zi}i=1m\{z_{i}\}_{i=1}^{m} and {ti}i=1m\{t_{i}\}_{i=1}^{m} is just the growth function of ~\tilde{\mathcal{F}}. To give a uniform bound of K(m)K(m), we use the Theorem 8.3 in Anthony and Bartlett (1999) as a main tool to deal with the analysis.

Lemma 30 (Theorem 8.3 in Anthony and Bartlett (1999))

Let p1,,pmp_{1},\ldots,p_{m} be polynomials in nn variables of degree at most dd. If nmn\leq m, define

K:=|{(sign(p1(x),,sign(pm(x))):xn)}|,K:=|\{({\rm sign}(p_{1}(x),\ldots,{\rm sign}(p_{m}(x))):x\in\mathbb{R}^{n})\}|,

i.e. KK is the number of possible sign vectors given by the polynomials. Then K2(2emd/n)nK\leq 2(2emd/n)^{n}.

Now if we can find a partition 𝒫={P1,,PN}\mathcal{P}=\{P_{1},\ldots,P_{N}\} of the parameter domain 𝒮\mathbb{R}^{\mathcal{S}} such that within each region PiP_{i}, the functions f(zj,)f(z_{j},\cdot) are all fixed polynomials of bounded degree, then K(m)K(m) can be bounded via the following sum

K(m)i=1N|{(sign(f(z1,θ)t1),,sign(f(zm,θ)tm)):θPi}|,\displaystyle K(m)\leq\sum_{i=1}^{N}\Big{|}\{({\rm sign}(f(z_{1},\theta)-t_{1}),\ldots,{\rm sign}(f(z_{m},\theta)-t_{m})):\theta\in P_{i}\}\Big{|}, (B.3)

and each term in this sum can be bounded via Lemma 30. Next, we construct the partition follows the same way as in Bartlett et al. (2019) iteratively layer by layer. We define the a sequence of successive refinements 𝒫1,,𝒫𝒟\mathcal{P}_{1},\ldots,\mathcal{P}_{\mathcal{D}} satisfying the following properties:

  • 1.

    The cardinality |𝒫1|=1|\mathcal{P}_{1}|=1 and for each n{1,,𝒟}n\in\{1,\ldots,\mathcal{D}\},

    |𝒫n+1||𝒫n|2(2emkn(1+(n1)pn1)𝒮n)𝒮n,\frac{|\mathcal{P}_{n+1}|}{|\mathcal{P}_{n}|}\leq 2\Big{(}\frac{2emk_{n}(1+(n-1)p^{n-1})}{\mathcal{S}_{n}}\Big{)}^{\mathcal{S}_{n}},

    where knk_{n} denotes the number of neurons in the nn-th layer and 𝒮n\mathcal{S}_{n} denotes the total number of parameters (weights and biases) at the inputs to units in all the layers up to layer nn.

  • 2.

    For each n{1,,𝒟}n\in\{1,\ldots,\mathcal{D}\}, each element of PP of 𝒫n\mathcal{P}_{n}, each j{1,,m}j\in\{1,\ldots,m\}, and each unit uu in the nn-th layer, when θ\theta varies in PP, the net input to uu is a fixed polynomial function in 𝒮n\mathcal{S}_{n} variables of θ\theta, of total degree no more than 1+(n1)pn11+(n-1)p^{n-1} where for each layer the activation functions are σ1,,σp\sigma_{1},\ldots,\sigma_{p} for some integer p2p\geq 2 (this polynomial may depend on P,jP,j and uu).

One can define 𝒫1=𝒮\mathcal{P}_{1}=\mathbb{R}^{\mathcal{S}}, and it can be verified that 𝒫1\mathcal{P}_{1} satisfies property 2 above. Note that in our case, for fixed zjz_{j} and tjt_{j} and any subset P𝒮P\subset\mathbb{R}^{\mathcal{S}}, f(zj,θ)tjf(z_{j},\theta)-t_{j} is a polynomial with respect to θ\theta with degree the same as that of f(zj,θ)f(z_{j},\theta), which is no more than 1+(𝒟1)p𝒟11+(\mathcal{D}-1)p^{\mathcal{D}-1}. Then the construction of 𝒫1,,𝒫𝒟\mathcal{P}_{1},\ldots,\mathcal{P}_{\mathcal{D}} and its verification for properties 1 and 2 can follow the same way in Bartlett et al. (2019). Finally we obtain a partition 𝒫𝒟\mathcal{P}_{\mathcal{D}} of 𝒮\mathbb{R}^{\mathcal{S}} such that for P𝒫𝒟P\in\mathcal{P}_{\mathcal{D}}, the network output in response to any zjz_{j} is a fixed polynomial of θP\theta\in P of degree no more than 1+(𝒟1)p𝒟11+(\mathcal{D}-1)p^{\mathcal{D}-1} (since the last node just outputs its input). Then by Lemma 30

|{(sign(f(z1,θ)t1),,sign(f(zm,θ)tm)):θP}|2(2em(1+(𝒟1)p𝒟1)𝒮𝒟)𝒮𝒟.\Big{|}\{({\rm sign}(f(z_{1},\theta)-t_{1}),\ldots,{\rm sign}(f(z_{m},\theta)-t_{m})):\theta\in P\}\Big{|}\leq 2\Big{(}\frac{2em(1+(\mathcal{D}-1)p^{\mathcal{D}-1})}{\mathcal{S}_{\mathcal{D}}}\Big{)}^{\mathcal{S}_{\mathcal{D}}}.

Besides, by property 1 we have

|𝒫𝒟|\displaystyle|\mathcal{P}_{\mathcal{D}}| Πi=1𝒟12(2emki(1+(i1)pi1)𝒮i)𝒮i.\displaystyle\leq\Pi_{i=1}^{\mathcal{D-1}}2\Big{(}\frac{2emk_{i}(1+(i-1)p^{i-1})}{\mathcal{S}_{i}}\Big{)}^{\mathcal{S}_{i}}.

Then using (B.3), and since the sample z1,,Zmz_{1},\ldots,Z_{m} are arbitrarily chosen, we have

K(m)\displaystyle K(m) Πi=1𝒟2(2emki(1+(i1)pi1)𝒮i)𝒮i\displaystyle\leq\Pi_{i=1}^{\mathcal{D}}2\Big{(}\frac{2emk_{i}(1+(i-1)p^{i-1})}{\mathcal{S}_{i}}\Big{)}^{\mathcal{S}_{i}}
2𝒟(2emki(1+(i1)pi1)𝒮i)𝒮i\displaystyle\leq 2^{\mathcal{D}}\Big{(}\frac{2em\sum k_{i}(1+(i-1)p^{i-1})}{\sum\mathcal{S}_{i}}\Big{)}^{\sum\mathcal{S}_{i}}
(4em(1+(𝒟1)p𝒟1)ki𝒮i)𝒮i\displaystyle\leq\Big{(}\frac{4em(1+(\mathcal{D}-1)p^{\mathcal{D}-1})\sum k_{i}}{\sum\mathcal{S}_{i}}\Big{)}^{\sum\mathcal{S}_{i}}
(4em(1+(𝒟1)p𝒟1))𝒮i,\displaystyle\leq\Big{(}4em(1+(\mathcal{D}-1)p^{\mathcal{D}-1})\Big{)}^{\sum\mathcal{S}_{i}},

where the second inequality follows from weighted arithmetic and geometric means inequality, the third holds since 𝒟𝒮i\mathcal{D}\leq\sum\mathcal{S}_{i} and the last holds since ki𝒮i\sum k_{i}\leq\sum\mathcal{S}_{i}. Since K(m)K(m) is the growth function of ~\tilde{\mathcal{F}}, we have

2Pdim()2VCdim(~)K(VCdim(~))2𝒟(2eRVCdim(~)𝒮i)𝒮i\displaystyle 2^{{\rm Pdim}(\mathcal{F})}\leq 2^{{\rm VCdim}(\tilde{\mathcal{F}})}\leq K({\rm VCdim}(\tilde{\mathcal{F}}))\leq 2^{\mathcal{D}}\Big{(}\frac{2eR\cdot{\rm VCdim}(\tilde{\mathcal{F}})}{\sum\mathcal{S}_{i}}\Big{)}^{\sum\mathcal{S}_{i}}

where R:=i=1𝒟ki(1+(i1)pi1)𝒰+𝒰(𝒟1)p𝒟1.R:=\sum_{i=1}^{\mathcal{D}}k_{i}(1+(i-1)p^{i-1})\leq\mathcal{U}+\mathcal{U}(\mathcal{D}-1)p^{\mathcal{D}-1}. Since 𝒰>0\mathcal{U}>0 and 2eR162eR\geq 16, then by Lemma 16 in Bartlett et al. (2019) we have

Pdim()𝒟+(i=1n𝒮i)log2(4eRlog2(2eR)).{\rm Pdim}(\mathcal{F})\leq\mathcal{D}+(\sum_{i=1}^{n}\mathcal{S}_{i})\log_{2}(4eR\log_{2}(2eR)).

Note that i=1𝒟𝒮i𝒟𝒮\sum_{i=1}^{\mathcal{D}}\mathcal{S}_{i}\leq\mathcal{D}\mathcal{S} and log2(R)log2(𝒰{1+(𝒟1)p𝒟1})log2(𝒰)+p𝒟\log_{2}(R)\leq\log_{2}(\mathcal{U}\{1+(\mathcal{D}-1)p^{\mathcal{D}-1}\})\leq\log_{2}(\mathcal{U})+p\mathcal{D}, then we have

Pdim()𝒟+𝒟𝒮(2p𝒟+2log2𝒰+6)3p𝒟𝒮(𝒟+log2𝒰){\rm Pdim}(\mathcal{F})\leq\mathcal{D}+\mathcal{D}\mathcal{S}(2p\mathcal{D}+2\log_{2}\mathcal{U}+6)\leq 3p\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})

for some universal constant c>0c>0. \hfill\Box

Proof of Theorem 3

We begin our proof with consider the simple case, which is to construct a proper RePU network to represent a univariate polynomial with no error. We can leverage Horner’s method or Qin Jiushao’s algorithm in China to construct such networks. Suppose f(x)=a0+a1x++aNxNf(x)=a_{0}+a_{1}x+\cdots+a_{N}x^{N} is a univariate polynomial of degree NN, then it can be written as

f(x)=a0+x(a1+x(a2+x(a3++x(aN1+xaN)))).f(x)=a_{0}+x(a_{1}+x(a_{2}+x(a_{3}+\cdots+x(a_{N-1}+xa_{N})))).

We can iteratively calculate a sequence of intermediate variables b1,,bNb_{1},\ldots,b_{N} by

bk={aN1+xaN,k=1,aNk+xbN1,k=2,,N..\displaystyle b_{k}=\Big{\{}\begin{array}[]{lr}a_{N-1}+xa_{N},\qquad k=1,\\ a_{N-k}+xb_{N-1},\ \ \ k=2,\ldots,N.\\ \end{array}\Big{.}

Then we can obtain bN=f(x)b_{N}=f(x). By (iii) in Lemma 40, we know that a RePU network with one hidden layer and no more than 2p2p nodes can represent any polynomial of the input with order no more than pp. Obviously, for input xx, the identity map xx, linear transformation ax+bax+b and square map x2x^{2} are all polynomials of xx with order no more than pp. In addition, it is not hard to see that the multiplication operator xy={(x+y)2(xy)2}/4xy=\{(x+y)^{2}-(x-y)^{2}\}/4 can be represented by a RePU network with one hidden layer and 4p4p nodes. Then to calculate b1b_{1} needs a RePU network with 1 hidden layer and 2p2p hidden neurons, and to calculate b2b_{2} needs a RePU network with 3 hidden layer, 2×2p+1×4p+22\times 2p+1\times 4p+2 hidden neurons. By induction, to calculate bN=f(x)b_{N}=f(x) for N1N\geq 1 needs a RePU network with 2N12N-1 hidden layer, N×2p+(N1)×4p+(N1)×2=(6p+2)(N1)+2pN\times 2p+(N-1)\times 4p+(N-1)\times 2=(6p+2)(N-1)+2p hidden neurons, (N1)(30p+2)+2p+1(N-1)(30p+2)+2p+1 parameters(weights and bias), and its width equals to 6p6p.

Apart from the construction based on the Horner’s method, another construction is shown in Theorem 2 of li2019powernet, where the constructed RePU network has logpN+2\lceil\log_{p}N\rceil+2 hidden layers, O(N)O(N) neurons and O(pN)O(pN) parameters (weights and bias).

Now we consider converting RePU networks to multivariate polynomial ff with total degree NN on d\mathbb{R}^{d}. For any d+d\in\mathbb{N}^{+} and N0N\in\mathbb{N}_{0}, let

fNd(x1,,xd)=i1++id=0Nai1,i2,,idx1i1x2i2xdid,f^{d}_{N}(x_{1},\ldots,x_{d})=\sum_{i_{1}+\cdots+i_{d}=0}^{N}a_{i_{1},i_{2},\ldots,i_{d}}x_{1}^{i_{1}}x_{2}^{i_{2}}\cdots x_{d}^{i_{d}},

denote the polynomial with total degree NN of dd variables, where i1,i2,,idi_{1},i_{2},\ldots,i_{d} are non-negative integers, {ai1,i2,,id:i1++idN}\{a_{i_{1},i_{2},\ldots,i_{d}}:i_{1}+\cdots+i_{d}\leq N\} are coefficients in \mathbb{R}. Note that the multivariate polynomial fNdf^{d}_{N} can be written as

fNd(x1,,xd)=i1=0N(i2++id=0Ni1ai1,i2,idx2i2xdid)x1i1,\displaystyle f^{d}_{N}(x_{1},\ldots,x_{d})=\sum_{i_{1}=0}^{N}\Big{(}\sum_{i_{2}+\cdots+i_{d}=0}^{N-i_{1}}a_{i_{1},i_{2}\ldots,i_{d}}x_{2}^{i_{2}}\cdots x_{d}^{i_{d}}\Big{)}x_{1}^{i_{1}},

and we can view fNdf^{d}_{N} as a univariate polynomial of x1x_{1} with degree NN if x2,,xdx_{2},\ldots,x_{d} are given and for each i1{0,,N}i_{1}\in\{0,\ldots,N\} the (d1)(d-1)-variate polynomial i2++id=0Ni1ai1,i2,idx2i2xdid\sum_{i_{2}+\cdots+i_{d}=0}^{N-i_{1}}a_{i_{1},i_{2}\ldots,i_{d}}x_{2}^{i_{2}}\cdots x_{d}^{i_{d}} with degree no more than NN can be computed by a proper RePU network. This reminds us the construction of RePU network for fNdf^{d}_{N} can be implemented recursively via composition of fN1,fN2,,fNdf^{1}_{N},f^{2}_{N},\ldots,f^{d}_{N} by induction.

By Horner’s method we have constructed a RePU network with 2N12N-1 hidden layers, (6p+2)(N1)+2p(6p+2)(N-1)+2p hidden neurons and (N1)(30p+2)+2p+1(N-1)(30p+2)+2p+1 parameters to exactly compute fN1f^{1}_{N}. Now we start to show fN2f^{2}_{N} can be computed by RePU networks. We can write fN2f^{2}_{N} as

fN2(x1,x2)=i+j=0Naijx1ix2j=i=0N(j=0Niaijx2j)x1i.f^{2}_{N}(x_{1},x_{2})=\sum_{i+j=0}^{N}a_{ij}x_{1}^{i}x_{2}^{j}=\sum_{i=0}^{N}\Big{(}\sum_{j=0}^{N-i}a_{ij}x_{2}^{j}\Big{)}x_{1}^{i}.

Note that for i{0,,N}i\in\{0,\ldots,N\}, the the degree of polynomial j=0Niaijx2j\sum_{j=0}^{N-i}a_{ij}x_{2}^{j} is NiN-i which is less than NN. But we can still view it as a polynomial with degree NN by padding (adding zero terms) such that j=0Niaijx2j=j=0Naijx2j\sum_{j=0}^{N-i}a_{ij}x_{2}^{j}=\sum_{j=0}^{N}a^{*}_{ij}x_{2}^{j} where aij=aija^{*}_{ij}=a_{ij} if i+jNi+j\leq N and aij=0a^{*}_{ij}=0 if i+j>Ni+j>N. In such a way, for each i{0,,N}i\in\{0,\ldots,N\} the polynomial j=0Niaijx2j\sum_{j=0}^{N-i}a_{ij}x_{2}^{j} can be computed by a RePU network with 2N12N-1 hidden layers, (6p+2)(N1)+2p(6p+2)(N-1)+2p hidden neurons, (N1)(30p+2)+2p+1(N-1)(30p+2)+2p+1 parameters and its width equal to 6p6p. Besides, for each i{0,,N}i\in\{0,\ldots,N\}, the monomial xix^{i} can also be computed by a RePU network with 2N12N-1 hidden layers, (6p+2)(N1)+2p(6p+2)(N-1)+2p hidden neurons, (N1)(30p+2)+2p+1(N-1)(30p+2)+2p+1 parameters and its width equal to 6p6p, in whose implementation the identity maps are used after the (2i1)(2i-1)-th hidden layer. Now we parallel these two sub networks to get a RePU network which takes x1x_{1} and x2x_{2} as input and outputs (j=0Niaijx2j)xi(\sum_{j=0}^{N-i}a_{ij}x_{2}^{j})x^{i} with width 12p12p, hidden layers 2N12N-1, number of neurons 2×[(6p+2)(N1)+2p]2\times[(6p+2)(N-1)+2p] and size 2×[(N1)(30p+2)+2p+1]2\times[(N-1)(30p+2)+2p+1]. Since for each i{0,,N}i\in\{0,\ldots,N\}, such paralleled RePU network can be constructed, then with straightforward paralleling of NN such RePU networks, we obtain a RePU network exactly computes fN2f^{2}_{N} with width 12pN12pN, hidden layers 2N12N-1, number of neurons 2×[(6p+2)(N1)+2p]×N14pN22\times[(6p+2)(N-1)+2p]\times N\leq 14pN^{2} and number of parameters 2×[(N1)(30p+2)+2p+1]×N62pN22\times[(N-1)(30p+2)+2p+1]\times N\leq 62pN^{2}.

Similarly for polynomial fN3f^{3}_{N} of 33 variables, we can write fN3f^{3}_{N} as

fN3(x1,x2,x3)=i+j+k=0Naijkx1ix2jx3k=i=0N(j+k=0Niaijkx2jx3k)x1i.f^{3}_{N}(x_{1},x_{2},x_{3})=\sum_{i+j+k=0}^{N}a_{ijk}x_{1}^{i}x_{2}^{j}x_{3}^{k}=\sum_{i=0}^{N}\Big{(}\sum_{j+k=0}^{N-i}a_{ijk}x_{2}^{j}x_{3}^{k}\Big{)}x_{1}^{i}.

By our previous argument, for each i{0,,N}i\in\{0,\ldots,N\}, there exists a RePU network which takes (x1,x2,x3)(x_{1},x_{2},x_{3}) as input and outputs (j+k=0Niaijkx2jx3k)x1i\Big{(}\sum_{j+k=0}^{N-i}a_{ijk}x_{2}^{j}x_{3}^{k}\Big{)}x_{1}^{i} with width 12pN+6p12pN+6p, hidden layers 2N12N-1, number of neurons 2N×[(6p+2)(N1)+2p]+[(6p+2)(N1)+2p]2N\times[(6p+2)(N-1)+2p]+[(6p+2)(N-1)+2p] and parameters 2N×[(N1)(30p+2)+2p+1]+[(N1)(30p+2)+2p+1]2N\times[(N-1)(30p+2)+2p+1]+[(N-1)(30p+2)+2p+1]. And by paralleling NN such subnetworks, we obtain a RePU network that exactly computes fN3f^{3}_{N} with width (12pN+6p)×N=12pN2+6pN(12pN+6p)\times N=12pN^{2}+6pN, hidden layers 2N12N-1, number of neurons 2N2×[(6p+2)(N1)+2p]+N×[(6p+2)(N1)+2p]2N^{2}\times[(6p+2)(N-1)+2p]+N\times[(6p+2)(N-1)+2p] and number of parameters 2N2×[(N1)(30p+2)+2p+1]+N×[(N1)(30p+2)+2p+1]2N^{2}\times[(N-1)(30p+2)+2p+1]+N\times[(N-1)(30p+2)+2p+1].

Continuing this process, we can construct RePU networks exactly compute polynomials of any dd variables with total degree NN. With a little bit abuse of notations, we let 𝒲k\mathcal{W}_{k}, 𝒟k\mathcal{D}_{k}, 𝒰k\mathcal{U}_{k} and 𝒮k\mathcal{S}_{k} denote the width, number of hidden layers, number of neurons and number of parameters (weights and bias) respectively of the RePU network computing fNkf^{k}_{N} for k=1,2,3,k=1,2,3,\ldots. We have known that

𝒟1=2N1,𝒲1=6p,𝒰1=(6p+2)(N1)+2p,𝒮1=(N1)(30p+2)+2p+1.\displaystyle\mathcal{D}_{1}=2N-1,\quad\mathcal{W}_{1}=6p,\quad\mathcal{U}_{1}=(6p+2)(N-1)+2p,\quad\mathcal{S}_{1}=(N-1)(30p+2)+2p+1.

Besides, based on the iterate procedure of the network construction, by induction we can see that for k=2,3,4,k=2,3,4,\ldots the following equations hold,

𝒟k=\displaystyle\mathcal{D}_{k}= 2N1,\displaystyle 2N-1,
𝒲k=\displaystyle\mathcal{W}_{k}= N×(𝒲k1+𝒲1),\displaystyle N\times(\mathcal{W}_{k-1}+\mathcal{W}_{1}),
𝒰k=\displaystyle\mathcal{U}_{k}= N×(𝒰k1+𝒰1),\displaystyle N\times(\mathcal{U}_{k-1}+\mathcal{U}_{1}),
𝒮k=\displaystyle\mathcal{S}_{k}= N×(𝒮k1+𝒮1).\displaystyle N\times(\mathcal{S}_{k-1}+\mathcal{S}_{1}).

Then based on the values of 𝒟1,𝒲1,𝒰1,𝒮1\mathcal{D}_{1},\mathcal{W}_{1},\mathcal{U}_{1},\mathcal{S}_{1} and the recursion formula, we have for k=2,3,4,k=2,3,4,\ldots

𝒟k=2N1,\displaystyle\mathcal{D}_{k}=2N-1,
𝒲k=12pNk1+6pNk1NN1,\displaystyle\mathcal{W}_{k}=12pN^{k-1}+6p\frac{N^{k-1}-N}{N-1},
𝒰k=\displaystyle\mathcal{U}_{k}= N×(𝒰k1+𝒰1)=2𝒰1Nk1+𝒰1Nk1NN1\displaystyle N\times(\mathcal{U}_{k-1}+\mathcal{U}_{1})=2\mathcal{U}_{1}N^{k-1}+\mathcal{U}_{1}\frac{N^{k-1}-N}{N-1}
=\displaystyle= (6p+2)(2NkNk1N)+2p(2NkNk1NN1),\displaystyle(6p+2)(2N^{k}-N^{k-1}-N)+2p(\frac{2N^{k}-N^{k-1}-N}{N-1}),
𝒮k=\displaystyle\mathcal{S}_{k}= N×(𝒮k1+𝒮1)=2𝒮1Nk1+𝒮1Nk1NN1\displaystyle N\times(\mathcal{S}_{k-1}+\mathcal{S}_{1})=2\mathcal{S}_{1}N^{k-1}+\mathcal{S}_{1}\frac{N^{k-1}-N}{N-1}
=\displaystyle= (30p+2)(2NkNk1N)+(2p+1)(2NkNk1NN1).\displaystyle(30p+2)(2N^{k}-N^{k-1}-N)+(2p+1)(\frac{2N^{k}-N^{k-1}-N}{N-1}).

This completes our proof. \hfill\Box

Proof of Theorem 5

The proof is straightforward by and leveraging the approximation power of multivariate polynomials based on Theorem 3. The theories for polynomial approximation have been extensively studies on various spaces of smooth functions. We refer to Bagby et al. (2002) for the polynomial approximation on smooth functions in our proof.

Lemma 31 (Theorem 2 in Bagby et al. (2002))

Let ff be a function of compact support on d\mathbb{R}^{d} of class CsC^{s} where s+s\in\mathbb{N}^{+} and let KK be a compact subset of d\mathbb{R}^{d} which contains the support of ff. Then for each nonnegative integer NN there is a polynomial pNp_{N} of degree at most NN on d\mathbb{R}^{d} with the following property: for each multi-index α\alpha with |α|1min{s,N}|\alpha|_{1}\leq\min\{s,N\} we have

supK|Dα(fpN)|CNs|α|1|α|1ssupK|Dαf|,\sup_{K}|D^{\alpha}(f-p_{N})|\leq\frac{C}{N^{s-|\alpha|_{1}}}\sum_{|\alpha|_{1}\leq s}\sup_{K}|D^{\alpha}f|,

where CC is a positive constant depending only on d,sd,s and KK.

The proof of Lemma 31 can be found in Bagby et al. (2002) based on the Whitney extension theorem (Theorem 2.3.6 in Hörmander (2015)) and by examining the proof of Theorem 1 in Bagby et al. (2002), the dependence of the constant CC in Lemma 31 on the d,sd,s and KK can be detailed.

To use Lemma 31, we need to find a RePU network to compute the pNp_{N} for each N+N\in\mathbb{N}^{+}. By Theorem 3, we know that any pNp_{N} of dd variables can be exactly computed by a RePU network ϕN\phi_{N} with 2N12N-1 hidden layers, (6p+2)(2NdNd1N)+2p(2NdNd1N)/(N1)(6p+2)(2N^{d}-N^{d-1}-N)+2p(2N^{d}-N^{d-1}-N)/(N-1) number of neurons, (30p+2)(2NdNd1N)+(2p+1)(2NdNd1N)/(N1)(30p+2)(2N^{d}-N^{d-1}-N)+(2p+1)(2N^{d}-N^{d-1}-N)/(N-1) number of parameters (weights and bias) and network width 12pNd1+6p(Nd1N)/(N1)12pN^{d-1}+6p(N^{d-1}-N)/(N-1). Then we have

supK|Dα(fϕN)|Cs,d,KN(s|α|1)fCs,\sup_{K}|D^{\alpha}(f-\phi_{N})|\leq C_{s,d,K}N^{-(s-|\alpha|_{1})}\|f\|_{C^{s}},

where Cs,d,KC_{s,d,K} is a positive constant depending only on ,d,s,d,s and KK. Note that the number neurons 𝒰=𝒪(18pNd)\mathcal{U}=\mathcal{O}(18pN^{d}), which implies (𝒰/18p)1/dN(\mathcal{U}/18p)^{1/d}\leq N. Then we also have

supK|Dα(fϕN)|Cp,s,d,K𝒰(s|α|1)/dfCs,\sup_{K}|D^{\alpha}(f-\phi_{N})|\leq C_{p,s,d,K}\mathcal{U}^{-(s-|\alpha|_{1})/d}\|f\|_{C^{s}},

where Cp,s,d,KC_{p,s,d,K} is a positive constant depending only on ,d,s,d,s and KK. This completes the proof. \hfill\Box

Proof of Theorem 8

The idea of our proof is based on projecting the data to a low-dimensional space and then use deep RePU neural network to approximate the low-dimensional function.

Given any integer dδ=O(dlog(d/δ)/δ2)d_{\delta}=O(d_{\mathcal{M}}{\log(d/\delta)}/{\delta^{2}}) satisfying dδdd_{\delta}\leq d, by Theorem 3.1 in Baraniuk and Wakin (2009) there exists a linear projector Adδ×dA\in\mathbb{R}^{d_{\delta}\times d} that maps a low-dimensional manifold in a high-dimensional space to a low-dimensional space nearly preserving the distance. Specifically, there exists a matrix Adδ×dA\in\mathbb{R}^{d_{\delta}\times d} such that AAT=(d/dδ)IdδAA^{T}=(d/d_{\delta})I_{d_{\delta}} where IdδI_{d_{\delta}} is an identity matrix of size dδ×dδd_{\delta}\times d_{\delta}, and

(1δ)x1x22Ax1Ax22(1+δ)x1x22,(1-\delta)\|x_{1}-x_{2}\|_{2}\leq\|Ax_{1}-Ax_{2}\|_{2}\leq(1+\delta)\|x_{1}-x_{2}\|_{2},

for any x1,x2.x_{1},x_{2}\in\mathcal{M}.

Note that for any zA()z\in A(\mathcal{M}), there exists a unique xx\in\mathcal{M} such that Ax=zAx=z. Then for any zA()z\in A(\mathcal{M}), define xz=𝒮({x:Ax=z})x_{z}=\mathcal{SL}(\{x\in\mathcal{M}:Ax=z\}) where 𝒮()\mathcal{SL}(\cdot) is a set function which returns a unique element of a set. If Ax=zAx=z where xx\in\mathcal{M} and zA()z\in A(\mathcal{M}), then x=xzx=x_{z} by our argument since {x:Ax=z}\{x\in\mathcal{M}:Ax=z\} is a set with only one element when zA()z\in A(\mathcal{M}). We can see that 𝒮:A()\mathcal{SL}:A(\mathcal{M})\to\mathcal{M} is a differentiable function with the norm of its derivative locates in [1/(1+δ),1/(1δ)][1/(1+\delta),1/(1-\delta)], since

11+δz1z22xz1xz2211δz1z22,\frac{1}{1+\delta}\|z_{1}-z_{2}\|_{2}\leq\|x_{z_{1}}-x_{z_{2}}\|_{2}\leq\frac{1}{1-\delta}\|z_{1}-z_{2}\|_{2},

for any z1,z2A()z_{1},z_{2}\in A(\mathcal{M}). For the high-dimensional function f0:𝒳1f_{0}:\mathcal{X}\to\mathbb{R}^{1}, we define its low-dimensional representation f~0:dδ1\tilde{f}_{0}:\mathbb{R}^{d_{\delta}}\to\mathbb{R}^{1} by

f~0(z)=f0(xz),foranyzA()dδ.\tilde{f}_{0}(z)=f_{0}(x_{z}),\quad{\rm for\ any}\ z\in A(\mathcal{M})\subseteq\mathbb{R}^{d_{\delta}}.

Recall that f0Cs(𝒳)f_{0}\in C^{s}(\mathcal{X}), then f~0Cs(A())\tilde{f}_{0}\in C^{s}(A(\mathcal{M})). Note that \mathcal{M} is compact manifold and AA is a linear mapping, then by the extended version of Whitney’ extension theorem in Fefferman (2006), there exists a function F~0Cs(A(ρ))\tilde{F}_{0}\in C^{s}(A(\mathcal{M}_{\rho})) such that F~0(z)=f~0(z)\tilde{F}_{0}(z)=\tilde{f}_{0}(z) for any zA(ρ)z\in A(\mathcal{M}_{\rho}) and F~0C1(1+δ)f0C1\|\tilde{F}_{0}\|_{C^{1}}\leq(1+\delta)\|f_{0}\|_{C^{1}}. By Theorem 5, for any N+N\in\mathbb{N}^{+}, there exists a function f~n:dδ1\tilde{f}_{n}:\mathbb{R}^{d_{\delta}}\to\mathbb{R}^{1} implemented by a RePU network with its depth 𝒟\mathcal{D}, width 𝒲\mathcal{W}, number of neurons 𝒰\mathcal{U} and size 𝒮\mathcal{S} specified as

𝒟=2N1,𝒲=12pNdδ1+6p(Ndδ1N)/(N1)\displaystyle\mathcal{D}=2N-1,\qquad\mathcal{W}=12pN^{d_{\delta}-1}+6p(N^{d_{\delta}-1}-N)/(N-1)
𝒰=(6p+2)(2NdδNdδ1N)+2p(2NdδNdδ1N)/(N1),\displaystyle\mathcal{U}=(6p+2)(2N^{d_{\delta}}-N^{d_{\delta}-1}-N)+2p(2N^{d_{\delta}}-N^{d_{\delta}-1}-N)/(N-1),
𝒮=(30p+2)(2NdδNdδ1N)+(2p+1)(2NdδNdδ1N)/(N1),\displaystyle\mathcal{S}=(30p+2)(2N^{d_{\delta}}-N^{d_{\delta}-1}-N)+(2p+1)(2N^{d_{\delta}}-N^{d_{\delta}-1}-N)/(N-1),

such that for each multi-index α0d\alpha\in\mathbb{N}^{d}_{0} with |α|11|\alpha|_{1}\leq 1, we have

|Dα(f~n(z)F~0(z))|Cs,dδ,A(ρ)N(s|α|1)F~0C1,|D^{\alpha}(\tilde{f}_{n}(z)-\tilde{F}_{0}(z))|\leq C_{s,d_{\delta},A(\mathcal{M}_{\rho})}N^{-(s-|\alpha|_{1})}\|\tilde{F}_{0}\|_{C^{1}},

for all zA(ρ)z\in A(\mathcal{M}_{\rho}) where Cs,dδ,A(ρ)>0C_{s,d_{\delta},A(\mathcal{M}_{\rho})}>0 is a constant depending only on s,dδ,A(ρ)s,d_{\delta},A(\mathcal{M}_{\rho}).

By Theorem 3, the linear projection AA can be computed by a RePU network with 1 hidden layer and its width no more than 18p. If we define fn=f~nAf^{*}_{n}=\tilde{f}_{n}\circ A which is fn(x)=f~n(Ax)f^{*}_{n}(x)=\tilde{f}_{n}(Ax) for any x𝒳x\in\mathcal{X}, then fn𝒟,𝒲,𝒰,𝒮,f^{*}_{n}\in\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B}} is also a RePU network with one more layer than f~n\tilde{f}_{n}. For any xρx\in\mathcal{M}_{\rho}, there exists a x~\tilde{x}\in\mathcal{M} such that xx~2ρ\|x-\tilde{x}\|_{2}\leq\rho. Then, for each multi-index α0d\alpha\in\mathbb{N}^{d}_{0} with |α|11|\alpha|_{1}\leq 1, we have

|Dα(fn(x)f0(x))|=|Dα(f~n(Ax)F~0(Ax)+F~0(Ax)F~0(Ax~)+F~0(Ax~)f0(x))|\displaystyle|D^{\alpha}(f^{*}_{n}(x)-f_{0}(x))|=|D^{\alpha}(\tilde{f}_{n}(Ax)-\tilde{F}_{0}(Ax)+\tilde{F}_{0}(Ax)-\tilde{F}_{0}(A\tilde{x})+\tilde{F}_{0}(A\tilde{x})-f_{0}(x))|
|Dα(f~n(Ax)F~0(Ax))|+|Dα(F~0(Ax)F~0(Ax~))|+|Dα(F~0(Ax~)f0(x))|\displaystyle\leq|D^{\alpha}(\tilde{f}_{n}(Ax)-\tilde{F}_{0}(Ax))|+|D^{\alpha}(\tilde{F}_{0}(Ax)-\tilde{F}_{0}(A\tilde{x}))|+|D^{\alpha}(\tilde{F}_{0}(A\tilde{x})-f_{0}(x))|
Cs,dδ,A(ρ)N(s|α|1)F~0C|α|1+(1+δ)ρF~0C|α|1+|Dα(f0(x~)f0(x))|\displaystyle\leq C_{s,d_{\delta},A(\mathcal{M}_{\rho})}N^{-(s-|\alpha|_{1})}\|\tilde{F}_{0}\|_{C^{|\alpha|_{1}}}+(1+\delta)\rho\|\tilde{F}_{0}\|_{C^{|\alpha|_{1}}}+|D^{\alpha}(f_{0}(\tilde{x})-f_{0}(x))|
[Cs,dδ,A(ρ)N(s|α|1)+(1+δ)ρ]F~0C|α|1+ρf0C|α|1\displaystyle\leq\big{[}C_{s,d_{\delta},A(\mathcal{M}_{\rho})}N^{-(s-|\alpha|_{1})}+(1+\delta)\rho\big{]}\|\tilde{F}_{0}\|_{C^{|\alpha|_{1}}}+\rho\|f_{0}\|_{C^{|\alpha|_{1}}}
Cs,dδ,A(ρ)(1+δ)f0C|α|1N(s|α|1)+2(1+δ)2ρf0C|α|1\displaystyle\leq C_{s,d_{\delta},A(\mathcal{M}_{\rho})}(1+\delta)\|f_{0}\|_{C^{|\alpha|_{1}}}N^{-(s-|\alpha|_{1})}+2(1+\delta)^{2}\rho\|f_{0}\|_{C^{|\alpha|_{1}}}
C~s,dδ,A(ρ)(1+δ)f0C|α|1N(s|α|1),\displaystyle\leq\tilde{C}_{s,d_{\delta},A(\mathcal{M}_{\rho})}(1+\delta)\|f_{0}\|_{C^{|\alpha|_{1}}}N^{-(s-|\alpha|_{1})},
Cp,s,dδ,A(ρ)(1+δ)f0C|α|1𝒰(s|α|1)/dδ,\displaystyle\leq{C}_{p,s,d_{\delta},A(\mathcal{M}_{\rho})}(1+\delta)\|f_{0}\|_{C^{|\alpha|_{1}}}\mathcal{U}^{-(s-|\alpha|_{1})/d_{\delta}},

where Cp,s,dδ,A(ρ){C}_{p,s,d_{\delta},A(\mathcal{M}_{\rho})} is a constant depending only on p,s,dδ,A(ρ)p,s,d_{\delta},A(\mathcal{M}_{\rho}). The second last inequality follows from ρC1N(s1)(1+δ)1\rho\leq C_{1}N^{-(s-1)}(1+\delta)^{-1}. Since the number neurons 𝒰=𝒪(18pNd)\mathcal{U}=\mathcal{O}(18pN^{d}) and (𝒰/18p)1/dN(\mathcal{U}/18p)^{1/d}\leq N, the last inequality follows. This completes the proof.

\hfill\Box

Proof of Lemma 10

For the empirical risk minimizer s^n\hat{s}_{n} based on the sample S={Xi}i=1nS=\{X_{i}\}_{i=1}^{n}, we consider its excess risk 𝔼{J(s^n)J(s0)}\mathbb{E}\{J(\hat{s}_{n})-J(s_{0})\}.

For any sns\in\mathcal{F}_{n}, we have

J(s^n)J(s0)\displaystyle J(\hat{s}_{n})-J(s_{0}) =J(s^n)Jn(s^n)+Jn(s^n)Jn(s)+Jn(s)J(s)+J(s)J(s0)\displaystyle=J(\hat{s}_{n})-J_{n}(\hat{s}_{n})+J_{n}(\hat{s}_{n})-J_{n}(s)+J_{n}(s)-J(s)+J(s)-J(s_{0})
J(s^n)Jn(s^n)+Jn(s)J(s)+J(s)J(s0)\displaystyle\leq J(\hat{s}_{n})-J_{n}(\hat{s}_{n})+J_{n}(s)-J(s)+J(s)-J(s_{0})
2supsn|J(s)Jn(s)|+J(s)J(s0),\displaystyle\leq 2\sup_{s\in\mathcal{F}_{n}}|J(s)-J_{n}(s)|+J(s)-J(s_{0}),

where the first inequality follows from the definition of empirical risk minimizer s^n\hat{s}_{n}, and the second inequality holds since s^n,sn\hat{s}_{n},s\in\mathcal{F}_{n}. Note that above inequality holds for any sns\in\mathcal{F}_{n}, then

J(s^n)J(s0)\displaystyle J(\hat{s}_{n})-J(s_{0}) =2supsn|J(s)Jn(s)|+infsn[J(s)J(s0)],\displaystyle=\leq 2\sup_{s\in\mathcal{F}_{n}}|J(s)-J_{n}(s)|+\inf_{s\in\mathcal{F}_{n}}[J(s)-J(s_{0})],

where we call 2supsn|J(s)Jn(s)|2\sup_{s\in\mathcal{F}_{n}}|J(s)-J_{n}(s)| the stochastic error and infsn[J(s)J(s0)]\inf_{s\in\mathcal{F}_{n}}[J(s)-J(s_{0})] the approximation error.

Bounding the stochastic error

Recall that s,s0s,s_{0} are vector-valued functions, we write s=(s1,,sd),s0=(s01,,s0d)s=(s_{1},\ldots,s_{d})^{\top},s_{0}=(s_{01},\ldots,s_{0d})^{\top} and let xjsj\frac{\partial}{\partial x_{j}}s_{j} denote the jj-th diagonal entry in xs\nabla_{x}s and xjs0j\frac{\partial}{\partial x_{j}}s_{0j} the jj-th diagonal entry in xs0\nabla_{x}s_{0}.

J(s)\displaystyle J(s) =𝔼[tr(xs(X))+12s(X)22]=𝔼[j=1dxjsj(X)+12j=1d|sj(X)|2].\displaystyle=\mathbb{E}\Big{[}tr(\nabla_{x}s(X))+\frac{1}{2}\|s(X)\|^{2}_{2}\Big{]}=\mathbb{E}\Big{[}\sum_{j=1}^{d}\frac{\partial}{\partial x_{j}}s_{j}(X)+\frac{1}{2}\sum_{j=1}^{d}|s_{j}(X)|^{2}\Big{]}.

Define

J1,j(s)\displaystyle J^{1,j}(s) =𝔼[xjsj(X)]Jn1,j(s)=i=1n[xjsj(Xi)]\displaystyle=\mathbb{E}\Big{[}\frac{\partial}{\partial x_{j}}s_{j}(X)\Big{]}\qquad J^{1,j}_{n}(s)=\sum_{i=1}^{n}\Big{[}\frac{\partial}{\partial x_{j}}s_{j}(X_{i})\Big{]}
J2,j(s)\displaystyle J^{2,j}(s) =𝔼[12|sj(X)|2]Jn2,j(s)=i=1n[12|sj(Xi)|2]\displaystyle=\mathbb{E}\Big{[}\frac{1}{2}|s_{j}(X)|^{2}\Big{]}\qquad J^{2,j}_{n}(s)=\sum_{i=1}^{n}\Big{[}\frac{1}{2}|s_{j}(X_{i})|^{2}\Big{]}

for j=1,,dj=1,\ldots,d and sns\in\mathcal{F}_{n}. Then,

supsn|J(s)Jn(s)|\displaystyle\sup_{s\in\mathcal{F}_{n}}|J(s)-J_{n}(s)| supsnj=1d[|J1,j(s)Jn1,j(s)|+|J2,j(s)Jn2,j(s)|]\displaystyle\leq\sup_{s\in\mathcal{F}_{n}}\sum_{j=1}^{d}\Big{[}|J^{1,j}(s)-J^{1,j}_{n}(s)|+|J^{2,j}(s)-J^{2,j}_{n}(s)|\Big{]}
j=1dsupsn|J1,j(s)Jn1,j(s)|+j=1dsupsn|J2,j(s)Jn2,j(s)|.\displaystyle\leq\sum_{j=1}^{d}\sup_{s\in\mathcal{F}_{n}}|J^{1,j}(s)-J^{1,j}_{n}(s)|+\sum_{j=1}^{d}\sup_{s\in\mathcal{F}_{n}}|J^{2,j}(s)-J^{2,j}_{n}(s)|. (B.4)

Recall that for any sns\in\mathcal{F}_{n}, the output sj\|s_{j}\|_{\infty}\leq\mathcal{B} and the partial derivative xjsj\|\frac{\partial}{\partial x_{j}}s_{j}\|_{\infty}\leq\mathcal{B}^{\prime} for j=1,,dj=1,\ldots,d. Then by Theorem 11.8 in Mohri et al. (2018), for any δ>0\delta>0, with probability at least 1δ1-\delta over the choice of nn i.i.d sample SS,

supsn|J1,j(s)Jn1,j(s)|22Pdim(jn)log(en)n+2log(1/δ)2n\displaystyle\sup_{s\in\mathcal{F}_{n}}|J^{1,j}(s)-J^{1,j}_{n}(s)|\leq 2\mathcal{B}^{\prime}\sqrt{\frac{2{\rm Pdim}(\mathcal{F}^{\prime}_{jn})\log(en)}{n}}+2\mathcal{B}^{\prime}\sqrt{\frac{\log(1/\delta)}{2n}} (B.5)

for j=1,,dj=1,\ldots,d where Pdim(jn){\rm Pdim}(\mathcal{F}^{\prime}_{jn}) is the Pseudo dimension of jn\mathcal{F}^{\prime}_{jn} and jn={xjsj:sn}\mathcal{F}^{\prime}_{jn}=\{\frac{\partial}{\partial x_{j}}s_{j}:s\in\mathcal{F}_{n}\}. Similarly, with probability at least 1δ1-\delta over the choice of nn i.i.d sample SS,

supsn|J2,j(s)Jn2,j(s)|22Pdim(jn)log(en)n+2log(1/δ)2n\displaystyle\sup_{s\in\mathcal{F}_{n}}|J^{2,j}(s)-J^{2,j}_{n}(s)|\leq\mathcal{B}^{2}\sqrt{\frac{2{\rm Pdim}(\mathcal{F}_{jn})\log(en)}{n}}+\mathcal{B}^{2}\sqrt{\frac{\log(1/\delta)}{2n}} (B.6)

for j=1,,dj=1,\ldots,d where jn={sj:sn}\mathcal{F}_{jn}=\{s_{j}:s\in\mathcal{F}_{n}\}. Combining (B.4), (B.5) and (B.6), we have proved that for any δ>0\delta>0, with probability at least 12dδ1-2d\delta,

supsn|J(s)Jn(s)|\displaystyle\sup_{s\in\mathcal{F}_{n}}|J(s)-J_{n}(s)| 2log(en)nj=1d[2Pdim(jn)+Pdim(jn)]\displaystyle\leq\sqrt{\frac{2\log(en)}{n}}\sum_{j=1}^{d}\Big{[}\mathcal{B}^{2}\sqrt{{\rm Pdim}(\mathcal{F}_{jn})}+\mathcal{B}^{\prime}\sqrt{{\rm Pdim}(\mathcal{F}^{\prime}_{jn})}\Big{]}
+d[2+2]log(1/δ)2n.\displaystyle\qquad\qquad+d\Big{[}\mathcal{B}^{2}+2\mathcal{B}^{\prime}\Big{]}\sqrt{\frac{\log(1/\delta)}{2n}}.

Note that s=(s1,,sd)s=(s_{1},\ldots,s_{d})^{\top} for sns\in\mathcal{F}_{n} where n=𝒟,𝒲,𝒰,𝒮,,\mathcal{F}_{n}=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}} is a class of RePU neural networks with depth 𝒟\mathcal{D}, width 𝒲\mathcal{W}, size 𝒮\mathcal{S} and number of neurons 𝒰\mathcal{U}. Then for each j=1,,dj=1,\ldots,d, the function class jn={sj:sn}\mathcal{F}_{jn}=\{s_{j}:s\in\mathcal{F}_{n}\} consists of RePU neural networks with depth 𝒟\mathcal{D}, width 𝒲\mathcal{W}, number of neurons 𝒰(d1)\mathcal{U}-(d-1) and size no more than 𝒮\mathcal{S}. By Lemma 2, we have Pdim(1n)=Pdim(2n)==Pdim(dn)3p𝒟𝒮(𝒟+log2𝒰){\rm Pdim}(\mathcal{F}_{1n})={\rm Pdim}(\mathcal{F}_{2n})=\ldots={\rm Pdim}(\mathcal{F}_{dn})\leq 3p\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U}). Similarly, by Theorem 1 and Lemma 2, we have Pdim(1n)=Pdim(2n)==Pdim(dn)2484p𝒟𝒮(𝒟+log2𝒰){\rm Pdim}(\mathcal{F}^{\prime}_{1n})={\rm Pdim}(\mathcal{F}^{\prime}_{2n})=\ldots={\rm Pdim}(\mathcal{F}^{\prime}_{dn})\leq 2484p\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U}). Then, for any δ>0\delta>0, with probability at least 1δ1-\delta

supsn|J(s)Jn(s)|\displaystyle\sup_{s\in\mathcal{F}_{n}}|J(s)-J_{n}(s)| 50×d(2+2)(502plog(en)𝒟𝒮(𝒟+log2𝒰)n+log(2d/δ)2n).\displaystyle\leq 50\times d(\mathcal{B}^{2}+2\mathcal{B}^{\prime})\Bigg{(}50\sqrt{\frac{2p\log(en)\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})}{n}}+\sqrt{\frac{\log(2d/\delta)}{2n}}\Bigg{)}.

If we let t=50d(2+2)(502plog(en)𝒟𝒮(𝒟+log2𝒰)/n)t=50d(\mathcal{B}^{2}+2\mathcal{B}^{\prime})(50\sqrt{{2p\log(en)\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})}/{n}}), then above inequality implies

(supsn|J(s)Jn(s)|\displaystyle\mathbb{P}\Bigg{(}\sup_{s\in\mathcal{F}_{n}}|J(s)-J_{n}(s)| ϵ)2dexp(2n(ϵt)2[50d(2+2)]2),\displaystyle\geq\epsilon\Bigg{)}\leq 2d\exp\left(\frac{-2n(\epsilon-t)^{2}}{[50d(\mathcal{B}^{2}+2\mathcal{B}^{\prime})]^{2}}\right),

for ϵt\epsilon\geq t. And

𝔼[supsn|J(s)Jn(s)|]\displaystyle\mathbb{E}\left[\sup_{s\in\mathcal{F}_{n}}|J(s)-J_{n}(s)|\right]
=0(supsn|J(s)Jn(s)|u)𝑑u\displaystyle=\int_{0}^{\infty}\mathbb{P}\Bigg{(}\sup_{s\in\mathcal{F}_{n}}|J(s)-J_{n}(s)|\geq u\Bigg{)}du
=0t(supsn|J(s)Jn(s)|u)𝑑u+t(supsn|J(s)Jn(s)|u)𝑑u\displaystyle=\int_{0}^{t}\mathbb{P}\Bigg{(}\sup_{s\in\mathcal{F}_{n}}|J(s)-J_{n}(s)|\geq u\Bigg{)}du+\int_{t}^{\infty}\mathbb{P}\Bigg{(}\sup_{s\in\mathcal{F}_{n}}|J(s)-J_{n}(s)|\geq u\Bigg{)}du
0t1𝑑u+t2dexp(2n(ut)2[50d(2+2)]2)𝑑u\displaystyle\leq\int_{0}^{t}1du+\int_{t}^{\infty}2d\exp\left(\frac{-2n(u-t)^{2}}{[50d(\mathcal{B}^{2}+2\mathcal{B}^{\prime})]^{2}}\right)du
=t+252πd2(2+2)1n\displaystyle=t+25\sqrt{2\pi}d^{2}(\mathcal{B}^{2}+2\mathcal{B}^{\prime})\frac{1}{\sqrt{n}}
2575d2(2+2)2plog(en)𝒟𝒮(𝒟+log2𝒰)/n.\displaystyle\leq 2575d^{2}(\mathcal{B}^{2}+2\mathcal{B}^{\prime})\sqrt{{2p\log(en)\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})}/{n}}.

Bounding the approximation error

Recall that for sns\in\mathcal{F}_{n},

J(s)\displaystyle J(s) =𝔼[tr(xs(X))+12s(X)22]=𝔼[j=1dxjsj(X)+12j=1d|sj(X)|2],\displaystyle=\mathbb{E}\Big{[}tr(\nabla_{x}s(X))+\frac{1}{2}\|s(X)\|^{2}_{2}\Big{]}=\mathbb{E}\Big{[}\sum_{j=1}^{d}\frac{\partial}{\partial x_{j}}s_{j}(X)+\frac{1}{2}\sum_{j=1}^{d}|s_{j}(X)|^{2}\Big{]},

and the excess risk

J(s)J(s0)\displaystyle J(s)-J(s_{0})
=𝔼[j=1dxjsj(X)+12j=1d|sj(X)|2]𝔼[j=1dxjs0j(X)+12j=1d|s0j(X)|2]\displaystyle=\mathbb{E}\Big{[}\sum_{j=1}^{d}\frac{\partial}{\partial x_{j}}s_{j}(X)+\frac{1}{2}\sum_{j=1}^{d}|s_{j}(X)|^{2}\Big{]}-\mathbb{E}\Big{[}\sum_{j=1}^{d}\frac{\partial}{\partial x_{j}}s_{0j}(X)+\frac{1}{2}\sum_{j=1}^{d}|s_{0j}(X)|^{2}\Big{]}
=j=1d𝔼[xjsj(X)xjs0j(X)+12|sj(X)|212|s0j(X)|2].\displaystyle=\sum_{j=1}^{d}\mathbb{E}\Big{[}\frac{\partial}{\partial x_{j}}s_{j}(X)-\frac{\partial}{\partial x_{j}}s_{0j}(X)+\frac{1}{2}|s_{j}(X)|^{2}-\frac{1}{2}|s_{0j}(X)|^{2}\Big{]}.

Recall that s=(s1,,sd)s=(s_{1},\ldots,s_{d})^{\top} and s0=(s01,,s0d)s_{0}=(s_{01},\ldots,s_{0d})^{\top} are vector-valued functions. For each j=1,,dj=1,\ldots,d, we let jn\mathcal{F}_{jn} be a class of RePU neural networks with depth 𝒟\mathcal{D}, width 𝒲\mathcal{W}, size 𝒮\mathcal{S} and number of neurons 𝒰\mathcal{U}. Define ~n={s=(s1,,sd):sjjn,j=1,,d}\tilde{\mathcal{F}}_{n}=\{s=(s_{1},\ldots,s_{d})^{\top}:s_{j}\in\mathcal{F}_{jn},j=1,\ldots,d\}. The neural networks in ~n\tilde{\mathcal{F}}_{n} has depth 𝒟\mathcal{D}, width d𝒲d\mathcal{W}, size d𝒮d\mathcal{S} and number of neurons d𝒰d\mathcal{U}, which can be seen as built by paralleling dd subnetworks in jn\mathcal{F}_{jn} for j=1,,dj=1,\ldots,d. Let n\mathcal{F}_{n} be the class of all RePU neural networks with depth 𝒟\mathcal{D}, width d𝒲d\mathcal{W}, size d𝒮d\mathcal{S} and number of neurons d𝒰d\mathcal{U}. Then ~nn\tilde{\mathcal{F}}_{n}\subset\mathcal{F}_{n} and

infsn[J(s)J(s0)]\displaystyle\inf_{s\in\mathcal{F}_{n}}\Big{[}J(s)-J(s_{0})\Big{]}
infs~n[J(s)J(s0)]\displaystyle\leq\inf_{s\in\tilde{\mathcal{F}}_{n}}\Big{[}J(s)-J(s_{0})\Big{]}
=infs=(s1,,sd),sjjn,j=1,,d[J(s)J(s0)]\displaystyle=\inf_{s=(s_{1},\ldots,s_{d})^{\top},s_{j}\in\mathcal{F}_{jn},j=1,\ldots,d}\Big{[}J(s)-J(s_{0})\Big{]}
=j=1dinfsjjn𝔼[xjsj(X)xjs0j(X)+12|sj(X)|212|s0j(X)|2].\displaystyle=\sum_{j=1}^{d}\inf_{s_{j}\in\mathcal{F}_{jn}}\mathbb{E}\Big{[}\frac{\partial}{\partial x_{j}}s_{j}(X)-\frac{\partial}{\partial x_{j}}s_{0j}(X)+\frac{1}{2}|s_{j}(X)|^{2}-\frac{1}{2}|s_{0j}(X)|^{2}\Big{]}.

Now we focus on derive upper bound for

infsjjn𝔼[xjsj(X)xjs0j(X)+12|sj(X)|212|s0j(X)|2]\inf_{s_{j}\in\mathcal{F}_{jn}}\mathbb{E}\Big{[}\frac{\partial}{\partial x_{j}}s_{j}(X)-\frac{\partial}{\partial x_{j}}s_{0j}(X)+\frac{1}{2}|s_{j}(X)|^{2}-\frac{1}{2}|s_{0j}(X)|^{2}\Big{]}

for each j=1,,dj=1,\ldots,d. By assumption, xjs0j,xjsj\|\frac{\partial}{\partial x_{j}}s_{0j}\|_{\infty},\|\frac{\partial}{\partial x_{j}}s_{j}\|_{\infty}\leq\mathcal{B}^{\prime} and s0j,sj\|s_{0j}\|_{\infty},\|s_{j}\|_{\infty}\leq\mathcal{B} for any sjns\in\mathcal{F}_{jn}. For each j=1,,nj=1,\ldots,n, the target s0js_{0j} defined on domain 𝒳\mathcal{X} is a real-valued function belonging to class CmC^{m} for 1m<1\leq m<\infty. By Theorem 5, for any N+N\in\mathbb{N}^{+}, there exists a RePU activated neural network sNjs_{Nj} with 2N12N-1 hidden layers, (6p+2)(2NdNd1N)+2p(2NdNd1N)/(N1)(6p+2)(2N^{d}-N^{d-1}-N)+2p(2N^{d}-N^{d-1}-N)/(N-1) number of neurons, (30p+2)(2NdNd1N)+(2p+1)(2NdNd1N)/(N1)(30p+2)(2N^{d}-N^{d-1}-N)+(2p+1)(2N^{d}-N^{d-1}-N)/(N-1) number of parameters and network width 12pNd1+6p(Nd1N)/(N1)12pN^{d-1}+6p(N^{d-1}-N)/(N-1) such that for each multi-index α0d\alpha\in\mathbb{N}^{d}_{0}, we have |α|11|\alpha|_{1}\leq 1,

sup𝒳|Dα(s0jsNj)|Cm,d,𝒳N(m|α|1)s0jC|α|1,\sup_{\mathcal{X}}|D^{\alpha}(s_{0j}-s_{Nj})|\leq C_{m,d,\mathcal{X}}N^{-(m-|\alpha|_{1})}\|s_{0j}\|_{C^{|\alpha|_{1}}},

where Cm,d,𝒳C_{m,d,\mathcal{X}} is a positive constant depending only on d,md,m and the diameter of 𝒳\mathcal{X}. Then

infsjjn𝔼[xjsj(X)xjs0j(X)+12|sj(X)|212|s0j(X)|2]\displaystyle\inf_{s_{j}\in\mathcal{F}_{jn}}\mathbb{E}\Big{[}\frac{\partial}{\partial x_{j}}s_{j}(X)-\frac{\partial}{\partial x_{j}}s_{0j}(X)+\frac{1}{2}|s_{j}(X)|^{2}-\frac{1}{2}|s_{0j}(X)|^{2}\Big{]}
𝔼[xjsNj(X)xjs0j(X)+12|sNj(X)|212|s0j(X)|2]\displaystyle\leq\mathbb{E}\Big{[}\frac{\partial}{\partial x_{j}}s_{Nj}(X)-\frac{\partial}{\partial x_{j}}s_{0j}(X)+\frac{1}{2}|s_{Nj}(X)|^{2}-\frac{1}{2}|s_{0j}(X)|^{2}\Big{]}
Cm,d,𝒳N(m1)s0jC1+Cm,d,𝒳Nms0jC0\displaystyle\leq C_{m,d,\mathcal{X}}N^{-(m-1)}\|s_{0j}\|_{C^{1}}+C_{m,d,\mathcal{X}}\mathcal{B}N^{-m}\|s_{0j}\|_{C^{0}}
Cm,d,𝒳(1+)N(m1)s0jC1\displaystyle\leq{C}_{m,d,\mathcal{X}}(1+\mathcal{B})N^{-(m-1)}\|s_{0j}\|_{C^{1}}

holds for each j=1,,dj=1,\ldots,d. Sum up above inequalities, we have proved that

infsn[J(s)J(s0)]Cm,d,𝒳(1+)N(m1)s0jC1.\inf_{s\in\mathcal{F}_{n}}\Big{[}J(s)-J(s_{0})\Big{]}\leq{C}_{m,d,\mathcal{X}}(1+\mathcal{B})N^{-(m-1)}\|s_{0j}\|_{C^{1}}.

Non-asymptotic error bound

Based on the obtained stochastic error bound and approximation error bound, we can conclude that with probability at least 1δ1-\delta, the empirical risk minimizer s^n\hat{s}_{n} defined in (9) satisfies

J(s^n)J(s0)\displaystyle J(\hat{s}_{n})-J(s_{0}) 100×d(2+2)(502plog(en)𝒟𝒮(𝒟+log2𝒰)n+log(2d/δ)2n)\displaystyle\leq 100\times d(\mathcal{B}^{2}+2\mathcal{B}^{\prime})\Bigg{(}50\sqrt{\frac{2p\log(en)\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})}{n}}+\sqrt{\frac{\log(2d/\delta)}{2n}}\Bigg{)}
+Cm,d,𝒳(1+)N(m1)s0C1,\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad+{C}_{m,d,\mathcal{X}}(1+\mathcal{B})N^{-(m-1)}\|s_{0}\|_{C^{1}},

and

𝔼{J(s^n)J(s0)}\displaystyle\mathbb{E}\{J(\hat{s}_{n})-J(s_{0})\} 2575d2(2+2)2plog(en)𝒟𝒮(𝒟+log2𝒰)/n\displaystyle\leq 2575d^{2}(\mathcal{B}^{2}+2\mathcal{B}^{\prime})\sqrt{{2p\log(en)\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})}/{n}}
+Cm,d,𝒳(1+)N(m1)s0C1,\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad+{C}_{m,d,\mathcal{X}}(1+\mathcal{B})N^{-(m-1)}\|s_{0}\|_{C^{1}},

where Cm,d,𝒳C_{m,d,\mathcal{X}} is a positive constant depending only on d,md,m and the diameter of 𝒳\mathcal{X}.

Note that the network depth 𝒟=2N1\mathcal{D}=2N-1 is a positive odd number. Then we let 𝒟\mathcal{D} be a positive odd number, and let the class of neuron network specified by depth 𝒟\mathcal{D}, width 𝒲=18pd[(𝒟+1)/2]d1\mathcal{W}=18pd[(\mathcal{D}+1)/2]^{d-1}, neurons 𝒰=18pd[(𝒟+1)/2]d\mathcal{U}=18pd[(\mathcal{D}+1)/2]^{d} and size 𝒮=67pd[(𝒟+1)/2]d\mathcal{S}=67pd[(\mathcal{D}+1)/2]^{d}. Then we can further express the stochastic error in term of 𝒰\mathcal{U}:

𝔼{J(s^n)J(s0)}\displaystyle\mathbb{E}\{J(\hat{s}_{n})-J(s_{0})\} Cp2d3(2+2)𝒰(d+2)/2d(logn)1/2n1/2\displaystyle\leq Cp^{2}d^{3}(\mathcal{B}^{2}+2\mathcal{B}^{\prime})\mathcal{U}^{(d+2)/{2d}}(\log n)^{1/2}{n}^{-1/2}
+Cm,d,𝒳(1+)s0C1𝒰(m1)/d,\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad+{C}_{m,d,\mathcal{X}}(1+\mathcal{B})\|s_{0}\|_{C^{1}}\mathcal{U}^{-(m-1)/d},

where CC is a universal positive constant and Cm,d,𝒳C_{m,d,\mathcal{X}} is a positive constant depending only on d,md,m and the diameter of 𝒳\mathcal{X}. \hfill\Box

Proof of Lemma 14

By applying Theorem 8, we can prove Lemma 14 similarly following the proof of Lemma 10.

Proof of Lemma 18

Recall that f^nλ\hat{f}^{\lambda}_{n} is the empirical risk minimizer. Then, for any fnf\in\mathcal{F}_{n} we have

nλ(f^nλ)nλ(f).\displaystyle\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})\leq\mathcal{R}^{\lambda}_{n}(f).

For any fnf\in\mathcal{F}_{n}, let

ρλ(f):=1dj=1dλj𝔼{ρ(xjf(X))}\rho^{\lambda}(f):=\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\mathbb{E}\{\rho(\frac{\partial}{\partial x_{j}}f(X))\}

and

ρnλ(f):=1n×di=1nj=1dλj𝔼{ρ(xjf(Xi))}.\rho^{\lambda}_{n}(f):=\frac{1}{n\times d}\sum_{i=1}^{n}\sum_{j=1}^{d}\lambda_{j}\mathbb{E}\{\rho(\frac{\partial}{\partial x_{j}}f(X_{i}))\}.

Then for any fnf\in\mathcal{F}_{n}, we have ρλ(f)0\rho^{\lambda}(f)\geq 0 and ρnλ(f)0\rho^{\lambda}_{n}(f)\geq 0 since ρλ\rho^{\lambda} and ρnλ\rho^{\lambda}_{n} are nonnegative functions and λj\lambda_{j}’s are nonnegative numbers. Note that ρλ(f0)=ρnλ(f0)=0\rho^{\lambda}(f_{0})=\rho^{\lambda}_{n}(f_{0})=0 by the assumption that f0f_{0} is coordinate-wisely nondecreasing. Then,

(f^nλ)(f0)\displaystyle\mathcal{R}(\hat{f}^{\lambda}_{n})-\mathcal{R}(f_{0})\leq (f^nλ)(f0)+ρλ(f^nλ)ρλ(f0)=λ(f^nλ)λ(f0).\displaystyle\mathcal{R}(\hat{f}^{\lambda}_{n})-\mathcal{R}(f_{0})+\rho^{\lambda}(\hat{f}^{\lambda}_{n})-\rho^{\lambda}(f_{0})=\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f_{0}).

We can then give upper bounds for the excess risk (f^nλ)(f0)\mathcal{R}(\hat{f}^{\lambda}_{n})-\mathcal{R}(f_{0}). For any fnf\in\mathcal{F}_{n},

𝔼{(f^nλ)(f0)}\displaystyle\mathbb{E}\{\mathcal{R}(\hat{f}^{\lambda}_{n})-\mathcal{R}(f_{0})\}
𝔼{λ(f^nλ)λ(f0)}\displaystyle\leq\mathbb{E}\{\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f_{0})\}
𝔼{λ(f^nλ)λ(f0)}+2𝔼{nλ(f)nλ(f^nλ)}\displaystyle\leq\mathbb{E}\{\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f_{0})\}+2\mathbb{E}\{\mathcal{R}_{n}^{\lambda}(f)-\mathcal{R}_{n}^{\lambda}(\hat{f}^{\lambda}_{n})\}
=𝔼{λ(f^nλ)λ(f0)}+2𝔼[{nλ(f)nλ(f0)}{nλ(f^nλ)nλ(f0)}]\displaystyle=\mathbb{E}\{\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f_{0})\}+2\mathbb{E}[\{\mathcal{R}_{n}^{\lambda}(f)-\mathcal{R}_{n}^{\lambda}(f_{0})\}-\{\mathcal{R}_{n}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}_{n}^{\lambda}(f_{0})\}]
=𝔼{λ(f^nλ)2nλ(f^nλ)+λ(f0)}+2𝔼{nλ(f)nλ(f0)}\displaystyle=\mathbb{E}\{\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-2\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}(f_{0})\}+2\mathbb{E}\{\mathcal{R}_{n}^{\lambda}(f)-\mathcal{R}_{n}^{\lambda}(f_{0})\}

where the second inequality holds by the the fact that f^nλ\hat{f}^{\lambda}_{n} satisfies nλ(f)nλ(f^nλ)\mathcal{R}_{n}^{\lambda}(f)\geq\mathcal{R}_{n}^{\lambda}(\hat{f}^{\lambda}_{n}) for any fnf\in\mathcal{F}_{n}. Since the inequality holds for any fnf\in\mathcal{F}_{n}, we have

𝔼{(f^nλ)(f0)}\displaystyle\mathbb{E}\{\mathcal{R}(\hat{f}^{\lambda}_{n})-\mathcal{R}(f_{0})\} 𝔼{λ(f^nλ)2nλ(f^nλ)+λ(f0)}+2inffn{λ(f)λ(f0)}.\displaystyle\leq\mathbb{E}\{\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-2\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}(f_{0})\}+2\inf_{f\in\mathcal{F}_{n}}\{\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})\}.

This completes the proof. \hfill\Box

Proof of Lemma 20

Lemma 20 can be proved by combing Lemma 18, Lemma 37 and Lemma 38.

Proof of Lemma 24

Lemma 24 can be proved by combing Lemma 18, Lemma 37, Lemma 38 and Theorem 8.

Proof of Lemma 26

Under the misspecified model, the target function f0f_{0} may not be monotonic, and the quantity j=1dλj[ρ(xjfnλ(Xi))ρ(xjf0(Xi))]\sum_{j=1}^{d}\lambda_{j}[\rho(\frac{\partial}{\partial x_{j}}f^{\lambda}_{n}(X_{i}))-\rho(\frac{\partial}{\partial x_{j}}{f}_{0}(X_{i}))] is not guaranteed to be positive, which prevents us to use the decomposition technique in proof of Lemma 37 to get a fast rate. Instead, we use the canonical decomposition of the excess risk. Let S={Zi=(Xi,Yi)}i=1nS=\{Z_{i}=(X_{i},Y_{i})\}_{i=1}^{n} be the sample, and let SX={Xi}i=1nS_{X}=\{X_{i}\}_{i=1}^{n} and SY={Yi}i=1nS_{Y}=\{Y_{i}\}_{i=1}^{n}. We notice that

𝔼[(f^nλ)(f0)]\displaystyle\mathbb{E}[\mathcal{R}(\hat{f}^{\lambda}_{n})-\mathcal{R}(f_{0})] 𝔼[(f^nλ)(f0)+j=1dλj𝔼[ρ(xjf^nλ(X))]]\displaystyle\leq\mathbb{E}\Big{[}\mathcal{R}(\hat{f}^{\lambda}_{n})-\mathcal{R}(f_{0})+\sum_{j=1}^{d}\lambda_{j}\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}\hat{f}^{\lambda}_{n}(X))]\Big{]}
=𝔼[λ(f^nλ)λ(f0)]+j=1dλj𝔼[ρ(xjf0(X))],\displaystyle=\mathbb{E}\Big{[}\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f_{0})\Big{]}+\sum_{j=1}^{d}\lambda_{j}\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}f_{0}(X))],

and

𝔼[λ(f^nλ)λ(f0)]\displaystyle\mathbb{E}\Big{[}\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f_{0})\Big{]}
=𝔼[λ(f^nλ)nλ(f^nλ)+nλ(f^nλ)λ(fn)+λ(fn)λ(f0)]\displaystyle=\mathbb{E}\Big{[}\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f^{*}_{n})+\mathcal{R}^{\lambda}(f^{*}_{n})-\mathcal{R}^{\lambda}(f_{0})\Big{]}
𝔼[λ(f^nλ)nλ(f^nλ)+nλ(fn)λ(fn)+λ(fn)λ(f0)]\displaystyle\leq\mathbb{E}\Big{[}\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}_{n}(f^{*}_{n})-\mathcal{R}^{\lambda}(f^{*}_{n})+\mathcal{R}^{\lambda}(f^{*}_{n})-\mathcal{R}^{\lambda}(f_{0})\Big{]}
=𝔼[[λ(f^nλ)λ(f0)][nλ(f^nλ)nλ(f0)]\displaystyle=\mathbb{E}\Big{[}[\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f_{0})]-[\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}_{n}(f_{0})]
+[nλ(fn)nλ(f0)][λ(fn)λ(f0)]+λ(fn)λ(f0)]\displaystyle+[\mathcal{R}^{\lambda}_{n}(f^{*}_{n})-\mathcal{R}^{\lambda}_{n}(f_{0})]-[\mathcal{R}^{\lambda}(f^{*}_{n})-\mathcal{R}^{\lambda}(f_{0})]+\mathcal{R}^{\lambda}(f^{*}_{n})-\mathcal{R}^{\lambda}(f_{0})\Big{]}
𝔼[2supfn|[λ(f)λ(f0)]𝔼[nλ(f)nλ(f0)SX]|]+inffn[λ(f)λ(f0)],\displaystyle\leq\mathbb{E}\Big{[}2\sup_{f\in\mathcal{F}_{n}}\Big{|}[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\Big{|}\Big{]}+\inf_{f\in\mathcal{F}_{n}}[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})],

where λ(fn)=inffnλ(f)\mathcal{R}^{\lambda}(f_{n}^{*})=\inf_{f\in\mathcal{F}_{n}}\mathcal{R}^{\lambda}(f), 𝔼\mathbb{E} denotes the expectation taken with respect to SS, and 𝔼[SX]\mathbb{E}[\cdot\mid S_{X}] denotes the conditional expectation given SXS_{X}. Then we have

𝔼[λ(f^nλ)λ(f0)]\displaystyle\mathbb{E}\Big{[}\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f_{0})\Big{]} 𝔼[2supfn|[λ(f)λ(f0)]𝔼[nλ(f)nλ(f0)SX]|]\displaystyle\leq\mathbb{E}\Big{[}2\sup_{f\in\mathcal{F}_{n}}\Big{|}[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\Big{|}\Big{]}
+inffn[λ(f)λ(f0)]+j=1dλj𝔼[ρ(xjf0(X))],\displaystyle+\inf_{f\in\mathcal{F}_{n}}[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]+\sum_{j=1}^{d}\lambda_{j}\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}f_{0}(X))],

where the first term is the stochastic error, the second term is the approximation error, and the third term the misspecification error with respect to the penalty. Compared with the decomposition in Lemma 18, the approximation error is the same and can be bounded using Lemma 38. However, the stochastic error is different, and there is an additional misspecification error. We will leave the misspecification error untouched and include it in the final upper bound. Next, we focus on deriving the upper bound for the stochastic error.

For fnf\in\mathcal{F}_{n} and each Zi=(Xi,Yi)Z_{i}=(X_{i},Y_{i}) and j=1,dj=1\ldots,d, let

g1(f,Xi)=𝔼[|Yif(Xi)\displaystyle g_{1}(f,X_{i})=\mathbb{E}\Big{[}|Y_{i}-f(X_{i}) |2|Yif0(Xi)|2Xi]=|f(Xi)f0(Xi)|2\displaystyle|^{2}-|Y_{i}-f_{0}(X_{i})|^{2}\mid X_{i}\Big{]}=|f(X_{i})-f_{0}(X_{i})|^{2}
g2j(f,Xi)\displaystyle g^{j}_{2}(f,X_{i}) =ρ(xjf(Xi))ρ(xjf0(Xi)).\displaystyle=\rho(\frac{\partial}{\partial x_{j}}f(X_{i}))-\rho(\frac{\partial}{\partial x_{j}}f_{0}(X_{i})).

Then we have

supfn|[λ(f)λ(f0)]𝔼[nλ(f)nλ(f0)SX]|\displaystyle\sup_{f\in\mathcal{F}_{n}}\left|[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\right|
supfn|𝔼[g1(f,X)]1ni=1ng1(f,Xi)+1dj=1dλj[𝔼[g2j(f,X)]1ni=1ng2j(f,Xi)]|\displaystyle\leq\sup_{f\in\mathcal{F}_{n}}\Big{|}\mathbb{E}[g_{1}(f,X)]-\frac{1}{n}\sum_{i=1}^{n}g_{1}(f,X_{i})+\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\Big{[}\mathbb{E}[g^{j}_{2}(f,X)]-\frac{1}{n}\sum_{i=1}^{n}g^{j}_{2}(f,X_{i})\Big{]}\Big{|}
supfn|𝔼[g1(f,X)]1ni=1ng1(f,Xi)|+1dj=1dλjsupfn|𝔼[g2j(f,X)]1ni=1ng2j(f,Xi)|.\displaystyle\leq\sup_{f\in\mathcal{F}_{n}}\Big{|}\mathbb{E}[g_{1}(f,X)]-\frac{1}{n}\sum_{i=1}^{n}g_{1}(f,X_{i})\Big{|}+\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\sup_{f\in\mathcal{F}_{n}}\Big{|}\mathbb{E}[g^{j}_{2}(f,X)]-\frac{1}{n}\sum_{i=1}^{n}g^{j}_{2}(f,X_{i})\Big{|}.

Recall that for any fnf\in\mathcal{F}_{n}, the f,xjf\|f\|_{\infty}\leq\mathcal{B},\|\frac{\partial}{\partial x_{j}}f\|\leq\mathcal{B}^{\prime} and by assumption f0,xjf0\|f_{0}\|_{\infty}\leq\mathcal{B},\|\frac{\partial}{\partial x_{j}}f_{0}\|_{\infty}\leq\mathcal{B}^{\prime} for j=1,,dj=1,\ldots,d. By applying Theorem 11.8 in Mohri et al. (2018), for any δ>0\delta>0, with probability at least 1δ1-\delta over the choice of nn i.i.d sample SS,

supfn|𝔼[g1(f,X)]1ni=1ng1(f,Xi)|422Pdim(n)log(en)n+42log(1/δ)2n,\displaystyle\sup_{f\in\mathcal{F}_{n}}\Big{|}\mathbb{E}[g_{1}(f,X)]-\frac{1}{n}\sum_{i=1}^{n}g_{1}(f,X_{i})\Big{|}\leq 4\mathcal{B}^{2}\sqrt{\frac{2{\rm Pdim}(\mathcal{F}_{n})\log(en)}{n}}+4\mathcal{B}^{2}\sqrt{\frac{\log(1/\delta)}{2n}},

and

supfn|𝔼[g2j(f,X)]1ni=1ng2j(f,Xi)|2κ2Pdim(jn)log(en)n+2κlog(1/δ)2n\displaystyle\sup_{f\in\mathcal{F}_{n}}|\mathbb{E}[g^{j}_{2}(f,X)]-\frac{1}{n}\sum_{i=1}^{n}g^{j}_{2}(f,X_{i})|\leq 2\kappa\mathcal{B}^{\prime}\sqrt{\frac{2{\rm Pdim}(\mathcal{F}^{\prime}_{jn})\log(en)}{n}}+2\kappa\mathcal{B}^{\prime}\sqrt{\frac{\log(1/\delta)}{2n}}

for j=1,,dj=1,\ldots,d where jn={xjf:fn}\mathcal{F}^{\prime}_{jn}=\{\frac{\partial}{\partial x_{j}}f:f\in\mathcal{F}_{n}\}. Combining above in probability bounds, we know that for any δ>0\delta>0, with probability at least 1(d+1)δ1-(d+1)\delta,

supfn|[λ(f)λ(f0)]𝔼[nλ(f)nλ(f0)SX]|\displaystyle\sup_{f\in\mathcal{F}_{n}}\left|[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\right|
42log(en)n[2Pdim(n)+λ¯κPdim(jn)]+4[2+λ¯κ]log(1/δ)2n.\displaystyle\leq 4\sqrt{\frac{2\log(en)}{n}}\Big{[}\mathcal{B}^{2}\sqrt{{\rm Pdim}(\mathcal{F}_{n})}+\bar{\lambda}\kappa\mathcal{B}^{\prime}\sqrt{{\rm Pdim}(\mathcal{F}^{\prime}_{jn})}\Big{]}+4\Big{[}\mathcal{B}^{2}+\bar{\lambda}\kappa\mathcal{B}^{\prime}\Big{]}\sqrt{\frac{\log(1/\delta)}{2n}}.

Recall that n=𝒟,𝒲,𝒰,𝒮,,\mathcal{F}_{n}=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}} is a class of RePU neural networks with depth 𝒟\mathcal{D}, width 𝒲\mathcal{W}, size 𝒮\mathcal{S} and number of neurons 𝒰\mathcal{U}. By Lemma 2, Pdim(n)3p𝒟𝒮(𝒟+log2𝒰){\rm Pdim}(\mathcal{F}_{n})\leq 3p\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U}) Then for each j=1,,dj=1,\ldots,d, the function class jn={xjf:fn}\mathcal{F}_{jn}=\{\frac{\partial}{\partial x_{j}}f:f\in\mathcal{F}_{n}\} consists of RePU neural networks with depth 3𝒟+33\mathcal{D}+3, width 6𝒲6\mathcal{W}, number of neurons 13𝒰13\mathcal{U} and size no more than 23𝒮23\mathcal{S}. By Theorem 1 and Lemma 2, we have Pdim(1n)=Pdim(2n)==Pdim(dn)2484p𝒟𝒮(𝒟+log2𝒰){\rm Pdim}(\mathcal{F}^{\prime}_{1n})={\rm Pdim}(\mathcal{F}^{\prime}_{2n})=\ldots={\rm Pdim}(\mathcal{F}^{\prime}_{dn})\leq 2484p\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U}). Then, for any δ>0\delta>0, with probability at least 1δ1-\delta

supfn|[λ(f)λ(f0)]𝔼[nλ(f)nλ(f0)SX]|\displaystyle\sup_{f\in\mathcal{F}_{n}}\left|[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\right|
200d(2+λ¯κ)(2plog(en)𝒟𝒮(𝒟+log2𝒰)n+log((d+1)/δ)2n).\displaystyle\leq 200d(\mathcal{B}^{2}+\bar{\lambda}\kappa\mathcal{B}^{\prime})\Bigg{(}\sqrt{\frac{2p\log(en)\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})}{n}}+\sqrt{\frac{\log((d+1)/\delta)}{2n}}\Bigg{)}.

If we let t=200d(2+κλ¯)(2plog(en)𝒟𝒮(𝒟+log2𝒰)/n)t=200d(\mathcal{B}^{2}+\kappa\bar{\lambda}\mathcal{B}^{\prime})(\sqrt{{2p\log(en)\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})}/{n}}), then above inequality implies

(supfn|[λ(f)λ(f0)]𝔼[nλ(f)nλ(f0)SX]|ϵ)\displaystyle\mathbb{P}\Bigg{(}\sup_{f\in\mathcal{F}_{n}}\left|[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\right|\geq\epsilon\Bigg{)}
(d+1)exp(n(ϵt)2[100d(2+κλ¯)]2),\displaystyle\leq(d+1)\exp\left(\frac{-n(\epsilon-t)^{2}}{[100d(\mathcal{B}^{2}+\kappa\bar{\lambda}\mathcal{B}^{\prime})]^{2}}\right),

for ϵt\epsilon\geq t. And

𝔼[supfn|[λ(f)λ(f0)]𝔼[nλ(f)nλ(f0)SX]|]\displaystyle\mathbb{E}\left[\sup_{f\in\mathcal{F}_{n}}\left|[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\right|\right]
=0(supfn|[λ(f)λ(f0)]𝔼[nλ(f)nλ(f0)SX]|u)du\displaystyle=\int_{0}^{\infty}\mathbb{P}\Bigg{(}\sup_{f\in\mathcal{F}_{n}}\left|[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\right|\geq u\Bigg{)}du
=0t(supfn|[λ(f)λ(f0)]𝔼[nλ(f)nλ(f0)SX]|u)du\displaystyle=\int_{0}^{t}\mathbb{P}\Bigg{(}\sup_{f\in\mathcal{F}_{n}}\left|[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\right|\geq u\Bigg{)}du
+t(supfn|[λ(f)λ(f0)]𝔼[nλ(f)nλ(f0)SX]|u)du\displaystyle+\int_{t}^{\infty}\mathbb{P}\Bigg{(}\sup_{f\in\mathcal{F}_{n}}\left|[\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})]-\mathbb{E}[\mathcal{R}^{\lambda}_{n}(f)-\mathcal{R}^{\lambda}_{n}(f_{0})\mid{S_{X}}]\right|\geq u\Bigg{)}du
0t1𝑑u+t(d+1)exp(n(ϵt)2[100d(2+κλ¯)]2)𝑑u\displaystyle\leq\int_{0}^{t}1du+\int_{t}^{\infty}(d+1)\exp\left(\frac{-n(\epsilon-t)^{2}}{[100d(\mathcal{B}^{2}+\kappa\bar{\lambda}\mathcal{B}^{\prime})]^{2}}\right)du
=t+2002πd(2+κλ¯)/n\displaystyle=t+200\sqrt{2\pi}d(\mathcal{B}^{2}+\kappa\bar{\lambda}\mathcal{B}^{\prime})/{\sqrt{n}}
800d2(2+κλ¯)2plog(en)𝒟𝒮(𝒟+log2𝒰)/n.\displaystyle\leq 800d^{2}(\mathcal{B}^{2}+\kappa\bar{\lambda}\mathcal{B}^{\prime})\sqrt{{2p\log(en)\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})}/{n}}.

Note that the network depth 𝒟\mathcal{D}, number of neurons 𝒰\mathcal{U} and number of parameters 𝒮\mathcal{S} satisfies 𝒰=18pd((𝒟+1)/2)d\mathcal{U}=18pd((\mathcal{D}+1)/2)^{d} and 𝒮=67pd((𝒟+1)/2)d\mathcal{S}=67pd((\mathcal{D}+1)/2)^{d}. By Lemma 38, combining the error decomposition, we have

𝔼[λ(f^nλ)λ(f0)]\displaystyle\mathbb{E}\Big{[}\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-\mathcal{R}^{\lambda}(f_{0})\Big{]} C1p2d3(2+κλ¯)(logn)1/2n1/2𝒰(d+2)/2d\displaystyle\leq C_{1}p^{2}d^{3}(\mathcal{B}^{2}+\kappa\bar{\lambda}\mathcal{B}^{\prime})(\log n)^{1/2}n^{-1/2}\mathcal{U}^{(d+2)/2d}
+C2(1+κλ¯)f0Cs2𝒰(s1)/d+j=1dλj𝔼[ρ(xjf0(X))],\displaystyle+C_{2}(1+\kappa\bar{\lambda})\|f_{0}\|^{2}_{C^{s}}\mathcal{U}^{-(s-1)/d}+\sum_{j=1}^{d}\lambda_{j}\mathbb{E}[\rho(\frac{\partial}{\partial x_{j}}f_{0}(X))],

where C1>0C_{1}>0 is a universal constant, C2>0C_{2}>0 is a constant depending only on d,sd,s and the diameter of the support 𝒳\mathcal{X}. This completes the proof. \hfill\Box

Proof of Lemma 37

Let S={Zi=(Xi,Yi)}i=1nS=\{Z_{i}=(X_{i},Y_{i})\}_{i=1}^{n} be the sample used to estimate f^nλ\hat{f}^{\lambda}_{n} from the distribution Z=(X,Y)Z=(X,Y). And let S={Zi=(Xi,Yi)}i=1nS^{\prime}=\{Z^{\prime}_{i}=(X^{\prime}_{i},Y^{\prime}_{i})\}_{i=1}^{n} be another sample independent of SS. Define

g1(f,Xi)=𝔼{|Yif(Xi)|2|Yif0(Xi)|2Xi}=𝔼{|f(Xi)f0(Xi)|2Xi}\displaystyle g_{1}(f,X_{i})=\mathbb{E}\big{\{}|Y_{i}-f(X_{i})|^{2}-|Y_{i}-f_{0}(X_{i})|^{2}\mid X_{i}\big{\}}=\mathbb{E}\big{\{}|f(X_{i})-f_{0}(X_{i})|^{2}\mid X_{i}\big{\}}
g2(f,Xi)=𝔼[1dj=1dλjρ(xjf(Xi))1dj=1dλjρ(xjf0(Xi))Xi]\displaystyle g_{2}(f,X_{i})=\mathbb{E}\big{[}\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\rho(\frac{\partial}{\partial x_{j}}f(X_{i}))-\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\rho(\frac{\partial}{\partial x_{j}}f_{0}(X_{i}))\mid X_{i}\big{]}
g(f,Xi)=g1(f,Xi)+g2(f,Xi)\displaystyle g(f,X_{i})=g_{1}(f,X_{i})+g_{2}(f,X_{i})

for any (random) ff and sample XiX_{i}. It worth noting that for any xx and fnf\in\mathcal{F}_{n},

0g1(f,x)=𝔼{|f(Xi)f0(Xi)|2Xi=x}42,0\leq g_{1}(f,x)=\mathbb{E}\big{\{}|f(X_{i})-f_{0}(X_{i})|^{2}\mid X_{i}=x\big{\}}\leq 4\mathcal{B}^{2},

since f\|f\|_{\infty}\leq\mathcal{B} and f0\|f_{0}\|_{\infty}\leq\mathcal{B} for fnf\in\mathcal{F}_{n} by assumption. For any xx and fnf\in\mathcal{F}_{n},

0g2(f,x)=𝔼[1dj=1dλjρ(xjf(Xi))1dj=1dλjρ(xjf0(Xi))Xi=x]2κλ¯,0\leq g_{2}(f,x)=\mathbb{E}\big{[}\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\rho(\frac{\partial}{\partial x_{j}}f(X_{i}))-\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\rho(\frac{\partial}{\partial x_{j}}f_{0}(X_{i}))\mid X_{i}=x\big{]}\leq 2\mathcal{B}^{\prime}\kappa\bar{\lambda},

since ρ()\rho(\cdot) is a κ\kappa-Lipschitz function and xjf\|\frac{\partial}{\partial x_{j}}f\|_{\infty}\leq\mathcal{B}^{\prime} and xjf0\|\frac{\partial}{\partial x_{j}}f_{0}\|_{\infty}\leq\mathcal{B}^{\prime} for j=1,,dj=1,\ldots,d for fnf\in\in\mathcal{F}_{n} by assumption

Recall that the the empirical risk minimizer f^nλ\hat{f}^{\lambda}_{n} depends on the sample SS, and the stochastic error is

𝔼{λ(f^nλ)2nλ(f^nλ)+λ(f0)}\displaystyle\mathbb{E}\{\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-2\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}(f_{0})\} =𝔼S(1ni=1n[𝔼S{g(f^nλ,Xi)}2g(f^nλ,Xi)])\displaystyle=\mathbb{E}_{S}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g(\hat{f}^{\lambda}_{n},X^{\prime}_{i})\big{\}}-2g(\hat{f}^{\lambda}_{n},X_{i})\bigg{]}\Big{)}
=𝔼S(1ni=1n[𝔼S{g1(f^nλ,Xi)}2g1(f^nλ,Xi)])\displaystyle=\mathbb{E}_{S}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{1}(\hat{f}^{\lambda}_{n},X^{\prime}_{i})\big{\}}-2g_{1}(\hat{f}^{\lambda}_{n},X_{i})\bigg{]}\Big{)} (B.7)
+𝔼S(1ni=1n[𝔼S{g2(f^nλ,Xi)}2g2(f^nλ,Xi)]).\displaystyle+\mathbb{E}_{S}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{2}(\hat{f}^{\lambda}_{n},X^{\prime}_{i})\big{\}}-2g_{2}(\hat{f}^{\lambda}_{n},X_{i})\bigg{]}\Big{)}. (B.8)

In the following, we derive upper bounds of (B.7) and (B.8) respectively. For any random variable ξ\xi, it is clear that 𝔼[ξ]𝔼[max{ξ,0}]=0(ξ>t)𝑑t\mathbb{E}[\xi]\leq\mathbb{E}[\max\{\xi,0\}]=\int_{0}^{\infty}\mathbb{P}(\xi>t)dt. In light of this, we aim at giving upper bounds for the tail probabilities

(1ni=1n[𝔼S{gk(f^nλ,Xi)}2gk(f^nλ,Xi)]>t),k=1,2\mathbb{P}\left(\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{k}(\hat{f}^{\lambda}_{n},X^{\prime}_{i})\big{\}}-2g_{k}(\hat{f}^{\lambda}_{n},X_{i})\bigg{]}>t\right),\qquad k=1,2

for t>0t>0. Given f^nλn\hat{f}^{\lambda}_{n}\in\mathcal{F}_{n}, for k=1,2k=1,2, we have

(1ni=1n[𝔼S{gk(f^nλ,Xi)}2gk(f^nλ,Xi,ξi)]>t)\displaystyle\mathbb{P}\left(\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{k}(\hat{f}^{\lambda}_{n},X^{\prime}_{i})\big{\}}-2g_{k}(\hat{f}^{\lambda}_{n},X_{i},\xi_{i})\bigg{]}>t\right)
\displaystyle\leq (fn:1ni=1n[𝔼S{gk(f,Xi)}2gk(f,Xi)]>t)\displaystyle\mathbb{P}\left(\exists f\in\mathcal{F}_{n}:\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{k}(f,X^{\prime}_{i})\big{\}}-2g_{k}(f,X_{i})\bigg{]}>t\right)
=\displaystyle= (fn:𝔼{gk(f,X)}1ni=1n[gk(f,Xi)]>12(t+𝔼{gk(f,X)})).\displaystyle\mathbb{P}\left(\exists f\in\mathcal{F}_{n}:\mathbb{E}\big{\{}g_{k}(f,X)\big{\}}-\frac{1}{n}\sum_{i=1}^{n}\big{[}g_{k}(f,X_{i})\big{]}>\frac{1}{2}\bigg{(}t+\mathbb{E}\big{\{}g_{k}(f,X)\big{\}}\bigg{)}\right). (B.9)

The bound the probability (B.9), we apply Lemma 24 in Shen et al. (2022). For completeness of the proof, we present Lemma in the following.

Lemma 32 (Lemma 24 in Shen et al. (2022))

Let \mathcal{H} be a set of functions h:d[0,B]h:\mathbb{R}^{d}\to[0,B] with B1B\geq 1. Let Z,Z1,,ZnZ,Z_{1},\ldots,Z_{n} be i.i.d. d\mathbb{R}^{d}-valued random variables. Then for each n1n\geq 1 and any s>0s>0 and 0<ϵ<10<\epsilon<1,

(suph𝔼{h(Z)}1ni=1n[h(Zi)]s+𝔼{h(Z)}+1ni=1n[h(Zi)]>ϵ)4𝒩n(sϵ16,,)exp(ϵ2sn15B),\displaystyle\mathbb{P}\left(\sup_{h\in\mathcal{H}}\ \frac{\mathbb{E}\big{\{}h(Z)\big{\}}-\frac{1}{n}\sum_{i=1}^{n}\big{[}h(Z_{i})\big{]}}{s+\mathbb{E}\big{\{}h(Z)\big{\}}+\frac{1}{n}\sum_{i=1}^{n}\big{[}h(Z_{i})\big{]}}>\epsilon\right)\leq 4\mathcal{N}_{n}\Big{(}\frac{s\epsilon}{16},\mathcal{H},\|\cdot\|_{\infty}\Big{)}\exp\Big{(}-\frac{\epsilon^{2}sn}{15B}\Big{)},

where 𝒩n(sϵ16,,)\mathcal{N}_{n}(\frac{s\epsilon}{16},\mathcal{H},\|\cdot\|_{\infty}) is the covering number of \mathcal{H} with radius sϵ/16s\epsilon/16 under the norm \|\cdot\|_{\infty}. The definition of the covering number can be found in Appendix C.

We apply Lemma 32 with ϵ=1/3,s=2t\epsilon=1/3,s=2t to the class of functions 𝒢k:={gk(f,):fn}\mathcal{G}_{k}:=\{g_{k}(f,\cdot):f\in\mathcal{F}_{n}\} for k=1,2k=1,2 to get

(fn:𝔼{g1(f,X)}1ni=1n[g1(f,Xi)]>12(t+𝔼{g1(f,X)}))\displaystyle\mathbb{P}\Big{(}\exists f\in\mathcal{F}_{n}:\mathbb{E}\big{\{}g_{1}(f,X)\big{\}}-\frac{1}{n}\sum_{i=1}^{n}\big{[}g_{1}(f,X_{i})\big{]}>\frac{1}{2}\bigg{(}t+\mathbb{E}\big{\{}g_{1}(f,X)\big{\}}\bigg{)}\Big{)}
4𝒩n(t24,𝒢1,)exp(tn2702),\displaystyle\leq 4\mathcal{N}_{n}\Big{(}\frac{t}{24},\mathcal{G}_{1},\|\cdot\|_{\infty}\Big{)}\exp\Big{(}-\frac{tn}{270\mathcal{B}^{2}}\Big{)}, (B.10)

and

(fn:𝔼{g2(f,X)}1ni=1n[g2(f,Xi)]>12(t+𝔼{g2(f,X)}))\displaystyle\mathbb{P}\Big{(}\exists f\in\mathcal{F}_{n}:\mathbb{E}\big{\{}g_{2}(f,X)\big{\}}-\frac{1}{n}\sum_{i=1}^{n}\big{[}g_{2}(f,X_{i})\big{]}>\frac{1}{2}\bigg{(}t+\mathbb{E}\big{\{}g_{2}(f,X)\big{\}}\bigg{)}\Big{)}
4𝒩n(t24,𝒢2,)exp(tn135λ¯κ).\displaystyle\leq 4\mathcal{N}_{n}\Big{(}\frac{t}{24},\mathcal{G}_{2},\|\cdot\|_{\infty}\Big{)}\exp\Big{(}-\frac{tn}{135\bar{\lambda}\kappa\mathcal{B}^{\prime}}\Big{)}. (B.11)

Combining (B.9) and (B.10), for an>1/na_{n}>1/n, we have

𝔼S(1ni=1n[𝔼S{g1(f^nλ,Xi)}2g1(f^nλ,Xi)])\displaystyle\mathbb{E}_{S}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{1}(\hat{f}^{\lambda}_{n},X^{\prime}_{i})\big{\}}-2g_{1}(\hat{f}^{\lambda}_{n},X_{i})\bigg{]}\Big{)}
0(1ni=1n[𝔼S{g1(f^nλ,Xi)}2g1(f^nλ,Xi)]>t)𝑑t\displaystyle\leq\int_{0}^{\infty}\mathbb{P}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{1}(\hat{f}^{\lambda}_{n},X^{\prime}_{i})\big{\}}-2g_{1}(\hat{f}^{\lambda}_{n},X_{i})\bigg{]}>t\Big{)}dt
0an1dt+an4𝒩n(t24,𝒢1,)exp(tn2702)dt\displaystyle\leq\int_{0}^{a_{n}}1dt+\int_{a_{n}}^{\infty}4\mathcal{N}_{n}\Big{(}\frac{t}{24},\mathcal{G}_{1},\|\cdot\|_{\infty}\Big{)}\exp\Big{(}-\frac{tn}{270\mathcal{B}^{2}}\Big{)}dt
an+4𝒩n(124n,𝒢1,)anexp(tn2702)dt\displaystyle\leq a_{n}+4\mathcal{N}_{n}\Big{(}\frac{1}{24n},\mathcal{G}_{1},\|\cdot\|_{\infty}\Big{)}\int_{a_{n}}^{\infty}\exp\Big{(}-\frac{tn}{270\mathcal{B}^{2}}\Big{)}dt
=an+4𝒩n(124n,𝒢1,)exp(ann2702)2702n.\displaystyle=a_{n}+4\mathcal{N}_{n}\Big{(}\frac{1}{24n},\mathcal{G}_{1},\|\cdot\|_{\infty}\Big{)}\exp\Big{(}-\frac{a_{n}n}{270\mathcal{B}^{2}}\Big{)}\frac{270\mathcal{B}^{2}}{n}.

Choosing an=log{4𝒩n(1/(24n),𝒢1,)}2702/na_{n}=\log\{4\mathcal{N}_{n}(1/(24n),\mathcal{G}_{1},\|\cdot\|_{\infty})\}\cdot 270\mathcal{B}^{2}/n, we get

𝔼S(1ni=1n[𝔼S{g1(f^nλ,Xi)}2g1(f^nλ,Xi)])270log[4e𝒩n(1/(24n),𝒢1,)]2n.\displaystyle\mathbb{E}_{S}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{1}(\hat{f}^{\lambda}_{n},X^{\prime}_{i})\big{\}}-2g_{1}(\hat{f}^{\lambda}_{n},X_{i})\bigg{]}\Big{)}\leq\frac{270\log[4e\mathcal{N}_{n}(1/(24n),\mathcal{G}_{1},\|\cdot\|_{\infty})]\mathcal{B}^{2}}{n}.

For any f1,f2nf_{1},f_{2}\in\mathcal{F}_{n}, by the definition of g1g_{1}, it is easy to show g1(f1,)g1(f2,)4f1f2\|g_{1}(f_{1},\cdot)-g_{1}(f_{2},\cdot)\|_{\infty}\leq 4\mathcal{B}\|f_{1}-f_{2}\|_{\infty}. Then 𝒩n(1/(24n),𝒢1,)𝒩n(1/(96n),n,)\mathcal{N}_{n}(1/(24n),\mathcal{G}_{1},\|\cdot\|_{\infty})\leq\mathcal{N}_{n}(1/(96\mathcal{B}n),\mathcal{F}_{n},\|\cdot\|_{\infty}), which leads to

𝔼S(1ni=1n[𝔼S{g1(f^nλ,Xi,ξi)}2g1(f^nλ,Xi,ξi)])\displaystyle\mathbb{E}_{S}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{1}(\hat{f}^{\lambda}_{n},X^{\prime}_{i},\xi^{\prime}_{i})\big{\}}-2g_{1}(\hat{f}^{\lambda}_{n},X_{i},\xi_{i})\bigg{]}\Big{)}
270log[4e𝒩n(1/(96n),n,)]2n.\displaystyle\leq\frac{270\log[4e\mathcal{N}_{n}(1/(96\mathcal{B}n),\mathcal{F}_{n},\|\cdot\|_{\infty})]\mathcal{B}^{2}}{n}. (B.12)

Similarly, combining (B.9) and (B.11), we can obtain

𝔼S(1ni=1n[𝔼S{g2(f^nλ,Xi)}2g2(f^nλ,Xi)])135λ¯κlog[4e𝒩n(1/(24n),𝒢2,)]n.\displaystyle\mathbb{E}_{S}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{2}(\hat{f}^{\lambda}_{n},X^{\prime}_{i})\big{\}}-2g_{2}(\hat{f}^{\lambda}_{n},X_{i})\bigg{]}\Big{)}\leq\frac{135\bar{\lambda}\kappa\log[4e\mathcal{N}_{n}(1/(24n),\mathcal{G}_{2},\|\cdot\|_{\infty})]\mathcal{B}^{\prime}}{n}.

For any f1,f2nf_{1},f_{2}\in\mathcal{F}_{n}, by the definition of g2g_{2}, it can be shown g2(f1,)g2(f2,)κdj=1dλjxjf1xjf2\|g_{2}(f_{1},\cdot)-g_{2}(f_{2},\cdot)\|_{\infty}\leq\frac{\kappa}{d}\sum_{j=1}^{d}\lambda_{j}\|\frac{\partial}{\partial x_{j}}f_{1}-\frac{\partial}{\partial x_{j}}f_{2}\|_{\infty}. Recall that nj={xjf:fn}\mathcal{F}_{nj}^{\prime}=\{\frac{\partial}{\partial x_{j}}f:f\in\mathcal{F}_{n}\} for j=1,,dj=1,\ldots,d. Then 𝒩n(1/(24n),𝒢2,)Πj=1d𝒩n(1/(24κλjn),nj,)\mathcal{N}_{n}(1/(24n),\mathcal{G}_{2},\|\cdot\|_{\infty})\leq\Pi_{j=1}^{d}\mathcal{N}_{n}(1/(24\kappa\lambda_{j}n),\mathcal{F}_{nj}^{\prime},\|\cdot\|_{\infty}) where we view 11/(24κλjn)11/(24\kappa\lambda_{j}n) as \infty if λj=0\lambda_{j}=0. This leads to

𝔼S(1ni=1n[𝔼S{g2(f^nλ,Xi,ξi)}2g2(f^nλ,Xi,ξi)])\displaystyle\mathbb{E}_{S}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\mathbb{E}_{S^{\prime}}\big{\{}g_{2}(\hat{f}^{\lambda}_{n},X^{\prime}_{i},\xi^{\prime}_{i})\big{\}}-2g_{2}(\hat{f}^{\lambda}_{n},X_{i},\xi_{i})\bigg{]}\Big{)}
135λ¯κlog[4eΠj=1d𝒩n(1/(24κλjn),nj,)]n.\displaystyle\leq\frac{135\bar{\lambda}\kappa\log[4e\Pi_{j=1}^{d}\mathcal{N}_{n}(1/(24\kappa\lambda_{j}n),\mathcal{F}^{\prime}_{nj},\|\cdot\|_{\infty})]\mathcal{B}^{\prime}}{n}. (B.13)

Then by Lemma 39 in Appendix C, we can further bound the covering number by the Pseudo dimension. More exactly, for nPdim(n)n\geq{\rm Pdim}(\mathcal{F}_{n}) and any δ>0\delta>0, we have

log(𝒩n(δ,n,))Pdim(n)log(enδPdim(n)),\displaystyle\log(\mathcal{N}_{n}(\delta,\mathcal{F}_{n},\|\cdot\|_{\infty}))\leq{\rm Pdim}(\mathcal{F}_{n})\log\Big{(}\frac{en\mathcal{B}}{\delta{\rm Pdim}(\mathcal{F}_{n})}\Big{)},

and for nPdim(nj)n\geq{\rm Pdim}(\mathcal{F}^{\prime}_{nj}) for j=1,dj=1\ldots,d and any δ>0\delta>0, we have

log(𝒩n(δ,nj,))Pdim(nj)log(enδPdim(nj)).\displaystyle\log(\mathcal{N}_{n}(\delta,\mathcal{F}^{\prime}_{nj},\|\cdot\|_{\infty}))\leq{\rm Pdim}(\mathcal{F}^{\prime}_{nj})\log\Big{(}\frac{en\mathcal{B}^{\prime}}{\delta{\rm Pdim}(\mathcal{F}^{\prime}_{nj})}\Big{)}.

By Theorem 1 we know Pdim(nj)=Pdim(n){\rm Pdim}(\mathcal{F}^{\prime}_{nj})={\rm Pdim}(\mathcal{F}^{\prime}_{n}) for j=1,,dj=1,\ldots,d. Combining the upper bounds of the covering numbers, we have

𝔼{λ(f^nλ)2nλ(f^nλ)+λ(f0)}c0[3Pdim(n)+d(κλ¯)2Pdim(n)]log(n)n,\displaystyle\mathbb{E}\{\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-2\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}(f_{0})\}\leq c_{0}\frac{\big{[}\mathcal{B}^{3}{\rm Pdim}(\mathcal{F}_{n})+d(\kappa\bar{\lambda}\mathcal{B}^{\prime})^{2}{\rm Pdim}(\mathcal{F}_{n}^{\prime})\big{]}\log(n)}{n},

for nmax{Pdim(n),Pdim(n)}n\geq\max\{{\rm Pdim}(\mathcal{F}_{n}),{\rm Pdim}(\mathcal{F}^{\prime}_{n})\} and some universal constant c0>0c_{0}>0 where λ¯=j=1dλj/d\bar{\lambda}=\sum_{j=1}^{d}\lambda_{j}/d. By Lemma 2, for the function class n\mathcal{F}_{n} implemented by Mixed RePU activated multilayer perceptrons with depth no more than 𝒟\mathcal{D}, width no more than 𝒲\mathcal{W}, number of neurons (nodes) no more than 𝒰\mathcal{U} and size or number of parameters (weights and bias) no more than 𝒮\mathcal{S}, we have

Pdim(n)3p𝒟𝒮(𝒟+log2𝒰),\displaystyle{\rm Pdim}(\mathcal{F}_{n})\leq 3p\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U}),

and by Lemma 2, for any function fnf\in\mathcal{F}_{n}, its partial derivative xjf\frac{\partial}{\partial x_{j}}f can be implemented by a Mixed RePU activated multilayer perceptron with depth 3𝒟+33\mathcal{D}+3, width 6𝒲6\mathcal{W}, number of neurons 13𝒰13\mathcal{U}, number of parameters 23𝒮23\mathcal{S} and bound \mathcal{B}^{\prime}. Then

Pdim(n)2484p𝒟𝒮(𝒟+log2𝒰).\displaystyle{\rm Pdim}(\mathcal{F}^{\prime}_{n})\leq 2484p\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U}).

It follows that

𝔼{λ(f^nλ)2nλ(f^nλ)+λ(f0)}c1(3+d(κλ¯)2)𝒟𝒮(𝒟+log2𝒰)log(n)n,\displaystyle\mathbb{E}\{\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-2\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}(f_{0})\}\leq c_{1}\big{(}\mathcal{B}^{3}+d(\kappa\bar{\lambda}\mathcal{B}^{\prime})^{2}\big{)}\frac{\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})\log(n)}{n},

for nmax{Pdim(n),Pdim(n)}n\geq\max\{{\rm Pdim}(\mathcal{F}_{n}),{\rm Pdim}(\mathcal{F}^{\prime}_{n})\} and some universal constant c1>0c_{1}>0. This completes the proof. \hfill\Box

Proof of Lemma 38

Recall that

inffn[λ(f)λ(f0)]\displaystyle\inf_{f\in\mathcal{F}_{n}}\Big{[}\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})\Big{]}
=inffn[𝔼|f(X)f0(X)|2+1dj=1dλj{ρ(xjf(X))ρ(xjf0(X))}]\displaystyle=\inf_{f\in\mathcal{F}_{n}}\Bigg{[}\mathbb{E}|f(X)-f_{0}(X)|^{2}+\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\{\rho(\frac{\partial}{\partial x_{j}}f(X))-\rho(\frac{\partial}{\partial x_{j}}f_{0}(X))\}\Bigg{]}
inffn[𝔼|f(X)f0(X)|2+1dj=1dλjκ|xjf(X)xjf0(X)|].\displaystyle\leq\inf_{f\in\mathcal{F}_{n}}\Bigg{[}\mathbb{E}|f(X)-f_{0}(X)|^{2}+\frac{1}{d}\sum_{j=1}^{d}\lambda_{j}\kappa|\frac{\partial}{\partial x_{j}}f(X)-\frac{\partial}{\partial x_{j}}f_{0}(X)|\Bigg{]}.

By Theorem 5, for each N+N\in\mathbb{N}^{+}, there exists a RePU network ϕNn\phi_{N}\in\mathcal{F}_{n} with 2N12N-1 hidden layer, no more than 15Nd15N^{d} neurons, no more than 24Nd24N^{d} parameters and width no more than 12Nd112N^{d-1} such that for each multi-index α0d\alpha\in\mathbb{N}^{d}_{0} with |α|1min{s,N}|\alpha|_{1}\leq\min\{s,N\} we have

sup𝒳|Dα(fϕN)|C(s,d,𝒳)×N(s|α|1)fC|α|1,\sup_{\mathcal{X}}|D^{\alpha}(f-\phi_{N})|\leq C(s,d,\mathcal{X})\times N^{-(s-|\alpha|_{1})}\|f\|_{C^{|\alpha|_{1}}},

where C(s,d,𝒳)C(s,d,\mathcal{X}) is a positive constant depending only on d,sd,s and the diameter of 𝒳\mathcal{X}. This implies

sup𝒳|fϕN|C(s,d,𝒳)×NsfC0,\sup_{\mathcal{X}}|f-\phi_{N}|\leq C(s,d,\mathcal{X})\times N^{-s}\|f\|_{C^{0}},

and for j=1,,dj=1,\ldots,d

sup𝒳|xj(fϕN)|C(s,d,𝒳)×N(s1)fC1.\sup_{\mathcal{X}}\Big{|}\frac{\partial}{\partial x_{j}}(f-\phi_{N})\Big{|}\leq C(s,d,\mathcal{X})\times N^{-(s-1)}\|f\|_{C^{1}}.

Combine above two uniform bounds, we have

inffn[λ(f)λ(f0)]\displaystyle\inf_{f\in\mathcal{F}_{n}}\Big{[}\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})\Big{]}
[|𝔼X{|ϕN(X)f0(X)|2+κdj=1dλj|xjϕN(X)xjf0(X)|}]\displaystyle\leq\Big{[}|\mathbb{E}_{X}\Big{\{}|\phi_{N}(X)-f_{0}(X)|^{2}+\frac{\kappa}{d}\sum_{j=1}^{d}\lambda_{j}|\frac{\partial}{\partial x_{j}}\phi_{N}(X)-\frac{\partial}{\partial x_{j}}f_{0}(X)|\Big{\}}\Big{]}
C(s,d,𝒳)2×N2sfC02+κλ¯C(s,d,𝒳)×N(s1)fC1\displaystyle\leq C(s,d,\mathcal{X})^{2}\times N^{-2s}\|f\|^{2}_{C^{0}}+\kappa\bar{\lambda}C(s,d,\mathcal{X})\times N^{-(s-1)}\|f\|_{C^{1}}
C1(s,d,𝒳)(1+κλ¯)N(s1)fC12,\displaystyle\leq C_{1}(s,d,\mathcal{X})(1+\kappa\bar{\lambda})N^{-(s-1)}\|f\|^{2}_{C^{1}},

where C1(s,d,𝒳)=max{[C(s,d,𝒳)]2,C(s,d,𝒳))}C_{1}(s,d,\mathcal{X})=\max\{[C(s,d,\mathcal{X})]^{2},C(s,d,\mathcal{X}))\} is also a constant depending only on s,ds,d and 𝒳\mathcal{X}. By defining the network depth 𝒟\mathcal{D} to be a positive odd number, and expressing the network width 𝒲\mathcal{W}, neurons 𝒰\mathcal{U} and size 𝒮\mathcal{S} in terms of 𝒟\mathcal{D}, one can obtain the approximation error bound in terms of 𝒰\mathcal{U}. This completes the proof. \hfill\Box

Appendix C Definitions and Supporting Lemmas

C.1 Definitions

The following definitions are used in the proofs.

Definition 33 (Covering number)

Let \mathcal{F} be a class of function from 𝒳\mathcal{X} to \mathbb{R}. For a given sequence x=(x1,,xn)𝒳n,x=(x_{1},\ldots,x_{n})\in\mathcal{X}^{n}, let n|x={(f(x1),,f(xn):fn}\mathcal{F}_{n}|_{x}=\{(f(x_{1}),\ldots,f(x_{n}):f\in\mathcal{F}_{n}\} be the subset of n\mathbb{R}^{n}. For a positive number δ\delta, let 𝒩(δ,n|x,)\mathcal{N}(\delta,\mathcal{F}_{n}|_{x},\|\cdot\|_{\infty}) be the covering number of n|x\mathcal{F}_{n}|_{x} under the norm \|\cdot\|_{\infty} with radius δ\delta. Define the uniform covering number 𝒩n(δ,,n)\mathcal{N}_{n}(\delta,\|\cdot\|_{\infty},\mathcal{F}_{n}) to be the maximum over all x𝒳x\in\mathcal{X} of the covering number 𝒩(δ,n|x,)\mathcal{N}(\delta,\mathcal{F}_{n}|_{x},\|\cdot\|_{\infty}), i.e.,

𝒩n(δ,n,)=max{𝒩(δ,n|x,):x𝒳n}.\mathcal{N}_{n}(\delta,\mathcal{F}_{n},\|\cdot\|_{\infty})=\max\{\mathcal{N}(\delta,\mathcal{F}_{n}|_{x},\|\cdot\|_{\infty}):x\in\mathcal{X}^{n}\}. (C.1)
Definition 34 (Shattering)

Let \mathcal{F} be a family of functions from a set 𝒵\mathcal{Z} to \mathbb{R}. A set {z1,,Zn}𝒵\{z_{1},\ldots,Z_{n}\}\subset\mathcal{Z} is said to be shattered by \mathcal{F}, if there exists t1,,tnt_{1},\ldots,t_{n}\in\mathbb{R} such that

|{[sgn(f(z1)t1)sgn(f(zn)tn)]:f}|=2n,\displaystyle\Big{|}\Big{\{}\Big{[}\begin{array}[]{lr}{\rm sgn}(f(z_{1})-t_{1})\\ \ldots\\ {\rm sgn}(f(z_{n})-t_{n})\\ \end{array}\Big{]}:f\in\mathcal{F}\Big{\}}\Big{|}=2^{n},

where rmsgn{rmsgn} is the sign function returns +1+1 or 1-1 and |||\cdot| denotes the cardinality of a set. When they exist, the threshold values t1,,tnt_{1},\ldots,t_{n} are said to witness the shattering.

Definition 35 (Pseudo dimension)

Let \mathcal{F} be a family of functions mapping from 𝒵\mathcal{Z} to \mathbb{R}. Then, the pseudo dimension of \mathcal{F}, denoted by Pdim(){\rm Pdim}(\mathcal{F}), is the size of the largest set shattered by \mathcal{F}.

Definition 36 (VC dimension)

Let \mathcal{F} be a family of functions mapping from 𝒵\mathcal{Z} to \mathbb{R}. Then, the Vapnik–Chervonenkis (VC) dimension of \mathcal{F}, denoted by VCdim(){\rm VCdim}(\mathcal{F}), is the size of the largest set shattered by \mathcal{F} with all threshold values being zero, i.e., t1=,=tn=0t_{1}=\ldots,=t_{n}=0.

C.2 Supporting Lemmas

Lemma 37 (Stochastic error bound)

Suppose Assumption 16 and 17 hold. Let n=𝒟,𝒲,𝒰,𝒮,,\mathcal{F}_{n}=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}} be the RePU σp\sigma_{p} activated multilayer perceptron and let n={x1f:fn}\mathcal{F}^{\prime}_{n}=\{\frac{\partial}{\partial x_{1}}f:f\in\mathcal{F}_{n}\} denote the class of the partial derivative of fnf\in\mathcal{F}_{n} with respect to its first argument. Then for nmax{Pdim(n),Pdim(n)}n\geq\max\{{\rm Pdim}(\mathcal{F}_{n}),{\rm Pdim}(\mathcal{F}^{\prime}_{n})\}, the stochastic error satisfies

𝔼{λ(f^nλ)2nλ(f^nλ)+λ(f0)}c1p{3+d(κλ¯)2}𝒟𝒮(𝒟+log2𝒰)log(n)n,\displaystyle\mathbb{E}\{\mathcal{R}^{\lambda}(\hat{f}^{\lambda}_{n})-2\mathcal{R}^{\lambda}_{n}(\hat{f}^{\lambda}_{n})+\mathcal{R}^{\lambda}(f_{0})\}\leq c_{1}p\big{\{}\mathcal{B}^{3}+d(\kappa\bar{\lambda}\mathcal{B}^{\prime})^{2}\big{\}}\mathcal{D}\mathcal{S}(\mathcal{D}+\log_{2}\mathcal{U})\frac{\log(n)}{n},

for some universal constant c1>0,c_{1}>0, where λ¯:=j=1dλj/d\bar{\lambda}:=\sum_{j=1}^{d}\lambda_{j}/d.

Lemma 38 (Approximation error bound)

Suppose that the target function f0f_{0} defined in (4) belongs to CsC^{s} for some s+s\in\mathbb{N}^{+}. For any positive odd number 𝒟\mathcal{D}, let n:=𝒟,𝒲,𝒰,𝒮,,\mathcal{F}_{n}:=\mathcal{F}_{\mathcal{D},\mathcal{W},\mathcal{U},\mathcal{S},\mathcal{B},\mathcal{B}^{\prime}} be the class of RePU activated neural networks f:𝒳df:\mathcal{X}\to\mathbb{R}^{d} with depth 𝒟\mathcal{D}, width 𝒲=18pd[(𝒟+1)/2]d1\mathcal{W}=18pd[(\mathcal{D}+1)/2]^{d-1}, number of neurons 𝒰=18pd[(𝒟+1)/2]d\mathcal{U}=18pd[(\mathcal{D}+1)/2]^{d} and size 𝒮=67pd[(𝒟+1)/2]d\mathcal{S}=67pd[(\mathcal{D}+1)/2]^{d}, satisfying f0C0\mathcal{B}\geq\|f_{0}\|_{C^{0}} and f0C1\mathcal{B}^{\prime}\geq\|f_{0}\|_{C^{1}}. Then the approximation error given in Lemma 18 satisfies

inffn[λ(f)λ(f0)]C(1+κλ¯)𝒰(s1)/df0C12,\inf_{f\in\mathcal{F}_{n}}\Big{[}\mathcal{R}^{\lambda}(f)-\mathcal{R}^{\lambda}(f_{0})\Big{]}\leq C(1+\kappa\bar{\lambda})\mathcal{U}^{-(s-1)/d}\|f_{0}\|^{2}_{C^{1}},

where λ¯:=j=1dλj/d\bar{\lambda}:=\sum_{j=1}^{d}\lambda_{j}/d, κ\kappa is the Lipschitz constant of the panelty function ρ\rho and C>0C>0 is a constant depending only on d,sd,s and the diameter of the support 𝒳\mathcal{X}.

The following lemma gives an upper bound for the covering number in terms of the pseudo-dimension.

Lemma 39 (Theorem 12.2 in Anthony and Bartlett (1999))

Let \mathcal{F} be a set of real functions from domain 𝒵\mathcal{Z} to the bounded interval [0,B][0,B]. Let δ>0\delta>0 and suppose that \mathcal{F} has finite pseudo-dimension Pdim(){\rm Pdim}(\mathcal{F}) then

𝒩n(δ,,)i=1Pdim()(ni)(Bδ)i,\displaystyle\mathcal{N}_{n}(\delta,\mathcal{F},\|\cdot\|_{\infty})\leq\sum_{i=1}^{{\rm Pdim}(\mathcal{F})}\binom{n}{i}\Big{(}\frac{B}{\delta}\Big{)}^{i},

which is less than {enB/(δPdim())}Pdim()\{enB/(\delta{\rm Pdim}(\mathcal{F}))\}^{{\rm Pdim}(\mathcal{F})} for nPdim()n\geq{\rm Pdim}(\mathcal{F}).

The following lemma presents basic approximation properties of RePU network on monomials.

Lemma 40 (Lemma 1 in li2019powernet)

The monomials xN,0Npx^{N},0\leq N\leq p can be exactly represented by RePU (σp,p2\sigma_{p},p\geq 2) activated neural network with one hidden layer and no more than 2p2p nodes. More exactly,

  • (i)

    If N=0N=0, the monomial xNx^{N} can be computed by a RePU σp\sigma_{p} activated network with one hidden layer and 1 nodes as

    1=x0=σp(0x+1).1=x^{0}=\sigma_{p}(0\cdot x+1).
  • (ii)

    If N=pN=p, the monomial xNx^{N} can be computed by a RePU σp\sigma_{p} activated network with one hidden layer and 2 nodes as

    xN=W1σp(W0x),W1=[1(1)p],W0=[11].x^{N}=W_{1}\sigma_{p}(W_{0}x),\qquad W_{1}=\left[\begin{array}[]{c}1\\ (-1)^{p}\end{array}\right],W_{0}=\left[\begin{array}[]{c}1\\ -1\end{array}\right].
  • (iii)

    If 1Np1\leq N\leq p, the monomial xNx^{N} can be computed by a RePU σp\sigma_{p} activated network with one hidden layer and no more than 2p2p nodes. More generally, a polynomial of degree no more than pp, i.e. k=0pakxk\sum_{k=0}^{p}a_{k}x^{k}, can also be computed by a RePU σp\sigma_{p} activated network with one hidden layer and no more than 2p2p nodes as

    xN=W1σp(W0x+b0)+u0,x^{N}=W_{1}^{\top}\sigma_{p}(W_{0}^{\top}x+b_{0})+u_{0},

    where

    W0=[1111]2p×1,b0=[t1t1tptp]2p×1,W1=[u1(1)pu1up(1)pup]2p×1.W_{0}=\left[\begin{array}[]{c}1\\ -1\\ \vdots\\ 1\\ -1\end{array}\right]\in\mathbb{R}^{2p\times 1},\ \ b_{0}=\left[\begin{array}[]{c}t_{1}\\ -t_{1}\\ \vdots\\ t_{p}\\ -t_{p}\end{array}\right]\in\mathbb{R}^{2p\times 1},\ \ W_{1}=\left[\begin{array}[]{c}u_{1}\\ (-1)^{p}u_{1}\\ \vdots\\ u_{p}\\ (-1)^{p}u_{p}\end{array}\right]\in\mathbb{R}^{2p\times 1}.

    Here t1,,tpt_{1},\ldots,t_{p} are distinct values in \mathbb{R} and values of u0,,upu_{0},\ldots,u_{p} satisfy the linear system

    [1110t1pit2pitppi0t1p1t2p1tpp10t1pt2ptpp1][u1uiupu0]=[ap(Cpp)1ai(Cpi)1a1(Cp1)1a0(Cp0)1],\left[\begin{array}[]{ccccc}1&1&\cdots&1&0\\ \vdots&\vdots&&\vdots&\vdots\\ t_{1}^{p-i}&t_{2}^{p-i}&\cdots&t_{p}^{p-i}&0\\ \vdots&\vdots&&\vdots&\vdots\\ t_{1}^{p-1}&t_{2}^{p-1}&\cdots&t_{p}^{p-1}&0\\ t_{1}^{p}&t_{2}^{p}&\cdots&t_{p}^{p}&1\end{array}\right]\left[\begin{array}[]{c}u_{1}\\ \vdots\\ u_{i}\\ \vdots\\ u_{p}\\ u_{0}\end{array}\right]=\left[\begin{array}[]{c}a_{p}(C^{p}_{p})^{-1}\\ \vdots\\ a_{i}(C^{i}_{p})^{-1}\\ \vdots\\ a_{1}(C^{1}_{p})^{-1}\\ a_{0}(C^{0}_{p})^{-1}\end{array}\right],

    where Cpi,i=0,,pC^{i}_{p},i=0,\ldots,p are binomial coefficients. Note that the top-left p×pp\times p sub-matrix of the (p+1)×(p+1)(p+1)\times(p+1) matrix above is a Vandermonde matrix, which is invertible as long as t1,,tpt_{1},\ldots,t_{p} are distinct.

References

  • Abdeljawad and Grohs (2022) Ahmed Abdeljawad and Philipp Grohs. Approximations with deep neural networks in sobolev time-space. Analysis and Applications, 20(03):499–541, 2022.
  • Ali and Nouy (2021) Mazen Ali and Anthony Nouy. Approximation of smoothness classes by deep rectifier networks. SIAM Journal on Numerical Analysis, 59(6):3032–3051, 2021.
  • Anthony and Bartlett (1999) Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge, 1999. ISBN 0-521-57353-X. doi: 10.1017/CBO9780511624216. URL https://doi.org/10.1017/CBO9780511624216.
  • Bagby et al. (2002) Thomas Bagby, Len Bos, and Norman Levenberg. Multivariate simultaneous approximation. Constructive approximation, 18(4):569–577, 2002.
  • Baraniuk and Wakin (2009) Richard G. Baraniuk and Michael B. Wakin. Random projections of smooth manifolds. Found. Comput. Math., 9(1):51–77, 2009. ISSN 1615-3375. doi: 10.1007/s10208-007-9011-z. URL https://doi.org/10.1007/s10208-007-9011-z.
  • Barlow et al. (1972) R. E. Barlow, D. J. Bartholomew, J. M. Bremner, and H. D. Brunk. Statistical Inference under Order Restrictions; the Theory and Application of Isotonic Regression. New York: Wiley, 1972.
  • Bartlett et al. (1998) Peter Bartlett, Vitaly Maiorov, and Ron Meir. Almost linear vc dimension bounds for piecewise polynomial networks. Advances in neural information processing systems, 11, 1998.
  • Bartlett et al. (2017) Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/b22b257ad0519d4500539da3c8bcf4dd-Paper.pdf.
  • Bartlett et al. (2019) Peter L. Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20:Paper No. 63, 17, 2019. ISSN 1532-4435.
  • Bauer and Kohler (2019) Benedikt Bauer and Michael Kohler. On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Ann. Statist., 47(4):2261–2285, 2019. ISSN 0090-5364. doi: 10.1214/18-AOS1747. URL https://doi.org/10.1214/18-AOS1747.
  • Belkin and Niyogi (2003) Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput., 15(6):1373–1396, 2003.
  • Bellec (2018) Pierre C Bellec. Sharp oracle inequalities for least squares estimators in shape restricted regression. The Annals of Statistics, 46(2):745–780, 2018.
  • Belomestny et al. (2022) Denis Belomestny, Alexey Naumov, Nikita Puchkin, and Sergey Samsonov. Simultaneous approximation of a smooth function and its derivatives by deep neural networks with piecewise-polynomial activations. arXiv:2206.09527, 2022.
  • Block et al. (2020) Adam Block, Youssef Mroueh, and Alexander Rakhlin. Generative modeling with denoising auto-encoders and langevin sampling. arXiv:2002.00107, 2020.
  • Chatterjee and Lafferty (2019) Sabyasachi Chatterjee and John Lafferty. Adaptive risk bounds in unimodal regression. Bernoulli, 25(1):1–25, 2019.
  • Chatterjee et al. (2015) Sabyasachi Chatterjee, Adityanand Guntuboyina, and Bodhisattva Sen. On risk bounds in isotonic and other shape restricted regression problems. The Annals of Statistics, 43(4):1774–1800, 2015.
  • Chatterjee et al. (2018) Sabyasachi Chatterjee, Adityanand Guntuboyina, and Bodhisattva Sen. On matrix estimation under monotonicity constraints. Bernoulli, 24(2):1072–1100, 2018.
  • Chen et al. (2019) Minshuo Chen, Haoming Jiang, and Tuo Zhao. Efficient approximation of deep relu networks for functions on low dimensional manifolds. Advances in Neural Information Processing Systems, 2019.
  • Chen et al. (2022) Minshuo Chen, Haoming Jiang, Wenjing Liao, and Tuo Zhao. Nonparametric regression on low-dimensional manifolds using deep relu networks: Function approximation and statistical recovery. Information and Inference: A Journal of the IMA, 11(4):1203–1253, 2022.
  • Chen et al. (2020) Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. arXiv:2009.00713, 2020.
  • Chui and Li (1993) Charles K Chui and Xin Li. Realization of neural networks with one hidden layer. In Multivariate approximation: From CAGD to wavelets, pages 77–89. World Scientific, 1993.
  • Chui et al. (1994) Charles K Chui, Xin Li, and Hrushikesh Narhar Mhaskar. Neural networks for localized approximation. mathematics of computation, 63(208):607–623, 1994.
  • Deng and Zhang (2020) Hang Deng and Cun-Hui Zhang. Isotonic regression in multi-dimensional spaces and graphs. The Annals of Statistics, 48(6):3672–3698, 2020.
  • Dhariwal and Nichol (2021) Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  • Diggle et al. (1999) Peter Diggle, Sara Morris, and Tony Morton-Jones. Case-control isotonic regression for investigation of elevation in risk around a point source. Statistics in medicine, 18(13):1605–1613, 1999.
  • Duan et al. (2021) Chenguang Duan, Yuling Jiao, Yanming Lai, Xiliang Lu, and Zhijian Yang. Convergence rate analysis for deep ritz method. arXiv preprint arXiv:2103.13330, 2021.
  • Durot (2002) Cécile Durot. Sharp asymptotics for isotonic regression. Probability theory and related fields, 122(2):222–240, 2002.
  • Durot (2007) Cécile Durot. On the lpl_{p}-error of monotonicity constrained estimators. The Annals of Statistics, 35(3):1080–1104, 2007.
  • Durot (2008) Cécile Durot. Monotone nonparametric regression with random design. Mathematical methods of statistics, 17(4):327–341, 2008.
  • Dykstra (1983) Richard L Dykstra. An algorithm for restricted least squares regression. Journal of the American Statistical Association, 78(384):837–842, 1983.
  • Fefferman (2006) Charles Fefferman. Whitney’s extension problem for cmc^{m}. Annals of Mathematics., 164(1):313–359, 2006. ISSN 0003486X. URL http://www.jstor.org/stable/20159991.
  • Fefferman et al. (2016) Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983–1049, 2016.
  • Fokianos et al. (2020) Konstantinos Fokianos, Anne Leucht, and Michael H Neumann. On integrated l1l_{1} convergence rate of an isotonic regression estimator for multivariate observations. IEEE Transactions on Information Theory, 66(10):6389–6402, 2020.
  • Gao et al. (2017) Chao Gao, Fang Han, and Cun-Hui Zhang. Minimax risk bounds for piecewise constant models. arXiv preprint arXiv:1705.06386, 2017.
  • Gao et al. (2019) Yuan Gao, Yuling Jiao, Yang Wang, Yao Wang, Can Yang, and Shunkang Zhang. Deep generative learning via variational gradient flow. In International Conference on Machine Learning, pages 2093–2101. PMLR, 2019.
  • Gao et al. (2022) Yuan Gao, Jian Huang, Yuling Jiao, Jin Liu, Xiliang Lu, and Zhijian Yang. Deep generative learning via euler particle transport. In Mathematical and Scientific Machine Learning, pages 336–368. PMLR, 2022.
  • Groeneboom and Jongbloed (2014) Piet Groeneboom and Geurt Jongbloed. Nonparametric estimation under shape constraints, volume 38. Cambridge University Press, 2014.
  • Gühring and Raslan (2021) Ingo Gühring and Mones Raslan. Approximation rates for neural networks with encodable weights in smoothness spaces. Neural Networks, 134:107–130, 2021.
  • Han et al. (2019) Qiyang Han, Tengyao Wang, Sabyasachi Chatterjee, and Richard J Samworth. Isotonic regression in general dimensions. The Annals of Statistics, 47(5):2440–2471, 2019.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  • Ho et al. (2022) Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47–1, 2022.
  • Hoffmann et al. (2009) Heiko Hoffmann, Stefan Schaal, and Sethu Vijayakumar. Local dimensionality reduction for non-parametric regression. Neural Processing Letters, 29(2):109, 2009.
  • Hon and Yang (2022) Sean Hon and Haizhao Yang. Simultaneous neural network approximation for smooth functions. Neural Networks, 154:152–164, 2022.
  • Hörmander (2015) Lars Hörmander. The analysis of linear partial differential operators I: Distribution theory and Fourier analysis. Springer, 2015.
  • Horner (1819) William George Horner. A new method of solving numerical equations of all orders, by continuous approximation. Philosophical Transactions of the Royal Society of London, (109):308–335, 1819.
  • Horowitz and Lee (2017) Joel L Horowitz and Sokbae Lee. Nonparametric estimation and inference under shape restrictions. Journal of Econometrics, 201(1):108–126, 2017.
  • Hyvärinen and Dayan (2005) Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
  • Jalal et al. (2021) Ajil Jalal, Marius Arvinte, Giannis Daras, Eric Price, Alexandros G Dimakis, and Jon Tamir. Robust compressed sensing mri with deep generative priors. Advances in Neural Information Processing Systems, 34:14938–14954, 2021.
  • Jiang et al. (2011) Xiaoqian Jiang, Melanie Osl, Jihoon Kim, and Lucila Ohno-Machado. Smooth isotonic regression: A new method to calibrate predictive models. AMIA Summits on Translational Science Proceedings, 2011:16, 2011.
  • Jiao et al. (2023) Yuling Jiao, Guohao Shen, Yuanyuan Lin, and Jian Huang. Deep nonparametric regression on approximate manifolds: Nonasymptotic error bounds with polynomial prefactors. The Annals of Statistics, 51(2):691–716, 2023.
  • Kim et al. (2018) Arlene KH Kim, Adityanand Guntuboyina, and Richard J Samworth. Adaptation in log-concave density estimation. The Annals of Statistics, 46(5):2279–2306, 2018.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Klusowski and Barron (2018) Jason M Klusowski and Andrew R Barron. Approximation by combinations of relu and squared relu ridge functions with 1\ell^{1} and 0\ell^{0} controls. IEEE Transactions on Information Theory, 64(12):7649–7656, 2018.
  • Kong et al. (2020) Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. arXiv:2009.09761, 2020.
  • Kyng et al. (2015) Rasmus Kyng, Anup Rao, and Sushant Sachdeva. Fast, provable algorithms for isotonic regression in all l_p-norms. Advances in neural information processing systems, 28, 2015.
  • Lee et al. (2022) Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence for score-based generative modeling with polynomial complexity. arXiv:2206.06227, 2022.
  • Li et al. (2019) Bo Li, Shanshan Tang, and Haijun Yu. Better approximations of high dimensional smooth functions by deep neural networks with rectified power units. arXiv preprint arXiv:1903.05858, 2019.
  • Li et al. (2020) Bo Li, Shanshan Tang, and Haijun Yu. Powernet: Efficient representations of polynomials and smooth functions by deep neural networks with rectified power units. J. Math. Study, 53(2):159–191, 2020.
  • Li and Turner (2017) Yingzhen Li and Richard E Turner. Gradient estimators for implicit models. arXiv:1705.07107, 2017.
  • Liu et al. (2016) Qiang Liu, Jason Lee, and Michael Jordan. A kernelized stein discrepancy for goodness-of-fit tests. In International conference on machine learning, pages 276–284. PMLR, 2016.
  • Lu et al. (2021a) Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021a.
  • Lu et al. (2021b) Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021b.
  • Luss et al. (2012) Ronny Luss, Saharon Rosset, and Moni Shahar. Efficient regularized isotonic regression with application to gene–gene interaction search. The Annals of Applied Statistics, 6(1):253–283, 2012.
  • Mhaskar (1993) Hrushikesh Narhar Mhaskar. Approximation properties of a multilayered feedforward artificial neural network. Advances in Computational Mathematics, 1(1):61–80, 1993.
  • Mittal et al. (2021) Gautam Mittal, Jesse Engel, Curtis Hawthorne, and Ian Simon. Symbolic music generation with diffusion models. arXiv:2103.16091, 2021.
  • Mohri et al. (2018) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.
  • Morton-Jones et al. (2000) Tony Morton-Jones, Peter Diggle, Louise Parker, Heather O Dickinson, and Keith Binks. Additive isotonic regression models in epidemiology. Statistics in medicine, 19(6):849–859, 2000.
  • Nagarajan and Kolter (2019) Vaishnavh Nagarajan and J Zico Kolter. Deterministic pac-bayesian generalization bounds for deep networks via generalizing noise-resilience. arXiv preprint arXiv:1905.13344, 2019.
  • Neyshabur et al. (2015) Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on learning theory, pages 1376–1401. PMLR, 2015.
  • Petersen and Voigtlaender (2018) Philipp Petersen and Felix Voigtlaender. Optimal approximation of piecewise smooth functions using deep relu neural networks. Neural Networks, 108:296–330, 2018.
  • Picard (1976) Jean-Claude Picard. Maximal closure of a graph and applications to combinatorial problems. Management science, 22(11):1268–1272, 1976.
  • Popov et al. (2021) Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. Grad-tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, pages 8599–8608. PMLR, 2021.
  • Qin et al. (2014) Jing Qin, Tanya P Garcia, Yanyuan Ma, Ming-Xin Tang, Karen Marder, and Yuanjia Wang. Combining isotonic regression and em algorithm to predict genetic risk under monotonicity constraint. The annals of applied statistics, 8(2):1182, 2014.
  • Robertson et al. (1988) T. Robertson, F. T. Wright, and R. L. Dykstra. Order Restricted Statistical Inference. New York: Wiley, 1988.
  • Rueda et al. (2009) Cristina Rueda, Miguel A Fernández, and Shyamal Das Peddada. Estimation of parameters subject to order restrictions on a circle with application to estimation of phase angles of cell cycle genes. Journal of the American Statistical Association, 104(485):338–347, 2009.
  • Sasaki et al. (2014) Hiroaki Sasaki, Aapo Hyvärinen, and Masashi Sugiyama. Clustering via mode seeking by direct estimation of the gradient of a log-density. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 19–34. Springer, 2014.
  • Schmidt-Hieber (2019) Johannes Schmidt-Hieber. Deep relu network approximation of functions on a manifold. arXiv:1908.00695, 2019.
  • Schmidt-Hieber (2020) Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU activation function. Annals of Statistics, 48(4):1875–1897, 2020.
  • Shen et al. (2022) Guohao Shen, Yuling Jiao, Yuanyuan Lin, Joel L Horowitz, and Jian Huang. Estimation of non-crossing quantile regression process with deep requ neural networks. arXiv:2207.10442, 2022.
  • Shen et al. (2020) Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation characterized by number of neurons. Commun. Comput. Phys., 28(5):1768–1811, 2020. ISSN 1815-2406. doi: 10.4208/cicp.oa-2020-0149. URL https://doi.org/10.4208/cicp.oa-2020-0149.
  • Shi et al. (2018) Jiaxin Shi, Shengyang Sun, and Jun Zhu. A spectral approach to gradient estimation for implicit distributions. In International Conference on Machine Learning, pages 4644–4653. PMLR, 2018.
  • Siegel and Xu (2022) Jonathan W Siegel and Jinchao Xu. High-order approximation rates for shallow neural networks with cosine and reluk activation functions. Applied and Computational Harmonic Analysis, 58:1–26, 2022.
  • Song and Ermon (2019) Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
  • Song and Ermon (2020) Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
  • Song et al. (2020) Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, pages 574–584. PMLR, 2020.
  • Song et al. (2021) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=PxTIG12RRHS.
  • Spouge et al. (2003) J Spouge, H Wan, and WJ Wilbur. Least squares isotonic regression in two dimensions. Journal of Optimization Theory and Applications, 117(3):585–605, 2003.
  • Sriperumbudur et al. (2017) Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Aapo Hyvärinen, and Revant Kumar. Density estimation in infinite dimensional exponential families. Journal of Machine Learning Research, 2017.
  • Stone (1982) Charles J Stone. Optimal global rates of convergence for nonparametric regression. The annals of statistics, pages 1040–1053, 1982.
  • Stout (2015) Quentin F Stout. Isotonic regression for multiple independent variables. Algorithmica, 71(2):450–470, 2015.
  • Strathmann et al. (2015) Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zoltan Szabo, and Arthur Gretton. Gradient-free hamiltonian monte carlo with efficient kernel exponential families. Advances in Neural Information Processing Systems, 28, 2015.
  • Sutherland et al. (2018) Danica J Sutherland, Heiko Strathmann, Michael Arbel, and Arthur Gretton. Efficient and principled score estimation with nyström kernel exponential families. In International Conference on Artificial Intelligence and Statistics, pages 652–660. PMLR, 2018.
  • Vincent (2011) Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
  • Warde-Farley and Bengio (2016) David Warde-Farley and Yoshua Bengio. Improving generative adversarial networks with denoising feature matching. 2016.
  • Wei and Ma (2019) Colin Wei and Tengyu Ma. Data-dependent sample complexity of deep neural networks via lipschitz augmentation. Advances in Neural Information Processing Systems, 32, 2019.
  • Xu and Cao (2005) Zong-Ben Xu and Fei-Long Cao. Simultaneous lp-approximation order for neural networks. Neural Networks, 18(7):914–923, 2005.
  • Yang and Barber (2019) Fan Yang and Rina Foygel Barber. Contraction and uniform convergence of isotonic regression. Electronic Journal of Statistics, 13(1):646–677, 2019.
  • Yarotsky (2017) Dmitry Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Networks, 94:103–114, 2017.
  • Yarotsky (2018) Dmitry Yarotsky. Optimal approximation of continuous functions by very deep ReLU networks. In Conference on Learning Theory, pages 639–649. PMLR, 2018.
  • Zhang (2002) Cun-Hui Zhang. Risk bounds in isotonic regression. The Annals of Statistics, 30(2):528–555, 2002.
  • Zhou et al. (2020) Yuhao Zhou, Jiaxin Shi, and Jun Zhu. Nonparametric score estimators. In International Conference on Machine Learning, pages 11513–11522. PMLR, 2020.