DAS-PINNs: A deep adaptive sampling method for solving high-dimensional partial differential equations

Kejun Tang [email protected] Xiaoliang Wan [email protected] Chao Yang [email protected] Peng Cheng Laboratory, Shenzhen 518052, China Department of Mathematics and Center for Computation and Technology, Louisiana State University, Baton Rouge 70803, USA School of Mathematical Sciences, Peking University, Beijing 100871, China Institute of Computing and Digital Economy, Peking University, Changsha 410205, China

Abstract

In this work we propose a deep adaptive sampling (DAS) method for solving partial differential equations (PDEs), where deep neural networks are utilized to approximate the solutions of PDEs and deep generative models are employed to generate new collocation points that refine the training set. The overall procedure of DAS consists of two components: solving the PDEs by minimizing the residual loss on the collocation points in the training set and generating a new training set to further improve the accuracy of the current approximate solution. In particular, we treat the residual as a probability density function and approximate it with a deep generative model, called KRnet. The new samples from KRnet are consistent with the distribution induced by the residual, i.e., more samples are located in the region of large residual and less samples are located in the region of small residual. Analogous to classical adaptive methods such as the adaptive finite element, KRnet acts as an error indicator that guides the refinement of the training set. Compared to the neural network approximation obtained with uniformly distributed collocation points, the developed algorithms can significantly improve the accuracy, especially for low regularity and high-dimensional problems. We demonstrate the effectiveness of the proposed DAS method with numerical experiments.

keywords:

deep learning; numerical approximation of PDEs; adaptive sampling; deep generative models.

^†^†journal: Journal of Computational Physics

1 Introduction

In recent years, solving partial differential equations (PDEs) with deep learning methods has been receiving increasing attention [1, 2, 3]. Two major types of deep leaning methods have been proposed for solving PDEs, including the variational form subject to deep learning techniques [4, 5, 6, 7] and the physics-informed neural networks (PINNs) [8, 9, 10, 3], both of which reformulate a PDE problem as an optimization problem and train a deep neural network (DNN) to approximate the solution of PDE through minimizing the corresponding loss functional. The variational form is based on the weak formulation of PDEs, while the physical informed neural networks are based on the residual loss of PDEs. Similar ideas of solving PDEs via minimizing the residual loss can be traced back to the works [11, 12] in the 1990’s, where a shallow neural network is optimized on a priori fixed mesh as an approximation of the solution. Some efforts have been made to incorporate traditional computational techniques to enhance the performance of solving PDEs with deep neural networks. In [13, 14, 15, 16, 7], deep neural networks based on domain decomposition are proposed to improve the efficiency. A penalty free neural network method [17] and Phygeonet [16] are developed to deal with complex geometries and irregular domains. A weak formulation with primal and adversarial networks is proposed in [18], where the PDE problem is converted to an operator norm minimization problem induced by the weak formulation.

One critical step for all these methods is to approximate the loss functional, where the integral is usually approximated by the Monte Carlo method with collocation points randomly generated by a uniform distribution on the computational domain. Since the minimization of the discrete loss functional yields the approximate solution, the accuracy of the approximate solution is closely related to the accuracy of the discrete loss functional. In contrast to classical computational methods, where the main concern is the approximation error, one needs to balance the approximation error and the generalization error for the neural network approximation, where the approximation error mainly originates from the modeling capability of the neural network and the generalization error is mainly related to the data points in the training set, i.e, the random samples for the discretization of the loss functional. However, for many PDE models, the uniform random sampling strategy is not efficient especially when the PDE solution has a low regularity, in other words, an integrand of low regularity may have a large variance in terms of a uniform distribution such that the Monte Carlo approximation of the loss functional has a large prefactor before the convergence rate $O(N^{-1/2})$ . This issue becomes worse for high-dimensional problems due to the curse of dimensionality. In high-dimensional spaces, most of the volume of the computational domain concentrates around its surface [19, 20, 21], which means that uniform samples may become less effective for training deep neural networks to approximate high-dimensional PDEs. For example, the collocation points from the uniform distribution are not suitable for solving high-dimensional Fokker-Planck equations, while an adaptive strategy through sampling the current approximate solution is effective [22]. In [23], a selection network is introduced to serve as a weight function to assign higher weights for samples with large point-wise residuals, which yields a more accurate approximate solution if the selection network is properly chosen. However, to obtain a valid selection network, one needs to impose additional constraints on the selection network, which is often a non-trivial task. For low-dimensional problems, it is well known that one can employ adaptive numerical schemes to deal with PDEs with low-regularity solutions [24, 25, 26], which also suggests that the uniform samples are not the best choice. Therefore, adaptive sampling strategies are crucial for developing more efficient and reliable deep learning techniques for the approximation of PDEs.

In this work, we develop a deep adaptive sampling method (DAS) for the neural network approximation of PDEs based on residual minimization, where a deep generative model, called KRnet [27, 28, 29], is used to guide the sample generation for the training set. To this end, we need to construct two deep neural network models: one for approximating the solution and the other for refining the training set. The neural network approximation is achieved by the standard procedure of residual minimization. KRnet defines a transport map [30] from the data distribution to a prior distribution (e.g. the standard Gaussian). KRnet retains two traits of flow-based generative models [31, 32]: exact invertibility of the transport map and efficient computation of the Jacobian determinant, based on which one can obtain an explicit density model using the change of variables and an effective approach for generating samples through the invertible mapping. The key point in our proposed framework is that the residual is viewed as a probability density function (PDF) up to a constant and approximating this PDF can be achieved by minimizing the Kullback-Leibler (KL) divergence between the KRnet-induced density model and the residual-induced distribution. We use the trained KRnet to generate new collocation points to replace or refine the training set, where more points are put in the region of large residual and less points are put in the region of small residual. The updated training set is then used to further improve the accuracy of the current approximate solution. Simply speaking, KRnet acts as an error indicator for the improvement of the training set, which shares similarities with the classical adaptive finite element method subject to a residual-based posteriori error estimator. In summary, the main contributions of this work are as follows.

1.

We utilize a deep generative model as a generic means to reflect the correspondence between the residual and the error of approximation through efficient PDF approximation and sample generation.
2.

We propose a deep adaptive sampling (DAS) framework, including efficient sampling procedures and training algorithms, for the adaptive improvement of neural network approximation of PDEs.

The remainder of the paper is organized as follows. In the next section, we briefly describe the deep learning method used in this work for the approximation of PDEs. After that, the statistical error of the machine learning technique is illustrated from the perspective of function approximation. Our DAS approach is presented in section 4. We provide the theoretical analysis of DAS in section 5. In section 6, we demonstrate the efficiency of our adaptive sampling approach with numerical experiments. The paper is concluded in section 7.

2 Deep learning for PDEs

Let $\Omega\subset\mathbb{R}^{d}$ be a spatial domain, which is bounded, connected and with a polygonal boundary $\partial\Omega$ , and $\boldsymbol{x}\in\mathbb{R}^{d}$ denote a spatial variable. The PDE problem is stated as: find $u(\boldsymbol{x})\in F:\mathbb{R}^{d}\mapsto\mathbb{R}$ where $F$ is a proper function space defined on $\Omega$ , such that

	$\displaystyle\mathcal{L}u(\boldsymbol{x})$	$\displaystyle=s(\boldsymbol{x}),\quad\forall\boldsymbol{x}\in\Omega,$		(1)
	$\displaystyle\mathfrak{b}u(\boldsymbol{x})$	$\displaystyle=g(\boldsymbol{x}),\quad\forall\boldsymbol{x}\in\partial\Omega,$		(1)

where $\mathcal{L}$ is the partial differential operator, $\mathfrak{b}$ is the boundary operator, $s(\boldsymbol{x})$ is the source function, and $g(\boldsymbol{x})$ represents the boundary conditions.

Let $u(\boldsymbol{x};\Theta)$ be a neural network with parameters $\Theta$ . In the framework of PINNs, the goal is to use $u(\boldsymbol{x};\Theta)$ to approximate the solution $u(\boldsymbol{x})$ through optimizing a loss functional defined as [8, 9]

J\left(u(\boldsymbol{x};\Theta)\right)=\left\|r(\boldsymbol{x};\Theta)\right\|_{2,\Omega}^{2}+\gamma\left\|b(\boldsymbol{x};\Theta)\right\|_{2,\partial\Omega}^{2}=J_{r}(u(\boldsymbol{x};\Theta))+\gamma J_{b}(u(\boldsymbol{x};\Theta)),

(2)

where $r(\boldsymbol{x};\Theta)=\mathcal{L}u(\boldsymbol{x};\Theta)-s(\boldsymbol{x})$ , and $b(\boldsymbol{x};\Theta)=\mathfrak{b}u(\boldsymbol{x};\Theta)-g(\boldsymbol{x})$ measure how well $u(\boldsymbol{x};\Theta)$ satisfies the partial differential equations and the boundary conditions, respectively, and $\gamma>0$ is a penalty parameter. Here, $\left\|u(\boldsymbol{x})\right\|_{2,\Omega}^{2}=\int_{\Omega}|u(\boldsymbol{x})|^{2}d\boldsymbol{x}$ and $\left\|u(\boldsymbol{x})\right\|_{2,\partial\Omega}^{2}=\int_{\partial\Omega}|u(\boldsymbol{x})|^{2}d\boldsymbol{x}$ . The loss functional (2) is usually discretized numerically before the optimization with respect to $\Theta$ is addressed. In practice, one often chooses two sets of uniformly distributed collocation points $\mathsf{S}_{\Omega}=\{\boldsymbol{x}_{\Omega}^{(i)}\}_{i=1}^{N_{r}}$ and $\mathsf{S}_{\partial\Omega}=\{\boldsymbol{x}_{\partial\Omega}^{(i)}\}_{i=1}^{N_{b}}$ respectively for the discretization of the two terms in the objective functional (2), leading to the following empirical loss

J_{N}\left(u(\boldsymbol{x};\Theta)\right)=\left\|r(\boldsymbol{x};\Theta)\right\|_{N_{r},\mathsf{S}_{\Omega}}^{2}+\hat{\gamma}\left\|b(\boldsymbol{x};\Theta)\right\|_{N_{b},\mathsf{S}_{\partial\Omega}}^{2},

(3)

where $\hat{\gamma}>0$ , and

\left\|u(\boldsymbol{x})\right\|_{N_{r},\mathsf{S}_{\Omega}}=\left(\frac{1}{N_{r}}\sum_{i=1}^{N_{r}}u^{2}(\boldsymbol{x}_{\Omega}^{(i)})\right)^{\frac{1}{2}},\quad\left\|u(\boldsymbol{x})\right\|_{N_{b},\mathsf{S}_{\partial\Omega}}=\left(\frac{1}{N_{b}}\sum_{i=1}^{N_{b}}u^{2}(\boldsymbol{x}_{\partial\Omega}^{(i)})\right)^{\frac{1}{2}}.

Note that in the definition of $J_{N}$ we do not take into account the constants $|\Omega|=\int_{\Omega}d\boldsymbol{x}$ and $|\partial\Omega|=\int_{\partial\Omega}d\boldsymbol{x}$ and the ratio induced by these two constants can be dealt with by choosing $\hat{\gamma}=\frac{\gamma|\partial\Omega|}{|\Omega|}$ such that $J_{N}(u)$ is a Monte Carlo approximation of $J(u)$ up to a constant scaling factor $|\Omega|$ . We then seek an approximate solution by minimizing the empirical loss (3), i.e.,

\min_{\Theta}J_{N}(u(\boldsymbol{x};\Theta)),

(4)

which can be solved by stochastic gradient-based methods [33, 34].

Recently, some prior error estimates of neural-network-based methods for solving PDEs are established. Combining the analysis techniques of the least square finite element method [35] with the universal approximation property of neural networks [36, 37, 38, 39], Shin et. al. propose an abstract framework for the error estimation of physical informed neural networks [40]. Lu et. al. derive a prior estimate of the generalization error for the deep Ritz method with two-layer neural networks [41]. Suppose that $u(\boldsymbol{x},\Theta_{N}^{*})$ is the minimizer of the empirical loss $J_{N}(u(\boldsymbol{x};\Theta))$ and $u(\boldsymbol{x};\Theta^{*})$ is the minimizer of $J(u(\boldsymbol{x};\Theta))$ , i.e.,

	$\displaystyle u(\boldsymbol{x};\Theta^{*})$	$\displaystyle=\arg\min_{\Theta}J(u(\boldsymbol{x};\Theta)),$
	$\displaystyle u(\boldsymbol{x};\Theta_{N}^{*})$	$\displaystyle=\arg\min_{\Theta}J_{N}(u(\boldsymbol{x};\Theta)).$

We have

u(\boldsymbol{x};\Theta_{N}^{*})-u(\boldsymbol{x})=u(\boldsymbol{x},\Theta_{N}^{*})-u(\boldsymbol{x};\Theta^{*})+u(\boldsymbol{x};\Theta^{*})-u(\boldsymbol{x}),

(5)

i.e.,

\mathbb{E}\left(\left\|u(\boldsymbol{x};\Theta_{N}^{*})-u(\boldsymbol{x})\right\|_{\Omega}\right)\leq\mathbb{E}\left(\left\|u(\boldsymbol{x},\Theta_{N}^{*})-u(\boldsymbol{x};\Theta^{*})\right\|_{\Omega}\right)+\left\|u(\boldsymbol{x};\Theta^{*})-u(\boldsymbol{x})\right\|_{\Omega},

(6)

where $\mathbb{E}$ indicates the expectation and the norm $\left\|\cdot\right\|_{\Omega}$ corresponds to the function space $F$ for $u(\boldsymbol{x};\Theta)$ . The first term describes the statistical error from discretizing the loss functional with the Monte Carlo approximation, and the second term is the approximation error of minimizing the loss functional over the hypothesis space. The approximation error depends on the capability of neural networks, while the statistical error depends on the definition of $\mathsf{S}_{\Omega}$ and $\mathsf{S}_{\partial\Omega}$ . In this work, we focus on how to reduce the statistical error for problem (4) and our algorithm can also be generalized to other formulations for the neural network approximation of PDEs. For simplicity, we focus on the integration of the residual $r(\boldsymbol{x};\Theta)$ and assume that the integral on the boundary is well approximated by a prescribed $\mathsf{S}_{\partial\Omega}$ .

3 Illustration of the statistical error

We first use function approximation as an example to illustrate the statistical error of the machine learning technique. Let $\boldsymbol{X}\in\mathbb{R}^{d}$ and $Y\in\mathbb{R}$ subject to a joint distribution $\rho_{\boldsymbol{X},\boldsymbol{Y}}$ . Let $\hat{Y}=m(\boldsymbol{X})$ be a model and $y=h(\boldsymbol{x})$ be a function to be approximated. We know in the $L_{2}$ sense the optimal model is

{m}^{*}(\boldsymbol{x})=\arg\min_{{m}(\boldsymbol{x})}\left[L(Y,\hat{Y})=\int(y-{m}(\boldsymbol{x}))^{2}\rho_{\boldsymbol{X},Y}(\boldsymbol{x},y)d\boldsymbol{x}dy\right].

(7)

In reality, we usually do not know $\rho_{\boldsymbol{X},Y}$ and only have a set $\{(\boldsymbol{x}^{(i)},y^{(i)})\}_{i=1}^{N}$ of data which can be regarded as samples of $\rho_{\boldsymbol{X},Y}$ . For a certain hypothesis space $W$ , we obtain a regression problem

{m}_{\boldsymbol{w}^{*}}(\boldsymbol{x})=\arg\min_{{m}_{\boldsymbol{w}}\in W}\left[L_{N}(Y,\hat{Y})=\frac{1}{N}\sum_{i=1}^{N}(y^{(i)}-{m}_{\boldsymbol{w}}(\boldsymbol{x}^{(i)}))^{2}\right],

(8)

where $L_{N}$ can be regarded as a Monte Carlo approximation of $L$ and the subscript ${\boldsymbol{w}}$ indicates the model parameters specified by $W$ . If we let $\rho_{\boldsymbol{X},Y}(\boldsymbol{x},y)=\delta(y-h(\boldsymbol{x}))\rho(\boldsymbol{x})$ , and assume $m(\boldsymbol{x})\in V$ with $V$ being a linear space, we then obtain the continuous least-squares method for function approximation

{m}_{V}^{*}(\boldsymbol{x})=\arg\min_{{m}(\boldsymbol{x})\in V}\left[L_{V}(Y,\hat{Y})=\int({m}(\boldsymbol{x})-h(\boldsymbol{x}))^{2}\rho(\boldsymbol{x})d\boldsymbol{x}\right],

(9)

where $m_{V}^{*}(\boldsymbol{x})$ is the best approximation of $h(\boldsymbol{x})$ located in $V$ subject to a weighted $L_{2}$ norm in terms of a probability density function $\rho(\boldsymbol{x})$ . To approximate $h(\boldsymbol{x})$ with a machine learning technique, we consider

m_{\hat{\boldsymbol{v}}^{*}}(\boldsymbol{x})=\arg\min_{m_{\hat{\boldsymbol{v}}}\in V}\left[L_{V,N}(Y,\hat{Y})=\frac{1}{N}\sum_{i=1}^{N}(m_{\hat{\boldsymbol{v}}}(\boldsymbol{x}^{(i)})-h(\boldsymbol{x}^{(i)}))^{2}\right],

(10)

where $\{\boldsymbol{x}^{(i)}\}_{i=1}^{N}$ are samples of PDF $\rho(\boldsymbol{x})$ , and $L_{V,N}$ is the Monte Carlo approximation of $L_{V}$ . A classical choice for $V$ is the linear space spanned by polynomials, for which we derive the error estimate for $m_{\hat{\boldsymbol{v}}^{*}}(\boldsymbol{x})$ as follows.

Lemma 1.

Let $h(\boldsymbol{x})\in C(D)$ be a continuous function defined on a compact domain $D\subset\mathbb{R}^{d}$ and $\rho(\boldsymbol{x})>0$ be a PDF on $D$ . Let $V=\mathrm{span}\{q_{i}(\boldsymbol{x})\}_{i=1}^{n}$ with $q_{i}(\boldsymbol{x})$ being orthonormal polynomials in terms of $\rho(\boldsymbol{x})$ . For any $\delta>0$ and with probability at least $1-2\delta$ , we have for a sufficiently large $N$

\|m_{\hat{\boldsymbol{v}}^{*}}(\boldsymbol{x})-h(\boldsymbol{x})\|_{\rho}\leq C\sqrt{\frac{\ln\delta^{-1}}{N}}+\|m^{*}_{V}(\boldsymbol{x})-h(\boldsymbol{x})\|_{\rho},

where $C$ is a constant, and $\|\cdot\|_{\rho}$ is the weighted $L_{2}$ norm in terms of $\rho(\boldsymbol{x})$ .

The first term on the right-hand is the statistical error due to the random samples for the approximation of $L_{V}$ (see the proof in the appendix for more details) and its existence does not depend on the choice of $V$ . When $N$ goes to infinity, the statistical error goes to zero and only the approximation error is left. In other words, when applying a machine learning technique to function approximation, we need to pay attention to both the hypothesis space $W$ and the choice of random samples $\{\boldsymbol{x}^{(i)}\}_{i=1}^{N}$ , i.e., the training set, to obtain a trade-off between the statistical error and the approximation error. For low-dimensional problems, classical method such as finite element methods avoid the statistical error by using Gauss quadrature rules, which implies that machine learning techniques are in general less efficient than classical methods due to the existence of statistical error. On the other hand, for high-dimensional problems, classical methods may not be able to obtain a relatively small approximation error due to the curse of dimensionality and machine learning techniques may perform better by using a capable hypothesis space such as neural networks and an affordable sample size for a relatively small statistical error. A more general estimate about the statistical error needs the Rademacher complexity of the hypothesis space. In this work, we are interested in the reduction of the statistical error instead of its bound for a certain set of random samples.

4 Deep adaptive sampling method

We now focus on the reduction of the statistical error when machine learning techniques are used to approximate PDEs. Our deep adaptive sampling method, i.e., the DAS method, will be established from the viewpoint of variance reduction since the statistical error is induced by the Monte Carlo approximation of the loss. Consider the term $J_{r}(u)$ in equation (2). It is easy to see that if $r^{2}(\boldsymbol{x})$ is a smooth function with a good regularity, the most effective way to reduce the error of $J_{N}(u)$ is to increase the sample size $N$ . However, if there exists low regularity, the situation might be different. For example, if the residual is strongly localized, the scenario can be regarded as a rare event. Assume that the residual $r^{2}(\boldsymbol{x})$ in equation (2) has a similar behavior to an indicator function $1_{I}(\boldsymbol{x})$ where $I\subset\Omega$ and is much smaller than $\Omega$ , i.e.,

\zeta=\int_{\Omega}1_{I}(\boldsymbol{x})d\boldsymbol{x}\approx\int_{\Omega}r^{2}(\boldsymbol{x})d\boldsymbol{x}\ll 1,

For simplicity, we assume that $|\Omega|=1$ when presenting the algorithm. To improve the approximation in $I$ , we need to compute $\zeta$ accurately. Consider a Monte Carlo estimator of $\zeta$ in terms of uniform samples

\hat{P}_{\mathsf{MC}}=\frac{1}{N}\sum_{i=1}^{N}1_{I}(\boldsymbol{x}^{(i)}).

The relative error of $\hat{P}_{\mathsf{MC}}$ is

\frac{\mathrm{Var}^{1/2}(\hat{P}_{\mathsf{MC}})}{\zeta}=N^{-1/2}((1-\zeta)/\zeta)^{1/2}\approx(\zeta N)^{-1/2}.

We then need a sample size of $\mathit{O}(1/\zeta)$ to obtain a relative error of $\mathit{O}(1)$ . This implies that a certain samples size $N$ quickly becomes less effective if the residual is strongly localized and such a problem can be worsen in the approximation of high-dimensional PDEs. To reduce the relative error, we need to choose more effective random samples instead of uniform samples.

4.1 Some ideas on variance reduction

We outline our basic ideas on variance reduction in this section and more details about the algorithm will be presented later. We first consider the importance sampling technique. For simplicity, we only consider $J_{r}(u(\boldsymbol{x};\Theta))$ in equation (2), which is the loss induced by the residual. We have

\displaystyle J_{r}\left(u(\boldsymbol{x};\Theta)\right)=\mathbb{E}[r^{2}]=\int_{\Omega}r^{2}(\boldsymbol{x};\Theta)d\boldsymbol{x}=\int_{\Omega}\frac{r^{2}(\boldsymbol{x};\Theta)}{p(\boldsymbol{x})}p(\boldsymbol{x})d\boldsymbol{x}\approx\frac{1}{N_{r}}\sum\limits_{i=1}^{N_{r}}\frac{r^{2}(\boldsymbol{x}_{\Omega}^{(i)};\Theta)}{p(\boldsymbol{x}_{\Omega}^{(i)})},

(11)

where the set $\{\boldsymbol{x}_{\Omega}^{(i)}\}_{i=1}^{N_{r}}$ is generated with respect to the probability density function (PDF) $p(\boldsymbol{x})$ instead of a uniform distribution as in equation (3). If the variance of $r^{2}(\boldsymbol{X})p^{-1}(\boldsymbol{X})$ in terms of $p(\boldsymbol{x})$ is smaller than the variance of $r^{2}(\boldsymbol{X})$ in terms of the uniform distribution, the accuracy of the Monte Carlo approximation will be improved for a fixed sample size $N_{r}$ . The optimal choice for $p(\boldsymbol{x})$ is

p^{*}(\boldsymbol{x})=\frac{r^{2}(\boldsymbol{x};\Theta)}{\mu},

(12)

where $\mu=\int_{\Omega}r^{2}(\boldsymbol{x};\Theta)d\boldsymbol{x}$ . Although the optimal choice is useless in practice since $\mu$ is the quantity to be computed, it suggests that variance reduction can be achieved if $p(\boldsymbol{x})$ is close to the residual-induced distribution.

Lemma 2.

Assume that $|\Omega|=1$ and $p(\boldsymbol{x})$ is a PDF satisfying

D_{\mathsf{KL}}(p\|p^{*})\leq\varepsilon<\infty,

where $D_{\mathsf{KL}}$ indicates the Kullback-Leibler divergence. For any $0<a<\infty$ , we have

\mathbb{E}\left|Q_{p}[r^{2}]-\mathbb{E}[r^{2}]\right|\leq aN_{r}^{-1/2}+2\|r^{2}/p\|_{p}\sqrt{\mathbb{P}(|r^{2}/p-\mu|>a;p)},

(13)

where

Q_{p}(r^{2})=\frac{1}{N_{r}}\sum_{i=1}^{N_{r}}\frac{r^{2}(\boldsymbol{X}^{(i)})}{p(\boldsymbol{X}^{(i)})},

and $\boldsymbol{X}^{(i)}\sim p(\boldsymbol{x})$ are i.i.d. random variables. $\|\cdot\|_{p}$ is the weighted $L_{2}$ norm in terms of $p$ . The tail probability can be bounded as

\mathbb{P}(|r^{2}/p-\mu|>a;p)\leq\frac{\mu(2\varepsilon)^{1/2}}{a}.

When $p(\boldsymbol{x})$ is close to $p^{*}(\boldsymbol{x})$ , the ratio $r^{2}(\boldsymbol{x})/p(\boldsymbol{x})$ should be close to $\mu$ . The second term on the right-hand side of inequality (13) is related to the tail probability $\mathbb{P}(|r^{2}/p-\mu|>a;p)$ in terms of $p(\boldsymbol{x})$ . It is seen that if the tail probability is small enough, a variance reduction can be achieved by choosing a small $a$ .

Another option for variance reduction is to relax the definition of $J_{r}(u)$ as:

J_{r,p}(u(\boldsymbol{x};\Theta))=\int_{\Omega}r^{2}(\boldsymbol{x};\Theta)p(\boldsymbol{x})d\boldsymbol{x}\approx\frac{1}{N_{r}}\sum_{i=1}^{N_{r}}r^{2}(\boldsymbol{x}_{\Omega}^{(i)};\Theta),

(14)

where the set $\{\boldsymbol{x}_{\Omega}^{(i)}\}_{i=1}^{N_{r}}$ is sampled from the PDF $p(\boldsymbol{x})$ and $p(\boldsymbol{x})>0$ on $\Omega$ . It is easy to see that the minimizer of $J_{r,p}(u(\boldsymbol{x};\Theta))$ is also the solution of problem (1) when the boundary conditions are satisfied. To reduce the error induced by the Monte Carlo approximation, we may adjust $p(\boldsymbol{x})$ such that the residual $r^{2}(\boldsymbol{x};\Theta)$ does not vary dramatically. This is similar to the classical adaptive finite element method, where mesh refinement/coarsening is supposed to make the approximation error nearly uniform. For our case, we may sample a PDF that is close to the residual-induced distribution and add more samples from the region of large residual into the training set.

4.2 PDF approximation and sample generation

Two options for variance reduction are considered in the previous section. To make both ideas practical, the key issue is to generate samples efficiently from a PDF $p(\boldsymbol{x})\approx\mu^{-1}r^{2}(\boldsymbol{x};\Theta)$ for a fixed $\Theta$ . The first option is based on importance sampling, meaning that an explicit PDF model $p(\boldsymbol{x})$ is needed, The second option is based on refinement of the training set, which only needs samples from the region of large residual and does not need the likelihood of the samples. The second option has been employed in the recent literature to improve the neural network approximation, which is either ad hoc [42] (only for low-dimensional problems) or based on traditional sampling strategies such as MCMC [43]. To the best of our knowledge, there does not exist a generic algorithm that can be effectively adapted to both options. We intend to fill this gap in this work. It is seen from Lemma 2 that the tail probability $\mathbb{P}(|r^{2}/p-\mu|>a;p)$ should be small enough for the effectiveness of $p(\boldsymbol{x})$ , which means that $p(\boldsymbol{x})$ must be close enough to the distribution induced by $r^{2}(\boldsymbol{x})$ . However, the approximation of PDF is a challenging task especially in high-dimensional spaces. Classical explicit PDF models such as the exponential family of distributions and Gaussian mixture models are in general not sufficient as an approximator for the PDF induced by $r^{2}(\boldsymbol{x})$ . To alleviate this difficulty, we resort to deep generative modes. In particular, we employ a recently developed deep generative model called KRnet for both probability approximation and sample generation [22, 27]. KRnet is one type of normalizing flow [44], which provides an invertible transport map between a prior distribution and the target distribution. Unlike other types of deep generative models such as GAN [45, 46, 47] and VAE [48], normalizing flow provides an explicit likelihood. KRnet can be used to address both options for variance reduction while other deep generative models may also be employed if only the second option for variance reduction is considered. However, one computational issue faced by all deep generative models is that the distribution induced by $r^{2}(\boldsymbol{x})$ is usually defined on a compact domain while deep generative models are in general defined on the whole space. We subsequently address this issue without going into details about the structure of KRnet.

Let $\boldsymbol{X}\in\mathbb{R}^{d}$ be a random vector associated with a given data set, and its PDF is denoted by $p_{\boldsymbol{X}}(\boldsymbol{x})$ . The target is to estimate $p_{\boldsymbol{X}}(\boldsymbol{x})$ from data or to generate samples that are consistent with a given $p_{\boldsymbol{X}}(\boldsymbol{x})$ . Let $\boldsymbol{Z}\in\mathbb{R}^{d}$ be a random vector associated with a PDF $p_{\boldsymbol{Z}}(\boldsymbol{z})$ , where $p_{\boldsymbol{Z}}(\boldsymbol{z})$ is a prior distribution (e.g., Gaussian distributions). The flow-based generative modeling is to seek an invertible mapping $\boldsymbol{z}=f(\boldsymbol{x})$ [31]. By the change of variables, we have the PDF of $\boldsymbol{X}=f^{-1}(\boldsymbol{Z})$ as

p_{\boldsymbol{X}}(\boldsymbol{x})=p_{\boldsymbol{Z}}(f(\boldsymbol{x}))\left|\det\nabla_{\boldsymbol{x}}f\right|.

(15)

Once the prior distribution $p_{\boldsymbol{Z}}(\boldsymbol{z})$ is specified, equation (15) provides an explicit density model for $\boldsymbol{X}$ . The inverse of $f(\cdot)$ provides a convenient way to sample $\boldsymbol{X}$ as $\boldsymbol{X}=f^{-1}(\boldsymbol{Z})$ . The basic idea of KRnet is to define the structure of $f(\boldsymbol{x})$ in terms of the Knothe-Rosenblatt rearrangement. Let $\mu_{\boldsymbol{Z}}$ and $\mu_{\boldsymbol{X}}$ be the probability measures of two random variables $\boldsymbol{Z},\boldsymbol{X}\in\mathbb{R}^{d}$ respectively. A mapping $\mathcal{T}$ : $\boldsymbol{Z}\mapsto\boldsymbol{X}$ is called a transport map such that $\mathcal{T}_{\#}\mu_{\boldsymbol{Z}}=\mu_{\boldsymbol{X}}$ , where $\mathcal{T}_{\#}\mu_{\boldsymbol{Z}}$ is the push-forward of $\mu_{\boldsymbol{Z}}$ such that $\mu_{\boldsymbol{X}}(B)=\mu_{\boldsymbol{Z}}(\mathcal{T}^{-1}(B))$ for every Borel set $B$ [49]. The transport map $\mathcal{T}$ given by the Knothe-Rosenblatt rearrangement [49, 30] has a lower-triangular structure

\boldsymbol{z}=\mathcal{T}^{-1}(\boldsymbol{x})=\left[\begin{array}[]{l}\mathcal{T}_{1}(x_{1})\\ \mathcal{T}_{2}(x_{1},x_{2})\\ \vdots\\ \mathcal{T}_{d}(x_{1},\ldots,x_{d})\end{array}\right].

(16)

Simply speaking, KRnet integrates the triangular structure of the K-R rearrangement into the definition of the invertible transport map. More details can be found in [27, 22, 29].

Let $f_{\mathsf{KRnet}}(\cdot;\Theta_{f})$ indicate the invertible transport map induced by KRnet, where $\Theta_{f}$ includes the model parameters. An explicit PDF model $p_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})$ can be obtained by letting $f=f_{\mathsf{KRnet}}$ in equation (15), i.e.,

p_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})=p_{\boldsymbol{Z}}(f_{\mathsf{KRnet}}(\boldsymbol{x}))\left|\det\nabla_{\boldsymbol{x}}f_{\mathsf{KRnet}}\right|.

(17)

The samples of $p_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})$ is given as $\boldsymbol{X}=f^{-1}_{\mathsf{KRnet}}(\boldsymbol{Z})$ by sampling $\boldsymbol{Z}$ . A common choice for the distribution of $\boldsymbol{Z}$ is the standard Gaussian distribution. Depending on the prior knowledge of the problem, a more general model such as Gaussian mixture model can also be used as the prior distribution. The KRnet $f_{\mathsf{KRnet}}$ does not have any constraint on the range of the mapped data, meaning that both $\boldsymbol{X}$ and $\boldsymbol{Z}$ are defined on $\mathbb{R}^{d}$ . Let $\boldsymbol{Z}$ be Gaussian. Due to the invertibility, $|\det\nabla_{\boldsymbol{x}}f_{\mathsf{KRnet}}|>0$ for any $\boldsymbol{x}\in\mathbb{R}^{d}$ , which implies that $p_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})>0$ for any $\boldsymbol{x}\in\mathbb{R}^{d}$ . So $p_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})$ is not consistent with the distribution induced by $r^{2}(\boldsymbol{x})$ , which is equal to zero on $\mathbb{R}^{d}\backslash\Omega$ . To deal with this issue, we propose the following strategy.

Without loss of generality, we can assume that $\Omega=[-1/2,1/2]^{d}$ . Let $B=(-(1/2+\delta/2),1/2+\delta/2)^{d}$ with $0<\delta<\infty$ such that $\Omega\subset B$ . For each dimension of $\boldsymbol{x}$ , we define the following logarithmic mapping

y=\ell(x)=\frac{s}{2}\log\frac{2x+(1+\delta)}{(1+\delta)-2x},\quad x=\ell^{-1}(y)=\frac{1+\delta}{2}\frac{e^{2y/s}-1}{e^{2y/s}+1},

with $s>0$ being a scale parameter, which defines a one-to-one correspondence between $x\in(-(1/2+\delta/2),1/2+\delta/2)$ and $y\in(-\infty,+\infty)$ . Let $\boldsymbol{\ell}(\boldsymbol{x}):B\mapsto\mathbb{R}^{d}$ be a $d$ -dimensional mapping such that

\ell_{i}(x_{i})=\ell(x_{i}),\quad i=1,\ldots,d.

Then the following invertible mapping

\boldsymbol{z}=f_{\mathsf{KRnet}}\circ\boldsymbol{\ell}(\boldsymbol{x})

(18)

defines a PDF

\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})=p_{\mathsf{KRnet}}(\boldsymbol{\ell}(\boldsymbol{x});\Theta_{f})|\nabla_{\boldsymbol{x}}\boldsymbol{\ell}(\boldsymbol{x})|,

(19)

where the support of $\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})$ is $B$ .

We now consider a modification of $r^{2}(\boldsymbol{x};\Theta)$ . Define a cutoff function as

h(\boldsymbol{x})=\left\{\begin{array}[]{rl}1,&\boldsymbol{x}\in\Omega,\\ \prod_{i=1}^{d}h_{\delta}(x_{i}),&\boldsymbol{x}\in B\backslash\Omega,\end{array}\right.

where $h_{\delta}(x)$ is a piecewise linear function

h_{\delta}(x)=\left\{\begin{array}[]{rl}1,&x\in[-1/2,1/2],\\ \delta^{-1}(2x+1+\delta),&x\in(-1/2-\delta/2,-1/2),\\ \delta^{-1}(1+\delta-2x),&x\in(1/2,1/2+\delta/2),\\ 0,&x\in(-\infty,-1/2-\delta/2]\cup[1/2+\delta/2,\infty).\end{array}\right.

We consider a modified PDF for any $\Theta$

\hat{r}_{\boldsymbol{X}}(\boldsymbol{x})\propto r^{2}(\boldsymbol{x};\Theta)h(\boldsymbol{x}).

(20)

Note that both $\hat{r}_{\boldsymbol{X}}(\boldsymbol{x})$ and $\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})$ have the support $B$ . We then solve the following optimization problem

\Theta_{f}^{*}=\arg\min_{\Theta_{f}}D_{\mathsf{KL}}(\hat{r}_{\boldsymbol{X}}(\boldsymbol{x})\|\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})),

(21)

where $D_{\mathsf{KL}}(\cdot\|\cdot)$ indicates the Kullback-Leibler (KL) divergence between two distributions. We finally use

p_{\boldsymbol{X}}(\boldsymbol{x})\propto\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*})1_{\Omega}(\boldsymbol{x})

(22)

as an approximation of the PDF induced by $r^{2}(\boldsymbol{x};\Theta)$ . If $\delta\ll 1$ , most the samples $\boldsymbol{\ell}^{-1}\circ f^{-1}_{\mathsf{KRnet}}(\boldsymbol{z}^{(i)};\Theta_{f}^{*})$ will be located in $\Omega$ . Since $\hat{r}_{\boldsymbol{X}}(\boldsymbol{x})\propto r^{2}(\boldsymbol{x};\Theta)$ on $\Omega$ , $p_{\boldsymbol{X}}(\boldsymbol{x})$ approximates the $r^{2}(\boldsymbol{x};\Theta)$ -induced PDF well when $\delta$ is small. In our numerical experiments, we set $\delta=0.01$ and $s=2$ .

We now look at the approximation of $\Theta_{f}^{*}$ . The KL divergence in the optimization problem (21) can be written as

D_{\mathsf{KL}}(\hat{r}_{\boldsymbol{X}}(\boldsymbol{x})\|\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}))=\int_{B}\hat{r}_{\boldsymbol{X}}\log\hat{r}_{\boldsymbol{X}}d\boldsymbol{x}-\int_{B}\hat{r}_{\boldsymbol{X}}\log\hat{p}_{\mathsf{KRnet}}d\boldsymbol{x}.

(23)

The first term on the right-hand side corresponds to the differential entropy of $\hat{r}_{\boldsymbol{X}}$ , which does not affect the optimization with respect to $\Theta_{f}$ . So minimizing the KL divergence is equivalent to minimizing the cross entropy between $\hat{r}_{\boldsymbol{X}}$ and $\hat{p}_{\mathsf{KRnet}}$ [50, 51]:

H(\hat{r}_{\boldsymbol{X}},\hat{p}_{\mathsf{KRnet}})=-\int_{B}\hat{r}_{\boldsymbol{X}}\log\hat{p}_{\mathsf{KRnet}}d\boldsymbol{x}.

(24)

Since the samples from $\hat{r}_{\boldsymbol{X}}$ are not available, we approximate the cross entropy using the importance sampling technique:

H(\hat{r}_{\boldsymbol{X}},\hat{p}_{\mathsf{KRnet}})\approx-\frac{1}{N_{r}}\sum_{i=1}^{N_{r}}\frac{\hat{r}_{\boldsymbol{X}}(\boldsymbol{x}_{B}^{(i)})}{\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x}_{B}^{(i)};\hat{\Theta}_{f})}\log\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x}_{B}^{(i)};\Theta_{f}),

(25)

where $\hat{p}(\boldsymbol{x};\hat{\Theta}_{f})$ is a PDF model with known parameter $\hat{\Theta}_{f}$ and its samples $\{\boldsymbol{x}_{B}^{(i)}\}_{i=1}^{N_{r}}$ can be generated efficiently as

\boldsymbol{x}_{B}^{(i)}=\boldsymbol{\ell}^{-1}\circ f^{-1}_{\mathsf{KRnet}}(\boldsymbol{z}^{(i)};\hat{\Theta}_{f})

(26)

with $\boldsymbol{z}^{(i)}$ being sampled from the prior distribution. We then minimize the discretized cross entropy (25) to obtain an approximation of $\Theta_{f}^{*}$ . The choice for $\hat{\Theta}_{f}$ will be specified in section 4.3 when the adaptive sampling method is defined.

Remark 1.

An alternative approach for the approximation of $\hat{r}_{\boldsymbol{X}}(\boldsymbol{x})$ is to minimize the following KL divergence:

D_{\mathsf{KL}}(\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})\|\hat{r}_{\boldsymbol{X}}(\boldsymbol{x}))=\int_{B}\hat{p}_{\mathsf{KRnet}}\log\hat{p}_{\mathsf{KRnet}}d\boldsymbol{x}-\int_{B}\hat{p}_{\mathsf{KRnet}}\log\hat{r}_{\boldsymbol{X}}d\boldsymbol{x},

(27)

which can be approximated by samples from $\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})$ . Note that the KL divergence is asymmetric. Minimizing the KL divergence (23) is not equivalent to the minimization of the KL divergence (27), although both minimizers will be achieved at $\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})=\hat{r}_{\boldsymbol{X}}(\boldsymbol{x})$ if $\hat{r}_{\boldsymbol{X}}$ can be reached exactly by a certain parameter $\Theta_{f}$ .

4.3 Adaptive sampling procedure

We are now ready to present our algorithms. In this work, we mainly focus on the adaptivity of $\mathsf{S}_{\Omega}$ for simplicity. The key step of our adaptivity strategy is to improve the effectiveness of the random samples in the training set $\mathsf{S}_{\Omega}$ , and we provide two algorithms corresponding to the two options discussed in section 4.1.

I.

We replace all the collocation points in the current training set using new samples from the probability measure for importance sampling. This corresponds to equation (11).
II.

We gradually add more collocation points to the current training set. This corresponds to equation (14), where the new samples are mainly from the region of large residual.

We first present strategy I. Let $\mathsf{S}_{\Omega,0}=\{\boldsymbol{x}_{\Omega,0}^{(i)}\}_{i=1}^{N_{r}}$ and $\mathsf{S}_{\partial\Omega,0}$ be two sets of collocation points that are uniformly sampled from $\Omega$ and $\partial\Omega$ respectively. Using $\mathsf{S}_{\Omega,0}$ and $\mathsf{S}_{\partial\Omega,0}$ , we minimize the empirical loss (3) to obtain $u(\boldsymbol{x};\Theta_{N}^{*,(1)})$ . With $u(\boldsymbol{x};\Theta_{N}^{*,(1)})$ , we minimize the cross entropy (25) to get $\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(1)})$ , where we simply use uniform samples for importance sampling. To refine the training set, a new set $\mathsf{S}_{\Omega,1}=\{\boldsymbol{x}_{\Omega,1}^{(i)}\}_{i=1}^{N_{r}}$ is generated by $\boldsymbol{\ell}^{-1}\circ f^{-1}_{\mathsf{KRnet}}(\boldsymbol{z}^{(i)};\Theta_{f}^{*,(1)})$ (see equation (18)). Then we continue to update the approximate solution $u(\boldsymbol{x};\Theta_{N}^{*,(1)})$ using $\mathsf{S}_{\Omega,1}$ as the training set. In general, we use $\mathsf{S}_{\Omega,k}=\{\boldsymbol{x}_{\Omega,k}^{(i)}\}_{i=1}^{N_{r}}$ to obtain $u(\boldsymbol{x};\Theta_{N}^{*,(k+1)})$ as

\Theta_{N}^{*,(k+1)}=\arg\min_{\Theta}J^{\mathsf{IS}}_{N}(u(\boldsymbol{x};\Theta)),

where $u(\boldsymbol{x};\Theta)$ is initialized as $u(\boldsymbol{x};\Theta_{N}^{*,(k)})$ and $J^{\mathsf{IS}}_{N}$ is defined as

J^{\mathsf{IS}}_{N}(u(\boldsymbol{x};\Theta))=\frac{1}{N_{r}}\sum\limits_{i=1}^{N_{r}}\frac{r^{2}(\boldsymbol{x}_{\Omega,k}^{(i)};\Theta)}{\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x}_{\Omega,k}^{(i)};\Theta_{f}^{*,(k)})}+\frac{1}{N_{b}}\sum\limits_{i=1}^{N_{b}}b^{2}(\boldsymbol{x}_{\partial\Omega,k}^{(i)};\Theta).

(28)

Starting with $\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k)})$ , the density model $\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})$ is updated as

\Theta_{f}^{*,(k+1)}=\arg\min_{\Theta_{f}}-\frac{1}{N_{r}}\sum_{i=1}^{N_{r}}\frac{r^{2}(\boldsymbol{x}_{B,k}^{(i)};\Theta_{N}^{*,(k+1)})h(\boldsymbol{x}_{B,k}^{(i)})}{\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x}_{B,k}^{(i)};\Theta_{f}^{*,(k)})}\log\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x}_{B,k}^{(i)};\Theta_{f}),

(29)

where we let $\hat{\Theta}_{f}=\Theta_{f}^{*,(k)}$ in equation (25), i.e., the previous PDF model $\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k)})$ is used for importance sampling when computing the cross entropy. A new set $\mathsf{S}_{\Omega,k+1}=\{\boldsymbol{x}_{\Omega,k+1}^{(i)}\}_{i=1}^{N_{r}}$ of collocation points is then generated. As detailed in section 4.2, the support of data points generated by KRnet is $B=(-(1/2+\delta/2),1/2+\delta/2)^{d}$ , while the computation domain is $\Omega=[-1/2,1/2]^{d}$ . So we need to deal with the collocation points located in $B\backslash\Omega$ . Instead of neglecting these points, we project them onto $\partial\Omega$ . We define an entry-wise projection operator $\mathcal{P}(\boldsymbol{x}):B\mapsto\Omega$ as

\mathcal{P}(x_{i})=\begin{cases}-1/2,\quad\mathrm{if}\ x_{i}<-1/2,\\ x_{i},\quad\mathrm{if}\ -1/2\leq x_{i}\leq 1/2,\\ 1/2,\quad\mathrm{if}\ x_{i}>1/2.\end{cases}i=1,\ldots,d.

(30)

For a sequence of i.i.d. samples $\boldsymbol{z}^{(j)}$ generated from the standard Gaussian with $j=1,2\ldots$ , we compute $\boldsymbol{x}_{B}^{(j)}=\boldsymbol{\ell}^{-1}\circ f^{-1}_{\mathsf{KRnet}}(\boldsymbol{z}^{(j)})$ . if $\boldsymbol{x}_{B}^{(j)}=\mathcal{P}(\boldsymbol{x}_{B}^{(j)})$ , we assign $\boldsymbol{x}_{B}^{(j)}$ to $\mathsf{S}_{\Omega,k+1}$ ; otherwise, we add $\mathcal{P}(\boldsymbol{x}^{(j)}_{B})$ to $\mathsf{S}_{\partial\Omega,k}$ . The updated training set $\mathsf{S}_{\Omega,k+1}$ and $\mathsf{S}_{\partial\Omega,k+1}$ will be used for the next training stage. This procedure is repeated until the stopping criterion is satisfied (see Algorithm 1). Since the collocation points in $\mathsf{S}_{\Omega,k}$ will be completely replaced at the next stage, we call this type of deep adaptive sampling strategy DAS-R for short. The alternative approach given in Remark 1 can also be used to obtain $\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k)})$ .

We now look at strategy II. Unlike DAS-R, the number of collocation points in the training set $\mathsf{S}_{\Omega}$ increases gradually. So we denote this type of deep adaptive sampling strategy by DAS-G for short. Starting with an initial set of collocation points $\mathsf{S}_{\Omega,0}=\{\boldsymbol{x}_{\Omega}^{(i)}\}_{i=1}^{n_{r}}$ (as well as $\mathsf{S}_{\partial\Omega,0}$ ) drawn from a uniform distribution defined on $\Omega$ , we minimize the empirical loss (3) on the training set $\mathsf{S}_{\Omega,0}$ (as well as $\mathsf{S}_{\partial\Omega,0}$ ) to obtain $u(\boldsymbol{x};\Theta_{N}^{*,(1)})$ . Once we have $u(\boldsymbol{x};\Theta_{N}^{*,(1)})$ in hand, we can seek $\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(1)})$ using the residual $r^{2}(\boldsymbol{x};\Theta_{N}^{*,(1)})$ . We here use uniform samples to approximate and minimize the cross entropy (25). Similar to DAS-R, a new set of collocation points $\mathsf{S}^{g}_{\Omega,1}=\{\boldsymbol{x}_{\Omega,1}^{g,(i)}\}_{i=1}^{n_{r}}$ is generated by $\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(1)})$ while the main difference is that we update the training set as $\mathsf{S}_{\Omega,1}=\mathsf{S}_{\Omega,0}\cup\mathsf{S}^{g}_{\Omega,1}$ , in other words, $\mathsf{S}_{\Omega,0}$ is augmented rather than replaced by $\mathsf{S}_{\Omega,1}^{g}$ . We continue to update $u(\boldsymbol{x};\Theta)$ using $\Theta_{N}^{*,(1)}$ as the initial parameters and $\mathsf{S}_{\Omega,1}$ as the training set, which yields a refined model $u(\boldsymbol{x};\Theta_{N}^{*,(2)})$ . Staring from $k=2$ , we seek $\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k)})$ using the approach given in Remark 1. We repeat the procedure to obtain an adaptive algorithm (see Algorithm 2).

Our two adaptive training methods are summarized in Algorithm 1 (DAS-R) and Algorithm 2 (DAS-G), where $N_{\rm adaptive}$ is a given number of maximum adaptivity iterations, $m$ is the batch size for stochastic gradient, and $N_{e}$ is the number of epochs for training $u(\boldsymbol{x};\Theta)$ and $\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})$ . The algorithms consist of three steps: solving PDE, training KRnet and refining the training set.

Remark 2.

In both strategies, we use uniform samples to approximate and minimize the cross entropy to obtain $\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(1)})$ , after which either equation (25) or equation (27) can be employed. The main reason of doing this is to use uniform samples to capture the modes if the residual-induced distribution is multimodal. The obtained PDF $\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(1)})$ provides a good initialization when equation (27) is employed to seek $\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(i)})$ with $i>1$ . This choice is usually not necessary if the modes are not strongly localized or the location of the modes can be encoded into the prior distribution of KRnet through a Gaussian mixture model.

Algorithm 1 DAS-R for PDEs

0: Initial

\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{(0)})

u(\boldsymbol{x};\Theta^{(0)})

, maximum epoch number

N_{e}

, batch size

m

, initial training set

\mathsf{S}_{\Omega,0}=\{\boldsymbol{x}_{\Omega,0}^{(i)}\}_{i=1}^{N_{r}}

and

\mathsf{S}_{\partial\Omega,0}=\{\boldsymbol{x}_{\partial\Omega,0}^{(i)}\}_{i=1}^{N_{b}}

1: for

k=0:N_{\rm adaptive}-1

2: // Solve PDE

3: for

i=1:N_{e}

4: for

j

steps do

5: Sample

m

samples from

\mathsf{S}_{\Omega,k}

6: Sample

m

samples from

\mathsf{S}_{\partial\Omega,k}

7: Update

u(\boldsymbol{x};\Theta)

by descending the stochastic gradient of

J^{\mathsf{IS}}_{N}(u(\boldsymbol{x};\Theta))

(see equation (28)).

8: end for

9: end for

10: // Train KRnet

11: for

i=1:N_{e}

12: for

j

steps do

13: Sample

m

samples from

\mathsf{S}_{\Omega,k}

14: Update

\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})

by descending the stochastic gradient of

H(\hat{r}_{\boldsymbol{X}},\hat{p}_{\mathsf{KRnet}})

(see equation (25)).

15: end for

16: end for

17: // Refine training set

18: Generate

\mathsf{S}_{\Omega,k+1}\subset\Omega

through

\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k+1)})

19: end for

19:

u(\boldsymbol{x};\Theta_{N}^{*})

Algorithm 2 DAS-G for PDEs

0: Initial

\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{(0)})

u(\boldsymbol{x};\Theta^{(0)})

, maximum epoch number

N_{e}

, batch size

m

, initial training set

\mathsf{S}_{\Omega,0}=\{\boldsymbol{x}_{\Omega,0}^{(i)}\}_{i=1}^{n_{r}}

and

\mathsf{S}_{\partial\Omega,0}=\{\boldsymbol{x}_{\partial\Omega,0}^{(i)}\}_{i=1}^{N_{b}}

1: for

k=0:N_{\rm adaptive}-1

2: // Solve PDE

3: for

i=1:N_{e}

4: for

j

steps do

5: Sample

m

samples from

\mathsf{S}_{\Omega,k}

6: Sample

m

samples from

\mathsf{S}_{\partial\Omega,k}

7: Update

u(\boldsymbol{x};\Theta)

by descending the stochastic gradient of

J_{N}(u(\boldsymbol{x};\Theta))

(see equation (3)).

8: end for

9: end for

10: // Train KRnet

11: for

i=1:N_{e}

12: for

j

steps do

13: Sample

m

samples from

\mathsf{S}_{\Omega,k}

14: Update

\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})

by descending the stochastic gradient of

H(\hat{r}_{\boldsymbol{X}},\hat{p}_{\mathsf{KRnet}})

(see equation (25)).

15: end for

16: end for

17: // Refine training set

18: Generate

\mathsf{S}^{g}_{\Omega,k+1}\subset\Omega

with size

n_{r}

through

\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k+1)})

19:

\mathsf{S}_{\Omega,k+1}=\mathsf{S}_{\Omega,k}\cup\mathsf{S}^{g}_{\Omega,k+1}

20: end for

20:

u(\boldsymbol{x};\Theta_{N}^{*})

5 Analysis of DAS

As discussed in section 4.3, the key point of our DAS method is to achieve variance reduction for the discretization of the residual loss, based on which we expect to improve the accuracy of the approximate solution. Under certain conditions, we show that the expectation of error bound becomes smaller when the adaptive sampling strategy is employed.

Assumption 1.

[35] In problem (1), we let $F=\mathscr{H}$ be a Hilbert space and $\mathcal{L}$ a linear operator. Assume that the differential operator $\mathcal{L}$ and the boundary operator $\mathfrak{b}$ satisfy

C_{1}\left\|v\right\|_{2,\Omega}\leq\left\|\mathcal{L}v\right\|_{2,\Omega}+\left\|\mathfrak{b}v\right\|_{2,\partial\Omega}\leq C_{2}\left\|v\right\|_{2,\Omega}\quad\forall v\in\mathscr{H}

(31)

where $\mathscr{H}$ is a Hilbert space defined on $\Omega$ and the positive constants $C_{1}$ and $C_{2}$ are independent of $v$ .

The above condition is called the stability bound [35], which is essential to the existence and uniqueness of problem (1). Except for this assumption, the following two assumptions for the relationship between $r(\boldsymbol{x};\Theta_{N}^{*,(k)})$ and $\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k)})$ are given.

Assumption 2.

Assume that $\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k)})$ is the optimal candidate for the change of measure in equation (11)

\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k)})=c_{k}r^{2}(\boldsymbol{x};\Theta_{N}^{*,(k)})

(32)

where $c_{k}=1/\int_{\Omega}r^{2}(\boldsymbol{x};\Theta_{N}^{*,(k)})d\boldsymbol{x}$ is the normalization constant.

Assumption 3.

Let

R_{k}=\frac{1}{N_{r}}\sum\limits_{i=1}^{N_{r}}\frac{r^{2}(\boldsymbol{x}_{\Omega}^{(i)};\Theta_{N}^{*,(k)})}{\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x}_{\Omega}^{(i)};\Theta_{f}^{*,(k-1)})}

be the discrete residual loss at the $k$ -th stage, where each $\boldsymbol{x}_{\Omega}^{(i)}$ is drawn from $\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k-1)})$ . Assume that $r^{2}(\boldsymbol{x}_{\Omega}^{(i)};\Theta_{N}^{*,(k)})/\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x}_{\Omega}^{(i)};\Theta_{f}^{*,(k-1)})\in[\tau_{1},\tau_{2}]$ almost surely for each $i=1,2,\ldots,N_{r}$ .

At each adaptivity stage, the error of the approximate solution is estimated as follows.

Theorem 1.

Let $u(\boldsymbol{x};\Theta_{N}^{*,(k)})\in F$ be a solution of (3) where the collocation points are independently drawn from $\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,{(k-1)}})$ . Suppose that Assumption 1 and Assumption 3 are satisfied. Given $0<\varepsilon<1$ , the following error estimate holds

\left\|u(\boldsymbol{x};\Theta_{N}^{*,(k)})-u(\boldsymbol{x})\right\|_{2,\Omega}\leq\sqrt{2}C_{1}^{-1}\left(R_{k}+\varepsilon+\left\|b(\boldsymbol{x};\Theta_{N}^{*,(k)})\right\|_{2,\partial\Omega}^{2}\right)^{\frac{1}{2}}.

with probability at least $1-\mathrm{exp}(-2N_{r}\varepsilon^{2}/(\tau_{2}-\tau_{1})^{2})$ .

The approximate solutions at two adjacent adaptivity stages satisfy:

Corollary 1.

Under the same conditions of Theorem 1, suppose that Assumption 2 is satisfied and the boundary loss $J_{b}(u)$ is zero, then the following inequality holds

\mathbb{E}(R_{k+1})\leq\mathbb{E}(R_{k}).

Theorem 1 provides an error estimate for the approximate solution similar to the results in [40]. Corollary 1 outlines the error behavior of a sequence of approximate solutions induced by adaptivity. However, it is not quite straightforward to quantify the decay of the error due to adaptive refinement. For example, for DAS-R the reduction in variance is up to a tail probability as shown in Lemma 2 and for DAS-G the loss is changed at each adaptivity stage due to the modification of the training set. These issues are left for future study.

6 Numerical experiments

In this section, we conduct some numerical experiments to demonstrate the effectiveness of the proposed DAS method, including two low-dimensional and low regularity test problems, one high-dimensional linear test problem, and one high-dimensional nonlinear test problem. Due to the curse of dimensionality, data are sparse in high-dimensional spaces [20, 19, 21], which implies that effective samples should be able to deal with localized information. We mainly use low-dimensional and low regularity test problems to demonstrate that the sampling strategy affects significantly the performance of neural network approximation if the residual is strongly localized. For comparison, we also test the performance of a heuristic method based on residual refinement [52, 42] (see section 6.2 and section 6.3) for high-dimensional problems, where the heuristic method searches uniform samples to find the points with large residuals and add them to the current training set. The heuristic method is similar to DAS-G, where the main difference is that the selection of new samples in DAS-G is completely guided by an optimization problem while the heuristic method relies partially on the user’s intuition. All deep neural network models are trained by the ADAM method [34]. The penalty parameter in equation (3) is set to $\hat{\gamma}=1$ . The activation function of $u(\boldsymbol{x};\Theta)$ is set to the hyperbolic tangent function. The activation function of KRnet is the rectified linear unit (ReLU) function since we only use the KRnet for density approximation.

6.1 Low-dimensional and low regularity test problems

In this part, two-dimensional low regularity problems are considered, where the solution of the first one has a peak and the solution of the second one has two peaks.

6.1.1 Two-dimensional peak problem

The following elliptic equation is considered

	$\displaystyle-\Delta u(x_{1},x_{2})$	$\displaystyle=s(x_{1},x_{2})\quad\text{in}\ \Omega,$		(33)
	$\displaystyle u(x_{1},x_{2})$	$\displaystyle=g(x_{1},x_{2})\quad\text{on}\ \partial\Omega,$		(33)

where the computation domain is $\Omega=[-1,1]^{2}$ . In order to quantify the error, we use the following reference solution

u(x_{1},x_{2})=\mathrm{exp}\left(-1000[(x_{1}-r_{c})^{2}+(x_{2}-r_{c})^{2}]\right),

which has a peak at the point $(r_{c},r_{c})$ and decreases rapidly away from $(r_{c},r_{c})$ . This test problem is often used to test the performance of adaptive finite element methods [53, 24].

We choose a six-layer fully connected neural network $u(\boldsymbol{x};\Theta)$ with $32$ neurons to approximate the solution. For KRnet, we take $L=6$ affine coupling layers, and two fully connected layers with $24$ neurons for each affine coupling layer. The number of epochs for training both $u(\boldsymbol{x};\Theta)$ and $p(\boldsymbol{x};\Theta_{f})$ is set to $N_{e}=3000$ . The learning rate for ADAM optimizer is set to $0.0001$ , and the batch size is set to $m=500$ . Here, we set $(r_{c},r_{c})=(0.5,0.5)$ . To assess the effectiveness of our DAS methods, we generate a uniform meshgrid with size $256\times 256$ in $[-1,1]^{2}$ and compute the mean square error on these grid points.

In Figure 1, we plot the approximation errors given by different sampling strategies with respect to the sample size in the left plot and with respect to the number of epochs in the right plot. For each $|\mathsf{S}_{\Omega}|$ , we take three runs with different random seeds for initialization and compute the mean error of the three runs as the final error. For DAS strategies, the numbers of adaptivity iterations are set to $N_{\rm{adaptive}}=4,6,8,10$ for $|\mathsf{S}_{\Omega}|=2\times 10^{3},3\times 10^{3},4\times 10^{3},5\times 10^{3}$ respectively, and $n_{r}=500$ is set for the DAS-G strategy (see section 4.3). For the uniform sampling strategy, the number of epochs is set to be the same as the total number of epochs of each DAS method. It is clear that for this test problem the DAS methods (DAS-G and DAS-R) have a better performance than the uniform sampling strategy and DAS-R performs better than DAS-G. For the same sample size, both DAS-R and DAS-G yield a smaller error than the uniform sampling method. In terms of the number of epochs, the errors of DAS-R and DAS-G decay in a more consistent way than the uniform sampling method.

In Figure 2 we compare the exact solution, the DAS solutions given by $5\times 10^{3}$ nonuniform samples and the approximate solution given by $5\times 10^{3}$ uniform samples. It is seen that DAS methods are much more effective than the uniform sampling method to capture the information in the region of low regularity. Figure 3 shows the error evolution of DAS-R at different adaptivity iteration steps. It is seen that the approximation error drops as the adaptivity iteration step $k$ increases, which is consistent with Corollary 1, and the relaxation time for the optimization iterations reduces as well. Figure 4 shows the evolution of the training sets ( $|\mathsf{S}_{\Omega}|=5\times 10^{3}$ ) of DAS-R method with respect to adaptivity iterations $k=1,4,7,9$ , where the initial training set $\mathsf{S}_{\Omega,0}$ consists of uniform collocation points on $\Omega$ (see section 4.3). It is seen that the largest density of $\mathsf{S}_{\Omega,1}$ for DAS-R is around $[0.5,0.5]$ since $\mathsf{S}_{\Omega,k}$ is consistent with the residual-induced distribution. However, we also expect that the tail of the residual-induced distribution becomes heavier as $k$ increases since the adaptivity tries to make the residual-induced distribution more uniform, which is illustrated by $\mathsf{S}_{\Omega,4}$ , $\mathsf{S}_{\Omega,7}$ and $\mathsf{S}_{\Omega,9}$ .

Refer to caption — Figure 1: Approximation errors for the two-dimensional peak test problem. Left: The error w.r.t sample size $|\mathsf{S}_{\Omega}|$ ; Right: The error w.r.t epoch for $|\mathsf{S}_{\Omega}|=5\times 10^{3}$ .

6.1.2 Two-dimensional test problem with two peaks

In this test problem, we consider the following equation

	$\displaystyle-\nabla\cdot\left[u(x_{1},x_{2})\nabla(x_{1}^{2}+x_{2}^{2})\right]$	$\displaystyle+\nabla^{2}u(x_{1},x_{2})=s(x_{1},x_{2})\quad\text{in}\ \Omega,$		(34)
	$\displaystyle u(x_{1},x_{2})$	$\displaystyle=g(x_{1},x_{2})\quad\text{on}\ \partial\Omega,$		(34)

where the computation domain is $\Omega=[-1,1]^{2}$ . The exact solution of (34) is chosen as

u(x_{1},x_{2})=\mathrm{e}^{-1000[(x_{1}-0.5)^{2}+(x_{2}-0.5)^{2}]}+\mathrm{e}^{-1000[(x_{1}+0.5)^{2}+(x_{2}+0.5)^{2}]},

which has two peaks at the points $(0.5,0.5)$ and $(-0.5,-0.5)$ . Here, the Dirichlet boundary condition on $\partial\Omega$ is given by the exact solution.

We choose a six-layer fully connected neural network $u(\boldsymbol{x};\Theta)$ with $64$ neurons to approximate the solution of (34). For KRnet, we take $L=8$ affine coupling layers, and two fully connected layers with $48$ neurons for each affine coupling layer. The number of epochs for training both $u(\boldsymbol{x};\Theta)$ and $p(\boldsymbol{x};\Theta_{f})$ is set to $N_{e}=5000$ . The learning rate for ADAM optimizer is set to $0.0001$ , and the batch size is set to $m=500$ . Again, we generate a uniform meshgrid with size $256\times 256$ in $[-1,1]^{2}$ and compute the mean square error on these grid points to assess the effectiveness of our DAS methods.

Figure 5 shows the approximation errors for this test problem, where the left one displays the errors with respect to the sample size $|\mathsf{S}_{\Omega}|$ for different sampling strategies, and the right one shows the error evolution of DAS-G at different adaptivity iteration steps. For each $|\mathsf{S}_{\Omega}|$ , we again take three runs with different random seeds for initialization and compute the mean error of the three runs as the final error. For the DAS-G strategy, the numbers of adaptivity iterations is set to $N_{\rm{adaptive}}=5$ (also for DAS-R), and the numbers of collocation points in $\mathsf{S}^{g}_{\Omega}(k=1,2,3,4)$ is set to $n_{r}=500,1\times 10^{3},1.5\times 10^{3},2\times 10^{3}$ for $|\mathsf{S}_{\Omega}|=2.5\times 10^{3},5\times 10^{3},7.5\times 10^{3},10^{4}$ respectively. For the uniform sampling strategy, we train the model with $2.5\times 10^{4}$ epochs to match the total number of epochs of DAS methods. From Figure 5, it is seen that for this test problem our DAS methods (DAS-G and DAS-R) have a better performance than the uniform sampling strategy and DAS-G performs better than DAS-R. It is also seen that the error decreases as the adaptivity iteration step $k$ increases.

In Figure 6 we compare the exact solution, the DAS solutions given by $10^{4}$ nonuniform samples and the approximate solution given by $10^{4}$ uniform samples. It is seen that DAS methods are much more effective than the uniform sampling method to capture the information around the two peaks. Figure 7 shows the evolution of $\mathsf{S}^{g}_{\Omega,k}$ of DAS-G method with respect to adaptivity iterations $k=1,2,3,4$ ( $|\mathsf{S}^{g}_{\Omega,k}|=2\times 10^{3}$ ), where the initial training set $\mathsf{S}_{\Omega,0}$ consists of uniform collocation points on $\Omega$ (see section 4.3). $\mathsf{S}_{\Omega,1}^{g}$ shows that the error profile has two peaks. After the training set is augmented with $\mathsf{S}_{\Omega,1}$ , the error profile becomes more flat as shown by the distribution of $\mathsf{S}_{\Omega,2}^{g}$ . After the training set is augmented with $\mathsf{S}_{\Omega,2}$ , the largest error is found again around the two peaks, and then the subsequent augmentation of the training set yields a more flat error profile. Such a pattern is repeated until no improvement can be reached.

6.2 High-dimensional linear test problems

Next we consider the $d$ -dimensional elliptic equation

-\Delta u(\boldsymbol{x})=s(\boldsymbol{x}),\quad\boldsymbol{x}\ \text{in}\ \Omega=[-1,1]^{d},

(35)

with an exact solution

u(\boldsymbol{x})=\mathrm{e}^{-10\left\|\boldsymbol{x}\right\|_{2}^{2}},

where the Dirichlet boundary condition on $\partial\Omega$ is given by the exact solution. We are interested in cases with a large $d>3$ . Note that the geometric properties of high-dimensional spaces are significantly different from our intuitions on low-dimensional ones, e.g., most of the volume of a high-dimensional cube is located around its corners [20, 19, 21]. If we use uniform samples to generate $\mathsf{S}_{\Omega}$ , most of the collocation points in $\mathsf{S}_{\Omega}$ are near the surface of the hypercube. Since the information of the exact solution is mainly from the neighborhood of the origin, most of the samples in $\mathsf{S}_{\Omega}$ may not contribute to training the neural network when $d$ is large enough.

We choose a six-layer fully connected neural network $u(\boldsymbol{x};\Theta)$ with 64 neurons to approximate the solution. For KRnet, we set $K=3$ and take $L=6$ affine coupling layers, and two fully connected layers with $64$ neurons for each affine coupling layer. The number of epochs for training both $u(\boldsymbol{x};\Theta)$ and $p(\boldsymbol{x};\Theta_{f})$ is set to $N_{e}=3000$ . The learning rate for ADAM optimizer is set to $0.0001$ , and the batch size is set to $m=5000$ . The numbers of adaptivity iterations is set to $N_{\rm{adaptive}}=5$ . To measure the quality of approximation, we generate a tensor grid with $n_{t}^{d}$ points around the origin (in $[-0.1,0.1]^{d}$ ) where $n_{t}$ is the number of nodes for each dimension. We define the relative error

\text{Relative error}=\frac{\left\|\boldsymbol{u}_{\mathsf{NN}}-\boldsymbol{u}\right\|_{2}}{\left\|\boldsymbol{u}\right\|_{2}},

where $\boldsymbol{u}_{\mathsf{NN}}$ and $\boldsymbol{u}$ denote two vectors whose elements are the function values of $u(\boldsymbol{x};\Theta)$ and $u(\boldsymbol{x})$ at the tensor grid respectively.

We first investigate the relation between the error and the dimensionality $d$ when the uniform sampling strategy is employed. Figure 8(a) shows the relative errors in terms of a varying $d$ for a sample size $|\mathsf{S}_{\Omega}|=2\times 10^{5}$ . To roughly match the number of grid points for different $d$ , we set $n_{t}=16,6,4,3$ for $d=4,6,8,10$ respectively. It is seen that the relative error grows quickly to $\mathit{O}(1)$ as $d$ increases. However, as shown in Figure 8(b), all training losses are finally close to zero for $d=4,6,8,10$ . This is consistent with the fact that in a high-dimensional space most of the uniform samples are located around the boundary, where the solution is close to zero. The optimizer is then in favor of the trivial solution since there are not sufficient samples to resolve the peak at the origin. This phenomenon demonstrates that the uniform sampling method may become less effective as $d$ increases and the convergence of the approximate solution is highly dependent on the choice of $\mathsf{S}_{\Omega}$ for a large $d$ .

Figure 9 shows the relative errors for the uniform sampling strategy, the residual-based adaptive refinement (RAR) method proposed in [52], DAS-R and DAS-G, where different numbers of samples $|\mathsf{S}_{\Omega}|$ are considered. For each $|\mathsf{S}_{\Omega}|$ , we again take three runs with different random seeds for initialization and compute the mean error of the three runs as the final error. For the DAS-G strategy, the numbers of collocation points in $\mathsf{S}^{g}_{\Omega,k}$ ( $k=1,2,3,4$ ) are set to $n_{r}=10^{4},2\times 10^{4},3\times 10^{4},4\times 10^{4}$ for $|\mathsf{S}_{\Omega}|=5\times 10^{4},10^{5},1.5\times 10^{5},2\times 10^{5}$ respectively. For the uniform sampling strategy, we train the model with $1.5\times 10^{4}$ epochs to match the total number of epochs of DAS methods. For the heuristic method RAR, the numbers of collocation points in $\mathsf{S}^{g}_{\Omega,k}$ ( $k=1,2,3,4$ ) are set to $n_{r}=5\times 10^{3},10^{4},1.5\times 10^{4},2.5\times 10^{4}$ for $|\mathsf{S}_{\Omega}|=5\times 10^{4},10^{5},1.5\times 10^{5},2\times 10^{5}$ respectively. From Figure 9, it can be seen that both DAS-G and DAS-R improve the accuracy significantly compared to the uniform sampling strategy and RAR. In addition, the error of DAS-G decreases slightly faster than that of DAS-R for this test problem. In Figure 10 we compare the error evolution of different sampling strategies. From the left plot of Figure 10, as the number of epochs increases, the errors of DAS-G and DAS-R decrease quickly, while the errors of RAR and the uniform sampling strategy do not decrease. This result suggests that for high-dimensional problems DAS methods are able to achieve a good approximation with a relatively small number of nonuniform samples while much more uniform samples are needed for the same accuracy. The right plot of Figure 10 shows the error of DAS-G at each adaptivity iteration step $k$ . It is seen that the error drops dramatically after we refine the solution using $\mathsf{S}_{\Omega,1}$ .

Figures 11 and 12 show $3000$ samples from the training sets ( $|\mathsf{S}_{\Omega}|=2\times 10^{5}$ ) DAS-R and DAS-G for the first four adaptivity iterations, where the components $x_{6}$ and $x_{7}$ are used for visualization. We have also checked the other components, and no significantly different results were found. For DAS-R, $3000$ samples are randomly chosen from $\mathsf{S}_{\Omega,k}$ ( $k=1,2,3,4$ ). For DAS-G, $3000$ samples for visualization are randomly selected from $\mathsf{S}^{g}_{\Omega,k}$ ( $k=1,2,3,4$ ). It is seen that the profile of $\mathsf{S}_{\Omega,k}$ is gradually flattened as $k$ increases, meaning the nonuniform samples are able to smooth the error profile which has a peak around the origin. As for DAS-G, the improvement takes a similar path. $\mathsf{S}_{\Omega,1}^{g}$ shows that the error profile has a peak around the origin. After the training set is augmented with $\mathsf{S}_{\Omega,1}$ , the error profile becomes more flat as shown by the distribution of $\mathsf{S}_{\Omega,2}^{g}$ . This is expected since more collocation points are added to the neighborhood of the origin which should reduce the error over there. Such a pattern is similar to what we have observed in Figure 7. In Figure 13, we compare the evolution of the variance of the residual for the training with $5\times 10^{4}$ samples. We estimate the variance of residual using $59049$ grid points around the origin (these points are also used to compute the relative errors in the above discussion). It is clear that both DAS-R and DAS-G achieve the variance reduction significantly compared with RAR, which helps reduce the statistical error dramatically for a fixed sample size. Looking more closely, the variance of DAS-R has a transition between two consecutive adaptivity iterations, resulting in the oscillation of errors for DAS-R as observed in the left plot of Figure 10. From Figure 13, it can be seen that DAS-G appears more robust than DAS-R for this test problem. We may adjust the communication pattern between the PDE model and the PDF model to reduce the oscillations, which will be left for future study.

6.3 High-dimensional nonlinear test problem

In this part, the ten-dimensional nonlinear partial differential equation considered is

-\Delta u(\boldsymbol{x})+u(\boldsymbol{x})-u^{3}(\boldsymbol{x})=s(\boldsymbol{x}),\quad\boldsymbol{x}\ \text{in}\ \Omega=[-1,1]^{10}.

(36)

The exact solution is set to be the same as (35), and the Dirichlet boundary condition on $\partial\Omega$ is given by the exact solution. The settings of $u(\boldsymbol{x};\Theta)$ and KRnet are the same as those in section 6.2.

The observations are similar to those for the high-dimensional linear problem. Figure 14 shows the relative errors for the uniform sampling strategy, RAR, DAS-R and DAS-G, where different numbers of samples $|\mathsf{S}_{\Omega}|$ are considered. Again, we take three runs with different random seeds for initialization and compute the mean error of the three runs as the final error for each $|\mathsf{S}_{\Omega}|$ . For the DAS-G strategy, the numbers of collocation points in $\mathsf{S}^{g}_{\Omega,k}$ ( $k=1,2,3,4$ ) are set to the same as those in section 6.2. For the uniform sampling strategy, we train the model with $1.5\times 10^{4}$ epochs to match the total number of epochs of DAS methods. For the heuristic method RAR, the numbers of collocation points in $\mathsf{S}^{g}_{\Omega,k}$ ( $k=1,2,3,4$ ) are set to $n_{r}=5\times 10^{3},10^{4},1.5\times 10^{4},2.5\times 10^{4}$ for $|\mathsf{S}_{\Omega}|=5\times 10^{4},10^{5},1.5\times 10^{5},2\times 10^{5}$ respectively. Both DAS-G and DAS-R improve the accuracy significantly compared to the uniform sampling strategy and RAR. In Figure 15 we compare the error evolution of different sampling strategies. Similar to the high-dimensional linear problem, the errors of DAS-G and DAS-R decrease quickly while the errors of the uniform sampling strategy and RAR do not decrease. The error behavior of DAS-G at each adaptivity iteration step $k$ is shown in the right plot of Figure 15. It is seen that the approximation is significantly improved when the adaptivity iteration step $k$ increases from $0$ to $1$ . Figure 16 and 17 show $3000$ samples from the training sets ( $|\mathsf{S}_{\Omega}|=2\times 10^{5}$ ) DAS-R and DAS-G for the first four adaptivity iterations, where the components $x_{6}$ and $x_{7}$ are used for visualization. For DAS-R, $3000$ samples are randomly chosen from $\mathsf{S}_{\Omega,k}$ ( $k=1,2,3,4$ ). For DAS-G, $3000$ samples for visualization are randomly selected from $\mathsf{S}^{g}_{\Omega,k}$ ( $k=1,2,3,4$ ). Both DAS-R and DAS-G flatten the error profile through adaptive sampling as we have observed in Figures 11 and 12. Figure 18 shows the evolution of the variance of the residual for DAS-R, DAS-G and RAR. The behavior is similar to that in Figure 13.

7 Conclusion

In this paper we have developed a deep adaptive sampling (DAS) method and coupled it with physics-informed neural networks (PINNs) to improve the neural network approximation of PDEs iteratively. The key idea of DAS is to employ a deep generative model to generate collocation points that are consistent with the distribution induced by an appropriate error indicator function. In this way, the training set is refined according to the regularity of the PDE solution, which follows the similar principle of adaptive mesh refinement of classical numerical methods. Numerical experiments have shown that the DAS method is able to significantly improve the accuracy for the approximation of low regularity problems especially when the dimensionality is relatively large. The proposed DAS method provides a very general and flexible framework for an adaptive learning strategy. There are several possible ways to further improve it. First, DAS consists of two DNN-based models: one model serves as an approximator for the PDE solution and the other one serves as an error indicator for the selection of collocation points. Both models can be chosen in terms of a certain criterion. In this work, we use a regular DNN for PDE approximation and KRnet for density approximation and sample generation. Second, the underlying distribution for the training set can be problem dependent. In this work, we choose the residual-induced distribution. We may also use the gradient of the approximation solution to define an indicator distribution. In [22], we employ KRnet to approximate the Fokker-Planck equation, where the collocation points are sampled from the approximate solution. Third, the DAS method is not limited to steady-state PDE problems. We may employ the DAS method on the space-time domain to refine the training set for the approximation of time-dependent problems. Last but not least, the current training process can also be improved. Although the current DAS methods work well enough to demonstrate the effectiveness of the algorithm, many questions remain open, e.g., what is the optimal way for the two deep models to communicate and what is the optimal sample size for $\mathsf{S}_{\Omega,k}^{g}$ . Research on these issues will be reported in forthcoming papers.

Acknowledgments: K. Tang has been supported by the China Postdoctoral Science Foundation grant 2022M711730. X. Wan has been supported by NSF grant DMS-1913163. C. Yang has been supported by NSFC grant 12131002.

References

[1] J. Han, A. Jentzen, E. Weinan, Solving high-dimensional partial differential equations using deep learning, Proceedings of the National Academy of Sciences 115 (34) (2018) 8505–8510.
[2] E. Weinan, The dawning of a new era in applied mathematics, Notices of the American Mathematical Society 68 (4) (2021) 565–571.
[3] G. E. Karniadakis, I. G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, L. Yang, Physics-informed machine learning, Nature Reviews Physics 3 (6) (2021) 422–440.
[4] W. E, B. Yu, The deep Ritz method: A deep learning-based numerical algorithm for solving variational problems, Communications in Mathematics and Statistics 6 (1) (2018) 1–12.
[5] E. Kharazmi, Z. Zhang, G. E. Karniadakis, Variational physics-informed neural networks for solving partial differential equations, arXiv preprint arXiv:1912.00873.
[6] Y. Zhu, N. Zabaras, P.-S. Koutsourelakis, P. Perdikaris, Physics-constrained deep learning for high-dimensional surrogate modeling and uncertainty quantification without labeled data, Journal of Computational Physics 394 (2019) 56–81.
[7] E. Kharazmi, Z. Zhang, G. E. Karniadakis, hp-VPINNs: Variational physics-informed neural networks with domain decomposition, Computer Methods in Applied Mechanics and Engineering 374 (2021) 113547.
[8] J. Sirignano, K. Spiliopoulos, DGM: A deep learning algorithm for solving partial differential equations, Journal of Computational Physics 375 (2018) 1339–1364.
[9] M. Raissi, P. Perdikaris, G. E. Karniadakis, Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, Journal of Computational Physics 378 (2019) 686–707.
[10] G. Pang, L. Lu, G. E. Karniadakis, fPINNs: Fractional physics-informed neural networks, SIAM Journal on Scientific Computing 41 (4) (2019) A2603–A2626.
[11] I. E. Lagaris, A. Likas, D. I. Fotiadis, Artificial neural networks for solving ordinary and partial differential equations, IEEE Transactions on Neural Networks 9 (5) (1998) 987–1000.
[12] M. Dissanayake, N. Phan-Thien, Neural-network-based approximations for solving partial differential equations, Communications in Numerical Methods in Engineering 10 (3) (1994) 195–201.
[13] K. Li, K. Tang, T. Wu, Q. Liao, D3M: A deep domain decomposition method for partial differential equations, IEEE Access 8 (2020) 5283–5294.
[14] W. Li, X. Xiang, Y. Xu, Deep domain decomposition method: Elliptic problems, in: J. Lu, R. Ward (Eds.), Proceedings of The First Mathematical and Scientific Machine Learning Conference, Vol. 107 of Proceedings of Machine Learning Research, PMLR, Princeton University, Princeton, NJ, USA, 2020, pp. 269–286.
[15] S. Dong, Z. Li, Local extreme learning machines and domain decomposition for solving linear and nonlinear partial differential equations, Computer Methods in Applied Mechanics and Engineering 387 (2021) 114129.
[16] H. Gao, L. Sun, J.-X. Wang, Phygeonet: Physics-informed geometry-adaptive convolutional neural networks for solving parameterized steady-state PDEs on irregular domain, Journal of Computational Physics 428 (2021) 110079.
[17] H. Sheng, C. Yang, PFNN: A penalty-free neural network method for solving a class of second-order boundary-value problems on complex geometries, Journal of Computational Physics (2020) 110085.
[18] Y. Zang, G. Bao, X. Ye, H. Zhou, Weak adversarial networks for high-dimensional partial differential equations, Journal of Computational Physics 411 (2020) 109409.
[19] A. Blum, J. Hopcroft, R. Kannan, Foundations of data science, Cambridge University Press, 2020.
[20] R. Vershynin, High-dimensional probability: An introduction with applications in data science, Vol. 47, Cambridge University Press, 2018.
[21] J. Wright, Y. Ma, High-dimensional data analysis with low-dimensional models: Principles, computation, and applications, Cambridge University Press, 2021.
[22] K. Tang, X. Wan, Q. Liao, Adaptive deep density approximation for Fokker-Planck equations, Journal of Computational Physics 457 (2022) 111080.
[23] Y. Gu, H. Yang, C. Zhou, Selectnet: Self-paced learning for high-dimensional partial differential equations, Journal of Computational Physics 441 (2021) 110444.
[24] P. Morin, R. H. Nochetto, K. G. Siebert, Convergence of adaptive finite element methods, SIAM Review 44 (4) (2002) 631–658.
[25] K. Mekchay, R. H. Nochetto, Convergence of adaptive finite element methods for general second order linear elliptic pdes, SIAM Journal on Numerical Analysis 43 (5) (2005) 1803–1827.
[26] H. C. Elman, D. J. Silvester, A. J. Wathen, Finite elements and fast iterative solvers: With applications in incompressible fluid dynamics, Oxford University Press, USA, 2014.
[27] K. Tang, X. Wan, Q. Liao, Deep density estimation via invertible block-triangular mapping, Theoretical & Applied Mechanics Letters 10 (2020) 143.
[28] X. Wan, S. Wei, VAE-KRnet and its applications to variational Bayes, arXiv preprint arXiv:2006.16431.
[29] X. Wan, K. Tang, Augmented KRnet for density estimation and approximation, arXiv preprint arXiv:2105.12866.
[30] F. Santambrogio, Optimal transport for applied mathematicians, Birkäuser, NY 55 (2015) 58–63.
[31] L. Dinh, J. Sohl-Dickstein, S. Bengio, Density estimation using real NVP, arXiv preprint arXiv:1605.08803.
[32] D. P. Kingma, P. Dhariwal, Glow: Generative flow with invertible 1x1 convolutions, in: Advances in Neural Information Processing Systems, 2018, pp. 10215–10224.
[33] L. Bottou, F. E. Curtis, J. Nocedal, Optimization methods for large-scale machine learning, SIAM Review 60 (2) (2018) 223–311.
[34] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.
[35] P. Bochev, M. Gunzburger, Least-squares methods for hyperbolic problems, in: Handbook of Numerical Analysis, Vol. 17, Elsevier, 2016, pp. 289–317.
[36] G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals and Systems 2 (4) (1989) 303–314.
[37] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Networks 2 (5) (1989) 359–366.
[38] K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural Networks 4 (2) (1991) 251–257.
[39] M. Leshno, V. Y. Lin, A. Pinkus, S. Schocken, Multilayer feedforward networks with a nonpolynomial activation function can approximate any function, Neural Networks 6 (6) (1993) 861–867.
[40] Y. Shin, Z. Zhang, G. E. Karniadakis, Error estimates of residual minimization using neural networks for linear PDEs, arXiv preprint arXiv:2010.08019.
[41] J. Lu, Y. Lu, M. Wang, A priori generalization analysis of the deep ritz method for solving high dimensional elliptic equations, arXiv preprint arXiv:2101.01708.
[42] J. Yu, L. Lu, X. Meng, G. E. Karniadakis, Gradient-enhanced physics-informed neural networks for forward and inverse PDE problems, Computer Methods in Applied Mechanics and Engineering 393 (2022) 114823.
[43] W. Gao, C. Wang, Active learning based sampling for high-dimensional nonlinear partial differential equations, arXiv preprint arXiv:2112.13988.
[44] I. Kobyzev, S. J. Prince, M. A. Brubaker, Normalizing flows: An introduction and review of current methods, IEEE transactions on pattern analysis and machine intelligence 43 (11) (2020) 3964–3979.
[45] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems, 2014, pp. 2672–2680.
[46] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial networks, in: International Conference on Machine Learning, PMLR, 2017, pp. 214–223.
[47] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A. C. Courville, Improved training of wasserstein gans, in: Advances in Neural Information Processing Systems, Vol. 30, 2017.
[48] D. P. Kingma, M. Welling, Auto-Encoding Variational Bayes, stat 1050 (2014) 1.
[49] G. Carlier, A. Galichon, F. Santambrogio, From Knothe’s transport to Brenier’s map and a continuation method for optimal transport, SIAM Journal on Mathematical Analysis 41 (6) (2010) 2554–2576.
[50] P.-T. De Boer, D. P. Kroese, S. Mannor, R. Y. Rubinstein, A tutorial on the cross-entropy method, Annals of Operations Research 134 (1) (2005) 19–67.
[51] R. Y. Rubinstein, D. P. Kroese, The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation and machine learning, Springer Science & Business Media, 2013.
[52] L. Lu, X. Meng, Z. Mao, G. E. Karniadakis, Deepxde: A deep learning library for solving differential equations, SIAM Review 63 (1) (2021) 208–228.
[53] W. F. Mitchell, A collection of 2D elliptic problems for testing adaptive grid refinement algorithms, Applied Mathematics and Computation 220 (2013) 350–364.

Appendix A Proof of Lemma 1

Proof.

We first introduce the following lemma:

Lemma 3.

Consider a perturbed identity matrix $\mathbf{I}+\delta\mathbf{A}$ with $\left\|\delta\mathbf{A}\right\|_{2}<1$ . We have

\left\|(\mathbf{I}+\delta\mathbf{A})^{-1}\right\|_{2}\leq\frac{1}{1-\left\|\delta\mathbf{A}\right\|_{2}}.

(37)

Proof.

For any $\boldsymbol{x}\neq 0$ , we have

\left\|(\mathbf{I}+\delta\mathbf{A})\boldsymbol{x}\right\|_{2}\geq\left\|\boldsymbol{x}\right\|_{2}-\left\|\delta\mathbf{A}\right\|_{2}\left\|\boldsymbol{x}\right\|_{2}=(1-\left\|\delta\mathbf{A}\right\|_{2})\left\|\boldsymbol{x}\right\|_{2}>0,

which implies that $\mathbf{I}+\delta\mathbf{A}$ is nonsingular. We then have

	$\displaystyle 1=\left\\|(\mathbf{I}+\delta\mathbf{A})^{-1}(\mathbf{I}+\delta\mathbf{A})\right\\|_{2}$	$\displaystyle=\left\\|(\mathbf{I}+\delta\mathbf{A})^{-1}+(\mathbf{I}+\delta\mathbf{A})^{-1}\delta\mathbf{A}\right\\|_{2}$
		$\displaystyle\geq\left\\|(\mathbf{I}+\delta\mathbf{A})^{-1}\right\\|_{2}-\left\\|(\mathbf{I}+\delta\mathbf{A})^{-1}\right\\|_{2}\left\\|\delta\mathbf{A}\right\\|_{2},$

which yields the conclusion. ∎

Assume that $m_{V}^{*}(\boldsymbol{x})=(\boldsymbol{v}^{*})^{\mathsf{T}}\boldsymbol{q}(\boldsymbol{x})$ and $m_{\hat{\boldsymbol{v}}^{*}}(\boldsymbol{x})=(\hat{\boldsymbol{v}}^{*})^{\mathsf{T}}\boldsymbol{q}(\boldsymbol{x})$ , where $\boldsymbol{q}(\boldsymbol{x})=[q_{1}(\boldsymbol{x}),\ldots,q_{n}(\boldsymbol{x})]^{\mathsf{T}}$ includes the basis functions in $V=\mathrm{span}\{q_{i}(\boldsymbol{x})\}_{i=1}^{n}$ and the vectors $\boldsymbol{v}^{*}$ and $\hat{\boldsymbol{v}}^{*}$ include the coefficients. It is easy to see that $\boldsymbol{v}^{*}$ and $\hat{\boldsymbol{v}}^{*}$ satisfy the following linear systems respectively

\mathbf{A}\boldsymbol{v}^{*}=\boldsymbol{b},\quad\hat{\mathbf{A}}\hat{\boldsymbol{v}}^{*}=\hat{\boldsymbol{b}},

(38)

where

	$\displaystyle\hat{a}_{ij}$	$\displaystyle=\frac{1}{N}\sum_{k=1}^{N}q_{i}(\boldsymbol{x}^{(k)})q_{j}(\boldsymbol{x}^{(k)})\approx\langle q_{i},q_{j}\rangle_{\rho}=a_{ij},$
	$\displaystyle\hat{b}_{i}$	$\displaystyle=\frac{1}{N}\sum_{k=1}^{N}q_{i}(\boldsymbol{x}^{(k)})h(\boldsymbol{x}^{(k)})\approx\langle q_{i},h\rangle_{\rho}=b_{i},$

and $\langle q_{i},q_{j}\rangle_{\rho}=\int_{D}q_{i}(\boldsymbol{x})q_{j}(\boldsymbol{x})\rho(\boldsymbol{x})d\boldsymbol{x}$ indicates the inner product of $q_{i}$ and $q_{j}$ . We rewrite the linear system for $\hat{\boldsymbol{v}}^{*}$ as

(\mathbf{A}+\delta\mathbf{A})\hat{\boldsymbol{v}}^{*}=\boldsymbol{b}+\delta\boldsymbol{b},

(39)

where $\delta{\mathbf{A}}=\hat{\mathbf{A}}-\mathbf{A}$ and $\delta\boldsymbol{b}=\hat{\boldsymbol{b}}-\boldsymbol{b}$ . Let $\mathscr{B}=\{q_{i}(\boldsymbol{x})\}_{i=1}^{n}\cup\{h(\boldsymbol{x})\}$ . Since both $\{q_{i}\}_{i=1}^{n}$ and $h(\boldsymbol{x})$ are continuous on a compact support, we may assume that $|m_{1}(\boldsymbol{x})m_{2}(\boldsymbol{x})|\leq M$ for any $m_{1},m_{2}\in\mathscr{B}$ and any $\boldsymbol{x}\in D$ , where $0<M<\infty$ is a constant. Using the Heoffding bound, we have for any $\delta>0$ with probability at least $1-2\delta$ ,

\left|\frac{1}{N}\sum_{k=1}^{N}m_{1}(\boldsymbol{x}^{(k)})m_{2}(\boldsymbol{x}^{(k)})-\langle m_{1},m_{2}\rangle_{\rho}\right|\leq\sqrt{\frac{2M^{2}\ln\delta^{-1}}{N}},

(40)

for any $m_{1},m_{2}\in\mathscr{B}$ . This means with probability at least $1-2\delta$ we have

\left\|\delta\mathbf{A}\right\|_{2}\leq\left\|\delta\mathbf{A}\right\|_{F}\leq n\sqrt{\frac{2M^{2}\ln\delta^{-1}}{N}},\quad\left\|\delta\boldsymbol{b}\right\|_{2}\leq\sqrt{n}\sqrt{\frac{2M^{2}\ln\delta^{-1}}{N}},

(41)

where $\left\|\cdot\right\|_{F}$ indicates the Frobenius norm. Let $\delta\boldsymbol{v}^{*}=\hat{\boldsymbol{v}}^{*}-\boldsymbol{v}^{*}$ . It is seen that $\left\|\delta\mathbf{A}\right\|_{2}\rightarrow 0$ as $N\rightarrow\infty$ . Assume that $N$ is large enough such that $\left\|\delta\mathbf{A}\right\|_{2}\leq(1-r)$ with $0<r<1$ , in other words, $(1-\left\|\delta\mathbf{A}\right\|_{2})^{-1}\leq r^{-1}$ . Since $q_{i}(\boldsymbol{x})$ are orthonormal, we have $\mathbf{A}=\mathbf{I}$ . From equations (38) and (41), we have

\delta\boldsymbol{v}^{*}=(\mathbf{I}+\delta\mathbf{A})^{-1}(\delta\boldsymbol{b}-\delta\mathbf{A}\boldsymbol{v}^{*}),

to which we apply Lemma 3 and the bounds in equation (39) and obtain

\left\|\delta\boldsymbol{v}^{*}\right\|_{2}\leq r^{-1}(\left\|\delta\boldsymbol{b}\right\|_{2}+\left\|\delta\mathbf{A}\right\|_{2}\left\|\boldsymbol{v}^{*}\right\|_{2})\leq r^{-1}(\sqrt{n}+n\left\|\boldsymbol{v}^{*}\right\|_{2})\sqrt{\frac{2M^{2}\ln\delta^{-1}}{N}}.

(42)

Using the Pythagorean theorem, we have

	$\displaystyle\\|m_{\hat{\boldsymbol{v}}^{*}}-h\\|_{\rho}^{2}$	$\displaystyle=\\|m_{\hat{\boldsymbol{v}}^{}}-m^{}_{V}\\|_{\rho}^{2}+\\|h-m^{*}_{V}\\|_{\rho}^{2}$
		$\displaystyle=\\|\delta\boldsymbol{v}^{}\\|^{2}_{2}+\\|h-m^{}\\|_{\rho}^{2}$
		$\displaystyle\leq(\\|\delta\boldsymbol{v}^{}\\|_{2}+\\|h-m^{}_{V}\\|_{\rho})^{2},$

which yields that

\|\hat{m}_{\hat{\boldsymbol{v}}^{*}}-h\|_{\rho}\leq\|\delta\boldsymbol{v}^{*}\|_{2}+\|h-m^{*}_{V}\|_{\rho}.

(43)

Combing equations (42) and (43), we reach the conclusion.

∎

Appendix B Proof of Lemma 2

Proof.

Let

\hat{r}(\boldsymbol{x})=\left\{\begin{array}[]{rl}r(\boldsymbol{x}),&\textrm{ if }|r^{2}/p-\mu|\leq a;\\ 0,&\textrm{ otherwise},\end{array}\right.

where $a>0$ . We consider

\left|Q_{p}[r^{2}]-\mathbb{E}[r^{2}]\right|\leq\left|Q_{p}(r^{2})-Q_{p}(\hat{r}^{2})\right|+\left|Q_{p}(\hat{r}^{2})-\mathbb{E}[\hat{r}^{2}]\right|+\left|\mathbb{E}[\hat{r}^{2}]-\mathbb{E}[r^{2}]\right|=I_{1}+I_{2}+I_{3}.

(44)

The first term $I_{1}$ on the right-hand side is bounded as

$\displaystyle\mathbb{E}_{p}\left\|Q_{p}(r^{2})-Q_{p}(\hat{r}^{2})\right\|$	$\displaystyle=\mathbb{E}_{p}\left\|r^{2}(\boldsymbol{X})p^{-1}(\boldsymbol{X})-\hat{r}^{2}(\boldsymbol{X})p^{-1}(\boldsymbol{X})\right\|$
	$\displaystyle=\int_{\|r^{2}/p-\mu\|>a}r^{2}(\boldsymbol{x})d\boldsymbol{x}$
	$\displaystyle\leq\\|r^{2}/p\\|_{p}\sqrt{\mathbb{P}(\|r^{2}/p-\mu\|>a;p)},$	(45)

where the Cauchy-Schwarz inequality is used in the last step.

The second term $I_{2}$ on the right-hand side can be bounded as

$\displaystyle\mathbb{E}_{p}\left\|Q_{p}(\hat{r}^{2})-\mathbb{E}[\hat{r}^{2}]\right\|$	$\displaystyle\leq\sqrt{\mathrm{Var}_{p}(Q_{p}(\hat{r}^{2}))}$
	$\displaystyle\leq N^{-1/2}\sqrt{\mathrm{Var}_{p}(\hat{r}^{2}(\boldsymbol{X})/p(\boldsymbol{X}))}$
	$\displaystyle\leq aN^{-1/2},$	(46)

where in the last step we used the fact that for any variable $\alpha\leq Y\leq\beta$ , $\mathrm{Var}(Y)\leq\frac{(\alpha-\beta)^{2}}{4}$ with probability 1. The third term $I_{3}$ on the right-hand side can be bounded the same way as $I_{1}$ .

We now estimate the tail probability $\mathbb{P}(|r^{2}/p-\mu|>a;p)$ . Using the correspondence between $L_{1}$ norm and total variation distance for two probability measures as well as the Pinsker’s inequality, we have

\|p-p^{*}\|_{L_{1}}=2\delta(p,p^{*})\leq\sqrt{2D_{\mathsf{KL}}(p\|p^{*})}\leq\sqrt{2\varepsilon},

(47)

which yields that

\mathbb{E}_{p}\left[\left|r^{2}/p-\mu\right|\right]\leq\mu\sqrt{2\varepsilon}.

(48)

From the Markov inequality, we have

\mathbb{P}\left(\left|r^{2}/p-\mu\right|\geq a;p\right)\leq\frac{\mu\sqrt{2\varepsilon}}{a},

(49)

where the probability is with respect to PDF $p(\boldsymbol{x})$ . Combining the bounds for $I_{i}$ , $i=1,2,3$ , and equation (49), we reach the conclusion. ∎

Appendix C Proof of Theorem 1

Proof.

By Assumption 1, we have

		$\displaystyle\left\\|u(\boldsymbol{x};\Theta_{N}^{*,(k)})-u(\boldsymbol{x})\right\\|_{2,\Omega}$
	$\displaystyle\leq$	$\displaystyle C_{1}^{-1}\left[\left\\|\mathcal{L}(u(\boldsymbol{x};\Theta_{N}^{,(k)})-u(\boldsymbol{x}))\right\\|_{2,\Omega}+\left\\|\mathfrak{b}(u(\boldsymbol{x};\Theta_{N}^{,(k)})-u(\boldsymbol{x}))\right\\|_{2,\partial\Omega}\right]$
	$\displaystyle\leq$	$\displaystyle\sqrt{2}C_{1}^{-1}\left(\left\\|\mathcal{L}(u(\boldsymbol{x};\Theta_{N}^{,(k)})-u(\boldsymbol{x}))\right\\|_{2,\Omega}^{2}+\left\\|\mathfrak{b}(u(\boldsymbol{x};\Theta_{N}^{,(k)})-u(\boldsymbol{x}))\right\\|_{2,\partial\Omega}^{2}\right)^{\frac{1}{2}}.$

Combining $\mathcal{L}u(\boldsymbol{x})=s(\boldsymbol{x})$ , $\mathfrak{b}u(\boldsymbol{x})=g(\boldsymbol{x})$ , $r(\boldsymbol{x};\Theta_{N}^{*,(k)})=\mathcal{L}u(\boldsymbol{x};\Theta_{N}^{*,(k)})-s(\boldsymbol{x})$ and $b(\boldsymbol{x};\Theta_{N}^{*,(k)})=\mathfrak{b}u(\boldsymbol{x};\Theta_{N}^{*,(k)})-g(\boldsymbol{x})$ gives

\left\|u(\boldsymbol{x};\Theta_{N}^{*,(k)})-u(\boldsymbol{x})\right\|_{2,\Omega}\leq\sqrt{2}C_{1}^{-1}\left(\left\|r(\boldsymbol{x};\Theta_{N}^{*,(k)})\right\|_{2,\Omega}^{2}+\left\|b(\boldsymbol{x};\Theta_{N}^{*,(k)})\right\|_{2,\partial\Omega}^{2}\right)^{\frac{1}{2}}.

(50)

Noting that $\mathbb{E}(R_{k})=\left\|r(\boldsymbol{x};\Theta_{N}^{*,(k)})\right\|_{2,\Omega}^{2}$ , and according to the Hoeffding inequality, we have

\mathbb{P}\left(R_{k}-\mathbb{E}(R_{k})\geq-\varepsilon\right)\geq 1-\mathrm{exp}\left(\frac{-2N_{r}\varepsilon^{2}}{(\tau_{2}-\tau_{1})^{2}}\right).

(51)

Combining (50) and (51) gives that

\left\|u(\boldsymbol{x};\Theta_{N}^{*,(k)})-u(\boldsymbol{x})\right\|_{2,\Omega}\leq\sqrt{2}C_{1}^{-1}\left(R_{k}+\varepsilon+\left\|b(\boldsymbol{x};\Theta_{N}^{*,(k)})\right\|_{2,\partial\Omega}^{2}\right)^{\frac{1}{2}}

with probability at least $1-\mathrm{exp}(-2N_{r}\varepsilon^{2}/(\tau_{2}-\tau_{1})^{2})$ . ∎

Appendix D Proof of Corollary 1

Proof.

Noting that

\Theta_{N}^{*,(k+1)}=\arg\min_{\Theta}\frac{1}{N_{r}}\sum_{i=1}^{N_{r}}\frac{r^{2}(\boldsymbol{x}_{\Omega}^{(i)};\Theta)}{\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x}_{\Omega}^{(i)};\Theta_{f}^{*,(k)})}.

Since $\Theta_{N}^{*,(k+1)}$ is the optimal solution at the $(k+1)$ -th stage, we have

R_{k+1}=\frac{1}{N_{r}}\sum\limits_{i=1}^{N_{r}}\frac{r^{2}(\boldsymbol{x}_{\Omega}^{(i)};\Theta_{N}^{*,(k+1)})}{\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x}_{\Omega}^{(i)};\Theta_{f}^{*,(k)})}\leq\frac{1}{N_{r}}\sum\limits_{i=1}^{N_{r}}\frac{r^{2}(\boldsymbol{x}_{\Omega}^{(i)};\Theta_{N}^{*,(k)})}{\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x}_{\Omega}^{(i)};\Theta_{f}^{*,(k)})}.

(52)

Plugging $\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k)})=c_{k}r^{2}(\boldsymbol{x};\Theta_{N}^{*,(k)})$ into (52) gives that

R_{k+1}\leq\frac{1}{c_{k}}.

Noting that $R_{k+1}$ is a random variable and taking its expectation, it follows that

\mathbb{E}(R_{k+1})\leq\frac{1}{c_{k}}=\int_{\Omega}r^{2}(\boldsymbol{x};\Theta_{N}^{*,(k)})d\boldsymbol{x}=\mathbb{E}(R_{k}),

which completes the proof. ∎

	$\displaystyle\\|m_{\hat{\boldsymbol{v}}^{*}}-h\\|_{\rho}^{2}$	$\displaystyle=\\|m_{\hat{\boldsymbol{v}}^{}}-m^{}_{V}\\|_{\rho}^{2}+\\|h-m^{*}_{V}\\|_{\rho}^{2}$
		$\displaystyle=\\|\delta\boldsymbol{v}^{}\\|^{2}_{2}+\\|h-m^{}\\|_{\rho}^{2}$
		$\displaystyle\leq(\\|\delta\boldsymbol{v}^{}\\|_{2}+\\|h-m^{}_{V}\\|_{\rho})^{2},$

$\displaystyle\mathbb{E}_{p}\left\|Q_{p}(r^{2})-Q_{p}(\hat{r}^{2})\right\|$	$\displaystyle=\mathbb{E}_{p}\left\|r^{2}(\boldsymbol{X})p^{-1}(\boldsymbol{X})-\hat{r}^{2}(\boldsymbol{X})p^{-1}(\boldsymbol{X})\right\|$
	$\displaystyle=\int_{\|r^{2}/p-\mu\|>a}r^{2}(\boldsymbol{x})d\boldsymbol{x}$
	$\displaystyle\leq\\|r^{2}/p\\|_{p}\sqrt{\mathbb{P}(\|r^{2}/p-\mu\|>a;p)},$	(45)

		$\displaystyle\left\\|u(\boldsymbol{x};\Theta_{N}^{*,(k)})-u(\boldsymbol{x})\right\\|_{2,\Omega}$
	$\displaystyle\leq$	$\displaystyle C_{1}^{-1}\left[\left\\|\mathcal{L}(u(\boldsymbol{x};\Theta_{N}^{,(k)})-u(\boldsymbol{x}))\right\\|_{2,\Omega}+\left\\|\mathfrak{b}(u(\boldsymbol{x};\Theta_{N}^{,(k)})-u(\boldsymbol{x}))\right\\|_{2,\partial\Omega}\right]$
	$\displaystyle\leq$	$\displaystyle\sqrt{2}C_{1}^{-1}\left(\left\\|\mathcal{L}(u(\boldsymbol{x};\Theta_{N}^{,(k)})-u(\boldsymbol{x}))\right\\|_{2,\Omega}^{2}+\left\\|\mathfrak{b}(u(\boldsymbol{x};\Theta_{N}^{,(k)})-u(\boldsymbol{x}))\right\\|_{2,\partial\Omega}^{2}\right)^{\frac{1}{2}}.$