This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DAS-PINNs: A deep adaptive sampling method for solving high-dimensional partial differential equations

Kejun Tang [email protected] Xiaoliang Wan [email protected] Chao Yang [email protected] Peng Cheng Laboratory, Shenzhen 518052, China Department of Mathematics and Center for Computation and Technology, Louisiana State University, Baton Rouge 70803, USA School of Mathematical Sciences, Peking University, Beijing 100871, China Institute of Computing and Digital Economy, Peking University, Changsha 410205, China
Abstract

In this work we propose a deep adaptive sampling (DAS) method for solving partial differential equations (PDEs), where deep neural networks are utilized to approximate the solutions of PDEs and deep generative models are employed to generate new collocation points that refine the training set. The overall procedure of DAS consists of two components: solving the PDEs by minimizing the residual loss on the collocation points in the training set and generating a new training set to further improve the accuracy of the current approximate solution. In particular, we treat the residual as a probability density function and approximate it with a deep generative model, called KRnet. The new samples from KRnet are consistent with the distribution induced by the residual, i.e., more samples are located in the region of large residual and less samples are located in the region of small residual. Analogous to classical adaptive methods such as the adaptive finite element, KRnet acts as an error indicator that guides the refinement of the training set. Compared to the neural network approximation obtained with uniformly distributed collocation points, the developed algorithms can significantly improve the accuracy, especially for low regularity and high-dimensional problems. We demonstrate the effectiveness of the proposed DAS method with numerical experiments.

keywords:
deep learning; numerical approximation of PDEs; adaptive sampling; deep generative models.
journal: Journal of Computational Physics

1 Introduction

In recent years, solving partial differential equations (PDEs) with deep learning methods has been receiving increasing attention [1, 2, 3]. Two major types of deep leaning methods have been proposed for solving PDEs, including the variational form subject to deep learning techniques [4, 5, 6, 7] and the physics-informed neural networks (PINNs) [8, 9, 10, 3], both of which reformulate a PDE problem as an optimization problem and train a deep neural network (DNN) to approximate the solution of PDE through minimizing the corresponding loss functional. The variational form is based on the weak formulation of PDEs, while the physical informed neural networks are based on the residual loss of PDEs. Similar ideas of solving PDEs via minimizing the residual loss can be traced back to the works [11, 12] in the 1990’s, where a shallow neural network is optimized on a priori fixed mesh as an approximation of the solution. Some efforts have been made to incorporate traditional computational techniques to enhance the performance of solving PDEs with deep neural networks. In [13, 14, 15, 16, 7], deep neural networks based on domain decomposition are proposed to improve the efficiency. A penalty free neural network method [17] and Phygeonet [16] are developed to deal with complex geometries and irregular domains. A weak formulation with primal and adversarial networks is proposed in [18], where the PDE problem is converted to an operator norm minimization problem induced by the weak formulation.

One critical step for all these methods is to approximate the loss functional, where the integral is usually approximated by the Monte Carlo method with collocation points randomly generated by a uniform distribution on the computational domain. Since the minimization of the discrete loss functional yields the approximate solution, the accuracy of the approximate solution is closely related to the accuracy of the discrete loss functional. In contrast to classical computational methods, where the main concern is the approximation error, one needs to balance the approximation error and the generalization error for the neural network approximation, where the approximation error mainly originates from the modeling capability of the neural network and the generalization error is mainly related to the data points in the training set, i.e, the random samples for the discretization of the loss functional. However, for many PDE models, the uniform random sampling strategy is not efficient especially when the PDE solution has a low regularity, in other words, an integrand of low regularity may have a large variance in terms of a uniform distribution such that the Monte Carlo approximation of the loss functional has a large prefactor before the convergence rate O(N1/2)O(N^{-1/2}). This issue becomes worse for high-dimensional problems due to the curse of dimensionality. In high-dimensional spaces, most of the volume of the computational domain concentrates around its surface [19, 20, 21], which means that uniform samples may become less effective for training deep neural networks to approximate high-dimensional PDEs. For example, the collocation points from the uniform distribution are not suitable for solving high-dimensional Fokker-Planck equations, while an adaptive strategy through sampling the current approximate solution is effective [22]. In [23], a selection network is introduced to serve as a weight function to assign higher weights for samples with large point-wise residuals, which yields a more accurate approximate solution if the selection network is properly chosen. However, to obtain a valid selection network, one needs to impose additional constraints on the selection network, which is often a non-trivial task. For low-dimensional problems, it is well known that one can employ adaptive numerical schemes to deal with PDEs with low-regularity solutions [24, 25, 26], which also suggests that the uniform samples are not the best choice. Therefore, adaptive sampling strategies are crucial for developing more efficient and reliable deep learning techniques for the approximation of PDEs.

In this work, we develop a deep adaptive sampling method (DAS) for the neural network approximation of PDEs based on residual minimization, where a deep generative model, called KRnet [27, 28, 29], is used to guide the sample generation for the training set. To this end, we need to construct two deep neural network models: one for approximating the solution and the other for refining the training set. The neural network approximation is achieved by the standard procedure of residual minimization. KRnet defines a transport map [30] from the data distribution to a prior distribution (e.g. the standard Gaussian). KRnet retains two traits of flow-based generative models [31, 32]: exact invertibility of the transport map and efficient computation of the Jacobian determinant, based on which one can obtain an explicit density model using the change of variables and an effective approach for generating samples through the invertible mapping. The key point in our proposed framework is that the residual is viewed as a probability density function (PDF) up to a constant and approximating this PDF can be achieved by minimizing the Kullback-Leibler (KL) divergence between the KRnet-induced density model and the residual-induced distribution. We use the trained KRnet to generate new collocation points to replace or refine the training set, where more points are put in the region of large residual and less points are put in the region of small residual. The updated training set is then used to further improve the accuracy of the current approximate solution. Simply speaking, KRnet acts as an error indicator for the improvement of the training set, which shares similarities with the classical adaptive finite element method subject to a residual-based posteriori error estimator. In summary, the main contributions of this work are as follows.

  • 1.

    We utilize a deep generative model as a generic means to reflect the correspondence between the residual and the error of approximation through efficient PDF approximation and sample generation.

  • 2.

    We propose a deep adaptive sampling (DAS) framework, including efficient sampling procedures and training algorithms, for the adaptive improvement of neural network approximation of PDEs.

The remainder of the paper is organized as follows. In the next section, we briefly describe the deep learning method used in this work for the approximation of PDEs. After that, the statistical error of the machine learning technique is illustrated from the perspective of function approximation. Our DAS approach is presented in section 4. We provide the theoretical analysis of DAS in section 5. In section 6, we demonstrate the efficiency of our adaptive sampling approach with numerical experiments. The paper is concluded in section 7.

2 Deep learning for PDEs

Let Ωd\Omega\subset\mathbb{R}^{d} be a spatial domain, which is bounded, connected and with a polygonal boundary Ω\partial\Omega, and 𝒙d\boldsymbol{x}\in\mathbb{R}^{d} denote a spatial variable. The PDE problem is stated as: find u(𝒙)F:du(\boldsymbol{x})\in F:\mathbb{R}^{d}\mapsto\mathbb{R} where FF is a proper function space defined on Ω\Omega, such that

u(𝒙)\displaystyle\mathcal{L}u(\boldsymbol{x}) =s(𝒙),𝒙Ω,\displaystyle=s(\boldsymbol{x}),\quad\forall\boldsymbol{x}\in\Omega, (1)
𝔟u(𝒙)\displaystyle\mathfrak{b}u(\boldsymbol{x}) =g(𝒙),𝒙Ω,\displaystyle=g(\boldsymbol{x}),\quad\forall\boldsymbol{x}\in\partial\Omega,

where \mathcal{L} is the partial differential operator, 𝔟\mathfrak{b} is the boundary operator, s(𝒙)s(\boldsymbol{x}) is the source function, and g(𝒙)g(\boldsymbol{x}) represents the boundary conditions.

Let u(𝒙;Θ)u(\boldsymbol{x};\Theta) be a neural network with parameters Θ\Theta. In the framework of PINNs, the goal is to use u(𝒙;Θ)u(\boldsymbol{x};\Theta) to approximate the solution u(𝒙)u(\boldsymbol{x}) through optimizing a loss functional defined as [8, 9]

J(u(𝒙;Θ))=r(𝒙;Θ)2,Ω2+γb(𝒙;Θ)2,Ω2=Jr(u(𝒙;Θ))+γJb(u(𝒙;Θ)),J\left(u(\boldsymbol{x};\Theta)\right)=\left\|r(\boldsymbol{x};\Theta)\right\|_{2,\Omega}^{2}+\gamma\left\|b(\boldsymbol{x};\Theta)\right\|_{2,\partial\Omega}^{2}=J_{r}(u(\boldsymbol{x};\Theta))+\gamma J_{b}(u(\boldsymbol{x};\Theta)), (2)

where r(𝒙;Θ)=u(𝒙;Θ)s(𝒙)r(\boldsymbol{x};\Theta)=\mathcal{L}u(\boldsymbol{x};\Theta)-s(\boldsymbol{x}), and b(𝒙;Θ)=𝔟u(𝒙;Θ)g(𝒙)b(\boldsymbol{x};\Theta)=\mathfrak{b}u(\boldsymbol{x};\Theta)-g(\boldsymbol{x}) measure how well u(𝒙;Θ)u(\boldsymbol{x};\Theta) satisfies the partial differential equations and the boundary conditions, respectively, and γ>0\gamma>0 is a penalty parameter. Here, u(𝒙)2,Ω2=Ω|u(𝒙)|2𝑑𝒙\left\|u(\boldsymbol{x})\right\|_{2,\Omega}^{2}=\int_{\Omega}|u(\boldsymbol{x})|^{2}d\boldsymbol{x} and u(𝒙)2,Ω2=Ω|u(𝒙)|2𝑑𝒙\left\|u(\boldsymbol{x})\right\|_{2,\partial\Omega}^{2}=\int_{\partial\Omega}|u(\boldsymbol{x})|^{2}d\boldsymbol{x}. The loss functional (2) is usually discretized numerically before the optimization with respect to Θ\Theta is addressed. In practice, one often chooses two sets of uniformly distributed collocation points 𝖲Ω={𝒙Ω(i)}i=1Nr\mathsf{S}_{\Omega}=\{\boldsymbol{x}_{\Omega}^{(i)}\}_{i=1}^{N_{r}} and 𝖲Ω={𝒙Ω(i)}i=1Nb\mathsf{S}_{\partial\Omega}=\{\boldsymbol{x}_{\partial\Omega}^{(i)}\}_{i=1}^{N_{b}} respectively for the discretization of the two terms in the objective functional (2), leading to the following empirical loss

JN(u(𝒙;Θ))=r(𝒙;Θ)Nr,𝖲Ω2+γ^b(𝒙;Θ)Nb,𝖲Ω2,J_{N}\left(u(\boldsymbol{x};\Theta)\right)=\left\|r(\boldsymbol{x};\Theta)\right\|_{N_{r},\mathsf{S}_{\Omega}}^{2}+\hat{\gamma}\left\|b(\boldsymbol{x};\Theta)\right\|_{N_{b},\mathsf{S}_{\partial\Omega}}^{2}, (3)

where γ^>0\hat{\gamma}>0, and

u(𝒙)Nr,𝖲Ω=(1Nri=1Nru2(𝒙Ω(i)))12,u(𝒙)Nb,𝖲Ω=(1Nbi=1Nbu2(𝒙Ω(i)))12.\left\|u(\boldsymbol{x})\right\|_{N_{r},\mathsf{S}_{\Omega}}=\left(\frac{1}{N_{r}}\sum_{i=1}^{N_{r}}u^{2}(\boldsymbol{x}_{\Omega}^{(i)})\right)^{\frac{1}{2}},\quad\left\|u(\boldsymbol{x})\right\|_{N_{b},\mathsf{S}_{\partial\Omega}}=\left(\frac{1}{N_{b}}\sum_{i=1}^{N_{b}}u^{2}(\boldsymbol{x}_{\partial\Omega}^{(i)})\right)^{\frac{1}{2}}.

Note that in the definition of JNJ_{N} we do not take into account the constants |Ω|=Ω𝑑𝒙|\Omega|=\int_{\Omega}d\boldsymbol{x} and |Ω|=Ω𝑑𝒙|\partial\Omega|=\int_{\partial\Omega}d\boldsymbol{x} and the ratio induced by these two constants can be dealt with by choosing γ^=γ|Ω||Ω|\hat{\gamma}=\frac{\gamma|\partial\Omega|}{|\Omega|} such that JN(u)J_{N}(u) is a Monte Carlo approximation of J(u)J(u) up to a constant scaling factor |Ω||\Omega|. We then seek an approximate solution by minimizing the empirical loss (3), i.e.,

minΘJN(u(𝒙;Θ)),\min_{\Theta}J_{N}(u(\boldsymbol{x};\Theta)), (4)

which can be solved by stochastic gradient-based methods [33, 34].

Recently, some prior error estimates of neural-network-based methods for solving PDEs are established. Combining the analysis techniques of the least square finite element method [35] with the universal approximation property of neural networks [36, 37, 38, 39], Shin et. al. propose an abstract framework for the error estimation of physical informed neural networks [40]. Lu et. al. derive a prior estimate of the generalization error for the deep Ritz method with two-layer neural networks [41]. Suppose that u(𝒙,ΘN)u(\boldsymbol{x},\Theta_{N}^{*}) is the minimizer of the empirical loss JN(u(𝒙;Θ))J_{N}(u(\boldsymbol{x};\Theta)) and u(𝒙;Θ)u(\boldsymbol{x};\Theta^{*}) is the minimizer of J(u(𝒙;Θ))J(u(\boldsymbol{x};\Theta)), i.e.,

u(𝒙;Θ)\displaystyle u(\boldsymbol{x};\Theta^{*}) =argminΘJ(u(𝒙;Θ)),\displaystyle=\arg\min_{\Theta}J(u(\boldsymbol{x};\Theta)),
u(𝒙;ΘN)\displaystyle u(\boldsymbol{x};\Theta_{N}^{*}) =argminΘJN(u(𝒙;Θ)).\displaystyle=\arg\min_{\Theta}J_{N}(u(\boldsymbol{x};\Theta)).

We have

u(𝒙;ΘN)u(𝒙)=u(𝒙,ΘN)u(𝒙;Θ)+u(𝒙;Θ)u(𝒙),u(\boldsymbol{x};\Theta_{N}^{*})-u(\boldsymbol{x})=u(\boldsymbol{x},\Theta_{N}^{*})-u(\boldsymbol{x};\Theta^{*})+u(\boldsymbol{x};\Theta^{*})-u(\boldsymbol{x}), (5)

i.e.,

𝔼(u(𝒙;ΘN)u(𝒙)Ω)𝔼(u(𝒙,ΘN)u(𝒙;Θ)Ω)+u(𝒙;Θ)u(𝒙)Ω,\mathbb{E}\left(\left\|u(\boldsymbol{x};\Theta_{N}^{*})-u(\boldsymbol{x})\right\|_{\Omega}\right)\leq\mathbb{E}\left(\left\|u(\boldsymbol{x},\Theta_{N}^{*})-u(\boldsymbol{x};\Theta^{*})\right\|_{\Omega}\right)+\left\|u(\boldsymbol{x};\Theta^{*})-u(\boldsymbol{x})\right\|_{\Omega}, (6)

where 𝔼\mathbb{E} indicates the expectation and the norm Ω\left\|\cdot\right\|_{\Omega} corresponds to the function space FF for u(𝒙;Θ)u(\boldsymbol{x};\Theta). The first term describes the statistical error from discretizing the loss functional with the Monte Carlo approximation, and the second term is the approximation error of minimizing the loss functional over the hypothesis space. The approximation error depends on the capability of neural networks, while the statistical error depends on the definition of 𝖲Ω\mathsf{S}_{\Omega} and 𝖲Ω\mathsf{S}_{\partial\Omega}. In this work, we focus on how to reduce the statistical error for problem (4) and our algorithm can also be generalized to other formulations for the neural network approximation of PDEs. For simplicity, we focus on the integration of the residual r(𝒙;Θ)r(\boldsymbol{x};\Theta) and assume that the integral on the boundary is well approximated by a prescribed 𝖲Ω\mathsf{S}_{\partial\Omega}.

3 Illustration of the statistical error

We first use function approximation as an example to illustrate the statistical error of the machine learning technique. Let 𝑿d\boldsymbol{X}\in\mathbb{R}^{d} and YY\in\mathbb{R} subject to a joint distribution ρ𝑿,𝒀\rho_{\boldsymbol{X},\boldsymbol{Y}}. Let Y^=m(𝑿)\hat{Y}=m(\boldsymbol{X}) be a model and y=h(𝒙)y=h(\boldsymbol{x}) be a function to be approximated. We know in the L2L_{2} sense the optimal model is

m(𝒙)=argminm(𝒙)[L(Y,Y^)=(ym(𝒙))2ρ𝑿,Y(𝒙,y)𝑑𝒙𝑑y].{m}^{*}(\boldsymbol{x})=\arg\min_{{m}(\boldsymbol{x})}\left[L(Y,\hat{Y})=\int(y-{m}(\boldsymbol{x}))^{2}\rho_{\boldsymbol{X},Y}(\boldsymbol{x},y)d\boldsymbol{x}dy\right]. (7)

In reality, we usually do not know ρ𝑿,Y\rho_{\boldsymbol{X},Y} and only have a set {(𝒙(i),y(i))}i=1N\{(\boldsymbol{x}^{(i)},y^{(i)})\}_{i=1}^{N} of data which can be regarded as samples of ρ𝑿,Y\rho_{\boldsymbol{X},Y}. For a certain hypothesis space WW, we obtain a regression problem

m𝒘(𝒙)=argminm𝒘W[LN(Y,Y^)=1Ni=1N(y(i)m𝒘(𝒙(i)))2],{m}_{\boldsymbol{w}^{*}}(\boldsymbol{x})=\arg\min_{{m}_{\boldsymbol{w}}\in W}\left[L_{N}(Y,\hat{Y})=\frac{1}{N}\sum_{i=1}^{N}(y^{(i)}-{m}_{\boldsymbol{w}}(\boldsymbol{x}^{(i)}))^{2}\right], (8)

where LNL_{N} can be regarded as a Monte Carlo approximation of LL and the subscript 𝒘{\boldsymbol{w}} indicates the model parameters specified by WW. If we let ρ𝑿,Y(𝒙,y)=δ(yh(𝒙))ρ(𝒙)\rho_{\boldsymbol{X},Y}(\boldsymbol{x},y)=\delta(y-h(\boldsymbol{x}))\rho(\boldsymbol{x}), and assume m(𝒙)Vm(\boldsymbol{x})\in V with VV being a linear space, we then obtain the continuous least-squares method for function approximation

mV(𝒙)=argminm(𝒙)V[LV(Y,Y^)=(m(𝒙)h(𝒙))2ρ(𝒙)𝑑𝒙],{m}_{V}^{*}(\boldsymbol{x})=\arg\min_{{m}(\boldsymbol{x})\in V}\left[L_{V}(Y,\hat{Y})=\int({m}(\boldsymbol{x})-h(\boldsymbol{x}))^{2}\rho(\boldsymbol{x})d\boldsymbol{x}\right], (9)

where mV(𝒙)m_{V}^{*}(\boldsymbol{x}) is the best approximation of h(𝒙)h(\boldsymbol{x}) located in VV subject to a weighted L2L_{2} norm in terms of a probability density function ρ(𝒙)\rho(\boldsymbol{x}). To approximate h(𝒙)h(\boldsymbol{x}) with a machine learning technique, we consider

m𝒗^(𝒙)=argminm𝒗^V[LV,N(Y,Y^)=1Ni=1N(m𝒗^(𝒙(i))h(𝒙(i)))2],m_{\hat{\boldsymbol{v}}^{*}}(\boldsymbol{x})=\arg\min_{m_{\hat{\boldsymbol{v}}}\in V}\left[L_{V,N}(Y,\hat{Y})=\frac{1}{N}\sum_{i=1}^{N}(m_{\hat{\boldsymbol{v}}}(\boldsymbol{x}^{(i)})-h(\boldsymbol{x}^{(i)}))^{2}\right], (10)

where {𝒙(i)}i=1N\{\boldsymbol{x}^{(i)}\}_{i=1}^{N} are samples of PDF ρ(𝒙)\rho(\boldsymbol{x}), and LV,NL_{V,N} is the Monte Carlo approximation of LVL_{V}. A classical choice for VV is the linear space spanned by polynomials, for which we derive the error estimate for m𝒗^(𝒙)m_{\hat{\boldsymbol{v}}^{*}}(\boldsymbol{x}) as follows.

Lemma 1.

Let h(𝐱)C(D)h(\boldsymbol{x})\in C(D) be a continuous function defined on a compact domain DdD\subset\mathbb{R}^{d} and ρ(𝐱)>0\rho(\boldsymbol{x})>0 be a PDF on DD. Let V=span{qi(𝐱)}i=1nV=\mathrm{span}\{q_{i}(\boldsymbol{x})\}_{i=1}^{n} with qi(𝐱)q_{i}(\boldsymbol{x}) being orthonormal polynomials in terms of ρ(𝐱)\rho(\boldsymbol{x}). For any δ>0\delta>0 and with probability at least 12δ1-2\delta, we have for a sufficiently large NN

m𝒗^(𝒙)h(𝒙)ρClnδ1N+mV(𝒙)h(𝒙)ρ,\|m_{\hat{\boldsymbol{v}}^{*}}(\boldsymbol{x})-h(\boldsymbol{x})\|_{\rho}\leq C\sqrt{\frac{\ln\delta^{-1}}{N}}+\|m^{*}_{V}(\boldsymbol{x})-h(\boldsymbol{x})\|_{\rho},

where CC is a constant, and ρ\|\cdot\|_{\rho} is the weighted L2L_{2} norm in terms of ρ(𝐱)\rho(\boldsymbol{x}).

The first term on the right-hand is the statistical error due to the random samples for the approximation of LVL_{V} (see the proof in the appendix for more details) and its existence does not depend on the choice of VV. When NN goes to infinity, the statistical error goes to zero and only the approximation error is left. In other words, when applying a machine learning technique to function approximation, we need to pay attention to both the hypothesis space WW and the choice of random samples {𝒙(i)}i=1N\{\boldsymbol{x}^{(i)}\}_{i=1}^{N}, i.e., the training set, to obtain a trade-off between the statistical error and the approximation error. For low-dimensional problems, classical method such as finite element methods avoid the statistical error by using Gauss quadrature rules, which implies that machine learning techniques are in general less efficient than classical methods due to the existence of statistical error. On the other hand, for high-dimensional problems, classical methods may not be able to obtain a relatively small approximation error due to the curse of dimensionality and machine learning techniques may perform better by using a capable hypothesis space such as neural networks and an affordable sample size for a relatively small statistical error. A more general estimate about the statistical error needs the Rademacher complexity of the hypothesis space. In this work, we are interested in the reduction of the statistical error instead of its bound for a certain set of random samples.

4 Deep adaptive sampling method

We now focus on the reduction of the statistical error when machine learning techniques are used to approximate PDEs. Our deep adaptive sampling method, i.e., the DAS method, will be established from the viewpoint of variance reduction since the statistical error is induced by the Monte Carlo approximation of the loss. Consider the term Jr(u)J_{r}(u) in equation (2). It is easy to see that if r2(𝒙)r^{2}(\boldsymbol{x}) is a smooth function with a good regularity, the most effective way to reduce the error of JN(u)J_{N}(u) is to increase the sample size NN. However, if there exists low regularity, the situation might be different. For example, if the residual is strongly localized, the scenario can be regarded as a rare event. Assume that the residual r2(𝒙)r^{2}(\boldsymbol{x}) in equation (2) has a similar behavior to an indicator function 1I(𝒙)1_{I}(\boldsymbol{x}) where IΩI\subset\Omega and is much smaller than Ω\Omega, i.e.,

ζ=Ω1I(𝒙)𝑑𝒙Ωr2(𝒙)𝑑𝒙1,\zeta=\int_{\Omega}1_{I}(\boldsymbol{x})d\boldsymbol{x}\approx\int_{\Omega}r^{2}(\boldsymbol{x})d\boldsymbol{x}\ll 1,

For simplicity, we assume that |Ω|=1|\Omega|=1 when presenting the algorithm. To improve the approximation in II, we need to compute ζ\zeta accurately. Consider a Monte Carlo estimator of ζ\zeta in terms of uniform samples

P^𝖬𝖢=1Ni=1N1I(𝒙(i)).\hat{P}_{\mathsf{MC}}=\frac{1}{N}\sum_{i=1}^{N}1_{I}(\boldsymbol{x}^{(i)}).

The relative error of P^𝖬𝖢\hat{P}_{\mathsf{MC}} is

Var1/2(P^𝖬𝖢)ζ=N1/2((1ζ)/ζ)1/2(ζN)1/2.\frac{\mathrm{Var}^{1/2}(\hat{P}_{\mathsf{MC}})}{\zeta}=N^{-1/2}((1-\zeta)/\zeta)^{1/2}\approx(\zeta N)^{-1/2}.

We then need a sample size of O(1/ζ)\mathit{O}(1/\zeta) to obtain a relative error of O(1)\mathit{O}(1). This implies that a certain samples size NN quickly becomes less effective if the residual is strongly localized and such a problem can be worsen in the approximation of high-dimensional PDEs. To reduce the relative error, we need to choose more effective random samples instead of uniform samples.

4.1 Some ideas on variance reduction

We outline our basic ideas on variance reduction in this section and more details about the algorithm will be presented later. We first consider the importance sampling technique. For simplicity, we only consider Jr(u(𝒙;Θ))J_{r}(u(\boldsymbol{x};\Theta)) in equation (2), which is the loss induced by the residual. We have

Jr(u(𝒙;Θ))=𝔼[r2]=Ωr2(𝒙;Θ)𝑑𝒙=Ωr2(𝒙;Θ)p(𝒙)p(𝒙)𝑑𝒙1Nri=1Nrr2(𝒙Ω(i);Θ)p(𝒙Ω(i)),\displaystyle J_{r}\left(u(\boldsymbol{x};\Theta)\right)=\mathbb{E}[r^{2}]=\int_{\Omega}r^{2}(\boldsymbol{x};\Theta)d\boldsymbol{x}=\int_{\Omega}\frac{r^{2}(\boldsymbol{x};\Theta)}{p(\boldsymbol{x})}p(\boldsymbol{x})d\boldsymbol{x}\approx\frac{1}{N_{r}}\sum\limits_{i=1}^{N_{r}}\frac{r^{2}(\boldsymbol{x}_{\Omega}^{(i)};\Theta)}{p(\boldsymbol{x}_{\Omega}^{(i)})}, (11)

where the set {𝒙Ω(i)}i=1Nr\{\boldsymbol{x}_{\Omega}^{(i)}\}_{i=1}^{N_{r}} is generated with respect to the probability density function (PDF) p(𝒙)p(\boldsymbol{x}) instead of a uniform distribution as in equation (3). If the variance of r2(𝑿)p1(𝑿)r^{2}(\boldsymbol{X})p^{-1}(\boldsymbol{X}) in terms of p(𝒙)p(\boldsymbol{x}) is smaller than the variance of r2(𝑿)r^{2}(\boldsymbol{X}) in terms of the uniform distribution, the accuracy of the Monte Carlo approximation will be improved for a fixed sample size NrN_{r}. The optimal choice for p(𝒙)p(\boldsymbol{x}) is

p(𝒙)=r2(𝒙;Θ)μ,p^{*}(\boldsymbol{x})=\frac{r^{2}(\boldsymbol{x};\Theta)}{\mu}, (12)

where μ=Ωr2(𝒙;Θ)𝑑𝒙\mu=\int_{\Omega}r^{2}(\boldsymbol{x};\Theta)d\boldsymbol{x}. Although the optimal choice is useless in practice since μ\mu is the quantity to be computed, it suggests that variance reduction can be achieved if p(𝒙)p(\boldsymbol{x}) is close to the residual-induced distribution.

Lemma 2.

Assume that |Ω|=1|\Omega|=1 and p(𝐱)p(\boldsymbol{x}) is a PDF satisfying

D𝖪𝖫(pp)ε<,D_{\mathsf{KL}}(p\|p^{*})\leq\varepsilon<\infty,

where D𝖪𝖫D_{\mathsf{KL}} indicates the Kullback-Leibler divergence. For any 0<a<0<a<\infty, we have

𝔼|Qp[r2]𝔼[r2]|aNr1/2+2r2/pp(|r2/pμ|>a;p),\mathbb{E}\left|Q_{p}[r^{2}]-\mathbb{E}[r^{2}]\right|\leq aN_{r}^{-1/2}+2\|r^{2}/p\|_{p}\sqrt{\mathbb{P}(|r^{2}/p-\mu|>a;p)}, (13)

where

Qp(r2)=1Nri=1Nrr2(𝑿(i))p(𝑿(i)),Q_{p}(r^{2})=\frac{1}{N_{r}}\sum_{i=1}^{N_{r}}\frac{r^{2}(\boldsymbol{X}^{(i)})}{p(\boldsymbol{X}^{(i)})},

and 𝐗(i)p(𝐱)\boldsymbol{X}^{(i)}\sim p(\boldsymbol{x}) are i.i.d. random variables. p\|\cdot\|_{p} is the weighted L2L_{2} norm in terms of pp. The tail probability can be bounded as

(|r2/pμ|>a;p)μ(2ε)1/2a.\mathbb{P}(|r^{2}/p-\mu|>a;p)\leq\frac{\mu(2\varepsilon)^{1/2}}{a}.

When p(𝒙)p(\boldsymbol{x}) is close to p(𝒙)p^{*}(\boldsymbol{x}), the ratio r2(𝒙)/p(𝒙)r^{2}(\boldsymbol{x})/p(\boldsymbol{x}) should be close to μ\mu. The second term on the right-hand side of inequality (13) is related to the tail probability (|r2/pμ|>a;p)\mathbb{P}(|r^{2}/p-\mu|>a;p) in terms of p(𝒙)p(\boldsymbol{x}). It is seen that if the tail probability is small enough, a variance reduction can be achieved by choosing a small aa.

Another option for variance reduction is to relax the definition of Jr(u)J_{r}(u) as:

Jr,p(u(𝒙;Θ))=Ωr2(𝒙;Θ)p(𝒙)𝑑𝒙1Nri=1Nrr2(𝒙Ω(i);Θ),J_{r,p}(u(\boldsymbol{x};\Theta))=\int_{\Omega}r^{2}(\boldsymbol{x};\Theta)p(\boldsymbol{x})d\boldsymbol{x}\approx\frac{1}{N_{r}}\sum_{i=1}^{N_{r}}r^{2}(\boldsymbol{x}_{\Omega}^{(i)};\Theta), (14)

where the set {𝒙Ω(i)}i=1Nr\{\boldsymbol{x}_{\Omega}^{(i)}\}_{i=1}^{N_{r}} is sampled from the PDF p(𝒙)p(\boldsymbol{x}) and p(𝒙)>0p(\boldsymbol{x})>0 on Ω\Omega. It is easy to see that the minimizer of Jr,p(u(𝒙;Θ))J_{r,p}(u(\boldsymbol{x};\Theta)) is also the solution of problem (1) when the boundary conditions are satisfied. To reduce the error induced by the Monte Carlo approximation, we may adjust p(𝒙)p(\boldsymbol{x}) such that the residual r2(𝒙;Θ)r^{2}(\boldsymbol{x};\Theta) does not vary dramatically. This is similar to the classical adaptive finite element method, where mesh refinement/coarsening is supposed to make the approximation error nearly uniform. For our case, we may sample a PDF that is close to the residual-induced distribution and add more samples from the region of large residual into the training set.

4.2 PDF approximation and sample generation

Two options for variance reduction are considered in the previous section. To make both ideas practical, the key issue is to generate samples efficiently from a PDF p(𝒙)μ1r2(𝒙;Θ)p(\boldsymbol{x})\approx\mu^{-1}r^{2}(\boldsymbol{x};\Theta) for a fixed Θ\Theta. The first option is based on importance sampling, meaning that an explicit PDF model p(𝒙)p(\boldsymbol{x}) is needed, The second option is based on refinement of the training set, which only needs samples from the region of large residual and does not need the likelihood of the samples. The second option has been employed in the recent literature to improve the neural network approximation, which is either ad hoc [42] (only for low-dimensional problems) or based on traditional sampling strategies such as MCMC [43]. To the best of our knowledge, there does not exist a generic algorithm that can be effectively adapted to both options. We intend to fill this gap in this work. It is seen from Lemma 2 that the tail probability (|r2/pμ|>a;p)\mathbb{P}(|r^{2}/p-\mu|>a;p) should be small enough for the effectiveness of p(𝒙)p(\boldsymbol{x}), which means that p(𝒙)p(\boldsymbol{x}) must be close enough to the distribution induced by r2(𝒙)r^{2}(\boldsymbol{x}). However, the approximation of PDF is a challenging task especially in high-dimensional spaces. Classical explicit PDF models such as the exponential family of distributions and Gaussian mixture models are in general not sufficient as an approximator for the PDF induced by r2(𝒙)r^{2}(\boldsymbol{x}). To alleviate this difficulty, we resort to deep generative modes. In particular, we employ a recently developed deep generative model called KRnet for both probability approximation and sample generation [22, 27]. KRnet is one type of normalizing flow [44], which provides an invertible transport map between a prior distribution and the target distribution. Unlike other types of deep generative models such as GAN [45, 46, 47] and VAE [48], normalizing flow provides an explicit likelihood. KRnet can be used to address both options for variance reduction while other deep generative models may also be employed if only the second option for variance reduction is considered. However, one computational issue faced by all deep generative models is that the distribution induced by r2(𝒙)r^{2}(\boldsymbol{x}) is usually defined on a compact domain while deep generative models are in general defined on the whole space. We subsequently address this issue without going into details about the structure of KRnet.

Let 𝑿d\boldsymbol{X}\in\mathbb{R}^{d} be a random vector associated with a given data set, and its PDF is denoted by p𝑿(𝒙)p_{\boldsymbol{X}}(\boldsymbol{x}). The target is to estimate p𝑿(𝒙)p_{\boldsymbol{X}}(\boldsymbol{x}) from data or to generate samples that are consistent with a given p𝑿(𝒙)p_{\boldsymbol{X}}(\boldsymbol{x}). Let 𝒁d\boldsymbol{Z}\in\mathbb{R}^{d} be a random vector associated with a PDF p𝒁(𝒛)p_{\boldsymbol{Z}}(\boldsymbol{z}), where p𝒁(𝒛)p_{\boldsymbol{Z}}(\boldsymbol{z}) is a prior distribution (e.g., Gaussian distributions). The flow-based generative modeling is to seek an invertible mapping 𝒛=f(𝒙)\boldsymbol{z}=f(\boldsymbol{x}) [31]. By the change of variables, we have the PDF of 𝑿=f1(𝒁)\boldsymbol{X}=f^{-1}(\boldsymbol{Z}) as

p𝑿(𝒙)=p𝒁(f(𝒙))|det𝒙f|.p_{\boldsymbol{X}}(\boldsymbol{x})=p_{\boldsymbol{Z}}(f(\boldsymbol{x}))\left|\det\nabla_{\boldsymbol{x}}f\right|. (15)

Once the prior distribution p𝒁(𝒛)p_{\boldsymbol{Z}}(\boldsymbol{z}) is specified, equation (15) provides an explicit density model for 𝑿\boldsymbol{X}. The inverse of f()f(\cdot) provides a convenient way to sample 𝑿\boldsymbol{X} as 𝑿=f1(𝒁)\boldsymbol{X}=f^{-1}(\boldsymbol{Z}). The basic idea of KRnet is to define the structure of f(𝒙)f(\boldsymbol{x}) in terms of the Knothe-Rosenblatt rearrangement. Let μ𝒁\mu_{\boldsymbol{Z}} and μ𝑿\mu_{\boldsymbol{X}} be the probability measures of two random variables 𝒁,𝑿d\boldsymbol{Z},\boldsymbol{X}\in\mathbb{R}^{d} respectively. A mapping 𝒯\mathcal{T}: 𝒁𝑿\boldsymbol{Z}\mapsto\boldsymbol{X} is called a transport map such that 𝒯#μ𝒁=μ𝑿\mathcal{T}_{\#}\mu_{\boldsymbol{Z}}=\mu_{\boldsymbol{X}}, where 𝒯#μ𝒁\mathcal{T}_{\#}\mu_{\boldsymbol{Z}} is the push-forward of μ𝒁\mu_{\boldsymbol{Z}} such that μ𝑿(B)=μ𝒁(𝒯1(B))\mu_{\boldsymbol{X}}(B)=\mu_{\boldsymbol{Z}}(\mathcal{T}^{-1}(B)) for every Borel set BB [49]. The transport map 𝒯\mathcal{T} given by the Knothe-Rosenblatt rearrangement [49, 30] has a lower-triangular structure

𝒛=𝒯1(𝒙)=[𝒯1(x1)𝒯2(x1,x2)𝒯d(x1,,xd)].\boldsymbol{z}=\mathcal{T}^{-1}(\boldsymbol{x})=\left[\begin{array}[]{l}\mathcal{T}_{1}(x_{1})\\ \mathcal{T}_{2}(x_{1},x_{2})\\ \vdots\\ \mathcal{T}_{d}(x_{1},\ldots,x_{d})\end{array}\right]. (16)

Simply speaking, KRnet integrates the triangular structure of the K-R rearrangement into the definition of the invertible transport map. More details can be found in [27, 22, 29].

Let f𝖪𝖱𝗇𝖾𝗍(;Θf)f_{\mathsf{KRnet}}(\cdot;\Theta_{f}) indicate the invertible transport map induced by KRnet, where Θf\Theta_{f} includes the model parameters. An explicit PDF model p𝖪𝖱𝗇𝖾𝗍(𝒙;Θf)p_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}) can be obtained by letting f=f𝖪𝖱𝗇𝖾𝗍f=f_{\mathsf{KRnet}} in equation (15), i.e.,

p𝖪𝖱𝗇𝖾𝗍(𝒙;Θf)=p𝒁(f𝖪𝖱𝗇𝖾𝗍(𝒙))|det𝒙f𝖪𝖱𝗇𝖾𝗍|.p_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})=p_{\boldsymbol{Z}}(f_{\mathsf{KRnet}}(\boldsymbol{x}))\left|\det\nabla_{\boldsymbol{x}}f_{\mathsf{KRnet}}\right|. (17)

The samples of p𝖪𝖱𝗇𝖾𝗍(𝒙;Θf)p_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}) is given as 𝑿=f𝖪𝖱𝗇𝖾𝗍1(𝒁)\boldsymbol{X}=f^{-1}_{\mathsf{KRnet}}(\boldsymbol{Z}) by sampling 𝒁\boldsymbol{Z}. A common choice for the distribution of 𝒁\boldsymbol{Z} is the standard Gaussian distribution. Depending on the prior knowledge of the problem, a more general model such as Gaussian mixture model can also be used as the prior distribution. The KRnet f𝖪𝖱𝗇𝖾𝗍f_{\mathsf{KRnet}} does not have any constraint on the range of the mapped data, meaning that both 𝑿\boldsymbol{X} and 𝒁\boldsymbol{Z} are defined on d\mathbb{R}^{d}. Let 𝒁\boldsymbol{Z} be Gaussian. Due to the invertibility, |det𝒙f𝖪𝖱𝗇𝖾𝗍|>0|\det\nabla_{\boldsymbol{x}}f_{\mathsf{KRnet}}|>0 for any 𝒙d\boldsymbol{x}\in\mathbb{R}^{d}, which implies that p𝖪𝖱𝗇𝖾𝗍(𝒙;Θf)>0p_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})>0 for any 𝒙d\boldsymbol{x}\in\mathbb{R}^{d}. So p𝖪𝖱𝗇𝖾𝗍(𝒙;Θf)p_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}) is not consistent with the distribution induced by r2(𝒙)r^{2}(\boldsymbol{x}), which is equal to zero on d\Ω\mathbb{R}^{d}\backslash\Omega. To deal with this issue, we propose the following strategy.

Without loss of generality, we can assume that Ω=[1/2,1/2]d\Omega=[-1/2,1/2]^{d}. Let B=((1/2+δ/2),1/2+δ/2)dB=(-(1/2+\delta/2),1/2+\delta/2)^{d} with 0<δ<0<\delta<\infty such that ΩB\Omega\subset B. For each dimension of 𝒙\boldsymbol{x}, we define the following logarithmic mapping

y=(x)=s2log2x+(1+δ)(1+δ)2x,x=1(y)=1+δ2e2y/s1e2y/s+1,y=\ell(x)=\frac{s}{2}\log\frac{2x+(1+\delta)}{(1+\delta)-2x},\quad x=\ell^{-1}(y)=\frac{1+\delta}{2}\frac{e^{2y/s}-1}{e^{2y/s}+1},

with s>0s>0 being a scale parameter, which defines a one-to-one correspondence between x((1/2+δ/2),1/2+δ/2)x\in(-(1/2+\delta/2),1/2+\delta/2) and y(,+)y\in(-\infty,+\infty). Let (𝒙):Bd\boldsymbol{\ell}(\boldsymbol{x}):B\mapsto\mathbb{R}^{d} be a dd-dimensional mapping such that

i(xi)=(xi),i=1,,d.\ell_{i}(x_{i})=\ell(x_{i}),\quad i=1,\ldots,d.

Then the following invertible mapping

𝒛=f𝖪𝖱𝗇𝖾𝗍(𝒙)\boldsymbol{z}=f_{\mathsf{KRnet}}\circ\boldsymbol{\ell}(\boldsymbol{x}) (18)

defines a PDF

p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf)=p𝖪𝖱𝗇𝖾𝗍((𝒙);Θf)|𝒙(𝒙)|,\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})=p_{\mathsf{KRnet}}(\boldsymbol{\ell}(\boldsymbol{x});\Theta_{f})|\nabla_{\boldsymbol{x}}\boldsymbol{\ell}(\boldsymbol{x})|, (19)

where the support of p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf)\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}) is BB.

We now consider a modification of r2(𝒙;Θ)r^{2}(\boldsymbol{x};\Theta). Define a cutoff function as

h(𝒙)={1,𝒙Ω,i=1dhδ(xi),𝒙B\Ω,h(\boldsymbol{x})=\left\{\begin{array}[]{rl}1,&\boldsymbol{x}\in\Omega,\\ \prod_{i=1}^{d}h_{\delta}(x_{i}),&\boldsymbol{x}\in B\backslash\Omega,\end{array}\right.

where hδ(x)h_{\delta}(x) is a piecewise linear function

hδ(x)={1,x[1/2,1/2],δ1(2x+1+δ),x(1/2δ/2,1/2),δ1(1+δ2x),x(1/2,1/2+δ/2),0,x(,1/2δ/2][1/2+δ/2,).h_{\delta}(x)=\left\{\begin{array}[]{rl}1,&x\in[-1/2,1/2],\\ \delta^{-1}(2x+1+\delta),&x\in(-1/2-\delta/2,-1/2),\\ \delta^{-1}(1+\delta-2x),&x\in(1/2,1/2+\delta/2),\\ 0,&x\in(-\infty,-1/2-\delta/2]\cup[1/2+\delta/2,\infty).\end{array}\right.

We consider a modified PDF for any Θ\Theta

r^𝑿(𝒙)r2(𝒙;Θ)h(𝒙).\hat{r}_{\boldsymbol{X}}(\boldsymbol{x})\propto r^{2}(\boldsymbol{x};\Theta)h(\boldsymbol{x}). (20)

Note that both r^𝑿(𝒙)\hat{r}_{\boldsymbol{X}}(\boldsymbol{x}) and p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf)\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}) have the support BB. We then solve the following optimization problem

Θf=argminΘfD𝖪𝖫(r^𝑿(𝒙)p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf)),\Theta_{f}^{*}=\arg\min_{\Theta_{f}}D_{\mathsf{KL}}(\hat{r}_{\boldsymbol{X}}(\boldsymbol{x})\|\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})), (21)

where D𝖪𝖫()D_{\mathsf{KL}}(\cdot\|\cdot) indicates the Kullback-Leibler (KL) divergence between two distributions. We finally use

p𝑿(𝒙)p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf)1Ω(𝒙)p_{\boldsymbol{X}}(\boldsymbol{x})\propto\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*})1_{\Omega}(\boldsymbol{x}) (22)

as an approximation of the PDF induced by r2(𝒙;Θ)r^{2}(\boldsymbol{x};\Theta). If δ1\delta\ll 1, most the samples 1f𝖪𝖱𝗇𝖾𝗍1(𝒛(i);Θf)\boldsymbol{\ell}^{-1}\circ f^{-1}_{\mathsf{KRnet}}(\boldsymbol{z}^{(i)};\Theta_{f}^{*}) will be located in Ω\Omega. Since r^𝑿(𝒙)r2(𝒙;Θ)\hat{r}_{\boldsymbol{X}}(\boldsymbol{x})\propto r^{2}(\boldsymbol{x};\Theta) on Ω\Omega, p𝑿(𝒙)p_{\boldsymbol{X}}(\boldsymbol{x}) approximates the r2(𝒙;Θ)r^{2}(\boldsymbol{x};\Theta)-induced PDF well when δ\delta is small. In our numerical experiments, we set δ=0.01\delta=0.01 and s=2s=2.

We now look at the approximation of Θf\Theta_{f}^{*}. The KL divergence in the optimization problem (21) can be written as

D𝖪𝖫(r^𝑿(𝒙)p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf))=Br^𝑿logr^𝑿d𝒙Br^𝑿logp^𝖪𝖱𝗇𝖾𝗍d𝒙.D_{\mathsf{KL}}(\hat{r}_{\boldsymbol{X}}(\boldsymbol{x})\|\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}))=\int_{B}\hat{r}_{\boldsymbol{X}}\log\hat{r}_{\boldsymbol{X}}d\boldsymbol{x}-\int_{B}\hat{r}_{\boldsymbol{X}}\log\hat{p}_{\mathsf{KRnet}}d\boldsymbol{x}. (23)

The first term on the right-hand side corresponds to the differential entropy of r^𝑿\hat{r}_{\boldsymbol{X}}, which does not affect the optimization with respect to Θf\Theta_{f}. So minimizing the KL divergence is equivalent to minimizing the cross entropy between r^𝑿\hat{r}_{\boldsymbol{X}} and p^𝖪𝖱𝗇𝖾𝗍\hat{p}_{\mathsf{KRnet}} [50, 51]:

H(r^𝑿,p^𝖪𝖱𝗇𝖾𝗍)=Br^𝑿logp^𝖪𝖱𝗇𝖾𝗍d𝒙.H(\hat{r}_{\boldsymbol{X}},\hat{p}_{\mathsf{KRnet}})=-\int_{B}\hat{r}_{\boldsymbol{X}}\log\hat{p}_{\mathsf{KRnet}}d\boldsymbol{x}. (24)

Since the samples from r^𝑿\hat{r}_{\boldsymbol{X}} are not available, we approximate the cross entropy using the importance sampling technique:

H(r^𝑿,p^𝖪𝖱𝗇𝖾𝗍)1Nri=1Nrr^𝑿(𝒙B(i))p^𝖪𝖱𝗇𝖾𝗍(𝒙B(i);Θ^f)logp^𝖪𝖱𝗇𝖾𝗍(𝒙B(i);Θf),H(\hat{r}_{\boldsymbol{X}},\hat{p}_{\mathsf{KRnet}})\approx-\frac{1}{N_{r}}\sum_{i=1}^{N_{r}}\frac{\hat{r}_{\boldsymbol{X}}(\boldsymbol{x}_{B}^{(i)})}{\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x}_{B}^{(i)};\hat{\Theta}_{f})}\log\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x}_{B}^{(i)};\Theta_{f}), (25)

where p^(𝒙;Θ^f)\hat{p}(\boldsymbol{x};\hat{\Theta}_{f}) is a PDF model with known parameter Θ^f\hat{\Theta}_{f} and its samples {𝒙B(i)}i=1Nr\{\boldsymbol{x}_{B}^{(i)}\}_{i=1}^{N_{r}} can be generated efficiently as

𝒙B(i)=1f𝖪𝖱𝗇𝖾𝗍1(𝒛(i);Θ^f)\boldsymbol{x}_{B}^{(i)}=\boldsymbol{\ell}^{-1}\circ f^{-1}_{\mathsf{KRnet}}(\boldsymbol{z}^{(i)};\hat{\Theta}_{f}) (26)

with 𝒛(i)\boldsymbol{z}^{(i)} being sampled from the prior distribution. We then minimize the discretized cross entropy (25) to obtain an approximation of Θf\Theta_{f}^{*}. The choice for Θ^f\hat{\Theta}_{f} will be specified in section 4.3 when the adaptive sampling method is defined.

Remark 1.

An alternative approach for the approximation of r^𝐗(𝐱)\hat{r}_{\boldsymbol{X}}(\boldsymbol{x}) is to minimize the following KL divergence:

D𝖪𝖫(p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf)r^𝑿(𝒙))=Bp^𝖪𝖱𝗇𝖾𝗍logp^𝖪𝖱𝗇𝖾𝗍d𝒙Bp^𝖪𝖱𝗇𝖾𝗍logr^𝑿d𝒙,D_{\mathsf{KL}}(\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})\|\hat{r}_{\boldsymbol{X}}(\boldsymbol{x}))=\int_{B}\hat{p}_{\mathsf{KRnet}}\log\hat{p}_{\mathsf{KRnet}}d\boldsymbol{x}-\int_{B}\hat{p}_{\mathsf{KRnet}}\log\hat{r}_{\boldsymbol{X}}d\boldsymbol{x}, (27)

which can be approximated by samples from p^𝖪𝖱𝗇𝖾𝗍(𝐱;Θf)\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}). Note that the KL divergence is asymmetric. Minimizing the KL divergence (23) is not equivalent to the minimization of the KL divergence (27), although both minimizers will be achieved at p^𝖪𝖱𝗇𝖾𝗍(𝐱;Θf)=r^𝐗(𝐱)\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f})=\hat{r}_{\boldsymbol{X}}(\boldsymbol{x}) if r^𝐗\hat{r}_{\boldsymbol{X}} can be reached exactly by a certain parameter Θf\Theta_{f}.

4.3 Adaptive sampling procedure

We are now ready to present our algorithms. In this work, we mainly focus on the adaptivity of 𝖲Ω\mathsf{S}_{\Omega} for simplicity. The key step of our adaptivity strategy is to improve the effectiveness of the random samples in the training set 𝖲Ω\mathsf{S}_{\Omega}, and we provide two algorithms corresponding to the two options discussed in section 4.1.

  1. I.

    We replace all the collocation points in the current training set using new samples from the probability measure for importance sampling. This corresponds to equation (11).

  2. II.

    We gradually add more collocation points to the current training set. This corresponds to equation (14), where the new samples are mainly from the region of large residual.

We first present strategy I. Let 𝖲Ω,0={𝒙Ω,0(i)}i=1Nr\mathsf{S}_{\Omega,0}=\{\boldsymbol{x}_{\Omega,0}^{(i)}\}_{i=1}^{N_{r}} and 𝖲Ω,0\mathsf{S}_{\partial\Omega,0} be two sets of collocation points that are uniformly sampled from Ω\Omega and Ω\partial\Omega respectively. Using 𝖲Ω,0\mathsf{S}_{\Omega,0} and 𝖲Ω,0\mathsf{S}_{\partial\Omega,0}, we minimize the empirical loss (3) to obtain u(𝒙;ΘN,(1))u(\boldsymbol{x};\Theta_{N}^{*,(1)}). With u(𝒙;ΘN,(1))u(\boldsymbol{x};\Theta_{N}^{*,(1)}), we minimize the cross entropy (25) to get p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf,(1))\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(1)}), where we simply use uniform samples for importance sampling. To refine the training set, a new set 𝖲Ω,1={𝒙Ω,1(i)}i=1Nr\mathsf{S}_{\Omega,1}=\{\boldsymbol{x}_{\Omega,1}^{(i)}\}_{i=1}^{N_{r}} is generated by 1f𝖪𝖱𝗇𝖾𝗍1(𝒛(i);Θf,(1))\boldsymbol{\ell}^{-1}\circ f^{-1}_{\mathsf{KRnet}}(\boldsymbol{z}^{(i)};\Theta_{f}^{*,(1)}) (see equation (18)). Then we continue to update the approximate solution u(𝒙;ΘN,(1))u(\boldsymbol{x};\Theta_{N}^{*,(1)}) using 𝖲Ω,1\mathsf{S}_{\Omega,1} as the training set. In general, we use 𝖲Ω,k={𝒙Ω,k(i)}i=1Nr\mathsf{S}_{\Omega,k}=\{\boldsymbol{x}_{\Omega,k}^{(i)}\}_{i=1}^{N_{r}} to obtain u(𝒙;ΘN,(k+1))u(\boldsymbol{x};\Theta_{N}^{*,(k+1)}) as

ΘN,(k+1)=argminΘJN𝖨𝖲(u(𝒙;Θ)),\Theta_{N}^{*,(k+1)}=\arg\min_{\Theta}J^{\mathsf{IS}}_{N}(u(\boldsymbol{x};\Theta)),

where u(𝒙;Θ)u(\boldsymbol{x};\Theta) is initialized as u(𝒙;ΘN,(k))u(\boldsymbol{x};\Theta_{N}^{*,(k)}) and JN𝖨𝖲J^{\mathsf{IS}}_{N} is defined as

JN𝖨𝖲(u(𝒙;Θ))=1Nri=1Nrr2(𝒙Ω,k(i);Θ)p^𝖪𝖱𝗇𝖾𝗍(𝒙Ω,k(i);Θf,(k))+1Nbi=1Nbb2(𝒙Ω,k(i);Θ).J^{\mathsf{IS}}_{N}(u(\boldsymbol{x};\Theta))=\frac{1}{N_{r}}\sum\limits_{i=1}^{N_{r}}\frac{r^{2}(\boldsymbol{x}_{\Omega,k}^{(i)};\Theta)}{\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x}_{\Omega,k}^{(i)};\Theta_{f}^{*,(k)})}+\frac{1}{N_{b}}\sum\limits_{i=1}^{N_{b}}b^{2}(\boldsymbol{x}_{\partial\Omega,k}^{(i)};\Theta). (28)

Starting with p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf,(k))\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k)}), the density model p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf)\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}) is updated as

Θf,(k+1)=argminΘf1Nri=1Nrr2(𝒙B,k(i);ΘN,(k+1))h(𝒙B,k(i))p^𝖪𝖱𝗇𝖾𝗍(𝒙B,k(i);Θf,(k))logp^𝖪𝖱𝗇𝖾𝗍(𝒙B,k(i);Θf),\Theta_{f}^{*,(k+1)}=\arg\min_{\Theta_{f}}-\frac{1}{N_{r}}\sum_{i=1}^{N_{r}}\frac{r^{2}(\boldsymbol{x}_{B,k}^{(i)};\Theta_{N}^{*,(k+1)})h(\boldsymbol{x}_{B,k}^{(i)})}{\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x}_{B,k}^{(i)};\Theta_{f}^{*,(k)})}\log\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x}_{B,k}^{(i)};\Theta_{f}), (29)

where we let Θ^f=Θf,(k)\hat{\Theta}_{f}=\Theta_{f}^{*,(k)} in equation (25), i.e., the previous PDF model p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf,(k))\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k)}) is used for importance sampling when computing the cross entropy. A new set 𝖲Ω,k+1={𝒙Ω,k+1(i)}i=1Nr\mathsf{S}_{\Omega,k+1}=\{\boldsymbol{x}_{\Omega,k+1}^{(i)}\}_{i=1}^{N_{r}} of collocation points is then generated. As detailed in section 4.2, the support of data points generated by KRnet is B=((1/2+δ/2),1/2+δ/2)dB=(-(1/2+\delta/2),1/2+\delta/2)^{d}, while the computation domain is Ω=[1/2,1/2]d\Omega=[-1/2,1/2]^{d}. So we need to deal with the collocation points located in B\ΩB\backslash\Omega. Instead of neglecting these points, we project them onto Ω\partial\Omega. We define an entry-wise projection operator 𝒫(𝒙):BΩ\mathcal{P}(\boldsymbol{x}):B\mapsto\Omega as

𝒫(xi)={1/2,ifxi<1/2,xi,if1/2xi1/2,1/2,ifxi>1/2.i=1,,d.\mathcal{P}(x_{i})=\begin{cases}-1/2,\quad\mathrm{if}\ x_{i}<-1/2,\\ x_{i},\quad\mathrm{if}\ -1/2\leq x_{i}\leq 1/2,\\ 1/2,\quad\mathrm{if}\ x_{i}>1/2.\end{cases}i=1,\ldots,d. (30)

For a sequence of i.i.d. samples 𝒛(j)\boldsymbol{z}^{(j)} generated from the standard Gaussian with j=1,2j=1,2\ldots, we compute 𝒙B(j)=1f𝖪𝖱𝗇𝖾𝗍1(𝒛(j))\boldsymbol{x}_{B}^{(j)}=\boldsymbol{\ell}^{-1}\circ f^{-1}_{\mathsf{KRnet}}(\boldsymbol{z}^{(j)}). if 𝒙B(j)=𝒫(𝒙B(j))\boldsymbol{x}_{B}^{(j)}=\mathcal{P}(\boldsymbol{x}_{B}^{(j)}), we assign 𝒙B(j)\boldsymbol{x}_{B}^{(j)} to 𝖲Ω,k+1\mathsf{S}_{\Omega,k+1}; otherwise, we add 𝒫(𝒙B(j))\mathcal{P}(\boldsymbol{x}^{(j)}_{B}) to 𝖲Ω,k\mathsf{S}_{\partial\Omega,k}. The updated training set 𝖲Ω,k+1\mathsf{S}_{\Omega,k+1} and 𝖲Ω,k+1\mathsf{S}_{\partial\Omega,k+1} will be used for the next training stage. This procedure is repeated until the stopping criterion is satisfied (see Algorithm 1). Since the collocation points in 𝖲Ω,k\mathsf{S}_{\Omega,k} will be completely replaced at the next stage, we call this type of deep adaptive sampling strategy DAS-R for short. The alternative approach given in Remark 1 can also be used to obtain p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf,(k))\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k)}).

We now look at strategy II. Unlike DAS-R, the number of collocation points in the training set 𝖲Ω\mathsf{S}_{\Omega} increases gradually. So we denote this type of deep adaptive sampling strategy by DAS-G for short. Starting with an initial set of collocation points 𝖲Ω,0={𝒙Ω(i)}i=1nr\mathsf{S}_{\Omega,0}=\{\boldsymbol{x}_{\Omega}^{(i)}\}_{i=1}^{n_{r}} (as well as 𝖲Ω,0\mathsf{S}_{\partial\Omega,0}) drawn from a uniform distribution defined on Ω\Omega, we minimize the empirical loss (3) on the training set 𝖲Ω,0\mathsf{S}_{\Omega,0} (as well as 𝖲Ω,0\mathsf{S}_{\partial\Omega,0}) to obtain u(𝒙;ΘN,(1))u(\boldsymbol{x};\Theta_{N}^{*,(1)}). Once we have u(𝒙;ΘN,(1))u(\boldsymbol{x};\Theta_{N}^{*,(1)}) in hand, we can seek p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf,(1))\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(1)}) using the residual r2(𝒙;ΘN,(1))r^{2}(\boldsymbol{x};\Theta_{N}^{*,(1)}). We here use uniform samples to approximate and minimize the cross entropy (25). Similar to DAS-R, a new set of collocation points 𝖲Ω,1g={𝒙Ω,1g,(i)}i=1nr\mathsf{S}^{g}_{\Omega,1}=\{\boldsymbol{x}_{\Omega,1}^{g,(i)}\}_{i=1}^{n_{r}} is generated by p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf,(1))\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(1)}) while the main difference is that we update the training set as 𝖲Ω,1=𝖲Ω,0𝖲Ω,1g\mathsf{S}_{\Omega,1}=\mathsf{S}_{\Omega,0}\cup\mathsf{S}^{g}_{\Omega,1}, in other words, 𝖲Ω,0\mathsf{S}_{\Omega,0} is augmented rather than replaced by 𝖲Ω,1g\mathsf{S}_{\Omega,1}^{g}. We continue to update u(𝒙;Θ)u(\boldsymbol{x};\Theta) using ΘN,(1)\Theta_{N}^{*,(1)} as the initial parameters and 𝖲Ω,1\mathsf{S}_{\Omega,1} as the training set, which yields a refined model u(𝒙;ΘN,(2))u(\boldsymbol{x};\Theta_{N}^{*,(2)}). Staring from k=2k=2, we seek p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf,(k))\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k)}) using the approach given in Remark 1. We repeat the procedure to obtain an adaptive algorithm (see Algorithm 2).

Our two adaptive training methods are summarized in Algorithm 1 (DAS-R) and Algorithm 2 (DAS-G), where NadaptiveN_{\rm adaptive} is a given number of maximum adaptivity iterations, mm is the batch size for stochastic gradient, and NeN_{e} is the number of epochs for training u(𝒙;Θ)u(\boldsymbol{x};\Theta) and p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf)\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}). The algorithms consist of three steps: solving PDE, training KRnet and refining the training set.

Remark 2.

In both strategies, we use uniform samples to approximate and minimize the cross entropy to obtain p^𝖪𝖱𝗇𝖾𝗍(𝐱;Θf,(1))\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(1)}), after which either equation (25) or equation (27) can be employed. The main reason of doing this is to use uniform samples to capture the modes if the residual-induced distribution is multimodal. The obtained PDF p^𝖪𝖱𝗇𝖾𝗍(𝐱;Θf,(1))\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(1)}) provides a good initialization when equation (27) is employed to seek p^𝖪𝖱𝗇𝖾𝗍(𝐱;Θf,(i))\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(i)}) with i>1i>1. This choice is usually not necessary if the modes are not strongly localized or the location of the modes can be encoded into the prior distribution of KRnet through a Gaussian mixture model.

Algorithm 1 DAS-R for PDEs
0:  Initial p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf(0))\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{(0)}) , u(𝒙;Θ(0))u(\boldsymbol{x};\Theta^{(0)}), maximum epoch number NeN_{e}, batch size mm, initial training set 𝖲Ω,0={𝒙Ω,0(i)}i=1Nr\mathsf{S}_{\Omega,0}=\{\boldsymbol{x}_{\Omega,0}^{(i)}\}_{i=1}^{N_{r}} and 𝖲Ω,0={𝒙Ω,0(i)}i=1Nb\mathsf{S}_{\partial\Omega,0}=\{\boldsymbol{x}_{\partial\Omega,0}^{(i)}\}_{i=1}^{N_{b}}.
1:  for k=0:Nadaptive1k=0:N_{\rm adaptive}-1 do
2:     // Solve PDE
3:     for i=1:Nei=1:N_{e} do
4:        for jj steps do
5:           Sample mm samples from 𝖲Ω,k\mathsf{S}_{\Omega,k}.
6:           Sample mm samples from 𝖲Ω,k\mathsf{S}_{\partial\Omega,k}.
7:           Update u(𝒙;Θ)u(\boldsymbol{x};\Theta) by descending the stochastic gradient of JN𝖨𝖲(u(𝒙;Θ))J^{\mathsf{IS}}_{N}(u(\boldsymbol{x};\Theta)) (see equation (28)).
8:        end for
9:     end for
10:     // Train KRnet
11:     for i=1:Nei=1:N_{e} do
12:        for jj steps do
13:           Sample mm samples from 𝖲Ω,k\mathsf{S}_{\Omega,k}.
14:           Update p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf)\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}) by descending the stochastic gradient of H(r^𝑿,p^𝖪𝖱𝗇𝖾𝗍)H(\hat{r}_{\boldsymbol{X}},\hat{p}_{\mathsf{KRnet}}) (see equation (25)).
15:        end for
16:     end for
17:     // Refine training set
18:     Generate 𝖲Ω,k+1Ω\mathsf{S}_{\Omega,k+1}\subset\Omega through p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf,(k+1))\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k+1)}).
19:  end for
19:  u(𝒙;ΘN)u(\boldsymbol{x};\Theta_{N}^{*})
Algorithm 2 DAS-G for PDEs
0:  Initial p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf(0))\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{(0)}) , u(𝒙;Θ(0))u(\boldsymbol{x};\Theta^{(0)}), maximum epoch number NeN_{e}, batch size mm, initial training set 𝖲Ω,0={𝒙Ω,0(i)}i=1nr\mathsf{S}_{\Omega,0}=\{\boldsymbol{x}_{\Omega,0}^{(i)}\}_{i=1}^{n_{r}} and 𝖲Ω,0={𝒙Ω,0(i)}i=1Nb\mathsf{S}_{\partial\Omega,0}=\{\boldsymbol{x}_{\partial\Omega,0}^{(i)}\}_{i=1}^{N_{b}}.
1:  for k=0:Nadaptive1k=0:N_{\rm adaptive}-1 do
2:     // Solve PDE
3:     for i=1:Nei=1:N_{e} do
4:        for jj steps do
5:           Sample mm samples from 𝖲Ω,k\mathsf{S}_{\Omega,k}.
6:           Sample mm samples from 𝖲Ω,k\mathsf{S}_{\partial\Omega,k}.
7:           Update u(𝒙;Θ)u(\boldsymbol{x};\Theta) by descending the stochastic gradient of JN(u(𝒙;Θ))J_{N}(u(\boldsymbol{x};\Theta)) (see equation (3)).
8:        end for
9:     end for
10:     // Train KRnet
11:     for i=1:Nei=1:N_{e} do
12:        for jj steps do
13:           Sample mm samples from 𝖲Ω,k\mathsf{S}_{\Omega,k}.
14:           Update p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf)\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}) by descending the stochastic gradient of H(r^𝑿,p^𝖪𝖱𝗇𝖾𝗍)H(\hat{r}_{\boldsymbol{X}},\hat{p}_{\mathsf{KRnet}}) (see equation (25)).
15:        end for
16:     end for
17:     // Refine training set
18:     Generate 𝖲Ω,k+1gΩ\mathsf{S}^{g}_{\Omega,k+1}\subset\Omega with size nrn_{r} through p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf,(k+1))\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k+1)}).
19:     𝖲Ω,k+1=𝖲Ω,k𝖲Ω,k+1g\mathsf{S}_{\Omega,k+1}=\mathsf{S}_{\Omega,k}\cup\mathsf{S}^{g}_{\Omega,k+1}.
20:  end for
20:  u(𝒙;ΘN)u(\boldsymbol{x};\Theta_{N}^{*})

5 Analysis of DAS

As discussed in section 4.3, the key point of our DAS method is to achieve variance reduction for the discretization of the residual loss, based on which we expect to improve the accuracy of the approximate solution. Under certain conditions, we show that the expectation of error bound becomes smaller when the adaptive sampling strategy is employed.

Assumption 1.

[35] In problem (1), we let F=F=\mathscr{H} be a Hilbert space and \mathcal{L} a linear operator. Assume that the differential operator \mathcal{L} and the boundary operator 𝔟\mathfrak{b} satisfy

C1v2,Ωv2,Ω+𝔟v2,ΩC2v2,ΩvC_{1}\left\|v\right\|_{2,\Omega}\leq\left\|\mathcal{L}v\right\|_{2,\Omega}+\left\|\mathfrak{b}v\right\|_{2,\partial\Omega}\leq C_{2}\left\|v\right\|_{2,\Omega}\quad\forall v\in\mathscr{H} (31)

where \mathscr{H} is a Hilbert space defined on Ω\Omega and the positive constants C1C_{1} and C2C_{2} are independent of vv.

The above condition is called the stability bound [35], which is essential to the existence and uniqueness of problem (1). Except for this assumption, the following two assumptions for the relationship between r(𝒙;ΘN,(k))r(\boldsymbol{x};\Theta_{N}^{*,(k)}) and p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf,(k))\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k)}) are given.

Assumption 2.

Assume that p^𝖪𝖱𝗇𝖾𝗍(𝐱;Θf,(k))\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k)}) is the optimal candidate for the change of measure in equation (11)

p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf,(k))=ckr2(𝒙;ΘN,(k))\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k)})=c_{k}r^{2}(\boldsymbol{x};\Theta_{N}^{*,(k)}) (32)

where ck=1/Ωr2(𝐱;ΘN,(k))𝑑𝐱c_{k}=1/\int_{\Omega}r^{2}(\boldsymbol{x};\Theta_{N}^{*,(k)})d\boldsymbol{x} is the normalization constant.

Assumption 3.

Let

Rk=1Nri=1Nrr2(𝒙Ω(i);ΘN,(k))p^𝖪𝖱𝗇𝖾𝗍(𝒙Ω(i);Θf,(k1))R_{k}=\frac{1}{N_{r}}\sum\limits_{i=1}^{N_{r}}\frac{r^{2}(\boldsymbol{x}_{\Omega}^{(i)};\Theta_{N}^{*,(k)})}{\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x}_{\Omega}^{(i)};\Theta_{f}^{*,(k-1)})}

be the discrete residual loss at the kk-th stage, where each 𝐱Ω(i)\boldsymbol{x}_{\Omega}^{(i)} is drawn from p^𝖪𝖱𝗇𝖾𝗍(𝐱;Θf,(k1))\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k-1)}). Assume that r2(𝐱Ω(i);ΘN,(k))/p^𝖪𝖱𝗇𝖾𝗍(𝐱Ω(i);Θf,(k1))[τ1,τ2]r^{2}(\boldsymbol{x}_{\Omega}^{(i)};\Theta_{N}^{*,(k)})/\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x}_{\Omega}^{(i)};\Theta_{f}^{*,(k-1)})\in[\tau_{1},\tau_{2}] almost surely for each i=1,2,,Nri=1,2,\ldots,N_{r}.

At each adaptivity stage, the error of the approximate solution is estimated as follows.

Theorem 1.

Let u(𝐱;ΘN,(k))Fu(\boldsymbol{x};\Theta_{N}^{*,(k)})\in F be a solution of (3) where the collocation points are independently drawn from p^𝖪𝖱𝗇𝖾𝗍(𝐱;Θf,(k1))\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,{(k-1)}}). Suppose that Assumption 1 and Assumption 3 are satisfied. Given 0<ε<10<\varepsilon<1, the following error estimate holds

u(𝒙;ΘN,(k))u(𝒙)2,Ω2C11(Rk+ε+b(𝒙;ΘN,(k))2,Ω2)12.\left\|u(\boldsymbol{x};\Theta_{N}^{*,(k)})-u(\boldsymbol{x})\right\|_{2,\Omega}\leq\sqrt{2}C_{1}^{-1}\left(R_{k}+\varepsilon+\left\|b(\boldsymbol{x};\Theta_{N}^{*,(k)})\right\|_{2,\partial\Omega}^{2}\right)^{\frac{1}{2}}.

with probability at least 1exp(2Nrε2/(τ2τ1)2)1-\mathrm{exp}(-2N_{r}\varepsilon^{2}/(\tau_{2}-\tau_{1})^{2}).

The approximate solutions at two adjacent adaptivity stages satisfy:

Corollary 1.

Under the same conditions of Theorem 1, suppose that Assumption 2 is satisfied and the boundary loss Jb(u)J_{b}(u) is zero, then the following inequality holds

𝔼(Rk+1)𝔼(Rk).\mathbb{E}(R_{k+1})\leq\mathbb{E}(R_{k}).

Theorem 1 provides an error estimate for the approximate solution similar to the results in [40]. Corollary 1 outlines the error behavior of a sequence of approximate solutions induced by adaptivity. However, it is not quite straightforward to quantify the decay of the error due to adaptive refinement. For example, for DAS-R the reduction in variance is up to a tail probability as shown in Lemma 2 and for DAS-G the loss is changed at each adaptivity stage due to the modification of the training set. These issues are left for future study.

6 Numerical experiments

In this section, we conduct some numerical experiments to demonstrate the effectiveness of the proposed DAS method, including two low-dimensional and low regularity test problems, one high-dimensional linear test problem, and one high-dimensional nonlinear test problem. Due to the curse of dimensionality, data are sparse in high-dimensional spaces [20, 19, 21], which implies that effective samples should be able to deal with localized information. We mainly use low-dimensional and low regularity test problems to demonstrate that the sampling strategy affects significantly the performance of neural network approximation if the residual is strongly localized. For comparison, we also test the performance of a heuristic method based on residual refinement [52, 42] (see section 6.2 and section 6.3) for high-dimensional problems, where the heuristic method searches uniform samples to find the points with large residuals and add them to the current training set. The heuristic method is similar to DAS-G, where the main difference is that the selection of new samples in DAS-G is completely guided by an optimization problem while the heuristic method relies partially on the user’s intuition. All deep neural network models are trained by the ADAM method [34]. The penalty parameter in equation (3) is set to γ^=1\hat{\gamma}=1. The activation function of u(𝒙;Θ)u(\boldsymbol{x};\Theta) is set to the hyperbolic tangent function. The activation function of KRnet is the rectified linear unit (ReLU) function since we only use the KRnet for density approximation.

6.1 Low-dimensional and low regularity test problems

In this part, two-dimensional low regularity problems are considered, where the solution of the first one has a peak and the solution of the second one has two peaks.

6.1.1 Two-dimensional peak problem

The following elliptic equation is considered

Δu(x1,x2)\displaystyle-\Delta u(x_{1},x_{2}) =s(x1,x2)inΩ,\displaystyle=s(x_{1},x_{2})\quad\text{in}\ \Omega, (33)
u(x1,x2)\displaystyle u(x_{1},x_{2}) =g(x1,x2)onΩ,\displaystyle=g(x_{1},x_{2})\quad\text{on}\ \partial\Omega,

where the computation domain is Ω=[1,1]2\Omega=[-1,1]^{2}. In order to quantify the error, we use the following reference solution

u(x1,x2)=exp(1000[(x1rc)2+(x2rc)2]),u(x_{1},x_{2})=\mathrm{exp}\left(-1000[(x_{1}-r_{c})^{2}+(x_{2}-r_{c})^{2}]\right),

which has a peak at the point (rc,rc)(r_{c},r_{c}) and decreases rapidly away from (rc,rc)(r_{c},r_{c}). This test problem is often used to test the performance of adaptive finite element methods [53, 24].

We choose a six-layer fully connected neural network u(𝒙;Θ)u(\boldsymbol{x};\Theta) with 3232 neurons to approximate the solution. For KRnet, we take L=6L=6 affine coupling layers, and two fully connected layers with 2424 neurons for each affine coupling layer. The number of epochs for training both u(𝒙;Θ)u(\boldsymbol{x};\Theta) and p(𝒙;Θf)p(\boldsymbol{x};\Theta_{f}) is set to Ne=3000N_{e}=3000. The learning rate for ADAM optimizer is set to 0.00010.0001, and the batch size is set to m=500m=500. Here, we set (rc,rc)=(0.5,0.5)(r_{c},r_{c})=(0.5,0.5). To assess the effectiveness of our DAS methods, we generate a uniform meshgrid with size 256×256256\times 256 in [1,1]2[-1,1]^{2} and compute the mean square error on these grid points.

In Figure 1, we plot the approximation errors given by different sampling strategies with respect to the sample size in the left plot and with respect to the number of epochs in the right plot. For each |𝖲Ω||\mathsf{S}_{\Omega}|, we take three runs with different random seeds for initialization and compute the mean error of the three runs as the final error. For DAS strategies, the numbers of adaptivity iterations are set to Nadaptive=4,6,8,10N_{\rm{adaptive}}=4,6,8,10 for |𝖲Ω|=2×103,3×103,4×103,5×103|\mathsf{S}_{\Omega}|=2\times 10^{3},3\times 10^{3},4\times 10^{3},5\times 10^{3} respectively, and nr=500n_{r}=500 is set for the DAS-G strategy (see section 4.3). For the uniform sampling strategy, the number of epochs is set to be the same as the total number of epochs of each DAS method. It is clear that for this test problem the DAS methods (DAS-G and DAS-R) have a better performance than the uniform sampling strategy and DAS-R performs better than DAS-G. For the same sample size, both DAS-R and DAS-G yield a smaller error than the uniform sampling method. In terms of the number of epochs, the errors of DAS-R and DAS-G decay in a more consistent way than the uniform sampling method.

In Figure 2 we compare the exact solution, the DAS solutions given by 5×1035\times 10^{3} nonuniform samples and the approximate solution given by 5×1035\times 10^{3} uniform samples. It is seen that DAS methods are much more effective than the uniform sampling method to capture the information in the region of low regularity. Figure 3 shows the error evolution of DAS-R at different adaptivity iteration steps. It is seen that the approximation error drops as the adaptivity iteration step kk increases, which is consistent with Corollary 1, and the relaxation time for the optimization iterations reduces as well. Figure 4 shows the evolution of the training sets (|𝖲Ω|=5×103|\mathsf{S}_{\Omega}|=5\times 10^{3}) of DAS-R method with respect to adaptivity iterations k=1,4,7,9k=1,4,7,9, where the initial training set 𝖲Ω,0\mathsf{S}_{\Omega,0} consists of uniform collocation points on Ω\Omega (see section 4.3). It is seen that the largest density of 𝖲Ω,1\mathsf{S}_{\Omega,1} for DAS-R is around [0.5,0.5][0.5,0.5] since 𝖲Ω,k\mathsf{S}_{\Omega,k} is consistent with the residual-induced distribution. However, we also expect that the tail of the residual-induced distribution becomes heavier as kk increases since the adaptivity tries to make the residual-induced distribution more uniform, which is illustrated by 𝖲Ω,4\mathsf{S}_{\Omega,4}, 𝖲Ω,7\mathsf{S}_{\Omega,7} and 𝖲Ω,9\mathsf{S}_{\Omega,9}.

Refer to caption
Refer to caption
Figure 1: Approximation errors for the two-dimensional peak test problem. Left: The error w.r.t sample size |𝖲Ω||\mathsf{S}_{\Omega}|; Right: The error w.r.t epoch for |𝖲Ω|=5×103|\mathsf{S}_{\Omega}|=5\times 10^{3}.
Refer to caption
(a) The exact solution.
Refer to caption
(b) DAS-R approximation.
Refer to caption
(c) DAS-G approximation.
Refer to caption
(d) Approximation using the uniform sampling strategy.
Figure 2: Solutions, two-dimensional peak test problem.
Refer to caption
Figure 3: The errors of DAS-R at certain adaptivity iteration steps for the two-dimensional peak test problem. |𝖲Ω|=5×103|\mathsf{S}_{\Omega}|=5\times 10^{3}.
Refer to caption
Figure 4: The evolution of 𝖲Ω,k\mathsf{S}_{\Omega,k} in DAS-R for the two-dimensional peak test problem.

6.1.2 Two-dimensional test problem with two peaks

In this test problem, we consider the following equation

[u(x1,x2)(x12+x22)]\displaystyle-\nabla\cdot\left[u(x_{1},x_{2})\nabla(x_{1}^{2}+x_{2}^{2})\right] +2u(x1,x2)=s(x1,x2)inΩ,\displaystyle+\nabla^{2}u(x_{1},x_{2})=s(x_{1},x_{2})\quad\text{in}\ \Omega, (34)
u(x1,x2)\displaystyle u(x_{1},x_{2}) =g(x1,x2)onΩ,\displaystyle=g(x_{1},x_{2})\quad\text{on}\ \partial\Omega,

where the computation domain is Ω=[1,1]2\Omega=[-1,1]^{2}. The exact solution of (34) is chosen as

u(x1,x2)=e1000[(x10.5)2+(x20.5)2]+e1000[(x1+0.5)2+(x2+0.5)2],u(x_{1},x_{2})=\mathrm{e}^{-1000[(x_{1}-0.5)^{2}+(x_{2}-0.5)^{2}]}+\mathrm{e}^{-1000[(x_{1}+0.5)^{2}+(x_{2}+0.5)^{2}]},

which has two peaks at the points (0.5,0.5)(0.5,0.5) and (0.5,0.5)(-0.5,-0.5). Here, the Dirichlet boundary condition on Ω\partial\Omega is given by the exact solution.

We choose a six-layer fully connected neural network u(𝒙;Θ)u(\boldsymbol{x};\Theta) with 6464 neurons to approximate the solution of (34). For KRnet, we take L=8L=8 affine coupling layers, and two fully connected layers with 4848 neurons for each affine coupling layer. The number of epochs for training both u(𝒙;Θ)u(\boldsymbol{x};\Theta) and p(𝒙;Θf)p(\boldsymbol{x};\Theta_{f}) is set to Ne=5000N_{e}=5000. The learning rate for ADAM optimizer is set to 0.00010.0001, and the batch size is set to m=500m=500. Again, we generate a uniform meshgrid with size 256×256256\times 256 in [1,1]2[-1,1]^{2} and compute the mean square error on these grid points to assess the effectiveness of our DAS methods.

Figure 5 shows the approximation errors for this test problem, where the left one displays the errors with respect to the sample size |𝖲Ω||\mathsf{S}_{\Omega}| for different sampling strategies, and the right one shows the error evolution of DAS-G at different adaptivity iteration steps. For each |𝖲Ω||\mathsf{S}_{\Omega}|, we again take three runs with different random seeds for initialization and compute the mean error of the three runs as the final error. For the DAS-G strategy, the numbers of adaptivity iterations is set to Nadaptive=5N_{\rm{adaptive}}=5 (also for DAS-R), and the numbers of collocation points in 𝖲Ωg(k=1,2,3,4)\mathsf{S}^{g}_{\Omega}(k=1,2,3,4) is set to nr=500,1×103,1.5×103,2×103n_{r}=500,1\times 10^{3},1.5\times 10^{3},2\times 10^{3} for |𝖲Ω|=2.5×103,5×103,7.5×103,104|\mathsf{S}_{\Omega}|=2.5\times 10^{3},5\times 10^{3},7.5\times 10^{3},10^{4} respectively. For the uniform sampling strategy, we train the model with 2.5×1042.5\times 10^{4} epochs to match the total number of epochs of DAS methods. From Figure 5, it is seen that for this test problem our DAS methods (DAS-G and DAS-R) have a better performance than the uniform sampling strategy and DAS-G performs better than DAS-R. It is also seen that the error decreases as the adaptivity iteration step kk increases.

In Figure 6 we compare the exact solution, the DAS solutions given by 10410^{4} nonuniform samples and the approximate solution given by 10410^{4} uniform samples. It is seen that DAS methods are much more effective than the uniform sampling method to capture the information around the two peaks. Figure 7 shows the evolution of 𝖲Ω,kg\mathsf{S}^{g}_{\Omega,k} of DAS-G method with respect to adaptivity iterations k=1,2,3,4k=1,2,3,4 (|𝖲Ω,kg|=2×103|\mathsf{S}^{g}_{\Omega,k}|=2\times 10^{3}), where the initial training set 𝖲Ω,0\mathsf{S}_{\Omega,0} consists of uniform collocation points on Ω\Omega (see section 4.3). 𝖲Ω,1g\mathsf{S}_{\Omega,1}^{g} shows that the error profile has two peaks. After the training set is augmented with 𝖲Ω,1\mathsf{S}_{\Omega,1}, the error profile becomes more flat as shown by the distribution of 𝖲Ω,2g\mathsf{S}_{\Omega,2}^{g}. After the training set is augmented with 𝖲Ω,2\mathsf{S}_{\Omega,2}, the largest error is found again around the two peaks, and then the subsequent augmentation of the training set yields a more flat error profile. Such a pattern is repeated until no improvement can be reached.

Refer to caption
Refer to caption
Figure 5: Approximation errors for the two-dimensional test problem with two peaks. Left: The error w.r.t sample size |𝖲Ω||\mathsf{S}_{\Omega}|; Right: The errors of DAS-G at each adaptivity iteration steps. |𝖲Ω|=104|\mathsf{S}_{\Omega}|=10^{4}.
Refer to caption
(a) The exact solution.
Refer to caption
(b) DAS-R approximation.
Refer to caption
(c) DAS-G approximation.
Refer to caption
(d) Approximation using the uniform sampling strategy.
Figure 6: Solutions, two-dimensional test problem with two peaks.
Refer to caption
Figure 7: The evolution of 𝖲Ω,kg\mathsf{S}^{g}_{\Omega,k} in DAS-G for the two-dimensional test problem with two peaks.

6.2 High-dimensional linear test problems

Next we consider the dd-dimensional elliptic equation

Δu(𝒙)=s(𝒙),𝒙inΩ=[1,1]d,-\Delta u(\boldsymbol{x})=s(\boldsymbol{x}),\quad\boldsymbol{x}\ \text{in}\ \Omega=[-1,1]^{d}, (35)

with an exact solution

u(𝒙)=e10𝒙22,u(\boldsymbol{x})=\mathrm{e}^{-10\left\|\boldsymbol{x}\right\|_{2}^{2}},

where the Dirichlet boundary condition on Ω\partial\Omega is given by the exact solution. We are interested in cases with a large d>3d>3. Note that the geometric properties of high-dimensional spaces are significantly different from our intuitions on low-dimensional ones, e.g., most of the volume of a high-dimensional cube is located around its corners [20, 19, 21]. If we use uniform samples to generate 𝖲Ω\mathsf{S}_{\Omega}, most of the collocation points in 𝖲Ω\mathsf{S}_{\Omega} are near the surface of the hypercube. Since the information of the exact solution is mainly from the neighborhood of the origin, most of the samples in 𝖲Ω\mathsf{S}_{\Omega} may not contribute to training the neural network when dd is large enough.

We choose a six-layer fully connected neural network u(𝒙;Θ)u(\boldsymbol{x};\Theta) with 64 neurons to approximate the solution. For KRnet, we set K=3K=3 and take L=6L=6 affine coupling layers, and two fully connected layers with 6464 neurons for each affine coupling layer. The number of epochs for training both u(𝒙;Θ)u(\boldsymbol{x};\Theta) and p(𝒙;Θf)p(\boldsymbol{x};\Theta_{f}) is set to Ne=3000N_{e}=3000. The learning rate for ADAM optimizer is set to 0.00010.0001, and the batch size is set to m=5000m=5000. The numbers of adaptivity iterations is set to Nadaptive=5N_{\rm{adaptive}}=5. To measure the quality of approximation, we generate a tensor grid with ntdn_{t}^{d} points around the origin (in [0.1,0.1]d[-0.1,0.1]^{d}) where ntn_{t} is the number of nodes for each dimension. We define the relative error

Relative error=𝒖𝖭𝖭𝒖2𝒖2,\text{Relative error}=\frac{\left\|\boldsymbol{u}_{\mathsf{NN}}-\boldsymbol{u}\right\|_{2}}{\left\|\boldsymbol{u}\right\|_{2}},

where 𝒖𝖭𝖭\boldsymbol{u}_{\mathsf{NN}} and 𝒖\boldsymbol{u} denote two vectors whose elements are the function values of u(𝒙;Θ)u(\boldsymbol{x};\Theta) and u(𝒙)u(\boldsymbol{x}) at the tensor grid respectively.

We first investigate the relation between the error and the dimensionality dd when the uniform sampling strategy is employed. Figure 8(a) shows the relative errors in terms of a varying dd for a sample size |𝖲Ω|=2×105|\mathsf{S}_{\Omega}|=2\times 10^{5}. To roughly match the number of grid points for different dd, we set nt=16,6,4,3n_{t}=16,6,4,3 for d=4,6,8,10d=4,6,8,10 respectively. It is seen that the relative error grows quickly to O(1)\mathit{O}(1) as dd increases. However, as shown in Figure 8(b), all training losses are finally close to zero for d=4,6,8,10d=4,6,8,10. This is consistent with the fact that in a high-dimensional space most of the uniform samples are located around the boundary, where the solution is close to zero. The optimizer is then in favor of the trivial solution since there are not sufficient samples to resolve the peak at the origin. This phenomenon demonstrates that the uniform sampling method may become less effective as dd increases and the convergence of the approximate solution is highly dependent on the choice of 𝖲Ω\mathsf{S}_{\Omega} for a large dd.

Figure 9 shows the relative errors for the uniform sampling strategy, the residual-based adaptive refinement (RAR) method proposed in [52], DAS-R and DAS-G, where different numbers of samples |𝖲Ω||\mathsf{S}_{\Omega}| are considered. For each |𝖲Ω||\mathsf{S}_{\Omega}|, we again take three runs with different random seeds for initialization and compute the mean error of the three runs as the final error. For the DAS-G strategy, the numbers of collocation points in 𝖲Ω,kg\mathsf{S}^{g}_{\Omega,k} (k=1,2,3,4k=1,2,3,4) are set to nr=104,2×104,3×104,4×104n_{r}=10^{4},2\times 10^{4},3\times 10^{4},4\times 10^{4} for |𝖲Ω|=5×104,105,1.5×105,2×105|\mathsf{S}_{\Omega}|=5\times 10^{4},10^{5},1.5\times 10^{5},2\times 10^{5} respectively. For the uniform sampling strategy, we train the model with 1.5×1041.5\times 10^{4} epochs to match the total number of epochs of DAS methods. For the heuristic method RAR, the numbers of collocation points in 𝖲Ω,kg\mathsf{S}^{g}_{\Omega,k} (k=1,2,3,4k=1,2,3,4) are set to nr=5×103,104,1.5×104,2.5×104n_{r}=5\times 10^{3},10^{4},1.5\times 10^{4},2.5\times 10^{4} for |𝖲Ω|=5×104,105,1.5×105,2×105|\mathsf{S}_{\Omega}|=5\times 10^{4},10^{5},1.5\times 10^{5},2\times 10^{5} respectively. From Figure 9, it can be seen that both DAS-G and DAS-R improve the accuracy significantly compared to the uniform sampling strategy and RAR. In addition, the error of DAS-G decreases slightly faster than that of DAS-R for this test problem. In Figure 10 we compare the error evolution of different sampling strategies. From the left plot of Figure 10, as the number of epochs increases, the errors of DAS-G and DAS-R decrease quickly, while the errors of RAR and the uniform sampling strategy do not decrease. This result suggests that for high-dimensional problems DAS methods are able to achieve a good approximation with a relatively small number of nonuniform samples while much more uniform samples are needed for the same accuracy. The right plot of Figure 10 shows the error of DAS-G at each adaptivity iteration step kk. It is seen that the error drops dramatically after we refine the solution using 𝖲Ω,1\mathsf{S}_{\Omega,1}.

Figures 11 and 12 show 30003000 samples from the training sets (|𝖲Ω|=2×105|\mathsf{S}_{\Omega}|=2\times 10^{5}) DAS-R and DAS-G for the first four adaptivity iterations, where the components x6x_{6} and x7x_{7} are used for visualization. We have also checked the other components, and no significantly different results were found. For DAS-R, 30003000 samples are randomly chosen from 𝖲Ω,k\mathsf{S}_{\Omega,k} (k=1,2,3,4k=1,2,3,4). For DAS-G, 30003000 samples for visualization are randomly selected from 𝖲Ω,kg\mathsf{S}^{g}_{\Omega,k} (k=1,2,3,4k=1,2,3,4). It is seen that the profile of 𝖲Ω,k\mathsf{S}_{\Omega,k} is gradually flattened as kk increases, meaning the nonuniform samples are able to smooth the error profile which has a peak around the origin. As for DAS-G, the improvement takes a similar path. 𝖲Ω,1g\mathsf{S}_{\Omega,1}^{g} shows that the error profile has a peak around the origin. After the training set is augmented with 𝖲Ω,1\mathsf{S}_{\Omega,1}, the error profile becomes more flat as shown by the distribution of 𝖲Ω,2g\mathsf{S}_{\Omega,2}^{g}. This is expected since more collocation points are added to the neighborhood of the origin which should reduce the error over there. Such a pattern is similar to what we have observed in Figure 7. In Figure 13, we compare the evolution of the variance of the residual for the training with 5×1045\times 10^{4} samples. We estimate the variance of residual using 5904959049 grid points around the origin (these points are also used to compute the relative errors in the above discussion). It is clear that both DAS-R and DAS-G achieve the variance reduction significantly compared with RAR, which helps reduce the statistical error dramatically for a fixed sample size. Looking more closely, the variance of DAS-R has a transition between two consecutive adaptivity iterations, resulting in the oscillation of errors for DAS-R as observed in the left plot of Figure 10. From Figure 13, it can be seen that DAS-G appears more robust than DAS-R for this test problem. We may adjust the communication pattern between the PDE model and the PDF model to reduce the oscillations, which will be left for future study.

Refer to caption
(a) Error
Refer to caption
(b) Loss
Figure 8: The convergence behavior of high-dimensional PDEs with uniform sampling method. Loss is close to zero, but the error is still large for the ten-dimensional test problem.
Refer to caption
Figure 9: The error w.r.t sample size |𝖲Ω||\mathsf{S}_{\Omega}|, ten-dimensional linear test problem.
Refer to caption
Refer to caption
Figure 10: The error evolution of different sampling strategies with |𝖲Ω|=2×105|\mathsf{S}_{\Omega}|=2\times 10^{5} and d=10d=10 (high-dimensional linear test problem). Left: A comparison of DAS-G, DAS-R and the uniform sampling method; Right: The error evolution of DAS-G at different adaptivity iteration steps.
Refer to caption
Figure 11: The evolution of 𝖲Ω,k\mathsf{S}_{\Omega,k} in DAS-R, ten-dimensional linear test problem.
Refer to caption
Figure 12: The evolution of 𝖲Ω,kg\mathsf{S}^{g}_{\Omega,k} in DAS-G, ten-dimensional linear test problem.
Refer to caption
Figure 13: The evolution for the variance of residual, ten-dimensional linear test problem.

6.3 High-dimensional nonlinear test problem

In this part, the ten-dimensional nonlinear partial differential equation considered is

Δu(𝒙)+u(𝒙)u3(𝒙)=s(𝒙),𝒙inΩ=[1,1]10.-\Delta u(\boldsymbol{x})+u(\boldsymbol{x})-u^{3}(\boldsymbol{x})=s(\boldsymbol{x}),\quad\boldsymbol{x}\ \text{in}\ \Omega=[-1,1]^{10}. (36)

The exact solution is set to be the same as (35), and the Dirichlet boundary condition on Ω\partial\Omega is given by the exact solution. The settings of u(𝒙;Θ)u(\boldsymbol{x};\Theta) and KRnet are the same as those in section 6.2.

The observations are similar to those for the high-dimensional linear problem. Figure 14 shows the relative errors for the uniform sampling strategy, RAR, DAS-R and DAS-G, where different numbers of samples |𝖲Ω||\mathsf{S}_{\Omega}| are considered. Again, we take three runs with different random seeds for initialization and compute the mean error of the three runs as the final error for each |𝖲Ω||\mathsf{S}_{\Omega}|. For the DAS-G strategy, the numbers of collocation points in 𝖲Ω,kg\mathsf{S}^{g}_{\Omega,k} (k=1,2,3,4k=1,2,3,4) are set to the same as those in section 6.2. For the uniform sampling strategy, we train the model with 1.5×1041.5\times 10^{4} epochs to match the total number of epochs of DAS methods. For the heuristic method RAR, the numbers of collocation points in 𝖲Ω,kg\mathsf{S}^{g}_{\Omega,k} (k=1,2,3,4k=1,2,3,4) are set to nr=5×103,104,1.5×104,2.5×104n_{r}=5\times 10^{3},10^{4},1.5\times 10^{4},2.5\times 10^{4} for |𝖲Ω|=5×104,105,1.5×105,2×105|\mathsf{S}_{\Omega}|=5\times 10^{4},10^{5},1.5\times 10^{5},2\times 10^{5} respectively. Both DAS-G and DAS-R improve the accuracy significantly compared to the uniform sampling strategy and RAR. In Figure 15 we compare the error evolution of different sampling strategies. Similar to the high-dimensional linear problem, the errors of DAS-G and DAS-R decrease quickly while the errors of the uniform sampling strategy and RAR do not decrease. The error behavior of DAS-G at each adaptivity iteration step kk is shown in the right plot of Figure 15. It is seen that the approximation is significantly improved when the adaptivity iteration step kk increases from 0 to 11. Figure 16 and 17 show 30003000 samples from the training sets (|𝖲Ω|=2×105|\mathsf{S}_{\Omega}|=2\times 10^{5}) DAS-R and DAS-G for the first four adaptivity iterations, where the components x6x_{6} and x7x_{7} are used for visualization. For DAS-R, 30003000 samples are randomly chosen from 𝖲Ω,k\mathsf{S}_{\Omega,k} (k=1,2,3,4k=1,2,3,4). For DAS-G, 30003000 samples for visualization are randomly selected from 𝖲Ω,kg\mathsf{S}^{g}_{\Omega,k} (k=1,2,3,4k=1,2,3,4). Both DAS-R and DAS-G flatten the error profile through adaptive sampling as we have observed in Figures 11 and 12. Figure 18 shows the evolution of the variance of the residual for DAS-R, DAS-G and RAR. The behavior is similar to that in Figure 13.

Refer to caption
Figure 14: The error w.r.t sample size |𝖲Ω||\mathsf{S}_{\Omega}|, ten-dimensional nonlinear test problem.
Refer to caption
Refer to caption
Figure 15: The error evolution of different sampling strategies with |𝖲Ω|=2×105|\mathsf{S}_{\Omega}|=2\times 10^{5} and d=10d=10 (high-dimensional nonlinear test problem). Left: A comparison of DAS-G, DAS-R and the uniform sampling method; Right: The error evolution of DAS-G at different adaptivity iteration steps.
Refer to caption
Figure 16: The evolution of 𝖲Ω,k\mathsf{S}_{\Omega,k} in DAS-R, ten-dimensional nonlinear test problem.
Refer to caption
Figure 17: The evolution of 𝖲Ω,kg\mathsf{S}^{g}_{\Omega,k} in DAS-G, ten-dimensional nonlinear test problem.
Refer to caption
Figure 18: The evolution for the variance of residual, ten-dimensional nonlinear test problem.

7 Conclusion

In this paper we have developed a deep adaptive sampling (DAS) method and coupled it with physics-informed neural networks (PINNs) to improve the neural network approximation of PDEs iteratively. The key idea of DAS is to employ a deep generative model to generate collocation points that are consistent with the distribution induced by an appropriate error indicator function. In this way, the training set is refined according to the regularity of the PDE solution, which follows the similar principle of adaptive mesh refinement of classical numerical methods. Numerical experiments have shown that the DAS method is able to significantly improve the accuracy for the approximation of low regularity problems especially when the dimensionality is relatively large. The proposed DAS method provides a very general and flexible framework for an adaptive learning strategy. There are several possible ways to further improve it. First, DAS consists of two DNN-based models: one model serves as an approximator for the PDE solution and the other one serves as an error indicator for the selection of collocation points. Both models can be chosen in terms of a certain criterion. In this work, we use a regular DNN for PDE approximation and KRnet for density approximation and sample generation. Second, the underlying distribution for the training set can be problem dependent. In this work, we choose the residual-induced distribution. We may also use the gradient of the approximation solution to define an indicator distribution. In [22], we employ KRnet to approximate the Fokker-Planck equation, where the collocation points are sampled from the approximate solution. Third, the DAS method is not limited to steady-state PDE problems. We may employ the DAS method on the space-time domain to refine the training set for the approximation of time-dependent problems. Last but not least, the current training process can also be improved. Although the current DAS methods work well enough to demonstrate the effectiveness of the algorithm, many questions remain open, e.g., what is the optimal way for the two deep models to communicate and what is the optimal sample size for 𝖲Ω,kg\mathsf{S}_{\Omega,k}^{g}. Research on these issues will be reported in forthcoming papers.


Acknowledgments: K. Tang has been supported by the China Postdoctoral Science Foundation grant 2022M711730. X. Wan has been supported by NSF grant DMS-1913163. C. Yang has been supported by NSFC grant 12131002.

References

  • [1] J. Han, A. Jentzen, E. Weinan, Solving high-dimensional partial differential equations using deep learning, Proceedings of the National Academy of Sciences 115 (34) (2018) 8505–8510.
  • [2] E. Weinan, The dawning of a new era in applied mathematics, Notices of the American Mathematical Society 68 (4) (2021) 565–571.
  • [3] G. E. Karniadakis, I. G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, L. Yang, Physics-informed machine learning, Nature Reviews Physics 3 (6) (2021) 422–440.
  • [4] W. E, B. Yu, The deep Ritz method: A deep learning-based numerical algorithm for solving variational problems, Communications in Mathematics and Statistics 6 (1) (2018) 1–12.
  • [5] E. Kharazmi, Z. Zhang, G. E. Karniadakis, Variational physics-informed neural networks for solving partial differential equations, arXiv preprint arXiv:1912.00873.
  • [6] Y. Zhu, N. Zabaras, P.-S. Koutsourelakis, P. Perdikaris, Physics-constrained deep learning for high-dimensional surrogate modeling and uncertainty quantification without labeled data, Journal of Computational Physics 394 (2019) 56–81.
  • [7] E. Kharazmi, Z. Zhang, G. E. Karniadakis, hp-VPINNs: Variational physics-informed neural networks with domain decomposition, Computer Methods in Applied Mechanics and Engineering 374 (2021) 113547.
  • [8] J. Sirignano, K. Spiliopoulos, DGM: A deep learning algorithm for solving partial differential equations, Journal of Computational Physics 375 (2018) 1339–1364.
  • [9] M. Raissi, P. Perdikaris, G. E. Karniadakis, Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, Journal of Computational Physics 378 (2019) 686–707.
  • [10] G. Pang, L. Lu, G. E. Karniadakis, fPINNs: Fractional physics-informed neural networks, SIAM Journal on Scientific Computing 41 (4) (2019) A2603–A2626.
  • [11] I. E. Lagaris, A. Likas, D. I. Fotiadis, Artificial neural networks for solving ordinary and partial differential equations, IEEE Transactions on Neural Networks 9 (5) (1998) 987–1000.
  • [12] M. Dissanayake, N. Phan-Thien, Neural-network-based approximations for solving partial differential equations, Communications in Numerical Methods in Engineering 10 (3) (1994) 195–201.
  • [13] K. Li, K. Tang, T. Wu, Q. Liao, D3M: A deep domain decomposition method for partial differential equations, IEEE Access 8 (2020) 5283–5294.
  • [14] W. Li, X. Xiang, Y. Xu, Deep domain decomposition method: Elliptic problems, in: J. Lu, R. Ward (Eds.), Proceedings of The First Mathematical and Scientific Machine Learning Conference, Vol. 107 of Proceedings of Machine Learning Research, PMLR, Princeton University, Princeton, NJ, USA, 2020, pp. 269–286.
  • [15] S. Dong, Z. Li, Local extreme learning machines and domain decomposition for solving linear and nonlinear partial differential equations, Computer Methods in Applied Mechanics and Engineering 387 (2021) 114129.
  • [16] H. Gao, L. Sun, J.-X. Wang, Phygeonet: Physics-informed geometry-adaptive convolutional neural networks for solving parameterized steady-state PDEs on irregular domain, Journal of Computational Physics 428 (2021) 110079.
  • [17] H. Sheng, C. Yang, PFNN: A penalty-free neural network method for solving a class of second-order boundary-value problems on complex geometries, Journal of Computational Physics (2020) 110085.
  • [18] Y. Zang, G. Bao, X. Ye, H. Zhou, Weak adversarial networks for high-dimensional partial differential equations, Journal of Computational Physics 411 (2020) 109409.
  • [19] A. Blum, J. Hopcroft, R. Kannan, Foundations of data science, Cambridge University Press, 2020.
  • [20] R. Vershynin, High-dimensional probability: An introduction with applications in data science, Vol. 47, Cambridge University Press, 2018.
  • [21] J. Wright, Y. Ma, High-dimensional data analysis with low-dimensional models: Principles, computation, and applications, Cambridge University Press, 2021.
  • [22] K. Tang, X. Wan, Q. Liao, Adaptive deep density approximation for Fokker-Planck equations, Journal of Computational Physics 457 (2022) 111080.
  • [23] Y. Gu, H. Yang, C. Zhou, Selectnet: Self-paced learning for high-dimensional partial differential equations, Journal of Computational Physics 441 (2021) 110444.
  • [24] P. Morin, R. H. Nochetto, K. G. Siebert, Convergence of adaptive finite element methods, SIAM Review 44 (4) (2002) 631–658.
  • [25] K. Mekchay, R. H. Nochetto, Convergence of adaptive finite element methods for general second order linear elliptic pdes, SIAM Journal on Numerical Analysis 43 (5) (2005) 1803–1827.
  • [26] H. C. Elman, D. J. Silvester, A. J. Wathen, Finite elements and fast iterative solvers: With applications in incompressible fluid dynamics, Oxford University Press, USA, 2014.
  • [27] K. Tang, X. Wan, Q. Liao, Deep density estimation via invertible block-triangular mapping, Theoretical & Applied Mechanics Letters 10 (2020) 143.
  • [28] X. Wan, S. Wei, VAE-KRnet and its applications to variational Bayes, arXiv preprint arXiv:2006.16431.
  • [29] X. Wan, K. Tang, Augmented KRnet for density estimation and approximation, arXiv preprint arXiv:2105.12866.
  • [30] F. Santambrogio, Optimal transport for applied mathematicians, Birkäuser, NY 55 (2015) 58–63.
  • [31] L. Dinh, J. Sohl-Dickstein, S. Bengio, Density estimation using real NVP, arXiv preprint arXiv:1605.08803.
  • [32] D. P. Kingma, P. Dhariwal, Glow: Generative flow with invertible 1x1 convolutions, in: Advances in Neural Information Processing Systems, 2018, pp. 10215–10224.
  • [33] L. Bottou, F. E. Curtis, J. Nocedal, Optimization methods for large-scale machine learning, SIAM Review 60 (2) (2018) 223–311.
  • [34] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.
  • [35] P. Bochev, M. Gunzburger, Least-squares methods for hyperbolic problems, in: Handbook of Numerical Analysis, Vol. 17, Elsevier, 2016, pp. 289–317.
  • [36] G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals and Systems 2 (4) (1989) 303–314.
  • [37] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Networks 2 (5) (1989) 359–366.
  • [38] K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural Networks 4 (2) (1991) 251–257.
  • [39] M. Leshno, V. Y. Lin, A. Pinkus, S. Schocken, Multilayer feedforward networks with a nonpolynomial activation function can approximate any function, Neural Networks 6 (6) (1993) 861–867.
  • [40] Y. Shin, Z. Zhang, G. E. Karniadakis, Error estimates of residual minimization using neural networks for linear PDEs, arXiv preprint arXiv:2010.08019.
  • [41] J. Lu, Y. Lu, M. Wang, A priori generalization analysis of the deep ritz method for solving high dimensional elliptic equations, arXiv preprint arXiv:2101.01708.
  • [42] J. Yu, L. Lu, X. Meng, G. E. Karniadakis, Gradient-enhanced physics-informed neural networks for forward and inverse PDE problems, Computer Methods in Applied Mechanics and Engineering 393 (2022) 114823.
  • [43] W. Gao, C. Wang, Active learning based sampling for high-dimensional nonlinear partial differential equations, arXiv preprint arXiv:2112.13988.
  • [44] I. Kobyzev, S. J. Prince, M. A. Brubaker, Normalizing flows: An introduction and review of current methods, IEEE transactions on pattern analysis and machine intelligence 43 (11) (2020) 3964–3979.
  • [45] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems, 2014, pp. 2672–2680.
  • [46] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial networks, in: International Conference on Machine Learning, PMLR, 2017, pp. 214–223.
  • [47] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A. C. Courville, Improved training of wasserstein gans, in: Advances in Neural Information Processing Systems, Vol. 30, 2017.
  • [48] D. P. Kingma, M. Welling, Auto-Encoding Variational Bayes, stat 1050 (2014) 1.
  • [49] G. Carlier, A. Galichon, F. Santambrogio, From Knothe’s transport to Brenier’s map and a continuation method for optimal transport, SIAM Journal on Mathematical Analysis 41 (6) (2010) 2554–2576.
  • [50] P.-T. De Boer, D. P. Kroese, S. Mannor, R. Y. Rubinstein, A tutorial on the cross-entropy method, Annals of Operations Research 134 (1) (2005) 19–67.
  • [51] R. Y. Rubinstein, D. P. Kroese, The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation and machine learning, Springer Science & Business Media, 2013.
  • [52] L. Lu, X. Meng, Z. Mao, G. E. Karniadakis, Deepxde: A deep learning library for solving differential equations, SIAM Review 63 (1) (2021) 208–228.
  • [53] W. F. Mitchell, A collection of 2D elliptic problems for testing adaptive grid refinement algorithms, Applied Mathematics and Computation 220 (2013) 350–364.

Appendix A Proof of Lemma 1

Proof.

We first introduce the following lemma:

Lemma 3.

Consider a perturbed identity matrix 𝐈+δ𝐀\mathbf{I}+\delta\mathbf{A} with δ𝐀2<1\left\|\delta\mathbf{A}\right\|_{2}<1. We have

(𝐈+δ𝐀)1211δ𝐀2.\left\|(\mathbf{I}+\delta\mathbf{A})^{-1}\right\|_{2}\leq\frac{1}{1-\left\|\delta\mathbf{A}\right\|_{2}}. (37)
Proof.

For any 𝒙0\boldsymbol{x}\neq 0, we have

(𝐈+δ𝐀)𝒙2𝒙2δ𝐀2𝒙2=(1δ𝐀2)𝒙2>0,\left\|(\mathbf{I}+\delta\mathbf{A})\boldsymbol{x}\right\|_{2}\geq\left\|\boldsymbol{x}\right\|_{2}-\left\|\delta\mathbf{A}\right\|_{2}\left\|\boldsymbol{x}\right\|_{2}=(1-\left\|\delta\mathbf{A}\right\|_{2})\left\|\boldsymbol{x}\right\|_{2}>0,

which implies that 𝐈+δ𝐀\mathbf{I}+\delta\mathbf{A} is nonsingular. We then have

1=(𝐈+δ𝐀)1(𝐈+δ𝐀)2\displaystyle 1=\left\|(\mathbf{I}+\delta\mathbf{A})^{-1}(\mathbf{I}+\delta\mathbf{A})\right\|_{2} =(𝐈+δ𝐀)1+(𝐈+δ𝐀)1δ𝐀2\displaystyle=\left\|(\mathbf{I}+\delta\mathbf{A})^{-1}+(\mathbf{I}+\delta\mathbf{A})^{-1}\delta\mathbf{A}\right\|_{2}
(𝐈+δ𝐀)12(𝐈+δ𝐀)12δ𝐀2,\displaystyle\geq\left\|(\mathbf{I}+\delta\mathbf{A})^{-1}\right\|_{2}-\left\|(\mathbf{I}+\delta\mathbf{A})^{-1}\right\|_{2}\left\|\delta\mathbf{A}\right\|_{2},

which yields the conclusion. ∎

Assume that mV(𝒙)=(𝒗)𝖳𝒒(𝒙)m_{V}^{*}(\boldsymbol{x})=(\boldsymbol{v}^{*})^{\mathsf{T}}\boldsymbol{q}(\boldsymbol{x}) and m𝒗^(𝒙)=(𝒗^)𝖳𝒒(𝒙)m_{\hat{\boldsymbol{v}}^{*}}(\boldsymbol{x})=(\hat{\boldsymbol{v}}^{*})^{\mathsf{T}}\boldsymbol{q}(\boldsymbol{x}), where 𝒒(𝒙)=[q1(𝒙),,qn(𝒙)]𝖳\boldsymbol{q}(\boldsymbol{x})=[q_{1}(\boldsymbol{x}),\ldots,q_{n}(\boldsymbol{x})]^{\mathsf{T}} includes the basis functions in V=span{qi(𝒙)}i=1nV=\mathrm{span}\{q_{i}(\boldsymbol{x})\}_{i=1}^{n} and the vectors 𝒗\boldsymbol{v}^{*} and 𝒗^\hat{\boldsymbol{v}}^{*} include the coefficients. It is easy to see that 𝒗\boldsymbol{v}^{*} and 𝒗^\hat{\boldsymbol{v}}^{*} satisfy the following linear systems respectively

𝐀𝒗=𝒃,𝐀^𝒗^=𝒃^,\mathbf{A}\boldsymbol{v}^{*}=\boldsymbol{b},\quad\hat{\mathbf{A}}\hat{\boldsymbol{v}}^{*}=\hat{\boldsymbol{b}}, (38)

where

a^ij\displaystyle\hat{a}_{ij} =1Nk=1Nqi(𝒙(k))qj(𝒙(k))qi,qjρ=aij,\displaystyle=\frac{1}{N}\sum_{k=1}^{N}q_{i}(\boldsymbol{x}^{(k)})q_{j}(\boldsymbol{x}^{(k)})\approx\langle q_{i},q_{j}\rangle_{\rho}=a_{ij},
b^i\displaystyle\hat{b}_{i} =1Nk=1Nqi(𝒙(k))h(𝒙(k))qi,hρ=bi,\displaystyle=\frac{1}{N}\sum_{k=1}^{N}q_{i}(\boldsymbol{x}^{(k)})h(\boldsymbol{x}^{(k)})\approx\langle q_{i},h\rangle_{\rho}=b_{i},

and qi,qjρ=Dqi(𝒙)qj(𝒙)ρ(𝒙)𝑑𝒙\langle q_{i},q_{j}\rangle_{\rho}=\int_{D}q_{i}(\boldsymbol{x})q_{j}(\boldsymbol{x})\rho(\boldsymbol{x})d\boldsymbol{x} indicates the inner product of qiq_{i} and qjq_{j}. We rewrite the linear system for 𝒗^\hat{\boldsymbol{v}}^{*} as

(𝐀+δ𝐀)𝒗^=𝒃+δ𝒃,(\mathbf{A}+\delta\mathbf{A})\hat{\boldsymbol{v}}^{*}=\boldsymbol{b}+\delta\boldsymbol{b}, (39)

where δ𝐀=𝐀^𝐀\delta{\mathbf{A}}=\hat{\mathbf{A}}-\mathbf{A} and δ𝒃=𝒃^𝒃\delta\boldsymbol{b}=\hat{\boldsymbol{b}}-\boldsymbol{b}. Let ={qi(𝒙)}i=1n{h(𝒙)}\mathscr{B}=\{q_{i}(\boldsymbol{x})\}_{i=1}^{n}\cup\{h(\boldsymbol{x})\}. Since both {qi}i=1n\{q_{i}\}_{i=1}^{n} and h(𝒙)h(\boldsymbol{x}) are continuous on a compact support, we may assume that |m1(𝒙)m2(𝒙)|M|m_{1}(\boldsymbol{x})m_{2}(\boldsymbol{x})|\leq M for any m1,m2m_{1},m_{2}\in\mathscr{B} and any 𝒙D\boldsymbol{x}\in D, where 0<M<0<M<\infty is a constant. Using the Heoffding bound, we have for any δ>0\delta>0 with probability at least 12δ1-2\delta,

|1Nk=1Nm1(𝒙(k))m2(𝒙(k))m1,m2ρ|2M2lnδ1N,\left|\frac{1}{N}\sum_{k=1}^{N}m_{1}(\boldsymbol{x}^{(k)})m_{2}(\boldsymbol{x}^{(k)})-\langle m_{1},m_{2}\rangle_{\rho}\right|\leq\sqrt{\frac{2M^{2}\ln\delta^{-1}}{N}}, (40)

for any m1,m2m_{1},m_{2}\in\mathscr{B}. This means with probability at least 12δ1-2\delta we have

δ𝐀2δ𝐀Fn2M2lnδ1N,δ𝒃2n2M2lnδ1N,\left\|\delta\mathbf{A}\right\|_{2}\leq\left\|\delta\mathbf{A}\right\|_{F}\leq n\sqrt{\frac{2M^{2}\ln\delta^{-1}}{N}},\quad\left\|\delta\boldsymbol{b}\right\|_{2}\leq\sqrt{n}\sqrt{\frac{2M^{2}\ln\delta^{-1}}{N}}, (41)

where F\left\|\cdot\right\|_{F} indicates the Frobenius norm. Let δ𝒗=𝒗^𝒗\delta\boldsymbol{v}^{*}=\hat{\boldsymbol{v}}^{*}-\boldsymbol{v}^{*}. It is seen that δ𝐀20\left\|\delta\mathbf{A}\right\|_{2}\rightarrow 0 as NN\rightarrow\infty. Assume that NN is large enough such that δ𝐀2(1r)\left\|\delta\mathbf{A}\right\|_{2}\leq(1-r) with 0<r<10<r<1, in other words, (1δ𝐀2)1r1(1-\left\|\delta\mathbf{A}\right\|_{2})^{-1}\leq r^{-1}. Since qi(𝒙)q_{i}(\boldsymbol{x}) are orthonormal, we have 𝐀=𝐈\mathbf{A}=\mathbf{I}. From equations (38) and (41), we have

δ𝒗=(𝐈+δ𝐀)1(δ𝒃δ𝐀𝒗),\delta\boldsymbol{v}^{*}=(\mathbf{I}+\delta\mathbf{A})^{-1}(\delta\boldsymbol{b}-\delta\mathbf{A}\boldsymbol{v}^{*}),

to which we apply Lemma 3 and the bounds in equation (39) and obtain

δ𝒗2r1(δ𝒃2+δ𝐀2𝒗2)r1(n+n𝒗2)2M2lnδ1N.\left\|\delta\boldsymbol{v}^{*}\right\|_{2}\leq r^{-1}(\left\|\delta\boldsymbol{b}\right\|_{2}+\left\|\delta\mathbf{A}\right\|_{2}\left\|\boldsymbol{v}^{*}\right\|_{2})\leq r^{-1}(\sqrt{n}+n\left\|\boldsymbol{v}^{*}\right\|_{2})\sqrt{\frac{2M^{2}\ln\delta^{-1}}{N}}. (42)

Using the Pythagorean theorem, we have

m𝒗^hρ2\displaystyle\|m_{\hat{\boldsymbol{v}}^{*}}-h\|_{\rho}^{2} =m𝒗^mVρ2+hmVρ2\displaystyle=\|m_{\hat{\boldsymbol{v}}^{*}}-m^{*}_{V}\|_{\rho}^{2}+\|h-m^{*}_{V}\|_{\rho}^{2}
=δ𝒗22+hmρ2\displaystyle=\|\delta\boldsymbol{v}^{*}\|^{2}_{2}+\|h-m^{*}\|_{\rho}^{2}
(δ𝒗2+hmVρ)2,\displaystyle\leq(\|\delta\boldsymbol{v}^{*}\|_{2}+\|h-m^{*}_{V}\|_{\rho})^{2},

which yields that

m^𝒗^hρδ𝒗2+hmVρ.\|\hat{m}_{\hat{\boldsymbol{v}}^{*}}-h\|_{\rho}\leq\|\delta\boldsymbol{v}^{*}\|_{2}+\|h-m^{*}_{V}\|_{\rho}. (43)

Combing equations (42) and (43), we reach the conclusion.

Appendix B Proof of Lemma 2

Proof.

Let

r^(𝒙)={r(𝒙), if |r2/pμ|a;0, otherwise,\hat{r}(\boldsymbol{x})=\left\{\begin{array}[]{rl}r(\boldsymbol{x}),&\textrm{ if }|r^{2}/p-\mu|\leq a;\\ 0,&\textrm{ otherwise},\end{array}\right.

where a>0a>0. We consider

|Qp[r2]𝔼[r2]||Qp(r2)Qp(r^2)|+|Qp(r^2)𝔼[r^2]|+|𝔼[r^2]𝔼[r2]|=I1+I2+I3.\left|Q_{p}[r^{2}]-\mathbb{E}[r^{2}]\right|\leq\left|Q_{p}(r^{2})-Q_{p}(\hat{r}^{2})\right|+\left|Q_{p}(\hat{r}^{2})-\mathbb{E}[\hat{r}^{2}]\right|+\left|\mathbb{E}[\hat{r}^{2}]-\mathbb{E}[r^{2}]\right|=I_{1}+I_{2}+I_{3}. (44)

The first term I1I_{1} on the right-hand side is bounded as

𝔼p|Qp(r2)Qp(r^2)|\displaystyle\mathbb{E}_{p}\left|Q_{p}(r^{2})-Q_{p}(\hat{r}^{2})\right| =𝔼p|r2(𝑿)p1(𝑿)r^2(𝑿)p1(𝑿)|\displaystyle=\mathbb{E}_{p}\left|r^{2}(\boldsymbol{X})p^{-1}(\boldsymbol{X})-\hat{r}^{2}(\boldsymbol{X})p^{-1}(\boldsymbol{X})\right|
=|r2/pμ|>ar2(𝒙)𝑑𝒙\displaystyle=\int_{|r^{2}/p-\mu|>a}r^{2}(\boldsymbol{x})d\boldsymbol{x}
r2/pp(|r2/pμ|>a;p),\displaystyle\leq\|r^{2}/p\|_{p}\sqrt{\mathbb{P}(|r^{2}/p-\mu|>a;p)}, (45)

where the Cauchy-Schwarz inequality is used in the last step.

The second term I2I_{2} on the right-hand side can be bounded as

𝔼p|Qp(r^2)𝔼[r^2]|\displaystyle\mathbb{E}_{p}\left|Q_{p}(\hat{r}^{2})-\mathbb{E}[\hat{r}^{2}]\right| Varp(Qp(r^2))\displaystyle\leq\sqrt{\mathrm{Var}_{p}(Q_{p}(\hat{r}^{2}))}
N1/2Varp(r^2(𝑿)/p(𝑿))\displaystyle\leq N^{-1/2}\sqrt{\mathrm{Var}_{p}(\hat{r}^{2}(\boldsymbol{X})/p(\boldsymbol{X}))}
aN1/2,\displaystyle\leq aN^{-1/2}, (46)

where in the last step we used the fact that for any variable αYβ\alpha\leq Y\leq\beta, Var(Y)(αβ)24\mathrm{Var}(Y)\leq\frac{(\alpha-\beta)^{2}}{4} with probability 1. The third term I3I_{3} on the right-hand side can be bounded the same way as I1I_{1}.

We now estimate the tail probability (|r2/pμ|>a;p)\mathbb{P}(|r^{2}/p-\mu|>a;p). Using the correspondence between L1L_{1} norm and total variation distance for two probability measures as well as the Pinsker’s inequality, we have

ppL1=2δ(p,p)2D𝖪𝖫(pp)2ε,\|p-p^{*}\|_{L_{1}}=2\delta(p,p^{*})\leq\sqrt{2D_{\mathsf{KL}}(p\|p^{*})}\leq\sqrt{2\varepsilon}, (47)

which yields that

𝔼p[|r2/pμ|]μ2ε.\mathbb{E}_{p}\left[\left|r^{2}/p-\mu\right|\right]\leq\mu\sqrt{2\varepsilon}. (48)

From the Markov inequality, we have

(|r2/pμ|a;p)μ2εa,\mathbb{P}\left(\left|r^{2}/p-\mu\right|\geq a;p\right)\leq\frac{\mu\sqrt{2\varepsilon}}{a}, (49)

where the probability is with respect to PDF p(𝒙)p(\boldsymbol{x}). Combining the bounds for IiI_{i}, i=1,2,3i=1,2,3, and equation (49), we reach the conclusion. ∎

Appendix C Proof of Theorem 1

Proof.

By Assumption 1, we have

u(𝒙;ΘN,(k))u(𝒙)2,Ω\displaystyle\left\|u(\boldsymbol{x};\Theta_{N}^{*,(k)})-u(\boldsymbol{x})\right\|_{2,\Omega}
\displaystyle\leq C11[(u(𝒙;ΘN,(k))u(𝒙))2,Ω+𝔟(u(𝒙;ΘN,(k))u(𝒙))2,Ω]\displaystyle C_{1}^{-1}\left[\left\|\mathcal{L}(u(\boldsymbol{x};\Theta_{N}^{*,(k)})-u(\boldsymbol{x}))\right\|_{2,\Omega}+\left\|\mathfrak{b}(u(\boldsymbol{x};\Theta_{N}^{*,(k)})-u(\boldsymbol{x}))\right\|_{2,\partial\Omega}\right]
\displaystyle\leq 2C11((u(𝒙;ΘN,(k))u(𝒙))2,Ω2+𝔟(u(𝒙;ΘN,(k))u(𝒙))2,Ω2)12.\displaystyle\sqrt{2}C_{1}^{-1}\left(\left\|\mathcal{L}(u(\boldsymbol{x};\Theta_{N}^{*,(k)})-u(\boldsymbol{x}))\right\|_{2,\Omega}^{2}+\left\|\mathfrak{b}(u(\boldsymbol{x};\Theta_{N}^{*,(k)})-u(\boldsymbol{x}))\right\|_{2,\partial\Omega}^{2}\right)^{\frac{1}{2}}.

Combining u(𝒙)=s(𝒙)\mathcal{L}u(\boldsymbol{x})=s(\boldsymbol{x}), 𝔟u(𝒙)=g(𝒙)\mathfrak{b}u(\boldsymbol{x})=g(\boldsymbol{x}), r(𝒙;ΘN,(k))=u(𝒙;ΘN,(k))s(𝒙)r(\boldsymbol{x};\Theta_{N}^{*,(k)})=\mathcal{L}u(\boldsymbol{x};\Theta_{N}^{*,(k)})-s(\boldsymbol{x}) and b(𝒙;ΘN,(k))=𝔟u(𝒙;ΘN,(k))g(𝒙)b(\boldsymbol{x};\Theta_{N}^{*,(k)})=\mathfrak{b}u(\boldsymbol{x};\Theta_{N}^{*,(k)})-g(\boldsymbol{x}) gives

u(𝒙;ΘN,(k))u(𝒙)2,Ω2C11(r(𝒙;ΘN,(k))2,Ω2+b(𝒙;ΘN,(k))2,Ω2)12.\left\|u(\boldsymbol{x};\Theta_{N}^{*,(k)})-u(\boldsymbol{x})\right\|_{2,\Omega}\leq\sqrt{2}C_{1}^{-1}\left(\left\|r(\boldsymbol{x};\Theta_{N}^{*,(k)})\right\|_{2,\Omega}^{2}+\left\|b(\boldsymbol{x};\Theta_{N}^{*,(k)})\right\|_{2,\partial\Omega}^{2}\right)^{\frac{1}{2}}. (50)

Noting that 𝔼(Rk)=r(𝒙;ΘN,(k))2,Ω2\mathbb{E}(R_{k})=\left\|r(\boldsymbol{x};\Theta_{N}^{*,(k)})\right\|_{2,\Omega}^{2}, and according to the Hoeffding inequality, we have

(Rk𝔼(Rk)ε)1exp(2Nrε2(τ2τ1)2).\mathbb{P}\left(R_{k}-\mathbb{E}(R_{k})\geq-\varepsilon\right)\geq 1-\mathrm{exp}\left(\frac{-2N_{r}\varepsilon^{2}}{(\tau_{2}-\tau_{1})^{2}}\right). (51)

Combining (50) and (51) gives that

u(𝒙;ΘN,(k))u(𝒙)2,Ω2C11(Rk+ε+b(𝒙;ΘN,(k))2,Ω2)12\left\|u(\boldsymbol{x};\Theta_{N}^{*,(k)})-u(\boldsymbol{x})\right\|_{2,\Omega}\leq\sqrt{2}C_{1}^{-1}\left(R_{k}+\varepsilon+\left\|b(\boldsymbol{x};\Theta_{N}^{*,(k)})\right\|_{2,\partial\Omega}^{2}\right)^{\frac{1}{2}}

with probability at least 1exp(2Nrε2/(τ2τ1)2)1-\mathrm{exp}(-2N_{r}\varepsilon^{2}/(\tau_{2}-\tau_{1})^{2}). ∎

Appendix D Proof of Corollary 1

Proof.

Noting that

ΘN,(k+1)=argminΘ1Nri=1Nrr2(𝒙Ω(i);Θ)p^𝖪𝖱𝗇𝖾𝗍(𝒙Ω(i);Θf,(k)).\Theta_{N}^{*,(k+1)}=\arg\min_{\Theta}\frac{1}{N_{r}}\sum_{i=1}^{N_{r}}\frac{r^{2}(\boldsymbol{x}_{\Omega}^{(i)};\Theta)}{\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x}_{\Omega}^{(i)};\Theta_{f}^{*,(k)})}.

Since ΘN,(k+1)\Theta_{N}^{*,(k+1)} is the optimal solution at the (k+1)(k+1)-th stage, we have

Rk+1=1Nri=1Nrr2(𝒙Ω(i);ΘN,(k+1))p^𝖪𝖱𝗇𝖾𝗍(𝒙Ω(i);Θf,(k))1Nri=1Nrr2(𝒙Ω(i);ΘN,(k))p^𝖪𝖱𝗇𝖾𝗍(𝒙Ω(i);Θf,(k)).R_{k+1}=\frac{1}{N_{r}}\sum\limits_{i=1}^{N_{r}}\frac{r^{2}(\boldsymbol{x}_{\Omega}^{(i)};\Theta_{N}^{*,(k+1)})}{\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x}_{\Omega}^{(i)};\Theta_{f}^{*,(k)})}\leq\frac{1}{N_{r}}\sum\limits_{i=1}^{N_{r}}\frac{r^{2}(\boldsymbol{x}_{\Omega}^{(i)};\Theta_{N}^{*,(k)})}{\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x}_{\Omega}^{(i)};\Theta_{f}^{*,(k)})}. (52)

Plugging p^𝖪𝖱𝗇𝖾𝗍(𝒙;Θf,(k))=ckr2(𝒙;ΘN,(k))\hat{p}_{\mathsf{KRnet}}(\boldsymbol{x};\Theta_{f}^{*,(k)})=c_{k}r^{2}(\boldsymbol{x};\Theta_{N}^{*,(k)}) into (52) gives that

Rk+11ck.R_{k+1}\leq\frac{1}{c_{k}}.

Noting that Rk+1R_{k+1} is a random variable and taking its expectation, it follows that

𝔼(Rk+1)1ck=Ωr2(𝒙;ΘN,(k))𝑑𝒙=𝔼(Rk),\mathbb{E}(R_{k+1})\leq\frac{1}{c_{k}}=\int_{\Omega}r^{2}(\boldsymbol{x};\Theta_{N}^{*,(k)})d\boldsymbol{x}=\mathbb{E}(R_{k}),

which completes the proof. ∎