Bayesian neural networks for weak solution of PDEs with uncertainty quantification

Xiaoxuan Zhang

{}^{1}

, Krishna Garikipati

{}^{1,2,3}

{}^{1}

Department of Mechanical Engineering, University of Michigan, United States

{}^{2}

Department of Mathematics, University of Michigan, United States

{}^{3}

Michigan Institute for Computational Discovery & Engineering, University of Michigan, United States
Corresponding author. E-mail address: [email protected]

Abstract

Solving partial differential equations (PDEs) is the canonical approach for understanding the behavior of physical systems. However, large scale solutions of PDEs using state of the art discretization techniques remains an expensive proposition. In this work, a new physics-constrained neural network (NN) approach is proposed to solve PDEs without labels, with a view to enabling high-throughput solutions in support of design and decision-making. Distinct from existing physics-informed NN approaches, where the strong form or weak form of PDEs are used to construct the loss function, we write the loss function of NNs based on the discretized residual of PDEs through an efficient, convolutional operator-based, and vectorized implementation. We explore an encoder-decoder NN structure for both deterministic and probabilistic models, with Bayesian NNs (BNNs) for the latter, which allow us to quantify both epistemic uncertainty from model parameters and aleatoric uncertainty from noise in the data. For BNNs, the discretized residual is used to construct the likelihood function. In our approach, both deterministic and probabilistic convolutional layers are used to learn the applied boundary conditions (BCs) and to detect the problem domain. As both Dirichlet and Neumann BCs are specified as inputs to NNs, a single NN can solve for similar physics, but with different BCs and on a number of problem domains. The trained surrogate PDE solvers can also make interpolating and extrapolating (to a certain extent) predictions for BCs that they were not exposed to during training. Such surrogate models are of particular importance for problems, where similar types of PDEs need to be repeatedly solved for many times with slight variations. We demonstrate the capability and performance of the proposed framework by applying it to different steady-state and equilibrium boundary value problems with physics that spans diffusion, linear elasticity, and nonlinear elasticity.

Keywords weak form constrained neural network $\cdot$ Bayesian neural network $\cdot$ uncertainty quantification $\cdot$ nonlinear elasticity $\cdot$ surrogate PDE solver $\cdot$ partial differential equations

1 Introduction

Solving partial differential equations (PDEs) is crucial for many scientists and engineers to understand the behavior of different physical systems. Popular numerical methods to solve PDEs include, but are not limited to, the finite element method (FEM), finite difference method, finite volume method, Fourier method, and other methods, where each has its own advantages and limitations. Among those methods, FEM is arguably the most widely used method due to its flexibility for solving problems with complex geometrical and irregular shapes. However, when a large number of simulations need to be carried out, such as homogenization, optimization, or inverse problems, these numerical methods can sometimes be very expensive, thus efficient surrogate models are needed.

With the drastically increasing computational power of graphics processing units (GPUs), machine learning has gained great popularity in a wide range of applications in fields, such as computer vision, speech recognition, and anomaly detection, etc. It has also emerged as a powerful approach among data-driven methods for surrogate modeling in many material, physics, and engineering applications [Ramprasad2017-material-informatics-review, Bock2019Kalidindi-ML-CM-review], such as material screening [Meredig2014-screen-materials, Wolverton2016Ward-screen-material-property], constitutive modeling [Hashash2004-NN-constitutive, Chinesta2019Ibanez-hybrid-constitutive-modeling, Sun+Wang2019-game-constitutive, Huang2020Learning-constitutive-DNN-indirect-observation], scale bridging [Brockherde2017DFT-MD, Teichert2019Garikipati-ML-bridge, Teichert2020Scale-bridge-active-learning], and system identification [Brunton2016Kutz-system-id, Wang2019Garikipati-CMAME-Variational-System-Identification, Wang2020System-inference-Covid19, Wang2020Variational-System-ID-sparse-data, Wang2020Garikipati-Xun-perspective-system-id-bayesian], effective material properties prediction [Kalidindi+Cecen2018-CNN-Structure-property, Li2019Zhuang-effective-ME-DNN, Agrawal2018Yang-Composites-S-P-deep-learning, Kondo2017CNN-ionic-conductivity], non-linear material response characterization [Hambli2011multiscale-bone-with-NN, Bessa2017-data-driven-framework-elasticity-inelasticity, Jones2019Frankel-Oligocrytal-behavior-CNN, Sun2018Wang-homogenization, Yvonnet2015Le-RVE-elasticity, Yvonnet2018Lu-NN-RVE-graphene, Zhang2020Garikipati-CMAME-ML-RVE], etc. Attempts to use machine learning techniques to learn the full field solution of physical systems are explored in [Zhu2018Zabaras-UQ-Bayesian-ED-CNN, Winovich2019Lin-ConvPDE-UQ, Bhatnagar2019CNN-encoder-decoder, Li2020Reaction-diffusion-prediction-CNN], and many others. For example, [Zhu2018Zabaras-UQ-Bayesian-ED-CNN] proposed a Bayesian approach to convolutional neural networks (CNNs) to study the flow problem in heterogeneous media with uncertainty quantification. CNNs are also used to predict the velocity and pressure fields for aerodynamic problems in [Bhatnagar2019CNN-encoder-decoder] and predict concentration distribution for single-species reaction-diffusion systems in [Li2020Reaction-diffusion-prediction-CNN]. Those approaches generally require a large amount of full-field solutions as the training data, either from experiments or direct numerical simulations (DNSs), which might not be easily available or expensive to obtain.

Another thrust in using machine learning techniques in the computational mechanics, materials, and physics community is to solve PDEs with little or no pre-labeled data [Lagaris1998NN-PDEs, Han2018Solve-high-dimension-PDE-deep-learning, Sirignano2018DGM-solve-PDEs, Raissi2019physics-informed-forward-inverse-jcp, Zhu2019Perdikaris-Physics-PDE-CNN, Geneva2020Zabaras-JCP-auto-regressive-NN-PDE, Yang2021BPINNs-PDE, Berg2018unified-deep-ann-PDE-complex-geometries, Sun2020Surrogate-modeling-flow-physics-constrained, Samaniego2020energy-approach-PDE-ML]. For example, high-dimensional free-boundary PDEs are solved with fully connected NNs by reformulating PDEs as backward stochastic differential equations in [Han2018Solve-high-dimension-PDE-deep-learning]. The. Deep Galerkin Method proposed in [Sirignano2018DGM-solve-PDEs] is used to solve high-dimensional free-boundary PDEs with neural networks (NNs), which satisfy the differential operators, initial condition (IC), and boundary conditions (BCs). The Physics-informed Neural Networks (PINNs) approach has been proposed to solve different transient systems [Raissi2019physics-informed-forward-inverse-jcp]. In this approach, the strong form of PDEs, either using the continuous or discrete time derivative, is constructed and serves as part of the loss function, which further consists of contributions from the IC and BCs [Raissi2019physics-informed-forward-inverse-jcp]. Various extensions of PINNs have been made to solve different systems [Jin2020NSFnets-PINN, Pang2019fPINNs, Meng2020PPINN, Geneva2020Zabaras-JCP-auto-regressive-NN-PDE, Yang2021BPINNs-PDE, Wang2020DL-free-boundary, Jagtap2020cPINN-discrete-domain-conservation-law]. Surrogate PDE solvers based on the weak/variational formulation are also studied in [Zang2020Weak-adversarial-network-PDE, Chen2020Friedrichs-learning-weak-PDE, Khodayi-mehr2019VarNet, Li2020D3M-domain-decomposition, Kharazmi2021hp-VPINNs].

When solving PDE systems, it is important to quantify uncertainties from different sources, such as geometry representation, BCs, material parameters, and others, to better understand the systems and make reliable predictions. The sources of uncertainties can generally be categorized as either epistemic and aleatory, where the former can be reduced by gathering more data or using a more sophisticated model, and the latter is less prone to reduction [Kiureghian2009Ditlevsen-aleatory-or-epistemic]. With machine learning techniques, probabilistic models can be constructed to easily quantify the uncertainty. Uncertainty quantification (UQ) with surrogate PDE solvers has been investigated in [Karniadakis2019Zhang-Quantify-UQ-dropout, Yang2019Perdikaris-Adversarial-UQ, Geneva2020Zabaras-JCP-auto-regressive-NN-PDE, Yang2021BPINNs-PDE]. For example, [Karniadakis2019Zhang-Quantify-UQ-dropout] employs the dropout techniques proposed in [Gal2016Dropout] to quantify uncertainties of a surrogate solver that combines arbitrary polynomial chaos with PINNs. In [Yang2019Perdikaris-Adversarial-UQ], probabilistic PINNs are constructed based on latent variable models and trained with an adversarial inference process for UQ. A Bayesian framework is used in [Geneva2020Zabaras-JCP-auto-regressive-NN-PDE] for UQ, where the posterior distribution of the surrogate model parameters is constructed based on the Stochastic Weight Averaging Gaussian technique proposed in [Maddox2019SWAG]. A Bayesian PINN is proposed in [Yang2021BPINNs-PDE] for UQ, where PINNs are used to construct the likelihood function and either the Hamiltonian Monte Carlo or variational inference (VI) techniques are used to estimate the posterior distribution.

With properly trained surrogate PDE solvers, there is interest in using them in problems, such as homogenization or optimization, to rapidly predict the response of the same PDE systems, but with different IC or BCs, and potentially even on different problem domains. However, such goals are in general difficult to achieve with existing surrogate approaches, which typically enforce only one specific set of BCs via the loss function. It is very challenging to make predictions for new sets of BCs with such surrogate solvers without re-training them. In this work, we aim to address such challenges by proposing a new physics-constrained NN to solve PDEs, where the BCs are specified as inputs to the NNs. Motivated by the FEM, which uses the weak formulation to completely define a physical system described by the governing PDEs and the associated BCs, and solves the problem based on the discretized residual, we construct the discretized residual of PDEs from NN predicted solutions to form the loss function to train NNs, through an efficient, convolutional operator-based, and vectorized residual calculation implementation. As shown in Fig. 2, the weak PDE loss layers are independent from the NN that serves as the surrogate PDE solver and do not introduce any new trainable parameters. Such features offer us great flexibility to choose the NN architecture. We focus on an encoder-decoder NN structure, which has been investigated for other physical systems [Zhu2018Zabaras-UQ-Bayesian-ED-CNN, Winovich2019Lin-ConvPDE-UQ, Bhatnagar2019CNN-encoder-decoder]. We studied both deterministic and probabilistic models, with Bayesian NNs (BNNs) for the latter. The encoder-decoder structure can be easily adopted to the BNNs with the modularized probabilistic layers provided in the TensorFlow Probability (TFP) library. In our approach, deterministic/probabilistic convolutional NN layers are used to learn the applied BCs (both Dirichlet and Neumann) and to detect the problem domains through carefully designed input data structure. Thus, with our approach, a single NN can be used to simultaneously solve different BVPs that are governed by the same PDEs but on different domains with different BCs. In addition, similar to other surrogate PDE solvers, our approach is also label free. Furthermore, the trained surrogate solvers can make predictions for interpolated and extrapolated (to a certain extent) BCs that they were not exposed to during training.

In our BNNs, each model parameter is sampled from a posterior distribution. We solve for the posterior distribution of model parameters with the VI method instead of the Markov Chain Monte Carlo (MCMC) sampling, as the latter involves expensive iterative inference steps and is not suitable for systems with a large number of parameters [Blei2017Variational-inference-review, Kingma2014Welling+Variational-bayes]. In our work, the likelihood function is constructed based on the discretized PDE residuals. The BNNs allow us to quantify both epistemic uncertainty from model parameters and aleatoric uncertainty from noise in the data. In our study, an additive noise is applied to the NN predicted solution as in [Luo2020Bayesian-deep-learning, Wang2020Garikipati-Xun-perspective-system-id-bayesian, Geneva2020Zabaras-JCP-auto-regressive-NN-PDE, Zhu2018Zabaras-UQ-Bayesian-ED-CNN, Zhu2019Perdikaris-Physics-PDE-CNN] to represent the aleatoric uncertainty. Such an additive noise represents potential errors from various sources, such as discretization error [Geneva2020Zabaras-JCP-auto-regressive-NN-PDE], geometry representation, boundary conditions, and material parameter measurement, among others.

The proposed framework is a generalized approach that is applicable to both steady-state and transient problems. In this work, we present the details of this new framework, and its application for the steady-state diffusion, linear elasticity, and nonlinear elasticity. We defer the investigation of transient problems to a subsequent work. To the authors’ best knowledge, this is the first attempt to simultaneously solve PDEs on different domains with different BCs with a single surrogate solver, with the further feature of UQ. In this study, the problem domains are represented via pixels on a square background grid for simplicity. Thus, the boundary of a domain is not a smooth curve, but has a pixel-level resolution. One can refer to [Bhatnagar2019CNN-encoder-decoder, Li2020Reaction-diffusion-prediction-CNN, Gao2020PhyGeoNet-PDE-on-Irregular-domain] and many others for strategies to map complex and irregular domain onto a regular grid mesh. Such geometry transformation can be taken into account in the proposed PDE loss layers with the isoparametric mapping concept of the FEM via the shape functions, though it is not the focus of this work.

The rest of the paper is organized as follows. In Section 2, we briefly summarize the general mathematical description of the type of physical systems that is studied in this work. The structures of discretized residual constrained NNs used in this work are presented in Section 3. Section 4 provides the details of an efficient implementation of the discretized residual calculation. Section LABEL:sec:data covers the data structure of NN inputs, domain/boundary detection, setup of BVPs, and NN training procedures. Detailed simulation results are presented in Section LABEL:sec:results, where steady-state diffusion, linear elasticity, and non-linear elasticity are studied. Concluding remarks and perspectives are offered in Section LABEL:sec:conclusion.

2 Problem definition

In this work, we are interested in solving the steady-state diffusion, linear elasticity, and nonlinear elasticity problems with discretized residual constrained NNs. These three physical systems are described by a general elliptic PDE on a domain $\Omega$ with the Dirichlet BC on $\Gamma^{\mbox{\boldmath$\varphi$}}$ and the Neumann BC on $\Gamma^{{\bm{\mathnormal{k}}}}$ as

$\displaystyle\nabla\cdot{\bm{\mathnormal{A}}}(\mbox{\boldmath$\varphi$})={\bm{\mathit{0}}}$	$\displaystyle\text{on}\quad\Omega,$	(1)
$\displaystyle\mbox{\boldmath$\varphi$}({\bm{\mathnormal{X}}})=\bar{\mbox{\boldmath$\varphi$}}({\bm{\mathnormal{X}}})$	$\displaystyle\text{on}\quad\Gamma^{\mbox{\boldmath$\varphi$}},$
$\displaystyle{\bm{\mathnormal{k}}}({\bm{\mathnormal{X}}})=\bar{{\bm{\mathnormal{k}}}}({\bm{\mathnormal{X}}})$	$\displaystyle\text{on}\quad\Gamma^{\bm{\mathnormal{k}}},$

where $\mbox{\boldmath$\varphi$}({\bm{\mathnormal{X}}})$ represents the location-dependent unknown variable and ${\bm{\mathnormal{X}}}$ is the coordinate. The overall boundary of the continuum body satisfies $\Gamma=\Gamma^{\mbox{\boldmath$\varphi$}}\bigcup\Gamma^{{\bm{\mathnormal{k}}}}$ and $\Gamma^{\mbox{\boldmath$\varphi$}}\bigcap\Gamma^{{\bm{\mathnormal{k}}}}=\emptyset$ . It is worth mentioning that even though bold typeface $\varphi$ , ${\bm{\mathnormal{A}}}$ , and ${\bm{\mathnormal{k}}}$ are used in (1), depending on the degree of freedoms (DOFs) of each physical system, they can represent either scalar, vector, or tensor fields. For example, in the diffusion problem, $\varphi$ , ${\bm{\mathnormal{A}}}$ , and ${\bm{\mathnormal{k}}}$ represent the compositional order parameter (scalar), the diffusive flux (vector), and the outward flux (scalar), respectively. Whereas in elasticity problems, $\varphi$ , ${\bm{\mathnormal{A}}}$ , and ${\bm{\mathnormal{k}}}$ represent the deformation field (vector), the stress field (second-order tensor), and the surface traction (vector), respectively. The details of each system are provided in the numerical simulation section.

The weak form of (1) states: For variations $\omega$ satisfying $\forall\mbox{\boldmath$\omega$}\in\mathscr{V}$ with $\mathscr{V}=\left\{\mbox{\boldmath$\omega$}|\mbox{\boldmath$\omega$}={\bm{\mathit{0}}}~{}\text{on}~{}\Gamma^{\mbox{\boldmath$\varphi$}}\right\}$ , seek trial solutions $\mbox{\boldmath$\varphi$}\in\mathscr{S}$ with $\mathscr{S}=\left\{\mbox{\boldmath$\varphi$}|\mbox{\boldmath$\varphi$}=\bar{\mbox{\boldmath$\varphi$}}~{}\text{on}~{}\Gamma^{\mbox{\boldmath$\varphi$}}\right\}$ such that

\int_{\Omega}\nabla\mbox{\boldmath$\omega$}\cdot{\bm{\mathnormal{A}}}(\mbox{\boldmath$\varphi$})~{}dV-\int_{\Gamma^{\bm{\mathnormal{k}}}}\bar{{\bm{\mathnormal{k}}}}\cdot\mbox{\boldmath$\omega$}~{}dS=0.

(2)

Eq. (2) is obtained by multiplying (1 ${}_{1}$ ) with $\omega$ , integrating by parts, and then incorporating the Neumann BC in (1 ${}_{3}$ ). For the diffusion problem, $\omega$ is a scalar field. For elasticity problems, $\omega$ is a vector field.

To obtain the approximate solutions of (2), finite-dimensional approximations of $\omega$ and $\varphi$ , denoted by $\mbox{\boldmath$\omega$}^{h}$ and $\mbox{\boldmath$\varphi$}^{h}$ , are constructed with $\forall\mbox{\boldmath$\omega$}^{h}\in\mathscr{V}^{h}=\left\{\mbox{\boldmath$\omega$}^{h}|\mbox{\boldmath$\omega$}^{h}={\bm{\mathit{0}}}~{}\text{on}~{}\Gamma^{\mbox{\boldmath$\varphi$}}\right\}$ and $\mbox{\boldmath$\varphi$}^{h}\in\mathscr{S}^{h}=\left\{\mbox{\boldmath$\varphi$}^{h}|\mbox{\boldmath$\varphi$}^{h}=\bar{\mbox{\boldmath$\varphi$}}~{}\text{on}~{}\Gamma^{\mbox{\boldmath$\varphi$}}\right\}$ . The discretized terms $\mbox{\boldmath$\omega$}^{h}$ , $\nabla\mbox{\boldmath$\omega$}^{h}$ , and $\mbox{\boldmath$\varphi$}^{h}$ are computed as

\mbox{\boldmath$\omega$}^{h}={\bm{\mathnormal{N}}}{\bm{\mathnormal{d}}}_{\mbox{\boldmath$\omega$}},\quad\nabla\mbox{\boldmath$\omega$}^{h}={\bm{\mathnormal{B}}}{\bm{\mathnormal{d}}}_{\mbox{\boldmath$\omega$}},\quad\text{and}\quad\mbox{\boldmath$\varphi$}^{h}={\bm{\mathnormal{N}}}{\bm{\mathnormal{d}}}_{\mbox{\boldmath$\varphi$}}

(3)

in terms of the nodal solutions ${\bm{\mathnormal{d}}}_{\mbox{\boldmath$\omega$}}$ and ${\bm{\mathnormal{d}}}_{\mbox{\boldmath$\varphi$}}$ , the basis functions ${\bm{\mathnormal{N}}}$ , and the gradient matrix ${\bm{\mathnormal{B}}}=\nabla{\bm{\mathnormal{N}}}$ . Inserting (3) into (2) we obtain the discretized residual by a sum over subdomains $\Omega^{e}$ and their associated boundary $\Gamma^{e}$ as

{\bm{\mathnormal{R}}}=\sum_{e=1}^{n_{\text{elem}}}\left\{\int_{\Omega^{e}}{\bm{\mathnormal{B}}}^{T}{\bm{\mathnormal{A}}}(\mbox{\boldmath$\varphi$}^{h})dV-\int_{\Gamma^{e,{\bm{\mathnormal{k}}}}}{\bm{\mathnormal{N}}}^{T}\bar{{\bm{\mathnormal{k}}}}~{}dS\right\}

(4)

where $n_{\text{elem}}$ represents the total number of subdomains. The volume and surface integrations in (4) are evaluated numerically via Gaussian quadrature rules. In this work, the problem domain $\Omega$ is represented by images. The adjacent pixels in images are used to form the subdomain $\Omega^{e}$ , whose connectivity information is preserved automatically by the image data. The values at each pixel of the image are treated as nodal values. A more detailed discussion on constructing the subdomains based on image pixels is provided in Section 4.

3 Discretized residual constrained neural networks

In this section, we present the formulation of discretized residual constrained deterministic/probabilistic NNs for solving PDEs for given BCs without labels.

3.1 Deterministic neural networks

Refer to caption — Figure 1: Illustration of the discretized residual constrained NNs, which consist of an encoder-decoder NN structure and a group of layers to compute the PDE-related loss. The NNs can be either deterministic or probabilistic. The weak PDE loss layers take both NN inputs and outputs as their inputs. The Dirichlet BCs from the NN inputs are applied to the NN predicted solution to compute the bulk residual. The Neumann residual is directly computed based on the Neumann BCs from the inputs²²2Material parameters can also be specified as inputs to the NNs to further enrich the flexibility of the proposed approach. For simplicity, only fixed material parameters are considered in the numerical sections in this work..

We first consider deterministic NNs, whose parameters $\Theta$ are represented by single values instead of distributions as in probabilistic NNs. In our approach, the NNs take image data that contains information of both Dirichlet and Neumann BCs as inputs and output the full field weak solutions of the PDEs that are associated to the input BCs. As shown in (4), the discretized residual of PDEs consists of two contributions, one bulk term and one surface term. We propose the weak PDE loss layers, which are discussed in detail in Section 4, to compute the residual in the bulk and on the Neumann boundary. As illustrated in Fig. 2, the weak PDE loss layers are constructed based on NN predicted solutions. Those weak PDE loss layers only contain forward calculations based on the FEM without introducing any new parameters to be optimized. Since the weak PDE loss layers are independent of the NN that serves as the surrogate PDE solver, and they also do not introduce any new trainable parameters, there is flexibility in choosing the NN architecture. As shown in Fig. 2, we focus on an encoder-decoder NN structure, which has been investigated for other physical systems [Zhu2018Zabaras-UQ-Bayesian-ED-CNN, Winovich2019Lin-ConvPDE-UQ, Bhatnagar2019CNN-encoder-decoder].

When using mini-batch optimization to train the discretized residual constrained deterministic NNs over a dataset ${\mathcal{D}}$ , the batch loss ${\mathcal{L}}_{i}$ is written in terms of the reduced total residual ${\bm{\mathnormal{R}}}_{\text{tot}}^{\text{red}}$ , as illustrated in Fig. 2, as

{\mathcal{L}}_{i}=\frac{1}{N}\sum_{n=1}^{N}\left({\bm{\mathnormal{R}}}_{\text{tot}}^{\text{red}}({\mathcal{D}}_{i},\mbox{\boldmath$\Theta$})\right)^{2},

(5)

for each mini-batch $i=1,2,\cdots,M$ with $N$ indicating the size of data in each mini-batch. The detailed training process of the discretized residual constrained NNs is discussed in Section LABEL:sec:NN-training.

3.2 Probabilistic neural networks

3.2.1 Background

For probabilistic NNs, we consider BNNs, whose model parameters $\Theta$ are stochastic and sampled from a posterior distribution $P(\mbox{\boldmath$\Theta$}|{\mathcal{D}})$ instead of being represented by single values as in deterministic NNs. The posterior distribution $P(\mbox{\boldmath$\Theta$}|{\mathcal{D}})$ is computed based on the Bayes’ theorem

P(\mbox{\boldmath$\Theta$}|{\mathcal{D}})=\frac{P({\mathcal{D}}|\mbox{\boldmath$\Theta$})P(\mbox{\boldmath$\Theta$})}{P({\mathcal{D}})},

(6)

where ${\mathcal{D}}$ denote the i.i.d. observations (training data) and $P$ represents the probability density function. In (6), $P({\mathcal{D}}|\mbox{\boldmath$\Theta$})$ is the likelihood, $P(\mbox{\boldmath$\Theta$})$ is the prior probability, and $P({\mathcal{D}})$ is the evidence function, respectively. The likelihood is the probability of ${\mathcal{D}}$ given $\Theta$ , which describes the probability of the observed data for given parameters $\Theta$ . A larger value of $P({\mathcal{D}}|\mbox{\boldmath$\Theta$})$ means the probability of the observed data is larger, implying that $\Theta$ is more reasonable. The prior needs to be specified before the Bayesian inference process [gelman2013bayesian].

To compute the posterior distributions of $\Theta$ , one can use popular sampling-based methods, such as MCMC. However, the sampling method involves expensive iterative inference steps and would be difficult to use when datasets are large or models are very complex [Kingma2014Welling+Variational-bayes, Liu2016Wang-Stein-variational-gradient-descent, Blei2017Variational-inference-review]. Alternatively, we can use the VI, which approximates the exact posterior distribution $P(\mbox{\boldmath$\Theta$}|{\mathcal{D}})$ with a more tractable surrogate distribution $Q(\mbox{\boldmath$\Theta$})$ by minimizing the Kullback-Leibler (KL) divergence [Liu2016Wang-Stein-variational-gradient-descent, Blei2017Variational-inference-review, Graves2011Varational-inference]

Q^{*}=\text{arg~{}min}~{}D_{\text{KL}}(Q(\mbox{\boldmath$\Theta$})||P(\mbox{\boldmath$\Theta$}|{\mathcal{D}})).

(7)

Compared with MCMC, the VI is faster and easier to scale to large datasets. We therefore explore it in this work, even though it is less rigorously studied than MCMC [Blei2017Variational-inference-review]. The KL divergence is computed as

D_{\text{KL}}(Q(\mbox{\boldmath$\Theta$})||P(\mbox{\boldmath$\Theta$}|{\mathcal{D}}))=\mathbb{E}_{Q}[\log Q(\mbox{\boldmath$\Theta$})]-\mathbb{E}_{Q}[\log P(\mbox{\boldmath$\Theta$},{\mathcal{D}})]+\log P({\mathcal{D}}),

(8)

which requires computing the logarithm of the evidence, $\text{log}P({\mathcal{D}})$ in (6) [Blei2017Variational-inference-review]. Since $P({\mathcal{D}})$ is hard to compute, it is challenging to direct evaluate the objective function in (7). Alternatively, we can optimize the so-called evidence lower bound (ELBO)

$\displaystyle\text{ELBO}(Q)$	$\displaystyle=\mathbb{E}_{Q}[\log P(\mbox{\boldmath$\Theta$},{\mathcal{D}})]-\mathbb{E}_{Q}[\log Q(\mbox{\boldmath$\Theta$})]$	(9)
	$\displaystyle=\mathbb{E}_{Q}[\log P({\mathcal{D}}\|\mbox{\boldmath$\Theta$})]-\left(\mathbb{E}_{Q}[\log Q(\mbox{\boldmath$\Theta$})]-\mathbb{E}_{Q}[\log P(\mbox{\boldmath$\Theta$})]\right)$
	$\displaystyle=\mathbb{E}_{Q}[\log P({\mathcal{D}}\|\mbox{\boldmath$\Theta$})]-D_{\text{KL}}\left(Q(\mbox{\boldmath$\Theta$})\|\|P(\mbox{\boldmath$\Theta$})\right).$

which is equivalent to the KL-divergence up to an added constant. So, the loss function for the BNN is written as

{\mathcal{L}}=D_{\text{KL}}\left(Q(\mbox{\boldmath$\Theta$})||P(\mbox{\boldmath$\Theta$})\right)-\mathbb{E}_{Q}[\log P({\mathcal{D}}|\mbox{\boldmath$\Theta$})],

(10)

which consists of a prior-dependent part and a data-dependent part, with the former being the KL-divergence of the surrogate posterior distribution $Q(\mbox{\boldmath$\Theta$})$ and the prior $P(\mbox{\boldmath$\Theta$})$ , and the latter being the negative log-likelihood cost.

3.2.2 Flipout

Different methods are available for training NNs with stochastic weights, such as weight perturbation [Graves2011Varational-inference, Blundell2015Weight-Uncertainty-NN, Wen2018GrosseFlipout], activation perturbation [Ioffe2015Szegedy-batch-normalization], reparameterization [Kingma2014Welling+Variational-bayes], and many others. In this work, we follow a specific weight perturbation method, the so-called Flipout, proposed in [Wen2018GrosseFlipout]. Compared with other weight perturbation algorithms that suffer from high variance of the gradient estimates because the same perturbation is shared in a mini-batch for all training examples, Flipout is an efficient method, which decorrelates the gradients in a mini-batch by implicitly sampling pseudo-independent weight perturbation for each example, and thus reduces the variance of NNs with stochastic weights [Wen2018GrosseFlipout]. This method can be efficiently implemented in a vectorized manner with unbiased stochastic gradients.

A brief description of Flipout is summarized here. Readers are directed to Ref. [Wen2018GrosseFlipout] for details. Flipout assumes that the perturbations of different weights are independent, and the perturbation distribution is symmetric around zero. Under such assumptions, the perturbation distribution is invariant to element-wise multiplication by a random sign matrix. To minimize the loss ${\mathcal{L}}$ , the distribution of $Q(\mbox{\boldmath$\Theta$})$ can be described in terms of perturbations with $W=\overline{W}+\Delta W$ , where $\overline{W}$ and $\Delta W$ are the mean and a stochastic perturbation for $\Theta$ , respectively. Flipout uses a base perturbation $\widehat{\Delta W}$ shared by all examples (training data points) in a mini-batch, and arrives at the perturbation for individual example by multiplying $\widehat{\Delta W}$ with a different rank-one sign matrix

\Delta W_{n}=\widehat{\Delta W}\circ r_{n}s_{n}^{t},

(11)

where the subscript $n$ indicates an individual example in a mini-batch, and $r_{n}$ and $s_{n}$ are entries of random vectors uniformly sampled from $\pm 1$ . Using different perturbations for each example in a mini-batch rather than an identical perturbation for all the example in a mini-batch ensures the reduction of the variance of the stochastic gradients in Flipout during training. For BNNs, the $\overline{W}$ and $\widehat{\Delta W}$ are the mean and standard deviation of the posterior distribution $Q(\mbox{\boldmath$\Theta$})$ , which are obtained via backpropagation with stochastic optimization algorithms. For mini-batch optimization, the batch loss is written as

{\mathcal{L}}_{i}=\frac{1}{M}D_{\text{KL}}\left(Q(\mbox{\boldmath$\Theta$})||P(\mbox{\boldmath$\Theta$})\right)-\mathbb{E}_{Q}[\log P({\mathcal{D}}_{i}|\mbox{\boldmath$\Theta$}^{(i)})],

(12)

for each mini-batch $i=1,2,\cdots,M$ [Blundell2015Weight-Uncertainty-NN]. With (12), we have ${\mathcal{L}}=\sum_{i}{\mathcal{L}}_{i}$ . Following [Blundell2015Weight-Uncertainty-NN], Monte Carlo (MC) sampling is used to approximate the expectation in (12) as

{\mathcal{L}}_{i}\approx\frac{1}{M}D_{\text{KL}}\left(Q(\mbox{\boldmath$\Theta$})||P(\mbox{\boldmath$\Theta$})\right)-\frac{1}{N}\sum_{n=1}^{N}\log P({\mathcal{D}}_{i}^{n}|\mbox{\boldmath$\Theta$}^{(i)}),

(13)

where $N$ is the size of each mini-batch dataset, and $\mbox{\boldmath$\Theta$}^{(i)}$ denotes the $i$ th batch sample drawn from the posterior distribution $Q(\mbox{\boldmath$\Theta$})$ . Even though only one set of parameters $\mbox{\boldmath$\Theta$}^{(i)}$ is drawn from $Q(\mbox{\boldmath$\Theta$})$ for each mini-batch, the perturbation approach proposed by Flipout ensures that parameters are different for the individual example ${\mathcal{D}}_{i}^{n}$ to calculate the log-likelihood cost. Probabilistic dense layers and convolutional layers with the. Flipout weight perturbation technique have been implemented in the TFP Library ³³3www.tensorflow.org/probability/api_docs/python/tfp/layers and are used to construct the BNNs in this work.

3.2.3 Neural network structure and loss function

As the probabilistic layers are implemented in the TFP library in a modularized form, we can easily construct the discretized residual constrained BNNs to have a similar encoder-decoder architecture, as shown in Fig. 2, as the deterministic model but with all weights being drawn from probability distributions. The loss of the BNNs is given in (10). The probabilistic layers in the TFP library automatically calculate the prior-dependent KL-divergence and add it to the total loss.

The data-dependent loss is accounted for by the likelihood function. In general, data contains noise that leads to aleatoric uncertainty, which cannot be reduced by training the surrogate model with more observations. Additive noise, which is independent of the data and is commonly treated as Gaussian, is often added to the output of the surrogate model to construct the likelihood function [Luo2020Bayesian-deep-learning, Wang2020Garikipati-Xun-perspective-system-id-bayesian, Geneva2020Zabaras-JCP-auto-regressive-NN-PDE, Zhu2018Zabaras-UQ-Bayesian-ED-CNN, Zhu2019Perdikaris-Physics-PDE-CNN]. Such an additive noise represents potential errors from various sources, such as discretization error [Geneva2020Zabaras-JCP-auto-regressive-NN-PDE], geometry representation, boundary conditions, and material parameter measurement. Assuming a Gaussian noise $\mbox{\boldmath$\epsilon$}\sim{\mathcal{N}}({\bm{\mathit{0}}},\Sigma_{1}{\bm{\mathnormal{I}}})$ with a zero-mean and a pre-specified constant covariance $\Sigma_{1}$ ⁴⁴4Aleatoric uncertainty can further be categorized into homoscedastic uncertainty and heteroscedastic uncertainty [Kendall2017Gal-both-uncertainty]. We assume $\Sigma_{1}$ being the former case for simplicity, which stays constant for different inputs. The latter is useful in cases where output noise depends on the model inputs., the NN output ${\bm{\mathnormal{y}}}$ is written as

{\bm{\mathnormal{y}}}={\bm{\mathnormal{f}}}({\bm{\mathnormal{x}}},\mbox{\boldmath$\Theta$})+\mbox{\boldmath$\epsilon$},

(14)

where ${\bm{\mathnormal{f}}}({\bm{\mathnormal{x}}},\mbox{\boldmath$\Theta$})$ represents the surrogate NNs. For discretized residual constrained NNs, the likelihood function is constructed based on the residual value, rather than NN predicted solutions. The expected value of point-wise residual is zero, which states that the governing PDEs are weakly satisfied at each location. This ensures that the proposed surrogate PDE solvers are label free. With the noise $\epsilon$ in (14) propagating through the residual calculation, the likelihood function⁵⁵5For systems where nonlinear operations are involved in the residual calculation, the residual noise distribution is in general non Gaussian even if the noise in the NN outputs is assumed to be Gaussian. Under the conditions that $\Sigma_{1}$ is small and the nonlinear operations are smooth and approximately linear locally, we assume that the noise distribution of the residual is approximately Gaussian. is written as

P({\bm{\mathnormal{R}}}_{\text{tot}}^{\text{red}}|{\bm{\mathit{0}}},\Sigma{\bm{\mathnormal{I}}})=\prod_{k=1}^{K}{\mathcal{N}}\left(R_{\text{tot}}^{\text{red,$k$}}|0,\Sigma_{2}\right)

(15)

where $k$ indicates the pixel location with $K$ total pixels. As it is challenging to directly calculate $\Sigma_{2}$ via error propagation based on $\Sigma_{1}$ , we treat $\Sigma_{2}$ as a learnable parameter to be optimized base on the NN loss. In (15), $\Sigma_{2}$ essentially serves as a threshold for the residual to converge to. The batch-wise loss of the residual constrained BNNs has the following format

{\mathcal{L}}_{i}\approx\frac{1}{M}D_{\text{KL}}\left(Q(\mbox{\boldmath$\Theta$})||P(\mbox{\boldmath$\Theta$})\right)-\frac{1}{N}\sum_{n=1}^{N}\sum_{k=1}^{K}\log\left({\mathcal{N}}\left(R_{\text{tot}}^{\text{red,$k$}}({\mathcal{D}}_{i}^{n},\mbox{\boldmath$\Theta$}^{(i)})|0,\Sigma_{2}\right)\right).

(16)

The detailed training process of the residual constrained BNNs is discussed in Section LABEL:sec:NN-training.

3.2.4 Uncertainty quantification

The BNNs allow us to quantify both epistemic uncertainty from model parameters and aleatoric uncertainty from noise in the data. With the discretized residual constrained BNNs, the posterior predictive distribution $P({\bm{\mathnormal{y}}}^{*}|{\bm{\mathnormal{x}}}^{*},{\mathcal{D}})$ for a specific testing data point $\{{\bm{\mathnormal{x}}}^{*},{\bm{\mathnormal{y}}}^{*}\}$ is expressed as [Zhu2018Zabaras-UQ-Bayesian-ED-CNN, Luo2020Bayesian-deep-learning]

	$\displaystyle P({\bm{\mathnormal{y}}}^{}\|{\bm{\mathnormal{x}}}^{},{\mathcal{D}})$	$\displaystyle=\int P({\bm{\mathnormal{y}}}^{}\|{\bm{\mathnormal{x}}}^{},\mbox{\boldmath$\Theta$})P(\mbox{\boldmath$\Theta$}\|{\mathcal{D}})d\mbox{\boldmath$\Theta$}$		(17)
		$\displaystyle\approx\int P({\bm{\mathnormal{y}}}^{}\|{\bm{\mathnormal{x}}}^{},\mbox{\boldmath$\Theta$})Q(\mbox{\boldmath$\Theta$})d\mbox{\boldmath$\Theta$},$		(17)

which can be numerically evaluated via MC sampling as

\displaystyle P({\bm{\mathnormal{y}}}^{*}|{\bm{\mathnormal{x}}}^{*},{\mathcal{D}})

\displaystyle\approx\frac{1}{S}\sum_{s=1}^{S}P({\bm{\mathnormal{y}}}^{*}|{\bm{\mathnormal{x}}}^{*},\mbox{\boldmath$\Theta$}^{s})\quad\text{where}\quad\mbox{\boldmath$\Theta$}^{s}\sim Q(\mbox{\boldmath$\Theta$}),

(18)

with $s$ indicating each sampling. To represent the uncertainty, we compute the statistical moments of ${\bm{\mathnormal{y}}}^{*}$ via the predictive expectation

\mathbb{E}[{\bm{\mathnormal{y}}}^{*}|{\bm{\mathnormal{x}}}^{*},{\mathcal{D}}]\approx\frac{1}{S}\sum_{s=1}^{S}{\bm{\mathnormal{f}}}({\bm{\mathnormal{x}}}^{*},\mbox{\boldmath$\Theta$}^{s})

(19)

and the predictive variance

	$\displaystyle\text{Var}[{\bm{\mathnormal{y}}}^{}\|{\bm{\mathnormal{x}}}^{},{\mathcal{D}}]$	$\displaystyle=\mathbb{E}[({\bm{\mathnormal{y}}}^{}+\mbox{\boldmath$\epsilon$})^{2}]-(\mathbb{E}[{\bm{\mathnormal{y}}}^{}+\mbox{\boldmath$\epsilon$}])^{2}$		(20)
		$\displaystyle\approx\frac{1}{S}\sum_{s=1}^{S}\left({\bm{\mathnormal{f}}}({\bm{\mathnormal{x}}}^{},\mbox{\boldmath$\Theta$}^{s}){\bm{\mathnormal{f}}}^{T}({\bm{\mathnormal{x}}}^{},\mbox{\boldmath$\Theta$}^{s})+\Sigma_{1}{\bm{\mathnormal{I}}}\right)-\left(\frac{1}{S}\sum_{s=1}^{S}{\bm{\mathnormal{f}}}({\bm{\mathnormal{x}}}^{},\mbox{\boldmath$\Theta$}^{s})\right)\left(\frac{1}{S}\sum_{s=1}^{S}{\bm{\mathnormal{f}}}({\bm{\mathnormal{x}}}^{},\mbox{\boldmath$\Theta$}^{s})\right)^{T}.$		(20)

4 Efficient implementation of the residual calculation

In this section, we describe the implementation details of the weak PDE loss layers. We heavily utilize the convolutional operation, and the vector/matrix/tensor operations to achieve numerical efficiency. Readers are directed to our source code for additional details⁶⁶6github.com/mechanoChem/mechanoChemML. As shown in Fig. 2, the weak PDE loss layers take both NN inputs (BCs information) and outputs (NN predicted solution) as their inputs. The data structure to represent the BCs is discussed in detail in Section LABEL:sec:data-representation. A schematic of the major implementation steps of the weak PDE loss layers is shown in Fig. 2. We choose a steady state diffusion problem with a scalar unknown at each node for illustration purpose, with Dirichlet BCs being applied on the left side, non-zero Neumann BCs being applied on the bottom and right sides, and zero Neumann BCs on the top. Assuming that the output of the NN shown in Fig. 2 is a $5\times 5$ matrix (an image with $5\times 5$ pixels), denoted as ${\bm{\mathnormal{M}}}_{5,5}^{\text{NN}}$ ⁷⁷7In the source code, ${\bm{\mathnormal{M}}}_{5,5}^{\text{NN}}$ is stored as ${\bm{\mathnormal{M}}}_{5,5,1}^{\text{NN}}$ with a third dimension of 1, which indicates the DOF per node. For elasticity problems, the third dimension has a size of 2. Here, we drop the “1” to simplify the notations. with the value of each entry being the actual concentration, ${\bm{\mathnormal{M}}}_{5,5}^{\text{NN}}$ is equivalent to the nodal solution on a domain, which is discretized with 4x4 elements, as shown in Fig. 2(a). The implementation procedure is summarized in the Algorithm Box LABEL:algo:compute-residual

4.1 Dirichlet BCs

The channel of NN inputs with Dirichlet BCs information is denoted as ${\bm{\mathnormal{I}}}_{\text{D}}^{5,5}$ . To enforce the Dirichlet BCs, we replace the nodal values of ${\bm{\mathnormal{M}}}_{5,5}^{\text{NN}}$ at the location of Dirichlet boundary with the actual values of ${\bm{\mathnormal{I}}}_{\text{D}}^{5,5}$ to obtain a new matrix, denoted as ${\bm{\mathnormal{M}}}_{5,5}$ , as indicated by the green color in Fig. 2(a). The Dirichlet BCs are then automatically incorporated into the residual vector during the bulk residual calculation discussed in the next Section.

4.2 Bulk residual

The matrix representation of the nodal solution automatically contains the element connectivity information of the mesh. To compute the bulk residual, we first apply convolutional operations to ${\bm{\mathnormal{M}}}_{5,5}$ with the following kernels

k_{B,1}=\begin{bmatrix}1&0\\ 0&0\\ \end{bmatrix},\quad k_{B,2}=\begin{bmatrix}0&1\\ 0&0\\ \end{bmatrix},\quad k_{B,3}=\begin{bmatrix}0&0\\ 1&0\\ \end{bmatrix},\quad k_{B,4}=\begin{bmatrix}0&0\\ 0&1\\ \end{bmatrix}.

(21)

Each convolutional operation results in a matrix with a size of $5\times 5$ ⁸⁸8The resulting matrix size is $4\times 4$ . Zero paddings are used to ensure the resulted matrix with a dimension of $5\times 5$ . Keeping the matrix size unchanged during the convolutional operations is not necessary and might require a small amount of extra floating-point operations, but it is less prone to errors if we handle matrices with a fixed size., which corresponds to the selected nodes, as highlighted with colors in Fig. 2(b). With these four convolutional operations, we now have a matrix with a size of $5\times 5\times 4$ ( ${\bm{\mathnormal{M}}}_{5,5,4}$ ), as shown in Fig. 2(c). We then reshape the matrix to an array $25\times 4$ ( ${\bm{\mathnormal{M}}}_{25,4}$ ), as shown in Fig. 2(d). Each row of ${\bm{\mathnormal{M}}}_{25,4}$ corresponds to the local nodal solution vector inside one finite element, the subdomain $\Omega^{e}$ in (4), which can then be used to efficiently evaluate the residual via the matrix-vector operation.

To evaluate the residual of the steady-state diffusion problem with $2\times 2$ Gauss points, the B-operator matrix in (LABEL:eq:discretized-residual-diffusion) has a size of $4\times 2\times 4$ (# of Gauss points $\times$ dimensions $\times$ # of nodes), denoted as ${\bm{\mathnormal{B}}}_{4,2,4}$ , with its transpose denoted as ${\bm{\mathnormal{B}}}_{4,4,2}^{T}$ . The bulk residual at each Gauss point $i$ is evaluated as

({\bm{\mathnormal{R}}}_{\text{bulk}}^{25,4})^{i}=\omega_{i}D{\bm{\mathnormal{M}}}_{25,4}{\bm{\mathnormal{B}}}_{i,4,2}{\bm{\mathnormal{B}}}_{i,2,4}

(22)

with $\omega_{i}$ denoting the weights. The total bulk residual is computed as

{\bm{\mathnormal{R}}}_{\text{bulk}}^{25,4}=\sum_{i=1}^{4}{\bm{\mathnormal{R}}}_{\text{bulk}}^{i},

(23)

as shown in Fig. 2(d). ${\bm{\mathnormal{R}}}_{\text{bulk}}^{25,4}$ is then reshaped to ${\bm{\mathnormal{R}}}_{\text{bulk}}^{5,5,4}$ , and stored in the element-like form, as shown in Fig. 2(e).

Next, we use the f.roll ~funcion to unfold the element-like residual to the correct nodal position, as shown Fig. 2(f), with

$\displaystyle{\bm{\mathnormal{R}}}_{\text{bulk}}^{5,5,0:1}$	$\displaystyle={\bm{\mathnormal{R}}}_{\text{bulk}}^{5,5,0:1}$	(24)
$\displaystyle{\bm{\mathnormal{R}}}_{\text{bulk}}^{5,5,1:2}$	$\displaystyle=\text{tf.roll}({\bm{\mathnormal{R}}}_{\text{bulk}}^{5,5,1:2},[1],[2])$
$\displaystyle{\bm{\mathnormal{R}}}_{\text{bulk}}^{5,5,2:3}$	$\displaystyle=\text{tf.roll}({\bm{\mathnormal{R}}}_{\text{bulk}}^{5,5,2:3},[1],[1])$
$\displaystyle{\bm{\mathnormal{R}}}_{\text{bulk}}^{5,5,3:4}$	$\displaystyle=\text{tf.roll}({\bm{\mathnormal{R}}}_{\text{bulk}}^{5,5,3:4},[1,1],[1,2])$

The assemble operation in (LABEL:eq:discretized-residual-diffusion) for the bulk integration is now achieved by the f.reduce_sum ($\BR_\extbulk^5,5,4 $)functionwithoutloopingoveralltheelementstoget$ R_bulk^5,5,1 $a s d o n e t r a d i t i o n a l l y i n t h e F E M . R e a d e r s a r e d i r e c t e d t o o u r s o u r c e c o d e f o r t h e i m p l e m e n t a t i o n o f t h e l i n e a r / n o n - l i n e a r e l a s t i c i t y p r o b l e m s . Algorithm 111Algorithm 11Res$

	$\displaystyle\text{Var}[{\bm{\mathnormal{y}}}^{}\|{\bm{\mathnormal{x}}}^{},{\mathcal{D}}]$	$\displaystyle=\mathbb{E}[({\bm{\mathnormal{y}}}^{}+\mbox{\boldmath$\epsilon$})^{2}]-(\mathbb{E}[{\bm{\mathnormal{y}}}^{}+\mbox{\boldmath$\epsilon$}])^{2}$		(20)
		$\displaystyle\approx\frac{1}{S}\sum_{s=1}^{S}\left({\bm{\mathnormal{f}}}({\bm{\mathnormal{x}}}^{},\mbox{\boldmath$\Theta$}^{s}){\bm{\mathnormal{f}}}^{T}({\bm{\mathnormal{x}}}^{},\mbox{\boldmath$\Theta$}^{s})+\Sigma_{1}{\bm{\mathnormal{I}}}\right)-\left(\frac{1}{S}\sum_{s=1}^{S}{\bm{\mathnormal{f}}}({\bm{\mathnormal{x}}}^{},\mbox{\boldmath$\Theta$}^{s})\right)\left(\frac{1}{S}\sum_{s=1}^{S}{\bm{\mathnormal{f}}}({\bm{\mathnormal{x}}}^{},\mbox{\boldmath$\Theta$}^{s})\right)^{T}.$		(20)