Solving Partial Differential Equations with Random Feature Models

Chunyang Liao¹¹1Email: [email protected] — University of California, Los Angeles

Abstract

Machine learning based partial differential equations (PDEs) solvers have received great attention in recent years. Most progress in this area has been driven by deep neural networks such as physics-informed neural networks (PINNs) and kernel method. In this paper, we introduce a random feature based framework toward efficiently solving PDEs. Random feature method was originally proposed to approximate large-scale kernel machines and can be viewed as a shallow neural network as well. We provide an error analysis for our proposed method along with comprehensive numerical results on several PDE benchmarks. In contrast to the state-of-the-art solvers that face challenges with a large number of collocation points, our proposed method reduces the computational complexity. Moreover, the implementation of our method is simple and does not require additional computational resources. Due to the theoretical guarantee and advantages in computation, our approach is proven to be efficient for solving PDEs.

Keywords: Random Feature, Partial Differential Equations, Scientific Machine Learning, Error Analysis

1 Introduction

Solving partial differential equations (PDEs) is a fundamental question in science and engineering. Traditional numerical methods include finite element method and finite difference method. Recently, the use of machine learning tools for solving PDEs, or in general any complex scientific tasks, has led to a new area of scientific machine learning. Unlike traditional numerical methods, machine learning (ML) based methods do not rely on complex mesh designs and intricate numerical techniques. Therefore, it enables simpler, faster, and more convenient implementation and use. The most prominent ML-based solver is physics-informed neural network (PINN) [1], which uses a deep neural network to approximate the PDE solution. Given a set of collocation points in a spatiotemporal domain $\Omega$ , we parametrize the PDE solution as a neural network satisfying PDE, boundary conditions, and initial conditions at given collocation points. This approach leads to solving an optimization problem where the objective function measures the PDE residual with respect to some loss functional. Finding the solution of PDE is equivalent to optimizing the neural network parameters by using variants of stochastic gradient descent. PINN and its variations have achieved great success in learning the PDE solutions. However, it is extremely hard and expensive to optimize all parameters in the deep neural network. To reduce the computational complexity, some recent works [2] proposed to use randomized neural networks to solve PDEs. Randomized neural network is a special type of neural networks with some parameters are randomly generated from a known probability distribution, instead of being optimized. This strategy was proven to reduce the computational time as well as maintain the approximation accuracy, see numerical experiments in [2]. While these neural network based methods are often used in practice, the theoretical analysis relies on the universal approximation property of deep neural network, which shows the existence of a network of a requisite size achieving a certain error rate. However, the existence result does not guarantee that the network is computable in practice.

Instead of using deep neural networks, kernel method/Gaussian process (GPs) are also used to learn the PDE solutions [3, 4, 5, 6]. The main idea of such method is to approximate the solution of a given PDE as an element in a reproducing kernel Hilbert space. This element will be found by solving an optimal recovery problem constrained by a PDE at collocation points [5, 7]. An optimal recovery problem can also be interpreted as maximum a posterior (MAP) estimation for a Gaussian process constrained by a PDE. The key to solving an optimal recovery problem is a celebrated representer theorem that characterizes the minimizer as a finite-dimensional representation formula, which is easy to implement and interpret. Moreover, kernel-based PDE solving methods are also supported by rigorous theoretical foundation. Specifically, the authors provided a detailed priori error estimates in [6]. However, the computational efficiency of kernel method can be a significant drawback. Specifically, it does not scale well when the sample size is large. For example, given $m$ training collocation points, kernel method requires $\mathcal{O}(m^{3})$ training time and $\mathcal{O}(m^{2})$ to store the kernel matrix, which is often computationally infeasible when $m$ is large.

To overcome the computational bottleneck in kernel method, random feature method was proposed to approximate large-scale kernel machines [8]. The main idea of random features is mapping data into a low-dimensional randomized feature space. Then, the kernel matrix is approximated by a low-rank matrix, which reduces the computational and storage costs of operating on kernel matrix. The random feature model can be viewed as a randomized two-layer neural network. The weights connecting input layer and single hidden layer are randomly generated from a known distribution rather than trainable parameters. Only the weights on the output layer are trainable.

In this paper, we propose a random feature based PDE solver. Since random feature model is a type of randomized neural network and an approximation of kernel method, the computational complexity can be reduced significantly. Moreover, we provide a convergence analysis of our method. The key contributions of our work are summarized as follows:

•

Framework: We propose a random feature framework for solving PDEs. By minimizing the PDE residuals, our method does not require the construction of kernel matrix. In practice, this framework allows us to use the modern automatic differential libraries, which is straightforward and convenient.
•

Convergence Analysis: We provide a detailed convergence analysis of our proposed framework under some mild and widely used assumptions on PDEs. Our convergence analysis contains two steps: the first step follows the standard convergence analysis of kernel method [6]; the second step concerns the approximation of kernel method using random feature method. To the best of our knowledge, this is the first work providing a convergence analysis of random feature PDE solving method.
•

Numerical Experiments: We test the performance of our framework with nonlinear elliptic PDEs, (high-dimensional) nonlinear Poisson PDEs, Allen-Cahn equation, and Advection-diffusion equation. Our method requires less computational resources to train the model with a similar or better performance compared with the existing methods on all benchmarks. We also numerically verify the convergence rate obtained from our analysis for all problems.

The remaining of this paper is organized as follows. In Section 2, we give an overview of random feature method and provide a framework for solving PDEs using random feature model along with error analysis. Section 3 is dedicated to the numerical experiments for solving PDEs with random feature method. We compare our method with PINN on several benchmarks and provide a empirical convergence study. We conclude the paper with a summary of the results and some possible future directions in Section 4.

2 Random Feature

2.1 Overview of Random Feature Method

To introduce random feature method, we first give a short introduction to kernel method, which is also known as kernel trick. It is one of the popular techniques for capturing nonlinear relations between features and targets. Let $\mathbf{x},\mathbf{x}^{\prime}\in X\subset\mathbb{R}^{d}$ be samples and $\phi:X\to\mathcal{H}$ be a feature map transforming samples to a high-dimensional (even infinite-dimensional) reproducing kernel Hilbert space $\mathcal{H}$ where the mapped data can be learned by a linear model. In practice, the explicit expression of feature map $\phi$ is not necessarily known to us. The inner produce between $\phi(\mathbf{x})$ and $\phi(\mathbf{x}^{\prime})$ endowed by $\mathcal{H}$ can be computed by using a kernel function $k(\cdot,\cdot):X\times X\to\mathbb{R}$ , i.e.

k(\mathbf{x},\mathbf{x}^{\prime})=\langle\phi(\mathbf{x}),\phi(\mathbf{x}^{\prime})\rangle_{\mathcal{H}}.

Due to the ease of computing the inner product, kernel method is effective for nonlinear learning problems with a wide range of successful applications [9]. However, kernel method does not scale well to extremely large datasets. For example, given $m$ training samples, kernel regression requires $\mathcal{O}(m^{3})$ training time and $\mathcal{O}(m^{2})$ to store the kernel matrix, which is often computationally infeasible when $m$ is large. Random feature method is one of the most popular techniques to overcome the computational challenges of kernel method. The theoretical foundation of random (Fourier) feature builds on the following classical result from harmonic analysis.

Theorem 1 (Bochner [10]).

A continuous shift-invariant kernel $k(\mathbf{x},\mathbf{x}^{\prime})=k(\mathbf{x}-\mathbf{x}^{\prime})$ on $\mathbb{R}^{d}$ is positive definite if and only if $k(\delta)$ is the Fourier transform of a non-negative measure.

Bochner’s theorem guarantees that the Fourier transform of kernel $k(\delta)$ is a proper probability distribution if the kernel is scaled properly. Denote the probability distribution by $\rho(\boldsymbol{\omega})$ , we have

k(\mathbf{x},\mathbf{x}^{\prime})=\int_{\mathbb{R}^{d}}\exp(i\langle\boldsymbol{\omega}_{k},\mathbf{x}-\mathbf{x}^{\prime}\rangle)d\rho(\boldsymbol{\omega})=\int_{\mathbb{R}^{d}}\exp(i\langle\boldsymbol{\omega}_{k},\mathbf{x}\rangle)\overline{\exp(i\langle\boldsymbol{\omega}_{k},\mathbf{x}^{\prime}\rangle)}d\rho(\boldsymbol{\omega}).

(1)

Using Monte Carlo sampling technique, we randomly generate $N$ i.i.d samples $\{\boldsymbol{\omega}_{k}\}_{k\in[N]}$ from $\rho(\boldsymbol{\omega})$ and define a random Fourier feature map $\phi:\mathbb{R}^{d}\to\mathbb{C}^{N}$ as ²²2It depends on the i.i.d samples $\{\boldsymbol{\omega}_{k}\}_{k\in[N]}$ from $\rho(\boldsymbol{\omega})$ . To simplify the notation, we omit the dependency on $\{\boldsymbol{\omega}_{k}\}_{k\in[N]}$ when we define $\phi$ .

\phi(\mathbf{x})=\frac{1}{\sqrt{N}}\Big{[}\exp(i\langle\boldsymbol{\omega}_{1},\mathbf{x}\rangle),\dots,\exp(i\langle\boldsymbol{\omega}_{N},\mathbf{x}\rangle)\Big{]}^{T}\in\mathbb{C}^{N}.

(2)

Using random Fourier feature map, we can define a kernel function $\hat{k}(\mathbf{x},\mathbf{x}^{\prime}):X\times X\to\mathbb{R}$ as

\hat{k}(\mathbf{x},\mathbf{x}^{\prime}):=\langle\phi(\mathbf{x}),\phi(\mathbf{x}^{\prime})\rangle.

Kernel function $\hat{k}$ is finite dimensional since the random Fourier feature map $\phi$ transforms data to a finite dimensional space $\mathbb{C}^{N}$ . We can also use random cosine features to approximate any shift-invariant kernel. Specifically, setting random feature $\boldsymbol{\omega}\in\mathbb{R}^{d}$ generated from $\rho(\boldsymbol{\omega})$ and $b\in\mathbb{R}$ sampled from uniform distribution on $[-\pi,\pi]$ , the random cosine feature is defined as $\cos(\langle\boldsymbol{\omega},\mathbf{x}\rangle+b)$ . Similarly, we can define a random cosine feature map $\phi:\mathbb{R}^{d}\to\mathbb{R}^{N}$ as

\phi(\mathbf{x})=\frac{1}{\sqrt{N}}\Big{[}\cos(\langle\boldsymbol{\omega}_{1},\mathbf{x}\rangle+b_{1}),\dots,\cos(\langle\boldsymbol{\omega}_{N},\mathbf{x}\rangle+b_{N})\Big{]}^{T}\in\mathbb{R}^{N},

(3)

using random i.i.d samples $\{\boldsymbol{\omega}_{k},b_{k}\}_{k\in[N]}$ , and hence define a kernel function $\hat{k}:X\times X\to\mathbb{R}$ using random cosine feature map. We could approximate the original shift-invariant kernel function $k$ defined in (1) by the finite-dimensional kernel $\hat{k}$ . Utilizing random features allows efficient learning with $\mathcal{O}(mN^{2})$ time and $\mathcal{O}(mN)$ storage capacity.

2.2 Random Feature Regression

Now, we set ourselves to the regression setting where our aim is to learn a function $f:\mathbb{R}^{d}\to\mathbb{R}$ using training samples $\{(\mathbf{x}_{j},y_{j})\}_{j\in[m]}$ . To implement the random feature model, we first draw $N$ i.i.d random features $\{\boldsymbol{\omega}_{k}\}_{k\in[N]}\subset\mathbb{R}^{d}$ from a probability distribution $\rho(\boldsymbol{\omega})$ , and then construct an approximation for target function $f$ taking the form

f^{\sharp}(\mathbf{x})=\sum_{k=1}^{N}c_{k}^{\sharp}\phi(\mathbf{x},\boldsymbol{\omega}_{k}).

(4)

We assume that the $m$ sampling points $\mathbf{x}_{j}$ ’s are drawn from a certain distribution with the corresponding output values

y_{j}=f(\mathbf{x}_{j})+e_{j},\quad\mbox{ for all }j\in[m],

where $e_{j}$ is the measurement noise. Let $\mathbf{A}\in\mathbb{R}^{m\times N}$ be the random feature matrix defined component-wise by $\mathbf{A}_{j,k}=\phi(\mathbf{x}_{j},\boldsymbol{\omega}_{k})$ for $j\in[m]$ and $k\in[N]$ . Training the random features model (4) is equivalent to finding the coefficient vector $\mathbf{c}^{\sharp}\in\mathbb{R}^{N}$ such that $\mathbf{A}\mathbf{c}^{\sharp}\approx\mathbf{y}$ , where $\mathbf{c}^{\sharp}=[c_{1}^{\sharp},\dots,c_{N}^{\sharp}]^{\top}\in\mathbb{R}^{N}$ and $\mathbf{y}=[y_{1},\dots,y_{m}]\in\mathbb{R}^{m}$ .

In the under-parameterized regime where we have more measurements than features ( $m\geq N$ ), the coefficients are trained by solving the (regularized) least squares problem:

\mathbf{c}_{\lambda}^{\sharp}\in\mathop{\mathrm{argmin}}_{\mathbf{c}\in\mathbb{R}^{N}}\|\mathbf{A}\mathbf{c}-\mathbf{y}\|_{2}^{2}+\lambda\|\mathbf{c}\|_{2}^{2},

where $\lambda>0$ is the regularization parameter. It is also referred to ridge regression since the ridge regularization term $\lambda\|\mathbf{c}\|_{2}^{2}$ is added.

Recently, over-parametrized models have received great attention since those trained models not only fit the training samples exactly but also predict well on unseen test data [11, 12]. In the over-parametrized regime, we have more features than measurements ( $N\geq m$ ), and we consider training the coefficient vector $\mathbf{c}^{\sharp}\in\mathbb{R}^{N}$ using the min-norm interpolation problem:

\mathbf{c}^{\sharp}\in\mathop{\mathrm{argmin}}_{\mathbf{c}\in\mathbb{R}^{N}}\|\mathbf{c}\|_{2}^{2}\quad\mbox{ subject to }\mathbf{A}\mathbf{c}=\mathbf{y}.

This problem is also referred to ridgeless regression problem since the solution $\mathbf{c}^{\sharp}$ can be viewed as the limit of $\mathbf{c}_{\lambda}^{\sharp}$ as $\lambda\to 0$ .

The generalization analysis of random features models have been of recent interest [13, 14, 15, 16, 17, 18, 19]. In [13], the authors showed that the random feature model yields a test error of $\mathcal{O}(N^{-\frac{1}{2}}+m^{-\frac{1}{2}})$ when trained on Lipschitz loss functions. Therefore, the generalization error is $\mathcal{O}(N^{-\frac{1}{2}})$ for large $N$ if $m\asymp N$ . However, the model is trained by solving a constrained optimization problem which is rarely used in practice. In [14], it was shown that for $f$ in an RKHS, using $N=\mathcal{O}(\sqrt{m}\log(m))$ features is sufficient to achieve a test error of $\mathcal{O}(m^{-\frac{1}{2}})$ with squared loss. In [15], the authors showed that a regularized model can achieve $N^{-1}+m^{-\frac{1}{2}}$ risk provided that the target function belonging to an RKHS. Nevertheless, results in [14, 15] depend on the assumptions of kernel and a certain decay rate of second moment operator, which may be difficult to verify in practice. Extending the results of random feature models from squared loss to 0-1 loss, the authors of [16] showed that the support vector machine with random features $N\ll m$ can achieve the learning rate faster than $\mathcal{O}(m^{-\frac{1}{2}})$ on a training set with $m$ samples. In [17], the authors computed the precise asymptotic bound of the test error, in the limit $N,m,d\to\infty$ with $N/d$ and $m/d$ fixed. In [18], the authors derived non-asymptotic bounds including both the under-parametrized setting using (regularized) least square problem and the over-parametrized setting using min-norm problem or sparse regression. Their results relied on the condition numbers of the random feature matrix and indicated double descent behavior in random feature models. However, the target function space is a subset of a RKHS, which limits the approximation ability of random feature model. In [19], the authors consider a RKHS as target function space and derived a similar non-asymptotic bound by utilizing different proof techniques.

Kernel name	$k(\mathbf{x},\mathbf{x}^{\prime})$	$\rho(\boldsymbol{\omega})$
Gaussian kernel	$\exp(-\gamma\\|\mathbf{x}-\mathbf{x}^{\prime}\\|_{2}^{2})$ , $\gamma>0$	$\left(2\pi(2\gamma)^{2}\right)^{-d/2}\exp(-\frac{\\|\boldsymbol{\omega}\\|_{2}^{2}}{2(2\gamma)^{2}})$
Laplace Kernel	$\exp(-\gamma\\|\mathbf{x}-\mathbf{x}^{\prime}\\|_{1})$ , $\gamma>0$	$\left(\frac{2}{\pi}\right)^{d}\Pi_{j=1}^{d}\frac{\gamma}{\gamma^{2}+\omega_{j}^{2}}$

Table 1: Commonly used kernels and the corresponding Fourier density

Furthermore, the random feature model can be viewed as a two-layer (one hidden layer) neural network where the weights connecting input layer and hidden layer and biases are sampled randomly and independently from a known distribution. Only the weights of the output layer are trainable using training samples, and hence it leads to solving a convex optimization problem when training random feature models. There are other types of randomized neural network, including the random vector functional link (RVFL) network [20, 21], and the extreme learning machine (ELM) [22, 23, 24], among others. Sharing the same structure as the random feature model, RVFL network is also a shallow neural network where the input-to-hidden weights and biases are randomly selected. However, the motivations of two models are different. RVFL networks were designed to address the difficulties associated with training deep neural networks and weights are usually sampled from uniform distribution. Random feature models were originally used to approximate large-scale kernel machines, and hence random weights depend on the kernel function, see Table 1 for some examples of commonly used kernels and the corresponding densities. Extreme learning machine is one type of deep neural network (more than two hidden layers) in which all the hidden-layer weights are randomly selected and then fixed. Only the output-layer coefficients are trained. Theoretical guarantees on RVFL and ELM suggest that they are universal approximators, see [25].

2.3 Random Feature Models for Solving PDEs

In this section, we present our framework for solving PDEs using random feature models. Let us consider the PDE problem of the general form

		$\displaystyle\mathcal{P}[u](\mathbf{x})=0,$		$\displaystyle\mathbf{x}\in\Omega$		(5)
		$\displaystyle\mathcal{B}[u](\mathbf{x})=0,$		$\displaystyle\mathbf{x}\in\partial\Omega,$		(5)

where $\Omega\subset\mathbb{R}^{d}$ is the domain with the boundary $\partial\Omega$ , $\mathcal{P}$ is the interior differential operator and $\mathcal{B}$ is the boundary differential operator. For the sake of brevity, we assume that the PDE is well-defined pointwise and has a unique strong solution throughout this paper. We propose to solve the PDE (5) by using random feature model. More precisely, let $\{\mathbf{x}_{j}\}_{j\in[M]}$ be a collection of collocation points such that $\{\mathbf{x}_{j}\}_{j\in[M_{\Omega}]}$ is a collection of points in the interior of $\Omega$ and $\{\mathbf{x}_{j}\}_{j=M_{\Omega}+1}^{M}$ is a set of points on the boundary $\partial\Omega$ . Random features $\{\boldsymbol{\omega}_{k}\}_{k\in[N]}$ are randomly drawn from a known distribution $\rho(\boldsymbol{\omega})$ . The random feature model takes the form

u^{\sharp}(\mathbf{x})=\sum_{k=1}^{N}c_{k}^{\sharp}\phi(\mathbf{x},\boldsymbol{\omega}_{k})

(6)

We train the random feature model by solving the following optimization problem:

$\displaystyle\mathop{\mathrm{minimize}}_{\mathbf{c}\in\mathbb{R}^{N}}$	$\displaystyle\\|\mathbf{c}\\|_{2}^{2}$	(7)
s.t.	$\displaystyle\mathcal{P}[u^{\sharp}](\mathbf{x}_{j})=0,\quad\mbox{ for }j=1,\dots,M_{\Omega}$
$\displaystyle\mathcal{B}[u^{\sharp}](\mathbf{x}_{j})=0,\quad\mbox{ for }j=M_{\Omega}+1,\dots,M$

We wish to find an approximation of the true solution $u$ with the min-norm random feature model satisfying the PDE and boundary data at collocation points.

Example (Linear PDE).

If the PDE is linear, we can rewrite the constraints in (7) as a linear system. Consider the following linear PDE

	$\displaystyle-\Delta u(\mathbf{x})+u(\mathbf{x})$	$\displaystyle=f(\mathbf{x}),\quad$		$\displaystyle\mathbf{x}\in\Omega$
	$\displaystyle u(\mathbf{x})$	$\displaystyle=g(\mathbf{x}),\quad$		$\displaystyle\mathbf{x}\in\partial\Omega,$

we can write the condition as the following linear system

\begin{bmatrix}\mathbf{A}\\ \hline\cr\mathbf{B}\end{bmatrix}\mathbf{c}=\begin{bmatrix}\mathbf{f}\\ \hline\cr\mathbf{g}\end{bmatrix},

(8)

where $\mathbf{A}\in\mathbb{R}^{M_{\Omega}\times N}$ and $\mathbf{B}\in\mathbb{R}^{(M-M_{\Omega})\times N}$ are defined component-wise by $\mathbf{A}_{j,k}=\phi(\mathbf{x}_{j},\boldsymbol{\omega}_{k})-\Delta\phi(\mathbf{x}_{j},\boldsymbol{\omega}_{k})$ ³³3Precisely, $\Delta\phi(\mathbf{x}_{j},\boldsymbol{\omega}_{k})$ means that evaluation of $\Delta\phi(\mathbf{x},\boldsymbol{\omega}_{k})$ at point $\mathbf{x}=\mathbf{x}_{j}$ . and by $\mathbf{B}_{j,k}=\phi(\mathbf{x}_{j},\boldsymbol{\omega}_{k})$ , respectively. Two vectors on the right-hand side are defined as $\mathbf{f}=[f(\mathbf{x}_{1}),\dots,f(\mathbf{x}_{M_{\Omega}})]^{\top}\in\mathbb{R}^{M_{\Omega}}$ and $\mathbf{g}=[g(\mathbf{x}_{M_{\Omega}+1}),\dots,g(\mathbf{x}_{M})]^{\top}\in\mathbb{R}^{M-M_{\Omega}}$ . In this case we compute $\mathbf{c}\in\mathbb{R}^{N}$ by using the least squares method if the system is overdetermined or using the min-norm method if the system is underdetermined.

However, addressing (7) directly can be complicated if the PDE is nonlinear. Therefore, we may solve an unconstrained optimization problem with regularization instead,

\mathop{\mathrm{minimize}}_{\mathbf{c}\in\mathbb{R}^{N}}\|\mathbf{c}\|_{2}^{2}+\lambda_{1}\sum_{j=1}^{M_{\Omega}}\left(\mathcal{P}[u^{\sharp}](\mathbf{x}_{j})\right)^{2}+\lambda_{2}\sum_{j=M_{\Omega}+1}^{M}\left(\mathcal{B}[u^{\sharp}](\mathbf{x}_{j})\right)^{2},

(9)

where $\lambda_{1},\lambda_{2}>0$ are regularization parameters. When $\lambda_{1},\lambda_{2}\to 0$ , the solution of (9) converges to the solution of (7). In practice, we use variants of stochastic gradient descent and the modern automatic differential libraries to solve problem (9). In scenarios where PDEs are challenging, a large number of collocation points are required to capture the solution details. Compared with the framework using the standard kernel matrix to construct the solution approximation in [3], our proposed random feature method approximates the kernel matrix, and hence reduces the computational cost and accelerate computation involving the standard kernel matrix (and its inverse) when dealing with massive collocation points.

2.4 Convergence Analysis

In this section, we show the convergence analysis of our method. Our convergence analysis relies on the standard convergence analysis of kernel method in [6] and the kernel approximation by using random features. Recall the kernel-based method for solving PDEs in [5, 6], a reproducing kernel Hilbert space $\mathcal{H}$ is chosen and we aim to solve the following

$\displaystyle\mathop{\mathrm{minimize}}_{u\in\mathcal{H}}$	$\displaystyle\\|u\\|_{\mathcal{H}}$	(10)
s.t.	$\displaystyle\mathcal{P}[u^{\sharp}](\mathbf{x}_{j})=0,\quad\mbox{ for }j=1,\dots,M_{\Omega}$
$\displaystyle\mathcal{B}[u^{\sharp}](\mathbf{x}_{j})=0,\quad\mbox{ for }j=M_{\Omega}+1,\dots,M$

We first state the main assumptions on the domain $\Omega$ and its boundary $\partial\Omega$ , the PDE operators $\mathcal{P}$ and $\mathcal{B}$ , and the reproducing kernel Hilbert space $\mathcal{H}$ .

Assumption 1.

The following assumptions hold:

•

(C1) Regularity of the domain and its boundary $\Omega\subset\mathbb{R}^{d}$ with $d>1$ is a compact set and $\partial\Omega$ is a smooth connected Riemannian manifold of dimension $d-1$ endowed with a geodesic distance $\rho_{\partial\Omega}$ .

•

(C2) Stability of the PDE There exist $\gamma>0$ and $k,t\in\mathbb{N}$ satisfying $d/2<k+\gamma$ and $(d-1)/2<t+\gamma$ , and $s,\ell\in\mathbb{R}$ so that for any $r>0$ it holds that, for any $u_{1},u_{2}\in B_{r}(H^{\ell}(\Omega))$ ,

\|u_{1}-u_{2}\|_{H^{\ell}(\Omega)}\leq C\left(\|\mathcal{P}(u_{1})-\mathcal{P}(u_{2})\|_{H^{k}(\Omega)}+\|\mathcal{B}(u_{1})-\mathcal{B}(u_{2})\|_{H^{t}(\partial\Omega)}\right),

and for any $u_{1},u_{2}\in B_{r}(H^{s}(\Omega))$ ,

\|\mathcal{P}(u_{1})-\mathcal{P}(u_{2})\|_{H^{k+\gamma}(\Omega)}+\|\mathcal{B}(u_{1})-\mathcal{B}(u_{2})\|_{H^{t+\gamma}(\partial\Omega)}\leq C\|u_{1}-u_{2}\|_{H^{s}(\Omega)},

where $C=C(r)>0$ is a constant independent of $u_{1}$ and $u_{2}$ .

•

(C3) The RKHS $\mathcal{H}$ is continuously embedded in $H^{s}(\Omega)$ .

Item (C1) is a standard assumption when analyzing PDEs. Item (C2) assumes that the PDE to be Lipschitz well-posed with respect to the right hand side/source term. It relates to the analysis of nonlinear PDEs and is independent of our numerical scheme. Assumption (C3) dictates the choice of the RKHS $\mathcal{H}$ , and in turn the kernel, which should be carefully selected based on the regularity of the strong solution $u$ . We are now ready to state the first theorem which concerns the convergence rate of kernel method

Theorem 2 (Theorem 3.8, [6]).

Suppose Assumption 1 is satisfied and denote the unique strong solution of by $u\in\mathcal{H}$ . Let $\hat{u}$ be a minimizer of (10) with interior collocation points $X_{\Omega}$ and collocation points on the boundary $X_{\partial\Omega}$ . Define the fill-in distances

h_{\Omega}:=\sup_{\mathbf{x}^{\prime}\in\Omega}\inf_{\mathbf{x}\in X_{\Omega}}\|\mathbf{x}-\mathbf{x}^{\prime}\|_{2},\qquad h_{\partial\Omega}:=\sup_{\mathbf{x}^{\prime}\in\partial\Omega}\inf_{\mathbf{x}\in X_{\partial\Omega}}\rho_{\partial\Omega}(\mathbf{x},\mathbf{x}^{\prime}),

and set $h=\max(h_{\Omega},h_{\partial\Omega})$ . Then there exists a constant $h_{0}$ so that if $h<h_{0}$ then

\|u-\hat{u}\|_{H^{s}(\Omega)}\leq Ch^{\gamma}\|u\|_{\mathcal{H}},

where $C>0$ is a constant independent of $h$ and $u$ .

With the kernel minimizer $\hat{u}\in\mathcal{H}$ at hand, our next step is approximating $\hat{u}$ using random features. We first adopt an alternative representation of preselected RKHS $\mathcal{H}$ . Denote the corresponding random Fourier feature map (or random cosine feature map) by $\phi:X\to\mathbb{C}^{N}(\mathbb{R}^{N})$ , then we define the following function space

\mathcal{F}(\rho):=\left\{f(\mathbf{x})=\int_{\mathbb{R}^{d}}\alpha(\boldsymbol{\omega})\phi(\mathbf{x},\boldsymbol{\omega})d\rho(\boldsymbol{\omega}):\|f\|^{2}_{\rho}=\mathbb{E}_{\boldsymbol{\omega}}[\alpha(\boldsymbol{\omega})^{2}]<\infty\right\},

(11)

where $\rho(\cdot)$ is the Fourier transform density associated with kernel $k$ . Notice that the completion of $\mathcal{F}(\rho)$ is a Hilbert space equipped with RKHS norm $\|f\|_{\rho}$ . Recall the Proposition 4.1 in [26], it is indeed the reproducing kernel Hilbert space $\mathcal{H}$ with associated kernel function $k$ . The endowed norms $\|\cdot\|_{\mathcal{H}}$ and $\|\cdot\|_{\rho}$ are equivalent.

In the next theorem, we address the approximation ability of finite sum random feature model taking the form (6).

Theorem 3.

Let $f$ be a function from $\mathcal{F}(\rho)$ . Suppose that the random feature map $\phi$ satisfies $|\phi(\mathbf{x},\boldsymbol{\omega})|\leq 1$ for all $\mathbf{x}\in X$ and $\boldsymbol{\omega}\in\mathbb{R}^{d}$ . Then for any $\delta\in(0,1)$ , there exists $c^{\sharp}_{1},\dots,c^{\sharp}_{N}$ so that the function

f^{\sharp}(\mathbf{x})=\sum_{k=1}^{N}c^{\sharp}_{k}\phi(\mathbf{x},\boldsymbol{\omega}_{k})

(12)

satisfies

\left|f(\mathbf{x})-f^{\sharp}(\mathbf{x})\right|\leq\frac{12\|f\|_{\rho}\log(2/\delta)}{\sqrt{N}}

with probability at least $1-\delta$ over $\boldsymbol{\omega}_{1},\dots,\boldsymbol{\omega}_{N}$ drawn i.i.d from $\rho(\boldsymbol{\omega})$ .

Proof.

We first introduce notations $\alpha_{\leq T}(\boldsymbol{\omega})=\alpha(\boldsymbol{\omega})\mathbbm{1}_{\left|\alpha(\boldsymbol{\omega})\right|\leq T}$ and $\alpha_{>T}=\alpha(\boldsymbol{\omega})-\alpha_{\leq T}(\boldsymbol{\omega})$ for any $T>0$ . Then we define

c_{k}^{\sharp}=\alpha_{\leq T}(\boldsymbol{\omega}_{k})\quad\mbox{ for all }k\in[N],

(13)

where $\boldsymbol{\omega}_{k}$ ’s are i.i.d samples following a probability distribution with density $\rho(\boldsymbol{\omega})$ , and hence define $f^{\sharp}(\mathbf{x})$ in (12) using $c^{\sharp}_{k}$ ’s. We can show that

\mathbb{E}f^{\sharp}(\mathbf{x})=\mathbb{E}_{\boldsymbol{\omega}}\left[\alpha_{\leq T}(\boldsymbol{\omega})\phi(\mathbf{x},\boldsymbol{\omega})\right].

By utilizing the triangle inequality, we decompose the error into two terms

\left|f(\mathbf{x})-f^{\sharp}(\mathbf{x})\right|\leq\underbrace{\left|f(\mathbf{x})-\mathbb{E}f^{\sharp}(\mathbf{x})\right|}_{I_{1}}+\underbrace{\left|\mathbb{E}f^{\sharp}(\mathbf{x})-f^{\sharp}(\mathbf{x})\right|}_{I_{2}}.

(14)

We first bound term $I_{1}$ . Recalling the definitions of $f$ and $\alpha_{\leq T}(\boldsymbol{\omega})$ , we bound term $I_{1}$ as

	$\displaystyle\left\|f(\mathbf{x})-\mathbb{E}f^{\sharp}(\mathbf{x})\right\|^{2}=$	$\displaystyle\Big{\|}\mathbb{E}_{\boldsymbol{\omega}}\left[\alpha_{>T}(\boldsymbol{\omega})\phi(\mathbf{x},\boldsymbol{\omega})\right]\Big{\|}^{2}\leq\mathbb{E}_{\boldsymbol{\omega}}\left[\alpha(\boldsymbol{\omega})\right]^{2}\mathbb{E}_{\boldsymbol{\omega}}\big{[}\mathbbm{1}_{\left\|\alpha(\boldsymbol{\omega})\right\|>T}\phi(\mathbf{x},\boldsymbol{\omega})\big{]}^{2}$		(15)
	$\displaystyle=$	$\displaystyle\mathbb{E}_{\boldsymbol{\omega}}\left[\alpha(\boldsymbol{\omega})\right]^{2}\mathbb{P}\big{(}\alpha(\boldsymbol{\omega})^{2}>T^{2}\big{)}\leq\frac{\Big{(}\mathbb{E}_{\boldsymbol{\omega}}[\alpha(\boldsymbol{\omega})^{2}]\Big{)}^{2}}{T^{2}}=\frac{\\|f\\|_{\rho}^{4}}{T^{2}}$		(15)

where we use the Cauchy-Schwarz inequality in the first line and the Markov’s inequality in the second line.

Next, we bound term $I_{2}$ . For any $\mathbf{x}\in X$ , we define random variable $Z(\boldsymbol{\omega})=\alpha_{\leq T}(\boldsymbol{\omega})\phi(\mathbf{x},\boldsymbol{\omega})$ and let $Z_{1},\dots,Z_{N}$ be $N$ i.i.d copies of $Z$ defined by $Z_{k}=Z(\boldsymbol{\omega}_{k})$ for each $k\in[N]$ . By boundedness of $\alpha_{\leq T}(\boldsymbol{\omega})$ , we have an upper bound $|Z_{k}|\leq T$ for any $k\in[N]$ . The variance of $Z$ is bounded above as

\sigma^{2}:=\mathbb{E}_{\boldsymbol{\omega}}|Z-\mathbb{E}_{\boldsymbol{\omega}}Z|^{2}\leq\mathbb{E}_{\boldsymbol{\omega}}|Z|^{2}\leq\mathbb{E}_{\boldsymbol{\omega}}[\alpha(\boldsymbol{\omega})^{2}]=\|f\|^{2}_{\rho}.

By Lemma A.2 and Theorem A.1 in [27], it holds that, with probability at least $1-\delta$ ,

\left|f^{\sharp}(\mathbf{x})-\mathbb{E}f^{\sharp}(\mathbf{x})\right|=\left|\frac{1}{N}\sum_{k=1}^{N}Z_{k}-\mathbb{E}_{\boldsymbol{\omega}}Z\right|\leq\frac{4T\log(2/\delta)}{N}+\sqrt{\frac{2\|f\|^{2}_{\rho}\log(2/\delta)}{N}}.

(16)

Taking the square root for both sides of (15), and then adding it to (16) gives

|f(\mathbf{x})-f^{\sharp}(\mathbf{x})|\leq\frac{\Big{(}\mathbb{E}_{\boldsymbol{\omega}}[\alpha(\boldsymbol{\omega})^{2}]\Big{)}}{T}+\frac{4T\log(2/\delta)}{N}+\sqrt{\frac{2\|f\|^{2}_{\rho}\log(2/\delta)}{N}}.

Selecting $T=\sqrt{N}\|f\|_{\rho}$ gives the desired result. ∎

The last theorem in this section presents the convergence rate of our proposed random feature model.

Theorem 4.

Suppose that the conditions in Theorem 2 and 3 hold. Then for any $\delta\in(0,1)$ there exists $c^{\sharp}_{1},\dots,c^{\sharp}_{N}$ so that the function

u^{\sharp}(\mathbf{x})=\sum_{k=1}^{N}c^{\sharp}_{k}\phi(\mathbf{x},\boldsymbol{\omega}_{k})

satisfies

\|u-u^{\sharp}\|_{L^{2}(\Omega)}\leq Ch^{\gamma}\|u\|_{\mathcal{H}}+\frac{12\|u\|_{\mathcal{H}}\log(2/\delta)\mathop{\mathrm{vol}}(\Omega)}{\sqrt{N}}

with probability at least $1-\delta$ over $\boldsymbol{\omega}_{1},\dots,\boldsymbol{\omega}_{N}$ drawn i.i.d from $\rho(\boldsymbol{\omega})$ .

Proof.

Using the triangle inequality, we decompose the error as

\|u-u^{\sharp}\|_{L^{2}(\Omega)}\leq\|u-\hat{u}\|_{L^{2}(\Omega)}+\|\hat{u}-u^{\sharp}\|_{L^{2}(\Omega)},

where $\hat{u}\in\mathcal{H}$ is a minimizer of (10) with interior collocation points $X_{\Omega}$ and collocation points on the boundary $X_{\partial\Omega}$ . We directly apply Theorem 2 to bound $\|u-\hat{u}\|_{L^{2}(\Omega)}$ . The second term is bounded as

\|\hat{u}-u^{\sharp}\|_{L^{2}(\Omega)}=\sqrt{\int_{\Omega}\left|\hat{u}(\mathbf{x})-u^{\sharp}(\mathbf{x})\right|^{2}d\mathbf{x}}\leq\frac{12\|\hat{u}\|_{\mathcal{H}}\log(2/\delta)\mathop{\mathrm{vol}}(\Omega)}{\sqrt{N}}\leq\frac{12\|u\|_{\mathcal{H}}\log(2/\delta)\mathop{\mathrm{vol}}(\Omega)}{\sqrt{N}},

where the first inequality relies on the entry-wise bound in Theorem 3 and the second inequality holds since the strong solution $u\in\mathcal{H}$ satisfies the boundary conditions, and hence the minimizer $\hat{u}$ must satisfy $\|\hat{u}\|_{\mathcal{H}}\leq\|u\|_{\mathcal{H}}$ . Adding the bounds together leads to the desired error bound. ∎

3 Numerical experiments

In this section, we test the performance of our proposed random feature method using several PDE benchmarks in the literature [5, 2, 3]. In addition, we numerically verify the convergence rate obtained in Theorem 4. In all numerical experiments, we randomly generate training samples over the domains to train the models, and generate different test points to evaluate the PDE solutions. We compare the true solution and predicted solution on these test points (of size $M$ ) to compute the test error, which is defined as

\mathop{\mathrm{Error}}=\frac{1}{M}\sum_{j=1}^{M}\Big{(}u(\mathbf{x}_{j})-\hat{u}(\mathbf{x}_{j})\Big{)}^{2}.

We consider Gaussian random features in all examples. Detailed problem settings are stated below. All experiments are implemented in Python based on Torch library. Our codes are available on the repository: https://github.com/liaochunyang/RF_PDE.

3.1 Nonlinear Elliptic PDEs

We test with the instance of nonlinear elliptic PDE

	$\displaystyle-\Delta u(\mathbf{x})+u(\mathbf{x})^{3}$	$\displaystyle=f(\mathbf{x}),\quad$		$\displaystyle\mathbf{x}\in\Omega$
	$\displaystyle u(\mathbf{x})$	$\displaystyle=g(\mathbf{x}),\quad$		$\displaystyle\mathbf{x}\in\partial\Omega,$

where $\Omega=[0,1]^{2}$ . The solution $u(\mathbf{x})=\sin(\pi x_{1})\sin(\pi x_{2})+4\sin(4\pi x_{1})\sin(4\pi x_{2})$ , and the right-hand side $f(\mathbf{x})$ is computed accordingly via the solution $u(\mathbf{x})$ . The boundary condition is $g(\mathbf{x})=0$ . This example was also considered in [5, 3]. We randomly sample 1000 random features ( $N=1000$ ) from normal distribution $\mathcal{N}(0,\sigma^{2}\mathbf{I})$ with variance $\sigma^{2}=100$ . The random feature model is trained on $M_{\Omega}=900$ interior collocation points and 124 collocation points on the boundary $\partial\Omega$ . We take the uniform grids of size $100\times 100$ as our test points. In Figure 1, we show predicted solution using our proposed random feature model, true solution, and entry-wise absolute errors at test points. We observe that our proposed random feature model provides an accurate prediction of the true solution.

Refer to caption — Figure 1: Numerical results of the nonlinear elliptic PDE: we show the predicted solution, the true solution, and the entry-wise absolute error.

We also compare the training epochs, test errors and training times of our proposed method with PINN, see results in Table 2. The PINN model has 2 hidden layers and each layer has 64 neurons. The nonlinear activation function is tanh function. PINN is trained on the same training samples and is tested on the uniform grid as well. We observe that training PINN is complicated, which requires more epochs, and hence longer training time. If we set the same number of epochs, then the PINN gives a bad prediction compared with random feature method. Moreover, our proposed method has smaller test error. Overall, our proposed random feature model outperforms PINN in this example.

Method	Epochs	Test error	Training Time (Seconds)
RF	1000	$1.44\times 10^{-4}$	99.05
PINN	1000	1.22	22.32
PINN	10000	$1.35\times 10^{-2}$	169.28

Table 2: Numerical results of the nonlinear elliptic PDE: we compare our proposed random feature method with PINN.

Finally, we numerically verify the convergence rate of nonlinear elliptic PDE. We first fix the number of random features to be $N=100$ and varies the number of collocation points. We sample $M_{\Omega}=400,900,1600$ points uniformly in the domain, and $M_{\partial\Omega}=84,124,164$ points uniformly on the boundary. We sample another set of 100 test points and evaluate the test errors. In the second experiment, we fix $M_{\Omega}=400$ interior collocation points and $M_{\partial\Omega}=84$ points on the boundary. We take different numbers of random features, i.e. $N=100,200,300$ . We sample another set of 100 test points and evaluate the test errors. In Figure 2, we show the test errors as a function of the number of collocation points and a function of the number of random features, respectively. For each point in the figure, we repeat the experiments 10 times and take the average.

3.2 Nonlinear Poisson PDE

In this section, we test our proposed method with high-dimensional nonlinear Poisson PDEs. We consider the domain $\Omega=[-1,1]^{d}$ and the following problem defined on $\Omega$ ,

	$\displaystyle-\nabla\cdot(a(u)\nabla u)$	$\displaystyle=f(\mathbf{x}),\quad$		$\displaystyle\mathbf{x}\in\Omega$
	$\displaystyle u(\mathbf{x})$	$\displaystyle=g(\mathbf{x}),\quad$		$\displaystyle\mathbf{x}\in\partial\Omega,$

where $a(u)=u^{3}-u$ . The solution is crafted as $u(\mathbf{x})=\exp(-\frac{1}{d}\sum_{i=1}^{d}x_{i})$ . The function $g(\mathbf{x})=u(\mathbf{x})$ on the boundary, and the right-hand side $f(\mathbf{x})$ is computed using the true solution, which has the explicit expression

f(\mathbf{x})=\frac{1}{d}\left[-3\exp\left(-\frac{3}{d}\sum_{i=1}^{d}x_{i}\right)+2\exp\left(-\frac{2}{d}\sum_{i=1}^{d}x_{i}\right)\right].

We first test with the 2D nonliear Poisson PDE and show the predicted solution, true solution, and entry-wise absolute errors in Figure 3. The training samples of size 1024 contains 900 interior collocation points and 124 boundary points, which are uniformly generated over the domain $\Omega=[-1,1]^{2}$ and its boundary $\partial\Omega$ , respectively. We randomly generate 500 test samples to evaluate the performance. We generate 1000 random features, following the standard Normal distribution, to construct the random feature model.

Furthermore, we compare between our proposed model and PINN model. The neural network has 2 hidden layers and 64 neurons at each layer. We select tanh function as the activation function. We test 2D nonlinear Poisson PDE as well as high-dimensional nonlinear Poisson PDEs ( $d=4$ and $d=8$ ). We aim to compare the test errors and training times. We also report the number of random features for our model and the number of epochs for both models. All numerical results are summarized in 3. We observe that our proposed random feature model achieves similar performance or even beats the PINN models in terms of the test error.

Dimension	Method	$N$	Epochs	Test error	Training Time (Seconds)
$d=2$	RF	500	1000	$2.22\times 10^{-3}$	42.29
$d=2$	PINN	-	2000	$8.98\times 10^{-3}$	51.46
$d=4$	RF	500	1000	$1.12\times 10^{-3}$	51.24
$d=4$	PINN	-	2000	$9.58\times 10^{-4}$	55.49
$d=8$	RF	100	1500	$1.81\times 10^{-3}$	37.84
$d=8$	PINN	-	2000	$3.21\times 10^{-3}$	54.14

Table 3: Comparison between random feature model and PINN for the nonlinear Poisson PDEs. We report the number of random features, number of epochs, test error, and training time for each model.

We note that the accuracy of our model is related to the choice of variance of Gaussian random feature, but it is not sensitive to the variance. Usually, we can tune the hyperparameter using cross-validation. In Table 4, we report the variances and the corresponding test errors for $d=8$ dimension nonlinear Poisson PDE. In this example, it is better to use small variance, but the test error is not very sensitive to the choice of variance when it is smaller than some threshold.

$\sigma^{2}$	100	1	0.04	0.01	0.0025
Test error	$2.14\times 10^{-1}$	$3.62\times 10^{-2}$	$7.31\times 10^{-3}$	$4.81\times 10^{-4}$	$4.05\times 10^{-4}$

Table 4: Variances of Gaussian random features and the corresponding test errors for nonlinear Poisson PDE (

d=8

). The training sample size

M=1024

and each error is calculated on a set of test samples with size 100.

In Figure 4, we report the empirical convergence rates for the nonlinear Poisson equations. Figure 4(a) shows the test error as a function of the number of collocation points. For each dimension, we fix $N=100$ random features and varies the number of collocation points. We uniformly sample $M_{\Omega}=100,400,900$ interior points, and $M_{\partial\Omega}=44,84,124$ boundary points. We sample a different set of 100 test points to evaluate the test errors. Figure 4(b) shows the test error as a function of the number of random features. For each dimension, we fix the collocation points of size 484 with 400 interior collocation points. We take $N=100,400$ random features. We average over 10 experiments to produce each point in Figure 4.

3.3 Allen-Cahn Equation

Next, we consider a 2D stationary Allen-Cahn equation with a source function and Dirichlet boundary conditions, i.e.

\Delta u+\gamma(u^{m}-u)=f(\mathbf{x}),\quad\mathbf{x}\in[0,1]^{2},

where $\gamma=1$ and $m=3$ . The solution takes the form $u(\mathbf{x})=\sin(2\pi ax_{1})\cos(2\pi ax_{2})$ , and the function $f(\mathbf{x})$ is computed using the solution $u(\mathbf{x})$ . Positive parameter $a$ controls the frequency of the solutions. We test three cases $a=1$ , $a=10$ , and $a=20$ .

We first compare the performance between random feature method and PINN. In each case, we randomly sample $M=1024$ points with $M_{\Omega}=900$ interior collocation points to train models. The test performances for both models are evaluated on 100 test samples, which are uniformly generated over the domain and boundary. We report the parameter selections and summarize the numerical results in Table 5. From the numerical results, we observe that the random feature models beat PINNs cross all cases, especially when the frequency parameter $a$ is large. It might be related to the known ”spectral bias” of neural networks [28, 29, 30]. Precisely, neural networks trained by gradient descent fit a low frequency function before a high frequency one. Therefore, it is difficult for PINNs to learn the high frequency PDE solutions. To alleviate ”spectral bias” and learn high frequency functions, previous work proposed to use Fourier features and both theoretically and empirically showed that a Fourier feature mapping can improve the performance [31]. Fourier features have been used to solve high frequency PDEs, see [32]. As our results suggest, higher frequency in the PDE solutions leads to a larger variance of Gaussian random feature. Moreover, it requires more random features and epochs to train the random feature model as the frequency increasing. In Figure 5, we show predicted solution, true solution and entry-wise errors at test points.

Frequency	Method	$N$	$\sigma^{2}$	Epochs	Test error	Training Time (Seconds)
$a=1$	RF	200	$10^{2}$	1000	$7.80\times 10^{-6}$	14.27
$a=1$	PINN	-	-	1000	$2.45\times 10^{-1}$	8.64
$a=10$	RF	200	$100^{2}$	2000	$1.12\times 10^{-4}$	25.37
$a=10$	PINN	-	-	2000	$1.04\times 10^{+1}$	18.48
$a=20$	RF	400	$1000^{2}$	1500	$2.87\times 10^{-1}$	86.54
$a=20$	PINN	-	-	2000	$6.23\times 10^{+1}$	26.67

Table 5: Comparison between random feature model and PINN for the Allen-Cahn equations. We report number of epochs, test error, and training time for each model. For the random feature model, we also report the number of random features and variance of Gaussian random features.

Figure 6 illustrates the numerical verifications of the convergence rate of Allen-Cahn equation. We first show the test error as a function of the number of collocation points. In this experiment, we fix the number of random features to be $N=100$ . To produce the collocation points, we uniformly sample $M_{\Omega}=400,900,1600$ points in the domain, and $M_{\partial\Omega}=84,124,164$ points on the boundary. We sample another set of 100 test points to evaluate the test errors. In the second experiment, where we show the test error as a function of the number of random features, we uniformly sample and then fix $M_{\Omega}=400$ interior collocation points and $M_{\partial\Omega}=124$ points on the boundary. We sample another set of 100 test points and evaluate the test errors. We generate $N=100,200,400$ random features from Gaussian distribution. In Figure 6, we show the test errors as a function of the number of collocation points and a function of the number of random features, respectively. For each point in the figure, we repeat the experiments 10 times and take the average.

3.4 Advection Diffusion Equation

Finally, we test our method with advection diffusion equation. Consider the initial boundary value problem on the spatial-temporal domain $(x,t)\in[-1,1]\times[0,1]$ , the PDE is

$\displaystyle u_{t}-u_{xx}+u_{x}$	$\displaystyle=f(x,t),\quad$	$\displaystyle(x,t)\in[-1,1]\times[0,1]$
$\displaystyle u(x,t)$	$\displaystyle=g(x,t),\quad$	$\displaystyle(x,t)\in\{-1,1\}\times[0,1],$
$\displaystyle u(x,0)$	$\displaystyle=h(x),\quad$	$\displaystyle x\in[-1,1].$

The true solution is employed as $u(x,t)=\sin(x)\exp(-t)$ . Functions $f(x,t)$ , $g(x,t)$ , and $h(x)$ are set according to the true solution. When we simulate this problem, we treat the time variable $t$ in the same way as the spatial variable $x$ . We uniformly generate 1000 training samples over $[-1,1]\times[0,1]$ . We enforce the boundary condition on 100 collocations points and the initial condition on 200 collocations points. Gaussian random features ( $N=100$ ) are randomly sampled from standard Normal distribution (variance $\sigma^{2}=1$ ). The PINN model has 2 hidden layers with 64 neurons at each layer. We report number of epochs, the test errors and training times for both models in Table 6. In this example, our proposed method achieves similar test error as PINN, but the training is simpler in the sense that it requires less epochs and the training time of random feature model is around 50% of that of PINN model. In Figure 7, we compare the predicted solution and true solution at various times $t=0.1,0.5,0.9$ to further highlight the ability of our method in learning the true solution.

Method	Epochs	Test error	Training Time (Seconds)
RF	600	$2.99\times 10^{-4}$	15.46
PINN	1000	$2.38\times 10^{-4}$	28.24

Table 6: Numerical results of the advection diffusion equation: we compare our proposed random feature method with PINN.

Finally, we perform a convergence study for advection-diffusion equation. In Figure 8(a), we show the test errors as a function of the number of collocation points. We use $M=100,200,400$ collocation points, which are uniformly generated from $[-1,1]\times[0,1]$ . We use 100 points on the boundary and 200 points for initial values. In Figure 8(b), we use 200 interior points, 100 boundary points, and 200 points for initial condition, respectively. We vary the number of random features, i.e $N=100,200,400$ . For each point in the figure, we take the average over 10 repetitions of the experiment.

4 Conclusion

In this paper, we propose a random feature model for solving partial differential equations along with an error analysis. By utilizing some techniques from probability, we provide convergence rates of our proposed method under some mild assumptions on the PDE. Our framework allows convenient implementation and efficient computation. Moreover, it easily scales to massive collocation points, which are necessary for solving challenging PDEs. We test our method on several PDE benchmarks. The numerical experiments indicate that our method either matches or beats state-of-the-art models and reduces the computational cost.

Finally, we conclude with some directions for future work. First, our analysis does not directly address the minimizer we obtained by solving an optimization problem. It requires us to analyze a min-norm minimization problem with some nonlinear constraints. Second, while it is natural to sample random features from the Fourier transform density, it is advantageous to sample from a different density which has been shown to yield better performance. Third, we assume that the PDE at hand is well- defined pointwise and has a unique strong solution. Extension our framework to weak solution is left for future work.

References

[1] M. Raissi, P. Perdikaris, and G. Karniadakis, “Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations,” Journal of Computational Physics, vol. 378, pp. 686–707, 2019.
[2] Y. Wang and S. Dong, “An extreme learning machine-based method for computational pdes in higher dimensions,” Computer Methods in Applied Mechanics and Engineering, vol. 418, p. 116578, 2024.
[3] Z. Xu, D. Long, Y. Xu, G. Yang, S. Zhe, and H. Owhadi, “Toward efficient kernel-based solvers for nonlinear PDEs,” arXiv:2410.11165, 2024.
[4] S. Fang, M. Cooley, D. Long, S. Li, M. Kirby, and S. Zhe, “Solving high frequency and multi-scale PDEs with gaussian processes,” in The Twelfth International Conference on Learning Representations, 2024.
[5] Y. Chen, B. Hosseini, H. Owhadi, and A. M. Stuart, “Solving and learning nonlinear pdes with gaussian processes,” Journal of Computational Physics, vol. 447, p. 110668, 2021.
[6] P. Batlle, Y. Chen, B. Hosseini, H. Owhadi, and A. M. Stuart, “Error analysis of kernel/GP methods for nonlinear and parametric pdes,” Journal of Computational Physics, vol. 520, p. 113488, 2025.
[7] S. Foucart, C. Liao, S. Shahrampour, and Y. Wang, “Learning from non-random data in Hilbert spaces: an optimal recovery perspective,” Sampling Theory, Signal Processing, and Data Analysis, vol. 20, no. 5, 2022.
[8] A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” in Advances in Neural Information Processing Systems, vol. 20, Curran Associates, Inc., 2007.
[9] B. Scholköpf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA, USA: MIT Press, 2001.
[10] S. Bochner, Harmonic Analysis and the Theory of Probability. University of California Press Berkeley, 1955.
[11] M. Belkin, S. Ma, and S. Mandal, “To understand deep learning we need to understand kernel learning,” in Proceedings of the 35th International Conference on Machine Learning, vol. 80 of Proceedings of Machine Learning Research, pp. 541–549, PMLR, 10–15 Jul 2018.
[12] T. Liang and A. Rakhlin, “Just interpolate: Kernel “Ridgeless” regression can generalize,” The Annals of Statistics, vol. 48, no. 3, pp. 1329 – 1347, 2020.
[13] A. Rahimi and B. Recht, “Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning,” in Advances in Neural Information Processing Systems, vol. 21, Curran Associates, Inc., 2008.
[14] A. Rudi and L. Rosasco, “Generalization properties of learning with random features,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, (Red Hook, NY, USA), p. 3218–3228, Curran Associates Inc., 2017.
[15] W. E, C. Ma, L. Wu, and S. Wojtowytsch, “Towards a mathematical understanding of neural network-based machine learning: What we know and what we don’t,” CSIAM Transactions on Applied Mathematics, vol. 1, no. 4, pp. 561–615, 2020.
[16] Y. Sun, A. Gilbert, and A. Tewari, “But how does it work in theory? linear SVM with random features,” in Advances in Neural Information Processing Systems, vol. 31, Curran Associates, Inc., 2018.
[17] S. Mei and A. Montanari, “The generalization error of random features regression: Precise asymptotics and the double descent curve,” Communications on Pure and Applied Mathematics, vol. 75, 2019.
[18] Z. Chen and H. Schaeffer, “Conditioning of random Fourier feature matrices: double descent and generalization error,” Information and Inference: A Journal of the IMA, vol. 13, p. iaad054, 04 2024.
[19] C. Liao, D. Needell, and A. Xue, “Differentially private random feature model,” arXiv:2412.04785, 2024. Submitted.
[20] D. Needell, A. A. Nelson, R. Saab, P. Salanevich, and O. Schavemaker, “Random vector functional link networks for function approximation on manifolds,” Frontiers in Applied Mathematics and Statistics-Optimization, vol. 10, 2024.
[21] A. Malik, R. Gao, M. Ganaie, M. Tanveer, and P. N. Suganthan, “Random vector functional link network: Recent developments, applications, and future directions,” Applied Soft Computing, vol. 143, p. 110377, 2023.
[22] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: Theory and applications,” Neurocomputing, vol. 70, no. 1, pp. 489–501, 2006. Neural Networks.
[23] G.-B. Huang, L. Chen, and C.-K. Siew, “Universal approximation using incremental constructive feedforward networks with random hidden nodes,” IEEE Transactions on Neural Networks, vol. 17, no. 4, pp. 879–892, 2006.
[24] J. Wang, S. Lu, S.-H. Wang, and Y.-D. Zhang, “A review on extreme learning machine,” Multimedia Tools and Applications, vol. 81, no. 29, 2022.
[25] B. Igelnik and Y.-H. Pao, “Stochastic choice of basis functions in adaptive function approximation and the functional-link net,” IEEE Transactions on Neural Networks, vol. 6, no. 6, pp. 1320–1329, 1995.
[26] A. Rahimi and B. Recht, “Uniform approximation of functions with random bases,” in 2008 46th Annual Allerton Conference on Communication, Control, and Computing, pp. 555–561, 2008.
[27] S. Lanthaler and N. H. Nelsen, “Error bounds for learning with vector-valued random features,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[28] R. Basri, M. Galun, A. Geifman, D. Jacobs, Y. Kasten, and S. Kritchman, “Frequency bias in neural networks for input of non-uniform density,” in Proceedings of the 37th International Conference on Machine Learning, vol. 119 of Proceedings of Machine Learning Research, pp. 685–694, PMLR, 13–18 Jul 2020.
[29] N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville, “On the spectral bias of neural networks,” in Proceedings of the 36th International Conference on Machine Learning, vol. 97 of Proceedings of Machine Learning Research, pp. 5301–5310, PMLR, 09–15 Jun 2019.
[30] B. Ronen, D. Jacobs, Y. Kasten, and S. Kritchman, “The convergence rate of neural networks for learned functions of different frequencies,” in Advances in Neural Information Processing Systems, vol. 32, Curran Associates, Inc., 2019.
[31] M. Tancik, P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. Barron, and R. Ng, “Fourier features let networks learn high frequency functions in low dimensional domains,” in Advances in Neural Information Processing Systems, vol. 33, pp. 7537–7547, Curran Associates, Inc., 2020.
[32] S. Wang, H. Wang, and P. Perdikaris, “On the eigenvector bias of Fourier feature networks: From regression to solving multi-scale pdes with physics-informed neural networks,” Computer Methods in Applied Mechanics and Engineering, vol. 384, p. 113938, 2021.