This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A MINI-BATCH METHOD For SOLVING NONLINEAR PDES WITH GAUSSIAN PROCESSES

Xianjin Yang1,∗, Houman Owhadi1 1Computing & Mathematical Sciences, California Institute of Technology, Pasadena, CA 91125. [email protected] [email protected]
Abstract.

Gaussian processes (GPs) based methods for solving partial differential equations (PDEs) demonstrate great promise by bridging the gap between the theoretical rigor of traditional numerical algorithms and the flexible design of machine learning solvers. The main bottleneck of GP methods lies in the inversion of a covariance matrix, whose cost grows cubically concerning the size of samples. Drawing inspiration from neural networks, we propose a mini-batch algorithm combined with GPs to solve nonlinear PDEs. A naive deployment of a stochastic gradient descent method for solving PDEs with GPs is challenging, as the objective function in the requisite minimization problem cannot be depicted as the expectation of a finite-dimensional random function. To address this issue, we employ a mini-batch method to the corresponding infinite-dimensional minimization problem over function spaces. The algorithm takes a mini-batch of samples at each step to update the GP model. Thus, the computational cost is allotted to each iteration. Using stability analysis and convexity arguments, we show that the mini-batch method steadily reduces a natural measure of errors towards zero at the rate of O(1/K+1/M)O(1/K+1/M), where KK is the number of iterations and MM is the batch size.

Corresponding author.

1. Introduction

This work develops a mini-batch method based on stochastic proximal algorithms [12, 1, 2] to solve nonlinear partial differential equations (PDEs) with Gaussian processes (GPs). PDEs have been widely used in science, economics, and biology [35, 44]. Since PDEs generally lack explicit solutions, numerical approximations are essential in applications. Standard numerical methods, for instance, finite difference [44] and finite element methods [19], for solving nonlinear PDEs are prone to the curse of dimensions. Recently, to keep pace with scaling problem sizes, machine learning methods, including neural networks [39, 27, 20] and probabilistic algorithms [38, 21, 9], arise a lot of attention. Especially, GPs present a promising approach that integrates the rigorous principles of traditional numerical algorithms with the adaptable framework of machine learning solvers [38, 9, 46, 3]. Our methods are based on the work of [9], which proposes a GP framework for solving general nonlinear PDEs. A numerical approximation of the solution to a nonlinear PDE is viewed as the maximum a posteriori (MAP) estimator of a GP conditioned on solving the PDE at a finite set of sample points. Equivalently, [9] proposes to approximate the unique strong solution of a nonlinear PDE in a reproducing kernel Hilbert space (RKHS) 𝒰\mathcal{U} by a minimizer of the following optimal recovery problem

minu𝒰λ2u𝒰2+1Ni=1N|fi([ϕi,u])yi|2,\displaystyle\min_{u\in\mathcal{U}}\frac{\lambda}{2}\|u\|_{\mathcal{U}}^{2}+\frac{1}{N}\sum_{i=1}^{N}|f_{i}([\boldsymbol{\phi}_{i},u])-{y}_{i}|^{2}, (1.1)

where λ\lambda is a small regularization parameter, NN is the number of samples in the domain, fi[ϕi,u]yif_{i}[\boldsymbol{\phi}_{i},u]-y_{i} encodes the PDE equation satisfied at the ithi^{\textit{th}} sample, and ϕi\boldsymbol{\phi}_{i} represents the linear operators in the ithi^{\textit{th}} equation. Let ϕ=(ϕ1,,ϕN)\boldsymbol{\phi}=(\boldsymbol{\phi}_{1},\dots,\boldsymbol{\phi}_{N}), let κ\kappa be the kernel associated with 𝒰\mathcal{U}, and let Ω\Omega be the domain of the PDE. Denote by κ(𝒙,ϕ)\kappa(\boldsymbol{x},\boldsymbol{\phi}) the action of ϕ\boldsymbol{\phi} on κ(𝒙,)\kappa(\boldsymbol{x},\cdot) for fixed 𝒙Ω\boldsymbol{x}\in\Omega and by κ(ϕ,ϕ)\kappa(\boldsymbol{\phi},\boldsymbol{\phi}) the action of ϕ\boldsymbol{\phi} on κ(,ϕ)\kappa(\cdot,\boldsymbol{\phi}). By the representer theorem (see [34, Sec. 17.8]), a solution uu^{*} of (1.1) satisfies u(𝒙)=κ(𝒙,ϕ)κ(ϕ,ϕ)1𝒛,𝒙Ωu^{*}(\boldsymbol{x})=\kappa(\boldsymbol{x},\boldsymbol{\phi})\kappa(\boldsymbol{\phi,\boldsymbol{\phi}})^{-1}\boldsymbol{z}^{*},\forall\boldsymbol{x}\in\Omega, where 𝒛\boldsymbol{z}^{*} solves

min𝒛λ2𝒛Tκ(ϕ,ϕ)1𝒛+12Ni=1N|fi(𝒛i)yi|2.\displaystyle\min_{\boldsymbol{z}}\frac{\lambda}{2}\boldsymbol{z}^{T}\kappa(\boldsymbol{\phi},\boldsymbol{\phi})^{-1}\boldsymbol{z}+\frac{1}{2N}\sum_{i=1}^{N}|f_{i}(\boldsymbol{z}_{i})-y_{i}|^{2}. (1.2)

Then, [9] proposes to solve (1.2) by the Gauss–Newton method. We refer readers to [9, 29] for illustrative examples on how to transform a PDE solving task into the minimization problem described in (1.1). The computational cost of the above GP method grows cubically with respect to the number of samples due to the inversion of the covariance matrix κ(ϕ,ϕ)\kappa(\boldsymbol{\phi},\boldsymbol{\phi}). To improve the performance, based on [9], [31] proposes to approximate the solution of a PDE in a function space generated by random Fourier features [37, 49], where the corresponding covariance matrix is a low-rank approximation to κ(ϕ,ϕ)\kappa(\boldsymbol{\phi},\boldsymbol{\phi}). In [29], another low-rank approximation method using sparse GPs is presented, employing a small number of inducing points to summarize observation information. Additionally, [10] introduces an algorithm that efficiently computes the sparse Cholesky factorization of the inverted Gram matrix. Despite their computational cost-saving accomplishments, low-rank approximation methods lack the scalability of neural network-based algorithms. The primary cause for this difference lies in the scaling of parameters in the GP/Kernel method with respect to the number of collocation points, which in turn escalates with the increase in dimensions. Conversely, in neural network models, the parameters are designed to be independent of the number of data points.

Compared to methods based on neural networks [39, 28, 20] , Gaussian Process (GP)/Kernel-based approaches generally offer solid theoretical foundations. Additionally, from a probabilistic viewpoint, GP regressions provide uncertainty quantification. This quantification allows us to establish the confidence intervals of the results and aids in deciding the placement of collocation points. The exploration of how to effectively utilize uncertainty quantification will be a subject for future research. Nevertheless, the GP method typically incurs a computational cost of O(N3)O(N^{3}), where NN represents the number of collocation points. On the other hand, besides their capacity to model a wide range of functions, neural network methods also excel in scalability, largely due to the use of stochastic gradient descent (SGD) algorithms with mini-batches. This scalability is facilitated by the independence of their data points from the parameters. However, due to the global correlation of the unknowns in (1.2), the direct application of SGD is impeded by the need to solve the inversion of the covariance matrix in equation (1.2). Consequently, devising an effective SGD approach is vital for the GP methodology to reach a level of scalability on par with neural networks. This paper attempts to handle this issue by proposing a mini-batch method inspired by stochastic proximal methods [12, 1, 2]. Rather than pursuing low-rank approximations, we stochastically solve equation (1.1). Each iteration selects a small portion of samples to update the GP model. Hence, we only need to consider the inversion of the covariance matrix of samples in a mini-batch each time. More precisely, we introduce a slack variable 𝒛\boldsymbol{z} and reformulate (1.1) as

minu𝒰,𝒛𝒳λ2u𝒰2+β2Ni=1N|[ϕi,u]𝒛i|2+12Ni=1N|fi(𝒛i)yi|2,\displaystyle\min_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\frac{\lambda}{2}\|u\|_{\mathcal{U}}^{2}+\frac{\beta}{2N}\sum_{i=1}^{N}|[\boldsymbol{\phi}_{i},u]-\boldsymbol{z}_{i}|^{2}+\frac{1}{2N}\sum_{i=1}^{N}|f_{i}(\boldsymbol{z}_{i})-{y}_{i}|^{2}, (1.3)

where 𝒳\mathcal{X} is a convex compact set, β>0\beta>0, and 𝒛=(𝒛1,,𝒛N)\boldsymbol{z}=(\boldsymbol{z}_{1},\dots,\boldsymbol{z}_{N}). We note that as β\beta\rightarrow\infty, the solutions of (1.3) converge to those of (1.1). The latent variable 𝒛\boldsymbol{z} encodes observations of uu. Next, we view (1.3) as a stochastic optimization problem

minu𝒰,𝒛𝒳Eξ[λ2u𝒰2+β2|[ϕξ,u]𝒛ξ|2+12|fξ(𝒛ξ)yξ|2],\displaystyle\min_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}E_{\xi}\bigg{[}\frac{\lambda}{2}\|u\|_{\mathcal{U}}^{2}+\frac{\beta}{2}|[\boldsymbol{\phi}_{\xi},u]-\boldsymbol{z}_{\xi}|^{2}+\frac{1}{2}|f_{\xi}(\boldsymbol{z}_{\xi})-{y}_{\xi}|^{2}\bigg{]}, (1.4)

where ξ\xi is a uniform random sample taken from the set of the whole samples. Then, we iteratively solve (1.4) by computing the proximal map

(uk+1,𝒛k+1)=\displaystyle(u^{k+1},\boldsymbol{z}^{k+1})= argminu𝒰,𝒛𝒳λ2u𝒰2+β2Mik|[ϕi,u]𝒛i|2\displaystyle\operatorname{argmin}\limits_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\frac{\lambda}{2}\|u\|_{\mathcal{U}}^{2}+\frac{\beta}{2M}\sum_{i\in\mathcal{I}_{k}}|[\boldsymbol{\phi}_{i},u]-\boldsymbol{z}_{i}|^{2}
+12Mik|fi(𝒛i)yi|2+γk2|𝒛𝒛k|2,\displaystyle+\frac{1}{2M}\sum_{i\in\mathcal{I}_{k}}|f_{i}(\boldsymbol{z}_{i})-{y}_{i}|^{2}+\frac{\gamma_{k}}{2}|\boldsymbol{z}-\boldsymbol{z}^{k}|^{2}, (1.5)

where k\mathcal{I}_{k} contains a mini-batch of samples, MM is the cardinality of k\mathcal{I}_{k}, and γk\gamma_{k} is the step size at the kthk^{\textit{th}} step. Denote by 𝒗k\boldsymbol{v}_{\mathcal{I}_{k}} the vector containning elements of a vector 𝒗\boldsymbol{v} indexed by k\mathcal{I}_{k}. Then, to handle the inifnite dimensional minimization problem in (1), we derive a representer formula for (1) in Section 2 such that

uk+1(𝒙)=κ(𝒙,ϕk)(κ(ϕk,ϕk)+λMβI)1𝒛kk+1,𝒙Ω,\displaystyle u^{k+1}(\boldsymbol{x})=\kappa(\boldsymbol{x},\boldsymbol{\phi}_{\mathcal{I}_{k}})\bigg{(}\kappa(\boldsymbol{\phi}_{\mathcal{I}_{k}},\boldsymbol{\phi}_{\mathcal{I}_{k}})+\frac{\lambda M}{\beta}{I}\bigg{)}^{-1}\boldsymbol{z}^{k+1}_{\mathcal{I}_{k}},\boldsymbol{x}\in\Omega, (1.6)

where I{I} is the identity matrix, 𝒛k+1=(𝒛𝒩\kk,𝒛kk+1)\boldsymbol{z}^{k+1}=(\boldsymbol{z}^{k}_{\mathcal{N}\backslash\mathcal{I}_{k}},\boldsymbol{z}^{k+1}_{\mathcal{I}_{k}}), and 𝒛k+1\boldsymbol{z}^{k+1} solves

min𝒛𝒳λ2𝒛kT(κ(ϕk,ϕk)+λMβI)1𝒛k+12Mik|fi(zi)yi|2+γk2|𝒛k𝒛kk|2.\displaystyle\min_{\boldsymbol{z}\in\mathcal{X}}\frac{\lambda}{2}\boldsymbol{z}^{T}_{\mathcal{I}_{k}}\bigg{(}\kappa(\boldsymbol{\phi}_{\mathcal{I}_{k}},\boldsymbol{\phi}_{\mathcal{I}_{k}})+\frac{\lambda M}{\beta}I\bigg{)}^{-1}\boldsymbol{z}_{\mathcal{I}_{k}}+\frac{1}{2M}\sum_{i\in\mathcal{I}_{k}}|f_{i}(z_{i})-y_{i}|^{2}+\frac{\gamma_{k}}{2}|\boldsymbol{z}_{\mathcal{I}_{k}}-{\boldsymbol{z}}_{\mathcal{I}_{k}}^{k}|^{2}. (1.7)

Thus, the computational cost is reduced to O(M3)O(M^{3}) in each iteration if we use the standard Cholesky decomposition to compute the inverse of the covariance matrix for solving (1.6) and (1.7). Under assumptions of boundedness of ϕ\boldsymbol{\phi} and the weakly convexity of the functions {fi}i=1N\{f_{i}\}_{i=1}^{N}, our convergence analysis shows that the error diminishes at the rate of O(1/K+1/M)O(1/K+1/M), where KK is the number of iterations and MM is the batch size. (For clarification, it’s important to note that the computational complexity of addressing the inner minimization issue, as presented in (1), stands at O(M3)O(M^{3}). Meanwhile, the complexity associated with the outer iteration, which is relevant for the convergence of (uk+1,𝒛k+1)(u^{k+1},\boldsymbol{z}^{k+1}), is represented as O(1/K+1/M)O(1/K+1/M).)

To illustrate our contributions, we present a comparison of our mini-batch approach with highly related existing mini-batch methods in the literature. Notably, the studies by [7, 8] introduce SGD techniques for hyperparameter learning in GP regressions. A key distinction of our method lies in its treatment of the minimization over the latent vector 𝒛\boldsymbol{z}, as outlined in (1.2). In contrast, [7, 8] focus on optimizing kernel hyperparameters for given data 𝒛\boldsymbol{z}, without considering this minimization. In alignment with our theoretical findings, the studies [7, 8] similarly identify errors comprising two main elements: an optimization error term and a statistical error term. These are determined by the number of optimization iterations and the mini-batch size. Consequently, with a fixed batch size, it becomes challenging to achieve zero error as the number of optimization steps grows. A key reason that we cannot eliminate the statistical error is the exclusion of a proximal step for uu as seen in (1). If we were to integrate the proximal step for uu, it would require maintaining all mini-batch samples and making the kthk^{\textit{th}} iteration’s minimizer uku^{k} depending on the expressions of all previous minimizers u1,,uk1u^{1},\dots,u^{k-1}. This integration would lead to significantly higher memory demands (see also Remark 3.7). Exploring ways to reduce statistical error while balancing memory efficiency remains an area for future investigation.

Furthermore, [45] develops a stochastic Gauss-Newton algorithm for non-convex compositional optimization. However, their approach is not applicable in our context due to the quadratic nature of (1.2), which is incompatible with their algorithm’s requirement for an expectation-based formulation in finite-dimensional settings. This is evident as the term 𝒛Tκ(ϕ,ϕ)1𝒛\boldsymbol{z}^{T}\kappa(\boldsymbol{\phi},\boldsymbol{\phi})^{-1}\boldsymbol{z} in Equation (1.2) cannot be framed as an expectation over 𝒛\boldsymbol{z}. Our method employs a mini-batch approach solving the infinite-dimensional minimization problem in (1.1).

Another subtle difference is our methodological approach compared to that of [45],. While they initiate their process by linearizing the objective function and then applying mini-batch updates to this linearized problem, we start by formulating a nonlinear proximal minimization step in each iteration. For solving this, we primarily use the Gauss-Newton method. It’s important to note, however, that the Gauss-Newton method in our framework can be substituted with any other efficient optimization algorithm suitable for the proximal iteration step.

We organize the paper as follows. Section 2 summarizes the GP method in [9] and proposes a mini-batch method to solve PDEs in a general setting. In Section 3, we extend the arguments in the framework of [12, 6, 13] and present a convergence analysis of the min-batch algorithm based on stability analysis and convexity arguments. Numerical experiments appear in Section 4. Further discussions and future work appear in Section 5.

Other related work A0.

Though GP regressions have achieved great success in various applications, the main drawback of it is the O(N3)O(N^{3}) computational time and O(N2)O(N^{2}) memory consumption for NN training points. Several techniques have been introduced to reduce the computational cost in recent years. One branch of these methods is to introduce sparsity into kernels of GPs based on variational inference procedures [36, 11, 18, 23, 48, 26] or random Fourier features [17, 49, 37, 24]. Another trend in tackling the scalability of the GP regression is to utilize increasing computational power and GPU acceleration to solve exact GPs [33, 16, 47]. Alternatively, a sparse approximation for the inverse Cholesky factor of a dense covariance matrix is considered in [41]. Recently, stochastic gradient descent methods (SGDs) have also been taken into consideration for GPs. The work [7] proposes a SGD method to perform hyperparameter learning in GP regressions. The critical difference between our method and the work in [7] is that the problem in [7] does not take into consideration the minimization over the latent vector 𝒛\boldsymbol{z} in (1.2). Instead, the authors in [7] optimize over the hyperparameters of the kernel for known data 𝒛\boldsymbol{z}.

SGDs are widely used in stochastic optimization problems [5, 32, 42, 50]. However, since the unknowns in (1.2) are globally correlated, SGDs cannot be applied without solving the inversion of the covariance matrix in (1.2). Moreover, SGD methods are sensitive to the choice of stepsize and may diverge if objective functions do not satisfy the convergence criteria [1, 2, 25]. Recently, stochastic proximal point and model-based methods [2, 4, 12, 15, 22] are proposed to serve as robust alternatives to standard stochastic gradient methods. Several works [15, 12] show that the stochastic proximal point and model-based methods exhibit the robustness to stepsize choice and the convergence on a broader range of difficult problems.

Notations A0.

Given an index set \mathcal{I}, we denote by |||\mathcal{I}| the cardinality of \mathcal{I}. For a real-valued vector 𝒗\boldsymbol{v}, we represent by |𝒗||\boldsymbol{v}| the Euclidean norm of 𝒗\boldsymbol{v} and by 𝒗T\boldsymbol{v}^{T} its transpose. Let 𝒖,𝒗\langle\boldsymbol{u},\boldsymbol{v}\rangle or 𝒖T𝒗\boldsymbol{u}^{T}\boldsymbol{v} be the inner product of vectors 𝒖\boldsymbol{u} and 𝒗\boldsymbol{v}. For a normed vector space VV, let V\|\cdot\|_{V} be the norm of VV. Let 𝒰\mathcal{U} be a Banach space associated with a quadratic norm 𝒰\|\cdot\|_{\mathcal{U}} and let 𝒰\mathcal{U}^{*} be the dual of 𝒰\mathcal{U}. Denote by [,][\cdot,\cdot] the duality pairing between 𝒰\mathcal{U}^{*} and 𝒰\mathcal{U}, and by ,𝒰\langle\cdot,\cdot\rangle_{\mathcal{U}} the inner produce of 𝒰\mathcal{U}. Suppose that there exists a linear, bijective, symmetric ([𝒦𝒰ϕ,ψ]=[𝒦𝒰ψ,ϕ][\mathcal{K}_{\mathcal{U}}\phi,\psi]=[\mathcal{K}_{\mathcal{U}}\psi,\phi]), and positive ([𝒦𝒰ϕ,ϕ]>0[\mathcal{K}_{\mathcal{U}}\phi,\phi]>0 for ϕ0\phi\not=0) covariance operator 𝒦𝒰:𝒰𝒰\mathcal{K}_{\mathcal{U}}\mathrel{\mathop{\mathchar 58\relax}}\mathcal{U}^{*}\mapsto\mathcal{U}, such that u𝒰2=[𝒦1u,u],u𝒰\|u\|_{\mathcal{U}}^{2}=[\mathcal{K}^{-1}u,u],\forall u\in\mathcal{U}. Let {ϕi}i=1P\{\phi_{i}\}_{i=1}^{P} be a set of PP\in\mathbb{N} elements in 𝒰\mathcal{U}^{*} and let ϕ:=(ϕ1,,ϕP)\boldsymbol{\phi}\mathrel{\mathop{\mathchar 58\relax}}=(\phi_{1},\dots,\phi_{P}) be in the product space (𝒰)P{(\mathcal{U}^{*})}^{\bigotimes P}. Then, for u𝒰u\in\mathcal{U}, we write [ϕ,u]:=([ϕ1,u],,[ϕP,u])[\boldsymbol{\phi},u]\mathrel{\mathop{\mathchar 58\relax}}=([{\phi}_{1},u],\dots,[{\phi}_{P},u]). Let ϕ𝒰2=i=1Pϕi𝒰2\|\boldsymbol{\phi}\|_{\mathcal{U}^{*}}^{2}=\sum_{i=1}^{P}\|{\phi}_{i}\|_{\mathcal{U}^{*}}^{2}. Furthermore, for 𝒖:=(u1,,uS)𝒰S\boldsymbol{u}\mathrel{\mathop{\mathchar 58\relax}}=(u_{1},\dots,u_{S})\in\mathcal{U}^{\bigotimes S}, SS\in\mathbb{N}, let [ϕ,𝒖]P×S[\boldsymbol{\phi},\boldsymbol{u}]\in\mathbb{R}^{P\times S} be the matrix with entries [ϕi,uj][\phi_{i},u_{j}]. Finally, we represent by CC a positive real number whose value may change line by line.

2. GP Regressions with Nonlinear Measurements

For ease of presentation, in this section, we introduce an iterative mini-batch method for solving nonlinear PDEs in the general setting of GP regressions with nonlinear measurements. We first summarize the GP method proposed by [9] in Subsection 2.1. Then, in Subsection 2.2, we detail the min-batch method and derive a representer theorem, from which we see that the mini-batch method allocates the burden of computations to each iteration.

2.1. A Revisit of The GP Method

Here, we summarize the general framework of the GP method [9] for solving nonlinear PDEs. Let Ωd\Omega\subset\mathbb{R}^{d} be an open bounded domain with the boundary Ω\partial\Omega for d1d\geqslant 1. Let L1,,LDb,LDb+1,,LDL_{1},\dots,L_{D_{b}},L_{D_{b}+1},\dots,L_{D} be bounded linear operators for 1DbD1\leqslant D_{b}\leqslant D, where Db,DD_{b},D\in\mathbb{N}. We solve

{P(LDb+1(u)(𝒙),,LD(u)(𝒙))=f(𝒙),𝒙Ω,B(L1(u)(𝒙),,LDb(u)(𝒙))=g(𝒙),𝒙Ω,\displaystyle\begin{cases}{P}(L_{D_{b}+1}(u^{*})(\boldsymbol{x}),\dots,L_{D}(u^{*})(\boldsymbol{x}))=f(\boldsymbol{x}),\forall\boldsymbol{x}\in\Omega,\\ {B}(L_{1}(u^{*})(\boldsymbol{x}),\dots,L_{D_{b}}(u^{*})(\boldsymbol{x}))=g(\boldsymbol{x}),\forall\boldsymbol{x}\in\partial\Omega,\end{cases} (2.1)

where uu^{*} is the unknown function, PP and BB represent nonlinear operators, and ff, gg are given data. Throughout this paper, we suppose that (2.1) admits a unique strong solution in a RKHS 𝒰\mathcal{U} associated with the covariance operator 𝒦\mathcal{K}.

To approximate uu^{*} in 𝒰\mathcal{U}, we first take NN sample points {𝒙j}j=1N\{{\boldsymbol{x}}_{j}\}_{j=1}^{N} in Ω¯\overline{\Omega} such that {𝒙j}j=1NΩΩ\{{\boldsymbol{x}}_{j}\}_{j=1}^{N_{\Omega}}\subset\Omega and {𝒙j}j=NΩ+1NΩ\{{\boldsymbol{x}}_{j}\}_{j=N_{\Omega}+1}^{N}\subset\partial\Omega for 1NΩN1\leqslant N_{\Omega}\leqslant N. Then, we define

ϕj(i):=δ𝒙jLi, and {NΩ+1jN, if 1iDb,1jNΩ, if Db+1iD.\displaystyle\phi^{(i)}_{j}\mathrel{\mathop{\mathchar 58\relax}}=\delta_{{\boldsymbol{x}}_{j}}\circ L_{i},\text{ and }\begin{cases}N_{\Omega}+1\leqslant j\leqslant N,\text{ if }1\leqslant i\leqslant D_{b},\\ 1\leqslant j\leqslant N_{\Omega},\text{ if }D_{b}+1\leqslant i\leqslant D.\end{cases}

Let ϕ(i)\boldsymbol{\phi}^{(i)} be the vector concatenating the functionals ϕj(i)\phi_{j}^{(i)} for fixed ii and define

ϕ:=(ϕ(1),,ϕ(D))(𝒰)R, where R=(NNΩ)Db+NΩ(DDb).\displaystyle\boldsymbol{\phi}\mathrel{\mathop{\mathchar 58\relax}}=(\boldsymbol{\phi}^{(1)},\dots,\boldsymbol{\phi}^{(D)})\in(\mathcal{U}^{*})^{\otimes R},\text{ where }R=(N-N_{\Omega})D_{b}+N_{\Omega}(D-D_{b}). (2.2)

Next, we define the data vector 𝒚N\boldsymbol{y}\in\mathbb{R}^{N} by

yi={f(𝒙i), if i{1,,NΩ},g(𝒙i), if i{NΩ+1,,N}.\displaystyle y_{i}=\begin{cases}f(\boldsymbol{x}_{i}),\text{ if }i\in\{1,\dots,N_{\Omega}\},\\ g(\boldsymbol{x}_{i}),\text{ if }i\in\{N_{\Omega}+1,\dots,N\}.\end{cases}

Furthermore, we define the nonlinear map

(F([ϕ,u])j:={P([ϕj(Db+1),u],,[ϕj(D),u]), for j=1,,NΩ,B([ϕj(1),u],,[ϕj(Db),u]), for j=NΩ+1,,N.\displaystyle(F([\boldsymbol{\phi},u])_{j}\mathrel{\mathop{\mathchar 58\relax}}=\begin{cases}{P}([\phi_{j}^{(D_{b}+1)},u],\dots,[\phi_{j}^{(D)},u]),\text{ for }j=1,\dots,N_{\Omega},\\ {B}([\phi_{j}^{(1)},u],\dots,[\phi_{j}^{(D_{b})},u]),\text{ for }j=N_{\Omega}+1,\dots,N.\end{cases}

Then, given a regularization parameter λ>0\lambda>0, the GP method [9] approximates uu^{*} by the minimizer of the following optimization problem

minu𝒰\displaystyle\min\limits_{u\in\mathcal{U}} λ2u𝒰2+12N|F([ϕ,u]𝒚|2.\displaystyle\frac{\lambda}{2}\|u\|_{\mathcal{U}}^{2}+\frac{1}{2N}|F([\boldsymbol{\phi},u]-\boldsymbol{y}|^{2}. (2.3)

The authors in [9] show that as λ0\lambda\rightarrow 0 and as the fill distance of samples decreases, a minimizer of (2.3) converges to the unique strong solution of (2.1) (see Sec. 3 of [9]).

2.2. GP Regressions with Nonlinear Measurements

In this subsection, we propose a mini-batch method to solve (2.3). For ease of presentation, we abstract (2.3) into a general framework by considering a GP problem with nonlinear observations. More precisely, let 𝒩={1,,N}\mathcal{N}=\{1,\dots,N\} and ϕ=(ϕ1,,ϕN)\boldsymbol{\phi}=(\boldsymbol{\phi}_{1},\dots,\boldsymbol{\phi}_{N}), where ϕi(𝒰)Ni\boldsymbol{\phi}_{i}\in(\mathcal{U}^{*})^{\bigotimes N_{i}} for some NiN_{i}\in\mathbb{N} and i𝒩i\in\mathcal{N}. For an unknown function uu and each i𝒩i\in\mathcal{N}, there exists a function fif_{i} such that we get a noisy observation yiy_{i} on fi([ϕi,u])f_{i}([\boldsymbol{\phi}_{i},u]). To approximate uu based on the nonlinear observations, we solve

minu𝒰λ2u𝒰2+12Ni=1N|fi([ϕi,u])yi|2,\displaystyle\min_{u\in\mathcal{U}}\frac{\lambda}{2}\|u\|_{\mathcal{U}}^{2}+\frac{1}{2N}\sum_{i=1}^{N}|f_{i}([\boldsymbol{\phi}_{i},u])-{y}_{i}|^{2}, (2.4)

where λ>0\lambda>0. Hence, (2.4) resembles (2.3) in such a way that fi([ϕi,u])yif_{i}([\boldsymbol{\phi}_{i},u])-y_{i} corresponds to the PDE equation satisfied by uu at the ithi^{\textit{th}} sample point.

As we mentioned before, a typical algorithm using the Cholesky decomposition to compute the inverse of the corresponding covariance matrix suffers the cost of O(N3)O(N^{3}). Here, we introduce an algorithm to solve (2.4) using mini-batches such that the dimensions of matrices to be inverted depend only on the sizes of mini-batches.

First, we reformulate a relaxed version of (2.4). More precisely, we introduce slack variables 𝒛=(𝒛1,,𝒛N)\boldsymbol{z}=(\boldsymbol{z}_{1},\dots,\boldsymbol{z}_{N}), a regularization parameter β>0\beta>0, and consider the following regularized optimal recovery problem

minu𝒰,𝒛𝒳φ(u,𝒛),\displaystyle\min_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\varphi(u,\boldsymbol{z}), (2.5)

where

φ(u,𝒛)=λ2u𝒰2+β2Ni=1N|[ϕi,u]𝒛i|2+12Ni=1N|fi(𝒛i)yi|2,\displaystyle\varphi(u,\boldsymbol{z})=\frac{\lambda}{2}\|u\|_{\mathcal{U}}^{2}+\frac{\beta}{2N}\sum_{i=1}^{N}|[\boldsymbol{\phi}_{i},u]-\boldsymbol{z}_{i}|^{2}+\frac{1}{2N}\sum_{i=1}^{N}|f_{i}(\boldsymbol{z}_{i})-{y}_{i}|^{2}, (2.6)

𝒳\mathcal{X} is a convex compact subset of i=1NNi\mathbb{R}^{\sum_{i=1}^{N}N_{i}}, 0𝒳0\in\mathcal{X}, and 𝒚𝒳\boldsymbol{y}\in\mathcal{X}. We further assume that 𝒳\mathcal{X} is the product space of a set of convex compact subsets {𝒳i}N\{\mathcal{X}_{i}\}^{N}, where 𝒳iNi\mathcal{X}_{i}\in\mathbb{R}^{N_{i}}.

Remark 2.1.

The restriction of 𝒛\boldsymbol{z} on 𝒳\mathcal{X} is not restrictive. Indeed, we can reformulate the unconstrained minimization problem

minu𝒰,𝒛i=1NNiλ2u𝒰2+β2Ni=1N|[ϕi,u]𝒛i|2+12Ni=1N|fi(𝒛i)yi|2,\displaystyle\min_{u\in\mathcal{U},\boldsymbol{z}\in\mathbb{R}^{\sum_{i=1}^{N}N_{i}}}\frac{\lambda}{2}\|u\|_{\mathcal{U}}^{2}+\frac{\beta}{2N}\sum_{i=1}^{N}|[\boldsymbol{\phi}_{i},u]-\boldsymbol{z}_{i}|^{2}+\frac{1}{2N}\sum_{i=1}^{N}|f_{i}(\boldsymbol{z}_{i})-{y}_{i}|^{2}, (2.7)

into the formulation of (2.5) by introducing a large enough compact set 𝒳\mathcal{X}. More precisely, let (uβ,𝒛β)(u_{\beta},\boldsymbol{z}_{\beta}) be the minimizer of (2.7). We have

φ(uβ,𝒛β)φ(0,𝟎)=12Ni=1N|fi(𝟎)yi|2,\displaystyle\varphi(u_{\beta},\boldsymbol{z}_{\beta})\leqslant\varphi(0,\boldsymbol{0})=\frac{1}{2N}\sum_{i=1}^{N}|f_{i}(\boldsymbol{0})-y_{i}|^{2},

which implies that uβu_{\beta} and 𝒛β\boldsymbol{z}_{\beta} are bounded if i=1N|fi(𝟎)yi|2<+\sum_{i=1}^{N}|f_{i}(\boldsymbol{0})-y_{i}|^{2}<+\infty. Hence, if fi,i=1,,Nf_{i},i=1,\dots,N are bounded, (2.7) and (2.5) are equivalent for large 𝒳\mathcal{X}. Thus, we mainly focus on (2.5) to take into consideration situations when 𝒛\boldsymbol{z} is constrained.

Next, we introduce a mini-batch version of φ\varphi defined in (2.5). For 𝒩\mathcal{I}\subset\mathcal{N} with ||=M|\mathcal{I}|=M, we define

φ(u,𝒛;)=λ2u𝒰2+β2Mi|[ϕi,u]𝒛i|2+12Mi|fi(𝒛i)yi|2.\displaystyle\varphi(u,\boldsymbol{z};\mathcal{I})=\frac{\lambda}{2}\|u\|_{\mathcal{U}}^{2}+\frac{\beta}{2M}\sum_{i\in\mathcal{I}}|[\boldsymbol{\phi}_{i},u]-\boldsymbol{z}_{i}|^{2}+\frac{1}{2M}\sum_{i\in\mathcal{I}}|f_{i}(\boldsymbol{z}_{i})-y_{i}|^{2}. (2.8)

In particular, when M=1M=1 and ={ξ}\mathcal{I}=\{\xi\} for ξ𝒩\xi\in\mathcal{N}, we denote

φ(u,𝒛;ξ):=φ(u,𝒛;{ξ})=λ2u𝒰2+β2|[ϕξ,u]𝒛ξ|2+12|fξ(𝒛ξ)yξ|2.\displaystyle\varphi(u,\boldsymbol{z};\xi)\mathrel{\mathop{\mathchar 58\relax}}=\varphi(u,\boldsymbol{z};\{\xi\})=\frac{\lambda}{2}\|u\|_{\mathcal{U}}^{2}+\frac{\beta}{2}|[\boldsymbol{\phi}_{\xi},u]-\boldsymbol{z}_{\xi}|^{2}+\frac{1}{2}|f_{\xi}(\boldsymbol{z}_{\xi})-y_{\xi}|^{2}.

Then, we detail the mini-batch method in Algorithm 1. Given 𝒛1𝒳\boldsymbol{z}^{1}\in\mathcal{X} and the iteration number KK, at the kthk^{\textit{th}} step, we update (uk+1,𝒛k+1)(u^{k+1},\boldsymbol{z}^{k+1}) by solving the minimization problem (2.10). Finally, we uniformly sample kk^{*} from {1,,K}\{1,\dots,K\} and output (u^k,𝒛^k)(\hat{u}^{k^{*}},\hat{\boldsymbol{z}}^{k^{*}}) as an approximation to a solution of (2.5).

Remark 2.2.

In Algorithm 1, we introduce the last step (2.11) for error analysis. In practice, we do not need to solve (2.11) directly since computing (2.11) consumes the same computational cost as calculating (2.4). Instead, if we are interested in the value of uu at a test point 𝒙~\tilde{\boldsymbol{x}}, we can take a mini-batch ~\tilde{\mathcal{I}} containing training samples at the neighbor of 𝒙~\tilde{\boldsymbol{x}}, choose kk^{*} and ρ\rho as in Algorithm 1, compute (u~,𝒛~)=argminu𝒰,𝒛𝒳φ(u,𝒛;~)+ρ2|𝒛𝒛k|2(\tilde{u},\tilde{\boldsymbol{z}})=\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\varphi(u,\boldsymbol{z};\tilde{\mathcal{I}})+\frac{\rho}{2}|\boldsymbol{z}-\boldsymbol{z}^{k^{*}}|^{2}, and evaluate u~\tilde{u} at 𝒙~\tilde{\boldsymbol{x}}. An alternative way is to include the test points of interest in the training samples and solve (2.5) using the mini-batch method. Then, the latent vector 𝒛k\boldsymbol{z}^{k^{*}} contains sufficient information at 𝒙~\tilde{\boldsymbol{x}}.

Remark 2.3.

Algorithm 1 can be readily extended to incorporate a preconditioned variant. Specifically, let QkQ_{k} be a positive definite matrix associated with the kthk^{\textit{th}} iteration. The update step in Equation (2.10) is modified as follows:

(uk+1,𝒛k+1)=\displaystyle(u^{k+1},\boldsymbol{z}^{k+1})= argminu𝒰,𝒛𝒳φ(u,𝒛;k)+12|𝒛𝒛k|Qk2,\displaystyle\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\varphi(u,\boldsymbol{z};\mathcal{I}_{k})+\frac{1}{2}|\boldsymbol{z}-\boldsymbol{z}^{k}|^{2}_{Q_{k}}, (2.9)

where the norm ||Qk|\cdot|_{Q_{k}} is defined by |y|Qk2=yTQk1y|y|_{Q_{k}}^{2}=y^{T}Q_{k}^{-1}y. This preconditioned approach maintains the theoretical foundations of Algorithm 1, given that the minimum eigenvalue of QkQ_{k} meets the analogous conditions set by γk\gamma_{k} in (2.10) for each iteration kk. This adaptation offers enhanced flexibility and potential efficiency improvements in various computational scenarios.

Input:λ>0,β>0,𝒛1𝒳,iteration number K\textbf{Input}\mathrel{\mathop{\mathchar 58\relax}}\lambda>0,\beta>0,\boldsymbol{z}^{1}\in\mathcal{X},\text{iteration number $K$}
For k=1,,K\textbf{For }k=1,\dots,K
Take a mini-batch k\mathcal{I}_{k} from 𝒩\mathcal{N}, choose a step size γk\gamma_{k}, and update 𝒛k+1\boldsymbol{z}^{k+1} by solving
(uk+1,𝒛k+1)=\displaystyle(u^{k+1},\boldsymbol{z}^{k+1})= argminu𝒰,𝒛𝒳φ(u,𝒛;k)+γk2|𝒛𝒛k|2.\displaystyle\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\varphi(u,\boldsymbol{z};\mathcal{I}_{k})+\frac{\gamma_{k}}{2}|\boldsymbol{z}-\boldsymbol{z}^{k}|^{2}. (2.10)
End For
Choose ρ>0\rho>0, sample kk^{*} uniformly in {1,,K}\{1,\dots,K\} and compute
(u^k,𝒛^k)=\displaystyle(\hat{u}^{k^{*}},\hat{\boldsymbol{z}}^{k^{*}})= argminu𝒰,𝒛𝒳φ(u,𝒛)+ρ2|𝒛𝒛k|2.\displaystyle\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\varphi(u,\boldsymbol{z})+\frac{\rho}{2}|\boldsymbol{z}-\boldsymbol{z}^{k^{*}}|^{2}. (2.11)
Output: u^k,𝒛^k\textbf{Output: }\hat{u}^{k^{*}},\hat{\boldsymbol{z}}^{k^{*}}.
Algorithm 1 The GP Regression With Mini-batches

Though the minimization problem in (2.10) is infinite-dimensional, we can reformulate (2.10) into a finite-dimensional minimization problem by using the framework developed in [43]. The following lemma resembles Lemma 3.3 in [29] and establishes properties of a sampling operator, which are crucial for us to derive a representer formula for (2.10). We omit the proof since the arguments are the same as those for Lemma 3.3 of [29]. For simplicity, we denote by II the identity map or the identity matrix, whose meaning is easily recognizable from the context.

Lemma 2.4.

Let 𝒩={1,,N}\mathcal{N}=\{1,\dots,N\} and 𝒩\mathcal{I}\subset\mathcal{N}. Let ϕ\boldsymbol{\phi} be as in (2.4) and let ϕ\boldsymbol{\phi}_{\mathcal{I}} be the subset of ϕ\boldsymbol{\phi} indexed by \mathcal{I}. Denote by 2()\ell^{2}(\mathcal{I}) the set of sequences 𝐚=(ai)i\boldsymbol{a}=(a_{i})_{i\in\mathcal{I}} with the inner product 𝐚,𝐛=iaibi\langle\boldsymbol{a},\boldsymbol{b}\rangle=\sum_{i\in\mathcal{I}}a_{i}b_{i}. Define the sampling operator S:𝒰2()S_{\mathcal{I}}\mathrel{\mathop{\mathchar 58\relax}}\mathcal{U}\mapsto\ell^{2}(\mathcal{I}) by Su=[ϕ,u],u𝒰S_{\mathcal{I}}u=[\boldsymbol{\phi}_{\mathcal{I}},u],\forall u\in\mathcal{U}. Let SS_{\mathcal{I}}^{*} be the adjoint of SS_{\mathcal{I}}. Then, for each c2()c\in\ell^{2}(\mathcal{I}), Sc=cT𝒦ϕ,c2()S_{\mathcal{I}}^{*}c=c^{T}\mathcal{K}\boldsymbol{\phi}_{\mathcal{I}},\forall c\in\ell^{2}(\mathcal{I}). Besides, for any γ>0\gamma>0, SS+γI{S}_{\mathcal{I}}^{*}{S}_{\mathcal{I}}+\gamma I and SS+γI{S}_{\mathcal{I}}{S}_{\mathcal{I}}^{*}+\gamma I are bijective. Meanwhile, for any c2()c\in\ell^{2}(\mathcal{I}), we have

(SS+γI)1Sc=(𝒦ϕ)T(γI+𝒦(ϕ,ϕ))1c.\displaystyle({S}_{\mathcal{I}}^{*}{S}_{\mathcal{I}}+\gamma I)^{-1}{S}_{\mathcal{I}}^{*}c=(\mathcal{K}\boldsymbol{\phi}_{\mathcal{I}})^{T}(\gamma I+\mathcal{K}(\boldsymbol{\phi}_{\mathcal{I}},\boldsymbol{\phi}_{\mathcal{I}}))^{-1}c. (2.12)

Furthermore,

IS(SS+γI)1S=γ(𝒦(ϕ,ϕ)+γI)1.\displaystyle I-S_{\mathcal{I}}(S_{\mathcal{I}}^{*}S_{\mathcal{I}}+\gamma I)^{-1}S_{\mathcal{I}}^{*}=\gamma(\mathcal{K}(\boldsymbol{\phi}_{\mathcal{I}},\boldsymbol{\phi}_{\mathcal{I}})+\gamma I)^{-1}. (2.13)

The next lemma provides a representer formula for (2.10). The arguments are similar to those for Theorem 3.4 of [29], which we postpone to Appendix A. We recall that for an index set \mathcal{I} and a vector 𝒗\boldsymbol{v}, 𝒗\boldsymbol{v}_{\mathcal{I}} represents the vector consisting of elements of 𝒗\boldsymbol{v} indexed by \mathcal{I}.

Lemma 2.5.

Let 𝒩={1,,N}\mathcal{N}=\{1,\dots,N\}, let 𝒩\mathcal{I}\subset\mathcal{N}, ||=M|\mathcal{I}|=M, and let φ(,;):𝒰×𝒳\varphi(\cdot,\cdot;\mathcal{I})\mathrel{\mathop{\mathchar 58\relax}}\mathcal{U}\times\mathcal{X}\mapsto\mathbb{R} be defined as in (2.8). Given γ0\gamma\geqslant 0 and 𝐳¯𝒳\bar{\boldsymbol{z}}\in\mathcal{X}, we define

(u,𝒛)=argminu𝒰,𝒛𝒳φ(u,𝒛;)+γ2|𝒛𝒛¯|2.\displaystyle(u^{\dagger},{\boldsymbol{z}}^{\dagger})=\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\varphi(u,\boldsymbol{z};\mathcal{I})+\frac{\gamma}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}. (2.14)

Let ϕ\boldsymbol{\phi} be as in (2.4) and let ϕ\boldsymbol{\phi}_{\mathcal{I}} be the subset of ϕ\boldsymbol{\phi} indexed by \mathcal{I}. Then,

u=𝒦(ϕ)(𝒦(ϕ,ϕ)+λMβI)1𝒛,\displaystyle u^{\dagger}=\mathcal{K}(\boldsymbol{\phi}_{\mathcal{I}})\bigg{(}\mathcal{K}(\boldsymbol{\phi}_{\mathcal{I}},\boldsymbol{\phi}_{\mathcal{I}})+\frac{\lambda M}{\beta}{I}\bigg{)}^{-1}\boldsymbol{z}^{\dagger}_{\mathcal{I}}, (2.15)

where I{I} is the identity matrix and 𝐳=(𝐳¯𝒩\,𝐳)\boldsymbol{z}^{\dagger}=(\bar{\boldsymbol{z}}_{\mathcal{N}\backslash\mathcal{I}},\boldsymbol{z}^{\dagger}_{\mathcal{I}}) is a solution to

min𝒛𝒳λ2𝒛T(𝒦(ϕ,ϕ)+λMβI)1𝒛+12Mi|fi(zi)yi|2+γ2|𝒛𝒛¯|2.\displaystyle\min_{\boldsymbol{z}\in\mathcal{X}}\frac{\lambda}{2}\boldsymbol{z}^{T}_{\mathcal{I}}\bigg{(}\mathcal{K}(\boldsymbol{\phi}_{\mathcal{I}},\boldsymbol{\phi}_{\mathcal{I}})+\frac{\lambda M}{\beta}I\bigg{)}^{-1}\boldsymbol{z}_{\mathcal{I}}+\frac{1}{2M}\sum_{i\in\mathcal{I}}|f_{i}(z_{i})-y_{i}|^{2}+\frac{\gamma}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}. (2.16)

The above lemma establishes the solvability of (2.10) and implies that in each step of Algorithm 1, the sizes of the matrices to be inverted depend only on the batch size MM. Thus, the computational cost is reduced to O(M3)O(M^{3}) if we use the standard Cholesky decomposition. In numerical implementations of solving PDEs, an alternative way to deal with the PDE constraints in (2.8) at each step is to use the technique of eliminating variables (see Subsection 3.3 of [9] for details), which further reduces the dimensions of unknown variables.

Remark 2.6.

In numerical experiments, we replace the term λMI/β\lambda MI/\beta by a nugget η\eta\mathcal{R}, where η>0\eta>0 is a small real number and \mathcal{R} is a block diagonal matrix formulated using the method described in Sec. 3.4 of [9]. Using block diagonal matrices lets us choose smaller nuggets and get more accurate results.

3. The Convergence Analysis

In this section, we study the convergence of Algorithm 1. According to (2.15), at the kthk^{\textit{th}} step of Algorithm 1, uk+1u^{k+1} is expressed as the linear combinations of 𝒦(ϕ)\mathcal{K}(\boldsymbol{\phi}_{\mathcal{I}}), where ϕ\boldsymbol{\phi}_{\mathcal{I}} only contains linear operators corresponding to a mini-batch, which is not expressive. Thus, at the final KthK^{\textit{th}} step, we do not expect that uK+1u^{K+1} approximates a minimizer of (2.5) very well. Instead, we study the accuracy of the sequence (𝒛k)k=1K(\boldsymbol{z}^{k})_{k=1}^{K} in Algorithm 1, which contains the approximated values of the linear operators of the true solution, and we study the convergence of (u^k,𝒛^k)(\hat{u}^{k^{*}},\hat{\boldsymbol{z}}^{k^{*}}) in (2.11), which uses the information of all the linear operators.

We first summarize some basic notions of convex analysis in Subsection 3.1. Then, we introduce the stability of Algorithm 1 in Subsection 3.2. The convergence analysis follows in Subsection 3.3. The arguments are based on the framework of [12] and stability analysis [6]. Similar arguments are used in [13] to analyze the convergence of algorithms for general weakly convex optimization problems.

3.1. Prerequisites

For a non-empty, compact, and convex set 𝒞d\mathcal{C}\subset\mathbb{R}^{d} for some dd\in\mathbb{N}, the normal cone to 𝒞\mathcal{C} at 𝒙𝒞\boldsymbol{x}\in\mathcal{C} is defined by N𝒞(𝒙)={𝒚d|𝒚(𝒄𝒙)0,𝒄𝒞}N_{\mathcal{C}}(\boldsymbol{x})=\{\boldsymbol{y}\in\mathbb{R}^{d}|\boldsymbol{y}\cdot(\boldsymbol{c}-\boldsymbol{x})\leqslant 0,\forall\boldsymbol{c}\in\mathcal{C}\}. Given γ0\gamma\geqslant 0, a function ω:d\omega\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\mapsto\mathbb{R} is γ\gamma-weakly convex if the map 𝒙ω(𝒙)+γ2|𝒙|2\boldsymbol{x}\mapsto\omega(\boldsymbol{x})+\frac{\gamma}{2}|\boldsymbol{x}|^{2} is convex and ω\omega is γ\gamma-strongly convex if 𝒙ω(𝒙)γ2|𝒙|2\boldsymbol{x}\mapsto\omega(\boldsymbol{x})-\frac{\gamma}{2}|\boldsymbol{x}|^{2} is convex. We recall that the proximal map of ω\omega is given by proxγω(𝒙)=argmin𝒚{ω(𝒚)+12γ|𝒚𝒙|2}\operatorname{prox}_{\gamma\omega}(\boldsymbol{x})=\operatorname{argmin}_{\boldsymbol{y}}\Big{\{}\omega(\boldsymbol{y})+\frac{1}{2\gamma}|\boldsymbol{y}-\boldsymbol{x}|^{2}\Big{\}}, and that the Moreau envelop [30] is defined as ωγ(𝒙)=min𝒚{ω(𝒚)+12γ|𝒚𝒙|2}\omega_{\gamma}(\boldsymbol{x})=\min_{\boldsymbol{y}}\Big{\{}\omega(\boldsymbol{y})+\frac{1}{2\gamma}|\boldsymbol{y}-\boldsymbol{x}|^{2}\Big{\}}. Standard results ([30] and [40, Theorem 31.5]) implies that if ω\omega is ρ\rho-weakly convex and γ<ρ1\gamma<\rho^{-1}, the envelope ωγ\omega_{\gamma} is C1C^{1} smooth with the gradient given by ωγ(𝒙)=γ1(𝒙proxγω(𝒙))\nabla\omega_{\gamma}(\boldsymbol{x})=\gamma^{-1}(\boldsymbol{x}-\operatorname{prox}_{\gamma\omega}(\boldsymbol{x})). Let 𝒙^=proxγω(𝒙)\hat{\boldsymbol{x}}=\operatorname{prox}_{\gamma\omega}(\boldsymbol{x}). Then, the definition of the Moreau envelope implies that |ω(𝒙^)+N𝒞(𝒙^)||ωγ(𝒙)||\partial\omega(\hat{\boldsymbol{x}})+N_{\mathcal{C}}(\hat{\boldsymbol{x}})|\leqslant|\nabla\omega_{\gamma}(\boldsymbol{x})|. Hence, a point 𝒙\boldsymbol{x} with a small gradient ωγ(𝒙)\nabla\omega_{\gamma}(\boldsymbol{x}) lies in the neighbor of a nearly stationary point 𝒙^\hat{\boldsymbol{x}}.

The following lemma establishes a crucial inequality for the convergence analysis.

Lemma 3.1.

Assume that ϑ(u,𝐳)𝒰×𝒳\vartheta(u,\boldsymbol{z})\in\mathcal{U}\times\mathcal{X} is γ\gamma-strongly convex in uu and ϖ\varpi-weakly convex in 𝐳\boldsymbol{z}. Let τ>ϖ\tau>\varpi and (u^,𝐳^)=argminu𝒰,𝐳𝒳{ϑ(u,𝐳)+τ2|𝐳𝐳¯|2}(\hat{u},\hat{\boldsymbol{z}})=\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\{\vartheta(u,\boldsymbol{z})+\frac{\tau}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}\}. Then, for any (u,𝐳)𝒰×𝒳(u,\boldsymbol{z})\in\mathcal{U}\times\mathcal{X}, we have

ϑ(u^,𝒛^)+τ2|𝒛^𝒛¯|2ϑ(u,𝒛)+τ2|𝒛𝒛¯|2γ2uu^𝒰2τϖ2|𝒛𝒛^|2.\displaystyle\vartheta(\hat{u},\hat{\boldsymbol{z}})+\frac{\tau}{2}|\hat{\boldsymbol{z}}-\bar{\boldsymbol{z}}|^{2}\leqslant\vartheta(u,\boldsymbol{z})+\frac{\tau}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}-\frac{\gamma}{2}\|u-\hat{u}\|^{2}_{\mathcal{U}}-\frac{\tau-\varpi}{2}|\boldsymbol{z}-\hat{\boldsymbol{z}}|^{2}. (3.1)
Proof.

Let (u^,𝒛^)=argminu𝒰,𝒛𝒳{ϑ(u,𝒛)+τ2|𝒛𝒛¯|2}(\hat{u},\hat{\boldsymbol{z}})=\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\{\vartheta(u,\boldsymbol{z})+\frac{\tau}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}\} and g^ϑ(u^,𝒛^)\hat{g}\in\partial\vartheta(\hat{u},\hat{\boldsymbol{z}}). Then, 𝟎ϑ(u^,𝒛^)+τ(0𝒛^𝒛¯)+(0N𝒳(𝒛^))\boldsymbol{0}\in\partial\vartheta(\hat{u},\hat{\boldsymbol{z}})+\tau\begin{pmatrix}0\\ \hat{\boldsymbol{z}}-\bar{\boldsymbol{z}}\end{pmatrix}+\begin{pmatrix}0\\ {N}_{\mathcal{X}}(\hat{\boldsymbol{z}})\end{pmatrix}. Thus, for any (u,𝒛)𝒰×𝒳(u,\boldsymbol{z})\in\mathcal{U}\times\mathcal{X}, we have

1τg^,(u^u𝒛^𝒛)\displaystyle\frac{1}{\tau}\bigg{\langle}\hat{g},\begin{pmatrix}\hat{u}-u\\ \hat{\boldsymbol{z}}-\boldsymbol{z}\end{pmatrix}\bigg{\rangle}\leqslant 𝒛^𝒛¯,𝒛𝒛^=12|𝒛𝒛¯|212|𝒛^𝒛¯|212|𝒛𝒛^|2.\displaystyle\langle\hat{\boldsymbol{z}}-\bar{\boldsymbol{z}},\boldsymbol{z}-\hat{\boldsymbol{z}}\rangle=\frac{1}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}-\frac{1}{2}|\hat{\boldsymbol{z}}-\bar{\boldsymbol{z}}|^{2}-\frac{1}{2}|\boldsymbol{z}-\hat{\boldsymbol{z}}|^{2}. (3.2)

Since ϑ(,)\vartheta(\cdot,\cdot) is γ\gamma-strongly convex in uu and ϖ\varpi-weakly convex in 𝒛\boldsymbol{z}, we have

g^,(u^u𝒛^𝒛)ϑ(u^,𝒛^)ϑ(u,𝒛)+γ2uu^𝒰2ϖ2|𝒛𝒛^|2.\displaystyle\bigg{\langle}\hat{g},\begin{pmatrix}\hat{u}-u\\ \hat{\boldsymbol{z}}-\boldsymbol{z}\end{pmatrix}\bigg{\rangle}\geqslant\vartheta(\hat{u},\hat{\boldsymbol{z}})-\vartheta(u,\boldsymbol{z})+\frac{\gamma}{2}\|u-\hat{u}\|_{\mathcal{U}}^{2}-\frac{\varpi}{2}|\boldsymbol{z}-\hat{\boldsymbol{z}}|^{2}. (3.3)

Thus, combining (3.2) and (3.3), we conclude (3.1). ∎

3.2. Stability

In this subsection, we define the notion of stability for (2.10) in a general setting. Let ={ξ1,,ξM}{\mathcal{I}}=\{\xi_{1},\dots,\xi_{M}\} be a set of i.i.d. samples from 𝒩={1,,N}{\mathcal{N}}=\{1,\dots,N\}, where M,NM,N\in\mathbb{N} and MNM\leqslant N. Let i[1,M]i\in[1,M]. We replace ξi\xi_{i} with an i.i.d. copy ξi\xi_{i}^{\prime} and define (i)={ξ1,,ξi1,ξi,ξi+1,,ξM}{\mathcal{I}}_{(i)}=\{\xi_{1},\dots,\xi_{i-1},\xi^{\prime}_{i},\xi_{i+1},\dots,\xi_{M}\}. Denote ={ξ1,,ξM}{\mathcal{I}}^{\prime}=\{\xi^{\prime}_{1},\dots,\xi^{\prime}_{M}\}. Let ξ\xi be a random sample from 𝒩\mathcal{N}, let ψ(,;ξ):𝒰×𝒳\psi(\cdot,\cdot;\xi)\mathrel{\mathop{\mathchar 58\relax}}\mathcal{U}\times\mathcal{X}\mapsto\mathbb{R} be a stochastic model, and let ψ(,;)=1Mi=1Mψ(,;ξi)\psi(\cdot,\cdot;{\mathcal{I}})=\frac{1}{M}\sum_{i=1}^{M}\psi(\cdot,\cdot;\xi_{i}). Given γ>0\gamma>0, we define an operator prox~ψ/γ\widetilde{\operatorname{prox}}_{\psi/\gamma}^{\mathcal{I}} by

prox~ψ/γ(𝒛¯)=argminu𝒰,𝒛𝒳ψ(u,𝒛;)+γ2|𝒛𝒛¯|2,𝒛¯𝒳.\displaystyle\widetilde{\operatorname{prox}}_{\psi/\gamma}^{\mathcal{I}}(\bar{\boldsymbol{z}})=\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\psi(u,\boldsymbol{z};{\mathcal{I}})+\frac{\gamma}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2},\forall\bar{\boldsymbol{z}}\in\mathcal{X}. (3.4)

Given 𝒛¯𝒳\bar{\boldsymbol{z}}\in\mathcal{X}, let (u^,𝒛^)=prox~ψ/γ(𝒛¯)(\hat{u}_{\mathcal{I}},\hat{\boldsymbol{z}}_{\mathcal{I}})=\widetilde{\operatorname{prox}}_{\psi/\gamma}^{\mathcal{I}}(\bar{\boldsymbol{z}}). Then, we say that the operator prox~ψ/γ\widetilde{\operatorname{prox}}_{\psi/\gamma}^{\mathcal{I}} is ϵ\epsilon-stable for some ϵ>0\epsilon>0 if |E,,i[ψ(u^,𝒛^;ξi)ψ(u^(i),𝒛^(i);ξi)]|ϵ|E_{{\mathcal{I}},{\mathcal{I}}^{\prime},i}[\psi(\hat{u}_{{\mathcal{I}}},\hat{\boldsymbol{z}}_{\mathcal{I}};\xi_{i}^{\prime})-\psi(\hat{u}_{{\mathcal{I}}_{(i)}},\hat{\boldsymbol{z}}_{{\mathcal{I}}_{(i)}};\xi_{i}^{\prime})]|\leqslant\epsilon. The stability measures the accuracy of prox~ψ/γ\widetilde{\operatorname{prox}}_{\psi/\gamma}^{\mathcal{I}} against the perturbation of one sample. The next lemma shows that if prox~ψ/γ\widetilde{\operatorname{prox}}_{\psi/\gamma}^{\mathcal{I}} is ϵ\epsilon-stable, the approximation of Eξ[ψ(u^,𝒛^;ξ)]E_{\xi}[\psi(\hat{u}_{\mathcal{I}},\hat{\boldsymbol{z}}_{\mathcal{I}};\xi)] using a mini-batch is bounded by ϵ\epsilon on expectation. The arguments are similar to those of Lemma 11 of [6] and Theorem A.3 of [13].

Lemma 3.2.

Let ξ\xi be a uniform sample from 𝒩\mathcal{N}. For a stochastic model ψ(,;ξ):𝒰×𝒳\psi(\cdot,\cdot;\xi)\mathrel{\mathop{\mathchar 58\relax}}\mathcal{U}\times\mathcal{X}\mapsto\mathbb{R}, let prox~ψ/γ\widetilde{\operatorname{prox}}_{\psi/\gamma}^{\mathcal{I}} be as in (3.4) given γ>0\gamma>0. Suppose that prox~ψ/γ\widetilde{\operatorname{prox}}_{\psi/\gamma}^{\mathcal{I}} is ϵ\epsilon-stable. Let (u^,𝐳^)=prox~ψ/γ(𝐳)(\hat{u},\hat{\boldsymbol{z}})=\widetilde{\operatorname{prox}}^{\mathcal{I}}_{\psi/\gamma}(\boldsymbol{z}) for 𝐳𝒳\boldsymbol{z}\in\mathcal{X}. Then, we have

|E[ψ(u^,𝒛^;)Eξ[ψ(u^,𝒛^;ξ)]]|ϵ.\displaystyle|E_{{\mathcal{I}}}[\psi(\hat{u},\hat{\boldsymbol{z}};{\mathcal{I}})-E_{\xi}[\psi(\hat{u},\hat{\boldsymbol{z}};\xi)]]|\leqslant\epsilon. (3.5)
Proof.

Let (i){\mathcal{I}}_{(i)} and {\mathcal{I}}^{\prime} be defined as in the beginning of this subsection. Given 𝒛¯𝒳\bar{\boldsymbol{z}}\in\mathcal{X}, let (u^,𝒛^)=prox~ψ/γ(𝒛¯)(\hat{u},\hat{\boldsymbol{z}})=\widetilde{\operatorname{prox}}_{\psi/\gamma}^{\mathcal{I}}(\bar{\boldsymbol{z}}) and (u^(i),𝒛^(i))=prox~ψ/γ(i)(𝒛¯)(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)})=\widetilde{\operatorname{prox}}_{\psi/\gamma}^{{\mathcal{I}}_{(i)}}(\bar{\boldsymbol{z}}). By the independence between {\mathcal{I}} and {\mathcal{I}}^{\prime}, we have

E[ψ(u^,𝒛^;)]=1Mi=1ME[ψ(u^,𝒛^;ξi)]=1Mi=1ME,ξi[ψ(u^(i),𝒛^(i);ξi)].\displaystyle E_{{\mathcal{I}}}[\psi(\hat{u},\hat{\boldsymbol{z}};{\mathcal{I}})]=\frac{1}{M}\sum_{i=1}^{M}E_{\mathcal{I}}[\psi(\hat{u},\hat{\boldsymbol{z}};\xi_{i})]=\frac{1}{M}\sum_{i=1}^{M}E_{{\mathcal{I}},\xi_{i}^{\prime}}[\psi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};\xi_{i}^{\prime})]. (3.6)

Let ξ\xi and ξ\xi^{\prime} be independent samples from 𝒩{\mathcal{N}}. Then, Eξ[ψ(u^,𝒛^;ξ)]=Eξ[ψ(u^,𝒛^;ξ)]=1Mi=1MEξi[ψ(u^,𝒛^;ξi)]E_{\xi}[\psi(\hat{u},\hat{\boldsymbol{z}};\xi)]=E_{\xi^{\prime}}[\psi(\hat{u},\hat{\boldsymbol{z}};\xi^{\prime})]=\frac{1}{M}\sum_{i=1}^{M}E_{\xi_{i}^{\prime}}[\psi(\hat{u},\hat{\boldsymbol{z}};\xi_{i}^{\prime})]. Thus, the prior equation and (3.6) yield

|E[ψ(u^,𝒛^;)Eξ[ψ(u^,𝒛^;ξ)]]|=\displaystyle|E_{{\mathcal{I}}}[\psi(\hat{u},\hat{\boldsymbol{z}};{\mathcal{I}})-E_{\xi}[\psi(\hat{u},\hat{\boldsymbol{z}};\xi)]]|= |1Mi=1ME,ξi[ψ(u^(i),𝒛^(i);ξi)ψ(u^,𝒛^;ξi)]|\displaystyle\bigg{|}\frac{1}{M}\sum_{i=1}^{M}E_{{\mathcal{I}},\xi^{\prime}_{i}}[\psi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};\xi_{i}^{\prime})-\psi(\hat{u},\hat{\boldsymbol{z}};\xi_{i}^{\prime})]\bigg{|}
=\displaystyle= |E,,i[ψ(u^(i),𝒛^(i);ξi)ψ(u^,𝒛^;ξi)]|ϵ,\displaystyle|E_{{\mathcal{I}},{\mathcal{I}}^{\prime},i}[\psi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};\xi_{i}^{\prime})-\psi(\hat{u},\hat{\boldsymbol{z}};\xi_{i}^{\prime})]\bigg{|}\leqslant\epsilon,

where the last inequality results from the stability of prox~ψ/γ\widetilde{\operatorname{prox}}_{\psi/\gamma}^{\mathcal{I}}. Hence, (3.5) holds. ∎

3.3. Convergence

In this subsection, we study the convergence of Algorithm 1. We first assume that all the linear operators in (2.6) are bounded.

Assumption 1.

Let {ϕi}i=1N\{\boldsymbol{\phi}_{i}\}_{i=1}^{N} be the linear operators in (2.6). Assume that there exists a constant L>0L>0 such that ϕi𝒰L,i1,,N\|\boldsymbol{\phi}_{i}\|_{\mathcal{U}^{*}}\leqslant L,\forall i\in{1,\dots,N}.

Remark 3.3.

The boundedness imposed on the linear operators {ϕi}\{\boldsymbol{\phi}_{i}\} is not restrictive if the kernel κ\kappa associated with the space 𝒰\mathcal{U} is smooth enough. Indeed, let ϕ=δ𝒙Δ\phi=\delta_{\boldsymbol{x}}\circ\Delta, where 𝒙Ω\boldsymbol{x}\in\Omega. Then, there exists a constant CC such that

ϕ𝒰=supu𝒰[ϕ,u]u𝒰=supΔu𝒰u(𝒙)u𝒰=supu𝒰u,Δκ(,𝒙)u𝒰Δκ(,𝒙)𝒰C,\displaystyle\|\phi\|_{\mathcal{U}^{*}}=\sup_{u\in\mathcal{U}}\frac{[\phi,u]}{\|u\|_{\mathcal{U}}}=\sup_{\Delta u\in\mathcal{U}}\frac{u(\boldsymbol{x})}{\|u\|_{\mathcal{U}}}=\sup_{u\in\mathcal{U}}\frac{\langle u,\Delta\kappa(\cdot,\boldsymbol{x})\rangle}{\|u\|_{\mathcal{U}}}\leqslant\|\Delta\kappa(\cdot,\boldsymbol{x})\|_{\mathcal{U}}\leqslant C,

where we use the reproducing property in the third equality, the Cauchy–Schwarz inequality in the first inequality, and the regularity of the kernel in the final inequality. Similar arguments hold for ϕ\phi to be other differential operators.

Furthermore, we impose the following assumption on the regularity of fif_{i} in 𝒳\mathcal{X}.

Assumption 2.

Let {fi}i=1N\{f_{i}\}_{i=1}^{N} be functions in (2.6). For any i{1,,N}i\in\{1,\dots,N\}, we assume that fif_{i} is continuous differentiable. Moreover, there exist constants UfU_{f} and f\mathcal{H}_{f} such that sup𝐱𝒳|fi(𝐱)yi|Uf\sup_{\boldsymbol{x}\in\mathcal{X}}|f_{i}(\boldsymbol{x})-{y}_{i}|\leqslant U_{f} for i{1,,N}i\in\{1,\dots,N\} and

|fi(𝒘)fi(𝒙)fi(𝒙),𝒘𝒙|f2|𝒘𝒙|2,𝒙,𝒘𝒳.\displaystyle|f_{i}(\boldsymbol{w})-f_{i}(\boldsymbol{x})-\langle\nabla f_{i}(\boldsymbol{x}),\boldsymbol{w}-\boldsymbol{x}\rangle|\leqslant\frac{\mathcal{H}_{f}}{2}|\boldsymbol{w}-\boldsymbol{x}|^{2},\forall\boldsymbol{x},\boldsymbol{w}\in\mathcal{X}. (3.7)

With the boundedness of the second derivative of fif_{i} at hand, we are able to show that φ(,)\varphi(\cdot,\cdot) in (2.6) and φ(,;)\varphi(\cdot,\cdot;\mathcal{I}) in (2.8) are weakly convex. The proof of the following lemma is similar to the argument of Lemma 4.2 of [14]. Recall that 𝒳\mathcal{X} is a convex compact set and is the product space of a set of convex compact sets {𝒳i}i=1N\{\mathcal{X}_{i}\}_{i=1}^{N}.

Lemma 3.4.

Let {fi}i=1N\{f_{i}\}_{i=1}^{N} be the collection of functions in (2.6). Suppose that Assumption 2 holds. Let UfU_{f} and f\mathcal{H}_{f} be the coefficients in Assumption 2, and let μ=Uff\mu=U_{f}\mathcal{H}_{f}. For i𝒩i\in\mathcal{N}, define hi(𝐳i)=12|fi(𝐳i)yi|2h_{i}(\boldsymbol{z}_{i})=\frac{1}{2}|f_{i}(\boldsymbol{z}_{i})-{y}_{i}|^{2} for 𝐳i𝒳i\boldsymbol{z}_{i}\in\mathcal{X}_{i}. Then, hih_{i} is μ\mu-weakly convex in 𝐳\boldsymbol{z}. Moreover, for any set 𝒩\mathcal{I}\subset\mathcal{N} with MM indices, let h(𝐳)=1Mihi(𝐳i)h(\boldsymbol{z})=\frac{1}{M}\sum_{i\in\mathcal{I}}h_{i}(\boldsymbol{z}_{i}) for 𝐳=(𝐳1,,𝐳N)𝒳\boldsymbol{z}=(\boldsymbol{z}_{1},\dots,\boldsymbol{z}_{N})\in\mathcal{X}. Then, hh is μ\mu-weakly convex in 𝐳\boldsymbol{z}.

Proof.

For i𝒩i\in\mathcal{N}, let 𝒛i𝒳i\boldsymbol{z}_{i}\in\mathcal{X}_{i} and 𝒘i𝒳i\boldsymbol{w}_{i}\in\mathcal{X}_{i}. Using Assumption 2, we obtain

hi(𝒛i)=\displaystyle h_{i}(\boldsymbol{z}_{i})= 12|fi(𝒛i)yi|212|fi(𝒘i)yi|2+fi(𝒘i)yi,fi(𝒛i)fi(𝒘i)\displaystyle\frac{1}{2}|f_{i}(\boldsymbol{z}_{i})-{y}_{i}|^{2}\geqslant\frac{1}{2}|f_{i}(\boldsymbol{w}_{i})-{y}_{i}|^{2}+\langle f_{i}(\boldsymbol{w}_{i})-{y}_{i},f_{i}(\boldsymbol{z}_{i})-f_{i}(\boldsymbol{w}_{i})\rangle
=\displaystyle= 12|fi(𝒘i)yi|2+fi(𝒘i)yi,fi(𝒘i)(𝒛i𝒘i)\displaystyle\frac{1}{2}|f_{i}(\boldsymbol{w}_{i})-{y}_{i}|^{2}+\langle f_{i}(\boldsymbol{w}_{i})-{y}_{i},\nabla f_{i}(\boldsymbol{w}_{i})(\boldsymbol{z}_{i}-\boldsymbol{w}_{i})\rangle
+fi(𝒘i)yi,f(𝒛i)fi(𝒘i)fi(𝒘i)(𝒛i𝒘i)\displaystyle+\langle f_{i}(\boldsymbol{w}_{i})-{y}_{i},f(\boldsymbol{z}_{i})-f_{i}(\boldsymbol{w}_{i})-\nabla f_{i}(\boldsymbol{w}_{i})(\boldsymbol{z}_{i}-\boldsymbol{w}_{i})\rangle
\displaystyle\geqslant hi(𝒘i)+(fi(𝒘i))T(fi(𝒘i)yi),𝒛i𝒘iμ2|𝒛i𝒘i|2,\displaystyle h_{i}(\boldsymbol{w}_{i})+\langle(\nabla f_{i}(\boldsymbol{w}_{i}))^{T}(f_{i}(\boldsymbol{w}_{i})-{y}_{i}),\boldsymbol{z}_{i}-\boldsymbol{w}_{i}\rangle-\frac{\mu}{2}|\boldsymbol{z}_{i}-\boldsymbol{w}_{i}|^{2},

which implies that hih_{i} is μ\mu-weakly convex. Thus, for 𝒛,𝒘𝒳\boldsymbol{z},\boldsymbol{w}\in\mathcal{X}, we have

h(𝒛)\displaystyle h(\boldsymbol{z})\geqslant 1Mihi(𝒘i)+(fi(𝒘i))T(fi(𝒘i)yi),𝒛i𝒘iμ2|𝒛i𝒘i|2\displaystyle\frac{1}{M}\sum_{i\in\mathcal{I}}h_{i}(\boldsymbol{w}_{i})+\langle(\nabla f_{i}(\boldsymbol{w}_{i}))^{T}(f_{i}(\boldsymbol{w}_{i})-{y}_{i}),\boldsymbol{z}_{i}-\boldsymbol{w}_{i}\rangle-\frac{\mu}{2}|\boldsymbol{z}_{i}-\boldsymbol{w}_{i}|^{2}
\displaystyle\geqslant h(𝒘)+h(𝒘),𝒛𝒘μ2|𝒛𝒘|2.\displaystyle h(\boldsymbol{w})+\langle\nabla h(\boldsymbol{w}),\boldsymbol{z}-\boldsymbol{w}\rangle-\frac{\mu}{2}|\boldsymbol{z}-\boldsymbol{w}|^{2}.

Hence, hh is also μ\mu-weakly convex. ∎

Lemma 3.4 directly implies the weakly convexity of φ(,;)\varphi(\cdot,\cdot;\mathcal{I}) defined in (2.8), which is described in the following corollary.

Corollary 3.5.

Suppose that Assumption 2 holds. Let 𝒩\mathcal{I}\subset\mathcal{N} and let μ\mu be the constant in Lemma 3.4. Then, φ(,;)\varphi(\cdot,\cdot;\mathcal{I}) defined in (2.8) is λ\lambda-strongly convex in uu and μ\mu-weakly convex in 𝐳\boldsymbol{z}.

Proof.

The claim follows from Assumption 2, Lemma 3.4, and the strongly convexity of φ(,;)\varphi(\cdot,\cdot;\mathcal{I}) in uu. ∎

Next, we use the stability analysis and the weakly convexity property to study the convergence of Algorithm 1. The following lemma establishes the stability for each iteration of Algorithm 1. The proof appears in Appendix A.

Lemma 3.6.

Suppose that Assumptions 1 and 2 hold. Let ={ξ1,,ξM}\mathcal{I}=\{\xi_{1},\dots,\xi_{M}\} be the i.i.d. samples from 𝒩={1,,N}\mathcal{N}=\{1,\dots,N\}, where M,NM,N\in\mathbb{N} and MNM\leqslant N. Given 𝐳¯𝒳\bar{\boldsymbol{z}}\in\mathcal{X} and γ>μ\gamma>\mu, where μ\mu is the constant in Lemma 3.4, we define

(u^,𝒛^)=prox~φ/γ(𝒛¯)=argminu𝒰,𝒛𝒳φ(u,𝒛;)+γ2|𝒛𝒛¯|2.\displaystyle(\hat{u},\hat{\boldsymbol{z}})=\widetilde{\operatorname{prox}}_{\varphi/\gamma}^{\mathcal{I}}(\bar{\boldsymbol{z}})=\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\varphi(u,\boldsymbol{z};{\mathcal{I}})+\frac{\gamma}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}. (3.8)

Let rr be the diameter of 𝒳\mathcal{X} and let UfU_{f} be given in Assumption 2. Then, there exists a constant CC depending on β,r,λ,L\beta,r,\lambda,L, and UfU_{f}, such that prox~φ/γ\widetilde{\operatorname{prox}}_{\varphi/\gamma}^{\mathcal{I}} is ϵ\epsilon-stable with

ϵ=CMmin{λ,γμ}.\displaystyle\epsilon=\frac{C}{M\min\{\lambda,\gamma-\mu\}}. (3.9)
Remark 3.7.

Lemma 3.6 demonstrates the critical distinction between the mini-batch method delineated in this paper and those explored in [15, 12, 13]. In our approach, the regularization parameter λ\lambda is typically small, rendering min{λ,γμ}\min\{\lambda,\gamma-\mu\} equivalent to λ\lambda, a fixed parameter that is not dependent on the step size. This aspect is significantly different from the methodologies in [15, 12, 13], where γ\gamma increasing with the number of iterations enables a reduction in statistical error. A notable difference in our method is the omission of a proximal step over uu in (2.10). Incorporating the proximal step over uu would necessitate the retention of all mini-batch samples and the dependency of the minimizer uku^{k} at the kthk^{\textit{th}} step on the formulations of preceding minimizers, u1,,uk1u^{1},\dots,u^{k-1}, thereby substantially elevating memory requirements. Addressing the challenge of reducing statistical error while managing memory efficiency is a subject for future research.

Next, we study the descent property of Algorithm 1. Henceforth, we use the notation Ek[]E_{k}[\cdot] to represent the expectation conditioned on all the randomness up to the kthk^{\textit{th}} step in Algorithm 1. The proof appears in Appendix A.

Proposition 3.8.

Let (uk,𝒛k)k=1K(u^{k},\boldsymbol{z}^{k})_{k=1}^{K} be the sequence generated by Algorithm 1 with step sizes {γk}k=1K\{\gamma_{k}\}_{k=1}^{K}, where γk>μ\gamma_{k}>\mu and μ\mu is given in Lemma 3.4. For ρ>0\rho>0, let

(u^k,𝒛^k)=argminu𝒰,𝒛𝒳φ(u,𝒛)+ρ2|𝒛𝒛k|2.\displaystyle(\hat{u}^{k},\hat{\boldsymbol{z}}^{k})=\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}{\varphi(u,\boldsymbol{z})}+\frac{\rho}{2}|\boldsymbol{z}-\boldsymbol{z}^{k}|^{2}. (3.10)

Define

φ1/ρ(𝒛¯)=minu𝒰,𝒛𝒳φ(u,𝒛)+ρ2|𝒛𝒛¯|2.\displaystyle\varphi_{1/\rho}(\bar{\boldsymbol{z}})=\min_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\varphi(u,\boldsymbol{z})+\frac{\rho}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}. (3.11)

Then, we get

ρ(ρμ)2(γkμ)|𝒛^k𝒛k|2+ρ(γkρ)2(γkμ)Ek[|𝒛k+1𝒛k|2]\displaystyle\frac{\rho(\rho-\mu)}{2(\gamma_{k}-\mu)}|\hat{\boldsymbol{z}}^{k}-\boldsymbol{z}^{k}|^{2}+\frac{\rho(\gamma_{k}-\rho)}{2(\gamma_{k}-\mu)}E_{k}[|\boldsymbol{z}^{k+1}-\boldsymbol{z}^{k}|^{2}]
\displaystyle\leqslant φ1/ρ(𝒛^k)Ek[φ1/ρ(𝒛k+1)]+ρϵkγkμ,\displaystyle\varphi_{1/\rho}(\hat{\boldsymbol{z}}^{k})-E_{k}[\varphi_{1/\rho}(\boldsymbol{z}^{k+1})]+\frac{\rho\epsilon_{k}}{\gamma_{k}-\mu}, (3.12)

where ϵk\epsilon_{k} is given as in (3.9) with γ:=γk\gamma\mathrel{\mathop{\mathchar 58\relax}}=\gamma_{k}.

Before we get to the main theorem, the following corollary implies that φ1/ρ\varphi_{1/\rho} in (3.11) is differentiable, which is a direct byproduct of Lemma 2.5.

Corollary 3.9.

Suppose that Assumption 2 holds. Let φ\varphi be as in (2.8). Given γ>0\gamma>0, for 𝐳¯𝒳\bar{\boldsymbol{z}}\in\mathcal{X}, we define (u^,𝐳^)=argminu𝒰,𝐳𝒳φ(u,𝐳)+γ2|𝐳𝐳¯|2(\hat{u},\hat{\boldsymbol{z}})=\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\varphi(u,\boldsymbol{z})+\frac{\gamma}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}. and let φ1/γ\varphi_{1/\gamma} be defined as in (3.11). Define

ψ(𝒛)=λ2𝒛T(𝒦(ϕ,ϕ)+λNβI)1𝒛+12Ni𝒩|fi(zi)yi|2,𝒛𝒳.\displaystyle\psi({\boldsymbol{z}})=\frac{\lambda}{2}\boldsymbol{z}^{T}(\mathcal{K}(\boldsymbol{\phi},\boldsymbol{\phi})+\frac{\lambda N}{\beta}I)^{-1}\boldsymbol{z}+\frac{1}{2N}\sum_{i\in\mathcal{N}}|f_{i}(z_{i})-y_{i}|^{2},\forall\boldsymbol{z}\in\mathcal{X}. (3.13)

Then, 𝐳^=proxψ/γ(𝐳¯)\hat{\boldsymbol{z}}=\operatorname{prox}_{\psi/\gamma}(\bar{\boldsymbol{z}}) and φ1/γ(𝐳¯)=min𝐳𝒳ψ(𝐳)+γ2|𝐳𝐳¯|2\varphi_{1/\gamma}(\bar{\boldsymbol{z}})=\min_{\boldsymbol{z}\in\mathcal{X}}\psi(\boldsymbol{z})+\frac{\gamma}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}. Furthermore, φ1/γ\varphi_{1/\gamma} is differentiable and φ1/γ(𝐳¯)=γ(𝐳¯𝐳^)\nabla\varphi_{1/\gamma}(\bar{\boldsymbol{z}})=\gamma(\bar{\boldsymbol{z}}-\hat{\boldsymbol{z}}).

Proof.

Let

(u^,𝒛^)=argminu𝒰,𝒛𝒳φ(u,𝒛)+γ2|𝒛𝒛¯|2.\displaystyle(\hat{u},\hat{\boldsymbol{z}})=\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\varphi(u,\boldsymbol{z})+\frac{\gamma}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}.

Thus, by (3.11) and the definition of \mathcal{L} in (A), (u^,𝒛^)=argminu𝒰,𝒛𝒳(u,𝒛;𝒩)(\hat{u},\hat{\boldsymbol{z}})=\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\mathcal{L}(u,\boldsymbol{z};\mathcal{N}) and φ1/γ(𝒛¯)=minu𝒰,𝒛𝒳(u,𝒛;𝒩)\varphi_{1/\gamma}(\bar{\boldsymbol{z}})=\min_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\mathcal{L}(u,\boldsymbol{z};\mathcal{N}), where 𝒩={1,,N}\mathcal{N}=\{1,\dots,N\}. Thus, the equation (A.3) implies that 𝒛^=proxψ/γ(𝒛¯)\hat{\boldsymbol{z}}=\operatorname{prox}_{\psi/\gamma}(\bar{\boldsymbol{z}}) and φ1/γ(𝒛¯)=min𝒛𝒳ψ(𝒛)+γ2|𝒛𝒛¯|2\varphi_{1/\gamma}(\bar{\boldsymbol{z}})=\min_{\boldsymbol{z}\in\mathcal{X}}\psi(\boldsymbol{z})+\frac{\gamma}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}. Hence, φ1/ρ\varphi_{1/\rho} is a Moreau envelope. By Lemma 3.4, ψ\psi in (3.13) is μ\mu-weakly convex in 𝒛\boldsymbol{z}. Then, standard results [40] implies that φ1/γ(𝒛¯)=γ(𝒛¯𝒛^)\nabla\varphi_{1/\gamma}(\bar{\boldsymbol{z}})=\gamma(\bar{\boldsymbol{z}}-\hat{\boldsymbol{z}}). ∎

Next, we have our main theorem about the convergence of Algorithm 1 in the setting of nonlinear measurements.

Theorem 3.10.

Suppose that Assumptions 1 and 2 hold. Let ρ>μ>0\rho>\mu>0. In Algorithm 1, let γk=γ\gamma_{k}=\gamma for some γ>ρ\gamma>\rho. In each iteration, let |k|=M|\mathcal{I}_{k}|=M for all k1k\geqslant 1, where MM\in\mathbb{N}. Let (uk,𝐳k)k=1K(u^{k},\boldsymbol{z}^{k})_{k=1}^{K} be the sequence generated by Algorithm 1, let kk^{*} be an index uniformly sampled from {1,,K}\{1,\dots,K\}, and let φ1/ρ\varphi_{1/\rho} be defined as in (3.11) for ρ>0\rho>0. Then, there exists a constant CC depending on β,r,λ,L,μ\beta,r,\lambda,L,\mu such that

E[|φ1/ρ(𝒛k)|2]2ρ(γμ)Δ(ρμ)K+Cρ2M(ρμ)min{λ,γμ},\displaystyle E[|\nabla\varphi_{1/\rho}(\boldsymbol{z}^{k^{*}})|^{2}]\leqslant\frac{2\rho(\gamma-\mu)\Delta}{(\rho-\mu)K}+\frac{C\rho^{2}}{M(\rho-\mu)\min\{\lambda,\gamma-\mu\}}, (3.14)

with Δ=φ1/ρ(𝐳^1)min𝐳𝒳φ1/ρ(𝐳)\Delta=\varphi_{1/\rho}(\hat{\boldsymbol{z}}^{1})-\min_{\boldsymbol{z}\in\mathcal{X}}\varphi_{1/\rho}(\boldsymbol{z}), where 𝐳^1\hat{\boldsymbol{z}}^{1} is defined as in (3.10).

Proof.

Taking ρ>μ>0\rho>\mu>0, we define 𝒛^k\hat{\boldsymbol{z}}^{k} as in (3.10). By Corollary 3.9, we get φ1/ρ(𝒛k)=ρ(𝒛k𝒛^k)\nabla\varphi_{1/\rho}(\boldsymbol{z}^{k})=\rho(\boldsymbol{z}^{k}-\hat{\boldsymbol{z}}^{k}). Thus, using Proposition 3.8, we obtain

ρμ2ρ(γμ)|φ1/ρ(𝒛k)|2φ1/ρ(𝒛^k)Ek[φ1/ρ(𝒛k+1)]+ρϵγμ,\displaystyle\frac{\rho-\mu}{2\rho(\gamma-\mu)}|\nabla\varphi_{1/\rho}(\boldsymbol{z}^{k})|^{2}\leqslant\varphi_{1/\rho}(\hat{\boldsymbol{z}}^{k})-E_{k}[\varphi_{1/\rho}(\boldsymbol{z}^{k+1})]+\frac{\rho\epsilon}{\gamma-\mu}, (3.15)

where ϵ\epsilon is given in (3.9). Summing (3.15) from k=1k=1 to k=Kk=K, taking the expectation over all the randomness, and dividing the result inequality by KK, we get

(ρμ)2(γμ)ρKk=1KE[|φ1/ρ(𝒛k)|2]φ1/ρ(𝒛^1)E[φ(𝒛K+1)]K+ρϵγμ.\displaystyle\frac{(\rho-\mu)}{2(\gamma-\mu)\rho K}\sum_{k=1}^{K}E[|\nabla\varphi_{1/\rho}(\boldsymbol{z}^{k})|^{2}]\leqslant\frac{\varphi_{1/\rho}(\hat{\boldsymbol{z}}^{1})-E[\varphi(\boldsymbol{z}^{K+1})]}{K}+\frac{\rho\epsilon}{\gamma-\mu}. (3.16)

Therefore, using the definition of kk^{*} and Jenssen’s inequality, we conclude from (3.16) and (3.9) that (3.14) holds. ∎

Theorem 3.10 shows that 𝒛k\boldsymbol{z}^{k^{*}} tends close to the neighbor of a nearly stationary point on expectation as the number of iterations and the batch size increase. The bound provided in (3.14) demonstrates that the error does not diminish to zero with an increasing number of iterations for a fixed batch size. Conversely, the accuracy improves as the size of mini-batches increases, a fact supported by the subsequent numerical experiments discussed in the following section.

Refer to caption
a Sample points
Refer to caption
b Averaged Loss history.
Refer to caption
c uu^{*}.
Refer to caption
d uMGPu_{\text{MGP}}
Refer to caption
e Errors of the mini-batch GP method.
Refer to caption
f Errors of the GP method
Fig. 1. Nonlinear elliptic equation: results for the mini-batch GP method using 6464 points in each batch. The parameters η=1013\eta=10^{-13}. (a): a set of sample points and the contour of the true solution; (b): The convergence history, averaged across the subsequent 10 iterations (semi-log scale in yy axis); (c): the graph of the true solution uu^{*}; (d): the numerical solution uMGPu_{\text{MGP}} of the mini-batch GP method; (e): the contour of point-wise errors for our mini-batch GP method; (f): the contour of point-wise errors of the GP method [9].

4. Numerical Experiments

In this section, we demonstrate the efficacy of the mini-batch method by solving a nonlinear elliptic equation in Subsection 4.1 and Burgers’ equation in Subsection 4.2. Our implementation uses Python with the JAX package for automatic differentiation. Our experiments are conducted only on CPUs. To better leverage the information between points in each mini-batch, we take mini-batches with nearby samples instead of sampling uniform mini-batches. A mini-batch consists of a uniformly randomly sampled point and its M1M-1 nearest points. We use kk-dd tree to support such search, which finds M1M-1 nearest for each point in the whole data set of size NN in O(NlogN)O(N\log N) time and O(N)O(N) space.

4.1. Solving A Nonlinear Elliptic PDE

In this example, we solve a nonlinear elliptic equation by utilizing the preconditioned mini-batch method delineated in Remark 2.3. The chosen preconditioner, QkQ_{k}, is structured as a diagonal matrix. Its diagonal elements are the diagonal entries of the covariance matrix, specific to each set of mini-batch points. Let d>1d>1 and Ω=(0,1)2\Omega=(0,1)^{2}. Given a continuous function f:Ωf\mathrel{\mathop{\mathchar 58\relax}}\Omega\mapsto\mathbb{R}, we find uu solving

{Δu(x)+u3(x)=f(x),xΩ,u(x)=0,xΩ.\displaystyle\begin{cases}-\Delta u(x)+u^{3}(x)=f(x),x\in\Omega,\\ u(x)=0,x\in\partial\Omega.\end{cases} (4.1)

We prescribe the true solution uu to be sin(πx1)sin(πx2)+4sin(4πx1)sin(4πx2)\sin(\pi x_{1})\sin(\pi x_{2})+4\sin(4\pi x_{1})\sin(4\pi x_{2}) for 𝒙:=(x1,x2)Ω¯\boldsymbol{x}\mathrel{\mathop{\mathchar 58\relax}}=(x_{1},x_{2})\in\overline{\Omega} and compute ff at the right-hand side of the first equation of (4.1) accordingly. For regression, we choose the Gaussian kernel K(𝒙,𝒚)=exp(|𝒙𝒚|22σ2)K(\boldsymbol{x},\boldsymbol{y})=\operatorname{exp}(-\frac{|\boldsymbol{x}-\boldsymbol{y}|^{2}}{2\sigma^{2}}) with the lengthscale parameter σ=0.2\sigma=0.2. We take N=1200N=1200 samples in the domain with NΩ=900N_{\Omega}=900 interior points. Meanwhile, we set the nugget η=1013\eta=10^{-13} (see Remark 2.6). The algorithm stops after 4000040000 iterations. In each time step, we use the technique of eliminating variables at each iteration (see Subsection 3.3 in [9]) and solve the resulting nonlinear minimization problem using the Gauss–Newton method, which terminates once the error between two successive steps is less than 10510^{-5} or the maximum iteration number 3030 is reached.

Figure 1 shows the results of a realization of the mini-batch GP method when the batch size M=64M=64. Figure 1a shows the samples taken in the domain and the contour of the true solution. Figure 1b depicts the reduction in loss for the mini-batch GP method during a single run. The loss is derived from the average of the most recent 10 iterations in the algorithm.. The explicit solution of uu^{*} is given in Figure 1c. Figure 1d plots the approximation of uu^{*}, denoted by uMGPu_{MGP}. The contour of the point-wise errors of uMGPu_{MGP} and uu^{*} is given in Figure 1e. Figure 1f presents the contour of errors for the GP method as described in [9], using the whole identical sample sets. It is observed that the mini-batch approach with a batch size of 64 achieves accuracy comparable to that of the GP method using the entire samples.

4.2. Burgers’ Equation

Refer to caption
a Sample points
Refer to caption
b Loss history.
Refer to caption
c uu^{*}.
Refer to caption
d uMGPu_{\text{MGP}}
Refer to caption
e Contour of point-wise errors.
Fig. 2. Burgers’ equation: results for the mini-batch GP method using 7575 points in each batch. The parameters η=1010\eta=10^{-10}, γ=1\gamma=1. (a): a set of sample points and the contour of the true solution; (b): the convergence history (semi-log scale in yy axis); (c): the graph of the true solution uu^{*}; (d): the numerical solution uMGPu_{\text{MGP}} of the mini-batch GP method; (e): the contour of point-wise errors.

In this subsection, we consider Burgers’ equation

{tu+uxu+νx2u=0,(t,x)(0,1]×(1,1),u(0,x)=sin(πx),x[1,1],u(t,1)=u(t,1)=0,t[0,1],\displaystyle\begin{cases}\partial_{t}u+u\partial_{x}u+\nu\partial_{x}^{2}u=0,\forall(t,x)\in(0,1]\times(-1,1),\\ u(0,x)=-\sin(\pi x),\forall x\in[-1,1],\\ u(t,-1)=u(t,1)=0,\forall t\in[0,1],\end{cases}

where ν=0.2\nu=0.2 and uu is unknown. We take N=2400N=2400 points uniformly in the space-time domain, of which NΩ=2000N_{\Omega}=2000 points are in the domain’s interior. As in [9], we choose the anisotropic kernel

K((t,x),(t,x))=exp((tt)2/σ12(xx)2/σ22),\displaystyle K((t,x),(t^{\prime},x^{\prime}))=\operatorname{exp}(-(t-t^{\prime})^{2}/\sigma_{1}^{2}-(x-x^{\prime})^{2}/{\sigma_{2}^{2}}), (4.2)

with (σ1,σ2)=(0.3,0.05)(\sigma_{1},\sigma_{2})=(0.3,0.05). To compute pointwise errors, as in [9], we compute the true solution using the Cole–Hopf transformation and the numerical quadrature. We use the technique of eliminating constraints at each iteration (see Subsection 3.3 in [9]) and use the Gauss–Newton method to solve the resulting nonlinear quadratic minimization problem, which terminates once the error between two successive steps is less than 10510^{-5} or the maximum iteration number 100100 is reached. Throughout the experiments, we use the nugget η=1010\eta=10^{-10} (see Remark 2.6).

Refer to caption
a Averaged loss histories. Semi-log scale in yy axis.
Refer to caption
b Averaged point-wise errors when M=75M=75.
Refer to caption
c Averaged point-wise errors when M=150M=150.
Refer to caption
d Averaged point-wise errors when M=300M=300.
Fig. 3. Burgers’ equation: averaged results over 1010 realizations for the mini-batch GP method using different batch sizes. The parameters η=1010\eta=10^{-10}, γ=1\gamma=1. (a): averaged loss histories (semi-log scale in yy axis); (b): the averaged point-wise errors when M=75M=75; (c): the averaged point-wise errors when M=150M=150; (e): the averaged point-wise errors when M=300M=300.

Figure 2 shows the results of a realization of the mini-batch GP method when we fix the batch size M=75M=75. Figure 2a shows the samples in the domain and the contour of the true solution. Figure 2b illustrates the history of the loss generated by the min-batch GP method in one realization. The explicit solution of uu^{*} is plotted in Figure 2c. Figure 2d shows the numerical approximation of uu^{*}, denoted by uMGPu_{MGP}. The contour of the point-wise errors between uMGPu_{MGP} and uu^{*} is presented in Figure 2e.

Then, we run the algorithm 1010 times for several batch sizes and plot the results in Figure 3. The averaged loss histories are plotted in Figure 3a, from which we see that increasing batch sizes results in faster convergence rates and smaller losses. The contours of averaged point-wise errors are presented in Figures 3b-3d, which imply that a larger batch size achieves better accuracy.

In our observations regarding the solution of Burgers’ equation, it becomes evident that a larger number of samples in a mini-batch is required compared to the nonlinear elliptic example. This disparity could be attributed to the relatively lower regularity of Burgers’ equation, where the uniformly distributed samples may not adequately capture the inherent regularity. Consequently, we defer the exploration of more effective strategies for sampling collocation points and selecting mini-batch points for future investigations.

5. Conclusions

This paper presents a mini-batch framework to solve nonlinear PDEs with GPs. The mini-batch method allocates the computational cost of GPs to each iterative step. We also perform a rigorous convergence analysis of the mini-batch method. For future work, extending our framework to more general GP regression problems like semi-supervised learning and hyperparameter learning would be interesting. In the numerical experiments, we notice that the choice of the positions of mini-batch samples influences the performance. Hence, a potential future work is investigating different sampling techniques to take mini-batch sample points.

Acknowledgement A0.

The authors wish to thank Professor Andrew Stuart for his insightful discussions on the convergence of the stochastic gradient descent method.

A Proofs of Results

Proof of Lemma 2.5.

Let (u,𝒛;)=φ(u,𝒛;)+γ2|𝒛𝒛¯|2,u𝒰,𝒛𝒳\mathcal{L}(u,\boldsymbol{z};\mathcal{I})=\varphi(u,\boldsymbol{z};\mathcal{I})+\frac{\gamma}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2},\forall u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}. Define the sampling operator S{S}_{\mathcal{I}} as in Lemma (2.4). Then, we have

(u,𝒛;)=\displaystyle\mathcal{L}(u,\boldsymbol{z};\mathcal{I})= 12u,λu+βMSSu2βMS𝒛𝒰\displaystyle\frac{1}{2}\bigg{\langle}{u},\lambda{u}+\frac{\beta}{M}S_{\mathcal{I}}^{*}S_{\mathcal{I}}{u}-\frac{2\beta}{M}S_{\mathcal{I}}^{*}\boldsymbol{z}_{\mathcal{I}}\bigg{\rangle}_{\mathcal{U}}
+β2M|𝒛|2+12Mi|fi(zi)yi|2+γ2|𝒛𝒛¯|2.\displaystyle+\frac{\beta}{2M}|\boldsymbol{z}_{\mathcal{I}}|^{2}+\frac{1}{2M}\sum_{i\in\mathcal{I}}|f_{i}(z_{i})-y_{i}|^{2}+\frac{\gamma}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}. (A.1)

Next, we first fix 𝒛𝒳\boldsymbol{z}\in\mathcal{X} and let u~=argminu(u,𝒛;)\tilde{u}=\operatorname{argmin}_{u}\mathcal{L}(u,\boldsymbol{z};\mathcal{I}). Then,

u|u=u~=λu~+βMSSu~βMS𝒛=0,\frac{\partial\mathcal{L}}{\partial u}\bigg{|}_{u=\tilde{u}}=\lambda\tilde{u}+\frac{\beta}{M}S_{\mathcal{I}}^{*}S_{\mathcal{I}}\tilde{u}-\frac{\beta}{M}S_{\mathcal{I}}^{*}\boldsymbol{z}_{\mathcal{I}}=0,

which yields

u~=(λMβI+SS)1S𝒛.\displaystyle\tilde{u}=\bigg{(}\frac{\lambda M}{\beta}I+S^{*}_{\mathcal{I}}S_{\mathcal{I}}\bigg{)}^{-1}S_{\mathcal{I}}^{*}\boldsymbol{z}_{\mathcal{I}}. (A.2)

Thus, using (A) and (A.2), we have

minu𝒰(u,𝒛;)=\displaystyle\min_{u\in\mathcal{U}}\mathcal{L}(u,\boldsymbol{z};\mathcal{I})= β2MSu~,𝒛+β2M|𝒛|2+12Mi|fi(zi)yi|2+γ2|𝒛𝒛¯|2\displaystyle-\frac{\beta}{2M}\langle S_{\mathcal{I}}\tilde{u},\boldsymbol{z}_{\mathcal{I}}\rangle+\frac{\beta}{2M}|\boldsymbol{z}_{\mathcal{I}}|^{2}+\frac{1}{2M}\sum_{i\in\mathcal{I}}|f_{i}(z_{i})-y_{i}|^{2}+\frac{\gamma}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}
=\displaystyle= λ2𝒛T(𝒦(ϕ,ϕ)+λMβI)1𝒛+12Mi|fi(zi)yi|2+γ2|𝒛𝒛¯|2.\displaystyle\frac{\lambda}{2}\boldsymbol{z}^{T}_{\mathcal{I}}\bigg{(}\mathcal{K}(\boldsymbol{\phi}_{\mathcal{I}},\boldsymbol{\phi}_{\mathcal{I}})+\frac{\lambda M}{\beta}I\bigg{)}^{-1}\boldsymbol{z}_{\mathcal{I}}+\frac{1}{2M}\sum_{i\in\mathcal{I}}|f_{i}(z_{i})-y_{i}|^{2}+\frac{\gamma}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}. (A.3)

where the last equality results from (2.13). Therefore, (A.2), (2.12), and (A.3) imply that the minimizer (u,𝒛)(u^{\dagger},\boldsymbol{z}^{\dagger}) in (2.14) satisfies (2.15) and 𝒛\boldsymbol{z}^{\dagger} minimizes (2.16). ∎

Proof of Lemma 3.6.

Let {\mathcal{I}}^{\prime} and (i){\mathcal{I}}_{(i)} be as in the beginning of this subsection. For 𝒛¯𝒳\bar{\boldsymbol{z}}\in\mathcal{X}, let

(u^,𝒛^)=prox~φ/γ(𝒛¯) and (u^(i),𝒛^(i))=prox~φ/γ(i)(𝒛¯).\displaystyle(\hat{u},\hat{\boldsymbol{z}})=\widetilde{\operatorname{prox}}_{\varphi/\gamma}^{\mathcal{I}}(\bar{\boldsymbol{z}})\text{ and }(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)})=\widetilde{\operatorname{prox}}_{\varphi/\gamma}^{{\mathcal{I}}_{(i)}}(\bar{\boldsymbol{z}}). (A.4)

For γ>1\gamma>1, by the optimality condition of (A.4), there exists a constant CC depending on β\beta, λ\lambda, the diameter rr of 𝒳\mathcal{X}, and the constant UfU_{f} in Assumption 2, such that

u^𝒰22λφ(0,𝒛¯;)βMλ|𝒛¯|2+1λMi|fi(𝒛¯i)yi|2C.\displaystyle\|\hat{u}\|_{\mathcal{U}}^{2}\leqslant\frac{2}{\lambda}\varphi(0,\bar{\boldsymbol{z}};{\mathcal{I}})\leqslant\frac{\beta}{M\lambda}|\bar{\boldsymbol{z}}|^{2}+\frac{1}{\lambda M}\sum_{i\in\mathcal{I}}|f_{i}(\bar{\boldsymbol{z}}_{i})-y_{i}|^{2}\leqslant C. (A.5)

Thus, using (A.5) and Assumption 1, there exists a constant CC depending on β\beta, λ\lambda, rr, UfU_{f}, and the constant LL in Assumption 1, such that |[ϕi,u^]|ϕi𝒰u^𝒰C|[\boldsymbol{\phi}_{i},\hat{u}]|\leqslant\|\boldsymbol{\phi}_{i}\|_{\mathcal{U}^{*}}\|\hat{u}\|_{\mathcal{U}}\leqslant C. Let ={u𝒰|u𝒰C~}×𝒳\mathcal{B}=\{u\in\mathcal{U}|\|u\|_{\mathcal{U}}\leqslant\tilde{C}\}\times\mathcal{X}, where C~\tilde{C} is large enough such that \mathcal{B} contains (u^,𝒛^)(\hat{u},\hat{\boldsymbol{z}}) in (A.4) for any 𝒛¯𝒳\bar{\boldsymbol{z}}\in\mathcal{X}. Next, we show that for any ξ{1,,N}\xi\in\{1,\dots,N\}, φ(,;ξ)\varphi(\cdot,\cdot;\xi) is Lipschitz in \mathcal{B}. Let (u1,𝒛1),(u2,𝒛2)(u_{1},\boldsymbol{z}_{1}),(u_{2},\boldsymbol{z}_{2})\in\mathcal{B}. Then, there exists a constant CC depending on β,r,λ,L\beta,r,\lambda,L, and Uf{U}_{f}, such that

|φ(u1,𝒛1;ξ)φ(u2,𝒛2;ξ)|\displaystyle|\varphi(u_{1},\boldsymbol{z}_{1};\xi)-\varphi(u_{2},\boldsymbol{z}_{2};\xi)|
\displaystyle\leqslant λ2u1+u2𝒰u1u2𝒰+12|fξ(𝒛1,ξ)+fξ(𝒛2,ξ)2yξ||fξ(𝒛1,ξ)fξ(𝒛2,ξ)|\displaystyle\frac{\lambda}{2}\|u_{1}+u_{2}\|_{\mathcal{U}}\|u_{1}-u_{2}\|_{\mathcal{U}}+\frac{1}{2}\big{|}f_{\xi}(\boldsymbol{z}_{1,\xi})+f_{\xi}(\boldsymbol{z}_{2,\xi})-2y_{\xi}\big{|}|f_{\xi}(\boldsymbol{z}_{1,\xi})-f_{\xi}(\boldsymbol{z}_{2,\xi})|
+β2|[ϕξ,u1]𝒛1,ξ+[ϕξ,u2]𝒛2,ξ|(|[ϕξ,u1][ϕξ,u2]|+|𝒛1,ξ𝒛2,ξ|)\displaystyle+\frac{\beta}{2}\bigg{|}[\boldsymbol{\phi}_{\xi},u_{1}]-\boldsymbol{z}_{1,\xi}+[\boldsymbol{\phi}_{\xi},u_{2}]-\boldsymbol{z}_{2,\xi}\bigg{|}\bigg{(}\big{|}[\boldsymbol{\phi}_{\xi},u_{1}]-[\boldsymbol{\phi}_{\xi},u_{2}]\big{|}+\big{|}\boldsymbol{z}_{1,\xi}-\boldsymbol{z}_{2,\xi}\big{|}\bigg{)}
\displaystyle\leqslant C(u1u2𝒰+|𝒛1𝒛2|),\displaystyle C(\|u_{1}-u_{2}\|_{\mathcal{U}}+|\boldsymbol{z}_{1}-\boldsymbol{z}_{2}|), (A.6)

where (A.6) follows by the definition of \mathcal{B} and Assumptions 1 and 2.

By Jensen’s inequality, we obtain

|E,,i[φ(u^,𝒛^;ξi)φ(u^(i),𝒛^(i);ξi)]|\displaystyle|E_{\mathcal{I},\mathcal{I}^{\prime},i}[\varphi(\hat{u},\hat{\boldsymbol{z}};\xi_{i}^{\prime})-\varphi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};\xi_{i}^{\prime})]|\leqslant 1Mi=1ME,ξi|φ(u^,𝒛^;ξi)φ(u^(i),𝒛^(i);ξi)|\displaystyle\frac{1}{M}\sum_{i=1}^{M}E_{\mathcal{I},\xi_{i}^{\prime}}\bigg{|}\varphi(\hat{u},\hat{\boldsymbol{z}};\xi_{i}^{\prime})-\varphi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};\xi_{i}^{\prime})\bigg{|}
\displaystyle\leqslant CMi=1ME,ξi(u^u^(i)𝒰+|𝒛^𝒛^(i)|),\displaystyle\frac{C}{M}\sum_{i=1}^{M}E_{\mathcal{I},\xi_{i}^{\prime}}(\|\hat{u}-\hat{u}_{(i)}\|_{\mathcal{U}}+|\hat{\boldsymbol{z}}-\hat{\boldsymbol{z}}_{(i)}|), (A.7)

where (A) follows by (A.6). Using (A.4), Corollary 3.5, and Lemma 3.1, we get

φ(u^,𝒛^;)+γ2|𝒛^𝒛¯|2\displaystyle\varphi(\hat{u},\hat{\boldsymbol{z}};{\mathcal{I}})+\frac{\gamma}{2}|\hat{\boldsymbol{z}}-\bar{\boldsymbol{z}}|^{2}
\displaystyle\leqslant φ(u^(i),𝒛^(i);)+γ2|𝒛^(i)𝒛¯|2γμ2|𝒛^𝒛^(i)|2λ2u^u^(i)𝒰2,\displaystyle\varphi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};{\mathcal{I}})+\frac{\gamma}{2}|\hat{\boldsymbol{z}}_{(i)}-\bar{\boldsymbol{z}}|^{2}-\frac{\gamma-\mu}{2}|\hat{\boldsymbol{z}}-\hat{\boldsymbol{z}}_{(i)}|^{2}-\frac{\lambda}{2}\|\hat{u}-\hat{u}_{(i)}\|_{\mathcal{U}}^{2}, (A.8)
φ(u^(i),𝒛^(i);(i))+γ2|𝒛^(i)𝒛¯|2\displaystyle\varphi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};{\mathcal{I}}_{(i)})+\frac{\gamma}{2}|\hat{\boldsymbol{z}}_{(i)}-\bar{\boldsymbol{z}}|^{2}
\displaystyle\leqslant φ(u^,𝒛^;(i))+γ2|𝒛^𝒛¯|2γμ2|𝒛^𝒛^(i)|2λ2u^u^(i)𝒰2.\displaystyle\varphi(\hat{u},\hat{\boldsymbol{z}};{\mathcal{I}}_{(i)})+\frac{\gamma}{2}|\hat{\boldsymbol{z}}-\bar{\boldsymbol{z}}|^{2}-\frac{\gamma-\mu}{2}|\hat{\boldsymbol{z}}-\hat{\boldsymbol{z}}_{(i)}|^{2}-\frac{\lambda}{2}\|\hat{u}-\hat{u}_{(i)}\|_{\mathcal{U}}^{2}. (A.9)

Adding (A.8) and (A.9) together and using (A.6), we obtain

λu^u^(i)𝒰2+(γμ)|𝒛^𝒛^(i)|2\displaystyle\lambda\|\hat{u}-\hat{u}_{(i)}\|^{2}_{\mathcal{U}}+(\gamma-\mu)|\hat{\boldsymbol{z}}-\hat{\boldsymbol{z}}_{(i)}|^{2}
\displaystyle\leqslant φ(u^(i),𝒛^(i);)φ(u^(i),𝒛^(i);(i))+φ(u^,𝒛^;(i))φ(u^,𝒛^;)\displaystyle\varphi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};{\mathcal{I}})-\varphi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};{\mathcal{I}}_{(i)})+\varphi(\hat{u},\hat{\boldsymbol{z}};{\mathcal{I}}_{(i)})-\varphi(\hat{u},\hat{\boldsymbol{z}};{\mathcal{I}})
=\displaystyle= 1M[φ(u^(i),𝒛^(i);ξi)φ(u^(i),𝒛^(i);ξi)+φ(u^,𝒛^;ξi)φ(u^,𝒛^;ξi)]\displaystyle\frac{1}{M}[\varphi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};\xi_{i})-\varphi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};\xi_{i}^{\prime})+\varphi(\hat{u},\hat{\boldsymbol{z}};\xi_{i}^{\prime})-\varphi(\hat{u},\hat{\boldsymbol{z}};\xi_{i})]
\displaystyle\leqslant CM(u^u^(i)𝒰+|𝒛^𝒛^(i)|),\displaystyle\frac{C}{M}(\|\hat{u}-\hat{u}_{(i)}\|_{\mathcal{U}}+|\hat{\boldsymbol{z}}-\hat{\boldsymbol{z}}_{(i)}|),

which together with the fact that a,b,(a+b)22a2+2b2\forall a,b\in\mathbb{R},(a+b)^{2}\leqslant 2a^{2}+2b^{2}, yield

u^u^(i)𝒰+|𝒛^𝒛^(i)|CMmin{λ,γμ}.\displaystyle\|\hat{u}-\hat{u}_{(i)}\|_{\mathcal{U}}+|\hat{\boldsymbol{z}}-\hat{\boldsymbol{z}}_{(i)}|\leqslant\frac{C}{M\min\{\lambda,\gamma-\mu\}}. (A.10)

Hence, (A) and (A.10) imply that prox~φ/γ\widetilde{\operatorname{prox}}_{\varphi/\gamma}^{\mathcal{I}} is ϵ\epsilon-stable with ϵ\epsilon given in (3.9). ∎

Proof of Proposition 3.8.

Let k\mathcal{I}_{k} be a mini-batch of indices in the kthk^{\textit{th}} step of Algorithm 1. According to Corollary 3.5, (u,𝒛)φ(u,𝒛;k)(u,\boldsymbol{z})\mapsto\varphi(u,\boldsymbol{z};\mathcal{I}_{k}) is μ\mu-weakly convex in 𝒛\boldsymbol{z} and strongly convex in uu. Then, by Lemma 3.1 and (2.10), for any u𝒰u\in\mathcal{U} and 𝒛𝒳\boldsymbol{z}\in\mathcal{X}, we have

γkμ2|𝒛𝒛k+1|2\displaystyle\frac{\gamma_{k}-\mu}{2}|\boldsymbol{z}-\boldsymbol{z}^{k+1}|^{2}\leqslant γk2|𝒛𝒛k|2γk2|𝒛k+1𝒛k|2+φ(u,𝒛;k)φ(uk+1,𝒛k+1)\displaystyle\frac{\gamma_{k}}{2}|\boldsymbol{z}-\boldsymbol{z}^{k}|^{2}-\frac{\gamma_{k}}{2}|\boldsymbol{z}^{k+1}-\boldsymbol{z}^{k}|^{2}+\varphi(u,\boldsymbol{z};{\mathcal{I}}_{k})-\varphi(u^{k+1},\boldsymbol{z}^{k+1})
+φ(uk+1,𝒛k+1)φ(uk+1,𝒛k+1;k).\displaystyle+\varphi(u^{k+1},\boldsymbol{z}^{k+1})-\varphi(u^{k+1},\boldsymbol{z}^{k+1};{\mathcal{I}}_{k}). (A.11)

Lemmas 3.2 and 3.6 imply that

Ek[φ(uk+1,𝒛k+1)φ(uk+1,𝒛k+1;k)]ϵk,\displaystyle E_{k}[\varphi(u^{k+1},\boldsymbol{z}^{k+1})-\varphi(u^{k+1},\boldsymbol{z}^{k+1};{\mathcal{I}}_{k})]\leqslant\epsilon_{k}, (A.12)

where ϵk\epsilon_{k} is given in (3.9) with γ:=γk\gamma\mathrel{\mathop{\mathchar 58\relax}}=\gamma_{k}. For ρ>0\rho>0, let (u^k,𝒛^k)(\hat{u}^{k},\hat{\boldsymbol{z}}^{k}) be given in (3.10). Then, we have

φ(u^k,𝒛^k)φ(uk+1,𝒛k+1)ρ2|𝒛^k𝒛k|2+ρ2|𝒛k+1𝒛k|2.\displaystyle\varphi(\hat{u}^{k},\hat{\boldsymbol{z}}^{k})-\varphi(u^{k+1},\boldsymbol{z}^{k+1})\leqslant-\frac{\rho}{2}|\hat{\boldsymbol{z}}^{k}-\boldsymbol{z}^{k}|^{2}+\frac{\rho}{2}|\boldsymbol{z}^{k+1}-\boldsymbol{z}^{k}|^{2}. (A.13)

Plugging (u,𝒛):=(u^k,𝒛^k)(u,\boldsymbol{z})\mathrel{\mathop{\mathchar 58\relax}}=(\hat{u}^{k},\hat{\boldsymbol{z}}^{k}) into (A.11), taking the expectation, using (A.12) and (A.13), and using Lemma 3.6, we obtain

γkμ2Ek[|𝒛^k𝒛k+1|2]γkρ2|𝒛^k𝒛k|2γkρ2Ek[|𝒛k+1𝒛k|2]+ϵk.\displaystyle\frac{\gamma_{k}-\mu}{2}E_{k}[|\hat{\boldsymbol{z}}^{k}-\boldsymbol{z}^{k+1}|^{2}]\leqslant\frac{\gamma_{k}-\rho}{2}|\hat{\boldsymbol{z}}^{k}-\boldsymbol{z}^{k}|^{2}-\frac{\gamma_{k}-\rho}{2}E_{k}[|\boldsymbol{z}^{k+1}-\boldsymbol{z}^{k}|^{2}]+\epsilon_{k}. (A.14)

For ρ>0\rho>0, using the definition of φ1/ρ\varphi_{1/\rho} and (A.14), we have

Ek[φ1/ρ(𝒛k+1)]Ek[φ(u^k,𝒛^k)+ρ2|𝒛^k𝒛k+1|2]\displaystyle E_{k}[\varphi_{1/\rho}(\boldsymbol{z}^{k+1})]\leqslant E_{k}[\varphi(\hat{u}^{k},\hat{\boldsymbol{z}}^{k})+\frac{\rho}{2}|\hat{\boldsymbol{z}}^{k}-\boldsymbol{z}^{k+1}|^{2}]
\displaystyle\leqslant φ1/ρ(𝒛^k)ρ(ρμ)2(γkμ)|𝒛^k𝒛k|2ρ(γkρ)2(γkμ)Ek[|𝒛k+1𝒛k|2]+ρϵkγkμ,\displaystyle\varphi_{1/\rho}(\hat{\boldsymbol{z}}^{k})-\frac{\rho(\rho-\mu)}{2(\gamma_{k}-\mu)}|\hat{\boldsymbol{z}}^{k}-\boldsymbol{z}^{k}|^{2}-\frac{\rho(\gamma_{k}-\rho)}{2(\gamma_{k}-\mu)}E_{k}[|\boldsymbol{z}^{k+1}-\boldsymbol{z}^{k}|^{2}]+\frac{\rho\epsilon_{k}}{\gamma_{k}-\mu},

which concludes (3.8) after rearrangement. ∎

References

  • [1] H. Asi and J. C. Duchi, The importance of better models in stochastic optimization, Proceedings of the National Academy of Sciences, 116 (2019), pp. 22924–22930.
  • [2] H. Asi and J. C. Duchi, Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity, SIAM Journal on Optimization, 29 (2019), pp. 2257–2290.
  • [3] P. Batlle, Y. Chen, B. Hosseini, H. Owhadi, and A. M. Stuart, Error analysis of kernel/GP methods for nonlinear and parametric PDEs, arXiv preprint arXiv:2305.04962, (2023).
  • [4] D. P. Bertsekas, Incremental proximal methods for large scale convex optimization, Mathematical programming, 129 (2011), pp. 163–195.
  • [5] L. Bottou and O. Bousquet, The tradeoffs of large scale learning, Advances in neural information processing systems, 20 (2007).
  • [6] O. Bousquet and A. Elisseeff, Stability and generalization, The Journal of Machine Learning Research, 2 (2002), pp. 499–526.
  • [7] H. Chen, L. Zheng, R. Al Kontar, and G. Raskutti, Stochastic gradient descent in correlated settings: A study on Gaussian processes, Advances in neural information processing systems, 33 (2020), pp. 2722–2733.
  • [8] H. Chen, L. Zheng, R. Al Kontar, and G. Raskutti, Gaussian process parameter estimation using mini-batch stochastic gradient descent: convergence guarantees and empirical benefits, The Journal of Machine Learning Research, 23 (2022), pp. 10298–10356.
  • [9] Y. Chen, B. Hosseini, H. Owhadi, and A. M. Stuart, Solving and learning nonlinear PDEs with Gaussian processes, Journal of Computational Physics, (2021).
  • [10] Y. Chen, H. Owhadi, and F. Schäfer, Sparse Cholesky factorization for solving nonlinear PDEs via Gaussian processes, arXiv preprint arXiv:2304.01294, (2023).
  • [11] A. C. Damianou, M. K. Titsias, and N. D. Lawrence, Variational inference for latent variables and uncertain inputs in Gaussian processes, (2016).
  • [12] D. Davis and D. Drusvyatskiy, Stochastic model-based minimization of weakly convex functions, SIAM Journal on Optimization, 29 (2019), pp. 207–239.
  • [13] Q. Deng and W. Gao, Minibatch and momentum model-based methods for stochastic weakly convex optimization, Advances in Neural Information Processing Systems, 34 (2021), pp. 23115–23127.
  • [14] D. Drusvyatskiy and C. Paquette, Efficiency of minimizing compositions of convex functions and smooth maps, Mathematical Programming, 178 (2019), pp. 503–558.
  • [15] J. C. Duchi and F. Ruan, Stochastic methods for composite and weakly convex optimization problems, SIAM Journal on Optimization, 28 (2018), pp. 3229–3259.
  • [16] J. Gardner, G. Pleiss, K. Q. Weinberger, D. Bindel, and A. G. Wilson, GPytorch: Blackbox matrix-matrix Gaussian process inference with GPU acceleration, Advances in neural information processing systems, 31 (2018).
  • [17] J. Hensman, N. Durrande, and A. Solin, Variational Fourier features for Gaussian processes., J. Mach. Learn. Res., 18 (2017), pp. 5537–5588.
  • [18] J. Hensman, N. Fusi, and N. D. Lawrence, Gaussian processes for big data, in Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI’13, Arlington, Virginia, United States, 2013, AUAI Press, pp. 282–290, http://dl.acm.org/citation.cfm?id=3023638.3023667.
  • [19] T. J. Hughes, The finite element method: linear static and dynamic finite element analysis, Courier Corporation, 2012.
  • [20] G. E. Karniadakis, I. G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, and L. Yang, Physics-informed machine learning, Nature Reviews Physics, 3 (2021), pp. 422–440.
  • [21] N. Krämer, J. Schmidt, and P. Hennig, Probabilistic numerical method of lines for time-dependent partial differential equations, arXiv preprint arXiv:2110.11847, (2021).
  • [22] B. Kulis and P. L. Bartlett, Implicit online learning, in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 575–582.
  • [23] M. Lázaro-Gredilla and A. Figueiras-Vidal, Inter-domain Gaussian processes for sparse inference using inducing features, Advances in Neural Information Processing Systems, 22 (2009).
  • [24] M. Lázaro-Gredilla, J. Quinonero-Candela, C. E. Rasmussen, and A. R. Figueiras-Vidal, Sparse spectrum Gaussian process regression, The Journal of Machine Learning Research, 11 (2010), pp. 1865–1881.
  • [25] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar, Hyperband: A novel bandit-based approach to hyperparameter optimization, The Journal of Machine Learning Research, 18 (2017), pp. 6765–6816.
  • [26] H. Liu, Y. S. Ong, X. Shen, and J. Cai, When Gaussian process meets big data: A review of scalable GPs, IEEE transactions on neural networks and learning systems, 31 (2020), pp. 4405–4423.
  • [27] L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis, Learning nonlinear operators via deeponet based on the universal approximation theorem of operators, Nature Machine Intelligence, 3 (2021), pp. 218–229.
  • [28] Z. Mao, A. D. Jagtap, and G. E. Karniadakis, Physics-informed neural networks for high-speed flows, Computer Methods in Applied Mechanics and Engineering, 360 (2020), p. 112789.
  • [29] R. Meng and X. Yang, Sparse Gaussian processes for solving nonlinear PDEs, Journal of Computational Physics, 490 (2023), p. 112340.
  • [30] J.-J. Moreau, Proximité et dualité dans un espace hilbertien, Bulletin de la Société mathématique de France, 93 (1965), pp. 273–299.
  • [31] C. Mou, X. Yang, and C. Zhou, Numerical methods for mean field games based on Gaussian processes and Fourier features, Journal of Computational Physics, (2022).
  • [32] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM Journal on optimization, 19 (2009), pp. 1574–1609.
  • [33] D. T. Nguyen, M. Filippone, and P. Michiardi, Exact Gaussian process regression with distributed computations, in Proceedings of the 34th ACM/SIGAPP symposium on applied computing, 2019, pp. 1286–1295.
  • [34] H. Owhadi and C. Scovel, Operator-Adapted Wavelets, Fast Solvers, and Numerical Homogenization: From a Game Theoretic Approach to Numerical Approximation and Algorithm Design, vol. 35, Cambridge University Press, 2019.
  • [35] A. Quarteroni and A. Valli, Numerical approximation of partial differential equations, vol. 23, Springer Science and Business Media, 2008.
  • [36] J. Quinonero-Candela and C. E. Rasmussen, A unifying view of sparse approximate Gaussian process regression, The Journal of Machine Learning Research, 6 (2005), pp. 1939–1959.
  • [37] A. Rahimi and B. Recht, Random features for large-scale kernel machines., in NIPS, vol. 3, Citeseer, 2007, p. 5.
  • [38] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Numerical Gaussian processes for time-dependent and nonlinear partial differential equations, SIAM Journal on Scientific Computing, 40 (2018), pp. A172–A198.
  • [39] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, Journal of Computational physics, 378 (2019), pp. 686–707.
  • [40] R. Rockafellar, Convex analysis, vol. 18, Princeton university press, 1970.
  • [41] F. Schäfer, M. Katzfuss, and H. Owhadi, Sparse Cholesky factorization by Kullback–Leibler minimization, SIAM Journal on scientific computing, 43 (2021), pp. A2019–A2046.
  • [42] S. Shai, Y. Singer, and N. Srebro, Pegasos: Primal estimated sub-gradient solver for SVM, in Proceedings of the 24th international conference on Machine learning, 2007, pp. 807–814.
  • [43] S. Smale and D. Zhou, Shannon sampling II: Connections to learning theory, Applied and Computational Harmonic Analysis, 19 (2005), pp. 285–302.
  • [44] J. W. Thomas, Numerical partial differential equations: finite difference methods, vol. 22, Springer Science and Business Media, 2013.
  • [45] Q. Tran-Dinh, N. Pham, and L. Nguyen, Stochastic gauss-newton algorithms for nonconvex compositional optimization, in International Conference on Machine Learning, PMLR, 2020, pp. 9572–9582.
  • [46] J. Wang, J. Cockayne, O. Chkrebtii, T. J. Sullivan, and C. J. Oates, Bayesian numerical methods for nonlinear partial differential equations, Statistics and Computing, 31 (2021), pp. 1–20.
  • [47] K. Wang, G. Pleiss, J. Gardner, S. Tyree, K. Q. Weinberger, and A. G. Wilson, Exact Gaussian processes on a million data points, Advances in neural information processing systems, 32 (2019).
  • [48] A. Wilson and H. Nickisch, Kernel interpolation for scalable structured Gaussian processes (KISS-GP), in International conference on machine learning, PMLR, 2015, pp. 1775–1784.
  • [49] F. X. Yu, A. T. Suresh, K. M. Choromanski, D. N. Holtmann-Rice, and S. Kumar, Orthogonal random features, Advances in neural information processing systems, 29 (2016), pp. 1975–1983.
  • [50] M. Zinkevich, Online convex programming and generalized infinitesimal gradient ascent, in Proceedings of the 20th international conference on machine learning (ICML-03), 2003, pp. 928–936.