A MINI-BATCH METHOD For SOLVING NONLINEAR PDES WITH GAUSSIAN PROCESSES

Xianjin Yang^1,∗, Houman Owhadi¹ ¹Computing & Mathematical Sciences, California Institute of Technology, Pasadena, CA 91125. [email protected] [email protected]

Abstract.

Gaussian processes (GPs) based methods for solving partial differential equations (PDEs) demonstrate great promise by bridging the gap between the theoretical rigor of traditional numerical algorithms and the flexible design of machine learning solvers. The main bottleneck of GP methods lies in the inversion of a covariance matrix, whose cost grows cubically concerning the size of samples. Drawing inspiration from neural networks, we propose a mini-batch algorithm combined with GPs to solve nonlinear PDEs. A naive deployment of a stochastic gradient descent method for solving PDEs with GPs is challenging, as the objective function in the requisite minimization problem cannot be depicted as the expectation of a finite-dimensional random function. To address this issue, we employ a mini-batch method to the corresponding infinite-dimensional minimization problem over function spaces. The algorithm takes a mini-batch of samples at each step to update the GP model. Thus, the computational cost is allotted to each iteration. Using stability analysis and convexity arguments, we show that the mini-batch method steadily reduces a natural measure of errors towards zero at the rate of $O(1/K+1/M)$ , where $K$ is the number of iterations and $M$ is the batch size.

^∗Corresponding author.

1. Introduction

This work develops a mini-batch method based on stochastic proximal algorithms [12, 1, 2] to solve nonlinear partial differential equations (PDEs) with Gaussian processes (GPs). PDEs have been widely used in science, economics, and biology [35, 44]. Since PDEs generally lack explicit solutions, numerical approximations are essential in applications. Standard numerical methods, for instance, finite difference [44] and finite element methods [19], for solving nonlinear PDEs are prone to the curse of dimensions. Recently, to keep pace with scaling problem sizes, machine learning methods, including neural networks [39, 27, 20] and probabilistic algorithms [38, 21, 9], arise a lot of attention. Especially, GPs present a promising approach that integrates the rigorous principles of traditional numerical algorithms with the adaptable framework of machine learning solvers [38, 9, 46, 3]. Our methods are based on the work of [9], which proposes a GP framework for solving general nonlinear PDEs. A numerical approximation of the solution to a nonlinear PDE is viewed as the maximum a posteriori (MAP) estimator of a GP conditioned on solving the PDE at a finite set of sample points. Equivalently, [9] proposes to approximate the unique strong solution of a nonlinear PDE in a reproducing kernel Hilbert space (RKHS) $\mathcal{U}$ by a minimizer of the following optimal recovery problem

\displaystyle\min_{u\in\mathcal{U}}\frac{\lambda}{2}\|u\|_{\mathcal{U}}^{2}+\frac{1}{N}\sum_{i=1}^{N}|f_{i}([\boldsymbol{\phi}_{i},u])-{y}_{i}|^{2},

(1.1)

where $\lambda$ is a small regularization parameter, $N$ is the number of samples in the domain, $f_{i}[\boldsymbol{\phi}_{i},u]-y_{i}$ encodes the PDE equation satisfied at the $i^{\textit{th}}$ sample, and $\boldsymbol{\phi}_{i}$ represents the linear operators in the $i^{\textit{th}}$ equation. Let $\boldsymbol{\phi}=(\boldsymbol{\phi}_{1},\dots,\boldsymbol{\phi}_{N})$ , let $\kappa$ be the kernel associated with $\mathcal{U}$ , and let $\Omega$ be the domain of the PDE. Denote by $\kappa(\boldsymbol{x},\boldsymbol{\phi})$ the action of $\boldsymbol{\phi}$ on $\kappa(\boldsymbol{x},\cdot)$ for fixed $\boldsymbol{x}\in\Omega$ and by $\kappa(\boldsymbol{\phi},\boldsymbol{\phi})$ the action of $\boldsymbol{\phi}$ on $\kappa(\cdot,\boldsymbol{\phi})$ . By the representer theorem (see [34, Sec. 17.8]), a solution $u^{*}$ of (1.1) satisfies $u^{*}(\boldsymbol{x})=\kappa(\boldsymbol{x},\boldsymbol{\phi})\kappa(\boldsymbol{\phi,\boldsymbol{\phi}})^{-1}\boldsymbol{z}^{*},\forall\boldsymbol{x}\in\Omega$ , where $\boldsymbol{z}^{*}$ solves

\displaystyle\min_{\boldsymbol{z}}\frac{\lambda}{2}\boldsymbol{z}^{T}\kappa(\boldsymbol{\phi},\boldsymbol{\phi})^{-1}\boldsymbol{z}+\frac{1}{2N}\sum_{i=1}^{N}|f_{i}(\boldsymbol{z}_{i})-y_{i}|^{2}.

(1.2)

Then, [9] proposes to solve (1.2) by the Gauss–Newton method. We refer readers to [9, 29] for illustrative examples on how to transform a PDE solving task into the minimization problem described in (1.1). The computational cost of the above GP method grows cubically with respect to the number of samples due to the inversion of the covariance matrix $\kappa(\boldsymbol{\phi},\boldsymbol{\phi})$ . To improve the performance, based on [9], [31] proposes to approximate the solution of a PDE in a function space generated by random Fourier features [37, 49], where the corresponding covariance matrix is a low-rank approximation to $\kappa(\boldsymbol{\phi},\boldsymbol{\phi})$ . In [29], another low-rank approximation method using sparse GPs is presented, employing a small number of inducing points to summarize observation information. Additionally, [10] introduces an algorithm that efficiently computes the sparse Cholesky factorization of the inverted Gram matrix. Despite their computational cost-saving accomplishments, low-rank approximation methods lack the scalability of neural network-based algorithms. The primary cause for this difference lies in the scaling of parameters in the GP/Kernel method with respect to the number of collocation points, which in turn escalates with the increase in dimensions. Conversely, in neural network models, the parameters are designed to be independent of the number of data points.

Compared to methods based on neural networks [39, 28, 20] , Gaussian Process (GP)/Kernel-based approaches generally offer solid theoretical foundations. Additionally, from a probabilistic viewpoint, GP regressions provide uncertainty quantification. This quantification allows us to establish the confidence intervals of the results and aids in deciding the placement of collocation points. The exploration of how to effectively utilize uncertainty quantification will be a subject for future research. Nevertheless, the GP method typically incurs a computational cost of $O(N^{3})$ , where $N$ represents the number of collocation points. On the other hand, besides their capacity to model a wide range of functions, neural network methods also excel in scalability, largely due to the use of stochastic gradient descent (SGD) algorithms with mini-batches. This scalability is facilitated by the independence of their data points from the parameters. However, due to the global correlation of the unknowns in (1.2), the direct application of SGD is impeded by the need to solve the inversion of the covariance matrix in equation (1.2). Consequently, devising an effective SGD approach is vital for the GP methodology to reach a level of scalability on par with neural networks. This paper attempts to handle this issue by proposing a mini-batch method inspired by stochastic proximal methods [12, 1, 2]. Rather than pursuing low-rank approximations, we stochastically solve equation (1.1). Each iteration selects a small portion of samples to update the GP model. Hence, we only need to consider the inversion of the covariance matrix of samples in a mini-batch each time. More precisely, we introduce a slack variable $\boldsymbol{z}$ and reformulate (1.1) as

\displaystyle\min_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\frac{\lambda}{2}\|u\|_{\mathcal{U}}^{2}+\frac{\beta}{2N}\sum_{i=1}^{N}|[\boldsymbol{\phi}_{i},u]-\boldsymbol{z}_{i}|^{2}+\frac{1}{2N}\sum_{i=1}^{N}|f_{i}(\boldsymbol{z}_{i})-{y}_{i}|^{2},

(1.3)

where $\mathcal{X}$ is a convex compact set, $\beta>0$ , and $\boldsymbol{z}=(\boldsymbol{z}_{1},\dots,\boldsymbol{z}_{N})$ . We note that as $\beta\rightarrow\infty$ , the solutions of (1.3) converge to those of (1.1). The latent variable $\boldsymbol{z}$ encodes observations of $u$ . Next, we view (1.3) as a stochastic optimization problem

\displaystyle\min_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}E_{\xi}\bigg{[}\frac{\lambda}{2}\|u\|_{\mathcal{U}}^{2}+\frac{\beta}{2}|[\boldsymbol{\phi}_{\xi},u]-\boldsymbol{z}_{\xi}|^{2}+\frac{1}{2}|f_{\xi}(\boldsymbol{z}_{\xi})-{y}_{\xi}|^{2}\bigg{]},

(1.4)

where $\xi$ is a uniform random sample taken from the set of the whole samples. Then, we iteratively solve (1.4) by computing the proximal map

	$\displaystyle(u^{k+1},\boldsymbol{z}^{k+1})=$	$\displaystyle\operatorname{argmin}\limits_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\frac{\lambda}{2}\\|u\\|_{\mathcal{U}}^{2}+\frac{\beta}{2M}\sum_{i\in\mathcal{I}_{k}}\|[\boldsymbol{\phi}_{i},u]-\boldsymbol{z}_{i}\|^{2}$
		$\displaystyle+\frac{1}{2M}\sum_{i\in\mathcal{I}_{k}}\|f_{i}(\boldsymbol{z}_{i})-{y}_{i}\|^{2}+\frac{\gamma_{k}}{2}\|\boldsymbol{z}-\boldsymbol{z}^{k}\|^{2},$		(1.5)

where $\mathcal{I}_{k}$ contains a mini-batch of samples, $M$ is the cardinality of $\mathcal{I}_{k}$ , and $\gamma_{k}$ is the step size at the $k^{\textit{th}}$ step. Denote by $\boldsymbol{v}_{\mathcal{I}_{k}}$ the vector containning elements of a vector $\boldsymbol{v}$ indexed by $\mathcal{I}_{k}$ . Then, to handle the inifnite dimensional minimization problem in (1), we derive a representer formula for (1) in Section 2 such that

\displaystyle u^{k+1}(\boldsymbol{x})=\kappa(\boldsymbol{x},\boldsymbol{\phi}_{\mathcal{I}_{k}})\bigg{(}\kappa(\boldsymbol{\phi}_{\mathcal{I}_{k}},\boldsymbol{\phi}_{\mathcal{I}_{k}})+\frac{\lambda M}{\beta}{I}\bigg{)}^{-1}\boldsymbol{z}^{k+1}_{\mathcal{I}_{k}},\boldsymbol{x}\in\Omega,

(1.6)

where ${I}$ is the identity matrix, $\boldsymbol{z}^{k+1}=(\boldsymbol{z}^{k}_{\mathcal{N}\backslash\mathcal{I}_{k}},\boldsymbol{z}^{k+1}_{\mathcal{I}_{k}})$ , and $\boldsymbol{z}^{k+1}$ solves

\displaystyle\min_{\boldsymbol{z}\in\mathcal{X}}\frac{\lambda}{2}\boldsymbol{z}^{T}_{\mathcal{I}_{k}}\bigg{(}\kappa(\boldsymbol{\phi}_{\mathcal{I}_{k}},\boldsymbol{\phi}_{\mathcal{I}_{k}})+\frac{\lambda M}{\beta}I\bigg{)}^{-1}\boldsymbol{z}_{\mathcal{I}_{k}}+\frac{1}{2M}\sum_{i\in\mathcal{I}_{k}}|f_{i}(z_{i})-y_{i}|^{2}+\frac{\gamma_{k}}{2}|\boldsymbol{z}_{\mathcal{I}_{k}}-{\boldsymbol{z}}_{\mathcal{I}_{k}}^{k}|^{2}.

(1.7)

Thus, the computational cost is reduced to $O(M^{3})$ in each iteration if we use the standard Cholesky decomposition to compute the inverse of the covariance matrix for solving (1.6) and (1.7). Under assumptions of boundedness of $\boldsymbol{\phi}$ and the weakly convexity of the functions $\{f_{i}\}_{i=1}^{N}$ , our convergence analysis shows that the error diminishes at the rate of $O(1/K+1/M)$ , where $K$ is the number of iterations and $M$ is the batch size. (For clarification, it’s important to note that the computational complexity of addressing the inner minimization issue, as presented in (1), stands at $O(M^{3})$ . Meanwhile, the complexity associated with the outer iteration, which is relevant for the convergence of $(u^{k+1},\boldsymbol{z}^{k+1})$ , is represented as $O(1/K+1/M)$ .)

To illustrate our contributions, we present a comparison of our mini-batch approach with highly related existing mini-batch methods in the literature. Notably, the studies by [7, 8] introduce SGD techniques for hyperparameter learning in GP regressions. A key distinction of our method lies in its treatment of the minimization over the latent vector $\boldsymbol{z}$ , as outlined in (1.2). In contrast, [7, 8] focus on optimizing kernel hyperparameters for given data $\boldsymbol{z}$ , without considering this minimization. In alignment with our theoretical findings, the studies [7, 8] similarly identify errors comprising two main elements: an optimization error term and a statistical error term. These are determined by the number of optimization iterations and the mini-batch size. Consequently, with a fixed batch size, it becomes challenging to achieve zero error as the number of optimization steps grows. A key reason that we cannot eliminate the statistical error is the exclusion of a proximal step for $u$ as seen in (1). If we were to integrate the proximal step for $u$ , it would require maintaining all mini-batch samples and making the $k^{\textit{th}}$ iteration’s minimizer $u^{k}$ depending on the expressions of all previous minimizers $u^{1},\dots,u^{k-1}$ . This integration would lead to significantly higher memory demands (see also Remark 3.7). Exploring ways to reduce statistical error while balancing memory efficiency remains an area for future investigation.

Furthermore, [45] develops a stochastic Gauss-Newton algorithm for non-convex compositional optimization. However, their approach is not applicable in our context due to the quadratic nature of (1.2), which is incompatible with their algorithm’s requirement for an expectation-based formulation in finite-dimensional settings. This is evident as the term $\boldsymbol{z}^{T}\kappa(\boldsymbol{\phi},\boldsymbol{\phi})^{-1}\boldsymbol{z}$ in Equation (1.2) cannot be framed as an expectation over $\boldsymbol{z}$ . Our method employs a mini-batch approach solving the infinite-dimensional minimization problem in (1.1).

Another subtle difference is our methodological approach compared to that of [45],. While they initiate their process by linearizing the objective function and then applying mini-batch updates to this linearized problem, we start by formulating a nonlinear proximal minimization step in each iteration. For solving this, we primarily use the Gauss-Newton method. It’s important to note, however, that the Gauss-Newton method in our framework can be substituted with any other efficient optimization algorithm suitable for the proximal iteration step.

We organize the paper as follows. Section 2 summarizes the GP method in [9] and proposes a mini-batch method to solve PDEs in a general setting. In Section 3, we extend the arguments in the framework of [12, 6, 13] and present a convergence analysis of the min-batch algorithm based on stability analysis and convexity arguments. Numerical experiments appear in Section 4. Further discussions and future work appear in Section 5.

Other related work A0.

Though GP regressions have achieved great success in various applications, the main drawback of it is the $O(N^{3})$ computational time and $O(N^{2})$ memory consumption for $N$ training points. Several techniques have been introduced to reduce the computational cost in recent years. One branch of these methods is to introduce sparsity into kernels of GPs based on variational inference procedures [36, 11, 18, 23, 48, 26] or random Fourier features [17, 49, 37, 24]. Another trend in tackling the scalability of the GP regression is to utilize increasing computational power and GPU acceleration to solve exact GPs [33, 16, 47]. Alternatively, a sparse approximation for the inverse Cholesky factor of a dense covariance matrix is considered in [41]. Recently, stochastic gradient descent methods (SGDs) have also been taken into consideration for GPs. The work [7] proposes a SGD method to perform hyperparameter learning in GP regressions. The critical difference between our method and the work in [7] is that the problem in [7] does not take into consideration the minimization over the latent vector $\boldsymbol{z}$ in (1.2). Instead, the authors in [7] optimize over the hyperparameters of the kernel for known data $\boldsymbol{z}$ .

SGDs are widely used in stochastic optimization problems [5, 32, 42, 50]. However, since the unknowns in (1.2) are globally correlated, SGDs cannot be applied without solving the inversion of the covariance matrix in (1.2). Moreover, SGD methods are sensitive to the choice of stepsize and may diverge if objective functions do not satisfy the convergence criteria [1, 2, 25]. Recently, stochastic proximal point and model-based methods [2, 4, 12, 15, 22] are proposed to serve as robust alternatives to standard stochastic gradient methods. Several works [15, 12] show that the stochastic proximal point and model-based methods exhibit the robustness to stepsize choice and the convergence on a broader range of difficult problems.

Notations A0.

Given an index set $\mathcal{I}$ , we denote by $|\mathcal{I}|$ the cardinality of $\mathcal{I}$ . For a real-valued vector $\boldsymbol{v}$ , we represent by $|\boldsymbol{v}|$ the Euclidean norm of $\boldsymbol{v}$ and by $\boldsymbol{v}^{T}$ its transpose. Let $\langle\boldsymbol{u},\boldsymbol{v}\rangle$ or $\boldsymbol{u}^{T}\boldsymbol{v}$ be the inner product of vectors $\boldsymbol{u}$ and $\boldsymbol{v}$ . For a normed vector space $V$ , let $\|\cdot\|_{V}$ be the norm of $V$ . Let $\mathcal{U}$ be a Banach space associated with a quadratic norm $\|\cdot\|_{\mathcal{U}}$ and let $\mathcal{U}^{*}$ be the dual of $\mathcal{U}$ . Denote by $[\cdot,\cdot]$ the duality pairing between $\mathcal{U}^{*}$ and $\mathcal{U}$ , and by $\langle\cdot,\cdot\rangle_{\mathcal{U}}$ the inner produce of $\mathcal{U}$ . Suppose that there exists a linear, bijective, symmetric ( $[\mathcal{K}_{\mathcal{U}}\phi,\psi]=[\mathcal{K}_{\mathcal{U}}\psi,\phi]$ ), and positive ( $[\mathcal{K}_{\mathcal{U}}\phi,\phi]>0$ for $\phi\not=0$ ) covariance operator $\mathcal{K}_{\mathcal{U}}\mathrel{\mathop{\mathchar 58\relax}}\mathcal{U}^{*}\mapsto\mathcal{U}$ , such that $\|u\|_{\mathcal{U}}^{2}=[\mathcal{K}^{-1}u,u],\forall u\in\mathcal{U}$ . Let $\{\phi_{i}\}_{i=1}^{P}$ be a set of $P\in\mathbb{N}$ elements in $\mathcal{U}^{*}$ and let $\boldsymbol{\phi}\mathrel{\mathop{\mathchar 58\relax}}=(\phi_{1},\dots,\phi_{P})$ be in the product space ${(\mathcal{U}^{*})}^{\bigotimes P}$ . Then, for $u\in\mathcal{U}$ , we write $[\boldsymbol{\phi},u]\mathrel{\mathop{\mathchar 58\relax}}=([{\phi}_{1},u],\dots,[{\phi}_{P},u])$ . Let $\|\boldsymbol{\phi}\|_{\mathcal{U}^{*}}^{2}=\sum_{i=1}^{P}\|{\phi}_{i}\|_{\mathcal{U}^{*}}^{2}$ . Furthermore, for $\boldsymbol{u}\mathrel{\mathop{\mathchar 58\relax}}=(u_{1},\dots,u_{S})\in\mathcal{U}^{\bigotimes S}$ , $S\in\mathbb{N}$ , let $[\boldsymbol{\phi},\boldsymbol{u}]\in\mathbb{R}^{P\times S}$ be the matrix with entries $[\phi_{i},u_{j}]$ . Finally, we represent by $C$ a positive real number whose value may change line by line.

2. GP Regressions with Nonlinear Measurements

For ease of presentation, in this section, we introduce an iterative mini-batch method for solving nonlinear PDEs in the general setting of GP regressions with nonlinear measurements. We first summarize the GP method proposed by [9] in Subsection 2.1. Then, in Subsection 2.2, we detail the min-batch method and derive a representer theorem, from which we see that the mini-batch method allocates the burden of computations to each iteration.

2.1. A Revisit of The GP Method

Here, we summarize the general framework of the GP method [9] for solving nonlinear PDEs. Let $\Omega\subset\mathbb{R}^{d}$ be an open bounded domain with the boundary $\partial\Omega$ for $d\geqslant 1$ . Let $L_{1},\dots,L_{D_{b}},L_{D_{b}+1},\dots,L_{D}$ be bounded linear operators for $1\leqslant D_{b}\leqslant D$ , where $D_{b},D\in\mathbb{N}$ . We solve

\displaystyle\begin{cases}{P}(L_{D_{b}+1}(u^{*})(\boldsymbol{x}),\dots,L_{D}(u^{*})(\boldsymbol{x}))=f(\boldsymbol{x}),\forall\boldsymbol{x}\in\Omega,\\ {B}(L_{1}(u^{*})(\boldsymbol{x}),\dots,L_{D_{b}}(u^{*})(\boldsymbol{x}))=g(\boldsymbol{x}),\forall\boldsymbol{x}\in\partial\Omega,\end{cases}

(2.1)

where $u^{*}$ is the unknown function, $P$ and $B$ represent nonlinear operators, and $f$ , $g$ are given data. Throughout this paper, we suppose that (2.1) admits a unique strong solution in a RKHS $\mathcal{U}$ associated with the covariance operator $\mathcal{K}$ .

To approximate $u^{*}$ in $\mathcal{U}$ , we first take $N$ sample points $\{{\boldsymbol{x}}_{j}\}_{j=1}^{N}$ in $\overline{\Omega}$ such that $\{{\boldsymbol{x}}_{j}\}_{j=1}^{N_{\Omega}}\subset\Omega$ and $\{{\boldsymbol{x}}_{j}\}_{j=N_{\Omega}+1}^{N}\subset\partial\Omega$ for $1\leqslant N_{\Omega}\leqslant N$ . Then, we define

\displaystyle\phi^{(i)}_{j}\mathrel{\mathop{\mathchar 58\relax}}=\delta_{{\boldsymbol{x}}_{j}}\circ L_{i},\text{ and }\begin{cases}N_{\Omega}+1\leqslant j\leqslant N,\text{ if }1\leqslant i\leqslant D_{b},\\ 1\leqslant j\leqslant N_{\Omega},\text{ if }D_{b}+1\leqslant i\leqslant D.\end{cases}

Let $\boldsymbol{\phi}^{(i)}$ be the vector concatenating the functionals $\phi_{j}^{(i)}$ for fixed $i$ and define

\displaystyle\boldsymbol{\phi}\mathrel{\mathop{\mathchar 58\relax}}=(\boldsymbol{\phi}^{(1)},\dots,\boldsymbol{\phi}^{(D)})\in(\mathcal{U}^{*})^{\otimes R},\text{ where }R=(N-N_{\Omega})D_{b}+N_{\Omega}(D-D_{b}).

(2.2)

Next, we define the data vector $\boldsymbol{y}\in\mathbb{R}^{N}$ by

\displaystyle y_{i}=\begin{cases}f(\boldsymbol{x}_{i}),\text{ if }i\in\{1,\dots,N_{\Omega}\},\\ g(\boldsymbol{x}_{i}),\text{ if }i\in\{N_{\Omega}+1,\dots,N\}.\end{cases}

Furthermore, we define the nonlinear map

\displaystyle(F([\boldsymbol{\phi},u])_{j}\mathrel{\mathop{\mathchar 58\relax}}=\begin{cases}{P}([\phi_{j}^{(D_{b}+1)},u],\dots,[\phi_{j}^{(D)},u]),\text{ for }j=1,\dots,N_{\Omega},\\ {B}([\phi_{j}^{(1)},u],\dots,[\phi_{j}^{(D_{b})},u]),\text{ for }j=N_{\Omega}+1,\dots,N.\end{cases}

Then, given a regularization parameter $\lambda>0$ , the GP method [9] approximates $u^{*}$ by the minimizer of the following optimization problem

\displaystyle\min\limits_{u\in\mathcal{U}}

\displaystyle\frac{\lambda}{2}\|u\|_{\mathcal{U}}^{2}+\frac{1}{2N}|F([\boldsymbol{\phi},u]-\boldsymbol{y}|^{2}.

(2.3)

The authors in [9] show that as $\lambda\rightarrow 0$ and as the fill distance of samples decreases, a minimizer of (2.3) converges to the unique strong solution of (2.1) (see Sec. 3 of [9]).

2.2. GP Regressions with Nonlinear Measurements

In this subsection, we propose a mini-batch method to solve (2.3). For ease of presentation, we abstract (2.3) into a general framework by considering a GP problem with nonlinear observations. More precisely, let $\mathcal{N}=\{1,\dots,N\}$ and $\boldsymbol{\phi}=(\boldsymbol{\phi}_{1},\dots,\boldsymbol{\phi}_{N})$ , where $\boldsymbol{\phi}_{i}\in(\mathcal{U}^{*})^{\bigotimes N_{i}}$ for some $N_{i}\in\mathbb{N}$ and $i\in\mathcal{N}$ . For an unknown function $u$ and each $i\in\mathcal{N}$ , there exists a function $f_{i}$ such that we get a noisy observation $y_{i}$ on $f_{i}([\boldsymbol{\phi}_{i},u])$ . To approximate $u$ based on the nonlinear observations, we solve

\displaystyle\min_{u\in\mathcal{U}}\frac{\lambda}{2}\|u\|_{\mathcal{U}}^{2}+\frac{1}{2N}\sum_{i=1}^{N}|f_{i}([\boldsymbol{\phi}_{i},u])-{y}_{i}|^{2},

(2.4)

where $\lambda>0$ . Hence, (2.4) resembles (2.3) in such a way that $f_{i}([\boldsymbol{\phi}_{i},u])-y_{i}$ corresponds to the PDE equation satisfied by $u$ at the $i^{\textit{th}}$ sample point.

As we mentioned before, a typical algorithm using the Cholesky decomposition to compute the inverse of the corresponding covariance matrix suffers the cost of $O(N^{3})$ . Here, we introduce an algorithm to solve (2.4) using mini-batches such that the dimensions of matrices to be inverted depend only on the sizes of mini-batches.

First, we reformulate a relaxed version of (2.4). More precisely, we introduce slack variables $\boldsymbol{z}=(\boldsymbol{z}_{1},\dots,\boldsymbol{z}_{N})$ , a regularization parameter $\beta>0$ , and consider the following regularized optimal recovery problem

\displaystyle\min_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\varphi(u,\boldsymbol{z}),

(2.5)

where

\displaystyle\varphi(u,\boldsymbol{z})=\frac{\lambda}{2}\|u\|_{\mathcal{U}}^{2}+\frac{\beta}{2N}\sum_{i=1}^{N}|[\boldsymbol{\phi}_{i},u]-\boldsymbol{z}_{i}|^{2}+\frac{1}{2N}\sum_{i=1}^{N}|f_{i}(\boldsymbol{z}_{i})-{y}_{i}|^{2},

(2.6)

$\mathcal{X}$ is a convex compact subset of $\mathbb{R}^{\sum_{i=1}^{N}N_{i}}$ , $0\in\mathcal{X}$ , and $\boldsymbol{y}\in\mathcal{X}$ . We further assume that $\mathcal{X}$ is the product space of a set of convex compact subsets $\{\mathcal{X}_{i}\}^{N}$ , where $\mathcal{X}_{i}\in\mathbb{R}^{N_{i}}$ .

Remark 2.1.

The restriction of $\boldsymbol{z}$ on $\mathcal{X}$ is not restrictive. Indeed, we can reformulate the unconstrained minimization problem

\displaystyle\min_{u\in\mathcal{U},\boldsymbol{z}\in\mathbb{R}^{\sum_{i=1}^{N}N_{i}}}\frac{\lambda}{2}\|u\|_{\mathcal{U}}^{2}+\frac{\beta}{2N}\sum_{i=1}^{N}|[\boldsymbol{\phi}_{i},u]-\boldsymbol{z}_{i}|^{2}+\frac{1}{2N}\sum_{i=1}^{N}|f_{i}(\boldsymbol{z}_{i})-{y}_{i}|^{2},

(2.7)

into the formulation of (2.5) by introducing a large enough compact set $\mathcal{X}$ . More precisely, let $(u_{\beta},\boldsymbol{z}_{\beta})$ be the minimizer of (2.7). We have

\displaystyle\varphi(u_{\beta},\boldsymbol{z}_{\beta})\leqslant\varphi(0,\boldsymbol{0})=\frac{1}{2N}\sum_{i=1}^{N}|f_{i}(\boldsymbol{0})-y_{i}|^{2},

which implies that $u_{\beta}$ and $\boldsymbol{z}_{\beta}$ are bounded if $\sum_{i=1}^{N}|f_{i}(\boldsymbol{0})-y_{i}|^{2}<+\infty$ . Hence, if $f_{i},i=1,\dots,N$ are bounded, (2.7) and (2.5) are equivalent for large $\mathcal{X}$ . Thus, we mainly focus on (2.5) to take into consideration situations when $\boldsymbol{z}$ is constrained.

Next, we introduce a mini-batch version of $\varphi$ defined in (2.5). For $\mathcal{I}\subset\mathcal{N}$ with $|\mathcal{I}|=M$ , we define

\displaystyle\varphi(u,\boldsymbol{z};\mathcal{I})=\frac{\lambda}{2}\|u\|_{\mathcal{U}}^{2}+\frac{\beta}{2M}\sum_{i\in\mathcal{I}}|[\boldsymbol{\phi}_{i},u]-\boldsymbol{z}_{i}|^{2}+\frac{1}{2M}\sum_{i\in\mathcal{I}}|f_{i}(\boldsymbol{z}_{i})-y_{i}|^{2}.

(2.8)

In particular, when $M=1$ and $\mathcal{I}=\{\xi\}$ for $\xi\in\mathcal{N}$ , we denote

\displaystyle\varphi(u,\boldsymbol{z};\xi)\mathrel{\mathop{\mathchar 58\relax}}=\varphi(u,\boldsymbol{z};\{\xi\})=\frac{\lambda}{2}\|u\|_{\mathcal{U}}^{2}+\frac{\beta}{2}|[\boldsymbol{\phi}_{\xi},u]-\boldsymbol{z}_{\xi}|^{2}+\frac{1}{2}|f_{\xi}(\boldsymbol{z}_{\xi})-y_{\xi}|^{2}.

Then, we detail the mini-batch method in Algorithm 1. Given $\boldsymbol{z}^{1}\in\mathcal{X}$ and the iteration number $K$ , at the $k^{\textit{th}}$ step, we update $(u^{k+1},\boldsymbol{z}^{k+1})$ by solving the minimization problem (2.10). Finally, we uniformly sample $k^{*}$ from $\{1,\dots,K\}$ and output $(\hat{u}^{k^{*}},\hat{\boldsymbol{z}}^{k^{*}})$ as an approximation to a solution of (2.5).

Remark 2.2.

In Algorithm 1, we introduce the last step (2.11) for error analysis. In practice, we do not need to solve (2.11) directly since computing (2.11) consumes the same computational cost as calculating (2.4). Instead, if we are interested in the value of $u$ at a test point $\tilde{\boldsymbol{x}}$ , we can take a mini-batch $\tilde{\mathcal{I}}$ containing training samples at the neighbor of $\tilde{\boldsymbol{x}}$ , choose $k^{*}$ and $\rho$ as in Algorithm 1, compute $(\tilde{u},\tilde{\boldsymbol{z}})=\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\varphi(u,\boldsymbol{z};\tilde{\mathcal{I}})+\frac{\rho}{2}|\boldsymbol{z}-\boldsymbol{z}^{k^{*}}|^{2}$ , and evaluate $\tilde{u}$ at $\tilde{\boldsymbol{x}}$ . An alternative way is to include the test points of interest in the training samples and solve (2.5) using the mini-batch method. Then, the latent vector $\boldsymbol{z}^{k^{*}}$ contains sufficient information at $\tilde{\boldsymbol{x}}$ .

Remark 2.3.

Algorithm 1 can be readily extended to incorporate a preconditioned variant. Specifically, let $Q_{k}$ be a positive definite matrix associated with the $k^{\textit{th}}$ iteration. The update step in Equation (2.10) is modified as follows:

\displaystyle(u^{k+1},\boldsymbol{z}^{k+1})=

\displaystyle\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\varphi(u,\boldsymbol{z};\mathcal{I}_{k})+\frac{1}{2}|\boldsymbol{z}-\boldsymbol{z}^{k}|^{2}_{Q_{k}},

(2.9)

where the norm $|\cdot|_{Q_{k}}$ is defined by $|y|_{Q_{k}}^{2}=y^{T}Q_{k}^{-1}y$ . This preconditioned approach maintains the theoretical foundations of Algorithm 1, given that the minimum eigenvalue of $Q_{k}$ meets the analogous conditions set by $\gamma_{k}$ in (2.10) for each iteration $k$ . This adaptation offers enhanced flexibility and potential efficiency improvements in various computational scenarios.

\textbf{Input}\mathrel{\mathop{\mathchar 58\relax}}\lambda>0,\beta>0,\boldsymbol{z}^{1}\in\mathcal{X},\text{iteration number $K$}

\textbf{For }k=1,\dots,K

Take a mini-batch

\mathcal{I}_{k}

from

\mathcal{N}

, choose a step size

\gamma_{k}

, and update

\boldsymbol{z}^{k+1}

by solving

\displaystyle(u^{k+1},\boldsymbol{z}^{k+1})=

\displaystyle\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\varphi(u,\boldsymbol{z};\mathcal{I}_{k})+\frac{\gamma_{k}}{2}|\boldsymbol{z}-\boldsymbol{z}^{k}|^{2}.

(2.10)

End For

Choose

\rho>0

, sample

k^{*}

uniformly in

\{1,\dots,K\}

and compute

\displaystyle(\hat{u}^{k^{*}},\hat{\boldsymbol{z}}^{k^{*}})=

\displaystyle\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\varphi(u,\boldsymbol{z})+\frac{\rho}{2}|\boldsymbol{z}-\boldsymbol{z}^{k^{*}}|^{2}.

(2.11)

\textbf{Output: }\hat{u}^{k^{*}},\hat{\boldsymbol{z}}^{k^{*}}

Algorithm 1 The GP Regression With Mini-batches

Though the minimization problem in (2.10) is infinite-dimensional, we can reformulate (2.10) into a finite-dimensional minimization problem by using the framework developed in [43]. The following lemma resembles Lemma 3.3 in [29] and establishes properties of a sampling operator, which are crucial for us to derive a representer formula for (2.10). We omit the proof since the arguments are the same as those for Lemma 3.3 of [29]. For simplicity, we denote by $I$ the identity map or the identity matrix, whose meaning is easily recognizable from the context.

Lemma 2.4.

Let $\mathcal{N}=\{1,\dots,N\}$ and $\mathcal{I}\subset\mathcal{N}$ . Let $\boldsymbol{\phi}$ be as in (2.4) and let $\boldsymbol{\phi}_{\mathcal{I}}$ be the subset of $\boldsymbol{\phi}$ indexed by $\mathcal{I}$ . Denote by $\ell^{2}(\mathcal{I})$ the set of sequences $\boldsymbol{a}=(a_{i})_{i\in\mathcal{I}}$ with the inner product $\langle\boldsymbol{a},\boldsymbol{b}\rangle=\sum_{i\in\mathcal{I}}a_{i}b_{i}$ . Define the sampling operator $S_{\mathcal{I}}\mathrel{\mathop{\mathchar 58\relax}}\mathcal{U}\mapsto\ell^{2}(\mathcal{I})$ by $S_{\mathcal{I}}u=[\boldsymbol{\phi}_{\mathcal{I}},u],\forall u\in\mathcal{U}$ . Let $S_{\mathcal{I}}^{*}$ be the adjoint of $S_{\mathcal{I}}$ . Then, for each $c\in\ell^{2}(\mathcal{I})$ , $S_{\mathcal{I}}^{*}c=c^{T}\mathcal{K}\boldsymbol{\phi}_{\mathcal{I}},\forall c\in\ell^{2}(\mathcal{I})$ . Besides, for any $\gamma>0$ , ${S}_{\mathcal{I}}^{*}{S}_{\mathcal{I}}+\gamma I$ and ${S}_{\mathcal{I}}{S}_{\mathcal{I}}^{*}+\gamma I$ are bijective. Meanwhile, for any $c\in\ell^{2}(\mathcal{I})$ , we have

\displaystyle({S}_{\mathcal{I}}^{*}{S}_{\mathcal{I}}+\gamma I)^{-1}{S}_{\mathcal{I}}^{*}c=(\mathcal{K}\boldsymbol{\phi}_{\mathcal{I}})^{T}(\gamma I+\mathcal{K}(\boldsymbol{\phi}_{\mathcal{I}},\boldsymbol{\phi}_{\mathcal{I}}))^{-1}c.

(2.12)

Furthermore,

\displaystyle I-S_{\mathcal{I}}(S_{\mathcal{I}}^{*}S_{\mathcal{I}}+\gamma I)^{-1}S_{\mathcal{I}}^{*}=\gamma(\mathcal{K}(\boldsymbol{\phi}_{\mathcal{I}},\boldsymbol{\phi}_{\mathcal{I}})+\gamma I)^{-1}.

(2.13)

The next lemma provides a representer formula for (2.10). The arguments are similar to those for Theorem 3.4 of [29], which we postpone to Appendix A. We recall that for an index set $\mathcal{I}$ and a vector $\boldsymbol{v}$ , $\boldsymbol{v}_{\mathcal{I}}$ represents the vector consisting of elements of $\boldsymbol{v}$ indexed by $\mathcal{I}$ .

Lemma 2.5.

Let $\mathcal{N}=\{1,\dots,N\}$ , let $\mathcal{I}\subset\mathcal{N}$ , $|\mathcal{I}|=M$ , and let $\varphi(\cdot,\cdot;\mathcal{I})\mathrel{\mathop{\mathchar 58\relax}}\mathcal{U}\times\mathcal{X}\mapsto\mathbb{R}$ be defined as in (2.8). Given $\gamma\geqslant 0$ and $\bar{\boldsymbol{z}}\in\mathcal{X}$ , we define

\displaystyle(u^{\dagger},{\boldsymbol{z}}^{\dagger})=\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\varphi(u,\boldsymbol{z};\mathcal{I})+\frac{\gamma}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}.

(2.14)

Let $\boldsymbol{\phi}$ be as in (2.4) and let $\boldsymbol{\phi}_{\mathcal{I}}$ be the subset of $\boldsymbol{\phi}$ indexed by $\mathcal{I}$ . Then,

\displaystyle u^{\dagger}=\mathcal{K}(\boldsymbol{\phi}_{\mathcal{I}})\bigg{(}\mathcal{K}(\boldsymbol{\phi}_{\mathcal{I}},\boldsymbol{\phi}_{\mathcal{I}})+\frac{\lambda M}{\beta}{I}\bigg{)}^{-1}\boldsymbol{z}^{\dagger}_{\mathcal{I}},

(2.15)

where ${I}$ is the identity matrix and $\boldsymbol{z}^{\dagger}=(\bar{\boldsymbol{z}}_{\mathcal{N}\backslash\mathcal{I}},\boldsymbol{z}^{\dagger}_{\mathcal{I}})$ is a solution to

\displaystyle\min_{\boldsymbol{z}\in\mathcal{X}}\frac{\lambda}{2}\boldsymbol{z}^{T}_{\mathcal{I}}\bigg{(}\mathcal{K}(\boldsymbol{\phi}_{\mathcal{I}},\boldsymbol{\phi}_{\mathcal{I}})+\frac{\lambda M}{\beta}I\bigg{)}^{-1}\boldsymbol{z}_{\mathcal{I}}+\frac{1}{2M}\sum_{i\in\mathcal{I}}|f_{i}(z_{i})-y_{i}|^{2}+\frac{\gamma}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}.

(2.16)

The above lemma establishes the solvability of (2.10) and implies that in each step of Algorithm 1, the sizes of the matrices to be inverted depend only on the batch size $M$ . Thus, the computational cost is reduced to $O(M^{3})$ if we use the standard Cholesky decomposition. In numerical implementations of solving PDEs, an alternative way to deal with the PDE constraints in (2.8) at each step is to use the technique of eliminating variables (see Subsection 3.3 of [9] for details), which further reduces the dimensions of unknown variables.

Remark 2.6.

In numerical experiments, we replace the term $\lambda MI/\beta$ by a nugget $\eta\mathcal{R}$ , where $\eta>0$ is a small real number and $\mathcal{R}$ is a block diagonal matrix formulated using the method described in Sec. 3.4 of [9]. Using block diagonal matrices lets us choose smaller nuggets and get more accurate results.

3. The Convergence Analysis

In this section, we study the convergence of Algorithm 1. According to (2.15), at the $k^{\textit{th}}$ step of Algorithm 1, $u^{k+1}$ is expressed as the linear combinations of $\mathcal{K}(\boldsymbol{\phi}_{\mathcal{I}})$ , where $\boldsymbol{\phi}_{\mathcal{I}}$ only contains linear operators corresponding to a mini-batch, which is not expressive. Thus, at the final $K^{\textit{th}}$ step, we do not expect that $u^{K+1}$ approximates a minimizer of (2.5) very well. Instead, we study the accuracy of the sequence $(\boldsymbol{z}^{k})_{k=1}^{K}$ in Algorithm 1, which contains the approximated values of the linear operators of the true solution, and we study the convergence of $(\hat{u}^{k^{*}},\hat{\boldsymbol{z}}^{k^{*}})$ in (2.11), which uses the information of all the linear operators.

We first summarize some basic notions of convex analysis in Subsection 3.1. Then, we introduce the stability of Algorithm 1 in Subsection 3.2. The convergence analysis follows in Subsection 3.3. The arguments are based on the framework of [12] and stability analysis [6]. Similar arguments are used in [13] to analyze the convergence of algorithms for general weakly convex optimization problems.

3.1. Prerequisites

For a non-empty, compact, and convex set $\mathcal{C}\subset\mathbb{R}^{d}$ for some $d\in\mathbb{N}$ , the normal cone to $\mathcal{C}$ at $\boldsymbol{x}\in\mathcal{C}$ is defined by $N_{\mathcal{C}}(\boldsymbol{x})=\{\boldsymbol{y}\in\mathbb{R}^{d}|\boldsymbol{y}\cdot(\boldsymbol{c}-\boldsymbol{x})\leqslant 0,\forall\boldsymbol{c}\in\mathcal{C}\}$ . Given $\gamma\geqslant 0$ , a function $\omega\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\mapsto\mathbb{R}$ is $\gamma$ -weakly convex if the map $\boldsymbol{x}\mapsto\omega(\boldsymbol{x})+\frac{\gamma}{2}|\boldsymbol{x}|^{2}$ is convex and $\omega$ is $\gamma$ -strongly convex if $\boldsymbol{x}\mapsto\omega(\boldsymbol{x})-\frac{\gamma}{2}|\boldsymbol{x}|^{2}$ is convex. We recall that the proximal map of $\omega$ is given by $\operatorname{prox}_{\gamma\omega}(\boldsymbol{x})=\operatorname{argmin}_{\boldsymbol{y}}\Big{\{}\omega(\boldsymbol{y})+\frac{1}{2\gamma}|\boldsymbol{y}-\boldsymbol{x}|^{2}\Big{\}}$ , and that the Moreau envelop [30] is defined as $\omega_{\gamma}(\boldsymbol{x})=\min_{\boldsymbol{y}}\Big{\{}\omega(\boldsymbol{y})+\frac{1}{2\gamma}|\boldsymbol{y}-\boldsymbol{x}|^{2}\Big{\}}$ . Standard results ([30] and [40, Theorem 31.5]) implies that if $\omega$ is $\rho$ -weakly convex and $\gamma<\rho^{-1}$ , the envelope $\omega_{\gamma}$ is $C^{1}$ smooth with the gradient given by $\nabla\omega_{\gamma}(\boldsymbol{x})=\gamma^{-1}(\boldsymbol{x}-\operatorname{prox}_{\gamma\omega}(\boldsymbol{x}))$ . Let $\hat{\boldsymbol{x}}=\operatorname{prox}_{\gamma\omega}(\boldsymbol{x})$ . Then, the definition of the Moreau envelope implies that $|\partial\omega(\hat{\boldsymbol{x}})+N_{\mathcal{C}}(\hat{\boldsymbol{x}})|\leqslant|\nabla\omega_{\gamma}(\boldsymbol{x})|$ . Hence, a point $\boldsymbol{x}$ with a small gradient $\nabla\omega_{\gamma}(\boldsymbol{x})$ lies in the neighbor of a nearly stationary point $\hat{\boldsymbol{x}}$ .

The following lemma establishes a crucial inequality for the convergence analysis.

Lemma 3.1.

Assume that $\vartheta(u,\boldsymbol{z})\in\mathcal{U}\times\mathcal{X}$ is $\gamma$ -strongly convex in $u$ and $\varpi$ -weakly convex in $\boldsymbol{z}$ . Let $\tau>\varpi$ and $(\hat{u},\hat{\boldsymbol{z}})=\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\{\vartheta(u,\boldsymbol{z})+\frac{\tau}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}\}$ . Then, for any $(u,\boldsymbol{z})\in\mathcal{U}\times\mathcal{X}$ , we have

\displaystyle\vartheta(\hat{u},\hat{\boldsymbol{z}})+\frac{\tau}{2}|\hat{\boldsymbol{z}}-\bar{\boldsymbol{z}}|^{2}\leqslant\vartheta(u,\boldsymbol{z})+\frac{\tau}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}-\frac{\gamma}{2}\|u-\hat{u}\|^{2}_{\mathcal{U}}-\frac{\tau-\varpi}{2}|\boldsymbol{z}-\hat{\boldsymbol{z}}|^{2}.

(3.1)

Proof.

Let $(\hat{u},\hat{\boldsymbol{z}})=\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\{\vartheta(u,\boldsymbol{z})+\frac{\tau}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}\}$ and $\hat{g}\in\partial\vartheta(\hat{u},\hat{\boldsymbol{z}})$ . Then, $\boldsymbol{0}\in\partial\vartheta(\hat{u},\hat{\boldsymbol{z}})+\tau\begin{pmatrix}0\\ \hat{\boldsymbol{z}}-\bar{\boldsymbol{z}}\end{pmatrix}+\begin{pmatrix}0\\ {N}_{\mathcal{X}}(\hat{\boldsymbol{z}})\end{pmatrix}$ . Thus, for any $(u,\boldsymbol{z})\in\mathcal{U}\times\mathcal{X}$ , we have

\displaystyle\frac{1}{\tau}\bigg{\langle}\hat{g},\begin{pmatrix}\hat{u}-u\\ \hat{\boldsymbol{z}}-\boldsymbol{z}\end{pmatrix}\bigg{\rangle}\leqslant

\displaystyle\langle\hat{\boldsymbol{z}}-\bar{\boldsymbol{z}},\boldsymbol{z}-\hat{\boldsymbol{z}}\rangle=\frac{1}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}-\frac{1}{2}|\hat{\boldsymbol{z}}-\bar{\boldsymbol{z}}|^{2}-\frac{1}{2}|\boldsymbol{z}-\hat{\boldsymbol{z}}|^{2}.

(3.2)

Since $\vartheta(\cdot,\cdot)$ is $\gamma$ -strongly convex in $u$ and $\varpi$ -weakly convex in $\boldsymbol{z}$ , we have

\displaystyle\bigg{\langle}\hat{g},\begin{pmatrix}\hat{u}-u\\ \hat{\boldsymbol{z}}-\boldsymbol{z}\end{pmatrix}\bigg{\rangle}\geqslant\vartheta(\hat{u},\hat{\boldsymbol{z}})-\vartheta(u,\boldsymbol{z})+\frac{\gamma}{2}\|u-\hat{u}\|_{\mathcal{U}}^{2}-\frac{\varpi}{2}|\boldsymbol{z}-\hat{\boldsymbol{z}}|^{2}.

(3.3)

Thus, combining (3.2) and (3.3), we conclude (3.1). ∎

3.2. Stability

In this subsection, we define the notion of stability for (2.10) in a general setting. Let ${\mathcal{I}}=\{\xi_{1},\dots,\xi_{M}\}$ be a set of i.i.d. samples from ${\mathcal{N}}=\{1,\dots,N\}$ , where $M,N\in\mathbb{N}$ and $M\leqslant N$ . Let $i\in[1,M]$ . We replace $\xi_{i}$ with an i.i.d. copy $\xi_{i}^{\prime}$ and define ${\mathcal{I}}_{(i)}=\{\xi_{1},\dots,\xi_{i-1},\xi^{\prime}_{i},\xi_{i+1},\dots,\xi_{M}\}$ . Denote ${\mathcal{I}}^{\prime}=\{\xi^{\prime}_{1},\dots,\xi^{\prime}_{M}\}$ . Let $\xi$ be a random sample from $\mathcal{N}$ , let $\psi(\cdot,\cdot;\xi)\mathrel{\mathop{\mathchar 58\relax}}\mathcal{U}\times\mathcal{X}\mapsto\mathbb{R}$ be a stochastic model, and let $\psi(\cdot,\cdot;{\mathcal{I}})=\frac{1}{M}\sum_{i=1}^{M}\psi(\cdot,\cdot;\xi_{i})$ . Given $\gamma>0$ , we define an operator $\widetilde{\operatorname{prox}}_{\psi/\gamma}^{\mathcal{I}}$ by

\displaystyle\widetilde{\operatorname{prox}}_{\psi/\gamma}^{\mathcal{I}}(\bar{\boldsymbol{z}})=\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\psi(u,\boldsymbol{z};{\mathcal{I}})+\frac{\gamma}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2},\forall\bar{\boldsymbol{z}}\in\mathcal{X}.

(3.4)

Given $\bar{\boldsymbol{z}}\in\mathcal{X}$ , let $(\hat{u}_{\mathcal{I}},\hat{\boldsymbol{z}}_{\mathcal{I}})=\widetilde{\operatorname{prox}}_{\psi/\gamma}^{\mathcal{I}}(\bar{\boldsymbol{z}})$ . Then, we say that the operator $\widetilde{\operatorname{prox}}_{\psi/\gamma}^{\mathcal{I}}$ is $\epsilon$ -stable for some $\epsilon>0$ if $|E_{{\mathcal{I}},{\mathcal{I}}^{\prime},i}[\psi(\hat{u}_{{\mathcal{I}}},\hat{\boldsymbol{z}}_{\mathcal{I}};\xi_{i}^{\prime})-\psi(\hat{u}_{{\mathcal{I}}_{(i)}},\hat{\boldsymbol{z}}_{{\mathcal{I}}_{(i)}};\xi_{i}^{\prime})]|\leqslant\epsilon$ . The stability measures the accuracy of $\widetilde{\operatorname{prox}}_{\psi/\gamma}^{\mathcal{I}}$ against the perturbation of one sample. The next lemma shows that if $\widetilde{\operatorname{prox}}_{\psi/\gamma}^{\mathcal{I}}$ is $\epsilon$ -stable, the approximation of $E_{\xi}[\psi(\hat{u}_{\mathcal{I}},\hat{\boldsymbol{z}}_{\mathcal{I}};\xi)]$ using a mini-batch is bounded by $\epsilon$ on expectation. The arguments are similar to those of Lemma 11 of [6] and Theorem A.3 of [13].

Lemma 3.2.

Let $\xi$ be a uniform sample from $\mathcal{N}$ . For a stochastic model $\psi(\cdot,\cdot;\xi)\mathrel{\mathop{\mathchar 58\relax}}\mathcal{U}\times\mathcal{X}\mapsto\mathbb{R}$ , let $\widetilde{\operatorname{prox}}_{\psi/\gamma}^{\mathcal{I}}$ be as in (3.4) given $\gamma>0$ . Suppose that $\widetilde{\operatorname{prox}}_{\psi/\gamma}^{\mathcal{I}}$ is $\epsilon$ -stable. Let $(\hat{u},\hat{\boldsymbol{z}})=\widetilde{\operatorname{prox}}^{\mathcal{I}}_{\psi/\gamma}(\boldsymbol{z})$ for $\boldsymbol{z}\in\mathcal{X}$ . Then, we have

\displaystyle|E_{{\mathcal{I}}}[\psi(\hat{u},\hat{\boldsymbol{z}};{\mathcal{I}})-E_{\xi}[\psi(\hat{u},\hat{\boldsymbol{z}};\xi)]]|\leqslant\epsilon.

(3.5)

Proof.

Let ${\mathcal{I}}_{(i)}$ and ${\mathcal{I}}^{\prime}$ be defined as in the beginning of this subsection. Given $\bar{\boldsymbol{z}}\in\mathcal{X}$ , let $(\hat{u},\hat{\boldsymbol{z}})=\widetilde{\operatorname{prox}}_{\psi/\gamma}^{\mathcal{I}}(\bar{\boldsymbol{z}})$ and $(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)})=\widetilde{\operatorname{prox}}_{\psi/\gamma}^{{\mathcal{I}}_{(i)}}(\bar{\boldsymbol{z}})$ . By the independence between ${\mathcal{I}}$ and ${\mathcal{I}}^{\prime}$ , we have

\displaystyle E_{{\mathcal{I}}}[\psi(\hat{u},\hat{\boldsymbol{z}};{\mathcal{I}})]=\frac{1}{M}\sum_{i=1}^{M}E_{\mathcal{I}}[\psi(\hat{u},\hat{\boldsymbol{z}};\xi_{i})]=\frac{1}{M}\sum_{i=1}^{M}E_{{\mathcal{I}},\xi_{i}^{\prime}}[\psi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};\xi_{i}^{\prime})].

(3.6)

Let $\xi$ and $\xi^{\prime}$ be independent samples from ${\mathcal{N}}$ . Then, $E_{\xi}[\psi(\hat{u},\hat{\boldsymbol{z}};\xi)]=E_{\xi^{\prime}}[\psi(\hat{u},\hat{\boldsymbol{z}};\xi^{\prime})]=\frac{1}{M}\sum_{i=1}^{M}E_{\xi_{i}^{\prime}}[\psi(\hat{u},\hat{\boldsymbol{z}};\xi_{i}^{\prime})]$ . Thus, the prior equation and (3.6) yield

	$\displaystyle\|E_{{\mathcal{I}}}[\psi(\hat{u},\hat{\boldsymbol{z}};{\mathcal{I}})-E_{\xi}[\psi(\hat{u},\hat{\boldsymbol{z}};\xi)]]\|=$	$\displaystyle\bigg{\|}\frac{1}{M}\sum_{i=1}^{M}E_{{\mathcal{I}},\xi^{\prime}_{i}}[\psi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};\xi_{i}^{\prime})-\psi(\hat{u},\hat{\boldsymbol{z}};\xi_{i}^{\prime})]\bigg{\|}$
	$\displaystyle=$	$\displaystyle\|E_{{\mathcal{I}},{\mathcal{I}}^{\prime},i}[\psi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};\xi_{i}^{\prime})-\psi(\hat{u},\hat{\boldsymbol{z}};\xi_{i}^{\prime})]\bigg{\|}\leqslant\epsilon,$

where the last inequality results from the stability of $\widetilde{\operatorname{prox}}_{\psi/\gamma}^{\mathcal{I}}$ . Hence, (3.5) holds. ∎

3.3. Convergence

In this subsection, we study the convergence of Algorithm 1. We first assume that all the linear operators in (2.6) are bounded.

Assumption 1.

Let $\{\boldsymbol{\phi}_{i}\}_{i=1}^{N}$ be the linear operators in (2.6). Assume that there exists a constant $L>0$ such that $\|\boldsymbol{\phi}_{i}\|_{\mathcal{U}^{*}}\leqslant L,\forall i\in{1,\dots,N}$ .

Remark 3.3.

The boundedness imposed on the linear operators $\{\boldsymbol{\phi}_{i}\}$ is not restrictive if the kernel $\kappa$ associated with the space $\mathcal{U}$ is smooth enough. Indeed, let $\phi=\delta_{\boldsymbol{x}}\circ\Delta$ , where $\boldsymbol{x}\in\Omega$ . Then, there exists a constant $C$ such that

\displaystyle\|\phi\|_{\mathcal{U}^{*}}=\sup_{u\in\mathcal{U}}\frac{[\phi,u]}{\|u\|_{\mathcal{U}}}=\sup_{\Delta u\in\mathcal{U}}\frac{u(\boldsymbol{x})}{\|u\|_{\mathcal{U}}}=\sup_{u\in\mathcal{U}}\frac{\langle u,\Delta\kappa(\cdot,\boldsymbol{x})\rangle}{\|u\|_{\mathcal{U}}}\leqslant\|\Delta\kappa(\cdot,\boldsymbol{x})\|_{\mathcal{U}}\leqslant C,

where we use the reproducing property in the third equality, the Cauchy–Schwarz inequality in the first inequality, and the regularity of the kernel in the final inequality. Similar arguments hold for $\phi$ to be other differential operators.

Furthermore, we impose the following assumption on the regularity of $f_{i}$ in $\mathcal{X}$ .

Assumption 2.

Let $\{f_{i}\}_{i=1}^{N}$ be functions in (2.6). For any $i\in\{1,\dots,N\}$ , we assume that $f_{i}$ is continuous differentiable. Moreover, there exist constants $U_{f}$ and $\mathcal{H}_{f}$ such that $\sup_{\boldsymbol{x}\in\mathcal{X}}|f_{i}(\boldsymbol{x})-{y}_{i}|\leqslant U_{f}$ for $i\in\{1,\dots,N\}$ and

\displaystyle|f_{i}(\boldsymbol{w})-f_{i}(\boldsymbol{x})-\langle\nabla f_{i}(\boldsymbol{x}),\boldsymbol{w}-\boldsymbol{x}\rangle|\leqslant\frac{\mathcal{H}_{f}}{2}|\boldsymbol{w}-\boldsymbol{x}|^{2},\forall\boldsymbol{x},\boldsymbol{w}\in\mathcal{X}.

(3.7)

With the boundedness of the second derivative of $f_{i}$ at hand, we are able to show that $\varphi(\cdot,\cdot)$ in (2.6) and $\varphi(\cdot,\cdot;\mathcal{I})$ in (2.8) are weakly convex. The proof of the following lemma is similar to the argument of Lemma 4.2 of [14]. Recall that $\mathcal{X}$ is a convex compact set and is the product space of a set of convex compact sets $\{\mathcal{X}_{i}\}_{i=1}^{N}$ .

Lemma 3.4.

Let $\{f_{i}\}_{i=1}^{N}$ be the collection of functions in (2.6). Suppose that Assumption 2 holds. Let $U_{f}$ and $\mathcal{H}_{f}$ be the coefficients in Assumption 2, and let $\mu=U_{f}\mathcal{H}_{f}$ . For $i\in\mathcal{N}$ , define $h_{i}(\boldsymbol{z}_{i})=\frac{1}{2}|f_{i}(\boldsymbol{z}_{i})-{y}_{i}|^{2}$ for $\boldsymbol{z}_{i}\in\mathcal{X}_{i}$ . Then, $h_{i}$ is $\mu$ -weakly convex in $\boldsymbol{z}$ . Moreover, for any set $\mathcal{I}\subset\mathcal{N}$ with $M$ indices, let $h(\boldsymbol{z})=\frac{1}{M}\sum_{i\in\mathcal{I}}h_{i}(\boldsymbol{z}_{i})$ for $\boldsymbol{z}=(\boldsymbol{z}_{1},\dots,\boldsymbol{z}_{N})\in\mathcal{X}$ . Then, $h$ is $\mu$ -weakly convex in $\boldsymbol{z}$ .

Proof.

For $i\in\mathcal{N}$ , let $\boldsymbol{z}_{i}\in\mathcal{X}_{i}$ and $\boldsymbol{w}_{i}\in\mathcal{X}_{i}$ . Using Assumption 2, we obtain

	$\displaystyle h_{i}(\boldsymbol{z}_{i})=$	$\displaystyle\frac{1}{2}\|f_{i}(\boldsymbol{z}_{i})-{y}_{i}\|^{2}\geqslant\frac{1}{2}\|f_{i}(\boldsymbol{w}_{i})-{y}_{i}\|^{2}+\langle f_{i}(\boldsymbol{w}_{i})-{y}_{i},f_{i}(\boldsymbol{z}_{i})-f_{i}(\boldsymbol{w}_{i})\rangle$
	$\displaystyle=$	$\displaystyle\frac{1}{2}\|f_{i}(\boldsymbol{w}_{i})-{y}_{i}\|^{2}+\langle f_{i}(\boldsymbol{w}_{i})-{y}_{i},\nabla f_{i}(\boldsymbol{w}_{i})(\boldsymbol{z}_{i}-\boldsymbol{w}_{i})\rangle$
		$\displaystyle+\langle f_{i}(\boldsymbol{w}_{i})-{y}_{i},f(\boldsymbol{z}_{i})-f_{i}(\boldsymbol{w}_{i})-\nabla f_{i}(\boldsymbol{w}_{i})(\boldsymbol{z}_{i}-\boldsymbol{w}_{i})\rangle$
	$\displaystyle\geqslant$	$\displaystyle h_{i}(\boldsymbol{w}_{i})+\langle(\nabla f_{i}(\boldsymbol{w}_{i}))^{T}(f_{i}(\boldsymbol{w}_{i})-{y}_{i}),\boldsymbol{z}_{i}-\boldsymbol{w}_{i}\rangle-\frac{\mu}{2}\|\boldsymbol{z}_{i}-\boldsymbol{w}_{i}\|^{2},$

which implies that $h_{i}$ is $\mu$ -weakly convex. Thus, for $\boldsymbol{z},\boldsymbol{w}\in\mathcal{X}$ , we have

	$\displaystyle h(\boldsymbol{z})\geqslant$	$\displaystyle\frac{1}{M}\sum_{i\in\mathcal{I}}h_{i}(\boldsymbol{w}_{i})+\langle(\nabla f_{i}(\boldsymbol{w}_{i}))^{T}(f_{i}(\boldsymbol{w}_{i})-{y}_{i}),\boldsymbol{z}_{i}-\boldsymbol{w}_{i}\rangle-\frac{\mu}{2}\|\boldsymbol{z}_{i}-\boldsymbol{w}_{i}\|^{2}$
	$\displaystyle\geqslant$	$\displaystyle h(\boldsymbol{w})+\langle\nabla h(\boldsymbol{w}),\boldsymbol{z}-\boldsymbol{w}\rangle-\frac{\mu}{2}\|\boldsymbol{z}-\boldsymbol{w}\|^{2}.$

Hence, $h$ is also $\mu$ -weakly convex. ∎

Lemma 3.4 directly implies the weakly convexity of $\varphi(\cdot,\cdot;\mathcal{I})$ defined in (2.8), which is described in the following corollary.

Corollary 3.5.

Suppose that Assumption 2 holds. Let $\mathcal{I}\subset\mathcal{N}$ and let $\mu$ be the constant in Lemma 3.4. Then, $\varphi(\cdot,\cdot;\mathcal{I})$ defined in (2.8) is $\lambda$ -strongly convex in $u$ and $\mu$ -weakly convex in $\boldsymbol{z}$ .

Proof.

The claim follows from Assumption 2, Lemma 3.4, and the strongly convexity of $\varphi(\cdot,\cdot;\mathcal{I})$ in $u$ . ∎

Next, we use the stability analysis and the weakly convexity property to study the convergence of Algorithm 1. The following lemma establishes the stability for each iteration of Algorithm 1. The proof appears in Appendix A.

Lemma 3.6.

Suppose that Assumptions 1 and 2 hold. Let $\mathcal{I}=\{\xi_{1},\dots,\xi_{M}\}$ be the i.i.d. samples from $\mathcal{N}=\{1,\dots,N\}$ , where $M,N\in\mathbb{N}$ and $M\leqslant N$ . Given $\bar{\boldsymbol{z}}\in\mathcal{X}$ and $\gamma>\mu$ , where $\mu$ is the constant in Lemma 3.4, we define

\displaystyle(\hat{u},\hat{\boldsymbol{z}})=\widetilde{\operatorname{prox}}_{\varphi/\gamma}^{\mathcal{I}}(\bar{\boldsymbol{z}})=\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\varphi(u,\boldsymbol{z};{\mathcal{I}})+\frac{\gamma}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}.

(3.8)

Let $r$ be the diameter of $\mathcal{X}$ and let $U_{f}$ be given in Assumption 2. Then, there exists a constant $C$ depending on $\beta,r,\lambda,L$ , and $U_{f}$ , such that $\widetilde{\operatorname{prox}}_{\varphi/\gamma}^{\mathcal{I}}$ is $\epsilon$ -stable with

\displaystyle\epsilon=\frac{C}{M\min\{\lambda,\gamma-\mu\}}.

(3.9)

Remark 3.7.

Lemma 3.6 demonstrates the critical distinction between the mini-batch method delineated in this paper and those explored in [15, 12, 13]. In our approach, the regularization parameter $\lambda$ is typically small, rendering $\min\{\lambda,\gamma-\mu\}$ equivalent to $\lambda$ , a fixed parameter that is not dependent on the step size. This aspect is significantly different from the methodologies in [15, 12, 13], where $\gamma$ increasing with the number of iterations enables a reduction in statistical error. A notable difference in our method is the omission of a proximal step over $u$ in (2.10). Incorporating the proximal step over $u$ would necessitate the retention of all mini-batch samples and the dependency of the minimizer $u^{k}$ at the $k^{\textit{th}}$ step on the formulations of preceding minimizers, $u^{1},\dots,u^{k-1}$ , thereby substantially elevating memory requirements. Addressing the challenge of reducing statistical error while managing memory efficiency is a subject for future research.

Next, we study the descent property of Algorithm 1. Henceforth, we use the notation $E_{k}[\cdot]$ to represent the expectation conditioned on all the randomness up to the $k^{\textit{th}}$ step in Algorithm 1. The proof appears in Appendix A.

Proposition 3.8.

Let $(u^{k},\boldsymbol{z}^{k})_{k=1}^{K}$ be the sequence generated by Algorithm 1 with step sizes $\{\gamma_{k}\}_{k=1}^{K}$ , where $\gamma_{k}>\mu$ and $\mu$ is given in Lemma 3.4. For $\rho>0$ , let

\displaystyle(\hat{u}^{k},\hat{\boldsymbol{z}}^{k})=\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}{\varphi(u,\boldsymbol{z})}+\frac{\rho}{2}|\boldsymbol{z}-\boldsymbol{z}^{k}|^{2}.

(3.10)

Define

\displaystyle\varphi_{1/\rho}(\bar{\boldsymbol{z}})=\min_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\varphi(u,\boldsymbol{z})+\frac{\rho}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}.

(3.11)

Then, we get

		$\displaystyle\frac{\rho(\rho-\mu)}{2(\gamma_{k}-\mu)}\|\hat{\boldsymbol{z}}^{k}-\boldsymbol{z}^{k}\|^{2}+\frac{\rho(\gamma_{k}-\rho)}{2(\gamma_{k}-\mu)}E_{k}[\|\boldsymbol{z}^{k+1}-\boldsymbol{z}^{k}\|^{2}]$
	$\displaystyle\leqslant$	$\displaystyle\varphi_{1/\rho}(\hat{\boldsymbol{z}}^{k})-E_{k}[\varphi_{1/\rho}(\boldsymbol{z}^{k+1})]+\frac{\rho\epsilon_{k}}{\gamma_{k}-\mu},$		(3.12)

where $\epsilon_{k}$ is given as in (3.9) with $\gamma\mathrel{\mathop{\mathchar 58\relax}}=\gamma_{k}$ .

Before we get to the main theorem, the following corollary implies that $\varphi_{1/\rho}$ in (3.11) is differentiable, which is a direct byproduct of Lemma 2.5.

Corollary 3.9.

Suppose that Assumption 2 holds. Let $\varphi$ be as in (2.8). Given $\gamma>0$ , for $\bar{\boldsymbol{z}}\in\mathcal{X}$ , we define $(\hat{u},\hat{\boldsymbol{z}})=\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\varphi(u,\boldsymbol{z})+\frac{\gamma}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}$ . and let $\varphi_{1/\gamma}$ be defined as in (3.11). Define

\displaystyle\psi({\boldsymbol{z}})=\frac{\lambda}{2}\boldsymbol{z}^{T}(\mathcal{K}(\boldsymbol{\phi},\boldsymbol{\phi})+\frac{\lambda N}{\beta}I)^{-1}\boldsymbol{z}+\frac{1}{2N}\sum_{i\in\mathcal{N}}|f_{i}(z_{i})-y_{i}|^{2},\forall\boldsymbol{z}\in\mathcal{X}.

(3.13)

Then, $\hat{\boldsymbol{z}}=\operatorname{prox}_{\psi/\gamma}(\bar{\boldsymbol{z}})$ and $\varphi_{1/\gamma}(\bar{\boldsymbol{z}})=\min_{\boldsymbol{z}\in\mathcal{X}}\psi(\boldsymbol{z})+\frac{\gamma}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}$ . Furthermore, $\varphi_{1/\gamma}$ is differentiable and $\nabla\varphi_{1/\gamma}(\bar{\boldsymbol{z}})=\gamma(\bar{\boldsymbol{z}}-\hat{\boldsymbol{z}})$ .

Proof.

Let

\displaystyle(\hat{u},\hat{\boldsymbol{z}})=\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\varphi(u,\boldsymbol{z})+\frac{\gamma}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}.

Thus, by (3.11) and the definition of $\mathcal{L}$ in (A), $(\hat{u},\hat{\boldsymbol{z}})=\operatorname{argmin}_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\mathcal{L}(u,\boldsymbol{z};\mathcal{N})$ and $\varphi_{1/\gamma}(\bar{\boldsymbol{z}})=\min_{u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}}\mathcal{L}(u,\boldsymbol{z};\mathcal{N})$ , where $\mathcal{N}=\{1,\dots,N\}$ . Thus, the equation (A.3) implies that $\hat{\boldsymbol{z}}=\operatorname{prox}_{\psi/\gamma}(\bar{\boldsymbol{z}})$ and $\varphi_{1/\gamma}(\bar{\boldsymbol{z}})=\min_{\boldsymbol{z}\in\mathcal{X}}\psi(\boldsymbol{z})+\frac{\gamma}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2}$ . Hence, $\varphi_{1/\rho}$ is a Moreau envelope. By Lemma 3.4, $\psi$ in (3.13) is $\mu$ -weakly convex in $\boldsymbol{z}$ . Then, standard results [40] implies that $\nabla\varphi_{1/\gamma}(\bar{\boldsymbol{z}})=\gamma(\bar{\boldsymbol{z}}-\hat{\boldsymbol{z}})$ . ∎

Next, we have our main theorem about the convergence of Algorithm 1 in the setting of nonlinear measurements.

Theorem 3.10.

Suppose that Assumptions 1 and 2 hold. Let $\rho>\mu>0$ . In Algorithm 1, let $\gamma_{k}=\gamma$ for some $\gamma>\rho$ . In each iteration, let $|\mathcal{I}_{k}|=M$ for all $k\geqslant 1$ , where $M\in\mathbb{N}$ . Let $(u^{k},\boldsymbol{z}^{k})_{k=1}^{K}$ be the sequence generated by Algorithm 1, let $k^{*}$ be an index uniformly sampled from $\{1,\dots,K\}$ , and let $\varphi_{1/\rho}$ be defined as in (3.11) for $\rho>0$ . Then, there exists a constant $C$ depending on $\beta,r,\lambda,L,\mu$ such that

\displaystyle E[|\nabla\varphi_{1/\rho}(\boldsymbol{z}^{k^{*}})|^{2}]\leqslant\frac{2\rho(\gamma-\mu)\Delta}{(\rho-\mu)K}+\frac{C\rho^{2}}{M(\rho-\mu)\min\{\lambda,\gamma-\mu\}},

(3.14)

with $\Delta=\varphi_{1/\rho}(\hat{\boldsymbol{z}}^{1})-\min_{\boldsymbol{z}\in\mathcal{X}}\varphi_{1/\rho}(\boldsymbol{z})$ , where $\hat{\boldsymbol{z}}^{1}$ is defined as in (3.10).

Proof.

Taking $\rho>\mu>0$ , we define $\hat{\boldsymbol{z}}^{k}$ as in (3.10). By Corollary 3.9, we get $\nabla\varphi_{1/\rho}(\boldsymbol{z}^{k})=\rho(\boldsymbol{z}^{k}-\hat{\boldsymbol{z}}^{k})$ . Thus, using Proposition 3.8, we obtain

\displaystyle\frac{\rho-\mu}{2\rho(\gamma-\mu)}|\nabla\varphi_{1/\rho}(\boldsymbol{z}^{k})|^{2}\leqslant\varphi_{1/\rho}(\hat{\boldsymbol{z}}^{k})-E_{k}[\varphi_{1/\rho}(\boldsymbol{z}^{k+1})]+\frac{\rho\epsilon}{\gamma-\mu},

(3.15)

where $\epsilon$ is given in (3.9). Summing (3.15) from $k=1$ to $k=K$ , taking the expectation over all the randomness, and dividing the result inequality by $K$ , we get

\displaystyle\frac{(\rho-\mu)}{2(\gamma-\mu)\rho K}\sum_{k=1}^{K}E[|\nabla\varphi_{1/\rho}(\boldsymbol{z}^{k})|^{2}]\leqslant\frac{\varphi_{1/\rho}(\hat{\boldsymbol{z}}^{1})-E[\varphi(\boldsymbol{z}^{K+1})]}{K}+\frac{\rho\epsilon}{\gamma-\mu}.

(3.16)

Therefore, using the definition of $k^{*}$ and Jenssen’s inequality, we conclude from (3.16) and (3.9) that (3.14) holds. ∎

Theorem 3.10 shows that $\boldsymbol{z}^{k^{*}}$ tends close to the neighbor of a nearly stationary point on expectation as the number of iterations and the batch size increase. The bound provided in (3.14) demonstrates that the error does not diminish to zero with an increasing number of iterations for a fixed batch size. Conversely, the accuracy improves as the size of mini-batches increases, a fact supported by the subsequent numerical experiments discussed in the following section.

4. Numerical Experiments

In this section, we demonstrate the efficacy of the mini-batch method by solving a nonlinear elliptic equation in Subsection 4.1 and Burgers’ equation in Subsection 4.2. Our implementation uses Python with the JAX package for automatic differentiation. Our experiments are conducted only on CPUs. To better leverage the information between points in each mini-batch, we take mini-batches with nearby samples instead of sampling uniform mini-batches. A mini-batch consists of a uniformly randomly sampled point and its $M-1$ nearest points. We use $k$ - $d$ tree to support such search, which finds $M-1$ nearest for each point in the whole data set of size $N$ in $O(N\log N)$ time and $O(N)$ space.

4.1. Solving A Nonlinear Elliptic PDE

In this example, we solve a nonlinear elliptic equation by utilizing the preconditioned mini-batch method delineated in Remark 2.3. The chosen preconditioner, $Q_{k}$ , is structured as a diagonal matrix. Its diagonal elements are the diagonal entries of the covariance matrix, specific to each set of mini-batch points. Let $d>1$ and $\Omega=(0,1)^{2}$ . Given a continuous function $f\mathrel{\mathop{\mathchar 58\relax}}\Omega\mapsto\mathbb{R}$ , we find $u$ solving

\displaystyle\begin{cases}-\Delta u(x)+u^{3}(x)=f(x),x\in\Omega,\\ u(x)=0,x\in\partial\Omega.\end{cases}

(4.1)

We prescribe the true solution $u$ to be $\sin(\pi x_{1})\sin(\pi x_{2})+4\sin(4\pi x_{1})\sin(4\pi x_{2})$ for $\boldsymbol{x}\mathrel{\mathop{\mathchar 58\relax}}=(x_{1},x_{2})\in\overline{\Omega}$ and compute $f$ at the right-hand side of the first equation of (4.1) accordingly. For regression, we choose the Gaussian kernel $K(\boldsymbol{x},\boldsymbol{y})=\operatorname{exp}(-\frac{|\boldsymbol{x}-\boldsymbol{y}|^{2}}{2\sigma^{2}})$ with the lengthscale parameter $\sigma=0.2$ . We take $N=1200$ samples in the domain with $N_{\Omega}=900$ interior points. Meanwhile, we set the nugget $\eta=10^{-13}$ (see Remark 2.6). The algorithm stops after $40000$ iterations. In each time step, we use the technique of eliminating variables at each iteration (see Subsection 3.3 in [9]) and solve the resulting nonlinear minimization problem using the Gauss–Newton method, which terminates once the error between two successive steps is less than $10^{-5}$ or the maximum iteration number $30$ is reached.

Figure 1 shows the results of a realization of the mini-batch GP method when the batch size $M=64$ . Figure 1a shows the samples taken in the domain and the contour of the true solution. Figure 1b depicts the reduction in loss for the mini-batch GP method during a single run. The loss is derived from the average of the most recent 10 iterations in the algorithm.. The explicit solution of $u^{*}$ is given in Figure 1c. Figure 1d plots the approximation of $u^{*}$ , denoted by $u_{MGP}$ . The contour of the point-wise errors of $u_{MGP}$ and $u^{*}$ is given in Figure 1e. Figure 1f presents the contour of errors for the GP method as described in [9], using the whole identical sample sets. It is observed that the mini-batch approach with a batch size of 64 achieves accuracy comparable to that of the GP method using the entire samples.

4.2. Burgers’ Equation

In this subsection, we consider Burgers’ equation

\displaystyle\begin{cases}\partial_{t}u+u\partial_{x}u+\nu\partial_{x}^{2}u=0,\forall(t,x)\in(0,1]\times(-1,1),\\ u(0,x)=-\sin(\pi x),\forall x\in[-1,1],\\ u(t,-1)=u(t,1)=0,\forall t\in[0,1],\end{cases}

where $\nu=0.2$ and $u$ is unknown. We take $N=2400$ points uniformly in the space-time domain, of which $N_{\Omega}=2000$ points are in the domain’s interior. As in [9], we choose the anisotropic kernel

\displaystyle K((t,x),(t^{\prime},x^{\prime}))=\operatorname{exp}(-(t-t^{\prime})^{2}/\sigma_{1}^{2}-(x-x^{\prime})^{2}/{\sigma_{2}^{2}}),

(4.2)

with $(\sigma_{1},\sigma_{2})=(0.3,0.05)$ . To compute pointwise errors, as in [9], we compute the true solution using the Cole–Hopf transformation and the numerical quadrature. We use the technique of eliminating constraints at each iteration (see Subsection 3.3 in [9]) and use the Gauss–Newton method to solve the resulting nonlinear quadratic minimization problem, which terminates once the error between two successive steps is less than $10^{-5}$ or the maximum iteration number $100$ is reached. Throughout the experiments, we use the nugget $\eta=10^{-10}$ (see Remark 2.6).

Figure 2 shows the results of a realization of the mini-batch GP method when we fix the batch size $M=75$ . Figure 2a shows the samples in the domain and the contour of the true solution. Figure 2b illustrates the history of the loss generated by the min-batch GP method in one realization. The explicit solution of $u^{*}$ is plotted in Figure 2c. Figure 2d shows the numerical approximation of $u^{*}$ , denoted by $u_{MGP}$ . The contour of the point-wise errors between $u_{MGP}$ and $u^{*}$ is presented in Figure 2e.

Then, we run the algorithm $10$ times for several batch sizes and plot the results in Figure 3. The averaged loss histories are plotted in Figure 3a, from which we see that increasing batch sizes results in faster convergence rates and smaller losses. The contours of averaged point-wise errors are presented in Figures 3b-3d, which imply that a larger batch size achieves better accuracy.

In our observations regarding the solution of Burgers’ equation, it becomes evident that a larger number of samples in a mini-batch is required compared to the nonlinear elliptic example. This disparity could be attributed to the relatively lower regularity of Burgers’ equation, where the uniformly distributed samples may not adequately capture the inherent regularity. Consequently, we defer the exploration of more effective strategies for sampling collocation points and selecting mini-batch points for future investigations.

5. Conclusions

This paper presents a mini-batch framework to solve nonlinear PDEs with GPs. The mini-batch method allocates the computational cost of GPs to each iterative step. We also perform a rigorous convergence analysis of the mini-batch method. For future work, extending our framework to more general GP regression problems like semi-supervised learning and hyperparameter learning would be interesting. In the numerical experiments, we notice that the choice of the positions of mini-batch samples influences the performance. Hence, a potential future work is investigating different sampling techniques to take mini-batch sample points.

Acknowledgement A0.

The authors wish to thank Professor Andrew Stuart for his insightful discussions on the convergence of the stochastic gradient descent method.

A Proofs of Results

Proof of Lemma 2.5.

Let $\mathcal{L}(u,\boldsymbol{z};\mathcal{I})=\varphi(u,\boldsymbol{z};\mathcal{I})+\frac{\gamma}{2}|\boldsymbol{z}-\bar{\boldsymbol{z}}|^{2},\forall u\in\mathcal{U},\boldsymbol{z}\in\mathcal{X}$ . Define the sampling operator ${S}_{\mathcal{I}}$ as in Lemma (2.4). Then, we have

	$\displaystyle\mathcal{L}(u,\boldsymbol{z};\mathcal{I})=$	$\displaystyle\frac{1}{2}\bigg{\langle}{u},\lambda{u}+\frac{\beta}{M}S_{\mathcal{I}}^{}S_{\mathcal{I}}{u}-\frac{2\beta}{M}S_{\mathcal{I}}^{}\boldsymbol{z}_{\mathcal{I}}\bigg{\rangle}_{\mathcal{U}}$
		$\displaystyle+\frac{\beta}{2M}\|\boldsymbol{z}_{\mathcal{I}}\|^{2}+\frac{1}{2M}\sum_{i\in\mathcal{I}}\|f_{i}(z_{i})-y_{i}\|^{2}+\frac{\gamma}{2}\|\boldsymbol{z}-\bar{\boldsymbol{z}}\|^{2}.$		(A.1)

Next, we first fix $\boldsymbol{z}\in\mathcal{X}$ and let $\tilde{u}=\operatorname{argmin}_{u}\mathcal{L}(u,\boldsymbol{z};\mathcal{I})$ . Then,

\frac{\partial\mathcal{L}}{\partial u}\bigg{|}_{u=\tilde{u}}=\lambda\tilde{u}+\frac{\beta}{M}S_{\mathcal{I}}^{*}S_{\mathcal{I}}\tilde{u}-\frac{\beta}{M}S_{\mathcal{I}}^{*}\boldsymbol{z}_{\mathcal{I}}=0,

which yields

\displaystyle\tilde{u}=\bigg{(}\frac{\lambda M}{\beta}I+S^{*}_{\mathcal{I}}S_{\mathcal{I}}\bigg{)}^{-1}S_{\mathcal{I}}^{*}\boldsymbol{z}_{\mathcal{I}}.

(A.2)

Thus, using (A) and (A.2), we have

	$\displaystyle\min_{u\in\mathcal{U}}\mathcal{L}(u,\boldsymbol{z};\mathcal{I})=$	$\displaystyle-\frac{\beta}{2M}\langle S_{\mathcal{I}}\tilde{u},\boldsymbol{z}_{\mathcal{I}}\rangle+\frac{\beta}{2M}\|\boldsymbol{z}_{\mathcal{I}}\|^{2}+\frac{1}{2M}\sum_{i\in\mathcal{I}}\|f_{i}(z_{i})-y_{i}\|^{2}+\frac{\gamma}{2}\|\boldsymbol{z}-\bar{\boldsymbol{z}}\|^{2}$
	$\displaystyle=$	$\displaystyle\frac{\lambda}{2}\boldsymbol{z}^{T}_{\mathcal{I}}\bigg{(}\mathcal{K}(\boldsymbol{\phi}_{\mathcal{I}},\boldsymbol{\phi}_{\mathcal{I}})+\frac{\lambda M}{\beta}I\bigg{)}^{-1}\boldsymbol{z}_{\mathcal{I}}+\frac{1}{2M}\sum_{i\in\mathcal{I}}\|f_{i}(z_{i})-y_{i}\|^{2}+\frac{\gamma}{2}\|\boldsymbol{z}-\bar{\boldsymbol{z}}\|^{2}.$		(A.3)

where the last equality results from (2.13). Therefore, (A.2), (2.12), and (A.3) imply that the minimizer $(u^{\dagger},\boldsymbol{z}^{\dagger})$ in (2.14) satisfies (2.15) and $\boldsymbol{z}^{\dagger}$ minimizes (2.16). ∎

Proof of Lemma 3.6.

Let ${\mathcal{I}}^{\prime}$ and ${\mathcal{I}}_{(i)}$ be as in the beginning of this subsection. For $\bar{\boldsymbol{z}}\in\mathcal{X}$ , let

\displaystyle(\hat{u},\hat{\boldsymbol{z}})=\widetilde{\operatorname{prox}}_{\varphi/\gamma}^{\mathcal{I}}(\bar{\boldsymbol{z}})\text{ and }(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)})=\widetilde{\operatorname{prox}}_{\varphi/\gamma}^{{\mathcal{I}}_{(i)}}(\bar{\boldsymbol{z}}).

(A.4)

For $\gamma>1$ , by the optimality condition of (A.4), there exists a constant $C$ depending on $\beta$ , $\lambda$ , the diameter $r$ of $\mathcal{X}$ , and the constant $U_{f}$ in Assumption 2, such that

\displaystyle\|\hat{u}\|_{\mathcal{U}}^{2}\leqslant\frac{2}{\lambda}\varphi(0,\bar{\boldsymbol{z}};{\mathcal{I}})\leqslant\frac{\beta}{M\lambda}|\bar{\boldsymbol{z}}|^{2}+\frac{1}{\lambda M}\sum_{i\in\mathcal{I}}|f_{i}(\bar{\boldsymbol{z}}_{i})-y_{i}|^{2}\leqslant C.

(A.5)

Thus, using (A.5) and Assumption 1, there exists a constant $C$ depending on $\beta$ , $\lambda$ , $r$ , $U_{f}$ , and the constant $L$ in Assumption 1, such that $|[\boldsymbol{\phi}_{i},\hat{u}]|\leqslant\|\boldsymbol{\phi}_{i}\|_{\mathcal{U}^{*}}\|\hat{u}\|_{\mathcal{U}}\leqslant C$ . Let $\mathcal{B}=\{u\in\mathcal{U}|\|u\|_{\mathcal{U}}\leqslant\tilde{C}\}\times\mathcal{X}$ , where $\tilde{C}$ is large enough such that $\mathcal{B}$ contains $(\hat{u},\hat{\boldsymbol{z}})$ in (A.4) for any $\bar{\boldsymbol{z}}\in\mathcal{X}$ . Next, we show that for any $\xi\in\{1,\dots,N\}$ , $\varphi(\cdot,\cdot;\xi)$ is Lipschitz in $\mathcal{B}$ . Let $(u_{1},\boldsymbol{z}_{1}),(u_{2},\boldsymbol{z}_{2})\in\mathcal{B}$ . Then, there exists a constant $C$ depending on $\beta,r,\lambda,L$ , and ${U}_{f}$ , such that

	$\displaystyle\|\varphi(u_{1},\boldsymbol{z}_{1};\xi)-\varphi(u_{2},\boldsymbol{z}_{2};\xi)\|$
$\displaystyle\leqslant$	$\displaystyle\frac{\lambda}{2}\\|u_{1}+u_{2}\\|_{\mathcal{U}}\\|u_{1}-u_{2}\\|_{\mathcal{U}}+\frac{1}{2}\big{\|}f_{\xi}(\boldsymbol{z}_{1,\xi})+f_{\xi}(\boldsymbol{z}_{2,\xi})-2y_{\xi}\big{\|}\|f_{\xi}(\boldsymbol{z}_{1,\xi})-f_{\xi}(\boldsymbol{z}_{2,\xi})\|$
	$\displaystyle+\frac{\beta}{2}\bigg{\|}[\boldsymbol{\phi}_{\xi},u_{1}]-\boldsymbol{z}_{1,\xi}+[\boldsymbol{\phi}_{\xi},u_{2}]-\boldsymbol{z}_{2,\xi}\bigg{\|}\bigg{(}\big{\|}[\boldsymbol{\phi}_{\xi},u_{1}]-[\boldsymbol{\phi}_{\xi},u_{2}]\big{\|}+\big{\|}\boldsymbol{z}_{1,\xi}-\boldsymbol{z}_{2,\xi}\big{\|}\bigg{)}$
$\displaystyle\leqslant$	$\displaystyle C(\\|u_{1}-u_{2}\\|_{\mathcal{U}}+\|\boldsymbol{z}_{1}-\boldsymbol{z}_{2}\|),$	(A.6)

where (A.6) follows by the definition of $\mathcal{B}$ and Assumptions 1 and 2.

By Jensen’s inequality, we obtain

	$\displaystyle\|E_{\mathcal{I},\mathcal{I}^{\prime},i}[\varphi(\hat{u},\hat{\boldsymbol{z}};\xi_{i}^{\prime})-\varphi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};\xi_{i}^{\prime})]\|\leqslant$	$\displaystyle\frac{1}{M}\sum_{i=1}^{M}E_{\mathcal{I},\xi_{i}^{\prime}}\bigg{\|}\varphi(\hat{u},\hat{\boldsymbol{z}};\xi_{i}^{\prime})-\varphi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};\xi_{i}^{\prime})\bigg{\|}$
	$\displaystyle\leqslant$	$\displaystyle\frac{C}{M}\sum_{i=1}^{M}E_{\mathcal{I},\xi_{i}^{\prime}}(\\|\hat{u}-\hat{u}_{(i)}\\|_{\mathcal{U}}+\|\hat{\boldsymbol{z}}-\hat{\boldsymbol{z}}_{(i)}\|),$		(A.7)

where (A) follows by (A.6). Using (A.4), Corollary 3.5, and Lemma 3.1, we get

	$\displaystyle\varphi(\hat{u},\hat{\boldsymbol{z}};{\mathcal{I}})+\frac{\gamma}{2}\|\hat{\boldsymbol{z}}-\bar{\boldsymbol{z}}\|^{2}$
$\displaystyle\leqslant$	$\displaystyle\varphi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};{\mathcal{I}})+\frac{\gamma}{2}\|\hat{\boldsymbol{z}}_{(i)}-\bar{\boldsymbol{z}}\|^{2}-\frac{\gamma-\mu}{2}\|\hat{\boldsymbol{z}}-\hat{\boldsymbol{z}}_{(i)}\|^{2}-\frac{\lambda}{2}\\|\hat{u}-\hat{u}_{(i)}\\|_{\mathcal{U}}^{2},$	(A.8)
	$\displaystyle\varphi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};{\mathcal{I}}_{(i)})+\frac{\gamma}{2}\|\hat{\boldsymbol{z}}_{(i)}-\bar{\boldsymbol{z}}\|^{2}$
$\displaystyle\leqslant$	$\displaystyle\varphi(\hat{u},\hat{\boldsymbol{z}};{\mathcal{I}}_{(i)})+\frac{\gamma}{2}\|\hat{\boldsymbol{z}}-\bar{\boldsymbol{z}}\|^{2}-\frac{\gamma-\mu}{2}\|\hat{\boldsymbol{z}}-\hat{\boldsymbol{z}}_{(i)}\|^{2}-\frac{\lambda}{2}\\|\hat{u}-\hat{u}_{(i)}\\|_{\mathcal{U}}^{2}.$	(A.9)

Adding (A.8) and (A.9) together and using (A.6), we obtain

		$\displaystyle\lambda\\|\hat{u}-\hat{u}_{(i)}\\|^{2}_{\mathcal{U}}+(\gamma-\mu)\|\hat{\boldsymbol{z}}-\hat{\boldsymbol{z}}_{(i)}\|^{2}$
	$\displaystyle\leqslant$	$\displaystyle\varphi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};{\mathcal{I}})-\varphi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};{\mathcal{I}}_{(i)})+\varphi(\hat{u},\hat{\boldsymbol{z}};{\mathcal{I}}_{(i)})-\varphi(\hat{u},\hat{\boldsymbol{z}};{\mathcal{I}})$
	$\displaystyle=$	$\displaystyle\frac{1}{M}[\varphi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};\xi_{i})-\varphi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};\xi_{i}^{\prime})+\varphi(\hat{u},\hat{\boldsymbol{z}};\xi_{i}^{\prime})-\varphi(\hat{u},\hat{\boldsymbol{z}};\xi_{i})]$
	$\displaystyle\leqslant$	$\displaystyle\frac{C}{M}(\\|\hat{u}-\hat{u}_{(i)}\\|_{\mathcal{U}}+\|\hat{\boldsymbol{z}}-\hat{\boldsymbol{z}}_{(i)}\|),$

which together with the fact that $\forall a,b\in\mathbb{R},(a+b)^{2}\leqslant 2a^{2}+2b^{2}$ , yield

\displaystyle\|\hat{u}-\hat{u}_{(i)}\|_{\mathcal{U}}+|\hat{\boldsymbol{z}}-\hat{\boldsymbol{z}}_{(i)}|\leqslant\frac{C}{M\min\{\lambda,\gamma-\mu\}}.

(A.10)

Hence, (A) and (A.10) imply that $\widetilde{\operatorname{prox}}_{\varphi/\gamma}^{\mathcal{I}}$ is $\epsilon$ -stable with $\epsilon$ given in (3.9). ∎

Proof of Proposition 3.8.

Let $\mathcal{I}_{k}$ be a mini-batch of indices in the $k^{\textit{th}}$ step of Algorithm 1. According to Corollary 3.5, $(u,\boldsymbol{z})\mapsto\varphi(u,\boldsymbol{z};\mathcal{I}_{k})$ is $\mu$ -weakly convex in $\boldsymbol{z}$ and strongly convex in $u$ . Then, by Lemma 3.1 and (2.10), for any $u\in\mathcal{U}$ and $\boldsymbol{z}\in\mathcal{X}$ , we have

	$\displaystyle\frac{\gamma_{k}-\mu}{2}\|\boldsymbol{z}-\boldsymbol{z}^{k+1}\|^{2}\leqslant$	$\displaystyle\frac{\gamma_{k}}{2}\|\boldsymbol{z}-\boldsymbol{z}^{k}\|^{2}-\frac{\gamma_{k}}{2}\|\boldsymbol{z}^{k+1}-\boldsymbol{z}^{k}\|^{2}+\varphi(u,\boldsymbol{z};{\mathcal{I}}_{k})-\varphi(u^{k+1},\boldsymbol{z}^{k+1})$
		$\displaystyle+\varphi(u^{k+1},\boldsymbol{z}^{k+1})-\varphi(u^{k+1},\boldsymbol{z}^{k+1};{\mathcal{I}}_{k}).$		(A.11)

Lemmas 3.2 and 3.6 imply that

\displaystyle E_{k}[\varphi(u^{k+1},\boldsymbol{z}^{k+1})-\varphi(u^{k+1},\boldsymbol{z}^{k+1};{\mathcal{I}}_{k})]\leqslant\epsilon_{k},

(A.12)

where $\epsilon_{k}$ is given in (3.9) with $\gamma\mathrel{\mathop{\mathchar 58\relax}}=\gamma_{k}$ . For $\rho>0$ , let $(\hat{u}^{k},\hat{\boldsymbol{z}}^{k})$ be given in (3.10). Then, we have

\displaystyle\varphi(\hat{u}^{k},\hat{\boldsymbol{z}}^{k})-\varphi(u^{k+1},\boldsymbol{z}^{k+1})\leqslant-\frac{\rho}{2}|\hat{\boldsymbol{z}}^{k}-\boldsymbol{z}^{k}|^{2}+\frac{\rho}{2}|\boldsymbol{z}^{k+1}-\boldsymbol{z}^{k}|^{2}.

(A.13)

Plugging $(u,\boldsymbol{z})\mathrel{\mathop{\mathchar 58\relax}}=(\hat{u}^{k},\hat{\boldsymbol{z}}^{k})$ into (A.11), taking the expectation, using (A.12) and (A.13), and using Lemma 3.6, we obtain

\displaystyle\frac{\gamma_{k}-\mu}{2}E_{k}[|\hat{\boldsymbol{z}}^{k}-\boldsymbol{z}^{k+1}|^{2}]\leqslant\frac{\gamma_{k}-\rho}{2}|\hat{\boldsymbol{z}}^{k}-\boldsymbol{z}^{k}|^{2}-\frac{\gamma_{k}-\rho}{2}E_{k}[|\boldsymbol{z}^{k+1}-\boldsymbol{z}^{k}|^{2}]+\epsilon_{k}.

(A.14)

For $\rho>0$ , using the definition of $\varphi_{1/\rho}$ and (A.14), we have

		$\displaystyle E_{k}[\varphi_{1/\rho}(\boldsymbol{z}^{k+1})]\leqslant E_{k}[\varphi(\hat{u}^{k},\hat{\boldsymbol{z}}^{k})+\frac{\rho}{2}\|\hat{\boldsymbol{z}}^{k}-\boldsymbol{z}^{k+1}\|^{2}]$
	$\displaystyle\leqslant$	$\displaystyle\varphi_{1/\rho}(\hat{\boldsymbol{z}}^{k})-\frac{\rho(\rho-\mu)}{2(\gamma_{k}-\mu)}\|\hat{\boldsymbol{z}}^{k}-\boldsymbol{z}^{k}\|^{2}-\frac{\rho(\gamma_{k}-\rho)}{2(\gamma_{k}-\mu)}E_{k}[\|\boldsymbol{z}^{k+1}-\boldsymbol{z}^{k}\|^{2}]+\frac{\rho\epsilon_{k}}{\gamma_{k}-\mu},$

which concludes (3.8) after rearrangement. ∎

References

[1] H. Asi and J. C. Duchi, The importance of better models in stochastic optimization, Proceedings of the National Academy of Sciences, 116 (2019), pp. 22924–22930.
[2] H. Asi and J. C. Duchi, Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity, SIAM Journal on Optimization, 29 (2019), pp. 2257–2290.
[3] P. Batlle, Y. Chen, B. Hosseini, H. Owhadi, and A. M. Stuart, Error analysis of kernel/GP methods for nonlinear and parametric PDEs, arXiv preprint arXiv:2305.04962, (2023).
[4] D. P. Bertsekas, Incremental proximal methods for large scale convex optimization, Mathematical programming, 129 (2011), pp. 163–195.
[5] L. Bottou and O. Bousquet, The tradeoffs of large scale learning, Advances in neural information processing systems, 20 (2007).
[6] O. Bousquet and A. Elisseeff, Stability and generalization, The Journal of Machine Learning Research, 2 (2002), pp. 499–526.
[7] H. Chen, L. Zheng, R. Al Kontar, and G. Raskutti, Stochastic gradient descent in correlated settings: A study on Gaussian processes, Advances in neural information processing systems, 33 (2020), pp. 2722–2733.
[8] H. Chen, L. Zheng, R. Al Kontar, and G. Raskutti, Gaussian process parameter estimation using mini-batch stochastic gradient descent: convergence guarantees and empirical benefits, The Journal of Machine Learning Research, 23 (2022), pp. 10298–10356.
[9] Y. Chen, B. Hosseini, H. Owhadi, and A. M. Stuart, Solving and learning nonlinear PDEs with Gaussian processes, Journal of Computational Physics, (2021).
[10] Y. Chen, H. Owhadi, and F. Schäfer, Sparse Cholesky factorization for solving nonlinear PDEs via Gaussian processes, arXiv preprint arXiv:2304.01294, (2023).
[11] A. C. Damianou, M. K. Titsias, and N. D. Lawrence, Variational inference for latent variables and uncertain inputs in Gaussian processes, (2016).
[12] D. Davis and D. Drusvyatskiy, Stochastic model-based minimization of weakly convex functions, SIAM Journal on Optimization, 29 (2019), pp. 207–239.
[13] Q. Deng and W. Gao, Minibatch and momentum model-based methods for stochastic weakly convex optimization, Advances in Neural Information Processing Systems, 34 (2021), pp. 23115–23127.
[14] D. Drusvyatskiy and C. Paquette, Efficiency of minimizing compositions of convex functions and smooth maps, Mathematical Programming, 178 (2019), pp. 503–558.
[15] J. C. Duchi and F. Ruan, Stochastic methods for composite and weakly convex optimization problems, SIAM Journal on Optimization, 28 (2018), pp. 3229–3259.
[16] J. Gardner, G. Pleiss, K. Q. Weinberger, D. Bindel, and A. G. Wilson, GPytorch: Blackbox matrix-matrix Gaussian process inference with GPU acceleration, Advances in neural information processing systems, 31 (2018).
[17] J. Hensman, N. Durrande, and A. Solin, Variational Fourier features for Gaussian processes., J. Mach. Learn. Res., 18 (2017), pp. 5537–5588.
[18] J. Hensman, N. Fusi, and N. D. Lawrence, Gaussian processes for big data, in Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI’13, Arlington, Virginia, United States, 2013, AUAI Press, pp. 282–290, http://dl.acm.org/citation.cfm?id=3023638.3023667.
[19] T. J. Hughes, The finite element method: linear static and dynamic finite element analysis, Courier Corporation, 2012.
[20] G. E. Karniadakis, I. G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, and L. Yang, Physics-informed machine learning, Nature Reviews Physics, 3 (2021), pp. 422–440.
[21] N. Krämer, J. Schmidt, and P. Hennig, Probabilistic numerical method of lines for time-dependent partial differential equations, arXiv preprint arXiv:2110.11847, (2021).
[22] B. Kulis and P. L. Bartlett, Implicit online learning, in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 575–582.
[23] M. Lázaro-Gredilla and A. Figueiras-Vidal, Inter-domain Gaussian processes for sparse inference using inducing features, Advances in Neural Information Processing Systems, 22 (2009).
[24] M. Lázaro-Gredilla, J. Quinonero-Candela, C. E. Rasmussen, and A. R. Figueiras-Vidal, Sparse spectrum Gaussian process regression, The Journal of Machine Learning Research, 11 (2010), pp. 1865–1881.
[25] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar, Hyperband: A novel bandit-based approach to hyperparameter optimization, The Journal of Machine Learning Research, 18 (2017), pp. 6765–6816.
[26] H. Liu, Y. S. Ong, X. Shen, and J. Cai, When Gaussian process meets big data: A review of scalable GPs, IEEE transactions on neural networks and learning systems, 31 (2020), pp. 4405–4423.
[27] L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis, Learning nonlinear operators via deeponet based on the universal approximation theorem of operators, Nature Machine Intelligence, 3 (2021), pp. 218–229.
[28] Z. Mao, A. D. Jagtap, and G. E. Karniadakis, Physics-informed neural networks for high-speed flows, Computer Methods in Applied Mechanics and Engineering, 360 (2020), p. 112789.
[29] R. Meng and X. Yang, Sparse Gaussian processes for solving nonlinear PDEs, Journal of Computational Physics, 490 (2023), p. 112340.
[30] J.-J. Moreau, Proximité et dualité dans un espace hilbertien, Bulletin de la Société mathématique de France, 93 (1965), pp. 273–299.
[31] C. Mou, X. Yang, and C. Zhou, Numerical methods for mean field games based on Gaussian processes and Fourier features, Journal of Computational Physics, (2022).
[32] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM Journal on optimization, 19 (2009), pp. 1574–1609.
[33] D. T. Nguyen, M. Filippone, and P. Michiardi, Exact Gaussian process regression with distributed computations, in Proceedings of the 34th ACM/SIGAPP symposium on applied computing, 2019, pp. 1286–1295.
[34] H. Owhadi and C. Scovel, Operator-Adapted Wavelets, Fast Solvers, and Numerical Homogenization: From a Game Theoretic Approach to Numerical Approximation and Algorithm Design, vol. 35, Cambridge University Press, 2019.
[35] A. Quarteroni and A. Valli, Numerical approximation of partial differential equations, vol. 23, Springer Science and Business Media, 2008.
[36] J. Quinonero-Candela and C. E. Rasmussen, A unifying view of sparse approximate Gaussian process regression, The Journal of Machine Learning Research, 6 (2005), pp. 1939–1959.
[37] A. Rahimi and B. Recht, Random features for large-scale kernel machines., in NIPS, vol. 3, Citeseer, 2007, p. 5.
[38] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Numerical Gaussian processes for time-dependent and nonlinear partial differential equations, SIAM Journal on Scientific Computing, 40 (2018), pp. A172–A198.
[39] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, Journal of Computational physics, 378 (2019), pp. 686–707.
[40] R. Rockafellar, Convex analysis, vol. 18, Princeton university press, 1970.
[41] F. Schäfer, M. Katzfuss, and H. Owhadi, Sparse Cholesky factorization by Kullback–Leibler minimization, SIAM Journal on scientific computing, 43 (2021), pp. A2019–A2046.
[42] S. Shai, Y. Singer, and N. Srebro, Pegasos: Primal estimated sub-gradient solver for SVM, in Proceedings of the 24th international conference on Machine learning, 2007, pp. 807–814.
[43] S. Smale and D. Zhou, Shannon sampling II: Connections to learning theory, Applied and Computational Harmonic Analysis, 19 (2005), pp. 285–302.
[44] J. W. Thomas, Numerical partial differential equations: finite difference methods, vol. 22, Springer Science and Business Media, 2013.
[45] Q. Tran-Dinh, N. Pham, and L. Nguyen, Stochastic gauss-newton algorithms for nonconvex compositional optimization, in International Conference on Machine Learning, PMLR, 2020, pp. 9572–9582.
[46] J. Wang, J. Cockayne, O. Chkrebtii, T. J. Sullivan, and C. J. Oates, Bayesian numerical methods for nonlinear partial differential equations, Statistics and Computing, 31 (2021), pp. 1–20.
[47] K. Wang, G. Pleiss, J. Gardner, S. Tyree, K. Q. Weinberger, and A. G. Wilson, Exact Gaussian processes on a million data points, Advances in neural information processing systems, 32 (2019).
[48] A. Wilson and H. Nickisch, Kernel interpolation for scalable structured Gaussian processes (KISS-GP), in International conference on machine learning, PMLR, 2015, pp. 1775–1784.
[49] F. X. Yu, A. T. Suresh, K. M. Choromanski, D. N. Holtmann-Rice, and S. Kumar, Orthogonal random features, Advances in neural information processing systems, 29 (2016), pp. 1975–1983.
[50] M. Zinkevich, Online convex programming and generalized infinitesimal gradient ascent, in Proceedings of the 20th international conference on machine learning (ICML-03), 2003, pp. 928–936.

	$\displaystyle\|E_{{\mathcal{I}}}[\psi(\hat{u},\hat{\boldsymbol{z}};{\mathcal{I}})-E_{\xi}[\psi(\hat{u},\hat{\boldsymbol{z}};\xi)]]\|=$	$\displaystyle\bigg{\|}\frac{1}{M}\sum_{i=1}^{M}E_{{\mathcal{I}},\xi^{\prime}_{i}}[\psi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};\xi_{i}^{\prime})-\psi(\hat{u},\hat{\boldsymbol{z}};\xi_{i}^{\prime})]\bigg{\|}$
	$\displaystyle=$	$\displaystyle\|E_{{\mathcal{I}},{\mathcal{I}}^{\prime},i}[\psi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};\xi_{i}^{\prime})-\psi(\hat{u},\hat{\boldsymbol{z}};\xi_{i}^{\prime})]\bigg{\|}\leqslant\epsilon,$

	$\displaystyle\|\varphi(u_{1},\boldsymbol{z}_{1};\xi)-\varphi(u_{2},\boldsymbol{z}_{2};\xi)\|$
$\displaystyle\leqslant$	$\displaystyle\frac{\lambda}{2}\\|u_{1}+u_{2}\\|_{\mathcal{U}}\\|u_{1}-u_{2}\\|_{\mathcal{U}}+\frac{1}{2}\big{\|}f_{\xi}(\boldsymbol{z}_{1,\xi})+f_{\xi}(\boldsymbol{z}_{2,\xi})-2y_{\xi}\big{\|}\|f_{\xi}(\boldsymbol{z}_{1,\xi})-f_{\xi}(\boldsymbol{z}_{2,\xi})\|$
	$\displaystyle+\frac{\beta}{2}\bigg{\|}[\boldsymbol{\phi}_{\xi},u_{1}]-\boldsymbol{z}_{1,\xi}+[\boldsymbol{\phi}_{\xi},u_{2}]-\boldsymbol{z}_{2,\xi}\bigg{\|}\bigg{(}\big{\|}[\boldsymbol{\phi}_{\xi},u_{1}]-[\boldsymbol{\phi}_{\xi},u_{2}]\big{\|}+\big{\|}\boldsymbol{z}_{1,\xi}-\boldsymbol{z}_{2,\xi}\big{\|}\bigg{)}$
$\displaystyle\leqslant$	$\displaystyle C(\\|u_{1}-u_{2}\\|_{\mathcal{U}}+\|\boldsymbol{z}_{1}-\boldsymbol{z}_{2}\|),$	(A.6)

	$\displaystyle\|E_{\mathcal{I},\mathcal{I}^{\prime},i}[\varphi(\hat{u},\hat{\boldsymbol{z}};\xi_{i}^{\prime})-\varphi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};\xi_{i}^{\prime})]\|\leqslant$	$\displaystyle\frac{1}{M}\sum_{i=1}^{M}E_{\mathcal{I},\xi_{i}^{\prime}}\bigg{\|}\varphi(\hat{u},\hat{\boldsymbol{z}};\xi_{i}^{\prime})-\varphi(\hat{u}_{(i)},\hat{\boldsymbol{z}}_{(i)};\xi_{i}^{\prime})\bigg{\|}$
	$\displaystyle\leqslant$	$\displaystyle\frac{C}{M}\sum_{i=1}^{M}E_{\mathcal{I},\xi_{i}^{\prime}}(\\|\hat{u}-\hat{u}_{(i)}\\|_{\mathcal{U}}+\|\hat{\boldsymbol{z}}-\hat{\boldsymbol{z}}_{(i)}\|),$		(A.7)