This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Stochastic mirror descent method for linear ill-posed problems in Banach spaces

Qinian Jin Mathematical Sciences Institute, Australian National University, Canberra, ACT 2601, Australia qinian.jin@anu.edu.au Xiliang Lu School of Mathematics and Statistics, Wuhan University, Wuhan 430072, China & Hubei Computational Science Key Laboratory, Wuhan University, Wuhan 430072, China xllv.math@whu.edu.cn  and  Liuying Zhang School of Mathematics and Statistics, Wuhan University, Wuhan 430072, China lyzhang.math@whu.edu.cn
Abstract.

Consider linear ill-posed problems governed by the system Aix=yiA_{i}x=y_{i} for i=1,,pi=1,\cdots,p, where each AiA_{i} is a bounded linear operator from a Banach space XX to a Hilbert space YiY_{i}. In case pp is huge, solving the problem by an iterative regularization method using the whole information at each iteration step can be very expensive, due to the huge amount of memory and excessive computational load per iteration. To solve such large-scale ill-posed systems efficiently, we develop in this paper a stochastic mirror descent method which uses only a small portion of equations randomly selected at each iteration steps and incorporates convex regularization terms into the algorithm design. Therefore, our method scales very well with the problem size and has the capability of capturing features of sought solutions. The convergence property of the method depends crucially on the choice of step-sizes. We consider various rules for choosing step-sizes and obtain convergence results under a priori early stopping rules. In particular, by incorporating the spirit of the discrepancy principle we propose a choice rule of step-sizes which can efficiently suppress the oscillations in iterates and reduce the effect of semi-convergence. Furthermore, we establish an order optimal convergence rate result when the sought solution satisfies a benchmark source condition. Various numerical simulations are reported to test the performance of the method.

1. Introduction

Due to rapid growth of data sizes in practical applications, in recent years stochastic optimization methods have received tremendous attention and have been proved to be efficient in various applications of science and technology including in particular the machine learning researches ([5, 13]). In this paper we will develop a stochastic mirror descent method for solving linear ill-posed inverse problems in Banach spaces.

We will consider linear ill-posed inverse problems governed by the system

Aix=yi,i=1,,p\displaystyle A_{i}x=y_{i},\qquad i=1,\cdots,p (1.1)

consisting of pp linear equations, where, for each i=1,,pi=1,\cdots,p, Ai:XYiA_{i}:X\to Y_{i} is a bounded linear operator from a Banach space XX to a Hilbert space YiY_{i}. Such systems arise naturally in many practical applications. For instance, many linear ill-posed inverse problems can be described by an integral equation of the first kind ([10, 14])

(Ax)(s):=𝒟k(s,t)x(t)𝑑t=y(s),s𝒟,(Ax)(s):=\int_{\mathscr{D}}k(s,t)x(t)dt=y(s),\quad s\in\mathscr{D}^{\prime},

where 𝒟\mathscr{D} and 𝒟\mathscr{D}^{\prime} are bounded domains in Euclidean spaces and the kernel kk is a bounded continuous function on 𝒟×𝒟\mathscr{D}^{\prime}\times\mathscr{D}. Clearly AA is a bounded linear operator from Lr(𝒟)L^{r}(\mathscr{D}) to C(𝒟)C(\mathscr{D}^{\prime}) for any 1r1\leq r\leq\infty. By taking pp sample points s1,,sps_{1},\cdots,s_{p} in 𝒟\mathscr{D}^{\prime}, then the problem of determining a solution using only the knowledge of yi:=y(si)y_{i}:=y(s_{i}) for i=1,,pi=1,\cdots,p reduces to solving a linear system of the form (1.1), where Ai:Lr(𝒟)A_{i}:L^{r}(\mathscr{D})\to{\mathbb{R}} is given by

Aix:=𝒟k(si,t)x(t)𝑑tA_{i}x:=\int_{\mathscr{D}}k(s_{i},t)x(t)dt

for each ii. Further examples of (1.1) can be found in various tomographic techniques using multiple measurements [27].

Throughout the paper we always assume (1.1) has a solution. The system (1.1) may have many solutions. By taking into account of a priori information about the sought solution, we may use a proper, lower semi-continuous, convex function :X(,]{\mathcal{R}}:X\to(-\infty,\infty] to select a solution xx^{\dagger} of (1.1) such that

(x)=min{(x):Aix=yi for i=1,,p}\displaystyle{\mathcal{R}}(x^{\dagger})=\min\left\{{\mathcal{R}}(x):A_{i}x=y_{i}\mbox{ for }i=1,\cdots,p\right\} (1.2)

which, if exists, is called a {\mathcal{R}}-minimizing solution of (1.1). In practical applications, the exact data y:=(y1,,yp)y:=(y_{1},\cdots,y_{p}) is in general not available, instead we only have noisy data yδ:=(y1δ,,ypδ)y^{\delta}:=(y_{1}^{\delta},\cdots,y_{p}^{\delta}) satisfying

yiδyiδi,i=1,,p,\displaystyle\|y_{i}^{\delta}-y_{i}\|\leq\delta_{i},\quad i=1,\cdots,p, (1.3)

where δi>0\delta_{i}>0 denotes the noise level corresponding to data in the space YiY_{i}. How to use the noisy data yδy^{\delta} to construct an approximate solution of the {\mathcal{R}}-minimizing solution of (1.1) is an important topic.

Let Y:=Y1××YpY:=Y_{1}\times\cdots\times Y_{p} and define A:XYA:X\to Y by

Ax=(A1x,,Apx),xX.Ax=(A_{1}x,\cdots,A_{p}x),\quad x\in X.

Then (1.2) can be equivalently stated as

(x)=min{(x):Ax=y}.\displaystyle{\mathcal{R}}(x^{\dagger})=\min\left\{{\mathcal{R}}(x):Ax=y\right\}. (1.4)

which has been considered by various variational and iterative regularization methods, see [4, 7, 21, 23, 35] for instance. In particular, the Landweber iteration in Hilbert spaces has been extended for solving (1.4), leading to the iterative method of the form

xnδ=argminxX{(x)ξnδ,x},ξn+1δ=ξnδtnδA(Axnδyδ),\displaystyle\begin{split}&x_{n}^{\delta}=\arg\min_{x\in X}\left\{{\mathcal{R}}(x)-\langle\xi_{n}^{\delta},x\rangle\right\},\\ &\xi_{n+1}^{\delta}=\xi_{n}^{\delta}-t_{n}^{\delta}A^{*}(Ax_{n}^{\delta}-y^{\delta}),\end{split} (1.5)

where A:YXA^{*}:Y\to X^{*} denotes the adjoint of AA and tnδt_{n}^{\delta} is the step-size. This method can be derived as a special case of the mirror descent method; see Section 2 for a brief account. The method (1.5) has been investigated in a number of references, see [4, 12, 21, 22, 23]. In particular, the convergence and convergence rates have been derived in [21] when the method is terminated by either an a priori stopping rule or the discrepancy principle

Axnδδyδτδ<Axnδyδ,0n<nδ,\|Ax_{n_{\delta}}^{\delta}-y^{\delta}\|\leq\tau\delta<\|Ax_{n}^{\delta}-y^{\delta}\|,\quad 0\leq n<n_{\delta},

where

δ:=δ12++δp2\delta:=\sqrt{\delta_{1}^{2}+\cdots+\delta_{p}^{2}}

denotes the total noise level of the noisy data. We remark that the minimization problem in (1.5) for defining xnδx_{n}^{\delta} can be solved easily in general as it does not depend on AA; in fact xnδx_{n}^{\delta} can be given by an explicit formula in many interesting cases; even if xnδx_{n}^{\delta} does not have an explicit formula, there exist fast algorithms for determining xnδx_{n}^{\delta} efficiently. However, note that

A(Axnδyδ)=i=1pAi(Aixnδyiδ).A^{*}(Ax_{n}^{\delta}-y^{\delta})=\sum_{i=1}^{p}A_{i}^{*}(A_{i}x_{n}^{\delta}-y_{i}^{\delta}).

Therefore, updating ξnδ\xi_{n}^{\delta} to ξn+1δ\xi_{n+1}^{\delta} requires calculating Ai(Aixnδyiδ)A_{i}^{*}(A_{i}x_{n}^{\delta}-y_{i}^{\delta}) for all i=1,,pi=1,\cdots,p. In case pp is huge, using the method (1.5) to solve (1.4) can be inefficient because it requires a huge amount of memory and excessive computational work per iteration.

In order to relieve the drawback of the method (1.5), by extending the Kaczmarz-type method [15] in Hilbert spaces, a Landweber-Kaczmarz method in Banach spaces has been proposed in [20, 23] to solve (1.2) which cyclically considers each equation in (1.2) in a Gauss-Seidel manner and the iteration scheme takes the form

xnδ=argminxX{(x)ξnδ,x},ξn+1δ=ξnδtnδAin(Ainxnδyinδ),\displaystyle\begin{split}&x_{n}^{\delta}=\arg\min_{x\in X}\left\{{\mathcal{R}}(x)-\langle\xi_{n}^{\delta},x\rangle\right\},\\ &\xi_{n+1}^{\delta}=\xi_{n}^{\delta}-t_{n}^{\delta}A_{i_{n}}^{*}(A_{i_{n}}x_{n}^{\delta}-y_{i_{n}}^{\delta}),\end{split} (1.6)

where in=(nmodp)+1i_{n}=(n\mod p)+1. The convergence of this method has been shown in [23] in which the numerical results demonstrate its nice performance. However, it should be pointed out that the efficiency of the method (1.6) depends crucially on the order of the equations and its convergence speed is difficult to be quantified. In order to resolve these issues, in this paper we will consider a stochastic version of (1.6), namely, instead of taking ini_{n} cyclically, we will choose ini_{n} from {1,,p}\{1,\cdots,p\} randomly at each iteration step. The corresponding method will be called the stochastic mirror descent method and more details will be presented in Section 2 where we also propose the mini-batch version of the stochastic mirror descent method.

The stochastic mirror descent method, that we will consider in this paper, includes the stochastic gradient descent as a special case. Indeed, when XX is a Hilbert space and (x)=x2/2{\mathcal{R}}(x)=\|x\|^{2}/2, the method (1.6) becomes

xn+1δ=xnδtnδAin(Ainxnδyinδ)\displaystyle x_{n+1}^{\delta}=x_{n}^{\delta}-t_{n}^{\delta}A_{i_{n}}^{*}(A_{i_{n}}x_{n}^{\delta}-y_{i_{n}}^{\delta}) (1.7)

which is exactly the stochastic gradient descent method studied in [18, 19, 25, 33] for solving linear ill-posed problems in Hilbert spaces. In many applications, however, the sought solution may sit in a Banach space instead of a Hilbert space, and the sought solution may have a priori known special features, such as nonnegativity, sparsity and piecewise constancy. Unfortunately, the stochastic gradient descent method does not have the capability to incorporate these information into the algorithm. However, this can be handled by the stochastic mirror descent method with careful choices of a suitable Banach space XX and a strongly convex penalty functional {\mathcal{R}}.

In this paper we will use tools from convex analysis in Banach spaces to analyze the stochastic mirror descent method. The choice of the step-size plays a crucial role on the convergence of the method. We consider several rules for choosing the step-sizes and provide criteria for terminating the iterations in order to guarantee a convergence when the noise level tends to zero. The iterates produced by the stochastic mirror descent method exhibit salient oscillations and, due to the ill-posedness of the underlying problems, the method using noisy data demonstrates the semi-convergence property, i.e. the iterate tends to the sought solution at the beginning and, after a critical number of iterations, the iterates diverges. The oscillations and semi-convergence make it difficult to determine an output with good approximation property, in particular when the noise level is relatively large. When the information on noise level is available, by incorporating the spirit of the discrepancy principle we propose a rule for choosing step-size. This rule enables us to efficiently suppress the oscillations of iterates and remove the semi-convergence of the method as indicated by the extensive numerical simulations. Furthermore, we obtain an order optimal convergence rate result for the stochastic mirror descent method with constant step-size when the sought solution satisfies a benchmark source condition. We achieve this by interpreting the stochastic mirror descent method equivalently as a randomized block gradient method applied to the dual problem of (1.1). Even for the stochastic gradient descent method, our convergence rate result supplements the existing results since only sub-optimal convergence rates have been derived under diminishing step-sizes, see [18, 25].

This paper is organized as follows. In Section 2 we first collect some basic facts on convex analysis in Banach spaces and then give an account on the stochastic mirror descent method. In Section 3 we prove some convergence results on the stochastic mirror descent method under various choices of the step-sizes. When the sought solution satisfies a benchmark source condition, in Section 4 we establish an order optimal convergence rate result. Finally, in Section 5 we present extensive numerical simulations to test the performance of the stochastic mirror descent method.

2. The method

2.1. Preliminaries

In this section, we will collect some basic facts on convex analysis in Banach spaces which will be used in the analysis of the stochastic mirror descent method; for more details one may refer to [34, 38] for instance.

Let XX be a Banach space whose norm is denoted by \|\cdot\|, we use XX^{*} to denote its dual space. Given xXx\in X and ξX\xi\in X^{*} we write ξ,x=ξ(x)\langle\xi,x\rangle=\xi(x) for the duality pairing; in case XX is a Hilbert space, we also use ,\langle\cdot,\cdot\rangle to denote the inner product. For a convex function f:X(,]f:X\to(-\infty,\infty], its effective domain is denoted by

dom(f):={xX:f(x)<}.\mbox{dom}(f):=\{x\in X:f(x)<\infty\}.

If dom(f)\mbox{dom}(f)\neq\emptyset, ff is called proper. Given xdom(f)x\in\mbox{dom}(f), an element ξX\xi\in X^{*} is called a subgradient of ff at xx if

f(x¯)f(x)+ξ,x¯x,x¯X.f(\bar{x})\geq f(x)+\langle\xi,\bar{x}-x\rangle,\quad\forall\bar{x}\in X.

The collection of all subgradients of ff at xx is denoted as f(x)\partial f(x) and is called the subdifferential of ff at xx. If f(x)\partial f(x)\neq\emptyset, then ff is called subdifferentiable at xx. Thus xf(x)x\to\partial f(x) defines a set-valued mapping f\partial f whose domain of definition is defined as

dom(f):={xdom(f):f(x)}.\mbox{dom}(\partial f):=\{x\in\mbox{dom}(f):\partial f(x)\neq\emptyset\}.

Given xdom(f)x\in\mbox{dom}(\partial f) and ξf(x)\xi\in\partial f(x), the Bregman distance induced by ff at xx in the direction ξ\xi is defined by

Dfξ(x¯,x):=f(x¯)f(x)ξ,x¯x,x¯XD_{f}^{\xi}(\bar{x},x):=f(\bar{x})-f(x)-\langle\xi,\bar{x}-x\rangle,\quad\forall\bar{x}\in X

which is always nonnegative.

A proper function f:X(,]f:X\to(-\infty,\infty] is called strongly convex if there exists a constant σ>0\sigma>0 such that

f(tx¯+(1t)x)+σt(1t)x¯x2tf(x¯)+(1t)f(x)\displaystyle f(t\bar{x}+(1-t)x)+\sigma t(1-t)\|\bar{x}-x\|^{2}\leq tf(\bar{x})+(1-t)f(x) (2.1)

for all x¯,xdom(f)\bar{x},x\in\mbox{dom}(f) and t[0,1]t\in[0,1]. The largest number σ>0\sigma>0 such that (2.1) holds true is called the modulus of convexity of ff. It is easy to see that for a proper, strongly convex function f:X(,]f:X\to(-\infty,\infty] with modulus of convexity σ>0\sigma>0 there holds

Dfξ(x¯,x)σxx¯2\displaystyle D_{f}^{\xi}(\bar{x},x)\geq\sigma\|x-\bar{x}\|^{2} (2.2)

for all x¯dom(f)\bar{x}\in\mbox{dom}(f), xdom(f)x\in\mbox{dom}(\partial f) and ξf(x)\xi\in\partial f(x).

For a proper function f:X(,]f:X\to(-\infty,\infty], its Legendre-Fenchel conjugate is defined by

f(ξ):=supxX{ξ,xf(x)},ξXf^{*}(\xi):=\sup_{x\in X}\{\langle\xi,x\rangle-f(x)\},\quad\xi\in X^{*}

which is a convex function taking values in (,](-\infty,\infty]. By definition we have the Fenchel-Young inequality

f(ξ)+f(x)ξ,x\displaystyle f^{*}(\xi)+f(x)\geq\langle\xi,x\rangle (2.3)

for all xXx\in X and ξX\xi\in X^{*}. If f:X(,]f:X\to(-\infty,\infty] is proper, lower semi-continuous and convex, ff^{*} is also proper and

ξf(x)xf(ξ)f(x)+f(ξ)=ξ,x.\displaystyle\xi\in\partial f(x)\Longleftrightarrow x\in\partial f^{*}(\xi)\Longleftrightarrow f(x)+f^{*}(\xi)=\langle\xi,x\rangle. (2.4)

The following important result gives further properties of ff^{*} which in particular shows that the strong convexity of ff implies the continuous differentiability of ff^{*} with gradient f\nabla f^{*} mapping XX^{*} to XX; see [38, Corollary 3.5.11]

Proposition 2.1.

Let XX be a Banach space and let f:X(,]f:X\to(-\infty,\infty] be a proper, lower semi-continuous, strongly convex function with modulus of convexity σ>0\sigma>0. Then dom(f)=X\mbox{dom}(f^{*})=X^{*}, ff^{*} is Fréchet differentiable and its gradient f\nabla f^{*} maps XX^{*} into XX satisfying

f(ξ)f(η)ξη2σ\|\nabla f^{*}(\xi)-\nabla f^{*}(\eta)\|\leq\frac{\|\xi-\eta\|}{2\sigma}

for all ξ,ηX\xi,\eta\in X^{*}.

Given a proper strongly convex function f:X(,]f:X\to(-\infty,\infty], we may consider for each ξX\xi\in X^{*} the convex minimization problem

minxX{f(x)ξ,x}\displaystyle\min_{x\in X}\left\{f(x)-\langle\xi,x\rangle\right\} (2.5)

which is involved in the formulation of the stochastic mirror descent method below. According to [38, Theorem 3.5.8], (2.4) and Proposition 2.1 we have

Proposition 2.2.

If f:X(,]f:X\to(-\infty,\infty] is a proper, lower semi-continuous, strongly convex function, then for any ξX\xi\in X^{*} the minimization problem (2.5) has a unique minimizer given by f(ξ)\nabla f^{*}(\xi).

2.2. Description of the method

In order to motivate our method, we first briefly review how to extend the gradient method in Hilbert spaces to solve the minimization problem

minxXf(x)\displaystyle\min_{x\in X}f(x) (2.6)

in Banach spaces, where f:Xf:X\to{\mathbb{R}} is a Fréchet differentiable function defined on a Banach space XX. When XX is a Hilbert space, the gradient method for solving (2.6) takes the form xn+1=xntnf(xn)x_{n+1}=x_{n}-t_{n}\nabla f(x_{n}) which can be equivalently stated as

xn+1\displaystyle x_{n+1} =argminxX{12x(xntnf(xn))2}\displaystyle=\arg\min_{x\in X}\left\{\frac{1}{2}\|x-(x_{n}-t_{n}\nabla f(x_{n}))\|^{2}\right\}
=argminxX{12xxn2+tnf(xn),x},\displaystyle=\arg\min_{x\in X}\left\{\frac{1}{2}\|x-x_{n}\|^{2}+t_{n}\langle\nabla f(x_{n}),x\rangle\right\}, (2.7)

where ,\langle\cdot,\cdot\rangle denotes the inner product on XX and \|\cdot\| denotes the induced norm. When XX is a general Banach space, inner product is no longer available and f(xn)\nabla f(x_{n}) is not necessarily an element in XX; instead f(xn)\nabla f(x_{n}) is an element in XX^{*}. Therefore, in order to guarantee f(xn),x\langle\nabla f(x_{n}),x\rangle to be meaningful, this expression should be understood as the dual pairing between XX^{*} and XX. On the other hand, since no Hilbert space norm is available, 12xxn2\frac{1}{2}\|x-x_{n}\|^{2} in (2.2) should be replaced by other suitable distance-like functionals. In order to capture the feature of the sought solution, a suitable convex penalty function :X(,]{\mathcal{R}}:X\to(-\infty,\infty] is usually chosen to enhance the feature. In such a situation, one may use the Bregman distance induced by {\mathcal{R}} to fulfill the purpose. To be more precise, if (xn)\partial{\mathcal{R}}(x_{n})\neq\emptyset, we may take ξn(xn)\xi_{n}\in\partial{\mathcal{R}}(x_{n}) suitably and then use the Bregman distance

Dξn(x,xn)=(x)(xn)ξn,xxnD_{\mathcal{R}}^{\xi_{n}}(x,x_{n})={\mathcal{R}}(x)-{\mathcal{R}}(x_{n})-\langle\xi_{n},x-x_{n}\rangle

to replace 12xxn2\frac{1}{2}\|x-x_{n}\|^{2} in (2.2). This leads to the new updating formula

xn+1argminxX{Dξn(x,xn)+tnf(xn),x}x_{n+1}\in\arg\min_{x\in X}\left\{D_{\mathcal{R}}^{\xi_{n}}(x,x_{n})+t_{n}\langle\nabla f(x_{n}),x\rangle\right\}

for xn+1x_{n+1}. By the expression of Dξn(x,xn)D_{\mathcal{R}}^{\xi_{n}}(x,x_{n}), we have

xn+1argminxX{(x)ξntnf(xn),x}.x_{n+1}\in\arg\min_{x\in X}\left\{{\mathcal{R}}(x)-\langle\xi_{n}-t_{n}\nabla f(x_{n}),x\rangle\right\}.

Let ξn+1=ξntnf(xn)\xi_{n+1}=\xi_{n}-t_{n}\nabla f(x_{n}), then

xn+1argminxX{(x)ξn+1,x}.x_{n+1}\in\arg\min_{x\in X}\left\{{\mathcal{R}}(x)-\langle\xi_{n+1},x\rangle\right\}.

If xn+1x_{n+1} is well-defined, then by the optimality condition on xn+1x_{n+1} we have ξn+1(xn+1)\xi_{n+1}\in\partial{\mathcal{R}}(x_{n+1}). Therefore, we can repeat the above procedure, leading to the algorithm

xn=argminxX{(x)ξn,x},ξn+1=ξntnf(xn)\displaystyle\begin{split}&x_{n}=\arg\min_{x\in X}\left\{{\mathcal{R}}(x)-\langle\xi_{n},x\rangle\right\},\\ &\xi_{n+1}=\xi_{n}-t_{n}\nabla f(x_{n})\end{split} (2.8)

for solving (2.6) in Banach spaces, which is called the mirror descent method in optimization community; see [1, 6, 30, 31].

In order to apply the mirror descent method to solve (1.2) when only noisy data yiδy_{i}^{\delta} are available, we may consider a problem of the form (2.6) with

f(x)=12i=1pAixyiδ2.f(x)=\frac{1}{2}\sum_{i=1}^{p}\|A_{i}x-y_{i}^{\delta}\|^{2}.

Note that f(x)=i=1pAi(Aixyiδ)\nabla f(x)=\sum_{i=1}^{p}A_{i}^{*}(A_{i}x-y_{i}^{\delta}). Therefore, an application of (2.8) gives

xn=argminxX{(x)ξn,x},ξn+1=ξntni=1pAi(Aixnyiδ).\displaystyle\begin{split}&x_{n}=\arg\min_{x\in X}\left\{{\mathcal{R}}(x)-\langle\xi_{n},x\rangle\right\},\\ &\xi_{n+1}=\xi_{n}-t_{n}\sum_{i=1}^{p}A_{i}^{*}(A_{i}x_{n}-y_{i}^{\delta}).\end{split} (2.9)

This is exactly the method (1.5) which has been considered in [4, 21, 23]. Clearly, the implementation of (2.9) requires to calculate Ai(Aixnyiδ)A_{i}^{*}(A_{i}x_{n}-y_{i}^{\delta}) for all i=1,,pi=1,\cdots,p at each iteration. In case pp is huge, each iteration step in (2.9) requires a huge amount of computational work. In order to reduce the computational load per iteration, one may randomly choose a subset InI_{n} from {1,,p}\{1,\cdots,p\} with small size to form the partial term

fIn(x):=12iInAixyiδ2f_{I_{n}}(x):=\frac{1}{2}\sum_{i\in I_{n}}\|A_{i}x-y_{i}^{\delta}\|^{2}

of ff and use its gradient fIn(xn)=iInAi(Aixnyiδ)\nabla f_{I_{n}}(x_{n})=\sum_{i\in I_{n}}A_{i}^{*}(A_{i}x_{n}-y_{i}^{\delta}) at xnx_{n} as a replacement of f(xn)\nabla f(x_{n}) in (2.9). This leads to the following mini-batch stochastic mirror descent method for solving (1.2) with noisy data.

Algorithm 1.

Fix a batch size bb, pick the initial guess ξ0=0\xi_{0}=0 in XX^{*} and set ξ0δ:=ξ0\xi_{0}^{\delta}:=\xi_{0}. For n0n\geq 0 do the following:

  1. (i)

    Calculate xnδXx_{n}^{\delta}\in X by solving

    xnδ=argminxX{(x)ξnδ,x};\displaystyle x_{n}^{\delta}=\arg\min_{x\in X}\left\{{\mathcal{R}}(x)-\langle\xi_{n}^{\delta},x\rangle\right\}; (2.10)
  2. (ii)

    Randomly select a subset In{1,,p}I_{n}\subset\{1,\cdots,p\} with |In|=b|I_{n}|=b via the uniform distribution;

  3. (iii)

    Choose a step-size tnδ0t_{n}^{\delta}\geq 0;

  4. (iv)

    Define ξn+1δX\xi_{n+1}^{\delta}\in X^{*} by

    ξn+1δ=ξnδtnδiInAi(Aixnδyiδ).\displaystyle\xi_{n+1}^{\delta}=\xi_{n}^{\delta}-t_{n}^{\delta}\sum_{i\in I_{n}}A_{i}^{*}\left(A_{i}x_{n}^{\delta}-y_{i}^{\delta}\right). (2.11)

It should be pointed out that the choice of the step-size tnδt_{n}^{\delta} plays a crucial role on the convergence of Algorithm 1. We will consider in Section 3 several rules for choosing the step-size. Note also that, at each iteration step of Algorithm 1, xnδx_{n}^{\delta} is defined by a minimization problem (2.10). When {\mathcal{R}} is proper, lower semi-continuous and strongly convex, it follows from Proposition 2.2 that xnδx_{n}^{\delta} is well-defined. Moreover, for many important choices of {\mathcal{R}}, xnδx_{n}^{\delta} can be given by an explicit formula, see Section 5 for instance, and thus the calculation of xnδx_{n}^{\delta} does not take much time.

There exist extensive studies on the mirror descent method and its stochastic variants in optimization, see [6, 8, 28, 29, 39] for instance. The existing works either depend crucially on the finite-dimensionality of the underlying spaces or establish only error estimates in terms of objective function values, and therefore they are not applicable to our Algorithm 1 for ill-posed problems. We need to develop new analysis. Our analysis of Algorithm 1 is based on the following assumption which is assumed throughout the paper.

Assumption 1.
  1. (i)

    XX is a Banach space, YiY_{i} is a Hilbert space and Ai:XYiA_{i}:X\to Y_{i} is a bounded linear operator for each i=1,,pi=1,\cdots,p;

  2. (ii)

    :X(,]{\mathcal{R}}:X\to(-\infty,\infty] is a proper, lower semi-continuous, strongly convex function with modulus of convexity σ>0\sigma>0;

  3. (iii)

    The system Aix=yiA_{i}x=y_{i}, i=1,,pi=1,\cdots,p, has a solution in dom()\emph{dom}({\mathcal{R}}).

According to Assumption 1 (ii) and Proposition 2.1, the Legendre-Fenchel conjugate {\mathcal{R}}^{*} of {\mathcal{R}} is continuous differentiable with Lipschitz continuous gradient, i.e.

(ξ)(η)ξη2σ,ξ,ηX.\displaystyle\|\nabla{\mathcal{R}}^{*}(\xi)-\nabla{\mathcal{R}}^{*}(\eta)\|\leq\frac{\|\xi-\eta\|}{2\sigma},\quad\forall\xi,\eta\in X^{*}. (2.12)

Moreover, by virtue of Assumption 1 and Proposition 2.2, the linear system (1.1) has a unique {\mathcal{R}}-minimizing solution xx^{\dagger} and Algorithm 1 is well-defined. By the optimality condition on xnδx_{n}^{\delta} and Proposition 2.2 we have

ξnδ(xnδ) and xnδ=(ξnδ)\displaystyle\xi_{n}^{\delta}\in\partial{\mathcal{R}}(x_{n}^{\delta})\quad\mbox{ and }\quad x_{n}^{\delta}=\nabla{\mathcal{R}}^{*}(\xi_{n}^{\delta}) (2.13)

which is the starting point of our convergence analysis.

3. Convergence

In order to establish a convergence result on Algorithm 1, we need to specify a probability space on which the analysis will be carried out. Let Λp:={1,,p}\Lambda_{p}:=\{1,\cdots,p\} and let 𝒮p{\mathcal{S}}_{p} denote the σ\sigma-algebra consisting of all subsets of Λp\Lambda_{p}. Recall that, at each iteration step, a subset of indices is randomly chosen from Λp\Lambda_{p} via the uniform distribution. Therefore, for each n1n\geq 1, it is natural to consider xnδx_{n}^{\delta} and ξnδ\xi_{n}^{\delta} on the sample space

Λpn:=Λp××Λpn copies\Lambda_{p}^{n}:=\underbrace{\Lambda_{p}\times\cdots\times\Lambda_{p}}_{n\mbox{ copies}}

equipped with the σ\sigma-algebra 𝒮pn{\mathcal{S}}_{p}^{\otimes n} and the uniform distributed probability, denoted as n{\mathbb{P}}_{n}. According to the Kolmogorov extension theorem ([2]), there exists a unique probability {\mathbb{P}} defined on the measurable space (Ω,):=(Λp,𝒮p)(\Omega,{\mathcal{F}}):=(\Lambda_{p}^{\infty},{\mathcal{S}}_{p}^{\otimes\infty}) such that each n{\mathbb{P}}_{n} is consistent with {\mathbb{P}}. Let 𝔼{\mathbb{E}} denote the expectation on the probability space (Ω,,)(\Omega,{\mathcal{F}},{\mathbb{P}}). Given a Banach space XX, we use L2(Ω,X)L^{2}(\Omega,X) to denote the space consisting of all random variables xx with values in XX such that 𝔼[x2]\mathbb{E}[\|x\|^{2}] is finite; this is a Banach space under the norm

(𝔼[x2])1/2.\left(\mathbb{E}[\|x\|^{2}]\right)^{1/2}.

Concerning Algorithm 1 we will use {n}\{{\mathcal{F}}_{n}\} to denote the natural filtration, where n:=σ(I0,,In1){\mathcal{F}}_{n}:=\sigma(I_{0},\cdots,I_{n-1}) for each n1n\geq 1. We will frequently use the identity

𝔼[ϕ]=𝔼[𝔼[ϕ|n]]\displaystyle\mathbb{E}[\phi]=\mathbb{E}[\mathbb{E}[\phi|{\mathcal{F}}_{n}]] (3.1)

for any random variable ϕ\phi on (Ω,,)(\Omega,{\mathcal{F}},{\mathbb{P}}). where 𝔼[ϕ|n]\mathbb{E}[\phi|{\mathcal{F}}_{n}] denotes the expectation of ϕ\phi conditioned on n{\mathcal{F}}_{n}.

In this section we will prove some convergence results on Algorithm 1 under suitable choices of the step-sizes. The convergence of Algorithm 1 will be established by investigating the convergence property of its counterpart for exact data together with its stability property. For simplicity of exposition, for each index set I={i1,,ib}{1,,p}I=\{i_{1},\cdots,i_{b}\}\subset\{1,\cdots,p\} of size bb we set

YI:=Yi1××Yib,yIδ:=(yi1δ,,yibδ),δI2=δi12++δib2Y_{I}:=Y_{i_{1}}\times\cdots\times Y_{i_{b}},\quad y_{I}^{\delta}:=(y_{i_{1}}^{\delta},\cdots,y_{i_{b}}^{\delta}),\quad\delta_{I}^{2}=\delta_{i_{1}}^{2}+\cdots+\delta_{i_{b}}^{2}

and define AI:XYIA_{I}:X\to Y_{I} by

AIx:=(Ai1x,,Aibx),xX.A_{I}x:=(A_{i_{1}}x,\cdots,A_{i_{b}}x),\quad\forall x\in X.

Let AIA_{I}^{*} denote the adjoint of AIA_{I}. Then the updating formula (2.11) of ξn+1δ\xi_{n+1}^{\delta} from ξnδ\xi_{n}^{\delta} can be rephrased as

ξn+1δ=ξnδtnδAIn(AInxnδyInδ).\xi_{n+1}^{\delta}=\xi_{n}^{\delta}-t_{n}^{\delta}A_{I_{n}}^{*}(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}).

We start with the following result.

Lemma 3.1.

Let Assumption 1 hold. Consider Algorithm 1 and assume that

0tnδmin{μ0AInxnδyInδ2AIn(AInxnδyInδ)2,μ1}0\leq t_{n}^{\delta}\leq\min\left\{\frac{\mu_{0}\|A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}\|^{2}}{\|A_{I_{n}}^{*}(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta})\|^{2}},\mu_{1}\right\}

for all n0n\geq 0, where μ0\mu_{0} and μ1\mu_{1} are two positive constants with c0:=1μ0/(4σ)>0c_{0}:=1-\mu_{0}/(4\sigma)>0. Let x^\hat{x} be any solution of (1.1) in dom()\mbox{dom}({\mathcal{R}}) and let

Δnδ:=Dξnδ(x^,xnδ),n=0,1,.\Delta_{n}^{\delta}:=D_{\mathcal{R}}^{\xi_{n}^{\delta}}(\hat{x},x_{n}^{\delta}),\quad n=0,1,\cdots.

Then

𝔼[Δn+1δ]𝔼[Δnδ]μ1b4c0pδ2.\displaystyle\mathbb{E}\left[\Delta_{n+1}^{\delta}\right]-\mathbb{E}\left[\Delta_{n}^{\delta}\right]\leq\frac{\mu_{1}b}{4c_{0}p}\delta^{2}.
Proof.

Note that

Δn+1δΔnδ=(ξn+1δ,xn+1δx^(xn+1δ))+((xnδ)+ξnδ,x^xnδ).\Delta_{n+1}^{\delta}-\Delta_{n}^{\delta}=\left(\langle\xi_{n+1}^{\delta},x_{n+1}^{\delta}-\hat{x}\rangle-{\mathcal{R}}(x_{n+1}^{\delta})\right)+\left({\mathcal{R}}(x_{n}^{\delta})+\langle\xi_{n}^{\delta},\hat{x}-x_{n}^{\delta}\rangle\right).

By using (2.13) and (2.4) we have

(xnδ)+(ξnδ)=ξnδ,xnδ{\mathcal{R}}(x_{n}^{\delta})+{\mathcal{R}}^{*}(\xi_{n}^{\delta})=\langle\xi_{n}^{\delta},x_{n}^{\delta}\rangle

for all n0n\geq 0. Therefore

Δn+1δΔnδ\displaystyle\Delta_{n+1}^{\delta}-\Delta_{n}^{\delta} =((ξn+1δ)ξn+1δ,x^)((ξnδ)ξnδ,x^).\displaystyle=\left({\mathcal{R}}^{*}(\xi_{n+1}^{\delta})-\langle\xi_{n+1}^{\delta},\hat{x}\rangle\right)-\left({\mathcal{R}}^{*}(\xi_{n}^{\delta})-\langle\xi_{n}^{\delta},\hat{x}\rangle\right).

Using xnδ=(ξnδ)x_{n}^{\delta}=\nabla{\mathcal{R}}^{*}(\xi_{n}^{\delta}) in (2.13) and the inequality (2.12) on \nabla{\mathcal{R}}^{*}, we can obtain

Δn+1δΔnδ\displaystyle\Delta_{n+1}^{\delta}-\Delta_{n}^{\delta} =((ξn+1δ)(ξnδ)ξn+1δξnδ,(ξnδ))\displaystyle=\left({\mathcal{R}}^{*}(\xi_{n+1}^{\delta})-{\mathcal{R}}^{*}(\xi_{n}^{\delta})-\langle\xi_{n+1}^{\delta}-\xi_{n}^{\delta},\nabla{\mathcal{R}}^{*}(\xi_{n}^{\delta})\rangle\right)
+ξn+1δξnδ,xnδx^\displaystyle\quad\,+\langle\xi_{n+1}^{\delta}-\xi_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle
14σξn+1δξnδ2+ξn+1δξnδ,xnδx^.\displaystyle\leq\frac{1}{4\sigma}\|\xi_{n+1}^{\delta}-\xi_{n}^{\delta}\|^{2}+\langle\xi_{n+1}^{\delta}-\xi_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle.

According to the definition of ξn+1δ\xi_{n+1}^{\delta} and AInx^=yInA_{I_{n}}\hat{x}=y_{I_{n}}, we can further obtain

Δn+1δΔnδ\displaystyle\Delta_{n+1}^{\delta}-\Delta_{n}^{\delta} 14σ(tnδ)2AIn(AInxnδyInδ)2tnδAInxnδyInδ,AInxnδyIn.\displaystyle\leq\frac{1}{4\sigma}\left(t_{n}^{\delta}\right)^{2}\|A_{I_{n}}^{*}(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta})\|^{2}-t_{n}^{\delta}\langle A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta},A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}\rangle.

By the given condition on tnδt_{n}^{\delta} and the Cauchy-Schwarz inequality, we then obtain

Δn+1δΔnδ\displaystyle\Delta_{n+1}^{\delta}-\Delta_{n}^{\delta} μ04σtnδAInxnδyInδ2tnδAInxnδyInδ,AInxnδyIn\displaystyle\leq\frac{\mu_{0}}{4\sigma}t_{n}^{\delta}\|A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}\|^{2}-t_{n}^{\delta}\langle A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta},A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}\rangle
c0tnδAInxnδyInδ2+tnδδInAInxnδyInδ\displaystyle\leq-c_{0}t_{n}^{\delta}\|A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}\|^{2}+t_{n}^{\delta}\delta_{I_{n}}\|A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}\|
14c0tnδδIn2μ14c0δIn2.\displaystyle\leq\frac{1}{4c_{0}}t_{n}^{\delta}\delta_{I_{n}}^{2}\leq\frac{\mu_{1}}{4c_{0}}\delta_{I_{n}}^{2}.

By taking the expectation and using I:|I|=b\sum_{I:|I|=b} to denote a sum over all subsets I{1,,n}I\subset\{1,\cdots,n\} with |I|=b|I|=b, we have

𝔼[Δn+1δ]𝔼[Δnδ]\displaystyle\mathbb{E}[\Delta_{n+1}^{\delta}]-\mathbb{E}[\Delta_{n}^{\delta}] μ14c0𝔼[δIn2]=μ14c01(pb)I:|I|=bδI2=μ14c01(pb)I:|I|=biIδi2\displaystyle\leq\frac{\mu_{1}}{4c_{0}}\mathbb{E}[\delta_{I_{n}}^{2}]=\frac{\mu_{1}}{4c_{0}}\frac{1}{{p\choose b}}\sum_{I:|I|=b}\delta_{I}^{2}=\frac{\mu_{1}}{4c_{0}}\frac{1}{{p\choose b}}\sum_{I:|I|=b}\sum_{i\in I}\delta_{i}^{2}
=μ14c01(pb)i=1pI:|I|=b&iIδi2=μ14c0(p1b1)(pb)i=1pδi2\displaystyle=\frac{\mu_{1}}{4c_{0}}\frac{1}{{p\choose b}}\sum_{i=1}^{p}\sum_{I:|I|=b\,\&\,i\in I}\delta_{i}^{2}=\frac{\mu_{1}}{4c_{0}}\frac{{p-1\choose b-1}}{{p\choose b}}\sum_{i=1}^{p}\delta_{i}^{2}
=μ1b4c0pδ2\displaystyle=\frac{\mu_{1}b}{4c_{0}p}\delta^{2}

which shows the desired inequality. ∎

Next we consider Algorithm 1 with exact data and drop the superscript δ\delta for every quantity defined by the algorithm, Thus xn,ξnx_{n},\xi_{n} denote the corresponding iterative sequences and tnt_{n} denotes the step-size. We now show a convergence result for Algorithm 1 with exact data by demonstrating that {xn}\{x_{n}\} is a Cauchy sequence in L2(Ω,X)L^{2}(\Omega,X).

Theorem 3.2.

Let Assumption 1 hold. Consider Algorithm 1 with exact data and assume that

μ2tnmin{μ0AInxnyIn2AIn(AInxnyIn)2,μ1} when AInxnyIn,\displaystyle\mu_{2}\leq t_{n}\leq\min\left\{\frac{\mu_{0}\|A_{I_{n}}x_{n}-y_{I_{n}}\|^{2}}{\|A_{I_{n}}^{*}(A_{I_{n}}x_{n}-y_{I_{n}})\|^{2}},\mu_{1}\right\}\quad\mbox{ when }A_{I_{n}}x_{n}\neq y_{I_{n}}, (3.2)

where μ0\mu_{0}, μ1\mu_{1} and μ2\mu_{2} are positive numbers with c0:=1μ0/(4σ)>0c_{0}:=1-\mu_{0}/(4\sigma)>0. Then

𝔼[xnx2]0 and 𝔼[Dξn(x,xn)]0\mathbb{E}[\|x_{n}-x^{\dagger}\|^{2}]\to 0\quad\mbox{ and }\quad\mathbb{E}\left[D_{\mathcal{R}}^{\xi_{n}}(x^{\dagger},x_{n})\right]\to 0

as nn\to\infty, where xx^{\dagger} denotes the unique {\mathcal{R}}-minimizing solution of (1.1).

Proof.

Let x^\hat{x} be any solution of (1.1) in dom()\mbox{dom}({\mathcal{R}}) and define

Δn:=Dξn(x^,xn).\Delta_{n}:=D_{\mathcal{R}}^{\xi_{n}}(\hat{x},x_{n}).

By the similar argument in the proof of Lemma 3.1 we can obtain

Δn+1Δnc0tnAInxnyIn2c0μ2AInxnyIn2.\Delta_{n+1}-\Delta_{n}\leq-c_{0}t_{n}\|A_{I_{n}}x_{n}-y_{I_{n}}\|^{2}\leq-c_{0}\mu_{2}\|A_{I_{n}}x_{n}-y_{I_{n}}\|^{2}.

Consequently

𝔼[Δn+1]𝔼[Δn]\displaystyle\mathbb{E}[\Delta_{n+1}]-\mathbb{E}[\Delta_{n}]
c0μ2𝔼[AInxnyIn2]=c0μ2𝔼[𝔼[AInxnyIn2|n]]\displaystyle\leq-c_{0}\mu_{2}\mathbb{E}\left[\|A_{I_{n}}x_{n}-y_{I_{n}}\|^{2}\right]=-c_{0}\mu_{2}\mathbb{E}\left[\mathbb{E}\left[\|A_{I_{n}}x_{n}-y_{I_{n}}\|^{2}|{\mathcal{F}_{n}}\right]\right]
=c0μ2𝔼[1(pb)I:|I|=bAIxnyI2].\displaystyle=-c_{0}\mu_{2}\mathbb{E}\left[\frac{1}{{p\choose b}}\sum_{I:|I|=b}\|A_{I}x_{n}-y_{I}\|^{2}\right]. (3.3)

This shows that {𝔼[Δn]}\{\mathbb{E}[\Delta_{n}]\} is monotonically decreasing and therefore

limn𝔼[Δn] existsandlimnΦn=0\displaystyle\lim_{n\to\infty}\mathbb{E}[\Delta_{n}]\mbox{ exists}\quad\mbox{and}\quad\lim_{n\to\infty}\Phi_{n}=0 (3.4)

where

Φn:=𝔼[1(pb)I:|I|=bAIxnyI2].\Phi_{n}:=\mathbb{E}\left[\frac{1}{{p\choose b}}\sum_{I:|I|=b}\|A_{I}x_{n}-y_{I}\|^{2}\right].

If Φn=0\Phi_{n}=0 for some nn, we must have

I:|I|=bAIxnyI2=0\sum_{I:|I|=b}\|A_{I}x_{n}-y_{I}\|^{2}=0

along any sample path (I0,,In1)(I_{0},\cdots,I_{n-1}) since there exist only finite many such sample paths each with a positive probability; consequently ξm=ξn\xi_{m}=\xi_{n} and xm=xnx_{m}=x_{n} and hence Φm=0\Phi_{m}=0 for all mnm\geq n. Based on these properties of Φn\Phi_{n}, it is possible to choose a strictly increasing sequence {nl}\{n_{l}\} of integers by setting n0:=0n_{0}:=0 and, for each l1l\geq 1, by letting nln_{l} be the first integer satisfying

nlnl1+1 and ΦnlΦnl1.n_{l}\geq n_{l-1}+1\quad\mbox{ and }\quad\Phi_{n_{l}}\leq\Phi_{n_{l-1}}.

It is easy to see that for this sequence {nl}\{n_{l}\} there holds

ΦnlΦn,0nnl.\displaystyle\Phi_{n_{l}}\leq\Phi_{n},\quad\forall 0\leq n\leq n_{l}. (3.5)

For the above chosen sequence {nl}\{n_{l}\} of integers, we are now going to show that

suplk𝔼[Dξnk(xnl,xnk)]0as k.\displaystyle\sup_{l\geq k}\mathbb{E}\left[D_{\mathcal{R}}^{\xi_{n_{k}}}(x_{n_{l}},x_{n_{k}})\right]\to 0\quad\mbox{as }k\to\infty. (3.6)

By the definition of the Bregman distance we have for any l>kl>k that

Dξnk(xnl,xnk)=ΔnkΔnl+ξnlξnk,xnlx^.D_{\mathcal{R}}^{\xi_{n_{k}}}(x_{n_{l}},x_{n_{k}})=\Delta_{n_{k}}-\Delta_{n_{l}}+\langle\xi_{n_{l}}-\xi_{n_{k}},x_{n_{l}}-\hat{x}\rangle.

Taking the expectation gives

𝔼[Dξnk(xnl,xnk)]=𝔼[Δnk]𝔼[Δnl]+𝔼[ξnlξnk,xnlx^].\displaystyle\mathbb{E}\left[D_{\mathcal{R}}^{\xi_{n_{k}}}(x_{n_{l}},x_{n_{k}})\right]=\mathbb{E}[\Delta_{n_{k}}]-\mathbb{E}[\Delta_{n_{l}}]+\mathbb{E}\left[\langle\xi_{n_{l}}-\xi_{n_{k}},x_{n_{l}}-\hat{x}\rangle\right]. (3.7)

By using the definition of {ξn}\{\xi_{n}\} we have

ξnlξnk,xnlx^\displaystyle\langle\xi_{n_{l}}-\xi_{n_{k}},x_{n_{l}}-\hat{x}\rangle =n=nknl1ξn+1ξn,xnlx^\displaystyle=\sum_{n=n_{k}}^{n_{l}-1}\langle\xi_{n+1}-\xi_{n},x_{n_{l}}-\hat{x}\rangle
=n=nknl1tnAIn(AInxnyIn),xnlx^\displaystyle=-\sum_{n=n_{k}}^{n_{l}-1}t_{n}\langle A_{I_{n}}^{*}(A_{I_{n}}x_{n}-y_{I_{n}}),x_{n_{l}}-\hat{x}\rangle
=n=nknl1tnAInxnyIn,AInxnlyIn.\displaystyle=-\sum_{n=n_{k}}^{n_{l}-1}t_{n}\langle A_{I_{n}}x_{n}-y_{I_{n}},A_{I_{n}}x_{n_{l}}-y_{I_{n}}\rangle.

Therefore, by taking the expectation and using the Cauchy-Schwarz inequality, we can obtain

|𝔼[ξnlξnk,xnlx^]|μ1n=nknl1|𝔼[AInxnyIn,AInxnlyIn]|\displaystyle\left|\mathbb{E}\left[\langle\xi_{n_{l}}-\xi_{n_{k}},x_{n_{l}}-\hat{x}\rangle\right]\right|\leq\mu_{1}\sum_{n=n_{k}}^{n_{l}-1}\left|\mathbb{E}\left[\langle A_{I_{n}}x_{n}-y_{I_{n}},A_{I_{n}}x_{n_{l}}-y_{I_{n}}\rangle\right]\right|
μ1n=nknl1(𝔼[AInxnyIn2])1/2(𝔼[AInxnlyIn2])1/2.\displaystyle\qquad\qquad\qquad\leq\mu_{1}\sum_{n=n_{k}}^{n_{l}-1}\left(\mathbb{E}\left[\|A_{I_{n}}x_{n}-y_{I_{n}}\|^{2}\right]\right)^{1/2}\left(\mathbb{E}\left[\|A_{I_{n}}x_{n_{l}}-y_{I_{n}}\|^{2}\right]\right)^{1/2}.

Since xnx_{n} is n{\mathcal{F}}_{n}-measurable, we have

𝔼[AInxnyIn2]=𝔼[𝔼[AInxnyIn2|n]]=Φn.\displaystyle\mathbb{E}\left[\|A_{I_{n}}x_{n}-y_{I_{n}}\|^{2}\right]=\mathbb{E}\left[\mathbb{E}\left[\|A_{I_{n}}x_{n}-y_{I_{n}}\|^{2}|{\mathcal{F}}_{n}\right]\right]=\Phi_{n}.

We can not treat the term 𝔼[AInxnlyIn2]\mathbb{E}\left[\|A_{I_{n}}x_{n_{l}}-y_{I_{n}}\|^{2}\right] in the same way because xnlx_{n_{l}} is not necessarily n{\mathcal{F}}_{n}-measurable. However, by noting that

AInxnlyIn2I:|I|=bAIxnlyI2,\|A_{I_{n}}x_{n_{l}}-y_{I_{n}}\|^{2}\leq\sum_{I:|I|=b}\|A_{I}x_{n_{l}}-y_{I}\|^{2},

we have

𝔼[AInxnlyIn2]𝔼[I:|I|=bAIxnlyI2](pb)Φnl.\mathbb{E}\left[\|A_{I_{n}}x_{n_{l}}-y_{I_{n}}\|^{2}\right]\leq\mathbb{E}\left[\sum_{I:|I|=b}\|A_{I}x_{n_{l}}-y_{I}\|^{2}\right]\leq{p\choose b}\Phi_{n_{l}}.

Therefore, with cp,b:=(pb)1/2c_{p,b}:={p\choose b}^{1/2}, we obtain

|𝔼[ξnlξnk,xnlx^]|μ1cp,bn=nknl1Φn1/2Φnl1/2.\displaystyle\left|\mathbb{E}\left[\langle\xi_{n_{l}}-\xi_{n_{k}},x_{n_{l}}-\hat{x}\rangle\right]\right|\leq\mu_{1}c_{p,b}\sum_{n=n_{k}}^{n_{l}-1}\Phi_{n}^{1/2}\Phi_{n_{l}}^{1/2}. (3.8)

By virtue of (3.5) and (3) we then have

|𝔼[ξnlξnk,xnlx^]|μ1cp,bn=nknl1Φnμ1cp,bc0μ2(𝔼[Δnk]𝔼[Δnl]).\displaystyle\left|\mathbb{E}\left[\langle\xi_{n_{l}}-\xi_{n_{k}},x_{n_{l}}-\hat{x}\rangle\right]\right|\leq\mu_{1}c_{p,b}\sum_{n=n_{k}}^{n_{l}-1}\Phi_{n}\leq\frac{\mu_{1}c_{p,b}}{c_{0}\mu_{2}}\left(\mathbb{E}[\Delta_{n_{k}}]-\mathbb{E}[\Delta_{n_{l}}]\right). (3.9)

Combining this with (3.7) yields

𝔼[Dξnk(xnl,xnk)](1+μ1cp,bc0μ2)(𝔼[Δnk]𝔼[Δnl])\displaystyle\mathbb{E}\left[D_{\mathcal{R}}^{\xi_{n_{k}}}(x_{n_{l}},x_{n_{k}})\right]\leq\left(1+\frac{\mu_{1}c_{p,b}}{c_{0}\mu_{2}}\right)\left(\mathbb{E}[\Delta_{n_{k}}]-\mathbb{E}[\Delta_{n_{l}}]\right)

which together with the first equation in (3.4) shows (3.6) immediately.

By using the strong convexity of {\mathcal{R}}, we can obtain from (3.6) that

suplk𝔼[xnlxnk2]0as k\sup_{l\geq k}\mathbb{E}[\|x_{n_{l}}-x_{n_{k}}\|^{2}]\to 0\quad\mbox{as }k\to\infty

which means that {xnl}\{x_{n_{l}}\} is a Cauchy sequence in L2(Ω,X)L^{2}(\Omega,X). Thus there exists a random vector xL2(Ω,X)x^{*}\in L^{2}(\Omega,X) such that

𝔼[xnlx2]0 as l.\displaystyle\mathbb{E}[\|x_{n_{l}}-x^{*}\|^{2}]\to 0\quad\mbox{ as }l\to\infty. (3.10)

By taking a subsequence of {nl}\{n_{l}\} if necessary, we can obtain from (3.10) and the second equation in (3.4) that

limlxnlx=0andlimlI:|I|=bAIxnlyI2=0\displaystyle\lim_{l\to\infty}\|x_{n_{l}}-x^{*}\|=0\quad\mbox{and}\quad\lim_{l\to\infty}\sum_{I:|I|=b}\|A_{I}x_{n_{l}}-y_{I}\|^{2}=0 (3.11)

almost surely. Consequently

I:|I|=bAIxyI2=0almost surely,\sum_{I:|I|=b}\|A_{I}x^{*}-y_{I}\|^{2}=0\ \ \mbox{almost surely},

i.e. xx^{*} is a solution of (1.1) almost surely.

We next show that xdom()x^{*}\in\mbox{dom}({\mathcal{R}}) almost surely. It suffices to show 𝔼[(x)]<\mathbb{E}[{\mathcal{R}}(x^{*})]<\infty. Recall that ξnl(xnl)\xi_{n_{l}}\in\partial{\mathcal{R}}(x_{n_{l}}), we have

(xnl)(x)+ξnl,xnlx,xX.\displaystyle{\mathcal{R}}(x_{n_{l}})\leq{\mathcal{R}}(x)+\langle\xi_{n_{l}},x_{n_{l}}-x\rangle,\quad\forall x\in X. (3.12)

By using (3.12) with x=xx=x^{\dagger}, taking the expectation, and noting ξn0=ξ0=0\xi_{n_{0}}=\xi_{0}=0, we can obtain from (3.9) that

𝔼[(xnl)]\displaystyle\mathbb{E}[{\mathcal{R}}(x_{n_{l}})] (x)+𝔼[ξnlξn0,xnlx]\displaystyle\leq{\mathcal{R}}(x^{\dagger})+\mathbb{E}\left[\langle\xi_{n_{l}}-\xi_{n_{0}},x_{n_{l}}-x^{\dagger}\rangle\right]
(x)+μ1cp,bc0μ2𝔼[Dξ0(x,x0)]\displaystyle\leq{\mathcal{R}}(x^{\dagger})+\frac{\mu_{1}c_{p,b}}{c_{0}\mu_{2}}\mathbb{E}\left[D_{{\mathcal{R}}}^{\xi_{0}}(x^{\dagger},x_{0})\right]
=:C<.\displaystyle=:C<\infty.

Therefore, it follows from the first equation in (3.11), the lower semi-continuity of {\mathcal{R}} and Fatou’s Lemma that

𝔼[(x)]𝔼[lim infl(xnl)]lim infl𝔼[(xnl)]C<.\displaystyle\mathbb{E}[{\mathcal{R}}(x^{*})]\leq\mathbb{E}\left[\liminf_{l\to\infty}{\mathcal{R}}(x_{n_{l}})\right]\leq\liminf_{l\to\infty}\mathbb{E}\left[{\mathcal{R}}(x_{n_{l}})\right]\leq C<\infty. (3.13)

We have thus shown that xL2(Ω,X)x^{*}\in L^{2}(\Omega,X) is a solution of (1.1) in dom()\mbox{dom}({\mathcal{R}}) almost surely.

In order to proceed further, we will show that, for any x^L2(Ω,X)\hat{x}\in L^{2}(\Omega,X) that is a solution of (1.1) in dom()\mbox{dom}({\mathcal{R}}) almost surely, there holds

liml𝔼[ξnl,xnlx^]=0.\displaystyle\lim_{l\to\infty}\mathbb{E}\left[\langle\xi_{n_{l}},x_{n_{l}}-\hat{x}\rangle\right]=0. (3.14)

To see this, for any k<lk<l we write

𝔼[ξnl,xnlx^]=𝔼[ξnlξnk,xnlx^]+𝔼[ξnk,xnlx^].\displaystyle\mathbb{E}\left[\langle\xi_{n_{l}},x_{n_{l}}-\hat{x}\rangle\right]=\mathbb{E}\left[\langle\xi_{n_{l}}-\xi_{n_{k}},x_{n_{l}}-\hat{x}\rangle\right]+\mathbb{E}\left[\langle\xi_{n_{k}},x_{n_{l}}-\hat{x}\rangle\right].

Since x^L2(Ω,X)\hat{x}\in L^{2}(\Omega,X) is a solution of (1.1) in dom()\mbox{dom}({\mathcal{R}}) almost surely, by the definition of {ξn}\{\xi_{n}\} and ξ0=0\xi_{0}=0 we can use the similar argument for deriving (3.8) to obtain

|𝔼[ξnk,xnlx^]|μ1cp,b(n=0nk1Φn1/2)Φnl1/2.\displaystyle\left|\mathbb{E}\left[\langle\xi_{n_{k}},x_{n_{l}}-\hat{x}\rangle\right]\right|\leq\mu_{1}c_{p,b}\left(\sum_{n=0}^{n_{k}-1}\Phi_{n}^{1/2}\right)\Phi_{n_{l}}^{1/2}.

This and the second equation in (3.4) imply for any fixed kk that 𝔼[ξnk,xnlx^]0\mathbb{E}[\langle\xi_{n_{k}},x_{n_{l}}-\hat{x}\rangle]\to 0 as ll\to\infty. Hence

lim supl|𝔼[ξnl,xnlx^]|\displaystyle\limsup_{l\to\infty}\left|\mathbb{E}\left[\langle\xi_{n_{l}},x_{n_{l}}-\hat{x}\rangle\right]\right| suplk|𝔼[ξnlξnk,xnlx^]|\displaystyle\leq\sup_{l\geq k}\left|\mathbb{E}\left[\langle\xi_{n_{l}}-\xi_{n_{k}},x_{n_{l}}-\hat{x}\rangle\right]\right|

for all kk. By virtue of (3.9) and letting kk\to\infty we thus obtain (3.14).

Based on (3.12) with x=xx=x^{*} and (3.14), we can obtain

lim supl𝔼[(xnl)]𝔼[(x)]+liml𝔼[ξnl,xnlx]=𝔼[(x)]\displaystyle\limsup_{l\to\infty}\mathbb{E}[{\mathcal{R}}(x_{n_{l}})]\leq\mathbb{E}\left[{\mathcal{R}}(x^{*})\right]+\lim_{l\to\infty}\mathbb{E}\left[\langle\xi_{n_{l}},x_{n_{l}}-x^{*}\rangle\right]=\mathbb{E}[{\mathcal{R}}(x^{*})]

which together with (3.13) then implies

liml𝔼[(xnl)]=𝔼[(x)]\displaystyle\lim_{l\to\infty}\mathbb{E}[{\mathcal{R}}(x_{n_{l}})]=\mathbb{E}[{\mathcal{R}}(x^{*})] (3.15)

By using (3.15), (3.12) with x=xx=x^{\dagger}, and (3.14) with x^=x\hat{x}=x^{\dagger}, we have

𝔼[(x)]=liml𝔼[(xnl)](x)+liml𝔼[ξnl,xnlx]=(x).\displaystyle\mathbb{E}[{\mathcal{R}}(x^{*})]=\lim_{l\to\infty}\mathbb{E}[{\mathcal{R}}(x_{n_{l}})]\leq{\mathcal{R}}(x^{\dagger})+\lim_{l\to\infty}\mathbb{E}[\langle\xi_{n_{l}},x_{n_{l}}-x^{\dagger}\rangle]={\mathcal{R}}(x^{\dagger}).

Since xx^{*} is a solution of (1.1) in dom()\mbox{dom}({\mathcal{R}}) almost surely, we have (x)(x){\mathcal{R}}(x^{*})\geq{\mathcal{R}}(x^{\dagger}) almost surely which implies 𝔼[(x)](x)\mathbb{E}[{\mathcal{R}}(x^{*})]\geq{\mathcal{R}}(x^{\dagger}). Consequently 𝔼[(x)]=(x)\mathbb{E}[{\mathcal{R}}(x^{*})]={\mathcal{R}}(x^{\dagger}) and thus it follows from (3.15) that

liml𝔼[(xnl)]=(x).\displaystyle\lim_{l\to\infty}\mathbb{E}[{\mathcal{R}}(x_{n_{l}})]={\mathcal{R}}(x^{\dagger}). (3.16)

Finally, from (3.16) and (3.14) with x^=x\hat{x}=x^{\dagger} it follows that

liml𝔼[Dξnl(x,xnl)]=liml(𝔼[(x)(xnl)ξnl,xxnl])=0.\displaystyle\lim_{l\to\infty}\mathbb{E}\left[D_{\mathcal{R}}^{\xi_{n_{l}}}(x^{\dagger},x_{n_{l}})\right]=\lim_{l\to\infty}\left(\mathbb{E}[{\mathcal{R}}(x^{\dagger})-{\mathcal{R}}(x_{n_{l}})-\langle\xi_{n_{l}},x^{\dagger}-x_{n_{l}}\rangle]\right)=0.

By the monotonicity of {𝔼[Dξn(x,xn)]}\{\mathbb{E}[D_{\mathcal{R}}^{\xi_{n}}(x^{\dagger},x_{n})]\} we can conclude

limn𝔼[Dξn(x,xn)]=0\lim_{n\to\infty}\mathbb{E}\left[D_{\mathcal{R}}^{\xi_{n}}(x^{\dagger},x_{n})\right]=0

and hence limn𝔼[xnx2]=0\lim_{n\to\infty}\mathbb{E}[\|x_{n}-x^{\dagger}\|^{2}]=0 by the strong convexity of {\mathcal{R}}. The proof is therefore complete. ∎

In order to use Theorem 3.2 to establish the convergence of Algorithm 1 with noisy data, we need to investigate the stability of the algorithm, i.e. the behavior of xnδx_{n}^{\delta} and ξnδ\xi_{n}^{\delta} as δ0\delta\to 0 for each fixed nn. In the following we will provide stability results of Algorithm 1 under the following three choices of step-sizes:

  1. (s1)

    tnδt_{n}^{\delta} depends only on InI_{n}, i.e. tnδ=tInt_{n}^{\delta}=t_{I_{n}} and 0<tI<4σ/AI20<t_{I}<4\sigma/\|A_{I}\|^{2} for all I{1,,p}I\subset\{1,\cdots,p\} with |I|=b|I|=b.

  2. (s2)

    tnδt_{n}^{\delta} is chosen according to the formula

    tnδ=min{μ0AInxnδyInδ2AIn(AInxnδyInδ)2,μ~1},t_{n}^{\delta}=\min\left\{\frac{\mu_{0}\|A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}\|^{2}}{\|A_{I_{n}}^{*}(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta})\|^{2}},\tilde{\mu}_{1}\right\},

    where μ0\mu_{0} and μ~1\tilde{\mu}_{1} are two positive constants with 0<μ0<4σ0<\mu_{0}<4\sigma.

  3. (s3)

    In case the information of δ1,,δp\delta_{1},\cdots,\delta_{p} is available, choose the step-size tnδt_{n}^{\delta} according to the rule

    tnδ={min{μ0AInxnδyInδ2AIn(AInxnδyInδ)2,μ~1} if AInxnδyInδ>τδIn,0 otherwise,\displaystyle t_{n}^{\delta}=\left\{\begin{array}[]{lll}\min\left\{\frac{\mu_{0}\|A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}\|^{2}}{\|A_{I_{n}}^{*}(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta})\|^{2}},\tilde{\mu}_{1}\right\}&\mbox{ if }\|A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}\|>\tau\delta_{I_{n}},\\[5.16663pt] 0&\mbox{ otherwise},\end{array}\right.

    where τ1\tau\geq 1, μ0>0\mu_{0}>0 and μ~1>0\tilde{\mu}_{1}>0 are preassigned numbers with 0<μ0<4σ0<\mu_{0}<4\sigma.

The step-size chosen in (s1) is motivated by the Landweber iteration which uses constant step-size. Along the iterations in Algorithm 1, when the same subset II is repeatedly chosen, (s1) suggests to use the same step-size in computation. The step-sizes chosen by (s1) could be small and thus slows down the computation. The rules given in (s2) and (s3) use the adaptive strategy which may produce large step-sizes. The step-size chosen by (s2) is motivated by the deterministic minimal error method, see [24]. The step-size given in (s3) is motivated by the work on deterministic Landweber-Kaczmarz method, see [15, 20], and it incorporates the spirit of the discrepancy principle into the selection.

Let {xn,ξn}\{x_{n},\xi_{n}\} be defined by Algorithm 1 with exact data, where the step-size tnt_{n} is chosen by (s1), (s2) or (s3) with the superscript δ\delta dropped and with δIn\delta_{I_{n}} replaced by 0. It is easy to see that tnt_{n} satisfies (3.2) in Theorem 3.2. Therefore, Theorem 3.2 is applicable to {xn,ξn}\{x_{n},\xi_{n}\}.

The following result gives a stability property of Algorithm 1 with the step-sizes chosen by (s1), (s2) or (s3).

Lemma 3.3.

Let Assumption 1 hold. Consider Algorithm 1 with the step-size chosen by (s1), (s2) or (s3). Then for each fixed integer n0n\geq 0 there hold

𝔼[xnδxn2]0and𝔼[ξnδξn2]0\mathbb{E}[\|x_{n}^{\delta}-x_{n}\|^{2}]\to 0\quad\mbox{and}\quad\mathbb{E}[\|\xi_{n}^{\delta}-\xi_{n}\|^{2}]\to 0

as δ0\delta\to 0.

Proof.

We show the result by induction on nn. The result is trivial for n=0n=0 because ξ0δ=ξ0\xi_{0}^{\delta}=\xi_{0}. Assuming the result holds for some n0n\geq 0, we will show it also holds for n+1n+1. Since there are only finitely many sample paths of the form (I0,,In1)(I_{0},\cdots,I_{n-1}) each with positive probability, we can conclude from the induction hypothesis that

xnδxn0 and ξnδξn0as δ0\displaystyle\|x_{n}^{\delta}-x_{n}\|\to 0\quad\mbox{ and }\quad\|\xi_{n}^{\delta}-\xi_{n}\|\to 0\quad\mbox{as }\delta\to 0 (3.17)

along every sample path (I0,,In1)(I_{0},\cdots,I_{n-1}). We now show that along each sample path (I0,,In1,In)(I_{0},\cdots,I_{n-1},I_{n}) there hold

xn+1δxn+10 and ξn+1δξn+10as δ0.\displaystyle\|x_{n+1}^{\delta}-x_{n+1}\|\to 0\quad\mbox{ and }\quad\|\xi_{n+1}^{\delta}-\xi_{n+1}\|\to 0\quad\mbox{as }\delta\to 0. (3.18)

To see this, we first show ξn+1δξn+10\|\xi_{n+1}^{\delta}-\xi_{n+1}\|\to 0 as δ0\delta\to 0 by considering two cases.

We first consider the case AInxnyIn=0A_{I_{n}}x_{n}-y_{I_{n}}=0. For this case we have ξn+1=ξn\xi_{n+1}=\xi_{n} and hence

ξn+1δξn+1=ξnδξntnδAIn(AIn(xnδxn)(yInδyIn)).\displaystyle\xi_{n+1}^{\delta}-\xi_{n+1}=\xi_{n}^{\delta}-\xi_{n}-t_{n}^{\delta}A_{I_{n}}^{*}\left(A_{I_{n}}(x_{n}^{\delta}-x_{n})-(y_{I_{n}}^{\delta}-y_{I_{n}})\right).

Note that for all the choices of the step-sizes given in (s1), (s2) and (s3) we have 0tnδμ^10\leq t_{n}^{\delta}\leq\hat{\mu}_{1}, where μ^1\hat{\mu}_{1} is a constant independent of nn and δ\delta. Note also that δInδ\delta_{I_{n}}\leq\delta. Therefore

ξn+1δξn+1ξnδξn+μ^1AIn(AInxnδxn+δ).\displaystyle\|\xi_{n+1}^{\delta}-\xi_{n+1}\|\leq\|\xi_{n}^{\delta}-\xi_{n}\|+\hat{\mu}_{1}\|A_{I_{n}}\|\left(\|A_{I_{n}}\|\|x_{n}^{\delta}-x_{n}\|+\delta\right).

Consequently, by using (3.17) we can obtain ξn+1δξn+10\|\xi_{n+1}^{\delta}-\xi_{n+1}\|\to 0 as δ0\delta\to 0.

Next we consider the case AInxnyIn0A_{I_{n}}x_{n}-y_{I_{n}}\neq 0. By using (3.17) we have AInxnδyInδ>τδIn\|A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}\|>\tau\delta_{I_{n}} when δ>0\delta>0 is sufficiently small. Note that

AIn(AInxnyIn),xnx=AInxnyIn2>0\displaystyle\langle A_{I_{n}}^{*}(A_{I_{n}}x_{n}-y_{I_{n}}),x_{n}-x^{\dagger}\rangle=\|A_{I_{n}}x_{n}-y_{I_{n}}\|^{2}>0

which implies that AIn(AInxnyIn)0A_{I_{n}}^{*}(A_{I_{n}}x_{n}-y_{I_{n}})\neq 0. Thus the step-size chosen by (s2) or (s3) is given by

tnδ=min{μ0AInxnδyInδ2AIn(AInxnδyInδ)2,μ~1}.t_{n}^{\delta}=\min\left\{\frac{\mu_{0}\|A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}\|^{2}}{\|A_{I_{n}}^{*}(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta})\|^{2}},\tilde{\mu}_{1}\right\}.

By using again (3.17) we have

tnδtn as δ0.\displaystyle t_{n}^{\delta}\to t_{n}\quad\mbox{ as }\delta\to 0. (3.19)

For the step-size chosen by (s1) the equation (3.19) holds trivially as tnδ=tn=tInt_{n}^{\delta}=t_{n}=t_{I_{n}}. Therefore, by noting that

ξn+1δξn+1\displaystyle\xi_{n+1}^{\delta}-\xi_{n+1} =ξnδξntnδAIn(AInxnδyInδ)+tnAIn(AInxnyIn)\displaystyle=\xi_{n}^{\delta}-\xi_{n}-t_{n}^{\delta}A_{I_{n}}^{*}(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta})+t_{n}A_{I_{n}}^{*}(A_{I_{n}}x_{n}-y_{I_{n}})
=ξnδξn+(tntnδ)AIn(AInxnyIn)\displaystyle=\xi_{n}^{\delta}-\xi_{n}+(t_{n}-t_{n}^{\delta})A_{I_{n}}^{*}(A_{I_{n}}x_{n}-y_{I_{n}})
+tnδAIn(AIn(xnxnδ)+(yInδyIn)),\displaystyle\quad\,+t_{n}^{\delta}A_{I_{n}}^{*}\left(A_{I_{n}}(x_{n}-x_{n}^{\delta})+(y_{I_{n}}^{\delta}-y_{I_{n}})\right),

we can use (3.17) and (3.19) to conclude again ξn+1δξn+10\|\xi_{n+1}^{\delta}-\xi_{n+1}\|\to 0 as δ0\delta\to 0.

Now by virtue of the formula xn+1δ=(ξn+1δ)x_{n+1}^{\delta}=\nabla{\mathcal{R}}^{*}(\xi_{n+1}^{\delta}), xn+1=(ξn+1)x_{n+1}=\nabla{\mathcal{R}}^{*}(\xi_{n+1}) and the continuity of \nabla{\mathcal{R}}^{*}, we have xn+1δxn+10\|x_{n+1}^{\delta}-x_{n+1}\|\to 0 as δ0\delta\to 0. We thus obtain (3.18).

Finally, since there are only finitely many sample paths of the form (I0,,In)(I_{0},\cdots,I_{n}), we can use (3.18) to conclude

𝔼[xn+1δxn+12]0and𝔼[ξn+1δξn+12]0\mathbb{E}\left[\|x_{n+1}^{\delta}-x_{n+1}\|^{2}\right]\to 0\quad\mbox{and}\quad\mathbb{E}\left[\|\xi_{n+1}^{\delta}-\xi_{n+1}\|^{2}\right]\to 0

as δ0\delta\to 0. The proof is thus complete by the induction principle. ∎

Theorem 3.4.

Let Assumption 1 hold. Consider Algorithm 1 with the step-size chosen by (s1), (s2) or (s3). If the integer nδn_{\delta} is chosen such that nδn_{\delta}\to\infty and δ2nδ0\delta^{2}n_{\delta}\to 0 as δ0\delta\to 0, then

𝔼[xnδδx2]0 and 𝔼[Dξnδδ(x,xnδδ)]0\mathbb{E}\left[\|x_{n_{\delta}}^{\delta}-x^{\dagger}\|^{2}\right]\to 0\quad\mbox{ and }\quad\mathbb{E}\left[D_{\mathcal{R}}^{\xi_{n_{\delta}}^{\delta}}(x^{\dagger},x_{n_{\delta}}^{\delta})\right]\to 0

as δ0\delta\to 0.

Proof.

Let Δnδ:=Dξnδ(x,xnδ)\Delta_{n}^{\delta}:=D_{\mathcal{R}}^{\xi_{n}^{\delta}}(x^{\dagger},x_{n}^{\delta}) for all n0n\geq 0. By the strong convexity of {\mathcal{R}} it suffices to show 𝔼[Δnδδ]0\mathbb{E}[\Delta_{n_{\delta}}^{\delta}]\to 0 as δ0\delta\to 0. According to Lemma 3.1 we have

𝔼[Δn+1δ]𝔼[Δnδ]μ1b4c0pδ2\mathbb{E}\left[\Delta_{n+1}^{\delta}\right]-\mathbb{E}\left[\Delta_{n}^{\delta}\right]\leq\frac{\mu_{1}b}{4c_{0}p}\delta^{2}

for all n0n\geq 0. Let nn be any fixed integer. Since nδn_{\delta}\to\infty as δ0\delta\to 0, we have nδ>nn_{\delta}>n for small δ>0\delta>0. Thus, we may repeatedly use the above inequality to obtain

𝔼[Δnδδ]𝔼[Δnδ]+μ1b4c0p(nδn)δ2.\displaystyle\mathbb{E}\left[\Delta_{n_{\delta}}^{\delta}\right]\leq\mathbb{E}\left[\Delta_{n}^{\delta}\right]+\frac{\mu_{1}b}{4c_{0}p}(n_{\delta}-n)\delta^{2}.

Since δ2nδ0\delta^{2}n_{\delta}\to 0 as δ0\delta\to 0, we thus have

lim supδ0𝔼[Δnδδ]\displaystyle\limsup_{\delta\to 0}\mathbb{E}\left[\Delta_{n_{\delta}}^{\delta}\right]
lim supδ0𝔼[Δnδ]=lim supδ0((x)𝔼[(xnδ)]𝔼[ξnδ,xxnδ])\displaystyle\leq\limsup_{\delta\to 0}\mathbb{E}\left[\Delta_{n}^{\delta}\right]=\limsup_{\delta\to 0}\left({\mathcal{R}}(x^{\dagger})-\mathbb{E}\left[{\mathcal{R}}(x_{n}^{\delta})\right]-\mathbb{E}\left[\langle\xi_{n}^{\delta},x^{\dagger}-x_{n}^{\delta}\rangle\right]\right)

for any n0n\geq 0. With the help of Cauchy-Schwarz inequality we have

|𝔼[ξnδ,xxnδ]𝔼[ξn,xxn]|\displaystyle\left|\mathbb{E}\left[\langle\xi_{n}^{\delta},x^{\dagger}-x_{n}^{\delta}\rangle\right]-\mathbb{E}\left[\langle\xi_{n},x^{\dagger}-x_{n}\rangle\right]\right|
|𝔼[ξnδξn,xxnδ]|+|𝔼[ξn,xnxnδ]|\displaystyle\qquad\qquad\qquad\leq\left|\mathbb{E}\left[\langle\xi_{n}^{\delta}-\xi_{n},x^{\dagger}-x_{n}^{\delta}\rangle\right]\right|+\left|\mathbb{E}\left[\langle\xi_{n},x_{n}-x_{n}^{\delta}\rangle\right]\right|
(𝔼[ξnδξn2])1/2(𝔼[xnδx2])1/2\displaystyle\qquad\qquad\qquad\leq\left(\mathbb{E}\left[\|\xi_{n}^{\delta}-\xi_{n}\|^{2}\right]\right)^{1/2}\left(\mathbb{E}\left[\|x_{n}^{\delta}-x^{\dagger}\|^{2}\right]\right)^{1/2}
+(𝔼[ξn2])1/2(𝔼[xnxnδ2])1/2.\displaystyle\qquad\qquad\qquad\quad\,+\left(\mathbb{E}[\|\xi_{n}\|^{2}]\right)^{1/2}\left(\mathbb{E}\left[\|x_{n}-x_{n}^{\delta}\|^{2}\right]\right)^{1/2}.

Thus, we may use Lemma 3.3 to conclude

limδ0𝔼[ξnδ,xxnδ]=𝔼[ξn,xxn].\displaystyle\lim_{\delta\to 0}\mathbb{E}\left[\langle\xi_{n}^{\delta},x^{\dagger}-x_{n}^{\delta}\rangle\right]=\mathbb{E}\left[\langle\xi_{n},x^{\dagger}-x_{n}\rangle\right].

According to the proof of Lemma 3.3 we have xnδxn0\|x_{n}^{\delta}-x_{n}\|\to 0 as δ0\delta\to 0 along each sample path (I0,,In1)(I_{0},\cdots,I_{n-1}). Thus, from the lower semi-continuity of {\mathcal{R}} and Fatou’s lemma it follows

𝔼[(xn)]𝔼[lim infδ0(xnδ)]lim infδ0𝔼[(xnδ)].\mathbb{E}[{\mathcal{R}}(x_{n})]\leq\mathbb{E}\left[\liminf_{\delta\to 0}{\mathcal{R}}(x_{n}^{\delta})\right]\leq\liminf_{\delta\to 0}\mathbb{E}\left[{\mathcal{R}}(x_{n}^{\delta})\right].

Consequently

lim supδ0𝔼[Δnδδ]\displaystyle\limsup_{\delta\to 0}\mathbb{E}\left[\Delta_{n_{\delta}}^{\delta}\right] (x)lim infδ0𝔼[(xnδ)]limδ0𝔼[ξnδ,xxnδ]\displaystyle\leq{\mathcal{R}}(x^{\dagger})-\liminf_{\delta\to 0}\mathbb{E}\left[{\mathcal{R}}(x_{n}^{\delta})\right]-\lim_{\delta\to 0}\mathbb{E}\left[\langle\xi_{n}^{\delta},x^{\dagger}-x_{n}^{\delta}\rangle\right]
(x)𝔼[(xn)]𝔼[ξn,xxn]\displaystyle\leq{\mathcal{R}}(x^{\dagger})-\mathbb{E}[{\mathcal{R}}(x_{n})]-\mathbb{E}\left[\langle\xi_{n},x^{\dagger}-x_{n}\rangle\right]
=𝔼[Dξn(x,xn)]\displaystyle=\mathbb{E}[D_{\mathcal{R}}^{\xi_{n}}(x^{\dagger},x_{n})]

for all n0n\geq 0. Letting nn\to\infty and using Theorem 3.2 we therefore obtain 𝔼[Δnδδ]0\mathbb{E}[\Delta_{n_{\delta}}^{\delta}]\to 0 as δ0\delta\to 0. ∎

Theorem 3.4 provides convergence results on Algorithm 1 under various choices of the step-sizes. For the step-size chosen by (s3), if further conditions are imposed and τ\tau and μ0\mu_{0}, we can show that the iteration error of Algorithm 1 always decreases.

Lemma 3.5.

Let Assumption 1 hold. Consider Algorithm 1 with the step-size chosen by (s3). If μ0>0\mu_{0}>0 and τ>1\tau>1 are chosen such that

11τμ04σ0,\displaystyle 1-\frac{1}{\tau}-\frac{\mu_{0}}{4\sigma}\geq 0, (3.20)

then Δn+1δΔnδ\Delta_{n+1}^{\delta}\leq\Delta_{n}^{\delta} and consequently 𝔼[Δn+1δ]𝔼[Δnδ]\mathbb{E}\left[\Delta_{n+1}^{\delta}\right]\leq\mathbb{E}\left[\Delta_{n}^{\delta}\right] for all n0n\geq 0, where Δnδ:=Dξnδ(x,xnδ)\Delta_{n}^{\delta}:=D_{\mathcal{R}}^{\xi_{n}^{\delta}}(x^{\dagger},x_{n}^{\delta}).

Proof.

According to the proof of Lemma 3.1, we have

Δn+1δΔnδ\displaystyle\Delta_{n+1}^{\delta}-\Delta_{n}^{\delta} 14σ(tnδ)2AIn(AInxnδyInδ)2tnδAInxnδyInδ,AInxnδyIn\displaystyle\leq\frac{1}{4\sigma}\left(t_{n}^{\delta}\right)^{2}\|A_{I_{n}}^{*}(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta})\|^{2}-t_{n}^{\delta}\langle A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta},A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}\rangle
14σ(tnδ)2AIn(AInxnδyInδ)2tnδAInxnδyInδ2\displaystyle\leq\frac{1}{4\sigma}\left(t_{n}^{\delta}\right)^{2}\|A_{I_{n}}^{*}(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta})\|^{2}-t_{n}^{\delta}\|A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}\|^{2}
+δIntnδAInxnδyInδ.\displaystyle\quad\,+\delta_{I_{n}}t_{n}^{\delta}\|A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}\|.

By the definition of tnδt_{n}^{\delta} we have

tnδAIn(AInxnδyInδ)2μ0AInxnδyInδ2,\displaystyle t_{n}^{\delta}\|A_{I_{n}}^{*}(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta})\|^{2}\leq\mu_{0}\|A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}\|^{2},
δIntnδAInxnδyInδ1τtnδAInxnδyInδ2.\displaystyle\delta_{I_{n}}t_{n}^{\delta}\|A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}\|\leq\frac{1}{\tau}t_{n}^{\delta}\|A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}\|^{2}.

Therefore

Δn+1δΔnδ(11τμ04σ)tnδAInxnδyInδ2.\displaystyle\Delta_{n+1}^{\delta}-\Delta_{n}^{\delta}\leq-\left(1-\frac{1}{\tau}-\frac{\mu_{0}}{4\sigma}\right)t_{n}^{\delta}\|A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}\|^{2}.

Since μ0\mu_{0} and τ\tau satisfy (3.20), this implies Δn+1δΔnδ\Delta_{n+1}^{\delta}\leq\Delta_{n}^{\delta}. By taking the expectation we then obtain 𝔼[Δn+1δ]𝔼[Δnδ]\mathbb{E}[\Delta_{n+1}^{\delta}]\leq\mathbb{E}[\Delta_{n}^{\delta}]. ∎

Theorem 3.6.

Let Assumption 1 hold. Consider Algorithm 1 with step-size chosen by (s3), where τ>1\tau>1 and μ0>0\mu_{0}>0 are chosen to satisfy (3.20). If the integer nδn_{\delta} is chosen such that nδn_{\delta}\to\infty as δ0\delta\to 0, then

𝔼[xnδδx2]0 and 𝔼[Dξnδδ(x,xnδδ)]0\mathbb{E}\left[\|x_{n_{\delta}}^{\delta}-x^{\dagger}\|^{2}\right]\to 0\quad\mbox{ and }\quad\mathbb{E}\left[D_{\mathcal{R}}^{\xi_{n_{\delta}}^{\delta}}(x^{\dagger},x_{n_{\delta}}^{\delta})\right]\to 0

as δ0\delta\to 0.

Proof.

Let n0n\geq 0 be any fixed integer. Since nδn_{\delta}\to\infty as δ0\delta\to 0, we have nδ>nn_{\delta}>n for sufficiently small δ>0\delta>0. It then follows from Lemma 3.5 that

𝔼[Δnδδ]𝔼[Δnδ]\mathbb{E}\left[\Delta_{n_{\delta}}^{\delta}\right]\leq\mathbb{E}\left[\Delta_{n}^{\delta}\right]

By using Lemma 3.3 and following the proof of Theorem 3.4 we can further conclude

lim supδ0𝔼[Δnδδ]𝔼[Δn]\limsup_{\delta\to 0}\mathbb{E}\left[\Delta_{n_{\delta}}^{\delta}\right]\leq\mathbb{E}[\Delta_{n}]

for all integers n0n\geq 0. Finally, by letting nn\to\infty, we can use Theorem 3.2 to obtain 𝔼[Δnδδ]0\mathbb{E}[\Delta_{n_{\delta}}^{\delta}]\to 0 as δ0\delta\to 0. ∎

We remark that, unlike Theorem 3.4, the convergence result given in Theorem 3.6 requires only nδn_{\delta} to satisfy nδn_{\delta}\to\infty as δ0\delta\to 0. This is not surprising because the discrepancy principle has been incorporated into the choice of the step-size by (s3). Theorem 3.6 can be viewed as a stochastic extension of the corresponding result for the deterministic Landweber-Kaczmarz method ([20]) for solving ill-posed problems. It should be pointed out that, due to (3.20), the application of Theorem 3.6 requires either τ\tau to be sufficiently large or μ0\mu_{0} to be sufficiently small which may lead the corresponding algorithm to lose accuracy or converge slowly. However, our numerical simulations in Section 5 demonstrate that using step-sizes chosen by (s3) without satisfying (3.20) still has the effect of decreasing errors as the iteration proceeds.

4. Rate of convergence

In Section 3 we have proved the convergence of Algorithm 1 under various choices of the step-sizes. In this section we will focus on deriving convergence rates. For ill-posed problems, this requires the sought solution to satisfy suitable source conditions. We will consider the benchmark source condition of the form ([7])

Aλ(x)\displaystyle A^{*}\lambda^{\dagger}\in\partial{\mathcal{R}}(x^{\dagger}) (4.1)

for some λ:=(λ,1,,λ,p)Y:=Y1××Yp\lambda^{\dagger}:=(\lambda_{,1}^{\dagger},\cdots,\lambda_{,p}^{\dagger})\in Y:=Y_{1}\times\cdots\times Y_{p}, where, here and below, for any λY\lambda\in Y we use λ,i\lambda_{,i} to denote its ii-th component in YiY_{i}. Deriving convergence rates under general choices of step-sizes can be very challenging. Therefore, in this section we will restrict ourselves to the step-sizes chosen by (s1) that includes the constant step-sizes. The corresponding algorithm can be reformulated as follows.

Algorithm 2.

Fix a batch size bb, pick the initial guess ξ0=0\xi_{0}=0 in XX^{*} and step-sizes tIt_{I} for all I{1,,p}I\subset\{1,\cdots,p\} with |I|=b|I|=b. Set ξ0δ:=ξ0\xi_{0}^{\delta}:=\xi_{0}. For n0n\geq 0 do the following:

  1. (i)

    Calculate xnδXx_{n}^{\delta}\in X by solving (2.10);

  2. (ii)

    Randomly select a subset In{1,,p}I_{n}\subset\{1,\cdots,p\} with |In|=b|I_{n}|=b via the uniform distribution;

  3. (iii)

    Define ξn+1δX\xi_{n+1}^{\delta}\in X^{*} by (2.11).

For Algorithm 2 we have the following convergence rate result.

Theorem 4.1.

Let Assumption 1 hold and consider Algorithm 2. Assume that 0<tI<4σ/AI20<t_{I}<4\sigma/\|A_{I}\|^{2} for all subsets I{1,,p}I\subset\{1,\cdots,p\} with |I|=b|I|=b; furthermore tI=tt_{I}=t with a constant tt for all such II in case b>1b>1. If the sought solution xx^{\dagger} satisfies the source condition (4.1) and the integer nδn_{\delta} is chosen such that 1+bpnδ1+\frac{b}{p}n_{\delta} is of the magnitude of δ1\delta^{-1}, then

𝔼[xnδδx2]Cδ,{\mathbb{E}}\left[\|x_{n_{\delta}}^{\delta}-x^{\dagger}\|^{2}\right]\leq C\delta,

where CC is a positive constant independent of δ\delta, pp and bb.

When XX is a Hilbert space and (x)=x2/2{\mathcal{R}}(x)=\|x\|^{2}/2, the stochastic mirror descent method becomes the stochastic gradient method. In the existing literature on stochastic gradient method for solving ill-posed problems, sub-optimal convergence rates have been derived for diminishing step-sizes under general source conditions on the sought solution, see [18, 25]. Our result given in Theorem 4.1 supplements these results by demonstrating that the stochastic gradient method can achieve the order optimal convergence rate O(δ)O(\delta) under constant step-sizes if the sought solution satisfies the source condition (4.1).

The result in Theorem 4.1 also demonstrates the role played by the batch size bb: to achieve the same convergence rate, less number of iterations is required if a larger batch size bb is used. However, this does not mean the corresponding algorithm with larger batch size bb is faster because the computational time at each iteration can increase as bb increases. How to determine a batch size with the best performance is an outstanding challenging question.

The proof of Theorem 4.1 is based on considering an equivalent formulation of Algorithm 2 which we will derive in the following. The derivation is based on applying the randomized block gradient method ([26, 32, 36]) to the dual problem of (1.4) with yy replaced by yδy^{\delta}. The associated Lagrangian function is

L(x;λ):=(x)λ,Axyδ,L(x;\lambda):={\mathcal{R}}(x)-\langle\lambda,Ax-y^{\delta}\rangle,

where xXx\in X and λY\lambda\in Y. Thus the dual function is

infxXL(x;λ)\displaystyle\inf_{x\in X}L(x;\lambda) =supxX{Aλ,x(x)}+λ,yδ=(Aλ)+λ,yδ\displaystyle=-\sup_{x\in X}\left\{\left\langle A^{*}\lambda,x\right\rangle-{\mathcal{R}}(x)\right\}+\langle\lambda,y^{\delta}\rangle=-{\mathcal{R}}^{*}\left(A^{*}\lambda\right)+\langle\lambda,y^{\delta}\rangle

which means the dual problem is

minλY{dyδ(λ):=(Aλ)λ,yδ}.\displaystyle\min_{\lambda\in Y}\left\{d_{y^{\delta}}(\lambda):={\mathcal{R}}^{*}\left(A^{*}\lambda\right)-\langle\lambda,y^{\delta}\rangle\right\}. (4.2)

Note that {\mathcal{R}}^{*} is continuous differentiable and

dyδ(λ)=(i=1pAiλ,i)i=1pλ,i,yiδ.d_{y^{\delta}}(\lambda)={\mathcal{R}}^{*}\left(\sum_{i=1}^{p}A_{i}^{*}\lambda_{,i}\right)-\sum_{i=1}^{p}\langle\lambda_{,i},y_{i}^{\delta}\rangle.

Therefore we may solve (4.2) by the randomized block gradient method which iteratively selects partial components of λ\lambda at random to be updated by a partial gradient of dyδd_{y^{\delta}} and leave other components unchanged. To be more precisely, for any λY\lambda\in Y and any subset I{1,,p}I\subset\{1,\cdots,p\}, let λI\lambda_{I} denote the group of components λ,i\lambda_{,i} of λ\lambda with iIi\in I. Assume λnδ:=(λn,1δ,,λn,pδ)Y\lambda_{n}^{\delta}:=(\lambda_{n,1}^{\delta},\cdots,\lambda_{n,p}^{\delta})\in Y is a current iterate. We then choose a subset of indices InI_{n} from {1,,p}\{1,\cdots,p\} with |In|=b|I_{n}|=b randomly with uniform distribution and define λn+1δ:=(λn+1,1δ,,λn+1,pδ)\lambda_{n+1}^{\delta}:=(\lambda_{n+1,1}^{\delta},\cdots,\lambda_{n+1,p}^{\delta}) by setting λn+1,iδ=λn,iδ\lambda_{n+1,i}^{\delta}=\lambda_{n,i}^{\delta} if iIni\not\in I_{n} and

λn+1,Inδ=λn,InδtInIndyδ(λnδ),\displaystyle\lambda_{n+1,I_{n}}^{\delta}=\lambda_{n,I_{n}}^{\delta}-t_{I_{n}}\nabla_{I_{n}}d_{y^{\delta}}(\lambda_{n}^{\delta}), (4.3)

with step-sizes tIn>0t_{I_{n}}>0 depending only on InI_{n}, where, for each I{1,,p}I\subset\{1,\cdots,p\}, Idyδ\nabla_{I}d_{y^{\delta}} denotes the partial gradient of dyδd_{y^{\delta}} with respect to λI\lambda_{I}. Note that

Idyδ(λ)=AI(Aλ)yIδ.\nabla_{I}d_{y^{\delta}}(\lambda)=A_{I}\nabla{\mathcal{R}}^{*}\left(A^{*}\lambda\right)-y_{I}^{\delta}.

Therefore, by setting xnδ:=(Aλnδ)x_{n}^{\delta}:=\nabla{\mathcal{R}}^{*}\left(A^{*}\lambda_{n}^{\delta}\right), we can rewrite (4.3) as

λn+1,Inδ=λn,InδtIn(AInxnδyInδ).\lambda_{n+1,I_{n}}^{\delta}=\lambda_{n,I_{n}}^{\delta}-t_{I_{n}}\left(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}\right).

By using the definition of xnδx_{n}^{\delta} and (2.4) we have Aλnδ(xnδ)A^{*}\lambda_{n}^{\delta}\in\partial{\mathcal{R}}(x_{n}^{\delta}) which implies

0(λnδ,A)(xnδ)0\in\partial\left({\mathcal{R}}-\langle\lambda_{n}^{\delta},A\cdot\rangle\right)(x_{n}^{\delta})

and thus

xnδ=argminxX{(x)λnδ,Ax}.\displaystyle x_{n}^{\delta}=\arg\min_{x\in X}\left\{{\mathcal{R}}(x)-\langle\lambda_{n}^{\delta},Ax\rangle\right\}. (4.4)

Combining the above analysis we thus obtain the following mini-batch randomized dual block gradient method for solving (1.2) with noisy data.

Algorithm 3.

Fix a batch size bb, pick the initial guess λ0:=(0,,0)Y:=Y1××Yp\lambda_{0}:=(0,\cdots,0)\in Y:=Y_{1}\times\cdots\times Y_{p} and step-sizes tIt_{I} for all I{1,,p}I\subset\{1,\cdots,p\} with |I|=b|I|=b. Set λ0δ:=λ0\lambda_{0}^{\delta}:=\lambda_{0}. For n0n\geq 0 do the following:

  1. (i)

    Calculate xnδXx_{n}^{\delta}\in X by using (4.4);

  2. (ii)

    Randomly select a subset In{1,,p}I_{n}\subset\{1,\cdots,p\} with |In|=b|I_{n}|=b via the uniform distribution;

  3. (iii)

    Define λn+1δ:=(λn+1,1δ,,λn+1,pδ)Y\lambda_{n+1}^{\delta}:=(\lambda_{n+1,1}^{\delta},\cdots,\lambda_{n+1,p}^{\delta})\in Y by setting λn+1,Incδ=λn,Incδ\lambda_{n+1,I_{n}^{c}}^{\delta}=\lambda_{n,I_{n}^{c}}^{\delta} and

    λn+1,Inδ=λn,InδtIn(AInxnδyInδ),\displaystyle\lambda_{n+1,I_{n}}^{\delta}=\lambda_{n,I_{n}}^{\delta}-t_{I_{n}}\left(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}\right), (4.5)

    where IncI_{n}^{c} denotes the complement of InI_{n} in {1,,p}\{1,\cdots,p\}.

Let us demonstrate the equivalence between Algorithm 2 and Algorithm 3. If {xnδ,λnδ}\{x_{n}^{\delta},\lambda_{n}^{\delta}\} is defined by Algorithm 3, by defining ξnδ:=Aλnδ\xi_{n}^{\delta}:=A^{*}\lambda_{n}^{\delta} we have

ξn+1δ\displaystyle\xi_{n+1}^{\delta} =Aλn+1δ=AIncλn+1,Incδ+AInλn+1,Inδ\displaystyle=A^{*}\lambda_{n+1}^{\delta}=A_{I_{n}^{c}}^{*}\lambda_{n+1,I_{n}^{c}}^{\delta}+A_{I_{n}}^{*}\lambda_{n+1,I_{n}}^{\delta}
=AIncλn,Incδ+AIn(λn,InδtIn(AInxnδyInδ))\displaystyle=A_{I_{n}^{c}}^{*}\lambda_{n,I_{n}^{c}}^{\delta}+A_{I_{n}}^{*}\left(\lambda_{n,I_{n}}^{\delta}-t_{I_{n}}(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta})\right)
=(AIncλn,Incδ+AInλn,Inδ)tInAIn(AInxnδyInδ)\displaystyle=\left(A_{I_{n}^{c}}^{*}\lambda_{n,I_{n}^{c}}^{\delta}+A_{I_{n}}^{*}\lambda_{n,I_{n}}^{\delta}\right)-t_{I_{n}}A_{I_{n}}^{*}(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta})
=ξnδtInAIn(AInxnδyInδ).\displaystyle=\xi_{n}^{\delta}-t_{I_{n}}A_{I_{n}}^{*}(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}).

Therefore {xnδ,ξnδ}\{x_{n}^{\delta},\xi_{n}^{\delta}\} can be produced by Algorithm 1. Conversely, if {xnδ,ξnδ}\{x_{n}^{\delta},\xi_{n}^{\delta}\} is defined by Algorithm 1, by using ξ0δ=0\xi_{0}^{\delta}=0 one can easily see that ξnδRan(A)\xi_{n}^{\delta}\in\mbox{Ran}(A), the range of AA. By writing ξnδ=Aλnδ\xi_{n}^{\delta}=A^{*}\lambda_{n}^{\delta} for some λnδ=(λn,1δ,,λn,pδ)Y\lambda_{n}^{\delta}=(\lambda_{n,1}^{\delta},\cdots,\lambda_{n,p}^{\delta})\in Y, we have

ξn+1δ\displaystyle\xi_{n+1}^{\delta} =ξnδtInAIn(AInxnδyInδ)=AλnδtInAIn(AInxnδyInδ)\displaystyle=\xi_{n}^{\delta}-t_{I_{n}}A_{I_{n}}^{*}(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta})=A^{*}\lambda_{n}^{\delta}-t_{I_{n}}A_{I_{n}}^{*}(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta})
=AIncλn,Incδ+AIn(λn,InδtIn(AInxnδyInδ))\displaystyle=A_{I_{n}^{c}}^{*}\lambda_{n,I_{n}^{c}}^{\delta}+A_{I_{n}}^{*}\left(\lambda_{n,I_{n}}^{\delta}-t_{I_{n}}(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta})\right)

Define λn+1δY\lambda_{n+1}^{\delta}\in Y by setting λn+1,Incδ=λn,Incδ\lambda_{n+1,I_{n}^{c}}^{\delta}=\lambda_{n,I_{n}^{c}}^{\delta} and λn+1,Inδ=λn,InδtIn(AInxnδyInδ)\lambda_{n+1,I_{n}}^{\delta}=\lambda_{n,I_{n}}^{\delta}-t_{I_{n}}(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}). We can see that ξn+1δ=Aλn+1δ\xi_{n+1}^{\delta}=A^{*}\lambda_{n+1}^{\delta} and {xnδ,λnδ}\{x_{n}^{\delta},\lambda_{n}^{\delta}\} can be generated by Algorithm 3. Therefore we obtain

Lemma 4.2.

Algorithm 2 and Algorithm 3 are equivalent.

Based on Algorithm 3, we will derive the estimate on 𝔼[xnδx2]{\mathbb{E}}[\|x_{n}^{\delta}-x^{\dagger}\|^{2}] under the benchmark source condition (4.1). Let

dy(λ):=(Aλ)λ,yd_{y}(\lambda):={\mathcal{R}}^{*}\left(A^{*}\lambda\right)-\langle\lambda,y\rangle

which is obtained from dyδ(λ)d_{y^{\delta}}(\lambda) with yδy^{\delta} replaced by yy. From (4.1) and (2.4) it follows that x=(Aλ)x^{\dagger}=\nabla{\mathcal{R}}^{*}(A^{*}\lambda^{\dagger}). Thus, by using Ax=yAx^{\dagger}=y we have

dy(λ)\displaystyle\nabla d_{y}(\lambda^{\dagger}) =A(Aλ)y=Axy=0.\displaystyle=A\nabla{\mathcal{R}}^{*}\left(A^{*}\lambda^{\dagger}\right)-y=Ax^{\dagger}-y=0.

Since dy(λ)d_{y}(\lambda) is convex, this shows that λ\lambda^{\dagger} is a global minimizer of dyd_{y} on YY.

According to (4.4) we have Aλnδ(xnδ)A^{*}\lambda_{n}^{\delta}\in\partial{\mathcal{R}}(x_{n}^{\delta}). Thus, we may consider the Bregman distance

Δnδ:=DAλnδ(x,xnδ).\Delta_{n}^{\delta}:=D_{\mathcal{R}}^{A^{*}\lambda_{n}^{\delta}}(x^{\dagger},x_{n}^{\delta}).

By using (4.1), Aλnδ(xnδ)A^{*}\lambda_{n}^{\delta}\in\partial{\mathcal{R}}(x_{n}^{\delta}), and (2.4) we have

(x)+(Aλ)=Aλ,x,(xnδ)+(Aλnδ)=Aλnδ,xnδ.\displaystyle{\mathcal{R}}(x^{\dagger})+{\mathcal{R}}^{*}\left(A^{*}\lambda^{\dagger}\right)=\left\langle A^{*}\lambda^{\dagger},x^{\dagger}\right\rangle,\quad{\mathcal{R}}(x_{n}^{\delta})+{\mathcal{R}}^{*}\left(A^{*}\lambda_{n}^{\delta}\right)=\left\langle A^{*}\lambda_{n}^{\delta},x_{n}^{\delta}\right\rangle.

Therefore

Δnδ\displaystyle\Delta_{n}^{\delta} =(x)(xnδ)Aλnδ,xxnδ\displaystyle={\mathcal{R}}(x^{\dagger})-{\mathcal{R}}(x_{n}^{\delta})-\langle A^{*}\lambda_{n}^{\delta},x^{\dagger}-x_{n}^{\delta}\rangle
=(Aλnδ)(Aλ)Aλnδ,x+Aλ,x\displaystyle={\mathcal{R}}^{*}\left(A^{*}\lambda_{n}^{\delta}\right)-{\mathcal{R}}^{*}\left(A^{*}\lambda^{\dagger}\right)-\langle A^{*}\lambda_{n}^{\delta},x^{\dagger}\rangle+\langle A^{*}\lambda^{\dagger},x^{\dagger}\rangle
=((Aλnδ)λnδ,y)((Aλ)λ,y)\displaystyle=\left({\mathcal{R}}^{*}\left(A^{*}\lambda_{n}^{\delta}\right)-\langle\lambda_{n}^{\delta},y\rangle\right)-\left({\mathcal{R}}^{*}(A^{*}\lambda^{\dagger})-\langle\lambda^{\dagger},y\rangle\right)
=dy(λnδ)dy(λ).\displaystyle=d_{y}(\lambda_{n}^{\delta})-d_{y}(\lambda^{\dagger}).

Consequently, by virtue of the strong convexity of {\mathcal{R}} we have

σxnδx2Δnδdy(λnδ)dy(λ).\sigma\|x_{n}^{\delta}-x^{\dagger}\|^{2}\leq\Delta_{n}^{\delta}\leq d_{y}(\lambda_{n}^{\delta})-d_{y}(\lambda^{\dagger}).

Taking the expectation gives

σ𝔼[xnδx2]𝔼[dy(λnδ)dy(λ)].\displaystyle\sigma\mathbb{E}\left[\|x_{n}^{\delta}-x^{\dagger}\|^{2}\right]\leq\mathbb{E}\left[d_{y}(\lambda_{n}^{\delta})-d_{y}(\lambda^{\dagger})\right]. (4.6)

This demonstrates that we can achieve our goal by bounding 𝔼[dy(λnδ)dy(λ)]\mathbb{E}[d_{y}(\lambda_{n}^{\delta})-d_{y}(\lambda^{\dagger})].

Note that the sequence {λnδ}\{\lambda_{n}^{\delta}\} is defined via the function dyδ(λ)d_{y^{\delta}}(\lambda). Although λ\lambda^{\dagger} is a minimizer of dyd_{y}, it may not be a minimizer of dyδd_{y^{\delta}}. In order to overcome this gap, we will make use of the relation

dyδ(λ)dy(λ)=λ,yyδ\displaystyle d_{y^{\delta}}(\lambda)-d_{y}(\lambda)=\langle\lambda,y-y^{\delta}\rangle (4.7)

between dyδ(λ)d_{y^{\delta}}(\lambda) and dy(λ)d_{y}(\lambda) for all λY\lambda\in Y.

Lemma 4.3.

Let Assumption 1 hold and consider Algorithm 3. Assume that 0<tI<4σ/AI20<t_{I}<4\sigma/\|A_{I}\|^{2} for all I{1,,p}I\subset\{1,\dots,p\} with |I|=b|I|=b. Then

𝔼[dy(λn+1δ)]𝔼[dy(λnδ)]+tmaxb2c1pδ2c12(pb)I:|I|=btI𝔼[Idyδ(λnδ)2]\mathbb{E}\left[d_{y}(\lambda_{n+1}^{\delta})\right]\leq\mathbb{E}\left[d_{y}(\lambda_{n}^{\delta})\right]+\frac{t_{\max}b}{2c_{1}p}\delta^{2}-\frac{c_{1}}{2{p\choose b}}\sum_{I:|I|=b}t_{I}\mathbb{E}\left[\|\nabla_{I}d_{y^{\delta}}(\lambda_{n}^{\delta})\|^{2}\right]

for all n0n\geq 0, where

tmax:=maxI:|I|=btI and c1:=minI:|I|=b{1tIAI24σ}>0.t_{\max}:=\max_{I:|I|=b}t_{I}\quad\mbox{ and }\quad c_{1}:=\min_{I:|I|=b}\left\{1-\frac{t_{I}\|A_{I}\|^{2}}{4\sigma}\right\}>0.
Proof.

For any λY\lambda\in Y and any I{1,,p}I\subset\{1,\cdots,p\} with |I|=b|I|=b we may write λ=(λI,λIc)\lambda=(\lambda_{I},\lambda_{I^{c}}) up to a permutation. Since the partial gradient of dyδd_{y^{\delta}} with respect to λI\lambda_{I} is given by Idyδ(λ)=AI(Aλ)yIδ\nabla_{I}d_{y^{\delta}}(\lambda)=A_{I}\nabla{\mathcal{R}}^{*}(A^{*}\lambda)-y_{I}^{\delta}, it follows from (2.12) that

Idyδ(λI,λIc)Idyδ(λI+h,λIc)LIh,hYI,\|\nabla_{I}d_{y^{\delta}}(\lambda_{I},\lambda_{I^{c}})-\nabla_{I}d_{y^{\delta}}(\lambda_{I}+h,\lambda_{I^{c}})\|\leq L_{I}\|h\|,\quad\forall h\in Y_{I},

where LI:=AI2/(2σ)L_{I}:=\|A_{I}\|^{2}/(2\sigma). Consequently, by recalling the relation between λn+1δ\lambda_{n+1}^{\delta} and λnδ\lambda_{n}^{\delta}, we can obtain

dyδ(λn+1δ)\displaystyle d_{y^{\delta}}(\lambda_{n+1}^{\delta}) dyδ(λnδ)+Indyδ(λnδ),λn+1,Inδλn,Inδ+LIn2λn+1,Inδλn,Inδ2\displaystyle\leq d_{y^{\delta}}(\lambda_{n}^{\delta})+\langle\nabla_{I_{n}}d_{y^{\delta}}(\lambda_{n}^{\delta}),\lambda_{n+1,I_{n}}^{\delta}-\lambda_{n,I_{n}}^{\delta}\rangle+\frac{L_{I_{n}}}{2}\|\lambda_{n+1,I_{n}}^{\delta}-\lambda_{n,I_{n}}^{\delta}\|^{2}
=dyδ(λnδ)(1LIntIn2)tInIndyδ(λnδ)2.\displaystyle=d_{y^{\delta}}(\lambda_{n}^{\delta})-\left(1-\frac{L_{I_{n}}t_{I_{n}}}{2}\right)t_{I_{n}}\|\nabla_{I_{n}}d_{y^{\delta}}(\lambda_{n}^{\delta})\|^{2}.

By using (4.7) and the definition of λn+1δ\lambda_{n+1}^{\delta}, we can see that

dyδ(λn+1δ)dyδ(λnδ)\displaystyle d_{y^{\delta}}(\lambda_{n+1}^{\delta})-d_{y^{\delta}}(\lambda_{n}^{\delta}) =dy(λn+1δ)dy(λnδ)+λn+1,Inδλn,Inδ,yInyInδ\displaystyle=d_{y}(\lambda_{n+1}^{\delta})-d_{y}(\lambda_{n}^{\delta})+\langle\lambda_{n+1,I_{n}}^{\delta}-\lambda_{n,I_{n}}^{\delta},y_{I_{n}}-y_{I_{n}}^{\delta}\rangle
=dy(λn+1δ)dy(λnδ)tInIndyδ(λnδ),yInyInδ.\displaystyle=d_{y}(\lambda_{n+1}^{\delta})-d_{y}(\lambda_{n}^{\delta})-t_{I_{n}}\langle\nabla_{I_{n}}d_{y^{\delta}}(\lambda_{n}^{\delta}),y_{I_{n}}-y_{I_{n}}^{\delta}\rangle.

Combining the above two equations and using the definition of c1c_{1}, we thus have

dy(λn+1δ)dy(λnδ)\displaystyle d_{y}(\lambda_{n+1}^{\delta})-d_{y}(\lambda_{n}^{\delta})
tInIndyδ(λnδ),yInyInδ(1LIntIn2)tInIndyδ(λnδ)2\displaystyle\leq t_{I_{n}}\langle\nabla_{I_{n}}d_{y^{\delta}}(\lambda_{n}^{\delta}),y_{I_{n}}-y_{I_{n}}^{\delta}\rangle-\left(1-\frac{L_{I_{n}}t_{I_{n}}}{2}\right)t_{I_{n}}\|\nabla_{I_{n}}d_{y^{\delta}}(\lambda_{n}^{\delta})\|^{2}
tInδInIndyδ(λnδ)c1tInIndyδ(λnδ)2\displaystyle\leq t_{I_{n}}\delta_{I_{n}}\|\nabla_{I_{n}}d_{y^{\delta}}(\lambda_{n}^{\delta})\|-c_{1}t_{I_{n}}\|\nabla_{I_{n}}d_{y^{\delta}}(\lambda_{n}^{\delta})\|^{2}
12c1tInδIn212c1tInIndyδ(λnδ)2.\displaystyle\leq\frac{1}{2c_{1}}t_{I_{n}}\delta_{I_{n}}^{2}-\frac{1}{2}c_{1}t_{I_{n}}\|\nabla_{I_{n}}d_{y^{\delta}}(\lambda_{n}^{\delta})\|^{2}.

By taking the expectation, using (3.1), and noting that 𝔼[δIn2]=bpδ2\mathbb{E}[\delta_{I_{n}}^{2}]=\frac{b}{p}\delta^{2}, we finally obtain

𝔼[dy(λn+1δ)dy(λnδ)]\displaystyle\mathbb{E}\left[d_{y}(\lambda_{n+1}^{\delta})-d_{y}(\lambda_{n}^{\delta})\right] tmax2c1𝔼[δIn2]12c1𝔼[𝔼[tInIndyδ(λnδ)2|n]]\displaystyle\leq\frac{t_{\max}}{2c_{1}}\mathbb{E}\left[\delta_{I_{n}}^{2}\right]-\frac{1}{2}c_{1}\mathbb{E}\left[\mathbb{E}\left[t_{I_{n}}\|\nabla_{I_{n}}d_{y^{\delta}}(\lambda_{n}^{\delta})\|^{2}|{\mathcal{F}}_{n}\right]\right]
=tmaxb2c1pδ212c1𝔼[1(pb)I:|I|=btIIdyδ(λnδ)2]\displaystyle=\frac{t_{\max}b}{2c_{1}p}\delta^{2}-\frac{1}{2}c_{1}\mathbb{E}\left[\frac{1}{{p\choose b}}\sum_{I:|I|=b}t_{I}\|\nabla_{I}d_{y^{\delta}}(\lambda_{n}^{\delta})\|^{2}\right]

which completes the proof. ∎

Lemma 4.4.

Consider Algorithm 3 in which 0<tI<4σ/AI20<t_{I}<4\sigma/\|A_{I}\|^{2} for all I{1,,p}I\subset\{1,\cdots,p\} with |I|=b|I|=b, furthermore tI=tt_{I}=t with a constant tt for all such II in case b>1b>1. Assume (4.1) holds and let

rn:={i=1p1tiλn,iδλ,i2 if b=1,1tλnδλ2 if b>1r_{n}:=\left\{\begin{array}[]{lll}\displaystyle{\sum_{i=1}^{p}\frac{1}{t_{i}}\|\lambda_{n,i}^{\delta}-\lambda_{,i}^{\dagger}\|^{2}}&\mbox{ if }b=1,\\[12.05553pt] \displaystyle{\frac{1}{t}\|\lambda_{n}^{\delta}-\lambda^{\dagger}\|^{2}}&\mbox{ if }b>1\end{array}\right.

for all n0n\geq 0. Then

𝔼[rn+1]\displaystyle\mathbb{E}[r_{n+1}] 𝔼[rn]+2bp𝔼[dy(λ)dy(λnδ)]+2tmax1/2bpδ(𝔼[rn])1/2\displaystyle\leq\mathbb{E}[r_{n}]+\frac{2b}{p}\mathbb{E}\left[d_{y}(\lambda^{\dagger})-d_{y}(\lambda_{n}^{\delta})\right]+\frac{2t_{\max}^{1/2}b}{p}\delta\left(\mathbb{E}[r_{n}]\right)^{1/2}
+1(pb)I:|I|=btI𝔼[Idyδ(λnδ)2]\displaystyle\quad\,+\frac{1}{{p\choose b}}\sum_{I:|I|=b}t_{I}\mathbb{E}\left[\|\nabla_{I}d_{y^{\delta}}(\lambda_{n}^{\delta})\|^{2}\right]

for all integers n0n\geq 0.

Proof.

We will only prove the result for b>1b>1; the result for b=1b=1 can be proved similarly. By using the definition of λn+1δ\lambda_{n+1}^{\delta} and tIn=tt_{I_{n}}=t we have

rn+1\displaystyle r_{n+1} =1tλn+1,IncδλInc2+1tλn+1,InδλIn2\displaystyle=\frac{1}{t}\|\lambda_{n+1,I_{n}^{c}}^{\delta}-\lambda_{I_{n}^{c}}^{\dagger}\|^{2}+\frac{1}{t}\|\lambda_{n+1,I_{n}}^{\delta}-\lambda_{I_{n}}^{\dagger}\|^{2}
=1tλn,IncδλInc2+1tλn,InδλIntIndyδ(λnδ)2\displaystyle=\frac{1}{t}\|\lambda_{n,I_{n}^{c}}^{\delta}-\lambda_{I_{n}^{c}}^{\dagger}\|^{2}+\frac{1}{t}\|\lambda_{n,I_{n}}^{\delta}-\lambda_{I_{n}}^{\dagger}-t\nabla_{I_{n}}d_{y^{\delta}}(\lambda_{n}^{\delta})\|^{2}
=1tλn,IncδλInc2+1tλn,InδλIn22Indyδ(λnδ),λn,InδλIn\displaystyle=\frac{1}{t}\|\lambda_{n,I_{n}^{c}}^{\delta}-\lambda_{I_{n}^{c}}^{\dagger}\|^{2}+\frac{1}{t}\|\lambda_{n,I_{n}}^{\delta}-\lambda_{I_{n}}^{\dagger}\|^{2}-2\langle\nabla_{I_{n}}d_{y^{\delta}}(\lambda_{n}^{\delta}),\lambda_{n,I_{n}}^{\delta}-\lambda_{I_{n}}^{\dagger}\rangle
+tIndyδ(λnδ)2\displaystyle\quad\,+t\|\nabla_{I_{n}}d_{y^{\delta}}(\lambda_{n}^{\delta})\|^{2}
=rn2Indyδ(λnδ),λn,InδλIn+tIndyδ(λnδ)2.\displaystyle=r_{n}-2\langle\nabla_{I_{n}}d_{y^{\delta}}(\lambda_{n}^{\delta}),\lambda_{n,I_{n}}^{\delta}-\lambda_{I_{n}}^{\dagger}\rangle+t\|\nabla_{I_{n}}d_{y^{\delta}}(\lambda_{n}^{\delta})\|^{2}.

Taking the expectation and using (3.1), it gives

𝔼[rn+1]\displaystyle\mathbb{E}[r_{n+1}] =𝔼[rn]2𝔼[𝔼[Indyδ(λnδ),λn,InδλIn|n]]\displaystyle=\mathbb{E}[r_{n}]-2\mathbb{E}\left[\mathbb{E}\left[\langle\nabla_{I_{n}}d_{y^{\delta}}(\lambda_{n}^{\delta}),\lambda_{n,I_{n}}^{\delta}-\lambda_{I_{n}}^{\dagger}\rangle|{\mathcal{F}}_{n}\right]\right]
+t𝔼[𝔼[Indyδ(λnδ)2|n]]\displaystyle\quad\,+t\mathbb{E}\left[\mathbb{E}\left[\|\nabla_{I_{n}}d_{y^{\delta}}(\lambda_{n}^{\delta})\|^{2}|{\mathcal{F}}_{n}\right]\right]
=𝔼[rn]2𝔼[1(pb)I:|I|=bIdyδ(λnδ),λn,IδλI]\displaystyle=\mathbb{E}[r_{n}]-2\mathbb{E}\left[\frac{1}{{p\choose b}}\sum_{I:|I|=b}\langle\nabla_{I}d_{y^{\delta}}(\lambda_{n}^{\delta}),\lambda_{n,I}^{\delta}-\lambda_{I}^{\dagger}\rangle\right]
+t𝔼[1(pb)I:|I|=bIdyδ(λnδ)2].\displaystyle\quad\,+t\mathbb{E}\left[\frac{1}{{p\choose b}}\sum_{I:|I|=b}\|\nabla_{I}d_{y^{\delta}}(\lambda_{n}^{\delta})\|^{2}\right].

Noting that

I:|I|=bIdyδ(λnδ),λn,IδλI\displaystyle\sum_{I:|I|=b}\langle\nabla_{I}d_{y^{\delta}}(\lambda_{n}^{\delta}),\lambda_{n,I}^{\delta}-\lambda_{I}^{\dagger}\rangle =I:|I|=biIidyδ(λnδ),λn,iδλ,i\displaystyle=\sum_{I:|I|=b}\sum_{i\in I}\langle\nabla_{i}d_{y^{\delta}}(\lambda_{n}^{\delta}),\lambda_{n,i}^{\delta}-\lambda_{,i}^{\dagger}\rangle
=i=1pI:|I|=b and iIidyδ(λnδ),λn,iδλ,i\displaystyle=\sum_{i=1}^{p}\sum_{I:|I|=b\mbox{ and }i\in I}\langle\nabla_{i}d_{y^{\delta}}(\lambda_{n}^{\delta}),\lambda_{n,i}^{\delta}-\lambda_{,i}^{\dagger}\rangle
=(p1b1)i=1pidyδ(λnδ),λn,iδλ,i\displaystyle={p-1\choose b-1}\sum_{i=1}^{p}\langle\nabla_{i}d_{y^{\delta}}(\lambda_{n}^{\delta}),\lambda_{n,i}^{\delta}-\lambda_{,i}^{\dagger}\rangle
=(p1b1)dyδ(λnδ),λnδλ.\displaystyle={p-1\choose b-1}\langle\nabla d_{y^{\delta}}(\lambda_{n}^{\delta}),\lambda_{n}^{\delta}-\lambda^{\dagger}\rangle.

Therefore

𝔼[rn+1]\displaystyle\mathbb{E}[r_{n+1}] =𝔼[rn]+2bp𝔼[dyδ(λnδ),λλnδ]+t(pb)I:|I|=b𝔼[Idyδ(λnδ)2].\displaystyle=\mathbb{E}[r_{n}]+\frac{2b}{p}\mathbb{E}\left[\langle\nabla d_{y^{\delta}}(\lambda_{n}^{\delta}),\lambda^{\dagger}-\lambda_{n}^{\delta}\rangle\right]+\frac{t}{{p\choose b}}\sum_{I:|I|=b}\mathbb{E}\left[\|\nabla_{I}d_{y^{\delta}}(\lambda_{n}^{\delta})\|^{2}\right].

By the convexity of dyδ(λ)d_{y^{\delta}}(\lambda), the Cauchy-Schwarz inequality and the definition of rnr_{n}, we can obtain

dyδ(λnδ),λλnδ\displaystyle\langle\nabla d_{y^{\delta}}(\lambda_{n}^{\delta}),\lambda^{\dagger}-\lambda_{n}^{\delta}\rangle dyδ(λ)dyδ(λnδ)\displaystyle\leq d_{y^{\delta}}(\lambda^{\dagger})-d_{y^{\delta}}(\lambda_{n}^{\delta})
=dy(λ)dy(λnδ)+λnδλ,yδy\displaystyle=d_{y}(\lambda^{\dagger})-d_{y}(\lambda_{n}^{\delta})+\langle\lambda_{n}^{\delta}-\lambda^{\dagger},y^{\delta}-y\rangle
dy(λ)dy(λnδ)+δλnδλ\displaystyle\leq d_{y}(\lambda^{\dagger})-d_{y}(\lambda_{n}^{\delta})+\delta\|\lambda_{n}^{\delta}-\lambda^{\dagger}\|
=dy(λ)dy(λnδ)+tδrn1/2.\displaystyle=d_{y}(\lambda^{\dagger})-d_{y}(\lambda_{n}^{\delta})+\sqrt{t}\delta r_{n}^{1/2}.

Therefore

𝔼[rn+1]\displaystyle\mathbb{E}[r_{n+1}] 𝔼[rn]+2bp𝔼[dy(λ)dy(λnδ)]+2tbpδ𝔼[rn1/2]\displaystyle\leq\mathbb{E}[r_{n}]+\frac{2b}{p}\mathbb{E}\left[d_{y}(\lambda^{\dagger})-d_{y}(\lambda_{n}^{\delta})\right]+\frac{2\sqrt{t}b}{p}\delta\mathbb{E}[r_{n}^{1/2}]
+t(pb)I:|I|=b𝔼[Idyδ(λnδ)2].\displaystyle\quad\,+\frac{t}{{p\choose b}}\sum_{I:|I|=b}\mathbb{E}\left[\|\nabla_{I}d_{y^{\delta}}(\lambda_{n}^{\delta})\|^{2}\right].

Since 𝔼[rn1/2](𝔼[rn])1/2\mathbb{E}[r_{n}^{1/2}]\leq(\mathbb{E}[r_{n}])^{1/2}, we thus obtain the desired inequality. ∎

Lemma 4.5.

Consider Algorithm 3 with the step-sizes chosen as in Lemma 4.4, there holds

𝔼[dy(λn+1δ)dy(λ)+12c1rn+1]+bc1pk=0n𝔼[dy(λkδ)dy(λ)]\displaystyle\mathbb{E}\left[d_{y}(\lambda_{n+1}^{\delta})-d_{y}(\lambda^{\dagger})+\frac{1}{2}c_{1}r_{n+1}\right]+\frac{bc_{1}}{p}\sum_{k=0}^{n}\mathbb{E}\left[d_{y}(\lambda_{k}^{\delta})-d_{y}(\lambda^{\dagger})\right]
M0+tmaxb2c1p(n+1)δ2+tmax1/2bc1pδk=0n(𝔼[rk])1/2\displaystyle\leq M_{0}+\frac{t_{\max}b}{2c_{1}p}(n+1)\delta^{2}+\frac{t_{\max}^{1/2}bc_{1}}{p}\delta\sum_{k=0}^{n}\left(\mathbb{E}[r_{k}]\right)^{1/2}

for all n0n\geq 0, where M0:=12c1r0+(dy(λ0)dy(λ))M_{0}:=\frac{1}{2}c_{1}r_{0}+\left(d_{y}(\lambda_{0})-d_{y}(\lambda^{\dagger})\right).

Proof.

From Lemma 4.3 and Lemma 4.4 it follows that

𝔼[dy(λn+1δ)dy(λ)+12c1rn+1]\displaystyle\mathbb{E}\left[d_{y}(\lambda_{n+1}^{\delta})-d_{y}(\lambda^{\dagger})+\frac{1}{2}c_{1}r_{n+1}\right]
𝔼[dy(λnδ)dy(λ)+12c1rn]bc1p𝔼[dy(λnδ)dy(λ)]\displaystyle\leq\mathbb{E}\left[d_{y}(\lambda_{n}^{\delta})-d_{y}(\lambda^{\dagger})+\frac{1}{2}c_{1}r_{n}\right]-\frac{bc_{1}}{p}\mathbb{E}\left[d_{y}(\lambda_{n}^{\delta})-d_{y}(\lambda^{\dagger})\right]
+tmaxb2c1pδ2+tmax1/2bc1pδ(𝔼[rn])1/2.\displaystyle\quad\,+\frac{t_{\max}b}{2c_{1}p}\delta^{2}+\frac{t_{\max}^{1/2}bc_{1}}{p}\delta\left(\mathbb{E}[r_{n}]\right)^{1/2}.

Recursively using this inequality then completes the proof. ∎

In order to proceed further, we need an estimate on 𝔼[rn]{\mathbb{E}}[r_{n}]. We will use the following elementary result.

Lemma 4.6.

Let {an}\{a_{n}\} and {bn}\{b_{n}\} be two sequences of nonnegative numbers such that

an2bn2+cj=0n1aj,n=0,1,,a_{n}^{2}\leq b_{n}^{2}+c\sum_{j=0}^{n-1}a_{j},\quad n=0,1,\cdots,

where c0c\geq 0 is a constant. If {bn}\{b_{n}\} is non-decreasing, then

anbn+cn,n=0,1,.a_{n}\leq b_{n}+cn,\quad n=0,1,\cdots.
Proof.

We show the result by induction. The result is trivial for n=0n=0. Assume that the result is valid for all 0nk0\leq n\leq k for some k0k\geq 0. We show it is also true for n=k+1n=k+1. If ak+1max{a0,,ak}a_{k+1}\leq\max\{a_{0},\cdots,a_{k}\}, then ak+1ala_{k+1}\leq a_{l} for some 0lk0\leq l\leq k. Thus, by the induction hypothesis and the monotonicity of {bn}\{b_{n}\} we have

ak+1albl+clbk+1+c(k+1).a_{k+1}\leq a_{l}\leq b_{l}+cl\leq b_{k+1}+c(k+1).

If ak+1>max{a0,,ak}a_{k+1}>\max\{a_{0},\cdots,a_{k}\}, then

ak+12bk+12+cj=0kajbk+12+c(k+1)ak+1\displaystyle a_{k+1}^{2}\leq b_{k+1}^{2}+c\sum_{j=0}^{k}a_{j}\leq b_{k+1}^{2}+c(k+1)a_{k+1}

which implies that

(ak+112c(k+1))2\displaystyle\left(a_{k+1}-\frac{1}{2}c(k+1)\right)^{2} =ak+12c(k+1)ak+1+14c2(k+1)2\displaystyle=a_{k+1}^{2}-c(k+1)a_{k+1}+\frac{1}{4}c^{2}(k+1)^{2}
bk+12+14c2(k+1)2\displaystyle\leq b_{k+1}^{2}+\frac{1}{4}c^{2}(k+1)^{2}
(bk+1+12c(k+1))2.\displaystyle\leq\left(b_{k+1}+\frac{1}{2}c(k+1)\right)^{2}.

Taking square roots shows ak+1bk+1+c(k+1)a_{k+1}\leq b_{k+1}+c(k+1) again. ∎

Now we are ready to show the following result which together with Lemma 4.2 implies Theorem 4.1 immediately.

Theorem 4.7.

Let Assumption 1 hold and consider Algorithm 3 with the step-sizes chosen as in Theorem 4.1. If the sought solution xx^{\dagger} satisfies the source condition (4.1), then for all n0n\geq 0 there holds

𝔼[dy(λnδ)dy(λ)]2M01+bpc1n+5tmaxb2c1pnδ2\displaystyle\mathbb{E}\left[d_{y}(\lambda_{n}^{\delta})-d_{y}(\lambda^{\dagger})\right]\leq\frac{2M_{0}}{1+\frac{b}{p}c_{1}n}+\frac{5t_{\max}b}{2c_{1}p}n\delta^{2} (4.8)

and consequently

𝔼[xnδx2]1σ(2M01+bpc1n+5tmaxb2c1pnδ2),\displaystyle\mathbb{E}\left[\|x_{n}^{\delta}-x^{\dagger}\|^{2}\right]\leq\frac{1}{\sigma}\left(\frac{2M_{0}}{1+\frac{b}{p}c_{1}n}+\frac{5t_{\max}b}{2c_{1}p}n\delta^{2}\right),

where tmaxt_{\max} and c1c_{1} are defined in Lemma 4.3.

Proof.

According to (4.6), it suffices to show (4.8). Since λ\lambda^{\dagger} is a minimizer of dy(λ)d_{y}(\lambda) over YY, we have dy(λnδ)dy(λ)0d_{y}(\lambda_{n}^{\delta})-d_{y}(\lambda^{\dagger})\geq 0 for all n0n\geq 0. Thus, by noting that 𝔼[r0]=r02M0/c1\mathbb{E}[r_{0}]=r_{0}\leq 2M_{0}/c_{1}, it follows from Lemma 4.5 that

𝔼[rn]\displaystyle\mathbb{E}\left[r_{n}\right] 2M0c1+tmaxbc12pnδ2+2tmax1/2bpδk=0n1(𝔼[rk])1/2\displaystyle\leq\frac{2M_{0}}{c_{1}}+\frac{t_{\max}b}{c_{1}^{2}p}n\delta^{2}+\frac{2t_{\max}^{1/2}b}{p}\delta\sum_{k=0}^{n-1}\left(\mathbb{E}[r_{k}]\right)^{1/2}

for all n0n\geq 0. Applying Lemma 4.6 and using the inequality a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b} for any a,b0a,b\geq 0, we can obtain

(𝔼[rn])1/22M0c1+tmaxbc12pnδ2+2tmax1/2bpnδ2M0c1+(bnc12p+2bnp)tmax1/2δ.\displaystyle\left(\mathbb{E}[r_{n}]\right)^{1/2}\leq\sqrt{\frac{2M_{0}}{c_{1}}+\frac{t_{\max}b}{c_{1}^{2}p}n\delta^{2}}+\frac{2t_{\max}^{1/2}b}{p}n\delta\leq\sqrt{\frac{2M_{0}}{c_{1}}}+\left(\sqrt{\frac{bn}{c_{1}^{2}p}}+\frac{2bn}{p}\right)t_{\max}^{1/2}\delta.

Consequently

k=0n(𝔼[rk])1/2\displaystyle\sum_{k=0}^{n}\left(\mathbb{E}[r_{k}]\right)^{1/2} 2M0c1(n+1)+k=0n(bkc12p+2bkp)tmax1/2δ\displaystyle\leq\sqrt{\frac{2M_{0}}{c_{1}}}(n+1)+\sum_{k=0}^{n}\left(\sqrt{\frac{bk}{c_{1}^{2}p}}+\frac{2bk}{p}\right)t_{\max}^{1/2}\delta
2M0c1(n+1)+(2b1/2(n+1)3/23c1p1/2+bn(n+1)p)tmax1/2δ.\displaystyle\leq\sqrt{\frac{2M_{0}}{c_{1}}}(n+1)+\left(\frac{2b^{1/2}(n+1)^{3/2}}{3c_{1}p^{1/2}}+\frac{bn(n+1)}{p}\right)t_{\max}^{1/2}\delta.

Combining this with the estimate in Lemma 4.5 we obtain

𝔼[dy(λn+1δ)dy(λ)]+bc1pk=0n𝔼[dy(λkδ)dy(λ)]\displaystyle\mathbb{E}\left[d_{y}(\lambda_{n+1}^{\delta})-d_{y}(\lambda^{\dagger})\right]+\frac{bc_{1}}{p}\sum_{k=0}^{n}\mathbb{E}\left[d_{y}(\lambda_{k}^{\delta})-d_{y}(\lambda^{\dagger})\right]
M0+tmaxb2c1p(n+1)δ2+b2tmaxc1M0p(n+1)δ\displaystyle\leq M_{0}+\frac{t_{\max}b}{2c_{1}p}(n+1)\delta^{2}+\frac{b\sqrt{2t_{\max}c_{1}M_{0}}}{p}(n+1)\delta
+(2b3/2(n+1)3/23p3/2+b2c1(n+1)2p2)tmaxδ2.\displaystyle\quad\,+\left(\frac{2b^{3/2}(n+1)^{3/2}}{3p^{3/2}}+\frac{b^{2}c_{1}(n+1)^{2}}{p^{2}}\right)t_{\max}\delta^{2}.

By using the inequality

b2tmaxc1M0p(n+1)δM0+b2c1(n+1)22p2tmaxδ2\frac{b\sqrt{2t_{\max}c_{1}M_{0}}}{p}(n+1)\delta\leq M_{0}+\frac{b^{2}c_{1}(n+1)^{2}}{2p^{2}}t_{\max}\delta^{2}

we can further obtain

𝔼[dy(λn+1δ)dy(λ)]+bc1pk=0n𝔼[dy(λkδ)dy(λ)]\displaystyle\mathbb{E}\left[d_{y}(\lambda_{n+1}^{\delta})-d_{y}(\lambda^{\dagger})\right]+\frac{bc_{1}}{p}\sum_{k=0}^{n}\mathbb{E}\left[d_{y}(\lambda_{k}^{\delta})-d_{y}(\lambda^{\dagger})\right]
2M0+tmaxbp(12c1+2b1/2(n+1)1/23p1/2+3bc1(n+1)2p)(n+1)δ2.\displaystyle\leq 2M_{0}+\frac{t_{\max}b}{p}\left(\frac{1}{2c_{1}}+\frac{2b^{1/2}(n+1)^{1/2}}{3p^{1/2}}+\frac{3bc_{1}(n+1)}{2p}\right)(n+1)\delta^{2}. (4.9)

According to Lemma 4.3 we have for any 0kn0\leq k\leq n that

𝔼[dy(λkδ)]𝔼[dy(λn+1δ)]tmaxb2c1p(n+1k)δ2.\mathbb{E}\left[d_{y}(\lambda_{k}^{\delta})\right]\geq\mathbb{E}\left[d_{y}(\lambda_{n+1}^{\delta})\right]-\frac{t_{\max}b}{2c_{1}p}(n+1-k)\delta^{2}.

Therefore

k=0n𝔼[dy(λkδ)dy(λ)]\displaystyle\sum_{k=0}^{n}\mathbb{E}\left[d_{y}(\lambda_{k}^{\delta})-d_{y}(\lambda^{\dagger})\right]
(n+1)𝔼[dy(λn+1δ)dy(λ)]tmaxb4c1p(n+1)(n+2)δ2.\displaystyle\geq(n+1)\mathbb{E}\left[d_{y}(\lambda_{n+1}^{\delta})-d_{y}(\lambda^{\dagger})\right]-\frac{t_{\max}b}{4c_{1}p}(n+1)(n+2)\delta^{2}.

Combining this with (4) and noting that n+22(n+1)n+2\leq 2(n+1) and

2b1/2(n+1)1/23p1/22c1+bc1(n+1)18p,\frac{2b^{1/2}(n+1)^{1/2}}{3p^{1/2}}\leq\frac{2}{c_{1}}+\frac{bc_{1}(n+1)}{18p},

we obtain

(1+bc1(n+1)p)𝔼[dy(λn+1δ)dy(λ)]\displaystyle\left(1+\frac{bc_{1}(n+1)}{p}\right)\mathbb{E}\left[d_{y}(\lambda_{n+1}^{\delta})-d_{y}(\lambda^{\dagger})\right]
2M0+tmaxbp(12c1+2b1/2(n+1)1/23p1/2+(12+3c12)b(n+1)p)(n+1)δ2\displaystyle\leq 2M_{0}+\frac{t_{\max}b}{p}\left(\frac{1}{2c_{1}}+\frac{2b^{1/2}(n+1)^{1/2}}{3p^{1/2}}+\left(\frac{1}{2}+\frac{3c_{1}}{2}\right)\frac{b(n+1)}{p}\right)(n+1)\delta^{2}
2M0+tmaxbp(α+βb(n+1)p)(n+1)δ2,\displaystyle\leq 2M_{0}+\frac{t_{\max}b}{p}\left(\alpha+\frac{\beta b(n+1)}{p}\right)(n+1)\delta^{2},

where

α:=52c1andβ:=12+149c1.\alpha:=\frac{5}{2c_{1}}\quad\mbox{and}\quad\beta:=\frac{1}{2}+\frac{14}{9}c_{1}.

Note that β/α<c1\beta/\alpha<c_{1} as c1<1c_{1}<1. Dividing the both sides of the above equation by 1+bc1(n+1)/p1+bc_{1}(n+1)/p shows

𝔼[dy(λn+1δ)dy(λ)]\displaystyle\mathbb{E}\left[d_{y}(\lambda_{n+1}^{\delta})-d_{y}(\lambda^{\dagger})\right] 2M01+bpc1(n+1)+5tmaxb2c1p(n+1)δ2\displaystyle\leq\frac{2M_{0}}{1+\frac{b}{p}c_{1}(n+1)}+\frac{5t_{\max}b}{2c_{1}p}(n+1)\delta^{2}

which completes the proof. ∎

5. Numerical results

In this section we will present numerical simulations to test the performance of the stochastic mirror descent method for solving linear ill-posed problems in which the sought solutions have various special features that require to make particular choices of the regularization functional {\mathcal{R}} and the solution space XX. Except Example 5.2, all the numerical simulations are performed on the linear ill-posed system

Aix:=abk(si,t)x(t)𝑑t=yi,i=1,,p\displaystyle A_{i}x:=\int_{a}^{b}k(s_{i},t)x(t)dt=y_{i},\quad i=1,\cdots,p (5.1)

obtained from linear integral equations of first kind on [a,b][a,b] by sampling at si[a,b]s_{i}\in[a,b] with i=1,,pi=1,\cdots,p, where the kernel k(,)k(\cdot,\cdot) is continuous on [a,b]×[a,b][a,b]\times[a,b] and si=a+(i1)(ba)/(p1)s_{i}=a+(i-1)(b-a)/(p-1) for i=1,,pi=1,\cdots,p.

Example 5.1.

Consider the linear system (1.1) where XX and YiY_{i} are all Hilbert spaces. In case the minimal norm solution is of interest, we may take (x)=x2/2{\mathcal{R}}(x)=\|x\|^{2}/2 in Algorithm 1, then the definition of xnδx_{n}^{\delta} shows xnδ=ξnδx_{n}^{\delta}=\xi_{n}^{\delta}. Consequently, Algorithm 1 becomes the form

xn+1δ=xnδtnδAIn(AInxnδyInδ),\displaystyle x_{n+1}^{\delta}=x_{n}^{\delta}-t_{n}^{\delta}A_{I_{n}}^{*}(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}), (5.2)

where In{1,,p}I_{n}\subset\{1,\cdots,p\} is a randomly selected subset with |In|=b|I_{n}|=b via the uniform distribution for a preassigned batch size bb. This is exactly the mini-batch stochastic gradient descent method for solving linear ill-posed problems in Hilbert spaces which has been studied recently in [17, 18, 25]. In these works the convergence analysis has been performed under diminishing step-sizes. Our work supplements the existing results by providing convergence results under new choices of step-sizes, in particular, for the method (5.2) with batch size b=1b=1, we obtain convergence and convergence rate results for the step-size

tnδ=μ0Ain2 with 0<μ0<2\displaystyle t_{n}^{\delta}=\frac{\mu_{0}}{\|A_{i_{n}}\|^{2}}\quad\mbox{ with }0<\mu_{0}<2 (5.3)

which obeys (s1) and (s2), where ini_{n} is the index chosen at the nn-th iteration step.

We now test the performance of the method (5.2) by considering the system (5.1) with [a,b]=[6,6][a,b]=[-6,6], p=1000p=1000 and k(s,t):=φ(st)k(s,t):=\varphi(s-t), where φ(s)=(1+cos(πs/3))χ{|s|<3}\varphi(s)=(1+\cos(\pi s/3))\chi_{\{|s|<3\}}. We assume the sought solution is

x(t)=sin(πt/12)+sin(πt/3)+1200t2(1t).x^{\dagger}(t)=\sin(\pi t/12)+\sin(\pi t/3)+\frac{1}{200}t^{2}(1-t).

Instead of the exact data yi:=Aixy_{i}:=A_{i}x^{\dagger}, we use the noisy data

yiδ=yi+δrel|yi|εi,i=1,,p,\displaystyle y_{i}^{\delta}=y_{i}+\delta_{rel}|y_{i}|\varepsilon_{i},\quad i=1,\cdots,p, (5.4)

where δrel\delta_{rel} is the relative noise level and εi\varepsilon_{i} are random noises obeying the standard Gaussian distribution, i.e. εiN(0,1)\varepsilon_{i}\sim N(0,1). We execute the method (5.2) with the batch size b=1b=1 and the initial guess x0δ=0x_{0}^{\delta}=0 together with the step-sizes given by (5.3); the integrals involved in the method are approximated by the trapezoidal rule based on the partition of [6,6][-6,6] into p1p-1 subintervals of equal length. To illustrate the dependence of convergence on the magnitude of step-size, we consider the three distinct values μ0=1.0,0.5\mu_{0}=1.0,0.5 and 0.10.1. We also use noisy data with three distinct relative noise levels δrel=101,102\delta_{rel}=10^{-1},10^{-2} and 10310^{-3}. In Figure 1 we plot the corresponding reconstruction errors; the first row plots the relative mean square errors 𝔼[xnδxL22/xL22]\mathbb{E}[\|x_{n}^{\delta}-x^{\dagger}\|_{L^{2}}^{2}/\|x^{\dagger}\|_{L^{2}}^{2}] which are calculated approximately by the average of 100100 independent runs and the second row plots xnδxL22/xL22\|x_{n}^{\delta}-x^{\dagger}\|_{L^{2}}^{2}/\|x^{\dagger}\|_{L^{2}}^{2} for a particular individual run. From these plots we can observe that, for each individual run, the iteration errors exhibit oscillations which are in particular dramatic when the noisy data have relative large noise levels. We can also observe that the method (5.2) demonstrates the semi-convergence property, i.e. the iterates converges to the sought solution at the beginning and, after a critical number of iterations, the iterates begin to diverge. This is typical for any iterative methods for solving ill-posed problems. Furthermore, for a fixed noise level, the semi-convergence occurs earlier if a larger step-size is used. We also note that, for noisy data with relatively large noise level, if a large step-size is used, the iterates can quickly produce a reconstruction result with minimal error and then diverge quickly from the sought solution; this makes it difficult to decide how to stop the iteration to produce a good reconstruction result. The use of small step-sizes has the advantage of suppressing oscillations and deferring semi-convergence. However, it slows down the convergence and hence makes the method time-consuming.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1. Reconstruction errors of the stochastic gradient descent (5.2) using step-size (5.3) with various values of μ0\mu_{0}.

In order to efficiently suppress the oscillations and reduce the effect of semi-convergence, we next consider the method (5.2) using the step-size chosen by (s3) whose realization relies on the information of noise level. We now assume that the noisy data have the form (5.4), where εi\varepsilon_{i} are noise uniformly distributed on [1,1][-1,1]. Note that δi:=δrel|yi|\delta_{i}:=\delta_{rel}|y_{i}| are the noise levels with |yiδyi|δi|y_{i}^{\delta}-y_{i}|\leq\delta_{i}. Assuming the information on δi\delta_{i} is known, the step-size chosen by (s3) takes the form

tnδ={μ0/Ain2 if |Ainxnδyinδ|>τδin,0 otherwise\displaystyle t_{n}^{\delta}=\left\{\begin{array}[]{lll}\mu_{0}/\|A_{i_{n}}\|^{2}&\mbox{ if }|A_{i_{n}}x_{n}^{\delta}-y_{i_{n}}^{\delta}|>\tau\delta_{i_{n}},\\ 0&\mbox{ otherwise}\end{array}\right. (5.7)

which incorporates the spirit of the discrepancy principle. To illustrate the advantage of this choice of step-size, we compare the computed results with the ones obtained by the step-size chosen by (5.3). In Figure 2 we present the numerical results of reconstruction errors by the method (5.2) with batch size b=1b=1 for various noise levels using the step-sizes chosen by (5.3) and (5.7) with μ0=1\mu_{0}=1 and τ=1\tau=1. where “DP” and “No DP” represent the results corresponding to the step-sizes chosen by (5.7) and (5.3) respectively. The first row in Figure 2 plots the mean square errors which are calculate approximately by the average of 100 independent runs and the second row plots the reconstruction errors for a typical individual run. The results show clearly that using the step-size (5.7) can significantly suppress the oscillations in reconstruction errors in particular when the data are corrupted by noise with relatively large noise level. We have performed extensive simulations which indicate that, due to the regularization effect of the discrepancy principle, using the step-size (5.7) has the tendency to decrease the iteration error as the number of iterations increases and has the ability to reduce the effect of semi-convergence.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2. Comparison of reconstruction results by the stochastic gradient descent (5.2) using step-sizes (5.3) and (5.7).
Example 5.2.

We consider the linear system (1.1) in which XX and YiY_{i} are all Hilbert spaces and the sought solution satisfies the constraint x𝒞x\in\mathcal{C}, where 𝒞X\mathcal{C}\subset X is a closed convex set. Finding the unique solution xx^{\dagger} of (1.1) in 𝒞\mathcal{C} with minimal norm can be stated as (1.2) with (x):=12x2+ι𝒞(x){\mathcal{R}}(x):=\frac{1}{2}\|x\|^{2}+\iota_{\mathcal{C}}(x), where ι𝒞\iota_{\mathcal{C}} denotes the indicator function of 𝒞\mathcal{C}. Let P𝒞P_{\mathcal{C}} denote the metric projection of XX onto 𝒞\mathcal{C}. Then the mini-batch stochastic mirror descent method for determining xx^{\dagger} takes the form

xnδ=P𝒞(ξnδ),ξn+1δ=ξnδtnδAIn(AInxnδyInδ).\displaystyle x_{n}^{\delta}=P_{\mathcal{C}}(\xi_{n}^{\delta}),\qquad\xi_{n+1}^{\delta}=\xi_{n}^{\delta}-t_{n}^{\delta}A_{I_{n}}^{*}(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}). (5.8)

In case X=L2(𝒟)X=L^{2}(\mathscr{D}) for some domain 𝒟d\mathscr{D}\subset{\mathbb{R}}^{d} and 𝒞={xX:x0 a.e. on 𝒟}\mathcal{C}=\{x\in X:x\geq 0\mbox{ a.e. on }\mathscr{D}\}, the iteration scheme (5.8) becomes

xnδ=max{ξnδ,0},ξn+1δ=ξnδtnδAIn(AInxnδyInδ)\displaystyle x_{n}^{\delta}=\max\{\xi_{n}^{\delta},0\},\qquad\xi_{n+1}^{\delta}=\xi_{n}^{\delta}-t_{n}^{\delta}A_{I_{n}}^{*}(A_{I_{n}}x_{n}^{\delta}-y_{In}^{\delta}) (5.9)

with an initial guess ξ0δ=0\xi_{0}^{\delta}=0, where InI_{n} is a randomly selected subset of {1,,p}\{1,\cdots,p\} via the uniform distribution with |In|=b|I_{n}|=b for a preassigned batch size bb.

As a testing example, we consider the computed tomography which consists in determining the density of cross sections of a human body by measuring the attenuation of X-rays as they propagate through the biological tissues [27]. Mathematically, it requires to determine a function supported on a bounded domain from its line integrals. We consider here only the standard 2D parallel-beam tomography; tomography with other scan geometries can be considered similarly. We consider a full angle problem using 9090 projection angles evenly distributed between 11 and 180180 degrees, with 367367 lines per projection. Assuming the sought image is discretized on a 256×256256\times 256 pixel grid, we may use the function paralleltomo in the MATLAB package AIR TOOLS [16] to discretize the problem. After deleting those rows with zero entries, it leads to a linear system Ax=yAx=y, where AA is a matrix with size 29658×6553629658\times 65536. This gives a problem of the form (1.1) with p=29658p=29658, where AiA_{i} corresponds to the ii-th row of AA and yiy_{i} is the ii-th component of yy.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3. Reconstruction of the Shepp-Logan phantom by the methods (5.9) and (5.2) with batch size b=400b=400 and step-size chosen by (s2) with μ0=1\mu_{0}=1.

We assume that the true image is the Shepp-Logan phantom shown in figure 3 (a) discretized on a 256×256256\times 256 pixel grid with nonnegative pixel values. This phantom is widely used in evaluating tomographic reconstruction algorithms. Let xx^{\dagger} denote the vector formed by stacking all the columns of the true image and let y=Axy=Ax^{\dagger} be the true data. By adding Gaussian noise to yiy_{i} we get the noisy data yiδy_{i}^{\delta} of the form (5.4), where εiN(0,1)\varepsilon_{i}\sim N(0,1) and δrel=0.01\delta_{rel}=0.01. Since the sought solution is nonnegative, we may reconstruct it by using the method (5.9). As a comparison, we also consider reconstructing the sought phantom by the method (5.2) for which the nonnegative constraint is not incorporated. For the both methods, we use the batch size b=400b=400 and the step-size chosen by (s2) with μ0=1\mu_{0}=1. We execute the methods for 600 iterations and plot the reconstruction results in Figure 3 (b) and (c). In Figure 3 (d) we also plot the squares of the relative errors xnδx2/x2\|x_{n}^{\delta}-x^{\dagger}\|^{2}/\|x^{\dagger}\|^{2} which indicate that incorporating the nonnegative constraint can produce more accurate reconstruction results. This example demonstrates that available a priori information on sought solutions should be incorporated into algorithm design to assist with better reconstruction results.

Example 5.3.

Let 𝒟d\mathscr{D}\subset{\mathbb{R}}^{d} be a bounded domain. Consider the linear system (1.1), where Ai:L1(𝒟)YiA_{i}:L^{1}(\mathscr{D})\to Y_{i} is a bounded linear operator and YiY_{i} is a Hilbert space for each ii. We assume the sought solution xx^{\dagger} is a probability density function, i.e. x0x^{\dagger}\geq 0 a.e. on 𝒟\mathscr{D} and 𝒟x=1\int_{\mathscr{D}}x^{\dagger}=1. To determine such a solution, we take (x):=f(x)+ιΔ(x),{\mathcal{R}}(x):=f(x)+\iota_{\Delta}(x), where ιΔ\iota_{\Delta} denotes the indicator function of

Δ:={xL+1(𝒟):𝒟x=1}\Delta:=\left\{x\in L_{+}^{1}(\mathscr{D}):\int_{\mathscr{D}}x^{\dagger}=1\right\}

and

f(x)={𝒟xlogx, if xL+1(𝒟) and xlogxL1(𝒟),+, otherwisef(x)=\left\{\begin{array}[]{lll}\int_{\mathscr{D}}x\log x,&\mbox{ if }x\in L_{+}^{1}(\mathscr{D})\mbox{ and }x\log x\in L^{1}(\mathscr{D}),\\ +\infty,&\mbox{ otherwise}\end{array}\right.

is the negative of the Boltzmann-Shannon entropy. Here L+1(𝒟):={xL1(𝒟):x0 a.e. on 𝒟}L_{+}^{1}(\mathscr{D}):=\{x\in L^{1}(\mathscr{D}):x\geq 0\mbox{ a.e. on }\mathscr{D}\}. The Boltzmann-Shannon entropy has been used in Tikhonov regularization as a regularization functional to determine nonnegative solutions, see [9, 11] for instance. According to [3, 9], {\mathcal{R}} is strongly convex on L1(𝒟)L^{1}(\mathscr{D}) with modulus of convexity σ=1/2\sigma=1/2. By the Karush-Kuhn-Tucker theory, for any ξL(𝒟)\xi\in L^{\infty}(\mathscr{D}) the unique minimizer of

minxL1(𝒟){(x)𝒟ξx}\min_{x\in L^{1}(\mathscr{D})}\left\{{\mathcal{R}}(x)-\int_{\mathscr{D}}\xi x\right\}

is given by x^=eξ/𝒟eξ\hat{x}=e^{\xi}/\int_{\mathscr{D}}e^{\xi}. Therefore the corresponding stochastic mirror descent method takes the form

xnδ=1𝒟eξnδeξnδ,ξn+1δ=ξnδtnδAIn(AInxnδyInδ),\displaystyle x_{n}^{\delta}=\frac{1}{\int_{\mathscr{D}}e^{\xi_{n}^{\delta}}}e^{\xi_{n}^{\delta}},\qquad\xi_{n+1}^{\delta}=\xi_{n}^{\delta}-t_{n}^{\delta}A_{I_{n}}^{*}(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}), (5.10)

where In{1,,p}I_{n}\subset\{1,\cdots,p\} is a randomly selected subset via the uniform distribution with a preassigned batch size bb and tnδ0t_{n}^{\delta}\geq 0 is the step-size.

For numerical simulations we consider the linear system (5.1) with [a,b]=[0,1][a,b]=[0,1], p=1000p=1000 and k(s,t):=4e(st)2/0.0064k(s,t):=4e^{-(s-t)^{2}/0.0064}. We assume the sought solution is

x(t):=c(e60(t0.3)2+0.3e40(t0.8)2),x^{\dagger}(t):=c\left(e^{-60(t-0.3)^{2}}+0.3e^{-40(t-0.8)^{2}}\right),

where c>0c>0 is a constant to ensure that 01x(t)𝑑t=1\int_{0}^{1}x^{\dagger}(t)dt=1 so that xx^{\dagger} is a probability density function. By adding random noise to the exact data yi:=Aixy_{i}:=A_{i}x^{\dagger} we get the noisy data yiδy_{i}^{\delta}; we then use these noisy data and the method (5.10) with batch size b=1b=1 to reconstruct the sought solution xx^{\dagger}; the integrals involved in the method are approximated by the trapezoidal rule based on the partition of [0,1][0,1] into p1p-1 subintervals of equal length.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4. Relative errors for reconstructing a solution, that is a probability density function, by the method (5.10) using step-size (s2) with various values of μ0\mu_{0}.

We first test the performance of the method (5.10) using the noisy data given by (5.4) corrupted by Gaussian noise, where εiN(0,1)\varepsilon_{i}\sim N(0,1) and δrel>0\delta_{rel}>0 is the relative noise level. We execute the method (5.10) using the batch size b=1b=1, the initial guess ξ0δ=0\xi_{0}^{\delta}=0 and the step-size tnδt_{n}^{\delta} chosen by (s2) with three distinct values μ0=2.0,1.0\mu_{0}=2.0,1.0 and 0.40.4. In Figure 4 we plot the reconstruction errors for three distinct relative noise levels δrel=0.5,0.1\delta_{rel}=0.5,0.1 and 0.010.01. The first row plots the mean square errors 𝔼[xnδxL12]\mathbb{E}[\|x_{n}^{\delta}-x^{\dagger}\|_{L^{1}}^{2}] which are calculated approximately by the average of 100 independent runs. The second row plots xnδxL12\|x_{n}^{\delta}-x^{\dagger}\|_{L^{1}}^{2} for a particular individual run. These plots demonstrate that the method (5.10) admits the semi-convergence property and the iterates exhibit dramatic oscillations. Furthermore, using large step-sizes can make convergence fast at the beginning but the iterates can diverge quickly; while using small step-sizes can suppress the oscillations and delay the occurrence of semi-convergence, it however makes the method converge slowly.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5. Comparison of reconstruction results for a solution that is a probability density function by the method (5.10) using step-sizes chosen by (s2) and (s3).

In order to efficiently remove oscillations and semi-convergence, we next consider the method (5.10) using the step-size chosen by (s3) whose realization relies on the information of noise level. We consider noisy data of the form (5.4), where εi\varepsilon_{i} are uniform noise distributed on [1,1][-1,1]. Note that δi:=δrel|yi|\delta_{i}:=\delta_{rel}|y_{i}| are the noise levels such that |yiδyi|δi|y_{i}^{\delta}-y_{i}|\leq\delta_{i} for all ii; we assume the information on δi\delta_{i} is known. To illustrate the advantage of the step-size chosen by (s3), we compare the computed results with the ones obtained by the step-size chosen by (s2). In Figure 5 we plot the reconstruction errors by the method (5.10) for various relative noise levels using the step-sizes chosen by (s2) and (s3) with μ0=2\mu_{0}=2 and τ=1\tau=1, where “DP” and “No DP” represent the results corresponding to the step-size chosen by (s3) and (s2) respectively. The first row in Figure 5 plots the mean square errors calculated approximately by the average of 100 independent runs and the second row plots the reconstruction errors for a particular individual run. The results show clearly that using step-size by (s3) can significantly suppress the oscillations in iterates and relieve the method from semi-convergence.

Example 5.4.

Let 𝒟d\mathscr{D}\subset{\mathbb{R}}^{d} be a bounded domain. We consider reconstructing a sparse solution in the linear system (1.1), where Ai:L2(𝒟)YiA_{i}:L^{2}(\mathscr{D})\to Y_{i} is a bounded linear operator and YiY_{i} is a Hilbert space for each ii. To determine such a solution, we take (x):=βxL1(𝒟)+12xL2(𝒟)2{\mathcal{R}}(x):=\beta\|x\|_{L^{1}(\mathscr{D})}+\frac{1}{2}\|x\|_{L^{2}(\mathscr{D})}^{2}, where β>0\beta>0 is a sufficiently large number whose choice reflects the role of the term xL1(𝒟)\|x\|_{L^{1}(\mathscr{D})} in the reconstruction of sparse solutions. Note that for any ξL2(𝒟)\xi\in L^{2}(\mathscr{D}) we have

argminxL2(𝒟){(x)ξ,x}=sign(ξ)max{|ξ|β,0}.\arg\min_{x\in L^{2}(\mathscr{D})}\left\{{\mathcal{R}}(x)-\langle\xi,x\rangle\right\}=\mbox{sign}(\xi)\max\{|\xi|-\beta,0\}.

Therefore the corresponding stochastic mirror descent method takes the form

xnδ=sign(ξnδ)max{|ξnδ|β,0},ξn+1δ=ξnδtnδAIn(AInxnδyInδ),\displaystyle x_{n}^{\delta}=\mbox{sign}(\xi_{n}^{\delta})\max\{|\xi_{n}^{\delta}|-\beta,0\},\quad\xi_{n+1}^{\delta}=\xi_{n}^{\delta}-t_{n}^{\delta}A_{I_{n}}^{*}(A_{I_{n}}x_{n}^{\delta}-y_{I_{n}}^{\delta}), (5.11)

where In{1,,p}I_{n}\subset\{1,\cdots,p\} is a randomly selected subset satisfying |In|=b|I_{n}|=b via the uniform distribution with the preassigned batch size bb and tnδ0t_{n}^{\delta}\geq 0 is the step-size. We remark that, when |In|=1|I_{n}|=1 and In=n(mod p)+1I_{n}=n\ (\mbox{mod }p)+1, the corresponding method is the sparse-Kaczmarz method which was first proposed in [23] to reconstruct sparse solutions of ill-posed inverse problems. For finite-dimensional linear systems with well-conditioned matrices, a randomized sparse-Kaczmarz method of the form (5.11) was introduced in [37] in which InI_{n} is chosen with a probability proportional to AIn2\|A_{I_{n}}\|^{2} and a linear convergence rate is derived. The theoretical result in [37] however is not applicable to ill-posed problems because the analysis depends heavily on the finite-dimensionality of the underlying spaces and the well-conditioning property of the matrices.

For numerical simulations we consider the linear system (5.1) with [a,b]=[0,1][a,b]=[0,1] and k(s,t):=(0.12+(st)2)3/2k(s,t):=(0.1^{2}+(s-t)^{2})^{-3/2}. We assume the sought solution is

x(t):=χ{0.19t0.22}χ{0.50t0.52}+0.5χ{0.78t0.80}x^{\dagger}(t):=\chi_{\{0.19\leq t\leq 0.22\}}-\chi_{\{0.50\leq t\leq 0.52\}}+0.5\chi_{\{0.78\leq t\leq 0.80\}}

which is sparse on [0,1][0,1]. By adding random noise to the exact data yi:=Aixy_{i}:=A_{i}x^{\dagger} we get the noisy data yiδy_{i}^{\delta}; we then use these noisy data and the method (5.11) with batch size b=1b=1 to reconstruct the sought solution xx^{\dagger}. In our numerical computation we take p=1000p=1000 and β=80\beta=80 and the integrals involved in the method are approximated by the trapezoidal rule based on the partition of [0,1][0,1] into p1p-1 subintervals of equal length.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6. Relative errors for reconstructing a sparse solution by the method (5.11) using step-size (s2) with various values of μ0\mu_{0}.

We first test the performance of the method (5.11) using the noisy data given by (5.4) corrupted by Gaussian noise, where εiN(0,1)\varepsilon_{i}\sim N(0,1). We execute the method (5.11) using the batch size b=1b=1, the initial guess ξ0δ=0\xi_{0}^{\delta}=0 and the step-size tnδt_{n}^{\delta} chosen by (s2) with three distinct values μ0=2.0,1.0\mu_{0}=2.0,1.0 and 0.40.4. In Figure 6 we plot the reconstruction errors for three distinct relative noise levels δrel=0.5,0.1\delta_{rel}=0.5,0.1 and 0.010.01. The first row plots the relative mean square errors 𝔼[xnδxL22/xL22]\mathbb{E}[\|x_{n}^{\delta}-x^{\dagger}\|_{L^{2}}^{2}/\|x^{\dagger}\|_{L^{2}}^{2}] calculated approximately by the average of 100 independent runs, and the second row plots the relative errors xnδxL22/xL22\|x_{n}^{\delta}-x^{\dagger}\|_{L^{2}}^{2}/\|x^{\dagger}\|_{L^{2}}^{2} for a particular individual run. From these plots we can observe the oscillations in iterates, the semi-convergence of the method, and the influence of the magnitude of step-sizes.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7. Reconstruction of sparse solutions by the method (5.11) using step-sizes chosen by (s2) and (s3). The first row plots the relative mean square errors calculated approximately by the average of 100 independent runs and the second row plots the relative errors for a particular individual run.

We next consider the method (5.11) using the step-size chosen by (s3). We use noisy data of the form (5.4) with εi\varepsilon_{i} being uniformly distributed over [1,1][-1,1] and assume that the noise levels δi=δrel|yi|\delta_{i}=\delta_{rel}|y_{i}| are known. In Figure 7 we plot the reconstruction errors of the method (5.11) using the step-sizes chosen by (s3) with μ0=2\mu_{0}=2 and τ=1.01\tau=1.01 which are labelled as “DP”; as comparisons we also plot the corresponding results using step-sizes chosen by (s2) with μ0=2\mu_{0}=2 which are labelled as “No DP”. Sharply contrast to (s2), these results illustrate that using step-size by (s3) can significantly suppress the oscillations in iterates and reduce the effect of semi-convergence.

Example 5.5.

In this final example we consider using the stochastic mirror descent method to reconstruct piecewise constant solutions. We consider again the linear system (5.1) with [a,b]=[0,1][a,b]=[0,1] and k(s,t)=4e(st)2/0.0064k(s,t)=4e^{-(s-t)^{2}/0.0064} and assume that the sought solution xx^{\dagger} is piecewise constant. By dividing [0,1][0,1] into p1p-1 subintervals of equal length and approximating integrals by the trapezoidal rule, we have a discrete ill-posed system 𝐀i𝐱=yi{\bf A}_{i}{\bf x}=y_{i}, where 𝐀i{\bf A}_{i} is a row vector for each ii. To reconstruct a piecewise constant solution, we use the model

min{(𝐱):=β𝐃𝐱1+12𝐃𝐱22+12𝐱22:𝐀i𝐱=yi,i=1,,p},\displaystyle\min\left\{{\mathcal{R}}({\bf x}):=\beta\|{\bf D}{\bf x}\|_{1}+\frac{1}{2}\|{\bf D}{\bf x}\|_{2}^{2}+\frac{1}{2}\|{\bf x}\|_{2}^{2}:{\bf A}_{i}{\bf x}=y_{i},i=1,\cdots,p\right\}, (5.12)

where 𝐃{\bf D} denotes the discrete gradient operator and β\beta is a large positive number. If we apply the stochastic mirror descent method to solve (5.12) directly, we need to solve a minimization problem related to {\mathcal{R}} to obtain 𝐱nδ{\bf x}_{n}^{\delta} at each iteration step. This can make the algorithm time-consuming since those minimization problems can not be solved explicitly. To circumvent this difficulty, we introduce 𝐳:=𝐃𝐱{\bf z}:={\bf D}{\bf x} and rephrase (5.12) as

min{β𝐳1+12𝐳22+12𝐱22:𝐁i(𝐱𝐳)=(yi𝟎),i=1,,p},\displaystyle\min\left\{\beta\|{\bf z}\|_{1}+\frac{1}{2}\|{\bf z}\|_{2}^{2}+\frac{1}{2}\|{\bf x}\|_{2}^{2}:{\bf B}_{i}\begin{pmatrix}{\bf x}\\ {\bf z}\end{pmatrix}=\begin{pmatrix}y_{i}\\ {\bf 0}\end{pmatrix},i=1,\cdots,p\right\}, (5.13)

where 𝐁i:=(𝐀i𝟎𝐃𝐈){\bf B}_{i}:=\begin{pmatrix}{\bf A}_{i}&{\bf 0}\\ {\bf D}&-{\bf I}\end{pmatrix}. Assuming, instead of yiy_{i}, we have the noisy data yiδy_{i}^{\delta}. By applying the stochastic mirror descent method with batch size b=1b=1 to (5.13) we can obtain

(𝐱nδ𝐳nδ)=argmin𝐱,𝐳{β𝐳1+12𝐳22+12x22ξnδ,𝐱ηnδ,𝐳},\displaystyle\begin{pmatrix}{\bf x}_{n}^{\delta}\\[4.30554pt] {\bf z}_{n}^{\delta}\end{pmatrix}=\arg\min_{{\bf x},{\bf z}}\left\{\beta\|{\bf z}\|_{1}+\frac{1}{2}\|{\bf z}\|_{2}^{2}+\frac{1}{2}\|x\|_{2}^{2}-\langle\xi_{n}^{\delta},{\bf x}\rangle-\langle\eta_{n}^{\delta},{\bf z}\rangle\right\},
(ξn+1δηn+1δ)=(ξnδηnδ)tnδ𝐁inT(𝐁in(𝐱nδ𝐳nδ)(yinδ𝟎)).\displaystyle\begin{pmatrix}\xi_{n+1}^{\delta}\\[4.30554pt] \eta_{n+1}^{\delta}\end{pmatrix}=\begin{pmatrix}\xi_{n}^{\delta}\\[4.30554pt] \eta_{n}^{\delta}\end{pmatrix}-t_{n}^{\delta}{\bf B}_{i_{n}}^{T}\left({\bf B}_{i_{n}}\begin{pmatrix}{\bf x}_{n}^{\delta}\\[4.30554pt] {\bf z}_{n}^{\delta}\end{pmatrix}-\begin{pmatrix}y_{i_{n}}^{\delta}\\[4.30554pt] {\bf 0}\end{pmatrix}\right).

By calculating 𝐱nδ,𝐳nδ{\bf x}_{n}^{\delta},{\bf z}_{n}^{\delta} and noting 𝐱nδ=ξnδ{\bf x}_{n}^{\delta}=\xi_{n}^{\delta}, we therefore obtain the following iteration scheme

𝐳nδ=sign(ηnδ)max{|ηnδ|β,0},𝐱n+1δ=𝐱nδtnδ(𝐀inT(𝐀in𝐱nδyinδ)+𝐃T(𝐃𝐱nδ𝐳nδ)),ηn+1δ=ηnδtnδ(𝐳nδ𝐃𝐱nδ)\displaystyle\begin{split}{\bf z}_{n}^{\delta}&=\mbox{sign}(\eta_{n}^{\delta})\max\{|\eta_{n}^{\delta}|-\beta,0\},\\ {\bf x}_{n+1}^{\delta}&={\bf x}_{n}^{\delta}-t_{n}^{\delta}\left({\bf A}_{i_{n}}^{T}({\bf A}_{i_{n}}{\bf x}_{n}^{\delta}-y_{i_{n}}^{\delta})+{\bf D}^{T}({\bf D}{\bf x}_{n}^{\delta}-{\bf z}_{n}^{\delta})\right),\\ \eta_{n+1}^{\delta}&=\eta_{n}^{\delta}-t_{n}^{\delta}({\bf z}_{n}^{\delta}-{\bf D}{\bf x}_{n}^{\delta})\end{split} (5.14)

with the initial guess η0δ=𝟎\eta_{0}^{\delta}={\bf 0} and 𝐱0δ=𝟎{\bf x}_{0}^{\delta}={\bf 0}, where in{1,,p}i_{n}\in\{1,\cdots,p\} is a randomly selected index via the uniform distribution and tnδ0t_{n}^{\delta}\geq 0 denotes the step-size.

Refer to caption
Refer to caption
Refer to caption
Figure 8. Reconstruction results of a piecewise constant solution by (5.14) using step-sizes chosen by (5.15) and (5.18) with μ0=1\mu_{0}=1 and τ=1\tau=1 after 5×1055\times 10^{5} iterations.

Assuming the sought solution 𝐱{\bf x}^{\dagger} is piecewise constant whose graph is plotted in Figure 8. By adding noise to the exact data yi:=𝐀i𝐱y_{i}:={\bf A}_{i}{\bf x}^{\dagger}, we produce the noisy data yiδy_{i}^{\delta}. We then use these noisy data in the method (5.14) to reconstruct 𝐱{\bf x}^{\dagger}. In the numerical computation we use p=1000p=1000 and β=400\beta=400. For the method (5.14) using step-size chosen by (s2), i.e.

tnδ=μ0(|𝐀in𝐱nδyinδ|2+𝐃𝐱nδ𝐳nδ22)𝐀inT(𝐀in𝐱nδyinδ)22+𝐃T(𝐃𝐱nδ𝐳nδ)22\displaystyle t_{n}^{\delta}=\frac{\mu_{0}(|{\bf A}_{i_{n}}{\bf x}_{n}^{\delta}-y_{i_{n}}^{\delta}|^{2}+\|{\bf D}{\bf x}_{n}^{\delta}-{\bf z}_{n}^{\delta}\|_{2}^{2})}{\|{\bf A}_{i_{n}}^{T}({\bf A}_{i_{n}}{\bf x}_{n}^{\delta}-y_{i_{n}}^{\delta})\|_{2}^{2}+\|{\bf D}^{T}({\bf D}{\bf x}_{n}^{\delta}-{\bf z}_{n}^{\delta})\|_{2}^{2}} (5.15)

we have performed the numerical computation for several distinct values of μ0\mu_{0} and observed semi-convergence property, oscillations of iterates and the effect of the magnitude of the step-sizes. Since these observations are very similar to the previous examples, we will not report them here. Instead we will focus on the computational effect of the step-size chosen by (s3). As before, the noisy data are generated by (5.4) with εi\varepsilon_{i} being uniform noise on [1,1][-1,1] and we assume the noise levels δi:=δrel|yi|\delta_{i}:=\delta_{rel}|y_{i}| are known. The step-size chosen by (s3) then takes the formula

tnδ={μ0(|𝐀in𝐱nδyinδ|2+𝐃𝐱nδ𝐳nδ22)𝐀inT(𝐀in𝐱nδyinδ)22+𝐃T(𝐃𝐱nδ𝐳nδ)22 if |𝐀in𝐱nδyinδ|>τδin,0 otherwise.\displaystyle t_{n}^{\delta}=\left\{\begin{array}[]{lll}\frac{\mu_{0}(|{\bf A}_{i_{n}}{\bf x}_{n}^{\delta}-y_{i_{n}}^{\delta}|^{2}+\|{\bf D}{\bf x}_{n}^{\delta}-{\bf z}_{n}^{\delta}\|_{2}^{2})}{\|{\bf A}_{i_{n}}^{T}({\bf A}_{i_{n}}{\bf x}_{n}^{\delta}-y_{i_{n}}^{\delta})\|_{2}^{2}+\|{\bf D}^{T}({\bf D}{\bf x}_{n}^{\delta}-{\bf z}_{n}^{\delta})\|_{2}^{2}}&\mbox{ if }|{\bf A}_{i_{n}}{\bf x}_{n}^{\delta}-y_{i_{n}}^{\delta}|>\tau\delta_{i_{n}},\\ 0&\mbox{ otherwise}.\end{array}\right. (5.18)

In Figure 8 we plot the reconstruction results by the method (5.14) after 5×1055\times 10^{5} iterations using noise data for three distinct relative noise levels δrel=0.5,0.1\delta_{rel}=0.5,0.1 and 0.010.01, where “DP” and “No DP” represent the reconstruction results by the step-size chosen by (5.18) and (5.15) respectively; we use μ0=1.0\mu_{0}=1.0 and τ=1.0\tau=1.0. These results demonstrate that the method (5.14) can capture the piecewise constant feature very well. In order to give further comparison on the reconstruction results by the step-sizes (5.15) and (5.18), we present in Figure 9 the reconstruction errors, the first row plots approximations of the relative mean square errors 𝔼[𝐱nδ𝐱22/𝐱22]\mathbb{E}[\|{\bf x}_{n}^{\delta}-{\bf x}^{\dagger}\|_{2}^{2}/\|{\bf x}^{\dagger}\|_{2}^{2}] by an average of 100 independent runs and the second row plots the reconstruction errors 𝐱nδ𝐱22/𝐱22\|{\bf x}_{n}^{\delta}-{\bf x}^{\dagger}\|_{2}^{2}/\|{\bf x}^{\dagger}\|_{2}^{2} for a typical individual run. The results show clearly that using the step-size (5.18) can significantly suppress the oscillations in the iterates and reduce the effect of semi-convergence of the method.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9. Comparison of reconstruction results for a piecewise constant solution by (5.14) using step-sizes (5.15) and (5.18).

Acknowledgement

Q. Jin would like to thank Peter Mathé from Weierstrass Institute for discussions on stochastic gradient descent. The work of Q. Jin is partially supported by the Future Fellowship of the Australian Research Council (FT170100231). The work of X. Lu is partially supported by the National Key Research and Development Program of China (No. 2020YFA0714200) and the National Science Foundation of China (No. 11871385).

References

  • [1] A. Beck and M. Teboulle, Mirror descent and nonlinear projected subgradient methods for convex optimization, Oper. Res. Lett., 31 (2003), no. 3, 167–175.
  • [2] R. Bhattacharya and E. C. Waymire, A basic course in probability theory, Second edition. Universitext. Springer, Cham, 2016.
  • [3] J. M. Borwein and C. S. Lewis, Convergence of best entropy estimates, SIAM J. Optim., 1 (1991), pp. 191–205.
  • [4] R. Boţ and T. Hein, Iterative regularization with a geeral penalty term – theory and applications to L1L^{1} and TV regularization, Inverse Problems, 28 (2012), 104010.
  • [5] L. Bottou, F. E. Curtis and J. Nocedal, Optimization methods for large-scale machine learning, SIAM Rev., 60 (2018), no. 2, 223–311.
  • [6] S. Bubeck, Convex Optimization: Algorithms and Complexity, Foundations and Trends in Machine Learning: 8 (2015), no. 3-4, pp 231–357.
  • [7] M. Burger and S. Osher, Convergence rates of convex variational regularization, Inverse Problems 20 (2004), no. 5, 1411–1421.
  • [8] J. C. Duchi, Introductory lectures on stochastic convex optimization, In The Mathematics of Data, IAS/Park City Mathematics Series. American Mathematical Society, 2018.
  • [9] P. P. B. Eggermont, Maximum entropy regularization for Fredholm integral equations of the first kind, SIAM J. Math. Anal., 24 (1993), no. 6, 1557–1576.
  • [10] H. W. Engl, M. Hanke and A. Neubauer, Regularization of Inverse Problems, Kluwer, Dordrecht, 1996.
  • [11] H. W. Engl and G. Landl, Convergence rates for maximum entropy regularization, SIAM J. Numer. Anal., 30 (1993), 1509–1536.
  • [12] K. Frick and O. Scherzer,Regularization of ill-posed linear equations by the non-stationary augmented Lagrangian method, J. Integral Equ. Appl., 22 (2010), no. 2, 217–257.
  • [13] I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, Adaptive Computation and Machine Learning series, Cambridge, United States, 2017.
  • [14] C. W. Groetsch, The theory of Tikhonov regularization for Fredholm equations of the first kind, Research Notes in Mathematics, 105. Pitman (Advanced Publishing Program), Boston, MA, 1984.
  • [15] M. Haltmaier, A. Leitao and O. Scherzer, Regularization of systems of nonlinear ill-posed equations: I. Convergence Analysis, Inverse Problems and Imaging, 1 (2007), no. 2, 289–298.
  • [16] P. C. Hansen and M. Saxild-Hansen, AIR Tools–a MATLAB package of algebraic iterative reconstruction methods, J. Comput. Appl. Math. 236 (2012), 2167–2178.
  • [17] Y. Jiao, B. Jin and X. Lu, Preasymptotic convergence of randomized Kaczmarz method, Inverse Problems, 33 (2017), 125012, 21 pp.
  • [18] B. Jin and X. Lu, On the regularizing property of stochastic gradient descent, Inverse Problems, 35 (2019), no. 1, 015004, 27 pp.
  • [19] B. Jin, Z. Zhou and J. Zou, On the convergence of stochastic gradient descent for nonlinear ill-posed problems, SIAM J. Optim., 30 (2020), no. 2, 1421–1450.
  • [20] Q. Jin, Landweber-Kaczmarz method in Banach spaces with inexact inner solvers, Inverse Problems, 32 (2016), no. 10, 104005, 26 pp.
  • [21] Q. Jin, Convergence rates of a dual gradient method for linear ill-posed problems, Numer. Math., accepted for publication, 2022.
  • [22] Q. Jin and X. Lu, A fast nonstationary iterative method with convex penalty for inverse problems in Hilbert spaces, Inverse Problems, 30 (2014), 04501, 21 pp.
  • [23] Q. Jin and W. Wang, Landweber iteration of Kaczmarz type with general non-smooth convex penalty functionals, Inverse Problems, 29 (2013), no. 8, 085011, 22 pp.
  • [24] B. Kaltenbacher, A. Neubauer and O. Scherzer, Iterative Regularization Methods for Nonlinear Ill-Posed Problems, Berlin: de Gruyter, 2008.
  • [25] S. Lu and P. Mathé, Stochastic gradient descent for linear inverse problems in Hilbert spaces, Math. Comput., 91 (2022), 1763–1788.
  • [26] Z. Lu and L. Xiao, On the complexity analysis of randomized block-coordinate descent methods, Math. Program., 152 (2015), no. 1–2, Ser. A, 615–642.
  • [27] F. Natterer,The Mathematics of Computerized Tomography, SIAM, Philadelphia, 2001.
  • [28] A. Nedic and S. Lee, On stochastic subgradient mirror-descent algorithm with weighted averaging, SIAM J. Optim., 24 (2014), pp. 84–107.
  • [29] A. Nemirovski, A. Juditsky, G. Lan and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM J. Optim., 19 (2009), 1574–1609.
  • [30] S. Nemirovski and D. B. Yudin, Problem Complexity and Method Efficiency in Optimization, Wiley, New York, 1983.
  • [31] Y. Nesterov, Primal-dual subgradient methods for convex problems, Math. Program., Ser. B 120 (2009), 221–259.
  • [32] Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems, SIAM J. Optim., 22 (2012), no. 2, 341–362.
  • [33] J. C. Rabelo, Y. F. Saporito and A. Leitão, On stochastic Kaczmarz type methods for solving large scale systems of ill-posed equations, Inverse Problems, 38 (2022), 025003.
  • [34] R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, NJ, 1970.
  • [35] T. Schuster, B. Kaltenbacher, B. Hofmann and K. S. Kazimierski,Regularization Methods in Banach Spaces, Radon Series on Computational and Applied Mathematics 10 Walter de Gruyter, Berlin 2012.
  • [36] P. Richtárik and M. Takác̆, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function, Math. Program., 144 (2014), no. 1–2, 1–38.
  • [37] F. Schöpfer and D. A. Lorenz, Linear convergence of the randomized sparse Kaczmarz method, Math. Program., 173 (2019), no. 1-2, Ser. A, 509–536.
  • [38] C. Zălinscu, Convex Analysis in General Vector Spaces, World Scientific Publishing Co., Inc., River Edge, New Jersey, 2002.
  • [39] Z. Zhou, P. Mertikopoulos, N. Bambos, S. P. Boyd and P. W. Glynn, On the convergence of mirror descent beyond stochastic convex programming, SIAM J. Optim., 30 (2020), no. 1, 687–716.