This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Convergence directions of the randomized Gauss–Seidel method and its extension 111The work is supported by the National Natural Science Foundation of China (No. 11671060) and the Natural Science Foundation Project of CQ CSTC (No. cstc2019jcyj-msxmX0267)

Yanjun Zhang,  Hanyu Li College of Mathematics and Statistics, Chongqing University, Chongqing 401331, P.R. China
Abstract

The randomized Gauss–Seidel method and its extension have attracted much attention recently and their convergence rates have been considered extensively. However, the convergence rates are usually determined by upper bounds, which cannot fully reflect the actual convergence. In this paper, we make a detailed analysis of their convergence behaviors. The analysis shows that the larger the singular value of AA is, the faster the error decays in the corresponding singular vector space, and the convergence directions are mainly driven by the large singular values at the beginning, then gradually driven by the small singular values, and finally by the smallest nonzero singular value. These results explain the phenomenon found in the extensive numerical experiments appearing in the literature that these two methods seem to converge faster at the beginning. Numerical examples are provided to confirm the above findings.

keywords:
convergence direction , randomized Gauss–Seidel method, randomized extended Gauss–Seidel method , singular vector , least squares
journal: Journal of  Templates

1 Introduction

Linear least squares problem is a ubiquitous problem arising frequently in data analysis and scientific computing. Specifically, given a data matrix ARm×nA\in R^{m\times n} and a data vector bRmb\in R^{m}, a linear least squares problem can be written as follows

minxRnbAx22.\min\limits_{x\in R^{n}}\|b-Ax\|^{2}_{2}. (1)

In the literature, several direct methods have been proposed for solving its normal equations ATAx=ATbA^{T}Ax=A^{T}b through either the QR factorization or the singular value decomposition (SVD) of ATAA^{T}A [1, 2], which can be prohibitive when the matrix is large–scale. Hence, iterative methods are considered for solving large linear least squares problem, such as the famous Gauss–Seidel method [3].

In [4], Leventhal and Lewis proved that the randomized Gauss–Seidel (RGS) method, also known as the randomized coordinate descent method, converges to the solution at a linear rate in expectation. This method works on the columns of the matrix AA at random with probability proportional to their norms. Later, Ma, Needell and Ramdas [5] provided a unified theory of the RGS method and the randomized Kaczmarz (RK) method [6], where the latter method works on the rows of AA, and showed that the RGS method converges to the minimum Euclidean norm least squares solution xx_{\star} of (1) only when the matrix AA is of full column rank. To further develop the RGS method for more general matrix, inspired by the randomized extended Kaczmarz (REK) method [7], Ma et al. [5] presented a variant of the RGS mehtod, i.e., randomized extended Gauss–Seidel (REGS) method, and proved that the REGS method converges to xx_{\star} regardless of whether the matrix AA has full column rank. After that, many variants of the RGS (or REGS) method were developed and studied extensively; see for example [8, 9, 10, 11, 12, 13, 14] and references therein.

To the best of our knowledge, when studying the convergence properties of the RGS and REGS methods, people mainly pay attention to their convergence rates and usually give corresponding upper bounds, and no work focuses on what determines their convergence rates, what drives their convergence directions, and what their ultimate directions is. As we know, the obtained upper bound of convergence can only be used as a reference for the convergence rate, and cannot truly reflect the empirical convergence of the method. So it is interesting to consider the above three problems.

In 2017, Jiao, Jin and Lu [15] analyzed the preasymptotic convergence of the RK method. Recently, Steinerberger [16] made a more detailed analysis of the convergence property of the RK method for overdetermined full rank linear system. The author showed that the right singular vectors of the matrix AA describe the directions of distinguished dynamics and the RK method converges along small right singular vectors. After that, Zhang and Li [17] considered the convergence property of the REK method for all types of linear systems (consistent or inconsistent, overdetermined or underdetermined, full-rank or rank-deficient) and showed that the REK method converges to the minimum Euclidean norm least squares solution xx_{\star} with different decay rates in different right singular vectors spaces.

In this paper, we analyze the convergence properties of the RGS and REGS methods for linear least squares problem and show that the decay rates of the sequences {Axk}k=1\{Ax_{k}\}_{k=1}^{\infty} and {xk}k=1\{x_{k}\}_{k=1}^{\infty} (resp., the sequences {Azk}k=1\{Az_{k}\}_{k=1}^{\infty} and {zk}k=1\{z_{k}\}_{k=1}^{\infty} ) generated by the RGS method (resp., the REGS method) are depend on the size of singular values of AA. Specifically, the larger the singular value of AA is, the faster the error decays in the corresponding singular vector space, and the convergence directions are mainly driven by the large singular values at the beginning, then gradually driven by the small singular values, and finally by the smallest nonzero singular value.

The rest of this paper is organized as follows. We first introduce some notations and preliminaries in Section 2 and then present our main results about the RGS and REGS methods in Section 3 and Section 4, respectively. Numerical experiments are given in Section 5.

2 Notations and preliminaries

Throughout the paper, for a matrix AA, ATA^{T}, A(i)A^{(i)}, A(j)A_{(j)}, σi(A)\sigma_{i}(A), σr(A)\sigma_{r}(A), AF\|A\|_{F}, and (A)\mathcal{R}(A) denote its transpose, iith row (or iith entry in the case of a vector), jjth column, iith singular value, smallest nonzero singular value, Frobenius norm, and column space, respectively. For any integer m1m\geq 1, let [m]:={1,2,3,,m}[m]:=\{1,2,3,...,m\}. If the matrix GRn×nG\in R^{n\times n} is positive definite, we define the energy norm of any vector xRnx\in R^{n} as xG:=xTGx\|x\|_{G}:=\sqrt{x^{T}Gx}. In addition, we denote the identity matrix by II, its jjth column by e(j)e_{(j)} and the expectation of any random variable ξ\xi by 𝔼[ξ]\mathbb{E}[\xi].

In the following, we use x=Abx_{\star}=A^{{\dagger}}b to denote the minimum Euclidean norm least squares solution of (1), where AA^{{\dagger}} denotes the Moore–Penrose pseudoinverse of the matrix AA. Because the SVD is the basic tool for the convergence analysis in next two sections, we denote the SVD [18] of ARm×nA\in R^{m\times n} by

A=UΣVT,\displaystyle A=U\Sigma V^{T},

where U=[u1,u2,um]Rm×mU=[u_{1},u_{2},\ldots u_{m}]\in R^{m\times m} and V=[v1,v2,vn]Rn×nV=[v_{1},v_{2},\ldots v_{n}]\in R^{n\times n} are column orthonormal matrices and their column vectors known as the left and right singular vectors, respectively, and ΣRm×n\Sigma\in R^{m\times n} is diagonal with the diagonal elements ordered nonincreasingly, i.e., σ1(A)σ2(A)σr(A)>0\sigma_{1}(A)\geq\sigma_{2}(A)\geq\ldots\sigma_{r}(A)>0 with rmin{m,n}r\leq\min\{m,n\}.

3 Convergence directions of the RGS method

We first list the RGS method [4, 5] in Algorithm 1 and restate its convergence bound in Theorem 1.

Algorithm 1

The RGS method

  1. 1.

    INPUT:  AA, bb, \ell, x0Rnx_{0}\in R^{n}

  2. 2.

    For k=1,2,,1k=1,2,\ldots,\ell-1 do

  3. 3.

    Select j[n]j\in[n] with probability A(j)22AF2\frac{\|A_{(j)}\|^{2}_{2}}{\|A\|^{2}_{F}}

  4. 4.

    Set xk=xk1A(j)T(Axk1b)A(j)22e(j)x_{k}=x_{k-1}-\frac{A_{(j)}^{T}(Ax_{k-1}-b)}{\|A_{(j)}\|_{2}^{2}}e_{(j)}

  5. 5.

    End for

Theorem 1

[4, 5] Let ARm×nA\in R^{m\times n}, bRmb\in R^{m}, x=Abx_{\star}=A^{{\dagger}}b be the minimum Euclidean norm least squares solution, and xkx_{k} be the kkth approximation of the RGS method generated by Algorithm 1 with initial guess x0Rnx_{0}\in R^{n}. Then

𝔼[AxkAx22](1σr2(A)AF2)kAx0Ax22.\displaystyle\mathbb{E}[\|Ax_{k}-Ax_{\star}\|_{2}^{2}]\leq(1-\frac{\sigma_{r}^{2}(A)}{\|A\|^{2}_{F}})^{k}\|Ax_{0}-Ax_{\star}\|_{2}^{2}. (2)
Remark 1

Theorem 1 shows that AxkAx_{k} converges linearly in expectation to AxAx_{\star} regardless of whether the matrix AA has full rank. Since AxkAx22=xkxATA2\|Ax_{k}-Ax_{\star}\|_{2}^{2}=\|x_{k}-x_{\star}\|_{A^{T}A}^{2}, it follows from (2) that

𝔼[xkxATA2](1σr2(A)AF2)kx0xATA2,\displaystyle\mathbb{E}[\|x_{k}-x_{\star}\|_{A^{T}A}^{2}]\leq(1-\frac{\sigma_{r}^{2}(A)}{\|A\|^{2}_{F}})^{k}\|x_{0}-x_{\star}\|_{A^{T}A}^{2},

which implies that xkx_{k} converges linearly in expectation to the minimum Euclidean norm least squares solution xx_{\star} when the matrix AA is overdetermined and of full column rank, but can not converge to xx_{\star} when AA is not full column rank. So, we assume that the matrix AA is of full column rank in this section.

Now, we give our three main results of the RGS method.

Theorem 2

Let ARm×nA\in R^{m\times n}, bRmb\in R^{m}, x=Abx_{\star}=A^{{\dagger}}b be the minimum Euclidean norm least squares solution, and xkx_{k} be the kkth approximation of the RGS method generated by Algorithm 1 with initial guess x0Rnx_{0}\in R^{n}. Then

𝔼[AxkAx,u]=(1σ2(A)AF2)kAx0Ax,u.\displaystyle\mathbb{E}[\langle Ax_{k}-Ax_{\star},u_{\ell}\rangle]=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle Ax_{0}-Ax_{\star},u_{\ell}\rangle. (3)
Proof 1

Let 𝔼k1[]\mathbb{E}_{k-1}[\cdot] be the conditional expectation conditioned on the first k1k-1 iterations of the RGS method. Then, from Algorithm 1, we have

𝔼k1[AxkAx,u]\displaystyle\mathbb{E}_{k-1}[\langle Ax_{k}-Ax_{\star},u_{\ell}\rangle]
=j=1nA(j)22AF2Axk1A(j)T(Axk1b)A(j)22A(j)Ax,u\displaystyle=\sum\limits_{j=1}^{n}\frac{\|A_{(j)}\|_{2}^{2}}{\|A\|_{F}^{2}}\langle Ax_{k-1}-\frac{A_{(j)}^{T}(Ax_{k-1}-b)}{\|A_{(j)}\|_{2}^{2}}A_{(j)}-Ax_{\star},u_{\ell}\rangle
=Axk1Ax,u1AF2j=1nA(j)T(Axk1b)A(j),u\displaystyle=\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle-\frac{1}{\|A\|_{F}^{2}}\sum\limits_{j=1}^{n}\langle A_{(j)}^{T}(Ax_{k-1}-b)A_{(j)},u_{\ell}\rangle
=Axk1Ax,u1AF2j=1nA(j),Axk1bA(j),u\displaystyle=\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle-\frac{1}{\|A\|_{F}^{2}}\sum\limits_{j=1}^{n}\langle A_{(j)},Ax_{k-1}-b\rangle\langle A_{(j)},u_{\ell}\rangle
=Axk1Ax,u1AF2AT(Axk1b),ATu,\displaystyle=\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle-\frac{1}{\|A\|_{F}^{2}}\langle A^{T}(Ax_{k-1}-b),A^{T}u_{\ell}\rangle,

which together with the facts AT(bAx)=0A^{T}(b-Ax_{\star})=0 and ATu=σ(A)vA^{T}u_{\ell}=\sigma_{\ell}(A)v_{\ell} yields

𝔼k1[AxkAx,u]\displaystyle\mathbb{E}_{k-1}[\langle Ax_{k}-Ax_{\star},u_{\ell}\rangle]
=Axk1Ax,u1AF2AT(Axk1Ax),ATu\displaystyle=\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle-\frac{1}{\|A\|_{F}^{2}}\langle A^{T}(Ax_{k-1}-Ax_{\star}),A^{T}u_{\ell}\rangle
=Axk1Ax,u1AF2AT(i=1mAxk1Ax,uiui),ATu\displaystyle=\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle-\frac{1}{\|A\|_{F}^{2}}\langle A^{T}(\sum\limits_{i=1}^{m}\langle Ax_{k-1}-Ax_{\star},u_{i}\rangle u_{i}),A^{T}u_{\ell}\rangle
=Axk1Ax,u1AF2(i=1mAxk1Ax,uiσi(A)vi),σ(A)v\displaystyle=\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle-\frac{1}{\|A\|_{F}^{2}}\langle(\sum\limits_{i=1}^{m}\langle Ax_{k-1}-Ax_{\star},u_{i}\rangle\sigma_{i}(A)v_{i}),\sigma_{\ell}(A)v_{\ell}\rangle
=Axk1Ax,uσ2(A)AF2Axk1Ax,u\displaystyle=\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle-\frac{\sigma_{\ell}^{2}(A)}{\|A\|_{F}^{2}}\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle
=(1σ2(A)AF2)Axk1Ax,u.\displaystyle=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|_{F}^{2}})\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle.

Thus, by taking the full expectation on both sides, we have

𝔼[AxkAx,u]=(1σ2(A)AF2)𝔼[Axk1Ax,u]==(1σ2(A)AF2)kAx0Ax,u,\displaystyle\mathbb{E}[\langle Ax_{k}-Ax_{\star},u_{\ell}\rangle]=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|_{F}^{2}})\mathbb{E}[\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle]=\ldots=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|_{F}^{2}})^{k}\langle Ax_{0}-Ax_{\star},u_{\ell}\rangle,

which is the estimate (3).

Remark 2

Theorem 2 shows that the decay rates of AxkAx2\|Ax_{k}-Ax_{\star}\|_{2} are different in different left singular vectors spaces. Specifically, the decay rates are dependent on the singular values: the larger the singular value of AA is, the faster the error decays in the corresponding left singular vector space. This implies that the smallest singular value will lead to the slowest rate of convergence, which is the one in (2). So, the convergence bound presented in [4, 5] is optimal.

Remark 3

Let rk=bAxkr_{k}=b-Ax_{k} be the residual vector with respect to the kk-th approximation xkx_{k}, and r=bAxr_{\star}=b-Ax_{\star} be the true residual vector with respect to the minimum Euclidean norm least squares solution xx_{\star}. It follows from (3) and AxkAx=(rkr)Ax_{k}-Ax_{\star}=-(r_{k}-r_{\star}) that

𝔼[rkr,u]=(1σ2(A)AF2)kr0r,u.\displaystyle\mathbb{E}[\langle r_{k}-r_{\star},u_{\ell}\rangle]=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle r_{0}-r_{\star},u_{\ell}\rangle.

Hence, Theorem 2 also implies that the decay rates of rkr2\|r_{k}-r_{\star}\|_{2} of the RGS method depend on the singular values.

Remark 4

Using the facts AxkAx,u=xkx,ATu\langle Ax_{k}-Ax_{\star},u_{\ell}\rangle=\langle x_{k}-x_{\star},A^{T}u_{\ell}\rangle and ATu=σ(A)vA^{T}u_{\ell}=\sigma_{\ell}(A)v_{\ell}, from (3), we have

𝔼[xkx,v]=(1σ2(A)AF2)kx0x,v,\displaystyle\mathbb{E}[\langle x_{k}-x_{\star},v_{\ell}\rangle]=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle x_{0}-x_{\star},v_{\ell}\rangle,

which recovers the decay rates of the RK method in different right singular vectors spaces [16]. In this view, both RGS and RK methods are essentially equivalent.

Theorem 3

Let ARm×nA\in R^{m\times n}, bRmb\in R^{m}, x=Abx_{\star}=A^{{\dagger}}b be the minimum Euclidean norm least squares solution, and xkx_{k} be the kkth approximation of the RGS method generated by Algorithm 1 with initial guess x0Rnx_{0}\in R^{n}. Then

𝔼[AxkAx22]=𝔼[(11AF2ATAxk1AxAxk1Ax222)Axk1Ax22].\displaystyle\mathbb{E}[\|Ax_{k}-Ax_{\star}\|_{2}^{2}]=\mathbb{E}[(1-\frac{1}{\|A\|^{2}_{F}}\|A^{T}\frac{Ax_{k-1}-Ax_{\star}}{\|Ax_{k-1}-Ax_{\star}\|_{2}}\|_{2}^{2})\|Ax_{k-1}-Ax_{\star}\|_{2}^{2}].
Proof 2

Similar to the proof of [5], we can derive the desired result.

Remark 5

Since ATAxk1AxAxk1Ax222σr2(A)\|A^{T}\frac{Ax_{k-1}-Ax_{\star}}{\|Ax_{k-1}-Ax_{\star}\|_{2}}\|_{2}^{2}\geq\sigma_{r}^{2}(A), Theorem 3 implies that the RGS method actually converges faster if Axk1AxAx_{k-1}-Ax_{\star} is not close to left singular vectors corresponding to the small singular values of AA .

Theorem 4

Let ARm×nA\in R^{m\times n}, bRmb\in R^{m}, x=Abx_{\star}=A^{{\dagger}}b be the minimum Euclidean norm least squares solution, and xkx_{k} be the kkth approximation of the RGS method generated by Algorithm 1 with initial guess x0Rnx_{0}\in R^{n}. Then

𝔼[AxkAxAxkAx2,Axk+1AxAxk+1Ax22]=11AF2𝔼[ATAxkAxAxkAx222].\displaystyle\mathbb{E}[\langle\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}},\frac{Ax_{k+1}-Ax_{\star}}{\|Ax_{k+1}-Ax_{\star}\|_{2}}\rangle^{2}]=1-\frac{1}{\|A\|^{2}_{F}}\mathbb{E}[\|A^{T}\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}}\|_{2}^{2}]. (4)
Proof 3

From Algorithm 1, we have

𝔼k[AxkAxAxkAx2,Axk+1AxAxk+1Ax22]\displaystyle\mathbb{E}_{k}[\langle\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}},\frac{Ax_{k+1}-Ax_{\star}}{\|Ax_{k+1}-Ax_{\star}\|_{2}}\rangle^{2}]
=𝔼k[AxkAxAxkAx2,AxkA(j)T(Axkb)A(j)22A(j)AxAxkA(j)T(Axkb)A(j)22A(j)Ax22]\displaystyle=\mathbb{E}_{k}[\langle\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}},\frac{Ax_{k}-\frac{A_{(j)}^{T}(Ax_{k}-b)}{\|A_{(j)}\|_{2}^{2}}A_{(j)}-Ax_{\star}}{\|Ax_{k}-\frac{A_{(j)}^{T}(Ax_{k}-b)}{\|A_{(j)}\|_{2}^{2}}A_{(j)}-Ax_{\star}\|_{2}}\rangle^{2}]
=𝔼k[1AxkAx22AxkA(j)T(Axkb)A(j)22A(j)Ax22AxkAx,AxkA(j)T(Axkb)A(j)22A(j)Ax2].\displaystyle=\mathbb{E}_{k}[\frac{1}{\|Ax_{k}-Ax_{\star}\|_{2}^{2}\cdot\|Ax_{k}-\frac{A_{(j)}^{T}(Ax_{k}-b)}{\|A_{(j)}\|_{2}^{2}}A_{(j)}-Ax_{\star}\|_{2}^{2}}\langle Ax_{k}-Ax_{\star},Ax_{k}-\frac{A_{(j)}^{T}(Ax_{k}-b)}{\|A_{(j)}\|_{2}^{2}}A_{(j)}-Ax_{\star}\rangle^{2}].

Since AxkAx,AxkA(j)T(Axkb)A(j)22A(j)Ax=AxkA(j)T(Axkb)A(j)22A(j)Ax22\langle Ax_{k}-Ax_{\star},Ax_{k}-\frac{A_{(j)}^{T}(Ax_{k}-b)}{\|A_{(j)}\|_{2}^{2}}A_{(j)}-Ax_{\star}\rangle=\|Ax_{k}-\frac{A_{(j)}^{T}(Ax_{k}-b)}{\|A_{(j)}\|_{2}^{2}}A_{(j)}-Ax_{\star}\|_{2}^{2}, we have

𝔼k[AxkAxAxkAx2,Axk+1AxAxk+1Ax22]\displaystyle\mathbb{E}_{k}[\langle\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}},\frac{Ax_{k+1}-Ax_{\star}}{\|Ax_{k+1}-Ax_{\star}\|_{2}}\rangle^{2}]
=𝔼k[1AxkAx22AxkA(j)T(Axkb)A(j)22A(j)Ax22]\displaystyle=\mathbb{E}_{k}[\frac{1}{\|Ax_{k}-Ax_{\star}\|_{2}^{2}}\|Ax_{k}-\frac{A_{(j)}^{T}(Ax_{k}-b)}{\|A_{(j)}\|_{2}^{2}}A_{(j)}-Ax_{\star}\|_{2}^{2}]
=𝔼k[1AxkAx22(AxkAx222AxkAx,A(j)T(Axkb)A(j)22A(j)+(A(j)T(Axkb))2A(j)22)]\displaystyle=\mathbb{E}_{k}[\frac{1}{\|Ax_{k}-Ax_{\star}\|_{2}^{2}}(\|Ax_{k}-Ax_{\star}\|_{2}^{2}-2\langle Ax_{k}-Ax_{\star},\frac{A_{(j)}^{T}(Ax_{k}-b)}{\|A_{(j)}\|_{2}^{2}}A_{(j)}\rangle+\frac{(A_{(j)}^{T}(Ax_{k}-b))^{2}}{\|A_{(j)}\|_{2}^{2}})]
=𝔼k[1AxkAx22(AxkAx22(A(j)T(AxkAx))2A(j)22)]\displaystyle=\mathbb{E}_{k}[\frac{1}{\|Ax_{k}-Ax_{\star}\|_{2}^{2}}(\|Ax_{k}-Ax_{\star}\|_{2}^{2}-\frac{(A_{(j)}^{T}(Ax_{k}-Ax_{\star}))^{2}}{\|A_{(j)}\|_{2}^{2}})]
=𝔼k[1(A(j)TAxkAxAxkAx2)2A(j)22]\displaystyle=\mathbb{E}_{k}[1-\frac{(A_{(j)}^{T}\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}})^{2}}{\|A_{(j)}\|_{2}^{2}}]
=j=1nA(j)22AF2(1(A(j)TAxkAxAxkAx2)2A(j)22)\displaystyle=\sum\limits_{j=1}^{n}\frac{\|A_{(j)}\|_{2}^{2}}{\|A\|_{F}^{2}}(1-\frac{(A_{(j)}^{T}\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}})^{2}}{\|A_{(j)}\|_{2}^{2}})
=11AF2ATAxkAxAxkAx222.\displaystyle=1-\frac{1}{\|A\|_{F}^{2}}\|A^{T}\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}}\|_{2}^{2}.

Thus, by taking the full expectation on both sides, we obtain the desired result (4).

Remark 6

Let uu and vv are two unit vectors, i.e., u2=1\|u\|_{2}=1 and v2=1\|v\|_{2}=1. We use inner quantity u,v2\langle u,v\rangle^{2} to represent the angle between uu and vv, and the bigger the angle is, the bigger the fluctuation becomes from uu to vv. Theorem 4 shows the fluctuation of two adjacent iterations. Specifically, when ATAxkAxAxkAx222\|A^{T}\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}}\|_{2}^{2} is large, the angle between AxkAxAxkAx2\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}} and Axk+1AxAxk+1Ax2\frac{Ax_{k+1}-Ax_{\star}}{\|Ax_{k+1}-Ax_{\star}\|_{2}} is large, which implies that AxkAxAxkAx2\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}} has a large fluctuation; when ATAxkAxAxkAx222\|A^{T}\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}}\|_{2}^{2} is small, the angle between AxkAxAxkAx2\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}} and Axk+1AxAxk+1Ax2\frac{Ax_{k+1}-Ax_{\star}}{\|Ax_{k+1}-Ax_{\star}\|_{2}} is small, which implies that AxkAxAxkAx2\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}} has very little fluctuation.

Since ATAxkAxAxkAx222σr2(A)\|A^{T}\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}}\|_{2}^{2}\geq\sigma_{r}^{2}(A), Theorem 4 implies that if AxkAxAx_{k}-Ax_{\star} is mainly composed of left singular vectors corresponding to the small singular values of AA, its direction hardly changes, which means that the RGS method finally converges along left singular vector corresponding to the small singular value of AA.

4 Convergence directions of the REGS method

Recalling Remark 1, when the matrix AA is not full column rank, the sequence {xk}k=1\{x_{k}\}_{k=1}^{\infty} generated by the RGS method does not converge to the minimum Euclidean norm least squares solution xx_{\star}, even though AxkAx_{k} does converge to AxAx_{\star}. In [5], Ma et al. proposed an extended variant of the RGS method, i.e., the REGS method, to allow for convergence to xx_{\star} regardless of whether AA has full column rank or not.

Now, we list the REGS method presented in [13] in Algorithm 2, which is a equivalent variant of the original REGS method [5], and restate its convergence bound presented in [13] in Theorem 5. From the algorithm we find that, in each iteration, xkx_{k} is the kkth approximation of the RGS method and zkz_{k} is a one-step RK update for the linear system Az=AxkAz=Ax_{k} from zk1z_{k-1}.

Algorithm 2

The REGS method

  1. 1.

    INPUT:  AA, bb, \ell, x0Rnx_{0}\in R^{n}, z0(AT)z_{0}\in\mathcal{R}(A^{T})

  2. 2.

    For k=1,2,,1k=1,2,\ldots,\ell-1 do

  3. 3.

    Select j[n]j\in[n] with probability A(j)22AF2\frac{\|A_{(j)}\|^{2}_{2}}{\|A\|^{2}_{F}}

  4. 4.

    Set xk=xk1A(j)T(Axk1b)A(j)22e(j)x_{k}=x_{k-1}-\frac{A_{(j)}^{T}(Ax_{k-1}-b)}{\|A_{(j)}\|_{2}^{2}}e_{(j)}

  5. 5.

    Select i[m]i\in[m] with probability A(i)22AF2\frac{\|A^{(i)}\|^{2}_{2}}{\|A\|^{2}_{F}}

  6. 6.

    Set zk=zk1A(i)(zk1xk)A(i)22(A(i))Tz_{k}=z_{k-1}-\frac{A^{(i)}(z_{k-1}-x_{k})}{\|A^{(i)}\|_{2}^{2}}(A^{(i)})^{T}

  7. 7.

    End for

Theorem 5

[13] Let ARm×nA\in R^{m\times n}, bRmb\in R^{m}, x=Abx_{\star}=A^{{\dagger}}b be the minimum Euclidean norm least squares solution, and zkz_{k} be the kkth approximation of the REGS method generated by Algorithm 2 with initial x0Rnx_{0}\in R^{n} and zo(AT)z_{o}\in\mathcal{R}(A^{T}). Then

𝔼[zkx22](1σr2(A)AF2)kz0x22+kAF2(1σr2(A)AF2)kAx0Ax22.\displaystyle\mathbb{E}[\|z_{k}-x_{\star}\|_{2}^{2}]\leq(1-\frac{\sigma_{r}^{2}(A)}{\|A\|^{2}_{F}})^{k}\|z_{0}-x_{\star}\|_{2}^{2}+\frac{k}{\|A\|_{F}^{2}}(1-\frac{\sigma_{r}^{2}(A)}{\|A\|^{2}_{F}})^{k}\|Ax_{0}-Ax_{\star}\|_{2}^{2}. (5)

For the REGS method, we first discuss the convergence behavior of zkxz_{k}-x_{\star} in Theorem 6 and Theorem 7, and then consider its convergence behavior of AzkAxAz_{k}-Ax_{\star} in Theorem 8.

Theorem 6

Let ARm×nA\in R^{m\times n}, bRmb\in R^{m}, x=Abx_{\star}=A^{{\dagger}}b be the minimum Euclidean norm least squares solution, and zkz_{k} be the kkth approximation of the REGS method generated by Algorithm 2 with initial x0Rnx_{0}\in R^{n} and zo(AT)z_{o}\in\mathcal{R}(A^{T}). Then

𝔼[zkx,v]=(1σ2(A)AF2)kz0x,v+kAF2(1σ2(A)AF2)kAT(Ax0Ax),v.\displaystyle\mathbb{E}[\langle z_{k}-x_{\star},v_{\ell}\rangle]=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle z_{0}-x_{\star},v_{\ell}\rangle+\frac{k}{\|A\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle A^{T}(Ax_{0}-Ax_{\star}),v_{\ell}\rangle. (6)
Proof 4

From Algorithm 2, we have

𝔼[zkx,v]\displaystyle\mathbb{E}[\langle z_{k}-x_{\star},v_{\ell}\rangle]
=𝔼[zk1A(i)(zk1xk)A(i)22(A(i))Tx,v]\displaystyle=\mathbb{E}[\langle z_{k-1}-\frac{A^{(i)}(z_{k-1}-x_{k})}{\|A^{(i)}\|_{2}^{2}}(A^{(i)})^{T}-x_{\star},v_{\ell}\rangle]
=𝔼[(I(A(i))TA(i)A(i)22)(zk1x),v]+𝔼[(A(i))TA(i)A(i)22(xkx),v],\displaystyle=\mathbb{E}[\langle(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),v_{\ell}\rangle]+\mathbb{E}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star}),v_{\ell}\rangle], (7)

so we next consider 𝔼[(I(A(i))TA(i)A(i)22)(zk1x),v]\mathbb{E}[\langle(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),v_{\ell}\rangle] and 𝔼[(A(i))TA(i)A(i)22(xkx),v]\mathbb{E}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star}),v_{\ell}\rangle] separately.

We first consider𝔼[(I(A(i))TA(i)A(i)22)(zk1x),v]\mathbb{E}[\langle(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),v_{\ell}\rangle]. Let 𝔼k1[]\mathbb{E}_{k-1}[\cdot] be the conditional expectation conditioned on the first k1k-1 iterations of the REGS method. That is,

𝔼k1[]=𝔼[|j1,i1,j2,i2,,jk1,ik1],\displaystyle\mathbb{E}_{k-1}[\cdot]=\mathbb{E}[\cdot|j_{1},i_{1},j_{2},i_{2},\ldots,j_{k-1},i_{k-1}],

where jtj_{t^{*}} is the t{t^{*}}th column chosen and iti_{t^{*}} is the t{t^{*}}th row chosen. We denote the conditional expectation conditioned on the first k1k-1 iterations and the kkth column chosen as

𝔼k1i[]=𝔼[|j1,i1,j2,i2,,jk1,ik1,jk].\displaystyle\mathbb{E}_{k-1}^{i}[\cdot]=\mathbb{E}[\cdot|j_{1},i_{1},j_{2},i_{2},\ldots,j_{k-1},i_{k-1},j_{k}].

Similarly, we denote the conditional expectation conditioned on the first k1k-1 iterations and the kkth row chosen as

𝔼k1j[]=𝔼[|j1,i1,j2,i2,,jk1,ik1,ik].\displaystyle\mathbb{E}_{k-1}^{j}[\cdot]=\mathbb{E}[\cdot|j_{1},i_{1},j_{2},i_{2},\ldots,j_{k-1},i_{k-1},i_{k}].

Then, by the law of total expectation, we have

𝔼k1[]=𝔼k1j[𝔼k1i[]].\displaystyle\mathbb{E}_{k-1}[\cdot]=\mathbb{E}_{k-1}^{j}[\mathbb{E}_{k-1}^{i}[\cdot]].

Thus, we obtain

𝔼k1[(I(A(i))TA(i)A(i)22)(zk1x),v]\displaystyle\mathbb{E}_{k-1}[\langle(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),v_{\ell}\rangle]
=𝔼k1[zk1x,vA(i)(zk1x)A(i)22(A(i))T,v]\displaystyle=\mathbb{E}_{k-1}[\langle z_{k-1}-x_{\star},v_{\ell}\rangle-\langle\frac{A^{(i)}(z_{k-1}-x_{\star})}{\|A^{(i)}\|_{2}^{2}}(A^{(i)})^{T},v_{\ell}\rangle]
=zk1x,v1AF2i=1mA(i)(zk1x)(A(i))T,v\displaystyle=\langle z_{k-1}-x_{\star},v_{\ell}\rangle-\frac{1}{\|A\|^{2}_{F}}\sum\limits_{i=1}^{m}\langle A^{(i)}(z_{k-1}-x_{\star})(A^{(i)})^{T},v_{\ell}\rangle
=zk1x,v1AF2i=1m(A(i))T,zk1x(A(i))T,v\displaystyle=\langle z_{k-1}-x_{\star},v_{\ell}\rangle-\frac{1}{\|A\|^{2}_{F}}\sum\limits_{i=1}^{m}\langle(A^{(i)})^{T},z_{k-1}-x_{\star}\rangle\langle(A^{(i)})^{T},v_{\ell}\rangle
=zk1x,v1AF2A(zk1x),Av.\displaystyle=\langle z_{k-1}-x_{\star},v_{\ell}\rangle-\frac{1}{\|A\|^{2}_{F}}\langle A(z_{k-1}-x_{\star}),Av_{\ell}\rangle.

Further, by making use of zk1x=i=1nzk1x,viviz_{k-1}-x_{\star}=\sum\limits_{i=1}^{n}\langle z_{k-1}-x_{\star},v_{i}\rangle v_{i} and Avi=σi(A)uiAv_{i}=\sigma_{i}(A)u_{i}, we get

𝔼k1[(I(A(i))TA(i)A(i)22)(zk1x),v]\displaystyle\mathbb{E}_{k-1}[\langle(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),v_{\ell}\rangle]
=zk1x,v1AF2Ai=1nzk1x,vivi,σ(A)u\displaystyle=\langle z_{k-1}-x_{\star},v_{\ell}\rangle-\frac{1}{\|A\|^{2}_{F}}\langle A\sum\limits_{i=1}^{n}\langle z_{k-1}-x_{\star},v_{i}\rangle v_{i},\sigma_{\ell}(A)u_{\ell}\rangle
=zk1x,v1AF2i=1nzk1x,viσi(A)ui,σ(A)u,\displaystyle=\langle z_{k-1}-x_{\star},v_{\ell}\rangle-\frac{1}{\|A\|^{2}_{F}}\langle\sum\limits_{i=1}^{n}\langle z_{k-1}-x_{\star},v_{i}\rangle\sigma_{i}(A)u_{i},\sigma_{\ell}(A)u_{\ell}\rangle,

which together with the orthogonality of the left singular vectors uiu_{i} yields

𝔼k1[(I(A(i))TA(i)A(i)22)(zk1x),v]\displaystyle\mathbb{E}_{k-1}[\langle(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),v_{\ell}\rangle]
=zk1x,vσ2(A)AF2zk1x,v\displaystyle=\langle z_{k-1}-x_{\star},v_{\ell}\rangle-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}}\langle z_{k-1}-x_{\star},v_{\ell}\rangle
=(1σ2(A)AF2)zk1x,v.\displaystyle=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})\langle z_{k-1}-x_{\star},v_{\ell}\rangle.

As a result, by taking the full expectation on both sides, we have

𝔼[(I(A(i))TA(i)A(i)22)(zk1x),v]=(1σ2(A)AF2)𝔼[zk1x,v].\displaystyle\mathbb{E}[\langle(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),v_{\ell}\rangle]=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})\mathbb{E}[\langle z_{k-1}-x_{\star},v_{\ell}\rangle]. (8)

We now consider 𝔼[(A(i))TA(i)A(i)22(xkx),v]\mathbb{E}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star}),v_{\ell}\rangle]. It follows from

𝔼k1[(A(i))TA(i)A(i)22(xkx),v]\displaystyle\mathbb{E}_{k-1}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star}),v_{\ell}\rangle]
=𝔼k1j[𝔼k1i[(A(i))TA(i)A(i)22(xkx),v]]\displaystyle=\mathbb{E}_{k-1}^{j}[\mathbb{E}_{k-1}^{i}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star}),v_{\ell}\rangle]]
=𝔼k1j[1AF2i=1m(A(i))TA(i)(xkx),v]\displaystyle=\mathbb{E}_{k-1}^{j}[\frac{1}{\|A\|^{2}_{F}}\sum\limits_{i=1}^{m}\langle(A^{(i)})^{T}A^{(i)}(x_{k}-x_{\star}),v_{\ell}\rangle]
=𝔼k1j[1AF2i=1mA(i)(xkx),A(i)v]\displaystyle=\mathbb{E}_{k-1}^{j}[\frac{1}{\|A\|^{2}_{F}}\sum\limits_{i=1}^{m}\langle A^{(i)}(x_{k}-x_{\star}),A^{(i)}v_{\ell}\rangle]
=𝔼k1[1AF2A(xkx),Av],\displaystyle=\mathbb{E}_{k-1}[\frac{1}{\|A\|^{2}_{F}}\langle A(x_{k}-x_{\star}),Av_{\ell}\rangle],

that

𝔼[(A(i))TA(i)A(i)22(xkx),v]=𝔼[1AF2A(xkx),Av].\displaystyle\mathbb{E}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star}),v_{\ell}\rangle]=\mathbb{E}[\frac{1}{\|A\|^{2}_{F}}\langle A(x_{k}-x_{\star}),Av_{\ell}\rangle]. (9)

Since

𝔼[1AF2A(xkx),Av]=σ(A)AF2𝔼[A(xkx),u],\displaystyle\mathbb{E}[\frac{1}{\|A\|^{2}_{F}}\langle A(x_{k}-x_{\star}),Av_{\ell}\rangle]=\frac{\sigma_{\ell}(A)}{\|A\|^{2}_{F}}\mathbb{E}[\langle A(x_{k}-x_{\star}),u_{\ell}\rangle],

by exploiting (3) in Theorem 2, we get

𝔼[1AF2A(xkx),Av]\displaystyle\mathbb{E}[\frac{1}{\|A\|^{2}_{F}}\langle A(x_{k}-x_{\star}),Av_{\ell}\rangle]
=σ(A)AF2(1σ2(A)AF2)kAx0Ax,u\displaystyle=\frac{\sigma_{\ell}(A)}{\|A\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle Ax_{0}-Ax_{\star},u_{\ell}\rangle
=1AF2(1σ2(A)AF2)kAx0Ax,Av.\displaystyle=\frac{1}{\|A\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle Ax_{0}-Ax_{\star},Av_{\ell}\rangle.

Thus, substituting the above equality into (9), we have

𝔼[(A(i))TA(i)A(i)22(xkx),v]\displaystyle\mathbb{E}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star}),v_{\ell}\rangle] =1AF2(1σ2(A)AF2)kAx0Ax,Av\displaystyle=\frac{1}{\|A\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle Ax_{0}-Ax_{\star},Av_{\ell}\rangle
=1AF2(1σ2(A)AF2)kAT(Ax0Ax),v.\displaystyle=\frac{1}{\|A\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle A^{T}(Ax_{0}-Ax_{\star}),v_{\ell}\rangle. (10)

Combining (7), (8) and (10) yields

𝔼[zkx,v]\displaystyle\mathbb{E}[\langle z_{k}-x_{\star},v_{\ell}\rangle]
=𝔼[(I(A(i))TA(i)A(i)22)(zk1x),v]+𝔼[(A(i))TA(i)A(i)22(xkx),v]\displaystyle=\mathbb{E}[\langle(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),v_{\ell}\rangle]+\mathbb{E}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star}),v_{\ell}\rangle]
=(1σ2(A)AF2)𝔼[zk1x,v]+1AF2(1σ2(A)AF2)kAT(Ax0Ax),v\displaystyle=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})\mathbb{E}[\langle z_{k-1}-x_{\star},v_{\ell}\rangle]+\frac{1}{\|A\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle A^{T}(Ax_{0}-Ax_{\star}),v_{\ell}\rangle
=(1σ2(A)AF2)2𝔼[zk2x,v]+2AF2(1σ2(A)AF2)kAT(Ax0Ax),v\displaystyle=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{2}\mathbb{E}[\langle z_{k-2}-x_{\star},v_{\ell}\rangle]+\frac{2}{\|A\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle A^{T}(Ax_{0}-Ax_{\star}),v_{\ell}\rangle
==(1σ2(A)AF2)kz0x,v+kAF2(1σ2(A)AF2)kAT(Ax0Ax),v,\displaystyle=\ldots=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle z_{0}-x_{\star},v_{\ell}\rangle+\frac{k}{\|A\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle A^{T}(Ax_{0}-Ax_{\star}),v_{\ell}\rangle,

which is the desired result (6).

Remark 7

Theorem 6 shows that the decay rates of zkx2\|z_{k}-x_{\star}\|_{2} are different in different right singular vectors spaces and the smallest singular value will lead to the slowest rate of convergence, which is the one in (5). So, the convergence bound presented by Du [13] is optimal.

Theorem 7

Let ARm×nA\in R^{m\times n}, bRmb\in R^{m}, x=Abx_{\star}=A^{{\dagger}}b be the minimum Euclidean norm least squares solution, and zkz_{k} be the kkth approximation of the REGS method generated by Algorithm 2 with initial x0Rnx_{0}\in R^{n} and zo(AT)z_{o}\in\mathcal{R}(A^{T}). Then

𝔼[zkx22]𝔼[(11AF2Azk1xzk1x222)zk1x22]+1AF2(1σr2(A)AF2)kAx0Ax22.\displaystyle\mathbb{E}[\|z_{k}-x_{\star}\|^{2}_{2}]\leq\mathbb{E}[(1-\frac{1}{\|A\|^{2}_{F}}\|A\frac{z_{k-1}-x_{\star}}{\|z_{k-1}-x_{\star}\|_{2}}\|_{2}^{2})\|z_{k-1}-x_{\star}\|_{2}^{2}]+\frac{1}{\|A\|^{2}_{F}}(1-\frac{\sigma_{r}^{2}(A)}{\|A\|^{2}_{F}})^{k}\|Ax_{0}-Ax_{\star}\|^{2}_{2}. (11)
Proof 5

Following an analogous argument to Theorem 4 of [13], we get

𝔼[zkx22]\displaystyle\mathbb{E}[\|z_{k}-x_{\star}\|_{2}^{2}] =𝔼[(I(A(i))TA(i)A(i)22)(zk1x)22]+𝔼[(A(i))TA(i)A(i)22(xkx)22],\displaystyle=\mathbb{E}[\|(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star})\|_{2}^{2}]+\mathbb{E}[\|\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star})\|_{2}^{2}],
𝔼[(I(A(i))TA(i)A(i)22)(zk1x)22]\displaystyle\mathbb{E}[\|(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star})\|_{2}^{2}] =𝔼[(zk1x)T(IATAAF2)(zk1x)]\displaystyle=\mathbb{E}[(z_{k-1}-x_{\star})^{T}(I-\frac{A^{T}A}{\|A\|_{F}^{2}})(z_{k-1}-x_{\star})]
=𝔼[(zk1x221AF2A(zk1x)22)]\displaystyle=\mathbb{E}[(\|z_{k-1}-x_{\star}\|_{2}^{2}-\frac{1}{\|A\|^{2}_{F}}\|A(z_{k-1}-x_{\star})\|_{2}^{2})]
=𝔼[(11AF2Azk1xzk1x222)zk1x22],\displaystyle=\mathbb{E}[(1-\frac{1}{\|A\|^{2}_{F}}\|A\frac{z_{k-1}-x_{\star}}{\|z_{k-1}-x_{\star}\|_{2}}\|_{2}^{2})\|z_{k-1}-x_{\star}\|_{2}^{2}],

and

𝔼[(A(i))TA(i)A(i)22(xkx)22]1AF2(1σr2(A)AF2)kAx0Ax22.\displaystyle\mathbb{E}[\|\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star})\|_{2}^{2}]\leq\frac{1}{\|A\|^{2}_{F}}(1-\frac{\sigma_{r}^{2}(A)}{\|A\|^{2}_{F}})^{k}\|Ax_{0}-Ax_{\star}\|^{2}_{2}.

Combining the above three equations, we have

𝔼[zkx22]𝔼[(11AF2Azk1xzk1x222)zk1x22]+1AF2(1σr2(A)AF2)kAx0Ax22,\displaystyle\mathbb{E}[\|z_{k}-x_{\star}\|_{2}^{2}]\leq\mathbb{E}[(1-\frac{1}{\|A\|^{2}_{F}}\|A\frac{z_{k-1}-x_{\star}}{\|z_{k-1}-x_{\star}\|_{2}}\|_{2}^{2})\|z_{k-1}-x_{\star}\|_{2}^{2}]+\frac{1}{\|A\|^{2}_{F}}(1-\frac{\sigma_{r}^{2}(A)}{\|A\|^{2}_{F}})^{k}\|Ax_{0}-Ax_{\star}\|^{2}_{2},

which implies the desired result (11).

Remark 8

Since Azk1xzk1x222σr2(A)\|A\frac{z_{k-1}-x_{\star}}{\|z_{k-1}-x_{\star}\|_{2}}\|_{2}^{2}\geq\sigma_{r}^{2}(A), Theorem 7 implies that zkz_{k} of the REGS method actually converges faster if zk1xz_{k-1}-x_{\star} is not close to right singular vectors corresponding to the small singular values of AA .

Theorem 8

Let ARm×nA\in R^{m\times n}, bRmb\in R^{m}, x=Abx_{\star}=A^{{\dagger}}b be the minimum Euclidean norm least squares solution, and zkz_{k} be the kkth approximation of the REGS method generated by Algorithm 2 with initial x0Rnx_{0}\in R^{n} and zo(AT)z_{o}\in\mathcal{R}(A^{T}). Then

𝔼[AzkAx,u]=(1σ2(A)AF2)kAz0Ax,v+kAF2(1σ2(A)AF2)kAAT(Ax0Ax),u.\displaystyle\mathbb{E}[\langle Az_{k}-Ax_{\star},u_{\ell}\rangle]=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle Az_{0}-Ax_{\star},v_{\ell}\rangle+\frac{k}{\|A\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle AA^{T}(Ax_{0}-Ax_{\star}),u_{\ell}\rangle. (12)
Proof 6

Similar to the proof of (7) in Theorem 6, we obtain

𝔼[AzkAx,u]=𝔼[A(I(A(i))TA(i)A(i)22)(zk1x),u]+𝔼[A(A(i))TA(i)A(i)22(xkx),u].\displaystyle\mathbb{E}[\langle Az_{k}-Ax_{\star},u_{\ell}\rangle]=\mathbb{E}[\langle A(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),u_{\ell}\rangle]+\mathbb{E}[\langle A\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star}),u_{\ell}\rangle]. (13)

Then, we consider 𝔼[A(I(A(i))TA(i)A(i)22)(zk1x),u]\mathbb{E}[\langle A(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),u_{\ell}\rangle] and 𝔼[A(A(i))TA(i)A(i)22(xkx),u]\mathbb{E}[\langle A\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star}),u_{\ell}\rangle] separately.

We first consider 𝔼[A(I(A(i))TA(i)A(i)22)(zk1x),u]\mathbb{E}[\langle A(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),u_{\ell}\rangle]. It follows from

A(I(A(i))TA(i)A(i)22)(zk1x),u=(I(A(i))TA(i)A(i)22)(zk1x),ATu\langle A(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),u_{\ell}\rangle=\langle(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),A^{T}u_{\ell}\rangle

and ATu=σ(A)vA^{T}u_{\ell}=\sigma_{\ell}(A)v_{\ell}, that

𝔼[A(I(A(i))TA(i)A(i)22)(zk1x),u]=σ(A)𝔼[(I(A(i))TA(i)A(i)22)(zk1x),v],\displaystyle\mathbb{E}[\langle A(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),u_{\ell}\rangle]=\sigma_{\ell}(A)\mathbb{E}[\langle(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),v_{\ell}\rangle],

which together with (8), yields

𝔼[A(I(A(i))TA(i)A(i)22)(zk1x),u]\displaystyle\mathbb{E}[\langle A(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),u_{\ell}\rangle] =σ(A)(1σ2(A)AF2)𝔼[zk1x,v]\displaystyle=\sigma_{\ell}(A)(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})\mathbb{E}[\langle z_{k-1}-x_{\star},v_{\ell}\rangle]
=(1σ2(A)AF2)𝔼[zk1x,σ(A)v]\displaystyle=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})\mathbb{E}[\langle z_{k-1}-x_{\star},\sigma_{\ell}(A)v_{\ell}\rangle]
=(1σ2(A)AF2)𝔼[Azk1Ax,u].\displaystyle=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})\mathbb{E}[\langle Az_{k-1}-Ax_{\star},u_{\ell}\rangle]. (14)

We now consider 𝔼[A(A(i))TA(i)A(i)22(xkx),u]\mathbb{E}[\langle A\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star}),u_{\ell}\rangle]. Exploiting (10), we have

𝔼[A(A(i))TA(i)A(i)22(xkx),u]\displaystyle\mathbb{E}[\langle A\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star}),u_{\ell}\rangle] =𝔼[(A(i))TA(i)A(i)22(xkx),ATu]\displaystyle=\mathbb{E}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star}),A^{T}u_{\ell}\rangle]
=σ(A)𝔼[(A(i))TA(i)A(i)22(xkx),v]\displaystyle=\sigma_{\ell}(A)\mathbb{E}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star}),v_{\ell}\rangle]
=σ(A)AF2(1σ2(A)AF2)kAT(Ax0Ax),v\displaystyle=\frac{\sigma_{\ell}(A)}{\|A\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle A^{T}(Ax_{0}-Ax_{\star}),v_{\ell}\rangle
=1AF2(1σ2(A)AF2)kAAT(Ax0Ax),u.\displaystyle=\frac{1}{\|A\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle AA^{T}(Ax_{0}-Ax_{\star}),u_{\ell}\rangle. (15)

Thus, combining (13), (14) and (15) yields

𝔼[AzkAx,u]\displaystyle\mathbb{E}[\langle Az_{k}-Ax_{\star},u_{\ell}\rangle]
=𝔼[A(I(A(i))TA(i)A(i)22)(zk1x),u]+𝔼[A(A(i))TA(i)A(i)22(xkx),u]\displaystyle=\mathbb{E}[\langle A(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),u_{\ell}\rangle]+\mathbb{E}[\langle A\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star}),u_{\ell}\rangle]
=(1σ2(A)AF2)𝔼[Azk1Ax,u]+1AF2(1σ2(A)AF2)kAAT(Ax0Ax),u\displaystyle=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})\mathbb{E}[\langle Az_{k-1}-Ax_{\star},u_{\ell}\rangle]+\frac{1}{\|A\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle AA^{T}(Ax_{0}-Ax_{\star}),u_{\ell}\rangle
=(1σ2(A)AF2)2𝔼[Azk2Ax,u]+2AF2(1σ2(A)AF2)kAAT(Ax0Ax),u\displaystyle=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{2}\mathbb{E}[\langle Az_{k-2}-Ax_{\star},u_{\ell}\rangle]+\frac{2}{\|A\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle AA^{T}(Ax_{0}-Ax_{\star}),u_{\ell}\rangle
==(1σ2(A)AF2)kAz0Ax,v+kAF2(1σ2(A)AF2)kAAT(Ax0Ax),u,\displaystyle=\ldots=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle Az_{0}-Ax_{\star},v_{\ell}\rangle+\frac{k}{\|A\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle AA^{T}(Ax_{0}-Ax_{\star}),u_{\ell}\rangle,

which is the desired result (12).

Remark 9

Theorem 8 shows the decay rates of AzkAx\|Az_{k}-Ax_{\star}\| of the REGS method and suggests that small singular values lead to poor convergence rates and vice versa. We note similar issues arise for the RK, REK, and RGS methods discussed in [16], [17], and Theorem 2, respectively.

5 Numerical experiments

Now we present two simple examples to illustrate that the convergence directions of the RGS and REGS methods. To this end, let G0R500×500G_{0}\in R^{500\times 500} be a Gaussian matrix with i.i.d. N(0,1)N(0,1) entries and DR500×500D\in R^{500\times 500} be a diagonal matrix whose diagonal elements are all 100. Further, we set G1=G0+DG_{1}=G_{0}+D and replace its last row G1(500)G_{1}^{(500)} by a tiny perturbation of G1(499)G_{1}^{(499)}, i.e., adding 0.01 to each entry of G1(499)G_{1}^{(499)}. Then, we normalize all rows of G1G_{1}, i.e., set G1(i)2=1\|G_{1}^{(i)}\|_{2}=1, i=1,2,,500i=1,2,\ldots,500. After that, we set A1=[G1G2]R600×500A_{1}=\begin{bmatrix}G_{1}\\ G_{2}\end{bmatrix}\in R^{600\times 500} and A2=[G1,G3]R500×600A_{2}=\begin{bmatrix}G_{1},G_{3}\end{bmatrix}\in R^{500\times 600}, where G2R100×500G_{2}\in R^{100\times 500} and G3R500×100G_{3}\in R^{500\times 100} are zero matrices. So, the first 499 singular values of the matrices A1A_{1} and A2A_{2} are between 0.6\sim 0.6 and 1.5\sim 1.5, and the smallest nonzero singular value is 104\sim 10^{-4}.

We first consider convergence directions of AxkAxAx_{k}-Ax_{\star} and xkxx_{k}-x_{\star} of the RGS method. We generate a vector xR500x\in R^{500} using the MATLAB function randn, set the full column rank coefficient matrix A=A1A=A_{1} and set the right-hand side b=Ax+zb=Ax+z, where zz is a nonzero vector belonging to the null space of ATA^{T}, which is generated by the MATLAB function null. With x0=0x_{0}=0, we plot |(AxkAx)/AxkAx2,u500||\langle(Ax_{k}-Ax_{\star})/\|Ax_{k}-Ax_{\star}\|_{2},u_{500}\rangle| and A(xkx)2xkx2\frac{\|A(x_{k}-x_{\star})\|_{2}}{\|x_{k}-x_{\star}\|_{2}} in Figure 1 and Figure 2, respectively.

Refer to caption
Figure 1: A sample evolution of |(AxkAx)/AxkAx2,u500||\langle(Ax_{k}-Ax_{\star})/\|Ax_{k}-Ax_{\star}\|_{2},u_{500}\rangle| of the RGS method.

From Figure 1, we find that |(AxkAx)/AxkAx2,u500||\langle(Ax_{k}-Ax_{\star})/\|Ax_{k}-Ax_{\star}\|_{2},u_{500}\rangle| initially is very small and almost is 0, which indicates that AxkAxAx_{k}-Ax_{\star} is not close to the left singular vector u500u_{500}. Considering the analysis of Remark 5, the phenomenon implies the ‘preconvergence’ behavior of the RGS method, that is, the RGS method seems to converge quickly at the beginning. In addition, as kk\rightarrow\infty, |(AxkAx)/AxkAx2,u500|1|\langle(Ax_{k}-Ax_{\star})/\|Ax_{k}-Ax_{\star}\|_{2},u_{500}\rangle|\rightarrow 1. This phenomenon implies that AxkAxAx_{k}-Ax_{\star} tends to the left singular vector corresponding to the smallest singular value of AA.

Refer to caption
Figure 2: A sample evolution of A(xkx)2xkx2\frac{\|A(x_{k}-x_{\star})\|_{2}}{\|x_{k}-x_{\star}\|_{2}} of the RGS method.

From Figure 2, we observe that the values of A(xkx)2xkx2\frac{\|A(x_{k}-x_{\star})\|_{2}}{\|x_{k}-x_{\star}\|_{2}} decreases with kk and finally approaches the small singular value. This phenomenon implies that the forward direction of xkxx_{k}-x_{\star} is mainly determined by the right singular vectors corresponding to the large singular values of AA at the beginning. With the increase of kk, the direction is mainly determined by the right singular vectors corresponding to the small singular values. Finally, xkxx_{k}-x_{\star} tends to the right singular vector space corresponding to the smallest singular value. Furthermore, this phenomenon also allows for an interesting application, i.e., finding nonzero vectors xx such that Ax2x2\frac{\|Ax\|_{2}}{\|x\|_{2}} is small.

We now consider convergence directions of AzkAxAz_{k}-Ax_{\star} and zkxz_{k}-x_{\star} of the REGS method. We generate a vector xR600x\in R^{600} using the MATLAB function randn, set the coefficient matrix A=A2A=A_{2} which does not have full column rank, and set the right-hand side b=Axb=Ax. With x0=0x_{0}=0 and z0=0z_{0}=0, we plot |(AzkAx)/AzkAx2,u500||\langle(Az_{k}-Ax_{\star})/\|Az_{k}-Ax_{\star}\|_{2},u_{500}\rangle| and A(zkx)2zkx2\frac{\|A(z_{k}-x_{\star})\|_{2}}{\|z_{k}-x_{\star}\|_{2}} in Figure 3 and Figure 4, respectively.

Refer to caption
Figure 3: A sample evolution of |(AzkAx)/AzkAx2,u500||\langle(Az_{k}-Ax_{\star})/\|Az_{k}-Ax_{\star}\|_{2},u_{500}\rangle| of the REGS method.
Refer to caption
Figure 4: A sample evolution of A(zkx)2zkx2\frac{\|A(z_{k}-x_{\star})\|_{2}}{\|z_{k}-x_{\star}\|_{2}} of the REGS method.

Figure 3 and Figure 4 show the similar results obtained in the RGS method. That is, the convergence directions of AzkAxAz_{k}-Ax_{\star} and zkxz_{k}-x_{\star} of the REGS method initially are depending on the large singular values and then mainly depending on the small singular values, and finally depending on the smallest singular value of AA.

References

  • [1] Å. Björck, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, 1996.
  • [2] N. J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, 2002.
  • [3] Y. Saad, Iterative Methods for Sparse Linear Systems, SIAM, Philadelphia, 2003.
  • [4] D. Leventhal, A. S. Lewis, Randomized methods for linear constraints: convergence rates and conditioning, Math. Oper. Res. 35 (2010) 641–654.
  • [5] A. Ma, D. Needell, A. Ramdas, Convergence properties of the randomized extended Gauss–Seidel and Kaczmarz methods, SIAM J. Matrix Anal. Appl. 36 (2015) 1590–1604.
  • [6] T. Strohmer, R. Vershynin, A randomized Kaczmarz algorithm with exponential convergence, J. Fourier Anal. Appl. 15 (2009) 262–278.
  • [7] A. Zouzias, M. N. Freris, Randomized extended Kaczmarz for solving least squares, SIAM J. Matrix Anal. Appl. 34 (2013) 773–793.
  • [8] R. M. Gower, P. Richtárik, Randomized iterative methods for linear systems, SIAM J. Matrix Anal. Appl. 36 (2015) 1660–1690.
  • [9] J. Nutini, M. Schmidt, I. Laradji, M. Friedlander, H. Koepke, Coordinate descent converges faster with the Gauss–Southwell rule than random selection, in: International Conference on Machine Learning, PMLR, 2015, pp. 1632–1641.
  • [10] A. Hefny, D. Needell, A. Ramdas, Rows versus columns: randomized Kaczmarz or Gauss–Seidel for ridge regression, SIAM J. Sci. Comput. 39 (2017) S528–S542.
  • [11] S. Tu, S. Venkataraman, A. C. Wilson, A. Gittens, M. I. Jordan, B. Recht, Breaking locality accelerates block Gauss–Seidel, in: International Conference on Machine Learning, PMLR, 2017, pp. 3482–3491.
  • [12] Y. Y. Xu, Hybrid Jacobian and Gauss–Seidel proximal block coordinate update methods for linearly constrained convex programming, SIAM J. Optim. 28 (2018) 646–670.
  • [13] K. Du, Tight upper bounds for the convergence of the randomized extended Kaczmarz and Gauss–Seidel algorithms, Numer. Linear Algebra Appl. 26 (2019) e2233.
  • [14] M. Razaviyayn, M. Y. Hong, N. Reyhanian, Z. Q. Luo, A linearly convergent doubly stochastic Gauss–Seidel algorithm for solving linear equations and a certain class of over–parameterized optimization problems, Math. Program. 176 (2019) 465–496.
  • [15] Y. L. Jiao, B. T. Jin, X. L. Lu, Preasymptotic convergence of randomized Kaczmarz method, Inverse Problems. 33 (2017) 125012.
  • [16] S. Steinerberger, Randomized Kaczmarz converges along small singular vectors, SIAM J. Matrix Anal. Appl. 42 (2021) 608–615.
  • [17] Y. J. Zhang, H. Y. Li, Preconvergence of the randomized extended Kaczmarz method, arXiv preprint arXiv:2105.04924, 2021.
  • [18] G. H. Golub, C. F. Van Loan, Matrix Computations, fourth ed., Johns Hopkins University Press, Baltimore, 2013.