Convergence directions of the randomized Gauss–Seidel method and its extension ¹¹1The work is supported by the National Natural Science Foundation of China (No. 11671060) and the Natural Science Foundation Project of CQ CSTC (No. cstc2019jcyj-msxmX0267)

Yanjun Zhang, Hanyu Li College of Mathematics and Statistics, Chongqing University, Chongqing 401331, P.R. China

Abstract

The randomized Gauss–Seidel method and its extension have attracted much attention recently and their convergence rates have been considered extensively. However, the convergence rates are usually determined by upper bounds, which cannot fully reflect the actual convergence. In this paper, we make a detailed analysis of their convergence behaviors. The analysis shows that the larger the singular value of $A$ is, the faster the error decays in the corresponding singular vector space, and the convergence directions are mainly driven by the large singular values at the beginning, then gradually driven by the small singular values, and finally by the smallest nonzero singular value. These results explain the phenomenon found in the extensive numerical experiments appearing in the literature that these two methods seem to converge faster at the beginning. Numerical examples are provided to confirm the above findings.

keywords:

convergence direction , randomized Gauss–Seidel method, randomized extended Gauss–Seidel method , singular vector , least squares

^†^†journal: Journal of LaTeX Templates

1 Introduction

Linear least squares problem is a ubiquitous problem arising frequently in data analysis and scientific computing. Specifically, given a data matrix $A\in R^{m\times n}$ and a data vector $b\in R^{m}$ , a linear least squares problem can be written as follows

\min\limits_{x\in R^{n}}\|b-Ax\|^{2}_{2}.

(1)

In the literature, several direct methods have been proposed for solving its normal equations $A^{T}Ax=A^{T}b$ through either the QR factorization or the singular value decomposition (SVD) of $A^{T}A$ [1, 2], which can be prohibitive when the matrix is large–scale. Hence, iterative methods are considered for solving large linear least squares problem, such as the famous Gauss–Seidel method [3].

In [4], Leventhal and Lewis proved that the randomized Gauss–Seidel (RGS) method, also known as the randomized coordinate descent method, converges to the solution at a linear rate in expectation. This method works on the columns of the matrix $A$ at random with probability proportional to their norms. Later, Ma, Needell and Ramdas [5] provided a unified theory of the RGS method and the randomized Kaczmarz (RK) method [6], where the latter method works on the rows of $A$ , and showed that the RGS method converges to the minimum Euclidean norm least squares solution $x_{\star}$ of (1) only when the matrix $A$ is of full column rank. To further develop the RGS method for more general matrix, inspired by the randomized extended Kaczmarz (REK) method [7], Ma et al. [5] presented a variant of the RGS mehtod, i.e., randomized extended Gauss–Seidel (REGS) method, and proved that the REGS method converges to $x_{\star}$ regardless of whether the matrix $A$ has full column rank. After that, many variants of the RGS (or REGS) method were developed and studied extensively; see for example [8, 9, 10, 11, 12, 13, 14] and references therein.

To the best of our knowledge, when studying the convergence properties of the RGS and REGS methods, people mainly pay attention to their convergence rates and usually give corresponding upper bounds, and no work focuses on what determines their convergence rates, what drives their convergence directions, and what their ultimate directions is. As we know, the obtained upper bound of convergence can only be used as a reference for the convergence rate, and cannot truly reflect the empirical convergence of the method. So it is interesting to consider the above three problems.

In 2017, Jiao, Jin and Lu [15] analyzed the preasymptotic convergence of the RK method. Recently, Steinerberger [16] made a more detailed analysis of the convergence property of the RK method for overdetermined full rank linear system. The author showed that the right singular vectors of the matrix $A$ describe the directions of distinguished dynamics and the RK method converges along small right singular vectors. After that, Zhang and Li [17] considered the convergence property of the REK method for all types of linear systems (consistent or inconsistent, overdetermined or underdetermined, full-rank or rank-deficient) and showed that the REK method converges to the minimum Euclidean norm least squares solution $x_{\star}$ with different decay rates in different right singular vectors spaces.

In this paper, we analyze the convergence properties of the RGS and REGS methods for linear least squares problem and show that the decay rates of the sequences $\{Ax_{k}\}_{k=1}^{\infty}$ and $\{x_{k}\}_{k=1}^{\infty}$ (resp., the sequences $\{Az_{k}\}_{k=1}^{\infty}$ and $\{z_{k}\}_{k=1}^{\infty}$ ) generated by the RGS method (resp., the REGS method) are depend on the size of singular values of $A$ . Specifically, the larger the singular value of $A$ is, the faster the error decays in the corresponding singular vector space, and the convergence directions are mainly driven by the large singular values at the beginning, then gradually driven by the small singular values, and finally by the smallest nonzero singular value.

The rest of this paper is organized as follows. We first introduce some notations and preliminaries in Section 2 and then present our main results about the RGS and REGS methods in Section 3 and Section 4, respectively. Numerical experiments are given in Section 5.

2 Notations and preliminaries

Throughout the paper, for a matrix $A$ , $A^{T}$ , $A^{(i)}$ , $A_{(j)}$ , $\sigma_{i}(A)$ , $\sigma_{r}(A)$ , $\|A\|_{F}$ , and $\mathcal{R}(A)$ denote its transpose, $i$ th row (or $i$ th entry in the case of a vector), $j$ th column, $i$ th singular value, smallest nonzero singular value, Frobenius norm, and column space, respectively. For any integer $m\geq 1$ , let $[m]:=\{1,2,3,...,m\}$ . If the matrix $G\in R^{n\times n}$ is positive definite, we define the energy norm of any vector $x\in R^{n}$ as $\|x\|_{G}:=\sqrt{x^{T}Gx}$ . In addition, we denote the identity matrix by $I$ , its $j$ th column by $e_{(j)}$ and the expectation of any random variable $\xi$ by $\mathbb{E}[\xi]$ .

In the following, we use $x_{\star}=A^{{\dagger}}b$ to denote the minimum Euclidean norm least squares solution of (1), where $A^{{\dagger}}$ denotes the Moore–Penrose pseudoinverse of the matrix $A$ . Because the SVD is the basic tool for the convergence analysis in next two sections, we denote the SVD [18] of $A\in R^{m\times n}$ by

\displaystyle A=U\Sigma V^{T},

where $U=[u_{1},u_{2},\ldots u_{m}]\in R^{m\times m}$ and $V=[v_{1},v_{2},\ldots v_{n}]\in R^{n\times n}$ are column orthonormal matrices and their column vectors known as the left and right singular vectors, respectively, and $\Sigma\in R^{m\times n}$ is diagonal with the diagonal elements ordered nonincreasingly, i.e., $\sigma_{1}(A)\geq\sigma_{2}(A)\geq\ldots\sigma_{r}(A)>0$ with $r\leq\min\{m,n\}$ .

3 Convergence directions of the RGS method

We first list the RGS method [4, 5] in Algorithm 1 and restate its convergence bound in Theorem 1.

Algorithm 1

The RGS method

1.

INPUT: $A$ , $b$ , $\ell$ , $x_{0}\in R^{n}$
2.

For $k=1,2,\ldots,\ell-1$ do
3.

Select $j\in[n]$ with probability $\frac{\|A_{(j)}\|^{2}_{2}}{\|A\|^{2}_{F}}$
4.

Set $x_{k}=x_{k-1}-\frac{A_{(j)}^{T}(Ax_{k-1}-b)}{\|A_{(j)}\|_{2}^{2}}e_{(j)}$
5.

End for

Theorem 1

[4, 5] Let $A\in R^{m\times n}$ , $b\in R^{m}$ , $x_{\star}=A^{{\dagger}}b$ be the minimum Euclidean norm least squares solution, and $x_{k}$ be the $k$ th approximation of the RGS method generated by Algorithm 1 with initial guess $x_{0}\in R^{n}$ . Then

\displaystyle\mathbb{E}[\|Ax_{k}-Ax_{\star}\|_{2}^{2}]\leq(1-\frac{\sigma_{r}^{2}(A)}{\|A\|^{2}_{F}})^{k}\|Ax_{0}-Ax_{\star}\|_{2}^{2}.

(2)

Remark 1

Theorem 1 shows that $Ax_{k}$ converges linearly in expectation to $Ax_{\star}$ regardless of whether the matrix $A$ has full rank. Since $\|Ax_{k}-Ax_{\star}\|_{2}^{2}=\|x_{k}-x_{\star}\|_{A^{T}A}^{2}$ , it follows from (2) that

\displaystyle\mathbb{E}[\|x_{k}-x_{\star}\|_{A^{T}A}^{2}]\leq(1-\frac{\sigma_{r}^{2}(A)}{\|A\|^{2}_{F}})^{k}\|x_{0}-x_{\star}\|_{A^{T}A}^{2},

which implies that $x_{k}$ converges linearly in expectation to the minimum Euclidean norm least squares solution $x_{\star}$ when the matrix $A$ is overdetermined and of full column rank, but can not converge to $x_{\star}$ when $A$ is not full column rank. So, we assume that the matrix $A$ is of full column rank in this section.

Now, we give our three main results of the RGS method.

Theorem 2

Let $A\in R^{m\times n}$ , $b\in R^{m}$ , $x_{\star}=A^{{\dagger}}b$ be the minimum Euclidean norm least squares solution, and $x_{k}$ be the $k$ th approximation of the RGS method generated by Algorithm 1 with initial guess $x_{0}\in R^{n}$ . Then

\displaystyle\mathbb{E}[\langle Ax_{k}-Ax_{\star},u_{\ell}\rangle]=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle Ax_{0}-Ax_{\star},u_{\ell}\rangle.

(3)

Proof 1

Let $\mathbb{E}_{k-1}[\cdot]$ be the conditional expectation conditioned on the first $k-1$ iterations of the RGS method. Then, from Algorithm 1, we have

	$\displaystyle\mathbb{E}_{k-1}[\langle Ax_{k}-Ax_{\star},u_{\ell}\rangle]$
	$\displaystyle=\sum\limits_{j=1}^{n}\frac{\\|A_{(j)}\\|_{2}^{2}}{\\|A\\|_{F}^{2}}\langle Ax_{k-1}-\frac{A_{(j)}^{T}(Ax_{k-1}-b)}{\\|A_{(j)}\\|_{2}^{2}}A_{(j)}-Ax_{\star},u_{\ell}\rangle$
	$\displaystyle=\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle-\frac{1}{\\|A\\|_{F}^{2}}\sum\limits_{j=1}^{n}\langle A_{(j)}^{T}(Ax_{k-1}-b)A_{(j)},u_{\ell}\rangle$
	$\displaystyle=\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle-\frac{1}{\\|A\\|_{F}^{2}}\sum\limits_{j=1}^{n}\langle A_{(j)},Ax_{k-1}-b\rangle\langle A_{(j)},u_{\ell}\rangle$
	$\displaystyle=\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle-\frac{1}{\\|A\\|_{F}^{2}}\langle A^{T}(Ax_{k-1}-b),A^{T}u_{\ell}\rangle,$

which together with the facts $A^{T}(b-Ax_{\star})=0$ and $A^{T}u_{\ell}=\sigma_{\ell}(A)v_{\ell}$ yields

	$\displaystyle\mathbb{E}_{k-1}[\langle Ax_{k}-Ax_{\star},u_{\ell}\rangle]$
	$\displaystyle=\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle-\frac{1}{\\|A\\|_{F}^{2}}\langle A^{T}(Ax_{k-1}-Ax_{\star}),A^{T}u_{\ell}\rangle$
	$\displaystyle=\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle-\frac{1}{\\|A\\|_{F}^{2}}\langle A^{T}(\sum\limits_{i=1}^{m}\langle Ax_{k-1}-Ax_{\star},u_{i}\rangle u_{i}),A^{T}u_{\ell}\rangle$
	$\displaystyle=\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle-\frac{1}{\\|A\\|_{F}^{2}}\langle(\sum\limits_{i=1}^{m}\langle Ax_{k-1}-Ax_{\star},u_{i}\rangle\sigma_{i}(A)v_{i}),\sigma_{\ell}(A)v_{\ell}\rangle$
	$\displaystyle=\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|_{F}^{2}}\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle$
	$\displaystyle=(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|_{F}^{2}})\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle.$

Thus, by taking the full expectation on both sides, we have

\displaystyle\mathbb{E}[\langle Ax_{k}-Ax_{\star},u_{\ell}\rangle]=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|_{F}^{2}})\mathbb{E}[\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle]=\ldots=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|_{F}^{2}})^{k}\langle Ax_{0}-Ax_{\star},u_{\ell}\rangle,

which is the estimate (3).

Remark 2

Theorem 2 shows that the decay rates of $\|Ax_{k}-Ax_{\star}\|_{2}$ are different in different left singular vectors spaces. Specifically, the decay rates are dependent on the singular values: the larger the singular value of $A$ is, the faster the error decays in the corresponding left singular vector space. This implies that the smallest singular value will lead to the slowest rate of convergence, which is the one in (2). So, the convergence bound presented in [4, 5] is optimal.

Remark 3

Let $r_{k}=b-Ax_{k}$ be the residual vector with respect to the $k$ -th approximation $x_{k}$ , and $r_{\star}=b-Ax_{\star}$ be the true residual vector with respect to the minimum Euclidean norm least squares solution $x_{\star}$ . It follows from (3) and $Ax_{k}-Ax_{\star}=-(r_{k}-r_{\star})$ that

\displaystyle\mathbb{E}[\langle r_{k}-r_{\star},u_{\ell}\rangle]=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle r_{0}-r_{\star},u_{\ell}\rangle.

Hence, Theorem 2 also implies that the decay rates of $\|r_{k}-r_{\star}\|_{2}$ of the RGS method depend on the singular values.

Remark 4

Using the facts $\langle Ax_{k}-Ax_{\star},u_{\ell}\rangle=\langle x_{k}-x_{\star},A^{T}u_{\ell}\rangle$ and $A^{T}u_{\ell}=\sigma_{\ell}(A)v_{\ell}$ , from (3), we have

\displaystyle\mathbb{E}[\langle x_{k}-x_{\star},v_{\ell}\rangle]=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle x_{0}-x_{\star},v_{\ell}\rangle,

which recovers the decay rates of the RK method in different right singular vectors spaces [16]. In this view, both RGS and RK methods are essentially equivalent.

Theorem 3

\displaystyle\mathbb{E}[\|Ax_{k}-Ax_{\star}\|_{2}^{2}]=\mathbb{E}[(1-\frac{1}{\|A\|^{2}_{F}}\|A^{T}\frac{Ax_{k-1}-Ax_{\star}}{\|Ax_{k-1}-Ax_{\star}\|_{2}}\|_{2}^{2})\|Ax_{k-1}-Ax_{\star}\|_{2}^{2}].

Proof 2

Similar to the proof of [5], we can derive the desired result.

Remark 5

Since $\|A^{T}\frac{Ax_{k-1}-Ax_{\star}}{\|Ax_{k-1}-Ax_{\star}\|_{2}}\|_{2}^{2}\geq\sigma_{r}^{2}(A)$ , Theorem 3 implies that the RGS method actually converges faster if $Ax_{k-1}-Ax_{\star}$ is not close to left singular vectors corresponding to the small singular values of $A$ .

Theorem 4

\displaystyle\mathbb{E}[\langle\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}},\frac{Ax_{k+1}-Ax_{\star}}{\|Ax_{k+1}-Ax_{\star}\|_{2}}\rangle^{2}]=1-\frac{1}{\|A\|^{2}_{F}}\mathbb{E}[\|A^{T}\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}}\|_{2}^{2}].

(4)

Proof 3

From Algorithm 1, we have

	$\displaystyle\mathbb{E}_{k}[\langle\frac{Ax_{k}-Ax_{\star}}{\\|Ax_{k}-Ax_{\star}\\|_{2}},\frac{Ax_{k+1}-Ax_{\star}}{\\|Ax_{k+1}-Ax_{\star}\\|_{2}}\rangle^{2}]$
	$\displaystyle=\mathbb{E}_{k}[\langle\frac{Ax_{k}-Ax_{\star}}{\\|Ax_{k}-Ax_{\star}\\|_{2}},\frac{Ax_{k}-\frac{A_{(j)}^{T}(Ax_{k}-b)}{\\|A_{(j)}\\|_{2}^{2}}A_{(j)}-Ax_{\star}}{\\|Ax_{k}-\frac{A_{(j)}^{T}(Ax_{k}-b)}{\\|A_{(j)}\\|_{2}^{2}}A_{(j)}-Ax_{\star}\\|_{2}}\rangle^{2}]$
	$\displaystyle=\mathbb{E}_{k}[\frac{1}{\\|Ax_{k}-Ax_{\star}\\|_{2}^{2}\cdot\\|Ax_{k}-\frac{A_{(j)}^{T}(Ax_{k}-b)}{\\|A_{(j)}\\|_{2}^{2}}A_{(j)}-Ax_{\star}\\|_{2}^{2}}\langle Ax_{k}-Ax_{\star},Ax_{k}-\frac{A_{(j)}^{T}(Ax_{k}-b)}{\\|A_{(j)}\\|_{2}^{2}}A_{(j)}-Ax_{\star}\rangle^{2}].$

Since $\langle Ax_{k}-Ax_{\star},Ax_{k}-\frac{A_{(j)}^{T}(Ax_{k}-b)}{\|A_{(j)}\|_{2}^{2}}A_{(j)}-Ax_{\star}\rangle=\|Ax_{k}-\frac{A_{(j)}^{T}(Ax_{k}-b)}{\|A_{(j)}\|_{2}^{2}}A_{(j)}-Ax_{\star}\|_{2}^{2}$ , we have

	$\displaystyle\mathbb{E}_{k}[\langle\frac{Ax_{k}-Ax_{\star}}{\\|Ax_{k}-Ax_{\star}\\|_{2}},\frac{Ax_{k+1}-Ax_{\star}}{\\|Ax_{k+1}-Ax_{\star}\\|_{2}}\rangle^{2}]$
	$\displaystyle=\mathbb{E}_{k}[\frac{1}{\\|Ax_{k}-Ax_{\star}\\|_{2}^{2}}\\|Ax_{k}-\frac{A_{(j)}^{T}(Ax_{k}-b)}{\\|A_{(j)}\\|_{2}^{2}}A_{(j)}-Ax_{\star}\\|_{2}^{2}]$
	$\displaystyle=\mathbb{E}_{k}[\frac{1}{\\|Ax_{k}-Ax_{\star}\\|_{2}^{2}}(\\|Ax_{k}-Ax_{\star}\\|_{2}^{2}-2\langle Ax_{k}-Ax_{\star},\frac{A_{(j)}^{T}(Ax_{k}-b)}{\\|A_{(j)}\\|_{2}^{2}}A_{(j)}\rangle+\frac{(A_{(j)}^{T}(Ax_{k}-b))^{2}}{\\|A_{(j)}\\|_{2}^{2}})]$
	$\displaystyle=\mathbb{E}_{k}[\frac{1}{\\|Ax_{k}-Ax_{\star}\\|_{2}^{2}}(\\|Ax_{k}-Ax_{\star}\\|_{2}^{2}-\frac{(A_{(j)}^{T}(Ax_{k}-Ax_{\star}))^{2}}{\\|A_{(j)}\\|_{2}^{2}})]$
	$\displaystyle=\mathbb{E}_{k}[1-\frac{(A_{(j)}^{T}\frac{Ax_{k}-Ax_{\star}}{\\|Ax_{k}-Ax_{\star}\\|_{2}})^{2}}{\\|A_{(j)}\\|_{2}^{2}}]$
	$\displaystyle=\sum\limits_{j=1}^{n}\frac{\\|A_{(j)}\\|_{2}^{2}}{\\|A\\|_{F}^{2}}(1-\frac{(A_{(j)}^{T}\frac{Ax_{k}-Ax_{\star}}{\\|Ax_{k}-Ax_{\star}\\|_{2}})^{2}}{\\|A_{(j)}\\|_{2}^{2}})$
	$\displaystyle=1-\frac{1}{\\|A\\|_{F}^{2}}\\|A^{T}\frac{Ax_{k}-Ax_{\star}}{\\|Ax_{k}-Ax_{\star}\\|_{2}}\\|_{2}^{2}.$

Thus, by taking the full expectation on both sides, we obtain the desired result (4).

Remark 6

Let $u$ and $v$ are two unit vectors, i.e., $\|u\|_{2}=1$ and $\|v\|_{2}=1$ . We use inner quantity $\langle u,v\rangle^{2}$ to represent the angle between $u$ and $v$ , and the bigger the angle is, the bigger the fluctuation becomes from $u$ to $v$ . Theorem 4 shows the fluctuation of two adjacent iterations. Specifically, when $\|A^{T}\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}}\|_{2}^{2}$ is large, the angle between $\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}}$ and $\frac{Ax_{k+1}-Ax_{\star}}{\|Ax_{k+1}-Ax_{\star}\|_{2}}$ is large, which implies that $\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}}$ has a large fluctuation; when $\|A^{T}\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}}\|_{2}^{2}$ is small, the angle between $\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}}$ and $\frac{Ax_{k+1}-Ax_{\star}}{\|Ax_{k+1}-Ax_{\star}\|_{2}}$ is small, which implies that $\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}}$ has very little fluctuation.

Since $\|A^{T}\frac{Ax_{k}-Ax_{\star}}{\|Ax_{k}-Ax_{\star}\|_{2}}\|_{2}^{2}\geq\sigma_{r}^{2}(A)$ , Theorem 4 implies that if $Ax_{k}-Ax_{\star}$ is mainly composed of left singular vectors corresponding to the small singular values of $A$ , its direction hardly changes, which means that the RGS method finally converges along left singular vector corresponding to the small singular value of $A$ .

4 Convergence directions of the REGS method

Recalling Remark 1, when the matrix $A$ is not full column rank, the sequence $\{x_{k}\}_{k=1}^{\infty}$ generated by the RGS method does not converge to the minimum Euclidean norm least squares solution $x_{\star}$ , even though $Ax_{k}$ does converge to $Ax_{\star}$ . In [5], Ma et al. proposed an extended variant of the RGS method, i.e., the REGS method, to allow for convergence to $x_{\star}$ regardless of whether $A$ has full column rank or not.

Now, we list the REGS method presented in [13] in Algorithm 2, which is a equivalent variant of the original REGS method [5], and restate its convergence bound presented in [13] in Theorem 5. From the algorithm we find that, in each iteration, $x_{k}$ is the $k$ th approximation of the RGS method and $z_{k}$ is a one-step RK update for the linear system $Az=Ax_{k}$ from $z_{k-1}$ .

Algorithm 2

The REGS method

1.

INPUT: $A$ , $b$ , $\ell$ , $x_{0}\in R^{n}$ , $z_{0}\in\mathcal{R}(A^{T})$
2.

For $k=1,2,\ldots,\ell-1$ do
3.

Select $j\in[n]$ with probability $\frac{\|A_{(j)}\|^{2}_{2}}{\|A\|^{2}_{F}}$
4.

Set $x_{k}=x_{k-1}-\frac{A_{(j)}^{T}(Ax_{k-1}-b)}{\|A_{(j)}\|_{2}^{2}}e_{(j)}$
5.

Select $i\in[m]$ with probability $\frac{\|A^{(i)}\|^{2}_{2}}{\|A\|^{2}_{F}}$
6.

Set $z_{k}=z_{k-1}-\frac{A^{(i)}(z_{k-1}-x_{k})}{\|A^{(i)}\|_{2}^{2}}(A^{(i)})^{T}$
7.

End for

Theorem 5

[13] Let $A\in R^{m\times n}$ , $b\in R^{m}$ , $x_{\star}=A^{{\dagger}}b$ be the minimum Euclidean norm least squares solution, and $z_{k}$ be the $k$ th approximation of the REGS method generated by Algorithm 2 with initial $x_{0}\in R^{n}$ and $z_{o}\in\mathcal{R}(A^{T})$ . Then

\displaystyle\mathbb{E}[\|z_{k}-x_{\star}\|_{2}^{2}]\leq(1-\frac{\sigma_{r}^{2}(A)}{\|A\|^{2}_{F}})^{k}\|z_{0}-x_{\star}\|_{2}^{2}+\frac{k}{\|A\|_{F}^{2}}(1-\frac{\sigma_{r}^{2}(A)}{\|A\|^{2}_{F}})^{k}\|Ax_{0}-Ax_{\star}\|_{2}^{2}.

(5)

For the REGS method, we first discuss the convergence behavior of $z_{k}-x_{\star}$ in Theorem 6 and Theorem 7, and then consider its convergence behavior of $Az_{k}-Ax_{\star}$ in Theorem 8.

Theorem 6

Let $A\in R^{m\times n}$ , $b\in R^{m}$ , $x_{\star}=A^{{\dagger}}b$ be the minimum Euclidean norm least squares solution, and $z_{k}$ be the $k$ th approximation of the REGS method generated by Algorithm 2 with initial $x_{0}\in R^{n}$ and $z_{o}\in\mathcal{R}(A^{T})$ . Then

\displaystyle\mathbb{E}[\langle z_{k}-x_{\star},v_{\ell}\rangle]=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle z_{0}-x_{\star},v_{\ell}\rangle+\frac{k}{\|A\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle A^{T}(Ax_{0}-Ax_{\star}),v_{\ell}\rangle.

(6)

Proof 4

From Algorithm 2, we have

	$\displaystyle\mathbb{E}[\langle z_{k}-x_{\star},v_{\ell}\rangle]$
	$\displaystyle=\mathbb{E}[\langle z_{k-1}-\frac{A^{(i)}(z_{k-1}-x_{k})}{\\|A^{(i)}\\|_{2}^{2}}(A^{(i)})^{T}-x_{\star},v_{\ell}\rangle]$
	$\displaystyle=\mathbb{E}[\langle(I-\frac{(A^{(i)})^{T}A^{(i)}}{\\|A^{(i)}\\|_{2}^{2}})(z_{k-1}-x_{\star}),v_{\ell}\rangle]+\mathbb{E}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\\|A^{(i)}\\|_{2}^{2}}(x_{k}-x_{\star}),v_{\ell}\rangle],$		(7)

so we next consider $\mathbb{E}[\langle(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),v_{\ell}\rangle]$ and $\mathbb{E}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star}),v_{\ell}\rangle]$ separately.

We first consider $\mathbb{E}[\langle(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),v_{\ell}\rangle]$ . Let $\mathbb{E}_{k-1}[\cdot]$ be the conditional expectation conditioned on the first $k-1$ iterations of the REGS method. That is,

\displaystyle\mathbb{E}_{k-1}[\cdot]=\mathbb{E}[\cdot|j_{1},i_{1},j_{2},i_{2},\ldots,j_{k-1},i_{k-1}],

where $j_{t^{*}}$ is the ${t^{*}}$ th column chosen and $i_{t^{*}}$ is the ${t^{*}}$ th row chosen. We denote the conditional expectation conditioned on the first $k-1$ iterations and the $k$ th column chosen as

\displaystyle\mathbb{E}_{k-1}^{i}[\cdot]=\mathbb{E}[\cdot|j_{1},i_{1},j_{2},i_{2},\ldots,j_{k-1},i_{k-1},j_{k}].

Similarly, we denote the conditional expectation conditioned on the first $k-1$ iterations and the $k$ th row chosen as

\displaystyle\mathbb{E}_{k-1}^{j}[\cdot]=\mathbb{E}[\cdot|j_{1},i_{1},j_{2},i_{2},\ldots,j_{k-1},i_{k-1},i_{k}].

Then, by the law of total expectation, we have

\displaystyle\mathbb{E}_{k-1}[\cdot]=\mathbb{E}_{k-1}^{j}[\mathbb{E}_{k-1}^{i}[\cdot]].

Thus, we obtain

	$\displaystyle\mathbb{E}_{k-1}[\langle(I-\frac{(A^{(i)})^{T}A^{(i)}}{\\|A^{(i)}\\|_{2}^{2}})(z_{k-1}-x_{\star}),v_{\ell}\rangle]$
	$\displaystyle=\mathbb{E}_{k-1}[\langle z_{k-1}-x_{\star},v_{\ell}\rangle-\langle\frac{A^{(i)}(z_{k-1}-x_{\star})}{\\|A^{(i)}\\|_{2}^{2}}(A^{(i)})^{T},v_{\ell}\rangle]$
	$\displaystyle=\langle z_{k-1}-x_{\star},v_{\ell}\rangle-\frac{1}{\\|A\\|^{2}_{F}}\sum\limits_{i=1}^{m}\langle A^{(i)}(z_{k-1}-x_{\star})(A^{(i)})^{T},v_{\ell}\rangle$
	$\displaystyle=\langle z_{k-1}-x_{\star},v_{\ell}\rangle-\frac{1}{\\|A\\|^{2}_{F}}\sum\limits_{i=1}^{m}\langle(A^{(i)})^{T},z_{k-1}-x_{\star}\rangle\langle(A^{(i)})^{T},v_{\ell}\rangle$
	$\displaystyle=\langle z_{k-1}-x_{\star},v_{\ell}\rangle-\frac{1}{\\|A\\|^{2}_{F}}\langle A(z_{k-1}-x_{\star}),Av_{\ell}\rangle.$

Further, by making use of $z_{k-1}-x_{\star}=\sum\limits_{i=1}^{n}\langle z_{k-1}-x_{\star},v_{i}\rangle v_{i}$ and $Av_{i}=\sigma_{i}(A)u_{i}$ , we get

	$\displaystyle\mathbb{E}_{k-1}[\langle(I-\frac{(A^{(i)})^{T}A^{(i)}}{\\|A^{(i)}\\|_{2}^{2}})(z_{k-1}-x_{\star}),v_{\ell}\rangle]$
	$\displaystyle=\langle z_{k-1}-x_{\star},v_{\ell}\rangle-\frac{1}{\\|A\\|^{2}_{F}}\langle A\sum\limits_{i=1}^{n}\langle z_{k-1}-x_{\star},v_{i}\rangle v_{i},\sigma_{\ell}(A)u_{\ell}\rangle$
	$\displaystyle=\langle z_{k-1}-x_{\star},v_{\ell}\rangle-\frac{1}{\\|A\\|^{2}_{F}}\langle\sum\limits_{i=1}^{n}\langle z_{k-1}-x_{\star},v_{i}\rangle\sigma_{i}(A)u_{i},\sigma_{\ell}(A)u_{\ell}\rangle,$

which together with the orthogonality of the left singular vectors $u_{i}$ yields

	$\displaystyle\mathbb{E}_{k-1}[\langle(I-\frac{(A^{(i)})^{T}A^{(i)}}{\\|A^{(i)}\\|_{2}^{2}})(z_{k-1}-x_{\star}),v_{\ell}\rangle]$
	$\displaystyle=\langle z_{k-1}-x_{\star},v_{\ell}\rangle-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}}\langle z_{k-1}-x_{\star},v_{\ell}\rangle$
	$\displaystyle=(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})\langle z_{k-1}-x_{\star},v_{\ell}\rangle.$

As a result, by taking the full expectation on both sides, we have

\displaystyle\mathbb{E}[\langle(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),v_{\ell}\rangle]=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})\mathbb{E}[\langle z_{k-1}-x_{\star},v_{\ell}\rangle].

(8)

We now consider $\mathbb{E}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star}),v_{\ell}\rangle]$ . It follows from

	$\displaystyle\mathbb{E}_{k-1}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\\|A^{(i)}\\|_{2}^{2}}(x_{k}-x_{\star}),v_{\ell}\rangle]$
	$\displaystyle=\mathbb{E}_{k-1}^{j}[\mathbb{E}_{k-1}^{i}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\\|A^{(i)}\\|_{2}^{2}}(x_{k}-x_{\star}),v_{\ell}\rangle]]$
	$\displaystyle=\mathbb{E}_{k-1}^{j}[\frac{1}{\\|A\\|^{2}_{F}}\sum\limits_{i=1}^{m}\langle(A^{(i)})^{T}A^{(i)}(x_{k}-x_{\star}),v_{\ell}\rangle]$
	$\displaystyle=\mathbb{E}_{k-1}^{j}[\frac{1}{\\|A\\|^{2}_{F}}\sum\limits_{i=1}^{m}\langle A^{(i)}(x_{k}-x_{\star}),A^{(i)}v_{\ell}\rangle]$
	$\displaystyle=\mathbb{E}_{k-1}[\frac{1}{\\|A\\|^{2}_{F}}\langle A(x_{k}-x_{\star}),Av_{\ell}\rangle],$

that

\displaystyle\mathbb{E}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star}),v_{\ell}\rangle]=\mathbb{E}[\frac{1}{\|A\|^{2}_{F}}\langle A(x_{k}-x_{\star}),Av_{\ell}\rangle].

(9)

Since

\displaystyle\mathbb{E}[\frac{1}{\|A\|^{2}_{F}}\langle A(x_{k}-x_{\star}),Av_{\ell}\rangle]=\frac{\sigma_{\ell}(A)}{\|A\|^{2}_{F}}\mathbb{E}[\langle A(x_{k}-x_{\star}),u_{\ell}\rangle],

by exploiting (3) in Theorem 2, we get

	$\displaystyle\mathbb{E}[\frac{1}{\\|A\\|^{2}_{F}}\langle A(x_{k}-x_{\star}),Av_{\ell}\rangle]$
	$\displaystyle=\frac{\sigma_{\ell}(A)}{\\|A\\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})^{k}\langle Ax_{0}-Ax_{\star},u_{\ell}\rangle$
	$\displaystyle=\frac{1}{\\|A\\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})^{k}\langle Ax_{0}-Ax_{\star},Av_{\ell}\rangle.$

Thus, substituting the above equality into (9), we have

	$\displaystyle\mathbb{E}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\\|A^{(i)}\\|_{2}^{2}}(x_{k}-x_{\star}),v_{\ell}\rangle]$	$\displaystyle=\frac{1}{\\|A\\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})^{k}\langle Ax_{0}-Ax_{\star},Av_{\ell}\rangle$
		$\displaystyle=\frac{1}{\\|A\\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})^{k}\langle A^{T}(Ax_{0}-Ax_{\star}),v_{\ell}\rangle.$		(10)

Combining (7), (8) and (10) yields

	$\displaystyle\mathbb{E}[\langle z_{k}-x_{\star},v_{\ell}\rangle]$
	$\displaystyle=\mathbb{E}[\langle(I-\frac{(A^{(i)})^{T}A^{(i)}}{\\|A^{(i)}\\|_{2}^{2}})(z_{k-1}-x_{\star}),v_{\ell}\rangle]+\mathbb{E}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\\|A^{(i)}\\|_{2}^{2}}(x_{k}-x_{\star}),v_{\ell}\rangle]$
	$\displaystyle=(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})\mathbb{E}[\langle z_{k-1}-x_{\star},v_{\ell}\rangle]+\frac{1}{\\|A\\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})^{k}\langle A^{T}(Ax_{0}-Ax_{\star}),v_{\ell}\rangle$
	$\displaystyle=(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})^{2}\mathbb{E}[\langle z_{k-2}-x_{\star},v_{\ell}\rangle]+\frac{2}{\\|A\\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})^{k}\langle A^{T}(Ax_{0}-Ax_{\star}),v_{\ell}\rangle$
	$\displaystyle=\ldots=(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})^{k}\langle z_{0}-x_{\star},v_{\ell}\rangle+\frac{k}{\\|A\\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})^{k}\langle A^{T}(Ax_{0}-Ax_{\star}),v_{\ell}\rangle,$

which is the desired result (6).

Remark 7

Theorem 6 shows that the decay rates of $\|z_{k}-x_{\star}\|_{2}$ are different in different right singular vectors spaces and the smallest singular value will lead to the slowest rate of convergence, which is the one in (5). So, the convergence bound presented by Du [13] is optimal.

Theorem 7

\displaystyle\mathbb{E}[\|z_{k}-x_{\star}\|^{2}_{2}]\leq\mathbb{E}[(1-\frac{1}{\|A\|^{2}_{F}}\|A\frac{z_{k-1}-x_{\star}}{\|z_{k-1}-x_{\star}\|_{2}}\|_{2}^{2})\|z_{k-1}-x_{\star}\|_{2}^{2}]+\frac{1}{\|A\|^{2}_{F}}(1-\frac{\sigma_{r}^{2}(A)}{\|A\|^{2}_{F}})^{k}\|Ax_{0}-Ax_{\star}\|^{2}_{2}.

(11)

Proof 5

Following an analogous argument to Theorem 4 of [13], we get

\displaystyle\mathbb{E}[\|z_{k}-x_{\star}\|_{2}^{2}]

\displaystyle=\mathbb{E}[\|(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star})\|_{2}^{2}]+\mathbb{E}[\|\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star})\|_{2}^{2}],

	$\displaystyle\mathbb{E}[\\|(I-\frac{(A^{(i)})^{T}A^{(i)}}{\\|A^{(i)}\\|_{2}^{2}})(z_{k-1}-x_{\star})\\|_{2}^{2}]$	$\displaystyle=\mathbb{E}[(z_{k-1}-x_{\star})^{T}(I-\frac{A^{T}A}{\\|A\\|_{F}^{2}})(z_{k-1}-x_{\star})]$
		$\displaystyle=\mathbb{E}[(\\|z_{k-1}-x_{\star}\\|_{2}^{2}-\frac{1}{\\|A\\|^{2}_{F}}\\|A(z_{k-1}-x_{\star})\\|_{2}^{2})]$
		$\displaystyle=\mathbb{E}[(1-\frac{1}{\\|A\\|^{2}_{F}}\\|A\frac{z_{k-1}-x_{\star}}{\\|z_{k-1}-x_{\star}\\|_{2}}\\|_{2}^{2})\\|z_{k-1}-x_{\star}\\|_{2}^{2}],$

and

\displaystyle\mathbb{E}[\|\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star})\|_{2}^{2}]\leq\frac{1}{\|A\|^{2}_{F}}(1-\frac{\sigma_{r}^{2}(A)}{\|A\|^{2}_{F}})^{k}\|Ax_{0}-Ax_{\star}\|^{2}_{2}.

Combining the above three equations, we have

\displaystyle\mathbb{E}[\|z_{k}-x_{\star}\|_{2}^{2}]\leq\mathbb{E}[(1-\frac{1}{\|A\|^{2}_{F}}\|A\frac{z_{k-1}-x_{\star}}{\|z_{k-1}-x_{\star}\|_{2}}\|_{2}^{2})\|z_{k-1}-x_{\star}\|_{2}^{2}]+\frac{1}{\|A\|^{2}_{F}}(1-\frac{\sigma_{r}^{2}(A)}{\|A\|^{2}_{F}})^{k}\|Ax_{0}-Ax_{\star}\|^{2}_{2},

which implies the desired result (11).

Remark 8

Since $\|A\frac{z_{k-1}-x_{\star}}{\|z_{k-1}-x_{\star}\|_{2}}\|_{2}^{2}\geq\sigma_{r}^{2}(A)$ , Theorem 7 implies that $z_{k}$ of the REGS method actually converges faster if $z_{k-1}-x_{\star}$ is not close to right singular vectors corresponding to the small singular values of $A$ .

Theorem 8

\displaystyle\mathbb{E}[\langle Az_{k}-Ax_{\star},u_{\ell}\rangle]=(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle Az_{0}-Ax_{\star},v_{\ell}\rangle+\frac{k}{\|A\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\|A\|^{2}_{F}})^{k}\langle AA^{T}(Ax_{0}-Ax_{\star}),u_{\ell}\rangle.

(12)

Proof 6

Similar to the proof of (7) in Theorem 6, we obtain

\displaystyle\mathbb{E}[\langle Az_{k}-Ax_{\star},u_{\ell}\rangle]=\mathbb{E}[\langle A(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),u_{\ell}\rangle]+\mathbb{E}[\langle A\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star}),u_{\ell}\rangle].

(13)

Then, we consider $\mathbb{E}[\langle A(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),u_{\ell}\rangle]$ and $\mathbb{E}[\langle A\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star}),u_{\ell}\rangle]$ separately.

We first consider $\mathbb{E}[\langle A(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),u_{\ell}\rangle]$ . It follows from

\langle A(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),u_{\ell}\rangle=\langle(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),A^{T}u_{\ell}\rangle

and $A^{T}u_{\ell}=\sigma_{\ell}(A)v_{\ell}$ , that

\displaystyle\mathbb{E}[\langle A(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),u_{\ell}\rangle]=\sigma_{\ell}(A)\mathbb{E}[\langle(I-\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}})(z_{k-1}-x_{\star}),v_{\ell}\rangle],

which together with (8), yields

$\displaystyle\mathbb{E}[\langle A(I-\frac{(A^{(i)})^{T}A^{(i)}}{\\|A^{(i)}\\|_{2}^{2}})(z_{k-1}-x_{\star}),u_{\ell}\rangle]$	$\displaystyle=\sigma_{\ell}(A)(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})\mathbb{E}[\langle z_{k-1}-x_{\star},v_{\ell}\rangle]$
	$\displaystyle=(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})\mathbb{E}[\langle z_{k-1}-x_{\star},\sigma_{\ell}(A)v_{\ell}\rangle]$
	$\displaystyle=(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})\mathbb{E}[\langle Az_{k-1}-Ax_{\star},u_{\ell}\rangle].$	(14)

We now consider $\mathbb{E}[\langle A\frac{(A^{(i)})^{T}A^{(i)}}{\|A^{(i)}\|_{2}^{2}}(x_{k}-x_{\star}),u_{\ell}\rangle]$ . Exploiting (10), we have

$\displaystyle\mathbb{E}[\langle A\frac{(A^{(i)})^{T}A^{(i)}}{\\|A^{(i)}\\|_{2}^{2}}(x_{k}-x_{\star}),u_{\ell}\rangle]$	$\displaystyle=\mathbb{E}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\\|A^{(i)}\\|_{2}^{2}}(x_{k}-x_{\star}),A^{T}u_{\ell}\rangle]$
	$\displaystyle=\sigma_{\ell}(A)\mathbb{E}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\\|A^{(i)}\\|_{2}^{2}}(x_{k}-x_{\star}),v_{\ell}\rangle]$
	$\displaystyle=\frac{\sigma_{\ell}(A)}{\\|A\\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})^{k}\langle A^{T}(Ax_{0}-Ax_{\star}),v_{\ell}\rangle$
	$\displaystyle=\frac{1}{\\|A\\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})^{k}\langle AA^{T}(Ax_{0}-Ax_{\star}),u_{\ell}\rangle.$	(15)

Thus, combining (13), (14) and (15) yields

	$\displaystyle\mathbb{E}[\langle Az_{k}-Ax_{\star},u_{\ell}\rangle]$
	$\displaystyle=\mathbb{E}[\langle A(I-\frac{(A^{(i)})^{T}A^{(i)}}{\\|A^{(i)}\\|_{2}^{2}})(z_{k-1}-x_{\star}),u_{\ell}\rangle]+\mathbb{E}[\langle A\frac{(A^{(i)})^{T}A^{(i)}}{\\|A^{(i)}\\|_{2}^{2}}(x_{k}-x_{\star}),u_{\ell}\rangle]$
	$\displaystyle=(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})\mathbb{E}[\langle Az_{k-1}-Ax_{\star},u_{\ell}\rangle]+\frac{1}{\\|A\\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})^{k}\langle AA^{T}(Ax_{0}-Ax_{\star}),u_{\ell}\rangle$
	$\displaystyle=(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})^{2}\mathbb{E}[\langle Az_{k-2}-Ax_{\star},u_{\ell}\rangle]+\frac{2}{\\|A\\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})^{k}\langle AA^{T}(Ax_{0}-Ax_{\star}),u_{\ell}\rangle$
	$\displaystyle=\ldots=(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})^{k}\langle Az_{0}-Ax_{\star},v_{\ell}\rangle+\frac{k}{\\|A\\|^{2}_{F}}(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|^{2}_{F}})^{k}\langle AA^{T}(Ax_{0}-Ax_{\star}),u_{\ell}\rangle,$

which is the desired result (12).

Remark 9

Theorem 8 shows the decay rates of $\|Az_{k}-Ax_{\star}\|$ of the REGS method and suggests that small singular values lead to poor convergence rates and vice versa. We note similar issues arise for the RK, REK, and RGS methods discussed in [16], [17], and Theorem 2, respectively.

5 Numerical experiments

Now we present two simple examples to illustrate that the convergence directions of the RGS and REGS methods. To this end, let $G_{0}\in R^{500\times 500}$ be a Gaussian matrix with i.i.d. $N(0,1)$ entries and $D\in R^{500\times 500}$ be a diagonal matrix whose diagonal elements are all 100. Further, we set $G_{1}=G_{0}+D$ and replace its last row $G_{1}^{(500)}$ by a tiny perturbation of $G_{1}^{(499)}$ , i.e., adding 0.01 to each entry of $G_{1}^{(499)}$ . Then, we normalize all rows of $G_{1}$ , i.e., set $\|G_{1}^{(i)}\|_{2}=1$ , $i=1,2,\ldots,500$ . After that, we set $A_{1}=\begin{bmatrix}G_{1}\\ G_{2}\end{bmatrix}\in R^{600\times 500}$ and $A_{2}=\begin{bmatrix}G_{1},G_{3}\end{bmatrix}\in R^{500\times 600}$ , where $G_{2}\in R^{100\times 500}$ and $G_{3}\in R^{500\times 100}$ are zero matrices. So, the first 499 singular values of the matrices $A_{1}$ and $A_{2}$ are between $\sim 0.6$ and $\sim 1.5$ , and the smallest nonzero singular value is $\sim 10^{-4}$ .

We first consider convergence directions of $Ax_{k}-Ax_{\star}$ and $x_{k}-x_{\star}$ of the RGS method. We generate a vector $x\in R^{500}$ using the MATLAB function randn, set the full column rank coefficient matrix $A=A_{1}$ and set the right-hand side $b=Ax+z$ , where $z$ is a nonzero vector belonging to the null space of $A^{T}$ , which is generated by the MATLAB function null. With $x_{0}=0$ , we plot $|\langle(Ax_{k}-Ax_{\star})/\|Ax_{k}-Ax_{\star}\|_{2},u_{500}\rangle|$ and $\frac{\|A(x_{k}-x_{\star})\|_{2}}{\|x_{k}-x_{\star}\|_{2}}$ in Figure 1 and Figure 2, respectively.

Refer to caption — Figure 1: A sample evolution of $|\langle(Ax_{k}-Ax_{\star})/\|Ax_{k}-Ax_{\star}\|_{2},u_{500}\rangle|$ of the RGS method.

From Figure 1, we find that $|\langle(Ax_{k}-Ax_{\star})/\|Ax_{k}-Ax_{\star}\|_{2},u_{500}\rangle|$ initially is very small and almost is 0, which indicates that $Ax_{k}-Ax_{\star}$ is not close to the left singular vector $u_{500}$ . Considering the analysis of Remark 5, the phenomenon implies the ‘preconvergence’ behavior of the RGS method, that is, the RGS method seems to converge quickly at the beginning. In addition, as $k\rightarrow\infty$ , $|\langle(Ax_{k}-Ax_{\star})/\|Ax_{k}-Ax_{\star}\|_{2},u_{500}\rangle|\rightarrow 1$ . This phenomenon implies that $Ax_{k}-Ax_{\star}$ tends to the left singular vector corresponding to the smallest singular value of $A$ .

From Figure 2, we observe that the values of $\frac{\|A(x_{k}-x_{\star})\|_{2}}{\|x_{k}-x_{\star}\|_{2}}$ decreases with $k$ and finally approaches the small singular value. This phenomenon implies that the forward direction of $x_{k}-x_{\star}$ is mainly determined by the right singular vectors corresponding to the large singular values of $A$ at the beginning. With the increase of $k$ , the direction is mainly determined by the right singular vectors corresponding to the small singular values. Finally, $x_{k}-x_{\star}$ tends to the right singular vector space corresponding to the smallest singular value. Furthermore, this phenomenon also allows for an interesting application, i.e., finding nonzero vectors $x$ such that $\frac{\|Ax\|_{2}}{\|x\|_{2}}$ is small.

We now consider convergence directions of $Az_{k}-Ax_{\star}$ and $z_{k}-x_{\star}$ of the REGS method. We generate a vector $x\in R^{600}$ using the MATLAB function randn, set the coefficient matrix $A=A_{2}$ which does not have full column rank, and set the right-hand side $b=Ax$ . With $x_{0}=0$ and $z_{0}=0$ , we plot $|\langle(Az_{k}-Ax_{\star})/\|Az_{k}-Ax_{\star}\|_{2},u_{500}\rangle|$ and $\frac{\|A(z_{k}-x_{\star})\|_{2}}{\|z_{k}-x_{\star}\|_{2}}$ in Figure 3 and Figure 4, respectively.

Figure 3 and Figure 4 show the similar results obtained in the RGS method. That is, the convergence directions of $Az_{k}-Ax_{\star}$ and $z_{k}-x_{\star}$ of the REGS method initially are depending on the large singular values and then mainly depending on the small singular values, and finally depending on the smallest singular value of $A$ .

References

[1] Å. Björck, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, 1996.
[2] N. J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, 2002.
[3] Y. Saad, Iterative Methods for Sparse Linear Systems, SIAM, Philadelphia, 2003.
[4] D. Leventhal, A. S. Lewis, Randomized methods for linear constraints: convergence rates and conditioning, Math. Oper. Res. 35 (2010) 641–654.
[5] A. Ma, D. Needell, A. Ramdas, Convergence properties of the randomized extended Gauss–Seidel and Kaczmarz methods, SIAM J. Matrix Anal. Appl. 36 (2015) 1590–1604.
[6] T. Strohmer, R. Vershynin, A randomized Kaczmarz algorithm with exponential convergence, J. Fourier Anal. Appl. 15 (2009) 262–278.
[7] A. Zouzias, M. N. Freris, Randomized extended Kaczmarz for solving least squares, SIAM J. Matrix Anal. Appl. 34 (2013) 773–793.
[8] R. M. Gower, P. Richtárik, Randomized iterative methods for linear systems, SIAM J. Matrix Anal. Appl. 36 (2015) 1660–1690.
[9] J. Nutini, M. Schmidt, I. Laradji, M. Friedlander, H. Koepke, Coordinate descent converges faster with the Gauss–Southwell rule than random selection, in: International Conference on Machine Learning, PMLR, 2015, pp. 1632–1641.
[10] A. Hefny, D. Needell, A. Ramdas, Rows versus columns: randomized Kaczmarz or Gauss–Seidel for ridge regression, SIAM J. Sci. Comput. 39 (2017) S528–S542.
[11] S. Tu, S. Venkataraman, A. C. Wilson, A. Gittens, M. I. Jordan, B. Recht, Breaking locality accelerates block Gauss–Seidel, in: International Conference on Machine Learning, PMLR, 2017, pp. 3482–3491.
[12] Y. Y. Xu, Hybrid Jacobian and Gauss–Seidel proximal block coordinate update methods for linearly constrained convex programming, SIAM J. Optim. 28 (2018) 646–670.
[13] K. Du, Tight upper bounds for the convergence of the randomized extended Kaczmarz and Gauss–Seidel algorithms, Numer. Linear Algebra Appl. 26 (2019) e2233.
[14] M. Razaviyayn, M. Y. Hong, N. Reyhanian, Z. Q. Luo, A linearly convergent doubly stochastic Gauss–Seidel algorithm for solving linear equations and a certain class of over–parameterized optimization problems, Math. Program. 176 (2019) 465–496.
[15] Y. L. Jiao, B. T. Jin, X. L. Lu, Preasymptotic convergence of randomized Kaczmarz method, Inverse Problems. 33 (2017) 125012.
[16] S. Steinerberger, Randomized Kaczmarz converges along small singular vectors, SIAM J. Matrix Anal. Appl. 42 (2021) 608–615.
[17] Y. J. Zhang, H. Y. Li, Preconvergence of the randomized extended Kaczmarz method, arXiv preprint arXiv:2105.04924, 2021.
[18] G. H. Golub, C. F. Van Loan, Matrix Computations, fourth ed., Johns Hopkins University Press, Baltimore, 2013.

	$\displaystyle\mathbb{E}_{k-1}[\langle Ax_{k}-Ax_{\star},u_{\ell}\rangle]$
	$\displaystyle=\sum\limits_{j=1}^{n}\frac{\\|A_{(j)}\\|_{2}^{2}}{\\|A\\|_{F}^{2}}\langle Ax_{k-1}-\frac{A_{(j)}^{T}(Ax_{k-1}-b)}{\\|A_{(j)}\\|_{2}^{2}}A_{(j)}-Ax_{\star},u_{\ell}\rangle$
	$\displaystyle=\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle-\frac{1}{\\|A\\|_{F}^{2}}\sum\limits_{j=1}^{n}\langle A_{(j)}^{T}(Ax_{k-1}-b)A_{(j)},u_{\ell}\rangle$
	$\displaystyle=\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle-\frac{1}{\\|A\\|_{F}^{2}}\sum\limits_{j=1}^{n}\langle A_{(j)},Ax_{k-1}-b\rangle\langle A_{(j)},u_{\ell}\rangle$
	$\displaystyle=\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle-\frac{1}{\\|A\\|_{F}^{2}}\langle A^{T}(Ax_{k-1}-b),A^{T}u_{\ell}\rangle,$

	$\displaystyle\mathbb{E}_{k-1}[\langle Ax_{k}-Ax_{\star},u_{\ell}\rangle]$
	$\displaystyle=\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle-\frac{1}{\\|A\\|_{F}^{2}}\langle A^{T}(Ax_{k-1}-Ax_{\star}),A^{T}u_{\ell}\rangle$
	$\displaystyle=\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle-\frac{1}{\\|A\\|_{F}^{2}}\langle A^{T}(\sum\limits_{i=1}^{m}\langle Ax_{k-1}-Ax_{\star},u_{i}\rangle u_{i}),A^{T}u_{\ell}\rangle$
	$\displaystyle=\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle-\frac{1}{\\|A\\|_{F}^{2}}\langle(\sum\limits_{i=1}^{m}\langle Ax_{k-1}-Ax_{\star},u_{i}\rangle\sigma_{i}(A)v_{i}),\sigma_{\ell}(A)v_{\ell}\rangle$
	$\displaystyle=\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|_{F}^{2}}\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle$
	$\displaystyle=(1-\frac{\sigma_{\ell}^{2}(A)}{\\|A\\|_{F}^{2}})\langle Ax_{k-1}-Ax_{\star},u_{\ell}\rangle.$

	$\displaystyle\mathbb{E}_{k}[\langle\frac{Ax_{k}-Ax_{\star}}{\\|Ax_{k}-Ax_{\star}\\|_{2}},\frac{Ax_{k+1}-Ax_{\star}}{\\|Ax_{k+1}-Ax_{\star}\\|_{2}}\rangle^{2}]$
	$\displaystyle=\mathbb{E}_{k}[\frac{1}{\\|Ax_{k}-Ax_{\star}\\|_{2}^{2}}\\|Ax_{k}-\frac{A_{(j)}^{T}(Ax_{k}-b)}{\\|A_{(j)}\\|_{2}^{2}}A_{(j)}-Ax_{\star}\\|_{2}^{2}]$
	$\displaystyle=\mathbb{E}_{k}[\frac{1}{\\|Ax_{k}-Ax_{\star}\\|_{2}^{2}}(\\|Ax_{k}-Ax_{\star}\\|_{2}^{2}-2\langle Ax_{k}-Ax_{\star},\frac{A_{(j)}^{T}(Ax_{k}-b)}{\\|A_{(j)}\\|_{2}^{2}}A_{(j)}\rangle+\frac{(A_{(j)}^{T}(Ax_{k}-b))^{2}}{\\|A_{(j)}\\|_{2}^{2}})]$
	$\displaystyle=\mathbb{E}_{k}[\frac{1}{\\|Ax_{k}-Ax_{\star}\\|_{2}^{2}}(\\|Ax_{k}-Ax_{\star}\\|_{2}^{2}-\frac{(A_{(j)}^{T}(Ax_{k}-Ax_{\star}))^{2}}{\\|A_{(j)}\\|_{2}^{2}})]$
	$\displaystyle=\mathbb{E}_{k}[1-\frac{(A_{(j)}^{T}\frac{Ax_{k}-Ax_{\star}}{\\|Ax_{k}-Ax_{\star}\\|_{2}})^{2}}{\\|A_{(j)}\\|_{2}^{2}}]$
	$\displaystyle=\sum\limits_{j=1}^{n}\frac{\\|A_{(j)}\\|_{2}^{2}}{\\|A\\|_{F}^{2}}(1-\frac{(A_{(j)}^{T}\frac{Ax_{k}-Ax_{\star}}{\\|Ax_{k}-Ax_{\star}\\|_{2}})^{2}}{\\|A_{(j)}\\|_{2}^{2}})$
	$\displaystyle=1-\frac{1}{\\|A\\|_{F}^{2}}\\|A^{T}\frac{Ax_{k}-Ax_{\star}}{\\|Ax_{k}-Ax_{\star}\\|_{2}}\\|_{2}^{2}.$

	$\displaystyle\mathbb{E}_{k-1}[\langle(I-\frac{(A^{(i)})^{T}A^{(i)}}{\\|A^{(i)}\\|_{2}^{2}})(z_{k-1}-x_{\star}),v_{\ell}\rangle]$
	$\displaystyle=\mathbb{E}_{k-1}[\langle z_{k-1}-x_{\star},v_{\ell}\rangle-\langle\frac{A^{(i)}(z_{k-1}-x_{\star})}{\\|A^{(i)}\\|_{2}^{2}}(A^{(i)})^{T},v_{\ell}\rangle]$
	$\displaystyle=\langle z_{k-1}-x_{\star},v_{\ell}\rangle-\frac{1}{\\|A\\|^{2}_{F}}\sum\limits_{i=1}^{m}\langle A^{(i)}(z_{k-1}-x_{\star})(A^{(i)})^{T},v_{\ell}\rangle$
	$\displaystyle=\langle z_{k-1}-x_{\star},v_{\ell}\rangle-\frac{1}{\\|A\\|^{2}_{F}}\sum\limits_{i=1}^{m}\langle(A^{(i)})^{T},z_{k-1}-x_{\star}\rangle\langle(A^{(i)})^{T},v_{\ell}\rangle$
	$\displaystyle=\langle z_{k-1}-x_{\star},v_{\ell}\rangle-\frac{1}{\\|A\\|^{2}_{F}}\langle A(z_{k-1}-x_{\star}),Av_{\ell}\rangle.$

	$\displaystyle\mathbb{E}_{k-1}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\\|A^{(i)}\\|_{2}^{2}}(x_{k}-x_{\star}),v_{\ell}\rangle]$
	$\displaystyle=\mathbb{E}_{k-1}^{j}[\mathbb{E}_{k-1}^{i}[\langle\frac{(A^{(i)})^{T}A^{(i)}}{\\|A^{(i)}\\|_{2}^{2}}(x_{k}-x_{\star}),v_{\ell}\rangle]]$
	$\displaystyle=\mathbb{E}_{k-1}^{j}[\frac{1}{\\|A\\|^{2}_{F}}\sum\limits_{i=1}^{m}\langle(A^{(i)})^{T}A^{(i)}(x_{k}-x_{\star}),v_{\ell}\rangle]$
	$\displaystyle=\mathbb{E}_{k-1}^{j}[\frac{1}{\\|A\\|^{2}_{F}}\sum\limits_{i=1}^{m}\langle A^{(i)}(x_{k}-x_{\star}),A^{(i)}v_{\ell}\rangle]$
	$\displaystyle=\mathbb{E}_{k-1}[\frac{1}{\\|A\\|^{2}_{F}}\langle A(x_{k}-x_{\star}),Av_{\ell}\rangle],$

Convergence directions of the randomized Gauss–Seidel method and its extension 111The work is supported by the National Natural Science Foundation of China (No. 11671060) and the Natural Science Foundation Project of CQ CSTC (No. cstc2019jcyj-msxmX0267)