A Semi-Randomized and Augmented Kaczmarz Method with Simple Random Sampling for Large-Scale Inconsistent Linear Systems

Shunchang Li School of Mathematics, China University of Mining and Technology, Xuzhou 221116, Jiangsu, P.R. China. E-mail: [email protected]. Gang Wu Corresponding author. School of Mathematics, China University of Mining and Technology, Xuzhou 221116, Jiangsu, P.R. China. E-mail: [email protected]. This work is supported by the National Natural Science Foundation of China under Grant 12271518, the Fujian Natural Science Foundation under Grant 2023J01354, the Key Research and Development Project of Xuzhou Natural Science Foundation under Grant KC22288, and the Open Project of Key Laboratory of Data Science and Intelligence Education of the Ministry of Education under Grant DSIE202203.

Abstract

A greedy randomized augmented Kaczmarz (GRAK) method was proposed in [Z.-Z. Bai and W.-T. Wu, SIAM J. Sci. Comput., 43 (2021), pp. A3892–A3911] for large and sparse inconsistent linear systems. However, one has to construct two new index sets via computing residual vector with respect to the augmented linear system in each iteration. Thus, the computational overhead of this method is large for extremely large-scale problems. Moreover, there is no reliable stopping criterion for this method. In this work, we are interested in solving large-scale sparse or dense inconsistent linear systems, and try to enhance the numerical performance of the GRAK method. First, we propose an accelerated greedy randomized augmented Kaczmarz method. Theoretical analysis indicates that it converges faster than the GRAK method under very weak assumptions. Second, in order to further release the overhead, we propose a semi-randomized augmented Kaczmarz method with simple random sampling. Third, to the best of our knowledge, there are no practical stopping criteria for all the randomized Kaczmarz-type methods till now. To fill-in this gap, we introduce a practical stopping criterion for Kaczmarz-type methods, and show its rationality from a theoretical point of view. Numerical experiments are performed on both real-world and synthetic data sets, which demonstrate the efficiency of the proposed methods and the effectiveness of our stopping criterion.

keywords:

Randomized Kaczmarz Method, Greedy Randomized Augmented Kaczmarz Method (GRAK), Semi-Randomized Kaczmarz Method, Inconsistent Linear System, Simple Random Sampling.

AMS:

65F10, 65F15

1 Introduction

Consider the large-scale linear system

A{\bf x}={\bf b},

(1)

where $A\in\mathbb{C}^{m\times n}$ , ${\bf b}\in\mathbb{C}^{m}$ , and ${\bf x}\in\mathbb{C}^{n}$ is an unknown vector. In this paper, we make the assumption that there are no zero rows or columns in $A$ . On one hand, if the linear system (1) is consistent, the interest is to find the least-norm solution ${\bf x}_{\star}=A^{\dagger}{\bf b}$ . The Kaczmarz method [23] is an effective iterative method for consistent linear systems, which has been applied in many real-world applications, e.g., signal processing [2, 16, 24], image and sound reconstructions [15, 18, 38, 42], computed tomography [1, 16, 24], and so on; see [11] and the references therein. At the $k$ -th iteration of the Kaczmarz method, the approximate solution ${\bf x}_{k+1}$ is obtained from orthogonally projecting the current iteration vector ${\bf x}_{k}$ onto the hyperplane $H_{i_{k}}=\big{\{}{\bf x}|A^{\left(i_{k}\right)}{\bf x}={\bf b}^{\left(i_{k}\right)}\big{\}}$ :

{\bf x}_{k+1}={\bf x}_{k}+\frac{{\bf b}^{\left(i_{k}\right)}-A^{\left(i_{k}\right)}{\bf x}_{k}}{\left\|A^{\left(i_{k}\right)}\right\|_{2}^{2}}\left(A^{\left(i_{k}\right)}\right)^{H},\quad i_{k}=(k~{}mod~{}m)+1.

(2)

However, if the angle between the hyperplane related to two consecutive iterations is small, the convergence rate of Kaczmarz method may be slow. In order to partially overcome this difficulty, Strohmer and Vershynin [41] proposed a randomized version and proved its expected exponential rate of convergence, which is called the randomized Kaczmarz (RK) method. However, RK is difficult to utilize some real-time information for selecting working rows in each iteration. Another disadvantage is that the probability of each row being selected is directly proportional to the square of its Euclidean norm. If the Euclidean norms of a few rows of the coefficient matrix are much larger than those of the others, the RK method will converge very slow or even diverge. In order to improve the convergence of the RK method, many variations have been proposed, such as the sampling Kaczmarz-Motzkin method [28], the accelerated randomized Kaczmarz method [27], the weighted randomized Kaczmarz method [40], and so on [6, 14, 17, 45, 46, 47]. Some randomized block Kaczmarz methods were introduced in [9, 20, 26, 30, 32, 34, 35].

In [3], Bai and Wu proposed a greedy randomized Kaczmarz (GRK) method for large consistent linear systems. Theoretical analysis and numerical experiments demonstrate that GRK works better than RK. However, in each iteration of the GRK method, one has to compute the residual vector corresponding to the augmented linear system, and construct a new index set, which suffer from a large amount of workload, especially for big data problems. In order to reduce the computational overhead, Jiang, Wu and Jiang [21] proposed a semi-randomized Kaczmarz method with simple random sampling. This method only uses a small portion of rows of the matrix $A$ , and there is no need to calculate probabilities nor constructing index set for choosing working rows.

On the other hand, if the linear system (1) is inconsistent, the Kaczmarz-type methods can not converge to the least-norm least-squares solution ${\bf x}_{\star}=A^{\dagger}{\bf b}$ [4]. Inspired by the works in [36, 37], Zouzias and Freris [48] introduced a randomized extended Kaczmarz (REK) method, and proved that the REK method converges to the least-norm least-squares solution. Indeed, the REK method can be understood as using the randomized Kaczmarz method to solve two linear systems $A^{H}{\bf z}=0$ and $A{\bf x}={\bf b}-{\bf z}$ . Based on the REK method, Bai and Wu [5] proposed a partially random extended Kaczmarz (PREK) method. However, both the REK method and the PREK method will converge slowly if one of the two linear systems does so. A number of randomized block Kaczmarz methods for solving inconsistent linear systems were investigated in [13, 33].

Recently, Bai and Wu proposed a greedy randomized augmented Kaczmarz (GRAK) method for solving large sparse inconsistent linear systems [4]. In essence, the GRAK method first equivalently transform the inconsistent linear system (1) into a consistent augmented linear system, and then apply the GRK method to solve this augmented linear system. The relaxed version of the GRAK method was proposed in [7]. However, similar to the GRK method, it is required to calculate the residual vector with respect to the augmented linear system and construct two index sets and in each iteration. This is unfavorable for extremely large-scale problems.

In this paper, we focus on solving large (dense or sparse) inconsistent linear systems, and try to improve the performance of the GRAK method proposed in [4]. The contributions of this work are as follows. First, we modify the iteration scheme for updating the approximate solution, and propose an accelerated greedy randomized augmented Kaczmarz method. We prove that it converges faster than the GRAK method if $\left\|A\right\|_{F}^{2}\gg 1$ . Second, we propose a semi-randomized augmented Kaczmarz method with simple random sampling for large inconsistent linear systems. In this method, we are free of computing the residual vector corresponding to the augmented linear system, and there is no need to construct index sets. So the proposed method is much cheaper than the GRAK method per iteration. The convergence of the method is established. Third, as far as we know, there are no practical stopping criteria for Kaczmarz-type methods. The third contribution of this work is to introduce a practical stopping criterion for Kaczmarz-type methods. By using this strategy, there is no need to know the exact solution a prior nor compute the residual vector.

The organization of this paper is as follows. In Section 2, we briefly review the randomized extended Kaczmarz method and the greedy randomized augmented Kaczmarz method for large inconsistent linear systems. In Section 3, we propose an accelerated greedy randomized augmented Kaczmarz method and an accelerated greedy randomized augmented Kaczmarz method with simple random sampling. Convergence results of the two methods are established. In Section 4, we present a practical stopping criterion for Kaczmarz-type methods, and show its rationality theoretically. In Section 5, we perform comprehensive numerical experiments on both synthetic and real-world data sets, to show the efficiency of the proposed methods as well as the effectiveness of our stopping criterion. Some concluding remarks are given in Section 6.

Let us introduce some notations. For a given matrix $A\in\mathbb{C}^{m\times n}$ , we denote by $A^{H}$ , $\left\|A\right\|_{2}$ , $A^{\dagger}$ , $A^{\left(i\right)}$ and $A_{\left(j\right)}$ its conjugate transpose, Euclidean norm, Moore-Penrose inverse, the $i$ th row and the $j$ th column of $A$ , respectively. Let $\mathcal{R}\left(A\right)$ and $\mathcal{R}\left(A\right)^{\bot}$ be the range space of $A$ and the orthogonal complement subspace of $\mathcal{R}\left(A\right)$ . Let ${\bf b}_{\mathcal{R}\left(A\right)}$ and ${\bf b}_{\mathcal{R}\left(A\right)^{\bot}}$ be the orthogonal projection of the vector ${\bf b}$ onto $\mathcal{R}\left(A\right)$ and $\mathcal{R}\left(A\right)^{\bot}$ , respectively. Denote by $\lambda_{\min}(A^{H}A)$ the minimal nonzero eigenvalue of the matrix $A^{H}A$ , and by ${\bf e_{i}}$ the $i$ th column of the identity matrix $I$ . Let $\mathbb{E}\left[\cdot\right]$ be the full expectation, and let $\mathbb{E}_{k}\left[\cdot\right]$ be the conditional expectation on the first $k$ iterations [3]. That is,

\mathbb{E}_{k}\left[\cdot\right]=\mathbb{E}\left[\cdot|i_{0},i_{1},\ldots,i_{k-1}\right],\quad k=1,2,\ldots

In the light of the properties of full expectation and conditional expectation, we have that $\mathbb{E}\left[\mathbb{E}_{k}\left[\cdot\right]\right]=\mathbb{E}\left[\cdot\right]$ .

2 The randomized extended Kaczmarz method and the greedy randomized augmented Kaczmarz method

In this section, we briefly introduce the randomized extended Kaczmarz (REK) method and the greedy randomized augmented Kaczmarz (GRAK) method for large-scale inconsistent linear systems. The REK method is a combination of the randomized orthogonal projection algorithm and the randomized Kaczmarz method [48]. The main idea is to efficiently reduce the norm of the “noisy” of ${\bf b}$ , i.e., ${\bf b}_{\mathcal{R}\left(A\right)^{\bot}}$ , by using the randomized orthogonal projection algorithm, and then apply the randomized Kaczmarz method on a new linear system whose right-hand side is now (arbitrarily) close to the column space of $A$ . The framework of the REK method is as follows; for more details, refer to [48].

Algorithm 1.

The randomized extended Kaczmarz (REK) method [48]
Input: Given $A$ , $\bf b$ , the convergence tolerance $\epsilon>0$ , and let ${\bf x}_{0}=0$ and ${\bf z}_{0}={\bf b}$ .
Output: The approximate solution $\widetilde{\bf x}$ .
1. for $k=0,1,2,\ldots,$ do
2. Pick $i_{k}\in\left\{1,2,\ldots,m\right\}$ with probability $q_{i}=\left\|A^{\left(i\right)}\right\|_{2}^{2}/\left\|A\right\|_{F}^{2},~{}i\in\left\{1,2,\ldots,m\right\}$ .
3. Pick $j_{k}\in\left\{1,2,\ldots,n\right\}$ with probability $p_{j}=\left\|A^{\left(j\right)}\right\|_{2}^{2}/\left\|A\right\|_{F}^{2},~{}j\in\left\{1,2,\ldots,n\right\}$ .
4. Set ${\bf z}_{k+1}={\bf z}_{k}-\frac{A_{\left(j_{k}\right)}^{H}{\bf z}_{k}}{\left\|A_{\left(j_{k}\right)}\right\|_{2}^{2}}A_{\left(j_{k}\right)}$ .
5. Set ${\bf x}_{k+1}={\bf x}_{k}-\frac{{\bf b}^{\left(i_{k}\right)}-{\bf z}_{k}^{\left(i_{k}\right)}-A^{\left(i_{k}\right)}{\bf x}_{k}}{\left\|A_{\left(j_{k}\right)}\right\|_{2}^{2}}\left(A^{\left(i_{k}\right)}\right)^{H}$ .
6. Check every $8\cdot\min\left(m,n\right)$ iterations and terminate if

\frac{\left\|A{\bf x}_{k}-\left({\bf b}-{\bf z}_{k}\right)\right\|_{2}}{\left\|A\right\|_{F}\left\|{\bf x}_{k}\right\|_{2}}\leqslant\varepsilon\,\,~{}~{}and~{}~{}\,\,\frac{\left\|A^{H}{\bf z}_{k}\right\|_{2}}{\left\|A\right\|_{F}^{2}\left\|{\bf x}_{k}\right\|_{2}}\leqslant\varepsilon.

7. end for

In [4], Bai and Wu proposed a greedy randomized augmented Kaczmarz (GRAK) method. The main idea is to rewrite the large-scale inconsistent linear system $A{\bf x}={\bf b}$ into a consistent augmented linear system

\tilde{A}\tilde{{\bf x}}=\tilde{{\bf b}},

(1)

where

\tilde{A}=\left(\begin{matrix}I&A\\ A^{H}&0\\ \end{matrix}\right),\quad\tilde{{\bf x}}=\left(\begin{array}[]{c}{\bf z}\\ {\bf x}\\ \end{array}\right),\quad and\quad\tilde{{\bf b}}=\left[\begin{array}[]{c}{\bf b}\\ {\bf 0}\\ \end{array}\right],

(2)

and then apply the greedy randomized augmented Kaczmarz (GRK) method [3] to the augmented linear system (1).

More precise, define the two index sets

\varOmega_{k}^{R}=\left\{i_{k}\in\left\{1,2,\ldots,m\right\}:\frac{\left|{\bf b}^{\left(i_{k}\right)}-{\bf z}_{k}^{\left(i_{k}\right)}-A^{\left(i_{k}\right)}{\bf x}_{k}\right|^{2}}{1+\left\|A^{\left(i_{k}\right)}\right\|_{2}^{2}}\geqslant\epsilon_{k}\left(\left\|{\bf b}-{\bf z}_{k}-A{\bf x}_{k}\right\|_{2}^{2}+\left\|A^{H}{\bf z}_{k}\right\|_{2}^{2}\right)\right\},

(3)

and

\varOmega_{k}^{C}=\left\{j_{k}\in\left\{1,2,\ldots,n\right\}:\frac{\left|A_{\left(j_{k}\right)}^{H}{\bf z}_{k}\right|^{2}}{\left\|A_{\left(j_{k}\right)}\right\|_{2}^{2}}\geqslant\varepsilon_{k}\left(\left\|{\bf b}-{\bf z}_{k}-A{\bf x}_{k}\right\|_{2}^{2}+\left\|A^{H}{\bf z}_{k}\right\|_{2}^{2}\right)\right\},

(4)

where

\epsilon_{k}=\max\left\{\epsilon_{k}^{R},\epsilon_{k}^{C}\right\},

(5)

with

\epsilon_{k}^{R}=\frac{1}{2}\left(\frac{1}{\left\|{\bf b}-{\bf z}_{k}-A{\bf x}_{k}\right\|_{2}^{2}+\left\|A^{H}{\bf z}_{k}\right\|_{2}^{2}}\underset{1\leqslant i_{k}\leqslant m}{\max}\left\{\frac{\left|{\bf b}^{\left(i_{k}\right)}-{\bf z}_{k}^{\left(i_{k}\right)}-A^{\left(i_{k}\right)}{\bf x}_{k}\right|^{2}}{1+\left\|A^{\left(i_{k}\right)}\right\|_{2}^{2}}\right\}+\frac{1}{m+2\left\|A\right\|_{F}^{2}}\right)

(6)

and

\epsilon_{k}^{C}=\frac{1}{2}\left(\frac{1}{\left\|{\bf b}-{\bf z}_{k}-A{\bf x}_{k}\right\|_{2}^{2}+\left\|A^{H}{\bf z}_{k}\right\|_{2}^{2}}\underset{1\leqslant j_{k}\leqslant n}{\max}\left\{\frac{\left|A_{\left(j_{k}\right)}^{H}{\bf z}_{k}\right|^{2}}{\left\|A^{\left(j_{k}\right)}\right\|_{2}^{2}}\right\}+\frac{1}{m+2\left\|A\right\|_{F}^{2}}\right).

(7)

Let

\tilde{{\bf r}}_{k}^{\left(i\right)}=\begin{cases}{\bf b}^{\left(i\right)}-{\bf z}_{k}^{\left(i\right)}-A^{\left(i\right)}{\bf x}_{k}&if\,\,i\in\varOmega_{k}^{R},\\ 0&otherwise,\\ \end{cases}

(8)

and

\tilde{{\bf s}}_{k}^{\left(j\right)}=\begin{cases}-A_{\left(j\right)}^{H}{\bf z}_{k}&if\,\,j\in\varOmega_{k}^{C},\\ 0&otherwise.\\ \end{cases}

(9)

The greedy randomized augmented Kaczmarz method is given as follows; for more details, refer to [4].

Algorithm 2.

The greedy randomized augmented Kaczmarz (GRAK) method [4]
Input: $A$ , $\bf b$ , $l$ , ${\bf x}_{0}$ , and ${\bf z}_{0}$ .
Output: ${\bf x}_{l}$ and ${\bf z}_{l}$ .
1. for $k=0,1,\ldots,l-1$ do
2. Compute the tolerances $\epsilon_{k}^{R}$ , $\epsilon_{k}^{C}$ and $\epsilon_{k}$ by (5)–(6), and identify the index sets $\varOmega_{k}^{R}$ and $\varOmega_{k}^{C}$ by the criteria (3) and (4), and determine the projected residuals $\tilde{\bf r}_{k}$ and $\tilde{\bf s}_{k}$ by the definitions (8) and (9), respectively.
3. Select an index $t_{k}$ satisfying $1\leqslant t_{k}\leqslant m+n$ , with the probability

Pr\left(index=t_{k}\right)=\begin{cases}\frac{\left|\tilde{{\bf r}}_{k}^{\left(t_{k}\right)}\right|^{2}}{\left\|\tilde{{\bf r}}_{k}\right\|_{2}^{2}+\left\|\tilde{{\bf s}}_{k}\right\|_{2}^{2}}&if\,\,1\leqslant t_{k}\leqslant m,\\ \frac{\left|\tilde{{\bf s}}_{k}^{\left(t_{k}-m\right)}\right|^{2}}{\left\|\tilde{{\bf r}}_{k}\right\|_{2}^{2}+\left\|\tilde{{\bf s}}_{k}\right\|_{2}^{2}}&if\,\,m+1\leqslant t_{k}\leqslant m+n.\\ \end{cases}

4. If $1\leqslant t_{k}\leqslant m$ , then set $i_{k}=t_{k}$ and compute

{\bf z}_{k+1}={\bf z}_{k}+\frac{\left({\bf b}^{\left(i_{k}\right)}-{\bf z}_{k}^{\left(i_{k}\right)}-A^{\left(i_{k}\right)}{\bf x}_{k}\right)}{1+\left\|A^{\left(i_{k}\right)}\right\|_{2}^{2}}{\bf e}_{i_{k}}

and

{\bf x}_{k+1}={\bf x}_{k}+\frac{\left({\bf b}^{\left(i_{k}\right)}-{\bf z}_{k}^{\left(i_{k}\right)}-A^{\left(i_{k}\right)}{\bf x}_{k}\right)}{1+\left\|A^{\left(i_{k}\right)}\right\|_{2}^{2}}\left(A^{\left(i_{k}\right)}\right)^{H}

else if $m+1\leqslant t_{k}\leqslant m+n$ , then set $j_{k}=t_{k}-m$ , and compute

{\bf z}_{k+1}={\bf z}_{k}-\frac{A_{\left(j_{k}\right)}^{H}{\bf z}_{k}}{\left\|A_{\left(j_{k}\right)}\right\|_{2}^{2}}A_{\left(j_{k}\right)}\quad and\quad{\bf x}_{k+1}={\bf x}_{k}

5. endfor

The following theorem indicates that the GRAK method is convergent in expectation to the least-norm least-square solution ${\bf x}_{\star}=A^{\dagger}{\bf b}$ .

Theorem 1.

[4] Starting from any initial vectors ${\bf x}_{0}\in range\left(A^{H}\right)$ and ${\bf z}_{0}\in\mathbb{C}^{m}$ , the iteration sequences $\left\{{\bf x}_{k}\right\}_{k=0}^{\infty}$ and $\left\{{\bf z}_{k}\right\}_{k=0}^{\infty}$ , generated by the GRAK method, converge in expectation to the least-norm least-squares solution ${\bf x}_{\star}=A^{\dagger}{\bf b}$ of the linear system (1) and the orthogonal projection vector ${\bf z}_{\star}={\bf b}_{\mathcal{R}\left(A\right)^{\bot}}$ , respectively. Moreover, the global solution error in expectation with respect to both iteration sequences $\left\{{\bf x}_{k}\right\}_{k=0}^{\infty}$ and $\left\{{\bf z}_{k}\right\}_{k=0}^{\infty}$ obeys

\mathbb{E}\left(\left\|{\bf x}_{1}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{1}-{\bf z}_{\star}\right\|_{2}^{2}\right)\leqslant\zeta\left(\left\|{\bf x}_{0}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{0}-{\bf z}_{\star}\right\|_{2}^{2}\right)

and

\mathbb{E}_{k}\left(\left\|{\bf x}_{k+1}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{k+1}-{\bf z}_{\star}\right\|_{2}^{2}\right)\leqslant\beta\left(\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}\right),k=1,2\ldots

(10)

where

\zeta=1-\frac{\eta}{2\left\|A\right\|_{F}^{2}+m}

and

\beta=1-\frac{1}{2}\left(\frac{2\left\|A\right\|_{F}^{2}+m}{\gamma}+1\right)\frac{\eta}{2\left\|A\right\|_{F}^{2}+m}

(11)

with

\eta=\min\left\{1,\left(\sqrt{\lambda_{\min}\left(A^{H}A\right)+\frac{1}{4}}-\frac{1}{2}\right)^{2}\right\}

(12)

and

\gamma=2\left\|A\right\|_{F}^{2}+m-\min\left\{1+\underset{1\leqslant i\leqslant m}{\min}\left\|A^{\left(i\right)}\right\|_{2}^{2},\underset{1\leqslant j\leqslant n}{\min}\left\|A_{\left(j\right)}\right\|_{2}^{2}\right\}.

(13)

As a result, it holds that

\mathbb{E}\left(\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}\right)\leqslant\beta^{k-1}\zeta\cdot\left(\left\|{\bf x}_{0}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{0}-{\bf z}_{\star}\right\|_{2}^{2}\right)

Theorem 1 shows that the convergence rate of GRAK strongly depends on the minimum singular value and the Frobenius norm, the number of rows, and the size of the Euclidean norms of all rows and columns of the matrix $A$ .

3 The Proposed Methods

In each iteration of the GRAK method, one has to compute two residual vectors $\tilde{{\bf r}}_{k}$ , $\tilde{{\bf s}}_{k}$ and the probabilities $PR(\cdot)$ , to construct two new index sets $\epsilon_{k}^{R}$ , $\epsilon_{k}^{C}$ . Thus, one has to scan all the $m+n$ rows of the coefficient matrix $\tilde{A}$ , which is very time consuming as the matrix $A$ is extremely large, and it is contrary to the original motivation of the Kaczmarz-type methods.

So as to deal with these problems, in this section, we first propose an accelerated greedy randomized augmented Kaczmarz method. Convergence analysis shows that, if $\|A\|_{F}^{2}\gg 1$ , the accelerated method often converges faster than the GRAK method. To further reduce the computational overhead of the accelerated method, we then propose an accelerated greedy randomized augmented Kaczmarz method with simple sampling. The convergence of the method is also established.

3.1 An accelerated greedy randomized augmented Kaczmarz method

It is seen from Algorithm 2 that, as $m+1\leqslant t_{k}\leqslant m+n$ , the GRAK method only updates the vector ${\bf z}_{k}$ rather than the approximate solution ${\bf x}_{k}$ , which may slow down the rate of convergence of this algorithm. Indeed, just like the REK method, GRAK may converge slowly if one of the iteration sequences $\left\{{\bf x}_{k}\right\}_{k=0}^{\infty}$ or $\left\{{\bf z}_{k}\right\}_{k=0}^{\infty}$ does so. Thus, it is necessary to give new insight into the GRAK method and improve its numerical performance.

The idea is that, when $m+1\leqslant t_{k}\leqslant m+n$ , after computing ${\bf z}_{k}$ , we select the working rows with probability $\|A^{\left(i_{k}\right)}\|_{2}^{2}/\|A\|_{F}^{2}$ to further update ${\bf x}_{k}$ for the linear equation $A{\bf x}={\bf b}-{\bf z}_{k+1}$ . Motivated by the semi-randomized Kaczmarz method proposed in [21], to refrain from evaluating probabilities and free of constructing two index sets in each iteration, we propose an accelerated greedy randomized augmented Kaczmarz (AGRAK) method to solve the augmented linear system (1).

Algorithm 3.

An accelerated greedy randomized augmented Kaczmarz (AGRAK) method
Input: $A$ , $\bf b$ , $l$ , ${\bf x}_{0}={\bf 0}$ , and ${\bf z}_{0}={\bf b}$ .
Output: ${\bf x}_{l}$ and ${\bf z}_{l}$ .
1. for $k=0,1,\ldots,l-1$ do
2. Select $t_{k}\in\left\{1,2,\dots,m+n\right\}$ according to

t_{k}=\mathop{\arg\max}\left\{\underset{1\leqslant i\leqslant m}{\max}\left\{\frac{\left|{\bf b}^{\left(i\right)}-{\bf z}_{k}^{\left(i\right)}-A^{\left(i\right)}{\bf x}_{k}\right|^{2}}{1+\left\|A^{\left(i\right)}\right\|_{2}^{2}}\right\},\underset{m+1\leqslant i\leqslant m+n}{\max}\left\{\frac{\left|A_{\left(i-m\right)}^{H}{\bf z}_{k}\right|^{2}}{\left\|A_{\left(i-m\right)}\right\|_{2}^{2}}\right\}\right\}.

3. If $1\leqslant t_{k}\leqslant m$ , then set $i_{k}=t_{k}$ and compute

{\bf z}_{k+1}={\bf z}_{k}+\frac{\left({\bf b}^{\left(i_{k}\right)}-{\bf z}_{k}^{\left(i_{k}\right)}-A^{\left(i_{k}\right)}{\bf x}_{k}\right)}{1+\left\|A^{\left(i_{k}\right)}\right\|_{2}^{2}}{\bf e}_{i_{k}},

and

{\bf x}_{k+1}={\bf x}_{k}+\frac{\left({\bf b}^{\left(i_{k}\right)}-{\bf z}_{k}^{\left(i_{k}\right)}-A^{\left(i_{k}\right)}{\bf x}_{k}\right)}{1+\left\|A^{\left(i_{k}\right)}\right\|_{2}^{2}}\left(A^{\left(i_{k}\right)}\right)^{H}.

Else if $m+1\leqslant t_{k}\leqslant m+n$ , then set $j_{k}=t_{k}-m$ ,

{\bf z}_{k+1}={\bf z}_{k}-\frac{A_{\left(j_{k}\right)}^{H}{\bf z}_{k}}{\left\|A_{\left(j_{k}\right)}\right\|_{2}^{2}}A_{\left(j_{k}\right)},

and select an index $i_{k}\in\left\{1,2,\dots,m\right\}$ with probability $P_{r}\left(row=i_{k}\right)=\frac{\left\|A^{\left(i_{k}\right)}\right\|_{2}^{2}}{\left\|A\right\|_{F}^{2}}$ , and compute

{\bf x}_{k+1}={\bf x}_{k}+\frac{\left({\bf b}^{\left(i_{k}\right)}-{\bf z}_{k+1}^{\left(i_{k}\right)}-A^{\left(i_{k}\right)}{\bf x}_{k}\right)}{\left\|A^{\left(i_{k}\right)}\right\|_{2}^{2}}\left(A^{\left(i_{k}\right)}\right)^{H}.

4. endfor

Next, we consider the convergence of the AGRAK method. The following lemma is needed.

Lemma 3.1.

Let $a_{1},a_{2},b_{1},b_{2}\in\mathbb{R}^{+}$ , and $a_{1}<a_{2}$ . Then there exists $\upsilon\in\left(a_{1},a_{2}\right)$ , such that $a_{1}b_{1}+a_{2}b_{2}=\upsilon\left(b_{1}+b_{2}\right)$ .

Proof.

Let $\upsilon=\frac{a_{1}b_{1}+a_{2}b_{2}}{b_{1}+b_{2}}$ , then

\upsilon=\frac{a_{1}b_{1}+a_{2}b_{2}}{b_{1}+b_{2}}<\frac{a_{2}b_{1}+a_{2}b_{2}}{b_{1}+b_{2}}=\frac{a_{2}\left(b_{1}+b_{2}\right)}{b_{1}+b_{2}}=a_{2},

and

\upsilon=\frac{a_{1}b_{1}+a_{2}b_{2}}{b_{1}+b_{2}}>\frac{a_{1}b_{1}+a_{1}b_{2}}{b_{1}+b_{2}}=\frac{a_{1}\left(b_{1}+b_{2}\right)}{b_{1}+b_{2}}=a_{1},

which satisfies $\upsilon\left(b_{1}+b_{2}\right)=a_{1}b_{1}+a_{2}b_{2}$ . ∎

First, notice that if $1\leqslant t_{k}\leqslant m$ , Algorithm 3 applies the semi-randomized Kaczmarz method [21] to solve the augmented linear system (1). Denote by

\tilde{{\bf x}}_{k}=\left[\begin{array}[]{c}{\bf z}_{k}\\ {\bf x}_{k}\\ \end{array}\right]\quad and\quad\tilde{{\bf x}}_{\star}=\left[\begin{array}[]{c}{\bf z}_{\star}\\ {\bf x}_{\star}\\ \end{array}\right],

from [21, Theorem 3.2], we have that

\mathbb{E}_{k}\left\|\tilde{{\bf x}}_{k+1}-\tilde{{\bf x}}_{\star}\right\|_{2}^{2}\leqslant\left(1-\frac{\lambda_{\min}\left(\tilde{A}^{H}\tilde{A}\right)}{\left\|\tilde{A}\right\|_{F}^{2}-\underset{1\leqslant i\leqslant m+n}{\min}\left\{\left\|\tilde{A}^{\left(i\right)}\right\|_{2}^{2}\right\}}\right)\left\|\tilde{{\bf x}}_{k}-\tilde{{\bf x}}_{\star}\right\|_{2}^{2}.

(1)

Thanks to the structure of $\tilde{A}$ , $\tilde{{\bf x}}_{k+1}$ and $\tilde{{\bf x}}_{\star}$ ; see (2), one can rewrite (1) as

\mathbb{E}_{k}\left(\left\|{\bf x}_{k+1}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{k+1}-{\bf z}_{\star}\right\|_{2}^{2}\right)\leqslant\left(1-\frac{\eta}{\gamma}\right)\left(\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}\right),

(2)

where $\eta$ and $\gamma$ be defined in (12) and (13).

Take full expectation on the both sides of (2), and set

\tilde{\beta}=1-\frac{\eta}{\gamma},

(3)

we arrive at

\mathbb{E}\left(\left\|{\bf x}_{k+1}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{k+1}-{\bf z}_{\star}\right\|_{2}^{2}\right)\leqslant\tilde{\beta}\mathbb{E}\left(\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}\right),\quad 1\leqslant t_{k}\leqslant m.

Second, we consider the convergence of ${\bf z}_{k}$ as $m+1\leqslant t_{k}\leqslant m+n$ . Notice that if $m+1\leqslant t_{k}\leqslant m+n$ , then

\max\left\{\underset{1\leqslant i\leqslant m}{\max}\left\{\frac{\left|{\bf b}^{\left(i\right)}-{\bf z}_{k}^{\left(i\right)}-A^{\left(i\right)}{\bf x}_{k}\right|^{2}}{1+\left\|A^{\left(i\right)}\right\|_{2}^{2}}\right\},\underset{m+1\leqslant i\leqslant m+n}{\max}\left\{\frac{\left|A_{\left(i-m\right)}^{H}{\bf z}_{k}\right|^{2}}{\left\|A_{\left(i-m\right)}\right\|_{2}^{2}}\right\}\right\}=\underset{m+1\leqslant i\leqslant m+n}{\max}\left\{\frac{\left|A_{\left(i-m\right)}^{H}{\bf z}_{k}\right|^{2}}{\left\|A_{\left(i-m\right)}\right\|_{2}^{2}}\right\}.

Consequently,

j_{k}=t_{k}-m=\mathop{\arg\max}_{1\leqslant j\leqslant n}\left\{\frac{\left|A_{\left(j\right)}^{H}{\bf z}_{k}\right|^{2}}{\left\|A_{\left(j\right)}\right\|_{2}^{2}}\right\},

and

{\bf z}_{k+1}={\bf z}_{k}-\frac{A_{\left(j_{k}\right)}^{H}{\bf z}_{k}}{\left\|A_{\left(j_{k}\right)}\right\|_{2}^{2}}A_{\left(j_{k}\right)},\quad where\quad j_{k}=\mathop{\arg\max}_{1\leqslant j\leqslant n}\left\{\frac{\left|A_{\left(j\right)}^{H}{\bf z}_{k}\right|^{2}}{\left\|A_{\left(j\right)}\right\|_{2}^{2}}\right\},

which is nothing but applying the semi-randomized Kaczmarz method to solve the equation $A^{H}{\bf z}=0$ . Therefore, by [21, Theorem 3.2],

\mathbb{E}_{k}\left\|{\bf z}_{k+1}-{\bf z}_{\star}\right\|_{2}^{2}\leqslant\left(1-\frac{\lambda_{\min}\left(A^{H}A\right)}{\left\|A\right\|_{F}^{2}-\underset{1\leqslant j\leqslant n}{\min}\left\{\left\|A_{\left(j\right)}\right\|_{2}^{2}\right\}}\right)\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2},\quad m+1\leqslant t_{k}\leqslant m+n.

(4)

Take full expectation on the both sides of (4), and set

\delta=1-\frac{\lambda_{\min}\left(A^{H}A\right)}{\left\|A\right\|_{F}^{2}-\underset{1\leqslant j\leqslant n}{\min}\left\{\left\|A_{\left(j\right)}\right\|_{2}^{2}\right\}},

we get

\mathbb{E}\left\|{\bf z}_{k+1}-{\bf z}_{\star}\right\|_{2}^{2}\leqslant\delta\mathbb{E}\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}.

(5)

Next, we update ${\bf x}_{k}$ for the linear equation $A{\bf x}={\bf b}-{\bf z}_{k+1}$ by using the randomized Kaczmarz method. From [31, 41], we obtain

\mathbb{E}_{k}\left\|{\bf x}_{k+1}-{\bf x}_{\star}\right\|_{2}^{2}\leqslant\left(1-\frac{\lambda_{\min}\left(A^{H}A\right)}{\left\|A\right\|_{F}^{2}}\right)\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2}+\frac{\left\|{\bf z}_{k+1}-{\bf z}_{\star}\right\|_{2}^{2}}{\left\|A\right\|_{F}^{2}},\quad m+1\leqslant t_{k}\leqslant m+n.

(6)

Take full expectation on the both sides of (6), and set

\alpha=1-\frac{\lambda_{\min}\left(A^{H}A\right)}{\left\|A\right\|_{F}^{2}},

(7)

we get

	$\displaystyle\mathbb{E}\left\\|{\bf x}_{k+1}-{\bf x}_{\star}\right\\|_{2}^{2}$	$\displaystyle\leqslant\alpha\mathbb{E}\left\\|{\bf x}_{k}-{\bf x}_{\star}\right\\|_{2}^{2}+\frac{\mathbb{E}\left\\|{\bf z}_{k+1}-{\bf z}_{\star}\right\\|_{2}^{2}}{\left\\|A\right\\|_{F}^{2}}$
		$\displaystyle\leqslant\alpha\mathbb{E}\left\\|{\bf x}_{k}-{\bf x}_{\star}\right\\|_{2}^{2}+\frac{\delta}{\left\\|A\right\\|_{F}^{2}}\mathbb{E}\left\\|{\bf z}_{k}-{\bf z}_{\star}\right\\|_{2}^{2},$		(8)

where the second inequality is from (5). Suppose that $\left\|A\right\|_{F}^{2}\gg 1$ ¹¹1We mention that this assumption is very weak in practice, since Kaczmarz-type methods often applies to very large (dense) linear systems., such that

\frac{\delta}{\left\|A\right\|_{F}^{2}}\mathbb{E}\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}\ll\alpha\mathbb{E}\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2}+\delta\mathbb{E}\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2},

or in other words,

\frac{\delta}{\left\|A\right\|_{F}^{2}}\mathbb{E}\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}=o\left(\alpha\mathbb{E}\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2}+\delta\mathbb{E}\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}\right).

(9)

Combining (5), (3.1) and (9), we have from Lemma 3.1 that there exists $\delta<\theta_{k}<\alpha$ , such that

$\displaystyle\mathbb{E}\left(\left\\|{\bf x}_{k+1}-{\bf x}_{\star}\right\\|_{2}^{2}+\left\\|{\bf z}_{k+1}-{\bf z}_{\star}\right\\|_{2}^{2}\right)$	$\displaystyle\leqslant\alpha\mathbb{E}\left\\|{\bf x}_{k}-{\bf x}_{\star}\right\\|_{2}^{2}+\delta\mathbb{E}\left\\|{\bf z}_{k}-{\bf z}_{\star}\right\\|_{2}^{2}+\frac{\delta}{\left\\|A\right\\|_{F}^{2}}\mathbb{E}\left\\|{\bf z}_{k}-{\bf z}_{\star}\right\\|_{2}^{2}$
	$\displaystyle=\alpha\mathbb{E}\left\\|{\bf x}_{k}-{\bf x}_{\star}\right\\|_{2}^{2}+\delta\mathbb{E}\left\\|{\bf z}_{k}-{\bf z}_{\star}\right\\|_{2}^{2}+o\big{(}\alpha\mathbb{E}\left\\|{\bf x}_{k}-{\bf x}_{\star}\right\\|_{2}^{2}+\delta\mathbb{E}\left\\|{\bf z}_{k}-{\bf z}_{\star}\right\\|_{2}^{2}\big{)}$
	$\displaystyle\approx\alpha\mathbb{E}\left\\|{\bf x}_{k}-{\bf x}_{\star}\right\\|_{2}^{2}+\delta\mathbb{E}\left\\|{\bf z}_{k}-{\bf z}_{\star}\right\\|_{2}^{2}$
	$\displaystyle=\theta_{k}\mathbb{E}\left(\left\\|{\bf x}_{k}-{\bf x}_{\star}\right\\|_{2}^{2}+\left\\|{\bf z}_{k}-{\bf z}_{\star}\right\\|_{2}^{2}\right),\quad m+1\leqslant t_{k}\leqslant m+n,$	(10)

where the high order term $o\big{(}\alpha\mathbb{E}\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2}+\delta\mathbb{E}\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}\big{)}$ is omitted. In conclusion, we have the follow theorem for the convergence of the proposed AGRAK method.

Theorem 2.

Let $\left\|A\right\|_{F}^{2}\gg 1$ such that (9) is satisfied. Then under the above notations, the iteration sequences $\left\{{\bf x}_{k}\right\}_{k=1}^{\infty}$ and $\left\{{\bf z}_{k}\right\}_{k=1}^{\infty}$ generated by Algorithm 3 converge in expectation to ${\bf x}_{\star}$ and ${\bf z}_{\star}$ , with

\begin{cases}\mathbb{E}\left(\left\|{\bf x}_{k+1}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{k+1}-{\bf z}_{\star}\right\|_{2}^{2}\right)\leqslant\tilde{\beta}\mathbb{E}\left(\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}\right)&if\,\,1\leqslant t_{k}\leqslant m,\\ \mathbb{E}\left(\left\|{\bf x}_{k+1}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{k+1}-{\bf z}_{\star}\right\|_{2}^{2}\right)\lesssim\theta_{k}\mathbb{E}\left(\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}\right)&if\,\,m+1\leqslant t_{k}\leqslant m+n,\\ \end{cases}

where “ $\lesssim$ ” means the high order term $o\big{(}\alpha\mathbb{E}\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2}+\delta\mathbb{E}\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}\big{)}$ is omitted.

Under the condition that $\|A\|_{F}^{2}\gg 1$ , the following theorem indicates that Algorithm 3 converges faster than the GRAK method; refer to (10).

Theorem 3.

Let $\beta,\tilde{\beta}$ and $\theta_{k}$ be defined in (11), (3) and (3.1), respectively. If $\left\|A\right\|_{F}^{2}\gg 1$ such that (9) is satisfied, then

0<\tilde{\beta}<\beta\quad{\rm and}\quad 0<\theta_{k}<\beta.

Proof.

First, we prove that $\tilde{\beta}<\beta$ . Recall that

\beta=1-\frac{1}{2}\left(\frac{\left\|\tilde{A}\right\|_{F}^{2}}{\left\|\tilde{A}\right\|_{F}^{2}-\underset{1\leqslant i\leqslant m+n}{\min}\left\{\left\|\tilde{A}^{\left(i\right)}\right\|_{2}^{2}\right\}}+1\right)\frac{\lambda_{\min}\left(\tilde{A}^{H}\tilde{A}\right)}{\left\|\tilde{A}\right\|_{F}^{2}},

and

\displaystyle\tilde{\beta}

\displaystyle=1-\frac{\lambda_{\min}\left(\tilde{A}^{H}\tilde{A}\right)}{\left\|\tilde{A}\right\|_{F}^{2}-\underset{1\leqslant i\leqslant m+n}{\min}\left\{\left\|\tilde{A}^{\left(i\right)}\right\|_{2}^{2}\right\}}=1-\frac{\left\|\tilde{A}\right\|_{F}^{2}}{\left\|\tilde{A}\right\|_{F}^{2}-\underset{1\leqslant i\leqslant m+n}{\min}\left\{\left\|\tilde{A}^{\left(i\right)}\right\|_{2}^{2}\right\}}\cdot\frac{\lambda_{\min}\left(\tilde{A}^{H}\tilde{A}\right)}{\left\|\tilde{A}\right\|_{F}^{2}}.

As there are no zero columns in $A$ , we obtain

\frac{1}{2}\left(\frac{\left\|\tilde{A}\right\|_{F}^{2}}{\left\|\tilde{A}\right\|_{F}^{2}-\underset{1\leqslant i\leqslant m+n}{\min}\left\{\left\|\tilde{A}^{\left(i\right)}\right\|_{2}^{2}\right\}}+1\right)<\frac{\left\|\tilde{A}\right\|_{F}^{2}}{\left\|\tilde{A}\right\|_{F}^{2}-\underset{1\leqslant i\leqslant m+n}{\min}\left\{\left\|\tilde{A}^{\left(i\right)}\right\|_{2}^{2}\right\}},

and $\tilde{\beta}<\beta$ .

Second, we prove that $\theta_{k}<\beta$ . Indeed, it is sufficient to show that $\alpha<\beta$ . Let

\begin{cases}\eta=1<\lambda_{\min}\left(A^{H}A\right),&if\,\,\lambda_{\min}\left(A^{H}A\right)\geqslant 2,\\ \eta=\left(\sqrt{\lambda_{\min}\left(A^{H}A\right)+\frac{1}{4}}-\frac{1}{2}\right)^{2},&if\,\,\lambda_{\min}\left(A^{H}A\right)<2.\\ \end{cases}

Notice that $\eta<\lambda_{\min}\left(A^{H}A\right)$ . Let $\xi=2+\frac{m}{\left\|A\right\|_{F}^{2}}$ , then

\frac{1}{2}\left(\frac{2\left\|A\right\|_{F}^{2}+m}{\gamma}+1\right)\frac{\eta}{2\left\|A\right\|_{F}^{2}+m}=\frac{1}{2\xi}\left(\frac{2\left\|A\right\|_{F}^{2}+m}{\gamma}+1\right)\frac{\eta}{\left\|A\right\|_{F}^{2}}.

(11)

Next, we show that $\frac{2\left\|A\right\|_{F}^{2}+m}{\gamma}+1<4.$ Otherwise, if $\frac{2\left\|A\right\|_{F}^{2}+m}{\gamma}+1\geqslant 4$ , then

3\left\|A\right\|_{F}^{2}\geqslant 3\cdot\min\left\{1+\underset{1\leqslant i\leqslant m}{\min}\left\{\left\|A^{\left(i\right)}\right\|_{2}^{2}\right\},\underset{1\leqslant j\leqslant n}{\min}\left\{\left\|A_{\left(j\right)}\right\|_{2}^{2}\right\}\right\}\geq 4\left\|A\right\|_{F}^{2}+2m,

which is incorrect. As a result,

\frac{1}{2\xi}\left(\frac{2\left\|A\right\|_{F}^{2}+m}{\gamma}+1\right)<1,

and

\frac{1}{2\xi}\left(\frac{2\left\|A\right\|_{F}^{2}+m}{\gamma}+1\right)\frac{\eta}{\left\|A\right\|_{F}^{2}}<\frac{\lambda_{\min}\left(A^{H}A\right)}{\left\|A\right\|_{F}^{2}}.

(12)

A combination of (11) and (12) yields

\frac{1}{2}\left(\frac{2\left\|A\right\|_{F}^{2}+m}{\gamma}+1\right)\frac{\eta}{2\left\|A\right\|_{F}^{2}+m}<\frac{\lambda_{\min}\left(A^{H}A\right)}{\left\|A\right\|_{F}^{2}},

and it follows from (11) and (7) that $\theta_{k}<\alpha<\beta.$ ∎

3.2 An accelerated greedy randomized augmented Kaczmarz method with simple random sampling

Although Algorithm 3 can converge faster than GRAK, however, just like the GRAK method, one has to compute the residual vector $A{\bf x}_{k}-({\bf b}-{\bf z}_{k})$ and $A^{H}{\bf z}_{k}$ in each iteration. That is, we have to access all the rows and columns of the data matrix $A$ , which is contrary to the motivation of the Kaczmarz-type methods. Specifically, this is unfavorable when the coefficient matrix is large and dense, and it is interesting to accelerate Algorithm 3 further.

To deal with this problem, we apply the simple random sampling strategy advocated in [21] to solve (1). The key idea is that, in each iteration, we first generate a subset by using simple random sampling, and then select working rows in this subset. Notice that the cost of generating a subset via simple random sampling is negligible compared with that of the overall algorithm. The advantages are that we are free of accessing all the rows or columns of the data matrix, and there is no need to compute the residual vector associated with the augmented linear system. The algorithm is given as follows.

Algorithm 4.

A semi-randomized augmented Kaczmarz method with simple random sampling
Input: $A$ , $\bf b$ , ${\bf x}_{0}$ , ${\bf z}_{0}$ , a parameter $0<\eta\ll 1$ and the maximal iteration number $l$ .
Output: ${\bf x}_{l}$ and ${\bf z}_{l}$ .
1. for $k=0,1,\ldots,l-1$ do
2. Generate an indicator set $\varOmega_{k}$ with $\lfloor\left(m+n\right)\eta\rceil$ rows by using the simple random sampling method, where $\lfloor\cdot\rceil$ means rounding down. Let $\varOmega_{k}^{1}=\varOmega_{k}\cap\left\{1,2,\dots,m\right\}$ and $\varOmega_{k}^{2}=\varOmega_{k}\cap\left\{m+1,m+2,\dots,m+n\right\}$ . Then, select $t_{k}$ according to

t_{k}=\mathop{\arg\max}\left\{\max_{j_{k}\in\varOmega_{k}^{1}}\left\{\frac{\left|{\bf b}^{\left(j_{k}\right)}-{\bf z}_{k}^{\left(j_{k}\right)}-A^{\left(j_{k}\right)}{\bf x}_{k}\right|}{\sqrt{1+\left\|A^{\left(j_{k}\right)}\right\|_{2}^{2}}}\right\},\max_{j_{k}\in\varOmega_{k}^{2}}\left\{\frac{\left|-A_{\left(j_{k}-m\right)}^{H}{\bf z}_{k}\right|}{\left\|A_{\left(j_{k}-m\right)}^{H}\right\|_{2}}\right\}\right\}

3. If $1\leqslant t_{k}\leqslant m$ , then set $i_{k}=t_{k}$ , and compute

{\bf z}_{k+1}={\bf z}_{k}+\frac{\left({\bf b}^{\left(i_{k}\right)}-{\bf z}_{k}^{\left(i_{k}\right)}-A^{\left(i_{k}\right)}{\bf x}_{k}\right)}{1+\left\|A^{\left(i_{k}\right)}\right\|_{2}^{2}}{\bf e}_{i_{k}},

and

{\bf x}_{k+1}={\bf x}_{k}+\frac{\left({\bf b}^{\left(i_{k}\right)}-{\bf z}_{k}^{\left(i_{k}\right)}-A^{\left(i_{k}\right)}{\bf x}_{k}\right)}{1+\left\|A^{\left(i_{k}\right)}\right\|_{2}^{2}}\left(A^{\left(i_{k}\right)}\right)^{H}.

Else if $m+1\leqslant t_{k}\leqslant m+n$ , then set $j_{k}=t_{k}-m$ , and compute

{\bf z}_{k+1}={\bf z}_{k}-\frac{A_{\left(j_{k}\right)}^{H}{\bf z}_{k}}{\left\|A_{\left(j_{k}\right)}\right\|_{2}^{2}}A_{\left(j_{k}\right)},

and select an index $i_{k}\in\left\{1,2,\dots,m\right\}$ with probability $P_{r}\left(row=i_{k}\right)=\frac{\left\|A^{\left(i_{k}\right)}\right\|_{2}^{2}}{\left\|A\right\|_{F}^{2}}$ , and compute

{\bf x}_{k+1}={\bf x}_{k}+\frac{\left({\bf b}^{\left(i_{k}\right)}-{\bf z}_{k+1}^{\left(i_{k}\right)}-A^{\left(i_{k}\right)}{\bf x}_{k}\right)}{\left\|A^{\left(i_{k}\right)}\right\|_{2}^{2}}\left(A^{\left(i_{k}\right)}\right)^{H}.

4. end for

We are in a position to consider the convergence of the semi-randomized augmented Kaczmarz method with simple random sampling. First, if $1\leqslant t_{k}\leqslant m$ , similar to the proof of Theorem 4.2 in [21], as the number of the rows of $A$ is sufficiently large, we have from the Chebyshev’s (weak) law of large numbers that [22], there are two scalars $0\leqslant\hat{\varepsilon}_{k},\tilde{\varepsilon}_{k}\ll 1$ such that

\mathbb{E}_{k}\left\|\tilde{{\bf x}}_{k+1}-\tilde{{\bf x}}_{\star}\right\|_{2}^{2}\leqslant\left(1-\frac{\left(1-\hat{\varepsilon}_{k}\right)\lambda_{\min}\left(\tilde{A}^{H}\tilde{A}\right)}{\left(1+\tilde{\varepsilon}_{k}\right)\left(\left\|\tilde{A}\right\|_{F}^{2}-\underset{1\leqslant i\leqslant m+n}{\min}\left\{\left\|\tilde{A}^{\left(i\right)}\right\|_{2}^{2}\right\}\right)}\right)\left\|\tilde{{\bf x}}_{k}-\tilde{{\bf x}}_{\star}\right\|_{2}^{2}.

It can be rewritten as

\mathbb{E}_{k}\left(\left\|{\bf x}_{k+1}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{k+1}-{\bf z}_{\star}\right\|_{2}^{2}\right)\leqslant\left(1-\frac{\left(1-\hat{\varepsilon}_{k}\right)\eta}{\left(1+\tilde{\varepsilon}_{k}\right)\gamma}\right)\left(\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}\right),

(13)

where $\eta$ and $\gamma$ are defined in (12) and (13), respectively. Take full expectation on the both sides of (13) and set

\hat{\beta}=1-\frac{\left(1-\hat{\varepsilon}_{k}\right)\eta}{\left(1+\tilde{\varepsilon}_{k}\right)\gamma},

we obtain

\mathbb{E}\left(\left\|{\bf x}_{k+1}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{k+1}-{\bf z}_{\star}\right\|_{2}^{2}\right)\leqslant\hat{\beta}\mathbb{E}\left(\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}\right),\quad 1\leqslant t_{k}\leqslant m.

Second, if $m+1\leqslant t_{k}\leqslant m+n$ , analogous to the analysis of Algorithm 3, we have that

\mathbb{E}\left\|{\bf z}_{k+1}-{\bf z}_{\star}\right\|_{2}^{2}\leqslant\tilde{\delta}\mathbb{E}\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}.

As the number of the rows of $A$ is sufficiently large, we have from the Chebyshev’s (weak) law of large numbers that [22], there are two scalars $0\leqslant\check{\varepsilon}_{k},\mathring{\varepsilon}_{k}\ll 1$ , such that

\mathbb{E}\left\|{\bf x}_{k+1}-{\bf x}_{\star}\right\|_{2}^{2}\leqslant\alpha\mathbb{E}\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2}+\frac{\tilde{\delta}}{\left\|A\right\|_{F}^{2}}\mathbb{E}\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2},

where $\alpha$ is defined in (7), and

\tilde{\delta}=1-\frac{\left(1-\check{\varepsilon}_{k}\right)\lambda_{\min}\left(A^{H}A\right)}{\left(1+\mathring{\varepsilon}_{k}\right)\left(\left\|A\right\|_{F}^{2}-\underset{1\leqslant j\leqslant n}{\min}\left\{\left\|A_{\left(j\right)}\right\|_{2}^{2}\right\}\right)}.

If $\left\|A\right\|_{F}^{2}\gg 1$ such that

\frac{\tilde{\delta}}{\left\|A\right\|_{F}^{2}}\mathbb{E}\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}\ll\alpha\mathbb{E}\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2}+\tilde{\delta}\mathbb{E}\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2},

i.e.,

\frac{\tilde{\delta}}{\left\|A\right\|_{F}^{2}}\mathbb{E}\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}=o\left(\alpha\mathbb{E}\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2}+\tilde{\delta}\mathbb{E}\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}\right).

(14)

Then we have the following theorem on the convergence of Algorithm 4.

Theorem 4.

Suppose that $\left\|A\right\|_{F}^{2}\gg 1$ such that (14) is satisfied. Then the iteration sequences $\left\{{\bf x}_{k}\right\}_{k=1}^{\infty}$ and $\left\{{\bf z}_{k}\right\}_{k=1}^{\infty}$ generated by Algorithm 4, converge in expectation to ${\bf x}_{\star}$ and ${\bf z}_{\star}$ , with

\begin{cases}\mathbb{E}\left(\left\|{\bf x}_{k+1}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{k+1}-{\bf z}_{\star}\right\|_{2}^{2}\right)\leqslant\hat{\beta}\mathbb{E}\left(\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}\right)&if\,\,1\leqslant t_{k}\leqslant m,\\ \mathbb{E}\left(\left\|{\bf x}_{k+1}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{k+1}-{\bf z}_{\star}\right\|_{2}^{2}\right)\lesssim\tilde{\theta}_{k}\mathbb{E}\left(\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}\right)&if\,\,m+1\leqslant t_{k}\leqslant m+n,\\ \end{cases}

where “ $\lesssim$ ” means the high order term $o\left(\alpha\mathbb{E}\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2}+\tilde{\delta}\mathbb{E}\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}\right)$ is omitted, and $\tilde{\theta}_{k}<\max\big{\{}\alpha,\tilde{\delta}\big{\}}<1$ .

Remark 3.1.

As we only use a small portion of the rows of the data matrix $A$ , we often have that $\hat{\beta}>\tilde{\beta}$ and $\tilde{\theta}_{k}>\theta_{k}$ . That is, Algorithm 4 may use more iterations than Algorithm 3. However, this does not mean that Algorithm 4 definitely runs slower than Algorithm 3. Indeed, Algorithm 4 is much cheaper than Algorithm 3 per iteration, and it may use less CPU time in practice. One refers to Section 5 for numerical comparisons of the algorithms.

4 A practical stopping criterion for Kaczmarz-type methods

To the best of our knowledge, there are no practical stopping criteria for the randomized Kaczmarz-type methods till now. To fill-in this gap, we introduce a practical stopping criterion for the randomized Kaczmarz-type methods in this section. Let $\left\{{\bf x}_{i}\right\}_{i=0}^{\infty}$ be an iterative sequence generated by any Kaczmarz-type methods, and let ${\bf x}_{\star}=A^{\dagger}{\bf b}$ be the least-squares solution of the linear system $A{\bf x}={\bf b}$ . One scheme is to use the relative solution error (RSE) as the stopping criterion [3, 5, 6, 21]

RSE=\frac{\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}}{\left\|{\bf x}_{\star}\right\|_{2}}.

(1)

In [29], the absolute solution error (ASE) is used

ASE=\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}.

(2)

Obviously, both (1) and (2) are impractical since $x_{\star}$ is unknown in advance. In [43], the relative residual (RRes) is utilized

RRes=\frac{\left\|{\bf b}-A{\bf x}_{k}\right\|_{2}}{\left\|{\bf b}\right\|_{2}}.

However, it is unsuitable for inconsistent linear system, and one has to calculate the residual ${\bf b}-A{\bf x}_{k}$ . In other words, we have to access all the rows of $A$ in each iteration. In [25], Li and Wu suggest using the adjacent iterative solution error (AISE for short) as the stopping criterion

AISE=\frac{\left\|{\bf x}_{k}-{\bf x}_{k-1}\right\|_{2}}{\left\|{\bf b}\right\|_{2}}.

(3)

We observe that it is free of $x_{\star}$ and there is no need to compute the residual vector in (3). Unfortunately, $\left\|{\bf x}_{k}-{\bf x}_{k-1}\right\|_{2}/\left\|{\bf b}\right\|_{2}$ converges to zero is only a necessary condition for $\left\{{\bf x}_{i}\right\}_{i=0}^{\infty}$ converges to ${\bf x}_{\star}$ . More precisely,

\frac{\left|\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}-\left\|{\bf x}_{k-1}-{\bf x}_{\star}\right\|_{2}\right|}{\left\|{\bf b}\right\|_{2}}\leqslant\frac{\left\|{\bf x}_{k}-{\bf x}_{k-1}\right\|_{2}}{\left\|{\bf b}\right\|_{2}}\leqslant\frac{\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}+\left\|{\bf x}_{k-1}-{\bf x}_{\star}\right\|_{2}}{\left\|{\bf b}\right\|_{2}}.

In this section, we will propose a practical stopping criterion for all the randomized Kaczmarz-type methods, and show its rationality theoretically. We need the following lemma.

Lemma 4.1.

Let $\left\{{\bf x}_{i}\right\}_{i=0}^{\infty}$ be an iterative sequence obtained from a convergent Kaczmarz-type method. Then, there is a positive integer $L$ such that $\left\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{\star}\right\|_{2}<\left\|{\bf x}_{kL}-{\bf x}_{\star}\right\|_{2},~{}k=0,1,2,\ldots$

Proof.

We prove it by contradiction. Suppose that $\left\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{\star}\right\|_{2}\geqslant\left\|{\bf x}_{kL}-{\bf x}_{\star}\right\|_{2},k=0,1,2,\ldots$ , for all the positive integers $L$ . If we take $L=1$ , then

\left\|{\bf x}_{k+1}-{\bf x}_{\star}\right\|_{2}\geqslant\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2},\quad k=0,1,2,\ldots,

which contradicts to the fact $\left\{{\bf x}_{i}\right\}_{i=0}^{\infty}$ is from a convergent Kaczmarz-type method. ∎

Let

g\left(\ell\right)=\left\|{\bf x}_{\ell}-{\bf x}_{\star}\right\|_{2}\,\,,\quad\ell=0,L,2L,\ldots

then the backward divided difference of $g\left(\ell\right)$ at $kL$ is defined as [10]

\frac{\left|g\left(kL\right)-g\left(\left(k-1\right)L\right)\right|}{kL-\left(k-1\right)L}=\frac{\left|\left\|{\bf x}_{kL}-{\bf x}_{\star}\right\|_{2}-\left\|{\bf x}_{\left(k-1\right)L}-{\bf x}_{\star}\right\|_{2}\right|}{L}.

From Lemma 4.1, we see that $g\left(\ell\right)$ is a strictly monotonically decreasing discrete function whose infimum is 0, and

\lim_{k\rightarrow\infty}g\left(kL\right)=0.

Thus, for any $\varepsilon>0$ , there exists a positive integer $k_{1}>0$ , such that $g\left(kL\right)\leqslant\frac{\varepsilon}{2},~{}\forall\,\,k\geqslant k_{1}$ . Next, we will show that there are $k_{2}>0$ and $\varepsilon_{1}<\varepsilon$ , if $\left\|{\bf x}_{k_{2}L}-{\bf x}_{\left(k_{2}-1\right)L}\right\|_{2}/L\leqslant\varepsilon_{1}$ , then $g\left(k_{2}L\right)<\varepsilon$ . Indeed, let $\varepsilon_{1}=\frac{\varepsilon}{2Lk_{1}}$ , there exists $k_{2}>0$ , such that

	$\displaystyle\frac{\left\|g\left(k_{2}L\right)-g\left(\left(k_{2}-1\right)L\right)\right\|}{L}$	$\displaystyle=\frac{\left\|\left\\|{\bf x}_{k_{2}L}-{\bf x}_{\star}\right\\|_{2}-\left\\|{\bf x}_{\left(k_{2}-1\right)L}-{\bf x}_{\star}\right\\|_{2}\right\|}{L}$
		$\displaystyle\leqslant\frac{\left\\|{\bf x}_{k_{2}L}-{\bf x}_{\left(k_{2}-1\right)L}\right\\|_{2}}{L}\leqslant\frac{\varepsilon}{2Lk_{1}}.$		(4)

On one hand, if $k_{1}\leqslant k_{2}$ , then $g\left(k_{2}L\right)\leqslant\frac{\varepsilon}{2}<\varepsilon$ . On the other hand, if $k_{1}>k_{2}$ , then there is a scalar

a=\frac{g\left(k_{1}L\right)-g\left(k_{2}L\right)}{\left(k_{1}-k_{2}\right)L},

(5)

such that

\frac{g\left(k_{2}L\right)-g\left(\left(k_{2}-1\right)L\right)}{L}\leqslant a\leqslant\frac{g\left(k_{1}L\right)-g\left(\left(k_{1}-1\right)L\right)}{L}<0.

(6)

Combining the inequalities (4) and (6), we arrive at

\left|a\right|\leqslant\frac{\left|g\left(k_{2}L\right)-g\left(\left(k_{2}-1\right)L\right)\right|}{L}\leqslant\frac{\varepsilon}{2k_{1}L}.

From (5), we have $g\left(k_{2}L\right)=g\left(k_{1}L\right)-a\left(k_{1}-k_{2}\right)L$ . Hence,

	$\displaystyle g\left(k_{2}L\right)$	$\displaystyle=\left\|g\left(k_{2}L\right)\right\|=\left\|g\left(k_{1}L\right)-a\left(k_{1}-k_{2}\right)L\right\|$
		$\displaystyle\leqslant\left\|g\left(k_{1}L\right)\right\|+\left\|a\right\|\cdot\left\|k_{1}-k_{2}\right\|L\leqslant\frac{\varepsilon}{2}+\frac{\varepsilon}{2k_{1}L}\left\|k_{1}-k_{2}\right\|L$
		$\displaystyle=\frac{\varepsilon}{2}+\frac{k_{1}-k_{2}}{k_{1}}\cdot\frac{\varepsilon}{2}\leqslant\frac{\varepsilon}{2}+\frac{\varepsilon}{2}=\varepsilon.$

In conclusion, we have the following theorem.

Theorem 5.

Let $g\left(\ell\right)=\left\|{\bf x}_{\ell}-{\bf x}_{\star}\right\|_{2}~{}(\ell=0,L,2L,\ldots)$ be a strictly monotonically decreasing discrete function whose infimum is 0. For any $\varepsilon>0$ , there are $k_{2}>0$ and $\varepsilon_{1}<\varepsilon$ , if

\frac{\left\|{\bf x}_{k_{2}L}-{\bf x}_{\left(k_{2}-1\right)L}\right\|}{L}\leqslant\varepsilon_{1},

then $g\left(k_{2}\right)<\varepsilon$ .

Remark 4.1.

Based on Theorem 5, we propose the following stopping criterion for Kaczmarz-type methods

LISE=\frac{\left\|{\bf x}_{kL}-{\bf x}_{\left(k-1\right)L}\right\|_{2}}{L}<tol,

(7)

where $L>0$ is a user-provided number and $\it tol$ is the convergence tolerance.

It is seen that the parameter $L$ is crucial to our stopping criterion. Next, we briefly discuss how to choose it in practice. Assume that

\mathbb{E}\left\|{\bf x}_{k+1}-{\bf x}_{\star}\right\|_{2}^{2}\leqslant c\mathbb{E}\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2},

where $0<c<1$ , and

\mathbb{E}\left\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{\star}\right\|_{2}^{2}\leqslant c^{L}\mathbb{E}\left\|{\bf x}_{kL}-{\bf x}_{\star}\right\|_{2}^{2}.

(8)

Therefore,

	$\displaystyle\mathbb{E}\left\\|{\bf x}_{kL}-{\bf x}_{\star}\right\\|_{2}^{2}$	$\displaystyle=\mathbb{E}\left\\|{\bf x}_{kL}-{\bf x}_{\left(k+1\right)L}+{\bf x}_{\left(k+1\right)L}-{\bf x}_{\star}\right\\|_{2}^{2}\leqslant\mathbb{E}\left(\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{kL}\right\\|_{2}+\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{\star}\right\\|_{2}\right)^{2}$
		$\displaystyle=\mathbb{E}\left(\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{kL}\right\\|_{2}^{2}+2\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{kL}\right\\|_{2}\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{\star}\right\\|_{2}+\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{\star}\right\\|_{2}^{2}\right)$
		$\displaystyle=\mathbb{E}\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{kL}\right\\|_{2}^{2}+2\mathbb{E}\left(\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{kL}\right\\|_{2}\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{\star}\right\\|_{2}\right)+\mathbb{E}\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{\star}\right\\|_{2}^{2}$
		$\displaystyle\leqslant\mathbb{E}\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{kL}\right\\|_{2}^{2}+2\mathbb{E}\left(\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{kL}\right\\|_{2}\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{\star}\right\\|_{2}\right)+c^{L}\mathbb{E}\left\\|{\bf x}_{kL}-{\bf x}_{\star}\right\\|_{2}^{2}.$

As a result,

\left(1-c^{L}\right)\mathbb{E}\left\|{\bf x}_{kL}-{\bf x}_{\star}\right\|_{2}^{2}\leqslant\mathbb{E}\left\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{kL}\right\|_{2}^{2}+2\mathbb{E}\left(\left\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{kL}\right\|_{2}\left\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{\star}\right\|_{2}\right),

and

\mathbb{E}\left\|{\bf x}_{kL}-{\bf x}_{\star}\right\|_{2}^{2}\leqslant\frac{1}{1-c^{L}}\mathbb{E}\left\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{kL}\right\|_{2}^{2}+\frac{2}{1-c^{L}}\mathbb{E}\left(\left\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{kL}\right\|_{2}\left\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{\star}\right\|_{2}\right).

(9)

From (8) and (9), it follows that

\mathbb{E}\left\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{\star}\right\|_{2}^{2}\leqslant\frac{c^{L}}{1-c^{L}}\mathbb{E}\left\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{kL}\right\|_{2}^{2}+\frac{2c^{L}}{1-c^{L}}\mathbb{E}\left(\left\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{kL}\right\|_{2}\left\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{\star}\right\|_{2}\right).

This implies that we need to use a sufficiently large $L$ in practical calculations. One can use, say, $L=50$ or $80$ in practice.

5 Numerical Experiments

In this section, we perform numerical experiments to show the efficiency of our algorithms as well as the effectiveness of the proposed stopping criterion. All the numerical experiments are run on a Hp workstation with 20 cores double Intel(R)Xeon(R) E5-2640 v3 processors, with CPU 2.60 GHz and RAM 256 GB. The operation system is 64-bit Windows 10. All the numerical results are obtained from using MATLAB 2018b. In order to show the efficiency of our proposed algorithms, we compare Algorithm 3 and Algorithm 4 with the following state-of-the-art Kaczmarz-type methods for solving large-scale (dense or sparse) inconsistent linear systems:
$\bullet$ REK [48]: The randomized extended Kaczmarz method.
$\bullet$ PREK [5]: The partially randomized extended Kaczmarz method.
$\bullet$ GRAK [4]: The greedy randomized augmented Kaczmarz method.
$\bullet$ RGRAK [7]: The relaxed greedy randomized augmented Kaczmarz method, in which the relaxation parameter is set to be 1 [7].

In the tables below, we denote by “IT” the number of iterations, by “CPU” the CPU time in seconds, and by “RSE” the relative error defined as

{\rm RSE}=\frac{\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}}{\left\|{\bf x}_{\star}\right\|_{2}},

(10)

where ${\bf x}_{\star}=A^{\dagger}{\bf b}$ is the least-squares solution in 2-norm of the inconsistent linear systems (1), which is generated by the MATLAB function $pinv.m$ . All the numerical results are the means from 10 runs.

5.1 Effectiveness of the stopping criterion (7)

In this section, we show effectiveness of our stopping criterion for large inconsistent linear systems. In this subsection, the test matrix $A$ is synthetic and is generated by using the MATLAB built-in function $randn(m,n)$ . Similar to [4], the right-hand side is chosen as ${\bf b}=A{\bf x}_{\ast}+{\bf r}$ , where ${\bf x}_{\ast}\in\mathbb{R}^{n}$ is one of the least-squares solution of the inconsistent linear systems (1), and ${\bf r}\in\mathbb{R}^{m}$ is a nonzero vector in the null space of the matrix $A^{H}$ generate by the MATLAB function $null.m$ . Two examples are performed in this subsection.

$\bullet$ In this first example, we run the six algorithms using their own stopping criteria. More precisely, in the REK method, we use [48]

\frac{\left\|A{\bf x}_{k}-\left({\bf b}-{\bf z}_{k}\right)\right\|_{2}}{\left\|A\right\|_{F}\left\|{\bf x}_{k}\right\|_{2}}\leqslant tol\quad{\rm and}\quad\frac{\left\|A^{H}{\bf z}_{k}\right\|_{2}}{\left\|A\right\|_{F}^{2}\left\|{\bf x}_{k}\right\|_{2}}\leqslant tol

(11)

as the stopping criterion; in the PREK method, we use [5]

\frac{\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2}}{\left\|{\bf x}_{\star}\right\|_{2}^{2}}\leqslant tol

(12)

as the stopping criterion; in the GRAK method and the RGRAK method, we use [4, 7]

\frac{\left\|{\bf x}_{k}-{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf z}_{k}-{\bf z}_{\star}\right\|_{2}^{2}}{\left\|{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf b}_{\mathcal{R}\left(A\right)}\right\|_{2}^{2}}\leqslant tol

(13)

as the stopping criteria, where ${\bf z}_{\star}={\bf b}_{\mathcal{R}\left(A\right)^{\bot}}$ is computed by ${\bf b}_{\mathcal{R}\left(A\right)^{\bot}}={\bf b}-A{\bf x}_{\star}$ . In Algorithm 3 and Algorithm 4, we make use of

LISE=\frac{\left\|\tilde{{\bf x}}_{kL}-\tilde{{\bf x}}_{\left(k-1\right)L}\right\|_{2}}{L}\leqslant tol

(14)

as the stopping criteria, where $L=400$ , $\tilde{{\bf x}}_{kL}=\left[{\bf z}_{kL}^{H},{\bf x}_{kL}^{H}\right]^{H}$ , refer to (2), and the convergence tolerance is chosen as $tol=10^{-4}$ . Table 1 lists the numerical results, where the values in bold are the best one.

Table 1: Section 5.1. Example 1: Numerical results of the algorithms using their own stopping criteria (11)–(14);

A=randn(m,n)

, and

tol=10^{-4}

. The sampling ratio is chosen as

\eta=0.01

in Algorithm 4.

$m\times n$		$5000\times 1000$	$5000\times 1500$	$5000\times 2000$	$5000\times 2500$	$5000\times 3000$
REK	IT	19916	37204.9	64410.7	109528.8	185659.2
REK	CPU	161.9	375.7	646.4	1362.3	2618.0
	RSE	4.38e-3	6.53e-3	9.44e-3	1.34e-2	1.98e-2
PREK	IT	13191.5	24724.2	47252.7	87730.9	169481.5
PREK	CPU	3.5	7.3	14.9	31.4	64.3
	RSE	1.00e-2	1.00e-2	1.00e-2	1.00e-2	1.00e-2
GRAK	IT	3003.5	5645.6	10181.9	16879.9	29117.3
GRAK	CPU	17.1	49.9	84.7	139.9	282.1
	RSE	4.79e-1	4.50e-1	4.16e-1	3.94e-1	3.61e-1
RGRAK	IT	2977.3	5601.4	10206	16702.6	28927.7
RGRAK	CPU	15.7	47.6	84.1	139.6	283.5
	RSE	4.90e-1	4.73e-1	4.55e-1	4.44e-1	4.10e-1
Algorithm 3	IT	9600	19200	36880	65480	127200
Algorithm 3	CPU	61.1	117.9	253.6	581.5	1356.9
	RSE	7.52e-4	1.17e-3	1.73e-3	2.22e-3	3.19e-3
Algorithm 4	IT	10240	19920	37960	68080	131360
Algorithm 4	CPU	14.0	38.9	107.5	250.7	748.6
	RSE	6.61e-4	1.06e-3	1.63e-3	2.09e-3	2.99e-3

It is seen from Table 1 that the PREK method and the RGRAK method may require less CPU time and fewer number of iterations by using their own stopping criterion. However, the computed solution obtained from the two methods are unreliable, and the “real” relative errors of the approximations from the two methods can be large. For instance, the RSE values of the GRAM and RGRAK methods are only in the order of $\mathcal{O}(10^{-1})$ , the RSE values of the REK and PREK methods are only in the order of $\mathcal{O}(10^{-2})$ or $\mathcal{O}(10^{-3})$ , while those of Algorithm 3 and Algorithm 4 are in the order of $\mathcal{O}(10^{-3})$ or $\mathcal{O}(10^{-4})$ .

Indeed, we find that the denominator of the stopping criterion (13) can be quite large, which leads to an early termination of the GRAM and RGRAK methods even when ${\bf x}_{k}$ is far from being a good approximation to ${\bf x}_{\star}$ . For example, for the matrix $A$ with $m=5000$ and $n=1000$ , the values of $\left\|{\bf x}_{\star}\right\|_{2}^{2}+\left\|{\bf b}_{\mathcal{R}\left(A\right)}\right\|_{2}^{2}$ can be over $5\times 10^{6}$ . Moreover, both (12) and (13) are impractical as $x_{\star}$ is unknown in advance, and we have to perform two matrix-vector product with respect to $A$ and $A^{H}$ for computing (11). That is to say, one has to access all the rows and columns of $A$ . Thus, compared with some available stopping criteria, our stopping criterion (14) is effective and is more appropriate as a stopping criterion for Kaczmarz-type methods.

Table 2: Section 5.1. Example 2: Numerical results of the algorithms using the proposed stopping criterion (14) with

L=400

;

A=randn(m,n)

and

tol=10^{-4}

. The sampling ratio is chosen as

\eta=0.01

in Algorithm 4.

$m\times n$		$5000\times 1000$	$5000\times 1500$	$5000\times 2000$	$5000\times 2500$	$5000\times 3000$
REK	IT	27680	54360	98480	173520	306360
REK	CPU	9.5	21.3	41.4	73.3	139.1
	RSE	4.43e-4	7.04e-4	9.94e-4	1.46e-3	2.28e-3
PREK	IT	17600	33360	60520	106200	186120
PREK	CPU	4.7	9.9	19.4	36.7	66.5
	RSE	2.17e-3	2.87e-3	3.49e-3	4.52e-3	5.84e-3
GRAK	IT	11200	23040	43760	81960	146280
GRAK	CPU	73.9	230.9	455.4	1013.4	1889.4
	RSE	1.01e-3	1.63e-3	2.22e-3	3.04e-3	4.13e-3
RGRAK	IT	11040	22640	42520	78640	140800
RGRAK	CPU	73.5	184.1	501.6	1123.3	1961.4
	RSE	1.02e-3	1.76e-3	2.59e-3	3.61e-3	5.58e-3
Algorithm 3	IT	9600	19200	36040	66440	116600
Algorithm 3	CPU	57.1	96.1	289.2	592.9	1119.5
	RSE	7.77e-4	1.19e-3	1.66e-3	2.23e-3	2.99e-3
Algorithm 4	IT	10120	20320	37720	68920	120760
Algorithm 4	CPU	14.0	40.6	110.2	223.3	713.5
	RSE	6.99e-4	1.10e-3	1.54e-3	2.09e-3	2.88e-3

$\bullet$ In view of the numerical results given in the first example, in the second example, we run the six algorithms using the proposed stopping criterion (7) with $L=400$ and $tol=10^{-4}$ . We generate another five test matrices by using the MATLAB built-in function $randn(m,n)$ . The numerical results are given in Table 2, where the best results are in bold.

Some remarks are in order. First, comparing the $RSE$ values of the GRAK, RGRAK, REK and PREK methods in Tabel 1 and Tabel 2, we find that the approximate solutions got from these four method are more accurate than before when using (11) as the stopping criterion. This means that the proposed stopping criterion is feasible and it is more effective than the available ones. Second, we observe from Table 2 and Table 1 that REK runs much faster when using (14) as the stopping criterion. This is due to the fact that one has to perform two matrix-vector product with respect to $A$ and $A^{H}$ when evaluating (11). In fact, the computational cost of matrix-vector products is high when $A$ is large and dense. Third, it is seen from the table that for this synthetic problem, REK enjoys the highest accuracy and PREK runs the fastest, while Algorithm 3 needs the fewest iterations. This implies that the convergence speed of Algorithm 3 is the highest among the six. Of cause, an algorithm with least iteration numbers does not mean that it runs the fastest. Indeed, the framework of REK and PREK is different from those of the other four algorithms. Fourth, we see that Algorithm 3 and Algorithm 4 outperform GRAK and RGRAK significantly, both in terms of CPU time and iteration numbers. Specifically, Algorithm 3 needs the fewest iterations and Algorithm 4 uses the least CPU time. This is because Algorithm 4 only utilize a small portion of rows of the matrix $\tilde{A}$ in each iteration.

5.2 Efficiency of the proposed algorithms

In this section, we perform numerical experiments to show the superiority of the proposed algorithms over many state-of-the-art Kaczmarz-type methods for large and inconsistent linear systems. Based on the numerical experiments made in Section 5.1 and for the sake of justification, we make use of (7) as the stopping criterion for all the algorithms, with $L=400$ . In order to demonstrate the superiority of Algorithm 4 over the GRAK method, we consider the value of

{\rm Speed\text{-}up}=\frac{\rm CPU~{}time~{}of~{}GRAK}{\rm CPU~{}time~{}of~{}Algorithm~{}\ref{alg6}}.

(15)

There are three numerical experiments in this section.

Table 3: Section 5.2. Example 1: Test matrices for solving large inconsistent linear systems.

Matrix ( $A$ )	Size ( $m\times n$ )	Nnz	Background
$abtaha1$ ²²footnotemark: 2	$14596\times 209$	51307	Combinatorial Problem
$abtaha2$ ²²footnotemark: 2	$37932\times 331$	137228	Combinatorial Problem
$sls$ ²²footnotemark: 2	1748122 $\times 62729$	6804304	Least Squares Problem
$California$ ²²footnotemark: 2	$9664\times 9664$	16150	Directed Graph
$stat96v5$ ²²footnotemark: 2	$2307\times 75779$	233921	Linear Programming Problem
$GL7d25$ ²²footnotemark: 2	$2798\times 21074$	81671	Combinatorial Problem

$\bullet$ In the first experiment, we show numerical behavior of the proposed algorithms on some large sparse inconsistent linear systems. The test matrices are from the University of Florida Sparse Matrix Collection ²²2https://sparse.tamu.edu/. The details of these data matrices are listed in Table 3. In this experiment, the construction of the right-hand side $\bf b$ corresponding to the matrices $abtaha1$ , $abtaha2$ , $California$ , $stat96v5$ and $GL7d25$ is the same way as in Section 5.1, while for the matrix $sls$ , it is generated by using the MATLAB built-in function $randn$ . The convergence tolerance is chosen as $tol=10^{-4}$ , and the sampling ratio in Algorithm 4 is set to be $\eta=0.01$ . Table 4 lists the numerical results, where the values in bold are the best one.

Table 4: Section 5.2. Example 1: Numerical results of the six algorithms on some sparse matrices from the University of Florida Sparse Matrix Collection,

tol=10^{-4}

and

L=400

. The sampling ratio is chosen as

\eta=0.01

in Algorithm 4. Here “/” means that the number of iterations exceeds 1000000 or the CPU time exceeds 24 hours.

Matrix& Size		$abtaha1$ $14596\times 209$	$abtaha2$ $37932\times 331$	$sls$ 1748122 $\times 62729$	$California$ $9664\times 9664$	$stat96v5$ $2307\times 75779$	$GL7d25$ $2798\times 21074$
REK	IT	49640	64760	/	476680	69440	103840
REK	CPU	21.8	51.6	/	396.7	273.4	152.8
PREK	IT	36640	41800	/	/	100880	/
PREK	CPU	14.1	29.7	/	/	175.6	/
GRAK	IT	38880	56200	/	300320	18680	73880
GRAK	CPU	35.1	99.4	/	418.2	119.6	190.9
RGRAK	IT	33680	47440	/	398080	18720	93240
RGRAK	CPU	29.5	85.5	/	531.9	118.2	202.9
Algorithm 3	IT	25680	32560	/	223280	11800	54440
Algorithm 3	CPU	11.9	29.2	/	165.7	33.4	70.1
Algorithm 4	IT	33880	45920	406640	251760	12680	59040
Algorithm 4	CPU	15.6	45.1	59120.4	161.2	31.8	68.5
Speed-up		2.25	2.20	/	2.59	3.76	2.79

Some remarks are in order. First, it is seen from Table 4 that Algorithm 3 and Algorithm 4 are far superior to the other algorithms in terms of CPU time and number of iterations. In terms of the values of “Speed-up”, we observe from Table 4 that Algorithm 4 is about 2 to 3 times faster than the GRAK method, and the improvements are impressive. Specifically, for the large matrix $sls$ , all the algorithms fail to converge within 24 hours, except for Algorithm 4. Therefore, we benefit from the strategy of simple sampling and there is no need to compute the whole residual vector during iterations. Thus, Algorithm 4 is competitive for solving large and sparse inconsistent linear systems. Notice that the PREK method does not work for the two matrices $California$ and $GL7d25$ , either. Indeed, it is required that the coefficient matrix $A$ is tall and of full column rank in the PREK method [5, Theorem 3.1], while the two matrices are not of full column rank.

$\bullet$ In the second experiment, we consider the problem of tomographic image reconstruction and run the six algorithms on a 2-D parallel-beam tomography problem. In this experiment, a sparse matrix $A\in\mathbb{R}^{m\times n}$ and an “exact solution” ${\bf x}_{\star}\in\mathbb{R}^{n}$ is generated by using the function $paralleltomo(N,\theta,p)$ in the AIR tool box [19], with $N=60$ , $\theta=0:1:178\degree$ and $p=125$ . The matrix $A$ is of size $22375\times 3600$ , and we set the right-hand side ${\bf b}=A{\bf x}_{\star}+{\bf r}$ , where the noise ${\bf r}\in\mathbb{R}^{m}$ is a nonzero vector in the null space of the matrix $A^{H}$ generate by the MATLAB function $null$ .

In this experiment, we solve the inconsistent linear system $A{\bf x}={\bf b}$ , and all the algorithms are run for 20000 iterations. Consider the signal-noise ratio (SNR) [44]

SNR=\frac{\sum_{i=1}^{n}{{\bf x}_{i}^{2}}}{\sum_{i=1}^{n}{\left({\bf x}_{i}-\hat{{\bf x}}_{i}\right)^{2}}},

(16)

where ${\bf x}$ is the original clean signal (i.e., the “exact solution” ${\bf x}_{\star}$ ), and $\hat{{\bf x}}$ is the denoised signal (i.e., the computed solution obtained from one of the six algorithms). Notice that the larger the $SNR$ , the better the denoising effect of an algorithm. The original image (Exact), the reconstructed ones as well as the values of SNR are shown in Figure 1. It is obviously to see from the figures and the values of SNR that Algorithm 3 and Algorithm 4 can remove the noise and restore the real image efficiently, and our algorithms are superior to the other methods in terms of SNR.

Refer to caption — Fig. 1: Example 2: The original image (Exact), the reconstructed ones, and the values of SNR obtained from the six algorithms REK, PREK, GRAK, RGRAK, Algorithm 3 and Algorithm 4 $\left(\eta=0.01\right)$ . All the algorithms were run for 20000 iterations.

$\bullet$ Spectral Regression Discriminant Analysis (SRDA) is one of the most popular methods for large-scale discriminant analysis [12]. By using the tool of spectral graph analysis, SRDA casts discriminant analysis into a regression framework which facilitates both efficient computation and the use of regularization techniques. Specifically, SRDA only needs to solve a set of least squares problems and there is no eigenvector computation involved, which can save both the running time and the memory compared with some classical linear discriminant analysis methods.

Let $X=\left[{\bf x}_{1},{\bf x}_{2},\dots,{\bf x}_{n}\right]\in\mathbb{R}^{d\times n}$ be a set of $n$ training samples in a $d$ -dimensional feature space. Assume that the data matrix is partitioned into $k$ classes as $X=\left[X_{1},\dots,X_{k}\right]$ , where $X_{j}$ is the $j$ -th set with $n_{j}$ being the number of samples. Denote by ${\bf c}_{j}$ the centroid vector of $X_{j}$ , and by ${\bf c}$ the global centroid vector of the training data. Denote by

Y=\left[{\bf y}_{1},{\bf y}_{2},\dots,{\bf y}_{k}\right]=\left[\begin{matrix}{\bf 1}_{n_{1}}&{\bf 0}&\cdots&{\bf 0}\\ {\bf 0}&{\bf 1}_{n_{2}}&\cdots&{\bf 0}\\ \vdots&\vdots&\ddots&\vdots\\ {\bf 0}&{\bf 0}&\cdots&{\bf 1}_{n_{k}}\\ \end{matrix}\right]\in\mathbb{R}^{n\times k},

where ${\bf 1}_{n_{i}}$ and ${\bf 1}_{n}$ are the one vector of size $n_{i}$ and $n$ , respectively, and $I_{n}$ is the $n$ -by- $n$ identity matrix. Let $\hat{Y}=\left(I_{n}-{\bf 1}_{n}{\bf 1}_{n}^{T}\right)Y$ and $\bar{Y}\in\mathbb{R}^{n\times\left(k-1\right)}$ be an orthonormal basis of $span\left\{\hat{Y}\right\}$ .

Let $\hat{X}=\left[X^{T},{\bf 1}_{n}\right]^{T}$ . In essence, the SRDA method resorts to the following least squares problems [12, 39]:

\hat{a}_{i}=\mathop{\arg\min}\limits_{a_{i}\in\mathbb{R}^{d}}\left\{\left\|\hat{X}^{T}a_{i}-{\bf y}_{i}\right\|_{2}\right\},\quad i=1,2,\dots,k-1,

(17)

where ${\bf y}_{i}$ is the $i$ -th column vector of $\bar{Y},~{}i=1,2,\dots,k-1$ .

Without loss of generality, we solve the problem as $i=1$ by using Algorithm 3 and Algorithm 4, where the sampling ratio is chosen as $\eta=0.001$ in the latter algorithm. We compare the two proposed methods with the REK method, the PREK method, the GRAK method, and the RGRAK method. In the stopping criterion (7), we use $tol=10^{-6}$ and $L=400$ . We run the algorithms on five dense databases, including the $MNIST$ dataset³³3http://yann.lecun.com/exdb/mnist/, the $UCI$ dataset⁴⁴4http://archive.ics.uci.edu/dataset/54/isolet, the $USPS\_$ dataset⁵⁵5https://hyper.ai/datasets/16041[8], the $CIFAR$ -10 dataset⁶⁶6http://www.cs.toronto.edu/ kriz/cifar.html, as well as the YouTube Faces dataset⁷⁷7http://www.cs.tau.ac.il/ wolf/ytfaces/. In all the experiments, we randomly choose $n=70\%\cdot N$ samples as the training set, where $N$ is the total number of samples. The details of the data sets are listed in Table 5, and Table 6 lists the numerical results obtained.

Table 5: Section 5.2. Example 3: Details of the databases.

Datasets	Dimensionality ( $d$ )	Number of total samples ( $N$ )	Background	Type
$MNIST$ ³³footnotemark: 3	784	70000	Handwritten digits recognition	dense
$ISOLET$ ⁴⁴footnotemark: 4	617	7797	Classification	dense
$USPS\_$ ⁵⁵footnotemark: 5	256	9598	Handwritten digits recognition	dense
$CIFAR-10$ ⁶⁶footnotemark: 6	3072	50000	Object recognition	dense
$YouTubeFaceArrange\_64\times 64$ ⁷⁷footnotemark: 7	4096	370319	Face recognition	dense

Table 6: Section 5.2.2. Example 3: Numerical results of the algorithms for solving the least squares problem (17) as

i=1

. Here ”/” means the number of iterations exceeds 1000000 or the CPU time exceeds 24 hours.

Matrix& Size		$MNIST$ $70000\times 784$	$ISOLET$ $7797\times 617$	$USPS\_$ 9598 $\times 256$	$CIFAR-10$ 50000 $\times 3072$	$YouTubeFaceArrange\_64\times 64$ $370319\times 4096$
REK	IT	166800	246440	367000	/	/
REK	CPU	237.6	83.3	111.9	/	/
PREK	IT	/	374680	162600	/	/
PREK	CPU	/	101.6	40.6	/	/
GRAK	IT	19200	124840	188480	465080	/
GRAK	CPU	419.9	312.5	153.1	30428.1	/
RGRAK	IT	17760	88600	98440	130800	/
RGRAK	CPU	350.2	217.9	79.9	14115.9	/
Algorithm 3	IT	17200	82200	72320	130760	/
Algorithm 3	CPU	1451.5	187.2	46.2	26437.2	/
Algorithm 4	IT	88000	236240	338280	975600	905200
Algorithm 4	CPU	183.4	72.7	94.6	4393.7	44789.7
Speed-up		2.3	4.3	1.6	6.9	/

From Table 6, we observe that Algorithm 4 is the fastest one while Algorithm 3 needs the fewest number of iterations in most cases. Specifically, Algorithm 4 is about two to six times faster than GRAK, and the improvement is prohibitive. For the extremely large-scale data set $YouTubeFaceArrange\_64\times 64$ , all the algorithms fail to converge except for Algorithm 4. Thus, we benefit from the strategy of random sampling and there is no need to calculate the whole residual vector at each iteration. All these results show that Algorithm 4 is advantageous in solving large-scale least squares problems with dense data. Notice that the PREK method does not work for the dataset $MNIST$ , this is because this method may fail to converge if the coefficient matrix is not of full column rank [5].

6 Concluding Remarks

The greedy randomized augmented Kaczmarz method is an effective approach proposed recently for large and sparse inconsistent linear systems. The key of this method is to equivalently transform a inconsistent linear system of size $m$ -by- $n$ into a consistent augmented linear system of size $(m+n)$ -by- $(m+n)$ , and then apply the GRK method to the augmented linear system. However, in the GRAK method, it is required to calculate the residual of the augmented linear system and construct two index sets at each iteration. Moreover, it only updates the vector ${\bf z}_{k}$ rather than the approximate solution ${\bf x}_{k}$ as $m+1\leqslant t_{k}\leqslant m+n$ , which may slow down the convergence rate. Thus, the computational overhead of this method is large, especially for big-data problems, and it is urgent to enhance the numerical performance of this method.

The main contributions of this work are two-fold. First, we propose an accelerated greedy randomized augmented Kaczmarz method for large-scale and sparse or dense inconsistent linear systems, and apply the strategy of simple random sampling to further reduce the workload. Theoretical analysis shows that, under very weak assumptions, the accelerated greedy randomized augmented Kaczmarz method converges faster than the greedy randomized augmented Kaczmarz method. Second, as far as we know, there are no practical stopping criteria for randomized Kaczmarz-type methods till now. So as to fill-in this gap, we introduce a practical stopping criterion that is applicable to all the randomized Kaczmarz-type methods. Numerical experiments demonstrate the numerical behavior of our proposed algorithms and the effectiveness of the stopping criterion. However, there is a parameter $L$ involved in the proposed stopping criterion. How to pick its optimal value is interesting and deserves further investigation. We believe it is problem-and-data dependent.

References

[1] R. Ansorge, Connections between the Cimmino-method and the Kaczmarz-method for the solution of singular and regular systems of equations, Computing, 33 (1984), pp. 367–375.
[2] V. Borkar, N. Karamchandani, and S. Mirani, Randomized Kaczmarz for rank aggregation from pairwise comparisons, IEEE Information Theory Workshop (ITW), Cambridge, 2016, pp. 389–393.
[3] Z.-Z. Bai and W.-T. Wu, On greedy randomized Kaczmarz method for solving large sparse linear systems, SIAM Journal on Scientific Computing, 40 (2018), pp. A592–A606.
[4] Z.-Z. Bai and W.-T. Wu, On greedy randomized augmented Kaczmarz method for solving large sparse inconsistent linear systems, SIAM Journal on Scientific Computing, 43 (2021), pp. A3892–A3911.
[5] Z.-Z. Bai and W.-T. Wu, On partially randomized extended Kaczmarz method for solving large sparse overdetermined inconsistent linear systems, Linear Algebra and its Applications, 578 (2019), pp. 225–250.
[6] Z.-Z. Bai and W.-T. Wu, On relaxed greedy randomized Kaczmarz methods for solving large sparse linear systems, Applied Mathematics Letters, 83 (2018), pp. 21–26.
[7] Z.-Z. Bai, L. Wang, and G.V. Muuratova, On relaxed greedy randomized augmented Kaczmarz methods for solving large sparse inconsistent linear systems, East Asian Journal on Applied Mathematics, (27) 2021, pp. 323–332.
[8] L. Bai, J. Liang and Y. Zhao, Self-constrained spectral clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence, 45 (2023), pp. 5126–5138.
[9] J. Chen and Z.-D. Huang, On a fast deterministic block Kaczmarz method for solving large scale linear system, Numerical Algorithms, 89 (2022), pp. 1007–1029.
[10] S. Chapra and R Canale, Numerical Methods for Engineers, 7th. ed., McGraw-Hill, New York, 2005.
[11] Y. Censor, Row-action methods for huge and sparse systems and their applications, SIAM Review, 23 (1981), pp. 444–466.
[12] D. Cai, X.-F. He, and J.-W. Han, SRDA: An Efficient Algorithm for Large-Scale Discriminant Analysis, IEEE Transactions on Knowledge and Data Engineering, 20 (2008), pp. 1–12.
[13] K. Du, W. Si, and X. Sun, Randomized extended average block Kaczmarz for solving least squares, SIAM Journal on Scientific Computing, 42 (2020), pp. A3541–A3559.
[14] C. Gu and Y. Liu, Variant of greedy randomized Kaczmarz for ridge regression, Applied Numerical Mathematics, 143 (2019), pp. 223–246.
[15] R. Gordon, R. Bender, and G.T. Herman, Algebraic reconstruction techniques (ART) for three-dimensional electron microscopy and X-ray photography, Journal of Theoretical Biology, 29 (1970), pp. 471–481.
[16] W. Guo, H. Chen, W. Geng, and L. Lei, A modified Kaczmarz algorithm for computerized tomographic image reconstruction, IEEE International Conference on Biomedical Engineering and Informatics, 3 (2009), pp. 1–4.
[17] A. Hefny, D. Needell, and A. Ramdas, Rows versus columns: randomized Kaczmarz or Gauss-Seidel for ridge egression, SIAM Journal on Scientific Computing, 39 (2017), pp. S528–S542.
[18] G. Herman, Fundamentals of Computerized Tomography: Image Reconstruction from Projections, 2nd. ed., Springer, Dordrecht, 2009.
[19] P. Hansen and J $\varnothing$ rgensen, AIR tools $II$ :algebraic iterative reconstruction methods, improved mplementation, Numerical Algorithms, 79 (2018), pp. 107–137.
[20] X. Jiang, K. Zhang, and J. Yin, Randomized block Kaczmarz methods with k-means clustering for solving large linear systems, Journal of Computational and Applied Mathematics, 403 (2022), Article 113828.
[21] Y. Jiang, G. Wu, and L. Jiang, A semi-randomized Kaczmarz method with simple random sampling for large-scale linear systems, Advances in Computational Mathmatics, 49 (2023), Article 20.
[22] J. Johnson, Probability and Statistics for Computer Scientists, The John Wiley & Sons Press, 1997.
[23] S. Kaczmarz, Approximate solution of systems of linear equations, International Journal of Control, 35 (1937), pp. 355–357.
[24] S. Lee and H. Kim, Noise properties of reconstructed images in a kilo-voltage on-board imaging system with iterative reconstruction techniques: A phantom study, Physica Medica, 30 (2014), pp. 365–373.
[25] K. Li and G. Wu, Randomized approximate class-specific kernel spectral regression analysis for large-scale face verification, Machine Learning, 111 (2022), pp. 2037–2091.
[26] R. Li and H. Liu, On global randomized block Kaczmarz method for image reconstruction, Electronic Research Archive, 30 (2022), pp. 1442–1453.
[27] J. Liu and S. Wright, An accelerated randomized Kaczmarz algorithm, Mathematics of Computation, 85 (2016), pp. 153–178.
[28] J. Loera, J. Haddock, and D. Needell, A sampling Kaczmarz-Motzkin algorithm for linear feasibility, SIAM Journal on Scientific Computing, 39 (2017), pp. S66–S87.
[29] A. Ma, D. Needell, and A. Ramdas, Convergence properties of the randomized extended Gauss-Seidel and Kaczmarz methods, SIAM Journal on Matrix Analysis and Applications, 36 (2015), pp. 1590–1604.
[30] C.-Q. Miao and W.-T. Wu, On greedy randomized average block Kaczmarz method for solving large linear systems, Journal of Computational and Applied Mathematics, 413 (2022), Article 114372.
[31] D. Needell, Randomized Kaczmarz solver for noisy linear systems, Bit Numerical Mathematics, 50 (2010), pp. 395–403.
[32] D. Needell and J. Tropp, Paved with good intentions: Analysis of a randomized block Kaczmarz method, Linear Algebra and Its Applications, 441 (2014), pp. 199–221.
[33] D. Needell, R. Zhao, and A. Zouzias, Randomized block Kaczmarz method with projection for solving least squares, Linear Algebra and Its Applications, 484 (2015), pp. 322–343.
[34] I. Necoara, Faster randomized block Kaczmarz algorithms, SIAM Journal on Matrix Analysis and Applications, 40 (2019), pp. 1425–1452.
[35] Y. Niu and B. Zheng, A greedy block Kaczmarz algorithm for solving large-scale linear systems, Applied Mathematics Letters, 104 (2020), Article 106294.
[36] C. Popa, Least-squares solution of overdetermined inconsistent linear systems using kaczmarz’s relaxation, International Journal of Computer Mathematics, 55 (1995), pp. 79–89.
[37] C. Popa, Extensions of block-projections methods with relaxation parameters to inconsistent and rank-deficient least-squares problems, Bit Numerical Mathematics, 38 (1998), pp. 151–176.
[38] R. Ramlau and M. Rosensteiner, An efficient solution to the atmospheric turbulence tomography problem using Kaczmarz iteration, Inverse Problems, 28 (2012), Article 095004.
[39] W. Shi and G. Wu, New algorithms for trace-ratio problem with application to high-dimension and large-sample data dimensionality reduction, Machine Learning (2021), pp. 1–28.
[40] S. Steinerberger, A weighted randomized Kaczmarz method for solving linear systems, Mathematics of Computation, 90 (2021), pp. 2815–2826.
[41] T. Strohmer and R. Vershynin, A randomized Kaczmarz algorithm with exponential convergence, Journal of Fourier Analysis and Applications, 15 (2009), pp. 262–278.
[42] G. Thoppe, V. Borkar, and D. Manjunath, A stochastic Kaczmarz algorithm for network tomography, Automatica, 50 (2014), pp. 910–914.
[43] X. Yang, A geometric probability randomized Kaczmarz method for large scale linear systems, Applied Numerical Mathematics, 164 (2021), pp. 139-160.
[44] J. Yin, N. Li, and N. Zheng, Restarted randomized surrounding methods for solving large linear equations, Applied mathematics letters, 133 (2022), Article 108290
[45] J. Zhang, A new greedy Kaczmarz algorithm for the solution of very large linear systems, Applied Mathematics Letters, 91 (2019), pp. 207–212.
[46] Y. Zhang and H. Li, Greedy Motzkin-Kaczmarz methods for solving linear systems, Numerical Linear Algebra with Applications, 29 (2022), e2429.
[47] Y. Zhang and H. Li, A count sketch maximal weighted residual Kaczmarz method for solving highly overdetermined linear systems, Applied Mathematics and Computation, 410 (2021), Article 126486.
[48] A. Zouzias and N.M. Freris, Randomized extended Kaczmarz for solving least-squares, SIAM Journal on Matrix Analysis and Applications, 34 (2013), pp. 773–793.

	$\displaystyle\mathbb{E}\left\\|{\bf x}_{k+1}-{\bf x}_{\star}\right\\|_{2}^{2}$	$\displaystyle\leqslant\alpha\mathbb{E}\left\\|{\bf x}_{k}-{\bf x}_{\star}\right\\|_{2}^{2}+\frac{\mathbb{E}\left\\|{\bf z}_{k+1}-{\bf z}_{\star}\right\\|_{2}^{2}}{\left\\|A\right\\|_{F}^{2}}$
		$\displaystyle\leqslant\alpha\mathbb{E}\left\\|{\bf x}_{k}-{\bf x}_{\star}\right\\|_{2}^{2}+\frac{\delta}{\left\\|A\right\\|_{F}^{2}}\mathbb{E}\left\\|{\bf z}_{k}-{\bf z}_{\star}\right\\|_{2}^{2},$		(8)

$\displaystyle\mathbb{E}\left(\left\\|{\bf x}_{k+1}-{\bf x}_{\star}\right\\|_{2}^{2}+\left\\|{\bf z}_{k+1}-{\bf z}_{\star}\right\\|_{2}^{2}\right)$	$\displaystyle\leqslant\alpha\mathbb{E}\left\\|{\bf x}_{k}-{\bf x}_{\star}\right\\|_{2}^{2}+\delta\mathbb{E}\left\\|{\bf z}_{k}-{\bf z}_{\star}\right\\|_{2}^{2}+\frac{\delta}{\left\\|A\right\\|_{F}^{2}}\mathbb{E}\left\\|{\bf z}_{k}-{\bf z}_{\star}\right\\|_{2}^{2}$
	$\displaystyle=\alpha\mathbb{E}\left\\|{\bf x}_{k}-{\bf x}_{\star}\right\\|_{2}^{2}+\delta\mathbb{E}\left\\|{\bf z}_{k}-{\bf z}_{\star}\right\\|_{2}^{2}+o\big{(}\alpha\mathbb{E}\left\\|{\bf x}_{k}-{\bf x}_{\star}\right\\|_{2}^{2}+\delta\mathbb{E}\left\\|{\bf z}_{k}-{\bf z}_{\star}\right\\|_{2}^{2}\big{)}$
	$\displaystyle\approx\alpha\mathbb{E}\left\\|{\bf x}_{k}-{\bf x}_{\star}\right\\|_{2}^{2}+\delta\mathbb{E}\left\\|{\bf z}_{k}-{\bf z}_{\star}\right\\|_{2}^{2}$
	$\displaystyle=\theta_{k}\mathbb{E}\left(\left\\|{\bf x}_{k}-{\bf x}_{\star}\right\\|_{2}^{2}+\left\\|{\bf z}_{k}-{\bf z}_{\star}\right\\|_{2}^{2}\right),\quad m+1\leqslant t_{k}\leqslant m+n,$	(10)

	$\displaystyle\frac{\left\|g\left(k_{2}L\right)-g\left(\left(k_{2}-1\right)L\right)\right\|}{L}$	$\displaystyle=\frac{\left\|\left\\|{\bf x}_{k_{2}L}-{\bf x}_{\star}\right\\|_{2}-\left\\|{\bf x}_{\left(k_{2}-1\right)L}-{\bf x}_{\star}\right\\|_{2}\right\|}{L}$
		$\displaystyle\leqslant\frac{\left\\|{\bf x}_{k_{2}L}-{\bf x}_{\left(k_{2}-1\right)L}\right\\|_{2}}{L}\leqslant\frac{\varepsilon}{2Lk_{1}}.$		(4)

	$\displaystyle\mathbb{E}\left\\|{\bf x}_{kL}-{\bf x}_{\star}\right\\|_{2}^{2}$	$\displaystyle=\mathbb{E}\left\\|{\bf x}_{kL}-{\bf x}_{\left(k+1\right)L}+{\bf x}_{\left(k+1\right)L}-{\bf x}_{\star}\right\\|_{2}^{2}\leqslant\mathbb{E}\left(\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{kL}\right\\|_{2}+\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{\star}\right\\|_{2}\right)^{2}$
		$\displaystyle=\mathbb{E}\left(\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{kL}\right\\|_{2}^{2}+2\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{kL}\right\\|_{2}\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{\star}\right\\|_{2}+\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{\star}\right\\|_{2}^{2}\right)$
		$\displaystyle=\mathbb{E}\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{kL}\right\\|_{2}^{2}+2\mathbb{E}\left(\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{kL}\right\\|_{2}\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{\star}\right\\|_{2}\right)+\mathbb{E}\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{\star}\right\\|_{2}^{2}$
		$\displaystyle\leqslant\mathbb{E}\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{kL}\right\\|_{2}^{2}+2\mathbb{E}\left(\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{kL}\right\\|_{2}\left\\|{\bf x}_{\left(k+1\right)L}-{\bf x}_{\star}\right\\|_{2}\right)+c^{L}\mathbb{E}\left\\|{\bf x}_{kL}-{\bf x}_{\star}\right\\|_{2}^{2}.$