A Kaczmarz-Inspired Method for Orthogonalization

Rikhav Shah¹¹1Supported by NSF CCF-2420130
UC Berkeley Isabel Detherage
UC Berkeley

(November 2024)

Abstract

This paper asks if it is possible to orthogonalize a set of $n$ linearly independent unit vectors in $n$ dimensions by only considering random pairs of vectors at a time. In particular, at each step one accesses a random pair of vectors and can replace them with some different basis for their span. If one could instead deterministically decide which pair of vectors to access, one could run the Gram-Schmidt procedure. In our setting, however, the pair is selected at random, so the answer is less clear.

We provide a positive answer to the question: given a pair of vectors at each iteration, replace one with its component perpendicular to the other, renormalized. This produces a sequence that almost surely converges to an orthonormal set. Quantitatively, we can measure the rate convergence in terms of the condition number $\kappa(A)$ of the square matrix $A$ formed by taking the $n$ vectors to be its columns. When the condition number is initially very large, we prove a rapidly decreasing upper bound on the expected logarithm of the condition number after $t$ iterations. The bound initially decreases by $O(1/n^{2})$ each iteration, but the decrease tapers off as the condition number improves. We then show $O(n^{4}/\delta\varepsilon^{2}+n^{3}\log(\kappa(A))\log\log(\kappa(A)))$ iterations suffice to bring the condition number below $1+\varepsilon$ with probability $1-\delta$ .

1 Introduction

We consider the following simple procedure for improving the condition number of a collection of $n$ linearly independent unit vectors, thought of as columns of a matrix $A$ . Sample two columns, and replace one with its component perpendicular to the other, renormalized. The only fixed points of this operation are unitary matrices. We ask the following questions: does this procedure converge to such matrices? If so, at what rate?

We give a positive answer to the first question and a bound on the rate. Our approach is to consider the evolution of the absolute value of the determinant as the orthogonalizations are preformed. We show it is monotone for every selection of columns at each iteration, and its logarithm increases non-trivially in expectation when the columns are selected randomly.

This procedure bears some resemblance to the Kaczmarz method for solving linear systems. If one is solving $A^{*}x=0$ using Kaczmarz, a column of $A$ is sampled and $x$ projected to be orthogonal to the sampled row; this is then repeated many times. In our case, $x$ is itself one of the columns of $A$ , say $x=a_{k}$ and one applies one step of Kaczmarz to the system $(A^{*}x)_{j}=0$ for $j\neq k$ (and then renormalizes).

Our two results about this procedure concern initial and asymptotic behavior of the condition number as these pairwise orthogonalization steps are performed. For very badly conditioned matrices, we show the condition number initially decays rapidly at a rate of $\exp(-t/n^{2})$ , where $t$ is the number of steps performed (see Proposition 2.9 and the ensuing remark). Our main theorem concerns the eventual convergence of the condition number to 1.

Theorem 1.1 (Simplification of Theorem 2.10).

For any initial $n\times n$ matrix $A$ , let $A_{t}$ denote the matrix after $t$ orthogonalization steps. Then for every $\varepsilon,\delta\in(0,1)$ , we have

t\gtrsim{\frac{n^{4}}{\delta\varepsilon^{2}}+n^{3}\log(\kappa(A))\log\log(\kappa(A))}\implies\kappa(A_{t})\leq 1+\varepsilon\text{ with probability }1-\delta,

where $\gtrsim$ is hiding an absolute constant.

Concurrent and independent work: This project was prompted by the very interesting related works of Stefan Steinerberger which inspired us to explore variants of Kaczmarz, particularly those which modify the matrix as the solver runs [Ste21a, Ste21b]. He and the present authors independently conceived of the particular variant analyzed in this paper. He recently announced some progress on this problem [Ste24]. First, he gives conditions on $x$ under which $\left\|Ax\right\|$ increases in expectation. Second, he finds a heuristic argument and numerical evidence that the rate of convergence should be $\kappa(A_{t})\sim\exp(-t/n^{2})\kappa(A)$ where $A_{t}$ is the matrix after $t$ orthogonalizations. Our work is independent of his.

1.1 Technical overview

We first precisely define the procedure we analyze. Throughout this paper, let $A=\begin{bmatrix}a_{1}&\cdots&a_{n}\end{bmatrix}\in{\mathbb{C}}^{n\times n}$ be the decomposition of $A$ into columns. We assume the columns are unit length. The operation we define is

\begin{split}\textnormal{{orth}}&:{\mathbb{C}}^{n\times n}\times{\left\{{(i,j)\in[n]^{2}:i\neq j}\right\}}\to{\mathbb{C}}^{n\times n}\\ \textnormal{{orth}}(A,i,j)_{k}&=\begin{cases}\hfil a_{k}&k\neq i\\ \dfrac{a_{i}-a_{j}\left\langle a_{i},a_{j}\right\rangle}{\left\|a_{i}-a_{j}\left\langle a_{i},a_{j}\right\rangle\right\|}&k=i.\end{cases}\end{split}

(1)

At each timestep $t$ , our procedure samples $(i_{t},j_{t})$ uniformly at random from ordered pairs of distinct indices and updates $A_{t+1}\leftarrow\textnormal{{orth}}(A_{t},i_{t},j_{t})$ .

One major difficulty in analyzing the evolution of the condition number $\kappa(A_{t})$ is that orth will sometimes increase it. Secondly, rare events that produce poorly conditioned matrices will skew the expectation of $\kappa(A_{t})$ upward by a lot. To overcome these issues, we study the evolution of the non-negative potential function

\begin{split}\Phi(A)=-\log\left|\det(A)\right|.\end{split}

(2)

Note that since $A$ has unit-length columns, $\Phi(A)=0$ if and only if $\kappa(A)=1$ . We show $\Phi(A_{t})$ is monotonically decreasing no matter the selection of $(i_{t},j_{t})$ (see Lemma 2.1). Over a random selection of $(i_{t},j_{t})$ , it strictly decreases in expectation unless $\Phi(A_{t})=0$ (see Lemma 2.3). Another complication is that the amount of decrease depends on the present value of $\Phi(A_{t})$ . Of course, when $\Phi(A_{t})=0$ already, no decrease could be expected. Our bound roughly suggests that when $\Phi(A_{t})\gg n$ , one should expect $\Phi(A_{t+1})\leq\Phi(A_{t})-O(1/n^{2})$ , i.e. steady progress is made (see Lemma 2.4). On the other hand, once $\Phi(A_{t})$ drops below this $O(n)$ threshold, progress toward 0 slows dramatically. The progress that one operation makes in bringing the potential to zero can be bounded solely in terms of the current value of the potential. This allows us to describe the evolution of $\Phi(A_{t})$ as the discretization of a differential equation (see Lemma 2.6).

In solving that differential equation, we show $\Phi(A_{t})$ indeed converges to 0 almost surely (see Theorem 2.7). Using the relationship between $\Phi(A)$ and $\kappa(A)$ supplied by Lemma 2.8, one finally arrives at a tail bound for $\kappa(A_{t})$ (see Theorem 2.10).

Correction note:

A previous version of this paper had set $\Phi(A)=-\sum_{i}\log\left(\operatorname{dist}(a_{i},\operatorname*{span}(a_{j}:j\neq i))\right)$ and stated that each of the $n$ terms in the sum were monotone in applications of orth. This was incorrect; a corrected version comes by replacing the condition “ $j\neq i$ ” with “ $j<i$ ”, which by the base-times-height formula results in $\Phi(A)=-\log\left|\det(A)\right|$ . This minorly alters the remaining proofs. The statement of Proposition 2.9 is improved slightly and the statements of Theorems 2.7 and 2.10 remain unchanged.

Remark on algorithmic application:

One could imagine running this procedure at the same time as running a Kaczmarz solver for the linear system $A^{*}x=b$ . By updating the entries of $b$ in the corresponding way to the columns of $A$ , the solution is preserved. The motivation is that the condition number gets smaller over time, so perhaps the increase in the rate of convergence of Kaczmarz justifies the additional computational expense. For a characterization of the rate of convergence of Kaczmarz in terms of the condition number, see [SV09]. We suspect this is not practical, however. In particular, the Kaczmarz algorithm only requires $O(n)$ writeable words of memory and only performs dot products, and so it is attractive for huge sparse systems. The procedure we outline requires updating all of $A$ and sparsity is not preserved. Nevertheless, the trajectory of the condition number is interesting as a standalone mathematical question.

2 Results

2.1 Analysis of a single step

We claim that the determinant of $A$ is monotone in applications of orth. In fact, the change in the value depends only on the inner product of the vectors being orthogonalized.

Lemma 2.1.

For any distinct indices $i,j$ and matrix $A$ with unit length columns, one has

\left|\det(A^{\prime})\right|=\frac{\left|\det(A)\right|}{\sqrt{1-\left|\left\langle a_{i},a_{j}\right\rangle\right|^{2}}}.

where $A^{\prime}=\textnormal{{orth}}(A,i,j)$

Proof.

By applying the appropriate permutation to the columns of $A$ , we may assume without loss of generality that $i=1$ and $j=2$ . The determinant of a matrix is the $n$ -volume of the parallelepiped generated by it’s columns. This can be expressed as

\left|\det(A)\right|=\prod_{i=1}^{n}\operatorname{dist}(a_{i},\operatorname*{span}\left(a_{j}:j<i\right)).

After the update $\textnormal{{orth}}(A,1,2)$ , the terms for $i>2$ remain unaltered since $\operatorname*{span}\left(a_{1},a_{2}\right)=\operatorname*{span}\left(a_{1}^{\prime},a_{2}^{\prime}\right)$ . The $i=1$ term is also unchanged, as it remains equal to 1. After the update, the $i=2$ term is

\begin{split}\operatorname{dist}(a_{2}^{\prime},\operatorname*{span}\left(a_{1}^{\prime}\right))&=\operatorname{dist}(a_{2}^{\prime},\operatorname*{span}\left(a_{1}\right))\\ &=\operatorname{dist}\left(\frac{a_{2}-\left\langle a_{1},a_{2}\right\rangle a_{1}}{\left\|a_{2}-\left\langle a_{1},a_{2}\right\rangle a_{1}\right\|},\operatorname*{span}\left(a_{1}\right)\right)\\ &=\frac{\operatorname{dist}\left({a_{2}},\operatorname*{span}\left(a_{1}\right)\right)}{{\left\|a_{2}-\left\langle a_{1},a_{2}\right\rangle a_{1}\right\|}}\end{split}

(3)

which is precisely the $i=2$ term before the update divided by $\left\|a_{2}-\left\langle a_{1},a_{2}\right\rangle a_{1}\right\|$ . Since the $a_{j}$ are unit vectors, this quantity is equal to $\sqrt{1-\left|\left\langle a_{1},a_{2}\right\rangle\right|^{2}}$ as required. ∎

From the above lemma, it is clear that the main driver increasing the determinant is the inner product $\left\langle a_{i},a_{j}\right\rangle$ . Our next lemma considers the matrix of these inner products, $A^{*}A-I$ , and relates the average entry to the smallest singular value of $A$ .

Lemma 2.2.

Say $A$ has unit length columns. Then $\left\|A^{*}A-I\right\|_{F}^{2}\geq\frac{n}{n-1}\left(1-\sigma_{n}(A)^{2}\right)^{2}$

Proof.

Since $A$ has unit length columns, $\left\|A\right\|_{F}^{2}=\sum_{j}\sigma_{j}(A)^{2}=n$ . Consider the expression

\left\|A^{*}A-I\right\|_{F}^{2}=\sum_{j}(\sigma_{j}(A)^{2}-1)^{2}=\sum_{j}\sigma_{j}(A)^{4}-2n+n.

Thought of as a function of the singular values, this expression is convex. For a given value of $\sigma_{n}$ , the minimum is achieved when the rest of the singular values are equal. Given the constraint that the sum of the squares of the singular values is $n$ , this means the expression is minimized for $\sigma_{j}(A)^{2}=\frac{n}{n-1}-\frac{\sigma_{n}(A)^{2}}{n-1}$ for $j<n$ . Plugging this back in gives the desired bound. ∎

With these two lemmas in place, we can compute the expected value of $-\log\left|\det(\cdot)\right|$ after one application of orth. Define the iterative map

f(x)=x-\frac{(1-\exp(-2x/n))^{2}}{2(n-1)^{2}}.

Lemma 2.3 (One step estimate).

If one samples $i,j$ uniformly at random and sets $A^{\prime}=\textnormal{{orth}}(A,i,j)$ , then

\Phi(A^{\prime})\leq\Phi(A)\text{ pointwise }\quad\And\quad\operatorname*{\mathbb{E}}\Phi(A^{\prime})\leq f(\Phi(A)).

Proof.

Lemma 2.1 states

\Phi(A^{\prime})=\Phi(A)+\frac{1}{2}\log\left(1-\left|\left\langle a_{i},a_{j}\right\rangle\right|^{2}\right).

Since $\left|\left\langle a_{i},a_{j}\right\rangle\right|<1$ , the first claim follows immediately. By Jensen’s inequality, when $i,j$ are picked randomly one has

\begin{split}\operatorname*{\mathbb{E}}\Phi(A^{\prime})&\leq\Phi(A)+\frac{1}{2}\log\left(1-\operatorname*{\mathbb{E}}\left|\left\langle a_{i},a_{j}\right\rangle\right|^{2}\right)\\ &=\Phi(A)+\frac{1}{2}\log\left(1-\frac{\left\|A^{*}A-I\right\|_{F}^{2}}{n(n-1)}\right).\end{split}

(4)

Now apply Lemma 2.2 and take a Taylor approximation,

\begin{split}\operatorname*{\mathbb{E}}\Phi(A^{\prime})&\leq\Phi(A)+\frac{1}{2}\log\left(1-\frac{(1-\sigma_{n}(A)^{2})^{2}}{(n-1)^{2}}\right)\\ &\leq\Phi(A)-\frac{(1-\sigma_{n}(A)^{2})^{2}}{2(n-1)^{2}}.\end{split}

(5)

The determinant is the product of the singular values, so

\sigma_{n}(A)\leq(\sigma_{n}(A)\cdots\sigma_{1}(A))^{1/n}=\left|\det(A)\right|^{1/n}=\exp(-\Phi(A)/n)

So altogether we have

\begin{split}\operatorname*{\mathbb{E}}\Phi(A^{\prime})&\leq\Phi(A)-\frac{(1-\exp(-\Phi(A)/n)^{2})^{2}}{2(n-1)^{2}}=f(\Phi(A)).\end{split}

(6)

∎

2.2 Markov chain supermartingale

Because applying $f$ does not commute with taking expectations, obtaining a bound for several applications of orth by applying Lemma 2.3 over and over again is not immediate. We initially define the stochastic process ${\left\{{\phi_{t}}\right\}}$ by

\begin{split}A_{0}=A\quad\And\quad A_{t+1}=\textnormal{{orth}}(A_{t},i_{t},j_{t})\quad\And\quad\phi_{t}=\Phi(A_{t}),\end{split}

(7)

where $i_{t},j_{t}$ are uniformly random selected indices. The sequence ${\left\{{A_{t}}\right\}}$ is a Markov chain, but ${\left\{{\phi_{t}}\right\}}$ is not. Both parts of Lemma 2.3 independently imply ${\left\{{\phi_{t}}\right\}}$ is a supermartingale. In particular, convergence to 0 in expectation is enough to guarantee almost sure convergence to 0 as well. We can modify (7) so that ${\left\{{\phi_{t}}\right\}}$ becomes a time independent Markov chain. Given $\phi_{t}$ , let an adversary pick any matrix $B_{t}$ satisfying $\Phi(B_{t})=\phi_{t}$ and set

\begin{split}\phi_{t+1}=\Phi(\textnormal{{orth}}(B_{t},i_{t},j_{t})).\end{split}

(8)

In this section, we exploit both implications of Lemma 2.3 on the law of $\phi_{t}$ :

\operatorname*{\mathbb{E}}(\phi_{t+1}\,|\,\phi_{t})\leq f(\phi_{t})\quad\And\quad\phi_{t+1}\leq\phi_{t}\text{ pointwise}.

Note that $f$ has an inflection point at $\frac{\log 2}{2}\cdot n$ and is concave to the left of it. We define the corresponding stopping time

t^{*}=\inf{\left\{{t:\phi_{t}<\frac{\log 2}{2}\cdot n}\right\}}.

Our control over the trajectory of $\operatorname*{\mathbb{E}}(\phi_{t})$ differs before and after $t^{*}$ . Our control before $t^{*}$ will mostly be used to estimate how long it takes to reach $t^{*}$ . We combine this with control over the trajectory after $t^{*}$ to produce a final convergence rate.

Lemma 2.4 (Before $t^{*}$ ).

\operatorname*{\mathbb{E}}(\phi_{t})\leq\phi_{0}-\frac{\sum_{j=1}^{t}\Pr\left(t^{*}\geq j\right)}{8(n-1)^{2}}.

Proof.

Note for $x\geq\frac{\log 2}{2}\cdot n$ , one has

f(x)\leq x-\frac{1}{8(n-1)^{2}}.

Use the approximation $f(x)\leq x$ for $x<\frac{\log 2}{2}\cdot n$ . This gives

\begin{split}\operatorname*{\mathbb{E}}(\phi_{j+1})&\leq\operatorname*{\mathbb{E}}(\phi_{j})-\frac{\Pr\left(\phi_{j}\geq\frac{\log 2}{2}\cdot n\right)}{8(n-1)^{2}}\\ &=\operatorname*{\mathbb{E}}(\phi_{j})-\frac{\Pr\left(t^{*}>j\right)}{8(n-1)^{2}}.\end{split}

(9)

By iterating this argument, we obtain

\operatorname*{\mathbb{E}}(\phi_{t})\leq\phi_{0}-\frac{\sum_{j=0}^{t-1}\Pr\left(t^{*}>j\right)}{8(n-1)^{2}}\leq\phi_{0}-\frac{\sum_{j=1}^{t}\Pr\left(t^{*}\geq j\right)}{8(n-1)^{2}}.

∎

This is used to control how large $t^{*}$ can be.

Lemma 2.5 (Exponential concentration of $t^{*}$ ).

For any $c\in\mathbb{Z}_{+}$

\Pr(t^{*}>c\cdot 16(n-1)^{2}\phi_{0})\leq 2^{-c}.

Proof.

Apply Lemma 2.4 in the limit as $t\to\infty$ . Then by the variational expression for expectation,

0\leq\lim_{t\to\infty}\operatorname*{\mathbb{E}}(\phi_{t})\leq\phi_{0}-\frac{\operatorname*{\mathbb{E}}(t^{*})}{8(n-1)^{2}}\implies\operatorname*{\mathbb{E}}(t^{*})\leq 8(n-1)^{2}\phi_{0}.

Now note that $t^{*}$ is implicitly a function of the initial value $\phi_{0}$ . In fact, conditioned on $t^{*}>k$ , the distribution of $t^{*}$ depends only on the value of $\phi_{k}$ . So we have the more refined fact that for each $k$ ,

\begin{split}\operatorname*{\mathbb{E}}(t^{*}(\phi_{0})\,|\,t^{*}(\phi_{0})>k)&=\operatorname*{\mathbb{E}}\left(\operatorname*{\mathbb{E}}(k+t^{*}(\phi_{k})\,|\,t^{*}(\phi_{0})>k,\phi_{k})\right)\\ &\leq k+\sup_{x\in[\frac{\log 2}{2}\cdot n,\phi_{0}]}\operatorname*{\mathbb{E}}(t^{*}(x))\\ &\leq k+8(n-1)^{2}\phi_{0}.\end{split}

(10)

Set $\mu=16(n-1)^{2}\phi_{0}$ and consider the case of $k=c\mu$ . Then (10) implies the following tail bound by Markov’s inequality,

\begin{split}\Pr\left(t^{*}>k+\mu\,|\,t^{*}>k\right)&=\Pr\left(t^{*}-k>\mu\,|\,t^{*}>k\right)\\ &\leq\frac{\operatorname*{\mathbb{E}}(t^{*}-k\,|\,t^{*}>k)}{\mu}\\ &\leq\frac{8(n-1)^{2}\phi_{0}}{\mu}\\ &=1/2.\end{split}

(11)

Then by iterating this argument, we have

\Pr(t^{*}>c\mu)=\Pr(t^{*}>0)\prod_{j=0}^{c-1}\Pr\left(t^{*}>j\mu+\mu\,|\,t^{*}>j\mu\right)\leq 2^{-c}.

∎

After the stopping time, we can exploit the concavity of $f$ and apply a tighter estimate to obtain a better bound.

Lemma 2.6 (After $t^{*}$ ).

Set $C_{n}=2(\log 2)^{2}n^{2}(n-1)^{2}$ . For any $t\geq k$ ,

\operatorname*{\mathbb{E}}(\phi_{t}\,|\,t^{*}=k,\phi_{k})\leq\frac{C_{n}}{C_{n}+{\phi_{k}\cdot(t-k)}}\cdot\phi_{k}

Proof.

Since ${\left\{{\phi_{t}}\right\}}$ is a time-independent Markov chain, we need only consider the case of $k=0$ and replace $t$ with $t-k$ at the end. For $x\leq\frac{\log 2}{2}\cdot n$ , observe that $f$ is concave. Since ${\left\{{\phi_{t}}\right\}}$ is monotone, we deterministically stay in the concave region of the domain. So we may apply Jensen to obtain

\operatorname*{\mathbb{E}}(\phi_{j+1})\leq\operatorname*{\mathbb{E}}(f(\phi_{j}))\leq f(\operatorname*{\mathbb{E}}(\phi_{j})).

Consider the sequence $x(j)=\operatorname*{\mathbb{E}}(\phi_{j})$ . Note that we may view the update rule

x(j+1)=f(x(j))=x(j)-\frac{(1-\exp(-2x/n))^{2}}{2(n-1)^{2}}

as running Euclid’s method with a step size of $1$ on the differential equation

\frac{\,\textnormal{d}}{\,\textnormal{d}{t}}x(t)=\frac{(1-\exp(-2x(t)/n))^{2}}{2(n-1)^{2}}.

By monotonicity of the left-hand side, the true solution to the differential equation is an upper bound. We can obtain a looser, but easier to solve for, upper bound by using

(1-\exp(-2x/n))^{2}\leq\frac{x^{2}}{(\log 2)^{2}n^{2}}

for $x\leq\frac{\log 2}{2}\cdot n$ . Then we have a quadratic first order differential equation, for which the explicit solution is

x(t)=\left(\frac{1}{\phi_{0}}+\frac{t}{2(\log 2)^{2}n^{2}(n-1)^{2}}\right)^{-1}.

After rearranging, this is exactly the bound appearing in the lemma statement. ∎

Remark 1.

The bound in Lemma 2.6 simplifies slightly differently when the initial state $\phi_{0}$ is above versus below the threshold $\frac{\log 2}{2}\cdot n$ , i.e. if $t^{*}=0$ or not. When $t^{*}=0$ , then we simply have the unconditional expectation

\operatorname*{\mathbb{E}}(\phi_{t})=\frac{C_{n}}{C_{n}+{\phi_{0}\cdot t}}\cdot\phi_{0}.

When $t^{*}>0$ , we do not incur too much loss by using $\phi_{k}=\phi_{t^{*}}\leq\frac{\log 2}{2}\cdot n$ by definition of $t^{*}$ . This implies

\operatorname*{\mathbb{E}}(\phi_{t}\,|\,t^{*}=k,\phi_{k})\leq\frac{C_{n}}{C_{n}+{\frac{\log 2}{2}\cdot n\cdot(t-k)}}\cdot\frac{\log 2}{2}\cdot n\leq\frac{n^{4}}{\frac{2}{\log 2}n^{3}+t-k}.

Our only remaining task is to bound the unconditional expectation $\operatorname*{\mathbb{E}}(\phi_{t})$ when $\phi_{0}\geq\frac{\log 2}{2}\cdot n$ . We do so using the exponential concentration of $t^{*}$ .

Theorem 2.7 (Convergence of $\phi_{t}$ ).

Set $C_{n}=2(\log 2)^{2}n^{2}(n-1)^{2}\leq n^{4}$ as in Lemma 2.6. Suppose $\phi_{0}<\frac{\log 2}{2}\cdot n$ . Then

\operatorname*{\mathbb{E}}(\phi_{t})\leq\frac{C_{n}}{C_{n}+{\phi_{0}\cdot t}}\cdot\phi_{0}.

Suppose instead $\phi_{0}\geq\frac{\log 2}{2}\cdot n$ . Then

\begin{split}\operatorname*{\mathbb{E}}(\phi_{t})\leq\frac{n^{4}}{\frac{2}{\log 2}n^{3}+\frac{1}{2}\cdot t}+\left(\phi_{0}-\frac{n^{4}}{\frac{2}{\log 2}n^{3}+\frac{1}{2}\cdot t}\right)\cdot\exp\left(-\frac{t}{47n^{2}\phi_{0}}\right).\end{split}

(12)

In particular, by taking $t\to\infty$ , see that $\phi_{t}\to 0$ in both cases.

Proof.

The first result is the simplification of Lemma 2.6 described in Remark 1. Set $\mu=16(n-1)^{2}\phi_{0}$ and $c=k/\mu$ for $k=t/2$ . By the law of total expectation, $\operatorname*{\mathbb{E}}(\phi_{t})$ is a convex combination of the conditional expectations

\operatorname*{\mathbb{E}}(\phi_{t}\,|\,t^{*}\leq c\mu)\quad\And\quad\operatorname*{\mathbb{E}}(\phi_{t}\,|\,t^{*}>c\mu).

The first term is bounded using Lemma 2.6 by

\sup_{k\in[0,c\mu]}\operatorname*{\mathbb{E}}(\phi_{t}\,|\,t^{*}=k)\leq\sup_{k\in[0,c\mu]}\frac{n^{4}}{2n^{3}+t-k}\leq\frac{n^{4}}{2n^{3}+t-c\mu}.

The second term is bounded by $\phi_{0}$ by pointwise monotonicity. The weights of the convex combination are

\Pr(t^{*}\leq c\mu)\quad\And\quad\Pr(t^{*}>c\mu),

the latter of which is bounded by $2^{-c}$ using Lemma 2.5. Since the second term is at least as large as the first, again by pointwise monotonicity, we may take $\Pr(t^{*}>c\mu)=2^{-c}$ for an upper bound. This establishes

\begin{split}\operatorname*{\mathbb{E}}(\phi_{t})&\leq\frac{n^{4}}{\frac{2}{\log 2}n^{3}+t-k}\left(1-2^{-k/\mu}\right)+\phi_{0}\cdot 2^{-k/\mu}\\ &=\frac{n^{4}}{\frac{2}{\log 2}n^{3}+t-k}+\left(\phi_{0}-\frac{n^{4}}{\frac{2}{\log 2}n^{3}+t-k}\right)\cdot 2^{-k/\mu}.\end{split}

(13)

Recall that we took $k=t/2$ . Use the numerical approximation $32/\log 2\leq 47$ for the final result. ∎

Remark 2 (Two rates of decay).

The right-hand side (12) can be viewed as an exponential decay from the value $\phi_{0}$ toward the curve $\sim n^{4}/(n^{3}+t)$ . Unfortunately, the constant $\left(47n^{2}\phi_{0}\right)^{-1}$ in the exponent is so small that the bound doesn’t exhibit exponential decay in a meaningful sense. Truly exponential decay would look like $\phi_{0}\cdot\exp(-ct)$ for some value $c$ not depending on $\phi_{0}$ . But in our case, we need on the order of $n^{2}\phi_{0}$ iterations to bring the second term under any kind of control. Specifically, for $t\ll n^{2}\phi_{0}$ , the exponential is well approximated by its first order Taylor series making the expression roughly

\operatorname*{\mathbb{E}}(\phi_{t})\leq\phi_{0}-O\left({t}/{n^{2}}\right).

On the other hand, after order $n^{2}\phi_{0}\log(\phi_{0}/n)$ iterations, the first term of (12) dominates, so the bound simplifies to just

\operatorname*{\mathbb{E}}(\phi_{t})=O(n^{4}/t).

2.3 Relationship to the condition number

Theorem 2.7 shows asymptotic convergence of $\log\left|\det(A_{t})\right|$ to 0, and the only matrices with unit-length columns and determinant equal to one are unitary. We can measure convergence to a unitary matrix in terms of the convergence of condition number $\kappa(A_{t})=\sigma_{1}(A_{t})/\sigma_{n}(A_{t})$ to 1. Indeed $\kappa(A)=1$ for a matrix with unit-length columns if and only if $A$ is unitary. This section contains two main results (Proposition 2.9 and Theorem 2.10), corresponding to the two rates of decay described in Remark 2. When $\kappa(A)$ is initially large, we want to show $\kappa(A_{t})$ is rapidly decaying for $t$ up to some point. Then we want to show for $t$ sufficiently large that $\kappa(A_{t})$ is very close to 1.

Lemma 2.8 (Bounds on $\kappa$ in terms of $\Phi$ ).

For any $n\times n$ matrix $A$ with unit-length columns,

\kappa(A)\geq\exp\left(\Phi(A)/n\right)\quad\And\quad\kappa(A)\leq\sqrt{en}\cdot\exp\left(\Phi(A)\right)\quad\And\quad\kappa(A)\leq\frac{1+\sqrt{2\Phi(A)}}{1-\sqrt{2\Phi(A)}},

where the right-aligned result holds only when the denominator is positive.

Proof.

The left-aligned result is an immediate consequence of $\sigma_{1}(A)\geq\max_{j}\left\|Ae_{j}\right\|=1$ and $\sigma_{n}(A)\leq\left|\det(A)\right|^{1/n}.$ For the center-aligned result, first note

\sigma_{1}(A)\leq\left\|A\right\|_{F}=\sqrt{n}.

Then, the largest value of $\sigma_{1}(A)\cdots\sigma_{n-1}(A)$ subject to the constraint $\sigma_{1}(A)^{2}+\cdots+\sigma_{n-1}(A)^{2}\leq n$ is achieved for $\sigma_{j}(A)^{2}=\frac{n}{n-1}$ , so

\sigma_{1}(A)^{2}\cdots\sigma_{n-1}(A)^{2}\leq\left(\frac{n}{n-1}\right)^{n-1}\leq e\implies\left|\det(A)\right|\leq\sqrt{e}{\sigma_{n}(A)}.

Now for the right-aligned result. By considering the Gram-Schmidt basis, one may assume without loss of generality that $A$ is upper triangular with positive real diagonal entries. Let $\lambda_{1},\cdots,\lambda_{n}$ be its eigenvalues. Let $r_{j}$ be the portion of the $j$ th column above the diagonal. Then

\begin{split}\left\|A-I\right\|_{F}^{2}=\sum_{j}\left((\lambda_{j}-1)^{2}+\left\|r_{j}\right\|^{2}\right)=\sum_{j}\left(1-2\lambda_{j}+\lambda_{j}^{2}+\left\|r_{j}\right\|^{2}\right).\end{split}

(14)

Note $\lambda_{j}^{2}+\left\|r_{j}\right\|^{2}=1$ since the columns are unit length. Thus

\begin{split}\left\|A-I\right\|_{F}^{2}=2\sum_{j}\left(1-\lambda_{j}\right)\leq 2\sum_{j}\log\left(1/\lambda_{j}\right)=2\Phi(A).\end{split}

(15)

Finally, note $\left\|A-I\right\|_{F}\geq\left\|A-I\right\|\geq\max(\sigma_{1}(A)-1,1-\sigma_{n}(A))$ . ∎

Using Lemma 2.8 to convert the bounds on $\Phi(A_{t})$ from Theorem 2.7 to bounds on $\kappa(A_{t})$ results in the following final theorems.

Proposition 2.9 (Initial rate of decay).

Assume the initial matrix $A$ satisfies $\Phi(A)\geq\frac{\log 2}{2}\cdot n$ . Then for all $t\leq n^{2}\cdot\Phi(A)$ ,

\begin{split}\operatorname*{\mathbb{E}}\log\kappa(A_{t})\leq 0.5\log(en)+\Phi(A)-\frac{1}{96}\left(1-\frac{\frac{\log 2}{2}\cdot n}{\Phi(A)}\right)\cdot\frac{t}{n^{2}}.\end{split}

(16)

Proof.

The center-aligned result of Lemma 2.8 gives

\log\kappa(A_{t})\leq 0.5\log({en})+\Phi(A_{t})

so combined with Theorem 2.7 this gives

\begin{split}\operatorname*{\mathbb{E}}\log\kappa(A_{t})&\leq 0.5\log({en})+\operatorname*{\mathbb{E}}\Phi(A_{t})\\ &\leq 0.5\log({en})+\frac{n^{4}}{\frac{2}{\log 2}n^{3}+\frac{1}{2}t}+\left(\Phi(A)-\frac{n^{4}}{\frac{2}{\log 2}n^{3}+\frac{1}{2}t}\right)\cdot\exp\left(-\frac{t}{47n^{2}\Phi(A)}\right)\\ &\leq 0.5\log({en})+\frac{\log 2}{2}\cdot n+\left(\Phi(A)-\frac{\log 2}{2}\cdot n\right)\cdot\left(1-\frac{1}{2}\cdot\frac{t}{47n^{2}\Phi(A)}\right)\\ &=0.5\log({en})+\Phi(A)-\frac{1}{2}\left(1-\frac{\frac{\log 2}{2}\cdot n}{\Phi(A)}\right)\cdot\frac{t}{47n^{2}}.\end{split}

(17)

∎

Remark 3.

If one plugs in the inequality $\Phi(A)\leq n\log\kappa(A)$ to the right hand side of (16), it becomes fairly ineffective as a bound on $\kappa(A)$ . We include Proposition 2.9 only to highlight the $O(-t/n^{2})$ term. If we omit the first step of the proof, the bound would be as we stated in Remark 2,

\operatorname*{\mathbb{E}}\Phi(A_{t})\leq\Phi(A)-O(t/n^{2})

for $\Phi(A)\geq c\cdot\frac{\log 2}{2}\cdot n$ for fixed $c>1$ . This shows strict decrease on the order of $t/n^{2}$ of $\Phi(A_{t})$ in expectation for badly conditioned initial matrices. Unfortunately, when passing from $\Phi$ to $\log\kappa$ we lose a multiplicative factor of $n$ and additive factor of $\log(n)$ in Lemma 2.8. Nevertheless, the dependence on $t$ remains the same. This bound agrees with the hueristic argument and numerical evidence provided in [Ste24] that the decay of $\kappa(A_{t})$ should go like $\exp(-t/n^{2})$ .

Theorem 2.10 (Convergence of $\kappa(A_{t})\to 1$ ).

Fix any initial matrix $A$ and parameters $\varepsilon,\delta\in(0,0.01)$ . Then for

t\geq 200n^{2}\Phi(A)\log(7\Phi(A)/n)\quad\And\quad t\geq\frac{48n^{4}}{\delta\varepsilon^{2}}

one has

\Pr\left(\kappa(A_{t})\geq 1+\varepsilon\right)\leq\delta.

In particular, one can take $\varepsilon,\delta\to 0$ with $t\to\infty$ .

Proof.

Set $x=\varepsilon^{2}/16$ so that

1+\varepsilon\geq\frac{1+\sqrt{2x}}{1-\sqrt{2x}}.

This fact along with the right-aligned statement of Lemma 2.8 and Markov’s inequality gives

\begin{split}\Pr\left(\kappa(A_{t})\geq 1+\varepsilon\right)\leq\Pr\left(\kappa(A_{t})\geq\frac{1+\sqrt{2x}}{1-\sqrt{2x}}\right)\leq\Pr\left(\Phi(A_{t})\geq x\right)\leq\frac{\operatorname*{\mathbb{E}}\Phi(A_{t})}{x}.\end{split}

(18)

Now apply Theorem 2.7 to the numerator and simplify the bound using $z\exp(-z)\leq\exp(-z/2)$ .

\begin{split}\Pr\left(\kappa(A_{t})\geq 1+\varepsilon\right)&\leq\frac{\dfrac{n^{4}}{\frac{2}{\log 2}n^{3}+\frac{1}{2}t}+\left(\Phi(A)-\dfrac{n^{4}}{\frac{2}{\log 2}n^{3}+\frac{1}{2}t}\right)\exp\left(-\dfrac{t}{47n^{2}\Phi(A)}\right)}{x}\\ &\leq\frac{n^{4}}{tx}\left(\frac{1}{\frac{2}{\log 2}\frac{n^{3}}{t}+\frac{1}{2}}+\Phi(A)\cdot\frac{t}{n^{4}}\cdot\exp\left(-\dfrac{t}{47n^{2}\Phi(A)}\right)\right)\\ &=\frac{n^{4}}{tx}\left(2+\frac{47\Phi(A)^{2}}{n^{2}}\cdot\frac{t}{47n^{2}\Phi(A)}\cdot\exp\left(-\dfrac{t}{47n^{2}\Phi(A)}\right)\right)\\ &\leq\frac{n^{4}}{tx}\left(2+\frac{47\Phi(A)^{2}}{n^{2}}\cdot\exp\left(-\dfrac{t}{94n^{2}\Phi(A)}\right)\right).\end{split}

(19)

When $t\geq 192n^{2}\Phi(A)\log(7\Phi(A)/n)$ , the bound further simplifies to $\frac{3n^{4}}{tx}$ . When $t\geq\frac{3n^{4}}{\delta x}=\frac{48n^{4}}{\delta\varepsilon^{2}}$ , the bound is simply $\delta$ as required. ∎

Final thoughts and future directions: We note that we do not produce lower bounds. In our estimation, the primary weakness is in Lemma 2.2 (or in its application). In particular, it seems like one could obtain a better bound by using the entire distribution of the singular values, rather than just the smallest or their geometric mean, as we do here.

Another idea is to sample pairs of columns from a distribution other than uniform. For example, in light of Lemma 2.1, one could pick them proportional to $\left|\left\langle a_{i},a_{j}\right\rangle\right|^{2}$ , or even greedily maximizing $\left|\left\langle a_{i},a_{j}\right\rangle\right|$ .

Acknowledgments: We would like to thank Stefan Steinerberger for his inventive works on the Kaczmarz algorithm and engaging discussions which inspired this project.

References

[Ste21a] Stefan Steinerberger. Randomized Kaczmarz converges along small singular vectors. SIAM J. Matrix Anal. Appl, 42:608–615, 2021.
[Ste21b] Stefan Steinerberger. Surrounding the solution of a Linear System of Equations from all sides. Quarterly of Applied Mathematics, 79(3):419–429, 2021.
[Ste24] Stefan Steinerberger. Kaczmarz Kac walk. arXiv preprint 2411.06614, 2024.
[SV09] Thomas Strohmer and Roman Vershynin. A Randomized Kaczmarz Algorithm with Exponential Convergence. J Fourier Anal Appl, 15:262–278, 2009.

A Kaczmarz-Inspired Method for Orthogonalization

Abstract

1 Introduction

Theorem 1.1 (Simplification of Theorem 2.10).

1.1 Technical overview

Correction note:

Remark on algorithmic application:

2 Results

2.1 Analysis of a single step

Lemma 2.1.

Proof.

Lemma 2.2.

Proof.

Lemma 2.3 (One step estimate).

Proof.

2.2 Markov chain supermartingale

Lemma 2.4 (Before t∗t^{*}).

Proof.

Lemma 2.5 (Exponential concentration of t∗t^{*}).

Proof.

Lemma 2.6 (After t∗t^{*}).

Proof.

Remark 1.

Theorem 2.7 (Convergence of ϕt\phi_{t}).

Proof.

Remark 2 (Two rates of decay).

2.3 Relationship to the condition number

Lemma 2.8 (Bounds on κ\kappa in terms of Φ\Phi).

Proof.

Proposition 2.9 (Initial rate of decay).

Proof.

Remark 3.

Theorem 2.10 (Convergence of κ​(At)→1\kappa(A_{t})\to 1).

Proof.

References

Lemma 2.4 (Before $t^{*}$ ).

Lemma 2.5 (Exponential concentration of $t^{*}$ ).

Lemma 2.6 (After $t^{*}$ ).

Theorem 2.7 (Convergence of $\phi_{t}$ ).

Lemma 2.8 (Bounds on $\kappa$ in terms of $\Phi$ ).

Theorem 2.10 (Convergence of $\kappa(A_{t})\to 1$ ).