This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Kaczmarz-Inspired Method for Orthogonalization

Rikhav Shah111Supported by NSF CCF-2420130
UC Berkeley
   Isabel Detherage
UC Berkeley
(November 2024)
Abstract

This paper asks if it is possible to orthogonalize a set of nn linearly independent unit vectors in nn dimensions by only considering random pairs of vectors at a time. In particular, at each step one accesses a random pair of vectors and can replace them with some different basis for their span. If one could instead deterministically decide which pair of vectors to access, one could run the Gram-Schmidt procedure. In our setting, however, the pair is selected at random, so the answer is less clear.

We provide a positive answer to the question: given a pair of vectors at each iteration, replace one with its component perpendicular to the other, renormalized. This produces a sequence that almost surely converges to an orthonormal set. Quantitatively, we can measure the rate convergence in terms of the condition number κ(A)\kappa(A) of the square matrix AA formed by taking the nn vectors to be its columns. When the condition number is initially very large, we prove a rapidly decreasing upper bound on the expected logarithm of the condition number after tt iterations. The bound initially decreases by O(1/n2)O(1/n^{2}) each iteration, but the decrease tapers off as the condition number improves. We then show O(n4/δε2+n3log(κ(A))loglog(κ(A)))O(n^{4}/\delta\varepsilon^{2}+n^{3}\log(\kappa(A))\log\log(\kappa(A))) iterations suffice to bring the condition number below 1+ε1+\varepsilon with probability 1δ1-\delta.

1 Introduction

We consider the following simple procedure for improving the condition number of a collection of nn linearly independent unit vectors, thought of as columns of a matrix AA. Sample two columns, and replace one with its component perpendicular to the other, renormalized. The only fixed points of this operation are unitary matrices. We ask the following questions: does this procedure converge to such matrices? If so, at what rate?

We give a positive answer to the first question and a bound on the rate. Our approach is to consider the evolution of the absolute value of the determinant as the orthogonalizations are preformed. We show it is monotone for every selection of columns at each iteration, and its logarithm increases non-trivially in expectation when the columns are selected randomly.

This procedure bears some resemblance to the Kaczmarz method for solving linear systems. If one is solving Ax=0A^{*}x=0 using Kaczmarz, a column of AA is sampled and xx projected to be orthogonal to the sampled row; this is then repeated many times. In our case, xx is itself one of the columns of AA, say x=akx=a_{k} and one applies one step of Kaczmarz to the system (Ax)j=0(A^{*}x)_{j}=0 for jkj\neq k (and then renormalizes).

Our two results about this procedure concern initial and asymptotic behavior of the condition number as these pairwise orthogonalization steps are performed. For very badly conditioned matrices, we show the condition number initially decays rapidly at a rate of exp(t/n2)\exp(-t/n^{2}), where tt is the number of steps performed (see Proposition 2.9 and the ensuing remark). Our main theorem concerns the eventual convergence of the condition number to 1.

Theorem 1.1 (Simplification of Theorem 2.10).

For any initial n×nn\times n matrix AA, let AtA_{t} denote the matrix after tt orthogonalization steps. Then for every ε,δ(0,1)\varepsilon,\delta\in(0,1), we have

tn4δε2+n3log(κ(A))loglog(κ(A))κ(At)1+ε with probability 1δ,t\gtrsim{\frac{n^{4}}{\delta\varepsilon^{2}}+n^{3}\log(\kappa(A))\log\log(\kappa(A))}\implies\kappa(A_{t})\leq 1+\varepsilon\text{ with probability }1-\delta,

where \gtrsim is hiding an absolute constant.

Concurrent and independent work: This project was prompted by the very interesting related works of Stefan Steinerberger which inspired us to explore variants of Kaczmarz, particularly those which modify the matrix as the solver runs [Ste21a, Ste21b]. He and the present authors independently conceived of the particular variant analyzed in this paper. He recently announced some progress on this problem [Ste24]. First, he gives conditions on xx under which Ax\left\|Ax\right\| increases in expectation. Second, he finds a heuristic argument and numerical evidence that the rate of convergence should be κ(At)exp(t/n2)κ(A)\kappa(A_{t})\sim\exp(-t/n^{2})\kappa(A) where AtA_{t} is the matrix after tt orthogonalizations. Our work is independent of his.

1.1 Technical overview

We first precisely define the procedure we analyze. Throughout this paper, let A=[a1an]n×nA=\begin{bmatrix}a_{1}&\cdots&a_{n}\end{bmatrix}\in{\mathbb{C}}^{n\times n} be the decomposition of AA into columns. We assume the columns are unit length. The operation we define is

orth:n×n×{(i,j)[n]2:ij}n×north(A,i,j)k={akkiaiajai,ajaiajai,ajk=i.\begin{split}\textnormal{{orth}}&:{\mathbb{C}}^{n\times n}\times{\left\{{(i,j)\in[n]^{2}:i\neq j}\right\}}\to{\mathbb{C}}^{n\times n}\\ \textnormal{{orth}}(A,i,j)_{k}&=\begin{cases}\hfil a_{k}&k\neq i\\ \dfrac{a_{i}-a_{j}\left\langle a_{i},a_{j}\right\rangle}{\left\|a_{i}-a_{j}\left\langle a_{i},a_{j}\right\rangle\right\|}&k=i.\end{cases}\end{split} (1)

At each timestep tt, our procedure samples (it,jt)(i_{t},j_{t}) uniformly at random from ordered pairs of distinct indices and updates At+1orth(At,it,jt)A_{t+1}\leftarrow\textnormal{{orth}}(A_{t},i_{t},j_{t}).

One major difficulty in analyzing the evolution of the condition number κ(At)\kappa(A_{t}) is that orth will sometimes increase it. Secondly, rare events that produce poorly conditioned matrices will skew the expectation of κ(At)\kappa(A_{t}) upward by a lot. To overcome these issues, we study the evolution of the non-negative potential function

Φ(A)=log|det(A)|.\begin{split}\Phi(A)=-\log\left|\det(A)\right|.\end{split} (2)

Note that since AA has unit-length columns, Φ(A)=0\Phi(A)=0 if and only if κ(A)=1\kappa(A)=1. We show Φ(At)\Phi(A_{t}) is monotonically decreasing no matter the selection of (it,jt)(i_{t},j_{t}) (see Lemma 2.1). Over a random selection of (it,jt)(i_{t},j_{t}), it strictly decreases in expectation unless Φ(At)=0\Phi(A_{t})=0 (see Lemma 2.3). Another complication is that the amount of decrease depends on the present value of Φ(At)\Phi(A_{t}). Of course, when Φ(At)=0\Phi(A_{t})=0 already, no decrease could be expected. Our bound roughly suggests that when Φ(At)n\Phi(A_{t})\gg n, one should expect Φ(At+1)Φ(At)O(1/n2)\Phi(A_{t+1})\leq\Phi(A_{t})-O(1/n^{2}), i.e. steady progress is made (see Lemma 2.4). On the other hand, once Φ(At)\Phi(A_{t}) drops below this O(n)O(n) threshold, progress toward 0 slows dramatically. The progress that one operation makes in bringing the potential to zero can be bounded solely in terms of the current value of the potential. This allows us to describe the evolution of Φ(At)\Phi(A_{t}) as the discretization of a differential equation (see Lemma 2.6).

In solving that differential equation, we show Φ(At)\Phi(A_{t}) indeed converges to 0 almost surely (see Theorem 2.7). Using the relationship between Φ(A)\Phi(A) and κ(A)\kappa(A) supplied by Lemma 2.8, one finally arrives at a tail bound for κ(At)\kappa(A_{t}) (see Theorem 2.10).

Correction note:

A previous version of this paper had set Φ(A)=ilog(dist(ai,span(aj:ji)))\Phi(A)=-\sum_{i}\log\left(\operatorname{dist}(a_{i},\operatorname*{span}(a_{j}:j\neq i))\right) and stated that each of the nn terms in the sum were monotone in applications of orth. This was incorrect; a corrected version comes by replacing the condition “jij\neq i” with “j<ij<i”, which by the base-times-height formula results in Φ(A)=log|det(A)|\Phi(A)=-\log\left|\det(A)\right|. This minorly alters the remaining proofs. The statement of Proposition 2.9 is improved slightly and the statements of Theorems 2.7 and 2.10 remain unchanged.

Remark on algorithmic application:

One could imagine running this procedure at the same time as running a Kaczmarz solver for the linear system Ax=bA^{*}x=b. By updating the entries of bb in the corresponding way to the columns of AA, the solution is preserved. The motivation is that the condition number gets smaller over time, so perhaps the increase in the rate of convergence of Kaczmarz justifies the additional computational expense. For a characterization of the rate of convergence of Kaczmarz in terms of the condition number, see [SV09]. We suspect this is not practical, however. In particular, the Kaczmarz algorithm only requires O(n)O(n) writeable words of memory and only performs dot products, and so it is attractive for huge sparse systems. The procedure we outline requires updating all of AA and sparsity is not preserved. Nevertheless, the trajectory of the condition number is interesting as a standalone mathematical question.

2 Results

2.1 Analysis of a single step

We claim that the determinant of AA is monotone in applications of orth. In fact, the change in the value depends only on the inner product of the vectors being orthogonalized.

Lemma 2.1.

For any distinct indices i,ji,j and matrix AA with unit length columns, one has

|det(A)|=|det(A)|1|ai,aj|2.\left|\det(A^{\prime})\right|=\frac{\left|\det(A)\right|}{\sqrt{1-\left|\left\langle a_{i},a_{j}\right\rangle\right|^{2}}}.

where A=orth(A,i,j)A^{\prime}=\textnormal{{orth}}(A,i,j)

Proof.

By applying the appropriate permutation to the columns of AA, we may assume without loss of generality that i=1i=1 and j=2j=2. The determinant of a matrix is the nn-volume of the parallelepiped generated by it’s columns. This can be expressed as

|det(A)|=i=1ndist(ai,span(aj:j<i)).\left|\det(A)\right|=\prod_{i=1}^{n}\operatorname{dist}(a_{i},\operatorname*{span}\left(a_{j}:j<i\right)).

After the update orth(A,1,2)\textnormal{{orth}}(A,1,2), the terms for i>2i>2 remain unaltered since span(a1,a2)=span(a1,a2)\operatorname*{span}\left(a_{1},a_{2}\right)=\operatorname*{span}\left(a_{1}^{\prime},a_{2}^{\prime}\right). The i=1i=1 term is also unchanged, as it remains equal to 1. After the update, the i=2i=2 term is

dist(a2,span(a1))=dist(a2,span(a1))=dist(a2a1,a2a1a2a1,a2a1,span(a1))=dist(a2,span(a1))a2a1,a2a1\begin{split}\operatorname{dist}(a_{2}^{\prime},\operatorname*{span}\left(a_{1}^{\prime}\right))&=\operatorname{dist}(a_{2}^{\prime},\operatorname*{span}\left(a_{1}\right))\\ &=\operatorname{dist}\left(\frac{a_{2}-\left\langle a_{1},a_{2}\right\rangle a_{1}}{\left\|a_{2}-\left\langle a_{1},a_{2}\right\rangle a_{1}\right\|},\operatorname*{span}\left(a_{1}\right)\right)\\ &=\frac{\operatorname{dist}\left({a_{2}},\operatorname*{span}\left(a_{1}\right)\right)}{{\left\|a_{2}-\left\langle a_{1},a_{2}\right\rangle a_{1}\right\|}}\end{split} (3)

which is precisely the i=2i=2 term before the update divided by a2a1,a2a1\left\|a_{2}-\left\langle a_{1},a_{2}\right\rangle a_{1}\right\|. Since the aja_{j} are unit vectors, this quantity is equal to 1|a1,a2|2\sqrt{1-\left|\left\langle a_{1},a_{2}\right\rangle\right|^{2}} as required. ∎

From the above lemma, it is clear that the main driver increasing the determinant is the inner product ai,aj\left\langle a_{i},a_{j}\right\rangle. Our next lemma considers the matrix of these inner products, AAIA^{*}A-I, and relates the average entry to the smallest singular value of AA.

Lemma 2.2.

Say AA has unit length columns. Then AAIF2nn1(1σn(A)2)2\left\|A^{*}A-I\right\|_{F}^{2}\geq\frac{n}{n-1}\left(1-\sigma_{n}(A)^{2}\right)^{2}

Proof.

Since AA has unit length columns, AF2=jσj(A)2=n\left\|A\right\|_{F}^{2}=\sum_{j}\sigma_{j}(A)^{2}=n. Consider the expression

AAIF2=j(σj(A)21)2=jσj(A)42n+n.\left\|A^{*}A-I\right\|_{F}^{2}=\sum_{j}(\sigma_{j}(A)^{2}-1)^{2}=\sum_{j}\sigma_{j}(A)^{4}-2n+n.

Thought of as a function of the singular values, this expression is convex. For a given value of σn\sigma_{n}, the minimum is achieved when the rest of the singular values are equal. Given the constraint that the sum of the squares of the singular values is nn, this means the expression is minimized for σj(A)2=nn1σn(A)2n1\sigma_{j}(A)^{2}=\frac{n}{n-1}-\frac{\sigma_{n}(A)^{2}}{n-1} for j<nj<n. Plugging this back in gives the desired bound. ∎

With these two lemmas in place, we can compute the expected value of log|det()|-\log\left|\det(\cdot)\right| after one application of orth. Define the iterative map

f(x)=x(1exp(2x/n))22(n1)2.f(x)=x-\frac{(1-\exp(-2x/n))^{2}}{2(n-1)^{2}}.
Lemma 2.3 (One step estimate).

If one samples i,ji,j uniformly at random and sets A=orth(A,i,j)A^{\prime}=\textnormal{{orth}}(A,i,j), then

Φ(A)Φ(A) pointwise &𝔼Φ(A)f(Φ(A)).\Phi(A^{\prime})\leq\Phi(A)\text{ pointwise }\quad\And\quad\operatorname*{\mathbb{E}}\Phi(A^{\prime})\leq f(\Phi(A)).
Proof.

Lemma 2.1 states

Φ(A)=Φ(A)+12log(1|ai,aj|2).\Phi(A^{\prime})=\Phi(A)+\frac{1}{2}\log\left(1-\left|\left\langle a_{i},a_{j}\right\rangle\right|^{2}\right).

Since |ai,aj|<1\left|\left\langle a_{i},a_{j}\right\rangle\right|<1, the first claim follows immediately. By Jensen’s inequality, when i,ji,j are picked randomly one has

𝔼Φ(A)Φ(A)+12log(1𝔼|ai,aj|2)=Φ(A)+12log(1AAIF2n(n1)).\begin{split}\operatorname*{\mathbb{E}}\Phi(A^{\prime})&\leq\Phi(A)+\frac{1}{2}\log\left(1-\operatorname*{\mathbb{E}}\left|\left\langle a_{i},a_{j}\right\rangle\right|^{2}\right)\\ &=\Phi(A)+\frac{1}{2}\log\left(1-\frac{\left\|A^{*}A-I\right\|_{F}^{2}}{n(n-1)}\right).\end{split} (4)

Now apply Lemma 2.2 and take a Taylor approximation,

𝔼Φ(A)Φ(A)+12log(1(1σn(A)2)2(n1)2)Φ(A)(1σn(A)2)22(n1)2.\begin{split}\operatorname*{\mathbb{E}}\Phi(A^{\prime})&\leq\Phi(A)+\frac{1}{2}\log\left(1-\frac{(1-\sigma_{n}(A)^{2})^{2}}{(n-1)^{2}}\right)\\ &\leq\Phi(A)-\frac{(1-\sigma_{n}(A)^{2})^{2}}{2(n-1)^{2}}.\end{split} (5)

The determinant is the product of the singular values, so

σn(A)(σn(A)σ1(A))1/n=|det(A)|1/n=exp(Φ(A)/n)\sigma_{n}(A)\leq(\sigma_{n}(A)\cdots\sigma_{1}(A))^{1/n}=\left|\det(A)\right|^{1/n}=\exp(-\Phi(A)/n)

So altogether we have

𝔼Φ(A)Φ(A)(1exp(Φ(A)/n)2)22(n1)2=f(Φ(A)).\begin{split}\operatorname*{\mathbb{E}}\Phi(A^{\prime})&\leq\Phi(A)-\frac{(1-\exp(-\Phi(A)/n)^{2})^{2}}{2(n-1)^{2}}=f(\Phi(A)).\end{split} (6)

2.2 Markov chain supermartingale

Because applying ff does not commute with taking expectations, obtaining a bound for several applications of orth by applying Lemma 2.3 over and over again is not immediate. We initially define the stochastic process {ϕt}{\left\{{\phi_{t}}\right\}} by

A0=A&At+1=orth(At,it,jt)&ϕt=Φ(At),\begin{split}A_{0}=A\quad\And\quad A_{t+1}=\textnormal{{orth}}(A_{t},i_{t},j_{t})\quad\And\quad\phi_{t}=\Phi(A_{t}),\end{split} (7)

where it,jti_{t},j_{t} are uniformly random selected indices. The sequence {At}{\left\{{A_{t}}\right\}} is a Markov chain, but {ϕt}{\left\{{\phi_{t}}\right\}} is not. Both parts of Lemma 2.3 independently imply {ϕt}{\left\{{\phi_{t}}\right\}} is a supermartingale. In particular, convergence to 0 in expectation is enough to guarantee almost sure convergence to 0 as well. We can modify (7) so that {ϕt}{\left\{{\phi_{t}}\right\}} becomes a time independent Markov chain. Given ϕt\phi_{t}, let an adversary pick any matrix BtB_{t} satisfying Φ(Bt)=ϕt\Phi(B_{t})=\phi_{t} and set

ϕt+1=Φ(orth(Bt,it,jt)).\begin{split}\phi_{t+1}=\Phi(\textnormal{{orth}}(B_{t},i_{t},j_{t})).\end{split} (8)

In this section, we exploit both implications of Lemma 2.3 on the law of ϕt\phi_{t}:

𝔼(ϕt+1|ϕt)f(ϕt)&ϕt+1ϕt pointwise.\operatorname*{\mathbb{E}}(\phi_{t+1}\,|\,\phi_{t})\leq f(\phi_{t})\quad\And\quad\phi_{t+1}\leq\phi_{t}\text{ pointwise}.

Note that ff has an inflection point at log22n\frac{\log 2}{2}\cdot n and is concave to the left of it. We define the corresponding stopping time

t=inf{t:ϕt<log22n}.t^{*}=\inf{\left\{{t:\phi_{t}<\frac{\log 2}{2}\cdot n}\right\}}.

Our control over the trajectory of 𝔼(ϕt)\operatorname*{\mathbb{E}}(\phi_{t}) differs before and after tt^{*}. Our control before tt^{*} will mostly be used to estimate how long it takes to reach tt^{*}. We combine this with control over the trajectory after tt^{*} to produce a final convergence rate.

Lemma 2.4 (Before tt^{*}).
𝔼(ϕt)ϕ0j=1tPr(tj)8(n1)2.\operatorname*{\mathbb{E}}(\phi_{t})\leq\phi_{0}-\frac{\sum_{j=1}^{t}\Pr\left(t^{*}\geq j\right)}{8(n-1)^{2}}.
Proof.

Note for xlog22nx\geq\frac{\log 2}{2}\cdot n, one has

f(x)x18(n1)2.f(x)\leq x-\frac{1}{8(n-1)^{2}}.

Use the approximation f(x)xf(x)\leq x for x<log22nx<\frac{\log 2}{2}\cdot n. This gives

𝔼(ϕj+1)𝔼(ϕj)Pr(ϕjlog22n)8(n1)2=𝔼(ϕj)Pr(t>j)8(n1)2.\begin{split}\operatorname*{\mathbb{E}}(\phi_{j+1})&\leq\operatorname*{\mathbb{E}}(\phi_{j})-\frac{\Pr\left(\phi_{j}\geq\frac{\log 2}{2}\cdot n\right)}{8(n-1)^{2}}\\ &=\operatorname*{\mathbb{E}}(\phi_{j})-\frac{\Pr\left(t^{*}>j\right)}{8(n-1)^{2}}.\end{split} (9)

By iterating this argument, we obtain

𝔼(ϕt)ϕ0j=0t1Pr(t>j)8(n1)2ϕ0j=1tPr(tj)8(n1)2.\operatorname*{\mathbb{E}}(\phi_{t})\leq\phi_{0}-\frac{\sum_{j=0}^{t-1}\Pr\left(t^{*}>j\right)}{8(n-1)^{2}}\leq\phi_{0}-\frac{\sum_{j=1}^{t}\Pr\left(t^{*}\geq j\right)}{8(n-1)^{2}}.

This is used to control how large tt^{*} can be.

Lemma 2.5 (Exponential concentration of tt^{*}).

For any c+c\in\mathbb{Z}_{+}

Pr(t>c16(n1)2ϕ0)2c.\Pr(t^{*}>c\cdot 16(n-1)^{2}\phi_{0})\leq 2^{-c}.
Proof.

Apply Lemma 2.4 in the limit as tt\to\infty. Then by the variational expression for expectation,

0limt𝔼(ϕt)ϕ0𝔼(t)8(n1)2𝔼(t)8(n1)2ϕ0.0\leq\lim_{t\to\infty}\operatorname*{\mathbb{E}}(\phi_{t})\leq\phi_{0}-\frac{\operatorname*{\mathbb{E}}(t^{*})}{8(n-1)^{2}}\implies\operatorname*{\mathbb{E}}(t^{*})\leq 8(n-1)^{2}\phi_{0}.

Now note that tt^{*} is implicitly a function of the initial value ϕ0\phi_{0}. In fact, conditioned on t>kt^{*}>k, the distribution of tt^{*} depends only on the value of ϕk\phi_{k}. So we have the more refined fact that for each kk,

𝔼(t(ϕ0)|t(ϕ0)>k)=𝔼(𝔼(k+t(ϕk)|t(ϕ0)>k,ϕk))k+supx[log22n,ϕ0]𝔼(t(x))k+8(n1)2ϕ0.\begin{split}\operatorname*{\mathbb{E}}(t^{*}(\phi_{0})\,|\,t^{*}(\phi_{0})>k)&=\operatorname*{\mathbb{E}}\left(\operatorname*{\mathbb{E}}(k+t^{*}(\phi_{k})\,|\,t^{*}(\phi_{0})>k,\phi_{k})\right)\\ &\leq k+\sup_{x\in[\frac{\log 2}{2}\cdot n,\phi_{0}]}\operatorname*{\mathbb{E}}(t^{*}(x))\\ &\leq k+8(n-1)^{2}\phi_{0}.\end{split} (10)

Set μ=16(n1)2ϕ0\mu=16(n-1)^{2}\phi_{0} and consider the case of k=cμk=c\mu. Then (10) implies the following tail bound by Markov’s inequality,

Pr(t>k+μ|t>k)=Pr(tk>μ|t>k)𝔼(tk|t>k)μ8(n1)2ϕ0μ=1/2.\begin{split}\Pr\left(t^{*}>k+\mu\,|\,t^{*}>k\right)&=\Pr\left(t^{*}-k>\mu\,|\,t^{*}>k\right)\\ &\leq\frac{\operatorname*{\mathbb{E}}(t^{*}-k\,|\,t^{*}>k)}{\mu}\\ &\leq\frac{8(n-1)^{2}\phi_{0}}{\mu}\\ &=1/2.\end{split} (11)

Then by iterating this argument, we have

Pr(t>cμ)=Pr(t>0)j=0c1Pr(t>jμ+μ|t>jμ)2c.\Pr(t^{*}>c\mu)=\Pr(t^{*}>0)\prod_{j=0}^{c-1}\Pr\left(t^{*}>j\mu+\mu\,|\,t^{*}>j\mu\right)\leq 2^{-c}.

After the stopping time, we can exploit the concavity of ff and apply a tighter estimate to obtain a better bound.

Lemma 2.6 (After tt^{*}).

Set Cn=2(log2)2n2(n1)2C_{n}=2(\log 2)^{2}n^{2}(n-1)^{2}. For any tkt\geq k,

𝔼(ϕt|t=k,ϕk)CnCn+ϕk(tk)ϕk\operatorname*{\mathbb{E}}(\phi_{t}\,|\,t^{*}=k,\phi_{k})\leq\frac{C_{n}}{C_{n}+{\phi_{k}\cdot(t-k)}}\cdot\phi_{k}
Proof.

Since {ϕt}{\left\{{\phi_{t}}\right\}} is a time-independent Markov chain, we need only consider the case of k=0k=0 and replace tt with tkt-k at the end. For xlog22nx\leq\frac{\log 2}{2}\cdot n, observe that ff is concave. Since {ϕt}{\left\{{\phi_{t}}\right\}} is monotone, we deterministically stay in the concave region of the domain. So we may apply Jensen to obtain

𝔼(ϕj+1)𝔼(f(ϕj))f(𝔼(ϕj)).\operatorname*{\mathbb{E}}(\phi_{j+1})\leq\operatorname*{\mathbb{E}}(f(\phi_{j}))\leq f(\operatorname*{\mathbb{E}}(\phi_{j})).

Consider the sequence x(j)=𝔼(ϕj)x(j)=\operatorname*{\mathbb{E}}(\phi_{j}). Note that we may view the update rule

x(j+1)=f(x(j))=x(j)(1exp(2x/n))22(n1)2x(j+1)=f(x(j))=x(j)-\frac{(1-\exp(-2x/n))^{2}}{2(n-1)^{2}}

as running Euclid’s method with a step size of 11 on the differential equation

ddtx(t)=(1exp(2x(t)/n))22(n1)2.\frac{\,\textnormal{d}}{\,\textnormal{d}{t}}x(t)=\frac{(1-\exp(-2x(t)/n))^{2}}{2(n-1)^{2}}.

By monotonicity of the left-hand side, the true solution to the differential equation is an upper bound. We can obtain a looser, but easier to solve for, upper bound by using

(1exp(2x/n))2x2(log2)2n2(1-\exp(-2x/n))^{2}\leq\frac{x^{2}}{(\log 2)^{2}n^{2}}

for xlog22nx\leq\frac{\log 2}{2}\cdot n. Then we have a quadratic first order differential equation, for which the explicit solution is

x(t)=(1ϕ0+t2(log2)2n2(n1)2)1.x(t)=\left(\frac{1}{\phi_{0}}+\frac{t}{2(\log 2)^{2}n^{2}(n-1)^{2}}\right)^{-1}.

After rearranging, this is exactly the bound appearing in the lemma statement. ∎

Remark 1.

The bound in Lemma 2.6 simplifies slightly differently when the initial state ϕ0\phi_{0} is above versus below the threshold log22n\frac{\log 2}{2}\cdot n, i.e. if t=0t^{*}=0 or not. When t=0t^{*}=0, then we simply have the unconditional expectation

𝔼(ϕt)=CnCn+ϕ0tϕ0.\operatorname*{\mathbb{E}}(\phi_{t})=\frac{C_{n}}{C_{n}+{\phi_{0}\cdot t}}\cdot\phi_{0}.

When t>0t^{*}>0, we do not incur too much loss by using ϕk=ϕtlog22n\phi_{k}=\phi_{t^{*}}\leq\frac{\log 2}{2}\cdot n by definition of tt^{*}. This implies

𝔼(ϕt|t=k,ϕk)CnCn+log22n(tk)log22nn42log2n3+tk.\operatorname*{\mathbb{E}}(\phi_{t}\,|\,t^{*}=k,\phi_{k})\leq\frac{C_{n}}{C_{n}+{\frac{\log 2}{2}\cdot n\cdot(t-k)}}\cdot\frac{\log 2}{2}\cdot n\leq\frac{n^{4}}{\frac{2}{\log 2}n^{3}+t-k}.

Our only remaining task is to bound the unconditional expectation 𝔼(ϕt)\operatorname*{\mathbb{E}}(\phi_{t}) when ϕ0log22n\phi_{0}\geq\frac{\log 2}{2}\cdot n. We do so using the exponential concentration of tt^{*}.

Theorem 2.7 (Convergence of ϕt\phi_{t}).

Set Cn=2(log2)2n2(n1)2n4C_{n}=2(\log 2)^{2}n^{2}(n-1)^{2}\leq n^{4} as in Lemma 2.6. Suppose ϕ0<log22n\phi_{0}<\frac{\log 2}{2}\cdot n. Then

𝔼(ϕt)CnCn+ϕ0tϕ0.\operatorname*{\mathbb{E}}(\phi_{t})\leq\frac{C_{n}}{C_{n}+{\phi_{0}\cdot t}}\cdot\phi_{0}.

Suppose instead ϕ0log22n\phi_{0}\geq\frac{\log 2}{2}\cdot n. Then

𝔼(ϕt)n42log2n3+12t+(ϕ0n42log2n3+12t)exp(t47n2ϕ0).\begin{split}\operatorname*{\mathbb{E}}(\phi_{t})\leq\frac{n^{4}}{\frac{2}{\log 2}n^{3}+\frac{1}{2}\cdot t}+\left(\phi_{0}-\frac{n^{4}}{\frac{2}{\log 2}n^{3}+\frac{1}{2}\cdot t}\right)\cdot\exp\left(-\frac{t}{47n^{2}\phi_{0}}\right).\end{split} (12)

In particular, by taking tt\to\infty, see that ϕt0\phi_{t}\to 0 in both cases.

Proof.

The first result is the simplification of Lemma 2.6 described in Remark 1. Set μ=16(n1)2ϕ0\mu=16(n-1)^{2}\phi_{0} and c=k/μc=k/\mu for k=t/2k=t/2. By the law of total expectation, 𝔼(ϕt)\operatorname*{\mathbb{E}}(\phi_{t}) is a convex combination of the conditional expectations

𝔼(ϕt|tcμ)&𝔼(ϕt|t>cμ).\operatorname*{\mathbb{E}}(\phi_{t}\,|\,t^{*}\leq c\mu)\quad\And\quad\operatorname*{\mathbb{E}}(\phi_{t}\,|\,t^{*}>c\mu).

The first term is bounded using Lemma 2.6 by

supk[0,cμ]𝔼(ϕt|t=k)supk[0,cμ]n42n3+tkn42n3+tcμ.\sup_{k\in[0,c\mu]}\operatorname*{\mathbb{E}}(\phi_{t}\,|\,t^{*}=k)\leq\sup_{k\in[0,c\mu]}\frac{n^{4}}{2n^{3}+t-k}\leq\frac{n^{4}}{2n^{3}+t-c\mu}.

The second term is bounded by ϕ0\phi_{0} by pointwise monotonicity. The weights of the convex combination are

Pr(tcμ)&Pr(t>cμ),\Pr(t^{*}\leq c\mu)\quad\And\quad\Pr(t^{*}>c\mu),

the latter of which is bounded by 2c2^{-c} using Lemma 2.5. Since the second term is at least as large as the first, again by pointwise monotonicity, we may take Pr(t>cμ)=2c\Pr(t^{*}>c\mu)=2^{-c} for an upper bound. This establishes

𝔼(ϕt)n42log2n3+tk(12k/μ)+ϕ02k/μ=n42log2n3+tk+(ϕ0n42log2n3+tk)2k/μ.\begin{split}\operatorname*{\mathbb{E}}(\phi_{t})&\leq\frac{n^{4}}{\frac{2}{\log 2}n^{3}+t-k}\left(1-2^{-k/\mu}\right)+\phi_{0}\cdot 2^{-k/\mu}\\ &=\frac{n^{4}}{\frac{2}{\log 2}n^{3}+t-k}+\left(\phi_{0}-\frac{n^{4}}{\frac{2}{\log 2}n^{3}+t-k}\right)\cdot 2^{-k/\mu}.\end{split} (13)

Recall that we took k=t/2k=t/2. Use the numerical approximation 32/log24732/\log 2\leq 47 for the final result. ∎

Remark 2 (Two rates of decay).

The right-hand side (12) can be viewed as an exponential decay from the value ϕ0\phi_{0} toward the curve n4/(n3+t)\sim n^{4}/(n^{3}+t). Unfortunately, the constant (47n2ϕ0)1\left(47n^{2}\phi_{0}\right)^{-1} in the exponent is so small that the bound doesn’t exhibit exponential decay in a meaningful sense. Truly exponential decay would look like ϕ0exp(ct)\phi_{0}\cdot\exp(-ct) for some value cc not depending on ϕ0\phi_{0}. But in our case, we need on the order of n2ϕ0n^{2}\phi_{0} iterations to bring the second term under any kind of control. Specifically, for tn2ϕ0t\ll n^{2}\phi_{0}, the exponential is well approximated by its first order Taylor series making the expression roughly

𝔼(ϕt)ϕ0O(t/n2).\operatorname*{\mathbb{E}}(\phi_{t})\leq\phi_{0}-O\left({t}/{n^{2}}\right).

On the other hand, after order n2ϕ0log(ϕ0/n)n^{2}\phi_{0}\log(\phi_{0}/n) iterations, the first term of (12) dominates, so the bound simplifies to just

𝔼(ϕt)=O(n4/t).\operatorname*{\mathbb{E}}(\phi_{t})=O(n^{4}/t).

2.3 Relationship to the condition number

Theorem 2.7 shows asymptotic convergence of log|det(At)|\log\left|\det(A_{t})\right| to 0, and the only matrices with unit-length columns and determinant equal to one are unitary. We can measure convergence to a unitary matrix in terms of the convergence of condition number κ(At)=σ1(At)/σn(At)\kappa(A_{t})=\sigma_{1}(A_{t})/\sigma_{n}(A_{t}) to 1. Indeed κ(A)=1\kappa(A)=1 for a matrix with unit-length columns if and only if AA is unitary. This section contains two main results (Proposition 2.9 and Theorem 2.10), corresponding to the two rates of decay described in Remark 2. When κ(A)\kappa(A) is initially large, we want to show κ(At)\kappa(A_{t}) is rapidly decaying for tt up to some point. Then we want to show for tt sufficiently large that κ(At)\kappa(A_{t}) is very close to 1.

Lemma 2.8 (Bounds on κ\kappa in terms of Φ\Phi).

For any n×nn\times n matrix AA with unit-length columns,

κ(A)exp(Φ(A)/n)&κ(A)enexp(Φ(A))&κ(A)1+2Φ(A)12Φ(A),\kappa(A)\geq\exp\left(\Phi(A)/n\right)\quad\And\quad\kappa(A)\leq\sqrt{en}\cdot\exp\left(\Phi(A)\right)\quad\And\quad\kappa(A)\leq\frac{1+\sqrt{2\Phi(A)}}{1-\sqrt{2\Phi(A)}},

where the right-aligned result holds only when the denominator is positive.

Proof.

The left-aligned result is an immediate consequence of σ1(A)maxjAej=1\sigma_{1}(A)\geq\max_{j}\left\|Ae_{j}\right\|=1 and σn(A)|det(A)|1/n.\sigma_{n}(A)\leq\left|\det(A)\right|^{1/n}. For the center-aligned result, first note

σ1(A)AF=n.\sigma_{1}(A)\leq\left\|A\right\|_{F}=\sqrt{n}.

Then, the largest value of σ1(A)σn1(A)\sigma_{1}(A)\cdots\sigma_{n-1}(A) subject to the constraint σ1(A)2++σn1(A)2n\sigma_{1}(A)^{2}+\cdots+\sigma_{n-1}(A)^{2}\leq n is achieved for σj(A)2=nn1\sigma_{j}(A)^{2}=\frac{n}{n-1}, so

σ1(A)2σn1(A)2(nn1)n1e|det(A)|eσn(A).\sigma_{1}(A)^{2}\cdots\sigma_{n-1}(A)^{2}\leq\left(\frac{n}{n-1}\right)^{n-1}\leq e\implies\left|\det(A)\right|\leq\sqrt{e}{\sigma_{n}(A)}.

Now for the right-aligned result. By considering the Gram-Schmidt basis, one may assume without loss of generality that AA is upper triangular with positive real diagonal entries. Let λ1,,λn\lambda_{1},\cdots,\lambda_{n} be its eigenvalues. Let rjr_{j} be the portion of the jjth column above the diagonal. Then

AIF2=j((λj1)2+rj2)=j(12λj+λj2+rj2).\begin{split}\left\|A-I\right\|_{F}^{2}=\sum_{j}\left((\lambda_{j}-1)^{2}+\left\|r_{j}\right\|^{2}\right)=\sum_{j}\left(1-2\lambda_{j}+\lambda_{j}^{2}+\left\|r_{j}\right\|^{2}\right).\end{split} (14)

Note λj2+rj2=1\lambda_{j}^{2}+\left\|r_{j}\right\|^{2}=1 since the columns are unit length. Thus

AIF2=2j(1λj)2jlog(1/λj)=2Φ(A).\begin{split}\left\|A-I\right\|_{F}^{2}=2\sum_{j}\left(1-\lambda_{j}\right)\leq 2\sum_{j}\log\left(1/\lambda_{j}\right)=2\Phi(A).\end{split} (15)

Finally, note AIFAImax(σ1(A)1,1σn(A))\left\|A-I\right\|_{F}\geq\left\|A-I\right\|\geq\max(\sigma_{1}(A)-1,1-\sigma_{n}(A)). ∎

Using Lemma 2.8 to convert the bounds on Φ(At)\Phi(A_{t}) from Theorem 2.7 to bounds on κ(At)\kappa(A_{t}) results in the following final theorems.

Proposition 2.9 (Initial rate of decay).

Assume the initial matrix AA satisfies Φ(A)log22n\Phi(A)\geq\frac{\log 2}{2}\cdot n. Then for all tn2Φ(A)t\leq n^{2}\cdot\Phi(A),

𝔼logκ(At)0.5log(en)+Φ(A)196(1log22nΦ(A))tn2.\begin{split}\operatorname*{\mathbb{E}}\log\kappa(A_{t})\leq 0.5\log(en)+\Phi(A)-\frac{1}{96}\left(1-\frac{\frac{\log 2}{2}\cdot n}{\Phi(A)}\right)\cdot\frac{t}{n^{2}}.\end{split} (16)
Proof.

The center-aligned result of Lemma 2.8 gives

logκ(At)0.5log(en)+Φ(At)\log\kappa(A_{t})\leq 0.5\log({en})+\Phi(A_{t})

so combined with Theorem 2.7 this gives

𝔼logκ(At)0.5log(en)+𝔼Φ(At)0.5log(en)+n42log2n3+12t+(Φ(A)n42log2n3+12t)exp(t47n2Φ(A))0.5log(en)+log22n+(Φ(A)log22n)(112t47n2Φ(A))=0.5log(en)+Φ(A)12(1log22nΦ(A))t47n2.\begin{split}\operatorname*{\mathbb{E}}\log\kappa(A_{t})&\leq 0.5\log({en})+\operatorname*{\mathbb{E}}\Phi(A_{t})\\ &\leq 0.5\log({en})+\frac{n^{4}}{\frac{2}{\log 2}n^{3}+\frac{1}{2}t}+\left(\Phi(A)-\frac{n^{4}}{\frac{2}{\log 2}n^{3}+\frac{1}{2}t}\right)\cdot\exp\left(-\frac{t}{47n^{2}\Phi(A)}\right)\\ &\leq 0.5\log({en})+\frac{\log 2}{2}\cdot n+\left(\Phi(A)-\frac{\log 2}{2}\cdot n\right)\cdot\left(1-\frac{1}{2}\cdot\frac{t}{47n^{2}\Phi(A)}\right)\\ &=0.5\log({en})+\Phi(A)-\frac{1}{2}\left(1-\frac{\frac{\log 2}{2}\cdot n}{\Phi(A)}\right)\cdot\frac{t}{47n^{2}}.\end{split} (17)

Remark 3.

If one plugs in the inequality Φ(A)nlogκ(A)\Phi(A)\leq n\log\kappa(A) to the right hand side of (16), it becomes fairly ineffective as a bound on κ(A)\kappa(A). We include Proposition 2.9 only to highlight the O(t/n2)O(-t/n^{2}) term. If we omit the first step of the proof, the bound would be as we stated in Remark 2,

𝔼Φ(At)Φ(A)O(t/n2)\operatorname*{\mathbb{E}}\Phi(A_{t})\leq\Phi(A)-O(t/n^{2})

for Φ(A)clog22n\Phi(A)\geq c\cdot\frac{\log 2}{2}\cdot n for fixed c>1c>1. This shows strict decrease on the order of t/n2t/n^{2} of Φ(At)\Phi(A_{t}) in expectation for badly conditioned initial matrices. Unfortunately, when passing from Φ\Phi to logκ\log\kappa we lose a multiplicative factor of nn and additive factor of log(n)\log(n) in Lemma 2.8. Nevertheless, the dependence on tt remains the same. This bound agrees with the hueristic argument and numerical evidence provided in [Ste24] that the decay of κ(At)\kappa(A_{t}) should go like exp(t/n2)\exp(-t/n^{2}).

Theorem 2.10 (Convergence of κ(At)1\kappa(A_{t})\to 1).

Fix any initial matrix AA and parameters ε,δ(0,0.01)\varepsilon,\delta\in(0,0.01). Then for

t200n2Φ(A)log(7Φ(A)/n)&t48n4δε2t\geq 200n^{2}\Phi(A)\log(7\Phi(A)/n)\quad\And\quad t\geq\frac{48n^{4}}{\delta\varepsilon^{2}}

one has

Pr(κ(At)1+ε)δ.\Pr\left(\kappa(A_{t})\geq 1+\varepsilon\right)\leq\delta.

In particular, one can take ε,δ0\varepsilon,\delta\to 0 with tt\to\infty.

Proof.

Set x=ε2/16x=\varepsilon^{2}/16 so that

1+ε1+2x12x.1+\varepsilon\geq\frac{1+\sqrt{2x}}{1-\sqrt{2x}}.

This fact along with the right-aligned statement of Lemma 2.8 and Markov’s inequality gives

Pr(κ(At)1+ε)Pr(κ(At)1+2x12x)Pr(Φ(At)x)𝔼Φ(At)x.\begin{split}\Pr\left(\kappa(A_{t})\geq 1+\varepsilon\right)\leq\Pr\left(\kappa(A_{t})\geq\frac{1+\sqrt{2x}}{1-\sqrt{2x}}\right)\leq\Pr\left(\Phi(A_{t})\geq x\right)\leq\frac{\operatorname*{\mathbb{E}}\Phi(A_{t})}{x}.\end{split} (18)

Now apply Theorem 2.7 to the numerator and simplify the bound using zexp(z)exp(z/2)z\exp(-z)\leq\exp(-z/2).

Pr(κ(At)1+ε)n42log2n3+12t+(Φ(A)n42log2n3+12t)exp(t47n2Φ(A))xn4tx(12log2n3t+12+Φ(A)tn4exp(t47n2Φ(A)))=n4tx(2+47Φ(A)2n2t47n2Φ(A)exp(t47n2Φ(A)))n4tx(2+47Φ(A)2n2exp(t94n2Φ(A))).\begin{split}\Pr\left(\kappa(A_{t})\geq 1+\varepsilon\right)&\leq\frac{\dfrac{n^{4}}{\frac{2}{\log 2}n^{3}+\frac{1}{2}t}+\left(\Phi(A)-\dfrac{n^{4}}{\frac{2}{\log 2}n^{3}+\frac{1}{2}t}\right)\exp\left(-\dfrac{t}{47n^{2}\Phi(A)}\right)}{x}\\ &\leq\frac{n^{4}}{tx}\left(\frac{1}{\frac{2}{\log 2}\frac{n^{3}}{t}+\frac{1}{2}}+\Phi(A)\cdot\frac{t}{n^{4}}\cdot\exp\left(-\dfrac{t}{47n^{2}\Phi(A)}\right)\right)\\ &=\frac{n^{4}}{tx}\left(2+\frac{47\Phi(A)^{2}}{n^{2}}\cdot\frac{t}{47n^{2}\Phi(A)}\cdot\exp\left(-\dfrac{t}{47n^{2}\Phi(A)}\right)\right)\\ &\leq\frac{n^{4}}{tx}\left(2+\frac{47\Phi(A)^{2}}{n^{2}}\cdot\exp\left(-\dfrac{t}{94n^{2}\Phi(A)}\right)\right).\end{split} (19)

When t192n2Φ(A)log(7Φ(A)/n)t\geq 192n^{2}\Phi(A)\log(7\Phi(A)/n), the bound further simplifies to 3n4tx\frac{3n^{4}}{tx}. When t3n4δx=48n4δε2t\geq\frac{3n^{4}}{\delta x}=\frac{48n^{4}}{\delta\varepsilon^{2}}, the bound is simply δ\delta as required. ∎

Final thoughts and future directions: We note that we do not produce lower bounds. In our estimation, the primary weakness is in Lemma 2.2 (or in its application). In particular, it seems like one could obtain a better bound by using the entire distribution of the singular values, rather than just the smallest or their geometric mean, as we do here.

Another idea is to sample pairs of columns from a distribution other than uniform. For example, in light of Lemma 2.1, one could pick them proportional to |ai,aj|2\left|\left\langle a_{i},a_{j}\right\rangle\right|^{2}, or even greedily maximizing |ai,aj|\left|\left\langle a_{i},a_{j}\right\rangle\right|.

Acknowledgments: We would like to thank Stefan Steinerberger for his inventive works on the Kaczmarz algorithm and engaging discussions which inspired this project.

References

  • [Ste21a] Stefan Steinerberger. Randomized Kaczmarz converges along small singular vectors. SIAM J. Matrix Anal. Appl, 42:608–615, 2021.
  • [Ste21b] Stefan Steinerberger. Surrounding the solution of a Linear System of Equations from all sides. Quarterly of Applied Mathematics, 79(3):419–429, 2021.
  • [Ste24] Stefan Steinerberger. Kaczmarz Kac walk. arXiv preprint 2411.06614, 2024.
  • [SV09] Thomas Strohmer and Roman Vershynin. A Randomized Kaczmarz Algorithm with Exponential Convergence. J Fourier Anal Appl, 15:262–278, 2009.