A Kaczmarz-Inspired Method for Orthogonalization
Abstract
This paper asks if it is possible to orthogonalize a set of linearly independent unit vectors in dimensions by only considering random pairs of vectors at a time. In particular, at each step one accesses a random pair of vectors and can replace them with some different basis for their span. If one could instead deterministically decide which pair of vectors to access, one could run the Gram-Schmidt procedure. In our setting, however, the pair is selected at random, so the answer is less clear.
We provide a positive answer to the question: given a pair of vectors at each iteration, replace one with its component perpendicular to the other, renormalized. This produces a sequence that almost surely converges to an orthonormal set. Quantitatively, we can measure the rate convergence in terms of the condition number of the square matrix formed by taking the vectors to be its columns. When the condition number is initially very large, we prove a rapidly decreasing upper bound on the expected logarithm of the condition number after iterations. The bound initially decreases by each iteration, but the decrease tapers off as the condition number improves. We then show iterations suffice to bring the condition number below with probability .
1 Introduction
We consider the following simple procedure for improving the condition number of a collection of linearly independent unit vectors, thought of as columns of a matrix . Sample two columns, and replace one with its component perpendicular to the other, renormalized. The only fixed points of this operation are unitary matrices. We ask the following questions: does this procedure converge to such matrices? If so, at what rate?
We give a positive answer to the first question and a bound on the rate. Our approach is to consider the evolution of the absolute value of the determinant as the orthogonalizations are preformed. We show it is monotone for every selection of columns at each iteration, and its logarithm increases non-trivially in expectation when the columns are selected randomly.
This procedure bears some resemblance to the Kaczmarz method for solving linear systems. If one is solving using Kaczmarz, a column of is sampled and projected to be orthogonal to the sampled row; this is then repeated many times. In our case, is itself one of the columns of , say and one applies one step of Kaczmarz to the system for (and then renormalizes).
Our two results about this procedure concern initial and asymptotic behavior of the condition number as these pairwise orthogonalization steps are performed. For very badly conditioned matrices, we show the condition number initially decays rapidly at a rate of , where is the number of steps performed (see Proposition 2.9 and the ensuing remark). Our main theorem concerns the eventual convergence of the condition number to 1.
Theorem 1.1 (Simplification of Theorem 2.10).
For any initial matrix , let denote the matrix after orthogonalization steps. Then for every , we have
where is hiding an absolute constant.
Concurrent and independent work: This project was prompted by the very interesting related works of Stefan Steinerberger which inspired us to explore variants of Kaczmarz, particularly those which modify the matrix as the solver runs [Ste21a, Ste21b]. He and the present authors independently conceived of the particular variant analyzed in this paper. He recently announced some progress on this problem [Ste24]. First, he gives conditions on under which increases in expectation. Second, he finds a heuristic argument and numerical evidence that the rate of convergence should be where is the matrix after orthogonalizations. Our work is independent of his.
1.1 Technical overview
We first precisely define the procedure we analyze. Throughout this paper, let be the decomposition of into columns. We assume the columns are unit length. The operation we define is
(1) |
At each timestep , our procedure samples uniformly at random from ordered pairs of distinct indices and updates .
One major difficulty in analyzing the evolution of the condition number is that orth will sometimes increase it. Secondly, rare events that produce poorly conditioned matrices will skew the expectation of upward by a lot. To overcome these issues, we study the evolution of the non-negative potential function
(2) |
Note that since has unit-length columns, if and only if . We show is monotonically decreasing no matter the selection of (see Lemma 2.1). Over a random selection of , it strictly decreases in expectation unless (see Lemma 2.3). Another complication is that the amount of decrease depends on the present value of . Of course, when already, no decrease could be expected. Our bound roughly suggests that when , one should expect , i.e. steady progress is made (see Lemma 2.4). On the other hand, once drops below this threshold, progress toward 0 slows dramatically. The progress that one operation makes in bringing the potential to zero can be bounded solely in terms of the current value of the potential. This allows us to describe the evolution of as the discretization of a differential equation (see Lemma 2.6).
In solving that differential equation, we show indeed converges to 0 almost surely (see Theorem 2.7). Using the relationship between and supplied by Lemma 2.8, one finally arrives at a tail bound for (see Theorem 2.10).
Correction note:
A previous version of this paper had set and stated that each of the terms in the sum were monotone in applications of orth. This was incorrect; a corrected version comes by replacing the condition “” with “”, which by the base-times-height formula results in . This minorly alters the remaining proofs. The statement of Proposition 2.9 is improved slightly and the statements of Theorems 2.7 and 2.10 remain unchanged.
Remark on algorithmic application:
One could imagine running this procedure at the same time as running a Kaczmarz solver for the linear system . By updating the entries of in the corresponding way to the columns of , the solution is preserved. The motivation is that the condition number gets smaller over time, so perhaps the increase in the rate of convergence of Kaczmarz justifies the additional computational expense. For a characterization of the rate of convergence of Kaczmarz in terms of the condition number, see [SV09]. We suspect this is not practical, however. In particular, the Kaczmarz algorithm only requires writeable words of memory and only performs dot products, and so it is attractive for huge sparse systems. The procedure we outline requires updating all of and sparsity is not preserved. Nevertheless, the trajectory of the condition number is interesting as a standalone mathematical question.
2 Results
2.1 Analysis of a single step
We claim that the determinant of is monotone in applications of orth. In fact, the change in the value depends only on the inner product of the vectors being orthogonalized.
Lemma 2.1.
For any distinct indices and matrix with unit length columns, one has
where
Proof.
By applying the appropriate permutation to the columns of , we may assume without loss of generality that and . The determinant of a matrix is the -volume of the parallelepiped generated by it’s columns. This can be expressed as
After the update , the terms for remain unaltered since . The term is also unchanged, as it remains equal to 1. After the update, the term is
(3) |
which is precisely the term before the update divided by . Since the are unit vectors, this quantity is equal to as required. ∎
From the above lemma, it is clear that the main driver increasing the determinant is the inner product . Our next lemma considers the matrix of these inner products, , and relates the average entry to the smallest singular value of .
Lemma 2.2.
Say has unit length columns. Then
Proof.
Since has unit length columns, . Consider the expression
Thought of as a function of the singular values, this expression is convex. For a given value of , the minimum is achieved when the rest of the singular values are equal. Given the constraint that the sum of the squares of the singular values is , this means the expression is minimized for for . Plugging this back in gives the desired bound. ∎
With these two lemmas in place, we can compute the expected value of after one application of orth. Define the iterative map
Lemma 2.3 (One step estimate).
If one samples uniformly at random and sets , then
2.2 Markov chain supermartingale
Because applying does not commute with taking expectations, obtaining a bound for several applications of orth by applying Lemma 2.3 over and over again is not immediate. We initially define the stochastic process by
(7) |
where are uniformly random selected indices. The sequence is a Markov chain, but is not. Both parts of Lemma 2.3 independently imply is a supermartingale. In particular, convergence to 0 in expectation is enough to guarantee almost sure convergence to 0 as well. We can modify (7) so that becomes a time independent Markov chain. Given , let an adversary pick any matrix satisfying and set
(8) |
In this section, we exploit both implications of Lemma 2.3 on the law of :
Note that has an inflection point at and is concave to the left of it. We define the corresponding stopping time
Our control over the trajectory of differs before and after . Our control before will mostly be used to estimate how long it takes to reach . We combine this with control over the trajectory after to produce a final convergence rate.
Lemma 2.4 (Before ).
Proof.
Note for , one has
Use the approximation for . This gives
(9) |
By iterating this argument, we obtain
∎
This is used to control how large can be.
Lemma 2.5 (Exponential concentration of ).
For any
Proof.
Apply Lemma 2.4 in the limit as . Then by the variational expression for expectation,
Now note that is implicitly a function of the initial value . In fact, conditioned on , the distribution of depends only on the value of . So we have the more refined fact that for each ,
(10) |
Set and consider the case of . Then (10) implies the following tail bound by Markov’s inequality,
(11) |
Then by iterating this argument, we have
∎
After the stopping time, we can exploit the concavity of and apply a tighter estimate to obtain a better bound.
Lemma 2.6 (After ).
Set . For any ,
Proof.
Since is a time-independent Markov chain, we need only consider the case of and replace with at the end. For , observe that is concave. Since is monotone, we deterministically stay in the concave region of the domain. So we may apply Jensen to obtain
Consider the sequence . Note that we may view the update rule
as running Euclid’s method with a step size of on the differential equation
By monotonicity of the left-hand side, the true solution to the differential equation is an upper bound. We can obtain a looser, but easier to solve for, upper bound by using
for . Then we have a quadratic first order differential equation, for which the explicit solution is
After rearranging, this is exactly the bound appearing in the lemma statement. ∎
Remark 1.
The bound in Lemma 2.6 simplifies slightly differently when the initial state is above versus below the threshold , i.e. if or not. When , then we simply have the unconditional expectation
When , we do not incur too much loss by using by definition of . This implies
Our only remaining task is to bound the unconditional expectation when . We do so using the exponential concentration of .
Theorem 2.7 (Convergence of ).
Set as in Lemma 2.6. Suppose . Then
Suppose instead . Then
(12) |
In particular, by taking , see that in both cases.
Proof.
The first result is the simplification of Lemma 2.6 described in Remark 1. Set and for . By the law of total expectation, is a convex combination of the conditional expectations
The first term is bounded using Lemma 2.6 by
The second term is bounded by by pointwise monotonicity. The weights of the convex combination are
the latter of which is bounded by using Lemma 2.5. Since the second term is at least as large as the first, again by pointwise monotonicity, we may take for an upper bound. This establishes
(13) |
Recall that we took . Use the numerical approximation for the final result. ∎
Remark 2 (Two rates of decay).
The right-hand side (12) can be viewed as an exponential decay from the value toward the curve . Unfortunately, the constant in the exponent is so small that the bound doesn’t exhibit exponential decay in a meaningful sense. Truly exponential decay would look like for some value not depending on . But in our case, we need on the order of iterations to bring the second term under any kind of control. Specifically, for , the exponential is well approximated by its first order Taylor series making the expression roughly
On the other hand, after order iterations, the first term of (12) dominates, so the bound simplifies to just
2.3 Relationship to the condition number
Theorem 2.7 shows asymptotic convergence of to 0, and the only matrices with unit-length columns and determinant equal to one are unitary. We can measure convergence to a unitary matrix in terms of the convergence of condition number to 1. Indeed for a matrix with unit-length columns if and only if is unitary. This section contains two main results (Proposition 2.9 and Theorem 2.10), corresponding to the two rates of decay described in Remark 2. When is initially large, we want to show is rapidly decaying for up to some point. Then we want to show for sufficiently large that is very close to 1.
Lemma 2.8 (Bounds on in terms of ).
For any matrix with unit-length columns,
where the right-aligned result holds only when the denominator is positive.
Proof.
The left-aligned result is an immediate consequence of and For the center-aligned result, first note
Then, the largest value of subject to the constraint is achieved for , so
Now for the right-aligned result. By considering the Gram-Schmidt basis, one may assume without loss of generality that is upper triangular with positive real diagonal entries. Let be its eigenvalues. Let be the portion of the th column above the diagonal. Then
(14) |
Note since the columns are unit length. Thus
(15) |
Finally, note . ∎
Using Lemma 2.8 to convert the bounds on from Theorem 2.7 to bounds on results in the following final theorems.
Proposition 2.9 (Initial rate of decay).
Assume the initial matrix satisfies . Then for all ,
(16) |
Remark 3.
If one plugs in the inequality to the right hand side of (16), it becomes fairly ineffective as a bound on . We include Proposition 2.9 only to highlight the term. If we omit the first step of the proof, the bound would be as we stated in Remark 2,
for for fixed . This shows strict decrease on the order of of in expectation for badly conditioned initial matrices. Unfortunately, when passing from to we lose a multiplicative factor of and additive factor of in Lemma 2.8. Nevertheless, the dependence on remains the same. This bound agrees with the hueristic argument and numerical evidence provided in [Ste24] that the decay of should go like .
Theorem 2.10 (Convergence of ).
Fix any initial matrix and parameters . Then for
one has
In particular, one can take with .
Proof.
Final thoughts and future directions: We note that we do not produce lower bounds. In our estimation, the primary weakness is in Lemma 2.2 (or in its application). In particular, it seems like one could obtain a better bound by using the entire distribution of the singular values, rather than just the smallest or their geometric mean, as we do here.
Another idea is to sample pairs of columns from a distribution other than uniform. For example, in light of Lemma 2.1, one could pick them proportional to , or even greedily maximizing .
Acknowledgments: We would like to thank Stefan Steinerberger for his inventive works on the Kaczmarz algorithm and engaging discussions which inspired this project.
References
- [Ste21a] Stefan Steinerberger. Randomized Kaczmarz converges along small singular vectors. SIAM J. Matrix Anal. Appl, 42:608–615, 2021.
- [Ste21b] Stefan Steinerberger. Surrounding the solution of a Linear System of Equations from all sides. Quarterly of Applied Mathematics, 79(3):419–429, 2021.
- [Ste24] Stefan Steinerberger. Kaczmarz Kac walk. arXiv preprint 2411.06614, 2024.
- [SV09] Thomas Strohmer and Roman Vershynin. A Randomized Kaczmarz Algorithm with Exponential Convergence. J Fourier Anal Appl, 15:262–278, 2009.