Computing Lewis Weights to High Precision
We present an algorithm for computing approximate Lewis weights to high precision. Given a full-rank with and a scalar , our algorithm computes -approximate Lewis weights of in iterations; the cost of each iteration is linear in the input size plus the cost of computing the leverage scores of for diagonal . Prior to our work, such a computational complexity was known only for [CP15], and combined with this result, our work yields the first polylogarithmic-depth polynomial-work algorithm for the problem of computing Lewis weights to high precision for all constant . An important consequence of this result is also the first polylogarithmic-depth polynomial-work algorithm for computing a nearly optimal self-concordant barrier for a polytope.
1 Introduction to Lewis Weights
In this paper, we study the problem of computing the Lewis weights111From hereon, we refer to these simply as “Lewis weights” for brevity. of a matrix.
Definition 1.
[Lew78, CP15] Given a full-rank matrix with and a scalar , the Lewis weights of are the entries of the unique222Existence and uniqueness was first proven by D.R.Lewis [Lew78], after whom the weights are named. vector satisfying the equation
where is the ’th row of matrix and is the diagonal matrix with vector on the diagonal.
Motivation. We contextualize our problem with a simpler, geometric notion. Given a set of points (the rows of the preceding matrix ), their John ellipsoid [Joh48] is the minimum333The John ellipsoid may also refer to the maximal volume ellipsoid enclosed by the set , but in this paper, we use the former definition. volume ellipsoid enclosing them. This ellipsoid finds use across experiment design and computational geometry [Tod16] and is central to certain cutting-plane methods [Vai89, LSW15], an algorithm fundamental to mathematical optimization (Section 1.3). It turns out that the John ellipsoid of a set of points is expressible [BV04] as the solution to the following convex program, with the objective being a stand-in for the volume of the ellipsoid and the constraints encoding the requirement that each given point lie within the ellipsoid:
The problem (1) may be generalized by the following convex program [Woj96, CP15], the generalization immediate from substituting in (1):
Geometrically, (1) seeks the minimum volume ellipsoid with a bound on the -norm of the distance of the points to the ellipsoid, and its solution is the “Lewis ellipsoid” [CP15] of .
The optimality condition of (1), written using defined as , is equivalent to (1), and this demonstrates that solving (1) is one approach to obtaining the Lewis weights of (see [CP15]). This equivalence also underscores the fact that the problem of computing Lewis weights is a natural generalization of the problem of computing the John ellipsoid.
More broadly, Lewis weights are ubiquitous across statistics, machine learning, and mathematical optimization in diverse applications, of which we presently highlight two (see Section 1.3 for details). First, their interpretation as “importance scores” of rows of matrices makes them key to shrinking the row dimension of input data [DMM06]. Second, through their role in constructing self-concordant barriers of polytopes [LS14], variants of Lewis weights have found prominence in recent advances in the computational complexity of linear programming.
From a purely optimization perspective, Lewis weights may be viewed as the optimal solution to the following convex optimization problem (which is in fact essentially dual to (1)):
As elaborated in [CP15, LS19], the reason this problem yields the Lewis weights is that an appropriate scaling of its solution transforms its optimality condition from to (1). The problem (1) is a simple and natural one and, in the case of (corresponding to the John ellipsoid), has been the subject of study for designing new optimization methods [Tod16].
In summary, Lewis weights naturally arise as generalizations of extensively studied problems in convex geometry and optimization. This, coupled with their role in machine learning, makes understanding the complexity of computing Lewis weights, i.e., solving (1), a fundamental problem.
Our Goal.
We aim to design high-precision algorithms for computing -approximate Lewis weights, i.e., a vector satisfying
where is used to denote . To this end, we design algorithms to solve the convex program (1) to -additive accuracy for an appropriate , which we prove suffices in Lemma 1.
By a “high-precision” algorithm, we mean one with a runtime polylogarithmic in . We emphasize that for several applications such as randomized sampling [CP15], approximate Lewis weights suffice; however, we believe that high-precision methods such as ours enrich our understanding of the structure of the optimization problem (1). Further, as stated in Theorem 3, such methods yield new runtimes for directly computing a near-optimal self-concordant barrier for polytopes.
We use number of leverage score computations as the complexity measure of our algorithms. Our choice is a result of the fact that leverage scores of appropriately scaled matrices appear in both (see Lemma 3) and in the verification of correctness of Lewis weights. This measure of complexity stresses the number of iterations rather than the details of iteration costs (which depend on exact techniques used for leverage core computation, e.g., fast matrix multiplication) and is consistent with many prior algorithms (see Table 1).
Prior Results.
The first polynomial-time algorithm for computing Lewis weights was presented by [CP15] and performed only 444We use to hide a polynomial in and and to hide factors polylogarithmic in , and . leverage score computations. However, their result holds only for . We explain the source of this limited range in Section 1.2.
In comparison, for , existing algorithms are slower: the algorithms by [CP15], [Lee16], and [LS19] perform , , and leverage score computations, respectively. [CP15] also gave an algorithm with total runtime . Of note is the fact that the algorithms with runtimes polynomial in ([Lee16, CP15]) satisfy the weaker approximation condition , which is in fact implied by our condition (1).
We display these runtimes in Table 1, assuming that the cost of a leverage score computation is (which, we reiterate, may be reduced through the use of fast matrix multiplication). In terms of the number of leverage score computations, Table 1 highlights the contrast between the polylogarithmic dependence on input size and accuracy for and polynomial dependence on these factors for . The motivation behind our paper is to close this gap.
1.1 Our Contribution
We design an algorithm that computes Lewis weights to high precision for all using only leverage score computations. Together with [CP15]’s result for , our result therefore completes the picture on a near-optimal reduction from leverage scores to Lewis weights for all .
Theorem 1 (Main Theorem (Parallel)).
Given a full-rank matrix and , we can compute (Algorithms 1 and 2) its -approximate Lewis weights (1) in iterations555Our algorithms work for all , as can be seen in our proof in Section 3.1. However, for , the algorithm of [CP15] is faster, and therefore, in our main theorems, we state runtimes only for .. Each iteration computes the leverage scores of a matrix for a diagonal matrix . The total runtime is , with depth.
Theorem 1 is attained by a parallel algorithm for computing Lewis weights that consists of polylogarithmic rounds of leverage score computations and therefore has polylogarithmic-depth, a result that was not known prior to this work.
Theorem 2 (Main Theorem (Sequential)).
Remark 1.1.
The solution to (1) characterizes a “Lewis ellipsoid,” and the Lewis ellipsoid of is precisely its John ellipsoid. After symmetrization [Tod16], computing the John ellipsoid is equivalent to solving a linear program (LP). Therefore, computing Lewis weights in iterations would imply a polylogarithmic-depth algorithm for solving LPs, which, given the current depth [LS19], would be a significant breakthrough in the field of optimization. We therefore believe that it would be difficult to remove the polynomial dependence on in our runtime.
Authors | Range of | Number of Leverage Score Computations/Depth | Total Runtime |
---|---|---|---|
[CP15] | |||
[CP15] | |||
[CP15]* | not applicable | ||
[Lee16]* | |||
[LS19] | |||
Theorem 1 |
1.2 Overview of Approach
Before presenting our algorithm, we describe obstacles to directly extending previous work on the problem for to the case . For , [CP15, LS19] design algorithms that, with a single computation of leverage scores, make constant (dependent on ) multiplicative progress on error (such as function error or distance to optimal point), thus attaining runtimes polylogarithmic in . However, these methods crucially rely on contractive properties that, in contrast to our work, do not necessarily hold for .
For example, one of the algorithms in [CP15] starts with a vector , where is the vector of true Lewis weights and some constant. Consequently, we have . Due to this map being a contraction for , or equivalently, for , recursive calls to it give Lewis weights for , but the contraction - and, by extension, this method - does not immediately extend to the setting .
Prior algorithms for therefore resort to alternate optimization techniques. [CP15] frames Lewis weights computation as determinant maximization (1) (see Section D) and applies cutting plane methods [GLS81, LSW15]. [Lee16] uses mirror descent, and [LS19] uses homotopy methods. These approaches yield runtimes with or leverage score computations, and therefore, in order to attain runtimes of leverage score computations, we need to rethink the algorithm.
Our Approach.
As stated in Section 1, to obtain -approximate Lewis weights for , we compute a that satisfies , where and are as defined in (1) and . In light of the preceding bottlenecks in prior work, we circumvent techniques that directly target constant multiplicative progress (on some potential) in each iteration.
Our main technical insight is that when the leverage scores for the current weight satisfy a certain technical condition (inequality (1.2)), it is indeed possible to update to get multiplicative decrease in function error (), thus resulting in our target runtime. To turn this insight into an algorithm, we design a corrective procedure that ensures that (1.2) is always satisfied: in other words, whenever (1.2) is violated, this procedure updates so that the new does satisfy (1.2), setting the stage for the aforementioned multiplicative progress. An important additional property of this procedure is that it does not increase the objective function and is therefore in keeping with our goal of minimizing (1).
Specifically, the technical condition that our geometric decrease in function error hinges on is
This ratio follows naturally from the gradient and Hessian of the function objective (see Lemma 3). Our algorithm’s update rule to solve (1) is obtained from minimizing a second-order approximation to the objective at the current point, and the condition specified in (1.2) allows us to relate the progress of a type of quasi-Newton step to lower bounds on the progress there is to make, which is critical to turning a runtime of poly() into (Lemma 5).
The process of updating so that (1.2) goes from being violated to being satisfied corresponds, geometrically, to sufficiently rounding the ellipsoid ; specifically, the updated ellipsoid satisfies (see Section C), and this is the reason we use the term “rounding” to describe our corrective procedure to get to satisfy (1.2) and the term “rounding condition” to refer to (1.2).
We develop two versions of rounding: a parallel method and a sequential one that has an improved dependence on . Each version is based on the principles that (1) one can increase those entries of at which the rounding condition (1.2) does not hold while decreasing the objective value, and (2) the vector obtained after this update is closer to satisfying (1.2).
We believe that such a principle of identifying a technical condition needed for fast convergence and the accompanying rounding procedures could be useful in other optimization problems. Additionally, we develop Algorithm 4, which, by varying the step sizes in the update rule, maintains (1.2) as invariant, thereby eliminating the need for a separate rounding and progress steps.
1.3 Applications and Related Work
We elaborate here on the applications of Lewis weights we briefly alluded to in Section 1. While for many applications (such as pre-processing in optimization [CP15]) approximate weights suffice, solving regularized -optimal and computing self-concordant barriers to high precision do use high precision Lewis weights.
Pre-processing in optimization.
Lewis weights are used as scores to sample rows of an input tall data matrix so the norms of the product of the matrix with vectors are preserved. They have been used in row sampling algorithms for data pre-processing [DMM06, DMIMW12, LMP13, CLM+15], for computing dimension-free strong coresets for -median and subspace approximation [SW18], and for fast tensor factorization in the streaming model [CCDS20]. Lewis weights are also used for regression, a popular model in machine learning used to capture robustness to outliers, in: [DLS18] for stochastic gradient descent pre-conditioning, [LWYZ20] for quantile regression, and [BDM+20] to provide algorithms for linear algebraic problems in the sliding window model.
John ellipsoid and D-optimal design.
As noted in Remark 1.1, a fast algorithm for Lewis weights could yield faster algorithms for computing John ellipsoid, a problem with a long history of work [Kha96, SF04, KY05, DAST08, CCLY19, ZF20]. It is known [Tod16] that the John ellipsoid problem is dual to the (relaxed) D-optimal experiment design problem [Puk06]. D-optimal design seeks to select a set of linear experiments with the largest confidence ellipsoid for its least-square estimator [AZLSW17, MSTX19, SX20].
Our problem (1) is equivalent to -regularized D-optimal design, which can be interpreted as enforcing a polynomial experiment cost: viewing as the fraction of resources allocated to experiment , each is penalized by . This regularization also appears in fair packing and fair covering problems [MSZ16, DFO20] from operations research.
Self-concordance.
Self-concordant barriers are fundamental in convex optimization [NN94], combinatorial optimization [LS14], sampling [KN09, LLV20], and online learning [AHR08]. Although there are (nearly) optimal self-concordant barriers for any convex set [NN94, BE15, LY18], computing them involves sampling from log-concave distributions, itself an expensive process with a poly() runtime. [LS14] shows how to construct nearly optimal barriers for polytopes using Lewis weights. Unfortunately, doing so still requires polynomial-many steps to compute these weights; [LS14] bypass this issue by showing it suffices to work with Lewis weights for . In this paper, we show how to compute Lewis weights by computing leverage scores of polylogarithmic-many matrices. This gives the first nearly optimal self-concordant barrier for polytopes that can be evaluated to high accuracy with depth polylogarithmic in the dimension.
1.4 Notation and Preliminaries
We use to denote our full-rank () real-valued input matrix and to denote the vector of Lewis weights of , as defined in (1) and (1). All matrices appear in boldface uppercase and vectors in lowercase. For any vector (say, ), we use its uppercase boldfaced form () to denote the diagonal matrix . For a matrix , the matrix is the Schur product (entry-wise product) of with itself. For matrices and , we use to mean is positive-semidefinite. For vectors and , the inequality applies entry-wise. We use to denote the ’th standard basis vector. We define . As in (1), since we defined , the ranges of and translate to and , respectively. From hereon, we work with . We also define . For a matrix and , we define the projection matrix . The quantity is precisely the leverage score of the ’th row of :
Fact 1.1 ([LS14]).
For all we have that for all , , and .
2 Our Algorithm
We present Algorithm 1 to compute an -additive solution to (1). We first provide the following definitions that we frequently refer to in our algorithm and analysis. Given and , the ’th coordinate of is
Based on this quantity, we define the following procedure, derived from approximating a quasi-Newton update on the objective from (1):
Using these definitions, we can now describe our algorithm. Depending on whether the following condition (“rounding condition”) holds, we run either or in each iteration:
Specifically, if (2) is not satisfied, we run , which returns a vector that does satisfy it without increasing the objective value. We design two versions of , one parallel (Algorithm 2) and one sequential (Algorithm 3), with the sequential algorithm having an improved dependence on , to update the coordinates violating (2). We apply one extra step of rounding to the vector returned after iterations of Algorithm 1 and transform it appropriately to obtain our final output. In the following lemma (proved in Section B), we justify that this output is indeed the solution to (1).
Lemma 1 (Lewis Weights from Optimization Solution).
2.1 Analysis of
We first analyze since it is common to both the parallel and sequential algorithms.
Lemma 2 (Iteration Complexity of ).
As is often the case to obtain such an iteration complexity, we prove Lemma 2 by incorporating the maximum sub-optimality in function value (Lemma 5) and the initial error bound (Lemma 4) into the inequality describing minimum function progress (Lemma 6). The assumption that does not increase the value of the objective is justified in Lemma 7.
Since our algorithm leverages quasi-Newton steps, we begin our analysis by stating the gradient and Hessian of the objective in (1) as well as the error at the initial vector , as measured against the optimal function value. The Hessian below is positive semidefinite when (equivalently, when ) and not necessarily so otherwise. Consequently, the objective is convex for , and we therefore consider only this setting throughout.
Lemma 3 (Gradient and Hessian).
For any , the objective in (1), , has gradient and Hessian given by the following expressions.
Lemma 4 (Initial Sub-Optimality).
2.1.1 Minimum Progress and Maximum Sub-optimality in
We first prove an upper bound on objective sub-optimality, necessary to obtain a runtime polylogarithmic in . Often, to obtain such a rate, the bound involving objective sub-optimality has a quadratic term derived from the Hessian; our lemma is somewhat non-standard in that it uses only the convexity of . Note that this lemma crucially uses (2).
Lemma 5 (Objective Sub-optimality).
Suppose and . Then the value of the objective of (1) at differs from the optimum objective value as follows.
Proof.
Since is convex and , we have
and therefore,
To prove the claim, it suffices to bound each from below. First, note that
(2.4) |
where the first equality used that the minimization problem is convex and the solutions to (i.e. where the gradient is 0) is a minimizer, and the second equality used . Applying , , and for yields
(2.5) |
Combining (2.5) with (2.4) yields that
The claim then follows from the fact that for , we have . ∎
Lemma 6 (Function Decrease in ).
Let with for all . Further, let , where Descent is defined in (2). Then, with the following decrease in function objective.
The proof of this lemma resembles that of quasi-Newton method: first, we write a second-order Taylor approximation of around and apply Fact 1.1 to Lemma 3 to obtain the upper bound on Hessian: We further use the expression for in this second-order approximation and simplify to obtain the claim, as detailed in Section A.
2.1.2 Iteration Complexity of
Proof of Lemma 2.
Since Algorithm 1 calls after running , the requirement in Lemma 5 is met. Therefore, we may combine Lemma 5 alongwith Lemma 6 and our choice of in Algorithm 1 to get a geometric decrease in function error as follows.
(2.6) |
We apply this inequality recursively over all iterations of Algorithm 1, while also using the assumption that does not increase the objective value. Setting the final value of (2.6) to , bounding the initial error as by Lemma 4, observing , and taking logarithms on both sides of the inequality gives the claimed iteration complexity of . ∎
3 Analysis of : The Parallel Algorithm
The main export of this section is the proof of our main theorem about the parallel algorithm (Theorem 1). This proof combines the iteration count of from the preceding section with the analysis of Algorithm 2 (invoked by in the parallel setting), shown next. In Lemma 7, we show that decreases the function objective, thereby justifying the key assumption in Lemma 2. Lemma 7 also shows an upper bound on the new value of after one while loop of , and by combining this with the maximum value of for termination in Algorithm 2, we get the iteration complexity of in Corollary 1.
Lemma 7 (Outcome of ).
Let be the state of at the end of one while loop of (Algorithm 2). Then, and .
Proof.
Each iteration of the while loop in performs over the set of coordinates . Lemma 6 then immediately proves , which is our first claim.
To prove the second claim, note that in Algorithm 2, for every
where the second step is by the monotonicity of for . Combining it with for all implies that . Therefore, for all , we have
(3.1) |
∎
Corollary 1.
Let be the input to . Then, the number of iterations of the while loop of is at most .
Proof.
Let be the value of at the start of the ’th execution of the while loop of . Repeated application of Lemma 7 over executions of the while loop gives . We set in accordance with the termination condition of . Next, applying , and taking logarithms on both sides yields the claimed limit on the number of iterations, . ∎
Lemma 8.
Over the entire run of Algorithm 1, the while loop of runs for at most iterations if and iterations if .
Proof.
Note that ; consequently, in the first iteration of Algorithm 1, there are at most iterations of the while loop of by Corollary 1. Note that between each call to , for all ,
where the first inequality is by using the fact that the output of satisfies . Therefore, applying the same logic as in (3.1), we get that between two calls to , the value of increases by at most for all . Combining this with Corollary 1 and the total initial iteration count and observing that is the total number of calls to finishes the proof. ∎
3.1 Proof of Main Theorem (Parallel)
Proof.
(Proof of Theorem 1) First, we show correctness. Note that, as a corollary of Lemma 2, . By the properties of Round as shown in Lemma 7, we also have that and for . Therefore, Lemma 1 is applicable, and by the choice of , we conclude that defined as satisfies for all . Combining the iteration counts of from Lemma 2 and of from Lemma 8 yields the total iteration count as if and if . As stated in Section 1.4, , and so translating these rates in terms of gives for and for , thereby proving the stated claim. The cost per iteration is 666This can be improved to using fast matrix multiplication. for multiplying two matrices. ∎
4 Analysis of : Sequential Algorithm
We now analyze Algorithm 3. Note that these proofs work for all .
Lemma 9 (Coordinate Step Progress).
Given , a coordinate , and , we have
Proof.
By definition of , we have
Recall the matrix determinant lemma: . Applying it to in the preceding expression for proves the lemma.
∎
Lemma 10 (Coordinate Step Outcome).
Given and , let for any , where . Then, we have and .
Proof.
We note that . Then, Lemma 9 implies the first claim. Since the update rule optimizes over coordinate-wise, at each step the optimality condition given by is met for each . The second claim is then proved by noting that for , and by the Sherman-Morrison-Woodbury identity, :
∎
Lemma 11 (Number of Coordinate Steps).
Proof.
4.1 Proof of Main Theorem (Sequential)
We now combine the preceding results to prove the main theorem about the sequential algorithm (Algorithm 1 with Algorithm 3).
Proof.
(Proof of Theorem 2) The proof of correctness is the same as that for Theorem 1 since the parallel and sequential algorithms share the same meta-algorithm. Computing leverage scores in the sequential version (Algorithm 1 with Algorithm 3) takes coordinate steps. The costliest component of a coordinate step is computing . By the Sherman-Morrison-Woodbury formula, computing this costs for each coordinate. Since the initial cost to compute is , the total run time is . When translated in terms of , this gives for and for . ∎
5 A “One-Step” Parallel Algorithm
We conclude our paper with an alternative algorithm (Algorithm 4) in which each iteration avoids any rounding and performs only .
Theorem 4 (Main Theorem (One-Step Parallel Algorithm)).
Given a full rank matrix and , we can compute -approximate Lewis weights (1) in iterations. Each iteration computes the leverage score of one row of for some diagonal matrix . The total runtime is .
We first spell out the key idea of the proof of Theorem 4 in Lemma 12 next, which is that (2) is maintained in every iteration through the use of varying step sizes, without explicitly invoking rounding procedures. Since (2) always holds, we may use Lemma 5 in bounding the iteration complexity.
Lemma 12 (Rounding Condition Invariance).
For any iteration in Algorithm 4, if , then .
Proof.
Proof of Theorem 4.
By our choice of for all , we have that by Fact 1.1. Then applying Lemma 12 yields by induction that at every iteration . We may now therefore upper bound the objective sub-optimality from Lemma 5; as before, combining this with the lower bound on progress from Lemma 6 (noticing that ) yields
(5.4) |
Thus, decreases . Using from Lemma 4 and setting (5.4) to gives an iteration complexity of if and otherwise. As in the proofs of Theorems 1 and 2, we can then invoke Lemma 1 to construct the vector that is -approximate to the Lewis weights. ∎
6 Acknowledgements
We are grateful to the anonymous reviewers of SODA 2022 for their careful reading and thoughtful comments that helped us improve our exposition. Maryam Fazel was supported in part by grants NSF TRIPODS II DMS 2023166, NSF TRIPODS CCF 1740551, and NSF CCF 2007036. Yin Tat Lee was supported in part by NSF awards CCF-1749609, DMS-1839116, DMS-2023166, CCF-2105772, a Microsoft Research Faculty Fellowship, Sloan Research Fellowship, and Packard Fellowship. Swati Padmanabhan was supported in part by NSF TRIPODS II DMS 2023166. Aaron Sidford was supported in part by a Microsoft Research Faculty Fellowship, NSF CAREER Award CCF-1844855, NSF Grant CCF-1955039, a PayPal research award, and a Sloan Research Fellowship.
References
- [AHR08] Jacob Abernethy, Elad E Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory, COLT 2008, 2008.
- [AZLSW17] Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, and Yining Wang. Near-optimal design of experiments via regret minimization. In Proceedings of the 34th International Conference on Machine Learning, 2017.
- [BDM+20] Vladimir Braverman, Petros Drineas, Cameron Musco, Christopher Musco, Jalaj Upadhyay, David P Woodruff, and Samson Zhou. Near optimal linear algebra in the online and sliding window models. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), 2020.
- [BE15] Sébastien Bubeck and Ronen Eldan. The entropic barrier: a simple and optimal universal self-concordant barrier. In Conference on Learning Theory, 2015.
- [BV04] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004.
- [CCDS20] Rachit Chhaya, Jayesh Choudhari, Anirban Dasgupta, and Supratim Shit. Streaming coresets for symmetric tensor factorization. In International Conference on Machine Learning, 2020.
- [CCLY19] Michael B. Cohen, Ben Cousins, Yin Tat Lee, and Xin Yang. A near-optimal algorithm for approximating the john ellipsoid. In Proceedings of the Thirty-Second Conference on Learning Theory, 2019.
- [CLM+15] Michael B. Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, and Aaron Sidford. Uniform sampling for matrix approximation. In Tim Roughgarden, editor, Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, ITCS 2015, Rehovot, Israel, January 11-13, 2015. ACM, 2015.
- [CP15] Michael B. Cohen and Richard Peng. Lp row sampling by lewis weights. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, STOC ’15. Association for Computing Machinery, 2015.
- [DAST08] S Damla Ahipasaoglu, Peng Sun, and Michael J Todd. Linear convergence of a modified frank–wolfe algorithm for computing minimum-volume enclosing ellipsoids. Optimisation Methods and Software, 23(1), 2008.
- [DFO20] Jelena Diakonikolas, Maryam Fazel, and Lorenzo Orecchia. Fair packing and covering on a relative scale. SIAM J. Optim., 30(4), 2020.
- [DLS18] David Durfee, Kevin A Lai, and Saurabh Sawlani. regression using lewis weights preconditioning and stochastic gradient descent. In Conference On Learning Theory, 2018.
- [DMIMW12] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast approximation of matrix coherence and statistical leverage. The Journal of Machine Learning Research, 13(1), 2012.
- [DMM06] Petros Drineas, Michael W Mahoney, and Shan Muthukrishnan. Sampling algorithms for l 2 regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, 2006.
- [GLS81] Martin Grötschel, László Lovász, and Alexander Schrijver. The ellipsoid method and its consequences in combinatorial optimization. Combinatorica, 1(2), 1981.
- [Joh48] Fritz John. Extremum problems with inequalities as subsidiary conditions, studies and essays presented to r. courant on his 60th birthday, january 8, 1948, 1948.
- [Kha96] Leonid G Khachiyan. Rounding of polytopes in the real number model of computation. Mathematics of Operations Research, 21(2), 1996.
- [KN09] Ravi Kannan and Hariharan Narayanan. Random walks on polytopes and an affine interior point method for linear programming. In Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, STOC ’09, New York, NY, USA, 2009. Association for Computing Machinery.
- [KY05] Piyush Kumar and E Alper Yildirim. Minimum-volume enclosing ellipsoids and core sets. Journal of Optimization Theory and applications, 126(1), 2005.
- [Lee16] Yin Tat Lee. Faster Algorithms for Convex and Combinatorial Optimization. PhD thesis, Massachusetts Institute of Technology, 2016.
- [Lew78] D Lewis. Finite dimensional subspaces of {}. Studia Mathematica, 63(2), 1978.
- [LLV20] Aditi Laddha, Yin Tat Lee, and Santosh Vempala. Strong self-concordance and sampling. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, New York, NY, USA, 2020. Association for Computing Machinery.
- [LMP13] Mu Li, Gary L Miller, and Richard Peng. Iterative row sampling. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, 2013.
- [LS14] Yin Tat Lee and Aaron Sidford. Path finding methods for linear programming: Solving linear programs in õ(sqrt(rank)) iterations and faster algorithms for maximum flow. In 55th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, 2014.
- [LS19] Yin Tat Lee and Aaron Sidford. Solving linear programs with sqrt(rank) linear system solves. CoRR, abs/1910.08033, 2019.
- [LSW15] Yin Tat Lee, Aaron Sidford, and Sam Chiu-wai Wong. A faster cutting plane method and its implications for combinatorial and convex optimization. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, 2015.
- [LWYZ20] Yi Li, Ruosong Wang, Lin Yang, and Hanrui Zhang. Nearly linear row sampling algorithm for quantile regression. In Proceedings of the 37th International Conference on Machine Learning, 2020.
- [LY18] Yin Tat Lee and Man-Chung Yue. Universal barrier is -self-concordant. arXiv preprint arXiv:1809.03011, 2018.
- [MSTX19] Vivek Madan, Mohit Singh, Uthaipon Tantipongpipat, and Weijun Xie. Combinatorial algorithms for optimal design. In Proceedings of the Thirty-Second Conference on Learning Theory, 2019.
- [MSZ16] Jelena Marasevic, Clifford Stein, and Gil Zussman. A fast distributed stateless algorithm for -fair packing problems. In 43rd International Colloquium on Automata, Languages, and Programming (ICALP 2016), volume 55, 2016.
- [NN94] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex programming. SIAM, 1994.
- [Puk06] Friedrich Pukelsheim. Optimal Design of Experiments (Classics in Applied Mathematics) (Classics in Applied Mathematics, 50). Society for Industrial and Applied Mathematics, USA, 2006.
- [SF04] Peng Sun and Robert M Freund. Computation of minimum-volume covering ellipsoids. Operations Research, 52(5), 2004.
- [SW18] Christian Sohler and David P Woodruff. Strong coresets for k-median and subspace approximation: Goodbye dimension. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), 2018.
- [SX20] Mohit Singh and Weijun Xie. Approximation algorithms for d-optimal design. Mathematics of Operations Research, 45(4), 2020.
- [Tod16] Michael J. Todd. Minimum volume ellipsoids - theory and algorithms, volume 23 of MOS-SIAM Series on Optimization. SIAM, 2016.
- [Vai89] Pravin M Vaidya. A new algorithm for minimizing convex functions over convex sets. In 30th Annual Symposium on Foundations of Computer Science, 1989.
- [Woj96] Przemyslaw Wojtaszczyk. Banach spaces for analysts. Number 25. Cambridge University Press, 1996.
- [ZF20] Renbo Zhao and Robert M Freund. Analysis of the frank-wolfe method for logarithmically-homogeneous barriers, with an extension. arXiv preprint arXiv:2010.08999, 2020.
We start with a piece of notation we frequently use in the appendix. For a given vector , we use to describe the diagonal matrix with on its diagonal. For a matrix , we use to denote the vector made up of the diagonal entries of . Further, recall as stated in Section 1.4, that given any vector , we use its uppercase boldface name .
Appendix A Technical Proofs: Gradient, Hessian, Initial Error, Minimum Progress
See 3
Proof.
The proof essentially follows by combining Lemmas 48 and 49 of [LS19]. For completeness, we provide the full proof here. Applying chain rule to the function and then the definition of from (2) gives the claim that
We now set some notation to compute the Hessian: let , let be any arbitrary vector, and let . For and for we let denote the directional derivative of at in the direction , i.e., . Then we have,
where the last step follows by symmetry of . This implies
which, in shorthand, is . We may express this Hessian as in the statement of the lemma by writing in terms of . ∎
See 4
Proof.
We study the two terms constituting the objective in (1). First, by choice of , we have
Next, since leverage scores always lie between zero and one, the optimality condition for (1), , implies , which in turn gives . This implies . Therefore,
Next, observe that , and , where we invoked Fact 1.1. By now applying , we get
Combining (A), (A), and the definition of the objective (1) finishes the claim. ∎
See 6
Proof.
By the remainder form of Taylor’s theorem, for some and
(A.5) |
We prove the result by bounding the quadratic form of from above and leveraging the structure of and . Lemma 3 and Fact 1.1 imply that
Further, the positivity of and and the non-negativity of and imply that for all . Since , this implies that
Consequently, for all , we bound the first term of (A) as
(A.6) |
Further, when , we bound the second term of (A) as
and when , we have
Using (A.6), (A), and (A) in (A), we have that in all cases
Applying to the above Loewner inequality the definition of gives
(A.9) |
Next, recall that by Lemma 3, for all . Consequently,
Combining (A.5), (A.9), and (A) yields that
The result follows by plugging in , as assumed. ∎
Appendix B From Optimization Problem to Lewis Weights
The goal of this section is to prove how to obtain -approximate Lewis weights from an -approximate solution to the problem in (1). Our proof strategy is to first utilize the fact that the vector obtained after the rounding step following the for loop of Algorithm 1 satisfies the properties of being -suboptimal (additively) and also the rounding condition (2). In Lemma 1, the -suboptimality is used to show a bound on . Coupled with the rounding condition, we then show in Lemma 13 that constructed as per the last line of Algorithm 1 then satisfies approximate optimality, , for some small . In Lemma 14, we finally relate this approximate optimality to coordinate-wise multiplicative closeness between and the vector of true Lewis weights. Finally, in Lemma 1, we pick the appropriate approximation factors for each of the lemmas invoked and prove the desired approximation. Since the vector obtained at the end of the for loop of Algorithm 4 also satisfies the aforementioned properties of , the same set of lemmas apply to Algorithm 4 as well. We begin with some technical lemmas.
B.1 From Approximate Closeness to Approximate Optimality
Lemma 13.
Let such that for some parameter and also let . Define . Then, for the parameter , we have that .
Proof.
Our strategy to prove involves first noting that this is the same as proving and, from the definition of in the statement of the lemma, to instead prove .
To this end, we split into two matrices based on the size of its coordinates, setting the following notation. Define to be the diagonal matrix with zeroes at indices corresponding to , and to be the diagonal matrix with zeroes at indices corresponding to . We first show that and are small compared to and can therefore be ignored in the preceding desired approximation. We then prove that for , we have . This proof technique is inspired by Lemma 4 of [Vai89].
First, we prove that is small as compared to . Since (2) is satisfied, it means
Combining this with the definition of as in the statement of the lemma, we may use non-negativity of to derive
We apply this inequality in the following expression to obtain
(B.2) |
This implies that777Given , we have . Then, if , we have , and combining these with the previous matrix inequality, we conclude that , which implies that .
Our next goal is to bound in terms of , which we do by first bounding it in terms of and then bounding in terms of . By definition, . Further, by assumption, . Therefore, for any
and
By our choice of , for , we have
Further, we have the following inequality:
Hence, we can combine (B.1), (B.1), and (B.1) to see that
Set for the upper bound.
For the lower bound, we bound and, therefore, also . Observe that
where the second step is by , as assumed in the lemma. This implies that
and therefore that
Repeating the method for the upper bound then finishes the proof. ∎
B.2 From Approximate Optimality to Approximate Lewis Weights
In this section, we go from the previous notion of approximation to the one we finally seek in (1). Specifically, we show that if , then . To prove this, we first give a technical result. We recall notation stated in Section 1.4: for any projection matrix , we have the vector of leverage scores .
Claim 1.
For any projection matrix , , and vector , we have that
Proof.
Let . Since (Fact 1.1), we have that and . Consequently, taking norms in terms of these matrices gives
(B.6) |
Next, since by Lemma 47 of [LS19], for all , we see that for all , and since by definition of , we have for all , we have that
(B.7) |
Combining (B.6) and (B.7) and using that yields the claim. ∎
Lemma 14.
Let be a vector that satisfies approximate optimality of (1) in the following sense:
Then, is also coordinate-wise multiplicatively close to , the true vector of Lewis weights, as formalized below.
Proof.
For all , let so that and . Further, for all , let be the unique solution to
Then we have the following gradients.
(B.8) | ||||
(B.9) |
Consequently, by optimality of as defined in (B.2), we have Rearranging the terms of this equation yields that
and therefore and . To prove the lemma, it therefore suffices to bound
(B.11) |
To bound (B.11), it remains to compute and apply Claim 1. To do this, note that
Using that , we have, by rearranging the above equation and applying (B.8) and (B.9) that
(B.12) |
Applying (B.2) to (B.12), we have that
Applying Claim 1 to the above equality, substituting in (B.11) and therefore yields
∎
B.3 From Optimization Problem to Approximate Lewis Weights
See 1
Proof.
We are given a vector satisfying . Then by Lemma 5, we have that for each . This bound implies that for all because, if not, then because of and the decreasing nature of over for a fixed , we obtain , a contradiction. Therefore . Coupled with the provided guarantee , we see that the requirements of Lemma 13 are met with , for , and Algorithm 1 therefore guarantees a satisfying . Therefore, we can now apply Lemma 14 with , and choosing lets us conclude that , as claimed. ∎
Appendix C A Geometric View of Rounding
At the end of Algorithms 2 and 3, the iterate satisfies the condition . We now show the geometry implied by the preceding condition, thereby provide the reason behind the terminology “rounding.”
Lemma 15.
Given such that . Define the ellipsoid . Then, we have that
Proof.
Consider any point . Then, by Cauchy-Schwarz inequality and ,
∎
Appendix D Explanations of Runtimes in Prior Work
The convex program (1) formulated by [CP15] has a variable size of . Therefore, by [LSW15], the number of iterations to solve it using the cutting plane method is , each iteration computing for . This can be computed by multiplying an matrix with an matrix, which costs between (at least the size of the larger input matrix) and (each entry of the resulting matrix obtained by an inner product of length vectors). Further, there is at least a total of additional work done by the cutting plane method. This gives us a cost of at least . The runtime of [Lee16] follows from Theorem .