This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Computing Lewis Weights to High Precision

Maryam Fazel [email protected]. University of Washington.    Yin Tat Lee [email protected]. University of Washington.    Swati Padmanabhan [email protected]. University of Washington.    Aaron Sidford [email protected]. Stanford University

We present an algorithm for computing approximate p\ell_{p} Lewis weights to high precision. Given a full-rank 𝐀m×n\mathbf{A}\in\mathbb{R}^{m\times n} with mnm\geq n and a scalar p>2p>2, our algorithm computes ϵ\epsilon-approximate p\ell_{p} Lewis weights of 𝐀\mathbf{A} in O~p(log(1/ϵ))\widetilde{O}_{p}(\log(1/\epsilon)) iterations; the cost of each iteration is linear in the input size plus the cost of computing the leverage scores of 𝐃𝐀\mathbf{D}\mathbf{A} for diagonal 𝐃m×m\mathbf{D}\in\mathbb{R}^{m\times m}. Prior to our work, such a computational complexity was known only for p(0,4)p\in(0,4) [CP15], and combined with this result, our work yields the first polylogarithmic-depth polynomial-work algorithm for the problem of computing p\ell_{p} Lewis weights to high precision for all constant p>0p>0. An important consequence of this result is also the first polylogarithmic-depth polynomial-work algorithm for computing a nearly optimal self-concordant barrier for a polytope.

1 Introduction to Lewis Weights

In this paper, we study the problem of computing the p\ell_{p} Lewis weights111From hereon, we refer to these simply as “Lewis weights” for brevity. of a matrix.

Definition 1.

[Lew78, CP15] Given a full-rank matrix 𝐀m×n\mathbf{A}\in\mathbb{R}^{m\times n} with mnm\geq n and a scalar p(0,)p\in(0,\infty), the Lewis weights of 𝐀\mathbf{A} are the entries of the unique222Existence and uniqueness was first proven by D.R.Lewis [Lew78], after whom the weights are named. vector w¯m\overline{w}\in\mathbb{R}^{m} satisfying the equation

w¯i2/p=ai(𝐀𝐖¯12/p𝐀)1ai for all i[m],\overline{w}_{i}^{2/p}=a_{i}^{\top}(\mathbf{A}^{\top}\overline{\mathbf{W}}^{1-2/p}\mathbf{A})^{-1}a_{i}\text{ for all $i\in[m]$},

where aia_{i} is the ii’th row of matrix 𝐀\mathbf{A} and 𝐖¯\overline{\mathbf{W}} is the diagonal matrix with vector w¯\overline{w} on the diagonal.

Motivation. We contextualize our problem with a simpler, geometric notion. Given a set of mm points {ai}i=1mn\{a_{i}\}_{i=1}^{m}\in\mathbb{R}^{n} (the rows of the preceding matrix 𝐀m×n\mathbf{A}\in\mathbb{R}^{m\times n}), their John ellipsoid [Joh48] is the minimum333The John ellipsoid may also refer to the maximal volume ellipsoid enclosed by the set {x:|xai|1}\{x:|x^{\top}a_{i}|\leq 1\}, but in this paper, we use the former definition. volume ellipsoid enclosing them. This ellipsoid finds use across experiment design and computational geometry [Tod16] and is central to certain cutting-plane methods [Vai89, LSW15], an algorithm fundamental to mathematical optimization (Section 1.3). It turns out that the John ellipsoid of a set of points {ai}i=1mn\{a_{i}\}_{i=1}^{m}\in\mathbb{R}^{n} is expressible [BV04] as the solution to the following convex program, with the objective being a stand-in for the volume of the ellipsoid and the constraints encoding the requirement that each given point aia_{i} lie within the ellipsoid:

minimize𝐌0det(𝐌)1, subject to ai𝐌ai1, for all i[m].\mbox{minimize}_{\mathbf{M}\succeq 0}\det(\mathbf{M})^{-1},\textup{ subject to }a_{i}^{\top}\mathbf{M}a_{i}\leq 1,\textup{ for all }i\in[m].

The problem (1) may be generalized by the following convex program [Woj96, CP15], the generalization immediate from substituting p=p=\infty in (1):

minimize𝐌0det(𝐌)1, subject to i=1m(ai𝐌ai)p/21.\mbox{minimize}_{\mathbf{M}\succeq 0}\det(\mathbf{M})^{-1},\textup{ subject to }\sum_{i=1}^{m}(a_{i}^{\top}\mathbf{M}a_{i})^{p/2}\leq 1.

Geometrically, (1) seeks the minimum volume ellipsoid with a bound on the p/2p/2-norm of the distance of the points to the ellipsoid, and its solution 𝐌\mathbf{M} is the “Lewis ellipsoid” [CP15] of {ai}i=1m\{a_{i}\}_{i=1}^{m}.

The optimality condition of (1), written using w¯m\overline{w}\in\mathbb{R}^{m} defined as w¯i=def(ai𝐌ai)p/2\overline{w}_{i}\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}(a_{i}^{\top}\mathbf{M}a_{i})^{p/2}, is equivalent to (1), and this demonstrates that solving (1) is one approach to obtaining the Lewis weights of 𝐀\mathbf{A} (see [CP15]). This equivalence also underscores the fact that the problem of computing Lewis weights is a natural p\ell_{p} generalization of the problem of computing the John ellipsoid.

More broadly, Lewis weights are ubiquitous across statistics, machine learning, and mathematical optimization in diverse applications, of which we presently highlight two (see Section 1.3 for details). First, their interpretation as “importance scores” of rows of matrices makes them key to shrinking the row dimension of input data [DMM06]. Second, through their role in constructing self-concordant barriers of polytopes [LS14], variants of Lewis weights have found prominence in recent advances in the computational complexity of linear programming.

From a purely optimization perspective, Lewis weights may be viewed as the optimal solution to the following convex optimization problem (which is in fact essentially dual to (1)):

w¯=argminw>0m(w)=deflogdet(𝐀𝐖𝐀)+11+α𝟏w1+α, for α=2p2.\overline{w}=\arg\min_{w\in\mathbb{R}^{m}_{>0}}\mathcal{F}(w)\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}-\log\det(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})+\frac{1}{1+\alpha}\mathbf{1}^{\top}w^{1+\alpha},\textrm{ for $\alpha=\tfrac{2}{p-2}$}.

As elaborated in [CP15, LS19], the reason this problem yields the Lewis weights is that an appropriate scaling of its solution w¯\overline{w} transforms its optimality condition from w¯iα=ai(𝐀𝐖¯𝐀)1ai\overline{w}_{i}^{\alpha}=a_{i}^{\top}(\mathbf{A}^{\top}\overline{\mathbf{W}}\mathbf{A})^{-1}a_{i} to (1). The problem (1) is a simple and natural one and, in the case of α=1\alpha=1 (corresponding to the John ellipsoid), has been the subject of study for designing new optimization methods [Tod16].

In summary, Lewis weights naturally arise as generalizations of extensively studied problems in convex geometry and optimization. This, coupled with their role in machine learning, makes understanding the complexity of computing Lewis weights, i.e., solving (1), a fundamental problem.

Our Goal.

We aim to design high-precision algorithms for computing ε\varepsilon-approximate Lewis weights, i.e., a vector wmw\in\mathbb{R}^{m} satisfying

wiεw¯i, for all i[m], where w¯ is defined in (1) and (1).w_{i}\approx_{\varepsilon}\overline{w}_{i},\text{ for all $i\in[m]$},\text{ where }\overline{w}\text{ is defined in }\eqref{eq_def_lewis_weights}\text{ and }\eqref{def_objective}.

where aεba\approx_{\varepsilon}b is used to denote (1ε)ab(1+ε)a(1-\varepsilon)a\leq b\leq(1+\varepsilon)a. To this end, we design algorithms to solve the convex program (1) to ε~\widetilde{\varepsilon}-additive accuracy for an appropriate ε~=poly(ε,n)\widetilde{\varepsilon}=\textrm{poly}(\varepsilon,n), which we prove suffices in Lemma 1.

By a “high-precision” algorithm, we mean one with a runtime polylogarithmic in ε\varepsilon. We emphasize that for several applications such as randomized sampling [CP15], approximate Lewis weights suffice; however, we believe that high-precision methods such as ours enrich our understanding of the structure of the optimization problem (1). Further, as stated in Theorem  3, such methods yield new runtimes for directly computing a near-optimal self-concordant barrier for polytopes.

We use number of leverage score computations as the complexity measure of our algorithms. Our choice is a result of the fact that leverage scores of appropriately scaled matrices appear in both (w)\nabla\mathcal{F}(w) (see Lemma 3) and in the verification of correctness of Lewis weights. This measure of complexity stresses the number of iterations rather than the details of iteration costs (which depend on exact techniques used for leverage core computation, e.g., fast matrix multiplication) and is consistent with many prior algorithms (see Table 1).

Prior Results.

The first polynomial-time algorithm for computing Lewis weights was presented by [CP15] and performed only O~p(log(1/ε))\widetilde{O}_{p}(\log(1/\varepsilon))444We use OpO_{p} to hide a polynomial in pp and O~\widetilde{O} and Ω~\widetilde{\Omega} to hide factors polylogarithmic in p,np,n, and mm. leverage score computations. However, their result holds only for p(0,4)p\in(0,4). We explain the source of this limited range in Section 1.2.

In comparison, for p4p\geq 4, existing algorithms are slower: the algorithms by [CP15], [Lee16], and [LS19] perform Ω~(n)\widetilde{\Omega}(n), O~(1/ε)\widetilde{O}(1/\varepsilon), and O~(n)\widetilde{O}(\sqrt{n}) leverage score computations, respectively. [CP15] also gave an algorithm with total runtime 𝒪(1ε𝐧𝐧𝐳(𝐀)+cpnO(p))\mathcal{O}(\tfrac{1}{\varepsilon}\mathop{\mathbf{nnz}}(\mathbf{A})+c_{p}n^{O(p)}). Of note is the fact that the algorithms with runtimes polynomial in 1/ε1/\varepsilon ([Lee16, CP15]) satisfy the weaker approximation condition w¯i2/pεai(𝐀𝐖¯12/p𝐀)1ai\overline{w}_{i}^{2/p}\approx_{\varepsilon}a_{i}^{\top}(\mathbf{A}^{\top}\overline{\mathbf{W}}^{1-2/p}\mathbf{A})^{-1}a_{i}, which is in fact implied by our condition (1).

We display these runtimes in Table 1, assuming that the cost of a leverage score computation is O(mn2)O(mn^{2}) (which, we reiterate, may be reduced through the use of fast matrix multiplication). In terms of the number of leverage score computations, Table 1 highlights the contrast between the polylogarithmic dependence on input size and accuracy for p(0,4)p\in(0,4) and polynomial dependence on these factors for p4p\geq 4. The motivation behind our paper is to close this gap.

1.1 Our Contribution

We design an algorithm that computes Lewis weights to high precision for all p>2p>2 using only O~p(log(1/ε))\widetilde{O}_{p}(\log(1/\varepsilon)) leverage score computations. Together with [CP15]’s result for p(0,4)p\in(0,4), our result therefore completes the picture on a near-optimal reduction from leverage scores to Lewis weights for all p>0p>0.

Theorem 1 (Main Theorem (Parallel)).

Given a full-rank matrix 𝐀m×n\mathbf{A}\in\mathbb{R}^{m\times n} and p4p\geq 4, we can compute (Algorithms 1 and  2) its ε\varepsilon-approximate Lewis weights (1) in O(p3log(mp/ε))O(p^{3}\log(mp/\varepsilon)) iterations555Our algorithms work for all p>2p>2, as can be seen in our proof in Section 3.1. However, for p(2,4)p\in(2,4), the algorithm of [CP15] is faster, and therefore, in our main theorems, we state runtimes only for p4p\geq 4.. Each iteration computes the leverage scores of a matrix 𝐃𝐀\mathbf{D}\mathbf{A} for a diagonal matrix 𝐃\mathbf{D}. The total runtime is O(p3mn2log(mp/ε))O(p^{3}mn^{2}\log(mp/\varepsilon)), with O(p3log(mp/ε)log2(m))O(p^{3}\log(mp/\varepsilon)\log^{2}(m)) depth.

Theorem 1 is attained by a parallel algorithm for computing Lewis weights that consists of polylogarithmic rounds of leverage score computations and therefore has polylogarithmic-depth, a result that was not known prior to this work.

Theorem 2 (Main Theorem (Sequential)).

Given a full-rank matrix 𝐀m×n\mathbf{A}\in\mathbb{R}^{m\times n} and p4p\geq 4, we can compute (Algorithms 1 and  3) its ε\varepsilon-approximate Lewis weights (1) in O(pmlog(mp/ε))O(pm\log(mp/\varepsilon)) iterations. Each iteration computes the leverage score of one row of 𝐃𝐀\mathbf{D}\mathbf{A} for a diagonal matrix 𝐃\mathbf{D}. The total runtime is O(pmn2log(mp/ε))O(pmn^{2}\log(mp/\varepsilon)).

Remark 1.1.

The solution to (1) characterizes a “Lewis ellipsoid,” and the \ell_{\infty} Lewis ellipsoid of 𝐀\mathbf{A} is precisely its John ellipsoid. After symmetrization [Tod16], computing the John ellipsoid is equivalent to solving a linear program (LP). Therefore, computing Lewis weights in O(log(mp/ε))O(\log(mp/\varepsilon)) iterations would imply a polylogarithmic-depth algorithm for solving LPs, which, given the current O(n)O(\sqrt{n}) depth [LS19], would be a significant breakthrough in the field of optimization. We therefore believe that it would be difficult to remove the polynomial dependence on pp in our runtime.

Authors Range of pp Number of Leverage Score Computations/Depth Total Runtime
[CP15] p(0,4)p\in(0,4) O(11|1p/2|log(log(m)ε))O\left(\tfrac{1}{1-|1-p/2|}\cdot\log(\tfrac{\log(m)}{\varepsilon})\right) O(11|1p/2|mn2log(log(m)ε))O\left(\tfrac{1}{1-|1-p/2|}\cdot mn^{2}\cdot\log(\tfrac{\log(m)}{\varepsilon})\right)
[CP15] p4p\geq 4 Ω(n)\Omega(n) Ω(mn3log(mε))\Omega(mn^{3}\cdot\log(\tfrac{m}{\varepsilon}))
[CP15]* p4p\geq 4 not applicable O(nnz(𝐀)ε+cpnO(p))O\left(\tfrac{\text{nnz}(\mathbf{A})}{\varepsilon}+c_{p}n^{O(p)}\right)
[Lee16]* p4p\geq 4 O(1εlog(m/n))O\left(\tfrac{1}{\varepsilon}\cdot\log(m/n)\right) O((nnz(𝐀)ε+n3ε3)log(m/n))O\left(\left(\tfrac{\text{nnz}(\mathbf{A})}{\varepsilon}+\frac{n^{3}}{\varepsilon^{3}}\right)\cdot\log(m/n)\right)
[LS19] p4p\geq 4 O(p2n1/2log(1ε))O(p^{2}\cdot n^{1/2}\cdot\log(\tfrac{1}{\varepsilon})) O(p2mn2.5polylog(mε))O(p^{2}\cdot mn^{2.5}\cdot\text{poly}\log(\tfrac{m}{\varepsilon}))
Theorem 1 p4p\geq 4 O(p3log(mpε))O(p^{3}\cdot\log(\tfrac{mp}{\varepsilon})) O(p3mn2log(mpε))O(p^{3}\cdot mn^{2}\cdot\log(\tfrac{mp}{\varepsilon}))
Table 1: Runtime comparison for computing Lewis weights. Results with asterisks use a weaker notion of approximation than our paper (1). All dependencies on nn in the running times of these methods can be improved using fast matrix multiplication.

1.2 Overview of Approach

Before presenting our algorithm, we describe obstacles to directly extending previous work on the problem for p(0,4)p\in(0,4) to the case p4p\geq 4. For p(0,4)p\in(0,4), [CP15, LS19] design algorithms that, with a single computation of leverage scores, make constant (dependent on pp) multiplicative progress on error (such as function error or distance to optimal point), thus attaining runtimes polylogarithmic in ε\varepsilon. However, these methods crucially rely on contractive properties that, in contrast to our work, do not necessarily hold for p4p\geq 4.

For example, one of the algorithms in [CP15] starts with a vector vcw¯v\approx_{c}\overline{w}, where w¯\overline{w} is the vector of true Lewis weights and cc some constant. Consequently, we have (ai(𝐀𝐕12/p𝐀)1ai)p/2c|p/21|(ai(𝐀𝐖¯12/p𝐀)1ai)p/2(a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{V}^{1-2/p}\mathbf{A})^{-1}a_{i})^{p/2}\approx_{c^{|p/2-1|}}(a_{i}^{\top}(\mathbf{A}^{\top}\overline{\mathbf{W}}^{1-2/p}\mathbf{A})^{-1}a_{i})^{p/2}. Due to this map being a contraction for |p/21|<1|p/2-1|<1, or equivalently, for p(0,4)p\in(0,4), O(log(logn))O(\log(\log n)) recursive calls to it give Lewis weights for p<4p<4, but the contraction - and, by extension, this method - does not immediately extend to the setting p4p\geq 4.

Prior algorithms for p4p\geq 4 therefore resort to alternate optimization techniques. [CP15] frames Lewis weights computation as determinant maximization (1) (see Section D) and applies cutting plane methods [GLS81, LSW15]. [Lee16] uses mirror descent, and [LS19] uses homotopy methods. These approaches yield runtimes with poly(n)\text{poly}(n) or poly(1ε)\text{poly}(\tfrac{1}{\varepsilon}) leverage score computations, and therefore, in order to attain runtimes of polylog(1/ε)\text{polylog}(1/\varepsilon) leverage score computations, we need to rethink the algorithm.

Our Approach.

As stated in Section 1, to obtain ε\varepsilon-approximate Lewis weights for p4p\geq 4, we compute a ww that satisfies (w¯)(w)(w¯)+ε~\mathcal{F}(\overline{w})\leq\mathcal{F}(w)\leq\mathcal{F}(\overline{w})+\widetilde{\varepsilon}, where \mathcal{F} and w¯\overline{w} are as defined in (1) and ε~=O(poly(n,ε))\widetilde{\varepsilon}=O(\mathrm{poly}(n,\varepsilon)). In light of the preceding bottlenecks in prior work, we circumvent techniques that directly target constant multiplicative progress (on some potential) in each iteration.

Our main technical insight is that when the leverage scores for the current weight w>0nw\in\mathbb{R}^{n}_{>0} satisfy a certain technical condition (inequality (1.2)), it is indeed possible to update ww to get multiplicative decrease in function error ((w)(w¯)\mathcal{F}(w)-\mathcal{F}(\overline{w})), thus resulting in our target runtime. To turn this insight into an algorithm, we design a corrective procedure that ensures that (1.2) is always satisfied: in other words, whenever (1.2) is violated, this procedure updates ww so that the new ww does satisfy (1.2), setting the stage for the aforementioned multiplicative progress. An important additional property of this procedure is that it does not increase the objective function and is therefore in keeping with our goal of minimizing (1).

Specifically, the technical condition that our geometric decrease in function error hinges on is

maxi[m]ai(𝐀𝐖𝐀)1aiwiα1+α.\max_{i\in[m]}\frac{a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i}}{w_{i}^{\alpha}}\leq 1+\alpha\,.

This ratio follows naturally from the gradient and Hessian of the function objective (see Lemma 3). Our algorithm’s update rule to solve (1) is obtained from minimizing a second-order approximation to the objective at the current point, and the condition specified in (1.2) allows us to relate the progress of a type of quasi-Newton step to lower bounds on the progress there is to make, which is critical to turning a runtime of poly(1/ε1/\varepsilon) into polylog(1/ε)\text{polylog}(1/\varepsilon) (Lemma 5).

The process of updating ww so that (1.2) goes from being violated to being satisfied corresponds, geometrically, to sufficiently rounding the ellipsoid (w)={x:x𝐀𝐖𝐀x1}\mathcal{E}(w)=\{x:x^{\top}\mathbf{A}^{\top}\mathbf{W}\mathbf{A}x\leq 1\}; specifically, the updated ellipsoid satisfies (w){𝐖12p𝐀x1+α}\mathcal{E}(w)\subseteq\{\|\mathbf{W}^{\frac{1}{2-p}}\mathbf{A}x\|_{\infty}\leq\sqrt{1+\alpha}\} (see Section C), and this is the reason we use the term “rounding” to describe our corrective procedure to get ww to satisfy (1.2) and the term “rounding condition” to refer to (1.2).

We develop two versions of rounding: a parallel method and a sequential one that has an improved dependence on pp. Each version is based on the principles that (1) one can increase those entries of ww at which the rounding condition (1.2) does not hold while decreasing the objective value, and (2) the vector ww obtained after this update is closer to satisfying (1.2).

We believe that such a principle of identifying a technical condition needed for fast convergence and the accompanying rounding procedures could be useful in other optimization problems. Additionally, we develop Algorithm 4, which, by varying the step sizes in the update rule, maintains (1.2) as invariant, thereby eliminating the need for a separate rounding and progress steps.

1.3 Applications and Related Work

We elaborate here on the applications of Lewis weights we briefly alluded to in Section 1. While for many applications (such as pre-processing in optimization [CP15]) approximate weights suffice, solving regularized DD-optimal and computing O~(n)\tilde{O}(n) self-concordant barriers to high precision do use high precision Lewis weights.

Pre-processing in optimization.

Lewis weights are used as scores to sample rows of an input tall data matrix so the p\ell_{p} norms of the product of the matrix with vectors are preserved. They have been used in row sampling algorithms for data pre-processing [DMM06, DMIMW12, LMP13, CLM+15], for computing dimension-free strong coresets for kk-median and subspace approximation [SW18], and for fast tensor factorization in the streaming model [CCDS20]. Lewis weights are also used for 1\ell_{1} regression, a popular model in machine learning used to capture robustness to outliers, in: [DLS18] for stochastic gradient descent pre-conditioning, [LWYZ20] for quantile regression, and [BDM+20] to provide algorithms for linear algebraic problems in the sliding window model.

John ellipsoid and D-optimal design.

As noted in Remark 1.1, a fast algorithm for Lewis weights could yield faster algorithms for computing John ellipsoid, a problem with a long history of work [Kha96, SF04, KY05, DAST08, CCLY19, ZF20]. It is known [Tod16] that the John ellipsoid problem is dual to the (relaxed) D-optimal experiment design problem [Puk06]. D-optimal design seeks to select a set of linear experiments with the largest confidence ellipsoid for its least-square estimator [AZLSW17, MSTX19, SX20].

Our problem (1) is equivalent to pp2\frac{p}{p-2}-regularized D-optimal design, which can be interpreted as enforcing a polynomial experiment cost: viewing wiw_{i} as the fraction of resources allocated to experiment ii, each wiw_{i} is penalized by wipp2w_{i}^{\frac{p}{p-2}}. This regularization also appears in fair packing and fair covering problems [MSZ16, DFO20] from operations research.

Self-concordance.

Self-concordant barriers are fundamental in convex optimization [NN94], combinatorial optimization [LS14], sampling [KN09, LLV20], and online learning [AHR08]. Although there are (nearly) optimal self-concordant barriers for any convex set [NN94, BE15, LY18], computing them involves sampling from log-concave distributions, itself an expensive process with a poly(1/ε1/\varepsilon) runtime. [LS14] shows how to construct nearly optimal barriers for polytopes using Lewis weights. Unfortunately, doing so still requires polynomial-many steps to compute these weights; [LS14] bypass this issue by showing it suffices to work with Lewis weights for p1p\approx 1. In this paper, we show how to compute Lewis weights by computing leverage scores of polylogarithmic-many matrices. This gives the first nearly optimal self-concordant barrier for polytopes that can be evaluated to high accuracy with depth polylogarithmic in the dimension.

Theorem 3 (Applying Theorem 1 to [LS19, Section 5]).

Given a non-empty polytope P={xn|𝐀x>b}P=\{x\in\mathbb{R}^{n}~{}|~{}\mathbf{A}x>b\} for full rank 𝐀m×n\mathbf{A}\in\mathbb{R}^{m\times n}, there is a O(nlog5m)O(n\log^{5}m)-self concordant barrier ψ\psi for PP such that for any ϵ>0\epsilon>0 and xPx\in P, in O(mnω1log3mlog(m/ϵ))O(mn^{\omega-1}\log^{3}m\log(m/\epsilon))-work and O(log3mlog(m/ϵ))O(\log^{3}m\log(m/\epsilon))-depth, we can compute gng\in\mathbb{R}^{n} and 𝐇n×n\mathbf{H}\in\mathbb{R}^{n\times n} with gψ(x)2ψ(x)1ϵ\|g-\nabla\psi(x)\|_{\nabla^{2}\psi(x)^{-1}}\leq\epsilon and 2ψ(x)𝐇O(logm)2ψ(x)\nabla^{2}\psi(x)\preceq\mathbf{H}\preceq O(\log m)\nabla^{2}\psi(x). With an additional O(mω+o(1))O(m^{\omega+o(1)}) work, 𝐇n×n\mathbf{H}\in\mathbb{R}^{n\times n} with (1ϵ)2ψ(x)𝐇O(1+ϵ)2ψ(x)(1-\epsilon)\nabla^{2}\psi(x)\preceq\mathbf{H}\preceq O(1+\epsilon)\nabla^{2}\psi(x) can be computed as well.

1.4 Notation and Preliminaries

We use 𝐀\mathbf{A} to denote our full-rank m×nm\times n (mnm\geq n) real-valued input matrix and w¯m\overline{w}\in\mathbb{R}^{m} to denote the vector of Lewis weights of 𝐀\mathbf{A}, as defined in (1) and (1). All matrices appear in boldface uppercase and vectors in lowercase. For any vector (say, σ\sigma), we use its uppercase boldfaced form (𝚺\mathbf{\Sigma}) to denote the diagonal matrix 𝚺ii=σi\mathbf{\Sigma}_{ii}=\sigma_{i}. For a matrix 𝐌\mathbf{M}, the matrix 𝐌(2)\mathbf{M}^{(2)} is the Schur product (entry-wise product) of 𝐌\mathbf{M} with itself. For matrices 𝐀\mathbf{A} and 𝐁\mathbf{B}, we use 𝐀𝐁\mathbf{A}\succeq\mathbf{B} to mean 𝐀𝐁\mathbf{A}-\mathbf{B} is positive-semidefinite. For vectors aa and bb, the inequality aba\leq b applies entry-wise. We use eie_{i} to denote the ii’th standard basis vector. We define [n]=def{1,2,,n}[n]\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}\{1,2,\dotsc,n\}. As in (1), since we defined α=def2p2\alpha\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}\frac{2}{p-2}, the ranges of p(2,4)p\in(2,4) and p4p\geq 4 translate to α>1\alpha>1 and α(0,1]\alpha\in(0,1], respectively. From hereon, we work with α\alpha. We also define α¯=max{1,α}\bar{\alpha}=\max\{1,\alpha\}. For a matrix 𝐀m×n\mathbf{A}\in\mathbb{R}^{m\times n} and w>0mw\in\mathbb{R}^{m}_{>0}, we define the projection matrix 𝐏(w)=def𝐖1/2𝐀(𝐀𝐖𝐀)1𝐀𝐖1/2m×m\mathbf{P}(w)\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}\mathbf{W}^{1/2}\mathbf{A}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}\mathbf{A}^{\top}\mathbf{W}^{1/2}\in\mathbb{R}^{m\times m}. The quantity 𝐏(w)ii\mathbf{P}(w)_{ii} is precisely the leverage score of the ii’th row of 𝐖1/2𝐀\mathbf{W}^{1/2}\mathbf{A}:

σi(w)=defwiai(𝐀𝐖𝐀)1ai.\sigma_{i}(w)\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}w_{i}\cdot a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i}.
Fact 1.1 ([LS14]).

For all w>0mw\in\mathbb{R}_{>0}^{m} we have that 0σi(w)10\leq\sigma_{i}(w)\leq 1 for all i[m]i\in[m], i[m]σi(w)n\sum_{i\in[m]}\sigma_{i}(w)\leq n, and 𝟎𝐏(w)(2)𝚺(w)\mathbf{0}\preceq\mathbf{P}(w)^{(2)}\preceq\mathbf{\Sigma}(w).

2 Our Algorithm

We present Algorithm 1 to compute an ε~\widetilde{\varepsilon}-additive solution to (1). We first provide the following definitions that we frequently refer to in our algorithm and analysis. Given α>0\alpha>0 and w>0mw\in\mathbb{R}^{m}_{>0}, the ii’th coordinate of ρ(w)m\rho(w)\in\mathbb{R}^{m} is

ρi(w)=defσi(w)wi1+α.\rho_{i}(w)\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}\frac{\sigma_{i}(w)}{w_{i}^{1+\alpha}}.

Based on this quantity, we define the following procedure, derived from approximating a quasi-Newton update on the objective \mathcal{F} from (1):

[Descent(w,C,η)]i=defwi[1+ηiρi(w)1ρi(w)+1] for all iC{1,2,,m}.\left[\textbf{Descent}(w,\textup{C},\eta)\right]_{i}\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}w_{i}\left[1+\eta_{i}\cdot\frac{\rho_{i}(w)-1}{\rho_{i}(w)+1}\right]\text{ for all }i\in\textup{C}\subseteq\{1,2,\dotsc,m\}.

Using these definitions, we can now describe our algorithm. Depending on whether the following condition (“rounding condition”) holds, we run either Descent()\textbf{Descent}({}\cdot{}) or Round()\textbf{Round}({}\cdot{}) in each iteration:

ρmax(w)=defmaxi[m]ρi(w)1+α.\rho_{\max}(w)\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}\max_{i\in[m]}\rho_{i}(w)\leq 1+\alpha.

Specifically, if (2) is not satisfied, we run Round()\textbf{Round}({}\cdot{}), which returns a vector that does satisfy it without increasing the objective value. We design two versions of Round()\textbf{Round}({}\cdot{}), one parallel (Algorithm 2) and one sequential (Algorithm 3), with the sequential algorithm having an improved dependence on α\alpha, to update the coordinates violating (2). We apply one extra step of rounding to the vector returned after 𝒯total\mathcal{T}_{\textrm{total}} iterations of Algorithm 1 and transform it appropriately to obtain our final output. In the following lemma (proved in Section B), we justify that this output is indeed the solution to (1).

Lemma 1 (Lewis Weights from Optimization Solution).

Let w>0mw\in\mathbb{R}_{>0}^{m} be a vector at which the objective (1) is ε~\widetilde{\varepsilon}-suboptimal in the additive sense for ε~=α8ε4(25m(n+α)(α+α1))4\widetilde{\varepsilon}=\tfrac{\alpha^{8}\varepsilon^{4}}{(25m(\sqrt{n}+\alpha)(\alpha+\alpha^{-1}))^{4}}, i.e., (w¯)(w)(w¯)+ε~\mathcal{F}(\overline{w})\leq\mathcal{F}(w)\leq\mathcal{F}(\overline{w})+\widetilde{\varepsilon}. Further assume that ww satisfies the rounding condition: ρmax(w)1+α\rho_{\max}(w)\leq 1+\alpha. Then, the vector w^\widehat{w} defined as w^i=(ai(𝐀𝐖𝐀)1ai)1/α\widehat{w}_{i}=(a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i})^{1/\alpha} satisfies w^iεw¯i\widehat{w}_{i}\approx_{\varepsilon}\overline{w}_{i} for all i[m]i\in[m], thus achieving the goal spelt out in (1).

Algorithm 1 Lewis Weight Computation Meta-Algorithm

Input: Matrix 𝐀m×n\mathbf{A}\in\mathbb{R}^{m\times n}, parameter p>2p>2, accuracy ε\varepsilon

Output: Vector w^>0m\widehat{w}\in\mathbb{R}^{m}_{>0} that satisfies (1)

For all i[m]i\in[m], initialize wi(0)=nmw_{i}^{(0)}=\frac{n}{m}.

Set α=2p2\alpha=\frac{2}{p-2}, α¯=max(α,1)\bar{\alpha}=\max(\alpha,1), ε~=α8ε4(25m(n+α)(α+α1))4\widetilde{\varepsilon}=\tfrac{\alpha^{8}\varepsilon^{4}}{(25m(\sqrt{n}+\alpha)(\alpha+\alpha^{-1}))^{4}}, and 𝒯total=𝒪(max(α1,α)log(m/ε~))\mathcal{T}_{\textrm{total}}=\mathcal{O}(\max(\alpha^{-1},\alpha)\log(m/\widetilde{\varepsilon})).

for k=1,2,3,,𝒯totalk=1,2,3,\dotsc,\mathcal{T}_{\textrm{total}} do

       w~(k)Round(w(k1),𝐀,α)\widetilde{w}^{(k)}\leftarrow\textbf{Round}(w^{(k-1)},\mathbf{A},\alpha) \triangleright Invoke Algorithm 2 (parallel) or 3 (sequential) w(k)Descent(w~(k),[m],13α¯𝟏)w^{(k)}\leftarrow\textbf{Descent}(\widetilde{w}^{(k)},[m],\tfrac{1}{3\bar{\alpha}}\mathbf{1}) \triangleright See (2) and Lemma 2
end for
Set wRRound(w(𝒯total),𝐀,α)w_{\textrm{R}}\leftarrow\textbf{Round}(w^{(\mathcal{T}_{\textrm{total}})},\mathbf{A},\alpha)   Return w^>0m\widehat{w}\in\mathbb{R}^{m}_{>0}, where w^i=(ai(𝐀𝐖R𝐀)1ai)1/α\widehat{w}_{i}=(a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}_{\textrm{R}}\mathbf{A})^{-1}a_{i})^{1/\alpha}. \triangleright See Section B
Algorithm 2 RoundParallel(ww, 𝐀\mathbf{A}, α\alpha)

Input: Vector w>0mw\in\mathbb{R}^{m}_{>0}, matrix 𝐀m×n\mathbf{A}\in\mathbb{R}^{m\times n}, parameter α>0\alpha>0

Output: Vector w>0mw\in\mathbb{R}^{m}_{>0} satisfying (2)

Define ρ(w)\rho(w) as in (2)

while C={i|ρi(w)>1+α}\textup{C}=\{i~{}|~{}\rho_{i}(w)>1+\alpha\}\neq\emptyset  do

       wDescent(w,C,13α¯𝟏)w\leftarrow\textbf{Descent}(w,\textup{C},\tfrac{1}{3\bar{\alpha}}\mathbf{1}) \triangleright See Section 3
end while
Return ww
Algorithm 3 RoundSequential(ww, 𝐀\mathbf{A},α\alpha)

Input: Vector w>0mw\in\mathbb{R}^{m}_{>0}, matrix 𝐀m×n\mathbf{A}\in\mathbb{R}^{m\times n}, parameter α>0\alpha>0

Output: Vector w>0mw\in\mathbb{R}^{m}_{>0} satisfying (2)

Define ρ(w)\rho(w) as in (2) and σ(w)\sigma(w) as in (1.4)

Define C={i|ρi(w)1}\textup{C}=\{i~{}|~{}\rho_{i}(w)\geq 1\}

for  iCi\in\textup{C} do

       wiwi(1+δi)w_{i}\leftarrow w_{i}(1+\delta_{i}), where δi\delta_{i} solves ρi(w)=(1+δiσi(w))(1+δi)α\rho_{i}(w)=(1+\delta_{i}\sigma_{i}(w))(1+\delta_{i})^{\alpha} \triangleright see Section 4
end for
Return ww

2.1 Analysis of Descent()\textbf{Descent}({}\cdot{})

We first analyze Descent()\textbf{Descent}({}\cdot{}) since it is common to both the parallel and sequential algorithms.

Lemma 2 (Iteration Complexity of Descent()\textbf{Descent}({}\cdot{})).

Each iteration of Descent()\textbf{Descent}({}\cdot{}) (described in (2)) decreases the value of \mathcal{F}. Assuming that Round()\textbf{Round}({}\cdot{}) does not increase the value of the objective in (1), for any given accuracy parameter 0<ε~<10<\widetilde{\varepsilon}<1, the number of Descent()\textbf{Descent}({}\cdot{}) steps that Algorithm 1 performs before achieving (w)(w¯)+ε~\mathcal{F}(w)\leq\mathcal{F}(\overline{w})+\widetilde{\varepsilon} is given by the following bound:

𝒯total=𝒪(max(α1,α)log(m/ε~)).\mathcal{T}_{\textrm{total}}=\mathcal{O}(\max(\alpha^{-1},\alpha)\log(m/\widetilde{\varepsilon})).

As is often the case to obtain such an iteration complexity, we prove Lemma 2 by incorporating the maximum sub-optimality in function value (Lemma 5) and the initial error bound (Lemma 4) into the inequality describing minimum function progress (Lemma 6). The assumption that Round()\textbf{Round}({}\cdot{}) does not increase the value of the objective is justified in Lemma 7.

Since our algorithm leverages quasi-Newton steps, we begin our analysis by stating the gradient and Hessian of the objective in (1) as well as the error at the initial vector w(0)w^{(0)}, as measured against the optimal function value. The Hessian below is positive semidefinite when α0\alpha\geq 0 (equivalently, when p2p\geq 2) and not necessarily so otherwise. Consequently, the objective is convex for α0\alpha\geq 0, and we therefore consider only this setting throughout.

Lemma 3 (Gradient and Hessian).

For any w>0mw\in\mathbb{R}^{m}_{>0}, the objective in (1), (w)=logdet(𝐀𝐖𝐀)+11+α𝟏w1+α\mathcal{F}(w)=-\log\det(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})+\frac{1}{1+\alpha}\mathbf{1}^{\top}w^{1+\alpha}, has gradient and Hessian given by the following expressions.

[(w)]i=wi1(wi1+ασi(w)) and 2(w)=𝐖1𝐏(w)(2)𝐖1+α𝐖α1.\left[\nabla\mathcal{F}(w)\right]_{i}=w_{i}^{-1}\cdot(w_{i}^{1+\alpha}-\sigma_{i}(w))\text{ and }\nabla^{2}\mathcal{F}(w)=\mathbf{W}^{-1}\mathbf{P}(w)^{(2)}\mathbf{W}^{-1}+\alpha\mathbf{W}^{\alpha-1}.
Lemma 4 (Initial Sub-Optimality).

At the start of Algorithm 1, the value of the objective of (1) differs from the optimum objective value as (w(0))(w¯)+nlog(m/n)\mathcal{F}(w^{(0)})\leq\mathcal{F}(\overline{w})+n\log(m/n).

2.1.1 Minimum Progress and Maximum Sub-optimality in Descent()\textbf{Descent}({}\cdot{})

We first prove an upper bound on objective sub-optimality, necessary to obtain a runtime polylogarithmic in 1/ε1/\varepsilon. Often, to obtain such a rate, the bound involving objective sub-optimality has a quadratic term derived from the Hessian; our lemma is somewhat non-standard in that it uses only the convexity of \mathcal{F}. Note that this lemma crucially uses (2).

Lemma 5 (Objective Sub-optimality).

Suppose w>0mw\in\mathbb{R}^{m}_{>0} and ρmax(w)1+α\rho_{\max}(w)\leq 1+\alpha. Then the value of the objective of (1) at ww differs from the optimum objective value as follows.

(w)(w¯)\displaystyle\mathcal{F}(w)-\mathcal{F}(\overline{w}) i[m]wi1+α1+α(1+ρi(w)α)(ρi(w)1)25max{1,α1}i[m]wi1+α(ρi(w)1)2ρi(w)+1.\displaystyle\leq\sum_{i\in[m]}\frac{w_{i}^{1+\alpha}}{1+\alpha}\left(1+\frac{\rho_{i}(w)}{\alpha}\right)\left(\rho_{i}(w)-1\right)^{2}\leq 5\max\{1,\alpha^{-1}\}\sum_{i\in[m]}w_{i}^{1+\alpha}\frac{(\rho_{i}(w)-1)^{2}}{\rho_{i}(w)+1}.
Proof.

Since g(w)=deflogdet(𝐀𝐖𝐀)g(w)\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}-\log\det(\mathbf{A}^{\top}\mathbf{W}\mathbf{A}) is convex and [g(w)]i=wi1σi(w)[\nabla g(w)]_{i}=-w_{i}^{-1}\sigma_{i}(w), we have

g(w¯)g(w)+g(w)(w¯w)=g(w)+i[m](σi(w)w¯iwi+σi(w))),g(\overline{w})\geq g(w)+\nabla g(w)^{\top}(\overline{w}-w)=g(w)+\sum_{i\in[m]}\left(-\frac{\sigma_{i}(w)\overline{w}_{i}}{w_{i}}+\sigma_{i}(w))\right),

and therefore,

(w¯)(w)\displaystyle\mathcal{F}(\overline{w})-\mathcal{F}(w) =g(w¯)g(w)+11+αi[m]([w¯]i1+αwi1+α)\displaystyle=g(\overline{w})-g(w)+\frac{1}{1+\alpha}\sum_{i\in[m]}\left([\overline{w}]_{i}^{1+\alpha}-w_{i}^{1+\alpha}\right)
i[m]ci where ci=defσi(w)w¯iwi+σi(w)+11+α([w¯]i1+αwi1+α).\displaystyle\geq\sum_{i\in[m]}c_{i}\text{ where }c_{i}\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}-\frac{\sigma_{i}(w)\overline{w}_{i}}{w_{i}}+\sigma_{i}(w)+\frac{1}{1+\alpha}\left([\overline{w}]_{i}^{1+\alpha}-w_{i}^{1+\alpha}\right)\,.

To prove the claim, it suffices to bound each cic_{i} from below. First, note that

ci\displaystyle c_{i} minv0vσi(w)wi+σi(w)+11+α(v1+αwi1+α)=α1+α(σi(w)wi)1+1α+σi(w)wi1+α1+α\displaystyle\geq\min_{v\geq 0}-\frac{v\cdot\sigma_{i}(w)}{w_{i}}+\sigma_{i}(w)+\frac{1}{1+\alpha}\left(v^{1+\alpha}-w_{i}^{1+\alpha}\right)=-\frac{\alpha}{1+\alpha}\left(\frac{\sigma_{i}(w)}{w_{i}}\right)^{1+\frac{1}{\alpha}}+\sigma_{i}(w)-\frac{w_{i}^{1+\alpha}}{1+\alpha}
=wi1+α1+α[αρi(w)1+1α+(1+α)ρi(w)1]\displaystyle=\frac{w_{i}^{1+\alpha}}{1+\alpha}\left[-\alpha\rho_{i}(w)^{1+\frac{1}{\alpha}}+(1+\alpha)\rho_{i}(w)-1\right] (2.4)

where the first equality used that the minimization problem is convex and the solutions to σi(w)wi1+vα=0-\sigma_{i}(w)w_{i}^{-1}+v^{\alpha}=0 (i.e. where the gradient is 0) is a minimizer, and the second equality used ρi(w)=σi(w)/wi1+α\rho_{i}(w)=\sigma_{i}(w)/w_{i}^{1+\alpha}. Applying ρi(w)1+α\rho_{i}(w)\leq 1+\alpha, 1+xexpx1+x\leq\exp x, and expx1+x+x2\exp x\leq 1+x+x^{2} for x1x\leq 1 yields

ρi(w)1α\displaystyle\rho_{i}(w)^{\frac{1}{\alpha}} =(1(1ρi(w)))1αexp(1α(ρi(w)1))1+1α(ρi(w)1)+1α2(ρi(w)1)2.\displaystyle=(1-(1-\rho_{i}(w)))^{\frac{1}{\alpha}}\leq\exp(\tfrac{1}{\alpha}(\rho_{i}(w)-1))\leq 1+\frac{1}{\alpha}(\rho_{i}(w)-1)+\frac{1}{\alpha^{2}}(\rho_{i}(w)-1)^{2}. (2.5)

Combining (2.5) with (2.4) yields that

ci\displaystyle c_{i} wi1+α1+α[αρi(w)[1+(ρi(w)1α)+(ρi(w)1α)2]+(1+α)ρi(w)1]\displaystyle\geq\frac{w_{i}^{1+\alpha}}{1+\alpha}\left[-\alpha\rho_{i}(w)\left[1+\left(\frac{\rho_{i}(w)-1}{\alpha}\right)+\left(\frac{\rho_{i}(w)-1}{\alpha}\right)^{2}\right]+(1+\alpha)\rho_{i}(w)-1\right]
=wi1+α1+α[1+2ρi(w)ρi(w)2ρi(w)α(ρi(w)1)2]=wi1+α1+α(1+ρi(w)α)(ρi(w)1)2\displaystyle=\frac{w_{i}^{1+\alpha}}{1+\alpha}\left[-1+2\rho_{i}(w)-\rho_{i}(w)^{2}-\frac{\rho_{i}(w)}{\alpha}\cdot(\rho_{i}(w)-1)^{2}\right]=-\frac{w_{i}^{1+\alpha}}{1+\alpha}\left(1+\frac{\rho_{i}(w)}{\alpha}\right)\cdot\left(\rho_{i}(w)-1\right)^{2}

The claim then follows from the fact that for ρi(w)1+α\rho_{i}(w)\leq 1+\alpha, we have (1+ρi(w)α1)(1+ρi(w))1+α11+α+1α+1+1+1α5max{1,α1}\frac{(1+\rho_{i}(w)\alpha^{-1})(1+\rho_{i}(w))}{1+\alpha}\leq\frac{1}{1+\alpha}+\frac{1}{\alpha}+1+1+\frac{1}{\alpha}\leq 5\max\{1,\alpha^{-1}\}. ∎

Lemma 6 (Function Decrease in Descent()\textbf{Descent}({}\cdot{})).

Let w,η>0mw,\eta\in\mathbb{R}^{m}_{>0} with ηi[0,13α¯]\eta_{i}\in[0,\tfrac{1}{3\bar{\alpha}}] for all i[m]i\in[m]. Further, let w+=Descent(w,[m],η)w^{+}=\textbf{Descent}(w,[m],\eta), where Descent is defined in (2). Then, w+>0mw^{+}\in\mathbb{R}^{m}_{>0} with the following decrease in function objective.

(w+)(w)i[m]ηi2wi1+α(ρi(w)1)2ρi(w)+1.\mathcal{F}(w^{+})\leq\mathcal{F}(w)-\sum_{i\in[m]}\frac{\eta_{i}}{2}\cdot w_{i}^{1+\alpha}\cdot\frac{(\rho_{i}(w)-1)^{2}}{\rho_{i}(w)+1}.

The proof of this lemma resembles that of quasi-Newton method: first, we write a second-order Taylor approximation of (w+)\mathcal{F}(w^{+}) around ww and apply Fact 1.1 to Lemma 3 to obtain the upper bound on Hessian: 2(w~)=𝐖~1𝐏(w~)(2)𝐖~1+α𝐖~α1𝐖~1Σ(w~)𝐖~1+α𝐖~α1.\nabla^{2}\mathcal{F}(\widetilde{w})=\mathbf{\widetilde{W}}^{-1}\mathbf{P}(\widetilde{w})^{(2)}\mathbf{\widetilde{W}}^{-1}+\alpha\mathbf{\widetilde{W}}^{\alpha-1}\preceq\mathbf{\widetilde{W}}^{-1}\Sigma(\widetilde{w})\mathbf{\widetilde{W}}^{-1}+\alpha\mathbf{\widetilde{W}}^{\alpha-1}. We further use the expression for (w)\nabla\mathcal{F}(w) in this second-order approximation and simplify to obtain the claim, as detailed in Section A.

2.1.2 Iteration Complexity of Descent()\textbf{Descent}({}\cdot{})

Proof of Lemma 2.

Since Algorithm 1 calls Descent()\textbf{Descent}({}\cdot{}) after running Round()\textbf{Round}({}\cdot{}), the requirement ρmax(w)1+α\rho_{\max}(w)\leq 1+\alpha in Lemma 5 is met. Therefore, we may combine Lemma 5 alongwith Lemma 6 and our choice of ηi=13α¯\eta_{i}=\frac{1}{3\bar{\alpha}} in Algorithm 1 to get a geometric decrease in function error as follows.

(w+)(w¯)\displaystyle\mathcal{F}(w^{+})-\mathcal{F}(\overline{w}) (w)(w¯)16max(α,1)i=1mwi1+α(ρi(w)1)2ρi(w)+1\displaystyle\leq\mathcal{F}(w)-\mathcal{F}(\overline{w})-\frac{1}{6\max(\alpha,1)}\sum_{i=1}^{m}w_{i}^{1+\alpha}\frac{(\rho_{i}(w)-1)^{2}}{\rho_{i}(w)+1}
(1130max(1,α)max(1,α1))((w)(w¯)).\displaystyle\leq\left(1-\frac{1}{30\max(1,\alpha)\cdot\max(1,\alpha^{-1})}\right)(\mathcal{F}(w)-\mathcal{F}(\overline{w})). (2.6)

We apply this inequality recursively over all iterations of Algorithm 1, while also using the assumption that Round()\textbf{Round}({}\cdot{}) does not increase the objective value. Setting the final value of (2.6) to ε~\widetilde{\varepsilon}, bounding the initial error as (w)(w¯)nlog(m/n)m2\mathcal{F}(w)-\mathcal{F}(\overline{w})\leq n\log(m/n)\leq m^{2} by Lemma 4, observing max(1,α)max(1,α1)=max(α,α1)\max(1,\alpha)\cdot\max(1,\alpha^{-1})=\max(\alpha,\alpha^{-1}), and taking logarithms on both sides of the inequality gives the claimed iteration complexity of Descent()\textbf{Descent}({}\cdot{}). ∎

3 Analysis of Round()\textbf{Round}({}\cdot{}): The Parallel Algorithm

The main export of this section is the proof of our main theorem about the parallel algorithm (Theorem 1). This proof combines the iteration count of Descent()\textbf{Descent}({}\cdot{}) from the preceding section with the analysis of Algorithm 2 (invoked by Round()\textbf{Round}({}\cdot{}) in the parallel setting), shown next. In Lemma 7, we show that RoundParallel()\textbf{RoundParallel}({}\cdot{}) decreases the function objective, thereby justifying the key assumption in Lemma 2. Lemma 7 also shows an upper bound on the new value of ρ\rho after one while loop of RoundParallel()\textbf{RoundParallel}({}\cdot{}), and by combining this with the maximum value of ρ\rho for termination in Algorithm 2, we get the iteration complexity of RoundParallel()\textbf{RoundParallel}({}\cdot{}) in Corollary 1.

Lemma 7 (Outcome of RoundParallel()\textbf{RoundParallel}({}\cdot{})).

Let w+>0mw^{+}\in\mathbb{R}^{m}_{>0} be the state of w>0mw\in\mathbb{R}^{m}_{>0} at the end of one while loop of RoundParallel()\textbf{RoundParallel}({}\cdot{}) (Algorithm 2). Then, (w+)(w)\mathcal{F}(w^{+})\leq\mathcal{F}(w) and ρmax(w+)(1+α3α¯(2+α))αρmax(w)\rho_{\max}(w^{+})\leq(1+\frac{\alpha}{3\bar{\alpha}(2+\alpha)})^{-\alpha}\rho_{\max}(w).

Proof.

Each iteration of the while loop in RoundParallel()\textbf{RoundParallel}({}\cdot{}) performs Descent(w,C,13α¯𝟏)\textbf{Descent}(w,\textup{C},\frac{1}{3\bar{\alpha}}\mathbf{1}) over the set of coordinates C={i:ρi(w)>1+α}\textup{C}=\{i:\rho_{i}(w)>1+\alpha\}. Lemma 6 then immediately proves (w+)(w)\mathcal{F}(w^{+})\leq\mathcal{F}(w), which is our first claim.

To prove the second claim, note that in Algorithm 2, for every iCi\in\textup{C}

wi+\displaystyle w^{+}_{i} =wi+wi3α¯[ρi(w)1ρi(w)+1]wi+wi3α¯[α1+1+α]=wi(1+α3α¯(2+α)),\displaystyle=w_{i}+\frac{w_{i}}{3\bar{\alpha}}\cdot\left[\frac{\rho_{i}(w)-1}{\rho_{i}(w)+1}\right]\geq w_{i}+\frac{w_{i}}{3\bar{\alpha}}\cdot\left[\frac{\alpha}{1+1+\alpha}\right]=w_{i}\cdot\left(1+\frac{\alpha}{3\bar{\alpha}(2+\alpha)}\right),

where the second step is by the monotonicity of xx1x+1x\rightarrow\frac{x-1}{x+1} for x1x\geq 1. Combining it with wi+=wiw_{i}^{+}=w_{i} for all iCi\notin\textup{C} implies that w+ww^{+}\geq w. Therefore, for all iCi\in\textup{C}, we have

ρ(w+)i\displaystyle\rho(w^{+})_{i} =[wi+]α[𝐀(𝐀𝐖+𝐀)1𝐀]ii[1+α3α¯(2+α)]αwiα[𝐀(𝐀𝐖𝐀)1𝐀]ii.\displaystyle=[w^{+}_{i}]^{-\alpha}[\mathbf{A}(\mathbf{A}^{\top}\mathbf{W}^{+}\mathbf{A})^{-1}\mathbf{A}^{\top}]_{ii}\leq\left[1+\frac{\alpha}{3\bar{\alpha}(2+\alpha)}\right]^{-\alpha}\cdot w_{i}^{-\alpha}[\mathbf{A}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}\mathbf{A}^{\top}]_{ii}. (3.1)

Corollary 1.

Let ww be the input to RoundParallel()\textbf{RoundParallel}({}\cdot{}). Then, the number of iterations of the while loop of RoundParallel()\textbf{RoundParallel}({}\cdot{}) is at most O((1+α2)log(ρmax(w)1+α))O\left((1+\alpha^{-2})\log(\frac{\rho_{\max}(w)}{1+\alpha})\right).

Proof.

Let w(i)w^{(i)} be the value of ww at the start of the ii’th execution of the while loop of RoundParallel()\textbf{RoundParallel}({}\cdot{}). Repeated application of Lemma 7 over kk executions of the while loop gives ρmax(w(k))ρmax(w)(1+α3α¯(2+α))αk\rho_{\max}(w^{(k)})\leq\rho_{\max}(w)\left(1+\frac{\alpha}{3\bar{\alpha}(2+\alpha)}\right)^{-\alpha k}. We set ρmax(w)(1+α3α¯(2+α))αk1+α\rho_{\max}(w)\left(1+\frac{\alpha}{3\bar{\alpha}(2+\alpha)}\right)^{-\alpha k}\leq 1+\alpha in accordance with the termination condition of RoundParallel()\textbf{RoundParallel}({}\cdot{}). Next, applying 1+xexp(x)1+x\leq\exp(x), and taking logarithms on both sides yields the claimed limit on the number of iterations, kk. ∎

Lemma 8.

Over the entire run of Algorithm 1, the while loop of RoundParallel()\textbf{RoundParallel}({}\cdot{}) runs for at most O(𝒯totalα2log(mn(1+α)))O\left(\mathcal{T}_{\textrm{total}}\cdot\alpha^{-2}\cdot\log\left(\tfrac{m}{n(1+\alpha)}\right)\right) iterations if α(0,1]\alpha\in(0,1] and 𝒪(𝒯totalαlog(mn(1+α)))\mathcal{O}\left(\mathcal{T}_{\textrm{total}}\cdot\alpha\cdot\log\left(\tfrac{m}{n(1+\alpha)}\right)\right) iterations if α1\alpha\geq 1.

Proof.

Note that ρmax(nm)(mn)1+α\rho_{\max}(\frac{n}{m})\leq(\frac{m}{n})^{1+\alpha}; consequently, in the first iteration of Algorithm 1, there are at most O((α+α2)log(m/(n(1+α))))O((\alpha+\alpha^{-2})\log(m/(n(1+\alpha)))) iterations of the while loop of RoundParallel()\textbf{RoundParallel}({}\cdot{}) by Corollary 1. Note that between each call to RoundParallel()\textbf{RoundParallel}({}\cdot{}), for all i[m]i\in[m],

wi+\displaystyle w^{+}_{i} =wi+wi3α¯[ρi(w)1ρi(w)+1]wi+wi3α¯[11+1+α]=wi(11(3α¯)(2+α)),\displaystyle=w_{i}+\frac{w_{i}}{3\bar{\alpha}}\cdot\left[\frac{\rho_{i}(w)-1}{\rho_{i}(w)+1}\right]\geq w_{i}+\frac{w_{i}}{3\bar{\alpha}}\cdot\left[\frac{-1}{1+1+\alpha}\right]=w_{i}\cdot\left(1-\frac{1}{(3\bar{\alpha})(2+\alpha)}\right),

where the first inequality is by using the fact that the output ww of RoundParallel()\textbf{RoundParallel}({}\cdot{}) satisfies ρmax(w)1+α\rho_{\max}(w)\leq 1+\alpha. Therefore, applying the same logic as in (3.1), we get that between two calls to RoundParallel()\textbf{RoundParallel}({}\cdot{}), the value of ρi(w)\rho_{i}(w) increases by at most (11(3α¯)(2+α))(1+α)=O(1)\left(1-\frac{1}{(3\bar{\alpha})(2+\alpha)}\right)^{-(1+\alpha)}=O(1) for all i[m]i\in[m]. Combining this with Corollary 1 and the total initial iteration count and observing that 𝒯total\mathcal{T}_{\textrm{total}} is the total number of calls to RoundParallel()\textbf{RoundParallel}({}\cdot{}) finishes the proof. ∎

3.1 Proof of Main Theorem (Parallel)

Proof.

(Proof of Theorem 1) First, we show correctness. Note that, as a corollary of Lemma 2, (w(𝒯total))(w¯)+ε~\mathcal{F}(w^{(\mathcal{T}_{\textrm{total}})})\leq\mathcal{F}(\overline{w})+\widetilde{\varepsilon}. By the properties of Round as shown in Lemma 7, we also have that (wR)(w¯)+ε~\mathcal{F}(w_{\textrm{R}})\leq\mathcal{F}(\overline{w})+\widetilde{\varepsilon} and ρmax(wR)1+α\rho_{\max}(w_{\textrm{R}})\leq 1+\alpha for wR=Round(w(𝒯total),𝐀,α)w_{\textrm{R}}=\textbf{Round}(w^{(\mathcal{T}_{\textrm{total}})},\mathbf{A},\alpha). Therefore, Lemma 1 is applicable, and by the choice of ε~=α4ε4(2m(n+α)(α+α1))4\widetilde{\varepsilon}=\tfrac{\alpha^{4}\varepsilon^{4}}{(2m(\sqrt{n}+\alpha)(\alpha+\alpha^{-1}))^{4}}, we conclude that w^m\widehat{w}\in\mathbb{R}^{m} defined as w^i=(ai(𝐀𝐖R𝐀)1ai)1/α\widehat{w}_{i}=(a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}_{\textrm{R}}\mathbf{A})^{-1}a_{i})^{1/\alpha} satisfies w^iεw¯i\widehat{w}_{i}\approx_{\varepsilon}\overline{w}_{i} for all i[m]i\in[m]. Combining the iteration counts of Descent()\textbf{Descent}({}\cdot{}) from Lemma 2 and of RoundParallel()\textbf{RoundParallel}({}\cdot{}) from Lemma 8 yields the total iteration count as O(α3log(m/(εα)))O(\alpha^{-3}\log(m/(\varepsilon\alpha))) if α1\alpha\leq 1 and O(α2log(m/ε))O(\alpha^{2}\log(m/\varepsilon)) if α>1\alpha>1. As stated in Section 1.4, α=2p2\alpha=\frac{2}{p-2}, and so translating these rates in terms of pp gives O(p3log(mp/ε))O(p^{3}\log(mp/\varepsilon)) for p4p\geq 4 and O(p2log(mp/ε))O(p^{-2}\log(mp/\varepsilon)) for p(2,4)p\in(2,4), thereby proving the stated claim. The cost per iteration is O(mn2)O(mn^{2})666This can be improved to O(mnω1)O(mn^{\omega-1}) using fast matrix multiplication. for multiplying two m×nm\times n matrices. ∎

4 Analysis of Round()\textbf{Round}({}\cdot{}): Sequential Algorithm

We now analyze Algorithm 3. Note that these proofs work for all α>0\alpha>0.

Lemma 9 (Coordinate Step Progress).

Given w>0mw\in\mathbb{R}_{>0}^{m}, a coordinate i[m]i\in[m], and δi\delta_{i}\in\mathbb{R}, we have

(w+δiwiei)=(w)log(1+δiσi(w))+wi1+α1+α((1+δi)1+α1).\mathcal{F}(w+\delta_{i}w_{i}e_{i})=\mathcal{F}(w)-\log(1+\delta_{i}\sigma_{i}(w))+\frac{w_{i}^{1+\alpha}}{1+\alpha}((1+\delta_{i})^{1+\alpha}-1).
Proof.

By definition of \mathcal{F}, we have

(w+δiwiei)=\displaystyle\mathcal{F}(w+\delta_{i}w_{i}e_{i})= logdet(𝐀𝐖𝐀+δiwiaiai)+11+αjiwj1+α+wi1+α1+α(1+δi)1+α.\displaystyle-\log\det(\mathbf{A}^{\top}\mathbf{W}\mathbf{A}+\delta_{i}w_{i}a_{i}a_{i}^{\top})+\frac{1}{1+\alpha}\sum_{j\neq i}w_{j}^{1+\alpha}+\frac{w_{i}^{1+\alpha}}{1+\alpha}(1+\delta_{i})^{1+\alpha}.

Recall the matrix determinant lemma: det(𝐀+uv)=(1+v𝐀1u)det(𝐀)\det(\mathbf{A}+uv^{\top})=(1+v^{\top}\mathbf{A}^{-1}u)\det(\mathbf{A}). Applying it to det(𝐀𝐝𝐢𝐚𝐠(w+δiwiei)𝐀)\det(\mathbf{A}^{\top}\mathop{\mathbf{diag}}(w+\delta_{i}w_{i}e_{i})\mathbf{A}) in the preceding expression for (w+δiwiei)\mathcal{F}(w+\delta_{i}w_{i}e_{i}) proves the lemma.

Lemma 10 (Coordinate Step Outcome).

Given w>0mw\in\mathbb{R}_{>0}^{m} and C={i:ρi(w)1}\textup{C}=\{i:\rho_{i}(w)\geq 1\}, let w+=w+δiwieiw^{+}=w+\delta_{i}w_{i}e_{i} for any iCi\in\textup{C}, where δi=argminδ[log(1+δσi(w))+11+αwi1+α((1+δ)1+α1)]\delta_{i}=\arg\min_{\delta}\left[-\log(1+\delta\sigma_{i}(w))+\frac{1}{1+\alpha}w_{i}^{1+\alpha}((1+\delta)^{1+\alpha}-1)\right]. Then, we have (w+)(w)\mathcal{F}(w^{+})\leq\mathcal{F}(w) and ρi(w+)1\rho_{i}(w^{+})\leq 1.

Proof.

We note that minδ[log(1+δσi(w))+11+αwi1+α((1+δ)1+α1)]0\min_{\delta}\left[-\log(1+\delta\sigma_{i}(w))+\frac{1}{1+\alpha}w_{i}^{1+\alpha}((1+\delta)^{1+\alpha}-1)\right]\leq 0. Then, Lemma 9 implies the first claim. Since the update rule optimizes over \mathcal{F} coordinate-wise, at each step the optimality condition given by ρi(w+)=1\rho_{i}(w^{+})=1 is met for each iCi\in\textup{C}. The second claim is then proved by noting that for jij\neq i, wj+=wjw_{j}^{+}=w_{j} and by the Sherman-Morrison-Woodbury identity, ρj(w+)ρj(w)\rho_{j}(w^{+})\leq\rho_{j}(w):

aj(𝐀𝐖+𝐀)1aj=aj(𝐀𝐖𝐀)1ajδiwi(aj(𝐀𝐖𝐀)1aj)21+δiwiai(𝐀𝐖𝐀)1aiaj(𝐀𝐖𝐀)1aj.a_{j}^{\top}(\mathbf{A}^{\top}\mathbf{W}^{+}\mathbf{A})^{-1}a_{j}=a_{j}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{j}-\delta_{i}w_{i}\frac{(a_{j}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{j})^{2}}{1+\delta_{i}w_{i}a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i}}\leq a_{j}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{j}.

Lemma 11 (Number of Coordinate Steps).

For any 0ε~10\leq\widetilde{\varepsilon}\leq 1, over all 𝒯total\mathcal{T}_{\textrm{total}} iterations of Algorithm 1, there are at most O(mmax(α,α1)log(m/ε~))O(m\max(\alpha,\alpha^{-1})\log(m/\widetilde{\varepsilon})) coordinate steps (see Algorithm 3).

Proof.

There are at most mm coordinate steps in each call to Algorithm 3. Combining this with the value of 𝒯total\mathcal{T}_{\textrm{total}} in Algorithm 1 gives the count of O(mα1log(m/ε~))O(m\alpha^{-1}\log(m/\widetilde{\varepsilon})) coordinate steps. ∎

4.1 Proof of Main Theorem (Sequential)

We now combine the preceding results to prove the main theorem about the sequential algorithm (Algorithm 1 with Algorithm 3).

Proof.

(Proof of Theorem 2) The proof of correctness is the same as that for Theorem 1 since the parallel and sequential algorithms share the same meta-algorithm. Computing leverage scores in the sequential version (Algorithm 1 with Algorithm 3) takes O(mmax(α,α1)log(m/(αε)))O(m\max(\alpha,\alpha^{-1})\log(m/(\alpha\varepsilon))) coordinate steps. The costliest component of a coordinate step is computing ai(𝐀(𝐖+δiwieiei)𝐀)1aia_{i}^{\top}(\mathbf{A}^{\top}(\mathbf{W}+\delta_{i}w_{i}e_{i}e_{i}^{\top})\mathbf{A})^{-1}a_{i}. By the Sherman-Morrison-Woodbury formula, computing this costs O(n2)O(n^{2}) for each coordinate. Since the initial cost to compute (𝐀𝐖𝐀)1(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1} is O(mn2)O(mn^{2}), the total run time is O(max(α,α1)mn2log(m/ε))O(\max(\alpha,\alpha^{-1})mn^{2}\log(m/\varepsilon)). When translated in terms of pp, this gives O(pmn2log(mp/ε))O(pmn^{2}\log(mp/\varepsilon)) for p4p\geq 4 and O(p1mn2log(mp/ε))O(p^{-1}mn^{2}\log(mp/\varepsilon)) for p(2,4)p\in(2,4). ∎

5 A “One-Step” Parallel Algorithm

We conclude our paper with an alternative algorithm (Algorithm 4) in which each iteration avoids any rounding and performs only Descent()\textbf{Descent}({}\cdot{}).

Algorithm 4 One-Step Algorithm

Input: Matrix 𝐀m×n\mathbf{A}\in\mathbb{R}^{m\times n}, parameter p>2p>2, accuracy ε\varepsilon

Output: Vector w^>0m\widehat{w}\in\mathbb{R}^{m}_{>0} that satisfies (1)

For all i[m]i\in[m], initialize wi(0)=1w_{i}^{(0)}=1. Set α=2p2\alpha=\frac{2}{p-2}. Set ε~=α4ε4(2m(n+α)(α+α1))4\widetilde{\varepsilon}=\frac{\alpha^{4}\varepsilon^{4}}{(2m(\sqrt{n}+\alpha)(\alpha+\alpha^{-1}))^{4}}.

Set β=11000min(α2,1)\beta=\frac{1}{1000}\min(\alpha^{2},1) and 𝒯total={𝒪(α3log(mp/ε~))if α(0,1]𝒪(α2log(mp/ε~)) α>1\mathcal{T}_{\textrm{total}}=\left\{\begin{array}[]{lll}\mathcal{O}(\alpha^{-3}\log(mp/\widetilde{\varepsilon}))&\mbox{if }\alpha\in(0,1]\\ \mathcal{O}(\alpha^{2}\log(mp/\widetilde{\varepsilon}))&\mbox{ }\alpha>1\end{array}\right.

for k=0,1,2,3,,𝒯total1k=0,1,2,3,\dotsc,\mathcal{T}_{\textrm{total}}-1 do

       Let η(k)m\eta^{(k)}\in\mathbb{R}^{m} where for all i[m]i\in[m] we let ηi(k)={13α¯if ρi(w(k))113α¯βif ρi(w(k))<1\eta^{(k)}_{i}=\begin{cases}\tfrac{1}{3\bar{\alpha}}&\text{if }\rho_{i}(w^{(k)})\geq 1\\ \tfrac{1}{3\bar{\alpha}}\beta&\text{if }\rho_{i}(w^{(k)})<1\end{cases} w(k+1)Descent(w(k),[m],η(k))w^{(k+1)}\leftarrow\textbf{Descent}({w}^{(k)},[m],\eta^{(k)}) \triangleright See (2) and Lemma 2
end for
Return w^>0m\widehat{w}\in\mathbb{R}^{m}_{>0}, where w^i=(ai(𝐀𝐖(𝒯total)𝐀)1ai)1/α\widehat{w}_{i}=(a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}^{(\mathcal{T}_{\textrm{total}})}\mathbf{A})^{-1}a_{i})^{1/\alpha}. \triangleright See Section B
Theorem 4 (Main Theorem (One-Step Parallel Algorithm)).

Given a full rank matrix 𝐀m×n\mathbf{A}\in\mathbb{R}^{m\times n} and p4p\geq 4, we can compute ε\varepsilon-approximate Lewis weights (1) in O(p3log(mp/ε)O(p^{3}\log(mp/\varepsilon) iterations. Each iteration computes the leverage score of one row of 𝐃𝐀\mathbf{D}\mathbf{A} for some diagonal matrix 𝐃\mathbf{D}. The total runtime is O(p3mn2log(mp/ε))O(p^{3}mn^{2}\log(mp/\varepsilon)).

We first spell out the key idea of the proof of Theorem 4 in Lemma 12 next, which is that (2) is maintained in every iteration through the use of varying step sizes, without explicitly invoking rounding procedures. Since (2) always holds, we may use Lemma 5 in bounding the iteration complexity.

Lemma 12 (Rounding Condition Invariance).

For any iteration k[𝒯total2]k\in[\mathcal{T}_{\textrm{total}}-2] in Algorithm 4, if ρmax(w(k))1+α\rho_{\max}(w^{(k)})\leq 1+\alpha, then ρmax(w(k+1))1+α\rho_{\max}(w^{(k+1)})\leq 1+\alpha.

Proof.

By the definition of Descent()\textbf{Descent}({}\cdot{}) in (2) and choice of ηi(k)\eta^{(k)}_{i} in Algorithm 4, we have,

wi(k+1)\displaystyle w^{(k+1)}_{i} =wi(k)[1+ηi(k)(ρi(w(k))1ρi(w(k))+1)]\displaystyle=w_{i}^{(k)}\cdot\left[1+\eta_{i}^{(k)}\left(\frac{\rho_{i}(w^{(k)})-1}{\rho_{i}(w^{(k)})+1}\right)\right] (5.1)
wi(k)(1ηi(k))wi(k)(1β3α¯).\displaystyle\geq w_{i}^{(k)}(1-\eta_{i}^{(k)})\geq w_{i}^{(k)}\left(1-\frac{\beta}{3\bar{\alpha}}\right). (5.2)

Applying this inequality to the definition of ρ(w)\rho(w) in (2), for all i[m]i\in[m], we have

ρi(w(k+1))=[wi(k+1)wi(k)]α1[wi(k)]αai(𝐀𝐖(k+1)𝐀)1ai(1β3α¯)1[wi(k+1)wi(k)]αρi(w(k)).\rho_{i}(w^{(k+1)})=\left[\frac{w^{(k+1)}_{i}}{w^{(k)}_{i}}\right]^{-\alpha}\frac{1}{[w^{(k)}_{i}]^{\alpha}}a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}^{(k+1)}\mathbf{A})^{-1}a_{i}\leq\left(1-\frac{\beta}{3\bar{\alpha}}\right)^{-1}\left[\frac{w^{(k+1)}_{i}}{w^{(k)}_{i}}\right]^{-\alpha}\rho_{i}(w^{(k)}).

Plugging (5.2) into (5) when ρi(w(k))1\rho_{i}(w^{(k)})\leq 1 and using the upper bound on β\beta yields that

ρi(w(k+1))(1β3α¯)(1+α)1+α.\rho_{i}(w^{(k+1)})\leq\left(1-\frac{\beta}{3\bar{\alpha}}\right)^{-(1+\alpha)}\leq 1+\alpha~{}.

If ρi(w(k))1\rho_{i}(w^{(k)})\geq 1, then (5), the equality in (5.2), the bound on β\beta, and ρi(w(k))1+α\rho_{i}(w^{(k)})\leq 1+\alpha imply that

ρi(w(k+1))(1β3α¯)1[1+13α¯(ρi(w(k))1ρi(w(k))+1)]αρi(w(k))1+α.\displaystyle\rho_{i}(w^{(k+1)})\leq\left(1-\frac{\beta}{3\bar{\alpha}}\right)^{-1}\left[1+\frac{1}{3\bar{\alpha}}\left(\frac{\rho_{i}(w^{(k)})-1}{\rho_{i}(w^{(k)})+1}\right)\right]^{-\alpha}\rho_{i}(w^{(k)})\leq 1+\alpha.

Proof of Theorem 4.

By our choice of wi(0)=1w_{i}^{(0)}=1 for all i[m]i\in[m], we have that ρi(w(0))=σi(w(0))1\rho_{i}(w^{(0)})=\sigma_{i}(w^{(0)})\leq 1 by Fact 1.1. Then applying Lemma 12 yields by induction that ρmax(w(k))1+α\rho_{\max}(w^{(k)})\leq 1+\alpha at every iteration kk. We may now therefore upper bound the objective sub-optimality from Lemma 5; as before, combining this with the lower bound on progress from Lemma 6 (noticing that ηiβ3α¯\eta_{i}\geq\frac{\beta}{3\bar{\alpha}}) yields

(w+)(w¯)\displaystyle\mathcal{F}(w^{+})-\mathcal{F}(\overline{w}) (w)(w¯)β6α¯i=1mwi1+α(ρi(w)1)2ρi(w)+1\displaystyle\leq\mathcal{F}(w)-\mathcal{F}(\overline{w})-\frac{\beta}{6\bar{\alpha}}\sum_{i=1}^{m}w_{i}^{1+\alpha}\frac{(\rho_{i}(w)-1)^{2}}{\rho_{i}(w)+1}
(1β30max(1,α)max(1,α1))((w)(w¯)).\displaystyle\leq\left(1-\frac{\beta}{30\max(1,\alpha)\max(1,\alpha^{-1})}\right)(\mathcal{F}(w)-\mathcal{F}(\overline{w})). (5.4)

Thus, Descent()\textbf{Descent}({}\cdot{}) decreases \mathcal{F}. Using (w)(w¯)nlog(m/n)m2\mathcal{F}(w)-\mathcal{F}(\overline{w})\leq n\log(m/n)\leq m^{2} from Lemma 4 and setting (5.4) to ε~\widetilde{\varepsilon} gives an iteration complexity of 𝒪(β1α1log(m/ε~))=𝒪(α3log(m/ε~))\mathcal{O}(\beta^{-1}\alpha^{-1}\log(m/\widetilde{\varepsilon}))=\mathcal{O}(\alpha^{-3}\log(m/\widetilde{\varepsilon})) if α(0,1]\alpha\in(0,1] and 𝒪(αβ1log(m/ε~))=𝒪(αlog(m/ε~))\mathcal{O}(\alpha\beta^{-1}\log(m/\widetilde{\varepsilon}))=\mathcal{O}(\alpha\log(m/\widetilde{\varepsilon})) otherwise. As in the proofs of Theorems 1 and  2, we can then invoke Lemma 1 to construct the vector that is ε\varepsilon-approximate to the Lewis weights. ∎

6 Acknowledgements

We are grateful to the anonymous reviewers of SODA 2022 for their careful reading and thoughtful comments that helped us improve our exposition. Maryam Fazel was supported in part by grants NSF TRIPODS II DMS 2023166, NSF TRIPODS CCF 1740551, and NSF CCF 2007036. Yin Tat Lee was supported in part by NSF awards CCF-1749609, DMS-1839116, DMS-2023166, CCF-2105772, a Microsoft Research Faculty Fellowship, Sloan Research Fellowship, and Packard Fellowship. Swati Padmanabhan was supported in part by NSF TRIPODS II DMS 2023166. Aaron Sidford was supported in part by a Microsoft Research Faculty Fellowship, NSF CAREER Award CCF-1844855, NSF Grant CCF-1955039, a PayPal research award, and a Sloan Research Fellowship.

References

  • [AHR08] Jacob Abernethy, Elad E Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory, COLT 2008, 2008.
  • [AZLSW17] Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, and Yining Wang. Near-optimal design of experiments via regret minimization. In Proceedings of the 34th International Conference on Machine Learning, 2017.
  • [BDM+20] Vladimir Braverman, Petros Drineas, Cameron Musco, Christopher Musco, Jalaj Upadhyay, David P Woodruff, and Samson Zhou. Near optimal linear algebra in the online and sliding window models. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), 2020.
  • [BE15] Sébastien Bubeck and Ronen Eldan. The entropic barrier: a simple and optimal universal self-concordant barrier. In Conference on Learning Theory, 2015.
  • [BV04] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004.
  • [CCDS20] Rachit Chhaya, Jayesh Choudhari, Anirban Dasgupta, and Supratim Shit. Streaming coresets for symmetric tensor factorization. In International Conference on Machine Learning, 2020.
  • [CCLY19] Michael B. Cohen, Ben Cousins, Yin Tat Lee, and Xin Yang. A near-optimal algorithm for approximating the john ellipsoid. In Proceedings of the Thirty-Second Conference on Learning Theory, 2019.
  • [CLM+15] Michael B. Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, and Aaron Sidford. Uniform sampling for matrix approximation. In Tim Roughgarden, editor, Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, ITCS 2015, Rehovot, Israel, January 11-13, 2015. ACM, 2015.
  • [CP15] Michael B. Cohen and Richard Peng. Lp row sampling by lewis weights. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, STOC ’15. Association for Computing Machinery, 2015.
  • [DAST08] S Damla Ahipasaoglu, Peng Sun, and Michael J Todd. Linear convergence of a modified frank–wolfe algorithm for computing minimum-volume enclosing ellipsoids. Optimisation Methods and Software, 23(1), 2008.
  • [DFO20] Jelena Diakonikolas, Maryam Fazel, and Lorenzo Orecchia. Fair packing and covering on a relative scale. SIAM J. Optim., 30(4), 2020.
  • [DLS18] David Durfee, Kevin A Lai, and Saurabh Sawlani. \ell_1\backslash ell\_1 regression using lewis weights preconditioning and stochastic gradient descent. In Conference On Learning Theory, 2018.
  • [DMIMW12] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast approximation of matrix coherence and statistical leverage. The Journal of Machine Learning Research, 13(1), 2012.
  • [DMM06] Petros Drineas, Michael W Mahoney, and Shan Muthukrishnan. Sampling algorithms for l 2 regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, 2006.
  • [GLS81] Martin Grötschel, László Lovász, and Alexander Schrijver. The ellipsoid method and its consequences in combinatorial optimization. Combinatorica, 1(2), 1981.
  • [Joh48] Fritz John. Extremum problems with inequalities as subsidiary conditions, studies and essays presented to r. courant on his 60th birthday, january 8, 1948, 1948.
  • [Kha96] Leonid G Khachiyan. Rounding of polytopes in the real number model of computation. Mathematics of Operations Research, 21(2), 1996.
  • [KN09] Ravi Kannan and Hariharan Narayanan. Random walks on polytopes and an affine interior point method for linear programming. In Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, STOC ’09, New York, NY, USA, 2009. Association for Computing Machinery.
  • [KY05] Piyush Kumar and E Alper Yildirim. Minimum-volume enclosing ellipsoids and core sets. Journal of Optimization Theory and applications, 126(1), 2005.
  • [Lee16] Yin Tat Lee. Faster Algorithms for Convex and Combinatorial Optimization. PhD thesis, Massachusetts Institute of Technology, 2016.
  • [Lew78] D Lewis. Finite dimensional subspaces of l_l\_{pp}. Studia Mathematica, 63(2), 1978.
  • [LLV20] Aditi Laddha, Yin Tat Lee, and Santosh Vempala. Strong self-concordance and sampling. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, New York, NY, USA, 2020. Association for Computing Machinery.
  • [LMP13] Mu Li, Gary L Miller, and Richard Peng. Iterative row sampling. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, 2013.
  • [LS14] Yin Tat Lee and Aaron Sidford. Path finding methods for linear programming: Solving linear programs in õ(sqrt(rank)) iterations and faster algorithms for maximum flow. In 55th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, 2014.
  • [LS19] Yin Tat Lee and Aaron Sidford. Solving linear programs with sqrt(rank) linear system solves. CoRR, abs/1910.08033, 2019.
  • [LSW15] Yin Tat Lee, Aaron Sidford, and Sam Chiu-wai Wong. A faster cutting plane method and its implications for combinatorial and convex optimization. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, 2015.
  • [LWYZ20] Yi Li, Ruosong Wang, Lin Yang, and Hanrui Zhang. Nearly linear row sampling algorithm for quantile regression. In Proceedings of the 37th International Conference on Machine Learning, 2020.
  • [LY18] Yin Tat Lee and Man-Chung Yue. Universal barrier is nn-self-concordant. arXiv preprint arXiv:1809.03011, 2018.
  • [MSTX19] Vivek Madan, Mohit Singh, Uthaipon Tantipongpipat, and Weijun Xie. Combinatorial algorithms for optimal design. In Proceedings of the Thirty-Second Conference on Learning Theory, 2019.
  • [MSZ16] Jelena Marasevic, Clifford Stein, and Gil Zussman. A fast distributed stateless algorithm for alphaalpha-fair packing problems. In 43rd International Colloquium on Automata, Languages, and Programming (ICALP 2016), volume 55, 2016.
  • [NN94] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex programming. SIAM, 1994.
  • [Puk06] Friedrich Pukelsheim. Optimal Design of Experiments (Classics in Applied Mathematics) (Classics in Applied Mathematics, 50). Society for Industrial and Applied Mathematics, USA, 2006.
  • [SF04] Peng Sun and Robert M Freund. Computation of minimum-volume covering ellipsoids. Operations Research, 52(5), 2004.
  • [SW18] Christian Sohler and David P Woodruff. Strong coresets for k-median and subspace approximation: Goodbye dimension. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), 2018.
  • [SX20] Mohit Singh and Weijun Xie. Approximation algorithms for d-optimal design. Mathematics of Operations Research, 45(4), 2020.
  • [Tod16] Michael J. Todd. Minimum volume ellipsoids - theory and algorithms, volume 23 of MOS-SIAM Series on Optimization. SIAM, 2016.
  • [Vai89] Pravin M Vaidya. A new algorithm for minimizing convex functions over convex sets. In 30th Annual Symposium on Foundations of Computer Science, 1989.
  • [Woj96] Przemyslaw Wojtaszczyk. Banach spaces for analysts. Number 25. Cambridge University Press, 1996.
  • [ZF20] Renbo Zhao and Robert M Freund. Analysis of the frank-wolfe method for logarithmically-homogeneous barriers, with an extension. arXiv preprint arXiv:2010.08999, 2020.

We start with a piece of notation we frequently use in the appendix. For a given vector xmx\in\mathbb{R}^{m}, we use 𝐃𝐢𝐚𝐠(x)\mathop{\mathbf{Diag}}(x) to describe the diagonal matrix with xx on its diagonal. For a matrix 𝐗\mathbf{X}, we use 𝐝𝐢𝐚𝐠(𝐗)\mathbf{diag}(\mathbf{X}) to denote the vector made up of the diagonal entries of 𝐗\mathbf{X}. Further, recall as stated in Section 1.4, that given any vector xx, we use its uppercase boldface name 𝐗=def𝐃𝐢𝐚𝐠(x)\mathbf{X}\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}\mathop{\mathbf{Diag}}(x).

Appendix A Technical Proofs: Gradient, Hessian, Initial Error, Minimum Progress

See 3

Proof.

The proof essentially follows by combining Lemmas 48 and 49 of [LS19]. For completeness, we provide the full proof here. Applying chain rule to the logdet\log\det function and then the definition of ρ(w)\rho(w) from (2) gives the claim that

i(w)=(𝐀(𝐀𝐖𝐀)1𝐀)ii+wiα=ai(𝐀𝐖𝐀)1ai+wiα=σi(w)wi+wiα.\nabla_{i}\mathcal{F}(w)=-(\mathbf{A}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}\mathbf{A}^{\top})_{ii}+w_{i}^{\alpha}=-a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i}+w_{i}^{\alpha}=\frac{-\sigma_{i}(w)}{w_{i}}+w_{i}^{\alpha}.

We now set some notation to compute the Hessian: let 𝐌=def𝐀(𝐀𝐖𝐀)1𝐀\mathbf{M}\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}\mathbf{A}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}\mathbf{A}^{\top}, let hmh\in\mathbb{R}^{m} be any arbitrary vector, and let 𝐇=def𝐃𝐢𝐚𝐠(h)\mathbf{H}\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}\mathop{\mathbf{Diag}}(h). For f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R} and for x,hnx,h\in\mathbb{R}^{n} we let 𝒟xf(x)[h]\mathcal{D}_{x}f(x)[h] denote the directional derivative of ff at xx in the direction hh, i.e., 𝒟xf(x)[h]=limt0(f(x+th)f(x))/t\mathcal{D}_{x}f(x)[h]=\lim_{t\rightarrow 0}(f(x+th)-f(x))/t. Then we have,

𝒟wh,𝐃𝐢𝐚𝐠(𝐀(𝐀𝐖𝐀)1𝐀)[h]\displaystyle\mathcal{D}_{w}\langle{h},{-\mathop{\mathbf{Diag}}(\mathbf{A}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}\mathbf{A}^{\top})}\rangle[h] =h,𝐃𝐢𝐚𝐠(𝐀𝒟w(𝐀𝐖𝐀)1[h]𝐀)\displaystyle=\langle{h},{-\mathop{\mathbf{Diag}}(\mathbf{A}\mathcal{D}_{w}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}[h]\mathbf{A}^{\top})}\rangle
=h,𝐃𝐢𝐚𝐠(𝐀(𝐀𝐖𝐀)1𝒟w(𝐀𝐖𝐀)[h]𝐀𝐖𝐀)1𝐀)\displaystyle=\langle{h},{\mathop{\mathbf{Diag}}(\mathbf{A}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}\mathcal{D}_{w}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})[h]\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}\mathbf{A}^{\top})}\rangle
=h,𝐃𝐢𝐚𝐠(𝐌𝐇𝐌)\displaystyle=\langle{h},{\mathop{\mathbf{Diag}}(\mathbf{M}\mathbf{H}\mathbf{M})}\rangle
=i,jhihj𝐌ij𝐌ji=i,jhihj𝐌ij2,\displaystyle=\sum_{i,j}h_{i}h_{j}\mathbf{M}_{ij}\mathbf{M}_{ji}=\sum_{i,j}h_{i}h_{j}\mathbf{M}_{ij}^{2},

where the last step follows by symmetry of 𝐌\mathbf{M}. This implies

ij2(w)\displaystyle\nabla_{ij}^{2}\mathcal{F}(w) ={(ai(𝐀𝐖𝐀)1aj)2if ij(ai(𝐀𝐖𝐀)1aj)2+αwiα1 otherwise,\displaystyle=\left\{\begin{array}[]{lll}(a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{j})^{2}&\mbox{if }i\neq j\\ (a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{j})^{2}+\alpha w_{i}^{\alpha-1}&\mbox{ }otherwise\end{array}\right.,

which, in shorthand, is 2(w)=𝐌𝐌+α𝐖α1\nabla^{2}\mathcal{F}(w)=\mathbf{M}\circ\mathbf{M}+\alpha\mathbf{W}^{\alpha-1}. We may express this Hessian as in the statement of the lemma by writing 𝐌\mathbf{M} in terms of 𝐏(w)\mathbf{P}(w). ∎

See 4

Proof.

We study the two terms constituting the objective in (1). First, by choice of w(0)=nm𝟏w^{(0)}=\tfrac{n}{m}\mathbf{1}, we have

logdet(𝐀𝐖(0)𝐀)=logdet((n/m)𝐀𝐀).-\log\det(\mathbf{A}^{\top}\mathbf{W}^{(0)}\mathbf{A})=-\log\det((n/m)\mathbf{A}^{\top}\mathbf{A}).

Next, since leverage scores always lie between zero and one, the optimality condition for (1), σ(w¯)=(w¯)1+α\sigma(\overline{w})=({\overline{w}})^{1+\alpha}, implies w¯1\overline{w}\leq 1, which in turn gives 𝐖¯I\overline{\mathbf{W}}\preceq I. This implies 𝐀𝐖¯𝐀𝐀𝐀\mathbf{A}^{\top}\overline{\mathbf{W}}\mathbf{A}\preceq\mathbf{A}^{\top}\mathbf{A}. Therefore,

logdet(𝐀𝐀)logdet(𝐀𝐖¯𝐀).-\log\det(\mathbf{A}^{\top}\mathbf{A})\leq-\log\det(\mathbf{A}^{\top}\overline{\mathbf{W}}\mathbf{A}).

Combining (A) and (A) gives

logdet(𝐀𝐖(0)𝐀)logdet(𝐀𝐖¯𝐀)+nlog(m/n).-\log\det(\mathbf{A}^{\top}\mathbf{W}^{(0)}\mathbf{A})\leq-\log\det(\mathbf{A}^{\top}\overline{\mathbf{W}}\mathbf{A})+n\log(m/n).

Next, observe that 𝟏(w(0))1+α=m(n/m)1+α\mathbf{1}^{\top}(w^{(0)})^{1+\alpha}=m\cdot(n/m)^{1+\alpha}, and 𝟏(w¯)1+α=i=1mσi(w¯)=n\mathbf{1}^{\top}({\overline{w}})^{1+\alpha}=\sum_{i=1}^{m}\sigma_{i}(\overline{w})=n, where we invoked Fact 1.1. By now applying mnm\geq n, we get

𝟏(w(0))1+α𝟏(w¯)1+α.\mathbf{1}^{\top}(w^{(0)})^{1+\alpha}\leq\mathbf{1}^{\top}({\overline{w}})^{1+\alpha}.

Combining (A), (A), and the definition of the objective (1) finishes the claim. ∎

See 6

Proof.

By the remainder form of Taylor’s theorem, for some t[0,1]t\in[0,1] and w~=tw+(1t)w+\widetilde{w}=tw+(1-t)w^{+}

(w+)=(w)+(w),w+w+12(w+w)2(w~)(w+w).\mathcal{F}(w^{+})=\mathcal{F}(w)+\langle{\nabla\mathcal{F}(w)},{w^{+}-w}\rangle+\frac{1}{2}(w^{+}-w)^{\top}\nabla^{2}\mathcal{F}(\widetilde{w})(w^{+}-w). (A.5)

We prove the result by bounding the quadratic form of 2(w~)\nabla^{2}\mathcal{F}(\widetilde{w}) from above and leveraging the structure of w+w^{+} and (w)\nabla\mathcal{F}(w). Lemma 3 and Fact 1.1 imply that

2(w~)=𝐖~1𝐏(w~)(2)𝐖~1+α𝐖~α1𝐖~1Σ(w~)𝐖~1+α𝐖~α1.\nabla^{2}\mathcal{F}(\widetilde{w})=\mathbf{\widetilde{W}}^{-1}\mathbf{P}(\widetilde{w})^{(2)}\mathbf{\widetilde{W}}^{-1}+\alpha\mathbf{\widetilde{W}}^{\alpha-1}\preceq\mathbf{\widetilde{W}}^{-1}\Sigma(\widetilde{w})\mathbf{\widetilde{W}}^{-1}+\alpha\mathbf{\widetilde{W}}^{\alpha-1}~{}.

Further, the positivity of wiw_{i} and σi(w)\sigma_{i}(w) and the non-negativity of η\eta and ρ\rho imply that (1η)wiwi+(1+η)wi(1-\norm{\eta}_{\infty})w_{i}\leq w_{i}^{+}\leq(1+\norm{\eta}_{\infty})w_{i} for all i[m]i\in[m]. Since η13α¯\norm{\eta}_{\infty}\leq\frac{1}{3\bar{\alpha}}, this implies that

(113α¯)wiw~i(1+13α¯)wi for all i[m].(1-\tfrac{1}{3\bar{\alpha}})w_{i}\leq\widetilde{w}_{i}\leq(1+\tfrac{1}{3\bar{\alpha}})w_{i}~{}\text{ for all }~{}i\in[m]~{}.

Consequently, for all i[m]i\in[m], we bound the first term of (A) as

[𝐖~1Σ(w~)𝐖~1]ii\displaystyle\left[\mathbf{\widetilde{W}}^{-1}\Sigma(\widetilde{w})\mathbf{\widetilde{W}}^{-1}\right]_{ii} =ei𝐖~1/2𝐀(𝐀𝐖~𝐀)1𝐀𝐖~1/2ei=1w~iai(𝐀𝐖~𝐀)1ai\displaystyle=e_{i}^{\top}\mathbf{\widetilde{W}}^{-1/2}\mathbf{A}(\mathbf{A}^{\top}\mathbf{\widetilde{W}}\mathbf{A})^{-1}\mathbf{A}^{\top}\mathbf{\widetilde{W}}^{-1/2}e_{i}=\frac{1}{\widetilde{w}_{i}}a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{\widetilde{W}}\mathbf{A})^{-1}a_{i}
(113α¯)11wiai(𝐀𝐖~𝐀)1ai(113α¯)21wiai(𝐀𝐖𝐀)1ai\displaystyle\leq(1-\tfrac{1}{3\bar{\alpha}})^{-1}\frac{1}{w_{i}}a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{\widetilde{W}}\mathbf{A})^{-1}a_{i}\leq(1-\tfrac{1}{3\bar{\alpha}})^{-2}\frac{1}{w_{i}}a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i}
=(113α¯)2[𝐖1Σ(w)𝐖1]ii3[𝐖1Σ(w)𝐖1]ii\displaystyle=(1-\tfrac{1}{3\bar{\alpha}})^{-2}\left[\mathbf{W}^{-1}\Sigma(w)\mathbf{W}^{-1}\right]_{ii}\preceq 3\left[\mathbf{W}^{-1}\Sigma(w)\mathbf{W}^{-1}\right]_{ii} (A.6)

Further, when α(0,1]\alpha\in(0,1], we bound the second term of (A) as

𝐖~α1(113α¯)α1𝐖α1(113α¯)1𝐖α13𝐖α1,\mathbf{\widetilde{W}}^{\alpha-1}\preceq(1-\tfrac{1}{3\bar{\alpha}})^{\alpha-1}\mathbf{W}^{\alpha-1}\preceq(1-\tfrac{1}{3\bar{\alpha}})^{-1}\mathbf{W}^{\alpha-1}\preceq 3\mathbf{W}^{\alpha-1},

and when α1\alpha\geq 1, we have

𝐖~α1(1+13α¯)α1𝐖α1exp(α13α¯)𝐖α1=exp(α13α)𝐖α13𝐖α1.\mathbf{\widetilde{W}}^{\alpha-1}\preceq(1+\tfrac{1}{3\bar{\alpha}})^{\alpha-1}\mathbf{W}^{\alpha-1}\preceq\exp(\frac{\alpha-1}{3\bar{\alpha}})\mathbf{W}^{\alpha-1}=\exp(\frac{\alpha-1}{3\alpha})\mathbf{W}^{\alpha-1}\preceq 3\mathbf{W}^{\alpha-1}.

Using (A.6), (A), and (A) in (A), we have that in all cases

2(w~)3[𝐖1Σ(w)𝐖1+α𝐖α1]3α¯𝐖1[Σ(w)+𝐖1+α]𝐖1.\nabla^{2}\mathcal{F}(\widetilde{w})\preceq 3\left[\mathbf{W}^{-1}\Sigma(w)\mathbf{W}^{-1}+\alpha\mathbf{W}^{\alpha-1}\right]\preceq 3\bar{\alpha}\mathbf{W}^{-1}\left[\Sigma(w)+\mathbf{W}^{1+\alpha}\right]\mathbf{W}^{-1}~{}.

Applying to the above Loewner inequality the definition of w+w^{+} gives

(w+w)2(w~)(w+w)\displaystyle(w^{+}-w)^{\top}\nabla^{2}\mathcal{F}(\widetilde{w})(w^{+}-w) i[m]3α¯(wi1+α+σi(w))(ηiρi(w)1ρi(w)+1)2\displaystyle\leq\sum_{i\in[m]}3\bar{\alpha}\cdot(w_{i}^{1+\alpha}+\sigma_{i}(w))\cdot\left(\eta_{i}\cdot\frac{\rho_{i}(w)-1}{\rho_{i}(w)+1}\right)^{2}
=i[m]3α¯ηi2wi1+α(ρi(w)1)2ρi(w)+1.\displaystyle=\sum_{i\in[m]}3\bar{\alpha}\cdot\eta_{i}^{2}\cdot w_{i}^{1+\alpha}\cdot\frac{(\rho_{i}(w)-1)^{2}}{\rho_{i}(w)+1}~{}. (A.9)

Next, recall that by Lemma 3, [(w)]i=wi1(wi1+ασi(w))\left[\nabla\mathcal{F}(w)\right]_{i}=w_{i}^{-1}\cdot(w_{i}^{1+\alpha}-\sigma_{i}(w)) for all i[m]i\in[m]. Consequently,

(w),w+w=i[m](wi1+ασi(w))(ηiρi(w)1ρi(w)+1)=i[m]ηiwi1+α(ρi(w)1)2ρi(w)+1.\langle{\nabla\mathcal{F}(w)},{w^{+}-w}\rangle=\sum_{i\in[m]}(w_{i}^{1+\alpha}-\sigma_{i}(w))\cdot\left(\eta_{i}\cdot\frac{\rho_{i}(w)-1}{\rho_{i}(w)+1}\right)=-\sum_{i\in[m]}\eta_{i}\cdot w_{i}^{1+\alpha}\cdot\frac{(\rho_{i}(w)-1)^{2}}{\rho_{i}(w)+1}~{}.

Combining (A.5), (A.9), and (A) yields that

(w+)\displaystyle\mathcal{F}(w^{+}) (w)+i[m](ηi+3α¯ηi22)wi1+α(ρi(w)1)2ρi(w)+1.\displaystyle\leq\mathcal{F}(w)+\sum_{i\in[m]}\left(-\eta_{i}+\frac{3\bar{\alpha}\eta_{i}^{2}}{2}\right)\cdot w_{i}^{1+\alpha}\cdot\frac{(\rho_{i}(w)-1)^{2}}{\rho_{i}(w)+1}~{}.

The result follows by plugging in ηi[0,(3α¯)1]\eta_{i}\in[0,(3\bar{\alpha})^{-1}], as assumed. ∎

Appendix B From Optimization Problem to Lewis Weights

The goal of this section is to prove how to obtain ε\varepsilon-approximate Lewis weights from an ε~\widetilde{\varepsilon}-approximate solution to the problem in (1). Our proof strategy is to first utilize the fact that the vector wRw_{\textrm{R}} obtained after the rounding step following the for loop of Algorithm 1 satisfies the properties of being ε~\widetilde{\varepsilon}-suboptimal (additively) and also the rounding condition (2). In Lemma 1, the ε~\widetilde{\varepsilon}-suboptimality is used to show a bound on σ(wR)wR1+α\|\sigma(w_{\textrm{R}})-w_{\textrm{R}}^{1+\alpha}\|_{\infty}. Coupled with the rounding condition, we then show in Lemma 13 that wR^\widehat{w_{\textrm{R}}} constructed as per the last line of Algorithm 1 then satisfies approximate optimality, σ(w^)δw^1+α\sigma(\widehat{w})\approx_{\delta}\widehat{w}^{1+\alpha}, for some small δ>0\delta>0. In Lemma 14, we finally relate this approximate optimality to coordinate-wise multiplicative closeness between w^\widehat{w} and the vector of true Lewis weights. Finally, in Lemma 1, we pick the appropriate approximation factors for each of the lemmas invoked and prove the desired approximation. Since the vector w𝒯totalw^{\mathcal{T}_{\textrm{total}}} obtained at the end of the for loop of Algorithm 4 also satisfies the aforementioned properties of wRw_{\textrm{R}}, the same set of lemmas apply to Algorithm 4 as well. We begin with some technical lemmas.

B.1 From Approximate Closeness to Approximate Optimality

Lemma 13.

Let w>0mw\in\mathbb{R}^{m}_{>0} such that σ(w)w1+αε¯\|\sigma(w)-w^{1+\alpha}\|_{\infty}\leq\overline{\varepsilon} for some parameter 0<ε¯1100m2(α+α1)20<\overline{\varepsilon}\leq\frac{1}{100m^{2}(\alpha+\alpha^{-1})^{2}} and also let ρmax(w)1+α\rho_{\max}(w)\leq 1+\alpha. Define w^i=(ai(𝐀𝐖𝐀)1ai)1/α\widehat{w}_{i}=(a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i})^{1/\alpha}. Then, for the parameter δ=20ε¯m(α+α1)\delta=20\sqrt{\overline{\varepsilon}}m(\alpha+\alpha^{-1}), we have that σ(w^)δw^1+α\sigma(\widehat{w}){\approx_{\delta}}\widehat{w}^{1+\alpha}.

Proof.

Our strategy to prove σ(w^)δw^1+α\sigma(\widehat{w})\approx_{\delta}\widehat{w}^{1+\alpha} involves first noting that this is the same as proving w^1σ(w^)δw^α\widehat{w}^{-1}\cdot\sigma(\widehat{w})\approx_{\delta}\widehat{w}^{\alpha} and, from the definition of w^\widehat{w} in the statement of the lemma, to instead prove 𝐀𝐖^𝐀δ𝐀𝐖𝐀\mathbf{A}^{\top}\mathbf{\widehat{W}}\mathbf{A}\approx_{\delta}\mathbf{A}^{\top}\mathbf{W}\mathbf{A}.

To this end, we split 𝐖\mathbf{W} into two matrices based on the size of its coordinates, setting the following notation. Define 𝐖wη\mathbf{W}_{w\leq\eta} to be the diagonal matrix 𝐖\mathbf{W} with zeroes at indices corresponding to w>ηw>\eta, and 𝐖^wη\mathbf{\widehat{W}}_{w\leq\eta} to be the diagonal matrix 𝐖^\mathbf{\widehat{W}} with zeroes at indices corresponding to w>ηw>\eta. We first show that 𝐀𝐖^wη𝐀\mathbf{A}^{\top}\mathbf{\widehat{W}}_{w\leq\eta}\mathbf{A} and 𝐀𝐖wη𝐀\mathbf{A}^{\top}\mathbf{W}_{w\leq\eta}\mathbf{A} are small compared to 𝐀𝐖𝐀\mathbf{A}^{\top}\mathbf{W}\mathbf{A} and can therefore be ignored in the preceding desired approximation. We then prove that for w>ηw>\eta, we have wδw^w\approx_{\delta}\widehat{w}. This proof technique is inspired by Lemma 4 of [Vai89].

First, we prove that 𝐀𝐖^wη𝐀\mathbf{A}^{\top}\mathbf{\widehat{W}}_{w\leq\eta}\mathbf{A} is small as compared to 𝐀𝐖w>η𝐀\mathbf{A}^{\top}\mathbf{W}_{w>\eta}\mathbf{A}. Since (2) is satisfied, it means

ai(𝐀𝐖𝐀)1ai=σi(w)wi1(1+α)wiα.a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i}=\sigma_{i}(w)\cdot w_{i}^{-1}\leq(1+\alpha)w_{i}^{\alpha}.

Combining this with the definition of w^i\widehat{w}_{i} as in the statement of the lemma, we may use non-negativity of α\alpha to derive

w^i(1+α)1/αwi3wi.\widehat{w}_{i}\leq(1+\alpha)^{1/\alpha}w_{i}\leq 3w_{i}.

We apply this inequality in the following expression to obtain

Tr((𝐀𝐖^wη𝐀)(𝐀𝐖𝐀)1)\displaystyle\Tr((\mathbf{A}^{\top}\mathbf{\widehat{W}}_{w\leq\eta}\mathbf{A})(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}) =wiηw^i(ai(𝐀𝐖𝐀)1ai)\displaystyle=\sum_{w_{i}\leq\eta}\widehat{w}_{i}(a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i})
=wiη(ai(𝐀𝐖𝐀)1ai)1+1/α\displaystyle=\sum_{w_{i}\leq\eta}(a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i})^{1+1/\alpha}
(1+α)1+1/αwiηwi1+α\displaystyle\leq(1+\alpha)^{1+1/\alpha}\sum_{w_{i}\leq\eta}w_{i}^{1+\alpha}
3(1+α)mη1+α.\displaystyle\leq 3(1+\alpha)m\eta^{1+\alpha}. (B.2)

This implies that777Given X,Y0X,Y\succeq 0, we have Y1/2XY1/20Y^{1/2}XY^{1/2}\succeq 0. Then, if Tr(XY)1\Tr(XY)\leq 1, we have Tr(Y1/2XY1/2)1\Tr(Y^{1/2}XY^{1/2})\leq 1, and combining these with the previous matrix inequality, we conclude that Y1/2XY1/2IY^{1/2}XY^{1/2}\preceq I, which implies that XY1X\preceq Y^{-1}.

𝐀𝐖^wη𝐀3(1+α)mη1+α𝐀𝐖𝐀.\mathbf{A}^{\top}\mathbf{\widehat{W}}_{w\leq\eta}\mathbf{A}\preceq 3(1+\alpha)m\eta^{1+\alpha}\mathbf{A}^{\top}\mathbf{W}\mathbf{A}.

Our next goal is to bound 𝐀𝐖^w>η𝐀\mathbf{A}^{\top}\mathbf{\widehat{W}}_{w>\eta}\mathbf{A} in terms of 𝐀𝐖𝐀\mathbf{A}^{\top}\mathbf{W}\mathbf{A}, which we do by first bounding it in terms of 𝐀𝐖w>η𝐀\mathbf{A}^{\top}\mathbf{W}_{w>\eta}\mathbf{A} and then bounding 𝐀𝐖w>η𝐀\mathbf{A}^{\top}\mathbf{W}_{w>\eta}\mathbf{A} in terms of 𝐀𝐖𝐀\mathbf{A}^{\top}\mathbf{W}\mathbf{A}. By definition, w^iα=σi(w)wi1\widehat{w}_{i}^{\alpha}=\sigma_{i}(w)\cdot w_{i}^{-1}. Further, by assumption, σ(w)w1+αε¯\|\sigma(w)-w^{1+\alpha}\|_{\infty}\leq\overline{\varepsilon}. Therefore, for any wiηw_{i}\geq\eta

w^iα(wi1+α+ε¯)wi1(1+ε¯/η1+α)wi1+αwi1=(1+ε¯/η1+α)wiα,\widehat{w}_{i}^{\alpha}\leq(w_{i}^{1+\alpha}+\overline{\varepsilon})\cdot w_{i}^{-1}\leq(1+\overline{\varepsilon}/\eta^{1+\alpha})w_{i}^{1+\alpha}\cdot w_{i}^{-1}=(1+\overline{\varepsilon}/\eta^{1+\alpha})w_{i}^{\alpha},

and

w^iα(wi1+αε¯)wi1(1ε¯/η1+α)wi1+αwi1=(1ε¯/η1+α)wiα.\widehat{w}_{i}^{\alpha}\geq(w_{i}^{1+\alpha}-\overline{\varepsilon})\cdot w_{i}^{-1}\geq(1-\overline{\varepsilon}/\eta^{1+\alpha})w_{i}^{1+\alpha}\cdot w_{i}^{-1}=(1-\overline{\varepsilon}/\eta^{1+\alpha})w_{i}^{\alpha}.

By our choice of ε¯\overline{\varepsilon}, for wiηw_{i}\geq\eta, we have

(12ε¯αη1+α)wiw^i(1+2ε¯αη1+α)wi.\left(1-\frac{2\overline{\varepsilon}}{\alpha\eta^{1+\alpha}}\right)w_{i}\leq\widehat{w}_{i}\leq\left(1+\frac{2\overline{\varepsilon}}{\alpha\eta^{1+\alpha}}\right)w_{i}.

Further, we have the following inequality:

𝐀𝐖w>η𝐀𝐀𝐖𝐀.\mathbf{A}^{\top}\mathbf{W}_{w>\eta}\mathbf{A}\preceq\mathbf{A}^{\top}\mathbf{W}\mathbf{A}.

Hence, we can combine (B.1), (B.1), and (B.1) to see that

𝐀𝐖^𝐀\displaystyle\mathbf{A}^{\top}\mathbf{\widehat{W}}\mathbf{A} =𝐀𝐖^w>η𝐀+𝐀𝐖^wη𝐀\displaystyle=\mathbf{A}^{\top}\mathbf{\widehat{W}}_{w>\eta}\mathbf{A}+\mathbf{A}^{\top}\mathbf{\widehat{W}}_{w\leq\eta}\mathbf{A}
(1+2ε¯αη1+α)𝐀𝐖w>η𝐀+3(1+α)mη1+α𝐀𝐖𝐀\displaystyle\preceq\left(1+\frac{2\overline{\varepsilon}}{\alpha\eta^{1+\alpha}}\right)\mathbf{A}^{\top}\mathbf{W}_{w>\eta}\mathbf{A}+3(1+\alpha)m\eta^{1+\alpha}\mathbf{A}^{\top}\mathbf{W}\mathbf{A}
𝐀𝐖𝐀(1+2ε¯αη1+α+3(1+α)mη1+α).\displaystyle\preceq\mathbf{A}^{\top}\mathbf{W}\mathbf{A}\left(1+\frac{2\overline{\varepsilon}}{\alpha\eta^{1+\alpha}}+3(1+\alpha)m\eta^{1+\alpha}\right).

Set η1+α=ε¯\eta^{1+\alpha}=\sqrt{\overline{\varepsilon}} for the upper bound.

For the lower bound, we bound 𝐀𝐖wη𝐀\mathbf{A}^{\top}\mathbf{W}_{w\leq\eta}\mathbf{A} and, therefore, also 𝐀𝐖w>η𝐀\mathbf{A}^{\top}\mathbf{W}_{w>\eta}\mathbf{A}. Observe that

Tr((𝐀𝐖wη𝐀)(𝐀𝐖𝐀)1)\displaystyle\Tr((\mathbf{A}^{\top}\mathbf{W}_{w\leq\eta}\mathbf{A})(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}) =wiηwiai(𝐀𝐖𝐀)1ai=wiησi(w)\displaystyle=\sum_{w_{i}\leq\eta}w_{i}a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i}=\sum_{w_{i}\leq\eta}\sigma_{i}(w)
wiη(wi1+α+ε¯)m(η1+α+ε¯),\displaystyle\leq\sum_{w_{i}\leq\eta}(w_{i}^{1+\alpha}+\overline{\varepsilon})\leq m(\eta^{1+\alpha}+\overline{\varepsilon}),

where the second step is by σ(w)w1+αε¯\|\sigma(w)-w^{1+\alpha}\|_{\infty}\leq\overline{\varepsilon}, as assumed in the lemma. This implies that

𝐀𝐖wη𝐀m(η1+α+ε¯)𝐀𝐖𝐀,\mathbf{A}^{\top}\mathbf{W}_{w\leq\eta}\mathbf{A}\preceq m(\eta^{1+\alpha}+\overline{\varepsilon})\mathbf{A}^{\top}\mathbf{W}\mathbf{A},

and therefore that

𝐀𝐖w>η𝐀(1m(η1+α+ε¯))𝐀𝐖𝐀.\mathbf{A}^{\top}\mathbf{W}_{w>\eta}\mathbf{A}\succeq(1-m(\eta^{1+\alpha}+\overline{\varepsilon}))\mathbf{A}^{\top}\mathbf{W}\mathbf{A}.

Repeating the method for the upper bound then finishes the proof. ∎

B.2 From Approximate Optimality to Approximate Lewis Weights

In this section, we go from the previous notion of approximation to the one we finally seek in (1). Specifically, we show that if σ(w)βw1+α\sigma(w)\approx_{\beta}w^{1+\alpha}, then wO((β/α)n)w¯w\approx_{O((\beta/\alpha)\sqrt{n})}\overline{w}. To prove this, we first give a technical result. We recall notation stated in Section 1.4: for any projection matrix 𝐏(w)m×m\mathbf{P}(w)\in\mathbb{R}^{m\times m}, we have the vector of leverage scores σ(w)=𝐝𝐢𝐚𝐠(𝐏(w))\sigma(w)=\mathbf{diag}(\mathbf{P}(w)).

Claim 1.

For any projection matrix 𝐏(w)m×m\mathbf{P}(w)\in\mathbb{R}^{m\times m}, α0\alpha\geq 0, and vector xmx\in\mathbb{R}^{m}, we have that

[𝐏(w)(2)+α𝚺(w)]1𝚺(w)x1αx+1α2x𝚺(w)(1+n/αα)x\norm{\left[\mathbf{P}(w)^{(2)}+\alpha\mathbf{\Sigma}(w)\right]^{-1}\mathbf{\Sigma}(w)x}_{\infty}\leq\frac{1}{\alpha}\norm{x}_{\infty}+\frac{1}{\alpha^{2}}\norm{x}_{\mathbf{\Sigma}(w)}\leq\left(\frac{1+\sqrt{n}/\alpha}{\alpha}\right)\norm{x}_{\infty}
Proof.

Let y=def[𝐏(w)(2)+α𝚺(w)]1𝚺(w)xy\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}\left[\mathbf{P}(w)^{(2)}+\alpha\mathbf{\Sigma}(w)\right]^{-1}\mathbf{\Sigma}(w)x. Since 𝟎𝐏(w)(2)𝚺(w)\mathbf{0}\preceq\mathbf{P}(w)^{(2)}\preceq\mathbf{\Sigma}(w) (Fact 1.1), we have that 𝚺(w)1α[𝐏(w)(2)+α𝚺(w)]\mathbf{\Sigma}(w)\preceq\frac{1}{\alpha}\left[\mathbf{P}(w)^{(2)}+\alpha\mathbf{\Sigma}(w)\right] and (𝐏(w)(2)+α𝚺(w))1α1𝚺(w)1(\mathbf{P}(w)^{(2)}+\alpha\mathbf{\Sigma}(w))^{-1}\preceq\alpha^{-1}\mathbf{\Sigma}(w)^{-1}. Consequently, taking norms in terms of these matrices gives

y𝚺(w)=[𝐏(w)(2)+α𝚺(w)]1𝚺(w)x𝚺(w)\displaystyle\norm{y}_{\mathbf{\Sigma}(w)}=\norm{\left[\mathbf{P}(w)^{(2)}+\alpha\mathbf{\Sigma}(w)\right]^{-1}\mathbf{\Sigma}(w)x}_{\mathbf{\Sigma}(w)} 1α𝚺(w)x[𝐏(w)(2)+α𝚺(w)]11αx𝚺(w).\displaystyle\leq\frac{1}{\sqrt{\alpha}}\norm{\mathbf{\Sigma}(w)x}_{\left[\mathbf{P}(w)^{(2)}+\alpha\mathbf{\Sigma}(w)\right]^{-1}}\leq\frac{1}{\alpha}\norm{x}_{\mathbf{\Sigma}(w)}\,. (B.6)

Next, since by Lemma 47 of [LS19], 𝚺(w)1𝐏(w)(2)zz𝚺(w)\norm{\mathbf{\Sigma}(w)^{-1}\mathbf{P}(w)^{(2)}z}_{\infty}\leq\norm{z}_{\mathbf{\Sigma}(w)} for all zmz\in\mathbb{R}^{m}, we see that |[𝐏(w)(2)y]i|σi(w)y𝚺(w)\left|[\mathbf{P}(w)^{(2)}y]_{i}\right|\leq\sigma_{i}(w)\norm{y}_{\mathbf{\Sigma}(w)} for all i[m]i\in[m], and since by definition of yy, we have [(𝐏(w)(2)+α𝚺(w))y]i=σi(w)xi[(\mathbf{P}(w)^{(2)}+\alpha\mathbf{\Sigma}(w))y]_{i}=\sigma_{i}(w)x_{i} for all i[m]i\in[m], we have that

y=maxi[m]|yi|=maxi[m]|1αxi+1ασi(w)[𝐏(w)(2)y]i|1αx+1αy𝚺(w).\norm{y}_{\infty}=\max_{i\in[m]}|y_{i}|=\max_{i\in[m]}\left|\frac{1}{\alpha}x_{i}+\frac{1}{\alpha\sigma_{i}(w)}\left[\mathbf{P}(w)^{(2)}y\right]_{i}\right|\leq\frac{1}{\alpha}\norm{x}_{\infty}+\frac{1}{\alpha}\norm{y}_{\mathbf{\Sigma}(w)}\,. (B.7)

Combining (B.6) and (B.7) and using that i[m]σi(w)n\sum_{i\in[m]}\sigma_{i}(w)\leq n yields the claim. ∎

Lemma 14.

Let w^>0m\widehat{w}\in\mathbb{R}_{>0}^{m} be a vector that satisfies approximate optimality of (1) in the following sense:

σ(w^)=𝐖^1+αv, for exp(μ)𝟏vexp(μ)𝟏.\sigma(\widehat{w})=\widehat{\mathbf{W}}^{1+\alpha}v,\text{ for }\exp(-\mu)\mathbf{1}\leq v\leq\exp(\mu)\mathbf{1}.

Then, w^\widehat{w} is also coordinate-wise multiplicatively close to w¯\overline{w}, the true vector of Lewis weights, as formalized below.

exp(1α(1+n/α)μ)w¯w^exp(1α(1+n/α)μ)w¯.\exp\left(-\frac{1}{\alpha}(1+\sqrt{n}/\alpha)\mu\right)\overline{w}\leq\widehat{w}\leq\exp\left(\frac{1}{\alpha}(1+\sqrt{n}/\alpha)\mu\right)\overline{w}\,.
Proof.

For all t[0,1]t\in[0,1], let [vt]i=[vit][v_{t}]_{i}=[v_{i}^{t}] so that v1=vv_{1}=v and v0=𝟏v_{0}=\mathbf{1}. Further, for all t[0,1]t\in[0,1], let wtw_{t} be the unique solution to

wt=argminw>0mft(w)=deflogdet(𝐀𝐖𝐀)+11+αi[m][vt]iwi1+α.w_{t}=\operatorname*{\arg\!\min}_{w\in\mathbb{R}_{>0}^{m}}f_{t}(w)\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}-\log\det\left(\mathbf{A}^{\top}\mathbf{W}\mathbf{A}\right)+\frac{1}{1+\alpha}\sum_{i\in[m]}[v_{t}]_{i}w_{i}^{1+\alpha}.

Then we have the following gradients.

wft(w)\displaystyle\nabla_{w}f_{t}(w) =𝐖1σ(w)+𝐖αvt,\displaystyle=-\mathbf{W}^{-1}\sigma(w)+\mathbf{W}^{\alpha}v_{t}\,,
w(ddtft)(w)\displaystyle\nabla_{w}(\frac{d}{dt}f_{t})(w) =𝐖αddtvt=𝐖αvtln(v)\displaystyle=\mathbf{W}^{\alpha}\frac{d}{dt}v_{t}=\mathbf{W}^{\alpha}v_{t}\ln(v)\, (B.8)
ww2ft(w)\displaystyle\nabla^{2}_{ww}f_{t}(w) =𝐖1[𝐏(w)(2)+α𝐖1+α𝐕]𝐖1.\displaystyle=\mathbf{W}^{-1}\left[\mathbf{P}(w)^{(2)}+\alpha\mathbf{W}^{1+\alpha}\mathbf{V}\right]\mathbf{W}^{-1}\,. (B.9)

Consequently, by optimality of wtw_{t} as defined in (B.2), we have 𝟎=wft(wt)=𝐖t1σ(wt)+𝐖tαvt.\mathbf{0}=\nabla_{w}f_{t}(w_{t})=-\mathbf{W}_{t}^{-1}\sigma(w_{t})+\mathbf{W}_{t}^{\alpha}v_{t}. Rearranging the terms of this equation yields that

σ(wt)=𝐖t1+αvt,\sigma(w_{t})=\mathbf{W}_{t}^{1+\alpha}v_{t},

and therefore w1=w^w_{1}=\widehat{w} and w0=w¯w_{0}=\overline{w}. To prove the lemma, it therefore suffices to bound

ln(w^/w¯)=ln(w1/w0)=t=01[ddtln(wt)]𝑑t=t=01𝐖t1[ddtwt]𝑑t.\ln(\widehat{w}/\overline{w})=\ln(w_{1}/w_{0})=\int_{t=0}^{1}\left[\frac{d}{dt}\ln(w_{t})\right]dt=\int_{t=0}^{1}\mathbf{W}_{t}^{-1}\left[\frac{d}{dt}w_{t}\right]dt\,. (B.11)

To bound (B.11), it remains to compute ddtwt\frac{d}{dt}w_{t} and apply Claim 1. To do this, note that

𝟎=ddtw[ft(wt)]=w(ddtft)(wt)+ww2ft(wt)ddtwt.\mathbf{0}=\frac{d}{dt}\gradient_{w}\left[f_{t}(w_{t})\right]=\nabla_{w}(\frac{d}{dt}f_{t})(w_{t})+\nabla^{2}_{ww}f_{t}(w_{t})\cdot\frac{d}{dt}w_{t}\,.

Using that 𝐏(wt)(2)+𝐖t1+α𝐕t𝟎\mathbf{P}(w_{t})^{(2)}+\mathbf{W}_{t}^{1+\alpha}\mathbf{V}_{t}\succ\mathbf{0}, we have, by rearranging the above equation and applying (B.8) and (B.9) that

ddtwt\displaystyle\frac{d}{dt}w_{t} =[ww2ft(wt)]1[w(ddtft)(wt)]=𝐖t[𝐏(wt)(2)+α𝐖t1+α𝐕t]1𝐖t1+αvtln(v).\displaystyle=-\left[\nabla^{2}_{ww}f_{t}(w_{t})\right]^{-1}\cdot\left[\nabla_{w}(\frac{d}{dt}f_{t})(w_{t})\right]=-\mathbf{W}_{t}\left[\mathbf{P}(w_{t})^{(2)}+\alpha\mathbf{W}_{t}^{1+\alpha}\mathbf{V}_{t}\right]^{-1}\mathbf{W}_{t}^{1+\alpha}v_{t}\ln(v)\,. (B.12)

Applying (B.2) to (B.12), we have that

𝐖t1[ddtwt]\displaystyle\mathbf{W}_{t}^{-1}\left[\frac{d}{dt}w_{t}\right] =[𝐏(wt)(2)+α𝚺(wt)]1𝚺(wt)ln(v).\displaystyle=-\left[\mathbf{P}(w_{t})^{(2)}+\alpha\mathbf{\Sigma}(w_{t})\right]^{-1}\mathbf{\Sigma}(w_{t})\ln(v)\,.

Applying Claim 1 to the above equality, substituting in (B.11) and ln(v)μ\norm{\ln(v)}_{\infty}\leq\mu therefore yields

ln(w^/w¯)=ln(w1/w0)\displaystyle\norm{\ln(\widehat{w}/\overline{w})}_{\infty}=\norm{\ln(w_{1}/w_{0})}_{\infty} t=01𝐖t1[ddtwt]𝑑tt=01(1+n/αα)μ𝑑t.\displaystyle\leq\int_{t=0}^{1}\norm{\mathbf{W}_{t}^{-1}\left[\frac{d}{dt}w_{t}\right]}_{\infty}dt\leq\int_{t=0}^{1}\left(\frac{1+\sqrt{n}/\alpha}{\alpha}\right)\mu dt\,.

B.3 From Optimization Problem to Approximate Lewis Weights

See 1

Proof.

We are given a vector wmw\in\mathbb{R}^{m} satisfying (w¯)(w)(w¯)+ε~\mathcal{F}(\overline{w})\leq\mathcal{F}(w)\leq\mathcal{F}(\overline{w})+\widetilde{\varepsilon}. Then by Lemma 5, we have that (σi(w)wi1+α)2σi(w)+wi1+αε~\frac{(\sigma_{i}(w)-{w}_{i}^{1+\alpha})^{2}}{\sigma_{i}(w)+{w}_{i}^{1+\alpha}}\leq\widetilde{\varepsilon} for each i[m]i\in[m]. This bound implies that wi3{w}_{i}\leq 3 for all ii because, if not, then because of σi(w)[0,1]\sigma_{i}(w)\in[0,1] and the decreasing nature of (xa)2/(x+a)(x-a)^{2}/(x+a) over x[0,1]x\in[0,1] for a fixed a3a\geq 3, we obtain (σi(w)wi1+α)2σi(w)+wi1+α(1wi1+α)21+wi1+α1\frac{(\sigma_{i}(w)-{w}_{i}^{1+\alpha})^{2}}{\sigma_{i}(w)+{w}_{i}^{1+\alpha}}\geq\frac{(1-{w}_{i}^{1+\alpha})^{2}}{1+{w}_{i}^{1+\alpha}}\geq 1, a contradiction. Therefore σ(w)w1+α2ε~\|\sigma(w)-w^{1+\alpha}\|_{\infty}\leq 2\sqrt{\widetilde{\varepsilon}}. Coupled with the provided guarantee ρmax(w)1+α\rho_{\max}(w)\leq 1+\alpha, we see that the requirements of Lemma 13 are met with ε¯=2ε~\overline{\varepsilon}=2\sqrt{\widetilde{\varepsilon}}, for ε~=defε^4(25m(α+α1))4\widetilde{\varepsilon}\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}\frac{\widehat{\varepsilon}^{4}}{(25m(\alpha+\alpha^{-1}))^{4}}, and Algorithm 1 therefore guarantees a w^\widehat{w} satisfying σ(w^)ε^w^1+α\sigma(\widehat{w})\approx_{\widehat{\varepsilon}}\widehat{w}^{1+\alpha}. Therefore, we can now apply Lemma 14 with μ=ε^\mu=\widehat{\varepsilon}, and choosing ε^=α2α+nε\widehat{\varepsilon}=\tfrac{\alpha^{2}}{\alpha+\sqrt{n}}\varepsilon lets us conclude that w^iεw¯i\widehat{w}_{i}\approx_{\varepsilon}\overline{w}_{i}, as claimed. ∎

Appendix C A Geometric View of Rounding

At the end of Algorithms 2 and 3, the iterate ww satisfies the condition ρmax(w)1+α\rho_{\max}(w)\leq 1+\alpha. We now show the geometry implied by the preceding condition, thereby provide the reason behind the terminology “rounding.”

Lemma 15.

Given w>0mw\in\mathbb{R}_{>0}^{m} such that ρmax(w)1+α\rho_{\max}(w)\leq 1+\alpha. Define the ellipsoid (w):={x:x𝐀𝐖𝐀x1}\mathcal{E}(w):=\{x:x^{\top}\mathbf{A}^{\top}\mathbf{W}\mathbf{A}x\leq 1\}. Then, we have that

(w){xn|𝐖α/2𝐀x1+α}.\mathcal{E}(w)\subset\{x\in\mathbb{R}^{n}~{}|~{}\|\mathbf{W}^{-\alpha/2}\mathbf{A}x\|_{\infty}\leq\sqrt{1+\alpha}\}.
Proof.

Consider any point x(w)x\in\mathcal{E}(w). Then, by Cauchy-Schwarz inequality and ρmax(w)1+α\rho_{\max}(w)\leq 1+\alpha,

𝐖α/2𝐀x\displaystyle\|\mathbf{W}^{-\alpha/2}\mathbf{A}x\|_{\infty} =maxi[m]ei𝐖α/2𝐀x=maxi[m]ei𝐖α/2𝐀(𝐀𝐖𝐀)12(𝐀𝐖𝐀)12x\displaystyle=\max_{i\in[m]}e_{i}^{\top}\mathbf{W}^{-\alpha/2}\mathbf{A}x=\max_{i\in[m]}e_{i}^{\top}\mathbf{W}^{-\alpha/2}\mathbf{A}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-\frac{1}{2}}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{\frac{1}{2}}x
maxi[m]ei𝐖α/2𝐀(𝐀𝐖𝐀)1𝐀𝐖α/2eix𝐀𝐖𝐀x\displaystyle\leq\max_{i\in[m]}\sqrt{e_{i}^{\top}\mathbf{W}^{-\alpha/2}\mathbf{A}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}\mathbf{A}^{\top}\mathbf{W}^{-\alpha/2}e_{i}}\sqrt{x^{\top}\mathbf{A}^{\top}\mathbf{W}\mathbf{A}x}
maxi[m]ei𝐖α/2𝐀(𝐀𝐖𝐀)1𝐀𝐖α/2ei=maxi[m]σi(w)wi1+α1+α.\displaystyle\leq\max_{i\in[m]}\sqrt{e_{i}^{\top}\mathbf{W}^{-\alpha/2}\mathbf{A}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}\mathbf{A}^{\top}\mathbf{W}^{-\alpha/2}e_{i}}=\max_{i\in[m]}\sqrt{\frac{\sigma_{i}(w)}{w_{i}^{1+\alpha}}}\leq\sqrt{1+\alpha}.

Appendix D Explanations of Runtimes in Prior Work

The convex program (1) formulated by [CP15] has a variable size of n2n^{2}. Therefore, by [LSW15], the number of iterations to solve it using the cutting plane method is O(n2log(nε1)O(n^{2}\log(n\varepsilon^{-1}), each iteration computing ai𝐌aia_{i}^{\top}\mathbf{M}a_{i} for i[m]i\in[m]. This can be computed by multiplying an n×nn\times n matrix with an n×mn\times m matrix, which costs between O(mn)O(mn) (at least the size of the larger input matrix) and O(mn2)O(mn^{2}) (each entry of the resulting m×nm\times n matrix obtained by an inner product of length nn vectors). Further, there is at least a total of O(n6)O(n^{6}) additional work done by the cutting plane method. This gives us a cost of at least n2(mn+n4)n^{2}(mn+n^{4}). The runtime of [Lee16] follows from Theorem 5.3.45.3.4.