Computing Lewis Weights to High Precision

Maryam Fazel [email protected]. University of Washington. Yin Tat Lee [email protected]. University of Washington. Swati Padmanabhan [email protected]. University of Washington. Aaron Sidford [email protected]. Stanford University

We present an algorithm for computing approximate $\ell_{p}$ Lewis weights to high precision. Given a full-rank $\mathbf{A}\in\mathbb{R}^{m\times n}$ with $m\geq n$ and a scalar $p>2$ , our algorithm computes $\epsilon$ -approximate $\ell_{p}$ Lewis weights of $\mathbf{A}$ in $\widetilde{O}_{p}(\log(1/\epsilon))$ iterations; the cost of each iteration is linear in the input size plus the cost of computing the leverage scores of $\mathbf{D}\mathbf{A}$ for diagonal $\mathbf{D}\in\mathbb{R}^{m\times m}$ . Prior to our work, such a computational complexity was known only for $p\in(0,4)$ [CP15], and combined with this result, our work yields the first polylogarithmic-depth polynomial-work algorithm for the problem of computing $\ell_{p}$ Lewis weights to high precision for all constant $p>0$ . An important consequence of this result is also the first polylogarithmic-depth polynomial-work algorithm for computing a nearly optimal self-concordant barrier for a polytope.

1 Introduction to Lewis Weights

In this paper, we study the problem of computing the $\ell_{p}$ Lewis weights¹¹1From hereon, we refer to these simply as “Lewis weights” for brevity. of a matrix.

Definition 1.

[Lew78, CP15] Given a full-rank matrix $\mathbf{A}\in\mathbb{R}^{m\times n}$ with $m\geq n$ and a scalar $p\in(0,\infty)$ , the Lewis weights of $\mathbf{A}$ are the entries of the unique²²2Existence and uniqueness was first proven by D.R.Lewis [Lew78], after whom the weights are named. vector $\overline{w}\in\mathbb{R}^{m}$ satisfying the equation

\overline{w}_{i}^{2/p}=a_{i}^{\top}(\mathbf{A}^{\top}\overline{\mathbf{W}}^{1-2/p}\mathbf{A})^{-1}a_{i}\text{ for all $i\in[m]$},

where $a_{i}$ is the $i$ ’th row of matrix $\mathbf{A}$ and $\overline{\mathbf{W}}$ is the diagonal matrix with vector $\overline{w}$ on the diagonal.

Motivation. We contextualize our problem with a simpler, geometric notion. Given a set of $m$ points $\{a_{i}\}_{i=1}^{m}\in\mathbb{R}^{n}$ (the rows of the preceding matrix $\mathbf{A}\in\mathbb{R}^{m\times n}$ ), their John ellipsoid [Joh48] is the minimum³³3The John ellipsoid may also refer to the maximal volume ellipsoid enclosed by the set $\{x:|x^{\top}a_{i}|\leq 1\}$ , but in this paper, we use the former definition. volume ellipsoid enclosing them. This ellipsoid finds use across experiment design and computational geometry [Tod16] and is central to certain cutting-plane methods [Vai89, LSW15], an algorithm fundamental to mathematical optimization (Section 1.3). It turns out that the John ellipsoid of a set of points $\{a_{i}\}_{i=1}^{m}\in\mathbb{R}^{n}$ is expressible [BV04] as the solution to the following convex program, with the objective being a stand-in for the volume of the ellipsoid and the constraints encoding the requirement that each given point $a_{i}$ lie within the ellipsoid:

\mbox{minimize}_{\mathbf{M}\succeq 0}\det(\mathbf{M})^{-1},\textup{ subject to }a_{i}^{\top}\mathbf{M}a_{i}\leq 1,\textup{ for all }i\in[m].

The problem (1) may be generalized by the following convex program [Woj96, CP15], the generalization immediate from substituting $p=\infty$ in (1):

\mbox{minimize}_{\mathbf{M}\succeq 0}\det(\mathbf{M})^{-1},\textup{ subject to }\sum_{i=1}^{m}(a_{i}^{\top}\mathbf{M}a_{i})^{p/2}\leq 1.

Geometrically, (1) seeks the minimum volume ellipsoid with a bound on the $p/2$ -norm of the distance of the points to the ellipsoid, and its solution $\mathbf{M}$ is the “Lewis ellipsoid” [CP15] of $\{a_{i}\}_{i=1}^{m}$ .

The optimality condition of (1), written using $\overline{w}\in\mathbb{R}^{m}$ defined as $\overline{w}_{i}\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}(a_{i}^{\top}\mathbf{M}a_{i})^{p/2}$ , is equivalent to (1), and this demonstrates that solving (1) is one approach to obtaining the Lewis weights of $\mathbf{A}$ (see [CP15]). This equivalence also underscores the fact that the problem of computing Lewis weights is a natural $\ell_{p}$ generalization of the problem of computing the John ellipsoid.

More broadly, Lewis weights are ubiquitous across statistics, machine learning, and mathematical optimization in diverse applications, of which we presently highlight two (see Section 1.3 for details). First, their interpretation as “importance scores” of rows of matrices makes them key to shrinking the row dimension of input data [DMM06]. Second, through their role in constructing self-concordant barriers of polytopes [LS14], variants of Lewis weights have found prominence in recent advances in the computational complexity of linear programming.

From a purely optimization perspective, Lewis weights may be viewed as the optimal solution to the following convex optimization problem (which is in fact essentially dual to (1)):

\overline{w}=\arg\min_{w\in\mathbb{R}^{m}_{>0}}\mathcal{F}(w)\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}-\log\det(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})+\frac{1}{1+\alpha}\mathbf{1}^{\top}w^{1+\alpha},\textrm{ for $\alpha=\tfrac{2}{p-2}$}.

As elaborated in [CP15, LS19], the reason this problem yields the Lewis weights is that an appropriate scaling of its solution $\overline{w}$ transforms its optimality condition from $\overline{w}_{i}^{\alpha}=a_{i}^{\top}(\mathbf{A}^{\top}\overline{\mathbf{W}}\mathbf{A})^{-1}a_{i}$ to (1). The problem (1) is a simple and natural one and, in the case of $\alpha=1$ (corresponding to the John ellipsoid), has been the subject of study for designing new optimization methods [Tod16].

In summary, Lewis weights naturally arise as generalizations of extensively studied problems in convex geometry and optimization. This, coupled with their role in machine learning, makes understanding the complexity of computing Lewis weights, i.e., solving (1), a fundamental problem.

Our Goal.

We aim to design high-precision algorithms for computing $\varepsilon$ -approximate Lewis weights, i.e., a vector $w\in\mathbb{R}^{m}$ satisfying

w_{i}\approx_{\varepsilon}\overline{w}_{i},\text{ for all $i\in[m]$},\text{ where }\overline{w}\text{ is defined in }\eqref{eq_def_lewis_weights}\text{ and }\eqref{def_objective}.

where $a\approx_{\varepsilon}b$ is used to denote $(1-\varepsilon)a\leq b\leq(1+\varepsilon)a$ . To this end, we design algorithms to solve the convex program (1) to $\widetilde{\varepsilon}$ -additive accuracy for an appropriate $\widetilde{\varepsilon}=\textrm{poly}(\varepsilon,n)$ , which we prove suffices in Lemma 1.

By a “high-precision” algorithm, we mean one with a runtime polylogarithmic in $\varepsilon$ . We emphasize that for several applications such as randomized sampling [CP15], approximate Lewis weights suffice; however, we believe that high-precision methods such as ours enrich our understanding of the structure of the optimization problem (1). Further, as stated in Theorem 3, such methods yield new runtimes for directly computing a near-optimal self-concordant barrier for polytopes.

We use number of leverage score computations as the complexity measure of our algorithms. Our choice is a result of the fact that leverage scores of appropriately scaled matrices appear in both $\nabla\mathcal{F}(w)$ (see Lemma 3) and in the verification of correctness of Lewis weights. This measure of complexity stresses the number of iterations rather than the details of iteration costs (which depend on exact techniques used for leverage core computation, e.g., fast matrix multiplication) and is consistent with many prior algorithms (see Table 1).

Prior Results.

The first polynomial-time algorithm for computing Lewis weights was presented by [CP15] and performed only $\widetilde{O}_{p}(\log(1/\varepsilon))$ ⁴⁴4We use $O_{p}$ to hide a polynomial in $p$ and $\widetilde{O}$ and $\widetilde{\Omega}$ to hide factors polylogarithmic in $p,n$ , and $m$ . leverage score computations. However, their result holds only for $p\in(0,4)$ . We explain the source of this limited range in Section 1.2.

In comparison, for $p\geq 4$ , existing algorithms are slower: the algorithms by [CP15], [Lee16], and [LS19] perform $\widetilde{\Omega}(n)$ , $\widetilde{O}(1/\varepsilon)$ , and $\widetilde{O}(\sqrt{n})$ leverage score computations, respectively. [CP15] also gave an algorithm with total runtime $\mathcal{O}(\tfrac{1}{\varepsilon}\mathop{\mathbf{nnz}}(\mathbf{A})+c_{p}n^{O(p)})$ . Of note is the fact that the algorithms with runtimes polynomial in $1/\varepsilon$ ([Lee16, CP15]) satisfy the weaker approximation condition $\overline{w}_{i}^{2/p}\approx_{\varepsilon}a_{i}^{\top}(\mathbf{A}^{\top}\overline{\mathbf{W}}^{1-2/p}\mathbf{A})^{-1}a_{i}$ , which is in fact implied by our condition (1).

We display these runtimes in Table 1, assuming that the cost of a leverage score computation is $O(mn^{2})$ (which, we reiterate, may be reduced through the use of fast matrix multiplication). In terms of the number of leverage score computations, Table 1 highlights the contrast between the polylogarithmic dependence on input size and accuracy for $p\in(0,4)$ and polynomial dependence on these factors for $p\geq 4$ . The motivation behind our paper is to close this gap.

1.1 Our Contribution

We design an algorithm that computes Lewis weights to high precision for all $p>2$ using only $\widetilde{O}_{p}(\log(1/\varepsilon))$ leverage score computations. Together with [CP15]’s result for $p\in(0,4)$ , our result therefore completes the picture on a near-optimal reduction from leverage scores to Lewis weights for all $p>0$ .

Theorem 1 (Main Theorem (Parallel)).

Given a full-rank matrix $\mathbf{A}\in\mathbb{R}^{m\times n}$ and $p\geq 4$ , we can compute (Algorithms 1 and 2) its $\varepsilon$ -approximate Lewis weights (1) in $O(p^{3}\log(mp/\varepsilon))$ iterations⁵⁵5Our algorithms work for all $p>2$ , as can be seen in our proof in Section 3.1. However, for $p\in(2,4)$ , the algorithm of [CP15] is faster, and therefore, in our main theorems, we state runtimes only for $p\geq 4$ .. Each iteration computes the leverage scores of a matrix $\mathbf{D}\mathbf{A}$ for a diagonal matrix $\mathbf{D}$ . The total runtime is $O(p^{3}mn^{2}\log(mp/\varepsilon))$ , with $O(p^{3}\log(mp/\varepsilon)\log^{2}(m))$ depth.

Theorem 1 is attained by a parallel algorithm for computing Lewis weights that consists of polylogarithmic rounds of leverage score computations and therefore has polylogarithmic-depth, a result that was not known prior to this work.

Theorem 2 (Main Theorem (Sequential)).

Given a full-rank matrix $\mathbf{A}\in\mathbb{R}^{m\times n}$ and $p\geq 4$ , we can compute (Algorithms 1 and 3) its $\varepsilon$ -approximate Lewis weights (1) in $O(pm\log(mp/\varepsilon))$ iterations. Each iteration computes the leverage score of one row of $\mathbf{D}\mathbf{A}$ for a diagonal matrix $\mathbf{D}$ . The total runtime is $O(pmn^{2}\log(mp/\varepsilon))$ .

Remark 1.1.

The solution to (1) characterizes a “Lewis ellipsoid,” and the $\ell_{\infty}$ Lewis ellipsoid of $\mathbf{A}$ is precisely its John ellipsoid. After symmetrization [Tod16], computing the John ellipsoid is equivalent to solving a linear program (LP). Therefore, computing Lewis weights in $O(\log(mp/\varepsilon))$ iterations would imply a polylogarithmic-depth algorithm for solving LPs, which, given the current $O(\sqrt{n})$ depth [LS19], would be a significant breakthrough in the field of optimization. We therefore believe that it would be difficult to remove the polynomial dependence on $p$ in our runtime.

Authors	Range of $p$	Number of Leverage Score Computations/Depth	Total Runtime
[CP15]	$p\in(0,4)$	$O\left(\tfrac{1}{1-\|1-p/2\|}\cdot\log(\tfrac{\log(m)}{\varepsilon})\right)$	$O\left(\tfrac{1}{1-\|1-p/2\|}\cdot mn^{2}\cdot\log(\tfrac{\log(m)}{\varepsilon})\right)$
[CP15]	$p\geq 4$	$\Omega(n)$	$\Omega(mn^{3}\cdot\log(\tfrac{m}{\varepsilon}))$
[CP15]*	$p\geq 4$	not applicable	$O\left(\tfrac{\text{nnz}(\mathbf{A})}{\varepsilon}+c_{p}n^{O(p)}\right)$
[Lee16]*	$p\geq 4$	$O\left(\tfrac{1}{\varepsilon}\cdot\log(m/n)\right)$	$O\left(\left(\tfrac{\text{nnz}(\mathbf{A})}{\varepsilon}+\frac{n^{3}}{\varepsilon^{3}}\right)\cdot\log(m/n)\right)$
[LS19]	$p\geq 4$	$O(p^{2}\cdot n^{1/2}\cdot\log(\tfrac{1}{\varepsilon}))$	$O(p^{2}\cdot mn^{2.5}\cdot\text{poly}\log(\tfrac{m}{\varepsilon}))$
Theorem 1	$p\geq 4$	$O(p^{3}\cdot\log(\tfrac{mp}{\varepsilon}))$	$O(p^{3}\cdot mn^{2}\cdot\log(\tfrac{mp}{\varepsilon}))$

Table 1: Runtime comparison for computing Lewis weights. Results with asterisks use a weaker notion of approximation than our paper (1). All dependencies on

n

in the running times of these methods can be improved using fast matrix multiplication.

1.2 Overview of Approach

Before presenting our algorithm, we describe obstacles to directly extending previous work on the problem for $p\in(0,4)$ to the case $p\geq 4$ . For $p\in(0,4)$ , [CP15, LS19] design algorithms that, with a single computation of leverage scores, make constant (dependent on $p$ ) multiplicative progress on error (such as function error or distance to optimal point), thus attaining runtimes polylogarithmic in $\varepsilon$ . However, these methods crucially rely on contractive properties that, in contrast to our work, do not necessarily hold for $p\geq 4$ .

For example, one of the algorithms in [CP15] starts with a vector $v\approx_{c}\overline{w}$ , where $\overline{w}$ is the vector of true Lewis weights and $c$ some constant. Consequently, we have $(a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{V}^{1-2/p}\mathbf{A})^{-1}a_{i})^{p/2}\approx_{c^{|p/2-1|}}(a_{i}^{\top}(\mathbf{A}^{\top}\overline{\mathbf{W}}^{1-2/p}\mathbf{A})^{-1}a_{i})^{p/2}$ . Due to this map being a contraction for $|p/2-1|<1$ , or equivalently, for $p\in(0,4)$ , $O(\log(\log n))$ recursive calls to it give Lewis weights for $p<4$ , but the contraction - and, by extension, this method - does not immediately extend to the setting $p\geq 4$ .

Prior algorithms for $p\geq 4$ therefore resort to alternate optimization techniques. [CP15] frames Lewis weights computation as determinant maximization (1) (see Section D) and applies cutting plane methods [GLS81, LSW15]. [Lee16] uses mirror descent, and [LS19] uses homotopy methods. These approaches yield runtimes with $\text{poly}(n)$ or $\text{poly}(\tfrac{1}{\varepsilon})$ leverage score computations, and therefore, in order to attain runtimes of $\text{polylog}(1/\varepsilon)$ leverage score computations, we need to rethink the algorithm.

Our Approach.

As stated in Section 1, to obtain $\varepsilon$ -approximate Lewis weights for $p\geq 4$ , we compute a $w$ that satisfies $\mathcal{F}(\overline{w})\leq\mathcal{F}(w)\leq\mathcal{F}(\overline{w})+\widetilde{\varepsilon}$ , where $\mathcal{F}$ and $\overline{w}$ are as defined in (1) and $\widetilde{\varepsilon}=O(\mathrm{poly}(n,\varepsilon))$ . In light of the preceding bottlenecks in prior work, we circumvent techniques that directly target constant multiplicative progress (on some potential) in each iteration.

Our main technical insight is that when the leverage scores for the current weight $w\in\mathbb{R}^{n}_{>0}$ satisfy a certain technical condition (inequality (1.2)), it is indeed possible to update $w$ to get multiplicative decrease in function error ( $\mathcal{F}(w)-\mathcal{F}(\overline{w})$ ), thus resulting in our target runtime. To turn this insight into an algorithm, we design a corrective procedure that ensures that (1.2) is always satisfied: in other words, whenever (1.2) is violated, this procedure updates $w$ so that the new $w$ does satisfy (1.2), setting the stage for the aforementioned multiplicative progress. An important additional property of this procedure is that it does not increase the objective function and is therefore in keeping with our goal of minimizing (1).

Specifically, the technical condition that our geometric decrease in function error hinges on is

\max_{i\in[m]}\frac{a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i}}{w_{i}^{\alpha}}\leq 1+\alpha\,.

This ratio follows naturally from the gradient and Hessian of the function objective (see Lemma 3). Our algorithm’s update rule to solve (1) is obtained from minimizing a second-order approximation to the objective at the current point, and the condition specified in (1.2) allows us to relate the progress of a type of quasi-Newton step to lower bounds on the progress there is to make, which is critical to turning a runtime of poly( $1/\varepsilon$ ) into $\text{polylog}(1/\varepsilon)$ (Lemma 5).

The process of updating $w$ so that (1.2) goes from being violated to being satisfied corresponds, geometrically, to sufficiently rounding the ellipsoid $\mathcal{E}(w)=\{x:x^{\top}\mathbf{A}^{\top}\mathbf{W}\mathbf{A}x\leq 1\}$ ; specifically, the updated ellipsoid satisfies $\mathcal{E}(w)\subseteq\{\|\mathbf{W}^{\frac{1}{2-p}}\mathbf{A}x\|_{\infty}\leq\sqrt{1+\alpha}\}$ (see Section C), and this is the reason we use the term “rounding” to describe our corrective procedure to get $w$ to satisfy (1.2) and the term “rounding condition” to refer to (1.2).

We develop two versions of rounding: a parallel method and a sequential one that has an improved dependence on $p$ . Each version is based on the principles that (1) one can increase those entries of $w$ at which the rounding condition (1.2) does not hold while decreasing the objective value, and (2) the vector $w$ obtained after this update is closer to satisfying (1.2).

We believe that such a principle of identifying a technical condition needed for fast convergence and the accompanying rounding procedures could be useful in other optimization problems. Additionally, we develop Algorithm 4, which, by varying the step sizes in the update rule, maintains (1.2) as invariant, thereby eliminating the need for a separate rounding and progress steps.

1.3 Applications and Related Work

We elaborate here on the applications of Lewis weights we briefly alluded to in Section 1. While for many applications (such as pre-processing in optimization [CP15]) approximate weights suffice, solving regularized $D$ -optimal and computing $\tilde{O}(n)$ self-concordant barriers to high precision do use high precision Lewis weights.

Pre-processing in optimization.

Lewis weights are used as scores to sample rows of an input tall data matrix so the $\ell_{p}$ norms of the product of the matrix with vectors are preserved. They have been used in row sampling algorithms for data pre-processing [DMM06, DMIMW12, LMP13, CLM⁺15], for computing dimension-free strong coresets for $k$ -median and subspace approximation [SW18], and for fast tensor factorization in the streaming model [CCDS20]. Lewis weights are also used for $\ell_{1}$ regression, a popular model in machine learning used to capture robustness to outliers, in: [DLS18] for stochastic gradient descent pre-conditioning, [LWYZ20] for quantile regression, and [BDM⁺20] to provide algorithms for linear algebraic problems in the sliding window model.

John ellipsoid and D-optimal design.

As noted in Remark 1.1, a fast algorithm for Lewis weights could yield faster algorithms for computing John ellipsoid, a problem with a long history of work [Kha96, SF04, KY05, DAST08, CCLY19, ZF20]. It is known [Tod16] that the John ellipsoid problem is dual to the (relaxed) D-optimal experiment design problem [Puk06]. D-optimal design seeks to select a set of linear experiments with the largest confidence ellipsoid for its least-square estimator [AZLSW17, MSTX19, SX20].

Our problem (1) is equivalent to $\frac{p}{p-2}$ -regularized D-optimal design, which can be interpreted as enforcing a polynomial experiment cost: viewing $w_{i}$ as the fraction of resources allocated to experiment $i$ , each $w_{i}$ is penalized by $w_{i}^{\frac{p}{p-2}}$ . This regularization also appears in fair packing and fair covering problems [MSZ16, DFO20] from operations research.

Self-concordance.

Self-concordant barriers are fundamental in convex optimization [NN94], combinatorial optimization [LS14], sampling [KN09, LLV20], and online learning [AHR08]. Although there are (nearly) optimal self-concordant barriers for any convex set [NN94, BE15, LY18], computing them involves sampling from log-concave distributions, itself an expensive process with a poly( $1/\varepsilon$ ) runtime. [LS14] shows how to construct nearly optimal barriers for polytopes using Lewis weights. Unfortunately, doing so still requires polynomial-many steps to compute these weights; [LS14] bypass this issue by showing it suffices to work with Lewis weights for $p\approx 1$ . In this paper, we show how to compute Lewis weights by computing leverage scores of polylogarithmic-many matrices. This gives the first nearly optimal self-concordant barrier for polytopes that can be evaluated to high accuracy with depth polylogarithmic in the dimension.

Theorem 3 (Applying Theorem 1 to [LS19, Section 5]).

Given a non-empty polytope $P=\{x\in\mathbb{R}^{n}~{}|~{}\mathbf{A}x>b\}$ for full rank $\mathbf{A}\in\mathbb{R}^{m\times n}$ , there is a $O(n\log^{5}m)$ -self concordant barrier $\psi$ for $P$ such that for any $\epsilon>0$ and $x\in P$ , in $O(mn^{\omega-1}\log^{3}m\log(m/\epsilon))$ -work and $O(\log^{3}m\log(m/\epsilon))$ -depth, we can compute $g\in\mathbb{R}^{n}$ and $\mathbf{H}\in\mathbb{R}^{n\times n}$ with $\|g-\nabla\psi(x)\|_{\nabla^{2}\psi(x)^{-1}}\leq\epsilon$ and $\nabla^{2}\psi(x)\preceq\mathbf{H}\preceq O(\log m)\nabla^{2}\psi(x)$ . With an additional $O(m^{\omega+o(1)})$ work, $\mathbf{H}\in\mathbb{R}^{n\times n}$ with $(1-\epsilon)\nabla^{2}\psi(x)\preceq\mathbf{H}\preceq O(1+\epsilon)\nabla^{2}\psi(x)$ can be computed as well.

1.4 Notation and Preliminaries

We use $\mathbf{A}$ to denote our full-rank $m\times n$ ( $m\geq n$ ) real-valued input matrix and $\overline{w}\in\mathbb{R}^{m}$ to denote the vector of Lewis weights of $\mathbf{A}$ , as defined in (1) and (1). All matrices appear in boldface uppercase and vectors in lowercase. For any vector (say, $\sigma$ ), we use its uppercase boldfaced form ( $\mathbf{\Sigma}$ ) to denote the diagonal matrix $\mathbf{\Sigma}_{ii}=\sigma_{i}$ . For a matrix $\mathbf{M}$ , the matrix $\mathbf{M}^{(2)}$ is the Schur product (entry-wise product) of $\mathbf{M}$ with itself. For matrices $\mathbf{A}$ and $\mathbf{B}$ , we use $\mathbf{A}\succeq\mathbf{B}$ to mean $\mathbf{A}-\mathbf{B}$ is positive-semidefinite. For vectors $a$ and $b$ , the inequality $a\leq b$ applies entry-wise. We use $e_{i}$ to denote the $i$ ’th standard basis vector. We define $[n]\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}\{1,2,\dotsc,n\}$ . As in (1), since we defined $\alpha\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}\frac{2}{p-2}$ , the ranges of $p\in(2,4)$ and $p\geq 4$ translate to $\alpha>1$ and $\alpha\in(0,1]$ , respectively. From hereon, we work with $\alpha$ . We also define $\bar{\alpha}=\max\{1,\alpha\}$ . For a matrix $\mathbf{A}\in\mathbb{R}^{m\times n}$ and $w\in\mathbb{R}^{m}_{>0}$ , we define the projection matrix $\mathbf{P}(w)\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}\mathbf{W}^{1/2}\mathbf{A}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}\mathbf{A}^{\top}\mathbf{W}^{1/2}\in\mathbb{R}^{m\times m}$ . The quantity $\mathbf{P}(w)_{ii}$ is precisely the leverage score of the $i$ ’th row of $\mathbf{W}^{1/2}\mathbf{A}$ :

\sigma_{i}(w)\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}w_{i}\cdot a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i}.

Fact 1.1 ([LS14]).

For all $w\in\mathbb{R}_{>0}^{m}$ we have that $0\leq\sigma_{i}(w)\leq 1$ for all $i\in[m]$ , $\sum_{i\in[m]}\sigma_{i}(w)\leq n$ , and $\mathbf{0}\preceq\mathbf{P}(w)^{(2)}\preceq\mathbf{\Sigma}(w)$ .

2 Our Algorithm

We present Algorithm 1 to compute an $\widetilde{\varepsilon}$ -additive solution to (1). We first provide the following definitions that we frequently refer to in our algorithm and analysis. Given $\alpha>0$ and $w\in\mathbb{R}^{m}_{>0}$ , the $i$ ’th coordinate of $\rho(w)\in\mathbb{R}^{m}$ is

\rho_{i}(w)\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}\frac{\sigma_{i}(w)}{w_{i}^{1+\alpha}}.

Based on this quantity, we define the following procedure, derived from approximating a quasi-Newton update on the objective $\mathcal{F}$ from (1):

\left[\textbf{Descent}(w,\textup{C},\eta)\right]_{i}\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}w_{i}\left[1+\eta_{i}\cdot\frac{\rho_{i}(w)-1}{\rho_{i}(w)+1}\right]\text{ for all }i\in\textup{C}\subseteq\{1,2,\dotsc,m\}.

Using these definitions, we can now describe our algorithm. Depending on whether the following condition (“rounding condition”) holds, we run either $\textbf{Descent}({}\cdot{})$ or $\textbf{Round}({}\cdot{})$ in each iteration:

\rho_{\max}(w)\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}\max_{i\in[m]}\rho_{i}(w)\leq 1+\alpha.

Specifically, if (2) is not satisfied, we run $\textbf{Round}({}\cdot{})$ , which returns a vector that does satisfy it without increasing the objective value. We design two versions of $\textbf{Round}({}\cdot{})$ , one parallel (Algorithm 2) and one sequential (Algorithm 3), with the sequential algorithm having an improved dependence on $\alpha$ , to update the coordinates violating (2). We apply one extra step of rounding to the vector returned after $\mathcal{T}_{\textrm{total}}$ iterations of Algorithm 1 and transform it appropriately to obtain our final output. In the following lemma (proved in Section B), we justify that this output is indeed the solution to (1).

Lemma 1 (Lewis Weights from Optimization Solution).

Let $w\in\mathbb{R}_{>0}^{m}$ be a vector at which the objective (1) is $\widetilde{\varepsilon}$ -suboptimal in the additive sense for $\widetilde{\varepsilon}=\tfrac{\alpha^{8}\varepsilon^{4}}{(25m(\sqrt{n}+\alpha)(\alpha+\alpha^{-1}))^{4}}$ , i.e., $\mathcal{F}(\overline{w})\leq\mathcal{F}(w)\leq\mathcal{F}(\overline{w})+\widetilde{\varepsilon}$ . Further assume that $w$ satisfies the rounding condition: $\rho_{\max}(w)\leq 1+\alpha$ . Then, the vector $\widehat{w}$ defined as $\widehat{w}_{i}=(a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i})^{1/\alpha}$ satisfies $\widehat{w}_{i}\approx_{\varepsilon}\overline{w}_{i}$ for all $i\in[m]$ , thus achieving the goal spelt out in (1).

Algorithm 1 Lewis Weight Computation Meta-Algorithm

Input: Matrix $\mathbf{A}\in\mathbb{R}^{m\times n}$ , parameter $p>2$ , accuracy $\varepsilon$

Output: Vector $\widehat{w}\in\mathbb{R}^{m}_{>0}$ that satisfies (1)

For all $i\in[m]$ , initialize $w_{i}^{(0)}=\frac{n}{m}$ .

Set $\alpha=\frac{2}{p-2}$ , $\bar{\alpha}=\max(\alpha,1)$ , $\widetilde{\varepsilon}=\tfrac{\alpha^{8}\varepsilon^{4}}{(25m(\sqrt{n}+\alpha)(\alpha+\alpha^{-1}))^{4}}$ , and $\mathcal{T}_{\textrm{total}}=\mathcal{O}(\max(\alpha^{-1},\alpha)\log(m/\widetilde{\varepsilon}))$ .

for $k=1,2,3,\dotsc,\mathcal{T}_{\textrm{total}}$ do

\widetilde{w}^{(k)}\leftarrow\textbf{Round}(w^{(k-1)},\mathbf{A},\alpha)

\triangleright

Invoke Algorithm 2 (parallel) or 3 (sequential)

w^{(k)}\leftarrow\textbf{Descent}(\widetilde{w}^{(k)},[m],\tfrac{1}{3\bar{\alpha}}\mathbf{1})

\triangleright

See (2) and Lemma 2

end for

Set

w_{\textrm{R}}\leftarrow\textbf{Round}(w^{(\mathcal{T}_{\textrm{total}})},\mathbf{A},\alpha)

Return

\widehat{w}\in\mathbb{R}^{m}_{>0}

, where

\widehat{w}_{i}=(a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}_{\textrm{R}}\mathbf{A})^{-1}a_{i})^{1/\alpha}

\triangleright

See Section B

Algorithm 2 RoundParallel(

w

\mathbf{A}

\alpha

)

Input: Vector $w\in\mathbb{R}^{m}_{>0}$ , matrix $\mathbf{A}\in\mathbb{R}^{m\times n}$ , parameter $\alpha>0$

Output: Vector $w\in\mathbb{R}^{m}_{>0}$ satisfying (2)

Define $\rho(w)$ as in (2)

while $\textup{C}=\{i~{}|~{}\rho_{i}(w)>1+\alpha\}\neq\emptyset$ do

w\leftarrow\textbf{Descent}(w,\textup{C},\tfrac{1}{3\bar{\alpha}}\mathbf{1})

\triangleright

See Section 3

end while

Return

w

Algorithm 3 RoundSequential(

w

\mathbf{A}

\alpha

)

Input: Vector $w\in\mathbb{R}^{m}_{>0}$ , matrix $\mathbf{A}\in\mathbb{R}^{m\times n}$ , parameter $\alpha>0$

Output: Vector $w\in\mathbb{R}^{m}_{>0}$ satisfying (2)

Define $\rho(w)$ as in (2) and $\sigma(w)$ as in (1.4)

Define $\textup{C}=\{i~{}|~{}\rho_{i}(w)\geq 1\}$

for $i\in\textup{C}$ do

w_{i}\leftarrow w_{i}(1+\delta_{i})

, where

\delta_{i}

solves

\rho_{i}(w)=(1+\delta_{i}\sigma_{i}(w))(1+\delta_{i})^{\alpha}

\triangleright

see Section 4

end for

Return

w

2.1 Analysis of $\textbf{Descent}({}\cdot{})$

We first analyze $\textbf{Descent}({}\cdot{})$ since it is common to both the parallel and sequential algorithms.

Lemma 2 (Iteration Complexity of $\textbf{Descent}({}\cdot{})$ ).

Each iteration of $\textbf{Descent}({}\cdot{})$ (described in (2)) decreases the value of $\mathcal{F}$ . Assuming that $\textbf{Round}({}\cdot{})$ does not increase the value of the objective in (1), for any given accuracy parameter $0<\widetilde{\varepsilon}<1$ , the number of $\textbf{Descent}({}\cdot{})$ steps that Algorithm 1 performs before achieving $\mathcal{F}(w)\leq\mathcal{F}(\overline{w})+\widetilde{\varepsilon}$ is given by the following bound:

\mathcal{T}_{\textrm{total}}=\mathcal{O}(\max(\alpha^{-1},\alpha)\log(m/\widetilde{\varepsilon})).

As is often the case to obtain such an iteration complexity, we prove Lemma 2 by incorporating the maximum sub-optimality in function value (Lemma 5) and the initial error bound (Lemma 4) into the inequality describing minimum function progress (Lemma 6). The assumption that $\textbf{Round}({}\cdot{})$ does not increase the value of the objective is justified in Lemma 7.

Since our algorithm leverages quasi-Newton steps, we begin our analysis by stating the gradient and Hessian of the objective in (1) as well as the error at the initial vector $w^{(0)}$ , as measured against the optimal function value. The Hessian below is positive semidefinite when $\alpha\geq 0$ (equivalently, when $p\geq 2$ ) and not necessarily so otherwise. Consequently, the objective is convex for $\alpha\geq 0$ , and we therefore consider only this setting throughout.

Lemma 3 (Gradient and Hessian).

For any $w\in\mathbb{R}^{m}_{>0}$ , the objective in (1), $\mathcal{F}(w)=-\log\det(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})+\frac{1}{1+\alpha}\mathbf{1}^{\top}w^{1+\alpha}$ , has gradient and Hessian given by the following expressions.

\left[\nabla\mathcal{F}(w)\right]_{i}=w_{i}^{-1}\cdot(w_{i}^{1+\alpha}-\sigma_{i}(w))\text{ and }\nabla^{2}\mathcal{F}(w)=\mathbf{W}^{-1}\mathbf{P}(w)^{(2)}\mathbf{W}^{-1}+\alpha\mathbf{W}^{\alpha-1}.

Lemma 4 (Initial Sub-Optimality).

At the start of Algorithm 1, the value of the objective of (1) differs from the optimum objective value as $\mathcal{F}(w^{(0)})\leq\mathcal{F}(\overline{w})+n\log(m/n)$ .

2.1.1 Minimum Progress and Maximum Sub-optimality in $\textbf{Descent}({}\cdot{})$

We first prove an upper bound on objective sub-optimality, necessary to obtain a runtime polylogarithmic in $1/\varepsilon$ . Often, to obtain such a rate, the bound involving objective sub-optimality has a quadratic term derived from the Hessian; our lemma is somewhat non-standard in that it uses only the convexity of $\mathcal{F}$ . Note that this lemma crucially uses (2).

Lemma 5 (Objective Sub-optimality).

Suppose $w\in\mathbb{R}^{m}_{>0}$ and $\rho_{\max}(w)\leq 1+\alpha$ . Then the value of the objective of (1) at $w$ differs from the optimum objective value as follows.

\displaystyle\mathcal{F}(w)-\mathcal{F}(\overline{w})

\displaystyle\leq\sum_{i\in[m]}\frac{w_{i}^{1+\alpha}}{1+\alpha}\left(1+\frac{\rho_{i}(w)}{\alpha}\right)\left(\rho_{i}(w)-1\right)^{2}\leq 5\max\{1,\alpha^{-1}\}\sum_{i\in[m]}w_{i}^{1+\alpha}\frac{(\rho_{i}(w)-1)^{2}}{\rho_{i}(w)+1}.

Proof.

Since $g(w)\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}-\log\det(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})$ is convex and $[\nabla g(w)]_{i}=-w_{i}^{-1}\sigma_{i}(w)$ , we have

g(\overline{w})\geq g(w)+\nabla g(w)^{\top}(\overline{w}-w)=g(w)+\sum_{i\in[m]}\left(-\frac{\sigma_{i}(w)\overline{w}_{i}}{w_{i}}+\sigma_{i}(w))\right),

and therefore,

	$\displaystyle\mathcal{F}(\overline{w})-\mathcal{F}(w)$	$\displaystyle=g(\overline{w})-g(w)+\frac{1}{1+\alpha}\sum_{i\in[m]}\left([\overline{w}]_{i}^{1+\alpha}-w_{i}^{1+\alpha}\right)$
		$\displaystyle\geq\sum_{i\in[m]}c_{i}\text{ where }c_{i}\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}-\frac{\sigma_{i}(w)\overline{w}_{i}}{w_{i}}+\sigma_{i}(w)+\frac{1}{1+\alpha}\left([\overline{w}]_{i}^{1+\alpha}-w_{i}^{1+\alpha}\right)\,.$

To prove the claim, it suffices to bound each $c_{i}$ from below. First, note that

	$\displaystyle c_{i}$	$\displaystyle\geq\min_{v\geq 0}-\frac{v\cdot\sigma_{i}(w)}{w_{i}}+\sigma_{i}(w)+\frac{1}{1+\alpha}\left(v^{1+\alpha}-w_{i}^{1+\alpha}\right)=-\frac{\alpha}{1+\alpha}\left(\frac{\sigma_{i}(w)}{w_{i}}\right)^{1+\frac{1}{\alpha}}+\sigma_{i}(w)-\frac{w_{i}^{1+\alpha}}{1+\alpha}$
		$\displaystyle=\frac{w_{i}^{1+\alpha}}{1+\alpha}\left[-\alpha\rho_{i}(w)^{1+\frac{1}{\alpha}}+(1+\alpha)\rho_{i}(w)-1\right]$		(2.4)

where the first equality used that the minimization problem is convex and the solutions to $-\sigma_{i}(w)w_{i}^{-1}+v^{\alpha}=0$ (i.e. where the gradient is 0) is a minimizer, and the second equality used $\rho_{i}(w)=\sigma_{i}(w)/w_{i}^{1+\alpha}$ . Applying $\rho_{i}(w)\leq 1+\alpha$ , $1+x\leq\exp x$ , and $\exp x\leq 1+x+x^{2}$ for $x\leq 1$ yields

\displaystyle\rho_{i}(w)^{\frac{1}{\alpha}}

\displaystyle=(1-(1-\rho_{i}(w)))^{\frac{1}{\alpha}}\leq\exp(\tfrac{1}{\alpha}(\rho_{i}(w)-1))\leq 1+\frac{1}{\alpha}(\rho_{i}(w)-1)+\frac{1}{\alpha^{2}}(\rho_{i}(w)-1)^{2}.

(2.5)

Combining (2.5) with (2.4) yields that

	$\displaystyle c_{i}$	$\displaystyle\geq\frac{w_{i}^{1+\alpha}}{1+\alpha}\left[-\alpha\rho_{i}(w)\left[1+\left(\frac{\rho_{i}(w)-1}{\alpha}\right)+\left(\frac{\rho_{i}(w)-1}{\alpha}\right)^{2}\right]+(1+\alpha)\rho_{i}(w)-1\right]$
		$\displaystyle=\frac{w_{i}^{1+\alpha}}{1+\alpha}\left[-1+2\rho_{i}(w)-\rho_{i}(w)^{2}-\frac{\rho_{i}(w)}{\alpha}\cdot(\rho_{i}(w)-1)^{2}\right]=-\frac{w_{i}^{1+\alpha}}{1+\alpha}\left(1+\frac{\rho_{i}(w)}{\alpha}\right)\cdot\left(\rho_{i}(w)-1\right)^{2}$

The claim then follows from the fact that for $\rho_{i}(w)\leq 1+\alpha$ , we have $\frac{(1+\rho_{i}(w)\alpha^{-1})(1+\rho_{i}(w))}{1+\alpha}\leq\frac{1}{1+\alpha}+\frac{1}{\alpha}+1+1+\frac{1}{\alpha}\leq 5\max\{1,\alpha^{-1}\}$ . ∎

Lemma 6 (Function Decrease in $\textbf{Descent}({}\cdot{})$ ).

Let $w,\eta\in\mathbb{R}^{m}_{>0}$ with $\eta_{i}\in[0,\tfrac{1}{3\bar{\alpha}}]$ for all $i\in[m]$ . Further, let $w^{+}=\textbf{Descent}(w,[m],\eta)$ , where Descent is defined in (2). Then, $w^{+}\in\mathbb{R}^{m}_{>0}$ with the following decrease in function objective.

\mathcal{F}(w^{+})\leq\mathcal{F}(w)-\sum_{i\in[m]}\frac{\eta_{i}}{2}\cdot w_{i}^{1+\alpha}\cdot\frac{(\rho_{i}(w)-1)^{2}}{\rho_{i}(w)+1}.

The proof of this lemma resembles that of quasi-Newton method: first, we write a second-order Taylor approximation of $\mathcal{F}(w^{+})$ around $w$ and apply Fact 1.1 to Lemma 3 to obtain the upper bound on Hessian: $\nabla^{2}\mathcal{F}(\widetilde{w})=\mathbf{\widetilde{W}}^{-1}\mathbf{P}(\widetilde{w})^{(2)}\mathbf{\widetilde{W}}^{-1}+\alpha\mathbf{\widetilde{W}}^{\alpha-1}\preceq\mathbf{\widetilde{W}}^{-1}\Sigma(\widetilde{w})\mathbf{\widetilde{W}}^{-1}+\alpha\mathbf{\widetilde{W}}^{\alpha-1}.$ We further use the expression for $\nabla\mathcal{F}(w)$ in this second-order approximation and simplify to obtain the claim, as detailed in Section A.

2.1.2 Iteration Complexity of $\textbf{Descent}({}\cdot{})$

Proof of Lemma 2.

Since Algorithm 1 calls $\textbf{Descent}({}\cdot{})$ after running $\textbf{Round}({}\cdot{})$ , the requirement $\rho_{\max}(w)\leq 1+\alpha$ in Lemma 5 is met. Therefore, we may combine Lemma 5 alongwith Lemma 6 and our choice of $\eta_{i}=\frac{1}{3\bar{\alpha}}$ in Algorithm 1 to get a geometric decrease in function error as follows.

	$\displaystyle\mathcal{F}(w^{+})-\mathcal{F}(\overline{w})$	$\displaystyle\leq\mathcal{F}(w)-\mathcal{F}(\overline{w})-\frac{1}{6\max(\alpha,1)}\sum_{i=1}^{m}w_{i}^{1+\alpha}\frac{(\rho_{i}(w)-1)^{2}}{\rho_{i}(w)+1}$
		$\displaystyle\leq\left(1-\frac{1}{30\max(1,\alpha)\cdot\max(1,\alpha^{-1})}\right)(\mathcal{F}(w)-\mathcal{F}(\overline{w})).$		(2.6)

We apply this inequality recursively over all iterations of Algorithm 1, while also using the assumption that $\textbf{Round}({}\cdot{})$ does not increase the objective value. Setting the final value of (2.6) to $\widetilde{\varepsilon}$ , bounding the initial error as $\mathcal{F}(w)-\mathcal{F}(\overline{w})\leq n\log(m/n)\leq m^{2}$ by Lemma 4, observing $\max(1,\alpha)\cdot\max(1,\alpha^{-1})=\max(\alpha,\alpha^{-1})$ , and taking logarithms on both sides of the inequality gives the claimed iteration complexity of $\textbf{Descent}({}\cdot{})$ . ∎

3 Analysis of $\textbf{Round}({}\cdot{})$ : The Parallel Algorithm

The main export of this section is the proof of our main theorem about the parallel algorithm (Theorem 1). This proof combines the iteration count of $\textbf{Descent}({}\cdot{})$ from the preceding section with the analysis of Algorithm 2 (invoked by $\textbf{Round}({}\cdot{})$ in the parallel setting), shown next. In Lemma 7, we show that $\textbf{RoundParallel}({}\cdot{})$ decreases the function objective, thereby justifying the key assumption in Lemma 2. Lemma 7 also shows an upper bound on the new value of $\rho$ after one while loop of $\textbf{RoundParallel}({}\cdot{})$ , and by combining this with the maximum value of $\rho$ for termination in Algorithm 2, we get the iteration complexity of $\textbf{RoundParallel}({}\cdot{})$ in Corollary 1.

Lemma 7 (Outcome of $\textbf{RoundParallel}({}\cdot{})$ ).

Let $w^{+}\in\mathbb{R}^{m}_{>0}$ be the state of $w\in\mathbb{R}^{m}_{>0}$ at the end of one while loop of $\textbf{RoundParallel}({}\cdot{})$ (Algorithm 2). Then, $\mathcal{F}(w^{+})\leq\mathcal{F}(w)$ and $\rho_{\max}(w^{+})\leq(1+\frac{\alpha}{3\bar{\alpha}(2+\alpha)})^{-\alpha}\rho_{\max}(w)$ .

Proof.

Each iteration of the while loop in $\textbf{RoundParallel}({}\cdot{})$ performs $\textbf{Descent}(w,\textup{C},\frac{1}{3\bar{\alpha}}\mathbf{1})$ over the set of coordinates $\textup{C}=\{i:\rho_{i}(w)>1+\alpha\}$ . Lemma 6 then immediately proves $\mathcal{F}(w^{+})\leq\mathcal{F}(w)$ , which is our first claim.

To prove the second claim, note that in Algorithm 2, for every $i\in\textup{C}$

\displaystyle w^{+}_{i}

\displaystyle=w_{i}+\frac{w_{i}}{3\bar{\alpha}}\cdot\left[\frac{\rho_{i}(w)-1}{\rho_{i}(w)+1}\right]\geq w_{i}+\frac{w_{i}}{3\bar{\alpha}}\cdot\left[\frac{\alpha}{1+1+\alpha}\right]=w_{i}\cdot\left(1+\frac{\alpha}{3\bar{\alpha}(2+\alpha)}\right),

where the second step is by the monotonicity of $x\rightarrow\frac{x-1}{x+1}$ for $x\geq 1$ . Combining it with $w_{i}^{+}=w_{i}$ for all $i\notin\textup{C}$ implies that $w^{+}\geq w$ . Therefore, for all $i\in\textup{C}$ , we have

\displaystyle\rho(w^{+})_{i}

\displaystyle=[w^{+}_{i}]^{-\alpha}[\mathbf{A}(\mathbf{A}^{\top}\mathbf{W}^{+}\mathbf{A})^{-1}\mathbf{A}^{\top}]_{ii}\leq\left[1+\frac{\alpha}{3\bar{\alpha}(2+\alpha)}\right]^{-\alpha}\cdot w_{i}^{-\alpha}[\mathbf{A}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}\mathbf{A}^{\top}]_{ii}.

(3.1)

∎

Corollary 1.

Let $w$ be the input to $\textbf{RoundParallel}({}\cdot{})$ . Then, the number of iterations of the while loop of $\textbf{RoundParallel}({}\cdot{})$ is at most $O\left((1+\alpha^{-2})\log(\frac{\rho_{\max}(w)}{1+\alpha})\right)$ .

Proof.

Let $w^{(i)}$ be the value of $w$ at the start of the $i$ ’th execution of the while loop of $\textbf{RoundParallel}({}\cdot{})$ . Repeated application of Lemma 7 over $k$ executions of the while loop gives $\rho_{\max}(w^{(k)})\leq\rho_{\max}(w)\left(1+\frac{\alpha}{3\bar{\alpha}(2+\alpha)}\right)^{-\alpha k}$ . We set $\rho_{\max}(w)\left(1+\frac{\alpha}{3\bar{\alpha}(2+\alpha)}\right)^{-\alpha k}\leq 1+\alpha$ in accordance with the termination condition of $\textbf{RoundParallel}({}\cdot{})$ . Next, applying $1+x\leq\exp(x)$ , and taking logarithms on both sides yields the claimed limit on the number of iterations, $k$ . ∎

Lemma 8.

Over the entire run of Algorithm 1, the while loop of $\textbf{RoundParallel}({}\cdot{})$ runs for at most $O\left(\mathcal{T}_{\textrm{total}}\cdot\alpha^{-2}\cdot\log\left(\tfrac{m}{n(1+\alpha)}\right)\right)$ iterations if $\alpha\in(0,1]$ and $\mathcal{O}\left(\mathcal{T}_{\textrm{total}}\cdot\alpha\cdot\log\left(\tfrac{m}{n(1+\alpha)}\right)\right)$ iterations if $\alpha\geq 1$ .

Proof.

Note that $\rho_{\max}(\frac{n}{m})\leq(\frac{m}{n})^{1+\alpha}$ ; consequently, in the first iteration of Algorithm 1, there are at most $O((\alpha+\alpha^{-2})\log(m/(n(1+\alpha))))$ iterations of the while loop of $\textbf{RoundParallel}({}\cdot{})$ by Corollary 1. Note that between each call to $\textbf{RoundParallel}({}\cdot{})$ , for all $i\in[m]$ ,

\displaystyle w^{+}_{i}

\displaystyle=w_{i}+\frac{w_{i}}{3\bar{\alpha}}\cdot\left[\frac{\rho_{i}(w)-1}{\rho_{i}(w)+1}\right]\geq w_{i}+\frac{w_{i}}{3\bar{\alpha}}\cdot\left[\frac{-1}{1+1+\alpha}\right]=w_{i}\cdot\left(1-\frac{1}{(3\bar{\alpha})(2+\alpha)}\right),

where the first inequality is by using the fact that the output $w$ of $\textbf{RoundParallel}({}\cdot{})$ satisfies $\rho_{\max}(w)\leq 1+\alpha$ . Therefore, applying the same logic as in (3.1), we get that between two calls to $\textbf{RoundParallel}({}\cdot{})$ , the value of $\rho_{i}(w)$ increases by at most $\left(1-\frac{1}{(3\bar{\alpha})(2+\alpha)}\right)^{-(1+\alpha)}=O(1)$ for all $i\in[m]$ . Combining this with Corollary 1 and the total initial iteration count and observing that $\mathcal{T}_{\textrm{total}}$ is the total number of calls to $\textbf{RoundParallel}({}\cdot{})$ finishes the proof. ∎

3.1 Proof of Main Theorem (Parallel)

Proof.

(Proof of Theorem 1) First, we show correctness. Note that, as a corollary of Lemma 2, $\mathcal{F}(w^{(\mathcal{T}_{\textrm{total}})})\leq\mathcal{F}(\overline{w})+\widetilde{\varepsilon}$ . By the properties of Round as shown in Lemma 7, we also have that $\mathcal{F}(w_{\textrm{R}})\leq\mathcal{F}(\overline{w})+\widetilde{\varepsilon}$ and $\rho_{\max}(w_{\textrm{R}})\leq 1+\alpha$ for $w_{\textrm{R}}=\textbf{Round}(w^{(\mathcal{T}_{\textrm{total}})},\mathbf{A},\alpha)$ . Therefore, Lemma 1 is applicable, and by the choice of $\widetilde{\varepsilon}=\tfrac{\alpha^{4}\varepsilon^{4}}{(2m(\sqrt{n}+\alpha)(\alpha+\alpha^{-1}))^{4}}$ , we conclude that $\widehat{w}\in\mathbb{R}^{m}$ defined as $\widehat{w}_{i}=(a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}_{\textrm{R}}\mathbf{A})^{-1}a_{i})^{1/\alpha}$ satisfies $\widehat{w}_{i}\approx_{\varepsilon}\overline{w}_{i}$ for all $i\in[m]$ . Combining the iteration counts of $\textbf{Descent}({}\cdot{})$ from Lemma 2 and of $\textbf{RoundParallel}({}\cdot{})$ from Lemma 8 yields the total iteration count as $O(\alpha^{-3}\log(m/(\varepsilon\alpha)))$ if $\alpha\leq 1$ and $O(\alpha^{2}\log(m/\varepsilon))$ if $\alpha>1$ . As stated in Section 1.4, $\alpha=\frac{2}{p-2}$ , and so translating these rates in terms of $p$ gives $O(p^{3}\log(mp/\varepsilon))$ for $p\geq 4$ and $O(p^{-2}\log(mp/\varepsilon))$ for $p\in(2,4)$ , thereby proving the stated claim. The cost per iteration is $O(mn^{2})$ ⁶⁶6This can be improved to $O(mn^{\omega-1})$ using fast matrix multiplication. for multiplying two $m\times n$ matrices. ∎

4 Analysis of $\textbf{Round}({}\cdot{})$ : Sequential Algorithm

We now analyze Algorithm 3. Note that these proofs work for all $\alpha>0$ .

Lemma 9 (Coordinate Step Progress).

Given $w\in\mathbb{R}_{>0}^{m}$ , a coordinate $i\in[m]$ , and $\delta_{i}\in\mathbb{R}$ , we have

\mathcal{F}(w+\delta_{i}w_{i}e_{i})=\mathcal{F}(w)-\log(1+\delta_{i}\sigma_{i}(w))+\frac{w_{i}^{1+\alpha}}{1+\alpha}((1+\delta_{i})^{1+\alpha}-1).

Proof.

By definition of $\mathcal{F}$ , we have

\displaystyle\mathcal{F}(w+\delta_{i}w_{i}e_{i})=

\displaystyle-\log\det(\mathbf{A}^{\top}\mathbf{W}\mathbf{A}+\delta_{i}w_{i}a_{i}a_{i}^{\top})+\frac{1}{1+\alpha}\sum_{j\neq i}w_{j}^{1+\alpha}+\frac{w_{i}^{1+\alpha}}{1+\alpha}(1+\delta_{i})^{1+\alpha}.

Recall the matrix determinant lemma: $\det(\mathbf{A}+uv^{\top})=(1+v^{\top}\mathbf{A}^{-1}u)\det(\mathbf{A})$ . Applying it to $\det(\mathbf{A}^{\top}\mathop{\mathbf{diag}}(w+\delta_{i}w_{i}e_{i})\mathbf{A})$ in the preceding expression for $\mathcal{F}(w+\delta_{i}w_{i}e_{i})$ proves the lemma.

∎

Lemma 10 (Coordinate Step Outcome).

Given $w\in\mathbb{R}_{>0}^{m}$ and $\textup{C}=\{i:\rho_{i}(w)\geq 1\}$ , let $w^{+}=w+\delta_{i}w_{i}e_{i}$ for any $i\in\textup{C}$ , where $\delta_{i}=\arg\min_{\delta}\left[-\log(1+\delta\sigma_{i}(w))+\frac{1}{1+\alpha}w_{i}^{1+\alpha}((1+\delta)^{1+\alpha}-1)\right]$ . Then, we have $\mathcal{F}(w^{+})\leq\mathcal{F}(w)$ and $\rho_{i}(w^{+})\leq 1$ .

Proof.

We note that $\min_{\delta}\left[-\log(1+\delta\sigma_{i}(w))+\frac{1}{1+\alpha}w_{i}^{1+\alpha}((1+\delta)^{1+\alpha}-1)\right]\leq 0$ . Then, Lemma 9 implies the first claim. Since the update rule optimizes over $\mathcal{F}$ coordinate-wise, at each step the optimality condition given by $\rho_{i}(w^{+})=1$ is met for each $i\in\textup{C}$ . The second claim is then proved by noting that for $j\neq i$ , $w_{j}^{+}=w_{j}$ and by the Sherman-Morrison-Woodbury identity, $\rho_{j}(w^{+})\leq\rho_{j}(w)$ :

a_{j}^{\top}(\mathbf{A}^{\top}\mathbf{W}^{+}\mathbf{A})^{-1}a_{j}=a_{j}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{j}-\delta_{i}w_{i}\frac{(a_{j}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{j})^{2}}{1+\delta_{i}w_{i}a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i}}\leq a_{j}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{j}.

∎

Lemma 11 (Number of Coordinate Steps).

For any $0\leq\widetilde{\varepsilon}\leq 1$ , over all $\mathcal{T}_{\textrm{total}}$ iterations of Algorithm 1, there are at most $O(m\max(\alpha,\alpha^{-1})\log(m/\widetilde{\varepsilon}))$ coordinate steps (see Algorithm 3).

Proof.

There are at most $m$ coordinate steps in each call to Algorithm 3. Combining this with the value of $\mathcal{T}_{\textrm{total}}$ in Algorithm 1 gives the count of $O(m\alpha^{-1}\log(m/\widetilde{\varepsilon}))$ coordinate steps. ∎

4.1 Proof of Main Theorem (Sequential)

We now combine the preceding results to prove the main theorem about the sequential algorithm (Algorithm 1 with Algorithm 3).

Proof.

(Proof of Theorem 2) The proof of correctness is the same as that for Theorem 1 since the parallel and sequential algorithms share the same meta-algorithm. Computing leverage scores in the sequential version (Algorithm 1 with Algorithm 3) takes $O(m\max(\alpha,\alpha^{-1})\log(m/(\alpha\varepsilon)))$ coordinate steps. The costliest component of a coordinate step is computing $a_{i}^{\top}(\mathbf{A}^{\top}(\mathbf{W}+\delta_{i}w_{i}e_{i}e_{i}^{\top})\mathbf{A})^{-1}a_{i}$ . By the Sherman-Morrison-Woodbury formula, computing this costs $O(n^{2})$ for each coordinate. Since the initial cost to compute $(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}$ is $O(mn^{2})$ , the total run time is $O(\max(\alpha,\alpha^{-1})mn^{2}\log(m/\varepsilon))$ . When translated in terms of $p$ , this gives $O(pmn^{2}\log(mp/\varepsilon))$ for $p\geq 4$ and $O(p^{-1}mn^{2}\log(mp/\varepsilon))$ for $p\in(2,4)$ . ∎

5 A “One-Step” Parallel Algorithm

We conclude our paper with an alternative algorithm (Algorithm 4) in which each iteration avoids any rounding and performs only $\textbf{Descent}({}\cdot{})$ .

Algorithm 4 One-Step Algorithm

Input: Matrix $\mathbf{A}\in\mathbb{R}^{m\times n}$ , parameter $p>2$ , accuracy $\varepsilon$

Output: Vector $\widehat{w}\in\mathbb{R}^{m}_{>0}$ that satisfies (1)

For all $i\in[m]$ , initialize $w_{i}^{(0)}=1$ . Set $\alpha=\frac{2}{p-2}$ . Set $\widetilde{\varepsilon}=\frac{\alpha^{4}\varepsilon^{4}}{(2m(\sqrt{n}+\alpha)(\alpha+\alpha^{-1}))^{4}}$ .

Set $\beta=\frac{1}{1000}\min(\alpha^{2},1)$ and $\mathcal{T}_{\textrm{total}}=\left\{\begin{array}[]{lll}\mathcal{O}(\alpha^{-3}\log(mp/\widetilde{\varepsilon}))&\mbox{if }\alpha\in(0,1]\\ \mathcal{O}(\alpha^{2}\log(mp/\widetilde{\varepsilon}))&\mbox{ }\alpha>1\end{array}\right.$

for $k=0,1,2,3,\dotsc,\mathcal{T}_{\textrm{total}}-1$ do

Let

\eta^{(k)}\in\mathbb{R}^{m}

where for all

i\in[m]

we let

\eta^{(k)}_{i}=\begin{cases}\tfrac{1}{3\bar{\alpha}}&\text{if }\rho_{i}(w^{(k)})\geq 1\\ \tfrac{1}{3\bar{\alpha}}\beta&\text{if }\rho_{i}(w^{(k)})<1\end{cases}

w^{(k+1)}\leftarrow\textbf{Descent}({w}^{(k)},[m],\eta^{(k)})

\triangleright

See (2) and Lemma 2

end for

Return

\widehat{w}\in\mathbb{R}^{m}_{>0}

, where

\widehat{w}_{i}=(a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}^{(\mathcal{T}_{\textrm{total}})}\mathbf{A})^{-1}a_{i})^{1/\alpha}

\triangleright

See Section B

Theorem 4 (Main Theorem (One-Step Parallel Algorithm)).

Given a full rank matrix $\mathbf{A}\in\mathbb{R}^{m\times n}$ and $p\geq 4$ , we can compute $\varepsilon$ -approximate Lewis weights (1) in $O(p^{3}\log(mp/\varepsilon)$ iterations. Each iteration computes the leverage score of one row of $\mathbf{D}\mathbf{A}$ for some diagonal matrix $\mathbf{D}$ . The total runtime is $O(p^{3}mn^{2}\log(mp/\varepsilon))$ .

We first spell out the key idea of the proof of Theorem 4 in Lemma 12 next, which is that (2) is maintained in every iteration through the use of varying step sizes, without explicitly invoking rounding procedures. Since (2) always holds, we may use Lemma 5 in bounding the iteration complexity.

Lemma 12 (Rounding Condition Invariance).

For any iteration $k\in[\mathcal{T}_{\textrm{total}}-2]$ in Algorithm 4, if $\rho_{\max}(w^{(k)})\leq 1+\alpha$ , then $\rho_{\max}(w^{(k+1)})\leq 1+\alpha$ .

Proof.

By the definition of $\textbf{Descent}({}\cdot{})$ in (2) and choice of $\eta^{(k)}_{i}$ in Algorithm 4, we have,

	$\displaystyle w^{(k+1)}_{i}$	$\displaystyle=w_{i}^{(k)}\cdot\left[1+\eta_{i}^{(k)}\left(\frac{\rho_{i}(w^{(k)})-1}{\rho_{i}(w^{(k)})+1}\right)\right]$		(5.1)
		$\displaystyle\geq w_{i}^{(k)}(1-\eta_{i}^{(k)})\geq w_{i}^{(k)}\left(1-\frac{\beta}{3\bar{\alpha}}\right).$		(5.2)

Applying this inequality to the definition of $\rho(w)$ in (2), for all $i\in[m]$ , we have

\rho_{i}(w^{(k+1)})=\left[\frac{w^{(k+1)}_{i}}{w^{(k)}_{i}}\right]^{-\alpha}\frac{1}{[w^{(k)}_{i}]^{\alpha}}a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}^{(k+1)}\mathbf{A})^{-1}a_{i}\leq\left(1-\frac{\beta}{3\bar{\alpha}}\right)^{-1}\left[\frac{w^{(k+1)}_{i}}{w^{(k)}_{i}}\right]^{-\alpha}\rho_{i}(w^{(k)}).

Plugging (5.2) into (5) when $\rho_{i}(w^{(k)})\leq 1$ and using the upper bound on $\beta$ yields that

\rho_{i}(w^{(k+1)})\leq\left(1-\frac{\beta}{3\bar{\alpha}}\right)^{-(1+\alpha)}\leq 1+\alpha~{}.

If $\rho_{i}(w^{(k)})\geq 1$ , then (5), the equality in (5.2), the bound on $\beta$ , and $\rho_{i}(w^{(k)})\leq 1+\alpha$ imply that

\displaystyle\rho_{i}(w^{(k+1)})\leq\left(1-\frac{\beta}{3\bar{\alpha}}\right)^{-1}\left[1+\frac{1}{3\bar{\alpha}}\left(\frac{\rho_{i}(w^{(k)})-1}{\rho_{i}(w^{(k)})+1}\right)\right]^{-\alpha}\rho_{i}(w^{(k)})\leq 1+\alpha.

∎

Proof of Theorem 4.

By our choice of $w_{i}^{(0)}=1$ for all $i\in[m]$ , we have that $\rho_{i}(w^{(0)})=\sigma_{i}(w^{(0)})\leq 1$ by Fact 1.1. Then applying Lemma 12 yields by induction that $\rho_{\max}(w^{(k)})\leq 1+\alpha$ at every iteration $k$ . We may now therefore upper bound the objective sub-optimality from Lemma 5; as before, combining this with the lower bound on progress from Lemma 6 (noticing that $\eta_{i}\geq\frac{\beta}{3\bar{\alpha}}$ ) yields

	$\displaystyle\mathcal{F}(w^{+})-\mathcal{F}(\overline{w})$	$\displaystyle\leq\mathcal{F}(w)-\mathcal{F}(\overline{w})-\frac{\beta}{6\bar{\alpha}}\sum_{i=1}^{m}w_{i}^{1+\alpha}\frac{(\rho_{i}(w)-1)^{2}}{\rho_{i}(w)+1}$
		$\displaystyle\leq\left(1-\frac{\beta}{30\max(1,\alpha)\max(1,\alpha^{-1})}\right)(\mathcal{F}(w)-\mathcal{F}(\overline{w})).$		(5.4)

Thus, $\textbf{Descent}({}\cdot{})$ decreases $\mathcal{F}$ . Using $\mathcal{F}(w)-\mathcal{F}(\overline{w})\leq n\log(m/n)\leq m^{2}$ from Lemma 4 and setting (5.4) to $\widetilde{\varepsilon}$ gives an iteration complexity of $\mathcal{O}(\beta^{-1}\alpha^{-1}\log(m/\widetilde{\varepsilon}))=\mathcal{O}(\alpha^{-3}\log(m/\widetilde{\varepsilon}))$ if $\alpha\in(0,1]$ and $\mathcal{O}(\alpha\beta^{-1}\log(m/\widetilde{\varepsilon}))=\mathcal{O}(\alpha\log(m/\widetilde{\varepsilon}))$ otherwise. As in the proofs of Theorems 1 and 2, we can then invoke Lemma 1 to construct the vector that is $\varepsilon$ -approximate to the Lewis weights. ∎

6 Acknowledgements

We are grateful to the anonymous reviewers of SODA 2022 for their careful reading and thoughtful comments that helped us improve our exposition. Maryam Fazel was supported in part by grants NSF TRIPODS II DMS 2023166, NSF TRIPODS CCF 1740551, and NSF CCF 2007036. Yin Tat Lee was supported in part by NSF awards CCF-1749609, DMS-1839116, DMS-2023166, CCF-2105772, a Microsoft Research Faculty Fellowship, Sloan Research Fellowship, and Packard Fellowship. Swati Padmanabhan was supported in part by NSF TRIPODS II DMS 2023166. Aaron Sidford was supported in part by a Microsoft Research Faculty Fellowship, NSF CAREER Award CCF-1844855, NSF Grant CCF-1955039, a PayPal research award, and a Sloan Research Fellowship.

References

[AHR08] Jacob Abernethy, Elad E Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory, COLT 2008, 2008.
[AZLSW17] Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, and Yining Wang. Near-optimal design of experiments via regret minimization. In Proceedings of the 34th International Conference on Machine Learning, 2017.
[BDM⁺20] Vladimir Braverman, Petros Drineas, Cameron Musco, Christopher Musco, Jalaj Upadhyay, David P Woodruff, and Samson Zhou. Near optimal linear algebra in the online and sliding window models. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), 2020.
[BE15] Sébastien Bubeck and Ronen Eldan. The entropic barrier: a simple and optimal universal self-concordant barrier. In Conference on Learning Theory, 2015.
[BV04] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004.
[CCDS20] Rachit Chhaya, Jayesh Choudhari, Anirban Dasgupta, and Supratim Shit. Streaming coresets for symmetric tensor factorization. In International Conference on Machine Learning, 2020.
[CCLY19] Michael B. Cohen, Ben Cousins, Yin Tat Lee, and Xin Yang. A near-optimal algorithm for approximating the john ellipsoid. In Proceedings of the Thirty-Second Conference on Learning Theory, 2019.
[CLM⁺15] Michael B. Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, and Aaron Sidford. Uniform sampling for matrix approximation. In Tim Roughgarden, editor, Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, ITCS 2015, Rehovot, Israel, January 11-13, 2015. ACM, 2015.
[CP15] Michael B. Cohen and Richard Peng. Lp row sampling by lewis weights. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, STOC ’15. Association for Computing Machinery, 2015.
[DAST08] S Damla Ahipasaoglu, Peng Sun, and Michael J Todd. Linear convergence of a modified frank–wolfe algorithm for computing minimum-volume enclosing ellipsoids. Optimisation Methods and Software, 23(1), 2008.
[DFO20] Jelena Diakonikolas, Maryam Fazel, and Lorenzo Orecchia. Fair packing and covering on a relative scale. SIAM J. Optim., 30(4), 2020.
[DLS18] David Durfee, Kevin A Lai, and Saurabh Sawlani. $\backslash ell\_1$ regression using lewis weights preconditioning and stochastic gradient descent. In Conference On Learning Theory, 2018.
[DMIMW12] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast approximation of matrix coherence and statistical leverage. The Journal of Machine Learning Research, 13(1), 2012.
[DMM06] Petros Drineas, Michael W Mahoney, and Shan Muthukrishnan. Sampling algorithms for l 2 regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, 2006.
[GLS81] Martin Grötschel, László Lovász, and Alexander Schrijver. The ellipsoid method and its consequences in combinatorial optimization. Combinatorica, 1(2), 1981.
[Joh48] Fritz John. Extremum problems with inequalities as subsidiary conditions, studies and essays presented to r. courant on his 60th birthday, january 8, 1948, 1948.
[Kha96] Leonid G Khachiyan. Rounding of polytopes in the real number model of computation. Mathematics of Operations Research, 21(2), 1996.
[KN09] Ravi Kannan and Hariharan Narayanan. Random walks on polytopes and an affine interior point method for linear programming. In Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, STOC ’09, New York, NY, USA, 2009. Association for Computing Machinery.
[KY05] Piyush Kumar and E Alper Yildirim. Minimum-volume enclosing ellipsoids and core sets. Journal of Optimization Theory and applications, 126(1), 2005.
[Lee16] Yin Tat Lee. Faster Algorithms for Convex and Combinatorial Optimization. PhD thesis, Massachusetts Institute of Technology, 2016.
[Lew78] D Lewis. Finite dimensional subspaces of $l\_$ { $p$ }. Studia Mathematica, 63(2), 1978.
[LLV20] Aditi Laddha, Yin Tat Lee, and Santosh Vempala. Strong self-concordance and sampling. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, New York, NY, USA, 2020. Association for Computing Machinery.
[LMP13] Mu Li, Gary L Miller, and Richard Peng. Iterative row sampling. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, 2013.
[LS14] Yin Tat Lee and Aaron Sidford. Path finding methods for linear programming: Solving linear programs in õ(sqrt(rank)) iterations and faster algorithms for maximum flow. In 55th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, 2014.
[LS19] Yin Tat Lee and Aaron Sidford. Solving linear programs with sqrt(rank) linear system solves. CoRR, abs/1910.08033, 2019.
[LSW15] Yin Tat Lee, Aaron Sidford, and Sam Chiu-wai Wong. A faster cutting plane method and its implications for combinatorial and convex optimization. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, 2015.
[LWYZ20] Yi Li, Ruosong Wang, Lin Yang, and Hanrui Zhang. Nearly linear row sampling algorithm for quantile regression. In Proceedings of the 37th International Conference on Machine Learning, 2020.
[LY18] Yin Tat Lee and Man-Chung Yue. Universal barrier is $n$ -self-concordant. arXiv preprint arXiv:1809.03011, 2018.
[MSTX19] Vivek Madan, Mohit Singh, Uthaipon Tantipongpipat, and Weijun Xie. Combinatorial algorithms for optimal design. In Proceedings of the Thirty-Second Conference on Learning Theory, 2019.
[MSZ16] Jelena Marasevic, Clifford Stein, and Gil Zussman. A fast distributed stateless algorithm for $alpha$ -fair packing problems. In 43rd International Colloquium on Automata, Languages, and Programming (ICALP 2016), volume 55, 2016.
[NN94] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex programming. SIAM, 1994.
[Puk06] Friedrich Pukelsheim. Optimal Design of Experiments (Classics in Applied Mathematics) (Classics in Applied Mathematics, 50). Society for Industrial and Applied Mathematics, USA, 2006.
[SF04] Peng Sun and Robert M Freund. Computation of minimum-volume covering ellipsoids. Operations Research, 52(5), 2004.
[SW18] Christian Sohler and David P Woodruff. Strong coresets for k-median and subspace approximation: Goodbye dimension. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), 2018.
[SX20] Mohit Singh and Weijun Xie. Approximation algorithms for d-optimal design. Mathematics of Operations Research, 45(4), 2020.
[Tod16] Michael J. Todd. Minimum volume ellipsoids - theory and algorithms, volume 23 of MOS-SIAM Series on Optimization. SIAM, 2016.
[Vai89] Pravin M Vaidya. A new algorithm for minimizing convex functions over convex sets. In 30th Annual Symposium on Foundations of Computer Science, 1989.
[Woj96] Przemyslaw Wojtaszczyk. Banach spaces for analysts. Number 25. Cambridge University Press, 1996.
[ZF20] Renbo Zhao and Robert M Freund. Analysis of the frank-wolfe method for logarithmically-homogeneous barriers, with an extension. arXiv preprint arXiv:2010.08999, 2020.

We start with a piece of notation we frequently use in the appendix. For a given vector $x\in\mathbb{R}^{m}$ , we use $\mathop{\mathbf{Diag}}(x)$ to describe the diagonal matrix with $x$ on its diagonal. For a matrix $\mathbf{X}$ , we use $\mathbf{diag}(\mathbf{X})$ to denote the vector made up of the diagonal entries of $\mathbf{X}$ . Further, recall as stated in Section 1.4, that given any vector $x$ , we use its uppercase boldface name $\mathbf{X}\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}\mathop{\mathbf{Diag}}(x)$ .

Appendix A Technical Proofs: Gradient, Hessian, Initial Error, Minimum Progress

See 3

Proof.

The proof essentially follows by combining Lemmas 48 and 49 of [LS19]. For completeness, we provide the full proof here. Applying chain rule to the $\log\det$ function and then the definition of $\rho(w)$ from (2) gives the claim that

\nabla_{i}\mathcal{F}(w)=-(\mathbf{A}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}\mathbf{A}^{\top})_{ii}+w_{i}^{\alpha}=-a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i}+w_{i}^{\alpha}=\frac{-\sigma_{i}(w)}{w_{i}}+w_{i}^{\alpha}.

We now set some notation to compute the Hessian: let $\mathbf{M}\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}\mathbf{A}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}\mathbf{A}^{\top}$ , let $h\in\mathbb{R}^{m}$ be any arbitrary vector, and let $\mathbf{H}\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}\mathop{\mathbf{Diag}}(h)$ . For $f:\mathbb{R}^{n}\rightarrow\mathbb{R}$ and for $x,h\in\mathbb{R}^{n}$ we let $\mathcal{D}_{x}f(x)[h]$ denote the directional derivative of $f$ at $x$ in the direction $h$ , i.e., $\mathcal{D}_{x}f(x)[h]=\lim_{t\rightarrow 0}(f(x+th)-f(x))/t$ . Then we have,

	$\displaystyle\mathcal{D}_{w}\langle{h},{-\mathop{\mathbf{Diag}}(\mathbf{A}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}\mathbf{A}^{\top})}\rangle[h]$	$\displaystyle=\langle{h},{-\mathop{\mathbf{Diag}}(\mathbf{A}\mathcal{D}_{w}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}[h]\mathbf{A}^{\top})}\rangle$
		$\displaystyle=\langle{h},{\mathop{\mathbf{Diag}}(\mathbf{A}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}\mathcal{D}_{w}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})[h]\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}\mathbf{A}^{\top})}\rangle$
		$\displaystyle=\langle{h},{\mathop{\mathbf{Diag}}(\mathbf{M}\mathbf{H}\mathbf{M})}\rangle$
		$\displaystyle=\sum_{i,j}h_{i}h_{j}\mathbf{M}_{ij}\mathbf{M}_{ji}=\sum_{i,j}h_{i}h_{j}\mathbf{M}_{ij}^{2},$

where the last step follows by symmetry of $\mathbf{M}$ . This implies

\displaystyle\nabla_{ij}^{2}\mathcal{F}(w)

\displaystyle=\left\{\begin{array}[]{lll}(a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{j})^{2}&\mbox{if }i\neq j\\ (a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{j})^{2}+\alpha w_{i}^{\alpha-1}&\mbox{ }otherwise\end{array}\right.,

which, in shorthand, is $\nabla^{2}\mathcal{F}(w)=\mathbf{M}\circ\mathbf{M}+\alpha\mathbf{W}^{\alpha-1}$ . We may express this Hessian as in the statement of the lemma by writing $\mathbf{M}$ in terms of $\mathbf{P}(w)$ . ∎

See 4

Proof.

We study the two terms constituting the objective in (1). First, by choice of $w^{(0)}=\tfrac{n}{m}\mathbf{1}$ , we have

-\log\det(\mathbf{A}^{\top}\mathbf{W}^{(0)}\mathbf{A})=-\log\det((n/m)\mathbf{A}^{\top}\mathbf{A}).

Next, since leverage scores always lie between zero and one, the optimality condition for (1), $\sigma(\overline{w})=({\overline{w}})^{1+\alpha}$ , implies $\overline{w}\leq 1$ , which in turn gives $\overline{\mathbf{W}}\preceq I$ . This implies $\mathbf{A}^{\top}\overline{\mathbf{W}}\mathbf{A}\preceq\mathbf{A}^{\top}\mathbf{A}$ . Therefore,

-\log\det(\mathbf{A}^{\top}\mathbf{A})\leq-\log\det(\mathbf{A}^{\top}\overline{\mathbf{W}}\mathbf{A}).

Combining (A) and (A) gives

-\log\det(\mathbf{A}^{\top}\mathbf{W}^{(0)}\mathbf{A})\leq-\log\det(\mathbf{A}^{\top}\overline{\mathbf{W}}\mathbf{A})+n\log(m/n).

Next, observe that $\mathbf{1}^{\top}(w^{(0)})^{1+\alpha}=m\cdot(n/m)^{1+\alpha}$ , and $\mathbf{1}^{\top}({\overline{w}})^{1+\alpha}=\sum_{i=1}^{m}\sigma_{i}(\overline{w})=n$ , where we invoked Fact 1.1. By now applying $m\geq n$ , we get

\mathbf{1}^{\top}(w^{(0)})^{1+\alpha}\leq\mathbf{1}^{\top}({\overline{w}})^{1+\alpha}.

Combining (A), (A), and the definition of the objective (1) finishes the claim. ∎

See 6

Proof.

By the remainder form of Taylor’s theorem, for some $t\in[0,1]$ and $\widetilde{w}=tw+(1-t)w^{+}$

\mathcal{F}(w^{+})=\mathcal{F}(w)+\langle{\nabla\mathcal{F}(w)},{w^{+}-w}\rangle+\frac{1}{2}(w^{+}-w)^{\top}\nabla^{2}\mathcal{F}(\widetilde{w})(w^{+}-w).

(A.5)

We prove the result by bounding the quadratic form of $\nabla^{2}\mathcal{F}(\widetilde{w})$ from above and leveraging the structure of $w^{+}$ and $\nabla\mathcal{F}(w)$ . Lemma 3 and Fact 1.1 imply that

\nabla^{2}\mathcal{F}(\widetilde{w})=\mathbf{\widetilde{W}}^{-1}\mathbf{P}(\widetilde{w})^{(2)}\mathbf{\widetilde{W}}^{-1}+\alpha\mathbf{\widetilde{W}}^{\alpha-1}\preceq\mathbf{\widetilde{W}}^{-1}\Sigma(\widetilde{w})\mathbf{\widetilde{W}}^{-1}+\alpha\mathbf{\widetilde{W}}^{\alpha-1}~{}.

Further, the positivity of $w_{i}$ and $\sigma_{i}(w)$ and the non-negativity of $\eta$ and $\rho$ imply that $(1-\norm{\eta}_{\infty})w_{i}\leq w_{i}^{+}\leq(1+\norm{\eta}_{\infty})w_{i}$ for all $i\in[m]$ . Since $\norm{\eta}_{\infty}\leq\frac{1}{3\bar{\alpha}}$ , this implies that

(1-\tfrac{1}{3\bar{\alpha}})w_{i}\leq\widetilde{w}_{i}\leq(1+\tfrac{1}{3\bar{\alpha}})w_{i}~{}\text{ for all }~{}i\in[m]~{}.

Consequently, for all $i\in[m]$ , we bound the first term of (A) as

$\displaystyle\left[\mathbf{\widetilde{W}}^{-1}\Sigma(\widetilde{w})\mathbf{\widetilde{W}}^{-1}\right]_{ii}$	$\displaystyle=e_{i}^{\top}\mathbf{\widetilde{W}}^{-1/2}\mathbf{A}(\mathbf{A}^{\top}\mathbf{\widetilde{W}}\mathbf{A})^{-1}\mathbf{A}^{\top}\mathbf{\widetilde{W}}^{-1/2}e_{i}=\frac{1}{\widetilde{w}_{i}}a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{\widetilde{W}}\mathbf{A})^{-1}a_{i}$
	$\displaystyle\leq(1-\tfrac{1}{3\bar{\alpha}})^{-1}\frac{1}{w_{i}}a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{\widetilde{W}}\mathbf{A})^{-1}a_{i}\leq(1-\tfrac{1}{3\bar{\alpha}})^{-2}\frac{1}{w_{i}}a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i}$
	$\displaystyle=(1-\tfrac{1}{3\bar{\alpha}})^{-2}\left[\mathbf{W}^{-1}\Sigma(w)\mathbf{W}^{-1}\right]_{ii}\preceq 3\left[\mathbf{W}^{-1}\Sigma(w)\mathbf{W}^{-1}\right]_{ii}$	(A.6)

Further, when $\alpha\in(0,1]$ , we bound the second term of (A) as

\mathbf{\widetilde{W}}^{\alpha-1}\preceq(1-\tfrac{1}{3\bar{\alpha}})^{\alpha-1}\mathbf{W}^{\alpha-1}\preceq(1-\tfrac{1}{3\bar{\alpha}})^{-1}\mathbf{W}^{\alpha-1}\preceq 3\mathbf{W}^{\alpha-1},

and when $\alpha\geq 1$ , we have

\mathbf{\widetilde{W}}^{\alpha-1}\preceq(1+\tfrac{1}{3\bar{\alpha}})^{\alpha-1}\mathbf{W}^{\alpha-1}\preceq\exp(\frac{\alpha-1}{3\bar{\alpha}})\mathbf{W}^{\alpha-1}=\exp(\frac{\alpha-1}{3\alpha})\mathbf{W}^{\alpha-1}\preceq 3\mathbf{W}^{\alpha-1}.

Using (A.6), (A), and (A) in (A), we have that in all cases

\nabla^{2}\mathcal{F}(\widetilde{w})\preceq 3\left[\mathbf{W}^{-1}\Sigma(w)\mathbf{W}^{-1}+\alpha\mathbf{W}^{\alpha-1}\right]\preceq 3\bar{\alpha}\mathbf{W}^{-1}\left[\Sigma(w)+\mathbf{W}^{1+\alpha}\right]\mathbf{W}^{-1}~{}.

Applying to the above Loewner inequality the definition of $w^{+}$ gives

	$\displaystyle(w^{+}-w)^{\top}\nabla^{2}\mathcal{F}(\widetilde{w})(w^{+}-w)$	$\displaystyle\leq\sum_{i\in[m]}3\bar{\alpha}\cdot(w_{i}^{1+\alpha}+\sigma_{i}(w))\cdot\left(\eta_{i}\cdot\frac{\rho_{i}(w)-1}{\rho_{i}(w)+1}\right)^{2}$
		$\displaystyle=\sum_{i\in[m]}3\bar{\alpha}\cdot\eta_{i}^{2}\cdot w_{i}^{1+\alpha}\cdot\frac{(\rho_{i}(w)-1)^{2}}{\rho_{i}(w)+1}~{}.$		(A.9)

Next, recall that by Lemma 3, $\left[\nabla\mathcal{F}(w)\right]_{i}=w_{i}^{-1}\cdot(w_{i}^{1+\alpha}-\sigma_{i}(w))$ for all $i\in[m]$ . Consequently,

\langle{\nabla\mathcal{F}(w)},{w^{+}-w}\rangle=\sum_{i\in[m]}(w_{i}^{1+\alpha}-\sigma_{i}(w))\cdot\left(\eta_{i}\cdot\frac{\rho_{i}(w)-1}{\rho_{i}(w)+1}\right)=-\sum_{i\in[m]}\eta_{i}\cdot w_{i}^{1+\alpha}\cdot\frac{(\rho_{i}(w)-1)^{2}}{\rho_{i}(w)+1}~{}.

Combining (A.5), (A.9), and (A) yields that

\displaystyle\mathcal{F}(w^{+})

\displaystyle\leq\mathcal{F}(w)+\sum_{i\in[m]}\left(-\eta_{i}+\frac{3\bar{\alpha}\eta_{i}^{2}}{2}\right)\cdot w_{i}^{1+\alpha}\cdot\frac{(\rho_{i}(w)-1)^{2}}{\rho_{i}(w)+1}~{}.

The result follows by plugging in $\eta_{i}\in[0,(3\bar{\alpha})^{-1}]$ , as assumed. ∎

Appendix B From Optimization Problem to Lewis Weights

The goal of this section is to prove how to obtain $\varepsilon$ -approximate Lewis weights from an $\widetilde{\varepsilon}$ -approximate solution to the problem in (1). Our proof strategy is to first utilize the fact that the vector $w_{\textrm{R}}$ obtained after the rounding step following the for loop of Algorithm 1 satisfies the properties of being $\widetilde{\varepsilon}$ -suboptimal (additively) and also the rounding condition (2). In Lemma 1, the $\widetilde{\varepsilon}$ -suboptimality is used to show a bound on $\|\sigma(w_{\textrm{R}})-w_{\textrm{R}}^{1+\alpha}\|_{\infty}$ . Coupled with the rounding condition, we then show in Lemma 13 that $\widehat{w_{\textrm{R}}}$ constructed as per the last line of Algorithm 1 then satisfies approximate optimality, $\sigma(\widehat{w})\approx_{\delta}\widehat{w}^{1+\alpha}$ , for some small $\delta>0$ . In Lemma 14, we finally relate this approximate optimality to coordinate-wise multiplicative closeness between $\widehat{w}$ and the vector of true Lewis weights. Finally, in Lemma 1, we pick the appropriate approximation factors for each of the lemmas invoked and prove the desired approximation. Since the vector $w^{\mathcal{T}_{\textrm{total}}}$ obtained at the end of the for loop of Algorithm 4 also satisfies the aforementioned properties of $w_{\textrm{R}}$ , the same set of lemmas apply to Algorithm 4 as well. We begin with some technical lemmas.

B.1 From Approximate Closeness to Approximate Optimality

Lemma 13.

Let $w\in\mathbb{R}^{m}_{>0}$ such that $\|\sigma(w)-w^{1+\alpha}\|_{\infty}\leq\overline{\varepsilon}$ for some parameter $0<\overline{\varepsilon}\leq\frac{1}{100m^{2}(\alpha+\alpha^{-1})^{2}}$ and also let $\rho_{\max}(w)\leq 1+\alpha$ . Define $\widehat{w}_{i}=(a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i})^{1/\alpha}$ . Then, for the parameter $\delta=20\sqrt{\overline{\varepsilon}}m(\alpha+\alpha^{-1})$ , we have that $\sigma(\widehat{w}){\approx_{\delta}}\widehat{w}^{1+\alpha}$ .

Proof.

Our strategy to prove $\sigma(\widehat{w})\approx_{\delta}\widehat{w}^{1+\alpha}$ involves first noting that this is the same as proving $\widehat{w}^{-1}\cdot\sigma(\widehat{w})\approx_{\delta}\widehat{w}^{\alpha}$ and, from the definition of $\widehat{w}$ in the statement of the lemma, to instead prove $\mathbf{A}^{\top}\mathbf{\widehat{W}}\mathbf{A}\approx_{\delta}\mathbf{A}^{\top}\mathbf{W}\mathbf{A}$ .

To this end, we split $\mathbf{W}$ into two matrices based on the size of its coordinates, setting the following notation. Define $\mathbf{W}_{w\leq\eta}$ to be the diagonal matrix $\mathbf{W}$ with zeroes at indices corresponding to $w>\eta$ , and $\mathbf{\widehat{W}}_{w\leq\eta}$ to be the diagonal matrix $\mathbf{\widehat{W}}$ with zeroes at indices corresponding to $w>\eta$ . We first show that $\mathbf{A}^{\top}\mathbf{\widehat{W}}_{w\leq\eta}\mathbf{A}$ and $\mathbf{A}^{\top}\mathbf{W}_{w\leq\eta}\mathbf{A}$ are small compared to $\mathbf{A}^{\top}\mathbf{W}\mathbf{A}$ and can therefore be ignored in the preceding desired approximation. We then prove that for $w>\eta$ , we have $w\approx_{\delta}\widehat{w}$ . This proof technique is inspired by Lemma 4 of [Vai89].

First, we prove that $\mathbf{A}^{\top}\mathbf{\widehat{W}}_{w\leq\eta}\mathbf{A}$ is small as compared to $\mathbf{A}^{\top}\mathbf{W}_{w>\eta}\mathbf{A}$ . Since (2) is satisfied, it means

a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i}=\sigma_{i}(w)\cdot w_{i}^{-1}\leq(1+\alpha)w_{i}^{\alpha}.

Combining this with the definition of $\widehat{w}_{i}$ as in the statement of the lemma, we may use non-negativity of $\alpha$ to derive

\widehat{w}_{i}\leq(1+\alpha)^{1/\alpha}w_{i}\leq 3w_{i}.

We apply this inequality in the following expression to obtain

$\displaystyle\Tr((\mathbf{A}^{\top}\mathbf{\widehat{W}}_{w\leq\eta}\mathbf{A})(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1})$	$\displaystyle=\sum_{w_{i}\leq\eta}\widehat{w}_{i}(a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i})$
	$\displaystyle=\sum_{w_{i}\leq\eta}(a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i})^{1+1/\alpha}$
	$\displaystyle\leq(1+\alpha)^{1+1/\alpha}\sum_{w_{i}\leq\eta}w_{i}^{1+\alpha}$
	$\displaystyle\leq 3(1+\alpha)m\eta^{1+\alpha}.$	(B.2)

This implies that⁷⁷7Given $X,Y\succeq 0$ , we have $Y^{1/2}XY^{1/2}\succeq 0$ . Then, if $\Tr(XY)\leq 1$ , we have $\Tr(Y^{1/2}XY^{1/2})\leq 1$ , and combining these with the previous matrix inequality, we conclude that $Y^{1/2}XY^{1/2}\preceq I$ , which implies that $X\preceq Y^{-1}$ .

\mathbf{A}^{\top}\mathbf{\widehat{W}}_{w\leq\eta}\mathbf{A}\preceq 3(1+\alpha)m\eta^{1+\alpha}\mathbf{A}^{\top}\mathbf{W}\mathbf{A}.

Our next goal is to bound $\mathbf{A}^{\top}\mathbf{\widehat{W}}_{w>\eta}\mathbf{A}$ in terms of $\mathbf{A}^{\top}\mathbf{W}\mathbf{A}$ , which we do by first bounding it in terms of $\mathbf{A}^{\top}\mathbf{W}_{w>\eta}\mathbf{A}$ and then bounding $\mathbf{A}^{\top}\mathbf{W}_{w>\eta}\mathbf{A}$ in terms of $\mathbf{A}^{\top}\mathbf{W}\mathbf{A}$ . By definition, $\widehat{w}_{i}^{\alpha}=\sigma_{i}(w)\cdot w_{i}^{-1}$ . Further, by assumption, $\|\sigma(w)-w^{1+\alpha}\|_{\infty}\leq\overline{\varepsilon}$ . Therefore, for any $w_{i}\geq\eta$

\widehat{w}_{i}^{\alpha}\leq(w_{i}^{1+\alpha}+\overline{\varepsilon})\cdot w_{i}^{-1}\leq(1+\overline{\varepsilon}/\eta^{1+\alpha})w_{i}^{1+\alpha}\cdot w_{i}^{-1}=(1+\overline{\varepsilon}/\eta^{1+\alpha})w_{i}^{\alpha},

and

\widehat{w}_{i}^{\alpha}\geq(w_{i}^{1+\alpha}-\overline{\varepsilon})\cdot w_{i}^{-1}\geq(1-\overline{\varepsilon}/\eta^{1+\alpha})w_{i}^{1+\alpha}\cdot w_{i}^{-1}=(1-\overline{\varepsilon}/\eta^{1+\alpha})w_{i}^{\alpha}.

By our choice of $\overline{\varepsilon}$ , for $w_{i}\geq\eta$ , we have

\left(1-\frac{2\overline{\varepsilon}}{\alpha\eta^{1+\alpha}}\right)w_{i}\leq\widehat{w}_{i}\leq\left(1+\frac{2\overline{\varepsilon}}{\alpha\eta^{1+\alpha}}\right)w_{i}.

Further, we have the following inequality:

\mathbf{A}^{\top}\mathbf{W}_{w>\eta}\mathbf{A}\preceq\mathbf{A}^{\top}\mathbf{W}\mathbf{A}.

Hence, we can combine (B.1), (B.1), and (B.1) to see that

	$\displaystyle\mathbf{A}^{\top}\mathbf{\widehat{W}}\mathbf{A}$	$\displaystyle=\mathbf{A}^{\top}\mathbf{\widehat{W}}_{w>\eta}\mathbf{A}+\mathbf{A}^{\top}\mathbf{\widehat{W}}_{w\leq\eta}\mathbf{A}$
		$\displaystyle\preceq\left(1+\frac{2\overline{\varepsilon}}{\alpha\eta^{1+\alpha}}\right)\mathbf{A}^{\top}\mathbf{W}_{w>\eta}\mathbf{A}+3(1+\alpha)m\eta^{1+\alpha}\mathbf{A}^{\top}\mathbf{W}\mathbf{A}$
		$\displaystyle\preceq\mathbf{A}^{\top}\mathbf{W}\mathbf{A}\left(1+\frac{2\overline{\varepsilon}}{\alpha\eta^{1+\alpha}}+3(1+\alpha)m\eta^{1+\alpha}\right).$

Set $\eta^{1+\alpha}=\sqrt{\overline{\varepsilon}}$ for the upper bound.

For the lower bound, we bound $\mathbf{A}^{\top}\mathbf{W}_{w\leq\eta}\mathbf{A}$ and, therefore, also $\mathbf{A}^{\top}\mathbf{W}_{w>\eta}\mathbf{A}$ . Observe that

	$\displaystyle\Tr((\mathbf{A}^{\top}\mathbf{W}_{w\leq\eta}\mathbf{A})(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1})$	$\displaystyle=\sum_{w_{i}\leq\eta}w_{i}a_{i}^{\top}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}a_{i}=\sum_{w_{i}\leq\eta}\sigma_{i}(w)$
		$\displaystyle\leq\sum_{w_{i}\leq\eta}(w_{i}^{1+\alpha}+\overline{\varepsilon})\leq m(\eta^{1+\alpha}+\overline{\varepsilon}),$

where the second step is by $\|\sigma(w)-w^{1+\alpha}\|_{\infty}\leq\overline{\varepsilon}$ , as assumed in the lemma. This implies that

\mathbf{A}^{\top}\mathbf{W}_{w\leq\eta}\mathbf{A}\preceq m(\eta^{1+\alpha}+\overline{\varepsilon})\mathbf{A}^{\top}\mathbf{W}\mathbf{A},

and therefore that

\mathbf{A}^{\top}\mathbf{W}_{w>\eta}\mathbf{A}\succeq(1-m(\eta^{1+\alpha}+\overline{\varepsilon}))\mathbf{A}^{\top}\mathbf{W}\mathbf{A}.

Repeating the method for the upper bound then finishes the proof. ∎

B.2 From Approximate Optimality to Approximate Lewis Weights

In this section, we go from the previous notion of approximation to the one we finally seek in (1). Specifically, we show that if $\sigma(w)\approx_{\beta}w^{1+\alpha}$ , then $w\approx_{O((\beta/\alpha)\sqrt{n})}\overline{w}$ . To prove this, we first give a technical result. We recall notation stated in Section 1.4: for any projection matrix $\mathbf{P}(w)\in\mathbb{R}^{m\times m}$ , we have the vector of leverage scores $\sigma(w)=\mathbf{diag}(\mathbf{P}(w))$ .

Claim 1.

For any projection matrix $\mathbf{P}(w)\in\mathbb{R}^{m\times m}$ , $\alpha\geq 0$ , and vector $x\in\mathbb{R}^{m}$ , we have that

\norm{\left[\mathbf{P}(w)^{(2)}+\alpha\mathbf{\Sigma}(w)\right]^{-1}\mathbf{\Sigma}(w)x}_{\infty}\leq\frac{1}{\alpha}\norm{x}_{\infty}+\frac{1}{\alpha^{2}}\norm{x}_{\mathbf{\Sigma}(w)}\leq\left(\frac{1+\sqrt{n}/\alpha}{\alpha}\right)\norm{x}_{\infty}

Proof.

Let $y\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}\left[\mathbf{P}(w)^{(2)}+\alpha\mathbf{\Sigma}(w)\right]^{-1}\mathbf{\Sigma}(w)x$ . Since $\mathbf{0}\preceq\mathbf{P}(w)^{(2)}\preceq\mathbf{\Sigma}(w)$ (Fact 1.1), we have that $\mathbf{\Sigma}(w)\preceq\frac{1}{\alpha}\left[\mathbf{P}(w)^{(2)}+\alpha\mathbf{\Sigma}(w)\right]$ and $(\mathbf{P}(w)^{(2)}+\alpha\mathbf{\Sigma}(w))^{-1}\preceq\alpha^{-1}\mathbf{\Sigma}(w)^{-1}$ . Consequently, taking norms in terms of these matrices gives

\displaystyle\norm{y}_{\mathbf{\Sigma}(w)}=\norm{\left[\mathbf{P}(w)^{(2)}+\alpha\mathbf{\Sigma}(w)\right]^{-1}\mathbf{\Sigma}(w)x}_{\mathbf{\Sigma}(w)}

\displaystyle\leq\frac{1}{\sqrt{\alpha}}\norm{\mathbf{\Sigma}(w)x}_{\left[\mathbf{P}(w)^{(2)}+\alpha\mathbf{\Sigma}(w)\right]^{-1}}\leq\frac{1}{\alpha}\norm{x}_{\mathbf{\Sigma}(w)}\,.

(B.6)

Next, since by Lemma 47 of [LS19], $\norm{\mathbf{\Sigma}(w)^{-1}\mathbf{P}(w)^{(2)}z}_{\infty}\leq\norm{z}_{\mathbf{\Sigma}(w)}$ for all $z\in\mathbb{R}^{m}$ , we see that $\left|[\mathbf{P}(w)^{(2)}y]_{i}\right|\leq\sigma_{i}(w)\norm{y}_{\mathbf{\Sigma}(w)}$ for all $i\in[m]$ , and since by definition of $y$ , we have $[(\mathbf{P}(w)^{(2)}+\alpha\mathbf{\Sigma}(w))y]_{i}=\sigma_{i}(w)x_{i}$ for all $i\in[m]$ , we have that

\norm{y}_{\infty}=\max_{i\in[m]}|y_{i}|=\max_{i\in[m]}\left|\frac{1}{\alpha}x_{i}+\frac{1}{\alpha\sigma_{i}(w)}\left[\mathbf{P}(w)^{(2)}y\right]_{i}\right|\leq\frac{1}{\alpha}\norm{x}_{\infty}+\frac{1}{\alpha}\norm{y}_{\mathbf{\Sigma}(w)}\,.

(B.7)

Combining (B.6) and (B.7) and using that $\sum_{i\in[m]}\sigma_{i}(w)\leq n$ yields the claim. ∎

Lemma 14.

Let $\widehat{w}\in\mathbb{R}_{>0}^{m}$ be a vector that satisfies approximate optimality of (1) in the following sense:

\sigma(\widehat{w})=\widehat{\mathbf{W}}^{1+\alpha}v,\text{ for }\exp(-\mu)\mathbf{1}\leq v\leq\exp(\mu)\mathbf{1}.

Then, $\widehat{w}$ is also coordinate-wise multiplicatively close to $\overline{w}$ , the true vector of Lewis weights, as formalized below.

\exp\left(-\frac{1}{\alpha}(1+\sqrt{n}/\alpha)\mu\right)\overline{w}\leq\widehat{w}\leq\exp\left(\frac{1}{\alpha}(1+\sqrt{n}/\alpha)\mu\right)\overline{w}\,.

Proof.

For all $t\in[0,1]$ , let $[v_{t}]_{i}=[v_{i}^{t}]$ so that $v_{1}=v$ and $v_{0}=\mathbf{1}$ . Further, for all $t\in[0,1]$ , let $w_{t}$ be the unique solution to

w_{t}=\operatorname*{\arg\!\min}_{w\in\mathbb{R}_{>0}^{m}}f_{t}(w)\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}-\log\det\left(\mathbf{A}^{\top}\mathbf{W}\mathbf{A}\right)+\frac{1}{1+\alpha}\sum_{i\in[m]}[v_{t}]_{i}w_{i}^{1+\alpha}.

Then we have the following gradients.

$\displaystyle\nabla_{w}f_{t}(w)$	$\displaystyle=-\mathbf{W}^{-1}\sigma(w)+\mathbf{W}^{\alpha}v_{t}\,,$
$\displaystyle\nabla_{w}(\frac{d}{dt}f_{t})(w)$	$\displaystyle=\mathbf{W}^{\alpha}\frac{d}{dt}v_{t}=\mathbf{W}^{\alpha}v_{t}\ln(v)\,$	(B.8)
$\displaystyle\nabla^{2}_{ww}f_{t}(w)$	$\displaystyle=\mathbf{W}^{-1}\left[\mathbf{P}(w)^{(2)}+\alpha\mathbf{W}^{1+\alpha}\mathbf{V}\right]\mathbf{W}^{-1}\,.$	(B.9)

Consequently, by optimality of $w_{t}$ as defined in (B.2), we have $\mathbf{0}=\nabla_{w}f_{t}(w_{t})=-\mathbf{W}_{t}^{-1}\sigma(w_{t})+\mathbf{W}_{t}^{\alpha}v_{t}.$ Rearranging the terms of this equation yields that

\sigma(w_{t})=\mathbf{W}_{t}^{1+\alpha}v_{t},

and therefore $w_{1}=\widehat{w}$ and $w_{0}=\overline{w}$ . To prove the lemma, it therefore suffices to bound

\ln(\widehat{w}/\overline{w})=\ln(w_{1}/w_{0})=\int_{t=0}^{1}\left[\frac{d}{dt}\ln(w_{t})\right]dt=\int_{t=0}^{1}\mathbf{W}_{t}^{-1}\left[\frac{d}{dt}w_{t}\right]dt\,.

(B.11)

To bound (B.11), it remains to compute $\frac{d}{dt}w_{t}$ and apply Claim 1. To do this, note that

\mathbf{0}=\frac{d}{dt}\gradient_{w}\left[f_{t}(w_{t})\right]=\nabla_{w}(\frac{d}{dt}f_{t})(w_{t})+\nabla^{2}_{ww}f_{t}(w_{t})\cdot\frac{d}{dt}w_{t}\,.

Using that $\mathbf{P}(w_{t})^{(2)}+\mathbf{W}_{t}^{1+\alpha}\mathbf{V}_{t}\succ\mathbf{0}$ , we have, by rearranging the above equation and applying (B.8) and (B.9) that

\displaystyle\frac{d}{dt}w_{t}

\displaystyle=-\left[\nabla^{2}_{ww}f_{t}(w_{t})\right]^{-1}\cdot\left[\nabla_{w}(\frac{d}{dt}f_{t})(w_{t})\right]=-\mathbf{W}_{t}\left[\mathbf{P}(w_{t})^{(2)}+\alpha\mathbf{W}_{t}^{1+\alpha}\mathbf{V}_{t}\right]^{-1}\mathbf{W}_{t}^{1+\alpha}v_{t}\ln(v)\,.

(B.12)

Applying (B.2) to (B.12), we have that

\displaystyle\mathbf{W}_{t}^{-1}\left[\frac{d}{dt}w_{t}\right]

\displaystyle=-\left[\mathbf{P}(w_{t})^{(2)}+\alpha\mathbf{\Sigma}(w_{t})\right]^{-1}\mathbf{\Sigma}(w_{t})\ln(v)\,.

Applying Claim 1 to the above equality, substituting in (B.11) and $\norm{\ln(v)}_{\infty}\leq\mu$ therefore yields

\displaystyle\norm{\ln(\widehat{w}/\overline{w})}_{\infty}=\norm{\ln(w_{1}/w_{0})}_{\infty}

\displaystyle\leq\int_{t=0}^{1}\norm{\mathbf{W}_{t}^{-1}\left[\frac{d}{dt}w_{t}\right]}_{\infty}dt\leq\int_{t=0}^{1}\left(\frac{1+\sqrt{n}/\alpha}{\alpha}\right)\mu dt\,.

∎

B.3 From Optimization Problem to Approximate Lewis Weights

See 1

Proof.

We are given a vector $w\in\mathbb{R}^{m}$ satisfying $\mathcal{F}(\overline{w})\leq\mathcal{F}(w)\leq\mathcal{F}(\overline{w})+\widetilde{\varepsilon}$ . Then by Lemma 5, we have that $\frac{(\sigma_{i}(w)-{w}_{i}^{1+\alpha})^{2}}{\sigma_{i}(w)+{w}_{i}^{1+\alpha}}\leq\widetilde{\varepsilon}$ for each $i\in[m]$ . This bound implies that ${w}_{i}\leq 3$ for all $i$ because, if not, then because of $\sigma_{i}(w)\in[0,1]$ and the decreasing nature of $(x-a)^{2}/(x+a)$ over $x\in[0,1]$ for a fixed $a\geq 3$ , we obtain $\frac{(\sigma_{i}(w)-{w}_{i}^{1+\alpha})^{2}}{\sigma_{i}(w)+{w}_{i}^{1+\alpha}}\geq\frac{(1-{w}_{i}^{1+\alpha})^{2}}{1+{w}_{i}^{1+\alpha}}\geq 1$ , a contradiction. Therefore $\|\sigma(w)-w^{1+\alpha}\|_{\infty}\leq 2\sqrt{\widetilde{\varepsilon}}$ . Coupled with the provided guarantee $\rho_{\max}(w)\leq 1+\alpha$ , we see that the requirements of Lemma 13 are met with $\overline{\varepsilon}=2\sqrt{\widetilde{\varepsilon}}$ , for $\widetilde{\varepsilon}\stackrel{{\scriptstyle\mathrm{{\scriptscriptstyle def}}}}{{=}}\frac{\widehat{\varepsilon}^{4}}{(25m(\alpha+\alpha^{-1}))^{4}}$ , and Algorithm 1 therefore guarantees a $\widehat{w}$ satisfying $\sigma(\widehat{w})\approx_{\widehat{\varepsilon}}\widehat{w}^{1+\alpha}$ . Therefore, we can now apply Lemma 14 with $\mu=\widehat{\varepsilon}$ , and choosing $\widehat{\varepsilon}=\tfrac{\alpha^{2}}{\alpha+\sqrt{n}}\varepsilon$ lets us conclude that $\widehat{w}_{i}\approx_{\varepsilon}\overline{w}_{i}$ , as claimed. ∎

Appendix C A Geometric View of Rounding

At the end of Algorithms 2 and 3, the iterate $w$ satisfies the condition $\rho_{\max}(w)\leq 1+\alpha$ . We now show the geometry implied by the preceding condition, thereby provide the reason behind the terminology “rounding.”

Lemma 15.

Given $w\in\mathbb{R}_{>0}^{m}$ such that $\rho_{\max}(w)\leq 1+\alpha$ . Define the ellipsoid $\mathcal{E}(w):=\{x:x^{\top}\mathbf{A}^{\top}\mathbf{W}\mathbf{A}x\leq 1\}$ . Then, we have that

\mathcal{E}(w)\subset\{x\in\mathbb{R}^{n}~{}|~{}\|\mathbf{W}^{-\alpha/2}\mathbf{A}x\|_{\infty}\leq\sqrt{1+\alpha}\}.

Proof.

Consider any point $x\in\mathcal{E}(w)$ . Then, by Cauchy-Schwarz inequality and $\rho_{\max}(w)\leq 1+\alpha$ ,

	$\displaystyle\\|\mathbf{W}^{-\alpha/2}\mathbf{A}x\\|_{\infty}$	$\displaystyle=\max_{i\in[m]}e_{i}^{\top}\mathbf{W}^{-\alpha/2}\mathbf{A}x=\max_{i\in[m]}e_{i}^{\top}\mathbf{W}^{-\alpha/2}\mathbf{A}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-\frac{1}{2}}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{\frac{1}{2}}x$
		$\displaystyle\leq\max_{i\in[m]}\sqrt{e_{i}^{\top}\mathbf{W}^{-\alpha/2}\mathbf{A}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}\mathbf{A}^{\top}\mathbf{W}^{-\alpha/2}e_{i}}\sqrt{x^{\top}\mathbf{A}^{\top}\mathbf{W}\mathbf{A}x}$
		$\displaystyle\leq\max_{i\in[m]}\sqrt{e_{i}^{\top}\mathbf{W}^{-\alpha/2}\mathbf{A}(\mathbf{A}^{\top}\mathbf{W}\mathbf{A})^{-1}\mathbf{A}^{\top}\mathbf{W}^{-\alpha/2}e_{i}}=\max_{i\in[m]}\sqrt{\frac{\sigma_{i}(w)}{w_{i}^{1+\alpha}}}\leq\sqrt{1+\alpha}.$

∎

Appendix D Explanations of Runtimes in Prior Work

The convex program (1) formulated by [CP15] has a variable size of $n^{2}$ . Therefore, by [LSW15], the number of iterations to solve it using the cutting plane method is $O(n^{2}\log(n\varepsilon^{-1})$ , each iteration computing $a_{i}^{\top}\mathbf{M}a_{i}$ for $i\in[m]$ . This can be computed by multiplying an $n\times n$ matrix with an $n\times m$ matrix, which costs between $O(mn)$ (at least the size of the larger input matrix) and $O(mn^{2})$ (each entry of the resulting $m\times n$ matrix obtained by an inner product of length $n$ vectors). Further, there is at least a total of $O(n^{6})$ additional work done by the cutting plane method. This gives us a cost of at least $n^{2}(mn+n^{4})$ . The runtime of [Lee16] follows from Theorem $5.3.4$ .

Computing Lewis Weights to High Precision

1 Introduction to Lewis Weights

Definition 1.

Our Goal.

Prior Results.

1.1 Our Contribution

Theorem 1 (Main Theorem (Parallel)).

Theorem 2 (Main Theorem (Sequential)).

Remark 1.1.

1.2 Overview of Approach

Our Approach.

1.3 Applications and Related Work

Pre-processing in optimization.

John ellipsoid and D-optimal design.

Self-concordance.

Theorem 3 (Applying Theorem 1 to [LS19, Section 5]).

1.4 Notation and Preliminaries

Fact 1.1 ([LS14]).

2 Our Algorithm

Lemma 1 (Lewis Weights from Optimization Solution).

2.1 Analysis of Descent​(⋅)\textbf{Descent}({}\cdot{})

Lemma 2 (Iteration Complexity of Descent​(⋅)\textbf{Descent}({}\cdot{})).

Lemma 3 (Gradient and Hessian).

Lemma 4 (Initial Sub-Optimality).

2.1.1 Minimum Progress and Maximum Sub-optimality in Descent​(⋅)\textbf{Descent}({}\cdot{})

Lemma 5 (Objective Sub-optimality).

Proof.

Lemma 6 (Function Decrease in Descent​(⋅)\textbf{Descent}({}\cdot{})).

2.1.2 Iteration Complexity of Descent​(⋅)\textbf{Descent}({}\cdot{})

Proof of Lemma 2.

3 Analysis of Round​(⋅)\textbf{Round}({}\cdot{}): The Parallel Algorithm

Lemma 7 (Outcome of RoundParallel​(⋅)\textbf{RoundParallel}({}\cdot{})).

Proof.

Corollary 1.

Proof.

Lemma 8.

Proof.

3.1 Proof of Main Theorem (Parallel)

Proof.

4 Analysis of Round​(⋅)\textbf{Round}({}\cdot{}): Sequential Algorithm

Lemma 9 (Coordinate Step Progress).

Proof.

Lemma 10 (Coordinate Step Outcome).

Proof.

Lemma 11 (Number of Coordinate Steps).

Proof.

4.1 Proof of Main Theorem (Sequential)

Proof.

5 A “One-Step” Parallel Algorithm

Theorem 4 (Main Theorem (One-Step Parallel Algorithm)).

Lemma 12 (Rounding Condition Invariance).

Proof.

Proof of Theorem 4.

6 Acknowledgements

References

Appendix A Technical Proofs: Gradient, Hessian, Initial Error, Minimum Progress

Proof.

Proof.

Proof.

Appendix B From Optimization Problem to Lewis Weights

B.1 From Approximate Closeness to Approximate Optimality

Lemma 13.

Proof.

B.2 From Approximate Optimality to Approximate Lewis Weights

Claim 1.

Proof.

Lemma 14.

Proof.

B.3 From Optimization Problem to Approximate Lewis Weights

Proof.

Appendix C A Geometric View of Rounding

Lemma 15.

Proof.

Appendix D Explanations of Runtimes in Prior Work

2.1 Analysis of $\textbf{Descent}({}\cdot{})$

Lemma 2 (Iteration Complexity of $\textbf{Descent}({}\cdot{})$ ).

2.1.1 Minimum Progress and Maximum Sub-optimality in $\textbf{Descent}({}\cdot{})$

Lemma 6 (Function Decrease in $\textbf{Descent}({}\cdot{})$ ).

2.1.2 Iteration Complexity of $\textbf{Descent}({}\cdot{})$

3 Analysis of $\textbf{Round}({}\cdot{})$ : The Parallel Algorithm

Lemma 7 (Outcome of $\textbf{RoundParallel}({}\cdot{})$ ).

4 Analysis of $\textbf{Round}({}\cdot{})$ : Sequential Algorithm