This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distributed Sketching for Randomized Optimization: Exact Characterization, Concentration and Lower Bounds

Burak Bartan and Mert Pilanci B. Bartan is with the Department of Electrical Engineering, Stanford University, CA, 94305 USA (e-mail: [email protected]).M. Pilanci is with the Department of Electrical Engineering, Stanford University, CA, 94305 USA (e-mail: [email protected]).
Abstract

We consider distributed optimization methods for problems where forming the Hessian is computationally challenging and communication is a significant bottleneck. We leverage randomized sketches for reducing the problem dimensions as well as preserving privacy and improving straggler resilience in asynchronous distributed systems. We derive novel approximation guarantees for classical sketching methods and establish tight concentration results that serve as both upper and lower bounds on the error. We then extend our analysis to the accuracy of parameter averaging for distributed sketches. Furthermore, we develop unbiased parameter averaging methods for randomized second order optimization for regularized problems that employ sketching of the Hessian. Existing works do not take the bias of the estimators into consideration, which limits their application to massively parallel computation. We provide closed-form formulas for regularization parameters and step sizes that provably minimize the bias for sketched Newton directions. Additionally, we demonstrate the implications of our theoretical findings via large scale experiments on a serverless cloud computing platform.

Index Terms:
second order optimization, sketching, distributed optimization, randomized algorithms, convex optimization, regularized least squares, large scale problems, differential privacy

I Introduction

We investigate distributed sketching methods for solving large scale regression problems. In particular, we study parameter averaging for variance reduction and establish theoretical results on approximation performance. We consider a distributed computing model with a single central node and multiple worker nodes that run in parallel. Employing parameter averaging in distributed computing systems enables asynchronous updates, since a running average of available parameters can approximate the result without requiring all worker nodes to finish their tasks. Moreover, sketching provably preserves the privacy of the data, making it an attractive choice for massively parallel cloud computing.

We consider distributed averaging in a variety of randomized second order methods including Newton Sketch, iterative Hessian sketch (IHS), and also in direct or non-iterative methods. We focus on the communication-efficient setting where we avoid the communication of approximate Hessian matrices of size d2d^{2} and communicate only the approximate solutions of size dd. Averaging sketched solutions was proposed in the literature in certain restrictive settings [1]. The presence of a regularization term requires additional caution, as naïve averaging may lead to biased estimators of the solution. In this work, bias is always with respect to the randomness of the sketching matrices; the input data are not assumed to be sampled from a probability distribution. Although bias is often overlooked in the literature, we show that one can re-calibrate the regularization coefficient of the sketched problems to obtain unbiased estimators. We show that having unbiased estimators leads to better performance without imposing any additional computational cost.

We consider both underdetermined and overdetermined linear regression problems in the regime where the data does not fit in main memory. Such linear regression problems and linear systems are commonly encountered in a multitude of problems ranging from statistics and machine learning to optimization. Being able to solve large scale linear regression problems efficiently is crucial for many applications. In this paper, the setting that we consider consists of a distributed system with qq workers that run in parallel and a single central node that collects and processes the outputs of the worker nodes. Applications of randomized sketches and dimension reduction to linear regression and other optimization problems have been extensively studied in the recent literature by works including [2, 3, 4, 5, 6, 7]. In this work, we investigate averaging the solutions of sketched sub-problems. This setting for overdetermined problems was also studied in [3]. In addition, we also consider regression problems where the number of data samples is less than the dimensionality and investigate the properties of averaging for such problems.

An important advantage of randomized sketching in distributed computing is the independent and identically distributed nature of all of the computational tasks. Therefore, sketching offers a resilient computing model where node failures, stragglers as well as additions of new nodes can be easily handled, e.g., via generating more data sketches. An alternative to averaging that would offer similar benefits is the asynchronous stochastic gradient descent (SGD) algorithm [8, 9]. However, the convergence rates of asynchronous SGD methods necessarily depend on the properties of the input data such as its condition number [8, 9]. In contrast, as we show in this work, distributed sketching has stronger convergence guarantees that do not depend on the condition number of the data matrix.

A fundamental emphasis of the paper is on distributed computing and parameter averaging. However, novel theoretical results including exact characterizations of expected error and exponential concentration are derived for the single sketch estimator as well.

Table I summarizes the types of problems that we study in this work. The last column of the table contains references to the theorems and lemmas for each result.

TABLE I: Summary of theoretical results. The data matrix is An×dA\in\mathbb{R}^{n\times d} and the output vector is bnb\in\mathbb{R}^{n}.
Problem Sketch Type Method and Theorem
minxAxb22\min_{x}\|Ax-b\|_{2}^{2} Gaussian Distributed randomized regression: Theorem II.2, II.8, II.11
Gaussian Distributed Iterative Hessian sketch: Theorem II.13
Other Distributed randomized regression: Lemma IV.3, IV.4, IV.5, Theorem IV.6
minxx22 s.t. Ax=b\min_{x}\|x\|_{2}^{2}\mbox{ s.t. }Ax=b Gaussian Right sketch: Theorem II.17
minxAxb22+λ1x22\min_{x}\|Ax-b\|_{2}^{2}+\lambda_{1}\|x\|_{2}^{2} Gaussian Distributed randomized ridge regression: Theorem II.20
Convex problems with Gaussian Distributed Newton sketch: Theorem V.1
Hessian (Ht1/2)THt1/2(H_{t}^{1/2})^{T}H_{t}^{1/2} (step size for unbiasedness)
Convex problems with Gaussian Distributed Newton sketch: Theorem V.2
Hessian (Ht1/2)THt1/2+λ1Id(H_{t}^{1/2})^{T}H_{t}^{1/2}+\lambda_{1}I_{d} (regularization coefficient for unbiasedness)

I-A Cloud Computing

Serverless computing is a relatively new technology that offers computing on the cloud without requiring any server management from end users. Functions in serverless computing can be thought of worker nodes that have very limited resources and a lifetime, but also are very scalable. The algorithms that we study in this paper are particularly suitable for serverless computing platforms, since the algorithms do not require peer-to-peer communication among different worker nodes, and have low memory and compute requirements per node. In numerical simulations, we have evaluated our methods on the serverless computing platform AWS Lambda.

Data privacy is an increasingly important issue in cloud computing, one that has been studied in many recent works including [10, 11, 12, 13]. Distributed sketching comes with the benefit of privacy preservation. To clarify, let us consider a setting where the master node computes sketched data SkAS_{k}A, SkbS_{k}b, k=1,,qk=1,\dots,q locally where Skm×nS_{k}\in\mathbb{R}^{m\times n} are the sketching matrices and An×dA\in\mathbb{R}^{n\times d}, bnb\in\mathbb{R}^{n} are the data matrix and the output vector, respectively, for the regression problem minxAxb22\min_{x}\|Ax-b\|_{2}^{2}. In the distributed sketching setting, the master node sends only the sketched data to worker nodes for computational efficiency, as well as data privacy preservation. In particular, the mutual information and differential privacy can be controlled when we reveal SkAS_{k}A and keep AA hidden. Furthermore, one can trade privacy for accuracy by choosing a suitable sketch dimension mm.

I-B Notation

We use hats x^\hat{x} to denote the estimator for a single sketch and bars x¯\bar{x} to denote the averaged estimator x¯:=1qk=1qx^k\bar{x}:=\frac{1}{q}\sum_{k=1}^{q}\hat{x}_{k}, where x^k\hat{x}_{k} is the estimator for the kk’th worker node. Stars are used to denote the optimal solution xx^{*}. We use f()f(\cdot) to denote the objective of the optimization problem being considered at that point in the text. The letter ϵ\epsilon is used for error while ε\varepsilon is used as the differential privacy parameter. The notation σmin()\sigma_{\min}(\cdot) denotes the smallest nonzero singular value of its argument. O()O(\cdot) is used for big O notation.

All the expectations in the paper are with respect to the randomness over the sketching matrices, and no randomness assumptions are made for the data. We use Sm×nS\in\mathbb{R}^{m\times n} to denote random sketching matrices and we often refer to mm as the sketch size. For non-iterative distributed algorithms, we use Skm×nS_{k}\in\mathbb{R}^{m\times n} for the sketching matrix used by worker node kk. For iterative algorithms, St,km×nS_{t,k}\in\mathbb{R}^{m\times n} is used to denote the sketching matrix used by worker kk in iteration tt. We assume the sketching matrices are scaled such that 𝔼[STS]=In\operatorname{\mathbb{E}}[S^{T}S]=I_{n}. We omit the subscripts in SkS_{k} and St,kS_{t,k} for simplicity whenever it does not cause confusion. The notation mdm\gtrsim d is used to denote that there exists a positive finite constant cc such that mcdm\geq cd. We denote the thin SVD decomposition of the data matrix as A=UΣVTA=U\Sigma V^{T}.

For regularized problems, we use λ1\lambda_{1} for the regularization coefficient of the original problem, and λ2\lambda_{2} for the regularization coefficient of the sketched problem. For instance, in the case of the regularized least squares problem, we have

x\displaystyle x^{*} =argminxAxb22+λ1x22,\displaystyle=\arg\min_{x}\|Ax-b\|_{2}^{2}+\lambda_{1}\|x\|_{2}^{2}\,,
x^\displaystyle\hat{x} =argminxSAxSb22+λ2x22.\displaystyle=\arg\min_{x}\|SAx-Sb\|_{2}^{2}+\lambda_{2}\|x\|_{2}^{2}\,. (1)

We will derive expressions in the sequel for the optimal selection of the coefficient λ2\lambda_{2} which we denote as λ2\lambda_{2}^{*}.

I-C Related Work

Random projections are a popular way of performing randomized dimensionality reduction, which are widely used in many computational and learning problems [14, 15, 16, 17]. Many works have studied randomized sketching methods for least squares and other optimization problems [18, 19, 2, 20, 21, 3, 22].

Sarlós [23] showed that the relative error of the single Gaussian sketch estimator in (I-B) in the unregularized case λ1=λ2=0\lambda_{1}=\lambda_{2}=0 is bounded as f(x^)/f(x)(1+Cdlog(d)/m)2f(\hat{x})/f(x^{*})\leq(1+Cd\log(d)/m)^{2} with probability at least 1/31/3, where CC is a constant. The relative error was improved to f(x^)/f(x)(1+Cd/m)f(\hat{x})/f(x^{*})\leq(1+Cd/m) with exponentially small failure probability in subsequent work [7]. In contrast, we derive the expectation of the error exactly and show exponential concentration around this expected value in this paper. As a result, we obtain a tight upper and lower bound for the relative error that holds with exponentially high probability (see Theorem II.8).

The work [3] investigates distributed sketching and averaging for regression from optimization and statistical perspectives. The most relevant result in [3], using our notation, can be stated as follows. Setting the sketch dimension m=O(μd(logd)/ϵ)m=O(\mu d(\log d)/\epsilon) for uniform sampling, where μ\mu is row coherence μ:=n/rank(A)maxii\mu:=n/\operatorname{rank}(A)\max_{i}\ell_{i} and i\ell_{i} is the ii’th row leverage score, and m=O~(d/ϵ)m=\tilde{O}(d/\epsilon) (with O~\tilde{O} hiding logarithmic factors) for other sketches, the inequality f(x¯)f(x)(ϵ/q+ϵ2)f(x)f(\bar{x})-f(x^{*})\leq(\epsilon/q+\epsilon^{2})f(x^{*}) holds with high probability. According to this result, for large qq, the cost at the averaged solution f(x¯)f(\bar{x}) will be upper bounded by ϵ2f(x)\epsilon^{2}f(x^{*}). In this work we prove that for Gaussian sketch, f(x¯)f(\bar{x}) will converge to the optimal cost f(x)f(x^{*}) as qq increases. We also identify the exact expected error for a given number of workers qq. In addition, our results on regularized least squares regression improve on the results in [3] for averaging multiple sketched solutions. In particular, in [3], the sketched sub-problems use the same regularization parameter as the original problem, which leads to biased solutions. We analyze the bias of the averaged solution, and provide explicit formulas for selecting the regularization parameter of the sketched sub-problems to achieve unbiasedness.

We show that the expected difference between the costs of the averaged solution and the optimal solution has two components, namely variance and squared bias (see Lemma IV.1). This result implies that for the Gaussian sketch, which we prove to be unbiased (see Lemma II.1), the number of workers required for a target error ϵ\epsilon scales as 1/ϵ1/\epsilon. Remarkably, this result does not involve condition numbers. In contrast, for the asynchronous Hogwild algorithm [9], the number of iterations required for error ϵ\epsilon scales with log(1/ϵ)/ϵ\log(1/\epsilon)/\epsilon and also depends on the condition number of the input data, although it addresses a more general class of problems.

One of the crucial results of our work is developing bias correction for averaging sketched solutions. Namely, the setting considered in [22] is based on a distributed second order optimization method that involves averaging approximate update directions. However, in that work, the bias was not taken into account which degrades accuracy and limits applicability to massively parallel computation. We additionally provide results for the unregularized case, which corresponds to the distributed version of the Newton sketch algorithm introduced in [21].

I-D Overview of Our Contributions

  • We characterize the exact expected error of the averaged estimator for Gaussian sketch in closed form as

    𝔼[f(x¯)]f(x)f(x)=1qdmd1\displaystyle\frac{\operatorname{\mathbb{E}}[f(\bar{x})]-f(x^{*})}{f(x^{*})}=\frac{1}{q}\frac{d}{m-d-1} (2)

    where x¯=1qk=1qx^k\bar{x}=\frac{1}{q}\sum_{k=1}^{q}\hat{x}_{k} is the averaged solution and x^k=argminxSkAxSkb22\hat{x}_{k}=\arg\min_{x}\|S_{k}Ax-S_{k}b\|^{2}_{2} is the output of the kk’th worker node. In addition, we obtain a similar result for the error of the averaged estimator for the least-norm solution in the underdetermined case n<dn<d.

  • For both the single sketch and averaged estimator, we show that the relative error f(x¯)f(x)f(x)\frac{f(\bar{x})-f(x^{*})}{f(x^{*})} is concentrated around its expectation given in (2) with exponentially high probability. This provides both an upper and lower bound on the performance of sketching, unlike previous results in the literature. Moreover, it offers a guaranteed recipe to set the sketch size mm and number of workers qq.

  • We show that for Gaussian distributed sketch, the expected error of the averaged estimator 𝔼[f(x¯)]f(x)\operatorname{\mathbb{E}}[f(\bar{x})]-f(x^{*}) matches the error lower bound for any unbiased estimator obtained via Fisher information. In addition, we provide a lower bound for general, possibly biased sketching based estimators.

  • We consider the privacy preserving properties of distributed sketching methods, in which only random projections of the data matrix AA need to be shared with worker notes. We derive conditions on the (ε,δ)(\varepsilon,\delta)-differential privacy when SS is the i.i.d. Gaussian sketch matrix. Combined with our results, we show that the approximation error of this distributed privacy preserving regression algorithm scales as O(1/ε2)O(1/\varepsilon^{2}).

  • We analyze the convergence rate of a distributed version of the iterative Hessian sketch algorithm which was introduced in [4] and show that the number of iterations required to reach error ϵ\epsilon with qq workers scales with log(1/ϵ)/log(q)\log(1/\epsilon)/\log(q).

  • We show that x^=argminxSAxSb22+λ2x22\hat{x}=\arg\min_{x}\|SAx-Sb\|_{2}^{2}+\lambda_{2}\|x\|_{2}^{2} is not an unbiased estimator of the optimal solution, i.e. 𝔼[A(x^x)]0\operatorname{\mathbb{E}}[A(\hat{x}-x^{*})]\neq 0 when λ2=λ1\lambda_{2}=\lambda_{1} and provide a closed form expression for the regularization coefficient λ2\lambda_{2}^{*} to make the estimator unbiased under certain assumptions.

  • In addition to Gaussian sketch, we derive bias bounds for uniform sampling, randomized Hadamard sketch and leverage score sampling. Analysis of the bias term is critical in understanding how close to the optimal solution we can hope to get and establishing the dependence on the sketch dimension mm. Moreover, we utilize the derived bias bounds and find an upper bound on the error of the averaged estimator for these other sketching methods.

  • We discuss averaging for the distributed version of Newton Sketch [21] and show that using the same regularization coefficient as the original problem, i.e., λ2=λ1\lambda_{2}=\lambda_{1}, as most works in the literature consider, is not optimal. Furthermore, we derive an expression for choosing the regularization coefficient λ2\lambda_{2}^{*} for unbiasedness.

  • We provide numerical simulations that illustrate the practicality and scalability of distributed sketching methods on the serverless cloud computing platform AWS Lambda.

I-E Preliminaries on Sketching Matrices

We consider various sketching matrices in this work including Gaussian sketch, uniform sampling, randomized Hadamard sketch, Sparse Johnson-Lindenstrauss Transform (SJLT), and hybrid sketch. We now briefly describe each of these sketching methods:

  1. 1.

    Gaussian sketch: Entries of Sm×nS\in\mathbb{R}^{m\times n} are i.i.d. and sampled from the Gaussian distribution. Sketching a matrix An×dA\in\mathbb{R}^{n\times d} using Gaussian sketch requires computing the matrix product SASA which has computational complexity equal to O(mnd)O(mnd).

  2. 2.

    Randomized Hadamard sketch: The sketch matrix in this case can be represented as S=PHDS=PHD where Pm×nP\in\mathbb{R}^{m\times n} is for uniform sampling of mm rows out of nn rows, Hn×nH\in\mathbb{R}^{n\times n} is the Hadamard matrix, and Dn×nD\in\mathbb{R}^{n\times n} is a diagonal matrix with diagonal entries sampled randomly from the Rademacher distribution. Multiplication by DD to obtain DADA requires O(nd)O(nd) scalar multiplications. Hadamard transform can be implemented as a fast transform with complexity O(nlog(n))O(n\log(n)) per column, and a total complexity of O(ndlog(n))O(nd\log(n)) to sketch all dd columns of DADA.

  3. 3.

    Uniform sampling: Uniform sampling randomly selects mm rows out of the nn rows of AA where the probability of any row being selected is the same.

  4. 4.

    Leverage score sampling: Row leverage scores of a matrix AA are given by i=u~i22\ell_{i}=\|\tilde{u}_{i}\|_{2}^{2} for i=1,,ni=1,\dots,n where u~id\tilde{u}_{i}\in\mathbb{R}^{d} denotes the ii’th row of UU. The matrix UU is the matrix whose columns are the left singular vectors of AA, i.e., A=UΣVTA=U\Sigma V^{T}. There is only one nonzero element in every row sins_{i}\in\mathbb{R}^{n} of the sketching matrix SS and the probability that the jj’th entry of sis_{i} is nonzero is proportional to the leverage score j\ell_{j}. More precisely, the rows s1,,sms_{1},\dots,s_{m} are sampled i.i.d. such that [si=ek/mpk]=pk,i,k\mathbb{P}[s_{i}=e_{k}/\sqrt{mp_{k}}]=p_{k},\,\forall i,\,\forall k where pk=kj=1njp_{k}=\frac{\ell_{k}}{\sum_{j=1}^{n}\ell_{j}}. Note that we have 𝔼[STS]=𝔼[i=1msisiT]=mk=1npkekekT/(mpk)=In\operatorname{\mathbb{E}}[S^{T}S]=\operatorname{\mathbb{E}}[\sum_{i=1}^{m}s_{i}s_{i}^{T}]=m\sum_{k=1}^{n}p_{k}e_{k}e_{k}^{T}/(mp_{k})=I_{n}.

  5. 5.

    Sparse Johnson-Lindenstrauss Transform (SJLT) [24]: The sketching matrix for SJLT is a sparse matrix where each column has exactly ss nonzero entries and the columns are independently distributed. The nonzero entries are sampled from the Rademacher distribution. It takes O(snd/m)O(snd/m) addition operations to sketch a data matrix using SJLT.

  6. 6.

    Hybrid sketch: The method that we refer to as hybrid sketch is a sequential application of two different sketching methods. In particular, it might be computationally feasible for worker nodes to sample as much data as possible, say mm^{\prime} rows, and then reduce the dimension of the available data to the final sketch dimension mm using another sketch with better error properties than uniform sampling such as Gaussian sketch or SJLT. For instance, a hybrid sketch of uniform sampling followed by Gaussian sketch would have computational complexity O(mmd)O(mm^{\prime}d).

I-F Paper Organization

Section II deals with the application of distributed sketching to quadratic problems for Gaussian sketch. Section III deals with the privacy preserving property of distributed sketching. Section IV gives theoretical results for randomized Hadamard sketch, uniform sampling, and leverage score sampling. The tools developed in Section II are applied to second order optimization for unconstrained convex problems in Section V. Section VI provides an overview of applications and example problems where distributed sketching methods could be applied. Section VII presents numerical results and Section VIII concludes the main part of the paper.

We give the proofs for the majority of the lemmas and theorems in the Appendix along with additional numerical results.

II Distributed Sketching for Quadratic Optimization problems

In this section, we focus on quadratic optimization problems of the form

x=argminxAxb22+λ1x22.\displaystyle x^{*}=\arg\min_{x}\|Ax-b\|_{2}^{2}+\lambda_{1}\|x\|_{2}^{2}\,. (3)

We study various distributed algorithms for solving problems of this form based on model averaging. Some of these algorithms are tailored for the unregularized case λ1=0\lambda_{1}=0.

II-A Closed Form Expressions for Gaussian Sketch

We begin with a non-iterative model averaging based algorithm, which we refer to as distributed randomized regression. This algorithm is for the unregularized case λ1=0\lambda_{1}=0. Each of the worker nodes computes an approximate solution x^k\hat{x}_{k} and these are averaged at the master node to compute the final solution x¯\bar{x}. This method is outlined in Algorithm 1. The theoretical analysis in this section assumes that the sketch matrix is Gaussian, which is generalized to other sketching matrices in Section IV. Although computing the Gaussian sketch is not as efficient as other fast sketches such as the randomized Hadamard sketch, it has several significant advantages over other sketches: (1) exact relative error expressions can be derived as we show in this section, (2) the solution is unbiased, (3) a differential privacy bound can be provided as we show in Section III, and (4) computation of the sketch can be trivially parallelized.

  Input: Data matrix An×dA\in\mathbb{R}^{n\times d}, target vector bnb\in\mathbb{R}^{n}.
  for worker k=1,,qk=1,\dots,q in parallel do
     Obtain the sketched data and sketched output: SkAS_{k}A and SkbS_{k}b.
     Solve x^k=argminxSkAxSkb22\hat{x}_{k}=\arg\min_{x}\|S_{k}Ax-S_{k}b\|^{2}_{2} and send x^k\hat{x}_{k} to the master node.
  end for
  Master node: return x¯=1qk=1qx^k\bar{x}=\frac{1}{q}\sum_{k=1}^{q}\hat{x}_{k}.
Algorithm 1 Distributed Randomized Regression

We first obtain a characterization of the expected error for the single sketch estimator in Lemma II.1.

Lemma II.1

For the Gaussian sketch with m>d+1m>d+1, the estimator x^\hat{x} satisfies

𝔼[A(x^x)22]=𝔼[f(x^)]f(x)=f(x)dmd1,\displaystyle\operatorname{\mathbb{E}}[\|A(\hat{x}-x^{*})\|_{2}^{2}]=\operatorname{\mathbb{E}}[f(\hat{x})]-f(x^{*})=f(x^{*})\frac{d}{m-d-1}\,, (4)

where f(x):=Axb22f(x):=\|Ax-b\|_{2}^{2} and x=argminxf(x)x^{*}=\arg\min_{x}f(x).

Proof:

Suppose that the matrix AA is full column rank. Then, for mdm\geq d, the matrix ATSTSAA^{T}S^{T}SA follows a Wishart distribution, and is invertible with probability one. Conditioned on the invertibility of ATSTSAA^{T}S^{T}SA, we have

x^\displaystyle\hat{x} =(ATSTSA)1ATSTSb\displaystyle=(A^{T}S^{T}SA)^{-1}A^{T}S^{T}Sb
=(ATSTSA)1ATSTS(Ax+b)\displaystyle=(A^{T}S^{T}SA)^{-1}A^{T}S^{T}S(Ax^{*}+b^{\perp})
=x+(ATSTSA)1ATSTSb,\displaystyle=x^{*}+(A^{T}S^{T}SA)^{-1}A^{T}S^{T}Sb^{\perp}\,,

where we have defined b:=bAxb^{\perp}:=b-Ax^{*} . Note that SASA and SbSb^{\perp} are independent since they are Gaussian and uncorrelated as a result of the normal equations ATb=0A^{T}b^{\perp}=0. Conditioned on the realization of the matrix SASA and the event ATSTSA0A^{T}S^{T}SA\succ 0, a simple covariance calculation shows that

x^𝒩(x,1mf(x)(ATSTSA)1).\displaystyle\hat{x}\sim\mathcal{N}\big{(}x^{*},\frac{1}{m}f(x^{*})(A^{T}S^{T}SA)^{-1}\big{)}\,. (5)

Multiplying with the data matrix AA on the left yields the distribution of the prediction error, conditioned on SASA, as

A(x^x)𝒩(0,1mf(x)A(ATSTSA)1AT).\displaystyle A(\hat{x}-x^{*})\sim\mathcal{N}\big{(}0,\frac{1}{m}f(x^{*})A(A^{T}S^{T}SA)^{-1}A^{T}\big{)}\,. (6)

Then we can compute the conditional expectation of the squared norm of the error

𝔼[A(x^x)22|SA]=f(x)m𝔼[tr(A(ATSTSA)1AT)].\displaystyle\operatorname{\mathbb{E}}[\|A(\hat{x}-x^{*})\|_{2}^{2}~{}\,\big{|}SA]=\frac{f(x^{*})}{m}\operatorname{\mathbb{E}}[\operatorname{tr}(A(A^{T}S^{T}SA)^{-1}A^{T})]\,.

Next we recall that the expected inverse of the Wishart matrix ATSTSAA^{T}S^{T}SA satisfies (see, e.g., [25])

𝔼[(ATSTSA)1]=(ATA)1mmd1.\displaystyle\operatorname{\mathbb{E}}[(A^{T}S^{T}SA)^{-1}]=(A^{T}A)^{-1}\frac{m}{m-d-1}\,.

Plugging in the previous result and using the tower property of expectations and then noting that tr(A(ATA)1AT)=d\operatorname{tr}(A(A^{T}A)^{-1}A^{T})=d give us the claimed result. ∎

To the best of our knowledge, this result of exact error characterization is novel in the theory of sketching. Similar tools to those that we used in this proof were used in [26], however, the exact context in which they are used and the end result are different from this work. Furthermore, existing results (see e.g. [6, 3, 16]) characterize a high probability upper bound on the error, whereas the above is a sharp and exact formula for the expected squared norm of the error. Theorem II.2 builds on Lemma II.1 to characterize the expected error for the averaged solution x¯\bar{x}.

Theorem II.2 (Expected error of the averaging estimator)

Let SkS_{k}, k=1,,qk=1,\dots,q be Gaussian sketching matrices, then Algorithm 1 runs in time O(md2)O(md^{2}), and the error of the averaged solution x¯\bar{x} satisfies

𝔼[f(x¯)]f(x)f(x)=1qdmd1.\displaystyle\frac{\operatorname{\mathbb{E}}[f(\bar{x})]-f(x^{*})}{f(x^{*})}=\frac{1}{q}\frac{d}{m-d-1}\,. (7)

Consequently, Markov’s inequality implies that f(x¯)f(x)f(x)ϵ\frac{f(\bar{x})-f(x^{*})}{f(x^{*})}\leq\epsilon holds with probability at least (11qϵdmd1)\left(1-\frac{1}{q\epsilon}\frac{d}{m-d-1}\right) for any ϵ>0\epsilon>0.

Theorem II.2 illustrates that the expected error scales as 1/q1/q, and converges to zero as q0q\rightarrow 0 as long as md+2m\geq d+2. This is due to the unbiasedness of the Gaussian sketch. Other sketching methods such as uniform sampling or randomized Hadamard sketch do not have this property, as we further investigate in Section IV.

Remark II.3

In Algorithm 1, the worker nodes are tasked with obtaining the sketched data SkAS_{k}A and the sketched output SkbS_{k}b. We identify two options for this step:

  • Option 1: The master node computes the sketched data SkA,SkbS_{k}A,S_{k}b, k=1,,qk=1,\dots,q and transmits to worker nodes. This option preserves data privacy, which we discuss in Section III.

  • Option 2: The worker nodes have access to the data A,bA,b and compute the sketched data SkAS_{k}A and SkbS_{k}b. This option does not preserve privacy, however, parallelizes the computation of the sketch via the workers.

II-B Exponential Concentration of the Gaussian Sketch Estimator

In this subsection, we show high probability concentration bounds for the error of the Gaussian sketch. Importantly, our result provides both an upper and lower bounds on the relative error with exponentially small failure probability. We begin with the single sketch estimator x^\hat{x} and extend the results to the averaged estimator x¯=1qk=1qx^k\bar{x}=\frac{1}{q}\sum_{k=1}^{q}\hat{x}_{k}.

The next theorem states the main concentration result for the single sketch estimator. Both upper and lower bounds are given for the ratio of the cost of the sketched solution to the optimal solution. The full proof is given in the Appendix.

Theorem II.4 (Concentration bound for the single sketch estimator x^\hat{x})

Suppose that the sketch size is such that mdm\gtrsim d. Then, the optimality ratio of x^\hat{x} with respect to the optimal solution xx^{*} is concentrated around its mean as

P\displaystyle P (|f(x^)f(x)1dmd1|<ϵ)1C1eC2ϵ4m\displaystyle\bigg{(}\Big{|}\frac{f(\hat{x})}{f(x^{*})}-1-\frac{d}{m-d-1}\Big{|}<\epsilon\bigg{)}\geq 1-C_{1}e^{-C_{2}\epsilon^{4}m} (8)

where C1,C2C_{1},C_{2} are positive constants.

Proof sketch: Our proof technique involves the concentration of Gaussian quadratic forms for SbSb conditioned on SASA. We leverage that the error A(x^x)22\|A(\hat{x}-x^{*})\|_{2}^{2} is a Gaussian quadratic when conditioned on SASA as shown in (6). In particular, we use the concentration result given in Lemma II.5 for quadratic form of Gaussian random variables.

Lemma II.5 (Concentration of Gaussian quadratic forms, [27])

Let the entries of zmz\in\mathbb{R}^{m} be distributed as i.i.d. 𝒩(0,1)\mathcal{N}(0,1). For any Gm×mG\in\mathbb{R}^{m\times m} and ϵ>0\epsilon>0,

P(zTGz𝔼[zTGz]>2GFϵ+2G2ϵ)eϵ.\displaystyle P\big{(}z^{T}Gz-\operatorname{\mathbb{E}}[z^{T}Gz]>2\|G\|_{F}\sqrt{\epsilon}+2\|G\|_{2}\epsilon\big{)}\leq e^{-\epsilon}\,. (9)

Then we note that after applying Lemma II.5, the conditional expectation of the error is given by 𝔼[A(x^x)22|SA]=f(x)mtr((UTSTSU)1)\operatorname{\mathbb{E}}[\|A(\hat{x}-x^{*})\|_{2}^{2}\,|SA]=\frac{f(x^{*})}{m}\operatorname{tr}((U^{T}S^{T}SU)^{-1}). Next, we focus on the trace term tr((UTSTSU)1)\operatorname{tr}((U^{T}S^{T}SU)^{-1}) and relate it to the Stieltjes transform.

Definition II.6 (Stieltjes Transform)

We define the Stieltjes Transform for a random rectangular matrix Sm×dS\in\mathbb{R}^{m\times d} such that dmd\leq m as

mS(z):\displaystyle m_{S}(z): =1dtr(STSzI)1\displaystyle=\frac{1}{d}\operatorname{tr}(S^{T}S-zI)^{-1} (10)
=1di=1d1λi(STS)z.\displaystyle=\frac{1}{d}\sum_{i=1}^{d}\frac{1}{\lambda_{i}(S^{T}S)-z}\,. (11)

Here λi(STS)\lambda_{i}(S^{T}S) denotes the ii’th eigenvalue of the symmetric matrix STSS^{T}S.

We derive a high probability bound for the trace term around its expectation by leveraging the concentration of the empirical Stieltjes transform to its expectation, which is given in Lemma II.7 below.

Lemma II.7

The trace tr((UTSTSU)1)\operatorname{tr}((U^{T}S^{T}SU)^{-1}) is concentrated around its mean with high probability as follows

P(|tr((UTSTSU)1)𝔼[tr((UTSTSU)1)]|ϵ)14eϵ4(1d/mδ)8210md2emδ2/2,\displaystyle P\Big{(}|\operatorname{tr}((U^{T}S^{T}SU)^{-1})-\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]|\leq\epsilon\Big{)}\geq 1-4e^{-\frac{\epsilon^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{10}md^{2}}}-e^{-m\delta^{2}/2}\,, (12)

for any ϵ,δ>0\epsilon,\delta>0.

The proof of Lemma II.7 is provided in the Appendix. Combining the concentration of the Stieltjes transform with the concentration of Gaussian quadratic forms completes the proof of Theorem II.4.

Next, we show that the relative error of the averaged estimator is also concentrated with exponentially small failure probability.

Theorem II.8 (Concentration bound for the averaged estimator x¯\bar{x})

Let the sketch size satisfy mdm\gtrsim d. Then, the ratio of the cost for the averaged estimator x¯=1qk=1qx^k\bar{x}=\frac{1}{q}\sum_{k=1}^{q}\hat{x}_{k} to the optimal cost is concentrated around its mean as follows

P(|f(x¯)f(x)11qdmd1|<ϵ)1qC1eC2(qϵ)4m\displaystyle P\bigg{(}\Big{|}\frac{f(\bar{x})}{f(x^{*})}-1-\frac{1}{q}\frac{d}{m-d-1}\Big{|}<\epsilon\bigg{)}\geq 1-qC_{1}e^{-C_{2}(q\epsilon)^{4}m} (13)

where C1,C2C_{1},C_{2} are positive constants.

The above result shows that the relative error of the averaged estimator using distributed sketching concentrates considerably faster compared to a single sketch. Moreover, the above bound offers a method to choose the values of the sketch size mm and number of workers qq to achieve a desired relative error with exponentially small error probability.

Refer to caption

Figure 1: Histograms of the optimality ratio f(x¯)/f(x)f(\bar{x})/f(x^{*}) for both Gaussian sketch and randomized Hadamard sketch. The vertical lines indicate the empirical mean of the error. The dataset is synthetically generated with n=512,d=20n=512,d=20 and the sketch size is m=100m=100. The histogram is calculated over 25002500 independent trials. The plot at the top shows the performance of the single sketch estimator, and the middle and bottom plots are for the averaged estimator with q=4q=4 and q=10q=10 workers, respectively.

Figure 1 shows the histograms for the single sketch (q=1q=1) and averaged estimator (q=4q=4 and q=10)q=10) for Gaussian and randomized Hadamard sketches. This experiment demonstrates that the optimality ratio f(x¯)/f(x)f(\bar{x})/f(x^{*}) is concentrated around its mean and its mean and concentration improve as we increase the number of workers qq.

II-C Error Lower Bounds via Fisher Information

We now present an error lower bound result for the Gaussian sketch. We first consider single sketch estimators and then discuss the distributed sketching case. Two different lower bounds can be obtained depending on whether the estimator is restricted to be unbiased or not. Lemma II.9 and II.10 provide error lower bounds for all unbiased and general (i.e., possibly biased) estimators, respectively. The proof of Lemma II.9 is based on Fisher information and Cramér-Rao lower bound while Lemma II.10 additionally employs the van Trees inequality. We refer the reader to [28] for details on these single sketch lower bounds.

Lemma II.9 (Unbiased estimators, [28])

For any single sketch unbiased estimator x^\hat{x} obtained from the Gaussian sketched data SASA and SbSb, the expected error is lower bounded as follows

𝔼[f(x^)]f(x)f(x)dmd1.\displaystyle\operatorname{\mathbb{E}}[f(\hat{x})]-f(x^{*})\geq f(x^{*})\frac{d}{m-d-1}. (14)

We note that the expected error of the single sketch estimator that we provide in Lemma II.1 exactly matches the error lower bound in Lemma II.9. Therefore, we conclude that no other unbiased estimator based on a single sketch SASA, SbSb can achieve a better expected relative error.

Lemma II.10 (General estimators, [28])

For any single sketch estimator x^\hat{x}, which is possibly biased, obtained from the Gaussian sketched data SASA and SbSb, the expected error is lower bounded as follows

𝔼[f(x^)]f(x)f(x)dm.\displaystyle\operatorname{\mathbb{E}}[f(\hat{x})]-f(x^{*})\geq f(x^{*})\frac{d}{m}. (15)

We now provide a novel generalized lower bound that applies to averaged estimators x¯\bar{x} obtained via distributed sketching. Theorem II.11 states the main lower bound result for the averaged estimator.

Theorem II.11

For any averaged estimator x¯=1qk=1qx^k\bar{x}=\frac{1}{q}\sum_{k=1}^{q}\hat{x}_{k}, where each x^k\hat{x}_{k} is based on Gaussian sketched data SkA,SkbS_{k}A,S_{k}b, k=1,,qk=1,\dots,q, the expected error is lower bounded as follows

  1. (i)

    for unbiased estimators satisfying 𝔼[x^k]=x\operatorname{\mathbb{E}}[\hat{x}_{k}]=x^{*}

    𝔼[f(x¯)]f(x)f(x)qdmd1.\displaystyle\operatorname{\mathbb{E}}[f(\bar{x})]-f(x^{*})\geq\frac{f(x^{*})}{q}\frac{d}{m-d-1}\,. (16)
  2. (ii)

    for general (i.e., possibly biased) estimators

    𝔼[f(x¯)]f(x)f(x)qdm.\displaystyle\operatorname{\mathbb{E}}[f(\bar{x})]-f(x^{*})\geq\frac{f(x^{*})}{q}\frac{d}{m}\,. (17)

We note that the expected error of the averaged estimator given in Theorem II.2 matches exactly the lower bound in Theorem II.11 for unbiased estimators. For general estimators, we observe that the lower bound matches the upper bound for large mm, e.g., when md1=O(m)m-d-1=O(m).

II-D Distributed Iterative Hessian Sketch

In this section, we consider an iterative algorithm for solving the unregularized least squares problem in (3), where λ1=0\lambda_{1}=0 with higher accuracy. Note that, Newton’s method terminates in one step when applied to this problem since the Hessian is ATAA^{T}A and

xt+1=xtμ(ATA)1AT(Axtb),\displaystyle x_{t+1}=x_{t}-\mu(A^{T}A)^{-1}A^{T}(Ax_{t}-b)\,, (18)

reduces to directly solving the normal equations (ATA)x=ATb(A^{T}A)x=A^{T}b. However, the computational cost of this direct solution is often prohibitive for large scale problems. To remedy this, the method of iterative Hessian sketch introduced in [4] employs a randomly sketched Hessian ATStTStTAA^{T}S_{t}^{T}S_{t}^{T}A as follows

xt+1=xtμ(ATStTStA)1AT(Axtb),\displaystyle x_{t+1}=x_{t}-\mu(A^{T}S_{t}^{T}S_{t}A)^{-1}A^{T}(Ax_{t}-b),

where StS_{t} corresponds to the sketching matrix at iteration tt. Sketching reduces the row dimension of the data from nn to mm and hence computing an approximate Hessian ATStTStAA^{T}S_{t}^{T}S_{t}A is computationally cheaper than the exact Hessian ATAA^{T}A. Moreover, for regularized problems one can choose mm to be smaller than dd as we investigate in Section II-F.

In a distributed computing setting, one can obtain more accurate update directions by averaging multiple trials, where each worker node computes an independent estimate of the update direction. These approximate update directions can be averaged at the master node and the following update takes place

xt+1=xtμ1qk=1q(ATSt,kTSt,kA)1AT(Axtb).\displaystyle x_{t+1}=x_{t}-\mu\frac{1}{q}\sum_{k=1}^{q}(A^{T}S_{t,k}^{T}S_{t,k}A)^{-1}A^{T}(Ax_{t}-b). (19)

Here St,kS_{t,k} is the sketching matrix for the kk’th node at iteration tt. The details of the distributed IHS algorithm are given in Algorithm 2. We note that although the update equation involves a matrix inverse term, in practice, this can be replaced with an approximate linear system solver. In particular, it might be computationally more efficient for worker nodes to compute their approximate update directions using indirect methods such as conjugate gradient.

We note that in Algorithm 2, worker nodes communicate their approximate update directions and not the approximate Hessian matrix, which reduces the communication complexity from O(d2)O(d^{2}) to O(d)O(d) for each worker per iteration.

  Input: Number of iterations TT, step size μ\mu.
  for t=1t=1 to TT do
     for worker k=1,,qk=1,\dots,q in parallel do
        Obtain the sketched data St,kAS_{t,k}A.
        Compute gradient gt=AT(Axtb)g_{t}=A^{T}(Ax_{t}-b).
        Solve Δ^t,k=argminΔ12St,kAΔ22+gtTΔ\hat{\Delta}_{t,k}=\arg\min_{\Delta}\frac{1}{2}\|S_{t,k}A\Delta\|_{2}^{2}+g_{t}^{T}\Delta and send to master node.
     end for
     Master node: Update xt+1=xt+μ1qk=1qΔ^t,kx_{t+1}=x_{t}+\mu\frac{1}{q}\sum_{k=1}^{q}\hat{\Delta}_{t,k} and send xt+1x_{t+1} to worker nodes.
  end for
  return  xTx_{T}
Algorithm 2 Distributed Iterative Hessian Sketch

We establish the convergence rate for Gaussian sketch in Theorem II.13, which provides an exact characterization of the expected error. First, we give the definition of the error in Definition II.12.

Definition II.12

To quantify the approximation quality of the iterate xtdx_{t}\in\mathbb{R}^{d} with respect to the optimal solution xdx^{*}\in\mathbb{R}^{d}, we define the error as etAA(xtx)e^{A}_{t}\coloneqq A(x_{t}-x^{*}).

To state the result, we first introduce the following moments of the inverse Wishart distribution (see Appendix).

θ1\displaystyle\theta_{1} mmd1,\displaystyle\coloneqq\frac{m}{m-d-1},
θ2\displaystyle\theta_{2} m2(m1)(md)(md1)(md3).\displaystyle\coloneqq\frac{m^{2}(m-1)}{(m-d)(m-d-1)(m-d-3)}. (20)
Theorem II.13 (Expected error decay for Gaussian sketch)

In Algorithm 2, let us set μ=1/θ1\mu=1/\theta_{1} and assume St,kS_{t,k}’s are i.i.d. Gaussian sketches, then the expected squared norm of the error etAe^{A}_{t} evolves according to the following relation:

𝔼[et+1A22]=1q(θ2θ121)etA22.\displaystyle\operatorname{\mathbb{E}}[\|e^{A}_{t+1}\|_{2}^{2}]=\frac{1}{q}\left(\frac{\theta_{2}}{\theta_{1}^{2}}-1\right)\|e^{A}_{t}\|_{2}^{2}.

Next, Corollary II.14 builds on Theorem II.13 to characterize the number of iterations for Algorithm 2 to achieve an error of ϵ\epsilon. The number of iterations required for error ϵ\epsilon scales with log(1/ϵ)/log(q)\log(1/\epsilon)/\log(q).

Corollary II.14

Let St,km×nS_{t,k}\in\mathbb{R}^{m\times n} (t=1,,Tt=1,\dots,T, k=1,,qk=1,\dots,q) be Gaussian sketching matrices. Then, Algorithm 2 outputs xTx_{T} that is ϵ\epsilon-accurate with respect to the initial error in expectation, that is, 𝔼[eTA22Ax22=ϵ\frac{\operatorname{\mathbb{E}}[\|e^{A}_{T}\|_{2}^{2}}{\|Ax^{*}\|_{2}^{2}}=\epsilon where TT is given by

T\displaystyle T =log(1/ϵ)log(q)log(θ2θ121),\displaystyle=\frac{\log(1/\epsilon)}{\log(q)-\log\left(\frac{\theta_{2}}{\theta_{1}^{2}}-1\right)},

where the overall required communication is TqdTqd real numbers, and the computational complexity per worker is

O(Tmnd+Tmd2+Td3).\displaystyle O(Tmnd+Tmd^{2}+Td^{3}).
Remark II.15

Provided that mm is at least 2d2d, the term log(θ2θ121)\log\left(\frac{\theta_{2}}{\theta_{1}^{2}}-1\right) is negative. Hence, TT is upper-bounded by log(1/ϵ)log(q)\frac{\log(1/\epsilon)}{\log(q)}.

II-E Distributed Sketching for Least-Norm Problems

In this section, we consider the underdetermined case where n<dn<d and applying the sketching matrix from the right, i.e., on the features. We will refer to this method as right sketch as well. Let us define the minimum norm solution

x=\displaystyle x^{*}= argminxx22s.t.Ax=b.\displaystyle\arg\min_{x}\|x\|_{2}^{2}\quad\mbox{s.t.}~{}Ax=b. (21)

The above problem has a closed-form solution given by x=AT(AAT)1bx^{*}=A^{T}(AA^{T})^{-1}b when the matrix AA is full row rank. We will assume that the full row rank condition holds in the sequel. Let us denote the optimal value of the minimum norm objective as f(x)=x22=bT(AAT)1bf(x^{*})=\|x^{*}\|_{2}^{2}=b^{T}(AA^{T})^{-1}b. The kk’th worker node will compute the approximate solution via x^k=SkTz^k\hat{x}_{k}=S_{k}^{T}\hat{z}_{k} where

z^k=\displaystyle\hat{z}_{k}= argminzz22s.t.ASkTz=b,\displaystyle\arg\min_{z}\|z\|_{2}^{2}\quad\mbox{s.t.}~{}AS_{k}^{T}z=b, (22)

where Skm×dS_{k}\in\mathbb{R}^{m\times d} and zmz\in\mathbb{R}^{m}. The averaged solution is then computed at the master node as x¯=1qk=1qx^k\bar{x}=\frac{1}{q}\sum_{k=1}^{q}\hat{x}_{k}. We will assume that the sketch matrices SkS_{k} are i.i.d. Gaussian in deriving the error expressions. Lemma II.16 establishes the approximation error for a single right sketch estimator.

Lemma II.16

For the Gaussian sketch with sketch size m>n+1m>n+1, the estimator x^\hat{x} satisfies

𝔼[x^x22]=dnmn1f(x).\displaystyle\operatorname{\mathbb{E}}[\|\hat{x}-x^{*}\|_{2}^{2}]=\frac{d-n}{m-n-1}f(x^{*}).
Proof:

Conditioned on ASTAS^{T}, we have

x^𝒩(x,PNull(A)AST(ASTSAT)1b22).\displaystyle\hat{x}\sim\mathcal{N}\Big{(}x^{*},P_{\mbox{Null}(A)}\|AS^{T}(AS^{T}SA^{T})^{-1}b\|_{2}^{2}\Big{)}\,.

Noting that 𝔼[(ASTSAT)1]=AATmmn1\operatorname{\mathbb{E}}[(AS^{T}SA^{T})^{-1}]=AA^{T}\frac{m}{m-n-1}, taking the expectation and noting that tr(PNull(A))=dn\operatorname{tr}(P_{\mbox{Null}(A)})=d-n, we obtain

𝔼[x^x22]\displaystyle\operatorname{\mathbb{E}}[\|\hat{x}-x^{*}\|_{2}^{2}] =dnmn1bT(AAT)1b=dnmn1f(x).\displaystyle=\frac{d-n}{m-n-1}b^{T}(AA^{T})^{-1}b=\frac{d-n}{m-n-1}f(x^{*}). (23)

An exact formula for averaging multiple outputs in right sketch that parallels Theorem II.2 can be obtained in a similar fashion:

𝔼[x¯x22]\displaystyle\operatorname{\mathbb{E}}[\|\bar{x}-x^{*}\|_{2}^{2}] =1q𝔼[x^kx22]=1qdnmn1f(x).\displaystyle=\frac{1}{q}\operatorname{\mathbb{E}}[\|\hat{x}_{k}-x^{*}\|_{2}^{2}]=\frac{1}{q}\frac{d-n}{m-n-1}f(x^{*}).

Hence, for the distributed Gaussian right sketch, we establish the approximation error as stated in Theorem II.17.

Theorem II.17 (Cost approximation for least-norm problems)

Let SkS_{k}, k=1,,qk=1,\dots,q be Gaussian sketching matrices, then the error of the averaged solution x¯\bar{x} satisfies

𝔼[f(x¯)]f(x)f(x)=1qdnmn1.\displaystyle\frac{\operatorname{\mathbb{E}}[f(\bar{x})]-f(x^{*})}{f(x^{*})}=\frac{1}{q}\frac{d-n}{m-n-1}\,. (24)

Consequently, Markov’s inequality implies that f(x¯)f(x)f(x)ϵ\frac{f(\bar{x})-f(x^{*})}{f(x^{*})}\leq\epsilon holds with probability at least (11qϵdnmn1)\left(1-\frac{1}{q\epsilon}\frac{d-n}{m-n-1}\right) for any ϵ>0\epsilon>0.

II-F Bias Correction for Regularized Least Squares

We have previously studied distributed randomized regression for unregularized problems and showed that Gaussian sketch leads to unbiased estimators. In this section, we focus on the regularized case and show that using the original regularization coefficient for the sketched problems causes the estimators to be biased. In addition, we provide a bias correction procedure. More precisely, the method described in this section is based on non-iterative averaging for solving the linear least squares problem with 2\ell_{2} regularization, i.e., ridge regression.

Note that we have λ1\lambda_{1} as the regularization coefficient of the original problem and λ2\lambda_{2} for the sketched sub-problems. If λ2\lambda_{2} is chosen to be equal to λ1\lambda_{1}, then this scheme reduces to the framework given in the work of [3] and we show in Theorem II.20 that λ2=λ1\lambda_{2}=\lambda_{1} leads to a biased estimator, which does not converge to the optimal solution.

We first introduce the following results on traces involving random Gaussian matrices which are instrumental in our result.

Lemma II.18 ([29])

For a Gaussian sketching matrix SS, the following asymptotic formula holds

limn𝔼[tr((UTSTSU+λ2I)1)]=d×θ3(d/m,λ2),\displaystyle\lim_{n\rightarrow\infty}\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU+\lambda_{2}I)^{-1})]=d\times\theta_{3}(d/m,\lambda_{2}),

where θ3(d/m,λ2)\theta_{3}(d/m,\lambda_{2}) is defined as

θ3(d/m,λ2):=λ2+d/m1+(λ2+d/m1)2+4λ2d/m2λ2d/m.\displaystyle\theta_{3}(d/m,\lambda_{2}):=\frac{-\lambda_{2}+d/m-1+\sqrt{(-\lambda_{2}+d/m-1)^{2}+4\lambda_{2}d/m}}{2\lambda_{2}d/m}\,.
Lemma II.19

For a Gaussian sketching matrix SS, the following asymptotic formula holds

limn𝔼[(UTSTSU+λ2I)1]=θ3(d/m,λ2)Id,\displaystyle\lim_{n\rightarrow\infty}\operatorname{\mathbb{E}}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}]=\theta_{3}(d/m,\lambda_{2})I_{d},

where θ3(d/m,λ2)\theta_{3}(d/m,\lambda_{2}) is as defined in Lemma II.18.

We list the distributed randomized ridge regression method in Algorithm 3. This algorithm assumes that our goal is to solve a large scale regression problem for a given value of regularization coefficient λ1\lambda_{1}. Theorem II.20 states the main result of this subsection. In short, for the averaged result to converge to the optimal solution xx^{*}, the regularization coefficient needs to be modified to λ2\lambda_{2}^{*}.

Theorem II.20

Given the thin SVD decomposition A=UΣVTn×dA=U\Sigma V^{T}\in\mathbb{R}^{n\times d} and ndn\geq d, and assuming AA has full rank and has identical singular values (i.e., Σ=σId\Sigma=\sigma I_{d} for some σ>0\sigma>0), there is a value of λ2\lambda_{2} that yields a zero bias of the single sketch estimator 𝔼[A(x^kx)]\operatorname{\mathbb{E}}[A(\hat{x}_{k}-x^{*})] as nn tends to infinity if

  • (i)

    m>dm>d   or

  • (ii)

    mdm\leq d and λ1σ2(dm1)\lambda_{1}\geq\sigma^{2}\left(\frac{d}{m}-1\right)

and the value of λ2\lambda_{2} that achieves zero bias is given by

λ2=λ1dmλ11+λ1/σ2,\displaystyle\lambda_{2}^{*}=\lambda_{1}-\frac{d}{m}\frac{\lambda_{1}}{1+\lambda_{1}/\sigma^{2}}, (25)

where the matrix SkS_{k} in x^k=argminxSkAxSkb22+λ2x22\hat{x}_{k}=\arg\min_{x}\|S_{k}Ax-S_{k}b\|_{2}^{2}+\lambda_{2}\|x\|_{2}^{2} is the Gaussian sketch.

  Set σ\sigma to the mean of singular values of AA.
  Calculate λ2=λ1dmλ11+λ1/σ2\lambda_{2}^{*}=\lambda_{1}-\frac{d}{m}\frac{\lambda_{1}}{1+\lambda_{1}/\sigma^{2}}.
  for worker k=1,,qk=1,\dots,q in parallel do
     Obtain the sketched data and sketched output: SkAS_{k}A and SkbS_{k}b.
     Solve x^k=argminxSkAxSkb22+λ2x22\hat{x}_{k}=\arg\min_{x}\|S_{k}Ax-S_{k}b\|_{2}^{2}+\lambda_{2}^{*}\|x\|_{2}^{2} and send x^k\hat{x}_{k} to master node.
  end for
  Master node: return x¯=1qk=1qx^k\bar{x}=\frac{1}{q}\sum_{k=1}^{q}\hat{x}_{k}.
Algorithm 3 Distributed Randomized Ridge Regression

Figure 2 illustrates the practical implications of Theorem II.20. If λ2\lambda_{2} is chosen according to the formula in (25), then the averaged solution x¯\bar{x} gives a significantly better approximation to xx^{*} than if we had used λ2=λ1\lambda_{2}=\lambda_{1}. The data matrix AA in Figure 2(a) has identical singular values, and 2(b) shows the case where the singular values of AA are not identical. When the singular values of AA are not all equal to each other, we set σ\sigma to the mean of the singular values of AA as a heuristic, which works extremely well as shown in Figure 2(b). According to the formula in (25), the value of λ2\lambda_{2} that we need to use to achieve zero bias is found to be λ2=0.833\lambda_{2}^{*}=0.833 whereas λ1=5\lambda_{1}=5. The plot in Figure 2(b) illustrates that even if the assumption that Σ=σId\Sigma=\sigma I_{d} in Theorem II.20 is violated, the proposed bias corrected averaging method outperforms vanilla averaging in [3] where λ2=λ1\lambda_{2}=\lambda_{1}.

Refer to caption

(a) Identical singular values

Refer to caption

(b) Non-identical singular values

Figure 2: Plots of x¯x2/x2\|\bar{x}-x^{*}\|_{2}/\|x^{*}\|_{2} against the number of averaged worker outputs for an unconstrained least squares problem with regularization using Algorithm 3. The dashed blue line corresponds to the case where λ2\lambda_{2} is determined according to the formula (25), and the solid red line corresponds to the case where λ2\lambda_{2} is the same as λ1\lambda_{1}. The parameters are as follows: n=1000n=1000, d=100d=100, λ1=5\lambda_{1}=5, m=20m=20, sketch type is Gaussian. (a) All singular values of AA are 11. (b) Singular values of AA are not identical and their mean is 11.
Remark II.21 (Varying sketch sizes)

Let us now consider the scenario where we have different sketch sizes for each worker node. This situation frequently arises in heterogeneous computing environments. Specifically, let us assume that the sketch size for worker kk is mkm_{k}, k=1,,qk=1,\dots,q. It follows from Theorem II.20 that by choosing the regularization parameter for worker node kk as

λ2(k)=λ1dmkλ11+λ1/σ2,\displaystyle\lambda_{2}^{*}(k)=\lambda_{1}-\frac{d}{m_{k}}\frac{\lambda_{1}}{1+\lambda_{1}/\sigma^{2}},

it is possible to obtain unbiased estimators x^k\hat{x}_{k}, k=1,,qk=1,\dots,q and hence an unbiased averaged result x¯\bar{x}. Note that here we assume that the sketch size for each worker satisfies the condition in Theorem II.20, that is, either mk>dm_{k}>d or mkdm_{k}\leq d and λ1σ2(d/mk1)\lambda_{1}\geq\sigma^{2}(d/m_{k}-1).

III Privacy Preserving Property

We now digress from the error and convergence properties of distributed sketching methods to consider the privacy preserving properties of distributed sketching. We use the notion of differential privacy for our privacy result given in Definition III.1. Differential privacy is a worst case type of privacy notion which does not require distribution assumptions on the data. It has been the privacy framework adopted by many works in the literature.

Definition III.1 ((ε,δ)(\varepsilon,\delta)-Differential Privacy, [30], [31])

An algorithm ALG which maps (n×d)(n\times d)-matrices into some range \mathcal{R} satisfies (ε,δ)(\varepsilon,\delta)-differential privacy if for all pairs of neighboring inputs AA and AA^{\prime} (i.e. they differ only in a single row) and all subsets 𝒮\mathcal{S}\subset\mathcal{R}, it holds that P(ALG(A)𝒮)eεP(ALG(A)𝒮)+δP(\text{ALG}(A)\in\mathcal{S})\leq e^{\varepsilon}P(\text{ALG}(A^{\prime})\in\mathcal{S})+\delta.

Theorem III.2 characterizes the required conditions and the sketch size to guarantee (ε,δ)(\varepsilon,\delta)-differential privacy for a given dataset. The proof of Theorem III.2 is based on the (ε,δ)(\varepsilon,\delta)-differential privacy result for i.i.d. Gaussian random projections from [31].

Theorem III.2

Given the data matrix An×dA\in\mathbb{R}^{n\times d} and the output vector bnb\in\mathbb{R}^{n}, let AcA_{c} denote the concatenated version of AA and bb, i.e., Ac=[A,b]n×(d+1)A_{c}=[A,b]\in\mathbb{R}^{n\times{(d+1)}}. Define β:=ln(4/δ)\beta:=\ln(4/\delta). Define σ0:=σmin(Ac)/n\sigma_{0}:=\sigma_{\min}(A_{c})/\sqrt{n}, and B0:=maxi,j|Ac,ij|B_{0}:=\max_{i,j}|A_{c,ij}|. Suppose that the condition

nd+1(3+2βε)B02σ02\displaystyle\frac{n}{d+1}\geq\left(3+\frac{2\beta}{\varepsilon}\right)\frac{B_{0}^{2}}{\sigma_{0}^{2}} (26)

is satisfied and the sketch size satisfies

m\displaystyle m =O(βn2(d+1)2ε2(ε+β)2)for privacy w.r.t. one worker,\displaystyle=O\left(\beta\frac{n^{2}}{(d+1)^{2}}\frac{\varepsilon^{2}}{(\varepsilon+\beta)^{2}}\right)\,\text{for privacy w.r.t. one worker,} (27)
m\displaystyle m =O(βqn2(d+1)2ε2(ε+β)2)for privacy w.r.t. q workers,\displaystyle=O\left(\frac{\beta}{q}\frac{n^{2}}{(d+1)^{2}}\frac{\varepsilon^{2}}{(\varepsilon+\beta)^{2}}\right)\,\text{for privacy w.r.t. $q$ workers,} (28)

and publishing the sketched matrices SkAcm×(d+1)S_{k}A_{c}\in\mathbb{R}^{m\times{(d+1)}}, k=1,,qk=1,\dots,q, provided that d+1<md+1<m, is (ε,δ)(\varepsilon,\delta)-differentially private, where Skm×nS_{k}\in\mathbb{R}^{m\times n} is the Gaussian sketch, and ε>0\varepsilon>0, β>1+ln(4)\beta>1+\ln(4).

Remark III.3

For fixed values of β\beta, σmin\sigma_{\min}, B0B_{0}, nn and dd, the approximation error is on the order of O(1ε2)O\left(\frac{1}{\varepsilon^{2}}\right) for (ε,δ)(\varepsilon,\delta)-differential privacy with respect to all qq workers.

The work [32] considers convex optimization under privacy constraints and shows that ε\varepsilon-differential privacy (i.e. equivalent to Definition III.1 with δ=0\delta=0), the approximation error of their distributed iterative algorithm is on the order of O(1ε2)O(\frac{1}{\varepsilon^{2}}), which is on the same order as Algorithm 1. The two algorithms however have different dependencies on parameters, which are hidden in the OO-notation. We note that the approximation error of the algorithm in [32] depends on parameters that Algorithm 1 does not have such as the initial step size, the step size decay rate, and noise decay rate. The reason for this is that the algorithm in [32] is a synchronous iterative algorithm designed to solve a more general class of optimization problems. Algorithm 1, on the other hand, is designed to solve linear regression problems and offers a significant advantage due to its single-round communication requirement.

Remark III.4

A similar privacy statement holds for the right sketch method discussed in Section II-E. In right sketch, we only sketch the data matrix AA and not the output vector bb. For publishing ASkTAS_{k}^{T} to be (ε,δ)(\varepsilon,\delta)-differentially private, Theorem III.2 still holds with the modification that we replace AcA_{c} with ATA^{T}.

IV Bounds for Other Sketching Matrices

In this section, we consider randomized Hadamard sketch, uniform sampling, and leverage score sampling for solving the unregularized linear least squares problem minxf(x)=Axb22\min_{x}f(x)=\|Ax-b\|_{2}^{2}. For each of these sketching methods, we present upper bounds on the bias of the corresponding single sketch estimator. Then, we combine these in Theorem IV.6, which provides high probability upper bounds for the error of the averaged estimator.

Lemma IV.1 expresses the expected objective value difference in terms of the bias and variance of the single sketch estimator.

Lemma IV.1

For any i.i.d. sketching matrices SkS_{k}, k=1,,qk=1,\dots,q, the expected objective value difference for the averaged estimator x¯\bar{x} can be decomposed as

𝔼[f(x¯)]f(x)=1q𝔼[Ax^Ax22]+q1q𝔼[Ax^]Ax22,\displaystyle\operatorname{\mathbb{E}}[f(\bar{x})]-f(x^{*})=\frac{1}{q}\operatorname{\mathbb{E}}\left[\|A\hat{x}-Ax^{*}\|_{2}^{2}\right]+\frac{q-1}{q}\|\operatorname{\mathbb{E}}[A\hat{x}]-Ax^{*}\|_{2}^{2}, (29)

where x^\hat{x} is the single sketch estimator.

Lemma IV.2 establishes an upper bound on the norm of the bias for any i.i.d. sketching matrix. The results in the remainder of this section will build on this lemma.

Lemma IV.2

Let the SVD of AA be denoted as A=UΣVTA=U\Sigma V^{T}. Let z:=UTSTSbz:=U^{T}S^{T}Sb^{\perp} and Q:=(UTSTSU)1Q:=(U^{T}S^{T}SU)^{-1} where b:=bAxb^{\perp}:=b-Ax^{*}. Define the event EE as (1ϵ)IdQ(1+ϵ)Id(1-\epsilon)I_{d}\preceq Q\preceq(1+\epsilon)I_{d}. Then, for any sketch matrix SS with 𝔼[STS]=In\operatorname{\mathbb{E}}[S^{T}S]=I_{n}, the bias of the single sketch estimator, conditioned on EE, is upper bounded as

𝔼[Ax^E]Ax24ϵ𝔼[z22|E].\displaystyle\|\operatorname{\mathbb{E}}[A\hat{x}|E]-Ax^{*}\|_{2}\leq\sqrt{4\epsilon\operatorname{\mathbb{E}}[\|z\|_{2}^{2}|E]}\,. (30)

The event EE is a high probability event and it is equivalent to the subspace embedding property (e.g. [3]). We will analyze the unconditioned error later in Theorem IV.6.

We note that Lemma IV.1 and IV.2 apply to all of the sketching matrices considered in this work. We now give specific bounds for the bias of the single sketch estimator separately for each of randomized Hadamard sketch, uniform sampling, and leverage score sampling.

Randomized Hadamard sketch: This method has the advantage of low computational complexity due to the Fast Hadamard Transform, which can be computed in O(nlogn)O(n\log n), whereas applying the Gaussian sketch takes O(mn)O(mn) time for length nn vectors. Lemma IV.3 states the upper bound for the bias for randomized Hadamard sketch.

Lemma IV.3

For randomized Hadamard sketch, the bias is upper bounded as

𝔼[Ax^E]Ax24ϵdmf(x).\displaystyle\|\operatorname{\mathbb{E}}[A\hat{x}|E]-Ax^{*}\|_{2}\leq\sqrt{4\epsilon\frac{d}{m}f(x^{*})}. (31)

Uniform sampling: We note that the bias of the uniform sampling estimator is different when it is computed with or without replacement. The reason for this is that the rows of the sketching matrix SS for uniform sampling are independent in the case of sampling with replacement, which breaks down in the case of sampling without replacement. Lemma IV.4 provides bounds for both cases.

Lemma IV.4

For uniform sampling, the bias can be upper bounded as

𝔼[Ax^E]Ax24ϵμdmf(x)\displaystyle\|\operatorname{\mathbb{E}}[A\hat{x}|E]-Ax^{*}\|_{2}\leq\sqrt{4\epsilon\frac{\mu d}{m}\,f(x^{*})} (32)
𝔼[Ax^E]Ax24ϵμdmnmn1f(x),\displaystyle\|\operatorname{\mathbb{E}}[A\hat{x}|E]-Ax^{*}\|_{2}\leq\sqrt{4\epsilon\frac{\mu d}{m}\frac{n-m}{n-1}\,f(x^{*})}, (33)

for sampling with and without replacement, respectively, where u~id\tilde{u}_{i}\in\mathbb{R}^{d} denotes the ii’th row of UU.

Leverage score sampling: Recall that the row leverage scores of a matrix A=UΣVTA=U\Sigma V^{T} are computed via i=u~i22\ell_{i}=\|\tilde{u}_{i}\|_{2}^{2} for i=1,,ni=1,\dots,n where u~id\tilde{u}_{i}\in\mathbb{R}^{d} denotes the ii’th row of UU. Since leverage score sampling looks at the data to help guide its sampling strategy, it achieves a lower error compared to uniform sampling which treats all the samples equally. Lemma IV.5 gives the upper bound for the bias of the leverage score sampling estimator.

Lemma IV.5

For leverage score sampling, the bias can be upper bounded as

𝔼[Ax^E]Ax24ϵdmf(x).\displaystyle\|\operatorname{\mathbb{E}}[A\hat{x}|E]-Ax^{*}\|_{2}\leq\sqrt{4\epsilon\frac{d}{m}f(x^{*})}. (34)

The results given in Lemma IV.3, IV.4, and IV.5 can be combined with Lemma IV.1. In particular, the error of the averaged estimator contains the squared bias term which is scaled with (q1)/q(q-1)/q. For a distributed computing system with a large number of worker nodes qq, the contribution of the bias term is very close to 11 while the variance term will vanish as it is scaled with 1/q1/q.

We now leverage our analysis of the bias upper bounds to establish upper bounds on the error of the averaged estimator. Theorem IV.6 gives high probability error bounds for different sketch types when the sketch size is set accordingly.

Theorem IV.6

Suppose that the sketch size is selected as

m\displaystyle m d+log(n)ϵ2log(d/δ) for randomized Hadamard sketch,\displaystyle\gtrsim\frac{d+\log(n)}{\epsilon^{2}}\log(d/\delta)\mbox{ for randomized Hadamard sketch,}
m\displaystyle m μdϵ2log(d/δ) for uniform sampling,\displaystyle\gtrsim\frac{\mu d}{\epsilon^{2}}\log(d/\delta)\mbox{ for uniform sampling,}
m\displaystyle m dϵ2log(d/δ) for leverage score sampling,\displaystyle\gtrsim\frac{d}{\epsilon^{2}}\log(d/\delta)\mbox{ for leverage score sampling}, (35)

for some ϵ(0,14]\epsilon\in(0,\frac{1}{4}] and δ>0\delta>0. Here μ\mu is the row coherence defined in Section I-C, that is, μ=n/dmaxiu~i22\mu=n/d\max_{i}\|\tilde{u}_{i}\|_{2}^{2}. Then, the relative optimality gap of the averaged estimator obeys the following upper bounds:

P(f(x¯)f(x)1+γ)\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}\leq 1+\gamma\Big{)}\geq
1qδdqγm((1+ϵ)2+4ϵ(q1)) for randomized Hadamard sketch,\displaystyle 1-q\delta-\frac{d}{q\gamma m}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\mbox{ for randomized Hadamard sketch,} (36)
1qδμdqγm((1+ϵ)2+4ϵ(q1)) for uniform sampling with replacement,\displaystyle 1-q\delta-\frac{\mu d}{q\gamma m}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\mbox{ for uniform sampling with replacement,} (37)
1qδμdqγmnmn1((1+ϵ)2+4ϵ(q1)) for uniform sampling without replacement,\displaystyle 1-q\delta-\frac{\mu d}{q\gamma m}\frac{n-m}{n-1}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\mbox{ for uniform sampling without replacement,} (38)
1qδdqγm((1+ϵ)2+4ϵ(q1)) for leverage score sampling,\displaystyle 1-q\delta-\frac{d}{q\gamma m}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\mbox{ for leverage score sampling,} (39)

for any γ>0\gamma>0.

Remarkably, the optimality ratio of the distributed sketching estimator converges to 11 as qq or mm gets large. This is different from the results of [3], which do not imply convergence to 11 as qq\rightarrow\infty due to an additional bias term. We note that the relative error of the uniform sampling sketch has a dependence on the row coherence unlike the randomized Hadamard and leverage score sampling sketch.

The main idea for the proof of Theorem IV.6 involves using the decomposition given in Lemma IV.1 along with the upper bounds for 𝔼[UTSTSb22|E]\operatorname{\mathbb{E}}[\|U^{T}S^{T}Sb^{\perp}\|_{2}^{2}|E]. Then, we use Markov’s inequality to find a high probability bound for the approximation quality of the averaged estimator, conditioned on the events EkE_{k}, k=1,,qk=1,\dots,q defined as (1ϵ)Id(UTSkTSkU)1(1+ϵ)Id(1-\epsilon)I_{d}\preceq(U^{T}S_{k}^{T}S_{k}U)^{-1}\preceq(1+\epsilon)I_{d}. This inequality is a direct result of the subspace embedding property and holds with high probability. The last component of the proof deals with removing the conditioning on the events EkE_{k} to find the upper bound for the relative error f(x¯)/f(x)f(\bar{x})/f(x^{*}).

V Distributed Sketching for Nonlinear Convex Optimization Problems

In this section, we consider randomized second order methods for solving a broader range of problems by leveraging a distributed version of the Newton Sketch algorithm described in [21].

V-A Distributed Newton Sketch

The update equation for Newton’s method is of the form xt+1=xtα1Ht1gtx_{t+1}=x_{t}-\alpha_{1}H_{t}^{-1}g_{t}, where Htd×dH_{t}\in\mathbb{R}^{d\times d} and gtdg_{t}\in\mathbb{R}^{d} denote the Hessian matrix and the gradient vector at iteration tt respectively, and α1\alpha_{1} is the step size. In contrast, Newton Sketch performs the approximate updates

xt+1=xt+α1argminΔ(12StHt1/2Δ22+gtTΔ),\displaystyle x_{t+1}=x_{t}+\alpha_{1}\arg\min_{\Delta}(\frac{1}{2}\|S_{t}H_{t}^{1/2}\Delta\|_{2}^{2}+g_{t}^{T}\Delta), (40)

where the sketching matrices Stm×nS_{t}\in\mathbb{R}^{m\times n} are refreshed every iteration. There can be a multitude of options for devising a distributed Newton’s method or a distributed Newton sketch algorithm. Here we consider a scheme that is similar in spirit to the GIANT algorithm [22] where worker nodes communicate length-dd approximate update directions to be averaged at the master node. Another alternative scheme would be to communicate the approximate Hessian matrices, which would require an increased communication load of d2d^{2} numbers.

We consider Hessian matrices of the form Ht=(Ht1/2)THt1/2H_{t}=(H_{t}^{1/2})^{T}H_{t}^{1/2}, where we assume that Ht1/2n×dH_{t}^{1/2}\in\mathbb{R}^{n\times d} is a full rank matrix and ndn\geq d. Note that this factorization is already available in terms of scaled data matrices in many problems as we illustrate in the sequel. This enables the fast construction of an approximation of HtH_{t} by applying sketching StHt1/2S_{t}H_{t}^{1/2} which leads to the approximation H^t=(StHt1/2)TStHt1/2\hat{H}_{t}=(S_{t}H_{t}^{1/2})^{T}S_{t}H_{t}^{1/2}. Averaging for regularized problems with Hessian matrices of the form Ht=(Ht1/2)THt1/2+λ1IdH_{t}=(H_{t}^{1/2})^{T}H_{t}^{1/2}+\lambda_{1}I_{d} will be considered in the next subsection.

The update equation for distributed Newton sketch for a system with qq worker nodes can be written as

xt+1=xt+α21qk=1qargminΔ(12St,kHt1/2Δ22+gtTΔ).\displaystyle x_{t+1}=x_{t}+\alpha_{2}\frac{1}{q}\sum_{k=1}^{q}\arg\min_{\Delta}\,\left(\frac{1}{2}\|S_{t,k}H_{t}^{1/2}\Delta\|_{2}^{2}+g_{t}^{T}\Delta\right). (41)

Note that the above update requires access to the full gradient gtg_{t}. If worker nodes do not have access to the entire dataset, then this requires an additional communication round per iteration where worker nodes communicate their local gradients with the master node, which computes the full gradient and broadcasts to worker nodes. The details of the distributed Newton sketch method are given in Algorithm 4.

  Input: Tolerance ϵ\epsilon
  while gtT(k=1qΔ^t,k)/2ϵg_{t}^{T}\left(\sum_{k=1}^{q}\hat{\Delta}_{t,k}\right)/2\leq\epsilon do
     for worker k=1,,qk=1,\dots,q in parallel do
        Obtain St,kHt1/2S_{t,k}H_{t}^{1/2}.
        Obtain the gradient gtg_{t}.
        Compute approximate Newton direction Δ^t,k=argminΔ(12St,kHt1/2Δ22+gtTΔ)\hat{\Delta}_{t,k}=\arg\min_{\Delta}(\frac{1}{2}\|S_{t,k}H_{t}^{1/2}\Delta\|_{2}^{2}+g_{t}^{T}\Delta) and send to master node.
     end for
     Master node: Determine step size α2\alpha_{2} and update xt+1=xt+α21qk=1qΔ^t,kx_{t+1}=x_{t}+\alpha_{2}\frac{1}{q}\sum_{k=1}^{q}\hat{\Delta}_{t,k}.
  end while
Algorithm 4 Distributed Newton Sketch

We analyze the bias and the variance of the update directions for distributed Newton sketch, and derive exact expressions for Gaussian sketching matrices. First, we establish the notation for update directions. We will let Δt\Delta_{t}^{*} denote the optimal Newton update direction at iteration tt

Δt=((Ht1/2)THt1/2)1gt\displaystyle\Delta_{t}^{*}=((H_{t}^{1/2})^{T}H_{t}^{1/2})^{-1}g_{t} (42)

and let Δ^t,k\hat{\Delta}_{t,k} denote the approximate update direction returned by worker node kk at iteration tt, which has the closed form expression

Δ^t,k=αs((Ht1/2)TSt,kTSt,kHt1/2)1gt.\displaystyle\hat{\Delta}_{t,k}=\alpha_{s}\left((H_{t}^{1/2})^{T}S_{t,k}^{T}S_{t,k}H_{t}^{1/2}\right)^{-1}g_{t}. (43)

Note that the step size for the averaged update direction will be calculated as α2=α1αs\alpha_{2}=\alpha_{1}\alpha_{s}. Theorem V.1 characterizes how the update directions need to be scaled to obtain an unbiased update direction, and also a minimum variance estimator.

Theorem V.1

For Gaussian sketches St,kS_{t,k}, assuming Ht1/2H_{t}^{1/2} is full column rank, the variance 𝔼[Ht1/2(Δ^t,kΔt)22]\operatorname{\mathbb{E}}[\|H_{t}^{1/2}(\hat{\Delta}_{t,k}-\Delta_{t}^{*})\|_{2}^{2}] is minimized when αs\alpha_{s} is chosen as αs=θ1θ2\alpha_{s}=\frac{\theta_{1}}{\theta_{2}} whereas the bias 𝔼[Ht1/2(Δ^t,kΔt)]\operatorname{\mathbb{E}}[H_{t}^{1/2}(\hat{\Delta}_{t,k}-\Delta_{t}^{*})] is zero when αs=1θ1\alpha_{s}=\frac{1}{\theta_{1}}, where θ1\theta_{1} and θ2\theta_{2} are as defined in (II-D).

Refer to caption

(a) q=10q=10 workers

Refer to caption

(b) q=2q=2 workers

Figure 3: Cost approximation (f(xt)f(x))/f(x)(f(x_{t})-f(x^{*}))/f(x^{*}) for Algorithm 4 against iteration number tt for various step sizes in solving a linear least squares problem on randomly generated data. The cyan colored dotted lines show cost approximation when we make a search for the learning rate αs\alpha_{s} between 0.050.05 and 11. The blue line with circle markers corresponds to αs=1/θ1\alpha_{s}=1/\theta_{1} that leads to the unbiased estimator and the red line with square markers corresponds to αs=θ1/θ2\alpha_{s}=\theta_{1}/\theta_{2} that gives the minimum variance. The resulting step size scaling factors αs\alpha_{s} are shown in the legends of the plots. The parameters used in these experiments are n=1000n=1000, d=200d=200, m=400m=400.

Figure 3 demonstrates that choosing α2=α1αs\alpha_{2}=\alpha_{1}\alpha_{s} when αs\alpha_{s} is calculated using the unbiased estimator formula αs=1/θ1\alpha_{s}=1/\theta_{1} leads to faster decrease of the objective value when the number of workers is high. If the number of workers is small, one should choose the step size that minimizes variance using αs=θ1/θ2\alpha_{s}=\theta_{1}/\theta_{2} instead. Furthermore, Figure 3(a) illustrates that the blue curve with square markers is in fact the best one could hope to achieve as it is very close to the best cyan dotted line.

V-B Distributed Newton Sketch for Regularized Problems

We now consider problems with squared 2\ell_{2}-norm regularization. In particular, we study problems with Hessian matrices of the form Ht=(Ht1/2)THt1/2+λ1IdH_{t}=(H_{t}^{1/2})^{T}H_{t}^{1/2}+\lambda_{1}I_{d}. Sketching can be applied to obtain approximate Hessian matrices as Ht=(StHt1/2)TStHt1/2+λ2IdH_{t}=(S_{t}H_{t}^{1/2})^{T}S_{t}H_{t}^{1/2}+\lambda_{2}I_{d}. Note that the case λ2=λ1\lambda_{2}=\lambda_{1} corresponds to the setting in the GIANT algorithm described in [22].

Theorem V.2 establishes that λ2\lambda_{2} should be chosen according to the formula (44) under the assumption that the singular values of Ht1/2H_{t}^{1/2} are identical. We later verify empirically that when the singular values are not identical, plugging the mean of the singular values into the formula still leads to improvements over the case of λ2=λ1\lambda_{2}=\lambda_{1}.

Theorem V.2

Given the thin SVD decomposition Ht1/2=UΣVTn×dH_{t}^{1/2}=U\Sigma V^{T}\in\mathbb{R}^{n\times d} and ndn\geq d where Ht1/2H_{t}^{1/2} is assumed to have full rank and satisfy Σ=σId\Sigma=\sigma I_{d}, the bias of the single sketch Newton step estimator 𝔼[Ht1/2(Δ^t,kΔt)]\operatorname{\mathbb{E}}[H_{t}^{1/2}(\hat{\Delta}_{t,k}-\Delta_{t}^{*})] is equal to zero as nn goes to infinity when λ2\lambda_{2} is chosen as

λ2=(λ1+dmσ2)(1d/m1+λ1σ2+d/m),\displaystyle\lambda_{2}^{*}=\left(\lambda_{1}+\frac{d}{m}\sigma^{2}\right)\left(1-\frac{d/m}{1+\lambda_{1}\sigma^{-2}+d/m}\right), (44)

where Δt=((Ht1/2)THt1/2+λ1Id)1gt\Delta_{t}^{*}=((H_{t}^{1/2})^{T}H_{t}^{1/2}+\lambda_{1}I_{d})^{-1}g_{t} and Δ^t,k=((St,kHt1/2)TSt,kHt1/2+λ2Id)1gt\hat{\Delta}_{t,k}=((S_{t,k}H_{t}^{1/2})^{T}S_{t,k}H_{t}^{1/2}+\lambda_{2}I_{d})^{-1}g_{t}, and St,kS_{t,k} is the Gaussian sketch.

Remark V.3

The proof of Theorem V.2 builds on the proof of Theorem II.20 (see Appendix). The main difference between these results is that the setting in Theorem V.2 is such that we assume access to the full gradient, i.e. only the Hessian matrix is sketched. In contrast, the setting in Theorem II.20 does not use the exact gradient.

VI Applications of Distributed Sketching

This section describes some example problems where our methodology can be applied. In particular, the problems in this section are convex problems that are efficiently addressed by our methods in distributed systems. We present numerical results on these problems in Section VII.

VI-A Logistic Regression

We begin by considering the well-known logistic regression model with squared 2\ell_{2}-norm penalty. The optimization problem for this model can be formulated as minimizexf(x)\textrm{minimize}_{x}f(x) where

f(x)=i=1n(yilog(pi)+(1yi)log(1pi))+λ12x22,\displaystyle f(x)=-\sum_{i=1}^{n}\left(y_{i}\log(p_{i})+(1-y_{i})\log(1-p_{i})\right)+\frac{\lambda_{1}}{2}\|x\|_{2}^{2}, (45)

and pnp\in\mathbb{R}^{n} is defined such that pi=1/(1+exp(a~iTx))p_{i}=1/(1+\exp(-\tilde{a}_{i}^{T}x)). a~id\tilde{a}_{i}\in\mathbb{R}^{d} denotes the ii’th row of the data matrix An×dA\in\mathbb{R}^{n\times d}. The output vector is denoted by yny\in\mathbb{R}^{n}.

We can use the distributed Newton sketch algorithm to solve this optimization problem. The gradient and Hessian for f(x)f(x) are as follows

g\displaystyle g =AT(py)+λ1x,\displaystyle=A^{T}(p-y)+\lambda_{1}x,
H\displaystyle H =ATDA+λ1Id.\displaystyle=A^{T}DA+\lambda_{1}I_{d}\,.

DD is a diagonal matrix with the elements of the vector p(1p)p(1-p) as its diagonal entries. The sketched Hessian matrix in this case can be formed as (SD1/2A)T(SD1/2A)+λ2Id(SD^{1/2}A)^{T}(SD^{1/2}A)+\lambda_{2}^{*}I_{d} where λ2\lambda_{2}^{*} can be calculated using (44), and we can set σ\sigma to the mean of the singular values of D1/2AD^{1/2}A. Since the entries of the matrix DD are a function of the variable xx, the matrix DD gets updated every iteration. It might be computationally expensive to re-compute the mean of the singular values of D1/2AD^{1/2}A in every iteration. However, we have found through experiments that it is not required to compute the exact value of the mean of the singular values for bias reduction. For instance, setting σ\sigma to the mean of the diagonals of the matrix D1/2D^{1/2} as a heuristic works sufficiently well.

VI-B Inequality Constrained Optimization

The second application that we consider is the inequality constrained optimization problem of the form

minimizex\displaystyle\textrm{minimize}_{x}\quad xc22\displaystyle\|x-c\|_{2}^{2}
subject to Axλ\displaystyle\|Ax\|_{\infty}\leq\lambda (46)

where An×dA\in\mathbb{R}^{n\times d}, and cdc\in\mathbb{R}^{d} are the problem data, and λ\lambda\in\mathbb{R} is a positive scalar. Note that this problem is the dual of the Lasso problem given by minx\text{min}_{x} λx1+12Axc22\lambda\|x\|_{1}+\frac{1}{2}\|Ax-c\|_{2}^{2}.

The above problem can be tackled by the standard log-barrier method [33], by solving sequences of unconstrained barrier penalized problems as follows

minimizexi=1nlog(\displaystyle\textrm{minimize}_{x}\,-\sum_{i=1}^{n}\log( a~iTx+λ)i=1nlog(a~iTx+λ)+λ1x222λ1cTx+λ1c22\displaystyle-\tilde{a}_{i}^{T}x+\lambda)-\sum_{i=1}^{n}\log(\tilde{a}_{i}^{T}x+\lambda)+\lambda_{1}\|x\|_{2}^{2}-2\lambda_{1}c^{T}x+\lambda_{1}\|c\|_{2}^{2} (47)

where a~i\tilde{a}_{i} represents the ii’th row of AA. The distributed Newton sketch algorithm could be used to solve this problem. The gradient and Hessian of the objective are given by

g\displaystyle g =AcTD𝟏2n×1+2λ1x2λ1c,\displaystyle=-A_{c}^{T}D\mathbf{1}_{2n\times 1}+2\lambda_{1}x-2\lambda_{1}c,
H\displaystyle H =(DAc)T(DAc)+2λ1Id.\displaystyle=(DA_{c})^{T}(DA_{c})+2\lambda_{1}I_{d}.

Here Ac=[AT,AT]TA_{c}=[A^{T},-A^{T}]^{T} and DD is a diagonal matrix with the element-wise inverse of the vector (Acx𝟏2n×1)(A_{c}x-\mathbf{1}_{2n\times 1}) as its diagonal entries. 𝟏2n×1\mathbf{1}_{2n\times 1} is the length-2n2n vector of all 11’s. The sketched Hessian can be written in the form of (SDAc)T(SDAc)+λ2Id(SDA_{c})^{T}(SDA_{c})+\lambda_{2}I_{d}.

Remark VI.1

Since the contribution of the regularization term in the Hessian matrix is 2λ1Id2\lambda_{1}I_{d}, we need to plug 2λ12\lambda_{1} instead of λ1\lambda_{1} in the formula for computing λ2\lambda_{2}^{*}.

VI-C Fine Tuning of Pre-Trained Neural Networks

The third application is neural network fine tuning. Let fNN():drf_{NN}(\cdot):\mathbb{R}^{d}\rightarrow\mathbb{R}^{r} represent the output of the (L1)(L-1)’st layer of a pre-trained network which consists of LL layers. We are interested in re-using a pre-trained neural network for, say, a task on a different dataset by learning a different final layer. More precisely, assuming squared loss, one way that we can formulate this problem is as follows

WL=argminWL[fNN(a~1)TfNN(a~n)T]fNN(A)WLbF2+λ1WLF2,\displaystyle W_{L}^{*}=\arg\min_{W_{L}}\Bigg{\|}\underbrace{\begin{bmatrix}f_{NN}(\tilde{a}_{1})^{T}\\ \vdots\\ f_{NN}(\tilde{a}_{n})^{T}\end{bmatrix}}_{f_{NN}(A)}W_{L}-b\Bigg{\|}_{F}^{2}+\lambda_{1}\|W_{L}\|_{F}^{2}\,, (48)

where WLr×CW_{L}\in\mathbb{R}^{r\times C} is the weight matrix of the final layer. The vectors a~id\tilde{a}_{i}\in\mathbb{R}^{d} i=1,,ni=1,\dots,n denote the rows of the data matrix An×dA\in\mathbb{R}^{n\times d} and bn×Cb\in\mathbb{R}^{n\times C} is the output matrix. In a classification problem, CC would correspond to the number of classes.

For large scale datasets, it is important to develop efficient methods for solving the problem in (48). Distributed randomized ridge regression algorithm (listed in Algorithm 3) could be applied to this problem where the sketched sub-problems are of the form

W^L,k=argminWLSkfNN(A)WLSkbF2+λ2WLF2\displaystyle\hat{W}_{L,k}=\arg\min_{W_{L}}\|S_{k}f_{NN}(A)W_{L}-S_{k}b\|_{F}^{2}+\lambda_{2}^{*}\|W_{L}\|_{F}^{2} (49)

and the averaged solution can be computed at the master node as W¯L=1qk=1qW^L,k\bar{W}_{L}=\frac{1}{q}\sum_{k=1}^{q}\hat{W}_{L,k}. This leads to an efficient non-iterative asynchronous method for fine-tuning of a neural network.

VII Numerical Results

We have implemented the distributed sketching methods for AWS Lambda in Python using the Pywren package [34]. The setting for the experiments is a centralized computing model where a single master node collects and averages the outputs of the qq worker nodes. The majority of the figures in this section plots the approximation error which we define as (f(x¯)f(x))/f(x)(f(\bar{x})-f(x^{*}))/f(x^{*}).

VII-A Hybrid Sketch

In a distributed computing setting, the amount of data that can be fit into the memory of nodes and the size of the largest problem that can be solved by the nodes often do not match. The hybrid sketch idea is motivated by this mismatch and it is basically a sequentially concatenated sketching scheme where we first perform uniform sampling with dimension mm^{\prime} and then sketch the sampled data using another sketching method, preferably with better convergence properties (say, Gaussian) with dimension mm. Worker nodes load mm^{\prime} rows of the data matrix AA into their memory and then perform sketching, reducing the number of rows from mm^{\prime} to mm. In addition, we note that if m=mm^{\prime}=m, hybrid sketch reduces to sampling and if m=nm^{\prime}=n, then it reduces to Gaussian sketch. The hybrid sketching scheme does not take privacy into account as worker nodes are assumed to have access to the data matrix.

For the experiments involving very large scale datasets, we have used Sparse Johnson-Lindenstrauss Transform (SJLT) [24] as the second sketching method in the hybrid sketch due to its low computational complexity.

VII-B Airline Dataset

We have conducted experiments with the publicly available Airline dataset [35]. This dataset contains information on domestic USA flights between the years 1987-2008. We are interested in predicting whether there is going to be a departure delay or not, based on information about the flights. More precisely, we are interested in predicting whether DepDelay >15>15 minutes using the attributes Month, DayofMonth, DayofWeek, CRSDepTime, CRSElapsedTime, Dest, Origin, and Distance. Most of these attributes are categorical and we have used dummy coding to convert these categorical attributes into binary representations. The size of the input matrix AA, after converting categorical features into binary representations, becomes (1.21×108)×774(1.21\times 10^{8})\times 774.

We have solved the linear least squares problem on the entire airline dataset: minimizexAxb22\text{minimize}_{x}\|Ax-b\|_{2}^{2} using qq workers on AWS Lambda. The output bb for the plots a and b in Figure 4 is a vector of binary variables indicating delay. The output bb for the plots c and d in Figure 4 is artificially generated via b=Axtruth+ϵb=Ax_{truth}+\epsilon where xtruthx_{truth} is the underlying solution and ϵ\epsilon is random Gaussian noise distributed as 𝒩(0,0.01I)\mathcal{N}(0,0.01I). Figure 4 shows that sampling followed by SJLT leads to a lower error.

We note that it increases the convergence rate to choose mm and mm^{\prime} as large as the available resources allow. Based on the run times given in the caption of Figure 4, we see that the run times are slightly worse if SJLT is involved. Decreasing mm^{\prime} will help reduce this processing time at the expense of error performance.

Refer to caption

a) m=5m=5×105m^{\prime}=5m=5\times 10^{5}

Refer to caption

b) m=2m=2×106m^{\prime}=2m=2\times 10^{6}

Refer to caption

c) m=5m=5×105m^{\prime}=5m=5\times 10^{5}

Refer to caption

d) m=2m=2×106m^{\prime}=2m=2\times 10^{6}

Figure 4: AWS Lambda experiments on the entire airline dataset (n=1.21×108n=1.21\times 10^{8}) with q=100q=100 workers. The run times for each plot are as follows (given in this order: sampling, sampling followed by SJLT): a: 37.5, 43.9 seconds, b: 48.3, 60.1 seconds, c: 39.8, 52.9 seconds, d: 78.8, 107.6 seconds.

VII-C Image Dataset: Extended MNIST

The experiments of this subsection are performed on the image dataset EMNIST (extended MNIST) [36]. We have used the ”bymerge” split of EMNIST, which has 700K training and 115K test images. The dimensions of the images are 28×2828\times 28 and there are 47 classes in total (letters and digits). Some capital letter classes have been merged with small letter classes (like C and c), and thus there are 47 classes and not 62.

Refer to caption

a) Approximation error

Refer to caption

b) Test accuracy

Figure 5: Approximation error and test set classification accuracy plots for the EMNIST-bymerge dataset where q=100q=100, m=2000m=2000, s=20s=20. The run times on AWS Lambda for uniform sampling and SJLT are 41.541.5 and 66.966.9 seconds, respectively.

Figure 5 shows the approximation error and test accuracy plots when we fit a least squares model on the EMNIST-bymerge dataset using the distributed randomized regression algorithm. Because this is a multi-class classification problem, we have one-hot encoded the labels. Figure 5 demonstrates that SJLT is able to drive the cost down more and leads to a better classification accuracy than uniform sampling.

VII-D Performance on Large Scale Synthetic Datasets

We now present the results of the experiments carried out on randomly generated large scale data to illustrate scalability of the methods. Plots in Figure 6 show the approximation error as a function of time, where the problem dimensions are as follows: A107×103A\in\mathbb{R}^{10^{7}\times 10^{3}} for plot a and A(2×107)×(2×103)A\in\mathbb{R}^{(2\times 10^{7})\times(2\times 10^{3})} for plot b. These data matrices take up 7575 GB and 298298 GB, respectively. The data used in these experiments were randomly generated from the student’s t-distribution with degrees of freedom of 1.51.5 for plot a and 1.71.7 for plot b. The output vector bb was computed according to b=Axtruth+ϵb=Ax_{truth}+\epsilon where ϵn\epsilon\in\mathbb{R}^{n} is i.i.d. noise distributed as 𝒩(0,0.1In)\mathcal{N}(0,0.1I_{n}). Other parameters used in the experiments are m=104,m=105m=10^{4},m^{\prime}=10^{5} for plot a, and m=8×103,m=8×104m=8\times 10^{3},m^{\prime}=8\times 10^{4} for plot b. We observe that both plots in Figure 6 reveal similar trends where the hybrid approach leads to a lower approximation error but takes longer due to the additional processing required for SJLT.

Refer to caption

a) A107×103A\in\mathbb{R}^{10^{7}\times 10^{3}}

Refer to caption

b) A(2×107)×(2×103)A\in\mathbb{R}^{(2\times 10^{7})\times(2\times 10^{3})}

Figure 6: Approximation error vs time for AWS Lambda experiments on randomly generated large scale datasets (q=200q=200 AWS Lambda functions have been used).

VII-E Numerical Results for the Right Sketch Method

Figure 7 shows the approximation error as a function of the number of averaged outputs in solving the least norm problem for two different datasets. The dataset for Figure 7(a) is randomly generated with dimensions A50×1000A\in\mathbb{R}^{50\times 1000}. We observe that Gaussian sketch outperforms uniform sampling in terms of the approximation error. Furthermore, Figure 7(a) verifies that if we apply the hybrid approach of sampling first and then applying Gaussian sketch, then its performance falls between the extreme ends of only sampling and only using Gaussian sketch. Moreover, Figure 7(b) shows the results for the same experiment on a subset of the airline dataset where we have included the pairwise interactions as features which makes this an underdetermined linear system. Originally, we had 774774 features for this dataset and if we include all xixjx_{i}x_{j} terms as features, we would have a total of 299925299925 features, most of which are zero for all samples. We have excluded the all-zero columns from this extended matrix to obtain the final dimensions 2000×115882000\times 11588.

Refer to caption

a) Random data

Refer to caption

b) Airline data

Figure 7: Averaging for least norm problems. Plot (a): The parameters are n=50n=50, d=1000d=1000, m=200m=200, m=500m^{\prime}=500. Plot (b): Least norm averaging applied to a subset of the airline dataset. The parameters are n=2000n=2000, d=11588d=11588, m=4000m=4000, m=8000m^{\prime}=8000. For this plot, the features include the pairwise interactions in addition to the original features.

VII-F Distributed Iterative Hessian Sketch

We have evaluated the performance of the distributed IHS algorithm on AWS Lambda. In the implementation, each serverless function is responsible for solving one sketched problem per iteration. Worker nodes wait idly once they finish their computation for that iteration until the next iterate xt+1x_{t+1} becomes available. The master node is implemented as another AWS Lambda function and is responsible for collecting and averaging the worker outputs and broadcasting the next iterate xt+1x_{t+1}.

Figure 8 shows the scaled difference between the cost for the tt’th iterate and the optimal cost (i.e. (f(xt)f(x))/f(x)(f(x_{t})-f(x^{*}))/f(x^{*})) against wall-clock time for the distributed IHS algorithm given in Algorithm 2. Due to the relatively small size of the problem, we have each worker compute the exact gradient without requiring an additional communication round per iteration to form the full gradient. We note that, for problems where it is not feasible for workers to form the full gradient due to limited memory, one can include an additional communication round where each worker sends their local gradient to the master node, and the master node forms the full gradient and distributes it to the worker nodes.

Refer to caption

Figure 8: Cost approximation (f(xt)f(x))/f(x)(f(x_{t})-f(x^{*}))/f(x^{*}) vs time for the distributed IHS algorithm running on AWS Lambda to fit the linear least squares model on randomly generated data. Unif is short for uniform sampling and unif&sjlt is short for hybrid sketch where uniform sampling is followed by SJLT. Problem parameters are as follows: n=250000n=250000, d=500d=500, m=6000m=6000, m2=20000m_{2}=20000, and qq as specified in the legend.

VII-G Inequality Constrained Optimization

Figure 9 compares the error performance of sketches with and without bias correction for the distributed Newton sketch algorithm when it is used to solve the log-barrier penalized problem given in (47). For each sketch, we have plotted the performance for λ2=λ1\lambda_{2}=\lambda_{1} and the bias corrected version λ2=λ2\lambda_{2}=\lambda_{2}^{*} (see Theorem V.2). The bias corrected versions are shown as the dotted lines. In these experiments, we have set σ\sigma to the minimum of the singular values of DADA as we have observed that setting σ\sigma to the minimum of the singular values of DADA performed better than setting it to their mean.

Even though the bias correction formula is derived for Gaussian sketch, we observe that it improves the performance of SJLT as well. We see that Gaussian sketch and SJLT perform the best out of the 4 sketches we have experimented with. We note that computational complexity of sketching for SJLT is lower than it is for Gaussian sketch, and hence the natural choice would be to use SJLT in this case.

Refer to caption

Figure 9: Plot shows cost approximation of the iterate xtx_{t} (i.e., (f(xt)f(x))/f(x)(f(x_{t})-f(x^{*}))/f(x^{*})) against iteration number tt for various sketches in solving the log-barrier version of an inequality constrained optimization problem given in (47). Abbreviations used in the plot are as follows. Unif: Uniform sampling, gaus: Gaussian sketch, unif&sjlt: Hybrid sketch where uniform sampling is followed by SJLT. The abbreviations followed by ++’s in the legend refer to the bias corrected versions. Problem parameters are as follows: n=500n=500, d=200d=200, λ1=1000\lambda_{1}=1000, m=50m=50, m2=8m=400m_{2}=8m=400, q=10q=10, λ=0.01\lambda=0.01, s=10s=10.

VIII Discussion

In this work, we study averaging sketched solutions for linear least squares problems and averaging for randomized second order optimization algorithms. We discuss distributed sketching methods from the perspectives of convergence, bias, privacy, and distributed computing. Our results and numerical experiments suggest that distributed sketching methods offer a competitive straggler-resilient solution for solving large scale problems.

We have shown that for problems involving regularization, averaging requires a more detailed analysis compared to problems without regularization. When the regularization term is not scaled properly, the resulting estimators are biased, and averaging a large number of independent sketched solutions does not converge to the true solution. We have provided closed-form formulas for scaling the regularization coefficient to obtain unbiased estimators that guarantee convergence to the optimum.

Appendix A Proofs

This section contains the proofs for the lemmas and theorems stated in the paper.

A-A Proofs of Theorems and Lemmas in Section II

Proof:

Since the Gaussian sketch estimator is unbiased (i.e., 𝔼[x^k]=x\operatorname{\mathbb{E}}[\hat{x}_{k}]=x^{*}), Lemma IV.1 reduces to 𝔼[f(x¯)]f(x)=1q𝔼[Ax^1Ax]22.\operatorname{\mathbb{E}}[f(\bar{x})]-f(x^{*})=\frac{1}{q}\operatorname{\mathbb{E}}[\|A\hat{x}_{1}-Ax^{*}]\|_{2}^{2}. By Lemma II.1, the error of the averaged solution conditioned on the events that Ek=ATSkTSkA0E_{k}=A^{T}S_{k}^{T}S_{k}A\succ 0, i=1,,q\forall i=1,\dots,q can exactly be written as

𝔼[A(x¯x)22|E1Eq]=1qdmd1f(x).\displaystyle\operatorname{\mathbb{E}}[\|A(\bar{x}-x^{*})\|_{2}^{2}|E_{1}\cap\dots\cap E_{q}]=\frac{1}{q}\frac{d}{m-d-1}f(x^{*}).

Using Markov’s inequality, it follows that

P(A(x¯x)22a|E1Eq)1qadmd1f(x).\displaystyle P(\|A(\bar{x}-x^{*})\|_{2}^{2}\geq a|E_{1}\cap\dots\cap E_{q})\leq\frac{1}{qa}\frac{d}{m-d-1}f(x^{*}).

The LHS can be lower bounded as

P(A(x¯x)22a(k=1qEk))P(k=1qEk)\displaystyle\frac{P(\|A(\bar{x}-x^{*})\|_{2}^{2}\geq a\cap(\bigcap_{k=1}^{q}E_{k}))}{P(\bigcap_{k=1}^{q}E_{k})} P(A(x¯x)22a)+P(k=1qEk)1P(k=1qEk)\displaystyle\geq\frac{P(\|A(\bar{x}-x^{*})\|_{2}^{2}\geq a)+P(\bigcap_{k=1}^{q}E_{k})-1}{P(\bigcap_{k=1}^{q}E_{k})} (50)
=P(A(x¯x)22a)+P(E1)q1P(E1)q,\displaystyle=\frac{P(\|A(\bar{x}-x^{*})\|_{2}^{2}\geq a)+P(E_{1})^{q}-1}{P(E_{1})^{q}}, (51)

where we have used the identity P(AB)P(A)+P(B)1P(A\cap B)\geq P(A)+P(B)-1 in (50) and the independence of the events EkE_{k} in (51). It follows

P(A(x¯x)22a)P(E1)q(11qadmd1f(x)).\displaystyle P(\|A(\bar{x}-x^{*})\|_{2}^{2}\leq a)\geq P(E_{1})^{q}\left(1-\frac{1}{qa}\frac{d}{m-d-1}f(x^{*})\right).

Setting a=f(x)ϵqa=f(x^{*})\frac{\epsilon}{q} and plugging in P(E1)=1P(E_{1})=1, which holds for mdm\geq d, we obtain

P(A(x¯x)22f(x)ϵq)(1d/ϵmd1).\displaystyle P\left(\frac{\|A(\bar{x}-x^{*})\|_{2}^{2}}{f(x^{*})}\leq\frac{\epsilon}{q}\right)\geq\left(1-\frac{d/\epsilon}{m-d-1}\right).

Lemma A.1

For the Gaussian sketching matrix SS and the matrix UU whose columns are orthonormal, the smallest singular value of the product SUSU is bounded as:

P(σmin(SU)1d/mδ)exp(mδ2/2).\displaystyle P(\sigma_{\min}(SU)\leq 1-\sqrt{d/m}-\delta)\leq\exp(-m\delta^{2}/2). (52)

This result follows from the concentration of Lipshitz functions of Gaussian random variables.

Proof:

We start by invoking a result on the concentration of Stieltjes transforms, namely Lemma 6 of [37]. This implies that

P\displaystyle P (|tr(UTSTSUϵiI)1𝔼[tr(UTSTSUϵiI)1]|>t)4exp(t2ϵ2/(16m))\displaystyle\big{(}|\operatorname{tr}(U^{T}S^{T}SU-\epsilon iI)^{-1}-\operatorname{\mathbb{E}}[\operatorname{tr}(U^{T}S^{T}SU-\epsilon iI)^{-1}]|>t\big{)}\leq 4\exp\left(-t^{2}\epsilon^{2}/(16m)\right) (53)

where i=1i=\sqrt{-1}. The matrix UTSTSUU^{T}S^{T}SU can be written as a sum of rank-1 matrices as follows:

UTSTSU=i=1m(s~iTU)T(s~iTU)\displaystyle U^{T}S^{T}SU=\sum_{i=1}^{m}(\tilde{s}_{i}^{T}U)^{T}(\tilde{s}_{i}^{T}U) (54)

where s~iT1×n\tilde{s}_{i}^{T}\in\mathbb{R}^{1\times n} is the ii’th row of SS. The vectors (s~iTU)T(\tilde{s}_{i}^{T}U)^{T} are independent as required by Lemma 6 of [37]. Next, we will bound the difference between the trace terms tr((UTSTSU)1)\operatorname{tr}((U^{T}S^{T}SU)^{-1}) and tr((UTSTSUϵiI)1)\operatorname{tr}((U^{T}S^{T}SU-\epsilon iI)^{-1}). First, we let the SVD decomposition of SUSU be U~Σ~V~T\tilde{U}\tilde{\Sigma}\tilde{V}^{T}. Then, we have the following relations:

SU\displaystyle SU =U~Σ~V~T\displaystyle=\tilde{U}\tilde{\Sigma}\tilde{V}^{T}
UTSTSU\displaystyle U^{T}S^{T}SU =V~Σ~2V~T\displaystyle=\tilde{V}\tilde{\Sigma}^{2}{\tilde{V}}^{T}
(UTSTSU)1\displaystyle(U^{T}S^{T}SU)^{-1} =V~Σ~2V~T\displaystyle=\tilde{V}\tilde{\Sigma}^{-2}{\tilde{V}}^{T}
UTSTSUϵiI\displaystyle U^{T}S^{T}SU-\epsilon iI =V~diag(σ~j2ϵi)V~T\displaystyle=\tilde{V}\operatorname{diag}(\tilde{\sigma}_{j}^{2}-\epsilon i){\tilde{V}}^{T}
(UTSTSUϵiI)1\displaystyle(U^{T}S^{T}SU-\epsilon iI)^{-1} =V~diag(1σ~j2ϵi)V~T\displaystyle=\tilde{V}\operatorname{diag}\left(\frac{1}{\tilde{\sigma}_{j}^{2}-\epsilon i}\right){\tilde{V}}^{T} (55)

where the notation diag(σ~j2)\operatorname{diag}(\tilde{\sigma}_{j}^{2}) refers to a diagonal matrix with diagonal entries equal to σ~12,σ~22,\tilde{\sigma}_{1}^{2},\tilde{\sigma}_{2}^{2},\dots. The difference between the trace terms can be bounded as follows:

|tr((UTSTSU)1)tr((UTSTSUϵiI)1)|=|tr(V~diag(1σ~j21σ~j2ϵi)V~T)|\displaystyle\left|\operatorname{tr}((U^{T}S^{T}SU)^{-1})-\operatorname{tr}((U^{T}S^{T}SU-\epsilon iI)^{-1})\right|=\left|\operatorname{tr}\left(\tilde{V}\operatorname{diag}\left(\frac{1}{\tilde{\sigma}_{j}^{2}}-\frac{1}{\tilde{\sigma}_{j}^{2}-\epsilon i}\right){\tilde{V}}^{T}\right)\right|
=|j=1d(1σ~j21σ~j2ϵi)|=|j=1d(ϵiσ~j2(σ~j2ϵi))|\displaystyle=\left|\sum_{j=1}^{d}\left(\frac{1}{\tilde{\sigma}_{j}^{2}}-\frac{1}{\tilde{\sigma}_{j}^{2}-\epsilon i}\right)\right|=\left|\sum_{j=1}^{d}\left(\frac{-\epsilon i}{\tilde{\sigma}_{j}^{2}(\tilde{\sigma}_{j}^{2}-\epsilon i)}\right)\right|
j=1d|ϵiσ~j2(σ~j2ϵi)|=j=1dϵσ~j2|σ~j2ϵi|\displaystyle\leq\sum_{j=1}^{d}\left|\frac{-\epsilon i}{\tilde{\sigma}_{j}^{2}(\tilde{\sigma}_{j}^{2}-\epsilon i)}\right|=\sum_{j=1}^{d}\frac{\epsilon}{\tilde{\sigma}_{j}^{2}|\tilde{\sigma}_{j}^{2}-\epsilon i|}
j=1dϵσ~j4j=1dϵσ~j2σmin2(SU)\displaystyle\leq\sum_{j=1}^{d}\frac{\epsilon}{\tilde{\sigma}_{j}^{4}}\leq\sum_{j=1}^{d}\frac{\epsilon}{\tilde{\sigma}_{j}^{2}\sigma^{2}_{\min}(SU)}
=ϵσmin2(SU)j=1d1σ~j2=ϵσmin2(SU)tr((UTSTSU)1).\displaystyle=\frac{\epsilon}{\sigma^{2}_{\min}(SU)}\sum_{j=1}^{d}\frac{1}{\tilde{\sigma}_{j}^{2}}=\frac{\epsilon}{\sigma^{2}_{\min}(SU)}\operatorname{tr}\left((U^{T}S^{T}SU)^{-1}\right)\,. (56)

Equivalently, we have

(1ϵσmin2(SU))tr((UTSTSU)1)tr((UTSTSUϵiI)1)(1+ϵσmin2(SU))tr((UTSTSU)1).\displaystyle\big{(}1-\frac{\epsilon}{\sigma^{2}_{\min}(SU)}\big{)}\operatorname{tr}((U^{T}S^{T}SU)^{-1})\leq\operatorname{tr}((U^{T}S^{T}SU-\epsilon iI)^{-1})\leq\big{(}1+\frac{\epsilon}{\sigma^{2}_{\min}(SU)}\big{)}\operatorname{tr}((U^{T}S^{T}SU)^{-1})\,. (57)

The relations in (53) and (57) together imply the following concentration result:

P(|tr((UTSTSU)1)𝔼[tr((UTSTSU)1)]|t+ϵσmin2(SU)(tr((UTSTSU)1)+𝔼[tr((UTSTSU)1)]))\displaystyle P\big{(}|\operatorname{tr}((U^{T}S^{T}SU)^{-1})-\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]|\leq t+\frac{\epsilon}{\sigma^{2}_{\min}(SU)}\left(\operatorname{tr}((U^{T}S^{T}SU)^{-1})+\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]\right)\big{)}
14exp(t2ϵ2/(16m)).\displaystyle\geq 1-4\exp\left(-t^{2}\epsilon^{2}/(16m)\right)\,. (58)

We know that 𝔼[tr((UTSTSU)1)]=mdmd1\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]=\frac{md}{m-d-1}. Using the upper bound tr((UTSTSU)1)dσmin2(SU)\operatorname{tr}((U^{T}S^{T}SU)^{-1})\leq d\sigma_{\min}^{-2}(SU) yields

P(|tr((UTSTSU)1)𝔼[tr((UTSTSU)1)]|t+ϵσmin2(SU)(dσmin2(SU)+mdmd1))\displaystyle P\Big{(}|\operatorname{tr}((U^{T}S^{T}SU)^{-1})-\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]|\leq t+\frac{\epsilon}{\sigma^{2}_{\min}(SU)}\big{(}\frac{d}{\sigma^{2}_{\min}(SU)}+\frac{md}{m-d-1}\big{)}\Big{)}
14exp(t2ϵ2/(16m)).\displaystyle\geq 1-4\exp\left(-t^{2}\epsilon^{2}/(16m)\right). (59)

The concentration inequality in (A-A) and the concentration of the minimum singular value P(1σmin(SU)11d/mδ)1exp(mδ2/2)P(\frac{1}{\sigma_{\min}(SU)}\leq\frac{1}{1-\sqrt{d/m}-\delta})\geq 1-\exp(-m\delta^{2}/2) from Lemma A.1 together imply that

P(|tr((UTSTSU)1)𝔼[tr((UTSTSU)1)]|t+ϵ(1d/mδ)2(d(1d/mδ)2+mdmd1))\displaystyle P\Big{(}|\operatorname{tr}((U^{T}S^{T}SU)^{-1})-\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]|\leq t+\frac{\epsilon}{(1-\sqrt{d/m}-\delta)^{2}}\big{(}\frac{d}{(1-\sqrt{d/m}-\delta)^{2}}+\frac{md}{m-d-1}\big{)}\Big{)}
14exp(t2ϵ2/(16m))exp(mδ2/2).\displaystyle\geq 1-4\exp\left(-t^{2}\epsilon^{2}/(16m)\right)-\exp(-m\delta^{2}/2). (60)

We now simplify this bound by observing that

d(1d/mδ)2=d(1+dm2dm)+δ22δ(1dm)\displaystyle\frac{d}{(1-\sqrt{d/m}-\delta)^{2}}=\frac{d}{(1+\frac{d}{m}-2\sqrt{\frac{d}{m}})+\delta^{2}-2\delta(1-\sqrt{\frac{d}{m}})}
=mdm+d2md+δ2m2δm(1dm)\displaystyle\quad=\frac{md}{m+d-2\sqrt{md}+\delta^{2}m-2\delta m(1-\sqrt{\frac{d}{m}})}
mdmd(2α1)\displaystyle\quad\geq\frac{md}{m-d(2\sqrt{\alpha}-1)}
mdmd1\displaystyle\quad\geq\frac{md}{m-d-1} (61)

where we used mαdm\geq\alpha d in the third line. We have also used the following simple inequality

2δm(1d/m)\displaystyle 2\delta m(1-\sqrt{d/m}) 2δm(112)δmδ2m.\displaystyle\geq 2\delta m(1-\frac{1}{\sqrt{2}})\geq\delta m\geq\delta^{2}m\,. (62)

For the fourth line to be true, we need m2dα+dmd1m-2d\sqrt{\alpha}+d\leq m-d-1. This is satisfied if 12d(α1)1\leq 2d(\sqrt{\alpha}-1) or α(12d+1)2\alpha\geq(\frac{1}{2d}+1)^{2}. We note that this is already implied by our assumption that mdm\gtrsim d. Using the above observation, we simplify the bound as follows:

P(|tr((UTSTSU)1)𝔼[tr((UTSTSU)1)]|t+2dϵ(1d/mδ)4)14exp(t2ϵ2/(16m))exp(mδ2/2).\displaystyle P\Big{(}|\operatorname{tr}((U^{T}S^{T}SU)^{-1})-\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]|\leq t+\frac{2d\epsilon}{(1-\sqrt{d/m}-\delta)^{4}}\Big{)}\geq 1-4\exp\left(-t^{2}\epsilon^{2}/(16m)\right)-\exp(-m\delta^{2}/2). (63)

Scaling ϵϵ(1d/mδ)42d\epsilon\leftarrow\epsilon\frac{(1-\sqrt{d/m}-\delta)^{4}}{2d} gives

P(|tr((UTSTSU)1)𝔼[tr((UTSTSU)1)]|t+ϵ)14et2ϵ2(1d/mδ)864md2emδ2/2.\displaystyle P\Big{(}|\operatorname{tr}((U^{T}S^{T}SU)^{-1})-\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]|\leq t+\epsilon\Big{)}\geq 1-4e^{-\frac{t^{2}\epsilon^{2}(1-\sqrt{d/m}-\delta)^{8}}{64md^{2}}}-e^{-m\delta^{2}/2}. (64)

Setting t=ϵt=\epsilon and then ϵϵ/2\epsilon\leftarrow\epsilon/2 give us the final result:

P(|tr((UTSTSU)1)𝔼[tr((UTSTSU)1)]|ϵ)14eϵ4(1d/mδ)8210md2emδ2/2.\displaystyle P\Big{(}|\operatorname{tr}((U^{T}S^{T}SU)^{-1})-\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]|\leq\epsilon\Big{)}\geq 1-4e^{-\frac{\epsilon^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{10}md^{2}}}-e^{-m\delta^{2}/2}. (65)

Proof:

Let us consider the single sketch estimator x^\hat{x}. The relative error of x^\hat{x} can be expressed as

f(x^)f(x)\displaystyle f(\hat{x})-f(x^{*}) =A(x^x)22\displaystyle=\|A(\hat{x}-x^{*})\|_{2}^{2}
=A(ATSTSA)1ATSTSb22\displaystyle=\|A(A^{T}S^{T}SA)^{-1}A^{T}S^{T}Sb^{\perp}\|_{2}^{2}
=A(SA)Sb22\displaystyle=\|A(SA)^{\dagger}Sb^{\perp}\|_{2}^{2} (66)

where the superscript ~{}^{\dagger} indicates the pseudo-inverse. We will rewrite the error expression as:

f(x^)f(x)\displaystyle f(\hat{x})-f(x^{*}) =b2mA(SA)mb2Sb22\displaystyle=\left\|\frac{\|b^{\perp}\|_{2}}{\sqrt{m}}A(SA)^{\dagger}\frac{\sqrt{m}}{\|b^{\perp}\|_{2}}Sb^{\perp}\right\|_{2}^{2} (67)

and define z:=mb2Sbz:=\frac{\sqrt{m}}{\|b^{\perp}\|_{2}}Sb^{\perp} and M:=b2mA(SA)M:=\frac{\|b^{\perp}\|_{2}}{\sqrt{m}}A(SA)^{\dagger}. This simplifies the previous expression

f\displaystyle f (x^)f(x)=Mz22=zTMTMz.\displaystyle(\hat{x})-f(x^{*})=\|Mz\|_{2}^{2}=z^{T}M^{T}Mz. (68)

Lemma II.5 implies

P(f(x^)f(x)𝔼Sb[f(x^)f(x)]>2MTMFϵ+2MTM2ϵ|SU)eϵ.\displaystyle P\Big{(}f(\hat{x})-f(x^{*})-\operatorname{\mathbb{E}}_{Sb^{\perp}}[f(\hat{x})-f(x^{*})]>2\|M^{T}M\|_{F}\sqrt{\epsilon}+2\|M^{T}M\|_{2}\epsilon\Big{|}SU\Big{)}\leq e^{-\epsilon}. (69)

The terms MTMF\|M^{T}M\|_{F} and MTM2\|M^{T}M\|_{2} reduce to the following expressions:

MTMF\displaystyle\|M^{T}M\|_{F} =f(x)mtr((UTSTSU)2)f(x)mdσmin2(SU),\displaystyle=\frac{f(x^{*})}{m}\sqrt{\operatorname{tr}((U^{T}S^{T}SU)^{-2})}\leq\frac{f(x^{*})}{m}\frac{\sqrt{d}}{\sigma^{2}_{\min}(SU)}\,, (70)
MTM2\displaystyle\|M^{T}M\|_{2} =f(x)m1σmin2(SU).\displaystyle=\frac{f(x^{*})}{m}\frac{1}{\sigma^{2}_{\min}(SU)}\,. (71)

Plugging these in (69) and taking the expectation of both sides with respect to SUSU give us

P(f(x^)f(x)>f(x)mtr((UTSTSU)1)+2f(x)mσmin2(SU)(ϵd+ϵ))eϵ.\displaystyle P\Big{(}f(\hat{x})-f(x^{*})>\frac{f(x^{*})}{m}\operatorname{tr}((U^{T}S^{T}SU)^{-1})+2\frac{f(x^{*})}{m\sigma^{2}_{\min}(SU)}(\sqrt{\epsilon d}+\epsilon)\Big{)}\leq e^{-\epsilon}. (72)

Next, we combine (72) with Lemma II.7 and the concentration of the minimum singular value from Lemma A.1 to obtain

P(f(x^)f(x)>f(x)m𝔼[tr((UTSTSU)1)]+2f(x)m(1d/ms)2(ϵd+ϵ)+f(x)mγ)\displaystyle P\Big{(}f(\hat{x})-f(x^{*})>\frac{f(x^{*})}{m}\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]+2\frac{f(x^{*})}{m(1-\sqrt{d/m}-s)^{2}}(\sqrt{\epsilon d}+\epsilon)+\frac{f(x^{*})}{m}\gamma\Big{)}
eϵ+4eγ4(1d/mδ)8210md2+emδ2/2+ems2/2.\displaystyle\leq e^{-\epsilon}+4e^{-\frac{\gamma^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{10}md^{2}}}+e^{-m\delta^{2}/2}+e^{-ms^{2}/2}. (73)

Assuming that ϵd>ϵ\sqrt{\epsilon d}>\epsilon, we can write:

P(f(x^)f(x)>f(x)m𝔼[tr((UTSTSU)1)]+4f(x)ϵdm(1d/ms)2+f(x)mγ)\displaystyle P\Big{(}f(\hat{x})-f(x^{*})>\frac{f(x^{*})}{m}\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]+\frac{4f(x^{*})\sqrt{\epsilon d}}{m(1-\sqrt{d/m}-s)^{2}}+\frac{f(x^{*})}{m}\gamma\Big{)}
eϵ+4eγ4(1d/mδ)8210md2+emδ2/2+ems2/2.\displaystyle\leq e^{-\epsilon}+4e^{-\frac{\gamma^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{10}md^{2}}}+e^{-m\delta^{2}/2}+e^{-ms^{2}/2}. (74)

Making the change of variable ϵϵ2/d\epsilon\leftarrow\epsilon^{2}/d yields:

P(f(x^)f(x)>f(x)m𝔼[tr((UTSTSU)1)]+4f(x)ϵm(1d/ms)2+f(x)mγ)\displaystyle P\Big{(}f(\hat{x})-f(x^{*})>\frac{f(x^{*})}{m}\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]+\frac{4f(x^{*})\epsilon}{m(1-\sqrt{d/m}-s)^{2}}+\frac{f(x^{*})}{m}\gamma\Big{)}
eϵ2/d+4eγ4(1d/mδ)8210md2+emδ2/2+ems2/2.\displaystyle\leq e^{-\epsilon^{2}/d}+4e^{-\frac{\gamma^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{10}md^{2}}}+e^{-m\delta^{2}/2}+e^{-ms^{2}/2}. (75)

Scaling ϵϵ(1d/ms)24\epsilon\leftarrow\epsilon\frac{(1-\sqrt{d/m}-s)^{2}}{4} yields:

P(f(x^)f(x)>f(x)m𝔼[tr((UTSTSU)1)]+f(x)(ϵ+γ)m)\displaystyle P\Big{(}f(\hat{x})-f(x^{*})>\frac{f(x^{*})}{m}\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]+\frac{f(x^{*})(\epsilon+\gamma)}{m}\Big{)}
eϵ2(1d/ms)416d+4eγ4(1d/mδ)8210md2+emδ2/2+ems2/2.\displaystyle\leq e^{-\frac{\epsilon^{2}(1-\sqrt{d/m}-s)^{4}}{16d}}+4e^{-\frac{\gamma^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{10}md^{2}}}+e^{-m\delta^{2}/2}+e^{-ms^{2}/2}\,. (76)

Letting γ=ϵ\gamma=\epsilon and s=δs=\delta, and then scaling ϵϵm/2\epsilon\leftarrow\epsilon m/2 lead to

P(f(x^)f(x)>m1md1+ϵ)eϵ2m2(1d/mδ)464d+4eϵ4m4(1d/mδ)8214md2+2emδ2/2.\displaystyle P\big{(}\frac{f(\hat{x})}{f(x^{*})}>\frac{m-1}{m-d-1}+\epsilon\big{)}\leq e^{-\frac{\epsilon^{2}m^{2}(1-\sqrt{d/m}-\delta)^{4}}{64d}}+4e^{-\frac{\epsilon^{4}m^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{14}md^{2}}}+2e^{-m\delta^{2}/2}. (77)

Let us also set δ=ϵ\delta=\epsilon:

P(f(x^)f(x)>m1md1+ϵ)eϵ2m2(1d/mϵ)464d+4eϵ4m4(1d/mϵ)8214md2+2emϵ2/2.\displaystyle P\big{(}\frac{f(\hat{x})}{f(x^{*})}>\frac{m-1}{m-d-1}+\epsilon\big{)}\leq e^{-\frac{\epsilon^{2}m^{2}(1-\sqrt{d/m}-\epsilon)^{4}}{64d}}+4e^{-\frac{\epsilon^{4}m^{4}(1-\sqrt{d/m}-\epsilon)^{8}}{2^{14}md^{2}}}+2e^{-m\epsilon^{2}/2}\,. (78)

Under our assumption that mdm\gtrsim d, the above probability is bounded by

P(f(x^)f(x)>m1md1+ϵ)C1eC2ϵ4m.\displaystyle P\big{(}\frac{f(\hat{x})}{f(x^{*})}>\frac{m-1}{m-d-1}+\epsilon\big{)}\leq C_{1}e^{-C_{2}\epsilon^{4}m}\,. (79)

We now move on to find a lower bound. We note that Lemma II.5 can be used as follows to find a lower bound

P(zT(G)z𝔼[zT(G)z]>2(G)Fϵ+2(G)2ϵ)=P(zTGz<𝔼[zTGz]2GFϵ2G2ϵ)eϵ.\displaystyle P\big{(}z^{T}(-G)z-\operatorname{\mathbb{E}}[z^{T}(-G)z]>2\|(-G)\|_{F}\sqrt{\epsilon}+2\|(-G)\|_{2}\epsilon\big{)}=P\big{(}z^{T}Gz<\operatorname{\mathbb{E}}[z^{T}Gz]-2\|G\|_{F}\sqrt{\epsilon}-2\|G\|_{2}\epsilon\big{)}\leq e^{-\epsilon}\,. (80)

Therefore, we have the following lower bound for the error:

P(f(x^)f(x)<𝔼Sb[f(x^)f(x)]2MTMFϵ2MTM2ϵ|SU)eϵ.\displaystyle P\Big{(}f(\hat{x})-f(x^{*})<\operatorname{\mathbb{E}}_{Sb^{\perp}}[f(\hat{x})-f(x^{*})]-2\|M^{T}M\|_{F}\sqrt{\epsilon}-2\|M^{T}M\|_{2}\epsilon\Big{|}SU\Big{)}\leq e^{-\epsilon}. (81)

Using the upper bound for MTMF\|M^{T}M\|_{F} and the exact expression for MTM2\|M^{T}M\|_{2}, and taking the expectation with respect to SUSU, we obtain the following

P(f(x^)f(x)<f(x)mtr((UTSTSU)1)2f(x)mσmin2(SU)(ϵd+ϵ))eϵ.\displaystyle P\Big{(}f(\hat{x})-f(x^{*})<\frac{f(x^{*})}{m}\operatorname{tr}((U^{T}S^{T}SU)^{-1})-2\frac{f(x^{*})}{m\sigma^{2}_{\min}(SU)}(\sqrt{\epsilon d}+\epsilon)\Big{)}\leq e^{-\epsilon}. (82)

Applying the same steps as before, we arrive at the following bound:

P(f(x^)f(x)<m1md1ϵ)C1eC2ϵ4m.\displaystyle P\big{(}\frac{f(\hat{x})}{f(x^{*})}<\frac{m-1}{m-d-1}-\epsilon\big{)}\leq C_{1}e^{-C_{2}\epsilon^{4}m}. (83)

Proof:

Consider the averaged estimator x¯=1qk=1qx^k\bar{x}=\frac{1}{q}\sum_{k=1}^{q}\hat{x}_{k}:

f(x¯)f(x)\displaystyle f(\bar{x})-f(x^{*}) =A(x¯x)22=1qk=1qA(SkA)Skb22\displaystyle=\|A(\bar{x}-x^{*})\|_{2}^{2}=\left\|\frac{1}{q}\sum_{k=1}^{q}A(S_{k}A)^{\dagger}S_{k}b^{\perp}\right\|_{2}^{2}
=1qk=1qMkzk22\displaystyle=\left\|\frac{1}{q}\sum_{k=1}^{q}M_{k}z_{k}\right\|_{2}^{2} (84)

where we defined

Mk\displaystyle M_{k} :=b2mA(SkA),zk:=mb2Skb.\displaystyle:=\frac{\|b^{\perp}\|_{2}}{\sqrt{m}}A(S_{k}A)^{\dagger},\quad z_{k}:=\frac{\sqrt{m}}{\|b^{\perp}\|_{2}}S_{k}b^{\perp}. (85)

We note that the entries of the vector zz are distributed as i.i.d. standard normal. We can manipulate the error expression as follows

k=1qMkzk\displaystyle\sum_{k=1}^{q}M_{k}z_{k} =[M1Mq][z1zq]=Mz\displaystyle=\begin{bmatrix}M_{1}&\dots&M_{q}\end{bmatrix}\begin{bmatrix}z_{1}\\ \vdots\\ z_{q}\end{bmatrix}=Mz (86)

and then

f(x¯)f(x)=1q2Mz22=zT(1q2MTM)z.\displaystyle f(\bar{x})-f(x^{*})=\frac{1}{q^{2}}\|Mz\|_{2}^{2}=z^{T}\big{(}\frac{1}{q^{2}}M^{T}M\big{)}z. (87)

We will now find an equivalent expression for 1q2MTMF2=1q4MTMF2\|\frac{1}{q^{2}}M^{T}M\|_{F}^{2}=\frac{1}{q^{4}}\|M^{T}M\|_{F}^{2}:

MTMF2\displaystyle\|M^{T}M\|_{F}^{2} =tr(MTMMTM)=tr(MMTMMT)\displaystyle=\operatorname{tr}(M^{T}MM^{T}M)=\operatorname{tr}(MM^{T}MM^{T})
=tr(k=1qMkMkTl=1qMlMlT)\displaystyle=\operatorname{tr}\left(\sum_{k=1}^{q}M_{k}M_{k}^{T}\sum_{l=1}^{q}M_{l}M_{l}^{T}\right)
=f2(x)m2k=1ql=1qtr((SkU)T(SkU)(SlU)T(SlU))\displaystyle=\frac{f^{2}(x^{*})}{m^{2}}\sum_{k=1}^{q}\sum_{l=1}^{q}\operatorname{tr}\left({(S_{k}U)^{\dagger}}^{T}(S_{k}U)^{\dagger}{(S_{l}U)^{\dagger}}^{T}(S_{l}U)^{\dagger}\right)
=f2(x)m2(k=1qtr((UTSkTSkU)2)+kltr((SkU)T(SkU)(SlU)T(SlU)))\displaystyle=\frac{f^{2}(x^{*})}{m^{2}}\Big{(}\sum_{k=1}^{q}\operatorname{tr}((U^{T}S_{k}^{T}S_{k}U)^{-2})+\sum_{k\neq l}\operatorname{tr}\left({(S_{k}U)^{\dagger}}^{T}(S_{k}U)^{\dagger}{(S_{l}U)^{\dagger}}^{T}(S_{l}U)^{\dagger}\right)\Big{)}
f2(x)m2(k=1qdσmin4(SkU)+kltr((SkU)T(SkU))tr((SlU)T(SlU)))\displaystyle\leq\frac{f^{2}(x^{*})}{m^{2}}\Big{(}\sum_{k=1}^{q}\frac{d}{\sigma^{4}_{\min}(S_{k}U)}+\sum_{k\neq l}\sqrt{\operatorname{tr}\left({(S_{k}U)^{\dagger}}^{T}(S_{k}U)^{\dagger}\right)}\sqrt{\operatorname{tr}\left({(S_{l}U)^{\dagger}}^{T}(S_{l}U)^{\dagger}\right)}\Big{)}
=f2(x)m2(k=1qdσmin4(SkU)+kltr((UTSkTSkU)1)tr((UTSlTSlU)1))\displaystyle=\frac{f^{2}(x^{*})}{m^{2}}\Big{(}\sum_{k=1}^{q}\frac{d}{\sigma^{4}_{\min}(S_{k}U)}+\sum_{k\neq l}\sqrt{\operatorname{tr}\left((U^{T}S_{k}^{T}S_{k}U)^{-1}\right)}\sqrt{\operatorname{tr}\left((U^{T}S_{l}^{T}S_{l}U)^{-1}\right)}\Big{)}
f2(x)m2(k=1qdσmin4(SkU)+kldσmin(SkU)σmin(SlU))\displaystyle\leq\frac{f^{2}(x^{*})}{m^{2}}\Big{(}\sum_{k=1}^{q}\frac{d}{\sigma^{4}_{\min}(S_{k}U)}+\sum_{k\neq l}\frac{d}{\sigma_{\min}(S_{k}U)\sigma_{\min}(S_{l}U)}\Big{)} (88)

where we have used the Cauchy-Schwarz inequality to bound the trace of the matrix product.

Next, we look at 1q2MTM2=1q2MTM2\|\frac{1}{q^{2}}M^{T}M\|_{2}=\frac{1}{q^{2}}\|M^{T}M\|_{2}:

MTM2\displaystyle\|M^{T}M\|_{2} =M22=MMT2=k=1qMkMkT2\displaystyle=\|M\|_{2}^{2}=\|MM^{T}\|_{2}=\left\|\sum_{k=1}^{q}M_{k}M_{k}^{T}\right\|_{2}
=b22mk=1q(SkU)T(SkU)2\displaystyle=\frac{\|b^{\perp}\|_{2}^{2}}{m}\left\|\sum_{k=1}^{q}{(S_{k}U)^{\dagger}}^{T}(S_{k}U)^{\dagger}\right\|_{2}
b22mk=1q(SkU)T(SkU)2\displaystyle\leq\frac{\|b^{\perp}\|_{2}^{2}}{m}\sum_{k=1}^{q}\|{(S_{k}U)^{\dagger}}^{T}(S_{k}U)^{\dagger}\|_{2}
=b22mk=1q1σmin2(SkU).\displaystyle=\frac{\|b^{\perp}\|_{2}^{2}}{m}\sum_{k=1}^{q}\frac{1}{\sigma^{2}_{\min}(S_{k}U)}. (89)

By Lemma II.5, we have

P(f(x¯)f(x)𝔼Skb[f(x¯)f(x)]>2q2MTMFϵ+2q2MTM2ϵ|SUk)eϵ\displaystyle P\Big{(}f(\bar{x})-f(x^{*})-\operatorname{\mathbb{E}}_{S_{k}b^{\perp}}[f(\bar{x})-f(x^{*})]>\frac{2}{q^{2}}\|M^{T}M\|_{F}\sqrt{\epsilon}+\frac{2}{q^{2}}\|M^{T}M\|_{2}\epsilon\Big{|}SU_{k}\Big{)}\leq e^{-\epsilon} (90)

We can take expectation of both sides with respect to SUkSU_{k}, k=1,,qk=1,\dots,q to remove the conditioning. The expectation term inside the probability is equal to

𝔼Skb[f(x¯)f(x)]\displaystyle\operatorname{\mathbb{E}}_{S_{k}b^{\perp}}[f(\bar{x})-f(x^{*})] =tr(1q2MTM)=1q2tr(MMT)=f(x)mq2k=1qtr((UTSkTSkU)1).\displaystyle=\operatorname{tr}(\frac{1}{q^{2}}M^{T}M)=\frac{1}{q^{2}}\operatorname{tr}(MM^{T})=\frac{f(x^{*})}{mq^{2}}\sum_{k=1}^{q}\operatorname{tr}((U^{T}S_{k}^{T}S_{k}U)^{-1}). (91)

Hence, we have

P(f(x¯)f(x)f(x)mq2k=1qtr((UTSkTSkU)1)>2q2MTMFϵ+2q2MTM2ϵ)eϵ.\displaystyle P\Big{(}f(\bar{x})-f(x^{*})-\frac{f(x^{*})}{mq^{2}}\sum_{k=1}^{q}\operatorname{tr}((U^{T}S_{k}^{T}S_{k}U)^{-1})>\frac{2}{q^{2}}\|M^{T}M\|_{F}\sqrt{\epsilon}+\frac{2}{q^{2}}\|M^{T}M\|_{2}\epsilon\Big{)}\leq e^{-\epsilon}\,. (92)

We now use the concentration bound of the trace term in Lemma II.7 to obtain

P(f(x¯)f(x)f(x)mq2k=1q𝔼[tr((UTSkTSkU)1)]>2q2MTMFϵ+2q2MTM2ϵ+f(x)mq2qγ)\displaystyle P\Big{(}f(\bar{x})-f(x^{*})-\frac{f(x^{*})}{mq^{2}}\sum_{k=1}^{q}\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S_{k}^{T}S_{k}U)^{-1})]>\frac{2}{q^{2}}\|M^{T}M\|_{F}\sqrt{\epsilon}+\frac{2}{q^{2}}\|M^{T}M\|_{2}\epsilon+\frac{f(x^{*})}{mq^{2}}q\gamma\Big{)}
eϵ+1(14eγ4(1d/mδ)8210md2emδ2/2)q\displaystyle\leq e^{-\epsilon}+1-\big{(}1-4e^{-\frac{\gamma^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{10}md^{2}}}-e^{-m\delta^{2}/2}\big{)}^{q} (93)

where we have used the independence of SkS_{k}’s. Using the upper bounds for MTMF\|M^{T}M\|_{F} and MTM2\|M^{T}M\|_{2}, we make the following observation. If the minimum singular values of all of SkUS_{k}U, k=1,,qk=1,\dots,q satisfy σmin(SkU)>1d/ms\sigma_{\min}(S_{k}U)>1-\sqrt{d/m}-s, which occurs with probability at least (1ems2/2)q(1-e^{-ms^{2}/2})^{q} from Lemma A.1, then the following holds

MTMFϵ+MTM2ϵ\displaystyle\|M^{T}M\|_{F}\sqrt{\epsilon}+\|M^{T}M\|_{2}\epsilon f(x)m(qd(1d/ms)4+(q2q)d(1d/ms)2ϵ+qϵ(1d/ms)2)\displaystyle\leq\frac{f(x^{*})}{m}\bigg{(}\sqrt{\frac{qd}{(1-\sqrt{d/m}-s)^{4}}+\frac{(q^{2}-q)d}{(1-\sqrt{d/m}-s)^{2}}}\sqrt{\epsilon}+\frac{q\epsilon}{(1-\sqrt{d/m}-s)^{2}}\bigg{)}
f(x)mq(ϵd+ϵ)(1d/ms)2\displaystyle\leq\frac{f(x^{*})}{m}\frac{q(\sqrt{\epsilon d}+\epsilon)}{(1-\sqrt{d/m}-s)^{2}}
f(x)m2qϵd(1d/ms)2\displaystyle\leq\frac{f(x^{*})}{m}\frac{2q\sqrt{\epsilon d}}{(1-\sqrt{d/m}-s)^{2}} (94)

where we used 1d/ms<11-\sqrt{d/m}-s<1 and ϵϵd\epsilon\leq\sqrt{\epsilon d} in the last two inequalities. Combining this observation with the probability bound, we arrive at

P(f(x¯)f(x)f(x)mq2k=1q𝔼[tr((UTSkTSkU)1)]>2q2f(x)m2qϵd(1d/ms)2+f(x)mq2qγ)\displaystyle P\Big{(}f(\bar{x})-f(x^{*})-\frac{f(x^{*})}{mq^{2}}\sum_{k=1}^{q}\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S_{k}^{T}S_{k}U)^{-1})]>\frac{2}{q^{2}}\frac{f(x^{*})}{m}\frac{2q\sqrt{\epsilon d}}{(1-\sqrt{d/m}-s)^{2}}+\frac{f(x^{*})}{mq^{2}}q\gamma\Big{)}
eϵ+1(14eγ4(1d/mδ)8210md2emδ2/2)q+1(1ems2/2)q.\displaystyle\leq e^{-\epsilon}+1-\big{(}1-4e^{-\frac{\gamma^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{10}md^{2}}}-e^{-m\delta^{2}/2}\big{)}^{q}+1-(1-e^{-ms^{2}/2})^{q}. (95)

Simplifying the expressions and making the change ϵϵ2/d\epsilon\leftarrow\epsilon^{2}/d lead to

P(f(x¯)f(x)f(x)qdmd1>4f(x)mqϵ(1d/ms)2+f(x)mqγ)\displaystyle P\Big{(}f(\bar{x})-f(x^{*})-\frac{f(x^{*})}{q}\frac{d}{m-d-1}>\frac{4f(x^{*})}{mq}\frac{\epsilon}{(1-\sqrt{d/m}-s)^{2}}+\frac{f(x^{*})}{mq}\gamma\Big{)}
eϵ2/d+2(14eγ4(1d/mδ)8210md2emδ2/2)q(1ems2/2)q\displaystyle\leq e^{-\epsilon^{2}/d}+2-\big{(}1-4e^{-\frac{\gamma^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{10}md^{2}}}-e^{-m\delta^{2}/2}\big{)}^{q}-(1-e^{-ms^{2}/2})^{q} (96)

Next, we let ϵϵ(1d/ms)24\epsilon\leftarrow\epsilon\frac{(1-\sqrt{d/m}-s)^{2}}{4} and obtain

P(f(x¯)f(x)f(x)qdmd1>f(x)mq(ϵ+γ))\displaystyle P\Big{(}f(\bar{x})-f(x^{*})-\frac{f(x^{*})}{q}\frac{d}{m-d-1}>\frac{f(x^{*})}{mq}(\epsilon+\gamma)\Big{)}
eϵ2(1d/ms)416d+2(1ems2/2)q(14eγ4(1d/mδ)8210md2emδ2/2)q.\displaystyle\leq e^{-\frac{\epsilon^{2}(1-\sqrt{d/m}-s)^{4}}{16d}}+2-(1-e^{-ms^{2}/2})^{q}-\big{(}1-4e^{-\frac{\gamma^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{10}md^{2}}}-e^{-m\delta^{2}/2}\big{)}^{q}\,. (97)

We now set γ=ϵ\gamma=\epsilon and scale ϵϵmq2\epsilon\leftarrow\epsilon\frac{mq}{2}:

P(f(x¯)f(x)>1+1qdmd1+ϵ)eϵ2m2q2(1d/ms)464d+2(1ems2/2)q(14eϵ4m4q41d/mδ)8214md2emδ2/2)q.\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}>1+\frac{1}{q}\frac{d}{m-d-1}+\epsilon\Big{)}\leq e^{-\frac{\epsilon^{2}m^{2}q^{2}(1-\sqrt{d/m}-s)^{4}}{64d}}+2-(1-e^{-ms^{2}/2})^{q}-\big{(}1-4e^{-\frac{\epsilon^{4}m^{4}q^{4}1-\sqrt{d/m}-\delta)^{8}}{2^{14}md^{2}}}-e^{-m\delta^{2}/2}\big{)}^{q}. (98)

Let us use the assumption mdm\gtrsim d to further simplify:

P(f(x¯)f(x)>1+1qdmd1+ϵ)C1eC2(qϵ)2m+2(1ems2/2)q(14eϵ4m4q4(1d/mδ)8214md2emδ2/2)q.\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}>1+\frac{1}{q}\frac{d}{m-d-1}+\epsilon\Big{)}\leq C_{1}e^{-C_{2}(q\epsilon)^{2}m}+2-(1-e^{-ms^{2}/2})^{q}-\big{(}1-4e^{-\frac{\epsilon^{4}m^{4}q^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{14}md^{2}}}-e^{-m\delta^{2}/2}\big{)}^{q}. (99)

We can use Bernoulli’s inequality (1ems2/2)q1qems2/2(1-e^{-ms^{2}/2})^{q}\geq 1-qe^{-ms^{2}/2} and arrive at:

P(f(x¯)f(x)>1+1qdmd1+ϵ)C1eC2(qϵ)2m+C3qeC4(qϵ)4m(1d/ms)8+2qems2/2\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}>1+\frac{1}{q}\frac{d}{m-d-1}+\epsilon\Big{)}\leq C_{1}e^{-C_{2}(q\epsilon)^{2}m}+C_{3}qe^{-C_{4}(q\epsilon)^{4}m(1-\sqrt{d/m}-s)^{8}}+2qe^{-ms^{2}/2} (100)

where we also set δ=s\delta=s. Picking s=qϵs=q\epsilon yields:

P(f(x¯)f(x)>1+1qdmd1+ϵ)C1eC2(qϵ)2m+C3qeC4(qϵ)4m(1d/mqϵ)8+2qem(qϵ)2/2.\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}>1+\frac{1}{q}\frac{d}{m-d-1}+\epsilon\Big{)}\leq C_{1}e^{-C_{2}(q\epsilon)^{2}m}+C_{3}qe^{-C_{4}(q\epsilon)^{4}m(1-\sqrt{d/m}-q\epsilon)^{8}}+2qe^{-m(q\epsilon)^{2}/2}\,. (101)

Finally, we obtain the following simpler expression for the bound:

P(f(x¯)f(x)>1+1qdmd1+ϵ)qC1eC2(qϵ)4m\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}>1+\frac{1}{q}\frac{d}{m-d-1}+\epsilon\Big{)}\leq qC_{1}e^{-C_{2}(q\epsilon)^{4}m} (102)

where we redefine the constants C1,C2C_{1},C_{2}.

Lower bound can be obtained by applying the same steps:

P(f(x¯)f(x)<1+1qdmd1ϵ)qC1eC2(qϵ)4m.\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}<1+\frac{1}{q}\frac{d}{m-d-1}-\epsilon\Big{)}\leq qC_{1}e^{-C_{2}(q\epsilon)^{4}m}. (103)

Proof:

i) For unbiased estimators: We begin by invoking Lemma IV.1 which gives us

𝔼[A(x¯x)22]\displaystyle\operatorname{\mathbb{E}}[\|A(\bar{x}-x^{*})\|_{2}^{2}] =1q𝔼[A(x^1x)22]\displaystyle=\frac{1}{q}\operatorname{\mathbb{E}}[\|A(\hat{x}_{1}-x^{*})\|_{2}^{2}] (104)

since we assume that x^k\hat{x}_{k}’s are unbiased, i.e., 𝔼[x^k]=x\operatorname{\mathbb{E}}[\hat{x}_{k}]=x^{*} for k=1,,qk=1,\dots,q.

This shows that the error of the averaged estimator is equal to the error of the single sketch estimator scaled by 1q\frac{1}{q}. Hence, we can leverage the lower bound result for the single sketch estimator. We now give the details of how to find the lower bound for the single sketch estimator which was first shown in [28].

The Fisher information matrix for estimating xx^{*} from SbSb can be constructed as follows:

I(Sb;x)=𝔼Sb[xlogg(Sb;x)xlogg(Sb;x)T]\displaystyle I(Sb;x^{*})=\operatorname{\mathbb{E}}_{Sb}[\nabla_{x^{*}}\log g(Sb;x^{*})\nabla_{x^{*}}\log g(Sb;x^{*})^{T}] (105)

where g(Sb;x)g(Sb;x^{*}) is the probability density function of SbSb conditioned on SASA, i.e., g(Sb;x)g(Sb;x^{*}) is the multivariate Gaussian distribution with 𝒩(SAx,1mb22Im)\mathcal{N}(SAx^{*},\frac{1}{m}\|b^{\perp}\|_{2}^{2}I_{m}). We recall the definition that b:=bAxb^{\perp}:=b-Ax^{*}.

The gradient of the logarithm of the probability density function with respect to xx^{*} is

xlogg(Sb;x)=mb22(SA)T(SbSAx).\displaystyle\nabla_{x^{*}}\log g(Sb;x^{*})=\frac{m}{\|b^{\perp}\|_{2}^{2}}(SA)^{T}(Sb-SAx^{*})\,. (106)

Plugging this in (105) yields

I(Sb;x)\displaystyle I(Sb;x^{*}) =m2b24𝔼Sb[(SA)T(SbSAx)(SbSAx)TSA]\displaystyle=\frac{m^{2}}{\|b^{\perp}\|_{2}^{4}}\operatorname{\mathbb{E}}_{Sb}\left[(SA)^{T}(Sb-SAx^{*})(Sb-SAx^{*})^{T}SA\right]
=m2b24(SA)T(b22mIm)SA\displaystyle=\frac{m^{2}}{\|b^{\perp}\|_{2}^{4}}(SA)^{T}\big{(}\frac{\|b^{\perp}\|_{2}^{2}}{m}I_{m}\big{)}SA
=mb22ATSTSA.\displaystyle=\frac{m}{\|b^{\perp}\|_{2}^{2}}A^{T}S^{T}SA\,. (107)

We note that the expectation of the error conditioned on SASA is equivalent to

𝔼[A(x^x)22|SA]\displaystyle\operatorname{\mathbb{E}}[\|A(\hat{x}-x^{*})\|_{2}^{2}|SA] =𝔼[tr(A(x^x)(x^x)TAT)|SA]\displaystyle=\operatorname{\mathbb{E}}[\operatorname{tr}(A(\hat{x}-x^{*})(\hat{x}-x^{*})^{T}A^{T})|SA]
=tr(A𝔼[(x^x)(x^x)T|SA]AT)\displaystyle=\operatorname{tr}\big{(}A\operatorname{\mathbb{E}}[(\hat{x}-x^{*})(\hat{x}-x^{*})^{T}|SA]A^{T}\big{)}
b22mtr(A(ATSTSA)1AT),\displaystyle\geq\frac{\|b^{\perp}\|_{2}^{2}}{m}\operatorname{tr}(A(A^{T}S^{T}SA)^{-1}A^{T})\,, (108)

where the last line follows from the Cramér lower bound [38] given by

𝔼[(x^x)(x^x)T|SA]I1(Sb;x).\displaystyle\operatorname{\mathbb{E}}[(\hat{x}-x^{*})(\hat{x}-x^{*})^{T}|SA]\succeq I^{-1}(Sb;x^{*})\,. (109)

Taking the expectation of both sides with respect to SASA yields

𝔼\displaystyle\operatorname{\mathbb{E}} [A(x^x)22]f(x)dmd1.\displaystyle[\|A(\hat{x}-x^{*})\|_{2}^{2}]\geq f(x^{*})\frac{d}{m-d-1}\,. (110)

Hence, we arrive at

𝔼[A(x¯x)22]f(x)qdmd1.\displaystyle\operatorname{\mathbb{E}}[\|A(\bar{x}-x^{*})\|_{2}^{2}]\geq\frac{f(x^{*})}{q}\frac{d}{m-d-1}\,. (111)

ii) For any estimator: We again invoke Lemma IV.1, which gives us

𝔼[A(x¯x)22]\displaystyle\operatorname{\mathbb{E}}[\|A(\bar{x}-x^{*})\|_{2}^{2}] =1q𝔼[A(x^1x)22]+q1q𝔼[Ax^1]Ax22\displaystyle=\frac{1}{q}\operatorname{\mathbb{E}}[\|A(\hat{x}_{1}-x^{*})\|_{2}^{2}]+\frac{q-1}{q}\|\operatorname{\mathbb{E}}[A\hat{x}_{1}]-Ax^{*}\|_{2}^{2}
f(x)qdm+q1q𝔼[Ax^1]Ax22\displaystyle\geq\frac{f(x^{*})}{q}\frac{d}{m}+\frac{q-1}{q}\|\operatorname{\mathbb{E}}[A\hat{x}_{1}]-Ax^{*}\|_{2}^{2}
f(x)qdm\displaystyle\geq\frac{f(x^{*})}{q}\frac{d}{m} (112)

where the inequality on the third line follows from Lemma II.10. ∎

Proof:

The update rule for distributed IHS is given as

xt+1=xtμ1qk=1q(ATSt,kTSt,kA)1AT(Axtb).\displaystyle x_{t+1}=x_{t}-\mu\frac{1}{q}\sum_{k=1}^{q}(A^{T}S_{t,k}^{T}S_{t,k}A)^{-1}A^{T}(Ax_{t}-b). (113)

Let us decompose bb as b=Ax+bb=Ax^{*}+b^{\perp} and note that ATb=0A^{T}b^{\perp}=0 which gives us:

xt+1=xtμ1qk=1q(ATSt,kTSt,kA)1ATAet.\displaystyle x_{t+1}=x_{t}-\mu\frac{1}{q}\sum_{k=1}^{q}(A^{T}S_{t,k}^{T}S_{t,k}A)^{-1}A^{T}Ae_{t}. (114)

Subtracting xx^{*} from both sides, we obtain an equation in terms of the error vector ete_{t} only:

et+1\displaystyle e_{t+1} =etμ1qk=1q(ATSt,kTSt,kA)1ATAet\displaystyle=e_{t}-\mu\frac{1}{q}\sum_{k=1}^{q}(A^{T}S_{t,k}^{T}S_{t,k}A)^{-1}A^{T}Ae_{t}
=(Iμ1qk=1q(ATSt,kTSt,kA)1ATA)et.\displaystyle=\left(I-\mu\frac{1}{q}\sum_{k=1}^{q}(A^{T}S_{t,k}^{T}S_{t,k}A)^{-1}A^{T}A\right)e_{t}.

Let us multiply both sides by AA from the left and define Qt,kA(ATSt,kTSt,kA)1ATQ_{t,k}\coloneqq A(A^{T}S_{t,k}^{T}S_{t,k}A)^{-1}A^{T} and we will have the following equation:

et+1A\displaystyle e^{A}_{t+1} =(Iμ1qk=1qQt,k)etA.\displaystyle=\left(I-\mu\frac{1}{q}\sum_{k=1}^{q}Q_{t,k}\right)e^{A}_{t}.

We now analyze the expectation of 2\ell_{2} norm of et+1Ae^{A}_{t+1}:

𝔼[et+1A22]\displaystyle\operatorname{\mathbb{E}}[\|e^{A}_{t+1}\|_{2}^{2}] =𝔼[1qk=1q(IμQt,k)etA22]\displaystyle=\operatorname{\mathbb{E}}\left[\left\|\frac{1}{q}\sum_{k=1}^{q}(I-\mu Q_{t,k})e^{A}_{t}\right\|_{2}^{2}\right]
=1q2𝔼[k=1ql=1q(IμQt,k)etA,(IμQt,l)etA]\displaystyle=\frac{1}{q^{2}}\operatorname{\mathbb{E}}\left[\sum_{k=1}^{q}\sum_{l=1}^{q}\langle(I-\mu Q_{t,k})e^{A}_{t},(I-\mu Q_{t,l})e^{A}_{t}\rangle\right]
=1q2k=1ql=1q𝔼[(IμQt,k)etA,(IμQt,l)etA].\displaystyle=\frac{1}{q^{2}}\sum_{k=1}^{q}\sum_{l=1}^{q}\operatorname{\mathbb{E}}\left[\langle(I-\mu Q_{t,k})e^{A}_{t},(I-\mu Q_{t,l})e^{A}_{t}\rangle\right]. (115)

The contribution for klk\neq l in the double summation of (A-A) is equal to zero because for klk\neq l, we have

𝔼[(IμQt,k)etA,(IμQt,l)etA]\displaystyle\operatorname{\mathbb{E}}\left[\langle(I-\mu Q_{t,k})e^{A}_{t},(I-\mu Q_{t,l})e^{A}_{t}\rangle\right] =𝔼[(IμQt,k)etA],𝔼[(IμQt,l)etA]\displaystyle=\langle\operatorname{\mathbb{E}}[(I-\mu Q_{t,k})e^{A}_{t}],\operatorname{\mathbb{E}}[(I-\mu Q_{t,l})e^{A}_{t}]\rangle
=𝔼[(IμQt,k)etA],𝔼[(IμQt,k)etA]\displaystyle=\langle\operatorname{\mathbb{E}}[(I-\mu Q_{t,k})e^{A}_{t}],\operatorname{\mathbb{E}}[(I-\mu Q_{t,k})e^{A}_{t}]\rangle
=𝔼[(IμQt,k)etA]22.\displaystyle=\left\|\operatorname{\mathbb{E}}[(I-\mu Q_{t,k})e^{A}_{t}]\right\|_{2}^{2}.

The term in the last line above is zero for μ=1θ1\mu=\frac{1}{\theta_{1}}:

𝔼[(IμQt,k)etA]\displaystyle\operatorname{\mathbb{E}}[(I-\mu Q_{t,k})e^{A}_{t}] =𝔼[(IμA(ATSt,kTSt,kA)1AT)etA]\displaystyle=\operatorname{\mathbb{E}}[(I-\mu A(A^{T}S_{t,k}^{T}S_{t,k}A)^{-1}A^{T})e^{A}_{t}]
=(Iμθ1A(ATA)1AT)etA\displaystyle=(I-\mu\theta_{1}A(A^{T}A)^{-1}A^{T})e^{A}_{t}
=(Iμθ1UUT)etA\displaystyle=(I-\mu\theta_{1}UU^{T})e^{A}_{t}
=(IUUT)etA=0\displaystyle=(I-UU^{T})e^{A}_{t}=0

where we used A=UΣVTA=U\Sigma V^{T}. For the rest of the proof, we assume that we set μ=1/θ1\mu=1/\theta_{1}. Now that we know the contribution from terms with klk\neq l is zero, the expansion in (A-A) can be rewritten as

𝔼[et+1A22]\displaystyle\operatorname{\mathbb{E}}[\|e^{A}_{t+1}\|_{2}^{2}] =1q2k=1q𝔼[(IμQt,k)etA,(IμQt,k)etA]\displaystyle=\frac{1}{q^{2}}\sum_{k=1}^{q}\operatorname{\mathbb{E}}\left[\langle(I-\mu Q_{t,k})e^{A}_{t},(I-\mu Q_{t,k})e^{A}_{t}\rangle\right]
=1q2k=1q𝔼[(IμQt,k)etA22]\displaystyle=\frac{1}{q^{2}}\sum_{k=1}^{q}\operatorname{\mathbb{E}}[\|(I-\mu Q_{t,k})e^{A}_{t}\|_{2}^{2}]
=1q𝔼[(IμQt,1)etA22]\displaystyle=\frac{1}{q}\operatorname{\mathbb{E}}[\|(I-\mu Q_{t,1})e^{A}_{t}\|_{2}^{2}]
=1q(etA22+μ2𝔼[Qt,1etA22]2μ(etA)T𝔼[Qt,1]etA)\displaystyle=\frac{1}{q}\left(\|e^{A}_{t}\|_{2}^{2}+\mu^{2}\operatorname{\mathbb{E}}[\|Q_{t,1}e^{A}_{t}\|_{2}^{2}]-2\mu(e^{A}_{t})^{T}\operatorname{\mathbb{E}}[Q_{t,1}]e^{A}_{t}\right)
=1q(μ2𝔼[Qt,1etA22]etA22)\displaystyle=\frac{1}{q}\left(\mu^{2}\operatorname{\mathbb{E}}[\|Q_{t,1}e^{A}_{t}\|_{2}^{2}]-\|e^{A}_{t}\|_{2}^{2}\right)
=1q(μ2(etA)T𝔼[Qt,1TQt,1]etAetA22).\displaystyle=\frac{1}{q}\left(\mu^{2}(e^{A}_{t})^{T}\operatorname{\mathbb{E}}[Q_{t,1}^{T}Q_{t,1}]e^{A}_{t}-\|e^{A}_{t}\|_{2}^{2}\right).

The term 𝔼[Qt,1TQt,1]\operatorname{\mathbb{E}}[Q_{t,1}^{T}Q_{t,1}] can be simplified using SVD decomposition A=UΣVTA=U\Sigma V^{T}. This gives us Qt,k=U(UTSt,kTSt,kU)1UTQ_{t,k}=U(U^{T}S_{t,k}^{T}S_{t,k}U)^{-1}U^{T} and furthermore we have:

𝔼[Qt,1TQt,1]\displaystyle\operatorname{\mathbb{E}}[Q_{t,1}^{T}Q_{t,1}] =𝔼[U(UTSt,1TSt,1U)1UTU(UTSt,1TSt,1U)1UT]\displaystyle=\operatorname{\mathbb{E}}[U(U^{T}S_{t,1}^{T}S_{t,1}U)^{-1}U^{T}U(U^{T}S_{t,1}^{T}S_{t,1}U)^{-1}U^{T}]
=𝔼[U(UTSt,1TSt,1U)1(UTSt,1TSt,1U)1UT]\displaystyle=\operatorname{\mathbb{E}}[U(U^{T}S_{t,1}^{T}S_{t,1}U)^{-1}(U^{T}S_{t,1}^{T}S_{t,1}U)^{-1}U^{T}]
=U𝔼[(UTSt,1TSt,1U)2]UT\displaystyle=U\operatorname{\mathbb{E}}[(U^{T}S_{t,1}^{T}S_{t,1}U)^{-2}]U^{T}
=θ2UUT.\displaystyle=\theta_{2}UU^{T}.

Plugging this in, we obtain:

𝔼[et+1A22]\displaystyle\operatorname{\mathbb{E}}[\|e^{A}_{t+1}\|_{2}^{2}] =1q(θ2μ2(etA)TUUTetAetA22)\displaystyle=\frac{1}{q}\left(\theta_{2}\mu^{2}(e^{A}_{t})^{T}UU^{T}e^{A}_{t}-\|e^{A}_{t}\|_{2}^{2}\right)
=1q(θ2μ2etA22etA22)\displaystyle=\frac{1}{q}\left(\theta_{2}\mu^{2}\|e^{A}_{t}\|_{2}^{2}-\|e^{A}_{t}\|_{2}^{2}\right)
=θ2μ21qetA22\displaystyle=\frac{\theta_{2}\mu^{2}-1}{q}\|e^{A}_{t}\|_{2}^{2}
=1q(θ2θ121)etA22.\displaystyle=\frac{1}{q}\left(\frac{\theta_{2}}{\theta_{1}^{2}}-1\right)\|e^{A}_{t}\|_{2}^{2}\,.

Proof:

Taking the expectation with respect to the sketching matrices St,kS_{t,k}, k=1,,qk=1,\dots,q of both sides of the equation given in Theorem II.13, we obtain

𝔼[et+1A22]=1q(θ2θ121)𝔼[etA22].\displaystyle\operatorname{\mathbb{E}}[\|e^{A}_{t+1}\|_{2}^{2}]=\frac{1}{q}\left(\frac{\theta_{2}}{\theta_{1}^{2}}-1\right)\operatorname{\mathbb{E}}[\|e^{A}_{t}\|_{2}^{2}]\,.

This gives us the relationship between the initial error (when we initialize x0x_{0} to be the zero vector) and the expected error in iteration tt:

𝔼[etA22]=1qt(θ2θ121)tAx22.\displaystyle\operatorname{\mathbb{E}}[\|e^{A}_{t}\|_{2}^{2}]=\frac{1}{q^{t}}\left(\frac{\theta_{2}}{\theta_{1}^{2}}-1\right)^{t}\|Ax^{*}\|_{2}^{2}.

It follows that the expected error reaches ϵ\epsilon-accuracy with respect to the initial error at iteration TT where:

1qT(θ2θ121)T\displaystyle\frac{1}{q^{T}}\left(\frac{\theta_{2}}{\theta_{1}^{2}}-1\right)^{T} =ϵ\displaystyle=\epsilon
qT(θ2θ121)T\displaystyle q^{T}\left(\frac{\theta_{2}}{\theta_{1}^{2}}-1\right)^{-T} =1ϵ\displaystyle=\frac{1}{\epsilon}
T(log(q)log(θ2θ121))\displaystyle T\left(\log(q)-\log\left(\frac{\theta_{2}}{\theta_{1}^{2}}-1\right)\right) =log(1/ϵ)\displaystyle=\log(1/\epsilon)
T\displaystyle T =log(1/ϵ)log(q)log(θ2θ121).\displaystyle=\frac{\log(1/\epsilon)}{\log(q)-\log\left(\frac{\theta_{2}}{\theta_{1}^{2}}-1\right)}.

Each iteration requires communicating a dd-dimensional vector for every worker, and we have qq workers and the algorithm runs for TT iterations, hence the communication load is TqdTqd.

The computational load per worker at each iteration involves the following numbers of operations:

  • Sketching AA: mndmnd multiplications

  • Computing H~t,k\tilde{H}_{t,k}: md2md^{2} multiplications

  • Computing gtg_{t}: O(nd)O(nd) operations

  • Solving H~t,k1gt\tilde{H}_{t,k}^{-1}g_{t}: O(d3)O(d^{3}) operations.

Proof:

In the following, we assume that we are in the regime where nn approaches infinity.

The expectation term 𝔼[(UTSTSU+λ2I)1]\operatorname{\mathbb{E}}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}] is equal to the identity matrix times a scalar (i.e. cIdcI_{d}) because it is signed permutation invariant, which we show as follows. Let Pd×dP\in\mathbb{R}^{d\times d} be a permutation matrix and Dd×dD\in\mathbb{R}^{d\times d} be an invertible diagonal sign matrix (1-1 and +1+1’s on the diagonals). A matrix MM is signed permutation invariant if (DP)M(DP)T=M(DP)M(DP)^{T}=M. We note that the signed permutation matrix is orthogonal: (DP)T(DP)=PTDTDP=PTP=Id(DP)^{T}(DP)=P^{T}D^{T}DP=P^{T}P=I_{d}, which we later use in the sequel.

(DP)𝔼S[(UTSTSU+λ2I)1](DP)T\displaystyle(DP)\operatorname{\mathbb{E}}_{S}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}](DP)^{T} =𝔼S[(DP)(UTSTSU+λ2I)1(DP)T]\displaystyle=\operatorname{\mathbb{E}}_{S}[(DP)(U^{T}S^{T}SU+\lambda_{2}I)^{-1}(DP)^{T}]
=𝔼S[((DP)TUTSTSU(DP)+λ2I)1]\displaystyle=\operatorname{\mathbb{E}}_{S}[((DP)^{T}U^{T}S^{T}SU(DP)+\lambda_{2}I)^{-1}]
=𝔼SUPD[𝔼S[((DP)TUTSTSU(DP)+λ2I)1|SUPD]]\displaystyle=\operatorname{\mathbb{E}}_{SUPD}[\operatorname{\mathbb{E}}_{S}[((DP)^{T}U^{T}S^{T}SU(DP)+\lambda_{2}I)^{-1}|SUPD]]
=𝔼SUPD[((DP)TUTSTSU(DP)+λ2I)1]\displaystyle=\operatorname{\mathbb{E}}_{SUPD}[((DP)^{T}U^{T}S^{T}SU(DP)+\lambda_{2}I)^{-1}]
=𝔼SU[(UTSTSU+λ2I)1]\displaystyle=\operatorname{\mathbb{E}}_{SU^{\prime}}[({U^{\prime}}^{T}S^{T}SU^{\prime}+\lambda_{2}I)^{-1}]

where we made the variable change U=UDPU^{\prime}=UDP and note that UU^{\prime} has orthonormal columns because DPDP is an orthogonal transformation. SUPDSUPD and SUSU have the same distribution because PDPD is an orthogonal transformation and SS is a Gaussian matrix. This shows that 𝔼[(UTSTSU+λ2I)1]\operatorname{\mathbb{E}}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}] is signed permutation invariant.

Now that we established that 𝔼[(UTSTSU+λ2I)1]\operatorname{\mathbb{E}}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}] is equal to the identity matrix times a scalar, we move on to find the value of the scalar. We use the identity 𝔼DP[(DP)Q(DP)T]=trQdId\operatorname{\mathbb{E}}_{DP}[(DP)Q(DP)^{T}]=\frac{\operatorname{tr}Q}{d}I_{d} for Qd×dQ\in\mathbb{R}^{d\times d} where the diagonal entries of DD are sampled from the Rademacher distribution and PP is sampled uniformly from the set of all possible permutation matrices. We already established that 𝔼[(UTSTSU+λ2I)1]\operatorname{\mathbb{E}}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}] is equal to (DP)𝔼S[(UTSTSU+λ2I)1](DP)T(DP)\operatorname{\mathbb{E}}_{S}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}](DP)^{T} for any signed permutation matrix of the form DPDP. It follows that

𝔼[(UTSTSU+λ2I)1]\displaystyle\operatorname{\mathbb{E}}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}] =(DP)𝔼S[(UTSTSU+λ2I)1](DP)T\displaystyle=(DP)\operatorname{\mathbb{E}}_{S}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}](DP)^{T}
=1|R|DPR(DP)𝔼S[(UTSTSU+λ2I)1](DP)T\displaystyle=\frac{1}{|R|}\sum_{DP\in R}(DP)\operatorname{\mathbb{E}}_{S}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}](DP)^{T}
=𝔼DP[(DP)𝔼S[(UTSTSU+λ2I)1](DP)T]\displaystyle=\operatorname{\mathbb{E}}_{DP}[(DP)\operatorname{\mathbb{E}}_{S}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}](DP)^{T}]
=1dtr(𝔼S[(UTSTSU+λ2I)1])Id\displaystyle=\frac{1}{d}\operatorname{tr}(\operatorname{\mathbb{E}}_{S}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}])I_{d}

where we define RR to be the set of all possible signed permutation matrices DPDP in going from line 1 to line 2.

By Lemma II.18, the trace term is equal to d×θ3(d/m,λ2)d\times\theta_{3}(d/m,\lambda_{2}), which concludes the proof. ∎

Proof:

Closed form expressions for the optimal solution and the output of the kk’th worker are as follows:

x\displaystyle x^{*} =(ATA+λ1Id)1ATb,\displaystyle=(A^{T}A+\lambda_{1}I_{d})^{-1}A^{T}b,
x^k\displaystyle\hat{x}_{k} =(ATSkTSkA+λ2Id)1ATSkTSkb.\displaystyle=(A^{T}S_{k}^{T}S_{k}A+\lambda_{2}I_{d})^{-1}A^{T}S_{k}^{T}S_{k}b.

Equivalently, xx^{*} can be written as:

x=argmin[Aλ1Id]x[b0d]22.\displaystyle x^{*}=\arg\min\left\|\begin{bmatrix}A\\ \sqrt{\lambda_{1}}I_{d}\end{bmatrix}x-\begin{bmatrix}b\\ 0_{d}\end{bmatrix}\right\|_{2}^{2}.

This allows us to decompose [b0d]\begin{bmatrix}b\\ 0_{d}\end{bmatrix} as

[b0d]=[Aλ1Id]x+b\displaystyle\begin{bmatrix}b\\ 0_{d}\end{bmatrix}=\begin{bmatrix}A\\ \sqrt{\lambda_{1}}I_{d}\end{bmatrix}x^{*}+b^{\perp}

where b=[b1b2]b^{\perp}=\begin{bmatrix}b_{1}^{\perp}\\ b_{2}^{\perp}\end{bmatrix} with b1nb_{1}^{\perp}\in\mathbb{R}^{n} and b2db_{2}^{\perp}\in\mathbb{R}^{d}. From the above equation we obtain b2=λ1xb^{\perp}_{2}=-\sqrt{\lambda_{1}}x^{*} and [ATλ1Id]b=ATb1+λ1b2=0\begin{bmatrix}A^{T}&\sqrt{\lambda_{1}}I_{d}\end{bmatrix}b^{\perp}=A^{T}b_{1}^{\perp}+\sqrt{\lambda_{1}}b_{2}^{\perp}=0.

The bias of x^k\hat{x}_{k} is given by (omitting the subscript kk in SkS_{k} for simplicity)

𝔼[A(x^kx)]\displaystyle\operatorname{\mathbb{E}}[A(\hat{x}_{k}-x^{*})] =𝔼[A(ATSTSA+λ2Id)1ATSTSbAx]\displaystyle=\operatorname{\mathbb{E}}[A(A^{T}S^{T}SA+\lambda_{2}I_{d})^{-1}A^{T}S^{T}Sb-Ax^{*}]
=𝔼[U(UTSTSU+λ2Σ2)1UTSTS(Ax+b1)]Ax\displaystyle=\operatorname{\mathbb{E}}[U(U^{T}S^{T}SU+\lambda_{2}\Sigma^{-2})^{-1}U^{T}S^{T}S(Ax^{*}+b_{1}^{\perp})]-Ax^{*}
=𝔼[λ2U(UTSTSU+λ2Σ2)1Σ1VTx]+𝔼[U(UTSTSU+λ2Σ2)1UTSTSb1].\displaystyle=\operatorname{\mathbb{E}}[-\lambda_{2}U(U^{T}S^{T}SU+\lambda_{2}\Sigma^{-2})^{-1}\Sigma^{-1}V^{T}x^{*}]+\operatorname{\mathbb{E}}[U(U^{T}S^{T}SU+\lambda_{2}\Sigma^{-2})^{-1}U^{T}S^{T}Sb_{1}^{\perp}]\,.

By the assumption Σ=σId\Sigma=\sigma I_{d}, the bias becomes

𝔼\displaystyle\operatorname{\mathbb{E}} [A(x^kx)]=𝔼[λ2σ1U(UTSTSU+λ2σ2Id)1VTx]+𝔼[U(UTSTSU+λ2σ2Id)1UTSTSb1].\displaystyle[A(\hat{x}_{k}-x^{*})]=\operatorname{\mathbb{E}}[-\lambda_{2}\sigma^{-1}U(U^{T}S^{T}SU+\lambda_{2}\sigma^{-2}I_{d})^{-1}V^{T}x^{*}]+\operatorname{\mathbb{E}}[U(U^{T}S^{T}SU+\lambda_{2}\sigma^{-2}I_{d})^{-1}U^{T}S^{T}Sb_{1}^{\perp}]\,. (116)

The first expectation term of (116) can be evaluated using Lemma II.19 (as nn goes to infinity):

𝔼[λ2σ1U\displaystyle\operatorname{\mathbb{E}}[-\lambda_{2}\sigma^{-1}U (UTSTSU+λ2Id)1VTx]=λ2σ1θ3(d/m,λ2σ2)UVTx.\displaystyle(U^{T}S^{T}SU+\lambda_{2}I_{d})^{-1}V^{T}x^{*}]=-\lambda_{2}\sigma^{-1}\theta_{3}(d/m,\lambda_{2}\sigma^{-2})UV^{T}x^{*}. (117)

To find the second expectation term in (116), let us first consider the full SVD of AA given by A=[UU][Σ0(nd)×d]VTA=\begin{bmatrix}U&U^{\perp}\end{bmatrix}\begin{bmatrix}\Sigma\\ 0_{(n-d)\times d}\end{bmatrix}V^{T} where Un×dU\in\mathbb{R}^{n\times d} and Un×(nd)U^{\perp}\in\mathbb{R}^{n\times(n-d)}. The matrix [UU]\begin{bmatrix}U&U^{\perp}\end{bmatrix} is an orthogonal matrix, which implies UUT+U(U)T=IdUU^{T}+U^{\perp}(U^{\perp})^{T}=I_{d}. If we insert UUT+U(U)T=IdUU^{T}+U^{\perp}(U^{\perp})^{T}=I_{d} between SS and b1b_{1}^{\perp}, the second term of (116) becomes

𝔼[U(UTSTSU+λ2σ2Id)1UTSTSb1]=\displaystyle\operatorname{\mathbb{E}}[U(U^{T}S^{T}SU+\lambda_{2}\sigma^{-2}I_{d})^{-1}U^{T}S^{T}Sb_{1}^{\perp}]=
=𝔼[U(UTSTSU+λ2σ2Id)1UTSTSUUTb1]+𝔼[U(UTSTSU+λ2σ2Id)1UTSTSU(U)Tb1]\displaystyle=\operatorname{\mathbb{E}}[U(U^{T}S^{T}SU+\lambda_{2}\sigma^{-2}I_{d})^{-1}U^{T}S^{T}SUU^{T}b_{1}^{\perp}]+\operatorname{\mathbb{E}}[U(U^{T}S^{T}SU+\lambda_{2}\sigma^{-2}I_{d})^{-1}U^{T}S^{T}SU^{\perp}(U^{\perp})^{T}b_{1}^{\perp}]
=𝔼[U(UTSTSU+λ2σ2Id)1UTSTSUUTb1]\displaystyle=\operatorname{\mathbb{E}}[U(U^{T}S^{T}SU+\lambda_{2}\sigma^{-2}I_{d})^{-1}U^{T}S^{T}SUU^{T}b_{1}^{\perp}]
=U(Idλ2σ2𝔼[(UTSTSU+λ2σ2Id)1])UTb1\displaystyle=U(I_{d}-\lambda_{2}\sigma^{-2}\operatorname{\mathbb{E}}[(U^{T}S^{T}SU+\lambda_{2}\sigma^{-2}I_{d})^{-1}])U^{T}b_{1}^{\perp}
=(1λ2σ2θ3(d/m,λ2σ2))UUTb1.\displaystyle=(1-\lambda_{2}\sigma^{-2}\theta_{3}(d/m,\lambda_{2}\sigma^{-2}))UU^{T}b_{1}^{\perp}\,.

In these derivations, we used 𝔼S[U(UTSTSU+λ2σ2Id)1UTSTSU(U)Tb1]=𝔼SU[𝔼S[U(UTSTSU+λ2σ2Id)1UTSTSU(U)Tb1|SU]]=0\operatorname{\mathbb{E}}_{S}[U(U^{T}S^{T}SU+\lambda_{2}\sigma^{-2}I_{d})^{-1}U^{T}S^{T}SU^{\perp}(U^{\perp})^{T}b_{1}^{\perp}]=\operatorname{\mathbb{E}}_{SU}[\operatorname{\mathbb{E}}_{S}[U(U^{T}S^{T}SU+\lambda_{2}\sigma^{-2}I_{d})^{-1}U^{T}S^{T}SU^{\perp}(U^{\perp})^{T}b_{1}^{\perp}|SU]]=0. This follows from 𝔼S[SU|SU]=0\operatorname{\mathbb{E}}_{S}[SU^{\perp}|SU]=0 as UU and UU^{\perp} are orthogonal. The last line follows from Lemma II.19, as nn goes to infinity.

We note that UTb1=λ1Σ1VTxU^{T}b_{1}^{\perp}=\lambda_{1}\Sigma^{-1}V^{T}x^{*} and for Σ=σId\Sigma=\sigma I_{d}, this becomes UTb1=λ1σ1VTxU^{T}b_{1}^{\perp}=\lambda_{1}\sigma^{-1}V^{T}x^{*}. Bringing the pieces together, we have the bias equal to (as nn goes to infinity):

𝔼[A(x^kx)]\displaystyle\operatorname{\mathbb{E}}[A(\hat{x}_{k}-x^{*})] =λ2σ1θ3(d/m,λ2σ2)UVTx+λ1σ1(1λ2σ2θ3(d/m,λ2σ2))UVTx\displaystyle=-\lambda_{2}\sigma^{-1}\theta_{3}(d/m,\lambda_{2}\sigma^{-2})UV^{T}x^{*}+\lambda_{1}\sigma^{-1}(1-\lambda_{2}\sigma^{-2}\theta_{3}(d/m,\lambda_{2}\sigma^{-2}))UV^{T}x^{*}
=σ1(λ1λ2θ3(d/m,λ2σ2)(1+λ1σ2))UVTx.\displaystyle=\sigma^{-1}(\lambda_{1}-\lambda_{2}\theta_{3}(d/m,\lambda_{2}\sigma^{-2})(1+\lambda_{1}\sigma^{-2}))UV^{T}x^{*}.

If there is a value of λ2>0\lambda_{2}>0 that satisfies λ1λ2θ3(d/m,λ2σ2)(1+λ1σ2)=0\lambda_{1}-\lambda_{2}\theta_{3}(d/m,\lambda_{2}\sigma^{-2})(1+\lambda_{1}\sigma^{-2})=0, then that value of λ2\lambda_{2} makes x^k\hat{x}_{k} an unbiased estimator. Equivalently,

λ2σ2+dm1+\displaystyle-\lambda_{2}\sigma^{-2}+\frac{d}{m}-1+ (λ2σ2+dm1)2+4λ2σ2dm=2dmσ2λ11+λ1σ2,\displaystyle\sqrt{(-\lambda_{2}\sigma^{-2}+\frac{d}{m}-1)^{2}+4\lambda_{2}\sigma^{-2}\frac{d}{m}}=2\frac{d}{m\sigma^{2}}\frac{\lambda_{1}}{1+\lambda_{1}\sigma^{-2}}, (118)

where we note that the LHS is a monotonically increasing function of λ2\lambda_{2} in the regime λ20\lambda_{2}\geq 0 and it attains its minimum in this regime at λ2=0\lambda_{2}=0. Analyzing this equation using these observations, for the cases of m>dm>d and mdm\leq d separately, we find that for the case of mdm\leq d, we need the following to be satisfied for zero bias:

2dmσ2λ11+λ1/σ2\displaystyle 2\frac{d}{m\sigma^{2}}\frac{\lambda_{1}}{1+\lambda_{1}/\sigma^{2}} 2(dm1),or more simply,\displaystyle\geq 2\left(\frac{d}{m}-1\right),\,\mbox{or more simply,}
λ1\displaystyle\lambda_{1} σ2(dm1),\displaystyle\geq\sigma^{2}\left(\frac{d}{m}-1\right),

whereas there is no additional condition on λ1\lambda_{1} for the case of m>dm>d.

The value of λ2\lambda_{2} that will lead to zero bias can be computed by solving the equation (118) where the expression for the inverse of the left-hand side is given by LHS1(y)=yσ2d/md/m+1y1σ2LHS^{-1}(y)=\frac{y\sigma^{-2}d/m-d/m+1}{y^{-1}-\sigma^{-2}}. We evaluate the inverse at y=λ1/(1+λ1/σ2)y=\lambda_{1}/(1+\lambda_{1}/\sigma^{2}) and obtain the following expression for λ2\lambda_{2}^{*}:

λ2=λ1dmλ11+λ1/σ2.\displaystyle\lambda_{2}^{*}=\lambda_{1}-\frac{d}{m}\frac{\lambda_{1}}{1+\lambda_{1}/\sigma^{2}}.

A-B Proofs of Theorems and Lemmas in Section III

The proof of the privacy result mainly follows due to Theorem A.2 (stated below for completeness) which is a result from [39].

Proof:

For some ε\varepsilon, δ\delta and matrix AcA_{c}, if there exist values for mm such that the smallest singular value of AcA_{c} satisfies σmin(Ac)w\sigma_{\min}(A_{c})\geq w, then using Theorem A.2, we find that the sketch size mm has to satisfy the following for (ε,δ)(\varepsilon,\delta)-differential privacy:

m\displaystyle m 18ln(4/δ)((σmin2B21)11ε+1ln(4/δ)2ln(4/δ))2\displaystyle\leq\frac{1}{8\ln(4/\delta)}\left(\left(\frac{\sigma_{\min}^{2}}{B^{2}}-1\right)\frac{1}{\frac{1}{\varepsilon}+\frac{1}{\ln(4/\delta)}}-2\ln(4/\delta)\right)^{2}
=18β((σmin2B21)εβε+β2β)2,\displaystyle=\frac{1}{8\beta}\left(\left(\frac{\sigma_{\min}^{2}}{B^{2}}-1\right)\frac{\varepsilon\beta}{\varepsilon+\beta}-2\beta\right)^{2}, (119)

where we have set δ=4/eβ\delta=4/e^{\beta} in the second line. For the first line to follow from Theorem A.2, we also need the condition σmin2B23+2βε\frac{\sigma_{\min}^{2}}{B^{2}}\geq 3+2\frac{\beta}{\varepsilon} to be satisfied. Note that the rows of AcA_{c} have bounded 2\ell_{2}-norm of B0d+1B_{0}\sqrt{d+1}. We now substitute B=B0d+1B=B_{0}\sqrt{d+1} and σmin=σ0n\sigma_{\min}=\sigma_{0}\sqrt{n} to obtain the simplified condition

nd+1(3+2βε)B02σ02,\displaystyle\frac{n}{d+1}\geq\left(3+2\frac{\beta}{\varepsilon}\right)\frac{B_{0}^{2}}{\sigma_{0}^{2}},

where B0B_{0} and σ0\sigma_{0} are constants. Assuming this condition is satisfied, then we pick the sketch size mm as (A-B) which can also be simplified:

m=O(βn2(d+1)2ε2(ε+β)2).\displaystyle m=O\left(\beta\frac{n^{2}}{(d+1)^{2}}\frac{\varepsilon^{2}}{(\varepsilon+\beta)^{2}}\right).

Note that the above arguments are for the privacy of a single sketch (i.e., SkAcS_{k}A_{c}). In the distributed setting where the adversary can attack all of the sketched data S1Ac,,SqAcS_{1}A_{c},\dots,S_{q}A_{c}, we can consider all of the sketched data to be a single sketch with size mqmq. Based on this argument, we can pick the sketch size as

m=O(βqn2(d+1)2ε2(ε+β)2).\displaystyle m=O\left(\frac{\beta}{q}\frac{n^{2}}{(d+1)^{2}}\frac{\varepsilon^{2}}{(\varepsilon+\beta)^{2}}\right).

Theorem A.2 (Differential privacy for random projections [39])

Fix ε>0\varepsilon>0 and δ(0,1/e)\delta\in(0,1/e). Fix B>0B>0. Fix a positive integer mm and let ww be such that

w2=B2(1+1+εln(4/δ)ε(22mln(4/δ)+2ln(4/δ))).\displaystyle w^{2}=B^{2}\left(1+\frac{1+\frac{\varepsilon}{\ln(4/\delta)}}{\varepsilon}\left(2\sqrt{2m\ln(4/\delta)}+2\ln(4/\delta)\right)\right). (120)

Let AA be an (n×d)(n\times d)-matrix with d<md<m and where each row of AA has bounded 2\ell_{2}-norm of BB. Given that σmin(A)w\sigma_{\min}(A)\geq w, the algorithm that picks an (m×n)(m\times n)-matrix RR whose entries are iid samples from the normal distribution 𝒩(0,1)\mathcal{N}(0,1) and publishes the projection RARA is (ε,δ)(\varepsilon,\delta)-differentially private.

A-C Proofs of Theorems and Lemmas in Section IV

Proof:

The expectation of the difference between the costs f(x¯)f(\bar{x}) and f(x)f(x^{*}) is given by

𝔼[f(x¯)]f(x)\displaystyle\operatorname{\mathbb{E}}[f(\bar{x})]-f(x^{*}) =𝔼[A(x¯x)+Axb22]f(x)\displaystyle=\operatorname{\mathbb{E}}[\|A(\bar{x}-x^{*})+Ax^{*}-b\|_{2}^{2}]-f(x^{*})
=𝔼[A(x¯x)22+Axb22]f(x)\displaystyle=\operatorname{\mathbb{E}}[\|A(\bar{x}-x^{*})\|_{2}^{2}+\|Ax^{*}-b\|_{2}^{2}]-f(x^{*})
=𝔼[A(x¯x)22],\displaystyle=\operatorname{\mathbb{E}}[\|A(\bar{x}-x^{*})\|_{2}^{2}]\,, (121)

where we have used the orthogonality property of the optimal least squares solution xx^{*} given by the normal equations AT(Axb)=0A^{T}(Ax^{*}-b)=0. Next, we have

𝔼[A(x¯x)22]\displaystyle\operatorname{\mathbb{E}}[\|A(\bar{x}-x^{*})\|_{2}^{2}] =𝔼[1qk=1q(Ax^kAx)22]\displaystyle=\operatorname{\mathbb{E}}\left[\left\|\frac{1}{q}\sum_{k=1}^{q}(A\hat{x}_{k}-Ax^{*})\right\|_{2}^{2}\right]
=1q2𝔼[k=1ql=1qAx^kAx,Ax^lAx]\displaystyle=\frac{1}{q^{2}}\operatorname{\mathbb{E}}\left[\sum_{k=1}^{q}\sum_{l=1}^{q}\langle A\hat{x}_{k}-Ax^{*},A\hat{x}_{l}-Ax^{*}\rangle\right]
=1q2k=1q𝔼[Ax^kAx22]+1q2kl,1k,lq𝔼[Ax^kAx,Ax^lAx]\displaystyle=\frac{1}{q^{2}}\sum_{k=1}^{q}\operatorname{\mathbb{E}}\left[\|A\hat{x}_{k}-Ax^{*}\|_{2}^{2}\right]+\frac{1}{q^{2}}\sum_{k\neq l\,,1\leq k,l\leq q}\operatorname{\mathbb{E}}\left[\langle A\hat{x}_{k}-Ax^{*},A\hat{x}_{l}-Ax^{*}\rangle\right]
=1q𝔼[Ax^Ax22]+q1q𝔼[Ax^]Ax22.\displaystyle=\frac{1}{q}\operatorname{\mathbb{E}}\left[\|A\hat{x}-Ax^{*}\|_{2}^{2}\right]+\frac{q-1}{q}\|\operatorname{\mathbb{E}}[A\hat{x}]-Ax^{*}\|_{2}^{2}\,.

Proof:

The bias of the single sketch estimator can be expanded as follows:

𝔼[Ax^]Ax2\displaystyle\|\operatorname{\mathbb{E}}[A\hat{x}]-Ax^{*}\|_{2} =𝔼[A(ATSTSA)1ATSTS(Ax+b)]Ax2\displaystyle=\|\operatorname{\mathbb{E}}[A(A^{T}S^{T}SA)^{-1}A^{T}S^{T}S(Ax^{*}+b^{\perp})]-Ax^{*}\|_{2}
=U𝔼[(UTSTSU)1UTSTSb2\displaystyle=\|U\operatorname{\mathbb{E}}[(U^{T}S^{T}SU)^{-1}U^{T}S^{T}Sb^{\perp}\|_{2}
=𝔼[(UTSTSU)1UTSTSb2\displaystyle=\|\operatorname{\mathbb{E}}[(U^{T}S^{T}SU)^{-1}U^{T}S^{T}Sb^{\perp}\|_{2}
=𝔼[Qz]2,\displaystyle=\|\operatorname{\mathbb{E}}[Qz]\|_{2},

where we define Q:=(UTSTSU)1Q:=(U^{T}S^{T}SU)^{-1} and z:=UTSTSbz:=U^{T}S^{T}Sb^{\perp}. The term 𝔼[Qz]22\|\operatorname{\mathbb{E}}[Qz]\|_{2}^{2} can be upper bounded as follows when conditioned on the event EE:

𝔼[Qz]22\displaystyle\|\operatorname{\mathbb{E}}[Qz]\|_{2}^{2} =𝔼[Qz]T𝔼[Qz]=𝔼S[Qz]T𝔼S[Qz]\displaystyle=\operatorname{\mathbb{E}}[Qz]^{T}\operatorname{\mathbb{E}}[Qz]=\operatorname{\mathbb{E}}_{S}[Qz]^{T}\operatorname{\mathbb{E}}_{S^{\prime}}[Q^{\prime}z^{\prime}]
=𝔼Sk𝔼Sk[zTQQz]\displaystyle=\operatorname{\mathbb{E}}_{S_{k}}\operatorname{\mathbb{E}}_{S_{k}^{\prime}}[z^{T}QQ^{\prime}z^{\prime}]
=12𝔼S𝔼S[(z+z)TQQ(z+z)zTQQzzTQQz]\displaystyle=\frac{1}{2}\operatorname{\mathbb{E}}_{S}\operatorname{\mathbb{E}}_{S^{\prime}}[(z+z^{\prime})^{T}QQ^{\prime}(z+z^{\prime})-z^{T}QQ^{\prime}z-{z^{\prime}}^{T}QQ^{\prime}z^{\prime}]
12𝔼S𝔼S[z+z22(1+ϵ)2(z22+z22)(1ϵ)2]\displaystyle\leq\frac{1}{2}\operatorname{\mathbb{E}}_{S}\operatorname{\mathbb{E}}_{S^{\prime}}[\|z+z^{\prime}\|_{2}^{2}(1+\epsilon)^{2}-(\|z\|_{2}^{2}+\|z^{\prime}\|_{2}^{2})(1-\epsilon)^{2}]
=𝔼S𝔼S[(z222ϵ+z222ϵ+zTz(1+ϵ)2)]\displaystyle=\operatorname{\mathbb{E}}_{S}\operatorname{\mathbb{E}}_{S^{\prime}}[\left(\|z\|_{2}^{2}2\epsilon+\|z^{\prime}\|_{2}^{2}2\epsilon+z^{T}z^{\prime}(1+\epsilon)^{2}\right)]
=4ϵ𝔼[z22]+(1+ϵ)2𝔼[z]22,\displaystyle=4\epsilon\operatorname{\mathbb{E}}[\|z\|_{2}^{2}]+(1+\epsilon)^{2}\|\operatorname{\mathbb{E}}[z]\|_{2}^{2}\,,

where the inequality follows from the inequality (1ϵ)IdQ(1+ϵ)Id(1-\epsilon)I_{d}\preceq Q\preceq(1+\epsilon)I_{d} and some simple bounds for the minimum and maximum eigenvalues of the product of two positive definite matrices. Furthermore, the expectation of zz is equal to zero because 𝔼[z]=𝔼[UTSTSb]=UT𝔼[STS]b=UTb=0\operatorname{\mathbb{E}}[z]=\operatorname{\mathbb{E}}[U^{T}S^{T}Sb^{\perp}]=U^{T}\operatorname{\mathbb{E}}[S^{T}S]b^{\perp}=U^{T}b^{\perp}=0. Hence we obtain the claimed bound 𝔼[Ax^E]Ax24ϵ𝔼[z22|E].\|\operatorname{\mathbb{E}}[A\hat{x}|E]-Ax^{*}\|_{2}\leq\sqrt{4\epsilon\operatorname{\mathbb{E}}[\|z\|_{2}^{2}|E]}.

Proof:

For the randomized Hadamard sketch (ROS), the term 𝔼[z22]\operatorname{\mathbb{E}}[\|z\|_{2}^{2}] can be expanded as follows. We will assume that all the expectations are conditioned on the event EE, which we defined earlier as (1ϵ)IdQ(1+ϵ)Id(1-\epsilon)I_{d}\preceq Q\preceq(1+\epsilon)I_{d}.

𝔼[z22]\displaystyle\operatorname{\mathbb{E}}[\|z\|_{2}^{2}] =𝔼[bT1mi=1msisiTUUT1mj=1msjsjTb]\displaystyle=\operatorname{\mathbb{E}}\left[{b^{\perp}}^{T}\frac{1}{m}\sum_{i=1}^{m}s_{i}s_{i}^{T}UU^{T}\frac{1}{m}\sum_{j=1}^{m}s_{j}s_{j}^{T}b^{\perp}\right]
=𝔼[1m2i=1mj=1mbTsisiTUUTsjsjTb]\displaystyle=\operatorname{\mathbb{E}}\left[\frac{1}{m^{2}}\sum_{i=1}^{m}\sum_{j=1}^{m}{b^{\perp}}^{T}s_{i}s_{i}^{T}UU^{T}s_{j}s_{j}^{T}b^{\perp}\right]
=1m21i=jmbT𝔼[sisiTUUTsjsjT]b+1m2ij, 1i,jmbT𝔼[sisiT]UUT𝔼[sjsjT]b\displaystyle=\frac{1}{m^{2}}\sum_{1\leq i=j\leq m}{b^{\perp}}^{T}\operatorname{\mathbb{E}}\left[s_{i}s_{i}^{T}UU^{T}s_{j}s_{j}^{T}\right]b^{\perp}+\frac{1}{m^{2}}\sum_{i\neq j,\,1\leq i,j\leq m}{b^{\perp}}^{T}\operatorname{\mathbb{E}}[s_{i}s_{i}^{T}]UU^{T}\operatorname{\mathbb{E}}[s_{j}s_{j}^{T}]b^{\perp}
=mm2bT𝔼[s1s1TUUTs1s1T]b+1m2ij, 1i,jmbTInUUTInb\displaystyle=\frac{m}{m^{2}}\,{b^{\perp}}^{T}\operatorname{\mathbb{E}}\left[s_{1}s_{1}^{T}UU^{T}s_{1}s_{1}^{T}\right]b^{\perp}+\frac{1}{m^{2}}\sum_{i\neq j,\,1\leq i,j\leq m}{b^{\perp}}^{T}I_{n}UU^{T}I_{n}b^{\perp}
=1mbT𝔼[s1s1TUUTs1s1T]b,\displaystyle=\frac{1}{m}\,{b^{\perp}}^{T}\operatorname{\mathbb{E}}\left[s_{1}s_{1}^{T}UU^{T}s_{1}s_{1}^{T}\right]b^{\perp},

where we have used the independence of sis_{i} and sjs_{j}, iji\neq j. This is true because of the assumption that the matrix PP corresponds to sampling with replacement.

bT𝔼[s1s1TUUTs1s1T]b\displaystyle{b^{\perp}}^{T}\operatorname{\mathbb{E}}\left[s_{1}s_{1}^{T}UU^{T}s_{1}s_{1}^{T}\right]b^{\perp} =𝔼[(s1TUUTs1)(s1TbbTs1)]\displaystyle=\operatorname{\mathbb{E}}[(s_{1}^{T}UU^{T}s_{1})(s_{1}^{T}b^{\perp}{b^{\perp}}^{T}s_{1})]
=𝔼[(s1TUUTs1)(bTs1)2]\displaystyle=\operatorname{\mathbb{E}}[(s_{1}^{T}UU^{T}s_{1})({b^{\perp}}^{T}s_{1})^{2}]
=1ni=1n𝔼[(hiTDUUTDhi)(bTDhi)2],\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\operatorname{\mathbb{E}}[(h_{i}^{T}DUU^{T}Dh_{i})({b^{\perp}}^{T}Dh_{i})^{2}],

where the row vector hiTh_{i}^{T} corresponds to the ii’th row of the Hadamard matrix HH. We also note that the expectation in the last line is with respect to the randomness of DD.

Let us define rr to be the column vector containing the diagonal entries of the diagonal matrix DD, that is, r:=[D11,D22,,Dnn]Tr:=[D_{11},D_{22},\dots,D_{nn}]^{T}. Then, the vector DhiDh_{i} is equivalent to diag(hi)r\operatorname{diag}(h_{i})r where diag(hi)\operatorname{diag}(h_{i}) is the diagonal matrix with the entries of hih_{i} on its diagonal.

1ni=1n𝔼[(hiTDUUTDhi)(bTDhi)2]\displaystyle\frac{1}{n}\sum_{i=1}^{n}\operatorname{\mathbb{E}}[(h_{i}^{T}DUU^{T}Dh_{i})({b^{\perp}}^{T}Dh_{i})^{2}] =1ni=1n𝔼[(rTdiag(hi)UUTdiag(hi)r)(bTdiag(hi)r)2]\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\operatorname{\mathbb{E}}[(r^{T}\operatorname{diag}(h_{i})UU^{T}\operatorname{diag}(h_{i})r)({b^{\perp}}^{T}\operatorname{diag}(h_{i})r)^{2}]
=1ni=1n𝔼[bTdiag(hi)r(rTPr)rTdiag(hi)b]\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\operatorname{\mathbb{E}}[{b^{\perp}}^{T}\operatorname{diag}(h_{i})r(r^{T}Pr)r^{T}\operatorname{diag}(h_{i})b^{\perp}]
=1ni=1nbTdiag(hi)𝔼[r(rTPr)rT]diag(hi)b,\displaystyle=\frac{1}{n}\sum_{i=1}^{n}{b^{\perp}}^{T}\operatorname{diag}(h_{i})\operatorname{\mathbb{E}}[r(r^{T}Pr)r^{T}]\operatorname{diag}(h_{i})b^{\perp},

where we have defined P:=diag(hi)UUTdiag(hi)P:=\operatorname{diag}(h_{i})UU^{T}\operatorname{diag}(h_{i}). It follows that 𝔼[r(rTPr)rT]=2P2diag(P)+tr(P)In\operatorname{\mathbb{E}}[r(r^{T}Pr)r^{T}]=2P-2\operatorname{diag}(P)+\operatorname{tr}(P)I_{n}. Here, diag(P)\operatorname{diag}(P) is used to refer to the diagonal matrix with the diagonal entries of PP as its diagonal. The trace of PP can be easily computed using the cyclic property of matrix trace as tr(P)=tr(diag(hi)UUTdiag(hi))=tr(UTdiag(hi)diag(hi)U)=tr(UTU)=tr(Id)=d\operatorname{tr}(P)=\operatorname{tr}(\operatorname{diag}(h_{i})UU^{T}\operatorname{diag}(h_{i}))=\operatorname{tr}(U^{T}\operatorname{diag}(h_{i})\operatorname{diag}(h_{i})U)=\operatorname{tr}(U^{T}U)=\operatorname{tr}(I_{d})=d. Next, we note that the term diag(P)\operatorname{diag}(P) can be simplified as diag(P)jj=u~j22\operatorname{diag}(P)_{jj}=\|\tilde{u}_{j}\|_{2}^{2} where u~jT\tilde{u}_{j}^{T} is the jj’th row of UU. This leads to

bTdiag(P)b\displaystyle{b^{\perp}}^{T}\operatorname{diag}(P)b^{\perp} =j=1n(bj)2u~j22j=1n(bj)2miniu~i22=b22miniu~i22.\displaystyle=\sum_{j=1}^{n}(b_{j}^{\perp})^{2}\|\tilde{u}_{j}\|_{2}^{2}\geq\sum_{j=1}^{n}(b_{j}^{\perp})^{2}\min_{i}\|\tilde{u}_{i}\|_{2}^{2}=\|b^{\perp}\|_{2}^{2}\min_{i}\|\tilde{u}_{i}\|_{2}^{2}\,.

Going back to 𝔼[z22]\operatorname{\mathbb{E}}[\|z\|_{2}^{2}], we have

𝔼[z22]\displaystyle\operatorname{\mathbb{E}}[\|z\|_{2}^{2}] =1mnbTndiag(hi)(2P2diag(P)+tr(P)In)diag(hi)b\displaystyle=\frac{1}{mn}{b^{\perp}}^{T}n\operatorname{diag}(h_{i})\left(2P-2\operatorname{diag}(P)+\operatorname{tr}(P)I_{n}\right)\operatorname{diag}(h_{i})b^{\perp}
=dmb222mbTdiag(P)b\displaystyle=\frac{d}{m}\|b^{\perp}\|_{2}^{2}-\frac{2}{m}{b^{\perp}}^{T}\operatorname{diag}(P)b^{\perp}
dmb222mb22miniu~i22\displaystyle\leq\frac{d}{m}\|b^{\perp}\|_{2}^{2}-\frac{2}{m}\|b^{\perp}\|_{2}^{2}\min_{i}\|\tilde{u}_{i}\|_{2}^{2}
=1mb22(d2miniu~i22)=dm(12miniu~i22d)f(x).\displaystyle=\frac{1}{m}\|b^{\perp}\|_{2}^{2}(d-2\min_{i}\|\tilde{u}_{i}\|_{2}^{2})=\frac{d}{m}\left(1-\frac{2\min_{i}\|\tilde{u}_{i}\|_{2}^{2}}{d}\right)f(x^{*}).

Proof:

We will assume that all the expectations are conditioned on the event EE, which we defined earlier as (1ϵ)IdQ(1+ϵ)Id(1-\epsilon)I_{d}\preceq Q\preceq(1+\epsilon)I_{d}. For uniform sampling with replacement, we have

𝔼[z22]\displaystyle\operatorname{\mathbb{E}}[\|z\|_{2}^{2}] =1m2𝔼[bTi=1msisiTUUTj=1msjsjTb]\displaystyle=\frac{1}{m^{2}}\,\operatorname{\mathbb{E}}\left[{b^{\perp}}^{T}\sum_{i=1}^{m}s_{i}s_{i}^{T}UU^{T}\sum_{j=1}^{m}s_{j}s_{j}^{T}b^{\perp}\right]
=1m2𝔼[i=1mj=1mbTsisiTUUTsjsjTb]\displaystyle=\frac{1}{m^{2}}\,\operatorname{\mathbb{E}}\left[\sum_{i=1}^{m}\sum_{j=1}^{m}{b^{\perp}}^{T}s_{i}s_{i}^{T}UU^{T}s_{j}s_{j}^{T}b^{\perp}\right]
=1m21i=jmbT𝔼[sisiTUUTsjsjT]b+1m2ij, 1i,jmbT𝔼[sisiT]UUT𝔼[sjsjT]b\displaystyle=\frac{1}{m^{2}}\sum_{1\leq i=j\leq m}{b^{\perp}}^{T}\operatorname{\mathbb{E}}[s_{i}s_{i}^{T}UU^{T}s_{j}s_{j}^{T}]b^{\perp}+\frac{1}{m^{2}}\sum_{i\neq j,\,1\leq i,j\leq m}{b^{\perp}}^{T}\operatorname{\mathbb{E}}[s_{i}s_{i}^{T}]UU^{T}\operatorname{\mathbb{E}}[s_{j}s_{j}^{T}]b^{\perp}
=1mbT𝔼[s1s1TUUTs1s1T]b+1m2ij, 1i,jmbTInUUTInb\displaystyle=\frac{1}{m}\,{b^{\perp}}^{T}\operatorname{\mathbb{E}}[s_{1}s_{1}^{T}UU^{T}s_{1}s_{1}^{T}]b^{\perp}+\frac{1}{m^{2}}\,\sum_{i\neq j,\,1\leq i,j\leq m}{b^{\perp}}^{T}I_{n}UU^{T}I_{n}b^{\perp}
=1mbT𝔼[s1s1TUUTs1s1T]b\displaystyle=\frac{1}{m}\,{b^{\perp}}^{T}\operatorname{\mathbb{E}}[s_{1}s_{1}^{T}UU^{T}s_{1}s_{1}^{T}]b^{\perp}
=1mbTn21ni=1neieiTUUTeieiTb\displaystyle=\frac{1}{m}\,{b^{\perp}}^{T}n^{2}\frac{1}{n}\sum_{i=1}^{n}e_{i}e_{i}^{T}UU^{T}e_{i}e_{i}^{T}b^{\perp}
=nmi=1nbi2u~i22\displaystyle=\frac{n}{m}\,\sum_{i=1}^{n}{b_{i}^{\perp}}^{2}\|\tilde{u}_{i}\|_{2}^{2}
nmi=1nbi2maxju~j22=μdmf(x).\displaystyle\leq\frac{n}{m}\,\sum_{i=1}^{n}{b_{i}^{\perp}}^{2}\max_{j}\|\tilde{u}_{j}\|_{2}^{2}=\frac{\mu d}{m}f(x^{*})\,.

Next, for uniform sampling without replacement, the rows sis_{i} and sjs_{j} are not independent which can be seen by noting that given sis_{i}, we know that sjs_{j} will have its nonzero entry at a different place than sis_{i}. Hence, differently from uniform sampling with replacement, the following term will not be zero:

1m2ij, 1i,jmbT𝔼[sisiTUUTsjsjT]b\displaystyle\frac{1}{m^{2}}\sum_{i\neq j,\,1\leq i,j\leq m}{b^{\perp}}^{T}\operatorname{\mathbb{E}}[s_{i}s_{i}^{T}UU^{T}s_{j}s_{j}^{T}]b^{\perp} =m2mm2bT𝔼[s1s1TUUTs2s2T]b\displaystyle=\frac{m^{2}-m}{m^{2}}{b^{\perp}}^{T}\operatorname{\mathbb{E}}[s_{1}s_{1}^{T}UU^{T}s_{2}s_{2}^{T}]b^{\perp}
=m1mbTn21n2nij,1i,jneieiTUUTejejTb\displaystyle=\frac{m-1}{m}{b^{\perp}}^{T}n^{2}\frac{1}{n^{2}-n}\sum_{i\neq j,1\leq i,j\leq n}e_{i}e_{i}^{T}UU^{T}e_{j}e_{j}^{T}b^{\perp}
=m1mnn1bTij,1i,jneiu~iTu~jejTb\displaystyle=\frac{m-1}{m}\frac{n}{n-1}{b^{\perp}}^{T}\sum_{i\neq j,1\leq i,j\leq n}e_{i}\tilde{u}_{i}^{T}\tilde{u}_{j}e_{j}^{T}b^{\perp}
=m1mnn1bT(UUTdiag(u~i22)b\displaystyle=\frac{m-1}{m}\frac{n}{n-1}{b^{\perp}}^{T}(UU^{T}-\operatorname{diag}(\|\tilde{u}_{i}\|_{2}^{2})b^{\perp}
=m1mnn1i=1nbi2u~i22.\displaystyle=-\frac{m-1}{m}\frac{n}{n-1}\sum_{i=1}^{n}{b_{i}^{\perp}}^{2}\|\tilde{u}_{i}\|_{2}^{2}\,.

It follows that for uniform sampling without replacement, we obtain

𝔼[z22]\displaystyle\operatorname{\mathbb{E}}[\|z\|_{2}^{2}] =(nmm1mnn1)i=1nbi2u~i22\displaystyle=\left(\frac{n}{m}-\frac{m-1}{m}\frac{n}{n-1}\right)\sum_{i=1}^{n}{b_{i}^{\perp}}^{2}\|\tilde{u}_{i}\|_{2}^{2}
=nmnmn1i=1nbi2u~i22\displaystyle=\frac{n}{m}\frac{n-m}{n-1}\sum_{i=1}^{n}{b_{i}^{\perp}}^{2}\|\tilde{u}_{i}\|_{2}^{2}
nmnmn1f(x)maxiu~i22\displaystyle\leq\frac{n}{m}\frac{n-m}{n-1}f(x^{*})\max_{i}\|\tilde{u}_{i}\|_{2}^{2}
=μdmnmn1f(x).\displaystyle=\frac{\mu d}{m}\frac{n-m}{n-1}f(x^{*})\,.

Proof:

We consider leverage score sampling with replacement. The rows sis_{i}, sjs_{j} iji\neq j are independent because sampling is with replacement. We will assume that all the expectations are conditioned on the event EE, which we defined earlier as (1ϵ)IdQ(1+ϵ)Id(1-\epsilon)I_{d}\preceq Q\preceq(1+\epsilon)I_{d}. For leverage score sampling, the term 𝔼[z22]\operatorname{\mathbb{E}}[\|z\|_{2}^{2}] is upper bounded as follows:

𝔼[z22]\displaystyle\operatorname{\mathbb{E}}[\|z\|_{2}^{2}] =1m2𝔼[bTi=1msisiTUUTj=1msjsjTb]\displaystyle=\frac{1}{m^{2}}\,\operatorname{\mathbb{E}}\left[{b^{\perp}}^{T}\sum_{i=1}^{m}s_{i}s_{i}^{T}UU^{T}\sum_{j=1}^{m}s_{j}s_{j}^{T}b^{\perp}\right]
=1m2𝔼[i=1mj=1mbTsisiTUUTsjsjTb]\displaystyle=\frac{1}{m^{2}}\,\operatorname{\mathbb{E}}\left[\sum_{i=1}^{m}\sum_{j=1}^{m}{b^{\perp}}^{T}s_{i}s_{i}^{T}UU^{T}s_{j}s_{j}^{T}b^{\perp}\right]
=1m21i=jmbT𝔼[sisiTUUTsjsjT]b+1m2ij, 1i,jmbT𝔼[sisiT]UUT𝔼[sjsjT]b\displaystyle=\frac{1}{m^{2}}\sum_{1\leq i=j\leq m}{b^{\perp}}^{T}\operatorname{\mathbb{E}}[s_{i}s_{i}^{T}UU^{T}s_{j}s_{j}^{T}]b^{\perp}+\frac{1}{m^{2}}\sum_{i\neq j,\,1\leq i,j\leq m}{b^{\perp}}^{T}\operatorname{\mathbb{E}}[s_{i}s_{i}^{T}]UU^{T}\operatorname{\mathbb{E}}[s_{j}s_{j}^{T}]b^{\perp}
=1mbTi=1niddieieiTUUTdieieiTb+m2mm2bTInUUTInb\displaystyle=\frac{1}{m}\,{b^{\perp}}^{T}\sum_{i=1}^{n}\frac{\ell_{i}}{d}\frac{d}{\ell_{i}}e_{i}e_{i}^{T}UU^{T}\frac{d}{\ell_{i}}e_{i}e_{i}^{T}b^{\perp}+\frac{m^{2}-m}{m^{2}}{b^{\perp}}^{T}I_{n}UU^{T}I_{n}b^{\perp}
=1mbTi=1ndiieieiTb\displaystyle=\frac{1}{m}\,{b^{\perp}}^{T}\sum_{i=1}^{n}\frac{d}{\ell_{i}}\ell_{i}e_{i}e_{i}^{T}b^{\perp}
=dmb22=dmf(x).\displaystyle=\frac{d}{m}\,\|b^{\perp}\|_{2}^{2}=\frac{d}{m}f(x^{*})\,.

Proof:

We will begin by using the results from Table 5 of [3] for selecting the sketch size:

m\displaystyle m =O(d+log(n)ϵ2log(d/δ)) for rand. Hadamard sketch\displaystyle=O\Big{(}\frac{d+\log(n)}{\epsilon^{2}}\log(d/\delta)\Big{)}\mbox{ for rand. Hadamard sketch}
m\displaystyle m =O(μdϵ2log(d/δ)) for uniform sampling\displaystyle=O\Big{(}\frac{\mu d}{\epsilon^{2}}\log(d/\delta)\Big{)}\mbox{ for uniform sampling}
m\displaystyle m =O(dϵ2log(d/δ)) for leverage score sampling\displaystyle=O\Big{(}\frac{d}{\epsilon^{2}}\log(d/\delta)\Big{)}\mbox{ for leverage score sampling} (122)

where μ\mu is the row coherence of UU as defined before. When the sketch sizes are selected according to the formulas above, the subspace embedding property given by

UTSTSUId2ϵ\displaystyle\|U^{T}S^{T}SU-I_{d}\|_{2}\leq\epsilon (123)

is satisfied with probability at least 1δ1-\delta. Note that the subspace embedding property can be rewritten as

(1ϵ)IdUTSTSU(1+ϵ)Id.\displaystyle(1-\epsilon)I_{d}\preceq U^{T}S^{T}SU\preceq(1+\epsilon)I_{d}\,. (124)

This implies the following relation for the inverse matrix (UTSTSU)1(U^{T}S^{T}SU)^{-1}:

11+ϵId(UTSTSU)111ϵId.\displaystyle\frac{1}{1+\epsilon}I_{d}\preceq(U^{T}S^{T}SU)^{-1}\preceq\frac{1}{1-\epsilon}I_{d}\,. (125)

Observe that 1ϵ11+ϵ1-\epsilon\leq\frac{1}{1+\epsilon} for any ϵ\epsilon and that 11ϵ1+2ϵ\frac{1}{1-\epsilon}\leq 1+2\epsilon for 0ϵ0.50\leq\epsilon\leq 0.5. Assuming that 0ϵ0.50\leq\epsilon\leq 0.5, we have

(12ϵ)Id(UTSTSU)1(1+2ϵ)Id.\displaystyle(1-2\epsilon)I_{d}\preceq(U^{T}S^{T}SU)^{-1}\preceq(1+2\epsilon)I_{d}\,. (126)

We can rescale ϵϵ/2\epsilon\leftarrow\epsilon/2 so that

(1ϵ)Id(UTSTSU)1(1+ϵ)Id\displaystyle(1-\epsilon)I_{d}\preceq(U^{T}S^{T}SU)^{-1}\preceq(1+\epsilon)I_{d} (127)

and the effect of this rescaling will be hidden in the OO notation for the sketch size formulas.

We will define the events EkE_{k}, k=1,,qk=1,\dots,q as (1ϵ)Id(UTSkTSkU)1(1+ϵ)Id(1-\epsilon)I_{d}\preceq(U^{T}S_{k}^{T}S_{k}U)^{-1}\preceq(1+\epsilon)I_{d} and it follows that when the sketch sizes are selected according to (A-C), we will have

P(Ek)1δ,k=1,,q.\displaystyle P(E_{k})\geq 1-\delta\,,k=1,\dots,q\,. (128)

Now, we find a simpler expression for the error of the single-sketch estimator:

Ax^Ax\displaystyle A\hat{x}-Ax^{*} =A(ATSTSA)1ATSTSTbAx\displaystyle=A(A^{T}S^{T}SA)^{-1}A^{T}S^{T}S^{T}b-Ax^{*}
=A(ATSTSA)1ATSTST(b+Ax)Ax\displaystyle=A(A^{T}S^{T}SA)^{-1}A^{T}S^{T}S^{T}(b^{\perp}+Ax^{*})-Ax^{*}
=A(ATSTSA)1ATSTSTb\displaystyle=A(A^{T}S^{T}SA)^{-1}A^{T}S^{T}S^{T}b^{\perp}
=U(UTSTSU)1UTSTSb\displaystyle=U(U^{T}S^{T}SU)^{-1}U^{T}S^{T}Sb^{\perp}
=UQz\displaystyle=UQz (129)

where we defined Q:=(UTSTSU)1Q:=(U^{T}S^{T}SU)^{-1}, and z:=UTSTSbz:=U^{T}S^{T}Sb^{\perp}. The variance term is equal to

𝔼[Ax^Ax22]\displaystyle\operatorname{\mathbb{E}}[\|A\hat{x}-Ax^{*}\|_{2}^{2}] =𝔼[UQz22]=𝔼[Qz22].\displaystyle=\operatorname{\mathbb{E}}[\|UQz\|_{2}^{2}]=\operatorname{\mathbb{E}}[\|Qz\|_{2}^{2}]\,. (130)

Conditioned on the event EE, we can bound the expectation 𝔼[Qz22|E]\operatorname{\mathbb{E}}[\|Qz\|_{2}^{2}|E] as follows:

𝔼[Qz22|E]\displaystyle\operatorname{\mathbb{E}}[\|Qz\|_{2}^{2}|E] =𝔼[zTQTQz|E]\displaystyle=\operatorname{\mathbb{E}}[z^{T}Q^{T}Qz|E]
𝔼[(1+ϵ)2z22|E]\displaystyle\leq\operatorname{\mathbb{E}}[(1+\epsilon)^{2}\|z\|_{2}^{2}|E]
=(1+ϵ)2𝔼[z22|E]\displaystyle=(1+\epsilon)^{2}\operatorname{\mathbb{E}}[\|z\|_{2}^{2}|E] (131)

where we have used QTQ(1+ϵ)2IdQ^{T}Q\preceq(1+\epsilon)^{2}I_{d}, which follows from all eigenvalues of QQ being less than (1+ϵ)(1+\epsilon) when conditioned on the event EE. Next, from Lemma IV.1, we obtain the following bound for the expected error

𝔼[f(x¯)f(x)|E1,,Eq]\displaystyle\operatorname{\mathbb{E}}[f(\bar{x})-f(x^{*})|E_{1},\dots,E_{q}] =1q𝔼[Ax^Ax22|E]+q1q𝔼[Ax^AxE]22\displaystyle=\frac{1}{q}\operatorname{\mathbb{E}}\left[\|A\hat{x}-Ax^{*}\|_{2}^{2}|E\right]+\frac{q-1}{q}\|\operatorname{\mathbb{E}}[A\hat{x}-Ax^{*}|E]\|_{2}^{2}
1q(1+ϵ2)𝔼[z22|E]+q1q4ϵ𝔼[z22|E].\displaystyle\leq\frac{1}{q}(1+\epsilon^{2})\operatorname{\mathbb{E}}[\|z\|_{2}^{2}|E]+\frac{q-1}{q}4\epsilon\operatorname{\mathbb{E}}[\|z\|_{2}^{2}|E]\,. (132)

Using Markov’s inequality gives us the following probability bound:

P(f(x¯)f(x)γ|E1,,Eq)\displaystyle P(f(\bar{x})-f(x^{*})\geq\gamma|E_{1},\dots,E_{q}) 1γ𝔼[f(x¯)f(x)|E1,,Eq]1γq((1+ϵ)2+4ϵ(q1))𝔼[z22|E].\displaystyle\leq\frac{1}{\gamma}\operatorname{\mathbb{E}}[f(\bar{x})-f(x^{*})|E_{1},\dots,E_{q}]\leq\frac{1}{\gamma q}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\operatorname{\mathbb{E}}[\|z\|_{2}^{2}|E]\,. (133)

The unconditioned probability can be computed as

P(f(x¯)f(x)γ)\displaystyle P(f(\bar{x})-f(x^{*})\geq\gamma) =P(f(x¯)f(x)γ|E1,,Eq)P(k=1qEk)+P(f(x¯)f(x)γ|k=1qEkC)P(k=1qEkC)\displaystyle=P(f(\bar{x})-f(x^{*})\geq\gamma|E_{1},\dots,E_{q})P(\cap_{k=1}^{q}E_{k})+P(f(\bar{x})-f(x^{*})\geq\gamma|\cup_{k=1}^{q}E_{k}^{C})P(\cup_{k=1}^{q}E_{k}^{C})
1γq((1+ϵ)2+4ϵ(q1))𝔼[z22|E]+qδ\displaystyle\leq\frac{1}{\gamma q}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\operatorname{\mathbb{E}}[\|z\|_{2}^{2}|E]+q\delta (134)

where we have used P(k=1qEkC)qδP(\cup_{k=1}^{q}E_{k}^{C})\leq q\delta and P(k=1qEk)1P(\cap_{k=1}^{q}E_{k})\leq 1. We can obtain a bound for the relative error as follows:

P\displaystyle P (f(x¯)f(x)1+γf(x))1γq((1+ϵ)2+4ϵ(q1))𝔼[z22|E]+qδ.\displaystyle\Big{(}\frac{f(\bar{x})}{f(x^{*})}\geq 1+\frac{\gamma}{f(x^{*})}\Big{)}\leq\frac{1}{\gamma q}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\operatorname{\mathbb{E}}[\|z\|_{2}^{2}|E]+q\delta\,. (135)

Scaling γγf(x)\gamma\leftarrow\gamma f(x^{*}) leads to

P\displaystyle P (f(x¯)f(x)1+γ)1qδ1γf(x)q((1+ϵ)2+4ϵ(q1))𝔼[z22|E].\displaystyle\Big{(}\frac{f(\bar{x})}{f(x^{*})}\leq 1+\gamma\Big{)}\geq 1-q\delta-\frac{1}{\gamma f(x^{*})q}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\operatorname{\mathbb{E}}[\|z\|_{2}^{2}|E]\,. (136)

We now plug the bounds for 𝔼[z22|E]\operatorname{\mathbb{E}}[\|z\|_{2}^{2}|E] from Lemma IV.3, IV.4, IV.5 in the above bound and obtain:

  • Randomized Hadamard sketch:

    P(f(x¯)f(x)1+γ)1qδdqγm((1+ϵ)2+4ϵ(q1)).\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}\leq 1+\gamma\Big{)}\geq 1-q\delta-\frac{d}{q\gamma m}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\,. (137)
  • Uniform sampling with replacement:

    P(f(x¯)f(x)1+γ)1qδμdqγm((1+ϵ)2+4ϵ(q1)).\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}\leq 1+\gamma\Big{)}\geq 1-q\delta-\frac{\mu d}{q\gamma m}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\,. (138)
  • Uniform sampling without replacement:

    P(f(x¯)f(x)1+γ)1qδμdqγmnmn1((1+ϵ)2+4ϵ(q1)).\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}\leq 1+\gamma\Big{)}\geq 1-q\delta-\frac{\mu d}{q\gamma m}\frac{n-m}{n-1}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}. (139)
  • Leverage score sampling:

    P(f(x¯)f(x)1+γ)1qδdqγm((1+ϵ)2+4ϵ(q1)).\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}\leq 1+\gamma\Big{)}\geq 1-q\delta-\frac{d}{q\gamma m}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\,. (140)

A-D Proofs of Theorems and Lemmas in Section V

Proof:

The optimal update direction is given by

Δt=((Ht1/2)THt1/2)1gt=Ht1gt\displaystyle\Delta_{t}^{*}=((H_{t}^{1/2})^{T}H_{t}^{1/2})^{-1}g_{t}=H_{t}^{-1}g_{t}

and the estimate update direction due to a single sketch is given by

Δ^t,k=αs((Ht1/2)TSt,kTSt,kHt1/2)1gt.\displaystyle\hat{\Delta}_{t,k}=\alpha_{s}((H_{t}^{1/2})^{T}S_{t,k}^{T}S_{t,k}H_{t}^{1/2})^{-1}g_{t}.

where αs\alpha_{s}\in\mathbb{R} is the step size scaling factor to be determined. Letting St,kS_{t,k} be a Gaussian sketch, the bias can be written as

𝔼[Ht1/2(Δ^t,kΔt)]\displaystyle\operatorname{\mathbb{E}}[H_{t}^{1/2}(\hat{\Delta}_{t,k}-\Delta_{t}^{*})] =𝔼[αsHt1/2((Ht1/2)TSt,kTSt,kHt1/2)1gtHt1/2Ht1gt]\displaystyle=\operatorname{\mathbb{E}}[\alpha_{s}H_{t}^{1/2}((H_{t}^{1/2})^{T}S_{t,k}^{T}S_{t,k}H_{t}^{1/2})^{-1}g_{t}-H_{t}^{1/2}H_{t}^{-1}g_{t}]
=αsHt1/2𝔼[((Ht1/2)TSt,kTSt,kHt1/2)1]gtHt1/2Ht1gt\displaystyle=\alpha_{s}H_{t}^{1/2}\operatorname{\mathbb{E}}[((H_{t}^{1/2})^{T}S_{t,k}^{T}S_{t,k}H_{t}^{1/2})^{-1}]g_{t}-H_{t}^{1/2}H_{t}^{-1}g_{t}
=αsθ1Ht1/2((Ht1/2)THt1/2)1gtHt1/2Ht1gt\displaystyle=\alpha_{s}\theta_{1}H_{t}^{1/2}((H_{t}^{1/2})^{T}H_{t}^{1/2})^{-1}g_{t}-H_{t}^{1/2}H_{t}^{-1}g_{t}
=(αsθ11)Ht1/2Ht1gt.\displaystyle=\left(\alpha_{s}\theta_{1}-1\right)H_{t}^{1/2}H_{t}^{-1}g_{t}\,.

In the third line, we plug in the mean of ((Ht1/2)TSt,kTSt,kHt1/2)1((H_{t}^{1/2})^{T}S_{t,k}^{T}S_{t,k}H_{t}^{1/2})^{-1} which is distributed as inverse Wishart distribution (see Lemma A.3). This calculation shows that the single sketch estimator gives an unbiased update direction for αs=1/θ1\alpha_{s}=1/\theta_{1}.

The variance analysis is as follows:

𝔼[Ht1/2(Δ^t,kΔt)22]\displaystyle\operatorname{\mathbb{E}}[\|H_{t}^{1/2}(\hat{\Delta}_{t,k}-\Delta_{t}^{*})\|_{2}^{2}] =𝔼[Δ^t,kTHtΔ^t,k+ΔtTHtΔt2ΔtTHtΔ^t,k]\displaystyle=\operatorname{\mathbb{E}}[\hat{\Delta}_{t,k}^{T}H_{t}\hat{\Delta}_{t,k}+{\Delta_{t}^{*}}^{T}H_{t}\Delta_{t}^{*}-2{\Delta_{t}^{*}}^{T}H_{t}\hat{\Delta}_{t,k}]
=αs2gtT𝔼[((Ht1/2)TSt,kTSt,kHt1/2)1Ht((Ht1/2)TSt,kTSt,kHt1/2)1]gt+(12αsθ1)gtTHt1gt.\displaystyle=\alpha_{s}^{2}g_{t}^{T}\operatorname{\mathbb{E}}[((H_{t}^{1/2})^{T}S_{t,k}^{T}S_{t,k}H_{t}^{1/2})^{-1}H_{t}((H_{t}^{1/2})^{T}S_{t,k}^{T}S_{t,k}H_{t}^{1/2})^{-1}]g_{t}+\left(1-2\alpha_{s}\theta_{1}\right)g_{t}^{T}H_{t}^{-1}g_{t}.

Plugging Ht1/2=UΣVTH_{t}^{1/2}=U\Sigma V^{T} into the first term and assuming Ht1/2H_{t}^{1/2} has full column rank, the expectation term becomes

𝔼[((Ht1/2)TSt,kTSt,kHt1/2)1Ht((Ht1/2)TSt,kTSt,kHt1/2)1]\displaystyle\operatorname{\mathbb{E}}[((H_{t}^{1/2})^{T}S_{t,k}^{T}S_{t,k}H_{t}^{1/2})^{-1}H_{t}((H_{t}^{1/2})^{T}S_{t,k}^{T}S_{t,k}H_{t}^{1/2})^{-1}] =VΣ1𝔼[(UTSt,kTSt,kU)2]Σ1VT\displaystyle=V\Sigma^{-1}\operatorname{\mathbb{E}}[(U^{T}S_{t,k}^{T}S_{t,k}U)^{-2}]\Sigma^{-1}V^{T}
=VΣ1(θ2Id)Σ1VT\displaystyle=V\Sigma^{-1}(\theta_{2}I_{d})\Sigma^{-1}V^{T}
=θ2VΣ2VT,\displaystyle=\theta_{2}V\Sigma^{-2}V^{T},

where the third line follows due to Lemma A.3. Because Ht1=VΣ2VTH_{t}^{-1}=V\Sigma^{-2}V^{T}, the variance becomes:

𝔼[Ht1/2(Δ^t,kΔt)22]\displaystyle\operatorname{\mathbb{E}}[\|H_{t}^{1/2}(\hat{\Delta}_{t,k}-\Delta_{t}^{*})\|_{2}^{2}] =(αs2θ2+12αsθ1)gtTVΣ2VTgt=(αs2θ2+12αsθ1)Σ1VTgt22.\displaystyle=\left(\alpha_{s}^{2}\theta_{2}+1-2\alpha_{s}\theta_{1}\right)g_{t}^{T}V\Sigma^{-2}V^{T}g_{t}=\left(\alpha_{s}^{2}\theta_{2}+1-2\alpha_{s}\theta_{1}\right)\|\Sigma^{-1}V^{T}g_{t}\|_{2}^{2}\,.

It follows that the variance is minimized when αs\alpha_{s} is chosen as αs=θ1/θ2\alpha_{s}=\theta_{1}/\theta_{2}. ∎

Lemma A.3 ([40])

For the Gaussian sketch matrix Sm×nS\in\mathbb{R}^{m\times n} with i.i.d. entries distributed as 𝒩(0,1/m)\mathcal{N}(0,1/\sqrt{m}) where mdm\geq d, and for Un×dU\in\mathbb{R}^{n\times d} with UTU=IdU^{T}U=I_{d}, the following are true:

𝔼[(UTSTSU)1]\displaystyle\operatorname{\mathbb{E}}[(U^{T}S^{T}SU)^{-1}] =θ1Id,\displaystyle=\theta_{1}I_{d},
𝔼[(UTSTSU)2]\displaystyle\operatorname{\mathbb{E}}[(U^{T}S^{T}SU)^{-2}] =θ2Id,\displaystyle=\theta_{2}I_{d}, (141)

where θ1\theta_{1} and θ2\theta_{2} are defined as

θ1\displaystyle\theta_{1} mmd1,\displaystyle\coloneqq\frac{m}{m-d-1},
θ2\displaystyle\theta_{2} m2(m1)(md)(md1)(md3).\displaystyle\coloneqq\frac{m^{2}(m-1)}{(m-d)(m-d-1)(m-d-3)}. (142)
Proof:

In the following, we omit the subscripts in St,kS_{t,k} for simplicity. Using the SVD decomposition of Ht1/2=UΣVTH_{t}^{1/2}=U\Sigma V^{T}, the bias can be written as

𝔼[Ht1/2(Δ^t,kΔt)]\displaystyle\operatorname{\mathbb{E}}[H_{t}^{1/2}(\hat{\Delta}_{t,k}-\Delta_{t}^{*})] =U𝔼[(UTSTSU+λ2Σ2)1]Σ1VTgtU(Id+λ1Σ2)1Σ1VTgt.\displaystyle=U\operatorname{\mathbb{E}}[(U^{T}S^{T}SU+\lambda_{2}\Sigma^{-2})^{-1}]\Sigma^{-1}V^{T}g_{t}-U(I_{d}+\lambda_{1}\Sigma^{-2})^{-1}\Sigma^{-1}V^{T}g_{t}\,.

By the assumption that Σ=σId\Sigma=\sigma I_{d}, the bias term can be simplified as

𝔼[Ht1/2\displaystyle\operatorname{\mathbb{E}}[H_{t}^{1/2} (Δ^t,kΔt)]=σ1U𝔼[(UTSTSU+λ2σ2Id)1]VTgtσ1(1+λ1σ2)1UVTgt.\displaystyle(\hat{\Delta}_{t,k}-\Delta_{t}^{*})]=\sigma^{-1}U\operatorname{\mathbb{E}}[(U^{T}S^{T}SU+\lambda_{2}\sigma^{-2}I_{d})^{-1}]V^{T}g_{t}-\sigma^{-1}(1+\lambda_{1}\sigma^{-2})^{-1}UV^{T}g_{t}.

By Lemma II.19, as nn goes to infinity, we have

𝔼[Ht1/2(Δ^t,kΔt)]\displaystyle\operatorname{\mathbb{E}}[H_{t}^{1/2}(\hat{\Delta}_{t,k}-\Delta_{t}^{*})] =σ1(θ3(d/m,λ2σ2)11+λ1σ2)UVTgt\displaystyle=\sigma^{-1}\left(\theta_{3}(d/m,\lambda_{2}\sigma^{-2})-\frac{1}{1+\lambda_{1}\sigma^{-2}}\right)UV^{T}g_{t}
=λ2σ2+dm1+(λ2σ2+dm1)2+4λ2σ2dm2λ2σ1d/mUVTgtσ11+λ1σ2UVTgt.\displaystyle=\frac{-\frac{\lambda_{2}}{\sigma^{2}}+\frac{d}{m}-1+\sqrt{(-\frac{\lambda_{2}}{\sigma^{2}}+\frac{d}{m}-1)^{2}+4\frac{\lambda_{2}}{\sigma^{2}}\frac{d}{m}}}{2\lambda_{2}\sigma^{-1}d/m}UV^{T}g_{t}-\frac{\sigma^{-1}}{1+\lambda_{1}\sigma^{-2}}UV^{T}g_{t}.

The bias becomes zero for the value of λ2\lambda_{2} that satisfies the following equation:

(σ2+1λ2(dm1))2+4σ2dmλ2σ2+1λ2(dm1)=2σ2dm11+λ1σ2.\displaystyle\sqrt{\left(-\sigma^{-2}+\frac{1}{\lambda_{2}}\left(\frac{d}{m}-1\right)\right)^{2}+4\sigma^{-2}\frac{d}{m\lambda_{2}}}-\sigma^{-2}+\frac{1}{\lambda_{2}}\left(\frac{d}{m}-1\right)=2\sigma^{-2}\frac{d}{m}\frac{1}{1+\lambda_{1}\sigma^{-2}}\,. (143)

In the regime where λ20\lambda_{2}\geq 0, the LHS of (143) is always non-negative and is monotonically decreasing in λ2\lambda_{2}.The LHS approaches zero as λ2\lambda_{2}\rightarrow\infty. We now consider the following cases:

  • Case 1: mdm\leq d. Because d/m10d/m-1\geq 0, as λ20\lambda_{2}\rightarrow 0, the LHS goes to infinity. Since the LHS can take any values between 0 and \infty, there is an appropriate λ2\lambda_{2}^{*} value that makes the bias zero for any λ1\lambda_{1}.

  • Case 2: m>dm>d. In this case, d/m1<0d/m-1<0. The maximum of LHS in this case is reached as λ20\lambda_{2}\rightarrow 0 and it is equal to 2σ2dmd2\sigma^{-2}\frac{d}{m-d}. As long as 2σ2dm11+λ1σ22σ2dmd2\sigma^{-2}\frac{d}{m}\frac{1}{1+\lambda_{1}\sigma^{-2}}\leq 2\sigma^{-2}\frac{d}{m-d} is true, then we can drive the bias down to zero. More simply, this corresponds to λ1σ2d/m\lambda_{1}\sigma^{-2}\geq-d/m, which is always true. Therefore in the case of m>dm>d as well, there is a λ2\lambda_{2}^{*} value for any value of λ1\lambda_{1} that will drive the bias down to zero.

To sum up, for any given value for the regularization parameter λ1\lambda_{1}, it is possible to find a λ2\lambda_{2}^{*} value to make the sketched update direction unbiased. The optimal value for λ2\lambda_{2} is given by LHS1(2σ2dm11+λ1σ2)LHS^{-1}(2\sigma^{-2}\frac{d}{m}\frac{1}{1+\lambda_{1}\sigma^{-2}}) where LHS1(y)=4σ2y1d/m+2(d/m1)y+2σ2LHS^{-1}(y)=\frac{4\sigma^{-2}y^{-1}d/m+2(d/m-1)}{y+2\sigma^{-2}}, which simplifies to the following expression:

λ2=(λ1+dmσ2)(1d/m1+λ1σ2+d/m).\displaystyle\lambda_{2}^{*}=\left(\lambda_{1}+\frac{d}{m}\sigma^{2}\right)\left(1-\frac{d/m}{1+\lambda_{1}\sigma^{-2}+d/m}\right).

Appendix B Additional Numerical Results

In this section, we present additional experimental results.

B-A Scalability of the Serverless Implementation

Figure 10 shows the cost against time when we solve the problem given in (47) for large scale data on AWS Lambda using the distributed Newton sketch algorithm. The setting in this experiment is such that each worker node has access to a different subset of data, and there is no additional sketching applied. The dataset used is randomly generated and the goal here is to demonstrate the scalability of the algorithm and the serverless implementation. The size of the data matrix AA is 4444 GB.

Refer to caption

Figure 10: Cost approximation vs time when we solve the problem given in (47) for a large scale randomly generated dataset (4444 GB sized) on AWS Lambda. Circles correspond to times that the iterates xtx_{t} are computed. Problem parameters are as follows: n=200000n=200000, d=30000d=30000, λ1=1\lambda_{1}=1, m=2000m=2000, q=100q=100, λ=10\lambda=10.

In the serverless computing implementation, we reuse the serverless functions during the course of the algorithm, meaning that the same q=100q=100 functions are used for every iteration. We note that every iteration requires two rounds of communication with the master node. The first round is for the communication of the local gradients, and the second round is for the approximate update directions. The master node, also a serverless function, is also reused across iterations. Figure 10 illustrates that each iteration takes a different amount of time and iteration times can be as short as 55 seconds. The reason for some iterations taking longer times is what is referred to as the straggler problem, which is a phenomenon commonly encountered in distributed computing. More precisely, the iteration time is determined by the slowest of the q=100q=100 nodes and nodes often slow down for a variety of reasons causing stragglers. A possible solution to the issue of straggling nodes is to use error correcting codes to insert redundancy to computation and hence to avoid waiting for the outputs of all of the worker nodes [41, 42]. We identify that implementing straggler mitigation for solving large scale problems via approximate second order optimization methods such as distributed Newton sketch is a promising direction.

B-B Experiments on UCI Datasets

In the case of large datasets and limited computing resources of worker nodes such as memory and lifetime, most of the standard sketches are computationally too expensive as discussed in the main body of the paper. This is the reason why we limited the scope of the large scale experiments to uniform sampling, SJLT, and hybrid sketch. In this section we present some additional experimental results on smaller datasets to empirically verify the theoretical results of the paper.

We present results on two UCI datasets in Figure 11 comparing the performances of the sketches we discussed in the paper.

Refer to caption

a) echocardiogram

Refer to caption

b) oocytes merluccius nucleus

Figure 11: Approximation error against the number of averaged outputs in log-log scale for various sketching methods on two UCI datasets. All of the curves have been averaged over 2525 independent trials and the vertical error bars show the standard error. The parameters are as follows. Plot a: n=131n=131, d=10d=10, m=20m=20. Plot b: n=1022n=1022, d=41d=41, m=100m=100.

Figure 11 shows that Gaussian and ROS sketches lead to unbiased estimators in the experiments because the corresponding curves appear linear in the log-log scale plots. These experiment results suggest that the upper bound that we have found for the bias of the ROS sketch may not be tight. We see that the estimates for uniform sampling and leverage score sampling are biased. The observation that the Gaussian sketch estimator is unbiased in the experiments is perfectly consistent with our theoretical findings. Furthermore, we observe that the approximation error is the highest in uniform sampling, which is also in agreement with the theoretical upper bounds that we have presented.

References

  • [1] C.-C. Wang, K. L. Tan, C.-T. Chen, Y.-H. Lin, S. S. Keerthi, D. Mahajan, S. Sundararajan, and C.-J. Lin, “Distributed newton methods for deep neural networks,” Neural Comput., vol. 30, no. 6, p. 1673–1724, Jun. 2018. [Online]. Available: https://doi.org/10.1162/neco_a_01088
  • [2] P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlós, “Faster least squares approximation,” Numerische mathematik, vol. 117, no. 2, pp. 219–249, 2011.
  • [3] S. Wang, A. Gittens, and M. W. Mahoney, “Sketched ridge regression: Optimization perspective, statistical perspective, and model averaging,” J. Mach. Learn. Res., vol. 18, no. 1, pp. 8039–8088, Jan. 2017.
  • [4] M. Pilanci and M. J. Wainwright, “Iterative hessian sketch: Fast and accurate solution approximation for constrained least-squares,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1842–1879, 2016.
  • [5] J. Lacotte and M. Pilanci, “Faster least squares optimization,” arXiv preprint, arXiv:1911.02675, 2019.
  • [6] M. W. Mahoney, “Randomized algorithms for matrices and data,” Foundations and Trends in Machine Learning in Machine Learning, vol. 3, no. 2, 2011.
  • [7] M. Pilanci and M. J. Wainwright, “Randomized sketches of convex programs with sharp guarantees,” IEEE Trans. Info. Theory, vol. 9, no. 61, pp. 5096–5115, September 2015.
  • [8] D. Bertsekas and J. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods.   Prentice-Hall, 1989.
  • [9] F. Niu, B. Recht, C. Re, and S. J. Wright, “Hogwild! a lock-free approach to parallelizing stochastic gradient descent,” in Proceedings of the 24th International Conference on Neural Information Processing Systems, ser. NIPS’11.   Red Hook, NY, USA: Curran Associates Inc., 2011, p. 693–701.
  • [10] S. Zhou, J. Lafferty, and L. Wasserman, “Compressed and privacy-sensitive sparse regression,” IEEE Transactions on Information Theory, vol. 55, no. 2, pp. 846–866, Feb 2009.
  • [11] M. Showkatbakhsh, C. Karakus, and S. Diggavi, “Privacy-utility trade-off of linear regression under random projections and additive noise,” in 2018 IEEE International Symposium on Information Theory (ISIT), June 2018, pp. 186–190.
  • [12] S. Zhou, J. Lafferty, and L. Wasserman, “Compressed regression,” in Neural Information Processing Systems, December 2007.
  • [13] J. Blocki, A. Blum, A. Datta, and O. Sheffet, “The johnson-lindenstrauss transform itself preserves differential privacy,” in 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, 2012, pp. 410–419.
  • [14] S. S. Vempala, The random projection method.   American Mathematical Soc., 2005, vol. 65.
  • [15] M. W. Mahoney, “Randomized algorithms for matrices and data,” Foundations and Trends® in Machine Learning, vol. 3, no. 2, pp. 123–224, 2011.
  • [16] D. P. Woodruff et al., “Sketching as a tool for numerical linear algebra,” Foundations and Trends® in Theoretical Computer Science, vol. 10, no. 1–2, pp. 1–157, 2014.
  • [17] P. Drineas and M. W. Mahoney, “RandNLA: randomized numerical linear algebra,” Communications of the ACM, vol. 59, no. 6, pp. 80–90, 2016.
  • [18] H. Avron, P. Maymounkov, and S. Toledo, “Blendenpik: Supercharging lapack’s least-squares solver,” SIAM Journal on Scientific Computing, vol. 32, no. 3, pp. 1217–1236, 2010.
  • [19] V. Rokhlin, A. Szlam, and M. Tygert, “A randomized algorithm for principal component analysis,” SIAM Journal on Matrix Analysis and Applications, vol. 31, no. 3, pp. 1100–1124, 2009.
  • [20] M. Pilanci and M. J. Wainwright, “Randomized sketches of convex programs with sharp guarantees,” IEEE Transactions on Information Theory, vol. 61, no. 9, pp. 5096–5115, 2015.
  • [21] ——, “Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence,” SIAM Journal on Optimization, vol. 27, no. 1, pp. 205–245, 2017.
  • [22] S. Wang, F. Roosta, P. Xu, and M. W. Mahoney, “Giant: Globally improved approximate newton method for distributed optimization,” in Advances in Neural Information Processing Systems 31.   Curran Associates, Inc., 2018, pp. 2332–2342.
  • [23] T. Sarlos, “Improved approximation algorithms for large matrices via random projections,” in Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on.   IEEE, 2006, pp. 143–152.
  • [24] J. Nelson and H. L. Nguyên, “Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings,” in Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on.   IEEE, 2013, pp. 117–126.
  • [25] G. Letac and H. Massam, “All invariant moments of the wishart distribution,” Scandinavian Journal of Statistics, vol. 31, no. 2, pp. 295–318, 2004.
  • [26] D. C. Ahfock, W. J. Astle, and S. Richardson, “Statistical properties of sketching algorithms,” Biometrika, vol. 108, no. 2, pp. 283–297, 07 2020. [Online]. Available: https://doi.org/10.1093/biomet/asaa062
  • [27] S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities - A Nonasymptotic Theory of Independence.   Oxford University Press, 2013.
  • [28] S. Sridhar, M. Pilanci, and A. Özgür, “Lower bounds and a near-optimal shrinkage estimator for least squares using random projections,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 3, pp. 660–668, 2020.
  • [29] S. Liu and E. Dobriban, “Ridge regression: Structure, cross-validation, and sketching,” 2019.
  • [30] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor, “Our data, ourselves: Privacy via distributed noise generation,” in Proceedings of the 24th Annual International Conference on The Theory and Applications of Cryptographic Techniques, ser. EUROCRYPT’06.   Berlin, Heidelberg: Springer-Verlag, 2006, p. 486–503. [Online]. Available: https://doi.org/10.1007/11761679_29
  • [31] O. Sheffet, “Private approximations of the 2nd-moment matrix using existing techniques in linear regression,” CoRR, vol. abs/1507.00056, 2015. [Online]. Available: http://arxiv.org/abs/1507.00056
  • [32] Z. Huang, S. Mitra, and N. Vaidya, “Differentially private distributed optimization,” in Proceedings of the 2015 International Conference on Distributed Computing and Networking, ser. ICDCN ’15.   New York, NY, USA: Association for Computing Machinery, 2015. [Online]. Available: https://doi.org/10.1145/2684464.2684480
  • [33] S. Boyd and L. Vandenberghe, Convex optimization.   Cambridge university press, 2004.
  • [34] E. Jonas, S. Venkataraman, I. Stoica, and B. Recht, “Occupy the cloud: Distributed computing for the 99%,” CoRR, vol. abs/1702.04024, 2017. [Online]. Available: http://arxiv.org/abs/1702.04024
  • [35] B. of Transportation Statistics, “Airline on-time statistics and delay causes, http://stat-computing.org/dataexpo/2009/,” 2018.
  • [36] G. Cohen, S. Afshar, J. Tapson, and A. van Schaik, “EMNIST: an extension of MNIST to handwritten letters,” CoRR, vol. abs/1702.05373, 2017. [Online]. Available: http://arxiv.org/abs/1702.05373
  • [37] N. E. Karoui, “Concentration of measure and spectra of random matrices: Applications to correlation matrices, elliptical distributions and beyond,” The Annals of Applied Probability, vol. 19, no. 6, pp. 2362–2405, 2009.
  • [38] A. A. Borovkov, Mathematical statistics.   Australia: Gordon and Breach Science Publishers, 1998.
  • [39] O. Sheffet, “Private approximations of the 2nd-moment matrix using existing techniques in linear regression,” CoRR, vol. abs/1507.00056, 2015. [Online]. Available: http://arxiv.org/abs/1507.00056
  • [40] J. Lacotte and M. Pilanci, “Faster least squares optimization,” arXiv preprint, arXiv:1911.02675, 2019.
  • [41] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1514–1529, March 2018.
  • [42] B. Bartan and M. Pilanci, “Straggler resilient serverless computing based on polar codes,” in 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2019, pp. 276–283.