Distributed Sketching for Randomized Optimization: Exact Characterization, Concentration and Lower Bounds

Burak Bartan and Mert Pilanci B. Bartan is with the Department of Electrical Engineering, Stanford University, CA, 94305 USA (e-mail: [email protected]).M. Pilanci is with the Department of Electrical Engineering, Stanford University, CA, 94305 USA (e-mail: [email protected]).

Abstract

We consider distributed optimization methods for problems where forming the Hessian is computationally challenging and communication is a significant bottleneck. We leverage randomized sketches for reducing the problem dimensions as well as preserving privacy and improving straggler resilience in asynchronous distributed systems. We derive novel approximation guarantees for classical sketching methods and establish tight concentration results that serve as both upper and lower bounds on the error. We then extend our analysis to the accuracy of parameter averaging for distributed sketches. Furthermore, we develop unbiased parameter averaging methods for randomized second order optimization for regularized problems that employ sketching of the Hessian. Existing works do not take the bias of the estimators into consideration, which limits their application to massively parallel computation. We provide closed-form formulas for regularization parameters and step sizes that provably minimize the bias for sketched Newton directions. Additionally, we demonstrate the implications of our theoretical findings via large scale experiments on a serverless cloud computing platform.

Index Terms:

second order optimization, sketching, distributed optimization, randomized algorithms, convex optimization, regularized least squares, large scale problems, differential privacy

I Introduction

We investigate distributed sketching methods for solving large scale regression problems. In particular, we study parameter averaging for variance reduction and establish theoretical results on approximation performance. We consider a distributed computing model with a single central node and multiple worker nodes that run in parallel. Employing parameter averaging in distributed computing systems enables asynchronous updates, since a running average of available parameters can approximate the result without requiring all worker nodes to finish their tasks. Moreover, sketching provably preserves the privacy of the data, making it an attractive choice for massively parallel cloud computing.

We consider distributed averaging in a variety of randomized second order methods including Newton Sketch, iterative Hessian sketch (IHS), and also in direct or non-iterative methods. We focus on the communication-efficient setting where we avoid the communication of approximate Hessian matrices of size $d^{2}$ and communicate only the approximate solutions of size $d$ . Averaging sketched solutions was proposed in the literature in certain restrictive settings [1]. The presence of a regularization term requires additional caution, as naïve averaging may lead to biased estimators of the solution. In this work, bias is always with respect to the randomness of the sketching matrices; the input data are not assumed to be sampled from a probability distribution. Although bias is often overlooked in the literature, we show that one can re-calibrate the regularization coefficient of the sketched problems to obtain unbiased estimators. We show that having unbiased estimators leads to better performance without imposing any additional computational cost.

We consider both underdetermined and overdetermined linear regression problems in the regime where the data does not fit in main memory. Such linear regression problems and linear systems are commonly encountered in a multitude of problems ranging from statistics and machine learning to optimization. Being able to solve large scale linear regression problems efficiently is crucial for many applications. In this paper, the setting that we consider consists of a distributed system with $q$ workers that run in parallel and a single central node that collects and processes the outputs of the worker nodes. Applications of randomized sketches and dimension reduction to linear regression and other optimization problems have been extensively studied in the recent literature by works including [2, 3, 4, 5, 6, 7]. In this work, we investigate averaging the solutions of sketched sub-problems. This setting for overdetermined problems was also studied in [3]. In addition, we also consider regression problems where the number of data samples is less than the dimensionality and investigate the properties of averaging for such problems.

An important advantage of randomized sketching in distributed computing is the independent and identically distributed nature of all of the computational tasks. Therefore, sketching offers a resilient computing model where node failures, stragglers as well as additions of new nodes can be easily handled, e.g., via generating more data sketches. An alternative to averaging that would offer similar benefits is the asynchronous stochastic gradient descent (SGD) algorithm [8, 9]. However, the convergence rates of asynchronous SGD methods necessarily depend on the properties of the input data such as its condition number [8, 9]. In contrast, as we show in this work, distributed sketching has stronger convergence guarantees that do not depend on the condition number of the data matrix.

A fundamental emphasis of the paper is on distributed computing and parameter averaging. However, novel theoretical results including exact characterizations of expected error and exponential concentration are derived for the single sketch estimator as well.

Table I summarizes the types of problems that we study in this work. The last column of the table contains references to the theorems and lemmas for each result.

TABLE I: Summary of theoretical results. The data matrix is

A\in\mathbb{R}^{n\times d}

and the output vector is

b\in\mathbb{R}^{n}

Problem	Sketch Type	Method and Theorem
$\min_{x}\\|Ax-b\\|_{2}^{2}$	Gaussian	Distributed randomized regression: Theorem II.2, II.8, II.11
	Gaussian	Distributed Iterative Hessian sketch: Theorem II.13
	Other	Distributed randomized regression: Lemma IV.3, IV.4, IV.5, Theorem IV.6
$\min_{x}\\|x\\|_{2}^{2}\mbox{ s.t. }Ax=b$	Gaussian	Right sketch: Theorem II.17
$\min_{x}\\|Ax-b\\|_{2}^{2}+\lambda_{1}\\|x\\|_{2}^{2}$	Gaussian	Distributed randomized ridge regression: Theorem II.20
Convex problems with	Gaussian	Distributed Newton sketch: Theorem V.1
Hessian $(H_{t}^{1/2})^{T}H_{t}^{1/2}$		(step size for unbiasedness)
Convex problems with	Gaussian	Distributed Newton sketch: Theorem V.2
Hessian $(H_{t}^{1/2})^{T}H_{t}^{1/2}+\lambda_{1}I_{d}$		(regularization coefficient for unbiasedness)

I-A Cloud Computing

Serverless computing is a relatively new technology that offers computing on the cloud without requiring any server management from end users. Functions in serverless computing can be thought of worker nodes that have very limited resources and a lifetime, but also are very scalable. The algorithms that we study in this paper are particularly suitable for serverless computing platforms, since the algorithms do not require peer-to-peer communication among different worker nodes, and have low memory and compute requirements per node. In numerical simulations, we have evaluated our methods on the serverless computing platform AWS Lambda.

Data privacy is an increasingly important issue in cloud computing, one that has been studied in many recent works including [10, 11, 12, 13]. Distributed sketching comes with the benefit of privacy preservation. To clarify, let us consider a setting where the master node computes sketched data $S_{k}A$ , $S_{k}b$ , $k=1,\dots,q$ locally where $S_{k}\in\mathbb{R}^{m\times n}$ are the sketching matrices and $A\in\mathbb{R}^{n\times d}$ , $b\in\mathbb{R}^{n}$ are the data matrix and the output vector, respectively, for the regression problem $\min_{x}\|Ax-b\|_{2}^{2}$ . In the distributed sketching setting, the master node sends only the sketched data to worker nodes for computational efficiency, as well as data privacy preservation. In particular, the mutual information and differential privacy can be controlled when we reveal $S_{k}A$ and keep $A$ hidden. Furthermore, one can trade privacy for accuracy by choosing a suitable sketch dimension $m$ .

I-B Notation

We use hats $\hat{x}$ to denote the estimator for a single sketch and bars $\bar{x}$ to denote the averaged estimator $\bar{x}:=\frac{1}{q}\sum_{k=1}^{q}\hat{x}_{k}$ , where $\hat{x}_{k}$ is the estimator for the $k$ ’th worker node. Stars are used to denote the optimal solution $x^{*}$ . We use $f(\cdot)$ to denote the objective of the optimization problem being considered at that point in the text. The letter $\epsilon$ is used for error while $\varepsilon$ is used as the differential privacy parameter. The notation $\sigma_{\min}(\cdot)$ denotes the smallest nonzero singular value of its argument. $O(\cdot)$ is used for big O notation.

All the expectations in the paper are with respect to the randomness over the sketching matrices, and no randomness assumptions are made for the data. We use $S\in\mathbb{R}^{m\times n}$ to denote random sketching matrices and we often refer to $m$ as the sketch size. For non-iterative distributed algorithms, we use $S_{k}\in\mathbb{R}^{m\times n}$ for the sketching matrix used by worker node $k$ . For iterative algorithms, $S_{t,k}\in\mathbb{R}^{m\times n}$ is used to denote the sketching matrix used by worker $k$ in iteration $t$ . We assume the sketching matrices are scaled such that $\operatorname{\mathbb{E}}[S^{T}S]=I_{n}$ . We omit the subscripts in $S_{k}$ and $S_{t,k}$ for simplicity whenever it does not cause confusion. The notation $m\gtrsim d$ is used to denote that there exists a positive finite constant $c$ such that $m\geq cd$ . We denote the thin SVD decomposition of the data matrix as $A=U\Sigma V^{T}$ .

For regularized problems, we use $\lambda_{1}$ for the regularization coefficient of the original problem, and $\lambda_{2}$ for the regularization coefficient of the sketched problem. For instance, in the case of the regularized least squares problem, we have

	$\displaystyle x^{*}$	$\displaystyle=\arg\min_{x}\\|Ax-b\\|_{2}^{2}+\lambda_{1}\\|x\\|_{2}^{2}\,,$
	$\displaystyle\hat{x}$	$\displaystyle=\arg\min_{x}\\|SAx-Sb\\|_{2}^{2}+\lambda_{2}\\|x\\|_{2}^{2}\,.$		(1)

We will derive expressions in the sequel for the optimal selection of the coefficient $\lambda_{2}$ which we denote as $\lambda_{2}^{*}$ .

I-C Related Work

Random projections are a popular way of performing randomized dimensionality reduction, which are widely used in many computational and learning problems [14, 15, 16, 17]. Many works have studied randomized sketching methods for least squares and other optimization problems [18, 19, 2, 20, 21, 3, 22].

Sarlós [23] showed that the relative error of the single Gaussian sketch estimator in (I-B) in the unregularized case $\lambda_{1}=\lambda_{2}=0$ is bounded as $f(\hat{x})/f(x^{*})\leq(1+Cd\log(d)/m)^{2}$ with probability at least $1/3$ , where $C$ is a constant. The relative error was improved to $f(\hat{x})/f(x^{*})\leq(1+Cd/m)$ with exponentially small failure probability in subsequent work [7]. In contrast, we derive the expectation of the error exactly and show exponential concentration around this expected value in this paper. As a result, we obtain a tight upper and lower bound for the relative error that holds with exponentially high probability (see Theorem II.8).

The work [3] investigates distributed sketching and averaging for regression from optimization and statistical perspectives. The most relevant result in [3], using our notation, can be stated as follows. Setting the sketch dimension $m=O(\mu d(\log d)/\epsilon)$ for uniform sampling, where $\mu$ is row coherence $\mu:=n/\operatorname{rank}(A)\max_{i}\ell_{i}$ and $\ell_{i}$ is the $i$ ’th row leverage score, and $m=\tilde{O}(d/\epsilon)$ (with $\tilde{O}$ hiding logarithmic factors) for other sketches, the inequality $f(\bar{x})-f(x^{*})\leq(\epsilon/q+\epsilon^{2})f(x^{*})$ holds with high probability. According to this result, for large $q$ , the cost at the averaged solution $f(\bar{x})$ will be upper bounded by $\epsilon^{2}f(x^{*})$ . In this work we prove that for Gaussian sketch, $f(\bar{x})$ will converge to the optimal cost $f(x^{*})$ as $q$ increases. We also identify the exact expected error for a given number of workers $q$ . In addition, our results on regularized least squares regression improve on the results in [3] for averaging multiple sketched solutions. In particular, in [3], the sketched sub-problems use the same regularization parameter as the original problem, which leads to biased solutions. We analyze the bias of the averaged solution, and provide explicit formulas for selecting the regularization parameter of the sketched sub-problems to achieve unbiasedness.

We show that the expected difference between the costs of the averaged solution and the optimal solution has two components, namely variance and squared bias (see Lemma IV.1). This result implies that for the Gaussian sketch, which we prove to be unbiased (see Lemma II.1), the number of workers required for a target error $\epsilon$ scales as $1/\epsilon$ . Remarkably, this result does not involve condition numbers. In contrast, for the asynchronous Hogwild algorithm [9], the number of iterations required for error $\epsilon$ scales with $\log(1/\epsilon)/\epsilon$ and also depends on the condition number of the input data, although it addresses a more general class of problems.

One of the crucial results of our work is developing bias correction for averaging sketched solutions. Namely, the setting considered in [22] is based on a distributed second order optimization method that involves averaging approximate update directions. However, in that work, the bias was not taken into account which degrades accuracy and limits applicability to massively parallel computation. We additionally provide results for the unregularized case, which corresponds to the distributed version of the Newton sketch algorithm introduced in [21].

I-D Overview of Our Contributions

•

We characterize the exact expected error of the averaged estimator for Gaussian sketch in closed form as

$\displaystyle\frac{\operatorname{\mathbb{E}}[f(\bar{x})]-f(x^{*})}{f(x^{*})}=\frac{1}{q}\frac{d}{m-d-1}$ (2)

where $\bar{x}=\frac{1}{q}\sum_{k=1}^{q}\hat{x}_{k}$ is the averaged solution and $\hat{x}_{k}=\arg\min_{x}\|S_{k}Ax-S_{k}b\|^{2}_{2}$ is the output of the $k$ ’th worker node. In addition, we obtain a similar result for the error of the averaged estimator for the least-norm solution in the underdetermined case $n<d$ .
•

For both the single sketch and averaged estimator, we show that the relative error $\frac{f(\bar{x})-f(x^{*})}{f(x^{*})}$ is concentrated around its expectation given in (2) with exponentially high probability. This provides both an upper and lower bound on the performance of sketching, unlike previous results in the literature. Moreover, it offers a guaranteed recipe to set the sketch size $m$ and number of workers $q$ .
•

We show that for Gaussian distributed sketch, the expected error of the averaged estimator $\operatorname{\mathbb{E}}[f(\bar{x})]-f(x^{*})$ matches the error lower bound for any unbiased estimator obtained via Fisher information. In addition, we provide a lower bound for general, possibly biased sketching based estimators.
•

We consider the privacy preserving properties of distributed sketching methods, in which only random projections of the data matrix $A$ need to be shared with worker notes. We derive conditions on the $(\varepsilon,\delta)$ -differential privacy when $S$ is the i.i.d. Gaussian sketch matrix. Combined with our results, we show that the approximation error of this distributed privacy preserving regression algorithm scales as $O(1/\varepsilon^{2})$ .
•

We analyze the convergence rate of a distributed version of the iterative Hessian sketch algorithm which was introduced in [4] and show that the number of iterations required to reach error $\epsilon$ with $q$ workers scales with $\log(1/\epsilon)/\log(q)$ .
•

We show that $\hat{x}=\arg\min_{x}\|SAx-Sb\|_{2}^{2}+\lambda_{2}\|x\|_{2}^{2}$ is not an unbiased estimator of the optimal solution, i.e. $\operatorname{\mathbb{E}}[A(\hat{x}-x^{*})]\neq 0$ when $\lambda_{2}=\lambda_{1}$ and provide a closed form expression for the regularization coefficient $\lambda_{2}^{*}$ to make the estimator unbiased under certain assumptions.
•

In addition to Gaussian sketch, we derive bias bounds for uniform sampling, randomized Hadamard sketch and leverage score sampling. Analysis of the bias term is critical in understanding how close to the optimal solution we can hope to get and establishing the dependence on the sketch dimension $m$ . Moreover, we utilize the derived bias bounds and find an upper bound on the error of the averaged estimator for these other sketching methods.
•

We discuss averaging for the distributed version of Newton Sketch [21] and show that using the same regularization coefficient as the original problem, i.e., $\lambda_{2}=\lambda_{1}$ , as most works in the literature consider, is not optimal. Furthermore, we derive an expression for choosing the regularization coefficient $\lambda_{2}^{*}$ for unbiasedness.
•

We provide numerical simulations that illustrate the practicality and scalability of distributed sketching methods on the serverless cloud computing platform AWS Lambda.

I-E Preliminaries on Sketching Matrices

We consider various sketching matrices in this work including Gaussian sketch, uniform sampling, randomized Hadamard sketch, Sparse Johnson-Lindenstrauss Transform (SJLT), and hybrid sketch. We now briefly describe each of these sketching methods:

1.

Gaussian sketch: Entries of $S\in\mathbb{R}^{m\times n}$ are i.i.d. and sampled from the Gaussian distribution. Sketching a matrix $A\in\mathbb{R}^{n\times d}$ using Gaussian sketch requires computing the matrix product $SA$ which has computational complexity equal to $O(mnd)$ .
2.

Randomized Hadamard sketch: The sketch matrix in this case can be represented as $S=PHD$ where $P\in\mathbb{R}^{m\times n}$ is for uniform sampling of $m$ rows out of $n$ rows, $H\in\mathbb{R}^{n\times n}$ is the Hadamard matrix, and $D\in\mathbb{R}^{n\times n}$ is a diagonal matrix with diagonal entries sampled randomly from the Rademacher distribution. Multiplication by $D$ to obtain $DA$ requires $O(nd)$ scalar multiplications. Hadamard transform can be implemented as a fast transform with complexity $O(n\log(n))$ per column, and a total complexity of $O(nd\log(n))$ to sketch all $d$ columns of $DA$ .
3.

Uniform sampling: Uniform sampling randomly selects $m$ rows out of the $n$ rows of $A$ where the probability of any row being selected is the same.
4.

Leverage score sampling: Row leverage scores of a matrix $A$ are given by $\ell_{i}=\|\tilde{u}_{i}\|_{2}^{2}$ for $i=1,\dots,n$ where $\tilde{u}_{i}\in\mathbb{R}^{d}$ denotes the $i$ ’th row of $U$ . The matrix $U$ is the matrix whose columns are the left singular vectors of $A$ , i.e., $A=U\Sigma V^{T}$ . There is only one nonzero element in every row $s_{i}\in\mathbb{R}^{n}$ of the sketching matrix $S$ and the probability that the $j$ ’th entry of $s_{i}$ is nonzero is proportional to the leverage score $\ell_{j}$ . More precisely, the rows $s_{1},\dots,s_{m}$ are sampled i.i.d. such that $\mathbb{P}[s_{i}=e_{k}/\sqrt{mp_{k}}]=p_{k},\,\forall i,\,\forall k$ where $p_{k}=\frac{\ell_{k}}{\sum_{j=1}^{n}\ell_{j}}$ . Note that we have $\operatorname{\mathbb{E}}[S^{T}S]=\operatorname{\mathbb{E}}[\sum_{i=1}^{m}s_{i}s_{i}^{T}]=m\sum_{k=1}^{n}p_{k}e_{k}e_{k}^{T}/(mp_{k})=I_{n}$ .
5.

Sparse Johnson-Lindenstrauss Transform (SJLT) [24]: The sketching matrix for SJLT is a sparse matrix where each column has exactly $s$ nonzero entries and the columns are independently distributed. The nonzero entries are sampled from the Rademacher distribution. It takes $O(snd/m)$ addition operations to sketch a data matrix using SJLT.
6.

Hybrid sketch: The method that we refer to as hybrid sketch is a sequential application of two different sketching methods. In particular, it might be computationally feasible for worker nodes to sample as much data as possible, say $m^{\prime}$ rows, and then reduce the dimension of the available data to the final sketch dimension $m$ using another sketch with better error properties than uniform sampling such as Gaussian sketch or SJLT. For instance, a hybrid sketch of uniform sampling followed by Gaussian sketch would have computational complexity $O(mm^{\prime}d)$ .

I-F Paper Organization

Section II deals with the application of distributed sketching to quadratic problems for Gaussian sketch. Section III deals with the privacy preserving property of distributed sketching. Section IV gives theoretical results for randomized Hadamard sketch, uniform sampling, and leverage score sampling. The tools developed in Section II are applied to second order optimization for unconstrained convex problems in Section V. Section VI provides an overview of applications and example problems where distributed sketching methods could be applied. Section VII presents numerical results and Section VIII concludes the main part of the paper.

We give the proofs for the majority of the lemmas and theorems in the Appendix along with additional numerical results.

II Distributed Sketching for Quadratic Optimization problems

In this section, we focus on quadratic optimization problems of the form

\displaystyle x^{*}=\arg\min_{x}\|Ax-b\|_{2}^{2}+\lambda_{1}\|x\|_{2}^{2}\,.

(3)

We study various distributed algorithms for solving problems of this form based on model averaging. Some of these algorithms are tailored for the unregularized case $\lambda_{1}=0$ .

II-A Closed Form Expressions for Gaussian Sketch

We begin with a non-iterative model averaging based algorithm, which we refer to as distributed randomized regression. This algorithm is for the unregularized case $\lambda_{1}=0$ . Each of the worker nodes computes an approximate solution $\hat{x}_{k}$ and these are averaged at the master node to compute the final solution $\bar{x}$ . This method is outlined in Algorithm 1. The theoretical analysis in this section assumes that the sketch matrix is Gaussian, which is generalized to other sketching matrices in Section IV. Although computing the Gaussian sketch is not as efficient as other fast sketches such as the randomized Hadamard sketch, it has several significant advantages over other sketches: (1) exact relative error expressions can be derived as we show in this section, (2) the solution is unbiased, (3) a differential privacy bound can be provided as we show in Section III, and (4) computation of the sketch can be trivially parallelized.

Input: Data matrix

A\in\mathbb{R}^{n\times d}

, target vector

b\in\mathbb{R}^{n}

for worker

k=1,\dots,q

in parallel do

Obtain the sketched data and sketched output:

S_{k}A

and

S_{k}b

Solve

\hat{x}_{k}=\arg\min_{x}\|S_{k}Ax-S_{k}b\|^{2}_{2}

and send

\hat{x}_{k}

to the master node.

end for

Master node: return

\bar{x}=\frac{1}{q}\sum_{k=1}^{q}\hat{x}_{k}

Algorithm 1 Distributed Randomized Regression

We first obtain a characterization of the expected error for the single sketch estimator in Lemma II.1.

Lemma II.1

For the Gaussian sketch with $m>d+1$ , the estimator $\hat{x}$ satisfies

\displaystyle\operatorname{\mathbb{E}}[\|A(\hat{x}-x^{*})\|_{2}^{2}]=\operatorname{\mathbb{E}}[f(\hat{x})]-f(x^{*})=f(x^{*})\frac{d}{m-d-1}\,,

(4)

where $f(x):=\|Ax-b\|_{2}^{2}$ and $x^{*}=\arg\min_{x}f(x)$ .

Proof:

Suppose that the matrix $A$ is full column rank. Then, for $m\geq d$ , the matrix $A^{T}S^{T}SA$ follows a Wishart distribution, and is invertible with probability one. Conditioned on the invertibility of $A^{T}S^{T}SA$ , we have

	$\displaystyle\hat{x}$	$\displaystyle=(A^{T}S^{T}SA)^{-1}A^{T}S^{T}Sb$
		$\displaystyle=(A^{T}S^{T}SA)^{-1}A^{T}S^{T}S(Ax^{*}+b^{\perp})$
		$\displaystyle=x^{*}+(A^{T}S^{T}SA)^{-1}A^{T}S^{T}Sb^{\perp}\,,$

where we have defined $b^{\perp}:=b-Ax^{*}$ . Note that $SA$ and $Sb^{\perp}$ are independent since they are Gaussian and uncorrelated as a result of the normal equations $A^{T}b^{\perp}=0$ . Conditioned on the realization of the matrix $SA$ and the event $A^{T}S^{T}SA\succ 0$ , a simple covariance calculation shows that

\displaystyle\hat{x}\sim\mathcal{N}\big{(}x^{*},\frac{1}{m}f(x^{*})(A^{T}S^{T}SA)^{-1}\big{)}\,.

(5)

Multiplying with the data matrix $A$ on the left yields the distribution of the prediction error, conditioned on $SA$ , as

\displaystyle A(\hat{x}-x^{*})\sim\mathcal{N}\big{(}0,\frac{1}{m}f(x^{*})A(A^{T}S^{T}SA)^{-1}A^{T}\big{)}\,.

(6)

Then we can compute the conditional expectation of the squared norm of the error

\displaystyle\operatorname{\mathbb{E}}[\|A(\hat{x}-x^{*})\|_{2}^{2}~{}\,\big{|}SA]=\frac{f(x^{*})}{m}\operatorname{\mathbb{E}}[\operatorname{tr}(A(A^{T}S^{T}SA)^{-1}A^{T})]\,.

Next we recall that the expected inverse of the Wishart matrix $A^{T}S^{T}SA$ satisfies (see, e.g., [25])

\displaystyle\operatorname{\mathbb{E}}[(A^{T}S^{T}SA)^{-1}]=(A^{T}A)^{-1}\frac{m}{m-d-1}\,.

Plugging in the previous result and using the tower property of expectations and then noting that $\operatorname{tr}(A(A^{T}A)^{-1}A^{T})=d$ give us the claimed result. ∎

To the best of our knowledge, this result of exact error characterization is novel in the theory of sketching. Similar tools to those that we used in this proof were used in [26], however, the exact context in which they are used and the end result are different from this work. Furthermore, existing results (see e.g. [6, 3, 16]) characterize a high probability upper bound on the error, whereas the above is a sharp and exact formula for the expected squared norm of the error. Theorem II.2 builds on Lemma II.1 to characterize the expected error for the averaged solution $\bar{x}$ .

Theorem II.2 (Expected error of the averaging estimator)

Let $S_{k}$ , $k=1,\dots,q$ be Gaussian sketching matrices, then Algorithm 1 runs in time $O(md^{2})$ , and the error of the averaged solution $\bar{x}$ satisfies

\displaystyle\frac{\operatorname{\mathbb{E}}[f(\bar{x})]-f(x^{*})}{f(x^{*})}=\frac{1}{q}\frac{d}{m-d-1}\,.

(7)

Consequently, Markov’s inequality implies that $\frac{f(\bar{x})-f(x^{*})}{f(x^{*})}\leq\epsilon$ holds with probability at least $\left(1-\frac{1}{q\epsilon}\frac{d}{m-d-1}\right)$ for any $\epsilon>0$ .

Theorem II.2 illustrates that the expected error scales as $1/q$ , and converges to zero as $q\rightarrow 0$ as long as $m\geq d+2$ . This is due to the unbiasedness of the Gaussian sketch. Other sketching methods such as uniform sampling or randomized Hadamard sketch do not have this property, as we further investigate in Section IV.

Remark II.3

In Algorithm 1, the worker nodes are tasked with obtaining the sketched data $S_{k}A$ and the sketched output $S_{k}b$ . We identify two options for this step:

•

Option 1: The master node computes the sketched data $S_{k}A,S_{k}b$ , $k=1,\dots,q$ and transmits to worker nodes. This option preserves data privacy, which we discuss in Section III.
•

Option 2: The worker nodes have access to the data $A,b$ and compute the sketched data $S_{k}A$ and $S_{k}b$ . This option does not preserve privacy, however, parallelizes the computation of the sketch via the workers.

II-B Exponential Concentration of the Gaussian Sketch Estimator

In this subsection, we show high probability concentration bounds for the error of the Gaussian sketch. Importantly, our result provides both an upper and lower bounds on the relative error with exponentially small failure probability. We begin with the single sketch estimator $\hat{x}$ and extend the results to the averaged estimator $\bar{x}=\frac{1}{q}\sum_{k=1}^{q}\hat{x}_{k}$ .

The next theorem states the main concentration result for the single sketch estimator. Both upper and lower bounds are given for the ratio of the cost of the sketched solution to the optimal solution. The full proof is given in the Appendix.

Theorem II.4 (Concentration bound for the single sketch estimator $\hat{x}$ )

Suppose that the sketch size is such that $m\gtrsim d$ . Then, the optimality ratio of $\hat{x}$ with respect to the optimal solution $x^{*}$ is concentrated around its mean as

\displaystyle P

\displaystyle\bigg{(}\Big{|}\frac{f(\hat{x})}{f(x^{*})}-1-\frac{d}{m-d-1}\Big{|}<\epsilon\bigg{)}\geq 1-C_{1}e^{-C_{2}\epsilon^{4}m}

(8)

where $C_{1},C_{2}$ are positive constants.

Proof sketch: Our proof technique involves the concentration of Gaussian quadratic forms for $Sb$ conditioned on $SA$ . We leverage that the error $\|A(\hat{x}-x^{*})\|_{2}^{2}$ is a Gaussian quadratic when conditioned on $SA$ as shown in (6). In particular, we use the concentration result given in Lemma II.5 for quadratic form of Gaussian random variables.

Lemma II.5 (Concentration of Gaussian quadratic forms, [27])

Let the entries of $z\in\mathbb{R}^{m}$ be distributed as i.i.d. $\mathcal{N}(0,1)$ . For any $G\in\mathbb{R}^{m\times m}$ and $\epsilon>0$ ,

\displaystyle P\big{(}z^{T}Gz-\operatorname{\mathbb{E}}[z^{T}Gz]>2\|G\|_{F}\sqrt{\epsilon}+2\|G\|_{2}\epsilon\big{)}\leq e^{-\epsilon}\,.

(9)

Then we note that after applying Lemma II.5, the conditional expectation of the error is given by $\operatorname{\mathbb{E}}[\|A(\hat{x}-x^{*})\|_{2}^{2}\,|SA]=\frac{f(x^{*})}{m}\operatorname{tr}((U^{T}S^{T}SU)^{-1})$ . Next, we focus on the trace term $\operatorname{tr}((U^{T}S^{T}SU)^{-1})$ and relate it to the Stieltjes transform.

Definition II.6 (Stieltjes Transform)

We define the Stieltjes Transform for a random rectangular matrix $S\in\mathbb{R}^{m\times d}$ such that $d\leq m$ as

	$\displaystyle m_{S}(z):$	$\displaystyle=\frac{1}{d}\operatorname{tr}(S^{T}S-zI)^{-1}$		(10)
		$\displaystyle=\frac{1}{d}\sum_{i=1}^{d}\frac{1}{\lambda_{i}(S^{T}S)-z}\,.$		(11)

Here $\lambda_{i}(S^{T}S)$ denotes the $i$ ’th eigenvalue of the symmetric matrix $S^{T}S$ .

We derive a high probability bound for the trace term around its expectation by leveraging the concentration of the empirical Stieltjes transform to its expectation, which is given in Lemma II.7 below.

Lemma II.7

The trace $\operatorname{tr}((U^{T}S^{T}SU)^{-1})$ is concentrated around its mean with high probability as follows

\displaystyle P\Big{(}|\operatorname{tr}((U^{T}S^{T}SU)^{-1})-\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]|\leq\epsilon\Big{)}\geq 1-4e^{-\frac{\epsilon^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{10}md^{2}}}-e^{-m\delta^{2}/2}\,,

(12)

for any $\epsilon,\delta>0$ .

The proof of Lemma II.7 is provided in the Appendix. Combining the concentration of the Stieltjes transform with the concentration of Gaussian quadratic forms completes the proof of Theorem II.4.

Next, we show that the relative error of the averaged estimator is also concentrated with exponentially small failure probability.

Theorem II.8 (Concentration bound for the averaged estimator $\bar{x}$ )

Let the sketch size satisfy $m\gtrsim d$ . Then, the ratio of the cost for the averaged estimator $\bar{x}=\frac{1}{q}\sum_{k=1}^{q}\hat{x}_{k}$ to the optimal cost is concentrated around its mean as follows

\displaystyle P\bigg{(}\Big{|}\frac{f(\bar{x})}{f(x^{*})}-1-\frac{1}{q}\frac{d}{m-d-1}\Big{|}<\epsilon\bigg{)}\geq 1-qC_{1}e^{-C_{2}(q\epsilon)^{4}m}

(13)

where $C_{1},C_{2}$ are positive constants.

The above result shows that the relative error of the averaged estimator using distributed sketching concentrates considerably faster compared to a single sketch. Moreover, the above bound offers a method to choose the values of the sketch size $m$ and number of workers $q$ to achieve a desired relative error with exponentially small error probability.

Refer to caption — Figure 1: Histograms of the optimality ratio $f(\bar{x})/f(x^{*})$ for both Gaussian sketch and randomized Hadamard sketch. The vertical lines indicate the empirical mean of the error. The dataset is synthetically generated with $n=512,d=20$ and the sketch size is $m=100$ . The histogram is calculated over $2500$ independent trials. The plot at the top shows the performance of the single sketch estimator, and the middle and bottom plots are for the averaged estimator with $q=4$ and $q=10$ workers, respectively.

Figure 1 shows the histograms for the single sketch ( $q=1$ ) and averaged estimator ( $q=4$ and $q=10)$ for Gaussian and randomized Hadamard sketches. This experiment demonstrates that the optimality ratio $f(\bar{x})/f(x^{*})$ is concentrated around its mean and its mean and concentration improve as we increase the number of workers $q$ .

II-C Error Lower Bounds via Fisher Information

We now present an error lower bound result for the Gaussian sketch. We first consider single sketch estimators and then discuss the distributed sketching case. Two different lower bounds can be obtained depending on whether the estimator is restricted to be unbiased or not. Lemma II.9 and II.10 provide error lower bounds for all unbiased and general (i.e., possibly biased) estimators, respectively. The proof of Lemma II.9 is based on Fisher information and Cramér-Rao lower bound while Lemma II.10 additionally employs the van Trees inequality. We refer the reader to [28] for details on these single sketch lower bounds.

Lemma II.9 (Unbiased estimators, [28])

For any single sketch unbiased estimator $\hat{x}$ obtained from the Gaussian sketched data $SA$ and $Sb$ , the expected error is lower bounded as follows

\displaystyle\operatorname{\mathbb{E}}[f(\hat{x})]-f(x^{*})\geq f(x^{*})\frac{d}{m-d-1}.

(14)

We note that the expected error of the single sketch estimator that we provide in Lemma II.1 exactly matches the error lower bound in Lemma II.9. Therefore, we conclude that no other unbiased estimator based on a single sketch $SA$ , $Sb$ can achieve a better expected relative error.

Lemma II.10 (General estimators, [28])

For any single sketch estimator $\hat{x}$ , which is possibly biased, obtained from the Gaussian sketched data $SA$ and $Sb$ , the expected error is lower bounded as follows

\displaystyle\operatorname{\mathbb{E}}[f(\hat{x})]-f(x^{*})\geq f(x^{*})\frac{d}{m}.

(15)

We now provide a novel generalized lower bound that applies to averaged estimators $\bar{x}$ obtained via distributed sketching. Theorem II.11 states the main lower bound result for the averaged estimator.

Theorem II.11

For any averaged estimator $\bar{x}=\frac{1}{q}\sum_{k=1}^{q}\hat{x}_{k}$ , where each $\hat{x}_{k}$ is based on Gaussian sketched data $S_{k}A,S_{k}b$ , $k=1,\dots,q$ , the expected error is lower bounded as follows

(i)

for unbiased estimators satisfying $\operatorname{\mathbb{E}}[\hat{x}_{k}]=x^{*}$

$\displaystyle\operatorname{\mathbb{E}}[f(\bar{x})]-f(x^{*})\geq\frac{f(x^{*})}{q}\frac{d}{m-d-1}\,.$ (16)
(ii)

for general (i.e., possibly biased) estimators

$\displaystyle\operatorname{\mathbb{E}}[f(\bar{x})]-f(x^{*})\geq\frac{f(x^{*})}{q}\frac{d}{m}\,.$ (17)

We note that the expected error of the averaged estimator given in Theorem II.2 matches exactly the lower bound in Theorem II.11 for unbiased estimators. For general estimators, we observe that the lower bound matches the upper bound for large $m$ , e.g., when $m-d-1=O(m)$ .

II-D Distributed Iterative Hessian Sketch

In this section, we consider an iterative algorithm for solving the unregularized least squares problem in (3), where $\lambda_{1}=0$ with higher accuracy. Note that, Newton’s method terminates in one step when applied to this problem since the Hessian is $A^{T}A$ and

\displaystyle x_{t+1}=x_{t}-\mu(A^{T}A)^{-1}A^{T}(Ax_{t}-b)\,,

(18)

reduces to directly solving the normal equations $(A^{T}A)x=A^{T}b$ . However, the computational cost of this direct solution is often prohibitive for large scale problems. To remedy this, the method of iterative Hessian sketch introduced in [4] employs a randomly sketched Hessian $A^{T}S_{t}^{T}S_{t}^{T}A$ as follows

\displaystyle x_{t+1}=x_{t}-\mu(A^{T}S_{t}^{T}S_{t}A)^{-1}A^{T}(Ax_{t}-b),

where $S_{t}$ corresponds to the sketching matrix at iteration $t$ . Sketching reduces the row dimension of the data from $n$ to $m$ and hence computing an approximate Hessian $A^{T}S_{t}^{T}S_{t}A$ is computationally cheaper than the exact Hessian $A^{T}A$ . Moreover, for regularized problems one can choose $m$ to be smaller than $d$ as we investigate in Section II-F.

In a distributed computing setting, one can obtain more accurate update directions by averaging multiple trials, where each worker node computes an independent estimate of the update direction. These approximate update directions can be averaged at the master node and the following update takes place

\displaystyle x_{t+1}=x_{t}-\mu\frac{1}{q}\sum_{k=1}^{q}(A^{T}S_{t,k}^{T}S_{t,k}A)^{-1}A^{T}(Ax_{t}-b).

(19)

Here $S_{t,k}$ is the sketching matrix for the $k$ ’th node at iteration $t$ . The details of the distributed IHS algorithm are given in Algorithm 2. We note that although the update equation involves a matrix inverse term, in practice, this can be replaced with an approximate linear system solver. In particular, it might be computationally more efficient for worker nodes to compute their approximate update directions using indirect methods such as conjugate gradient.

We note that in Algorithm 2, worker nodes communicate their approximate update directions and not the approximate Hessian matrix, which reduces the communication complexity from $O(d^{2})$ to $O(d)$ for each worker per iteration.

Input: Number of iterations

T

, step size

\mu

for

t=1

T

for worker

k=1,\dots,q

in parallel do

Obtain the sketched data

S_{t,k}A

Compute gradient

g_{t}=A^{T}(Ax_{t}-b)

Solve

\hat{\Delta}_{t,k}=\arg\min_{\Delta}\frac{1}{2}\|S_{t,k}A\Delta\|_{2}^{2}+g_{t}^{T}\Delta

and send to master node.

end for

Master node: Update

x_{t+1}=x_{t}+\mu\frac{1}{q}\sum_{k=1}^{q}\hat{\Delta}_{t,k}

and send

x_{t+1}

to worker nodes.

end for

return

x_{T}

Algorithm 2 Distributed Iterative Hessian Sketch

We establish the convergence rate for Gaussian sketch in Theorem II.13, which provides an exact characterization of the expected error. First, we give the definition of the error in Definition II.12.

Definition II.12

To quantify the approximation quality of the iterate $x_{t}\in\mathbb{R}^{d}$ with respect to the optimal solution $x^{*}\in\mathbb{R}^{d}$ , we define the error as $e^{A}_{t}\coloneqq A(x_{t}-x^{*})$ .

To state the result, we first introduce the following moments of the inverse Wishart distribution (see Appendix).

	$\displaystyle\theta_{1}$	$\displaystyle\coloneqq\frac{m}{m-d-1},$
	$\displaystyle\theta_{2}$	$\displaystyle\coloneqq\frac{m^{2}(m-1)}{(m-d)(m-d-1)(m-d-3)}.$		(20)

Theorem II.13 (Expected error decay for Gaussian sketch)

In Algorithm 2, let us set $\mu=1/\theta_{1}$ and assume $S_{t,k}$ ’s are i.i.d. Gaussian sketches, then the expected squared norm of the error $e^{A}_{t}$ evolves according to the following relation:

\displaystyle\operatorname{\mathbb{E}}[\|e^{A}_{t+1}\|_{2}^{2}]=\frac{1}{q}\left(\frac{\theta_{2}}{\theta_{1}^{2}}-1\right)\|e^{A}_{t}\|_{2}^{2}.

Next, Corollary II.14 builds on Theorem II.13 to characterize the number of iterations for Algorithm 2 to achieve an error of $\epsilon$ . The number of iterations required for error $\epsilon$ scales with $\log(1/\epsilon)/\log(q)$ .

Corollary II.14

Let $S_{t,k}\in\mathbb{R}^{m\times n}$ ( $t=1,\dots,T$ , $k=1,\dots,q$ ) be Gaussian sketching matrices. Then, Algorithm 2 outputs $x_{T}$ that is $\epsilon$ -accurate with respect to the initial error in expectation, that is, $\frac{\operatorname{\mathbb{E}}[\|e^{A}_{T}\|_{2}^{2}}{\|Ax^{*}\|_{2}^{2}}=\epsilon$ where $T$ is given by

\displaystyle T

\displaystyle=\frac{\log(1/\epsilon)}{\log(q)-\log\left(\frac{\theta_{2}}{\theta_{1}^{2}}-1\right)},

where the overall required communication is $Tqd$ real numbers, and the computational complexity per worker is

\displaystyle O(Tmnd+Tmd^{2}+Td^{3}).

Remark II.15

Provided that $m$ is at least $2d$ , the term $\log\left(\frac{\theta_{2}}{\theta_{1}^{2}}-1\right)$ is negative. Hence, $T$ is upper-bounded by $\frac{\log(1/\epsilon)}{\log(q)}$ .

II-E Distributed Sketching for Least-Norm Problems

In this section, we consider the underdetermined case where $n<d$ and applying the sketching matrix from the right, i.e., on the features. We will refer to this method as right sketch as well. Let us define the minimum norm solution

\displaystyle x^{*}=

\displaystyle\arg\min_{x}\|x\|_{2}^{2}\quad\mbox{s.t.}~{}Ax=b.

(21)

The above problem has a closed-form solution given by $x^{*}=A^{T}(AA^{T})^{-1}b$ when the matrix $A$ is full row rank. We will assume that the full row rank condition holds in the sequel. Let us denote the optimal value of the minimum norm objective as $f(x^{*})=\|x^{*}\|_{2}^{2}=b^{T}(AA^{T})^{-1}b$ . The $k$ ’th worker node will compute the approximate solution via $\hat{x}_{k}=S_{k}^{T}\hat{z}_{k}$ where

\displaystyle\hat{z}_{k}=

\displaystyle\arg\min_{z}\|z\|_{2}^{2}\quad\mbox{s.t.}~{}AS_{k}^{T}z=b,

(22)

where $S_{k}\in\mathbb{R}^{m\times d}$ and $z\in\mathbb{R}^{m}$ . The averaged solution is then computed at the master node as $\bar{x}=\frac{1}{q}\sum_{k=1}^{q}\hat{x}_{k}$ . We will assume that the sketch matrices $S_{k}$ are i.i.d. Gaussian in deriving the error expressions. Lemma II.16 establishes the approximation error for a single right sketch estimator.

Lemma II.16

For the Gaussian sketch with sketch size $m>n+1$ , the estimator $\hat{x}$ satisfies

\displaystyle\operatorname{\mathbb{E}}[\|\hat{x}-x^{*}\|_{2}^{2}]=\frac{d-n}{m-n-1}f(x^{*}).

Proof:

Conditioned on $AS^{T}$ , we have

\displaystyle\hat{x}\sim\mathcal{N}\Big{(}x^{*},P_{\mbox{Null}(A)}\|AS^{T}(AS^{T}SA^{T})^{-1}b\|_{2}^{2}\Big{)}\,.

Noting that $\operatorname{\mathbb{E}}[(AS^{T}SA^{T})^{-1}]=AA^{T}\frac{m}{m-n-1}$ , taking the expectation and noting that $\operatorname{tr}(P_{\mbox{Null}(A)})=d-n$ , we obtain

\displaystyle\operatorname{\mathbb{E}}[\|\hat{x}-x^{*}\|_{2}^{2}]

\displaystyle=\frac{d-n}{m-n-1}b^{T}(AA^{T})^{-1}b=\frac{d-n}{m-n-1}f(x^{*}).

(23)

∎

An exact formula for averaging multiple outputs in right sketch that parallels Theorem II.2 can be obtained in a similar fashion:

\displaystyle\operatorname{\mathbb{E}}[\|\bar{x}-x^{*}\|_{2}^{2}]

\displaystyle=\frac{1}{q}\operatorname{\mathbb{E}}[\|\hat{x}_{k}-x^{*}\|_{2}^{2}]=\frac{1}{q}\frac{d-n}{m-n-1}f(x^{*}).

Hence, for the distributed Gaussian right sketch, we establish the approximation error as stated in Theorem II.17.

Theorem II.17 (Cost approximation for least-norm problems)

Let $S_{k}$ , $k=1,\dots,q$ be Gaussian sketching matrices, then the error of the averaged solution $\bar{x}$ satisfies

\displaystyle\frac{\operatorname{\mathbb{E}}[f(\bar{x})]-f(x^{*})}{f(x^{*})}=\frac{1}{q}\frac{d-n}{m-n-1}\,.

(24)

II-F Bias Correction for Regularized Least Squares

We have previously studied distributed randomized regression for unregularized problems and showed that Gaussian sketch leads to unbiased estimators. In this section, we focus on the regularized case and show that using the original regularization coefficient for the sketched problems causes the estimators to be biased. In addition, we provide a bias correction procedure. More precisely, the method described in this section is based on non-iterative averaging for solving the linear least squares problem with $\ell_{2}$ regularization, i.e., ridge regression.

Note that we have $\lambda_{1}$ as the regularization coefficient of the original problem and $\lambda_{2}$ for the sketched sub-problems. If $\lambda_{2}$ is chosen to be equal to $\lambda_{1}$ , then this scheme reduces to the framework given in the work of [3] and we show in Theorem II.20 that $\lambda_{2}=\lambda_{1}$ leads to a biased estimator, which does not converge to the optimal solution.

We first introduce the following results on traces involving random Gaussian matrices which are instrumental in our result.

Lemma II.18 ([29])

For a Gaussian sketching matrix $S$ , the following asymptotic formula holds

\displaystyle\lim_{n\rightarrow\infty}\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU+\lambda_{2}I)^{-1})]=d\times\theta_{3}(d/m,\lambda_{2}),

where $\theta_{3}(d/m,\lambda_{2})$ is defined as

\displaystyle\theta_{3}(d/m,\lambda_{2}):=\frac{-\lambda_{2}+d/m-1+\sqrt{(-\lambda_{2}+d/m-1)^{2}+4\lambda_{2}d/m}}{2\lambda_{2}d/m}\,.

Lemma II.19

For a Gaussian sketching matrix $S$ , the following asymptotic formula holds

\displaystyle\lim_{n\rightarrow\infty}\operatorname{\mathbb{E}}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}]=\theta_{3}(d/m,\lambda_{2})I_{d},

where $\theta_{3}(d/m,\lambda_{2})$ is as defined in Lemma II.18.

We list the distributed randomized ridge regression method in Algorithm 3. This algorithm assumes that our goal is to solve a large scale regression problem for a given value of regularization coefficient $\lambda_{1}$ . Theorem II.20 states the main result of this subsection. In short, for the averaged result to converge to the optimal solution $x^{*}$ , the regularization coefficient needs to be modified to $\lambda_{2}^{*}$ .

Theorem II.20

Given the thin SVD decomposition $A=U\Sigma V^{T}\in\mathbb{R}^{n\times d}$ and $n\geq d$ , and assuming $A$ has full rank and has identical singular values (i.e., $\Sigma=\sigma I_{d}$ for some $\sigma>0$ ), there is a value of $\lambda_{2}$ that yields a zero bias of the single sketch estimator $\operatorname{\mathbb{E}}[A(\hat{x}_{k}-x^{*})]$ as $n$ tends to infinity if

(i)

$m>d$ or
(ii)

$m\leq d$ and $\lambda_{1}\geq\sigma^{2}\left(\frac{d}{m}-1\right)$

and the value of $\lambda_{2}$ that achieves zero bias is given by

\displaystyle\lambda_{2}^{*}=\lambda_{1}-\frac{d}{m}\frac{\lambda_{1}}{1+\lambda_{1}/\sigma^{2}},

(25)

where the matrix $S_{k}$ in $\hat{x}_{k}=\arg\min_{x}\|S_{k}Ax-S_{k}b\|_{2}^{2}+\lambda_{2}\|x\|_{2}^{2}$ is the Gaussian sketch.

Set

\sigma

to the mean of singular values of

A

Calculate

\lambda_{2}^{*}=\lambda_{1}-\frac{d}{m}\frac{\lambda_{1}}{1+\lambda_{1}/\sigma^{2}}

for worker

k=1,\dots,q

in parallel do

Obtain the sketched data and sketched output:

S_{k}A

and

S_{k}b

Solve

\hat{x}_{k}=\arg\min_{x}\|S_{k}Ax-S_{k}b\|_{2}^{2}+\lambda_{2}^{*}\|x\|_{2}^{2}

and send

\hat{x}_{k}

to master node.

end for

Master node: return

\bar{x}=\frac{1}{q}\sum_{k=1}^{q}\hat{x}_{k}

Algorithm 3 Distributed Randomized Ridge Regression

Figure 2 illustrates the practical implications of Theorem II.20. If $\lambda_{2}$ is chosen according to the formula in (25), then the averaged solution $\bar{x}$ gives a significantly better approximation to $x^{*}$ than if we had used $\lambda_{2}=\lambda_{1}$ . The data matrix $A$ in Figure 2(a) has identical singular values, and 2(b) shows the case where the singular values of $A$ are not identical. When the singular values of $A$ are not all equal to each other, we set $\sigma$ to the mean of the singular values of $A$ as a heuristic, which works extremely well as shown in Figure 2(b). According to the formula in (25), the value of $\lambda_{2}$ that we need to use to achieve zero bias is found to be $\lambda_{2}^{*}=0.833$ whereas $\lambda_{1}=5$ . The plot in Figure 2(b) illustrates that even if the assumption that $\Sigma=\sigma I_{d}$ in Theorem II.20 is violated, the proposed bias corrected averaging method outperforms vanilla averaging in [3] where $\lambda_{2}=\lambda_{1}$ .

Remark II.21 (Varying sketch sizes)

Let us now consider the scenario where we have different sketch sizes for each worker node. This situation frequently arises in heterogeneous computing environments. Specifically, let us assume that the sketch size for worker $k$ is $m_{k}$ , $k=1,\dots,q$ . It follows from Theorem II.20 that by choosing the regularization parameter for worker node $k$ as

\displaystyle\lambda_{2}^{*}(k)=\lambda_{1}-\frac{d}{m_{k}}\frac{\lambda_{1}}{1+\lambda_{1}/\sigma^{2}},

it is possible to obtain unbiased estimators $\hat{x}_{k}$ , $k=1,\dots,q$ and hence an unbiased averaged result $\bar{x}$ . Note that here we assume that the sketch size for each worker satisfies the condition in Theorem II.20, that is, either $m_{k}>d$ or $m_{k}\leq d$ and $\lambda_{1}\geq\sigma^{2}(d/m_{k}-1)$ .

III Privacy Preserving Property

We now digress from the error and convergence properties of distributed sketching methods to consider the privacy preserving properties of distributed sketching. We use the notion of differential privacy for our privacy result given in Definition III.1. Differential privacy is a worst case type of privacy notion which does not require distribution assumptions on the data. It has been the privacy framework adopted by many works in the literature.

Definition III.1 ( $(\varepsilon,\delta)$ -Differential Privacy, [30], [31])

An algorithm ALG which maps $(n\times d)$ -matrices into some range $\mathcal{R}$ satisfies $(\varepsilon,\delta)$ -differential privacy if for all pairs of neighboring inputs $A$ and $A^{\prime}$ (i.e. they differ only in a single row) and all subsets $\mathcal{S}\subset\mathcal{R}$ , it holds that $P(\text{ALG}(A)\in\mathcal{S})\leq e^{\varepsilon}P(\text{ALG}(A^{\prime})\in\mathcal{S})+\delta$ .

Theorem III.2 characterizes the required conditions and the sketch size to guarantee $(\varepsilon,\delta)$ -differential privacy for a given dataset. The proof of Theorem III.2 is based on the $(\varepsilon,\delta)$ -differential privacy result for i.i.d. Gaussian random projections from [31].

Theorem III.2

Given the data matrix $A\in\mathbb{R}^{n\times d}$ and the output vector $b\in\mathbb{R}^{n}$ , let $A_{c}$ denote the concatenated version of $A$ and $b$ , i.e., $A_{c}=[A,b]\in\mathbb{R}^{n\times{(d+1)}}$ . Define $\beta:=\ln(4/\delta)$ . Define $\sigma_{0}:=\sigma_{\min}(A_{c})/\sqrt{n}$ , and $B_{0}:=\max_{i,j}|A_{c,ij}|$ . Suppose that the condition

\displaystyle\frac{n}{d+1}\geq\left(3+\frac{2\beta}{\varepsilon}\right)\frac{B_{0}^{2}}{\sigma_{0}^{2}}

(26)

is satisfied and the sketch size satisfies

	$\displaystyle m$	$\displaystyle=O\left(\beta\frac{n^{2}}{(d+1)^{2}}\frac{\varepsilon^{2}}{(\varepsilon+\beta)^{2}}\right)\,\text{for privacy w.r.t. one worker,}$		(27)
	$\displaystyle m$	$\displaystyle=O\left(\frac{\beta}{q}\frac{n^{2}}{(d+1)^{2}}\frac{\varepsilon^{2}}{(\varepsilon+\beta)^{2}}\right)\,\text{for privacy w.r.t. $q$ workers,}$		(28)

and publishing the sketched matrices $S_{k}A_{c}\in\mathbb{R}^{m\times{(d+1)}}$ , $k=1,\dots,q$ , provided that $d+1<m$ , is $(\varepsilon,\delta)$ -differentially private, where $S_{k}\in\mathbb{R}^{m\times n}$ is the Gaussian sketch, and $\varepsilon>0$ , $\beta>1+\ln(4)$ .

Remark III.3

For fixed values of $\beta$ , $\sigma_{\min}$ , $B_{0}$ , $n$ and $d$ , the approximation error is on the order of $O\left(\frac{1}{\varepsilon^{2}}\right)$ for $(\varepsilon,\delta)$ -differential privacy with respect to all $q$ workers.

The work [32] considers convex optimization under privacy constraints and shows that $\varepsilon$ -differential privacy (i.e. equivalent to Definition III.1 with $\delta=0$ ), the approximation error of their distributed iterative algorithm is on the order of $O(\frac{1}{\varepsilon^{2}})$ , which is on the same order as Algorithm 1. The two algorithms however have different dependencies on parameters, which are hidden in the $O$ -notation. We note that the approximation error of the algorithm in [32] depends on parameters that Algorithm 1 does not have such as the initial step size, the step size decay rate, and noise decay rate. The reason for this is that the algorithm in [32] is a synchronous iterative algorithm designed to solve a more general class of optimization problems. Algorithm 1, on the other hand, is designed to solve linear regression problems and offers a significant advantage due to its single-round communication requirement.

Remark III.4

A similar privacy statement holds for the right sketch method discussed in Section II-E. In right sketch, we only sketch the data matrix $A$ and not the output vector $b$ . For publishing $AS_{k}^{T}$ to be $(\varepsilon,\delta)$ -differentially private, Theorem III.2 still holds with the modification that we replace $A_{c}$ with $A^{T}$ .

IV Bounds for Other Sketching Matrices

In this section, we consider randomized Hadamard sketch, uniform sampling, and leverage score sampling for solving the unregularized linear least squares problem $\min_{x}f(x)=\|Ax-b\|_{2}^{2}$ . For each of these sketching methods, we present upper bounds on the bias of the corresponding single sketch estimator. Then, we combine these in Theorem IV.6, which provides high probability upper bounds for the error of the averaged estimator.

Lemma IV.1 expresses the expected objective value difference in terms of the bias and variance of the single sketch estimator.

Lemma IV.1

For any i.i.d. sketching matrices $S_{k}$ , $k=1,\dots,q$ , the expected objective value difference for the averaged estimator $\bar{x}$ can be decomposed as

\displaystyle\operatorname{\mathbb{E}}[f(\bar{x})]-f(x^{*})=\frac{1}{q}\operatorname{\mathbb{E}}\left[\|A\hat{x}-Ax^{*}\|_{2}^{2}\right]+\frac{q-1}{q}\|\operatorname{\mathbb{E}}[A\hat{x}]-Ax^{*}\|_{2}^{2},

(29)

where $\hat{x}$ is the single sketch estimator.

Lemma IV.2 establishes an upper bound on the norm of the bias for any i.i.d. sketching matrix. The results in the remainder of this section will build on this lemma.

Lemma IV.2

Let the SVD of $A$ be denoted as $A=U\Sigma V^{T}$ . Let $z:=U^{T}S^{T}Sb^{\perp}$ and $Q:=(U^{T}S^{T}SU)^{-1}$ where $b^{\perp}:=b-Ax^{*}$ . Define the event $E$ as $(1-\epsilon)I_{d}\preceq Q\preceq(1+\epsilon)I_{d}$ . Then, for any sketch matrix $S$ with $\operatorname{\mathbb{E}}[S^{T}S]=I_{n}$ , the bias of the single sketch estimator, conditioned on $E$ , is upper bounded as

\displaystyle\|\operatorname{\mathbb{E}}[A\hat{x}|E]-Ax^{*}\|_{2}\leq\sqrt{4\epsilon\operatorname{\mathbb{E}}[\|z\|_{2}^{2}|E]}\,.

(30)

The event $E$ is a high probability event and it is equivalent to the subspace embedding property (e.g. [3]). We will analyze the unconditioned error later in Theorem IV.6.

We note that Lemma IV.1 and IV.2 apply to all of the sketching matrices considered in this work. We now give specific bounds for the bias of the single sketch estimator separately for each of randomized Hadamard sketch, uniform sampling, and leverage score sampling.

Randomized Hadamard sketch: This method has the advantage of low computational complexity due to the Fast Hadamard Transform, which can be computed in $O(n\log n)$ , whereas applying the Gaussian sketch takes $O(mn)$ time for length $n$ vectors. Lemma IV.3 states the upper bound for the bias for randomized Hadamard sketch.

Lemma IV.3

For randomized Hadamard sketch, the bias is upper bounded as

\displaystyle\|\operatorname{\mathbb{E}}[A\hat{x}|E]-Ax^{*}\|_{2}\leq\sqrt{4\epsilon\frac{d}{m}f(x^{*})}.

(31)

Uniform sampling: We note that the bias of the uniform sampling estimator is different when it is computed with or without replacement. The reason for this is that the rows of the sketching matrix $S$ for uniform sampling are independent in the case of sampling with replacement, which breaks down in the case of sampling without replacement. Lemma IV.4 provides bounds for both cases.

Lemma IV.4

For uniform sampling, the bias can be upper bounded as

	$\displaystyle\\|\operatorname{\mathbb{E}}[A\hat{x}\|E]-Ax^{}\\|_{2}\leq\sqrt{4\epsilon\frac{\mu d}{m}\,f(x^{})}$		(32)
	$\displaystyle\\|\operatorname{\mathbb{E}}[A\hat{x}\|E]-Ax^{}\\|_{2}\leq\sqrt{4\epsilon\frac{\mu d}{m}\frac{n-m}{n-1}\,f(x^{})},$		(33)

for sampling with and without replacement, respectively, where $\tilde{u}_{i}\in\mathbb{R}^{d}$ denotes the $i$ ’th row of $U$ .

Leverage score sampling: Recall that the row leverage scores of a matrix $A=U\Sigma V^{T}$ are computed via $\ell_{i}=\|\tilde{u}_{i}\|_{2}^{2}$ for $i=1,\dots,n$ where $\tilde{u}_{i}\in\mathbb{R}^{d}$ denotes the $i$ ’th row of $U$ . Since leverage score sampling looks at the data to help guide its sampling strategy, it achieves a lower error compared to uniform sampling which treats all the samples equally. Lemma IV.5 gives the upper bound for the bias of the leverage score sampling estimator.

Lemma IV.5

For leverage score sampling, the bias can be upper bounded as

\displaystyle\|\operatorname{\mathbb{E}}[A\hat{x}|E]-Ax^{*}\|_{2}\leq\sqrt{4\epsilon\frac{d}{m}f(x^{*})}.

(34)

The results given in Lemma IV.3, IV.4, and IV.5 can be combined with Lemma IV.1. In particular, the error of the averaged estimator contains the squared bias term which is scaled with $(q-1)/q$ . For a distributed computing system with a large number of worker nodes $q$ , the contribution of the bias term is very close to $1$ while the variance term will vanish as it is scaled with $1/q$ .

We now leverage our analysis of the bias upper bounds to establish upper bounds on the error of the averaged estimator. Theorem IV.6 gives high probability error bounds for different sketch types when the sketch size is set accordingly.

Theorem IV.6

Suppose that the sketch size is selected as

$\displaystyle m$	$\displaystyle\gtrsim\frac{d+\log(n)}{\epsilon^{2}}\log(d/\delta)\mbox{ for randomized Hadamard sketch,}$
$\displaystyle m$	$\displaystyle\gtrsim\frac{\mu d}{\epsilon^{2}}\log(d/\delta)\mbox{ for uniform sampling,}$
$\displaystyle m$	$\displaystyle\gtrsim\frac{d}{\epsilon^{2}}\log(d/\delta)\mbox{ for leverage score sampling},$	(35)

for some $\epsilon\in(0,\frac{1}{4}]$ and $\delta>0$ . Here $\mu$ is the row coherence defined in Section I-C, that is, $\mu=n/d\max_{i}\|\tilde{u}_{i}\|_{2}^{2}$ . Then, the relative optimality gap of the averaged estimator obeys the following upper bounds:

	$\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}\leq 1+\gamma\Big{)}\geq$
	$\displaystyle 1-q\delta-\frac{d}{q\gamma m}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\mbox{ for randomized Hadamard sketch,}$		(36)
	$\displaystyle 1-q\delta-\frac{\mu d}{q\gamma m}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\mbox{ for uniform sampling with replacement,}$		(37)
	$\displaystyle 1-q\delta-\frac{\mu d}{q\gamma m}\frac{n-m}{n-1}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\mbox{ for uniform sampling without replacement,}$		(38)
	$\displaystyle 1-q\delta-\frac{d}{q\gamma m}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\mbox{ for leverage score sampling,}$		(39)

for any $\gamma>0$ .

Remarkably, the optimality ratio of the distributed sketching estimator converges to $1$ as $q$ or $m$ gets large. This is different from the results of [3], which do not imply convergence to $1$ as $q\rightarrow\infty$ due to an additional bias term. We note that the relative error of the uniform sampling sketch has a dependence on the row coherence unlike the randomized Hadamard and leverage score sampling sketch.

The main idea for the proof of Theorem IV.6 involves using the decomposition given in Lemma IV.1 along with the upper bounds for $\operatorname{\mathbb{E}}[\|U^{T}S^{T}Sb^{\perp}\|_{2}^{2}|E]$ . Then, we use Markov’s inequality to find a high probability bound for the approximation quality of the averaged estimator, conditioned on the events $E_{k}$ , $k=1,\dots,q$ defined as $(1-\epsilon)I_{d}\preceq(U^{T}S_{k}^{T}S_{k}U)^{-1}\preceq(1+\epsilon)I_{d}$ . This inequality is a direct result of the subspace embedding property and holds with high probability. The last component of the proof deals with removing the conditioning on the events $E_{k}$ to find the upper bound for the relative error $f(\bar{x})/f(x^{*})$ .

V Distributed Sketching for Nonlinear Convex Optimization Problems

In this section, we consider randomized second order methods for solving a broader range of problems by leveraging a distributed version of the Newton Sketch algorithm described in [21].

V-A Distributed Newton Sketch

The update equation for Newton’s method is of the form $x_{t+1}=x_{t}-\alpha_{1}H_{t}^{-1}g_{t}$ , where $H_{t}\in\mathbb{R}^{d\times d}$ and $g_{t}\in\mathbb{R}^{d}$ denote the Hessian matrix and the gradient vector at iteration $t$ respectively, and $\alpha_{1}$ is the step size. In contrast, Newton Sketch performs the approximate updates

\displaystyle x_{t+1}=x_{t}+\alpha_{1}\arg\min_{\Delta}(\frac{1}{2}\|S_{t}H_{t}^{1/2}\Delta\|_{2}^{2}+g_{t}^{T}\Delta),

(40)

where the sketching matrices $S_{t}\in\mathbb{R}^{m\times n}$ are refreshed every iteration. There can be a multitude of options for devising a distributed Newton’s method or a distributed Newton sketch algorithm. Here we consider a scheme that is similar in spirit to the GIANT algorithm [22] where worker nodes communicate length- $d$ approximate update directions to be averaged at the master node. Another alternative scheme would be to communicate the approximate Hessian matrices, which would require an increased communication load of $d^{2}$ numbers.

We consider Hessian matrices of the form $H_{t}=(H_{t}^{1/2})^{T}H_{t}^{1/2}$ , where we assume that $H_{t}^{1/2}\in\mathbb{R}^{n\times d}$ is a full rank matrix and $n\geq d$ . Note that this factorization is already available in terms of scaled data matrices in many problems as we illustrate in the sequel. This enables the fast construction of an approximation of $H_{t}$ by applying sketching $S_{t}H_{t}^{1/2}$ which leads to the approximation $\hat{H}_{t}=(S_{t}H_{t}^{1/2})^{T}S_{t}H_{t}^{1/2}$ . Averaging for regularized problems with Hessian matrices of the form $H_{t}=(H_{t}^{1/2})^{T}H_{t}^{1/2}+\lambda_{1}I_{d}$ will be considered in the next subsection.

The update equation for distributed Newton sketch for a system with $q$ worker nodes can be written as

\displaystyle x_{t+1}=x_{t}+\alpha_{2}\frac{1}{q}\sum_{k=1}^{q}\arg\min_{\Delta}\,\left(\frac{1}{2}\|S_{t,k}H_{t}^{1/2}\Delta\|_{2}^{2}+g_{t}^{T}\Delta\right).

(41)

Note that the above update requires access to the full gradient $g_{t}$ . If worker nodes do not have access to the entire dataset, then this requires an additional communication round per iteration where worker nodes communicate their local gradients with the master node, which computes the full gradient and broadcasts to worker nodes. The details of the distributed Newton sketch method are given in Algorithm 4.

Input: Tolerance

\epsilon

while

g_{t}^{T}\left(\sum_{k=1}^{q}\hat{\Delta}_{t,k}\right)/2\leq\epsilon

for worker

k=1,\dots,q

in parallel do

Obtain

S_{t,k}H_{t}^{1/2}

Obtain the gradient

g_{t}

Compute approximate Newton direction

\hat{\Delta}_{t,k}=\arg\min_{\Delta}(\frac{1}{2}\|S_{t,k}H_{t}^{1/2}\Delta\|_{2}^{2}+g_{t}^{T}\Delta)

and send to master node.

end for

Master node: Determine step size

\alpha_{2}

and update

x_{t+1}=x_{t}+\alpha_{2}\frac{1}{q}\sum_{k=1}^{q}\hat{\Delta}_{t,k}

end while

Algorithm 4 Distributed Newton Sketch

We analyze the bias and the variance of the update directions for distributed Newton sketch, and derive exact expressions for Gaussian sketching matrices. First, we establish the notation for update directions. We will let $\Delta_{t}^{*}$ denote the optimal Newton update direction at iteration $t$

\displaystyle\Delta_{t}^{*}=((H_{t}^{1/2})^{T}H_{t}^{1/2})^{-1}g_{t}

(42)

and let $\hat{\Delta}_{t,k}$ denote the approximate update direction returned by worker node $k$ at iteration $t$ , which has the closed form expression

\displaystyle\hat{\Delta}_{t,k}=\alpha_{s}\left((H_{t}^{1/2})^{T}S_{t,k}^{T}S_{t,k}H_{t}^{1/2}\right)^{-1}g_{t}.

(43)

Note that the step size for the averaged update direction will be calculated as $\alpha_{2}=\alpha_{1}\alpha_{s}$ . Theorem V.1 characterizes how the update directions need to be scaled to obtain an unbiased update direction, and also a minimum variance estimator.

Theorem V.1

For Gaussian sketches $S_{t,k}$ , assuming $H_{t}^{1/2}$ is full column rank, the variance $\operatorname{\mathbb{E}}[\|H_{t}^{1/2}(\hat{\Delta}_{t,k}-\Delta_{t}^{*})\|_{2}^{2}]$ is minimized when $\alpha_{s}$ is chosen as $\alpha_{s}=\frac{\theta_{1}}{\theta_{2}}$ whereas the bias $\operatorname{\mathbb{E}}[H_{t}^{1/2}(\hat{\Delta}_{t,k}-\Delta_{t}^{*})]$ is zero when $\alpha_{s}=\frac{1}{\theta_{1}}$ , where $\theta_{1}$ and $\theta_{2}$ are as defined in (II-D).

Figure 3 demonstrates that choosing $\alpha_{2}=\alpha_{1}\alpha_{s}$ when $\alpha_{s}$ is calculated using the unbiased estimator formula $\alpha_{s}=1/\theta_{1}$ leads to faster decrease of the objective value when the number of workers is high. If the number of workers is small, one should choose the step size that minimizes variance using $\alpha_{s}=\theta_{1}/\theta_{2}$ instead. Furthermore, Figure 3(a) illustrates that the blue curve with square markers is in fact the best one could hope to achieve as it is very close to the best cyan dotted line.

V-B Distributed Newton Sketch for Regularized Problems

We now consider problems with squared $\ell_{2}$ -norm regularization. In particular, we study problems with Hessian matrices of the form $H_{t}=(H_{t}^{1/2})^{T}H_{t}^{1/2}+\lambda_{1}I_{d}$ . Sketching can be applied to obtain approximate Hessian matrices as $H_{t}=(S_{t}H_{t}^{1/2})^{T}S_{t}H_{t}^{1/2}+\lambda_{2}I_{d}$ . Note that the case $\lambda_{2}=\lambda_{1}$ corresponds to the setting in the GIANT algorithm described in [22].

Theorem V.2 establishes that $\lambda_{2}$ should be chosen according to the formula (44) under the assumption that the singular values of $H_{t}^{1/2}$ are identical. We later verify empirically that when the singular values are not identical, plugging the mean of the singular values into the formula still leads to improvements over the case of $\lambda_{2}=\lambda_{1}$ .

Theorem V.2

Given the thin SVD decomposition $H_{t}^{1/2}=U\Sigma V^{T}\in\mathbb{R}^{n\times d}$ and $n\geq d$ where $H_{t}^{1/2}$ is assumed to have full rank and satisfy $\Sigma=\sigma I_{d}$ , the bias of the single sketch Newton step estimator $\operatorname{\mathbb{E}}[H_{t}^{1/2}(\hat{\Delta}_{t,k}-\Delta_{t}^{*})]$ is equal to zero as $n$ goes to infinity when $\lambda_{2}$ is chosen as

\displaystyle\lambda_{2}^{*}=\left(\lambda_{1}+\frac{d}{m}\sigma^{2}\right)\left(1-\frac{d/m}{1+\lambda_{1}\sigma^{-2}+d/m}\right),

(44)

where $\Delta_{t}^{*}=((H_{t}^{1/2})^{T}H_{t}^{1/2}+\lambda_{1}I_{d})^{-1}g_{t}$ and $\hat{\Delta}_{t,k}=((S_{t,k}H_{t}^{1/2})^{T}S_{t,k}H_{t}^{1/2}+\lambda_{2}I_{d})^{-1}g_{t}$ , and $S_{t,k}$ is the Gaussian sketch.

Remark V.3

The proof of Theorem V.2 builds on the proof of Theorem II.20 (see Appendix). The main difference between these results is that the setting in Theorem V.2 is such that we assume access to the full gradient, i.e. only the Hessian matrix is sketched. In contrast, the setting in Theorem II.20 does not use the exact gradient.

VI Applications of Distributed Sketching

This section describes some example problems where our methodology can be applied. In particular, the problems in this section are convex problems that are efficiently addressed by our methods in distributed systems. We present numerical results on these problems in Section VII.

VI-A Logistic Regression

We begin by considering the well-known logistic regression model with squared $\ell_{2}$ -norm penalty. The optimization problem for this model can be formulated as $\textrm{minimize}_{x}f(x)$ where

\displaystyle f(x)=-\sum_{i=1}^{n}\left(y_{i}\log(p_{i})+(1-y_{i})\log(1-p_{i})\right)+\frac{\lambda_{1}}{2}\|x\|_{2}^{2},

(45)

and $p\in\mathbb{R}^{n}$ is defined such that $p_{i}=1/(1+\exp(-\tilde{a}_{i}^{T}x))$ . $\tilde{a}_{i}\in\mathbb{R}^{d}$ denotes the $i$ ’th row of the data matrix $A\in\mathbb{R}^{n\times d}$ . The output vector is denoted by $y\in\mathbb{R}^{n}$ .

We can use the distributed Newton sketch algorithm to solve this optimization problem. The gradient and Hessian for $f(x)$ are as follows

	$\displaystyle g$	$\displaystyle=A^{T}(p-y)+\lambda_{1}x,$
	$\displaystyle H$	$\displaystyle=A^{T}DA+\lambda_{1}I_{d}\,.$

$D$ is a diagonal matrix with the elements of the vector $p(1-p)$ as its diagonal entries. The sketched Hessian matrix in this case can be formed as $(SD^{1/2}A)^{T}(SD^{1/2}A)+\lambda_{2}^{*}I_{d}$ where $\lambda_{2}^{*}$ can be calculated using (44), and we can set $\sigma$ to the mean of the singular values of $D^{1/2}A$ . Since the entries of the matrix $D$ are a function of the variable $x$ , the matrix $D$ gets updated every iteration. It might be computationally expensive to re-compute the mean of the singular values of $D^{1/2}A$ in every iteration. However, we have found through experiments that it is not required to compute the exact value of the mean of the singular values for bias reduction. For instance, setting $\sigma$ to the mean of the diagonals of the matrix $D^{1/2}$ as a heuristic works sufficiently well.

VI-B Inequality Constrained Optimization

The second application that we consider is the inequality constrained optimization problem of the form

	$\displaystyle\textrm{minimize}_{x}\quad$	$\displaystyle\\|x-c\\|_{2}^{2}$
	subject to	$\displaystyle\\|Ax\\|_{\infty}\leq\lambda$		(46)

where $A\in\mathbb{R}^{n\times d}$ , and $c\in\mathbb{R}^{d}$ are the problem data, and $\lambda\in\mathbb{R}$ is a positive scalar. Note that this problem is the dual of the Lasso problem given by $\text{min}_{x}$ $\lambda\|x\|_{1}+\frac{1}{2}\|Ax-c\|_{2}^{2}$ .

The above problem can be tackled by the standard log-barrier method [33], by solving sequences of unconstrained barrier penalized problems as follows

\displaystyle\textrm{minimize}_{x}\,-\sum_{i=1}^{n}\log(

\displaystyle-\tilde{a}_{i}^{T}x+\lambda)-\sum_{i=1}^{n}\log(\tilde{a}_{i}^{T}x+\lambda)+\lambda_{1}\|x\|_{2}^{2}-2\lambda_{1}c^{T}x+\lambda_{1}\|c\|_{2}^{2}

(47)

where $\tilde{a}_{i}$ represents the $i$ ’th row of $A$ . The distributed Newton sketch algorithm could be used to solve this problem. The gradient and Hessian of the objective are given by

	$\displaystyle g$	$\displaystyle=-A_{c}^{T}D\mathbf{1}_{2n\times 1}+2\lambda_{1}x-2\lambda_{1}c,$
	$\displaystyle H$	$\displaystyle=(DA_{c})^{T}(DA_{c})+2\lambda_{1}I_{d}.$

Here $A_{c}=[A^{T},-A^{T}]^{T}$ and $D$ is a diagonal matrix with the element-wise inverse of the vector $(A_{c}x-\mathbf{1}_{2n\times 1})$ as its diagonal entries. $\mathbf{1}_{2n\times 1}$ is the length- $2n$ vector of all $1$ ’s. The sketched Hessian can be written in the form of $(SDA_{c})^{T}(SDA_{c})+\lambda_{2}I_{d}$ .

Remark VI.1

Since the contribution of the regularization term in the Hessian matrix is $2\lambda_{1}I_{d}$ , we need to plug $2\lambda_{1}$ instead of $\lambda_{1}$ in the formula for computing $\lambda_{2}^{*}$ .

VI-C Fine Tuning of Pre-Trained Neural Networks

The third application is neural network fine tuning. Let $f_{NN}(\cdot):\mathbb{R}^{d}\rightarrow\mathbb{R}^{r}$ represent the output of the $(L-1)$ ’st layer of a pre-trained network which consists of $L$ layers. We are interested in re-using a pre-trained neural network for, say, a task on a different dataset by learning a different final layer. More precisely, assuming squared loss, one way that we can formulate this problem is as follows

\displaystyle W_{L}^{*}=\arg\min_{W_{L}}\Bigg{\|}\underbrace{\begin{bmatrix}f_{NN}(\tilde{a}_{1})^{T}\\ \vdots\\ f_{NN}(\tilde{a}_{n})^{T}\end{bmatrix}}_{f_{NN}(A)}W_{L}-b\Bigg{\|}_{F}^{2}+\lambda_{1}\|W_{L}\|_{F}^{2}\,,

(48)

where $W_{L}\in\mathbb{R}^{r\times C}$ is the weight matrix of the final layer. The vectors $\tilde{a}_{i}\in\mathbb{R}^{d}$ $i=1,\dots,n$ denote the rows of the data matrix $A\in\mathbb{R}^{n\times d}$ and $b\in\mathbb{R}^{n\times C}$ is the output matrix. In a classification problem, $C$ would correspond to the number of classes.

For large scale datasets, it is important to develop efficient methods for solving the problem in (48). Distributed randomized ridge regression algorithm (listed in Algorithm 3) could be applied to this problem where the sketched sub-problems are of the form

\displaystyle\hat{W}_{L,k}=\arg\min_{W_{L}}\|S_{k}f_{NN}(A)W_{L}-S_{k}b\|_{F}^{2}+\lambda_{2}^{*}\|W_{L}\|_{F}^{2}

(49)

and the averaged solution can be computed at the master node as $\bar{W}_{L}=\frac{1}{q}\sum_{k=1}^{q}\hat{W}_{L,k}$ . This leads to an efficient non-iterative asynchronous method for fine-tuning of a neural network.

VII Numerical Results

We have implemented the distributed sketching methods for AWS Lambda in Python using the Pywren package [34]. The setting for the experiments is a centralized computing model where a single master node collects and averages the outputs of the $q$ worker nodes. The majority of the figures in this section plots the approximation error which we define as $(f(\bar{x})-f(x^{*}))/f(x^{*})$ .

VII-A Hybrid Sketch

In a distributed computing setting, the amount of data that can be fit into the memory of nodes and the size of the largest problem that can be solved by the nodes often do not match. The hybrid sketch idea is motivated by this mismatch and it is basically a sequentially concatenated sketching scheme where we first perform uniform sampling with dimension $m^{\prime}$ and then sketch the sampled data using another sketching method, preferably with better convergence properties (say, Gaussian) with dimension $m$ . Worker nodes load $m^{\prime}$ rows of the data matrix $A$ into their memory and then perform sketching, reducing the number of rows from $m^{\prime}$ to $m$ . In addition, we note that if $m^{\prime}=m$ , hybrid sketch reduces to sampling and if $m^{\prime}=n$ , then it reduces to Gaussian sketch. The hybrid sketching scheme does not take privacy into account as worker nodes are assumed to have access to the data matrix.

For the experiments involving very large scale datasets, we have used Sparse Johnson-Lindenstrauss Transform (SJLT) [24] as the second sketching method in the hybrid sketch due to its low computational complexity.

VII-B Airline Dataset

We have conducted experiments with the publicly available Airline dataset [35]. This dataset contains information on domestic USA flights between the years 1987-2008. We are interested in predicting whether there is going to be a departure delay or not, based on information about the flights. More precisely, we are interested in predicting whether DepDelay $>15$ minutes using the attributes Month, DayofMonth, DayofWeek, CRSDepTime, CRSElapsedTime, Dest, Origin, and Distance. Most of these attributes are categorical and we have used dummy coding to convert these categorical attributes into binary representations. The size of the input matrix $A$ , after converting categorical features into binary representations, becomes $(1.21\times 10^{8})\times 774$ .

We have solved the linear least squares problem on the entire airline dataset: $\text{minimize}_{x}\|Ax-b\|_{2}^{2}$ using $q$ workers on AWS Lambda. The output $b$ for the plots a and b in Figure 4 is a vector of binary variables indicating delay. The output $b$ for the plots c and d in Figure 4 is artificially generated via $b=Ax_{truth}+\epsilon$ where $x_{truth}$ is the underlying solution and $\epsilon$ is random Gaussian noise distributed as $\mathcal{N}(0,0.01I)$ . Figure 4 shows that sampling followed by SJLT leads to a lower error.

We note that it increases the convergence rate to choose $m$ and $m^{\prime}$ as large as the available resources allow. Based on the run times given in the caption of Figure 4, we see that the run times are slightly worse if SJLT is involved. Decreasing $m^{\prime}$ will help reduce this processing time at the expense of error performance.

VII-C Image Dataset: Extended MNIST

The experiments of this subsection are performed on the image dataset EMNIST (extended MNIST) [36]. We have used the ”bymerge” split of EMNIST, which has 700K training and 115K test images. The dimensions of the images are $28\times 28$ and there are 47 classes in total (letters and digits). Some capital letter classes have been merged with small letter classes (like C and c), and thus there are 47 classes and not 62.

Figure 5 shows the approximation error and test accuracy plots when we fit a least squares model on the EMNIST-bymerge dataset using the distributed randomized regression algorithm. Because this is a multi-class classification problem, we have one-hot encoded the labels. Figure 5 demonstrates that SJLT is able to drive the cost down more and leads to a better classification accuracy than uniform sampling.

VII-D Performance on Large Scale Synthetic Datasets

We now present the results of the experiments carried out on randomly generated large scale data to illustrate scalability of the methods. Plots in Figure 6 show the approximation error as a function of time, where the problem dimensions are as follows: $A\in\mathbb{R}^{10^{7}\times 10^{3}}$ for plot a and $A\in\mathbb{R}^{(2\times 10^{7})\times(2\times 10^{3})}$ for plot b. These data matrices take up $75$ GB and $298$ GB, respectively. The data used in these experiments were randomly generated from the student’s t-distribution with degrees of freedom of $1.5$ for plot a and $1.7$ for plot b. The output vector $b$ was computed according to $b=Ax_{truth}+\epsilon$ where $\epsilon\in\mathbb{R}^{n}$ is i.i.d. noise distributed as $\mathcal{N}(0,0.1I_{n})$ . Other parameters used in the experiments are $m=10^{4},m^{\prime}=10^{5}$ for plot a, and $m=8\times 10^{3},m^{\prime}=8\times 10^{4}$ for plot b. We observe that both plots in Figure 6 reveal similar trends where the hybrid approach leads to a lower approximation error but takes longer due to the additional processing required for SJLT.

VII-E Numerical Results for the Right Sketch Method

Figure 7 shows the approximation error as a function of the number of averaged outputs in solving the least norm problem for two different datasets. The dataset for Figure 7(a) is randomly generated with dimensions $A\in\mathbb{R}^{50\times 1000}$ . We observe that Gaussian sketch outperforms uniform sampling in terms of the approximation error. Furthermore, Figure 7(a) verifies that if we apply the hybrid approach of sampling first and then applying Gaussian sketch, then its performance falls between the extreme ends of only sampling and only using Gaussian sketch. Moreover, Figure 7(b) shows the results for the same experiment on a subset of the airline dataset where we have included the pairwise interactions as features which makes this an underdetermined linear system. Originally, we had $774$ features for this dataset and if we include all $x_{i}x_{j}$ terms as features, we would have a total of $299925$ features, most of which are zero for all samples. We have excluded the all-zero columns from this extended matrix to obtain the final dimensions $2000\times 11588$ .

VII-F Distributed Iterative Hessian Sketch

We have evaluated the performance of the distributed IHS algorithm on AWS Lambda. In the implementation, each serverless function is responsible for solving one sketched problem per iteration. Worker nodes wait idly once they finish their computation for that iteration until the next iterate $x_{t+1}$ becomes available. The master node is implemented as another AWS Lambda function and is responsible for collecting and averaging the worker outputs and broadcasting the next iterate $x_{t+1}$ .

Figure 8 shows the scaled difference between the cost for the $t$ ’th iterate and the optimal cost (i.e. $(f(x_{t})-f(x^{*}))/f(x^{*})$ ) against wall-clock time for the distributed IHS algorithm given in Algorithm 2. Due to the relatively small size of the problem, we have each worker compute the exact gradient without requiring an additional communication round per iteration to form the full gradient. We note that, for problems where it is not feasible for workers to form the full gradient due to limited memory, one can include an additional communication round where each worker sends their local gradient to the master node, and the master node forms the full gradient and distributes it to the worker nodes.

VII-G Inequality Constrained Optimization

Figure 9 compares the error performance of sketches with and without bias correction for the distributed Newton sketch algorithm when it is used to solve the log-barrier penalized problem given in (47). For each sketch, we have plotted the performance for $\lambda_{2}=\lambda_{1}$ and the bias corrected version $\lambda_{2}=\lambda_{2}^{*}$ (see Theorem V.2). The bias corrected versions are shown as the dotted lines. In these experiments, we have set $\sigma$ to the minimum of the singular values of $DA$ as we have observed that setting $\sigma$ to the minimum of the singular values of $DA$ performed better than setting it to their mean.

Even though the bias correction formula is derived for Gaussian sketch, we observe that it improves the performance of SJLT as well. We see that Gaussian sketch and SJLT perform the best out of the 4 sketches we have experimented with. We note that computational complexity of sketching for SJLT is lower than it is for Gaussian sketch, and hence the natural choice would be to use SJLT in this case.

VIII Discussion

In this work, we study averaging sketched solutions for linear least squares problems and averaging for randomized second order optimization algorithms. We discuss distributed sketching methods from the perspectives of convergence, bias, privacy, and distributed computing. Our results and numerical experiments suggest that distributed sketching methods offer a competitive straggler-resilient solution for solving large scale problems.

We have shown that for problems involving regularization, averaging requires a more detailed analysis compared to problems without regularization. When the regularization term is not scaled properly, the resulting estimators are biased, and averaging a large number of independent sketched solutions does not converge to the true solution. We have provided closed-form formulas for scaling the regularization coefficient to obtain unbiased estimators that guarantee convergence to the optimum.

Appendix A Proofs

This section contains the proofs for the lemmas and theorems stated in the paper.

A-A Proofs of Theorems and Lemmas in Section II

Proof:

Since the Gaussian sketch estimator is unbiased (i.e., $\operatorname{\mathbb{E}}[\hat{x}_{k}]=x^{*}$ ), Lemma IV.1 reduces to $\operatorname{\mathbb{E}}[f(\bar{x})]-f(x^{*})=\frac{1}{q}\operatorname{\mathbb{E}}[\|A\hat{x}_{1}-Ax^{*}]\|_{2}^{2}.$ By Lemma II.1, the error of the averaged solution conditioned on the events that $E_{k}=A^{T}S_{k}^{T}S_{k}A\succ 0$ , $\forall i=1,\dots,q$ can exactly be written as

\displaystyle\operatorname{\mathbb{E}}[\|A(\bar{x}-x^{*})\|_{2}^{2}|E_{1}\cap\dots\cap E_{q}]=\frac{1}{q}\frac{d}{m-d-1}f(x^{*}).

Using Markov’s inequality, it follows that

\displaystyle P(\|A(\bar{x}-x^{*})\|_{2}^{2}\geq a|E_{1}\cap\dots\cap E_{q})\leq\frac{1}{qa}\frac{d}{m-d-1}f(x^{*}).

The LHS can be lower bounded as

	$\displaystyle\frac{P(\\|A(\bar{x}-x^{*})\\|_{2}^{2}\geq a\cap(\bigcap_{k=1}^{q}E_{k}))}{P(\bigcap_{k=1}^{q}E_{k})}$	$\displaystyle\geq\frac{P(\\|A(\bar{x}-x^{*})\\|_{2}^{2}\geq a)+P(\bigcap_{k=1}^{q}E_{k})-1}{P(\bigcap_{k=1}^{q}E_{k})}$		(50)
		$\displaystyle=\frac{P(\\|A(\bar{x}-x^{*})\\|_{2}^{2}\geq a)+P(E_{1})^{q}-1}{P(E_{1})^{q}},$		(51)

where we have used the identity $P(A\cap B)\geq P(A)+P(B)-1$ in (50) and the independence of the events $E_{k}$ in (51). It follows

\displaystyle P(\|A(\bar{x}-x^{*})\|_{2}^{2}\leq a)\geq P(E_{1})^{q}\left(1-\frac{1}{qa}\frac{d}{m-d-1}f(x^{*})\right).

Setting $a=f(x^{*})\frac{\epsilon}{q}$ and plugging in $P(E_{1})=1$ , which holds for $m\geq d$ , we obtain

\displaystyle P\left(\frac{\|A(\bar{x}-x^{*})\|_{2}^{2}}{f(x^{*})}\leq\frac{\epsilon}{q}\right)\geq\left(1-\frac{d/\epsilon}{m-d-1}\right).

∎

Lemma A.1

For the Gaussian sketching matrix $S$ and the matrix $U$ whose columns are orthonormal, the smallest singular value of the product $SU$ is bounded as:

\displaystyle P(\sigma_{\min}(SU)\leq 1-\sqrt{d/m}-\delta)\leq\exp(-m\delta^{2}/2).

(52)

This result follows from the concentration of Lipshitz functions of Gaussian random variables.

Proof:

We start by invoking a result on the concentration of Stieltjes transforms, namely Lemma 6 of [37]. This implies that

\displaystyle P

\displaystyle\big{(}|\operatorname{tr}(U^{T}S^{T}SU-\epsilon iI)^{-1}-\operatorname{\mathbb{E}}[\operatorname{tr}(U^{T}S^{T}SU-\epsilon iI)^{-1}]|>t\big{)}\leq 4\exp\left(-t^{2}\epsilon^{2}/(16m)\right)

(53)

where $i=\sqrt{-1}$ . The matrix $U^{T}S^{T}SU$ can be written as a sum of rank-1 matrices as follows:

\displaystyle U^{T}S^{T}SU=\sum_{i=1}^{m}(\tilde{s}_{i}^{T}U)^{T}(\tilde{s}_{i}^{T}U)

(54)

where $\tilde{s}_{i}^{T}\in\mathbb{R}^{1\times n}$ is the $i$ ’th row of $S$ . The vectors $(\tilde{s}_{i}^{T}U)^{T}$ are independent as required by Lemma 6 of [37]. Next, we will bound the difference between the trace terms $\operatorname{tr}((U^{T}S^{T}SU)^{-1})$ and $\operatorname{tr}((U^{T}S^{T}SU-\epsilon iI)^{-1})$ . First, we let the SVD decomposition of $SU$ be $\tilde{U}\tilde{\Sigma}\tilde{V}^{T}$ . Then, we have the following relations:

$\displaystyle SU$	$\displaystyle=\tilde{U}\tilde{\Sigma}\tilde{V}^{T}$
$\displaystyle U^{T}S^{T}SU$	$\displaystyle=\tilde{V}\tilde{\Sigma}^{2}{\tilde{V}}^{T}$
$\displaystyle(U^{T}S^{T}SU)^{-1}$	$\displaystyle=\tilde{V}\tilde{\Sigma}^{-2}{\tilde{V}}^{T}$
$\displaystyle U^{T}S^{T}SU-\epsilon iI$	$\displaystyle=\tilde{V}\operatorname{diag}(\tilde{\sigma}_{j}^{2}-\epsilon i){\tilde{V}}^{T}$
$\displaystyle(U^{T}S^{T}SU-\epsilon iI)^{-1}$	$\displaystyle=\tilde{V}\operatorname{diag}\left(\frac{1}{\tilde{\sigma}_{j}^{2}-\epsilon i}\right){\tilde{V}}^{T}$	(55)

where the notation $\operatorname{diag}(\tilde{\sigma}_{j}^{2})$ refers to a diagonal matrix with diagonal entries equal to $\tilde{\sigma}_{1}^{2},\tilde{\sigma}_{2}^{2},\dots$ . The difference between the trace terms can be bounded as follows:

	$\displaystyle\left\|\operatorname{tr}((U^{T}S^{T}SU)^{-1})-\operatorname{tr}((U^{T}S^{T}SU-\epsilon iI)^{-1})\right\|=\left\|\operatorname{tr}\left(\tilde{V}\operatorname{diag}\left(\frac{1}{\tilde{\sigma}_{j}^{2}}-\frac{1}{\tilde{\sigma}_{j}^{2}-\epsilon i}\right){\tilde{V}}^{T}\right)\right\|$
	$\displaystyle=\left\|\sum_{j=1}^{d}\left(\frac{1}{\tilde{\sigma}_{j}^{2}}-\frac{1}{\tilde{\sigma}_{j}^{2}-\epsilon i}\right)\right\|=\left\|\sum_{j=1}^{d}\left(\frac{-\epsilon i}{\tilde{\sigma}_{j}^{2}(\tilde{\sigma}_{j}^{2}-\epsilon i)}\right)\right\|$
	$\displaystyle\leq\sum_{j=1}^{d}\left\|\frac{-\epsilon i}{\tilde{\sigma}_{j}^{2}(\tilde{\sigma}_{j}^{2}-\epsilon i)}\right\|=\sum_{j=1}^{d}\frac{\epsilon}{\tilde{\sigma}_{j}^{2}\|\tilde{\sigma}_{j}^{2}-\epsilon i\|}$
	$\displaystyle\leq\sum_{j=1}^{d}\frac{\epsilon}{\tilde{\sigma}_{j}^{4}}\leq\sum_{j=1}^{d}\frac{\epsilon}{\tilde{\sigma}_{j}^{2}\sigma^{2}_{\min}(SU)}$
	$\displaystyle=\frac{\epsilon}{\sigma^{2}_{\min}(SU)}\sum_{j=1}^{d}\frac{1}{\tilde{\sigma}_{j}^{2}}=\frac{\epsilon}{\sigma^{2}_{\min}(SU)}\operatorname{tr}\left((U^{T}S^{T}SU)^{-1}\right)\,.$		(56)

Equivalently, we have

\displaystyle\big{(}1-\frac{\epsilon}{\sigma^{2}_{\min}(SU)}\big{)}\operatorname{tr}((U^{T}S^{T}SU)^{-1})\leq\operatorname{tr}((U^{T}S^{T}SU-\epsilon iI)^{-1})\leq\big{(}1+\frac{\epsilon}{\sigma^{2}_{\min}(SU)}\big{)}\operatorname{tr}((U^{T}S^{T}SU)^{-1})\,.

(57)

The relations in (53) and (57) together imply the following concentration result:

	$\displaystyle P\big{(}\|\operatorname{tr}((U^{T}S^{T}SU)^{-1})-\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]\|\leq t+\frac{\epsilon}{\sigma^{2}_{\min}(SU)}\left(\operatorname{tr}((U^{T}S^{T}SU)^{-1})+\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]\right)\big{)}$
	$\displaystyle\geq 1-4\exp\left(-t^{2}\epsilon^{2}/(16m)\right)\,.$		(58)

We know that $\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]=\frac{md}{m-d-1}$ . Using the upper bound $\operatorname{tr}((U^{T}S^{T}SU)^{-1})\leq d\sigma_{\min}^{-2}(SU)$ yields

		$\displaystyle P\Big{(}\|\operatorname{tr}((U^{T}S^{T}SU)^{-1})-\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]\|\leq t+\frac{\epsilon}{\sigma^{2}_{\min}(SU)}\big{(}\frac{d}{\sigma^{2}_{\min}(SU)}+\frac{md}{m-d-1}\big{)}\Big{)}$
		$\displaystyle\geq 1-4\exp\left(-t^{2}\epsilon^{2}/(16m)\right).$		(59)

The concentration inequality in (A-A) and the concentration of the minimum singular value $P(\frac{1}{\sigma_{\min}(SU)}\leq\frac{1}{1-\sqrt{d/m}-\delta})\geq 1-\exp(-m\delta^{2}/2)$ from Lemma A.1 together imply that

		$\displaystyle P\Big{(}\|\operatorname{tr}((U^{T}S^{T}SU)^{-1})-\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]\|\leq t+\frac{\epsilon}{(1-\sqrt{d/m}-\delta)^{2}}\big{(}\frac{d}{(1-\sqrt{d/m}-\delta)^{2}}+\frac{md}{m-d-1}\big{)}\Big{)}$
		$\displaystyle\geq 1-4\exp\left(-t^{2}\epsilon^{2}/(16m)\right)-\exp(-m\delta^{2}/2).$		(60)

We now simplify this bound by observing that

	$\displaystyle\frac{d}{(1-\sqrt{d/m}-\delta)^{2}}=\frac{d}{(1+\frac{d}{m}-2\sqrt{\frac{d}{m}})+\delta^{2}-2\delta(1-\sqrt{\frac{d}{m}})}$
	$\displaystyle\quad=\frac{md}{m+d-2\sqrt{md}+\delta^{2}m-2\delta m(1-\sqrt{\frac{d}{m}})}$
	$\displaystyle\quad\geq\frac{md}{m-d(2\sqrt{\alpha}-1)}$
	$\displaystyle\quad\geq\frac{md}{m-d-1}$		(61)

where we used $m\geq\alpha d$ in the third line. We have also used the following simple inequality

\displaystyle 2\delta m(1-\sqrt{d/m})

\displaystyle\geq 2\delta m(1-\frac{1}{\sqrt{2}})\geq\delta m\geq\delta^{2}m\,.

(62)

For the fourth line to be true, we need $m-2d\sqrt{\alpha}+d\leq m-d-1$ . This is satisfied if $1\leq 2d(\sqrt{\alpha}-1)$ or $\alpha\geq(\frac{1}{2d}+1)^{2}$ . We note that this is already implied by our assumption that $m\gtrsim d$ . Using the above observation, we simplify the bound as follows:

\displaystyle P\Big{(}|\operatorname{tr}((U^{T}S^{T}SU)^{-1})-\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]|\leq t+\frac{2d\epsilon}{(1-\sqrt{d/m}-\delta)^{4}}\Big{)}\geq 1-4\exp\left(-t^{2}\epsilon^{2}/(16m)\right)-\exp(-m\delta^{2}/2).

(63)

Scaling $\epsilon\leftarrow\epsilon\frac{(1-\sqrt{d/m}-\delta)^{4}}{2d}$ gives

\displaystyle P\Big{(}|\operatorname{tr}((U^{T}S^{T}SU)^{-1})-\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]|\leq t+\epsilon\Big{)}\geq 1-4e^{-\frac{t^{2}\epsilon^{2}(1-\sqrt{d/m}-\delta)^{8}}{64md^{2}}}-e^{-m\delta^{2}/2}.

(64)

Setting $t=\epsilon$ and then $\epsilon\leftarrow\epsilon/2$ give us the final result:

\displaystyle P\Big{(}|\operatorname{tr}((U^{T}S^{T}SU)^{-1})-\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]|\leq\epsilon\Big{)}\geq 1-4e^{-\frac{\epsilon^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{10}md^{2}}}-e^{-m\delta^{2}/2}.

(65)

∎

Proof:

Let us consider the single sketch estimator $\hat{x}$ . The relative error of $\hat{x}$ can be expressed as

$\displaystyle f(\hat{x})-f(x^{*})$	$\displaystyle=\\|A(\hat{x}-x^{*})\\|_{2}^{2}$
	$\displaystyle=\\|A(A^{T}S^{T}SA)^{-1}A^{T}S^{T}Sb^{\perp}\\|_{2}^{2}$
	$\displaystyle=\\|A(SA)^{\dagger}Sb^{\perp}\\|_{2}^{2}$	(66)

where the superscript $~{}^{\dagger}$ indicates the pseudo-inverse. We will rewrite the error expression as:

\displaystyle f(\hat{x})-f(x^{*})

\displaystyle=\left\|\frac{\|b^{\perp}\|_{2}}{\sqrt{m}}A(SA)^{\dagger}\frac{\sqrt{m}}{\|b^{\perp}\|_{2}}Sb^{\perp}\right\|_{2}^{2}

(67)

and define $z:=\frac{\sqrt{m}}{\|b^{\perp}\|_{2}}Sb^{\perp}$ and $M:=\frac{\|b^{\perp}\|_{2}}{\sqrt{m}}A(SA)^{\dagger}$ . This simplifies the previous expression

\displaystyle f

\displaystyle(\hat{x})-f(x^{*})=\|Mz\|_{2}^{2}=z^{T}M^{T}Mz.

(68)

Lemma II.5 implies

\displaystyle P\Big{(}f(\hat{x})-f(x^{*})-\operatorname{\mathbb{E}}_{Sb^{\perp}}[f(\hat{x})-f(x^{*})]>2\|M^{T}M\|_{F}\sqrt{\epsilon}+2\|M^{T}M\|_{2}\epsilon\Big{|}SU\Big{)}\leq e^{-\epsilon}.

(69)

The terms $\|M^{T}M\|_{F}$ and $\|M^{T}M\|_{2}$ reduce to the following expressions:

	$\displaystyle\\|M^{T}M\\|_{F}$	$\displaystyle=\frac{f(x^{})}{m}\sqrt{\operatorname{tr}((U^{T}S^{T}SU)^{-2})}\leq\frac{f(x^{})}{m}\frac{\sqrt{d}}{\sigma^{2}_{\min}(SU)}\,,$		(70)
	$\displaystyle\\|M^{T}M\\|_{2}$	$\displaystyle=\frac{f(x^{*})}{m}\frac{1}{\sigma^{2}_{\min}(SU)}\,.$		(71)

Plugging these in (69) and taking the expectation of both sides with respect to $SU$ give us

\displaystyle P\Big{(}f(\hat{x})-f(x^{*})>\frac{f(x^{*})}{m}\operatorname{tr}((U^{T}S^{T}SU)^{-1})+2\frac{f(x^{*})}{m\sigma^{2}_{\min}(SU)}(\sqrt{\epsilon d}+\epsilon)\Big{)}\leq e^{-\epsilon}.

(72)

Next, we combine (72) with Lemma II.7 and the concentration of the minimum singular value from Lemma A.1 to obtain

	$\displaystyle P\Big{(}f(\hat{x})-f(x^{})>\frac{f(x^{})}{m}\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]+2\frac{f(x^{})}{m(1-\sqrt{d/m}-s)^{2}}(\sqrt{\epsilon d}+\epsilon)+\frac{f(x^{})}{m}\gamma\Big{)}$
	$\displaystyle\leq e^{-\epsilon}+4e^{-\frac{\gamma^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{10}md^{2}}}+e^{-m\delta^{2}/2}+e^{-ms^{2}/2}.$		(73)

Assuming that $\sqrt{\epsilon d}>\epsilon$ , we can write:

	$\displaystyle P\Big{(}f(\hat{x})-f(x^{})>\frac{f(x^{})}{m}\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]+\frac{4f(x^{})\sqrt{\epsilon d}}{m(1-\sqrt{d/m}-s)^{2}}+\frac{f(x^{})}{m}\gamma\Big{)}$
	$\displaystyle\leq e^{-\epsilon}+4e^{-\frac{\gamma^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{10}md^{2}}}+e^{-m\delta^{2}/2}+e^{-ms^{2}/2}.$		(74)

Making the change of variable $\epsilon\leftarrow\epsilon^{2}/d$ yields:

	$\displaystyle P\Big{(}f(\hat{x})-f(x^{})>\frac{f(x^{})}{m}\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]+\frac{4f(x^{})\epsilon}{m(1-\sqrt{d/m}-s)^{2}}+\frac{f(x^{})}{m}\gamma\Big{)}$
	$\displaystyle\leq e^{-\epsilon^{2}/d}+4e^{-\frac{\gamma^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{10}md^{2}}}+e^{-m\delta^{2}/2}+e^{-ms^{2}/2}.$		(75)

Scaling $\epsilon\leftarrow\epsilon\frac{(1-\sqrt{d/m}-s)^{2}}{4}$ yields:

	$\displaystyle P\Big{(}f(\hat{x})-f(x^{})>\frac{f(x^{})}{m}\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S^{T}SU)^{-1})]+\frac{f(x^{*})(\epsilon+\gamma)}{m}\Big{)}$
	$\displaystyle\leq e^{-\frac{\epsilon^{2}(1-\sqrt{d/m}-s)^{4}}{16d}}+4e^{-\frac{\gamma^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{10}md^{2}}}+e^{-m\delta^{2}/2}+e^{-ms^{2}/2}\,.$		(76)

Letting $\gamma=\epsilon$ and $s=\delta$ , and then scaling $\epsilon\leftarrow\epsilon m/2$ lead to

\displaystyle P\big{(}\frac{f(\hat{x})}{f(x^{*})}>\frac{m-1}{m-d-1}+\epsilon\big{)}\leq e^{-\frac{\epsilon^{2}m^{2}(1-\sqrt{d/m}-\delta)^{4}}{64d}}+4e^{-\frac{\epsilon^{4}m^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{14}md^{2}}}+2e^{-m\delta^{2}/2}.

(77)

Let us also set $\delta=\epsilon$ :

\displaystyle P\big{(}\frac{f(\hat{x})}{f(x^{*})}>\frac{m-1}{m-d-1}+\epsilon\big{)}\leq e^{-\frac{\epsilon^{2}m^{2}(1-\sqrt{d/m}-\epsilon)^{4}}{64d}}+4e^{-\frac{\epsilon^{4}m^{4}(1-\sqrt{d/m}-\epsilon)^{8}}{2^{14}md^{2}}}+2e^{-m\epsilon^{2}/2}\,.

(78)

Under our assumption that $m\gtrsim d$ , the above probability is bounded by

\displaystyle P\big{(}\frac{f(\hat{x})}{f(x^{*})}>\frac{m-1}{m-d-1}+\epsilon\big{)}\leq C_{1}e^{-C_{2}\epsilon^{4}m}\,.

(79)

We now move on to find a lower bound. We note that Lemma II.5 can be used as follows to find a lower bound

\displaystyle P\big{(}z^{T}(-G)z-\operatorname{\mathbb{E}}[z^{T}(-G)z]>2\|(-G)\|_{F}\sqrt{\epsilon}+2\|(-G)\|_{2}\epsilon\big{)}=P\big{(}z^{T}Gz<\operatorname{\mathbb{E}}[z^{T}Gz]-2\|G\|_{F}\sqrt{\epsilon}-2\|G\|_{2}\epsilon\big{)}\leq e^{-\epsilon}\,.

(80)

Therefore, we have the following lower bound for the error:

\displaystyle P\Big{(}f(\hat{x})-f(x^{*})<\operatorname{\mathbb{E}}_{Sb^{\perp}}[f(\hat{x})-f(x^{*})]-2\|M^{T}M\|_{F}\sqrt{\epsilon}-2\|M^{T}M\|_{2}\epsilon\Big{|}SU\Big{)}\leq e^{-\epsilon}.

(81)

Using the upper bound for $\|M^{T}M\|_{F}$ and the exact expression for $\|M^{T}M\|_{2}$ , and taking the expectation with respect to $SU$ , we obtain the following

\displaystyle P\Big{(}f(\hat{x})-f(x^{*})<\frac{f(x^{*})}{m}\operatorname{tr}((U^{T}S^{T}SU)^{-1})-2\frac{f(x^{*})}{m\sigma^{2}_{\min}(SU)}(\sqrt{\epsilon d}+\epsilon)\Big{)}\leq e^{-\epsilon}.

(82)

Applying the same steps as before, we arrive at the following bound:

\displaystyle P\big{(}\frac{f(\hat{x})}{f(x^{*})}<\frac{m-1}{m-d-1}-\epsilon\big{)}\leq C_{1}e^{-C_{2}\epsilon^{4}m}.

(83)

∎

Proof:

Consider the averaged estimator $\bar{x}=\frac{1}{q}\sum_{k=1}^{q}\hat{x}_{k}$ :

	$\displaystyle f(\bar{x})-f(x^{*})$	$\displaystyle=\\|A(\bar{x}-x^{*})\\|_{2}^{2}=\left\\|\frac{1}{q}\sum_{k=1}^{q}A(S_{k}A)^{\dagger}S_{k}b^{\perp}\right\\|_{2}^{2}$
		$\displaystyle=\left\\|\frac{1}{q}\sum_{k=1}^{q}M_{k}z_{k}\right\\|_{2}^{2}$		(84)

where we defined

\displaystyle M_{k}

\displaystyle:=\frac{\|b^{\perp}\|_{2}}{\sqrt{m}}A(S_{k}A)^{\dagger},\quad z_{k}:=\frac{\sqrt{m}}{\|b^{\perp}\|_{2}}S_{k}b^{\perp}.

(85)

We note that the entries of the vector $z$ are distributed as i.i.d. standard normal. We can manipulate the error expression as follows

\displaystyle\sum_{k=1}^{q}M_{k}z_{k}

\displaystyle=\begin{bmatrix}M_{1}&\dots&M_{q}\end{bmatrix}\begin{bmatrix}z_{1}\\ \vdots\\ z_{q}\end{bmatrix}=Mz

(86)

and then

\displaystyle f(\bar{x})-f(x^{*})=\frac{1}{q^{2}}\|Mz\|_{2}^{2}=z^{T}\big{(}\frac{1}{q^{2}}M^{T}M\big{)}z.

(87)

We will now find an equivalent expression for $\|\frac{1}{q^{2}}M^{T}M\|_{F}^{2}=\frac{1}{q^{4}}\|M^{T}M\|_{F}^{2}$ :

$\displaystyle\\|M^{T}M\\|_{F}^{2}$	$\displaystyle=\operatorname{tr}(M^{T}MM^{T}M)=\operatorname{tr}(MM^{T}MM^{T})$
	$\displaystyle=\operatorname{tr}\left(\sum_{k=1}^{q}M_{k}M_{k}^{T}\sum_{l=1}^{q}M_{l}M_{l}^{T}\right)$
	$\displaystyle=\frac{f^{2}(x^{*})}{m^{2}}\sum_{k=1}^{q}\sum_{l=1}^{q}\operatorname{tr}\left({(S_{k}U)^{\dagger}}^{T}(S_{k}U)^{\dagger}{(S_{l}U)^{\dagger}}^{T}(S_{l}U)^{\dagger}\right)$
	$\displaystyle=\frac{f^{2}(x^{*})}{m^{2}}\Big{(}\sum_{k=1}^{q}\operatorname{tr}((U^{T}S_{k}^{T}S_{k}U)^{-2})+\sum_{k\neq l}\operatorname{tr}\left({(S_{k}U)^{\dagger}}^{T}(S_{k}U)^{\dagger}{(S_{l}U)^{\dagger}}^{T}(S_{l}U)^{\dagger}\right)\Big{)}$
	$\displaystyle\leq\frac{f^{2}(x^{*})}{m^{2}}\Big{(}\sum_{k=1}^{q}\frac{d}{\sigma^{4}_{\min}(S_{k}U)}+\sum_{k\neq l}\sqrt{\operatorname{tr}\left({(S_{k}U)^{\dagger}}^{T}(S_{k}U)^{\dagger}\right)}\sqrt{\operatorname{tr}\left({(S_{l}U)^{\dagger}}^{T}(S_{l}U)^{\dagger}\right)}\Big{)}$
	$\displaystyle=\frac{f^{2}(x^{*})}{m^{2}}\Big{(}\sum_{k=1}^{q}\frac{d}{\sigma^{4}_{\min}(S_{k}U)}+\sum_{k\neq l}\sqrt{\operatorname{tr}\left((U^{T}S_{k}^{T}S_{k}U)^{-1}\right)}\sqrt{\operatorname{tr}\left((U^{T}S_{l}^{T}S_{l}U)^{-1}\right)}\Big{)}$
	$\displaystyle\leq\frac{f^{2}(x^{*})}{m^{2}}\Big{(}\sum_{k=1}^{q}\frac{d}{\sigma^{4}_{\min}(S_{k}U)}+\sum_{k\neq l}\frac{d}{\sigma_{\min}(S_{k}U)\sigma_{\min}(S_{l}U)}\Big{)}$	(88)

where we have used the Cauchy-Schwarz inequality to bound the trace of the matrix product.

Next, we look at $\|\frac{1}{q^{2}}M^{T}M\|_{2}=\frac{1}{q^{2}}\|M^{T}M\|_{2}$ :

$\displaystyle\\|M^{T}M\\|_{2}$	$\displaystyle=\\|M\\|_{2}^{2}=\\|MM^{T}\\|_{2}=\left\\|\sum_{k=1}^{q}M_{k}M_{k}^{T}\right\\|_{2}$
	$\displaystyle=\frac{\\|b^{\perp}\\|_{2}^{2}}{m}\left\\|\sum_{k=1}^{q}{(S_{k}U)^{\dagger}}^{T}(S_{k}U)^{\dagger}\right\\|_{2}$
	$\displaystyle\leq\frac{\\|b^{\perp}\\|_{2}^{2}}{m}\sum_{k=1}^{q}\\|{(S_{k}U)^{\dagger}}^{T}(S_{k}U)^{\dagger}\\|_{2}$
	$\displaystyle=\frac{\\|b^{\perp}\\|_{2}^{2}}{m}\sum_{k=1}^{q}\frac{1}{\sigma^{2}_{\min}(S_{k}U)}.$	(89)

By Lemma II.5, we have

\displaystyle P\Big{(}f(\bar{x})-f(x^{*})-\operatorname{\mathbb{E}}_{S_{k}b^{\perp}}[f(\bar{x})-f(x^{*})]>\frac{2}{q^{2}}\|M^{T}M\|_{F}\sqrt{\epsilon}+\frac{2}{q^{2}}\|M^{T}M\|_{2}\epsilon\Big{|}SU_{k}\Big{)}\leq e^{-\epsilon}

(90)

We can take expectation of both sides with respect to $SU_{k}$ , $k=1,\dots,q$ to remove the conditioning. The expectation term inside the probability is equal to

\displaystyle\operatorname{\mathbb{E}}_{S_{k}b^{\perp}}[f(\bar{x})-f(x^{*})]

\displaystyle=\operatorname{tr}(\frac{1}{q^{2}}M^{T}M)=\frac{1}{q^{2}}\operatorname{tr}(MM^{T})=\frac{f(x^{*})}{mq^{2}}\sum_{k=1}^{q}\operatorname{tr}((U^{T}S_{k}^{T}S_{k}U)^{-1}).

(91)

Hence, we have

\displaystyle P\Big{(}f(\bar{x})-f(x^{*})-\frac{f(x^{*})}{mq^{2}}\sum_{k=1}^{q}\operatorname{tr}((U^{T}S_{k}^{T}S_{k}U)^{-1})>\frac{2}{q^{2}}\|M^{T}M\|_{F}\sqrt{\epsilon}+\frac{2}{q^{2}}\|M^{T}M\|_{2}\epsilon\Big{)}\leq e^{-\epsilon}\,.

(92)

We now use the concentration bound of the trace term in Lemma II.7 to obtain

	$\displaystyle P\Big{(}f(\bar{x})-f(x^{})-\frac{f(x^{})}{mq^{2}}\sum_{k=1}^{q}\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S_{k}^{T}S_{k}U)^{-1})]>\frac{2}{q^{2}}\\|M^{T}M\\|_{F}\sqrt{\epsilon}+\frac{2}{q^{2}}\\|M^{T}M\\|_{2}\epsilon+\frac{f(x^{*})}{mq^{2}}q\gamma\Big{)}$
	$\displaystyle\leq e^{-\epsilon}+1-\big{(}1-4e^{-\frac{\gamma^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{10}md^{2}}}-e^{-m\delta^{2}/2}\big{)}^{q}$		(93)

where we have used the independence of $S_{k}$ ’s. Using the upper bounds for $\|M^{T}M\|_{F}$ and $\|M^{T}M\|_{2}$ , we make the following observation. If the minimum singular values of all of $S_{k}U$ , $k=1,\dots,q$ satisfy $\sigma_{\min}(S_{k}U)>1-\sqrt{d/m}-s$ , which occurs with probability at least $(1-e^{-ms^{2}/2})^{q}$ from Lemma A.1, then the following holds

$\displaystyle\\|M^{T}M\\|_{F}\sqrt{\epsilon}+\\|M^{T}M\\|_{2}\epsilon$	$\displaystyle\leq\frac{f(x^{*})}{m}\bigg{(}\sqrt{\frac{qd}{(1-\sqrt{d/m}-s)^{4}}+\frac{(q^{2}-q)d}{(1-\sqrt{d/m}-s)^{2}}}\sqrt{\epsilon}+\frac{q\epsilon}{(1-\sqrt{d/m}-s)^{2}}\bigg{)}$
	$\displaystyle\leq\frac{f(x^{*})}{m}\frac{q(\sqrt{\epsilon d}+\epsilon)}{(1-\sqrt{d/m}-s)^{2}}$
	$\displaystyle\leq\frac{f(x^{*})}{m}\frac{2q\sqrt{\epsilon d}}{(1-\sqrt{d/m}-s)^{2}}$	(94)

where we used $1-\sqrt{d/m}-s<1$ and $\epsilon\leq\sqrt{\epsilon d}$ in the last two inequalities. Combining this observation with the probability bound, we arrive at

	$\displaystyle P\Big{(}f(\bar{x})-f(x^{})-\frac{f(x^{})}{mq^{2}}\sum_{k=1}^{q}\operatorname{\mathbb{E}}[\operatorname{tr}((U^{T}S_{k}^{T}S_{k}U)^{-1})]>\frac{2}{q^{2}}\frac{f(x^{})}{m}\frac{2q\sqrt{\epsilon d}}{(1-\sqrt{d/m}-s)^{2}}+\frac{f(x^{})}{mq^{2}}q\gamma\Big{)}$
	$\displaystyle\leq e^{-\epsilon}+1-\big{(}1-4e^{-\frac{\gamma^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{10}md^{2}}}-e^{-m\delta^{2}/2}\big{)}^{q}+1-(1-e^{-ms^{2}/2})^{q}.$		(95)

Simplifying the expressions and making the change $\epsilon\leftarrow\epsilon^{2}/d$ lead to

	$\displaystyle P\Big{(}f(\bar{x})-f(x^{})-\frac{f(x^{})}{q}\frac{d}{m-d-1}>\frac{4f(x^{})}{mq}\frac{\epsilon}{(1-\sqrt{d/m}-s)^{2}}+\frac{f(x^{})}{mq}\gamma\Big{)}$
	$\displaystyle\leq e^{-\epsilon^{2}/d}+2-\big{(}1-4e^{-\frac{\gamma^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{10}md^{2}}}-e^{-m\delta^{2}/2}\big{)}^{q}-(1-e^{-ms^{2}/2})^{q}$		(96)

Next, we let $\epsilon\leftarrow\epsilon\frac{(1-\sqrt{d/m}-s)^{2}}{4}$ and obtain

	$\displaystyle P\Big{(}f(\bar{x})-f(x^{})-\frac{f(x^{})}{q}\frac{d}{m-d-1}>\frac{f(x^{*})}{mq}(\epsilon+\gamma)\Big{)}$
	$\displaystyle\leq e^{-\frac{\epsilon^{2}(1-\sqrt{d/m}-s)^{4}}{16d}}+2-(1-e^{-ms^{2}/2})^{q}-\big{(}1-4e^{-\frac{\gamma^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{10}md^{2}}}-e^{-m\delta^{2}/2}\big{)}^{q}\,.$		(97)

We now set $\gamma=\epsilon$ and scale $\epsilon\leftarrow\epsilon\frac{mq}{2}$ :

\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}>1+\frac{1}{q}\frac{d}{m-d-1}+\epsilon\Big{)}\leq e^{-\frac{\epsilon^{2}m^{2}q^{2}(1-\sqrt{d/m}-s)^{4}}{64d}}+2-(1-e^{-ms^{2}/2})^{q}-\big{(}1-4e^{-\frac{\epsilon^{4}m^{4}q^{4}1-\sqrt{d/m}-\delta)^{8}}{2^{14}md^{2}}}-e^{-m\delta^{2}/2}\big{)}^{q}.

(98)

Let us use the assumption $m\gtrsim d$ to further simplify:

\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}>1+\frac{1}{q}\frac{d}{m-d-1}+\epsilon\Big{)}\leq C_{1}e^{-C_{2}(q\epsilon)^{2}m}+2-(1-e^{-ms^{2}/2})^{q}-\big{(}1-4e^{-\frac{\epsilon^{4}m^{4}q^{4}(1-\sqrt{d/m}-\delta)^{8}}{2^{14}md^{2}}}-e^{-m\delta^{2}/2}\big{)}^{q}.

(99)

We can use Bernoulli’s inequality $(1-e^{-ms^{2}/2})^{q}\geq 1-qe^{-ms^{2}/2}$ and arrive at:

\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}>1+\frac{1}{q}\frac{d}{m-d-1}+\epsilon\Big{)}\leq C_{1}e^{-C_{2}(q\epsilon)^{2}m}+C_{3}qe^{-C_{4}(q\epsilon)^{4}m(1-\sqrt{d/m}-s)^{8}}+2qe^{-ms^{2}/2}

(100)

where we also set $\delta=s$ . Picking $s=q\epsilon$ yields:

\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}>1+\frac{1}{q}\frac{d}{m-d-1}+\epsilon\Big{)}\leq C_{1}e^{-C_{2}(q\epsilon)^{2}m}+C_{3}qe^{-C_{4}(q\epsilon)^{4}m(1-\sqrt{d/m}-q\epsilon)^{8}}+2qe^{-m(q\epsilon)^{2}/2}\,.

(101)

Finally, we obtain the following simpler expression for the bound:

\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}>1+\frac{1}{q}\frac{d}{m-d-1}+\epsilon\Big{)}\leq qC_{1}e^{-C_{2}(q\epsilon)^{4}m}

(102)

where we redefine the constants $C_{1},C_{2}$ .

Lower bound can be obtained by applying the same steps:

\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}<1+\frac{1}{q}\frac{d}{m-d-1}-\epsilon\Big{)}\leq qC_{1}e^{-C_{2}(q\epsilon)^{4}m}.

(103)

∎

Proof:

i) For unbiased estimators: We begin by invoking Lemma IV.1 which gives us

\displaystyle\operatorname{\mathbb{E}}[\|A(\bar{x}-x^{*})\|_{2}^{2}]

\displaystyle=\frac{1}{q}\operatorname{\mathbb{E}}[\|A(\hat{x}_{1}-x^{*})\|_{2}^{2}]

(104)

since we assume that $\hat{x}_{k}$ ’s are unbiased, i.e., $\operatorname{\mathbb{E}}[\hat{x}_{k}]=x^{*}$ for $k=1,\dots,q$ .

This shows that the error of the averaged estimator is equal to the error of the single sketch estimator scaled by $\frac{1}{q}$ . Hence, we can leverage the lower bound result for the single sketch estimator. We now give the details of how to find the lower bound for the single sketch estimator which was first shown in [28].

The Fisher information matrix for estimating $x^{*}$ from $Sb$ can be constructed as follows:

\displaystyle I(Sb;x^{*})=\operatorname{\mathbb{E}}_{Sb}[\nabla_{x^{*}}\log g(Sb;x^{*})\nabla_{x^{*}}\log g(Sb;x^{*})^{T}]

(105)

where $g(Sb;x^{*})$ is the probability density function of $Sb$ conditioned on $SA$ , i.e., $g(Sb;x^{*})$ is the multivariate Gaussian distribution with $\mathcal{N}(SAx^{*},\frac{1}{m}\|b^{\perp}\|_{2}^{2}I_{m})$ . We recall the definition that $b^{\perp}:=b-Ax^{*}$ .

The gradient of the logarithm of the probability density function with respect to $x^{*}$ is

\displaystyle\nabla_{x^{*}}\log g(Sb;x^{*})=\frac{m}{\|b^{\perp}\|_{2}^{2}}(SA)^{T}(Sb-SAx^{*})\,.

(106)

Plugging this in (105) yields

$\displaystyle I(Sb;x^{*})$	$\displaystyle=\frac{m^{2}}{\\|b^{\perp}\\|_{2}^{4}}\operatorname{\mathbb{E}}_{Sb}\left[(SA)^{T}(Sb-SAx^{})(Sb-SAx^{})^{T}SA\right]$
	$\displaystyle=\frac{m^{2}}{\\|b^{\perp}\\|_{2}^{4}}(SA)^{T}\big{(}\frac{\\|b^{\perp}\\|_{2}^{2}}{m}I_{m}\big{)}SA$
	$\displaystyle=\frac{m}{\\|b^{\perp}\\|_{2}^{2}}A^{T}S^{T}SA\,.$	(107)

We note that the expectation of the error conditioned on $SA$ is equivalent to

$\displaystyle\operatorname{\mathbb{E}}[\\|A(\hat{x}-x^{*})\\|_{2}^{2}\|SA]$	$\displaystyle=\operatorname{\mathbb{E}}[\operatorname{tr}(A(\hat{x}-x^{})(\hat{x}-x^{})^{T}A^{T})\|SA]$
	$\displaystyle=\operatorname{tr}\big{(}A\operatorname{\mathbb{E}}[(\hat{x}-x^{})(\hat{x}-x^{})^{T}\|SA]A^{T}\big{)}$
	$\displaystyle\geq\frac{\\|b^{\perp}\\|_{2}^{2}}{m}\operatorname{tr}(A(A^{T}S^{T}SA)^{-1}A^{T})\,,$	(108)

where the last line follows from the Cramér lower bound [38] given by

\displaystyle\operatorname{\mathbb{E}}[(\hat{x}-x^{*})(\hat{x}-x^{*})^{T}|SA]\succeq I^{-1}(Sb;x^{*})\,.

(109)

Taking the expectation of both sides with respect to $SA$ yields

\displaystyle\operatorname{\mathbb{E}}

\displaystyle[\|A(\hat{x}-x^{*})\|_{2}^{2}]\geq f(x^{*})\frac{d}{m-d-1}\,.

(110)

Hence, we arrive at

\displaystyle\operatorname{\mathbb{E}}[\|A(\bar{x}-x^{*})\|_{2}^{2}]\geq\frac{f(x^{*})}{q}\frac{d}{m-d-1}\,.

(111)

ii) For any estimator: We again invoke Lemma IV.1, which gives us

$\displaystyle\operatorname{\mathbb{E}}[\\|A(\bar{x}-x^{*})\\|_{2}^{2}]$	$\displaystyle=\frac{1}{q}\operatorname{\mathbb{E}}[\\|A(\hat{x}_{1}-x^{})\\|_{2}^{2}]+\frac{q-1}{q}\\|\operatorname{\mathbb{E}}[A\hat{x}_{1}]-Ax^{}\\|_{2}^{2}$
	$\displaystyle\geq\frac{f(x^{})}{q}\frac{d}{m}+\frac{q-1}{q}\\|\operatorname{\mathbb{E}}[A\hat{x}_{1}]-Ax^{}\\|_{2}^{2}$
	$\displaystyle\geq\frac{f(x^{*})}{q}\frac{d}{m}$	(112)

where the inequality on the third line follows from Lemma II.10. ∎

Proof:

The update rule for distributed IHS is given as

\displaystyle x_{t+1}=x_{t}-\mu\frac{1}{q}\sum_{k=1}^{q}(A^{T}S_{t,k}^{T}S_{t,k}A)^{-1}A^{T}(Ax_{t}-b).

(113)

Let us decompose $b$ as $b=Ax^{*}+b^{\perp}$ and note that $A^{T}b^{\perp}=0$ which gives us:

\displaystyle x_{t+1}=x_{t}-\mu\frac{1}{q}\sum_{k=1}^{q}(A^{T}S_{t,k}^{T}S_{t,k}A)^{-1}A^{T}Ae_{t}.

(114)

Subtracting $x^{*}$ from both sides, we obtain an equation in terms of the error vector $e_{t}$ only:

	$\displaystyle e_{t+1}$	$\displaystyle=e_{t}-\mu\frac{1}{q}\sum_{k=1}^{q}(A^{T}S_{t,k}^{T}S_{t,k}A)^{-1}A^{T}Ae_{t}$
		$\displaystyle=\left(I-\mu\frac{1}{q}\sum_{k=1}^{q}(A^{T}S_{t,k}^{T}S_{t,k}A)^{-1}A^{T}A\right)e_{t}.$

Let us multiply both sides by $A$ from the left and define $Q_{t,k}\coloneqq A(A^{T}S_{t,k}^{T}S_{t,k}A)^{-1}A^{T}$ and we will have the following equation:

\displaystyle e^{A}_{t+1}

\displaystyle=\left(I-\mu\frac{1}{q}\sum_{k=1}^{q}Q_{t,k}\right)e^{A}_{t}.

We now analyze the expectation of $\ell_{2}$ norm of $e^{A}_{t+1}$ :

$\displaystyle\operatorname{\mathbb{E}}[\\|e^{A}_{t+1}\\|_{2}^{2}]$	$\displaystyle=\operatorname{\mathbb{E}}\left[\left\\|\frac{1}{q}\sum_{k=1}^{q}(I-\mu Q_{t,k})e^{A}_{t}\right\\|_{2}^{2}\right]$
	$\displaystyle=\frac{1}{q^{2}}\operatorname{\mathbb{E}}\left[\sum_{k=1}^{q}\sum_{l=1}^{q}\langle(I-\mu Q_{t,k})e^{A}_{t},(I-\mu Q_{t,l})e^{A}_{t}\rangle\right]$
	$\displaystyle=\frac{1}{q^{2}}\sum_{k=1}^{q}\sum_{l=1}^{q}\operatorname{\mathbb{E}}\left[\langle(I-\mu Q_{t,k})e^{A}_{t},(I-\mu Q_{t,l})e^{A}_{t}\rangle\right].$	(115)

The contribution for $k\neq l$ in the double summation of (A-A) is equal to zero because for $k\neq l$ , we have

	$\displaystyle\operatorname{\mathbb{E}}\left[\langle(I-\mu Q_{t,k})e^{A}_{t},(I-\mu Q_{t,l})e^{A}_{t}\rangle\right]$	$\displaystyle=\langle\operatorname{\mathbb{E}}[(I-\mu Q_{t,k})e^{A}_{t}],\operatorname{\mathbb{E}}[(I-\mu Q_{t,l})e^{A}_{t}]\rangle$
		$\displaystyle=\langle\operatorname{\mathbb{E}}[(I-\mu Q_{t,k})e^{A}_{t}],\operatorname{\mathbb{E}}[(I-\mu Q_{t,k})e^{A}_{t}]\rangle$
		$\displaystyle=\left\\|\operatorname{\mathbb{E}}[(I-\mu Q_{t,k})e^{A}_{t}]\right\\|_{2}^{2}.$

The term in the last line above is zero for $\mu=\frac{1}{\theta_{1}}$ :

	$\displaystyle\operatorname{\mathbb{E}}[(I-\mu Q_{t,k})e^{A}_{t}]$	$\displaystyle=\operatorname{\mathbb{E}}[(I-\mu A(A^{T}S_{t,k}^{T}S_{t,k}A)^{-1}A^{T})e^{A}_{t}]$
		$\displaystyle=(I-\mu\theta_{1}A(A^{T}A)^{-1}A^{T})e^{A}_{t}$
		$\displaystyle=(I-\mu\theta_{1}UU^{T})e^{A}_{t}$
		$\displaystyle=(I-UU^{T})e^{A}_{t}=0$

where we used $A=U\Sigma V^{T}$ . For the rest of the proof, we assume that we set $\mu=1/\theta_{1}$ . Now that we know the contribution from terms with $k\neq l$ is zero, the expansion in (A-A) can be rewritten as

	$\displaystyle\operatorname{\mathbb{E}}[\\|e^{A}_{t+1}\\|_{2}^{2}]$	$\displaystyle=\frac{1}{q^{2}}\sum_{k=1}^{q}\operatorname{\mathbb{E}}\left[\langle(I-\mu Q_{t,k})e^{A}_{t},(I-\mu Q_{t,k})e^{A}_{t}\rangle\right]$
		$\displaystyle=\frac{1}{q^{2}}\sum_{k=1}^{q}\operatorname{\mathbb{E}}[\\|(I-\mu Q_{t,k})e^{A}_{t}\\|_{2}^{2}]$
		$\displaystyle=\frac{1}{q}\operatorname{\mathbb{E}}[\\|(I-\mu Q_{t,1})e^{A}_{t}\\|_{2}^{2}]$
		$\displaystyle=\frac{1}{q}\left(\\|e^{A}_{t}\\|_{2}^{2}+\mu^{2}\operatorname{\mathbb{E}}[\\|Q_{t,1}e^{A}_{t}\\|_{2}^{2}]-2\mu(e^{A}_{t})^{T}\operatorname{\mathbb{E}}[Q_{t,1}]e^{A}_{t}\right)$
		$\displaystyle=\frac{1}{q}\left(\mu^{2}\operatorname{\mathbb{E}}[\\|Q_{t,1}e^{A}_{t}\\|_{2}^{2}]-\\|e^{A}_{t}\\|_{2}^{2}\right)$
		$\displaystyle=\frac{1}{q}\left(\mu^{2}(e^{A}_{t})^{T}\operatorname{\mathbb{E}}[Q_{t,1}^{T}Q_{t,1}]e^{A}_{t}-\\|e^{A}_{t}\\|_{2}^{2}\right).$

The term $\operatorname{\mathbb{E}}[Q_{t,1}^{T}Q_{t,1}]$ can be simplified using SVD decomposition $A=U\Sigma V^{T}$ . This gives us $Q_{t,k}=U(U^{T}S_{t,k}^{T}S_{t,k}U)^{-1}U^{T}$ and furthermore we have:

	$\displaystyle\operatorname{\mathbb{E}}[Q_{t,1}^{T}Q_{t,1}]$	$\displaystyle=\operatorname{\mathbb{E}}[U(U^{T}S_{t,1}^{T}S_{t,1}U)^{-1}U^{T}U(U^{T}S_{t,1}^{T}S_{t,1}U)^{-1}U^{T}]$
		$\displaystyle=\operatorname{\mathbb{E}}[U(U^{T}S_{t,1}^{T}S_{t,1}U)^{-1}(U^{T}S_{t,1}^{T}S_{t,1}U)^{-1}U^{T}]$
		$\displaystyle=U\operatorname{\mathbb{E}}[(U^{T}S_{t,1}^{T}S_{t,1}U)^{-2}]U^{T}$
		$\displaystyle=\theta_{2}UU^{T}.$

Plugging this in, we obtain:

	$\displaystyle\operatorname{\mathbb{E}}[\\|e^{A}_{t+1}\\|_{2}^{2}]$	$\displaystyle=\frac{1}{q}\left(\theta_{2}\mu^{2}(e^{A}_{t})^{T}UU^{T}e^{A}_{t}-\\|e^{A}_{t}\\|_{2}^{2}\right)$
		$\displaystyle=\frac{1}{q}\left(\theta_{2}\mu^{2}\\|e^{A}_{t}\\|_{2}^{2}-\\|e^{A}_{t}\\|_{2}^{2}\right)$
		$\displaystyle=\frac{\theta_{2}\mu^{2}-1}{q}\\|e^{A}_{t}\\|_{2}^{2}$
		$\displaystyle=\frac{1}{q}\left(\frac{\theta_{2}}{\theta_{1}^{2}}-1\right)\\|e^{A}_{t}\\|_{2}^{2}\,.$

∎

Proof:

Taking the expectation with respect to the sketching matrices $S_{t,k}$ , $k=1,\dots,q$ of both sides of the equation given in Theorem II.13, we obtain

\displaystyle\operatorname{\mathbb{E}}[\|e^{A}_{t+1}\|_{2}^{2}]=\frac{1}{q}\left(\frac{\theta_{2}}{\theta_{1}^{2}}-1\right)\operatorname{\mathbb{E}}[\|e^{A}_{t}\|_{2}^{2}]\,.

This gives us the relationship between the initial error (when we initialize $x_{0}$ to be the zero vector) and the expected error in iteration $t$ :

\displaystyle\operatorname{\mathbb{E}}[\|e^{A}_{t}\|_{2}^{2}]=\frac{1}{q^{t}}\left(\frac{\theta_{2}}{\theta_{1}^{2}}-1\right)^{t}\|Ax^{*}\|_{2}^{2}.

It follows that the expected error reaches $\epsilon$ -accuracy with respect to the initial error at iteration $T$ where:

	$\displaystyle\frac{1}{q^{T}}\left(\frac{\theta_{2}}{\theta_{1}^{2}}-1\right)^{T}$	$\displaystyle=\epsilon$
	$\displaystyle q^{T}\left(\frac{\theta_{2}}{\theta_{1}^{2}}-1\right)^{-T}$	$\displaystyle=\frac{1}{\epsilon}$
	$\displaystyle T\left(\log(q)-\log\left(\frac{\theta_{2}}{\theta_{1}^{2}}-1\right)\right)$	$\displaystyle=\log(1/\epsilon)$
	$\displaystyle T$	$\displaystyle=\frac{\log(1/\epsilon)}{\log(q)-\log\left(\frac{\theta_{2}}{\theta_{1}^{2}}-1\right)}.$

Each iteration requires communicating a $d$ -dimensional vector for every worker, and we have $q$ workers and the algorithm runs for $T$ iterations, hence the communication load is $Tqd$ .

The computational load per worker at each iteration involves the following numbers of operations:

•

Sketching $A$ : $mnd$ multiplications
•

Computing $\tilde{H}_{t,k}$ : $md^{2}$ multiplications
•

Computing $g_{t}$ : $O(nd)$ operations
•

Solving $\tilde{H}_{t,k}^{-1}g_{t}$ : $O(d^{3})$ operations.

∎

Proof:

In the following, we assume that we are in the regime where $n$ approaches infinity.

The expectation term $\operatorname{\mathbb{E}}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}]$ is equal to the identity matrix times a scalar (i.e. $cI_{d}$ ) because it is signed permutation invariant, which we show as follows. Let $P\in\mathbb{R}^{d\times d}$ be a permutation matrix and $D\in\mathbb{R}^{d\times d}$ be an invertible diagonal sign matrix ( $-1$ and $+1$ ’s on the diagonals). A matrix $M$ is signed permutation invariant if $(DP)M(DP)^{T}=M$ . We note that the signed permutation matrix is orthogonal: $(DP)^{T}(DP)=P^{T}D^{T}DP=P^{T}P=I_{d}$ , which we later use in the sequel.

	$\displaystyle(DP)\operatorname{\mathbb{E}}_{S}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}](DP)^{T}$	$\displaystyle=\operatorname{\mathbb{E}}_{S}[(DP)(U^{T}S^{T}SU+\lambda_{2}I)^{-1}(DP)^{T}]$
		$\displaystyle=\operatorname{\mathbb{E}}_{S}[((DP)^{T}U^{T}S^{T}SU(DP)+\lambda_{2}I)^{-1}]$
		$\displaystyle=\operatorname{\mathbb{E}}_{SUPD}[\operatorname{\mathbb{E}}_{S}[((DP)^{T}U^{T}S^{T}SU(DP)+\lambda_{2}I)^{-1}\|SUPD]]$
		$\displaystyle=\operatorname{\mathbb{E}}_{SUPD}[((DP)^{T}U^{T}S^{T}SU(DP)+\lambda_{2}I)^{-1}]$
		$\displaystyle=\operatorname{\mathbb{E}}_{SU^{\prime}}[({U^{\prime}}^{T}S^{T}SU^{\prime}+\lambda_{2}I)^{-1}]$

where we made the variable change $U^{\prime}=UDP$ and note that $U^{\prime}$ has orthonormal columns because $DP$ is an orthogonal transformation. $SUPD$ and $SU$ have the same distribution because $PD$ is an orthogonal transformation and $S$ is a Gaussian matrix. This shows that $\operatorname{\mathbb{E}}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}]$ is signed permutation invariant.

Now that we established that $\operatorname{\mathbb{E}}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}]$ is equal to the identity matrix times a scalar, we move on to find the value of the scalar. We use the identity $\operatorname{\mathbb{E}}_{DP}[(DP)Q(DP)^{T}]=\frac{\operatorname{tr}Q}{d}I_{d}$ for $Q\in\mathbb{R}^{d\times d}$ where the diagonal entries of $D$ are sampled from the Rademacher distribution and $P$ is sampled uniformly from the set of all possible permutation matrices. We already established that $\operatorname{\mathbb{E}}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}]$ is equal to $(DP)\operatorname{\mathbb{E}}_{S}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}](DP)^{T}$ for any signed permutation matrix of the form $DP$ . It follows that

	$\displaystyle\operatorname{\mathbb{E}}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}]$	$\displaystyle=(DP)\operatorname{\mathbb{E}}_{S}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}](DP)^{T}$
		$\displaystyle=\frac{1}{\|R\|}\sum_{DP\in R}(DP)\operatorname{\mathbb{E}}_{S}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}](DP)^{T}$
		$\displaystyle=\operatorname{\mathbb{E}}_{DP}[(DP)\operatorname{\mathbb{E}}_{S}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}](DP)^{T}]$
		$\displaystyle=\frac{1}{d}\operatorname{tr}(\operatorname{\mathbb{E}}_{S}[(U^{T}S^{T}SU+\lambda_{2}I)^{-1}])I_{d}$

where we define $R$ to be the set of all possible signed permutation matrices $DP$ in going from line 1 to line 2.

By Lemma II.18, the trace term is equal to $d\times\theta_{3}(d/m,\lambda_{2})$ , which concludes the proof. ∎

Proof:

Closed form expressions for the optimal solution and the output of the $k$ ’th worker are as follows:

	$\displaystyle x^{*}$	$\displaystyle=(A^{T}A+\lambda_{1}I_{d})^{-1}A^{T}b,$
	$\displaystyle\hat{x}_{k}$	$\displaystyle=(A^{T}S_{k}^{T}S_{k}A+\lambda_{2}I_{d})^{-1}A^{T}S_{k}^{T}S_{k}b.$

Equivalently, $x^{*}$ can be written as:

\displaystyle x^{*}=\arg\min\left\|\begin{bmatrix}A\\ \sqrt{\lambda_{1}}I_{d}\end{bmatrix}x-\begin{bmatrix}b\\ 0_{d}\end{bmatrix}\right\|_{2}^{2}.

This allows us to decompose $\begin{bmatrix}b\\ 0_{d}\end{bmatrix}$ as

\displaystyle\begin{bmatrix}b\\ 0_{d}\end{bmatrix}=\begin{bmatrix}A\\ \sqrt{\lambda_{1}}I_{d}\end{bmatrix}x^{*}+b^{\perp}

where $b^{\perp}=\begin{bmatrix}b_{1}^{\perp}\\ b_{2}^{\perp}\end{bmatrix}$ with $b_{1}^{\perp}\in\mathbb{R}^{n}$ and $b_{2}^{\perp}\in\mathbb{R}^{d}$ . From the above equation we obtain $b^{\perp}_{2}=-\sqrt{\lambda_{1}}x^{*}$ and $\begin{bmatrix}A^{T}&\sqrt{\lambda_{1}}I_{d}\end{bmatrix}b^{\perp}=A^{T}b_{1}^{\perp}+\sqrt{\lambda_{1}}b_{2}^{\perp}=0$ .

The bias of $\hat{x}_{k}$ is given by (omitting the subscript $k$ in $S_{k}$ for simplicity)

	$\displaystyle\operatorname{\mathbb{E}}[A(\hat{x}_{k}-x^{*})]$	$\displaystyle=\operatorname{\mathbb{E}}[A(A^{T}S^{T}SA+\lambda_{2}I_{d})^{-1}A^{T}S^{T}Sb-Ax^{*}]$
		$\displaystyle=\operatorname{\mathbb{E}}[U(U^{T}S^{T}SU+\lambda_{2}\Sigma^{-2})^{-1}U^{T}S^{T}S(Ax^{}+b_{1}^{\perp})]-Ax^{}$
		$\displaystyle=\operatorname{\mathbb{E}}[-\lambda_{2}U(U^{T}S^{T}SU+\lambda_{2}\Sigma^{-2})^{-1}\Sigma^{-1}V^{T}x^{*}]+\operatorname{\mathbb{E}}[U(U^{T}S^{T}SU+\lambda_{2}\Sigma^{-2})^{-1}U^{T}S^{T}Sb_{1}^{\perp}]\,.$

By the assumption $\Sigma=\sigma I_{d}$ , the bias becomes

\displaystyle\operatorname{\mathbb{E}}

\displaystyle[A(\hat{x}_{k}-x^{*})]=\operatorname{\mathbb{E}}[-\lambda_{2}\sigma^{-1}U(U^{T}S^{T}SU+\lambda_{2}\sigma^{-2}I_{d})^{-1}V^{T}x^{*}]+\operatorname{\mathbb{E}}[U(U^{T}S^{T}SU+\lambda_{2}\sigma^{-2}I_{d})^{-1}U^{T}S^{T}Sb_{1}^{\perp}]\,.

(116)

The first expectation term of (116) can be evaluated using Lemma II.19 (as $n$ goes to infinity):

\displaystyle\operatorname{\mathbb{E}}[-\lambda_{2}\sigma^{-1}U

\displaystyle(U^{T}S^{T}SU+\lambda_{2}I_{d})^{-1}V^{T}x^{*}]=-\lambda_{2}\sigma^{-1}\theta_{3}(d/m,\lambda_{2}\sigma^{-2})UV^{T}x^{*}.

(117)

To find the second expectation term in (116), let us first consider the full SVD of $A$ given by $A=\begin{bmatrix}U&U^{\perp}\end{bmatrix}\begin{bmatrix}\Sigma\\ 0_{(n-d)\times d}\end{bmatrix}V^{T}$ where $U\in\mathbb{R}^{n\times d}$ and $U^{\perp}\in\mathbb{R}^{n\times(n-d)}$ . The matrix $\begin{bmatrix}U&U^{\perp}\end{bmatrix}$ is an orthogonal matrix, which implies $UU^{T}+U^{\perp}(U^{\perp})^{T}=I_{d}$ . If we insert $UU^{T}+U^{\perp}(U^{\perp})^{T}=I_{d}$ between $S$ and $b_{1}^{\perp}$ , the second term of (116) becomes

	$\displaystyle\operatorname{\mathbb{E}}[U(U^{T}S^{T}SU+\lambda_{2}\sigma^{-2}I_{d})^{-1}U^{T}S^{T}Sb_{1}^{\perp}]=$
	$\displaystyle=\operatorname{\mathbb{E}}[U(U^{T}S^{T}SU+\lambda_{2}\sigma^{-2}I_{d})^{-1}U^{T}S^{T}SUU^{T}b_{1}^{\perp}]+\operatorname{\mathbb{E}}[U(U^{T}S^{T}SU+\lambda_{2}\sigma^{-2}I_{d})^{-1}U^{T}S^{T}SU^{\perp}(U^{\perp})^{T}b_{1}^{\perp}]$
	$\displaystyle=\operatorname{\mathbb{E}}[U(U^{T}S^{T}SU+\lambda_{2}\sigma^{-2}I_{d})^{-1}U^{T}S^{T}SUU^{T}b_{1}^{\perp}]$
	$\displaystyle=U(I_{d}-\lambda_{2}\sigma^{-2}\operatorname{\mathbb{E}}[(U^{T}S^{T}SU+\lambda_{2}\sigma^{-2}I_{d})^{-1}])U^{T}b_{1}^{\perp}$
	$\displaystyle=(1-\lambda_{2}\sigma^{-2}\theta_{3}(d/m,\lambda_{2}\sigma^{-2}))UU^{T}b_{1}^{\perp}\,.$

In these derivations, we used $\operatorname{\mathbb{E}}_{S}[U(U^{T}S^{T}SU+\lambda_{2}\sigma^{-2}I_{d})^{-1}U^{T}S^{T}SU^{\perp}(U^{\perp})^{T}b_{1}^{\perp}]=\operatorname{\mathbb{E}}_{SU}[\operatorname{\mathbb{E}}_{S}[U(U^{T}S^{T}SU+\lambda_{2}\sigma^{-2}I_{d})^{-1}U^{T}S^{T}SU^{\perp}(U^{\perp})^{T}b_{1}^{\perp}|SU]]=0$ . This follows from $\operatorname{\mathbb{E}}_{S}[SU^{\perp}|SU]=0$ as $U$ and $U^{\perp}$ are orthogonal. The last line follows from Lemma II.19, as $n$ goes to infinity.

We note that $U^{T}b_{1}^{\perp}=\lambda_{1}\Sigma^{-1}V^{T}x^{*}$ and for $\Sigma=\sigma I_{d}$ , this becomes $U^{T}b_{1}^{\perp}=\lambda_{1}\sigma^{-1}V^{T}x^{*}$ . Bringing the pieces together, we have the bias equal to (as $n$ goes to infinity):

	$\displaystyle\operatorname{\mathbb{E}}[A(\hat{x}_{k}-x^{*})]$	$\displaystyle=-\lambda_{2}\sigma^{-1}\theta_{3}(d/m,\lambda_{2}\sigma^{-2})UV^{T}x^{}+\lambda_{1}\sigma^{-1}(1-\lambda_{2}\sigma^{-2}\theta_{3}(d/m,\lambda_{2}\sigma^{-2}))UV^{T}x^{}$
		$\displaystyle=\sigma^{-1}(\lambda_{1}-\lambda_{2}\theta_{3}(d/m,\lambda_{2}\sigma^{-2})(1+\lambda_{1}\sigma^{-2}))UV^{T}x^{*}.$

If there is a value of $\lambda_{2}>0$ that satisfies $\lambda_{1}-\lambda_{2}\theta_{3}(d/m,\lambda_{2}\sigma^{-2})(1+\lambda_{1}\sigma^{-2})=0$ , then that value of $\lambda_{2}$ makes $\hat{x}_{k}$ an unbiased estimator. Equivalently,

\displaystyle-\lambda_{2}\sigma^{-2}+\frac{d}{m}-1+

\displaystyle\sqrt{(-\lambda_{2}\sigma^{-2}+\frac{d}{m}-1)^{2}+4\lambda_{2}\sigma^{-2}\frac{d}{m}}=2\frac{d}{m\sigma^{2}}\frac{\lambda_{1}}{1+\lambda_{1}\sigma^{-2}},

(118)

where we note that the LHS is a monotonically increasing function of $\lambda_{2}$ in the regime $\lambda_{2}\geq 0$ and it attains its minimum in this regime at $\lambda_{2}=0$ . Analyzing this equation using these observations, for the cases of $m>d$ and $m\leq d$ separately, we find that for the case of $m\leq d$ , we need the following to be satisfied for zero bias:

	$\displaystyle 2\frac{d}{m\sigma^{2}}\frac{\lambda_{1}}{1+\lambda_{1}/\sigma^{2}}$	$\displaystyle\geq 2\left(\frac{d}{m}-1\right),\,\mbox{or more simply,}$
	$\displaystyle\lambda_{1}$	$\displaystyle\geq\sigma^{2}\left(\frac{d}{m}-1\right),$

whereas there is no additional condition on $\lambda_{1}$ for the case of $m>d$ .

The value of $\lambda_{2}$ that will lead to zero bias can be computed by solving the equation (118) where the expression for the inverse of the left-hand side is given by $LHS^{-1}(y)=\frac{y\sigma^{-2}d/m-d/m+1}{y^{-1}-\sigma^{-2}}$ . We evaluate the inverse at $y=\lambda_{1}/(1+\lambda_{1}/\sigma^{2})$ and obtain the following expression for $\lambda_{2}^{*}$ :

\displaystyle\lambda_{2}^{*}=\lambda_{1}-\frac{d}{m}\frac{\lambda_{1}}{1+\lambda_{1}/\sigma^{2}}.

∎

A-B Proofs of Theorems and Lemmas in Section III

The proof of the privacy result mainly follows due to Theorem A.2 (stated below for completeness) which is a result from [39].

Proof:

For some $\varepsilon$ , $\delta$ and matrix $A_{c}$ , if there exist values for $m$ such that the smallest singular value of $A_{c}$ satisfies $\sigma_{\min}(A_{c})\geq w$ , then using Theorem A.2, we find that the sketch size $m$ has to satisfy the following for $(\varepsilon,\delta)$ -differential privacy:

	$\displaystyle m$	$\displaystyle\leq\frac{1}{8\ln(4/\delta)}\left(\left(\frac{\sigma_{\min}^{2}}{B^{2}}-1\right)\frac{1}{\frac{1}{\varepsilon}+\frac{1}{\ln(4/\delta)}}-2\ln(4/\delta)\right)^{2}$
		$\displaystyle=\frac{1}{8\beta}\left(\left(\frac{\sigma_{\min}^{2}}{B^{2}}-1\right)\frac{\varepsilon\beta}{\varepsilon+\beta}-2\beta\right)^{2},$		(119)

where we have set $\delta=4/e^{\beta}$ in the second line. For the first line to follow from Theorem A.2, we also need the condition $\frac{\sigma_{\min}^{2}}{B^{2}}\geq 3+2\frac{\beta}{\varepsilon}$ to be satisfied. Note that the rows of $A_{c}$ have bounded $\ell_{2}$ -norm of $B_{0}\sqrt{d+1}$ . We now substitute $B=B_{0}\sqrt{d+1}$ and $\sigma_{\min}=\sigma_{0}\sqrt{n}$ to obtain the simplified condition

\displaystyle\frac{n}{d+1}\geq\left(3+2\frac{\beta}{\varepsilon}\right)\frac{B_{0}^{2}}{\sigma_{0}^{2}},

where $B_{0}$ and $\sigma_{0}$ are constants. Assuming this condition is satisfied, then we pick the sketch size $m$ as (A-B) which can also be simplified:

\displaystyle m=O\left(\beta\frac{n^{2}}{(d+1)^{2}}\frac{\varepsilon^{2}}{(\varepsilon+\beta)^{2}}\right).

Note that the above arguments are for the privacy of a single sketch (i.e., $S_{k}A_{c}$ ). In the distributed setting where the adversary can attack all of the sketched data $S_{1}A_{c},\dots,S_{q}A_{c}$ , we can consider all of the sketched data to be a single sketch with size $mq$ . Based on this argument, we can pick the sketch size as

\displaystyle m=O\left(\frac{\beta}{q}\frac{n^{2}}{(d+1)^{2}}\frac{\varepsilon^{2}}{(\varepsilon+\beta)^{2}}\right).

∎

Theorem A.2 (Differential privacy for random projections [39])

Fix $\varepsilon>0$ and $\delta\in(0,1/e)$ . Fix $B>0$ . Fix a positive integer $m$ and let $w$ be such that

\displaystyle w^{2}=B^{2}\left(1+\frac{1+\frac{\varepsilon}{\ln(4/\delta)}}{\varepsilon}\left(2\sqrt{2m\ln(4/\delta)}+2\ln(4/\delta)\right)\right).

(120)

Let $A$ be an $(n\times d)$ -matrix with $d<m$ and where each row of $A$ has bounded $\ell_{2}$ -norm of $B$ . Given that $\sigma_{\min}(A)\geq w$ , the algorithm that picks an $(m\times n)$ -matrix $R$ whose entries are iid samples from the normal distribution $\mathcal{N}(0,1)$ and publishes the projection $RA$ is $(\varepsilon,\delta)$ -differentially private.

A-C Proofs of Theorems and Lemmas in Section IV

Proof:

The expectation of the difference between the costs $f(\bar{x})$ and $f(x^{*})$ is given by

$\displaystyle\operatorname{\mathbb{E}}[f(\bar{x})]-f(x^{*})$	$\displaystyle=\operatorname{\mathbb{E}}[\\|A(\bar{x}-x^{})+Ax^{}-b\\|_{2}^{2}]-f(x^{*})$
	$\displaystyle=\operatorname{\mathbb{E}}[\\|A(\bar{x}-x^{})\\|_{2}^{2}+\\|Ax^{}-b\\|_{2}^{2}]-f(x^{*})$
	$\displaystyle=\operatorname{\mathbb{E}}[\\|A(\bar{x}-x^{*})\\|_{2}^{2}]\,,$	(121)

where we have used the orthogonality property of the optimal least squares solution $x^{*}$ given by the normal equations $A^{T}(Ax^{*}-b)=0$ . Next, we have

	$\displaystyle\operatorname{\mathbb{E}}[\\|A(\bar{x}-x^{*})\\|_{2}^{2}]$	$\displaystyle=\operatorname{\mathbb{E}}\left[\left\\|\frac{1}{q}\sum_{k=1}^{q}(A\hat{x}_{k}-Ax^{*})\right\\|_{2}^{2}\right]$
		$\displaystyle=\frac{1}{q^{2}}\operatorname{\mathbb{E}}\left[\sum_{k=1}^{q}\sum_{l=1}^{q}\langle A\hat{x}_{k}-Ax^{},A\hat{x}_{l}-Ax^{}\rangle\right]$
		$\displaystyle=\frac{1}{q^{2}}\sum_{k=1}^{q}\operatorname{\mathbb{E}}\left[\\|A\hat{x}_{k}-Ax^{}\\|_{2}^{2}\right]+\frac{1}{q^{2}}\sum_{k\neq l\,,1\leq k,l\leq q}\operatorname{\mathbb{E}}\left[\langle A\hat{x}_{k}-Ax^{},A\hat{x}_{l}-Ax^{*}\rangle\right]$
		$\displaystyle=\frac{1}{q}\operatorname{\mathbb{E}}\left[\\|A\hat{x}-Ax^{}\\|_{2}^{2}\right]+\frac{q-1}{q}\\|\operatorname{\mathbb{E}}[A\hat{x}]-Ax^{}\\|_{2}^{2}\,.$

∎

Proof:

The bias of the single sketch estimator can be expanded as follows:

	$\displaystyle\\|\operatorname{\mathbb{E}}[A\hat{x}]-Ax^{*}\\|_{2}$	$\displaystyle=\\|\operatorname{\mathbb{E}}[A(A^{T}S^{T}SA)^{-1}A^{T}S^{T}S(Ax^{}+b^{\perp})]-Ax^{}\\|_{2}$
		$\displaystyle=\\|U\operatorname{\mathbb{E}}[(U^{T}S^{T}SU)^{-1}U^{T}S^{T}Sb^{\perp}\\|_{2}$
		$\displaystyle=\\|\operatorname{\mathbb{E}}[(U^{T}S^{T}SU)^{-1}U^{T}S^{T}Sb^{\perp}\\|_{2}$
		$\displaystyle=\\|\operatorname{\mathbb{E}}[Qz]\\|_{2},$

where we define $Q:=(U^{T}S^{T}SU)^{-1}$ and $z:=U^{T}S^{T}Sb^{\perp}$ . The term $\|\operatorname{\mathbb{E}}[Qz]\|_{2}^{2}$ can be upper bounded as follows when conditioned on the event $E$ :

	$\displaystyle\\|\operatorname{\mathbb{E}}[Qz]\\|_{2}^{2}$	$\displaystyle=\operatorname{\mathbb{E}}[Qz]^{T}\operatorname{\mathbb{E}}[Qz]=\operatorname{\mathbb{E}}_{S}[Qz]^{T}\operatorname{\mathbb{E}}_{S^{\prime}}[Q^{\prime}z^{\prime}]$
		$\displaystyle=\operatorname{\mathbb{E}}_{S_{k}}\operatorname{\mathbb{E}}_{S_{k}^{\prime}}[z^{T}QQ^{\prime}z^{\prime}]$
		$\displaystyle=\frac{1}{2}\operatorname{\mathbb{E}}_{S}\operatorname{\mathbb{E}}_{S^{\prime}}[(z+z^{\prime})^{T}QQ^{\prime}(z+z^{\prime})-z^{T}QQ^{\prime}z-{z^{\prime}}^{T}QQ^{\prime}z^{\prime}]$
		$\displaystyle\leq\frac{1}{2}\operatorname{\mathbb{E}}_{S}\operatorname{\mathbb{E}}_{S^{\prime}}[\\|z+z^{\prime}\\|_{2}^{2}(1+\epsilon)^{2}-(\\|z\\|_{2}^{2}+\\|z^{\prime}\\|_{2}^{2})(1-\epsilon)^{2}]$
		$\displaystyle=\operatorname{\mathbb{E}}_{S}\operatorname{\mathbb{E}}_{S^{\prime}}[\left(\\|z\\|_{2}^{2}2\epsilon+\\|z^{\prime}\\|_{2}^{2}2\epsilon+z^{T}z^{\prime}(1+\epsilon)^{2}\right)]$
		$\displaystyle=4\epsilon\operatorname{\mathbb{E}}[\\|z\\|_{2}^{2}]+(1+\epsilon)^{2}\\|\operatorname{\mathbb{E}}[z]\\|_{2}^{2}\,,$

where the inequality follows from the inequality $(1-\epsilon)I_{d}\preceq Q\preceq(1+\epsilon)I_{d}$ and some simple bounds for the minimum and maximum eigenvalues of the product of two positive definite matrices. Furthermore, the expectation of $z$ is equal to zero because $\operatorname{\mathbb{E}}[z]=\operatorname{\mathbb{E}}[U^{T}S^{T}Sb^{\perp}]=U^{T}\operatorname{\mathbb{E}}[S^{T}S]b^{\perp}=U^{T}b^{\perp}=0$ . Hence we obtain the claimed bound $\|\operatorname{\mathbb{E}}[A\hat{x}|E]-Ax^{*}\|_{2}\leq\sqrt{4\epsilon\operatorname{\mathbb{E}}[\|z\|_{2}^{2}|E]}.$ ∎

Proof:

For the randomized Hadamard sketch (ROS), the term $\operatorname{\mathbb{E}}[\|z\|_{2}^{2}]$ can be expanded as follows. We will assume that all the expectations are conditioned on the event $E$ , which we defined earlier as $(1-\epsilon)I_{d}\preceq Q\preceq(1+\epsilon)I_{d}$ .

	$\displaystyle\operatorname{\mathbb{E}}[\\|z\\|_{2}^{2}]$	$\displaystyle=\operatorname{\mathbb{E}}\left[{b^{\perp}}^{T}\frac{1}{m}\sum_{i=1}^{m}s_{i}s_{i}^{T}UU^{T}\frac{1}{m}\sum_{j=1}^{m}s_{j}s_{j}^{T}b^{\perp}\right]$
		$\displaystyle=\operatorname{\mathbb{E}}\left[\frac{1}{m^{2}}\sum_{i=1}^{m}\sum_{j=1}^{m}{b^{\perp}}^{T}s_{i}s_{i}^{T}UU^{T}s_{j}s_{j}^{T}b^{\perp}\right]$
		$\displaystyle=\frac{1}{m^{2}}\sum_{1\leq i=j\leq m}{b^{\perp}}^{T}\operatorname{\mathbb{E}}\left[s_{i}s_{i}^{T}UU^{T}s_{j}s_{j}^{T}\right]b^{\perp}+\frac{1}{m^{2}}\sum_{i\neq j,\,1\leq i,j\leq m}{b^{\perp}}^{T}\operatorname{\mathbb{E}}[s_{i}s_{i}^{T}]UU^{T}\operatorname{\mathbb{E}}[s_{j}s_{j}^{T}]b^{\perp}$
		$\displaystyle=\frac{m}{m^{2}}\,{b^{\perp}}^{T}\operatorname{\mathbb{E}}\left[s_{1}s_{1}^{T}UU^{T}s_{1}s_{1}^{T}\right]b^{\perp}+\frac{1}{m^{2}}\sum_{i\neq j,\,1\leq i,j\leq m}{b^{\perp}}^{T}I_{n}UU^{T}I_{n}b^{\perp}$
		$\displaystyle=\frac{1}{m}\,{b^{\perp}}^{T}\operatorname{\mathbb{E}}\left[s_{1}s_{1}^{T}UU^{T}s_{1}s_{1}^{T}\right]b^{\perp},$

where we have used the independence of $s_{i}$ and $s_{j}$ , $i\neq j$ . This is true because of the assumption that the matrix $P$ corresponds to sampling with replacement.

	$\displaystyle{b^{\perp}}^{T}\operatorname{\mathbb{E}}\left[s_{1}s_{1}^{T}UU^{T}s_{1}s_{1}^{T}\right]b^{\perp}$	$\displaystyle=\operatorname{\mathbb{E}}[(s_{1}^{T}UU^{T}s_{1})(s_{1}^{T}b^{\perp}{b^{\perp}}^{T}s_{1})]$
		$\displaystyle=\operatorname{\mathbb{E}}[(s_{1}^{T}UU^{T}s_{1})({b^{\perp}}^{T}s_{1})^{2}]$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\operatorname{\mathbb{E}}[(h_{i}^{T}DUU^{T}Dh_{i})({b^{\perp}}^{T}Dh_{i})^{2}],$

where the row vector $h_{i}^{T}$ corresponds to the $i$ ’th row of the Hadamard matrix $H$ . We also note that the expectation in the last line is with respect to the randomness of $D$ .

Let us define $r$ to be the column vector containing the diagonal entries of the diagonal matrix $D$ , that is, $r:=[D_{11},D_{22},\dots,D_{nn}]^{T}$ . Then, the vector $Dh_{i}$ is equivalent to $\operatorname{diag}(h_{i})r$ where $\operatorname{diag}(h_{i})$ is the diagonal matrix with the entries of $h_{i}$ on its diagonal.

	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\operatorname{\mathbb{E}}[(h_{i}^{T}DUU^{T}Dh_{i})({b^{\perp}}^{T}Dh_{i})^{2}]$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\operatorname{\mathbb{E}}[(r^{T}\operatorname{diag}(h_{i})UU^{T}\operatorname{diag}(h_{i})r)({b^{\perp}}^{T}\operatorname{diag}(h_{i})r)^{2}]$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\operatorname{\mathbb{E}}[{b^{\perp}}^{T}\operatorname{diag}(h_{i})r(r^{T}Pr)r^{T}\operatorname{diag}(h_{i})b^{\perp}]$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}{b^{\perp}}^{T}\operatorname{diag}(h_{i})\operatorname{\mathbb{E}}[r(r^{T}Pr)r^{T}]\operatorname{diag}(h_{i})b^{\perp},$

where we have defined $P:=\operatorname{diag}(h_{i})UU^{T}\operatorname{diag}(h_{i})$ . It follows that $\operatorname{\mathbb{E}}[r(r^{T}Pr)r^{T}]=2P-2\operatorname{diag}(P)+\operatorname{tr}(P)I_{n}$ . Here, $\operatorname{diag}(P)$ is used to refer to the diagonal matrix with the diagonal entries of $P$ as its diagonal. The trace of $P$ can be easily computed using the cyclic property of matrix trace as $\operatorname{tr}(P)=\operatorname{tr}(\operatorname{diag}(h_{i})UU^{T}\operatorname{diag}(h_{i}))=\operatorname{tr}(U^{T}\operatorname{diag}(h_{i})\operatorname{diag}(h_{i})U)=\operatorname{tr}(U^{T}U)=\operatorname{tr}(I_{d})=d$ . Next, we note that the term $\operatorname{diag}(P)$ can be simplified as $\operatorname{diag}(P)_{jj}=\|\tilde{u}_{j}\|_{2}^{2}$ where $\tilde{u}_{j}^{T}$ is the $j$ ’th row of $U$ . This leads to

\displaystyle{b^{\perp}}^{T}\operatorname{diag}(P)b^{\perp}

\displaystyle=\sum_{j=1}^{n}(b_{j}^{\perp})^{2}\|\tilde{u}_{j}\|_{2}^{2}\geq\sum_{j=1}^{n}(b_{j}^{\perp})^{2}\min_{i}\|\tilde{u}_{i}\|_{2}^{2}=\|b^{\perp}\|_{2}^{2}\min_{i}\|\tilde{u}_{i}\|_{2}^{2}\,.

Going back to $\operatorname{\mathbb{E}}[\|z\|_{2}^{2}]$ , we have

	$\displaystyle\operatorname{\mathbb{E}}[\\|z\\|_{2}^{2}]$	$\displaystyle=\frac{1}{mn}{b^{\perp}}^{T}n\operatorname{diag}(h_{i})\left(2P-2\operatorname{diag}(P)+\operatorname{tr}(P)I_{n}\right)\operatorname{diag}(h_{i})b^{\perp}$
		$\displaystyle=\frac{d}{m}\\|b^{\perp}\\|_{2}^{2}-\frac{2}{m}{b^{\perp}}^{T}\operatorname{diag}(P)b^{\perp}$
		$\displaystyle\leq\frac{d}{m}\\|b^{\perp}\\|_{2}^{2}-\frac{2}{m}\\|b^{\perp}\\|_{2}^{2}\min_{i}\\|\tilde{u}_{i}\\|_{2}^{2}$
		$\displaystyle=\frac{1}{m}\\|b^{\perp}\\|_{2}^{2}(d-2\min_{i}\\|\tilde{u}_{i}\\|_{2}^{2})=\frac{d}{m}\left(1-\frac{2\min_{i}\\|\tilde{u}_{i}\\|_{2}^{2}}{d}\right)f(x^{*}).$

∎

Proof:

We will assume that all the expectations are conditioned on the event $E$ , which we defined earlier as $(1-\epsilon)I_{d}\preceq Q\preceq(1+\epsilon)I_{d}$ . For uniform sampling with replacement, we have

	$\displaystyle\operatorname{\mathbb{E}}[\\|z\\|_{2}^{2}]$	$\displaystyle=\frac{1}{m^{2}}\,\operatorname{\mathbb{E}}\left[{b^{\perp}}^{T}\sum_{i=1}^{m}s_{i}s_{i}^{T}UU^{T}\sum_{j=1}^{m}s_{j}s_{j}^{T}b^{\perp}\right]$
		$\displaystyle=\frac{1}{m^{2}}\,\operatorname{\mathbb{E}}\left[\sum_{i=1}^{m}\sum_{j=1}^{m}{b^{\perp}}^{T}s_{i}s_{i}^{T}UU^{T}s_{j}s_{j}^{T}b^{\perp}\right]$
		$\displaystyle=\frac{1}{m^{2}}\sum_{1\leq i=j\leq m}{b^{\perp}}^{T}\operatorname{\mathbb{E}}[s_{i}s_{i}^{T}UU^{T}s_{j}s_{j}^{T}]b^{\perp}+\frac{1}{m^{2}}\sum_{i\neq j,\,1\leq i,j\leq m}{b^{\perp}}^{T}\operatorname{\mathbb{E}}[s_{i}s_{i}^{T}]UU^{T}\operatorname{\mathbb{E}}[s_{j}s_{j}^{T}]b^{\perp}$
		$\displaystyle=\frac{1}{m}\,{b^{\perp}}^{T}\operatorname{\mathbb{E}}[s_{1}s_{1}^{T}UU^{T}s_{1}s_{1}^{T}]b^{\perp}+\frac{1}{m^{2}}\,\sum_{i\neq j,\,1\leq i,j\leq m}{b^{\perp}}^{T}I_{n}UU^{T}I_{n}b^{\perp}$
		$\displaystyle=\frac{1}{m}\,{b^{\perp}}^{T}\operatorname{\mathbb{E}}[s_{1}s_{1}^{T}UU^{T}s_{1}s_{1}^{T}]b^{\perp}$
		$\displaystyle=\frac{1}{m}\,{b^{\perp}}^{T}n^{2}\frac{1}{n}\sum_{i=1}^{n}e_{i}e_{i}^{T}UU^{T}e_{i}e_{i}^{T}b^{\perp}$
		$\displaystyle=\frac{n}{m}\,\sum_{i=1}^{n}{b_{i}^{\perp}}^{2}\\|\tilde{u}_{i}\\|_{2}^{2}$
		$\displaystyle\leq\frac{n}{m}\,\sum_{i=1}^{n}{b_{i}^{\perp}}^{2}\max_{j}\\|\tilde{u}_{j}\\|_{2}^{2}=\frac{\mu d}{m}f(x^{*})\,.$

Next, for uniform sampling without replacement, the rows $s_{i}$ and $s_{j}$ are not independent which can be seen by noting that given $s_{i}$ , we know that $s_{j}$ will have its nonzero entry at a different place than $s_{i}$ . Hence, differently from uniform sampling with replacement, the following term will not be zero:

	$\displaystyle\frac{1}{m^{2}}\sum_{i\neq j,\,1\leq i,j\leq m}{b^{\perp}}^{T}\operatorname{\mathbb{E}}[s_{i}s_{i}^{T}UU^{T}s_{j}s_{j}^{T}]b^{\perp}$	$\displaystyle=\frac{m^{2}-m}{m^{2}}{b^{\perp}}^{T}\operatorname{\mathbb{E}}[s_{1}s_{1}^{T}UU^{T}s_{2}s_{2}^{T}]b^{\perp}$
		$\displaystyle=\frac{m-1}{m}{b^{\perp}}^{T}n^{2}\frac{1}{n^{2}-n}\sum_{i\neq j,1\leq i,j\leq n}e_{i}e_{i}^{T}UU^{T}e_{j}e_{j}^{T}b^{\perp}$
		$\displaystyle=\frac{m-1}{m}\frac{n}{n-1}{b^{\perp}}^{T}\sum_{i\neq j,1\leq i,j\leq n}e_{i}\tilde{u}_{i}^{T}\tilde{u}_{j}e_{j}^{T}b^{\perp}$
		$\displaystyle=\frac{m-1}{m}\frac{n}{n-1}{b^{\perp}}^{T}(UU^{T}-\operatorname{diag}(\\|\tilde{u}_{i}\\|_{2}^{2})b^{\perp}$
		$\displaystyle=-\frac{m-1}{m}\frac{n}{n-1}\sum_{i=1}^{n}{b_{i}^{\perp}}^{2}\\|\tilde{u}_{i}\\|_{2}^{2}\,.$

It follows that for uniform sampling without replacement, we obtain

	$\displaystyle\operatorname{\mathbb{E}}[\\|z\\|_{2}^{2}]$	$\displaystyle=\left(\frac{n}{m}-\frac{m-1}{m}\frac{n}{n-1}\right)\sum_{i=1}^{n}{b_{i}^{\perp}}^{2}\\|\tilde{u}_{i}\\|_{2}^{2}$
		$\displaystyle=\frac{n}{m}\frac{n-m}{n-1}\sum_{i=1}^{n}{b_{i}^{\perp}}^{2}\\|\tilde{u}_{i}\\|_{2}^{2}$
		$\displaystyle\leq\frac{n}{m}\frac{n-m}{n-1}f(x^{*})\max_{i}\\|\tilde{u}_{i}\\|_{2}^{2}$
		$\displaystyle=\frac{\mu d}{m}\frac{n-m}{n-1}f(x^{*})\,.$

∎

Proof:

We consider leverage score sampling with replacement. The rows $s_{i}$ , $s_{j}$ $i\neq j$ are independent because sampling is with replacement. We will assume that all the expectations are conditioned on the event $E$ , which we defined earlier as $(1-\epsilon)I_{d}\preceq Q\preceq(1+\epsilon)I_{d}$ . For leverage score sampling, the term $\operatorname{\mathbb{E}}[\|z\|_{2}^{2}]$ is upper bounded as follows:

	$\displaystyle\operatorname{\mathbb{E}}[\\|z\\|_{2}^{2}]$	$\displaystyle=\frac{1}{m^{2}}\,\operatorname{\mathbb{E}}\left[{b^{\perp}}^{T}\sum_{i=1}^{m}s_{i}s_{i}^{T}UU^{T}\sum_{j=1}^{m}s_{j}s_{j}^{T}b^{\perp}\right]$
		$\displaystyle=\frac{1}{m^{2}}\,\operatorname{\mathbb{E}}\left[\sum_{i=1}^{m}\sum_{j=1}^{m}{b^{\perp}}^{T}s_{i}s_{i}^{T}UU^{T}s_{j}s_{j}^{T}b^{\perp}\right]$
		$\displaystyle=\frac{1}{m^{2}}\sum_{1\leq i=j\leq m}{b^{\perp}}^{T}\operatorname{\mathbb{E}}[s_{i}s_{i}^{T}UU^{T}s_{j}s_{j}^{T}]b^{\perp}+\frac{1}{m^{2}}\sum_{i\neq j,\,1\leq i,j\leq m}{b^{\perp}}^{T}\operatorname{\mathbb{E}}[s_{i}s_{i}^{T}]UU^{T}\operatorname{\mathbb{E}}[s_{j}s_{j}^{T}]b^{\perp}$
		$\displaystyle=\frac{1}{m}\,{b^{\perp}}^{T}\sum_{i=1}^{n}\frac{\ell_{i}}{d}\frac{d}{\ell_{i}}e_{i}e_{i}^{T}UU^{T}\frac{d}{\ell_{i}}e_{i}e_{i}^{T}b^{\perp}+\frac{m^{2}-m}{m^{2}}{b^{\perp}}^{T}I_{n}UU^{T}I_{n}b^{\perp}$
		$\displaystyle=\frac{1}{m}\,{b^{\perp}}^{T}\sum_{i=1}^{n}\frac{d}{\ell_{i}}\ell_{i}e_{i}e_{i}^{T}b^{\perp}$
		$\displaystyle=\frac{d}{m}\,\\|b^{\perp}\\|_{2}^{2}=\frac{d}{m}f(x^{*})\,.$

∎

Proof:

We will begin by using the results from Table 5 of [3] for selecting the sketch size:

$\displaystyle m$	$\displaystyle=O\Big{(}\frac{d+\log(n)}{\epsilon^{2}}\log(d/\delta)\Big{)}\mbox{ for rand. Hadamard sketch}$
$\displaystyle m$	$\displaystyle=O\Big{(}\frac{\mu d}{\epsilon^{2}}\log(d/\delta)\Big{)}\mbox{ for uniform sampling}$
$\displaystyle m$	$\displaystyle=O\Big{(}\frac{d}{\epsilon^{2}}\log(d/\delta)\Big{)}\mbox{ for leverage score sampling}$	(122)

where $\mu$ is the row coherence of $U$ as defined before. When the sketch sizes are selected according to the formulas above, the subspace embedding property given by

\displaystyle\|U^{T}S^{T}SU-I_{d}\|_{2}\leq\epsilon

(123)

is satisfied with probability at least $1-\delta$ . Note that the subspace embedding property can be rewritten as

\displaystyle(1-\epsilon)I_{d}\preceq U^{T}S^{T}SU\preceq(1+\epsilon)I_{d}\,.

(124)

This implies the following relation for the inverse matrix $(U^{T}S^{T}SU)^{-1}$ :

\displaystyle\frac{1}{1+\epsilon}I_{d}\preceq(U^{T}S^{T}SU)^{-1}\preceq\frac{1}{1-\epsilon}I_{d}\,.

(125)

Observe that $1-\epsilon\leq\frac{1}{1+\epsilon}$ for any $\epsilon$ and that $\frac{1}{1-\epsilon}\leq 1+2\epsilon$ for $0\leq\epsilon\leq 0.5$ . Assuming that $0\leq\epsilon\leq 0.5$ , we have

\displaystyle(1-2\epsilon)I_{d}\preceq(U^{T}S^{T}SU)^{-1}\preceq(1+2\epsilon)I_{d}\,.

(126)

We can rescale $\epsilon\leftarrow\epsilon/2$ so that

\displaystyle(1-\epsilon)I_{d}\preceq(U^{T}S^{T}SU)^{-1}\preceq(1+\epsilon)I_{d}

(127)

and the effect of this rescaling will be hidden in the $O$ notation for the sketch size formulas.

We will define the events $E_{k}$ , $k=1,\dots,q$ as $(1-\epsilon)I_{d}\preceq(U^{T}S_{k}^{T}S_{k}U)^{-1}\preceq(1+\epsilon)I_{d}$ and it follows that when the sketch sizes are selected according to (A-C), we will have

\displaystyle P(E_{k})\geq 1-\delta\,,k=1,\dots,q\,.

(128)

Now, we find a simpler expression for the error of the single-sketch estimator:

$\displaystyle A\hat{x}-Ax^{*}$	$\displaystyle=A(A^{T}S^{T}SA)^{-1}A^{T}S^{T}S^{T}b-Ax^{*}$
	$\displaystyle=A(A^{T}S^{T}SA)^{-1}A^{T}S^{T}S^{T}(b^{\perp}+Ax^{})-Ax^{}$
	$\displaystyle=A(A^{T}S^{T}SA)^{-1}A^{T}S^{T}S^{T}b^{\perp}$
	$\displaystyle=U(U^{T}S^{T}SU)^{-1}U^{T}S^{T}Sb^{\perp}$
	$\displaystyle=UQz$	(129)

where we defined $Q:=(U^{T}S^{T}SU)^{-1}$ , and $z:=U^{T}S^{T}Sb^{\perp}$ . The variance term is equal to

\displaystyle\operatorname{\mathbb{E}}[\|A\hat{x}-Ax^{*}\|_{2}^{2}]

\displaystyle=\operatorname{\mathbb{E}}[\|UQz\|_{2}^{2}]=\operatorname{\mathbb{E}}[\|Qz\|_{2}^{2}]\,.

(130)

Conditioned on the event $E$ , we can bound the expectation $\operatorname{\mathbb{E}}[\|Qz\|_{2}^{2}|E]$ as follows:

$\displaystyle\operatorname{\mathbb{E}}[\\|Qz\\|_{2}^{2}\|E]$	$\displaystyle=\operatorname{\mathbb{E}}[z^{T}Q^{T}Qz\|E]$
	$\displaystyle\leq\operatorname{\mathbb{E}}[(1+\epsilon)^{2}\\|z\\|_{2}^{2}\|E]$
	$\displaystyle=(1+\epsilon)^{2}\operatorname{\mathbb{E}}[\\|z\\|_{2}^{2}\|E]$	(131)

where we have used $Q^{T}Q\preceq(1+\epsilon)^{2}I_{d}$ , which follows from all eigenvalues of $Q$ being less than $(1+\epsilon)$ when conditioned on the event $E$ . Next, from Lemma IV.1, we obtain the following bound for the expected error

	$\displaystyle\operatorname{\mathbb{E}}[f(\bar{x})-f(x^{*})\|E_{1},\dots,E_{q}]$	$\displaystyle=\frac{1}{q}\operatorname{\mathbb{E}}\left[\\|A\hat{x}-Ax^{}\\|_{2}^{2}\|E\right]+\frac{q-1}{q}\\|\operatorname{\mathbb{E}}[A\hat{x}-Ax^{}\|E]\\|_{2}^{2}$
		$\displaystyle\leq\frac{1}{q}(1+\epsilon^{2})\operatorname{\mathbb{E}}[\\|z\\|_{2}^{2}\|E]+\frac{q-1}{q}4\epsilon\operatorname{\mathbb{E}}[\\|z\\|_{2}^{2}\|E]\,.$		(132)

Using Markov’s inequality gives us the following probability bound:

\displaystyle P(f(\bar{x})-f(x^{*})\geq\gamma|E_{1},\dots,E_{q})

\displaystyle\leq\frac{1}{\gamma}\operatorname{\mathbb{E}}[f(\bar{x})-f(x^{*})|E_{1},\dots,E_{q}]\leq\frac{1}{\gamma q}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\operatorname{\mathbb{E}}[\|z\|_{2}^{2}|E]\,.

(133)

The unconditioned probability can be computed as

	$\displaystyle P(f(\bar{x})-f(x^{*})\geq\gamma)$	$\displaystyle=P(f(\bar{x})-f(x^{})\geq\gamma\|E_{1},\dots,E_{q})P(\cap_{k=1}^{q}E_{k})+P(f(\bar{x})-f(x^{})\geq\gamma\|\cup_{k=1}^{q}E_{k}^{C})P(\cup_{k=1}^{q}E_{k}^{C})$
		$\displaystyle\leq\frac{1}{\gamma q}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\operatorname{\mathbb{E}}[\\|z\\|_{2}^{2}\|E]+q\delta$		(134)

where we have used $P(\cup_{k=1}^{q}E_{k}^{C})\leq q\delta$ and $P(\cap_{k=1}^{q}E_{k})\leq 1$ . We can obtain a bound for the relative error as follows:

\displaystyle P

\displaystyle\Big{(}\frac{f(\bar{x})}{f(x^{*})}\geq 1+\frac{\gamma}{f(x^{*})}\Big{)}\leq\frac{1}{\gamma q}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\operatorname{\mathbb{E}}[\|z\|_{2}^{2}|E]+q\delta\,.

(135)

Scaling $\gamma\leftarrow\gamma f(x^{*})$ leads to

\displaystyle P

\displaystyle\Big{(}\frac{f(\bar{x})}{f(x^{*})}\leq 1+\gamma\Big{)}\geq 1-q\delta-\frac{1}{\gamma f(x^{*})q}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\operatorname{\mathbb{E}}[\|z\|_{2}^{2}|E]\,.

(136)

We now plug the bounds for $\operatorname{\mathbb{E}}[\|z\|_{2}^{2}|E]$ from Lemma IV.3, IV.4, IV.5 in the above bound and obtain:

•

Randomized Hadamard sketch:

\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}\leq 1+\gamma\Big{)}\geq 1-q\delta-\frac{d}{q\gamma m}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\,.

(137)

•

Uniform sampling with replacement:

\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}\leq 1+\gamma\Big{)}\geq 1-q\delta-\frac{\mu d}{q\gamma m}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\,.

(138)

•

Uniform sampling without replacement:

\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}\leq 1+\gamma\Big{)}\geq 1-q\delta-\frac{\mu d}{q\gamma m}\frac{n-m}{n-1}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}.

(139)

•

Leverage score sampling:

\displaystyle P\Big{(}\frac{f(\bar{x})}{f(x^{*})}\leq 1+\gamma\Big{)}\geq 1-q\delta-\frac{d}{q\gamma m}\big{(}(1+\epsilon)^{2}+4\epsilon(q-1)\big{)}\,.

(140)

∎

A-D Proofs of Theorems and Lemmas in Section V

Proof:

The optimal update direction is given by

\displaystyle\Delta_{t}^{*}=((H_{t}^{1/2})^{T}H_{t}^{1/2})^{-1}g_{t}=H_{t}^{-1}g_{t}

and the estimate update direction due to a single sketch is given by

\displaystyle\hat{\Delta}_{t,k}=\alpha_{s}((H_{t}^{1/2})^{T}S_{t,k}^{T}S_{t,k}H_{t}^{1/2})^{-1}g_{t}.

where $\alpha_{s}\in\mathbb{R}$ is the step size scaling factor to be determined. Letting $S_{t,k}$ be a Gaussian sketch, the bias can be written as

	$\displaystyle\operatorname{\mathbb{E}}[H_{t}^{1/2}(\hat{\Delta}_{t,k}-\Delta_{t}^{*})]$	$\displaystyle=\operatorname{\mathbb{E}}[\alpha_{s}H_{t}^{1/2}((H_{t}^{1/2})^{T}S_{t,k}^{T}S_{t,k}H_{t}^{1/2})^{-1}g_{t}-H_{t}^{1/2}H_{t}^{-1}g_{t}]$
		$\displaystyle=\alpha_{s}H_{t}^{1/2}\operatorname{\mathbb{E}}[((H_{t}^{1/2})^{T}S_{t,k}^{T}S_{t,k}H_{t}^{1/2})^{-1}]g_{t}-H_{t}^{1/2}H_{t}^{-1}g_{t}$
		$\displaystyle=\alpha_{s}\theta_{1}H_{t}^{1/2}((H_{t}^{1/2})^{T}H_{t}^{1/2})^{-1}g_{t}-H_{t}^{1/2}H_{t}^{-1}g_{t}$
		$\displaystyle=\left(\alpha_{s}\theta_{1}-1\right)H_{t}^{1/2}H_{t}^{-1}g_{t}\,.$

In the third line, we plug in the mean of $((H_{t}^{1/2})^{T}S_{t,k}^{T}S_{t,k}H_{t}^{1/2})^{-1}$ which is distributed as inverse Wishart distribution (see Lemma A.3). This calculation shows that the single sketch estimator gives an unbiased update direction for $\alpha_{s}=1/\theta_{1}$ .

The variance analysis is as follows:

	$\displaystyle\operatorname{\mathbb{E}}[\\|H_{t}^{1/2}(\hat{\Delta}_{t,k}-\Delta_{t}^{*})\\|_{2}^{2}]$	$\displaystyle=\operatorname{\mathbb{E}}[\hat{\Delta}_{t,k}^{T}H_{t}\hat{\Delta}_{t,k}+{\Delta_{t}^{}}^{T}H_{t}\Delta_{t}^{}-2{\Delta_{t}^{*}}^{T}H_{t}\hat{\Delta}_{t,k}]$
		$\displaystyle=\alpha_{s}^{2}g_{t}^{T}\operatorname{\mathbb{E}}[((H_{t}^{1/2})^{T}S_{t,k}^{T}S_{t,k}H_{t}^{1/2})^{-1}H_{t}((H_{t}^{1/2})^{T}S_{t,k}^{T}S_{t,k}H_{t}^{1/2})^{-1}]g_{t}+\left(1-2\alpha_{s}\theta_{1}\right)g_{t}^{T}H_{t}^{-1}g_{t}.$

Plugging $H_{t}^{1/2}=U\Sigma V^{T}$ into the first term and assuming $H_{t}^{1/2}$ has full column rank, the expectation term becomes

	$\displaystyle\operatorname{\mathbb{E}}[((H_{t}^{1/2})^{T}S_{t,k}^{T}S_{t,k}H_{t}^{1/2})^{-1}H_{t}((H_{t}^{1/2})^{T}S_{t,k}^{T}S_{t,k}H_{t}^{1/2})^{-1}]$	$\displaystyle=V\Sigma^{-1}\operatorname{\mathbb{E}}[(U^{T}S_{t,k}^{T}S_{t,k}U)^{-2}]\Sigma^{-1}V^{T}$
		$\displaystyle=V\Sigma^{-1}(\theta_{2}I_{d})\Sigma^{-1}V^{T}$
		$\displaystyle=\theta_{2}V\Sigma^{-2}V^{T},$

where the third line follows due to Lemma A.3. Because $H_{t}^{-1}=V\Sigma^{-2}V^{T}$ , the variance becomes:

\displaystyle\operatorname{\mathbb{E}}[\|H_{t}^{1/2}(\hat{\Delta}_{t,k}-\Delta_{t}^{*})\|_{2}^{2}]

\displaystyle=\left(\alpha_{s}^{2}\theta_{2}+1-2\alpha_{s}\theta_{1}\right)g_{t}^{T}V\Sigma^{-2}V^{T}g_{t}=\left(\alpha_{s}^{2}\theta_{2}+1-2\alpha_{s}\theta_{1}\right)\|\Sigma^{-1}V^{T}g_{t}\|_{2}^{2}\,.

It follows that the variance is minimized when $\alpha_{s}$ is chosen as $\alpha_{s}=\theta_{1}/\theta_{2}$ . ∎

Lemma A.3 ([40])

For the Gaussian sketch matrix $S\in\mathbb{R}^{m\times n}$ with i.i.d. entries distributed as $\mathcal{N}(0,1/\sqrt{m})$ where $m\geq d$ , and for $U\in\mathbb{R}^{n\times d}$ with $U^{T}U=I_{d}$ , the following are true:

	$\displaystyle\operatorname{\mathbb{E}}[(U^{T}S^{T}SU)^{-1}]$	$\displaystyle=\theta_{1}I_{d},$
	$\displaystyle\operatorname{\mathbb{E}}[(U^{T}S^{T}SU)^{-2}]$	$\displaystyle=\theta_{2}I_{d},$		(141)

where $\theta_{1}$ and $\theta_{2}$ are defined as

	$\displaystyle\theta_{1}$	$\displaystyle\coloneqq\frac{m}{m-d-1},$
	$\displaystyle\theta_{2}$	$\displaystyle\coloneqq\frac{m^{2}(m-1)}{(m-d)(m-d-1)(m-d-3)}.$		(142)

Proof:

In the following, we omit the subscripts in $S_{t,k}$ for simplicity. Using the SVD decomposition of $H_{t}^{1/2}=U\Sigma V^{T}$ , the bias can be written as

\displaystyle\operatorname{\mathbb{E}}[H_{t}^{1/2}(\hat{\Delta}_{t,k}-\Delta_{t}^{*})]

\displaystyle=U\operatorname{\mathbb{E}}[(U^{T}S^{T}SU+\lambda_{2}\Sigma^{-2})^{-1}]\Sigma^{-1}V^{T}g_{t}-U(I_{d}+\lambda_{1}\Sigma^{-2})^{-1}\Sigma^{-1}V^{T}g_{t}\,.

By the assumption that $\Sigma=\sigma I_{d}$ , the bias term can be simplified as

\displaystyle\operatorname{\mathbb{E}}[H_{t}^{1/2}

\displaystyle(\hat{\Delta}_{t,k}-\Delta_{t}^{*})]=\sigma^{-1}U\operatorname{\mathbb{E}}[(U^{T}S^{T}SU+\lambda_{2}\sigma^{-2}I_{d})^{-1}]V^{T}g_{t}-\sigma^{-1}(1+\lambda_{1}\sigma^{-2})^{-1}UV^{T}g_{t}.

By Lemma II.19, as $n$ goes to infinity, we have

	$\displaystyle\operatorname{\mathbb{E}}[H_{t}^{1/2}(\hat{\Delta}_{t,k}-\Delta_{t}^{*})]$	$\displaystyle=\sigma^{-1}\left(\theta_{3}(d/m,\lambda_{2}\sigma^{-2})-\frac{1}{1+\lambda_{1}\sigma^{-2}}\right)UV^{T}g_{t}$
		$\displaystyle=\frac{-\frac{\lambda_{2}}{\sigma^{2}}+\frac{d}{m}-1+\sqrt{(-\frac{\lambda_{2}}{\sigma^{2}}+\frac{d}{m}-1)^{2}+4\frac{\lambda_{2}}{\sigma^{2}}\frac{d}{m}}}{2\lambda_{2}\sigma^{-1}d/m}UV^{T}g_{t}-\frac{\sigma^{-1}}{1+\lambda_{1}\sigma^{-2}}UV^{T}g_{t}.$

The bias becomes zero for the value of $\lambda_{2}$ that satisfies the following equation:

\displaystyle\sqrt{\left(-\sigma^{-2}+\frac{1}{\lambda_{2}}\left(\frac{d}{m}-1\right)\right)^{2}+4\sigma^{-2}\frac{d}{m\lambda_{2}}}-\sigma^{-2}+\frac{1}{\lambda_{2}}\left(\frac{d}{m}-1\right)=2\sigma^{-2}\frac{d}{m}\frac{1}{1+\lambda_{1}\sigma^{-2}}\,.

(143)

In the regime where $\lambda_{2}\geq 0$ , the LHS of (143) is always non-negative and is monotonically decreasing in $\lambda_{2}$ .The LHS approaches zero as $\lambda_{2}\rightarrow\infty$ . We now consider the following cases:

•

Case 1: $m\leq d$ . Because $d/m-1\geq 0$ , as $\lambda_{2}\rightarrow 0$ , the LHS goes to infinity. Since the LHS can take any values between $0$ and $\infty$ , there is an appropriate $\lambda_{2}^{*}$ value that makes the bias zero for any $\lambda_{1}$ .
•

Case 2: $m>d$ . In this case, $d/m-1<0$ . The maximum of LHS in this case is reached as $\lambda_{2}\rightarrow 0$ and it is equal to $2\sigma^{-2}\frac{d}{m-d}$ . As long as $2\sigma^{-2}\frac{d}{m}\frac{1}{1+\lambda_{1}\sigma^{-2}}\leq 2\sigma^{-2}\frac{d}{m-d}$ is true, then we can drive the bias down to zero. More simply, this corresponds to $\lambda_{1}\sigma^{-2}\geq-d/m$ , which is always true. Therefore in the case of $m>d$ as well, there is a $\lambda_{2}^{*}$ value for any value of $\lambda_{1}$ that will drive the bias down to zero.

To sum up, for any given value for the regularization parameter $\lambda_{1}$ , it is possible to find a $\lambda_{2}^{*}$ value to make the sketched update direction unbiased. The optimal value for $\lambda_{2}$ is given by $LHS^{-1}(2\sigma^{-2}\frac{d}{m}\frac{1}{1+\lambda_{1}\sigma^{-2}})$ where $LHS^{-1}(y)=\frac{4\sigma^{-2}y^{-1}d/m+2(d/m-1)}{y+2\sigma^{-2}}$ , which simplifies to the following expression:

\displaystyle\lambda_{2}^{*}=\left(\lambda_{1}+\frac{d}{m}\sigma^{2}\right)\left(1-\frac{d/m}{1+\lambda_{1}\sigma^{-2}+d/m}\right).

∎

Appendix B Additional Numerical Results

In this section, we present additional experimental results.

B-A Scalability of the Serverless Implementation

Figure 10 shows the cost against time when we solve the problem given in (47) for large scale data on AWS Lambda using the distributed Newton sketch algorithm. The setting in this experiment is such that each worker node has access to a different subset of data, and there is no additional sketching applied. The dataset used is randomly generated and the goal here is to demonstrate the scalability of the algorithm and the serverless implementation. The size of the data matrix $A$ is $44$ GB.

In the serverless computing implementation, we reuse the serverless functions during the course of the algorithm, meaning that the same $q=100$ functions are used for every iteration. We note that every iteration requires two rounds of communication with the master node. The first round is for the communication of the local gradients, and the second round is for the approximate update directions. The master node, also a serverless function, is also reused across iterations. Figure 10 illustrates that each iteration takes a different amount of time and iteration times can be as short as $5$ seconds. The reason for some iterations taking longer times is what is referred to as the straggler problem, which is a phenomenon commonly encountered in distributed computing. More precisely, the iteration time is determined by the slowest of the $q=100$ nodes and nodes often slow down for a variety of reasons causing stragglers. A possible solution to the issue of straggling nodes is to use error correcting codes to insert redundancy to computation and hence to avoid waiting for the outputs of all of the worker nodes [41, 42]. We identify that implementing straggler mitigation for solving large scale problems via approximate second order optimization methods such as distributed Newton sketch is a promising direction.

B-B Experiments on UCI Datasets

In the case of large datasets and limited computing resources of worker nodes such as memory and lifetime, most of the standard sketches are computationally too expensive as discussed in the main body of the paper. This is the reason why we limited the scope of the large scale experiments to uniform sampling, SJLT, and hybrid sketch. In this section we present some additional experimental results on smaller datasets to empirically verify the theoretical results of the paper.

We present results on two UCI datasets in Figure 11 comparing the performances of the sketches we discussed in the paper.

Figure 11 shows that Gaussian and ROS sketches lead to unbiased estimators in the experiments because the corresponding curves appear linear in the log-log scale plots. These experiment results suggest that the upper bound that we have found for the bias of the ROS sketch may not be tight. We see that the estimates for uniform sampling and leverage score sampling are biased. The observation that the Gaussian sketch estimator is unbiased in the experiments is perfectly consistent with our theoretical findings. Furthermore, we observe that the approximation error is the highest in uniform sampling, which is also in agreement with the theoretical upper bounds that we have presented.

References

[1] C.-C. Wang, K. L. Tan, C.-T. Chen, Y.-H. Lin, S. S. Keerthi, D. Mahajan, S. Sundararajan, and C.-J. Lin, “Distributed newton methods for deep neural networks,” Neural Comput., vol. 30, no. 6, p. 1673–1724, Jun. 2018. [Online]. Available: https://doi.org/10.1162/neco_a_01088
[2] P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlós, “Faster least squares approximation,” Numerische mathematik, vol. 117, no. 2, pp. 219–249, 2011.
[3] S. Wang, A. Gittens, and M. W. Mahoney, “Sketched ridge regression: Optimization perspective, statistical perspective, and model averaging,” J. Mach. Learn. Res., vol. 18, no. 1, pp. 8039–8088, Jan. 2017.
[4] M. Pilanci and M. J. Wainwright, “Iterative hessian sketch: Fast and accurate solution approximation for constrained least-squares,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1842–1879, 2016.
[5] J. Lacotte and M. Pilanci, “Faster least squares optimization,” arXiv preprint, arXiv:1911.02675, 2019.
[6] M. W. Mahoney, “Randomized algorithms for matrices and data,” Foundations and Trends in Machine Learning in Machine Learning, vol. 3, no. 2, 2011.
[7] M. Pilanci and M. J. Wainwright, “Randomized sketches of convex programs with sharp guarantees,” IEEE Trans. Info. Theory, vol. 9, no. 61, pp. 5096–5115, September 2015.
[8] D. Bertsekas and J. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, 1989.
[9] F. Niu, B. Recht, C. Re, and S. J. Wright, “Hogwild! a lock-free approach to parallelizing stochastic gradient descent,” in Proceedings of the 24th International Conference on Neural Information Processing Systems, ser. NIPS’11. Red Hook, NY, USA: Curran Associates Inc., 2011, p. 693–701.
[10] S. Zhou, J. Lafferty, and L. Wasserman, “Compressed and privacy-sensitive sparse regression,” IEEE Transactions on Information Theory, vol. 55, no. 2, pp. 846–866, Feb 2009.
[11] M. Showkatbakhsh, C. Karakus, and S. Diggavi, “Privacy-utility trade-off of linear regression under random projections and additive noise,” in 2018 IEEE International Symposium on Information Theory (ISIT), June 2018, pp. 186–190.
[12] S. Zhou, J. Lafferty, and L. Wasserman, “Compressed regression,” in Neural Information Processing Systems, December 2007.
[13] J. Blocki, A. Blum, A. Datta, and O. Sheffet, “The johnson-lindenstrauss transform itself preserves differential privacy,” in 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, 2012, pp. 410–419.
[14] S. S. Vempala, The random projection method. American Mathematical Soc., 2005, vol. 65.
[15] M. W. Mahoney, “Randomized algorithms for matrices and data,” Foundations and Trends® in Machine Learning, vol. 3, no. 2, pp. 123–224, 2011.
[16] D. P. Woodruff et al., “Sketching as a tool for numerical linear algebra,” Foundations and Trends® in Theoretical Computer Science, vol. 10, no. 1–2, pp. 1–157, 2014.
[17] P. Drineas and M. W. Mahoney, “RandNLA: randomized numerical linear algebra,” Communications of the ACM, vol. 59, no. 6, pp. 80–90, 2016.
[18] H. Avron, P. Maymounkov, and S. Toledo, “Blendenpik: Supercharging lapack’s least-squares solver,” SIAM Journal on Scientific Computing, vol. 32, no. 3, pp. 1217–1236, 2010.
[19] V. Rokhlin, A. Szlam, and M. Tygert, “A randomized algorithm for principal component analysis,” SIAM Journal on Matrix Analysis and Applications, vol. 31, no. 3, pp. 1100–1124, 2009.
[20] M. Pilanci and M. J. Wainwright, “Randomized sketches of convex programs with sharp guarantees,” IEEE Transactions on Information Theory, vol. 61, no. 9, pp. 5096–5115, 2015.
[21] ——, “Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence,” SIAM Journal on Optimization, vol. 27, no. 1, pp. 205–245, 2017.
[22] S. Wang, F. Roosta, P. Xu, and M. W. Mahoney, “Giant: Globally improved approximate newton method for distributed optimization,” in Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 2018, pp. 2332–2342.
[23] T. Sarlos, “Improved approximation algorithms for large matrices via random projections,” in Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on. IEEE, 2006, pp. 143–152.
[24] J. Nelson and H. L. Nguyên, “Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings,” in Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on. IEEE, 2013, pp. 117–126.
[25] G. Letac and H. Massam, “All invariant moments of the wishart distribution,” Scandinavian Journal of Statistics, vol. 31, no. 2, pp. 295–318, 2004.
[26] D. C. Ahfock, W. J. Astle, and S. Richardson, “Statistical properties of sketching algorithms,” Biometrika, vol. 108, no. 2, pp. 283–297, 07 2020. [Online]. Available: https://doi.org/10.1093/biomet/asaa062
[27] S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013.
[28] S. Sridhar, M. Pilanci, and A. Özgür, “Lower bounds and a near-optimal shrinkage estimator for least squares using random projections,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 3, pp. 660–668, 2020.
[29] S. Liu and E. Dobriban, “Ridge regression: Structure, cross-validation, and sketching,” 2019.
[30] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor, “Our data, ourselves: Privacy via distributed noise generation,” in Proceedings of the 24th Annual International Conference on The Theory and Applications of Cryptographic Techniques, ser. EUROCRYPT’06. Berlin, Heidelberg: Springer-Verlag, 2006, p. 486–503. [Online]. Available: https://doi.org/10.1007/11761679_29
[31] O. Sheffet, “Private approximations of the 2nd-moment matrix using existing techniques in linear regression,” CoRR, vol. abs/1507.00056, 2015. [Online]. Available: http://arxiv.org/abs/1507.00056
[32] Z. Huang, S. Mitra, and N. Vaidya, “Differentially private distributed optimization,” in Proceedings of the 2015 International Conference on Distributed Computing and Networking, ser. ICDCN ’15. New York, NY, USA: Association for Computing Machinery, 2015. [Online]. Available: https://doi.org/10.1145/2684464.2684480
[33] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.
[34] E. Jonas, S. Venkataraman, I. Stoica, and B. Recht, “Occupy the cloud: Distributed computing for the 99%,” CoRR, vol. abs/1702.04024, 2017. [Online]. Available: http://arxiv.org/abs/1702.04024
[35] B. of Transportation Statistics, “Airline on-time statistics and delay causes, http://stat-computing.org/dataexpo/2009/,” 2018.
[36] G. Cohen, S. Afshar, J. Tapson, and A. van Schaik, “EMNIST: an extension of MNIST to handwritten letters,” CoRR, vol. abs/1702.05373, 2017. [Online]. Available: http://arxiv.org/abs/1702.05373
[37] N. E. Karoui, “Concentration of measure and spectra of random matrices: Applications to correlation matrices, elliptical distributions and beyond,” The Annals of Applied Probability, vol. 19, no. 6, pp. 2362–2405, 2009.
[38] A. A. Borovkov, Mathematical statistics. Australia: Gordon and Breach Science Publishers, 1998.
[39] O. Sheffet, “Private approximations of the 2nd-moment matrix using existing techniques in linear regression,” CoRR, vol. abs/1507.00056, 2015. [Online]. Available: http://arxiv.org/abs/1507.00056
[40] J. Lacotte and M. Pilanci, “Faster least squares optimization,” arXiv preprint, arXiv:1911.02675, 2019.
[41] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1514–1529, March 2018.
[42] B. Bartan and M. Pilanci, “Straggler resilient serverless computing based on polar codes,” in 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2019, pp. 276–283.

	$\displaystyle\frac{P(\\|A(\bar{x}-x^{*})\\|_{2}^{2}\geq a\cap(\bigcap_{k=1}^{q}E_{k}))}{P(\bigcap_{k=1}^{q}E_{k})}$	$\displaystyle\geq\frac{P(\\|A(\bar{x}-x^{*})\\|_{2}^{2}\geq a)+P(\bigcap_{k=1}^{q}E_{k})-1}{P(\bigcap_{k=1}^{q}E_{k})}$		(50)
		$\displaystyle=\frac{P(\\|A(\bar{x}-x^{*})\\|_{2}^{2}\geq a)+P(E_{1})^{q}-1}{P(E_{1})^{q}},$		(51)

$\displaystyle f(\hat{x})-f(x^{*})$	$\displaystyle=\\|A(\hat{x}-x^{*})\\|_{2}^{2}$
	$\displaystyle=\\|A(A^{T}S^{T}SA)^{-1}A^{T}S^{T}Sb^{\perp}\\|_{2}^{2}$
	$\displaystyle=\\|A(SA)^{\dagger}Sb^{\perp}\\|_{2}^{2}$	(66)

$\displaystyle\\|M^{T}M\\|_{2}$	$\displaystyle=\\|M\\|_{2}^{2}=\\|MM^{T}\\|_{2}=\left\\|\sum_{k=1}^{q}M_{k}M_{k}^{T}\right\\|_{2}$
	$\displaystyle=\frac{\\|b^{\perp}\\|_{2}^{2}}{m}\left\\|\sum_{k=1}^{q}{(S_{k}U)^{\dagger}}^{T}(S_{k}U)^{\dagger}\right\\|_{2}$
	$\displaystyle\leq\frac{\\|b^{\perp}\\|_{2}^{2}}{m}\sum_{k=1}^{q}\\|{(S_{k}U)^{\dagger}}^{T}(S_{k}U)^{\dagger}\\|_{2}$
	$\displaystyle=\frac{\\|b^{\perp}\\|_{2}^{2}}{m}\sum_{k=1}^{q}\frac{1}{\sigma^{2}_{\min}(S_{k}U)}.$	(89)

$\displaystyle I(Sb;x^{*})$	$\displaystyle=\frac{m^{2}}{\\|b^{\perp}\\|_{2}^{4}}\operatorname{\mathbb{E}}_{Sb}\left[(SA)^{T}(Sb-SAx^{})(Sb-SAx^{})^{T}SA\right]$
	$\displaystyle=\frac{m^{2}}{\\|b^{\perp}\\|_{2}^{4}}(SA)^{T}\big{(}\frac{\\|b^{\perp}\\|_{2}^{2}}{m}I_{m}\big{)}SA$
	$\displaystyle=\frac{m}{\\|b^{\perp}\\|_{2}^{2}}A^{T}S^{T}SA\,.$	(107)

$\displaystyle\operatorname{\mathbb{E}}[\\|A(\hat{x}-x^{*})\\|_{2}^{2}\|SA]$	$\displaystyle=\operatorname{\mathbb{E}}[\operatorname{tr}(A(\hat{x}-x^{})(\hat{x}-x^{})^{T}A^{T})\|SA]$
	$\displaystyle=\operatorname{tr}\big{(}A\operatorname{\mathbb{E}}[(\hat{x}-x^{})(\hat{x}-x^{})^{T}\|SA]A^{T}\big{)}$
	$\displaystyle\geq\frac{\\|b^{\perp}\\|_{2}^{2}}{m}\operatorname{tr}(A(A^{T}S^{T}SA)^{-1}A^{T})\,,$	(108)

Distributed Sketching for Randomized Optimization: Exact Characterization, Concentration and Lower Bounds

Abstract

Index Terms:

I Introduction

I-A Cloud Computing

I-B Notation

I-C Related Work

I-D Overview of Our Contributions

I-E Preliminaries on Sketching Matrices

I-F Paper Organization

II Distributed Sketching for Quadratic Optimization problems

II-A Closed Form Expressions for Gaussian Sketch

Lemma II.1

Proof:

Theorem II.2 (Expected error of the averaging estimator)

Remark II.3

II-B Exponential Concentration of the Gaussian Sketch Estimator

Theorem II.4 (Concentration bound for the single sketch estimator x^\hat{x})

Lemma II.5 (Concentration of Gaussian quadratic forms, [27])

Definition II.6 (Stieltjes Transform)

Lemma II.7

Theorem II.8 (Concentration bound for the averaged estimator x¯\bar{x})

II-C Error Lower Bounds via Fisher Information

Lemma II.9 (Unbiased estimators, [28])

Lemma II.10 (General estimators, [28])

Theorem II.11

II-D Distributed Iterative Hessian Sketch

Definition II.12

Theorem II.13 (Expected error decay for Gaussian sketch)

Corollary II.14

Remark II.15

II-E Distributed Sketching for Least-Norm Problems

Lemma II.16

Proof:

Theorem II.17 (Cost approximation for least-norm problems)

II-F Bias Correction for Regularized Least Squares

Lemma II.18 ([29])

Lemma II.19

Theorem II.20

Remark II.21 (Varying sketch sizes)

III Privacy Preserving Property

Definition III.1 ((ε,δ)(\varepsilon,\delta)-Differential Privacy, [30], [31])

Theorem III.2

Remark III.3

Remark III.4

IV Bounds for Other Sketching Matrices

Lemma IV.1

Lemma IV.2

Lemma IV.3

Lemma IV.4

Lemma IV.5

Theorem IV.6

V Distributed Sketching for Nonlinear Convex Optimization Problems

V-A Distributed Newton Sketch

Theorem V.1

V-B Distributed Newton Sketch for Regularized Problems

Theorem V.2

Remark V.3

VI Applications of Distributed Sketching

VI-A Logistic Regression

VI-B Inequality Constrained Optimization

Remark VI.1

VI-C Fine Tuning of Pre-Trained Neural Networks

VII Numerical Results

VII-A Hybrid Sketch

VII-B Airline Dataset

VII-C Image Dataset: Extended MNIST

VII-D Performance on Large Scale Synthetic Datasets

VII-E Numerical Results for the Right Sketch Method

VII-F Distributed Iterative Hessian Sketch

VII-G Inequality Constrained Optimization

VIII Discussion

Appendix A Proofs

A-A Proofs of Theorems and Lemmas in Section II

Proof:

Lemma A.1

Proof:

Proof:

Proof:

Proof:

Theorem II.4 (Concentration bound for the single sketch estimator $\hat{x}$ )

Theorem II.8 (Concentration bound for the averaged estimator $\bar{x}$ )

Definition III.1 ( $(\varepsilon,\delta)$ -Differential Privacy, [30], [31])