¹¹affiliationtext: School of Engineering and Applied Sciences, Harvard University²²affiliationtext: School of Electrical, Computer and Energy Engineering, Arizona State University

Communication-Efficient Distributed SGD with Compressed Sensing

Yujie Tang Vikram Ramanathan Junshan Zhang Na Li

Abstract

We consider large scale distributed optimization over a set of edge devices connected to a central server, where the limited communication bandwidth between the server and edge devices imposes a significant bottleneck for the optimization procedure. Inspired by recent advances in federated learning, we propose a distributed stochastic gradient descent (SGD) type algorithm that exploits the sparsity of the gradient, when possible, to reduce communication burden. At the heart of the algorithm is to use compressed sensing techniques for the compression of the local stochastic gradients at the device side; and at the server side, a sparse approximation of the global stochastic gradient is recovered from the noisy aggregated compressed local gradients. We conduct theoretical analysis on the convergence of our algorithm in the presence of noise perturbation incurred by the communication channels, and also conduct numerical experiments to corroborate its effectiveness.

1 Introduction

Large-scale distributed stochastic optimization plays a fundamental role in the recent advances of machine learning, allowing models with vast sizes to be trained on massive datasets by multiple machines. In the meantime, the past few years have witnessed an explosive growth of networks of IoT devices such as smart phones, self-driving cars, robots, unmanned aerial vehicles (UAVs), etc., which are capable of data collection and processing for many learning tasks. In many of these applications, due to privacy concerns, it is preferable that the local edge devices learn the model by cooperating with the central server but without sending their own data to the server. Moreover, the communication between the edge devices and the server is often through wireless channels, which are lossy and unreliable in nature and have limited bandwidth, imposing significant challenges, especially for large dimensional problems.

To address the communication bottlenecks, researchers have investigated communication-efficient distributed optimization methods for large-scale problems, for both the device-server setting [1, 2] and the peer-to-peer setting [3, 4]. In this paper, we consider the device-server setting where a group of edge devices are coordinated by a central server.

Most existing techniques for the device-server setting can be classified into two categories. The first category aims to reduce the number of communication rounds, based on the idea that each edge device runs multiple local SGD steps in parallel before sending the local updates to the server for aggregation. This approach has also been called FedAvg [1] in federated learning and convergence has been studied in [5, 6, 7]. Another line of work investigates lazy/adaptive upload of information, i.e., local gradients are uploaded only when found to be informative enough [8].

The second category focuses on efficient compression of gradient information transmitted from edge devices to the server. Commonly adopted compression techniques include quantization [9, 10, 11] and sparsification [12, 13, 14]. These techniques can be further classified according to whether the gradient compression yields biased [9, 14] or unbiased [10, 13] gradient estimators. To handle the bias and boost convergence, [12, 15] introduced the error feedback method that accumulates and corrects the error caused by gradient compression at each step.

Two recent papers [16, 17] employ sketching methods for gradient compression. Specifically, each device compresses its local stochastic gradient by count sketch [18] via a common sketching operator; and the server recovers the indices and the values of large entries of the aggregated stochastic gradient from the gradient sketches. However, theoretical guarantees of count sketch were developed for recovering one fixed signal by randomly generating a sketching operator from a given probability distribution. During SGD, gradient signals are constantly changing, making it impractical to generate a new sketch operator for every SGD iteration. Thus the papers apply a single sketching operator to all the gradients through the optimization procedure, while sacrificing theoretical guarantees. Further, there is a limited understanding of the performance when there is transmission error/noise of the uploading links.

Our Contributions. We propose a distributed SGD-type algorithm that employs compressed sensing for gradient compression. Specifically, we adopt compressed sensing techniques for the compression of local stochastic gradients at the device side, and the reconstruction of the aggregated stochastic gradients at the server side. The use of compressed sensing enables the server to approximately identify the top entries of the aggregated gradient without querying directly each local gradient. Our algorithm also integrates error feedback strategies at the server side to handle the bias introduced by compression, while keeping the edge devices to be stateless. We provide convergence analysis of our algorithm in the presence of additive noise incurred by the uploading communication channels, and conduct numerical experiments that justify the effectiveness of our algorithm.

Besides the related work discussed above, it is worth noting that a recent paper [19] uses compressed sensing for zeroth-order optimization, which exhibits a mathematical structure similar to this study. However, [19] considers the centralized setting and only establishes convergence to a neighborhood of the minimizer.

Notations: For $x\in\mathbb{R}^{d}$ , $\|x\|_{p}$ denotes its $\ell_{p}$ -norm, and $x^{[K]}\in\mathbb{R}^{d}$ denotes its best- $K$ approximation, i.e., the vector that keeps the top $K$ entries of $x$ in magnitude with other entries set to $0$ .

2 Problem Setup

Consider a group of $n$ edge devices and a server. Each device $i$ is associated with a differentiable local objective function $f_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}$ , and is able to query a stochastic gradient $\mathsf{g}_{i}(x)$ such that $\mathbb{E}[\mathsf{g}_{i}(x)]=\nabla f_{i}(x)$ . Between each device and the server are an uploading communication link and a broadcasting communication link. The goal is to solve

\min_{x\in\mathbb{R}^{d}}\ \ f(x)\coloneqq\frac{1}{n}\sum\nolimits_{i=1}^{n}f_{i}(x)

(1)

through queries of stochastic gradients at each device and exchange of information between the server and each device.

One common approach for our problem setup is the stochastic gradient descent (SGD) method: For each time step $t$ , the server first broadcasts the current iterate $x(t)$ to all devices, and then each device produces a stochastic gradient $g_{i}(t)=\mathsf{g}_{i}(x(t))$ and uploads it to the server, after which the server updates $x(t\!+\!1)=x(t)-\eta\cdot\frac{1}{n}\sum_{i}g_{i}(t)$ . However, as the server needs to collect local stochastic gradients from each device at every iteration, the vanilla SGD may encounter significant bottleneck imposed by the uploading links if $d$ is very large. This issue may be further exacerbated if the server and the devices are connected via lossy wireless networks of limited bandwidth, which is the case for many IoT applications.

In this work, we investigate the situation where the communication links, particularly the uploading links from each edge device to the server, have limited bandwidth that can significantly slow down the whole optimization procedure; the data transmitted through each uploading link may also be corrupted by noise. Our goal is to develop an SGD-type algorithm for solving (1) that achieves better communication efficiency over the uploading links.

3 Algorithm

Our algorithm is outlined in Algorithm 1, which is based on the SGD method with the following major ingredients:

1.

Compression of local stochastic gradients using compressed sensing techniques. Here each edge device compresses its local gradient by $y_{i}(t)=\Phi g_{i}(t)$ before uploading it to the server. The matrix $\Phi\in\mathbb{R}^{Q\times d}$ is called the sensing matrix, and its number of rows $Q$ is strictly less than the number of columns $d$ . As a result, the communication burden of uploading the local gradient information can be reduced.

We emphasize that Algorithm 1 employs the for-all scheme of compressed sensing, which allows one $\Phi$ to be used for the compression of all local stochastic gradients (see Section 3.1 for more details on the for-each and the for-all schemes).

After collecting the compressed local gradients and obtaining $\frac{1}{n}\sum_{i=1}^{n}y_{i}(t)$ (corrupted by communication channel noise), the server recovers a vector $\Delta(t)$ by a compressed sensing algorithm, which will be used for updating $x(t)$ .
2.

Error feedback of compressed gradients. In general, the compressed sensing reconstruction will introduce a nonzero bias in the SGD iterations that hinders convergence. To handle this bias, we adopt the error feedback method in [12, 15] and modify it similarly as FetchSGD [17]. The resulting error feedback procedure is done purely at the server side without knowing the true aggregated stochastic gradients.

Note that the aggregated vector $\tilde{y}(t)$ is corrupted by additive noise $w(t)$ from the uploading links. This noise model incorporates a variety of communication schemes, including digital transmission with quantization, and over-the-air transmission for wireless multi-access networks [14].

1 Input: sparsity level

K

, size of sensing matrix

Q\!\times\!d

, step size

\eta

, number of iterations

T

, initial point

x_{0}

2 Initialize:

x(1)=x_{0}

\varepsilon(1)=0

3 The server generates the sensing matrix

\Phi\in\mathbb{R}^{Q\times d}

and sends it to every edge device

4 for $t=1,2\ldots,T$ do

5 The server sends

x(t)

to every edge device

6 foreach device $i=1,\ldots,n$ do

7 Device

i

samples a stochastic gradient

g_{i}(t)=\mathsf{g}_{i}(x(t))

8 Device

i

constructs

y_{i}(t)=\Phi g_{i}(t)\in\mathbb{R}^{Q}

9 Device

i

sends

y_{i}(t)

back to the server

11 end foreach

12 The server receives

\tilde{y}(t)=\frac{1}{n}\sum_{i=1}^{n}y_{i}(t)+w(t)

, where

w(t)

denotes additive noise incurred by the communication channels

13 The server computes

z(t)=\eta\tilde{y}(t)+\varepsilon(t)

14 The server reconstructs

\Delta(t)=\mathcal{A}(z(t);\Phi)

, where

\mathcal{A}(z(t);\Phi)

denotes the output of the compressed sensing algorithm of choice

15 The server updates

x(t\!+\!1)=x(t)-\Delta(t)

16 The server updates

\varepsilon(t\!+\!1)=z(t)-\Phi\Delta(t)

18 end for

Algorithm 1 SGD with Compressed Sensing

We now provide more details on our algorithm design.

3.1 Preliminaries on Compressed Sensing

Compressed sensing [20] is a technique that allows efficient sensing and reconstruction of an approximately sparse signal. Mathematically, in the sensing step, a signal $x\in\mathbb{R}^{d}$ is observed through linear measurement $y=\Phi x+w$ , where $\Phi\in\mathbb{R}^{Q\times d}$ is a pre-specified sensing matrix with $Q<d$ , and $w\in\mathbb{R}^{Q}$ is additive noise. Then in the reconstruction step, one recovers the original signal $x$ by approximately solving

	$\displaystyle\hat{x}=\,$	$\displaystyle\operatorname*{arg\,min}\nolimits_{z}\,\\|z\\|_{0}\quad\textrm{s.t.}\ \ y=\Phi z,$	( $w=0$ )		(2)
	$\displaystyle\hat{x}=\,$	$\displaystyle\operatorname*{arg\,min}\nolimits_{z}\,\mfrac{1}{2}\\|y\!-\!\Phi z\\|_{2}^{2}\quad\textrm{s.t.}\ \ \\|z\\|_{0}\leq K$	( $w\neq 0$ )		(3)

where $K$ restricts the number of nonzero entries in $\hat{x}$ .

Both (2) and (3) are NP-hard nonconvex problems[21, 22] , and researchers have proposed various compressed sensing algorithms for obtaining approximate solutions. As discussed below, the reconstruction error $\|\hat{x}-x\|$ will heavily depend on i) the design of the sensing matrix $\Phi$ , and ii) whether the signal $x$ can be well approximated by a sparse vector.

Design of the sensing matrix $\Phi$ . Compressed sensing algorithms can be categorized into two schemes [23]: i) the for-each scheme, in which a probability distribution over sensing matrices is designed to provide desired reconstruction for a fixed signal, and every time a new signal is to be measured and reconstructed, one needs to randomly generate a new $\Phi$ ; ii) the for-all scheme, in which a single $\Phi$ is used for the sensing and reconstruction of all possible signals.¹¹1A more detailed explanation of the two schemes can be found in Appendix D. We mention that count sketch is an example of a for-each scheme algorithm. In this paper, we choose the for-all scheme so that the server doesn’t need to send a new matrix to each device per iteration.

To ensure that the linear measurement $y=\Phi x$ can discriminate approximately sparse signals, researchers have proposed the restricted isometry property (RIP) [20] as a condition on $\Phi$ :

Definition 1.

We say that $\Phi\in\mathbb{R}^{Q\times d}$ satisfies the $(K,\delta)$ -restricted isometry property, if $(1\!-\!\delta)\|x\|_{2}^{2}\leq\|\Phi x\|_{2}^{2}\leq(1\!+\!\delta)\|x\|_{2}^{2}$ for any $x\in\mathbb{R}^{d}$ that has at most $K$ nonzero entries.

The restricted isometry property on $\Phi$ is fundamental for analyzing the reconstruction error of many compressed sensing algorithms under the for-all scheme [24].

Metric of sparsity. The classical metric of sparsity is the $\ell_{0}$ norm defined as the number of nonzero entries. However, for our setup, the vectors to be compressed can only be approximately sparse in general, which cannot be handled by the $\ell_{0}$ norm as it is not stable under small perturbations. Here, we adopt the following sparsity metric from [25]:²²2Compared to [25], we add a scaling factor $d^{-1}$ so that $\operatorname{sp}(x)\!\in\!(0,1]$ .

\operatorname{sp}(x)\coloneqq\|x\|_{1}^{2}/(\|x\|_{2}^{2}\cdot d),\qquad x\in\mathbb{R}^{d}\backslash\{0\}.

(4)

The continuity of $\operatorname{sp}(x)$ indicates that $\operatorname{sp}(x)$ is robust to small perturbations on $x$ , and it can be shown that $\operatorname{sp}(x)$ is Schur-concave, meaning that it can characterize approximate sparsity of a signal. $\operatorname{sp}(x)$ has also been used in [25] for performance analysis of compressed sensing algorithms.

3.2 Details of Algorithm Design

Generation of $\Phi$ . As mentioned before, we choose compressed sensing under the for-all scheme for gradient compression and reconstruction. We require that the sensing matrix $\Phi\in\mathbb{R}^{Q\times d}$ have a low storage cost, since it will be transmitted to and stored at each device; $\Phi$ should also satisfy RIP so that the compressed sensing algorithm $\mathcal{A}$ has good reconstruction performance. The following proposition suggests a storage-friendly approach for generating matrices satisfying RIP.

Proposition 1 ([26]).

Let $B\in\mathbb{R}^{d\times d}$ be an orthogonal matrix with entries of absolute values $O(1/\sqrt{d})$ , and let $\delta>0$ be sufficiently small. For some $Q=\tilde{O}(\delta^{-2}K\log^{2}K\log d)$ ,³³3 The $\tilde{O}$ notation hides logarithm dependence on $1/\delta$ . let $\Phi\in\mathbb{R}^{Q\times d}$ be a matrix whose $Q$ rows are chosen uniformly and independently from the rows of $B$ , multiplied by $\sqrt{d/Q}$ . Then, with high probability, $\Phi$ satisfies the $(K,\delta)$ -RIP.

This proposition indicates that, we can choose a “base matrix” $B$ satisfying the condition in Proposition 1, and then randomly choose $Q$ rows to form $\Phi$ . In this way, $\Phi$ can be stored or transmitted by merely the corresponding row indices in $B$ . Note that Proposition 1 only requires $Q$ to have logarithm dependence on $d$ . Candidates of the base matrix $B$ include the discrete cosine transform (DCT) matrix and the Walsh-Hadamard transform (WHT) matrix, as both DCT and WHT and their inverse have fast algorithms of time complexity $O(d\log d)$ , implying that multiplication of $\Phi$ or $\Phi^{\!\top}$ with any vector can be finished within $O(d\log d)$ time.

Choice of the compressed sensing algorithm. We let $\mathcal{A}$ be the Fast Iterative Hard Thresholding (FIHT) algorithm [27].⁴⁴4See also Appendix E for a brief summary. Our experiments suggest that FIHT achieves a good balance between computation efficiency and empirical reconstruction error compared to other algorithms we have tried.

We note that FIHT has a tunable parameter $K$ that controls the number of nonzero entries of $\Delta(t)$ . This parameter should accord with the sparsity of the vector to be recovered (see Section 3.3 for theoretical results). In addition, the server can broadcast the sparse vector $\Delta(t)$ instead of the whole $x(t)$ for the edge devices to update their local copies of $x(t)$ , which saves communication over the broadcasting links.

Error feedback. We adopt error feedback to facilitate convergence of Algorithm 1. The following lemma verifies that Algorithm 1 indeed incorporates the error feedback steps in [15]; the proof is straightforward which we omit here.

Lemma 1.

Consider Algorithm 1, and suppose $\Phi$ is generated according to Proposition 1. Then for each $t$ , there exist unique $p(t)\in\mathbb{R}^{d}$ and $e(t)\in\mathbb{R}^{d}$ satisfying $z(t)=\Phi p(t)+\eta w(t)$ , $\varepsilon(t)=\Phi e(t)$ , $e(1)=0$ such that

$\displaystyle p(t)=$	$\displaystyle\eta g(t)+e(t),\quad x(t\!+\!1)=x(t)-\Delta(t),$	(5)
$\displaystyle\Delta(t)=$	$\displaystyle\mathcal{A}\left(\Phi p(t)+\eta w(t);\Phi\right),$
$\displaystyle e(t\!+\!1)=$	$\displaystyle p(t)-\Delta(t)+\mfrac{\eta Q}{d}\Phi^{\top}w(t).$

where $g(t)\coloneqq\frac{1}{n}\sum\nolimits_{i=1}^{n}g_{i}(t)$ .

By comparing Lemma 1 with [15, Algorithm 2], we see that the only difference lies in the presence of communication channel noise $w(t)$ in our setting. In addition, since error feedback is implemented purely at the server side, the edge devices will be stateless during the whole optimization procedure.

3.3 Theoretical Analysis

First, we make the following assumptions on the objective function $f$ , the stochastic gradients $g_{i}(t)$ and the communication channel noise $w(t)$ :

Assumption 1.

The function $f(x)$ is $L$ -smooth over $x\in\mathbb{R}^{d}$ , i.e., there exists $L>0$ such that for all $x,y\in\mathbb{R}^{n}$ , $\|\nabla f(x)-\nabla f(y)\|_{2}\leq L\|x-y\|_{2}.$

Assumption 2.

Denote $g(t)=\frac{1}{n}\sum_{i=1}^{n}g_{i}(t)$ . There exists $G>0$ such that $\mathbb{E}\big{[}\left\|g(t)-\nabla f(x(t))\right\|_{2}^{2}\big{]}\leq G^{2}$ for all $t$ .

Assumption 3.

The communication channel noise $w(t)$ satisfies $\mathbb{E}[\|w(t)\|_{2}^{2}]\leq\sigma^{2}$ for each $t$ .

Our theoretical analysis will be based on the following result on the reconstruction error of FIHT:

Lemma 2 ([27, Corollary I.3]).

Let $K$ be the maximum number of nonzero entries of the output of FIHT. Suppose the sensing matrix $\Phi\in\mathbb{R}^{Q\times d}$ satisfies $(4K,\delta_{4K})$ -RIP for sufficiently small $\delta_{4K}$ . Then, for any $x\in\mathbb{R}^{d}$ and $w\in\mathbb{R}^{Q}$ ,

\displaystyle\|\mathcal{A}(\Phi x+w;\Phi)-x\|_{2}\leq(C_{\mathcal{A},\mathrm{s}}\!+\!1)\big{\|}x\!-\!x^{[K]}\big{\|}_{2}+\mfrac{C_{\mathcal{A},\mathrm{s}}}{\sqrt{K}}\big{\|}x\!-\!x^{[K]}\big{\|}_{1}+C_{\mathcal{A},\mathrm{n}}\|w\|_{2}

(6)

where $C_{\mathcal{A},\mathrm{s}}$ and $C_{\mathcal{A},\mathrm{n}}$ are positive constants that depend on $\delta_{4K}$ .

We are now ready to establish convergence of Algorithm 1.

Theorem 1.

\operatorname{sp}(p(t))\leq\gamma\cdot\frac{2K/d}{[1+C_{\mathcal{A},\mathrm{s}}(3-2K/d)]^{2}}

(7)

for all $t\geq 1$ for some $\gamma\in(0,1)$ , where $p(t)$ is defined in Lemma 1. Then for sufficiently large $T$ , by choosing $\eta=1/(L\sqrt{T})$ , we have that

\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[\|\nabla\!f(x(t)\mkern-1.0mu)\|_{2}^{2}\right]\leq\frac{6L\!\left(f(x(1))\!-\!f^{\ast}\right)+3G^{2}}{2\sqrt{T}}+\frac{3}{T}\!\left[\frac{\gamma(1\!+\!\gamma)}{(1\!-\!\gamma)^{2}}G^{2}+\frac{2\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}^{\!2}\sigma^{2}}{1-\gamma}\right]\!.

In addition, if $f$ is convex and has a minimizer $x^{\ast}\in\mathbb{R}^{d}$ , then we further have

\displaystyle\mathbb{E}\!\left[f(\bar{x}(t))\!-\!f^{\ast}\right]\leq\frac{L\|x(1)-x^{\ast}\|_{2}^{2}+G^{2}/L}{\sqrt{T}}+\frac{6}{T}\!\left[\frac{\gamma(1\!+\!\gamma)G^{2}}{(1\!-\!\gamma)^{2}L}+\frac{2\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}^{\!2}\sigma^{2}}{(1\!-\!\gamma)L}\right]\!,

where $\bar{x}(t)\coloneqq\frac{1}{T}\sum_{t=1}^{T}x(t)$ and $f^{\ast}\coloneqq f(x^{\ast})$ .

Proof of Theorem 1 is given in Appendix B.

Remark 1.

Theorem 1 requires $\operatorname{sp}(p(t))$ to remain sufficiently low. This condition is hard to check and can be violated in practice (see Section 4). However, our numerical experiments seem to suggest that even if the condition (7) is violated, Algorithm 1 may still exhibit relatively good convergence behavior when the gradient $g(t)$ itself has relatively low sparsity level. Theoretical investigation on these observations will be interesting future directions.

4 Numerical Results

4.1 Test Case with Synthetic Data

We first conduct numerical experiments on a test case with synthetic data. Here we set the dimension to be $d=2^{14}$ and the number of edge devices to be $n=20$ . The local objective functions are of the form $f_{i}(x)=\frac{1}{2}(x-x_{0})^{\top}A_{i}(x-x_{0})$ , where each $A_{i}\in\mathbb{R}^{d\times d}$ is a diagonal matrix, and we denote $A=\frac{1}{n}\sum_{i}A_{i}$ . We generate $A_{i}$ such that the diagonal entries of $A$ is given by $A_{jj}=e^{-j/300}+0.001$ for each $j=1,\ldots,d$ while the diagonal of each $A_{i}$ is dense. We also let $\mathsf{g}_{i}(x)$ give approximately sparse stochastic gradients for every $x\in\mathbb{R}^{d}$ . We refer to Appendix C for details on the test case.

We test three algorithms: the uncompressed vanilla SGD, Algorithm 1, and SGD with count sketch. The SGD with count sketch just replaces the gradient compression and reconstruction of Algorithm 1 by the count sketch method [18]. We set $K=500$ for both Algorithm 1 and SGD with count sketch. For Algorithm 1, we generate $\Phi$ from the WHT matrix, and uses the FFHT library [28] for fast WHT. We set $T=1000$ , $\eta=1/\sqrt{T}$ and the initial point to be the origin for all three algorithms.

Refer to caption — (a) Convergence of three algorithms ( $w(t)=0$ ).

Figure 1(a) illustrates the convergence of the three algorithms without communication channel error (i.e., $w(t)=0$ ). For Algorithm 1, we set $Q=5000$ (the compression rate $d/Q$ is $3.28$ ), and for SGD with count sketch we set the sketch size to be $16\times 500$ (the compression rate is $d/(16\times 500)=2.05$ ). We see that Algorithm 1 has better convergence behavior while also achieves higher compression rate compared to SGD with count sketch. Our numerical experiments suggest that for approximately sparse signals, FIHT can achieve higher reconstruction accuracy and more aggressive compression than count sketch, and for signals that are not very sparse, FIHT also seems more robust.

Figure 1(b) shows the evolution of $\operatorname{sp}(p(t))$ and $\operatorname{sp}(g(t))$ for Algorithm 1. We see that $\operatorname{sp}(p(t))$ is small for the first few iterations, and then increases and stabilizes around $0.5$ , which suggests that the condition (7) is likely to have been violated for large $t$ . On the other hand, Fig. 1(a) shows that Algorithm 1 can still achieve relatively good convergence behavior. This indicates a gap between the theoretical results in Section 3.3 and the empirical results, and suggests our analysis could be improved. We leave relevant investigation as future work.

Figure 1(c) illustrates the convergence of Algorithm 1 with different levels of communication channel noise. Here the entries of $w(t)$ are i.i.d. sampled from $\mathcal{N}(0,W^{2})$ with $W\in\{0.2,0.4,0.6,0.8,1.0\}$ . We see that the convergence of Algorithm 1 gradually deteriorates as $W$ increases, suggesting its robustness against communication channel noise.

4.2 Test Case of Federated Learning with CIFAR-10 Dataset

We implement our algorithm on a residual network with 668426 trainable parameters in two different settings. We primarily use Tensorflow and MPI in the implementation of these results (details about the specific experimental setup can be found in Appendix C). In addition, we shall only present upload compression results here; download compression is not considered to be as significant as the upload compression (given that download speeds are generally higher than upload speeds), and in our case the download compression rate is simply given by $d/K$ . For both settings, we use the CIFAR10 dataset (60,000 32 $\times$ 32 $\times$ 3 images of 10 classes) with a 50,000/10,000 train/test split.

In the first setting, we instantiate 100 workers and split the CIFAR10 training dataset such that all local datasets are i.i.d. As seen in Figure 2, our algorithm is able to achieve 2 $\times$ upload compression with marginal effect to the training and testing accuracy over 50 epochs. As the compression rate increases, the convergence of our method gradually deteriorates (echoing results in the synthetic case). For comparison, we also show the results of using Count Sketch with $5$ rows and $d/(5\lambda)$ columns, where $\lambda$ is the desired compression rate, in lieu of FIHT. In our setting, while uncompressed (1 $\times$ ) Count Sketch performs well, it is very sensitive to higher compression, diverging for 1.1 $\times$ and 1.25 $\times$ compression.

In the second setting, we split the CIFAR10 dataset into 10,000 sets of 5 images of a single class and assign these sets to 10,000 workers. Each epoch consists of 100 rounds with 1% (100) workers participating in each round. In Figure 3, similar to the i.i.d. setting, we see that our algorithm’s training accuracy convergence gradually deteriorates with higher compression. Typical of the non-i.i.d. setting, the testing accuracy is not as high as that of the i.i.d. setting. However, we note that FIHT is able to achieve 10 $\times$ compression with negligible effect on the testing accuracy. In addition, Count Sketch with $5$ rows and $d/(5\lambda)$ columns diverges even for small compression rates in this problem setting.

5 Conclusion

In this paper, we develop a communication efficient SGD algorithm based on compressed sensing. This algorithm has several direct variants. For example, momentum method can be directly incorporated. Also, when the number of devices $n$ is very large, the server can choose to query compressed stochastic gradients from a randomly chosen subset of workers.

Our convergence guarantees require $\operatorname{sp}(p(t))$ to be persistently low, which is hard to check in practice. The numerical experiments also show that our algorithm can work even if $\operatorname{sp}(p(t))$ grows to a relatively high level. They suggest that our theoretical analysis can be further improved, which will be an interesting future direction.

Appendix A Auxiliary Results

In this section, we provide some auxiliary results for the proof of Theorem 1. We first give an alternative form of the reconstruction error derived from the condition (7) and the performance guarantee (6).

Lemma 3.

Suppose the conditions in Lemma 2 are satisfied. Let $w\in\mathbb{R}^{Q}$ be arbitrary, and let $x\in\mathbb{R}^{d}$ satisfy that $\operatorname{sp}(x)$ is upper bounded by the right-hand side of (7). Then

\|\mathcal{A}(\Phi x+w;\Phi)-x\|_{2}\leq\sqrt{\frac{\gamma}{2}}\|x\|_{2}+C_{\mathcal{A},\mathrm{n}}\|w\|_{2}.

Proof.

By [29, Lemma 7], we have $\|x-x^{[K]}\|_{2}\leq\|x\|_{1}/(2\sqrt{K})$ . Therefore by Lemma 2,

	$\displaystyle\\|\mathcal{A}(\Phi x+w;\Phi)-x\\|_{2}\leq\$	$\displaystyle\frac{C_{\mathcal{A},\mathrm{s}}+1}{2\sqrt{K}}\\|x\\|_{1}+\frac{C_{\mathcal{A},\mathrm{s}}}{\sqrt{K}}\big{\\|}x-x^{[K]}\big{\\|}_{1}+C_{\mathcal{A},\mathrm{n}}\\|w\\|_{2}$
	$\displaystyle\leq\$	$\displaystyle\left[\frac{C_{\mathcal{A},\mathrm{s}}+1}{2}+C_{\mathcal{A},\mathrm{s}}\!\left(1-\mfrac{K}{d}\right)\right]\frac{\big{\\|}x\big{\\|}_{1}}{\sqrt{K}}+C_{\mathcal{A},\mathrm{n}}\\|w\\|_{2}.$

Plugging in the definition and upper bound of $\operatorname{sp}(x)$ leads to the result. ∎

Next, we derive a bound on the second moment of $e(t)$ .

Lemma 4.

We have

\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[\|e(t)\|_{2}^{2}\right]\!\leq\,

\displaystyle\frac{2\eta^{2}}{1\!-\!\gamma}\bigg{[}\frac{\gamma(1\!+\!\gamma)}{1\!-\!\gamma}G^{2}\!+\!2\!\left(C_{\mathcal{A},\mathrm{n}}\!\!+\!\!\sqrt{Q/d}\right)^{\!2}\!\!\sigma^{2}\bigg{]}+\frac{2\eta^{2}\gamma(1\!+\!\gamma)}{(1\!-\!\gamma)^{2}}\cdot\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[\|\nabla f(x(t))\|_{2}^{2}\right].

Proof.

By definition, we have

	$\displaystyle\mathbb{E}\!\left[\\|e(t\!+\!1)\\|_{2}^{2}\right]\leq\$	$\displaystyle\mathbb{E}\Big{[}\big{(}\\|\Delta(t)-p(t)\\|_{2}+\mfrac{\eta Q}{d}\big{\\|}\Phi^{\top}w(t)\big{\\|}_{2}\big{)}^{2}\Big{]}$
	$\displaystyle=\$	$\displaystyle\mathbb{E}\Big{[}\big{(}\left\\|\mathcal{A}(\Phi p(t)+\eta w(t))-p(t)\right\\|_{2}+\eta\sqrt{Q/d}\\|w\\|_{2}\big{)}^{2}\Big{]}$
	$\displaystyle\leq\$	$\displaystyle\mathbb{E}\Big{[}\big{(}\sqrt{\gamma/2}\\|p(t)\\|_{2}+\eta\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}\\|w\\|_{2}\big{)}^{2}\Big{]}$
	$\displaystyle\leq\$	$\displaystyle\gamma\mathbb{E}\!\left[\\|p(t)\\|_{2}^{2}\right]+2\eta^{2}\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}^{2}\mathbb{E}\!\left[\\|w(t)\\|_{2}^{2}\right]$
	$\displaystyle\leq\$	$\displaystyle\gamma\mathbb{E}\!\left[\\|\eta g(t)+e(t)\\|_{2}^{2}\right]+2\eta^{2}\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}^{2}\sigma^{2},$

where the first inequality follows from Lemma 3, and the second inequality follows from the definition of $p(t)$ and the assumption that $\mathbb{E}[\|w\|_{2}^{2}]\leq\sigma^{2}$ . Notice that

\displaystyle\mathbb{E}\!\left[\|\eta g(t)+e(t)\|_{2}^{2}\right]\leq\left(1+\frac{2\gamma}{1-\gamma}\right)\mathbb{E}\!\left[\|\eta g(t)\|_{2}^{2}\right]+\left(1+\frac{1-\gamma}{2\gamma}\right)\mathbb{E}\!\left[\|e(t)\|_{2}^{2}\right],

which leads to

\displaystyle\mathbb{E}\!\left[\|e(t\!+\!1)\|_{2}^{2}\right]\leq\frac{1\!+\!\gamma}{2}\mathbb{E}\!\left[\|e(t)\|_{2}^{2}\right]+\eta^{2}\frac{\gamma(1\!+\!\gamma)}{1\!-\!\gamma}\mathbb{E}\!\left[\|g(t)\|_{2}^{2}\right]+2\eta^{2}\!\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}^{\!2}\!\sigma^{2}.

By $\mathbb{E}[g(t)|x(t)]=\nabla f(x(t))$ and Assumption 2, we have

	$\displaystyle\mathbb{E}\!\left[\\|g(t)\\|_{2}^{2}\right]=\$	$\displaystyle\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right]+\mathbb{E}\!\left[\\|g(t)-\nabla f(x(t))\\|_{2}^{2}\right]$
	$\displaystyle\leq\$	$\displaystyle\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right]+G^{2}.$

Therefore

\displaystyle\mathbb{E}\!\left[\|e(t\!+\!1)\|_{2}^{2}\right]\leq\frac{1\!+\!\gamma}{2}\mathbb{E}\mkern-4.0mu\left[\|e(t)\|_{2}^{2}\right]\!+\!\eta^{2}\mkern-1.0mu\frac{\gamma(1\!+\!\gamma)}{1\!-\!\gamma}\mathbb{E}\mkern-4.0mu\left[\|\nabla\!f(x(t)\mkern-2.0mu)\|_{2}^{2}\right]+\eta^{2}\!\left[\frac{\gamma(1\!+\!\gamma)}{1\!-\!\gamma}G^{2}\!+\!2\big{(}C_{\mathcal{A},\mathrm{n}}\!\!+\!\!\sqrt{Q/d}\big{)}^{\!2}\!\sigma^{2}\right].

By summing over $t=1,\ldots,T$ and noting that $e(1)=0$ and $\mathbb{E}[\|e(T+1)\|_{2}^{2}]\geq 0$ , we get

	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[\\|e(t)\\|_{2}^{2}\right]\leq\$	$\displaystyle\frac{1\!+\!\gamma}{2T}\sum_{t=1}^{T}\mathbb{E}\!\left[\\|e(t)\\|_{2}^{2}\right]+\frac{\eta^{2}\gamma(1\!+\!\gamma)}{(1\!-\!\gamma)T}\sum_{t=1}^{T}\mathbb{E}\!\left[\\|\nabla f(x(t)\mkern-2.0mu)\\|_{2}^{2}\right]$
		$\displaystyle+\eta^{2}\!\left[\frac{\gamma(1\!+\!\gamma)}{1\!-\!\gamma}G^{2}+2\big{(}C_{\mathcal{A},\mathrm{n}}\!\!+\!\!\sqrt{Q/d}\big{)}^{\!2}\!\sigma^{2}\right],$

which then leads to the desired result. ∎

Appendix B Proof of Theorem 1

Convex case: Denote $\tilde{x}(t)=x(t)-e(t)$ , and it can be checked that

	$\displaystyle\tilde{x}(t\!+\!1)$	$\displaystyle=x(t\!+\!1)-e(t\!+\!1)=x(t)-\Delta(t)-(p(t)-\Delta(t))=x(t)-p(t)$
		$\displaystyle=\tilde{x}(t)-\eta g(t).$

We then have

\|\tilde{x}(t\!+\!1)-x^{\ast}\|_{2}^{2}=\|\tilde{x}(t)-x^{\ast}\|_{2}^{2}+\eta^{2}\|g(t)\|_{2}^{2}-2\eta\langle g(t),\tilde{x}(t)-x^{\ast}\rangle.

By taking the expectation and noting $\mathbb{E}[g(t)|x(t)]=\nabla f(x(t))$ and Assumption 2, we get

	$\displaystyle\mathbb{E}\!\left[\\|\tilde{x}(t\!+\!1)-x^{\ast}\\|_{2}^{2}\right]\leq$	$\displaystyle\mathbb{E}\!\left[\\|\tilde{x}(t)-x^{\ast}\\|_{2}^{2}\right]+\eta^{2}\!\left(\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right]+G^{2}\right)$
		$\displaystyle-2\eta\,\mathbb{E}\!\left[\langle\nabla f(x(t)),x(t)-x^{\ast}\rangle\right]+2\eta\,\mathbb{E}\!\left[\langle\nabla f(x(t)),e(t)\rangle\right],$

which leads to

	$\displaystyle\mathbb{E}\!\left[\langle\nabla f(x(t)),x(t)-x^{\ast}\rangle\right]\leq$	$\displaystyle\frac{1}{2\eta}(\mathbb{E}\!\left[\\|\tilde{x}(t)-x^{\ast}\\|_{2}^{2}\right]-\mathbb{E}\!\left[\\|\tilde{x}(t\!+\!1)-x^{\ast}\\|_{2}^{2}\right])+\frac{\eta G^{2}}{2}$
		$\displaystyle+\frac{\eta}{2}\,\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right]+\mathbb{E}\!\left[\langle\nabla f(x(t)),e(t)\rangle\right]$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2\eta}(\mathbb{E}\!\left[\\|\tilde{x}(t)-x^{\ast}\\|_{2}^{2}\right]-\mathbb{E}\!\left[\\|\tilde{x}(t\!+\!1)-x^{\ast}\\|_{2}^{2}\right])+\frac{\eta G^{2}}{2}$
		$\displaystyle+\frac{\eta+(3L)^{-1}}{2}\,\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right]+\frac{3L}{2}\mathbb{E}\!\left[\\|e(t)\\|_{2}^{2}\right],$

where in the second inequality we used $\langle\nabla f(x(t)),e(t)\rangle\leq\frac{1}{6L}\|\nabla f(x(t))\|_{2}^{2}+\frac{3L}{2}\|e(t)\|_{2}^{2}$ . Now, we take the average of both sides over $t=1,\ldots,T$ and plug in the bound in Lemma 4 to get

	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[\langle\nabla f(x(t)),x(t)-x^{\ast}\rangle\right]\leq\$	$\displaystyle\frac{1}{2\eta T}\\|x(1)-x^{\ast}\\|_{2}^{2}+\frac{\eta G^{2}}{2}+\frac{3\eta^{2}L}{1\!-\!\gamma}\!\left[\frac{\gamma(1\!+\!\gamma)}{1\!-\!\gamma}G^{2}+2\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}^{\!2}\sigma^{2}\right]$
		$\displaystyle+\left(\frac{\eta\!+\!(3L)^{-1}}{2}\!+\!\frac{3\eta^{2}L\gamma(1\!+\!\gamma)}{(1\!-\!\gamma)^{2}}\right)\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right]\!.$

By $\eta=1/(L\sqrt{T})$ , we can show that for sufficiently large $T$ ,

\frac{\eta+(3L)^{-1}}{2}+\frac{3\eta^{2}L\gamma(1+\gamma)}{(1-\gamma)^{2}}=\frac{1}{L}\left(\frac{1}{6}+\frac{1}{2\sqrt{T}}+\frac{3\gamma(1+\gamma)}{T(1-\gamma)^{2}}\right)\leq\frac{1}{4L}.

Thus

	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[\langle\nabla f(x(t)),x(t)-x^{\ast}\rangle\right]\leq$	$\displaystyle\frac{1}{2\eta T}\\|x(1)-x^{\ast}\\|_{2}^{2}+\frac{\eta G^{2}}{2}+\frac{3\eta^{2}L}{1\!-\!\gamma}\!\left[\frac{\gamma(1\!+\!\gamma)}{1\!-\!\gamma}G^{2}+2\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}^{\!2}\sigma^{2}\right]$
		$\displaystyle+\frac{1}{4L}\cdot\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right].$

By the convexity of $f$ , we have $f(x(t))-f(x^{\ast})\leq\langle\nabla f(x(t)),x(t)-x^{\ast}\rangle$ . Furthermore, since $f$ is $L$ -smooth, we see that

f(x^{\ast})\leq f\!\left(x(t)-\frac{1}{L}\nabla f(x(t))\right)\leq f(x(t))-\frac{1}{2L}\|\nabla f(x(t))\|_{2}^{2},

which leads to $\|\nabla f(x(t))\|_{2}^{2}\leq 2L(f(x(t))-f(x^{\ast}))$ . We then get

	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[f(x(t))-f(x^{\ast})\right]\leq$	$\displaystyle\frac{1}{2T}\sum_{t=1}^{T}\mathbb{E}\!\left[f(x(t))-f(x^{\ast})\right]+\frac{1}{2\eta T}\\|x(1)-x^{\ast}\\|_{2}^{2}+\frac{\eta G^{2}}{2}$
		$\displaystyle+\frac{3\eta^{2}L}{1\!-\!\gamma}\!\left[\frac{\gamma(1\!+\!\gamma)}{1\!-\!\gamma}G^{2}+2\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}^{\!2}\sigma^{2}\right].$

By subtracting $\frac{1}{2T}\sum_{t=1}^{T}\mathbb{E}\!\left[f(x(t))\!-\!f(x^{\ast})\right]$ from both sides of the inequality, and using the bound $f(\bar{x}(t))\leq\frac{1}{T}\sum_{t=1}^{T}f(x(t))$ that follows from the convexity of $f$ , we get the final bound.

Nonconvex case: Denote $\tilde{x}(t)=x(t)-e(t)$ , and it can be checked that $\tilde{x}(t\!+\!1)=\tilde{x}(t)-\eta g(t)$ . Since $f$ is $L$ -smooth, we get

f(\tilde{x}(t\!+\!1))\leq f(\tilde{x}(t))-\eta\langle\nabla f(\tilde{x}(t)),g(t)\rangle+\frac{\eta^{2}L}{2}\|g(t)\|_{2}^{2}.

By taking the expectation and using $\mathbb{E}[g(t)|x(t)]=\nabla f(x(t))$ and Assumption 2, we see that

	$\displaystyle\mathbb{E}\!\left[f(\tilde{x}(t\!+\!1))\right]\leq$	$\displaystyle\mathbb{E}\!\left[f(\tilde{x}(t))\right]-\eta\,\mathbb{E}\!\left[\langle\nabla f(\tilde{x}(t)),\nabla f(x(t))\rangle\right]+\frac{\eta^{2}L}{2}\!\left(\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right]+G^{2}\right)$
	$\displaystyle=$	$\displaystyle\mathbb{E}\!\left[f(\tilde{x}(t))\right]-\eta\,\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right]$
		$\displaystyle-\eta\,\mathbb{E}\!\left[\langle\nabla f(\tilde{x}(t))-\nabla f(x(t)),\nabla f(x(t))\rangle\right]+\frac{\eta^{2}L}{2}\!\left(\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right]+G^{2}\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\!\left[f(\tilde{x}(t))\right]-\frac{\eta(1-\eta L)}{2}\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right]$
		$\displaystyle+\frac{\eta}{2}\,\mathbb{E}\!\left[\\|\nabla f(\tilde{x}(t))-\nabla f(x(t))\\|_{2}^{2}\right]+\frac{\eta^{2}L}{2}G^{2}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\!\left[f(\tilde{x}(t))\right]-\frac{\eta(1-\eta L)}{2}\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right]+\frac{\eta L^{2}}{2}\mathbb{E}\!\left[\\|e(t)\\|_{2}^{2}\right]+\frac{\eta^{2}L}{2}G^{2},$

where in the second inequality we used $\langle\nabla f(\tilde{x}(t))-\nabla f(x(t)),\nabla f(x(t))\rangle\leq\frac{1}{2}\|\nabla f(\tilde{x}(t))-\nabla f(x(t))\|_{2}^{2}+\frac{1}{2}\|\nabla f(x(t))\|_{2}^{2}$ , and in the last inequality we used the $L$ -smoothness of $f$ . By taking the telescoping sum, we get

\displaystyle\frac{1-\eta L}{2}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\|\nabla f(x(t))\|_{2}^{2}\right]\leq

\displaystyle\frac{f(x(1))-f^{\ast}}{\eta}+\frac{\eta LG^{2}}{2}+\frac{L^{2}}{2}\cdot\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[\|e(t)\|_{2}^{2}\right].

After plugging in the bound in Lemma 4, we get

	$\displaystyle\frac{1-\eta L}{2}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\\|\nabla f(x(t))\\|_{2}^{2}\right]\leq$	$\displaystyle\frac{f(x(1))-f^{\ast}}{\eta}+\frac{\eta LG^{2}}{2}+\frac{\eta^{2}L^{2}}{1\!-\!\gamma}\!\left[\frac{\gamma(1\!+\!\gamma)}{1\!-\!\gamma}G^{2}+2\!\left(C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\right)^{\!2}\!\sigma^{2}\right]$
		$\displaystyle+\eta^{2}L^{2}\frac{\gamma(1+\gamma)}{(1-\gamma)^{2}}\cdot\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right].$

By choosing $\eta=1/(L\sqrt{T})$ and letting $T$ be sufficiently large, we have

\frac{1-\eta L}{2}-\eta^{2}L^{2}\frac{\gamma(1+\gamma)}{(1-\gamma)^{2}}=\frac{1}{2}-\frac{1}{2\sqrt{T}}-\frac{\gamma(1+\gamma)}{T(1-\gamma)^{2}}\geq\frac{1}{3}.

Finally, we get

\frac{1}{3T}\sum_{t=1}^{T}\mathbb{E}\!\left[\|\nabla f(x(t))\|_{2}^{2}\right]\leq\frac{2L\!\left(f(x(1))\!-\!f^{\ast}\right)+G^{2}}{2\sqrt{T}}+\frac{1}{T}\!\left[\frac{\gamma(1+\gamma)}{(1-\gamma)^{2}}G^{2}+\frac{2\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}^{\!2}\sigma^{2}}{1-\gamma}\right]\!,

which completes the proof.

Appendix C Further Details on Numerical Experiments

For all numerical experiments in this paper, we employ the WHT matrix as the “base matrix” for generating the sensing matrix $\Phi$ . We choose WHT instead of DCT here because the library Fast Fast Hadamard Transform (FFHT) [28] provides a heavily optimized C99 implementation of fast WHT, which turns out to be faster than the fast DCT in the Python library SciPy on our computing platforms.

Note that the WHT matrix is only defined for $d$ that is a power of $2$ . In order to use WHT for general $d$ , we let $d_{\mathrm{aug}}=2^{\lceil\log_{2}d\rceil}$ and generate $\Phi\in\mathbb{R}^{Q\times d_{\mathrm{aug}}}$ by randomly choosing Q rows from the $d_{\mathrm{aug}}\times d_{\mathrm{aug}}$ WHT matrix as in Proposition 1. Then for any $u\in\mathbb{R}^{d}$ , we can first pad zeros to the vector $u$ to augment its dimension to be $d_{\mathrm{aug}}$ , and then multiply $\Phi\in\mathbb{R}^{Q\times d_{\mathrm{aug}}}$ with the augmented vector to form the compressed vector of dimension $Q$ . For the reconstruction, after obtaining the recovered signal, we can truncate the last $d_{\mathrm{aug}}\!-\!d$ entries to get a vector of the original dimension $d$ .

Code for the federated learning test case is available at https://github.com/vikr53/cosamp.

C.1 Test Case with Synthetic Data

All experiments on the synthetic test case were run on a MacBook Pro Model A1502 with 2.9 GHz Dual-Core Intel Core i5 CPU and 16 GB memory. GPU acceleration was not exploited. We implement count sketch by our own code for the synthetic test case.

Recall that the local objective functions are given by $f_{i}(x)=\frac{1}{2}(x-x_{0})^{\top}A_{i}(x-x_{0})$ , where each $A_{i}$ is a diagonal matrix. The $j$ ’th diagonal entries of $A_{1},\ldots,A_{n}$ are randomly sampled such that $\big{(}(A_{1})_{jj},(A_{2})_{jj},\ldots,(A_{n})_{jj}\big{)}$ satisfies the Gaussian distribution $\mathcal{N}((e^{-j/300}+0.001)\mathbf{1},I-\frac{1}{n}\mathbf{1}\mathbf{1}^{\top})$ , where $I$ denotes the identity matrix and $\mathbf{1}$ denotes the vector of all ones. As a result, $A\coloneqq\frac{1}{n}\sum_{i}A_{i}$ has diagonal entries given by $A_{jj}=e^{-j/300}+0.001$ . The point $x_{0}\in\mathbb{R}^{d}$ is randomly generated from the standard Gaussian distribution. We fix each $A_{i}$ and $x_{0}$ once they have been generated.

The stochastic gradient $\mathsf{g}_{i}(x)$ will be given by

\mathbf{g}_{i}(x)=A_{i}(x-x_{0})+R_{1}\cdot A\mathsf{u}_{1}+R_{2}\cdot(\mathsf{b}\odot\mathsf{u}_{2})

Here $\mathsf{u}_{1}$ and $\mathsf{u}_{2}$ are i.i.d. random vectors following the standard Gaussian distribution $\mathcal{N}(0,I)$ ; the entries of $\mathsf{b}\in\mathbb{R}^{d}$ are i.i.d. satisfying the Bernoulli distribution $\mathrm{Ber}(p_{\mathsf{b}})$ with $p_{\mathsf{b}}\in(0,1)$ ; $\odot$ denotes the Hadamard product. We set $R_{1}=12.5,R_{2}=50,p_{\mathsf{b}}=1.5\times 10^{-3}$ in our test case.

C.2 Test Case of Federated Learning with CIFAR-10 Dataset

We ran all federated learning experiments using the research computing service provided by the authors’ institution, and we did not exploit GPU acceleration. We primarily use Tensorflow [30] as the deep learning framework. We use MPI for Python [31] for parallel computing on multiple processors, and communication between nodes is facilitated by Open MPI [32]. Additionally, we use the CSVec package (available at https://github.com/nikitaivkin/csh) for count sketch.

Regarding the compressed sensing algorithms, we set $K=30000$ (number of nonzero entries in the reconstructed signal) for both FIHT and count sketch. The dimension $Q$ of the compressed gradients for FIHT is simply $Q=d/\lambda$ where $\lambda$ is the desired compression rate. The sketch size for count sketch is $r$ rows $\times$ $\mfrac{d}{r\lambda}$ columns; we tried $r\in\{1,5,10,50\}$ and found the best results with $r=5$ for all tested compression rates.

Recall that we conduct experiments on the i.i.d. setting and the non-i.i.d. setting. The differences between the i.i.d. setting and the non-i.i.d. settings include the following:

1.

In the i.i.d. setting, there are $100$ local workers, and we split the training dataset evenly into $100$ local datasets, so that each local dataset has exactly $50$ samples for each of the $10$ classes. In the non-i.i.d. setting, there are $10000$ workers, and each of the $10000$ local datasets has $5$ samples belonging to the same class.
2.

In the i.i.d. setting, the server queries compressed local gradients from all workers for each iteration. In the non-i.i.d. setting, the server queries compressed local gradients from $1\%$ of all workers for each iteration.
3.

In the i.i.d. setting, we set the local batch size to be $8$ . In the non-i.i.d. setting, we set the local batch size to be $1$ .
4.

In the non-i.i.d. setting, we perform data augmentation (while dividing the dataset amongst the 10,000 workers) using random flips and random rotations.

The step size $\eta$ of both the i.i.d. and the non-i.i.d. settings are set to be $0.01$ . We also use the same parameters of FIHT and count sketch for both the i.i.d. and the non-i.i.d. settings.

Experiments on the reconstruction error:

We also conducted numerical experiments that compare the reconstruction errors of FIHT and count sketch. Here we let $g\in\mathbb{R}^{d}$ with $d=668426$ be randomly generated by $g=g_{\mathrm{sp}}+g_{\mathrm{n}}$ , where $g_{\mathrm{sp}}$ is a vector having $30000$ nonzero entries that are i.i.d. sampled from the standard Gaussian distribution, and $g_{\mathrm{n}}\sim\mathcal{N}(0,0.05^{2}I)$ . We apply FIHT and count sketch to the compression and reconstruction of $g$ , and test the relative reconstruction error incurred at different compression rates $\lambda$ . The relative reconstruction error is defined as $\|g-\hat{g}\|_{2}^{2}/\|g\|_{2}^{2}$ where $\hat{g}$ is the reconstructed signal by FIHT or count sketch. For a given compression rate $\lambda$ , we set $Q=d/\lambda$ for FIHT, and set the sketch size to be $5$ rows $\times$ $\mfrac{d}{5\lambda}$ columns for count sketch. We set $\|\hat{g}\|_{0}=K=30000$ for both FIHT and count sketch. For each compression rate $\lambda$ , we test FIHT and count sketch over $20$ random instances of $g$ ; we fix the sensing matrix $\Phi$ of FIHT and the sketching operator of count sketch for the $20$ random trials. Figure 4 empirically confirms that the reconstruction performance of count sketch is worse than FIHT, especially at higher compression rates.

Appendix D For-each Scheme and For-all Scheme

In what follows, we elaborate on the two schemes in compressed sensing: the for-each scheme and the for-all scheme.

For-each scheme:

In this scheme, the sensing matrix $\Phi$ is chosen at random from some distribution for each individual signal $x$ . Specifically, consider an algorithm belonging to the for-each scheme, and let $\mathcal{A}(y;\Phi)$ denote the signal reconstructed from the measurement $y$ and the sensing matrix $\Phi$ by this algorithm. Then associated with the algorithm $\mathcal{A}$ is an indexed set $\{\mathcal{D}_{d,Q}\}$ with each $\mathcal{D}_{d,Q}$ being a probability distribution over $\mathbb{R}^{Q\times d}$ , and theoretical guarantee of the algorithm $\mathcal{A}$ can be typically recast as follows:

Let $K$ be sufficiently small [say, $K\leq O(Q/\log d)$ ]. Then there exist $\varsigma>0$ and $c\geq 0$ depending on $K$ , $Q$ and $d$ , such that

\mathbb{P}_{\Phi\sim\mathcal{D}_{d,Q}}\!\left(\left\|\mathcal{A}(\Phi x;\Phi)-x\right\|_{2}\leq(1+\varsigma)\big{\|}x-x^{[K]}\big{\|}_{2}\right)\geq 1-O(d^{-c}),

(8)

where $x\in\mathbb{R}^{d}$ is any arbitrary deterministic vector.

Note that the bound (8) applies to the reconstruction of one deterministic signal $x\in\mathbb{R}^{d}$ , and the probability on the left-hand side of (8) is with respect to the distribution $\mathcal{D}_{d,Q}$ over the sensing matrix $\Phi$ . Consequently, every time a new vector $x$ is to be measured and reconstructed under the for-each scheme, we typically generates a new sensing matrix $\Phi$ independent of $x$ . Especially, if a randomly generated matrix $\Phi$ has already been used for the reconstruction of $x$ , and $y$ is a vector that is dependent of $\mathcal{A}(\Phi x,\Phi)$ , then theoretically the bound (8) can no longer be directly applied to the reconstruction of $y$ by $\mathcal{A}(\Phi y;\Phi)$ , and as a result, bounding $\big{\|}\mathcal{A}(\Phi y,\Phi)-y\big{\|}_{2}$ would be more challenging.

Examples of the for-each scheme include count sketch [18] and count-min sketch [33].

For-all scheme:

In this scheme, one generates a single sensing matrix $\Phi\in\mathbb{R}^{Q\times d}$ that is used for the reconstruction of all possible signals $x\in\mathbb{R}^{d}$ . As already mentioned, the restricted isometry property (see Definition 1) has been proposed as a condition on $\Phi$ under the for-all scheme. Intuitively, the restricted isometry property ensures that the linear measurements $y=\Phi x$ can discriminate sparse signals: Suppose $\Phi\in\mathbb{R}^{Q\times d}$ satisfies $(2K,\delta)$ -RIP for some $K$ and $\delta\in(0,1)$ . Then for any $x_{1},x_{2}\in\mathbb{R}^{d}$ with $\|x_{1}\|_{0}\leq K$ and $\|x_{2}\|_{0}\leq K$ , we have

\|\Phi(x_{1}-x_{2})\|_{2}\geq\sqrt{1-\delta}\cdot\|x_{1}-x_{2}\|_{2},

indicating that the map $x\mapsto\Phi x$ is an injection from $\{x\in\mathbb{R}^{d}:\|x\|_{0}\leq K\}$ to $\mathbb{R}^{d}$ . Consequently, given the measurement vector $y=\Phi x$ , the solution to (2) will always be unique and equal to $x$ as long as $\|x\|_{0}\leq K$ . This explains why RIP can provide reconstruction guarantees under the for-all scheme in the ideal scenario (i.e., original vector $x$ is strictly sparse, measurement is noiseless, and the solution to the nonconvex problem (2) can be obtained).

Extending the above discussion to more general scenarios is not trivial, but the existing literature has investigated reconstruction guarantees of various compressed sensing algorithms when the original vector $x$ is approximately sparse or linear measurement by $\Phi$ is noisy. One typical form of guarantees for compressed sensing algorithms under the for-all scheme is as follows:

Suppose $\Phi\in\mathbb{R}^{Q\times d}$ satisfies $(cK,\delta)$ -RIP for some $c>1$ and sufficiently small $\delta>0$ . Then there exist $C_{1}>0$ and $C_{2}>0$ depending on $\delta$ , such that

\left\|\mathcal{A}(\Phi x+w;\Phi)-x\right\|_{2}\leq\frac{C_{1}}{\sqrt{K}}\big{\|}x-x^{[K]}\big{\|}_{1}+C_{2}\|w\|_{2}

for all $x\in\mathbb{R}^{d}$ , where $\mathcal{A}(\Phi x+w;\Phi)$ denotes the reconstructed signal from the (noisy) measurement vector $\Phi x+w$ and the sensing matrix $\Phi$ .

Examples of the for-all scheme include $\ell_{1}$ -minimization [34] and various greedy algorithms [35, 27, 36]; see also [24]. We also mention that there are variants of RIP that have been used for compressed sensing with sparse sensing matrices [37, 23].

Finally, we note that in practice, sensing matrices that satisfy RIP are usually generated by randomized methods. For example, it has been shown that a $Q\times d$ matrix having i.i.d. standard Gaussian entries will satisfy $(K,\delta)$ -RIP for $Q\geq O(K\log(d/K))$ with high probability [38]. In this work, we generate the sensing matrix $\Phi$ based on Proposition 1 so that $\Phi$ has low storage and transmission cost.

Appendix E The Fast Iterative Hard Thresholding (FIHT) Algorithm

We introduce some additional notations before presenting FIHT. For any $x\in\mathbb{R}^{d}$ , we let $\mathsf{Supp}(x)$ denote the set of indices of nonzero entries of $x$ , and let $\mathsf{PrincipalSupp}(x;K)$ denote the set of indices corresponding to the top- $K$ entries of $x$ in magnitude. Given $x\in\mathbb{R}^{d}$ and an index set $S\subseteq\{1,\ldots,d\}$ , we use $\mathsf{Proj}(x,S)$ to denote the $d$ -dimensional vector that keeps the entries of $x$ with indices in $S$ and sets other entries to be zero.

The Fast Iterative Hard Thresholding (FIHT) algorithm [27] is outlined as follows:

Input: measurement

y

, sensing matrix

\Phi

, number of nonzero entries in the output

K

2Initialize:

g(0)=0\in\mathbb{R}^{d}

w(0)=A^{\top}y

\Omega(0)=\mathsf{PrincipalSupp}(w(0),K)

g(1)=\mathsf{Proj}(w(0),\Omega(0))

s=1

4repeat

5 if $s=1$ then

\tau(s)=0

8 else

\tau(s)=\mfrac{\langle y-\Phi g(s),\Phi(g(s)-g(s\!-\!1))\rangle}{\|\Phi(g(s)-g(s\!-\!1))\|^{2}}

11 end if

w(s)=g(s)+\tau(s)(g(s)-g(s-1))

r^{w}(s)=\Phi^{\top}(y-\Phi w(s))

\Gamma(s)=\mathsf{Supp}(w(s))

\tilde{\alpha}(s)=\mfrac{\left\|\mathsf{Proj}(r^{w}(s),\Gamma(s))\right\|^{2}}{\left\|\Phi\cdot\mathsf{Proj}(r^{w}(s),\Gamma(s))\right\|^{2}}

h(s)=w(s)+\tilde{\alpha}(s)r^{w}(s)

\Omega(s)=\mathsf{PrincipalSupp}(h(s),K)

\tilde{g}(s)=\mathsf{Proj}(h(s),\Omega(s))

r(s)=\Phi^{\top}(y-\Phi\tilde{g}(s))

\alpha(s)=\mfrac{\left\|\mathsf{Proj}(r(s),\Omega(s))\right\|^{2}}{\left\|\Phi\cdot\mathsf{Proj}(r(s),\Omega(s))\right\|^{2}}

g(s\!+\!1)=\tilde{g}(s)+\alpha(s)\cdot\mathsf{Proj}(r(s),\Omega(s))

s\leftarrow s+1

24until stopping criterion satisfied

Output:

g(s)

Algorithm 2 Fast Iterative Hard Thresholding (FIHT)

We elaborate further details on our implementation of FIHT.

Stopping criterion. In our implementation of FIHT, we stop the iterations whenever any of the following holds:

•

$s\geq 26$ (i.e., the maximum number of iterations is $25$ )
•

$\|w(s)\|_{2}\leq 10^{-4}$
•

$\operatorname{std}(\|w(s^{\prime})\|_{2}:s\!-\!3\leq s^{\prime}\leq s)\leq 0.01\cdot\operatorname{mean}(\|w(s^{\prime})\|_{2}:s\!-\!3\leq s^{\prime}\leq s)$

Here the tuple $(\|w(s^{\prime})\|_{2}:s\!-\!3\leq s^{\prime}\leq s)$ records the values of $\|w(s)\|_{2}$ in the last $4$ iterations, and $\operatorname{std}$ and $\operatorname{mean}$ denote the standard deviation and the mean value respectively.

Multiplication of $\Phi$ with a vector. We generate the sensing matrix $\Phi$ from an orthogonal matrix $B$ as in Proposition 1, and the “base matrix” $B$ is chosen to be either the DCT matrix

B_{ij}=\sqrt{\frac{2}{d}}\cdot\frac{1}{\sqrt{1+\delta_{1,i}}}\cos\frac{\pi(i\!-\!1)(2j\!-\!1)}{2d},\quad 1\leq i,j\leq d,

or the WHT matrix $B=H^{(\log_{2}d)}$ defined recursively by

H^{(0)}=1,\qquad H^{(k)}=\frac{1}{\sqrt{2}}\begin{bmatrix}H^{(k-1)}&H^{(k-1)}\\ H^{(k-1)}&-H^{(k-1)}\end{bmatrix},\quad k\geq 1,

when $d$ is a power of $2$ . Then, since both DCT and WHT have fast algorithms of time complexity $O(d\log d)$ , the matrix-vector multiplication $\Phi u$ and $\Phi^{\top}u$ can be calculated with time complexity $O(d\log d)$ for any $u\in\mathbb{R}^{d}$ . Specifically, suppose $\Sigma\subseteq\{1,\ldots,d\}$ is the index set of the $Q$ rows of $B$ that form $\Phi$ . Taking DCT as an example, for any $u\in\mathbb{R}^{d}$ , we can computed $\Phi u$ as follows:

1.

$\widehat{u}\leftarrow\sqrt{d/Q}\cdot\mathsf{FastDCT}(u)$ ,
2.

$\widehat{u}_{j}\leftarrow 0$ for each $j\in{\Sigma^{c}}$ .

Then $\widehat{u}$ is a vector whose $Q$ nonzero entries record the values of each entry of $\Phi u$ . Similarly, for any $u\in\mathbb{R}^{Q}$ , we can compute $\Phi^{\top}u$ by the following steps:

1.

$\tilde{u}\leftarrow 0\in\mathbb{R}^{d}$ ,
2.

$\tilde{u}_{\Sigma}\leftarrow u$ , where $\tilde{u}_{\Sigma}$ denotes the subvector of $\tilde{u}$ with indices in $\Sigma$ .
3.

$v\leftarrow\sqrt{d/Q}\cdot\mathsf{FastInverseDCT}(\tilde{u})$

Then we have $v=\Phi u$ . Note that if we employ WHT instead of DCT, then both $\mathsf{FastDCT}$ and $\mathsf{FastInverseDCT}$ can be replaced by $\mathsf{FastWHT}$ since the WHT matrix is symmetric.

References

[1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
[2] S. Magnússon, C. Enyioha, N. Li, C. Fischione, and V. Tarokh, “Communication complexity of dual decomposition methods for distributed resource allocation optimization,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 4, pp. 717–732, 2018.
[3] C.-S. Lee, N. Michelusi, and G. Scutari, “Finite rate quantized distributed optimization with geometric convergence,” in 52nd Asilomar Conference on Signals, Systems, and Computers, 2018, pp. 1876–1880.
[4] A. Reisizadeh, A. Mokhtari, H. Hassani, and R. Pedarsani, “An exact quantized decentralized gradient descent algorithm,” IEEE Transactions on Signal Processing, vol. 67, no. 19, pp. 4934–4947, 2019.
[5] S. U. Stich, “Local SGD converges fast and communicates little,” arXiv preprint arXiv:1805.09767, 2018.
[6] J. Wang and G. Joshi, “Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms,” arXiv preprint arXiv:1808.07576, 2018.
[7] H. Yu, S. Yang, and S. Zhu, “Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 5693–5700.
[8] J. Sun, T. Chen, G. B. Giannakis, Q. Yang, and Z. Yang, “Lazily aggregated quantized gradient innovation for communication-efficient federated learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
[9] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
[10] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signSGD: Compressed optimisation for non-convex problems,” in Proceedings of the 35th International Conference on Machine Learning, 2018, pp. 560–569.
[11] S. Magnússon, H. Shokri-Ghadikolaei, and N. Li, “On maintaining linear convergence of distributed learning and optimization under limited communication,” IEEE Transactions on Signal Processing, vol. 68, pp. 6101–6116, 2020.
[12] S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified SGD with memory,” in Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc., 2018.
[13] D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli, “The convergence of sparsified gradient methods,” in Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc., 2018.
[14] J. Zhang, N. Li, and M. Dedeoglu, “Federated learning over wireless networks: A band-limited coordinated descent approach,” in IEEE Conference on Computer Communications, 2021.
[15] S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi, “Error feedback fixes signSGD and other gradient compression schemes,” in International Conference on Machine Learning, 2019, pp. 3252–3261.
[16] N. Ivkin, D. Rothchild, E. Ullah, V. braverman, I. Stoica, and R. Arora, “Communication-efficient distributed SGD with sketching,” in Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., 2019.
[17] D. Rothchild, A. Panda, E. Ullah, N. Ivkin, I. Stoica, V. Braverman, J. Gonzalez, and R. Arora, “FetchSGD: Communication-efficient federated learning with sketching,” in International Conference on Machine Learning, 2020, pp. 8253–8265.
[18] M. Charikar, K. Chen, and M. Farach-Colton, “Finding frequent items in data streams,” in International Colloquium on Automata, Languages, and Programming. Springer, 2002, pp. 693–703.
[19] H. Cai, D. Mckenzie, W. Yin, and Z. Zhang, “Zeroth-order regularized optimization (ZORO): Approximately sparse gradients and adaptive sampling,” arXiv preprint arXiv:2003.13001, 2020.
[20] E. J. Candès and M. B. Wakin, “An introduction to compressive sampling,” IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 21–30, 2008.
[21] B. K. Natarajan, “Sparse approximate solutions to linear systems,” SIAM Journal on Computing, vol. 24, no. 2, pp. 227–234, 1995.
[22] G. Davis, S. Mallat, and M. Avellaneda, “Adaptive greedy approximations,” Constructive Approximation, vol. 13, no. 1, pp. 57–98, 1997.
[23] A. Gilbert and P. Indyk, “Sparse recovery using sparse matrices,” Proceedings of the IEEE, vol. 98, no. 6, pp. 937–947, 2010.
[24] S. Foucart, “Sparse recovery algorithms: Sufficient conditions in terms of restricted isometry constants,” in Approximation Theory XIII: San Antonio 2010. Springer, 2012, pp. 65–77.
[25] M. E. Lopes, “Unknown sparsity in compressed sensing: Denoising and inference,” IEEE Transactions on Information Theory, vol. 62, no. 9, pp. 5145–5166, 2016.
[26] I. Haviv and O. Regev, “The restricted isometry property of subsampled Fourier matrices,” in Geometric aspects of functional analysis. Springer, 2017, pp. 163–179.
[27] K. Wei, “Fast iterative hard thresholding for compressed sensing,” IEEE Signal Processing Letters, vol. 22, no. 5, pp. 593–597, 2014.
[28] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt, “Practical and optimal LSH for angular distance,” in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, 2015, p. 1225–1233, full version available at arXiv:1509.02897.
[29] A. C. Gilbert, M. J. Strauss, J. A. Tropp, and R. Vershynin, “One sketch for all: fast algorithms for compressed sensing,” in Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, 2007, pp. 237–246.
[30] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
[31] L. Dalcín, R. Paz, M. Storti, and J. D’Elía, “MPI for Python: Performance improvements and MPI-2 extensions,” Journal of Parallel and Distributed Computing, vol. 68, no. 5, pp. 655–662, 2008.
[32] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall, “Open MPI: Goals, concept, and design of a next generation MPI implementation,” in Proceedings, 11th European PVM/MPI Users’ Group Meeting, 2004, pp. 97–104.
[33] G. Cormode and S. Muthukrishnan, “An improved data stream summary: the count-min sketch and its applications,” Journal of Algorithms, vol. 55, no. 1, pp. 58–75, 2005.
[34] E. J. Candes and T. Tao, “Decoding by linear programming,” IEEE Transactions on Information Theory, vol. 51, no. 12, pp. 4203–4215, 2005.
[35] D. Needell and J. A. Tropp, “CoSaMP: Iterative signal recovery from incomplete and inaccurate samples,” Applied and Computational Harmonic Analysis, vol. 26, no. 3, pp. 301–321, 2009.
[36] J. D. Blanchard, J. Tanner, and K. Wei, “Cgiht: conjugate gradient iterative hard thresholding for compressed sensing and matrix completion,” Information and Inference: A Journal of the IMA, vol. 4, no. 4, pp. 289–327, 2015.
[37] R. Berinde, A. C. Gilbert, P. Indyk, H. Karloff, and M. J. Strauss, “Combining geometry and combinatorics: A unified approach to sparse signal recovery,” in 2008 46th Annual Allerton Conference on Communication, Control, and Computing. IEEE, 2008, pp. 798–805.
[38] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, “A simple proof of the restricted isometry property for random matrices,” Constructive Approximation, vol. 28, no. 3, pp. 253–263, 2008.

	$\displaystyle\\|\mathcal{A}(\Phi x+w;\Phi)-x\\|_{2}\leq\$	$\displaystyle\frac{C_{\mathcal{A},\mathrm{s}}+1}{2\sqrt{K}}\\|x\\|_{1}+\frac{C_{\mathcal{A},\mathrm{s}}}{\sqrt{K}}\big{\\|}x-x^{[K]}\big{\\|}_{1}+C_{\mathcal{A},\mathrm{n}}\\|w\\|_{2}$
	$\displaystyle\leq\$	$\displaystyle\left[\frac{C_{\mathcal{A},\mathrm{s}}+1}{2}+C_{\mathcal{A},\mathrm{s}}\!\left(1-\mfrac{K}{d}\right)\right]\frac{\big{\\|}x\big{\\|}_{1}}{\sqrt{K}}+C_{\mathcal{A},\mathrm{n}}\\|w\\|_{2}.$

	$\displaystyle\mathbb{E}\!\left[\\|e(t\!+\!1)\\|_{2}^{2}\right]\leq\$	$\displaystyle\mathbb{E}\Big{[}\big{(}\\|\Delta(t)-p(t)\\|_{2}+\mfrac{\eta Q}{d}\big{\\|}\Phi^{\top}w(t)\big{\\|}_{2}\big{)}^{2}\Big{]}$
	$\displaystyle=\$	$\displaystyle\mathbb{E}\Big{[}\big{(}\left\\|\mathcal{A}(\Phi p(t)+\eta w(t))-p(t)\right\\|_{2}+\eta\sqrt{Q/d}\\|w\\|_{2}\big{)}^{2}\Big{]}$
	$\displaystyle\leq\$	$\displaystyle\mathbb{E}\Big{[}\big{(}\sqrt{\gamma/2}\\|p(t)\\|_{2}+\eta\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}\\|w\\|_{2}\big{)}^{2}\Big{]}$
	$\displaystyle\leq\$	$\displaystyle\gamma\mathbb{E}\!\left[\\|p(t)\\|_{2}^{2}\right]+2\eta^{2}\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}^{2}\mathbb{E}\!\left[\\|w(t)\\|_{2}^{2}\right]$
	$\displaystyle\leq\$	$\displaystyle\gamma\mathbb{E}\!\left[\\|\eta g(t)+e(t)\\|_{2}^{2}\right]+2\eta^{2}\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}^{2}\sigma^{2},$

	$\displaystyle\mathbb{E}\!\left[\\|g(t)\\|_{2}^{2}\right]=\$	$\displaystyle\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right]+\mathbb{E}\!\left[\\|g(t)-\nabla f(x(t))\\|_{2}^{2}\right]$
	$\displaystyle\leq\$	$\displaystyle\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right]+G^{2}.$

	$\displaystyle\mathbb{E}\!\left[\langle\nabla f(x(t)),x(t)-x^{\ast}\rangle\right]\leq$	$\displaystyle\frac{1}{2\eta}(\mathbb{E}\!\left[\\|\tilde{x}(t)-x^{\ast}\\|_{2}^{2}\right]-\mathbb{E}\!\left[\\|\tilde{x}(t\!+\!1)-x^{\ast}\\|_{2}^{2}\right])+\frac{\eta G^{2}}{2}$
		$\displaystyle+\frac{\eta}{2}\,\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right]+\mathbb{E}\!\left[\langle\nabla f(x(t)),e(t)\rangle\right]$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2\eta}(\mathbb{E}\!\left[\\|\tilde{x}(t)-x^{\ast}\\|_{2}^{2}\right]-\mathbb{E}\!\left[\\|\tilde{x}(t\!+\!1)-x^{\ast}\\|_{2}^{2}\right])+\frac{\eta G^{2}}{2}$
		$\displaystyle+\frac{\eta+(3L)^{-1}}{2}\,\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right]+\frac{3L}{2}\mathbb{E}\!\left[\\|e(t)\\|_{2}^{2}\right],$

	$\displaystyle\mathbb{E}\!\left[f(\tilde{x}(t\!+\!1))\right]\leq$	$\displaystyle\mathbb{E}\!\left[f(\tilde{x}(t))\right]-\eta\,\mathbb{E}\!\left[\langle\nabla f(\tilde{x}(t)),\nabla f(x(t))\rangle\right]+\frac{\eta^{2}L}{2}\!\left(\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right]+G^{2}\right)$
	$\displaystyle=$	$\displaystyle\mathbb{E}\!\left[f(\tilde{x}(t))\right]-\eta\,\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right]$
		$\displaystyle-\eta\,\mathbb{E}\!\left[\langle\nabla f(\tilde{x}(t))-\nabla f(x(t)),\nabla f(x(t))\rangle\right]+\frac{\eta^{2}L}{2}\!\left(\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right]+G^{2}\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\!\left[f(\tilde{x}(t))\right]-\frac{\eta(1-\eta L)}{2}\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right]$
		$\displaystyle+\frac{\eta}{2}\,\mathbb{E}\!\left[\\|\nabla f(\tilde{x}(t))-\nabla f(x(t))\\|_{2}^{2}\right]+\frac{\eta^{2}L}{2}G^{2}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\!\left[f(\tilde{x}(t))\right]-\frac{\eta(1-\eta L)}{2}\mathbb{E}\!\left[\\|\nabla f(x(t))\\|_{2}^{2}\right]+\frac{\eta L^{2}}{2}\mathbb{E}\!\left[\\|e(t)\\|_{2}^{2}\right]+\frac{\eta^{2}L}{2}G^{2},$