Communication-Efficient Distributed SGD with Compressed Sensing
Abstract
We consider large scale distributed optimization over a set of edge devices connected to a central server, where the limited communication bandwidth between the server and edge devices imposes a significant bottleneck for the optimization procedure. Inspired by recent advances in federated learning, we propose a distributed stochastic gradient descent (SGD) type algorithm that exploits the sparsity of the gradient, when possible, to reduce communication burden. At the heart of the algorithm is to use compressed sensing techniques for the compression of the local stochastic gradients at the device side; and at the server side, a sparse approximation of the global stochastic gradient is recovered from the noisy aggregated compressed local gradients. We conduct theoretical analysis on the convergence of our algorithm in the presence of noise perturbation incurred by the communication channels, and also conduct numerical experiments to corroborate its effectiveness.
1 Introduction
Large-scale distributed stochastic optimization plays a fundamental role in the recent advances of machine learning, allowing models with vast sizes to be trained on massive datasets by multiple machines. In the meantime, the past few years have witnessed an explosive growth of networks of IoT devices such as smart phones, self-driving cars, robots, unmanned aerial vehicles (UAVs), etc., which are capable of data collection and processing for many learning tasks. In many of these applications, due to privacy concerns, it is preferable that the local edge devices learn the model by cooperating with the central server but without sending their own data to the server. Moreover, the communication between the edge devices and the server is often through wireless channels, which are lossy and unreliable in nature and have limited bandwidth, imposing significant challenges, especially for large dimensional problems.
To address the communication bottlenecks, researchers have investigated communication-efficient distributed optimization methods for large-scale problems, for both the device-server setting [1, 2] and the peer-to-peer setting [3, 4]. In this paper, we consider the device-server setting where a group of edge devices are coordinated by a central server.
Most existing techniques for the device-server setting can be classified into two categories. The first category aims to reduce the number of communication rounds, based on the idea that each edge device runs multiple local SGD steps in parallel before sending the local updates to the server for aggregation. This approach has also been called FedAvg [1] in federated learning and convergence has been studied in [5, 6, 7]. Another line of work investigates lazy/adaptive upload of information, i.e., local gradients are uploaded only when found to be informative enough [8].
The second category focuses on efficient compression of gradient information transmitted from edge devices to the server. Commonly adopted compression techniques include quantization [9, 10, 11] and sparsification [12, 13, 14]. These techniques can be further classified according to whether the gradient compression yields biased [9, 14] or unbiased [10, 13] gradient estimators. To handle the bias and boost convergence, [12, 15] introduced the error feedback method that accumulates and corrects the error caused by gradient compression at each step.
Two recent papers [16, 17] employ sketching methods for gradient compression. Specifically, each device compresses its local stochastic gradient by count sketch [18] via a common sketching operator; and the server recovers the indices and the values of large entries of the aggregated stochastic gradient from the gradient sketches. However, theoretical guarantees of count sketch were developed for recovering one fixed signal by randomly generating a sketching operator from a given probability distribution. During SGD, gradient signals are constantly changing, making it impractical to generate a new sketch operator for every SGD iteration. Thus the papers apply a single sketching operator to all the gradients through the optimization procedure, while sacrificing theoretical guarantees. Further, there is a limited understanding of the performance when there is transmission error/noise of the uploading links.
Our Contributions. We propose a distributed SGD-type algorithm that employs compressed sensing for gradient compression. Specifically, we adopt compressed sensing techniques for the compression of local stochastic gradients at the device side, and the reconstruction of the aggregated stochastic gradients at the server side. The use of compressed sensing enables the server to approximately identify the top entries of the aggregated gradient without querying directly each local gradient. Our algorithm also integrates error feedback strategies at the server side to handle the bias introduced by compression, while keeping the edge devices to be stateless. We provide convergence analysis of our algorithm in the presence of additive noise incurred by the uploading communication channels, and conduct numerical experiments that justify the effectiveness of our algorithm.
Besides the related work discussed above, it is worth noting that a recent paper [19] uses compressed sensing for zeroth-order optimization, which exhibits a mathematical structure similar to this study. However, [19] considers the centralized setting and only establishes convergence to a neighborhood of the minimizer.
Notations: For , denotes its -norm, and denotes its best- approximation, i.e., the vector that keeps the top entries of in magnitude with other entries set to .
2 Problem Setup
Consider a group of edge devices and a server. Each device is associated with a differentiable local objective function , and is able to query a stochastic gradient such that . Between each device and the server are an uploading communication link and a broadcasting communication link. The goal is to solve
(1) |
through queries of stochastic gradients at each device and exchange of information between the server and each device.
One common approach for our problem setup is the stochastic gradient descent (SGD) method: For each time step , the server first broadcasts the current iterate to all devices, and then each device produces a stochastic gradient and uploads it to the server, after which the server updates . However, as the server needs to collect local stochastic gradients from each device at every iteration, the vanilla SGD may encounter significant bottleneck imposed by the uploading links if is very large. This issue may be further exacerbated if the server and the devices are connected via lossy wireless networks of limited bandwidth, which is the case for many IoT applications.
In this work, we investigate the situation where the communication links, particularly the uploading links from each edge device to the server, have limited bandwidth that can significantly slow down the whole optimization procedure; the data transmitted through each uploading link may also be corrupted by noise. Our goal is to develop an SGD-type algorithm for solving (1) that achieves better communication efficiency over the uploading links.
3 Algorithm
Our algorithm is outlined in Algorithm 1, which is based on the SGD method with the following major ingredients:
-
1.
Compression of local stochastic gradients using compressed sensing techniques. Here each edge device compresses its local gradient by before uploading it to the server. The matrix is called the sensing matrix, and its number of rows is strictly less than the number of columns . As a result, the communication burden of uploading the local gradient information can be reduced.
We emphasize that Algorithm 1 employs the for-all scheme of compressed sensing, which allows one to be used for the compression of all local stochastic gradients (see Section 3.1 for more details on the for-each and the for-all schemes).
After collecting the compressed local gradients and obtaining (corrupted by communication channel noise), the server recovers a vector by a compressed sensing algorithm, which will be used for updating .
-
2.
Error feedback of compressed gradients. In general, the compressed sensing reconstruction will introduce a nonzero bias in the SGD iterations that hinders convergence. To handle this bias, we adopt the error feedback method in [12, 15] and modify it similarly as FetchSGD [17]. The resulting error feedback procedure is done purely at the server side without knowing the true aggregated stochastic gradients.
Note that the aggregated vector is corrupted by additive noise from the uploading links. This noise model incorporates a variety of communication schemes, including digital transmission with quantization, and over-the-air transmission for wireless multi-access networks [14].
We now provide more details on our algorithm design.
3.1 Preliminaries on Compressed Sensing
Compressed sensing [20] is a technique that allows efficient sensing and reconstruction of an approximately sparse signal. Mathematically, in the sensing step, a signal is observed through linear measurement , where is a pre-specified sensing matrix with , and is additive noise. Then in the reconstruction step, one recovers the original signal by approximately solving
() | (2) | ||||
() | (3) |
where restricts the number of nonzero entries in .
Both (2) and (3) are NP-hard nonconvex problems[21, 22] , and researchers have proposed various compressed sensing algorithms for obtaining approximate solutions. As discussed below, the reconstruction error will heavily depend on i) the design of the sensing matrix , and ii) whether the signal can be well approximated by a sparse vector.
Design of the sensing matrix . Compressed sensing algorithms can be categorized into two schemes [23]: i) the for-each scheme, in which a probability distribution over sensing matrices is designed to provide desired reconstruction for a fixed signal, and every time a new signal is to be measured and reconstructed, one needs to randomly generate a new ; ii) the for-all scheme, in which a single is used for the sensing and reconstruction of all possible signals.111A more detailed explanation of the two schemes can be found in Appendix D. We mention that count sketch is an example of a for-each scheme algorithm. In this paper, we choose the for-all scheme so that the server doesn’t need to send a new matrix to each device per iteration.
To ensure that the linear measurement can discriminate approximately sparse signals, researchers have proposed the restricted isometry property (RIP) [20] as a condition on :
Definition 1.
We say that satisfies the -restricted isometry property, if for any that has at most nonzero entries.
The restricted isometry property on is fundamental for analyzing the reconstruction error of many compressed sensing algorithms under the for-all scheme [24].
Metric of sparsity. The classical metric of sparsity is the norm defined as the number of nonzero entries. However, for our setup, the vectors to be compressed can only be approximately sparse in general, which cannot be handled by the norm as it is not stable under small perturbations. Here, we adopt the following sparsity metric from [25]:222Compared to [25], we add a scaling factor so that .
(4) |
The continuity of indicates that is robust to small perturbations on , and it can be shown that is Schur-concave, meaning that it can characterize approximate sparsity of a signal. has also been used in [25] for performance analysis of compressed sensing algorithms.
3.2 Details of Algorithm Design
Generation of . As mentioned before, we choose compressed sensing under the for-all scheme for gradient compression and reconstruction. We require that the sensing matrix have a low storage cost, since it will be transmitted to and stored at each device; should also satisfy RIP so that the compressed sensing algorithm has good reconstruction performance. The following proposition suggests a storage-friendly approach for generating matrices satisfying RIP.
Proposition 1 ([26]).
Let be an orthogonal matrix with entries of absolute values , and let be sufficiently small. For some ,333 The notation hides logarithm dependence on . let be a matrix whose rows are chosen uniformly and independently from the rows of , multiplied by . Then, with high probability, satisfies the -RIP.
This proposition indicates that, we can choose a “base matrix” satisfying the condition in Proposition 1, and then randomly choose rows to form . In this way, can be stored or transmitted by merely the corresponding row indices in . Note that Proposition 1 only requires to have logarithm dependence on . Candidates of the base matrix include the discrete cosine transform (DCT) matrix and the Walsh-Hadamard transform (WHT) matrix, as both DCT and WHT and their inverse have fast algorithms of time complexity , implying that multiplication of or with any vector can be finished within time.
Choice of the compressed sensing algorithm. We let be the Fast Iterative Hard Thresholding (FIHT) algorithm [27].444See also Appendix E for a brief summary. Our experiments suggest that FIHT achieves a good balance between computation efficiency and empirical reconstruction error compared to other algorithms we have tried.
We note that FIHT has a tunable parameter that controls the number of nonzero entries of . This parameter should accord with the sparsity of the vector to be recovered (see Section 3.3 for theoretical results). In addition, the server can broadcast the sparse vector instead of the whole for the edge devices to update their local copies of , which saves communication over the broadcasting links.
Error feedback. We adopt error feedback to facilitate convergence of Algorithm 1. The following lemma verifies that Algorithm 1 indeed incorporates the error feedback steps in [15]; the proof is straightforward which we omit here.
Lemma 1.
3.3 Theoretical Analysis
First, we make the following assumptions on the objective function , the stochastic gradients and the communication channel noise :
Assumption 1.
The function is -smooth over , i.e., there exists such that for all ,
Assumption 2.
Denote . There exists such that for all .
Assumption 3.
The communication channel noise satisfies for each .
Our theoretical analysis will be based on the following result on the reconstruction error of FIHT:
Lemma 2 ([27, Corollary I.3]).
Let be the maximum number of nonzero entries of the output of FIHT. Suppose the sensing matrix satisfies -RIP for sufficiently small . Then, for any and ,
(6) |
where and are positive constants that depend on .
We are now ready to establish convergence of Algorithm 1.
Theorem 1.
Let be the maximum number of nonzero entries of the output of FIHT. Suppose the sensing matrix satisfies -RIP for sufficiently small . Furthermore, assume that
(7) |
for all for some , where is defined in Lemma 1. Then for sufficiently large , by choosing , we have that
In addition, if is convex and has a minimizer , then we further have
where and .
Remark 1.
Theorem 1 requires to remain sufficiently low. This condition is hard to check and can be violated in practice (see Section 4). However, our numerical experiments seem to suggest that even if the condition (7) is violated, Algorithm 1 may still exhibit relatively good convergence behavior when the gradient itself has relatively low sparsity level. Theoretical investigation on these observations will be interesting future directions.
4 Numerical Results
4.1 Test Case with Synthetic Data
We first conduct numerical experiments on a test case with synthetic data. Here we set the dimension to be and the number of edge devices to be . The local objective functions are of the form , where each is a diagonal matrix, and we denote . We generate such that the diagonal entries of is given by for each while the diagonal of each is dense. We also let give approximately sparse stochastic gradients for every . We refer to Appendix C for details on the test case.
We test three algorithms: the uncompressed vanilla SGD, Algorithm 1, and SGD with count sketch. The SGD with count sketch just replaces the gradient compression and reconstruction of Algorithm 1 by the count sketch method [18]. We set for both Algorithm 1 and SGD with count sketch. For Algorithm 1, we generate from the WHT matrix, and uses the FFHT library [28] for fast WHT. We set , and the initial point to be the origin for all three algorithms.



Figure 1(a) illustrates the convergence of the three algorithms without communication channel error (i.e., ). For Algorithm 1, we set (the compression rate is ), and for SGD with count sketch we set the sketch size to be (the compression rate is ). We see that Algorithm 1 has better convergence behavior while also achieves higher compression rate compared to SGD with count sketch. Our numerical experiments suggest that for approximately sparse signals, FIHT can achieve higher reconstruction accuracy and more aggressive compression than count sketch, and for signals that are not very sparse, FIHT also seems more robust.
Figure 1(b) shows the evolution of and for Algorithm 1. We see that is small for the first few iterations, and then increases and stabilizes around , which suggests that the condition (7) is likely to have been violated for large . On the other hand, Fig. 1(a) shows that Algorithm 1 can still achieve relatively good convergence behavior. This indicates a gap between the theoretical results in Section 3.3 and the empirical results, and suggests our analysis could be improved. We leave relevant investigation as future work.
Figure 1(c) illustrates the convergence of Algorithm 1 with different levels of communication channel noise. Here the entries of are i.i.d. sampled from with . We see that the convergence of Algorithm 1 gradually deteriorates as increases, suggesting its robustness against communication channel noise.
4.2 Test Case of Federated Learning with CIFAR-10 Dataset
We implement our algorithm on a residual network with 668426 trainable parameters in two different settings. We primarily use Tensorflow and MPI in the implementation of these results (details about the specific experimental setup can be found in Appendix C). In addition, we shall only present upload compression results here; download compression is not considered to be as significant as the upload compression (given that download speeds are generally higher than upload speeds), and in our case the download compression rate is simply given by . For both settings, we use the CIFAR10 dataset (60,000 32323 images of 10 classes) with a 50,000/10,000 train/test split.

In the first setting, we instantiate 100 workers and split the CIFAR10 training dataset such that all local datasets are i.i.d. As seen in Figure 2, our algorithm is able to achieve 2 upload compression with marginal effect to the training and testing accuracy over 50 epochs. As the compression rate increases, the convergence of our method gradually deteriorates (echoing results in the synthetic case). For comparison, we also show the results of using Count Sketch with rows and columns, where is the desired compression rate, in lieu of FIHT. In our setting, while uncompressed (1) Count Sketch performs well, it is very sensitive to higher compression, diverging for 1.1 and 1.25 compression.

In the second setting, we split the CIFAR10 dataset into 10,000 sets of 5 images of a single class and assign these sets to 10,000 workers. Each epoch consists of 100 rounds with 1% (100) workers participating in each round. In Figure 3, similar to the i.i.d. setting, we see that our algorithm’s training accuracy convergence gradually deteriorates with higher compression. Typical of the non-i.i.d. setting, the testing accuracy is not as high as that of the i.i.d. setting. However, we note that FIHT is able to achieve 10 compression with negligible effect on the testing accuracy. In addition, Count Sketch with rows and columns diverges even for small compression rates in this problem setting.
5 Conclusion
In this paper, we develop a communication efficient SGD algorithm based on compressed sensing. This algorithm has several direct variants. For example, momentum method can be directly incorporated. Also, when the number of devices is very large, the server can choose to query compressed stochastic gradients from a randomly chosen subset of workers.
Our convergence guarantees require to be persistently low, which is hard to check in practice. The numerical experiments also show that our algorithm can work even if grows to a relatively high level. They suggest that our theoretical analysis can be further improved, which will be an interesting future direction.
Appendix A Auxiliary Results
In this section, we provide some auxiliary results for the proof of Theorem 1. We first give an alternative form of the reconstruction error derived from the condition (7) and the performance guarantee (6).
Lemma 3.
Proof.
Next, we derive a bound on the second moment of .
Lemma 4.
We have
Proof.
By definition, we have
where the first inequality follows from Lemma 3, and the second inequality follows from the definition of and the assumption that . Notice that
which leads to
By and Assumption 2, we have
Therefore
By summing over and noting that and , we get
which then leads to the desired result. ∎
Appendix B Proof of Theorem 1
Convex case: Denote , and it can be checked that
We then have
By taking the expectation and noting and Assumption 2, we get
which leads to
where in the second inequality we used . Now, we take the average of both sides over and plug in the bound in Lemma 4 to get
By , we can show that for sufficiently large ,
Thus
By the convexity of , we have . Furthermore, since is -smooth, we see that
which leads to . We then get
By subtracting from both sides of the inequality, and using the bound that follows from the convexity of , we get the final bound.
Nonconvex case: Denote , and it can be checked that . Since is -smooth, we get
By taking the expectation and using and Assumption 2, we see that
where in the second inequality we used , and in the last inequality we used the -smoothness of . By taking the telescoping sum, we get
After plugging in the bound in Lemma 4, we get
By choosing and letting be sufficiently large, we have
Finally, we get
which completes the proof.
Appendix C Further Details on Numerical Experiments
For all numerical experiments in this paper, we employ the WHT matrix as the “base matrix” for generating the sensing matrix . We choose WHT instead of DCT here because the library Fast Fast Hadamard Transform (FFHT) [28] provides a heavily optimized C99 implementation of fast WHT, which turns out to be faster than the fast DCT in the Python library SciPy on our computing platforms.
Note that the WHT matrix is only defined for that is a power of . In order to use WHT for general , we let and generate by randomly choosing Q rows from the WHT matrix as in Proposition 1. Then for any , we can first pad zeros to the vector to augment its dimension to be , and then multiply with the augmented vector to form the compressed vector of dimension . For the reconstruction, after obtaining the recovered signal, we can truncate the last entries to get a vector of the original dimension .
Code for the federated learning test case is available at https://github.com/vikr53/cosamp.
C.1 Test Case with Synthetic Data
All experiments on the synthetic test case were run on a MacBook Pro Model A1502 with 2.9 GHz Dual-Core Intel Core i5 CPU and 16 GB memory. GPU acceleration was not exploited. We implement count sketch by our own code for the synthetic test case.
Recall that the local objective functions are given by , where each is a diagonal matrix. The ’th diagonal entries of are randomly sampled such that satisfies the Gaussian distribution , where denotes the identity matrix and denotes the vector of all ones. As a result, has diagonal entries given by . The point is randomly generated from the standard Gaussian distribution. We fix each and once they have been generated.
The stochastic gradient will be given by
Here and are i.i.d. random vectors following the standard Gaussian distribution ; the entries of are i.i.d. satisfying the Bernoulli distribution with ; denotes the Hadamard product. We set in our test case.
C.2 Test Case of Federated Learning with CIFAR-10 Dataset
We ran all federated learning experiments using the research computing service provided by the authors’ institution, and we did not exploit GPU acceleration. We primarily use Tensorflow [30] as the deep learning framework. We use MPI for Python [31] for parallel computing on multiple processors, and communication between nodes is facilitated by Open MPI [32]. Additionally, we use the CSVec package (available at https://github.com/nikitaivkin/csh) for count sketch.
Regarding the compressed sensing algorithms, we set (number of nonzero entries in the reconstructed signal) for both FIHT and count sketch. The dimension of the compressed gradients for FIHT is simply where is the desired compression rate. The sketch size for count sketch is rows columns; we tried and found the best results with for all tested compression rates.
Recall that we conduct experiments on the i.i.d. setting and the non-i.i.d. setting. The differences between the i.i.d. setting and the non-i.i.d. settings include the following:
-
1.
In the i.i.d. setting, there are local workers, and we split the training dataset evenly into local datasets, so that each local dataset has exactly samples for each of the classes. In the non-i.i.d. setting, there are workers, and each of the local datasets has samples belonging to the same class.
-
2.
In the i.i.d. setting, the server queries compressed local gradients from all workers for each iteration. In the non-i.i.d. setting, the server queries compressed local gradients from of all workers for each iteration.
-
3.
In the i.i.d. setting, we set the local batch size to be . In the non-i.i.d. setting, we set the local batch size to be .
-
4.
In the non-i.i.d. setting, we perform data augmentation (while dividing the dataset amongst the 10,000 workers) using random flips and random rotations.
The step size of both the i.i.d. and the non-i.i.d. settings are set to be . We also use the same parameters of FIHT and count sketch for both the i.i.d. and the non-i.i.d. settings.
Experiments on the reconstruction error:
We also conducted numerical experiments that compare the reconstruction errors of FIHT and count sketch. Here we let with be randomly generated by , where is a vector having nonzero entries that are i.i.d. sampled from the standard Gaussian distribution, and . We apply FIHT and count sketch to the compression and reconstruction of , and test the relative reconstruction error incurred at different compression rates . The relative reconstruction error is defined as where is the reconstructed signal by FIHT or count sketch. For a given compression rate , we set for FIHT, and set the sketch size to be rows columns for count sketch. We set for both FIHT and count sketch. For each compression rate , we test FIHT and count sketch over random instances of ; we fix the sensing matrix of FIHT and the sketching operator of count sketch for the random trials. Figure 4 empirically confirms that the reconstruction performance of count sketch is worse than FIHT, especially at higher compression rates.

Appendix D For-each Scheme and For-all Scheme
In what follows, we elaborate on the two schemes in compressed sensing: the for-each scheme and the for-all scheme.
For-each scheme:
In this scheme, the sensing matrix is chosen at random from some distribution for each individual signal . Specifically, consider an algorithm belonging to the for-each scheme, and let denote the signal reconstructed from the measurement and the sensing matrix by this algorithm. Then associated with the algorithm is an indexed set with each being a probability distribution over , and theoretical guarantee of the algorithm can be typically recast as follows:
Let be sufficiently small [say, ]. Then there exist and depending on , and , such that
(8) |
where is any arbitrary deterministic vector.
Note that the bound (8) applies to the reconstruction of one deterministic signal , and the probability on the left-hand side of (8) is with respect to the distribution over the sensing matrix . Consequently, every time a new vector is to be measured and reconstructed under the for-each scheme, we typically generates a new sensing matrix independent of . Especially, if a randomly generated matrix has already been used for the reconstruction of , and is a vector that is dependent of , then theoretically the bound (8) can no longer be directly applied to the reconstruction of by , and as a result, bounding would be more challenging.
For-all scheme:
In this scheme, one generates a single sensing matrix that is used for the reconstruction of all possible signals . As already mentioned, the restricted isometry property (see Definition 1) has been proposed as a condition on under the for-all scheme. Intuitively, the restricted isometry property ensures that the linear measurements can discriminate sparse signals: Suppose satisfies -RIP for some and . Then for any with and , we have
indicating that the map is an injection from to . Consequently, given the measurement vector , the solution to (2) will always be unique and equal to as long as . This explains why RIP can provide reconstruction guarantees under the for-all scheme in the ideal scenario (i.e., original vector is strictly sparse, measurement is noiseless, and the solution to the nonconvex problem (2) can be obtained).
Extending the above discussion to more general scenarios is not trivial, but the existing literature has investigated reconstruction guarantees of various compressed sensing algorithms when the original vector is approximately sparse or linear measurement by is noisy. One typical form of guarantees for compressed sensing algorithms under the for-all scheme is as follows:
Suppose satisfies -RIP for some and sufficiently small . Then there exist and depending on , such that
for all , where denotes the reconstructed signal from the (noisy) measurement vector and the sensing matrix .
Examples of the for-all scheme include -minimization [34] and various greedy algorithms [35, 27, 36]; see also [24]. We also mention that there are variants of RIP that have been used for compressed sensing with sparse sensing matrices [37, 23].
Finally, we note that in practice, sensing matrices that satisfy RIP are usually generated by randomized methods. For example, it has been shown that a matrix having i.i.d. standard Gaussian entries will satisfy -RIP for with high probability [38]. In this work, we generate the sensing matrix based on Proposition 1 so that has low storage and transmission cost.
Appendix E The Fast Iterative Hard Thresholding (FIHT) Algorithm
We introduce some additional notations before presenting FIHT. For any , we let denote the set of indices of nonzero entries of , and let denote the set of indices corresponding to the top- entries of in magnitude. Given and an index set , we use to denote the -dimensional vector that keeps the entries of with indices in and sets other entries to be zero.
The Fast Iterative Hard Thresholding (FIHT) algorithm [27] is outlined as follows:
We elaborate further details on our implementation of FIHT.
Stopping criterion. In our implementation of FIHT, we stop the iterations whenever any of the following holds:
-
•
(i.e., the maximum number of iterations is )
-
•
-
•
Here the tuple records the values of in the last iterations, and and denote the standard deviation and the mean value respectively.
Multiplication of with a vector. We generate the sensing matrix from an orthogonal matrix as in Proposition 1, and the “base matrix” is chosen to be either the DCT matrix
or the WHT matrix defined recursively by
when is a power of . Then, since both DCT and WHT have fast algorithms of time complexity , the matrix-vector multiplication and can be calculated with time complexity for any . Specifically, suppose is the index set of the rows of that form . Taking DCT as an example, for any , we can computed as follows:
-
1.
,
-
2.
for each .
Then is a vector whose nonzero entries record the values of each entry of . Similarly, for any , we can compute by the following steps:
-
1.
,
-
2.
, where denotes the subvector of with indices in .
-
3.
Then we have . Note that if we employ WHT instead of DCT, then both and can be replaced by since the WHT matrix is symmetric.
References
- [1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
- [2] S. Magnússon, C. Enyioha, N. Li, C. Fischione, and V. Tarokh, “Communication complexity of dual decomposition methods for distributed resource allocation optimization,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 4, pp. 717–732, 2018.
- [3] C.-S. Lee, N. Michelusi, and G. Scutari, “Finite rate quantized distributed optimization with geometric convergence,” in 52nd Asilomar Conference on Signals, Systems, and Computers, 2018, pp. 1876–1880.
- [4] A. Reisizadeh, A. Mokhtari, H. Hassani, and R. Pedarsani, “An exact quantized decentralized gradient descent algorithm,” IEEE Transactions on Signal Processing, vol. 67, no. 19, pp. 4934–4947, 2019.
- [5] S. U. Stich, “Local SGD converges fast and communicates little,” arXiv preprint arXiv:1805.09767, 2018.
- [6] J. Wang and G. Joshi, “Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms,” arXiv preprint arXiv:1808.07576, 2018.
- [7] H. Yu, S. Yang, and S. Zhu, “Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 5693–5700.
- [8] J. Sun, T. Chen, G. B. Giannakis, Q. Yang, and Z. Yang, “Lazily aggregated quantized gradient innovation for communication-efficient federated learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
- [9] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
- [10] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signSGD: Compressed optimisation for non-convex problems,” in Proceedings of the 35th International Conference on Machine Learning, 2018, pp. 560–569.
- [11] S. Magnússon, H. Shokri-Ghadikolaei, and N. Li, “On maintaining linear convergence of distributed learning and optimization under limited communication,” IEEE Transactions on Signal Processing, vol. 68, pp. 6101–6116, 2020.
- [12] S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified SGD with memory,” in Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc., 2018.
- [13] D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli, “The convergence of sparsified gradient methods,” in Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc., 2018.
- [14] J. Zhang, N. Li, and M. Dedeoglu, “Federated learning over wireless networks: A band-limited coordinated descent approach,” in IEEE Conference on Computer Communications, 2021.
- [15] S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi, “Error feedback fixes signSGD and other gradient compression schemes,” in International Conference on Machine Learning, 2019, pp. 3252–3261.
- [16] N. Ivkin, D. Rothchild, E. Ullah, V. braverman, I. Stoica, and R. Arora, “Communication-efficient distributed SGD with sketching,” in Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., 2019.
- [17] D. Rothchild, A. Panda, E. Ullah, N. Ivkin, I. Stoica, V. Braverman, J. Gonzalez, and R. Arora, “FetchSGD: Communication-efficient federated learning with sketching,” in International Conference on Machine Learning, 2020, pp. 8253–8265.
- [18] M. Charikar, K. Chen, and M. Farach-Colton, “Finding frequent items in data streams,” in International Colloquium on Automata, Languages, and Programming. Springer, 2002, pp. 693–703.
- [19] H. Cai, D. Mckenzie, W. Yin, and Z. Zhang, “Zeroth-order regularized optimization (ZORO): Approximately sparse gradients and adaptive sampling,” arXiv preprint arXiv:2003.13001, 2020.
- [20] E. J. Candès and M. B. Wakin, “An introduction to compressive sampling,” IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 21–30, 2008.
- [21] B. K. Natarajan, “Sparse approximate solutions to linear systems,” SIAM Journal on Computing, vol. 24, no. 2, pp. 227–234, 1995.
- [22] G. Davis, S. Mallat, and M. Avellaneda, “Adaptive greedy approximations,” Constructive Approximation, vol. 13, no. 1, pp. 57–98, 1997.
- [23] A. Gilbert and P. Indyk, “Sparse recovery using sparse matrices,” Proceedings of the IEEE, vol. 98, no. 6, pp. 937–947, 2010.
- [24] S. Foucart, “Sparse recovery algorithms: Sufficient conditions in terms of restricted isometry constants,” in Approximation Theory XIII: San Antonio 2010. Springer, 2012, pp. 65–77.
- [25] M. E. Lopes, “Unknown sparsity in compressed sensing: Denoising and inference,” IEEE Transactions on Information Theory, vol. 62, no. 9, pp. 5145–5166, 2016.
- [26] I. Haviv and O. Regev, “The restricted isometry property of subsampled Fourier matrices,” in Geometric aspects of functional analysis. Springer, 2017, pp. 163–179.
- [27] K. Wei, “Fast iterative hard thresholding for compressed sensing,” IEEE Signal Processing Letters, vol. 22, no. 5, pp. 593–597, 2014.
- [28] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt, “Practical and optimal LSH for angular distance,” in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, 2015, p. 1225–1233, full version available at arXiv:1509.02897.
- [29] A. C. Gilbert, M. J. Strauss, J. A. Tropp, and R. Vershynin, “One sketch for all: fast algorithms for compressed sensing,” in Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, 2007, pp. 237–246.
- [30] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
- [31] L. Dalcín, R. Paz, M. Storti, and J. D’Elía, “MPI for Python: Performance improvements and MPI-2 extensions,” Journal of Parallel and Distributed Computing, vol. 68, no. 5, pp. 655–662, 2008.
- [32] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall, “Open MPI: Goals, concept, and design of a next generation MPI implementation,” in Proceedings, 11th European PVM/MPI Users’ Group Meeting, 2004, pp. 97–104.
- [33] G. Cormode and S. Muthukrishnan, “An improved data stream summary: the count-min sketch and its applications,” Journal of Algorithms, vol. 55, no. 1, pp. 58–75, 2005.
- [34] E. J. Candes and T. Tao, “Decoding by linear programming,” IEEE Transactions on Information Theory, vol. 51, no. 12, pp. 4203–4215, 2005.
- [35] D. Needell and J. A. Tropp, “CoSaMP: Iterative signal recovery from incomplete and inaccurate samples,” Applied and Computational Harmonic Analysis, vol. 26, no. 3, pp. 301–321, 2009.
- [36] J. D. Blanchard, J. Tanner, and K. Wei, “Cgiht: conjugate gradient iterative hard thresholding for compressed sensing and matrix completion,” Information and Inference: A Journal of the IMA, vol. 4, no. 4, pp. 289–327, 2015.
- [37] R. Berinde, A. C. Gilbert, P. Indyk, H. Karloff, and M. J. Strauss, “Combining geometry and combinatorics: A unified approach to sparse signal recovery,” in 2008 46th Annual Allerton Conference on Communication, Control, and Computing. IEEE, 2008, pp. 798–805.
- [38] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, “A simple proof of the restricted isometry property for random matrices,” Constructive Approximation, vol. 28, no. 3, pp. 253–263, 2008.