This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11affiliationtext: School of Engineering and Applied Sciences, Harvard University22affiliationtext: School of Electrical, Computer and Energy Engineering, Arizona State University

Communication-Efficient Distributed SGD with Compressed Sensing

Yujie Tang Vikram Ramanathan Junshan Zhang Na Li
Abstract

We consider large scale distributed optimization over a set of edge devices connected to a central server, where the limited communication bandwidth between the server and edge devices imposes a significant bottleneck for the optimization procedure. Inspired by recent advances in federated learning, we propose a distributed stochastic gradient descent (SGD) type algorithm that exploits the sparsity of the gradient, when possible, to reduce communication burden. At the heart of the algorithm is to use compressed sensing techniques for the compression of the local stochastic gradients at the device side; and at the server side, a sparse approximation of the global stochastic gradient is recovered from the noisy aggregated compressed local gradients. We conduct theoretical analysis on the convergence of our algorithm in the presence of noise perturbation incurred by the communication channels, and also conduct numerical experiments to corroborate its effectiveness.

1 Introduction

Large-scale distributed stochastic optimization plays a fundamental role in the recent advances of machine learning, allowing models with vast sizes to be trained on massive datasets by multiple machines. In the meantime, the past few years have witnessed an explosive growth of networks of IoT devices such as smart phones, self-driving cars, robots, unmanned aerial vehicles (UAVs), etc., which are capable of data collection and processing for many learning tasks. In many of these applications, due to privacy concerns, it is preferable that the local edge devices learn the model by cooperating with the central server but without sending their own data to the server. Moreover, the communication between the edge devices and the server is often through wireless channels, which are lossy and unreliable in nature and have limited bandwidth, imposing significant challenges, especially for large dimensional problems.

To address the communication bottlenecks, researchers have investigated communication-efficient distributed optimization methods for large-scale problems, for both the device-server setting [1, 2] and the peer-to-peer setting [3, 4]. In this paper, we consider the device-server setting where a group of edge devices are coordinated by a central server.

Most existing techniques for the device-server setting can be classified into two categories. The first category aims to reduce the number of communication rounds, based on the idea that each edge device runs multiple local SGD steps in parallel before sending the local updates to the server for aggregation. This approach has also been called FedAvg [1] in federated learning and convergence has been studied in [5, 6, 7]. Another line of work investigates lazy/adaptive upload of information, i.e., local gradients are uploaded only when found to be informative enough [8].

The second category focuses on efficient compression of gradient information transmitted from edge devices to the server. Commonly adopted compression techniques include quantization [9, 10, 11] and sparsification [12, 13, 14]. These techniques can be further classified according to whether the gradient compression yields biased [9, 14] or unbiased [10, 13] gradient estimators. To handle the bias and boost convergence, [12, 15] introduced the error feedback method that accumulates and corrects the error caused by gradient compression at each step.

Two recent papers [16, 17] employ sketching methods for gradient compression. Specifically, each device compresses its local stochastic gradient by count sketch [18] via a common sketching operator; and the server recovers the indices and the values of large entries of the aggregated stochastic gradient from the gradient sketches. However, theoretical guarantees of count sketch were developed for recovering one fixed signal by randomly generating a sketching operator from a given probability distribution. During SGD, gradient signals are constantly changing, making it impractical to generate a new sketch operator for every SGD iteration. Thus the papers apply a single sketching operator to all the gradients through the optimization procedure, while sacrificing theoretical guarantees. Further, there is a limited understanding of the performance when there is transmission error/noise of the uploading links.

Our Contributions. We propose a distributed SGD-type algorithm that employs compressed sensing for gradient compression. Specifically, we adopt compressed sensing techniques for the compression of local stochastic gradients at the device side, and the reconstruction of the aggregated stochastic gradients at the server side. The use of compressed sensing enables the server to approximately identify the top entries of the aggregated gradient without querying directly each local gradient. Our algorithm also integrates error feedback strategies at the server side to handle the bias introduced by compression, while keeping the edge devices to be stateless. We provide convergence analysis of our algorithm in the presence of additive noise incurred by the uploading communication channels, and conduct numerical experiments that justify the effectiveness of our algorithm.

Besides the related work discussed above, it is worth noting that a recent paper [19] uses compressed sensing for zeroth-order optimization, which exhibits a mathematical structure similar to this study. However, [19] considers the centralized setting and only establishes convergence to a neighborhood of the minimizer.

Notations: For xdx\in\mathbb{R}^{d}, xp\|x\|_{p} denotes its p\ell_{p}-norm, and x[K]dx^{[K]}\in\mathbb{R}^{d} denotes its best-KK approximation, i.e., the vector that keeps the top KK entries of xx in magnitude with other entries set to 0.

2 Problem Setup

Consider a group of nn edge devices and a server. Each device ii is associated with a differentiable local objective function fi:df_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}, and is able to query a stochastic gradient 𝗀i(x)\mathsf{g}_{i}(x) such that 𝔼[𝗀i(x)]=fi(x)\mathbb{E}[\mathsf{g}_{i}(x)]=\nabla f_{i}(x). Between each device and the server are an uploading communication link and a broadcasting communication link. The goal is to solve

minxdf(x)1ni=1nfi(x)\min_{x\in\mathbb{R}^{d}}\ \ f(x)\coloneqq\frac{1}{n}\sum\nolimits_{i=1}^{n}f_{i}(x) (1)

through queries of stochastic gradients at each device and exchange of information between the server and each device.

One common approach for our problem setup is the stochastic gradient descent (SGD) method: For each time step tt, the server first broadcasts the current iterate x(t)x(t) to all devices, and then each device produces a stochastic gradient gi(t)=𝗀i(x(t))g_{i}(t)=\mathsf{g}_{i}(x(t)) and uploads it to the server, after which the server updates x(t+1)=x(t)η1nigi(t)x(t\!+\!1)=x(t)-\eta\cdot\frac{1}{n}\sum_{i}g_{i}(t). However, as the server needs to collect local stochastic gradients from each device at every iteration, the vanilla SGD may encounter significant bottleneck imposed by the uploading links if dd is very large. This issue may be further exacerbated if the server and the devices are connected via lossy wireless networks of limited bandwidth, which is the case for many IoT applications.

In this work, we investigate the situation where the communication links, particularly the uploading links from each edge device to the server, have limited bandwidth that can significantly slow down the whole optimization procedure; the data transmitted through each uploading link may also be corrupted by noise. Our goal is to develop an SGD-type algorithm for solving (1) that achieves better communication efficiency over the uploading links.

3 Algorithm

Our algorithm is outlined in Algorithm 1, which is based on the SGD method with the following major ingredients:

  1. 1.

    Compression of local stochastic gradients using compressed sensing techniques. Here each edge device compresses its local gradient by yi(t)=Φgi(t)y_{i}(t)=\Phi g_{i}(t) before uploading it to the server. The matrix ΦQ×d\Phi\in\mathbb{R}^{Q\times d} is called the sensing matrix, and its number of rows QQ is strictly less than the number of columns dd. As a result, the communication burden of uploading the local gradient information can be reduced.

    We emphasize that Algorithm 1 employs the for-all scheme of compressed sensing, which allows one Φ\Phi to be used for the compression of all local stochastic gradients (see Section 3.1 for more details on the for-each and the for-all schemes).

    After collecting the compressed local gradients and obtaining 1ni=1nyi(t)\frac{1}{n}\sum_{i=1}^{n}y_{i}(t) (corrupted by communication channel noise), the server recovers a vector Δ(t)\Delta(t) by a compressed sensing algorithm, which will be used for updating x(t)x(t).

  2. 2.

    Error feedback of compressed gradients. In general, the compressed sensing reconstruction will introduce a nonzero bias in the SGD iterations that hinders convergence. To handle this bias, we adopt the error feedback method in [12, 15] and modify it similarly as FetchSGD [17]. The resulting error feedback procedure is done purely at the server side without knowing the true aggregated stochastic gradients.

Note that the aggregated vector y~(t)\tilde{y}(t) is corrupted by additive noise w(t)w(t) from the uploading links. This noise model incorporates a variety of communication schemes, including digital transmission with quantization, and over-the-air transmission for wireless multi-access networks [14].

1 Input: sparsity level KK, size of sensing matrix Q×dQ\!\times\!d, step size η\eta, number of iterations TT, initial point x0x_{0}
2 Initialize: x(1)=x0x(1)=x_{0}, ε(1)=0\varepsilon(1)=0
3 The server generates the sensing matrix ΦQ×d\Phi\in\mathbb{R}^{Q\times d} and sends it to every edge device
4 for t=1,2,Tt=1,2\ldots,T do
5        The server sends x(t)x(t) to every edge device
6        foreach device i=1,,ni=1,\ldots,n do
7               Device ii samples a stochastic gradient gi(t)=𝗀i(x(t))g_{i}(t)=\mathsf{g}_{i}(x(t))
8               Device ii constructs yi(t)=Φgi(t)Qy_{i}(t)=\Phi g_{i}(t)\in\mathbb{R}^{Q}
9               Device ii sends yi(t)y_{i}(t) back to the server
10              
11        end foreach
12       The server receives y~(t)=1ni=1nyi(t)+w(t)\tilde{y}(t)=\frac{1}{n}\sum_{i=1}^{n}y_{i}(t)+w(t), where w(t)w(t) denotes additive noise incurred by the communication channels
13        The server computes z(t)=ηy~(t)+ε(t)z(t)=\eta\tilde{y}(t)+\varepsilon(t)
14        The server reconstructs Δ(t)=𝒜(z(t);Φ)\Delta(t)=\mathcal{A}(z(t);\Phi), where 𝒜(z(t);Φ)\mathcal{A}(z(t);\Phi) denotes the output of the compressed sensing algorithm of choice
15       The server updates x(t+1)=x(t)Δ(t)x(t\!+\!1)=x(t)-\Delta(t)
16        The server updates ε(t+1)=z(t)ΦΔ(t)\varepsilon(t\!+\!1)=z(t)-\Phi\Delta(t)
17       
18 end for
Algorithm 1 SGD with Compressed Sensing

We now provide more details on our algorithm design.

3.1 Preliminaries on Compressed Sensing

Compressed sensing [20] is a technique that allows efficient sensing and reconstruction of an approximately sparse signal. Mathematically, in the sensing step, a signal xdx\in\mathbb{R}^{d} is observed through linear measurement y=Φx+wy=\Phi x+w, where ΦQ×d\Phi\in\mathbb{R}^{Q\times d} is a pre-specified sensing matrix with Q<dQ<d, and wQw\in\mathbb{R}^{Q} is additive noise. Then in the reconstruction step, one recovers the original signal xx by approximately solving

x^=\displaystyle\hat{x}=\, argminzz0s.t.y=Φz,\displaystyle\operatorname*{arg\,min}\nolimits_{z}\,\|z\|_{0}\quad\textrm{s.t.}\ \ y=\Phi z, (w=0w=0) (2)
x^=\displaystyle\hat{x}=\, argminz\mfrac12yΦz22s.t.z0K\displaystyle\operatorname*{arg\,min}\nolimits_{z}\,\mfrac{1}{2}\|y\!-\!\Phi z\|_{2}^{2}\quad\textrm{s.t.}\ \ \|z\|_{0}\leq K (w0w\neq 0) (3)

where KK restricts the number of nonzero entries in x^\hat{x}.

Both (2) and (3) are NP-hard nonconvex problems[21, 22] , and researchers have proposed various compressed sensing algorithms for obtaining approximate solutions. As discussed below, the reconstruction error x^x\|\hat{x}-x\| will heavily depend on i) the design of the sensing matrix Φ\Phi, and ii) whether the signal xx can be well approximated by a sparse vector.

Design of the sensing matrix Φ\Phi. Compressed sensing algorithms can be categorized into two schemes [23]: i) the for-each scheme, in which a probability distribution over sensing matrices is designed to provide desired reconstruction for a fixed signal, and every time a new signal is to be measured and reconstructed, one needs to randomly generate a new Φ\Phi; ii) the for-all scheme, in which a single Φ\Phi is used for the sensing and reconstruction of all possible signals.111A more detailed explanation of the two schemes can be found in Appendix D. We mention that count sketch is an example of a for-each scheme algorithm. In this paper, we choose the for-all scheme so that the server doesn’t need to send a new matrix to each device per iteration.

To ensure that the linear measurement y=Φxy=\Phi x can discriminate approximately sparse signals, researchers have proposed the restricted isometry property (RIP) [20] as a condition on Φ\Phi:

Definition 1.

We say that ΦQ×d\Phi\in\mathbb{R}^{Q\times d} satisfies the (K,δ)(K,\delta)-restricted isometry property, if (1δ)x22Φx22(1+δ)x22(1\!-\!\delta)\|x\|_{2}^{2}\leq\|\Phi x\|_{2}^{2}\leq(1\!+\!\delta)\|x\|_{2}^{2} for any xdx\in\mathbb{R}^{d} that has at most KK nonzero entries.

The restricted isometry property on Φ\Phi is fundamental for analyzing the reconstruction error of many compressed sensing algorithms under the for-all scheme [24].

Metric of sparsity. The classical metric of sparsity is the 0\ell_{0} norm defined as the number of nonzero entries. However, for our setup, the vectors to be compressed can only be approximately sparse in general, which cannot be handled by the 0\ell_{0} norm as it is not stable under small perturbations. Here, we adopt the following sparsity metric from [25]:222Compared to [25], we add a scaling factor d1d^{-1} so that sp(x)(0,1]\operatorname{sp}(x)\!\in\!(0,1].

sp(x)x12/(x22d),xd\{0}.\operatorname{sp}(x)\coloneqq\|x\|_{1}^{2}/(\|x\|_{2}^{2}\cdot d),\qquad x\in\mathbb{R}^{d}\backslash\{0\}. (4)

The continuity of sp(x)\operatorname{sp}(x) indicates that sp(x)\operatorname{sp}(x) is robust to small perturbations on xx, and it can be shown that sp(x)\operatorname{sp}(x) is Schur-concave, meaning that it can characterize approximate sparsity of a signal. sp(x)\operatorname{sp}(x) has also been used in [25] for performance analysis of compressed sensing algorithms.

3.2 Details of Algorithm Design

Generation of Φ\Phi. As mentioned before, we choose compressed sensing under the for-all scheme for gradient compression and reconstruction. We require that the sensing matrix ΦQ×d\Phi\in\mathbb{R}^{Q\times d} have a low storage cost, since it will be transmitted to and stored at each device; Φ\Phi should also satisfy RIP so that the compressed sensing algorithm 𝒜\mathcal{A} has good reconstruction performance. The following proposition suggests a storage-friendly approach for generating matrices satisfying RIP.

Proposition 1 ([26]).

Let Bd×dB\in\mathbb{R}^{d\times d} be an orthogonal matrix with entries of absolute values O(1/d)O(1/\sqrt{d}), and let δ>0\delta>0 be sufficiently small. For some Q=O~(δ2Klog2Klogd)Q=\tilde{O}(\delta^{-2}K\log^{2}K\log d),333 The O~\tilde{O} notation hides logarithm dependence on 1/δ1/\delta. let ΦQ×d\Phi\in\mathbb{R}^{Q\times d} be a matrix whose QQ rows are chosen uniformly and independently from the rows of BB, multiplied by d/Q\sqrt{d/Q}. Then, with high probability, Φ\Phi satisfies the (K,δ)(K,\delta)-RIP.

This proposition indicates that, we can choose a “base matrix” BB satisfying the condition in Proposition 1, and then randomly choose QQ rows to form Φ\Phi. In this way, Φ\Phi can be stored or transmitted by merely the corresponding row indices in BB. Note that Proposition 1 only requires QQ to have logarithm dependence on dd. Candidates of the base matrix BB include the discrete cosine transform (DCT) matrix and the Walsh-Hadamard transform (WHT) matrix, as both DCT and WHT and their inverse have fast algorithms of time complexity O(dlogd)O(d\log d), implying that multiplication of Φ\Phi or Φ\Phi^{\!\top} with any vector can be finished within O(dlogd)O(d\log d) time.

Choice of the compressed sensing algorithm. We let 𝒜\mathcal{A} be the Fast Iterative Hard Thresholding (FIHT) algorithm [27].444See also Appendix E for a brief summary. Our experiments suggest that FIHT achieves a good balance between computation efficiency and empirical reconstruction error compared to other algorithms we have tried.

We note that FIHT has a tunable parameter KK that controls the number of nonzero entries of Δ(t)\Delta(t). This parameter should accord with the sparsity of the vector to be recovered (see Section 3.3 for theoretical results). In addition, the server can broadcast the sparse vector Δ(t)\Delta(t) instead of the whole x(t)x(t) for the edge devices to update their local copies of x(t)x(t), which saves communication over the broadcasting links.

Error feedback. We adopt error feedback to facilitate convergence of Algorithm 1. The following lemma verifies that Algorithm 1 indeed incorporates the error feedback steps in [15]; the proof is straightforward which we omit here.

Lemma 1.

Consider Algorithm 1, and suppose Φ\Phi is generated according to Proposition 1. Then for each tt, there exist unique p(t)dp(t)\in\mathbb{R}^{d} and e(t)de(t)\in\mathbb{R}^{d} satisfying z(t)=Φp(t)+ηw(t)z(t)=\Phi p(t)+\eta w(t), ε(t)=Φe(t)\varepsilon(t)=\Phi e(t), e(1)=0e(1)=0 such that

p(t)=\displaystyle p(t)= ηg(t)+e(t),x(t+1)=x(t)Δ(t),\displaystyle\eta g(t)+e(t),\quad x(t\!+\!1)=x(t)-\Delta(t), (5)
Δ(t)=\displaystyle\Delta(t)= 𝒜(Φp(t)+ηw(t);Φ),\displaystyle\mathcal{A}\left(\Phi p(t)+\eta w(t);\Phi\right),
e(t+1)=\displaystyle e(t\!+\!1)= p(t)Δ(t)+\mfracηQdΦw(t).\displaystyle p(t)-\Delta(t)+\mfrac{\eta Q}{d}\Phi^{\top}w(t).

where g(t)1ni=1ngi(t)g(t)\coloneqq\frac{1}{n}\sum\nolimits_{i=1}^{n}g_{i}(t).

By comparing Lemma 1 with [15, Algorithm 2], we see that the only difference lies in the presence of communication channel noise w(t)w(t) in our setting. In addition, since error feedback is implemented purely at the server side, the edge devices will be stateless during the whole optimization procedure.

3.3 Theoretical Analysis

First, we make the following assumptions on the objective function ff, the stochastic gradients gi(t)g_{i}(t) and the communication channel noise w(t)w(t):

Assumption 1.

The function f(x)f(x) is LL-smooth over xdx\in\mathbb{R}^{d}, i.e., there exists L>0L>0 such that for all x,ynx,y\in\mathbb{R}^{n}, f(x)f(y)2Lxy2.\|\nabla f(x)-\nabla f(y)\|_{2}\leq L\|x-y\|_{2}.

Assumption 2.

Denote g(t)=1ni=1ngi(t)g(t)=\frac{1}{n}\sum_{i=1}^{n}g_{i}(t). There exists G>0G>0 such that 𝔼[g(t)f(x(t))22]G2\mathbb{E}\big{[}\left\|g(t)-\nabla f(x(t))\right\|_{2}^{2}\big{]}\leq G^{2} for all tt.

Assumption 3.

The communication channel noise w(t)w(t) satisfies 𝔼[w(t)22]σ2\mathbb{E}[\|w(t)\|_{2}^{2}]\leq\sigma^{2} for each tt.

Our theoretical analysis will be based on the following result on the reconstruction error of FIHT:

Lemma 2 ([27, Corollary I.3]).

Let KK be the maximum number of nonzero entries of the output of FIHT. Suppose the sensing matrix ΦQ×d\Phi\in\mathbb{R}^{Q\times d} satisfies (4K,δ4K)(4K,\delta_{4K})-RIP for sufficiently small δ4K\delta_{4K}. Then, for any xdx\in\mathbb{R}^{d} and wQw\in\mathbb{R}^{Q},

𝒜(Φx+w;Φ)x2(C𝒜,s+1)xx[K]2+\mfracC𝒜,sKxx[K]1+C𝒜,nw2\displaystyle\|\mathcal{A}(\Phi x+w;\Phi)-x\|_{2}\leq(C_{\mathcal{A},\mathrm{s}}\!+\!1)\big{\|}x\!-\!x^{[K]}\big{\|}_{2}+\mfrac{C_{\mathcal{A},\mathrm{s}}}{\sqrt{K}}\big{\|}x\!-\!x^{[K]}\big{\|}_{1}+C_{\mathcal{A},\mathrm{n}}\|w\|_{2} (6)

where C𝒜,sC_{\mathcal{A},\mathrm{s}} and C𝒜,nC_{\mathcal{A},\mathrm{n}} are positive constants that depend on δ4K\delta_{4K}.

We are now ready to establish convergence of Algorithm 1.

Theorem 1.

Let KK be the maximum number of nonzero entries of the output of FIHT. Suppose the sensing matrix ΦQ×d\Phi\in\mathbb{R}^{Q\times d} satisfies (4K,δ4K)(4K,\delta_{4K})-RIP for sufficiently small δ4K\delta_{4K}. Furthermore, assume that

sp(p(t))γ2K/d[1+C𝒜,s(32K/d)]2\operatorname{sp}(p(t))\leq\gamma\cdot\frac{2K/d}{[1+C_{\mathcal{A},\mathrm{s}}(3-2K/d)]^{2}} (7)

for all t1t\geq 1 for some γ(0,1)\gamma\in(0,1), where p(t)p(t) is defined in Lemma 1. Then for sufficiently large TT, by choosing η=1/(LT)\eta=1/(L\sqrt{T}), we have that

1Tt=1T𝔼[f(x(t))22]6L(f(x(1))f)+3G22T+3T[γ(1+γ)(1γ)2G2+2(C𝒜,n+Q/d)2σ21γ].\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[\|\nabla\!f(x(t)\mkern-1.0mu)\|_{2}^{2}\right]\leq\frac{6L\!\left(f(x(1))\!-\!f^{\ast}\right)+3G^{2}}{2\sqrt{T}}+\frac{3}{T}\!\left[\frac{\gamma(1\!+\!\gamma)}{(1\!-\!\gamma)^{2}}G^{2}+\frac{2\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}^{\!2}\sigma^{2}}{1-\gamma}\right]\!.

In addition, if ff is convex and has a minimizer xdx^{\ast}\in\mathbb{R}^{d}, then we further have

𝔼[f(x¯(t))f]Lx(1)x22+G2/LT+6T[γ(1+γ)G2(1γ)2L+2(C𝒜,n+Q/d)2σ2(1γ)L],\displaystyle\mathbb{E}\!\left[f(\bar{x}(t))\!-\!f^{\ast}\right]\leq\frac{L\|x(1)-x^{\ast}\|_{2}^{2}+G^{2}/L}{\sqrt{T}}+\frac{6}{T}\!\left[\frac{\gamma(1\!+\!\gamma)G^{2}}{(1\!-\!\gamma)^{2}L}+\frac{2\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}^{\!2}\sigma^{2}}{(1\!-\!\gamma)L}\right]\!,

where x¯(t)1Tt=1Tx(t)\bar{x}(t)\coloneqq\frac{1}{T}\sum_{t=1}^{T}x(t) and ff(x)f^{\ast}\coloneqq f(x^{\ast}).

Proof of Theorem 1 is given in Appendix B.

Remark 1.

Theorem 1 requires sp(p(t))\operatorname{sp}(p(t)) to remain sufficiently low. This condition is hard to check and can be violated in practice (see Section 4). However, our numerical experiments seem to suggest that even if the condition (7) is violated, Algorithm 1 may still exhibit relatively good convergence behavior when the gradient g(t)g(t) itself has relatively low sparsity level. Theoretical investigation on these observations will be interesting future directions.

4 Numerical Results

4.1 Test Case with Synthetic Data

We first conduct numerical experiments on a test case with synthetic data. Here we set the dimension to be d=214d=2^{14} and the number of edge devices to be n=20n=20. The local objective functions are of the form fi(x)=12(xx0)Ai(xx0)f_{i}(x)=\frac{1}{2}(x-x_{0})^{\top}A_{i}(x-x_{0}), where each Aid×dA_{i}\in\mathbb{R}^{d\times d} is a diagonal matrix, and we denote A=1niAiA=\frac{1}{n}\sum_{i}A_{i}. We generate AiA_{i} such that the diagonal entries of AA is given by Ajj=ej/300+0.001A_{jj}=e^{-j/300}+0.001 for each j=1,,dj=1,\ldots,d while the diagonal of each AiA_{i} is dense. We also let 𝗀i(x)\mathsf{g}_{i}(x) give approximately sparse stochastic gradients for every xdx\in\mathbb{R}^{d}. We refer to Appendix C for details on the test case.

We test three algorithms: the uncompressed vanilla SGD, Algorithm 1, and SGD with count sketch. The SGD with count sketch just replaces the gradient compression and reconstruction of Algorithm 1 by the count sketch method [18]. We set K=500K=500 for both Algorithm 1 and SGD with count sketch. For Algorithm 1, we generate Φ\Phi from the WHT matrix, and uses the FFHT library [28] for fast WHT. We set T=1000T=1000, η=1/T\eta=1/\sqrt{T} and the initial point to be the origin for all three algorithms.

Refer to caption
(a) Convergence of three algorithms (w(t)=0w(t)=0).
Refer to caption
(b) Evolution of sp(p(t))\operatorname{sp}(p(t)) and sp(g(t))\operatorname{sp}(g(t)) for Algorithm 1 (w(t)=0w(t)=0).
Refer to caption
(c) Convergence of Algorithm 1 with different amplitudes of w(t)w(t).
Figure 1: Curves represent the average of 5050 random trials, and light-colored shades represent 33-sigma confidence intervals.

Figure 1(a) illustrates the convergence of the three algorithms without communication channel error (i.e., w(t)=0w(t)=0). For Algorithm 1, we set Q=5000Q=5000 (the compression rate d/Qd/Q is 3.283.28), and for SGD with count sketch we set the sketch size to be 16×50016\times 500 (the compression rate is d/(16×500)=2.05d/(16\times 500)=2.05). We see that Algorithm 1 has better convergence behavior while also achieves higher compression rate compared to SGD with count sketch. Our numerical experiments suggest that for approximately sparse signals, FIHT can achieve higher reconstruction accuracy and more aggressive compression than count sketch, and for signals that are not very sparse, FIHT also seems more robust.

Figure 1(b) shows the evolution of sp(p(t))\operatorname{sp}(p(t)) and sp(g(t))\operatorname{sp}(g(t)) for Algorithm 1. We see that sp(p(t))\operatorname{sp}(p(t)) is small for the first few iterations, and then increases and stabilizes around 0.50.5, which suggests that the condition (7) is likely to have been violated for large tt. On the other hand, Fig. 1(a) shows that Algorithm 1 can still achieve relatively good convergence behavior. This indicates a gap between the theoretical results in Section 3.3 and the empirical results, and suggests our analysis could be improved. We leave relevant investigation as future work.

Figure 1(c) illustrates the convergence of Algorithm 1 with different levels of communication channel noise. Here the entries of w(t)w(t) are i.i.d. sampled from 𝒩(0,W2)\mathcal{N}(0,W^{2}) with W{0.2,0.4,0.6,0.8,1.0}W\in\{0.2,0.4,0.6,0.8,1.0\}. We see that the convergence of Algorithm 1 gradually deteriorates as WW increases, suggesting its robustness against communication channel noise.

4.2 Test Case of Federated Learning with CIFAR-10 Dataset

We implement our algorithm on a residual network with 668426 trainable parameters in two different settings. We primarily use Tensorflow and MPI in the implementation of these results (details about the specific experimental setup can be found in Appendix C). In addition, we shall only present upload compression results here; download compression is not considered to be as significant as the upload compression (given that download speeds are generally higher than upload speeds), and in our case the download compression rate is simply given by d/Kd/K. For both settings, we use the CIFAR10 dataset (60,000 32×\times32×\times3 images of 10 classes) with a 50,000/10,000 train/test split.

Refer to caption
Figure 2: Training Accuracy (left), Training Loss (middle), and Testing Accuracy (right) achieved on CIFAR10 with i.i.d. datasets for local workers.

In the first setting, we instantiate 100 workers and split the CIFAR10 training dataset such that all local datasets are i.i.d. As seen in Figure 2, our algorithm is able to achieve 2×\times upload compression with marginal effect to the training and testing accuracy over 50 epochs. As the compression rate increases, the convergence of our method gradually deteriorates (echoing results in the synthetic case). For comparison, we also show the results of using Count Sketch with 55 rows and d/(5λ)d/(5\lambda) columns, where λ\lambda is the desired compression rate, in lieu of FIHT. In our setting, while uncompressed (1×\times) Count Sketch performs well, it is very sensitive to higher compression, diverging for 1.1×\times and 1.25×\times compression.

Refer to caption
Figure 3: Training Accuracy (left), Training Loss (middle), and Testing Accuracy (right) achieved on CIFAR10 with non-i.i.d. datasets for local workers.

In the second setting, we split the CIFAR10 dataset into 10,000 sets of 5 images of a single class and assign these sets to 10,000 workers. Each epoch consists of 100 rounds with 1% (100) workers participating in each round. In Figure 3, similar to the i.i.d. setting, we see that our algorithm’s training accuracy convergence gradually deteriorates with higher compression. Typical of the non-i.i.d. setting, the testing accuracy is not as high as that of the i.i.d. setting. However, we note that FIHT is able to achieve 10×\times compression with negligible effect on the testing accuracy. In addition, Count Sketch with 55 rows and d/(5λ)d/(5\lambda) columns diverges even for small compression rates in this problem setting.

5 Conclusion

In this paper, we develop a communication efficient SGD algorithm based on compressed sensing. This algorithm has several direct variants. For example, momentum method can be directly incorporated. Also, when the number of devices nn is very large, the server can choose to query compressed stochastic gradients from a randomly chosen subset of workers.

Our convergence guarantees require sp(p(t))\operatorname{sp}(p(t)) to be persistently low, which is hard to check in practice. The numerical experiments also show that our algorithm can work even if sp(p(t))\operatorname{sp}(p(t)) grows to a relatively high level. They suggest that our theoretical analysis can be further improved, which will be an interesting future direction.

Appendix A Auxiliary Results

In this section, we provide some auxiliary results for the proof of Theorem 1. We first give an alternative form of the reconstruction error derived from the condition (7) and the performance guarantee (6).

Lemma 3.

Suppose the conditions in Lemma 2 are satisfied. Let wQw\in\mathbb{R}^{Q} be arbitrary, and let xdx\in\mathbb{R}^{d} satisfy that sp(x)\operatorname{sp}(x) is upper bounded by the right-hand side of (7). Then

𝒜(Φx+w;Φ)x2γ2x2+C𝒜,nw2.\|\mathcal{A}(\Phi x+w;\Phi)-x\|_{2}\leq\sqrt{\frac{\gamma}{2}}\|x\|_{2}+C_{\mathcal{A},\mathrm{n}}\|w\|_{2}.
Proof.

By [29, Lemma 7], we have xx[K]2x1/(2K)\|x-x^{[K]}\|_{2}\leq\|x\|_{1}/(2\sqrt{K}). Therefore by Lemma 2,

𝒜(Φx+w;Φ)x2\displaystyle\|\mathcal{A}(\Phi x+w;\Phi)-x\|_{2}\leq\ C𝒜,s+12Kx1+C𝒜,sKxx[K]1+C𝒜,nw2\displaystyle\frac{C_{\mathcal{A},\mathrm{s}}+1}{2\sqrt{K}}\|x\|_{1}+\frac{C_{\mathcal{A},\mathrm{s}}}{\sqrt{K}}\big{\|}x-x^{[K]}\big{\|}_{1}+C_{\mathcal{A},\mathrm{n}}\|w\|_{2}
\displaystyle\leq\ [C𝒜,s+12+C𝒜,s(1\mfracKd)]x1K+C𝒜,nw2.\displaystyle\left[\frac{C_{\mathcal{A},\mathrm{s}}+1}{2}+C_{\mathcal{A},\mathrm{s}}\!\left(1-\mfrac{K}{d}\right)\right]\frac{\big{\|}x\big{\|}_{1}}{\sqrt{K}}+C_{\mathcal{A},\mathrm{n}}\|w\|_{2}.

Plugging in the definition and upper bound of sp(x)\operatorname{sp}(x) leads to the result. ∎

Next, we derive a bound on the second moment of e(t)e(t).

Lemma 4.

We have

1Tt=1T𝔼[e(t)22]\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[\|e(t)\|_{2}^{2}\right]\!\leq\, 2η21γ[γ(1+γ)1γG2+2(C𝒜,n+Q/d)2σ2]+2η2γ(1+γ)(1γ)21Tt=1T𝔼[f(x(t))22].\displaystyle\frac{2\eta^{2}}{1\!-\!\gamma}\bigg{[}\frac{\gamma(1\!+\!\gamma)}{1\!-\!\gamma}G^{2}\!+\!2\!\left(C_{\mathcal{A},\mathrm{n}}\!\!+\!\!\sqrt{Q/d}\right)^{\!2}\!\!\sigma^{2}\bigg{]}+\frac{2\eta^{2}\gamma(1\!+\!\gamma)}{(1\!-\!\gamma)^{2}}\cdot\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[\|\nabla f(x(t))\|_{2}^{2}\right].
Proof.

By definition, we have

𝔼[e(t+1)22]\displaystyle\mathbb{E}\!\left[\|e(t\!+\!1)\|_{2}^{2}\right]\leq\ 𝔼[(Δ(t)p(t)2+\mfracηQdΦw(t)2)2]\displaystyle\mathbb{E}\Big{[}\big{(}\|\Delta(t)-p(t)\|_{2}+\mfrac{\eta Q}{d}\big{\|}\Phi^{\top}w(t)\big{\|}_{2}\big{)}^{2}\Big{]}
=\displaystyle=\ 𝔼[(𝒜(Φp(t)+ηw(t))p(t)2+ηQ/dw2)2]\displaystyle\mathbb{E}\Big{[}\big{(}\left\|\mathcal{A}(\Phi p(t)+\eta w(t))-p(t)\right\|_{2}+\eta\sqrt{Q/d}\|w\|_{2}\big{)}^{2}\Big{]}
\displaystyle\leq\ 𝔼[(γ/2p(t)2+η(C𝒜,n+Q/d)w2)2]\displaystyle\mathbb{E}\Big{[}\big{(}\sqrt{\gamma/2}\|p(t)\|_{2}+\eta\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}\|w\|_{2}\big{)}^{2}\Big{]}
\displaystyle\leq\ γ𝔼[p(t)22]+2η2(C𝒜,n+Q/d)2𝔼[w(t)22]\displaystyle\gamma\mathbb{E}\!\left[\|p(t)\|_{2}^{2}\right]+2\eta^{2}\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}^{2}\mathbb{E}\!\left[\|w(t)\|_{2}^{2}\right]
\displaystyle\leq\ γ𝔼[ηg(t)+e(t)22]+2η2(C𝒜,n+Q/d)2σ2,\displaystyle\gamma\mathbb{E}\!\left[\|\eta g(t)+e(t)\|_{2}^{2}\right]+2\eta^{2}\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}^{2}\sigma^{2},

where the first inequality follows from Lemma 3, and the second inequality follows from the definition of p(t)p(t) and the assumption that 𝔼[w22]σ2\mathbb{E}[\|w\|_{2}^{2}]\leq\sigma^{2}. Notice that

𝔼[ηg(t)+e(t)22](1+2γ1γ)𝔼[ηg(t)22]+(1+1γ2γ)𝔼[e(t)22],\displaystyle\mathbb{E}\!\left[\|\eta g(t)+e(t)\|_{2}^{2}\right]\leq\left(1+\frac{2\gamma}{1-\gamma}\right)\mathbb{E}\!\left[\|\eta g(t)\|_{2}^{2}\right]+\left(1+\frac{1-\gamma}{2\gamma}\right)\mathbb{E}\!\left[\|e(t)\|_{2}^{2}\right],

which leads to

𝔼[e(t+1)22]1+γ2𝔼[e(t)22]+η2γ(1+γ)1γ𝔼[g(t)22]+2η2(C𝒜,n+Q/d)2σ2.\displaystyle\mathbb{E}\!\left[\|e(t\!+\!1)\|_{2}^{2}\right]\leq\frac{1\!+\!\gamma}{2}\mathbb{E}\!\left[\|e(t)\|_{2}^{2}\right]+\eta^{2}\frac{\gamma(1\!+\!\gamma)}{1\!-\!\gamma}\mathbb{E}\!\left[\|g(t)\|_{2}^{2}\right]+2\eta^{2}\!\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}^{\!2}\!\sigma^{2}.

By 𝔼[g(t)|x(t)]=f(x(t))\mathbb{E}[g(t)|x(t)]=\nabla f(x(t)) and Assumption 2, we have

𝔼[g(t)22]=\displaystyle\mathbb{E}\!\left[\|g(t)\|_{2}^{2}\right]=\ 𝔼[f(x(t))22]+𝔼[g(t)f(x(t))22]\displaystyle\mathbb{E}\!\left[\|\nabla f(x(t))\|_{2}^{2}\right]+\mathbb{E}\!\left[\|g(t)-\nabla f(x(t))\|_{2}^{2}\right]
\displaystyle\leq\ 𝔼[f(x(t))22]+G2.\displaystyle\mathbb{E}\!\left[\|\nabla f(x(t))\|_{2}^{2}\right]+G^{2}.

Therefore

𝔼[e(t+1)22]1+γ2𝔼[e(t)22]+η2γ(1+γ)1γ𝔼[f(x(t))22]+η2[γ(1+γ)1γG2+2(C𝒜,n+Q/d)2σ2].\displaystyle\mathbb{E}\!\left[\|e(t\!+\!1)\|_{2}^{2}\right]\leq\frac{1\!+\!\gamma}{2}\mathbb{E}\mkern-4.0mu\left[\|e(t)\|_{2}^{2}\right]\!+\!\eta^{2}\mkern-1.0mu\frac{\gamma(1\!+\!\gamma)}{1\!-\!\gamma}\mathbb{E}\mkern-4.0mu\left[\|\nabla\!f(x(t)\mkern-2.0mu)\|_{2}^{2}\right]+\eta^{2}\!\left[\frac{\gamma(1\!+\!\gamma)}{1\!-\!\gamma}G^{2}\!+\!2\big{(}C_{\mathcal{A},\mathrm{n}}\!\!+\!\!\sqrt{Q/d}\big{)}^{\!2}\!\sigma^{2}\right].

By summing over t=1,,Tt=1,\ldots,T and noting that e(1)=0e(1)=0 and 𝔼[e(T+1)22]0\mathbb{E}[\|e(T+1)\|_{2}^{2}]\geq 0, we get

1Tt=1T𝔼[e(t)22]\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[\|e(t)\|_{2}^{2}\right]\leq\ 1+γ2Tt=1T𝔼[e(t)22]+η2γ(1+γ)(1γ)Tt=1T𝔼[f(x(t))22]\displaystyle\frac{1\!+\!\gamma}{2T}\sum_{t=1}^{T}\mathbb{E}\!\left[\|e(t)\|_{2}^{2}\right]+\frac{\eta^{2}\gamma(1\!+\!\gamma)}{(1\!-\!\gamma)T}\sum_{t=1}^{T}\mathbb{E}\!\left[\|\nabla f(x(t)\mkern-2.0mu)\|_{2}^{2}\right]
+η2[γ(1+γ)1γG2+2(C𝒜,n+Q/d)2σ2],\displaystyle+\eta^{2}\!\left[\frac{\gamma(1\!+\!\gamma)}{1\!-\!\gamma}G^{2}+2\big{(}C_{\mathcal{A},\mathrm{n}}\!\!+\!\!\sqrt{Q/d}\big{)}^{\!2}\!\sigma^{2}\right],

which then leads to the desired result. ∎

Appendix B Proof of Theorem 1

Convex case: Denote x~(t)=x(t)e(t)\tilde{x}(t)=x(t)-e(t), and it can be checked that

x~(t+1)\displaystyle\tilde{x}(t\!+\!1) =x(t+1)e(t+1)=x(t)Δ(t)(p(t)Δ(t))=x(t)p(t)\displaystyle=x(t\!+\!1)-e(t\!+\!1)=x(t)-\Delta(t)-(p(t)-\Delta(t))=x(t)-p(t)
=x~(t)ηg(t).\displaystyle=\tilde{x}(t)-\eta g(t).

We then have

x~(t+1)x22=x~(t)x22+η2g(t)222ηg(t),x~(t)x.\|\tilde{x}(t\!+\!1)-x^{\ast}\|_{2}^{2}=\|\tilde{x}(t)-x^{\ast}\|_{2}^{2}+\eta^{2}\|g(t)\|_{2}^{2}-2\eta\langle g(t),\tilde{x}(t)-x^{\ast}\rangle.

By taking the expectation and noting 𝔼[g(t)|x(t)]=f(x(t))\mathbb{E}[g(t)|x(t)]=\nabla f(x(t)) and Assumption 2, we get

𝔼[x~(t+1)x22]\displaystyle\mathbb{E}\!\left[\|\tilde{x}(t\!+\!1)-x^{\ast}\|_{2}^{2}\right]\leq 𝔼[x~(t)x22]+η2(𝔼[f(x(t))22]+G2)\displaystyle\mathbb{E}\!\left[\|\tilde{x}(t)-x^{\ast}\|_{2}^{2}\right]+\eta^{2}\!\left(\mathbb{E}\!\left[\|\nabla f(x(t))\|_{2}^{2}\right]+G^{2}\right)
2η𝔼[f(x(t)),x(t)x]+2η𝔼[f(x(t)),e(t)],\displaystyle-2\eta\,\mathbb{E}\!\left[\langle\nabla f(x(t)),x(t)-x^{\ast}\rangle\right]+2\eta\,\mathbb{E}\!\left[\langle\nabla f(x(t)),e(t)\rangle\right],

which leads to

𝔼[f(x(t)),x(t)x]\displaystyle\mathbb{E}\!\left[\langle\nabla f(x(t)),x(t)-x^{\ast}\rangle\right]\leq 12η(𝔼[x~(t)x22]𝔼[x~(t+1)x22])+ηG22\displaystyle\frac{1}{2\eta}(\mathbb{E}\!\left[\|\tilde{x}(t)-x^{\ast}\|_{2}^{2}\right]-\mathbb{E}\!\left[\|\tilde{x}(t\!+\!1)-x^{\ast}\|_{2}^{2}\right])+\frac{\eta G^{2}}{2}
+η2𝔼[f(x(t))22]+𝔼[f(x(t)),e(t)]\displaystyle+\frac{\eta}{2}\,\mathbb{E}\!\left[\|\nabla f(x(t))\|_{2}^{2}\right]+\mathbb{E}\!\left[\langle\nabla f(x(t)),e(t)\rangle\right]
\displaystyle\leq 12η(𝔼[x~(t)x22]𝔼[x~(t+1)x22])+ηG22\displaystyle\frac{1}{2\eta}(\mathbb{E}\!\left[\|\tilde{x}(t)-x^{\ast}\|_{2}^{2}\right]-\mathbb{E}\!\left[\|\tilde{x}(t\!+\!1)-x^{\ast}\|_{2}^{2}\right])+\frac{\eta G^{2}}{2}
+η+(3L)12𝔼[f(x(t))22]+3L2𝔼[e(t)22],\displaystyle+\frac{\eta+(3L)^{-1}}{2}\,\mathbb{E}\!\left[\|\nabla f(x(t))\|_{2}^{2}\right]+\frac{3L}{2}\mathbb{E}\!\left[\|e(t)\|_{2}^{2}\right],

where in the second inequality we used f(x(t)),e(t)16Lf(x(t))22+3L2e(t)22\langle\nabla f(x(t)),e(t)\rangle\leq\frac{1}{6L}\|\nabla f(x(t))\|_{2}^{2}+\frac{3L}{2}\|e(t)\|_{2}^{2}. Now, we take the average of both sides over t=1,,Tt=1,\ldots,T and plug in the bound in Lemma 4 to get

1Tt=1T𝔼[f(x(t)),x(t)x]\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[\langle\nabla f(x(t)),x(t)-x^{\ast}\rangle\right]\leq\ 12ηTx(1)x22+ηG22+3η2L1γ[γ(1+γ)1γG2+2(C𝒜,n+Q/d)2σ2]\displaystyle\frac{1}{2\eta T}\|x(1)-x^{\ast}\|_{2}^{2}+\frac{\eta G^{2}}{2}+\frac{3\eta^{2}L}{1\!-\!\gamma}\!\left[\frac{\gamma(1\!+\!\gamma)}{1\!-\!\gamma}G^{2}+2\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}^{\!2}\sigma^{2}\right]
+(η+(3L)12+3η2Lγ(1+γ)(1γ)2)1Tt=1T𝔼[f(x(t))22].\displaystyle+\left(\frac{\eta\!+\!(3L)^{-1}}{2}\!+\!\frac{3\eta^{2}L\gamma(1\!+\!\gamma)}{(1\!-\!\gamma)^{2}}\right)\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[\|\nabla f(x(t))\|_{2}^{2}\right]\!.

By η=1/(LT)\eta=1/(L\sqrt{T}), we can show that for sufficiently large TT,

η+(3L)12+3η2Lγ(1+γ)(1γ)2=1L(16+12T+3γ(1+γ)T(1γ)2)14L.\frac{\eta+(3L)^{-1}}{2}+\frac{3\eta^{2}L\gamma(1+\gamma)}{(1-\gamma)^{2}}=\frac{1}{L}\left(\frac{1}{6}+\frac{1}{2\sqrt{T}}+\frac{3\gamma(1+\gamma)}{T(1-\gamma)^{2}}\right)\leq\frac{1}{4L}.

Thus

1Tt=1T𝔼[f(x(t)),x(t)x]\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[\langle\nabla f(x(t)),x(t)-x^{\ast}\rangle\right]\leq 12ηTx(1)x22+ηG22+3η2L1γ[γ(1+γ)1γG2+2(C𝒜,n+Q/d)2σ2]\displaystyle\frac{1}{2\eta T}\|x(1)-x^{\ast}\|_{2}^{2}+\frac{\eta G^{2}}{2}+\frac{3\eta^{2}L}{1\!-\!\gamma}\!\left[\frac{\gamma(1\!+\!\gamma)}{1\!-\!\gamma}G^{2}+2\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}^{\!2}\sigma^{2}\right]
+14L1Tt=1T𝔼[f(x(t))22].\displaystyle+\frac{1}{4L}\cdot\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[\|\nabla f(x(t))\|_{2}^{2}\right].

By the convexity of ff, we have f(x(t))f(x)f(x(t)),x(t)xf(x(t))-f(x^{\ast})\leq\langle\nabla f(x(t)),x(t)-x^{\ast}\rangle. Furthermore, since ff is LL-smooth, we see that

f(x)f(x(t)1Lf(x(t)))f(x(t))12Lf(x(t))22,f(x^{\ast})\leq f\!\left(x(t)-\frac{1}{L}\nabla f(x(t))\right)\leq f(x(t))-\frac{1}{2L}\|\nabla f(x(t))\|_{2}^{2},

which leads to f(x(t))222L(f(x(t))f(x))\|\nabla f(x(t))\|_{2}^{2}\leq 2L(f(x(t))-f(x^{\ast})). We then get

1Tt=1T𝔼[f(x(t))f(x)]\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[f(x(t))-f(x^{\ast})\right]\leq 12Tt=1T𝔼[f(x(t))f(x)]+12ηTx(1)x22+ηG22\displaystyle\frac{1}{2T}\sum_{t=1}^{T}\mathbb{E}\!\left[f(x(t))-f(x^{\ast})\right]+\frac{1}{2\eta T}\|x(1)-x^{\ast}\|_{2}^{2}+\frac{\eta G^{2}}{2}
+3η2L1γ[γ(1+γ)1γG2+2(C𝒜,n+Q/d)2σ2].\displaystyle+\frac{3\eta^{2}L}{1\!-\!\gamma}\!\left[\frac{\gamma(1\!+\!\gamma)}{1\!-\!\gamma}G^{2}+2\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}^{\!2}\sigma^{2}\right].

By subtracting 12Tt=1T𝔼[f(x(t))f(x)]\frac{1}{2T}\sum_{t=1}^{T}\mathbb{E}\!\left[f(x(t))\!-\!f(x^{\ast})\right] from both sides of the inequality, and using the bound f(x¯(t))1Tt=1Tf(x(t))f(\bar{x}(t))\leq\frac{1}{T}\sum_{t=1}^{T}f(x(t)) that follows from the convexity of ff, we get the final bound.

Nonconvex case: Denote x~(t)=x(t)e(t)\tilde{x}(t)=x(t)-e(t), and it can be checked that x~(t+1)=x~(t)ηg(t)\tilde{x}(t\!+\!1)=\tilde{x}(t)-\eta g(t). Since ff is LL-smooth, we get

f(x~(t+1))f(x~(t))ηf(x~(t)),g(t)+η2L2g(t)22.f(\tilde{x}(t\!+\!1))\leq f(\tilde{x}(t))-\eta\langle\nabla f(\tilde{x}(t)),g(t)\rangle+\frac{\eta^{2}L}{2}\|g(t)\|_{2}^{2}.

By taking the expectation and using 𝔼[g(t)|x(t)]=f(x(t))\mathbb{E}[g(t)|x(t)]=\nabla f(x(t)) and Assumption 2, we see that

𝔼[f(x~(t+1))]\displaystyle\mathbb{E}\!\left[f(\tilde{x}(t\!+\!1))\right]\leq 𝔼[f(x~(t))]η𝔼[f(x~(t)),f(x(t))]+η2L2(𝔼[f(x(t))22]+G2)\displaystyle\mathbb{E}\!\left[f(\tilde{x}(t))\right]-\eta\,\mathbb{E}\!\left[\langle\nabla f(\tilde{x}(t)),\nabla f(x(t))\rangle\right]+\frac{\eta^{2}L}{2}\!\left(\mathbb{E}\!\left[\|\nabla f(x(t))\|_{2}^{2}\right]+G^{2}\right)
=\displaystyle= 𝔼[f(x~(t))]η𝔼[f(x(t))22]\displaystyle\mathbb{E}\!\left[f(\tilde{x}(t))\right]-\eta\,\mathbb{E}\!\left[\|\nabla f(x(t))\|_{2}^{2}\right]
η𝔼[f(x~(t))f(x(t)),f(x(t))]+η2L2(𝔼[f(x(t))22]+G2)\displaystyle-\eta\,\mathbb{E}\!\left[\langle\nabla f(\tilde{x}(t))-\nabla f(x(t)),\nabla f(x(t))\rangle\right]+\frac{\eta^{2}L}{2}\!\left(\mathbb{E}\!\left[\|\nabla f(x(t))\|_{2}^{2}\right]+G^{2}\right)
\displaystyle\leq 𝔼[f(x~(t))]η(1ηL)2𝔼[f(x(t))22]\displaystyle\mathbb{E}\!\left[f(\tilde{x}(t))\right]-\frac{\eta(1-\eta L)}{2}\mathbb{E}\!\left[\|\nabla f(x(t))\|_{2}^{2}\right]
+η2𝔼[f(x~(t))f(x(t))22]+η2L2G2\displaystyle+\frac{\eta}{2}\,\mathbb{E}\!\left[\|\nabla f(\tilde{x}(t))-\nabla f(x(t))\|_{2}^{2}\right]+\frac{\eta^{2}L}{2}G^{2}
\displaystyle\leq 𝔼[f(x~(t))]η(1ηL)2𝔼[f(x(t))22]+ηL22𝔼[e(t)22]+η2L2G2,\displaystyle\mathbb{E}\!\left[f(\tilde{x}(t))\right]-\frac{\eta(1-\eta L)}{2}\mathbb{E}\!\left[\|\nabla f(x(t))\|_{2}^{2}\right]+\frac{\eta L^{2}}{2}\mathbb{E}\!\left[\|e(t)\|_{2}^{2}\right]+\frac{\eta^{2}L}{2}G^{2},

where in the second inequality we used f(x~(t))f(x(t)),f(x(t))12f(x~(t))f(x(t))22+12f(x(t))22\langle\nabla f(\tilde{x}(t))-\nabla f(x(t)),\nabla f(x(t))\rangle\leq\frac{1}{2}\|\nabla f(\tilde{x}(t))-\nabla f(x(t))\|_{2}^{2}+\frac{1}{2}\|\nabla f(x(t))\|_{2}^{2}, and in the last inequality we used the LL-smoothness of ff. By taking the telescoping sum, we get

1ηL21Tt=1T𝔼[f(x(t))22]\displaystyle\frac{1-\eta L}{2}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\|\nabla f(x(t))\|_{2}^{2}\right]\leq f(x(1))fη+ηLG22+L221Tt=1T𝔼[e(t)22].\displaystyle\frac{f(x(1))-f^{\ast}}{\eta}+\frac{\eta LG^{2}}{2}+\frac{L^{2}}{2}\cdot\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[\|e(t)\|_{2}^{2}\right].

After plugging in the bound in Lemma 4, we get

1ηL21Tt=1T𝔼[f(x(t))22]\displaystyle\frac{1-\eta L}{2}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\|\nabla f(x(t))\|_{2}^{2}\right]\leq f(x(1))fη+ηLG22+η2L21γ[γ(1+γ)1γG2+2(C𝒜,n+Q/d)2σ2]\displaystyle\frac{f(x(1))-f^{\ast}}{\eta}+\frac{\eta LG^{2}}{2}+\frac{\eta^{2}L^{2}}{1\!-\!\gamma}\!\left[\frac{\gamma(1\!+\!\gamma)}{1\!-\!\gamma}G^{2}+2\!\left(C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\right)^{\!2}\!\sigma^{2}\right]
+η2L2γ(1+γ)(1γ)21Tt=1T𝔼[f(x(t))22].\displaystyle+\eta^{2}L^{2}\frac{\gamma(1+\gamma)}{(1-\gamma)^{2}}\cdot\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[\|\nabla f(x(t))\|_{2}^{2}\right].

By choosing η=1/(LT)\eta=1/(L\sqrt{T}) and letting TT be sufficiently large, we have

1ηL2η2L2γ(1+γ)(1γ)2=1212Tγ(1+γ)T(1γ)213.\frac{1-\eta L}{2}-\eta^{2}L^{2}\frac{\gamma(1+\gamma)}{(1-\gamma)^{2}}=\frac{1}{2}-\frac{1}{2\sqrt{T}}-\frac{\gamma(1+\gamma)}{T(1-\gamma)^{2}}\geq\frac{1}{3}.

Finally, we get

13Tt=1T𝔼[f(x(t))22]2L(f(x(1))f)+G22T+1T[γ(1+γ)(1γ)2G2+2(C𝒜,n+Q/d)2σ21γ],\frac{1}{3T}\sum_{t=1}^{T}\mathbb{E}\!\left[\|\nabla f(x(t))\|_{2}^{2}\right]\leq\frac{2L\!\left(f(x(1))\!-\!f^{\ast}\right)+G^{2}}{2\sqrt{T}}+\frac{1}{T}\!\left[\frac{\gamma(1+\gamma)}{(1-\gamma)^{2}}G^{2}+\frac{2\big{(}C_{\mathcal{A},\mathrm{n}}\!+\!\sqrt{Q/d}\big{)}^{\!2}\sigma^{2}}{1-\gamma}\right]\!,

which completes the proof.

Appendix C Further Details on Numerical Experiments

For all numerical experiments in this paper, we employ the WHT matrix as the “base matrix” for generating the sensing matrix Φ\Phi. We choose WHT instead of DCT here because the library Fast Fast Hadamard Transform (FFHT) [28] provides a heavily optimized C99 implementation of fast WHT, which turns out to be faster than the fast DCT in the Python library SciPy on our computing platforms.

Note that the WHT matrix is only defined for dd that is a power of 22. In order to use WHT for general dd, we let daug=2log2dd_{\mathrm{aug}}=2^{\lceil\log_{2}d\rceil} and generate ΦQ×daug\Phi\in\mathbb{R}^{Q\times d_{\mathrm{aug}}} by randomly choosing Q rows from the daug×daugd_{\mathrm{aug}}\times d_{\mathrm{aug}} WHT matrix as in Proposition 1. Then for any udu\in\mathbb{R}^{d}, we can first pad zeros to the vector uu to augment its dimension to be daugd_{\mathrm{aug}}, and then multiply ΦQ×daug\Phi\in\mathbb{R}^{Q\times d_{\mathrm{aug}}} with the augmented vector to form the compressed vector of dimension QQ. For the reconstruction, after obtaining the recovered signal, we can truncate the last daugdd_{\mathrm{aug}}\!-\!d entries to get a vector of the original dimension dd.

Code for the federated learning test case is available at https://github.com/vikr53/cosamp.

C.1 Test Case with Synthetic Data

All experiments on the synthetic test case were run on a MacBook Pro Model A1502 with 2.9 GHz Dual-Core Intel Core i5 CPU and 16 GB memory. GPU acceleration was not exploited. We implement count sketch by our own code for the synthetic test case.

Recall that the local objective functions are given by fi(x)=12(xx0)Ai(xx0)f_{i}(x)=\frac{1}{2}(x-x_{0})^{\top}A_{i}(x-x_{0}), where each AiA_{i} is a diagonal matrix. The jj’th diagonal entries of A1,,AnA_{1},\ldots,A_{n} are randomly sampled such that ((A1)jj,(A2)jj,,(An)jj)\big{(}(A_{1})_{jj},(A_{2})_{jj},\ldots,(A_{n})_{jj}\big{)} satisfies the Gaussian distribution 𝒩((ej/300+0.001)𝟏,I1n𝟏𝟏)\mathcal{N}((e^{-j/300}+0.001)\mathbf{1},I-\frac{1}{n}\mathbf{1}\mathbf{1}^{\top}), where II denotes the identity matrix and 𝟏\mathbf{1} denotes the vector of all ones. As a result, A1niAiA\coloneqq\frac{1}{n}\sum_{i}A_{i} has diagonal entries given by Ajj=ej/300+0.001A_{jj}=e^{-j/300}+0.001. The point x0dx_{0}\in\mathbb{R}^{d} is randomly generated from the standard Gaussian distribution. We fix each AiA_{i} and x0x_{0} once they have been generated.

The stochastic gradient 𝗀i(x)\mathsf{g}_{i}(x) will be given by

𝐠i(x)=Ai(xx0)+R1A𝗎1+R2(𝖻𝗎2)\mathbf{g}_{i}(x)=A_{i}(x-x_{0})+R_{1}\cdot A\mathsf{u}_{1}+R_{2}\cdot(\mathsf{b}\odot\mathsf{u}_{2})

Here 𝗎1\mathsf{u}_{1} and 𝗎2\mathsf{u}_{2} are i.i.d. random vectors following the standard Gaussian distribution 𝒩(0,I)\mathcal{N}(0,I); the entries of 𝖻d\mathsf{b}\in\mathbb{R}^{d} are i.i.d. satisfying the Bernoulli distribution Ber(p𝖻)\mathrm{Ber}(p_{\mathsf{b}}) with p𝖻(0,1)p_{\mathsf{b}}\in(0,1); \odot denotes the Hadamard product. We set R1=12.5,R2=50,p𝖻=1.5×103R_{1}=12.5,R_{2}=50,p_{\mathsf{b}}=1.5\times 10^{-3} in our test case.

C.2 Test Case of Federated Learning with CIFAR-10 Dataset

We ran all federated learning experiments using the research computing service provided by the authors’ institution, and we did not exploit GPU acceleration. We primarily use Tensorflow [30] as the deep learning framework. We use MPI for Python [31] for parallel computing on multiple processors, and communication between nodes is facilitated by Open MPI [32]. Additionally, we use the CSVec package (available at https://github.com/nikitaivkin/csh) for count sketch.

Regarding the compressed sensing algorithms, we set K=30000K=30000 (number of nonzero entries in the reconstructed signal) for both FIHT and count sketch. The dimension QQ of the compressed gradients for FIHT is simply Q=d/λQ=d/\lambda where λ\lambda is the desired compression rate. The sketch size for count sketch is rr rows ×\times \mfracdrλ\mfrac{d}{r\lambda} columns; we tried r{1,5,10,50}r\in\{1,5,10,50\} and found the best results with r=5r=5 for all tested compression rates.

Recall that we conduct experiments on the i.i.d. setting and the non-i.i.d. setting. The differences between the i.i.d. setting and the non-i.i.d. settings include the following:

  1. 1.

    In the i.i.d. setting, there are 100100 local workers, and we split the training dataset evenly into 100100 local datasets, so that each local dataset has exactly 5050 samples for each of the 1010 classes. In the non-i.i.d. setting, there are 1000010000 workers, and each of the 1000010000 local datasets has 55 samples belonging to the same class.

  2. 2.

    In the i.i.d. setting, the server queries compressed local gradients from all workers for each iteration. In the non-i.i.d. setting, the server queries compressed local gradients from 1%1\% of all workers for each iteration.

  3. 3.

    In the i.i.d. setting, we set the local batch size to be 88. In the non-i.i.d. setting, we set the local batch size to be 11.

  4. 4.

    In the non-i.i.d. setting, we perform data augmentation (while dividing the dataset amongst the 10,000 workers) using random flips and random rotations.

The step size η\eta of both the i.i.d. and the non-i.i.d. settings are set to be 0.010.01. We also use the same parameters of FIHT and count sketch for both the i.i.d. and the non-i.i.d. settings.

Experiments on the reconstruction error:

We also conducted numerical experiments that compare the reconstruction errors of FIHT and count sketch. Here we let gdg\in\mathbb{R}^{d} with d=668426d=668426 be randomly generated by g=gsp+gng=g_{\mathrm{sp}}+g_{\mathrm{n}}, where gspg_{\mathrm{sp}} is a vector having 3000030000 nonzero entries that are i.i.d. sampled from the standard Gaussian distribution, and gn𝒩(0,0.052I)g_{\mathrm{n}}\sim\mathcal{N}(0,0.05^{2}I). We apply FIHT and count sketch to the compression and reconstruction of gg, and test the relative reconstruction error incurred at different compression rates λ\lambda. The relative reconstruction error is defined as gg^22/g22\|g-\hat{g}\|_{2}^{2}/\|g\|_{2}^{2} where g^\hat{g} is the reconstructed signal by FIHT or count sketch. For a given compression rate λ\lambda, we set Q=d/λQ=d/\lambda for FIHT, and set the sketch size to be 55 rows ×\times \mfracd5λ\mfrac{d}{5\lambda} columns for count sketch. We set g^0=K=30000\|\hat{g}\|_{0}=K=30000 for both FIHT and count sketch. For each compression rate λ\lambda, we test FIHT and count sketch over 2020 random instances of gg; we fix the sensing matrix Φ\Phi of FIHT and the sketching operator of count sketch for the 2020 random trials. Figure 4 empirically confirms that the reconstruction performance of count sketch is worse than FIHT, especially at higher compression rates.

Refer to caption
Figure 4: Relative reconstruction error for FIHT and count sketch at different compression rates, averaged over 2020 random trials.

Appendix D For-each Scheme and For-all Scheme

In what follows, we elaborate on the two schemes in compressed sensing: the for-each scheme and the for-all scheme.

For-each scheme:

In this scheme, the sensing matrix Φ\Phi is chosen at random from some distribution for each individual signal xx. Specifically, consider an algorithm belonging to the for-each scheme, and let 𝒜(y;Φ)\mathcal{A}(y;\Phi) denote the signal reconstructed from the measurement yy and the sensing matrix Φ\Phi by this algorithm. Then associated with the algorithm 𝒜\mathcal{A} is an indexed set {𝒟d,Q}\{\mathcal{D}_{d,Q}\} with each 𝒟d,Q\mathcal{D}_{d,Q} being a probability distribution over Q×d\mathbb{R}^{Q\times d}, and theoretical guarantee of the algorithm 𝒜\mathcal{A} can be typically recast as follows:

Let KK be sufficiently small [say, KO(Q/logd)K\leq O(Q/\log d)]. Then there exist ς>0\varsigma>0 and c0c\geq 0 depending on KK, QQ and dd, such that

Φ𝒟d,Q(𝒜(Φx;Φ)x2(1+ς)xx[K]2)1O(dc),\mathbb{P}_{\Phi\sim\mathcal{D}_{d,Q}}\!\left(\left\|\mathcal{A}(\Phi x;\Phi)-x\right\|_{2}\leq(1+\varsigma)\big{\|}x-x^{[K]}\big{\|}_{2}\right)\geq 1-O(d^{-c}), (8)

where xdx\in\mathbb{R}^{d} is any arbitrary deterministic vector.

Note that the bound (8) applies to the reconstruction of one deterministic signal xdx\in\mathbb{R}^{d}, and the probability on the left-hand side of (8) is with respect to the distribution 𝒟d,Q\mathcal{D}_{d,Q} over the sensing matrix Φ\Phi. Consequently, every time a new vector xx is to be measured and reconstructed under the for-each scheme, we typically generates a new sensing matrix Φ\Phi independent of xx. Especially, if a randomly generated matrix Φ\Phi has already been used for the reconstruction of xx, and yy is a vector that is dependent of 𝒜(Φx,Φ)\mathcal{A}(\Phi x,\Phi), then theoretically the bound (8) can no longer be directly applied to the reconstruction of yy by 𝒜(Φy;Φ)\mathcal{A}(\Phi y;\Phi), and as a result, bounding 𝒜(Φy,Φ)y2\big{\|}\mathcal{A}(\Phi y,\Phi)-y\big{\|}_{2} would be more challenging.

Examples of the for-each scheme include count sketch [18] and count-min sketch [33].

For-all scheme:

In this scheme, one generates a single sensing matrix ΦQ×d\Phi\in\mathbb{R}^{Q\times d} that is used for the reconstruction of all possible signals xdx\in\mathbb{R}^{d}. As already mentioned, the restricted isometry property (see Definition 1) has been proposed as a condition on Φ\Phi under the for-all scheme. Intuitively, the restricted isometry property ensures that the linear measurements y=Φxy=\Phi x can discriminate sparse signals: Suppose ΦQ×d\Phi\in\mathbb{R}^{Q\times d} satisfies (2K,δ)(2K,\delta)-RIP for some KK and δ(0,1)\delta\in(0,1). Then for any x1,x2dx_{1},x_{2}\in\mathbb{R}^{d} with x10K\|x_{1}\|_{0}\leq K and x20K\|x_{2}\|_{0}\leq K, we have

Φ(x1x2)21δx1x22,\|\Phi(x_{1}-x_{2})\|_{2}\geq\sqrt{1-\delta}\cdot\|x_{1}-x_{2}\|_{2},

indicating that the map xΦxx\mapsto\Phi x is an injection from {xd:x0K}\{x\in\mathbb{R}^{d}:\|x\|_{0}\leq K\} to d\mathbb{R}^{d}. Consequently, given the measurement vector y=Φxy=\Phi x, the solution to (2) will always be unique and equal to xx as long as x0K\|x\|_{0}\leq K. This explains why RIP can provide reconstruction guarantees under the for-all scheme in the ideal scenario (i.e., original vector xx is strictly sparse, measurement is noiseless, and the solution to the nonconvex problem (2) can be obtained).

Extending the above discussion to more general scenarios is not trivial, but the existing literature has investigated reconstruction guarantees of various compressed sensing algorithms when the original vector xx is approximately sparse or linear measurement by Φ\Phi is noisy. One typical form of guarantees for compressed sensing algorithms under the for-all scheme is as follows:

Suppose ΦQ×d\Phi\in\mathbb{R}^{Q\times d} satisfies (cK,δ)(cK,\delta)-RIP for some c>1c>1 and sufficiently small δ>0\delta>0. Then there exist C1>0C_{1}>0 and C2>0C_{2}>0 depending on δ\delta, such that

𝒜(Φx+w;Φ)x2C1Kxx[K]1+C2w2\left\|\mathcal{A}(\Phi x+w;\Phi)-x\right\|_{2}\leq\frac{C_{1}}{\sqrt{K}}\big{\|}x-x^{[K]}\big{\|}_{1}+C_{2}\|w\|_{2}

for all xdx\in\mathbb{R}^{d}, where 𝒜(Φx+w;Φ)\mathcal{A}(\Phi x+w;\Phi) denotes the reconstructed signal from the (noisy) measurement vector Φx+w\Phi x+w and the sensing matrix Φ\Phi.

Examples of the for-all scheme include 1\ell_{1}-minimization [34] and various greedy algorithms [35, 27, 36]; see also [24]. We also mention that there are variants of RIP that have been used for compressed sensing with sparse sensing matrices [37, 23].

Finally, we note that in practice, sensing matrices that satisfy RIP are usually generated by randomized methods. For example, it has been shown that a Q×dQ\times d matrix having i.i.d. standard Gaussian entries will satisfy (K,δ)(K,\delta)-RIP for QO(Klog(d/K))Q\geq O(K\log(d/K)) with high probability [38]. In this work, we generate the sensing matrix Φ\Phi based on Proposition 1 so that Φ\Phi has low storage and transmission cost.

Appendix E The Fast Iterative Hard Thresholding (FIHT) Algorithm

We introduce some additional notations before presenting FIHT. For any xdx\in\mathbb{R}^{d}, we let 𝖲𝗎𝗉𝗉(x)\mathsf{Supp}(x) denote the set of indices of nonzero entries of xx, and let 𝖯𝗋𝗂𝗇𝖼𝗂𝗉𝖺𝗅𝖲𝗎𝗉𝗉(x;K)\mathsf{PrincipalSupp}(x;K) denote the set of indices corresponding to the top-KK entries of xx in magnitude. Given xdx\in\mathbb{R}^{d} and an index set S{1,,d}S\subseteq\{1,\ldots,d\}, we use 𝖯𝗋𝗈𝗃(x,S)\mathsf{Proj}(x,S) to denote the dd-dimensional vector that keeps the entries of xx with indices in SS and sets other entries to be zero.

The Fast Iterative Hard Thresholding (FIHT) algorithm [27] is outlined as follows:

Input: measurement yy, sensing matrix Φ\Phi, number of nonzero entries in the output KK
1
2Initialize: g(0)=0dg(0)=0\in\mathbb{R}^{d}, w(0)=Ayw(0)=A^{\top}y, Ω(0)=𝖯𝗋𝗂𝗇𝖼𝗂𝗉𝖺𝗅𝖲𝗎𝗉𝗉(w(0),K)\Omega(0)=\mathsf{PrincipalSupp}(w(0),K), g(1)=𝖯𝗋𝗈𝗃(w(0),Ω(0))g(1)=\mathsf{Proj}(w(0),\Omega(0)), s=1s=1
3
4repeat
5        if s=1s=1 then
6               τ(s)=0\tau(s)=0
7              
8       else
9               τ(s)=\mfracyΦg(s),Φ(g(s)g(s1))Φ(g(s)g(s1))2\tau(s)=\mfrac{\langle y-\Phi g(s),\Phi(g(s)-g(s\!-\!1))\rangle}{\|\Phi(g(s)-g(s\!-\!1))\|^{2}}
10              
11        end if
12       w(s)=g(s)+τ(s)(g(s)g(s1))w(s)=g(s)+\tau(s)(g(s)-g(s-1))
13        rw(s)=Φ(yΦw(s))r^{w}(s)=\Phi^{\top}(y-\Phi w(s))
14        Γ(s)=𝖲𝗎𝗉𝗉(w(s))\Gamma(s)=\mathsf{Supp}(w(s))
15        α~(s)=\mfrac𝖯𝗋𝗈𝗃(rw(s),Γ(s))2Φ𝖯𝗋𝗈𝗃(rw(s),Γ(s))2\tilde{\alpha}(s)=\mfrac{\left\|\mathsf{Proj}(r^{w}(s),\Gamma(s))\right\|^{2}}{\left\|\Phi\cdot\mathsf{Proj}(r^{w}(s),\Gamma(s))\right\|^{2}}
16        h(s)=w(s)+α~(s)rw(s)h(s)=w(s)+\tilde{\alpha}(s)r^{w}(s)
17        Ω(s)=𝖯𝗋𝗂𝗇𝖼𝗂𝗉𝖺𝗅𝖲𝗎𝗉𝗉(h(s),K)\Omega(s)=\mathsf{PrincipalSupp}(h(s),K)
18        g~(s)=𝖯𝗋𝗈𝗃(h(s),Ω(s))\tilde{g}(s)=\mathsf{Proj}(h(s),\Omega(s))
19        r(s)=Φ(yΦg~(s))r(s)=\Phi^{\top}(y-\Phi\tilde{g}(s))
20        α(s)=\mfrac𝖯𝗋𝗈𝗃(r(s),Ω(s))2Φ𝖯𝗋𝗈𝗃(r(s),Ω(s))2\alpha(s)=\mfrac{\left\|\mathsf{Proj}(r(s),\Omega(s))\right\|^{2}}{\left\|\Phi\cdot\mathsf{Proj}(r(s),\Omega(s))\right\|^{2}}
21        g(s+1)=g~(s)+α(s)𝖯𝗋𝗈𝗃(r(s),Ω(s))g(s\!+\!1)=\tilde{g}(s)+\alpha(s)\cdot\mathsf{Proj}(r(s),\Omega(s))
22        ss+1s\leftarrow s+1
23       
24until stopping criterion satisfied
Output: g(s)g(s)
Algorithm 2 Fast Iterative Hard Thresholding (FIHT)

We elaborate further details on our implementation of FIHT.

Stopping criterion. In our implementation of FIHT, we stop the iterations whenever any of the following holds:

  • s26s\geq 26 (i.e., the maximum number of iterations is 2525)

  • w(s)2104\|w(s)\|_{2}\leq 10^{-4}

  • std(w(s)2:s3ss)0.01mean(w(s)2:s3ss)\operatorname{std}(\|w(s^{\prime})\|_{2}:s\!-\!3\leq s^{\prime}\leq s)\leq 0.01\cdot\operatorname{mean}(\|w(s^{\prime})\|_{2}:s\!-\!3\leq s^{\prime}\leq s)

    Here the tuple (w(s)2:s3ss)(\|w(s^{\prime})\|_{2}:s\!-\!3\leq s^{\prime}\leq s) records the values of w(s)2\|w(s)\|_{2} in the last 44 iterations, and std\operatorname{std} and mean\operatorname{mean} denote the standard deviation and the mean value respectively.

Multiplication of Φ\Phi with a vector. We generate the sensing matrix Φ\Phi from an orthogonal matrix BB as in Proposition 1, and the “base matrix” BB is chosen to be either the DCT matrix

Bij=2d11+δ1,icosπ(i1)(2j1)2d,1i,jd,B_{ij}=\sqrt{\frac{2}{d}}\cdot\frac{1}{\sqrt{1+\delta_{1,i}}}\cos\frac{\pi(i\!-\!1)(2j\!-\!1)}{2d},\quad 1\leq i,j\leq d,

or the WHT matrix B=H(log2d)B=H^{(\log_{2}d)} defined recursively by

H(0)=1,H(k)=12[H(k1)H(k1)H(k1)H(k1)],k1,H^{(0)}=1,\qquad H^{(k)}=\frac{1}{\sqrt{2}}\begin{bmatrix}H^{(k-1)}&H^{(k-1)}\\ H^{(k-1)}&-H^{(k-1)}\end{bmatrix},\quad k\geq 1,

when dd is a power of 22. Then, since both DCT and WHT have fast algorithms of time complexity O(dlogd)O(d\log d), the matrix-vector multiplication Φu\Phi u and Φu\Phi^{\top}u can be calculated with time complexity O(dlogd)O(d\log d) for any udu\in\mathbb{R}^{d}. Specifically, suppose Σ{1,,d}\Sigma\subseteq\{1,\ldots,d\} is the index set of the QQ rows of BB that form Φ\Phi. Taking DCT as an example, for any udu\in\mathbb{R}^{d}, we can computed Φu\Phi u as follows:

  1. 1.

    u^d/Q𝖥𝖺𝗌𝗍𝖣𝖢𝖳(u)\widehat{u}\leftarrow\sqrt{d/Q}\cdot\mathsf{FastDCT}(u),

  2. 2.

    u^j0\widehat{u}_{j}\leftarrow 0 for each jΣcj\in{\Sigma^{c}}.

Then u^\widehat{u} is a vector whose QQ nonzero entries record the values of each entry of Φu\Phi u. Similarly, for any uQu\in\mathbb{R}^{Q}, we can compute Φu\Phi^{\top}u by the following steps:

  1. 1.

    u~0d\tilde{u}\leftarrow 0\in\mathbb{R}^{d},

  2. 2.

    u~Σu\tilde{u}_{\Sigma}\leftarrow u, where u~Σ\tilde{u}_{\Sigma} denotes the subvector of u~\tilde{u} with indices in Σ\Sigma.

  3. 3.

    vd/Q𝖥𝖺𝗌𝗍𝖨𝗇𝗏𝖾𝗋𝗌𝖾𝖣𝖢𝖳(u~)v\leftarrow\sqrt{d/Q}\cdot\mathsf{FastInverseDCT}(\tilde{u})

Then we have v=Φuv=\Phi u. Note that if we employ WHT instead of DCT, then both 𝖥𝖺𝗌𝗍𝖣𝖢𝖳\mathsf{FastDCT} and 𝖥𝖺𝗌𝗍𝖨𝗇𝗏𝖾𝗋𝗌𝖾𝖣𝖢𝖳\mathsf{FastInverseDCT} can be replaced by 𝖥𝖺𝗌𝗍𝖶𝖧𝖳\mathsf{FastWHT} since the WHT matrix is symmetric.

References

  • [1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
  • [2] S. Magnússon, C. Enyioha, N. Li, C. Fischione, and V. Tarokh, “Communication complexity of dual decomposition methods for distributed resource allocation optimization,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 4, pp. 717–732, 2018.
  • [3] C.-S. Lee, N. Michelusi, and G. Scutari, “Finite rate quantized distributed optimization with geometric convergence,” in 52nd Asilomar Conference on Signals, Systems, and Computers, 2018, pp. 1876–1880.
  • [4] A. Reisizadeh, A. Mokhtari, H. Hassani, and R. Pedarsani, “An exact quantized decentralized gradient descent algorithm,” IEEE Transactions on Signal Processing, vol. 67, no. 19, pp. 4934–4947, 2019.
  • [5] S. U. Stich, “Local SGD converges fast and communicates little,” arXiv preprint arXiv:1805.09767, 2018.
  • [6] J. Wang and G. Joshi, “Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms,” arXiv preprint arXiv:1808.07576, 2018.
  • [7] H. Yu, S. Yang, and S. Zhu, “Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 5693–5700.
  • [8] J. Sun, T. Chen, G. B. Giannakis, Q. Yang, and Z. Yang, “Lazily aggregated quantized gradient innovation for communication-efficient federated learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • [9] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [10] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signSGD: Compressed optimisation for non-convex problems,” in Proceedings of the 35th International Conference on Machine Learning, 2018, pp. 560–569.
  • [11] S. Magnússon, H. Shokri-Ghadikolaei, and N. Li, “On maintaining linear convergence of distributed learning and optimization under limited communication,” IEEE Transactions on Signal Processing, vol. 68, pp. 6101–6116, 2020.
  • [12] S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified SGD with memory,” in Advances in Neural Information Processing Systems, vol. 31.   Curran Associates, Inc., 2018.
  • [13] D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli, “The convergence of sparsified gradient methods,” in Advances in Neural Information Processing Systems, vol. 31.   Curran Associates, Inc., 2018.
  • [14] J. Zhang, N. Li, and M. Dedeoglu, “Federated learning over wireless networks: A band-limited coordinated descent approach,” in IEEE Conference on Computer Communications, 2021.
  • [15] S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi, “Error feedback fixes signSGD and other gradient compression schemes,” in International Conference on Machine Learning, 2019, pp. 3252–3261.
  • [16] N. Ivkin, D. Rothchild, E. Ullah, V. braverman, I. Stoica, and R. Arora, “Communication-efficient distributed SGD with sketching,” in Advances in Neural Information Processing Systems, vol. 32.   Curran Associates, Inc., 2019.
  • [17] D. Rothchild, A. Panda, E. Ullah, N. Ivkin, I. Stoica, V. Braverman, J. Gonzalez, and R. Arora, “FetchSGD: Communication-efficient federated learning with sketching,” in International Conference on Machine Learning, 2020, pp. 8253–8265.
  • [18] M. Charikar, K. Chen, and M. Farach-Colton, “Finding frequent items in data streams,” in International Colloquium on Automata, Languages, and Programming.   Springer, 2002, pp. 693–703.
  • [19] H. Cai, D. Mckenzie, W. Yin, and Z. Zhang, “Zeroth-order regularized optimization (ZORO): Approximately sparse gradients and adaptive sampling,” arXiv preprint arXiv:2003.13001, 2020.
  • [20] E. J. Candès and M. B. Wakin, “An introduction to compressive sampling,” IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 21–30, 2008.
  • [21] B. K. Natarajan, “Sparse approximate solutions to linear systems,” SIAM Journal on Computing, vol. 24, no. 2, pp. 227–234, 1995.
  • [22] G. Davis, S. Mallat, and M. Avellaneda, “Adaptive greedy approximations,” Constructive Approximation, vol. 13, no. 1, pp. 57–98, 1997.
  • [23] A. Gilbert and P. Indyk, “Sparse recovery using sparse matrices,” Proceedings of the IEEE, vol. 98, no. 6, pp. 937–947, 2010.
  • [24] S. Foucart, “Sparse recovery algorithms: Sufficient conditions in terms of restricted isometry constants,” in Approximation Theory XIII: San Antonio 2010.   Springer, 2012, pp. 65–77.
  • [25] M. E. Lopes, “Unknown sparsity in compressed sensing: Denoising and inference,” IEEE Transactions on Information Theory, vol. 62, no. 9, pp. 5145–5166, 2016.
  • [26] I. Haviv and O. Regev, “The restricted isometry property of subsampled Fourier matrices,” in Geometric aspects of functional analysis.   Springer, 2017, pp. 163–179.
  • [27] K. Wei, “Fast iterative hard thresholding for compressed sensing,” IEEE Signal Processing Letters, vol. 22, no. 5, pp. 593–597, 2014.
  • [28] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt, “Practical and optimal LSH for angular distance,” in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, 2015, p. 1225–1233, full version available at arXiv:1509.02897.
  • [29] A. C. Gilbert, M. J. Strauss, J. A. Tropp, and R. Vershynin, “One sketch for all: fast algorithms for compressed sensing,” in Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, 2007, pp. 237–246.
  • [30] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
  • [31] L. Dalcín, R. Paz, M. Storti, and J. D’Elía, “MPI for Python: Performance improvements and MPI-2 extensions,” Journal of Parallel and Distributed Computing, vol. 68, no. 5, pp. 655–662, 2008.
  • [32] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall, “Open MPI: Goals, concept, and design of a next generation MPI implementation,” in Proceedings, 11th European PVM/MPI Users’ Group Meeting, 2004, pp. 97–104.
  • [33] G. Cormode and S. Muthukrishnan, “An improved data stream summary: the count-min sketch and its applications,” Journal of Algorithms, vol. 55, no. 1, pp. 58–75, 2005.
  • [34] E. J. Candes and T. Tao, “Decoding by linear programming,” IEEE Transactions on Information Theory, vol. 51, no. 12, pp. 4203–4215, 2005.
  • [35] D. Needell and J. A. Tropp, “CoSaMP: Iterative signal recovery from incomplete and inaccurate samples,” Applied and Computational Harmonic Analysis, vol. 26, no. 3, pp. 301–321, 2009.
  • [36] J. D. Blanchard, J. Tanner, and K. Wei, “Cgiht: conjugate gradient iterative hard thresholding for compressed sensing and matrix completion,” Information and Inference: A Journal of the IMA, vol. 4, no. 4, pp. 289–327, 2015.
  • [37] R. Berinde, A. C. Gilbert, P. Indyk, H. Karloff, and M. J. Strauss, “Combining geometry and combinatorics: A unified approach to sparse signal recovery,” in 2008 46th Annual Allerton Conference on Communication, Control, and Computing.   IEEE, 2008, pp. 798–805.
  • [38] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, “A simple proof of the restricted isometry property for random matrices,” Constructive Approximation, vol. 28, no. 3, pp. 253–263, 2008.