This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Optimal Compression of Unit Norm Vectors
in the High Distortion Regime

Heng Zhu, Avishek Ghosh and Arya Mazumdar University of California, San Diego
Indian Institute of Technology, Bombay
Emails: [email protected], avishek_\_[email protected], [email protected]
Abstract

Motivated by the need for communication-efficient distributed learning, we investigate the method for compressing a unit norm vector into the minimum number of bits, while still allowing for some acceptable level of distortion in recovery. This problem has been explored in the rate-distortion/covering code literature, but our focus is exclusively on the "high-distortion" regime. We approach this problem in a worst-case scenario, without any prior information on the vector, but allowing for the use of randomized compression maps. Our study considers both biased and unbiased compression methods and determines the optimal compression rates. It turns out that simple compression schemes are nearly optimal in this scenario. While the results are a mix of new and known, they are compiled in this paper for completeness.

I Introduction

Data compression is an active area of research in signal processing, specially in the context of image, audio and video processing [1, 2, 3]. Apart from the classical applications, in modern large scale learning systems, like federated learning [4, 5, 6], data compression plays a crucial role. Federated learning (FL) is a distributed learning paradigm, with one center and several (in millions or even more) client machines, and exchange of information between the server and the clients is crucial for learning models. In a generic federated learning framework, we have each of a large number of clients computing local stochastic gradients of a loss function and sending it to the server. The server averages the received stochastic gradients and updates the model parameters. In modern machine learning, one necessarily needs to deal with extreme high dimensional data (dd is very large) and train a large neural network with billions (if not more) of parameters. Hence, one of the major challenges faced by FL is communication cost between the client machines and the server since this cost is associated with the internet bandwidth of the users which are often resource constraints [4].

A canonical way to reduce the communication cost in FL is to compress (sprasify or quantize) the data (gradients) before communicating, and this forms the basis of our study. Indeed, data compression is widely used in FL systems ([7, 8, 9, 10]) and several compression schemes, both deterministic as well as randomized, have been proposed in the last few years. Broadly, the compression schemes used in FL falls under two categories: (i) unbiased ([7, 8]) and (ii) biased ([9, 8, 10]), where for unbiased compressor, we require conditions on first and second moment (see Definition 1) and for the biased one, we just require a condition on second moment (see Definition 2). We now define them formally:

Definition 1 (Unbiased ω\omega-compressor).

A randomized operator 𝒬\mathcal{Q}: dd\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is an unbiased ω\omega-compressor if it satisfies

𝔼𝒬[𝒬(x)]=\displaystyle{\mathbb{E}}_{\mathcal{Q}}[\mathcal{Q}(x)]= x,\displaystyle x,
𝔼𝒬𝒬(x)x2\displaystyle{\mathbb{E}}_{\mathcal{Q}}\left\|\mathcal{Q}(x)-x\right\|^{2}\leq (ω1)x2,xd,\displaystyle(\omega-1)\left\|x\right\|^{2},\quad\forall x\in\mathbb{R}^{d},

where ω1\omega\geq 1 is the (unbiased) compression parameter.

Typical unbiased compressors include:

  • Randomized Quantization in QSGD [7]: For any real number r[a,b]r\in[a,b], with probability brba\frac{b-r}{b-a} quantize rr into aa, and with probability raba\frac{r-a}{b-a} quantize rr into bb.

  • Rand-kk sparsification [11]: For any xdx\in\mathbb{R}^{d}, randomly select kk elements of xx to be scaled by dk\frac{d}{k}, and let the other elements be zero.

A more general definition including biased compressor is given by the following:

Definition 2 (Biased δ\delta-compressor).

A randomized operator 𝒬\mathcal{Q}: dd\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is a δ\delta-compressor if it satisfies

𝔼𝒬𝒬(x)x2\displaystyle{\mathbb{E}}_{\mathcal{Q}}\left\|\mathcal{Q}(x)-x\right\|^{2}\leq (1δ)x2,xd,\displaystyle(1-\delta)\left\|x\right\|^{2},\quad\forall x\in\mathbb{R}^{d},

where δ[0,1]\delta\in[0,1] is the (biased) compression parameter. If the compressor is not random, we remove the expectation.

Typical biased compressors include:

  • 1\ell_{1}-sign quantization [9]: For any xdx\in\mathbb{R}^{d}, 𝒬(x)=x1psign(x)\mathcal{Q}(x)=\frac{\left\|x\right\|_{1}}{p}{\rm sign}(x). Here δ\delta is x12px\frac{\left\|x\right\|_{1}^{2}}{p\left\|x\right\|}.

  • Top-kk sparsification [8]: For any xdx\in\mathbb{R}^{d}, select kk elements with the largest absolute values to be remained, and let the other elements to be zero. Here δ\delta is kd\frac{k}{d}.

Ranges of ω\omega and δ\delta: The parameters ω\omega and 1δ1-\delta measure the amount of distortion of the compressors. We emphasize that for the biased compressor, the δ\delta is in the range of [0,1][0,1], while in the unbiased case, ω\omega is in the range of [1,+)[1,+\infty). Owing to unbiasness, the variance of the compressor often increases and we require ω\omega to be large in such settings [7, 12]. Thus there is a fundamental difference between definition of biased δ\delta-compressor and unbiased compressor. In this work we consider the high distortion regime, where δ\delta is small and ω\omega is large, again motivated by FL applications where extreme compression is desirable.

We aim to find the optimal communication cost of the above-mentioned widely used compressors. Since, in standard applications of FL, one can normalize the information vector and sends the norm separately, we can, without loss of generality, assume that the compressors are applied to the unit vector xx, i.e., x2=1\|x\|^{2}=1 holds in the Definition 1 and 2.

I-A Our contributions

Motivated by the compression needs in FL, in this paper, we study the basic problem of compressing a unit norm vector in high dimension. Though this problem has been extensively explored in the context of rate-distortion and covering code [13, 1, 14], we are interested in the high distortion regime, where the recovery error is high, and the compression rate 0\to 0. Moreover, we do not put any prior information on the vector to be compressed and hence our results hold for the worst-case setting.

We first obtain lower bounds on the number of bits required to transmit such unit norm vectors under biased as well as unbiased compression schemes. For the biased schemes, this follow directly via a sphere covering argument and for the unbiased case, we can leverage some results from [15, 16]. Moreover, we propose and analyze several efficient algorithms that matches this lower bound and hence optimal. It turns out that for unbiased ω\omega-compressor the minimum number of bits required is Ω(d/ω)\Omega(d/\omega), whereas for biased δ\delta-compressor the least number of bits required is Ω(dδ)\Omega(d\delta). For upper bounds, we provide the following different compressors:

An optimal but inefficient biased δ\delta-compressor

In Section III-B, we propose and analyze an unbiased δ\delta-compressor based on generating Random Gaussian codebook. In this scheme, out of the generated codebook, we choose the Gaussian vector closest (in 2\ell_{2} norm) to the information vector as the quantized vector. This scheme only requires O(dδ){O}(d\delta) number of bits, which matches the lower bound. Although optimal, the encoding is an exhaustive search in exponentially large number of possible codewords, and hence this algorithm is not computationally efficient.

A near-optimal and efficient biased δ\delta-compressor

In Section III-C, we propose an efficient biased compressor that is (near) optimal, namely Max Block Norm Quantization (MBNQ). In this algorithm, we first find the sub-block of length kk that has the largest norm, and then use scalar quantization (coordinate-wise) on the sub-vector. As shown in Theorem 2, MBNQ requires O(dδlog(dδ)){O}(d\delta\log(d\delta)) number of bits. Furthermore, MBNQ is computationally efficient as seen in Algorithm 2.

A near-optimal and efficient biased ω\omega-compressor

In Section IV-B, we discuss the vector quantized SGD (VQSGD) algorithm of [12], which requires 𝒪(d/ω)\mathcal{O}(d/\omega) number of bits and hence optimal. However, similar to the Random Gaussian codebook algorithm, this is also inefficient, and the computation complexity scales exponentially with dimension dd. In Section IV-C, we propose an efficient algorithm, namely Sparse Randomized Quantization Scheme (SRQS), that first applies the rand-kk compressor (of [8]) and then uses QSGD (of [7]) on the sparse kk-length sub-vector. As shown in Theorem 5, this simple combination yields an efficient algorithm requiring O(d/ω)log(dω){O}(d/\omega)\log(d\omega) bits , and hence SRQS is (near) optimal.

Our contribution is summarized in Table 1.

TABLE I: Summary of findings:
LB: Lower Bounds, Random: Random Gaussian Codebook Scheme and VQSGD [12], Sparse: Max Block Norm Quantization and Sparse Randomized Quantization Scheme
Compressor LB Random Sparse
Biased Ω(dδ)\Omega(d\delta) O(dδ)O(d\delta) O(dδlogdδ)O(d\delta\log d\delta)
Unbiased Ω(dω)\Omega\left(\frac{d}{\omega}\right) [15] O(dω)O\left(\frac{d}{\omega}\right) [12] O(dωlogdω)O\left(\frac{d}{\omega}\log d\omega\right)

II Related Work

II-A ϵ\epsilon-nets

Note that, our biased compressors are just ϵ\epsilon-nets for the unit sphere (ϵ=1δ\epsilon=1-\delta), and it is known that there exists an ϵ\epsilon-net of size (1+2/ϵ)d(1+2/\epsilon)^{d} [17]. However, results on ϵ\epsilon-nets such as this are tailored for ϵ\epsilon very small; for ϵ\epsilon very close to 1, they do not provide a fine-grained dependence on δ\delta.

II-B Quantization

Possibly, a straightforward way to cut down communication is to use quantization or sparsification techniques. Since, typically the dimension of such data is huge, dimension reduction techniques often turn out to be useful. To achieve compression, one can quantize each coordinate of a transmitted vector into few bits [7, 18, 19], or obtain a sparser vector by letting some elements be zero [8, 11, 20]. These compressors are either biased and unbiased, influencing the convergence of learning algorithms. In [21, 22], a lower bound of compression budget is obtained to retain the convergence rate of stochastic optimization of the uncompressed setting. To achieve the lower bound, a compression scheme based on the random rotation and block quantization was proposed. In context of distributed mean estimation, a lower bound of estimation accuracy is derived given a compression budget [23], which is based on a distributed statistical estimation lower bound. Also a random rotation scheme was proposed to approach the lower bound.

II-C Federated Learning and Communication Cost

As mentioned in the introduction, FL algorithms employ several techniques to reduce the communication cost. One simple way is to use local iterations and communicate to the server sparsely [6, 24, 25]. Another way to reduce number of communication rounds is to use second order Newton based algorithm [26, 27, 28, 29], which exploits the compute power of the client machines and cut the communication cost.

III Biased δ\delta-Compressors

We start with the biased δ\delta-compressor. We first answer the question of minimum communication cost of δ\delta-compressor by providing a lower bound, which is provided by the sphere covering.

III-A Lower Bound

Proposition 1.

Let Q(v)Q(v) be the compressor of vv satisfying Definition 2. Then the number of bits bb required to transmit Q(v)Q(v) satisfy

bdδ+logd.\displaystyle b\geq d\delta+\log d. (1)
Proof.

The proof employs a simple sphere covering of the unit sphere Sd1S^{d-1} [14]. We know that 𝒬(v)\mathcal{Q}{(v)} resides within the ball Bd(v,1δ)B_{d}(v,1-\delta). The cardinality of the set of balls covering the sphere Sd1S^{d-1} is at least

C=vol(Sd1)vol(Sd1Bd(v,1δ))vol(Sd1)vol(Bd(v,1δ))\displaystyle C=\frac{\text{vol}(S^{d-1})}{\text{vol}(S^{d-1}\cap B_{d}(v,1-\delta))}\geq\frac{\text{vol}(S^{d-1})}{\text{vol}(B_{d}(v,1-\delta))}

We can send the indices of these balls covering the unit sphere. Thus the necessary number of bits bb to represent 𝒬(v)\mathcal{Q}(v) satisfies

b=\displaystyle b= logClogvol(Sd1)vol(Bd(v,1δ))=log2πd2/Γ(d2)[πd2/Γ(d2+1)](1δ)d\displaystyle\log C\geq\log\frac{\text{vol}(S^{d-1})}{\text{vol}(B_{d}(v,1-\delta))}=\log\frac{2\pi^{\frac{d}{2}}/\Gamma(\frac{d}{2})}{[\pi^{\frac{d}{2}}/\Gamma(\frac{d}{2}+1)](1-\delta)^{d}}
=\displaystyle= logd(1δ)d=logddlog(1δ)logd+dδ,\displaystyle\log\frac{d}{(1-\delta)^{d}}=\log d-d\log(1-\delta)\geq\log d+d\delta,

where Γ()\Gamma(\cdot) is the Gamma function, the last inequality is from the fact log(1+x)x\log(1+x)\leq x for x>1x>-1. ∎

Considering that δ\delta falls within the range of [0,1][0,1], the required number of bits is smaller than transmitting the uncompressed vector directly, which would require O(d)O(d) bits.

In the following we first present a random Gaussian codebook scheme that achieves the aforementioned lower bound. However, this scheme incurs significant computational and storage requirements, rendering it infeasible in practice. Consequently, we introduce an alternative practical sparse quantization scheme which is nearly optimal.

III-B Random Gaussian Codebook Scheme

We use a vector quantization method via random Gaussian construction: let a random Gaussian matrix Ad×nA\in\mathbb{R}^{d\times n} be

A=1N[a1,,an]\displaystyle A=\frac{1}{\sqrt{N}}\left[a_{1},\dots,a_{n}\right] (2)

where each column aida_{i}\in\mathbb{R}^{d} is a Gaussian vector with each element being standard Gaussian variable 𝒩(0,1)\mathcal{N}(0,1). NN is the normalization factor, that will be chosen later. The columns of Gaussian matrix AA are regarded as a codebook for compression.

For a unit norm vector vv to be compressed, we find the nearest Gaussian vector

amin=argmin{ai}i=1nv1Nai2.\displaystyle a_{\text{min}}=\underset{\{a_{i}\}_{i=1}^{n}}{\arg\min}\left\|v-\frac{1}{\sqrt{N}}a_{i}\right\|^{2}.

Then we map vv to the Gaussian vector

𝒬(v)=1Namin.\displaystyle\mathcal{Q}(v)=\frac{1}{\sqrt{N}}a_{\text{min}}.

To transmit the compressed vector, we can only send the index of the nearest Gaussian vector imini_{\text{min}}. The process of random Gaussian scheme is described in Algorithm 1.

Algorithm 1 Random Gaussian Codebook Scheme

Input: Unit norm vector vv; A matrix Ad×nA\in\mathbb{R}^{d\times n} constructed as (2)
Encoding:

1:  Calculate the distance between vv and all the Gaussian vectors {ai}i=1n\{a_{i}\}_{i=1}^{n} : disti=v1Nai2\text{dist}_{i}=\left\|v-\frac{1}{\sqrt{N}}a_{i}\right\|^{2}
2:  Find the smallest distance and corresponding index imini_{\text{min}}

Output: Index of the Gaussian vector imini_{\text{min}}

For this scheme, we have the following guarantee:

Theorem 1.

A Gaussian codebook constructed as (2) with N=dδN=\frac{d}{\delta} of size n=exp(O(dδ))n=\exp(O(d\delta)), is a δ\delta-compressor with probability at least Pδ=1exp[n22π(2+t)dδexp((2+t)2dδ8)+2d1δ]=1o(1)P_{\delta}=1-\exp\left[-\frac{n}{2\sqrt{2\pi}(2+t)\sqrt{d\delta}}\exp(-\frac{(2+t)^{2}d\delta}{8})+\frac{2d}{\sqrt{1-\delta}}\right]=1-o(1). The number of bits needed is b=log2n=O(dδ).b=\log_{2}n=O(d\delta).

The proof is in the Appendix A.

Remark 1.

Since we use O(dδ)O(d\delta) bits to represent the index of nn Gaussian vectors, the random Gaussian codebook scheme can achieve the lower bound of biased δ\delta-compressor. However, it is impractical to store the large number of Gaussian vectors and perform a lot of distance computations at each iteration, where nn grows exponentially with model dimension dd.

III-C Max Block Norm Quantization (MBNQ)

To give a practical compression scheme, we show that a simple block quantization scheme can approach the lower bound, with an extra logarithmic factor. This scheme, Max Block Norm Quantization, uses standard scalar quantization on a sub-block with largest norm.

Our scheme is first to uniformly partition the vector into sub-blocks vjkv_{j}\in\mathbb{R}^{k}, j=1,2,,d/kj=1,2,\dots,d/k. Then we pick up the sub-block vmaxv_{\max} with largest norm from the vectors {vj}j=1d/k\{v_{j}\}_{j=1}^{d/k}.

We next quantize the vmaxv_{\max} with standard scalar quantization SQ()SQ(\cdot) with ll quantization levels to get the quantized vector q=SQ(vmax)q=SQ(v_{\max}). The quantization function is SQ(x)=riSQ(x)=r_{i} if x[lowi,upi),i=1,,lx\in[low_{i},up_{i}),i=1,\dots,l; lowi,upilow_{i},up_{i} are the lower and upper bound of a quantization interval, rir_{i} can be any quantization value inside the quantization interval. That is, each element of vmaxv_{\max} is quantized to the closest quantization value. The scalar quantization function is applied element-wise. The final compressor is shown as 𝒬(v)=q\mathcal{Q}(v)=q. The detailed process of MBNQ is described in Algorithm 2.

Algorithm 2 Max Block Norm Quantization (MBNQ)

Input: Unit norm vector vv
Initialization: Define a scalar quantization function SQ()SQ(\cdot) with ll quantization levels, where SQ(x)=riSQ(x)=r_{i} if x[lowi,upi),i=1,,lx\in[low_{i},up_{i}),i=1,\dots,l, lowi,upilow_{i},up_{i} are the lower and upper bound of a quantization interval, rir_{i} is the quantization value inside the quantization interval

1:  Partition the vector vv into d/kd/k sub-vectors vjkv_{j}\in\mathbb{R}^{k}, j=1,2,,d/kj=1,2,\dots,d/k
2:  Calculate the norms of sub-vectors uj=vj2u_{j}=\|v_{j}\|^{2}, then pick up the vmaxv_{\text{max}} with largest norm umaxu_{\text{max}}
3:  Use scalar quantization function to quantize vmaxv_{\text{max}} q=SQ(vmax)q=SQ(v_{\text{max}}), where the quantization function is applied element-wise

Output: Compressed vector qq

For the MBNQ scheme, we have the following guarantee.

Theorem 2.

With number of sub-blocks k=2dδk=2d\delta and the number of quantization levels l=4dδl=\sqrt{4d\delta}, the MBNQ is a δ\delta-compressor with the number of bits

b=dδlog(4dδ)+log12δ.\displaystyle b=d\delta\log(4d\delta)+\log\frac{1}{2\delta}.
Proof.

The compression error of our proposed scheme can be expressed as follows:

𝔼vQ(v)2=\displaystyle{\mathbb{E}}\left\|v-Q(v)\right\|^{2}\ = 𝔼vv¯max+v¯maxq2\displaystyle{\mathbb{E}}\left\|v-\bar{v}_{\max}+\bar{v}_{\max}-q\right\|^{2}
=\displaystyle= 𝔼vv¯max2+𝔼v¯maxq2,\displaystyle{\mathbb{E}}\left\|v-\bar{v}_{\max}\right\|^{2}+{\mathbb{E}}\left\|\bar{v}_{\max}-q\right\|^{2}, (3)

where v¯maxd\bar{v}_{\max}\in\mathbb{R}^{d} represents a vector with kk elements equal to vmaxv_{\max} and the remaining elements set to zero. The last equality is from the fact that v¯maxq\bar{v}_{\max}-q is the vector with kk non-zero elements, and vv¯maxv-\bar{v}_{\max} is the vector with the remaining dkd-k non-zero elements, so that vv¯max,v¯maxq\langle v-\bar{v}_{\max},\bar{v}_{\max}-q\rangle would be zero.

The first term in (III-C) corresponds to the error resulting from the partitioning of the vector. Since vmaxv_{\max} is a sub-vector of vv, we have

vv¯max2=1umax2.\displaystyle\left\|v-\bar{v}_{\max}\right\|^{2}=1-u_{\max}^{2}.

where umaxu_{\max} denotes the norm of vmaxv_{\max}. The second term in (III-C) represents the error resulting from scalar quantization with ll levels. From the quantization error in scalar quantization, we have

𝔼v¯maxq2kl2v¯max2=kl2umax2.\displaystyle{\mathbb{E}}\left\|\bar{v}_{\max}-q\right\|^{2}\leq\frac{k}{l^{2}}\left\|\bar{v}_{\max}\right\|^{2}=\frac{k}{l^{2}}u_{\max}^{2}.

Here we choose k=l22k=\frac{l^{2}}{2}, then we can obtain

𝔼vQ(v)2\displaystyle{\mathbb{E}}\left\|v-Q(v)\right\|^{2}\leq 1umax2+kl2umax2=1umax22.\displaystyle 1-u_{\max}^{2}+\frac{k}{l^{2}}u_{\max}^{2}=1-\frac{u_{\max}^{2}}{2}.

Note the error decreases with a larger umaxu_{\max}. Since we pick up the sub-vector with largest norm, we have umax2kdu_{\max}^{2}\geq\frac{k}{d}. Therefore we can obtain

𝔼vQ(v)21k2d.\displaystyle{\mathbb{E}}\left\|v-Q(v)\right\|^{2}\leq 1-\frac{k}{2d}.

Hence, the MBNQ scheme achieves the δ\delta-compressor with δ=k2d\delta=\frac{k}{2d}.

Regarding the communication cost of the MBNQ scheme, each worker needs to transmit the index of the sub-vector with the largest norm and kk quantized real numbers. Thus, the total number of required bits can be calculated as:

b=\displaystyle b= logdk+klogl=logdk+k2log(2k)=log12δ+dδlog(4dδ),\displaystyle\log\frac{d}{k}+k\log l=\log\frac{d}{k}+\frac{k}{2}\log(2k)=\log\frac{1}{2\delta}+d\delta\log(4d\delta),

which represents the communication cost in Theorem 2. ∎

Remark 2.

The implementation of MBNQ is straightforward and easy in real systems. And the simple MBNQ scheme can approach the lower bound of biased δ\delta-compressor, except for a logarithmic factor. If we directly apply scalar quantization on the whole vector vv, the communication cost would be O(d)O(d). The improvement is from the quantization on a sub-vector. When the size of sub-vector kk is chosen properly, the quantization can get nearly optimal performance.

IV Unbiased ω\omega-Compressors

In this section we provide results on the unbiased compressor of definition 1. We start with a lower bound and a random codebook scheme to achieve this lower bound, both following from existing results. Next we propose a practical and simple sparse quantization scheme to approach the lower bound, except for a logarithmic factor.

IV-A Lower Bound

For unbiased compressor, [15] already provided a lower bound under both communication and privacy constraints. The lower bound is proven by constructing a prior distribution on vv and analyzing the compression error. Here we directly give the lower bound in [15]. The lower bound also follows from [12].

Theorem 3 ([15], Appendix D.2).

Let Q(v)Q(v) be the compressor of vv satisfying definition 1. Then the number of bits bb required to describe Q(v)Q(v) satisfies

b=Ω(dω).\displaystyle b=\Omega\left(\frac{d}{\omega}\right). (4)

Note that, ω\omega is in the range of [1,+)[1,+\infty). Thus the lower bound of Ω(dω)\Omega(\frac{d}{\omega}) is smaller than directly transmitting the vector as Ω(d)\Omega(d) real numbers.

IV-B Unbiased Random Gaussian Codebook Scheme

In [12], a randomized vector quantization framework is proposed for unbiased compression. Among the quantization methods introduced in [12], there is a random Gaussian codebook method to achieve the lower bound of unbiased compressor.

Similar to the Random Gaussian Codebook Scheme in the biased case, we first construct a random Gaussian matrix Ad×nA\in\mathbb{R}^{d\times n}:

A=1N[a1,,an]\displaystyle A=\frac{1}{\sqrt{N}}\left[a_{1},\dots,a_{n}\right] (5)

where each column aid×1a_{i}\in\mathbb{R}^{d\times 1} is a Gaussian vector with elements as standard Gaussians 𝒩(0,1)\mathcal{N}(0,1). NN is the normalization factor. The columns of Gaussian matrix AA are regarded as a codebook for compression, i.e., the points in CC are ci=1Nai,i=1,,nc_{i}=\frac{1}{\sqrt{N}}a_{i},\quad i=1,\dots,n. Then we use the Gaussian vectors to perform the compression:

Q(v)=1Nai,with probabilitypi.\displaystyle Q(v)=\frac{1}{\sqrt{N}}a_{i},\text{with probability}\quad p_{i}.

The detailed process of unbiased random Gaussian codebook scheme is described in Algorithm 3.

Algorithm 3 Unbiased Random Gaussian Codebook Scheme

Input: Unit norm vector vv; A matrix Ad×nA\in\mathbb{R}^{d\times n} constructed as (5)
Encoding:

1:  Construct a linear convex combination of Gaussian vectors {1Nai}i=1n\{\frac{1}{\sqrt{N}}a_{i}\}_{i=1}^{n} to get vv: v=i=1n1Npiaiv=\sum\limits_{i=1}^{n}\frac{1}{\sqrt{N}}p_{i}a_{i}
2:  Randomly choose from {ai}i=1n\{a_{i}\}_{i=1}^{n} with probability distribution {pi}i=1n\{p_{i}\}_{i=1}^{n}: Q(v)=1Nai,with probabilitypiQ(v)=\frac{1}{\sqrt{N}}a_{i},\text{with probability}~{}p_{i}
3:  Get the index of the chosen Gaussian vector ii

Output: Index of the chosen Gaussian vector ii

Theorem 4 ([12], Theorem 7).

A Gaussian codebook of size n=exp(O(dω+logd))n=\exp(O(\frac{d}{\omega}+\log d)), constructed as in (5), with N=9dωN=\frac{9d}{\omega}, ω[25,36d]\omega\in[25,36d], is an unbiased compressor (definition 1), with probability 1o(1)1-o(1). The number of compressed bits is

b=logn=O(dω+logd).\displaystyle b=\log n=O\left(\frac{d}{\omega}+\log d\right).
Remark 3.

From the above theorem, the dominiating term is O(dω)O(\frac{d}{\omega}) bits, matching the lower bound in Theorem 3. However, the unbiased Gaussian codebook scheme also has the problem of impractical computational and storage burden.

IV-C Sparse Randomized Quantization Scheme (SRQS)

To design a practical compression method to approach the lower bound, we propose a Sparse Randomized Quantization Scheme (SRQS) based on the well-known rand-kk sparsification and randomized quantization in QSGD [7]. It is easy to implement and we prove it can approach the lower bound except for a logarithmic factor.

Our SRQS method is as follows. The rand-kk compressor randomly chooses kk elements of the vector vv and let other elements be zero. Specifically, for each coordinate jj with value vjv_{j}, with probability kd\frac{k}{d} the value is set to be dkvj\frac{d}{k}v_{j}, and probability 1kd1-\frac{k}{d} to be 0.

For the sparse vector vspav_{\text{spa}} obtained from vv, we perform an unbiased Randomized Quantization as in [7] to quantize the vector q=RQ(vspa)q=RQ(v_{\text{spa}}). This method, called QSGD, works as follows. The quantization function consists of ll quantization levels. For each element vjv_{j} in the vspav_{\text{spa}}, the Randomized Quantization function is

RQ(vj)=vspa2sign(vj)ξj(vspa,l)\displaystyle RQ(v_{j})=\|v_{\text{spa}}\|_{2}\cdot\text{sign}(v_{j})\cdot\xi_{j}(v_{\text{spa}},l)

where sign(x){+1,1}\text{sign}(x)\in\{+1,-1\} represents the sign of xx, ξj\xi_{j} is an independent random variable that determines which quantization level that vjv_{j} is mapped to, as defined next. Let 0t<l0\leq t<l be an integer such that |vj|vspa2[tl,t+1l]\frac{|v_{j}|}{\|v_{\text{spa}}\|_{2}}\in[\frac{t}{l},\frac{t+1}{l}], i.e., [tl,t+1l][\frac{t}{l},\frac{t+1}{l}] is the quantization interval for vjv_{j}. Then the random variable ξj\xi_{j} is

ξj(vspa,l)={(t+1)/l,with probability|vj|vspaltt/l,otherwise,\displaystyle\xi_{j}(v_{\text{spa}},l)=\left\{\begin{array}[]{cc}(t+1)/l,&\text{with probability}\quad\frac{|v_{j}|}{\|v_{\text{spa}}\|}\cdot l-t\\ t/l,&\text{otherwise},\end{array}\right. (8)

Thus the Randomized Quantization is an unbiased method 𝔼[RQ(vspa)]=vspa{\mathbb{E}}[RQ(v_{\text{spa}})]=v_{\text{spa}}. As in QSGD, the final quantized vector qq is expressed by a tuple (vspa,s,z)(\left\|v_{\text{spa}}\right\|,s,z), where vector ss includes the signs of the non-zero elements, si{+1,1}s_{i}\in\{+1,-1\}, and zz includes the quantization levels of non-zero elements, i.e., zj=ξjl{0,1,,l}z_{j}=\xi_{j}\cdot l\in\{0,1,\dots,l\}.

Overall, the the two-stage compression scheme sequentially applying rand-kk and Randomized Quantization, is unbiased. It can be seen as a more sparse version of QSGD. The detailed process of SRQS is described in Algorithm 4.

Algorithm 4 Sparse Randomized Quantization Scheme

Input: Unit norm vector vv
Initialization: Design a Randomized Quantization function RQ()RQ(\cdot) with ll quantization levels

1:  Apply rand-kk compressor on vv to output vspav_{\text{spa}}: For each coordinate vjv_{j}, there is a probability kd\frac{k}{d} to be dkvj\frac{d}{k}v_{j}, and probability 1kd1-\frac{k}{d} to be 0
2:  Use Randomized Quantization function on vspav_{\text{spa}} to obtain the compressed vector q=RQ(vspa)q=RQ(v_{\text{spa}}), the qq is expressed a tuple (vspa,s,z)(\left\|v_{\text{spa}}\right\|,s,z).

Output: Compressed vector qq

Now we give a lemma on the accuracy of SRQS.

Lemma 1.

The SRQS achieves compression error

𝔼v𝒬(v)2[dk(1+kl2)1]v2,\displaystyle{\mathbb{E}}\left\|v-\mathcal{Q}(v)\right\|^{2}\leq\left[\frac{d}{k}(1+\frac{k}{l^{2}})-1\right]\left\|v\right\|^{2},

with rand-kk compressor and Randomized Quantization with ll quantization levels.

Proof.

Please note that our scheme incorporates two sources of randomness: the rand-k sparsification and the unbiased Randomized Quantization. Let 𝔼𝒬{\mathbb{E}}_{\mathcal{Q}} denote the expectation over Randomized Quantization. The total compression error of the SRQS scheme can be expressed as follows:

𝔼v𝒬(v)2=𝔼vvspa+vspaq2\displaystyle{\mathbb{E}}\left\|v-\mathcal{Q}(v)\right\|^{2}={\mathbb{E}}\left\|v-v_{spa}+v_{spa}-q\right\|^{2}
=\displaystyle= 𝔼(vvspa2+2vvspa,vspaq+vspaq2)\displaystyle{\mathbb{E}}(\left\|v-v_{spa}\right\|^{2}+2\left<v-v_{spa},v_{spa}-q\right>+\left\|v_{spa}-q\right\|^{2})
=\displaystyle= 𝔼(vvspa2+2𝔼Qvvspa,vspaq+𝔼Qvspaq2)\displaystyle{\mathbb{E}}(\left\|v-v_{spa}\right\|^{2}+2{\mathbb{E}}_{Q}\left<v-v_{spa},v_{spa}-q\right>+{\mathbb{E}}_{Q}\left\|v_{spa}-q\right\|^{2})
=\displaystyle= 𝔼(vvspa2+𝔼Qvspaq2)\displaystyle{\mathbb{E}}\left(\left\|v-v_{spa}\right\|^{2}+{\mathbb{E}}_{Q}\left\|v_{spa}-q\right\|^{2}\right) (9)

where the last equality arises from the unbiased property 𝔼Q(vspaq)=0{\mathbb{E}}_{Q}(v_{spa}-q)=0. The error in the second term of (IV-C) is from the Randomized Quantization. According to Lemma 3.1 in QSGD [7], we obtain the quantization error as:

𝔼Qvspaq2min{kl2,kl}vspa2\displaystyle{\mathbb{E}}_{Q}\left\|v_{spa}-q\right\|^{2}\leq\min\{\frac{k}{l^{2}},\frac{\sqrt{k}}{l}\}\left\|v_{spa}\right\|^{2}

Here we choose l=2kl=\sqrt{2k}, which ensures kl21\frac{k}{l^{2}}\leq 1. Consequently, we can deduce that 𝔼Qvspaq2kl2vspa2{\mathbb{E}}_{Q}\left\|v_{spa}-q\right\|^{2}\leq\frac{k}{l^{2}}\left\|v_{spa}\right\|^{2}.

Then, the total compression error can be expressed as follows:

𝔼v𝒬(v)2𝔼(vvspa2+kl2vspa2)\displaystyle{\mathbb{E}}\left\|v-\mathcal{Q}(v)\right\|^{2}\leq{\mathbb{E}}(\left\|v-v_{spa}\right\|^{2}+\frac{k}{l^{2}}\left\|v_{spa}\right\|^{2})
=\displaystyle= 𝔼(v22v,vspa+(1+kl2)vspa2)\displaystyle{\mathbb{E}}\left(\left\|v\right\|^{2}-2\left<v,v_{spa}\right>+(1+\frac{k}{l^{2}})\left\|v_{spa}\right\|^{2}\right)
=\displaystyle= (1+kl2)𝔼vspa2v2\displaystyle(1+\frac{k}{l^{2}}){\mathbb{E}}\left\|v_{spa}\right\|^{2}-\left\|v\right\|^{2}

where the last equality is from the unbiased property of rand-kk sparsification, 𝔼vspa=v{\mathbb{E}}v_{spa}=v.

Given that vspav_{spa} is obtained through the rand-k sparsification, we have:

𝔼vspa2=j=1dkd(dkvj)2=j=1ddkvj2=dkv2.\displaystyle{\mathbb{E}}\left\|v_{spa}\right\|^{2}=\sum\limits_{j=1}^{d}\frac{k}{d}(\frac{d}{k}v_{j})^{2}=\sum\limits_{j=1}^{d}\frac{d}{k}v_{j}^{2}=\frac{d}{k}\left\|v\right\|^{2}.

Finally we can get the total compression error as:

𝔼v𝒬(v)2[dk(1+kl2)1]v2.\displaystyle{\mathbb{E}}\left\|v-\mathcal{Q}(v)\right\|^{2}\leq\left[\frac{d}{k}(1+\frac{k}{l^{2}})-1\right]\left\|v\right\|^{2}.

This lemma indicates that in our scheme, the compression error is given by ω=dk(1+kl2)\omega=\frac{d}{k}(1+\frac{k}{l^{2}}).

To transmit the final compressed vector qq, we encode and transmit the tuple (vspa,s,z)(\left\|v_{\text{spa}}\right\|,s,z). In our scheme, we employ Elias coding, similar to QSGD, for this purpose. Elias coding is a method used to encode positive integers in the process, such as the locations of non-zero elements in qq and the integer vector zz.

Now let us present the formal guarantee of SRQS.

Theorem 5.

When ω9\omega\geq 9, k=3d2ωk=\frac{3d}{2\omega}, quantization level l=2kl=\sqrt{2k}, the SRQS is unbiased ω\omega-compressor with the number of bits b=O(dωlogdω).b=O(\frac{d}{\omega}\log d\omega).

The proof is in the Appendix B.

As can be seen, the SRQS approaches the lower bound of unbiased compressor, except for a logarithmic factor logdω\log d\omega.

V Conclusion

In this paper we compile the number of necessary and sufficient bits required for widely used biased δ\delta-compressor and unbiased ω\omega-compressors. For biased δ\delta-compressor, we propose a random Gaussian codebook scheme to achieve the lower bound, and a Max Block Norm Quantization scheme to approach the lower bound up to a logarithmic factor. For unbiased compressor, we also show an unbiased random Gaussian codebook scheme can achieve the lower bound. And we further propose a practical Sparse Randomized Quantization Scheme to approach the lower bound, up to a logarithmic factor.In short, an application of the simple combination of sparsification and quantization methods on distributed learning leads to near-optimal compression.

Acknowledgement: This work is supported in part by NSF awards 2133484, 2217058, and 2112665.

References

  • [1] Allen Gersho and Robert M Gray, Vector quantization and signal compression, vol. 159, Springer Science & Business Media, 2012.
  • [2] David Salomon, Giovanni Motta, and Giovanni Motta, Handbook of data compression, vol. 2, Springer, 2010.
  • [3] Khalid Sayood, Introduction to data compression, Morgan Kaufmann, 2017.
  • [4] Peter Kairouz and H Brendan McMahan, “Advances and open problems in federated learning,” Foundations and Trends® in Machine Learning, vol. 14, no. 1, pp. 1–210, 2021.
  • [5] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon, “Federated learning: Strategies for improving communication efficiency,” arXiv preprint arXiv:1610.05492, 2016.
  • [6] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics. PMLR, 2017, pp. 1273–1282.
  • [7] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, 2017, pp. 1709–1720.
  • [8] Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi, “Sparsified SGD with memory,” in Advances in Neural Information Processing Systems, 2018, pp. 4447–4458.
  • [9] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U Stich, and Martin Jaggi, “Error feedback fixes SignSGD and other gradient compression schemes,” in International Conference on Machine Learning, 2019, pp. 3252–3261.
  • [10] Avishek Ghosh, Raj Kumar Maity, Swanand Kadhe, Arya Mazumdar, and Kannan Ramchandran, “Communication-efficient and byzantine-robust distributed learning with error feedback,” IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 3, pp. 942–953, 2021.
  • [11] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang, “Gradient sparsification for communication-efficient distributed optimization,” in Advances in Neural Information Processing Systems, 2018, pp. 1299–1309.
  • [12] Venkata Gandikota, Daniel Kane, Raj Kumar Maity, and Arya Mazumdar, “vqsgd: Vector quantized stochastic gradient descent,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2021, pp. 2197–2205.
  • [13] Toby Berger, “Rate-distortion theory,” Wiley Encyclopedia of Telecommunications, 2003.
  • [14] Gérard Cohen, Iiro Honkala, Simon Litsyn, and Antoine Lobstein, Covering codes, Elsevier, 1997.
  • [15] Wei-Ning Chen, Peter Kairouz, and Ayfer Ozgur, “Breaking the communication-privacy-accuracy trilemma,” Advances in Neural Information Processing Systems, vol. 33, pp. 3312–3324, 2020.
  • [16] Wei-Ning Chen, Christopher A Choquette Choo, Peter Kairouz, and Ananda Theertha Suresh, “The fundamental price of secure aggregation in differentially private federated learning,” in International Conference on Machine Learning. PMLR, 2022, pp. 3056–3089.
  • [17] Roman Vershynin, “Introduction to the non-asymptotic analysis of random matrices,” arXiv preprint arXiv:1011.3027, 2010.
  • [18] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” in Advances in Neural Information Processing Systems, 2017, pp. 1509–1519.
  • [19] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang, “Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning,” in International Conference on Machine Learning, 2017, pp. 4035–4043.
  • [20] Jakub Konečnỳ and Peter Richtárik, “Randomized distributed mean estimation: Accuracy vs. communication,” Frontiers in Applied Mathematics and Statistics, vol. 4, no. 62, 2018.
  • [21] Prathamesh Mayekar and Himanshu Tyagi, “Ratq: A universal fixed-length quantizer for stochastic optimization,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 1399–1409.
  • [22] Prathamesh Mayekar and Himanshu Tyagi, “Limits on gradient compression for stochastic optimization,” in 2020 IEEE International Symposium on Information Theory (ISIT). IEEE, 2020, pp. 2658–2663.
  • [23] Ananda Theertha Suresh, X Yu Felix, Sanjiv Kumar, and H Brendan McMahan, “Distributed mean estimation with limited communication,” in International conference on machine learning. PMLR, 2017, pp. 3329–3337.
  • [24] Sebastian U Stich, “Local SGD converges fast and communicates little,” in International Conference on Learning Representations, 2019.
  • [25] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh, “Scaffold: Stochastic controlled averaging for federated learning,” in International Conference on Machine Learning. PMLR, 2020, pp. 5132–5143.
  • [26] Ohad Shamir, Nathan Srebro, and Tong Zhang, “Communication efficient distributed optimization using an approximate newton-type method,” CoRR, vol. abs/1312.7853, 2013.
  • [27] Shusen Wang, Farbod Roosta-Khorasani, Peng Xu, and Michael W. Mahoney, “Giant: Globally improved approximate newton method for distributed optimization,” 2017.
  • [28] Avishek Ghosh, Raj Kumar Maity, and Arya Mazumdar, “Distributed newton can communicate less and resist byzantine workers,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2020, NIPS’20, Curran Associates Inc.
  • [29] Avishek Ghosh, Raj Kumar Maity, Arya Mazumdar, and Kannan Ramchandran, “Escaping saddle points in distributed newton’s method with communication efficiency and byzantine resilience,” CoRR, vol. abs/2103.09424, 2021.

Appendix A Proof of Theorem 1

Proof.

We begin by considering a specific fixed unit vector vv. Let γ2=1δ\gamma^{2}=1-\delta. For the ii-th column aia_{i} in AA, the probability of the event that the distance between vv and aia_{i} exceeds γ\gamma can be expressed as follows:

Pr(v1Nai2>γ2)=Pr(1+1Nai22Nv,ai>γ2).\displaystyle\text{Pr}\left(\left\|v-\frac{1}{\sqrt{N}}a_{i}\right\|^{2}>\gamma^{2}\right)=\text{Pr}\left(1+\frac{1}{N}\|a_{i}\|^{2}-\frac{2}{\sqrt{N}}\left<v,a_{i}\right>>\gamma^{2}\right).

Since v2=1\left\|v\right\|^{2}=1, aia_{i} is the vector with each element being a standard Gaussian variable, the inner-product v,ai\left<v,a_{i}\right> is a Gaussian variable. We denote it as z=v,aiz=\left<v,a_{i}\right> and z𝒩(0,1)z\sim\mathcal{N}(0,1). Thus, we can write the probability as:

Pr(1+1Nai22Nz>γ2)=Pr(z<N2(1+1Nai2γ2))\displaystyle\text{Pr}\left(1+\frac{1}{N}\|a_{i}\|^{2}-\frac{2}{\sqrt{N}}z>\gamma^{2}\right)=\text{Pr}\left(z<\frac{\sqrt{N}}{2}\left(1+\frac{1}{N}\|a_{i}\|^{2}-\gamma^{2}\right)\right)

Since ai2\|a_{i}\|^{2} follows chi-squared distribution with degree dd, we can get the tail bound of ai2\|a_{i}\|^{2} as

Pr(|1dai21|t)2exp(dt28)\displaystyle\text{Pr}\left(\left|\frac{1}{d}\|a_{i}\|^{2}-1\right|\geq t\right)\leq 2\exp(-\frac{dt^{2}}{8})

We define the event AA as A={z<N2(1+1Nai2γ2)}A=\left\{z<\frac{\sqrt{N}}{2}\left(1+\frac{1}{N}\|a_{i}\|^{2}-\gamma^{2}\right)\right\}, event B1B_{1} as B1={|1dai21|t}B_{1}=\left\{\left|\frac{1}{d}\|a_{i}\|^{2}-1\right|\geq t\right\}, and B2={|1dai21|<t}B_{2}=\left\{\left|\frac{1}{d}\|a_{i}\|^{2}-1\right|<t\right\}. Then we have

Pr(A)=Pr(A,B1)+Pr(A,B2)Pr(B1)+Pr(A,B2)\displaystyle\text{Pr}(A)=\text{Pr}(A,B_{1})+\text{Pr}(A,B_{2})\leq\text{Pr}(B_{1})+\text{Pr}(A,B_{2})

From event {A,B2}\{A,B_{2}\}, we can imply that z<N2(1+dN(1+t)γ2)z<\frac{\sqrt{N}}{2}\left(1+\frac{d}{N}(1+t)-\gamma^{2}\right). Thus we know that

Pr(B1)2exp(dt28)\displaystyle\text{Pr}(B_{1})\leq 2\exp(-\frac{dt^{2}}{8})
Pr(A,B2)Pr(z<N2(1+dN(1+t)γ2))=1Pr(zN2(1+dN(1+t)γ2)).\displaystyle\text{Pr}(A,B_{2})\leq\text{Pr}\left(z<\frac{\sqrt{N}}{2}\left(1+\frac{d}{N}(1+t)-\gamma^{2}\right)\right)=1-\text{Pr}\left(z\geq\frac{\sqrt{N}}{2}\left(1+\frac{d}{N}(1+t)-\gamma^{2}\right)\right).

Let E=N2(1+dN(1+t)γ2)E=\frac{\sqrt{N}}{2}(1+\frac{d}{N}(1+t)-\gamma^{2}). By choosing N=dδN=\frac{d}{\delta} and recalling that γ2=1δ\gamma^{2}=1-\delta, we obtain E=(1+t2)dδE=(1+\frac{t}{2})\sqrt{d\delta}.

For the standard Gaussian variable zz, the tail bound is

Pr(zx)Cxex2/2,whenx1.\displaystyle\text{Pr}(z\geq x)\geq\frac{C}{x}e^{-x^{2}/2},\quad\text{when}\quad x\geq 1.

where C=122πC=\frac{1}{2\sqrt{2\pi}}. We can see E>1E>1 from the chosen N, thus we can have

Pr(zE)CEexp(E22).\displaystyle\text{Pr}\left(z\geq E\right)\geq\frac{C}{E}\exp(-\frac{E^{2}}{2}).

Combining above results, we have

Pr(v1Nai2>γ2)1CEexp((2+t)2dδ8)+2exp(dt28).\displaystyle\text{Pr}\left(\left\|v-\frac{1}{\sqrt{N}}a_{i}\right\|^{2}>\gamma^{2}\right)\leq 1-\frac{C}{E}\exp(-\frac{(2+t)^{2}d\delta}{8})+2\exp(-\frac{dt^{2}}{8}).

In our setting, δ\delta is small and can approach 0, thus we can choose a constant tt to satisfy (2+t)2δ<t2(2+t)^{2}\delta<t^{2}. We can further let

2exp(dt28)C2Eexp((2+t)2dδ8)\displaystyle 2\exp(-\frac{dt^{2}}{8})\leq\frac{C}{2E}\exp(-\frac{(2+t)^{2}d\delta}{8})

Taking logrithm on both sides, we can get

4logdδ+8log2+t2+4log128πd[t2(2+t)2δ]\displaystyle 4\log d\delta+8\log\frac{2+t}{2}+4\log 128\pi\leq d[t^{2}-(2+t)^{2}\delta] (10)

Since the left-hand side of (10) is at order O(logdδ)O(\log d\delta), and the right-hand side is at order dd. When we choose t to satisfy (2+t)2δ<t2(2+t)^{2}\delta<t^{2} and dd is large, the condition (10) can easily hold.

Then we can obtain

Pr(v1Nai2>γ2)1C2Eexp((2+t)2dδ8).\displaystyle\text{Pr}\left(\left\|v-\frac{1}{\sqrt{N}}a_{i}\right\|^{2}>\gamma^{2}\right)\leq 1-\frac{C}{2E}\exp(-\frac{(2+t)^{2}d\delta}{8}).

Since there are nn i.i.d. random Gaussian vectors, we define the event that for all Gaussian vectors, there is no close vector to the particular vv:

v={i[n],v1Nai2>γ2}.\displaystyle\mathcal{E}_{v}=\left\{\forall i\in[n],\quad\left\|v-\frac{1}{\sqrt{N}}a_{i}\right\|^{2}>\gamma^{2}\right\}.

The probability of event v\mathcal{E}_{v} is

Pr[v]\displaystyle\text{Pr}[\mathcal{E}_{v}] [1C2Eexp((2+t)2dδ8)]n\displaystyle\leq\left[1-\frac{C}{2E}\exp(-\frac{(2+t)^{2}d\delta}{8})\right]^{n}
exp[C2Enexp((2+t)2dδ8)],\displaystyle\leq\exp\left[-\frac{C}{2E}n\exp(-\frac{(2+t)^{2}d\delta}{8})\right],

where the second inequality is from the fact that 1xex1-x\leq e^{-x}.

To encompass all possible vv on the unit sphere, we employ an ϵ\epsilon-net to cover the unit sphere, where we set the error parameter ϵ\epsilon equal to γ\gamma. Consequently, the size of the ϵ\epsilon-net is given by (1+2γ)d\left(1+\frac{2}{\gamma}\right)^{d} [17]. Subsequently, we define the event that for all Gaussian vectors, there are no close vectors to any of the vectors in the ϵ\epsilon-net:

={i[n],vϵnet,v1Nai2>γ2}.\displaystyle\mathcal{E}=\left\{\forall i\in[n],\forall v\in\epsilon~{}\text{net},\quad\left\|v-\frac{1}{\sqrt{N}}a_{i}\right\|^{2}>\gamma^{2}\right\}.

By union bound, the possibility of event \mathcal{E} is

Pr[]\displaystyle\text{Pr}[\mathcal{E}]\leq (1+2γ)dexp[C2Enexp((2+t)2dδ8)]\displaystyle\left(1+\frac{2}{\gamma}\right)^{d}\exp\left[-\frac{C}{2E}n\exp(-\frac{(2+t)^{2}d\delta}{8})\right]
\displaystyle\leq exp(2dγ)exp[C2Enexp((2+t)2dδ8)]\displaystyle\exp(\frac{2d}{\gamma})\exp\left[-\frac{C}{2E}n\exp(-\frac{(2+t)^{2}d\delta}{8})\right]
=\displaystyle= exp[2dγC2Enexp((2+t)2dδ8)],\displaystyle\exp\left[\frac{2d}{\gamma}-\frac{C}{2E}n\exp(-\frac{(2+t)^{2}d\delta}{8})\right],

where the second inequality is from the fact that 1+xex1+x\leq e^{x}.

Here we expect that the exponential term is negative, so that Pr[]\text{Pr}[\mathcal{E}] approaches 0. Hence, we need to ensure the following condition holds:

2dγC2Enexp((2+t)2dδ8).\displaystyle\frac{2d}{\gamma}\leq\frac{C}{2E}n\exp(-\frac{(2+t)^{2}d\delta}{8}).

Taking logarithm on both sides, we get

logn\displaystyle\log n\geq (2+t)2dδ8+log4dlogγlogCE\displaystyle\frac{(2+t)^{2}d\delta}{8}+\log 4d-\log\gamma-\log\frac{C}{E}
=\displaystyle= (2+t)2dδ8+12log(32πd3δ(2+t)21δ).\displaystyle\frac{(2+t)^{2}d\delta}{8}+\frac{1}{2}\log\left(\frac{32\pi d^{3}\delta(2+t)^{2}}{1-\delta}\right).

When lognO(dδ)\log n\geq O(d\delta) holds, the probability Pr[]\text{Pr}[\mathcal{E}] approaches to 0. Then the communication cost of the random Gaussian codebook scheme is

b=logn=O(dδ).\displaystyle b=\log n=O(d\delta).

The scheme is a δ\delta-compressor with probability at least

1Pr[]=1exp[n22π(2+t)dδexp((2+t)2dδ8)+2d1δ]\displaystyle 1-\text{Pr}[\mathcal{E}]=1-\exp\left[-\frac{n}{2\sqrt{2\pi}(2+t)\sqrt{d\delta}}\exp(-\frac{(2+t)^{2}d\delta}{8})+\frac{2d}{\sqrt{1-\delta}}\right]

Appendix B Proof of Theorem 5

During the transmission of compressed vectors, we use the Elias coding to encode positive integers. Before proving Theorem 5, we need to introduce a lemma from [7] that demonstrates the number of bits needed to represent a vector after Elias coding.

Lemma 2 ([7], Appendix, Lemma A.3).

Let ydy\in\mathbb{N}^{d} be a vector of which each element yiy_{i} is a positive integer, and its p\ell_{p}-norm is yppρ\left\|y\right\|^{p}_{p}\leq\rho, then we have

i=1d|Elias(yi)|(1+o(1)plogρd+1)d\displaystyle\sum\limits_{i=1}^{d}|\text{Elias}(y_{i})|\leq\left(\frac{1+o(1)}{p}\log\frac{\rho}{d}+1\right)d (11)

where Elias()\text{Elias}(\cdot) is the Elias coding function applied to a positive integer.

Now we give the proof of Theorem 5.

Proof.

After rand-kk sparsification and Randomized Quantization, the compressed vector qq becomes very sparse. To transmit qq, we first need to send the locations of q0\|q\|_{0} non-zero elements. Let i1,i2,,iq0i_{1},i_{2},\dots,i_{\|q\|_{0}} represent the non-zero indices of qq. We use Elias coding to encode the integer vector [i1,i2i1,,iq0iq01][i_{1},i_{2}-i_{1},\dots,i_{\|q\|_{0}}-i_{\|q\|_{0}-1}]. This integer vector has a length of q0\|q\|_{0} and an 1\ell_{1}-norm at most dd. Thus from Lemma 2 the required number of bits after Elias coding is given by

b1=((1+o(1))logdq0+1)q0.\displaystyle b_{1}=\left((1+o(1))\log\frac{d}{\|q\|_{0}}+1\right)\|q\|_{0}.

Secondly, we need to transmit the sign vector ss and the integer vector zz. For sign vector ss with q0\|q\|_{0} elements, we need b2=q0b_{2}=\|q\|_{0} bits. For the integer vector zz, we also use Elias coding to encode it. The required number of bits after Elias coding is

b3=(1+o(1)2logz22q0+1)q0.\displaystyle b_{3}=\left(\frac{1+o(1)}{2}\log\frac{\left\|z\right\|_{2}^{2}}{\|q\|_{0}}+1\right)\|q\|_{0}.

From [7], we know for Randomized Quantization, the number of non-zero elements is

𝔼q0l2+vspa0\displaystyle{\mathbb{E}}\|q\|_{0}\leq l^{2}+\sqrt{\|v_{spa}\|_{0}}

and the squared norm of zz is

z222(l2+vspa0).\displaystyle\|z\|^{2}_{2}\leq 2(l^{2}+\|v_{spa}\|_{0}).

Summing up the three parts b1,b2,b3b_{1},b_{2},b_{3}, we can get the total number of bits

𝔼b=3𝔼q0+𝔼(1+o(1))q0(logdq0+12log2(l2+vspa0)q0)\displaystyle{\mathbb{E}}b=3{\mathbb{E}}\|q\|_{0}+{\mathbb{E}}(1+o(1))\|q\|_{0}\left(\log\frac{d}{\|q\|_{0}}+\frac{1}{2}\log\frac{2(l^{2}+\|v_{spa}\|_{0})}{\|q\|_{0}}\right)

Note that the function xlogCxx\log\frac{C}{x} increases until x=C2x=\frac{C}{2} and then decreases. Also function xlogCxx\log\frac{C}{x} is concave so that 𝔼[xlogCx]𝔼xlogC𝔼x{\mathbb{E}}\left[x\log\frac{C}{x}\right]\leq{\mathbb{E}}x\log\frac{C}{{\mathbb{E}}x}. Assuming l2+vspa0d2l^{2}+\|v_{spa}\|_{0}\leq\frac{d}{2}, it follows that l2+vspa0d2l^{2}+\sqrt{\|v_{spa}\|_{0}}\leq\frac{d}{2}. Applying x=l2+vspa0x=l^{2}+\sqrt{\|v_{spa}\|_{0}}, C=dC=d in the function, we can obtain

𝔼b\displaystyle{\mathbb{E}}b\leq 3𝔼q0+𝔼(1+o(1))q0(32logdq0)\displaystyle 3{\mathbb{E}}\|q\|_{0}+{\mathbb{E}}(1+o(1))\|q\|_{0}\left(\frac{3}{2}\log\frac{d}{\|q\|_{0}}\right)
\displaystyle\leq 3𝔼(l2+vspa0)+𝔼(1+o(1))(l2+vspa0)(32logdl2+vspa0)\displaystyle 3{\mathbb{E}}\left(l^{2}+\sqrt{\|v_{spa}\|_{0}}\right)+{\mathbb{E}}(1+o(1))\left(l^{2}+\sqrt{\|v_{spa}\|_{0}}\right)\left(\frac{3}{2}\log\frac{d}{l^{2}+\sqrt{\|v_{spa}\|_{0}}}\right)

where the first inequality is from l2+vspa0d2l^{2}+\|v_{spa}\|_{0}\leq\frac{d}{2}, the second inequality is from the property of function xlogCxx\log\frac{C}{x}.

From the rand-kk sparsification, we know that 𝔼(l2+vspa0)l2+k{\mathbb{E}}(l^{2}+\sqrt{\|v_{spa}\|_{0}})\leq l^{2}+\sqrt{k}. Assume l2+kd2l^{2}+k\leq\frac{d}{2}. Applying the property of function xlogCxx\log\frac{C}{x} again, and let x=l2+vspa0x=l^{2}+\sqrt{\|v_{spa}\|_{0}}, C=dC=d, we can have

𝔼b\displaystyle{\mathbb{E}}b\leq 3(l2+k)+32(1+o(1))(l2+k)logdl2+k\displaystyle 3(l^{2}+\sqrt{k})+\frac{3}{2}(1+o(1))(l^{2}+\sqrt{k})\log\frac{d}{l^{2}+\sqrt{k}}
=\displaystyle= (l2+k)[3+32(1+o(1))logdl2+k].\displaystyle(l^{2}+\sqrt{k})\left[3+\frac{3}{2}(1+o(1))\log\frac{d}{l^{2}+\sqrt{k}}\right].

From Lemma 1, our scheme has parameter ω=dk(1+kl2)\omega=\frac{d}{k}(1+\frac{k}{l^{2}}). Here We choose l=2kl=\sqrt{2k}, then we have ω=32dk\omega=\frac{3}{2}\frac{d}{k}.

Finally the communication cost of SQRS is

𝔼b(2k+k)[3+32(1+o(1))logd2k+k].\displaystyle{\mathbb{E}}b\leq(2k+\sqrt{k})\left[3+\frac{3}{2}(1+o(1))\log\frac{d}{2k+\sqrt{k}}\right].

The dominating term is O(klogdk)O(k\log\frac{d}{\sqrt{k}}), i.e., O(dωlogdω)O(\frac{d}{\omega}\log d\omega).

To satisfy the condition l2+vspa0l2+kd2l^{2}+\|v_{spa}\|_{0}\leq l^{2}+k\leq\frac{d}{2}, we require

l2+k=3kd2.\displaystyle l^{2}+k=3k\leq\frac{d}{2}.

Thus kd6k\leq\frac{d}{6}, and it follows that ω9\omega\geq 9.

When ω9\omega\geq 9, and we choose k=3d2kk=\frac{3d}{2k}, l=2kl=\sqrt{2k}, then we can achieve unbiased compressor with communication cost O(dωlogdω)O(\frac{d}{\omega}\log d\omega) bits. ∎