Optimal Compression of Unit Norm Vectors
in the High Distortion Regime
Abstract
Motivated by the need for communication-efficient distributed learning, we investigate the method for compressing a unit norm vector into the minimum number of bits, while still allowing for some acceptable level of distortion in recovery. This problem has been explored in the rate-distortion/covering code literature, but our focus is exclusively on the "high-distortion" regime. We approach this problem in a worst-case scenario, without any prior information on the vector, but allowing for the use of randomized compression maps. Our study considers both biased and unbiased compression methods and determines the optimal compression rates. It turns out that simple compression schemes are nearly optimal in this scenario. While the results are a mix of new and known, they are compiled in this paper for completeness.
I Introduction
Data compression is an active area of research in signal processing, specially in the context of image, audio and video processing [1, 2, 3]. Apart from the classical applications, in modern large scale learning systems, like federated learning [4, 5, 6], data compression plays a crucial role. Federated learning (FL) is a distributed learning paradigm, with one center and several (in millions or even more) client machines, and exchange of information between the server and the clients is crucial for learning models. In a generic federated learning framework, we have each of a large number of clients computing local stochastic gradients of a loss function and sending it to the server. The server averages the received stochastic gradients and updates the model parameters. In modern machine learning, one necessarily needs to deal with extreme high dimensional data ( is very large) and train a large neural network with billions (if not more) of parameters. Hence, one of the major challenges faced by FL is communication cost between the client machines and the server since this cost is associated with the internet bandwidth of the users which are often resource constraints [4].
A canonical way to reduce the communication cost in FL is to compress (sprasify or quantize) the data (gradients) before communicating, and this forms the basis of our study. Indeed, data compression is widely used in FL systems ([7, 8, 9, 10]) and several compression schemes, both deterministic as well as randomized, have been proposed in the last few years. Broadly, the compression schemes used in FL falls under two categories: (i) unbiased ([7, 8]) and (ii) biased ([9, 8, 10]), where for unbiased compressor, we require conditions on first and second moment (see Definition 1) and for the biased one, we just require a condition on second moment (see Definition 2). We now define them formally:
Definition 1 (Unbiased -compressor).
A randomized operator : is an unbiased -compressor if it satisfies
where is the (unbiased) compression parameter.
Typical unbiased compressors include:
A more general definition including biased compressor is given by the following:
Definition 2 (Biased -compressor).
A randomized operator : is a -compressor if it satisfies
where is the (biased) compression parameter. If the compressor is not random, we remove the expectation.
Typical biased compressors include:
-
•
-sign quantization [9]: For any , . Here is .
-
•
Top- sparsification [8]: For any , select elements with the largest absolute values to be remained, and let the other elements to be zero. Here is .
Ranges of and : The parameters and measure the amount of distortion of the compressors. We emphasize that for the biased compressor, the is in the range of , while in the unbiased case, is in the range of . Owing to unbiasness, the variance of the compressor often increases and we require to be large in such settings [7, 12]. Thus there is a fundamental difference between definition of biased -compressor and unbiased compressor. In this work we consider the high distortion regime, where is small and is large, again motivated by FL applications where extreme compression is desirable.
We aim to find the optimal communication cost of the above-mentioned widely used compressors. Since, in standard applications of FL, one can normalize the information vector and sends the norm separately, we can, without loss of generality, assume that the compressors are applied to the unit vector , i.e., holds in the Definition 1 and 2.
I-A Our contributions
Motivated by the compression needs in FL, in this paper, we study the basic problem of compressing a unit norm vector in high dimension. Though this problem has been extensively explored in the context of rate-distortion and covering code [13, 1, 14], we are interested in the high distortion regime, where the recovery error is high, and the compression rate . Moreover, we do not put any prior information on the vector to be compressed and hence our results hold for the worst-case setting.
We first obtain lower bounds on the number of bits required to transmit such unit norm vectors under biased as well as unbiased compression schemes. For the biased schemes, this follow directly via a sphere covering argument and for the unbiased case, we can leverage some results from [15, 16]. Moreover, we propose and analyze several efficient algorithms that matches this lower bound and hence optimal. It turns out that for unbiased -compressor the minimum number of bits required is , whereas for biased -compressor the least number of bits required is . For upper bounds, we provide the following different compressors:
An optimal but inefficient biased -compressor
In Section III-B, we propose and analyze an unbiased -compressor based on generating Random Gaussian codebook. In this scheme, out of the generated codebook, we choose the Gaussian vector closest (in norm) to the information vector as the quantized vector. This scheme only requires number of bits, which matches the lower bound. Although optimal, the encoding is an exhaustive search in exponentially large number of possible codewords, and hence this algorithm is not computationally efficient.
A near-optimal and efficient biased -compressor
In Section III-C, we propose an efficient biased compressor that is (near) optimal, namely Max Block Norm Quantization (MBNQ). In this algorithm, we first find the sub-block of length that has the largest norm, and then use scalar quantization (coordinate-wise) on the sub-vector. As shown in Theorem 2, MBNQ requires number of bits. Furthermore, MBNQ is computationally efficient as seen in Algorithm 2.
A near-optimal and efficient biased -compressor
In Section IV-B, we discuss the vector quantized SGD (VQSGD) algorithm of [12], which requires number of bits and hence optimal. However, similar to the Random Gaussian codebook algorithm, this is also inefficient, and the computation complexity scales exponentially with dimension . In Section IV-C, we propose an efficient algorithm, namely Sparse Randomized Quantization Scheme (SRQS), that first applies the rand- compressor (of [8]) and then uses QSGD (of [7]) on the sparse -length sub-vector. As shown in Theorem 5, this simple combination yields an efficient algorithm requiring bits , and hence SRQS is (near) optimal.
Our contribution is summarized in Table 1.
II Related Work
II-A -nets
Note that, our biased compressors are just -nets for the unit sphere (), and it is known that there exists an -net of size [17]. However, results on -nets such as this are tailored for very small; for very close to 1, they do not provide a fine-grained dependence on .
II-B Quantization
Possibly, a straightforward way to cut down communication is to use quantization or sparsification techniques. Since, typically the dimension of such data is huge, dimension reduction techniques often turn out to be useful. To achieve compression, one can quantize each coordinate of a transmitted vector into few bits [7, 18, 19], or obtain a sparser vector by letting some elements be zero [8, 11, 20]. These compressors are either biased and unbiased, influencing the convergence of learning algorithms. In [21, 22], a lower bound of compression budget is obtained to retain the convergence rate of stochastic optimization of the uncompressed setting. To achieve the lower bound, a compression scheme based on the random rotation and block quantization was proposed. In context of distributed mean estimation, a lower bound of estimation accuracy is derived given a compression budget [23], which is based on a distributed statistical estimation lower bound. Also a random rotation scheme was proposed to approach the lower bound.
II-C Federated Learning and Communication Cost
As mentioned in the introduction, FL algorithms employ several techniques to reduce the communication cost. One simple way is to use local iterations and communicate to the server sparsely [6, 24, 25]. Another way to reduce number of communication rounds is to use second order Newton based algorithm [26, 27, 28, 29], which exploits the compute power of the client machines and cut the communication cost.
III Biased -Compressors
We start with the biased -compressor. We first answer the question of minimum communication cost of -compressor by providing a lower bound, which is provided by the sphere covering.
III-A Lower Bound
Proposition 1.
Let be the compressor of satisfying Definition 2. Then the number of bits required to transmit satisfy
(1) |
Proof.
The proof employs a simple sphere covering of the unit sphere [14]. We know that resides within the ball . The cardinality of the set of balls covering the sphere is at least
We can send the indices of these balls covering the unit sphere. Thus the necessary number of bits to represent satisfies
where is the Gamma function, the last inequality is from the fact for . ∎
Considering that falls within the range of , the required number of bits is smaller than transmitting the uncompressed vector directly, which would require bits.
In the following we first present a random Gaussian codebook scheme that achieves the aforementioned lower bound. However, this scheme incurs significant computational and storage requirements, rendering it infeasible in practice. Consequently, we introduce an alternative practical sparse quantization scheme which is nearly optimal.
III-B Random Gaussian Codebook Scheme
We use a vector quantization method via random Gaussian construction: let a random Gaussian matrix be
(2) |
where each column is a Gaussian vector with each element being standard Gaussian variable . is the normalization factor, that will be chosen later. The columns of Gaussian matrix are regarded as a codebook for compression.
For a unit norm vector to be compressed, we find the nearest Gaussian vector
Then we map to the Gaussian vector
To transmit the compressed vector, we can only send the index of the nearest Gaussian vector . The process of random Gaussian scheme is described in Algorithm 1.
Input: Unit norm vector ; A matrix constructed as (2)
Encoding:
Output: Index of the Gaussian vector
For this scheme, we have the following guarantee:
Theorem 1.
A Gaussian codebook constructed as (2) with of size , is a -compressor with probability at least . The number of bits needed is
The proof is in the Appendix A.
Remark 1.
Since we use bits to represent the index of Gaussian vectors, the random Gaussian codebook scheme can achieve the lower bound of biased -compressor. However, it is impractical to store the large number of Gaussian vectors and perform a lot of distance computations at each iteration, where grows exponentially with model dimension .
III-C Max Block Norm Quantization (MBNQ)
To give a practical compression scheme, we show that a simple block quantization scheme can approach the lower bound, with an extra logarithmic factor. This scheme, Max Block Norm Quantization, uses standard scalar quantization on a sub-block with largest norm.
Our scheme is first to uniformly partition the vector into sub-blocks , . Then we pick up the sub-block with largest norm from the vectors .
We next quantize the with standard scalar quantization with quantization levels to get the quantized vector . The quantization function is if ; are the lower and upper bound of a quantization interval, can be any quantization value inside the quantization interval. That is, each element of is quantized to the closest quantization value. The scalar quantization function is applied element-wise. The final compressor is shown as . The detailed process of MBNQ is described in Algorithm 2.
Input: Unit norm vector
Initialization: Define a scalar quantization function with quantization levels, where if , are the lower and upper bound of a quantization interval, is the quantization value inside the quantization interval
Output: Compressed vector
For the MBNQ scheme, we have the following guarantee.
Theorem 2.
With number of sub-blocks and the number of quantization levels , the MBNQ is a -compressor with the number of bits
Proof.
The compression error of our proposed scheme can be expressed as follows:
(3) |
where represents a vector with elements equal to and the remaining elements set to zero. The last equality is from the fact that is the vector with non-zero elements, and is the vector with the remaining non-zero elements, so that would be zero.
The first term in (III-C) corresponds to the error resulting from the partitioning of the vector. Since is a sub-vector of , we have
where denotes the norm of . The second term in (III-C) represents the error resulting from scalar quantization with levels. From the quantization error in scalar quantization, we have
Here we choose , then we can obtain
Note the error decreases with a larger . Since we pick up the sub-vector with largest norm, we have . Therefore we can obtain
Hence, the MBNQ scheme achieves the -compressor with .
Regarding the communication cost of the MBNQ scheme, each worker needs to transmit the index of the sub-vector with the largest norm and quantized real numbers. Thus, the total number of required bits can be calculated as:
which represents the communication cost in Theorem 2. ∎
Remark 2.
The implementation of MBNQ is straightforward and easy in real systems. And the simple MBNQ scheme can approach the lower bound of biased -compressor, except for a logarithmic factor. If we directly apply scalar quantization on the whole vector , the communication cost would be . The improvement is from the quantization on a sub-vector. When the size of sub-vector is chosen properly, the quantization can get nearly optimal performance.
IV Unbiased -Compressors
In this section we provide results on the unbiased compressor of definition 1. We start with a lower bound and a random codebook scheme to achieve this lower bound, both following from existing results. Next we propose a practical and simple sparse quantization scheme to approach the lower bound, except for a logarithmic factor.
IV-A Lower Bound
For unbiased compressor, [15] already provided a lower bound under both communication and privacy constraints. The lower bound is proven by constructing a prior distribution on and analyzing the compression error. Here we directly give the lower bound in [15]. The lower bound also follows from [12].
Theorem 3 ([15], Appendix D.2).
Let be the compressor of satisfying definition 1. Then the number of bits required to describe satisfies
(4) |
Note that, is in the range of . Thus the lower bound of is smaller than directly transmitting the vector as real numbers.
IV-B Unbiased Random Gaussian Codebook Scheme
In [12], a randomized vector quantization framework is proposed for unbiased compression. Among the quantization methods introduced in [12], there is a random Gaussian codebook method to achieve the lower bound of unbiased compressor.
Similar to the Random Gaussian Codebook Scheme in the biased case, we first construct a random Gaussian matrix :
(5) |
where each column is a Gaussian vector with elements as standard Gaussians . is the normalization factor. The columns of Gaussian matrix are regarded as a codebook for compression, i.e., the points in are . Then we use the Gaussian vectors to perform the compression:
The detailed process of unbiased random Gaussian codebook scheme is described in Algorithm 3.
Input: Unit norm vector ; A matrix constructed as (5)
Encoding:
Output: Index of the chosen Gaussian vector
Theorem 4 ([12], Theorem 7).
Remark 3.
From the above theorem, the dominiating term is bits, matching the lower bound in Theorem 3. However, the unbiased Gaussian codebook scheme also has the problem of impractical computational and storage burden.
IV-C Sparse Randomized Quantization Scheme (SRQS)
To design a practical compression method to approach the lower bound, we propose a Sparse Randomized Quantization Scheme (SRQS) based on the well-known rand- sparsification and randomized quantization in QSGD [7]. It is easy to implement and we prove it can approach the lower bound except for a logarithmic factor.
Our SRQS method is as follows. The rand- compressor randomly chooses elements of the vector and let other elements be zero. Specifically, for each coordinate with value , with probability the value is set to be , and probability to be 0.
For the sparse vector obtained from , we perform an unbiased Randomized Quantization as in [7] to quantize the vector . This method, called QSGD, works as follows. The quantization function consists of quantization levels. For each element in the , the Randomized Quantization function is
where represents the sign of , is an independent random variable that determines which quantization level that is mapped to, as defined next. Let be an integer such that , i.e., is the quantization interval for . Then the random variable is
(8) |
Thus the Randomized Quantization is an unbiased method . As in QSGD, the final quantized vector is expressed by a tuple , where vector includes the signs of the non-zero elements, , and includes the quantization levels of non-zero elements, i.e., .
Overall, the the two-stage compression scheme sequentially applying rand- and Randomized Quantization, is unbiased. It can be seen as a more sparse version of QSGD. The detailed process of SRQS is described in Algorithm 4.
Input: Unit norm vector
Initialization: Design a Randomized Quantization function with quantization levels
Output: Compressed vector
Now we give a lemma on the accuracy of SRQS.
Lemma 1.
The SRQS achieves compression error
with rand- compressor and Randomized Quantization with quantization levels.
Proof.
Please note that our scheme incorporates two sources of randomness: the rand-k sparsification and the unbiased Randomized Quantization. Let denote the expectation over Randomized Quantization. The total compression error of the SRQS scheme can be expressed as follows:
(9) |
where the last equality arises from the unbiased property . The error in the second term of (IV-C) is from the Randomized Quantization. According to Lemma 3.1 in QSGD [7], we obtain the quantization error as:
Here we choose , which ensures . Consequently, we can deduce that .
Then, the total compression error can be expressed as follows:
where the last equality is from the unbiased property of rand- sparsification, .
Given that is obtained through the rand-k sparsification, we have:
Finally we can get the total compression error as:
∎
This lemma indicates that in our scheme, the compression error is given by .
To transmit the final compressed vector , we encode and transmit the tuple . In our scheme, we employ Elias coding, similar to QSGD, for this purpose. Elias coding is a method used to encode positive integers in the process, such as the locations of non-zero elements in and the integer vector .
Now let us present the formal guarantee of SRQS.
Theorem 5.
When , , quantization level , the SRQS is unbiased -compressor with the number of bits
The proof is in the Appendix B.
As can be seen, the SRQS approaches the lower bound of unbiased compressor, except for a logarithmic factor .
V Conclusion
In this paper we compile the number of necessary and sufficient bits required for widely used biased -compressor and unbiased -compressors. For biased -compressor, we propose a random Gaussian codebook scheme to achieve the lower bound, and a Max Block Norm Quantization scheme to approach the lower bound up to a logarithmic factor. For unbiased compressor, we also show an unbiased random Gaussian codebook scheme can achieve the lower bound. And we further propose a practical Sparse Randomized Quantization Scheme to approach the lower bound, up to a logarithmic factor.In short, an application of the simple combination of sparsification and quantization methods on distributed learning leads to near-optimal compression.
Acknowledgement: This work is supported in part by NSF awards 2133484, 2217058, and 2112665.
References
- [1] Allen Gersho and Robert M Gray, Vector quantization and signal compression, vol. 159, Springer Science & Business Media, 2012.
- [2] David Salomon, Giovanni Motta, and Giovanni Motta, Handbook of data compression, vol. 2, Springer, 2010.
- [3] Khalid Sayood, Introduction to data compression, Morgan Kaufmann, 2017.
- [4] Peter Kairouz and H Brendan McMahan, “Advances and open problems in federated learning,” Foundations and Trends® in Machine Learning, vol. 14, no. 1, pp. 1–210, 2021.
- [5] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon, “Federated learning: Strategies for improving communication efficiency,” arXiv preprint arXiv:1610.05492, 2016.
- [6] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics. PMLR, 2017, pp. 1273–1282.
- [7] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, 2017, pp. 1709–1720.
- [8] Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi, “Sparsified SGD with memory,” in Advances in Neural Information Processing Systems, 2018, pp. 4447–4458.
- [9] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U Stich, and Martin Jaggi, “Error feedback fixes SignSGD and other gradient compression schemes,” in International Conference on Machine Learning, 2019, pp. 3252–3261.
- [10] Avishek Ghosh, Raj Kumar Maity, Swanand Kadhe, Arya Mazumdar, and Kannan Ramchandran, “Communication-efficient and byzantine-robust distributed learning with error feedback,” IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 3, pp. 942–953, 2021.
- [11] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang, “Gradient sparsification for communication-efficient distributed optimization,” in Advances in Neural Information Processing Systems, 2018, pp. 1299–1309.
- [12] Venkata Gandikota, Daniel Kane, Raj Kumar Maity, and Arya Mazumdar, “vqsgd: Vector quantized stochastic gradient descent,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2021, pp. 2197–2205.
- [13] Toby Berger, “Rate-distortion theory,” Wiley Encyclopedia of Telecommunications, 2003.
- [14] Gérard Cohen, Iiro Honkala, Simon Litsyn, and Antoine Lobstein, Covering codes, Elsevier, 1997.
- [15] Wei-Ning Chen, Peter Kairouz, and Ayfer Ozgur, “Breaking the communication-privacy-accuracy trilemma,” Advances in Neural Information Processing Systems, vol. 33, pp. 3312–3324, 2020.
- [16] Wei-Ning Chen, Christopher A Choquette Choo, Peter Kairouz, and Ananda Theertha Suresh, “The fundamental price of secure aggregation in differentially private federated learning,” in International Conference on Machine Learning. PMLR, 2022, pp. 3056–3089.
- [17] Roman Vershynin, “Introduction to the non-asymptotic analysis of random matrices,” arXiv preprint arXiv:1011.3027, 2010.
- [18] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” in Advances in Neural Information Processing Systems, 2017, pp. 1509–1519.
- [19] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang, “Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning,” in International Conference on Machine Learning, 2017, pp. 4035–4043.
- [20] Jakub Konečnỳ and Peter Richtárik, “Randomized distributed mean estimation: Accuracy vs. communication,” Frontiers in Applied Mathematics and Statistics, vol. 4, no. 62, 2018.
- [21] Prathamesh Mayekar and Himanshu Tyagi, “Ratq: A universal fixed-length quantizer for stochastic optimization,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 1399–1409.
- [22] Prathamesh Mayekar and Himanshu Tyagi, “Limits on gradient compression for stochastic optimization,” in 2020 IEEE International Symposium on Information Theory (ISIT). IEEE, 2020, pp. 2658–2663.
- [23] Ananda Theertha Suresh, X Yu Felix, Sanjiv Kumar, and H Brendan McMahan, “Distributed mean estimation with limited communication,” in International conference on machine learning. PMLR, 2017, pp. 3329–3337.
- [24] Sebastian U Stich, “Local SGD converges fast and communicates little,” in International Conference on Learning Representations, 2019.
- [25] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh, “Scaffold: Stochastic controlled averaging for federated learning,” in International Conference on Machine Learning. PMLR, 2020, pp. 5132–5143.
- [26] Ohad Shamir, Nathan Srebro, and Tong Zhang, “Communication efficient distributed optimization using an approximate newton-type method,” CoRR, vol. abs/1312.7853, 2013.
- [27] Shusen Wang, Farbod Roosta-Khorasani, Peng Xu, and Michael W. Mahoney, “Giant: Globally improved approximate newton method for distributed optimization,” 2017.
- [28] Avishek Ghosh, Raj Kumar Maity, and Arya Mazumdar, “Distributed newton can communicate less and resist byzantine workers,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2020, NIPS’20, Curran Associates Inc.
- [29] Avishek Ghosh, Raj Kumar Maity, Arya Mazumdar, and Kannan Ramchandran, “Escaping saddle points in distributed newton’s method with communication efficiency and byzantine resilience,” CoRR, vol. abs/2103.09424, 2021.
Appendix A Proof of Theorem 1
Proof.
We begin by considering a specific fixed unit vector . Let . For the -th column in , the probability of the event that the distance between and exceeds can be expressed as follows:
Since , is the vector with each element being a standard Gaussian variable, the inner-product is a Gaussian variable. We denote it as and . Thus, we can write the probability as:
Since follows chi-squared distribution with degree , we can get the tail bound of as
We define the event as , event as , and . Then we have
From event , we can imply that . Thus we know that
Let . By choosing and recalling that , we obtain .
For the standard Gaussian variable , the tail bound is
where . We can see from the chosen N, thus we can have
Combining above results, we have
In our setting, is small and can approach 0, thus we can choose a constant to satisfy . We can further let
Taking logrithm on both sides, we can get
(10) |
Since the left-hand side of (10) is at order , and the right-hand side is at order . When we choose t to satisfy and is large, the condition (10) can easily hold.
Then we can obtain
Since there are i.i.d. random Gaussian vectors, we define the event that for all Gaussian vectors, there is no close vector to the particular :
The probability of event is
where the second inequality is from the fact that .
To encompass all possible on the unit sphere, we employ an -net to cover the unit sphere, where we set the error parameter equal to . Consequently, the size of the -net is given by [17]. Subsequently, we define the event that for all Gaussian vectors, there are no close vectors to any of the vectors in the -net:
By union bound, the possibility of event is
where the second inequality is from the fact that .
Here we expect that the exponential term is negative, so that approaches 0. Hence, we need to ensure the following condition holds:
Taking logarithm on both sides, we get
When holds, the probability approaches to 0. Then the communication cost of the random Gaussian codebook scheme is
The scheme is a -compressor with probability at least
∎
Appendix B Proof of Theorem 5
During the transmission of compressed vectors, we use the Elias coding to encode positive integers. Before proving Theorem 5, we need to introduce a lemma from [7] that demonstrates the number of bits needed to represent a vector after Elias coding.
Lemma 2 ([7], Appendix, Lemma A.3).
Let be a vector of which each element is a positive integer, and its -norm is , then we have
(11) |
where is the Elias coding function applied to a positive integer.
Now we give the proof of Theorem 5.
Proof.
After rand- sparsification and Randomized Quantization, the compressed vector becomes very sparse. To transmit , we first need to send the locations of non-zero elements. Let represent the non-zero indices of . We use Elias coding to encode the integer vector . This integer vector has a length of and an -norm at most . Thus from Lemma 2 the required number of bits after Elias coding is given by
Secondly, we need to transmit the sign vector and the integer vector . For sign vector with elements, we need bits. For the integer vector , we also use Elias coding to encode it. The required number of bits after Elias coding is
From [7], we know for Randomized Quantization, the number of non-zero elements is
and the squared norm of is
Summing up the three parts , we can get the total number of bits
Note that the function increases until and then decreases. Also function is concave so that . Assuming , it follows that . Applying , in the function, we can obtain
where the first inequality is from , the second inequality is from the property of function .
From the rand- sparsification, we know that . Assume . Applying the property of function again, and let , , we can have
From Lemma 1, our scheme has parameter . Here We choose , then we have .
Finally the communication cost of SQRS is
The dominating term is , i.e., .
To satisfy the condition , we require
Thus , and it follows that .
When , and we choose , , then we can achieve unbiased compressor with communication cost bits. ∎