Optimal Compression of Unit Norm Vectors
in the High Distortion Regime

Heng Zhu^∗, Avishek Ghosh^† and Arya Mazumdar^∗ ^∗University of California, San Diego
^†Indian Institute of Technology, Bombay
Emails: [email protected], avishek

\_

[email protected], [email protected]

Abstract

Motivated by the need for communication-efficient distributed learning, we investigate the method for compressing a unit norm vector into the minimum number of bits, while still allowing for some acceptable level of distortion in recovery. This problem has been explored in the rate-distortion/covering code literature, but our focus is exclusively on the "high-distortion" regime. We approach this problem in a worst-case scenario, without any prior information on the vector, but allowing for the use of randomized compression maps. Our study considers both biased and unbiased compression methods and determines the optimal compression rates. It turns out that simple compression schemes are nearly optimal in this scenario. While the results are a mix of new and known, they are compiled in this paper for completeness.

I Introduction

Data compression is an active area of research in signal processing, specially in the context of image, audio and video processing [1, 2, 3]. Apart from the classical applications, in modern large scale learning systems, like federated learning [4, 5, 6], data compression plays a crucial role. Federated learning (FL) is a distributed learning paradigm, with one center and several (in millions or even more) client machines, and exchange of information between the server and the clients is crucial for learning models. In a generic federated learning framework, we have each of a large number of clients computing local stochastic gradients of a loss function and sending it to the server. The server averages the received stochastic gradients and updates the model parameters. In modern machine learning, one necessarily needs to deal with extreme high dimensional data ( $d$ is very large) and train a large neural network with billions (if not more) of parameters. Hence, one of the major challenges faced by FL is communication cost between the client machines and the server since this cost is associated with the internet bandwidth of the users which are often resource constraints [4].

A canonical way to reduce the communication cost in FL is to compress (sprasify or quantize) the data (gradients) before communicating, and this forms the basis of our study. Indeed, data compression is widely used in FL systems ([7, 8, 9, 10]) and several compression schemes, both deterministic as well as randomized, have been proposed in the last few years. Broadly, the compression schemes used in FL falls under two categories: (i) unbiased ([7, 8]) and (ii) biased ([9, 8, 10]), where for unbiased compressor, we require conditions on first and second moment (see Definition 1) and for the biased one, we just require a condition on second moment (see Definition 2). We now define them formally:

Definition 1 (Unbiased $\omega$ -compressor).

A randomized operator $\mathcal{Q}$ : $\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ is an unbiased $\omega$ -compressor if it satisfies

	$\displaystyle{\mathbb{E}}_{\mathcal{Q}}[\mathcal{Q}(x)]=$	$\displaystyle x,$
	$\displaystyle{\mathbb{E}}_{\mathcal{Q}}\left\\|\mathcal{Q}(x)-x\right\\|^{2}\leq$	$\displaystyle(\omega-1)\left\\|x\right\\|^{2},\quad\forall x\in\mathbb{R}^{d},$

where $\omega\geq 1$ is the (unbiased) compression parameter.

Typical unbiased compressors include:

•

Randomized Quantization in QSGD [7]: For any real number $r\in[a,b]$ , with probability $\frac{b-r}{b-a}$ quantize $r$ into $a$ , and with probability $\frac{r-a}{b-a}$ quantize $r$ into $b$ .
•

Rand- $k$ sparsification [11]: For any $x\in\mathbb{R}^{d}$ , randomly select $k$ elements of $x$ to be scaled by $\frac{d}{k}$ , and let the other elements be zero.

A more general definition including biased compressor is given by the following:

Definition 2 (Biased $\delta$ -compressor).

A randomized operator $\mathcal{Q}$ : $\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ is a $\delta$ -compressor if it satisfies

\displaystyle{\mathbb{E}}_{\mathcal{Q}}\left\|\mathcal{Q}(x)-x\right\|^{2}\leq

\displaystyle(1-\delta)\left\|x\right\|^{2},\quad\forall x\in\mathbb{R}^{d},

where $\delta\in[0,1]$ is the (biased) compression parameter. If the compressor is not random, we remove the expectation.

Typical biased compressors include:

•

$\ell_{1}$ -sign quantization [9]: For any $x\in\mathbb{R}^{d}$ , $\mathcal{Q}(x)=\frac{\left\|x\right\|_{1}}{p}{\rm sign}(x)$ . Here $\delta$ is $\frac{\left\|x\right\|_{1}^{2}}{p\left\|x\right\|}$ .
•

Top- $k$ sparsification [8]: For any $x\in\mathbb{R}^{d}$ , select $k$ elements with the largest absolute values to be remained, and let the other elements to be zero. Here $\delta$ is $\frac{k}{d}$ .

Ranges of $\omega$ and $\delta$ : The parameters $\omega$ and $1-\delta$ measure the amount of distortion of the compressors. We emphasize that for the biased compressor, the $\delta$ is in the range of $[0,1]$ , while in the unbiased case, $\omega$ is in the range of $[1,+\infty)$ . Owing to unbiasness, the variance of the compressor often increases and we require $\omega$ to be large in such settings [7, 12]. Thus there is a fundamental difference between definition of biased $\delta$ -compressor and unbiased compressor. In this work we consider the high distortion regime, where $\delta$ is small and $\omega$ is large, again motivated by FL applications where extreme compression is desirable.

We aim to find the optimal communication cost of the above-mentioned widely used compressors. Since, in standard applications of FL, one can normalize the information vector and sends the norm separately, we can, without loss of generality, assume that the compressors are applied to the unit vector $x$ , i.e., $\|x\|^{2}=1$ holds in the Definition 1 and 2.

I-A Our contributions

Motivated by the compression needs in FL, in this paper, we study the basic problem of compressing a unit norm vector in high dimension. Though this problem has been extensively explored in the context of rate-distortion and covering code [13, 1, 14], we are interested in the high distortion regime, where the recovery error is high, and the compression rate $\to 0$ . Moreover, we do not put any prior information on the vector to be compressed and hence our results hold for the worst-case setting.

We first obtain lower bounds on the number of bits required to transmit such unit norm vectors under biased as well as unbiased compression schemes. For the biased schemes, this follow directly via a sphere covering argument and for the unbiased case, we can leverage some results from [15, 16]. Moreover, we propose and analyze several efficient algorithms that matches this lower bound and hence optimal. It turns out that for unbiased $\omega$ -compressor the minimum number of bits required is $\Omega(d/\omega)$ , whereas for biased $\delta$ -compressor the least number of bits required is $\Omega(d\delta)$ . For upper bounds, we provide the following different compressors:

An optimal but inefficient biased $\delta$ -compressor

In Section III-B, we propose and analyze an unbiased $\delta$ -compressor based on generating Random Gaussian codebook. In this scheme, out of the generated codebook, we choose the Gaussian vector closest (in $\ell_{2}$ norm) to the information vector as the quantized vector. This scheme only requires ${O}(d\delta)$ number of bits, which matches the lower bound. Although optimal, the encoding is an exhaustive search in exponentially large number of possible codewords, and hence this algorithm is not computationally efficient.

A near-optimal and efficient biased $\delta$ -compressor

In Section III-C, we propose an efficient biased compressor that is (near) optimal, namely Max Block Norm Quantization (MBNQ). In this algorithm, we first find the sub-block of length $k$ that has the largest norm, and then use scalar quantization (coordinate-wise) on the sub-vector. As shown in Theorem 2, MBNQ requires ${O}(d\delta\log(d\delta))$ number of bits. Furthermore, MBNQ is computationally efficient as seen in Algorithm 2.

A near-optimal and efficient biased $\omega$ -compressor

In Section IV-B, we discuss the vector quantized SGD (VQSGD) algorithm of [12], which requires $\mathcal{O}(d/\omega)$ number of bits and hence optimal. However, similar to the Random Gaussian codebook algorithm, this is also inefficient, and the computation complexity scales exponentially with dimension $d$ . In Section IV-C, we propose an efficient algorithm, namely Sparse Randomized Quantization Scheme (SRQS), that first applies the rand- $k$ compressor (of [8]) and then uses QSGD (of [7]) on the sparse $k$ -length sub-vector. As shown in Theorem 5, this simple combination yields an efficient algorithm requiring ${O}(d/\omega)\log(d\omega)$ bits , and hence SRQS is (near) optimal.

Our contribution is summarized in Table 1.

TABLE I: Summary of findings:
LB: Lower Bounds, Random: Random Gaussian Codebook Scheme and VQSGD [12], Sparse: Max Block Norm Quantization and Sparse Randomized Quantization Scheme

Compressor	LB	Random	Sparse
Biased	$\Omega(d\delta)$	$O(d\delta)$	$O(d\delta\log d\delta)$
Unbiased	$\Omega\left(\frac{d}{\omega}\right)$ [15]	$O\left(\frac{d}{\omega}\right)$ [12]	$O\left(\frac{d}{\omega}\log d\omega\right)$

II Related Work

II-A $\epsilon$ -nets

Note that, our biased compressors are just $\epsilon$ -nets for the unit sphere ( $\epsilon=1-\delta$ ), and it is known that there exists an $\epsilon$ -net of size $(1+2/\epsilon)^{d}$ [17]. However, results on $\epsilon$ -nets such as this are tailored for $\epsilon$ very small; for $\epsilon$ very close to 1, they do not provide a fine-grained dependence on $\delta$ .

II-B Quantization

Possibly, a straightforward way to cut down communication is to use quantization or sparsification techniques. Since, typically the dimension of such data is huge, dimension reduction techniques often turn out to be useful. To achieve compression, one can quantize each coordinate of a transmitted vector into few bits [7, 18, 19], or obtain a sparser vector by letting some elements be zero [8, 11, 20]. These compressors are either biased and unbiased, influencing the convergence of learning algorithms. In [21, 22], a lower bound of compression budget is obtained to retain the convergence rate of stochastic optimization of the uncompressed setting. To achieve the lower bound, a compression scheme based on the random rotation and block quantization was proposed. In context of distributed mean estimation, a lower bound of estimation accuracy is derived given a compression budget [23], which is based on a distributed statistical estimation lower bound. Also a random rotation scheme was proposed to approach the lower bound.

II-C Federated Learning and Communication Cost

As mentioned in the introduction, FL algorithms employ several techniques to reduce the communication cost. One simple way is to use local iterations and communicate to the server sparsely [6, 24, 25]. Another way to reduce number of communication rounds is to use second order Newton based algorithm [26, 27, 28, 29], which exploits the compute power of the client machines and cut the communication cost.

III Biased $\delta$ -Compressors

We start with the biased $\delta$ -compressor. We first answer the question of minimum communication cost of $\delta$ -compressor by providing a lower bound, which is provided by the sphere covering.

III-A Lower Bound

Proposition 1.

Let $Q(v)$ be the compressor of $v$ satisfying Definition 2. Then the number of bits $b$ required to transmit $Q(v)$ satisfy

\displaystyle b\geq d\delta+\log d.

(1)

Proof.

The proof employs a simple sphere covering of the unit sphere $S^{d-1}$ [14]. We know that $\mathcal{Q}{(v)}$ resides within the ball $B_{d}(v,1-\delta)$ . The cardinality of the set of balls covering the sphere $S^{d-1}$ is at least

\displaystyle C=\frac{\text{vol}(S^{d-1})}{\text{vol}(S^{d-1}\cap B_{d}(v,1-\delta))}\geq\frac{\text{vol}(S^{d-1})}{\text{vol}(B_{d}(v,1-\delta))}

We can send the indices of these balls covering the unit sphere. Thus the necessary number of bits $b$ to represent $\mathcal{Q}(v)$ satisfies

	$\displaystyle b=$	$\displaystyle\log C\geq\log\frac{\text{vol}(S^{d-1})}{\text{vol}(B_{d}(v,1-\delta))}=\log\frac{2\pi^{\frac{d}{2}}/\Gamma(\frac{d}{2})}{[\pi^{\frac{d}{2}}/\Gamma(\frac{d}{2}+1)](1-\delta)^{d}}$
	$\displaystyle=$	$\displaystyle\log\frac{d}{(1-\delta)^{d}}=\log d-d\log(1-\delta)\geq\log d+d\delta,$

where $\Gamma(\cdot)$ is the Gamma function, the last inequality is from the fact $\log(1+x)\leq x$ for $x>-1$ . ∎

Considering that $\delta$ falls within the range of $[0,1]$ , the required number of bits is smaller than transmitting the uncompressed vector directly, which would require $O(d)$ bits.

In the following we first present a random Gaussian codebook scheme that achieves the aforementioned lower bound. However, this scheme incurs significant computational and storage requirements, rendering it infeasible in practice. Consequently, we introduce an alternative practical sparse quantization scheme which is nearly optimal.

III-B Random Gaussian Codebook Scheme

We use a vector quantization method via random Gaussian construction: let a random Gaussian matrix $A\in\mathbb{R}^{d\times n}$ be

\displaystyle A=\frac{1}{\sqrt{N}}\left[a_{1},\dots,a_{n}\right]

(2)

where each column $a_{i}\in\mathbb{R}^{d}$ is a Gaussian vector with each element being standard Gaussian variable $\mathcal{N}(0,1)$ . $N$ is the normalization factor, that will be chosen later. The columns of Gaussian matrix $A$ are regarded as a codebook for compression.

For a unit norm vector $v$ to be compressed, we find the nearest Gaussian vector

\displaystyle a_{\text{min}}=\underset{\{a_{i}\}_{i=1}^{n}}{\arg\min}\left\|v-\frac{1}{\sqrt{N}}a_{i}\right\|^{2}.

Then we map $v$ to the Gaussian vector

\displaystyle\mathcal{Q}(v)=\frac{1}{\sqrt{N}}a_{\text{min}}.

To transmit the compressed vector, we can only send the index of the nearest Gaussian vector $i_{\text{min}}$ . The process of random Gaussian scheme is described in Algorithm 1.

Algorithm 1 Random Gaussian Codebook Scheme

Input: Unit norm vector $v$ ; A matrix $A\in\mathbb{R}^{d\times n}$ constructed as (2)
Encoding:

1: Calculate the distance between

v

and all the Gaussian vectors

\{a_{i}\}_{i=1}^{n}

\text{dist}_{i}=\left\|v-\frac{1}{\sqrt{N}}a_{i}\right\|^{2}

2: Find the smallest distance and corresponding index

i_{\text{min}}

Output: Index of the Gaussian vector $i_{\text{min}}$

For this scheme, we have the following guarantee:

Theorem 1.

A Gaussian codebook constructed as (2) with $N=\frac{d}{\delta}$ of size $n=\exp(O(d\delta))$ , is a $\delta$ -compressor with probability at least $P_{\delta}=1-\exp\left[-\frac{n}{2\sqrt{2\pi}(2+t)\sqrt{d\delta}}\exp(-\frac{(2+t)^{2}d\delta}{8})+\frac{2d}{\sqrt{1-\delta}}\right]=1-o(1)$ . The number of bits needed is $b=\log_{2}n=O(d\delta).$

The proof is in the Appendix A.

Remark 1.

Since we use $O(d\delta)$ bits to represent the index of $n$ Gaussian vectors, the random Gaussian codebook scheme can achieve the lower bound of biased $\delta$ -compressor. However, it is impractical to store the large number of Gaussian vectors and perform a lot of distance computations at each iteration, where $n$ grows exponentially with model dimension $d$ .

III-C Max Block Norm Quantization (MBNQ)

To give a practical compression scheme, we show that a simple block quantization scheme can approach the lower bound, with an extra logarithmic factor. This scheme, Max Block Norm Quantization, uses standard scalar quantization on a sub-block with largest norm.

Our scheme is first to uniformly partition the vector into sub-blocks $v_{j}\in\mathbb{R}^{k}$ , $j=1,2,\dots,d/k$ . Then we pick up the sub-block $v_{\max}$ with largest norm from the vectors $\{v_{j}\}_{j=1}^{d/k}$ .

We next quantize the $v_{\max}$ with standard scalar quantization $SQ(\cdot)$ with $l$ quantization levels to get the quantized vector $q=SQ(v_{\max})$ . The quantization function is $SQ(x)=r_{i}$ if $x\in[low_{i},up_{i}),i=1,\dots,l$ ; $low_{i},up_{i}$ are the lower and upper bound of a quantization interval, $r_{i}$ can be any quantization value inside the quantization interval. That is, each element of $v_{\max}$ is quantized to the closest quantization value. The scalar quantization function is applied element-wise. The final compressor is shown as $\mathcal{Q}(v)=q$ . The detailed process of MBNQ is described in Algorithm 2.

Algorithm 2 Max Block Norm Quantization (MBNQ)

Input: Unit norm vector $v$
Initialization: Define a scalar quantization function $SQ(\cdot)$ with $l$ quantization levels, where $SQ(x)=r_{i}$ if $x\in[low_{i},up_{i}),i=1,\dots,l$ , $low_{i},up_{i}$ are the lower and upper bound of a quantization interval, $r_{i}$ is the quantization value inside the quantization interval

1: Partition the vector

v

into

d/k

sub-vectors

v_{j}\in\mathbb{R}^{k}

j=1,2,\dots,d/k

2: Calculate the norms of sub-vectors

u_{j}=\|v_{j}\|^{2}

, then pick up the

v_{\text{max}}

with largest norm

u_{\text{max}}

3: Use scalar quantization function to quantize

v_{\text{max}}

q=SQ(v_{\text{max}})

, where the quantization function is applied element-wise

Output: Compressed vector $q$

For the MBNQ scheme, we have the following guarantee.

Theorem 2.

With number of sub-blocks $k=2d\delta$ and the number of quantization levels $l=\sqrt{4d\delta}$ , the MBNQ is a $\delta$ -compressor with the number of bits

\displaystyle b=d\delta\log(4d\delta)+\log\frac{1}{2\delta}.

Proof.

The compression error of our proposed scheme can be expressed as follows:

	$\displaystyle{\mathbb{E}}\left\\|v-Q(v)\right\\|^{2}\ =$	$\displaystyle{\mathbb{E}}\left\\|v-\bar{v}_{\max}+\bar{v}_{\max}-q\right\\|^{2}$
	$\displaystyle=$	$\displaystyle{\mathbb{E}}\left\\|v-\bar{v}_{\max}\right\\|^{2}+{\mathbb{E}}\left\\|\bar{v}_{\max}-q\right\\|^{2},$		(3)

where $\bar{v}_{\max}\in\mathbb{R}^{d}$ represents a vector with $k$ elements equal to $v_{\max}$ and the remaining elements set to zero. The last equality is from the fact that $\bar{v}_{\max}-q$ is the vector with $k$ non-zero elements, and $v-\bar{v}_{\max}$ is the vector with the remaining $d-k$ non-zero elements, so that $\langle v-\bar{v}_{\max},\bar{v}_{\max}-q\rangle$ would be zero.

The first term in (III-C) corresponds to the error resulting from the partitioning of the vector. Since $v_{\max}$ is a sub-vector of $v$ , we have

\displaystyle\left\|v-\bar{v}_{\max}\right\|^{2}=1-u_{\max}^{2}.

where $u_{\max}$ denotes the norm of $v_{\max}$ . The second term in (III-C) represents the error resulting from scalar quantization with $l$ levels. From the quantization error in scalar quantization, we have

\displaystyle{\mathbb{E}}\left\|\bar{v}_{\max}-q\right\|^{2}\leq\frac{k}{l^{2}}\left\|\bar{v}_{\max}\right\|^{2}=\frac{k}{l^{2}}u_{\max}^{2}.

Here we choose $k=\frac{l^{2}}{2}$ , then we can obtain

\displaystyle{\mathbb{E}}\left\|v-Q(v)\right\|^{2}\leq

\displaystyle 1-u_{\max}^{2}+\frac{k}{l^{2}}u_{\max}^{2}=1-\frac{u_{\max}^{2}}{2}.

Note the error decreases with a larger $u_{\max}$ . Since we pick up the sub-vector with largest norm, we have $u_{\max}^{2}\geq\frac{k}{d}$ . Therefore we can obtain

\displaystyle{\mathbb{E}}\left\|v-Q(v)\right\|^{2}\leq 1-\frac{k}{2d}.

Hence, the MBNQ scheme achieves the $\delta$ -compressor with $\delta=\frac{k}{2d}$ .

Regarding the communication cost of the MBNQ scheme, each worker needs to transmit the index of the sub-vector with the largest norm and $k$ quantized real numbers. Thus, the total number of required bits can be calculated as:

\displaystyle b=

\displaystyle\log\frac{d}{k}+k\log l=\log\frac{d}{k}+\frac{k}{2}\log(2k)=\log\frac{1}{2\delta}+d\delta\log(4d\delta),

which represents the communication cost in Theorem 2. ∎

Remark 2.

The implementation of MBNQ is straightforward and easy in real systems. And the simple MBNQ scheme can approach the lower bound of biased $\delta$ -compressor, except for a logarithmic factor. If we directly apply scalar quantization on the whole vector $v$ , the communication cost would be $O(d)$ . The improvement is from the quantization on a sub-vector. When the size of sub-vector $k$ is chosen properly, the quantization can get nearly optimal performance.

IV Unbiased $\omega$ -Compressors

In this section we provide results on the unbiased compressor of definition 1. We start with a lower bound and a random codebook scheme to achieve this lower bound, both following from existing results. Next we propose a practical and simple sparse quantization scheme to approach the lower bound, except for a logarithmic factor.

IV-A Lower Bound

For unbiased compressor, [15] already provided a lower bound under both communication and privacy constraints. The lower bound is proven by constructing a prior distribution on $v$ and analyzing the compression error. Here we directly give the lower bound in [15]. The lower bound also follows from [12].

Theorem 3 ([15], Appendix D.2).

Let $Q(v)$ be the compressor of $v$ satisfying definition 1. Then the number of bits $b$ required to describe $Q(v)$ satisfies

\displaystyle b=\Omega\left(\frac{d}{\omega}\right).

(4)

Note that, $\omega$ is in the range of $[1,+\infty)$ . Thus the lower bound of $\Omega(\frac{d}{\omega})$ is smaller than directly transmitting the vector as $\Omega(d)$ real numbers.

IV-B Unbiased Random Gaussian Codebook Scheme

In [12], a randomized vector quantization framework is proposed for unbiased compression. Among the quantization methods introduced in [12], there is a random Gaussian codebook method to achieve the lower bound of unbiased compressor.

Similar to the Random Gaussian Codebook Scheme in the biased case, we first construct a random Gaussian matrix $A\in\mathbb{R}^{d\times n}$ :

\displaystyle A=\frac{1}{\sqrt{N}}\left[a_{1},\dots,a_{n}\right]

(5)

where each column $a_{i}\in\mathbb{R}^{d\times 1}$ is a Gaussian vector with elements as standard Gaussians $\mathcal{N}(0,1)$ . $N$ is the normalization factor. The columns of Gaussian matrix $A$ are regarded as a codebook for compression, i.e., the points in $C$ are $c_{i}=\frac{1}{\sqrt{N}}a_{i},\quad i=1,\dots,n$ . Then we use the Gaussian vectors to perform the compression:

\displaystyle Q(v)=\frac{1}{\sqrt{N}}a_{i},\text{with probability}\quad p_{i}.

The detailed process of unbiased random Gaussian codebook scheme is described in Algorithm 3.

Algorithm 3 Unbiased Random Gaussian Codebook Scheme

Input: Unit norm vector $v$ ; A matrix $A\in\mathbb{R}^{d\times n}$ constructed as (5)
Encoding:

1: Construct a linear convex combination of Gaussian vectors

\{\frac{1}{\sqrt{N}}a_{i}\}_{i=1}^{n}

to get

v

v=\sum\limits_{i=1}^{n}\frac{1}{\sqrt{N}}p_{i}a_{i}

2: Randomly choose from

\{a_{i}\}_{i=1}^{n}

with probability distribution

\{p_{i}\}_{i=1}^{n}

Q(v)=\frac{1}{\sqrt{N}}a_{i},\text{with probability}~{}p_{i}

3: Get the index of the chosen Gaussian vector

i

Output: Index of the chosen Gaussian vector $i$

Theorem 4 ([12], Theorem 7).

A Gaussian codebook of size $n=\exp(O(\frac{d}{\omega}+\log d))$ , constructed as in (5), with $N=\frac{9d}{\omega}$ , $\omega\in[25,36d]$ , is an unbiased compressor (definition 1), with probability $1-o(1)$ . The number of compressed bits is

\displaystyle b=\log n=O\left(\frac{d}{\omega}+\log d\right).

Remark 3.

From the above theorem, the dominiating term is $O(\frac{d}{\omega})$ bits, matching the lower bound in Theorem 3. However, the unbiased Gaussian codebook scheme also has the problem of impractical computational and storage burden.

IV-C Sparse Randomized Quantization Scheme (SRQS)

To design a practical compression method to approach the lower bound, we propose a Sparse Randomized Quantization Scheme (SRQS) based on the well-known rand- $k$ sparsification and randomized quantization in QSGD [7]. It is easy to implement and we prove it can approach the lower bound except for a logarithmic factor.

Our SRQS method is as follows. The rand- $k$ compressor randomly chooses $k$ elements of the vector $v$ and let other elements be zero. Specifically, for each coordinate $j$ with value $v_{j}$ , with probability $\frac{k}{d}$ the value is set to be $\frac{d}{k}v_{j}$ , and probability $1-\frac{k}{d}$ to be 0.

For the sparse vector $v_{\text{spa}}$ obtained from $v$ , we perform an unbiased Randomized Quantization as in [7] to quantize the vector $q=RQ(v_{\text{spa}})$ . This method, called QSGD, works as follows. The quantization function consists of $l$ quantization levels. For each element $v_{j}$ in the $v_{\text{spa}}$ , the Randomized Quantization function is

\displaystyle RQ(v_{j})=\|v_{\text{spa}}\|_{2}\cdot\text{sign}(v_{j})\cdot\xi_{j}(v_{\text{spa}},l)

where $\text{sign}(x)\in\{+1,-1\}$ represents the sign of $x$ , $\xi_{j}$ is an independent random variable that determines which quantization level that $v_{j}$ is mapped to, as defined next. Let $0\leq t<l$ be an integer such that $\frac{|v_{j}|}{\|v_{\text{spa}}\|_{2}}\in[\frac{t}{l},\frac{t+1}{l}]$ , i.e., $[\frac{t}{l},\frac{t+1}{l}]$ is the quantization interval for $v_{j}$ . Then the random variable $\xi_{j}$ is

\displaystyle\xi_{j}(v_{\text{spa}},l)=\left\{\begin{array}[]{cc}(t+1)/l,&\text{with probability}\quad\frac{|v_{j}|}{\|v_{\text{spa}}\|}\cdot l-t\\ t/l,&\text{otherwise},\end{array}\right.

(8)

Thus the Randomized Quantization is an unbiased method ${\mathbb{E}}[RQ(v_{\text{spa}})]=v_{\text{spa}}$ . As in QSGD, the final quantized vector $q$ is expressed by a tuple $(\left\|v_{\text{spa}}\right\|,s,z)$ , where vector $s$ includes the signs of the non-zero elements, $s_{i}\in\{+1,-1\}$ , and $z$ includes the quantization levels of non-zero elements, i.e., $z_{j}=\xi_{j}\cdot l\in\{0,1,\dots,l\}$ .

Overall, the the two-stage compression scheme sequentially applying rand- $k$ and Randomized Quantization, is unbiased. It can be seen as a more sparse version of QSGD. The detailed process of SRQS is described in Algorithm 4.

Algorithm 4 Sparse Randomized Quantization Scheme

Input: Unit norm vector $v$
Initialization: Design a Randomized Quantization function $RQ(\cdot)$ with $l$ quantization levels

1: Apply rand-

k

compressor on

v

to output

v_{\text{spa}}

: For each coordinate

v_{j}

, there is a probability

\frac{k}{d}

to be

\frac{d}{k}v_{j}

, and probability

1-\frac{k}{d}

to be 0

2: Use Randomized Quantization function on

v_{\text{spa}}

to obtain the compressed vector

q=RQ(v_{\text{spa}})

, the

q

is expressed a tuple

(\left\|v_{\text{spa}}\right\|,s,z)

Output: Compressed vector $q$

Now we give a lemma on the accuracy of SRQS.

Lemma 1.

The SRQS achieves compression error

\displaystyle{\mathbb{E}}\left\|v-\mathcal{Q}(v)\right\|^{2}\leq\left[\frac{d}{k}(1+\frac{k}{l^{2}})-1\right]\left\|v\right\|^{2},

with rand- $k$ compressor and Randomized Quantization with $l$ quantization levels.

Proof.

Please note that our scheme incorporates two sources of randomness: the rand-k sparsification and the unbiased Randomized Quantization. Let ${\mathbb{E}}_{\mathcal{Q}}$ denote the expectation over Randomized Quantization. The total compression error of the SRQS scheme can be expressed as follows:

	$\displaystyle{\mathbb{E}}\left\\|v-\mathcal{Q}(v)\right\\|^{2}={\mathbb{E}}\left\\|v-v_{spa}+v_{spa}-q\right\\|^{2}$
$\displaystyle=$	$\displaystyle{\mathbb{E}}(\left\\|v-v_{spa}\right\\|^{2}+2\left<v-v_{spa},v_{spa}-q\right>+\left\\|v_{spa}-q\right\\|^{2})$
$\displaystyle=$	$\displaystyle{\mathbb{E}}(\left\\|v-v_{spa}\right\\|^{2}+2{\mathbb{E}}_{Q}\left<v-v_{spa},v_{spa}-q\right>+{\mathbb{E}}_{Q}\left\\|v_{spa}-q\right\\|^{2})$
$\displaystyle=$	$\displaystyle{\mathbb{E}}\left(\left\\|v-v_{spa}\right\\|^{2}+{\mathbb{E}}_{Q}\left\\|v_{spa}-q\right\\|^{2}\right)$	(9)

where the last equality arises from the unbiased property ${\mathbb{E}}_{Q}(v_{spa}-q)=0$ . The error in the second term of (IV-C) is from the Randomized Quantization. According to Lemma 3.1 in QSGD [7], we obtain the quantization error as:

\displaystyle{\mathbb{E}}_{Q}\left\|v_{spa}-q\right\|^{2}\leq\min\{\frac{k}{l^{2}},\frac{\sqrt{k}}{l}\}\left\|v_{spa}\right\|^{2}

Here we choose $l=\sqrt{2k}$ , which ensures $\frac{k}{l^{2}}\leq 1$ . Consequently, we can deduce that ${\mathbb{E}}_{Q}\left\|v_{spa}-q\right\|^{2}\leq\frac{k}{l^{2}}\left\|v_{spa}\right\|^{2}$ .

Then, the total compression error can be expressed as follows:

		$\displaystyle{\mathbb{E}}\left\\|v-\mathcal{Q}(v)\right\\|^{2}\leq{\mathbb{E}}(\left\\|v-v_{spa}\right\\|^{2}+\frac{k}{l^{2}}\left\\|v_{spa}\right\\|^{2})$
	$\displaystyle=$	$\displaystyle{\mathbb{E}}\left(\left\\|v\right\\|^{2}-2\left<v,v_{spa}\right>+(1+\frac{k}{l^{2}})\left\\|v_{spa}\right\\|^{2}\right)$
	$\displaystyle=$	$\displaystyle(1+\frac{k}{l^{2}}){\mathbb{E}}\left\\|v_{spa}\right\\|^{2}-\left\\|v\right\\|^{2}$

where the last equality is from the unbiased property of rand- $k$ sparsification, ${\mathbb{E}}v_{spa}=v$ .

Given that $v_{spa}$ is obtained through the rand-k sparsification, we have:

\displaystyle{\mathbb{E}}\left\|v_{spa}\right\|^{2}=\sum\limits_{j=1}^{d}\frac{k}{d}(\frac{d}{k}v_{j})^{2}=\sum\limits_{j=1}^{d}\frac{d}{k}v_{j}^{2}=\frac{d}{k}\left\|v\right\|^{2}.

Finally we can get the total compression error as:

\displaystyle{\mathbb{E}}\left\|v-\mathcal{Q}(v)\right\|^{2}\leq\left[\frac{d}{k}(1+\frac{k}{l^{2}})-1\right]\left\|v\right\|^{2}.

∎

This lemma indicates that in our scheme, the compression error is given by $\omega=\frac{d}{k}(1+\frac{k}{l^{2}})$ .

To transmit the final compressed vector $q$ , we encode and transmit the tuple $(\left\|v_{\text{spa}}\right\|,s,z)$ . In our scheme, we employ Elias coding, similar to QSGD, for this purpose. Elias coding is a method used to encode positive integers in the process, such as the locations of non-zero elements in $q$ and the integer vector $z$ .

Now let us present the formal guarantee of SRQS.

Theorem 5.

When $\omega\geq 9$ , $k=\frac{3d}{2\omega}$ , quantization level $l=\sqrt{2k}$ , the SRQS is unbiased $\omega$ -compressor with the number of bits $b=O(\frac{d}{\omega}\log d\omega).$

The proof is in the Appendix B.

As can be seen, the SRQS approaches the lower bound of unbiased compressor, except for a logarithmic factor $\log d\omega$ .

V Conclusion

In this paper we compile the number of necessary and sufficient bits required for widely used biased $\delta$ -compressor and unbiased $\omega$ -compressors. For biased $\delta$ -compressor, we propose a random Gaussian codebook scheme to achieve the lower bound, and a Max Block Norm Quantization scheme to approach the lower bound up to a logarithmic factor. For unbiased compressor, we also show an unbiased random Gaussian codebook scheme can achieve the lower bound. And we further propose a practical Sparse Randomized Quantization Scheme to approach the lower bound, up to a logarithmic factor.In short, an application of the simple combination of sparsification and quantization methods on distributed learning leads to near-optimal compression.

Acknowledgement: This work is supported in part by NSF awards 2133484, 2217058, and 2112665.

References

[1] Allen Gersho and Robert M Gray, Vector quantization and signal compression, vol. 159, Springer Science & Business Media, 2012.
[2] David Salomon, Giovanni Motta, and Giovanni Motta, Handbook of data compression, vol. 2, Springer, 2010.
[3] Khalid Sayood, Introduction to data compression, Morgan Kaufmann, 2017.
[4] Peter Kairouz and H Brendan McMahan, “Advances and open problems in federated learning,” Foundations and Trends® in Machine Learning, vol. 14, no. 1, pp. 1–210, 2021.
[5] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon, “Federated learning: Strategies for improving communication efficiency,” arXiv preprint arXiv:1610.05492, 2016.
[6] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics. PMLR, 2017, pp. 1273–1282.
[7] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, 2017, pp. 1709–1720.
[8] Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi, “Sparsified SGD with memory,” in Advances in Neural Information Processing Systems, 2018, pp. 4447–4458.
[9] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U Stich, and Martin Jaggi, “Error feedback fixes SignSGD and other gradient compression schemes,” in International Conference on Machine Learning, 2019, pp. 3252–3261.
[10] Avishek Ghosh, Raj Kumar Maity, Swanand Kadhe, Arya Mazumdar, and Kannan Ramchandran, “Communication-efficient and byzantine-robust distributed learning with error feedback,” IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 3, pp. 942–953, 2021.
[11] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang, “Gradient sparsification for communication-efficient distributed optimization,” in Advances in Neural Information Processing Systems, 2018, pp. 1299–1309.
[12] Venkata Gandikota, Daniel Kane, Raj Kumar Maity, and Arya Mazumdar, “vqsgd: Vector quantized stochastic gradient descent,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2021, pp. 2197–2205.
[13] Toby Berger, “Rate-distortion theory,” Wiley Encyclopedia of Telecommunications, 2003.
[14] Gérard Cohen, Iiro Honkala, Simon Litsyn, and Antoine Lobstein, Covering codes, Elsevier, 1997.
[15] Wei-Ning Chen, Peter Kairouz, and Ayfer Ozgur, “Breaking the communication-privacy-accuracy trilemma,” Advances in Neural Information Processing Systems, vol. 33, pp. 3312–3324, 2020.
[16] Wei-Ning Chen, Christopher A Choquette Choo, Peter Kairouz, and Ananda Theertha Suresh, “The fundamental price of secure aggregation in differentially private federated learning,” in International Conference on Machine Learning. PMLR, 2022, pp. 3056–3089.
[17] Roman Vershynin, “Introduction to the non-asymptotic analysis of random matrices,” arXiv preprint arXiv:1011.3027, 2010.
[18] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” in Advances in Neural Information Processing Systems, 2017, pp. 1509–1519.
[19] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang, “Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning,” in International Conference on Machine Learning, 2017, pp. 4035–4043.
[20] Jakub Konečnỳ and Peter Richtárik, “Randomized distributed mean estimation: Accuracy vs. communication,” Frontiers in Applied Mathematics and Statistics, vol. 4, no. 62, 2018.
[21] Prathamesh Mayekar and Himanshu Tyagi, “Ratq: A universal fixed-length quantizer for stochastic optimization,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 1399–1409.
[22] Prathamesh Mayekar and Himanshu Tyagi, “Limits on gradient compression for stochastic optimization,” in 2020 IEEE International Symposium on Information Theory (ISIT). IEEE, 2020, pp. 2658–2663.
[23] Ananda Theertha Suresh, X Yu Felix, Sanjiv Kumar, and H Brendan McMahan, “Distributed mean estimation with limited communication,” in International conference on machine learning. PMLR, 2017, pp. 3329–3337.
[24] Sebastian U Stich, “Local SGD converges fast and communicates little,” in International Conference on Learning Representations, 2019.
[25] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh, “Scaffold: Stochastic controlled averaging for federated learning,” in International Conference on Machine Learning. PMLR, 2020, pp. 5132–5143.
[26] Ohad Shamir, Nathan Srebro, and Tong Zhang, “Communication efficient distributed optimization using an approximate newton-type method,” CoRR, vol. abs/1312.7853, 2013.
[27] Shusen Wang, Farbod Roosta-Khorasani, Peng Xu, and Michael W. Mahoney, “Giant: Globally improved approximate newton method for distributed optimization,” 2017.
[28] Avishek Ghosh, Raj Kumar Maity, and Arya Mazumdar, “Distributed newton can communicate less and resist byzantine workers,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2020, NIPS’20, Curran Associates Inc.
[29] Avishek Ghosh, Raj Kumar Maity, Arya Mazumdar, and Kannan Ramchandran, “Escaping saddle points in distributed newton’s method with communication efficiency and byzantine resilience,” CoRR, vol. abs/2103.09424, 2021.

Appendix A Proof of Theorem 1

Proof.

We begin by considering a specific fixed unit vector $v$ . Let $\gamma^{2}=1-\delta$ . For the $i$ -th column $a_{i}$ in $A$ , the probability of the event that the distance between $v$ and $a_{i}$ exceeds $\gamma$ can be expressed as follows:

\displaystyle\text{Pr}\left(\left\|v-\frac{1}{\sqrt{N}}a_{i}\right\|^{2}>\gamma^{2}\right)=\text{Pr}\left(1+\frac{1}{N}\|a_{i}\|^{2}-\frac{2}{\sqrt{N}}\left<v,a_{i}\right>>\gamma^{2}\right).

Since $\left\|v\right\|^{2}=1$ , $a_{i}$ is the vector with each element being a standard Gaussian variable, the inner-product $\left<v,a_{i}\right>$ is a Gaussian variable. We denote it as $z=\left<v,a_{i}\right>$ and $z\sim\mathcal{N}(0,1)$ . Thus, we can write the probability as:

\displaystyle\text{Pr}\left(1+\frac{1}{N}\|a_{i}\|^{2}-\frac{2}{\sqrt{N}}z>\gamma^{2}\right)=\text{Pr}\left(z<\frac{\sqrt{N}}{2}\left(1+\frac{1}{N}\|a_{i}\|^{2}-\gamma^{2}\right)\right)

Since $\|a_{i}\|^{2}$ follows chi-squared distribution with degree $d$ , we can get the tail bound of $\|a_{i}\|^{2}$ as

\displaystyle\text{Pr}\left(\left|\frac{1}{d}\|a_{i}\|^{2}-1\right|\geq t\right)\leq 2\exp(-\frac{dt^{2}}{8})

We define the event $A$ as $A=\left\{z<\frac{\sqrt{N}}{2}\left(1+\frac{1}{N}\|a_{i}\|^{2}-\gamma^{2}\right)\right\}$ , event $B_{1}$ as $B_{1}=\left\{\left|\frac{1}{d}\|a_{i}\|^{2}-1\right|\geq t\right\}$ , and $B_{2}=\left\{\left|\frac{1}{d}\|a_{i}\|^{2}-1\right|<t\right\}$ . Then we have

\displaystyle\text{Pr}(A)=\text{Pr}(A,B_{1})+\text{Pr}(A,B_{2})\leq\text{Pr}(B_{1})+\text{Pr}(A,B_{2})

From event $\{A,B_{2}\}$ , we can imply that $z<\frac{\sqrt{N}}{2}\left(1+\frac{d}{N}(1+t)-\gamma^{2}\right)$ . Thus we know that

	$\displaystyle\text{Pr}(B_{1})\leq 2\exp(-\frac{dt^{2}}{8})$
	$\displaystyle\text{Pr}(A,B_{2})\leq\text{Pr}\left(z<\frac{\sqrt{N}}{2}\left(1+\frac{d}{N}(1+t)-\gamma^{2}\right)\right)=1-\text{Pr}\left(z\geq\frac{\sqrt{N}}{2}\left(1+\frac{d}{N}(1+t)-\gamma^{2}\right)\right).$

Let $E=\frac{\sqrt{N}}{2}(1+\frac{d}{N}(1+t)-\gamma^{2})$ . By choosing $N=\frac{d}{\delta}$ and recalling that $\gamma^{2}=1-\delta$ , we obtain $E=(1+\frac{t}{2})\sqrt{d\delta}$ .

For the standard Gaussian variable $z$ , the tail bound is

\displaystyle\text{Pr}(z\geq x)\geq\frac{C}{x}e^{-x^{2}/2},\quad\text{when}\quad x\geq 1.

where $C=\frac{1}{2\sqrt{2\pi}}$ . We can see $E>1$ from the chosen N, thus we can have

\displaystyle\text{Pr}\left(z\geq E\right)\geq\frac{C}{E}\exp(-\frac{E^{2}}{2}).

Combining above results, we have

\displaystyle\text{Pr}\left(\left\|v-\frac{1}{\sqrt{N}}a_{i}\right\|^{2}>\gamma^{2}\right)\leq 1-\frac{C}{E}\exp(-\frac{(2+t)^{2}d\delta}{8})+2\exp(-\frac{dt^{2}}{8}).

In our setting, $\delta$ is small and can approach 0, thus we can choose a constant $t$ to satisfy $(2+t)^{2}\delta<t^{2}$ . We can further let

\displaystyle 2\exp(-\frac{dt^{2}}{8})\leq\frac{C}{2E}\exp(-\frac{(2+t)^{2}d\delta}{8})

Taking logrithm on both sides, we can get

\displaystyle 4\log d\delta+8\log\frac{2+t}{2}+4\log 128\pi\leq d[t^{2}-(2+t)^{2}\delta]

(10)

Since the left-hand side of (10) is at order $O(\log d\delta)$ , and the right-hand side is at order $d$ . When we choose t to satisfy $(2+t)^{2}\delta<t^{2}$ and $d$ is large, the condition (10) can easily hold.

Then we can obtain

\displaystyle\text{Pr}\left(\left\|v-\frac{1}{\sqrt{N}}a_{i}\right\|^{2}>\gamma^{2}\right)\leq 1-\frac{C}{2E}\exp(-\frac{(2+t)^{2}d\delta}{8}).

Since there are $n$ i.i.d. random Gaussian vectors, we define the event that for all Gaussian vectors, there is no close vector to the particular $v$ :

\displaystyle\mathcal{E}_{v}=\left\{\forall i\in[n],\quad\left\|v-\frac{1}{\sqrt{N}}a_{i}\right\|^{2}>\gamma^{2}\right\}.

The probability of event $\mathcal{E}_{v}$ is

	$\displaystyle\text{Pr}[\mathcal{E}_{v}]$	$\displaystyle\leq\left[1-\frac{C}{2E}\exp(-\frac{(2+t)^{2}d\delta}{8})\right]^{n}$
		$\displaystyle\leq\exp\left[-\frac{C}{2E}n\exp(-\frac{(2+t)^{2}d\delta}{8})\right],$

where the second inequality is from the fact that $1-x\leq e^{-x}$ .

To encompass all possible $v$ on the unit sphere, we employ an $\epsilon$ -net to cover the unit sphere, where we set the error parameter $\epsilon$ equal to $\gamma$ . Consequently, the size of the $\epsilon$ -net is given by $\left(1+\frac{2}{\gamma}\right)^{d}$ [17]. Subsequently, we define the event that for all Gaussian vectors, there are no close vectors to any of the vectors in the $\epsilon$ -net:

\displaystyle\mathcal{E}=\left\{\forall i\in[n],\forall v\in\epsilon~{}\text{net},\quad\left\|v-\frac{1}{\sqrt{N}}a_{i}\right\|^{2}>\gamma^{2}\right\}.

By union bound, the possibility of event $\mathcal{E}$ is

	$\displaystyle\text{Pr}[\mathcal{E}]\leq$	$\displaystyle\left(1+\frac{2}{\gamma}\right)^{d}\exp\left[-\frac{C}{2E}n\exp(-\frac{(2+t)^{2}d\delta}{8})\right]$
	$\displaystyle\leq$	$\displaystyle\exp(\frac{2d}{\gamma})\exp\left[-\frac{C}{2E}n\exp(-\frac{(2+t)^{2}d\delta}{8})\right]$
	$\displaystyle=$	$\displaystyle\exp\left[\frac{2d}{\gamma}-\frac{C}{2E}n\exp(-\frac{(2+t)^{2}d\delta}{8})\right],$

where the second inequality is from the fact that $1+x\leq e^{x}$ .

Here we expect that the exponential term is negative, so that $\text{Pr}[\mathcal{E}]$ approaches 0. Hence, we need to ensure the following condition holds:

\displaystyle\frac{2d}{\gamma}\leq\frac{C}{2E}n\exp(-\frac{(2+t)^{2}d\delta}{8}).

Taking logarithm on both sides, we get

	$\displaystyle\log n\geq$	$\displaystyle\frac{(2+t)^{2}d\delta}{8}+\log 4d-\log\gamma-\log\frac{C}{E}$
	$\displaystyle=$	$\displaystyle\frac{(2+t)^{2}d\delta}{8}+\frac{1}{2}\log\left(\frac{32\pi d^{3}\delta(2+t)^{2}}{1-\delta}\right).$

When $\log n\geq O(d\delta)$ holds, the probability $\text{Pr}[\mathcal{E}]$ approaches to 0. Then the communication cost of the random Gaussian codebook scheme is

\displaystyle b=\log n=O(d\delta).

The scheme is a $\delta$ -compressor with probability at least

\displaystyle 1-\text{Pr}[\mathcal{E}]=1-\exp\left[-\frac{n}{2\sqrt{2\pi}(2+t)\sqrt{d\delta}}\exp(-\frac{(2+t)^{2}d\delta}{8})+\frac{2d}{\sqrt{1-\delta}}\right]

∎

Appendix B Proof of Theorem 5

During the transmission of compressed vectors, we use the Elias coding to encode positive integers. Before proving Theorem 5, we need to introduce a lemma from [7] that demonstrates the number of bits needed to represent a vector after Elias coding.

Lemma 2 ([7], Appendix, Lemma A.3).

Let $y\in\mathbb{N}^{d}$ be a vector of which each element $y_{i}$ is a positive integer, and its $\ell_{p}$ -norm is $\left\|y\right\|^{p}_{p}\leq\rho$ , then we have

\displaystyle\sum\limits_{i=1}^{d}|\text{Elias}(y_{i})|\leq\left(\frac{1+o(1)}{p}\log\frac{\rho}{d}+1\right)d

(11)

where $\text{Elias}(\cdot)$ is the Elias coding function applied to a positive integer.

Now we give the proof of Theorem 5.

Proof.

After rand- $k$ sparsification and Randomized Quantization, the compressed vector $q$ becomes very sparse. To transmit $q$ , we first need to send the locations of $\|q\|_{0}$ non-zero elements. Let $i_{1},i_{2},\dots,i_{\|q\|_{0}}$ represent the non-zero indices of $q$ . We use Elias coding to encode the integer vector $[i_{1},i_{2}-i_{1},\dots,i_{\|q\|_{0}}-i_{\|q\|_{0}-1}]$ . This integer vector has a length of $\|q\|_{0}$ and an $\ell_{1}$ -norm at most $d$ . Thus from Lemma 2 the required number of bits after Elias coding is given by

\displaystyle b_{1}=\left((1+o(1))\log\frac{d}{\|q\|_{0}}+1\right)\|q\|_{0}.

Secondly, we need to transmit the sign vector $s$ and the integer vector $z$ . For sign vector $s$ with $\|q\|_{0}$ elements, we need $b_{2}=\|q\|_{0}$ bits. For the integer vector $z$ , we also use Elias coding to encode it. The required number of bits after Elias coding is

\displaystyle b_{3}=\left(\frac{1+o(1)}{2}\log\frac{\left\|z\right\|_{2}^{2}}{\|q\|_{0}}+1\right)\|q\|_{0}.

From [7], we know for Randomized Quantization, the number of non-zero elements is

\displaystyle{\mathbb{E}}\|q\|_{0}\leq l^{2}+\sqrt{\|v_{spa}\|_{0}}

and the squared norm of $z$ is

\displaystyle\|z\|^{2}_{2}\leq 2(l^{2}+\|v_{spa}\|_{0}).

Summing up the three parts $b_{1},b_{2},b_{3}$ , we can get the total number of bits

\displaystyle{\mathbb{E}}b=3{\mathbb{E}}\|q\|_{0}+{\mathbb{E}}(1+o(1))\|q\|_{0}\left(\log\frac{d}{\|q\|_{0}}+\frac{1}{2}\log\frac{2(l^{2}+\|v_{spa}\|_{0})}{\|q\|_{0}}\right)

Note that the function $x\log\frac{C}{x}$ increases until $x=\frac{C}{2}$ and then decreases. Also function $x\log\frac{C}{x}$ is concave so that ${\mathbb{E}}\left[x\log\frac{C}{x}\right]\leq{\mathbb{E}}x\log\frac{C}{{\mathbb{E}}x}$ . Assuming $l^{2}+\|v_{spa}\|_{0}\leq\frac{d}{2}$ , it follows that $l^{2}+\sqrt{\|v_{spa}\|_{0}}\leq\frac{d}{2}$ . Applying $x=l^{2}+\sqrt{\|v_{spa}\|_{0}}$ , $C=d$ in the function, we can obtain

	$\displaystyle{\mathbb{E}}b\leq$	$\displaystyle 3{\mathbb{E}}\\|q\\|_{0}+{\mathbb{E}}(1+o(1))\\|q\\|_{0}\left(\frac{3}{2}\log\frac{d}{\\|q\\|_{0}}\right)$
	$\displaystyle\leq$	$\displaystyle 3{\mathbb{E}}\left(l^{2}+\sqrt{\\|v_{spa}\\|_{0}}\right)+{\mathbb{E}}(1+o(1))\left(l^{2}+\sqrt{\\|v_{spa}\\|_{0}}\right)\left(\frac{3}{2}\log\frac{d}{l^{2}+\sqrt{\\|v_{spa}\\|_{0}}}\right)$

where the first inequality is from $l^{2}+\|v_{spa}\|_{0}\leq\frac{d}{2}$ , the second inequality is from the property of function $x\log\frac{C}{x}$ .

From the rand- $k$ sparsification, we know that ${\mathbb{E}}(l^{2}+\sqrt{\|v_{spa}\|_{0}})\leq l^{2}+\sqrt{k}$ . Assume $l^{2}+k\leq\frac{d}{2}$ . Applying the property of function $x\log\frac{C}{x}$ again, and let $x=l^{2}+\sqrt{\|v_{spa}\|_{0}}$ , $C=d$ , we can have

	$\displaystyle{\mathbb{E}}b\leq$	$\displaystyle 3(l^{2}+\sqrt{k})+\frac{3}{2}(1+o(1))(l^{2}+\sqrt{k})\log\frac{d}{l^{2}+\sqrt{k}}$
	$\displaystyle=$	$\displaystyle(l^{2}+\sqrt{k})\left[3+\frac{3}{2}(1+o(1))\log\frac{d}{l^{2}+\sqrt{k}}\right].$

From Lemma 1, our scheme has parameter $\omega=\frac{d}{k}(1+\frac{k}{l^{2}})$ . Here We choose $l=\sqrt{2k}$ , then we have $\omega=\frac{3}{2}\frac{d}{k}$ .

Finally the communication cost of SQRS is

\displaystyle{\mathbb{E}}b\leq(2k+\sqrt{k})\left[3+\frac{3}{2}(1+o(1))\log\frac{d}{2k+\sqrt{k}}\right].

The dominating term is $O(k\log\frac{d}{\sqrt{k}})$ , i.e., $O(\frac{d}{\omega}\log d\omega)$ .

To satisfy the condition $l^{2}+\|v_{spa}\|_{0}\leq l^{2}+k\leq\frac{d}{2}$ , we require

\displaystyle l^{2}+k=3k\leq\frac{d}{2}.

Thus $k\leq\frac{d}{6}$ , and it follows that $\omega\geq 9$ .

When $\omega\geq 9$ , and we choose $k=\frac{3d}{2k}$ , $l=\sqrt{2k}$ , then we can achieve unbiased compressor with communication cost $O(\frac{d}{\omega}\log d\omega)$ bits. ∎

	$\displaystyle{\mathbb{E}}\left\\|v-Q(v)\right\\|^{2}\ =$	$\displaystyle{\mathbb{E}}\left\\|v-\bar{v}_{\max}+\bar{v}_{\max}-q\right\\|^{2}$
	$\displaystyle=$	$\displaystyle{\mathbb{E}}\left\\|v-\bar{v}_{\max}\right\\|^{2}+{\mathbb{E}}\left\\|\bar{v}_{\max}-q\right\\|^{2},$		(3)

		$\displaystyle{\mathbb{E}}\left\\|v-\mathcal{Q}(v)\right\\|^{2}\leq{\mathbb{E}}(\left\\|v-v_{spa}\right\\|^{2}+\frac{k}{l^{2}}\left\\|v_{spa}\right\\|^{2})$
	$\displaystyle=$	$\displaystyle{\mathbb{E}}\left(\left\\|v\right\\|^{2}-2\left<v,v_{spa}\right>+(1+\frac{k}{l^{2}})\left\\|v_{spa}\right\\|^{2}\right)$
	$\displaystyle=$	$\displaystyle(1+\frac{k}{l^{2}}){\mathbb{E}}\left\\|v_{spa}\right\\|^{2}-\left\\|v\right\\|^{2}$

Optimal Compression of Unit Norm Vectors in the High Distortion Regime

Abstract

I Introduction

Definition 1 (Unbiased ω\omega-compressor).

Definition 2 (Biased δ\delta-compressor).

I-A Our contributions

An optimal but inefficient biased δ\delta-compressor

A near-optimal and efficient biased δ\delta-compressor

A near-optimal and efficient biased ω\omega-compressor

II Related Work

II-A ϵ\epsilon-nets

II-B Quantization

II-C Federated Learning and Communication Cost

III Biased δ\delta-Compressors

III-A Lower Bound

Proposition 1.

Proof.

III-B Random Gaussian Codebook Scheme

Theorem 1.

Remark 1.

III-C Max Block Norm Quantization (MBNQ)

Theorem 2.

Proof.

Remark 2.

IV Unbiased ω\omega-Compressors

IV-A Lower Bound

Theorem 3 ([15], Appendix D.2).

IV-B Unbiased Random Gaussian Codebook Scheme

Theorem 4 ([12], Theorem 7).

Remark 3.

IV-C Sparse Randomized Quantization Scheme (SRQS)

Lemma 1.

Proof.

Theorem 5.

V Conclusion

References

Appendix A Proof of Theorem 1

Proof.

Appendix B Proof of Theorem 5

Lemma 2 ([7], Appendix, Lemma A.3).

Proof.

Optimal Compression of Unit Norm Vectors
in the High Distortion Regime

Definition 1 (Unbiased $\omega$ -compressor).

Definition 2 (Biased $\delta$ -compressor).

An optimal but inefficient biased $\delta$ -compressor

A near-optimal and efficient biased $\delta$ -compressor

A near-optimal and efficient biased $\omega$ -compressor

II-A $\epsilon$ -nets

III Biased $\delta$ -Compressors

IV Unbiased $\omega$ -Compressors