This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Polar Coded Computing:
The Role of the Scaling Exponent

Dorsa Fathollahi1, and Marco Mondelli2 1Stanford University, Stanford, CA 94305, USA, [email protected] 2Institute of Science and Technology (IST) Austria, Klosterneuburg, Austria, [email protected]
Abstract

We consider the problem of coded distributed computing using polar codes. The average execution time of a coded computing system is related to the error probability for transmission over the binary erasure channel in recent work by Soleymani, Jamali and Mahdavifar, where the performance of binary linear codes is investigated. In this paper, we focus on polar codes and unveil a connection between the average execution time and the scaling exponent μ\mu of the family of codes. The scaling exponent has emerged as a central object in the finite-length characterization of polar codes, and it captures the speed of convergence to capacity. In particular, we show that (i) the gap between the normalized average execution time of polar codes and that of optimal MDS codes is O(n1/μ)O(n^{-1/\mu}), and (ii) this upper bound can be improved to roughly O(n1/2)O(n^{-1/2}) by considering polar codes with large kernels. We conjecture that these bounds could be improved to O(n2/μ)O(n^{-2/\mu}) and O(n1)O(n^{-1}), respectively, and provide a heuristic argument as well as numerical evidence supporting this view.

I Introduction

In recent years, there has been a high demand for data processing at large scale. This has motivated a paradigm shift towards distributed computing, where a large scale computation is spread over nn workers. A central issue is that some workers may finish later than others or crash, and a common solution is to add redundancy: coded distributed computing breaks a given large computational job into kk smaller tasks and adds nkn-k redundant tasks, so that the computation is performed over nn workers. As a result, the final output is available even if we receive only part of the outputs.

Applications of coding theory to distributed computing have been studied extensively, see e.g. [1, 2, 3, 4, 5, 6, 7, 8] and references therein. Product codes are applied to distributed matrix multiplication in [1]. In [2], maximum distance separable (MDS) codes are used to speed up matrix multiplication, and for matrix multiplication in a large finite field polynomial codes are employed in [3]. Luby Transform (LT) codes are proposed in [4, 5] for coded computation, and the application of polar codes to serverless computing is considered in [6, 7]. In [8], the average execution time of a coded computing system is related to the error probability of the linear code used to add redundancy. Although MDS codes are optimal in terms of average delay, they suffer from numerical stability and high decoding complexity, hence binary linear codes are studied in [8]: the authors characterize the performance gap between MDS and random binary linear codes and provide a numerical analysis of systems based on polar and Reed-Muller codes.

In this paper, we analyze the performance of polar codes for coded distributed computing. Polar codes provably achieve capacity for any binary memoryless symmetric (BMS) channel with quasi-linear encoding and decoding complexity [9]. The code construction can be performed with linear complexity [10, 11] and, by exploiting a partial order between the synthetic channels, the construction complexity becomes sub-linear [12]. The error probability under successive cancellation (SC) decoding scales as 2n2^{-\sqrt{n}} [13], nn being the block length, and polar codes are not affected by error floors [14]. Because of their attractive properties, polar codes have been recently adopted for the 5G wireless communications standard [15].

A central object of interest in the analysis of polar codes has been the speed of convergence of the rate to capacity [16, 17, 18, 19, 14, 20, 21]. This line of work shows that the gap to capacity scales with the block length as n1/μn^{-1/\mu}. The parameter μ\mu is called the scaling exponent, it depends on the transmission channel and, for the special case of the binary erasure channel (BEC), μ3.63\mu\approx 3.63 [16, 14]. Furthermore, by using large polarization kernels, it is possible to approach the optimal scaling exponent μ=2\mu=2 [20, 21].

The goal of this paper is to characterize the average execution time of a coded computing system using polar codes via their scaling exponent. In particular, our main contributions can be summarized as follows:

  • In Section III, we prove an upper bound on the average execution time of polar codes (Theorem 1) and, as a consequence, we show that the normalized gap between polar codes and the optimal MDS codes is O(n1/μ)O(n^{-1/\mu}), μ\mu being the scaling exponent (Corollary 1).

  • In Section IV, we show that the normalized gap can be improved to roughly O(n1/2)O(n^{-1/2}) by considering polar codes with large kernels (Theorem 2).

  • In Section V, we consider codes of rate R=1/2R=1/2 and provide a heuristic argument suggesting that, for polar codes with scaling exponent μ\mu, the normalized gap should scale as n2/μn^{-2/\mu}. The argument is also validated by numerical evidence. Hence, as polar codes with large kernels approach the optimal μ=2\mu=2 [20, 21], this would lead to a gap scaling as n1n^{-1}, which matches the performance of random binary linear codes.

II Preliminaries

II-A Distributed Computing System Model

We consider a distributed computing model with nn workers having the same computational capabilities, as in [8]. For the ii-th worker, the run-time for performing a task is represented via a shifted exponential random variable TiT_{i} [2, 22, 23]. Thus, dividing the computation equally among kk workers results in the following cumulative distribution function:

(Tit)=1exp(μs(kt1)),t1/k,\mathbb{P}(T_{i}\leq t)=1-\exp(-\mu_{\rm s}(kt-1)),\quad\forall\,t\geq 1/k, (1)

where μs\mu_{\rm s} is the straggling parameter corresponding to each worker. Hence, the probability that the task is not completed until time t1/kt\geq 1/k is given by

ϵ(t)=(Tit)=exp(μs(kt1)).\epsilon(t)=\mathbb{P}(T_{i}\geq t)=\exp(-\mu_{\rm s}(kt-1)). (2)

We can now relate the problem of computing a task divided into kk parts over nn servers to the problem of transmitting kk symbols via an (n,k)(n,k) linear code over nn i.i.d. BECs with erasure probability ϵ(t)\epsilon(t). Let TT be the (random) execution time, defined as the time at which a decodable set of tasks is obtained from the workers. Then, the error probability of the (n,k)(n,k) code is related to the average execution time 𝔼[T]\mathbb{E}[T] via the following lemma.

Lemma 1 (Average execution time and error probability, Lemma 1 in [8])

The average execution time TavgT_{\rm avg} of a coded distributed computing system using an (n,k)(n,k) linear code is given by

Tavg𝔼[T]=\displaystyle T_{\rm avg}\triangleq\mathbb{E}[T]= 0Pe(ϵ(τ),n)dτ=1k+1μsk01Pe(ϵ,n)ϵdϵ,\displaystyle\int_{0}^{\infty}P_{e}(\epsilon(\tau),n){\rm d}\tau=\frac{1}{k}+\frac{1}{\mu_{\rm s}k}\int_{0}^{1}\frac{P_{e}(\epsilon,n)}{\epsilon}{\rm d}\epsilon, (3)

where ϵ(τ)\epsilon(\tau) is defined in (2) and Pe(ϵ(τ),n)P_{e}(\epsilon(\tau),n) denotes the error probability of the (n,k)(n,k) code for the transmission over a BEC(ϵ(τ))(\epsilon(\tau)).

II-B Channel Polarization

Channel polarization is based on mapping two identical copies of the transmission channel WW into a pair of “synthetic” channels W+W^{+} and WW^{-}, such that W+W^{+} is strictly better and WW^{-} is strictly worse than WW. Here, we will focus on the special case in which WW is a BEC, since this is the relevant setting for coded computing. When WW is a BEC(ϵ)(\epsilon), we have that W+W^{+} is a BEC(ϵ2)(\epsilon^{2}) and WW^{-} is a BEC(2ϵϵ2)(2\epsilon-\epsilon^{2}). By repeating mm times this operation, we map n2mn\triangleq 2^{m} identical copies of WW into the synthetic channels Wm(i)W_{m}^{(i)} (i{1,,n}i\in\{1,\ldots,n\}). Consider a sequence of channels WmW_{m} obtained by picking uniformly at random one of the synthetic channels Wm(i)W_{m}^{(i)}, and let Zm(W)Z_{m}(W) be the random process that tracks the erasure probability of WmW_{m}. Then,

Zm={Zm12,w.p. 1/2,2Zm1Zm12,w.p. 1/2,Z_{m}=\begin{cases}Z_{m-1}^{2},&\text{w.p. }1/2,\\ 2Z_{m-1}-Z_{m-1}^{2},&\text{w.p. }1/2,\end{cases} (4)

with initialization Z0=ϵZ_{0}=\epsilon, ϵ\epsilon being the erasure probability of the original channel WW. The synthetic channels Wm(i)W_{m}^{(i)} are all BECs and they polarize in the sense that, as mm grows, their erasure probabilities are either close to 0 (noiseless channels) or close to 11 (noisy channel). Formally, as mm\to\infty, ZmZ_{m} converges almost surely to a random variable ZZ_{\infty} such that (Z=0)=1ϵ\mathbb{P}(Z_{\infty}=0)=1-\epsilon and (Z=1)=ϵ\mathbb{P}(Z_{\infty}=1)=\epsilon.

Now, the idea is to use the noiseless channels to transmit the information, while the noisy channels contain a fixed sequence shared between encoder and decoder (e.g., the all-zero sequence). In particular, in order to construct a polar code of block length nn and rate RR for transmission over W=W= BEC(ϵ\epsilon), we apply m=log2nm=\log_{2}n steps of polarization to WW and select the nRnR synthetic channels Wm(i)W_{m}^{(i)} with the smallest erasure probabilities Zm(i)Z_{m}^{(i)}. We denote by {1,,n}\mathcal{I}\subset\{1,\ldots,n\} the corresponding set of indices (||=nR|\mathcal{I}|=nR). Then, the error probability Pe(ϵ,n)P_{e}(\epsilon,n) of the polar code can be bounded as (see e.g. Section III of [24])

maxiZm(i)Pe(ϵ,n)iZm(i).\max_{i\in\mathcal{I}}Z_{m}^{(i)}\leq P_{e}(\epsilon,n)\leq\sum_{i\in\mathcal{I}}Z_{m}^{(i)}. (5)

In [9], it is shown that, as nn\to\infty, the upper bound in (5) vanishes for any R<I(W)=1ϵR<I(W)=1-\epsilon, thus implying that polar codes achieve capacity.

II-C Scaling Exponent

The scaling exponent captures the speed of convergence of the rate to the channel capacity by quantifying how the gap to capacity scales as a function of the block length.

Definition 1 (Upper bound on scaling exponent of BEC)

We say that μ\mu is an upper bound on the scaling exponent of the BEC, if there exists a function h(x):[0,1][0,1]h(x):[0,1]\rightarrow[0,1] such that h(0)=h(1)=0h(0)=h(1)=0, h(x)>0h(x)>0 for any x(0,1)x\in(0,1), and

supx(0,1)h(x2)+h(2xx2)2h(x)<21/μ.\sup_{x\in(0,1)}\frac{h(x^{2})+h(2x-x^{2})}{2h(x)}<2^{-1/\mu}.

Using the definition above, in [14] it is proved that, if μ\mu is an upper bound for the scaling exponent, then the gap to capacity I(W)RI(W)-R scales with the block length as n1/μn^{-1/\mu}. Furthermore, by constructing a suitable h(x)h(x), in [14] it is also shown that μ=3.639\mu=3.639 is a valid upper bound for the BEC.

II-D Polar Codes with Large Kernels

Conventional polar codes are obtained from the kernel [1011]\begin{bmatrix}1&0\\ 1&1\end{bmatrix}, in the sense that their generator matrix is given by the rows of KmK^{\otimes m} (here, KmK^{\otimes m} denotes the mm-th Kronecker power of KK) corresponding to the most reliable synthetic channels. In [25], it was shown that capacity-achieving polar codes can be constructed from any kernel KK that is an ×\ell\times\ell non-singular binary matrix, which cannot be transformed into an upper triangular matrix under any column permutations. In particular, given an ×\ell\times\ell kernel KK, we map nmn\triangleq\ell^{m} identical copies of the transmission channel WW into the synthetic channels Wm(i)W_{m}^{(i)} (i{1,,n}i\in\{1,\ldots,n\}). If WW is a BEC, then the channels {Wm(i)}i=1n\{W_{m}^{(i)}\}_{i=1}^{n} are all BECs. Their erasure probabilities are denoted by {Zm(i)}i=1n\{Z_{m}^{(i)}\}_{i=1}^{n} and they are tracked by the random process Zm(W)Z_{m}(W), which is defined as

Zm=fK,Bm1(Zm1),Z_{m}=f_{K,B_{m-1}}(Z_{m-1}), (6)

where Bm1B_{m-1} is drawn uniformly in {1,,}\{1,\ldots,\ell\} and the set of polynomials {fK,i}i=1\{f_{K,i}\}_{i=1}^{\ell} characterizes the polarization behavior of the kernel KK (for details and a formal definition, see e.g. (57) in [20]). The error probability Pe(ϵ,n)P_{e}(\epsilon,n) of the resulting polar code can again be bounded as in (5). By carefully analyzing the behavior of the polynomials {fK,i}i=1\{f_{K,i}\}_{i=1}^{\ell}, when KK is a random non-singular binary matrix, in [20] it is shown that the gap to capacity of the polar code obtained from KK approaches the optimal scaling n1/2n^{-1/2} as \ell\to\infty. These results were then generalized to the transmission over any BMS channel in [21].

III Average Execution Time for Polar Codes

Our characterization of the average execution time via the scaling exponent exploits an intermediate result from [14]: we count the fraction of synthetic channels Wm(i)W_{m}^{(i)} whose erasure probability is polynomially small in n=2mn=2^{m}, and we show that this fraction is at least the channel capacity 1ϵ1-\epsilon minus a term scaling as n1/μ=2m/μn^{-1/\mu}=2^{-m/\mu}, where μ\mu is the scaling exponent.

Lemma 2 (Bound on ZmZ_{m})

Let μ2\mu\geq 2 be an upper bound on the scaling exponent of the BEC in the sense of Definition 1, and let ZmZ_{m} be the random process tracking the erasure probability of WmW_{m}, where W0W_{0} is a BEC(ϵ)(\epsilon). Then,

\displaystyle\mathbb{P} (Zm210m)1ϵc12m/μ,\displaystyle\left(Z_{m}\leq 2^{-10m}\right)\geq 1-\epsilon-c_{1}2^{-m/\mu}, (7)

where c1c_{1} is a numerical constant independent of m,ϵm,\epsilon.

The proof of Lemma 2 follows by setting ν=9\nu=9 into (35) of [14], where ρ\rho is given by (33) and α\alpha is a factor 1010 smaller than in (31). At this point, we are ready to state and prove our main result on the average execution time.

Theorem 1 (TavgT_{\rm avg} for polar codes)

Let μ2\mu\geq 2 be an upper bound on the scaling exponent of the BEC in the sense of Definition 1. Fix R(0,1)R\in(0,1) and a sufficiently large nn, and define

ϵ=1Rc1n1/μ,\epsilon^{*}=1-R-c_{1}\,n^{-1/\mu}, (8)

where c1c_{1} is the constant in (7). Let Tavgpolar(n,R)T_{\rm avg}^{\rm polar}(n,R) be the average execution time of a coded distributed computing system using a polar code of block length nn and rate R=k/nR=k/n, which is constructed for transmission over a BEC(ϵ)(\epsilon^{*}). Then,

Tavgpolar(n,R)1nR+1μsnR(ln(1R)+cn1/μ),T_{\rm avg}^{\rm polar}(n,R)\leq\frac{1}{n\,R}+\frac{1}{\mu_{\rm s}n\,R}\left(-\ln\left(1-R\right)+c\,n^{-1/\mu}\right), (9)

where cc is a numerical constant independent of nn.

Proof:

An application of Lemma 1 gives that

Tavgpolar(n,R)=1nR+1μsnR01Pe(ϵ,n)ϵdϵ.T_{\rm avg}^{\rm polar}(n,R)=\frac{1}{n\,R}+\frac{1}{\mu_{\rm s}n\,R}\int_{0}^{1}\frac{P_{e}(\epsilon,n)}{\epsilon}{\rm d}\epsilon. (10)

To upper bound 01Pe(ϵ,n)ϵdϵ\int_{0}^{1}\frac{P_{e}(\epsilon,n)}{\epsilon}{\rm d}\epsilon, we divide the integration domain into three intervals and bound each piece individually.

First, pick ϵ[0,1/n3]\epsilon\in[0,1/n^{3}] and note that Pe(ϵ,n)P_{e}(\epsilon,n) is trivially upper bounded by the probability that there is at least 11 erasure, which is equal to g(ϵ)1(1ϵ)ng(\epsilon)\triangleq 1-(1-\epsilon)^{n}. Note that g(0)=0g(0)=0 and, for all ϵ[0,1]\epsilon\in[0,1], g(ϵ)=n(1ϵ)n1ng^{\prime}(\epsilon)=n(1-\epsilon)^{n-1}\leq n. Thus, g(ϵ)nϵg(\epsilon)\leq n\epsilon and, consequently, Pe(ϵ,n)nϵP_{e}(\epsilon,n)\leq n\epsilon. Hence,

01n3Pe(ϵ,n)ϵdϵ1n2.\int_{0}^{\frac{1}{n^{3}}}\frac{P_{e}(\epsilon,n)}{\epsilon}{\rm d}\epsilon\leq\frac{1}{n^{2}}. (11)

Next, let ϵ[1/n3,ϵ]\epsilon\in[1/n^{3},\epsilon^{*}], where ϵ\epsilon^{*} is given by (8). As the family of BECs is ordered by degradation [26], we have that

Pe(ϵ,n)Pe(ϵ,n).P_{e}(\epsilon,n)\leq P_{e}(\epsilon^{*},n). (12)

Recall that the polar code is obtained by polarizing a BEC(ϵ\epsilon^{*}), and let {1,,n}\mathcal{I}\subset\{1,\ldots,n\} denote the corresponding set of nRnR indices containing the information bits. By applying Lemma 3 with initial condition W0=W_{0}= BEC(ϵ\epsilon^{*}), we have that the fraction of synthetic channels whose erasure probability is at most 210m=n102^{-10m}=n^{-10} is lower bounded by

1ϵc12m/μ=R,1-\epsilon^{*}-c_{1}2^{-m/\mu}=R, (13)

where we have used the definition (8) of ϵ\epsilon^{*}. Therefore, from the upper bound in (5), we obtain that

Pe(ϵ,n)n9,P_{e}(\epsilon^{*},n)\leq n^{-9}, (14)

where we have also used that ||=nRn|\mathcal{I}|=nR\leq n. Hence,

1n3ϵPe(ϵ,n)ϵdϵ\displaystyle\int_{\frac{1}{n^{3}}}^{\epsilon^{*}}\frac{P_{e}(\epsilon,n)}{\epsilon}{\rm d}\epsilon 1n3ϵ1n9ϵdϵ1n3ϵ1n6dϵ1n6.\displaystyle\leq\int_{\frac{1}{n^{3}}}^{\epsilon^{*}}\frac{1}{n^{9}\epsilon}{\rm d}\epsilon\leq\int_{\frac{1}{n^{3}}}^{\epsilon^{*}}\frac{1}{n^{6}}{\rm d}\epsilon\leq\frac{1}{n^{6}}. (15)

Here, in the first inequality, we combine (12) and (14); and in the second inequality, we use that ϵ1/n3\epsilon\geq 1/n^{3}.

Finally, let ϵ[ϵ,1]\epsilon\in[\epsilon^{*},1]. In this interval, we use the trivial upper bound Pe(ϵ,n)1P_{e}(\epsilon,n)\leq 1. Hence,

ϵ1Pe(ϵ,n)ϵdϵ\displaystyle\int_{\epsilon^{*}}^{1}\frac{P_{e}(\epsilon,n)}{\epsilon}{\rm d}\epsilon ϵ11ϵdϵ=ln(ϵ).\displaystyle\leq\int_{\epsilon^{*}}^{1}\frac{1}{\epsilon}{\rm d}\epsilon=-\ln(\epsilon^{*}). (16)

By combining (8), (11), (15) and (16), we have that

01Pe(ϵ,n)ϵdϵ\displaystyle\int_{0}^{1}\frac{P_{e}(\epsilon,n)}{\epsilon}{\rm d}\epsilon 1n2+1n6ln(1Rc1n1/μ)\displaystyle\leq\frac{1}{n^{2}}+\frac{1}{n^{6}}-\ln(1-R-c_{1}n^{-1/\mu}) (17)
1n2+1n6ln(1R)+2c11Rn1/μ\displaystyle\leq\frac{1}{n^{2}}+\frac{1}{n^{6}}-\ln(1-R)+\frac{2c_{1}}{1-R}n^{-1/\mu}
ln(1R)+cn1/μ.\displaystyle\leq-\ln(1-R)+cn^{-1/\mu}.

Here, the second inequality holds as ln(1x)2x-\ln(1-x)\leq 2x for x[0,1/2]x\in[0,1/2] and c11Rn1/μ[0,1/2]\frac{c_{1}}{1-R}n^{-1/\mu}\in[0,1/2] for sufficiently large nn; and, to obtain the third inequality, we use that μ2\mu\geq 2 and pick c=2c11R+2c=\frac{2c_{1}}{1-R}+2. By combining (10) and (17), the desired result readily follows. ∎

Armed with this result, we can bound the gap between polar codes and MDS codes, which achieve the minimum average execution time (cf. Corollary 6 in [8]).

Corollary 1 (Gap to optimum for polar codes)

Fix R(0,1)R\in(0,1) and a sufficiently large nn. Let TavgMDS(n,R)T_{\rm avg}^{\rm MDS}(n,R) be the average execution time of a coded distributed computing system using an MDS code of block length nn and rate R=k/nR=k/n. Let Tavgpolar(n,R)T_{\rm avg}^{\rm polar}(n,R) be defined as in Theorem 1. Then,

nTavgpolar(n,R)nTavgMDS(n,R)cn1/μ,nT_{\rm avg}^{\rm polar}(n,R)-nT_{\rm avg}^{\rm MDS}(n,R)\leq c^{\prime}n^{-1/\mu}, (18)

where cc^{\prime} is a numerical constant independent of nn.

Proof:

Recall from (15) in [8] that

TavgMDS(n,R)=1nR+1μsnRi=n(1R)+1n1i.T_{\rm avg}^{\rm MDS}(n,R)=\frac{1}{n\,R}+\frac{1}{\mu_{\rm s}n\,R}\sum_{i=n(1-R)+1}^{n}\frac{1}{i}. (19)

We lower bound the harmonic series on the RHS of (19) as

i=n(1R)+1n1in(1R)+1n1tdt=ln(n)ln(n(1R)+1)=ln(1R+1/n)ln(1R)1n(1R),\begin{split}&\sum_{i=n(1-R)+1}^{n}\frac{1}{i}\geq\int_{n(1-R)+1}^{n}\frac{1}{t}{\rm d}t=\ln(n)-\ln(n(1-R)+1)\\ &\hskip 20.00003pt=-\ln(1-R+1/n)\geq-\ln(1-R)-\frac{1}{n(1-R)},\end{split} (20)

where first inequality uses that f(t)=1/tf(t)=1/t is decreasing and the last inequality uses that log(1+t)t\log(1+t)\leq t for all t0t\geq 0. By combining (19) and (20), we deduce that

nTavgMDS(n,R)1R1μsRln(1R)1μsR(1R)1n.nT_{\rm avg}^{\rm MDS}(n,R)\geq\frac{1}{R}-\frac{1}{\mu_{\rm s}R}\ln(1-R)-\frac{1}{\mu_{\rm s}R(1-R)}\frac{1}{n}. (21)

Recall that μ2\mu\geq 2 and pick c=c(μsR)1+(μsR(1R))1c^{\prime}=c(\mu_{\rm s}R)^{-1}+(\mu_{\rm s}R(1-R))^{-1}, where cc is the numerical constant in (9). Thus, the claim follows by combining (21) with the result of Theorem 1. ∎

We remark that random binary linear codes achieve a smaller gap with respect to the optimal MDS codes. In fact, nTavgBRC(n,R)nTavgMDS(n,R)nT_{\rm avg}^{\rm BRC}(n,R)-nT_{\rm avg}^{\rm MDS}(n,R) is O(log2nn)O\left(\frac{\log_{2}n}{n}\right), where TavgBRC(n,R)T_{\rm avg}^{\rm BRC}(n,R) denotes the average execution time for a random binary linear code of block length nn and rate RR (cf. Corollary 9 in [8]). Thus, a natural question is whether it is possible to reduce the gap between polar and MDS codes. We answer this question affirmatively in the next section, where we consider polar codes obtained from large random kernels.

IV Extension to Polar Codes with Large Kernels

First, we consider the fraction of synthetic channels whose erasure probability is polynomially small in n=mn=\ell^{m}, and we show that it is lower bounded by the channel capacity 1ϵ1-\epsilon minus a term scaling roughly as m/2=n1/2\ell^{-m/2}=n^{-1/2}. This result is stated below, and it is a (slight) improvement over Theorem 3 in [20]. Its proof is deferred to the appendix.

Lemma 3 (Bounds on ZmZ_{m} for large kernels)

Let KK be a kernel selected uniformly at random from all ×\ell\times\ell non-singular binary matrices, and let ZmZ_{m} be the random process defined in (6) with initial condition Z0=ϵZ_{0}=\epsilon. Fix a small constant δ>0\delta>0. Then, there exists 0(δ)\ell_{0}(\delta) such that, for any >0(δ)\ell>\ell_{0}(\delta) and for all m1m\geq 1, the following holds with high probability over the choice of KK:

\displaystyle\mathbb{P} (Zm10m)1ϵc2m/(2+δ),\displaystyle\left(Z_{m}\leq\ell^{-10m}\right)\geq 1-\epsilon-c_{2}\ell^{-m/(2+\delta)}, (22)

where c2c_{2} is a numerical constant independent of m,,δ,ϵm,\ell,\delta,\epsilon.

At this point, we are ready to state our results for polar codes based on large kernels.

Theorem 2 (TavgT_{\rm avg} and gap to optimum for polar codes with large kernels)

Fix R(0,1)R\in(0,1), a small constant δ>0\delta>0, a sufficiently large \ell and a sufficiently large nn. Define

ϵ=1Rc2n1/(2+δ),\epsilon^{*}=1-R-c_{2}\,n^{-1/(2+\delta)}, (23)

where c2c_{2} is the constant in (22). Let KK be a kernel selected uniformly at random from all ×\ell\times\ell non-singular binary matrices, let 𝒞K(n,R,δ)\mathcal{C}_{K}(n,R,\delta) be the polar code of block length n=mn=\ell^{m} and rate RR obtained from the kernel KK and constructed for transmission over a BEC(ϵ\epsilon^{*}), and let TavgK(n,R,δ)T_{\rm avg}^{K}(n,R,\delta) be the average execution time of a coded distributed computing system using 𝒞K(n,R,δ)\mathcal{C}_{K}(n,R,\delta). Then, the following holds with high probability over the choice of KK:

TavgK(n,R,δ)1nR+1μsnR(ln(1R)+c~n1/(2+δ)),T_{\rm avg}^{K}(n,R,\delta)\leq\frac{1}{n\,R}+\frac{1}{\mu_{\rm s}n\,R}\left(-\ln\left(1-R\right)+\tilde{c}\,n^{-1/(2+\delta)}\right), (24)

where c~\tilde{c} is a numerical constant independent of nn. Furthermore, let TavgMDS(n,R)T_{\rm avg}^{\rm MDS}(n,R) be the average execution time of a coded distributed computing system using an MDS code of block length nn and rate RR. Then,

nTavgK(n,R,δ)nTavgMDS(n,R)c~n1/(2+δ),nT_{\rm avg}^{K}(n,R,\delta)-nT_{\rm avg}^{\rm MDS}(n,R)\leq\tilde{c}^{\prime}n^{-1/(2+\delta)}, (25)

where c~\tilde{c}^{\prime} is a numerical constant independent of nn.

The proof of (24) is analogous to that of Theorem 1, the only difference being that the application of Lemma 2 is replaced by the application of Lemma 3. The proof of (25) follows from (24) and from the lower bound on nTavgMDS(n,R)nT_{\rm avg}^{\rm MDS}(n,R) in (21).

The upper bound in (25) roughly scales as n1/2n^{-1/2}, hence the result of Theorem 2 still does not match the performance of random binary linear codes. We conjecture that this is not due to a sub-optimality of polar codes using large kernels, but rather to a sub-optimality of our bounds. In fact, in the following section, we will give a simple heuristic argument suggesting that, for a family of polar codes of rate R=1/2R=1/2 and scaling exponent μ\mu, the gap to optimum nTavgpolar(n,R)nTavgMDS(n,R)nT_{\rm avg}^{\rm polar}(n,R)-nT_{\rm avg}^{\rm MDS}(n,R) is O(n2/μ)O(n^{-2/\mu}) (as opposed to O(n1/μ)O(n^{-1/\mu}) via Corollary 1). Since polar codes with large kernels approach μ=2\mu=2, this improved bound would match the gap to optimum of random binary linear codes.

V Improving the Bounds: A Heuristic Argument and Numerical Experiments

We focus on the case R=1/2R=1/2. Our heuristic argument relies on the assumption that the error probability Pe(ϵ,n)P_{e}(\epsilon,n) behaves as the lower bound maxiZm(i)\max_{i\in\mathcal{I}}Z_{m}^{(i)} in (5). We also assume the existence of a scaling law s.t.

limn:n1/μ(1/2ϵ)=zPe(ϵ,n)=f(z).\lim_{n\to\infty\,:\,n^{1/\mu}(1/2-\epsilon)=z}P_{e}(\epsilon,n)=f(z). (26)

Here, μ\mu represents the scaling exponent and ff is typically referred to as the mother curve. The idea is that, as the block length diverges, the gap to capacity 1/2ϵ1/2-\epsilon scales as n1/μn^{-1/\mu}, and in this regime the error probability tends to the limit f(n1/μ(1/2ϵ))f(z)f(n^{1/\mu}(1/2-\epsilon))\triangleq f(z). In [24], the authors provide numerical and rigorous evidence that the lower bound in (5) satisfies such a scaling law. Furthermore, the scaling exponent for the BEC is estimated as μ3.6261\mu\approx 3.6261, a value which is close to the rigorous upper bound of 3.6393.639 coming from [14].

We will also use that the derivative of the mother curve f()f^{\prime}(\cdot) is even. To justify this, recall that Zm(ϵ)Z_{m}(\epsilon) is given by the recursion (4) with initial condition Z0=ϵZ_{0}=\epsilon, {Zm(i)(ϵ)}i=1n\{Z_{m}^{(i)}(\epsilon)\}_{i=1}^{n} are the erasure probabilities of the synthetic channels obtained by polarizing a BEC(ϵ\epsilon), and (ϵ)\mathcal{I}(\epsilon) is the set of indices corresponding to the synthetic channels with the smallest erasure probabilities Zm(i)(ϵ)Z_{m}^{(i)}(\epsilon). Similarly, Zm(1ϵ)Z_{m}(1-\epsilon) is given by (4) with initial condition Z0=1ϵZ_{0}=1-\epsilon, {Zm(i)(1ϵ)}i=1n\{Z_{m}^{(i)}(1-\epsilon)\}_{i=1}^{n} denote the erasure probabilities obtained from the polarization of a BEC(1ϵ1-\epsilon), and (1ϵ)\mathcal{I}(1-\epsilon) is the corresponding set of indices. Then, one can easily verify that Zm(1ϵ)=1Zm(ϵ)Z_{m}(1-\epsilon)=1-Z_{m}(\epsilon). As R=1/2R=1/2, this implies that maxi(1ϵ)Zm(i)(1ϵ)=1minic(ϵ)Zm(i)(ϵ)\max_{i\in\mathcal{I}(1-\epsilon)}Z_{m}^{(i)}(1-\epsilon)=1-\min_{i\in\mathcal{I}^{c}(\epsilon)}Z_{m}^{(i)}(\epsilon), where c(ϵ)\mathcal{I}^{c}(\epsilon) is the complement set of (ϵ)\mathcal{I}(\epsilon). Furthermore, as mm\to\infty, minic(ϵ)Zm(i)(ϵ)maxi(ϵ)Zm(i)(ϵ)\min_{i\in\mathcal{I}^{c}(\epsilon)}Z_{m}^{(i)}(\epsilon)\approx\max_{i\in\mathcal{I}(\epsilon)}Z_{m}^{(i)}(\epsilon). Therefore, assuming that the lower bound in (5) is tight, Pe(1ϵ,n)=1Pe(ϵ,n)P_{e}(1-\epsilon,n)=1-P_{e}(\epsilon,n), which under the scaling assumption (26), implies that f()f^{\prime}(\cdot) is even.

The goal of our heuristic argument is to show that

1/43/4Pe(ϵ,n)ϵdϵ=ln(3/4)ln(1/2)+O(n2/μ).\int_{1/4}^{3/4}\frac{P_{e}(\epsilon,n)}{\epsilon}{\rm d}\epsilon=\ln(3/4)-\ln(1/2)+O(n^{-2/\mu}). (27)

Note that, by using (11) and (15), the integral of Pe(ϵ,n)/ϵP_{e}(\epsilon,n)/\epsilon over the interval [0,1/4][0,1/4] is O(n2)O(n^{-2}). Furthermore, by using the trivial upper bound Pe(ϵ,n)1P_{e}(\epsilon,n)\leq 1, the same integral over the interval [3/4,1][3/4,1] is upper bounded by ln(3/4)-\ln(3/4). Hence, (27) implies that

Tavgpolar(n,R)1nR+1μsnR(ln(1R)+O(n2/μ)),T_{\rm avg}^{\rm polar}(n,R)\leq\frac{1}{n\,R}+\frac{1}{\mu_{\rm s}n\,R}\left(-\ln\left(1-R\right)+O(n^{-2/\mu})\right), (28)

which implies that the gap to optimum for polar codes scales as n2/μn^{-2/\mu}, as desired.

We now show how to obtain (27). By performing the change of variables z=n1/μ(1/2ϵ)z=n^{1/\mu}(1/2-\epsilon) and using the scaling law assumption (26), we have

1/43/4Pe(ϵ,n)ϵdϵ=n1/μn1/μ/4n1/μ/4Pe(1/2zn1/μ,n)1/2zn1/μdzn1/μn1/μ/4n1/μ/4f(z)1/2zn1/μdz.\begin{split}\int_{1/4}^{3/4}\frac{P_{e}(\epsilon,n)}{\epsilon}{\rm d}\epsilon&=n^{-1/\mu}\int_{-n^{1/\mu}/4}^{n^{1/\mu}/4}\frac{P_{e}(1/2-zn^{-1/\mu},n)}{1/2-zn^{-1/\mu}}{\rm d}z\\ &\approx n^{-1/\mu}\int_{-n^{1/\mu}/4}^{n^{1/\mu}/4}\frac{f(z)}{1/2-zn^{-1/\mu}}{\rm d}z.\end{split} (29)

Next, we integrate by parts the RHS of (29), which gives

f(z)ln(1/2zn1/μ)|n1/μ/4n1/μ/4+n1/μ/4n1/μ/4f(z)ln(1/2zn1/μ)dz:=T1+T2.\begin{split}-&f(z)\ln(1/2-zn^{-1/\mu})\bigg{|}_{-n^{1/\mu}/4}^{n^{1/\mu}/4}\\ &+\int_{-n^{1/\mu}/4}^{n^{1/\mu}/4}f^{\prime}(z)\ln(1/2-zn^{-1/\mu}){\rm d}z:=T_{1}+T_{2}.\end{split} (30)

The term T1T_{1} in the first line of (30) is upper bounded by ln(3/4)+O(n9)\ln(3/4)+O(n^{-9}). In fact, f(n1/μ/4)Pe(3/4,n)1f(-n^{1/\mu}/4)\approx P_{e}(3/4,n)\approx 1 and f(n1/μ/4)Pe(1/4,n)n9f(n^{1/\mu}/4)\approx P_{e}(1/4,n)\leq n^{-9}, where the last inequality follows from (12) and (14). Finally, by performing a Taylor expansion of ln(1/2zn1/μ)\ln(1/2-zn^{-1/\mu}) around 1/21/2, the term T2T_{2} in the second line of (30) is upper bounded by

ln(1/2)n1/μ/4n1/μ/4f(z)dz+2n1/μn1/μ/4n1/μ/4zf(z)dz+O(n2/μ)=ln(1/2)+O(n2/μ),\begin{split}\ln(1/2)\int_{-n^{1/\mu}/4}^{n^{1/\mu}/4}f^{\prime}(z){\rm d}z+2n^{-1/\mu}\int_{-n^{1/\mu}/4}^{n^{1/\mu}/4}zf^{\prime}(z){\rm d}z&+O(n^{-2/\mu})\\ =-\ln(1/2)&+O(n^{-2/\mu}),\end{split} (31)

where we have used that the second term in the first line is 0 (being the integral of an odd function over a symmetric interval). By summing up these bounds on T1T_{1} and T2T_{2}, we obtain that the quantity in (30) is upper bounded by ln(3/4)ln(1/2)+O(n2/μ)\ln(3/4)-\ln(1/2)+O(n^{-2/\mu}), which combined with (29) implies the desired result (27).

10101515202015-1510-105-5slope 0.550536-0.550536log2(n)\log_{2}(n)log2(g(n)ln(1R))\log_{2}\big{(}g(n)-\ln(1-R)\big{)}
Figure 1: Evaluation of the function g(n)g(n) defined in (32) for polar codes of rate R=1/2R=1/2 and block lengths from 282^{8} up to 2242^{24}.

The scaling suggested by the heuristic argument above is also supported by numerical results. Let g(n)g(n) be defined as

g(n)01maxi(ϵ)Zm(i)(ϵ)ϵ𝑑ϵ01Pe(ϵ,n)ϵdϵ,g(n)\triangleq\int_{0}^{1}\frac{\max_{i\in\mathcal{I}(\epsilon)}Z_{m}^{(i)}(\epsilon)}{\epsilon}d\epsilon\leq\int_{0}^{1}\frac{P_{e}(\epsilon,n)}{\epsilon}{\rm d}\epsilon, (32)

where the upper bound follows from (5). The plot of Figure 1 shows that g(n)ln(1R)g(n)-\ln(1-R) scales as n2/μn^{-2/\mu}, where the numerical value of μ\mu matches the scaling exponent of the BEC.

VI Conclusion

In this paper, we consider the average execution time of a coded computing system using polar codes. We show that the performance gap between polar codes and the optimal MDS codes is upper bounded as O(n1/μ)O(n^{-1/\mu}), nn being the block length. Here, μ\mu is the scaling exponent of polar codes for transmission over the BEC, and tight bounds have been proved on this quantity (μ3.63\mu\approx 3.63). Next, we consider polar codes based on large kernels, which are known to approach the optimal scaling exponent μ=2\mu=2. For this family of codes, we prove that the gap to MDS codes scales roughly as O(n1/2)O(n^{-1/2}). Finally, we conjecture that our bounds could be improved to O(n2/μ)O(n^{-2/\mu}) and O(n1)O(n^{-1}), respectively, which would imply that polar codes with large kernels match the performance of binary random linear codes. Our conjecture is supported by a heuristic argument, as well as by numerical evidence.

Acknowledgments

D. Fathollahi and M. Mondelli were partially supported by the 2019 Lopez-Loreta Prize. The authors thank Hamed Hassani and Hessam Mahdavifar for helpful discussions.

References

  • [1] T. Baharav, K. Lee, O. Ocal, and K. Ramchandran, “Straggler-proofing massive-scale distributed matrix multiplication with d-dimensional product codes,” in IEEE International Symposium on Information Theory (ISIT), 2018, pp. 1993–1997.
  • [2] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1514–1529, Mar. 2017.
  • [3] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Polynomial codes: An optimal design for high-dimensional coded matrix multiplication,” in Neural Information Processing Systems (NeurIPS), 2017, p. 4406–4416.
  • [4] A. Mallick, M. Chaudhari, and G. Joshi, “Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 3, pp. 1 – 40, 2019.
  • [5] A. Severinson, A. G. i Amat, and E. Rosnes, “Block-diagonal and LT codes for distributed computing with straggling servers,” IEEE Transactions on Communications, vol. 67, no. 3, pp. 1739–1753, Mar. 2019.
  • [6] B. Bartan and M. Pilanci, “Straggler resilient serverless computing based on polar codes,” in 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2019, pp. 276–283.
  • [7] M. Pilanci, “Computational polarization: An information-theoretic method for resilient computing,” IEEE Transactions on Information Theory, to appear.
  • [8] M. Soleymani, M. V. Jamali, and H. Mahdavifar, “Coded computing via binary linear codes: Designs and performance limits,” IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 3, pp. 87–892, Sept. 2021.
  • [9] E. Arikan, “Channel polarization: A method for constructing capacity-achieving codes for symmetric binary-input memoryless channels,” IEEE Transactions on information Theory, vol. 55, no. 7, pp. 3051–3073, 2009.
  • [10] I. Tal and A. Vardy, “How to construct polar codes,” IEEE Transactions on Information Theory, vol. 59, no. 10, pp. 6562–6582, Oct. 2013.
  • [11] R. Pedarsani, H. Hassani, I. Tal, and E. Telatar, “On the construction of polar codes,” in IEEE International Symposium on Information Theory (ISIT), St. Petersberg, Russia, Aug. 2011, pp. 11–15.
  • [12] M. Mondelli, S. H. Hassani, and R. Urbanke, “Construction of polar codes with sublinear complexity,” IEEE Transactions on Information Theory, vol. 65, no. 5, pp. 2782–2791, May 2019.
  • [13] E. Arıkan and I. E. Telatar, “On the rate of channel polarization,” in IEEE International Symposium on Information Theory (ISIT), Seoul, South Korea, July 2009, pp. 1493–1495.
  • [14] M. Mondelli, S. H. Hassani, and R. L. Urbanke, “Unified scaling of polar codes: Error exponent, scaling exponent, moderate deviations, and error floors,” IEEE Transactions on Information Theory, vol. 62, no. 12, pp. 6698–6712, 2016.
  • [15] “Final report of 3GPP TSG RAN WG1 #87 v1.0.0,” Reno, USA, Nov. 2016.
  • [16] S. H. Hassani, K. Alishahi, and R. L. Urbanke, “Finite-length scaling for polar codes,” IEEE Transactions on Information Theory, vol. 60, no. 10, pp. 5875–5898, Oct. 2014.
  • [17] D. Goldin and D. Burshtein, “Improved bounds on the finite length scaling of polar codes,” IEEE Transactions on Information Theory, vol. 60, no. 11, pp. 6966–6978, Nov. 2014.
  • [18] V. Guruswami and P. Xia, “Polar codes: Speed of polarization and polynomial gap to capacity,” IEEE Transactions on Information Theory, vol. 61, no. 1, pp. 3–16, Jan. 2015.
  • [19] M. Mondelli, S. H. Hassani, and R. Urbanke, “Scaling exponent of list decoders with applications to polar codes,” IEEE Transactions on Information Theory, vol. 61, no. 9, pp. 4838–4851, Sept. 2015.
  • [20] A. Fazeli, H. Hassani, M. Mondelli, and A. Vardy, “Binary linear codes with optimal scaling: Polar codes with large kernels,” IEEE Transactions on Information Theory, vol. 67, no. 9, pp. 5693–5710, Sept. 2021.
  • [21] V. Guruswami, A. Riazanov, and M. Ye, “Arikan meets shannon: Polar codes with near-optimal convergence to channel capacity,” in 52nd Annual ACM SIGACT Symposium on Theory of Computing (STOC), 2020, p. 552–564.
  • [22] A. Reisizadeh, S. Prakash, R. Pedarsani, and A. S. Avestimehr, “Coded computation over heterogeneous clusters,” IEEE Transactions on Information Theory, vol. 65, no. 7, pp. 4227–4242, July 2019.
  • [23] G. Liang and U. C. Kozat, “TOFEC: Achieving optimal throughput-delay trade-off of cloud storage using erasure codes,” in IEEE INFOCOM 2014-IEEE Conference on Computer Communications, 2014, pp. 826–834.
  • [24] S. B. Korada, A. Montanari, E. Telatar, and R. Urbanke, “An empirical scaling law for polar codes,” in 2010 IEEE International Symposium on Information Theory, 2010, pp. 884–888.
  • [25] S. B. Korada, E. Şaşoğlu, and R. Urbanke, “Polar codes: Characterization of exponent, bounds, and constructions,” IEEE Transactions on Information Theory, vol. 56, no. 12, pp. 6253–6264, Dec. 2010.
  • [26] T. Richardson and R. Urbanke, Modern coding theory.   Cambridge university press, 2008.
Proof:

Let g(z)=z1/ln(1z)1/lng_{\ell}(z)=z^{1/\ln\ell}(1-z)^{1/\ln\ell}. By combining Lemma 5 and Lemma 6 in [20], we have that, for any m0m\geq 0, with high probability over the choice of KK,

𝔼[g(Zm)](c31/2ln)mg(ϵ),\mathbb{E}[g_{\ell}(Z_{m})]\leq\left(c_{3}\ell^{-1/2}\ln\ell\right)^{m}g_{\ell}(\epsilon), (33)

where c3c_{3} is a numerical constant. Then, the following chain of inequalities holds:

(Zm[10m,110m])=(g(Zm)g(10m))𝔼[g(Zm)]g(10m)(c31/2ln)mg(ϵ)g(10m)(c31/2+10/lnln)m1(110m)1/ln2(c31/2+10/lnln)m.\begin{split}\mathbb{P}(Z_{m}\in&[\ell^{-10m},1-\ell^{-10m}])=\mathbb{P}(g_{\ell}(Z_{m})\geq g_{\ell}(\ell^{-10m}))\\ &\leq\frac{\mathbb{E}[g_{\ell}(Z_{m})]}{g_{\ell}(\ell^{-10m})}\\ &\leq\frac{\left(c_{3}\ell^{-1/2}\ln\ell\right)^{m}g_{\ell}(\epsilon)}{g_{\ell}(\ell^{-10m})}\\ &\leq\left(c_{3}\ell^{-1/2+10/\ln\ell}\ln\ell\right)^{m}\frac{1}{(1-\ell^{-10m})^{1/\ln\ell}}\\ &\leq 2\left(c_{3}\ell^{-1/2+10/\ln\ell}\ln\ell\right)^{m}.\end{split} (34)

Here, in the first line we use the concavity of g()g_{\ell}(\cdot) together with its symmetry around 1/21/2; in the second line, we use Markov’s inequality; in the third line, we use (33); in the fourth line, we use that g(ϵ)1g_{\ell}(\epsilon)\leq 1 and that g(10m)=10m/ln(110m)1/lng_{\ell}(\ell^{-10m})=\ell^{-10m/\ln\ell}(1-\ell^{-10m})^{1/\ln\ell}; and in the fifth line, we use that (110m)1/ln1/2(1-\ell^{-10m})^{1/\ln\ell}\geq 1/2 for m1m\geq 1 and 3\ell\geq 3.

The rest of the argument follows the proof of Lemma 4 in [20], and it is briefly outlined below. Let us define

A(Zm[0,10m)),B(Zm[10m,110m]),C(Zm(110m,1)).\begin{split}A&\triangleq\mathbb{P}(Z_{m}\in[0,\ell^{-10m})),\\ B&\triangleq\mathbb{P}(Z_{m}\in[\ell^{-10m},1-\ell^{-10m}]),\\ C&\triangleq\mathbb{P}(Z_{m}\in(1-\ell^{-10m},1)).\\ \end{split} (35)

Furthermore, let AA^{\prime}, BB^{\prime}, and CC^{\prime} be the fraction of indices in AA, BB, and CC, respectively, that will have a vanishing erasure probability, i.e.,

Alimm(Zm(0,10m),Zm+mm),Blimm(Zm[10m,110m],Zm+mm),Climm(Zm(110m,1),Zm+mm).\begin{split}A^{\prime}&\triangleq\lim_{m^{\prime}\to\infty}\mathbb{P}\left(Z_{m}\in(0,\ell^{-10m}),Z_{m+m^{\prime}}\leq\ell^{-m^{\prime}}\right),\\ B^{\prime}&\triangleq\lim_{m^{\prime}\to\infty}\mathbb{P}\left(Z_{m}\in[\ell^{-10m},1-\ell^{-10m}],Z_{m+m^{\prime}}\leq\ell^{-m^{\prime}}\right),\\ C^{\prime}&\triangleq\lim_{m^{\prime}\to\infty}\mathbb{P}\left(Z_{m}\in(1-\ell^{-10m},1),Z_{m+m^{\prime}}\leq\ell^{-m^{\prime}}\right).\\ \end{split} (36)

The existence of the limits in (36) – and therefore the well-posedness of AA^{\prime}, BB^{\prime}, and CC^{\prime} – is proved in Lemma 4 of [20].

A simple counting over the binary subspaces of dimension \ell shows that, with high probability, none of the column permutations of KK is upper triangular (cf. (47) in [20]). Hence, with high probability, KK is polarizing [25], which implies that

A+B+C=limm(Zm+mm)=1ϵ.A^{\prime}+B^{\prime}+C^{\prime}=\lim_{m^{\prime}\to\infty}\mathbb{P}\left(Z_{m+m^{\prime}}\leq\ell^{-m^{\prime}}\right)=1-\epsilon. (37)

In order to upper bound BB^{\prime}, we use the definitions (35) and (36) of BB and BB^{\prime} as well as (34):

BB(c31/2+10/lnln)m.B^{\prime}\leq B\leq\left(c_{3}\ell^{-1/2+10/\ln\ell}\ln\ell\right)^{m}. (38)

Furthermore, by using the definition (36), we can upper bound CC^{\prime} as

Climm(Zm+mmZm(110m,1)).C^{\prime}\leq\lim_{m^{\prime}\to\infty}\mathbb{P}\left(Z_{m+m^{\prime}}\leq\ell^{-m^{\prime}}\mid Z_{m}\in(1-\ell^{-10m},1)\right). (39)

As the kernel KK is polarizing (with high probability), the RHS of (39) equals the capacity of a BEC with erasure probability at least 110m1-\ell^{-10m}. Therefore,

C10m.C^{\prime}\leq\ell^{-10m}. (40)

As a result, we conclude that A=(Zm[0,10m))A=\mathbb{P}(Z_{m}\in[0,\ell^{-10m})) is lower bounded as

AA=1ϵBC1ϵ(c31/2+10/lnln)m10m,\begin{split}A&\geq A^{\prime}=1-\epsilon-B^{\prime}-C^{\prime}\\ &\geq 1-\epsilon-\left(c_{3}\ell^{-1/2+10/\ln\ell}\ln\ell\right)^{m}-\ell^{-10m},\end{split} (41)

where the equality in the first line follows from (37), and the inequality in the second line follows from (38) and (40). The desired result is readily implied by (41). ∎