Polar Coded Computing:
The Role of the Scaling Exponent
Abstract
We consider the problem of coded distributed computing using polar codes. The average execution time of a coded computing system is related to the error probability for transmission over the binary erasure channel in recent work by Soleymani, Jamali and Mahdavifar, where the performance of binary linear codes is investigated. In this paper, we focus on polar codes and unveil a connection between the average execution time and the scaling exponent of the family of codes. The scaling exponent has emerged as a central object in the finite-length characterization of polar codes, and it captures the speed of convergence to capacity. In particular, we show that (i) the gap between the normalized average execution time of polar codes and that of optimal MDS codes is , and (ii) this upper bound can be improved to roughly by considering polar codes with large kernels. We conjecture that these bounds could be improved to and , respectively, and provide a heuristic argument as well as numerical evidence supporting this view.
I Introduction
In recent years, there has been a high demand for data processing at large scale. This has motivated a paradigm shift towards distributed computing, where a large scale computation is spread over workers. A central issue is that some workers may finish later than others or crash, and a common solution is to add redundancy: coded distributed computing breaks a given large computational job into smaller tasks and adds redundant tasks, so that the computation is performed over workers. As a result, the final output is available even if we receive only part of the outputs.
Applications of coding theory to distributed computing have been studied extensively, see e.g. [1, 2, 3, 4, 5, 6, 7, 8] and references therein. Product codes are applied to distributed matrix multiplication in [1]. In [2], maximum distance separable (MDS) codes are used to speed up matrix multiplication, and for matrix multiplication in a large finite field polynomial codes are employed in [3]. Luby Transform (LT) codes are proposed in [4, 5] for coded computation, and the application of polar codes to serverless computing is considered in [6, 7]. In [8], the average execution time of a coded computing system is related to the error probability of the linear code used to add redundancy. Although MDS codes are optimal in terms of average delay, they suffer from numerical stability and high decoding complexity, hence binary linear codes are studied in [8]: the authors characterize the performance gap between MDS and random binary linear codes and provide a numerical analysis of systems based on polar and Reed-Muller codes.
In this paper, we analyze the performance of polar codes for coded distributed computing. Polar codes provably achieve capacity for any binary memoryless symmetric (BMS) channel with quasi-linear encoding and decoding complexity [9]. The code construction can be performed with linear complexity [10, 11] and, by exploiting a partial order between the synthetic channels, the construction complexity becomes sub-linear [12]. The error probability under successive cancellation (SC) decoding scales as [13], being the block length, and polar codes are not affected by error floors [14]. Because of their attractive properties, polar codes have been recently adopted for the 5G wireless communications standard [15].
A central object of interest in the analysis of polar codes has been the speed of convergence of the rate to capacity [16, 17, 18, 19, 14, 20, 21]. This line of work shows that the gap to capacity scales with the block length as . The parameter is called the scaling exponent, it depends on the transmission channel and, for the special case of the binary erasure channel (BEC), [16, 14]. Furthermore, by using large polarization kernels, it is possible to approach the optimal scaling exponent [20, 21].
The goal of this paper is to characterize the average execution time of a coded computing system using polar codes via their scaling exponent. In particular, our main contributions can be summarized as follows:
- •
- •
-
•
In Section V, we consider codes of rate and provide a heuristic argument suggesting that, for polar codes with scaling exponent , the normalized gap should scale as . The argument is also validated by numerical evidence. Hence, as polar codes with large kernels approach the optimal [20, 21], this would lead to a gap scaling as , which matches the performance of random binary linear codes.
II Preliminaries
II-A Distributed Computing System Model
We consider a distributed computing model with workers having the same computational capabilities, as in [8]. For the -th worker, the run-time for performing a task is represented via a shifted exponential random variable [2, 22, 23]. Thus, dividing the computation equally among workers results in the following cumulative distribution function:
(1) |
where is the straggling parameter corresponding to each worker. Hence, the probability that the task is not completed until time is given by
(2) |
We can now relate the problem of computing a task divided into parts over servers to the problem of transmitting symbols via an linear code over i.i.d. BECs with erasure probability . Let be the (random) execution time, defined as the time at which a decodable set of tasks is obtained from the workers. Then, the error probability of the code is related to the average execution time via the following lemma.
II-B Channel Polarization
Channel polarization is based on mapping two identical copies of the transmission channel into a pair of “synthetic” channels and , such that is strictly better and is strictly worse than . Here, we will focus on the special case in which is a BEC, since this is the relevant setting for coded computing. When is a BEC, we have that is a BEC and is a BEC. By repeating times this operation, we map identical copies of into the synthetic channels (). Consider a sequence of channels obtained by picking uniformly at random one of the synthetic channels , and let be the random process that tracks the erasure probability of . Then,
(4) |
with initialization , being the erasure probability of the original channel . The synthetic channels are all BECs and they polarize in the sense that, as grows, their erasure probabilities are either close to (noiseless channels) or close to (noisy channel). Formally, as , converges almost surely to a random variable such that and .
Now, the idea is to use the noiseless channels to transmit the information, while the noisy channels contain a fixed sequence shared between encoder and decoder (e.g., the all-zero sequence). In particular, in order to construct a polar code of block length and rate for transmission over BEC(), we apply steps of polarization to and select the synthetic channels with the smallest erasure probabilities . We denote by the corresponding set of indices (). Then, the error probability of the polar code can be bounded as (see e.g. Section III of [24])
(5) |
In [9], it is shown that, as , the upper bound in (5) vanishes for any , thus implying that polar codes achieve capacity.
II-C Scaling Exponent
The scaling exponent captures the speed of convergence of the rate to the channel capacity by quantifying how the gap to capacity scales as a function of the block length.
Definition 1 (Upper bound on scaling exponent of BEC)
We say that is an upper bound on the scaling exponent of the BEC, if there exists a function such that , for any , and
II-D Polar Codes with Large Kernels
Conventional polar codes are obtained from the kernel , in the sense that their generator matrix is given by the rows of (here, denotes the -th Kronecker power of ) corresponding to the most reliable synthetic channels. In [25], it was shown that capacity-achieving polar codes can be constructed from any kernel that is an non-singular binary matrix, which cannot be transformed into an upper triangular matrix under any column permutations. In particular, given an kernel , we map identical copies of the transmission channel into the synthetic channels (). If is a BEC, then the channels are all BECs. Their erasure probabilities are denoted by and they are tracked by the random process , which is defined as
(6) |
where is drawn uniformly in and the set of polynomials characterizes the polarization behavior of the kernel (for details and a formal definition, see e.g. (57) in [20]). The error probability of the resulting polar code can again be bounded as in (5). By carefully analyzing the behavior of the polynomials , when is a random non-singular binary matrix, in [20] it is shown that the gap to capacity of the polar code obtained from approaches the optimal scaling as . These results were then generalized to the transmission over any BMS channel in [21].
III Average Execution Time for Polar Codes
Our characterization of the average execution time via the scaling exponent exploits an intermediate result from [14]: we count the fraction of synthetic channels whose erasure probability is polynomially small in , and we show that this fraction is at least the channel capacity minus a term scaling as , where is the scaling exponent.
Lemma 2 (Bound on )
Let be an upper bound on the scaling exponent of the BEC in the sense of Definition 1, and let be the random process tracking the erasure probability of , where is a BEC. Then,
(7) |
where is a numerical constant independent of .
The proof of Lemma 2 follows by setting into (35) of [14], where is given by (33) and is a factor smaller than in (31). At this point, we are ready to state and prove our main result on the average execution time.
Theorem 1 ( for polar codes)
Let be an upper bound on the scaling exponent of the BEC in the sense of Definition 1. Fix and a sufficiently large , and define
(8) |
where is the constant in (7). Let be the average execution time of a coded distributed computing system using a polar code of block length and rate , which is constructed for transmission over a BEC. Then,
(9) |
where is a numerical constant independent of .
Proof:
An application of Lemma 1 gives that
(10) |
To upper bound , we divide the integration domain into three intervals and bound each piece individually.
First, pick and note that is trivially upper bounded by the probability that there is at least erasure, which is equal to . Note that and, for all , . Thus, and, consequently, . Hence,
(11) |
Next, let , where is given by (8). As the family of BECs is ordered by degradation [26], we have that
(12) |
Recall that the polar code is obtained by polarizing a BEC(), and let denote the corresponding set of indices containing the information bits. By applying Lemma 3 with initial condition BEC(), we have that the fraction of synthetic channels whose erasure probability is at most is lower bounded by
(13) |
where we have used the definition (8) of . Therefore, from the upper bound in (5), we obtain that
(14) |
where we have also used that . Hence,
(15) |
Here, in the first inequality, we combine (12) and (14); and in the second inequality, we use that .
Finally, let . In this interval, we use the trivial upper bound . Hence,
(16) |
By combining (8), (11), (15) and (16), we have that
(17) | ||||
Here, the second inequality holds as for and for sufficiently large ; and, to obtain the third inequality, we use that and pick . By combining (10) and (17), the desired result readily follows. ∎
Armed with this result, we can bound the gap between polar codes and MDS codes, which achieve the minimum average execution time (cf. Corollary 6 in [8]).
Corollary 1 (Gap to optimum for polar codes)
Fix and a sufficiently large . Let be the average execution time of a coded distributed computing system using an MDS code of block length and rate . Let be defined as in Theorem 1. Then,
(18) |
where is a numerical constant independent of .
Proof:
Recall from (15) in [8] that
(19) |
We lower bound the harmonic series on the RHS of (19) as
(20) |
where first inequality uses that is decreasing and the last inequality uses that for all . By combining (19) and (20), we deduce that
(21) |
Recall that and pick , where is the numerical constant in (9). Thus, the claim follows by combining (21) with the result of Theorem 1. ∎
We remark that random binary linear codes achieve a smaller gap with respect to the optimal MDS codes. In fact, is , where denotes the average execution time for a random binary linear code of block length and rate (cf. Corollary 9 in [8]). Thus, a natural question is whether it is possible to reduce the gap between polar and MDS codes. We answer this question affirmatively in the next section, where we consider polar codes obtained from large random kernels.
IV Extension to Polar Codes with Large Kernels
First, we consider the fraction of synthetic channels whose erasure probability is polynomially small in , and we show that it is lower bounded by the channel capacity minus a term scaling roughly as . This result is stated below, and it is a (slight) improvement over Theorem 3 in [20]. Its proof is deferred to the appendix.
Lemma 3 (Bounds on for large kernels)
Let be a kernel selected uniformly at random from all non-singular binary matrices, and let be the random process defined in (6) with initial condition . Fix a small constant . Then, there exists such that, for any and for all , the following holds with high probability over the choice of :
(22) |
where is a numerical constant independent of .
At this point, we are ready to state our results for polar codes based on large kernels.
Theorem 2 ( and gap to optimum for polar codes with large kernels)
Fix , a small constant , a sufficiently large and a sufficiently large . Define
(23) |
where is the constant in (22). Let be a kernel selected uniformly at random from all non-singular binary matrices, let be the polar code of block length and rate obtained from the kernel and constructed for transmission over a BEC(), and let be the average execution time of a coded distributed computing system using . Then, the following holds with high probability over the choice of :
(24) |
where is a numerical constant independent of . Furthermore, let be the average execution time of a coded distributed computing system using an MDS code of block length and rate . Then,
(25) |
where is a numerical constant independent of .
The proof of (24) is analogous to that of Theorem 1, the only difference being that the application of Lemma 2 is replaced by the application of Lemma 3. The proof of (25) follows from (24) and from the lower bound on in (21).
The upper bound in (25) roughly scales as , hence the result of Theorem 2 still does not match the performance of random binary linear codes. We conjecture that this is not due to a sub-optimality of polar codes using large kernels, but rather to a sub-optimality of our bounds. In fact, in the following section, we will give a simple heuristic argument suggesting that, for a family of polar codes of rate and scaling exponent , the gap to optimum is (as opposed to via Corollary 1). Since polar codes with large kernels approach , this improved bound would match the gap to optimum of random binary linear codes.
V Improving the Bounds: A Heuristic Argument and Numerical Experiments
We focus on the case . Our heuristic argument relies on the assumption that the error probability behaves as the lower bound in (5). We also assume the existence of a scaling law s.t.
(26) |
Here, represents the scaling exponent and is typically referred to as the mother curve. The idea is that, as the block length diverges, the gap to capacity scales as , and in this regime the error probability tends to the limit . In [24], the authors provide numerical and rigorous evidence that the lower bound in (5) satisfies such a scaling law. Furthermore, the scaling exponent for the BEC is estimated as , a value which is close to the rigorous upper bound of coming from [14].
We will also use that the derivative of the mother curve is even. To justify this, recall that is given by the recursion (4) with initial condition , are the erasure probabilities of the synthetic channels obtained by polarizing a BEC(), and is the set of indices corresponding to the synthetic channels with the smallest erasure probabilities . Similarly, is given by (4) with initial condition , denote the erasure probabilities obtained from the polarization of a BEC(), and is the corresponding set of indices. Then, one can easily verify that . As , this implies that , where is the complement set of . Furthermore, as , . Therefore, assuming that the lower bound in (5) is tight, , which under the scaling assumption (26), implies that is even.
The goal of our heuristic argument is to show that
(27) |
Note that, by using (11) and (15), the integral of over the interval is . Furthermore, by using the trivial upper bound , the same integral over the interval is upper bounded by . Hence, (27) implies that
(28) |
which implies that the gap to optimum for polar codes scales as , as desired.
We now show how to obtain (27). By performing the change of variables and using the scaling law assumption (26), we have
(29) |
Next, we integrate by parts the RHS of (29), which gives
(30) |
The term in the first line of (30) is upper bounded by . In fact, and , where the last inequality follows from (12) and (14). Finally, by performing a Taylor expansion of around , the term in the second line of (30) is upper bounded by
(31) |
where we have used that the second term in the first line is (being the integral of an odd function over a symmetric interval). By summing up these bounds on and , we obtain that the quantity in (30) is upper bounded by , which combined with (29) implies the desired result (27).
VI Conclusion
In this paper, we consider the average execution time of a coded computing system using polar codes. We show that the performance gap between polar codes and the optimal MDS codes is upper bounded as , being the block length. Here, is the scaling exponent of polar codes for transmission over the BEC, and tight bounds have been proved on this quantity (). Next, we consider polar codes based on large kernels, which are known to approach the optimal scaling exponent . For this family of codes, we prove that the gap to MDS codes scales roughly as . Finally, we conjecture that our bounds could be improved to and , respectively, which would imply that polar codes with large kernels match the performance of binary random linear codes. Our conjecture is supported by a heuristic argument, as well as by numerical evidence.
Acknowledgments
D. Fathollahi and M. Mondelli were partially supported by the 2019 Lopez-Loreta Prize. The authors thank Hamed Hassani and Hessam Mahdavifar for helpful discussions.
References
- [1] T. Baharav, K. Lee, O. Ocal, and K. Ramchandran, “Straggler-proofing massive-scale distributed matrix multiplication with d-dimensional product codes,” in IEEE International Symposium on Information Theory (ISIT), 2018, pp. 1993–1997.
- [2] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1514–1529, Mar. 2017.
- [3] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Polynomial codes: An optimal design for high-dimensional coded matrix multiplication,” in Neural Information Processing Systems (NeurIPS), 2017, p. 4406–4416.
- [4] A. Mallick, M. Chaudhari, and G. Joshi, “Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 3, pp. 1 – 40, 2019.
- [5] A. Severinson, A. G. i Amat, and E. Rosnes, “Block-diagonal and LT codes for distributed computing with straggling servers,” IEEE Transactions on Communications, vol. 67, no. 3, pp. 1739–1753, Mar. 2019.
- [6] B. Bartan and M. Pilanci, “Straggler resilient serverless computing based on polar codes,” in 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2019, pp. 276–283.
- [7] M. Pilanci, “Computational polarization: An information-theoretic method for resilient computing,” IEEE Transactions on Information Theory, to appear.
- [8] M. Soleymani, M. V. Jamali, and H. Mahdavifar, “Coded computing via binary linear codes: Designs and performance limits,” IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 3, pp. 87–892, Sept. 2021.
- [9] E. Arikan, “Channel polarization: A method for constructing capacity-achieving codes for symmetric binary-input memoryless channels,” IEEE Transactions on information Theory, vol. 55, no. 7, pp. 3051–3073, 2009.
- [10] I. Tal and A. Vardy, “How to construct polar codes,” IEEE Transactions on Information Theory, vol. 59, no. 10, pp. 6562–6582, Oct. 2013.
- [11] R. Pedarsani, H. Hassani, I. Tal, and E. Telatar, “On the construction of polar codes,” in IEEE International Symposium on Information Theory (ISIT), St. Petersberg, Russia, Aug. 2011, pp. 11–15.
- [12] M. Mondelli, S. H. Hassani, and R. Urbanke, “Construction of polar codes with sublinear complexity,” IEEE Transactions on Information Theory, vol. 65, no. 5, pp. 2782–2791, May 2019.
- [13] E. Arıkan and I. E. Telatar, “On the rate of channel polarization,” in IEEE International Symposium on Information Theory (ISIT), Seoul, South Korea, July 2009, pp. 1493–1495.
- [14] M. Mondelli, S. H. Hassani, and R. L. Urbanke, “Unified scaling of polar codes: Error exponent, scaling exponent, moderate deviations, and error floors,” IEEE Transactions on Information Theory, vol. 62, no. 12, pp. 6698–6712, 2016.
- [15] “Final report of 3GPP TSG RAN WG1 #87 v1.0.0,” Reno, USA, Nov. 2016.
- [16] S. H. Hassani, K. Alishahi, and R. L. Urbanke, “Finite-length scaling for polar codes,” IEEE Transactions on Information Theory, vol. 60, no. 10, pp. 5875–5898, Oct. 2014.
- [17] D. Goldin and D. Burshtein, “Improved bounds on the finite length scaling of polar codes,” IEEE Transactions on Information Theory, vol. 60, no. 11, pp. 6966–6978, Nov. 2014.
- [18] V. Guruswami and P. Xia, “Polar codes: Speed of polarization and polynomial gap to capacity,” IEEE Transactions on Information Theory, vol. 61, no. 1, pp. 3–16, Jan. 2015.
- [19] M. Mondelli, S. H. Hassani, and R. Urbanke, “Scaling exponent of list decoders with applications to polar codes,” IEEE Transactions on Information Theory, vol. 61, no. 9, pp. 4838–4851, Sept. 2015.
- [20] A. Fazeli, H. Hassani, M. Mondelli, and A. Vardy, “Binary linear codes with optimal scaling: Polar codes with large kernels,” IEEE Transactions on Information Theory, vol. 67, no. 9, pp. 5693–5710, Sept. 2021.
- [21] V. Guruswami, A. Riazanov, and M. Ye, “Arikan meets shannon: Polar codes with near-optimal convergence to channel capacity,” in 52nd Annual ACM SIGACT Symposium on Theory of Computing (STOC), 2020, p. 552–564.
- [22] A. Reisizadeh, S. Prakash, R. Pedarsani, and A. S. Avestimehr, “Coded computation over heterogeneous clusters,” IEEE Transactions on Information Theory, vol. 65, no. 7, pp. 4227–4242, July 2019.
- [23] G. Liang and U. C. Kozat, “TOFEC: Achieving optimal throughput-delay trade-off of cloud storage using erasure codes,” in IEEE INFOCOM 2014-IEEE Conference on Computer Communications, 2014, pp. 826–834.
- [24] S. B. Korada, A. Montanari, E. Telatar, and R. Urbanke, “An empirical scaling law for polar codes,” in 2010 IEEE International Symposium on Information Theory, 2010, pp. 884–888.
- [25] S. B. Korada, E. Şaşoğlu, and R. Urbanke, “Polar codes: Characterization of exponent, bounds, and constructions,” IEEE Transactions on Information Theory, vol. 56, no. 12, pp. 6253–6264, Dec. 2010.
- [26] T. Richardson and R. Urbanke, Modern coding theory. Cambridge university press, 2008.
Proof:
Let . By combining Lemma 5 and Lemma 6 in [20], we have that, for any , with high probability over the choice of ,
(33) |
where is a numerical constant. Then, the following chain of inequalities holds:
(34) |
Here, in the first line we use the concavity of together with its symmetry around ; in the second line, we use Markov’s inequality; in the third line, we use (33); in the fourth line, we use that and that ; and in the fifth line, we use that for and .
The rest of the argument follows the proof of Lemma 4 in [20], and it is briefly outlined below. Let us define
(35) |
Furthermore, let , , and be the fraction of indices in , , and , respectively, that will have a vanishing erasure probability, i.e.,
(36) |
The existence of the limits in (36) – and therefore the well-posedness of , , and – is proved in Lemma 4 of [20].
A simple counting over the binary subspaces of dimension shows that, with high probability, none of the column permutations of is upper triangular (cf. (47) in [20]). Hence, with high probability, is polarizing [25], which implies that
(37) |
In order to upper bound , we use the definitions (35) and (36) of and as well as (34):
(38) |
Furthermore, by using the definition (36), we can upper bound as
(39) |
As the kernel is polarizing (with high probability), the RHS of (39) equals the capacity of a BEC with erasure probability at least . Therefore,
(40) |
As a result, we conclude that is lower bounded as
(41) |
where the equality in the first line follows from (37), and the inequality in the second line follows from (38) and (40). The desired result is readily implied by (41). ∎