This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Nearly-Optimal Consensus Tolerating Adaptive Omissions:
Why is a Lot of Randomness Needed?

Mohammad T. Hajiaghayi
University of Maryland, Maryland, USA
   Dariusz R. Kowalski
School of Computer and Cyber Sciences, Augusta University, Georgia, USA
   Jan Olkowski
University of Maryland, Maryland, USA.
Abstract

We study the complexity of the problem of reaching agreement in a synchronous distributed system, also called consensus, by nn autonomous parties, when the communication links from/to faulty parties can omit messages. The faulty parties are selected and controlled by an adaptive, full-information, computationally unbounded adversary. We design a randomized algorithm that works in O(nlog2n)O\left(\sqrt{n}\log^{2}n\right) rounds and sends O(n2log3n)O\left(n^{2}\log^{3}n\right) total number of communication bits, where the number of faulty parties can be Θ(n)\Theta(n). When the number of faulty parties is linear in nn, our result is simultaneously tight for both these measures within polylogarithmic factors: due to the Ω(n2)\Omega(n^{2}) lower bound on the number of messages send by any Monte Carlo solution, by Abraham et al. (PODC’19), and due to the Ω(n/logn)\Omega(\sqrt{n/\log{n}}) lower bound on the number of rounds of any Las Vegas solution by Bar-Joseph and Ben-Or (PODC’98). Thereby, this work settles the landscape of the consensus problem in the omission failures model, which stood as an open question since the work of Dolev and Strong (SICOMP’83).

Additionally, we strictly quantify how much randomness is necessary and sufficient to reduce time complexity to a certain value, while keeping the communication complexity optimal wrt to polylogarithmic factors. We prove that no Monte Carlo algorithm can work in less than Ω(n2max{R,n}logn)\Omega\left(\frac{n^{2}}{\max\{R,n\}\log{n}}\right) rounds if it uses less than O(R)O(R) calls to a random source, assuming a constant fraction of all parties is faulty. This result should be contrasted with a long line of work on consensus algorithms against an adversary limited to polynomial computation time, thus unable to break cryptographic primitives, culminating in a work by Ghinea et al. (EUROCRYPT’22), where an optimal O(r)O(r)-round solution reaching consensus with probability 1(cr)r1-(cr)^{-r} is given. Our lower bound strictly separates these two regimes, by excluding such results if the adversary is computationally unbounded.

On the upper bound side, we show that for R𝒪~(n3/2)R\in\tilde{\mathcal{O}}\left(n^{3/2}\right) there exists a randomized algorithm solving consensus in 𝒪~(n2R)\tilde{\mathcal{O}}\left(\frac{n^{2}}{R}\right) rounds, with probability polynomially close to 11 (whp), where tilde notation hides a poly-logarithmic factor. The communication complexity of the algorithm does not depend on the amount of randomness RR and stays (universally) optimal within polylogarithmic factors. As a consequence, we give a spectrum of solutions that interpolates between optimal results in the deterministic regime (RO(n)R\in O(n); O(1)O(1) entropy per party) and the randomized regime (RO(n3/2)R\in O(n^{3/2}) random bits).

1 Introduction

In any distributed system, reaching agreement (consensus) is essential for coordinating actions of the participating nn parties (also called processes), out of which up to tt could be faulty. The hardness of the task primarily depends on the type of fault present in the system. The classical hierarchies [Attiya-Welch-book2004, Lynch-book96] for synchronous message-passing models distinguish the following types of faults, in the order of increasing hardness: crash failures, omission failures, authenticated Byzantine failures, and Byzantine failures. In this work, we focus on the second type of failures in this list, the omission failures. Precisely, we assume that a computationally unbounded and malicious adversary can observe the system during the computation and omit an arbitrary subset of messages send to / received from selected faulty processes in an online, adaptive fashion. The adversary can also, based on the history of the computation, corrupt new processes if the number of corrupted stays within a fixed limit tt. The adversary, however, cannot see the future random bits.

Despite of huge volume of research on the performance of consensus algorithms, the proper assessment of the hardness of consensus under this type of failure has been elusive. There is no theoretical evidence that omissions failures are weaker than the next model in the hierarchy, authenticated Byzantine, even if such a hypothesis seems compelling. Compared to the weaker model of crash failures, there is a strict hardness barrier following from the existence of an algorithm using O(n3/2log13/2n)O(n^{3/2}\log^{13/2}{n}) messages by Hajiaghayi et al. [DBLP:conf/stoc/HajiaghayiKO22] (STOC’22) in the case of crash failures and the Ω(n2)\Omega(n^{2}) lower bound on the number of messages in the model with omission failures, proved by Dolev and Reishuk [DolevR85] (JACM’85) for deterministic solutions and by Abraham et al. [AbrahamCDNPRS19] (PODC’19) for the randomized ones, all results assuming Θ(n)\Theta(n) faulty parties. However, if the space of solutions is categorized based on the round complexity of a solution, even for these two models (i.e., crashes and omissions) the picture is not clear. Both models admit an Ω(n/logn)\Omega(\sqrt{n/\log{n}}) lower bound for Las Vegas solutions, due to the work of Bar-Joseph and Ben-Or [Bar-JosephB98] (PODC’98) (when the number of faulty parties is linear in the system size). In the case of crashes, the lower bound is matched by an algorithm of the same authors [Bar-JosephB98], while in the case of omissions – the best previous solution is 40 years old result of Dolev and Strong [DolevS83] (SICOMP’83) that works in O(n)O(n) rounds! Hence, our first goal is to fully understand the hardness of omission failures in reaching consensus:
Question 1: Is there a consensus algorithm, possibly randomized, that at the same time matches both lower bounds with respect to poly-logarithmic factor? That is, is there an algorithm that solves consensus in O~(n)\tilde{O}(\sqrt{n}) rounds with O~(n2)\tilde{O}(n^{2}) total number of sent communication bits even if a linear number of omission failures occur in the network?

We answer this question affirmatively and give a new algorithm that is almost-optimal111In this work, almost-optimal means optimal within a poly-logarithmic factor. with respect to, simultaneously, the number of rounds and the total number of sent communication bits when the number of faults is Θ(n)\Theta(n).

Then, we focus on the aspect of how much randomness is necessary to break the Ω(n)\Omega(n) lower bound on the number of rounds for deterministic solutions [FischerL82] (again, assuming Θ(n)\Theta(n) faults):
Question 2: What is the impact of the amount of randomness available at processes on the efficiency of consensus algorithms? In particular, could pseudo-random generators be used efficiently?

We quantify the number of random bits that is sufficient and necessary to achieve a given time Ω~(n)\in\tilde{\Omega}(\sqrt{n})222We use tilde notation to hide a polylogarithmic factor., while keeping almost-optimal communication complexity. In particular, we prove that using pseudo-random generators with small seeds may delay reaching consensus nearly quadratically, which may have severe consequences in distributed ledger implementations and distributed database applications based on consensus. Even more interestingly, our lower bounds apply to Monte Carlo algorithms. This makes a surprising distinction between a parallel line of work on consensus protocols in the case of authenticated Byzantine failures governed by a computationally-bounded adversary [abraham2019synchronous, fitzi2003efficient, ghinea2022round, katz2006expected], done by the cryptography community. In this model, the adversary is still adaptive, but the algorithm has a trusted setup for unique threshold signatures and an unforgeable public-key infrastructure; thus, the view of the adversary is limited to the history of the faulty processes and all messages send to them. Nevertheless, the adversary can adaptively corrupt new parties based on its view and has the priority over messages being delivered in the round it corrupts. In this setting, the result of Ghinea et al. [ghinea2022round] (EUROCRYPT’22) shows an algorithm that terminates in rr rounds with probability at least 1crr1-cr^{-r}, for any rr and some absolute constant c>0c>0. Our lower bound excludes such results in the model with a computationally unbounded adversary, showing that not only many random bits but also strong cryptography could be necessary for some applications of consensus protocols.

Summary of results and the paper structure.

As mentioned earlier, we focus on the classical message-passing model in which processes are autonomous, synchronous and can exchange messages via point-to-point communication channels. The adversary corrupts and controls at most tt processes, causes omission failures at them, and is (full-information) strongly adaptive. In Sections 3LABEL:sec:overview-lower and LABEL:sec:overview-randomness, we present an extended overview of the three main results and novel techniques obtained in this work; for the sake of overview, we typically assume t=Θ(n)t=\Theta(n). A summary of them can be found in Table 1. In Section LABEL:sec:related we discuss the related work in more details. Section 2 presents detail formal description of the model. Sections LABEL:sec:main-algoLABEL:sec:proof-lower and LABEL:sec:tradeoff-upper contain the full and formal analysis of our main results. Section LABEL:sec:future states major research directions opened by our work.

result time comm. bits random bits comments
algo- Thm 1 O(nlog2n)O\left(\sqrt{n}\log^{2}{n}\right) O(n2log3n)O(n^{2}\log^{3}n) O(n3/2log2n)O\left(n^{3/2}\log^{2}{n}\right)
rithms Thm LABEL:thm:trade-off-res O(n2Rlog2n)O\left(\frac{n^{2}}{R}\log^{2}{n}\right) O(n2log5n)O\left(n^{2}\log^{5}{n}\right) O(Rlog2n)O\left(R\log^{2}{n}\right) RO(n3/2)\forall R\in O\left(n^{3/2}\right)
lower [Bar-JosephB98] Ω(tnlogn)\Omega\left(\frac{t}{\sqrt{n\log n}}\right) - - correct prob. =1=1
bounds [AbrahamCDNPRS19] - Ω(ϵt2)\Omega\left(\epsilon t^{2}\right) - correct prob. 34+ϵ\geq\frac{3}{4}+\epsilon
Thm LABEL:thm:lower-randomness-res TT - RR T×(R+T)=Ω(t2logn)T\times(R+T)=\Omega\left(\frac{t^{2}}{\log{n}}\right)
correct prob. 11n3/2\geq 1-\frac{1}{n^{3/2}}
Table 1: Our main results presented for three metrics: time complexity, total number of used communication bits and total number of used random bits. All our algorithms are subject to their complexity bounds whp, i.e., with probability polynomially close to 11, see Section 2. TT and RR are random variables denoting, respectively, the number of rounds and the cumulative number of random bits used by all processes in a run of an algorithm. The “correct prob.” denotes the probability of correctness of the class of algorithms to which the lower bound applies. The new results are underlined in column “result”. Extended versions of Theorems 1 and LABEL:thm:trade-off-res, with explicit dependency on the parameter tt, are given in Sections LABEL:sec:main-algo and LABEL:sec:tradeoff-upper, resp.

2 Model Details and Definitions

We consider the classical synchronous message-passing distributed system with omission faults, cf., [Attiya-Welch-book2004, Lynch-book96, raipin2010strongly]. The system contains nn processes, also called parties. Each process has a unique ID in 𝒫=[n]={1,,n}\mathcal{P}=[n]=\{1,\ldots,n\}. For simplicity, we will use pp to refer to a process with ID p𝒫p\in\mathcal{P}. Both 𝒫\mathcal{P} and nn are known to all process. Processes operate in synchronized rounds. Without loss of generality,333See [Attiya-Welch-book2004, Bar-JosephB98] for the discussion on generality of our assumptions and related settings – crash and Byzantine faults. we assume that each round consists of the following two phases:

  1. 1.

    Local computation phase: Each process performs a local computation (i.e., autonomously from other processes) in order to change its state. The computation can be any function of the current state, of all messages received prior to this phase, and of a sequence of uniformly distributed and independent random bits of an arbitrary, but finite, length that can be reached by a process at the beginning of the phase. More precisely, for the last parameter of the function, we assume that there exists a random source that, when called, can provide a process and its state-changing function with a 0-1 sequence, of requested length, containing uniform and independent distributed random bits.

  2. 2.

    Communication phase: During this phase, each process can send messages to any other processes in the system. The content of a message is a function of the current state computed in the preceding local computation phase, and is not limited by the model. In particular, the length and content of messages is not restricted by the model, although our algorithms are designed in a way to use short messages. Each message sent is delivered to its destination at the end of the same phase, and is ready to be processed in the next round, unless an omission failure occurs.

Processes’ omission failures and adversaries.

Not all processes are reliable. Up to some t=O(n)t=O(n) processes may become (omission) faulty during the execution – once a process becomes faulty, it stays faulty through the end of computation and some of its incoming/outgoing messages could be lost. We assume that tt is a part of the problem input and thus known up-front to all processes. The decision which process becomes faulty and when, as well as the control over the faulty processes, is governed by an adversary. We consider an adaptive full-power full-information adversary that

  • has unlimited computational power, knows the algorithm and input parameters and can see the states (and thus also the current random bits used) of all processes, as well as the content of all arriving messages, at any time, and

  • can select online which (non-faulty) processes to fail and when, and with respect to faulty processes – it can omit any subset of messages incoming/outgoing to/from the faulty processes (i.e., such messages are not delivered to their destinations, having the same effect as no message sent).

Consequently, an adversarial strategy is a deterministic function, which assigns to each possible history that may occur in any execution some adversarial action for the subsequent phase of the execution, i.e., which processes to fail in that phase and in which moment and which messages sent by/to them would reach their destinations. Note, that in the above definition, the adversary has flexibility to adapt its actions between any two phases of the algorithm (not only between consecutive rounds). In the remainder, we will be referring to this adversary simply by adaptive adversary.

We remark that crash failures of processes can be viewed as omission failures – the adversary simply bans all their incoming and outgoing messages after the failure round (while in the round of a crash, the adversary could allow any subset of outgoing messages to reach their destinations).444In case of crashes, incoming messages are not relevant, because the faulty process could not influence correct processes any more and it is not required from them to satisfy any of the consensus properties.

Consensus problem.

A randomized algorithm for processes 𝒫={1,,n}\mathcal{P}=\{1,\ldots,n\}, where each process p𝒫p\in\mathcal{P} holds initial input bp{0,1}b_{p}\in\{0,1\}, is a consensus protocol tolerating tt faulty processes if all the three following conditions hold with probability 11 in the presence of an adaptive adversary failing/controlling at most tt processes:

Agreement. All non-faulty processes output the same value.

Validity. If all non-faulty processes begin with the same input value bb, then all of them output bb.

Termination. Non-faulty processes have to decide and terminate.

Consider a randomized consensus algorithm against a fixed adversarial strategy. The following metrics determine the quality of the execution against that strategy:

  • Time of an execution of the algorithm is defined as the smallest number τ1\tau_{1} such that the number of rounds that occur by termination of the last non-faulty process is at most τ1\tau_{1};

  • The number of communication bits in an execution of the algorithm is the smallest number τ2\tau_{2} such that the total number of bits sent by all processes in point-to-point messages by termination of the last non-faulty process is at most τ2\tau_{2};

  • Randomness of an execution of the algorithm is defined as the smallest number τ3\tau_{3} such that the number of (independent and uniform) random bits used by all processes by termination of the last non-faulty process is at most τ3\tau_{3}; when describing the lower bound result, we abuse the notation slightly, and define randomness of an execution as the total number of times when processes access their random sources during the local computation phase. Complying to the definition of the local computation phase, in each such access a process can use a sequence of random bits of finite length. Observe that such definition makes a lower bound result even stronger.

We define time/communication/randomness complexity of a (distributed) algorithm as a supremum of time, the number of communication bits, and the number of random bits, respectively, taken over all adversarial strategies.555Note that the supremum for different measures could be achieved by sequences of different strategies – nevertheless, each of the complexities of our solutions is still close to optimal. Finally, time/communication/randomness complexity of a distributed problem is an infimum of all algorithms’ time/communication/randomness complexities, respectively. In our paper, we are interested in studying almost-optimal solutions, i.e., optimal within a polylogarithmic factor, wrt the abovementioned complexities. The bounds on the complexities should hold with high probability. We say that a random event occurs with high probability (whp for short), if its probability could be made 1O(nc)1-O(n^{-c}) for any positive constant cc by linear scaling of parameters of the considered random process.666We would like to note that, after ignoring polylogarithimic factors, all our upper bounds hold with the same asymptotic complexities in even stronger regime where the probabilities are of form 1O(nω(1))1-O(n^{\omega(1)}).

3 Main consensus algorithm

Our first and main result is a new consensus algorithm which is almost-optimal with respect to time complexity, bit complexity and randomness complexity, if the number of faulty parties is Θ(n)\Theta(n).

Theorem 1.

There is a randomized algorithm solving consensus with probability 11 against the adaptive omission adversary that can control t<n30t<\frac{n}{30} processes, which terminates in O(nlog2n)O\left(\sqrt{n}\log^{2}{n}\right) rounds and uses O(n2log3n)O\big{(}n^{2}\log^{3}n\big{)} bits of communication and O(nnlog2n)O\left(n\cdot\sqrt{n}\log^{2}{n}\right) random bits, whp.

For the ease of presentation, and for the sake of space constraints, in this section, we assume that t=n301t=\frac{n}{30}-1 and we provide a more high-level overview of techniques used to obtain the result. A self-contained and fully formal derivation of this theorem, incorporating the upper bound tt on the number of faulty processes into the time and communication complexities and containing all the omitted proofs, is deferred to Appendix LABEL:sec:main-algo, where the above Theorem is restated as Theorem LABEL:thm:main-algo.

In the case when t=Θ(n)t=\Theta(n), the almost-optimality of the running time follows from the Ω(n/logn)\Omega\left(\sqrt{n/\log{n}}\right) lower bound showed in [Bar-JosephB98]. The almost-optimality of the communication bit complexity is due to the result of Abraham et al. [AbrahamCDNPRS19] who showed that any randomized algorithm solving consensus with at least a constant positive probability against the adaptive omission-causing adversary requires Ω(n2)\Omega(n^{2}) messages (each message carries at least one bit). The almost-optimality of the randomness complexity follows from Theorem LABEL:thm:lower-randomness-res presented in the later part of the paper. We next give an overview of the algorithm. The pseudocode can be found in Algorithm 1.

input: 𝒫\mathcal{P}, pp, bpb_{p}, t
1 operativeptrue\texttt{operative}_{p}\leftarrow true, decidedpfalse\texttt{decided}_{p}\leftarrow false;
2 VpV_{p}\leftarrow a set of neighbors of pp in a predetermined graph GG guaranteed by Theorem LABEL:thm:random-graph-properties;
3 W1,,WnW_{1},\ldots,W_{\left\lceil\sqrt{n}\right\rceil}\leftarrow a pre-defined partition of 𝒫\mathcal{P} into n\left\lceil\sqrt{n}\right\rceil disjoint sets of size n\leq\left\lceil\sqrt{n}\right\rceil each;
4 let \ell be such that pWp\in W_{\ell};
5 for tnlogn\frac{t}{\sqrt{n}}\log n epochs do
6       g_onesp,g_zerosp,operativepGroupBitsAggregation(W,p,operativep;bp)\texttt{g\_ones}_{p},\texttt{g\_zeros}_{p},\texttt{operative}_{p}\leftarrow\textsc{GroupBitsAggregation}(W_{\ell},p,\texttt{operative}_{p};b_{p});
7       if operativep=false\texttt{operative}_{p}=false then stay idle until the end of the epoch;
8      
9      onesp,zerosp,operativepGroupBitsSpreading(Vp,p,,operativep;g_onesp,g_zerosp)\texttt{ones}_{p},\texttt{zeros}_{p},\texttt{operative}_{p}\leftarrow\textsc{GroupBitsSpreading}(V_{p},p,\ell,\texttt{operative}_{p};\texttt{g\_ones}_{p},\texttt{g\_zeros}_{p});
10      
11      if onesp>1830(onesp+zerosp)\texttt{ones}_{p}>\frac{18}{30}(\texttt{ones}_{p}+\texttt{zeros}_{p}) then bp1b_{p}\leftarrow 1;
12       else if onesp<1530(onesp+zerosp)\texttt{ones}_{p}<\frac{15}{30}(\texttt{ones}_{p}+\texttt{zeros}_{p}) then bp0b_{p}\leftarrow 0;
13       else set bpb_{p} to 0 or 11 uniformly at random;
14      
15      if onesp>2730(onesp+zerosp)\texttt{ones}_{p}>\frac{27}{30}(\texttt{ones}_{p}+\texttt{zeros}_{p}) or onesp<330(onesp+zerosp)\texttt{ones}_{p}<\frac{3}{30}(\texttt{ones}_{p}+\texttt{zeros}_{p}) then decidedptrue\texttt{decided}_{p}\leftarrow true;
16      
17 end for
18
19if operativep=true\texttt{operative}_{p}=true and decidedp=true\texttt{decided}_{p}=true then send bpb_{p} to all processes in 𝒫\mathcal{P};
20 else if any message bqb_{q} received from some process qq then  bpbqb_{p}\leftarrow b_{q};
/* in the above, qq can be chosen arbitrarily from the received messages */
21 if decidedp=true\texttt{decided}_{p}=true or (operativep=false(\texttt{operative}_{p}=false and pp received a message in the previous round)) then decide bpb_{p};
22 else
23       if operativep=true\texttt{operative}_{p}=true then pp participates in the deterministic synchronous Consensus algorithm given in Theorem 4 in [DolevS83] with the input bit bpb_{p}; if pp reaches agreement in that protocol, it broadcasts the decision to all processes in 𝒫\mathcal{P} and it decides on the algorithm’s decision;
24       else pp remains idle until a decision is sent to it; upon receiving a decision, it decides on this value;
25      
26 end if
Algorithm 1 OptimalOmissionsConsensus

Universal idea: Local and dynamic partitioning of processes into operative / inoperative and implementing time- and communication-efficient biased-majority-voting only by the operative ones.

We introduce a new partitioning of processes into operative and inoperative, based on the communication received by each process from a certain pre-defined set of other processes which maintain their operative status (this set may vary, depending on what procedure is executed – it will be emphasized later). This partition is not equivalent to the standard classification into faulty / non-faulty ones. With our partition, we can guarantee that faulty processes either communicate well enough to contribute to the progress towards a unified decision (i.e., stay operative) or become excluded from the set of operative processes, having no impact on the final decision. Our partition also avoid a major performance problem in omission-tolerant or Byzantine-tolerant computation – identifying a single faulty process may require at least quadratic number of messages, cf., [AbrahamCDNPRS19], which makes it fast, local and incorporated in efficient communication schedules.

Then, we employ the idea of reaching consensus by applying the biased-majority-voting rule, as proposed in [Bar-JosephB98], but with a novel twist – only the operative processes implement the vote protocol to agree on a consensus decision. It uses O(nlogn)O(\sqrt{n}\log{n}) repetitions of the single vote subroutine, called epochs in the pseudo-code, see lines 1-1 in Algorithm 1. Each repetition/epoch consists of O(logn)O\left(\log n\right) rounds of communication-efficient counting, see the description below), however it succeeds in unifying the votes only if the number of newly failed processes is O(n)O(\sqrt{n}) and only with a constant probability (this is why we need O(nlogn)O(\sqrt{n}\log{n}) epochs). Only after the part implementing the biased-majority-voting rule ends, the operative processes communicate the decision to the remaining parties. We first describe how we implement a single epoch, based on two technical advancements, and conclude with more details on how the overall consensus protocol (based on the biased-majority-voting rule) is designed.

An implementation of a single biased-majority-vote subroutine (epoch).

In our case, a single epoch (i.e., a single repetition of the biased-majority-vote subroutine, lines 1-1) heavily relies on counting, collaboratively by every operative process, the number of operative processes that have candidate decision value 0 and, separately, value 11. These numbers must be approximate, up to an additive factor linearly dependent on the number of processes that become inoperative, as the operative status may change dynamically – some processes can lose it before the calculation finishes and, in consequence, their candidate values might not be properly counted by others. Moreover, this calculation has to consider the fact that some operative processes can be controlled by adversary, thus it must, regardless, exploit the property that the operative processes communicate with enough other processes. A protocol performing this calculation in O(logn)O(\log{n}) rounds and using O(n3/2log2n)O(n^{3/2}\log^{2}{n}) communication bits, in total, is the main technical advancement of this algorithm and we present it next in a form of two technical advancements.

Technical advancement 1: n\sqrt{n}-decomposition into groups and binary-tree-like intra-group calculations of operative processes for communication saving.

As mentioned above, there are some inherit difficulties in the omission failure model that complicate the time- and communication-efficient counting of the number of candidate values (i.e., votes) 0 and 11 among the operative processes. Here, we present the techniques we use to mitigate them. We pre-define fixed partition of the set of processes into n\left\lceil\sqrt{n}\right\rceil groups of size n\left\lfloor\sqrt{n}\right\rfloor or n\left\lceil\sqrt{n}\right\rceil each (see line 1 of Algorithm 1 and example in Figure 1), and first require the operative processes to count the number of operative 0’s and 11’s only within the groups (executed procedure GroupBitsAggregation in line 1). In this part, we use a virtual sparse data structure on some subsets of processes, structured into a balanced binary tree, which is used to aggregate the counts.

W1W_{1}W2W_{2}abcdeW3W_{3}W4W_{4}W5W_{5}W6W_{6}
Figure 1: A schematic picture of two different techniques used for communication between processes. Different colors represent different groups in the n\sqrt{n}-decomposition of the processes. The links represent the overlaying communication resembling a sparse random graph used for exchanging operative counts of different groups. The choice of links is independent of the n\sqrt{n}-decomposition.

Vertices of the binary tree correspond to specific subsets of processes in the group: leaves of the binary tree are singletons in the group, and each vertex in a higher layer corresponds to the union of the subsets that are already identified with the children of that vertex in the tree. The root of the binary tree corresponds to the set of all processes in the group. Processes calculate the number of operative 0’s and 11’s, called operative counts, starting from the leaves of the tree (i.e., singletons) and then keep moving up the tree. At each vertex in a higher layer, the processes in the subset corresponding to that vertex work together to relay and sum up the operative counts from the lower layer. In this relay-and-aggregation procedure, all operative processes in the group exchange messages, not only to relay and aggregate values from children to parents in the virtual tree, but also to keep track who remains operative in the whole group. More precisely, if a process receives information from less than half of the other processes from the group, during the procedure of relaying and aggregating the operative counts of 0’s and 11’s from the lower layer, it becomes inoperative. See example in Figure 2. After O(logn)O(\log{n}) rounds, corresponding to the height of the binary tree, the operative counts for the entire group – corresponding to the root of the tree – can be calculated. Complying with the rule of receiving enough number of messages in order to maintain an operative status, only the processes of the group that remain operative use the operative counts further in the protocol. In the analysis, we will be able to prove that, regardless of the omission failures’ pattern, there is a group of Θ(n)\Theta(n) operative processes whose counts of operative 0’s and 11’s differ by, at most, the number of processes who have become inoperative. Taking the advantage of the binary-tree-structured communication, we can guarantee that processes of each group exchange at most O~(n)\tilde{O}(n) bits in total and that the procedure of operative counting 0’s and 11’s takes only O(logn)O\left(\log{n}\right) rounds. Summarizing, we prove the following result about a single execution of the procedure GroupBitsAggregation. Detailed description can be found in Appendix LABEL:subsec:comm-patterns and in Algorithm LABEL:alg:bits-agreg; the formal analysis is in Appendix LABEL:subsec:analysis-main.

Lemma (Lemma LABEL:lem:bits-agg-contr and LABEL:lem:msg-aggr in Appendix LABEL:subsec:analysis-main).

A single execution of the procedure GroupBitsSpreading works in O(logn)O(\log{n}) rounds and guarantees that every operative process in a single group knows an approximate number of other operative processes in the group having candidate value 0 and 11. The numbers in different operative processes differ by at most the number of processes of the group that became inoperative during the execution of the procedure. Processes in a single group use at most O(nlog2n)O(n\log^{2}{n}) bits of communication, in total, during the execution.

a,b,c,d,ea,b,c,da,babc,dcdea,b,c,d,ea,b,c,d,e1{\color[rgb]{1,.75,.75}\definecolor[named]{pgfstrokecolor}{rgb}{1,.75,.75}1}2{\color[rgb]{1,.75,.75}\definecolor[named]{pgfstrokecolor}{rgb}{1,.75,.75}2}3{\color[rgb]{1,.75,.75}\definecolor[named]{pgfstrokecolor}{rgb}{1,.75,.75}3}{a,b,d,e}\{a,b,d,e\}{a,b,d,e}\{a,b,d,e\}
Figure 2: Visualization of the n\sqrt{n}-decomposition of the blue group from Figure 1. The processes a,b,c,d,ea,b,c,d,e in the group are logically decomposed into a binary tree. The pink arrows visualize the three-round process of relaying operative counts of the two children of the root to the root itself. First, the counts are relayed to all processes in the group (arrow #1), then the processes send a confirmation if they received the counts (arrow #2), finally, all in the group transmit the received counts to the higher layer – the root in this case (arrow #3). Some processes can be faulty (process cc does not communicate, only {a,b,d,e}\{a,b,d,e\}) and their values are not guaranteed to be accumulated accurately.

Technical advancement 2: Fast inter-group communication and status maintenance between operative processes.

After the tree-based communication, the operative processes in each group have a shared knowledge about the count of operative 0’s and 11’s within their group. Since there are n\left\lceil\sqrt{n}\right\rceil different groups, the number of logically different counts is O(n)O(\sqrt{n}). To exchange these O(n)O(\sqrt{n}) counts between the groups, the operative processes communicate along the links in a sparse, but well-connected, graph – neighborhoods of which are pre-selected locally in line 1 – that underlays the entire network; see the executed procedure GroupBitsSpreading in line 1 of Algorithm 1 and the illustration of the graph on the top of the group partition in Figure 1. The graph used by the operative processes is selected as follows. We consider a random graph, where each edge is selected independently at random with probability Θ(logn/n)\Theta(\log{n}/n). Next, we use the probabilistic analysis to show that the following event holds whp: a random graph with such edge density has the property that every subgraph of a constant-fraction size is dense and shallow – see Theorem LABEL:thm:random-graph-properties in Appendix LABEL:subsec:analysis-main. The “dense” property refers to the fact that removing an arbitrary but at most α\alpha-fraction of edges incident to any vertex from a linear number of vertices, allows to find a connected subgraph of this linear number of vertices such that every vertex has degree at least βlogn\beta\log{n} within this subgraph. The “shallow” property refers to the fact that the latter subgraph has logarithmic diameter (asymptotically). These two properties justify our partition into operative and inoperative classes, from perspective of the inter-group communication: as long as the process has more than βlogn\beta\log{n} active links, intuitively it belongs to the connected shallow subgraph and thus it is capable of exchanging information with any other process with this property in O(logn)O(\log{n}) rounds. This holds regardless of the factual faulty / non-faulty state of the processes. Therefore, the operative processes can spread among themselves the operative counts of the n\left\lceil\sqrt{n}\right\rceil different groups, yielding the property that as long as a process remains operative it knows some operative counts of any group with at least one operative process. More specifically, every operative process stores a data structure memorizing the operative counts of each of the n\left\lceil\sqrt{n}\right\rceil groups present in the system. Initially, it knows only the counts of the group it belongs to. In Θ(logn)\Theta(\log{n}) rounds of communication it keeps sending these counts along every edge determined by the underlying graph, maintaining the fact that operative counts of a particular group are sent only once via each edge; at the same time it receives counts of other groups and updates the data structure based on this information. In case a process receives two or more different count values of some group, it can choose arbitrarily any of them – as argued earlier, all of them could differ by at most the number of processes that have become inoperative in that group. To summarize, the procedure  GroupBitsSpreading has the following outcome (the formal description is given in Algorithm LABEL:alg:bits-spread and in Appendix LABEL:subsec:comm-patterns; the formal analysis is provided in Appendix LABEL:subsec:analysis-main).

Lemma (Lemmas LABEL:lem:spreading-reachingLABEL:lem:operative-contribution and Theorem LABEL:thm:main-algo in Appendix LABEL:subsec:analysis-main).

Assume that processes run the procedure GroupBitSpreading with O(n)O(\sqrt{n}) different logical input values (which are the operative counts of candidate values 0 and 11 of each group). At the end of the procedure, each operative process knows at least one copy of the logical value, provided that at least one process starting with this logical value remains operative. The procedure uses O(logn)O(\log{n}) communication rounds and O(nnlog2n)O(n\sqrt{n}\log^{2}{n}) communication bits in total.

The combination of the two above technical advancements lead to calculating the number of candidate values 0 and 1 among the operative processes. Although these numbers are approximate, as they can differ by the number of processes that have become inoperative, this difference is acceptable to still employ a variant of the biased-majority-voting consensus, as we discuss in the next part.

Putting them all together: consensus protocol based on biased-majority-voting adjusted to the new efficient voting implementation in an epoch.

Assuming that each operative process has the approximate numbers of other operative processes having a candidate value 11 and 0, we explain the modification to the consensus framework based on the biased-majority-voting by Bar-Joseph and Ben-Or [Bar-JosephB98] in order to adapt to the dynamic characteristic of the operative set of processes and to the properties that our new efficient single-epoch implementation of a voting has. Our modification takes into account that, in our case, the counts of operative 0’s and 11’s are not the same in every operative process. The communication protocols guarantee only that a candidate value of an operative process is accounted by any other operative process, however, it gives no guarantee regarding the candidate values of the processes that become inoperative during the calculation and communication. Also, in our implementation of a single biased-majority-voting subroutine (epoch), described earlier, the operative processes do not assign any default values to the candidate values of the inoperative processes. Thus, any operative process estimates the number of all operative processes simply by adding the operative counts of 0’s and 11’s. Based on the above estimates, the algorithm employs the procedure of converging the candidate values of the operative processes to the final decision value according to the following: if an operative process has an operative count of 11’s (meaning the candidate values of operative processes assigned to 11) at least 1830\frac{18}{30} of the estimated total of all operative processes, it sets the candidate value to 1. If the operative count is less than half of the estimated total, it sets the candidate value to 0. In all other cases, the candidate value for the next step is a uniformly chosen random bit. See lines 1-1 and Figure LABEL:fig:random-coin for an illustration.