Dispel: Byzantine SMR with Distributed Pipelining

Gauthier Voron University of Sydney [email protected] and Vincent Gramoli 0000-0001-5632-8572 University of Sydney [email protected]

Abstract.

Byzantine State Machine Replication (SMR) is a long studied topic that received increasing attention recently with the advent of blockchains as companies are trying to scale them to hundreds of nodes. Byzantine SMRs try to increase throughput by either reducing the latency of consensus instances that they run sequentially or by reducing the number of replicas that send messages to others in order to reduce the network usage. Unfortunately, the former approach makes use of resources in burst whereas the latter requires CPU-intensive authentication mechanisms.

In this paper, we propose a new Byzantine SMR called Dispel (Distributed Pipeline) that allows any node to distributively start new consensus instances whenever they detect sufficient resources locally. We evaluate the performance of Dispel within a single datacenter and across up to 380 machines over 3 continents by comparing it against four other SMRs. On 128 nodes, Dispel speeds up HotStuff, the Byzantine fault tolerant SMR being integrated within Facebook’s blockchain, by more than 12 times. In addition, we also test Dispel under isolated and correlated failures and show that the Dispel distributed design is more robust than HotStuff. Finally, we evaluate Dispel in a cryptocurrency application with Bitcoin transactions and show that this SMR is not the bottleneck.

^†^†copyright: none^†^†copyright: none

1. Introduction

State machine replication (SMR) makes use of consensus for totally ordering a set of commands (or proposals) that are executed in the same order by all replicas. Consensus protocols are generally network bound as they often rely on some broadcast patterns to minimize the time it takes to reach agreement on the next command. Byzantine fault tolerant (BFT) SMRs have regained in popularity with the introduction of blockchain technology: Facebook even aims at deploying a variant of the HotStuff SMR (YMR19, ) on at least 100 replicas over a large network (Fac19, , §5). With the 1.6 billion daily active users on Facebook¹¹1https://newsroom.fb.com/company-info/., comes the question of the amount of payload needed to be treated by such a blockchain system once it will be in production. Unfortunately, the performance of SMRs generally drops significantly before reaching a hundred of nodes.

One of the reasons of this limitation is commonly believed to be the all-to-all communication pattern between replicas (Vuc15, ; CJB18, ; YMR19, ). In fact, $n$ replicas sending messages to all other replicas necessarily lead to $\Theta(n^{2})$ messages (CL02, ; CWA09, ; BSA14, ). Given that the network bandwidth is a limited resource, it could seem that this quadratic complexity becomes unaffordable on large networks, like the Internet. For these reasons, various protocols (AMG05, ; KAD07, ; AGK14, ; YMR19, ; BBC19, ) replaced this all-to-all message exchanges by one-to-all exchanges where they could. The problem is that the evaluation of network usage is far from being trivial and unexpected causes may impact the observed throughput.

Refer to caption — Figure 1. The goodput of SMRs (e.g., HotStuff) is low due to its unbalanced bandwidth consumption whereas Dispel exploits the bandwidth of more links to offer, at 128 nodes, a 12-fold speedup.

In this paper, we revisit this idea by offering a new SMR, called Dispel (Distributed Pipeline) that balances its quadratic number of messages onto as many routes between distributed replicas to offer high throughput with reasonably low latency. Dispel follows from a year-long investigation of the application-level throughput in SMRs—commonly referred to as goodput that accounts for payload amount. This extensive evaluation of network performance allowed us to identify important causes of performance limitations, like head-of-line (HOL) blocking (SK06, ) that delays subsequent packet delivery due to one TCP packet loss. To illustrate the benefit of Dispel consider Figure 1 that compares its throughput to HotStuff, the latest SMR that we are aware of. Although HotStuff outperforms preceding SMRs thanks to its linear message complexity, it suffers from the same leader-based message pattern as its predecessors (cf. §5 for the detailed setting of this figure and more results).

The key innovation of Dispel is its distributed pipeline, a technique adapted from the centralized pipelining of micro-architectures to the context of SMR to leverage unused resources. Both pipelining techniques consist of executing multiple epochs (or concurrent executions) in order to provide higher throughput than what could be obtained with a single epoch. As opposed to the classic pipelining that aims at maximizing the usage of central resources, the distributed pipelining maximizes the resource usage of distributed replicas by allowing them to decide locally when to spawn new consensus epochs. In particular, in Dispel each replica that detects idle network resources at its network interface (NIC) and sufficient available memory spawns a new consensus instance in a dedicated pipeline epoch. This distributed detection is important to leverage links of heterogeneous capacity like inter-datacenters vs. intra-datacenters communication links (cf. §5.5). We draw three conclusions out of this work:

(1)

HOL blocking limits SMRs at high speed. Dispel allows us to increase the performance of the SMR to a new level, where we identified head-of-line (HOL) blocking as the bottleneck (instead of the network interface (NIC) bandwidth). This phenomenon is detailed in §5.5 where the throughput increases at small scale due to the multiplication of communication routes between replicas and where performance increases proportionally to the number of TCP connections.
(2)

Distributed pipelining increases robustness. Distributing the pipelining allows any replica to start a new consensus instance by proposing requests, hence allowing the SMR to progress despite the failure of a single replica. This is in contrast with centralized pipelining: the leader failure can impact the SMR performance until a correct leader is selected. This is detailed in §5.6 where a single failure has no noticeable impact on Dispel and where correlated failures do not prevent Dispel from treating 150,000 requests per second.
(3)

CPU-bound planet-scale BFT cryptocurrency. Our experiments of Dispel within a cryptocurrency application where miners verify all transactions (as in classic blockchains (Nak08, ; Woo15, )) show that the performance is limited locally by the cryptographic cost at each node and no longer globally by the network cost of BFT consensus. A direct consequence is that the performance does not drop when the system size increases, even when deployed on up to 380 nodes distributed across 3 continents (§6).

In the remainder of the paper, we start by presenting the background that motivates our work (§2). Then we present the design of Dispel to pipeline the consensus executions (§3) before we detail the implementation choices (§4). We then evaluate Dispel within a datacenter, on geo-distributed systems across 3 continents, against other SMRs (§5), under correlated failures, and when running a cryptocurrency application (§6). Finally, we discuss the related work (§7) and conclude (§8).

2. Background

BFT state machine replication (SMR) relies on executing repeatedly a BFT consensus protocol designed for subsecond latency. Typical implementations are sequential (CL02, ; KAD07, ; BSA14, ; AGK14, ); they execute one consensus instance at a time. In order to increase throughput, they batch commands (e.g., (CL02, ; BSA14, )), hence proposing multiple commands to the same consensus instance. More precisely, when the system starts, $n$ replicas (or nodes) propose a batch (or proposal), possibly empty. They execute a BFT consensus instance until they decide on a batch. When a batch is decided, the replicas execute all the commands of this batch and then start a new consensus instance. Executing consensus instances sequentially is instrumental in identifying and discarding conflicting commands in the batch they process. This sequential design, yet simple, is efficient enough when the consensus latency is low.

BFT consensus instances totally order proposals at consecutive indices. As it is impossible to implement consensus in an asynchronous failure-prone environment (FLP85, ), various systems assume a partially synchronous environment, in that every message gets delivered in an unknown but bounded amount of time (DLS88, ). In this environment, one can solve the BFT consensus if the number $f$ of Byzantine (or malicious) nodes is lower than $\frac{n}{3}$ (PSL80, ), meaning that $n-f$ nodes remain correct. Most of the practical BFT SMRs implemented today rely on a leader-based pattern where a single node exchanges messages with Byzantine quorums of $2f+1$ nodes. Recently, BFT SMR gained attention for its ability to totally order blocks of transactions in a blockchain (Nak08, ) in which the challenge becomes to perform efficiently on larger networks.

2.1. The network bottleneck of the first phase

The leader-based pattern typically starts with a message exchange phase where a specific node, called the leader, aims at proposing a command to the rest of the nodes. If the leader is faulty, the system may choose another leader. As it is impossible to distinguish a slow network from a mute leader in a partially synchronous environment, such changes affect performance (SRM12, ; MAK13, ; AMQ13, ; GPB16, ). If the leader is correct and the network is timely, then the command is decided by all nodes. But in this case, the leader may have to send its proposal to many, making its NIC the bottleneck (JRS11, ). One may think of batching proposals within the same consensus instance (FH06, ), so that multiple proposals are piggybacked in the same messages and can be decided in a row. This increases the information conveyed to all nodes by the leader, hence adding to the congestion.

2.2. The CPU-intensive subsequent phases

To mitigate the leader bottleneck once the proposals are conveyed to all nodes, the nodes typically hash the content of the proposal into some digest (CL02, ; BSA14, ; YMR19, ) and exchange the resulting digests to refer to specific proposals. On the one hand, this reduces considerably the network utilization in the phases following the prepare phase. On the other hand, the subsequent phases convey more frequent but smaller messages than in the first phase whose treatment consumes CPU. The hashing function necessary to encode these messages is also CPU intensive. As the communication is partially synchronous it is likely that the hashing function execution overlaps at many nodes with the reception of these message digests, hence further increasing the CPU usage.

When requests must be cryptographically verified (BSA14, ), as in cryptocurrency applications (Nak08, ; Woo15, ), or when the phases require message authenticators (CL02, ), the system can become CPU bound. To put things in perspective, an AWS EC2 c5.xlarge instance has an upload bandwidth of 600 MiB/s but a CPU of this same instance can only hash 425 MiB/s with SHA256 and verify up to 5 MiB/s for 400 byte transactions with the fastest ECDSA curve provided by OpenSSL as we detail in §6. BFT SMRs typically alternate between the network-intensive phase and these CPU-intensive phases.

2.3. Bypassing network and CPU bottlenecks

To bypass the leader completely, several deterministic leaderless Byzantine consensus were proposed. Lamport suggests a virtual leader election (Lam10, ; Lam11, ) to transform a leader-based Byzantine consensus algorithm into a leaderless one, however, no virtual leader election algorithm is given. Borran and Schiper (BS10, ) proposed a Byzantine leaderless consensus whose communication complexity is exponential.

Recently, Crain et al. (CGLR18, ) proposed a leaderless deterministic Byzantine consensus algorithm. The algorithm called Democratic BFT (DBFT), is depicted in Algorithm 1 and builds upon a reduction of the problem of multivalue consensus to the problem of binary consensus (BKR94, ) and a resilient-optimal and time-optimal deterministic binary consensus algorithm that was formally verified using model checking (TG19, ): Each replica of id $0\leq k<n-1$ reliably broadcasts its input value such that all correct replicas deliver the same value (B87, ) into the $k^{th}$ coordinate of a local array. Later, DBFT spawns $n$ binary instances whose $k^{th}$ takes input value 1 if the $k^{th}$ coordinate of the array was reliably delivered, or 0 if it was not yet delivered. The decisions of the $n$ binary instances form a bitmask that is applied to the array of values to extract the decidable values. DBFT outputs the first of these values to solve the classic Byzantine consensus problem, the Red Belly Blockchain extends DBFT (CNG18, ) to output all these values hence committing more transactions, unfortunately, it runs consensus instances sequentially.

Algorithm 1 DBFT, Consensus for Blockchains (CGLR18, )

\mathord{\sf consensus}(\mathord{\it batch})

: // similar to (BKR94, )

\mathord{\sf reliable-broadcast}(\mathord{\it batch})

// e.g., (B87, )

\mathord{\it array}[1..n]\leftarrow

reliably-delivered

n-t

batches // wait for

n-t

values

4: for

k=1..n

5: if

\mathord{\it array}[k]

was reliably delivered then

\mathord{\it bitmask}[k]\leftarrow\mathord{\sf fast-deterministic-binary-consensus}_{k}(1)

7: else

\mathord{\it bitmask}[k]\leftarrow\mathord{\sf fast-deterministic-binary-consensus}_{k}(0)

8: wait until

\mathord{\it bitmask}

is full and

\forall\ell,\mathord{\it bitmask}[\ell]=1:\mathord{\it array}[\ell]\neq\emptyset

9: return (

\mathord{\it bitmask}~{}\&~{}\mathord{\it array}

) // apply the bitmask to decide a batch subset

2.4. Limits of sequential consensus instances

Once the bottleneck effects are mitigated, one can further improve the throughput of SMRs by reducing the latency of one consensus instance or by overlapping different consensus instances. First, a long series of results (FV97, ; KAD07, ; SR08, ; SB12, ; MBS13, ; VCB13, ) reduce the latency of the BFT consensus, sometimes by assuming correct clients (KAD07, ), tolerating less Byzantine nodes (SR08, ) or using a trusted execution environment (VCB13, ; KBC12, ). The effect of reducing latency on increasing the throughput is quite visible (MBS13, ), however, this increase is limited by the sequential execution of these consensus instances.

Second, some form of centralized pipelining was proposed (SS13, ; YMR19, ). This technique inherited from networking (PM95, ), consists of executing some consensus instance before another terminates. This approach is “centralized” because it is leader-based: a leader is needed to spawn a new pipeline epoch. The benefit of centralized pipelining was observed to be limited (SS13, ), again due to the network bottleneck at the leader (cf. §2.1). HotStuff (YMR19, ) is a recent BFT SMR that piggybacks phases of consecutive consensus instances and Facebook is building Libra (BBC19, ) on top of its variant.²²2https://medium.com/ontologynetwork/hotstuff-the-consensus- protocol-behind-facebooks-librabft-a5503680b151. Unfortunately, it again relies on a leader. Dispel is the first distributed pipeline SMR as described below.

3. Dispel Overview

This section presents Dispel, standing for Distributed Pipeline, a BFT SMR for the partially synchronous model tolerating $f<n/3$ Byzantine replicas. Unlike previous SMRs, each replica in Dispel spawns a new consensus instance based on its available local resources.

Figure 2. Architecture of Dispel. Transactions are collected in the pool and batched (1). Replicas exchange batches (2A,2B), store and associate them with their hash digest in the manager (3). Replicas execute consensus protocol over hashes (2C,3) and transmit the decisions to the manager (4). The manager transmits the batches associated with the decided hashes to the application (5).

3.1. Architecture of a pipeline epoch

Figure 2 shows the architecture and the main steps of one pipeline epoch that runs one consensus instance. This epoch consists of creating a batch of commands, called transactions in the context of blockchains, exchanging batches with other replicas and selecting an ordered subset of the batches created by the replicas. A Dispel replica continuously listens for client transactions and stores them in its transaction pool (step 1). When a replica decides to spawn a new pipeline epoch, it concatenates all the transactions in the pool to create a batch. The replica then broadcasts the batch as the first step of the reliable broadcast of the DBFT consensus protocol (§2.3). In parallel, a dedicated hashing thread computes the sha256 checksum of the batch (step 2A).

All Dispel replicas decide independently to spawn a new pipeline epoch. We describe this process in details in §4.1. As a consequence, a replica receives batches from the other replicas in parallel to steps 1 and 2A (in step 2B). When a hashing thread has computed the hash of a batch, it stores the batch and its associated hash in the manager component (step 3). The hashing thread also transmits the hash to the consensus component. From this point on, the consensus component sent the hash digest instead of the batch itself when communicating with other replicas. The consensus component exchanges hashes to complete the reliable broadcast and executes the subsequent steps of Algorithm 1 (step 2C). When the replicas decide a set of hashes, the consensus component transmit the decision to the manager (step 4). The manager then fetches the batches associated to the decided hashes and gives these decided batches to the pipeline epoch orderer that we describe in §4.5 (step 5).

3.2. Pipeline overview

Unlike traditional SMRs, Dispel leverages bandwidth, CPU and memory resources. Traditional SMRs start a new consensus instance no sooner than when the previous consensus instance decides. Dispel uses a different approach in order to leverage the network bandwidth and the CPU resources. As we described in §3.1, a pipeline epoch first receives transactions from clients to create a batch, then hashes this batch, broadcasts it and finally executes a consensus over small hash values. This results in four distinct phases depicted in Figure 3: a network reception phase (Rx), a CPU intensive hash phase (H), a network transmission phase (Tx) and a wait-mostly consensus phase (C).

Figure 3. Principle of a consensus pipeline. A pipeline epoch consists of four phases: network reception (Rx), network transmission (Tx), CPU intensive hash (H) and latency bound consensus (C). As soon as an epoch finishes its first phase (Rx), the replica spawns a new epoch. When the replica executes four epochs concurrently, it leverages all its resources.

The goal of Dispel is to have all these phases executing at the same time from different pipeline epochs, hence leveraging most resources. To this end, a Dispel replica spawns a new pipeline epoch before the previous epoch terminates. Figure 3 illustrates a 4-epoch pipeline where each replica of Dispel starts a new epoch as soon as its resources permit. Each row stands for a pipeline epoch and has four phases, Rx, H, Tx and V. As soon as four epochs run concurrently, the replica can execute all these distinct phases at the same time, one phase per epoch, and leverages most resources. This technique called pipelining is especially effective when most of the transactions are issued by geodistributed clients to their closest replicas and when most concurrent transactions do not conflict because such phases consume different resources. Recall that the fact that concurrent transactions are usually not conflicting was observed before and benefited others (KAD07, ; MAK13, ).

4. Implementing the pipeline

In this section, we present how to implement a pipeline by making sure each replica can spawn a new pipeline epoch (§4.1), how replicas naturally coordinate to participate in the same uniquely identified epoch (§4.2), how the algorithm spawns new epochs (§4.3), how batches are committed (§4.4), and how epochs are ordered (§4.5).

4.1. Distributed spawn of epochs

The first phase (Rx) of a pipeline epoch (Fig. 3) consists of receiving transactions from clients in the transaction pool, as we explained in §3.1. As a replica has no control over the number of transactions that the clients send, its only responsibility is to ensure that there is always space in the pool to collect new transactions. If the pool is full, incoming client transactions are discarded and the client must retry later or on another replica. This motivates the first condition to spawn a new pipeline epoch: having a full transaction pool. Spawning a new epoch before the pool is full would result in fewer transactions per batch and thus in a lower throughput. When the clients send transactions slowly, filling a transaction pool takes a long time. To prevent old transactions from hanging in the transaction pool for too long, a replica also spawns a new pipeline epoch after a timeout expires, however, this never happens in our experiments as the demand is high.

The next phase (Tx) consists of broadcasting an epoch batch to the other replicas. A pipeline epoch can only progress to its next phase if the previous epoch has finished the same phase. A second condition for spawning a new epoch is thus to have an idle network, at least for the sending part. To detect when the network is idle, each replica of Dispel continuously monitors its network usage. Every 2 ms, each node measures its sending rate and compares this sending rate after 3 samples (over the resulting 6 ms duration) to the 600 MiB/s physical network capacity (as we measure in §5). If the network usage is lower than 5% of the physical network bandwidth limit, then the network is considered idle.

It is also important for a node to spawn a new pipeline epoch only if it has enough memory available. The risk is, otherwise, for replicas to inflate the latency of a consensus instance by accepting to participate in too many concurrent epochs. To decide the maximum number of epochs, each replica divides, prior to the execution, the amount of available memory (returned by the OS command free) by the batch size. Each replica also keeps a few megabytes to store thread stacks, consensus instance states and other small sized objects. This available memory is observed offline because it is hard to assess the memory available in real-time in Java due to the garbage collector of OpenJDK11 that is unpredictable.

4.2. Following remote epochs

Each pipeline epoch corresponds to a new epoch in which one consensus instance executes. By monitoring its own resource usage, a replica decides when to spawn a new pipeline epoch independently of the other replicas. Each replica tags the pipeline epochs it spawns with an increasing epoch number based on the current index of the pipeline—if two replicas spawn two pipeline epochs with the same number then it means that there are actually the same epoch. Additionally, all the consensus messages associated with an epoch are prefixed by the epoch number of the epoch. A replica considers that the received messages tagged with the same epoch number belong to the same pipeline epoch and thus to the same consensus instance. With this simple method, replicas join an existing pipeline epoch by spawning an epoch locally which happens to have the same epoch number.

A replica participates to remotely spawned epochs depending on its local state and the received messages. When every replica in the system receives an equal number of transactions from the clients, the replicas end up spawning new pipeline epochs at the same rate. In this ideal scenario, all replicas participate to the same pipeline epochs without any additional synchronization. When the clients load is imbalanced among the replicas, some replicas may never decide to spawn a new epoch on their own. However, sufficiently many replicas must participate in a consensus instance for this instance to terminate. For this reason, when a replica receives a message from a pipeline epoch to which it does not participate, if the previous epoch is decided then this replica participates to the epoch. This restriction of having the previous epoch decided prevents a malicious node from making other replicas participate in epochs in a distant future, which could lead to starvation.

Algorithm 2 Dispel at process

p_{i}

1:Initially:

\mathord{\it batch}_{i}\leftarrow\emptyset

// gathers the client requests

\mathord{\it stage-num}_{i}\leftarrow 0

// number of running instances

\mathord{\it started}_{i}\leftarrow\emptyset

;

\mathord{\it decided}_{i}\leftarrow\emptyset

// epochs of started / decided instances

\mathord{\it max-stages}\leftarrow\mathord{\it available-memory}/\mathord{\it batch-capacity}

7:When

\mathord{\it batch}_{i}

is full, network is idle and

\mathord{\it stage-num}_{i}<\mathord{\it max-stages}

\mathord{\sf spawn-instance}(\mathord{\sf max}(\mathord{\it started}_{i})+1,\mathord{\it batch}_{i})

// non-blocking call

\mathord{\it batch}_{i}\leftarrow\emptyset

10:

11:When receive a message for

\mathord{\sf consensus}^{\mathord{\it epoch}}

and

epoch\notin\mathord{\it started}_{i}

12: wait until

\mathord{\sf consensus}^{\mathord{\it epoch}-1}

decides

13: if

\mathord{\it epoch}=\mathord{\sf max}(\mathord{\it started}_{i})+1

then

14:

\mathord{\sf spawn-instance}(\mathord{\it epoch},\mathord{\it batch}_{i})

// non-blocking call

15:

\mathord{\it batch}_{i}\leftarrow\emptyset

16: else

\mathord{\sf spawn-instance}(\mathord{\it epoch},\emptyset)

// non-blocking call

17:

18:

\mathord{\sf spawn-instance}(\mathord{\it epoch},\mathord{\it batch})

: // starts a new consensus instance

19:

\mathord{\it stage-num}_{i}\leftarrow\mathord{\it stage-num}_{i}+1

;

\mathord{\it started}_{i}\leftarrow\mathord{\it started}_{i}\cup\{\mathord{\it epoch}\}

20:

\mathord{\it results}\leftarrow\mathord{\sf consensus}^{\mathord{\it epoch}}(\mathord{\it batch})

// blocking call

21:

\mathord{\it decided}_{i}\leftarrow\mathord{\it decided}_{i}\cup\{\mathord{\it epoch}\}

22: if

\mathord{\it epoch}>0

then

23: wait until

\mathord{\sf consensus}^{\mathord{\it epoch}-1}

decides

24: transmit

\mathord{\it results}

to the application

25:

\mathord{\it stage-num}_{i}\leftarrow\mathord{\it stage-num}_{i}-1

4.3. The algorithm for spawning new pipeline epochs

Algorithm 2 summarizes with pseudocode how Dispel decides to spawn new epochs. If a replica $p_{i}$ meets the three conditions (a) the transaction pool $\mathord{\it batch}_{i}$ is full, (b) its network is idle and (c) there is enough available memory, then the replica spawns a new pipeline epoch (lines 7–9). The epoch number of this new epoch is the immediately greater integer than the epoch number of the newest pipeline epoch to which the replica $p_{i}$ participated. Additionally, when a replica receives a message from a pipeline epoch to which it does not participate, it waits for the previous epoch to decide and then participates (lines 11,12). If this new epoch is the new epoch the replica $p_{i}$ has spawned on its own, it participates with the content of its transaction pool (lines 13–15). Otherwise, it participates with an empty batch (lines 16). This avoids submitting a batch to a consensus instance that is likely to already have accepted $n-t$ batches from other replicas. In this case, the batch submitted by the replica $p_{i}$ is not part of the decision set. The epoch spawning routine (lines 18–25) starts a new consensus instance (cf. §2.3) tagged by an epoch number and delivers the consensus decisions in the order of the epoch numbers. We detail how a replica orders the decision inside a epoch in §4.4 and across epochs in §4.5.

4.4. Intra-epoch ordering between hashes and batches

An interesting side effect of our pipeline is that correct replicas may decide upon the hash of batches before receiving the corresponding batches in full. As we mentioned in §3.1, because replicas decide locally to spawn a new epoch, the transactions reception and the batches exchange happen concurrently. Additionally, as the consensus decides on hashes and not on batches, the consensus decision also happens concurrently with the batch exchange. As a result of this concurrency, a replica sometimes decides on batches it has not yet received. Indeed, as we described in §2.3, the DBFT consensus algorithm starts by a reliable broadcast and then executes a binary consensus on every reliably delivered value. During the first step of a reliable broadcast, a Dispel correct replica sends the batch to all the replicas.

The two following steps of a reliable broadcast are all-to-all broadcasts used to prevent correct replicas from delivering values broadcast by Byzantine to only a subset of the replicas or from delivering different values. Dispel correct replicas only send hashes during these to all-to-all broadcasts in order to save bandwidth. A reliable broadcast delivers its value when the replica delivers enough messages from the two all-to-all reliable broadcast. As a result, a Dispel correct replica sometimes has the hash of a batch reliably delivered without knowing the corresponding batch. This is not an issue since this is only possible if at least one correct replica knows the batch. A replica handles this corner case by broadcasting a batch request for the decided hashes with an unknown batch at the end of each pipeline epoch. When a replica knows all the decided batches, it sorts them with their hash as a sorting key.

4.5. Inter-epoch ordering to guarantee starvation freedom

Having each pipeline epoch deciding an ordered set of batches is not sufficient. Because of the concurrency between pipeline epochs, the order in which epochs are decided is not always the order in which they are spawned. Replicas solve this issue by transmitting the pipeline epochs decisions sorted by the epoch number of these epochs. More formally, a replica transmits the decision of a pipeline epoch with the epoch number $k$ to the application only after transmitting the decisions of the previous pipeline epochs for all $j<k$ to the application. This corresponds to the lines 22–24 in Algorithm 2.

Figure 4. The consensus reordering pipeline of Dispel indicates a possible evolution of pipeline epochs (

C^{0}

C^{2}

) as events occur. The character in the middle of each cell indicates the epoch state, either running (R), decided (D), ready for commit (C) or not yet started (/). The top left characters indicate the received batches A, B or C. The bottom left characters indicate the batches the decided hashes stand for.

Figure 4 illustrates how both intra and inter-epoch orderings take place. It shows the evolution of three pipeline epochs on a replica when this replica receives messages. Initially ( $T_{0}$ ), only two epochs $C^{0}$ and $C^{1}$ are running. The replica has received the batch A for $C^{0}$ and the batches A, B and C for $C^{1}$ . Upon reception of the batch B for $C^{2}$ from another replica ( $T_{1}$ ), the receiving replica spawns locally the new pipeline epoch $C^{2}$ as described in lines 11–15 of Algorithm 2. Upon reception of the consensus messages for $C^{0}$ ( $T_{2}$ ), the replica reaches consensus and decides on the batches A and B. However, as we explain in §4.4, the replica has not received the decided batch B. Upon reception of this batch, the epoch $C^{0}$ is marked as decided but not yet ready to commit (i.e., to transmit the decision to the application). The replica then receives consensus messages for $C^{1}$ ( $T_{3}$ ) and decides on the batches A, B and C. Since the replica knows the decided hashes, it is ready to commit $C^{1}$ but waits for the previous epoch $C^{0}$ before transmitting to the application. Finally, the replica receives the batch B for $C^{0}$ ( $T_{4}$ ). The replica then commits both epochs $C^{0}$ and $C^{1}$ and mark them as terminated.

To conclude, we summarize three properties for the distributed pipelining that allows to proceed efficiently:

(1)

Early epoch: this requires launching a epoch $k+1$ early, typically epoch $k+1$ should start before the epoch $k$ completed in order to guarantee that the pipeline translates into higher throughput.
(2)

Oldest epoch priority: epoch $k$ always has priority over epoch $k+1$ to ensure that the number of concurrently active epochs eventually decreases, preventing undesirable situations like starvation.
(3)

Pipeline feeding: the SMR must guarantee that commands are enqueued into the upcoming epochs to guarantee progress, despite the workload induced by the inter replicas communications.

Guaranteeing these three properties help ensure the good performance of the distributed pipelining.

5. SMR Evaluation

This section presents an evaluation of Dispel in geo-distributed setups with up to 256 machines. We compare the performance of Dispel against four SMRs protocols from the literature. The larger experiments with a cryptocurrency application are deferred to §6.

5.1. SMRs protocols

For this evaluation, we implement Dispel, a leaderless SMR using the DBFT consensus described in §2 and the pipelined architecture described in §3 and §4. Dispel is written in Java using only the standard Java libraries. We compare Dispel against four SMRs:

•

BFT-SMaRt: is a leader-based BFT SMR similar to PBFT with further optimizations that has been maintained in Java for more than a decade (BSA14, ). It was used for the ordering service of Hyperledger Fabric (ABB18, ) to tolerate Byzantine failures (SBV18, ). We used the branch weat2 of the official BFT-SMaRt git repo https://github.com/bft-smart/library that outperforms the master branch when geo-distributed (SB15, ; BRS19, ).
•

EPaxos: is a leaderless crash fault tolerant (CFT) SMR (MAK13, ) that does not provide the guarantees of a BFT SMR but that serves as a fast baseline. It is an improved version of Paxos with no leader and a fast path for non conflicting commands. We use the Go author’s library from https://github.com/efficient/epaxos.git.
•

HotStuff: is a recent leader-based BFT SMR. It outperforms BFT-SMaRt by having the leader piggybacking phases of distinct consensus instances into the same messages (YMR19, ). Facebook is currently developing in Rust the Libra state machine (BBC19, ) on a variant of HotStuff (Fac19, ). We use the C++ authors’ library from https://github.com/hot-stuff/libhotstuff.
•

Zyzzyva: is a leader-based BFT SMR designed to reduce decisions latency (KAD07, ). To this end, Zyzzyva replicas reply optimistically to the client before reaching consensus and let the client solve disagreements. We use the Zlight C++ implementation (AGK14, ) that we patched to make it run on a geo-distributed setup. More precisely, the original implementation uses IP multicast and UDP communication without any mechanism to handle packet loss. We replace the UDP implementation by a standard event-driven loop TCP implementation.

Other SMRs: there is a long body of BFT SMRs papers, however, not all implementations are available: we contacted the authors of BFT-Mencius (MBS13, ) and ezBFT (APR19, ), two leaderless BFT protocols, but there were no readily available implementation.

5.2. Benchmark setup

This evaluation takes place on AWS EC2 instances. We use c5.xlarge instances for running replica and client processes. Each instance is a KVM virtual machine with four hardware threads implemented by two hyperthreaded Intel Xeon core running at 3 GHz, with 8 GiB of memory and a network interface of 600 MiB/s upload and download speed. Each virtual machine is running a Ubuntu 18.04 distribution with an AWS modified Linux kernel 4.15, OpenJRE 11 and the glibc 2.27. The SMRs are compiled using OpenJDK 11, gcc 7.5 and go 1.10.4.

We ran BFT-SMaRt, Zyzzyva, HotStuff and EPaxos in their default configurations indicated in their original papers or reports (BSA14, ; AGK14, ; YMR19, ; MAK13, ). Each Dispel replica uses a batch size of $25/n$ MB where $n$ is the number of replica. We configure Dispel replicas to use up to 12 pipeline epochs. In every experiment, we execute one replica process per VM. In addition, we dedicate two VMs per region to execute the client processes. For EPaxos, BFT-SMaRt, HotStuff and Dispel, each client VM executes a single client process which sends many parallel transactions to the replicas. Zyzzyva clients only implement blocking requests. For Zyzzyva, we execute 100 client processes per client VM. We found that this number of client process is sufficient to saturate Zyzzyva replicas.

The goal of the following experiments is to compare the throughput that each SMR delivers. This throughput increases with the client load. As a tradeoff, increasing the client load generally results in a higher latency. To provide a fair comparison, we choose to explore different client loads for every data points we plot. We then select the client loads resulting in a throughput of at least 90% of the best observed throughput. Among these selected loads, we pick the one resulting in the best latency. The intuition behind this methodology is that when the client load increases, the latency remains stable until an inflection point and then skyrockets. The throughput increases until this inflection point and then remains stable. Our methodogy identifies the inflection point by selecting the set of client loads with the best throughputs and then pick the latency obtained in this set right before the inflection point. All the data points are the average over at least 5 runs.

	Frankfurt	Ireland	London	Paris
Frankfurt	–	60	109	179
Ireland	25	–	148	80
London	14	10	–	210
Paris	8	19	7	–

Table 1. Latency in ms (bottom left) and bandwidth in MiB/s (top right) between the AWS Europeans datacenters.

5.3. Pipelining copes with round-trip delays

As we expect the pipeline to increase throughput despite high latency, we start by comparing the throughput and latency of Dispel to EPaxos, HotStuff, BFT-SMaRt and Zyzzyva within a single continent. More precisely, we measured the throughput and latency on up to 256 VMs located in Europe (Ireland, London, Paris and Frankfurt). Table 1 summarizes the latencies and bandwidth between the four datacenters, mesured with nuTCP.

Figure 5(a) depicts the throughput of Dispel, EPaxos, HotStuff and Zyzzyva. We found that BFT-SMaRt and HotStuff exhibit the same behavior although BFT-SMaRt throughput is consistently at least $2\times$ lower than HotStuff throughput. We choose to not plot BFT-SMaRt for sake of clarity. First, we observe that Zyzzyva does not perform as well as the others with a throughput of 25 Kops/s at 4 nodes and of 2 Kops/s or below with 64 nodes or more. This is because Zyzzyva is optimized to perform in a local area network (LAN): it does not batch requests and executes sequentiallySecond, as expected, EPaxos outperforms HotStuff by delivering a throughput up to $7\times$ larger at 32 nodes and still $3\times$ larger at 256 nodes. EPaxos does not offer the same level of fault tolerance than HotStuff: it does not tolerate Byzantine failures while HotStuff does. Third, all solutions perform significantly slower than Dispel due to their lack of pipeline. Dispel exhibits the best performance at 4 nodes with a throughput of 462 Kops/s (more than $2\times$ larger than EPaxos and $13\times$ larger than HotStuff). At worst, with 256 nodes, Dispel has a throughput of 73 Kops/s (still $2\times$ larger than EPaxos and $6\times$ larger than HotStuff).

Pipelining typically hides the increase in latency through parallelization, executing multiple consensus instances at a time. Although HotStuff pipelines to some extent, its pipeline only piggybacks phases of consecutive consensus instances into the same messages and does not leverage resources from any replicas like Dispel’s pipeline does. Interestingly, Dispel also doubles the performance of EPaxos at 4, 128 and 256 nodes even though EPaxos does not tolerate Byzantine failures. As we consider only non-conflicting requests here, all command leaders of EPaxos should commit their command in parallel—similarly to a pipeline. The difference is that EPaxos does not spawn new leader proposal based on resources as Dispel does, hence unable to detect resource contention.

A surprising phenomenon is the unexpected high throughput of Dispel with 4 replicas and the througput drop between 4 and 16 replicas. As we describe later in §5.5, we expect the throughput of Dispel to increase with the number of replicas while the number of replica is small, then to decrease. This is what we observe from 16 replicas, however the throughput of Dispel at 4 replicas is $1.68\times$ larger than at 16 nodes. Although we do not have a definitive explanation for this throughput, we are confident that it does not come from either our measurement nor the methodology we use. Indeed, over the 5 runs we use to compute the average throughput of this point, the individual throughputs are all between 449 Kops/s and 499 Kops/s. The client load we use is the one that results in the best throughput over the explored values. Our main hypothesis is that for this small number of replica, all the replicas progress at the same rate in the pipeline, which is the ideal scenario that we describe in §4.2 where all replicas always participate with full batches.

5.4. Higher performance than SMR for blockchains

Figure 5(b) depicts the latencies obtained for the four SMRs, Dispel, EPaxos, HotStuff and Zyzzyva, in the same settings as above (§5.3). This shows that Dispel does not offer the best latency. This is because the pipeline does not need to reduce the latency to increase throughput. Zyzzyva, designed for LANs, offers consistently the lowest latency by staying below 300 ms until 64 nodes and always below 1700 ms. In contrast with other solutions, the latency of HotStuff improves with the system size at small scale, which is likely due to our measurements that extract the best throughput runs of all runs before selecting the best latency run within these values (§5.2). Due to HotStuff lower communication complexity, HotStuff latency increases slower than other SMRs. As a result, the best throughput values are observed in conditions where all latencies are relatively similar. Given that Facebook aims at deploying a variant of HotStuff on 100 nodes and more as part of their Libra blockchain (Fac19, , §5), it is particularly interesting to compare performance at 128 nodes: the latencies of HotStuff and Dispel are in the same order of magnitude, respectively 1300 ms and 2100 ms, whereas the throughput of Dispel (Fig. 5(a)) is more than $12\times$ higher than HotStuff. We realized that the default HotStuff is limited so in §6 we tune HotStuff to obtained better performance on geodistributed experiments, however, as we will see, it remains significantly slower than Dispel.

	Throughput (MiB/s)
Parallel streams	Total	Per stream
1	4	4.53
10	45	4.53
50	225	4.51
100	443	4.43
200	533	2.67

Table 2. Bandwidth between two machines located in Sydney and São Paulo depending on the number of parallel TCP streams obtained by nutTCP and showing that HOL blocking limits performance.

5.5. Multiplying TCP bandwidth capacity

Because TCP preserves ordering, using one connection to transmit independent data often leads to the head-of-line (HOL) blocking suboptimal phenomenon (SK06, ) where the loss of one packet actually delays the reception of potentially numerous subsequent packets until after the packet is retransmitted and successfully delivered. As simple way to circumvent this phenomenon is to use several TCP connections in parallel. Table 2 illustrates the difference between the NIC physical limit and the bandwidth limit of a single TCP connection. As one can see, the network usage increases linearly with the number of parallel connections until the limit of the network interface is met at around 533 MB/s, after which the bandwidth per stream decreases due to this physical limit.

Despite having several open TCP connections, one per peer, sequential SMRs replicas are unable to benefit from this parallelism. Indeed, when such sequential replicas broadcast a message, they must wait the transmission of this message completes on all the TCP connections before to progress. On the contrary, Dispel replicas do not wait the transmission completion of every TCP connection thanks to their pipeline structure.

To illustrate how Dispel cirumvents the HOL blocking phenomenon, we measure the throughput of EPaxos, HotStuff, Zyzzyva and Dispel when the number of replica varies on two setups. The first setup evenly spreads the replicas across two regions, North Virginia and North California. The bandwidth of one TCP connection between these two regions if 23 MiB/s. The second setup evenly spreads the replicas across North Virginia and Tokyo. The TCP bandwidth between these two regions 11 MiB/s. On these two setups where the TCP bandwidth is low, Dispel benefits from an increasing number of replicas. However, when the number of replicas becomes too large, the probability of having a packet loss on a majority of TCP connections becomes significant and Dispel does not benefit from additional replicas.

We report on Figure 6 the throughput for Dispel and EPaxos on the two setups, 11 MiB/s and 23 MiB/s of TCP bandwidth, when the number of replica increases with transactions of 128 bytes. We do not report the throughput for HotStuff, BFT-SMaRt and Zyzzyva as their throughput is systematically below respectively 20 Kops/s, 14 Kops/s and 6 Kops/s. We observe that for each setup and every number of replica, the throughput of Dispel is larger than for the other SMRs. Moreover, the throughput of Dispel first increases up to 108 MiB/s at 6 replicas in the 23 MiB/s bandwidth setup and up to 85 MiB/s at 10 replicas in the 11 MiB/s bandwidth setup. This increase is caused by Dispel replicas using more TCP connections in parallel when the number of replicas increases. We observe a similar yet less pronounced evolution for EPaxos. Indeed, EPaxos replicas decide transactions in parallel as long as they are non conflicting. This parallelism makes EPaxos also using TCP connections in parallel although to a lesser extent. While we do not observe this phenomenon on HotStuff, BFT-SMaRt and Zyzzyva, their throughput is too small to conclude.

5.6. Robustness in case of failures

In order to assess the performance of Dispel and HotStuff under failures, we ran an experiment for 122 seconds on $n=32$ nodes spread evenly in Ireland, London, Paris and Frankfurt during which we manually injected crash failures. In this experiment, each transaction is of size 400 bytes, similar to the size of simple Bitcoin transactions.

Boosting HotStuff performance for geo-distribution.

As we observed in §5.4, HotStuff performance is particularly limited by default. As far as we know, HotStuff has only been evaluated in a single datacenter where the “geodistributed” experiments were emulated by adding artificial latencies (Agm18, ). As we show in §5.5, real geodistributed setups, where large blockchains are expected to be deployed, also suffer from packet loss forcing TCP connections to block during packet retransmissions. As an example, during the Red Belly Blockchain experiments (CNG18, ), the bandwidth of a TCP connection between the AWS availability zones of Sydney and São Paulo was only 5 MB/s. To cope with this limitation, we increased the size of the batches used by HotStuff to obtain the same decision size of 25 MB in both HotStuff and Dispel in large-scale experiments. When HotStuff uses larger batches, there are proportionally more packet loss during the batch transmission than during the other phases of consensus. Since a replica broadcasts batches to every peers in parallel, larger batches brings more opportunities for parallel TCP transmissions. While this increases the throughput, it comes at the cost or larger latencies. In our case, using 25 MB batches results in an average request latency of 20 seconds.

Impact of isolated and correlated failures.

Figure 7 depicts the performance expressed as the throughput in MiB/second and the number of transactions committed per second by the two protocols. First, one can see that HotStuff takes more time than Dispel to reach its peak throughput that is lower than Dispel. These two observations confirm previous conclusions that the leader-based pattern leads to higher latency and lower throughput (CNG18, ).

More interestingly, we manually injected a failure of the leader on HotStuff and the weak coordinator of the binary consensus on Dispel. As HotStuff triggers a view-change when the faulty leader is detected as slow, the system must wait for the detection to occur and for the view-change to complete before the throughput can increase again. Again this phenomenon was already experienced in leader-based SMRs, like Zookeeper (SRM12, , Fig.6), Multi-Paxos (MAK13, , Fig.10), Mir-BFT (SDM19, , Fig.8) and Paxos (EBR20, , Fig.8). More surprisingly, a second view-change seems to occur systematically after one failure, this can be seen at 57 seconds. As no single node correctness is required for termination in Dispel, the throughput does not seem to drop in Dispel after a single failure.

We also injected correlated failures to see the performance variations in Dispel and HotStuff. The correlated failures consists of shutting down all the remaining machines of the Frankfurt region. The throughput of HotStuff drops to 0 whereas the throughput of Dispel drops by about 30%. Note that the impact of failures on performance of leader-based SMRs raises the question of their suitability for cryptocurrency applications where simply DDoS-ing one node, the leader, or a region can DDoS the entire system.

6. Cryptocurrency Application

We now present the performance of Dispel within a cryptocurrency application on up to 380 machines deployed over 3 continents and show that the bottleneck is not Dispel but the cryptographic verifications.

Application cryptographic overhead.

To understand if an SMR can be used for blockchain applications like a cryptocurrency where assets are transferred among owners of some accounts represented with a public key, one must evaluate the performance of the SMR when transactions are cryptographically signed by the clients and these signatures are verified by the replica. Figure 8 depicts the time it takes to verify signatures with one core using the public-key cryptosystems available in OpenSSL and written in C, and in Bouncy Castle and written in Java. One can observe the large variation of performance of the cryptosystems and the overhead of the Java library.

Dispel with a cryptocurrency application.

We extended Dispel to support a cryptocurrency application by signing and verifying all transactions. Each user is equipped with an account and a public-private key pair. Each client pre-generates signed transactions using its private key prior to sending them to some replica, whereas every correct replica that receives a transaction verifies it using the public key associated with the account. Once the verification is correct the transaction can be stored.

From blockchain to block-sequence.

Note that the hash-link between blocks is not necessary under the assumption that $f<\frac{n}{3}$ processes can fail as one can simply retrieve an immutable copy of a block by requesting it from $2f+1$ correct processes. By definition, among these responses, all correct replicas will respond with the same tamper-proof copy. Hence one has to find the copy duplicated $f+1$ times among the $2f+1$ ones to identify it as the correct copy of the block.

Performance at large-scale.

For this experiment, we used up to 380 c5.2xlarge replica VMs located equally in 10 regions on 3 continents: California, Canada, Frankfurt, Ireland, London, North Virginia, Ohio, Oregon, Sydney and Tokyo. The c5.2xlarge VMs have the same configuration than the c5.xlarge VMs we use in §5 but with 8 hardware threads instead of 4 and 16 GiB of memory instead of 8 GiB. For sending transactions, we spawned 10 additional client VMs, one from each region. We selected the per-node batch size to be 2 MB for 10 replicas, 8 MB for 40 replicas and 24 MB for 120, 250, 380 replicas to minimize the duration of the experiments. We use the secp256k1 public-key cryptosystem that is used by Bitcoin (Nak08, ) and transactions of size 400 bytes, just like Bitcoin UTXO transactions.

Figure 9 depicts the performance of Dispel when executing a cryptocurrency application where all transactions are verified by all replicas. We represent the transactions committed per second, including the signatures in addition to the application payload that contains the simple transfer information. One can see that the throughput and latency vary slightly across the different network sizes but there is no clear difference between the performance on 10 nodes or 380 nodes, indicating that Dispel is not the bottleneck when running the cryptocurrency application. In these experiments, only 3 epochs in Dispel is sufficient to achieve the best performance in the cryptocurrency application, so further cryptographic optimizations like verification sharding (CNG18, ; SDM19, ) would better leverage Dispel.

7. Related Work

Research on Byzantine fault tolerance (BFT) started in the 80’s (PSL80, ; LSP82, ) and later improvements (CL02, ) led to a long series of leader-based BFT systems. The technique of pipelining originated in network (PM95, ) and was later applied to consensus in the context of crash fault tolerance (Lam98, ).

Bypassing the leader bottleneck.

The idea that the leader-based design can limit the throughput of consensus is not new (BS10, ; BMS12, ; ABQ13, ; CNG18, ; SDM19, ). S-Paxos is a variant of Paxos that aims at disseminating client requests to all replicas to offload the leader (BMS12, ). In particular, it increases the throughput of Paxos by balancing the CPU workload over all replicas, similar to what our hashing phase does in Dispel. However, it tolerates only crash failures. RBFT (ABQ13, ) uses multiple concurrent instances of PBFT to detect a slow master instance and triggers a leader replacement through PBFT’s complex view change. Mir-BFT (SDM19, ) combines these instances to outperform PBFT but when some leaders fail, the throughput can only recover after multiple view changes discard the faulty leaders. Some consensus algorithms alleviate the need for a leader by requiring an exponential information gathering tree (BS10, ) or synchrony (GHM17, ). Democratic BFT does not use a leader but a weak coordinator whose failure does not prevent termination within the same round (CNG18, ). Despite this observation, most blockchain SMRs build upon the classic leader-based pattern (Buc16, ; EGSvR16, ; KJGK16, ; KJGG17, ; GAGM18, ).

EPaxos (MAK13, ) bypasses the leader whose failure might impact performance by exploiting one leader per command issued. If these commands do not conflict, they are committed concurrently. Atlas (EBR20, ) improves over EPaxos on 13 sites distributed world-wide by adding replicas closer to clients. These solutions are not Byzantine fault tolerant. Classic reductions (BCG93, ) from the multivalue consensus to the binary consensus problem, like the one we use, avoid the leader to propose the value that will be decided. Instead they reliably broadcast values and spawn binary consensus instances, which has already proved efficient in SMRs (MSC16, ; CNG18, ). HoneyBadger (MSC16, ) is an SMR for asynchronous networks building upon this reduction. It exploits erasure coding to limit the communication complexity of the reliable broadcast. As consensus cannot be solved in an asynchronous model (FLP85, ), it builds upon a randomized consensus (MMR14, ) that converges in constant expected time as long as messages are scheduled in a fair manner (MMR14, ; TG19, ), an assumption called fair scheduler (BT85, ). Red Belly Blockchain (CNG18, ) is deterministic, works in a partially synchronous model (DLS88, ) and outperforms HoneyBadger by avoiding the CPU overhead of erasure coding and introducing verification sharding. Although it relies on DBFT (CGLR18, ) like we do to balance the load across multiple links, it does not leverage the bandwidth like Dispel because it does not pipeline: consensus instances are executed sequentially as each block depends on the previous one.

Pipelining to increase performance.

The idea of pipelining is quite old and consists, in the context of networking, of sending a packet before its predecessor has been acknowledged within the same connection (PM95, ). The original version of Paxos (Lam98, ) mentioned the idea of pipelining as ballotting could take place in parallel with ballots initiated by different legislators. Pipelining ballots in Paxos has also been implemented (KSZ11, ; SS12, ; SS13, ), however, the benefit of pipelining was not significant. JPaxos (KSZ11, ) is an implementation of MultiPaxos that focuses on recovery, batching, pipelining and concurrency, but the pipeline is actually tuned in a separate work (SS12, ; SS13, ) where the leader of MultiPaxos can spawn multiple instances in parallel to increase the resource utilization. The authors note that the drawback of the approach is that pipelining may lead to congestion: multiple instances can max out the leader’s CPU or cause network congestion. Distributed pipelining leads to different conclusions.

Chain (AGK14, ) organizes nodes in a different pipeline so that only one head node, that can be seen as a leader, spawns all instances. HotStuff (YMR19, ) is a leader-based SMR with a reduced communication complexity. Its leader piggybacks the phase of one consensus instance with the phase of another consensus instance, hence offering a form of pipelining. In addition it reduces the leader bottleneck by having clients sending their proposals directly to all replicas so that the leader can simply send digests in all the consensus phases. To reduce message complexity, HotStuff makes use of threshold signatures. Dispel differs in that it does not reduce message complexity but instead balances its network load by having any replica spawn new consensus instances based on its resource usage.

The COP scheme (BDK15, ) dispatches batches to concurrent consensus instances to execute pipeline stages in parallel at every replica by exploiting the multiple cores available. Dispel distributed the pipeline over the network, minimizing the number of threads per replica in an even-driven loop that favors run-to-completion. Multiple threads are only spawned for cryptographic tasks including the hashing task.

Mencius (MJM08, ) is an SMR that exploits pipelining by allowing replicas that are faster than the leader to propose a no-op proposals, hence allowing to speed up the consensus by (i) preventing replicas with nothing to propose from blocking the protocol and (ii) coping with faulty leaders. The authors mentioned the difficulty of making Mencius Byzantine fault tolerant because not all communications are exchanged through a quorum and would require a trusted component. Some efforts were devoted to reducing the latency of replicated state machines to increase their throughput without the need for pipelining (FV97, ; SB12, ; MBS13, ). BFT-Mencius (MBS13, ) consists of upper-bounding the latency of updates initiated by correct processes by using an abortable timely announced broadcast. Like in Mencius, BFT-Mencius allows a replica to skip its turn by proposing no-op. The experimental results clearly indicates in a cluster setting that the lower the latency the higher the throughput for BFT-Mencius as well as Spinning and PBFT (CL02, ). This confirms previous observations on non-pipelined replicated state machines (FV97, ). Some research work even explored the latency optimality of BFT state machine replication (SB12, ), which was implemened in BFT-SMaRt (BSA14, ). Our approach indicates that distributed pipelining can lead to significantly higher throughput while still achieving reasonable latency at large scale.

8. Conclusion

Dispel is the first SMR to exploit resources through distributed pipelining. At 128 nodes, it improves the throughput of the HotStuff SMR we know by 12-fold. Within a cryptocurrency application, Dispel demonstrates that blockchains can suffer from bottlenecks that are not related to the network usage of the consensus protocols they build upon. Our work has revealed the significant impact of the HOL blocking factor on performance, that was neglected because it could not be observed in slower SMRs before. This observation opens up new interesting research directions on the choice of layer-4 network protocols for large-scale applications like blockchains.

Acknowledgments

We wish to thank Alysson Bessani for his fruitful comments on an earlier version of this draft. This research is supported under Australian Research Council Discovery Projects funding scheme (project number 180104030) entitled “Taipan: A Blockchain with Democratic Consensus and Validated Contracts” and Australian Research Council Future Fellowship funding scheme (project number 180100496) entitled “The Red Belly Blockchain: A Scalable Blockchain for Internet of Things”.

References

[1] Michael Abd-El-Malek, Gregory R. Ganger, Garth R. Goodson, Michael K. Reiter, and Jay J. Wylie. Fault-scalable Byzantine fault-tolerant services. In Proceedings of the Twentieth ACM Symposium on Operating Systems Principles, SOSP ’05, pages 59–74, 2005.
[2] Ittai Abraham, Guy Gueta, and Dahlia Malkhi. Hot-stuff the linear, optimal-resilience, one-message BFT devil. CoRR, abs/1803.05069, 2018.
[3] Elli Androulaki, Artem Barger, Vita Bortnikov, Christian Cachin, Konstantinos Christidis, Angelo De Caro, David Enyeart, Christopher Ferris, Gennady Laventman, Yacov Manevich, Srinivasan Muralidharan, Chet Murthy, Binh Nguyen, Manish Sethi, Gari Singh, Keith Smith, Alessandro Sorniotti, Chrysoula Stathakopoulou, Marko Vukolić, Sharon Weed Cocco, and Jason Yellick. Hyperledger fabric: A distributed operating system for permissioned blockchains. In Proceedings of the Thirteenth EuroSys Conference, EuroSys ’18, pages 30:1–30:15, 2018.
[4] Balaji Arun, Sebastiano Peluso, and Binoy Ravindran. ezBFT: Decentralizing Byzantine fault-tolerant state machine replication. In 39th IEEE International Conference on Distributed Computing Systems, ICDCS 2019, Dallas, TX, USA, July 7-10, 2019, pages 565–577, 2019.
[5] Pierre-Louis Aublin, Rachid Guerraoui, Nikola Knežević, Vivien Quéma, and Marko Vukolić. The next 700 BFT protocols. ACM Trans. Comput. Syst., 32(4):12:1–12:45, January 2015.
[6] Pierre-Louis Aublin, Sonia Ben Mokhtar, and Vivien Quéma. RBFT: redundant byzantine fault tolerance. In ICDCS, pages 297–306, 2013.
[7] Pierre-Louis Aublin, Sonia Ben Mokhtar, and Vivien Quéma. RBFT: Redundant Byzantine fault tolerance. In 2013 IEEE 33rd International Conference on Distributed Computing Systems, pages 297–306, July 2013.
[8] Shehar Bano, Mathieu Baudet, Avery Ching, Andrey Chursin, George Danezis, François Garillot, Zekun Li, Dahlia Malkhi, Oded Naor, Dmitri Perelman, and Alberto Sonnino. State machine replication in the libra blockchain, 2019. https://developers.libra.org/docs/assets/papers/libra-consensus-state-machine-replication-in-the-libra-blockchain.pdf (accessed 10/2019).
[9] Johannes Behl, Tobias Distler, and Rüdiger Kapitza. Consensus-oriented parallelization: How to earn your first million. In Proceedings of the 16th Annual Middleware Conference, pages 173–184, 2015.
[10] Michael Ben-Or, Ran Canetti, and Oded Goldreich. Asynchronous secure computation. In Proceedings of the Twenty-fifth Annual ACM Symposium on Theory of Computing, STOC ’93, pages 52–61, New York, NY, USA, 1993. ACM.
[11] Michael Ben-Or, Boaz Kelmer, and Tal Rabin. Asynchronous secure computations with optimal resilience (extended abstract). In Proceedings of the Thirteenth Annual ACM Symposium on Principles of Distributed Computing, PODC, pages 183–192, 1994.
[12] Christian Berger, Hans P. Reiser, João Sousa, and Alysson Bessani. Resilient wide-area Byzantine consensus using adaptive weighted replication. In Proceedings of the 34th IEEE Symposium on Reliable Distributed Systems (SRDS), 2019.
[13] Alyson Bessani, João Sousa, and Eduardo E. P. Alchieri. State machine replication for the masses with BFT-Smart. In Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pages 355–362, June 2014.
[14] Martin Biely, Zarko Milosevic, Nuno Santos, and André Schiper. S-Paxos: Offloading the leader for high throughput state machine replication. In Proceedings of the IEEE 31st Symposium on Reliable Distributed Systems SRDS, pages 111–120, 2012.
[15] Fatemeh Borran and André Schiper. A leader-free Byzantine consensus algorithm. In Krishna Kant, Sriram V. Pemmaraju, Krishna M. Sivalingam, and Jie Wu, editors, Distributed Computing and Networking, pages 67–78, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.
[16] Gabriel Bracha. Asynchronous Byzantine agreement protocols. Inf. Comput., 75(2):130–143, November 1987.
[17] Gabriel Bracha and Sam Toueg. Asynchronous consensus and broadcast protocols. J. ACM, 32(4):824–840, October 1985.
[18] Ethan Buchman. Tendermint: Byzantine fault tolerance in the age of blockchains, 2016. MS Thesis.
[19] Miguel Castro and Barbara Liskov. Practical Byzantine fault tolerance and proactive recovery. ACM Trans. Comput. Syst., 20(4):398–461, November 2002.
[20] Allen Clement, Edmund Wong, Lorenzo Alvisi, Mike Dahlin, and Mirco Marchetti. Making Byzantine fault tolerant systems tolerate Byzantine faults. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation, NSDI’09, pages 153–168, 2009.
[21] Paulo R. Coelho, Tarcisio Ceolin Junior, Alysson Bessani, Fernando Luís Dotti, and Fernando Pedone. Byzantine fault-tolerant atomic multicast. In 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018, Luxembourg City, Luxembourg, June 25-28, 2018, pages 39–50, 2018.
[22] Tyler Crain, Vincent Gramoli, Mikel Larrea, and Michel Raynal. DBFT: Efficient leaderless Byzantine consensus and its applications to blockchains. In Proceedings of the 17th IEEE International Symposium on Network Computing and Applications (NCA’18), 2018.
[23] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. Consensus in the presence of partial synchrony. J. ACM, 35(2):288–323, April 1988.
[24] Vitor Enes, Carlos Baquero, Tuanir França Rezende, Alexey Gotsman, Matthieu Perrin, and Pierre Sutra. State-machine replication for planet-scale systems. In Proceedings of the Fifteenth European Conference on Computer Systems, 2020.
[25] Zachary Amsden et al. The Libra blockchain. Technical report, Calibra, 2019. Revised version of April 9th, 2020, https://developers.libra.org/docs/assets/papers/the-libra-blockchain/2020-04-09.pdf.
[26] Ittay Eyal, Adem Efe Gencer, Emin Gün Sirer, and Robbert van Renesse. Bitcoin-NG: A scalable blockchain protocol. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2016.
[27] Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. Impossibility of distributed consensus with one faulty process. J. ACM, 32(2):374–382, April 1985.
[28] R. Friedman and A. Vaysburd. Fast replicated state machines over partitionable networks. In Proceedings of the 16th IEEE Symposium on Reliable Distributed Systems, pages 130–137, 1997.
[29] Roy Friedman and Erez Hadad. Adaptive batching for replicated servers. In Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems (SRDS), pages 311–320, 2006.
[30] Yossi Gilad, Rotem Hemo, Silvio Micali, Georgios Vlachos, and Nickolai Zeldovich. Algorand: Scaling Byzantine agreements for cryptocurrencies. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP ’17, pages 51–68, New York, NY, USA, 2017. ACM.
[31] Guy Golan-Gueta, Ittai Abraham, Shelly Grossman, Dahlia Malkhi, Benny Pinkas, Michael K. Reiter, Dragos-Adrian Seredinschi, Orr Tamir, and Alin Tomescu. SBFT: a scalable decentralized trust infrastructure for blockchains. CoRR, abs/1804.01626, 2018.
[32] Divya Gupta, Lucas Perronne, and Sara Bouchenak. BFT-Bench: Towards a practical evaluation of robustness and effectiveness of BFT protocols. In DAIS, pages 115–128, 2016.
[33] Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini. Zab: High-performance broadcast for primary-backup systems. In Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks, DSN ’11, pages 245–256, 2011.
[34] Rüdiger Kapitza, Johannes Behl, Christian Cachin, Tobias Distler, Simon Kuhnle, Seyed Vahid Mohammadi, Wolfgang Schröder-Preikschat, and Klaus Stengel. CheapBFT: Resource-efficient byzantine fault tolerance. In Proceedings of the 7th ACM European Conference on Computer Systems, EuroSys ’12, pages 295–308, 2012.
[35] Eleftherios Kokoris Kogias, Philipp Jovanovic, Nicolas Gailly, Ismail Khoffi, Linus Gasser, and Bryan Ford. Enhancing Bitcoin security and performance with strong consistency via collective signing. In 25th USENIX Security Symposium (USENIX Security 16), pages 279–296, Austin, TX, 2016. USENIX Association.
[36] Eleftherios Kokoris-Kogias, Philipp Jovanovic, Linus Gasser, Nicolas Gailly, Ewa Syta, and Bryan Ford. Omniledger: A secure, scale-out, decentralized ledger via sharding. Cryptology ePrint Archive, Report 2017/406, 2017. https://eprint.iacr.org/2017/406.
[37] Jan Kończak, Nuno Santos, Tomasz Żurkowski, Paweĺ T. Wojciechowski, and André Schiper. JPaxos: State machine replication based on the Paxos protocol. Technical Report 167765, EPFL, 2011.
[38] Ramakrishna Kotla, Lorenzo Alvisi, Mike Dahlin, Allen Clement, and Edmund Wong. Zyzzyva: Speculative Byzantine fault tolerance. In Proceedings of Twenty-first ACM SIGOPS Symposium on Operating Systems Principles, SOSP ’07, pages 45–58, 2007.
[39] Leslie Lamport. The part-time parliament. ACM Trans. Comput. Syst., 16(2):133–169, May 1998.
[40] Leslie Lamport. Leaderless Byzantine consensus, 2010. United States Patent, Microsoft Corporation, Redmond, WA (USA).
[41] Leslie Lamport. Brief announcement: Leaderless Byzantine Paxos. In Proceedings of the 25th International Symposium on Distributed Computing, DISC, pages 141–142, 2011.
[42] Leslie Lamport, Robert Shostak, and Marshall Pease. The Byzantine generals problem. ACM Trans. Program. Lang. Syst., 4(3):382–401, July 1982.
[43] Yanhua Mao, Flavio P. Junqueira, and Keith Marzullo. Mencius: Building efficient replicated state machines for WANs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI’08, pages 369–384, Berkeley, CA, USA, 2008. USENIX Association.
[44] Andrew Miller, Yu Xia, Kyle Croman, Elaine Shi, and Dawn Song. The honey badger of BFT protocols. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, pages 31–42, New York, NY, USA, 2016. ACM.
[45] Zarko Milosevic, Martin Biely, and André Schiper. Bounded delay in Byzantine-tolerant state machine replication. In Proceedings of the IEEE 32nd International Symposium on Reliable Distributed Systems (SRDS), pages 61–70, Sep. 2013.
[46] Iulian Moraru, David G. Andersen, and Michael Kaminsky. There is more consensus in egalitarian parliaments. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP’13, pages 358–372, New York, NY, USA, 2013. Association for Computing Machinery.
[47] Achour Mostéfaoui, Hamouma Moumen, and Michel Raynal. Signature-free asynchronous Byzantine consensus with t ¿ n/3 and o(n2) messages. In Proceedings of the 2014 ACM Symposium on Principles of Distributed Computing, PODC ’14, pages 2–9, New York, NY, USA, 2014. ACM.
[48] Satoshi Nakamoto. Bitcoin: a peer-to-peer electronic cash system, 2008. http://www.bitcoin.org.
[49] Venkata N. Padmanabhan and Jeffrey C. Mogul. Improving HTTP latency. Comput. Netw. ISDN Syst., 28(1-2):25–35, December 1995.
[50] Marshall C. Pease, Robert E. Shostak, and Leslie Lamport. Reaching agreement in the presence of faults. J. ACM, 27(2):228–234, 1980.
[51] Nuno Santos and André Schiper. Tuning Paxos for high-throughput with batching and pipelining. In Proceedings of the 13th International Conference on Distributed Computing and Networking (ICDCN), pages 153–167, 2012.
[52] Nuno Santos and André Schiper. Optimizing Paxos with batching and pipelining. Theor. Comput. Sci., 496:170–183, July 2013.
[53] Michael Scharf and Sebastian Kiesel. NXG03-5: Head-of-line blocking in TCP and SCTP: Analysis and measurements. In IEEE Globecom 2006, pages 1–5, Nov 2006.
[54] Alexander Shraer, Benjamin Reed, Dahlia Malkhi, and Flavio Junqueira. Dynamic reconfiguration of primary/backup clusters. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference, USENIX ATC’12, pages 39–39, Berkeley, CA, USA, 2012. USENIX Association.
[55] Yee Jiun Song and Robbert Renesse. Bosco: One-step Byzantine asynchronous consensus. In Proceedings of the 22Nd International Symposium on Distributed Computing, DISC ’08, pages 438–450, Berlin, Heidelberg, 2008. Springer-Verlag.
[56] João Sousa and Alysson Bessani. From Byzantine consensus to BFT state machine replication: A latency-optimal transformation. In Proceedings of the 2012 Ninth European Dependable Computing Conference, EDCC ’12, pages 37–48, 2012.
[57] João Sousa and Alysson Bessani. Separating the WHEAT from the chaff: An empirical design for geo-replicated state machines. In Proceedings of the 34th IEEE Symposium on Reliable Distributed Systems (SRDS), pages 146–155, 2015.
[58] João Sousa, Alysson Bessani, and Marko Vukolić. A byzantine fault-tolerant ordering service for the hyperledger fabric blockchain platform. In 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018, Luxembourg City, Luxembourg, June 25-28, 2018, pages 51–58, 2018.
[59] Chrysoula Stathakopoulou, Tudor David, and Marko Vukolić. Mir-BFT: High-throughput BFT for blockchains. Technical Report 1906.05552v1, arXiv, 2019.
[60] Pierre Tholoniat and Vincent Gramoli. Formally verifying blockchain Byzantine fault tolerance. In The 6th Workshop on Formal Reasoning in Distributed Algorithms (FRIDA’19), 2019. Available at https://arxiv.org/pdf/1909.07453.pdf.
[61] Vincent Gramoli Tyler Crain, Christopher Natoli. Evaluating the Red Belly Blockchain. Technical Report 1812.11747, arXiv, 2018.
[62] Giuliana Santos Veronese, Miguel Correia, Alysson Neves Bessani, Lau Cheuk Lung, and Paulo Veríssimo. Efficient Byzantine fault-tolerance. IEEE Trans. Comput., 62(1):16–30, January 2013.
[63] Marko Vukolić. The quest for scalable blockchain fabric: Proof-of-work vs. BFT replication. In IFIP WG 11.4 International Workshop on Open Problems in Network Security, pages 112–125, 2015.
[64] Gavin Wood. Ethereum: A secure decentralised generalised transaction ledger, 2015. Yellow paper.
[65] Maofan Yin, Dahlia Malkhi, Michael K. Reiter, Guy Golan-Gueta, and Ittai Abraham. HotStuff: BFT consensus with linearity and responsiveness. In Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing, pages 347–356, 2019.