Why Raft/Paxos stop being enough, and what PBFT/HotStuff/Tendermint do differently

You wrote “BGP” as the next, harder consensus problem. In distributed-systems literature, that usually means the Byzantine Generals Problem (not the Internet routing protocol also called BGP). The Byzantine Generals Problem is the classic way to talk about Byzantine faults—nodes that can lie, equivocate, or behave maliciously.

This post connects the dots:

PACELC: the trade-offs you’re always making (partition vs. else; latency vs. consistency)
Crash-fault consensus (Paxos/Raft): great when nodes fail benignly
Byzantine consensus (PBFT, HotStuff, Tendermint): needed when nodes can be arbitrary/malicious
Concrete, easy-to-follow examples and “how the algorithm actually moves messages”

1) PACELC sets the stage: consistency is never “free”

PACELC (Abadi) is basically:

If there’s a Partition (P) → choose Availability (A) or Consistency (C)
Else (E) (no partition) → you still trade Latency (L) vs Consistency (C)

Abadi’s key point is that the consistency/latency trade-off exists whenever you replicate—even when there’s no partition—so focusing only on CAP misses what dominates day-to-day system behavior.

Why this matters for consensus:
Consensus is one of the strongest “pick C” tools we have. It buys you linearizable ordering (or close variants), but it does so by forcing extra communication rounds—i.e., it very directly “spends latency” to buy safety/consistency.

2) Crash faults vs Byzantine faults: the single most important distinction

Crash / fail-stop model (what Raft is built for)

In the crash-fault world, a node:

follows the protocol until it…
stops responding (crashes), or restarts, or is slow

Raft is a replicated-log consensus algorithm designed to be understandable and equivalent in outcome to Multi-Paxos-style log replication.
It’s widely used in infrastructure systems like etcd (Kubernetes’ backing store), which explicitly states it uses Raft for replication and agreement. (etcd)

Byzantine model (what “Byzantine Generals” is about)

In the Byzantine world, a faulty node can do anything, including:

send different stories to different peers (“equivocation”)
forge progress, reorder messages, selectively omit messages
behave maliciously on purpose

Lamport et al. frame this as the Byzantine Generals Problem, and show a core impossibility flavor: with only “oral messages” (normal computer messages), you need more than 2/3 loyal participants to solve it.

That single difference (“faulty nodes may lie”) is why Raft/Paxos don’t automatically extend to adversarial settings.

3) The Byzantine Generals Problem in one simple story

Imagine 4 generals (A, B, C, D). They must agree on ATTACK or RETREAT.

If everyone is loyal, any reasonable protocol works.
If D is a traitor, D can tell:
- A: “ATTACK”
- B: “RETREAT”
- C: “ATTACK”
Now the loyal generals don’t even share the same inputs.

The problem requires interactive consistency properties like:

all loyal parties decide the same value
if the commander is loyal, loyal lieutenants follow the commander

…and Lamport et al. show why, with normal messages, you need a strong resilience threshold (intuitively, enough honest overlap to outvote lies).

This is the bridge from “crash tolerance” to “malicious tolerance”.

4) Why Byzantine systems need 3f+1 replicas (intuition, not magic)

Most practical Byzantine Fault Tolerant (BFT) replication assumes:

up to f replicas may be Byzantine
total replicas n ≥ 3f + 1
decisions require a supermajority quorum (often 2f + 1 votes)

HotStuff states this directly in the partially synchronous model: n ≥ 3f + 1 is required, and deterministic progress is only guaranteed after network stabilizes (GST). (Michael Reiter's Home Page)

Quorum intersection intuition:

If you require 2f+1 votes to commit,
and there are at most f liars,
then any two sets of 2f+1 votes must overlap in at least f+1 nodes,
which guarantees at least one honest node sits in the overlap,
preventing two conflicting commits from both being “valid.”

That overlap property is the spine of safety proofs in most modern BFT protocols.

5) PBFT: the classic “practical” Byzantine consensus

What PBFT promises

PBFT (Castro + Liskov) is one of the first widely-cited practical state machine replication protocols tolerating Byzantine faults in realistic networks, and it guarantees safety + liveness provided fewer than about 1/3 replicas are faulty.

The PBFT roles

Primary (leader) proposes ordering
Backups validate and vote
Clients send requests and wait for enough matching replies

The PBFT normal-case pipeline

PBFT’s normal case has three key phases (plus client request/reply):

PRE-PREPARE (leader proposes an order)
PREPARE (replicas echo/confirm what they saw)
COMMIT (replicas lock it in strongly enough to execute)

PBFT describes how the primary assigns sequence numbers and multicasts pre-prepare, then replicas send prepare, then commit to ensure total order even across leader changes.

Easy example: PBFT with f=1 (so n=4)

Let replicas be R0, R1, R2, R3.
Assume f = 1, so at most one is Byzantine. You need:

2f+1 = 3 matching “votes” to cross key thresholds.

Step-by-step (happy path):

(0) Client request
Client sends operation op = "transfer(…)" to the primary R0.

(1) PRE-PREPARE
R0 picks a sequence number (say s=42) and broadcasts:
PRE-PREPARE(view=v, seq=42, digest(op))
Backups check:

signatures/digests are valid
no conflicting pre-prepare already accepted for (v, 42)

(2) PREPARE
Each backup broadcasts:
PREPARE(v, 42, digest(op))

A replica becomes prepared when it has:

the pre-prepare, and
2f prepares from different backups matching it
(for f=1, that’s 2 prepares)

(3) COMMIT
Once prepared, it broadcasts:
COMMIT(v, 42, digest(op))

A replica becomes committed-local once prepared and it has 2f+1 matching commits (for f=1, that’s 3).

(4) Execute + reply
Now it is safe to execute operation 42 in order and reply to the client. The client waits for enough matching replies to be confident.

What if the leader is malicious?

If R0 (primary) is Byzantine and tries to send different pre-prepares for the same (v, 42) to different backups, PBFT’s prepare/commit quorum logic ensures that honest replicas won’t end up committing two different values for the same slot—because committing requires enough overlapping honest votes.

PBFT cost profile (practical reality)

PBFT’s strength is safety under Byzantine behavior, but you pay for it:

more messages than crash-fault protocols
heavier cryptography (signatures/digests)
more complex view changes

That cost shows up as extra latency and CPU—exactly the “ELC” side of PACELC in action.

6) HotStuff: modern BFT designed to scale better

HotStuff is a leader-based BFT replication protocol for the partially synchronous model. It targets two big practical improvements:

Responsiveness: after the network stabilizes, a correct leader can drive progress at the pace of actual network delay (not worst-case bounds).
Linearity: communication complexity that is linear in the number of replicas (in the right conditions), especially around leader replacement. (Michael Reiter's Home Page)

HotStuff also explicitly states:

it works in partial synchrony with an unknown “GST” after which message delay is bounded, and
requires n ≥ 3f+1. (Michael Reiter's Home Page)

The HotStuff mental model: “Quorum Certificates”

Instead of “I saw 2f+1 prepares individually,” HotStuff tends to bundle votes into a QC (Quorum Certificate)—a compact proof that a supermajority signed off on something.

This makes it much easier to:

carry proofs forward across views,
perform leader changes without O(n³) storms,
pipeline decisions.

HotStuff even includes a concrete algorithm listing with phases like prepare → pre-commit → commit → decide and rules such as safeNode that combine safety and liveness constraints. (Michael Reiter's Home Page)

Where HotStuff shows up

The Diem (formerly Libra) project documents that DiemBFT is based on HotStuff. (developers.diem.com)
This is a good real-world pointer for “HotStuff-family BFT” in a production-grade design space.

7) Tendermint: BFT consensus built for WAN + gossip

Tendermint is presented as a BFT consensus algorithm intended for:

wide-area networks,
many nodes,
nodes not in a single admin domain,
gossip-based communication.

The Tendermint paper explicitly contrasts older SMR deployments (small, single-domain) that often assume benign/crash faults, with blockchain/consortium settings where Byzantine behavior is a real design assumption. (arXiv)

It also notes:

Tendermint is inspired by PBFT/DLS-style approaches
it can decide in three communication steps under favorable conditions (correct proposer + timely comms), similar in spirit to PBFT’s normal case. (arXiv)

Tendermint uses supermajority voting power thresholds (the paper references aggregates like 2f+1-style thresholds in its model). (arXiv)

8) The “deep theory” you actually need: FLP + partial synchrony

If you go deep on consensus, two classic results explain why real systems make the assumptions they do:

FLP impossibility (why “fully async + deterministic consensus” fails)

FLP shows that in a fully asynchronous system, no deterministic consensus protocol can guarantee termination if even one process can fail (even benignly). (MIT CSAIL)

This is not a “PBFT vs Raft” detail—it’s why so many protocols introduce some additional assumption to regain liveness.

Partial synchrony (a common escape hatch)

Dwork–Lynch–Stockmeyer (DLS) formalizes partial synchrony: the system may behave asynchronously for a while, but eventually message delays become bounded (after some unknown GST). (MIT CSAIL)

Modern BFT protocols (HotStuff explicitly) live in that world: safety always, liveness after stabilization. (Michael Reiter's Home Page)

9) PACELC lens on Byzantine consensus: what are you “choosing,” practically?

Let’s translate all this into PACELC-style trade-offs.

During partitions (P)

BFT protocols typically prioritize Consistency/Safety:

they will not “make up” a decision just to be available
if the network is partitioned such that you can’t assemble a quorum, progress halts (availability drops) until the partition heals

So under partition, you’re usually on the PC side (choose Consistency over Availability).

Else (E): when the network is fine

You’re still choosing between:

low latency (fast replies)
strong consistency (linearizable, globally ordered decisions)

BFT tends to choose consistency, and “pays”:

extra message rounds (often more than Raft)
signature verification / cryptographic overhead
larger quorum thresholds (2f+1 vs majority)
more complex leader rotation/view-change logic

That’s Abadi’s point in action: replication + strong semantics forces a consistency/latency trade.

10) Practical engineering notes (the stuff that bites in production)

If you ever implement or adopt BFT-style consensus, these are the real dragons:

1) Identity + key management is part of the protocol

BFT assumes authenticated identities (public keys, signatures). Your “cluster PKI” becomes a correctness dependency.

2) Membership & reconfiguration are hard

Adding/removing validators changes quorum math. Doing it safely often requires dedicated “epoch” / “configuration” mechanisms.

3) DoS and performance attacks matter

A Byzantine node can:

spam invalid signatures to waste CPU,
try to trigger repeated view changes,
exploit bandwidth asymmetries.

4) Observability must be protocol-aware

You want metrics like:

view/round number,
vote counts per phase,
QC formation latency,
suspected leader timeouts,
equivocation detection signals (if supported).

5) If you don’t need Byzantine tolerance, don’t pay for it

Within a single trusted org, crash-fault protocols (Raft/Paxos/Zab) are often the right cost/perf point.

etcd uses Raft for exactly this style of “trusted infra consensus.” (etcd)
ZooKeeper uses Zab (not Raft) as its atomic broadcast protocol. (GitHub)

11) Quick “which should I use?” decision guide

Choose Raft/Paxos/Zab (crash-fault) when:

nodes are within one security domain
compromise is rare compared to crashes
you want simpler ops and lower latency

Choose PBFT/HotStuff/Tendermint-family (Byzantine-fault) when:

validators span multiple orgs / trust boundaries
malicious behavior is in-scope, not hypothetical
auditability and adversarial safety matter more than minimal latency

A nice rule of thumb:

If your threat model includes “a node may actively lie,” you’re already in Byzantine land.
If your worst case is “node disappears or restarts,” crash consensus is usually enough.

A small closing summary

PACELC tells you you’re always trading latency vs consistency once you replicate.
Raft/Paxos solve consensus under benign failures.
Byzantine Generals Problem formalizes the “nodes can lie” case and shows why you need strong honest majorities.
PBFT makes Byzantine SMR practical with pre-prepare/prepare/commit and <1/3 faulty tolerance.
HotStuff modernizes BFT for better scaling and leader changes, in partial synchrony. (Michael Reiter's Home Page)
Tendermint adapts PBFT-like ideas to WAN + gossip + multi-domain settings. (arXiv)

If you want, I can also add a section mapping PBFT vs HotStuff vs Tendermint side-by-side (message steps, quorums, finality behavior, and where each fits best) and include runnable pseudo-code for one of them (e.g., a minimal HotStuff-style QC loop or a PBFT replica state machine).

From PACELC to Byzantine Consensus