There are at least five distinct paradigms for replication: Group Communication [0], Viewstamped Replication [1], MultiPaxos [2], Raft [3], and Shared Logs [4].

In a previous post, I did a deep-dive on MultiPaxos, showing that it implements a specific abstraction: State Machine Replication or SMR. The SMR API allows servers to propose commands and play them back in a durable total order:

class SMR<Command, Result>{
	public:
		//insert a new command into the durable total order;
		//play it back and apply it on the local state machine;
		//return the result
		Result propose(Command C);
		//return once all prior commands in the durable total order
		//have been applied to the local state machine.
		void sync();
		//register the local state machine to receive new commands
		void registerApply(Applicator A);
}

To kick things off, we make the following point: the five paradigms of replication are simply different implementations of the SMR API.

This is a useful observation for at least two reasons. First, an application (e.g., a database) built above the SMR API can work on any of these systems; if you need to modify your database to work over Raft rather than MultiPaxos, something went wrong in the design. Second, it implies that systems in each paradigm can be compared in a straightforward way, since they provide identical functionality.

In this post, we take a closer look at one particular aspect of these paradigms: modularity.

How modular is each implementation of SMR?

A key question is whether systems in these paradigms use a low-level distributed abstraction as a building block. To understand this, consider an analogy to storage stacks: if someone asked you to build a filesystem, rather than implement the file API directly over hardware (e.g., sending SATA commands), you would typically layer it over a block device, which hides the complexity of HW behind a simple interface. The block API makes it easy to run filesystems over diverse hardware (e.g., SSDs vs. HDDs; or network-attached disks vs. local storage). In addition, the block layer can add functionality without modifying the filesystem (e.g., RAID, encryption, deduplication, etc.). A similar analogy exists in networks: TCP layers complex functionality over IP, which hides the complexity of specific networks.

In our case, the SMR API is the filesystem (or TCP layer); but what’s the block device (or IP layer) analogue that hides the complexity of distributed consensus?

For two of these paradigms, there is no such “block device”. Raft and Viewstamped Replication both implement SMR directly via RPCs between a collection of machines, without a lower-level distributed abstraction. Each one is aware of consensus and implements it using simplifying primitives (e.g., single-machine logs, leaders, etc.).

For MultiPaxos, the “block device” is the Write-Once Register or WOR (as we showed in the earlier post. Given a collection of WORs, MultiPaxos stitches them together into a single total order of commands slots. MultiPaxos on its own does not need to be aware of the internals of each WOR slot (e.g., it need not understand quorums or phase 1 / phase 2 messages in Paxos). However, the logic in MultiPaxos for determining the location / membership of each slot can be quite subtle, lending to its reputation as a complex protocol.

In Group Communication, the “block device” is a multicast group. Groups hide more complexity than single WORs: they provide not just single-slot consensus, but also the ordering and membership mechanisms across consensus slots. In group communication (also known as virtual synchrony), each server implements the SMR API’s propose method by multicasting the message to the other servers; and then applies / delivers the message locally. Group communication systems were astonishingly modular nearly three decades before Raft first appeared, with full support for dynamic membership / view changes, pluggable consistency guarantees, etc. Unfortunately, groups provide a slightly weaker semantic by default than MultiPaxos: a command can be executed on a server before it is durable on a set of acceptors. (I’ll do a deep dive on group communication in a future post!)

Shared Logs combine the strong semantics of WORs with the WOR-stitching functionality of groups. Each server implements the SMR API’s propose call by appending to a shared log; playing the log forward until the appended command; and applying each command from the log to its local state. The shared log acts as the “block device”; in fact, this is more than a simple analogy since a shared log is effectively an append-only address space. In a sense, the extra complexity in MultiPaxos is around stitching together individual WORs with distinct memberships into an append-only address space; we take that functionality and push it underneath a Shared Log API.

So MultiPaxos, Shared Logs, and Group Communication are each more modular than Raft or Viewstamped Replication, for a precise definition of modularity: there is an internal abstraction layer that hides consensus.

Once you have a low-level abstraction like a WOR or a Shared Log, you can implement it with any protocol. A WOR can be implemented via any single-slot consensus protocol (Paxos, Flexible Paxos, etc.); whereas a Shared Log can be implemented with a protocol like Raft or Viewstamped Replication or MultiPaxos. Since both abstractions have data-centric storage APIs (i.e., no upcalls; no self-initiated activity; simple request/response interfaces), they can also be layered over other storage systems like key-value stores.

So far, all we have argued is that three of these five paradigms have an internal abstraction; and the other two do not. Why do these abstractions help if they are just internal APIs that you are anyway going to implement using some other protocol? After all, a compact, monolithic, and well-defined implementation like Raft is possibly preferable to a layered or modular implementation, unless we actually obtain some benefit from modularity in this setting.

Why does modularity matter?

The extra layer of internal abstraction is important because it logically separates learners from acceptors. In Lamport’s terminology, learners / proposers are the machines storing state and keeping it synchronized via a durable total order; whereas acceptors are the machines storing the total order itself. In other words, each learner keeps a first-class copy of the database; whereas acceptors store the commands mutating that database.

In MultiPaxos, the acceptors are hidden underneath the WOR API. In Shared Log systems, they are hidden underneath the Shared Log API. Logically separating learners from acceptors via an abstraction boundary has multiple benefits (see the Delos paper for more details):

  • The database tier (learners) can be collocated with the consensus tier (acceptors) or disaggregated.
  • We can run the database on any implementation of consensus (the same way that a filesystem can run on any block device).
  • We can scale the database and consensus tiers independently.
  • The consensus tier can be scaled by RAID-ing it (e.g., see Corfu in NSDI 2012; or Scalog in NSDI 2020).
  • We can deploy the DB tier separately (to roll out new DB features) without impacting the consensus tier.
  • We can deploy fewer DB copies, since learners have lower fault-tolerance requirements (only one needs to be alive) than acceptors.

Once you design an SMR-based system with the ability to decouple and disaggregate acceptors from learners, some connections become obvious. For instance: a replicated database like ZooKeeper or etcd and a streaming system like Kafka can both be expressed as simple, consensus-free state machines above a shared log. But more on that in a future post!

Citations:

  • [0] The process group approach to reliable distributed computing. Ken Birman. Communications of the ACM 1993.
  • [1] Viewstamped Replication Revisited. Barbara Liskov and James Cowling. MIT Tech Report 2012.
  • [2] Paxos made Moderately Complex. Robert van Renesse and Deniz Altinbuken. ACM Computing Surveys 2015.
  • [3] In Search of an Understandable Consensus Algorithm. Diego Ongaro and John Ousterhout. Usenix ATC 2014.
  • [4] Virtual Consensus in Delos. Mahesh Balakrishnan, Jason Flinn, Chen Shen, et al. Usenix OSDI 2020.