The Leadership Myth in Replicated Databases

I’ve heard multiple times that a strong notion of leadership somehow simplifies replication. I don’t think this is true. I explain why in this post.

In a prior post, we described the State Machine Replication (or SMR) abstraction. The SMR API allows you to propose commands and play them back in a total order. In another post, we described how SMR can be implemented either directly using RPC (e.g., as in Raft or Viewstamped Replication) or via intermediate abstractions like a shared log or a process group.

To start with, consider a system that’s implementing the SMR API. In such a system, there is typically a layer of stateful servers (i.e., either application or database servers) storing replicated state. When a server wants to update the replicated state, it proposes a new command to the SMR layer. The SMR layer continuously applies new commands to each server.

For now, let’s ignore how the SMR layer is implemented (i.e., via some custom consensus protocol or as a layer above some underlying abstraction like a shared log). Without loss of generality, I’ll use the terms ‘log’ or ‘shared log’ to mean ‘SMR total order’, regardless of whether the SMR layer constructs this total order via an actual shared log or some protocol like Raft or MultiPaxos.

In such a system, there are two key questions:

What is proposed to the shared log?
Who is allowed to propose to the shared log?

In classical SMR, the answer to what is simple: the proposal is an unexecuted piece of code (or a lambda, if you prefer). The proposing server does not first execute the code (i.e., apply the update to its local copy of state); instead, it first proposes the code as a command to the SMR layer. For example, a command might be “x++”; or “if(x<5) y++;” Each server is provided the same total order of commands by the SMR layer and executes the new commands on its local state as it receives them via the apply upcall.

This simple protocol results in a strong invariant: the local state at any given server always corresponds to some prefix of the SMR layer’s total order; if the server reboots, the SMR layer can examine its local copy of state and determine the position in the total order at which it should resume applying new commands. Ideally, local state is stored in some persistent store that has failure atomicity (e.g., RocksDB), so that the state can never reflect a half-applied command.

In classical SMR, the answer to who is also simple: any server can propose a new command. This results in a surprising and powerful property: there is no primary / master / leader at the database layer. You get active-active or multi-master replication for free. Clients can contact any database server and get a linearizable / strictly serializable response. The replicated database is trivially serializable since its state corresponds to the serial execution of the commands in the SMR total order.

There is no notion of leadership is classical SMR: systems are multi-primary by default.

Implementing an active-active / multi-master / multi-primary replicated database is simple: you can literally follow the most basic definition of SMR and obtain such a property. If active-active systems are this easy to build, why do systems even bother with electing leaders? The reasons lie on either side of the SMR API; let’s start first with life above the API.

Leadership above the Log: One reason for introducing leadership above the SMR API is safety: if the total order contains inputs proposed by any server, then each server is executing a lambda independently. As a result, if there is any non-determinism in the database server’s code (e.g., if a command is “if(random()>0.5) x++”), replica state will diverge. Non-determinism can arise due to any use of real time or randomness in the apply logic; or non-deterministic errors thrown during execution (e.g., a disk out-of-space error). In a sense, the complexity in the system moves away from reasoning about leadership changes (since there is no leader) and towards ensuring determinism.

In contrast, if the total order contains the output produced by an executed command (e.g., a write-set of keys updated by the command), we are no longer executing arbitrary code on multiple machines, so the burden of determinism is lower. Instead of “x++”, the command would simply say “x=1”.

If the log contains outputs, then we can no longer blindly apply commands proposed by different servers. Each database server will still converge to the same state, but updates are no longer linearizable. For example if two servers simultaneously receive an “x++” command from clients; each would read its current value of x and produce a new value (e.g., “x=1”) to propose. In effect, the replicated database no longer behaves as if it’s executing commands in a total order; as a result, it is no longer strictly serializable.

To re-introduce serializability, we could somehow ensure that there is only one server proposing to the log at any given time, by baking mechanisms such as leases and fencing into the log itself; or equivalently, stitching together a multi-proposer shared log from a sequence of single-proposer shared logs. This is an easier option if the log itself has a leader under the SMR API, as we discuss shortly; but exploiting that property can constrain the database to running over a specific log implementation.

However, there’s an easier option that’s agnostic to the log implementation: we can store speculative outputs in the log. In this case, each proposing server would first execute the update (“x++”) on its own local state; and then add a command that with the output and a read-set: “if(x==0) x=1;”. The speculation could be predicated on either the value of the state read by the command (i.e., “x==0”) or the version of that state. Essentially each database server applying that command would then ask the question: “would I get the same output as the proposer if I had executed the original command now?” or equivalently “has the state seen by the command changed since the command was executed at the proposer?”.

Note that the speculation could also be on the identity of the proposer: “if(proposer==primary) x=1;”. This brings us to the second way to re-introduce serializability: we can simply elect a designated proposer. Crucially, this election can happen above the SMR API. All we have to do is propose a ‘takeover’ command to the log saying “I am the designated proposer now; from this point, ignore commands (except takeovers) in the log from any other proposer”.

To switch between multi-primary and single-primary, we simply decide whether we want to store inputs or outputs in the log; and elect a designated proposer via the log.

A system that elects a primary in this manner obtains the performance benefits typically associated with primary-based systems. For example, we can do strongly consistent reads at the primary without catching up with the SMR total order (or equivalently, without contacting any other machine over a network): since the primary is the only proposer to the total order, it knows the sequence of commands in that order. Note that this kind of optimization in our SMR-driven election requires a real-time lease (new primaries have to wait for some time period after proposing the ‘takeover’ command), but such a real-time lease is fundamentally required for this optimization in any consensus-based system.

Leadership below the Log: Note that so far we said absolutely nothing about the implementation of the SMR layer. In itself, this might be a surprising observation: most of the properties associated with leadership (e.g., strongly consistent reads at a leader) have to do with the design of the database state machine above the SMR API; and absolutely nothing to do with the consensus protocol used to implement the SMR API.

Internally, the SMR layer could be implemented in any of the five ways we discussed in a previous post. Some implementations have a notion of a leader and others don’t. The Paxos protocol that implements a single consensus slot is explicitly designed to not require a leader; MultiPaxos reintroduces leadership as a liveness and latency optimization. Raft has a strong notion of a leader (e.g., the leader sees all I/O); Corfu has a weak notion of a leader (i.e., a sequencer that hands out timestamps but does not see all I/O).

The leader above the log – i.e., the database primary; the designated proposer – does not have to be the leader below the log.

In the 90s and 00s, the most common deployment model for replicated systems was to have a single replicated shard that collocated the database layer with the shared log; containing a strong primary above the log as well as a leader-based SMR implementation. In this deployment mode, it makes perfect sense to have a single machine play the role of the leader above and below the SMR API. Over time, implementations began to fuse these roles, blurring the distinction between leadership above and below the log.

However, understanding the difference between these two different leadership roles is very useful in a cloud setting. With the right layering, you can disaggregate your log layer from your database and scale it independently; switch your database from single-primary to multi-primary without changing the consensus protocol; change your consensus protocol to be leaderless without disturbing your database layer, and so on. For a more technical description of these ideas, see the Delos papers from Meta.

(This post is informed by several discussions over the years with various collaborators at Confluent and Meta; as well as Ben Reed and Allen Clement)