There are three questions to ask of any system:

  • What abstraction does it implement?
  • What is the design space for such an abstraction?
  • Why is this abstraction useful?

In a previous post, we examined the Paxos protocol and answered the first two questions. Paxos implements the abstraction of a Write-once Register (a WOR) using a combination of quorums and a two-phase locking protocol. As for the third question: Paxos is useful because it can be used to implement MultiPaxos (among other things). But that’s an unhelpful answer unless you know what MultiPaxos does.

In this post, we seek to understand MultiPaxos by asking the same three questions of it. To start with, what abstraction does MultiPaxos implement?

MultiPaxos implements State Machine Replication.

Consider a stateful service that exposes mutators and accessors: for example, a key-value store with a put and a get API. If we run this service on a single server, the system is neither available (e.g., the server can reboot) or durable (e.g., the server can explode). To make the service available and durable, we need to run it on multiple servers; however, if each server maintains its own local copy of the state machine and updates it independently, clients will see inconsistent state for the service (e.g., if a client does a put on one server and then a get on a different one, it will not see the value it just wrote). In other words, the service will not be linearizable.

How do we run multiple copies of a stateful service on different servers so that it appears to be executing on a single copy? State Machine Replication (SMR) provides one answer. In SMR, rather than immediately modify its local copy of state in response to a put, a server first proposes the command to an underlying layer implementing the SMR abstraction. This layer appends the command to a totally ordered sequence of commands (assume for now that this sequence is stored in some location that is magically durable and available). The sequence – which includes commands proposed by all the servers – is then applied to each server’s local copy in strict order. On a get, the server first calls sync on the SMR API, applying any commands from the sequence that it hasn’t already seen; and then executes the get on its local, updated copy of state.

This simple protocol – propose first on mutators, sync first on accessors – ensures linearizability for the replicated service.

Concretely, here is the SMR API:

class SMR<Command, Result>{
	public:
		//insert a new command into the durable total order;
		//play it back and apply it on the local state machine;
		//return the result
		Result propose(Command C);
		//return once all prior commands in the durable total order
		//have been applied to the local state machine.
		void sync();
		//register the local state machine to receive new commands
		void registerApply(Applicator A);
}

MultiPaxos implements State Machine Replication using an address space of WORs.

There are many, many ways to implement the SMR API (groups, shared logs, the Raft protocol… this is a topic for a future post). For now, we consider how MultiPaxos implements such a sequence: using an address space of WORs.

The idea (outlined by Lamport) is simple: to propose a new command to the sequence, you simply locate the first unwritten WOR; and write to it. To sync with the sequence, you play it until the first unwritten WOR, applying each command to the local state machine.

One interesting implication of this design is that the WOR can be implemented in many ways: as a key-value pair on some remote storage system; or using a consensus protocol like Paxos, with the acceptors either collocated with the set of servers running our replicated service, or running on some entirely different set of machines. Further, a single address space can combine WORs with different implementations: WORs 0-10 might be on some external storage service, while WORs 11-20 could be on a collocated set of Paxos instances. More commonly, WORs 0-10 and 11-20 might use the same Paxos-based implementation, but reside on different set of acceptors (e.g., if some acceptors fail in the earlier range).

How do we know exactly where each WOR in the array resides (and what implementation it uses)? This is the membership or reconfiguration problem for MultiPaxos. Two types of solutions exist. First, Lamport’s original proposal was to store membership in-band: each WOR maintains a pointer to the next WOR (i.e., a linked list of WORs). When we write the payload for WOR I, we also include the location of WOR I+1. To allow parallel writes, we can determine the location of the next segment of K slots rather than just the next one.

A second solution is to store membership out-of-band: we can store a map from the WOR address space to different implementations / locations in an external location. This location – the membership store – has to be some kind of versioned conditional register. Think of this as implementing a virtual address space of WORs using an indirection map; or equivalently, a view of the system. Critically, there has to be some mechanism of stopping activity by servers that happen to have a stale view before we can switch to a new one. In other words, we need to stop the world in view X before writing view X+1 to the membership store. Typically this requires some extra capability from the WORs; some kind of seal or fence operation that prevents servers with old views from accessing WORs and tells them to go check the latest view. Alternatives to sealing include checking the membership store in the critical path of each operation (which is simple but slow); or relying on real-time leases for each view (this gets complex).

MultiPaxos implements State Machine Replication using an address space of WORs and a designated writer.

The SMR-based implementation above has two performance issues when it runs over Paxos-based WORs. The first relates to livelock. Recall that the API for a WOR has a lock (which in a Paxos-based WOR executes the first phase of Paxos) and a write (the second phase of Paxos). If two servers try to propose a new command at the same time, they will both lock the last WOR in the array; and lock each other out continuously without writing successfully.

As a result, we need a single designated writer to each WOR for better performance. Most commonly, MultiPaxos simply stores the identity of the designated writer along with the membership (either in-band or in the out-of-band membership store). Note that this is simply a hint to avoid livelock; if the designated writer fails, we can fall back on dueling writers.

Second, each write on a Paxos-based WOR requires two RTTs to a quorum of acceptors: the first to lock them, and the second to do the actual write. However, WORs can be pre-locked by a designated writer, allowing it to skip the first RTT. For instance, each newly created WOR can be pre-locked with lock number 0; and the designated writer for the WOR can issue the write with lock number 0. We are guaranteed that each WOR has only a single designated writer via the membership mechanism.

A designated writer solves a third performance / liveness concern generic to any kind of WOR implementation. To propose or sync, we need to find the first unwritten WOR; this is not guaranteed to terminate, since a server could keep chasing the tail of an expanding sequence. With a single designated writer, we can efficiently track the first unwritten WOR in the failure-free case; if this writer fails, we fall back on reading forward on the array.

That’s it. MultiPaxos is simple if Paxos is abstract, because it can then ignore all the subtlety of the single-slot Paxos protocol and be expressed purely in terms of invocations on the WOR API. Practically, you can implement and test a full-fledged MultiPaxos implementation by using some existing key-value store to implement each WOR, without requiring a Paxos implementation.