The road to Paxos is a long one (as with other greek islands) and also somewhat elusive (it’s an island, after all). It took me longer than I’d like to admit to obtain a working understanding of the Paxos protocol. In my early attempts, I’d hit a brick wall of complexity: do I really need to know what this particular acceptor is going to do? What’s a learner anyway? What does it even mean to decide a value? Why do I need all these ballot numbers?

In systems, we deal with complexity via abstraction. For any system, there are three key questions:

  • What abstraction does it implement?
  • What is the design space for such an abstraction?
  • Why is this abstraction useful?

In this post, we answer the first two questions for Paxos.

This document is not meant to replace the Paxos papers (if you are relying on blog posts to implement Paxos… don’t). Rather, it supplies some of the informal systems intuition for why Paxos looks the way it does. Most of this will be obvious to experts.

Paxos implements a Write-Once Register:

Paxos is a protocol for implementing consensus. This means it implements a logical object called a write-once register (or WOR). A WOR has a simple API: you can write to it once; and you can read from it many times. Literally:

class WOR{
		//success means some write succeeded;
		//read after a write to see what was written.
		void write(std::string payload);
		//throw an exception if unwritten
		std::string read();

A WOR is safe for concurrent access. It is linearizable, or equivalently, strongly consistent. Informally, your implementation has to behave as if you had a big fat lock around it (regardless of how it’s actually implemented).

The WOR is an API for consensus. Consensus can be hidden behind many other APIs, but a WOR is the minimal API for it.

Understanding that consensus is just a type of object has some useful implications for system-builders.

  • You can immediately spot the existence or absence of consensus in other systems simply by examining their APIs. For instance, a key-value store with a simple put/get API does not necessarily implement consensus (you can’t implement a WOR on top); whereas a conditional put/get API does implement consensus (you can implement a WOR on top).
  • When building a real system that needs consensus, you can depend on some existing system as a makeshift implementation of the WOR (e.g., some existing key-value store with conditional puts); get the entire stack running; and later replace it cleanly with a consensus protocol.
  • You can cleanly isolate the complexity associated with consensus in a distributed system in the presence of failures (coming up next) within the WOR implementation, away from all the other complexity of your system.

Paxos implements a Write-Once Register using quorums for fault-tolerance:

The whole point of consensus is to do useful things in a distributed system in the presence of failures. What usually trips people up is the exact failure model.

Here is the model that Paxos uses. Machines can reboot arbitrarily; but they have persistent storage and will come back up. More worryingly, they can fail by exploding (losing all the bits they store).

We modeled consensus as a logical object: a WOR. Consider a system where a client is accessing a WOR, which in turn is stored on a cluster of storage servers.

A simple way to implement a write-once register (let’s call this WOR-server) is to have a single server store the value in RAM and expose the WOR API via RPC. But this is not durable; if the server reboots, the value is lost. A slightly better solution is to store the data on disk, so it’s not lost on a reboot. But if the server explodes, you lose data.

So WOR-server – a single-server design – is not durable if machines can explode. To get durability against exploding machines, we need to replicate data across servers. But clearly, if all the servers can explode, there is no way to get durability. So we need to make an assumption about the number of exploding servers.

Paxos assumes that only a minority of servers can explode; equivalently, to tolerate F exploding servers, Paxos requires 2F+1 servers. This leads directly to a quorum-based protocol. To write data durably, we simply have to store it on a majority quorum of servers. We assumed that only a minority can explode; so at least one unexploded server will contain the data.

So we arrive at WOR-quorum: a client writing to a WOR can simply write the data to a quorum of servers (each of which is running the single-server WOR-server design). Reading from the WOR requires the client to go to a quorum of servers. Such a solution is durable.

In Paxos, the servers are called acceptors; the client that is writing a value is called a proposer.

In a system where there’s only a single client that doesn’t fail, we are done. But in real systems we typically want multiple clients accessing the WOR concurrently; and each client can itself reboot or explode.

This brings us to the Paxos protocol.

Paxos implements a Write-Once Register using quorums for fault-tolerance and two-phase locking for concurrency control:

In WOR-quorum, when more than two clients write to a quorum at the same time, we can end up with different values on different minorities of acceptors. One way to prevent this is to first lock the acceptors, and then write to them (and unlock them).

Locking is tricky for two reasons. First, you can have deadlocks; acquiring locks in a strict order to prevent deadlocks adds latency. Second, locking in a distributed system has a new failure mode: a client can crash after acquiring locks.

Paxos provides a solution to both these issues via a form of lock stealing. Locks are versioned or numbered; higher-numbered locks can override the lower-numbered ones. Clients will pick a unique lock number; and then try to lock a quorum with it. The lock acquisition will succeed if the acceptor is unlocked, or locked with a lower number; and fail if the acceptor is locked with a higher number (in which case the locking client can retry with a higher lock number). Locks are not advisory; writes are predicated on lock numbers and will fail at the acceptor if the lock has been stolen.

Now for some subtleties.

Completing writes: Recall that we assume a minority of servers can explode. If a client locks a majority of servers; can’t access the remaining minority; and finds a value already written on a single acceptor in that majority, it has to assume that the value was also written on the inaccessible minority and acknowledged back to some older client. As a result, the only path forward is for the new client to adopt that value as its own and write it on the majority that is accessible to it. If more than one such value exists, the client has to pick the value with the highest associated lock number.

Livelock: Clearly the protocol above can livelock, if two clients keep stealing locks from each other. This problem turns out to be theoretically impossible to solve: the FLP result (which predates the Paxos protocol) states that you can’t have both liveness and safety for fault-tolerant consensus. Livelock in Paxos is a real-world example of the FLP result in action.

Different lock/write quorums: It turns out that you can lock some majority quorum and write to another majority quorum; it doesn’t have to be the same quorum in both phases. (But a write to an unlocked acceptor has to be interpreted as a lock-then-write, else you get this bug). Flexible Paxos takes this further by pointing out that the write quorum does not necessarily have to be a majority if lower durability is acceptable.

Pre-locking: Clients can pre-lock the quorum to avoid a round-trip when they write a value. To provide fine-grained control over locking, we can explicitly expose an extra API to the WOR called lock(). If a client is interacting with multiple WORs that happen to live on the same set of acceptors, we can pre-lock an entire batch of WORs. Pre-locking turns out to cover the key optimizations in MultiPaxos, which we will discuss later.

So here is the final WOR API:

class WOR{
		//lock a quorum
		int lock();
		//success means some write succeeded;
		//read after a write to see what was written.
		//throws an exception if you lost the lock.
		void write(std::string payload, int lockId);
		//throw an exception if unwritten
		std::string read();

At this point, we know the abstraction implemented by Paxos; and how it is implemented. In a future post, we will go into why this abstraction is useful.

Once again: this post is only an informal, intuitive description; go read the papers armed with this intuition before implementing the protocol!

A formal description of the WOR API is in this paper. David Geraghty proof-read an early version of this post. Dahlia Malkhi introduced me to idea of a WOR; Zhong Shao and Ji-Yong Shin worked with me to refine its API.