42 things I learned from building a production database

In 2017, I went to Facebook on a sabbatical from my faculty position at Yale. I created a team to build a storage system called Delos at the bottom of the Facebook stack (think of it as Facebook’s version of Chubby). We hit production with a 3-person team in less than a year; and subsequently scaled the team to 30+ engineers spanning multiple sub-teams. In the four years that I led the team (until Spring 2021), we did not experience a single severe outage (nothing higher than a SEV3). The Delos design is well-documented in two academic papers (in OSDI 2020 and SOSP 2021). Delos is currently replacing all uses of ZooKeeper at Facebook.

Here are some of the things I learned as the tech lead for Delos. My intent in publishing this is to help others in similar roles (leading teams that are building new infra at large companies); much of it may not generalize to different settings.

Note: an IC is an “Individual Contributor” in company parlance; i.e., someone who does not manage others. In this context, IC can be interpreted as engineer or developer.

Customers:

[1] Keep your customers happy; else the rest of this document doesn’t matter.

[2] Be careful to have the right number of customers (in the beginning, just one) and the right customers (whose requirements allow you to build out key technology); and grow that number carefully.

[3] Interface directly with customer ICs. A lot of intra-team conflict can be resolved by saying “I talked to the customer just now and they said…”. In infra we often don’t need to speculate about what customers want; we can just ask them.

[4] But realize that customers may not express what they really need; don’t take requirements at face-value, instead spend the time to understand their use case in detail. Read their code.

Project Management:

[5] Have a simple, crisp mission statement that expresses your raison d’etre. For Delos it was: we will be a reliable foundation for FB infra.

[6] Socialize estimates of task difficulty repeatedly; decision-makers may not have the time, inclination, context, or training to generate these estimates, and may get them wrong (literally) by orders of magnitude.

[7] Task allocation to ICs is critical; ask to be in the critical path of any decision, because you typically have a much better understanding of the problem, the codebase, and the IC’s strengths than the manager. Most managers are thrilled if you and the other IC figure out the task allocation on your own.

[8] A road-map is a means, not an end.

[9] If you get good and/or aligned managers, be as understanding, supportive, and accommodating as you can. If you don’t get such managers… well, I haven’t figured this one out, let me know if you do.

[10] Make your project robust to re-orgs. A company management hierarchy is inherently fragile (a tree is a 1-connected graph, after all); socialize the project continuously with managers who might take over in the future. Do whatever it takes to make sure that manager churn does not result in unfair career outcomes for ICs.

[11] Keep track of how long similar features took in other projects in your space and use this as evidence for task difficulty estimates (e.g., “feature X took three years in system Y; it’s not a one-half job for one IC.”).

Design:

[12] Be conservative on APIs and liberal with implementations.

[13] But insist on careful process around rolling out new implementations (shadowing, staged roll-out).

[14] When designing APIs, write code for one implementation; plan actively for the second implementation; and hope/pray that things will work for a third implementation.

[15] Design APIs with migration to new implementations as a first-class consideration; custom migrations are huge time-sinks and sources of unreliability. Every major API should have a single CLI-driven call for switching implementations.

[16] Design as a team; implement as individuals. This will make design the bottleneck, but it’s worth it: push back on impulses to parallelize design.

[17] For storage systems, bias heavily in the beginning towards consistency and durability rather than availability; these are harder to measure and harder to fix if broken. Because availability is easier to measure, there will be external pressure to prioritize it first; push back.

[18] Maintain multiple implementations in test for APIs; compare results between them. The cost is worth it (it will help with correctness, and also prevent leakage of implementation detail).

[19] Late-bind to designs: encourage the team to think about the entire design space without committing to a particular point solution. Running brainstorming meetings with a bunch of high-IQ, opinionated ICs is an art worth mastering. Encourage rough prototyping in the critical path of binding to a design.

[20] Late-bind to implementers: once design is done, any IC should be able to write the code.

[21] Have the right number of abstractions (this is hard). Too few and you end up with a messy monolith; too many and the team will be overwhelmed by the cognitive overhead of understanding each abstraction’s semantics.

[22] Avoid using real-time for correctness guarantees or comparing clocks across machines unless you have (and understand) error bounds on the clock.

[23] Have a single source of truth. Establish simple invariants between various types of state.

[24] Create a culture where ICs are constantly thinking about radically different designs; do not shut down conversations about hypothetical alternative designs. Encourage curiosity.

[25] Know your SKUs. Cloud infra makes it easy to ignore hardware; but an understanding of hardware (and hardware trends) is critical for design.

Code Review:

[26] In a transparent codebase with quick review cycles, APIs will leak implementation details unless you gate-keep.

[27] Encourage ICs to think critically about diffs and create an environment where people feel free to express concerns. Your response as a diff writer to someone pointing out a problem with a diff should be gratitude, not dismay.

[28] For critical components, consider informal rules such as requiring two accepts or even unanimous accept from some subset of ICs.

[29] For critical components, time to landing a diff is not a metric of importance: push back against impulses to measure this metric and optimize it. Create a culture where ICs are okay with diffs not landing quickly (creative endeavors – books, papers, etc. – typically involve long review cycles due to the cost of high-quality reviewing; why should code be different?).

[30] Sometimes you realize the right design for something only after an IC has written up a candidate design as a diff. Fight the impulse to say “oh well, let’s land it and then fix it later”; you are not helping either the IC or the project by doing this. Create a culture where ICs feel comfortable throwing away code if it’s not the right solution (lead by example).

Strategy:

[31] Ask yourself on some cadence: why does the team/project exist? If it didn’t exist, what would happen (which other team / system would fill the gap)? How is the team adding value to the company and how can it continue doing so in the future?

[32] Keep track of every other major project in your space within the company: you should be able to explain their technical design better than their own ICs. Grab any opportunities to debate scope with the leads of other similar projects: you should be able to articulate how your project fits into the larger ecosystem of options. Inter-team competition is healthy and necessary. Make friends with ICs in these projects: they understand your technical challenges better than anyone else in the company.

[33] Do not compete on raw performance or efficiency with other teams; this will escalate into an arms race where both teams waste time optimizing their systems for point workloads, generating apples-to-oranges comparisons, etc. Compete on fundamental design characteristics.

[34] If someone objectively has a better system for your use case and wants to take it on, go find something else to do.

Observability:

[35] Measurement is a means, not an end.

[36] You should be able to detect problems in your service before your customer does.

[37] As much as humanly possible, observability should be above APIs and external to implementations. This ensures that you can switch implementations and compare performance without introducing bugs in the measurement code. It also de-clutters implementations; and lowers the bar for new implementations.

[38] Anything that can’t be measured easily (e.g., consistency) is often forgotten; pay particular attention to attributes that are difficult to measure.

[39] Push critical checks (e.g. for consistency) into the deployment itself whenever possible; minimize reliance on external services for checks (else you now have two things to track instead of one).

Research:

[40] Keep track of research in your space. Soon you’ll have a shorthand with your ICs that enables super-fast communication: “what if we try that thing from projectX? And combine it with the technique in projectY?”.

[41] Try new things. Bias towards novelty within the space of feasible solutions. Fight the impulse to copy designs verbatim. Every major system was just a half-baked idea in someone’s head at some point.

[42] Write papers. Writing for an audience that has zero context on what you are doing will force you to examine and clarify your assumptions. Papers make it easier to hire good people and to on-board them. Grad students should be able to explain your design back to you (and find bugs!). Try to say yes when asked to give talks. They are fun, and you get to meet new people.

changelog:

11/24/2021: Added a definition for “IC” before the list after HackerNews feedback.