The Art of Managing Skunks

Since moving from academic research to industry in 2017, I’ve worked on two software projects. Each one started as a small, clean-slate¹ skunkworks effort involving 2-3 people and gradually expanded to a large, conventional software engineering effort with dozens of engineers. The first of these (from 2017 to 2021) was Delos at Meta, a Chubby/ZooKeeper/etcd-like control plane storage system. The second was a new Kafka engine (from 2022 to 2024) that can run on any disaggregated storage layer (and powers the Confluent Freight product, where S3 is used as that storage layer). Nearly every system at Meta depends in some way on Delos as of 2025 (e.g., this article describes an example dependency chain); Confluent Freight just became generally available and time will tell if it succeeds commercially, though early results are promising.

While these systems were technically difficult to build and operate (particularly given their critical roles in the stacks of the respective companies), I found that much of the challenge lay in the management of these projects. Even the most innovative companies on the planet have incentive structures (for line managers and engineers) that are incompatible with clean-slate skunkworks innovation. In my discussions with various managers over the last few years, I found myself converging on a set of key principles, which I outline in this article. I hope these rules are helpful for other engineers and managers looking to define a shared set of principles for their own skunkworks projects.

Some caveats: I am an engineer, not a manager, and have never managed anything in my life beyond a handful of interns and graduate students; this is just my wish-list as a technical project lead / architect for what I need from managers. These rules may be highly specific to building new storage services at large companies. At some point, the project has to exit skunkworks mode and these rules cease to apply. My sample size is N=2 and it’s difficult to establish that these rules are causally related (or even just correlated) to project success.

A. No non-coding architects: If you want to participate in designing the system, you have to write code. (Note that it’s okay to not code if you are bringing some particular expertise to the table: e.g., if you are a world-class erasure coding theorist. It might also be okay if you are a world-class ops specialist, though in my experience most such people are very comfortable with code). If there’s anyone on the team whose only job is to delegate work to other people, something has gone extremely wrong.

B. No individual “ownership”: Everyone is responsible for everything. If the boat sinks, everyone sinks with it; there’s no way to win or lose independently. We want people accelerating and enabling each other to expand the pie, not competing with each other on a fixed pie. We want zero tolerance for self-promoting activity: it is the job of the manager and the TLs to make sure that people are rewarded fairly based on what they actually did. The best managers of such teams tell them to run as fast as possible and get the job done; and take the burden of justifying ratings completely off the engineers. If the need to safeguard an engineer’s rating begins to shape the project’s priorities, something has gone extremely wrong.

C. Operate on strengths, not weaknesses: We want each person to focus on what they are truly good at; rather than what they are weakest at. The latter often happens in a big company setup when engineers want to get promoted; and managers tell them to focus on the areas that are holding them back. If engineers ask managers what will get them promoted, the answer has to be “run as fast as you can and make the team ship”. If that person cannot get objectively promoted under those circumstances, something has gone extremely wrong: the project is not impactful enough to warrant a skunkworks approach; or the person should not be on this project (e.g., maybe their skillset is not needed on the project, or they are not good enough in their assumed area of expertise, etc.).

D. Formal communication (exposed outside the team) has to be extremely precise, high-quality, and reviewed: To paraphrase Jeff Bezos, we want “crisp documents and messy meetings”. Different constituencies need different types of messaging: some may need to know about technically impressive details; others may want to know about business impact; some may need to know what’s happening next month and others may need to know what’s happening in 3 years. Writing a single document with that kind of versatility takes months even for experienced writers. Everything the team says publicly impacts its reputation and credibility; and constrains its actions in the future. Note that this is not at odds with transparency: informal communication should happen at all levels with the utmost transparency, since it’s typically easy to convey nuance and context when discussing things informally.

E. Avoid a first-doc-wins culture. Having a single person write a public-facing doc has a chilling effect on design engagement within the team; it “muddies the pool”. We want less experienced members of the team to experience the rush of discovering (or re-discovering) ideas; it’s a part of the training process. Public-facing docs should be written after a design process and authored collaboratively. We should not reward docs as deliverables. (None of this applies to internal communication within the team, which can happen in any form and quality level that the team likes).

F. Reward on impact: Everyone on the team has the same rules: they will be rewarded when the project ships in some form. No promotions until something ships (unless someone on the team is already long-due for a promotion). On the flip side, we guarantee reasonable baseline ratings even if there’s no shipped impact. Basically we take out both the upside and the downside for engineers until something ships.

G. Minimize dependencies: Dependencies take a lot of time (external teams often have multiple priorities). They can create uneven quality across the project. Air-gaps can show up in the design. One failure mode is that external teams often specialize in specific solutions rather than a problem domain; so asking an external team to “build a component to solve X” often translates into “modify our existing solution Y to solve X”, which can add a ton of accidental complexity. Note that “reward on impact” incentives force the team to be careful about dependencies: if they do their job perfectly but a dependency fails to show up, the team does not get rewarded. In practice, this ensures that the team only takes a dependency if it absolutely makes sense and they are comfortable with the risk profile.

H. Understand the hierarchy of needs for a new project: For some technical problems, the slope of progress is continuous: it’s easy to get an initial version that works somewhat well and then incrementally improve it, but quite hard to get to an ideal version (e.g., a multi-tenant load-balancer can require years of tinkering with policies). Other problems have a discrete progression: it’s difficult to get to a reasonable v0, but after that you can pretty much leave it alone (e.g., a consensus protocol that’s only used on reconfigurations). In a mature 1-to-10 system, managers and engineers will spend most of their time on the former class of problems; as a result, it may be tempting to prioritize the same problems in a 0-to-1 system. But in a brand new database (for example), it’s far more important to have a working consensus protocol than to have excellent load-balancing in your first release.

I. Hire Pigs, not Chicken: Pigs are full-time engineers committed to a project; whereas Chicken are part-time engineers involved in the project. We want to bias towards a small number of Pigs rather than a larger number of Chicken. Note that this is not a question of competence: even the best chicken can hurt velocity and undermine the sense of shared fate in a project.

J. Eliminate process ruthlessly: This one should be obvious: a small team does not need process. Do not impose any make-work activity on the team. Free them up to execute. A manager’s role in this setup is to provide inspirational leadership and motivate the troops, rather than manage / limit risk.

K. Progressively overload the team: Pick goals for the team that are ambitious and just a little bit impossible. This has two effects: one, it forces the team to prioritize ruthlessly, where you cut out anything inessential for success; and two, it pushes the team to somehow find leverage through system design, where you find new ways to deliver the same result without as much code / complexity because you literally don’t have cycles to write the code / manage the complexity.

L. Do not exit skunkworks mode prematurely: It makes sense to exit skunkworks mode once execution risk begins to dominate design risk. However, there’s a second consideration: ideally, the team stays in skunkworks mode until it achieves some kind of actual success, i.e., something ships. To understand why, consider that a key reason to create a skunkworks project is to incubate a new type of culture within an incumbent org. Over time, we can create more conventional-looking ancillary teams around the core project, creating a composite of the new culture and the incumbent one. But timing is critical; if we expand before anything ships, the incumbent culture will drown out the new one (which makes sense, since the new culture has no success to back it).

M. Fail-fast vs. Zombie mode: New, risky projects often have to operate in uncertain environments where our assumptions (about the market, hardware, customers) are shifting rapidly. It’s better to move quickly and try something rather than aim for perfect decision making; and to stop quickly rather than allow the project to meander, consume resources / attention, and incur opportunity cost. If we can fail fast and recover quickly, bad decisions don’t matter as much; and we get data for the next attempt. Failing fast on any endeavor requires us to establish concrete criteria for determining its success in some short time-frame (e.g., 3-6 months) before starting work. Fail-fast works for entire projects, but also for smaller decisions within the project (e.g., personnel assignments) or even as a philosophy for building the system (in effect, we can always convert throughput into goodput via rapid iteration on failures).

N. We want R&D, not !R&!D: R&D projects can often end up in no-man’s land, partly because it’s difficult to hold these projects accountable and measure their success. Ideally we want the project to be good “R” (publishable in top conferences) and good “D” (shipping to production). An anti-pattern is if external researchers think the project must be “D” (since it’s obviously not good “R”) and external developers think the project must be “R” (since it’s obviously not good “D”).

O. Synchronous, frequent, informal communication is critical: Prioritizing daily synchronous communication is critical. Note that this meeting is not for listing and managing work items (nobody wants a stressful daily stand-up in a skunkworks project); its goal is to build a shared understanding of the design space; and a shared set of values for assessing points in that space. We want to encourage free-wheeling debate on designs, long-term strategy, and short-term tactics; and train engineers to collaboratively think and talk about design. To manage meeting load, eliminate all other broadcast meetings. This principle is one of the reasons we prefer small teams.

P. People are not fungible. Team composition is critical. Our operating model is a sports team where we pick individuals for particular positions based on the needs of the team and their skill-sets. A second goalkeeper doesn’t help a soccer team much, even if they are absolutely stellar at what they do. Good managers will often do their best to make engineers fungible, in order to reduce personnel risk to the project; but in a true skunkworks team, nobody is fungible.

Q. Run towards risk. In skunkworks mode, the goal is to reduce technical risk as quickly as possible. Accordingly, the team has to surge on areas where risk is high. Fight the temptation to make steady progress on well-understood, low-risk parts of the system.

R. Keep the team small. This one seems obvious but is notoriously difficult to enforce in large companies, for a number of reasons. A well-meaning manager might add engineers to a project to 1) make it go faster; and 2) reward the engineer. But we’ve known for 50 years (!) that software projects actually do not go faster if you add people to them². And adding the wrong type of engineer can often hurt the project and the engineer’s career (see rule P about goalkeepers). Critically, keeping the team small ensures that it’s always resource-constrained (see rule K about ruthless prioritization); and also protects the project against cost-cutting initiatives (since the company doesn’t significantly reduce cost by shutting the project down).

I hope these rules help managers and engineers find common ground – good luck starting your own clean-slate skunkworks projects!

Footnotes:

this prior post might explain why I think clean-slate innovation is critical in systems. ↩
One VP argued – pedantically but accurately – that Fred Brooks only said this about projects that are running late; though in my experience, every software project is already late on day one. ↩