Branching Came Late to Data

On April 3, 2005, Linus Torvalds began writing a new version-control system. The free-use license for BitKeeper, the proprietary tool the Linux kernel had used since 2002, was being withdrawn, and no replacement worked at kernel scale. Four days later, Torvalds had a first working version of Git, with its own source tree committed into itself. The design goal he set was that applying a patch should take no more than three seconds, fast enough to absorb the hundreds of operations a kernel maintainer needed each day. Branching, for the same reason, had to be free. A branch in git is a label on a commit. The commit is content-addressed. The tree exists once. Creating a branch costs the bytes of a forty-character hash.

Twenty-one years later, branching code is effectively free and branching data is mostly not. The storage primitives that close the gap arrived in two layers separated by two decades. The contract layer that decides whether a branch is useful is still being written, and most of the industry has not noticed it is a separate problem.

I argued earlier this month that branching is what gives AI agents a substrate to do real work on production data. This post is about the layer underneath that argument: what a branch contractually means, and why the contract is harder than the storage primitive that took thirty years to engineer.

The asymmetry

A working software engineer in 2026 branches dozens of times a day. Feature branches, hotfix branches, release branches, throwaway experiment branches. The unit cost is zero.

A working data engineer in 2026 still treats a clone of production as a project. Setting up a copy of the production database is paperwork: open a ticket, schedule a window, pick an off-hours snapshot, restore to a staging cluster, anonymize the PII, hope the schema is current, accept that the data is stale the moment it lands. Teams that need to verify a migration against production-shaped data often forgo the verification entirely because the cost of producing the environment is higher than the perceived risk of the migration. The result is a class of Friday-night incidents where the migration ran against prod because nobody could afford to run it anywhere else.

The asymmetry is not about willingness. Data engineers would happily branch their production databases hundreds of times a day if they could. The asymmetry is physical. Code is small, mostly text, mostly append-only, and addressable by hash. Data is large, mutable, transactionally consistent, and referentially entangled. A 200 GB database does not branch by labeling a commit. A naive copy is a dump and a restore. Two hundred gigabytes at a generous 200 MB/s is seventeen minutes, and the copy is stale before it finishes landing.

The storage primitives

The first primitive that mattered shipped before git did. In 1993, the Write Anywhere File Layout (WAFL) at NetApp made filesystem snapshots cheap by storing data in a tree of immutable blocks and changing only the root pointer on write. A snapshot was a saved copy of the root pointer. The blocks underneath were shared. Dave Hitz, James Lau, and Michael Malcolm described the system at USENIX Winter 1994. NetApp built an entire decade of revenue on selling appliances whose differentiator was that snapshots were free.

WAFL was proprietary and tied to NetApp hardware. The same architecture became broadly available with ZFS, released as part of OpenSolaris build 27 on November 16, 2005, designed by Jeff Bonwick, Bill Moore, and Matthew Ahrens. ZFS shipped snapshots and writable clones as first-class operations. The clones were copy-on-write at the block layer: the clone shared blocks with the parent until either side wrote, at which point only the changed blocks diverged. A multi-terabyte clone cost the time to update a few pointers.

For application databases, the copy-on-write primitive took another twelve years to arrive in a form that engineers could actually use without administering their own ZFS pool. Heroku Postgres shipped fork-and-follow in 2012, but fork was a snapshot-based copy: faster than a dump-and-restore, but still O(database size). Amazon Aurora database cloning, announced on June 12, 2017, was the first widely-used cloud database with a true page-level copy-on-write clone. AWS engineered the storage layer to share pages between source and clone until one side wrote. Clone time became independent of database size. A 10 TB Aurora cluster could be cloned in roughly the same time as a 10 GB cluster. The clone was writable, isolated, and disposable. The blue-green pattern Dan North and Jez Humble had named for application deployments around 2005, and would later codify in Continuous Delivery, now had a database-layer analogue.

Aurora’s clones were locked to Aurora. The next move was to build the same primitive in a form that lived outside any one cloud provider’s storage architecture. Neon launched publicly on June 15, 2022, with a separated storage-and-compute architecture in which the storage layer was a log-structured copy-on-write substrate. Branches were labels on the log. Semantically, the branch became what git had made it on April 7, 2005: a cheap pointer into shared underlying state.

The change primitive

A storage clone is a frozen moment. If the source database keeps writing, the clone drifts. For a clone to be useful for “what would happen if I ran this migration against current production,” you also need a way to apply the source’s ongoing changes to the clone, or to a parent the clone is taken from. That primitive is logical replication.

PostgreSQL’s logical decoding shipped in 9.4 on December 18, 2014. For the first time, the WAL could be streamed out of a Postgres instance in a customizable format, row by row, without the receiver having to be another Postgres of the same major version. Logical decoding was the substrate. Built-in publish/subscribe logical replication arrived in PostgreSQL 10 on October 5, 2017. Debezium, pgstream, and the change-data-capture ecosystem grew up around logical decoding to give the same capability to teams whose source databases were managed services that did not let them install replication subscriptions.

Combining the two layers is what produces a branch that behaves like a clone of production rather than a clone of last Tuesday. CDC streams source changes into a parent mirror; copy-on-write at the parent makes child branches free. The parent stays current; the child stays isolated. The clone is cheap, fresh, and disposable. The architecture is more than a decade younger than git and has roughly the same shape: an immutable substrate, cheap labels, and a way to converge or diverge.

The honesty cost

Cheap branches make the storage problem solvable. They do not make the contract problem solvable, and the contract is what determines whether a branch is useful or a footgun.

A database is not one thing. It is a heterogeneous collection of objects: tables, indexes, views, triggers, functions, sequences, extensions, row-level security policies, grants, materialized views, foreign tables. PostgreSQL replicates each category at a different fidelity, and the divergences are not minor.

Trigger definitions replicate, but the enabled state is dangerous. A trigger that calls a Resend webhook or a Supabase Edge Function on insert, enabled on a parent mirror while logical replication is catching up, fires the webhook again for every row replication replays. A migration that touches a million rows fires the webhook a million times from the mirror. The customer-facing system thinks a million users just signed up. Sequences replicate as definitions; the running last_value does not. The first insert against a fresh branch collides with an existing primary key unless the counter is advanced after the snapshot. pg_cron jobs replicate as rows in cron.job, and unless the target filters the cron.* schema out of the WAL, the same nightly billing run fires from production and from the branch within milliseconds of each other. Extensions depend on what the target supports: a source with pg_cron, pg_partman, or pgvector is not a branch on a target that lacks them, no matter how page-level the copy-on-write underneath is. Row-level security policies, GRANTs, materialized view contents, tables without a replica identity — each is a separate contract.

The honest engineering move is to write the divergences down per category and let users make informed decisions. The dishonest move is to call the result “a clone” and let the corner cases surface at 2 AM. The honest version looks like a parity contract: triggers default-disabled on the parent mirror and re-enabled on the child at the moment of branch cut; sequences advanced once post-snapshot; cron.* excluded from WAL replay by default, with an explicit opt-in symmetric strategy for customers who want it; extensions filtered through the target’s allowlist and unsupported ones either failing the branch loudly or dropped with explicit consent. Defaults are full-clone; divergences are flags. A branching product is not its storage architecture. The storage architecture is the substrate. The product is the contract: a per-category statement of what a branch faithfully mirrors of source, where it intentionally diverges, and why.

This is the work that is mostly still ahead, and the conversation in the industry has not caught up to it.

Most of the writing about data branching is about the storage primitive, which is the part that is now solved. Aurora’s clone is a property of Aurora. It does not extend to a database whose storage is not Aurora. Neon’s branching is a property of Neon. It does not extend to a database whose storage is not Neon. Branching across providers, where production sits on RDS or Cloud SQL or self-managed Postgres and the branch sits on a copy-on-write substrate somewhere else, requires a CDC pipeline whose parity is documented per object category, not assumed. The teams that are actually writing the per-category contracts down are the ones that will be remembered for this generation of data infrastructure. The teams that ship “a clone” and let customers discover the corner cases will not.

What this unlocks

When a branch is genuinely cheap and the contract is genuinely written down, several things change.

Pre-migration verification stops being a project. A migration that an engineer is nervous about gets run against a branch first. If it succeeds against production-shaped data, that is a better signal than success against an empty test fixture. If it fails, it fails against a copy that nobody depends on.

CI runs against production-shaped data. Most application test suites today run against an empty Postgres with a handful of seed rows. The bugs that production-shaped data triggers — query plans that depend on table statistics, locks that depend on concurrent writers, constraints that fire only on the data shape that exists in prod — go undiscovered until the change reaches the production cluster.

Coding agents stop being purely advisory. An agent that can run destructive queries against a branch can verify its own work. The blast radius is the branch. This is not a hypothetical. In the run-up to its acquisition by Databricks, Neon reported internal telemetry showing more than 80% of databases provisioned on the platform were created automatically by AI agents rather than by humans. The agents are choosing the substrate that gives them disposable environments, because an environment that costs nothing to spin up and nothing to throw away is the only environment in which an autonomous process can take an action that might be wrong.

The unifying frame is that reversibility is becoming infrastructure one layer at a time. The editor got it in the 1970s with undo. The database got it in the 1980s with transactions. Deployment got it in the 2000s with blue-green and feature flags, and by the late 2010s every major cloud platform shipped it as the default. The data layer is the one that has not commoditized yet. Cheap branching is what that commoditization looks like when it arrives.

What I am still figuring out

Default safety. A branch that is by default a full clone of production is the most useful version and the most dangerous version. The most-useful argument is that branches are for testing the change you would otherwise have made against prod, and a branch that strips the data shape of production is not testing the same thing. The most-dangerous argument is that production data inside a non-production environment is a perimeter problem regardless of intent. Anonymization-by-default has its own corner cases: timestamps that are themselves sensitive, foreign keys whose values reveal identity, free-text columns that contain PII without a schema declaring them PII. My answer is that the customer owns the data classification, not the branching layer. Defaults stay production-faithful; masking is a deliberate action the customer takes, not a hidden behavior the platform applies. The perimeter argument is upstream of that. It is not a question a branching primitive should be trying to answer.

Promote vs. merge. Git’s three-way merge works because most code conflicts are line-local and most lines do not depend on each other transactionally. Row data with foreign keys, triggers, constraints, and partial uniqueness does not. The closest existing analogue, conflict-free replicated data types, works for a narrow class of values and does not generalize to arbitrary schema. The instinct from a decade of git is that the merge primitive is what data branching is reaching toward. I think that instinct is wrong.

The better primitive is promote, in the sense Terraform uses apply. A branch is not a sibling of main; it is a plan against the current state of production. Conflicts surface at plan time, when they are reviewable, not at apply time, when they are irreversible. The plan is the operator’s eyeball encoded in the wire: what an attentive engineer used to do before clicking apply, the system now does as a first-class step. Schema migrations have always worked this way. You do not three-way-merge 001_add_users.sql and 002_add_orgs.sql; you apply N+1 to whatever prod is now. Blue-green deployments work this way too: you do not merge the green environment into blue, you swap the pointer. The asymmetry that git pretends does not exist between main and every other branch is the asymmetry that promote codifies. Prod is special. The branch is a proposed change to it. The operation is unidirectional.

Promote does not solve every workflow. Two branches that have independently diverged from prod still cannot be combined cleanly; you promote one, and then promote the other against the new prod. That is a real limitation. But it is a smaller limitation than the one you accept by trying to make three-way row-level merge work for arbitrary schema, which is that it does not work. The question is not “how do we build merge for data.” The question is “what is the right primitive for applying a tested branch back to production,” and the honest answer is the one that ships, not the one that copies git uncritically.

Whether the primitive belongs in the provider or in a layer. Aurora’s clone is a feature of Aurora. Neon’s branching is a feature of Neon. A CDC-based branching layer can run on top of any source database but pays a latency and parity tax that a native implementation does not. Both have working implementations and real customer bases. I do not have a strong view on which shape dominates at scale.

On April 3, 2005, Linus Torvalds began writing a new version-control system because BitKeeper’s free-use license was being withdrawn and there was nothing to replace it with. By April 7, he had a working tool, committed its source to itself, and made branching what kernel work required at that scale: a cheap pointer into shared, content-addressed history. The Linux kernel community switched. The rest of the software industry followed over the next five years. The substrate that made it possible — content-addressed storage, cheap pointers, an immutable history — was thirty years old by the time git put a usable interface on it.

The substrate for data branching is now roughly as old. WAFL was 1993. ZFS was 2005. Aurora’s clone was 2017. Logical decoding was 2014. What is missing is not the storage primitive. What is missing is the contract: a per-category statement of what a branch means, written down honestly enough that an engineer can read it before 2 AM and know what they can and cannot rely on. The next twenty years of data infrastructure are about that contract, not about whether copy-on-write exists.

The reason branching came late to data is that the physical asymmetry between code and data was real, and it took thirty years to engineer it away. The reason branching is still arriving is that the contract layer is harder than the storage layer. The storage layer is what got most of the attention.