The Week-Long Transaction

Four weeks ago, Alice forked the production database to investigate something. Over those weeks she ran hundreds of queries — who has not ordered in 90 days, what was last quarter’s revenue, how many orders are over a hundred dollars — and made decisions on each answer. Production has been moving the whole time. Today she wants to fold her work back in. The question that has to be answered before any byte moves: which of Alice’s answers are still true? She cannot re-run all hundred. The system has to know which ones matter, and check those. I do not yet know how to build that system. This post is what I have so far.

In January 2011, at the CIDR conference at Asilomar, Phil Bernstein and Colin Reid of Microsoft Research, together with Sudipto Das, presented a system called Hyder. It won best paper that year. The pitch was contrarian: build a transactional database without sharding, by sharing one append-only log across all servers and letting each server roll the log forward against its own copy of the data. The architectural piece that made the design work was an algorithm Bernstein called meld. Given two transactions that produced changes against the same starting state, meld decided whether they could both be applied, and if so how, or whether one of them had to lose. Meld processed those intentions at microsecond scale; Alice’s branch is the same kind of decision over weeks. I keep coming back to meld even though I am not sure the implementation transfers — the concept does, the duration doesn’t, and the gap is where the engineering lives. Branching a database used to be a project and now mostly isn’t, and the operation that ends a branch is better called promote than merge. Whatever you call it, the mechanism has to sort the branch’s work into a few different shapes: representation changes it can translate, operations whose intent it can safely combine, read answers that may have gone stale, and procedural effects it probably should not replay at all.

The workload that makes any of this matter is AI agents using databases. A human might promote a careful branch occasionally; an agentic system invites branches as a working surface. Every analysis, migration dry run, cleanup, and experiment becomes a disposable fork with a proposed promotion at the end. A September 2025 paper out of Berkeley reports that on Neon’s serverless Postgres, agents create roughly twenty times more branches and perform fifty times more rollbacks than human developers. The paper does not measure branch lifetime — that part is my extrapolation — but when forks are this cheap to spawn, some sit for hours or days while production keeps moving, and the rare ones that sit for a week become the design pressure. The Berkeley authors argue that branching-consistency models from the weak-consistency era — Bayou, Dynamo, TARDiS — can offer inspiration but that agentic speculation goes further. The difference between a useful branching database and a useless one is whether a failed promote tells Alice start over or your branch read this predicate, main added these matching rows, and these three updates still commute. The first is what an agent gets today. The second is what I want to build.

What a long-lived branch breaks

The copying is now cheap; that’s the part of the problem the last two posts said was solved. The folding-back is where things go wrong, in four specific ways.

The first thing that goes wrong is that there is no production-grade way to answer Alice’s question — which of my answers are still true — at branch scale and over time. The closest existing family of techniques is serializable isolation: systems track enough of what a transaction read to know whether a concurrent write would have changed its answer. Postgres does this with serializable snapshot isolation; other databases use their own versions of range locks, gap locks, or snapshot checks. They were designed for active transactions, not branches that sleep for weeks. As the read footprint grows, the implementation has to trade precision for bounded memory: tuple locks become page or relation locks, exact ranges become coarse ranges, and the result is more false conflicts. At week scale, the bookkeeping becomes the product problem, and I do not yet know of anyone shipping the bookkeeping at that scale.

The second thing that goes wrong is concurrent cleanup. Databases get messy with use. Old versions of rows accumulate, deleted rows still take space, the physical layout fragments. So databases periodically do a compaction pass — they reorganize themselves into a tidier shape. It’s the way defragmenting a hard drive used to be a maintenance task, and the way garbage collection runs in the background in most modern languages. The user shouldn’t have to think about it.

The problem is that two branches can compact independently. Alice’s branch reorganizes a million rows from one set of files to another and updates the indexes to point at the new locations. Bob’s branch does the same against his copy, but picks a different layout. When the branches promote, you don’t only have to reconcile Alice’s data and Bob’s data — you have to reconcile Alice’s reorganization with Bob’s reorganization. The cleanup itself becomes a conflict, and if you handle it naively, you end up with both reorganizations and double the storage.

The kitchen analogy is exact. Imagine you and a roommate both independently decide to reorganize the kitchen while the other is out. You move the spices alphabetically and put the pots on a high shelf. Your roommate moves the spices by cuisine and puts the pots in a drawer. When you both come home, you don’t have one reorganized kitchen — you have two competing reorganizations of an overlapping set of things. Keep both to be safe and the kitchen doubles. Repeat that across many promotes, and the database is mostly bookkeeping about old reorganizations. Distributed stores call this the sibling explosion problem: the system cannot choose one canonical version, so it keeps accumulating alternatives. CRDTs exist partly to avoid that shape by making more updates converge by construction, but the engineering work is still the kind that looks tractable until you are three months in.

The third thing that goes wrong is that some of Alice’s queries weren’t about specific rows. They were about whatever currently matches a rule. Compare two questions: what is customer 12345’s email address, and who has not ordered in the last 90 days. The first has a clean answer that depends on one row; if 12345’s row doesn’t change, the answer doesn’t change. The second has an answer that depends on the whole table and on what today means. New customers can move into the category, existing ones can move out, and the question itself names no specific row.

The clean illustration is the order count from earlier. Alice’s branch ran SELECT count(*) FROM orders WHERE total > 100 and got forty-seven. She acted on that number, planned the campaign, wrote the budget. Meanwhile, production has accepted a new order with total = 150. Alice’s count is now forty-eight, or stale, depending on how you score it. The order that changes the answer is new; her branch never saw it, because it didn’t exist when she asked. Row-level conflict detection misses this one, because the conflict isn’t about a row Alice read. Serializable systems know this as the phantom problem and handle it with predicate or range tracking; the week-scale version is harder because the thing you have to preserve is not a row identity but a rule.

The fourth thing that goes wrong is the rules behind the rules. Most database changes are simple: set this value to that one. But databases also let you set up automatic behavior — when a new order comes in, add a row to the loyalty-points table; when a customer’s status becomes VIP, send the welcome email. These are triggers — automatic actions defined on tables that fire a trigger function when the table event occurs — and they’re code, not data. Their definitions live in catalogs like Postgres’s pg_trigger, so the rule itself isn’t hidden. What the transaction log doesn’t preserve is the causal attribution: a row got inserted into the loyalty table, but the log records the write, not this version of this trigger fired because of this input.

It’s the same problem as a smart-home log that says porch light turned on at 11:23 PM. The log records the event but not the rule that fired it. Was it the door sensor, because someone came home? Was it manual? Was it the rule firing because the door sensor’s battery died and the system fell back to a timer? At promote time, you have to reason about whether the rule that produced a triggered change on Alice’s branch would still fire under main’s current data, and main’s current version of the rule, which might be different from the version that fired on Alice’s branch. That gets philosophically tangled fast, and most systems duck it.

What we know how to handle

The four failure modes are where long-lived branches hurt. They are not the whole promote workload. Some branch work already has a clean shape: a row address moved, a bounded resource was reserved, a collection update commutes. Those parts map onto ideas that exist, work, and have credible engineering paths. They are not equally mature. Escrow and CRDTs have long production histories; compaction maps are newer as an explicit conflict-resolution artifact, but they express an old storage-engine habit: preserve enough context to translate a change after the representation moves. Each one preserves intent or context rather than only the final answer, at a different layer of the system.

For the cleanup problem, there is something called a compaction map. When a database physically rewrites a file, the map records a translation table: the row you used to know as (file_X, position_17) now lives at (file_Y, position_42). It is the post office’s forwarding-address system, applied to rows. The same shape generalizes to schema renames (the column customer_email becomes email), internal surrogate-key rewrites where identity is otherwise preserved, and partition reorganizations. In each case, the claim is narrow: the same logical thing has moved into a different representation, and the map translates between them. Chris Douglas’s prototype for Apache Iceberg, which I walked through last week, is the clean example. A delete prepared against a pre-compaction layout can be translated through the map and retried against the post-compaction layout, with no human deciding whether the row moved or the meaning changed.

For shared counters and budgets, there is something called escrow. Picture three branches all wanting to subtract from a shared inventory of a thousand units. The naive approach makes them coordinate every transaction: can I take ten? Did anyone just take a hundred? That serializes everything and kills the value of branching. Escrow flips the model. Before any branch starts, you pre-allocate budgets — Branch A gets to subtract up to three hundred, Branch B gets three hundred, Branch C gets three hundred. Each branch operates independently inside its budget. At promote, you sum the actual subtractions. As long as nobody exceeded their budget, no coordination was needed and no conflict has to be resolved.

The everyday version is three siblings sharing a weekly grocery budget. Instead of texting each other before every trip to the store (“did you already buy stuff this week?”), each sibling gets a hundred dollars at the start of the week. They shop independently. At the end of the week, the actual spending reconciles against the three-hundred-dollar total. No coordination during the week, no overspend at the end. That is escrow.

Escrow descends from IBM’s IMS Fast Path field calls in the late 1970s and was formalized by Patrick O’Neil in 1986. Account balances, inventory, vote counts, rate limits, capacity reservations — anything numeric with a known bound — fits this shape.

For state that combines naturally, there are conflict-free replicated data types, CRDTs for short. The state-based version designs the merge rule so that order and duplication do not matter; the operation-based version designs concurrent operations to commute and relies on the delivery layer, or deduplication, to avoid replaying the same operation twice. A counter that only goes up: Alice’s branch adds five; Bob’s branch adds three; merging gives eight regardless of which one applies first. A set of tags: Alice adds urgent; Bob adds draft; the merged tag set is {urgent, draft} by union. Marc Shapiro and his collaborators formalized the family in 2011, but versions of the ideas had been showing up in distributed systems and collaborative-editing tools for years before that.

The three primitives map onto three layers of the merge problem. The compaction map handles the vocabulary problem: the same thing has a different name on the branch and on main, and the map translates between names. Escrow handles the budget problem: many parties draw from one resource, and budgets prevent overdrafts without coordination. CRDTs handle the combination problem: many parties add to a shared collection, and the combine rule always agrees with itself. Peter Bailis’s I-confluence paper from 2014 is the theoretical lens that ties them together: a set of operations is I-confluent with respect to an invariant if any two states reachable from a common ancestor can be merged without violating the invariant. Bailis and his coauthors ran the math against TPC-C and found that ten of twelve invariants in that workload were I-confluent; a coordination-free execution strategy delivered a 25-fold throughput improvement against serializable execution on a 200-server cluster. The qualification matters: TPC-C is a specific benchmark, not the universe of application invariants. The result is encouraging for the common OLTP shapes; it is not a universal guarantee about arbitrary application logic.

Better still, these three primitives compose into a product shape. A promote walks them cheapest first: translate through the compaction map; apply escrow and CRDT operations directly; use read-set checks for what remains; escalate the residue. Each layer’s false positives become the next layer’s input, never a corruption. The user experience is not merge failed, retry the whole job. It is we applied the safe parts, proved these parts irrelevant, and need judgment on these three facts. That is the database version of the cost of doubt: spend human attention only where the system cannot honestly decide.

My current read of the state of the art, mapped against the four failure modes, with the disclaimer that the labels are mine and I would not bet my career on any one of them. Failure mode #2 (compaction conflicts) is the only one I would call shipping, on the strength of the Iceberg prototype walked through last week. Failure mode #4 (hidden trigger rules) has a pragmatic workaround — treat the recorded writes as the intent — and the semantics question of what that means for triggers with external side effects is open enough to get its own section below. Failure modes #1 (row-level read-set staleness at week scale) and #3 (predicate and aggregate results going stale) are Open in the most honest sense, where Open is partly I have not yet found the paper that does it. One purpose of this post is to find out whether that paper exists, and the second-most-honest reading of #1 and #3 is that they might not be two problems at all — more on that below.

Escrow and CRDTs are not on this list because they are not section-1 failure modes. They handle adjacent workload shapes (bounded numeric resources, mergeable collections) that any promote engine has to support but that do not break on their own. Of the four shapes that do break, exactly one ships today.

What we don’t know how to handle yet

The third problem — some questions don’t point at a specific thing — doesn’t fit any of the three primitives above. The compaction map knows where rows went, not what a query means; an inserted row matching a branch’s predicate isn’t a representational change to a row the branch read, so the map has no entry for it. Escrow needs a known budget, and how many orders are over a hundred dollars isn’t a budget. CRDTs need an operation whose result doesn’t depend on what else is in the collection, and count over a condition is exactly the kind of operation whose result depends on what else is in the collection. There is a primitive for predicate-result maintenance — the next paragraph gets to it — but no version of it has shipped as a packaged branch-promote feature. That is the open product problem, and it is the part of the design that everything else hinges on.

The most promising lead is an idea called incremental view maintenance, and especially Frank McSherry’s differential dataflow line of work from the early 2010s. The mental model is a query that keeps running. When the underlying data changes, the answer updates incrementally, the way a spreadsheet recalculates cells when an input changes. When something is inserted on main that matches a branch’s question, the dataflow tells the branch the answer has changed. That delta is the conflict signal. You don’t have to re-ask the question; the answer already arrived. Aggregates work the same way: count and sum are operators in the dataflow graph, and they emit deltas when their inputs change. The predicate-count case falls out of this machinery. The hidden-rules case does not. Trigger code with external side effects is still code with external side effects.

The catch is that this is not free. Materialize, the company built around differential dataflow, spent the better part of a decade making it production-grade: partial-order timestamps that handle out-of-order updates, shared arrangements that let many queries reuse the same indexed state, a compiler from SQL down to dataflow graphs. DBSP is a newer formalization in the same direction, with a cleaner incremental-view-maintenance model. There is a fit problem to acknowledge first. Differential dataflow and DBSP maintain answers for queries that are registered as continuous; a branch’s queries are one-shot — Alice runs a count, gets forty-seven, and walks away. Ad-hoc queries can run through an ephemeral dataflow, but they do not become maintained conflict detectors unless the system promotes them into indexed or materialized views. That gap is the engineering work, not the dataflow engine itself. A branching database that wants this capability has two choices: build on an existing dataflow or DBSP implementation and pay the registration cost honestly, or scope predicate-result conflicts out of the first version and escalate them to a human with the right context. The second is shippable. The first is the next major version.

This is the fork in the road I cannot see past. If you have shipped incremental view maintenance at production scale and have thoughts on the registration cost when most queries fire once — whether it amortizes, whether DBSP’s flat-map formulation changes the calculus, whether there is a way to get the conflict-detection benefit without paying full materialization — I want to argue about it.

Step back to meld for a moment, because the IVM problem is also a meld problem in disguise. Meld resolves the merge problem at transaction granularity: two transactions produce intentions (deltas against a shared multiversion search tree), and meld decides whether both can be applied, or whether their preconditions conflict. What the algorithm gives you, when you squint, is a precondition language for did this change still make sense against the current state. The translation layer in particular has to handle schema changes that happened on main while the branch was running, which is what the compaction map is for. What I have not yet seen worked out anywhere is what meld’s precondition language looks like when the precondition is a query result over arbitrary predicates — which is the version Alice needs. The honest disclaimer is that Hyder never shipped, and neither did Microsoft’s follow-ups in that lineage; meld is load-bearing prior art for the concept, not evidence the architecture is viable as a product.

The closest shipping system to a SQL database with branch merge is Dolt, which in July 2025 added a tree-aware merge path on top of its prolly-tree storage. The trick is structural: walk both branches against their common ancestor, generate patches at the tree-node level, and apply unchanged nodes by reference when key ranges don’t overlap — reducing work from proportional-to-changed-rows to proportional-to-affected-nodes, with reported speedups around 1000× on a five-million-row workload. What Dolt shipped is a throughput primitive for merging long-lived branches with many disjoint changes; it falls back to row-by-row when tables have secondary indexes, check constraints, or schema changes on either branch, and the announcement does not address any of the four semantic failure modes above — read-set staleness, compaction conflicts, predicate results, or trigger semantics. That is not a criticism; it is the right separation. Dolt solved the question can we afford to merge a branch this big, which the four-failure-mode taxonomy quietly assumed away. Whether prolly-tree merge and meld converge on the same primitive at scale, or are distinct shapes for the same problem, is a question this post does not settle.

What I am still figuring out

The duration problem. The shape that looks most promising for week-scale read-set tracking is bounded-memory probabilistic summaries — small, fixed-size structures that answer did this row come up in anything we did? with a configurable false-positive rate and zero false negatives. A branch that read ten million rows at a 0.1% false-positive rate fits in roughly eighteen megabytes of bloom filter: small enough to keep in memory for a month, with a tolerable rate of “let’s double-check this one.” That handles only the easy version of the question — row-id read-set tracking. Phantoms, predicates, and aggregates still need predicate logging or incremental view maintenance, and the duration angle for those is the part of the design that, if it cracks, differentiates a branching database from a faster way to clone production.

The cross-mode question. Failure mode #3 (predicate results going stale) is arguably failure mode #1 (row-level reads going stale) extended from tuples to rules. The boundary between them is whether the read identifies a specific row or matches a condition, and the closer I look the more arbitrary that boundary feels. If they are really the same problem, the primitive that solves one solves both, and the two Open shapes in the landscape above collapse into one. If they are genuinely different, the primitive that solves #1 will not generalize, and I will have spent a year on the wrong abstraction. I am not yet sure which.

Compaction-map composition. Even at depth one, the combine rule for two compaction maps has to be associative and commutative or order matters and the property that made the primitive useful is gone. Collaborative data systems have spent years turning this class of problem into data structures that converge by construction. A promote engine does not get that for free; the practical answer in the meantime is to allow only one promote at a time, which kills the architectural elegance but ships. The harder version is recursive. When merged branches spawn sub-branches that also compact, you need to combine maps of maps. I suspect compaction maps do not actually compose under recursive promote — once the depth gets past two, the map-of-maps has properties I have not seen anyone write down. I might be wrong about this. If I am, and you can point me at the paper, I will buy you a drink.

The scope question. One answer to the hidden rules problem — triggers and the functions they invoke — is to say branches don’t support them; run them on main only. That trades testing parity for promote safety. The cleaner answer, and the one Ardent has been heading toward, is to enable triggers on the branch (because that is the user’s actual workspace) but treat the trigger’s recorded row writes as the intent at promote time, rather than re-running the trigger function against main. The WAL-as-intent assumption survives if you promote what the trigger did, not the trigger code itself.

The side-effect trigger. WAL-as-intent works for triggers whose effect is a write the WAL records. It does not work for triggers that send a Slack message, charge a card, or call an external API. The honest answer for the first version is scope external-effect triggers out of branches, which buys promote safety at the cost of testing parity. I do not yet know whether that trade is right, or whether there is a third option — replay the side effect from the branch with a sentinel that lets main ignore it on promote, say — that does not turn into a footgun.

If you have built or thought hard about any of these — week-scale read-set tracking, recursive compaction-map composition, trigger semantics that survive branching, the predicate-vs-row boundary, or the IVM registration cost when most queries fire once — I want to argue about it. The conversation I keep wanting to have is with someone who has held one of these problems in their hands and either solved it, decided it was unsolvable, or punted on it for reasons they can defend. I am building toward this at Ardent, and the kind of person I most want to hear from is the kind of person who finished this post irritated by something specific. If that’s you: cal.com/evanvolgas.

Thanks to Chris Douglas for pointing me toward the right historical sources and helping me understand the academic lineage behind this piece. Any mistakes are mine.

I am involved with Ardent, which is exploring this design space. The opinions here are my own.