The Escalation Interface

At 14:14 EDT on August 14, 2003, the alarm and event processor in FirstEnergy’s Akron control center stopped processing alarms. The XA21 Energy Management System, built by General Electric, had hit a race condition that caused the alarm subsystem to hang in a way that did not propagate to the operator console. (GE’s own post-incident code audit, not the Task Force report, identified the race condition.) No alarms were declared. The display was calm. The U.S.–Canada Power System Outage Task Force, reporting eight months later, would spend most of a chapter on what happened in the next two hours.

What happened was that the grid started to fail and the operators did not know.

At 15:05, the Harding–Chamberlin 345-kilovolt line sagged into a tree and tripped. At 15:32, Hanna–Juniper did the same. At 15:41, Star–South Canton. None of these losses produced an alarm on the FirstEnergy operators’ consoles, because the alarm processor was hung and the event logger had stopped writing. American Electric Power and the Midwest ISO called Akron to report line problems they were seeing on their side. The FirstEnergy operators looked at their consoles, saw nothing wrong, and said so. By 16:05, the Sammis–Star line tripped when its relay misread a heavily loaded, voltage-depressed line as a fault, and the cascade went regional. By 16:13, fifty million people across eight U.S. states and Ontario were without power. The Task Force cited loss estimates of four to ten billion dollars in the United States. Later epidemiological work attributed dozens of excess deaths to the outage; a 2012 study of New York City alone estimated about ninety, most of them disease-related rather than heat or trauma.

The escalation interface failed. That is the part I want to write about. Not the tree, not the line, not the cascade. The layer that was supposed to tell the operators the system you are operating is no longer the system you think you are operating had been silent for almost two hours, and the silence was indistinguishable from calm.

What an escalation interface is for

An escalation interface is the layer of a system that tells a competent receiver what the system could not honestly decide for itself. Every system that automates something has one, implicitly or explicitly. patch, in 1985, had one: a hunk it could not place was written to a .rej file in the working directory. Terraform has a plan step that emits a structured artifact for a human eyeball before any byte moves. Postgres, when it cannot serialize a transaction, raises a structured error naming the conflict. PagerDuty is one. Three-Mile Island’s control room was one. The XA21 alarm subsystem was one. Each is a contract about how a system tells the truth at the moment it has run out of the ability to do its own job.

Across the posts in this series, the recurring claim has been that each primitive that becomes infrastructure is something the operator used to do. Idempotency was a header that absorbed what the retry-attentive operator used to handle. The plan/apply split externalized the operator’s pre-commit eyeball. Structured errors externalized the operator’s parser. Each of those is, in its own way, an escalation interface: a structured signal from the system to the operator about a state the system cannot honestly resolve alone. The interface gets richer over decades as more of the operator’s judgment gets pulled into the wire.

The XA21 failure is what happens to a system whose escalation interface has, itself, become an implicit operator. Nobody was watching whether the alarm processor was alive. The alarm processor was supposed to be the thing that told you whether anything was wrong, and the question is the thing that tells me whether anything is wrong itself wrong was outside its scope.

The anatomy of an honest signal

Read closely, the artifacts in the list above share a shape. patch’s .rej file is the cleanest specimen, because Larry Wall designed the workflow around four properties that have to hold at once or the contract breaks.

The first is that the system applies what it can before it escalates. patch puts every hunk it can honestly place into the file. The receiver is not asked to redo the work the tool could do; the receiver is asked to decide the part the tool could not. The XA21 system did not do this. When the alarm processor hung, it did not partially deliver; it stopped entirely. The operators did not see here is what is fine and here is what I am not sure about. They saw everything is fine, which was a lie composed entirely of silences.

The second is that the residue is named precisely. A .rej file contains the exact hunk that could not be placed, with its original context. The receiver does not have to reconstruct what the tool was trying to do. Postgres’s serialization error names the conflicting transactions. Terraform’s plan names every resource it will create, modify, or destroy, with the existing and proposed values side by side. The XA21 failure names nothing: the operators received no signal at all, and the signals they did receive, phone calls from neighboring utilities, did not name what was wrong in terms their consoles could confirm.

The third is that the signal lands where context lives. A .rej file appears in the working directory, next to the source the receiver is already editing. A Postgres error returns on the connection that issued the conflicting statement. A page goes to the on-call engineer whose service is implicated. When XA21 was working, alarms landed on the operator console of the team responsible for the relevant region. When the processor hung, the meta-signal (the thing you rely on is not working) landed nowhere. IT staff at FirstEnergy eventually noticed the processor was down, but the chain that would have told the control-room operators in time did not exist.

The fourth is that the receiver’s resolution lands back as a normal write. When a human resolves a .rej file, they edit the source. The edit is indistinguishable from any other edit; it shows up in git diff and gets committed normally. The resolution is a fact in the audit trail, not a side-channel acknowledgment. The XA21 case never reached this property because it never reached the first three; the conversation between AEP and FirstEnergy was a phone call, and there is no record on the FirstEnergy control system of the moment the operators were told something was wrong and disbelieved it.

The four properties are not a checklist. They are the anatomy of an honest signal at the moment a system runs out of the ability to do its own job. A system that has them tells the truth. A system that has lost any of them tells a partial truth, which is usually indistinguishable from a lie.

The four properties assume a precondition they do not name: that the signal fires at all. XA21 failed there first, before any of the four could apply, which is why the blackout is the limiting case of an escalation interface and not a clean specimen of one. The four describe how an escalation tells the truth once it happens; XA21 is what happens when the layer that would tell the truth is the one thing nobody is watching.

What modern systems lose

The list of systems that have lost one or two of the four is long. Terraform’s plan has the first three and loses the fourth: the apply log records what changed, but not the human’s decision to approve in a structured form. CI/CD pipelines often lose the first by rolling the whole job back on a single-step failure; what could have been applied isn’t, and the receiver sees the job’s failure rather than the residue. Postgres’s serialization error has the second and third but lacks the first: the conflicting transaction was rolled back, not partially applied. Most webhook delivery systems lose the third: a 503 from an endpoint lands in a queue the integrator did not know they had to monitor.

Agent frameworks have, in the current shape of the work, lost all four. When an agent’s tool call fails (a 409, a 504, a malformed response), the agent typically writes a sentence into the chat: I encountered an error. Let me try a different approach. Nothing is applied that could be applied. The residue is not named. The signal lands in the chat, which is the operator’s eyeball but not their working surface. The resolution, whatever it is, lands as another agent action rather than as a structured fact about what the human decided. The interface is, by the criteria above, a sequence of four small lies, repeated.

The reason this matters is not that the failures are dramatic. They mostly aren’t. The reason it matters is the rate. The cost of doubt argued that verification is paid from a finite human attention budget. An escalation interface that costs more per receipt than the system saves by automating the rest is the same mechanism in a different dress. At agent throughput, the rate of escalations is the system’s throughput. A four-lies-per-failure interface bankrupts the receiver’s attention well before the system’s compute budget gets close to the question.

What the brief looks like, for Alice

The trilogy of database posts that preceded this one (branching, promote, the week-long transaction) ended with a specific failure mode no existing system handles. Alice forks production, runs a hundred queries over four weeks, makes decisions, comes back to promote. Among those decisions is a campaign budget set against SELECT count(*) FROM orders WHERE total > 100, which returned forty-seven on day three. Production has since accepted matching orders Alice never saw. The count is now stale by an amount the system can compute.

A promote that has lost the four properties tells Alice merge failed. A promote that holds them tells her: I applied the parts of your branch the compaction map could rebase. I committed the writes whose intents survive the new base. The one decision I cannot make is your day-three count: it returned forty-seven, main has since inserted four matching rows on days twelve, fourteen, nineteen, and twenty-one, and your campaign-budget value depends on the count. Here are the four rows, here is the budget, decide whether the budget changes. That is a brief, in the legal sense. It carries the auto-resolved work in a form the receiver does not have to redo, names what could not be decided in terms the receiver can decide, lands in the workspace the receiver was already in, and her decision becomes the next row in the audit trail.

The four properties are why that brief is shippable and merge failed is not. The brief is not better UX; it is the same primitive that turned .rej into a working contract in 1985, applied to a workload that did not exist then.

What I am still figuring out

Whether the four properties survive when the receiver is also an agent. Lands where context lives assumes a receiver with context to bring. If the receiver is a downstream agent, what is its context? A second agent will not have read the day-three campaign budget Alice wrote; it will have access to whatever the system thought to surface to it. The properties as I have them are the human-receiver properties. The agent-receiver version may need a fifth property, something like carries enough provenance for the receiver to acquire context cheaply, that does not appear at the human boundary because humans bring context with them. I am not yet sure whether the fifth is genuinely new or a consequence of named precisely taken seriously.

How singularity survives at volume. The XA21 failure was that nothing arrived; Three-Mile Island was the inverse: about a hundred alarms in the first minute, indistinguishable from each other. Agent throughput will produce TMI-shaped problems by default; the first property (apply what you can before escalating) is a partial answer because it shrinks the receiver’s queue, but it is not enough on its own. The receiver needs a way to know which of the escalations on the queue is the load-bearing one. I do not yet know what that primitive looks like. It is probably not ranking; it is probably structural, like priority queues for alarms have always been, but for briefs with semantically heterogeneous content.

The XA21 alarm processor came back online at FirstEnergy after the blackout had already happened. The post-mortem ran for months; GE patched the race condition; the operators were retrained. The grid continued to run on systems whose escalation interfaces could, in principle, still fail silently, and standards bodies wrote new requirements about how to know whether your monitoring system is itself monitored. The list of primitives the operator no longer has to carry, in this domain, got a little longer.

There is no version of this work where the list finishes. Each generation, a little more of what the receiver used to do gets pulled into the wire, and a little more of the receiver’s attention is freed for the next thing that cannot yet be decided automatically. Building an honest escalation interface is, in this sense, the same project as building an honest API contract: it is the work of writing down what the implicit receiver used to be expected to handle, one structured property at a time, each one paid for by an incident whose cost was high enough to justify it.