How I Learned to Stop Worrying and Write the Spec

I answered questions for an hour. Then I said “implement this” and walked away. I came back to a working library: Python SDK, TypeScript SDK, FastAPI collector, TimescaleDB migrations, 63 passing tests.

The spec was extracted, not authored. The architecture emerged from answers, not from upfront design.

The experiment

I wanted to build Hikari, an LLM pipeline cost tracking tool. Instead of writing a spec, I built a tool that would interrogate me until the spec was obvious.

Elenchus is an MCP server with one constraint: it will not generate a specification while unresolved contradictions exist. Claude performs the interrogation: asking questions, extracting premises from my answers, detecting contradictions. Elenchus stores that data and blocks progress until I resolve the conflicts. The intelligence is Claude’s. The constraint is Elenchus’s.

Eleven questions. Forty-nine premises extracted. Four potential contradictions flagged. One was real:

“Zero-config (one import) vs. needing user to provide pipeline_id and stage attributes.”

I had said I wanted zero-config instrumentation. I had also said I wanted users to set pipeline IDs. Both cannot be true by default. Resolution: pipeline_id defaults to trace_id, stage auto-derives from the span name. Zero-config works. Overrides are optional.

That design decision was hiding in my head. The interrogation surfaced it.

The other three flags (SDK buffer size vs collector buffer size, batch sizes that looked mismatched) turned out to be non-issues. But explaining why they were non-issues made the reasoning explicit in the spec. An implementer reading “SDK buffer: 10,000, Collector buffer: 50,000” might wonder if that is a mistake. Now the spec explains: the SDK buffers per-process; the collector receives from many instances.

Once contradictions were resolved, Claude generated a 1,061-line engineering specification. Not prose. Function signatures, DDL statements, API schemas, Pydantic models, test expectations. I said “implement this” and left.

The code matches the spec. The buffer sizes I specified (10,000 SDK, 50,000 collector), the retry logic (3 retries, exponential backoff), the SQL injection prevention pattern, the version checking on provider SDKs. These specific decisions from my answers appear in the implementation.

My contribution after the spec: nine lines changed. Two minor fixes that Claude’s self-review suggested.

Why it worked

Writing specs is hard because you must anticipate questions nobody has asked yet. You guess what an implementer will need to know. You guess wrong. The implementer fills gaps with assumptions. The assumptions are wrong.

Interrogation inverts this. You do not anticipate. You answer. The cognitive load shifts from “what might someone need to know” to “what is the answer to this specific question.”

But interrogation alone is not enough. Without a gate, you can answer questions inconsistently and never notice. The gate forces reckoning. You cannot proceed by ignoring the conflict. You must resolve it.

Contradictions surface during interrogation, when fixing them is cheap, not during implementation, when fixing them is expensive.

The caveats

The conditions were favorable. Greenfield project. Single author. Well-bounded scope: an observability SDK is a known pattern. No legacy code, no team conventions, no existing architecture. The hard parts of most software projects were absent.

And I am a domain expert. My answers were detailed because I have thought about this domain. When Claude asked about token extraction paths, I knew the answer. A non-expert giving vague answers would get a vague spec, and vague specs produce wrong code.

Elenchus catches contradictions between explicit statements. It cannot catch contradictions between stated requirements and unstated assumptions. If I never mentioned authentication, the spec would not include it. The tool surfaces conflicts in what you say. It cannot surface conflicts with what you forgot to say. This is a variant of the seeing problem: the gap between what was specified and what was assumed.

What I am still figuring out

Whether adversarial interrogation scales to teams. One person answering questions produces one coherent perspective. Five people answering questions produces five perspectives with different assumptions, terminology, and priorities. The contradiction detection would need to handle not just logical contradictions but semantic disagreements about what words mean. This is a harder problem than catching “zero-config vs. requires pipeline_id.”

This is one data point. I do not know if it generalizes.

I answered questions for an hour. I got a working library. The hard part was not the implementation. The hard part was the hour of reckoning with my own contradictions.

The code and extracted premises are public.