What I Learned Building Four Tools With AI Agents

Over the past year, I built four open-source tools almost entirely with AI coding agents: Engram (a memory system for AI applications that preserves ground truth and tracks confidence), Tessera (data contract coordination for warehouses), Conduit (ML-powered LLM routing using contextual bandits), and Arbiter (a provider-agnostic LLM evaluation framework). Combined: 1,016 commits, tens of thousands of lines of code.

This is what I learned about designing systems that agents can actually build.

The workflow

Every project started the same way.

Step 1: Write the CLAUDE.md file. Before any code, an agent constitution. What the project is. What the agent’s role is. What it should always do, what it should ask about first, and what it should never do. The skeleton stayed constant across projects; the guardrails got more specific each time. The Engram version is the compact baseline; the Conduit and Arbiter versions show how it grows under contact.

Step 2: Create GitHub Issues for coordination. Issues as durable state. Each issue had clear acceptance criteria with checkboxes. Labels routed work: focus:core-functionality, phase-1-foundation, priority:high. The agent could not close an issue until every checkbox was checked.

Step 3: Let the agent build. Point the agent at an issue, it works, opens a PR, I review. Repeat.

The workflow sounds simple. Getting it to work required learning some lessons the hard way.

The eight lessons

1. Design for deletion

Across the four projects, 20-30% of agent-written code was eventually removed. Not because it was wrong, but because it was unnecessary. The agent builds what you ask for, including scaffolding you did not need, abstractions that never got reused, and edge case handling for cases that never happened.

Implication: Small, deletable modules beat monoliths. When you delete agent code, you want to delete a file, not untangle a dependency graph.

2. Security cannot be retrofitted

In all four projects, security improvements came late. Path traversal protection, input validation, authentication hardening. These arrived in the final 20% of commits. The pattern was consistent: build functionality first, realize it is exposed, add protection.

This was my fault, not the agent’s. I did not specify security requirements upfront. The agent built exactly what I asked for.

Implication: Security constraints belong in the CLAUDE.md file and the first issues, not the last ones.

3. Documentation describes intent, not reality

Agent-written documentation is aspirational. It describes what the code should do, not what it actually does. I found README sections for features that were half-implemented, API docs for endpoints that had drifted, architecture diagrams that no longer matched the code.

Implication: Treat agent documentation as a first draft. Verify claims against the implementation before publishing.

4. Validation comes last (it should come first)

The same pattern in every repo: core logic first, then features, then validation and error handling at the end. This meant the happy path worked for weeks before edge cases got attention.

Implication: Write issues that require validation as part of acceptance criteria, not as follow-up work.

5. Modules do not emerge naturally

Agents do not refactor toward better abstractions on their own. They add code where you point them. If you do not explicitly create module boundaries, you get files that grow until they are unmaintainable. One Conduit file hit 800 lines before I noticed.

Implication: Create the module structure yourself. Tell the agent where new code belongs.

6. Decomposition is the specification

The quality of agent output is directly proportional to how well you decompose the work. Vague issues produce vague code. “Implement authentication” gets you something. “Implement JWT token generation with refresh tokens, 15-minute access token expiry, 7-day refresh token expiry, and token rotation on refresh” gets you what you want.

Implication: Spend more time writing issues than you think you need to. The issue is the spec. This is the design department problem at the individual practitioner level.

7. Labels route the agent

I started with three labels. By the end, I had a taxonomy: focus: for domain, phase- for project stage, priority: for urgency. The labels were not just for organization. They gave the agent context about what kind of work this was.

Implication: Design your label system deliberately. Labels are hints to the agent about how to approach the work.

8. State lives in GitHub, not agent memory

Agents forget. Context windows fill up. Sessions end. GitHub Issues persist. Every decision, every requirement, every acceptance criterion went into an issue. When I resumed work after a break, the agent could read the issues and pick up where we left off.

Implication: Write issues as if explaining to a new team member. Because the agent is always a new team member.

What I would do differently

Enforce PR size limits. Some PRs bundled multiple issues for convenience. The convenience made review harder. Smaller PRs are easier to understand, review, and revert.

Require security review earlier. Not as a final check, but as a gate on the first PR that touches user input, file systems, or network boundaries.

Add explicit validation checkboxes. Every issue should include “validates inputs” and “handles error cases” as acceptance criteria, not implied expectations.

Define module structure upfront. Before writing any code, sketch the folder structure and tell the agent where each type of code belongs.

What I am still figuring out

Whether this workflow scales beyond solo development. One person reviewing AI-generated PRs works because the reviewer holds the full mental model. Two people reviewing introduces coordination: who owns which modules, what conventions matter, how to resolve disagreements about the CLAUDE.md. The document becomes a shared governance artifact rather than a personal operating guide. I have not tested this with a team and do not know where it breaks.

Building with agents is not pair programming. It is more like managing a very fast, very literal contractor who has no institutional memory. The systems that work are the ones designed for that constraint: explicit specifications, durable state, clear boundaries, and the assumption that everything will need to be reviewed.

The agents did most of the typing. The architecture was still my job.

Thanks to the contributors who helped with these projects: SoulSniper-V2, ChanfeiLi, MukeshK17, JithinBathula, nicoleman0, Aditya-Singh-008, Nikhita-14, harshit-69, hhayan, DAShaikh10, MeghanBao, and jskrable.