agent architecture SPEC · V1

System Architecture

Agent-Based Product Development

A small set of role-specialized agents, run in sequence, each handing locked work to the next — with a human at the gates and no agent grading its own homework.

This document — the reasoning, in prose & diagrams The executable rules — in each agent’s definition
01

The one idea

What the whole system reduces to.

Development is not one agent doing everything. It is a small set of role-specialized agents, run sequentially, where each step hands its finished, locked output to the next. A human stays in control at evaluation gates between steps. The work that builds the product is judged by agents that did not build it.

Three principles generate everything else in this document:

  • Sequential over parallel (for the build path). Each step departs with more information than the last — a locked spec sharpens design, an approved design makes implementation unambiguous. Parallelism would force a step to start before the prior output exists.
  • Separation of authorship and verification. The agent that builds a thing is never the sole agent that judges it. An agent grading its own work shares its own blind spots.
  • The human is the gate, not the worker. Routine mechanical failures are resolved between agents. The human reviews locked artifacts and makes judgment calls.
02

Core concepts

Read this before the rules.

An agent is a stateless function

An agent does not “contain” knowledge that persists between runs. Every run is assembled fresh from four layers stacked into one context window:

LayerWhat it isPortable?
1 · Model weightsGeneral reasoning & coding abilityyes — shared
2 · Role definition“You are a security reviewer who…”yes — reused across projects
3 · Project contextThis codebase, its decisions, conventionsno — per project
4 · Live task + stateThe current prompt + files right nowno — this moment

The agent is Layer 2 — a reusable role template. The project is Layer 3, loaded in at runtime. A well-written role agent works on any project, because the project-specific part is injected at invocation, never baked in.

Definition vs. instance

A role definition is a template. Invoking it creates an instance: a running context bound to one project for its lifetime. The same definition can spawn many independent instances — research on Project A and Project B at once. They share no state and cannot interfere, because the project-binding happens at instance creation, not in the definition.

Research agent — definition the reusable template (layer 2) invoke + project A invoke + project B Instance 1 researching product A own context window Instance 2 researching product B own context window no shared state — neither instance knows the other exists safe to parallelize because research produces documents, not shared code
Fig 1. One definition, many instances. The project binds at instance creation — same agent, two projects, zero interference.

Why parallelism is dangerous on the build path

Isolating files (e.g. git worktrees) prevents file collisions. It does not prevent semantic staleness: an agent that branched before a core refactor landed will build against a reality that no longer exists, and merge cleanly but wrong — worse than a conflict, because it’s silent.

The rule

Parallelism is safe only where agents share no mutable state. Document work (research, specs) shares no state with code → safe to parallelize. Code-writing all mutates the same files → must be sequenced.

The gate

Between every step there is a gate that does three jobs at once: control (the human reviews one small, readable artifact before the next step builds on it), the staleness cure (each step starts only after the prior artifact is locked, so it can’t start from stale input), and the handoff (the locked output of step N is the input to step N+1). Most gates are a conversation with the agent that produced the artifact — not a bare approve/reject button — so the human can interrogate the reasoning and adjust in place. But not every gate is a dialogue. Where the upstream step is a verifier (it judges rather than authors), its gate is an inspection: you read a verdict and the evidence behind it, and any conversation about it happens with the agent that can act on it — the CTO — not the verifier itself. Which gates converse and which are inspections is settled in realization (Part II, R5).

03

System shape

Two layers: a roadmap above, a per-requirement pipeline below.

The unit of work is a requirement — any unit of change to the system: a new capability, a refactor, a migration, an architectural fix. Not only new user-facing features. That word is load-bearing: it means the same pipeline that builds a screen also handles “auth needs SSO now, re-architect it,” so structural changes have a home instead of being exceptions.

Any project shape — two on-ramps

The pipeline is layout-agnostic. It never hard-codes a folder shape: what an agent needs is not “edit front/” but “here is where the frontend lives, here are its conventions” — and that is Layer 3 (project context), injected at runtime, never baked into a role. So a single tree with back/ and front/, a three-repo workspace (app/ + panel/ + back/), or any other arrangement are all just different Layer-3 facts. The architecture absorbs the variation in the project descriptor; the roles stay portable.

The mental model is a dev team brought onto a project: first it understands what exists, then it works. A project reaches a valid Layer-3 descriptor by one of two on-ramps, which converge on the same handoff so the pipeline never knows which ran:

  • Scaffold (greenfield) — generate a new project and its descriptor from a stack template. The team writes structure that wasn’t there. (The concrete template — its stack and folder shape — is a realization choice, deliberately kept out of Part I: R8.)
  • Adopt (brownfield) — point the team at an existing codebase. There is no template to apply; the architecture already exists in the code, so adopt is a conversation in which the CTO reads the repos, proposes what it believes the project is, and you correct until the derived descriptor matches reality. The team reads and confirms structure that was already there.

Adopt is that “understand first” step, and it is the CTO at project altitude (below) — no new role, just the CTO’s existing “ground yourself before acting” instinct run once over the whole project instead of once per requirement. Both on-ramps end at the same place — a project that speaks the system’s language — after which the roadmap and pipeline below run identically. (Their concrete realization is Part II, R8.)

Dev and prod — workbench and live system

A real project usually has two running environments, sitting in the workspace side by side (the scaffold creates both): xproject_dev/ (the agent’s workbench — full read/write, where tests and the browser-eyes run) and xproject_prod/ (the live system). The split is itself just a Layer-3 fact, but it carries one hard safety principle: the agent authors only in dev. Source flows dev → deploy → prod, never by editing prod directly — a direct prod edit would bypass the tests, the gate, and everything the pipeline exists to enforce. Prod is authoring-frozen.

The freeze is on authoring, not on operating: the agent may still read prod (logs, processes, status — the context it needs to debug) and, through a narrow path, deploy/restart it. “Author never on prod; operate yes, but scoped” is the whole rule. Its enforcement — at the OS level, not by instruction — is Part II, R9.

Layer A — the roadmap

The human talks with the CTO about the roadmap in general terms: value vs. effort, what to build next, priorities. Its output is prioritized requirement tickets. The CTO operates at three altitudes: at project altitude (once, at onboarding — it grounds itself in the whole project and confirms its architecture, the adopt on-ramp above), at roadmap altitude (what is worth building and in what order), and — when a requirement enters the pipeline — at requirement altitude, where it settles the approach and writes the tests. So the CTO sits above the pipeline and opens each run; it is the single conversational role you plan and argue strategy with, and effort can’t be priced without reasoning about approach, which is why those two conversations are one agent.

The roadmap is the dev team’s planning surface, not part of the product: its format and flow (the requirement-card shape, the Backlog→Done rules) are the team’s instructions, shipping with it from origin but empty; its cards accumulate per project. It is persistent state on disk — a folder, not a session and not a single file. Each requirement is its own markdown file, and the model is hybrid by design:

  • Coarse state is location. Two folders only — Backlog/ and Done/. A requirement is shipped because its file lives in Done/. The one thing you never want ambiguous — done or not — is physical location, which can’t silently drift.
  • Planning dimensions are fields. Everything you want to slice and re-plan — version, size, value — lives in the card’s frontmatter. The UI surface derives its columns from version, so the board can group by release, filter by size, or sort by value without moving any files.
  • Per-requirement files prevent collisions. Each requirement is its own file, so two CTO sessions on different requirements never clobber each other — the same shared-mutable-state principle, applied to the roadmap.

So there are two different operations for two different kinds of change: moving a card between versions edits a field; marking it done moves the file. If the status field and the folder ever disagree, the folder wins — location is the source of truth that can’t rot. A CTO session is disposable; close it freely and the roadmap folder remains.

Persistent state on disk — outlives every session the product dev/ the workbench prod/ the live system the dev team — owns the roadmap roadmap · skills · method ships from origin — roadmap empty, fills per project Requirement 1 session disposable Requirement 2 session disposable Requirement 3 session disposable read write each session: read state → work → write back nothing durable lives in a session — the files remain
Fig 2. Disposable sessions over two kinds of persistent state: the product (dev/, prod/ — its shape a per-project fact) and the dev team, which owns the roadmap. The team ships from origin with the requirement format and flow but an empty roadmap; cards accumulate per project.

The roadmap folder holds just two folders for done-ness; the planning dimensions live inside each card as frontmatter, and the board groups by the version field:

roadmap/ — done-ness is location Backlog/ auth.md home.md video.md Done/ signup.md cap-table.md auth.md — fields live in frontmatter status: backlog version: V0.1 size: M value: 8 status mirrors the folder · folder wins version / size / value are queryable the board derives its columns from the version field — no files move Roadmap board — columns grouped by version V0.1 auth · M·8 persist · M·9 V0.5 shell · S·9 V1.0 home · M·6 cite · L·8 V1.1 people · L·7 Later change version = edit a field · mark done = move the file to Done/
Fig 3. Done-ness is location (two folders); version, size, value are queryable fields. The board groups by version — re-planning a release never touches the filesystem.

Layer B — the per-requirement pipeline

The human picks a ticket and opens a requirement session. The pipeline runs mostly autonomously, surfacing only at gates or escalations: CTO → Design → Build → QA → Document → Review, each separated by a human gate. Research is not a fixed step — it is an on-demand portable agent, invokable in either layer, and because it produces documents its instances are safe to run in parallel. Research presents options; it does not decide.

One pipeline, no fast lanes. Every requirement runs this same flow regardless of size — there is no separate “lightweight” track for small changes. The flow compresses on its own when the work is small: a one-line copy change has a trivial CTO conversation, a one-step Design change, and a two-second gate, while still leaving behind a test, current docs, and a passing suite. A second profile would buy a near-zero saving (the heavy part is the gates, and gates already cost time proportional to content, not to the flow’s length) at the price of a track to maintain and a “is this small enough?” decision on every requirement. Uniformity is the simpler design here, and it keeps every change — large or small — leaving the same evidence behind. So size stays a planning field only; it never selects a workflow.

04

The agents

Who does what — and who is allowed to judge whose work.

RoleKindLayerResponsibility
CTOcreatorA·BAt onboarding, once: grounds itself in the project and confirms its architecture (the adopt on-ramp). Owns the roadmap and prioritization (value vs. effort). For each requirement: settles the approach with you, fills in the requirement card (the roadmap file), and writes its tests against that card
ResearchadvisorA / BGathers external evidence; presents options, does not decide
DesigncreatorBProduces UI against the design system; objective props auto-tested, taste judged in evaluation
BuildcreatorBImplements backend + frontend until the requirement test passes; self-tests for speed
QAverifierBRuns the CTO’s tests on the deliverer’s work; routes failures back to the deliverer
DocumentcreatorBUpdates docs from merged, current code — after build, so it documents reality
ReviewverifierBIndependent security + cleanup audit — catches what tests can’t

Two kinds of role, one firewall. The roster splits cleanly. Creators author deliverables — the CTO writes the card and tests, Design and Build produce the work, Document writes the docs. Verifiers judge that work and author none of it — QA runs the tests, Review audits the finished code. (Research is neither: an advisor that gathers evidence and presents options without deciding.) The whole system reduces to one rule across this split — a verifier never judges anything it authored — and Part II (R5) makes it structural rather than trusted: verifiers run write-walled, literally unable to edit the code they judge, while creators hold the write access their job needs. This is also why you converse with creators (you reason with them) but only receive a verdict from a verifier (then take it to the CTO) — the gate-type split of §02, mapped onto the roster.

How judgment is split. No role judges its own work — but “judging” isn’t a single role one agent owns. Each deliverable is partitioned, and the parts go to different judges. Design is the clearest case: its measurable properties (contrast, tokens, spacing, touch targets) are tested by QA, while its taste is judged by you at the gate — one deliverable, two judges. Build is the same shape, lopsided: almost all of it is measurable → QA, with anything untestable flagged to HUMAN-TEST-QUEUE.md → you. The CTO’s card is the opposite extreme — there’s no test to run on an approach, so it’s judged entirely by you. QA owns the measurable partition wherever it exists; you own the judgment partition; Review adds an orthogonal pass over the finished code for security and cleanliness that tests can’t catch.

The CTO advises; you decide. It does not decide the how by fiat — it grounds the requirement in the codebase, lays out approaches with tradeoffs, and the how is settled in conversation with you. Same pattern as research, and as every gate: the CTO structures the decision and records it, you make it.

The requirement card is the roadmap file, filled in. The card is created in Backlog/ during roadmap planning — entering the pipeline is the CTO filling that same file in with the settled approach (the status field tracks finer progress; the file only moves once, to Done/, when it ships). One artifact per requirement accretes detail across its life rather than spawning competing documents.

Why the CTO owns all tests. The requirement test asks “does this requirement do its job?” and the regression concern asks “does it respect the rest of the system?” — both architectural questions about fit and contract. The CTO writes the filled-in card and the tests in the same step, locked together at one gate, so the tests can’t drift from the spec. This keeps QA clean: QA only ever runs tests it did not write, on work it did not build.

CTO — proposes, you decide the how fills in the requirement card + writes tests against it gate: you + CTO Design — mockup against the design system objective props auto-tested, taste in evaluation gate: you + design Build — implement until the requirement test passes runs the test continuously as it works gate: you + build QA — runs the tests + verifies tests it did NOT write, on work it did NOT build runs the full suite — must not break other requirements gate: you + QA Document — from the merged, current code Review — security + cleanup audit authorship and verification split at every layer; creator gates converse, verifier gates report
Fig 4. The pipeline. CTO/build make it, QA judges it; creator gates are interrogable conversations with the delivering agent, while verifier gates (QA, Review) hand you a verdict you take to the CTO.
05

Testing discipline

Test-first as standard, with an honest escape hatch.

Test-first is the standard

Tests are authored from the locked requirement card before the code exists — by the CTO, in the same step that fills the card in. With agents this matters more than with humans: an agent that writes code then its own tests will unconsciously shape the tests to pass whatever it wrote — a green check that means nothing. A locked, pre-written test is an external, objective definition of done. It cannot move the goalposts.

Test-after (the risk) agent writes code agent writes tests for code it just wrote tests pass — but shaped to the code, not the spec Test-first (the fix) write test from spec you review + lock it agent writes code tests pass — and target was fixed before code the locked test is an objective gate the agent must satisfy — it can’t move the goalposts it is also a precise definition of “done”, and a reviewable artifact for your gate
Fig 5. Why test-first matters more for agents: it stops “draw the target around the arrow after firing.”

Two scopes — both required

A requirement (acceptance) test asks “did I build this right?” — scoped to one requirement, shipping with it. A regression suite asks “did I break anything else?” — the accumulated union of every requirement test ever written. The suite is the only thing that catches “the requirement works but broke something else,” and it can only catch it because every past requirement left its test behind.

Acceptance tests — one per requirement, ships WITH the requirement Requirement A test does A work? Requirement B test does B work? Requirement C test does C work? every requirement’s test joins the suite forever Regression suite — the whole project, grows over time runs every accumulated test together a new requirement must pass its OWN test AND this entire suite this is what catches “requirement works but broke something else” acceptance = did I build the thing right? regression = did I break the other things? the build gate runs both — requirement test for proof, full suite for safety
Fig 6. Authored once by the CTO, the requirement test graduates permanently into the shared suite.

The human-flag escape hatch

“Test-first as standard” has a failure mode: faced with something genuinely hard to test, an agent may write a hollow test that passes but asserts nothing — worse than no test.

The rule

If an agent cannot write a meaningful automated assertion, it must not write a hollow one. It appends an entry to HUMAN-TEST-QUEUE.md describing what needs human verification and why. Honest by construction.

Design is more testable than it looks

A design system turns most “subjective” design into objective conformance: approved token or off-system value? Approved component or bespoke? Spacing from the scale? Contrast meeting WCAG? All pass/fail, checked automatically. The design system is to the design agent what the requirement spec is to the CTO — the source of truth its tests assert against. Only true taste reaches the human evaluation conversation.

Existing code — own it all, test what earns it

Test-first is clean on a greenfield requirement. An adopted codebase (§03) is different: it arrives with legacy code and often no suite at all, so the regression net — “did I break anything else?” — has nothing to run against. The resolution comes straight from the dev-team model: a real team owns the whole codebase, legacy included, but it does not stop to fabricate exhaustive tests for code that already works — that effort buys the client nothing. Coverage is value-prioritized, not mandatory.

So writing a critical test is itself a requirement, flowing through the same pipeline and competing for priority like any other work. The main flow of a booking system — “book an appointment” — earns a test because breaking it is catastrophic; a rarely-touched admin corner may never earn one. This collapses “testing legacy” into the ordinary mechanism instead of making it a special rule. Those critical-flow tests run at the behavior level through the headless browser (“eyes” on the rendered product, R7), verifying a flow without unit-testing the code beneath it. The suite then grows forward, concentrated exactly where the team actually works.

06

The QA loop

Correction loops — and when to convene.

Real development is a line with correction loops. When QA finds a failure, the fix goes to the agent that delivered the work — not to QA (fixing would make it an author), and not to the human (the gate, not the debugger). The deliverer has the full context of what it built, so it diagnoses fastest. The loop is QA → deliverer → QA, and the human is outside it.

CTO writes requirement + regression tests Delivering agent build / design produces the work QA — runs the tests didn’t write them · didn’t build the work pass Your evaluation gate fail → deliverer the fix goes to the agent that built it — not to QA, not to you
Fig 7. Routine “code doesn’t match the test yet” failures resolve autonomously. You see the work only once it passes.

When a producer is stuck — the only way back

The loop above handles “the code doesn’t match the test yet.” But sometimes the problem is deeper: the approach itself can’t work, the requirement is ambiguous, or something structural is wrong. A producer is never allowed to reopen the approach on its own — a stuck agent has every incentive to blame the spec rather than its own work, which is the self-judgment problem the system exists to prevent. So being stuck does not trigger a rewrite; it triggers a conversation.

That conversation convenes you and the CTO together — the same agent that wrote the approach, so there is no handoff — and it decides the course of action openly. It might amend the current requirement (add a clarifying constraint and continue), open a new requirement (a refactor or architectural fix, prioritized against everything else), fix a structural problem first and then resume, or some combination. This is the system’s only backward motion, and it is deliberately a human-plus-CTO decision rather than an automated transition. It is also why the unit of work is a requirement and not a “feature”: “the approach was wrong” has a natural home, because a refactor is just another requirement.

Circuit breaker · the trigger

What fires that conversation is a hard limit, so an autonomous fix-retry loop can’t spin for a day burning tokens against a wrong spec. After MAX_FIX_ATTEMPTS (default 3) failed fixes, the loop stops and hands the full failure history to the you-plus-CTO conversation above. Persistent failure is treated as an upstream problem, not a mechanical one.

07

Project files

Where durable state lives.

These files live in each project (Layer 3). The role definitions (Layer 2) are reused across projects. There is no canonical folder shape: a project may be one tree with back/ and front/, or several repos side by side (app/ + panel/ + back/), or anything else — the layout is a Layer-3 fact recorded in the project descriptor (CLAUDE.md), written by scaffold for new projects or derived by adopt for existing ones (§03). What the table below lists is the set of durable artifacts every project carries, wherever they physically sit.

FilePurposeWritten byRead by
roadmap/ folderFolders Backlog/ and Done/ (done-ness = location). One markdown file per requirement, carrying frontmatter fields version, size, value, status. The board groups by version. One file per requirement → sessions never collideCTOHuman, pipeline, roadmap board
CLAUDE.mdProject conventions, stack, structureHuman + agentsAll agents
DESIGN-SYSTEM.mdTokens, components, rules the design agent must conform toHumanDesign, QA
HUMAN-TEST-QUEUE.mdBehaviors needing human verification because they can’t be meaningfully auto-testedAny agentHuman (at gates)
/testsThe accumulated requirement + regression suiteCTOBuild (self-test), QA (verify)
.env / .env.exampleSecrets (git-ignored) plus a value-less structure file. One of the files the agent cannot author (R9)Human onlyHuman; agent reads .env.example structure only
permission switchHand-edited armed_until flag gating sensitive prod actions — off by default, self-expiring (R9)Human onlyAgent (read-only) · dashboard (display)
08

Parameters

The knobs worth tuning.

ParameterDefaultMeaning
MAX_FIX_ATTEMPTS3Failed QA-fix cycles before the loop stops and convenes the you-plus-CTO conversation
Pipeline roles (v1)CTO → Design → Build → QA → Document → Review. Applied uniformly to every requirement; the flow compresses naturally when the work is small (no separate track)
Research parallelismonResearch instances may run concurrently across requirements / projects

End of Part I — Architecture

This architecture is a long-term setup, not a one-project plan. The agent definitions are portable across projects; only the project files are per-project. Revisit the parameters and agent set as needs evolve.