System Architecture

Agent-Based Product Development

A small set of role-specialized agents, run in sequence, each handing locked work to the next — with a human at the gates and no agent grading its own homework.

This document — the reasoning, in prose & diagrams The executable rules — in each agent’s definition

The one idea

What the whole system reduces to.

Development is not one agent doing everything. It is a small set of role-specialized agents, run sequentially, where each step hands its finished, locked output to the next. A human stays in control at evaluation gates between steps. The work that builds the product is judged by agents that did not build it.

Three principles generate everything else in this document:

Sequential over parallel (for the build path). Each step departs with more information than the last — a locked spec sharpens design, an approved design makes implementation unambiguous. Parallelism would force a step to start before the prior output exists.
Separation of authorship and verification. The agent that builds a thing is never the sole agent that judges it. An agent grading its own work shares its own blind spots.
The human is the gate, not the worker. Routine mechanical failures are resolved between agents. The human reviews locked artifacts and makes judgment calls.

Core concepts

Read this before the rules.

An agent is a stateless function

An agent does not “contain” knowledge that persists between runs. Every run is assembled fresh from four layers stacked into one context window:

Layer	What it is	Portable?
1 · Model weights	General reasoning & coding ability	yes — shared
2 · Role definition	“You are a security reviewer who…”	yes — reused across projects
3 · Project context	This codebase, its decisions, conventions	no — per project
4 · Live task + state	The current prompt + files right now	no — this moment

The agent is Layer 2 — a reusable role template. The project is Layer 3, loaded in at runtime. A well-written role agent works on any project, because the project-specific part is injected at invocation, never baked in.

Definition vs. instance

A role definition is a template. Invoking it creates an instance: a running context bound to one project for its lifetime. The same definition can spawn many independent instances — research on Project A and Project B at once. They share no state and cannot interfere, because the project-binding happens at instance creation, not in the definition.

Fig 1. One definition, many instances. The project binds at instance creation — same agent, two projects, zero interference.

Why parallelism is dangerous on the build path

Isolating files (e.g. git worktrees) prevents file collisions. It does not prevent semantic staleness: an agent that branched before a core refactor landed will build against a reality that no longer exists, and merge cleanly but wrong — worse than a conflict, because it’s silent.

The rule

Parallelism is safe only where agents share no mutable state. Document work (research, specs) shares no state with code → safe to parallelize. Code-writing all mutates the same files → must be sequenced.

The gate

Between every step there is a gate that does three jobs at once: control (the human reviews one small, readable artifact before the next step builds on it), the staleness cure (each step starts only after the prior artifact is locked, so it can’t start from stale input), and the handoff (the locked output of step N is the input to step N+1). Most gates are a conversation with the agent that produced the artifact — not a bare approve/reject button — so the human can interrogate the reasoning and adjust in place. But not every gate is a dialogue. Where the upstream step is a verifier (it judges rather than authors), its gate is an inspection: you read a verdict and the evidence behind it, and any conversation about it happens with the agent that can act on it — the CTO — not the verifier itself. Which gates converse and which are inspections is settled in realization (Part II, R5).

System shape

Two layers: a roadmap above, a per-requirement pipeline below.

The unit of work is a requirement — any unit of change to the system: a new capability, a refactor, a migration, an architectural fix. Not only new user-facing features. That word is load-bearing: it means the same pipeline that builds a screen also handles “auth needs SSO now, re-architect it,” so structural changes have a home instead of being exceptions.

Any project shape — two on-ramps

The pipeline is layout-agnostic. It never hard-codes a folder shape: what an agent needs is not “edit front/” but “here is where the frontend lives, here are its conventions” — and that is Layer 3 (project context), injected at runtime, never baked into a role. So a single tree with back/ and front/, a three-repo workspace (app/ + panel/ + back/), or any other arrangement are all just different Layer-3 facts. The architecture absorbs the variation in the project descriptor; the roles stay portable.

The mental model is a dev team brought onto a project: first it understands what exists, then it works. A project reaches a valid Layer-3 descriptor by one of two on-ramps, which converge on the same handoff so the pipeline never knows which ran:

Scaffold (greenfield) — generate a new project and its descriptor from a stack template. The team writes structure that wasn’t there. (The concrete template — its stack and folder shape — is a realization choice, deliberately kept out of Part I: R8.)
Adopt (brownfield) — point the team at an existing codebase. There is no template to apply; the architecture already exists in the code, so adopt is a conversation in which the CTO reads the repos, proposes what it believes the project is, and you correct until the derived descriptor matches reality. The team reads and confirms structure that was already there.

Adopt is that “understand first” step, and it is the CTO at project altitude (below) — no new role, just the CTO’s existing “ground yourself before acting” instinct run once over the whole project instead of once per requirement. Both on-ramps end at the same place — a project that speaks the system’s language — after which the roadmap and pipeline below run identically. (Their concrete realization is Part II, R8.)

Dev and prod — workbench and live system

A real project usually has two running environments, sitting in the workspace side by side (the scaffold creates both): xproject_dev/ (the agent’s workbench — full read/write, where tests and the browser-eyes run) and xproject_prod/ (the live system). The split is itself just a Layer-3 fact, but it carries one hard safety principle: the agent authors only in dev. Source flows dev → deploy → prod, never by editing prod directly — a direct prod edit would bypass the tests, the gate, and everything the pipeline exists to enforce. Prod is authoring-frozen.

The freeze is on authoring, not on operating: the agent may still read prod (logs, processes, status — the context it needs to debug) and, through a narrow path, deploy/restart it. “Author never on prod; operate yes, but scoped” is the whole rule. Its enforcement — at the OS level, not by instruction — is Part II, R9.

Layer A — the roadmap

The human talks with the CTO about the roadmap in general terms: value vs. effort, what to build next, priorities. Its output is prioritized requirement tickets. The CTO operates at three altitudes: at project altitude (once, at onboarding — it grounds itself in the whole project and confirms its architecture, the adopt on-ramp above), at roadmap altitude (what is worth building and in what order), and — when a requirement enters the pipeline — at requirement altitude, where it settles the approach and writes the tests. So the CTO sits above the pipeline and opens each run; it is the single conversational role you plan and argue strategy with, and effort can’t be priced without reasoning about approach, which is why those two conversations are one agent.

The roadmap is the dev team’s planning surface, not part of the product: its format and flow (the requirement-card shape, the Backlog→Done rules) are the team’s instructions, shipping with it from origin but empty; its cards accumulate per project. It is persistent state on disk — a folder, not a session and not a single file. Each requirement is its own markdown file, and the model is hybrid by design:

Coarse state is location. Two folders only — Backlog/ and Done/. A requirement is shipped because its file lives in Done/. The one thing you never want ambiguous — done or not — is physical location, which can’t silently drift.
Planning dimensions are fields. Everything you want to slice and re-plan — version, size, value — lives in the card’s frontmatter. The UI surface derives its columns from version, so the board can group by release, filter by size, or sort by value without moving any files.
Per-requirement files prevent collisions. Each requirement is its own file, so two CTO sessions on different requirements never clobber each other — the same shared-mutable-state principle, applied to the roadmap.

So there are two different operations for two different kinds of change: moving a card between versions edits a field; marking it done moves the file. If the status field and the folder ever disagree, the folder wins — location is the source of truth that can’t rot. A CTO session is disposable; close it freely and the roadmap folder remains.

Fig 2. Disposable sessions over two kinds of persistent state: the product (dev/, prod/ — its shape a per-project fact) and the dev team, which owns the roadmap. The team ships from origin with the requirement format and flow but an empty roadmap; cards accumulate per project.

The roadmap folder holds just two folders for done-ness; the planning dimensions live inside each card as frontmatter, and the board groups by the version field:

Fig 3. Done-ness is location (two folders); version, size, value are queryable fields. The board groups by version — re-planning a release never touches the filesystem.

Layer B — the per-requirement pipeline

The human picks a ticket and opens a requirement session. The pipeline runs mostly autonomously, surfacing only at gates or escalations: CTO → Design → Build → QA → Document → Review, each separated by a human gate. Research is not a fixed step — it is an on-demand portable agent, invokable in either layer, and because it produces documents its instances are safe to run in parallel. Research presents options; it does not decide.

One pipeline, no fast lanes. Every requirement runs this same flow regardless of size — there is no separate “lightweight” track for small changes. The flow compresses on its own when the work is small: a one-line copy change has a trivial CTO conversation, a one-step Design change, and a two-second gate, while still leaving behind a test, current docs, and a passing suite. A second profile would buy a near-zero saving (the heavy part is the gates, and gates already cost time proportional to content, not to the flow’s length) at the price of a track to maintain and a “is this small enough?” decision on every requirement. Uniformity is the simpler design here, and it keeps every change — large or small — leaving the same evidence behind. So size stays a planning field only; it never selects a workflow.

The agents

Who does what — and who is allowed to judge whose work.

Role	Kind	Layer	Responsibility
CTO	creator	A·B	At onboarding, once: grounds itself in the project and confirms its architecture (the adopt on-ramp). Owns the roadmap and prioritization (value vs. effort). For each requirement: settles the approach with you, fills in the requirement card (the roadmap file), and writes its tests against that card
Research	advisor	A / B	Gathers external evidence; presents options, does not decide
Design	creator	B	Produces UI against the design system; objective props auto-tested, taste judged in evaluation
Build	creator	B	Implements backend + frontend until the requirement test passes; self-tests for speed
QA	verifier	B	Runs the CTO’s tests on the deliverer’s work; routes failures back to the deliverer
Document	creator	B	Updates docs from merged, current code — after build, so it documents reality
Review	verifier	B	Independent security + cleanup audit — catches what tests can’t

Two kinds of role, one firewall. The roster splits cleanly. Creators author deliverables — the CTO writes the card and tests, Design and Build produce the work, Document writes the docs. Verifiers judge that work and author none of it — QA runs the tests, Review audits the finished code. (Research is neither: an advisor that gathers evidence and presents options without deciding.) The whole system reduces to one rule across this split — a verifier never judges anything it authored — and Part II (R5) makes it structural rather than trusted: verifiers run write-walled, literally unable to edit the code they judge, while creators hold the write access their job needs. This is also why you converse with creators (you reason with them) but only receive a verdict from a verifier (then take it to the CTO) — the gate-type split of §02, mapped onto the roster.

How judgment is split. No role judges its own work — but “judging” isn’t a single role one agent owns. Each deliverable is partitioned, and the parts go to different judges. Design is the clearest case: its measurable properties (contrast, tokens, spacing, touch targets) are tested by QA, while its taste is judged by you at the gate — one deliverable, two judges. Build is the same shape, lopsided: almost all of it is measurable → QA, with anything untestable flagged to HUMAN-TEST-QUEUE.md → you. The CTO’s card is the opposite extreme — there’s no test to run on an approach, so it’s judged entirely by you. QA owns the measurable partition wherever it exists; you own the judgment partition; Review adds an orthogonal pass over the finished code for security and cleanliness that tests can’t catch.

The CTO advises; you decide. It does not decide the how by fiat — it grounds the requirement in the codebase, lays out approaches with tradeoffs, and the how is settled in conversation with you. Same pattern as research, and as every gate: the CTO structures the decision and records it, you make it.

The requirement card is the roadmap file, filled in. The card is created in Backlog/ during roadmap planning — entering the pipeline is the CTO filling that same file in with the settled approach (the status field tracks finer progress; the file only moves once, to Done/, when it ships). One artifact per requirement accretes detail across its life rather than spawning competing documents.

Why the CTO owns all tests. The requirement test asks “does this requirement do its job?” and the regression concern asks “does it respect the rest of the system?” — both architectural questions about fit and contract. The CTO writes the filled-in card and the tests in the same step, locked together at one gate, so the tests can’t drift from the spec. This keeps QA clean: QA only ever runs tests it did not write, on work it did not build.

Fig 4. The pipeline. CTO/build make it, QA judges it; creator gates are interrogable conversations with the delivering agent, while verifier gates (QA, Review) hand you a verdict you take to the CTO.

Testing discipline

Test-first as standard, with an honest escape hatch.

Test-first is the standard

Tests are authored from the locked requirement card before the code exists — by the CTO, in the same step that fills the card in. With agents this matters more than with humans: an agent that writes code then its own tests will unconsciously shape the tests to pass whatever it wrote — a green check that means nothing. A locked, pre-written test is an external, objective definition of done. It cannot move the goalposts.

Fig 5. Why test-first matters more for agents: it stops “draw the target around the arrow after firing.”

Two scopes — both required

A requirement (acceptance) test asks “did I build this right?” — scoped to one requirement, shipping with it. A regression suite asks “did I break anything else?” — the accumulated union of every requirement test ever written. The suite is the only thing that catches “the requirement works but broke something else,” and it can only catch it because every past requirement left its test behind.

Fig 6. Authored once by the CTO, the requirement test graduates permanently into the shared suite.

The human-flag escape hatch

“Test-first as standard” has a failure mode: faced with something genuinely hard to test, an agent may write a hollow test that passes but asserts nothing — worse than no test.

The rule

If an agent cannot write a meaningful automated assertion, it must not write a hollow one. It appends an entry to HUMAN-TEST-QUEUE.md describing what needs human verification and why. Honest by construction.

Design is more testable than it looks

A design system turns most “subjective” design into objective conformance: approved token or off-system value? Approved component or bespoke? Spacing from the scale? Contrast meeting WCAG? All pass/fail, checked automatically. The design system is to the design agent what the requirement spec is to the CTO — the source of truth its tests assert against. Only true taste reaches the human evaluation conversation.

Existing code — own it all, test what earns it

Test-first is clean on a greenfield requirement. An adopted codebase (§03) is different: it arrives with legacy code and often no suite at all, so the regression net — “did I break anything else?” — has nothing to run against. The resolution comes straight from the dev-team model: a real team owns the whole codebase, legacy included, but it does not stop to fabricate exhaustive tests for code that already works — that effort buys the client nothing. Coverage is value-prioritized, not mandatory.

So writing a critical test is itself a requirement, flowing through the same pipeline and competing for priority like any other work. The main flow of a booking system — “book an appointment” — earns a test because breaking it is catastrophic; a rarely-touched admin corner may never earn one. This collapses “testing legacy” into the ordinary mechanism instead of making it a special rule. Those critical-flow tests run at the behavior level through the headless browser (“eyes” on the rendered product, R7), verifying a flow without unit-testing the code beneath it. The suite then grows forward, concentrated exactly where the team actually works.

The QA loop

Correction loops — and when to convene.

Real development is a line with correction loops. When QA finds a failure, the fix goes to the agent that delivered the work — not to QA (fixing would make it an author), and not to the human (the gate, not the debugger). The deliverer has the full context of what it built, so it diagnoses fastest. The loop is QA → deliverer → QA, and the human is outside it.

Fig 7. Routine “code doesn’t match the test yet” failures resolve autonomously. You see the work only once it passes.

When a producer is stuck — the only way back

The loop above handles “the code doesn’t match the test yet.” But sometimes the problem is deeper: the approach itself can’t work, the requirement is ambiguous, or something structural is wrong. A producer is never allowed to reopen the approach on its own — a stuck agent has every incentive to blame the spec rather than its own work, which is the self-judgment problem the system exists to prevent. So being stuck does not trigger a rewrite; it triggers a conversation.

That conversation convenes you and the CTO together — the same agent that wrote the approach, so there is no handoff — and it decides the course of action openly. It might amend the current requirement (add a clarifying constraint and continue), open a new requirement (a refactor or architectural fix, prioritized against everything else), fix a structural problem first and then resume, or some combination. This is the system’s only backward motion, and it is deliberately a human-plus-CTO decision rather than an automated transition. It is also why the unit of work is a requirement and not a “feature”: “the approach was wrong” has a natural home, because a refactor is just another requirement.

Circuit breaker · the trigger

What fires that conversation is a hard limit, so an autonomous fix-retry loop can’t spin for a day burning tokens against a wrong spec. After MAX_FIX_ATTEMPTS (default 3) failed fixes, the loop stops and hands the full failure history to the you-plus-CTO conversation above. Persistent failure is treated as an upstream problem, not a mechanical one.

Project files

Where durable state lives.

These files live in each project (Layer 3). The role definitions (Layer 2) are reused across projects. There is no canonical folder shape: a project may be one tree with back/ and front/, or several repos side by side (app/ + panel/ + back/), or anything else — the layout is a Layer-3 fact recorded in the project descriptor (CLAUDE.md), written by scaffold for new projects or derived by adopt for existing ones (§03). What the table below lists is the set of durable artifacts every project carries, wherever they physically sit.

File	Purpose	Written by	Read by
roadmap/ folder	Folders `Backlog/` and `Done/` (done-ness = location). One markdown file per requirement, carrying frontmatter fields `version`, `size`, `value`, `status`. The board groups by `version`. One file per requirement → sessions never collide	CTO	Human, pipeline, roadmap board
CLAUDE.md	Project conventions, stack, structure	Human + agents	All agents
DESIGN-SYSTEM.md	Tokens, components, rules the design agent must conform to	Human	Design, QA
HUMAN-TEST-QUEUE.md	Behaviors needing human verification because they can’t be meaningfully auto-tested	Any agent	Human (at gates)
/tests	The accumulated requirement + regression suite	CTO	Build (self-test), QA (verify)
.env / .env.example	Secrets (git-ignored) plus a value-less structure file. One of the files the agent cannot author (R9)	Human only	Human; agent reads `.env.example` structure only
permission switch	Hand-edited `armed_until` flag gating sensitive prod actions — off by default, self-expiring (R9)	Human only	Agent (read-only) · dashboard (display)

Parameters

The knobs worth tuning.

Parameter	Default	Meaning
MAX_FIX_ATTEMPTS	3	Failed QA-fix cycles before the loop stops and convenes the you-plus-CTO conversation
Pipeline roles (v1)	—	CTO → Design → Build → QA → Document → Review. Applied uniformly to every requirement; the flow compresses naturally when the work is small (no separate track)
Research parallelism	on	Research instances may run concurrently across requirements / projects

Part I is the architecture — the why: provider-neutral, slow-moving, true regardless of tools. Part II is the realization — the how: the concrete stack that runs it, which is tool-specific and faster-moving. They are kept apart on purpose. Nothing here changes the architecture, and Part I names nothing here — so when a price, policy, or product shifts (in one week of 2026 the SDK billing went fine → metered → paused), only this part has to move.

Three layers of portability

Locked Where commitment is cheap, and where it must be contained.

The system runs on specific tools, but commitment to them is layered by how expensive it is to reverse. Three layers, most portable to most locked:

The architecture (Part I) — names no provider; pure reasoning. Fully portable, and free to keep that way, because tooling decisions simply never enter it. If the stack became untenable, the thinking transfers wholesale.
The capability manifest — for every provider-locked organ adopted, a short, provider-neutral record of what it does for us and what replacing it would require. Portable knowledge about a native dependency; it does not wrap the tool and costs nothing at runtime — it is prose written once.
The native implementation — the actual stack (Claude Code, the Agent SDK, specific skills), used at full power, unabstracted, and confined at the seams. Provider-locked, but small and known — because the manifest documents it.

The principle: depend natively; document the dependency’s contract provider-neutrally. The alternative — a runtime abstraction layer wrapping every provider — is rejected: it drops the stack to a lowest-common-denominator, forfeits the native fit that is the whole reason to choose these tools, and is a maintenance burden forever. Knowledge is portable and cheap; abstraction is operational and expensive. We buy the first.

Manifest entry — the shape

The contract is capabilities plus interface, never implementation. For a design skill: enforces design-system tokens; generates variants for human selection; flags “AI-slop”; runs accessibility checks; invoked in-session, takes a spec, returns styled code + variants; to replace, need token-enforcement + variant-generation + a11y checks — and the taste-memory portion is ours and already portable. “Design skill, replace if migrating” is too vague to act on; copying its internal logic rebuilds it on paper and goes stale. The contract is the middle — and writing it also reveals the boundary between what is ours and what is the provider’s.

Managing provider risk

Locked Three tools, each applied where it is cheapest.

Provider risk is handled by three tools acting at different moments — and the discipline is to use each only where it costs the least.

Prefer agnostic — at selection (proactive). When choosing a dependency, favour one that already works across providers. A weighted criterion, not a veto: a meaningfully better locked option may still win, but then the choice is conscious and earns a manifest entry. Where two options are comparable, the portable one wins.
The capability manifest — after adoption (reactive). Documents what a locked organ does, so a future migration is a scoping exercise, not an archaeology dig.
The one-seam discipline — in the code (structural). Confine every provider-touching call to a single place — the way a prior project kept its whole model invocation in one file. Migration then has a known, small blast radius instead of being smeared across the system.

These specialise naturally, because dependencies fall into two kinds. Knowledge-layer dependencies — skills, prompts, role definitions, methods — are prose, so they tend to be portable already (a good design skill reads on any capable model); prefer-agnostic dominates here and is nearly free. Runtime-layer dependencies — the harness, hooks, tool-permission mechanics, the SDK — are machinery, so they tend to be locked; the manifest and the one-seam discipline do their work here. None of the three requires a line of abstraction code — they are selection discipline and documentation.

The stack, so far

Slots filled, leaning, and deferred.

Slot	Choice	Status
Runtime	Claude Code, run inside Cursor (free tier). Editor swappable to VS Code or Zed — reversible, affects nothing downstream	locked
Host	EC2, reached over SSH from desktop and laptop — sessions live server-side, so they’re available from either machine. Interactive over SSH is still interactive → subscription billing	locked
Dashboard	Committed (see R6). Node back + React/TS front; nginx only if served over the network. The dev-team management surface — roadmap, skills, md files, gate review	committed · stack leaning
Surface & billing	Planning in the dashboard (headless chat → the metered side) · building in Cursor (interactive → subscription). The two axes line up: the heavy work stays interactive/subscription, only the light planning chat is headless. Currently moot — metering paused (see R6)	locked
Context / memory	File-based for the build agents; a database per product feature (see R4)	principle locked
Tool access (MCP)	Composio leaning; to be pressure-tested	open
Skills / instructions	Anthropic Skills format as the container; prefer cross-provider skills	leaning
Knowledge / method	A reading list to internalise — an action, not a decision	n/a
Infra / deploy	Deferred — a solved problem that picks itself later	deferred

Why Claude Code is the locked runtime, and why the lock is acceptable: it is the model-maker’s own harness, so it lines up natively with the Skills format and the model layer with no translation; and its core loop — an interactive conversation that can also act — is the architecture’s gate model. The commitment is contained by R1–R2: the harness is a runtime-layer dependency, documented and seam-confined, not abstracted away.

Two worlds of memory

Locked Build-time context for the agents vs. runtime data for the product.

“Memory” is two unrelated problems that must never be merged.

World 1 — context for the agents that build. The docs/, skills, rules and CLAUDE.md files, plus the codebase itself. Curated, human-approved, in-repo, file-based. Build-time memory: it shapes how the agents work.
World 2 — data the product stores and serves. Listings, people, whatever the app holds — in a real database (Postgres, a vector store, ClickHouse), chosen per feature when that feature is built. Product RAG, if any, lives here. Runtime data: what the running app reads for its users.

The test for which world a thing belongs to: does an agent read it while building, or does the app read it while running? A design decision the build agent needs is World 1, a file. A user’s profile the app serves is World 2, a database. The agents will build World 2 databases; they do not use them as their own memory.

World 1 stays files — not a retrieval index — for a concrete reason: Claude Code does not embed or index the repository. It navigates like an engineer (path globbing, content grep, reading whole files), so the leverage is well-organised, current, curated authored context that tells that search where to look. A retrieval layer over your own curated docs would reintroduce staleness and context-poisoning for near-zero benefit, since curated docs stay small enough to read directly. Semantic retrieval (e.g. a Milvus-backed code-search server over MCP) is an optional add-on, justified only if a repository outgrows the context window.

How the agents are realized

Locked Creators are modes you converse with; verifiers are walled workers that report.

The architecture’s “agents” map onto two Claude Code primitives. Both facts hold; the mapping is now settled.

A sub-agent is an isolated, stateless worker — its own context window, its own tool allowlist. Excellent for dispatch and for enforced separation (a verifier given no write access literally cannot edit the code it judges). It cannot hold a back-and-forth: it runs once and closes.
A skill runs in the main session as injected knowledge — conversational, stateful, able to chain to other skills, but sharing one context window.

The deciding test — do I need to converse with this role, or only receive its output? — resolves the mapping:

Creators are conversational modes. The CTO and Design run as skills in the main session, because you reason with them — settling an approach, confirming an adopted architecture, judging taste are all dialogues.
Verifiers are walled sub-agents. QA and Review run dispatched, each with an allowlist that omits Write and Edit, so a verifier structurally cannot touch the code it judges. They run once and emit a verdict; you do not converse with them.
The remaining roles fall out by the same test. Build and Document are creators that do hold write access (they author code and docs), but you mostly receive their output and review it at a gate rather than reasoning through it turn by turn — so they run dispatched, just not walled. Research is a read-only advisor, dispatched and parallelizable (§03). The firewall-critical line is the one that must be structural: write-walled verifiers. Conversational-vs-dispatched is a comfort detail; write-allowed-vs-write-walled is the security boundary.

The conflict that looked real — you might want to interrogate a reviewer’s finding (→ converse) yet also need it write-walled (→ dispatch) — dissolves by separating the moments. The audit runs walled; then, if you want to understand or act on a finding, you talk to the CTO, who reads the report and can spawn a fix requirement. The verifier that judged never discusses; the creator you discuss with never judged — so authorship and verification stay split even inside the conversation. A verifier emitting only a verdict is the simplest possible walled thing; giving it a voice is surface we don’t need.

This refines Part I’s “no agent judges its own work”: the firewall now runs structurally along the mode/agent line, enforced by tool-scoping rather than by instruction. It also resolves which gates (Part I, §02) converse and which inspect — creator gates are conversations; verifier gates are verdicts you inspect and take to the CTO.

The human-facing surface

Locked You review at the gates; a committed dashboard manages, Cursor implements.

One correction is settled: you are the reviewer at the gates, not the dispatcher between steps. The pipeline auto-advances and stops for your approval; the QA fix-loop runs without you, surfacing only on a pass or at the circuit breaker. Mechanically this is a deterministic gate — a hook that pauses for confirmation at each role boundary, which holds even in fast or headless modes.

The management surface is now committed, and split by tool. A custom dashboard handles managing the team: the kanban roadmap, editing skills and the .md context files, reviewing at the gates, and the planning conversations with the CTO (writing requirements, settling approach, discussing a verdict) — hosted as a chat inside the dashboard. Cursor + the Claude Code plugin handles implementing features — coding belongs in the editor where the code lives. They sit side by side over the same files (roadmap folder, skills, descriptors): editing a skill in the dashboard and using it in Cursor are the same file, two windows.

The hard constraint, refined: the dashboard may drive Claude Code — it hosts the planning chat by spawning Claude headlessly (the Amiko pattern, below) — but it must never own the pipeline or re-implement agent logic. The line was never “don’t drive Claude”; it is “don’t become the brain.” Claude Code plus the role files stay the engine and the orchestrator; the dashboard hosts conversation, editing, and the board, and triggers runs. Drive Claude: fine. Own the pipeline: never.

Stack: Node back + React/TS front (nginx only if served over the network, not for localhost). React over the more-familiar Vue for one reason specific to this system: AI assistants generate React with fewer correction loops than Vue’s template SFCs, because JSX + TS is a far larger share of training data — and a system whose whole job is AI-built products should lean toward what the agents build most reliably. The learning-curve cost lands on the one hand-built component (the dashboard), which — internal, single-user, low-stakes — is also the safest place to pay it; and the skill that actually matters is reading React well enough to gate it, not writing it, since the agents write it. That AI-affinity is also what makes React the choice for the current scaffold (scaffold_node_react, R8) — but that is the scaffold’s property, not a global product law; a different scaffold carries a different stack. The dashboard’s own React is a separate committed choice for this one hand-built component.

How the dashboard hosts that chat — and the billing it implies. Amiko already proves the mechanism: it drives Claude Code by spawning the CLI in headless mode (claude -p / --print, NDJSON streamed into the UI) — not the Agent SDK, though either works. That headless driving is the metered side of Anthropic’s interactive vs programmatic line (interactive draws from subscription; claude -p and the SDK alike meter — a change announced for June 2026, then paused, so for now everything still draws from subscription, signalled to return in some form). The allocation this produces is the right way round: planning conversations (light) run headless in the dashboard → the metered side; building features (heavy) runs interactive in Cursor → subscription. The expensive work stays on the flat plan; only the cheap chat touches the meter.

Reference architectures

Steal patterns; do not adopt the whole.

Opinionated complete setups (gstack and others) are someone else’s assembly of these same slots. They are reference implementations to raid, not foundations to fork — forking inherits the author’s sprawl and agenda and undermines the rule that the human must understand every piece. The method is a curated assembly: own the spine, borrow self-contained organs, and adapt each to the spine’s conventions, gated by three questions — what does it do that I need, stripped of the rest? what does it assume that I must reconcile? can I explain how it works?

Patterns already validated against this architecture, worth taking: explicit role “gears” (planning ≠ review ≠ shipping); a three-strikes circuit breaker (the same number reached independently here); an auto-advancing pipeline that surfaces only taste decisions, by classifying each decision mechanical-vs-taste; a decision audit-trail written to disk rather than accumulated in context; curated, prunable, file-based memory rather than auto-capture; a headless browser giving the QA and Design roles real “eyes” on the rendered product; anti-sycophancy phrase-bans in the planning role; and a blind reviewer that sees the artifact, not the reasoning. The convergence is the validation; the divergences — stricter test-first authorship, a single merged CTO — are deliberate and arguably cleaner for a from-scratch build.

Distributing & installing the dev team

Locked Central origin, vendored clones, one paste to install.

The role definitions, skills, and config are one repo — the dev team — distributed by git’s ordinary fork/upstream model, chosen deliberately over a single shared install:

Central origin holds the canonical team; generic improvements are made there.
Vendored clones — each project git clones the team into its workspace, getting its own copy. Per-project edits (a new skill, an adapted one, the project’s requirements) live only in that clone and never flow back to origin. Generic updates flow the other way, origin → pull → project, opt-in per project.
Manual-merge on conflict is the accepted cost: when a project has locally edited a skill that origin also changed, pull conflicts — the correct signal (your local divergence overlaps an upstream change), not noise.

This is the opposite of Amiko’s choice, deliberately: Amiko symlinked every user to one global skills dir because users must not diverge. Here divergence is the point — each project tunes its own team — so clone-per-project is right and symlink-to-global is wrong. (Part I’s “Layer 2 is reused across projects” still holds; “reused” means cloned-from-one-origin, not one-shared-copy.)

Workspace layout

workspace/
├─ xproject_dev/   — product · the workbench
├─ xproject_prod/  — product · the live system
└─ dev_team/       — the cloned team (Layer 2)
   ├─ roadmap/   — Backlog/ · Done/ · cards fill per project
   ├─ skills/
   └─ … method, config, dashboard

The product (dev/, prod/) and the team (dev_team/) sit side by side; the roadmap lives inside the team because it is the team’s planning surface, not product code (Part I, §03). The operative .claude/ files are symlinked up to root, below.

Install — one paste, then a script

The operative files Claude Code reads — .claude/skills/, settings.local.json, allowlists, CLAUDE.md — must sit at workspace root, alongside the product repos, because that is where Claude Code is opened. But the team clones into a dev_team/ subfolder. So install is not a bare clone; it must clone and place files one level up:

A copy-paste instruction to Claude Code at workspace root — small enough to live in the origin README, just “clone repo X, run script Y”. The only thing outside the team repo is that one trigger; all real logic is script Y, which lives inside the team and is versioned with it.
Script Y hoists by symlink, not copy: it links root .claude/ to the clone’s, so editing a skill “at root” is editing the team’s own file — already inside the clone, hence committable to that project’s vendored team for free. (Copy would fork root and clone into drift; symlink keeps one real copy. Same mechanism Amiko used, here serving divergence instead of sharing.)
Script Y then runs init_dev_team, which branches on new-vs-existing: new → scaffold (pick a stack template) writes the descriptor; existing → adopt (the CTO confirmation conversation, §03/§04) derives and confirms it. From there both feed the identical pipeline.

So Part I’s two on-ramps have a concrete home: they are the two branches of init_dev_team. For now the library holds one template, which scaffolds both environments at once:

back/ — Node + Express (the API)
app/ — React + TypeScript (the authenticated application)
web/ — Astro (home, landing, blog — public, no auth; runs React as islands, so React stays the component language)

— produced under both xproject_dev/ and xproject_prod/. The two-frontend split is by auth boundary: public content is static, fast, and crawlable, structurally isolated from the logged-in app. This is also the dashboard’s own stack, so the system dogfoods its own scaffold. But this React stack is a property of this scaffold, not a system-wide law: each scaffold is a self-contained skill named for what it produces — scaffold_node_react today, a future scaffold_vue or scaffold_ecommerce carrying its own stack. Selecting a scaffold is selecting the stack, which keeps the spine stack-agnostic — the same discipline as layout-agnosticism, applied to the stack: the opinion lives in a swappable skill, never in the architecture. The library may grow on one rule: every scaffold converges on the same handoff — a project with a valid Layer-3 descriptor — which keeps the pipeline ignorant of which produced it. Starting with one, not a speculative library, is the deliberate guard against the “zoo.”

Environments & prod access

Locked Author never on prod; operate yes, but OS-enforced.

Part I’s principle — the agent authors only in dev, prod is authoring-frozen — becomes a concrete, OS-level safety model. The reach splits in two:

Reads → the CTO, freely. Inspecting prod (logs, processes, nginx, status) is diagnosis, and needs the project context the CTO already has. No separate “DevOps” role — that would conflate reading (needs context) with mutating (needs containment); only the second needs walling.
Mutations → the deploy skill, scoped. Pull / build / restart / edit named config run as one command, not a conversational role. Deploy is a skill, not a gate: the real gate already happened in dev (the requirement passed QA and your evaluation before it could ship), so re-confirming at prod would be asking “are you sure?” about something already approved.

The armed switch

Sensitive prod actions sit behind a switch that is off by default, armed only in the window you’re actively supervising. It is a hand-edited config file carrying an armed_until timestamp (not a bare bool): the agent treats prod as armed only while now < armed_until, so it self-expires — forgetting to disarm can’t leave prod exposed, with no background process, just a comparison at read time. The dashboard shows the state (loudly) but does not write it; you edit the file by hand. Removing the write-path is the simplification — there is no dashboard button to misuse.

OS-enforced, not instruction-enforced

The boundary that matters is below Claude’s world. --allowedTools and CLAUDE.md govern what the agent is told it may do — negotiable in principle. So the real lock is the operating system: the agent runs as an unprivileged user; the armed-switch file, the env files, and the prod source folders (xproject_prod/back, /front) are owned above it. The agent reads what it’s allowed and is structurally barred from writing the rest — no prompt or jailbreak grants an unprivileged user write on a root-owned file. Deploy elevates through a narrow per-command sudo rule (just the deploy commands), never blanket sudo.

This forms a small class of files the agent cannot author: env files (git-ignored, human-only — the agent never sees secret values, only a .env.example structure), the armed switch (agent-readable, not writable), and prod source (deploy-writable, agent-frozen). The freeze is on authoring — deploy must still write prod/back when it pulls and builds; the agent’s ordinary Edit must not. Four independent guards stack into defense in depth: when (the switch), secrets (env-exclusion), source (frozen folders), enforcement (OS ownership) — an accident must pass all four to do damage.

Amiko — a conscious reversal

Amiko deliberately skipped OS-user separation (chmod) because on a single-user box it would block the developer too and bought nothing — it was protecting nothing from itself. Here the agent is the thing being constrained, so the separation is the entire point. The complexity Amiko avoided is the complexity that buys the unbypassable switch.