System Architecture
A small set of role-specialized agents, run in sequence, each handing locked work to the next — with a human at the gates and no agent grading its own homework.
What the whole system reduces to.
Development is not one agent doing everything. It is a small set of role-specialized agents, run sequentially, where each step hands its finished, locked output to the next. A human stays in control at evaluation gates between steps. The work that builds the product is judged by agents that did not build it.
Three principles generate everything else in this document:
Read this before the rules.
An agent does not “contain” knowledge that persists between runs. Every run is assembled fresh from four layers stacked into one context window:
| Layer | What it is | Portable? |
|---|---|---|
| 1 · Model weights | General reasoning & coding ability | yes — shared |
| 2 · Role definition | “You are a security reviewer who…” | yes — reused across projects |
| 3 · Project context | This codebase, its decisions, conventions | no — per project |
| 4 · Live task + state | The current prompt + files right now | no — this moment |
The agent is Layer 2 — a reusable role template. The project is Layer 3, loaded in at runtime. A well-written role agent works on any project, because the project-specific part is injected at invocation, never baked in.
A role definition is a template. Invoking it creates an instance: a running context bound to one project for its lifetime. The same definition can spawn many independent instances — research on Project A and Project B at once. They share no state and cannot interfere, because the project-binding happens at instance creation, not in the definition.
Isolating files (e.g. git worktrees) prevents file collisions. It does not prevent semantic staleness: an agent that branched before a core refactor landed will build against a reality that no longer exists, and merge cleanly but wrong — worse than a conflict, because it’s silent.
Parallelism is safe only where agents share no mutable state. Document work (research, specs) shares no state with code → safe to parallelize. Code-writing all mutates the same files → must be sequenced.
Between every step there is a gate that does three jobs at once: control (the human reviews one small, readable artifact before the next step builds on it), the staleness cure (each step starts only after the prior artifact is locked, so it can’t start from stale input), and the handoff (the locked output of step N is the input to step N+1). Most gates are a conversation with the agent that produced the artifact — not a bare approve/reject button — so the human can interrogate the reasoning and adjust in place. But not every gate is a dialogue. Where the upstream step is a verifier (it judges rather than authors), its gate is an inspection: you read a verdict and the evidence behind it, and any conversation about it happens with the agent that can act on it — the CTO — not the verifier itself. Which gates converse and which are inspections is settled in realization (Part II, R5).
Two layers: a roadmap above, a per-requirement pipeline below.
The unit of work is a requirement — any unit of change to the system: a new capability, a refactor, a migration, an architectural fix. Not only new user-facing features. That word is load-bearing: it means the same pipeline that builds a screen also handles “auth needs SSO now, re-architect it,” so structural changes have a home instead of being exceptions.
The pipeline is layout-agnostic. It never hard-codes a folder shape: what an agent needs is not “edit front/” but “here is where the frontend lives, here are its conventions” — and that is Layer 3 (project context), injected at runtime, never baked into a role. So a single tree with back/ and front/, a three-repo workspace (app/ + panel/ + back/), or any other arrangement are all just different Layer-3 facts. The architecture absorbs the variation in the project descriptor; the roles stay portable.
The mental model is a dev team brought onto a project: first it understands what exists, then it works. A project reaches a valid Layer-3 descriptor by one of two on-ramps, which converge on the same handoff so the pipeline never knows which ran:
Adopt is that “understand first” step, and it is the CTO at project altitude (below) — no new role, just the CTO’s existing “ground yourself before acting” instinct run once over the whole project instead of once per requirement. Both on-ramps end at the same place — a project that speaks the system’s language — after which the roadmap and pipeline below run identically. (Their concrete realization is Part II, R8.)
A real project usually has two running environments, sitting in the workspace side by side (the scaffold creates both): xproject_dev/ (the agent’s workbench — full read/write, where tests and the browser-eyes run) and xproject_prod/ (the live system). The split is itself just a Layer-3 fact, but it carries one hard safety principle: the agent authors only in dev. Source flows dev → deploy → prod, never by editing prod directly — a direct prod edit would bypass the tests, the gate, and everything the pipeline exists to enforce. Prod is authoring-frozen.
The freeze is on authoring, not on operating: the agent may still read prod (logs, processes, status — the context it needs to debug) and, through a narrow path, deploy/restart it. “Author never on prod; operate yes, but scoped” is the whole rule. Its enforcement — at the OS level, not by instruction — is Part II, R9.
The human talks with the CTO about the roadmap in general terms: value vs. effort, what to build next, priorities. Its output is prioritized requirement tickets. The CTO operates at three altitudes: at project altitude (once, at onboarding — it grounds itself in the whole project and confirms its architecture, the adopt on-ramp above), at roadmap altitude (what is worth building and in what order), and — when a requirement enters the pipeline — at requirement altitude, where it settles the approach and writes the tests. So the CTO sits above the pipeline and opens each run; it is the single conversational role you plan and argue strategy with, and effort can’t be priced without reasoning about approach, which is why those two conversations are one agent.
The roadmap is the dev team’s planning surface, not part of the product: its format and flow (the requirement-card shape, the Backlog→Done rules) are the team’s instructions, shipping with it from origin but empty; its cards accumulate per project. It is persistent state on disk — a folder, not a session and not a single file. Each requirement is its own markdown file, and the model is hybrid by design:
Backlog/ and Done/. A requirement is shipped because its file lives in Done/. The one thing you never want ambiguous — done or not — is physical location, which can’t silently drift.version, size, value — lives in the card’s frontmatter. The UI surface derives its columns from version, so the board can group by release, filter by size, or sort by value without moving any files.So there are two different operations for two different kinds of change: moving a card between versions edits a field; marking it done moves the file. If the status field and the folder ever disagree, the folder wins — location is the source of truth that can’t rot. A CTO session is disposable; close it freely and the roadmap folder remains.
dev/, prod/ — its shape a per-project fact) and the dev team, which owns the roadmap. The team ships from origin with the requirement format and flow but an empty roadmap; cards accumulate per project.The roadmap folder holds just two folders for done-ness; the planning dimensions live inside each card as frontmatter, and the board groups by the version field:
version — re-planning a release never touches the filesystem.The human picks a ticket and opens a requirement session. The pipeline runs mostly autonomously, surfacing only at gates or escalations: CTO → Design → Build → QA → Document → Review, each separated by a human gate. Research is not a fixed step — it is an on-demand portable agent, invokable in either layer, and because it produces documents its instances are safe to run in parallel. Research presents options; it does not decide.
One pipeline, no fast lanes. Every requirement runs this same flow regardless of size — there is no separate “lightweight” track for small changes. The flow compresses on its own when the work is small: a one-line copy change has a trivial CTO conversation, a one-step Design change, and a two-second gate, while still leaving behind a test, current docs, and a passing suite. A second profile would buy a near-zero saving (the heavy part is the gates, and gates already cost time proportional to content, not to the flow’s length) at the price of a track to maintain and a “is this small enough?” decision on every requirement. Uniformity is the simpler design here, and it keeps every change — large or small — leaving the same evidence behind. So size stays a planning field only; it never selects a workflow.
Who does what — and who is allowed to judge whose work.
| Role | Kind | Layer | Responsibility |
|---|---|---|---|
| CTO | creator | A·B | At onboarding, once: grounds itself in the project and confirms its architecture (the adopt on-ramp). Owns the roadmap and prioritization (value vs. effort). For each requirement: settles the approach with you, fills in the requirement card (the roadmap file), and writes its tests against that card |
| Research | advisor | A / B | Gathers external evidence; presents options, does not decide |
| Design | creator | B | Produces UI against the design system; objective props auto-tested, taste judged in evaluation |
| Build | creator | B | Implements backend + frontend until the requirement test passes; self-tests for speed |
| QA | verifier | B | Runs the CTO’s tests on the deliverer’s work; routes failures back to the deliverer |
| Document | creator | B | Updates docs from merged, current code — after build, so it documents reality |
| Review | verifier | B | Independent security + cleanup audit — catches what tests can’t |
Two kinds of role, one firewall. The roster splits cleanly. Creators author deliverables — the CTO writes the card and tests, Design and Build produce the work, Document writes the docs. Verifiers judge that work and author none of it — QA runs the tests, Review audits the finished code. (Research is neither: an advisor that gathers evidence and presents options without deciding.) The whole system reduces to one rule across this split — a verifier never judges anything it authored — and Part II (R5) makes it structural rather than trusted: verifiers run write-walled, literally unable to edit the code they judge, while creators hold the write access their job needs. This is also why you converse with creators (you reason with them) but only receive a verdict from a verifier (then take it to the CTO) — the gate-type split of §02, mapped onto the roster.
How judgment is split. No role judges its own work — but “judging” isn’t a single role one agent owns. Each deliverable is partitioned, and the parts go to different judges. Design is the clearest case: its measurable properties (contrast, tokens, spacing, touch targets) are tested by QA, while its taste is judged by you at the gate — one deliverable, two judges. Build is the same shape, lopsided: almost all of it is measurable → QA, with anything untestable flagged to HUMAN-TEST-QUEUE.md → you. The CTO’s card is the opposite extreme — there’s no test to run on an approach, so it’s judged entirely by you. QA owns the measurable partition wherever it exists; you own the judgment partition; Review adds an orthogonal pass over the finished code for security and cleanliness that tests can’t catch.
The CTO advises; you decide. It does not decide the how by fiat — it grounds the requirement in the codebase, lays out approaches with tradeoffs, and the how is settled in conversation with you. Same pattern as research, and as every gate: the CTO structures the decision and records it, you make it.
The requirement card is the roadmap file, filled in. The card is created in Backlog/ during roadmap planning — entering the pipeline is the CTO filling that same file in with the settled approach (the status field tracks finer progress; the file only moves once, to Done/, when it ships). One artifact per requirement accretes detail across its life rather than spawning competing documents.
Why the CTO owns all tests. The requirement test asks “does this requirement do its job?” and the regression concern asks “does it respect the rest of the system?” — both architectural questions about fit and contract. The CTO writes the filled-in card and the tests in the same step, locked together at one gate, so the tests can’t drift from the spec. This keeps QA clean: QA only ever runs tests it did not write, on work it did not build.
Test-first as standard, with an honest escape hatch.
Tests are authored from the locked requirement card before the code exists — by the CTO, in the same step that fills the card in. With agents this matters more than with humans: an agent that writes code then its own tests will unconsciously shape the tests to pass whatever it wrote — a green check that means nothing. A locked, pre-written test is an external, objective definition of done. It cannot move the goalposts.
A requirement (acceptance) test asks “did I build this right?” — scoped to one requirement, shipping with it. A regression suite asks “did I break anything else?” — the accumulated union of every requirement test ever written. The suite is the only thing that catches “the requirement works but broke something else,” and it can only catch it because every past requirement left its test behind.
“Test-first as standard” has a failure mode: faced with something genuinely hard to test, an agent may write a hollow test that passes but asserts nothing — worse than no test.
If an agent cannot write a meaningful automated assertion, it must not write a hollow one. It appends an entry to HUMAN-TEST-QUEUE.md describing what needs human verification and why. Honest by construction.
A design system turns most “subjective” design into objective conformance: approved token or off-system value? Approved component or bespoke? Spacing from the scale? Contrast meeting WCAG? All pass/fail, checked automatically. The design system is to the design agent what the requirement spec is to the CTO — the source of truth its tests assert against. Only true taste reaches the human evaluation conversation.
Test-first is clean on a greenfield requirement. An adopted codebase (§03) is different: it arrives with legacy code and often no suite at all, so the regression net — “did I break anything else?” — has nothing to run against. The resolution comes straight from the dev-team model: a real team owns the whole codebase, legacy included, but it does not stop to fabricate exhaustive tests for code that already works — that effort buys the client nothing. Coverage is value-prioritized, not mandatory.
So writing a critical test is itself a requirement, flowing through the same pipeline and competing for priority like any other work. The main flow of a booking system — “book an appointment” — earns a test because breaking it is catastrophic; a rarely-touched admin corner may never earn one. This collapses “testing legacy” into the ordinary mechanism instead of making it a special rule. Those critical-flow tests run at the behavior level through the headless browser (“eyes” on the rendered product, R7), verifying a flow without unit-testing the code beneath it. The suite then grows forward, concentrated exactly where the team actually works.
Correction loops — and when to convene.
Real development is a line with correction loops. When QA finds a failure, the fix goes to the agent that delivered the work — not to QA (fixing would make it an author), and not to the human (the gate, not the debugger). The deliverer has the full context of what it built, so it diagnoses fastest. The loop is QA → deliverer → QA, and the human is outside it.
The loop above handles “the code doesn’t match the test yet.” But sometimes the problem is deeper: the approach itself can’t work, the requirement is ambiguous, or something structural is wrong. A producer is never allowed to reopen the approach on its own — a stuck agent has every incentive to blame the spec rather than its own work, which is the self-judgment problem the system exists to prevent. So being stuck does not trigger a rewrite; it triggers a conversation.
That conversation convenes you and the CTO together — the same agent that wrote the approach, so there is no handoff — and it decides the course of action openly. It might amend the current requirement (add a clarifying constraint and continue), open a new requirement (a refactor or architectural fix, prioritized against everything else), fix a structural problem first and then resume, or some combination. This is the system’s only backward motion, and it is deliberately a human-plus-CTO decision rather than an automated transition. It is also why the unit of work is a requirement and not a “feature”: “the approach was wrong” has a natural home, because a refactor is just another requirement.
What fires that conversation is a hard limit, so an autonomous fix-retry loop can’t spin for a day burning tokens against a wrong spec. After MAX_FIX_ATTEMPTS (default 3) failed fixes, the loop stops and hands the full failure history to the you-plus-CTO conversation above. Persistent failure is treated as an upstream problem, not a mechanical one.
Where durable state lives.
These files live in each project (Layer 3). The role definitions (Layer 2) are reused across projects. There is no canonical folder shape: a project may be one tree with back/ and front/, or several repos side by side (app/ + panel/ + back/), or anything else — the layout is a Layer-3 fact recorded in the project descriptor (CLAUDE.md), written by scaffold for new projects or derived by adopt for existing ones (§03). What the table below lists is the set of durable artifacts every project carries, wherever they physically sit.
| File | Purpose | Written by | Read by |
|---|---|---|---|
| roadmap/ folder | Folders Backlog/ and Done/ (done-ness = location). One markdown file per requirement, carrying frontmatter fields version, size, value, status. The board groups by version. One file per requirement → sessions never collide | CTO | Human, pipeline, roadmap board |
| CLAUDE.md | Project conventions, stack, structure | Human + agents | All agents |
| DESIGN-SYSTEM.md | Tokens, components, rules the design agent must conform to | Human | Design, QA |
| HUMAN-TEST-QUEUE.md | Behaviors needing human verification because they can’t be meaningfully auto-tested | Any agent | Human (at gates) |
| /tests | The accumulated requirement + regression suite | CTO | Build (self-test), QA (verify) |
| .env / .env.example | Secrets (git-ignored) plus a value-less structure file. One of the files the agent cannot author (R9) | Human only | Human; agent reads .env.example structure only |
| permission switch | Hand-edited armed_until flag gating sensitive prod actions — off by default, self-expiring (R9) | Human only | Agent (read-only) · dashboard (display) |
The knobs worth tuning.
| Parameter | Default | Meaning |
|---|---|---|
| MAX_FIX_ATTEMPTS | 3 | Failed QA-fix cycles before the loop stops and convenes the you-plus-CTO conversation |
| Pipeline roles (v1) | — | CTO → Design → Build → QA → Document → Review. Applied uniformly to every requirement; the flow compresses naturally when the work is small (no separate track) |
| Research parallelism | on | Research instances may run concurrently across requirements / projects |
Locked Where commitment is cheap, and where it must be contained.
The system runs on specific tools, but commitment to them is layered by how expensive it is to reverse. Three layers, most portable to most locked:
The principle: depend natively; document the dependency’s contract provider-neutrally. The alternative — a runtime abstraction layer wrapping every provider — is rejected: it drops the stack to a lowest-common-denominator, forfeits the native fit that is the whole reason to choose these tools, and is a maintenance burden forever. Knowledge is portable and cheap; abstraction is operational and expensive. We buy the first.
The contract is capabilities plus interface, never implementation. For a design skill: enforces design-system tokens; generates variants for human selection; flags “AI-slop”; runs accessibility checks; invoked in-session, takes a spec, returns styled code + variants; to replace, need token-enforcement + variant-generation + a11y checks — and the taste-memory portion is ours and already portable. “Design skill, replace if migrating” is too vague to act on; copying its internal logic rebuilds it on paper and goes stale. The contract is the middle — and writing it also reveals the boundary between what is ours and what is the provider’s.
Locked Three tools, each applied where it is cheapest.
Provider risk is handled by three tools acting at different moments — and the discipline is to use each only where it costs the least.
These specialise naturally, because dependencies fall into two kinds. Knowledge-layer dependencies — skills, prompts, role definitions, methods — are prose, so they tend to be portable already (a good design skill reads on any capable model); prefer-agnostic dominates here and is nearly free. Runtime-layer dependencies — the harness, hooks, tool-permission mechanics, the SDK — are machinery, so they tend to be locked; the manifest and the one-seam discipline do their work here. None of the three requires a line of abstraction code — they are selection discipline and documentation.
Slots filled, leaning, and deferred.
| Slot | Choice | Status |
|---|---|---|
| Runtime | Claude Code, run inside Cursor (free tier). Editor swappable to VS Code or Zed — reversible, affects nothing downstream | locked |
| Host | EC2, reached over SSH from desktop and laptop — sessions live server-side, so they’re available from either machine. Interactive over SSH is still interactive → subscription billing | locked |
| Dashboard | Committed (see R6). Node back + React/TS front; nginx only if served over the network. The dev-team management surface — roadmap, skills, md files, gate review | committed · stack leaning |
| Surface & billing | Planning in the dashboard (headless chat → the metered side) · building in Cursor (interactive → subscription). The two axes line up: the heavy work stays interactive/subscription, only the light planning chat is headless. Currently moot — metering paused (see R6) | locked |
| Context / memory | File-based for the build agents; a database per product feature (see R4) | principle locked |
| Tool access (MCP) | Composio leaning; to be pressure-tested | open |
| Skills / instructions | Anthropic Skills format as the container; prefer cross-provider skills | leaning |
| Knowledge / method | A reading list to internalise — an action, not a decision | n/a |
| Infra / deploy | Deferred — a solved problem that picks itself later | deferred |
Why Claude Code is the locked runtime, and why the lock is acceptable: it is the model-maker’s own harness, so it lines up natively with the Skills format and the model layer with no translation; and its core loop — an interactive conversation that can also act — is the architecture’s gate model. The commitment is contained by R1–R2: the harness is a runtime-layer dependency, documented and seam-confined, not abstracted away.
Locked Build-time context for the agents vs. runtime data for the product.
“Memory” is two unrelated problems that must never be merged.
docs/, skills, rules and CLAUDE.md files, plus the codebase itself. Curated, human-approved, in-repo, file-based. Build-time memory: it shapes how the agents work.The test for which world a thing belongs to: does an agent read it while building, or does the app read it while running? A design decision the build agent needs is World 1, a file. A user’s profile the app serves is World 2, a database. The agents will build World 2 databases; they do not use them as their own memory.
World 1 stays files — not a retrieval index — for a concrete reason: Claude Code does not embed or index the repository. It navigates like an engineer (path globbing, content grep, reading whole files), so the leverage is well-organised, current, curated authored context that tells that search where to look. A retrieval layer over your own curated docs would reintroduce staleness and context-poisoning for near-zero benefit, since curated docs stay small enough to read directly. Semantic retrieval (e.g. a Milvus-backed code-search server over MCP) is an optional add-on, justified only if a repository outgrows the context window.
Locked Creators are modes you converse with; verifiers are walled workers that report.
The architecture’s “agents” map onto two Claude Code primitives. Both facts hold; the mapping is now settled.
The deciding test — do I need to converse with this role, or only receive its output? — resolves the mapping:
Write and Edit, so a verifier structurally cannot touch the code it judges. They run once and emit a verdict; you do not converse with them.The conflict that looked real — you might want to interrogate a reviewer’s finding (→ converse) yet also need it write-walled (→ dispatch) — dissolves by separating the moments. The audit runs walled; then, if you want to understand or act on a finding, you talk to the CTO, who reads the report and can spawn a fix requirement. The verifier that judged never discusses; the creator you discuss with never judged — so authorship and verification stay split even inside the conversation. A verifier emitting only a verdict is the simplest possible walled thing; giving it a voice is surface we don’t need.
This refines Part I’s “no agent judges its own work”: the firewall now runs structurally along the mode/agent line, enforced by tool-scoping rather than by instruction. It also resolves which gates (Part I, §02) converse and which inspect — creator gates are conversations; verifier gates are verdicts you inspect and take to the CTO.
Locked You review at the gates; a committed dashboard manages, Cursor implements.
One correction is settled: you are the reviewer at the gates, not the dispatcher between steps. The pipeline auto-advances and stops for your approval; the QA fix-loop runs without you, surfacing only on a pass or at the circuit breaker. Mechanically this is a deterministic gate — a hook that pauses for confirmation at each role boundary, which holds even in fast or headless modes.
The management surface is now committed, and split by tool. A custom dashboard handles managing the team: the kanban roadmap, editing skills and the .md context files, reviewing at the gates, and the planning conversations with the CTO (writing requirements, settling approach, discussing a verdict) — hosted as a chat inside the dashboard. Cursor + the Claude Code plugin handles implementing features — coding belongs in the editor where the code lives. They sit side by side over the same files (roadmap folder, skills, descriptors): editing a skill in the dashboard and using it in Cursor are the same file, two windows.
The hard constraint, refined: the dashboard may drive Claude Code — it hosts the planning chat by spawning Claude headlessly (the Amiko pattern, below) — but it must never own the pipeline or re-implement agent logic. The line was never “don’t drive Claude”; it is “don’t become the brain.” Claude Code plus the role files stay the engine and the orchestrator; the dashboard hosts conversation, editing, and the board, and triggers runs. Drive Claude: fine. Own the pipeline: never.
Stack: Node back + React/TS front (nginx only if served over the network, not for localhost). React over the more-familiar Vue for one reason specific to this system: AI assistants generate React with fewer correction loops than Vue’s template SFCs, because JSX + TS is a far larger share of training data — and a system whose whole job is AI-built products should lean toward what the agents build most reliably. The learning-curve cost lands on the one hand-built component (the dashboard), which — internal, single-user, low-stakes — is also the safest place to pay it; and the skill that actually matters is reading React well enough to gate it, not writing it, since the agents write it. That AI-affinity is also what makes React the choice for the current scaffold (scaffold_node_react, R8) — but that is the scaffold’s property, not a global product law; a different scaffold carries a different stack. The dashboard’s own React is a separate committed choice for this one hand-built component.
How the dashboard hosts that chat — and the billing it implies. Amiko already proves the mechanism: it drives Claude Code by spawning the CLI in headless mode (claude -p / --print, NDJSON streamed into the UI) — not the Agent SDK, though either works. That headless driving is the metered side of Anthropic’s interactive vs programmatic line (interactive draws from subscription; claude -p and the SDK alike meter — a change announced for June 2026, then paused, so for now everything still draws from subscription, signalled to return in some form). The allocation this produces is the right way round: planning conversations (light) run headless in the dashboard → the metered side; building features (heavy) runs interactive in Cursor → subscription. The expensive work stays on the flat plan; only the cheap chat touches the meter.
Steal patterns; do not adopt the whole.
Opinionated complete setups (gstack and others) are someone else’s assembly of these same slots. They are reference implementations to raid, not foundations to fork — forking inherits the author’s sprawl and agenda and undermines the rule that the human must understand every piece. The method is a curated assembly: own the spine, borrow self-contained organs, and adapt each to the spine’s conventions, gated by three questions — what does it do that I need, stripped of the rest? what does it assume that I must reconcile? can I explain how it works?
Patterns already validated against this architecture, worth taking: explicit role “gears” (planning ≠ review ≠ shipping); a three-strikes circuit breaker (the same number reached independently here); an auto-advancing pipeline that surfaces only taste decisions, by classifying each decision mechanical-vs-taste; a decision audit-trail written to disk rather than accumulated in context; curated, prunable, file-based memory rather than auto-capture; a headless browser giving the QA and Design roles real “eyes” on the rendered product; anti-sycophancy phrase-bans in the planning role; and a blind reviewer that sees the artifact, not the reasoning. The convergence is the validation; the divergences — stricter test-first authorship, a single merged CTO — are deliberate and arguably cleaner for a from-scratch build.
Locked Central origin, vendored clones, one paste to install.
The role definitions, skills, and config are one repo — the dev team — distributed by git’s ordinary fork/upstream model, chosen deliberately over a single shared install:
git clones the team into its workspace, getting its own copy. Per-project edits (a new skill, an adapted one, the project’s requirements) live only in that clone and never flow back to origin. Generic updates flow the other way, origin → pull → project, opt-in per project.pull conflicts — the correct signal (your local divergence overlaps an upstream change), not noise.This is the opposite of Amiko’s choice, deliberately: Amiko symlinked every user to one global skills dir because users must not diverge. Here divergence is the point — each project tunes its own team — so clone-per-project is right and symlink-to-global is wrong. (Part I’s “Layer 2 is reused across projects” still holds; “reused” means cloned-from-one-origin, not one-shared-copy.)
workspace/
├─ xproject_dev/ — product · the workbench
├─ xproject_prod/ — product · the live system
└─ dev_team/ — the cloned team (Layer 2)
├─ roadmap/ — Backlog/ · Done/ · cards fill per project
├─ skills/
└─ … method, config, dashboard
The product (dev/, prod/) and the team (dev_team/) sit side by side; the roadmap lives inside the team because it is the team’s planning surface, not product code (Part I, §03). The operative .claude/ files are symlinked up to root, below.
The operative files Claude Code reads — .claude/skills/, settings.local.json, allowlists, CLAUDE.md — must sit at workspace root, alongside the product repos, because that is where Claude Code is opened. But the team clones into a dev_team/ subfolder. So install is not a bare clone; it must clone and place files one level up:
.claude/ to the clone’s, so editing a skill “at root” is editing the team’s own file — already inside the clone, hence committable to that project’s vendored team for free. (Copy would fork root and clone into drift; symlink keeps one real copy. Same mechanism Amiko used, here serving divergence instead of sharing.)init_dev_team, which branches on new-vs-existing: new → scaffold (pick a stack template) writes the descriptor; existing → adopt (the CTO confirmation conversation, §03/§04) derives and confirms it. From there both feed the identical pipeline.So Part I’s two on-ramps have a concrete home: they are the two branches of init_dev_team. For now the library holds one template, which scaffolds both environments at once:
back/ — Node + Express (the API)app/ — React + TypeScript (the authenticated application)web/ — Astro (home, landing, blog — public, no auth; runs React as islands, so React stays the component language)— produced under both xproject_dev/ and xproject_prod/. The two-frontend split is by auth boundary: public content is static, fast, and crawlable, structurally isolated from the logged-in app. This is also the dashboard’s own stack, so the system dogfoods its own scaffold. But this React stack is a property of this scaffold, not a system-wide law: each scaffold is a self-contained skill named for what it produces — scaffold_node_react today, a future scaffold_vue or scaffold_ecommerce carrying its own stack. Selecting a scaffold is selecting the stack, which keeps the spine stack-agnostic — the same discipline as layout-agnosticism, applied to the stack: the opinion lives in a swappable skill, never in the architecture. The library may grow on one rule: every scaffold converges on the same handoff — a project with a valid Layer-3 descriptor — which keeps the pipeline ignorant of which produced it. Starting with one, not a speculative library, is the deliberate guard against the “zoo.”
Locked Author never on prod; operate yes, but OS-enforced.
Part I’s principle — the agent authors only in dev, prod is authoring-frozen — becomes a concrete, OS-level safety model. The reach splits in two:
deploy skill, scoped. Pull / build / restart / edit named config run as one command, not a conversational role. Deploy is a skill, not a gate: the real gate already happened in dev (the requirement passed QA and your evaluation before it could ship), so re-confirming at prod would be asking “are you sure?” about something already approved.Sensitive prod actions sit behind a switch that is off by default, armed only in the window you’re actively supervising. It is a hand-edited config file carrying an armed_until timestamp (not a bare bool): the agent treats prod as armed only while now < armed_until, so it self-expires — forgetting to disarm can’t leave prod exposed, with no background process, just a comparison at read time. The dashboard shows the state (loudly) but does not write it; you edit the file by hand. Removing the write-path is the simplification — there is no dashboard button to misuse.
The boundary that matters is below Claude’s world. --allowedTools and CLAUDE.md govern what the agent is told it may do — negotiable in principle. So the real lock is the operating system: the agent runs as an unprivileged user; the armed-switch file, the env files, and the prod source folders (xproject_prod/back, /front) are owned above it. The agent reads what it’s allowed and is structurally barred from writing the rest — no prompt or jailbreak grants an unprivileged user write on a root-owned file. Deploy elevates through a narrow per-command sudo rule (just the deploy commands), never blanket sudo.
This forms a small class of files the agent cannot author: env files (git-ignored, human-only — the agent never sees secret values, only a .env.example structure), the armed switch (agent-readable, not writable), and prod source (deploy-writable, agent-frozen). The freeze is on authoring — deploy must still write prod/back when it pulls and builds; the agent’s ordinary Edit must not. Four independent guards stack into defense in depth: when (the switch), secrets (env-exclusion), source (frozen folders), enforcement (OS ownership) — an accident must pass all four to do damage.
Amiko deliberately skipped OS-user separation (chmod) because on a single-user box it would block the developer too and bought nothing — it was protecting nothing from itself. Here the agent is the thing being constrained, so the separation is the entire point. The complexity Amiko avoided is the complexity that buys the unbypassable switch.