Friday, 22 May 2026

The Invisible Compiler: How One Human and Two AI Pair-Programmers Turned a Five-Step Stepper Into a Chat in just 4 days

May 22, 2026. South Africa.

The V1 agentic layer of the platform — the typed-and-versioned, schema-validated, Composer-plus-Linter-plus-Runtime-plus-Scheduler stack I wrote about four days ago — was working. Real Monday-morning briefings. Real citations. Real charts. Real customers. But the way you composed an agent in V1 was a five-step stepper: Compose → Lint → Dry-Run → Review → Publish. Status pills like [tool-availability] would flicker across a panel. If something went wrong, the user was asked to edit a typed-field draft by hand.

That UX treated the user like a developer.

This week I rewrote it. The new Composer is a chat. The user describes the agent they want in plain English. The model asks clarifying questions, drafts the agent, runs a preview inline in the conversation, accepts refinements ("move the regional breakdown above the top movers"), and when the user says "ship it" a single orange button publishes. No stepper. No lint codes. No JSON. The entire authoring surface is a conversation.

That's V2 of the Agent Composer. It shipped this week as two PRs — a foundational chat-profiles refactor, then the Composer V2 specialisation on top. Along the way Claude and Codex held each other accountable, six fix-cycles deep, against a four-round production QA loop. This post is the after-action.

Other ways to understand what I did:

"From Lint Pills to Plain English: Building a Chat-First Agent Composer with a Self-Correcting LLM-to-Zod Loop"
Hooks: vivid "lint pills" image, names the technical pattern AI builders care about (LLM-to-Zod self-correction is having a moment)

"Profiles Over Forks: How I Specialized an AI Chat Engine Without Touching the Runtime — Solo, with Claude and Codex"
Hooks: "Profiles Over Forks" is a memetic architectural slogan, signals the foundational-refactor sophistication that engineering-leaning AI gurus respect.

"Chat Profiles, MCP Tools, and a Schema as the Trust Boundary: A Tier-3 Agentic Composer in 71 Commits"
Hooks: name-drops the standards (MCP, schema-first), claims a market tier with receipts (71 commits), no fluff.

"How to Build a Tier-3 Autonomous Analytics Agent Without an Orchestration Framework, a Vector Store, or a Sub-Agent Swarm"
Hooks: contrarian by listing what you don't need — irresistible to AI gurus tired of overbuilt stacks. Sets up your minimalist architecture as the takeaway.


Act 1: Why the V1 Composer Wasn't Good Enough

V1's authoring stepper was a faithful UI mapping of the typed contract underneath. It had to be — that's how the data shape was implemented. AgentDefinitionV1 has a name, a description, a taskSpec, a schedule, an allowedTools array. The stepper just rendered each block of the contract as a form section, validated each step before letting the user proceed, surfaced the linter's 9-check pass/fail.

It was mechanically correct. It was conversationally wrong.

The owner-facing review of V1 surfaced four hard problems:

  • Non-technical users were asked to read lint codes. [tool-availability] means nothing to a sales manager.
  • The dry-run sample report rendered in a separate panel below the typed-field editor — users had to switch context to see what they were authoring.
  • If the LLM proposed something invalid, the user had to fix it by hand in a Zod-shaped form. Most users didn't even know what a draft was.
  • The mental model the stepper imposed (Compose, then Lint, then Dry-Run, then Review, then Publish) didn't match how anyone actually thinks about a report. People think in iteration loops: describe, see, refine, see, refine, publish.
The fundamental insight: the typed contract is the trust boundary; it must stay. But the typed contract should be invisible to the user. The conversation, the inline preview, and the single publish action are the entire surface area. Everything underneath is implementation detail.

Act 2: The Iteration Loop That Reshaped the Plan

I asked Claude to scope the work. The first plan came back at about 600 lines of markdown and proposed a single-PR build — new chat profile, new tools, new UI, all landing together. It described the Composer V2 surface in detail but treated the underlying chat engine as a black box that the new Composer would simply call.

I pushed back twice. The first push was structural: "the AI chat engine bakes a single global system prompt into its runtime. If we want a chat tuned for a different purpose — designing agents, editing forecasts, master-data cleanup — there is no clean way to plug in a different system prompt, a different tool allowlist, a different conversational discipline. That's the real refactor."

That reframing rippled. The plan grew a "Layer 0" — the Chat Profiles platform. Profiles became a first-class abstraction: each profile is a server-defined bundle of { systemPromptBuilder, allowedTools, suggestedPrompts, markerProtocol, hooks? }. The existing global behaviour became the general profile. The new Composer became the composer-v2 profile. The runtime no longer baked one prompt; it looked up a profile at request time.

The second push was sequencing: "ship the foundational refactor first, soak it, then build Composer V2 on top." That split the work into two PRs — PR 1 (foundation, zero behaviour change) and PR 2 (Composer V2 on the soaked foundation). The byte-equivalence safety net got built INTO PR 1 as the "do-no-harm" gate.

Codex did a clean-room review of the v3 plan and posted four findings that materially changed what shipped. I'll paraphrase the gist:

Codex plan-review pass (paraphrased):
  1. "PR 1 needs a byte-equivalence snapshot suite as the do-no-harm gate. Replay a corpus of ~30 canonical prompts against the migrated general profile and assert byte-identical responses against a pre-refactor snapshot. Any drift fails the build. This is the gate, not a nice-to-have."
  2. "Composer-v2 MUST be registered at module-load time, not lazily. If validateChatRequestBody sees an unknown profileId on the first request, every initial Composer invocation 400s. Eager registration with a guardrail test."
  3. "The four authoring tools need a session-scoped allowlist. The Composer agent gets only those four (plus disambiguation tools), nothing else. If a regular chat caller passes profileId: 'composer-v2' by accident, the runtime should still enforce the narrow allowlist server-side."
  4. "Materialise server-owned fields after the LLM returns. id, pk, defType, ownerUserId, systemPrompt, audit, publishToken — never let the LLM author these. Same rule as V1. Pull materializeAgentDefinition out of the V1 Composer into a shared helper so V2 reuses it."

All four landed. The byte-equivalence snapshot suite became the merit gate for PR 1. The eager-registration guardrail is now a permanent test. The four authoring tools live in a hard-coded COMPOSER_AUTHORING_TOOLS array; the bridge filter excludes them from general chat. The materializer was refactored as a pure function and reused by both V1 and V2 Composers.

The final plan landed at 1,771 lines of markdown after about ten review cycles with Codex — almost three times the length of the first draft, because the iterations forced every layer to be specified before any code was written.


Act 3: The Architecture We Landed On

Here's the V2 picture in plain ASCII. Compare to the V1 diagram from the last post — the additions are bracketed by [NEW].

   
   +-----------------------------------------------------------------+
   |                       Web App (browser SPA)                     |
   |                                                                 |
   |   Gallery  -  Compose (V1)  -  [NEW] Compose with AI chat (V2)  |
   |   Filed Reports  -  Insights  -  Admin                          |
   +---------------------------+-------------------------------------+
                               |
                               | HTTPS + auth proxy
                               v
+-----------------------+   /api/agents/*    +-----------------------------+
|  Web App (Express)    |  ----------------> |  AI service (Fastify)       |
|  React shell + static |                    |                             |
|  files. Proxies AI    |   /api/ai-chat     |  +-----------------------+  |
|  routes to the AI     |  ----------------> |  | Chat route            |  |
|  service.             |                    |  | accepts profileId     |  |
|                       |   /api/composer-v2 |  | (default = general)   |  |
|  + serves the new     |  ----------------> |  +----------+------------+  |
|  agent-composer-chat  |                    |             |               |
|  React view + the V2  |                    |             v               |
|  HTTP shims for       |                    |  +-----------------------+  |
|  draft fetch, reset,  |                    |  | [NEW] Chat-Profile    |  |
|  publish.             |                    |  |       Dispatcher      |  |
+-----------------------+                    |  +----------+------------+  |
                                             |             |               |
                                             |       +-----+-----+         |
                                             |       |           |         |
                                             |       v           v         |
                                             |  +---------+ +---------+    |
                                             |  | general | | [NEW]   |    |
                                             |  | profile | | composer|    |
                                             |  | (V1     | | -v2     |    |
                                             |  |  prompt | | profile |    |
                                             |  |  +tools)| | (chat-  |    |
                                             |  +----+----+ |  first  |    |
                                             |       |      |  multi- |    |
                                             |       |      |  turn)  |    |
                                             |       |      +----+----+    |
                                             |       |           |         |
                                             |       |     [NEW] |         |
                                             |       |     four  |         |
                                             |       |     auth. |         |
                                             |       |     tools |         |
                                             |       |  +--------+-----+   |
                                             |       |  | propose      |   |
                                             |       |  | validate     |   |
                                             |       |  | dry_run      |   |
                                             |       |  | materialize  |   |
                                             |       |  | _and_publish |   |
                                             |       |  +--------+-----+   |
                                             |       |           |         |
                                             |       |           v         |
                                             |       |  +-----------------+|
                                             |       |  | [NEW] in-mem    ||
                                             |       |  | draft store     ||
                                             |       |  | TTL 30min,      ||
                                             |       |  | quota 10        ||
                                             |       |  | dry-runs/sess.  ||
                                             |       |  +-----------------+|
                                             |       v           |         |
                                             |  +-----------------v----+   |
                                             |  | Same MCP tool surface|   |
                                             |  | as V1 chat:          |   |
                                             |  |   build_movement_pack|   |
                                             |  |   compute_period_agg |   |
                                             |  |   compute_plan_vs_act|   |
                                             |  |   customer_health    |   |
                                             |  |   query_customer_mas |   |
                                             |  |   get_insight_def    |   |
                                             |  |   search_insights    |   |
                                             |  |   ... (60+)          |   |
                                             |  +----------+-----------+   |
                                             |             |               |
                                             |             v               |
                                             |  +-----------------------+  |
                                             |  | Runtime adapter       |  |
                                             |  | (claude-agent-sdk     |  |
                                             |  |  - unchanged)         |  |
                                             |  +----------+------------+  |
                                             |             |               |
                                             |   Hooks (unchanged):        |
                                             |     PreToolUse  PolicyGuard |
                                             |     PostToolUse Citation+ds |
                                             |     Stop        CostTracker |
                                             |                             |
                                             +-----------+-----------------+
                                                         |
                                                         v
   +-----------------------------------------------------------------+
   |                        Cosmos DB (single account)               |
   |   definitions (incl. agent-definitions from V2 publishes,       |
   |   schema-identical to V1)                                       |
   |   agent_runs (incl. dry-run artefacts owned by composer-v2)     |
   |   bu-ai-data-policy, runtime-state, etc.                        |
   +-----------------------------------------------------------------+

The point worth labouring: everything below the dispatcher is unchanged. The four new authoring tools are MCP tools just like every other tool. The draft store is an in-memory Map with a 30-minute TTL. The runtime adapter, the hooks, the stores, the Cosmos schema, the scheduler — not touched. The Composer V2 surface is purely an additive specialisation.

An agent composed by V2 is byte-identical to an agent composed by V1, up to a single optional audit.composedVia: 'chat' telemetry field. The Gallery doesn't know V2 exists. The scheduler doesn't know V2 exists. If we deleted the V2 surface tomorrow, every previously V2-composed agent would still run.

The non-negotiable invariant: V2 is a better front door, nothing more. It does not fork the agentic engine. Same definition store. Same scheduler. Same runtime. Same hooks. Same ABAC. Same Gallery. Same Filed Reports. Same renderer. Same audit log. The four new authoring tools, the in-memory draft store, and the chat profile are the entire footprint.

Act 4: PR 1 — The Foundational Refactor (Zero Regression)

PR 1 was the first 1.5 days. Six commits. No new user-visible surface. The entire change was that standaloneAiChat.js — the file that owns the chat runtime — learned to look up a chat profile at request time and use it instead of a hard-coded global prompt.

The challenge wasn't writing the dispatcher. The challenge was proving the existing chat behaviour didn't drift by a single byte.

That's what the byte-equivalence snapshot suite was for. ~30 canonical prompts, drawn from a real week of production chat usage (sanitised), replayed against both the pre-refactor runtime and the post-refactor runtime with profileId defaulted to general. The system prompt the runtime composes must be byte-identical. The tool allowlist must be byte-identical. The streaming behaviour must be byte-identical. Any drift fails the build.

The snapshot suite caught one real regression mid-PR — my first attempt at the dispatcher had a sneaky difference in how it normalised the absence of profileId in the request body (treating undefined as 'general' instead of preserving the absence). Codex flagged it during review, the snapshot diffed, and the fix was a four-line change.

Commits in PR 1:

  1. ChatProfileV1 contract in @app/ai-core: Zod schema for { id, label, description, systemPromptBuilder, allowedTools, suggestedPrompts, markerProtocol?, hooks? }.
  2. Profile registry in ai-service: getChatProfile(id), listChatProfiles(), registerChatProfile(profile). Eager registration at module-load.
  3. Inline profileId branch in the chat runtime: look up the profile, call its systemPromptBuilder, filter the MCP tools by profile.allowedTools, apply profile-specific hooks on top of the global ones.
  4. Chat route accepts an optional profileId field. Default general preserves backward compat with every existing caller.
  5. Byte-equivalence snapshot suite — the do-no-harm gate. 30 canonical prompts, byte-identical assertions, fails the build on any drift.
  6. Docs update: agent-contract.md gains a "Chat Profiles platform" section with the platform invariants.

PR 1 sat on the feature branch for a few hours of soak in production before PR 2 began. During the soak the snapshot suite ran on every CI build and never failed. The existing chat experience — AI helper buttons on dashboards, the standalone chat view, page-context-aware tool routing — behaved identically. None of those callers passed profileId; all defaulted to general; none drifted.

That's the value of a do-no-harm gate. A foundational refactor of the chat engine landed on master with confidence high enough that no human had to manually test the existing surface. The snapshot was the proof.


Act 5: PR 2 — The Composer V2 Specialisation

PR 2 was fifteen commits over two days. New profile. Four new tools. New draft store. New frontend view. New HTTP endpoints. New navigation entry. Everything else unchanged.

#CommitWhat it added
1lint translatorPure ai-core helper that translates V1's 9 linter codes into plain English. [tool-availability] becomes "The tool 'X' isn't available for this connector — try 'Y' instead." Engineering codes never leave the server.
2COMPOSER_AUTHORING_TOOLS exclusionThe hard-coded set of four authoring tools. The auto-bridge filter excludes them from the general chat profile so they only surface inside composer-v2.
3Extract materializeAgentDefinitionPure refactor: pulled V1's server-owned-field materialiser out into a shared helper. Reused by V2's propose_agent_draft tool.
4ComposerIntentDraftV1 schemaThe Zod schema for the draft payload the V2 LLM hands to propose_agent_draft. Strict. Field-level error messages. Hand-authored, not derived (Codex caught that a derived schema lost field-level hints).
5Draft store with versioning + publishToken + dry-run quotaIn-memory Map, 30-minute TTL, max 10 dry-runs per session. publishToken is server-issued, never returned to the LLM, only known to the UI's PublishActionChip.
6runAgent accepts AbortSignalLets the V2 frontend cancel an in-flight dry-run when the user backs out. Threaded through the runtime adapter.
7Four Composer tool handlers + definitions + executor wiringpropose_agent_draft, validate_draft_silently, dry_run_draft, materialize_and_publish. Each Zod-validated at the MCP boundary. dry_run_draft calls the existing runtime.runAgent({ trigger: 'dry-run' }) — no V2-specific runtime branch.
8composer-v2 chat profile + EAGER registrationThe profile bundle. Multi-turn system prompt. Suggested prompts. Marker protocol for {{ATTACH_REPORT:runId}}, {{FOLLOWUP:text}}, {{ACTION:publish}}. Registered at module load with a guardrail test.
9Dispatcher branch in standaloneAiChat.jsIf the resolved profile is composer-v2, swap in the COMPOSER_AUTHORING_TOOLS allowlist; otherwise behave exactly as PR 1.
10HTTP endpoints (drafts fetch, session reset, publish)GET /api/composer-v2/drafts/:id for the technical-details disclosure, POST /api/composer-v2/sessions/:id/reset for the "Start a new agent" button, POST /api/composer-v2/drafts/:id/publish for the UI's PublishActionChip.
11SSE tool_result event + AIChatPanel propsStreams structured tool results back to the browser so the chat panel can render the inline report in real time.
12AgentComposerChatView + InlineSampleReport + PublishActionChipThe new chat-first view. Renders the sample report inline using the same AgentReportView sub-components the Filed Reports surface uses — no private copy.
13Navigation entry + AM Contract Performance starter promptSidebar entry "Compose Agent with AI chat" at /agent-composer-chat. AM Contract Performance template surfaces as a suggested prompt.
14MCP bridge coverage test + AM Contract Performance smokeEnd-to-end smoke that drives the Composer through a full session, asserts the published agent is V1-shape, and confirms a spoken name like "City of Cape Town" is REFUSED at propose_agent_draft with a structured hint to call query_customer_master.
15Docs updateagent-contract.md gains the Composer V2 invariants section.

That's the build. Fifteen commits. About 4,300 lines of V2-specific source and 7,200 lines of V2-specific tests. The plan doc was 1,771 lines. The test-to-source ratio is intentionally above 1.5 — the contract is the trust boundary, the tests pin the contract.


Act 6: The User Experience (Currently Lightweight, Evolving)

The V2 surface is deliberately minimal in its first form. ASCII wireframe of the empty state:

+-----------------------------------------------------------------+
|  Compose Agent with AI chat            ?    Start a new agent   |
|-----------------------------------------------------------------|
|                                                                 |
|    Agent Composer                                               |
|    (BU scope . current fiscal year)                             |
|                                                                 |
|    Suggested starters:                                          |
|    [ Weekly Finance briefing ]  [ Monthly performance report ]  |
|    [ Customer health watcher ]  [ Stock coverage review ]       |
|                                                                 |
|                                                                 |
|                                                                 |
|                                                                 |
|                                                                 |
|                                                                 |
|-----------------------------------------------------------------|
|  Converse with AI to craft the agentic report you desire... [>] |
|-----------------------------------------------------------------|
|  > Show technical details                                       |
+-----------------------------------------------------------------+

And the active state, mid-composition:

+-----------------------------------------------------------------+
|  Compose Agent with AI chat                  Start a new agent  |
|-----------------------------------------------------------------|
|                                                                 |
|  [You]    Compose a monthly contract-performance briefing       |
|           agent for City of Cape Town -- sales vs plan,         |
|           delivery progress, debtors, customer health, ...      |
|                                                                 |
|  [Bot]    Got it -- City of Cape Town, monthly, account-        |
|           manager briefing with the seven sections you named.   |
|                                                                 |
|           [O Schedule this report]                              |
|                                                                 |
|           SAMPLE PREVIEW                                        |
|           +-------------------------------------------------+   |
|           | Monthly Contract Performance Briefing --        |   |
|           | City of Cape Town                               |   |
|           |                                                 |   |
|           | Executive Summary                               |   |
|           | Account: City of Cape Town . Municipality .     |   |
|           | Western Cape . Local logistics zone . SA        |   |
|           |   [Sales 01 Aug -- 07 Aug 2025]                 |   |
|           |                                                 |   |
|           | Headline                                        |   |
|           | YTD net sales RXXXX at XXX%    gross margin     |   |
|           | translate to RXXXX of gross profit; the         |   |
|           | account is pacing at YYY% of the RZZZZZ         |   |
|           | annual budget target.                           |   |
|           |   [Sales 01 Aug -- 07 Aug 2025]                 |   |
|           |                                                 |   |
|           |   [chart: SAP Orderbook Insights by Status]     |   |
|           |                                                 |   |
|           | Account Health Composite                        |   |
|           | +-----------+--------+-----------------------+  |   |
|           | | Component | Score  | Signal                |  |   |
|           | +-----------+--------+-----------------------+  |   |
|           | | Debtor    |   50   | All Rabck overdue     |  |   |
|           | | Orderbook |  100   | Ra.bcm vs R282k prior |  |   |
|           | | Delivery  |  n/a   | No delivery signal    |  |   |
|           | | Sales     |   50   | No WoW movement       |  |   |
|           | +-----------+--------+-----------------------+  |   |
|           |                                                 |   |
|           | ... 5 more sections ...                         |   |
|           +-------------------------------------------------+   |
|                                                                 |
|           [ Looks good -- schedule it ]                         |
|           [ Move Risks above Top movers ]                       |
|           [ Add a gross-margin trend chart ]                    |
|           [ Make the tone more detailed ]                       |
|                                                                 |
|-----------------------------------------------------------------|
|  Converse with AI to craft the agentic report you desire... [>] |
|-----------------------------------------------------------------|
|  > Show technical details                                       |
+-----------------------------------------------------------------+

That's the entire user-facing surface in V1 of V2. One chat pane. Inline report preview. Follow-up chips. One disclosed publish button. One collapsed "technical details" disclosure for power users.

The roadmap is to evolve this into a canvas-style dual-pane layout — the conversation in a left rail, the live agent design (sections, charts, schedule, scope) on the right. Drag-to-reorder sections. Inline chart-type swap. Live citation map. The conversation stays the source of truth; the canvas becomes the visual confirmation.

The lightweight chat surface ships first because conversation alone is enough to design a high-quality agent. The canvas is a power-user affordance for refinement, not a precondition for authoring.


Act 7: Six Fix-Cycles, Four QA Rounds

The Composer V2 PR merged on a Thursday afternoon. The one-day production QA loop that followed is what actually convinced me the surface is ready.

Here's the post-merge fix-cycle ledger:

Fix #TriggerWhat landed
1Round-0 production smoke: the "ship it" follow-up chip kicked off the whole workflow again instead of just publishing.Tightened the SHIP rule in the system prompt with explicit "WRONG behaviour at SHIP time" examples. The LLM now responds to "looks good -- schedule it" with a short acknowledgement and STOPS — the orange publish button is the actual publish path.
2Owner direction: non-technical users have no concept of an SAP/simulation data plane; do not surface a "boundary confirmation" chip.Dropped the boundary chip entirely. Boundary acknowledgement is now server-derived. The seven simulation tools are excluded from the agent reporting toolset; plan-vs-actual tools using static targets stay.
3Owner trust incident: agentic surface produced "0 rows" for City of Cape Town across multiple connectors, then auto-rendered top-N tables that included other customers.Stopped auto-generating tableSpecs in composeOutput. Made compute_customer_health_composite honour scope='specific'. Tightened the prompt to ban SAP customer numbers in narrative, to use the correct master-data filter field, and to label BU-aggregate facts explicitly.
4QA round 1: IR-slug ("cust-city-of-cape-town") matched no SAP-keyed row. Every customer-scoped tool returned empty.Built a shared customerScopeResolver module: master-data lookup once, expand each input to the full set of identity strings (customerId, sapCustomerNumber, name, plus slug variants), filter downstream rows by the expanded match-key set. Auto-include query_customer_master + get_customer_deepdive when customerScope is specific. Token budget bumped 512KB → 2MB.
5QA round 2: composer-generated slugs like "cust-city-of-cape-town" still didn't match master rows whose customerId stored a different shape.Slug-to-name reverse transform in the resolver: strip cust- prefix, replace hyphens/underscores with spaces, case-fold, match against row.name. Plus a defensive slug-variant collection so downstream stores keyed by either the friendly name or the slug both match.
6QA round 3: movement pack threw "entityKey must be a function" because the connector was summary-mode (sales/finpack) without the detail-level per-row entityKey.Short-circuit the movement-pack builder for summary-mode connectors with a warning, not a throw. The narrative now reports honestly: "Sales movement pack is summary-mode; per-customer week-over-week diff is on the backlog."
7QA round 3 also surfaced: raw [cite-tool-...] markers were leaking into rendered narrative prose.Defence in depth. Prompt-side: forbid inline citation IDs explicitly. Render-side: stripInlineCiteMarkers() scrubs the narrative before composeOutput calls splitNarrativeIntoSections. Preserves real markdown link refs ([1], [footnote]); only removes cite--prefixed brackets.
8Owner pressure-test: "How confident are you in customer-name lookup for OTHER customers? We only tested City of Cape Town."Option B re-architecture. The resolver becomes strict-equality only (no regex, no slug guessing). The smart fuzzy matching moves UPSTREAM into query_customer_master with three new capabilities: acronym matching (JRA -> Johannesburg Roads Agency, EMM -> eThekwini Metropolitan Municipality), operator-curated aliases harvested from master rows (aliases/searchTerms/shortNames), and a three-status flow (resolved/ambiguous/unresolved) the LLM follows with disambiguation questions.

Each of these landed within hours of being identified. Each was committed to master, pushed through CI/CD, deployed to the AI App Service, and re-tested in production. Then the next fix.

The convergence trajectory: by QA round 4, the inline preview produced a multi-section report with real numbers (Rxxxx YTD revenue, Rxxx budget target, xx% pacing, Rxx outstanding, Rxxxm orderbook surge), a real account-health composite table, citation chips at every section, follow-up chips like "Looks good — schedule it" / "Move Risks above Top movers" / "Add a gross-margin trend chart", and zero raw cite-marker noise. The quality became indistinguishable from the native chat engine's quality on the same prompt.

The convergence rule: ship the surface, then iterate against real workloads until quality matches the existing chat surface on the same prompt. Don't ship without a comparison baseline. Don't accept "the agent gave an answer" as success — the answer has to be as good as the native chat's answer, or you haven't shipped a real upgrade.

Act 8: Closing the QA Loop with Chrome Automation

The thing that compressed the 1-day QA loop into something tractable for one human was Claude driving Chrome directly.

The pattern: I'd describe the issue I'd seen in production. Claude would open the production URL in a real Chrome instance, click through the Composer wizard the same way a user would, type the same prompt I'd typed, wait for the dry-run, screenshot the inline report, compare it to a reference, identify the regression, trace the cause through the codebase, push the fix, monitor CI, verify the deploy, re-run the same flow, and confirm the fix held.

During the last QA round Claude ran ten Chrome sessions in a single afternoon — navigate, reset, send, wait, screenshot, parse, diff, commit, push, wait for CI, repeat. No human intervention except for me reviewing the post-QA summary at the end. One of those cycles ran while I was AFK; I came back to "round 4 complete, here's the rendered preview, no cite markers, narrative quality matches native chat".

This pattern — AI driving the same UI a human would drive, reading its own logs, deploying its own fixes, re-running its own QA — is the single most under-appreciated capability shift of the last six months. It's not a demo. It worked. Repeatedly. In production.

The gating constraints are:

  • The AI must be able to see what the user sees (screenshot, page-text, console messages, network requests).
  • The AI must be able to act on what the user acts on (clicks, types, navigation, form submission).
  • The AI must be able to verify its own work (read backend logs, query the API directly, compare against a known-good baseline).
  • The AI must know when to stop (the loop converges, or hits a hard limit, or hands back to the human).

All four were satisfied in this build. The QA loop ran. The fixes landed. The owner was doing other things or went to bed.


Self-Assessment: Benchmarking Against the Market

I asked one of my non-coding AI co-thinkers (Google's Gemini used purely for market analysis) where this V2 architecture sits relative to the enterprise BI tools shipping in 2026. The assessment was specific enough that I want to capture it here, because it informs what I prioritise next.

Market penetration

Fewer than 5% of production enterprise BI platforms globally provide the exact convergence of capabilities V2 ships: conversational agent authoring + persistent scheduled artefacts + self-healing background automation + deterministic citations + open MCP tool exposure. Conversational BI chatbots that answer questions about existing charts are common (think Looker Studio's Gemini integration, Snowflake Copilot, Power BI Copilot). What's rare is autonomous report-authoring agents that the user composes via chat and that then run unattended on a schedule, with every claim traceable back to an immutable ETL run id.

Tier matrix

TierCapabilityInterfaceWhere it shows up
Tier 1: Conversational BINatural language to query data; ad-hoc charts; single-turn or basic chat thread. No persistence, no background automation.Chat sidebar / input boxLooker Studio (Gemini), Snowflake Copilot, Power BI Copilot
Tier 2: The Data CanvasVisual space; natural-language nodes that auto-generate SQL blocks; chart-organisation on a canvas layout.Unified node/canvas UIBigQuery Data Canvas, Power Apps Agent Studio
Tier 3: Autonomous analytics agentsChat compiles a scheduled employee. Multi-turn refinement, dry runs, self-healing, persistent + scheduled artefacts, citations, MCP-exposed tools.Dual-pane split canvas (conversation + live preview)Anthropic Claude Artifacts, WisdomAI analytics agents, this V2 architecture

Three architectural benchmarks where V2 lands well

1. Agentic compiler UX vs raw text-to-SQL. Most current data tools focus on text-to-code translation. If the LLM misses a comma or references an invalid schema object, the user troubleshoots the raw error. V2 wraps the composition phase in an invisible, asynchronous LLM-to-Zod self-correcting loop. The non-technical user never sees a Zod error. Agent creation is treated like code compilation, with the validation layer hidden behind a directive system prompt and a "fix silently, retry at most 3 times" rule.

2. Open interoperability vs walled gardens. Major cloud database vendors are building closed-loop agent solutions tightly coupled to their proprietary data stacks. V2 anchors execution to the open Model Context Protocol. The Composer can configure agents to run against an isolated ERP database, a local Fastify service, or third-party APIs interchangeably. There's no vendor lock-in in the runtime — the SDK is one adapter file.

3. Absolute traceability. Enterprise AI surfaces frequently generate narrative summaries directly from data snapshots, creating a trust deficit when finance demands auditability. V2 forces deterministic citations: every claim or table row binds back to an immutable ETL run id and dataset id. The LLM never authors a citation; PostToolUse hooks accumulate them onto the run document. The schema rejects any run without coverage. This matches the compliance posture of specialised enterprise analytical platforms.

Benchmark verdict: V2's combination of zero-friction chat-first authoring with rigid, type-safe validation boundaries lands in the "autonomous analytics agents" tier. The next step — the dual-pane split canvas with a live agent-design surface — is on the roadmap and is what closes the visual-confirmation gap relative to the most polished products in the same tier.

Modern Software Development with Two AI Pair-Programmers

I want to spend a section on the actual practice of building this with two AIs and one human, because the pattern is doing real work and I think it's under-described.

The two AIs have different shapes. Claude is the long-context implementer. It holds the full codebase context for hours at a stretch, writes code, runs tests, drives CI, observes production, makes plans, executes plans, reflects, iterates. Its strength is depth and continuity. Its bias is tunnel vision — once it's in a flow, it can miss things adjacent to the work that fresh eyes would catch immediately. It also codes a little hastily, leaving gaps or assuming things without validating deeply.

Codex is the short-context reviewer. It also has long context and memory, but I use it frugally because of the 5-hour time limits, sometimes it enters cold, reads the diff or the file at master, has some prior session memory. Its strength is freshness, depth and architectural rigour — if there's a bug in plain sight, Codex finds it; if a design assumption looks suspect to an outsider, Codex flags it. Its bias is occasional outdatedness and loss of owner's intent — sometimes it's reviewing against an older mental model of the code than master actually carries - or also assumes an intent using its own judgment that needs me to steer it to my design thinking.

The human's job is to steer the two. Claude implements. Codex reviews. Sometimes they both drift and don't get my instinct or intuition for simplifying complexity. The human triages, decides what to take, what to push back on, what to send back to either AI for refinement. The pattern that worked best for V2:

  1. I describe the change I want at a high level. Claude proposes a plan. I push back on framing. Claude revises. Repeat 2-3 times until the plan is the shape I want.
  2. I hand the plan to Codex for clean-room review. Codex returns numbered findings (P0/P1/P2 with rationale). I relay them to Claude.
  3. Claude triages, agrees with most, occasionally pushes back. I adjudicate the disagreements.
  4. Claude implements, commits, pushes, drives CI. I watch the diff over Claude's shoulder. I got Codex to review every single commit from Claude; Codex was the gatekeeper.
  5. I ship to production. I run the surface in Chrome, find issues, describe them.
  6. Claude drives Chrome itself, reproduces, fixes, redeploys, re-tests. Hand back when converged.
  7. I take the merged work back to Codex for a post-implementation review. Codex finds the bugs Claude missed. Claude fixes them. Repeat.

The asymmetry is the leverage. Claude's tunnel vision is Codex's strength. Codex's high-level context is Claude's strength. The human's job is to keep the loop honest — reject sycophantic agreement, demand pushback on weak findings, take credit for nothing the AIs caught.

For V2 specifically: Codex caught five substantive things during plan review and another seven during code review across both PRs. None of them were trivial — the eager-registration guardrail, the byte-equivalence snapshot suite, the hand-written JSON schema for propose_agent_draft.intent, the proxy mount for the V2 endpoints in the root server, the canonical-key terminology audit, the FOLLOW_UPS parser regression tests, the customer-key server-side enforcement. Every one of those would have shipped broken without the second reviewer.

The pattern, distilled: Claude is the depth-first implementer. Codex is the breadth-first reviewer. The human is the architect, the arbiter, and the QA lead. Each AI's bias is the other's strength. Each iteration gets sharper. No one of the three could ship V2 alone.

Metrics

MetricValue
Plan document length1,771 lines of markdown
Total commits during the V2 window (May 12-22, 2026)217
Composer-v2 / chat-profile-specific commits71
PR 1 commits (foundational refactor)6
PR 2 commits (Composer V2 specialisation)15
Post-merge production fix-cycles8
V2-specific source LOC (ai-core chatProfiles + composer/v2 + ai-service composerV2 + customerScopeResolver + frontend view + sub-components)~4,300 lines
V2-specific test LOC (chatProfiles + composer-v2 + customer-scope-resolver tests)~7,200 lines
Test-to-source ratio~1.67
Byte-equivalence snapshot corpus size~30 canonical prompts
Composer V2 tools (authoring + disambiguation)4 + 2
Draft store TTL30 minutes
Dry-run quota per session10
QA rounds against production4 (with ~5 round-0 hotfixes before round 1)
Chrome QA cycles driven by Claude unattended~10 in the final QA round alone
Total elapsed wall time~4 days from "plan-the-V2-refactor" to "round-4-QA-convergence"

Lessons & Advice

For anyone building a similar agentic specialisation on top of an existing AI chat surface, here's what I'd write down up-front.

1. Build the platform abstraction before the specialisation

The strongest decision in the V2 plan was carving out the Chat Profiles platform as Layer 0. Without it, Composer V2 would have been a special case bolted onto the chat engine, and the next specialisation (forecast editor? scenario architect? master-data cleanup?) would have repeated the same bolt-on pattern. With it, future specialisations are a new profile bundle in a registry — no runtime changes, no new endpoints.

The cost: the foundational refactor needed a byte-equivalence safety net (the snapshot suite). The benefit: a one-day soak proved the refactor didn't drift, and PR 2 could land with no fear of regressing the existing chat surface.

2. The do-no-harm gate is non-negotiable for foundational refactors

If you're touching the runtime that powers the existing AI chat experience, you need a mechanical proof that nothing changed. A snapshot of canonical responses, replayed on every CI build, byte-diffed against a pre-refactor baseline. Don't trust manual smoke testing — it misses the prompts you don't think to try. Don't trust unit tests — they pin the parts you remembered to test. The snapshot pins the whole black-box behaviour.

3. Specialise via profile, not via fork

The Composer V2 surface adds a profile, four tools, a draft store, and a UI view. Nothing else. No new runtime branch. No new MCP server. No new hooks. No new scheduler. No new Cosmos partition. No new Gallery. No new renderer. The specialisation is purely additive. An agent composed by V2 is V1-shape downstream; the surface that authored it is a UI affordance, not an architectural divergence.

This is the rule that prevents fragmentation. If your "improved" front door produces artefacts that the rest of the platform doesn't know how to handle, you haven't built a better front door — you've built a parallel platform you now have to maintain.

4. The LLM is authoritative for intent only

The Composer LLM authors a typed payload describing the user's intent: name, description, allowedTools, taskSpec, schedule, customerScope, comparisons, visualPreferences. Everything else — id, pk, defType, ownerUserId, audit, revision, the system prompt itself, the publish token — is server-owned, materialised after extraction, never trusted to the LLM. The schema rejects any draft that tries to set those fields.

This is what prevents the LLM from impersonating users, skipping the runtime's tool-first directive, or short-circuiting the publish gate. The trust boundary is the schema, not the prompt.

5. Build the smart matching upstream, not downstream

The single biggest mistake in V2's first three days was putting fuzzy customer-name matching in the wrong layer — the downstream scope resolver instead of the upstream master-data lookup. The downstream resolver got progressively more regex-laden as edge cases surfaced (slug variants, punctuation, abbreviations). When the owner pressed "how confident are you for OTHER customers, not just City of Cape Town?", I had to admit the resolver was a guess that worked for clean names and failed silently for the rest.

The architecturally correct fix (Option B) was to make the master-data lookup smart (exact + label + alias + acronym + prefix + all-tokens, ranked) and the downstream resolver dumb (strict equality only). The LLM is now expected to call query_customer_master with the user's free-text phrase FIRST, get back a canonical id, and pass THAT to the resolver. No regex anywhere downstream. Acronyms like "JRA" resolve to "Johannesburg Roads Agency" deterministically. Operator-curated aliases on the master row pick up colloquial short-forms like "CCT" without code changes.

6. Hand the AI the browser

If your AI assistant can drive Chrome, your QA loop changes character. The same loop that used to be "human describes bug, AI infers, AI fixes, human verifies, repeat" becomes "AI reproduces, AI fixes, AI verifies, hands back when converged". This is not a luxury feature — it's a compounding capability. Each fix-cycle is hours shorter. Multiple cycles run in parallel. The human supervises rather than drives.

For this to work, the AI needs: clickable accessibility-tree access to the page, the ability to read console messages and network requests, JavaScript-eval in the page context, and screenshot output. All four are table stakes; if your tooling doesn't provide them, prioritise getting them before you optimise anything else.

7. Stop QA when quality matches the comparison baseline

The QA loop's exit criterion was not "the agent gave an answer". It was "the agent's answer is as good as the native chat engine's answer on the same prompt". The four QA rounds existed because that bar was higher than the first three rounds met. Round 4 met it. That was when the loop stopped.

If you don't have a comparison baseline, you don't have a stopping criterion. If your "agentic surface" doesn't have to match the quality of your existing surface, you're shipping a regression dressed as an upgrade.


Best Practices for AI Developers

  • Schemas as the trust boundary. Zod or equivalent. Strict. Frozen before code. Tagged in git. Every interface across an AI surface should be schema-validated; every claim the AI makes should reduce to a schema-conformant artefact.
  • The materialiser pattern. Server-owned fields (id, audit, ownerUserId, system prompt, publish token) get overwritten after the LLM returns. Never let the LLM author identity, audit, or runtime-directive fields.
  • One adapter file imports the SDK. Everything else stays SDK-agnostic. Swap cost is bounded.
  • MCP tools always-loaded; built-ins locked down. If alwaysLoad is false, your model never sees your tools when built-ins are disallowed. This is a known SDK sharp edge.
  • PostToolUse.tool_response is a JSON string. Parse first. Always.
  • Tool-first system prompts. List the tools, then directly tell the model "CALL THESE TOOLS FIRST". The model will not infer the instruction from a list.
  • Citations are tool-handler output. Never LLM-authored. Hooks accumulate them. The LLM writes prose around them.
  • Render via the same components that render the final artefact. Inline previews must use the same sub-components as the eventual filed report — no private copy. Future renderer changes land in one place.
  • Draft stores are in-memory. 30-minute TTL. Per-session quotas. No Cosmos coupling until the user explicitly publishes.
  • Eager registration with a guardrail test. Profiles, plugins, tools — whatever the registration model, prove it at module load and pin the proof with a test.
  • Snapshot suites for foundational refactors. Replay real prompts, byte-diff against a baseline, fail the build on drift.
  • Comparison baseline for QA convergence. Don't ship the new surface without proving its quality matches the surface it's supposed to replace.
  • Two AI pair-programmers, one human supervisor. Long-context implementer + short-context reviewer + arbitrating human. Each AI's bias is the other's strength.
  • Give the AI the browser. Closed-loop QA. Compounding capability.

The Takeaway

The Composer V2 work is what happens when a one-person engineering team treats two frontier AI assistants as pair-programmers and a third AI as a market analyst — with the human as the architect, supervisor, and accountability layer.

The shape of the work is different from solo development. The shape is different from team development. It's not "the AI wrote the code"; it's "the AI implemented the design I steered, while a second AI clean-room reviewed it, while I drove the browser and made the calls about what was good enough to ship." The product of that loop is in production today. Real users compose real agents in plain English; the platform publishes a typed, scheduled, ABAC-scoped, citation-stamped artefact that runs unattended every Monday at 06:30 SAST.

The platform is still small. One person. One repo. One Fastify process for the AI surface. One App Service for the web app. One Cosmos account. No vector store. No background-job platform. No multi-agent orchestration framework. The architecture is deliberately minimalist — the abstractions that matter (schemas, profiles, MCP tools, hooks, structured artefacts) are sharp; the abstractions that don't (a sub-agent swarm, a custom DSL, a parallel data plane) are absent.

What it isn't, is a slide deck. It's a Monday-morning briefing that fires by itself, an Account Manager who composes their own contract-performance agent in a five-minute chat, a citation chip that traces every number back to its SAP source. The Composer V2 surface is one piece of that puzzle — the front door. The next piece — the dual-pane canvas with live agent design — is what closes the gap to the most polished tier-3 products in the market.

Onwards.To. V3 - a fully canvas-style UX that frontier platforms like Claude, ChatGPT and Gemini provide - I'll ship this too, within a week (if time allows). 

Remember I'm just a GM, building my own enterprise BI platform to manage my business. In between my operaional meetings, evenings, and weekends, working with my two AI copilots has unlocked a ton of productivity that is truly amazing!


Stack: TypeScript-flavoured JavaScript end-to-end. Fastify on a Linux App Service for the AI service. Express + React on a separate App Service for the web app. Cosmos DB as the only data store. @anthropic-ai/claude-agent-sdk + @anthropic-ai/sdk for the agentic loop and the chat engine. node-schedule for the cron tick. Recharts for chart rendering. react-markdown + remark-gfm for the markdown renderer. Zod for the frozen contracts. The new abstractions in V2: ChatProfileV1 (Zod schema for chat-engine specialisations), ComposerIntentDraftV1 (the typed draft payload), the four-tool authoring set, the in-memory draft store with TTL + quota, the marker protocol for inline-report attachment, and the byte-equivalence snapshot suite as the do-no-harm gate for the foundational refactor.

Total elapsed: ~4 days from "let's reframe the Composer as a chat" to "round-4 QA convergence in production". Total Composer-V2-specific commits: 71 across two PRs. Total post-merge fix-cycles: 8. Total Chrome QA cycles driven unattended by the implementer AI: ~10 in the final round alone. Cups of coffee: still lost count.

No comments:

Post a Comment