Mo Khan's Blog: May 2026

Wednesday, 27 May 2026

The Scoreboard I Cannot Query: How Anthropic Locks Team-Plan Admins Out of Their Own Claude Code Data

May 27, 2026. South Africa.

This week I tried to do the single most pro-Anthropic thing a paying customer can do: measure my engineering team's adoption of Claude Code so I could grow it. I wanted a live dashboard, sitting next to our other operating metrics, that answered one question every week — are more of my engineers getting more value out of Claude Code than they did last week? I wanted to celebrate the power users, spot the colleagues who hadn't started yet, and put a real number in front of leadership to justify expanding our investment.

I could not build it. Not because the data doesn't exist — Anthropic has it, and renders it beautifully inside the Claude app. I couldn't build it because Anthropic does not let a Team-plan administrator query their own team's data programmatically. The numbers are right there on my screen. The door to them is locked, and the key is sold separately, under a different plan, in a different organisation, behind a URL that redirects to nowhere.

I want to write this up the way I'd write any serious product escalation: lead with the conclusion, show the evidence, separate fact from opinion, and end with what "good" looks like. I've spent twenty-five years in and around engineering and product leadership, and I've learned that the most useful feedback a vendor can get is from a customer who wants them to win and is willing to be specific about where they're getting in their own way. So, candidly and with respect: Anthropic, this is a miss, and you can do better.

Bottom line up front. Claude Code's usage analytics — per-engineer lines accepted, sessions, acceptance rate, active-user trends — are accessible programmatically only via an Enterprise-gated API, minted in an organisation most Team admins don't even know they have. On a Team plan you get an in-app dashboard and a manual CSV export, and nothing else. The admins most motivated to evangelise Claude Code internally are precisely the ones locked out of the data that would let them do it. That is the wrong place to draw the line.

Act 1: Working backwards from what I actually wanted

Start with the customer, not the API. The customer here is me — an engineering leader trying to drive a behaviour change. Adoption of a new tool is never a snapshot; it's a trend. "We have eight seats" tells you nothing. "Weekly active engineers went from two to six over a month, acceptance rate is holding above 90%, and here are the two people who haven't logged a session yet" tells you everything. That second sentence is a management tool. The first is a procurement receipt.

So the brief was concrete: a tab in our internal platform showing active engineers over time, sessions, lines of code accepted, tool-acceptance rate, and cost per active engineer — framed end to end around that one adoption question. We already do exactly this for our delivery metrics out of Jira, so the pattern was proven and the appetite was a single afternoon. We don't use GitHub, so I was happy to drop the pull-request metrics and keep the rest.

This is the behaviour Anthropic wants from a customer. I was about to instrument my own team to use their product more. Hold that thought.

Act 2: The build was the easy part (it always is now)

I didn't write the integration by hand. I drove; Claude Code wrote it. In one pass it produced a clean server-side module against Anthropic's documented Claude Code Analytics API — windowed daily fetch, cursor pagination, exponential backoff, a circuit breaker, an hour-long cache with stale-while-revalidate. Good, defensive, production-shaped code. I created an Admin API key in the Console exactly as the docs instruct, wired it into our backend, and shipped it.

Then we did the one thing that separates engineering from wishful thinking: before rendering a single chart, we ran the integration against the real API and looked at what came back. Insist on the highest standards isn't a poster on a wall — it's the discipline of validating the write/read path against production data instead of trusting that green tests mean a correct system. That discipline is exactly what caught the problem.

Act 3: Dive deep — the smoke test that told the truth

Ninety-day window. The API returned four records. Zero of them were engineers. The only usage it reported was our own application's API key making headless calls — useful to know, but not a single human developer in the result set.

Now, I know my team uses Claude Code. I'd seen the dashboard the day before: heavy, healthy usage. So either every engineer had quietly stopped overnight, or the API was answering a different question than the one I was asking. A green pipeline returning a confident, precise, wrong answer is the most dangerous artefact in software. Had we skipped the validation step and wired the charts straight up, I'd have walked into a leadership review with a dashboard declaring "zero adoption" of a tool my team was using daily. That's not a bug report; that's a credibility event.

The lesson, restated: a silent wrong answer is worse than a loud failure. The API did not error. It returned 200 OK and an empty truth. Everything that followed was the work of figuring out why the truth was empty.

Act 4: Two organisations, one name, no link

So we interrogated the credential itself. We pointed the same Admin API key at the Admin API's other endpoints and asked it, in effect, "who are you?" The answer was the whole story:

What I expected the key to see	What the key actually saw
My 8-person engineering team	One member — me
Engineers' Claude Code usage	Zero records for every date I checked
"My organisation"	An organisation with the same name — but the API one, not the team one

There are two organisations. They share the same display name. They are not linked, not cross-queryable, and nothing in either product tells you the other exists:

   Anthropic Console (API plan)          Claude app (Team subscription)
   ----------------------------          ------------------------------
   * where my Admin key lives            * where my 8 engineers live
   * one member (me)                     * the Claude Code seats + real usage
   * our app's API key + credits         * the dashboard I screenshotted
   * Admin API works here                * no programmatic API surface at all
        |                                          |
        +-------------  same name, NO link  -------+

The Admin API key can only ever see the Console organisation. My engineers don't live there; they live in the Team subscription, which is a separate identity on a separate surface. And here's the part that moves this from "confusing" to "below the bar": this is a known, reported, still-open bug. Anthropic's own issue tracker carries claude-code #27780 — "Claude Code Analytics Admin API does not return subscription/OAuth users", documenting precisely this: the endpoint only ever returns customer_type: "api" records, and the OAuth/subscription users the docs themselves call "most common" never appear at all. It has been open since February 2026 with more than a dozen comments and no fix — and it notes that two earlier reports of the same bug (#20819 and #9700) were auto-closed by a bot without a single response from Anthropic. The product knows it gives a misleading answer here, and it gives it anyway, silently, with a 200.

I'll be plain about this as a design critique: two organisations with the same name and no discoverable relationship is a trap that every admin in my position will fall into. It cost me hours. It will cost the next person the same hours. Earn trust means, among other things, never letting a customer build a confident mental model on top of a silent inconsistency.

Act 5: The wall — and a documented door that opens onto a wall

Fine. If the engineers live in the Team subscription, I'll mint a key there. The documentation for the subscription analytics API is explicit: create an API key at claude.ai/analytics/api-keys, as a Primary Owner. I am the Primary Owner. So I had Claude drive my own browser to go find it.

That URL redirects to general Settings. The page does not exist for my organisation. We checked, methodically, every place a Primary Owner would reasonably look:

Organisation settings — no API-keys section.
The Analytics app itself (/analytics/activity, /analytics/claude-code) — no API-keys section.
The Claude Code analytics dashboard, top to bottom — a gorgeous chart, a CSV Export button, and nothing resembling programmatic access.

The data was right there on the screen the whole time — that month, five active engineers, on the order of a hundred thousand lines of code accepted between them, weekly active users up by a third. The only sanctioned way to get those numbers out of Anthropic's servers and into a dashboard I control is to click "Export" and download a spreadsheet by hand, every week, forever.

The reason the URL is a dead end is the reason this whole post exists. The programmatic analytics API is real and genuinely good — per-user Claude Code metrics, daily/weekly/monthly active users, token and cost breakdowns. But the reference guide that documents it is the Claude Enterprise Analytics API guide, and minting a key requires being "Primary Owner within your Enterprise organisation." I am a Primary Owner of a Team plan. Team plans don't get it.

Capability	Team plan	Enterprise plan
In-app Claude Code analytics dashboard	Yes	Yes
Manual CSV export	Yes	Yes
Programmatic Analytics API + key	No	Yes
Console Admin API returns your engineers	No (wrong org)	—

So the documented path I was sent down — create a key, call the API — was never available to me. The door is in the docs. The wall is behind the door. The key is sold one pricing tier up.

Act 6: Why this is the wrong line to draw

Let me separate the legitimate from the indefensible, because I don't want this to read as a customer who simply wants everything for free. I don't.

What's legitimate. Gating heavy governance, SCIM provisioning, audit/compliance export, and bulk administrative control behind Enterprise is completely reasonable. Those are genuinely enterprise concerns with enterprise cost. If that were the line, I'd have nothing to write.

What's indefensible. Putting a Primary Owner's read-only access to their own team's adoption numbers on the far side of that same wall. Think about who that decision actually penalises. It's not the disengaged customer. It's the admin who is so bought in that they want to wire Claude Code metrics into their company's own operating cadence and evangelise the results internally. You are taxing your most enthusiastic champions at the exact moment they're trying to spend political capital on your behalf. Customer obsession would start from that champion and work backwards. This decision works backwards from a pricing table.

And the supporting details compound it rather than soften it:

A documented URL (claude.ai/analytics/api-keys) that redirects to nowhere for the plan most likely to read that doc.
An Admin API that returns a silent, empty, 200-OK wrong answer for subscription users — a behaviour reported repeatedly, and still open, in Anthropic's own issue tracker.
A two-organisations-same-name identity model with no in-product signpost connecting them.

Individually, each is a paper cut. Together, they cost a competent, motivated, paying admin the better part of a day to discover that the answer is simply "no, not on your plan." That's not a hard technical limitation. Every byte of this data already leaves Anthropic's servers to render my dashboard and my CSV. This is a product-segmentation choice, and it's the wrong one.

What "good" looks like

Criticism without a recommendation is just complaint, so here's what I'd ship if this were my product. None of it requires building new data — only opening a tap that already flows.

Give Team-plan Primary Owners a scoped, read-only analytics key — mintable from the very dashboard they're already looking at. Read-only. Their own org. That's the whole ask. I'll build everything downstream of it myself; in fact, I already have.
Make the two-organisation relationship discoverable in-product. One line on the Console org switcher and the analytics page: "Your Claude Code seats live in your Team organisation. View its analytics here." A signpost costs nothing and saves every admin the rabbit hole I just climbed out of.
Fix the silent wrong answers. If the Console Admin API can't return subscription users, it should say so — an explicit, documented not-supported-for-this-org-type response beats a friendly, confident, empty 200 every time. Loud failure over quiet falsehood.
Repair the docs. A reference guide should not point Team admins at a URL that redirects to Settings. Either the page exists for them, or the guide says, in bold, "Enterprise only."
State the tiering honestly, up front. If programmatic analytics is an Enterprise feature, put that on the Team plan page before I write the integration — not three hours and one production-shaped backend into the build.

The Takeaway

I want to end where I started: as a customer who wants Anthropic to win. Claude Code is an outstanding product — my team's usage proves it, and the irony is that I only know how good the numbers are because the in-app dashboard is excellent. The build itself, done with Claude, took an afternoon. The product gap took the rest of the day to map, and it's the only thing standing between me and a dashboard that would have made me a louder advocate inside my own company.

So I'll say it directly and respectfully: Anthropic can do better here, and I believe they will. The fix is small, the data already exists, and the customers you'd delight are the ones already cheering loudest for you. Let the people keeping score see the scoreboard. Until you do, I'll wire up the CSV export and keep the live integration warm — it's ninety percent done, pointed at the endpoint I'm not yet allowed to call, ready the day that wall comes down.

I'd rather be writing about the dashboard. I'll settle, for now, for writing about why I can't.

Method note: the integration was written by Claude Code in a single session; the production read-path was validated live before any UI was built (which is what surfaced the empty result); and the org/credential investigation — including driving my own Chrome session through the Claude admin settings to confirm the missing API-key surface — was likewise done by Claude under my direction. Team-member identities and individual figures have been deliberately omitted; the only numbers quoted are organisation-level aggregates already shown on Anthropic's own dashboard. No teammate's personal usage is named here.

Stack, for the curious: React/Vite front end, Azure Functions API, a server-side fetch-and-cache module mirroring our existing Jira "Technology Roadmap" integration. Cups of coffee spent discovering that "the API exists" and "the API is available to me" are two very different sentences: more than the chart would suggest.

Friday, 22 May 2026

The Invisible Compiler: How One Human and Two AI Pair-Programmers Turned a Five-Step Stepper Into a Chat in just 4 days

May 22, 2026. South Africa.

The V1 agentic layer of the platform — the typed-and-versioned, schema-validated, Composer-plus-Linter-plus-Runtime-plus-Scheduler stack I wrote about four days ago — was working. Real Monday-morning briefings. Real citations. Real charts. Real customers. But the way you composed an agent in V1 was a five-step stepper: Compose → Lint → Dry-Run → Review → Publish. Status pills like [tool-availability] would flicker across a panel. If something went wrong, the user was asked to edit a typed-field draft by hand.

That UX treated the user like a developer.

This week I rewrote it. The new Composer is a chat. The user describes the agent they want in plain English. The model asks clarifying questions, drafts the agent, runs a preview inline in the conversation, accepts refinements ("move the regional breakdown above the top movers"), and when the user says "ship it" a single orange button publishes. No stepper. No lint codes. No JSON. The entire authoring surface is a conversation.

That's V2 of the Agent Composer. It shipped this week as two PRs — a foundational chat-profiles refactor, then the Composer V2 specialisation on top. Along the way Claude and Codex held each other accountable, six fix-cycles deep, against a four-round production QA loop. This post is the after-action.

Other ways to understand what I did:

"From Lint Pills to Plain English: Building a Chat-First Agent Composer with a Self-Correcting LLM-to-Zod Loop"
Hooks: vivid "lint pills" image, names the technical pattern AI builders care about (LLM-to-Zod self-correction is having a moment)

"Profiles Over Forks: How I Specialized an AI Chat Engine Without Touching the Runtime — Solo, with Claude and Codex"
Hooks: "Profiles Over Forks" is a memetic architectural slogan, signals the foundational-refactor sophistication that engineering-leaning AI gurus respect.

"Chat Profiles, MCP Tools, and a Schema as the Trust Boundary: A Tier-3 Agentic Composer in 71 Commits"
Hooks: name-drops the standards (MCP, schema-first), claims a market tier with receipts (71 commits), no fluff.

"How to Build a Tier-3 Autonomous Analytics Agent Without an Orchestration Framework, a Vector Store, or a Sub-Agent Swarm"
Hooks: contrarian by listing what you don't need — irresistible to AI gurus tired of overbuilt stacks. Sets up your minimalist architecture as the takeaway.

Act 1: Why the V1 Composer Wasn't Good Enough

V1's authoring stepper was a faithful UI mapping of the typed contract underneath. It had to be — that's how the data shape was implemented. AgentDefinitionV1 has a name, a description, a taskSpec, a schedule, an allowedTools array. The stepper just rendered each block of the contract as a form section, validated each step before letting the user proceed, surfaced the linter's 9-check pass/fail.

It was mechanically correct. It was conversationally wrong.

The owner-facing review of V1 surfaced four hard problems:

Non-technical users were asked to read lint codes. [tool-availability] means nothing to a sales manager.
The dry-run sample report rendered in a separate panel below the typed-field editor — users had to switch context to see what they were authoring.
If the LLM proposed something invalid, the user had to fix it by hand in a Zod-shaped form. Most users didn't even know what a draft was.
The mental model the stepper imposed (Compose, then Lint, then Dry-Run, then Review, then Publish) didn't match how anyone actually thinks about a report. People think in iteration loops: describe, see, refine, see, refine, publish.

The fundamental insight: the typed contract is the trust boundary; it must stay. But the typed contract should be invisible to the user. The conversation, the inline preview, and the single publish action are the entire surface area. Everything underneath is implementation detail.

Act 2: The Iteration Loop That Reshaped the Plan

I asked Claude to scope the work. The first plan came back at about 600 lines of markdown and proposed a single-PR build — new chat profile, new tools, new UI, all landing together. It described the Composer V2 surface in detail but treated the underlying chat engine as a black box that the new Composer would simply call.

I pushed back twice. The first push was structural: "the AI chat engine bakes a single global system prompt into its runtime. If we want a chat tuned for a different purpose — designing agents, editing forecasts, master-data cleanup — there is no clean way to plug in a different system prompt, a different tool allowlist, a different conversational discipline. That's the real refactor."

That reframing rippled. The plan grew a "Layer 0" — the Chat Profiles platform. Profiles became a first-class abstraction: each profile is a server-defined bundle of { systemPromptBuilder, allowedTools, suggestedPrompts, markerProtocol, hooks? }. The existing global behaviour became the general profile. The new Composer became the composer-v2 profile. The runtime no longer baked one prompt; it looked up a profile at request time.

The second push was sequencing: "ship the foundational refactor first, soak it, then build Composer V2 on top." That split the work into two PRs — PR 1 (foundation, zero behaviour change) and PR 2 (Composer V2 on the soaked foundation). The byte-equivalence safety net got built INTO PR 1 as the "do-no-harm" gate.

Codex did a clean-room review of the v3 plan and posted four findings that materially changed what shipped. I'll paraphrase the gist:

Codex plan-review pass (paraphrased):

"PR 1 needs a byte-equivalence snapshot suite as the do-no-harm gate. Replay a corpus of ~30 canonical prompts against the migrated general profile and assert byte-identical responses against a pre-refactor snapshot. Any drift fails the build. This is the gate, not a nice-to-have."
"Composer-v2 MUST be registered at module-load time, not lazily. If validateChatRequestBody sees an unknown profileId on the first request, every initial Composer invocation 400s. Eager registration with a guardrail test."
"The four authoring tools need a session-scoped allowlist. The Composer agent gets only those four (plus disambiguation tools), nothing else. If a regular chat caller passes profileId: 'composer-v2' by accident, the runtime should still enforce the narrow allowlist server-side."
"Materialise server-owned fields after the LLM returns. id, pk, defType, ownerUserId, systemPrompt, audit, publishToken — never let the LLM author these. Same rule as V1. Pull materializeAgentDefinition out of the V1 Composer into a shared helper so V2 reuses it."

All four landed. The byte-equivalence snapshot suite became the merit gate for PR 1. The eager-registration guardrail is now a permanent test. The four authoring tools live in a hard-coded COMPOSER_AUTHORING_TOOLS array; the bridge filter excludes them from general chat. The materializer was refactored as a pure function and reused by both V1 and V2 Composers.

The final plan landed at 1,771 lines of markdown after about ten review cycles with Codex — almost three times the length of the first draft, because the iterations forced every layer to be specified before any code was written.

Act 3: The Architecture We Landed On

Here's the V2 picture in plain ASCII. Compare to the V1 diagram from the last post — the additions are bracketed by [NEW].

   
   +-----------------------------------------------------------------+
   |                       Web App (browser SPA)                     |
   |                                                                 |
   |   Gallery  -  Compose (V1)  -  [NEW] Compose with AI chat (V2)  |
   |   Filed Reports  -  Insights  -  Admin                          |
   +---------------------------+-------------------------------------+
                               |
                               | HTTPS + auth proxy
                               v
+-----------------------+   /api/agents/*    +-----------------------------+
|  Web App (Express)    |  ----------------> |  AI service (Fastify)       |
|  React shell + static |                    |                             |
|  files. Proxies AI    |   /api/ai-chat     |  +-----------------------+  |
|  routes to the AI     |  ----------------> |  | Chat route            |  |
|  service.             |                    |  | accepts profileId     |  |
|                       |   /api/composer-v2 |  | (default = general)   |  |
|  + serves the new     |  ----------------> |  +----------+------------+  |
|  agent-composer-chat  |                    |             |               |
|  React view + the V2  |                    |             v               |
|  HTTP shims for       |                    |  +-----------------------+  |
|  draft fetch, reset,  |                    |  | [NEW] Chat-Profile    |  |
|  publish.             |                    |  |       Dispatcher      |  |
+-----------------------+                    |  +----------+------------+  |
                                             |             |               |
                                             |       +-----+-----+         |
                                             |       |           |         |
                                             |       v           v         |
                                             |  +---------+ +---------+    |
                                             |  | general | | [NEW]   |    |
                                             |  | profile | | composer|    |
                                             |  | (V1     | | -v2     |    |
                                             |  |  prompt | | profile |    |
                                             |  |  +tools)| | (chat-  |    |
                                             |  +----+----+ |  first  |    |
                                             |       |      |  multi- |    |
                                             |       |      |  turn)  |    |
                                             |       |      +----+----+    |
                                             |       |           |         |
                                             |       |     [NEW] |         |
                                             |       |     four  |         |
                                             |       |     auth. |         |
                                             |       |     tools |         |
                                             |       |  +--------+-----+   |
                                             |       |  | propose      |   |
                                             |       |  | validate     |   |
                                             |       |  | dry_run      |   |
                                             |       |  | materialize  |   |
                                             |       |  | _and_publish |   |
                                             |       |  +--------+-----+   |
                                             |       |           |         |
                                             |       |           v         |
                                             |       |  +-----------------+|
                                             |       |  | [NEW] in-mem    ||
                                             |       |  | draft store     ||
                                             |       |  | TTL 30min,      ||
                                             |       |  | quota 10        ||
                                             |       |  | dry-runs/sess.  ||
                                             |       |  +-----------------+|
                                             |       v           |         |
                                             |  +-----------------v----+   |
                                             |  | Same MCP tool surface|   |
                                             |  | as V1 chat:          |   |
                                             |  |   build_movement_pack|   |
                                             |  |   compute_period_agg |   |
                                             |  |   compute_plan_vs_act|   |
                                             |  |   customer_health    |   |
                                             |  |   query_customer_mas |   |
                                             |  |   get_insight_def    |   |
                                             |  |   search_insights    |   |
                                             |  |   ... (60+)          |   |
                                             |  +----------+-----------+   |
                                             |             |               |
                                             |             v               |
                                             |  +-----------------------+  |
                                             |  | Runtime adapter       |  |
                                             |  | (claude-agent-sdk     |  |
                                             |  |  - unchanged)         |  |
                                             |  +----------+------------+  |
                                             |             |               |
                                             |   Hooks (unchanged):        |
                                             |     PreToolUse  PolicyGuard |
                                             |     PostToolUse Citation+ds |
                                             |     Stop        CostTracker |
                                             |                             |
                                             +-----------+-----------------+
                                                         |
                                                         v
   +-----------------------------------------------------------------+
   |                        Cosmos DB (single account)               |
   |   definitions (incl. agent-definitions from V2 publishes,       |
   |   schema-identical to V1)                                       |
   |   agent_runs (incl. dry-run artefacts owned by composer-v2)     |
   |   bu-ai-data-policy, runtime-state, etc.                        |
   +-----------------------------------------------------------------+

The point worth labouring: everything below the dispatcher is unchanged. The four new authoring tools are MCP tools just like every other tool. The draft store is an in-memory Map with a 30-minute TTL. The runtime adapter, the hooks, the stores, the Cosmos schema, the scheduler — not touched. The Composer V2 surface is purely an additive specialisation.

An agent composed by V2 is byte-identical to an agent composed by V1, up to a single optional audit.composedVia: 'chat' telemetry field. The Gallery doesn't know V2 exists. The scheduler doesn't know V2 exists. If we deleted the V2 surface tomorrow, every previously V2-composed agent would still run.

The non-negotiable invariant: V2 is a better front door, nothing more. It does not fork the agentic engine. Same definition store. Same scheduler. Same runtime. Same hooks. Same ABAC. Same Gallery. Same Filed Reports. Same renderer. Same audit log. The four new authoring tools, the in-memory draft store, and the chat profile are the entire footprint.

Act 4: PR 1 — The Foundational Refactor (Zero Regression)

PR 1 was the first 1.5 days. Six commits. No new user-visible surface. The entire change was that standaloneAiChat.js — the file that owns the chat runtime — learned to look up a chat profile at request time and use it instead of a hard-coded global prompt.

The challenge wasn't writing the dispatcher. The challenge was proving the existing chat behaviour didn't drift by a single byte.

That's what the byte-equivalence snapshot suite was for. ~30 canonical prompts, drawn from a real week of production chat usage (sanitised), replayed against both the pre-refactor runtime and the post-refactor runtime with profileId defaulted to general. The system prompt the runtime composes must be byte-identical. The tool allowlist must be byte-identical. The streaming behaviour must be byte-identical. Any drift fails the build.

The snapshot suite caught one real regression mid-PR — my first attempt at the dispatcher had a sneaky difference in how it normalised the absence of profileId in the request body (treating undefined as 'general' instead of preserving the absence). Codex flagged it during review, the snapshot diffed, and the fix was a four-line change.

Commits in PR 1:

ChatProfileV1 contract in @app/ai-core: Zod schema for { id, label, description, systemPromptBuilder, allowedTools, suggestedPrompts, markerProtocol?, hooks? }.
Profile registry in ai-service: getChatProfile(id), listChatProfiles(), registerChatProfile(profile). Eager registration at module-load.
Inline profileId branch in the chat runtime: look up the profile, call its systemPromptBuilder, filter the MCP tools by profile.allowedTools, apply profile-specific hooks on top of the global ones.
Chat route accepts an optional profileId field. Default general preserves backward compat with every existing caller.
Byte-equivalence snapshot suite — the do-no-harm gate. 30 canonical prompts, byte-identical assertions, fails the build on any drift.
Docs update: agent-contract.md gains a "Chat Profiles platform" section with the platform invariants.

PR 1 sat on the feature branch for a few hours of soak in production before PR 2 began. During the soak the snapshot suite ran on every CI build and never failed. The existing chat experience — AI helper buttons on dashboards, the standalone chat view, page-context-aware tool routing — behaved identically. None of those callers passed profileId; all defaulted to general; none drifted.

That's the value of a do-no-harm gate. A foundational refactor of the chat engine landed on master with confidence high enough that no human had to manually test the existing surface. The snapshot was the proof.

Act 5: PR 2 — The Composer V2 Specialisation

PR 2 was fifteen commits over two days. New profile. Four new tools. New draft store. New frontend view. New HTTP endpoints. New navigation entry. Everything else unchanged.

#	Commit	What it added
1	lint translator	Pure ai-core helper that translates V1's 9 linter codes into plain English. `[tool-availability]` becomes "The tool 'X' isn't available for this connector — try 'Y' instead." Engineering codes never leave the server.
2	COMPOSER_AUTHORING_TOOLS exclusion	The hard-coded set of four authoring tools. The auto-bridge filter excludes them from the general chat profile so they only surface inside `composer-v2`.
3	Extract `materializeAgentDefinition`	Pure refactor: pulled V1's server-owned-field materialiser out into a shared helper. Reused by V2's `propose_agent_draft` tool.
4	ComposerIntentDraftV1 schema	The Zod schema for the draft payload the V2 LLM hands to `propose_agent_draft`. Strict. Field-level error messages. Hand-authored, not derived (Codex caught that a derived schema lost field-level hints).
5	Draft store with versioning + publishToken + dry-run quota	In-memory `Map`, 30-minute TTL, max 10 dry-runs per session. `publishToken` is server-issued, never returned to the LLM, only known to the UI's PublishActionChip.
6	runAgent accepts `AbortSignal`	Lets the V2 frontend cancel an in-flight dry-run when the user backs out. Threaded through the runtime adapter.
7	Four Composer tool handlers + definitions + executor wiring	`propose_agent_draft`, `validate_draft_silently`, `dry_run_draft`, `materialize_and_publish`. Each Zod-validated at the MCP boundary. `dry_run_draft` calls the existing `runtime.runAgent({ trigger: 'dry-run' })` — no V2-specific runtime branch.
8	composer-v2 chat profile + EAGER registration	The profile bundle. Multi-turn system prompt. Suggested prompts. Marker protocol for `{{ATTACH_REPORT:runId}}`, `{{FOLLOWUP:text}}`, `{{ACTION:publish}}`. Registered at module load with a guardrail test.
9	Dispatcher branch in `standaloneAiChat.js`	If the resolved profile is `composer-v2`, swap in the COMPOSER_AUTHORING_TOOLS allowlist; otherwise behave exactly as PR 1.
10	HTTP endpoints (drafts fetch, session reset, publish)	`GET /api/composer-v2/drafts/:id` for the technical-details disclosure, `POST /api/composer-v2/sessions/:id/reset` for the "Start a new agent" button, `POST /api/composer-v2/drafts/:id/publish` for the UI's PublishActionChip.
11	SSE `tool_result` event + AIChatPanel props	Streams structured tool results back to the browser so the chat panel can render the inline report in real time.
12	AgentComposerChatView + InlineSampleReport + PublishActionChip	The new chat-first view. Renders the sample report inline using the same `AgentReportView` sub-components the Filed Reports surface uses — no private copy.
13	Navigation entry + AM Contract Performance starter prompt	Sidebar entry "Compose Agent with AI chat" at `/agent-composer-chat`. AM Contract Performance template surfaces as a suggested prompt.
14	MCP bridge coverage test + AM Contract Performance smoke	End-to-end smoke that drives the Composer through a full session, asserts the published agent is V1-shape, and confirms a spoken name like "City of Cape Town" is REFUSED at `propose_agent_draft` with a structured hint to call `query_customer_master`.
15	Docs update	agent-contract.md gains the Composer V2 invariants section.

That's the build. Fifteen commits. About 4,300 lines of V2-specific source and 7,200 lines of V2-specific tests. The plan doc was 1,771 lines. The test-to-source ratio is intentionally above 1.5 — the contract is the trust boundary, the tests pin the contract.

Act 6: The User Experience (Currently Lightweight, Evolving)

The V2 surface is deliberately minimal in its first form. ASCII wireframe of the empty state:

+-----------------------------------------------------------------+
|  Compose Agent with AI chat            ?    Start a new agent   |
|-----------------------------------------------------------------|
|                                                                 |
|    Agent Composer                                               |
|    (BU scope . current fiscal year)                             |
|                                                                 |
|    Suggested starters:                                          |
|    [ Weekly Finance briefing ]  [ Monthly performance report ]  |
|    [ Customer health watcher ]  [ Stock coverage review ]       |
|                                                                 |
|                                                                 |
|                                                                 |
|                                                                 |
|                                                                 |
|                                                                 |
|-----------------------------------------------------------------|
|  Converse with AI to craft the agentic report you desire... [>] |
|-----------------------------------------------------------------|
|  > Show technical details                                       |
+-----------------------------------------------------------------+

And the active state, mid-composition:

+-----------------------------------------------------------------+
|  Compose Agent with AI chat                  Start a new agent  |
|-----------------------------------------------------------------|
|                                                                 |
|  [You]    Compose a monthly contract-performance briefing       |
|           agent for City of Cape Town -- sales vs plan,         |
|           delivery progress, debtors, customer health, ...      |
|                                                                 |
|  [Bot]    Got it -- City of Cape Town, monthly, account-        |
|           manager briefing with the seven sections you named.   |
|                                                                 |
|           [O Schedule this report]                              |
|                                                                 |
|           SAMPLE PREVIEW                                        |
|           +-------------------------------------------------+   |
|           | Monthly Contract Performance Briefing --        |   |
|           | City of Cape Town                               |   |
|           |                                                 |   |
|           | Executive Summary                               |   |
|           | Account: City of Cape Town . Municipality .     |   |
|           | Western Cape . Local logistics zone . SA        |   |
|           |   [Sales 01 Aug -- 07 Aug 2025]                 |   |
|           |                                                 |   |
|           | Headline                                        |   |
|           | YTD net sales RXXXX at XXX%    gross margin     |   |
|           | translate to RXXXX of gross profit; the         |   |
|           | account is pacing at YYY% of the RZZZZZ         |   |
|           | annual budget target.                           |   |
|           |   [Sales 01 Aug -- 07 Aug 2025]                 |   |
|           |                                                 |   |
|           |   [chart: SAP Orderbook Insights by Status]     |   |
|           |                                                 |   |
|           | Account Health Composite                        |   |
|           | +-----------+--------+-----------------------+  |   |
|           | | Component | Score  | Signal                |  |   |
|           | +-----------+--------+-----------------------+  |   |
|           | | Debtor    |   50   | All Rabck overdue     |  |   |
|           | | Orderbook |  100   | Ra.bcm vs R282k prior |  |   |
|           | | Delivery  |  n/a   | No delivery signal    |  |   |
|           | | Sales     |   50   | No WoW movement       |  |   |
|           | +-----------+--------+-----------------------+  |   |
|           |                                                 |   |
|           | ... 5 more sections ...                         |   |
|           +-------------------------------------------------+   |
|                                                                 |
|           [ Looks good -- schedule it ]                         |
|           [ Move Risks above Top movers ]                       |
|           [ Add a gross-margin trend chart ]                    |
|           [ Make the tone more detailed ]                       |
|                                                                 |
|-----------------------------------------------------------------|
|  Converse with AI to craft the agentic report you desire... [>] |
|-----------------------------------------------------------------|
|  > Show technical details                                       |
+-----------------------------------------------------------------+

That's the entire user-facing surface in V1 of V2. One chat pane. Inline report preview. Follow-up chips. One disclosed publish button. One collapsed "technical details" disclosure for power users.

The roadmap is to evolve this into a canvas-style dual-pane layout — the conversation in a left rail, the live agent design (sections, charts, schedule, scope) on the right. Drag-to-reorder sections. Inline chart-type swap. Live citation map. The conversation stays the source of truth; the canvas becomes the visual confirmation.

The lightweight chat surface ships first because conversation alone is enough to design a high-quality agent. The canvas is a power-user affordance for refinement, not a precondition for authoring.

Act 7: Six Fix-Cycles, Four QA Rounds

The Composer V2 PR merged on a Thursday afternoon. The one-day production QA loop that followed is what actually convinced me the surface is ready.

Here's the post-merge fix-cycle ledger:

Fix #	Trigger	What landed
1	Round-0 production smoke: the "ship it" follow-up chip kicked off the whole workflow again instead of just publishing.	Tightened the SHIP rule in the system prompt with explicit "WRONG behaviour at SHIP time" examples. The LLM now responds to "looks good -- schedule it" with a short acknowledgement and STOPS — the orange publish button is the actual publish path.
2	Owner direction: non-technical users have no concept of an SAP/simulation data plane; do not surface a "boundary confirmation" chip.	Dropped the boundary chip entirely. Boundary acknowledgement is now server-derived. The seven simulation tools are excluded from the agent reporting toolset; plan-vs-actual tools using static targets stay.
3	Owner trust incident: agentic surface produced "0 rows" for City of Cape Town across multiple connectors, then auto-rendered top-N tables that included other customers.	Stopped auto-generating tableSpecs in `composeOutput`. Made `compute_customer_health_composite` honour `scope='specific'`. Tightened the prompt to ban SAP customer numbers in narrative, to use the correct master-data filter field, and to label BU-aggregate facts explicitly.
4	QA round 1: IR-slug ("cust-city-of-cape-town") matched no SAP-keyed row. Every customer-scoped tool returned empty.	Built a shared `customerScopeResolver` module: master-data lookup once, expand each input to the full set of identity strings (customerId, sapCustomerNumber, name, plus slug variants), filter downstream rows by the expanded match-key set. Auto-include `query_customer_master` + `get_customer_deepdive` when customerScope is specific. Token budget bumped 512KB → 2MB.
5	QA round 2: composer-generated slugs like "cust-city-of-cape-town" still didn't match master rows whose `customerId` stored a different shape.	Slug-to-name reverse transform in the resolver: strip `cust-` prefix, replace hyphens/underscores with spaces, case-fold, match against `row.name`. Plus a defensive slug-variant collection so downstream stores keyed by either the friendly name or the slug both match.
6	QA round 3: movement pack threw "entityKey must be a function" because the connector was summary-mode (sales/finpack) without the detail-level per-row entityKey.	Short-circuit the movement-pack builder for summary-mode connectors with a warning, not a throw. The narrative now reports honestly: "Sales movement pack is summary-mode; per-customer week-over-week diff is on the backlog."
7	QA round 3 also surfaced: raw `[cite-tool-...]` markers were leaking into rendered narrative prose.	Defence in depth. Prompt-side: forbid inline citation IDs explicitly. Render-side: `stripInlineCiteMarkers()` scrubs the narrative before `composeOutput` calls `splitNarrativeIntoSections`. Preserves real markdown link refs (`[1]`, `[footnote]`); only removes `cite-`-prefixed brackets.
8	Owner pressure-test: "How confident are you in customer-name lookup for OTHER customers? We only tested City of Cape Town."	Option B re-architecture. The resolver becomes strict-equality only (no regex, no slug guessing). The smart fuzzy matching moves UPSTREAM into `query_customer_master` with three new capabilities: acronym matching (JRA -> Johannesburg Roads Agency, EMM -> eThekwini Metropolitan Municipality), operator-curated aliases harvested from master rows (`aliases`/`searchTerms`/`shortNames`), and a three-status flow (resolved/ambiguous/unresolved) the LLM follows with disambiguation questions.

Each of these landed within hours of being identified. Each was committed to master, pushed through CI/CD, deployed to the AI App Service, and re-tested in production. Then the next fix.

The convergence trajectory: by QA round 4, the inline preview produced a multi-section report with real numbers (Rxxxx YTD revenue, Rxxx budget target, xx% pacing, Rxx outstanding, Rxxxm orderbook surge), a real account-health composite table, citation chips at every section, follow-up chips like "Looks good — schedule it" / "Move Risks above Top movers" / "Add a gross-margin trend chart", and zero raw cite-marker noise. The quality became indistinguishable from the native chat engine's quality on the same prompt.

The convergence rule: ship the surface, then iterate against real workloads until quality matches the existing chat surface on the same prompt. Don't ship without a comparison baseline. Don't accept "the agent gave an answer" as success — the answer has to be as good as the native chat's answer, or you haven't shipped a real upgrade.

Act 8: Closing the QA Loop with Chrome Automation

The thing that compressed the 1-day QA loop into something tractable for one human was Claude driving Chrome directly.

The pattern: I'd describe the issue I'd seen in production. Claude would open the production URL in a real Chrome instance, click through the Composer wizard the same way a user would, type the same prompt I'd typed, wait for the dry-run, screenshot the inline report, compare it to a reference, identify the regression, trace the cause through the codebase, push the fix, monitor CI, verify the deploy, re-run the same flow, and confirm the fix held.

During the last QA round Claude ran ten Chrome sessions in a single afternoon — navigate, reset, send, wait, screenshot, parse, diff, commit, push, wait for CI, repeat. No human intervention except for me reviewing the post-QA summary at the end. One of those cycles ran while I was AFK; I came back to "round 4 complete, here's the rendered preview, no cite markers, narrative quality matches native chat".

This pattern — AI driving the same UI a human would drive, reading its own logs, deploying its own fixes, re-running its own QA — is the single most under-appreciated capability shift of the last six months. It's not a demo. It worked. Repeatedly. In production.

The gating constraints are:

The AI must be able to see what the user sees (screenshot, page-text, console messages, network requests).
The AI must be able to act on what the user acts on (clicks, types, navigation, form submission).
The AI must be able to verify its own work (read backend logs, query the API directly, compare against a known-good baseline).
The AI must know when to stop (the loop converges, or hits a hard limit, or hands back to the human).

All four were satisfied in this build. The QA loop ran. The fixes landed. The owner was doing other things or went to bed.

Self-Assessment: Benchmarking Against the Market

I asked one of my non-coding AI co-thinkers (Google's Gemini used purely for market analysis) where this V2 architecture sits relative to the enterprise BI tools shipping in 2026. The assessment was specific enough that I want to capture it here, because it informs what I prioritise next.

Market penetration

Fewer than 5% of production enterprise BI platforms globally provide the exact convergence of capabilities V2 ships: conversational agent authoring + persistent scheduled artefacts + self-healing background automation + deterministic citations + open MCP tool exposure. Conversational BI chatbots that answer questions about existing charts are common (think Looker Studio's Gemini integration, Snowflake Copilot, Power BI Copilot). What's rare is autonomous report-authoring agents that the user composes via chat and that then run unattended on a schedule, with every claim traceable back to an immutable ETL run id.

Tier matrix

Tier	Capability	Interface	Where it shows up
Tier 1: Conversational BI	Natural language to query data; ad-hoc charts; single-turn or basic chat thread. No persistence, no background automation.	Chat sidebar / input box	Looker Studio (Gemini), Snowflake Copilot, Power BI Copilot
Tier 2: The Data Canvas	Visual space; natural-language nodes that auto-generate SQL blocks; chart-organisation on a canvas layout.	Unified node/canvas UI	BigQuery Data Canvas, Power Apps Agent Studio
Tier 3: Autonomous analytics agents	Chat compiles a scheduled employee. Multi-turn refinement, dry runs, self-healing, persistent + scheduled artefacts, citations, MCP-exposed tools.	Dual-pane split canvas (conversation + live preview)	Anthropic Claude Artifacts, WisdomAI analytics agents, this V2 architecture

Three architectural benchmarks where V2 lands well

1. Agentic compiler UX vs raw text-to-SQL. Most current data tools focus on text-to-code translation. If the LLM misses a comma or references an invalid schema object, the user troubleshoots the raw error. V2 wraps the composition phase in an invisible, asynchronous LLM-to-Zod self-correcting loop. The non-technical user never sees a Zod error. Agent creation is treated like code compilation, with the validation layer hidden behind a directive system prompt and a "fix silently, retry at most 3 times" rule.

2. Open interoperability vs walled gardens. Major cloud database vendors are building closed-loop agent solutions tightly coupled to their proprietary data stacks. V2 anchors execution to the open Model Context Protocol. The Composer can configure agents to run against an isolated ERP database, a local Fastify service, or third-party APIs interchangeably. There's no vendor lock-in in the runtime — the SDK is one adapter file.

3. Absolute traceability. Enterprise AI surfaces frequently generate narrative summaries directly from data snapshots, creating a trust deficit when finance demands auditability. V2 forces deterministic citations: every claim or table row binds back to an immutable ETL run id and dataset id. The LLM never authors a citation; PostToolUse hooks accumulate them onto the run document. The schema rejects any run without coverage. This matches the compliance posture of specialised enterprise analytical platforms.

Benchmark verdict: V2's combination of zero-friction chat-first authoring with rigid, type-safe validation boundaries lands in the "autonomous analytics agents" tier. The next step — the dual-pane split canvas with a live agent-design surface — is on the roadmap and is what closes the visual-confirmation gap relative to the most polished products in the same tier.

Modern Software Development with Two AI Pair-Programmers

I want to spend a section on the actual practice of building this with two AIs and one human, because the pattern is doing real work and I think it's under-described.

The two AIs have different shapes. Claude is the long-context implementer. It holds the full codebase context for hours at a stretch, writes code, runs tests, drives CI, observes production, makes plans, executes plans, reflects, iterates. Its strength is depth and continuity. Its bias is tunnel vision — once it's in a flow, it can miss things adjacent to the work that fresh eyes would catch immediately. It also codes a little hastily, leaving gaps or assuming things without validating deeply.

Codex is the short-context reviewer. It also has long context and memory, but I use it frugally because of the 5-hour time limits, sometimes it enters cold, reads the diff or the file at master, has some prior session memory. Its strength is freshness, depth and architectural rigour — if there's a bug in plain sight, Codex finds it; if a design assumption looks suspect to an outsider, Codex flags it. Its bias is occasional outdatedness and loss of owner's intent — sometimes it's reviewing against an older mental model of the code than master actually carries - or also assumes an intent using its own judgment that needs me to steer it to my design thinking.

The human's job is to steer the two. Claude implements. Codex reviews. Sometimes they both drift and don't get my instinct or intuition for simplifying complexity. The human triages, decides what to take, what to push back on, what to send back to either AI for refinement. The pattern that worked best for V2:

I describe the change I want at a high level. Claude proposes a plan. I push back on framing. Claude revises. Repeat 2-3 times until the plan is the shape I want.
I hand the plan to Codex for clean-room review. Codex returns numbered findings (P0/P1/P2 with rationale). I relay them to Claude.
Claude triages, agrees with most, occasionally pushes back. I adjudicate the disagreements.
Claude implements, commits, pushes, drives CI. I watch the diff over Claude's shoulder. I got Codex to review every single commit from Claude; Codex was the gatekeeper.
I ship to production. I run the surface in Chrome, find issues, describe them.
Claude drives Chrome itself, reproduces, fixes, redeploys, re-tests. Hand back when converged.
I take the merged work back to Codex for a post-implementation review. Codex finds the bugs Claude missed. Claude fixes them. Repeat.

The asymmetry is the leverage. Claude's tunnel vision is Codex's strength. Codex's high-level context is Claude's strength. The human's job is to keep the loop honest — reject sycophantic agreement, demand pushback on weak findings, take credit for nothing the AIs caught.

For V2 specifically: Codex caught five substantive things during plan review and another seven during code review across both PRs. None of them were trivial — the eager-registration guardrail, the byte-equivalence snapshot suite, the hand-written JSON schema for propose_agent_draft.intent, the proxy mount for the V2 endpoints in the root server, the canonical-key terminology audit, the FOLLOW_UPS parser regression tests, the customer-key server-side enforcement. Every one of those would have shipped broken without the second reviewer.

The pattern, distilled: Claude is the depth-first implementer. Codex is the breadth-first reviewer. The human is the architect, the arbiter, and the QA lead. Each AI's bias is the other's strength. Each iteration gets sharper. No one of the three could ship V2 alone.

Metrics

Metric	Value
Plan document length	1,771 lines of markdown
Total commits during the V2 window (May 12-22, 2026)	217
Composer-v2 / chat-profile-specific commits	71
PR 1 commits (foundational refactor)	6
PR 2 commits (Composer V2 specialisation)	15
Post-merge production fix-cycles	8
V2-specific source LOC (ai-core chatProfiles + composer/v2 + ai-service composerV2 + customerScopeResolver + frontend view + sub-components)	~4,300 lines
V2-specific test LOC (chatProfiles + composer-v2 + customer-scope-resolver tests)	~7,200 lines
Test-to-source ratio	~1.67
Byte-equivalence snapshot corpus size	~30 canonical prompts
Composer V2 tools (authoring + disambiguation)	4 + 2
Draft store TTL	30 minutes
Dry-run quota per session	10
QA rounds against production	4 (with ~5 round-0 hotfixes before round 1)
Chrome QA cycles driven by Claude unattended	~10 in the final QA round alone
Total elapsed wall time	~4 days from "plan-the-V2-refactor" to "round-4-QA-convergence"

Lessons & Advice

For anyone building a similar agentic specialisation on top of an existing AI chat surface, here's what I'd write down up-front.

1. Build the platform abstraction before the specialisation

The strongest decision in the V2 plan was carving out the Chat Profiles platform as Layer 0. Without it, Composer V2 would have been a special case bolted onto the chat engine, and the next specialisation (forecast editor? scenario architect? master-data cleanup?) would have repeated the same bolt-on pattern. With it, future specialisations are a new profile bundle in a registry — no runtime changes, no new endpoints.

The cost: the foundational refactor needed a byte-equivalence safety net (the snapshot suite). The benefit: a one-day soak proved the refactor didn't drift, and PR 2 could land with no fear of regressing the existing chat surface.

2. The do-no-harm gate is non-negotiable for foundational refactors

If you're touching the runtime that powers the existing AI chat experience, you need a mechanical proof that nothing changed. A snapshot of canonical responses, replayed on every CI build, byte-diffed against a pre-refactor baseline. Don't trust manual smoke testing — it misses the prompts you don't think to try. Don't trust unit tests — they pin the parts you remembered to test. The snapshot pins the whole black-box behaviour.

3. Specialise via profile, not via fork

The Composer V2 surface adds a profile, four tools, a draft store, and a UI view. Nothing else. No new runtime branch. No new MCP server. No new hooks. No new scheduler. No new Cosmos partition. No new Gallery. No new renderer. The specialisation is purely additive. An agent composed by V2 is V1-shape downstream; the surface that authored it is a UI affordance, not an architectural divergence.

This is the rule that prevents fragmentation. If your "improved" front door produces artefacts that the rest of the platform doesn't know how to handle, you haven't built a better front door — you've built a parallel platform you now have to maintain.

4. The LLM is authoritative for intent only

The Composer LLM authors a typed payload describing the user's intent: name, description, allowedTools, taskSpec, schedule, customerScope, comparisons, visualPreferences. Everything else — id, pk, defType, ownerUserId, audit, revision, the system prompt itself, the publish token — is server-owned, materialised after extraction, never trusted to the LLM. The schema rejects any draft that tries to set those fields.

This is what prevents the LLM from impersonating users, skipping the runtime's tool-first directive, or short-circuiting the publish gate. The trust boundary is the schema, not the prompt.

5. Build the smart matching upstream, not downstream

The single biggest mistake in V2's first three days was putting fuzzy customer-name matching in the wrong layer — the downstream scope resolver instead of the upstream master-data lookup. The downstream resolver got progressively more regex-laden as edge cases surfaced (slug variants, punctuation, abbreviations). When the owner pressed "how confident are you for OTHER customers, not just City of Cape Town?", I had to admit the resolver was a guess that worked for clean names and failed silently for the rest.

The architecturally correct fix (Option B) was to make the master-data lookup smart (exact + label + alias + acronym + prefix + all-tokens, ranked) and the downstream resolver dumb (strict equality only). The LLM is now expected to call query_customer_master with the user's free-text phrase FIRST, get back a canonical id, and pass THAT to the resolver. No regex anywhere downstream. Acronyms like "JRA" resolve to "Johannesburg Roads Agency" deterministically. Operator-curated aliases on the master row pick up colloquial short-forms like "CCT" without code changes.

6. Hand the AI the browser

If your AI assistant can drive Chrome, your QA loop changes character. The same loop that used to be "human describes bug, AI infers, AI fixes, human verifies, repeat" becomes "AI reproduces, AI fixes, AI verifies, hands back when converged". This is not a luxury feature — it's a compounding capability. Each fix-cycle is hours shorter. Multiple cycles run in parallel. The human supervises rather than drives.

For this to work, the AI needs: clickable accessibility-tree access to the page, the ability to read console messages and network requests, JavaScript-eval in the page context, and screenshot output. All four are table stakes; if your tooling doesn't provide them, prioritise getting them before you optimise anything else.

7. Stop QA when quality matches the comparison baseline

The QA loop's exit criterion was not "the agent gave an answer". It was "the agent's answer is as good as the native chat engine's answer on the same prompt". The four QA rounds existed because that bar was higher than the first three rounds met. Round 4 met it. That was when the loop stopped.

If you don't have a comparison baseline, you don't have a stopping criterion. If your "agentic surface" doesn't have to match the quality of your existing surface, you're shipping a regression dressed as an upgrade.

Best Practices for AI Developers

Schemas as the trust boundary. Zod or equivalent. Strict. Frozen before code. Tagged in git. Every interface across an AI surface should be schema-validated; every claim the AI makes should reduce to a schema-conformant artefact.
The materialiser pattern. Server-owned fields (id, audit, ownerUserId, system prompt, publish token) get overwritten after the LLM returns. Never let the LLM author identity, audit, or runtime-directive fields.
One adapter file imports the SDK. Everything else stays SDK-agnostic. Swap cost is bounded.
MCP tools always-loaded; built-ins locked down. If alwaysLoad is false, your model never sees your tools when built-ins are disallowed. This is a known SDK sharp edge.
PostToolUse.tool_response is a JSON string. Parse first. Always.
Tool-first system prompts. List the tools, then directly tell the model "CALL THESE TOOLS FIRST". The model will not infer the instruction from a list.
Citations are tool-handler output. Never LLM-authored. Hooks accumulate them. The LLM writes prose around them.
Render via the same components that render the final artefact. Inline previews must use the same sub-components as the eventual filed report — no private copy. Future renderer changes land in one place.
Draft stores are in-memory. 30-minute TTL. Per-session quotas. No Cosmos coupling until the user explicitly publishes.
Eager registration with a guardrail test. Profiles, plugins, tools — whatever the registration model, prove it at module load and pin the proof with a test.
Snapshot suites for foundational refactors. Replay real prompts, byte-diff against a baseline, fail the build on drift.
Comparison baseline for QA convergence. Don't ship the new surface without proving its quality matches the surface it's supposed to replace.
Two AI pair-programmers, one human supervisor. Long-context implementer + short-context reviewer + arbitrating human. Each AI's bias is the other's strength.
Give the AI the browser. Closed-loop QA. Compounding capability.

The Takeaway

The Composer V2 work is what happens when a one-person engineering team treats two frontier AI assistants as pair-programmers and a third AI as a market analyst — with the human as the architect, supervisor, and accountability layer.

The shape of the work is different from solo development. The shape is different from team development. It's not "the AI wrote the code"; it's "the AI implemented the design I steered, while a second AI clean-room reviewed it, while I drove the browser and made the calls about what was good enough to ship." The product of that loop is in production today. Real users compose real agents in plain English; the platform publishes a typed, scheduled, ABAC-scoped, citation-stamped artefact that runs unattended every Monday at 06:30 SAST.

The platform is still small. One person. One repo. One Fastify process for the AI surface. One App Service for the web app. One Cosmos account. No vector store. No background-job platform. No multi-agent orchestration framework. The architecture is deliberately minimalist — the abstractions that matter (schemas, profiles, MCP tools, hooks, structured artefacts) are sharp; the abstractions that don't (a sub-agent swarm, a custom DSL, a parallel data plane) are absent.

What it isn't, is a slide deck. It's a Monday-morning briefing that fires by itself, an Account Manager who composes their own contract-performance agent in a five-minute chat, a citation chip that traces every number back to its SAP source. The Composer V2 surface is one piece of that puzzle — the front door. The next piece — the dual-pane canvas with live agent design — is what closes the gap to the most polished tier-3 products in the market.

Onwards.To. V3 - a fully canvas-style UX that frontier platforms like Claude, ChatGPT and Gemini provide - I'll ship this too, within a week (if time allows).

Remember I'm just a GM, building my own enterprise BI platform to manage my business. In between my operaional meetings, evenings, and weekends, working with my two AI copilots has unlocked a ton of productivity that is truly amazing!

Stack: TypeScript-flavoured JavaScript end-to-end. Fastify on a Linux App Service for the AI service. Express + React on a separate App Service for the web app. Cosmos DB as the only data store. @anthropic-ai/claude-agent-sdk + @anthropic-ai/sdk for the agentic loop and the chat engine. node-schedule for the cron tick. Recharts for chart rendering. react-markdown + remark-gfm for the markdown renderer. Zod for the frozen contracts. The new abstractions in V2: ChatProfileV1 (Zod schema for chat-engine specialisations), ComposerIntentDraftV1 (the typed draft payload), the four-tool authoring set, the in-memory draft store with TTL + quota, the marker protocol for inline-report attachment, and the byte-equivalence snapshot suite as the do-no-harm gate for the foundational refactor.

Total elapsed: ~4 days from "let's reframe the Composer as a chat" to "round-4 QA convergence in production". Total Composer-V2-specific commits: 71 across two PRs. Total post-merge fix-cycles: 8. Total Chrome QA cycles driven unattended by the implementer AI: ~10 in the final round alone. Cups of coffee: still lost count.

Monday, 18 May 2026

How I built and agentic engine (powered by Anthropic's Agentic runtime) to augment my already AI-powered BI app, with Claude & Codex, over one weekend

May 18, 2026. South Africa.

This morning, at exactly 07:00 SAST, the platform produced a financial briefing without anybody pressing a button. It pulled six SAP connectors, picked the top five movers across debtors / orderbook / stock / delivery, narrated them with real customer names ("CITY OF CAPE TOWN orderbook changed by XX, from RXX to RYY, the single largest book inflow this week"), bound a Recharts bar chart to the underlying dataset, and stamped every numeric claim with a citation chip showing the calendar date range "Debtors · 15 May to 21 May 2026". Validation passed. Cost: a few cents. Latency: 91 seconds end-to-end.

The report is publishable as-is. Not the LLM "making something up that looks right" — every number traces back to an ETL run id, every chart traces back to a dataset, every claim traces back to a CitationV1.

That report is the V1 of an "agentic" layer we've been building on top of an existing BI platform. It took three and a half weeks of research, ideation, detailed implementation planning of V0, then an architecture pivot when I changed my thinking to stop reinventing the wheel and jump on frontier model's agentic SDK, then a bite-the-bullet weekend sprint of building, that included an autonomous overnight QA cycle using browser connectors, an awkward midnight debugging session over a single JSON field, and the discovery that one missing column predicate had silently prevented every scheduled run from ever firing....

I want to write this down before it ages out of memory.

Act 1: The Existing AI Chat Engine (Why This Wasn't Enough)

The platform already had an AI chat engine before any of this started. It's the kind of thing most BI tools ship in 2026: open a side panel, type "what's our overdue debtors looking like", an LLM reads the page context, calls a few internal tools to fetch SAP snapshot summaries, and answers with citations. It works. People use it. It's saved hours of "let me Excel-pivot this for you" emails.

The App was fine in its initial release form. Like any app launch, once users start playing with it, and seeing the powerful AI chat feature, almost immediately the feedback is about requesting automated reports, automated insights, signals of weekly trends, etc. An oversight on my part was the lack of trend data, weekly snapshots were nonexistent. For me to support advanced automated reporting, I would have to close the gap on weekly snapshots. This was considered foundational before I started with any agentic framework. It took me about a week to close this gap, following my usual architecture, spec-driven planning approach with Claude and Codex.

The architecture was honest about what it was. A single Fastify service stands up a chat route that streams from the Anthropic API. A small registry of typed tools sits behind it — build_movement_pack, get_insight_definition, query_snapshot_week, compute_period_aggregate, etc. Every tool reads from the same Cosmos containers the dashboards read from, scoped by BU, filtered by an ABAC layer that keeps customer-level data behind role assignments. The model never sees raw Cosmos — it sees tool returns that have already been authz-checked, citation-stamped, and shape-normalised.

For ad-hoc questions ("show me the top 5 debtors this week", "summarise expense trends since January") this is great. It's chat. The user types, the model answers, the conversation ends.

But chat is reactive. Chat assumes there's a human in the loop who knows what to ask. A real enterprise BI tool needs to surface the things you didn't know to ask about. The customer whose overdue jumped 149% this week. The stock material that flipped from healthy to critical between Friday and Monday. The orderbook line that vanished because the invoice posted. None of that bubbles up if you don't open the chat panel and ask.

The chat engine is solid. But a true enterprise BI insights platform needs agentic capabilities too — agents that run on a schedule, pull what's changed, narrate it, file a report, and surface it to the right people without anyone typing a prompt.

That's the gap Agentic V1 is closing.

Act 2: Origins of the Agentic Idea

The initial sketch was simple. "Take the existing AI chat tools. Add a scheduler. Run them weekly. Save the output as a report."

That sketch survived about two weeks before it started to fall apart.

The first problem was structural: a "report" isn't a chat reply. A chat reply is a stream of tokens that disappears when you close the panel. A report is a document — structured, multi-section, with charts and tables, that needs to be filed, indexed, ABAC-scoped per viewer, exportable to PDF, shareable, auditable. The chat engine's plumbing doesn't model any of that. It just streams text.

The second problem was about who authors agents. The first instinct was "we'll seed a few hand-written ones — weekly briefing, monthly performance, customer health — and that's V1." But the moment the seeded agents existed, the next question was obvious: "Can a sales manager build their own for their portfolio? Can an account manager clone the seed for Customer X and remix it?" That implies a Composer. A Composer is itself an agent. So now we're not just building "three scheduled briefings" — we're building a platform that lets users author agents in natural language, lint them against a typed contract, dry-run them, and publish.

The third problem was the trust boundary. The chat engine's tools are read-only against scoped Cosmos. That's fine for ad-hoc questions where a human reads the answer. For an agent that fires on a schedule and surfaces a report into an inbox-like list, the trust model has to be stricter. Every numeric claim must cite the underlying snapshot. Every chart must reference a dataset id the runtime materialised, not inline data the LLM invented. The schema has to enforce this so a malformed run can't masquerade as a real one.

By the end of the third week we knew the shape: typed, versioned contracts (AgentDefinitionV1, AgentRunV1, MovementPackV1, CitationV1); a Composer that writes drafts; a linter that rejects drafts that don't fit; a runtime that runs them with deterministic tools and structured output; a scheduler that fires them; a Filed Reports surface that renders them.

Act 3: The Pivot — Stop Building the Loop, Use Claude Agent SDK

The original V0 plan had us writing the agent loop ourselves. Read a definition, build a system prompt, send it to the Anthropic API, parse the response, dispatch any tool calls, loop. The chat engine already did 60% of that — we'd just wrap it in a scheduler.

But Anthropic shipped the @anthropic-ai/claude-agent-sdk. It's a meaningful step up from a raw chat client: it has first-class concepts for tools (define name + description + Zod input schema + handler, register them with createSdkMcpServer, the model gets MCP-qualified tool names like mcp__syntell__build_movement_pack), for hooks (PreToolUse, PostToolUse, Stop), for permission modes, for system prompts, for the whole agentic loop — without us writing any of it.

More importantly, MCP gave us a clean trust boundary. The same tools the chat engine exposes become MCP tools the runtime exposes. The model is told "you have these tools, here are their schemas, call them". Hooks let us intercept every tool call before it runs (PolicyGuard checks BU scope, ABAC, connector allow-lists) and every tool return after it runs (extract citations from structuredContent.citations, materialise datasets onto the run doc, write audit log entries).

We pivoted. Not on day one — on week three, after we'd already completed much of the architecture, design and implementation plan. We threw away the custom loop and adopted Claude's Agentic SDK. The pivot cost about two evenings of rework. It paid back the rest of the project.

The decision: the SDK runs the agentic loop. We supply tools, hooks, system prompts, and the agent definition. The runtime adapter sits in a single file (runtime.js) and is the only place in the codebase that imports the SDK. Everything else — stores, hooks, tools, schemas — stays SDK-agnostic so a future swap costs days, not months.

Act 4: From Idea to Spec — The Review Loop That Reshaped Every Pass

The pivot to the SDK happened on a Friday evening. But the SDK didn't arrive in our codebase first — the plan did. And the plan only became a good plan after several review passes with Claude and Codex, that materially reshaped what Claude originally proposed, since I had decided to not reinvent the wheel.

This is the part of the project I want to credit Codex for explicitly. Without those reviews the architecture would have shipped narrower, more fragile, and harder to extend.

Step 1: Capturing the business requirements (not the technical wish-list)

The first version of the requirements doc crafted jointly with Claude was a mistake. It read like a technical wish-list — "use Cosmos, use Fastify, support cron, support webhooks". I later pushed back: "start from what an account manager actually wants, not what the platform should do."

So we rewrote the requirements in terms of user behaviour:

"Every Monday morning I want a one-page briefing of what changed in my portfolio last week, without typing a prompt."
"I want to clone a teammate's report template, point it at my customers, and publish it — in five minutes, not a sprint."
"I want every number in the report to be traceable to its SAP source so when finance pushes back I can show them where it came from."
"If I'm an admin and something goes wrong, I want one switch to stop everything."
"If I share a report with my team, the people without access to a customer must NOT see that customer's numbers."
"Built-in seeds for the most common cases: Finance Weekly Briefing, Monthly Performance Report, Customer Health, Account Manager Contract Performance."

That reframing rippled into every later decision. The "five minutes not a sprint" requirement became the Composer-as-agent design (NL intent in, typed draft out). The "every number traceable" requirement became the CitationV1 contract and the rule that the LLM never authors a citation. The "people without access must not see" requirement became the ABAC layer that filters tool returns before the LLM ever sees them. The "one switch to stop everything" became the runtime-state kill switch checked at the top of every scheduler tick.

Step 2: Claude's first architecture draft (and why most of it was wrong)

Claude's initial design proposal was about 800 lines of markdown - which started with the assumption of leveraging much of the existing AI-chat engine infrastructure first, then extend the platform to build an agentic engine. It had a custom agent loop (parse the model's response, dispatch tool calls, iterate), a per-agent SQL-like query language for "what data does this agent need", a graph database of capabilities, a vector store for "discovering which tool fits the intent", and three separate microservices for Composer, Runtime, and Scheduler.

When I changed direction on leveraging existing agentic frameworks, Claude reviewed the state-of-the-art and recommended Anthropic's Agent SDK - surfacing five substantive critiques. which I then passed to Codex to review. I'll paraphrase but the gist of each is verbatim from the review:

Review pass #1 of the design (paraphrased):

"Don't roll your own agent loop. @anthropic-ai/claude-agent-sdk shipped recently and gives you tool dispatch, hooks (Pre/Post/Stop), permission modes, MCP-qualified tool names, and an injectable query function for free. Adopt it as the runtime. Keep it confined to one adapter file so a future swap costs days not months."
"No custom query language. The data model is Cosmos. The tools you'd want are already the same shape as your existing AI chat tools. Don't reinvent. Reuse the chat tools as MCP tools, add hooks for ABAC and citation extraction."
"No vector store. The Capability Registry is small (~6 tools in V1) and hand-curated. A vector store is over-engineered for this scale and adds a dependency. Hand-curate the registry, version it, ship it."
"No three microservices. Composer, Runtime, and Scheduler all share the same Cosmos containers, the same auth identity, the same lifetime. Keep them in one Fastify process. Operationally simpler. Two App Services maximum (web + AI), not five."
"Freeze the schemas BEFORE writing code. AgentDefinitionV1, AgentRunV1, MovementPackV1, CitationV1, ToolCapabilityV1, BuAiDataPolicyV1. Zod. Strict. Tag the freeze in git. Every layer below the schemas is allowed to evolve; the schemas themselves only move on a V2 bump. The schema is the trust boundary."

All five landed. The design rewrite cut the doc from 800 lines to about 350. The implementation plan compressed from "eight tracks across six weeks" to "five tracks across two and a half weeks" because most of what Claude planned to build was now being provided by the SDK or by reusing existing code. We reworked the plan. We executed in one weekend.

Step 3: The contract freeze (Codex's most consequential push)

The contract-freeze idea (#5 above) deserves its own paragraph because it changed how the whole project was sequenced.

The original plan had every track writing its own data shapes as it went. Track A would define AgentDefinitionV1 by writing the Composer first. Track C would refine it by writing the runtime. Track G would tweak it again when writing the report renderer. Schemas would converge through iteration.

Codex flagged the obvious problem: if the schema converges through iteration, every track's tests are coupled to whichever revision they happened to be written against, and integration becomes a coordination nightmare. The fix: write the schemas first, freeze them, tag the freeze, then let every track build against a stable contract. Tests against a frozen schema are forwards-compatible. Code that fits a frozen schema can be developed in parallel without integration grief.

So we did exactly that. Track 0 (Phase 0) was a one-day pass to write all six Zod contracts, run them through fixture-based tests, commit, and tag contracts-v1-frozen on origin. Every other track started after that tag landed. The integration phase at the end of the build was nearly painless. That single insight saved at least a week of merge-hell.

Codex was excellent in reviewing Claude's multi-agent parallel execution plan and called out the gaps in sequencing, risks of agent handover corruption of shared status.md file updates. I had Claude leading the multi-agent coding sprint as lead orchestrator, and setup Codex to snoop and review, periodically from 20 minute intervals to 7 minute intervals during Claude's build process. Codex was my senior engineer reviewer. I set both agents off, and went about my weekend. Claude and Codex worked all through the weekend almost autonomously.

Step 4: Codex's during-implementation review passes

Once code started landing, Codex's reviews moved from architectural critique to substantive defect-finding. Below are the specific Codex findings that materially changed shipped code. I'm listing them because they're the kind of thing that doesn't appear in a "two AIs worked together" abstraction — they're the actual leverage of having a second reviewer.

Codex finding	Severity	What it became
"Composer meta-agent looks structurally incomplete — the seed lists list_capabilities / propose_agent_definition / validate_agent_definition in allowedTools, but those handler factories are NOT in TOOL_FACTORIES. The SDK exposes the business tools to the Composer agent but not its own meta-tools. The LLM has no callable tools and emits tool-call JSON as raw text in its assistant message."	P0	Built the three meta-tools in `ai-service/src/agents/tools/composerMeta.js` delegating to the existing `listCapabilities` / `proposeAgentDefinition` / `lintAgentDefinition` helpers in ai-core. Registered them in TOOL_FACTORIES. Added 6 wiring tests. Composer dry-runs went from "always produces text-JSON" to "always produces typed drafts via proper tool calls".
"Customer Health is real but not truly per-customer — `buildCustomerHealth` reads only `summaryDoc.topOutstandingCustomers` etc, so customers outside the top-N are invisible to the at-risk composite. That's a quiet correctness gap."	P1	Rewrote `buildCustomerHealth` to load runId-scoped detail snapshots per connector and aggregate every customer by customerId. Summary-top-N falls back only when detail rows aren't available. The at-risk ranking is now genuinely complete.
"Filed Reports NEW badge can be wrong for returning users — `AgentReportsView` freezes `renderedReadCursor` to '' on initial render BEFORE the read-state hook has loaded. Every existing report renders as NEW for returning users on every visit."	P1	Added a `ready: boolean` flag to the `useAgentUnreadCount` hook. Cursor only freezes after `ready === true`. NEW badge now correctly reflects "since you last visited" semantics.
"The read-state monotonic guarantee is not concurrency-safe — `setUserReportSeen` is read-then-upsert with no precondition. Multi-tab races can overwrite a newer cursor with an older one."	P2	Cosmos ETag CAS: `IfMatch: <etag>` for replace, `IfNoneMatch: '*'` for first-create, 3-attempt retry loop. Monotonic guarantee restored.
"`chooseAllowedTools` still defaults to Movement-Pack-only for non-plan intents — users asking for monthly aggregates won't get `compute_period_aggregate` in their drafts. Same for customer-health intents missing `compute_customer_health_composite`."	Follow-up	Pattern-matched intent signals against BOUNDARY_PATTERNS, HEALTH_PATTERNS, PERIOD_AGGREGATE_PATTERNS. Auto-pick the right deterministic primitives. `applySignalsOverride` now also applies `allowedTools` overrides + auto-acks the simulation/SAP boundary if `compute_plan_vs_actual` is present.
"@anthropic-ai/claude-agent-sdk@0.3.143 declares a peer dependency on @anthropic-ai/sdk >=0.93.0. You're pinned at 0.80 and running `--legacy-peer-deps` to suppress the warning. That's a smell. Upgrade the peer."	Follow-up	Upgraded `@anthropic-ai/sdk` from 0.80.0 to ^0.93.0. Dropped `--legacy-peer-deps` from the deploy workflow. Verified both call sites (standaloneAiChat + modelCatalogManager) use stable API surfaces that survived the bump.
"The AM Contract Performance seed should NOT be a hardcoded Customer X agent. Customer X should appear as the EXAMPLE in the natural-language intent / help text. The seed itself stays reusable for every account manager."	Design feedback	The seed ships with `customerScope: 'owner-abac'` (AM sees their assigned portfolio on first run), `specificCustomerIds` intentionally undefined, Customer X mentioned only as the example clone target in the natural-language intent. AM clones to `customerScope: 'specific'` for their actual contract. The seed itself is template-shaped.
"Are there still explicit 'later/deferred' items in the shipped story? V1 should not include language that says 'V1.1 will fix this'. Either ship it or remove the mention — don't let deferral noise leak into the seeded prompts."	P2	Stripped all V1.1 deferral language from the seed system prompts. Added test guards in `seedGalleryTemplates.test.js` that fail if any seed's description / systemPrompt / naturalLanguageIntent contains "V1.1" or "planned for".

Eight findings. Six of them were actual bugs (P0/P1/P2 + the SDK peer-dep smell), one was a Composer-design improvement (the chooseAllowedTools follow-up), one was a product-design nudge (the AM seed template-shape feedback). All eight got fixed. None of them would have been caught without the second reviewer. I'd been staring at the code for hours and missed them.

Step 5: The iteration cadence

The way the loop ended up working in practice:

Claude writes the implementation, commit, push, run touched tests, drive the CI.
Claude summarise what landed in the PR description (or in the chat, since this was a long-running session).
Codex session runs continuously in background, wakes up, inspects master, feeds back findings to me. I decide when to interrupt and steer Claude along. Sometimes I let Claude just run, with Codex keeping a mental registry of things to cleanup later. Claude was coordinating at least 8 parallel workstreams, managing integration - I didn't want to interrupt Claude unless Codex picked up something critical.
Codex clean-room reviews against master, posts numbered findings (P0/P1/P2 with rationale + suggested fix).
I relay Codex's findings to Claude.
Claude triages. Claude usually agrees and thanks Codex for superb findings. Occasionally Claude pushed back with a rationale — once or twice Codex was working from an outdated mental model of the codebase, but more often than not its findings are real and worth fixing.
Claude fixes, commit, pushes autonomously.
Loop.

The cadence wasn't fixed. Sometimes Codex would review every two or three commits. Sometimes I would say "you've been heads-down for a few hours, let me get Codex to do a sweep". The asymmetry of Claude (long context, slow review) vs Codex (fresh context, fast review) is actually a useful structural feature — the two AIs do not see the same thing, and that's where the leverage comes from.

The mental model: Claude is the long-context implementer with deep familiarity but tunnel vision. Codex is the short-context reviewer with familiarity but cleaner eyes - can see the wood for the trees. Codex is beating Claude in seeing the system with an architect lens IMHO. Claude often dives straight in without much foresight, or appreciation for larget system impact. At one stage I questioned my decision to give Claude the task of building, since all my previous long-running implementations were almost always dedicated to Codex. But I was short on Codex credits so opted to use Claude instead. I, as the human steered both. Each AI's bias is the other AI's strength. Don't replace one with the other — pair them.

Act 5: The V1 Architecture

What we ended up with:

   
   +-----------------------------------------------------------------+
   |                       Web App (browser SPA)                     |
   |   Gallery  -  Compose  -  Filed Reports  -  Insights  -  Admin  |
   +---------------------------+-------------------------------------+
                               |
                               | HTTPS + auth proxy
                               v
+-----------------------+   /api/agents/*    +-------------------------+
|  Web App (Express)    |  ----------------->|  AI service (Fastify)   |
|  React shell + static |                    |                         |
|  files. Proxies AI    |   /api/ai-chat/*   |  Existing AI chat       |
|  routes to the AI     |  ----------------->|  engine (kept, reused)  |
|  service.             |                    |                         |
+-----------------------+                    |  Agent Framework V1     |
                                             |                         |
                                             |  +-------------------+  |
                                             |  | Routes:           |  |
                                             |  |   /compose        |  |
                                             |  |   /dry-run        |  |
                                             |  |   /agents/:id     |  |
                                             |  |   /runs           |  |
                                             |  |   /runs/:id       |  |
                                             |  |   /lint           |  |
                                             |  |   /policy         |  |
                                             |  |   /runtime-state  |  |
                                             |  |   /insights/...   |  |
                                             |  +---------+---------+  |
                                             |            |            |
                                             |            v            |
                                             |  +-------------------+  |
                                             |  | Runtime adapter   |  |
                                             |  | (claude-agent-sdk)|  |
                                             |  |                   |  |
                                             |  |runAgent({def,ctx})|  |
                                             |  +---------+---------+  |
                                             |            |            |
                                             |   Hooks:   |            |
                                             |   PreToolUse  ---+      |
                                             |   PostToolUse -+ |      |
                                             |   Stop -----+  | |      |
                                             |             |  | |      |
                                             |  +----------v--v-v---+  |
                                             |  | In-process MCP    |  |
                                             |  | server (Company)  |  |
                                             |  |                   |  |
                                             |  |  build_movement   |  |
                                             |  |  compute_period   |  |
                                             |  |  compute_plan_vs  |  |
                                             |  |  customer_health  |  |
                                             |  |  list_capabilities|  |
                                             |  |  propose_def      |  |
                                             |  |  validate_def     |  |
                                             |  |  get_insight_def  |  |
                                             |  |  search_insights  |  |
                                             |  +---------+---------+  |
                                             |            |            |
                                             |            v            |
                                             |  +-------------------+  |
                                             |  | Stores            |  |
                                             |  |   definitionStore |  |
                                             |  |   runStore        |  |
                                             |  |   policyStore     |  |
                                             |  |   leaseStore      |  |
                                             |  |   heartbeatStore  |  |
                                             |  |   runtimeState    |  |
                                             |  |   costCounter     |  |
                                             |  +---------+---------+  |
                                             |            |            |
                                             |  +-------------------+  |
                                             |  | Scheduler         |  |
                                             |  | (node-schedule    |  |
                                             |  |  cron every min)  |  |
                                             |  |findDue -> claim   |  |
                                             |  |lease -> runAgent  |  |
                                             |  +-------------------+  |
                                             +-----------+-------------+
                                                         |
                                                         v
   +-----------------------------------------------------------------+
   |                        Cosmos DB (single account)               |
   |                                                                 |
   |   definitions                       agent_runs                  |
   |   +---------------------+           +-----------------------+   |
   |   | agent-definition    |           | agent-run             |   |
   |   | sap-*-insights      |           | (immutable snapshot,  |   |
   |   | sap-*-customer-...  |           |  citations[],         |   |
   |   | sap-*-weekly-       |           |  datasets{},          |   |
   |   |   snapshot          |           |  sections[]           |   |
   |   | bu-ai-data-policy   |           |  with chartSpec +     |   |
   |   | agent-lease         |           |  tableSpec)           |   |
   |   | runtime-state       |           +-----------------------+   |
   |   | scheduler-heartbeat |                                       |
   |   +---------------------+                                       |
   +-----------------------------------------------------------------+
                                                        ^
                                                        |
                                            +-----------+-----------+
                                            | ETL pipeline          |
                                            | (weekly SAP exports   |
                                            |  -> weekNum-stamped   |
                                            |  detail + summary     |
                                            |  docs)                |
                                            +-----------------------+

A few things worth pointing out:

One database, two consumers. The chat engine and the agent framework read the same Cosmos containers, the same snapshot history, the same policy docs. We didn't build a parallel data plane — we layered a new control plane (Composer + scheduler + runtime + report store) on top of the existing data plane.

The runtime adapter is one file. Per agent-framework-v1.md's adapter pattern, exactly one file imports @anthropic-ai/claude-agent-sdk. Stores, hooks, tools, schemas, routes — everything else is SDK-agnostic. The swap cost for the next runtime (whatever that ends up being) is bounded.

Citations and datasets are produced by hooks, not by the LLM. The model is never trusted to write a citation. Every tool handler attaches structuredContent: { citations, dataset } to its return. The PostToolUse citation hook walks the tool return and accumulates citations onto the run doc. The dataset hook does the same for chart-bound data. The LLM writes prose about these structures — never the structures themselves.

The schema is the boundary. AgentDefinitionV1 says exactly what fields exist, what types they have, what enums are valid. The linter runs the schema first and blocks publish on any failure. Dry-run validates the resulting AgentRunV1 against its own frozen schema. If the run doesn't fit the contract, it's flagged needs-review and excluded from the Publish gate.

Act 6: Self-QA via the Chrome Browser (The Surprise Power Move)

The thing that didn't appear on any plan was the Chrome integration. Claude has access to a browser tool that lets it drive an actual Chrome instance on my machine. Navigate to a URL, click an element by accessibility-tree reference, type into an input, take a screenshot, run JS in the page context, read network requests, read console messages.

Once you have that, the whole feedback loop changes.

The old pattern was:

Claude pushes a commit.
The user opens Chrome.
The user clicks around.
The user takes a screenshot.
The user describes what's broken.
Claude infers what's wrong from the description.
Claude pushes another commit.
Repeat.

With Chrome access:

Claude pushes a commit.
Claude waits for CI to finish.
Claude navigates Chrome to the production URL myself.
Claude clicks the same buttons a user would click.
Claude fetches the API responses with fetch() in the page context to see the actual run docs.
Claude screenshots the rendered report.
Claude sees the bug.
Claude diagnoses the bug.
Claude pushes the fix.
Repeat — without bothering the human. I gave Codex the same task, as independent QA tester- I found Codex much more skilled at operating the browser than Claude was.

I went to bed at one point and gave a single instruction: "AFK — continue to recursively test using Chrome browser until you find and fix every issue. You are not done until then. Repeat QA/debug/bug-fix cycles until no issues remain." Over the next ~5 hours that's exactly what happened. The browser tool drove the Composer wizard, triggered dry-runs, fetched run documents, read operational logs, found that citations were synthetic, traced the cause to a hook field-name mismatch, pushed a fix, deployed it, restarted the App Service, re-ran the wizard, watched validation still fail, dug deeper, found another field name issue, fixed it, redeployed.

The final blocker turned out to be that the SDK passes tool_response to PostToolUse hooks as a JSON-encoded string, not as an object. Every probe path Claude had written assumed object shape. To find this, Claude had the hooks emit a one-shot diagnostic entry to the operational log on first invocation (because console.log inside SDK hooks doesn't reach the App Service container stdout). The diagnostic entry came back with trType: 'string' and a sample snippet that started with an escaped {"\\"pack\\":...". Five lines of code later (if (typeof tr === 'string') { try { tr = JSON.parse(tr); } catch {} }), the next dry-run produced 5 real citations across 5 sections, validation passed, Publish button went green.

That cycle took ~9 hours of autonomous work. I woke up to "all features verified green in production" and a one-page summary of every commit. The bug was real, the fix was real, the verification was real, and no human had to drive the Chrome session.

The lesson: giving an AI agent access to the same UI a human uses, plus the ability to read its own logs and re-deploy, turns "debugging" from a back-and-forth conversation into a closed loop the agent runs autonomously. It's not science fiction. It worked. Last night.

Act 7: What We Learned About the Claude Agent SDK

Things that surprised Claude, in no particular order, written by Claude's own words here:

1. tools: [] means "no tools", not "default tools". Early on I'd written the SDK options as { ..., tools: [] } assuming an empty array meant "use the MCP tools I've registered via mcpServers". It actually means "the model has zero tools available". The MCP tools never reached the model. Symptom: the LLM emitted tool-call JSON as plain text in its assistant messages instead of as tool_use blocks. Fix: omit the tools key entirely.

2. MCP tools must be alwaysLoad: true or they go behind tool search. By default the SDK defers MCP tool schemas behind a built-in ToolSearch facility. If you've also locked down the built-ins via disallowedTools (which any sensible production agent does), the model has no way to discover the MCP tools at all. alwaysLoad: true pre-loads every registered MCP tool's schema directly into the model's tool list. Without this, you get the same "tool-call-as-text" symptom as above.

3. PostToolUse.tool_response is a JSON string, not an object. Documented as unknown. In practice it's a JSON-encoded string of the tool handler's structuredContent. Always JSON.parse when it's a string.

4. console.log from inside SDK hooks does not reach App Service container stdout. I do not know why. I do know that emitting structured entries to an operational-log store works fine. We added a one-shot diagnostic emission on first hook invocation, and that's how we discovered #3 above. Worth instrumenting hooks with an op-log fallback from day one.

5. The model will not call tools just because they're listed. If the system prompt says "Available tools: build_movement_pack, get_insight_definition, search_insights" without instructing the model to use them, the model will think out loud, write a narrative, and never call a thing. The system prompt must say "CALL TOOLS FIRST. Before writing any narrative, invoke each allowed tool." Pretend you're talking to a thoughtful but lazy junior analyst.

6. The Composer LLM is authoritative for intent-shaped fields only. Server-owned fields (id, pk, defType, ownerUserId, audit, composerVersion, the system prompt itself) must be materialised server-side after extraction. Trusting the LLM to author its own ownerUserId is an impersonation hole. Trusting it to author the system prompt skips the runtime tool-first directive. Materialise these post-extraction, every time.

7. The agentic SDK is a strong abstraction up to a point. For an in-process MCP server with deterministic tools and structured returns, the SDK is fantastic — permission modes, hook plumbing, tool dispatch, streaming all just work. Beyond that point (sub-agents, sessions for multi-turn, prompt caching controls), the API surface is less mature and the docs lag the code. We're not using sub-agents in V1. Custom tools only.

Act 8: A Developer's Guide — Making an Existing AI App Agentic

I want to spend an act on the concrete steps. If you have an existing AI chat app and you're trying to figure out how to add a scheduled / agentic layer on top, the path is more mechanical than you'd think. Here's the order I'd follow if I were starting fresh on a similar codebase.

Step 1: Reframe agents as typed documents, not streaming sessions

The biggest mental shift. A chat session is ephemeral — tokens stream, you read them, the session ends. An agent is a Cosmos / Postgres / S3 document. It has an id. It has a revision counter. It has an owner. It has an audit trail. The runtime interprets the document each time it fires; the document itself doesn't move.

Practically: pick a schema library (Zod is my pick), define your AgentDefinition shape, freeze it, tag the freeze. Define the AgentRun shape next — that's what every fire produces. Both shapes must be strict (extra fields rejected) so you can evolve them safely.

// packages/your-app-core/src/contracts/v1/AgentDefinitionV1.js
import { z } from 'zod';

export const AgentDefinitionV1 = z.object({
  id: z.string().min(1),                  // deterministic
  pk: z.string().min(1),                  // Cosmos partition key
  defType: z.literal('agent-definition'),
  buId: z.string().min(1),
  slug: z.string().regex(/^[a-z0-9-]+$/),
  name: z.string().min(1),
  description: z.string().min(1),
  template: z.enum(['weekly-briefing', 'monthly-report', 'composed']),
  composerVersion: z.string().min(1).nullable(),
  revision: z.number().int().positive(),
  ownerUserId: z.string().min(1),         // server-authored; never LLM
  visibility: z.enum(['org', 'private']),
  systemPrompt: z.string().min(1),        // server-composed; never LLM
  allowedTools: z.array(z.string().min(1)).min(1),
  taskSpec: z.object({ /* connectors, period, scope, ... */ }).strict(),
  schedule: z.object({ /* cadence, nextRunAt, ... */ }).strict(),
  quotas: z.object({ /* maxRunsPerMonth, maxSpendZarPerMonth, ... */ }).strict(),
  audit: z.object({                       // server-stamped
    createdAt: z.string().datetime(),
    createdBy: z.string().min(1),
    lastEditedAt: z.string().datetime(),
    lastEditedBy: z.string().min(1)
  }).strict()
}).strict();

The .strict() is non-negotiable. It's what lets you evolve forwards without silently accepting drift.

Step 2: Wrap your existing tools as MCP tools

If you have a chat app you already have tool handlers — functions that take typed args, do an authz check, hit your data store, return a result. You almost certainly don't need to rewrite them. Wrap each handler in the SDK's tool() helper, with a Zod input schema and a stable name:

// ai-service/src/agents/tools/movementPack.js
import { z } from 'zod';
import { buildMovementPack } from '../../../api/lib/movementPack/build.js';

export function buildMovementPackTool({ ctx, policy, agentDefinition }) {
  return {
    name: 'build_movement_pack',
    description: [
      'Build a typed, policy-filtered, ranked Movement Pack of business',
      'events for (buId, connector, fromWeek -> toWeek). The canonical',
      '"what changed" tool. Returns ranked events ready for narration.'
    ].join(' '),
    inputSchema: {
      buId: z.string(),
      connector: z.enum(['debtors', 'orderbook', 'stock', 'delivery', 'expenses', 'sales']),
      fromWeek: z.number().int().min(1).max(53),
      toWeek: z.number().int().min(1).max(53),
      runFy: z.string(),
      maxEvents: z.number().int().min(1).max(50).optional()
    },
    handler: async (args /* extra */) => {
      // 1. Re-check BU + ABAC scope against ctx (defence in depth)
      if (args.buId !== ctx.buId) {
        return { content: [{ type: 'text', text: 'Cross-BU blocked' }], isError: true };
      }
      // 2. Call your existing builder
      const pack = await buildMovementPack({ ...args, ctx, policy });
      // 3. Return BOTH a text block (for the LLM to read) AND
      //    structuredContent (for the runtime hooks to consume).
      return {
        content: [{ type: 'text', text: JSON.stringify(pack, null, 2) }],
        structuredContent: {
          pack,
          citations: pack.citations || [],
          dataset: packToDataset(pack)
        }
      };
    }
  };
}

Three rules for tool handlers: (1) re-check authz inside the handler — never trust the model to pass the right buId; (2) return both content (text the LLM reads) AND structuredContent (data the hooks consume); (3) when no data exists, return isError: true with a useful message rather than silently returning an empty pack.

Step 3: Build the MCP server (and read the alwaysLoad warning)

Each agent run gets its own MCP server with only the tools that the agent's allowedTools approves:

// ai-service/src/agents/toolServer.js
export const TOOL_FACTORIES = Object.freeze({
  build_movement_pack: buildMovementPackTool,
  compute_period_aggregate: buildPeriodAggregateTool,
  compute_plan_vs_actual: buildPlanVsActualTool,
  compute_customer_health_composite: buildCustomerHealthTool,
  get_insight_definition: getInsightDefinitionTool,
  search_insights: searchInsightsTool,
  // Composer meta-tools (if you have a Composer)
  list_capabilities: buildListCapabilitiesTool,
  propose_agent_definition: buildProposeAgentDefinitionTool,
  validate_agent_definition: buildValidateAgentDefinitionTool
});

export function createToolServer({ sdk, ctx, policy, agentDefinition, allowedLogicalNames }) {
  const registered = [];
  for (const name of allowedLogicalNames) {
    const factory = TOOL_FACTORIES[name];
    if (!factory) continue;                // linter rejects unknown names at save time
    const t = factory({ ctx, policy, agentDefinition });
    registered.push(sdk.tool(t.name, t.description, t.inputSchema, t.handler));
  }
  return sdk.createSdkMcpServer({
    name: 'your-app',
    version: '0.0.0',
    tools: registered,
    alwaysLoad: true   // CRITICAL: see below
  });
}

alwaysLoad: true is the single setting most likely to bite you. Without it, the SDK defers MCP tool schemas behind a built-in ToolSearch facility — meaning the model never sees the tool list directly. If you've also locked down the built-ins (as any sensible production agent does), the model has no callable tools at all, and you'll see the most confusing failure mode in agentic engineering: the LLM writes a tool call as plain text in its assistant message instead of as a structured tool_use block, and the runtime never invokes anything.

Step 4: The runtime adapter (one file imports the SDK, full stop)

Confine the SDK to exactly one adapter file. Stores, hooks, tools, schemas, routes — everything else stays SDK-agnostic. The swap cost for the next runtime is a week of rewriting one file, not months of untangling SDK types from every layer.

// ai-service/src/agents/runtime.js   -- the ONLY file that imports the SDK
import {
  query as realSdkQuery,
  tool as realSdkTool,
  createSdkMcpServer as realCreateSdkMcpServer
} from '@anthropic-ai/claude-agent-sdk';
import { createToolServer } from './toolServer.js';
import { createPolicyGuardHook } from './hooks/policyGuard.js';
import { createCitationExtractorHook } from './hooks/citationExtractor.js';
import { createDatasetExtractorHook } from './hooks/datasetExtractor.js';
import { createCostTrackerHook } from './hooks/costTracker.js';
import * as runStore from './stores/runStore.js';

export function createAgentRuntime({
  sdk = { query: realSdkQuery, tool: realSdkTool, createSdkMcpServer: realCreateSdkMcpServer }
} = {}) {
  return {
    async runAgent({ def, ctx, prompt = null, trigger = 'manual', scheduledFor = null }) {
      // 1. Pre-flight quota check, lease claim, write the pending run doc.
      const runDoc = await runStore.beginRun({ def, trigger, scheduledFor });

      // 2. Resolve MCP-qualified names + build per-run tool server.
      const allowedMcpNames = def.allowedTools.map(n => `mcp__your-app__${n}`);
      const toolServerInstance = createToolServer({
        sdk, ctx, policy: await loadPolicy(def.buId),
        agentDefinition: def, allowedLogicalNames: def.allowedTools
      });

      // 3. Compose hooks: PreToolUse (ABAC), PostToolUse (extract), Stop (cost).
      const policyGuard = createPolicyGuardHook({ ctx, policy, allowedMcpNames });
      const citationHook = createCitationExtractorHook({ runDoc });
      const datasetHook = createDatasetExtractorHook({ runDoc });
      const costHook = createCostTrackerHook({ ctx, def, runDoc });

      // 4. SDK options.
      const options = {
        systemPrompt: def.systemPrompt,
        // NB: do NOT pass `tools: []`. The SDK reads that as "no tools".
        // Omit `tools` entirely to keep the MCP tools visible.
        disallowedTools: ALL_SDK_BUILTINS,     // lock down built-ins
        allowedTools: allowedMcpNames,         // ONLY these MCP tools
        mcpServers: { 'your-app': toolServerInstance },
        permissionMode: 'dontAsk',
        maxTurns: def.quotas.maxTurns ?? 16,
        hooks: {
          PreToolUse:  [{ matcher: '.*', hooks: [policyGuard] }],
          PostToolUse: [{ matcher: '.*', hooks: [citationHook, datasetHook] }],
          Stop:        [{ matcher: '.*', hooks: [costHook] }]
        }
      };

      // 5. Drive the stream. Collect synthesis text from text-only messages
      //    (the model's "thinking aloud" mid-tool-call goes in interleavedText
      //    as a fallback; the final narrative comes from messages that
      //    have NO tool_use blocks).
      const synthesisText = [];
      const interleavedText = [];
      for await (const message of sdk.query({ prompt: prompt ?? def.naturalLanguageIntent, options })) {
        if (message.type !== 'assistant') continue;
        const blocks = message.content || message.message?.content || [];
        const hasToolUse = blocks.some(b => b?.type === 'tool_use');
        for (const b of blocks) {
          if (b?.type === 'text' && typeof b.text === 'string') {
            interleavedText.push(b.text);
            if (!hasToolUse) synthesisText.push(b.text);
          }
        }
      }
      const narrative = (synthesisText.length ? synthesisText : interleavedText).join('\n\n').trim();

      // 6. Compose the final output, validate against AgentRunV1, persist.
      const output = composeOutput({ def, runDoc, narrative });
      return runStore.completeRun(runDoc, output);
    }
  };
}

This is ~50 lines of structure. Almost everything else in the agent framework is in modules that don't know the SDK exists.

Step 5: Three hooks — ABAC, extraction, cost

PreToolUse (PolicyGuard): reject tool calls that smuggle a different BU, that target a connector denied by policy, or that aren't in the agent's allowedTools set.

// ai-service/src/agents/hooks/policyGuard.js
export function createPolicyGuardHook({ ctx, policy, allowedMcpNames }) {
  const allowed = new Set(allowedMcpNames);
  return async function policyGuardHook(input) {
    const toolName = input?.tool_name;
    const toolInput = input?.tool_input || {};
    if (!toolName?.startsWith('mcp__your-app__')) {
      return { decision: 'block', reason: 'Built-in tools are not allowed' };
    }
    if (allowed.size > 0 && !allowed.has(toolName)) {
      return { decision: 'block', reason: `${toolName} not in agent's allowedTools` };
    }
    if (toolInput.buId && toolInput.buId !== ctx.buId) {
      return { decision: 'block', reason: 'Cross-BU tool call blocked' };
    }
    return { decision: 'approve' };
  };
}

PostToolUse (Citation/Dataset extractors): read the tool return, pull citations onto the run doc, materialise the dataset into runDoc.output.datasets[datasetId]. This is where the JSON-string-tool_response gotcha bites — the SDK passes tool_response as a JSON string, not an object. Parse first.

// ai-service/src/agents/hooks/citationExtractor.js
export function createCitationExtractorHook({ runDoc }) {
  return async function citationExtractorHook(input) {
    // SDK passes tool_response as a JSON STRING (confirmed via op-log diag).
    let tr = input?.tool_response;
    if (typeof tr === 'string') {
      try { tr = JSON.parse(tr); } catch { return { decision: 'approve' }; }
    }
    // After parsing, tr IS the handler's structuredContent.
    const citations = tr?.citations || tr?.pack?.citations || [];
    const known = new Set(runDoc.output.citations.map(c => c.id));
    for (const c of citations) {
      if (c?.id && !known.has(c.id)) {
        runDoc.output.citations.push(c);
        known.add(c.id);
      }
    }
    return { decision: 'approve' };
  };
}

Stop (CostTracker): read final usage from the SDK's stop event, estimate cost, increment a per-agent monthly counter, attach token + cost totals to the run doc. NOTE: I haven't got the cost tracker to work yet.

Step 6: The system prompt must INSTRUCT the model to call tools

This is the third-most-common failure mode people are likely to hit. Listing the tools in the system prompt is not enough. The model will read the list, think out loud about what it could do, and never make a call.

The prompt has to be directive. Roughly:

You are the [agent name] agent for [BU id].

AVAILABLE DETERMINISTIC TOOLS (call these — they compute the truth, you narrate):
  - build_movement_pack: Build a ranked Movement Pack of business events...
  - compute_period_aggregate: Roll weekly snapshots into monthly buckets...
  - get_insight_definition: Look up an insight's source + formula + ABAC scope...

EXECUTION ORDER (non-negotiable):
  1. CALL TOOLS FIRST. Before writing ANY narrative or numbers, invoke
     each allowed tool that applies to this run.
  2. NARRATE FROM TOOL RESULTS ONLY. Every numeric claim must come from
     a tool result. Quote what the tool returned; do not infer.
  3. EMIT STRUCTURED OUTPUT. The runtime auto-binds your tool results
     to chartSpecs/tableSpecs by datasetId — you do NOT need to author
     chart data. Just call the tool; the runtime renders it.

HARD GUARDRAILS:
  - Every claim must cite at least one CitationV1 from this run.
  - Every chartSpec.datasetId must match a dataset the runtime materialised.
  - Do NOT invent numbers. If a number can't be sourced, say "not available".
  - Stay within [BU id]. Refuse cross-BU requests.

The "EXECUTION ORDER" / "HARD GUARDRAILS" framing is what actually flips the model from "narrate plausibly" mode into "call tools, then narrate" mode. Without it, the model is well-behaved chat AI, which is the wrong product.

Step 7: Materialise server-owned fields after the Composer LLM returns

If you have a Composer (an agent that authors other agents from natural language intent), the LLM returns a typed-ish draft. Do NOT publish that draft as-is. The LLM is authoritative only for intent-shaped fields: name, description, allowedTools, taskSpec, schedule, quotas, acknowledgesSimulationSapBoundary. Everything else is server-owned and gets overwritten after extraction:

function materializeComposerDraft({ partial, body, ownerUserId, buId }) {
  const draft = { ...(partial || {}) };
  const slug = body?.slug || draft.slug;

  // Server-OWNED (always overwrite the LLM):
  draft.id = `agent_${buId}_${slug}_v1`;
  draft.pk = pkForAgents(buId);
  draft.defType = 'agent-definition';
  draft.buId = buId;
  draft.slug = slug;
  draft.template = 'composed';
  draft.revision = 1;
  draft.composerVersion = COMPOSER_VERSION;
  draft.ownerUserId = ownerUserId;          // SECURITY: never trust LLM here
  draft.naturalLanguageIntent = body?.intent;  // preserve user's words

  // The system prompt is server-COMPOSED, never LLM-authored. This is what
  // guarantees the runtime LLM always receives the CALL-TOOLS-FIRST directive.
  draft.systemPrompt = composeSystemPromptFromDraft({
    name: draft.name, buId,
    connectors: draft.taskSpec?.connectors || [],
    allowedTools: draft.allowedTools || [],
    customerScope: draft.taskSpec?.customerScope,
    roleLens: draft.taskSpec?.roleLens
  });

  // Audit stamp — security-owned, never LLM-authored.
  const now = new Date().toISOString();
  draft.audit = {
    createdAt: now, createdBy: ownerUserId,
    lastEditedAt: now, lastEditedBy: ownerUserId
  };

  return draft;  // Now lint this. Failures surface to the user, not papered over.
}

Step 8: Wire a scheduler (the predicate that almost killed us)

For weekly/monthly cadences, node-schedule with a per-minute tick is plenty. Each tick reads "what's due" from the data store and fires anything overdue. The thing to get right is the findDue query — it must filter on cadence, on nextRunAt, and on the runtime kill switch.

export async function findDue(buId, { now = new Date().toISOString() } = {}) {
  const container = getAgentsContainer();
  const { resources } = await container.items.query({
    query: `SELECT * FROM c
            WHERE c.defType = @defType
              AND c.buId = @buId
              AND c.schedule.nextRunAt <= @now
              AND c.schedule.cadence IN ('weekly', 'monthly', 'quarterly')`,
    parameters: [
      { name: '@defType', value: 'agent-definition' },
      { name: '@buId', value: buId },
      { name: '@now', value: now }
    ]
  }, { partitionKey: pkForAgents(buId) }).fetchAll();
  return resources;
}

The lesson scarred into me: do not include predicates against fields that don't exist in your schema. My first version of findDue included AND c.enabled = true, because I'd assumed per-agent enable/disable would be a V1 feature. It wasn't — the AgentDefinitionV1 schema doesn't have an enabled field at all. The predicate silently never matched, and NO scheduled run ever fired. The kind of bug you don't catch until Monday morning at 07:30 SAST when you realise the briefing didn't come in.

If you want a kill switch, put it in a separate document (we use a Cosmos runtime-state doc, checked at the top of every scheduler tick). Don't try to embed it in the agent definition itself unless the schema fully supports it.

Step 9: The render side — do the boring work

An agent that produces a beautiful run document but renders as a wall of raw markdown is not done. Render the run with the same markdown library your chat surface uses (we use react-markdown + remark-gfm + remark-breaks). Wrap chart specs in your charting library (Recharts is fine). Bind datasets by datasetId — never let the renderer accept inline data, because that defeats the citation chain.

Split the run's narrative into sections by H2 (## ...) at compose time, not at render time. The data store should already hold structured sections, so the renderer is dumb. Use a heuristic to bind chart/table specs under the section whose heading or body matches the dataset id tokens. Orphan specs land in the tail section as an appendix.

Citation chips should show calendar dates, not week numbers. Users do not know what "FY26 W42" means. They know what "10-16 May 2026" means. Make the chip surface the latter; keep the FY-week code in the expanded detail for ops/audit.

Step 10: Common pitfalls (a survival checklist)

tools: [] means "no tools". Omit the key entirely. The SDK uses its default and your MCP tools become visible.
MCP tools must be alwaysLoad: true or they go behind a built-in tool-search facility the model can't reach when you've locked down built-ins.
PostToolUse.tool_response is a JSON string. Always JSON.parse before probing.
console.log inside SDK hooks does not reach App Service container stdout. Emit to your operational-log store instead. Add a one-shot diagnostic emission to discover any future SDK shape change.
The model will not call tools just because they're listed. Your system prompt must say "CALL TOOLS FIRST" in directive language.
Never let the LLM author identity, audit, or system-prompt fields. Materialise them server-side after extraction. The schema is your trust boundary.
Never let the LLM author citations. Citations are produced by tool handlers (in structuredContent.citations) and accumulated by PostToolUse hooks. The model writes prose around them.
Confine the SDK import to one adapter file. Everything else stays SDK-agnostic. Future-proofs against runtime swaps.
Don't gate findDue on fields your schema doesn't define. The predicate will silently never match. You will not find out for days.
Freeze your schemas BEFORE writing code. Tag the freeze. Track 0 of your project. Saves a week of merge-hell down the line.

The minimum-viable agentic conversion of an existing AI chat app: write the schemas (1 day) → wrap your existing tools as MCP tools (1-2 days) → write the runtime adapter (1 day) → write the three hooks (1 day) → write the system-prompt composer (half a day) → wire a scheduler (half a day) → render the run document (2-3 days for a polished UI). That's a developer-week or two of work for an MVP, assuming you already have working tools and an authz layer. Most of the time you'll spend is debugging the SDK quirks above — budget two more days for that.

Act 9: The V1 Success Criteria

Here's where we ended up. V1 is "done" against the following criteria:

Criterion	Status
Compose an agent from a free-form natural-language intent	LLM-path Composer produces typed AgentDefinitionV1 drafts that pass the linter cleanly. The materialiser fills server-owned fields (id, pk, ownerUserId, systemPrompt, audit) so the LLM can never impersonate or skip the runtime tool-first directive.
Lint a draft against the frozen contract	9 lint checks (schema-valid, tool-availability, cadence-compatibility, connector-compatibility, tool-arg-validity, customer-scope-ack, simulation-sap-boundary-ack, cost-cap, chart-table-binding). Blocks publish on any error.
Dry-run a draft against the real runtime	Runtime invokes MCP tools, hooks accumulate citations + datasets, output is composed into multi-section markdown, schema is validated. Publish is gated on validation passing.
Publish a draft into the Gallery	Definition lands in Cosmos as `agent-definition`. Visible to the BU. Owner + revision audited.
Schedule a published agent	node-schedule cron tick every minute. `findDue` filters by cadence (weekly/monthly/quarterly) and `nextRunAt <= now`. Lease claimed via Cosmos `IfNoneMatch`. Runs via the same runtime as dry-run. Verified live: Finance Weekly Briefing fired at 07:00 SAST on a Monday morning, no human in the loop.
File a run as an immutable report	AgentRunV1 doc, embedded definitionSnapshot, multi-section output, real citations, dataset envelopes, chart + table specs bound to dataset ids. Visible in Filed Reports.
Render a report cleanly to a human	Multi-section layout, ChatMarkdown (react-markdown + remark-gfm) for headings/lists/tables/bold, Recharts for chart specs, structured table renderer for table specs, citation chips with calendar date ranges, status pills, audience-scope banner, Export/Print to PDF, ABAC-blocked viewers get a 403 banner with a "clone and re-run" CTA.
Back/forward navigation works	`history.pushState` on drill-in, `popstate` listener closes the drill-in, `?runId=` deep-link supported on mount.
Cost + token quotas	Per-agent monthly cost cap. Pre-flight quota check via `costCounterStore`. Stop hook aggregates real token usage + estimated ZAR cost from the model response.
Runtime kill switch	Single Cosmos doc checked at the top of every scheduler tick and every route. Admin UI flip.
Operational logs	Every run emits AGENT_RUN_STARTED, AGENT_RUN_COMPLETED / NEEDS_REVIEW / FAILED. Every tool invocation emits AGENT_TOOL_INVOKED. Lease collisions, lint warnings, definition changes all logged. Insights Recent Activity feed filters on the AGENT_* event taxonomy.
Per-user adoption metrics	Insights AdoptionPanel shows distinct users, runs, succeeded/failed/needs-review per user, tokens, ZAR cost, last-activity.
Seed templates ship	Finance Weekly Briefing (weekly, top movers across four customer connectors), Monthly Performance Report (compute_plan_vs_actual + compute_period_aggregate, board-style report), Customer Health Composite (compute_customer_health_composite, top 10 at-risk), Account Manager Contract Performance Report (reusable template, owner-ABAC scope, monthly cadence, City of Cape Town as example only — not pre-pinned).

What V1 does NOT include (deferred to V1.1+): per-agent pause/enable UI, multi-BU scheduling, in-app notifications when a new report lands, email delivery, role-shared visibility, on-event triggers, raw-prompt admin edit mode, more deterministic tools (seasonality, anomaly detection), background-job retry queue. The plan is to ship V1 first, learn from real use, then prioritise V1.1 from feedback.

Act 10: Assessment Against Modern AI Patterns

If I step back and look at this against where the agentic-AI field is in 2026, I think we landed in a reasonable spot. Not state of the art — we don't have a multi-agent reasoning swarm with arbitrary task decomposition, and we don't want that — but defensible and well-grounded.

1. Tool-augmented LLM with strict schemas (instead of free agentic reasoning)

The dominant safe-pattern for enterprise BI agents in 2026 is the same one we adopted: the LLM narrates, deterministic tools compute. Numbers come from tools. Citations come from tools. Charts come from tools. The LLM writes prose around them. The schema rejects any run that doesn't fit.

This is the opposite direction from "let the LLM reason its way to an answer". For BI it's the right direction. Finance teams cannot afford hallucinated numbers. The schema-first design means the platform produces something predictable — if a tool is broken, the run fails loudly, doesn't quietly invent.

2. MCP as the trust boundary

MCP is the right abstraction for tool exposure. It separates "what the tool does" (the handler, in our code) from "how the model invokes the tool" (MCP-qualified names, JSON-schema'd input). Hooks intercept on the SDK side; ABAC + scope enforcement live in the handler. That two-layer defence is the modern standard.

The catch we hit (tool_response being a JSON string, alwaysLoad needing to be true, console.log from hooks not reaching stdout) are SDK-specific quirks of @anthropic-ai/claude-agent-sdk@0.3.143. If we'd built our own loop we'd have hit different quirks. None of these were design errors — they were implementation details we discovered by running the code in production.

3. Composer-as-agent (instead of UI form)

Letting users describe an agent in natural language and having an LLM compile it to a typed draft is the right pattern for V1. A form-based "build your own report" UI sounds simpler but it's actually constraining — users have to learn the form, and the form has to anticipate every combination of cadence x scope x tool. Natural language lets users say "weekly debtors report for Customer X with monthly trend chart" and the Composer figures out the slug, schedule, allowedTools, taskSpec.

The hardening is the typed schema underneath. The user's natural language goes through the LLM, comes out as a typed draft, passes through the linter, passes through dry-run validation, and only then can be published. The free-form NL never reaches the runtime — only the typed draft does. That's the right separation.

4. Agents as documents, not as code

An AgentDefinitionV1 is a Cosmos document. It's clonable, remixable, versioned (revision bump on every save), audited (createdBy / lastEditedBy / lastEditedAt). The Gallery is just a list of those documents filtered to the BU. The Composer is just an editor for those documents. The runtime is just an interpreter for those documents.

This is the right abstraction for a BU lead who wants to say "clone the AM contract template, set my customers, change the cadence to monthly". They're editing a document, not deploying a service. That's the leverage of treating agents as first-class data, not as code.

5. Where we are NOT state of the art

We don't have:

Sub-agent decomposition — an agent can't spawn a child agent to handle a sub-task. We deliberately chose not to use the SDK's sub-agent feature for V1. Sub-agents are non-reproducible, expensive, and audit-hostile. Custom tools only.
Memory across runs — each run is stateless. No "remember what you said last week". The Movement Pack diff engine does that work deterministically (week-over-week comparison from snapshot history), not through LLM memory.
Multi-turn agent sessions — the Composer is single-shot. You give it an intent, it gives you a draft. We don't model "refine the draft via three turns of dialogue". The user can re-compose with a tweaked intent, but there's no conversation state.
Auto-discovery of new tools — the Capability Registry is hand-curated. New tools are added by engineers, not by the agent itself.

I think most of these absences are correct for V1. The ones I'd revisit first are multi-turn refinement in the Composer (lets the user say "good, but make the cadence monthly instead of weekly" without re-typing the whole intent) and on-event triggers (agent fires when a new ETL snapshot lands, not on a wall-clock cadence). Both are V1.1 candidates.

6. The chat engine vs the agent framework, side by side

Now that both exist, the right mental model is:

	AI Chat Engine	Agent Framework
Trigger	User opens panel and types	Schedule, or user clicks Run
Output	Streamed chat reply	Filed report (multi-section, versioned)
Lifetime	Session-bound	Immutable, stored, ABAC-scoped, exportable
Audience	The asker	Anyone in the BU (org) or just the owner (private)
Best for	"What's our orderbook looking like?"	"Every Monday at 07:00, brief me on the top five movers."
Lives at	Side panel	Filed Reports surface

They share the same tools, the same data, the same auth. They diverge on temporality. Chat is reactive. Agents are proactive. Together they cover the whole "AI as a BI partner" surface.

The Takeaway

I started this project thinking "we'll wrap the chat engine in a scheduler and call it agentic." We ended up with a typed, versioned, schema-validated agent platform with a Composer, a Linter, a multi-section report renderer, a chart binder, a citation chip, an admin kill switch, an Insights panel, four seed templates, and a real scheduled run that fired itself this morning.

The hardest part wasn't the LLM. It was the trust boundary. Getting the schema right. Refusing to let the LLM author its own citations. Refusing to let the LLM author its own system prompt. Refusing to let the LLM author its own owner id. Making the schema the boundary, and making everything below the schema deterministic.

The second-hardest part was the SDK quirks — the JSON-string tool_response, the alwaysLoad gotcha, the console.log-doesn't-reach-stdout thing, the c.enabled-predicate-that-no-one-set that prevented every scheduled run from ever firing. Each one took an hour to diagnose. None of them would have shown up without running real workloads against real data in real production.

The third-hardest part — and this is the one most underrated — was the two-AI pattern. Claude as orchestrator, Codex as reviewer, both adversarial, both checking each other, with a human steering. Codex caught things Claude seriously missed. Claude caught things Codex missed. I used human judgement by steering decisions. The codebase is better for it. The pattern works.

Use Claude Code's frontend designer to build the mock UX end-to-end. As part of the requirements and design phase, using Claude's frontend designer plugin saved a lot of time. After 4 iterations we landed on a design UX for the new agentic feature, conforming to the existing UX design and stlye guide. With the sample html file as the key requirements spec for frontend, Claude was able to build a solid usable feature from the start.

There's still much to do (per-agent pause, in-app notifications, more deterministic primitives, on-event triggers, multi-BU scheduling, account manager-friendly clone workflows). But this morning the platform produced a real briefing without anyone clicking anything. That's the V1 milestone. That's the moment "agentic AI" stopped being a slide deck and started being a Monday-morning artefact.

Onwards to V1.1.

Stack: TypeScript-flavoured JavaScript end-to-end. Fastify on a Linux App Service for the AI service. Express + React on a separate App Service for the web app. Cosmos DB as the only data store. @anthropic-ai/claude-agent-sdk + @anthropic-ai/sdk for the agentic loop and the chat engine respectively. node-schedule for the cron tick. Recharts for chart rendering. react-markdown + remark-gfm for the markdown renderer. Zod for the frozen contracts (AgentDefinitionV1, AgentRunV1, MovementPackV1, CitationV1, ToolCapabilityV1, BuAiDataPolicyV1). No background-job platform — the scheduler runs in-process. No vector store — semantic retrieval lives in the Capability Registry, hand-curated. No multi-agent orchestration framework — one agent at a time, custom tools only.

Total elapsed: ~3.5 weeks from "let's wrap the chat engine in a scheduler" to "the scheduled briefing fired at 07:00 SAST this morning". Total production downtime during the build: zero — the agent framework runs on its own App Service, and the build never touched the chat engine or the BI dashboards. Cups of coffee: lost count, again.

Pages

Wednesday, 27 May 2026

The Scoreboard I Cannot Query: How Anthropic Locks Team-Plan Admins Out of Their Own Claude Code Data

Act 1: Working backwards from what I actually wanted

Act 2: The build was the easy part (it always is now)

Act 3: Dive deep — the smoke test that told the truth

Act 4: Two organisations, one name, no link

Act 5: The wall — and a documented door that opens onto a wall

Act 6: Why this is the wrong line to draw

What "good" looks like

The Takeaway

Friday, 22 May 2026

The Invisible Compiler: How One Human and Two AI Pair-Programmers Turned a Five-Step Stepper Into a Chat in just 4 days

Act 1: Why the V1 Composer Wasn't Good Enough

Act 2: The Iteration Loop That Reshaped the Plan

Act 3: The Architecture We Landed On

Act 4: PR 1 — The Foundational Refactor (Zero Regression)

Act 5: PR 2 — The Composer V2 Specialisation

Act 6: The User Experience (Currently Lightweight, Evolving)

Act 7: Six Fix-Cycles, Four QA Rounds

Act 8: Closing the QA Loop with Chrome Automation

Self-Assessment: Benchmarking Against the Market

Market penetration

Tier matrix

Three architectural benchmarks where V2 lands well

Modern Software Development with Two AI Pair-Programmers

Metrics

Lessons & Advice

1. Build the platform abstraction before the specialisation

2. The do-no-harm gate is non-negotiable for foundational refactors

3. Specialise via profile, not via fork

4. The LLM is authoritative for intent only

5. Build the smart matching upstream, not downstream

6. Hand the AI the browser

7. Stop QA when quality matches the comparison baseline

Best Practices for AI Developers

The Takeaway

Monday, 18 May 2026

How I built and agentic engine (powered by Anthropic's Agentic runtime) to augment my already AI-powered BI app, with Claude & Codex, over one weekend

Act 1: The Existing AI Chat Engine (Why This Wasn't Enough)

Act 2: Origins of the Agentic Idea

Act 3: The Pivot — Stop Building the Loop, Use Claude Agent SDK

Act 4: From Idea to Spec — The Review Loop That Reshaped Every Pass

Step 1: Capturing the business requirements (not the technical wish-list)

Step 2: Claude's first architecture draft (and why most of it was wrong)

Step 3: The contract freeze (Codex's most consequential push)

Step 4: Codex's during-implementation review passes

Step 5: The iteration cadence

Act 5: The V1 Architecture

Act 6: Self-QA via the Chrome Browser (The Surprise Power Move)

Act 7: What We Learned About the Claude Agent SDK

Act 8: A Developer's Guide — Making an Existing AI App Agentic

Step 1: Reframe agents as typed documents, not streaming sessions

Step 2: Wrap your existing tools as MCP tools

Step 3: Build the MCP server (and read the alwaysLoad warning)

Step 4: The runtime adapter (one file imports the SDK, full stop)

Step 5: Three hooks — ABAC, extraction, cost

Step 6: The system prompt must INSTRUCT the model to call tools

Step 7: Materialise server-owned fields after the Composer LLM returns

Step 8: Wire a scheduler (the predicate that almost killed us)

Step 9: The render side — do the boring work

Step 10: Common pitfalls (a survival checklist)

Act 9: The V1 Success Criteria

Act 10: Assessment Against Modern AI Patterns

1. Tool-augmented LLM with strict schemas (instead of free agentic reasoning)

2. MCP as the trust boundary

3. Composer-as-agent (instead of UI form)

4. Agents as documents, not as code

5. Where we are NOT state of the art

6. The chat engine vs the agent framework, side by side

The Takeaway