This morning, at exactly 07:00 SAST, the platform produced a financial briefing without anybody pressing a button. It pulled six SAP connectors, picked the top five movers across debtors / orderbook / stock / delivery, narrated them with real customer names ("CITY OF CAPE TOWN orderbook changed by XX, from RXX to RYY, the single largest book inflow this week"), bound a Recharts bar chart to the underlying dataset, and stamped every numeric claim with a citation chip showing the calendar date range "Debtors · 15 May to 21 May 2026". Validation passed. Cost: a few cents. Latency: 91 seconds end-to-end.
The report is publishable as-is. Not the LLM "making something up that looks right" — every number traces back to an ETL run id, every chart traces back to a dataset, every claim traces back to a CitationV1.
That report is the V1 of an "agentic" layer we've been building on top of an existing BI platform. It took three and a half weeks of research, ideation, detailed implementation planning of V0, then an architecture pivot when I changed my thinking to stop reinventing the wheel and jump on frontier model's agentic SDK, then a bite-the-bullet weekend sprint of building, that included an autonomous overnight QA cycle using browser connectors, an awkward midnight debugging session over a single JSON field, and the discovery that one missing column predicate had silently prevented every scheduled run from ever firing....
I want to write this down before it ages out of memory.
Act 1: The Existing AI Chat Engine (Why This Wasn't Enough)
The platform already had an AI chat engine before any of this started. It's the kind of thing most BI tools ship in 2026: open a side panel, type "what's our overdue debtors looking like", an LLM reads the page context, calls a few internal tools to fetch SAP snapshot summaries, and answers with citations. It works. People use it. It's saved hours of "let me Excel-pivot this for you" emails.
The App was fine in its initial release form. Like any app launch, once users start playing with it, and seeing the powerful AI chat feature, almost immediately the feedback is about requesting automated reports, automated insights, signals of weekly trends, etc. An oversight on my part was the lack of trend data, weekly snapshots were nonexistent. For me to support advanced automated reporting, I would have to close the gap on weekly snapshots. This was considered foundational before I started with any agentic framework. It took me about a week to close this gap, following my usual architecture, spec-driven planning approach with Claude and Codex.
The architecture was honest about what it was. A single Fastify service stands up a chat route that streams from the Anthropic API. A small registry of typed tools sits behind it — build_movement_pack, get_insight_definition, query_snapshot_week, compute_period_aggregate, etc. Every tool reads from the same Cosmos containers the dashboards read from, scoped by BU, filtered by an ABAC layer that keeps customer-level data behind role assignments. The model never sees raw Cosmos — it sees tool returns that have already been authz-checked, citation-stamped, and shape-normalised.
For ad-hoc questions ("show me the top 5 debtors this week", "summarise expense trends since January") this is great. It's chat. The user types, the model answers, the conversation ends.
The chat engine is solid. But a true enterprise BI insights platform needs agentic capabilities too — agents that run on a schedule, pull what's changed, narrate it, file a report, and surface it to the right people without anyone typing a prompt.
That's the gap Agentic V1 is closing.
Act 2: Origins of the Agentic Idea
The initial sketch was simple. "Take the existing AI chat tools. Add a scheduler. Run them weekly. Save the output as a report."
That sketch survived about two weeks before it started to fall apart.
The first problem was structural: a "report" isn't a chat reply. A chat reply is a stream of tokens that disappears when you close the panel. A report is a document — structured, multi-section, with charts and tables, that needs to be filed, indexed, ABAC-scoped per viewer, exportable to PDF, shareable, auditable. The chat engine's plumbing doesn't model any of that. It just streams text.
The second problem was about who authors agents. The first instinct was "we'll seed a few hand-written ones — weekly briefing, monthly performance, customer health — and that's V1." But the moment the seeded agents existed, the next question was obvious: "Can a sales manager build their own for their portfolio? Can an account manager clone the seed for Customer X and remix it?" That implies a Composer. A Composer is itself an agent. So now we're not just building "three scheduled briefings" — we're building a platform that lets users author agents in natural language, lint them against a typed contract, dry-run them, and publish.
The third problem was the trust boundary. The chat engine's tools are read-only against scoped Cosmos. That's fine for ad-hoc questions where a human reads the answer. For an agent that fires on a schedule and surfaces a report into an inbox-like list, the trust model has to be stricter. Every numeric claim must cite the underlying snapshot. Every chart must reference a dataset id the runtime materialised, not inline data the LLM invented. The schema has to enforce this so a malformed run can't masquerade as a real one.
By the end of the third week we knew the shape: typed, versioned contracts (AgentDefinitionV1, AgentRunV1, MovementPackV1, CitationV1); a Composer that writes drafts; a linter that rejects drafts that don't fit; a runtime that runs them with deterministic tools and structured output; a scheduler that fires them; a Filed Reports surface that renders them.
Act 3: The Pivot — Stop Building the Loop, Use Claude Agent SDK
The original V0 plan had us writing the agent loop ourselves. Read a definition, build a system prompt, send it to the Anthropic API, parse the response, dispatch any tool calls, loop. The chat engine already did 60% of that — we'd just wrap it in a scheduler.
But Anthropic shipped the @anthropic-ai/claude-agent-sdk. It's a meaningful step up from a raw chat client: it has first-class concepts for tools (define name + description + Zod input schema + handler, register them with createSdkMcpServer, the model gets MCP-qualified tool names like mcp__syntell__build_movement_pack), for hooks (PreToolUse, PostToolUse, Stop), for permission modes, for system prompts, for the whole agentic loop — without us writing any of it.
More importantly, MCP gave us a clean trust boundary. The same tools the chat engine exposes become MCP tools the runtime exposes. The model is told "you have these tools, here are their schemas, call them". Hooks let us intercept every tool call before it runs (PolicyGuard checks BU scope, ABAC, connector allow-lists) and every tool return after it runs (extract citations from structuredContent.citations, materialise datasets onto the run doc, write audit log entries).
We pivoted. Not on day one — on week three, after we'd already completed much of the architecture, design and implementation plan. We threw away the custom loop and adopted Claude's Agentic SDK. The pivot cost about two evenings of rework. It paid back the rest of the project.
runtime.js) and is the only place in the codebase that imports the SDK. Everything else — stores, hooks, tools, schemas — stays SDK-agnostic so a future swap costs days, not months.
Act 4: From Idea to Spec — The Review Loop That Reshaped Every Pass
The pivot to the SDK happened on a Friday evening. But the SDK didn't arrive in our codebase first — the plan did. And the plan only became a good plan after several review passes with Claude and Codex, that materially reshaped what Claude originally proposed, since I had decided to not reinvent the wheel.
This is the part of the project I want to credit Codex for explicitly. Without those reviews the architecture would have shipped narrower, more fragile, and harder to extend.
Step 1: Capturing the business requirements (not the technical wish-list)
The first version of the requirements doc crafted jointly with Claude was a mistake. It read like a technical wish-list — "use Cosmos, use Fastify, support cron, support webhooks". I later pushed back: "start from what an account manager actually wants, not what the platform should do."
So we rewrote the requirements in terms of user behaviour:
- "Every Monday morning I want a one-page briefing of what changed in my portfolio last week, without typing a prompt."
- "I want to clone a teammate's report template, point it at my customers, and publish it — in five minutes, not a sprint."
- "I want every number in the report to be traceable to its SAP source so when finance pushes back I can show them where it came from."
- "If I'm an admin and something goes wrong, I want one switch to stop everything."
- "If I share a report with my team, the people without access to a customer must NOT see that customer's numbers."
- "Built-in seeds for the most common cases: Finance Weekly Briefing, Monthly Performance Report, Customer Health, Account Manager Contract Performance."
That reframing rippled into every later decision. The "five minutes not a sprint" requirement became the Composer-as-agent design (NL intent in, typed draft out). The "every number traceable" requirement became the CitationV1 contract and the rule that the LLM never authors a citation. The "people without access must not see" requirement became the ABAC layer that filters tool returns before the LLM ever sees them. The "one switch to stop everything" became the runtime-state kill switch checked at the top of every scheduler tick.
Step 2: Claude's first architecture draft (and why most of it was wrong)
Claude's initial design proposal was about 800 lines of markdown - which started with the assumption of leveraging much of the existing AI-chat engine infrastructure first, then extend the platform to build an agentic engine. It had a custom agent loop (parse the model's response, dispatch tool calls, iterate), a per-agent SQL-like query language for "what data does this agent need", a graph database of capabilities, a vector store for "discovering which tool fits the intent", and three separate microservices for Composer, Runtime, and Scheduler.
When I changed direction on leveraging existing agentic frameworks, Claude reviewed the state-of-the-art and recommended Anthropic's Agent SDK - surfacing five substantive critiques. which I then passed to Codex to review. I'll paraphrase but the gist of each is verbatim from the review:
- "Don't roll your own agent loop.
@anthropic-ai/claude-agent-sdkshipped recently and gives you tool dispatch, hooks (Pre/Post/Stop), permission modes, MCP-qualified tool names, and an injectable query function for free. Adopt it as the runtime. Keep it confined to one adapter file so a future swap costs days not months." - "No custom query language. The data model is Cosmos. The tools you'd want are already the same shape as your existing AI chat tools. Don't reinvent. Reuse the chat tools as MCP tools, add hooks for ABAC and citation extraction."
- "No vector store. The Capability Registry is small (~6 tools in V1) and hand-curated. A vector store is over-engineered for this scale and adds a dependency. Hand-curate the registry, version it, ship it."
- "No three microservices. Composer, Runtime, and Scheduler all share the same Cosmos containers, the same auth identity, the same lifetime. Keep them in one Fastify process. Operationally simpler. Two App Services maximum (web + AI), not five."
- "Freeze the schemas BEFORE writing code. AgentDefinitionV1, AgentRunV1, MovementPackV1, CitationV1, ToolCapabilityV1, BuAiDataPolicyV1. Zod. Strict. Tag the freeze in git. Every layer below the schemas is allowed to evolve; the schemas themselves only move on a V2 bump. The schema is the trust boundary."
All five landed. The design rewrite cut the doc from 800 lines to about 350. The implementation plan compressed from "eight tracks across six weeks" to "five tracks across two and a half weeks" because most of what Claude planned to build was now being provided by the SDK or by reusing existing code. We reworked the plan. We executed in one weekend.
Step 3: The contract freeze (Codex's most consequential push)
The contract-freeze idea (#5 above) deserves its own paragraph because it changed how the whole project was sequenced.
The original plan had every track writing its own data shapes as it went. Track A would define AgentDefinitionV1 by writing the Composer first. Track C would refine it by writing the runtime. Track G would tweak it again when writing the report renderer. Schemas would converge through iteration.
Codex flagged the obvious problem: if the schema converges through iteration, every track's tests are coupled to whichever revision they happened to be written against, and integration becomes a coordination nightmare. The fix: write the schemas first, freeze them, tag the freeze, then let every track build against a stable contract. Tests against a frozen schema are forwards-compatible. Code that fits a frozen schema can be developed in parallel without integration grief.
So we did exactly that. Track 0 (Phase 0) was a one-day pass to write all six Zod contracts, run them through fixture-based tests, commit, and tag contracts-v1-frozen on origin. Every other track started after that tag landed. The integration phase at the end of the build was nearly painless. That single insight saved at least a week of merge-hell.
Codex was excellent in reviewing Claude's multi-agent parallel execution plan and called out the gaps in sequencing, risks of agent handover corruption of shared status.md file updates. I had Claude leading the multi-agent coding sprint as lead orchestrator, and setup Codex to snoop and review, periodically from 20 minute intervals to 7 minute intervals during Claude's build process. Codex was my senior engineer reviewer. I set both agents off, and went about my weekend. Claude and Codex worked all through the weekend almost autonomously.
Step 4: Codex's during-implementation review passes
Once code started landing, Codex's reviews moved from architectural critique to substantive defect-finding. Below are the specific Codex findings that materially changed shipped code. I'm listing them because they're the kind of thing that doesn't appear in a "two AIs worked together" abstraction — they're the actual leverage of having a second reviewer.
| Codex finding | Severity | What it became |
|---|---|---|
| "Composer meta-agent looks structurally incomplete — the seed lists list_capabilities / propose_agent_definition / validate_agent_definition in allowedTools, but those handler factories are NOT in TOOL_FACTORIES. The SDK exposes the business tools to the Composer agent but not its own meta-tools. The LLM has no callable tools and emits tool-call JSON as raw text in its assistant message." | P0 | Built the three meta-tools in ai-service/src/agents/tools/composerMeta.js delegating to the existing listCapabilities / proposeAgentDefinition / lintAgentDefinition helpers in ai-core. Registered them in TOOL_FACTORIES. Added 6 wiring tests. Composer dry-runs went from "always produces text-JSON" to "always produces typed drafts via proper tool calls". |
"Customer Health is real but not truly per-customer — buildCustomerHealth reads only summaryDoc.topOutstandingCustomers etc, so customers outside the top-N are invisible to the at-risk composite. That's a quiet correctness gap." | P1 | Rewrote buildCustomerHealth to load runId-scoped detail snapshots per connector and aggregate every customer by customerId. Summary-top-N falls back only when detail rows aren't available. The at-risk ranking is now genuinely complete. |
"Filed Reports NEW badge can be wrong for returning users — AgentReportsView freezes renderedReadCursor to '' on initial render BEFORE the read-state hook has loaded. Every existing report renders as NEW for returning users on every visit." | P1 | Added a ready: boolean flag to the useAgentUnreadCount hook. Cursor only freezes after ready === true. NEW badge now correctly reflects "since you last visited" semantics. |
"The read-state monotonic guarantee is not concurrency-safe — setUserReportSeen is read-then-upsert with no precondition. Multi-tab races can overwrite a newer cursor with an older one." | P2 | Cosmos ETag CAS: IfMatch: <etag> for replace, IfNoneMatch: '*' for first-create, 3-attempt retry loop. Monotonic guarantee restored. |
"chooseAllowedTools still defaults to Movement-Pack-only for non-plan intents — users asking for monthly aggregates won't get compute_period_aggregate in their drafts. Same for customer-health intents missing compute_customer_health_composite." | Follow-up | Pattern-matched intent signals against BOUNDARY_PATTERNS, HEALTH_PATTERNS, PERIOD_AGGREGATE_PATTERNS. Auto-pick the right deterministic primitives. applySignalsOverride now also applies allowedTools overrides + auto-acks the simulation/SAP boundary if compute_plan_vs_actual is present. |
"@anthropic-ai/claude-agent-sdk@0.3.143 declares a peer dependency on @anthropic-ai/sdk >=0.93.0. You're pinned at 0.80 and running --legacy-peer-deps to suppress the warning. That's a smell. Upgrade the peer." | Follow-up | Upgraded @anthropic-ai/sdk from 0.80.0 to ^0.93.0. Dropped --legacy-peer-deps from the deploy workflow. Verified both call sites (standaloneAiChat + modelCatalogManager) use stable API surfaces that survived the bump. |
| "The AM Contract Performance seed should NOT be a hardcoded Customer X agent. Customer X should appear as the EXAMPLE in the natural-language intent / help text. The seed itself stays reusable for every account manager." | Design feedback | The seed ships with customerScope: 'owner-abac' (AM sees their assigned portfolio on first run), specificCustomerIds intentionally undefined, Customer X mentioned only as the example clone target in the natural-language intent. AM clones to customerScope: 'specific' for their actual contract. The seed itself is template-shaped. |
| "Are there still explicit 'later/deferred' items in the shipped story? V1 should not include language that says 'V1.1 will fix this'. Either ship it or remove the mention — don't let deferral noise leak into the seeded prompts." | P2 | Stripped all V1.1 deferral language from the seed system prompts. Added test guards in seedGalleryTemplates.test.js that fail if any seed's description / systemPrompt / naturalLanguageIntent contains "V1.1" or "planned for". |
Eight findings. Six of them were actual bugs (P0/P1/P2 + the SDK peer-dep smell), one was a Composer-design improvement (the chooseAllowedTools follow-up), one was a product-design nudge (the AM seed template-shape feedback). All eight got fixed. None of them would have been caught without the second reviewer. I'd been staring at the code for hours and missed them.
Step 5: The iteration cadence
The way the loop ended up working in practice:
- Claude writes the implementation, commit, push, run touched tests, drive the CI.
- Claude summarise what landed in the PR description (or in the chat, since this was a long-running session).
- Codex session runs continuously in background, wakes up, inspects master, feeds back findings to me. I decide when to interrupt and steer Claude along. Sometimes I let Claude just run, with Codex keeping a mental registry of things to cleanup later. Claude was coordinating at least 8 parallel workstreams, managing integration - I didn't want to interrupt Claude unless Codex picked up something critical.
- Codex clean-room reviews against master, posts numbered findings (P0/P1/P2 with rationale + suggested fix).
- I relay Codex's findings to Claude.
- Claude triages. Claude usually agrees and thanks Codex for superb findings. Occasionally Claude pushed back with a rationale — once or twice Codex was working from an outdated mental model of the codebase, but more often than not its findings are real and worth fixing.
- Claude fixes, commit, pushes autonomously.
- Loop.
The cadence wasn't fixed. Sometimes Codex would review every two or three commits. Sometimes I would say "you've been heads-down for a few hours, let me get Codex to do a sweep". The asymmetry of Claude (long context, slow review) vs Codex (fresh context, fast review) is actually a useful structural feature — the two AIs do not see the same thing, and that's where the leverage comes from.
Act 5: The V1 Architecture
What we ended up with:
+-----------------------------------------------------------------+
| Web App (browser SPA) |
| Gallery - Compose - Filed Reports - Insights - Admin |
+---------------------------+-------------------------------------+
|
| HTTPS + auth proxy
v
+-----------------------+ /api/agents/* +-------------------------+
| Web App (Express) | ----------------->| AI service (Fastify) |
| React shell + static | | |
| files. Proxies AI | /api/ai-chat/* | Existing AI chat |
| routes to the AI | ----------------->| engine (kept, reused) |
| service. | | |
+-----------------------+ | Agent Framework V1 |
| |
| +-------------------+ |
| | Routes: | |
| | /compose | |
| | /dry-run | |
| | /agents/:id | |
| | /runs | |
| | /runs/:id | |
| | /lint | |
| | /policy | |
| | /runtime-state | |
| | /insights/... | |
| +---------+---------+ |
| | |
| v |
| +-------------------+ |
| | Runtime adapter | |
| | (claude-agent-sdk)| |
| | | |
| |runAgent({def,ctx})| |
| +---------+---------+ |
| | |
| Hooks: | |
| PreToolUse ---+ |
| PostToolUse -+ | |
| Stop -----+ | | |
| | | | |
| +----------v--v-v---+ |
| | In-process MCP | |
| | server (Company) | |
| | | |
| | build_movement | |
| | compute_period | |
| | compute_plan_vs | |
| | customer_health | |
| | list_capabilities| |
| | propose_def | |
| | validate_def | |
| | get_insight_def | |
| | search_insights | |
| +---------+---------+ |
| | |
| v |
| +-------------------+ |
| | Stores | |
| | definitionStore | |
| | runStore | |
| | policyStore | |
| | leaseStore | |
| | heartbeatStore | |
| | runtimeState | |
| | costCounter | |
| +---------+---------+ |
| | |
| +-------------------+ |
| | Scheduler | |
| | (node-schedule | |
| | cron every min) | |
| |findDue -> claim | |
| |lease -> runAgent | |
| +-------------------+ |
+-----------+-------------+
|
v
+-----------------------------------------------------------------+
| Cosmos DB (single account) |
| |
| definitions agent_runs |
| +---------------------+ +-----------------------+ |
| | agent-definition | | agent-run | |
| | sap-*-insights | | (immutable snapshot, | |
| | sap-*-customer-... | | citations[], | |
| | sap-*-weekly- | | datasets{}, | |
| | snapshot | | sections[] | |
| | bu-ai-data-policy | | with chartSpec + | |
| | agent-lease | | tableSpec) | |
| | runtime-state | +-----------------------+ |
| | scheduler-heartbeat | |
| +---------------------+ |
+-----------------------------------------------------------------+
^
|
+-----------+-----------+
| ETL pipeline |
| (weekly SAP exports |
| -> weekNum-stamped |
| detail + summary |
| docs) |
+-----------------------+
A few things worth pointing out:
One database, two consumers. The chat engine and the agent framework read the same Cosmos containers, the same snapshot history, the same policy docs. We didn't build a parallel data plane — we layered a new control plane (Composer + scheduler + runtime + report store) on top of the existing data plane.
The runtime adapter is one file. Per agent-framework-v1.md's adapter pattern, exactly one file imports @anthropic-ai/claude-agent-sdk. Stores, hooks, tools, schemas, routes — everything else is SDK-agnostic. The swap cost for the next runtime (whatever that ends up being) is bounded.
Citations and datasets are produced by hooks, not by the LLM. The model is never trusted to write a citation. Every tool handler attaches structuredContent: { citations, dataset } to its return. The PostToolUse citation hook walks the tool return and accumulates citations onto the run doc. The dataset hook does the same for chart-bound data. The LLM writes prose about these structures — never the structures themselves.
The schema is the boundary. AgentDefinitionV1 says exactly what fields exist, what types they have, what enums are valid. The linter runs the schema first and blocks publish on any failure. Dry-run validates the resulting AgentRunV1 against its own frozen schema. If the run doesn't fit the contract, it's flagged needs-review and excluded from the Publish gate.
Act 6: Self-QA via the Chrome Browser (The Surprise Power Move)
The thing that didn't appear on any plan was the Chrome integration. Claude has access to a browser tool that lets it drive an actual Chrome instance on my machine. Navigate to a URL, click an element by accessibility-tree reference, type into an input, take a screenshot, run JS in the page context, read network requests, read console messages.
Once you have that, the whole feedback loop changes.
The old pattern was:
- Claude pushes a commit.
- The user opens Chrome.
- The user clicks around.
- The user takes a screenshot.
- The user describes what's broken.
- Claude infers what's wrong from the description.
- Claude pushes another commit.
- Repeat.
With Chrome access:
- Claude pushes a commit.
- Claude waits for CI to finish.
- Claude navigates Chrome to the production URL myself.
- Claude clicks the same buttons a user would click.
- Claude fetches the API responses with
fetch()in the page context to see the actual run docs. - Claude screenshots the rendered report.
- Claude sees the bug.
- Claude diagnoses the bug.
- Claude pushes the fix.
- Repeat — without bothering the human. I gave Codex the same task, as independent QA tester- I found Codex much more skilled at operating the browser than Claude was.
I went to bed at one point and gave a single instruction: "AFK — continue to recursively test using Chrome browser until you find and fix every issue. You are not done until then. Repeat QA/debug/bug-fix cycles until no issues remain." Over the next ~5 hours that's exactly what happened. The browser tool drove the Composer wizard, triggered dry-runs, fetched run documents, read operational logs, found that citations were synthetic, traced the cause to a hook field-name mismatch, pushed a fix, deployed it, restarted the App Service, re-ran the wizard, watched validation still fail, dug deeper, found another field name issue, fixed it, redeployed.
The final blocker turned out to be that the SDK passes tool_response to PostToolUse hooks as a JSON-encoded string, not as an object. Every probe path Claude had written assumed object shape. To find this, Claude had the hooks emit a one-shot diagnostic entry to the operational log on first invocation (because console.log inside SDK hooks doesn't reach the App Service container stdout). The diagnostic entry came back with trType: 'string' and a sample snippet that started with an escaped {"\\"pack\\":...". Five lines of code later (if (typeof tr === 'string') { try { tr = JSON.parse(tr); } catch {} }), the next dry-run produced 5 real citations across 5 sections, validation passed, Publish button went green.
That cycle took ~9 hours of autonomous work. I woke up to "all features verified green in production" and a one-page summary of every commit. The bug was real, the fix was real, the verification was real, and no human had to drive the Chrome session.
Act 7: What We Learned About the Claude Agent SDK
Things that surprised Claude, in no particular order, written by Claude's own words here:
1. tools: [] means "no tools", not "default tools". Early on I'd written the SDK options as { ..., tools: [] } assuming an empty array meant "use the MCP tools I've registered via mcpServers". It actually means "the model has zero tools available". The MCP tools never reached the model. Symptom: the LLM emitted tool-call JSON as plain text in its assistant messages instead of as tool_use blocks. Fix: omit the tools key entirely.
2. MCP tools must be alwaysLoad: true or they go behind tool search. By default the SDK defers MCP tool schemas behind a built-in ToolSearch facility. If you've also locked down the built-ins via disallowedTools (which any sensible production agent does), the model has no way to discover the MCP tools at all. alwaysLoad: true pre-loads every registered MCP tool's schema directly into the model's tool list. Without this, you get the same "tool-call-as-text" symptom as above.
3. PostToolUse.tool_response is a JSON string, not an object. Documented as unknown. In practice it's a JSON-encoded string of the tool handler's structuredContent. Always JSON.parse when it's a string.
4. console.log from inside SDK hooks does not reach App Service container stdout. I do not know why. I do know that emitting structured entries to an operational-log store works fine. We added a one-shot diagnostic emission on first hook invocation, and that's how we discovered #3 above. Worth instrumenting hooks with an op-log fallback from day one.
5. The model will not call tools just because they're listed. If the system prompt says "Available tools: build_movement_pack, get_insight_definition, search_insights" without instructing the model to use them, the model will think out loud, write a narrative, and never call a thing. The system prompt must say "CALL TOOLS FIRST. Before writing any narrative, invoke each allowed tool." Pretend you're talking to a thoughtful but lazy junior analyst.
6. The Composer LLM is authoritative for intent-shaped fields only. Server-owned fields (id, pk, defType, ownerUserId, audit, composerVersion, the system prompt itself) must be materialised server-side after extraction. Trusting the LLM to author its own ownerUserId is an impersonation hole. Trusting it to author the system prompt skips the runtime tool-first directive. Materialise these post-extraction, every time.
7. The agentic SDK is a strong abstraction up to a point. For an in-process MCP server with deterministic tools and structured returns, the SDK is fantastic — permission modes, hook plumbing, tool dispatch, streaming all just work. Beyond that point (sub-agents, sessions for multi-turn, prompt caching controls), the API surface is less mature and the docs lag the code. We're not using sub-agents in V1. Custom tools only.
Act 8: A Developer's Guide — Making an Existing AI App Agentic
I want to spend an act on the concrete steps. If you have an existing AI chat app and you're trying to figure out how to add a scheduled / agentic layer on top, the path is more mechanical than you'd think. Here's the order I'd follow if I were starting fresh on a similar codebase.
Step 1: Reframe agents as typed documents, not streaming sessions
The biggest mental shift. A chat session is ephemeral — tokens stream, you read them, the session ends. An agent is a Cosmos / Postgres / S3 document. It has an id. It has a revision counter. It has an owner. It has an audit trail. The runtime interprets the document each time it fires; the document itself doesn't move.
Practically: pick a schema library (Zod is my pick), define your AgentDefinition shape, freeze it, tag the freeze. Define the AgentRun shape next — that's what every fire produces. Both shapes must be strict (extra fields rejected) so you can evolve them safely.
// packages/your-app-core/src/contracts/v1/AgentDefinitionV1.js
import { z } from 'zod';
export const AgentDefinitionV1 = z.object({
id: z.string().min(1), // deterministic
pk: z.string().min(1), // Cosmos partition key
defType: z.literal('agent-definition'),
buId: z.string().min(1),
slug: z.string().regex(/^[a-z0-9-]+$/),
name: z.string().min(1),
description: z.string().min(1),
template: z.enum(['weekly-briefing', 'monthly-report', 'composed']),
composerVersion: z.string().min(1).nullable(),
revision: z.number().int().positive(),
ownerUserId: z.string().min(1), // server-authored; never LLM
visibility: z.enum(['org', 'private']),
systemPrompt: z.string().min(1), // server-composed; never LLM
allowedTools: z.array(z.string().min(1)).min(1),
taskSpec: z.object({ /* connectors, period, scope, ... */ }).strict(),
schedule: z.object({ /* cadence, nextRunAt, ... */ }).strict(),
quotas: z.object({ /* maxRunsPerMonth, maxSpendZarPerMonth, ... */ }).strict(),
audit: z.object({ // server-stamped
createdAt: z.string().datetime(),
createdBy: z.string().min(1),
lastEditedAt: z.string().datetime(),
lastEditedBy: z.string().min(1)
}).strict()
}).strict();
The .strict() is non-negotiable. It's what lets you evolve forwards without silently accepting drift.
Step 2: Wrap your existing tools as MCP tools
If you have a chat app you already have tool handlers — functions that take typed args, do an authz check, hit your data store, return a result. You almost certainly don't need to rewrite them. Wrap each handler in the SDK's tool() helper, with a Zod input schema and a stable name:
// ai-service/src/agents/tools/movementPack.js
import { z } from 'zod';
import { buildMovementPack } from '../../../api/lib/movementPack/build.js';
export function buildMovementPackTool({ ctx, policy, agentDefinition }) {
return {
name: 'build_movement_pack',
description: [
'Build a typed, policy-filtered, ranked Movement Pack of business',
'events for (buId, connector, fromWeek -> toWeek). The canonical',
'"what changed" tool. Returns ranked events ready for narration.'
].join(' '),
inputSchema: {
buId: z.string(),
connector: z.enum(['debtors', 'orderbook', 'stock', 'delivery', 'expenses', 'sales']),
fromWeek: z.number().int().min(1).max(53),
toWeek: z.number().int().min(1).max(53),
runFy: z.string(),
maxEvents: z.number().int().min(1).max(50).optional()
},
handler: async (args /* extra */) => {
// 1. Re-check BU + ABAC scope against ctx (defence in depth)
if (args.buId !== ctx.buId) {
return { content: [{ type: 'text', text: 'Cross-BU blocked' }], isError: true };
}
// 2. Call your existing builder
const pack = await buildMovementPack({ ...args, ctx, policy });
// 3. Return BOTH a text block (for the LLM to read) AND
// structuredContent (for the runtime hooks to consume).
return {
content: [{ type: 'text', text: JSON.stringify(pack, null, 2) }],
structuredContent: {
pack,
citations: pack.citations || [],
dataset: packToDataset(pack)
}
};
}
};
}
Three rules for tool handlers: (1) re-check authz inside the handler — never trust the model to pass the right buId; (2) return both content (text the LLM reads) AND structuredContent (data the hooks consume); (3) when no data exists, return isError: true with a useful message rather than silently returning an empty pack.
Step 3: Build the MCP server (and read the alwaysLoad warning)
Each agent run gets its own MCP server with only the tools that the agent's allowedTools approves:
// ai-service/src/agents/toolServer.js
export const TOOL_FACTORIES = Object.freeze({
build_movement_pack: buildMovementPackTool,
compute_period_aggregate: buildPeriodAggregateTool,
compute_plan_vs_actual: buildPlanVsActualTool,
compute_customer_health_composite: buildCustomerHealthTool,
get_insight_definition: getInsightDefinitionTool,
search_insights: searchInsightsTool,
// Composer meta-tools (if you have a Composer)
list_capabilities: buildListCapabilitiesTool,
propose_agent_definition: buildProposeAgentDefinitionTool,
validate_agent_definition: buildValidateAgentDefinitionTool
});
export function createToolServer({ sdk, ctx, policy, agentDefinition, allowedLogicalNames }) {
const registered = [];
for (const name of allowedLogicalNames) {
const factory = TOOL_FACTORIES[name];
if (!factory) continue; // linter rejects unknown names at save time
const t = factory({ ctx, policy, agentDefinition });
registered.push(sdk.tool(t.name, t.description, t.inputSchema, t.handler));
}
return sdk.createSdkMcpServer({
name: 'your-app',
version: '0.0.0',
tools: registered,
alwaysLoad: true // CRITICAL: see below
});
}
alwaysLoad: true is the single setting most likely to bite you. Without it, the SDK defers MCP tool schemas behind a built-in ToolSearch facility — meaning the model never sees the tool list directly. If you've also locked down the built-ins (as any sensible production agent does), the model has no callable tools at all, and you'll see the most confusing failure mode in agentic engineering: the LLM writes a tool call as plain text in its assistant message instead of as a structured tool_use block, and the runtime never invokes anything.
Step 4: The runtime adapter (one file imports the SDK, full stop)
Confine the SDK to exactly one adapter file. Stores, hooks, tools, schemas, routes — everything else stays SDK-agnostic. The swap cost for the next runtime is a week of rewriting one file, not months of untangling SDK types from every layer.
// ai-service/src/agents/runtime.js -- the ONLY file that imports the SDK
import {
query as realSdkQuery,
tool as realSdkTool,
createSdkMcpServer as realCreateSdkMcpServer
} from '@anthropic-ai/claude-agent-sdk';
import { createToolServer } from './toolServer.js';
import { createPolicyGuardHook } from './hooks/policyGuard.js';
import { createCitationExtractorHook } from './hooks/citationExtractor.js';
import { createDatasetExtractorHook } from './hooks/datasetExtractor.js';
import { createCostTrackerHook } from './hooks/costTracker.js';
import * as runStore from './stores/runStore.js';
export function createAgentRuntime({
sdk = { query: realSdkQuery, tool: realSdkTool, createSdkMcpServer: realCreateSdkMcpServer }
} = {}) {
return {
async runAgent({ def, ctx, prompt = null, trigger = 'manual', scheduledFor = null }) {
// 1. Pre-flight quota check, lease claim, write the pending run doc.
const runDoc = await runStore.beginRun({ def, trigger, scheduledFor });
// 2. Resolve MCP-qualified names + build per-run tool server.
const allowedMcpNames = def.allowedTools.map(n => `mcp__your-app__${n}`);
const toolServerInstance = createToolServer({
sdk, ctx, policy: await loadPolicy(def.buId),
agentDefinition: def, allowedLogicalNames: def.allowedTools
});
// 3. Compose hooks: PreToolUse (ABAC), PostToolUse (extract), Stop (cost).
const policyGuard = createPolicyGuardHook({ ctx, policy, allowedMcpNames });
const citationHook = createCitationExtractorHook({ runDoc });
const datasetHook = createDatasetExtractorHook({ runDoc });
const costHook = createCostTrackerHook({ ctx, def, runDoc });
// 4. SDK options.
const options = {
systemPrompt: def.systemPrompt,
// NB: do NOT pass `tools: []`. The SDK reads that as "no tools".
// Omit `tools` entirely to keep the MCP tools visible.
disallowedTools: ALL_SDK_BUILTINS, // lock down built-ins
allowedTools: allowedMcpNames, // ONLY these MCP tools
mcpServers: { 'your-app': toolServerInstance },
permissionMode: 'dontAsk',
maxTurns: def.quotas.maxTurns ?? 16,
hooks: {
PreToolUse: [{ matcher: '.*', hooks: [policyGuard] }],
PostToolUse: [{ matcher: '.*', hooks: [citationHook, datasetHook] }],
Stop: [{ matcher: '.*', hooks: [costHook] }]
}
};
// 5. Drive the stream. Collect synthesis text from text-only messages
// (the model's "thinking aloud" mid-tool-call goes in interleavedText
// as a fallback; the final narrative comes from messages that
// have NO tool_use blocks).
const synthesisText = [];
const interleavedText = [];
for await (const message of sdk.query({ prompt: prompt ?? def.naturalLanguageIntent, options })) {
if (message.type !== 'assistant') continue;
const blocks = message.content || message.message?.content || [];
const hasToolUse = blocks.some(b => b?.type === 'tool_use');
for (const b of blocks) {
if (b?.type === 'text' && typeof b.text === 'string') {
interleavedText.push(b.text);
if (!hasToolUse) synthesisText.push(b.text);
}
}
}
const narrative = (synthesisText.length ? synthesisText : interleavedText).join('\n\n').trim();
// 6. Compose the final output, validate against AgentRunV1, persist.
const output = composeOutput({ def, runDoc, narrative });
return runStore.completeRun(runDoc, output);
}
};
}
This is ~50 lines of structure. Almost everything else in the agent framework is in modules that don't know the SDK exists.
Step 5: Three hooks — ABAC, extraction, cost
PreToolUse (PolicyGuard): reject tool calls that smuggle a different BU, that target a connector denied by policy, or that aren't in the agent's allowedTools set.
// ai-service/src/agents/hooks/policyGuard.js
export function createPolicyGuardHook({ ctx, policy, allowedMcpNames }) {
const allowed = new Set(allowedMcpNames);
return async function policyGuardHook(input) {
const toolName = input?.tool_name;
const toolInput = input?.tool_input || {};
if (!toolName?.startsWith('mcp__your-app__')) {
return { decision: 'block', reason: 'Built-in tools are not allowed' };
}
if (allowed.size > 0 && !allowed.has(toolName)) {
return { decision: 'block', reason: `${toolName} not in agent's allowedTools` };
}
if (toolInput.buId && toolInput.buId !== ctx.buId) {
return { decision: 'block', reason: 'Cross-BU tool call blocked' };
}
return { decision: 'approve' };
};
}
PostToolUse (Citation/Dataset extractors): read the tool return, pull citations onto the run doc, materialise the dataset into runDoc.output.datasets[datasetId]. This is where the JSON-string-tool_response gotcha bites — the SDK passes tool_response as a JSON string, not an object. Parse first.
// ai-service/src/agents/hooks/citationExtractor.js
export function createCitationExtractorHook({ runDoc }) {
return async function citationExtractorHook(input) {
// SDK passes tool_response as a JSON STRING (confirmed via op-log diag).
let tr = input?.tool_response;
if (typeof tr === 'string') {
try { tr = JSON.parse(tr); } catch { return { decision: 'approve' }; }
}
// After parsing, tr IS the handler's structuredContent.
const citations = tr?.citations || tr?.pack?.citations || [];
const known = new Set(runDoc.output.citations.map(c => c.id));
for (const c of citations) {
if (c?.id && !known.has(c.id)) {
runDoc.output.citations.push(c);
known.add(c.id);
}
}
return { decision: 'approve' };
};
}
Stop (CostTracker): read final usage from the SDK's stop event, estimate cost, increment a per-agent monthly counter, attach token + cost totals to the run doc. NOTE: I haven't got the cost tracker to work yet.
Step 6: The system prompt must INSTRUCT the model to call tools
This is the third-most-common failure mode people are likely to hit. Listing the tools in the system prompt is not enough. The model will read the list, think out loud about what it could do, and never make a call.
The prompt has to be directive. Roughly:
You are the [agent name] agent for [BU id].
AVAILABLE DETERMINISTIC TOOLS (call these — they compute the truth, you narrate):
- build_movement_pack: Build a ranked Movement Pack of business events...
- compute_period_aggregate: Roll weekly snapshots into monthly buckets...
- get_insight_definition: Look up an insight's source + formula + ABAC scope...
EXECUTION ORDER (non-negotiable):
1. CALL TOOLS FIRST. Before writing ANY narrative or numbers, invoke
each allowed tool that applies to this run.
2. NARRATE FROM TOOL RESULTS ONLY. Every numeric claim must come from
a tool result. Quote what the tool returned; do not infer.
3. EMIT STRUCTURED OUTPUT. The runtime auto-binds your tool results
to chartSpecs/tableSpecs by datasetId — you do NOT need to author
chart data. Just call the tool; the runtime renders it.
HARD GUARDRAILS:
- Every claim must cite at least one CitationV1 from this run.
- Every chartSpec.datasetId must match a dataset the runtime materialised.
- Do NOT invent numbers. If a number can't be sourced, say "not available".
- Stay within [BU id]. Refuse cross-BU requests.
The "EXECUTION ORDER" / "HARD GUARDRAILS" framing is what actually flips the model from "narrate plausibly" mode into "call tools, then narrate" mode. Without it, the model is well-behaved chat AI, which is the wrong product.
Step 7: Materialise server-owned fields after the Composer LLM returns
If you have a Composer (an agent that authors other agents from natural language intent), the LLM returns a typed-ish draft. Do NOT publish that draft as-is. The LLM is authoritative only for intent-shaped fields: name, description, allowedTools, taskSpec, schedule, quotas, acknowledgesSimulationSapBoundary. Everything else is server-owned and gets overwritten after extraction:
function materializeComposerDraft({ partial, body, ownerUserId, buId }) {
const draft = { ...(partial || {}) };
const slug = body?.slug || draft.slug;
// Server-OWNED (always overwrite the LLM):
draft.id = `agent_${buId}_${slug}_v1`;
draft.pk = pkForAgents(buId);
draft.defType = 'agent-definition';
draft.buId = buId;
draft.slug = slug;
draft.template = 'composed';
draft.revision = 1;
draft.composerVersion = COMPOSER_VERSION;
draft.ownerUserId = ownerUserId; // SECURITY: never trust LLM here
draft.naturalLanguageIntent = body?.intent; // preserve user's words
// The system prompt is server-COMPOSED, never LLM-authored. This is what
// guarantees the runtime LLM always receives the CALL-TOOLS-FIRST directive.
draft.systemPrompt = composeSystemPromptFromDraft({
name: draft.name, buId,
connectors: draft.taskSpec?.connectors || [],
allowedTools: draft.allowedTools || [],
customerScope: draft.taskSpec?.customerScope,
roleLens: draft.taskSpec?.roleLens
});
// Audit stamp — security-owned, never LLM-authored.
const now = new Date().toISOString();
draft.audit = {
createdAt: now, createdBy: ownerUserId,
lastEditedAt: now, lastEditedBy: ownerUserId
};
return draft; // Now lint this. Failures surface to the user, not papered over.
}
Step 8: Wire a scheduler (the predicate that almost killed us)
For weekly/monthly cadences, node-schedule with a per-minute tick is plenty. Each tick reads "what's due" from the data store and fires anything overdue. The thing to get right is the findDue query — it must filter on cadence, on nextRunAt, and on the runtime kill switch.
export async function findDue(buId, { now = new Date().toISOString() } = {}) {
const container = getAgentsContainer();
const { resources } = await container.items.query({
query: `SELECT * FROM c
WHERE c.defType = @defType
AND c.buId = @buId
AND c.schedule.nextRunAt <= @now
AND c.schedule.cadence IN ('weekly', 'monthly', 'quarterly')`,
parameters: [
{ name: '@defType', value: 'agent-definition' },
{ name: '@buId', value: buId },
{ name: '@now', value: now }
]
}, { partitionKey: pkForAgents(buId) }).fetchAll();
return resources;
}
The lesson scarred into me: do not include predicates against fields that don't exist in your schema. My first version of findDue included AND c.enabled = true, because I'd assumed per-agent enable/disable would be a V1 feature. It wasn't — the AgentDefinitionV1 schema doesn't have an enabled field at all. The predicate silently never matched, and NO scheduled run ever fired. The kind of bug you don't catch until Monday morning at 07:30 SAST when you realise the briefing didn't come in.
If you want a kill switch, put it in a separate document (we use a Cosmos runtime-state doc, checked at the top of every scheduler tick). Don't try to embed it in the agent definition itself unless the schema fully supports it.
Step 9: The render side — do the boring work
An agent that produces a beautiful run document but renders as a wall of raw markdown is not done. Render the run with the same markdown library your chat surface uses (we use react-markdown + remark-gfm + remark-breaks). Wrap chart specs in your charting library (Recharts is fine). Bind datasets by datasetId — never let the renderer accept inline data, because that defeats the citation chain.
Split the run's narrative into sections by H2 (## ...) at compose time, not at render time. The data store should already hold structured sections, so the renderer is dumb. Use a heuristic to bind chart/table specs under the section whose heading or body matches the dataset id tokens. Orphan specs land in the tail section as an appendix.
Citation chips should show calendar dates, not week numbers. Users do not know what "FY26 W42" means. They know what "10-16 May 2026" means. Make the chip surface the latter; keep the FY-week code in the expanded detail for ops/audit.
Step 10: Common pitfalls (a survival checklist)
tools: []means "no tools". Omit the key entirely. The SDK uses its default and your MCP tools become visible.- MCP tools must be
alwaysLoad: trueor they go behind a built-in tool-search facility the model can't reach when you've locked down built-ins. PostToolUse.tool_responseis a JSON string. AlwaysJSON.parsebefore probing.console.loginside SDK hooks does not reach App Service container stdout. Emit to your operational-log store instead. Add a one-shot diagnostic emission to discover any future SDK shape change.- The model will not call tools just because they're listed. Your system prompt must say "CALL TOOLS FIRST" in directive language.
- Never let the LLM author identity, audit, or system-prompt fields. Materialise them server-side after extraction. The schema is your trust boundary.
- Never let the LLM author citations. Citations are produced by tool handlers (in
structuredContent.citations) and accumulated by PostToolUse hooks. The model writes prose around them. - Confine the SDK import to one adapter file. Everything else stays SDK-agnostic. Future-proofs against runtime swaps.
- Don't gate
findDueon fields your schema doesn't define. The predicate will silently never match. You will not find out for days. - Freeze your schemas BEFORE writing code. Tag the freeze. Track 0 of your project. Saves a week of merge-hell down the line.
Act 9: The V1 Success Criteria
Here's where we ended up. V1 is "done" against the following criteria:
| Criterion | Status |
|---|---|
| Compose an agent from a free-form natural-language intent | LLM-path Composer produces typed AgentDefinitionV1 drafts that pass the linter cleanly. The materialiser fills server-owned fields (id, pk, ownerUserId, systemPrompt, audit) so the LLM can never impersonate or skip the runtime tool-first directive. |
| Lint a draft against the frozen contract | 9 lint checks (schema-valid, tool-availability, cadence-compatibility, connector-compatibility, tool-arg-validity, customer-scope-ack, simulation-sap-boundary-ack, cost-cap, chart-table-binding). Blocks publish on any error. |
| Dry-run a draft against the real runtime | Runtime invokes MCP tools, hooks accumulate citations + datasets, output is composed into multi-section markdown, schema is validated. Publish is gated on validation passing. |
| Publish a draft into the Gallery | Definition lands in Cosmos as agent-definition. Visible to the BU. Owner + revision audited. |
| Schedule a published agent | node-schedule cron tick every minute. findDue filters by cadence (weekly/monthly/quarterly) and nextRunAt <= now. Lease claimed via Cosmos IfNoneMatch. Runs via the same runtime as dry-run. Verified live: Finance Weekly Briefing fired at 07:00 SAST on a Monday morning, no human in the loop. |
| File a run as an immutable report | AgentRunV1 doc, embedded definitionSnapshot, multi-section output, real citations, dataset envelopes, chart + table specs bound to dataset ids. Visible in Filed Reports. |
| Render a report cleanly to a human | Multi-section layout, ChatMarkdown (react-markdown + remark-gfm) for headings/lists/tables/bold, Recharts for chart specs, structured table renderer for table specs, citation chips with calendar date ranges, status pills, audience-scope banner, Export/Print to PDF, ABAC-blocked viewers get a 403 banner with a "clone and re-run" CTA. |
| Back/forward navigation works | history.pushState on drill-in, popstate listener closes the drill-in, ?runId= deep-link supported on mount. |
| Cost + token quotas | Per-agent monthly cost cap. Pre-flight quota check via costCounterStore. Stop hook aggregates real token usage + estimated ZAR cost from the model response. |
| Runtime kill switch | Single Cosmos doc checked at the top of every scheduler tick and every route. Admin UI flip. |
| Operational logs | Every run emits AGENT_RUN_STARTED, AGENT_RUN_COMPLETED / NEEDS_REVIEW / FAILED. Every tool invocation emits AGENT_TOOL_INVOKED. Lease collisions, lint warnings, definition changes all logged. Insights Recent Activity feed filters on the AGENT_* event taxonomy. |
| Per-user adoption metrics | Insights AdoptionPanel shows distinct users, runs, succeeded/failed/needs-review per user, tokens, ZAR cost, last-activity. |
| Seed templates ship | Finance Weekly Briefing (weekly, top movers across four customer connectors), Monthly Performance Report (compute_plan_vs_actual + compute_period_aggregate, board-style report), Customer Health Composite (compute_customer_health_composite, top 10 at-risk), Account Manager Contract Performance Report (reusable template, owner-ABAC scope, monthly cadence, City of Cape Town as example only — not pre-pinned). |
What V1 does NOT include (deferred to V1.1+): per-agent pause/enable UI, multi-BU scheduling, in-app notifications when a new report lands, email delivery, role-shared visibility, on-event triggers, raw-prompt admin edit mode, more deterministic tools (seasonality, anomaly detection), background-job retry queue. The plan is to ship V1 first, learn from real use, then prioritise V1.1 from feedback.
Act 10: Assessment Against Modern AI Patterns
If I step back and look at this against where the agentic-AI field is in 2026, I think we landed in a reasonable spot. Not state of the art — we don't have a multi-agent reasoning swarm with arbitrary task decomposition, and we don't want that — but defensible and well-grounded.
1. Tool-augmented LLM with strict schemas (instead of free agentic reasoning)
The dominant safe-pattern for enterprise BI agents in 2026 is the same one we adopted: the LLM narrates, deterministic tools compute. Numbers come from tools. Citations come from tools. Charts come from tools. The LLM writes prose around them. The schema rejects any run that doesn't fit.
This is the opposite direction from "let the LLM reason its way to an answer". For BI it's the right direction. Finance teams cannot afford hallucinated numbers. The schema-first design means the platform produces something predictable — if a tool is broken, the run fails loudly, doesn't quietly invent.
2. MCP as the trust boundary
MCP is the right abstraction for tool exposure. It separates "what the tool does" (the handler, in our code) from "how the model invokes the tool" (MCP-qualified names, JSON-schema'd input). Hooks intercept on the SDK side; ABAC + scope enforcement live in the handler. That two-layer defence is the modern standard.
The catch we hit (tool_response being a JSON string, alwaysLoad needing to be true, console.log from hooks not reaching stdout) are SDK-specific quirks of @anthropic-ai/claude-agent-sdk@0.3.143. If we'd built our own loop we'd have hit different quirks. None of these were design errors — they were implementation details we discovered by running the code in production.
3. Composer-as-agent (instead of UI form)
Letting users describe an agent in natural language and having an LLM compile it to a typed draft is the right pattern for V1. A form-based "build your own report" UI sounds simpler but it's actually constraining — users have to learn the form, and the form has to anticipate every combination of cadence x scope x tool. Natural language lets users say "weekly debtors report for Customer X with monthly trend chart" and the Composer figures out the slug, schedule, allowedTools, taskSpec.
The hardening is the typed schema underneath. The user's natural language goes through the LLM, comes out as a typed draft, passes through the linter, passes through dry-run validation, and only then can be published. The free-form NL never reaches the runtime — only the typed draft does. That's the right separation.
4. Agents as documents, not as code
An AgentDefinitionV1 is a Cosmos document. It's clonable, remixable, versioned (revision bump on every save), audited (createdBy / lastEditedBy / lastEditedAt). The Gallery is just a list of those documents filtered to the BU. The Composer is just an editor for those documents. The runtime is just an interpreter for those documents.
This is the right abstraction for a BU lead who wants to say "clone the AM contract template, set my customers, change the cadence to monthly". They're editing a document, not deploying a service. That's the leverage of treating agents as first-class data, not as code.
5. Where we are NOT state of the art
We don't have:
- Sub-agent decomposition — an agent can't spawn a child agent to handle a sub-task. We deliberately chose not to use the SDK's sub-agent feature for V1. Sub-agents are non-reproducible, expensive, and audit-hostile. Custom tools only.
- Memory across runs — each run is stateless. No "remember what you said last week". The Movement Pack diff engine does that work deterministically (week-over-week comparison from snapshot history), not through LLM memory.
- Multi-turn agent sessions — the Composer is single-shot. You give it an intent, it gives you a draft. We don't model "refine the draft via three turns of dialogue". The user can re-compose with a tweaked intent, but there's no conversation state.
- Auto-discovery of new tools — the Capability Registry is hand-curated. New tools are added by engineers, not by the agent itself.
I think most of these absences are correct for V1. The ones I'd revisit first are multi-turn refinement in the Composer (lets the user say "good, but make the cadence monthly instead of weekly" without re-typing the whole intent) and on-event triggers (agent fires when a new ETL snapshot lands, not on a wall-clock cadence). Both are V1.1 candidates.
6. The chat engine vs the agent framework, side by side
Now that both exist, the right mental model is:
| AI Chat Engine | Agent Framework | |
|---|---|---|
| Trigger | User opens panel and types | Schedule, or user clicks Run |
| Output | Streamed chat reply | Filed report (multi-section, versioned) |
| Lifetime | Session-bound | Immutable, stored, ABAC-scoped, exportable |
| Audience | The asker | Anyone in the BU (org) or just the owner (private) |
| Best for | "What's our orderbook looking like?" | "Every Monday at 07:00, brief me on the top five movers." |
| Lives at | Side panel | Filed Reports surface |
They share the same tools, the same data, the same auth. They diverge on temporality. Chat is reactive. Agents are proactive. Together they cover the whole "AI as a BI partner" surface.
The Takeaway
I started this project thinking "we'll wrap the chat engine in a scheduler and call it agentic." We ended up with a typed, versioned, schema-validated agent platform with a Composer, a Linter, a multi-section report renderer, a chart binder, a citation chip, an admin kill switch, an Insights panel, four seed templates, and a real scheduled run that fired itself this morning.
The hardest part wasn't the LLM. It was the trust boundary. Getting the schema right. Refusing to let the LLM author its own citations. Refusing to let the LLM author its own system prompt. Refusing to let the LLM author its own owner id. Making the schema the boundary, and making everything below the schema deterministic.
The second-hardest part was the SDK quirks — the JSON-string tool_response, the alwaysLoad gotcha, the console.log-doesn't-reach-stdout thing, the c.enabled-predicate-that-no-one-set that prevented every scheduled run from ever firing. Each one took an hour to diagnose. None of them would have shown up without running real workloads against real data in real production.
The third-hardest part — and this is the one most underrated — was the two-AI pattern. Claude as orchestrator, Codex as reviewer, both adversarial, both checking each other, with a human steering. Codex caught things Claude seriously missed. Claude caught things Codex missed. I used human judgement by steering decisions. The codebase is better for it. The pattern works.
Use Claude Code's frontend designer to build the mock UX end-to-end. As part of the requirements and design phase, using Claude's frontend designer plugin saved a lot of time. After 4 iterations we landed on a design UX for the new agentic feature, conforming to the existing UX design and stlye guide. With the sample html file as the key requirements spec for frontend, Claude was able to build a solid usable feature from the start.
There's still much to do (per-agent pause, in-app notifications, more deterministic primitives, on-event triggers, multi-BU scheduling, account manager-friendly clone workflows). But this morning the platform produced a real briefing without anyone clicking anything. That's the V1 milestone. That's the moment "agentic AI" stopped being a slide deck and started being a Monday-morning artefact.
Onwards to V1.1.
Stack: TypeScript-flavoured JavaScript end-to-end. Fastify on a Linux App Service for the AI service. Express + React on a separate App Service for the web app. Cosmos DB as the only data store. @anthropic-ai/claude-agent-sdk + @anthropic-ai/sdk for the agentic loop and the chat engine respectively. node-schedule for the cron tick. Recharts for chart rendering. react-markdown + remark-gfm for the markdown renderer. Zod for the frozen contracts (AgentDefinitionV1, AgentRunV1, MovementPackV1, CitationV1, ToolCapabilityV1, BuAiDataPolicyV1). No background-job platform — the scheduler runs in-process. No vector store — semantic retrieval lives in the Capability Registry, hand-curated. No multi-agent orchestration framework — one agent at a time, custom tools only.
Total elapsed: ~3.5 weeks from "let's wrap the chat engine in a scheduler" to "the scheduled briefing fired at 07:00 SAST this morning". Total production downtime during the build: zero — the agent framework runs on its own App Service, and the build never touched the chat engine or the BI dashboards. Cups of coffee: lost count, again.
.png)
