Monday, 18 May 2026

How I built and agentic engine (powered by Anthropic's Agentic runtime) to augment my already AI-powered BI app, with Claude & Codex, over one weekend

May 18, 2026. South Africa.

This morning, at exactly 07:00 SAST, the platform produced a financial briefing without anybody pressing a button. It pulled six SAP connectors, picked the top five movers across debtors / orderbook / stock / delivery, narrated them with real customer names ("CITY OF CAPE TOWN orderbook changed by XX, from RXX to RYY, the single largest book inflow this week"), bound a Recharts bar chart to the underlying dataset, and stamped every numeric claim with a citation chip showing the calendar date range "Debtors · 15 May to 21 May 2026". Validation passed. Cost: a few cents. Latency: 91 seconds end-to-end.

The report is publishable as-is. Not the LLM "making something up that looks right" — every number traces back to an ETL run id, every chart traces back to a dataset, every claim traces back to a CitationV1.

That report is the V1 of an "agentic" layer we've been building on top of an existing BI platform. It took three and a half weeks of research, ideation, detailed implementation planning of V0, then an architecture pivot when I changed my thinking to stop reinventing the wheel and jump on frontier model's agentic SDK, then a bite-the-bullet weekend sprint of building, that included an autonomous overnight QA cycle using browser connectors, an awkward midnight debugging session over a single JSON field, and the discovery that one missing column predicate had silently prevented every scheduled run from ever firing....

I want to write this down before it ages out of memory.


Act 1: The Existing AI Chat Engine (Why This Wasn't Enough)

The platform already had an AI chat engine before any of this started.  It's the kind of thing most BI tools ship in 2026: open a side panel, type "what's our overdue debtors looking like", an LLM reads the page context, calls a few internal tools to fetch SAP snapshot summaries, and answers with citations. It works. People use it. It's saved hours of "let me Excel-pivot this for you" emails.

The App was fine in its initial release form. Like any app launch, once users start playing with it, and seeing the powerful AI chat feature, almost immediately the feedback is about requesting automated reports, automated insights, signals of weekly trends, etc. An oversight on my part was the lack of trend data, weekly snapshots were nonexistent. For me to support advanced automated reporting, I would have to close the gap on weekly snapshots. This was considered foundational before I started with any agentic framework. It took me about a week to close this gap, following my usual architecture, spec-driven planning approach with Claude and Codex.

The architecture was honest about what it was. A single Fastify service stands up a chat route that streams from the Anthropic API. A small registry of typed tools sits behind it — build_movement_pack, get_insight_definition, query_snapshot_week, compute_period_aggregate, etc. Every tool reads from the same Cosmos containers the dashboards read from, scoped by BU, filtered by an ABAC layer that keeps customer-level data behind role assignments. The model never sees raw Cosmos — it sees tool returns that have already been authz-checked, citation-stamped, and shape-normalised.

For ad-hoc questions ("show me the top 5 debtors this week", "summarise expense trends since January") this is great. It's chat. The user types, the model answers, the conversation ends.

But chat is reactive. Chat assumes there's a human in the loop who knows what to ask. A real enterprise BI tool needs to surface the things you didn't know to ask about. The customer whose overdue jumped 149% this week. The stock material that flipped from healthy to critical between Friday and Monday. The orderbook line that vanished because the invoice posted. None of that bubbles up if you don't open the chat panel and ask.

The chat engine is solid. But a true enterprise BI insights platform needs agentic capabilities too — agents that run on a schedule, pull what's changed, narrate it, file a report, and surface it to the right people without anyone typing a prompt.

That's the gap Agentic V1 is closing.


Act 2: Origins of the Agentic Idea

The initial sketch was simple. "Take the existing AI chat tools. Add a scheduler. Run them weekly. Save the output as a report."

That sketch survived about two weeks before it started to fall apart.

The first problem was structural: a "report" isn't a chat reply. A chat reply is a stream of tokens that disappears when you close the panel. A report is a document — structured, multi-section, with charts and tables, that needs to be filed, indexed, ABAC-scoped per viewer, exportable to PDF, shareable, auditable. The chat engine's plumbing doesn't model any of that. It just streams text.

The second problem was about who authors agents. The first instinct was "we'll seed a few hand-written ones — weekly briefing, monthly performance, customer health — and that's V1." But the moment the seeded agents existed, the next question was obvious: "Can a sales manager build their own for their portfolio? Can an account manager clone the seed for Customer X and remix it?" That implies a Composer. A Composer is itself an agent. So now we're not just building "three scheduled briefings" — we're building a platform that lets users author agents in natural language, lint them against a typed contract, dry-run them, and publish.

The third problem was the trust boundary. The chat engine's tools are read-only against scoped Cosmos. That's fine for ad-hoc questions where a human reads the answer. For an agent that fires on a schedule and surfaces a report into an inbox-like list, the trust model has to be stricter. Every numeric claim must cite the underlying snapshot. Every chart must reference a dataset id the runtime materialised, not inline data the LLM invented. The schema has to enforce this so a malformed run can't masquerade as a real one.

By the end of the third week we knew the shape: typed, versioned contracts (AgentDefinitionV1, AgentRunV1, MovementPackV1, CitationV1); a Composer that writes drafts; a linter that rejects drafts that don't fit; a runtime that runs them with deterministic tools and structured output; a scheduler that fires them; a Filed Reports surface that renders them.


Act 3: The Pivot — Stop Building the Loop, Use Claude Agent SDK

The original V0 plan had us writing the agent loop ourselves. Read a definition, build a system prompt, send it to the Anthropic API, parse the response, dispatch any tool calls, loop. The chat engine already did 60% of that — we'd just wrap it in a scheduler.

But Anthropic shipped the @anthropic-ai/claude-agent-sdk. It's a meaningful step up from a raw chat client: it has first-class concepts for tools (define name + description + Zod input schema + handler, register them with createSdkMcpServer, the model gets MCP-qualified tool names like mcp__syntell__build_movement_pack), for hooks (PreToolUse, PostToolUse, Stop), for permission modes, for system prompts, for the whole agentic loop — without us writing any of it.

More importantly, MCP gave us a clean trust boundary. The same tools the chat engine exposes become MCP tools the runtime exposes. The model is told "you have these tools, here are their schemas, call them". Hooks let us intercept every tool call before it runs (PolicyGuard checks BU scope, ABAC, connector allow-lists) and every tool return after it runs (extract citations from structuredContent.citations, materialise datasets onto the run doc, write audit log entries).

We pivoted. Not on day one — on week three, after we'd already completed much of the architecture, design and implementation plan. We threw away the custom loop and adopted Claude's Agentic SDK. The pivot cost about two evenings of rework. It paid back the rest of the project.

The decision: the SDK runs the agentic loop. We supply tools, hooks, system prompts, and the agent definition. The runtime adapter sits in a single file (runtime.js) and is the only place in the codebase that imports the SDK. Everything else — stores, hooks, tools, schemas — stays SDK-agnostic so a future swap costs days, not months.

Act 4: From Idea to Spec — The Review Loop That Reshaped Every Pass

The pivot to the SDK happened on a Friday evening. But the SDK didn't arrive in our codebase first — the plan did. And the plan only became a good plan after several review passes with Claude and Codex, that materially reshaped what Claude originally proposed, since I had decided to not reinvent the wheel.

This is the part of the project I want to credit Codex for explicitly. Without those reviews the architecture would have shipped narrower, more fragile, and harder to extend.

Step 1: Capturing the business requirements (not the technical wish-list)

The first version of the requirements doc crafted jointly with Claude was a mistake. It read like a technical wish-list — "use Cosmos, use Fastify, support cron, support webhooks". I later pushed back: "start from what an account manager actually wants, not what the platform should do."

So we rewrote the requirements in terms of user behaviour:

  • "Every Monday morning I want a one-page briefing of what changed in my portfolio last week, without typing a prompt."
  • "I want to clone a teammate's report template, point it at my customers, and publish it — in five minutes, not a sprint."
  • "I want every number in the report to be traceable to its SAP source so when finance pushes back I can show them where it came from."
  • "If I'm an admin and something goes wrong, I want one switch to stop everything."
  • "If I share a report with my team, the people without access to a customer must NOT see that customer's numbers."
  • "Built-in seeds for the most common cases: Finance Weekly Briefing, Monthly Performance Report, Customer Health, Account Manager Contract Performance."

That reframing rippled into every later decision. The "five minutes not a sprint" requirement became the Composer-as-agent design (NL intent in, typed draft out). The "every number traceable" requirement became the CitationV1 contract and the rule that the LLM never authors a citation. The "people without access must not see" requirement became the ABAC layer that filters tool returns before the LLM ever sees them. The "one switch to stop everything" became the runtime-state kill switch checked at the top of every scheduler tick.

Step 2: Claude's first architecture draft (and why most of it was wrong)

Claude's initial design proposal was about 800 lines of markdown - which started with the assumption of leveraging much of the existing AI-chat engine infrastructure first, then extend the platform to build an agentic engine. It had a custom agent loop (parse the model's response, dispatch tool calls, iterate), a per-agent SQL-like query language for "what data does this agent need", a graph database of capabilities, a vector store for "discovering which tool fits the intent", and three separate microservices for Composer, Runtime, and Scheduler.

When I changed direction on leveraging existing agentic frameworks, Claude reviewed the state-of-the-art and recommended Anthropic's Agent SDK - surfacing five substantive critiques. which I then passed to Codex to review. I'll paraphrase but the gist of each is verbatim from the review:

Review pass #1 of the design (paraphrased):
  1. "Don't roll your own agent loop. @anthropic-ai/claude-agent-sdk shipped recently and gives you tool dispatch, hooks (Pre/Post/Stop), permission modes, MCP-qualified tool names, and an injectable query function for free. Adopt it as the runtime. Keep it confined to one adapter file so a future swap costs days not months."
  2. "No custom query language. The data model is Cosmos. The tools you'd want are already the same shape as your existing AI chat tools. Don't reinvent. Reuse the chat tools as MCP tools, add hooks for ABAC and citation extraction."
  3. "No vector store. The Capability Registry is small (~6 tools in V1) and hand-curated. A vector store is over-engineered for this scale and adds a dependency. Hand-curate the registry, version it, ship it."
  4. "No three microservices. Composer, Runtime, and Scheduler all share the same Cosmos containers, the same auth identity, the same lifetime. Keep them in one Fastify process. Operationally simpler. Two App Services maximum (web + AI), not five."
  5. "Freeze the schemas BEFORE writing code. AgentDefinitionV1, AgentRunV1, MovementPackV1, CitationV1, ToolCapabilityV1, BuAiDataPolicyV1. Zod. Strict. Tag the freeze in git. Every layer below the schemas is allowed to evolve; the schemas themselves only move on a V2 bump. The schema is the trust boundary."

All five landed. The design rewrite cut the doc from 800 lines to about 350. The implementation plan compressed from "eight tracks across six weeks" to "five tracks across two and a half weeks" because most of what Claude planned to build was now being provided by the SDK or by reusing existing code. We reworked the plan. We executed in one weekend.

Step 3: The contract freeze (Codex's most consequential push)

The contract-freeze idea (#5 above) deserves its own paragraph because it changed how the whole project was sequenced.

The original plan had every track writing its own data shapes as it went. Track A would define AgentDefinitionV1 by writing the Composer first. Track C would refine it by writing the runtime. Track G would tweak it again when writing the report renderer. Schemas would converge through iteration.

Codex flagged the obvious problem: if the schema converges through iteration, every track's tests are coupled to whichever revision they happened to be written against, and integration becomes a coordination nightmare. The fix: write the schemas first, freeze them, tag the freeze, then let every track build against a stable contract. Tests against a frozen schema are forwards-compatible. Code that fits a frozen schema can be developed in parallel without integration grief.

So we did exactly that. Track 0 (Phase 0) was a one-day pass to write all six Zod contracts, run them through fixture-based tests, commit, and tag contracts-v1-frozen on origin. Every other track started after that tag landed. The integration phase at the end of the build was nearly painless. That single insight saved at least a week of merge-hell.

Codex was excellent in reviewing Claude's multi-agent parallel execution plan and called out the gaps in sequencing, risks of agent handover corruption of shared status.md file updates. I had Claude leading the multi-agent coding sprint as lead orchestrator, and setup Codex to snoop and review, periodically from 20 minute intervals to 7 minute intervals during Claude's build process. Codex was my senior engineer reviewer. I set both agents off, and went about my weekend. Claude and Codex worked all through the weekend almost autonomously.

Step 4: Codex's during-implementation review passes

Once code started landing, Codex's reviews moved from architectural critique to substantive defect-finding. Below are the specific Codex findings that materially changed shipped code. I'm listing them because they're the kind of thing that doesn't appear in a "two AIs worked together" abstraction — they're the actual leverage of having a second reviewer.

Codex findingSeverityWhat it became
"Composer meta-agent looks structurally incomplete — the seed lists list_capabilities / propose_agent_definition / validate_agent_definition in allowedTools, but those handler factories are NOT in TOOL_FACTORIES. The SDK exposes the business tools to the Composer agent but not its own meta-tools. The LLM has no callable tools and emits tool-call JSON as raw text in its assistant message."P0Built the three meta-tools in ai-service/src/agents/tools/composerMeta.js delegating to the existing listCapabilities / proposeAgentDefinition / lintAgentDefinition helpers in ai-core. Registered them in TOOL_FACTORIES. Added 6 wiring tests. Composer dry-runs went from "always produces text-JSON" to "always produces typed drafts via proper tool calls".
"Customer Health is real but not truly per-customer — buildCustomerHealth reads only summaryDoc.topOutstandingCustomers etc, so customers outside the top-N are invisible to the at-risk composite. That's a quiet correctness gap."P1Rewrote buildCustomerHealth to load runId-scoped detail snapshots per connector and aggregate every customer by customerId. Summary-top-N falls back only when detail rows aren't available. The at-risk ranking is now genuinely complete.
"Filed Reports NEW badge can be wrong for returning users — AgentReportsView freezes renderedReadCursor to '' on initial render BEFORE the read-state hook has loaded. Every existing report renders as NEW for returning users on every visit."P1Added a ready: boolean flag to the useAgentUnreadCount hook. Cursor only freezes after ready === true. NEW badge now correctly reflects "since you last visited" semantics.
"The read-state monotonic guarantee is not concurrency-safe — setUserReportSeen is read-then-upsert with no precondition. Multi-tab races can overwrite a newer cursor with an older one."P2Cosmos ETag CAS: IfMatch: <etag> for replace, IfNoneMatch: '*' for first-create, 3-attempt retry loop. Monotonic guarantee restored.
"chooseAllowedTools still defaults to Movement-Pack-only for non-plan intents — users asking for monthly aggregates won't get compute_period_aggregate in their drafts. Same for customer-health intents missing compute_customer_health_composite."Follow-upPattern-matched intent signals against BOUNDARY_PATTERNS, HEALTH_PATTERNS, PERIOD_AGGREGATE_PATTERNS. Auto-pick the right deterministic primitives. applySignalsOverride now also applies allowedTools overrides + auto-acks the simulation/SAP boundary if compute_plan_vs_actual is present.
"@anthropic-ai/claude-agent-sdk@0.3.143 declares a peer dependency on @anthropic-ai/sdk >=0.93.0. You're pinned at 0.80 and running --legacy-peer-deps to suppress the warning. That's a smell. Upgrade the peer."Follow-upUpgraded @anthropic-ai/sdk from 0.80.0 to ^0.93.0. Dropped --legacy-peer-deps from the deploy workflow. Verified both call sites (standaloneAiChat + modelCatalogManager) use stable API surfaces that survived the bump.
"The AM Contract Performance seed should NOT be a hardcoded Customer X agent. Customer X should appear as the EXAMPLE in the natural-language intent / help text. The seed itself stays reusable for every account manager."Design feedbackThe seed ships with customerScope: 'owner-abac' (AM sees their assigned portfolio on first run), specificCustomerIds intentionally undefined, Customer X mentioned only as the example clone target in the natural-language intent. AM clones to customerScope: 'specific' for their actual contract. The seed itself is template-shaped.
"Are there still explicit 'later/deferred' items in the shipped story? V1 should not include language that says 'V1.1 will fix this'. Either ship it or remove the mention — don't let deferral noise leak into the seeded prompts."P2Stripped all V1.1 deferral language from the seed system prompts. Added test guards in seedGalleryTemplates.test.js that fail if any seed's description / systemPrompt / naturalLanguageIntent contains "V1.1" or "planned for".

Eight findings. Six of them were actual bugs (P0/P1/P2 + the SDK peer-dep smell), one was a Composer-design improvement (the chooseAllowedTools follow-up), one was a product-design nudge (the AM seed template-shape feedback). All eight got fixed. None of them would have been caught without the second reviewer. I'd been staring at the code for hours and missed them.

Step 5: The iteration cadence

The way the loop ended up working in practice:

  1. Claude writes the implementation, commit, push, run touched tests, drive the CI.
  2. Claude summarise what landed in the PR description (or in the chat, since this was a long-running session).
  3. Codex session runs continuously in background, wakes up, inspects master, feeds back findings to me. I decide when to interrupt and steer Claude along. Sometimes I let Claude just run, with Codex keeping a mental registry of things to cleanup later. Claude was coordinating at least 8 parallel workstreams, managing integration - I didn't want to interrupt Claude unless Codex picked up something critical.
  4. Codex clean-room reviews against master, posts numbered findings (P0/P1/P2 with rationale + suggested fix).
  5. I relay Codex's findings to Claude.
  6. Claude triages. Claude usually agrees and thanks Codex for superb findings. Occasionally Claude pushed back with a rationale — once or twice Codex was working from an outdated mental model of the codebase, but more often than not its findings are real and worth fixing.
  7. Claude fixes, commit, pushes autonomously.
  8. Loop.

The cadence wasn't fixed. Sometimes Codex would review every two or three commits. Sometimes I  would say "you've been heads-down for a few hours, let me get Codex to do a sweep". The asymmetry of Claude (long context, slow review) vs Codex (fresh context, fast review) is actually a useful structural feature — the two AIs do not see the same thing, and that's where the leverage comes from.

The mental model: Claude is the long-context implementer with deep familiarity but tunnel vision. Codex is the short-context reviewer with familiarity but cleaner eyes - can see the wood for the trees. Codex is beating Claude in seeing the system with an architect lens IMHO. Claude often dives straight in without much foresight, or appreciation for larget system impact. At one stage I questioned my decision to give Claude the task of building, since all my previous long-running implementations were almost always dedicated to Codex. But I was short on Codex credits so opted to use Claude instead.  I, as the human steered both. Each AI's bias is the other AI's strength. Don't replace one with the other — pair them.

Act 5: The V1 Architecture

What we ended up with:

   
   +-----------------------------------------------------------------+
   |                       Web App (browser SPA)                     |
   |   Gallery  -  Compose  -  Filed Reports  -  Insights  -  Admin  |
   +---------------------------+-------------------------------------+
                               |
                               | HTTPS + auth proxy
                               v
+-----------------------+   /api/agents/*    +-------------------------+
|  Web App (Express)    |  ----------------->|  AI service (Fastify)   |
|  React shell + static |                    |                         |
|  files. Proxies AI    |   /api/ai-chat/*   |  Existing AI chat       |
|  routes to the AI     |  ----------------->|  engine (kept, reused)  |
|  service.             |                    |                         |
+-----------------------+                    |  Agent Framework V1     |
                                             |                         |
                                             |  +-------------------+  |
                                             |  | Routes:           |  |
                                             |  |   /compose        |  |
                                             |  |   /dry-run        |  |
                                             |  |   /agents/:id     |  |
                                             |  |   /runs           |  |
                                             |  |   /runs/:id       |  |
                                             |  |   /lint           |  |
                                             |  |   /policy         |  |
                                             |  |   /runtime-state  |  |
                                             |  |   /insights/...   |  |
                                             |  +---------+---------+  |
                                             |            |            |
                                             |            v            |
                                             |  +-------------------+  |
                                             |  | Runtime adapter   |  |
                                             |  | (claude-agent-sdk)|  |
                                             |  |                   |  |
                                             |  |runAgent({def,ctx})|  |
                                             |  +---------+---------+  |
                                             |            |            |
                                             |   Hooks:   |            |
                                             |   PreToolUse  ---+      |
                                             |   PostToolUse -+ |      |
                                             |   Stop -----+  | |      |
                                             |             |  | |      |
                                             |  +----------v--v-v---+  |
                                             |  | In-process MCP    |  |
                                             |  | server (Company)  |  |
                                             |  |                   |  |
                                             |  |  build_movement   |  |
                                             |  |  compute_period   |  |
                                             |  |  compute_plan_vs  |  |
                                             |  |  customer_health  |  |
                                             |  |  list_capabilities|  |
                                             |  |  propose_def      |  |
                                             |  |  validate_def     |  |
                                             |  |  get_insight_def  |  |
                                             |  |  search_insights  |  |
                                             |  +---------+---------+  |
                                             |            |            |
                                             |            v            |
                                             |  +-------------------+  |
                                             |  | Stores            |  |
                                             |  |   definitionStore |  |
                                             |  |   runStore        |  |
                                             |  |   policyStore     |  |
                                             |  |   leaseStore      |  |
                                             |  |   heartbeatStore  |  |
                                             |  |   runtimeState    |  |
                                             |  |   costCounter     |  |
                                             |  +---------+---------+  |
                                             |            |            |
                                             |  +-------------------+  |
                                             |  | Scheduler         |  |
                                             |  | (node-schedule    |  |
                                             |  |  cron every min)  |  |
                                             |  |findDue -> claim   |  |
                                             |  |lease -> runAgent  |  |
                                             |  +-------------------+  |
                                             +-----------+-------------+
                                                         |
                                                         v
   +-----------------------------------------------------------------+
   |                        Cosmos DB (single account)               |
   |                                                                 |
   |   definitions                       agent_runs                  |
   |   +---------------------+           +-----------------------+   |
   |   | agent-definition    |           | agent-run             |   |
   |   | sap-*-insights      |           | (immutable snapshot,  |   |
   |   | sap-*-customer-...  |           |  citations[],         |   |
   |   | sap-*-weekly-       |           |  datasets{},          |   |
   |   |   snapshot          |           |  sections[]           |   |
   |   | bu-ai-data-policy   |           |  with chartSpec +     |   |
   |   | agent-lease         |           |  tableSpec)           |   |
   |   | runtime-state       |           +-----------------------+   |
   |   | scheduler-heartbeat |                                       |
   |   +---------------------+                                       |
   +-----------------------------------------------------------------+
                                                        ^
                                                        |
                                            +-----------+-----------+
                                            | ETL pipeline          |
                                            | (weekly SAP exports   |
                                            |  -> weekNum-stamped   |
                                            |  detail + summary     |
                                            |  docs)                |
                                            +-----------------------+

A few things worth pointing out:

One database, two consumers. The chat engine and the agent framework read the same Cosmos containers, the same snapshot history, the same policy docs. We didn't build a parallel data plane — we layered a new control plane (Composer + scheduler + runtime + report store) on top of the existing data plane.

The runtime adapter is one file. Per agent-framework-v1.md's adapter pattern, exactly one file imports @anthropic-ai/claude-agent-sdk. Stores, hooks, tools, schemas, routes — everything else is SDK-agnostic. The swap cost for the next runtime (whatever that ends up being) is bounded.

Citations and datasets are produced by hooks, not by the LLM. The model is never trusted to write a citation. Every tool handler attaches structuredContent: { citations, dataset } to its return. The PostToolUse citation hook walks the tool return and accumulates citations onto the run doc. The dataset hook does the same for chart-bound data. The LLM writes prose about these structures — never the structures themselves.

The schema is the boundary. AgentDefinitionV1 says exactly what fields exist, what types they have, what enums are valid. The linter runs the schema first and blocks publish on any failure. Dry-run validates the resulting AgentRunV1 against its own frozen schema. If the run doesn't fit the contract, it's flagged needs-review and excluded from the Publish gate.


Act 6: Self-QA via the Chrome Browser (The Surprise Power Move)

The thing that didn't appear on any plan was the Chrome integration. Claude has access to a browser tool that lets it drive an actual Chrome instance on my machine. Navigate to a URL, click an element by accessibility-tree reference, type into an input, take a screenshot, run JS in the page context, read network requests, read console messages.

Once you have that, the whole feedback loop changes.

The old pattern was:

  1. Claude pushes a commit.
  2. The user opens Chrome.
  3. The user clicks around.
  4. The user takes a screenshot.
  5. The user describes what's broken.
  6. Claude infers what's wrong from the description.
  7. Claude pushes another commit.
  8. Repeat.

With Chrome access:

  1. Claude pushes a commit.
  2. Claude waits for CI to finish.
  3. Claude navigates Chrome to the production URL myself.
  4. Claude clicks the same buttons a user would click.
  5. Claude fetches the API responses with fetch() in the page context to see the actual run docs.
  6. Claude screenshots the rendered report.
  7. Claude sees the bug.
  8. Claude diagnoses the bug.
  9. Claude pushes the fix.
  10. Repeat — without bothering the human. I gave Codex the same task, as independent QA tester-  I found Codex much more skilled at operating the browser than Claude was.

I went to bed at one point and gave a single instruction: "AFK — continue to recursively test using Chrome browser until you find and fix every issue. You are not done until then. Repeat QA/debug/bug-fix cycles until no issues remain." Over the next ~5 hours that's exactly what happened. The browser tool drove the Composer wizard, triggered dry-runs, fetched run documents, read operational logs, found that citations were synthetic, traced the cause to a hook field-name mismatch, pushed a fix, deployed it, restarted the App Service, re-ran the wizard, watched validation still fail, dug deeper, found another field name issue, fixed it, redeployed.

The final blocker turned out to be that the SDK passes tool_response to PostToolUse hooks as a JSON-encoded string, not as an object. Every probe path Claude had written assumed object shape. To find this, Claude had the hooks emit a one-shot diagnostic entry to the operational log on first invocation (because console.log inside SDK hooks doesn't reach the App Service container stdout). The diagnostic entry came back with trType: 'string' and a sample snippet that started with an escaped {"\\"pack\\":...". Five lines of code later (if (typeof tr === 'string') { try { tr = JSON.parse(tr); } catch {} }), the next dry-run produced 5 real citations across 5 sections, validation passed, Publish button went green.

That cycle took ~9 hours of autonomous work. I woke up to "all features verified green in production" and a one-page summary of every commit. The bug was real, the fix was real, the verification was real, and no human had to drive the Chrome session.

The lesson: giving an AI agent access to the same UI a human uses, plus the ability to read its own logs and re-deploy, turns "debugging" from a back-and-forth conversation into a closed loop the agent runs autonomously. It's not science fiction. It worked. Last night.

Act 7: What We Learned About the Claude Agent SDK

Things that surprised Claude, in no particular order, written by Claude's own words here:

1. tools: [] means "no tools", not "default tools". Early on I'd written the SDK options as { ..., tools: [] } assuming an empty array meant "use the MCP tools I've registered via mcpServers". It actually means "the model has zero tools available". The MCP tools never reached the model. Symptom: the LLM emitted tool-call JSON as plain text in its assistant messages instead of as tool_use blocks. Fix: omit the tools key entirely.

2. MCP tools must be alwaysLoad: true or they go behind tool search. By default the SDK defers MCP tool schemas behind a built-in ToolSearch facility. If you've also locked down the built-ins via disallowedTools (which any sensible production agent does), the model has no way to discover the MCP tools at all. alwaysLoad: true pre-loads every registered MCP tool's schema directly into the model's tool list. Without this, you get the same "tool-call-as-text" symptom as above.

3. PostToolUse.tool_response is a JSON string, not an object. Documented as unknown. In practice it's a JSON-encoded string of the tool handler's structuredContent. Always JSON.parse when it's a string.

4. console.log from inside SDK hooks does not reach App Service container stdout. I do not know why. I do know that emitting structured entries to an operational-log store works fine. We added a one-shot diagnostic emission on first hook invocation, and that's how we discovered #3 above. Worth instrumenting hooks with an op-log fallback from day one.

5. The model will not call tools just because they're listed. If the system prompt says "Available tools: build_movement_pack, get_insight_definition, search_insights" without instructing the model to use them, the model will think out loud, write a narrative, and never call a thing. The system prompt must say "CALL TOOLS FIRST. Before writing any narrative, invoke each allowed tool." Pretend you're talking to a thoughtful but lazy junior analyst.

6. The Composer LLM is authoritative for intent-shaped fields only. Server-owned fields (id, pk, defType, ownerUserId, audit, composerVersion, the system prompt itself) must be materialised server-side after extraction. Trusting the LLM to author its own ownerUserId is an impersonation hole. Trusting it to author the system prompt skips the runtime tool-first directive. Materialise these post-extraction, every time.

7. The agentic SDK is a strong abstraction up to a point. For an in-process MCP server with deterministic tools and structured returns, the SDK is fantastic — permission modes, hook plumbing, tool dispatch, streaming all just work. Beyond that point (sub-agents, sessions for multi-turn, prompt caching controls), the API surface is less mature and the docs lag the code. We're not using sub-agents in V1. Custom tools only.


Act 8: A Developer's Guide — Making an Existing AI App Agentic

I want to spend an act on the concrete steps. If you have an existing AI chat app and you're trying to figure out how to add a scheduled / agentic layer on top, the path is more mechanical than you'd think. Here's the order I'd follow if I were starting fresh on a similar codebase.

Step 1: Reframe agents as typed documents, not streaming sessions

The biggest mental shift. A chat session is ephemeral — tokens stream, you read them, the session ends. An agent is a Cosmos / Postgres / S3 document. It has an id. It has a revision counter. It has an owner. It has an audit trail. The runtime interprets the document each time it fires; the document itself doesn't move.

Practically: pick a schema library (Zod is my pick), define your AgentDefinition shape, freeze it, tag the freeze. Define the AgentRun shape next — that's what every fire produces. Both shapes must be strict (extra fields rejected) so you can evolve them safely.

// packages/your-app-core/src/contracts/v1/AgentDefinitionV1.js
import { z } from 'zod';

export const AgentDefinitionV1 = z.object({
  id: z.string().min(1),                  // deterministic
  pk: z.string().min(1),                  // Cosmos partition key
  defType: z.literal('agent-definition'),
  buId: z.string().min(1),
  slug: z.string().regex(/^[a-z0-9-]+$/),
  name: z.string().min(1),
  description: z.string().min(1),
  template: z.enum(['weekly-briefing', 'monthly-report', 'composed']),
  composerVersion: z.string().min(1).nullable(),
  revision: z.number().int().positive(),
  ownerUserId: z.string().min(1),         // server-authored; never LLM
  visibility: z.enum(['org', 'private']),
  systemPrompt: z.string().min(1),        // server-composed; never LLM
  allowedTools: z.array(z.string().min(1)).min(1),
  taskSpec: z.object({ /* connectors, period, scope, ... */ }).strict(),
  schedule: z.object({ /* cadence, nextRunAt, ... */ }).strict(),
  quotas: z.object({ /* maxRunsPerMonth, maxSpendZarPerMonth, ... */ }).strict(),
  audit: z.object({                       // server-stamped
    createdAt: z.string().datetime(),
    createdBy: z.string().min(1),
    lastEditedAt: z.string().datetime(),
    lastEditedBy: z.string().min(1)
  }).strict()
}).strict();

The .strict() is non-negotiable. It's what lets you evolve forwards without silently accepting drift.

Step 2: Wrap your existing tools as MCP tools

If you have a chat app you already have tool handlers — functions that take typed args, do an authz check, hit your data store, return a result. You almost certainly don't need to rewrite them. Wrap each handler in the SDK's tool() helper, with a Zod input schema and a stable name:

// ai-service/src/agents/tools/movementPack.js
import { z } from 'zod';
import { buildMovementPack } from '../../../api/lib/movementPack/build.js';

export function buildMovementPackTool({ ctx, policy, agentDefinition }) {
  return {
    name: 'build_movement_pack',
    description: [
      'Build a typed, policy-filtered, ranked Movement Pack of business',
      'events for (buId, connector, fromWeek -> toWeek). The canonical',
      '"what changed" tool. Returns ranked events ready for narration.'
    ].join(' '),
    inputSchema: {
      buId: z.string(),
      connector: z.enum(['debtors', 'orderbook', 'stock', 'delivery', 'expenses', 'sales']),
      fromWeek: z.number().int().min(1).max(53),
      toWeek: z.number().int().min(1).max(53),
      runFy: z.string(),
      maxEvents: z.number().int().min(1).max(50).optional()
    },
    handler: async (args /* extra */) => {
      // 1. Re-check BU + ABAC scope against ctx (defence in depth)
      if (args.buId !== ctx.buId) {
        return { content: [{ type: 'text', text: 'Cross-BU blocked' }], isError: true };
      }
      // 2. Call your existing builder
      const pack = await buildMovementPack({ ...args, ctx, policy });
      // 3. Return BOTH a text block (for the LLM to read) AND
      //    structuredContent (for the runtime hooks to consume).
      return {
        content: [{ type: 'text', text: JSON.stringify(pack, null, 2) }],
        structuredContent: {
          pack,
          citations: pack.citations || [],
          dataset: packToDataset(pack)
        }
      };
    }
  };
}

Three rules for tool handlers: (1) re-check authz inside the handler — never trust the model to pass the right buId; (2) return both content (text the LLM reads) AND structuredContent (data the hooks consume); (3) when no data exists, return isError: true with a useful message rather than silently returning an empty pack.

Step 3: Build the MCP server (and read the alwaysLoad warning)

Each agent run gets its own MCP server with only the tools that the agent's allowedTools approves:

// ai-service/src/agents/toolServer.js
export const TOOL_FACTORIES = Object.freeze({
  build_movement_pack: buildMovementPackTool,
  compute_period_aggregate: buildPeriodAggregateTool,
  compute_plan_vs_actual: buildPlanVsActualTool,
  compute_customer_health_composite: buildCustomerHealthTool,
  get_insight_definition: getInsightDefinitionTool,
  search_insights: searchInsightsTool,
  // Composer meta-tools (if you have a Composer)
  list_capabilities: buildListCapabilitiesTool,
  propose_agent_definition: buildProposeAgentDefinitionTool,
  validate_agent_definition: buildValidateAgentDefinitionTool
});

export function createToolServer({ sdk, ctx, policy, agentDefinition, allowedLogicalNames }) {
  const registered = [];
  for (const name of allowedLogicalNames) {
    const factory = TOOL_FACTORIES[name];
    if (!factory) continue;                // linter rejects unknown names at save time
    const t = factory({ ctx, policy, agentDefinition });
    registered.push(sdk.tool(t.name, t.description, t.inputSchema, t.handler));
  }
  return sdk.createSdkMcpServer({
    name: 'your-app',
    version: '0.0.0',
    tools: registered,
    alwaysLoad: true   // CRITICAL: see below
  });
}

alwaysLoad: true is the single setting most likely to bite you. Without it, the SDK defers MCP tool schemas behind a built-in ToolSearch facility — meaning the model never sees the tool list directly. If you've also locked down the built-ins (as any sensible production agent does), the model has no callable tools at all, and you'll see the most confusing failure mode in agentic engineering: the LLM writes a tool call as plain text in its assistant message instead of as a structured tool_use block, and the runtime never invokes anything.

Step 4: The runtime adapter (one file imports the SDK, full stop)

Confine the SDK to exactly one adapter file. Stores, hooks, tools, schemas, routes — everything else stays SDK-agnostic. The swap cost for the next runtime is a week of rewriting one file, not months of untangling SDK types from every layer.

// ai-service/src/agents/runtime.js   -- the ONLY file that imports the SDK
import {
  query as realSdkQuery,
  tool as realSdkTool,
  createSdkMcpServer as realCreateSdkMcpServer
} from '@anthropic-ai/claude-agent-sdk';
import { createToolServer } from './toolServer.js';
import { createPolicyGuardHook } from './hooks/policyGuard.js';
import { createCitationExtractorHook } from './hooks/citationExtractor.js';
import { createDatasetExtractorHook } from './hooks/datasetExtractor.js';
import { createCostTrackerHook } from './hooks/costTracker.js';
import * as runStore from './stores/runStore.js';

export function createAgentRuntime({
  sdk = { query: realSdkQuery, tool: realSdkTool, createSdkMcpServer: realCreateSdkMcpServer }
} = {}) {
  return {
    async runAgent({ def, ctx, prompt = null, trigger = 'manual', scheduledFor = null }) {
      // 1. Pre-flight quota check, lease claim, write the pending run doc.
      const runDoc = await runStore.beginRun({ def, trigger, scheduledFor });

      // 2. Resolve MCP-qualified names + build per-run tool server.
      const allowedMcpNames = def.allowedTools.map(n => `mcp__your-app__${n}`);
      const toolServerInstance = createToolServer({
        sdk, ctx, policy: await loadPolicy(def.buId),
        agentDefinition: def, allowedLogicalNames: def.allowedTools
      });

      // 3. Compose hooks: PreToolUse (ABAC), PostToolUse (extract), Stop (cost).
      const policyGuard = createPolicyGuardHook({ ctx, policy, allowedMcpNames });
      const citationHook = createCitationExtractorHook({ runDoc });
      const datasetHook = createDatasetExtractorHook({ runDoc });
      const costHook = createCostTrackerHook({ ctx, def, runDoc });

      // 4. SDK options.
      const options = {
        systemPrompt: def.systemPrompt,
        // NB: do NOT pass `tools: []`. The SDK reads that as "no tools".
        // Omit `tools` entirely to keep the MCP tools visible.
        disallowedTools: ALL_SDK_BUILTINS,     // lock down built-ins
        allowedTools: allowedMcpNames,         // ONLY these MCP tools
        mcpServers: { 'your-app': toolServerInstance },
        permissionMode: 'dontAsk',
        maxTurns: def.quotas.maxTurns ?? 16,
        hooks: {
          PreToolUse:  [{ matcher: '.*', hooks: [policyGuard] }],
          PostToolUse: [{ matcher: '.*', hooks: [citationHook, datasetHook] }],
          Stop:        [{ matcher: '.*', hooks: [costHook] }]
        }
      };

      // 5. Drive the stream. Collect synthesis text from text-only messages
      //    (the model's "thinking aloud" mid-tool-call goes in interleavedText
      //    as a fallback; the final narrative comes from messages that
      //    have NO tool_use blocks).
      const synthesisText = [];
      const interleavedText = [];
      for await (const message of sdk.query({ prompt: prompt ?? def.naturalLanguageIntent, options })) {
        if (message.type !== 'assistant') continue;
        const blocks = message.content || message.message?.content || [];
        const hasToolUse = blocks.some(b => b?.type === 'tool_use');
        for (const b of blocks) {
          if (b?.type === 'text' && typeof b.text === 'string') {
            interleavedText.push(b.text);
            if (!hasToolUse) synthesisText.push(b.text);
          }
        }
      }
      const narrative = (synthesisText.length ? synthesisText : interleavedText).join('\n\n').trim();

      // 6. Compose the final output, validate against AgentRunV1, persist.
      const output = composeOutput({ def, runDoc, narrative });
      return runStore.completeRun(runDoc, output);
    }
  };
}

This is ~50 lines of structure. Almost everything else in the agent framework is in modules that don't know the SDK exists.

Step 5: Three hooks — ABAC, extraction, cost

PreToolUse (PolicyGuard): reject tool calls that smuggle a different BU, that target a connector denied by policy, or that aren't in the agent's allowedTools set.

// ai-service/src/agents/hooks/policyGuard.js
export function createPolicyGuardHook({ ctx, policy, allowedMcpNames }) {
  const allowed = new Set(allowedMcpNames);
  return async function policyGuardHook(input) {
    const toolName = input?.tool_name;
    const toolInput = input?.tool_input || {};
    if (!toolName?.startsWith('mcp__your-app__')) {
      return { decision: 'block', reason: 'Built-in tools are not allowed' };
    }
    if (allowed.size > 0 && !allowed.has(toolName)) {
      return { decision: 'block', reason: `${toolName} not in agent's allowedTools` };
    }
    if (toolInput.buId && toolInput.buId !== ctx.buId) {
      return { decision: 'block', reason: 'Cross-BU tool call blocked' };
    }
    return { decision: 'approve' };
  };
}

PostToolUse (Citation/Dataset extractors): read the tool return, pull citations onto the run doc, materialise the dataset into runDoc.output.datasets[datasetId]. This is where the JSON-string-tool_response gotcha bites — the SDK passes tool_response as a JSON string, not an object. Parse first.

// ai-service/src/agents/hooks/citationExtractor.js
export function createCitationExtractorHook({ runDoc }) {
  return async function citationExtractorHook(input) {
    // SDK passes tool_response as a JSON STRING (confirmed via op-log diag).
    let tr = input?.tool_response;
    if (typeof tr === 'string') {
      try { tr = JSON.parse(tr); } catch { return { decision: 'approve' }; }
    }
    // After parsing, tr IS the handler's structuredContent.
    const citations = tr?.citations || tr?.pack?.citations || [];
    const known = new Set(runDoc.output.citations.map(c => c.id));
    for (const c of citations) {
      if (c?.id && !known.has(c.id)) {
        runDoc.output.citations.push(c);
        known.add(c.id);
      }
    }
    return { decision: 'approve' };
  };
}

Stop (CostTracker): read final usage from the SDK's stop event, estimate cost, increment a per-agent monthly counter, attach token + cost totals to the run doc. NOTE: I haven't got the cost tracker to work yet.

Step 6: The system prompt must INSTRUCT the model to call tools

This is the third-most-common failure mode people are likely to hit. Listing the tools in the system prompt is not enough. The model will read the list, think out loud about what it could do, and never make a call.

The prompt has to be directive. Roughly:

You are the [agent name] agent for [BU id].

AVAILABLE DETERMINISTIC TOOLS (call these — they compute the truth, you narrate):
  - build_movement_pack: Build a ranked Movement Pack of business events...
  - compute_period_aggregate: Roll weekly snapshots into monthly buckets...
  - get_insight_definition: Look up an insight's source + formula + ABAC scope...

EXECUTION ORDER (non-negotiable):
  1. CALL TOOLS FIRST. Before writing ANY narrative or numbers, invoke
     each allowed tool that applies to this run.
  2. NARRATE FROM TOOL RESULTS ONLY. Every numeric claim must come from
     a tool result. Quote what the tool returned; do not infer.
  3. EMIT STRUCTURED OUTPUT. The runtime auto-binds your tool results
     to chartSpecs/tableSpecs by datasetId — you do NOT need to author
     chart data. Just call the tool; the runtime renders it.

HARD GUARDRAILS:
  - Every claim must cite at least one CitationV1 from this run.
  - Every chartSpec.datasetId must match a dataset the runtime materialised.
  - Do NOT invent numbers. If a number can't be sourced, say "not available".
  - Stay within [BU id]. Refuse cross-BU requests.

The "EXECUTION ORDER" / "HARD GUARDRAILS" framing is what actually flips the model from "narrate plausibly" mode into "call tools, then narrate" mode. Without it, the model is well-behaved chat AI, which is the wrong product.

Step 7: Materialise server-owned fields after the Composer LLM returns

If you have a Composer (an agent that authors other agents from natural language intent), the LLM returns a typed-ish draft. Do NOT publish that draft as-is. The LLM is authoritative only for intent-shaped fields: name, description, allowedTools, taskSpec, schedule, quotas, acknowledgesSimulationSapBoundary. Everything else is server-owned and gets overwritten after extraction:

function materializeComposerDraft({ partial, body, ownerUserId, buId }) {
  const draft = { ...(partial || {}) };
  const slug = body?.slug || draft.slug;

  // Server-OWNED (always overwrite the LLM):
  draft.id = `agent_${buId}_${slug}_v1`;
  draft.pk = pkForAgents(buId);
  draft.defType = 'agent-definition';
  draft.buId = buId;
  draft.slug = slug;
  draft.template = 'composed';
  draft.revision = 1;
  draft.composerVersion = COMPOSER_VERSION;
  draft.ownerUserId = ownerUserId;          // SECURITY: never trust LLM here
  draft.naturalLanguageIntent = body?.intent;  // preserve user's words

  // The system prompt is server-COMPOSED, never LLM-authored. This is what
  // guarantees the runtime LLM always receives the CALL-TOOLS-FIRST directive.
  draft.systemPrompt = composeSystemPromptFromDraft({
    name: draft.name, buId,
    connectors: draft.taskSpec?.connectors || [],
    allowedTools: draft.allowedTools || [],
    customerScope: draft.taskSpec?.customerScope,
    roleLens: draft.taskSpec?.roleLens
  });

  // Audit stamp — security-owned, never LLM-authored.
  const now = new Date().toISOString();
  draft.audit = {
    createdAt: now, createdBy: ownerUserId,
    lastEditedAt: now, lastEditedBy: ownerUserId
  };

  return draft;  // Now lint this. Failures surface to the user, not papered over.
}

Step 8: Wire a scheduler (the predicate that almost killed us)

For weekly/monthly cadences, node-schedule with a per-minute tick is plenty. Each tick reads "what's due" from the data store and fires anything overdue. The thing to get right is the findDue query — it must filter on cadence, on nextRunAt, and on the runtime kill switch.

export async function findDue(buId, { now = new Date().toISOString() } = {}) {
  const container = getAgentsContainer();
  const { resources } = await container.items.query({
    query: `SELECT * FROM c
            WHERE c.defType = @defType
              AND c.buId = @buId
              AND c.schedule.nextRunAt <= @now
              AND c.schedule.cadence IN ('weekly', 'monthly', 'quarterly')`,
    parameters: [
      { name: '@defType', value: 'agent-definition' },
      { name: '@buId', value: buId },
      { name: '@now', value: now }
    ]
  }, { partitionKey: pkForAgents(buId) }).fetchAll();
  return resources;
}

The lesson scarred into me: do not include predicates against fields that don't exist in your schema. My first version of findDue included AND c.enabled = true, because I'd assumed per-agent enable/disable would be a V1 feature. It wasn't — the AgentDefinitionV1 schema doesn't have an enabled field at all. The predicate silently never matched, and NO scheduled run ever fired. The kind of bug you don't catch until Monday morning at 07:30 SAST when you realise the briefing didn't come in.

If you want a kill switch, put it in a separate document (we use a Cosmos runtime-state doc, checked at the top of every scheduler tick). Don't try to embed it in the agent definition itself unless the schema fully supports it.

Step 9: The render side — do the boring work

An agent that produces a beautiful run document but renders as a wall of raw markdown is not done. Render the run with the same markdown library your chat surface uses (we use react-markdown + remark-gfm + remark-breaks). Wrap chart specs in your charting library (Recharts is fine). Bind datasets by datasetId — never let the renderer accept inline data, because that defeats the citation chain.

Split the run's narrative into sections by H2 (## ...) at compose time, not at render time. The data store should already hold structured sections, so the renderer is dumb. Use a heuristic to bind chart/table specs under the section whose heading or body matches the dataset id tokens. Orphan specs land in the tail section as an appendix.

Citation chips should show calendar dates, not week numbers. Users do not know what "FY26 W42" means. They know what "10-16 May 2026" means. Make the chip surface the latter; keep the FY-week code in the expanded detail for ops/audit.

Step 10: Common pitfalls (a survival checklist)

  • tools: [] means "no tools". Omit the key entirely. The SDK uses its default and your MCP tools become visible.
  • MCP tools must be alwaysLoad: true or they go behind a built-in tool-search facility the model can't reach when you've locked down built-ins.
  • PostToolUse.tool_response is a JSON string. Always JSON.parse before probing.
  • console.log inside SDK hooks does not reach App Service container stdout. Emit to your operational-log store instead. Add a one-shot diagnostic emission to discover any future SDK shape change.
  • The model will not call tools just because they're listed. Your system prompt must say "CALL TOOLS FIRST" in directive language.
  • Never let the LLM author identity, audit, or system-prompt fields. Materialise them server-side after extraction. The schema is your trust boundary.
  • Never let the LLM author citations. Citations are produced by tool handlers (in structuredContent.citations) and accumulated by PostToolUse hooks. The model writes prose around them.
  • Confine the SDK import to one adapter file. Everything else stays SDK-agnostic. Future-proofs against runtime swaps.
  • Don't gate findDue on fields your schema doesn't define. The predicate will silently never match. You will not find out for days.
  • Freeze your schemas BEFORE writing code. Tag the freeze. Track 0 of your project. Saves a week of merge-hell down the line.
The minimum-viable agentic conversion of an existing AI chat app: write the schemas (1 day) → wrap your existing tools as MCP tools (1-2 days) → write the runtime adapter (1 day) → write the three hooks (1 day) → write the system-prompt composer (half a day) → wire a scheduler (half a day) → render the run document (2-3 days for a polished UI). That's a developer-week or two of work for an MVP, assuming you already have working tools and an authz layer. Most of the time you'll spend is debugging the SDK quirks above — budget two more days for that.

Act 9: The V1 Success Criteria

Here's where we ended up. V1 is "done" against the following criteria:

CriterionStatus
Compose an agent from a free-form natural-language intentLLM-path Composer produces typed AgentDefinitionV1 drafts that pass the linter cleanly. The materialiser fills server-owned fields (id, pk, ownerUserId, systemPrompt, audit) so the LLM can never impersonate or skip the runtime tool-first directive.
Lint a draft against the frozen contract9 lint checks (schema-valid, tool-availability, cadence-compatibility, connector-compatibility, tool-arg-validity, customer-scope-ack, simulation-sap-boundary-ack, cost-cap, chart-table-binding). Blocks publish on any error.
Dry-run a draft against the real runtimeRuntime invokes MCP tools, hooks accumulate citations + datasets, output is composed into multi-section markdown, schema is validated. Publish is gated on validation passing.
Publish a draft into the GalleryDefinition lands in Cosmos as agent-definition. Visible to the BU. Owner + revision audited.
Schedule a published agentnode-schedule cron tick every minute. findDue filters by cadence (weekly/monthly/quarterly) and nextRunAt <= now. Lease claimed via Cosmos IfNoneMatch. Runs via the same runtime as dry-run. Verified live: Finance Weekly Briefing fired at 07:00 SAST on a Monday morning, no human in the loop.
File a run as an immutable reportAgentRunV1 doc, embedded definitionSnapshot, multi-section output, real citations, dataset envelopes, chart + table specs bound to dataset ids. Visible in Filed Reports.
Render a report cleanly to a humanMulti-section layout, ChatMarkdown (react-markdown + remark-gfm) for headings/lists/tables/bold, Recharts for chart specs, structured table renderer for table specs, citation chips with calendar date ranges, status pills, audience-scope banner, Export/Print to PDF, ABAC-blocked viewers get a 403 banner with a "clone and re-run" CTA.
Back/forward navigation workshistory.pushState on drill-in, popstate listener closes the drill-in, ?runId= deep-link supported on mount.
Cost + token quotasPer-agent monthly cost cap. Pre-flight quota check via costCounterStore. Stop hook aggregates real token usage + estimated ZAR cost from the model response.
Runtime kill switchSingle Cosmos doc checked at the top of every scheduler tick and every route. Admin UI flip.
Operational logsEvery run emits AGENT_RUN_STARTED, AGENT_RUN_COMPLETED / NEEDS_REVIEW / FAILED. Every tool invocation emits AGENT_TOOL_INVOKED. Lease collisions, lint warnings, definition changes all logged. Insights Recent Activity feed filters on the AGENT_* event taxonomy.
Per-user adoption metricsInsights AdoptionPanel shows distinct users, runs, succeeded/failed/needs-review per user, tokens, ZAR cost, last-activity.
Seed templates shipFinance Weekly Briefing (weekly, top movers across four customer connectors), Monthly Performance Report (compute_plan_vs_actual + compute_period_aggregate, board-style report), Customer Health Composite (compute_customer_health_composite, top 10 at-risk), Account Manager Contract Performance Report (reusable template, owner-ABAC scope, monthly cadence, City of Cape Town as example only — not pre-pinned).

What V1 does NOT include (deferred to V1.1+): per-agent pause/enable UI, multi-BU scheduling, in-app notifications when a new report lands, email delivery, role-shared visibility, on-event triggers, raw-prompt admin edit mode, more deterministic tools (seasonality, anomaly detection), background-job retry queue. The plan is to ship V1 first, learn from real use, then prioritise V1.1 from feedback.


Act 10: Assessment Against Modern AI Patterns

If I step back and look at this against where the agentic-AI field is in 2026, I think we landed in a reasonable spot. Not state of the art — we don't have a multi-agent reasoning swarm with arbitrary task decomposition, and we don't want that — but defensible and well-grounded.

1. Tool-augmented LLM with strict schemas (instead of free agentic reasoning)

The dominant safe-pattern for enterprise BI agents in 2026 is the same one we adopted: the LLM narrates, deterministic tools compute. Numbers come from tools. Citations come from tools. Charts come from tools. The LLM writes prose around them. The schema rejects any run that doesn't fit.

This is the opposite direction from "let the LLM reason its way to an answer". For BI it's the right direction. Finance teams cannot afford hallucinated numbers. The schema-first design means the platform produces something predictable — if a tool is broken, the run fails loudly, doesn't quietly invent.

2. MCP as the trust boundary

MCP is the right abstraction for tool exposure. It separates "what the tool does" (the handler, in our code) from "how the model invokes the tool" (MCP-qualified names, JSON-schema'd input). Hooks intercept on the SDK side; ABAC + scope enforcement live in the handler. That two-layer defence is the modern standard.

The catch we hit (tool_response being a JSON string, alwaysLoad needing to be true, console.log from hooks not reaching stdout) are SDK-specific quirks of @anthropic-ai/claude-agent-sdk@0.3.143. If we'd built our own loop we'd have hit different quirks. None of these were design errors — they were implementation details we discovered by running the code in production.

3. Composer-as-agent (instead of UI form)

Letting users describe an agent in natural language and having an LLM compile it to a typed draft is the right pattern for V1. A form-based "build your own report" UI sounds simpler but it's actually constraining — users have to learn the form, and the form has to anticipate every combination of cadence x scope x tool. Natural language lets users say "weekly debtors report for Customer X with monthly trend chart" and the Composer figures out the slug, schedule, allowedTools, taskSpec.

The hardening is the typed schema underneath. The user's natural language goes through the LLM, comes out as a typed draft, passes through the linter, passes through dry-run validation, and only then can be published. The free-form NL never reaches the runtime — only the typed draft does. That's the right separation.

4. Agents as documents, not as code

An AgentDefinitionV1 is a Cosmos document. It's clonable, remixable, versioned (revision bump on every save), audited (createdBy / lastEditedBy / lastEditedAt). The Gallery is just a list of those documents filtered to the BU. The Composer is just an editor for those documents. The runtime is just an interpreter for those documents.

This is the right abstraction for a BU lead who wants to say "clone the AM contract template, set my customers, change the cadence to monthly". They're editing a document, not deploying a service. That's the leverage of treating agents as first-class data, not as code.

5. Where we are NOT state of the art

We don't have:

  • Sub-agent decomposition — an agent can't spawn a child agent to handle a sub-task. We deliberately chose not to use the SDK's sub-agent feature for V1. Sub-agents are non-reproducible, expensive, and audit-hostile. Custom tools only.
  • Memory across runs — each run is stateless. No "remember what you said last week". The Movement Pack diff engine does that work deterministically (week-over-week comparison from snapshot history), not through LLM memory.
  • Multi-turn agent sessions — the Composer is single-shot. You give it an intent, it gives you a draft. We don't model "refine the draft via three turns of dialogue". The user can re-compose with a tweaked intent, but there's no conversation state.
  • Auto-discovery of new tools — the Capability Registry is hand-curated. New tools are added by engineers, not by the agent itself.

I think most of these absences are correct for V1. The ones I'd revisit first are multi-turn refinement in the Composer (lets the user say "good, but make the cadence monthly instead of weekly" without re-typing the whole intent) and on-event triggers (agent fires when a new ETL snapshot lands, not on a wall-clock cadence). Both are V1.1 candidates.

6. The chat engine vs the agent framework, side by side

Now that both exist, the right mental model is:

AI Chat EngineAgent Framework
TriggerUser opens panel and typesSchedule, or user clicks Run
OutputStreamed chat replyFiled report (multi-section, versioned)
LifetimeSession-boundImmutable, stored, ABAC-scoped, exportable
AudienceThe askerAnyone in the BU (org) or just the owner (private)
Best for"What's our orderbook looking like?""Every Monday at 07:00, brief me on the top five movers."
Lives atSide panelFiled Reports surface

They share the same tools, the same data, the same auth. They diverge on temporality. Chat is reactive. Agents are proactive. Together they cover the whole "AI as a BI partner" surface.


The Takeaway

I started this project thinking "we'll wrap the chat engine in a scheduler and call it agentic." We ended up with a typed, versioned, schema-validated agent platform with a Composer, a Linter, a multi-section report renderer, a chart binder, a citation chip, an admin kill switch, an Insights panel, four seed templates, and a real scheduled run that fired itself this morning.

The hardest part wasn't the LLM. It was the trust boundary. Getting the schema right. Refusing to let the LLM author its own citations. Refusing to let the LLM author its own system prompt. Refusing to let the LLM author its own owner id. Making the schema the boundary, and making everything below the schema deterministic.

The second-hardest part was the SDK quirks — the JSON-string tool_response, the alwaysLoad gotcha, the console.log-doesn't-reach-stdout thing, the c.enabled-predicate-that-no-one-set that prevented every scheduled run from ever firing. Each one took an hour to diagnose. None of them would have shown up without running real workloads against real data in real production.

The third-hardest part — and this is the one most underrated — was the two-AI pattern. Claude as orchestrator, Codex as reviewer, both adversarial, both checking each other, with a human steering. Codex caught things Claude seriously missed. Claude caught things Codex missed. I used human judgement by steering decisions. The codebase is better for it. The pattern works.

Use Claude Code's frontend designer to build the mock UX end-to-end. As part of the requirements and design phase, using Claude's frontend designer plugin saved a lot of time. After 4 iterations we landed on a design UX for the new agentic feature, conforming to the existing UX design and stlye guide. With the sample html file as the key requirements spec for frontend, Claude was able to build a solid usable feature from the start.

There's still much to do (per-agent pause, in-app notifications, more deterministic primitives, on-event triggers, multi-BU scheduling, account manager-friendly clone workflows). But this morning the platform produced a real briefing without anyone clicking anything. That's the V1 milestone. That's the moment "agentic AI" stopped being a slide deck and started being a Monday-morning artefact.

Onwards to V1.1.


Stack: TypeScript-flavoured JavaScript end-to-end. Fastify on a Linux App Service for the AI service. Express + React on a separate App Service for the web app. Cosmos DB as the only data store. @anthropic-ai/claude-agent-sdk + @anthropic-ai/sdk for the agentic loop and the chat engine respectively. node-schedule for the cron tick. Recharts for chart rendering. react-markdown + remark-gfm for the markdown renderer. Zod for the frozen contracts (AgentDefinitionV1, AgentRunV1, MovementPackV1, CitationV1, ToolCapabilityV1, BuAiDataPolicyV1). No background-job platform — the scheduler runs in-process. No vector store — semantic retrieval lives in the Capability Registry, hand-curated. No multi-agent orchestration framework — one agent at a time, custom tools only.

Total elapsed: ~3.5 weeks from "let's wrap the chat engine in a scheduler" to "the scheduled briefing fired at 07:00 SAST this morning". Total production downtime during the build: zero — the agent framework runs on its own App Service, and the build never touched the chat engine or the BI dashboards. Cups of coffee: lost count, again.

Friday, 17 April 2026

How Claude Code helped me migrate ETL workflows from serverless functions in one evening

April 15, 2026. South Africa.

Two days. That's how long my ERP ETL workflow had been broken in production. The Function App deploy pipeline — the one that was supposed to be "serverless and simple" — had drifted so far from reality that a hotfix deployed outside CI/CD was the only thing keeping the lights on.

I was staring at a system held together with surgical tape and a prayer. The app was supposedly stable for months, but very brittle at some places, how can a simple, deterministic ETL process get so unreliable?

The sync Function App was serving a stale runtime that didn't match what was in git. The CI/CD workflow couldn't deploy without breaking things further. The Kudu zipdeploy mechanism — Azure's "just push a zip and we'll figure it out" deployment model — had proven itself fundamentally untrustworthy for my production workloads. And Event Grid, the invisible message bus connecting my extract and sync stages, added complexity I couldn't observe or debug without spelunking through Azure portal logs.

I had a choice: keep patching, or rethink the architecture.

I chose to rethink it. In one session. With an AI coding agent. On a Tuesday night. In < 5 hours.


Act 1: The Architecture That Looked Good on Paper

When I first built the ETL pipeline, the architecture felt elegant:

Upload -> Blob Storage -> Event Grid -> Python Extract (Function App)
                                            |
                                     Blob Storage (artifacts)
                                            |
                                    Event Grid -> Node Sync (Function App)
                                            |
                                       Cosmos DB

Three Azure Function Apps. Two Event Grid subscriptions. Two separate Kudu zipdeploy pipelines. Two different runtimes (Python for extraction, Node.js for sync). Each with its own host.json, its own publish profile secrets, its own health check probes.

It worked. For a while.

Then the drift started. A deploy would succeed in CI but the runtime would serve stale code. Kudu would report success but the Function App would 404. A "quick fix" deployed through the portal would work, but the next CI deploy would overwrite it with the wrong version. The Python Function App and the Node.js Function App had different deploy mechanisms, different failure modes, and different ways of being quietly broken.

The serverless promise — "just write functions, we handle the rest" — had become: "just debug Azure's deployment infrastructure while your financial planning app serves stale data to executives."

Act 2: The Question That Changed Everything

I asked my AI agent a question:

"Since both services will eventually run from the App Service itself, why do we need Event Grid?"

That's when it clicked. My App Service was already running Express.js, serving the React frontend, proxying API calls. It was a Linux host with Node.js. It was deployed via a single zip push. It just worked.

Why wasn't the ETL pipeline running there too?

The original reason was separation of concerns — extractors in Python, sync in Node.js, each scaling independently. But the reality was: my ETL workload processed one file at a time, took 20 seconds end-to-end, and ran maybe 5 times a day. It didn't need independent scaling. It needed reliable deployment.

Act 3: The Plan (And Why We Threw Half of It Away)

The initial plan was careful and staged:

Stage 1: Move the sync service to the App Service, keep extract on the Function App, use Event Grid to connect them.

Stage 2: Move extract to the App Service, eliminate Event Grid.

We started with Stage 1. Moved all 12 sync/verify scripts. Created an Express router. Wired up audit logging with distinct source tags (etl-sync-appservice vs the Function App's etl-sync) so we could see exactly which code path processed each job.

The clever part: we created re-export stubs that proxied imports to the shared API libraries. This meant the 12 sync scripts could be copied byte-for-byte — zero modifications. The stubs resolved their ../lib/ imports to the shared libraries. One-line files doing the heavy lifting:

// etl/lib/containerConfig.js
export * from '../../api/lib/containerConfig.js';

Then we hit the Event Grid wall.

Act 4: The Event Grid Wall

To register an Event Grid webhook subscription, Azure sends a validation handshake to your endpoint. Your endpoint must respond with a validation code. Simple, right?

Except our App Service was behind Azure AD authentication. Anonymous requests got a 302 redirect to the Microsoft login page. Event Grid's validation request isn't a browser — it doesn't follow redirects or authenticate with Azure AD.

We tried excludedPaths in the auth config. We tried switching to Return401 mode. The excluded paths worked — but Return401 mode broke the entire login flow for the SPA. Users couldn't log in. On production. At night.

We reverted in 30 seconds. Crisis contained. But Event Grid was blocked.

We tried the AAD-authenticated delivery approach. Event Grid can send a bearer token if you configure it with a tenant ID and app registration. But creating the required AzureEventGridSecureWebhookSubscriber app role needed Entra admin permissions. It was 9 PM. The IT admin was asleep.

That's when my own question came back: why do we need Event Grid at all?

Act 5: The Browser Console Trick

We had the sync running on the App Service. We had the Event Grid subscription blocked. But we needed to prove it worked.

The user is already authenticated in the browser. The browser has the App Service auth cookie. What if we just... called the endpoint from the browser console?

fetch("/etl/webhook/sync", {
  method: "POST",
  headers: {"Content-Type": "application/json"},
  body: '[{"eventType":"Microsoft.Storage.BlobCreated",
          "subject":"/blobServices/default/containers/etl-artifacts/blobs/...",
          "data":{"api":"PutBlob"}}]'
}).then(r => r.json()).then(d => console.log(d))

Response:

{
  "accepted": 1,
  "processedBy": "etl-sync-appservice",
  "timestamp": "2026-04-15T18:28:22.051Z"
}

The latest/* blobs in Azure Storage updated 19 seconds later. Audit logs showed ETL Sync AppService as the actor. Verification passed. Cache manifest bumped.

It worked. The new code path was live.

Act 6: Scrap the Plan, Go All In

The staged plan said "move extract in Stage 2, later." But we'd just proven the architecture worked. The Event Grid approach was blocked anyway. And I realized: if both extract and sync run on the App Service, the upload button can trigger the entire pipeline directly. No Event Grid. No Function Apps. No intermediate blob-event dance.

Upload -> POST /etl/pipeline/run
           -> Python extract (subprocess)
           -> Node sync (in-process)
           -> Cosmos DB updated
           -> Done. 21 seconds.

The Python challenge was real though. The App Service runs Node.js. The extractors use openpyxl to parse Excel files. Our first attempt — pip install openpyxl && npm start as the startup command — killed the app. The Node.js Linux image doesn't reliably support pip in the startup command.

The fix: vendor openpyxl into the deploy package during CI build. The GitHub Actions runner (Ubuntu) has pip. Install there, ship the result:

python3 -m pip install \
  --target .deploy/webapp/etl/extract/.pylibs \
  --quiet \
  openpyxl>=3.1.0

The Node.js extract handler adds .pylibs to PYTHONPATH when spawning the subprocess. The Python extractors run unchanged — they don't know they're on an App Service instead of a Function App.

Act 7: The Cutover

Here's the full cutover sequence:

# Stop Function Apps (restartable for rollback)
az functionapp stop --name my-etl-sync --resource-group my-rg
az functionapp stop --name my-etl-extract --resource-group my-rg

# Deploy lands via normal CI/CD -- same workflow that deploys frontend
git push origin master

# Test: upload ERP export, pipeline runs automatically
# Audit logs: "ETL Extract AppService" -> "ETL Sync AppService"
# All SUCCESS. 21 seconds end-to-end.

Rollback:

az functionapp start --name my-etl-sync --resource-group my-rg
az functionapp start --name my-etl-extract --resource-group my-rg

Two commands. No code changes needed. The Function App code was never modified.

Act 8: What We Actually Built

The final architecture:

Admin Console
  -> POST /api/etl/upload (file to blob storage)
  -> POST /etl/pipeline/run {buId, connector, sourceBlobName}
  -> Express router
      -> Download source file from blob
      -> Spawn python3 extractor subprocess
      -> Upload artifacts to blob
      -> Run Node.js sync script
      -> Run Node.js verify script
      -> Promote artifacts to latest/
      -> Update Cosmos DB
      -> Bump cache manifest
      -> Audit log with user identity + source tags

No Event Grid subscriptions. No Function App deploys. No Kudu zipdeploy. No publish profile secrets. No runtime drift.

One Express server. One deploy workflow. One zip. One host.

Files added:

FilePurpose
etl/etlRouter.jsExpress router — pipeline, webhook, status, provision, healthz
etl/etlExtractHandler.jsNode.js wrapper spawning Python extractors
etl/sync/*.js12 sync/verify scripts (copied unchanged)
etl/extract/extractors/*.py6 Python extractors (copied unchanged)
etl/lib/*.jsRe-export stubs (1 line each)
api/lib/etlBlobHelpers.jsBlob operations for ETL
api/lib/etlOperationalLog.jsOperational log adapter

Files untouched: All 6 Python extractors. All 12 sync/verify scripts. The Function App code. The frontend upload handler. Database schemas. Blob storage structure.

Act 9: The Lessons

1. Serverless isn't free

The "no servers to manage" promise is real until deployment breaks. Then you're managing Azure's deployment infrastructure, which is harder to debug than your own server because you can't SSH in, can't see the filesystem, and can't reproduce the environment locally.

2. Event Grid adds invisible coupling

Every Event Grid subscription is an invisible dependency. When it works, it's magic. When it breaks, you're reading Azure portal logs trying to understand why a blob write didn't trigger a function invocation. Moving to direct HTTP calls made the pipeline debuggable, observable, and fast.

3. The best abstraction is sometimes no abstraction

Three Azure Function Apps, two Event Grid subscriptions, and a Kudu zipdeploy pipeline — or one Express route handler. The "simpler" architecture has more moving parts. The "complex" monolith is actually simpler to operate.

4. Source tagging is non-negotiable in migrations

Tagging every audit event and operational log with source: 'etl-sync-appservice' vs source: 'etl-sync' meant we could prove, in production, which code path processed each job. When leadership asks "are we on the new system?" you can point to the audit log and say: yes, see the tag.

5. Copy, don't rewrite

The 12 sync scripts and 6 Python extractors were copied byte-for-byte. Zero modifications. The re-export stubs handled the import path difference. This meant: if anything breaks, it's the new wrapper code, not the battle-tested extraction and sync logic.

6. AI agents are force multipliers, not autopilots

The AI agent wrote the ETL router, the extract handler, the blob helpers, the operational log adapter, the deploy workflow changes, and the Azure CLI provisioning commands. It also broke the production login flow by changing the auth config, tried to pip install in a Node.js startup command (which killed the app), and ran up my GitHub Actions credits.

The human judgment calls that mattered: choosing to disable-and-replace (not dual-write), deciding to skip Event Grid entirely, and saying "let's just test from the browser console." The agent executed brilliantly once pointed in the right direction. But pointing it in the right direction was the hard part.


The Numbers

MetricBeforeAfter
Azure resources for ETL2 Function Apps + 3 Event Grid subscriptions0 (runs on existing App Service)
Deploy mechanisms3 (webapp + 2 Kudu zipdeploys)1 (webapp only)
CI/CD jobs5 (often 2 failing)3 (all green)
Pipeline latency45-90s (Event Grid hops + cold starts)21s (in-process)
Runtime drift riskHigh (Function App VFS cache)Zero (deploy is atomic zip)
Rollback timeUnknown (redeploy + pray)60 seconds (restart Function App)
Deploy time for ETL changes7+ min (separate Kudu job, often fails)0 (included in webapp deploy)

The Takeaway

I spent two days debugging a broken deploy pipeline, using up both Claude and Codex tokens! I pivoted. I spent one evening replacing it entirely.

The old architecture was designed for a future that never came — independent scaling, multi-region ETL, connector-level isolation. The new architecture is designed for the reality I have — a single-region financial planning app that processes a handful of ERP exports a day and needs to be reliable.

Sometimes the bravest engineering decision is to make things simpler.


Both Function Apps remain stopped but restartable. The Event Grid subscriptions exist but deliver to stopped endpoints. The full ETL pipeline runs on a single App Service, deployed by the same workflow that deploys the React frontend.

Total session time: ~5 hours. Total production downtime: ~90 seconds (the auth config incident). Cups of coffee: lost count.

Wednesday, 8 April 2026

How I Built a Full BI Platform With AI Because SAP Wanted Seven Figures to Upgrade

From a single HTML file that started out as an experiment to build a budget model simualtion tool, on a Monday night to a 467-commit, 13-view financial intelligence BI platform with AI — the real timeline of building enterprise software when your ERP is stuck in the 90s and your budget for fixing it is zero.

The Backstory 

I work at a mid-market manufacturer for traffic signal controllers. Our ERP is SAP — the old from the 90s kind. ECC 6.0, no HANA, no Fiori, no embedded analytics. The kind of system where "reporting" means exporting an ALV grid to Excel and emailing it to finance, business review meetings wasting a ton of time preparing excels and converting to charts and PowerPoints. Every now and again, the question comes up: are we upgrading to S/4HANA or latest SAP?

The answer, every time, is the same: not yet. And the reason is always money. For a medium-size South African company, SAP is becoming prohibitively expensive to upgrade and sustain ongoing licensing costs. What we have today is good enough and working. But as the new guy, this wasn't working for me...

SAP S/4HANA migrations run $250K minimum for small companies, scaling to millions for mid-market. Consulting rates are up 30–50% as the 2027 ECC end-of-mainstream-support deadline approaches, and Gartner projects that nearly half of the 35,000 ECC customers — roughly 17,000 companies — will still be running the legacy system past that date. We're one of them.

But here's what changed: in late 2025, I'd just come off a bruising 8-month personal hobby project where I learned the hard way what happens when you "vibe code" with AI agents without architecture governance — a 3,000-line monolith in a single JavaScript file, three AI models fighting each other, and a codebase I was ashamed to show anyone. That failure taught me an important lesson as I climbed the learning curve of building with AI assistants: AI agents are only as good as your architectural contracts. Code is a liability. Architecture is the asset. Specification and Planning are crucial activities that must not be ignored. I've shared my experiences on previous blog posts. I was ready to build a real-world enterprise application on the back of the success I had with personametry.com (another hobby project).

So when December 2025 rolled around and I sat down to first solve a real work problem: How do I plan my budget for next financial year? What margins should my account managers work with? How do I simulate different outcomes? How do I track actuals versus forecasts. I designed a detailed financial model, this was the MVP. Then the next big opportunity was integrating SAP ERP reporting problem - how do I stop all the tedious manual prep work that my managers go through when preparing for business ops reviews. Wouldn't it be nice if I ran my meetings with a live dashboard? 

I didn't start by prompting an AI. I started by designing a financial model on paper, scoped the business needs and worked with the AI to map out the product in fair amount of detail. I even used sample PowerPoint presentations to guide the visual display requirements for the dashboards. Then I added the real game changer - integrated AI financial assistant. Along this journey, the product vision was right but the engineering execution evolved over time - because at first, I had zero budget to play with - so I went all in on using free services. Early on in the project, I decided to design for multi-tenancy - figuring that if the app adds value to my business unit, surely the other group businesses (with a similar business model to mine) could also be onboarded :)

This is the story of what happened — told through the git log, because that's the only history that doesn't lie.

I am a General Manager for a full business unit. I am also an ex-principal software engineer but stopped writing code since 2010! So I dabble in AI in my spare time and have drunk the cool aid on building with AI agents. I wanted to test whether I could build a real world enterprise application completely with AI (Antigravity, Claude Code and Codex) from the success I've had with my personal projects. Saving me time. Saving me money. One person, no outsourced developers costs (typically an app like mine would cost no less than R300k and take at least 3-6 months with an agency)...

But first: here's what the platform looks like today, so you know where this story is headed.


The Architecture at a Glance

┌─────────────────────────────────────────────────────────────────────┐
│                         USERS                                       │
│  Finance │ Sales Directors │ Account Managers │ Executives │ Admin  │
└─────────────────────────────┬───────────────────────────────────────┘
                              │ HTTPS (Entra ID SSO)
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│                 AZURE APP SERVICE (South Africa North)              │
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  EXPRESS HOST                                               │    │
│  │  Serves React SPA + Proxies API + Proxies AI Service        │    │
│  └──────────┬────────────────────────┬─────────────────────────┘    │
│             │                        │                              │
│             ▼                        ▼                              │
│  ┌─────────────────────┐  ┌──────────────────────────────────┐      │
│  │  API LAYER          │  │  AI SERVICE (Fastify)            │      │
│  │  24 Endpoints       │  │  Dedicated App Service           │      │
│  │  3-Layer Auth       │  │  23 Tools │ 4 Tool Profiles      │      │
│  │  ─────────────────  │  │  SSE Streaming (10 event types)  │      │
│  │  Products           │  │  3-Layer Cache                   │      │
│  │  Customers          │  │  ──────────────────────────────  │      │
│  │  Actuals/Forecasts  │  │  ai-core shared workspace pkg    │      │
│  │  Budget Config      │  │  Financial Model Engine          │      │
│  │  Definitions        │  │  (same math as frontend)         │      │
│  │  SAP Live Insights  │  └──────────────────────────────────┘      │
│  │  Admin / Audit      │                                            │
│  │  User Governance    │                                            │
│  │  Maintenance/Health │                                            │
│  └──────────┬──────────┘                                            │
└─────────────┼───────────────────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    AZURE COSMOS DB (SA North, Serverless)           │
│                                                                     │
│  ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────────────┐    │
│  │ products  │ │ customers │ │ actuals   │ │ config            │    │
│  │ {bu}_{fy} │ │ {bu}_{fy} │ │ {bu}_{fy} │ │ {bu}_{fy}         │    │
│  └───────────┘ └───────────┘ └───────────┘ └───────────────────┘    │
│  ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────────────┐    │
│  │definitions│ │ users     │ │ groups    │ │ feedback          │    │
│  │ defs_{bu} │ │ {userId}  │ │ {groupId} │ │ feedback          │    │
│  └───────────┘ └───────────┘ └───────────┘ └───────────────────┘    │
│  ┌───────────┐ ┌───────────┐                                        │
│  │ audit     │ │ rbac      │  ALL containers: partition key /pk     │
│  │ log       │ │ rbac      │  Single source of truth:               │
│  └───────────┘ └───────────┘  containerConfig.js                    │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                    SAP DATA INTEGRATION                             │
│                                                                     │
│  SAP ECC 6.0 ──► Manual XLS Export ──► Browser Upload               │
│       │                                                             │
│       ▼                                                             │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐       │
│  │ Blob:    │───►│ Event    │───►│ Python   │───►│ Blob:    │       │
│  │ sap-     │    │ Grid     │    │ Extract  │    │ etl-     │       │
│  │ exports  │    │ Trigger  │    │ Function │    │ artifacts│       │
│  └──────────┘    └──────────┘    └──────────┘    └────┬─────┘       │
│                                                       │             │
│                  ┌──────────┐    ┌──────────┐         │             │
│                  │ Cosmos   │◄───│ Node.js  │◄────────┘             │
│                  │ Upsert   │    │ Sync     │  EventGrid            │
│                  │ + Cache  │    │ Function │  Trigger              │
│                  │ Manifest │    │ + Verify │                       │
│                  └──────────┘    └──────────┘                       │
│                                                                     │
│  5 Connectors: Sales │ Debtors │ Order Book │ Delivery │ Stock      │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                    CI/CD (GitHub Actions)                           │
│                                                                     │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────────┐   │
│  │ Main App     │  │ AI Service   │  │ ETL Functions            │   │
│  │ Deploy       │  │ Deploy       │  │ Deploy                   │   │
│  │ (push main)  │  │ (ai-service/)│  │ (etl-sync/)              │   │
│  └──────────────┘  └──────────────┘  └──────────────────────────┘   │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────────┐   │
│  │ Auth         │  │ ETL Drift    │  │ Secret                   │   │
│  │ Regression   │  │ Guard        │  │ Scan                     │   │
│  │ (nightly)    │  │ (nightly)    │  │ (push main)              │   │
│  └──────────────┘  └──────────────┘  └──────────────────────────┘   │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │ SWA Standby (US, parked) — manual failover drills            │   │
│  └──────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                    AUTHORIZATION MODEL                              │
│                                                                     │
│  Entra ID Token ──► RBAC (11 roles, 29 perms) ──► BU Scope          │
│                      ──► ABAC Ownership ──► Field Whitelist         │
│                                                                     │
│  allow = hasPermission AND hasBuScope AND ownsResource              │
│                                                                     │
│  Budget Lock Lifecycle:                                             │
│  PLANNING ──► LOCKED (review) ──► EXECUTION (scoped edits)          │
│                  │                                                  │
│                  ▼                                                  │
│            STRUCTURAL UNLOCK (CEO/admin, temporary)                 │
└─────────────────────────────────────────────────────────────────────┘

The platform is a React 19 single-page application with 13 views and 97 components, served by an Express host on Azure App Service in South Africa North. Behind it: 24 API endpoints with a 3-layer authorization model (identity, RBAC, and attribute-based ownership), 10 Cosmos DB containers using a unified /pk partition key, a standalone AI service running Fastify with 23 domain-specific tools, and a 7-stage ETL pipeline that pulls data from SAP through exported spreadsheets.

The financial model engine — a pure JavaScript function with no React dependencies — is the foundation. It's shared across the frontend (for real-time simulation), the backend (for API-driven calculations), and the AI service (so the copilot can run the same math the dashboard shows). One source of truth for the numbers.

Users authenticate via Microsoft Entra ID. Every API call passes through RBAC (11 roles, 29 permissions), BU scope enforcement, and for write operations, ABAC ownership checks. The budget lock lifecycle governs when data can be edited and by whom. Nightly CI workflows validate that permission rules haven't drifted.

SAP data arrives through manual Excel exports uploaded by users — the pipeline handles parsing, normalization, anomaly detection, database sync, and cache invalidation automatically. The client caches SAP data with an 8-day TTL matched to the weekly export cadence, checking a lightweight version manifest before fetching fresh data. The monthly Cosmos DB bill: single digits.

Seven GitHub Actions workflows cover deployment, authorization regression, ETL drift detection, and secret scanning. An Azure Static Web App in the US is parked as a standby failover.

That's the destination. Here's how it got built.


Act 1: The Single HTML File (December 9, 2025)

Commit 720f0f1. 5:41 PM on a Tuesday. That's when the first line of code landed.

It was a single HTML file. One page. A profitability chart and a dashboard. Nothing fancy — just the financial model I'd sketched on paper, implemented as a pure JavaScript calculation function, with a basic React frontend bolted on top. By 6:51 PM — seventy minutes later — I had monthly actuals, rolling forecasts, and inputs formatted in Rands (millions).

I didn't know it at the time, but that calculation function — the one I wrote before any UI, before any backend, before any database — would survive every refactor, every migration, every architecture overhaul for the next four months. It lives today in an extracted workspace package, shared between the frontend, the API server, and the AI service. It is the single most important file in the entire codebase.

By the end of December 11 — three days in — I had 32 commits: a simulation mode with margin sliders, region groupings, customer concentration risk analysis, product portfolio editor, variance analysis, and a product mix risk component. The commit messages tell the story of a developer moving fast: "safety commit", "safety checkin - still working on simulation chart", "Milestone v1: Completed Chart Logic & Simulation Refactor".

And then, the commit that defined the project's DNA:

e40c116"Major refactor to original monolith demo app (old_app included for reference)"

I'd learned from my previous project. The moment the single file got unwieldy, I broke it apart. Not later. Not after it became painful. Immediately. The old monolith was preserved in a folder called old_app — a reminder of what happens when you don't refactor early.

By December 13, commit 5816420, I had comprehensive tests for the financial model: margin calculations, PBT (Profit before Tax), break-even analysis, customer analysis, and edge cases. The math was locked down before I had a real backend.


Act 2: The Formula Wars (December 22, 2025)

December 22nd has 18 commits. That's not a flex — it's a confession.

I sat down that morning to validate the financial model against a real-world scenario, and the numbers didn't add up. What followed was the most humbling day of the project: a cascade of formula bugs that I squashed one by one, each revealing the next.

CommitWhat Was Wrong
271c199Product mix weights weren't contributing to revenue percentages correctly
69e3e4bGoal-seek base margin excluded customer discounts, freight, and insurance — required revenue was materially understated
55aa219Simulation inflation was calculated but never returned — UI defaulted to 1 and ignored adjustments
fce1acaFreight/insurance used additive formula instead of multiplicative (compound) — causing price drift
55c1c77Components read monthlyRevenue but model exported Revenue — zero budget series in charts
505d5e4Discount impact understated profit erosion by 40-50% by multiplying discount by post-discount margin
5a8f931NaN propagation when customers had no active line items
f052953Infinity values causing display crashes while being mathematically "correct"

Eight formula bugs in one day. Each one producing numbers that looked plausible but were silently wrong. This is the nightmare scenario for any financial planning tool — beautiful charts telling beautiful lies.

The fix wasn't just patching each bug. It was tightening the agent contract (9c9607b) so AI agents couldn't introduce formula regressions. From that day forward, the financial model was a protected file — no AI agent modifies it without explicit approval.

The next day, December 23, brought more: budget config wasn't persisting across reloads (6ad3355), discounts were being applied to the full invoice price including logistics charges (33a7515), and margin labels were misleading (516b72e). By December 24, I'd separated business logic from views entirely (c7238bc). And on December 27, the Services ARR model was integrated into the engine (ae4aba2).

The prototype phase was over. I had a financial model I could trust — and a healthy fear of "correct-looking" calculations.


Act 3: From localStorage to the Cloud (December 17 – February 8, 2026)

For the first five weeks, all data lived in the browser's localStorage. It worked for a single-user prototype. It was never going to work for an organization.

First Cloud Deploy: December 17

Commit 2bf8100: the Azure Static Web Apps workflow file landed. SWA's free tier was the cheapest entry point — a React SPA with built-in serverless API support and Microsoft Entra ID for authentication. I blocked GitHub auth on the same day (dc13e7c): "Block GitHub auth - allow only Microsoft/Entra ID login". This was an enterprise app from day one.

But SWA's free tier deployed to the US. The database — Azure Cosmos DB — was always in South Africa North, because that's where data sovereignty requirements pointed. So every request took a round trip: browser in South Africa → SWA in the US → Cosmos DB back in South Africa → return through the US → back to the browser. The API itself was crossing the Atlantic twice per call, adding ~180ms of pure network latency.

The Big Bang: January 28

A 3-week gap in the git log (Dec 27 → Jan 13) marks where I was designing the backend architecture during my summer vacation and doing my personal project (personametry.com). Then on January 28, six commits in four hours: 02810dc"Phase 1: Cosmos DB API & SWA Auth Workaround". Azure Functions v4 as the API layer, Cosmos DB NoSQL as the persistence layer, and the beginning of the end for localStorage.

The Partition Key Incident: February 8

Commit 59cf7f0: "Align all partition keys to production /pk". Followed immediately by 01c013e: "Single source of truth for container partition keys — prevent future drift".

I'd discovered that different scripts were using different partition key paths for the same containers. Seeding scripts used one path, API endpoints used another, sync jobs used a third. Documents were landing in the wrong logical partitions. Queries returning empty for data that definitely existed.

The fix was radical: every container uses the same partition key path (/pk), and a single configuration file (containerConfig.js) became the sole source of truth. No script, no function, no seed job is allowed to hardcode a partition key. This one decision eliminated an entire class of bugs.

┌─────────────────────────────────────────────────────────┐
│              DATA LAYER EVOLUTION                       │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Dec 2025     localStorage (browser-only)               │
│       │                                                 │
│       ▼                                                 │
│  Jan 2026     Cosmos DB (SA North)                      │
│       │        + accessed via SWA Functions in US       │
│       │        + cross-Atlantic API latency             │
│       │        + partition key path chaos               │
│       ▼                                                 │
│  Feb 2026     Cosmos DB (SA North, consolidated)        │
│       │        + /pk everywhere                         │
│       │        + containerConfig.js as single truth     │
│       │        + killed the spare dev cloud instance    │
│       ▼                                                 │
│  Mar 2026     Cosmos DB (SA North, co-located)          │
│                + App Service in same region = ~5ms      │
│                + Cosmos Emulator for local dev          │
│                + 10 containers, all /pk                 │
│                                                         │
└─────────────────────────────────────────────────────────┘


Act 4: Under the Hood of Cosmos DB — 10 Containers, One Partition Key, and a Seed Pipeline (January 28 – March 2026)

Act 3 told the story of getting to the cloud. This act opens the hood and shows what's inside. Because the database isn't just storage — it's the entire data model, the authorization boundary, the audit trail, and the seed pipeline. Get the database wrong and nothing else works.

The Architecture

┌─────────────────────────────────────────────────────────┐
│                  COSMOS DB NoSQL                        │
│              Serverless • South Africa North            │
│              Database: myapp-db-prod                    │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐   │
│  │   products   │  │  customers   │  │   actuals    │   │
│  │  pk: BU_FY   │  │  pk: BU_FY   │  │  pk: BU_FY   │   │
│  │  line items  │  │  targets     │  │  forecasts   │   │
│  │  unit costs  │  │  discounts   │  │  ownership   │   │
│  └──────────────┘  └──────────────┘  └──────────────┘   │
│                                                         │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐   │
│  │    config    │  │ definitions  │  │   feedback   │   │
│  │  pk: BU_FY   │  │ pk: defs_BU  │  │ pk: feedback │   │
│  │  BU settings │  │ or defs_BU_FY│  │  global pool │   │
│  │  thresholds  │  │ SAP summaries│  │              │   │
│  └──────────────┘  └──────────────┘  └──────────────┘   │
│                                                         │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐   │
│  │    users     │  │    groups    │  │    audit     │   │
│  │  pk: userId  │  │ pk: groupId  │  │  pk: log     │   │
│  │  profileRoles│  │  org config  │  │  append-only │   │
│  │  BU scope    │  │              │  │  1-year TTL  │   │
│  │  assignments │  │              │  │              │   │
│  └──────────────┘  └──────────────┘  └──────────────┘   │
│                                                         │
│  ┌──────────────┐                                       │
│  │     rbac     │  All containers: partition key = /pk  │
│  │  pk: rbac    │  All queries: WHERE c.pk = @pk        │
│  │  role ──►perm│  Single source of truth:              │
│  │  mapping     │    containerConfig.js                 │
│  └──────────────┘                                       │
│                                                         │
└─────────────────────────────────────────────────────────┘

10 containers. All using the same partition key path (/pk) but different partition key values. That uniformity was the result of the February 8 incident (Act 3) — and it's the single most important data architecture decision in the project.

The Partition Key Strategy

Cosmos DB's partition key determines data co-location, query efficiency, and cost. Get it wrong and you're paying for cross-partition fan-out on every read. Get it right and most queries cost 1–5 Request Units.

ContainerPK Value PatternExampleWhy This Partitioning
products{buId}_{fy}bu-acme_FY26All products for a BU's fiscal year land in one partition — single read per planning view
customers{buId}_{fy}bu-acme_FY26Customer list is always BU + FY scoped. One partition = one query, no fan-out
actuals{buId}_{fy}bu-acme_FY26Actuals and forecasts partition-aligned with customers they reference
config{buId}_{fy}bu-acme_FY26Singleton BU config per fiscal year — thresholds, margins, planning parameters
definitionsdefs_{buId} or defs_{buId}_{fy}defs_bu-acmeReference data (regions, categories, SAP summaries) — BU-scoped with FY override
feedbackfeedbackfeedbackSingle global partition — low volume, no BU isolation needed
users{userId}Entra Object IDOne partition per user — point reads for auth, no cross-user queries in hot paths
groups{groupId}group-acmeOrg-level config, rarely queried
auditloglogAppend-only stream with 1-year TTL — write-heavy, query-rare
rbacrbacrbacGlobal role → permission map — cached 30s, queried rarely

The four BU/FY-partitioned containers (products, customers, actuals, config) are the hot path. Every planning view loads WHERE c.pk = '{buId}_{fy}' — a single-partition query that returns everything the view needs in 5–10 RUs. No joins. No aggregation. The financial model engine receives the full dataset and computes everything client-side.

Document Shapes

Every document follows a consistent contract: id, pk, entityType, buId, fy, plus domain-specific fields. Here's what a customer document looks like:

{
  "id":           "cust-1",
  "pk":           "bu-acme_FY26",
  "entityType":   "customer",
  "buId":         "bu-acme",
  "fy":           "FY26",
  "name":         "METRO MUNICIPALITY A",
  "regionId":     "reg-001",
  "category":     "Metro",
  "targetRevenue": 25000000,
  "discount":     5,
  "productMix":   { "li-wp-200": 30, "li-wp-300": 20 },
  "updatedAt":    "2026-01-28T05:00:00Z",
  "updatedBy":    "user@company.example",
  "_etag":        "\"00000000-0000-0000-....\""
}

The _etag matters. Cosmos DB's optimistic concurrency uses ETags — if two users edit the same customer simultaneously, the second write fails with a 412 Precondition Failed. The API passes If-Match headers on every write. No distributed locks, no pessimistic locking, no conflict resolution logic — just "first writer wins, second writer retries."

User documents are the most complex — they carry the full authorization profile:

{
  "id":              "2d1f0f0a-...",
  "pk":              "2d1f0f0a-...",
  "email":           "user@company.example",
  "profileRoles":    ["account_manager", "sales_admin"],
  "roles":           ["account_manager", "sales_admin"],
  "tokenRoles":      ["authenticated"],
  "homeBuId":        "bu-acme",
  "editBuIds":       ["bu-acme"],
  "assignmentByBu":  {
    "bu-acme": {
      "assignedSalesGroupIds":  ["sg-001"],
      "assignedRegionIds":      ["reg-001"],
      "assignedCustomerIds":    []
    }
  },
  "isActive":        true,
  "lastLogin":       "2026-01-28T06:00:00Z"
}

Notice profileRoles vs. tokenRoles. The platform token from Entra ID says "authenticated." The database says "account_manager, sales_admin." The database wins. This is the DB-authoritative identity model — the RBAC container holds the role definitions, the users container holds the role assignments, and the API merges them at request time. The identity provider handles who you are; the database decides what you can do.

Product documents carry nested line items — each product category contains the individual SKUs with unit costs, margins, and units of measure:

{
  "id":            "prod-widgets",
  "pk":            "bu-acme_FY26",
  "entityType":    "product",
  "buId":          "bu-acme",
  "fy":            "FY26",
  "name":          "Widgets",
  "isActive":      true,
  "lineItems":     [
    { "id": "li-wl-100", "name": "Widget Lite",
      "unitCost": 35000, "margin": 45, "uom": "EA" },
    { "id": "li-wp-200",  "name": "Widget Pro",
      "unitCost": 52000, "margin": 42, "uom": "EA" }
  ],
  "version":       7,
  "_etag":         "\"00000000-0000-....\""
}

The financial model engine reads all products and all customers for a BU/FY in two parallel queries, then computes margins, projections, and simulation results entirely in-memory. The lineItems array means product structure is self-contained — no joins, no reference lookups during calculation.

Audit documents are append-only with a ttl field — Cosmos DB's built-in time-to-live. After 365 days, the document silently disappears. No cleanup job, no scheduled delete, no storage cost after expiration:

{
  "id":            "aud_1706425200000_abc123",
  "pk":            "log",
  "schemaVersion": 2,
  "timestamp":     "2026-01-28T07:00:00.000Z",
  "eventType":     "WRITE",
  "userId":        "a1b2c3d4-...",
  "action":        "UPDATE",
  "entityType":    "customer",
  "entityId":      "cust-1",
  "buId":          "bu-acme",
  "changes": {
    "entries": [
      { "path": "targetRevenue",
        "changeType": "updated",
        "before": 20000000, "after": 25000000 }
    ]
  },
  "ttl":           31536000
}

Every write to the platform — customer updates, product edits, role changes, budget lock events — generates an audit document. The changes.entries array captures the exact diff: which field changed, what it was before, what it is now. The admin audit log view queries this partition with filters on action type, entity, date range, and actor — but always within pk = "log".

The RBAC container holds the role → permission mapping as a single document per role:

{
  "id":          "role-account_manager",
  "pk":          "rbac",
  "roleId":      "account_manager",
  "permissions": [
    "products.read", "customers.read",
    "customers.edit_assigned",
    "actuals.edit_assigned",
    "sap.view"
  ]
}

The _assigned suffix on permissions is the bridge between RBAC and ABAC — it tells the authorization layer "this user has the permission, but only for resources they own." The API evaluates ownership at request time by checking the user's assignmentByBu against the resource's customer, region, or sales group.

The Definitions Container — Where SAP Data Lives

The definitions container is the most architecturally interesting one. It holds both reference data (regions, categories, account managers) and all SAP ETL output — coexisting in the same container, differentiated by defType. Every SAP connector publishes three document types here:

┌─────────────────────────────────────────────────────────┐
│          DEFINITIONS CONTAINER: SAP DOCUMENT MAP        │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  pk = defs_{buId}                                       │
│                                                         │
│  REFERENCE DATA (non-SAP):                              │
│  ├── defType: regions                                   │
│  ├── defType: categories                                │
│  ├── defType: account-managers                          │
│  └── defType: field-technicians                         │
│                                                         │
│  SAP SUMMARIES (1 per connector):                       │
│  ├── defType: sap-sales-insights                        │
│  ├── defType: sap-debtors-insights                      │
│  ├── defType: sap-stock-insights                        │
│  ├── defType: sap-orderbook-insights                    │
│  └── defType: sap-delivery-insights                     │
│                                                         │
│  SAP ETL RUNS (1 per run per connector):                │
│  ├── defType: sap-sales-etl-run                         │
│  ├── defType: sap-debtors-etl-run                       │
│  ├── defType: sap-stock-etl-run                         │
│  ├── defType: sap-orderbook-etl-run                     │
│  └── defType: sap-delivery-etl-run                      │
│                                                         │
│  SAP DETAIL DOCS (partitioned, for AI drill-down):      │
│  ├── defType: sap-sales-customer-transactions           │
│  ├── defType: sap-debtors-customer-ledger               │
│  ├── defType: sap-stock-material-ledger                 │
│  ├── defType: sap-orderbook-customer-lines              │
│  └── defType: sap-delivery-customer-lines               │
│                                                         │
│  CACHE MANIFEST (1 per BU):                             │
│  └── defType: sap-cache-manifest                        │
│                                                         │
└─────────────────────────────────────────────────────────┘

A SAP sales summary document is the richest — it's what the dashboard hero cards read on every page load:

{
  "id":              "sap_sales_insights_bu-acme",
  "pk":              "defs_bu-acme",
  "type":            "definition",
  "defType":         "sap-sales-insights",
  "buId":            "bu-acme",
  "importStatus":    "SUCCESS",
  "runId":           "20260215T103000Z",
  "generatedAt":     "2026-02-15T10:30:00Z",
  "reconciliation":  {
    "sourceRows": 4218,
    "includedRows": 4193,
    "excludedRows": 25,
    "isBalanced": true
  },
  "fiscalYearsAvailable": ["FY25", "FY26"],
  "yoy":             { /* year-over-year comparisons */ },
  "totalsByFiscalYear": { "FY25": 48200000, "FY26": 31400000 },
  "topCustomers":    [ /* max 50, ranked by value */ ],
  "topProducts":     [ /* max 50, ranked by value */ ],
  "ledgerPreview":   [ /* max 120 rows for quick inspection */ ],
  "lastImport":      {
    "status": "SUCCESS",
    "timestamp": "2026-02-15T10:30:00Z"
  }
}

The detail documents are where the SAP ETL gets interesting. Large ledger datasets are partitioned into chunks — a connector with 4,000 transaction rows splits them into multiple documents using partIndex and partCount:

{
  "id":              "sap_sales_customer_tx_bu-acme_FY26_cust-1_0",
  "pk":              "defs_bu-acme",
  "defType":         "sap-sales-customer-transactions",
  "customerId":      "cust-1",
  "fy":              "FY26",
  "partIndex":       0,
  "partCount":       3,
  "rowCount":        500,
  "customerRowCount": 1420,
  "totals":          { "valueZar": 8500000, "grossProfitZar": 3200000 },
  "rows": [
    {
      "customerId":    "cust-1",
      "customerName":  "METRO MUNICIPALITY A",
      "invoiceNo":     "9001234",
      "invoiceDate":   "2025-09-15",
      "partNumber":    "WP-001",
      "description":   "Widget Pro Unit",
      "netQty":        12,
      "valueZar":      624000,
      "costZar":       361000,
      "grossProfitZar": 263000,
      "marginRatio":   0.4215,
      "isCreditLine":  false
    }
    /* ... up to 500 rows per part */
  ]
}

The AI copilot uses these detail documents for drill-down queries — "show me the top 5 products sold to Municipality A this fiscal year" doesn't hit SAP; it queries the pre-extracted, pre-normalized ledger rows sitting in Cosmos DB. Same pattern for each connector: stock has material ledger parts, debtors has customer aging ledger parts, orderbook has order line parts, delivery has shipment line parts.

The cache manifest document ties the whole system together — it tracks which connector was last refreshed and what version the cache is at:

{
  "id":           "sap-cache-manifest",
  "pk":           "defs_bu-acme",
  "defType":      "sap-cache-manifest",
  "version":      17,
  "connectors":   {
    "sales":      { "lastRunAt": "2026-03-28T09:15:00Z", "lastRunId": "20260328T091500Z" },
    "debtors":    { "lastRunAt": "2026-03-28T09:20:00Z", "lastRunId": "20260328T092000Z" },
    "stock":      { "lastRunAt": "2026-03-21T14:00:00Z", "lastRunId": "20260321T140000Z" },
    "orderbook":  { "lastRunAt": "2026-03-28T09:25:00Z", "lastRunId": "20260328T092500Z" },
    "delivery":   { "lastRunAt": "2026-03-28T09:30:00Z", "lastRunId": "20260328T093000Z" }
  }
}

Every successful ETL sync bumps version. The client-side cache checks this version on a 1-hour interval. If it's changed, stale SAP data is evicted from localStorage and the next dashboard load pulls fresh summaries. No manual refresh required — the data flows from SAP export through Python extraction, Node.js sync, Cosmos DB, manifest bump, client cache invalidation, and into the dashboard cards. The definitions container is the meeting point where SAP's transactional world becomes the platform's analytical world.

Query Patterns — Why Everything Is Cheap

Three query patterns cover 95% of all database reads:

PatternExampleRU CostWhen Used
Point Read.item(id, pk).read()~1 RUUser auth check, single doc fetch, config lookup
Partition QueryWHERE c.pk = @pk5–10 RULoad all products/customers for a BU/FY view
Partition + FilterWHERE c.pk = @pk AND c.defType = @type5–15 RUSAP summaries, definition lookups with FY fallback

Cross-partition queries exist in exactly one place: the audit log viewer (admin-only, paginated, never in a hot path). Everything else is partition-targeted. The financial model view loads all products and all customers for a BU/FY in two parallel partition queries — then the calculation engine runs entirely in-memory. The database is a data loader, not a computation engine.

The Definitions Fallback Pattern

The definitions container has a unique query pattern: FY-scoped with BU fallback. When the app requests region definitions for FY26, it first checks pk = defs_bu-acme_FY26. If nothing is there, it falls back to pk = defs_bu-acme (the BU-level default). This allows FY-specific overrides without duplicating every definition for every fiscal year:

  FY-scoped lookup: pk = defs_bu-acme_FY26
       │
       ├── Found? ──► Use FY-scoped definition
       │
       └── Not found?
            │
            ▼
  BU-scoped fallback: pk = defs_bu-acme
       │
       └── Found? ──► Use BU default

The Client — Lazy Proxies and Environment Detection

The Cosmos DB client uses a JavaScript Proxy pattern for lazy initialization. The first time any code accesses containers.products, the proxy intercepts, creates the CosmosClient, connects to the database, and caches container references. Subsequent accesses go straight to the cache. If the connection string points to localhost:8081, the client automatically disables TLS certificate validation for the emulator's self-signed cert.

Two connection modes:

  • Connection String (default): AccountEndpoint=...;AccountKey=... — simple, works everywhere
  • Managed Identity: endpoint-only, authenticates via Azure AD — no secrets in config, but requires App Service hosting

Killing the Cloud Dev Database

For months, the project maintained two Cosmos environments: production (myapp-db-prod), cloud dev (myapp-db-dev), and local emulator (myapp-db-emulator) was later added. The cloud dev database was supposed to let multiple developers share a development environment. In practice, one developer meant zero benefit and double the Azure bill. I wanted to use developer best practices when I started, but later realized this was overkill, and due to an AI mistake, incurred unexpected costs.

Commit e6fce7a (March 29): "Retire cloud dev Cosmos DB targeting." The cloud dev database was killed. All non-production development moved to the local Cosmos Emulator — free, fast, no network dependency. The legacy COSMOS_USE_SUPPORT_DEV_DB flag was preserved but now redirects to the emulator. A guardrail function (assertNoCloudNonProdDatabaseTarget()) blocks any code from accidentally routing dev traffic to the cloud.

The Seed-and-Sync Pipeline

An enterprise database is useless without repeatable data seeding. The seed-and-sync.js orchestrator runs a 5-stage pipeline:

┌─────────────────────────────────────────────────────────┐
│              SEED-AND-SYNC PIPELINE                     │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Stage 1: init-containers                               │
│  └── Create all 10 containers from containerConfig.js   │
│                                                         │
│  Stage 2: seed-global-settings                          │
│  └── RBAC baseline, role defaults                       │
│                                                         │
│  Stage 3: seed-rbac                                     │
│  └── Role ──► permission mapping (11 roles, 29 perms)   │
│                                                         │
│  Stage 4: seed-data                                     │
│  └── BU-scoped definitions (products, regions, etc.)    │
│                                                         │
│  Stage 5: seed-initial-data                             │
│  └── BU planning data (products, customers, actuals)    │
│                                                         │
│  Optional: --include-sap-customer-master                │
│           --include-sap-product-master                  │
│           --include-user-backfill                       │
│           --include-audit-backfill                      │
│                                                         │
│  Verify: --verify                                       │
│  └── definitions, products, actuals, lock state, users  │
│                                                         │
│  Promote: --sync                                        │
│  └── Mirror dev ──► prod (non-destructive by default)   │
│                                                         │
└─────────────────────────────────────────────────────────┘

The pipeline is idempotent — run it ten times and the result is the same. Production promotion uses a non-destructive mirror by default: it upserts documents but never deletes. A destructive reseed requires an explicit double-confirmation flag (--allow-production-reseed --confirm-production-reseed=RESEED_PRODUCTION). You can't accidentally wipe production with a typo.

The same pipeline runs in CI via a GitHub Actions workflow (cosmos-seed-sync.yml), with manual dispatch for production operations. The verification stage validates document shapes, partition key consistency, lock state integrity, and user access grants — catching data model drift before it reaches users.

What I Learned About Cosmos DB

The partition key decision made in January determined the cost profile for the entire project. Every query that includes WHERE c.pk = @pk is cheap. Every query that doesn't is expensive. By making every container use the same partition key path and making partition targeting a non-negotiable API design rule, I turned a serverless database with pay-per-RU pricing into a predictable-cost data layer. The monthly bill stayed flat even as usage grew — because query efficiency was baked into the schema, not bolted on later.


Act 5: The dabe903 Incident and Enterprise Hardening (February 7–8, 2026)

February 7 was the day the platform stopped being a prototype and started being enterprise software. Nine commits that day — and one of them became the most referenced commit in the project's history.

dabe903: "Close data model gaps - categories, zones, If-Match". That innocent-sounding commit destroyed 200 lines of the customer list component. The AI agent, trying to "simplify" the code, had omitted entire sections it deemed redundant. The next morning, commit c35436c: "Restore Customer Product Mix Editor destroyed in dabe903". And immediately after: 4f1daab — the addition of the NEVER OMIT CODE rule to the agent contract, in bold, in all caps.

But February 7 also delivered the admin portal (7300a32: "Implement Admin Portal — User Governance, Audit, Fiscal Years, System Health"), the navigation redesign (5d06d9e), and the first RBAC permissions (1d0ac3f). February 8 brought audit logs, comprehensive API logging, the partition key alignment, and the seed-and-sync orchestrator. In two days, the app went from a prototype with localStorage to a governed platform with user management, audit trails, and production-safe database tooling.


Act 6: The Authorization Model — 11 Roles, 29 Permissions, and a Nightly Regression (February 10–19, 2026)

Authorization is the least glamorous feature I built. It doesn't make for exciting demos. But it's the reason the CXOs trusts this platform with real financial data, and it consumed more engineering time than any dashboard view.

The 3-Layer Stack

┌─────────────────────────────────────────────────────────┐
│  LAYER 1: IDENTITY (Entra ID)                           │
│  ─ SSO via Microsoft identity platform                  │
│  ─ Token validated on every API request                 │
│  ─ No local passwords, no session management            │
└────────────────────────┬────────────────────────────────┘
                         ▼
┌─────────────────────────────────────────────────────────┐
│  LAYER 2: RBAC (Database-Backed, 30s TTL cache)         │
│  ─ 11 roles mapped to 29 permissions                    │
│  ─ Stored in Cosmos DB, runtime-updatable               │
│  ─ Wildcard (*) for admin roles                         │
│  ─ No redeployment needed to change permissions         │
└────────────────────────┬────────────────────────────────┘
                         ▼
┌─────────────────────────────────────────────────────────┐
│  LAYER 3: ABAC (Attribute-Based Access Control)         │
│  ─ BU Scope: viewBuIds / editBuIds per user             │
│  ─ Customer ownership: region, sales group, or direct   │
│  ─ Field-level whitelist on write operations            │
│  ─ Time-gated: post-budget-lock permissions             │
│  ─ Immutable fields: buId, fy, customerId (always)      │
└─────────────────────────────────────────────────────────┘

The rule: allow = hasPermission AND hasBuScope AND ownsResource. All three layers must pass. The server is the security boundary — the UI hides buttons and fields for convenience, but every write operation is re-validated server-side with the full permission stack.

The 11 Roles

I designed the role hierarchy after mapping every real-world persona in the company to what they need to see and do:

RoleTypeWhat They Can Do
adminPlatformEverything (wildcard *). Manage users, roles, config. Structural unlock.
group_adminPlatformSame as admin — for multi-tenant group-level oversight
group_ceoExecutiveRead everything. Structural unlock (reopen a locked budget). No day-to-day edits.
group_financeBusinessFull financial editing: budgets, actuals, global parameters, definitions. Cross-BU visibility.
finance_managerBusinessFinance subset: edit actuals and parameters, but can't create products or customers.
sales_adminBusinessFull customer CRUD across all regions. Assign account managers. Upload SAP data.
account_managerScopedEdit only assigned customers (ABAC-enforced). View all customers in BU for context.
sap_exporterPlatformViewer baseline + SAP file upload + ETL status monitoring
viewerPlatformRead-only baseline: products, customers, actuals, plans, dashboards
authenticatedPlatformViewer defaults — for users with Entra ID login but no explicit role assignment yet
anonymousPlatformViewer defaults — mapped from SWA anonymous auth (legacy, parked)

The 29 Permissions (Grouped by Domain)

DomainPermissions
Productsproducts.view, product.create, product.edit_price
Customerscustomers.view, customers.view.assigned, customer.create, customer.create_post_lock_all, customer.create_post_lock_assigned, customer.delete, customer.edit_sales_data, customer.edit_sales_data_assigned, customer.edit_target_post_lock_all, customer.edit_target_post_lock_assigned
Actualsactuals.view, actuals.edit, actuals.edit_assigned
Budget Planbudget.plan.view, budget.plan.edit
Global Paramsglobal_params.view, global_params.edit
Definitionsdefinitions.view
Feedbackfeedback.view, feedback.create, feedback.comment, feedback.vote
Business Configbusiness_config.structural_unlock
SAPsap.upload, sap.status
AIai.chat

Two design patterns worth noting:

  • The _assigned suffix pattern: Permissions like customer.edit_sales_data_assigned vs customer.edit_sales_data encode the ABAC scope directly in the permission name. An account manager gets the _assigned variant; a sales admin gets the unrestricted one. The API resolves this at request time — no special-case branching needed.
  • The post_lock time gates: Permissions like customer.create_post_lock_all only activate after the budget is locked. During planning phase, they're irrelevant. During execution phase, they're the only write permissions that survive. The budget lock state governs which permission set is in effect — the roles themselves don't change.

The Budget Lock Lifecycle

┌──────────┐         ┌──────────┐         ┌──────────┐
│ PLANNING │────────►│  LOCKED  │────────►│EXECUTION │
│  MODE    │  lock   │  (review)│  approve│  MODE    │
│          │         │          │         │          │
│ All edits│         │ Read-only│         │ Limited  │
│ allowed  │         │ for most │         │ edits    │
│          │         │ roles    │         │ (scoped) │
└──────────┘         └──────────┘         └──────────┘
                          │
                          │ structural
                          │ unlock
                          ▼
                     ┌──────────┐
                     │ UNLOCKED │
                     │ (CEO or  │
                     │  admin)  │
                     └──────────┘

During planning mode, everyone with the right permissions can edit products, customers, targets, and parameters. When the budget is locked (6bde4cb), the system enters review mode — most edits are frozen. After approval, execution mode allows only post-lock permissions: actuals entry, assigned-customer adjustments. A CEO or admin can perform a structural unlock (5bb436c, March 3) that temporarily reopens the budget without resetting the lifecycle — handling the real-world scenario where the board changes a target number mid-year and finance needs to adjust the approved plan.

The Admin Portal

The admin portal (7300a32, Feb 7) is where governance becomes tangible. It has four sections:

┌─────────────────────────────────────────────────────────┐
│                    ADMIN PORTAL                         │
├──────────────┬──────────────────────────────────────────┤
│              │                                          │
│  SECTIONS    │  USER GOVERNANCE                         │
│              │  ─ Role assignment per user              │
│  Users       │  ─ BU view/edit scope configuration      │
│  Audit       │  ─ Customer/region/sales group assignment│
│  Fiscal Year │  ─ Active/inactive user toggle           │
│  Health      │  ─ SAP exporter delegation               │
│              │  ─ DB-authoritative onboarding flow      │
│              │                                          │
│              │  AUDIT LOG                               │
│              │  ─ Deep diff (field-level change history)│
│              │  ─ Sensitive field redaction             │
│              │  ─ Denied-write audit trail              │
│              │  ─ Login auditing + usage analytics      │
│              │                                          │
│              │  FISCAL YEAR MANAGEMENT                  │
│              │  ─ FY lifecycle (planning/locked/exec)   │
│              │  ─ Historic FY read-only mode            │
│              │  ─ Structural unlock controls            │
│              │                                          │
│              │  SYSTEM HEALTH                           │
│              │  ─ Cosmos DB connectivity                │
│              │  ─ ETL pipeline status per connector     │
│              │  ─ AI service health probes              │
│              │  ─ Storage diagnostics                   │
│              │  ─ Cost ledger (Azure spend tracking)    │
│              │                                          │
└──────────────┴──────────────────────────────────────────┘

The user governance flow evolved significantly. By February 19, commit 49ca324: "Implement DB-authoritative user onboarding and role admin flow" — new users who authenticate via Entra ID are automatically provisioned with viewer defaults, and an admin can then assign roles, BU scope, and customer ownership through the portal. No more email requests to me to get someone access.

The audit system (cc410b5, Feb 13) records every mutation with a deep diff — not just "customer X was edited" but which fields changed, from what values to what values, by whom, at what time. Sensitive fields are redacted in the audit log. Denied writes are audited too (17c3924) — so when an account manager tries to edit a customer outside their scope, the attempt is logged even though it's blocked.

The Authorization Harness

February 13, commit 35de3df: "add full authorization harness framework, scenarios, and docs". This is the testing infrastructure that makes the whole permission model trustworthy.

The harness runs every role-permission-resource combination against golden snapshots. It has three modes:

  • Golden mode: captures the current permission state as the baseline truth
  • Quick mode: runs a focused subset for fast CI feedback
  • Full matrix mode: tests every role against every permission against every resource scope — the exhaustive check

A nightly CI workflow runs the harness against the golden snapshots. If any code change accidentally grants or removes access, the pipeline fails. This catches the kind of bug that no unit test would find — a permission string typo, a missing scope check in a new endpoint, an ABAC condition that doesn't fire for a specific role. The harness has caught real regressions that would have shipped to production.

The evolution timeline of the auth system:

  1. Feb 7: Hardcoded RBAC permissions (1d0ac3f)
  2. Feb 7: Admin Portal with user governance, audit, fiscal year management (7300a32)
  3. Feb 8: Audit logs working, API error handling (bc646c4)
  4. Feb 10: DB-backed RBAC with 30-second TTL cache (5e37745)
  5. Feb 10: BU scope enforcement (4673200)
  6. Feb 11: ABAC customer ownership (55b54a7)
  7. Feb 11: Budget lock policy (6bde4cb)
  8. Feb 11: Unified budget plan permission model (81a8942)
  9. Feb 11: Definitions permission-key RBAC (2466acd)
  10. Feb 13: Authorization harness framework (35de3df)
  11. Feb 13: Enterprise audit v2 with deep diff and redaction (cc410b5)
  12. Feb 16: Audited post-lock customer adjustments (7f15f61)
  13. Feb 16: Denied-write audit trail (17c3924)
  14. Feb 19: DB-authoritative user onboarding (49ca324)
  15. Feb 19: Login auditing and usage analytics (fe23556)
  16. Feb 21: SAP exporter role delegation (2a170ac)
  17. Mar 3: Structural unlock for locked fiscal years (5bb436c)

Seventeen commits over a month, and I'm only counting the major ones. Authorization is the feature that never stops evolving — every new capability (SAP upload, AI chat, market analysis, post-lock adjustments) required a new permission, a new scope check, and a new harness scenario.


Act 7: The SAP Bridge — Building the ETL Pipeline (February 13–14, 2026)

Two days. That's how long it took to build the plumbing that connects SAP to the dashboard.

Our SAP system has no OData services. No RFC connections. No BTP integration layer. The data lives behind transaction codes and ALV grids. So I built a pipeline that works with what SAP actually gives you: exported spreadsheets.

February 13, commit f089e87: "add SAP sync pipeline and searchable budget customer selection". By the end of February 14, I had customer master sync, product master sync, an admin editor, and FY planning reset tooling — all backed by a 7-stage event-driven pipeline.

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  1. UPLOAD   │     │  2. BLOB     │     │  3. EVENT    │
│  User uploads│────►│  sap-exports │────►│  EventGrid   │
│  XLS via UI  │     │  container   │     │  triggers    │
└──────────────┘     └──────────────┘     └──────┬───────┘
                                                  │
                                                  ▼
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  6. COSMOS   │     │  5. NODE.JS  │     │  4. PYTHON   │
│  Containers  │◄────│  Sync to DB  │◄────│  Extract &   │
│  updated     │     │  + verify    │     │  normalize   │
└──────────────┘     └──────────────┘     └──────────────┘
                                                  │
                                          ┌───────┴──────┐
                                          │  BLOB STORE  │
                                          │ etl-artifacts│
                                          │ 4 JSON files │
                                          │ per run      │
                                          └──────────────┘
                           ▼
                    ┌──────────────┐
                    │  7. CACHE    │
                    │  Invalidation│
                    │  + audit log │
                    └──────────────┘

A user exports from SAP (transaction codes like VA05, FBL5N, MB52, VL06O, MB5B) as XLS and uploads through the dashboard. EventGrid fires, Python parses the SAP-specific layout, normalizes data, flags anomalies, outputs 4 JSON artifacts (summary, ledger, anomalies, run metadata). A second trigger fires Node.js sync, which upserts into Cosmos DB and invalidates caches. End to end: about 30 seconds. The old process — emailing spreadsheets, manually copy-pasting into a shared workbook — took days.

Each connector has three synchronized registries — Python extraction, Node.js sync, and API seeding. If the Python extractor changes its output schema, the Node.js sync catches the mismatch before it corrupts the database.

Why not a direct SAP API? Enabling OData on our ECC system would require external SAP team involvement, change management, and transport requests — a 3-month process minimum. The XLS exports are already part of existing business workflows, so adoption cost is zero. And the pipeline approach means SAP is completely decoupled — if we migrate to S/4HANA someday, only the Python extractors need updating.


Act 8: The Connector Pattern — A Reusable Recipe for SAP Reports (February 2026)

Act 7 showed the pipeline. This act shows the pattern — because the pipeline was deliberately designed to make adding new SAP reports trivial. Five connectors (sales, debtors, orderbook, stock, delivery) all follow the same architecture. When I add detailed stock management insights next, I won't build new plumbing. I'll follow the recipe.

The Connector Contract

Every SAP connector is defined by a frozen registry entry — identical across three runtimes (Python extraction, Node.js sync, API layer):

{
  key:           "stock",
  label:         "SAP Stock Insights",
  status:        "active",
  artifacts: {
    summary:     "stock-insights-summary.json",
    ledger:      "stock-insights-ledger.json",
    anomalies:   "stock-insights-anomalies.json",
    run:         "stock-insights-run.json"
  },
  defTypes: {
    summary:     "sap-stock-insights",
    run:         "sap-stock-etl-run"
  }
}

That's the entire interface. Everything else — extraction logic, summary calculations, anomaly detection — is connector-specific. But the shape of the output is fixed. Four artifacts. Two Cosmos document types. One manifest bump. Every connector, every time.

The Four Artifacts

Every ETL run produces exactly four JSON files:

ArtifactContentsConsumer
SummaryYear-over-year metrics, monthly breakdowns, totals by fiscal yearDashboard hero cards, AI copilot summary queries
LedgerNormalized transaction rows — every line item from the SAP exportVerification (reconciliation), AI drill-down queries
AnomaliesData quality issues: missing fields, out-of-range values, assumptions appliedAdmin pipeline view, AI diagnostics
RunMetadata: connector, runId, timestamps, import status, reconciliation countersPipeline monitoring, cache invalidation, audit trail

The run contract is the gate. A run can only be promoted to "latest" (and cached data invalidated) if the run document confirms: importStatus=SUCCESS, latestPointerUpdated=true, and reconciliation.isBalanced=true. That last check is critical — no-loss reconciliation. The number of source rows must equal included rows plus excluded rows. If a single row drops between extraction and summary, the pipeline fails. No silent data loss.

The Extraction Framework

A shared Python framework (sap_live_connector_framework.py) provides the scaffolding that every extractor reuses:

┌─────────────────────────────────────────────────────────┐
│           PYTHON EXTRACTION FRAMEWORK                   │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  resolve_connector_input_file()                         │
│  └── Find the latest XLS in the connector's folder      │
│                                                         │
│  build_run_context(buId, connector, runId)              │
│  └── Create timestamped output directories              │
│       history/{runId}/ and latest/                      │
│                                                         │
│  build_run_payload()                                    │
│  └── Initialize run metadata with required fields       │
│                                                         │
│  validate_run_contract()                                │
│  └── Enforce reconciliation balance + required fields   │
│                                                         │
│  copy_latest()                                          │
│  └── Promote history/{runId}/ ──► latest/ on success    │
│                                                         │
│  cleanup_history(keep=3)                                │
│  └── Bounded retention — keep last N runs               │
│                                                         │
└─────────────────────────────────────────────────────────┘

The extractor itself only needs to implement the domain-specific part: parse SAP headers, normalize rows, compute summaries, detect anomalies. The framework handles input resolution, output paths, run tracking, promotion, and cleanup.

The Sync Layer

After extraction, a Node.js sync script reads the four artifacts and publishes to Cosmos DB:

  • Summary document (defType: sap-{connector}-insights) — the dashboard reads this
  • Run document (defType: sap-{connector}-etl-run) — the admin pipeline view reads this
  • Detail documents (optional, connector-specific) — the AI copilot reads these for drill-down queries

On success, the sync bumps the sap-cache-manifest version in Cosmos DB. Client-side caches detect the version change on their next manifest check (1-hour TTL) and invalidate stale SAP data. Users see fresh numbers without a manual refresh.

Adding a New Connector: The Recipe

This is the part that makes the pattern worth documenting. When I add a new report — say, detailed stock management — here's the exact checklist:

┌─────────────────────────────────────────────────────────┐
│      RECIPE: ADD A NEW SAP REPORT CONNECTOR             │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Step 1: REGISTRY (3 files, ~20 lines each)             │
│  ├── api/lib/sapLiveConnectorRegistry.js                │
│  ├── etl-extract/lib/connector_registry.py              │
│  └── etl-sync/lib/sapLiveConnectorRegistry.js           │
│  Define: key, label, artifact names, defTypes           │
│                                                         │
│  Step 2: EXTRACTOR (1 Python file)                      │
│  └── etl-extract/extractors/extract_sap_{name}.py       │
│  Implement: parse headers, normalize rows, compute      │
│  summaries, detect anomalies, validate reconciliation   │
│  Reuse: framework helpers for everything else           │
│                                                         │
│  Step 3: SYNC SCRIPTS (2 Node.js files)                 │
│  ├── etl-sync/sync/sync-sap-{name}-live-insights.js     │
│  └── etl-sync/sync/verify-sap-{name}-live-insights.js   │
│  Implement: read artifacts, publish to Cosmos,          │
│  verify writes match run contract                       │
│                                                         │
│  Step 4: LOCAL SEED + VERIFY (2 files)                  │
│  ├── api/scripts/seed-sap-{name}-live-insights.js       │
│  └── api/scripts/verify-sap-{name}-live-insights.js     │
│  Wire: Python extraction ──► Node.js sync ──► verify    │
│                                                         │
│  Step 5: WIRE ORCHESTRATION (1 edit)                    │
│  └── etl-sync/functions/etlSync.js                      │
│  Add entry to SCRIPT_MAP: connector ──► sync + verify   │
│                                                         │
│  Step 6: AI INTEGRATION (optional, 1 edit)              │
│  └── ai-service/src/lib/runtime/aiToolExecutor.js       │
│  Add: connector ──► defTypes, detail doc type,          │
│  queryable row fields                                   │
│                                                         │
│  Step 7: TEST                                           │
│  └── Upload XLS via admin UI, verify end-to-end:        │
│       artifacts ──► Cosmos docs ──► manifest bump       │
│       ──► cache invalidation ──► dashboard renders      │
│                                                         │
└─────────────────────────────────────────────────────────┘

Steps 1, 5, and 6 are config edits — under 20 lines each. Step 2 is the real work: parsing the SAP-specific Excel layout, which varies per transaction code. Steps 3 and 4 follow a template so closely that an AI agent could generate them from the registry definition. The upload endpoint, EventGrid triggers, Blob storage paths, cache invalidation, and admin UI all work automatically — they're connector-agnostic.

That's the design goal: the pipeline is fixed, the connectors are pluggable. The same architecture that handles sales data handles stock levels, debtors aging, delivery tracking, and order books — and will handle whatever SAP report comes next.


Act 9: SAP Live Insights — Five Connectors in Five Days (February 15–18, 2026)

The ETL pipeline was the plumbing. This act is the payoff.

February 15, commit 0366d4c: "ship end-to-end SAP Live Insights ETL, dashboard, and ops docs". The sales connector went live — the first time anyone in the company could see SAP sales data in a browser dashboard instead of an emailed spreadsheet. Within the same day, I shipped 14 more commits: FY-segmented rankings, customer governance enrichment, hero KPI cards, performer strategy toggles, product performance analytics, and quantity-driven insights. The data was finally flowing, and every question finance had ever asked could now be answered with a card on the dashboard.

Then came the expansion:

DateConnectorKey CommitDashboard KPIs Added
Feb 15Sales0366d4cRevenue by customer/product, growth rates, budget coverage, top-N performers
Feb 17Debtors9b25ab1Receivables aging, projected realization, payment patterns
Feb 17Order Book8b2eba7Backlog runway, order-to-cash velocity, delivery schedules
Feb 17Deliveryb537157On-time delivery %, penalty exposure, fulfillment rates
Feb 18Stock3369f29Inventory levels, slow-moving analysis, valuation

Each connector was a full vertical slice: Python extractor, Node.js sync, verification script, Cosmos schema, and a set of dashboard hero cards with drill-down widgets. The debtors connector doesn't just show outstanding balances — it projects realization based on historical payment patterns and flags deteriorating customers. The delivery connector computes on-time delivery percentages with a two-pass exclusion filter, then surfaces penalty exposure KPIs.

The Dashboard That Became the Home Screen

February 18, commit 86ed0b2: "Reorganize navigation and default to SAP Live Insights". The SAP view replaced the budget dashboard as the app's landing page. Users wanted what's happening right now before what should be happening according to the plan.

┌─────────────────────────────────────────────────────────┐
│              SAP LIVE INSIGHTS DASHBOARD                │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐               │
│  │ REVENUE  │  │ DEBTORS  │  │ ORDER    │               │
│  │ Hero KPI │  │ Hero KPI │  │ BOOK     │               │
│  │ vs Target│  │ Aging    │  │ Runway   │               │
│  └──────────┘  └──────────┘  └──────────┘               │
│                                                         │
│  ┌──────────┐  ┌──────────┐  ┌──────────────────────┐   │
│  │ DELIVERY │  │ STOCK    │  │ BUDGET COVERAGE      │   │
│  │ OTD %    │  │ Valuation│  │ Invoiced + Pipeline  │   │
│  │ Penalties│  │ Movement │  │ vs Plan              │   │
│  └──────────┘  └──────────┘  └──────────────────────┘   │
│                                                         │
│  ┌──────────────────────────────────────────────────┐   │
│  │  CONNECTOR STATUS CARDS                          │   │
│  │  Last refreshed │ Run history │ Anomaly count    │   │
│  └──────────────────────────────────────────────────┘   │
│                                                         │
└─────────────────────────────────────────────────────────┘

The insight cards evolved rapidly. Early versions showed raw numbers. By February 17, they projected realization rates. By February 19, an OEM vs services margin breakdown (797e7e9). By February 27, a forecast variance widget with monthly YTD charts (cbca3fa). Each refinement came from sitting with finance and asking: "what question are you still opening SAP to answer?"


Act 10: The Cost Reckoning — Caching SAP Data That Only Changes Weekly (February 11–28, 2026)

Here's what happens when you build five SAP connectors in five days and make them the app's home screen: everyone opens the dashboard, and every page load triggers Cosmos DB reads for sales, debtors, orderbook, stock, and delivery data. Multiply that by every user, every session, every tab refresh. The Request Unit meter was climbing, and the data it was fetching hadn't changed since last Tuesday's SAP export.

The wake-up call came on February 11. Commit 4bdae3e: "Optimize save flows and add RU-impact telemetry/checklist". That commit added telemetry to measure the actual RU cost of every save and read operation. The numbers were sobering — the app was hammering Cosmos DB for data that only changed once a week, when someone uploaded a fresh SAP export.

The fix wasn't a simple HTTP cache header. SAP data has a specific refresh cadence — weekly exports from SAP transaction codes — and the caching system needed to be aware of that cadence. I built it in layers:

The Cache Architecture

┌─────────────────────────────────────────────────────────┐
│              CLIENT (Browser)                           │
│                                                         │
│  ┌──────────────────────────────────────────────────┐   │
│  │  localStorage Cache                              │   │
│  │  ─ Per-connector, per-BU, per-FY entries         │   │
│  │  ─ 8-day default TTL (matches weekly cadence)    │   │
│  │  ─ Configurable via env vars                     │   │
│  │  ─ Max 14-day absolute ceiling                   │   │
│  └────────────────────┬─────────────────────────────┘   │
│                       │ Is manifest version stale?      │
│                       ▼                                 │
│  ┌──────────────────────────────────────────────────┐   │
│  │  Manifest Version Check (1-hour TTL)             │   │
│  │  ─ Lightweight: fetches version number only      │   │
│  │  ─ If version unchanged: serve from cache        │   │
│  │  ─ If version bumped: invalidate + refetch       │   │
│  └────────────────────┬─────────────────────────────┘   │
│                       │ Only on cache miss              │
└───────────────────────┼─────────────────────────────────┘
                        ▼
┌─────────────────────────────────────────────────────────┐
│              SERVER (API + Cosmos DB)                   │
│                                                         │
│  ┌──────────────────────────────────────────────────┐   │
│  │  SAP Cache Manifest (Cosmos document)            │   │
│  │  ─ Global version counter                        │   │
│  │  ─ Per-connector: lastRunAt, lastRunId           │   │
│  │  ─ Bumped automatically by ETL sync pipeline     │   │
│  │  ─ Admin can force-refresh via maintenance API   │   │
│  └──────────────────────────────────────────────────┘   │
│                                                         │
└─────────────────────────────────────────────────────────┘

The key insight: instead of caching the data itself with a dumb TTL, cache the manifest version with a short TTL (1 hour), and cache the data with a long TTL (8 days). The client checks the manifest version cheaply — a single tiny Cosmos read — and only refetches the actual SAP data if the version number has been bumped by a new ETL run. This means 100 users opening the dashboard in the same hour generate 100 manifest checks (a few RUs each) instead of 500 full connector data fetches (hundreds of RUs each).

The timeline of the cost optimization story:

DateCommitWhat Happened
Feb 114bdae3eAdded RU-impact telemetry — first time we could see the actual cost of each operation
Feb 2260774f6AI cache freshness layer + cache health diagnostics in admin portal
Feb 229d8e8edRemoved monitoring extras, stripped down to low-cost execution model
Feb 23a4c5aefThe big one: SAP manifest-aware caching — 1,624 lines across 38 files. Server-side manifest, client-side localStorage cache, per-connector invalidation, admin refresh API.
Feb 2304e816aConfigurable TTLs tuned to weekly SAP export cadence (8-day default, 14-day max)
Feb 25c2e0da0Reduced save-path writes — fewer unnecessary Cosmos updates on non-dirty saves
Feb 28750f979Killed the dev cloud Cosmos instance, moved to emulator-first local development
Mar 1c5a1453Low-RU cross-user refresh for market analysis — shared cache state across users
Mar 25e04ee50Static cache for market analysis — stopped re-fetching data that changes quarterly

When the ETL pipeline syncs a fresh SAP export, the last step bumps the manifest version in Cosmos. The next time any user's browser checks the manifest (within the hour), it sees the new version, invalidates its local cache for that connector, and fetches fresh data. Every other connector's cache stays warm. An admin can also force-refresh from the admin portal if an urgent re-export happens mid-week.

The result: Cosmos DB RU consumption dropped dramatically. The monthly bill went from "noticeable" to "single digits." And the dashboard felt faster — most page loads were hitting warm local caches instead of waiting for cross-region Cosmos reads.


Act 11: The AI Copilot (February 21 – March 5, 2026)

Commit b14dadc, February 21 at 8:34 PM: "ship context-aware finance chat with tools, streaming UX, and telemetry". That single commit was the biggest in the project — an AI chat panel with domain-specific tools, SSE streaming, and operational telemetry, wired into every view.

But that was just the beginning. The AI feature went through its own intense evolution over the next two weeks:

DateCommitWhat Landed
Feb 21b14dadcInitial AI chat: tools, streaming, telemetry
Feb 22a33cd77Reliability hardening, methodology tool
Feb 22f312b36Deep SAP connector insight tools — AI can query ETL data
Feb 2260774f6AI cache freshness layer, cache health diagnostics
Feb 23e291761User-facing error context and failure diagnostics
Feb 2439eb700Streaming stabilization, operational log drilldown
Mar 1f7461f8Page-aware AI context with BU feature flags
Mar 2865ed91AI health probe + System Health diagnostics card
Mar 4ba68f1cExpanded telemetry, health insights, chat feedback
Mar 5b776921Thread memory continuity — AI remembers conversation context

The key architectural insight — which I wrote about in detail in a separate blog post — is that tool-based AI is dramatically more reliable than RAG for structured financial data. Instead of embedding documents and hoping the retrieval step finds the right context, the AI calls specific tools: query the budget for Q3, calculate margin impact of a 2% price increase, compare actuals to forecast. Every answer is traceable to a specific data query, not a probabilistic vector similarity match.

By March 5, the AI had 23 tools across 7 domains, organized into 4 tool profiles selected by regex-matching the user's question. It knew which page the user was viewing and adapted its tool selection accordingly. It streamed responses over SSE with 10 event types. And it had a 3-layer cache (request memo, shared LRU, ETL-aware freshness) to keep token costs from spiraling.


Act 12: Market Analysis and Geographic Intelligence (March 1–5, 2026)

March 1, commit a4cfc8a: "implement upload, dashboard widgets, and AI tooling" for market analysis. This was the 12th view — a competitive intelligence layer that mapped the company's market position geographically.

By March 4, the market analysis view had a national drilldown geography mode (dcf3829), competitor-dominance territory coloring (44f04d6), and fuzzy matching against official South African geography datasets (c60c3d9). Commit e5887e5 added an official geography reference dataset with an unmapped-suggestion report — so when a SAP customer record listed a town the system didn't recognize, it would suggest the closest official match instead of silently dropping it.

March 5, 7e02926: "Implement FY-scoped market analysis and add Region Master admin flow". Market intelligence was now scoped by fiscal year and managed through an admin interface — not a static spreadsheet updated quarterly.

This view was built entirely with AI agents. I specified the data model and component boundaries, the agent wrote the implementation, and I reviewed and refined over 4 days of commits. The geography fuzzy-matching — handling South African place names with their aliases, abbreviations, and colonial-era naming variations — was the kind of domain problem where AI implementation with human review worked exceptionally well.


Act 13: CI/CD and DevOps — The One-Person Pipeline (December 2025 – April 2026)

This act doesn't have a single date because the DevOps story unfolded across the entire project. It started simple and got serious fast.

The Evolution

┌─────────────────────────────────────────────────────────┐
│              CI/CD EVOLUTION                            │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Dec 17       Azure SWA workflow (auto-generated)       │
│       │        + Entra ID auth only                     │
│       ▼                                                 │
│  Feb 7        Test gate + CI/CD scaffolding             │
│       │        + ESLint + Vitest in pipeline            │
│       │        + PR preview environments                │
│       ▼                                                 │
│  Feb 11       Cosmos seed/sync workflow                 │
│       │        + Repeatable DB seeding in CI            │
│       ▼                                                 │
│  Feb 13       Authorization harness in CI               │
│       │        + Golden-snapshot regression             │
│       │        + Matrix mode (quick/full/flaky)         │
│       ▼                                                 │
│  Feb 22       ETL function deployment                   │
│       │        + Kudu ──► Azure Functions action        │
│       │        + Conditional deploy by path             │
│       ▼                                                 │
│  Feb 28       Secret scan CI gate                       │
│       │        + Cost ledger diagnostics                │
│       ▼                                                 │
│  Mar 6        SAP ETL drift guard workflow              │
│       │        + Friday preflight checklist             │
│       ▼                                                 │
│  Mar 29       South Africa App Service workflow         │
│       │        + Smoke tests against EasyAuth           │
│       ▼                                                 │
│  Mar 31       AI Service deploy workflow                │
│       │        + Readiness probes                       │
│       │        + Dark deploy with rollback proof        │
│       ▼                                                 │
│  Apr 7        SWA parked (PR previews disabled)         │
│                + 7 workflows total                      │
│                                                         │
└─────────────────────────────────────────────────────────┘

The 7 Workflows

WorkflowTriggerWhat It Does
Main App DeployPush to mainBuild, test, deploy React SPA + Express API to App Service (SA North)
AI Service DeployPush to main (ai-service/ paths)Build ai-core + ai-service, deploy to dedicated App Service, readiness probe
ETL Function DeployPush to main (etl-sync/ paths)Deploy Node.js sync functions via Azure Functions action
Auth RegressionNightlyRun authorization harness against golden snapshots — catches permission drift
SAP ETL Drift GuardNightlyValidate ETL function app settings haven't drifted from expected config
Secret ScanPush to mainScan for credentials, API keys, connection strings in committed code
SWA StandbyManualDeploy to parked US Static Web App for failover drills

Hard-Won DevOps Lessons

The git log records every DevOps battle:

  • February 7: Five consecutive commits (87ce3a0fdbcdcb) just trying to get tests to pass in CI — ESLint config blocking deploys, E2E tests leaking into the unit test run, Vitest pattern matching too broadly. The test infrastructure took an afternoon to stabilize.
  • February 22: Three attempts at ETL deploy strategy in a single day — Kudu helper (5273f15), then Azure Functions action (47f67a6), then removing ETL from the SWA workflow entirely (24f8ead). Each deployment tool had different failure modes.
  • February 26, commit 51fbc10: "fix(sev1): restore API function indexing by removing cross-folder fiscal lock import". A single ESM import across function boundaries crashed the entire Azure Functions worker. Not one endpoint — all of them. The next commit (0e28a0c) codified the lesson: SEV1 documented, API runtime boundary guard test added to CI.
  • March 6: ETL deploy regression, hotfixes, Kudu polling fixes (bda3eabd292495). Added the drift guard workflow and a Friday preflight checklist because manual SAP exports happen weekly.

No Terraform. No Bicep. PowerShell provisioning scripts for Azure resource creation, because for a solo developer, imperative scripts are easier to debug at 11pm. The infrastructure is simple enough that "just look at the script" is a valid runbook.


Act 14: The Test Harnesses — 245 Test Files and a Decision Oracle (December 2025 – April 2026)

A one-person platform with 11 roles, 29 permissions, 5 SAP connectors, a financial model engine, and an AI copilot doesn't survive without tests. But the test story isn't "I wrote unit tests." It's: I built a custom authorization testing framework with a decision oracle, pairwise scenario generation, quarantine management, and a flaky-detection runner — and then wrapped the rest of the platform in 245 test files across 5 testing layers.

The Testing Stack

┌─────────────────────────────────────────────────────────┐
│                 TESTING ARCHITECTURE                    │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  LAYER 1: Domain Model Tests (Dec 22, 2025)             │
│  ├── Financial model engine: 1,190-line test file       │
│  ├── Services model, fiscal year, budget lock           │
│  └── Pure math — no mocks, no API, just formulas        │
│                                                         │
│  LAYER 2: API Contract Tests (Feb 10 – Apr 2026)        │
│  ├── 56 contract test files                             │
│  ├── RBAC evaluation, customer ABAC, budget lock        │
│  ├── SAP connectors, ETL pipeline, audit schema         │
│  ├── AI chat bootstrap, retry, telemetry                │
│  └── Mocked Cosmos + auth — tests the contract, not DB  │
│                                                         │
│  LAYER 3: Frontend Component Tests (Feb – Apr 2026)     │
│  ├── 89 component contract tests                        │
│  ├── Admin views, SAP insights, market analysis         │
│  ├── AI chat panel, navigation, simulation context      │
│  └── Vitest + jsdom — tests UI logic, not pixels        │
│                                                         │
│  LAYER 4: Authorization Harness (Feb 13, 2026)          │
│  ├── Custom framework: oracle + generators + executors  │
│  ├── 40 golden scenarios (p0-core, products, sentinels) │
│  ├── Pairwise matrix: quick (~15) / full (~100+)        │
│  ├── Quarantine system with expiration dates            │
│  └── Runs nightly at 2 AM UTC in CI                     │
│                                                         │
│  LAYER 5: AI Service Tests (Mar 31, 2026)               │
│  ├── ai-core contracts + extraction + smoke tests       │
│  ├── ai-service runtime + SSE event shape validation    │
│  └── Node.js native test runner (not Vitest)            │
│                                                         │
└─────────────────────────────────────────────────────────┘

Layer 1: The Financial Model Test — December 22, 2025

The very first "real" test in the project was financialModel.test.js — born on December 22 during the Formula Wars (Act 2), when I discovered that a single wrong operator could misstate revenue by 12%. At 1,190 lines, it's still the largest test file in the codebase. It tests the pure calculation engine: margin calculations, discount compounding, freight surcharge modes, actuals-vs-forecast projection, simulation parameter sensitivity. No mocks, no API calls — just math in, numbers out. If this file fails, nothing else matters.

Layer 2: API Contract Tests — 56 Files

Every API endpoint has at least one contract test. These use mocked Cosmos DB containers and mocked authentication — they test the contract (request shape → response shape, permission gate → HTTP status) without hitting real infrastructure. The naming convention tells the story:

  • customers.abac.contract.test.js — ownership-gated writes
  • customerActuals.ownership.contract.test.js — assigned vs. unassigned account managers
  • products.lock.contract.test.js — budget lock enforcement on product edits
  • historicalFy.readonly.contract.test.js — past fiscal years are read-only
  • runtime-boundary.contract.test.js — the SEV1 guard (Act 13): ensures no cross-folder ESM imports crash the entire Functions worker
  • auth.appservice.contract.test.js — App Service EasyAuth header parsing

The .contract. suffix isn't decoration. It means the test verifies a behavioral contract that other systems depend on. Break the contract, break the integration.

Layer 3: Frontend Component Tests — 89 Files

UI tests in Vitest with jsdom. These don't render pixels — they test component logic: does the admin panel show the ETL tab for users with sap_exporter permission? Does the budget lock hero banner appear when the fiscal year is locked? Does the AI chat panel correctly serialize page context?

Key coverage areas:

DomainTest FilesWhat They Verify
Admin Views12Account management, ETL access, audit logs, fiscal year lock events, user governance
SAP Live Insights14Each connector widget: coverage, delivery, forecast variance, growth rates, orderbook, penalties, stock, quantity KPIs
Configuration9Lock hero, regional goals, customer list, post-lock adjustment, global parameters
Market Analysis7Workbook parsing, geographic model, caching, customer scope, widget rendering
AI Chat4Memory management, UI context serialization, markdown rendering, SSE streaming
Navigation6Routing, admin access gates, SAP live badge, feedback sidebar
Simulation4Context state machine, budget plan contract, simulation params, fail-soft fallback

Layer 4: The Authorization Harness — A Custom Testing Framework

This is the one I'm most proud of. On February 13, commit 35de3df landed the full authorization harness framework in a single commit: 1,475 lines across 15 new files. It's not a test suite — it's a testing framework with its own architecture:

┌─────────────────────────────────────────────────────────┐
│           AUTHORIZATION HARNESS ARCHITECTURE            │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Scenario Files (JSON)                                  │
│  ├── golden/p0-core.json     (25 priority-0 baselines)  │
│  ├── golden/products-golden.json  (8 product scenarios) │
│  └── golden/sentinels.json   (7 guardrail scenarios)    │
│          │                                              │
│          ▼                                              │
│  Scenario Loader ──► Schema Validator                   │
│          │                                              │
│          ├──► Decision Oracle                           │
│          │    ├── Permission Evaluator (RBAC layer)     │
│          │    ├── Scope Evaluator (BU layer)            │
│          │    ├── Ownership Evaluator (ABAC layer)      │
│          │    └── Lock Policy Evaluator (time gates)    │
│          │                                              │
│          ├──► Pairwise Generator                        │
│          │    ├── Cartesian product of dimensions       │
│          │    ├── Greedy pair-coverage optimization     │
│          │    └── Exclusion rules for invalid combos    │
│          │                                              │
│          ├──► Executors                                 │
│          │    ├── API Executor (HTTP status codes)      │
│          │    ├── UI Executor (editability + save gates)│
│          │    └── Integration Executor (live backend)   │
│          │                                              │
│          ▼                                              │
│  Harness Runner                                         │
│  ├── Loads scenarios + quarantine policy                │
│  ├── Runs oracle vs. executor comparison                │
│  ├── Detects regressions vs. quarantine recoveries      │
│  └── Writes Markdown coverage reports                   │
│                                                         │
└─────────────────────────────────────────────────────────┘

The Decision Oracle is the core. It's a pure-function authorization evaluator that mirrors the production 3-layer auth stack. Given a scenario (actor + resource + context), the oracle predicts the expected outcome: allow or deny, with a specific reason and the first failed gate. Then the API executor fires the actual HTTP request, and the harness compares oracle prediction vs. actual result. Mismatch = regression.

The oracle handles real authorization complexity: SAP-managed actuals are read-only. Budget-locked fiscal years block most writes — but post_lock permissions allow targeted adjustments to specific fields. The _assigned permission suffix means ownership gates apply. Sentinel scenarios like SENTINEL_ACCOUNT_MANAGER_CANNOT_ADMIN are invariants that must never flip to "allow."

The Pairwise Generator solves the combinatorial explosion. With 6 actor types × 8 entity types × 2 lock states × 2 ownership states × multiple actions, full Cartesian testing would produce thousands of scenarios. The generator uses a greedy pair-coverage algorithm: it selects the minimum set of scenarios that covers every pair of dimension values at least once. Quick mode: ~15 scenarios. Full mode: ~100+. Both run in CI.

The Quarantine System manages known-flaky scenarios without hiding regressions. A quarantined scenario has an owner, a reason, and an expiration date. If it expires without being fixed, the harness treats it as a failure again. Priority-0 and sentinel scenarios cannot be quarantined — if a guardrail fails, the build fails. Period.

Five Test Modes, One CLI

ModeCommandWhat It DoesWhen It Runs
Goldennpm run test:authz:goldenRun 40 hand-written baseline scenariosEvery PR
Quick Matrixnpm run test:authz:matrix:quickPairwise-generated ~15 scenariosEvery PR
Full Matrixnpm run test:authz:matrix:fullPairwise-generated ~100+ scenariosNightly (2 AM UTC)
Flaky Detectionnpm run test:authz:matrix:flakyRun quick mode 3× consecutively, flag non-deterministic resultsNightly
Integrationnpm run test:authz:integrationHit live backend with real Cosmos + real authManual (cost-gated)

The integration mode is intentionally gated behind a repository variable (ALLOW_AZURE_AUTHZ_INTEGRATION=true). Every integration run costs Cosmos DB request units. For a solo developer watching Azure spend, the test harness respects the budget — it defaults to mocked execution and only touches live infrastructure when explicitly told to.

Layer 5: AI Service Tests — Node.js Native Runner

When I extracted the AI copilot into a standalone Fastify service (Act 16), the tests came with it. The ai-core package has its own contract tests, extraction tests, and smoke tests. The ai-service runtime tests validate SSE event shapes — because when you stream AI responses to the browser, the event format is a contract that the frontend depends on. These use Node.js's native test runner, not Vitest — the AI service is a separate workspace with its own runtime.

The Testing Timeline

DateCommitMilestone
Dec 22d2e06c8First financial model tests — zero/negative margin edge cases
Jan 29799b781Production verification tests and health endpoint
Feb 7182662a5 commits in one afternoon to stabilize Vitest in CI
Feb 105e37745First RBAC contract tests alongside DB-backed auth
Feb 1335de3dfFull authz harness: 15 files, oracle, pairwise generator, quarantine
Feb 139e1d5f3Nightly harness workflow + cost guardrails for integration mode
Feb 1617c3924Harness expansion: p0-core golden scenarios, post-lock policy
Feb 20de8a01eRemove legacy Jest API tests, consolidate on Vitest
Feb 2187aa810AI agent feature module tests
Feb 260e28a0cSEV1 runtime boundary guard test (the import crash that killed all endpoints)
Mar 461ced1dRestore fiscal relation mocks — contracts caught the regression
Mar 31d719aaeAI characterization baseline captured before extraction

The test count grew with the platform. Every new feature — SAP connectors, market analysis, AI chat, budget lock, post-lock adjustments — came with its own contract tests. Not because I'm disciplined. Because after the dabe903 incident (Act 5) and the SEV1 import crash (Act 13), I learned that a solo developer's test suite isn't a nice-to-have. It's the only thing standing between a clean deploy and a 2 AM rollback.


Act 15: The Continent Migration (March 29–30, 2026)

I wrote an entire blog post about this, so the summary: over a single weekend, a Codex agent migrated the entire platform from Azure Static Web Apps (US) to Azure App Service (South Africa North). 2,300-line runbook. Zero portal clicks. 28 hours. 537 passing tests. API latency: 180ms → ~5ms.

The key commits: 8ca2dff"Implement South Africa App Service migration path", followed by a series of deploy workflow hardening, EasyAuth enablement (6e936bf), identity migration, cost ledger implementation (250ab5a), and finally 6285063"Park SWA and promote South Africa app".

┌─────────────────────────────────────────────────────────┐
│           HOSTING EVOLUTION                             │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Dec 2025     Single HTML file (local dev only)         │
│       │                                                 │
│       ▼                                                 │
│  Jan 2026     Azure Static Web Apps (US)                │
│       │        + Managed Functions (serverless)         │
│       │        + Free tier                              │
│       │        + 180ms API latency (cross-Atlantic)     │
│       ▼                                                 │
│  Mar 2026     Azure App Service (South Africa North)    │
│       │        + Express host (SSR-capable)             │
│       │        + Co-located with Cosmos DB              │
│       │        + ~5ms API latency (same region)         │
│       ▼                                                 │
│  Mar 2026     SWA parked as standby failover (US)       │
│                + Manual failover drills enabled         │
│                                                         │
└─────────────────────────────────────────────────────────┘


Act 16: Extracting the AI Into a Standalone Service (March 31, 2026)

The day after the continent migration, the AI got its own home.

The AI copilot had been running inside the main Express API — sharing compute resources with every other endpoint. A complex AI query could starve the budget save endpoint of CPU. The extraction happened on March 31: I wrote a 47-page migration runbook, and a Codex agent executed it across parallel git worktrees.

The PRs tell the story: #4 through #20, merged sequentially across a single day, each one a self-contained phase:

PRPhaseWhat It Did
#6Workspace scaffoldnpm workspaces for ai-core package
#4InfrastructureProvision standalone App Service
#7Deploy workflowGitHub Actions for AI service CI/CD
#5CharacterizationBaseline tests for existing AI behavior
#12–13ai-core extractionContracts, SSE shell, module extraction
#14Standalone runtimeFastify service built and tested
#15Backend proxySplit AI and API proxy routes on Express host
#16SSE contractStream provider tokens, freeze SSE protocol
#17Frontend clientRun-aware AI client pointing at new service
#18DiagnosticsAI runtime visibility in System Health
#19CutoverDark deploy with rollback proof
┌─────────────────────────────────────────────────────────┐
│           AI SERVICE EVOLUTION                          │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Feb 21       AI chat as Azure Functions                │
│       │        + Shared compute with API                │
│       │        + SSE streaming over SWA (US)            │
│       ▼                                                 │
│  Mar 29       AI chat on App Service (SA North)         │
│       │        + Same-region streaming (~5ms)           │
│       │        + Still shared with main API             │
│       ▼                                                 │
│  Mar 31       Standalone Fastify AI Service             │
│                + Dedicated App Service instance         │
│                + Extracted ai-core workspace package    │
│                + Independent scaling and deploy cycle   │
│                + 23 tools, 7 domains, 4 tool profiles   │
│                + 3-layer cache, SSE with 10 event types │
│                                                         │
└─────────────────────────────────────────────────────────┘

April 1 brought a series of hotfixes — auth provisioning (78c58d4), operational logging packaging (8c5378e), SSE stream transport (055b1fc), empty stream recovery (6e5e798). The commits are labeled "Hotfix" because they were exactly that — production issues found after the cutover. Every one of them was resolved the same day.


Act 17: What The Platform Looks Like Today

467 commits. 18 acts. Here's the final architecture:

┌─────────────────────────────────────────────────────────┐
│                    BROWSER (SPA)                        │
│  React 19 + Vite + Tailwind CSS + Recharts              │
│  13 Views │ 97 Components │ Financial Model Engine      │
└────────────────────────┬────────────────────────────────┘
                         │ HTTPS
                         ▼
┌─────────────────────────────────────────────────────────┐
│              AZURE APP SERVICE (SA North)               │
│  Express Host │ 24 API Endpoints │ 3-Layer Auth         │
│  RBAC + BU Scope + ABAC Ownership                       │
└──────────┬──────────────────────┬───────────────────────┘
           │                      │
           ▼                      ▼
┌─────────────────────┐  ┌───────────────────────────────┐
│   COSMOS DB NoSQL   │  │     AI SERVICE (Fastify)      │
│   10 Containers     │  │  23 Tools │ SSE Streaming     │
│   /pk Partition Key │  │  Tool Profiles │ 3-Layer Cache│
└─────────────────────┘  └───────────────────────────────┘
           ▲
           │  ETL Pipeline
┌─────────────────────────────────────────────────────────┐
│              SAP DATA INTEGRATION                       │
│  Manual XLS Upload ──► Blob ──► Python Extract          │
│  ──► Blob Artifacts ──► Node.js Sync ──► Cosmos DB      │
└─────────────────────────────────────────────────────────┘


The Full Timeline

#PhasePeriodKey CommitsWhat Changed
1PrototypeDec 9–13720f0f1 → 5816420Single HTML file → React SPA with financial model, tests, simulation sliders, localStorage
2Formula WarsDec 22–27271c199 → ae4aba218 formula fixes, agent contract, services ARR integration, protected model file
3Cloud DeployDec 17 – Feb 82bf8100 → dc33412Azure SWA (US) + Cosmos DB (SA North), Entra ID, partition key consolidation
4Cosmos DB ArchitectureJan 28 – Mar02810dc → e6fce7a10 containers, /pk uniformity, document shapes, seed pipeline, killed cloud dev DB, emulator-first
5dabe903 & HardeningFeb 7–89af6c11 → dabe903Destructive commit, NEVER OMIT CODE rule, admin portal, enterprise hardening sprint
6Authorization ModelFeb 10–1935de3df → ...3-layer auth (RBAC + BU scope + ABAC), 11 roles, 29 permissions, budget lock, nightly regression
7SAP ETL PipelineFeb 13–14f089e87 → 6ad6a007-stage pipeline, Python extractors, Node.js sync, customer & product masters
8Connector PatternFeb 2026Reusable recipe: registry + 4 artifacts + framework + sync + manifest. Pluggable connectors.
9SAP Live InsightsFeb 15–180366d4c → 3369f295 connectors in 5 days. Hero KPI dashboard. Became the default home screen.
10Cost ReckoningFeb 11–284bdae3e → 750f979RU telemetry, manifest-aware SAP caching, configurable TTLs, killed dev Cosmos instance, emulator-first.
11AI CopilotFeb 21 – Mar 5b14dadc → b77692123 tools, SSE streaming, 3-layer cache, page-aware context, thread memory
12Market AnalysisMar 1–5a4cfc8a → 7e02926Geographic intelligence, competitor-dominance mapping, fuzzy geo matching
13DevOps & CI/CDDec 17 – Apr 72bf8100 → 550f8b27 workflows, nightly auth regression, ETL drift guard, secret scan, SEV1 lessons
14Test HarnessesDec 22 – Apr 7d2e06c8 → d719aae245 test files, 5 layers, custom authz harness with decision oracle, pairwise generator, quarantine system
15Continent MigrationMar 29–308ca2dff → 88e6d35US → SA North. SWA → App Service. 180ms → 5ms. Codex-executed, zero portal clicks.
16AI ExtractionMar 3108b614b → 387feabStandalone Fastify AI service. ai-core package. 14 PRs in one day.
17StabilizationApr 1–73822cd3 → 550f8b2Hotfixes, auth hardening, CI cleanup. SWA parked. Production steady-state.

By The Numbers

MetricValue
Total commits467
Development periodDec 9, 2025 → Apr 7, 2026 (~4 months)
Developers1 (with AI coding agents)
React components97
Frontend source files230+
API endpoints24
Shared library modules38
AI tools23 across 7 domains
AI service codebase11,751 lines across 43 files
SAP connectors5
Cosmos DB containers10
CI/CD workflows7
Auth roles / permissions11 / 29
GitHub PRs (AI extraction alone)14 in one day
"Safety commit" messagesAt least 6 (I was learning)


Act 18: The AI Development Story — Architect and Delegate

A single developer didn't write 467 commits in four months by typing every line. I used AI coding agents — Claude Code and OpenAI Codex — throughout the project. But the relationship evolved dramatically.

In December, I was pair-programming with Claude — writing specifications, reviewing output line by line, manually fixing mistakes. By March, I was handing Codex a 47-page runbook and walking away while it executed a multi-phase migration across parallel git worktrees. The shift was possible because of one thing: the contracts got better.

My previous project taught me that AI agents without architectural governance produce unmaintainable code at terrifying speed. So this project had contracts from the start: coding rules, protected files, partition key conventions, ESM-only enforcement, import registration requirements. Every rule was written in response to a specific incident:

  • dabe903 destroying 200 lines → "NEVER omit code for brevity"
  • An ESM require() crash → "ESM only — never use require() in .js files"
  • Partition key chaos → "All containers use PK path /pk"
  • A cross-folder import crashing all Functions → "New Functions must be imported in index.js"
  • Formula regression → "financialModel.js is protected — don't modify without approval"

This development model — architect, contract, delegate, review — changed the economics of solo development. The bottleneck shifted from typing speed to clarity of specification. The agents are a multiplier, not a replacement. They turned one architect into what would have required a team of 4–5 developers.


What I Learned

1. You don't need to migrate your ERP to get modern analytics

The XLS-upload pipeline will never win an architecture award. But it works, it's been running for months, and it cost nothing to build. 17,000 companies are stuck on legacy SAP. A pipeline that turns yesterday's Excel export into today's live dashboard is worth more than a migration roadmap that never starts.

2. Build the calculation engine first, UI second

The pure financial model function from Day 1 survived every refactor, every migration, every architecture change. Everything else is a projection of that engine. If your math is wrong, your charts are beautiful lies.

3. Real-time simulation changes how people think about budgets

Before this platform, budget revisions were a seldom event. Now a sales director drags a margin slider and immediately sees the impact on PBT, customer profitability, and logistics costs. That's a bigger organizational shift than any technology choice.

4. Authorization is harder than the features it protects

11 roles, 29 permissions, BU scoping, ABAC ownership, field-level whitelists, post-lock time gates, and a nightly regression test. More engineering time than any dashboard view. But it's why finance trusts the platform with real data.

5. Fail early, fail in the git log

Those "safety commit" messages from December? The 18 formula fixes on December 22nd? The dabe903 incident? The SEV1 that crashed all API endpoints? They're all in the history forever. Good. Every failure became a rule in the agent contract, a test in the regression suite, or a convention in the architecture.

6. AI agents need contracts, not prompts

My previous project's monolith was built on prompts. This platform was built on contracts. The difference is the difference between a codebase you're proud of and one you're hiding.

* * *

Try This Yourself

If you're sitting on a legacy ERP and dreaming of modern analytics:

  1. Start with the export, not the API. Every ERP can export to Excel. Build a pipeline that consumes those exports. You'll have a working system in weeks instead of months of integration work.
  2. Build the calculation engine as a pure function. No UI dependencies. No state management. Just inputs and outputs. Test it before you write a single component.
  3. Choose a document database with a consistent partition strategy. One partition key path everywhere. A single configuration file as the source of truth. Cosmos DB's serverless tier costs pennies for planning workloads.
  4. Make simulation synchronous. Client-side recalculation under 16ms. No loading spinners. Finance people think in rapid "what if" loops — any latency kills adoption.
  5. Invest in authorization before features. RBAC, BU scope, and ownership-based restrictions before the first real user touches the system.
  6. Add AI tools, not AI chat. Domain-specific tools that call your calculation engine will always be more accurate than a RAG pipeline that embeds your spreadsheets.
  7. Automate everything a solo developer can forget. Nightly auth regression. ETL drift guards. Secret scanning. Friday preflight checklists. The CI pipeline is your team.
  8. Write contracts, not prompts. Claude Code and Codex turn one architect into a delivery team — but only if the specifications are unambiguous enough to be executed autonomously.

* * *

The Bottom Line

My company may eventually migrate to S/4HANA if forced by the big tech company to do so. The ECC clock is ticking. But in the meantime, the business has a financial planning and business intelligence platform that's faster, more interactive, and more insightful than anything SAP's standard reporting ever provided — and it was built by one person in four months for less than the cost of a single SAP consulting engagement, or having to drop a few hundred thousand rands on a development agency!

The 55–75% ERP implementation failure rate isn't just a statistic. It's a signal that the industry's default answer — "upgrade everything at once" — doesn't work for most companies. Sometimes the smarter move is to build a bridge: extract the data you need, compute the insights you want, and let the legacy system keep doing what it's always done — processing transactions.

467 commits ago, this was a single HTML file. Today it's 13 views, 24 API endpoints, 5 SAP connectors, 23 AI tools, and a platform that finance, sales, and executives use daily. The git log has every step — including the embarrassing ones.

The ERP doesn't need AI. Your decisions need AI. Build around the gap.