Wednesday, 8 April 2026

How I Built a Full BI Platform With AI Because SAP Wanted Seven Figures to Upgrade

From a single HTML file that started out as an experiment to build a budget model simualtion tool, on a Monday night to a 467-commit, 13-view financial intelligence BI platform with AI — the real timeline of building enterprise software when your ERP is stuck in the 90s and your budget for fixing it is zero.

The Backstory 

I work at a mid-market manufacturer for traffic signal controllers. Our ERP is SAP — the old from the 90s kind. ECC 6.0, no HANA, no Fiori, no embedded analytics. The kind of system where "reporting" means exporting an ALV grid to Excel and emailing it to finance, business review meetings wasting a ton of time preparing excels and converting to charts and PowerPoints. Every now and again, the question comes up: are we upgrading to S/4HANA or latest SAP?

The answer, every time, is the same: not yet. And the reason is always money. For a medium-size South African company, SAP is becoming prohibitively expensive to upgrade and sustain ongoing licensing costs. What we have today is good enough and working. But as the new guy, this wasn't working for me...

SAP S/4HANA migrations run $250K minimum for small companies, scaling to millions for mid-market. Consulting rates are up 30–50% as the 2027 ECC end-of-mainstream-support deadline approaches, and Gartner projects that nearly half of the 35,000 ECC customers — roughly 17,000 companies — will still be running the legacy system past that date. We're one of them.

But here's what changed: in late 2025, I'd just come off a bruising 8-month personal hobby project where I learned the hard way what happens when you "vibe code" with AI agents without architecture governance — a 3,000-line monolith in a single JavaScript file, three AI models fighting each other, and a codebase I was ashamed to show anyone. That failure taught me an important lesson as I climbed the learning curve of building with AI assistants: AI agents are only as good as your architectural contracts. Code is a liability. Architecture is the asset. Specification and Planning are crucial activities that must not be ignored. I've shared my experiences on previous blog posts. I was ready to build a real-world enterprise application on the back of the success I had with personametry.com (another hobby project).

So when December 2025 rolled around and I sat down to first solve a real work problem: How do I plan my budget for next financial year? What margins should my account managers work with? How do I simulate different outcomes? How do I track actuals versus forecasts. I designed a detailed financial model, this was the MVP. Then the next big opportunity was integrating SAP ERP reporting problem - how do I stop all the tedious manual prep work that my managers go through when preparing for business ops reviews. Wouldn't it be nice if I ran my meetings with a live dashboard? 

I didn't start by prompting an AI. I started by designing a financial model on paper, scoped the business needs and worked with the AI to map out the product in fair amount of detail. I even used sample PowerPoint presentations to guide the visual display requirements for the dashboards. Then I added the real game changer - integrated AI financial assistant. Along this journey, the product vision was right but the engineering execution evolved over time - because at first, I had zero budget to play with - so I went all in on using free services. Early on in the project, I decided to design for multi-tenancy - figuring that if the app adds value to my business unit, surely the other group businesses (with a similar business model to mine) could also be onboarded :)

This is the story of what happened — told through the git log, because that's the only history that doesn't lie.

I am a General Manager for a full business unit. I am also an ex-principal software engineer but stopped writing code since 2010! So I dabble in AI in my spare time and have drunk the cool aid on building with AI agents. I wanted to test whether I could build a real world enterprise application completely with AI (Antigravity, Claude Code and Codex) from the success I've had with my personal projects. Saving me time. Saving me money. One person, no outsourced developers costs (typically an app like mine would cost no less than R300k and take at least 3-6 months with an agency)...

But first: here's what the platform looks like today, so you know where this story is headed.


The Architecture at a Glance

┌─────────────────────────────────────────────────────────────────────┐
│                         USERS                                       │
│  Finance │ Sales Directors │ Account Managers │ Executives │ Admin  │
└─────────────────────────────┬───────────────────────────────────────┘
                              │ HTTPS (Entra ID SSO)
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│                 AZURE APP SERVICE (South Africa North)              │
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  EXPRESS HOST                                               │    │
│  │  Serves React SPA + Proxies API + Proxies AI Service        │    │
│  └──────────┬────────────────────────┬─────────────────────────┘    │
│             │                        │                              │
│             ▼                        ▼                              │
│  ┌─────────────────────┐  ┌──────────────────────────────────┐      │
│  │  API LAYER          │  │  AI SERVICE (Fastify)            │      │
│  │  24 Endpoints       │  │  Dedicated App Service           │      │
│  │  3-Layer Auth       │  │  23 Tools │ 4 Tool Profiles      │      │
│  │  ─────────────────  │  │  SSE Streaming (10 event types)  │      │
│  │  Products           │  │  3-Layer Cache                   │      │
│  │  Customers          │  │  ──────────────────────────────  │      │
│  │  Actuals/Forecasts  │  │  ai-core shared workspace pkg    │      │
│  │  Budget Config      │  │  Financial Model Engine          │      │
│  │  Definitions        │  │  (same math as frontend)         │      │
│  │  SAP Live Insights  │  └──────────────────────────────────┘      │
│  │  Admin / Audit      │                                            │
│  │  User Governance    │                                            │
│  │  Maintenance/Health │                                            │
│  └──────────┬──────────┘                                            │
└─────────────┼───────────────────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    AZURE COSMOS DB (SA North, Serverless)           │
│                                                                     │
│  ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────────────┐    │
│  │ products  │ │ customers │ │ actuals   │ │ config            │    │
│  │ {bu}_{fy} │ │ {bu}_{fy} │ │ {bu}_{fy} │ │ {bu}_{fy}         │    │
│  └───────────┘ └───────────┘ └───────────┘ └───────────────────┘    │
│  ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────────────┐    │
│  │definitions│ │ users     │ │ groups    │ │ feedback          │    │
│  │ defs_{bu} │ │ {userId}  │ │ {groupId} │ │ feedback          │    │
│  └───────────┘ └───────────┘ └───────────┘ └───────────────────┘    │
│  ┌───────────┐ ┌───────────┐                                        │
│  │ audit     │ │ rbac      │  ALL containers: partition key /pk     │
│  │ log       │ │ rbac      │  Single source of truth:               │
│  └───────────┘ └───────────┘  containerConfig.js                    │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                    SAP DATA INTEGRATION                             │
│                                                                     │
│  SAP ECC 6.0 ──► Manual XLS Export ──► Browser Upload               │
│       │                                                             │
│       ▼                                                             │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐       │
│  │ Blob:    │───►│ Event    │───►│ Python   │───►│ Blob:    │       │
│  │ sap-     │    │ Grid     │    │ Extract  │    │ etl-     │       │
│  │ exports  │    │ Trigger  │    │ Function │    │ artifacts│       │
│  └──────────┘    └──────────┘    └──────────┘    └────┬─────┘       │
│                                                       │             │
│                  ┌──────────┐    ┌──────────┐         │             │
│                  │ Cosmos   │◄───│ Node.js  │◄────────┘             │
│                  │ Upsert   │    │ Sync     │  EventGrid            │
│                  │ + Cache  │    │ Function │  Trigger              │
│                  │ Manifest │    │ + Verify │                       │
│                  └──────────┘    └──────────┘                       │
│                                                                     │
│  5 Connectors: Sales │ Debtors │ Order Book │ Delivery │ Stock      │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                    CI/CD (GitHub Actions)                           │
│                                                                     │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────────┐   │
│  │ Main App     │  │ AI Service   │  │ ETL Functions            │   │
│  │ Deploy       │  │ Deploy       │  │ Deploy                   │   │
│  │ (push main)  │  │ (ai-service/)│  │ (etl-sync/)              │   │
│  └──────────────┘  └──────────────┘  └──────────────────────────┘   │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────────┐   │
│  │ Auth         │  │ ETL Drift    │  │ Secret                   │   │
│  │ Regression   │  │ Guard        │  │ Scan                     │   │
│  │ (nightly)    │  │ (nightly)    │  │ (push main)              │   │
│  └──────────────┘  └──────────────┘  └──────────────────────────┘   │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │ SWA Standby (US, parked) — manual failover drills            │   │
│  └──────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                    AUTHORIZATION MODEL                              │
│                                                                     │
│  Entra ID Token ──► RBAC (11 roles, 29 perms) ──► BU Scope          │
│                      ──► ABAC Ownership ──► Field Whitelist         │
│                                                                     │
│  allow = hasPermission AND hasBuScope AND ownsResource              │
│                                                                     │
│  Budget Lock Lifecycle:                                             │
│  PLANNING ──► LOCKED (review) ──► EXECUTION (scoped edits)          │
│                  │                                                  │
│                  ▼                                                  │
│            STRUCTURAL UNLOCK (CEO/admin, temporary)                 │
└─────────────────────────────────────────────────────────────────────┘

The platform is a React 19 single-page application with 13 views and 97 components, served by an Express host on Azure App Service in South Africa North. Behind it: 24 API endpoints with a 3-layer authorization model (identity, RBAC, and attribute-based ownership), 10 Cosmos DB containers using a unified /pk partition key, a standalone AI service running Fastify with 23 domain-specific tools, and a 7-stage ETL pipeline that pulls data from SAP through exported spreadsheets.

The financial model engine — a pure JavaScript function with no React dependencies — is the foundation. It's shared across the frontend (for real-time simulation), the backend (for API-driven calculations), and the AI service (so the copilot can run the same math the dashboard shows). One source of truth for the numbers.

Users authenticate via Microsoft Entra ID. Every API call passes through RBAC (11 roles, 29 permissions), BU scope enforcement, and for write operations, ABAC ownership checks. The budget lock lifecycle governs when data can be edited and by whom. Nightly CI workflows validate that permission rules haven't drifted.

SAP data arrives through manual Excel exports uploaded by users — the pipeline handles parsing, normalization, anomaly detection, database sync, and cache invalidation automatically. The client caches SAP data with an 8-day TTL matched to the weekly export cadence, checking a lightweight version manifest before fetching fresh data. The monthly Cosmos DB bill: single digits.

Seven GitHub Actions workflows cover deployment, authorization regression, ETL drift detection, and secret scanning. An Azure Static Web App in the US is parked as a standby failover.

That's the destination. Here's how it got built.


Act 1: The Single HTML File (December 9, 2025)

Commit 720f0f1. 5:41 PM on a Tuesday. That's when the first line of code landed.

It was a single HTML file. One page. A profitability chart and a dashboard. Nothing fancy — just the financial model I'd sketched on paper, implemented as a pure JavaScript calculation function, with a basic React frontend bolted on top. By 6:51 PM — seventy minutes later — I had monthly actuals, rolling forecasts, and inputs formatted in Rands (millions).

I didn't know it at the time, but that calculation function — the one I wrote before any UI, before any backend, before any database — would survive every refactor, every migration, every architecture overhaul for the next four months. It lives today in an extracted workspace package, shared between the frontend, the API server, and the AI service. It is the single most important file in the entire codebase.

By the end of December 11 — three days in — I had 32 commits: a simulation mode with margin sliders, region groupings, customer concentration risk analysis, product portfolio editor, variance analysis, and a product mix risk component. The commit messages tell the story of a developer moving fast: "safety commit", "safety checkin - still working on simulation chart", "Milestone v1: Completed Chart Logic & Simulation Refactor".

And then, the commit that defined the project's DNA:

e40c116"Major refactor to original monolith demo app (old_app included for reference)"

I'd learned from my previous project. The moment the single file got unwieldy, I broke it apart. Not later. Not after it became painful. Immediately. The old monolith was preserved in a folder called old_app — a reminder of what happens when you don't refactor early.

By December 13, commit 5816420, I had comprehensive tests for the financial model: margin calculations, PBT (Profit before Tax), break-even analysis, customer analysis, and edge cases. The math was locked down before I had a real backend.


Act 2: The Formula Wars (December 22, 2025)

December 22nd has 18 commits. That's not a flex — it's a confession.

I sat down that morning to validate the financial model against a real-world scenario, and the numbers didn't add up. What followed was the most humbling day of the project: a cascade of formula bugs that I squashed one by one, each revealing the next.

CommitWhat Was Wrong
271c199Product mix weights weren't contributing to revenue percentages correctly
69e3e4bGoal-seek base margin excluded customer discounts, freight, and insurance — required revenue was materially understated
55aa219Simulation inflation was calculated but never returned — UI defaulted to 1 and ignored adjustments
fce1acaFreight/insurance used additive formula instead of multiplicative (compound) — causing price drift
55c1c77Components read monthlyRevenue but model exported Revenue — zero budget series in charts
505d5e4Discount impact understated profit erosion by 40-50% by multiplying discount by post-discount margin
5a8f931NaN propagation when customers had no active line items
f052953Infinity values causing display crashes while being mathematically "correct"

Eight formula bugs in one day. Each one producing numbers that looked plausible but were silently wrong. This is the nightmare scenario for any financial planning tool — beautiful charts telling beautiful lies.

The fix wasn't just patching each bug. It was tightening the agent contract (9c9607b) so AI agents couldn't introduce formula regressions. From that day forward, the financial model was a protected file — no AI agent modifies it without explicit approval.

The next day, December 23, brought more: budget config wasn't persisting across reloads (6ad3355), discounts were being applied to the full invoice price including logistics charges (33a7515), and margin labels were misleading (516b72e). By December 24, I'd separated business logic from views entirely (c7238bc). And on December 27, the Services ARR model was integrated into the engine (ae4aba2).

The prototype phase was over. I had a financial model I could trust — and a healthy fear of "correct-looking" calculations.


Act 3: From localStorage to the Cloud (December 17 – February 8, 2026)

For the first five weeks, all data lived in the browser's localStorage. It worked for a single-user prototype. It was never going to work for an organization.

First Cloud Deploy: December 17

Commit 2bf8100: the Azure Static Web Apps workflow file landed. SWA's free tier was the cheapest entry point — a React SPA with built-in serverless API support and Microsoft Entra ID for authentication. I blocked GitHub auth on the same day (dc13e7c): "Block GitHub auth - allow only Microsoft/Entra ID login". This was an enterprise app from day one.

But SWA's free tier deployed to the US. The database — Azure Cosmos DB — was always in South Africa North, because that's where data sovereignty requirements pointed. So every request took a round trip: browser in South Africa → SWA in the US → Cosmos DB back in South Africa → return through the US → back to the browser. The API itself was crossing the Atlantic twice per call, adding ~180ms of pure network latency.

The Big Bang: January 28

A 3-week gap in the git log (Dec 27 → Jan 13) marks where I was designing the backend architecture during my summer vacation and doing my personal project (personametry.com). Then on January 28, six commits in four hours: 02810dc"Phase 1: Cosmos DB API & SWA Auth Workaround". Azure Functions v4 as the API layer, Cosmos DB NoSQL as the persistence layer, and the beginning of the end for localStorage.

The Partition Key Incident: February 8

Commit 59cf7f0: "Align all partition keys to production /pk". Followed immediately by 01c013e: "Single source of truth for container partition keys — prevent future drift".

I'd discovered that different scripts were using different partition key paths for the same containers. Seeding scripts used one path, API endpoints used another, sync jobs used a third. Documents were landing in the wrong logical partitions. Queries returning empty for data that definitely existed.

The fix was radical: every container uses the same partition key path (/pk), and a single configuration file (containerConfig.js) became the sole source of truth. No script, no function, no seed job is allowed to hardcode a partition key. This one decision eliminated an entire class of bugs.

┌─────────────────────────────────────────────────────────┐
│              DATA LAYER EVOLUTION                       │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Dec 2025     localStorage (browser-only)               │
│       │                                                 │
│       ▼                                                 │
│  Jan 2026     Cosmos DB (SA North)                      │
│       │        + accessed via SWA Functions in US       │
│       │        + cross-Atlantic API latency             │
│       │        + partition key path chaos               │
│       ▼                                                 │
│  Feb 2026     Cosmos DB (SA North, consolidated)        │
│       │        + /pk everywhere                         │
│       │        + containerConfig.js as single truth     │
│       │        + killed the spare dev cloud instance    │
│       ▼                                                 │
│  Mar 2026     Cosmos DB (SA North, co-located)          │
│                + App Service in same region = ~5ms      │
│                + Cosmos Emulator for local dev          │
│                + 10 containers, all /pk                 │
│                                                         │
└─────────────────────────────────────────────────────────┘


Act 4: Under the Hood of Cosmos DB — 10 Containers, One Partition Key, and a Seed Pipeline (January 28 – March 2026)

Act 3 told the story of getting to the cloud. This act opens the hood and shows what's inside. Because the database isn't just storage — it's the entire data model, the authorization boundary, the audit trail, and the seed pipeline. Get the database wrong and nothing else works.

The Architecture

┌─────────────────────────────────────────────────────────┐
│                  COSMOS DB NoSQL                        │
│              Serverless • South Africa North            │
│              Database: myapp-db-prod                    │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐   │
│  │   products   │  │  customers   │  │   actuals    │   │
│  │  pk: BU_FY   │  │  pk: BU_FY   │  │  pk: BU_FY   │   │
│  │  line items  │  │  targets     │  │  forecasts   │   │
│  │  unit costs  │  │  discounts   │  │  ownership   │   │
│  └──────────────┘  └──────────────┘  └──────────────┘   │
│                                                         │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐   │
│  │    config    │  │ definitions  │  │   feedback   │   │
│  │  pk: BU_FY   │  │ pk: defs_BU  │  │ pk: feedback │   │
│  │  BU settings │  │ or defs_BU_FY│  │  global pool │   │
│  │  thresholds  │  │ SAP summaries│  │              │   │
│  └──────────────┘  └──────────────┘  └──────────────┘   │
│                                                         │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐   │
│  │    users     │  │    groups    │  │    audit     │   │
│  │  pk: userId  │  │ pk: groupId  │  │  pk: log     │   │
│  │  profileRoles│  │  org config  │  │  append-only │   │
│  │  BU scope    │  │              │  │  1-year TTL  │   │
│  │  assignments │  │              │  │              │   │
│  └──────────────┘  └──────────────┘  └──────────────┘   │
│                                                         │
│  ┌──────────────┐                                       │
│  │     rbac     │  All containers: partition key = /pk  │
│  │  pk: rbac    │  All queries: WHERE c.pk = @pk        │
│  │  role ──►perm│  Single source of truth:              │
│  │  mapping     │    containerConfig.js                 │
│  └──────────────┘                                       │
│                                                         │
└─────────────────────────────────────────────────────────┘

10 containers. All using the same partition key path (/pk) but different partition key values. That uniformity was the result of the February 8 incident (Act 3) — and it's the single most important data architecture decision in the project.

The Partition Key Strategy

Cosmos DB's partition key determines data co-location, query efficiency, and cost. Get it wrong and you're paying for cross-partition fan-out on every read. Get it right and most queries cost 1–5 Request Units.

ContainerPK Value PatternExampleWhy This Partitioning
products{buId}_{fy}bu-acme_FY26All products for a BU's fiscal year land in one partition — single read per planning view
customers{buId}_{fy}bu-acme_FY26Customer list is always BU + FY scoped. One partition = one query, no fan-out
actuals{buId}_{fy}bu-acme_FY26Actuals and forecasts partition-aligned with customers they reference
config{buId}_{fy}bu-acme_FY26Singleton BU config per fiscal year — thresholds, margins, planning parameters
definitionsdefs_{buId} or defs_{buId}_{fy}defs_bu-acmeReference data (regions, categories, SAP summaries) — BU-scoped with FY override
feedbackfeedbackfeedbackSingle global partition — low volume, no BU isolation needed
users{userId}Entra Object IDOne partition per user — point reads for auth, no cross-user queries in hot paths
groups{groupId}group-acmeOrg-level config, rarely queried
auditloglogAppend-only stream with 1-year TTL — write-heavy, query-rare
rbacrbacrbacGlobal role → permission map — cached 30s, queried rarely

The four BU/FY-partitioned containers (products, customers, actuals, config) are the hot path. Every planning view loads WHERE c.pk = '{buId}_{fy}' — a single-partition query that returns everything the view needs in 5–10 RUs. No joins. No aggregation. The financial model engine receives the full dataset and computes everything client-side.

Document Shapes

Every document follows a consistent contract: id, pk, entityType, buId, fy, plus domain-specific fields. Here's what a customer document looks like:

{
  "id":           "cust-1",
  "pk":           "bu-acme_FY26",
  "entityType":   "customer",
  "buId":         "bu-acme",
  "fy":           "FY26",
  "name":         "METRO MUNICIPALITY A",
  "regionId":     "reg-001",
  "category":     "Metro",
  "targetRevenue": 25000000,
  "discount":     5,
  "productMix":   { "li-wp-200": 30, "li-wp-300": 20 },
  "updatedAt":    "2026-01-28T05:00:00Z",
  "updatedBy":    "user@company.example",
  "_etag":        "\"00000000-0000-0000-....\""
}

The _etag matters. Cosmos DB's optimistic concurrency uses ETags — if two users edit the same customer simultaneously, the second write fails with a 412 Precondition Failed. The API passes If-Match headers on every write. No distributed locks, no pessimistic locking, no conflict resolution logic — just "first writer wins, second writer retries."

User documents are the most complex — they carry the full authorization profile:

{
  "id":              "2d1f0f0a-...",
  "pk":              "2d1f0f0a-...",
  "email":           "user@company.example",
  "profileRoles":    ["account_manager", "sales_admin"],
  "roles":           ["account_manager", "sales_admin"],
  "tokenRoles":      ["authenticated"],
  "homeBuId":        "bu-acme",
  "editBuIds":       ["bu-acme"],
  "assignmentByBu":  {
    "bu-acme": {
      "assignedSalesGroupIds":  ["sg-001"],
      "assignedRegionIds":      ["reg-001"],
      "assignedCustomerIds":    []
    }
  },
  "isActive":        true,
  "lastLogin":       "2026-01-28T06:00:00Z"
}

Notice profileRoles vs. tokenRoles. The platform token from Entra ID says "authenticated." The database says "account_manager, sales_admin." The database wins. This is the DB-authoritative identity model — the RBAC container holds the role definitions, the users container holds the role assignments, and the API merges them at request time. The identity provider handles who you are; the database decides what you can do.

Product documents carry nested line items — each product category contains the individual SKUs with unit costs, margins, and units of measure:

{
  "id":            "prod-widgets",
  "pk":            "bu-acme_FY26",
  "entityType":    "product",
  "buId":          "bu-acme",
  "fy":            "FY26",
  "name":          "Widgets",
  "isActive":      true,
  "lineItems":     [
    { "id": "li-wl-100", "name": "Widget Lite",
      "unitCost": 35000, "margin": 45, "uom": "EA" },
    { "id": "li-wp-200",  "name": "Widget Pro",
      "unitCost": 52000, "margin": 42, "uom": "EA" }
  ],
  "version":       7,
  "_etag":         "\"00000000-0000-....\""
}

The financial model engine reads all products and all customers for a BU/FY in two parallel queries, then computes margins, projections, and simulation results entirely in-memory. The lineItems array means product structure is self-contained — no joins, no reference lookups during calculation.

Audit documents are append-only with a ttl field — Cosmos DB's built-in time-to-live. After 365 days, the document silently disappears. No cleanup job, no scheduled delete, no storage cost after expiration:

{
  "id":            "aud_1706425200000_abc123",
  "pk":            "log",
  "schemaVersion": 2,
  "timestamp":     "2026-01-28T07:00:00.000Z",
  "eventType":     "WRITE",
  "userId":        "a1b2c3d4-...",
  "action":        "UPDATE",
  "entityType":    "customer",
  "entityId":      "cust-1",
  "buId":          "bu-acme",
  "changes": {
    "entries": [
      { "path": "targetRevenue",
        "changeType": "updated",
        "before": 20000000, "after": 25000000 }
    ]
  },
  "ttl":           31536000
}

Every write to the platform — customer updates, product edits, role changes, budget lock events — generates an audit document. The changes.entries array captures the exact diff: which field changed, what it was before, what it is now. The admin audit log view queries this partition with filters on action type, entity, date range, and actor — but always within pk = "log".

The RBAC container holds the role → permission mapping as a single document per role:

{
  "id":          "role-account_manager",
  "pk":          "rbac",
  "roleId":      "account_manager",
  "permissions": [
    "products.read", "customers.read",
    "customers.edit_assigned",
    "actuals.edit_assigned",
    "sap.view"
  ]
}

The _assigned suffix on permissions is the bridge between RBAC and ABAC — it tells the authorization layer "this user has the permission, but only for resources they own." The API evaluates ownership at request time by checking the user's assignmentByBu against the resource's customer, region, or sales group.

The Definitions Container — Where SAP Data Lives

The definitions container is the most architecturally interesting one. It holds both reference data (regions, categories, account managers) and all SAP ETL output — coexisting in the same container, differentiated by defType. Every SAP connector publishes three document types here:

┌─────────────────────────────────────────────────────────┐
│          DEFINITIONS CONTAINER: SAP DOCUMENT MAP        │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  pk = defs_{buId}                                       │
│                                                         │
│  REFERENCE DATA (non-SAP):                              │
│  ├── defType: regions                                   │
│  ├── defType: categories                                │
│  ├── defType: account-managers                          │
│  └── defType: field-technicians                         │
│                                                         │
│  SAP SUMMARIES (1 per connector):                       │
│  ├── defType: sap-sales-insights                        │
│  ├── defType: sap-debtors-insights                      │
│  ├── defType: sap-stock-insights                        │
│  ├── defType: sap-orderbook-insights                    │
│  └── defType: sap-delivery-insights                     │
│                                                         │
│  SAP ETL RUNS (1 per run per connector):                │
│  ├── defType: sap-sales-etl-run                         │
│  ├── defType: sap-debtors-etl-run                       │
│  ├── defType: sap-stock-etl-run                         │
│  ├── defType: sap-orderbook-etl-run                     │
│  └── defType: sap-delivery-etl-run                      │
│                                                         │
│  SAP DETAIL DOCS (partitioned, for AI drill-down):      │
│  ├── defType: sap-sales-customer-transactions           │
│  ├── defType: sap-debtors-customer-ledger               │
│  ├── defType: sap-stock-material-ledger                 │
│  ├── defType: sap-orderbook-customer-lines              │
│  └── defType: sap-delivery-customer-lines               │
│                                                         │
│  CACHE MANIFEST (1 per BU):                             │
│  └── defType: sap-cache-manifest                        │
│                                                         │
└─────────────────────────────────────────────────────────┘

A SAP sales summary document is the richest — it's what the dashboard hero cards read on every page load:

{
  "id":              "sap_sales_insights_bu-acme",
  "pk":              "defs_bu-acme",
  "type":            "definition",
  "defType":         "sap-sales-insights",
  "buId":            "bu-acme",
  "importStatus":    "SUCCESS",
  "runId":           "20260215T103000Z",
  "generatedAt":     "2026-02-15T10:30:00Z",
  "reconciliation":  {
    "sourceRows": 4218,
    "includedRows": 4193,
    "excludedRows": 25,
    "isBalanced": true
  },
  "fiscalYearsAvailable": ["FY25", "FY26"],
  "yoy":             { /* year-over-year comparisons */ },
  "totalsByFiscalYear": { "FY25": 48200000, "FY26": 31400000 },
  "topCustomers":    [ /* max 50, ranked by value */ ],
  "topProducts":     [ /* max 50, ranked by value */ ],
  "ledgerPreview":   [ /* max 120 rows for quick inspection */ ],
  "lastImport":      {
    "status": "SUCCESS",
    "timestamp": "2026-02-15T10:30:00Z"
  }
}

The detail documents are where the SAP ETL gets interesting. Large ledger datasets are partitioned into chunks — a connector with 4,000 transaction rows splits them into multiple documents using partIndex and partCount:

{
  "id":              "sap_sales_customer_tx_bu-acme_FY26_cust-1_0",
  "pk":              "defs_bu-acme",
  "defType":         "sap-sales-customer-transactions",
  "customerId":      "cust-1",
  "fy":              "FY26",
  "partIndex":       0,
  "partCount":       3,
  "rowCount":        500,
  "customerRowCount": 1420,
  "totals":          { "valueZar": 8500000, "grossProfitZar": 3200000 },
  "rows": [
    {
      "customerId":    "cust-1",
      "customerName":  "METRO MUNICIPALITY A",
      "invoiceNo":     "9001234",
      "invoiceDate":   "2025-09-15",
      "partNumber":    "WP-001",
      "description":   "Widget Pro Unit",
      "netQty":        12,
      "valueZar":      624000,
      "costZar":       361000,
      "grossProfitZar": 263000,
      "marginRatio":   0.4215,
      "isCreditLine":  false
    }
    /* ... up to 500 rows per part */
  ]
}

The AI copilot uses these detail documents for drill-down queries — "show me the top 5 products sold to Municipality A this fiscal year" doesn't hit SAP; it queries the pre-extracted, pre-normalized ledger rows sitting in Cosmos DB. Same pattern for each connector: stock has material ledger parts, debtors has customer aging ledger parts, orderbook has order line parts, delivery has shipment line parts.

The cache manifest document ties the whole system together — it tracks which connector was last refreshed and what version the cache is at:

{
  "id":           "sap-cache-manifest",
  "pk":           "defs_bu-acme",
  "defType":      "sap-cache-manifest",
  "version":      17,
  "connectors":   {
    "sales":      { "lastRunAt": "2026-03-28T09:15:00Z", "lastRunId": "20260328T091500Z" },
    "debtors":    { "lastRunAt": "2026-03-28T09:20:00Z", "lastRunId": "20260328T092000Z" },
    "stock":      { "lastRunAt": "2026-03-21T14:00:00Z", "lastRunId": "20260321T140000Z" },
    "orderbook":  { "lastRunAt": "2026-03-28T09:25:00Z", "lastRunId": "20260328T092500Z" },
    "delivery":   { "lastRunAt": "2026-03-28T09:30:00Z", "lastRunId": "20260328T093000Z" }
  }
}

Every successful ETL sync bumps version. The client-side cache checks this version on a 1-hour interval. If it's changed, stale SAP data is evicted from localStorage and the next dashboard load pulls fresh summaries. No manual refresh required — the data flows from SAP export through Python extraction, Node.js sync, Cosmos DB, manifest bump, client cache invalidation, and into the dashboard cards. The definitions container is the meeting point where SAP's transactional world becomes the platform's analytical world.

Query Patterns — Why Everything Is Cheap

Three query patterns cover 95% of all database reads:

PatternExampleRU CostWhen Used
Point Read.item(id, pk).read()~1 RUUser auth check, single doc fetch, config lookup
Partition QueryWHERE c.pk = @pk5–10 RULoad all products/customers for a BU/FY view
Partition + FilterWHERE c.pk = @pk AND c.defType = @type5–15 RUSAP summaries, definition lookups with FY fallback

Cross-partition queries exist in exactly one place: the audit log viewer (admin-only, paginated, never in a hot path). Everything else is partition-targeted. The financial model view loads all products and all customers for a BU/FY in two parallel partition queries — then the calculation engine runs entirely in-memory. The database is a data loader, not a computation engine.

The Definitions Fallback Pattern

The definitions container has a unique query pattern: FY-scoped with BU fallback. When the app requests region definitions for FY26, it first checks pk = defs_bu-acme_FY26. If nothing is there, it falls back to pk = defs_bu-acme (the BU-level default). This allows FY-specific overrides without duplicating every definition for every fiscal year:

  FY-scoped lookup: pk = defs_bu-acme_FY26
       │
       ├── Found? ──► Use FY-scoped definition
       │
       └── Not found?
            │
            ▼
  BU-scoped fallback: pk = defs_bu-acme
       │
       └── Found? ──► Use BU default

The Client — Lazy Proxies and Environment Detection

The Cosmos DB client uses a JavaScript Proxy pattern for lazy initialization. The first time any code accesses containers.products, the proxy intercepts, creates the CosmosClient, connects to the database, and caches container references. Subsequent accesses go straight to the cache. If the connection string points to localhost:8081, the client automatically disables TLS certificate validation for the emulator's self-signed cert.

Two connection modes:

  • Connection String (default): AccountEndpoint=...;AccountKey=... — simple, works everywhere
  • Managed Identity: endpoint-only, authenticates via Azure AD — no secrets in config, but requires App Service hosting

Killing the Cloud Dev Database

For months, the project maintained two Cosmos environments: production (myapp-db-prod), cloud dev (myapp-db-dev), and local emulator (myapp-db-emulator) was later added. The cloud dev database was supposed to let multiple developers share a development environment. In practice, one developer meant zero benefit and double the Azure bill. I wanted to use developer best practices when I started, but later realized this was overkill, and due to an AI mistake, incurred unexpected costs.

Commit e6fce7a (March 29): "Retire cloud dev Cosmos DB targeting." The cloud dev database was killed. All non-production development moved to the local Cosmos Emulator — free, fast, no network dependency. The legacy COSMOS_USE_SUPPORT_DEV_DB flag was preserved but now redirects to the emulator. A guardrail function (assertNoCloudNonProdDatabaseTarget()) blocks any code from accidentally routing dev traffic to the cloud.

The Seed-and-Sync Pipeline

An enterprise database is useless without repeatable data seeding. The seed-and-sync.js orchestrator runs a 5-stage pipeline:

┌─────────────────────────────────────────────────────────┐
│              SEED-AND-SYNC PIPELINE                     │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Stage 1: init-containers                               │
│  └── Create all 10 containers from containerConfig.js   │
│                                                         │
│  Stage 2: seed-global-settings                          │
│  └── RBAC baseline, role defaults                       │
│                                                         │
│  Stage 3: seed-rbac                                     │
│  └── Role ──► permission mapping (11 roles, 29 perms)   │
│                                                         │
│  Stage 4: seed-data                                     │
│  └── BU-scoped definitions (products, regions, etc.)    │
│                                                         │
│  Stage 5: seed-initial-data                             │
│  └── BU planning data (products, customers, actuals)    │
│                                                         │
│  Optional: --include-sap-customer-master                │
│           --include-sap-product-master                  │
│           --include-user-backfill                       │
│           --include-audit-backfill                      │
│                                                         │
│  Verify: --verify                                       │
│  └── definitions, products, actuals, lock state, users  │
│                                                         │
│  Promote: --sync                                        │
│  └── Mirror dev ──► prod (non-destructive by default)   │
│                                                         │
└─────────────────────────────────────────────────────────┘

The pipeline is idempotent — run it ten times and the result is the same. Production promotion uses a non-destructive mirror by default: it upserts documents but never deletes. A destructive reseed requires an explicit double-confirmation flag (--allow-production-reseed --confirm-production-reseed=RESEED_PRODUCTION). You can't accidentally wipe production with a typo.

The same pipeline runs in CI via a GitHub Actions workflow (cosmos-seed-sync.yml), with manual dispatch for production operations. The verification stage validates document shapes, partition key consistency, lock state integrity, and user access grants — catching data model drift before it reaches users.

What I Learned About Cosmos DB

The partition key decision made in January determined the cost profile for the entire project. Every query that includes WHERE c.pk = @pk is cheap. Every query that doesn't is expensive. By making every container use the same partition key path and making partition targeting a non-negotiable API design rule, I turned a serverless database with pay-per-RU pricing into a predictable-cost data layer. The monthly bill stayed flat even as usage grew — because query efficiency was baked into the schema, not bolted on later.


Act 5: The dabe903 Incident and Enterprise Hardening (February 7–8, 2026)

February 7 was the day the platform stopped being a prototype and started being enterprise software. Nine commits that day — and one of them became the most referenced commit in the project's history.

dabe903: "Close data model gaps - categories, zones, If-Match". That innocent-sounding commit destroyed 200 lines of the customer list component. The AI agent, trying to "simplify" the code, had omitted entire sections it deemed redundant. The next morning, commit c35436c: "Restore Customer Product Mix Editor destroyed in dabe903". And immediately after: 4f1daab — the addition of the NEVER OMIT CODE rule to the agent contract, in bold, in all caps.

But February 7 also delivered the admin portal (7300a32: "Implement Admin Portal — User Governance, Audit, Fiscal Years, System Health"), the navigation redesign (5d06d9e), and the first RBAC permissions (1d0ac3f). February 8 brought audit logs, comprehensive API logging, the partition key alignment, and the seed-and-sync orchestrator. In two days, the app went from a prototype with localStorage to a governed platform with user management, audit trails, and production-safe database tooling.


Act 6: The Authorization Model — 11 Roles, 29 Permissions, and a Nightly Regression (February 10–19, 2026)

Authorization is the least glamorous feature I built. It doesn't make for exciting demos. But it's the reason the CXOs trusts this platform with real financial data, and it consumed more engineering time than any dashboard view.

The 3-Layer Stack

┌─────────────────────────────────────────────────────────┐
│  LAYER 1: IDENTITY (Entra ID)                           │
│  ─ SSO via Microsoft identity platform                  │
│  ─ Token validated on every API request                 │
│  ─ No local passwords, no session management            │
└────────────────────────┬────────────────────────────────┘
                         ▼
┌─────────────────────────────────────────────────────────┐
│  LAYER 2: RBAC (Database-Backed, 30s TTL cache)         │
│  ─ 11 roles mapped to 29 permissions                    │
│  ─ Stored in Cosmos DB, runtime-updatable               │
│  ─ Wildcard (*) for admin roles                         │
│  ─ No redeployment needed to change permissions         │
└────────────────────────┬────────────────────────────────┘
                         ▼
┌─────────────────────────────────────────────────────────┐
│  LAYER 3: ABAC (Attribute-Based Access Control)         │
│  ─ BU Scope: viewBuIds / editBuIds per user             │
│  ─ Customer ownership: region, sales group, or direct   │
│  ─ Field-level whitelist on write operations            │
│  ─ Time-gated: post-budget-lock permissions             │
│  ─ Immutable fields: buId, fy, customerId (always)      │
└─────────────────────────────────────────────────────────┘

The rule: allow = hasPermission AND hasBuScope AND ownsResource. All three layers must pass. The server is the security boundary — the UI hides buttons and fields for convenience, but every write operation is re-validated server-side with the full permission stack.

The 11 Roles

I designed the role hierarchy after mapping every real-world persona in the company to what they need to see and do:

RoleTypeWhat They Can Do
adminPlatformEverything (wildcard *). Manage users, roles, config. Structural unlock.
group_adminPlatformSame as admin — for multi-tenant group-level oversight
group_ceoExecutiveRead everything. Structural unlock (reopen a locked budget). No day-to-day edits.
group_financeBusinessFull financial editing: budgets, actuals, global parameters, definitions. Cross-BU visibility.
finance_managerBusinessFinance subset: edit actuals and parameters, but can't create products or customers.
sales_adminBusinessFull customer CRUD across all regions. Assign account managers. Upload SAP data.
account_managerScopedEdit only assigned customers (ABAC-enforced). View all customers in BU for context.
sap_exporterPlatformViewer baseline + SAP file upload + ETL status monitoring
viewerPlatformRead-only baseline: products, customers, actuals, plans, dashboards
authenticatedPlatformViewer defaults — for users with Entra ID login but no explicit role assignment yet
anonymousPlatformViewer defaults — mapped from SWA anonymous auth (legacy, parked)

The 29 Permissions (Grouped by Domain)

DomainPermissions
Productsproducts.view, product.create, product.edit_price
Customerscustomers.view, customers.view.assigned, customer.create, customer.create_post_lock_all, customer.create_post_lock_assigned, customer.delete, customer.edit_sales_data, customer.edit_sales_data_assigned, customer.edit_target_post_lock_all, customer.edit_target_post_lock_assigned
Actualsactuals.view, actuals.edit, actuals.edit_assigned
Budget Planbudget.plan.view, budget.plan.edit
Global Paramsglobal_params.view, global_params.edit
Definitionsdefinitions.view
Feedbackfeedback.view, feedback.create, feedback.comment, feedback.vote
Business Configbusiness_config.structural_unlock
SAPsap.upload, sap.status
AIai.chat

Two design patterns worth noting:

  • The _assigned suffix pattern: Permissions like customer.edit_sales_data_assigned vs customer.edit_sales_data encode the ABAC scope directly in the permission name. An account manager gets the _assigned variant; a sales admin gets the unrestricted one. The API resolves this at request time — no special-case branching needed.
  • The post_lock time gates: Permissions like customer.create_post_lock_all only activate after the budget is locked. During planning phase, they're irrelevant. During execution phase, they're the only write permissions that survive. The budget lock state governs which permission set is in effect — the roles themselves don't change.

The Budget Lock Lifecycle

┌──────────┐         ┌──────────┐         ┌──────────┐
│ PLANNING │────────►│  LOCKED  │────────►│EXECUTION │
│  MODE    │  lock   │  (review)│  approve│  MODE    │
│          │         │          │         │          │
│ All edits│         │ Read-only│         │ Limited  │
│ allowed  │         │ for most │         │ edits    │
│          │         │ roles    │         │ (scoped) │
└──────────┘         └──────────┘         └──────────┘
                          │
                          │ structural
                          │ unlock
                          ▼
                     ┌──────────┐
                     │ UNLOCKED │
                     │ (CEO or  │
                     │  admin)  │
                     └──────────┘

During planning mode, everyone with the right permissions can edit products, customers, targets, and parameters. When the budget is locked (6bde4cb), the system enters review mode — most edits are frozen. After approval, execution mode allows only post-lock permissions: actuals entry, assigned-customer adjustments. A CEO or admin can perform a structural unlock (5bb436c, March 3) that temporarily reopens the budget without resetting the lifecycle — handling the real-world scenario where the board changes a target number mid-year and finance needs to adjust the approved plan.

The Admin Portal

The admin portal (7300a32, Feb 7) is where governance becomes tangible. It has four sections:

┌─────────────────────────────────────────────────────────┐
│                    ADMIN PORTAL                         │
├──────────────┬──────────────────────────────────────────┤
│              │                                          │
│  SECTIONS    │  USER GOVERNANCE                         │
│              │  ─ Role assignment per user              │
│  Users       │  ─ BU view/edit scope configuration      │
│  Audit       │  ─ Customer/region/sales group assignment│
│  Fiscal Year │  ─ Active/inactive user toggle           │
│  Health      │  ─ SAP exporter delegation               │
│              │  ─ DB-authoritative onboarding flow      │
│              │                                          │
│              │  AUDIT LOG                               │
│              │  ─ Deep diff (field-level change history)│
│              │  ─ Sensitive field redaction             │
│              │  ─ Denied-write audit trail              │
│              │  ─ Login auditing + usage analytics      │
│              │                                          │
│              │  FISCAL YEAR MANAGEMENT                  │
│              │  ─ FY lifecycle (planning/locked/exec)   │
│              │  ─ Historic FY read-only mode            │
│              │  ─ Structural unlock controls            │
│              │                                          │
│              │  SYSTEM HEALTH                           │
│              │  ─ Cosmos DB connectivity                │
│              │  ─ ETL pipeline status per connector     │
│              │  ─ AI service health probes              │
│              │  ─ Storage diagnostics                   │
│              │  ─ Cost ledger (Azure spend tracking)    │
│              │                                          │
└──────────────┴──────────────────────────────────────────┘

The user governance flow evolved significantly. By February 19, commit 49ca324: "Implement DB-authoritative user onboarding and role admin flow" — new users who authenticate via Entra ID are automatically provisioned with viewer defaults, and an admin can then assign roles, BU scope, and customer ownership through the portal. No more email requests to me to get someone access.

The audit system (cc410b5, Feb 13) records every mutation with a deep diff — not just "customer X was edited" but which fields changed, from what values to what values, by whom, at what time. Sensitive fields are redacted in the audit log. Denied writes are audited too (17c3924) — so when an account manager tries to edit a customer outside their scope, the attempt is logged even though it's blocked.

The Authorization Harness

February 13, commit 35de3df: "add full authorization harness framework, scenarios, and docs". This is the testing infrastructure that makes the whole permission model trustworthy.

The harness runs every role-permission-resource combination against golden snapshots. It has three modes:

  • Golden mode: captures the current permission state as the baseline truth
  • Quick mode: runs a focused subset for fast CI feedback
  • Full matrix mode: tests every role against every permission against every resource scope — the exhaustive check

A nightly CI workflow runs the harness against the golden snapshots. If any code change accidentally grants or removes access, the pipeline fails. This catches the kind of bug that no unit test would find — a permission string typo, a missing scope check in a new endpoint, an ABAC condition that doesn't fire for a specific role. The harness has caught real regressions that would have shipped to production.

The evolution timeline of the auth system:

  1. Feb 7: Hardcoded RBAC permissions (1d0ac3f)
  2. Feb 7: Admin Portal with user governance, audit, fiscal year management (7300a32)
  3. Feb 8: Audit logs working, API error handling (bc646c4)
  4. Feb 10: DB-backed RBAC with 30-second TTL cache (5e37745)
  5. Feb 10: BU scope enforcement (4673200)
  6. Feb 11: ABAC customer ownership (55b54a7)
  7. Feb 11: Budget lock policy (6bde4cb)
  8. Feb 11: Unified budget plan permission model (81a8942)
  9. Feb 11: Definitions permission-key RBAC (2466acd)
  10. Feb 13: Authorization harness framework (35de3df)
  11. Feb 13: Enterprise audit v2 with deep diff and redaction (cc410b5)
  12. Feb 16: Audited post-lock customer adjustments (7f15f61)
  13. Feb 16: Denied-write audit trail (17c3924)
  14. Feb 19: DB-authoritative user onboarding (49ca324)
  15. Feb 19: Login auditing and usage analytics (fe23556)
  16. Feb 21: SAP exporter role delegation (2a170ac)
  17. Mar 3: Structural unlock for locked fiscal years (5bb436c)

Seventeen commits over a month, and I'm only counting the major ones. Authorization is the feature that never stops evolving — every new capability (SAP upload, AI chat, market analysis, post-lock adjustments) required a new permission, a new scope check, and a new harness scenario.


Act 7: The SAP Bridge — Building the ETL Pipeline (February 13–14, 2026)

Two days. That's how long it took to build the plumbing that connects SAP to the dashboard.

Our SAP system has no OData services. No RFC connections. No BTP integration layer. The data lives behind transaction codes and ALV grids. So I built a pipeline that works with what SAP actually gives you: exported spreadsheets.

February 13, commit f089e87: "add SAP sync pipeline and searchable budget customer selection". By the end of February 14, I had customer master sync, product master sync, an admin editor, and FY planning reset tooling — all backed by a 7-stage event-driven pipeline.

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  1. UPLOAD   │     │  2. BLOB     │     │  3. EVENT    │
│  User uploads│────►│  sap-exports │────►│  EventGrid   │
│  XLS via UI  │     │  container   │     │  triggers    │
└──────────────┘     └──────────────┘     └──────┬───────┘
                                                  │
                                                  ▼
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  6. COSMOS   │     │  5. NODE.JS  │     │  4. PYTHON   │
│  Containers  │◄────│  Sync to DB  │◄────│  Extract &   │
│  updated     │     │  + verify    │     │  normalize   │
└──────────────┘     └──────────────┘     └──────────────┘
                                                  │
                                          ┌───────┴──────┐
                                          │  BLOB STORE  │
                                          │ etl-artifacts│
                                          │ 4 JSON files │
                                          │ per run      │
                                          └──────────────┘
                           ▼
                    ┌──────────────┐
                    │  7. CACHE    │
                    │  Invalidation│
                    │  + audit log │
                    └──────────────┘

A user exports from SAP (transaction codes like VA05, FBL5N, MB52, VL06O, MB5B) as XLS and uploads through the dashboard. EventGrid fires, Python parses the SAP-specific layout, normalizes data, flags anomalies, outputs 4 JSON artifacts (summary, ledger, anomalies, run metadata). A second trigger fires Node.js sync, which upserts into Cosmos DB and invalidates caches. End to end: about 30 seconds. The old process — emailing spreadsheets, manually copy-pasting into a shared workbook — took days.

Each connector has three synchronized registries — Python extraction, Node.js sync, and API seeding. If the Python extractor changes its output schema, the Node.js sync catches the mismatch before it corrupts the database.

Why not a direct SAP API? Enabling OData on our ECC system would require external SAP team involvement, change management, and transport requests — a 3-month process minimum. The XLS exports are already part of existing business workflows, so adoption cost is zero. And the pipeline approach means SAP is completely decoupled — if we migrate to S/4HANA someday, only the Python extractors need updating.


Act 8: The Connector Pattern — A Reusable Recipe for SAP Reports (February 2026)

Act 7 showed the pipeline. This act shows the pattern — because the pipeline was deliberately designed to make adding new SAP reports trivial. Five connectors (sales, debtors, orderbook, stock, delivery) all follow the same architecture. When I add detailed stock management insights next, I won't build new plumbing. I'll follow the recipe.

The Connector Contract

Every SAP connector is defined by a frozen registry entry — identical across three runtimes (Python extraction, Node.js sync, API layer):

{
  key:           "stock",
  label:         "SAP Stock Insights",
  status:        "active",
  artifacts: {
    summary:     "stock-insights-summary.json",
    ledger:      "stock-insights-ledger.json",
    anomalies:   "stock-insights-anomalies.json",
    run:         "stock-insights-run.json"
  },
  defTypes: {
    summary:     "sap-stock-insights",
    run:         "sap-stock-etl-run"
  }
}

That's the entire interface. Everything else — extraction logic, summary calculations, anomaly detection — is connector-specific. But the shape of the output is fixed. Four artifacts. Two Cosmos document types. One manifest bump. Every connector, every time.

The Four Artifacts

Every ETL run produces exactly four JSON files:

ArtifactContentsConsumer
SummaryYear-over-year metrics, monthly breakdowns, totals by fiscal yearDashboard hero cards, AI copilot summary queries
LedgerNormalized transaction rows — every line item from the SAP exportVerification (reconciliation), AI drill-down queries
AnomaliesData quality issues: missing fields, out-of-range values, assumptions appliedAdmin pipeline view, AI diagnostics
RunMetadata: connector, runId, timestamps, import status, reconciliation countersPipeline monitoring, cache invalidation, audit trail

The run contract is the gate. A run can only be promoted to "latest" (and cached data invalidated) if the run document confirms: importStatus=SUCCESS, latestPointerUpdated=true, and reconciliation.isBalanced=true. That last check is critical — no-loss reconciliation. The number of source rows must equal included rows plus excluded rows. If a single row drops between extraction and summary, the pipeline fails. No silent data loss.

The Extraction Framework

A shared Python framework (sap_live_connector_framework.py) provides the scaffolding that every extractor reuses:

┌─────────────────────────────────────────────────────────┐
│           PYTHON EXTRACTION FRAMEWORK                   │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  resolve_connector_input_file()                         │
│  └── Find the latest XLS in the connector's folder      │
│                                                         │
│  build_run_context(buId, connector, runId)              │
│  └── Create timestamped output directories              │
│       history/{runId}/ and latest/                      │
│                                                         │
│  build_run_payload()                                    │
│  └── Initialize run metadata with required fields       │
│                                                         │
│  validate_run_contract()                                │
│  └── Enforce reconciliation balance + required fields   │
│                                                         │
│  copy_latest()                                          │
│  └── Promote history/{runId}/ ──► latest/ on success    │
│                                                         │
│  cleanup_history(keep=3)                                │
│  └── Bounded retention — keep last N runs               │
│                                                         │
└─────────────────────────────────────────────────────────┘

The extractor itself only needs to implement the domain-specific part: parse SAP headers, normalize rows, compute summaries, detect anomalies. The framework handles input resolution, output paths, run tracking, promotion, and cleanup.

The Sync Layer

After extraction, a Node.js sync script reads the four artifacts and publishes to Cosmos DB:

  • Summary document (defType: sap-{connector}-insights) — the dashboard reads this
  • Run document (defType: sap-{connector}-etl-run) — the admin pipeline view reads this
  • Detail documents (optional, connector-specific) — the AI copilot reads these for drill-down queries

On success, the sync bumps the sap-cache-manifest version in Cosmos DB. Client-side caches detect the version change on their next manifest check (1-hour TTL) and invalidate stale SAP data. Users see fresh numbers without a manual refresh.

Adding a New Connector: The Recipe

This is the part that makes the pattern worth documenting. When I add a new report — say, detailed stock management — here's the exact checklist:

┌─────────────────────────────────────────────────────────┐
│      RECIPE: ADD A NEW SAP REPORT CONNECTOR             │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Step 1: REGISTRY (3 files, ~20 lines each)             │
│  ├── api/lib/sapLiveConnectorRegistry.js                │
│  ├── etl-extract/lib/connector_registry.py              │
│  └── etl-sync/lib/sapLiveConnectorRegistry.js           │
│  Define: key, label, artifact names, defTypes           │
│                                                         │
│  Step 2: EXTRACTOR (1 Python file)                      │
│  └── etl-extract/extractors/extract_sap_{name}.py       │
│  Implement: parse headers, normalize rows, compute      │
│  summaries, detect anomalies, validate reconciliation   │
│  Reuse: framework helpers for everything else           │
│                                                         │
│  Step 3: SYNC SCRIPTS (2 Node.js files)                 │
│  ├── etl-sync/sync/sync-sap-{name}-live-insights.js     │
│  └── etl-sync/sync/verify-sap-{name}-live-insights.js   │
│  Implement: read artifacts, publish to Cosmos,          │
│  verify writes match run contract                       │
│                                                         │
│  Step 4: LOCAL SEED + VERIFY (2 files)                  │
│  ├── api/scripts/seed-sap-{name}-live-insights.js       │
│  └── api/scripts/verify-sap-{name}-live-insights.js     │
│  Wire: Python extraction ──► Node.js sync ──► verify    │
│                                                         │
│  Step 5: WIRE ORCHESTRATION (1 edit)                    │
│  └── etl-sync/functions/etlSync.js                      │
│  Add entry to SCRIPT_MAP: connector ──► sync + verify   │
│                                                         │
│  Step 6: AI INTEGRATION (optional, 1 edit)              │
│  └── ai-service/src/lib/runtime/aiToolExecutor.js       │
│  Add: connector ──► defTypes, detail doc type,          │
│  queryable row fields                                   │
│                                                         │
│  Step 7: TEST                                           │
│  └── Upload XLS via admin UI, verify end-to-end:        │
│       artifacts ──► Cosmos docs ──► manifest bump       │
│       ──► cache invalidation ──► dashboard renders      │
│                                                         │
└─────────────────────────────────────────────────────────┘

Steps 1, 5, and 6 are config edits — under 20 lines each. Step 2 is the real work: parsing the SAP-specific Excel layout, which varies per transaction code. Steps 3 and 4 follow a template so closely that an AI agent could generate them from the registry definition. The upload endpoint, EventGrid triggers, Blob storage paths, cache invalidation, and admin UI all work automatically — they're connector-agnostic.

That's the design goal: the pipeline is fixed, the connectors are pluggable. The same architecture that handles sales data handles stock levels, debtors aging, delivery tracking, and order books — and will handle whatever SAP report comes next.


Act 9: SAP Live Insights — Five Connectors in Five Days (February 15–18, 2026)

The ETL pipeline was the plumbing. This act is the payoff.

February 15, commit 0366d4c: "ship end-to-end SAP Live Insights ETL, dashboard, and ops docs". The sales connector went live — the first time anyone in the company could see SAP sales data in a browser dashboard instead of an emailed spreadsheet. Within the same day, I shipped 14 more commits: FY-segmented rankings, customer governance enrichment, hero KPI cards, performer strategy toggles, product performance analytics, and quantity-driven insights. The data was finally flowing, and every question finance had ever asked could now be answered with a card on the dashboard.

Then came the expansion:

DateConnectorKey CommitDashboard KPIs Added
Feb 15Sales0366d4cRevenue by customer/product, growth rates, budget coverage, top-N performers
Feb 17Debtors9b25ab1Receivables aging, projected realization, payment patterns
Feb 17Order Book8b2eba7Backlog runway, order-to-cash velocity, delivery schedules
Feb 17Deliveryb537157On-time delivery %, penalty exposure, fulfillment rates
Feb 18Stock3369f29Inventory levels, slow-moving analysis, valuation

Each connector was a full vertical slice: Python extractor, Node.js sync, verification script, Cosmos schema, and a set of dashboard hero cards with drill-down widgets. The debtors connector doesn't just show outstanding balances — it projects realization based on historical payment patterns and flags deteriorating customers. The delivery connector computes on-time delivery percentages with a two-pass exclusion filter, then surfaces penalty exposure KPIs.

The Dashboard That Became the Home Screen

February 18, commit 86ed0b2: "Reorganize navigation and default to SAP Live Insights". The SAP view replaced the budget dashboard as the app's landing page. Users wanted what's happening right now before what should be happening according to the plan.

┌─────────────────────────────────────────────────────────┐
│              SAP LIVE INSIGHTS DASHBOARD                │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐               │
│  │ REVENUE  │  │ DEBTORS  │  │ ORDER    │               │
│  │ Hero KPI │  │ Hero KPI │  │ BOOK     │               │
│  │ vs Target│  │ Aging    │  │ Runway   │               │
│  └──────────┘  └──────────┘  └──────────┘               │
│                                                         │
│  ┌──────────┐  ┌──────────┐  ┌──────────────────────┐   │
│  │ DELIVERY │  │ STOCK    │  │ BUDGET COVERAGE      │   │
│  │ OTD %    │  │ Valuation│  │ Invoiced + Pipeline  │   │
│  │ Penalties│  │ Movement │  │ vs Plan              │   │
│  └──────────┘  └──────────┘  └──────────────────────┘   │
│                                                         │
│  ┌──────────────────────────────────────────────────┐   │
│  │  CONNECTOR STATUS CARDS                          │   │
│  │  Last refreshed │ Run history │ Anomaly count    │   │
│  └──────────────────────────────────────────────────┘   │
│                                                         │
└─────────────────────────────────────────────────────────┘

The insight cards evolved rapidly. Early versions showed raw numbers. By February 17, they projected realization rates. By February 19, an OEM vs services margin breakdown (797e7e9). By February 27, a forecast variance widget with monthly YTD charts (cbca3fa). Each refinement came from sitting with finance and asking: "what question are you still opening SAP to answer?"


Act 10: The Cost Reckoning — Caching SAP Data That Only Changes Weekly (February 11–28, 2026)

Here's what happens when you build five SAP connectors in five days and make them the app's home screen: everyone opens the dashboard, and every page load triggers Cosmos DB reads for sales, debtors, orderbook, stock, and delivery data. Multiply that by every user, every session, every tab refresh. The Request Unit meter was climbing, and the data it was fetching hadn't changed since last Tuesday's SAP export.

The wake-up call came on February 11. Commit 4bdae3e: "Optimize save flows and add RU-impact telemetry/checklist". That commit added telemetry to measure the actual RU cost of every save and read operation. The numbers were sobering — the app was hammering Cosmos DB for data that only changed once a week, when someone uploaded a fresh SAP export.

The fix wasn't a simple HTTP cache header. SAP data has a specific refresh cadence — weekly exports from SAP transaction codes — and the caching system needed to be aware of that cadence. I built it in layers:

The Cache Architecture

┌─────────────────────────────────────────────────────────┐
│              CLIENT (Browser)                           │
│                                                         │
│  ┌──────────────────────────────────────────────────┐   │
│  │  localStorage Cache                              │   │
│  │  ─ Per-connector, per-BU, per-FY entries         │   │
│  │  ─ 8-day default TTL (matches weekly cadence)    │   │
│  │  ─ Configurable via env vars                     │   │
│  │  ─ Max 14-day absolute ceiling                   │   │
│  └────────────────────┬─────────────────────────────┘   │
│                       │ Is manifest version stale?      │
│                       ▼                                 │
│  ┌──────────────────────────────────────────────────┐   │
│  │  Manifest Version Check (1-hour TTL)             │   │
│  │  ─ Lightweight: fetches version number only      │   │
│  │  ─ If version unchanged: serve from cache        │   │
│  │  ─ If version bumped: invalidate + refetch       │   │
│  └────────────────────┬─────────────────────────────┘   │
│                       │ Only on cache miss              │
└───────────────────────┼─────────────────────────────────┘
                        ▼
┌─────────────────────────────────────────────────────────┐
│              SERVER (API + Cosmos DB)                   │
│                                                         │
│  ┌──────────────────────────────────────────────────┐   │
│  │  SAP Cache Manifest (Cosmos document)            │   │
│  │  ─ Global version counter                        │   │
│  │  ─ Per-connector: lastRunAt, lastRunId           │   │
│  │  ─ Bumped automatically by ETL sync pipeline     │   │
│  │  ─ Admin can force-refresh via maintenance API   │   │
│  └──────────────────────────────────────────────────┘   │
│                                                         │
└─────────────────────────────────────────────────────────┘

The key insight: instead of caching the data itself with a dumb TTL, cache the manifest version with a short TTL (1 hour), and cache the data with a long TTL (8 days). The client checks the manifest version cheaply — a single tiny Cosmos read — and only refetches the actual SAP data if the version number has been bumped by a new ETL run. This means 100 users opening the dashboard in the same hour generate 100 manifest checks (a few RUs each) instead of 500 full connector data fetches (hundreds of RUs each).

The timeline of the cost optimization story:

DateCommitWhat Happened
Feb 114bdae3eAdded RU-impact telemetry — first time we could see the actual cost of each operation
Feb 2260774f6AI cache freshness layer + cache health diagnostics in admin portal
Feb 229d8e8edRemoved monitoring extras, stripped down to low-cost execution model
Feb 23a4c5aefThe big one: SAP manifest-aware caching — 1,624 lines across 38 files. Server-side manifest, client-side localStorage cache, per-connector invalidation, admin refresh API.
Feb 2304e816aConfigurable TTLs tuned to weekly SAP export cadence (8-day default, 14-day max)
Feb 25c2e0da0Reduced save-path writes — fewer unnecessary Cosmos updates on non-dirty saves
Feb 28750f979Killed the dev cloud Cosmos instance, moved to emulator-first local development
Mar 1c5a1453Low-RU cross-user refresh for market analysis — shared cache state across users
Mar 25e04ee50Static cache for market analysis — stopped re-fetching data that changes quarterly

When the ETL pipeline syncs a fresh SAP export, the last step bumps the manifest version in Cosmos. The next time any user's browser checks the manifest (within the hour), it sees the new version, invalidates its local cache for that connector, and fetches fresh data. Every other connector's cache stays warm. An admin can also force-refresh from the admin portal if an urgent re-export happens mid-week.

The result: Cosmos DB RU consumption dropped dramatically. The monthly bill went from "noticeable" to "single digits." And the dashboard felt faster — most page loads were hitting warm local caches instead of waiting for cross-region Cosmos reads.


Act 11: The AI Copilot (February 21 – March 5, 2026)

Commit b14dadc, February 21 at 8:34 PM: "ship context-aware finance chat with tools, streaming UX, and telemetry". That single commit was the biggest in the project — an AI chat panel with domain-specific tools, SSE streaming, and operational telemetry, wired into every view.

But that was just the beginning. The AI feature went through its own intense evolution over the next two weeks:

DateCommitWhat Landed
Feb 21b14dadcInitial AI chat: tools, streaming, telemetry
Feb 22a33cd77Reliability hardening, methodology tool
Feb 22f312b36Deep SAP connector insight tools — AI can query ETL data
Feb 2260774f6AI cache freshness layer, cache health diagnostics
Feb 23e291761User-facing error context and failure diagnostics
Feb 2439eb700Streaming stabilization, operational log drilldown
Mar 1f7461f8Page-aware AI context with BU feature flags
Mar 2865ed91AI health probe + System Health diagnostics card
Mar 4ba68f1cExpanded telemetry, health insights, chat feedback
Mar 5b776921Thread memory continuity — AI remembers conversation context

The key architectural insight — which I wrote about in detail in a separate blog post — is that tool-based AI is dramatically more reliable than RAG for structured financial data. Instead of embedding documents and hoping the retrieval step finds the right context, the AI calls specific tools: query the budget for Q3, calculate margin impact of a 2% price increase, compare actuals to forecast. Every answer is traceable to a specific data query, not a probabilistic vector similarity match.

By March 5, the AI had 23 tools across 7 domains, organized into 4 tool profiles selected by regex-matching the user's question. It knew which page the user was viewing and adapted its tool selection accordingly. It streamed responses over SSE with 10 event types. And it had a 3-layer cache (request memo, shared LRU, ETL-aware freshness) to keep token costs from spiraling.


Act 12: Market Analysis and Geographic Intelligence (March 1–5, 2026)

March 1, commit a4cfc8a: "implement upload, dashboard widgets, and AI tooling" for market analysis. This was the 12th view — a competitive intelligence layer that mapped the company's market position geographically.

By March 4, the market analysis view had a national drilldown geography mode (dcf3829), competitor-dominance territory coloring (44f04d6), and fuzzy matching against official South African geography datasets (c60c3d9). Commit e5887e5 added an official geography reference dataset with an unmapped-suggestion report — so when a SAP customer record listed a town the system didn't recognize, it would suggest the closest official match instead of silently dropping it.

March 5, 7e02926: "Implement FY-scoped market analysis and add Region Master admin flow". Market intelligence was now scoped by fiscal year and managed through an admin interface — not a static spreadsheet updated quarterly.

This view was built entirely with AI agents. I specified the data model and component boundaries, the agent wrote the implementation, and I reviewed and refined over 4 days of commits. The geography fuzzy-matching — handling South African place names with their aliases, abbreviations, and colonial-era naming variations — was the kind of domain problem where AI implementation with human review worked exceptionally well.


Act 13: CI/CD and DevOps — The One-Person Pipeline (December 2025 – April 2026)

This act doesn't have a single date because the DevOps story unfolded across the entire project. It started simple and got serious fast.

The Evolution

┌─────────────────────────────────────────────────────────┐
│              CI/CD EVOLUTION                            │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Dec 17       Azure SWA workflow (auto-generated)       │
│       │        + Entra ID auth only                     │
│       ▼                                                 │
│  Feb 7        Test gate + CI/CD scaffolding             │
│       │        + ESLint + Vitest in pipeline            │
│       │        + PR preview environments                │
│       ▼                                                 │
│  Feb 11       Cosmos seed/sync workflow                 │
│       │        + Repeatable DB seeding in CI            │
│       ▼                                                 │
│  Feb 13       Authorization harness in CI               │
│       │        + Golden-snapshot regression             │
│       │        + Matrix mode (quick/full/flaky)         │
│       ▼                                                 │
│  Feb 22       ETL function deployment                   │
│       │        + Kudu ──► Azure Functions action        │
│       │        + Conditional deploy by path             │
│       ▼                                                 │
│  Feb 28       Secret scan CI gate                       │
│       │        + Cost ledger diagnostics                │
│       ▼                                                 │
│  Mar 6        SAP ETL drift guard workflow              │
│       │        + Friday preflight checklist             │
│       ▼                                                 │
│  Mar 29       South Africa App Service workflow         │
│       │        + Smoke tests against EasyAuth           │
│       ▼                                                 │
│  Mar 31       AI Service deploy workflow                │
│       │        + Readiness probes                       │
│       │        + Dark deploy with rollback proof        │
│       ▼                                                 │
│  Apr 7        SWA parked (PR previews disabled)         │
│                + 7 workflows total                      │
│                                                         │
└─────────────────────────────────────────────────────────┘

The 7 Workflows

WorkflowTriggerWhat It Does
Main App DeployPush to mainBuild, test, deploy React SPA + Express API to App Service (SA North)
AI Service DeployPush to main (ai-service/ paths)Build ai-core + ai-service, deploy to dedicated App Service, readiness probe
ETL Function DeployPush to main (etl-sync/ paths)Deploy Node.js sync functions via Azure Functions action
Auth RegressionNightlyRun authorization harness against golden snapshots — catches permission drift
SAP ETL Drift GuardNightlyValidate ETL function app settings haven't drifted from expected config
Secret ScanPush to mainScan for credentials, API keys, connection strings in committed code
SWA StandbyManualDeploy to parked US Static Web App for failover drills

Hard-Won DevOps Lessons

The git log records every DevOps battle:

  • February 7: Five consecutive commits (87ce3a0fdbcdcb) just trying to get tests to pass in CI — ESLint config blocking deploys, E2E tests leaking into the unit test run, Vitest pattern matching too broadly. The test infrastructure took an afternoon to stabilize.
  • February 22: Three attempts at ETL deploy strategy in a single day — Kudu helper (5273f15), then Azure Functions action (47f67a6), then removing ETL from the SWA workflow entirely (24f8ead). Each deployment tool had different failure modes.
  • February 26, commit 51fbc10: "fix(sev1): restore API function indexing by removing cross-folder fiscal lock import". A single ESM import across function boundaries crashed the entire Azure Functions worker. Not one endpoint — all of them. The next commit (0e28a0c) codified the lesson: SEV1 documented, API runtime boundary guard test added to CI.
  • March 6: ETL deploy regression, hotfixes, Kudu polling fixes (bda3eabd292495). Added the drift guard workflow and a Friday preflight checklist because manual SAP exports happen weekly.

No Terraform. No Bicep. PowerShell provisioning scripts for Azure resource creation, because for a solo developer, imperative scripts are easier to debug at 11pm. The infrastructure is simple enough that "just look at the script" is a valid runbook.


Act 14: The Test Harnesses — 245 Test Files and a Decision Oracle (December 2025 – April 2026)

A one-person platform with 11 roles, 29 permissions, 5 SAP connectors, a financial model engine, and an AI copilot doesn't survive without tests. But the test story isn't "I wrote unit tests." It's: I built a custom authorization testing framework with a decision oracle, pairwise scenario generation, quarantine management, and a flaky-detection runner — and then wrapped the rest of the platform in 245 test files across 5 testing layers.

The Testing Stack

┌─────────────────────────────────────────────────────────┐
│                 TESTING ARCHITECTURE                    │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  LAYER 1: Domain Model Tests (Dec 22, 2025)             │
│  ├── Financial model engine: 1,190-line test file       │
│  ├── Services model, fiscal year, budget lock           │
│  └── Pure math — no mocks, no API, just formulas        │
│                                                         │
│  LAYER 2: API Contract Tests (Feb 10 – Apr 2026)        │
│  ├── 56 contract test files                             │
│  ├── RBAC evaluation, customer ABAC, budget lock        │
│  ├── SAP connectors, ETL pipeline, audit schema         │
│  ├── AI chat bootstrap, retry, telemetry                │
│  └── Mocked Cosmos + auth — tests the contract, not DB  │
│                                                         │
│  LAYER 3: Frontend Component Tests (Feb – Apr 2026)     │
│  ├── 89 component contract tests                        │
│  ├── Admin views, SAP insights, market analysis         │
│  ├── AI chat panel, navigation, simulation context      │
│  └── Vitest + jsdom — tests UI logic, not pixels        │
│                                                         │
│  LAYER 4: Authorization Harness (Feb 13, 2026)          │
│  ├── Custom framework: oracle + generators + executors  │
│  ├── 40 golden scenarios (p0-core, products, sentinels) │
│  ├── Pairwise matrix: quick (~15) / full (~100+)        │
│  ├── Quarantine system with expiration dates            │
│  └── Runs nightly at 2 AM UTC in CI                     │
│                                                         │
│  LAYER 5: AI Service Tests (Mar 31, 2026)               │
│  ├── ai-core contracts + extraction + smoke tests       │
│  ├── ai-service runtime + SSE event shape validation    │
│  └── Node.js native test runner (not Vitest)            │
│                                                         │
└─────────────────────────────────────────────────────────┘

Layer 1: The Financial Model Test — December 22, 2025

The very first "real" test in the project was financialModel.test.js — born on December 22 during the Formula Wars (Act 2), when I discovered that a single wrong operator could misstate revenue by 12%. At 1,190 lines, it's still the largest test file in the codebase. It tests the pure calculation engine: margin calculations, discount compounding, freight surcharge modes, actuals-vs-forecast projection, simulation parameter sensitivity. No mocks, no API calls — just math in, numbers out. If this file fails, nothing else matters.

Layer 2: API Contract Tests — 56 Files

Every API endpoint has at least one contract test. These use mocked Cosmos DB containers and mocked authentication — they test the contract (request shape → response shape, permission gate → HTTP status) without hitting real infrastructure. The naming convention tells the story:

  • customers.abac.contract.test.js — ownership-gated writes
  • customerActuals.ownership.contract.test.js — assigned vs. unassigned account managers
  • products.lock.contract.test.js — budget lock enforcement on product edits
  • historicalFy.readonly.contract.test.js — past fiscal years are read-only
  • runtime-boundary.contract.test.js — the SEV1 guard (Act 13): ensures no cross-folder ESM imports crash the entire Functions worker
  • auth.appservice.contract.test.js — App Service EasyAuth header parsing

The .contract. suffix isn't decoration. It means the test verifies a behavioral contract that other systems depend on. Break the contract, break the integration.

Layer 3: Frontend Component Tests — 89 Files

UI tests in Vitest with jsdom. These don't render pixels — they test component logic: does the admin panel show the ETL tab for users with sap_exporter permission? Does the budget lock hero banner appear when the fiscal year is locked? Does the AI chat panel correctly serialize page context?

Key coverage areas:

DomainTest FilesWhat They Verify
Admin Views12Account management, ETL access, audit logs, fiscal year lock events, user governance
SAP Live Insights14Each connector widget: coverage, delivery, forecast variance, growth rates, orderbook, penalties, stock, quantity KPIs
Configuration9Lock hero, regional goals, customer list, post-lock adjustment, global parameters
Market Analysis7Workbook parsing, geographic model, caching, customer scope, widget rendering
AI Chat4Memory management, UI context serialization, markdown rendering, SSE streaming
Navigation6Routing, admin access gates, SAP live badge, feedback sidebar
Simulation4Context state machine, budget plan contract, simulation params, fail-soft fallback

Layer 4: The Authorization Harness — A Custom Testing Framework

This is the one I'm most proud of. On February 13, commit 35de3df landed the full authorization harness framework in a single commit: 1,475 lines across 15 new files. It's not a test suite — it's a testing framework with its own architecture:

┌─────────────────────────────────────────────────────────┐
│           AUTHORIZATION HARNESS ARCHITECTURE            │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Scenario Files (JSON)                                  │
│  ├── golden/p0-core.json     (25 priority-0 baselines)  │
│  ├── golden/products-golden.json  (8 product scenarios) │
│  └── golden/sentinels.json   (7 guardrail scenarios)    │
│          │                                              │
│          ▼                                              │
│  Scenario Loader ──► Schema Validator                   │
│          │                                              │
│          ├──► Decision Oracle                           │
│          │    ├── Permission Evaluator (RBAC layer)     │
│          │    ├── Scope Evaluator (BU layer)            │
│          │    ├── Ownership Evaluator (ABAC layer)      │
│          │    └── Lock Policy Evaluator (time gates)    │
│          │                                              │
│          ├──► Pairwise Generator                        │
│          │    ├── Cartesian product of dimensions       │
│          │    ├── Greedy pair-coverage optimization     │
│          │    └── Exclusion rules for invalid combos    │
│          │                                              │
│          ├──► Executors                                 │
│          │    ├── API Executor (HTTP status codes)      │
│          │    ├── UI Executor (editability + save gates)│
│          │    └── Integration Executor (live backend)   │
│          │                                              │
│          ▼                                              │
│  Harness Runner                                         │
│  ├── Loads scenarios + quarantine policy                │
│  ├── Runs oracle vs. executor comparison                │
│  ├── Detects regressions vs. quarantine recoveries      │
│  └── Writes Markdown coverage reports                   │
│                                                         │
└─────────────────────────────────────────────────────────┘

The Decision Oracle is the core. It's a pure-function authorization evaluator that mirrors the production 3-layer auth stack. Given a scenario (actor + resource + context), the oracle predicts the expected outcome: allow or deny, with a specific reason and the first failed gate. Then the API executor fires the actual HTTP request, and the harness compares oracle prediction vs. actual result. Mismatch = regression.

The oracle handles real authorization complexity: SAP-managed actuals are read-only. Budget-locked fiscal years block most writes — but post_lock permissions allow targeted adjustments to specific fields. The _assigned permission suffix means ownership gates apply. Sentinel scenarios like SENTINEL_ACCOUNT_MANAGER_CANNOT_ADMIN are invariants that must never flip to "allow."

The Pairwise Generator solves the combinatorial explosion. With 6 actor types × 8 entity types × 2 lock states × 2 ownership states × multiple actions, full Cartesian testing would produce thousands of scenarios. The generator uses a greedy pair-coverage algorithm: it selects the minimum set of scenarios that covers every pair of dimension values at least once. Quick mode: ~15 scenarios. Full mode: ~100+. Both run in CI.

The Quarantine System manages known-flaky scenarios without hiding regressions. A quarantined scenario has an owner, a reason, and an expiration date. If it expires without being fixed, the harness treats it as a failure again. Priority-0 and sentinel scenarios cannot be quarantined — if a guardrail fails, the build fails. Period.

Five Test Modes, One CLI

ModeCommandWhat It DoesWhen It Runs
Goldennpm run test:authz:goldenRun 40 hand-written baseline scenariosEvery PR
Quick Matrixnpm run test:authz:matrix:quickPairwise-generated ~15 scenariosEvery PR
Full Matrixnpm run test:authz:matrix:fullPairwise-generated ~100+ scenariosNightly (2 AM UTC)
Flaky Detectionnpm run test:authz:matrix:flakyRun quick mode 3× consecutively, flag non-deterministic resultsNightly
Integrationnpm run test:authz:integrationHit live backend with real Cosmos + real authManual (cost-gated)

The integration mode is intentionally gated behind a repository variable (ALLOW_AZURE_AUTHZ_INTEGRATION=true). Every integration run costs Cosmos DB request units. For a solo developer watching Azure spend, the test harness respects the budget — it defaults to mocked execution and only touches live infrastructure when explicitly told to.

Layer 5: AI Service Tests — Node.js Native Runner

When I extracted the AI copilot into a standalone Fastify service (Act 16), the tests came with it. The ai-core package has its own contract tests, extraction tests, and smoke tests. The ai-service runtime tests validate SSE event shapes — because when you stream AI responses to the browser, the event format is a contract that the frontend depends on. These use Node.js's native test runner, not Vitest — the AI service is a separate workspace with its own runtime.

The Testing Timeline

DateCommitMilestone
Dec 22d2e06c8First financial model tests — zero/negative margin edge cases
Jan 29799b781Production verification tests and health endpoint
Feb 7182662a5 commits in one afternoon to stabilize Vitest in CI
Feb 105e37745First RBAC contract tests alongside DB-backed auth
Feb 1335de3dfFull authz harness: 15 files, oracle, pairwise generator, quarantine
Feb 139e1d5f3Nightly harness workflow + cost guardrails for integration mode
Feb 1617c3924Harness expansion: p0-core golden scenarios, post-lock policy
Feb 20de8a01eRemove legacy Jest API tests, consolidate on Vitest
Feb 2187aa810AI agent feature module tests
Feb 260e28a0cSEV1 runtime boundary guard test (the import crash that killed all endpoints)
Mar 461ced1dRestore fiscal relation mocks — contracts caught the regression
Mar 31d719aaeAI characterization baseline captured before extraction

The test count grew with the platform. Every new feature — SAP connectors, market analysis, AI chat, budget lock, post-lock adjustments — came with its own contract tests. Not because I'm disciplined. Because after the dabe903 incident (Act 5) and the SEV1 import crash (Act 13), I learned that a solo developer's test suite isn't a nice-to-have. It's the only thing standing between a clean deploy and a 2 AM rollback.


Act 15: The Continent Migration (March 29–30, 2026)

I wrote an entire blog post about this, so the summary: over a single weekend, a Codex agent migrated the entire platform from Azure Static Web Apps (US) to Azure App Service (South Africa North). 2,300-line runbook. Zero portal clicks. 28 hours. 537 passing tests. API latency: 180ms → ~5ms.

The key commits: 8ca2dff"Implement South Africa App Service migration path", followed by a series of deploy workflow hardening, EasyAuth enablement (6e936bf), identity migration, cost ledger implementation (250ab5a), and finally 6285063"Park SWA and promote South Africa app".

┌─────────────────────────────────────────────────────────┐
│           HOSTING EVOLUTION                             │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Dec 2025     Single HTML file (local dev only)         │
│       │                                                 │
│       ▼                                                 │
│  Jan 2026     Azure Static Web Apps (US)                │
│       │        + Managed Functions (serverless)         │
│       │        + Free tier                              │
│       │        + 180ms API latency (cross-Atlantic)     │
│       ▼                                                 │
│  Mar 2026     Azure App Service (South Africa North)    │
│       │        + Express host (SSR-capable)             │
│       │        + Co-located with Cosmos DB              │
│       │        + ~5ms API latency (same region)         │
│       ▼                                                 │
│  Mar 2026     SWA parked as standby failover (US)       │
│                + Manual failover drills enabled         │
│                                                         │
└─────────────────────────────────────────────────────────┘


Act 16: Extracting the AI Into a Standalone Service (March 31, 2026)

The day after the continent migration, the AI got its own home.

The AI copilot had been running inside the main Express API — sharing compute resources with every other endpoint. A complex AI query could starve the budget save endpoint of CPU. The extraction happened on March 31: I wrote a 47-page migration runbook, and a Codex agent executed it across parallel git worktrees.

The PRs tell the story: #4 through #20, merged sequentially across a single day, each one a self-contained phase:

PRPhaseWhat It Did
#6Workspace scaffoldnpm workspaces for ai-core package
#4InfrastructureProvision standalone App Service
#7Deploy workflowGitHub Actions for AI service CI/CD
#5CharacterizationBaseline tests for existing AI behavior
#12–13ai-core extractionContracts, SSE shell, module extraction
#14Standalone runtimeFastify service built and tested
#15Backend proxySplit AI and API proxy routes on Express host
#16SSE contractStream provider tokens, freeze SSE protocol
#17Frontend clientRun-aware AI client pointing at new service
#18DiagnosticsAI runtime visibility in System Health
#19CutoverDark deploy with rollback proof
┌─────────────────────────────────────────────────────────┐
│           AI SERVICE EVOLUTION                          │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Feb 21       AI chat as Azure Functions                │
│       │        + Shared compute with API                │
│       │        + SSE streaming over SWA (US)            │
│       ▼                                                 │
│  Mar 29       AI chat on App Service (SA North)         │
│       │        + Same-region streaming (~5ms)           │
│       │        + Still shared with main API             │
│       ▼                                                 │
│  Mar 31       Standalone Fastify AI Service             │
│                + Dedicated App Service instance         │
│                + Extracted ai-core workspace package    │
│                + Independent scaling and deploy cycle   │
│                + 23 tools, 7 domains, 4 tool profiles   │
│                + 3-layer cache, SSE with 10 event types │
│                                                         │
└─────────────────────────────────────────────────────────┘

April 1 brought a series of hotfixes — auth provisioning (78c58d4), operational logging packaging (8c5378e), SSE stream transport (055b1fc), empty stream recovery (6e5e798). The commits are labeled "Hotfix" because they were exactly that — production issues found after the cutover. Every one of them was resolved the same day.


Act 17: What The Platform Looks Like Today

467 commits. 18 acts. Here's the final architecture:

┌─────────────────────────────────────────────────────────┐
│                    BROWSER (SPA)                        │
│  React 19 + Vite + Tailwind CSS + Recharts              │
│  13 Views │ 97 Components │ Financial Model Engine      │
└────────────────────────┬────────────────────────────────┘
                         │ HTTPS
                         ▼
┌─────────────────────────────────────────────────────────┐
│              AZURE APP SERVICE (SA North)               │
│  Express Host │ 24 API Endpoints │ 3-Layer Auth         │
│  RBAC + BU Scope + ABAC Ownership                       │
└──────────┬──────────────────────┬───────────────────────┘
           │                      │
           ▼                      ▼
┌─────────────────────┐  ┌───────────────────────────────┐
│   COSMOS DB NoSQL   │  │     AI SERVICE (Fastify)      │
│   10 Containers     │  │  23 Tools │ SSE Streaming     │
│   /pk Partition Key │  │  Tool Profiles │ 3-Layer Cache│
└─────────────────────┘  └───────────────────────────────┘
           ▲
           │  ETL Pipeline
┌─────────────────────────────────────────────────────────┐
│              SAP DATA INTEGRATION                       │
│  Manual XLS Upload ──► Blob ──► Python Extract          │
│  ──► Blob Artifacts ──► Node.js Sync ──► Cosmos DB      │
└─────────────────────────────────────────────────────────┘


The Full Timeline

#PhasePeriodKey CommitsWhat Changed
1PrototypeDec 9–13720f0f1 → 5816420Single HTML file → React SPA with financial model, tests, simulation sliders, localStorage
2Formula WarsDec 22–27271c199 → ae4aba218 formula fixes, agent contract, services ARR integration, protected model file
3Cloud DeployDec 17 – Feb 82bf8100 → dc33412Azure SWA (US) + Cosmos DB (SA North), Entra ID, partition key consolidation
4Cosmos DB ArchitectureJan 28 – Mar02810dc → e6fce7a10 containers, /pk uniformity, document shapes, seed pipeline, killed cloud dev DB, emulator-first
5dabe903 & HardeningFeb 7–89af6c11 → dabe903Destructive commit, NEVER OMIT CODE rule, admin portal, enterprise hardening sprint
6Authorization ModelFeb 10–1935de3df → ...3-layer auth (RBAC + BU scope + ABAC), 11 roles, 29 permissions, budget lock, nightly regression
7SAP ETL PipelineFeb 13–14f089e87 → 6ad6a007-stage pipeline, Python extractors, Node.js sync, customer & product masters
8Connector PatternFeb 2026Reusable recipe: registry + 4 artifacts + framework + sync + manifest. Pluggable connectors.
9SAP Live InsightsFeb 15–180366d4c → 3369f295 connectors in 5 days. Hero KPI dashboard. Became the default home screen.
10Cost ReckoningFeb 11–284bdae3e → 750f979RU telemetry, manifest-aware SAP caching, configurable TTLs, killed dev Cosmos instance, emulator-first.
11AI CopilotFeb 21 – Mar 5b14dadc → b77692123 tools, SSE streaming, 3-layer cache, page-aware context, thread memory
12Market AnalysisMar 1–5a4cfc8a → 7e02926Geographic intelligence, competitor-dominance mapping, fuzzy geo matching
13DevOps & CI/CDDec 17 – Apr 72bf8100 → 550f8b27 workflows, nightly auth regression, ETL drift guard, secret scan, SEV1 lessons
14Test HarnessesDec 22 – Apr 7d2e06c8 → d719aae245 test files, 5 layers, custom authz harness with decision oracle, pairwise generator, quarantine system
15Continent MigrationMar 29–308ca2dff → 88e6d35US → SA North. SWA → App Service. 180ms → 5ms. Codex-executed, zero portal clicks.
16AI ExtractionMar 3108b614b → 387feabStandalone Fastify AI service. ai-core package. 14 PRs in one day.
17StabilizationApr 1–73822cd3 → 550f8b2Hotfixes, auth hardening, CI cleanup. SWA parked. Production steady-state.

By The Numbers

MetricValue
Total commits467
Development periodDec 9, 2025 → Apr 7, 2026 (~4 months)
Developers1 (with AI coding agents)
React components97
Frontend source files230+
API endpoints24
Shared library modules38
AI tools23 across 7 domains
AI service codebase11,751 lines across 43 files
SAP connectors5
Cosmos DB containers10
CI/CD workflows7
Auth roles / permissions11 / 29
GitHub PRs (AI extraction alone)14 in one day
"Safety commit" messagesAt least 6 (I was learning)


Act 18: The AI Development Story — Architect and Delegate

A single developer didn't write 467 commits in four months by typing every line. I used AI coding agents — Claude Code and OpenAI Codex — throughout the project. But the relationship evolved dramatically.

In December, I was pair-programming with Claude — writing specifications, reviewing output line by line, manually fixing mistakes. By March, I was handing Codex a 47-page runbook and walking away while it executed a multi-phase migration across parallel git worktrees. The shift was possible because of one thing: the contracts got better.

My previous project taught me that AI agents without architectural governance produce unmaintainable code at terrifying speed. So this project had contracts from the start: coding rules, protected files, partition key conventions, ESM-only enforcement, import registration requirements. Every rule was written in response to a specific incident:

  • dabe903 destroying 200 lines → "NEVER omit code for brevity"
  • An ESM require() crash → "ESM only — never use require() in .js files"
  • Partition key chaos → "All containers use PK path /pk"
  • A cross-folder import crashing all Functions → "New Functions must be imported in index.js"
  • Formula regression → "financialModel.js is protected — don't modify without approval"

This development model — architect, contract, delegate, review — changed the economics of solo development. The bottleneck shifted from typing speed to clarity of specification. The agents are a multiplier, not a replacement. They turned one architect into what would have required a team of 4–5 developers.


What I Learned

1. You don't need to migrate your ERP to get modern analytics

The XLS-upload pipeline will never win an architecture award. But it works, it's been running for months, and it cost nothing to build. 17,000 companies are stuck on legacy SAP. A pipeline that turns yesterday's Excel export into today's live dashboard is worth more than a migration roadmap that never starts.

2. Build the calculation engine first, UI second

The pure financial model function from Day 1 survived every refactor, every migration, every architecture change. Everything else is a projection of that engine. If your math is wrong, your charts are beautiful lies.

3. Real-time simulation changes how people think about budgets

Before this platform, budget revisions were a seldom event. Now a sales director drags a margin slider and immediately sees the impact on PBT, customer profitability, and logistics costs. That's a bigger organizational shift than any technology choice.

4. Authorization is harder than the features it protects

11 roles, 29 permissions, BU scoping, ABAC ownership, field-level whitelists, post-lock time gates, and a nightly regression test. More engineering time than any dashboard view. But it's why finance trusts the platform with real data.

5. Fail early, fail in the git log

Those "safety commit" messages from December? The 18 formula fixes on December 22nd? The dabe903 incident? The SEV1 that crashed all API endpoints? They're all in the history forever. Good. Every failure became a rule in the agent contract, a test in the regression suite, or a convention in the architecture.

6. AI agents need contracts, not prompts

My previous project's monolith was built on prompts. This platform was built on contracts. The difference is the difference between a codebase you're proud of and one you're hiding.

* * *

Try This Yourself

If you're sitting on a legacy ERP and dreaming of modern analytics:

  1. Start with the export, not the API. Every ERP can export to Excel. Build a pipeline that consumes those exports. You'll have a working system in weeks instead of months of integration work.
  2. Build the calculation engine as a pure function. No UI dependencies. No state management. Just inputs and outputs. Test it before you write a single component.
  3. Choose a document database with a consistent partition strategy. One partition key path everywhere. A single configuration file as the source of truth. Cosmos DB's serverless tier costs pennies for planning workloads.
  4. Make simulation synchronous. Client-side recalculation under 16ms. No loading spinners. Finance people think in rapid "what if" loops — any latency kills adoption.
  5. Invest in authorization before features. RBAC, BU scope, and ownership-based restrictions before the first real user touches the system.
  6. Add AI tools, not AI chat. Domain-specific tools that call your calculation engine will always be more accurate than a RAG pipeline that embeds your spreadsheets.
  7. Automate everything a solo developer can forget. Nightly auth regression. ETL drift guards. Secret scanning. Friday preflight checklists. The CI pipeline is your team.
  8. Write contracts, not prompts. Claude Code and Codex turn one architect into a delivery team — but only if the specifications are unambiguous enough to be executed autonomously.

* * *

The Bottom Line

My company may eventually migrate to S/4HANA if forced by the big tech company to do so. The ECC clock is ticking. But in the meantime, the business has a financial planning and business intelligence platform that's faster, more interactive, and more insightful than anything SAP's standard reporting ever provided — and it was built by one person in four months for less than the cost of a single SAP consulting engagement, or having to drop a few hundred thousand rands on a development agency!

The 55–75% ERP implementation failure rate isn't just a statistic. It's a signal that the industry's default answer — "upgrade everything at once" — doesn't work for most companies. Sometimes the smarter move is to build a bridge: extract the data you need, compute the insights you want, and let the legacy system keep doing what it's always done — processing transactions.

467 commits ago, this was a single HTML file. Today it's 13 views, 24 API endpoints, 5 SAP connectors, 23 AI tools, and a platform that finance, sales, and executives use daily. The git log has every step — including the embarrassing ones.

The ERP doesn't need AI. Your decisions need AI. Build around the gap.

Tuesday, 7 April 2026

How I built an AI Finance Assistant into a Business Intelligence Dashboard with Claude Code and Codex

Twenty-three tools. Four tool profiles. A prompt library that rewrites itself when you change pages. A health dashboard that tracks token costs, cache hit rates, p95 latencies, and user-level adoption — in real time. I gave an LLM read-only access to my entire financial database, my ERP (SAP) pipeline, and a full fiscal simulation engine — then shipped it as a chat panel inside a budget dashboard. This is how I designed it. Every contract, every cache layer, every retry path, every health probe. If you're building AI features into a line-of-business application, this is the reference architecture that could help you on your own journey...

The Numbers

23 AI Tools | 4 Tool Profiles | 3 Cache Layers | 3 Auth Layers | 5 ERP Connectors

Act 1: The Problem — Dashboards Don't Answer Questions

I'd already built a production budget modelling and business intelligence ops dashboard tool: React frontend, serverless API, NoSQL backend. Product data, customer targets, actuals vs. forecasts, fiscal simulations — the works. It was a solid dashboard. But dashboards are passive. A CFO staring at twelve charts still has to synthesize the story. "Are we tracking against PBT (Profit Before Tax) target?" requires mentally combining revenue actuals, expense forecasts, margin config, and service ARR (Annual Recurring Revenue). That's four separate data views, minimum. I wanted something different: a financial intelligence analyst that lives inside the dashboard, has access to everything the dashboard knows, and can answer questions at FP&A (Financial Planning & Analysis)-analyst level. Not a chatbot bolted onto the side. But an agentic copilot that calls tools, runs models, queries databases, and cites its sources — all in real time, streamed back as SSE events.
Why tool-based, not RAG? My data is structured and relational — products have line items, customers have product mixes, actuals are monthly arrays. This is not a document search problem. It's a database query problem. Tools let the model query exactly the data it needs, stay within context limits, and reuse existing data-access code.
"But you're using a NoSQL document database, not a relational DB — isn't that a contradiction?" No — it's actually why tools are even more important. My database uses a non-standard query dialect with no cross-container JOINs, mandatory partition key routing, and data models that vary by document type. Text-to-SQL would be catastrophic here — the LLM would need to know partition key patterns, container boundaries, and cross-container join logic just to form a valid query. Tools encapsulate all of that complexity. The model calls query_products and gets clean, scoped data back. It never sees partition keys, internal IDs, or the document model underneath. And because every tool goes through the DataAccessAdapter contract, the entire AI layer is database-agnostic — I could swap to PostgreSQL or SQL Server tomorrow and the orchestration, prompt engineering, and tool profiles wouldn't change. The document database was a deliberate trade-off: zero-schema migrations, natural fit for heterogeneous product/customer structures, native cloud integration, and serverless pricing. The tool abstraction means that trade-off is invisible to the AI.

Act 2: The Architecture — End-to-End

Here's the full picture, from browser to database and back:
┌─────────────────────────────────────────────────────────────────┐
│                        BROWSER (React + Vite)                   │
│                                                                 │
│  ┌──────────────┐   ┌────────────────────┐   ┌───────────────┐  │
│  │  AI Chat     │──►│  Bootstrap Tokens  │──►│ POST /ai/chat │  │
│  │  Panel       │   │  (HMAC-SHA256,120s)│   │ (Fetch API)   │  │
│  └──────┬───────┘   └────────────────────┘   └───────┬───────┘  │
│         │                                            │          │
│         ▼                                            ▼          │
│  ┌──────────────┐   ┌────────────────────┐   ┌───────────────┐  │
│  │  Prompt      │   │  Context Nudges    │   │ Event Parser  │  │
│  │  Catalog     │   │  (Page-Aware)      │   │ delta/snapshot│  │
│  └──────────────┘   └────────────────────┘   │ tool tracking │  │
│                                              │ chartrendering│  │
│                                              └───────────────┘  │
└─────────────────────────────┬───────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                  REVERSE PROXY (Identity Injection)             │
│           Client-Principal-ID injected server-side (unforgeable)│
└─────────────────────────────┬───────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│              AI SERVICE (Fastify on Dedicated App Service)      │
│                                                                 │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                     RUNTIME MANAGER                       │  │
│  │                                                           │  │
│  │  Concurrency: 2 active/user, 4 global active, 8 queued    │  │
│  │  Watchdog:    10 min stall kill    Shutdown grace: 15s    │  │
│  │                                                           │  │
│  │┌─────────────────┐┌──────────────────┐┌──────────────────┐│  │
│  ││  Auth Layer     ││  Question Router ││  Orchestration   ││  │
│  ││                 ││                  ││  Engine          ││  │
│  ││  Proxy identity ││  Regex classify  ││  Model selection ││  │
│  ││  Bootstrap HMAC ││  Profile resolve ││  Tool budgets    ││  │
│  ││  RBAC check     ││  Chart detection ││  Iteration loop  ││  │
│  ││  BU scope       ││  ERP routing     ││  40s orch. cap   ││  │
│  ││  ABAC ownership ││  Prompt context  ││  28s provider    ││  │
│  │└─────────────────┘└──────────────────┘└────────┬─────────┘│  │
│  │                                                │          │  │
│  │┌─────────────────┐┌──────────────────┐┌────────▼─────────┐│  │
│  ││  SSE Streaming  ││  Tool Cache      ││  Tool Executor   ││  │
│  ││                 ││                  ││                  ││  │
│  ││  PassThrough    ││  Shared (500/2m) ││  23 tool handlers││  │
│  ││  Hijack reply   ││  Request-scoped  ││  Auth scope/call ││  │
│  ││  10 event types ││  Data freshness  ││  Result sanitize ││  │
│  ││  Reconnect      ││  ETL-aware       ││  5s timeout/tool ││  │
│  │└─────────────────┘└──────────────────┘└──────────────────┘│  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                 │
└──────────────────────────────────┬──────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────┐
│               ai-core (Extracted Workspace Package)             │
│                                                                 │
│  ┌────────────────────┐  ┌─────────────────┐  ┌─────────────┐   │
│  │ Contracts          │  │ System Prompt   │  │ Tool Defs   │   │ 
│  │ ProviderTransport  │  │ FP&A persona    │  │ 23 tools    │   │
│  │ DataAccessAdapter  │  │ 19 behavioral   │  │ JSON Schema │   │
│  │ EventSink          │  │   rules         │  │ Routing     │   │
│  │ SSE Events         │  │ Response frame  │  └─────────────┘   │
│  └────────────────────┘  └─────────────────┘                    │
└─────────────────────────────────┬───────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                        PROVIDER LAYER                           │
│                                                                 │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │            ProviderTransport (Contract)                   │  │
│  │                                                           │  │
│  │  createMessage({ model, maxTokens, system,                │  │
│  │                   tools, messages })                      │  │
│  │      ──► { responseId, model, stopReason,                 │  │
│  │            content, usage }                               │  │
│  │                                                           │  │
│  │  Today:    Primary ──► Fallback (model chain)             │  │
│  │  Tomorrow: Any LLM provider (same contract)               │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────┬───────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                          DATA LAYER                             │
│                                                                 │
│┌────────────┐┌──────────────┐┌──────────────┐┌─────────────────┐│
││  NoSQL DB  ││  ERP ETL     ││  Financial   ││ Market          ││
││            ││  Pipeline    ││  Model       ││ Analysis        ││
││  products  ││  sales       ││  Engine      ││                 ││
││  customers ││  debtors     ││              ││ regions         ││
││  actuals   ││  orderbook   ││  run_model   ││ municipalities  ││
││  config    ││  stock       ││  what_if     ││ market share    ││
││  defs      ││  delivery    ││  services    ││ mfr split       ││
│└────────────┘└──────────────┘└──────────────┘└─────────────────┘│
└─────────────────────────────────────────────────────────────────┘

Act 3: The Contract System — Provider-Agnostic by Design

The single most important architectural decision I made: never let the LLM provider leak into business logic. I defined three frozen contracts that form the boundary between AI orchestration and everything else.
// ai-core/contracts/interfaces.js

export const ProviderTransport = freezeContract(
  'ProviderTransport',
  {
    createMessage: freezeMethod(
      'Submit one provider turn.',
      {
        accepts:  { model, maxTokens, system, tools, messages },
        returns:  { responseId, model, stopReason, content, usage },
      }
    ),
  },
);

export const DataAccessAdapter = freezeContract(
  'DataAccessAdapter',
  {
    loadRequestContext:       freezeMethod('Load group config and access state.'),
    loadConversationHistory:  freezeMethod('Load ordered chat history.'),
    executeToolCall:          freezeMethod('Execute a normalized AI tool call.'),
    persistRunArtifacts:      freezeMethod('Persist assistant output.'),
    writeAuditEvent:          freezeMethod('Write audit telemetry.'),
  },
);

export const OrchestrationEventSink = freezeContract(
  'OrchestrationEventSink',
  {
    emit: freezeMethod('Emit a normalized run event.'),
  },
);
Why this matters: the orchestration engine talks to ProviderTransport.createMessage(). It doesn't know or care whether that's Claude, GPT-4, Gemini, or a local model behind Ollama. The contract enforces a stopReason vocabulary (end_turn, tool_use, max_tokens, refusal, etc.) that the orchestration loop consumes to decide: "Do I call tools and loop? Or am I done?" Similarly, the DataAccessAdapter isolates the orchestration from the database, the ERP pipeline, and my financial model engine. The AI core package has zero database imports. Zero cloud SDK references. It's a pure orchestration library. Every contract includes runtime validation. At boot time, assertContract() throws a TypeError if any required method is missing. This means a bad adapter implementation fails at startup, not at 3 AM in production:
function assertContract(candidate, contract, label) {
  const missing = Object.keys(contract.methods)
    .filter(name => typeof candidate[name] !== 'function');
  if (missing.length > 0) {
    throw new TypeError(
      `${label} is missing required methods: ${missing.join(', ')}`
    );
  }
  return candidate;
}
                        ┌──────────────────────┐
                        │  Orchestration Loop  │
                        │  (ai-core package)   │
                        └──────────┬───────────┘
                                   │
              ┌────────────────────┼────────────────────┐
              │                    │                    │
              ▼                    ▼                    ▼
┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│  Provider        │  │  Data Access     │  │  Event           │
│  Transport       │  │  Adapter         │  │  Sink            │
│                  │  │                  │  │                  │
│  createMessage() │  │  executeToolCall │  │  emit()          │
└────────┬─────────┘  └────────┬─────────┘  └────────┬─────────┘
         │                     │                     │
         ▼                     ▼                     ▼
┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│ Claude SDK   │      │ NoSQL DB     │      │ SSE Stream   │
│ OpenAI SDK   │      │ ERP ETL      │      │ (client)     │
│ Gemini SDK   │      │ Fin Model    │      └──────────────┘
│ Local LLM    │      └──────────────┘
└──────────────┘

Act 4: The Tool Arsenal — 23 Tools Across 7 Domains

The real power of an agentic copilot isn't the LLM — it's the tools you give it. I exposed 23 tools organized into seven domains, each with strict JSON Schema input validation and BU/FY scoping:
DomainToolsWhat They Access
Core Dataquery_products, query_customers, query_actuals, query_budget_config, query_definitionsNoSQL budget data — products, customers, targets, actuals/forecasts, FY config
ERP Live Insightsquery_erp_sales_insights, query_erp_debtors_insights, query_erp_orderbook_insights, query_erp_stock_insights, query_erp_delivery_insights, query_etl_run_history, query_erp_connector_detailERP pipeline — sales, debtors aging, open orders, stock levels, delivery status, line-level detail with filtering
Financial Modelsrun_financial_model, run_what_if_simulation, run_services_modelFull budget model (read-only), hypothetical scenarios with parameter adjustments, ARR/MRR/service profit
Analyticsget_customer_concentration, get_regional_breakdown, get_margin_analysis, compare_fiscal_yearsRevenue concentration risk, regional splits, margin by customer/product/region, FY-vs-FY comparison
Market Analysisquery_market_analysisMunicipality-level market data — region, manufacturer split, coverage, assignment chains
Methodologyget_model_methodologyCanonical formulas, assumptions, validation rules, simulation logic reference
Visualizationgenerate_chartInline chart specs (bar, line, pie, composed, area) with formatting and annotations
Plus web search (provider-managed) for external context like tenders, competitors, and market developments.

Tool Execution: Authorization at Every Call

Every single tool call passes through an authorization boundary. This isn't "check auth once at the top." Every tool execution resolves a scope:
// Per-tool-call authorization
const scope = await scopeFor(userContext, buId);
// Checks: isAdmin, BU viewAccess, user profile roles

// Account managers see only their assigned customers
const customers = isAccountManager
  ? filterCustomersForUser(allCustomers, userId)
  : allCustomers;

// Internal fields are stripped before the LLM sees them
function sanitizeDoc(doc) {
  const { _rid, _self, _etag, _attachments, _ts, pk, ...clean } = doc;
  return clean;
}
The AI agent is read-only. It cannot modify data. It cannot even see internal database fields. The tool executor enforces row limits (5,000 default, 12,000 for ERP detail scans) and a 5-second timeout per tool call.

Adding New Tools: A Disciplined Process

I maintain an extension guide with strict criteria. A new tool is added only if all four conditions are met: 1. Existing tools cannot answer the target question class reliably 2. The query can be bounded by BU/FY plus filters or row limits 3. The tool is read-only and deterministic 4. The output can be explained with clear freshness metadata (dataAsOf timestamp) Steps: define schema in the core package, implement handler in the executor, enforce BU/role scope, add cache eligibility decision, update tool profiles and system prompt if needed.

Act 5: Intelligent Question Routing — Tool Profiles

Not every question needs every tool. Asking "What does the model methodology say about margin vs. markup?" doesn't need ERP data. Asking "Show me overdue orders" doesn't need the financial model. I built a question routing engine that classifies incoming messages and selects a tool profile — a curated subset of tools with an enforced budget:
                       User Question
                            │
                            ▼
                   ┌─────────────────┐
                   │ Question Router │
                   │                 │
                   │ Regex classify  │
                   │ + UI context    │
                   │ + Prompt hint   │
                   └────────┬────────┘
                            │
         ┌──────────────────┼──────────────────┐────────────┐
         │                  │                  │            │
         ▼                  ▼                  ▼            ▼
┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  Finance     │  │  ERP         │  │  Scenario    │  │  Methodology │
│  Profile     │  │  Profile     │  │  Profile     │  │  Profile     │
│              │  │              │  │              │  │              │
│  6 tools     │  │  6+ tools    │  │  5 tools     │  │  1 tool      │
│  2 calls     │  │  2 calls     │  │  2 calls     │  │  1 call      │
│  40s cap     │  │  +detail     │  │  what-if     │  │  formulas    │
└──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘
The router uses regex classification to match question intent, but also respects prompt context hints from the frontend. Profile routing also includes question-intent classification that detects 12 distinct intent types: seasonality, customer decline, forecast, variance, profitability, risk, cashflow, trend, comparison, what-if, chart requests, and strategy/recommendation asks. Each profile enforces a strict tool budget:
ParameterDefaultPurpose
MAX_TOOL_ITERATIONS2Maximum LLM-to-tool round-trips
MAX_TOOL_CALLS2Total tool invocations per request
MAX_TOOL_CALLS_WITH_CHART3Budget when chart is requested
MAX_ORCHESTRATION_MS40,000Total wall-clock cap for entire run
MAX_PROVIDER_CALL_MS28,000Per-LLM-call timeout
Why budget tool calls? Cost and latency. Every tool call means the LLM has to process tool results and generate another turn. Without budgets, a curious model could chain 8 tool calls, burn tokens, and make the user wait 90 seconds. My budgets keep responses under 10 seconds in the common case.

Act 6: The Prompt Library — Context-Aware Starter Prompts and UI Nudges

Most AI chat implementations show static starter prompts: "Ask me anything." Mine are dynamic. I have two systems working together:

1. Starter Prompt Catalog

Curated prompts with embedded routing metadata. Each prompt carries a promptContext that tells the router which profile and tools to prefer:
export const STARTER_PROMPTS = Object.freeze([
  createPromptOption({
    id: 'pbt_target_tracking',
    text: 'How are we tracking against the PBT target?',
    profileHint: 'finance_interactive',
    preferredTools: ['run_financial_model']
  }),
  createPromptOption({
    id: 'monthly_revenue_vs_budget_chart',
    text: 'Show me a chart of monthly revenue vs budget',
    profileHint: 'finance_interactive',
    preferredTools: ['run_financial_model', 'generate_chart'],
    chartRequested: true
  }),
  createPromptOption({
    id: 'debtors_aging_overview',
    text: 'What does the debtors aging look like?',
    profileHint: 'erp_interactive',
    preferredTools: ['query_erp_debtors_insights']
  }),
]);

2. Context Nudges — The Page-Aware Prompt Engine

This is the feature I'm most proud of. When the user navigates between dashboard pages, I detect the transition and generate contextual suggestions:
export function buildUiContextNudges({ previousScope, currentScope, page }) {
  const nudges = [];

  // User switched tabs: suggest page takeaways
  if (prev.tab !== curr.tab) {
    nudges.push(createPromptOption({
      text: `What are the top takeaways on ${pageLabel}?`,
      source: 'context_nudge',
      profileHint   // auto-inferred from page context
    }));
  }
  // FY changed: suggest risk/opportunity summary
  if (prev.fy !== curr.fy) {
    nudges.push(createPromptOption({
      text: `Summarize the key risks and opportunities for ${curr.fy}.`,
    }));
  }
  // Historical mode toggled
  if (prev.historicalMode !== curr.historicalMode) {
    nudges.push(createPromptOption({
      text: curr.historicalMode
        ? 'What should I learn from this historical-year view?'
        : 'What should I monitor in the active year?',
    }));
  }
  // Filters changed
  if (prev.filterSignature !== curr.filterSignature) {
    nudges.push(createPromptOption({
      text: 'How do the currently applied filters change the story?',
    }));
  }
  return dedupePromptOptions(nudges).slice(0, 3);
}
The copilot isn't just waiting for questions. It's noticing what you're looking at and suggesting the questions a good analyst would ask next.

Act 7: The System Prompt — Persona Engineering for Finance

My system prompt doesn't just say "You are a helpful assistant." It establishes a professional identity with explicit behavioral rules:
You are Budget Analyst, an elite strategic finance and
business intelligence copilot.

Operate at the level of a senior data scientist, FP&A lead,
chartered-accountant quality reviewer, and board-ready analyst.

Your audience ranges from junior budget analysts to CEOs,
board members, investors, and shareholders.
19 behavioral rules govern the agent's conduct. The critical ones:
  • Tools first: For company-data questions, call tools first and ground answers in tool results. Never invent figures.
  • Read-only: Never claim to have changed persisted data.
  • Source transparency: Every response starts with SOURCE: data | knowledge | mixed and CONFIDENCE: high | medium | low.
  • Staleness awareness: Flag data older than 7 days with an explicit warning.
  • Follow-ups required: Every response ends with 3 suggested follow-up questions.
  • ERP line-level protocol: For specific product/customer queries, call the detail tool with filters before concluding.
  • No secrets: Never reveal document IDs, partition keys, or internal implementation details.
  • Executive communication: For senior stakeholders, prioritize key finding, material drivers, risks, and actions.

Act 8: The Streaming Pipeline — SSE From Orchestration to Browser

AI responses need to stream. A 15-second wait for a complete response feels broken. A response that starts appearing in 800ms feels fast, even if the total time is the same. I built a full SSE (Server-Sent Events) pipeline with 10 distinct event types:
Orchestration Engine                SSE Event Types:
        │                           ─────────────────────────────
        ▼                           ready       Run allocated, correlation ID
  EventSink.emit()                  heartbeat   Keep-alive (8s interval)
        │                           status      Phase label updates
        ▼                           text        Delta (streaming) or snapshot
┌───────────────┐                   tool_start  Tool execution begins
│  PassThrough  │                   tool_end    Tool complete + cache metadata
│  Stream       │                   chart       Inline chart spec payload
└───────┬───────┘                   follow_ups  Suggested next questions
        │                           error       Code, message, details
        ▼                           done        Token counts, model, metrics
┌───────────────┐
│  Fastify      │
│  reply.hijack │
│  + pipe()     │
└───────┬───────┘
        │
        ▼
┌───────────────────┐
│  HTTP Response    │
│                   │
│  Content-Type:    │
│  text/event-stream│
│  Cache-Control:   │
│  no-cache         │
│  X-Accel-Buffering│
│  no               │
└───────┬───────────┘
        │
        ▼
┌───────────────────┐
│  Browser          │
│  ReadableStream   │
└───────────────────┘
The Fastify route hijacks the raw HTTP response to avoid framework buffering:
function sendSse(reply, stream, headers = {}) {
  reply.hijack();
  reply.raw.statusCode = 200;
  reply.raw.setHeader('Content-Type', 'text/event-stream');
  reply.raw.setHeader('Cache-Control', 'no-cache');
  reply.raw.setHeader('Connection', 'keep-alive');
  reply.raw.setHeader('X-Accel-Buffering', 'no');
  reply.raw.flushHeaders();

  reply.raw.on('close', () => stream.destroy());
  stream.on('error', () => reply.raw.end());
  stream.pipe(reply.raw);
}
The X-Accel-Buffering: no header is critical — it tells any reverse proxy or CDN in front not to buffer the SSE stream. Without it, users see nothing until the entire response completes.

Stream Recovery

SSE connections drop. Networks glitch. Phones go to sleep. I handle this with a parallel polling mechanism:
  • Every run gets a unique runId returned in a response header
  • If the stream disconnects, the frontend polls a status endpoint every 15 seconds
  • If the run is still active, it can reattach via a separate stream endpoint
  • Rate limited to 10 reattach attempts per minute
  • An 8-second heartbeat event keeps the connection alive through aggressive proxy timeouts

Act 9: The Cache System — Three Layers Deep

AI tool calls are expensive — not in compute, but in tokens. Every tool result gets injected into the LLM context. More data = more tokens = more cost = more latency. I built three cache layers to minimize redundant work:
Tool Call Request
         │
         ▼
┌────────────────────┐     Hit?
│  Layer 1:          │──────────►  Return cached result
│  Request-Scoped    │             (same request, same inputs)
│  Memo              │
└────────┬───────────┘
         │ Miss
         ▼
┌────────────────────┐     Hit?
│  Layer 2:          │──────────►  Return cached result
│  Shared Tool       │             (cross-request, 2min TTL)
│  Cache (500 max)   │
└────────┬───────────┘
         │ Miss
         ▼
┌────────────────────┐     Hit?
│  Layer 3:          │──────────►  Return if within freshness
│  Data Freshness    │             window (ERP: 7-day ETL cycle)
│  (ETL-aware)       │
└────────┬───────────┘
         │ Miss
         ▼
   Execute Tool
   (DB / Model / ERP)
LayerScopeTTLMax SizePurpose
Request MemoSingle AI requestRequest lifetimeUnboundedPrevent duplicate tool calls in same turn (e.g., model calls auth twice)
Shared CacheCross-request120 seconds500 entries (LRU)Reuse results across concurrent users asking similar questions
Data FreshnessETL-aware7 days (ERP)IntegratedERP data refreshes weekly; don't refetch mid-cycle
One critical detail: run_what_if_simulation is never cached. Simulations take arbitrary parameter adjustments, so every call must be fresh. All other 22 tools are cacheable. Cache key security is strict. Keys include tool name, normalized input, BU, FY, and a scope fingerprint. For account-manager-scoped users, the key includes the userId — ensuring that user A's filtered customer data is never returned to user B.

Act 10: Retry, Fallback, and Graceful Degradation

Production AI features fail in ways that traditional APIs don't. The LLM provider can return 429 (rate limit), 529 (overloaded), or simply time out. I built multiple layers of resilience:

Provider-Level Retry

const RETRYABLE_STATUSES = new Set([429, 500, 502, 503, 504, 529]);

// Exponential backoff: 350ms base, 2s max
const BACKOFF_BASE_MS = 350;
const BACKOFF_MAX_MS  = 2_000;

// Model fallback chain: Primary ──► Secondary
const MODEL_FALLBACKS = [MODEL_PRIMARY, MODEL_FALLBACK];

function buildModelCandidates(initialModel) {
  // Returns ordered list: preferred model first, then fallbacks
  // Enables retry with automatic model downgrade
}

Partial Completion States

Not every failure is a total failure. I track five partial completion states that let the frontend show whatever was generated before the failure:
const PARTIAL_COMPLETION_STATES = new Set([
  'partial_timeout',              // Hit orchestration wall-clock cap
  'partial_provider_failure',     // Provider failed after partial response
  'tool_iteration_limit',         // Hit max tool round-trips
  'tool_call_budget_limit',       // Hit max tool calls
  'interactive_budget_limit'      // Hit interactive mode constraints
]);
When the model hits a budget limit, I don't discard the partial response. The done event includes a budgetExitReason field, and the model is instructed to "answer with the highest-confidence partial result and propose narrower follow-ups."

Continuation and Follow-Up Detection

The system detects when a user is following up on a partial result with regex patterns for continuations like "continue", "go deeper", "top 5 results", "carry on" — and adjusts routing accordingly so the follow-up doesn't restart from scratch.

Act 11: The Authentication Stack — Three Layers, Zero Trust

            Browser Request
                  │
                  ▼
┌──────────────────────────────────────┐
│  Layer 1: Proxy Identity             │
│                                      │
│  Reverse proxy injects:              │
│  Client-Principal-ID                 │
│  Client-Principal-Name               │
│  ──► Cannot be forged by client      │
└──────────────────┬───────────────────┘
                   │
                   ▼
┌──────────────────────────────────────┐
│  Layer 2: Bootstrap Token (HMAC)     │
│                                      │
│  HMAC-SHA256 signed, 120s TTL        │
│  Contains: userId, email, sessionId  │
│  viewBuIds, homeBuId, permissions    │
│  ──► Timing-safe signature verify    │
└──────────────────┬───────────────────┘
                   │
                   ▼
┌──────────────────────────────────────┐
│  Layer 3: RBAC + BU Scope + ABAC     │
│                                      │
│  hasPermission(user, 'ai.chat')      │
│  hasBuViewAccess(profile, buId)      │
│  filterCustomersForUser(...)         │
│  ──► Enforced on EVERY tool call     │
└──────────────────────────────────────┘
The bootstrap token is particularly clever. The main API mints it (it knows the user's RBAC state), and the AI service validates it (it doesn't need to re-query the user database). TTL is 120 seconds — long enough for a chat session, short enough that a stolen token is useless quickly. Signature verification uses timingSafeEqual to prevent timing attacks.

Act 12: The Health & Observability System — Knowing What You Don't Know

Shipping an AI feature without observability is like flying blind. I built a comprehensive health monitoring system that tracks everything from token costs to user adoption to cache efficiency. This isn't a separate monitoring tool — it's baked into the application's system health dashboard, accessible to admins.

The Telemetry Pipeline

Every AI chat request emits structured operational log events to blob storage. The health endpoint scans these events with configurable lookback windows and computes real-time analytics:
  AI Chat Request
         │
         ▼
┌────────────────────┐
│  Operational Log   │     Events Emitted:
│  (Blob Storage)    │     ──────────────────────────
│                    │     AI_CHAT_REQUEST           (success + metrics)
│  Structured JSON   │     AI_CHAT_REQUEST_FAILED    (failure + error code)
│  per-request       │     AI_CHAT_FEEDBACK          (thumbs up/down)
└────────┬───────────┘     AI_CHAT_STREAM_OPENED     (SSE stream timing)
         │                 AI_CHAT_FIRST_EVENT       (time-to-first-token)
         ▼                 AI_CHAT_HTTP_ERROR        (client-side HTTP fail)
┌────────────────────┐     AI_CHAT_NETWORK_ERROR     (client-side network fail)
│  Health Endpoint   │     AI_CHAT_STREAM_END        (stream termination)
│                    │
│  Three Probes:     │
│  AI Cache Health   │
│  AI Chat Telemetry │
│  AI Chat Insights  │
└────────────────────┘

Probe 1: AI Cache Health

Monitors shared cache effectiveness with automatic recommendations:
// Cache health probe output
{
  status: 'connected',        // connected | warning | error
  lookbackHours: 72,
  requestsSampled: 142,
  sharedCacheHits: 87,
  sharedCacheMisses: 214,
  hitRatePct: 28.9,
  lookupCoveragePct: 95.1,    // % of requests using cache
  requestHitRatePct: 44.3,    // % of requests with >= 1 hit
  targetHitRatePct: 15,       // configurable threshold
  note: 'ok',
  recommendation: 'Shared cache hit rate is within target range.',
  suggestedSettingChange: null
}
When the hit rate drops below target, the probe returns actionable recommendations like "Increase cache TTL to 300000ms to improve shared-cache reuse." The evaluation logic considers sample size too — it won't raise alarms on fewer than 20 requests.

Probe 2: AI Chat Telemetry

Tracks operational health of the AI chat endpoint itself, with failure classification:
// Chat telemetry probe output
{
  status: 'connected',
  lookbackHours: 24,
  requestsSampled: 47,
  successRequests: 44,
  failedRequests: 3,
  timeoutFailures: 1,
  failureRatePct: 6.38,
  failureCodeCounts: {
    'AI_TIMEOUT': 1,
    'AI_PROVIDER_ERROR': 2
  },
  p95StreamOpenedMs: 1240,    // Time to SSE stream open
  p95FirstEventMs: 2850,      // Time to first SSE event
  likelyAppPreStreamFailureCount: 0,
  likelyProviderOrToolTimeoutCount: 1,
  suspectedProxyTimeoutPreStreamCount: 0,
  note: 'ok'
}
Notice the suspectedProxyTimeoutPreStreamCount. I discovered that certain reverse proxies impose a 45-second timeout on HTTP connections. The health probe correlates client-side timing signals (elapsed ~43-47s) with server-side stream-opened events to detect whether failures are app-side or proxy-side. This saved me weeks of debugging. Health status is evaluated with configurable thresholds (default: warning at 20% failure rate, critical at 50%), with special handling for timeout spikes.

Probe 3: AI Insights — Usage, Cost, and Adoption

The deepest probe. It computes comprehensive business intelligence about the AI feature itself:
// AI Insights probe output (simplified)
{
  sampled: {
    requests7d: 312,
    requestsMtd: 847,
    successfulRequests7d: 298,
    failedRequests7d: 14,
    timeoutFailures7d: 3
  },
  tokens: {
    input7d: 1_420_000,
    output7d: 312_000,
    inputMtd: 3_890_000,
    outputMtd: 842_000,
    avgInputPerChat7d: 4551,
    avgOutputPerChat7d: 1000
  },
  cost: {
    currency: 'USD',
    pricingConfigured: true,
    estimatedUsd7d: 0.0284,
    estimatedUsdMtd: 0.0781
  },
  usage: {
    totalChats7d: 312,
    totalChatsMtd: 847,
    activeAiEligibleUsers: 8,
    avgChatsPerEligibleUser: 39,
    topUserByChats: { name: '***', chats: 142 },
    nonUsersCount: 3,
    nonUsersPreview: [...]
  },
  performance: {
    slowThresholdMs: 10_000,
    p50DurationMs: 5200,
    p95DurationMs: 12400,
    p50FirstTokenMs: 1100,
    p95FirstTokenMs: 3200,
    timeoutRatePct: 0.96,
    slowRatePct: 8.3
  },
  quality: {
    thumbsUp: 24,
    thumbsDown: 3,
    positiveRatePct: 88.89,
    feedbackCoveragePct: 9.06
  }
}
Let me break down what this gives me:
CategoryMetricsWhy It Matters
Token UsageInput/output tokens (7d + MTD), average per chatCost forecasting, prompt optimization signals, context window utilization
Cost EstimationEstimated USD (7d + MTD), configurable per-model pricing tableBudget tracking, cost-per-user analysis, ROI calculation
User AdoptionActive users, chats per user, top user, non-adopters listFeature adoption tracking, training needs identification, champion users
Performancep50/p95 duration, p50/p95 first-token, timeout rate, slow rateSLA monitoring, user experience optimization, provider health
QualityThumbs up/down, positive rate, feedback coverage %Answer quality monitoring, prompt/tool tuning signals

Token Cost Estimation

The cost system supports a per-model pricing table with partial model-name matching for fallback resolution:
// Configurable per-model pricing
TOKEN_COST_TABLE = {
  "primary_model": {
    "inputUsdPerMtok": 3.0,
    "outputUsdPerMtok": 15.0
  },
  "fallback_model": {
    "inputUsdPerMtok": 0.25,
    "outputUsdPerMtok": 1.25
  }
}

function estimateTokenCostUsd({ inputTokens, outputTokens, rates }) {
  const inputCost  = (inputTokens  / 1_000_000) * rates.inputUsdPerMtok;
  const outputCost = (outputTokens / 1_000_000) * rates.outputUsdPerMtok;
  return inputCost + outputCost;
}

AI Service Runtime Health

The AI service itself exposes two health endpoints probed by the main health dashboard:
GET /healthz   ──► { status, runningRuns, queuedRuns }
GET /readyz    ──► { status, storage.mode, data.authMode }
The main health endpoint combines these with telemetry probes into a unified status that powers the admin dashboard. Service status is computed by merging health + readiness signals, with nuanced logic (e.g., health-healthy + ready-unhealthy = degraded, not error).

The Done Event — Per-Request Telemetry

Every completed AI request emits a done SSE event packed with operational metrics. This is what feeds all three health probes:
// done SSE event shape
{
  type: 'done',
  responseId: '...',
  model: '...',
  durationMs: 5200,
  firstTokenMs: 1100,
  historyMessageCount: 4,
  inputTokens: 4200,
  outputTokens: 680,
  toolCalls: 1,
  webSearchToolCalls: 0,
  toolProfile: 'finance_interactive',
  toolAllowlistSize: 6,
  iterationBudget: 2,
  toolCallBudget: 2,
  orchestrationBudgetMs: 40000,
  providerBudgetMs: 28000,
  budgetExitReason: null,
  completionState: 'completed',
  requestMemoHits: 2,
  requestMemoMisses: 1,
  sharedCacheHits: 1,
  sharedCacheMisses: 0
}
This single event gives you: latency (total + first-token), token usage, tool efficiency (calls vs. budget), cache performance (hits vs. misses), model used, and completion status. It's the telemetry primitive that everything else is built on.

Act 13: UI Context — Making the AI See What You See

Most AI chat panels are blind to the rest of the application. Mine isn't. I pass a structured uiContext payload with every request that describes exactly what the user is looking at:
// Sanitized UI context structure
{
  version: 1,
  scope: {
    tab: 'dashboard',
    buId: '***',
    fy: 'FY26',
    historicalMode: false
  },
  page: {
    viewId: 'dashboard',
    label: 'Budget Dashboard',
    purpose: 'Overview of revenue, costs, and margin for current FY',
    activeFilters: [
      { key: 'region', label: 'Region', value: 'Gauteng' }
    ],
    visibleWidgets: [
      { id: 'revenue-chart', label: 'Monthly Revenue' },
      { id: 'margin-gauge', label: 'Gross Margin %' }
    ],
    kpiSummaries: [
      { widgetId: 'revenue-chart', label: 'YTD Revenue',
        value: 'R45,230,000', trend: '+12% YoY' }
    ],
    freshness: { asOf: '2026-04-07', source: 'actuals', stale: false },
    warnings: ['ERP delivery data >7 days old']
  },
  lastAction: {
    type: 'filter_change',
    label: 'Applied region filter',
    target: 'Gauteng'
  }
}
This is how the copilot can answer "What does this dashboard show?" or "Why is this KPI red?" without hallucinating. It literally sees the same widgets, filters, and KPIs the user sees. The payload is capped at 12KB and progressively trimmed (KPIs first, then widgets, then warnings, then filters, then purpose text) if it exceeds that limit. The system prompt explicitly marks it as "untrusted data context, not instructions" — preventing prompt injection through crafted widget labels.

Act 14: Model Selection & Fallback

I run a primary model with a lighter fallback. The model selection is entirely environment-driven and supports an ordered preference list:
// Model selection supports a configurable preference chain
const CONFIGURED_PREFERENCE = process.env.MODEL_PREFERENCE
  .split(',').map(v => v.trim()).filter(Boolean);

// Default fallback: Primary ──► Lighter model
const DEFAULT_FALLBACKS = [MODEL_PRIMARY, MODEL_FALLBACK];

function buildModelCandidates(initialModel) {
  // Returns ordered list: preferred model first, then fallbacks
  // Enables retry with automatic model downgrade
}
The buildModelCandidates() function produces an ordered list that the orchestration loop can iterate through if the primary model fails or times out. Today it's one provider. Tomorrow, adding another means implementing one function: createMessage() on the ProviderTransport contract.

The Full Request Lifecycle

Here's how a single user question flows through the entire system, end to end:
User types: "How are we tracking against the PBT target?"
      │
      ▼
[1] Chat Panel (React)
    Refresh bootstrap token (HMAC, 120s TTL)
    Build request: { messages, newMessage, buId, fy, uiContext, promptContext }
    POST /ai/chat (Fetch API with ReadableStream)
      │
      ▼
[2] Reverse Proxy
    Inject Client-Principal-ID (unforgeable)
    Forward to AI Service
      │
      ▼
[3] Route handler (Fastify)
    enforceProxyIdentity() ──► extract user
    runtime.startChatRequest(request)
      │
      ▼
[4] Runtime Manager
    Check concurrency: user slots (2), global slots (4)
    Validate bootstrap tokens (HMAC-SHA256, timing-safe)
    Check RBAC: hasPermission(user, 'ai.chat')
    Check BU scope: hasBuViewAccess(profile, buId)
      │
      ▼
[5] Question Router (Tool Profiles)
    Classify question ──► finance_interactive
    Select tools: run_financial_model, run_services_model, ...
    Build budget: 2 tool calls, 40s cap, 28s per-provider
    Merge promptContext hints if present
      │
      ▼
[6] Orchestration Engine
    │
    │  ┌─ Turn 1 ──────────────────────────────────────────┐
    │  │  selectModel() ──► primary model                  │
    │  │  buildSystemPrompt() + uiContext + runtime rules  │
    │  │  providerTransport.createMessage(...)             │
    │  │                                                   │
    │  │  Model responds: stopReason=tool_use              │
    │  │    ──► run_financial_model                        │
    │  └────────────────────────┬──────────────────────────┘
    │                           │
    │  ┌─ Tool Execution ───────▼──────────────────────────┐
    │  │  Check request memo cache ──► MISS                │
    │  │  Check shared cache (500/2min) ──► MISS           │
    │  │  Resolve scope ──► BU access verified             │
    │  │  Execute: load products + customers + actuals     │
    │  │  Run financial model engine (synchronous)         │
    │  │  Sanitize result (strip internal fields)          │
    │  │  Store in request memo + shared cache             │
    │  │  Emit SSE: tool_start ──► tool_end (w/ metrics)   │
    │  └────────────────────────┬──────────────────────────┘
    │                           │
    │  ┌─ Turn 2 ───────────────▼──────────────────────────┐
    │  │  Inject tool result into messages                 │
    │  │  createMessage() ──► generates final answer       │
    │  │  stopReason: end_turn                             │
    │  │  Stream text deltas via SSE                       │
    │  └────────────────────────┬──────────────────────────┘
    │                           │
    ▼                           ▼
[7] SSE Events emitted:
    ──► ready        { runId, correlationId }
    ──► status       { phase: 'thinking', label: 'Analyzing...' }
    ──► tool_start   { name: 'run_financial_model' }
    ──► tool_end     { name: '...', durationMs: 820,
                       cacheLayer: 'none', cacheHit: false }
    ──► text         { mode: 'delta', content: 'SOURCE: data\n...' }
    ──► text         { mode: 'delta', content: '...PBT tracking at 94%...' }
    ──► follow_ups   ['What is driving the margin gap?', ...]
    ──► done         { model: '...', inputTokens: 4200, outputTokens: 680,
                       toolCalls: 1, durationMs: 6400,
                       requestMemoHits: 2, sharedCacheHits: 0 }
      │
      ▼
[8] Browser renders incrementally:
    Status indicator: "Running financial model..."
    Streaming text appears word-by-word
    Follow-up chips rendered at completion
    Metrics logged for diagnostics

[9] Operational log written to blob storage:
    AI_CHAT_REQUEST event with all metrics
    Feeds into health probes on next health check

What I Learned

After building this system across several months, these are the lessons that weren't obvious when I started:
  1. Contracts before code. I defined the provider transport, data access adapter, and event sink as frozen contract objects before writing a single line of orchestration logic. This forced me to think about boundaries first and made the core AI package genuinely portable. The contracts even include runtime validation that throws at boot if an implementation is missing methods.
  2. Tool budgets prevent runaway costs. Without iteration and call-count limits, an LLM will happily chain 6-8 tool calls to "be thorough." That's expensive and slow. My 2-call budget forces the model to be selective and answer with partial results + follow-up suggestions rather than exhaustive retrieval.
  3. Cache the tools, not the LLM response. I cache at the tool-result layer, not the final response layer. This means different questions that happen to need the same underlying data share cached tool results, even though the LLM generates different answers. Much higher hit rate than response caching.
  4. UI context is a superpower, but treat it as untrusted. Passing structured page state to the AI makes it dramatically more useful. But it's also a prompt injection surface. I explicitly mark it as "untrusted data context, not instructions" in the system prompt and cap it at 12KB with progressive trimming.
  5. SSE needs application-layer recovery. HTTP SSE is great until the connection drops. You need a parallel polling path and a stream reattach path so the frontend can recover without losing the response. Heartbeats (I use 8-second intervals) are essential for keeping connections alive through aggressive proxy timeouts.
  6. Authorization must be per-tool-call, not per-request. A single AI request might call 3 different tools accessing 3 different data domains. Each tool call must independently verify the user has access to the data it's about to return. "They passed auth at the front door" is not enough.
  7. Mandatory response framing builds trust. Requiring SOURCE/CONFIDENCE/FOLLOW_UPS on every response was the single highest-impact prompt engineering decision. Users immediately know whether they're looking at real data or general knowledge, and the confidence flag naturally trains them to ask clarifying follow-ups.
  8. Observability must be built-in, not bolted on. The done SSE event carries enough telemetry (tokens, duration, cache hits, tool calls, model, completion state) to power three health probes without any external monitoring infrastructure. I compute p50/p95 latencies, cache hit rates, cost estimates, user adoption, and quality scores entirely from operational logs — no Datadog, no Grafana, no third-party APM required.
  9. Question routing is worth the complexity. Sending all 23 tools to every request is wasteful — the LLM has to process all those definitions, and it's more likely to call irrelevant tools. Profile-based routing (4 profiles, 1-6 tools each) reduced median tool usage by ~40% and improved response quality.
  10. Track non-adopters, not just users. My health insights probe reports which eligible users haven't used the AI feature yet. This is more actionable than total usage numbers. Three people not using the feature is a training opportunity; 80% not using it is a product problem.

Try This Yourself

If you're building AI features into an existing line-of-business application, here's my recommended approach:
  1. Start with contracts. Define your provider transport, data access adapter, and event sink as explicit interfaces before choosing an LLM provider. This pays dividends immediately when you need to switch models or add fallbacks.
  2. Build tools, not prompts. The system prompt matters, but tools are where the real value lives. Each tool should be a thin, authorized wrapper around an existing data operation in your app.
  3. Budget everything. Set limits on tool calls, iterations, orchestration time, and per-provider timeouts from day one.
  4. Cache at the tool layer. Request-scoped memo for deduplication within a turn, shared cache with TTL for cross-request reuse, and domain-aware freshness for data that changes on known cycles.
  5. Pass UI context. Even a minimal payload (current page, active filters, visible KPI values) makes the AI dramatically more useful. But sanitize it, cap it, and mark it as untrusted.
  6. Use SSE with a polling fallback. Stream events for responsiveness, but always have a status endpoint and a stream reattach path for recovery. Add heartbeats.
  7. Embed telemetry in every response. Make the AI done event carry token counts, cache metrics, latency, model used, and completion state.
  8. Track cost at the model level. Configure per-model token pricing and compute estimated costs in your health probes.
  9. Make the response frame mandatory. SOURCE, CONFIDENCE, staleness warnings, and follow-up suggestions should be non-negotiable.
  10. Monitor adoption, not just health. Build a probe that identifies eligible non-users. It's the most actionable metric for driving feature adoption.
The full architecture — contracts, tools, profiles, caching, streaming, auth, health probes — runs as a standalone Fastify service with an extracted workspace package for AI core logic. The core package is provider-agnostic. The service is deployment-agnostic. The tools are database-agnostic (through the data access adapter). The health system is APM-agnostic (built on operational logs). I started with a dashboard. I ended up with a copilot that sees what you see, knows what the database knows, answers like a senior analyst, and reports its own health. The architecture isn't just about making it work — it's about making it replaceable at every layer. Today it's one LLM provider. Tomorrow it could be anything. The contracts don't care. That's the point.

Claude's Honest Assessment: Strengths, Gaps, and Where the Industry Is Heading

No architecture post is complete without an honest look at what I got right, what I didn't, and where the industry is going. After deep-diving into how companies like Microsoft, ThoughtSpot, Tableau, and dozens of startups are building AI into BI tools in 2025-2026, here's my self-assessment.

What I Got Right

DecisionIndustry Validation
Tool-use over text-to-SQLResearch shows text-to-SQL accuracy drops from 85-92% on clean academic benchmarks to 6-21% on enterprise schemas (Spider 2.0, ICLR 2025 paper). My tool-based approach avoids this entirely — the LLM never writes raw database queries. It calls pre-built, authorized, scope-enforced tools. This aligns with the industry shift toward semantic-layer-aware AI, where the model queries governed metric definitions rather than raw SQL. ThoughtSpot, Holistics, and Looker have all converged on this pattern.
SSE streaming with recoverySSE is the de facto industry standard for LLM response streaming. Every major provider (OpenAI, Anthropic, Google) uses it. My heartbeat keepalives (8s), proxy-buffering headers, and parallel polling fallback are textbook production patterns. The stream reattach mechanism goes beyond what most implementations offer.
Per-tool-call authorizationMicrosoft's security playbook for AI agents (2026) explicitly calls out that "a prompt injection vulnerability in a multi-tenant agent could cross tenant boundaries" and describes this as catastrophic. My per-tool-call scope resolution with account-manager-level data isolation is ahead of most enterprise AI implementations, which typically enforce auth only at the request level. The OWASP Top 10 for LLM Applications (PDF) identifies excessive agency and improper output handling — both enabling privilege escalation — as top-5 risks.
Provider-agnostic contractsThe market has converged on provider abstraction as essential infrastructure. Solutions like LiteLLM (40K+ GitHub stars, 240M+ Docker pulls), Bifrost, and Portkey all provide unified provider interfaces. My frozen contract pattern achieves the same goal without an external dependency, and the assertContract() boot-time validation is a pattern I haven't seen in any gateway solution.
Built-in observabilityMost teams bolt on third-party observability (Helicone, LangSmith, Langfuse) after deployment. My approach — embedding telemetry in the done event and computing health probes from operational logs — eliminates a dependency and gives me adoption metrics (non-user tracking) that no off-the-shelf tool provides.
Tool budgets and question routingIndustry consensus: unbounded tool use is the #1 cause of LLM cost overruns in agentic applications. My 2-call budget with profile routing is more disciplined than most production implementations. Anthropic's own guides on building effective agents and writing tools for agents recommend exactly this pattern: classify intent, select a tool subset, enforce a call budget.
UI context awarenessGenuinely differentiated. Most AI chat panels are blind to the host application. My structured uiContext payload with progressive trimming and prompt-injection-safe labeling is a pattern I haven't seen documented in any major BI vendor's public architecture. The context-nudge system (auto-suggesting questions when pages change) is unique.
Mandatory response framingSOURCE/CONFIDENCE/staleness/follow-ups framing aligns with emerging enterprise AI governance requirements. Gartner's 2026 AI governance framework recommends explicit source attribution and confidence signaling for any AI feature that influences business decisions.

What I Could Do Better — Honest Gaps according to Claude Code

GapCurrent StateIndustry StandardImpact
No semantic cachingExact-match tool-result caching (hash-based). Hit rates depend on identical inputs.Semantic caching (embedding similarity) achieves 40-70% hit rates vs. 10-15% for exact match. Solutions like Bifrost and GPTCache offer this at the gateway layer. FAQ-heavy workloads see 60-85% hit rates; my financial BI pattern would likely see 30-50%.Medium. My tool-level caching partially compensates, but I'm leaving cost savings on the table for paraphrased questions ("What's our margin?" vs "Show me the margin numbers").
No LLM-as-judge evaluationQuality monitoring via thumbs up/down only (9% feedback coverage).Leading teams run a three-layer evaluation: (1) automated heuristic checks on 100% of traffic, (2) LLM-as-judge scoring on 5-10% of requests, (3) human review for edge cases. Research shows judge models align with human judgment up to 85%. Tools: DeepEval, TruLens, Langfuse evals.High. With only 9% feedback coverage, I have blind spots on answer quality. I catch failures (errors, timeouts) but not subtle quality degradation (correct but unhelpful answers, missing nuance).
Single-provider dependencyProvider-agnostic contracts exist, but only one provider implementation is wired up.The industry is rapidly moving toward multi-provider strategies. Production teams are using AI gateways (Portkey, LiteLLM) for automatic failover across 2-3 providers.Medium. The contracts are ready, but I haven't exercised the abstraction. A provider outage today means total AI feature downtime.
No conversation persistenceChat history lives in browser sessionStorage only. Close the tab, lose the conversation.Most enterprise AI copilots persist conversation history server-side for audit trails, cross-device continuity, and analytics.Low-Medium. Deliberate choice (zero storage cost, no data retention liability), but limits my ability to do conversation-level quality analysis.
Regex-based question routingRegex patterns classify questions into 4 profiles with 12 intent types.More sophisticated routers use lightweight embedding classifiers or distilled intent models. These handle paraphrasing and multilingual queries better.Low. My regex routing works well for my domain (financial English with a bounded vocabulary), but would struggle with multilingual or highly varied question patterns.
No distributed cacheIn-process Map with LRU eviction. Cache is per-instance.Multi-instance deployments use Redis or a distributed cache layer. My in-process cache means cache misses when requests hit different instances.Low. I run a single AI service instance today. This becomes a gap only at scale-out.
Read-only agent onlyThe AI cannot take actions — it can only analyze and recommend.The industry is cautiously moving toward "action agents" that can trigger workflows, send alerts, and execute constrained write operations. Microsoft Copilot Studio, Salesforce Agentforce, and ThoughtSpot's Agentic Analytics all support agent-initiated actions with approval workflows.Deliberate trade-off. Read-only is a security boundary I chose intentionally. But it means users have to manually act on every recommendation.

Future Roadmap — Where This Architecture Should Go Next (Claude's recos)

Based on where the industry is heading in 2026-2027, here are the highest-value additions to consider, ordered by impact-to-effort ratio:

Tier 1: High Impact, Near-Term (Weeks)

  1. Automated quality evaluation (LLM-as-judge). Run a lightweight evaluation model on 5-10% of responses, scoring for groundedness (did the answer match tool results?), relevance (did it answer the question?), and completeness. This closes the biggest observability gap. Tools like DeepEval or Langfuse evals make this a 1-2 week implementation. The done event already carries enough context to feed an evaluator.
  2. Second provider implementation. Wire up a second LLM provider behind the existing ProviderTransport contract. The contracts are ready — this is purely an implementation exercise. Automatic failover across providers eliminates single-provider downtime risk.
  3. Tiered model routing. Route simple questions (methodology lookups, single-tool queries) to a cheaper/faster model and reserve the primary model for complex multi-tool analysis. Industry data shows 60-75% of queries can be handled by a lighter tier with no quality drop, yielding 40-60% cost reduction. My question router already classifies intent — adding model selection per profile is a natural extension.

Tier 2: High Impact, Medium-Term (1-2 Months)

  1. Semantic caching layer. Add embedding-based similarity matching as a cache layer between the request memo and the shared tool cache. When "What's our gross margin?" hits, the cached result for "Show me the margin numbers" should match at ~0.9 similarity. Production systems report 30-50% hit rates for analytical workloads, with latency dropping from seconds to single-digit milliseconds on hits. For enterprise apps locked behind Entra ID, the cache must stay inside the security boundary — public SaaS caching services are not an option. Azure Cosmos DB vector search (already in my stack, supports DiskANN-based similarity) or Azure Cache for Redis deployed with private endpoints inside the VNet are both viable. Alternatively, a lightweight in-process embedding approach using a small local model can avoid any external dependency entirely — compute similarity on the AI service itself and keep everything within the existing deployment boundary.
  2. Proactive insight surfacing. The industry is moving from reactive (user asks, AI answers) to proactive (AI notices something and suggests investigation). I already have the infrastructure: the UI context system detects page changes and generates nudges. Extending this to data-driven alerts ("Revenue dropped 15% this month vs. last month — want me to investigate?") would be a natural evolution.
  3. Conversation persistence for audit and analytics. Selectively persist conversation transcripts server-side (the writeTranscript() infrastructure already exists in the runtime manager). This enables conversation-level quality analysis, compliance audit trails, and cross-session continuity. Implement with opt-in consent and configurable retention policies.

Tier 3: Strategic, Longer-Term (3-6 Months)

  1. Multi-agent orchestration. Today I run a single agent with tool use. The next step is specialized sub-agents: a data retrieval agent, an analysis agent, a visualization agent, and a narrative agent. Frameworks like LangGraph (now GA) provide graph-based execution with checkpointing and human-in-the-loop. The contracts already separate orchestration from data access — decomposing the orchestration into collaborating agents is architecturally natural. Caution: Gartner predicts over 40% of agentic AI projects will be cancelled by 2027 due to coordination complexity.
  2. Constrained write-back actions. Move from read-only analyst to constrained actor: "Shall I flag this customer for review?" or "Want me to set a monitoring alert for this KPI?" The action layer pattern (approve/reject workflow, audit trail, rollback capability) is emerging in enterprise AI. This is the highest-risk item on the roadmap — it fundamentally changes the security model. Start with soft actions only (create alerts, flag items, generate reports) before considering data mutations.
  3. MCP (Model Context Protocol) integration. Anthropic donated MCP to the Linux Foundation's Agentic AI Foundation, and it's been adopted by OpenAI, Microsoft, AWS, and Google. Exposing my tool surface as an MCP server would allow any MCP-compatible client (IDE copilots, other agents, automation platforms) to query my financial data through the same authorized, scoped, cached tool pipeline.
  4. Natural language to dashboard generation. Rather than answering questions about existing dashboards, generate new dashboard views from natural language descriptions. "Show me a dashboard tracking our top 5 customers' margin trends over the last 3 quarters." This is where ThoughtSpot's Agentic Analytics and Tableau's Einstein are heading. It's the most ambitious item on the list — it blurs the line between AI feature and core product.

The Maturity Spectrum mapped by Claude Code

Placing my architecture on the industry maturity spectrum (adapted from GoodData's Agentic Analytics framework and Gartner's autonomy model):
Industry AI-in-BI Maturity Spectrum (2026)
──────────────────────────────────────────

Stage 1               Stage 2               Stage 3               Stage 4
Basic Chat            Tool-Use Agent        Multi-Agent           Autonomous
──────────────        ──────────────        ──────────────        ──────────────
NLQ interface         Tool calling          Sub-agents            Proactive
Static prompts        Data grounding        Multi-model           Action layer
No data access        Auth scoping          Semantic cache        Self-improving
Single model          SSE streaming         LLM-as-judge          NL-to-dashboard
No observability      Tool budgets          Conversation DB       MCP ecosystem
                      Built-in telemetry    Proactive nudges
                      UI context            Write-back (soft)
                      Cache layers
                      Question routing
                      Health probes

       Most BI                   ┌───┐
       vendors                   │   │  ◄── I am here
       are here                  └───┘
         │                         │
         ▼                         ▼
┌──────────────────┐  ┌──────────────────────────┐  ┌──────────────────┐
│ Power BI Copilot │  │ My Implementation        │  │ ThoughtSpot      │
│ Tableau Einstein │  │                          │  │ Spotter (2026)   │
│ Looker Gemini    │  │ Stage 2 with partial     │  │ Stage 2-3        │
│ (mostly Stage 1) │  │ Stage 3 elements         │  │ transition       │
└──────────────────┘  │ (UI context, nudges,     │  └──────────────────┘
                      │  health probes)          │
                      └──────────────────────────┘
I'm solidly in Stage 2 with several Stage 3 elements already in place (UI context awareness, context nudges, comprehensive health probes). The gaps are well-defined (semantic caching, LLM-as-judge, multi-provider, conversation persistence), and the architecture was designed to accommodate them without rearchitecting. The industry is moving fast. Gartner predicts 40% of enterprise apps will embed AI agents by end of 2026. The companies that get the architecture right now — contracts, authorization, observability, cost control — will be the ones that can evolve from Stage 2 to Stage 3 without a rewrite. I designed for exactly that.

Thursday, 2 April 2026

AI School Fees: The $0 Database That Wasn't: How AI Agents Silently Burned Through My Azure Budget Twice

I told the agent "zero cost." It provisioned 8,000 RU/s of dedicated throughput. I fixed it. It did it again. Here's the full forensic timeline.




The Problem

When I started building this internal enterprise app on Azure, the constraints were clear: free tier only. Azure Cosmos DB gives you 1,000 RU/s free. The app had ~10 containers. The math was simple — shared throughput across the database, stay under 1,000 RU/s, pay nothing.

I documented this everywhere. The agent contract said "RU-frugal." The app rules said "any throughput or retention change must be documented." The SAP feature brief said "Free Tier Guardrails — non-negotiable." The AI feature design explicitly rejected a Cosmos-backed chat history because it "violates the zero Azure cost constraint."

Despite all of this, the AI agent provisioned expensive dedicated throughput — not once, but twice. Both times I had to manually intervene, audit the damage, and harden the codebase to prevent it from happening again.

This is the forensic timeline of what happened, reconstructed from git history.


The Architecture Context

The app is a React + Azure Functions stack backed by Cosmos DB NoSQL. All containers use partition key /pk. The intended cost model was:

  Cosmos DB Free Tier
  ───────────────────
  1 Database  →  shared throughput (400-600 RU/s)
  10 Containers  →  no dedicated throughput
  ───────────────────
  Total: $0/month (within 1,000 RU/s free allowance)

Simple. Except the AI agent had a different idea.


Act 1: The Silent Provisioning (Feb 7, 2026)

What The Agent Did

I asked the AI agent to set up CI/CD scaffolding and infrastructure automation. Commit <sha-1> created scripts/setup-cosmos.sh — a script to provision Cosmos databases and containers. Sounds reasonable. Here's what it actually created:

THROUGHPUT=400

az cosmosdb sql container create \
    --partition-key-path "$PARTITION_KEY" \
    --throughput "$THROUGHPUT"     ← 400 RU/s PER CONTAINER

That --throughput flag on the container create command is the problem. It provisions dedicated throughput per container, not shared throughput at the database level.

The script also created two databases: a production DB and a dev DB. Both got the same treatment.

The Math

  What I asked for:          What the agent provisioned:
  ──────────────────         ──────────────────────────────
  1 DB, shared 400 RU/s     2 DBs, dedicated per-container

  Production:                Production:
    400 RU/s shared            10 containers × 400 RU/s = 4,000 RU/s
    $0 (free tier)             $0.008/hr × 10 = billable

  Dev:                       Dev:
    Emulator (local)           10 containers × 400 RU/s = 4,000 RU/s
    $0                         $0.008/hr × 10 = billable

  Total: ≤ 1,000 RU/s       Total: ~8,000 RU/s dedicated
  Cost: $0/month             Cost: Azure billing surprise

The agent created 8x the intended throughput across two databases, all with dedicated provisioning that can't be scaled below 400 RU/s per container. The Cosmos free tier's 1,000 RU/s allowance was instantly overwhelmed.

Why It Happened

The agent treated database provisioning as a standard infrastructure task. It knew Cosmos needs throughput. It picked the per-container model (which is the more common pattern in documentation and tutorials) without considering that:

  1. Shared throughput exists and is the correct model for cost-sensitive workloads
  2. A dev database in the cloud is unnecessary when the Cosmos emulator exists
  3. 400 RU/s is a floor, not a ceiling — you can't go lower with dedicated provisioning
  4. The cost rules in the project docs explicitly prohibited this

Act 2: The First Cleanup (Feb 22, 2026)

I discovered the cost spike through Azure billing alerts and immediately performed a forensic audit. Commit <sha-2> documents the full cleanup in a cost plan document that reads like an incident post-mortem.

The Damage Assessment

From the cost plan doc I wrote at the time:

"Legacy dedicated-throughput DBs still exist and still bill baseline RU: <app-db> → 10 containers × 400 RU/s dedicated. <app-db>-dev → 10 containers × 400 RU/s dedicated."

The Fix: V2 Databases with Shared Throughput

I created new databases with V2costsaver in the name (yes, I literally named them to remind future agents about cost) and rewrote the setup script:

  Before (agent's version):              After (my fix):
  ─────────────────────────              ──────────────────────────
  THROUGHPUT=400                         DB_THROUGHPUT="${DB_THROUGHPUT:-400}"

  az cosmosdb sql container create \     az cosmosdb sql database create \
    --throughput "$THROUGHPUT"              --throughput "$DB_THROUGHPUT"
                                           ← shared at DB level
  (per container = expensive)
                                         az cosmosdb sql container create \
                                           ← NO --throughput flag
                                           (inherits from database)

Then I ran the decommission:

  1. Created V2 databases with shared throughput
  2. Migrated all production data
  3. Added rollback support (--rollbackToV1Cosmos flag)
  4. Verified all 6 cutover gates passed
  5. Deleted both V1 databases
  6. Applied Azure budget alerts: $300/month cap with alerts at 50%, 80%, 100%
  7. Added Cosmos daily RU spike alert (> 2M RU in 24h)

The Emulator Decision

Six days later (Feb 28, commit <sha-3>), I made a harder decision: eliminate the cloud dev database entirely. The local Cosmos emulator would serve as the dev environment. This meant:

  • Zero cloud cost for development
  • Dev database routing consolidated into an emulator-first mode in dbResolver.js
  • A new mirror-to-emulator.mjs script for refreshing local dev data
  • The cloud dev DB (<app-db>-dev-V2costsaver) was decommissioned

Final state: one production database at 600 RU/s shared throughput — well within the 1,000 RU/s free tier allowance. Cost: $0/month.


Act 3: The Regression (Mar 3, 2026)

Five days later, the AI agent struck again.

Commit <sha-4> — a large feature commit (30 files, 3,425 insertions) implementing fiscal-year structural changes — quietly re-introduced the cloud dev database code path that I had just removed.

What The Agent Changed

In api/lib/dbResolver.js, the agent rewrote the database mode resolver. My Feb 28 version had consolidated all non-production paths to route to the emulator. The agent's version re-expanded them:

  My version (Feb 28):                   Agent's version (Mar 3):
  ────────────────────                   ────────────────────────
  if (shouldUseEmulator())               if (hasArg(EMULATOR_FLAG))
    return 'emulator';                     return 'emulator';
  if (shouldUseSupportDevDb())           if (hasArg(SUPPORT_DEV_FLAG))
    return 'support_dev_db';               return 'support_dev_db';   ← RE-ADDED
                                         if (isTruthy(COSMOS_USE_EMULATOR))
                                           return 'emulator';
                                         if (isTruthy(COSMOS_USE_SUPPORT_DEV_DB))
                                           return 'support_dev_db';   ← RE-ADDED

The 'support_dev_db' return path was back. The DEFAULT_DB_NAMES object still had supportDev: '<app-db>-dev-V2costsaver'. Combined with the init-cosmos.js script's createIfNotExists calls, this meant any script invocation with the dev flag would recreate the cloud dev database.

Why It Happened Again

The agent was working on a large feature (fiscal-year scoping) that touched the database layer. It needed to understand how database names were resolved across environments. Rather than preserving my carefully consolidated emulator-first logic, it re-derived the resolution function from first principles — and landed on the same multi-path pattern I had specifically eliminated.

The agent didn't know why those paths had been removed. It saw the pattern as "incomplete" and "helpfully" restored it. The commit message says nothing about database mode changes — they were buried in a 3,400-line feature diff.


Act 4: The Permanent Fix (Mar 29, 2026)

I'd had enough. Commit <sha-5>59 files changed, 277 insertions, 273 deletions — was a comprehensive retirement of all cloud dev database targeting across the entire codebase.

The Hard Guards

This time I didn't just remove the code paths. I made them impossible to restore:

1. Setup script errors on dev:

  # scripts/setup-cosmos.sh
  dev|--useSupportDevDB)
      echo "Cloud dev Cosmos setup is retired."
      echo "Use the local Cosmos emulator for development."
      exit 1

2. Runtime assertion in dbResolver.js:

  assertNoCloudNonProdDatabaseTarget()
  ────────────────────────────────────
  IF target DB ≠ production DB
  AND endpoint host ≠ localhost / 127.0.0.1 / emulator
  THEN → throw Error (hard crash)

3. Dev flags redirected to emulator: Any code passing --useSupportDevDB or setting COSMOS_USE_SUPPORT_DEV_DB=true now silently routes to the emulator instead of a cloud database.

4. Seed scripts refuse cloud non-prod targets: If the connection string points to Azure (not localhost), the seed scripts refuse to operate on non-production databases.

5. Default throughput documented at 600 RU/s: The setup script now defaults to 600 RU/s shared — within free tier — with the value explicitly visible in the script header.

The 59-File Sweep

The retirement touched every layer:

  Layer                     Files Changed    What Changed
  ─────────────────────     ─────────────    ─────────────────────────────────
  Database resolver         1                Hard assertion + emulator redirect
  Setup/provisioning        1                Dev path → error exit
  API scripts (20+)         23               All routed through new guards
  ETL scripts (JS+Python)   4                cosmosDbNames updated
  CI workflow               1                Dev DB references removed
  Documentation             8                Updated to emulator-first model
  Dev tooling               2                Local settings + dev script

The Numbers

MetricWave 1 (Feb 7)After Fix 1 (Feb 22-28)Wave 2 (Mar 3)After Fix 2 (Mar 29)
Cloud databases2 (prod + dev)1 (prod only)1 + code path for 2nd1 (prod only, hardened)
Throughput modelDedicated per-containerShared per-databaseShared (but dev path live)Shared, dev path blocked
Provisioned RU/s~8,000600600 (risk of +400)600
Free tier compliantNoYesFragileYes (enforced)
Guard railsDocs onlyDocs + script rewriteRegressedRuntime assertion + error exits
Files with dev DB refsGrowingConsolidatingRe-expanded0 (retired across 59 files)

What I Learned

1. Documentation Is Necessary But Not Sufficient

I had cost rules in agent-contract.md, app-rules.md, feature design docs, and the SAP brief. The rules said "RU-frugal," "zero Azure cost constraint," "Free Tier Guardrails — non-negotiable." The agent read them. The agent still provisioned dedicated throughput. Rules written in prose are suggestions. Rules written in code are enforcement.

2. AI Agents Optimize Locally, Not Globally

When the agent created the setup script, it was solving a local problem: "provision Cosmos containers." It picked the pattern most common in Azure documentation (dedicated throughput per container) without reasoning about the global cost constraint. When it re-introduced the dev DB path in Wave 2, it was solving another local problem: "make the database resolver more explicit." Both times, the agent's local optimization violated a global invariant.

3. Large Commits Hide Regressions

The Wave 2 regression was buried in a 3,425-line feature commit. The commit message mentioned fiscal-year changes, not database mode changes. If I'd reviewed only the commit message and stat, I'd have missed the dbResolver.js rewrite entirely. AI agents that make large commits need automated invariant checks, not just human code review.

4. "Remove" Is Not "Prevent"

My Feb 28 fix removed the cloud dev DB code path. My Mar 29 fix prevented it from being restored. The difference: runtime assertions that crash the process, script entry points that error on dev arguments, and seed scripts that refuse non-production targets on cloud endpoints. If you remove something from an AI-maintained codebase, you must also add a guard that prevents its resurrection.

5. Name Your Databases After Your Constraints

I named the V2 database *-V2costsaver. It's ugly. It's also the only thing in the codebase that survived every agent refactor without being renamed. Sometimes the best documentation is a name that makes the constraint impossible to ignore.


Try This Yourself

  1. Audit your IaC scripts for throughput flags. Search for --throughput in any Cosmos provisioning script. If it's on a container create (not a database create), you're paying per-container minimums.
  2. Add runtime guards, not just documentation. If your app must never target a cloud dev database, add an assertion that crashes on startup if it detects a non-production database on a cloud endpoint.
  3. Review large AI commits file-by-file. Don't trust commit messages for scope. A "fiscal-year feature" commit can silently regress your cost model.
  4. Set Azure budget alerts immediately. I should have done this on day one. A $300/month cap with 50%/80%/100% alerts would have caught Wave 1 within days instead of weeks.
  5. Use the emulator for dev. The Cosmos emulator is free, runs locally, and eliminates an entire category of cloud cost risk. If you're paying for a cloud dev database, ask yourself why.

The agent contract, the app rules, the feature design docs — none of them stopped this. What stopped it was a throw new Error() in the database resolver. Trust but verify. Then add a guard.


Mo Khan is just an old-timer engineer-turned-manager who forgot how fun it is to build things — and who learned the hard way that AI agents read your cost rules but don't always follow them.