Mo Khan's Blog

Tuesday, 7 April 2026

How I built an AI Finance Assistant into a Business Intelligence Dashboard with Claude Code and Codex

Twenty-three tools. Four tool profiles. A prompt library that rewrites itself when you change pages. A health dashboard that tracks token costs, cache hit rates, p95 latencies, and user-level adoption — in real time. I gave an LLM read-only access to my entire financial database, my ERP (SAP) pipeline, and a full fiscal simulation engine — then shipped it as a chat panel inside a budget dashboard. This is how I designed it. Every contract, every cache layer, every retry path, every health probe. If you're building AI features into a line-of-business application, this is the reference architecture that could help you on your own journey...

The Numbers

23 AI Tools | 4 Tool Profiles | 3 Cache Layers | 3 Auth Layers | 5 ERP Connectors

Act 1: The Problem — Dashboards Don't Answer Questions

I'd already built a production budget modelling and business intelligence ops dashboard tool: React frontend, serverless API, NoSQL backend. Product data, customer targets, actuals vs. forecasts, fiscal simulations — the works. It was a solid dashboard. But dashboards are passive. A CFO staring at twelve charts still has to synthesize the story. "Are we tracking against PBT (Profit Before Tax) target?" requires mentally combining revenue actuals, expense forecasts, margin config, and service ARR (Annual Recurring Revenue). That's four separate data views, minimum. I wanted something different: a financial intelligence analyst that lives inside the dashboard, has access to everything the dashboard knows, and can answer questions at FP&A (Financial Planning & Analysis)-analyst level. Not a chatbot bolted onto the side. But an agentic copilot that calls tools, runs models, queries databases, and cites its sources — all in real time, streamed back as SSE events.

Why tool-based, not RAG? My data is structured and relational — products have line items, customers have product mixes, actuals are monthly arrays. This is not a document search problem. It's a database query problem. Tools let the model query exactly the data it needs, stay within context limits, and reuse existing data-access code.

"But you're using a NoSQL document database, not a relational DB — isn't that a contradiction?" No — it's actually why tools are even more important. My database uses a non-standard query dialect with no cross-container JOINs, mandatory partition key routing, and data models that vary by document type. Text-to-SQL would be catastrophic here — the LLM would need to know partition key patterns, container boundaries, and cross-container join logic just to form a valid query. Tools encapsulate all of that complexity. The model calls query_products and gets clean, scoped data back. It never sees partition keys, internal IDs, or the document model underneath. And because every tool goes through the DataAccessAdapter contract, the entire AI layer is database-agnostic — I could swap to PostgreSQL or SQL Server tomorrow and the orchestration, prompt engineering, and tool profiles wouldn't change. The document database was a deliberate trade-off: zero-schema migrations, natural fit for heterogeneous product/customer structures, native cloud integration, and serverless pricing. The tool abstraction means that trade-off is invisible to the AI.

Act 2: The Architecture — End-to-End

Here's the full picture, from browser to database and back:

┌─────────────────────────────────────────────────────────────────┐
│                        BROWSER (React + Vite)                   │
│                                                                 │
│  ┌──────────────┐   ┌────────────────────┐   ┌───────────────┐  │
│  │  AI Chat     │──►│  Bootstrap Tokens  │──►│ POST /ai/chat │  │
│  │  Panel       │   │  (HMAC-SHA256,120s)│   │ (Fetch API)   │  │
│  └──────┬───────┘   └────────────────────┘   └───────┬───────┘  │
│         │                                            │          │
│         ▼                                            ▼          │
│  ┌──────────────┐   ┌────────────────────┐   ┌───────────────┐  │
│  │  Prompt      │   │  Context Nudges    │   │ Event Parser  │  │
│  │  Catalog     │   │  (Page-Aware)      │   │ delta/snapshot│  │
│  └──────────────┘   └────────────────────┘   │ tool tracking │  │
│                                              │ chartrendering│  │
│                                              └───────────────┘  │
└─────────────────────────────┬───────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                  REVERSE PROXY (Identity Injection)             │
│           Client-Principal-ID injected server-side (unforgeable)│
└─────────────────────────────┬───────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│              AI SERVICE (Fastify on Dedicated App Service)      │
│                                                                 │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                     RUNTIME MANAGER                       │  │
│  │                                                           │  │
│  │  Concurrency: 2 active/user, 4 global active, 8 queued    │  │
│  │  Watchdog:    10 min stall kill    Shutdown grace: 15s    │  │
│  │                                                           │  │
│  │┌─────────────────┐┌──────────────────┐┌──────────────────┐│  │
│  ││  Auth Layer     ││  Question Router ││  Orchestration   ││  │
│  ││                 ││                  ││  Engine          ││  │
│  ││  Proxy identity ││  Regex classify  ││  Model selection ││  │
│  ││  Bootstrap HMAC ││  Profile resolve ││  Tool budgets    ││  │
│  ││  RBAC check     ││  Chart detection ││  Iteration loop  ││  │
│  ││  BU scope       ││  ERP routing     ││  40s orch. cap   ││  │
│  ││  ABAC ownership ││  Prompt context  ││  28s provider    ││  │
│  │└─────────────────┘└──────────────────┘└────────┬─────────┘│  │
│  │                                                │          │  │
│  │┌─────────────────┐┌──────────────────┐┌────────▼─────────┐│  │
│  ││  SSE Streaming  ││  Tool Cache      ││  Tool Executor   ││  │
│  ││                 ││                  ││                  ││  │
│  ││  PassThrough    ││  Shared (500/2m) ││  23 tool handlers││  │
│  ││  Hijack reply   ││  Request-scoped  ││  Auth scope/call ││  │
│  ││  10 event types ││  Data freshness  ││  Result sanitize ││  │
│  ││  Reconnect      ││  ETL-aware       ││  5s timeout/tool ││  │
│  │└─────────────────┘└──────────────────┘└──────────────────┘│  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                 │
└──────────────────────────────────┬──────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────┐
│               ai-core (Extracted Workspace Package)             │
│                                                                 │
│  ┌────────────────────┐  ┌─────────────────┐  ┌─────────────┐   │
│  │ Contracts          │  │ System Prompt   │  │ Tool Defs   │   │ 
│  │ ProviderTransport  │  │ FP&A persona    │  │ 23 tools    │   │
│  │ DataAccessAdapter  │  │ 19 behavioral   │  │ JSON Schema │   │
│  │ EventSink          │  │   rules         │  │ Routing     │   │
│  │ SSE Events         │  │ Response frame  │  └─────────────┘   │
│  └────────────────────┘  └─────────────────┘                    │
└─────────────────────────────────┬───────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                        PROVIDER LAYER                           │
│                                                                 │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │            ProviderTransport (Contract)                   │  │
│  │                                                           │  │
│  │  createMessage({ model, maxTokens, system,                │  │
│  │                   tools, messages })                      │  │
│  │      ──► { responseId, model, stopReason,                 │  │
│  │            content, usage }                               │  │
│  │                                                           │  │
│  │  Today:    Primary ──► Fallback (model chain)             │  │
│  │  Tomorrow: Any LLM provider (same contract)               │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────┬───────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                          DATA LAYER                             │
│                                                                 │
│┌────────────┐┌──────────────┐┌──────────────┐┌─────────────────┐│
││  NoSQL DB  ││  ERP ETL     ││  Financial   ││ Market          ││
││            ││  Pipeline    ││  Model       ││ Analysis        ││
││  products  ││  sales       ││  Engine      ││                 ││
││  customers ││  debtors     ││              ││ regions         ││
││  actuals   ││  orderbook   ││  run_model   ││ municipalities  ││
││  config    ││  stock       ││  what_if     ││ market share    ││
││  defs      ││  delivery    ││  services    ││ mfr split       ││
│└────────────┘└──────────────┘└──────────────┘└─────────────────┘│
└─────────────────────────────────────────────────────────────────┘

Act 3: The Contract System — Provider-Agnostic by Design

The single most important architectural decision I made: never let the LLM provider leak into business logic. I defined three frozen contracts that form the boundary between AI orchestration and everything else.

// ai-core/contracts/interfaces.js

export const ProviderTransport = freezeContract(
  'ProviderTransport',
  {
    createMessage: freezeMethod(
      'Submit one provider turn.',
      {
        accepts:  { model, maxTokens, system, tools, messages },
        returns:  { responseId, model, stopReason, content, usage },
      }
    ),
  },
);

export const DataAccessAdapter = freezeContract(
  'DataAccessAdapter',
  {
    loadRequestContext:       freezeMethod('Load group config and access state.'),
    loadConversationHistory:  freezeMethod('Load ordered chat history.'),
    executeToolCall:          freezeMethod('Execute a normalized AI tool call.'),
    persistRunArtifacts:      freezeMethod('Persist assistant output.'),
    writeAuditEvent:          freezeMethod('Write audit telemetry.'),
  },
);

export const OrchestrationEventSink = freezeContract(
  'OrchestrationEventSink',
  {
    emit: freezeMethod('Emit a normalized run event.'),
  },
);

Why this matters: the orchestration engine talks to ProviderTransport.createMessage(). It doesn't know or care whether that's Claude, GPT-4, Gemini, or a local model behind Ollama. The contract enforces a stopReason vocabulary (end_turn, tool_use, max_tokens, refusal, etc.) that the orchestration loop consumes to decide: "Do I call tools and loop? Or am I done?" Similarly, the DataAccessAdapter isolates the orchestration from the database, the ERP pipeline, and my financial model engine. The AI core package has zero database imports. Zero cloud SDK references. It's a pure orchestration library. Every contract includes runtime validation. At boot time, assertContract() throws a TypeError if any required method is missing. This means a bad adapter implementation fails at startup, not at 3 AM in production:

function assertContract(candidate, contract, label) {
  const missing = Object.keys(contract.methods)
    .filter(name => typeof candidate[name] !== 'function');
  if (missing.length > 0) {
    throw new TypeError(
      `${label} is missing required methods: ${missing.join(', ')}`
    );
  }
  return candidate;
}

                        ┌──────────────────────┐
                        │  Orchestration Loop  │
                        │  (ai-core package)   │
                        └──────────┬───────────┘
                                   │
              ┌────────────────────┼────────────────────┐
              │                    │                    │
              ▼                    ▼                    ▼
┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│  Provider        │  │  Data Access     │  │  Event           │
│  Transport       │  │  Adapter         │  │  Sink            │
│                  │  │                  │  │                  │
│  createMessage() │  │  executeToolCall │  │  emit()          │
└────────┬─────────┘  └────────┬─────────┘  └────────┬─────────┘
         │                     │                     │
         ▼                     ▼                     ▼
┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│ Claude SDK   │      │ NoSQL DB     │      │ SSE Stream   │
│ OpenAI SDK   │      │ ERP ETL      │      │ (client)     │
│ Gemini SDK   │      │ Fin Model    │      └──────────────┘
│ Local LLM    │      └──────────────┘
└──────────────┘

Act 4: The Tool Arsenal — 23 Tools Across 7 Domains

The real power of an agentic copilot isn't the LLM — it's the tools you give it. I exposed 23 tools organized into seven domains, each with strict JSON Schema input validation and BU/FY scoping:

Domain	Tools	What They Access
Core Data	`query_products`, `query_customers`, `query_actuals`, `query_budget_config`, `query_definitions`	NoSQL budget data — products, customers, targets, actuals/forecasts, FY config
ERP Live Insights	`query_erp_sales_insights`, `query_erp_debtors_insights`, `query_erp_orderbook_insights`, `query_erp_stock_insights`, `query_erp_delivery_insights`, `query_etl_run_history`, `query_erp_connector_detail`	ERP pipeline — sales, debtors aging, open orders, stock levels, delivery status, line-level detail with filtering
Financial Models	`run_financial_model`, `run_what_if_simulation`, `run_services_model`	Full budget model (read-only), hypothetical scenarios with parameter adjustments, ARR/MRR/service profit
Analytics	`get_customer_concentration`, `get_regional_breakdown`, `get_margin_analysis`, `compare_fiscal_years`	Revenue concentration risk, regional splits, margin by customer/product/region, FY-vs-FY comparison
Market Analysis	`query_market_analysis`	Municipality-level market data — region, manufacturer split, coverage, assignment chains
Methodology	`get_model_methodology`	Canonical formulas, assumptions, validation rules, simulation logic reference
Visualization	`generate_chart`	Inline chart specs (bar, line, pie, composed, area) with formatting and annotations

Plus web search (provider-managed) for external context like tenders, competitors, and market developments.

Tool Execution: Authorization at Every Call

Every single tool call passes through an authorization boundary. This isn't "check auth once at the top." Every tool execution resolves a scope:

// Per-tool-call authorization
const scope = await scopeFor(userContext, buId);
// Checks: isAdmin, BU viewAccess, user profile roles

// Account managers see only their assigned customers
const customers = isAccountManager
  ? filterCustomersForUser(allCustomers, userId)
  : allCustomers;

// Internal fields are stripped before the LLM sees them
function sanitizeDoc(doc) {
  const { _rid, _self, _etag, _attachments, _ts, pk, ...clean } = doc;
  return clean;
}

The AI agent is read-only. It cannot modify data. It cannot even see internal database fields. The tool executor enforces row limits (5,000 default, 12,000 for ERP detail scans) and a 5-second timeout per tool call.

Adding New Tools: A Disciplined Process

I maintain an extension guide with strict criteria. A new tool is added only if all four conditions are met: 1. Existing tools cannot answer the target question class reliably 2. The query can be bounded by BU/FY plus filters or row limits 3. The tool is read-only and deterministic 4. The output can be explained with clear freshness metadata (dataAsOf timestamp) Steps: define schema in the core package, implement handler in the executor, enforce BU/role scope, add cache eligibility decision, update tool profiles and system prompt if needed.

Act 5: Intelligent Question Routing — Tool Profiles

Not every question needs every tool. Asking "What does the model methodology say about margin vs. markup?" doesn't need ERP data. Asking "Show me overdue orders" doesn't need the financial model. I built a question routing engine that classifies incoming messages and selects a tool profile — a curated subset of tools with an enforced budget:

                       User Question
                            │
                            ▼
                   ┌─────────────────┐
                   │ Question Router │
                   │                 │
                   │ Regex classify  │
                   │ + UI context    │
                   │ + Prompt hint   │
                   └────────┬────────┘
                            │
         ┌──────────────────┼──────────────────┐────────────┐
         │                  │                  │            │
         ▼                  ▼                  ▼            ▼
┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  Finance     │  │  ERP         │  │  Scenario    │  │  Methodology │
│  Profile     │  │  Profile     │  │  Profile     │  │  Profile     │
│              │  │              │  │              │  │              │
│  6 tools     │  │  6+ tools    │  │  5 tools     │  │  1 tool      │
│  2 calls     │  │  2 calls     │  │  2 calls     │  │  1 call      │
│  40s cap     │  │  +detail     │  │  what-if     │  │  formulas    │
└──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘

The router uses regex classification to match question intent, but also respects prompt context hints from the frontend. Profile routing also includes question-intent classification that detects 12 distinct intent types: seasonality, customer decline, forecast, variance, profitability, risk, cashflow, trend, comparison, what-if, chart requests, and strategy/recommendation asks. Each profile enforces a strict tool budget:

Parameter	Default	Purpose
`MAX_TOOL_ITERATIONS`	2	Maximum LLM-to-tool round-trips
`MAX_TOOL_CALLS`	2	Total tool invocations per request
`MAX_TOOL_CALLS_WITH_CHART`	3	Budget when chart is requested
`MAX_ORCHESTRATION_MS`	40,000	Total wall-clock cap for entire run
`MAX_PROVIDER_CALL_MS`	28,000	Per-LLM-call timeout

Why budget tool calls? Cost and latency. Every tool call means the LLM has to process tool results and generate another turn. Without budgets, a curious model could chain 8 tool calls, burn tokens, and make the user wait 90 seconds. My budgets keep responses under 10 seconds in the common case.

Act 6: The Prompt Library — Context-Aware Starter Prompts and UI Nudges

Most AI chat implementations show static starter prompts: "Ask me anything." Mine are dynamic. I have two systems working together:

1. Starter Prompt Catalog

Curated prompts with embedded routing metadata. Each prompt carries a promptContext that tells the router which profile and tools to prefer:

export const STARTER_PROMPTS = Object.freeze([
  createPromptOption({
    id: 'pbt_target_tracking',
    text: 'How are we tracking against the PBT target?',
    profileHint: 'finance_interactive',
    preferredTools: ['run_financial_model']
  }),
  createPromptOption({
    id: 'monthly_revenue_vs_budget_chart',
    text: 'Show me a chart of monthly revenue vs budget',
    profileHint: 'finance_interactive',
    preferredTools: ['run_financial_model', 'generate_chart'],
    chartRequested: true
  }),
  createPromptOption({
    id: 'debtors_aging_overview',
    text: 'What does the debtors aging look like?',
    profileHint: 'erp_interactive',
    preferredTools: ['query_erp_debtors_insights']
  }),
]);

2. Context Nudges — The Page-Aware Prompt Engine

This is the feature I'm most proud of. When the user navigates between dashboard pages, I detect the transition and generate contextual suggestions:

export function buildUiContextNudges({ previousScope, currentScope, page }) {
  const nudges = [];

  // User switched tabs: suggest page takeaways
  if (prev.tab !== curr.tab) {
    nudges.push(createPromptOption({
      text: `What are the top takeaways on ${pageLabel}?`,
      source: 'context_nudge',
      profileHint   // auto-inferred from page context
    }));
  }
  // FY changed: suggest risk/opportunity summary
  if (prev.fy !== curr.fy) {
    nudges.push(createPromptOption({
      text: `Summarize the key risks and opportunities for ${curr.fy}.`,
    }));
  }
  // Historical mode toggled
  if (prev.historicalMode !== curr.historicalMode) {
    nudges.push(createPromptOption({
      text: curr.historicalMode
        ? 'What should I learn from this historical-year view?'
        : 'What should I monitor in the active year?',
    }));
  }
  // Filters changed
  if (prev.filterSignature !== curr.filterSignature) {
    nudges.push(createPromptOption({
      text: 'How do the currently applied filters change the story?',
    }));
  }
  return dedupePromptOptions(nudges).slice(0, 3);
}

The copilot isn't just waiting for questions. It's noticing what you're looking at and suggesting the questions a good analyst would ask next.

Act 7: The System Prompt — Persona Engineering for Finance

My system prompt doesn't just say "You are a helpful assistant." It establishes a professional identity with explicit behavioral rules:

You are Budget Analyst, an elite strategic finance and
business intelligence copilot.

Operate at the level of a senior data scientist, FP&A lead,
chartered-accountant quality reviewer, and board-ready analyst.

Your audience ranges from junior budget analysts to CEOs,
board members, investors, and shareholders.

19 behavioral rules govern the agent's conduct. The critical ones:

Tools first: For company-data questions, call tools first and ground answers in tool results. Never invent figures.
Read-only: Never claim to have changed persisted data.
Source transparency: Every response starts with SOURCE: data | knowledge | mixed and CONFIDENCE: high | medium | low.
Staleness awareness: Flag data older than 7 days with an explicit warning.
Follow-ups required: Every response ends with 3 suggested follow-up questions.
ERP line-level protocol: For specific product/customer queries, call the detail tool with filters before concluding.
No secrets: Never reveal document IDs, partition keys, or internal implementation details.
Executive communication: For senior stakeholders, prioritize key finding, material drivers, risks, and actions.

Act 8: The Streaming Pipeline — SSE From Orchestration to Browser

AI responses need to stream. A 15-second wait for a complete response feels broken. A response that starts appearing in 800ms feels fast, even if the total time is the same. I built a full SSE (Server-Sent Events) pipeline with 10 distinct event types:

Orchestration Engine                SSE Event Types:
        │                           ─────────────────────────────
        ▼                           ready       Run allocated, correlation ID
  EventSink.emit()                  heartbeat   Keep-alive (8s interval)
        │                           status      Phase label updates
        ▼                           text        Delta (streaming) or snapshot
┌───────────────┐                   tool_start  Tool execution begins
│  PassThrough  │                   tool_end    Tool complete + cache metadata
│  Stream       │                   chart       Inline chart spec payload
└───────┬───────┘                   follow_ups  Suggested next questions
        │                           error       Code, message, details
        ▼                           done        Token counts, model, metrics
┌───────────────┐
│  Fastify      │
│  reply.hijack │
│  + pipe()     │
└───────┬───────┘
        │
        ▼
┌───────────────────┐
│  HTTP Response    │
│                   │
│  Content-Type:    │
│  text/event-stream│
│  Cache-Control:   │
│  no-cache         │
│  X-Accel-Buffering│
│  no               │
└───────┬───────────┘
        │
        ▼
┌───────────────────┐
│  Browser          │
│  ReadableStream   │
└───────────────────┘

The Fastify route hijacks the raw HTTP response to avoid framework buffering:

function sendSse(reply, stream, headers = {}) {
  reply.hijack();
  reply.raw.statusCode = 200;
  reply.raw.setHeader('Content-Type', 'text/event-stream');
  reply.raw.setHeader('Cache-Control', 'no-cache');
  reply.raw.setHeader('Connection', 'keep-alive');
  reply.raw.setHeader('X-Accel-Buffering', 'no');
  reply.raw.flushHeaders();

  reply.raw.on('close', () => stream.destroy());
  stream.on('error', () => reply.raw.end());
  stream.pipe(reply.raw);
}

The X-Accel-Buffering: no header is critical — it tells any reverse proxy or CDN in front not to buffer the SSE stream. Without it, users see nothing until the entire response completes.

Stream Recovery

SSE connections drop. Networks glitch. Phones go to sleep. I handle this with a parallel polling mechanism:

Every run gets a unique runId returned in a response header
If the stream disconnects, the frontend polls a status endpoint every 15 seconds
If the run is still active, it can reattach via a separate stream endpoint
Rate limited to 10 reattach attempts per minute
An 8-second heartbeat event keeps the connection alive through aggressive proxy timeouts

Act 9: The Cache System — Three Layers Deep

AI tool calls are expensive — not in compute, but in tokens. Every tool result gets injected into the LLM context. More data = more tokens = more cost = more latency. I built three cache layers to minimize redundant work:

Tool Call Request
         │
         ▼
┌────────────────────┐     Hit?
│  Layer 1:          │──────────►  Return cached result
│  Request-Scoped    │             (same request, same inputs)
│  Memo              │
└────────┬───────────┘
         │ Miss
         ▼
┌────────────────────┐     Hit?
│  Layer 2:          │──────────►  Return cached result
│  Shared Tool       │             (cross-request, 2min TTL)
│  Cache (500 max)   │
└────────┬───────────┘
         │ Miss
         ▼
┌────────────────────┐     Hit?
│  Layer 3:          │──────────►  Return if within freshness
│  Data Freshness    │             window (ERP: 7-day ETL cycle)
│  (ETL-aware)       │
└────────┬───────────┘
         │ Miss
         ▼
   Execute Tool
   (DB / Model / ERP)

Layer	Scope	TTL	Max Size	Purpose
Request Memo	Single AI request	Request lifetime	Unbounded	Prevent duplicate tool calls in same turn (e.g., model calls auth twice)
Shared Cache	Cross-request	120 seconds	500 entries (LRU)	Reuse results across concurrent users asking similar questions
Data Freshness	ETL-aware	7 days (ERP)	Integrated	ERP data refreshes weekly; don't refetch mid-cycle

One critical detail: run_what_if_simulation is never cached. Simulations take arbitrary parameter adjustments, so every call must be fresh. All other 22 tools are cacheable. Cache key security is strict. Keys include tool name, normalized input, BU, FY, and a scope fingerprint. For account-manager-scoped users, the key includes the userId — ensuring that user A's filtered customer data is never returned to user B.

Act 10: Retry, Fallback, and Graceful Degradation

Production AI features fail in ways that traditional APIs don't. The LLM provider can return 429 (rate limit), 529 (overloaded), or simply time out. I built multiple layers of resilience:

Provider-Level Retry

const RETRYABLE_STATUSES = new Set([429, 500, 502, 503, 504, 529]);

// Exponential backoff: 350ms base, 2s max
const BACKOFF_BASE_MS = 350;
const BACKOFF_MAX_MS  = 2_000;

// Model fallback chain: Primary ──► Secondary
const MODEL_FALLBACKS = [MODEL_PRIMARY, MODEL_FALLBACK];

function buildModelCandidates(initialModel) {
  // Returns ordered list: preferred model first, then fallbacks
  // Enables retry with automatic model downgrade
}

Partial Completion States

Not every failure is a total failure. I track five partial completion states that let the frontend show whatever was generated before the failure:

const PARTIAL_COMPLETION_STATES = new Set([
  'partial_timeout',              // Hit orchestration wall-clock cap
  'partial_provider_failure',     // Provider failed after partial response
  'tool_iteration_limit',         // Hit max tool round-trips
  'tool_call_budget_limit',       // Hit max tool calls
  'interactive_budget_limit'      // Hit interactive mode constraints
]);

When the model hits a budget limit, I don't discard the partial response. The done event includes a budgetExitReason field, and the model is instructed to "answer with the highest-confidence partial result and propose narrower follow-ups."

Continuation and Follow-Up Detection

The system detects when a user is following up on a partial result with regex patterns for continuations like "continue", "go deeper", "top 5 results", "carry on" — and adjusts routing accordingly so the follow-up doesn't restart from scratch.

Act 11: The Authentication Stack — Three Layers, Zero Trust

            Browser Request
                  │
                  ▼
┌──────────────────────────────────────┐
│  Layer 1: Proxy Identity             │
│                                      │
│  Reverse proxy injects:              │
│  Client-Principal-ID                 │
│  Client-Principal-Name               │
│  ──► Cannot be forged by client      │
└──────────────────┬───────────────────┘
                   │
                   ▼
┌──────────────────────────────────────┐
│  Layer 2: Bootstrap Token (HMAC)     │
│                                      │
│  HMAC-SHA256 signed, 120s TTL        │
│  Contains: userId, email, sessionId  │
│  viewBuIds, homeBuId, permissions    │
│  ──► Timing-safe signature verify    │
└──────────────────┬───────────────────┘
                   │
                   ▼
┌──────────────────────────────────────┐
│  Layer 3: RBAC + BU Scope + ABAC     │
│                                      │
│  hasPermission(user, 'ai.chat')      │
│  hasBuViewAccess(profile, buId)      │
│  filterCustomersForUser(...)         │
│  ──► Enforced on EVERY tool call     │
└──────────────────────────────────────┘

The bootstrap token is particularly clever. The main API mints it (it knows the user's RBAC state), and the AI service validates it (it doesn't need to re-query the user database). TTL is 120 seconds — long enough for a chat session, short enough that a stolen token is useless quickly. Signature verification uses timingSafeEqual to prevent timing attacks.

Act 12: The Health & Observability System — Knowing What You Don't Know

Shipping an AI feature without observability is like flying blind. I built a comprehensive health monitoring system that tracks everything from token costs to user adoption to cache efficiency. This isn't a separate monitoring tool — it's baked into the application's system health dashboard, accessible to admins.

The Telemetry Pipeline

Every AI chat request emits structured operational log events to blob storage. The health endpoint scans these events with configurable lookback windows and computes real-time analytics:

  AI Chat Request
         │
         ▼
┌────────────────────┐
│  Operational Log   │     Events Emitted:
│  (Blob Storage)    │     ──────────────────────────
│                    │     AI_CHAT_REQUEST           (success + metrics)
│  Structured JSON   │     AI_CHAT_REQUEST_FAILED    (failure + error code)
│  per-request       │     AI_CHAT_FEEDBACK          (thumbs up/down)
└────────┬───────────┘     AI_CHAT_STREAM_OPENED     (SSE stream timing)
         │                 AI_CHAT_FIRST_EVENT       (time-to-first-token)
         ▼                 AI_CHAT_HTTP_ERROR        (client-side HTTP fail)
┌────────────────────┐     AI_CHAT_NETWORK_ERROR     (client-side network fail)
│  Health Endpoint   │     AI_CHAT_STREAM_END        (stream termination)
│                    │
│  Three Probes:     │
│  AI Cache Health   │
│  AI Chat Telemetry │
│  AI Chat Insights  │
└────────────────────┘

Probe 1: AI Cache Health

Monitors shared cache effectiveness with automatic recommendations:

// Cache health probe output
{
  status: 'connected',        // connected | warning | error
  lookbackHours: 72,
  requestsSampled: 142,
  sharedCacheHits: 87,
  sharedCacheMisses: 214,
  hitRatePct: 28.9,
  lookupCoveragePct: 95.1,    // % of requests using cache
  requestHitRatePct: 44.3,    // % of requests with >= 1 hit
  targetHitRatePct: 15,       // configurable threshold
  note: 'ok',
  recommendation: 'Shared cache hit rate is within target range.',
  suggestedSettingChange: null
}

When the hit rate drops below target, the probe returns actionable recommendations like "Increase cache TTL to 300000ms to improve shared-cache reuse." The evaluation logic considers sample size too — it won't raise alarms on fewer than 20 requests.

Probe 2: AI Chat Telemetry

Tracks operational health of the AI chat endpoint itself, with failure classification:

// Chat telemetry probe output
{
  status: 'connected',
  lookbackHours: 24,
  requestsSampled: 47,
  successRequests: 44,
  failedRequests: 3,
  timeoutFailures: 1,
  failureRatePct: 6.38,
  failureCodeCounts: {
    'AI_TIMEOUT': 1,
    'AI_PROVIDER_ERROR': 2
  },
  p95StreamOpenedMs: 1240,    // Time to SSE stream open
  p95FirstEventMs: 2850,      // Time to first SSE event
  likelyAppPreStreamFailureCount: 0,
  likelyProviderOrToolTimeoutCount: 1,
  suspectedProxyTimeoutPreStreamCount: 0,
  note: 'ok'
}

Notice the suspectedProxyTimeoutPreStreamCount. I discovered that certain reverse proxies impose a 45-second timeout on HTTP connections. The health probe correlates client-side timing signals (elapsed ~43-47s) with server-side stream-opened events to detect whether failures are app-side or proxy-side. This saved me weeks of debugging. Health status is evaluated with configurable thresholds (default: warning at 20% failure rate, critical at 50%), with special handling for timeout spikes.

Probe 3: AI Insights — Usage, Cost, and Adoption

The deepest probe. It computes comprehensive business intelligence about the AI feature itself:

// AI Insights probe output (simplified)
{
  sampled: {
    requests7d: 312,
    requestsMtd: 847,
    successfulRequests7d: 298,
    failedRequests7d: 14,
    timeoutFailures7d: 3
  },
  tokens: {
    input7d: 1_420_000,
    output7d: 312_000,
    inputMtd: 3_890_000,
    outputMtd: 842_000,
    avgInputPerChat7d: 4551,
    avgOutputPerChat7d: 1000
  },
  cost: {
    currency: 'USD',
    pricingConfigured: true,
    estimatedUsd7d: 0.0284,
    estimatedUsdMtd: 0.0781
  },
  usage: {
    totalChats7d: 312,
    totalChatsMtd: 847,
    activeAiEligibleUsers: 8,
    avgChatsPerEligibleUser: 39,
    topUserByChats: { name: '***', chats: 142 },
    nonUsersCount: 3,
    nonUsersPreview: [...]
  },
  performance: {
    slowThresholdMs: 10_000,
    p50DurationMs: 5200,
    p95DurationMs: 12400,
    p50FirstTokenMs: 1100,
    p95FirstTokenMs: 3200,
    timeoutRatePct: 0.96,
    slowRatePct: 8.3
  },
  quality: {
    thumbsUp: 24,
    thumbsDown: 3,
    positiveRatePct: 88.89,
    feedbackCoveragePct: 9.06
  }
}

Let me break down what this gives me:

Category	Metrics	Why It Matters
Token Usage	Input/output tokens (7d + MTD), average per chat	Cost forecasting, prompt optimization signals, context window utilization
Cost Estimation	Estimated USD (7d + MTD), configurable per-model pricing table	Budget tracking, cost-per-user analysis, ROI calculation
User Adoption	Active users, chats per user, top user, non-adopters list	Feature adoption tracking, training needs identification, champion users
Performance	p50/p95 duration, p50/p95 first-token, timeout rate, slow rate	SLA monitoring, user experience optimization, provider health
Quality	Thumbs up/down, positive rate, feedback coverage %	Answer quality monitoring, prompt/tool tuning signals

Token Cost Estimation

The cost system supports a per-model pricing table with partial model-name matching for fallback resolution:

// Configurable per-model pricing
TOKEN_COST_TABLE = {
  "primary_model": {
    "inputUsdPerMtok": 3.0,
    "outputUsdPerMtok": 15.0
  },
  "fallback_model": {
    "inputUsdPerMtok": 0.25,
    "outputUsdPerMtok": 1.25
  }
}

function estimateTokenCostUsd({ inputTokens, outputTokens, rates }) {
  const inputCost  = (inputTokens  / 1_000_000) * rates.inputUsdPerMtok;
  const outputCost = (outputTokens / 1_000_000) * rates.outputUsdPerMtok;
  return inputCost + outputCost;
}

AI Service Runtime Health

The AI service itself exposes two health endpoints probed by the main health dashboard:

GET /healthz   ──► { status, runningRuns, queuedRuns }
GET /readyz    ──► { status, storage.mode, data.authMode }

The main health endpoint combines these with telemetry probes into a unified status that powers the admin dashboard. Service status is computed by merging health + readiness signals, with nuanced logic (e.g., health-healthy + ready-unhealthy = degraded, not error).

The Done Event — Per-Request Telemetry

Every completed AI request emits a done SSE event packed with operational metrics. This is what feeds all three health probes:

// done SSE event shape
{
  type: 'done',
  responseId: '...',
  model: '...',
  durationMs: 5200,
  firstTokenMs: 1100,
  historyMessageCount: 4,
  inputTokens: 4200,
  outputTokens: 680,
  toolCalls: 1,
  webSearchToolCalls: 0,
  toolProfile: 'finance_interactive',
  toolAllowlistSize: 6,
  iterationBudget: 2,
  toolCallBudget: 2,
  orchestrationBudgetMs: 40000,
  providerBudgetMs: 28000,
  budgetExitReason: null,
  completionState: 'completed',
  requestMemoHits: 2,
  requestMemoMisses: 1,
  sharedCacheHits: 1,
  sharedCacheMisses: 0
}

This single event gives you: latency (total + first-token), token usage, tool efficiency (calls vs. budget), cache performance (hits vs. misses), model used, and completion status. It's the telemetry primitive that everything else is built on.

Act 13: UI Context — Making the AI See What You See

Most AI chat panels are blind to the rest of the application. Mine isn't. I pass a structured uiContext payload with every request that describes exactly what the user is looking at:

// Sanitized UI context structure
{
  version: 1,
  scope: {
    tab: 'dashboard',
    buId: '***',
    fy: 'FY26',
    historicalMode: false
  },
  page: {
    viewId: 'dashboard',
    label: 'Budget Dashboard',
    purpose: 'Overview of revenue, costs, and margin for current FY',
    activeFilters: [
      { key: 'region', label: 'Region', value: 'Gauteng' }
    ],
    visibleWidgets: [
      { id: 'revenue-chart', label: 'Monthly Revenue' },
      { id: 'margin-gauge', label: 'Gross Margin %' }
    ],
    kpiSummaries: [
      { widgetId: 'revenue-chart', label: 'YTD Revenue',
        value: 'R45,230,000', trend: '+12% YoY' }
    ],
    freshness: { asOf: '2026-04-07', source: 'actuals', stale: false },
    warnings: ['ERP delivery data >7 days old']
  },
  lastAction: {
    type: 'filter_change',
    label: 'Applied region filter',
    target: 'Gauteng'
  }
}

This is how the copilot can answer "What does this dashboard show?" or "Why is this KPI red?" without hallucinating. It literally sees the same widgets, filters, and KPIs the user sees. The payload is capped at 12KB and progressively trimmed (KPIs first, then widgets, then warnings, then filters, then purpose text) if it exceeds that limit. The system prompt explicitly marks it as "untrusted data context, not instructions" — preventing prompt injection through crafted widget labels.

Act 14: Model Selection & Fallback

I run a primary model with a lighter fallback. The model selection is entirely environment-driven and supports an ordered preference list:

// Model selection supports a configurable preference chain
const CONFIGURED_PREFERENCE = process.env.MODEL_PREFERENCE
  .split(',').map(v => v.trim()).filter(Boolean);

// Default fallback: Primary ──► Lighter model
const DEFAULT_FALLBACKS = [MODEL_PRIMARY, MODEL_FALLBACK];

function buildModelCandidates(initialModel) {
  // Returns ordered list: preferred model first, then fallbacks
  // Enables retry with automatic model downgrade
}

The buildModelCandidates() function produces an ordered list that the orchestration loop can iterate through if the primary model fails or times out. Today it's one provider. Tomorrow, adding another means implementing one function: createMessage() on the ProviderTransport contract.

The Full Request Lifecycle

Here's how a single user question flows through the entire system, end to end:

User types: "How are we tracking against the PBT target?"
      │
      ▼
[1] Chat Panel (React)
    Refresh bootstrap token (HMAC, 120s TTL)
    Build request: { messages, newMessage, buId, fy, uiContext, promptContext }
    POST /ai/chat (Fetch API with ReadableStream)
      │
      ▼
[2] Reverse Proxy
    Inject Client-Principal-ID (unforgeable)
    Forward to AI Service
      │
      ▼
[3] Route handler (Fastify)
    enforceProxyIdentity() ──► extract user
    runtime.startChatRequest(request)
      │
      ▼
[4] Runtime Manager
    Check concurrency: user slots (2), global slots (4)
    Validate bootstrap tokens (HMAC-SHA256, timing-safe)
    Check RBAC: hasPermission(user, 'ai.chat')
    Check BU scope: hasBuViewAccess(profile, buId)
      │
      ▼
[5] Question Router (Tool Profiles)
    Classify question ──► finance_interactive
    Select tools: run_financial_model, run_services_model, ...
    Build budget: 2 tool calls, 40s cap, 28s per-provider
    Merge promptContext hints if present
      │
      ▼
[6] Orchestration Engine
    │
    │  ┌─ Turn 1 ──────────────────────────────────────────┐
    │  │  selectModel() ──► primary model                  │
    │  │  buildSystemPrompt() + uiContext + runtime rules  │
    │  │  providerTransport.createMessage(...)             │
    │  │                                                   │
    │  │  Model responds: stopReason=tool_use              │
    │  │    ──► run_financial_model                        │
    │  └────────────────────────┬──────────────────────────┘
    │                           │
    │  ┌─ Tool Execution ───────▼──────────────────────────┐
    │  │  Check request memo cache ──► MISS                │
    │  │  Check shared cache (500/2min) ──► MISS           │
    │  │  Resolve scope ──► BU access verified             │
    │  │  Execute: load products + customers + actuals     │
    │  │  Run financial model engine (synchronous)         │
    │  │  Sanitize result (strip internal fields)          │
    │  │  Store in request memo + shared cache             │
    │  │  Emit SSE: tool_start ──► tool_end (w/ metrics)   │
    │  └────────────────────────┬──────────────────────────┘
    │                           │
    │  ┌─ Turn 2 ───────────────▼──────────────────────────┐
    │  │  Inject tool result into messages                 │
    │  │  createMessage() ──► generates final answer       │
    │  │  stopReason: end_turn                             │
    │  │  Stream text deltas via SSE                       │
    │  └────────────────────────┬──────────────────────────┘
    │                           │
    ▼                           ▼
[7] SSE Events emitted:
    ──► ready        { runId, correlationId }
    ──► status       { phase: 'thinking', label: 'Analyzing...' }
    ──► tool_start   { name: 'run_financial_model' }
    ──► tool_end     { name: '...', durationMs: 820,
                       cacheLayer: 'none', cacheHit: false }
    ──► text         { mode: 'delta', content: 'SOURCE: data\n...' }
    ──► text         { mode: 'delta', content: '...PBT tracking at 94%...' }
    ──► follow_ups   ['What is driving the margin gap?', ...]
    ──► done         { model: '...', inputTokens: 4200, outputTokens: 680,
                       toolCalls: 1, durationMs: 6400,
                       requestMemoHits: 2, sharedCacheHits: 0 }
      │
      ▼
[8] Browser renders incrementally:
    Status indicator: "Running financial model..."
    Streaming text appears word-by-word
    Follow-up chips rendered at completion
    Metrics logged for diagnostics

[9] Operational log written to blob storage:
    AI_CHAT_REQUEST event with all metrics
    Feeds into health probes on next health check

What I Learned

After building this system across several months, these are the lessons that weren't obvious when I started:

Contracts before code. I defined the provider transport, data access adapter, and event sink as frozen contract objects before writing a single line of orchestration logic. This forced me to think about boundaries first and made the core AI package genuinely portable. The contracts even include runtime validation that throws at boot if an implementation is missing methods.
Tool budgets prevent runaway costs. Without iteration and call-count limits, an LLM will happily chain 6-8 tool calls to "be thorough." That's expensive and slow. My 2-call budget forces the model to be selective and answer with partial results + follow-up suggestions rather than exhaustive retrieval.
Cache the tools, not the LLM response. I cache at the tool-result layer, not the final response layer. This means different questions that happen to need the same underlying data share cached tool results, even though the LLM generates different answers. Much higher hit rate than response caching.
UI context is a superpower, but treat it as untrusted. Passing structured page state to the AI makes it dramatically more useful. But it's also a prompt injection surface. I explicitly mark it as "untrusted data context, not instructions" in the system prompt and cap it at 12KB with progressive trimming.
SSE needs application-layer recovery. HTTP SSE is great until the connection drops. You need a parallel polling path and a stream reattach path so the frontend can recover without losing the response. Heartbeats (I use 8-second intervals) are essential for keeping connections alive through aggressive proxy timeouts.
Authorization must be per-tool-call, not per-request. A single AI request might call 3 different tools accessing 3 different data domains. Each tool call must independently verify the user has access to the data it's about to return. "They passed auth at the front door" is not enough.
Mandatory response framing builds trust. Requiring SOURCE/CONFIDENCE/FOLLOW_UPS on every response was the single highest-impact prompt engineering decision. Users immediately know whether they're looking at real data or general knowledge, and the confidence flag naturally trains them to ask clarifying follow-ups.
Observability must be built-in, not bolted on. The done SSE event carries enough telemetry (tokens, duration, cache hits, tool calls, model, completion state) to power three health probes without any external monitoring infrastructure. I compute p50/p95 latencies, cache hit rates, cost estimates, user adoption, and quality scores entirely from operational logs — no Datadog, no Grafana, no third-party APM required.
Question routing is worth the complexity. Sending all 23 tools to every request is wasteful — the LLM has to process all those definitions, and it's more likely to call irrelevant tools. Profile-based routing (4 profiles, 1-6 tools each) reduced median tool usage by ~40% and improved response quality.
Track non-adopters, not just users. My health insights probe reports which eligible users haven't used the AI feature yet. This is more actionable than total usage numbers. Three people not using the feature is a training opportunity; 80% not using it is a product problem.

Try This Yourself

If you're building AI features into an existing line-of-business application, here's my recommended approach:

Start with contracts. Define your provider transport, data access adapter, and event sink as explicit interfaces before choosing an LLM provider. This pays dividends immediately when you need to switch models or add fallbacks.
Build tools, not prompts. The system prompt matters, but tools are where the real value lives. Each tool should be a thin, authorized wrapper around an existing data operation in your app.
Budget everything. Set limits on tool calls, iterations, orchestration time, and per-provider timeouts from day one.
Cache at the tool layer. Request-scoped memo for deduplication within a turn, shared cache with TTL for cross-request reuse, and domain-aware freshness for data that changes on known cycles.
Pass UI context. Even a minimal payload (current page, active filters, visible KPI values) makes the AI dramatically more useful. But sanitize it, cap it, and mark it as untrusted.
Use SSE with a polling fallback. Stream events for responsiveness, but always have a status endpoint and a stream reattach path for recovery. Add heartbeats.
Embed telemetry in every response. Make the AI done event carry token counts, cache metrics, latency, model used, and completion state.
Track cost at the model level. Configure per-model token pricing and compute estimated costs in your health probes.
Make the response frame mandatory. SOURCE, CONFIDENCE, staleness warnings, and follow-up suggestions should be non-negotiable.
Monitor adoption, not just health. Build a probe that identifies eligible non-users. It's the most actionable metric for driving feature adoption.

The full architecture — contracts, tools, profiles, caching, streaming, auth, health probes — runs as a standalone Fastify service with an extracted workspace package for AI core logic. The core package is provider-agnostic. The service is deployment-agnostic. The tools are database-agnostic (through the data access adapter). The health system is APM-agnostic (built on operational logs). I started with a dashboard. I ended up with a copilot that sees what you see, knows what the database knows, answers like a senior analyst, and reports its own health. The architecture isn't just about making it work — it's about making it replaceable at every layer. Today it's one LLM provider. Tomorrow it could be anything. The contracts don't care. That's the point.

Claude's Honest Assessment: Strengths, Gaps, and Where the Industry Is Heading

No architecture post is complete without an honest look at what I got right, what I didn't, and where the industry is going. After deep-diving into how companies like Microsoft, ThoughtSpot, Tableau, and dozens of startups are building AI into BI tools in 2025-2026, here's my self-assessment.

What I Got Right

Decision	Industry Validation
Tool-use over text-to-SQL	Research shows text-to-SQL accuracy drops from 85-92% on clean academic benchmarks to 6-21% on enterprise schemas (Spider 2.0, ICLR 2025 paper). My tool-based approach avoids this entirely — the LLM never writes raw database queries. It calls pre-built, authorized, scope-enforced tools. This aligns with the industry shift toward semantic-layer-aware AI, where the model queries governed metric definitions rather than raw SQL. ThoughtSpot, Holistics, and Looker have all converged on this pattern.
SSE streaming with recovery	SSE is the de facto industry standard for LLM response streaming. Every major provider (OpenAI, Anthropic, Google) uses it. My heartbeat keepalives (8s), proxy-buffering headers, and parallel polling fallback are textbook production patterns. The stream reattach mechanism goes beyond what most implementations offer.
Per-tool-call authorization	Microsoft's security playbook for AI agents (2026) explicitly calls out that "a prompt injection vulnerability in a multi-tenant agent could cross tenant boundaries" and describes this as catastrophic. My per-tool-call scope resolution with account-manager-level data isolation is ahead of most enterprise AI implementations, which typically enforce auth only at the request level. The OWASP Top 10 for LLM Applications (PDF) identifies excessive agency and improper output handling — both enabling privilege escalation — as top-5 risks.
Provider-agnostic contracts	The market has converged on provider abstraction as essential infrastructure. Solutions like LiteLLM (40K+ GitHub stars, 240M+ Docker pulls), Bifrost, and Portkey all provide unified provider interfaces. My frozen contract pattern achieves the same goal without an external dependency, and the `assertContract()` boot-time validation is a pattern I haven't seen in any gateway solution.
Built-in observability	Most teams bolt on third-party observability (Helicone, LangSmith, Langfuse) after deployment. My approach — embedding telemetry in the `done` event and computing health probes from operational logs — eliminates a dependency and gives me adoption metrics (non-user tracking) that no off-the-shelf tool provides.
Tool budgets and question routing	Industry consensus: unbounded tool use is the #1 cause of LLM cost overruns in agentic applications. My 2-call budget with profile routing is more disciplined than most production implementations. Anthropic's own guides on building effective agents and writing tools for agents recommend exactly this pattern: classify intent, select a tool subset, enforce a call budget.
UI context awareness	Genuinely differentiated. Most AI chat panels are blind to the host application. My structured `uiContext` payload with progressive trimming and prompt-injection-safe labeling is a pattern I haven't seen documented in any major BI vendor's public architecture. The context-nudge system (auto-suggesting questions when pages change) is unique.
Mandatory response framing	SOURCE/CONFIDENCE/staleness/follow-ups framing aligns with emerging enterprise AI governance requirements. Gartner's 2026 AI governance framework recommends explicit source attribution and confidence signaling for any AI feature that influences business decisions.

What I Could Do Better — Honest Gaps according to Claude Code

Gap	Current State	Industry Standard	Impact
No semantic caching	Exact-match tool-result caching (hash-based). Hit rates depend on identical inputs.	Semantic caching (embedding similarity) achieves 40-70% hit rates vs. 10-15% for exact match. Solutions like Bifrost and GPTCache offer this at the gateway layer. FAQ-heavy workloads see 60-85% hit rates; my financial BI pattern would likely see 30-50%.	Medium. My tool-level caching partially compensates, but I'm leaving cost savings on the table for paraphrased questions ("What's our margin?" vs "Show me the margin numbers").
No LLM-as-judge evaluation	Quality monitoring via thumbs up/down only (9% feedback coverage).	Leading teams run a three-layer evaluation: (1) automated heuristic checks on 100% of traffic, (2) LLM-as-judge scoring on 5-10% of requests, (3) human review for edge cases. Research shows judge models align with human judgment up to 85%. Tools: DeepEval, TruLens, Langfuse evals.	High. With only 9% feedback coverage, I have blind spots on answer quality. I catch failures (errors, timeouts) but not subtle quality degradation (correct but unhelpful answers, missing nuance).
Single-provider dependency	Provider-agnostic contracts exist, but only one provider implementation is wired up.	The industry is rapidly moving toward multi-provider strategies. Production teams are using AI gateways (Portkey, LiteLLM) for automatic failover across 2-3 providers.	Medium. The contracts are ready, but I haven't exercised the abstraction. A provider outage today means total AI feature downtime.
No conversation persistence	Chat history lives in browser `sessionStorage` only. Close the tab, lose the conversation.	Most enterprise AI copilots persist conversation history server-side for audit trails, cross-device continuity, and analytics.	Low-Medium. Deliberate choice (zero storage cost, no data retention liability), but limits my ability to do conversation-level quality analysis.
Regex-based question routing	Regex patterns classify questions into 4 profiles with 12 intent types.	More sophisticated routers use lightweight embedding classifiers or distilled intent models. These handle paraphrasing and multilingual queries better.	Low. My regex routing works well for my domain (financial English with a bounded vocabulary), but would struggle with multilingual or highly varied question patterns.
No distributed cache	In-process `Map` with LRU eviction. Cache is per-instance.	Multi-instance deployments use Redis or a distributed cache layer. My in-process cache means cache misses when requests hit different instances.	Low. I run a single AI service instance today. This becomes a gap only at scale-out.
Read-only agent only	The AI cannot take actions — it can only analyze and recommend.	The industry is cautiously moving toward "action agents" that can trigger workflows, send alerts, and execute constrained write operations. Microsoft Copilot Studio, Salesforce Agentforce, and ThoughtSpot's Agentic Analytics all support agent-initiated actions with approval workflows.	Deliberate trade-off. Read-only is a security boundary I chose intentionally. But it means users have to manually act on every recommendation.

Future Roadmap — Where This Architecture Should Go Next (Claude's recos)

Based on where the industry is heading in 2026-2027, here are the highest-value additions to consider, ordered by impact-to-effort ratio:

Tier 1: High Impact, Near-Term (Weeks)

Automated quality evaluation (LLM-as-judge). Run a lightweight evaluation model on 5-10% of responses, scoring for groundedness (did the answer match tool results?), relevance (did it answer the question?), and completeness. This closes the biggest observability gap. Tools like DeepEval or Langfuse evals make this a 1-2 week implementation. The done event already carries enough context to feed an evaluator.
Second provider implementation. Wire up a second LLM provider behind the existing ProviderTransport contract. The contracts are ready — this is purely an implementation exercise. Automatic failover across providers eliminates single-provider downtime risk.
Tiered model routing. Route simple questions (methodology lookups, single-tool queries) to a cheaper/faster model and reserve the primary model for complex multi-tool analysis. Industry data shows 60-75% of queries can be handled by a lighter tier with no quality drop, yielding 40-60% cost reduction. My question router already classifies intent — adding model selection per profile is a natural extension.

Tier 2: High Impact, Medium-Term (1-2 Months)

Semantic caching layer. Add embedding-based similarity matching as a cache layer between the request memo and the shared tool cache. When "What's our gross margin?" hits, the cached result for "Show me the margin numbers" should match at ~0.9 similarity. Production systems report 30-50% hit rates for analytical workloads, with latency dropping from seconds to single-digit milliseconds on hits. For enterprise apps locked behind Entra ID, the cache must stay inside the security boundary — public SaaS caching services are not an option. Azure Cosmos DB vector search (already in my stack, supports DiskANN-based similarity) or Azure Cache for Redis deployed with private endpoints inside the VNet are both viable. Alternatively, a lightweight in-process embedding approach using a small local model can avoid any external dependency entirely — compute similarity on the AI service itself and keep everything within the existing deployment boundary.
Proactive insight surfacing. The industry is moving from reactive (user asks, AI answers) to proactive (AI notices something and suggests investigation). I already have the infrastructure: the UI context system detects page changes and generates nudges. Extending this to data-driven alerts ("Revenue dropped 15% this month vs. last month — want me to investigate?") would be a natural evolution.
Conversation persistence for audit and analytics. Selectively persist conversation transcripts server-side (the writeTranscript() infrastructure already exists in the runtime manager). This enables conversation-level quality analysis, compliance audit trails, and cross-session continuity. Implement with opt-in consent and configurable retention policies.

Tier 3: Strategic, Longer-Term (3-6 Months)

Multi-agent orchestration. Today I run a single agent with tool use. The next step is specialized sub-agents: a data retrieval agent, an analysis agent, a visualization agent, and a narrative agent. Frameworks like LangGraph (now GA) provide graph-based execution with checkpointing and human-in-the-loop. The contracts already separate orchestration from data access — decomposing the orchestration into collaborating agents is architecturally natural. Caution: Gartner predicts over 40% of agentic AI projects will be cancelled by 2027 due to coordination complexity.
Constrained write-back actions. Move from read-only analyst to constrained actor: "Shall I flag this customer for review?" or "Want me to set a monitoring alert for this KPI?" The action layer pattern (approve/reject workflow, audit trail, rollback capability) is emerging in enterprise AI. This is the highest-risk item on the roadmap — it fundamentally changes the security model. Start with soft actions only (create alerts, flag items, generate reports) before considering data mutations.
MCP (Model Context Protocol) integration. Anthropic donated MCP to the Linux Foundation's Agentic AI Foundation, and it's been adopted by OpenAI, Microsoft, AWS, and Google. Exposing my tool surface as an MCP server would allow any MCP-compatible client (IDE copilots, other agents, automation platforms) to query my financial data through the same authorized, scoped, cached tool pipeline.
Natural language to dashboard generation. Rather than answering questions about existing dashboards, generate new dashboard views from natural language descriptions. "Show me a dashboard tracking our top 5 customers' margin trends over the last 3 quarters." This is where ThoughtSpot's Agentic Analytics and Tableau's Einstein are heading. It's the most ambitious item on the list — it blurs the line between AI feature and core product.

The Maturity Spectrum mapped by Claude Code

Placing my architecture on the industry maturity spectrum (adapted from GoodData's Agentic Analytics framework and Gartner's autonomy model):

Industry AI-in-BI Maturity Spectrum (2026)
──────────────────────────────────────────

Stage 1               Stage 2               Stage 3               Stage 4
Basic Chat            Tool-Use Agent        Multi-Agent           Autonomous
──────────────        ──────────────        ──────────────        ──────────────
NLQ interface         Tool calling          Sub-agents            Proactive
Static prompts        Data grounding        Multi-model           Action layer
No data access        Auth scoping          Semantic cache        Self-improving
Single model          SSE streaming         LLM-as-judge          NL-to-dashboard
No observability      Tool budgets          Conversation DB       MCP ecosystem
                      Built-in telemetry    Proactive nudges
                      UI context            Write-back (soft)
                      Cache layers
                      Question routing
                      Health probes

       Most BI                   ┌───┐
       vendors                   │   │  ◄── I am here
       are here                  └───┘
         │                         │
         ▼                         ▼
┌──────────────────┐  ┌──────────────────────────┐  ┌──────────────────┐
│ Power BI Copilot │  │ My Implementation        │  │ ThoughtSpot      │
│ Tableau Einstein │  │                          │  │ Spotter (2026)   │
│ Looker Gemini    │  │ Stage 2 with partial     │  │ Stage 2-3        │
│ (mostly Stage 1) │  │ Stage 3 elements         │  │ transition       │
└──────────────────┘  │ (UI context, nudges,     │  └──────────────────┘
                      │  health probes)          │
                      └──────────────────────────┘

I'm solidly in Stage 2 with several Stage 3 elements already in place (UI context awareness, context nudges, comprehensive health probes). The gaps are well-defined (semantic caching, LLM-as-judge, multi-provider, conversation persistence), and the architecture was designed to accommodate them without rearchitecting. The industry is moving fast. Gartner predicts 40% of enterprise apps will embed AI agents by end of 2026. The companies that get the architecture right now — contracts, authorization, observability, cost control — will be the ones that can evolve from Stage 2 to Stage 3 without a rewrite. I designed for exactly that.

Thursday, 2 April 2026

AI School Fees: The $0 Database That Wasn't: How AI Agents Silently Burned Through My Azure Budget Twice

I told the agent "zero cost." It provisioned 8,000 RU/s of dedicated throughput. I fixed it. It did it again. Here's the full forensic timeline.

The Problem

When I started building this internal enterprise app on Azure, the constraints were clear: free tier only. Azure Cosmos DB gives you 1,000 RU/s free. The app had ~10 containers. The math was simple — shared throughput across the database, stay under 1,000 RU/s, pay nothing.

I documented this everywhere. The agent contract said "RU-frugal." The app rules said "any throughput or retention change must be documented." The SAP feature brief said "Free Tier Guardrails — non-negotiable." The AI feature design explicitly rejected a Cosmos-backed chat history because it "violates the zero Azure cost constraint."

Despite all of this, the AI agent provisioned expensive dedicated throughput — not once, but twice. Both times I had to manually intervene, audit the damage, and harden the codebase to prevent it from happening again.

This is the forensic timeline of what happened, reconstructed from git history.

The Architecture Context

The app is a React + Azure Functions stack backed by Cosmos DB NoSQL. All containers use partition key /pk. The intended cost model was:

  Cosmos DB Free Tier
  ───────────────────
  1 Database  →  shared throughput (400-600 RU/s)
  10 Containers  →  no dedicated throughput
  ───────────────────
  Total: $0/month (within 1,000 RU/s free allowance)

Simple. Except the AI agent had a different idea.

Act 1: The Silent Provisioning (Feb 7, 2026)

What The Agent Did

I asked the AI agent to set up CI/CD scaffolding and infrastructure automation. Commit <sha-1> created scripts/setup-cosmos.sh — a script to provision Cosmos databases and containers. Sounds reasonable. Here's what it actually created:

THROUGHPUT=400

az cosmosdb sql container create \
    --partition-key-path "$PARTITION_KEY" \
    --throughput "$THROUGHPUT"     ← 400 RU/s PER CONTAINER

That --throughput flag on the container create command is the problem. It provisions dedicated throughput per container, not shared throughput at the database level.

The script also created two databases: a production DB and a dev DB. Both got the same treatment.

The Math

  What I asked for:          What the agent provisioned:
  ──────────────────         ──────────────────────────────
  1 DB, shared 400 RU/s     2 DBs, dedicated per-container

  Production:                Production:
    400 RU/s shared            10 containers × 400 RU/s = 4,000 RU/s
    $0 (free tier)             $0.008/hr × 10 = billable

  Dev:                       Dev:
    Emulator (local)           10 containers × 400 RU/s = 4,000 RU/s
    $0                         $0.008/hr × 10 = billable

  Total: ≤ 1,000 RU/s       Total: ~8,000 RU/s dedicated
  Cost: $0/month             Cost: Azure billing surprise

The agent created 8x the intended throughput across two databases, all with dedicated provisioning that can't be scaled below 400 RU/s per container. The Cosmos free tier's 1,000 RU/s allowance was instantly overwhelmed.

Why It Happened

The agent treated database provisioning as a standard infrastructure task. It knew Cosmos needs throughput. It picked the per-container model (which is the more common pattern in documentation and tutorials) without considering that:

Shared throughput exists and is the correct model for cost-sensitive workloads
A dev database in the cloud is unnecessary when the Cosmos emulator exists
400 RU/s is a floor, not a ceiling — you can't go lower with dedicated provisioning
The cost rules in the project docs explicitly prohibited this

Act 2: The First Cleanup (Feb 22, 2026)

I discovered the cost spike through Azure billing alerts and immediately performed a forensic audit. Commit <sha-2> documents the full cleanup in a cost plan document that reads like an incident post-mortem.

The Damage Assessment

From the cost plan doc I wrote at the time:

"Legacy dedicated-throughput DBs still exist and still bill baseline RU: <app-db> → 10 containers × 400 RU/s dedicated. <app-db>-dev → 10 containers × 400 RU/s dedicated."

The Fix: V2 Databases with Shared Throughput

I created new databases with V2costsaver in the name (yes, I literally named them to remind future agents about cost) and rewrote the setup script:

  Before (agent's version):              After (my fix):
  ─────────────────────────              ──────────────────────────
  THROUGHPUT=400                         DB_THROUGHPUT="${DB_THROUGHPUT:-400}"

  az cosmosdb sql container create \     az cosmosdb sql database create \
    --throughput "$THROUGHPUT"              --throughput "$DB_THROUGHPUT"
                                           ← shared at DB level
  (per container = expensive)
                                         az cosmosdb sql container create \
                                           ← NO --throughput flag
                                           (inherits from database)

Then I ran the decommission:

Created V2 databases with shared throughput
Migrated all production data
Added rollback support (--rollbackToV1Cosmos flag)
Verified all 6 cutover gates passed
Deleted both V1 databases
Applied Azure budget alerts: $300/month cap with alerts at 50%, 80%, 100%
Added Cosmos daily RU spike alert (> 2M RU in 24h)

The Emulator Decision

Six days later (Feb 28, commit <sha-3>), I made a harder decision: eliminate the cloud dev database entirely. The local Cosmos emulator would serve as the dev environment. This meant:

Zero cloud cost for development
Dev database routing consolidated into an emulator-first mode in dbResolver.js
A new mirror-to-emulator.mjs script for refreshing local dev data
The cloud dev DB (<app-db>-dev-V2costsaver) was decommissioned

Final state: one production database at 600 RU/s shared throughput — well within the 1,000 RU/s free tier allowance. Cost: $0/month.

Act 3: The Regression (Mar 3, 2026)

Five days later, the AI agent struck again.

Commit <sha-4> — a large feature commit (30 files, 3,425 insertions) implementing fiscal-year structural changes — quietly re-introduced the cloud dev database code path that I had just removed.

What The Agent Changed

In api/lib/dbResolver.js, the agent rewrote the database mode resolver. My Feb 28 version had consolidated all non-production paths to route to the emulator. The agent's version re-expanded them:

  My version (Feb 28):                   Agent's version (Mar 3):
  ────────────────────                   ────────────────────────
  if (shouldUseEmulator())               if (hasArg(EMULATOR_FLAG))
    return 'emulator';                     return 'emulator';
  if (shouldUseSupportDevDb())           if (hasArg(SUPPORT_DEV_FLAG))
    return 'support_dev_db';               return 'support_dev_db';   ← RE-ADDED
                                         if (isTruthy(COSMOS_USE_EMULATOR))
                                           return 'emulator';
                                         if (isTruthy(COSMOS_USE_SUPPORT_DEV_DB))
                                           return 'support_dev_db';   ← RE-ADDED

The 'support_dev_db' return path was back. The DEFAULT_DB_NAMES object still had supportDev: '<app-db>-dev-V2costsaver'. Combined with the init-cosmos.js script's createIfNotExists calls, this meant any script invocation with the dev flag would recreate the cloud dev database.

Why It Happened Again

The agent was working on a large feature (fiscal-year scoping) that touched the database layer. It needed to understand how database names were resolved across environments. Rather than preserving my carefully consolidated emulator-first logic, it re-derived the resolution function from first principles — and landed on the same multi-path pattern I had specifically eliminated.

The agent didn't know why those paths had been removed. It saw the pattern as "incomplete" and "helpfully" restored it. The commit message says nothing about database mode changes — they were buried in a 3,400-line feature diff.

Act 4: The Permanent Fix (Mar 29, 2026)

I'd had enough. Commit <sha-5> — 59 files changed, 277 insertions, 273 deletions — was a comprehensive retirement of all cloud dev database targeting across the entire codebase.

The Hard Guards

This time I didn't just remove the code paths. I made them impossible to restore:

1. Setup script errors on dev:

  # scripts/setup-cosmos.sh
  dev|--useSupportDevDB)
      echo "Cloud dev Cosmos setup is retired."
      echo "Use the local Cosmos emulator for development."
      exit 1

2. Runtime assertion in dbResolver.js:

  assertNoCloudNonProdDatabaseTarget()
  ────────────────────────────────────
  IF target DB ≠ production DB
  AND endpoint host ≠ localhost / 127.0.0.1 / emulator
  THEN → throw Error (hard crash)

3. Dev flags redirected to emulator: Any code passing --useSupportDevDB or setting COSMOS_USE_SUPPORT_DEV_DB=true now silently routes to the emulator instead of a cloud database.

4. Seed scripts refuse cloud non-prod targets: If the connection string points to Azure (not localhost), the seed scripts refuse to operate on non-production databases.

5. Default throughput documented at 600 RU/s: The setup script now defaults to 600 RU/s shared — within free tier — with the value explicitly visible in the script header.

The 59-File Sweep

The retirement touched every layer:

  Layer                     Files Changed    What Changed
  ─────────────────────     ─────────────    ─────────────────────────────────
  Database resolver         1                Hard assertion + emulator redirect
  Setup/provisioning        1                Dev path → error exit
  API scripts (20+)         23               All routed through new guards
  ETL scripts (JS+Python)   4                cosmosDbNames updated
  CI workflow               1                Dev DB references removed
  Documentation             8                Updated to emulator-first model
  Dev tooling               2                Local settings + dev script

The Numbers

Metric	Wave 1 (Feb 7)	After Fix 1 (Feb 22-28)	Wave 2 (Mar 3)	After Fix 2 (Mar 29)
Cloud databases	2 (prod + dev)	1 (prod only)	1 + code path for 2nd	1 (prod only, hardened)
Throughput model	Dedicated per-container	Shared per-database	Shared (but dev path live)	Shared, dev path blocked
Provisioned RU/s	~8,000	600	600 (risk of +400)	600
Free tier compliant	No	Yes	Fragile	Yes (enforced)
Guard rails	Docs only	Docs + script rewrite	Regressed	Runtime assertion + error exits
Files with dev DB refs	Growing	Consolidating	Re-expanded	0 (retired across 59 files)

What I Learned

1. Documentation Is Necessary But Not Sufficient

I had cost rules in agent-contract.md, app-rules.md, feature design docs, and the SAP brief. The rules said "RU-frugal," "zero Azure cost constraint," "Free Tier Guardrails — non-negotiable." The agent read them. The agent still provisioned dedicated throughput. Rules written in prose are suggestions. Rules written in code are enforcement.

2. AI Agents Optimize Locally, Not Globally

When the agent created the setup script, it was solving a local problem: "provision Cosmos containers." It picked the pattern most common in Azure documentation (dedicated throughput per container) without reasoning about the global cost constraint. When it re-introduced the dev DB path in Wave 2, it was solving another local problem: "make the database resolver more explicit." Both times, the agent's local optimization violated a global invariant.

3. Large Commits Hide Regressions

The Wave 2 regression was buried in a 3,425-line feature commit. The commit message mentioned fiscal-year changes, not database mode changes. If I'd reviewed only the commit message and stat, I'd have missed the dbResolver.js rewrite entirely. AI agents that make large commits need automated invariant checks, not just human code review.

4. "Remove" Is Not "Prevent"

My Feb 28 fix removed the cloud dev DB code path. My Mar 29 fix prevented it from being restored. The difference: runtime assertions that crash the process, script entry points that error on dev arguments, and seed scripts that refuse non-production targets on cloud endpoints. If you remove something from an AI-maintained codebase, you must also add a guard that prevents its resurrection.

5. Name Your Databases After Your Constraints

I named the V2 database *-V2costsaver. It's ugly. It's also the only thing in the codebase that survived every agent refactor without being renamed. Sometimes the best documentation is a name that makes the constraint impossible to ignore.

Try This Yourself

Audit your IaC scripts for throughput flags. Search for --throughput in any Cosmos provisioning script. If it's on a container create (not a database create), you're paying per-container minimums.
Add runtime guards, not just documentation. If your app must never target a cloud dev database, add an assertion that crashes on startup if it detects a non-production database on a cloud endpoint.
Review large AI commits file-by-file. Don't trust commit messages for scope. A "fiscal-year feature" commit can silently regress your cost model.
Set Azure budget alerts immediately. I should have done this on day one. A $300/month cap with 50%/80%/100% alerts would have caught Wave 1 within days instead of weeks.
Use the emulator for dev. The Cosmos emulator is free, runs locally, and eliminates an entire category of cloud cost risk. If you're paying for a cloud dev database, ask yourself why.

The agent contract, the app rules, the feature design docs — none of them stopped this. What stopped it was a throw new Error() in the database resolver. Trust but verify. Then add a guard.

Mo Khan is just an old-timer engineer-turned-manager who forgot how fun it is to build things — and who learned the hard way that AI agents read your cost rules but don't always follow them.

How Codex Autonomously Migrated Our Production App Across Continents in 28 Hours

One runbook. One AI agent. Zero portal clicks. A full SWA-to-App-Service migration from the US to South Africa.

The Problem: Your Frontend Is on the Wrong Continent

Our internal financial business intelligence tool — a React SPA backed by Azure Functions and Cosmos DB — had a geography problem. When I rapidly developed the MVP, I thought I could leverage free cloud services to not only prove the concept, but also since this tool was going to be used internally by a small group of users, I thought I could get away with free Azure services. Alas, as the MVP evolved into a real release, it became clear I had to do something about latency, cross-region calls, data sovereignty and the inherent limitations of free cloud services! So a migration was without question.

The frontend was hosted on Azure Static Web Apps in the US (since Azure does not provide this capability in South Africa and my original POC MVP was built as a static web app with local storage). The database and all backend services lived in South Africa North. Every API call crossed the Atlantic and back.

Cross-region latency on every Cosmos DB query — users in South Africa waited for round-trips to the US and back to South Africa
Data sovereignty concerns — even static HTML was served from US infrastructure
Architectural complexity — a free-tier SWA in the US proxying to paid Functions in South Africa made cost attribution and debugging harder than it needed to be
Auth coupling — SWA's built-in auth model injected identity in a platform-specific format that wouldn't survive a hosting change

The decision was made: move everything to South Africa. Same region as the data. Same region as the users.

But this wasn't just a redeploy. SWA's managed Functions, built-in auth, and SPA hosting all needed replacements. The target was a Linux App Service running Express, a standalone Azure Functions app, EasyAuth with a dedicated Entra app registration, and a completely new CI/CD pipeline. All while keeping the existing SWA running as a live fallback. Frugality is top-of-mind for me, aiming for the lowest cost options as the driving constraint.

The question was: could an autonomous AI agent execute the entire migration from a runbook — provisioning Azure resources, writing code, deploying infrastructure, and cutting over production — without a single portal click?

The Cast

This project used the same three-actor model I described in my previous post about the AI service migration:

Me — architect and orchestrator. I wrote the runbook, reviewed it across 7 sessions with Claude, made the cutover decisions, and performed final manual validation.

Claude (Opus) — planning partner. Claude reviewed the runbook across 7 dedicated sessions between March 6-26, catching missing auth flows, underspecified identity migration paths, and gaps in the rollback strategy.

Codex — autonomous executor. Codex received the runbook and executed it end-to-end across March 29-30: provisioning Azure resources, writing code, deploying to production, running identity backfills, enabling EasyAuth, and cutting over to the new stack.

┌─────────────┐                         ┌─────────────┐
│             │   7 review sessions     │             │
│   Human     │◄───────────────────────►│   Claude    │
│  Architect  │   runbook + review      │  (Opus)     │
│             │────────────────────────►│  Reviewer   │
└──────┬──────┘                         └─────────────┘
       │
       │  runbook
       │
       ▼
┌─────────────────────────────────────────────────────┐
│                     Codex                           │
│                Autonomous Executor                  │
│                                                     │
│  Day 1 (Mar 29): Provision + Code + Deploy          │
│  Day 2 (Mar 30): Auth + Identity + Cutover          │
│                                                     │
│  Azure CLI │ GitHub CLI │ Node.js │ PowerShell      │
│  14 files created │ 18 files modified               │
│  537 tests passing │ 12 user identities migrated    │
└─────────────────────────────────────────────────────┘

The Architecture: Before and After

Before: Cross-Region SWA

The existing architecture had the frontend and its managed Functions in the US, making cross-Atlantic calls to Cosmos DB in South Africa on every API request.

┌─────────────────────────────────────────────────────────────────┐
│                        BROWSER (South Africa)                   │
│   React SPA ──── fetch('/api/*') ────►                          │
└─────────────────────┬───────────────────────────────────────────┘
                      │
    🔻 Atlantic crossing (~180ms RTT)
                      │
                      ▼
┌─────────────────────────────────────────────────────────────────┐
│         Azure Static Web Apps  (US Region)                      │
│                                                                 │
│  ┌─────────────────┐  ┌────────────────────────────┐            │
│  │  SWA Built-in   │  │  SWA-Managed Functions     │            │
│  │  Auth (EasyAuth)│  │  (co-located in US)        │            │
│  │  SWA headers    │  │                            │            │
│  │  Platform-      │  │  /api/me                   │            │
│  │  specific format│  │  /api/data                 │            │
│  └─────────────────┘  │  /api/ai/chat              │            │
│                       │  /api/etl/upload           │            │
│  Serves React SPA     │  ... 40+ API endpoints     │            │
│  (static files US)    └──────┬─────────────────────┘            │
└──────────────────────────────│──────────────────────────────────┘
                               │
              🔻 Another Atlantic crossing
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                   South Africa North                            │
│                                                                 │
│  ┌────────────────┐  ┌────────────┐  ┌───────────────┐          │
│  │  Cosmos DB     │  │ ETL Extract│  │  Blob Storage │          │
│  │  (all data)    │  │  (Python)  │  │  (SAP exports)│          │
│  └────────────────┘  │  ETL Sync  │  └───────────────┘          │
│                      │  (Node.js) │                             │
│                      └────────────┘                             │
└─────────────────────────────────────────────────────────────────┘

Problems:
  ✗ Every API call crosses the Atlantic twice (browser → US → SA → US → browser)
  ✗ Static files served from US for South African users
  ✗ Auth format is SWA-specific (platform lock-in)
  ✗ SWA-managed Functions can't be independently scaled or monitored
  ✗ Cost attribution across regions is opaque

After: Single-Region App Service

Everything co-located in South Africa North. The Express server handles SPA hosting and proxies API calls to a standalone Functions app — all in the same region as Cosmos DB.

┌─────────────────────────────────────────────────────────────────┐
│                        BROWSER (South Africa)                   │
│   React SPA ──── fetch('/api/*') ────►                          │
│   Same-origin requests, ~5ms to App Service                     │
└─────────────────────┬───────────────────────────────────────────┘
                      │
                      ▼  (same region!)
┌─────────────────────────────────────────────────────────────────┐
│               All South Africa North                            │
│                                                                 │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │   App Service B1 Linux  (Express server)                  │  │
│  │                                                           │  │
│  │   ┌───────────────┐  ┌───────────────────────────────────┐│  │
│  │   │  EasyAuth     │  │  Express Web Host                 ││  │
│  │   │  (Entra ID)   │  │                                   ││  │
│  │   │  Dedicated app│  │  /healthz → direct 200            ││  │
│  │   │  registration │  │  /api/*   → proxy to Functions    ││  │
│  │   │  Claims-array │  │  /*       → serve dist/index.html ││  │
│  │   │  format       │  │  dist/assets/* → immutable cache  ││  │
│  │   └───────────────┘  └────────┬──────────────────────────┘│  │
│  └───────────────────────────────│───────────────────────────┘  │
│                                  │                              │
│                                  │  x-internal-proxy-secret     │
│                                  │  x-ms-client-principal       │
│                                  ▼                              │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Standalone Functions App  (Consumption plan)             │  │
│  │                                                           │  │
│  │  AUTH_MODE=appservice                                     │  │
│  │  Validates proxy secret → parses claims-array             │  │
│  │  IP restrictions: App Service outbound IPs only           │  │
│  │                                                           │  │
│  │  /api/me  /api/data  /api/ai/chat  /api/etl/upload        │  │
│  │  ... 40+ endpoints (same business logic, new auth mode)   │  │
│  └─────────────┬─────────────────────────────────────────────┘  │
│                │                                                │
│                ▼  (same region, ~1ms)                           │
│  ┌────────────────┐  ┌────────────┐  ┌───────────────┐          │
│  │  Cosmos DB     │  │ETL Extract │  │  Blob Storage │          │
│  │  (same region!)│  │(unchanged) │  │  (unchanged)  │          │
│  └────────────────┘  │ETL Sync    │  └───────────────┘          │
│                      │(unchanged) │                             │
│                      └────────────┘                             │
└─────────────────────────────────────────────────────────────────┘

Improvements:
  ✓ All traffic stays in South Africa — no cross-region hops
  ✓ Express serves SPA + proxies to Functions in same region
  ✓ Dedicated EasyAuth with claims-array auth (no SWA lock-in)
  ✓ Functions independently scalable and monitorable
  ✓ IP-restricted: Functions only accept traffic from App Service
  ✓ Shared-secret trust boundary on every proxied request
  ✓ SWA kept as parked standby for emergency failover
  ✓ Cost: +$13/month for the App Service plan

The Auth Migration

This deserves its own diagram because it was the hardest part of the migration. SWA and App Service EasyAuth present identity differently. The backend had to understand both.

  SWA Auth (before):                    App Service Auth (after):
  ──────────────────                    ────────────────────────
  x-ms-client-principal                 x-ms-client-principal
  │                                     │
  ▼                                     ▼
  Base64 → JSON                         Base64 → JSON
  {                                     {
    userId: "abc",                        claims: [
    userRoles: ["admin"],                   { typ: "oid", val: "abc" },
    identityProvider: "aad"                 { typ: "email", val: "..." },
  }                                         { typ: "roles", val: "admin" }
                                          ]
  Top-level fields                      }
  (SWA-specific)
                                        Claims array
                                        (standard Entra format)

  AUTH_MODE=swa                         AUTH_MODE=appservice
  + no proxy secret needed              + x-internal-proxy-secret required
  + SWA manages the hop                 + timing-safe secret validation
                                        + claims-array parsing
                                        + identity canonicalization

Act 1: The Runbook

Why This Migration Needed a Runbook

This wasn't a "lift and shift." Moving from SWA to App Service touched:

4 Azure resources to provision (App Service plan, web app, Functions app, storage account)
56 app settings to migrate from SWA to the standalone Functions app
12 user identities to canonicalize from SWA format to Entra format
A new auth mode (App Service EasyAuth with claims-array parsing)
A new web host (Express server replacing SWA's built-in hosting)
2 CI/CD pipelines running in parallel during validation
An ETL pipeline that needed seamless ownership transfer between workflows
A parked standby mode for the old SWA (not decommission — failover readiness)

The runbook grew to 21 sections with 2,300+ lines. Every Azure CLI command. Every app setting category. Every auth claim extraction rule. Every verification checkpoint.

Seven Review Sessions

Before Codex touched anything, Claude reviewed the runbook across seven dedicated sessions between March 6-26, 2026:

Session	Date	Focus
1	Mar 6	Initial architecture validation and scope framing
2	Mar 25 (18:55)	Plan validity: are the phases correctly sequenced?
3	Mar 25 (19:34)	Autonomous execution review: can an AI agent run this without portal clicks?
4	Mar 25 (20:03)	Architecture and design review: is the auth migration sound?
5	Mar 25 (20:36)	Implementation plan: critical path validation
6	Mar 25 (20:59)	Expert engineer review: what would a senior engineer push back on?
7	Mar 25 (21:24)	Deep review: identity migration, rollback, and edge cases

Key issues caught during review:

Identity continuity gap: The initial runbook assumed user IDs would carry over. Claude caught that SWA uses platform-managed service principals while App Service EasyAuth uses Entra object IDs — a completely different identity format. This led to adding the userIdentity.js canonicalization layer and the one-time backfill script.
Auth lightweight path: The verifyTokenLightweight function used by AI chat was SWA-only. Without an App Service equivalent, AI chat would break silently after migration.
ETL upload streaming: If Express body-parsing middleware was added before the /api proxy, multipart ETL uploads would break. The runbook was updated to explicitly forbid express.json() ahead of the proxy mount.
Rollback strategy: The original plan assumed SWA decommission. I later changed my mind and pushed for a parked-standby model instead — keep SWA deployable as emergency failover, not delete it.

Act 2: The Execution

Codex received the runbook and started working on March 29, 2026 at 14:56 SAST.

Day 1: Infrastructure and Code (March 29)

Time	Event
14:56	Inventory capture + code implementation (auth, identity, server, tests)
15:22	Azure provisioning: App Service plan, web app, Functions app, storage account
15:41	Identity backfill dry-run: 12 users scanned, 12 canonical migrations found
15:42	Identity backfill applied: 12 users migrated, zero conflicts
16:43	Web deploy (first attempt — Windows zip failed, rebuilt with POSIX paths)
17:00	API deploy: standalone Functions packaging fixed, proxy verified
17:06	Auth blocker: Entra app registration failed (insufficient tenant privileges)
18:05	Full verification: 56/56 config parity, health green, smoke blocked only by auth
19:18	Workflow + deploy path hardening committed
22:10	API deploy recovery: Functions-action produced 503; switched to source-only Kudu
23:33	Kudu false-negative analysis: rsync symlink errors masked a healthy deploy
23:56	Both GitHub Actions workflows green. SA web + API deployed successfully.

Day 2: Auth, Validation, and Cutover (March 30)

Time	Event
14:32	Entra auth unblocked: dedicated app registration created with new privileges
14:40	EasyAuth enabled: login redirect verified working
15:01	Identity re-audit: 9 canonical, 2 clean migrations, 1 overlap detected
15:15	Overlap identity fix + verification hardening deployed
15:48	Auth flow correction: `enableIdTokenIssuance` was false, fixed live
16:05	Workflow smoke alignment: accept EasyAuth-protected probe responses
17:18	ETL admin regression: `EtlPipelineView` used wrong role authority, fixed
18:24	Documentation strategy rewrite: park SWA, don't decommission
21:03	ETL ownership switched to SA workflow
21:22	Final cutover: SWA parked, SA primary, both workflows green
21:34	Failover drill fix: `workflow_dispatch` jobs were gated to push-only
21:49	Failover drill complete: SWA restored, re-parked, verified end-to-end

Act 3: The Battles

Autonomous doesn't mean smooth. Codex hit real obstacles and worked through them.

Battle 1: The Windows Zip

The first web deploy failed because the zip archive built on Windows contained backslash paths. Azure's OneDeploy rejected them. Codex rebuilt the package with POSIX-style paths and redeployed successfully.

Battle 2: The Functions 503

The standard Azure/functions-action@v1 with pre-built node_modules produced a deployed Functions app that returned 503. Codex diagnosed it, switched to source-only Kudu zipdeploy with SCM_DO_BUILD_DURING_DEPLOYMENT=true (matching the pattern already proven by the ETL sync app), and restored the API to 200.

Battle 3: The Kudu False Negative

After fixing the deploy shape, Kudu still reported "failed" — because rsync couldn't create node_modules/.bin/* symlinks. But the app was actually healthy. Codex analyzed the log pattern, hardened the Kudu helper script to recognize this specific false negative, and added a health-gated fallback.

Battle 4: The Tenant Privilege Blocker

Creating the Entra app registration required Application Administrator privileges that Codex didn't have on Day 1. This blocked EasyAuth completely. I resolved the privilege overnight, and Codex resumed on Day 2.

Battle 5: The ID Token Gap

After enabling EasyAuth, browser logins failed silently. The Entra app registration had enableIdTokenIssuance=false, but App Service EasyAuth requests response_type=code id_token. Codex found this, set the flag to true via CLI, and updated both the provisioning and verification scripts to treat it as required state.

Battle 6: The ETL Role Regression

The ETL admin page broke for the admin user on the new stack. Root cause: EtlPipelineView preferred raw tokenRoles (which under App Service auth is just ["authenticated"]) over the database-backed profileRoles. Codex fixed the precedence and added a regression test.

Act 4: The Numbers

Metric	Value
Total execution time	~28 hours across 2 days
Files created	14
Files modified	18
Tests passing	537 across 149 test files
User identities migrated	12
App settings migrated	56 (verified parity)
Azure resources provisioned	4 (plan, web app, Functions app, storage)
GitHub Actions workflows	2 running in parallel, both green
Execution ledger entries	30+ timestamped operations
Portal clicks	0
Incremental monthly cost	+$13 (one B1 Linux plan)

What Got Deployed

Express web host serving the React SPA with immutable asset caching
API proxy with shared-secret trust boundary and 180s timeout
Standalone Functions app with AUTH_MODE=appservice and IP restrictions
Dedicated Entra app registration with EasyAuth
App Service auth parser with claims-array extraction and timing-safe secret validation
Identity canonicalization layer with SWA-to-Entra migration
Kudu zipdeploy helper with false-negative resilience
Curated deploy artifact with dependency pruning
ETL workflow parity with ownership switch variable
SWA parked standby with verified failover drill

What I Learned

The Execution Ledger Pattern

The most valuable artifact wasn't the code — it was the execution ledger. Every action Codex took was recorded with timestamp, phase, command, sanitized result, and next action. This append-only log became the working memory across sessions and the audit trail for the entire migration.

When Codex hit the tenant privilege blocker on Day 1 and had to resume on Day 2, the ledger told it exactly where to pick up. When the Kudu deploy shape needed three iterations, the ledger captured each failure and its resolution.

If you're planning autonomous multi-session work, build the ledger into the runbook from the start.

CLI-First Changes Everything

The runbook's execution rule — "no Azure Portal or GitHub UI dependency; all setup must be executable by az, gh, PowerShell, or GitHub Actions" — was the single most important constraint. It made the entire migration automatable.

Every resource provisioned by az appservice plan create. Every secret set by gh secret set. Every EasyAuth configuration by az webapp auth update. Every verification by scripted probes. Zero portal clicks meant zero human bottlenecks.

Park, Don't Decommission

My push toward a parked-standby model instead of immediate SWA decommission was the right call. On Day 2, after cutover, Codex ran a full failover drill: unparked SWA, verified the full app was serving, then re-parked it. The whole cycle took 15 minutes and proved the rollback path works.

For any production migration: keep the old thing alive in standby until you're confident you'll never need it. The cost of maintaining a parked SWA ($0) is much less than the cost of recreating one in an emergency.

Auth Migrations Are Never Simple

We hit five distinct auth-related issues across two days: tenant privileges, ID token issuance flags, claims-array format differences, role authority precedence, and identity canonicalization. Any one of them could have broken the migration silently.

The runbook's detailed auth specification — with pseudocode for claim extraction, validation ordering, and normalized return shapes — was essential. Without it, the agent would have guessed at the auth format and produced subtly wrong behavior.

The Human's Role

I didn't write the Express server, the auth parser, the identity backfill script, the deploy workflows, or the provisioning scripts. But I:

Designed the target architecture
Wrote a 2,300-line runbook that left nothing ambiguous
Reviewed it across 7 sessions with an AI planning partner
Resolved the tenant privilege blocker that no CLI command could fix
Made the cutover decision based on the verification evidence after confirming with pilot users
Performed manual browser validation that proved the stack worked end-to-end

The pattern is the same as before: the human's job is to write specifications precise enough that code writes itself. The better the runbook, the more the agent can do autonomously.

Try This Yourself

Compared to the AI service migration (which was a code extraction and new service build), this SWA migration was a different kind of challenge: less code, more infrastructure, more auth complexity, more operational choreography.

If you're planning a similar hosting migration:

Audit your auth surface before you start. SWA, App Service, and B2C/Entra all present identity differently. Map the claim shapes explicitly.
Build the execution ledger into the plan. Autonomous agents that work across sessions need persistent working memory.
Require CLI-only execution. If the plan needs portal clicks, the agent can't run it.
Run both stacks in parallel. Shared data (same Cosmos, same Blob) means zero data migration. Two active frontends during validation costs almost nothing.
Park, don't delete. Your rollback is only useful if it still exists.
Test the real user path manually. Health probes pass, workflows are green, config parity is 56/56 — and then a human opens a browser and auth fails because of a flag nobody thought to check.

The runbook used in this project is shared in the appendix below.

Appendix: The Complete Migration Runbook (Redacted)

Below is the full runbook that guided this migration, exactly as Codex executed it. Sensitive identifiers — Azure resource names, GitHub references, email addresses, Entra IDs, connection strings, and deployment credentials — have been replaced with <placeholder> tokens. The architecture decisions, execution patterns, CLI commands, and verification checklists are preserved verbatim.

This is the document that Claude reviewed across seven sessions and Codex executed autonomously over two days. Scroll through to see the level of detail that makes autonomous agent execution possible.

Scroll inside the frame to read the complete runbook. The document contains 21 sections covering architecture, auth design, CLI automation, deployment workflows, and verification checklists.

Muhammad Khan is a GM moonlighting as software engineer in his spare time, learning about AI-augmented development workflows, cloud architecture, and autonomous agent orchestration.

Pages

Tuesday, 7 April 2026

How I built an AI Finance Assistant into a Business Intelligence Dashboard with Claude Code and Codex

The Numbers

Act 1: The Problem — Dashboards Don't Answer Questions

Act 2: The Architecture — End-to-End

Act 3: The Contract System — Provider-Agnostic by Design

Act 4: The Tool Arsenal — 23 Tools Across 7 Domains

Tool Execution: Authorization at Every Call

Adding New Tools: A Disciplined Process

Act 5: Intelligent Question Routing — Tool Profiles

Act 6: The Prompt Library — Context-Aware Starter Prompts and UI Nudges

1. Starter Prompt Catalog

2. Context Nudges — The Page-Aware Prompt Engine

Act 7: The System Prompt — Persona Engineering for Finance

Act 8: The Streaming Pipeline — SSE From Orchestration to Browser

Stream Recovery

Act 9: The Cache System — Three Layers Deep

Act 10: Retry, Fallback, and Graceful Degradation

Provider-Level Retry

Partial Completion States

Continuation and Follow-Up Detection

Act 11: The Authentication Stack — Three Layers, Zero Trust

Act 12: The Health & Observability System — Knowing What You Don't Know

The Telemetry Pipeline

Probe 1: AI Cache Health

Probe 2: AI Chat Telemetry

Probe 3: AI Insights — Usage, Cost, and Adoption

Token Cost Estimation

AI Service Runtime Health

The Done Event — Per-Request Telemetry

Act 13: UI Context — Making the AI See What You See

Act 14: Model Selection & Fallback

The Full Request Lifecycle

What I Learned

Try This Yourself

Claude's Honest Assessment: Strengths, Gaps, and Where the Industry Is Heading

What I Got Right

What I Could Do Better — Honest Gaps according to Claude Code

Future Roadmap — Where This Architecture Should Go Next (Claude's recos)

Tier 1: High Impact, Near-Term (Weeks)

Tier 2: High Impact, Medium-Term (1-2 Months)

Tier 3: Strategic, Longer-Term (3-6 Months)

The Maturity Spectrum mapped by Claude Code

Thursday, 2 April 2026

AI School Fees: The $0 Database That Wasn't: How AI Agents Silently Burned Through My Azure Budget Twice

The Problem

The Architecture Context

Act 1: The Silent Provisioning (Feb 7, 2026)

What The Agent Did

The Math

Why It Happened

Act 2: The First Cleanup (Feb 22, 2026)

The Damage Assessment

The Fix: V2 Databases with Shared Throughput

The Emulator Decision

Act 3: The Regression (Mar 3, 2026)

What The Agent Changed

Why It Happened Again

Act 4: The Permanent Fix (Mar 29, 2026)

The Hard Guards

The 59-File Sweep

The Numbers

What I Learned

1. Documentation Is Necessary But Not Sufficient

2. AI Agents Optimize Locally, Not Globally

3. Large Commits Hide Regressions

4. "Remove" Is Not "Prevent"

5. Name Your Databases After Your Constraints

Try This Yourself

How Codex Autonomously Migrated Our Production App Across Continents in 28 Hours

The Problem: Your Frontend Is on the Wrong Continent

The Cast

The Architecture: Before and After

Before: Cross-Region SWA

After: Single-Region App Service

The Auth Migration

Act 1: The Runbook

Why This Migration Needed a Runbook

Seven Review Sessions