Mo Khan's Blog: AI School Fees: The $0 Database That Wasn't: How AI Agents Silently Burned Through My Azure Budget Twice

I told the agent "zero cost." It provisioned 8,000 RU/s of dedicated throughput. I fixed it. It did it again. Here's the full forensic timeline.

The Problem

When I started building this internal enterprise app on Azure, the constraints were clear: free tier only. Azure Cosmos DB gives you 1,000 RU/s free. The app had ~10 containers. The math was simple — shared throughput across the database, stay under 1,000 RU/s, pay nothing.

I documented this everywhere. The agent contract said "RU-frugal." The app rules said "any throughput or retention change must be documented." The SAP feature brief said "Free Tier Guardrails — non-negotiable." The AI feature design explicitly rejected a Cosmos-backed chat history because it "violates the zero Azure cost constraint."

Despite all of this, the AI agent provisioned expensive dedicated throughput — not once, but twice. Both times I had to manually intervene, audit the damage, and harden the codebase to prevent it from happening again.

This is the forensic timeline of what happened, reconstructed from git history.

The Architecture Context

The app is a React + Azure Functions stack backed by Cosmos DB NoSQL. All containers use partition key /pk. The intended cost model was:

  Cosmos DB Free Tier
  ───────────────────
  1 Database  →  shared throughput (400-600 RU/s)
  10 Containers  →  no dedicated throughput
  ───────────────────
  Total: $0/month (within 1,000 RU/s free allowance)

Simple. Except the AI agent had a different idea.

Act 1: The Silent Provisioning (Feb 7, 2026)

What The Agent Did

I asked the AI agent to set up CI/CD scaffolding and infrastructure automation. Commit <sha-1> created scripts/setup-cosmos.sh — a script to provision Cosmos databases and containers. Sounds reasonable. Here's what it actually created:

THROUGHPUT=400

az cosmosdb sql container create \
    --partition-key-path "$PARTITION_KEY" \
    --throughput "$THROUGHPUT"     ← 400 RU/s PER CONTAINER

That --throughput flag on the container create command is the problem. It provisions dedicated throughput per container, not shared throughput at the database level.

The script also created two databases: a production DB and a dev DB. Both got the same treatment.

The Math

  What I asked for:          What the agent provisioned:
  ──────────────────         ──────────────────────────────
  1 DB, shared 400 RU/s     2 DBs, dedicated per-container

  Production:                Production:
    400 RU/s shared            10 containers × 400 RU/s = 4,000 RU/s
    $0 (free tier)             $0.008/hr × 10 = billable

  Dev:                       Dev:
    Emulator (local)           10 containers × 400 RU/s = 4,000 RU/s
    $0                         $0.008/hr × 10 = billable

  Total: ≤ 1,000 RU/s       Total: ~8,000 RU/s dedicated
  Cost: $0/month             Cost: Azure billing surprise

The agent created 8x the intended throughput across two databases, all with dedicated provisioning that can't be scaled below 400 RU/s per container. The Cosmos free tier's 1,000 RU/s allowance was instantly overwhelmed.

Why It Happened

The agent treated database provisioning as a standard infrastructure task. It knew Cosmos needs throughput. It picked the per-container model (which is the more common pattern in documentation and tutorials) without considering that:

Shared throughput exists and is the correct model for cost-sensitive workloads
A dev database in the cloud is unnecessary when the Cosmos emulator exists
400 RU/s is a floor, not a ceiling — you can't go lower with dedicated provisioning
The cost rules in the project docs explicitly prohibited this

Act 2: The First Cleanup (Feb 22, 2026)

I discovered the cost spike through Azure billing alerts and immediately performed a forensic audit. Commit <sha-2> documents the full cleanup in a cost plan document that reads like an incident post-mortem.

The Damage Assessment

From the cost plan doc I wrote at the time:

"Legacy dedicated-throughput DBs still exist and still bill baseline RU: <app-db> → 10 containers × 400 RU/s dedicated. <app-db>-dev → 10 containers × 400 RU/s dedicated."

The Fix: V2 Databases with Shared Throughput

I created new databases with V2costsaver in the name (yes, I literally named them to remind future agents about cost) and rewrote the setup script:

  Before (agent's version):              After (my fix):
  ─────────────────────────              ──────────────────────────
  THROUGHPUT=400                         DB_THROUGHPUT="${DB_THROUGHPUT:-400}"

  az cosmosdb sql container create \     az cosmosdb sql database create \
    --throughput "$THROUGHPUT"              --throughput "$DB_THROUGHPUT"
                                           ← shared at DB level
  (per container = expensive)
                                         az cosmosdb sql container create \
                                           ← NO --throughput flag
                                           (inherits from database)

Then I ran the decommission:

Created V2 databases with shared throughput
Migrated all production data
Added rollback support (--rollbackToV1Cosmos flag)
Verified all 6 cutover gates passed
Deleted both V1 databases
Applied Azure budget alerts: $300/month cap with alerts at 50%, 80%, 100%
Added Cosmos daily RU spike alert (> 2M RU in 24h)

The Emulator Decision

Six days later (Feb 28, commit <sha-3>), I made a harder decision: eliminate the cloud dev database entirely. The local Cosmos emulator would serve as the dev environment. This meant:

Zero cloud cost for development
Dev database routing consolidated into an emulator-first mode in dbResolver.js
A new mirror-to-emulator.mjs script for refreshing local dev data
The cloud dev DB (<app-db>-dev-V2costsaver) was decommissioned

Final state: one production database at 600 RU/s shared throughput — well within the 1,000 RU/s free tier allowance. Cost: $0/month.

Act 3: The Regression (Mar 3, 2026)

Five days later, the AI agent struck again.

Commit <sha-4> — a large feature commit (30 files, 3,425 insertions) implementing fiscal-year structural changes — quietly re-introduced the cloud dev database code path that I had just removed.

What The Agent Changed

In api/lib/dbResolver.js, the agent rewrote the database mode resolver. My Feb 28 version had consolidated all non-production paths to route to the emulator. The agent's version re-expanded them:

  My version (Feb 28):                   Agent's version (Mar 3):
  ────────────────────                   ────────────────────────
  if (shouldUseEmulator())               if (hasArg(EMULATOR_FLAG))
    return 'emulator';                     return 'emulator';
  if (shouldUseSupportDevDb())           if (hasArg(SUPPORT_DEV_FLAG))
    return 'support_dev_db';               return 'support_dev_db';   ← RE-ADDED
                                         if (isTruthy(COSMOS_USE_EMULATOR))
                                           return 'emulator';
                                         if (isTruthy(COSMOS_USE_SUPPORT_DEV_DB))
                                           return 'support_dev_db';   ← RE-ADDED

The 'support_dev_db' return path was back. The DEFAULT_DB_NAMES object still had supportDev: '<app-db>-dev-V2costsaver'. Combined with the init-cosmos.js script's createIfNotExists calls, this meant any script invocation with the dev flag would recreate the cloud dev database.

Why It Happened Again

The agent was working on a large feature (fiscal-year scoping) that touched the database layer. It needed to understand how database names were resolved across environments. Rather than preserving my carefully consolidated emulator-first logic, it re-derived the resolution function from first principles — and landed on the same multi-path pattern I had specifically eliminated.

The agent didn't know why those paths had been removed. It saw the pattern as "incomplete" and "helpfully" restored it. The commit message says nothing about database mode changes — they were buried in a 3,400-line feature diff.

Act 4: The Permanent Fix (Mar 29, 2026)

I'd had enough. Commit <sha-5> — 59 files changed, 277 insertions, 273 deletions — was a comprehensive retirement of all cloud dev database targeting across the entire codebase.

The Hard Guards

This time I didn't just remove the code paths. I made them impossible to restore:

1. Setup script errors on dev:

  # scripts/setup-cosmos.sh
  dev|--useSupportDevDB)
      echo "Cloud dev Cosmos setup is retired."
      echo "Use the local Cosmos emulator for development."
      exit 1

2. Runtime assertion in dbResolver.js:

  assertNoCloudNonProdDatabaseTarget()
  ────────────────────────────────────
  IF target DB ≠ production DB
  AND endpoint host ≠ localhost / 127.0.0.1 / emulator
  THEN → throw Error (hard crash)

3. Dev flags redirected to emulator: Any code passing --useSupportDevDB or setting COSMOS_USE_SUPPORT_DEV_DB=true now silently routes to the emulator instead of a cloud database.

4. Seed scripts refuse cloud non-prod targets: If the connection string points to Azure (not localhost), the seed scripts refuse to operate on non-production databases.

5. Default throughput documented at 600 RU/s: The setup script now defaults to 600 RU/s shared — within free tier — with the value explicitly visible in the script header.

The 59-File Sweep

The retirement touched every layer:

  Layer                     Files Changed    What Changed
  ─────────────────────     ─────────────    ─────────────────────────────────
  Database resolver         1                Hard assertion + emulator redirect
  Setup/provisioning        1                Dev path → error exit
  API scripts (20+)         23               All routed through new guards
  ETL scripts (JS+Python)   4                cosmosDbNames updated
  CI workflow               1                Dev DB references removed
  Documentation             8                Updated to emulator-first model
  Dev tooling               2                Local settings + dev script

The Numbers

Metric	Wave 1 (Feb 7)	After Fix 1 (Feb 22-28)	Wave 2 (Mar 3)	After Fix 2 (Mar 29)
Cloud databases	2 (prod + dev)	1 (prod only)	1 + code path for 2nd	1 (prod only, hardened)
Throughput model	Dedicated per-container	Shared per-database	Shared (but dev path live)	Shared, dev path blocked
Provisioned RU/s	~8,000	600	600 (risk of +400)	600
Free tier compliant	No	Yes	Fragile	Yes (enforced)
Guard rails	Docs only	Docs + script rewrite	Regressed	Runtime assertion + error exits
Files with dev DB refs	Growing	Consolidating	Re-expanded	0 (retired across 59 files)

What I Learned

1. Documentation Is Necessary But Not Sufficient

I had cost rules in agent-contract.md, app-rules.md, feature design docs, and the SAP brief. The rules said "RU-frugal," "zero Azure cost constraint," "Free Tier Guardrails — non-negotiable." The agent read them. The agent still provisioned dedicated throughput. Rules written in prose are suggestions. Rules written in code are enforcement.

2. AI Agents Optimize Locally, Not Globally

When the agent created the setup script, it was solving a local problem: "provision Cosmos containers." It picked the pattern most common in Azure documentation (dedicated throughput per container) without reasoning about the global cost constraint. When it re-introduced the dev DB path in Wave 2, it was solving another local problem: "make the database resolver more explicit." Both times, the agent's local optimization violated a global invariant.

3. Large Commits Hide Regressions

The Wave 2 regression was buried in a 3,425-line feature commit. The commit message mentioned fiscal-year changes, not database mode changes. If I'd reviewed only the commit message and stat, I'd have missed the dbResolver.js rewrite entirely. AI agents that make large commits need automated invariant checks, not just human code review.

4. "Remove" Is Not "Prevent"

My Feb 28 fix removed the cloud dev DB code path. My Mar 29 fix prevented it from being restored. The difference: runtime assertions that crash the process, script entry points that error on dev arguments, and seed scripts that refuse non-production targets on cloud endpoints. If you remove something from an AI-maintained codebase, you must also add a guard that prevents its resurrection.

5. Name Your Databases After Your Constraints

I named the V2 database *-V2costsaver. It's ugly. It's also the only thing in the codebase that survived every agent refactor without being renamed. Sometimes the best documentation is a name that makes the constraint impossible to ignore.

Try This Yourself

Audit your IaC scripts for throughput flags. Search for --throughput in any Cosmos provisioning script. If it's on a container create (not a database create), you're paying per-container minimums.
Add runtime guards, not just documentation. If your app must never target a cloud dev database, add an assertion that crashes on startup if it detects a non-production database on a cloud endpoint.
Review large AI commits file-by-file. Don't trust commit messages for scope. A "fiscal-year feature" commit can silently regress your cost model.
Set Azure budget alerts immediately. I should have done this on day one. A $300/month cap with 50%/80%/100% alerts would have caught Wave 1 within days instead of weeks.
Use the emulator for dev. The Cosmos emulator is free, runs locally, and eliminates an entire category of cloud cost risk. If you're paying for a cloud dev database, ask yourself why.

The agent contract, the app rules, the feature design docs — none of them stopped this. What stopped it was a throw new Error() in the database resolver. Trust but verify. Then add a guard.

Mo Khan is just an old-timer engineer-turned-manager who forgot how fun it is to build things — and who learned the hard way that AI agents read your cost rules but don't always follow them.

Mo Khan's Blog

Pages

Thursday, 2 April 2026

AI School Fees: The $0 Database That Wasn't: How AI Agents Silently Burned Through My Azure Budget Twice