I told the agent "zero cost." It provisioned 8,000 RU/s of dedicated throughput. I fixed it. It did it again. Here's the full forensic timeline.
The Problem
When I started building this internal enterprise app on Azure, the constraints were clear: free tier only. Azure Cosmos DB gives you 1,000 RU/s free. The app had ~10 containers. The math was simple — shared throughput across the database, stay under 1,000 RU/s, pay nothing.
I documented this everywhere. The agent contract said "RU-frugal." The app rules said "any throughput or retention change must be documented." The SAP feature brief said "Free Tier Guardrails — non-negotiable." The AI feature design explicitly rejected a Cosmos-backed chat history because it "violates the zero Azure cost constraint."
Despite all of this, the AI agent provisioned expensive dedicated throughput — not once, but twice. Both times I had to manually intervene, audit the damage, and harden the codebase to prevent it from happening again.
This is the forensic timeline of what happened, reconstructed from git history.
The Architecture Context
The app is a React + Azure Functions stack backed by Cosmos DB NoSQL. All containers use partition key /pk. The intended cost model was:
Cosmos DB Free Tier ─────────────────── 1 Database → shared throughput (400-600 RU/s) 10 Containers → no dedicated throughput ─────────────────── Total: $0/month (within 1,000 RU/s free allowance)
Simple. Except the AI agent had a different idea.
Act 1: The Silent Provisioning (Feb 7, 2026)
What The Agent Did
I asked the AI agent to set up CI/CD scaffolding and infrastructure automation. Commit <sha-1> created scripts/setup-cosmos.sh — a script to provision Cosmos databases and containers. Sounds reasonable. Here's what it actually created:
THROUGHPUT=400
az cosmosdb sql container create \
--partition-key-path "$PARTITION_KEY" \
--throughput "$THROUGHPUT" ← 400 RU/s PER CONTAINER
That --throughput flag on the container create command is the problem. It provisions dedicated throughput per container, not shared throughput at the database level.
The script also created two databases: a production DB and a dev DB. Both got the same treatment.
The Math
What I asked for: What the agent provisioned:
────────────────── ──────────────────────────────
1 DB, shared 400 RU/s 2 DBs, dedicated per-container
Production: Production:
400 RU/s shared 10 containers × 400 RU/s = 4,000 RU/s
$0 (free tier) $0.008/hr × 10 = billable
Dev: Dev:
Emulator (local) 10 containers × 400 RU/s = 4,000 RU/s
$0 $0.008/hr × 10 = billable
Total: ≤ 1,000 RU/s Total: ~8,000 RU/s dedicated
Cost: $0/month Cost: Azure billing surprise
The agent created 8x the intended throughput across two databases, all with dedicated provisioning that can't be scaled below 400 RU/s per container. The Cosmos free tier's 1,000 RU/s allowance was instantly overwhelmed.
Why It Happened
The agent treated database provisioning as a standard infrastructure task. It knew Cosmos needs throughput. It picked the per-container model (which is the more common pattern in documentation and tutorials) without considering that:
- Shared throughput exists and is the correct model for cost-sensitive workloads
- A dev database in the cloud is unnecessary when the Cosmos emulator exists
- 400 RU/s is a floor, not a ceiling — you can't go lower with dedicated provisioning
- The cost rules in the project docs explicitly prohibited this
Act 2: The First Cleanup (Feb 22, 2026)
I discovered the cost spike through Azure billing alerts and immediately performed a forensic audit. Commit <sha-2> documents the full cleanup in a cost plan document that reads like an incident post-mortem.
The Damage Assessment
From the cost plan doc I wrote at the time:
"Legacy dedicated-throughput DBs still exist and still bill baseline RU:
<app-db>→ 10 containers × 400 RU/s dedicated.<app-db>-dev→ 10 containers × 400 RU/s dedicated."
The Fix: V2 Databases with Shared Throughput
I created new databases with V2costsaver in the name (yes, I literally named them to remind future agents about cost) and rewrote the setup script:
Before (agent's version): After (my fix):
───────────────────────── ──────────────────────────
THROUGHPUT=400 DB_THROUGHPUT="${DB_THROUGHPUT:-400}"
az cosmosdb sql container create \ az cosmosdb sql database create \
--throughput "$THROUGHPUT" --throughput "$DB_THROUGHPUT"
← shared at DB level
(per container = expensive)
az cosmosdb sql container create \
← NO --throughput flag
(inherits from database)
Then I ran the decommission:
- Created V2 databases with shared throughput
- Migrated all production data
- Added rollback support (
--rollbackToV1Cosmosflag) - Verified all 6 cutover gates passed
- Deleted both V1 databases
- Applied Azure budget alerts: $300/month cap with alerts at 50%, 80%, 100%
- Added Cosmos daily RU spike alert (> 2M RU in 24h)
The Emulator Decision
Six days later (Feb 28, commit <sha-3>), I made a harder decision: eliminate the cloud dev database entirely. The local Cosmos emulator would serve as the dev environment. This meant:
- Zero cloud cost for development
- Dev database routing consolidated into an emulator-first mode in
dbResolver.js - A new
mirror-to-emulator.mjsscript for refreshing local dev data - The cloud dev DB (
<app-db>-dev-V2costsaver) was decommissioned
Final state: one production database at 600 RU/s shared throughput — well within the 1,000 RU/s free tier allowance. Cost: $0/month.
Act 3: The Regression (Mar 3, 2026)
Five days later, the AI agent struck again.
Commit <sha-4> — a large feature commit (30 files, 3,425 insertions) implementing fiscal-year structural changes — quietly re-introduced the cloud dev database code path that I had just removed.
What The Agent Changed
In api/lib/dbResolver.js, the agent rewrote the database mode resolver. My Feb 28 version had consolidated all non-production paths to route to the emulator. The agent's version re-expanded them:
My version (Feb 28): Agent's version (Mar 3):
──────────────────── ────────────────────────
if (shouldUseEmulator()) if (hasArg(EMULATOR_FLAG))
return 'emulator'; return 'emulator';
if (shouldUseSupportDevDb()) if (hasArg(SUPPORT_DEV_FLAG))
return 'support_dev_db'; return 'support_dev_db'; ← RE-ADDED
if (isTruthy(COSMOS_USE_EMULATOR))
return 'emulator';
if (isTruthy(COSMOS_USE_SUPPORT_DEV_DB))
return 'support_dev_db'; ← RE-ADDED
The 'support_dev_db' return path was back. The DEFAULT_DB_NAMES object still had supportDev: '<app-db>-dev-V2costsaver'. Combined with the init-cosmos.js script's createIfNotExists calls, this meant any script invocation with the dev flag would recreate the cloud dev database.
Why It Happened Again
The agent was working on a large feature (fiscal-year scoping) that touched the database layer. It needed to understand how database names were resolved across environments. Rather than preserving my carefully consolidated emulator-first logic, it re-derived the resolution function from first principles — and landed on the same multi-path pattern I had specifically eliminated.
The agent didn't know why those paths had been removed. It saw the pattern as "incomplete" and "helpfully" restored it. The commit message says nothing about database mode changes — they were buried in a 3,400-line feature diff.
Act 4: The Permanent Fix (Mar 29, 2026)
I'd had enough. Commit <sha-5> — 59 files changed, 277 insertions, 273 deletions — was a comprehensive retirement of all cloud dev database targeting across the entire codebase.
The Hard Guards
This time I didn't just remove the code paths. I made them impossible to restore:
1. Setup script errors on dev:
# scripts/setup-cosmos.sh
dev|--useSupportDevDB)
echo "Cloud dev Cosmos setup is retired."
echo "Use the local Cosmos emulator for development."
exit 1
2. Runtime assertion in dbResolver.js:
assertNoCloudNonProdDatabaseTarget() ──────────────────────────────────── IF target DB ≠ production DB AND endpoint host ≠ localhost / 127.0.0.1 / emulator THEN → throw Error (hard crash)
3. Dev flags redirected to emulator: Any code passing --useSupportDevDB or setting COSMOS_USE_SUPPORT_DEV_DB=true now silently routes to the emulator instead of a cloud database.
4. Seed scripts refuse cloud non-prod targets: If the connection string points to Azure (not localhost), the seed scripts refuse to operate on non-production databases.
5. Default throughput documented at 600 RU/s: The setup script now defaults to 600 RU/s shared — within free tier — with the value explicitly visible in the script header.
The 59-File Sweep
The retirement touched every layer:
Layer Files Changed What Changed ───────────────────── ───────────── ───────────────────────────────── Database resolver 1 Hard assertion + emulator redirect Setup/provisioning 1 Dev path → error exit API scripts (20+) 23 All routed through new guards ETL scripts (JS+Python) 4 cosmosDbNames updated CI workflow 1 Dev DB references removed Documentation 8 Updated to emulator-first model Dev tooling 2 Local settings + dev script
The Numbers
| Metric | Wave 1 (Feb 7) | After Fix 1 (Feb 22-28) | Wave 2 (Mar 3) | After Fix 2 (Mar 29) |
|---|---|---|---|---|
| Cloud databases | 2 (prod + dev) | 1 (prod only) | 1 + code path for 2nd | 1 (prod only, hardened) |
| Throughput model | Dedicated per-container | Shared per-database | Shared (but dev path live) | Shared, dev path blocked |
| Provisioned RU/s | ~8,000 | 600 | 600 (risk of +400) | 600 |
| Free tier compliant | No | Yes | Fragile | Yes (enforced) |
| Guard rails | Docs only | Docs + script rewrite | Regressed | Runtime assertion + error exits |
| Files with dev DB refs | Growing | Consolidating | Re-expanded | 0 (retired across 59 files) |
What I Learned
1. Documentation Is Necessary But Not Sufficient
I had cost rules in agent-contract.md, app-rules.md, feature design docs, and the SAP brief. The rules said "RU-frugal," "zero Azure cost constraint," "Free Tier Guardrails — non-negotiable." The agent read them. The agent still provisioned dedicated throughput. Rules written in prose are suggestions. Rules written in code are enforcement.
2. AI Agents Optimize Locally, Not Globally
When the agent created the setup script, it was solving a local problem: "provision Cosmos containers." It picked the pattern most common in Azure documentation (dedicated throughput per container) without reasoning about the global cost constraint. When it re-introduced the dev DB path in Wave 2, it was solving another local problem: "make the database resolver more explicit." Both times, the agent's local optimization violated a global invariant.
3. Large Commits Hide Regressions
The Wave 2 regression was buried in a 3,425-line feature commit. The commit message mentioned fiscal-year changes, not database mode changes. If I'd reviewed only the commit message and stat, I'd have missed the dbResolver.js rewrite entirely. AI agents that make large commits need automated invariant checks, not just human code review.
4. "Remove" Is Not "Prevent"
My Feb 28 fix removed the cloud dev DB code path. My Mar 29 fix prevented it from being restored. The difference: runtime assertions that crash the process, script entry points that error on dev arguments, and seed scripts that refuse non-production targets on cloud endpoints. If you remove something from an AI-maintained codebase, you must also add a guard that prevents its resurrection.
5. Name Your Databases After Your Constraints
I named the V2 database *-V2costsaver. It's ugly. It's also the only thing in the codebase that survived every agent refactor without being renamed. Sometimes the best documentation is a name that makes the constraint impossible to ignore.
Try This Yourself
- Audit your IaC scripts for throughput flags. Search for
--throughputin any Cosmos provisioning script. If it's on a container create (not a database create), you're paying per-container minimums. - Add runtime guards, not just documentation. If your app must never target a cloud dev database, add an assertion that crashes on startup if it detects a non-production database on a cloud endpoint.
- Review large AI commits file-by-file. Don't trust commit messages for scope. A "fiscal-year feature" commit can silently regress your cost model.
- Set Azure budget alerts immediately. I should have done this on day one. A $300/month cap with 50%/80%/100% alerts would have caught Wave 1 within days instead of weeks.
- Use the emulator for dev. The Cosmos emulator is free, runs locally, and eliminates an entire category of cloud cost risk. If you're paying for a cloud dev database, ask yourself why.
The agent contract, the app rules, the feature design docs — none of them stopped this. What stopped it was a throw new Error() in the database resolver. Trust but verify. Then add a guard.
Mo Khan is just an old-timer engineer-turned-manager who forgot how fun it is to build things — and who learned the hard way that AI agents read your cost rules but don't always follow them.

No comments:
Post a Comment