Showing posts with label Creativity. Show all posts
Showing posts with label Creativity. Show all posts

Friday, 17 April 2026

How Claude Code helped me migrate ETL workflows from serverless functions in one evening

April 15, 2026. South Africa.

Two days. That's how long my ERP ETL workflow had been broken in production. The Function App deploy pipeline — the one that was supposed to be "serverless and simple" — had drifted so far from reality that a hotfix deployed outside CI/CD was the only thing keeping the lights on.

I was staring at a system held together with surgical tape and a prayer. The app was supposedly stable for months, but very brittle at some places, how can a simple, deterministic ETL process get so unreliable?

The sync Function App was serving a stale runtime that didn't match what was in git. The CI/CD workflow couldn't deploy without breaking things further. The Kudu zipdeploy mechanism — Azure's "just push a zip and we'll figure it out" deployment model — had proven itself fundamentally untrustworthy for my production workloads. And Event Grid, the invisible message bus connecting my extract and sync stages, added complexity I couldn't observe or debug without spelunking through Azure portal logs.

I had a choice: keep patching, or rethink the architecture.

I chose to rethink it. In one session. With an AI coding agent. On a Tuesday night. In < 5 hours.


Act 1: The Architecture That Looked Good on Paper

When I first built the ETL pipeline, the architecture felt elegant:

Upload -> Blob Storage -> Event Grid -> Python Extract (Function App)
                                            |
                                     Blob Storage (artifacts)
                                            |
                                    Event Grid -> Node Sync (Function App)
                                            |
                                       Cosmos DB

Three Azure Function Apps. Two Event Grid subscriptions. Two separate Kudu zipdeploy pipelines. Two different runtimes (Python for extraction, Node.js for sync). Each with its own host.json, its own publish profile secrets, its own health check probes.

It worked. For a while.

Then the drift started. A deploy would succeed in CI but the runtime would serve stale code. Kudu would report success but the Function App would 404. A "quick fix" deployed through the portal would work, but the next CI deploy would overwrite it with the wrong version. The Python Function App and the Node.js Function App had different deploy mechanisms, different failure modes, and different ways of being quietly broken.

The serverless promise — "just write functions, we handle the rest" — had become: "just debug Azure's deployment infrastructure while your financial planning app serves stale data to executives."

Act 2: The Question That Changed Everything

I asked my AI agent a question:

"Since both services will eventually run from the App Service itself, why do we need Event Grid?"

That's when it clicked. My App Service was already running Express.js, serving the React frontend, proxying API calls. It was a Linux host with Node.js. It was deployed via a single zip push. It just worked.

Why wasn't the ETL pipeline running there too?

The original reason was separation of concerns — extractors in Python, sync in Node.js, each scaling independently. But the reality was: my ETL workload processed one file at a time, took 20 seconds end-to-end, and ran maybe 5 times a day. It didn't need independent scaling. It needed reliable deployment.

Act 3: The Plan (And Why We Threw Half of It Away)

The initial plan was careful and staged:

Stage 1: Move the sync service to the App Service, keep extract on the Function App, use Event Grid to connect them.

Stage 2: Move extract to the App Service, eliminate Event Grid.

We started with Stage 1. Moved all 12 sync/verify scripts. Created an Express router. Wired up audit logging with distinct source tags (etl-sync-appservice vs the Function App's etl-sync) so we could see exactly which code path processed each job.

The clever part: we created re-export stubs that proxied imports to the shared API libraries. This meant the 12 sync scripts could be copied byte-for-byte — zero modifications. The stubs resolved their ../lib/ imports to the shared libraries. One-line files doing the heavy lifting:

// etl/lib/containerConfig.js
export * from '../../api/lib/containerConfig.js';

Then we hit the Event Grid wall.

Act 4: The Event Grid Wall

To register an Event Grid webhook subscription, Azure sends a validation handshake to your endpoint. Your endpoint must respond with a validation code. Simple, right?

Except our App Service was behind Azure AD authentication. Anonymous requests got a 302 redirect to the Microsoft login page. Event Grid's validation request isn't a browser — it doesn't follow redirects or authenticate with Azure AD.

We tried excludedPaths in the auth config. We tried switching to Return401 mode. The excluded paths worked — but Return401 mode broke the entire login flow for the SPA. Users couldn't log in. On production. At night.

We reverted in 30 seconds. Crisis contained. But Event Grid was blocked.

We tried the AAD-authenticated delivery approach. Event Grid can send a bearer token if you configure it with a tenant ID and app registration. But creating the required AzureEventGridSecureWebhookSubscriber app role needed Entra admin permissions. It was 9 PM. The IT admin was asleep.

That's when my own question came back: why do we need Event Grid at all?

Act 5: The Browser Console Trick

We had the sync running on the App Service. We had the Event Grid subscription blocked. But we needed to prove it worked.

The user is already authenticated in the browser. The browser has the App Service auth cookie. What if we just... called the endpoint from the browser console?

fetch("/etl/webhook/sync", {
  method: "POST",
  headers: {"Content-Type": "application/json"},
  body: '[{"eventType":"Microsoft.Storage.BlobCreated",
          "subject":"/blobServices/default/containers/etl-artifacts/blobs/...",
          "data":{"api":"PutBlob"}}]'
}).then(r => r.json()).then(d => console.log(d))

Response:

{
  "accepted": 1,
  "processedBy": "etl-sync-appservice",
  "timestamp": "2026-04-15T18:28:22.051Z"
}

The latest/* blobs in Azure Storage updated 19 seconds later. Audit logs showed ETL Sync AppService as the actor. Verification passed. Cache manifest bumped.

It worked. The new code path was live.

Act 6: Scrap the Plan, Go All In

The staged plan said "move extract in Stage 2, later." But we'd just proven the architecture worked. The Event Grid approach was blocked anyway. And I realized: if both extract and sync run on the App Service, the upload button can trigger the entire pipeline directly. No Event Grid. No Function Apps. No intermediate blob-event dance.

Upload -> POST /etl/pipeline/run
           -> Python extract (subprocess)
           -> Node sync (in-process)
           -> Cosmos DB updated
           -> Done. 21 seconds.

The Python challenge was real though. The App Service runs Node.js. The extractors use openpyxl to parse Excel files. Our first attempt — pip install openpyxl && npm start as the startup command — killed the app. The Node.js Linux image doesn't reliably support pip in the startup command.

The fix: vendor openpyxl into the deploy package during CI build. The GitHub Actions runner (Ubuntu) has pip. Install there, ship the result:

python3 -m pip install \
  --target .deploy/webapp/etl/extract/.pylibs \
  --quiet \
  openpyxl>=3.1.0

The Node.js extract handler adds .pylibs to PYTHONPATH when spawning the subprocess. The Python extractors run unchanged — they don't know they're on an App Service instead of a Function App.

Act 7: The Cutover

Here's the full cutover sequence:

# Stop Function Apps (restartable for rollback)
az functionapp stop --name my-etl-sync --resource-group my-rg
az functionapp stop --name my-etl-extract --resource-group my-rg

# Deploy lands via normal CI/CD -- same workflow that deploys frontend
git push origin master

# Test: upload ERP export, pipeline runs automatically
# Audit logs: "ETL Extract AppService" -> "ETL Sync AppService"
# All SUCCESS. 21 seconds end-to-end.

Rollback:

az functionapp start --name my-etl-sync --resource-group my-rg
az functionapp start --name my-etl-extract --resource-group my-rg

Two commands. No code changes needed. The Function App code was never modified.

Act 8: What We Actually Built

The final architecture:

Admin Console
  -> POST /api/etl/upload (file to blob storage)
  -> POST /etl/pipeline/run {buId, connector, sourceBlobName}
  -> Express router
      -> Download source file from blob
      -> Spawn python3 extractor subprocess
      -> Upload artifacts to blob
      -> Run Node.js sync script
      -> Run Node.js verify script
      -> Promote artifacts to latest/
      -> Update Cosmos DB
      -> Bump cache manifest
      -> Audit log with user identity + source tags

No Event Grid subscriptions. No Function App deploys. No Kudu zipdeploy. No publish profile secrets. No runtime drift.

One Express server. One deploy workflow. One zip. One host.

Files added:

FilePurpose
etl/etlRouter.jsExpress router — pipeline, webhook, status, provision, healthz
etl/etlExtractHandler.jsNode.js wrapper spawning Python extractors
etl/sync/*.js12 sync/verify scripts (copied unchanged)
etl/extract/extractors/*.py6 Python extractors (copied unchanged)
etl/lib/*.jsRe-export stubs (1 line each)
api/lib/etlBlobHelpers.jsBlob operations for ETL
api/lib/etlOperationalLog.jsOperational log adapter

Files untouched: All 6 Python extractors. All 12 sync/verify scripts. The Function App code. The frontend upload handler. Database schemas. Blob storage structure.

Act 9: The Lessons

1. Serverless isn't free

The "no servers to manage" promise is real until deployment breaks. Then you're managing Azure's deployment infrastructure, which is harder to debug than your own server because you can't SSH in, can't see the filesystem, and can't reproduce the environment locally.

2. Event Grid adds invisible coupling

Every Event Grid subscription is an invisible dependency. When it works, it's magic. When it breaks, you're reading Azure portal logs trying to understand why a blob write didn't trigger a function invocation. Moving to direct HTTP calls made the pipeline debuggable, observable, and fast.

3. The best abstraction is sometimes no abstraction

Three Azure Function Apps, two Event Grid subscriptions, and a Kudu zipdeploy pipeline — or one Express route handler. The "simpler" architecture has more moving parts. The "complex" monolith is actually simpler to operate.

4. Source tagging is non-negotiable in migrations

Tagging every audit event and operational log with source: 'etl-sync-appservice' vs source: 'etl-sync' meant we could prove, in production, which code path processed each job. When leadership asks "are we on the new system?" you can point to the audit log and say: yes, see the tag.

5. Copy, don't rewrite

The 12 sync scripts and 6 Python extractors were copied byte-for-byte. Zero modifications. The re-export stubs handled the import path difference. This meant: if anything breaks, it's the new wrapper code, not the battle-tested extraction and sync logic.

6. AI agents are force multipliers, not autopilots

The AI agent wrote the ETL router, the extract handler, the blob helpers, the operational log adapter, the deploy workflow changes, and the Azure CLI provisioning commands. It also broke the production login flow by changing the auth config, tried to pip install in a Node.js startup command (which killed the app), and ran up my GitHub Actions credits.

The human judgment calls that mattered: choosing to disable-and-replace (not dual-write), deciding to skip Event Grid entirely, and saying "let's just test from the browser console." The agent executed brilliantly once pointed in the right direction. But pointing it in the right direction was the hard part.


The Numbers

MetricBeforeAfter
Azure resources for ETL2 Function Apps + 3 Event Grid subscriptions0 (runs on existing App Service)
Deploy mechanisms3 (webapp + 2 Kudu zipdeploys)1 (webapp only)
CI/CD jobs5 (often 2 failing)3 (all green)
Pipeline latency45-90s (Event Grid hops + cold starts)21s (in-process)
Runtime drift riskHigh (Function App VFS cache)Zero (deploy is atomic zip)
Rollback timeUnknown (redeploy + pray)60 seconds (restart Function App)
Deploy time for ETL changes7+ min (separate Kudu job, often fails)0 (included in webapp deploy)

The Takeaway

I spent two days debugging a broken deploy pipeline, using up both Claude and Codex tokens! I pivoted. I spent one evening replacing it entirely.

The old architecture was designed for a future that never came — independent scaling, multi-region ETL, connector-level isolation. The new architecture is designed for the reality I have — a single-region financial planning app that processes a handful of ERP exports a day and needs to be reliable.

Sometimes the bravest engineering decision is to make things simpler.


Both Function Apps remain stopped but restartable. The Event Grid subscriptions exist but deliver to stopped endpoints. The full ETL pipeline runs on a single App Service, deployed by the same workflow that deploys the React frontend.

Total session time: ~5 hours. Total production downtime: ~90 seconds (the auth config incident). Cups of coffee: lost count.

Thursday, 2 April 2026

AI School Fees: The $0 Database That Wasn't: How AI Agents Silently Burned Through My Azure Budget Twice

I told the agent "zero cost." It provisioned 8,000 RU/s of dedicated throughput. I fixed it. It did it again. Here's the full forensic timeline.




The Problem

When I started building this internal enterprise app on Azure, the constraints were clear: free tier only. Azure Cosmos DB gives you 1,000 RU/s free. The app had ~10 containers. The math was simple — shared throughput across the database, stay under 1,000 RU/s, pay nothing.

I documented this everywhere. The agent contract said "RU-frugal." The app rules said "any throughput or retention change must be documented." The SAP feature brief said "Free Tier Guardrails — non-negotiable." The AI feature design explicitly rejected a Cosmos-backed chat history because it "violates the zero Azure cost constraint."

Despite all of this, the AI agent provisioned expensive dedicated throughput — not once, but twice. Both times I had to manually intervene, audit the damage, and harden the codebase to prevent it from happening again.

This is the forensic timeline of what happened, reconstructed from git history.


The Architecture Context

The app is a React + Azure Functions stack backed by Cosmos DB NoSQL. All containers use partition key /pk. The intended cost model was:

  Cosmos DB Free Tier
  ───────────────────
  1 Database  →  shared throughput (400-600 RU/s)
  10 Containers  →  no dedicated throughput
  ───────────────────
  Total: $0/month (within 1,000 RU/s free allowance)

Simple. Except the AI agent had a different idea.


Act 1: The Silent Provisioning (Feb 7, 2026)

What The Agent Did

I asked the AI agent to set up CI/CD scaffolding and infrastructure automation. Commit <sha-1> created scripts/setup-cosmos.sh — a script to provision Cosmos databases and containers. Sounds reasonable. Here's what it actually created:

THROUGHPUT=400

az cosmosdb sql container create \
    --partition-key-path "$PARTITION_KEY" \
    --throughput "$THROUGHPUT"     ← 400 RU/s PER CONTAINER

That --throughput flag on the container create command is the problem. It provisions dedicated throughput per container, not shared throughput at the database level.

The script also created two databases: a production DB and a dev DB. Both got the same treatment.

The Math

  What I asked for:          What the agent provisioned:
  ──────────────────         ──────────────────────────────
  1 DB, shared 400 RU/s     2 DBs, dedicated per-container

  Production:                Production:
    400 RU/s shared            10 containers × 400 RU/s = 4,000 RU/s
    $0 (free tier)             $0.008/hr × 10 = billable

  Dev:                       Dev:
    Emulator (local)           10 containers × 400 RU/s = 4,000 RU/s
    $0                         $0.008/hr × 10 = billable

  Total: ≤ 1,000 RU/s       Total: ~8,000 RU/s dedicated
  Cost: $0/month             Cost: Azure billing surprise

The agent created 8x the intended throughput across two databases, all with dedicated provisioning that can't be scaled below 400 RU/s per container. The Cosmos free tier's 1,000 RU/s allowance was instantly overwhelmed.

Why It Happened

The agent treated database provisioning as a standard infrastructure task. It knew Cosmos needs throughput. It picked the per-container model (which is the more common pattern in documentation and tutorials) without considering that:

  1. Shared throughput exists and is the correct model for cost-sensitive workloads
  2. A dev database in the cloud is unnecessary when the Cosmos emulator exists
  3. 400 RU/s is a floor, not a ceiling — you can't go lower with dedicated provisioning
  4. The cost rules in the project docs explicitly prohibited this

Act 2: The First Cleanup (Feb 22, 2026)

I discovered the cost spike through Azure billing alerts and immediately performed a forensic audit. Commit <sha-2> documents the full cleanup in a cost plan document that reads like an incident post-mortem.

The Damage Assessment

From the cost plan doc I wrote at the time:

"Legacy dedicated-throughput DBs still exist and still bill baseline RU: <app-db> → 10 containers × 400 RU/s dedicated. <app-db>-dev → 10 containers × 400 RU/s dedicated."

The Fix: V2 Databases with Shared Throughput

I created new databases with V2costsaver in the name (yes, I literally named them to remind future agents about cost) and rewrote the setup script:

  Before (agent's version):              After (my fix):
  ─────────────────────────              ──────────────────────────
  THROUGHPUT=400                         DB_THROUGHPUT="${DB_THROUGHPUT:-400}"

  az cosmosdb sql container create \     az cosmosdb sql database create \
    --throughput "$THROUGHPUT"              --throughput "$DB_THROUGHPUT"
                                           ← shared at DB level
  (per container = expensive)
                                         az cosmosdb sql container create \
                                           ← NO --throughput flag
                                           (inherits from database)

Then I ran the decommission:

  1. Created V2 databases with shared throughput
  2. Migrated all production data
  3. Added rollback support (--rollbackToV1Cosmos flag)
  4. Verified all 6 cutover gates passed
  5. Deleted both V1 databases
  6. Applied Azure budget alerts: $300/month cap with alerts at 50%, 80%, 100%
  7. Added Cosmos daily RU spike alert (> 2M RU in 24h)

The Emulator Decision

Six days later (Feb 28, commit <sha-3>), I made a harder decision: eliminate the cloud dev database entirely. The local Cosmos emulator would serve as the dev environment. This meant:

  • Zero cloud cost for development
  • Dev database routing consolidated into an emulator-first mode in dbResolver.js
  • A new mirror-to-emulator.mjs script for refreshing local dev data
  • The cloud dev DB (<app-db>-dev-V2costsaver) was decommissioned

Final state: one production database at 600 RU/s shared throughput — well within the 1,000 RU/s free tier allowance. Cost: $0/month.


Act 3: The Regression (Mar 3, 2026)

Five days later, the AI agent struck again.

Commit <sha-4> — a large feature commit (30 files, 3,425 insertions) implementing fiscal-year structural changes — quietly re-introduced the cloud dev database code path that I had just removed.

What The Agent Changed

In api/lib/dbResolver.js, the agent rewrote the database mode resolver. My Feb 28 version had consolidated all non-production paths to route to the emulator. The agent's version re-expanded them:

  My version (Feb 28):                   Agent's version (Mar 3):
  ────────────────────                   ────────────────────────
  if (shouldUseEmulator())               if (hasArg(EMULATOR_FLAG))
    return 'emulator';                     return 'emulator';
  if (shouldUseSupportDevDb())           if (hasArg(SUPPORT_DEV_FLAG))
    return 'support_dev_db';               return 'support_dev_db';   ← RE-ADDED
                                         if (isTruthy(COSMOS_USE_EMULATOR))
                                           return 'emulator';
                                         if (isTruthy(COSMOS_USE_SUPPORT_DEV_DB))
                                           return 'support_dev_db';   ← RE-ADDED

The 'support_dev_db' return path was back. The DEFAULT_DB_NAMES object still had supportDev: '<app-db>-dev-V2costsaver'. Combined with the init-cosmos.js script's createIfNotExists calls, this meant any script invocation with the dev flag would recreate the cloud dev database.

Why It Happened Again

The agent was working on a large feature (fiscal-year scoping) that touched the database layer. It needed to understand how database names were resolved across environments. Rather than preserving my carefully consolidated emulator-first logic, it re-derived the resolution function from first principles — and landed on the same multi-path pattern I had specifically eliminated.

The agent didn't know why those paths had been removed. It saw the pattern as "incomplete" and "helpfully" restored it. The commit message says nothing about database mode changes — they were buried in a 3,400-line feature diff.


Act 4: The Permanent Fix (Mar 29, 2026)

I'd had enough. Commit <sha-5>59 files changed, 277 insertions, 273 deletions — was a comprehensive retirement of all cloud dev database targeting across the entire codebase.

The Hard Guards

This time I didn't just remove the code paths. I made them impossible to restore:

1. Setup script errors on dev:

  # scripts/setup-cosmos.sh
  dev|--useSupportDevDB)
      echo "Cloud dev Cosmos setup is retired."
      echo "Use the local Cosmos emulator for development."
      exit 1

2. Runtime assertion in dbResolver.js:

  assertNoCloudNonProdDatabaseTarget()
  ────────────────────────────────────
  IF target DB ≠ production DB
  AND endpoint host ≠ localhost / 127.0.0.1 / emulator
  THEN → throw Error (hard crash)

3. Dev flags redirected to emulator: Any code passing --useSupportDevDB or setting COSMOS_USE_SUPPORT_DEV_DB=true now silently routes to the emulator instead of a cloud database.

4. Seed scripts refuse cloud non-prod targets: If the connection string points to Azure (not localhost), the seed scripts refuse to operate on non-production databases.

5. Default throughput documented at 600 RU/s: The setup script now defaults to 600 RU/s shared — within free tier — with the value explicitly visible in the script header.

The 59-File Sweep

The retirement touched every layer:

  Layer                     Files Changed    What Changed
  ─────────────────────     ─────────────    ─────────────────────────────────
  Database resolver         1                Hard assertion + emulator redirect
  Setup/provisioning        1                Dev path → error exit
  API scripts (20+)         23               All routed through new guards
  ETL scripts (JS+Python)   4                cosmosDbNames updated
  CI workflow               1                Dev DB references removed
  Documentation             8                Updated to emulator-first model
  Dev tooling               2                Local settings + dev script

The Numbers

MetricWave 1 (Feb 7)After Fix 1 (Feb 22-28)Wave 2 (Mar 3)After Fix 2 (Mar 29)
Cloud databases2 (prod + dev)1 (prod only)1 + code path for 2nd1 (prod only, hardened)
Throughput modelDedicated per-containerShared per-databaseShared (but dev path live)Shared, dev path blocked
Provisioned RU/s~8,000600600 (risk of +400)600
Free tier compliantNoYesFragileYes (enforced)
Guard railsDocs onlyDocs + script rewriteRegressedRuntime assertion + error exits
Files with dev DB refsGrowingConsolidatingRe-expanded0 (retired across 59 files)

What I Learned

1. Documentation Is Necessary But Not Sufficient

I had cost rules in agent-contract.md, app-rules.md, feature design docs, and the SAP brief. The rules said "RU-frugal," "zero Azure cost constraint," "Free Tier Guardrails — non-negotiable." The agent read them. The agent still provisioned dedicated throughput. Rules written in prose are suggestions. Rules written in code are enforcement.

2. AI Agents Optimize Locally, Not Globally

When the agent created the setup script, it was solving a local problem: "provision Cosmos containers." It picked the pattern most common in Azure documentation (dedicated throughput per container) without reasoning about the global cost constraint. When it re-introduced the dev DB path in Wave 2, it was solving another local problem: "make the database resolver more explicit." Both times, the agent's local optimization violated a global invariant.

3. Large Commits Hide Regressions

The Wave 2 regression was buried in a 3,425-line feature commit. The commit message mentioned fiscal-year changes, not database mode changes. If I'd reviewed only the commit message and stat, I'd have missed the dbResolver.js rewrite entirely. AI agents that make large commits need automated invariant checks, not just human code review.

4. "Remove" Is Not "Prevent"

My Feb 28 fix removed the cloud dev DB code path. My Mar 29 fix prevented it from being restored. The difference: runtime assertions that crash the process, script entry points that error on dev arguments, and seed scripts that refuse non-production targets on cloud endpoints. If you remove something from an AI-maintained codebase, you must also add a guard that prevents its resurrection.

5. Name Your Databases After Your Constraints

I named the V2 database *-V2costsaver. It's ugly. It's also the only thing in the codebase that survived every agent refactor without being renamed. Sometimes the best documentation is a name that makes the constraint impossible to ignore.


Try This Yourself

  1. Audit your IaC scripts for throughput flags. Search for --throughput in any Cosmos provisioning script. If it's on a container create (not a database create), you're paying per-container minimums.
  2. Add runtime guards, not just documentation. If your app must never target a cloud dev database, add an assertion that crashes on startup if it detects a non-production database on a cloud endpoint.
  3. Review large AI commits file-by-file. Don't trust commit messages for scope. A "fiscal-year feature" commit can silently regress your cost model.
  4. Set Azure budget alerts immediately. I should have done this on day one. A $300/month cap with 50%/80%/100% alerts would have caught Wave 1 within days instead of weeks.
  5. Use the emulator for dev. The Cosmos emulator is free, runs locally, and eliminates an entire category of cloud cost risk. If you're paying for a cloud dev database, ask yourself why.

The agent contract, the app rules, the feature design docs — none of them stopped this. What stopped it was a throw new Error() in the database resolver. Trust but verify. Then add a guard.


Mo Khan is just an old-timer engineer-turned-manager who forgot how fun it is to build things — and who learned the hard way that AI agents read your cost rules but don't always follow them.

How Codex Autonomously Migrated Our Production App Across Continents in 28 Hours

One runbook. One AI agent. Zero portal clicks. A full SWA-to-App-Service migration from the US to South Africa.



The Problem: Your Frontend Is on the Wrong Continent

Our internal financial business intelligence tool — a React SPA backed by Azure Functions and Cosmos DB — had a geography problem. When I rapidly developed the MVP, I thought I could leverage free cloud services to not only prove the concept, but also since this tool was going to be used internally by a small group of users, I thought I could get away with free Azure services. Alas, as the MVP evolved into a real release, it became clear I had to do something about latency, cross-region calls, data sovereignty and the inherent limitations of free cloud services! So a migration was without question.

The frontend was hosted on Azure Static Web Apps in the US (since Azure does not provide this capability in South Africa and my original POC MVP was built as a static web app with local storage). The database and all backend services lived in South Africa North. Every API call crossed the Atlantic and back.

  • Cross-region latency on every Cosmos DB query — users in South Africa waited for round-trips to the US and back to South Africa
  • Data sovereignty concerns — even static HTML was served from US infrastructure
  • Architectural complexity — a free-tier SWA in the US proxying to paid Functions in South Africa made cost attribution and debugging harder than it needed to be
  • Auth coupling — SWA's built-in auth model injected identity in a platform-specific format that wouldn't survive a hosting change

The decision was made: move everything to South Africa. Same region as the data. Same region as the users.

But this wasn't just a redeploy. SWA's managed Functions, built-in auth, and SPA hosting all needed replacements. The target was a Linux App Service running Express, a standalone Azure Functions app, EasyAuth with a dedicated Entra app registration, and a completely new CI/CD pipeline. All while keeping the existing SWA running as a live fallback. Frugality is top-of-mind for me, aiming for the lowest cost options as the driving constraint.

The question was: could an autonomous AI agent execute the entire migration from a runbook — provisioning Azure resources, writing code, deploying infrastructure, and cutting over production — without a single portal click?


The Cast

This project used the same three-actor model I described in my previous post about the AI service migration:

Me — architect and orchestrator. I wrote the runbook, reviewed it across 7 sessions with Claude, made the cutover decisions, and performed final manual validation.

Claude (Opus) — planning partner. Claude reviewed the runbook across 7 dedicated sessions between March 6-26, catching missing auth flows, underspecified identity migration paths, and gaps in the rollback strategy.

Codex — autonomous executor. Codex received the runbook and executed it end-to-end across March 29-30: provisioning Azure resources, writing code, deploying to production, running identity backfills, enabling EasyAuth, and cutting over to the new stack.

┌─────────────┐                         ┌─────────────┐
│             │   7 review sessions     │             │
│   Human     │◄───────────────────────►│   Claude    │
│  Architect  │   runbook + review      │  (Opus)     │
│             │────────────────────────►│  Reviewer   │
└──────┬──────┘                         └─────────────┘
       │
       │  runbook
       │
       ▼
┌─────────────────────────────────────────────────────┐
│                     Codex                           │
│                Autonomous Executor                  │
│                                                     │
│  Day 1 (Mar 29): Provision + Code + Deploy          │
│  Day 2 (Mar 30): Auth + Identity + Cutover          │
│                                                     │
│  Azure CLI │ GitHub CLI │ Node.js │ PowerShell      │
│  14 files created │ 18 files modified               │
│  537 tests passing │ 12 user identities migrated    │
└─────────────────────────────────────────────────────┘

The Architecture: Before and After

Before: Cross-Region SWA

The existing architecture had the frontend and its managed Functions in the US, making cross-Atlantic calls to Cosmos DB in South Africa on every API request.

┌─────────────────────────────────────────────────────────────────┐
│                        BROWSER (South Africa)                   │
│   React SPA ──── fetch('/api/*') ────►                          │
└─────────────────────┬───────────────────────────────────────────┘
                      │
    🔻 Atlantic crossing (~180ms RTT)
                      │
                      ▼
┌─────────────────────────────────────────────────────────────────┐
│         Azure Static Web Apps  (US Region)                      │
│                                                                 │
│  ┌─────────────────┐  ┌────────────────────────────┐            │
│  │  SWA Built-in   │  │  SWA-Managed Functions     │            │
│  │  Auth (EasyAuth)│  │  (co-located in US)        │            │
│  │  SWA headers    │  │                            │            │
│  │  Platform-      │  │  /api/me                   │            │
│  │  specific format│  │  /api/data                 │            │
│  └─────────────────┘  │  /api/ai/chat              │            │
│                       │  /api/etl/upload           │            │
│  Serves React SPA     │  ... 40+ API endpoints     │            │
│  (static files US)    └──────┬─────────────────────┘            │
└──────────────────────────────│──────────────────────────────────┘
                               │
              🔻 Another Atlantic crossing
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                   South Africa North                            │
│                                                                 │
│  ┌────────────────┐  ┌────────────┐  ┌───────────────┐          │
│  │  Cosmos DB     │  │ ETL Extract│  │  Blob Storage │          │
│  │  (all data)    │  │  (Python)  │  │  (SAP exports)│          │
│  └────────────────┘  │  ETL Sync  │  └───────────────┘          │
│                      │  (Node.js) │                             │
│                      └────────────┘                             │
└─────────────────────────────────────────────────────────────────┘

Problems:
  ✗ Every API call crosses the Atlantic twice (browser → US → SA → US → browser)
  ✗ Static files served from US for South African users
  ✗ Auth format is SWA-specific (platform lock-in)
  ✗ SWA-managed Functions can't be independently scaled or monitored
  ✗ Cost attribution across regions is opaque

After: Single-Region App Service

Everything co-located in South Africa North. The Express server handles SPA hosting and proxies API calls to a standalone Functions app — all in the same region as Cosmos DB.

┌─────────────────────────────────────────────────────────────────┐
│                        BROWSER (South Africa)                   │
│   React SPA ──── fetch('/api/*') ────►                          │
│   Same-origin requests, ~5ms to App Service                     │
└─────────────────────┬───────────────────────────────────────────┘
                      │
                      ▼  (same region!)
┌─────────────────────────────────────────────────────────────────┐
│               All South Africa North                            │
│                                                                 │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │   App Service B1 Linux  (Express server)                  │  │
│  │                                                           │  │
│  │   ┌───────────────┐  ┌───────────────────────────────────┐│  │
│  │   │  EasyAuth     │  │  Express Web Host                 ││  │
│  │   │  (Entra ID)   │  │                                   ││  │
│  │   │  Dedicated app│  │  /healthz → direct 200            ││  │
│  │   │  registration │  │  /api/*   → proxy to Functions    ││  │
│  │   │  Claims-array │  │  /*       → serve dist/index.html ││  │
│  │   │  format       │  │  dist/assets/* → immutable cache  ││  │
│  │   └───────────────┘  └────────┬──────────────────────────┘│  │
│  └───────────────────────────────│───────────────────────────┘  │
│                                  │                              │
│                                  │  x-internal-proxy-secret     │
│                                  │  x-ms-client-principal       │
│                                  ▼                              │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Standalone Functions App  (Consumption plan)             │  │
│  │                                                           │  │
│  │  AUTH_MODE=appservice                                     │  │
│  │  Validates proxy secret → parses claims-array             │  │
│  │  IP restrictions: App Service outbound IPs only           │  │
│  │                                                           │  │
│  │  /api/me  /api/data  /api/ai/chat  /api/etl/upload        │  │
│  │  ... 40+ endpoints (same business logic, new auth mode)   │  │
│  └─────────────┬─────────────────────────────────────────────┘  │
│                │                                                │
│                ▼  (same region, ~1ms)                           │
│  ┌────────────────┐  ┌────────────┐  ┌───────────────┐          │
│  │  Cosmos DB     │  │ETL Extract │  │  Blob Storage │          │
│  │  (same region!)│  │(unchanged) │  │  (unchanged)  │          │
│  └────────────────┘  │ETL Sync    │  └───────────────┘          │
│                      │(unchanged) │                             │
│                      └────────────┘                             │
└─────────────────────────────────────────────────────────────────┘

Improvements:
  ✓ All traffic stays in South Africa — no cross-region hops
  ✓ Express serves SPA + proxies to Functions in same region
  ✓ Dedicated EasyAuth with claims-array auth (no SWA lock-in)
  ✓ Functions independently scalable and monitorable
  ✓ IP-restricted: Functions only accept traffic from App Service
  ✓ Shared-secret trust boundary on every proxied request
  ✓ SWA kept as parked standby for emergency failover
  ✓ Cost: +$13/month for the App Service plan

The Auth Migration

This deserves its own diagram because it was the hardest part of the migration. SWA and App Service EasyAuth present identity differently. The backend had to understand both.

  SWA Auth (before):                    App Service Auth (after):
  ──────────────────                    ────────────────────────
  x-ms-client-principal                 x-ms-client-principal
  │                                     │
  ▼                                     ▼
  Base64 → JSON                         Base64 → JSON
  {                                     {
    userId: "abc",                        claims: [
    userRoles: ["admin"],                   { typ: "oid", val: "abc" },
    identityProvider: "aad"                 { typ: "email", val: "..." },
  }                                         { typ: "roles", val: "admin" }
                                          ]
  Top-level fields                      }
  (SWA-specific)
                                        Claims array
                                        (standard Entra format)

  AUTH_MODE=swa                         AUTH_MODE=appservice
  + no proxy secret needed              + x-internal-proxy-secret required
  + SWA manages the hop                 + timing-safe secret validation
                                        + claims-array parsing
                                        + identity canonicalization

Act 1: The Runbook

Why This Migration Needed a Runbook

This wasn't a "lift and shift." Moving from SWA to App Service touched:

  • 4 Azure resources to provision (App Service plan, web app, Functions app, storage account)
  • 56 app settings to migrate from SWA to the standalone Functions app
  • 12 user identities to canonicalize from SWA format to Entra format
  • A new auth mode (App Service EasyAuth with claims-array parsing)
  • A new web host (Express server replacing SWA's built-in hosting)
  • 2 CI/CD pipelines running in parallel during validation
  • An ETL pipeline that needed seamless ownership transfer between workflows
  • A parked standby mode for the old SWA (not decommission — failover readiness)

The runbook grew to 21 sections with 2,300+ lines. Every Azure CLI command. Every app setting category. Every auth claim extraction rule. Every verification checkpoint.

Seven Review Sessions

Before Codex touched anything, Claude reviewed the runbook across seven dedicated sessions between March 6-26, 2026:

SessionDateFocus
1Mar 6Initial architecture validation and scope framing
2Mar 25 (18:55)Plan validity: are the phases correctly sequenced?
3Mar 25 (19:34)Autonomous execution review: can an AI agent run this without portal clicks?
4Mar 25 (20:03)Architecture and design review: is the auth migration sound?
5Mar 25 (20:36)Implementation plan: critical path validation
6Mar 25 (20:59)Expert engineer review: what would a senior engineer push back on?
7Mar 25 (21:24)Deep review: identity migration, rollback, and edge cases

Key issues caught during review:

  • Identity continuity gap: The initial runbook assumed user IDs would carry over. Claude caught that SWA uses platform-managed service principals while App Service EasyAuth uses Entra object IDs — a completely different identity format. This led to adding the userIdentity.js canonicalization layer and the one-time backfill script.
  • Auth lightweight path: The verifyTokenLightweight function used by AI chat was SWA-only. Without an App Service equivalent, AI chat would break silently after migration.
  • ETL upload streaming: If Express body-parsing middleware was added before the /api proxy, multipart ETL uploads would break. The runbook was updated to explicitly forbid express.json() ahead of the proxy mount.
  • Rollback strategy: The original plan assumed SWA decommission. I later changed my mind and pushed for a parked-standby model instead — keep SWA deployable as emergency failover, not delete it.

Act 2: The Execution

Codex received the runbook and started working on March 29, 2026 at 14:56 SAST.

Day 1: Infrastructure and Code (March 29)

TimeEvent
14:56Inventory capture + code implementation (auth, identity, server, tests)
15:22Azure provisioning: App Service plan, web app, Functions app, storage account
15:41Identity backfill dry-run: 12 users scanned, 12 canonical migrations found
15:42Identity backfill applied: 12 users migrated, zero conflicts
16:43Web deploy (first attempt — Windows zip failed, rebuilt with POSIX paths)
17:00API deploy: standalone Functions packaging fixed, proxy verified
17:06Auth blocker: Entra app registration failed (insufficient tenant privileges)
18:05Full verification: 56/56 config parity, health green, smoke blocked only by auth
19:18Workflow + deploy path hardening committed
22:10API deploy recovery: Functions-action produced 503; switched to source-only Kudu
23:33Kudu false-negative analysis: rsync symlink errors masked a healthy deploy
23:56Both GitHub Actions workflows green. SA web + API deployed successfully.

Day 2: Auth, Validation, and Cutover (March 30)

TimeEvent
14:32Entra auth unblocked: dedicated app registration created with new privileges
14:40EasyAuth enabled: login redirect verified working
15:01Identity re-audit: 9 canonical, 2 clean migrations, 1 overlap detected
15:15Overlap identity fix + verification hardening deployed
15:48Auth flow correction: enableIdTokenIssuance was false, fixed live
16:05Workflow smoke alignment: accept EasyAuth-protected probe responses
17:18ETL admin regression: EtlPipelineView used wrong role authority, fixed
18:24Documentation strategy rewrite: park SWA, don't decommission
21:03ETL ownership switched to SA workflow
21:22Final cutover: SWA parked, SA primary, both workflows green
21:34Failover drill fix: workflow_dispatch jobs were gated to push-only
21:49Failover drill complete: SWA restored, re-parked, verified end-to-end

Act 3: The Battles

Autonomous doesn't mean smooth. Codex hit real obstacles and worked through them.

Battle 1: The Windows Zip

The first web deploy failed because the zip archive built on Windows contained backslash paths. Azure's OneDeploy rejected them. Codex rebuilt the package with POSIX-style paths and redeployed successfully.

Battle 2: The Functions 503

The standard Azure/functions-action@v1 with pre-built node_modules produced a deployed Functions app that returned 503. Codex diagnosed it, switched to source-only Kudu zipdeploy with SCM_DO_BUILD_DURING_DEPLOYMENT=true (matching the pattern already proven by the ETL sync app), and restored the API to 200.

Battle 3: The Kudu False Negative

After fixing the deploy shape, Kudu still reported "failed" — because rsync couldn't create node_modules/.bin/* symlinks. But the app was actually healthy. Codex analyzed the log pattern, hardened the Kudu helper script to recognize this specific false negative, and added a health-gated fallback.

Battle 4: The Tenant Privilege Blocker

Creating the Entra app registration required Application Administrator privileges that Codex didn't have on Day 1. This blocked EasyAuth completely. I resolved the privilege overnight, and Codex resumed on Day 2.

Battle 5: The ID Token Gap

After enabling EasyAuth, browser logins failed silently. The Entra app registration had enableIdTokenIssuance=false, but App Service EasyAuth requests response_type=code id_token. Codex found this, set the flag to true via CLI, and updated both the provisioning and verification scripts to treat it as required state.

Battle 6: The ETL Role Regression

The ETL admin page broke for the admin user on the new stack. Root cause: EtlPipelineView preferred raw tokenRoles (which under App Service auth is just ["authenticated"]) over the database-backed profileRoles. Codex fixed the precedence and added a regression test.


Act 4: The Numbers

MetricValue
Total execution time~28 hours across 2 days
Files created14
Files modified18
Tests passing537 across 149 test files
User identities migrated12
App settings migrated56 (verified parity)
Azure resources provisioned4 (plan, web app, Functions app, storage)
GitHub Actions workflows2 running in parallel, both green
Execution ledger entries30+ timestamped operations
Portal clicks0
Incremental monthly cost+$13 (one B1 Linux plan)

What Got Deployed

  • Express web host serving the React SPA with immutable asset caching
  • API proxy with shared-secret trust boundary and 180s timeout
  • Standalone Functions app with AUTH_MODE=appservice and IP restrictions
  • Dedicated Entra app registration with EasyAuth
  • App Service auth parser with claims-array extraction and timing-safe secret validation
  • Identity canonicalization layer with SWA-to-Entra migration
  • Kudu zipdeploy helper with false-negative resilience
  • Curated deploy artifact with dependency pruning
  • ETL workflow parity with ownership switch variable
  • SWA parked standby with verified failover drill

What I Learned

The Execution Ledger Pattern

The most valuable artifact wasn't the code — it was the execution ledger. Every action Codex took was recorded with timestamp, phase, command, sanitized result, and next action. This append-only log became the working memory across sessions and the audit trail for the entire migration.

When Codex hit the tenant privilege blocker on Day 1 and had to resume on Day 2, the ledger told it exactly where to pick up. When the Kudu deploy shape needed three iterations, the ledger captured each failure and its resolution.

If you're planning autonomous multi-session work, build the ledger into the runbook from the start.

CLI-First Changes Everything

The runbook's execution rule — "no Azure Portal or GitHub UI dependency; all setup must be executable by az, gh, PowerShell, or GitHub Actions" — was the single most important constraint. It made the entire migration automatable.

Every resource provisioned by az appservice plan create. Every secret set by gh secret set. Every EasyAuth configuration by az webapp auth update. Every verification by scripted probes. Zero portal clicks meant zero human bottlenecks.

Park, Don't Decommission

My push toward a parked-standby model instead of immediate SWA decommission was the right call. On Day 2, after cutover, Codex ran a full failover drill: unparked SWA, verified the full app was serving, then re-parked it. The whole cycle took 15 minutes and proved the rollback path works.

For any production migration: keep the old thing alive in standby until you're confident you'll never need it. The cost of maintaining a parked SWA ($0) is much less than the cost of recreating one in an emergency.

Auth Migrations Are Never Simple

We hit five distinct auth-related issues across two days: tenant privileges, ID token issuance flags, claims-array format differences, role authority precedence, and identity canonicalization. Any one of them could have broken the migration silently.

The runbook's detailed auth specification — with pseudocode for claim extraction, validation ordering, and normalized return shapes — was essential. Without it, the agent would have guessed at the auth format and produced subtly wrong behavior.

The Human's Role

I didn't write the Express server, the auth parser, the identity backfill script, the deploy workflows, or the provisioning scripts. But I:

  • Designed the target architecture
  • Wrote a 2,300-line runbook that left nothing ambiguous
  • Reviewed it across 7 sessions with an AI planning partner
  • Resolved the tenant privilege blocker that no CLI command could fix
  • Made the cutover decision based on the verification evidence after confirming with pilot users
  • Performed manual browser validation that proved the stack worked end-to-end

The pattern is the same as before: the human's job is to write specifications precise enough that code writes itself. The better the runbook, the more the agent can do autonomously.


Try This Yourself

Compared to the AI service migration (which was a code extraction and new service build), this SWA migration was a different kind of challenge: less code, more infrastructure, more auth complexity, more operational choreography.

If you're planning a similar hosting migration:

  1. Audit your auth surface before you start. SWA, App Service, and B2C/Entra all present identity differently. Map the claim shapes explicitly.
  2. Build the execution ledger into the plan. Autonomous agents that work across sessions need persistent working memory.
  3. Require CLI-only execution. If the plan needs portal clicks, the agent can't run it.
  4. Run both stacks in parallel. Shared data (same Cosmos, same Blob) means zero data migration. Two active frontends during validation costs almost nothing.
  5. Park, don't delete. Your rollback is only useful if it still exists.
  6. Test the real user path manually. Health probes pass, workflows are green, config parity is 56/56 — and then a human opens a browser and auth fails because of a flag nobody thought to check.

The runbook used in this project is shared in the appendix below.

Appendix: The Complete Migration Runbook (Redacted)

Below is the full runbook that guided this migration, exactly as Codex executed it. Sensitive identifiers — Azure resource names, GitHub references, email addresses, Entra IDs, connection strings, and deployment credentials — have been replaced with <placeholder> tokens. The architecture decisions, execution patterns, CLI commands, and verification checklists are preserved verbatim.

This is the document that Claude reviewed across seven sessions and Codex executed autonomously over two days. Scroll through to see the level of detail that makes autonomous agent execution possible.

Scroll inside the frame to read the complete runbook. The document contains 21 sections covering architecture, auth design, CLI automation, deployment workflows, and verification checklists.


Muhammad Khan is a GM moonlighting as software engineer in his spare time, learning about AI-augmented development workflows, cloud architecture, and autonomous agent orchestration.