Two days. That's how long my ERP ETL workflow had been broken in production. The Function App deploy pipeline — the one that was supposed to be "serverless and simple" — had drifted so far from reality that a hotfix deployed outside CI/CD was the only thing keeping the lights on.
I was staring at a system held together with surgical tape and a prayer. The app was supposedly stable for months, but very brittle at some places, how can a simple, deterministic ETL process get so unreliable?
The sync Function App was serving a stale runtime that didn't match what was in git. The CI/CD workflow couldn't deploy without breaking things further. The Kudu zipdeploy mechanism — Azure's "just push a zip and we'll figure it out" deployment model — had proven itself fundamentally untrustworthy for my production workloads. And Event Grid, the invisible message bus connecting my extract and sync stages, added complexity I couldn't observe or debug without spelunking through Azure portal logs.
I had a choice: keep patching, or rethink the architecture.
I chose to rethink it. In one session. With an AI coding agent. On a Tuesday night. In < 5 hours.
Act 1: The Architecture That Looked Good on Paper
When I first built the ETL pipeline, the architecture felt elegant:
Upload -> Blob Storage -> Event Grid -> Python Extract (Function App)
|
Blob Storage (artifacts)
|
Event Grid -> Node Sync (Function App)
|
Cosmos DB
Three Azure Function Apps. Two Event Grid subscriptions. Two separate Kudu zipdeploy pipelines. Two different runtimes (Python for extraction, Node.js for sync). Each with its own host.json, its own publish profile secrets, its own health check probes.
It worked. For a while.
Then the drift started. A deploy would succeed in CI but the runtime would serve stale code. Kudu would report success but the Function App would 404. A "quick fix" deployed through the portal would work, but the next CI deploy would overwrite it with the wrong version. The Python Function App and the Node.js Function App had different deploy mechanisms, different failure modes, and different ways of being quietly broken.
The serverless promise — "just write functions, we handle the rest" — had become: "just debug Azure's deployment infrastructure while your financial planning app serves stale data to executives."
Act 2: The Question That Changed Everything
I asked my AI agent a question:
"Since both services will eventually run from the App Service itself, why do we need Event Grid?"
That's when it clicked. My App Service was already running Express.js, serving the React frontend, proxying API calls. It was a Linux host with Node.js. It was deployed via a single zip push. It just worked.
Why wasn't the ETL pipeline running there too?
The original reason was separation of concerns — extractors in Python, sync in Node.js, each scaling independently. But the reality was: my ETL workload processed one file at a time, took 20 seconds end-to-end, and ran maybe 5 times a day. It didn't need independent scaling. It needed reliable deployment.
Act 3: The Plan (And Why We Threw Half of It Away)
The initial plan was careful and staged:
Stage 1: Move the sync service to the App Service, keep extract on the Function App, use Event Grid to connect them.
Stage 2: Move extract to the App Service, eliminate Event Grid.
We started with Stage 1. Moved all 12 sync/verify scripts. Created an Express router. Wired up audit logging with distinct source tags (etl-sync-appservice vs the Function App's etl-sync) so we could see exactly which code path processed each job.
The clever part: we created re-export stubs that proxied imports to the shared API libraries. This meant the 12 sync scripts could be copied byte-for-byte — zero modifications. The stubs resolved their ../lib/ imports to the shared libraries. One-line files doing the heavy lifting:
// etl/lib/containerConfig.js export * from '../../api/lib/containerConfig.js';
Then we hit the Event Grid wall.
Act 4: The Event Grid Wall
To register an Event Grid webhook subscription, Azure sends a validation handshake to your endpoint. Your endpoint must respond with a validation code. Simple, right?
Except our App Service was behind Azure AD authentication. Anonymous requests got a 302 redirect to the Microsoft login page. Event Grid's validation request isn't a browser — it doesn't follow redirects or authenticate with Azure AD.
We tried excludedPaths in the auth config. We tried switching to Return401 mode. The excluded paths worked — but Return401 mode broke the entire login flow for the SPA. Users couldn't log in. On production. At night.
We reverted in 30 seconds. Crisis contained. But Event Grid was blocked.
We tried the AAD-authenticated delivery approach. Event Grid can send a bearer token if you configure it with a tenant ID and app registration. But creating the required AzureEventGridSecureWebhookSubscriber app role needed Entra admin permissions. It was 9 PM. The IT admin was asleep.
That's when my own question came back: why do we need Event Grid at all?
Act 5: The Browser Console Trick
We had the sync running on the App Service. We had the Event Grid subscription blocked. But we needed to prove it worked.
The user is already authenticated in the browser. The browser has the App Service auth cookie. What if we just... called the endpoint from the browser console?
fetch("/etl/webhook/sync", {
method: "POST",
headers: {"Content-Type": "application/json"},
body: '[{"eventType":"Microsoft.Storage.BlobCreated",
"subject":"/blobServices/default/containers/etl-artifacts/blobs/...",
"data":{"api":"PutBlob"}}]'
}).then(r => r.json()).then(d => console.log(d))
Response:
{
"accepted": 1,
"processedBy": "etl-sync-appservice",
"timestamp": "2026-04-15T18:28:22.051Z"
}
The latest/* blobs in Azure Storage updated 19 seconds later. Audit logs showed ETL Sync AppService as the actor. Verification passed. Cache manifest bumped.
It worked. The new code path was live.
Act 6: Scrap the Plan, Go All In
The staged plan said "move extract in Stage 2, later." But we'd just proven the architecture worked. The Event Grid approach was blocked anyway. And I realized: if both extract and sync run on the App Service, the upload button can trigger the entire pipeline directly. No Event Grid. No Function Apps. No intermediate blob-event dance.
Upload -> POST /etl/pipeline/run
-> Python extract (subprocess)
-> Node sync (in-process)
-> Cosmos DB updated
-> Done. 21 seconds.
The Python challenge was real though. The App Service runs Node.js. The extractors use openpyxl to parse Excel files. Our first attempt — pip install openpyxl && npm start as the startup command — killed the app. The Node.js Linux image doesn't reliably support pip in the startup command.
The fix: vendor openpyxl into the deploy package during CI build. The GitHub Actions runner (Ubuntu) has pip. Install there, ship the result:
python3 -m pip install \ --target .deploy/webapp/etl/extract/.pylibs \ --quiet \ openpyxl>=3.1.0
The Node.js extract handler adds .pylibs to PYTHONPATH when spawning the subprocess. The Python extractors run unchanged — they don't know they're on an App Service instead of a Function App.
Act 7: The Cutover
Here's the full cutover sequence:
# Stop Function Apps (restartable for rollback) az functionapp stop --name my-etl-sync --resource-group my-rg az functionapp stop --name my-etl-extract --resource-group my-rg # Deploy lands via normal CI/CD -- same workflow that deploys frontend git push origin master # Test: upload ERP export, pipeline runs automatically # Audit logs: "ETL Extract AppService" -> "ETL Sync AppService" # All SUCCESS. 21 seconds end-to-end.
Rollback:
az functionapp start --name my-etl-sync --resource-group my-rg az functionapp start --name my-etl-extract --resource-group my-rg
Two commands. No code changes needed. The Function App code was never modified.
Act 8: What We Actually Built
The final architecture:
Admin Console
-> POST /api/etl/upload (file to blob storage)
-> POST /etl/pipeline/run {buId, connector, sourceBlobName}
-> Express router
-> Download source file from blob
-> Spawn python3 extractor subprocess
-> Upload artifacts to blob
-> Run Node.js sync script
-> Run Node.js verify script
-> Promote artifacts to latest/
-> Update Cosmos DB
-> Bump cache manifest
-> Audit log with user identity + source tags
No Event Grid subscriptions. No Function App deploys. No Kudu zipdeploy. No publish profile secrets. No runtime drift.
One Express server. One deploy workflow. One zip. One host.
Files added:
| File | Purpose |
|---|---|
etl/etlRouter.js | Express router — pipeline, webhook, status, provision, healthz |
etl/etlExtractHandler.js | Node.js wrapper spawning Python extractors |
etl/sync/*.js | 12 sync/verify scripts (copied unchanged) |
etl/extract/extractors/*.py | 6 Python extractors (copied unchanged) |
etl/lib/*.js | Re-export stubs (1 line each) |
api/lib/etlBlobHelpers.js | Blob operations for ETL |
api/lib/etlOperationalLog.js | Operational log adapter |
Files untouched: All 6 Python extractors. All 12 sync/verify scripts. The Function App code. The frontend upload handler. Database schemas. Blob storage structure.
Act 9: The Lessons
1. Serverless isn't free
The "no servers to manage" promise is real until deployment breaks. Then you're managing Azure's deployment infrastructure, which is harder to debug than your own server because you can't SSH in, can't see the filesystem, and can't reproduce the environment locally.
2. Event Grid adds invisible coupling
Every Event Grid subscription is an invisible dependency. When it works, it's magic. When it breaks, you're reading Azure portal logs trying to understand why a blob write didn't trigger a function invocation. Moving to direct HTTP calls made the pipeline debuggable, observable, and fast.
3. The best abstraction is sometimes no abstraction
Three Azure Function Apps, two Event Grid subscriptions, and a Kudu zipdeploy pipeline — or one Express route handler. The "simpler" architecture has more moving parts. The "complex" monolith is actually simpler to operate.
4. Source tagging is non-negotiable in migrations
Tagging every audit event and operational log with source: 'etl-sync-appservice' vs source: 'etl-sync' meant we could prove, in production, which code path processed each job. When leadership asks "are we on the new system?" you can point to the audit log and say: yes, see the tag.
5. Copy, don't rewrite
The 12 sync scripts and 6 Python extractors were copied byte-for-byte. Zero modifications. The re-export stubs handled the import path difference. This meant: if anything breaks, it's the new wrapper code, not the battle-tested extraction and sync logic.
6. AI agents are force multipliers, not autopilots
The AI agent wrote the ETL router, the extract handler, the blob helpers, the operational log adapter, the deploy workflow changes, and the Azure CLI provisioning commands. It also broke the production login flow by changing the auth config, tried to pip install in a Node.js startup command (which killed the app), and ran up my GitHub Actions credits.
The human judgment calls that mattered: choosing to disable-and-replace (not dual-write), deciding to skip Event Grid entirely, and saying "let's just test from the browser console." The agent executed brilliantly once pointed in the right direction. But pointing it in the right direction was the hard part.
The Numbers
| Metric | Before | After |
|---|---|---|
| Azure resources for ETL | 2 Function Apps + 3 Event Grid subscriptions | 0 (runs on existing App Service) |
| Deploy mechanisms | 3 (webapp + 2 Kudu zipdeploys) | 1 (webapp only) |
| CI/CD jobs | 5 (often 2 failing) | 3 (all green) |
| Pipeline latency | 45-90s (Event Grid hops + cold starts) | 21s (in-process) |
| Runtime drift risk | High (Function App VFS cache) | Zero (deploy is atomic zip) |
| Rollback time | Unknown (redeploy + pray) | 60 seconds (restart Function App) |
| Deploy time for ETL changes | 7+ min (separate Kudu job, often fails) | 0 (included in webapp deploy) |
The Takeaway
I spent two days debugging a broken deploy pipeline, using up both Claude and Codex tokens! I pivoted. I spent one evening replacing it entirely.
The old architecture was designed for a future that never came — independent scaling, multi-region ETL, connector-level isolation. The new architecture is designed for the reality I have — a single-region financial planning app that processes a handful of ERP exports a day and needs to be reliable.
Sometimes the bravest engineering decision is to make things simpler.
Both Function Apps remain stopped but restartable. The Event Grid subscriptions exist but deliver to stopped endpoints. The full ETL pipeline runs on a single App Service, deployed by the same workflow that deploys the React frontend.
Total session time: ~5 hours. Total production downtime: ~90 seconds (the auth config incident). Cups of coffee: lost count.
.png)
No comments:
Post a Comment