One planned. One executed entirely autonmously as lead AI agent co-ordinator (managing 3-4 parallel agents). Human watched each other and steered. Here's what happened.
The Problem Nobody Warns You About Your Fast-Built MVP
You build an AI feature on a shoestring budget leveraging free Azure services, like an SWA app (that I later migrated to a standalone service - future blog post). It works in dev. It works in staging. Then it hits production, and the platform underneath it starts fighting you.
Our AI chat feature — a financial intelligence analyst chatbot built on Claude — was deployed inside Azure Functions behind a Static Web App proxy. On paper, it worked. In practice:
- 45-second hard timeout from the SWA proxy killed any complex analysis
- HTTP 500 errors appeared randomly under load
- No reconnection — if your browser tab hiccupped, your 30-second AI response was gone
- Request-bound execution — the AI's "life" was the HTTP request. When the request died, so did the AI
The feature was unreliable enough that users had learned to retry. That's a product failure.
I decided to rip it out of Azure Functions entirely and move it to a standalone always-on service. Not a rewrite — a port. Same AI logic, new runtime.
The question was: could I plan this migration in a way that an autonomous AI agent could execute it end-to-end?
The Cast
This project involved three actors:
Me — the architect and orchestrator. I wrote the runbook, designed the execution plan, reviewed every decision, and verified the final result. A taste of the future role of an SDM (Software Development Manager)
Claude (Opus) — my planning partner and shadow observer. Claude reviewed the runbook across three iterations, helped me design the parallel execution strategy, drafted the system prompt for the executor, and then spent the entire execution day "snooping" on progress — reading the codebase in real-time and reporting back.
Codex — the autonomous executor and master agent coordinator. Codex received the runbook and a carefully crafted system prompt, then worked for ~9.5 hours straight spawning multiple agents on demand, producing 21 PRs, ~19,000 lines of code, provisioning Azure infrastructure, and deploying to production. Codex effectively did the job of a senior engineer working with 3-4 engineers
┌─────────────┐ reviews/plans ┌─────────────┐
│ │◄───────────────────────►│ │
│ Human │ system prompt │ Claude │
│ Architect │────────────────────────►│ (Opus) │
│ │ snoop reports │ Reviewer │
│ │◄────────────────────────│ Planner │
└──────┬──────┘ └─────┬───────┘
│ │
│ runbook + prompt │ reads codebase
│ │ (read-only)
▼ ▼
┌─────────────────────────────────────────────────────┐
│ │
│ Codex │
│ Autonomous Executor │
│ │
│ Thread A ─── Thread B ─── Thread C ─── Thread D │
│ (critical) (infra) (proxy) (diagnostics)│
│ │
│ 21 PRs │ ~19K lines │ 9.5 hours │
└─────────────────────────────────────────────────────┘
Act 1: The Runbook
Why a Runbook, Not a Prompt
Most people using AI coding agents write prompts. I wrote a runbook — a 2,100-line execution plan that reads more like an engineering specification than a chat message.
Why? Because autonomous agents fail in predictable ways:
- Scope drift — they start "improving" things you didn't ask for
- Missing context — they make reasonable-sounding decisions based on wrong assumptions
- No gates — they charge ahead past failure points without stopping
- Sequential thinking — they do everything in order even when work can be parallelized
The runbook addressed all four:
- Explicit scope fences: "Port, do not rewrite. This is a runtime extraction, not a feature redesign."
- Complete architecture context: every Azure resource name, every file path, every dependency version, every API contract
- Stop gates at every phase: "Stop and escalate if any of the following is true..."
- Parallel thread definitions: five named threads with explicit entry gates and exit artifacts
The runbook went through three review rounds with Claude before I was satisfied. Each round caught real issues:
Round 1 identified that the parallel thread definitions were missing concrete entry/exit gates — they said what work to do, but not what must be true before starting or what artifact proves completion.
Round 2 caught that the integration-branch merge strategy was underspecified. Without explicit merge ordering, parallel agents would create conflicting merges.
Round 3 hardened the execution contract: what the agent can do without permission, what requires stopping and reporting, and how to handle the managed-identity-vs-connection-string decision for Cosmos DB.
The Migration Architecture
The target was clean:
Before:
Browser → SWA Proxy → Azure Functions (aiChat.js monolith)
↳ 3,356 lines, one file
↳ 45-second timeout
↳ request-bound execution
After:
Browser → App Service (Express proxy)
├── /api/ai/* → Standalone AI Service (Fastify)
│ ↳ Durable runs (survive disconnects)
│ ↳ True SSE streaming
│ ↳ Reconnect by runId
│ ↳ No hard timeouts
└── /api/* → Azure Functions (unchanged)
The AI logic itself — prompts, tools, model selection, orchestration — would be extracted into a shared package (../ai-core) and reused by the new runtime. Port, don't rewrite.
The Architecture: Before and After
To appreciate what changed, you need to see what we were working with — and what we built to replace it.
Before: The Monolith
The entire AI feature lived in one Azure Functions HTTP handler. One file. 3,356 lines. Everything from authentication to prompt construction to tool execution to streaming — all coupled to the Azure Functions request lifecycle.
┌─────────────────────────────────────────────────────────────────────┐
│ BROWSER │
│ AIChatPanel.jsx ──POST /api/ai/chat──► │
│ │ │
│ ◄──── raw SSE text stream ────────────┘ │
│ (no reconnect, no runId, no resume) │
└─────────────────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ SWA / App Service Proxy │
│ ┌──────────────────┐ │
│ │ 45-second hard │ │
│ │ timeout on ALL │ │
│ │ proxied requests│ │
│ └──────────────────┘ │
│ /api/* ──► Azure Functions │
└─────────────────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Azure Functions (app-name-backend-api) │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ aiChat.js (3,356 lines) │ │
│ │ │ │
│ │ ┌──────────┐ ┌───────────────┐ ┌─────────────────────────┐ │ │
│ │ │ Auth │ │ System │ │ Anthropic SDK │ │ │
│ │ │ Bootstrap│ │ Prompt │ │ (buffered call) │ │ │
│ │ │ Validate │ │ Builder │ │ │ │ │
│ │ └────┬─────┘ └───────┬───────┘ │ ┌───────────────────┐ │ │ │
│ │ │ │ │ │ Tool Dispatch │ │ │ │
│ │ ▼ ▼ │ │ Loop │ │ │ │
│ │ ┌──────────┐ ┌───────────────┐ │ │ │ │ │ │
│ │ │ Feature │ │ Question │ │ │ Claude response │ │ │ │
│ │ │ Toggle │ │ Classifier │ │ │ ──► tool_use? │ │ │ │
│ │ │ Check │ │ Telemetry │ │ │ ──► execute │ │ │ │
│ │ └──────────┘ └───────────────┘ │ │ ──► feed back │ │ │ │
│ │ │ │ ──► loop │ │ │ │
│ │ ┌──────────┐ ┌───────────────┐ │ └───────────────────┘ │ │ │
│ │ │ Model │ │ Tool Profile │ │ │ │ │
│ │ │ Selection│ │ Resolution │ │ request dies = AI dies │ │ │
│ │ │ Policy │ │ Routing │ │ no run state persisted │ │ │
│ │ └──────────┘ └───────────────┘ └─────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ Tool Executor (22 tools) │ │ │
│ │ │ query_products │ query_customers │ query_actuals │ │ │
│ │ │ run_financial_model │ run_services_model │ │ │
│ │ │ compare_fiscal_years │ get_margin_analysis │ │ │
│ │ │ get_customer_concentration │ generate_chart │ │ │
│ │ │ run_what_if_simulation │ ...12 more │ │ │
│ │ └─────────────────────────┬───────────────────────────────┘ │ │
│ └────────────────────────────┼───────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Cosmos DB │ │
│ │ (products, budgets,│ │
│ │ actuals, configs, │ │
│ │ users, groups) │ │
│ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Problems:
✗ Request-bound: browser disconnect kills the AI mid-thought
✗ No run persistence: nothing survives a dropped connection
✗ Buffered streaming: text arrives in chunks after provider finishes
✗ 45s proxy timeout: complex multi-tool analyses get killed
✗ No reconnect: lose your tab, lose your answer
✗ Monolith coupling: auth, prompt, tools, streaming all in one file
After: The Decomposed Architecture
The new architecture separates concerns across three layers: a proxy that routes traffic, a standalone AI runtime that manages durable runs, and an extracted core package that holds all the AI logic.
┌─────────────────────────────────────────────────────────────────────┐
│ BROWSER │
│ │
│ AIChatPanel.jsx │
│ │ │
│ ├── POST /api/ai/chat ──────────────► start run, get runId │
│ │ ◄── SSE stream (attach immediately) │
│ │ │
│ ├── GET /api/ai/runs/:runId/stream ─► reconnect to live run │
│ │ ◄── SSE replay (snapshot) + live deltas │
│ │ │
│ ├── GET /api/ai/runs/:runId ────────► poll status; final result │
│ │ │
│ ├── POST /api/ai/runs/:runId/cancel ► user-initiated cancel │
│ │ │
│ └── POST /api/ai/chat/feedback ─────► thumbs up/down on run │
│ │
│ Client persists runId in React state │
│ Disconnect ≠ cancel (run continues server-side) │
└─────────────────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Web App Service (Express proxy — server.js) │
│ │
│ ┌─────────────────┐ ┌──────────────────────┐ │
│ │ /api/ai/* │ │ /api/* │ │
│ │ ──► AI Service │ │ ──► Azure Functions │ │
│ │ (dedicated) │ │ (unchanged) │ │
│ └────────┬────────┘ └──────────┬───────────┘ │
│ │ │ │
│ x-internal-proxy-secret x-internal-proxy-secret │
│ x-ms-client-principal x-ms-client-principal │
└────────────┬────────────────────────────┬───────────────────────────┘
│ │
▼ ▼
┌────────────────────────────┐ ┌─────────────────────────┐
│ AI App Service (Fastify) │ │ Azure Functions │
│ │ │ (non-AI APIs) │
│ │ │ /api/me │
│ Dedicated B1 plan │ │ /api/budgets │
│ Always-on process │ │ /api/products │
│ No hard timeouts │ │ /api/definitions │
│ │ │ etc. │
│ ┌──────────────────────┐ │ └─────────────────────────┘
│ │ Route Layer │ │
│ │ /api/ai/chat │ │
│ │ /api/ai/runs/:id │ │
│ │ /api/ai/runs/:id/ │ │
│ │ stream │ │
│ │ /api/ai/runs/:id/ │ │
│ │ cancel │ │
│ │ /api/ai/chat/ │ │
│ │ feedback │ │
│ │ /healthz /readyz │ │
│ └──────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Runtime Manager │ │
│ │ │ │
│ │ Admission Control: │ │
│ │ 2 active/user │ │
│ │ 2 queued/user │ │
│ │ 4 active global │ │
│ │ 8 queued global │ │
│ │ │ │
│ │ Run State Machine: │ │
│ │ queued ──► running │ │
│ │ │ │ │ │ │
│ │ │ ▼ ▼ │ │
│ │ │ completed │ │
│ │ │ failed │ │
│ │ ▼ cancelled │ │
│ │ cancelled_by_admin │ │
│ │ │ │
│ │ Watchdog: 10 min │ │
│ │ Shutdown: 15s grace │ │
│ └──────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ SSE Event Stream │ │
│ │ │ │
│ │ Contract (Phase 5): │ │
│ │ ready ──► status │ │
│ │ ──► text(delta) │ │
│ │ ──► tool_start │ │
│ │ ──► tool_end │ │
│ │ ──► text(delta) │ │
│ │ ──► chart │ │
│ │ ──► follow_ups │ │
│ │ ──► done │ │
│ │ │ │
│ │ Reconnect mode: │ │
│ │ text(snapshot) │ │
│ │ + live deltas │ │
│ └──────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Legacy AI Bridge │ │ ┌──────────────────────────┐
│ │ (adapter pattern) │──┼────►│ /ai-core │
│ │ │ │ │ (workspace package) │
│ │ Wraps existing AI │ │ │ │
│ │ chat logic with │ │ │ contracts/ │
│ │ new runtime hooks │ │ │ sseEvents.js │
│ └──────────────────────┘ │ │ interfaces.js │
│ │ │ uiContext.js │
│ │ │ chatFeedback.js │
│ │ │ │
│ │ │ prompts/ │
│ │ │ systemPrompt.js │
│ │ │ │
│ │ │ tools/ │
│ │ │ definitions.js (22) │
│ │ │ anthropic.js │
│ │ │ │
│ │ │ orchestration/ │
│ │ │ questionTelemetry.js │
│ │ │ modelSelection.js │
│ │ │ toolProfiles.js │
│ │ │ │
│ │ │ domain/ │
│ │ │ financialModel.js │
│ │ │ servicesModel.js │
│ │ └──────────────────────────┘
│ │
│ ┌──────────────────────┐ │
│ │ Data Access Layer │ │
│ │ (injected adapters) │ │
│ │ │ │
│ │ getGroupDoc() │ │
│ │ hasAiPermission() │ │
│ │ getUserProfile() │ │
│ │ canViewBu() │ │
│ │ probeReadiness() │ │
│ └──────────┬───────────┘ │
│ │ │
└─────────────┼──────────────┘
│
┌───────┴────────┐
│ │
▼ ▼
┌───────────┐ ┌──────────────────────────────────┐
│ Cosmos DB │ │ Azure Storage │
│ │ │ │
│ products │ │ │
│ customers │ │ Table: AiRuns (run metadata) │
│ actuals │ │ Queue: ai-run-dispatch │
│ configs │ │ Blob: ai-run-events │
│ users │ │ Blob: ai-run-snapshots │
│ groups │ │ Blob: ai-run-transcripts │
│ rbac │ │ │
└───────────┘ └──────────────────────────────────┘
Improvements:
✓ Durable runs: AI continues even if browser disconnects
✓ Reconnect by runId: pick up where you left off
✓ True SSE streaming: tokens arrive as they're generated
✓ No proxy timeout: dedicated service, no 45s wall
✓ Admission control: queuing, per-user limits, watchdog
✓ Graceful shutdown: in-flight runs drain before restart
✓ Separated concerns: proxy / runtime / core / storage
✓ Dependency injection: testable without Azure
The Run Lifecycle
This is the part that makes the new architecture fundamentally different. A "run" is now a first-class entity that exists independently of any HTTP connection.
Client AI Service Storage
│ │ │
│ POST /api/ai/chat │ │
│─────────────────────────►│ │
│ │ create run (queued) │
│ │─────────────────────────────►│
│ │ │
│ ◄─ SSE: ready{runId} ── │ │
│ ◄─ SSE: status{...} ── │ transition: running │
│ │─────────────────────────────►│
│ │ │
│ │ ┌─────────────────────┐ │
│ ◄─ SSE: text{delta} ─── │ │ Claude streaming │ │
│ ◄─ SSE: text{delta} ─── │ │ response arrives │ │
│ ◄─ SSE: text{delta} ─── │ │ token by token │ │
│ │ └─────────────────────┘ │
│ │ │
│ ◄─ SSE: tool_start ──── │ Claude requests tool_use │
│ │ execute tool (Cosmos query) │
│ ◄─ SSE: tool_end ────── │ feed result back to Claude │
│ │ │
│ ◄─ SSE: text{delta} ─── │ Claude continues response │
│ ◄─ SSE: text{delta} ─── │ │
│ │ │
│ ╔═══════════════╗ │ │
│ ║ DISCONNECT! ║ │ │
│ ║ tab closed ║ │ run keeps going... │
│ ╚═══════════════╝ │ events persisted to blob │
│ │─────────────────────────────►│
│ │ │
│ (reconnect) │ │
│ GET /runs/:id/stream │ │
│─────────────────────────►│ replay from storage │
│ │◄─────────────────────────────│
│ ◄─ SSE: text{snapshot}─ │ full text so far │
│ ◄─ SSE: text{delta} ─── │ live deltas resume │
│ │ │
│ ◄─ SSE: follow_ups ──── │ │
│ ◄─ SSE: done ────────── │ transition: completed │
│ │─────────────────────────────►│
│ │ archive transcript │
│ │─────────────────────────────►│
│ │ │
The key insight: the run outlives the connection. When you disconnect, the AI doesn't stop thinking. When you reconnect, you get a snapshot of everything that happened while you were away, followed by live deltas. This is what makes AI chat feel like a real product instead of a fragile demo.
Act 2: Designing the Parallel Execution Plan
This is where Claude earned its keep as a planning partner.
A naive sequential execution would have looked like: baseline → scaffold → extract → build runtime → add streaming → update proxy → provision infra → deploy → diagnostics → cutover. That's a 12+ hour critical path with no parallelism.
Instead, we designed five concurrent threads:
| Thread | Scope | Can Start When |
|---|---|---|
| A (Critical Path) | Baseline → Scaffold → Extract → Runtime → Streaming | Immediately |
| B (Infrastructure) | Azure provisioning, deploy workflow | Immediately (parallel with A) |
| C (Integration) | Proxy routing, frontend run-client | After A delivers runtime + SSE contract |
| D (Diagnostics) | Cost ledger, health monitoring | After B provisions resources |
| E (Cutover) | Dark deploy, brownout test, go-live | After A+B+C converge |
The key insight: infrastructure work (Thread B) has zero code dependencies on the AI logic (Thread A). You can provision Azure resources, set up managed identity, create storage primitives, and write deploy workflows while the extraction work is still happening. This saved hours.
Time ────────────────────────────────────────────────────────────────────────►
Thread A: [Phase 0][Phase 1][Phase 2a][Phase 2b][Phase 3][ Phase 5 ]
Thread B: [Phase 7-infra ][Phase 7-deploy]
Thread C: [P6-back][P6-front]
Thread D: [Phase 8 ]
Thread E: [Phase 9]
│
All threads
converge
The Integration Branch Model
With multiple threads writing code simultaneously, merge strategy matters. We used:
- Integration branch (
codex/ai-migration-main) as the merge target - Named feature branches per thread (
codex/ai-migration-a-phase3,codex/ai-migration-b-deploy, etc.) - Explicit merge order: Phase 1 first → Thread B infra → Thread A checkpoints in order → Thread C after SSE freeze → Thread E last
- Master updated only after verified integration checkpoints
This kept master stable throughout the entire migration.
Act 3: The System Prompt
The runbook was the what. The system prompt was the how.
Here's the actual prompt I gave Codex (sensitive values redacted):
## Mission
Execute the AI service migration runbook at
docs/TODO_MigrateAIFeatureToStandaloneAIServince.md.
That document is the single source of truth for scope,
architecture, sequencing, gates, and success criteria.
## Execution Model - Parallel Threads
Spawn parallel agent threads with explicit sequencing:
- Thread A (critical path): Phase 0 → 1 → 2a → 2b → 3 → 5
- Thread B (infra/deploy): Phase 7 infra starts immediately.
Phase 7 deploy-workflow waits for Phase 1.
- Thread C: Phase 6 backend after Phase 3.
Phase 6 frontend after Phase 5 SSE contract freeze.
- Thread D: Phase 8 after Phase 7 + runtime telemetry stable.
- Thread E: Phase 9 after A+B+C converge.
Start Thread A and Thread B in parallel immediately.
## First Actions
1. Run required preflight commands (Section 6)
2. Begin Thread A Phase 0: baseline + characterization tests
3. Begin Thread B Phase 7 infra: Azure provisioning
4. Use integration branch codex/ai-migration-main
## Operating Rules
- Follow the runbook literally
- Update Migration Status table at every checkpoint
- Stop on stop gates
- Merge order matters
- One PR per bounded checkpoint
- No rewrites — this is a port
- Never commit or log secrets
## Do NOT Need Permission For
- Creating branches, PRs, running CLI commands
- Provisioning Azure resources, setting GitHub variables/secrets
- Making the connection-string fallback decision if managed
identity fails
## MUST Stop and Report If
- Any stop gate failure
- Any az or gh command fails due to auth/RBAC
- Any ambiguity where the runbook doesn't specify a clear path
- Any temptation to redesign rather than port
Design Principles Behind the Prompt
Three deliberate choices:
Point at the runbook, don't repeat it. The prompt says "that document is the single source of truth." Duplicating architecture context in the prompt creates drift — two sources that can disagree.
Explicit permission boundaries. The "do NOT need permission" section prevents the agent from stalling on confirmations for actions the runbook already authorizes. The "MUST stop" section catches the failure modes where human judgment is actually needed.
Parallel execution is called out explicitly. Without this, Codex defaults to sequential execution. You have to tell it to spawn concurrent work.
Act 4: The Miss — Worktree Isolation
Here's where I have to be honest about a gap.
The runbook specified parallel threads. The prompt told Codex to spawn them. But neither document specified how to physically isolate the parallel work. We described the logical dependency graph but not the physical isolation model.
When I checked on Codex's early progress, I found it was working sequentially in a single directory, mixing Phase 0 and Phase 7 changes in the same checkout. The parallel execution model existed on paper but not in practice.
This matters because git doesn't handle concurrent modifications to the same working directory gracefully. You need separate worktrees — independent checkouts of the repository where each thread can work without stepping on the others.
I had to inject a mid-session correction:
"Each parallel agent thread should work in its own git worktree with a named branch convention (e.g., codex/ai-migration-a-phase3). The orchestrating agent uses the primary checkout for coordination only."
Codex received this, stashed its in-progress work, split changes into proper branches, created worktrees, and adopted the model. It recovered well, but the gap shouldn't have existed.
Lesson learned: the DAG (dependency graph) is necessary but not sufficient. You need to specify the physical isolation model — worktrees, branch naming, merge-back protocol — not just the logical thread dependencies.
I've since added this to my standard checklist for any multi-agent execution plan.
Act 5: The Shadow — 12 Snoop Reports
While Codex worked, I periodically asked Claude to observe the codebase and report on progress. Claude had read-only access — it could inspect files, read git history, check PR statuses, and review code quality, but couldn't modify anything.
This created a powerful feedback loop: one AI executes, another AI reviews, and a human makes decisions based on both perspectives.
Here's the timeline of observations:
| Report | Time | Finding |
|---|---|---|
| #1-#5 | Early | Phase 0 + Phase 7 infra done. Worktree issue identified. |
| #6 | Mid | Phase 2b merged. 1,863 lines extracted into ai-core with parity tests. |
| #7 | Mid | Phase 3 worktree created but no commits yet (Codex working locally). |
| #8 | Mid | Phase 3 merged. 916-line runtime manager with full run state machine. |
| #9 | Late | Phase 5 + Phase 6 backend both merged. True streaming + proxy split. |
| #10 | Late | Phase 6 frontend merged. Durable run client with reconnect. |
| #11 | Final | All phases complete. 20 PRs merged. Integration promoted to master. |
| #12 | Post | Hotfix for missing AI_CHAT_BOOTSTRAP_SECRET on AI service. |
What the Shadow Caught
The snoop reports weren't just progress tracking. They caught real issues:
Lost artifact: The Phase 2a SDK migration audit was written to a gitignored directory. When the worktree was removed, the artifact disappeared. Flagged in report #6.
Session continuity gap: Codex sessions have quota limits. If a session expires mid-work, the next session needs to know where to pick up. We drafted a
HANDOVER.mdprotocol — a structured file that captures current phase, what's done, what's blocked, branch state, and next action.Code quality validation: Report #8 confirmed the Phase 3 runtime manager had proper admission control, watchdog timers, graceful shutdown, and queue drain — not just placeholder stubs.
Contract compliance: Report #10 verified the frontend was consuming the Phase 5 SSE contract correctly —
text(mode=delta)for live tokens,text(mode=snapshot)for reconnect reconstruction.
Act 6: The Numbers
Timeline (March 31, 2026)
| Time | Event |
|---|---|
| 08:52 | First runbook commit |
| 09:31 | Codex starts — PR #4 (Phase 7 infra) |
| 09:41 | PR #5 (Phase 0 baseline) |
| 10:16 | PR #6 (Phase 1 workspace scaffold) |
| 12:41 | PR #12 (Phase 2a contracts) |
| 13:27 | PR #13 (Phase 2b extraction — 1,863 lines) |
| 14:29 | PR #14 (Phase 3 runtime — 916-line manager) |
| 15:08 | PR #15 (Phase 6 backend proxy split) |
| 15:28 | PR #16 (Phase 5 provider streaming + SSE freeze) |
| 15:59 | PR #17 (Phase 6 frontend run-client) |
| 17:30 | PR #18 (Phase 8 diagnostics) |
| 19:07 | PR #19 (Phase 9 cutover) |
| 19:10 | PR #20 (Master promotion) |
| Total | ~9.5 hours, 21 PRs, ~19,000 net lines |
Code Produced
| Component | Lines | Purpose |
|---|---|---|
ai-service/ |
~4,500 | Fastify runtime, routes, run manager, storage |
packages/ai-core/ |
~3,800 | Extracted AI logic, contracts, domain models |
| Infrastructure scripts | ~1,700 | Provisioning, verification, packaging |
| Deploy workflow | ~250 | CI/CD pipeline |
| Frontend changes | ~1,100 | Run-aware chat panel, API client |
| Tests | ~1,500 | Characterization, parity, integration, contract |
| Remaining | ~6,000+ | Internalized API libs, domain model moves, etc. |
What Got Deployed
- Standalone Fastify AI service on
app-name - Dedicated App Service plan, storage account, managed identity
- Express proxy splitting
/api/aito AI service,/apito Functions - GitHub Actions deploy workflow with pre/post-deploy health probes
- Run-aware frontend with reconnect, cancel, and status polling
Act 7: The Hotfix
Twenty PRs merged. Three deploy workflows green. Dark deploy verified. Brownout rollback proven.
Then I opened the chat panel and got "Access Denied."
The AI_CHAT_BOOTSTRAP_SECRET — a shared secret used to verify frontend bootstrap tokens — was configured on the Azure Functions app but had never been provisioned to the new AI service. The dark deploy health probes (which test /healthz and /readyz) didn't catch it because they don't send authenticated chat requests.
Codex fixed it in one commit: added the secret to the provisioning script, the verification checklist, and the readiness probe. PR #21 merged and deployed.
Lesson: Dark deploys verify infrastructure health, not end-to-end user flows. You need both.
What I Learned About Multi-Model Development
The Reviewer/Executor Pattern
The most powerful pattern I discovered is using two AI models in complementary roles:
Model A (Claude Opus): Planner, reviewer, and observer. It reviews the plan before execution, monitors progress during execution, and validates quality after execution. It never touches the code directly.
Model B (Codex): Autonomous executor. It receives a plan and system prompt, then works independently — creating branches, writing code, provisioning infrastructure, creating PRs, and deploying.
This separation of concerns mirrors how engineering teams work: architects review, developers execute, and neither role is diminished by the other.
How to Write Plans for Autonomous Agents
Be exhaustive about context. Every file path, resource name, dependency version, and API contract. Agents can't ask clarifying questions mid-execution the way humans can.
Define stop gates, not just tasks. "Do X" is a task. "Do X, but stop if Y is true" is a gate. Gates prevent agents from charging past failure points.
Specify physical isolation, not just logical dependencies. Parallel threads need worktrees, branch naming conventions, and merge protocols — not just a DAG.
Include permission boundaries. "You can do X without asking" prevents stalling. "You must stop for Y" prevents runaway execution.
Plan for session turnover. Long-running agents hit quota limits. Build in handover protocols — structured files that capture state for the next session.
How to Monitor Autonomous Agents
The "snoop" pattern — having a second model periodically observe the codebase — turned out to be valuable for three reasons:
Progress tracking without interruption. The observer reads git history and file state without disturbing the executor.
Quality validation in real-time. The observer can assess whether extracted code maintains behavioral parity, whether contracts are properly frozen, and whether test coverage is meaningful.
Issue detection before merge. The observer caught the lost SDK audit artifact and the session continuity gap before they became blocking problems.
The Human's Role Changes, But Doesn't Disappear
I didn't write 19,000 lines of code. But I:
- Designed the migration architecture
- Wrote a 2,100-line runbook that eliminated ambiguity
- Reviewed the plan across three iterations with an AI partner
- Identified the worktree isolation gap and injected a correction mid-flight
- Monitored execution through 12 observation cycles
- Made the "big bang, fix forward" cutover decision
- Performed the first manual end-to-end test that found the bootstrap secret issue
The human role shifts from writing code to writing specifications precise enough that code writes itself. That's a different skill, but it's still a skill — and it's the skill that determined whether this project took 9 hours or 9 days.
Try This Yourself
If you want to experiment with multi-model development:
Start with a runbook, not a prompt. Write down everything the agent needs to know before it starts. If you find yourself thinking "it'll figure that out," write it down instead.
Use one model to review the plan before another executes it. The reviewer will catch ambiguities, missing gates, and underspecified contracts.
Set up a shadow observer. Ask a model to periodically read your codebase and report on the executor's progress. You'll catch issues faster than waiting for CI failures.
Plan for failure. Include stop gates, rollback strategies, and session handover protocols. Autonomous agents don't get tired, but they do get stuck — and they need structured ways to communicate that.
Test the real user path. Health probes are necessary but not sufficient. Someone needs to open the browser and click the button.
The future of software development isn't AI replacing developers. It's developers who can orchestrate AI systems effectively building things that would have been impractical before.
The runbook and system prompt used in this project are shared below for anyone who wants to adapt this approach.
Appendix: The Full Runbook
Below is the complete 2,100-line migration runbook I wrote to drive this project — the single document that governed every phase, gate, and decision. All Azure resource names and company identifiers have been redacted. The structure, sequencing, execution model, and stop gates are authentic.
This is the actual artifact that Codex followed autonomously for 9.5 hours. If you want to adapt this approach for your own projects, this is your template.
Scroll within the frame to read the full runbook. The original is approximately 2,100 lines covering 22 sections.
Mo Khan is just an old-timer engineer-turned-manager, who left coding a long time ago, and is having so much fun learning again and building with AI tools with a special interest in AI-augmented development workflows, cloud architecture, and autonomous agent orchestration.
