Wednesday, 1 April 2026

How I Used Two AI Models to Re-Architect a Production Backend in 9.5 Hours

One planned. One executed entirely autonmously as lead AI agent co-ordinator (managing 3-4 parallel agents). Human watched each other and steered. Here's what happened.



The Problem Nobody Warns You About Your Fast-Built MVP

You build an AI feature on a shoestring budget leveraging free Azure services, like an SWA app (that I later migrated to a standalone service - future blog post). It works in dev. It works in staging. Then it hits production, and the platform underneath it starts fighting you.

Our AI chat feature — a financial intelligence analyst chatbot built on Claude — was deployed inside Azure Functions behind a Static Web App proxy. On paper, it worked. In practice:

  • 45-second hard timeout from the SWA proxy killed any complex analysis
  • HTTP 500 errors appeared randomly under load
  • No reconnection — if your browser tab hiccupped, your 30-second AI response was gone
  • Request-bound execution — the AI's "life" was the HTTP request. When the request died, so did the AI

The feature was unreliable enough that users had learned to retry. That's a product failure.

I decided to rip it out of Azure Functions entirely and move it to a standalone always-on service. Not a rewrite — a port. Same AI logic, new runtime.

The question was: could I plan this migration in a way that an autonomous AI agent could execute it end-to-end?


The Cast

This project involved three actors:

Me — the architect and orchestrator. I wrote the runbook, designed the execution plan, reviewed every decision, and verified the final result. A taste of the future role of an SDM (Software Development Manager)

Claude (Opus) — my planning partner and shadow observer. Claude reviewed the runbook across three iterations, helped me design the parallel execution strategy, drafted the system prompt for the executor, and then spent the entire execution day "snooping" on progress — reading the codebase in real-time and reporting back.

Codex — the autonomous executor and master agent coordinator. Codex received the runbook and a carefully crafted system prompt, then worked for ~9.5 hours straight spawning multiple agents on demand, producing 21 PRs, ~19,000 lines of code, provisioning Azure infrastructure, and deploying to production. Codex effectively did the job of a senior engineer working with 3-4 engineers

┌─────────────┐     reviews/plans       ┌─────────────┐
│             │◄───────────────────────►│             │
│   Human     │     system prompt       │   Claude    │
│  Architect  │────────────────────────►│  (Opus)     │
│             │     snoop reports       │  Reviewer   │
│             │◄────────────────────────│    Planner  │
└──────┬──────┘                         └─────┬───────┘
       │                                      │
       │  runbook + prompt                    │ reads codebase
       │                                      │ (read-only)
       ▼                                      ▼
┌─────────────────────────────────────────────────────┐
│                                                     │
│                    Codex                            │
│               Autonomous Executor                   │
│                                                     │
│   Thread A ─── Thread B ─── Thread C ─── Thread D   │
│  (critical)    (infra)      (proxy)    (diagnostics)│
│                                                     │
│              21 PRs │ ~19K lines │ 9.5 hours        │
└─────────────────────────────────────────────────────┘

Act 1: The Runbook

Why a Runbook, Not a Prompt

Most people using AI coding agents write prompts. I wrote a runbook — a 2,100-line execution plan that reads more like an engineering specification than a chat message.

Why? Because autonomous agents fail in predictable ways:

  1. Scope drift — they start "improving" things you didn't ask for
  2. Missing context — they make reasonable-sounding decisions based on wrong assumptions
  3. No gates — they charge ahead past failure points without stopping
  4. Sequential thinking — they do everything in order even when work can be parallelized

The runbook addressed all four:

  • Explicit scope fences: "Port, do not rewrite. This is a runtime extraction, not a feature redesign."
  • Complete architecture context: every Azure resource name, every file path, every dependency version, every API contract
  • Stop gates at every phase: "Stop and escalate if any of the following is true..."
  • Parallel thread definitions: five named threads with explicit entry gates and exit artifacts

The runbook went through three review rounds with Claude before I was satisfied. Each round caught real issues:

Round 1 identified that the parallel thread definitions were missing concrete entry/exit gates — they said what work to do, but not what must be true before starting or what artifact proves completion.

Round 2 caught that the integration-branch merge strategy was underspecified. Without explicit merge ordering, parallel agents would create conflicting merges.

Round 3 hardened the execution contract: what the agent can do without permission, what requires stopping and reporting, and how to handle the managed-identity-vs-connection-string decision for Cosmos DB.

The Migration Architecture

The target was clean:

Before:
  Browser → SWA Proxy → Azure Functions (aiChat.js monolith)
                         ↳ 3,356 lines, one file
                         ↳ 45-second timeout
                         ↳ request-bound execution

After:
  Browser → App Service (Express proxy)
              ├── /api/ai/* → Standalone AI Service (Fastify)
              │                ↳ Durable runs (survive disconnects)
              │                ↳ True SSE streaming
              │                ↳ Reconnect by runId
              │                ↳ No hard timeouts
              └── /api/*    → Azure Functions (unchanged)

The AI logic itself — prompts, tools, model selection, orchestration — would be extracted into a shared package (../ai-core) and reused by the new runtime. Port, don't rewrite.


The Architecture: Before and After

To appreciate what changed, you need to see what we were working with — and what we built to replace it.

Before: The Monolith

The entire AI feature lived in one Azure Functions HTTP handler. One file. 3,356 lines. Everything from authentication to prompt construction to tool execution to streaming — all coupled to the Azure Functions request lifecycle.

┌─────────────────────────────────────────────────────────────────────┐
│                        BROWSER                                      │
│   AIChatPanel.jsx ──POST /api/ai/chat──►                            │
│                                         │                           │
│   ◄──── raw SSE text stream ────────────┘                           │
│   (no reconnect, no runId, no resume)                               │
└─────────────────────────┬───────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────────────┐
│                  SWA / App Service Proxy                            │
│                  ┌──────────────────┐                               │
│                  │  45-second hard  │                               │
│                  │  timeout on ALL  │                               │
│                  │  proxied requests│                               │
│                  └──────────────────┘                               │
│                  /api/* ──► Azure Functions                         │
└─────────────────────────┬───────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────────────┐
│              Azure Functions  (app-name-backend-api)                │
│                                                                     │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                  aiChat.js  (3,356 lines)                      │ │
│  │                                                                │ │
│  │  ┌──────────┐  ┌───────────────┐  ┌─────────────────────────┐  │ │
│  │  │  Auth    │  │  System       │  │  Anthropic SDK          │  │ │
│  │  │ Bootstrap│  │  Prompt       │  │  (buffered call)        │  │ │
│  │  │ Validate │  │  Builder      │  │                         │  │ │
│  │  └────┬─────┘  └───────┬───────┘  │  ┌───────────────────┐  │  │ │
│  │       │                │          │  │  Tool Dispatch    │  │  │ │
│  │       ▼                ▼          │  │  Loop             │  │  │ │
│  │  ┌──────────┐  ┌───────────────┐  │  │                   │  │  │ │
│  │  │  Feature │  │  Question     │  │  │  Claude response  │  │  │ │
│  │  │  Toggle  │  │  Classifier   │  │  │  ──► tool_use?    │  │  │ │
│  │  │  Check   │  │  Telemetry    │  │  │  ──► execute      │  │  │ │
│  │  └──────────┘  └───────────────┘  │  │  ──► feed back    │  │  │ │
│  │                                   │  │  ──► loop         │  │  │ │
│  │  ┌──────────┐  ┌───────────────┐  │  └───────────────────┘  │  │ │
│  │  │  Model   │  │  Tool Profile │  │                         │  │ │
│  │  │ Selection│  │  Resolution   │  │  request dies = AI dies │  │ │
│  │  │ Policy   │  │   Routing     │  │  no run state persisted │  │ │
│  │  └──────────┘  └───────────────┘  └─────────────────────────┘  │ │
│  │                                                                │ │
│  │  ┌─────────────────────────────────────────────────────────┐   │ │
│  │  │              Tool Executor (22 tools)                   │   │ │
│  │  │  query_products │ query_customers │ query_actuals       │   │ │
│  │  │  run_financial_model │ run_services_model               │   │ │
│  │  │  compare_fiscal_years │ get_margin_analysis             │   │ │
│  │  │  get_customer_concentration │ generate_chart            │   │ │
│  │  │  run_what_if_simulation │ ...12 more                    │   │ │
│  │  └─────────────────────────┬───────────────────────────────┘   │ │
│  └────────────────────────────┼───────────────────────────────────┘ │
│                               │                                     │
│                               ▼                                     │
│                    ┌─────────────────────┐                          │
│                    │     Cosmos DB       │                          │
│                    │  (products, budgets,│                          │
│                    │   actuals, configs, │                          │
│                    │   users, groups)    │                          │
│                    └─────────────────────┘                          │
└─────────────────────────────────────────────────────────────────────┘

Problems:
  ✗ Request-bound: browser disconnect kills the AI mid-thought
  ✗ No run persistence: nothing survives a dropped connection
  ✗ Buffered streaming: text arrives in chunks after provider finishes
  ✗ 45s proxy timeout: complex multi-tool analyses get killed
  ✗ No reconnect: lose your tab, lose your answer
  ✗ Monolith coupling: auth, prompt, tools, streaming all in one file

After: The Decomposed Architecture

The new architecture separates concerns across three layers: a proxy that routes traffic, a standalone AI runtime that manages durable runs, and an extracted core package that holds all the AI logic.

┌─────────────────────────────────────────────────────────────────────┐
│                        BROWSER                                      │
│                                                                     │
│   AIChatPanel.jsx                                                   │
│   │                                                                 │
│   ├── POST /api/ai/chat ──────────────► start run, get runId        │
│   │   ◄── SSE stream (attach immediately)                           │
│   │                                                                 │
│   ├── GET /api/ai/runs/:runId/stream ─► reconnect to live run       │
│   │   ◄── SSE replay (snapshot) + live deltas                       │
│   │                                                                 │
│   ├── GET /api/ai/runs/:runId ────────► poll status; final result   │
│   │                                                                 │
│   ├── POST /api/ai/runs/:runId/cancel ► user-initiated cancel       │
│   │                                                                 │
│   └── POST /api/ai/chat/feedback ─────► thumbs up/down on run       │
│                                                                     │
│   Client persists runId in React state                              │
│   Disconnect ≠ cancel (run continues server-side)                   │
└─────────────────────────┬───────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────────────┐
│           Web App Service  (Express proxy — server.js)              │
│                                                                     │
│   ┌─────────────────┐        ┌──────────────────────┐               │
│   │  /api/ai/*      │        │  /api/*              │               │
│   │  ──► AI Service │        │  ──► Azure Functions │               │
│   │  (dedicated)    │        │  (unchanged)         │               │
│   └────────┬────────┘        └──────────┬───────────┘               │
│            │                            │                           │
│   x-internal-proxy-secret      x-internal-proxy-secret              │
│   x-ms-client-principal        x-ms-client-principal                │
└────────────┬────────────────────────────┬───────────────────────────┘
             │                            │
             ▼                            ▼
┌────────────────────────────┐  ┌─────────────────────────┐
│  AI App Service (Fastify)  │  │  Azure Functions        │
│                            │  │  (non-AI APIs)          │
│                            │  │  /api/me                │
│  Dedicated B1 plan         │  │  /api/budgets           │
│  Always-on process         │  │  /api/products          │
│  No hard timeouts          │  │  /api/definitions       │
│                            │  │  etc.                   │
│  ┌──────────────────────┐  │  └─────────────────────────┘
│  │    Route Layer       │  │
│  │  /api/ai/chat        │  │
│  │  /api/ai/runs/:id    │  │
│  │  /api/ai/runs/:id/   │  │
│  │    stream            │  │
│  │  /api/ai/runs/:id/   │  │
│  │    cancel            │  │
│  │  /api/ai/chat/       │  │
│  │    feedback          │  │
│  │  /healthz  /readyz   │  │
│  └──────────┬───────────┘  │
│             │              │
│             ▼              │
│  ┌──────────────────────┐  │
│  │   Runtime Manager    │  │
│  │                      │  │
│  │  Admission Control:  │  │
│  │   2 active/user      │  │
│  │   2 queued/user      │  │
│  │   4 active global    │  │
│  │   8 queued global    │  │
│  │                      │  │
│  │  Run State Machine:  │  │
│  │   queued ──► running │  │
│  │     │         │  │   │  │
│  │     │         ▼  ▼   │  │
│  │     │    completed   │  │
│  │     │    failed      │  │
│  │     ▼    cancelled   │  │
│  │  cancelled_by_admin  │  │
│  │                      │  │
│  │  Watchdog: 10 min    │  │
│  │  Shutdown: 15s grace │  │
│  └──────────┬───────────┘  │
│             │              │
│             ▼              │
│  ┌──────────────────────┐  │
│  │   SSE Event Stream   │  │
│  │                      │  │
│  │  Contract (Phase 5): │  │
│  │   ready ──► status   │  │
│  │   ──► text(delta)    │  │
│  │   ──► tool_start     │  │
│  │   ──► tool_end       │  │
│  │   ──► text(delta)    │  │
│  │   ──► chart          │  │
│  │   ──► follow_ups     │  │
│  │   ──► done           │  │
│  │                      │  │
│  │  Reconnect mode:     │  │
│  │   text(snapshot)     │  │
│  │   + live deltas      │  │
│  └──────────┬───────────┘  │
│             │              │
│             ▼              │
│  ┌──────────────────────┐  │
│  │  Legacy AI Bridge    │  │     ┌──────────────────────────┐
│  │  (adapter pattern)   │──┼────►│        /ai-core          │
│  │                      │  │     │  (workspace package)     │
│  │  Wraps existing AI   │  │     │                          │
│  │  chat logic with     │  │     │  contracts/              │
│  │  new runtime hooks   │  │     │   sseEvents.js           │
│  └──────────────────────┘  │     │   interfaces.js          │
│                            │     │   uiContext.js           │
│                            │     │   chatFeedback.js        │
│                            │     │                          │
│                            │     │  prompts/                │
│                            │     │   systemPrompt.js        │
│                            │     │                          │
│                            │     │  tools/                  │
│                            │     │   definitions.js (22)    │
│                            │     │   anthropic.js           │
│                            │     │                          │
│                            │     │  orchestration/          │
│                            │     │   questionTelemetry.js   │
│                            │     │   modelSelection.js      │
│                            │     │   toolProfiles.js        │
│                            │     │                          │
│                            │     │  domain/                 │
│                            │     │   financialModel.js      │
│                            │     │   servicesModel.js       │
│                            │     └──────────────────────────┘
│                            │
│  ┌──────────────────────┐  │
│  │  Data Access Layer   │  │
│  │  (injected adapters) │  │
│  │                      │  │
│  │  getGroupDoc()       │  │
│  │  hasAiPermission()   │  │
│  │  getUserProfile()    │  │
│  │  canViewBu()         │  │
│  │  probeReadiness()    │  │
│  └──────────┬───────────┘  │
│             │              │
└─────────────┼──────────────┘
              │
      ┌───────┴────────┐
      │                │
      ▼                ▼
┌───────────┐  ┌──────────────────────────────────┐
│ Cosmos DB │  │  Azure Storage                   │
│           │  │                                  │
│ products  │  │                                  │
│ customers │  │  Table:  AiRuns (run metadata)   │
│ actuals   │  │  Queue:  ai-run-dispatch         │
│ configs   │  │  Blob:   ai-run-events           │
│ users     │  │  Blob:   ai-run-snapshots        │
│ groups    │  │  Blob:   ai-run-transcripts      │
│ rbac      │  │                                  │
└───────────┘  └──────────────────────────────────┘

Improvements:
  ✓ Durable runs: AI continues even if browser disconnects
  ✓ Reconnect by runId: pick up where you left off
  ✓ True SSE streaming: tokens arrive as they're generated
  ✓ No proxy timeout: dedicated service, no 45s wall
  ✓ Admission control: queuing, per-user limits, watchdog
  ✓ Graceful shutdown: in-flight runs drain before restart
  ✓ Separated concerns: proxy / runtime / core / storage
  ✓ Dependency injection: testable without Azure

The Run Lifecycle

This is the part that makes the new architecture fundamentally different. A "run" is now a first-class entity that exists independently of any HTTP connection.

  Client                    AI Service                     Storage
    │                          │                              │
    │  POST /api/ai/chat       │                              │
    │─────────────────────────►│                              │
    │                          │  create run (queued)         │
    │                          │─────────────────────────────►│
    │                          │                              │
    │  ◄─ SSE: ready{runId} ── │                              │
    │  ◄─ SSE: status{...}  ── │  transition: running         │
    │                          │─────────────────────────────►│
    │                          │                              │
    │                          │  ┌─────────────────────┐     │
    │  ◄─ SSE: text{delta} ─── │  │  Claude streaming   │     │
    │  ◄─ SSE: text{delta} ─── │  │  response arrives   │     │
    │  ◄─ SSE: text{delta} ─── │  │  token by token     │     │
    │                          │  └─────────────────────┘     │
    │                          │                              │
    │  ◄─ SSE: tool_start ──── │  Claude requests tool_use    │
    │                          │  execute tool (Cosmos query) │
    │  ◄─ SSE: tool_end ────── │  feed result back to Claude  │
    │                          │                              │
    │  ◄─ SSE: text{delta} ─── │  Claude continues response   │
    │  ◄─ SSE: text{delta} ─── │                              │
    │                          │                              │
    │       ╔═══════════════╗  │                              │
    │       ║ DISCONNECT!   ║  │                              │
    │       ║ tab closed    ║  │  run keeps going...          │
    │       ╚═══════════════╝  │  events persisted to blob    │
    │                          │─────────────────────────────►│
    │                          │                              │
    │  (reconnect)             │                              │
    │  GET /runs/:id/stream    │                              │
    │─────────────────────────►│  replay from storage         │
    │                          │◄─────────────────────────────│
    │  ◄─ SSE: text{snapshot}─ │  full text so far            │
    │  ◄─ SSE: text{delta} ─── │  live deltas resume          │
    │                          │                              │
    │  ◄─ SSE: follow_ups ──── │                              │
    │  ◄─ SSE: done ────────── │  transition: completed       │
    │                          │─────────────────────────────►│
    │                          │  archive transcript          │
    │                          │─────────────────────────────►│
    │                          │                              │

The key insight: the run outlives the connection. When you disconnect, the AI doesn't stop thinking. When you reconnect, you get a snapshot of everything that happened while you were away, followed by live deltas. This is what makes AI chat feel like a real product instead of a fragile demo.


Act 2: Designing the Parallel Execution Plan

This is where Claude earned its keep as a planning partner.

A naive sequential execution would have looked like: baseline → scaffold → extract → build runtime → add streaming → update proxy → provision infra → deploy → diagnostics → cutover. That's a 12+ hour critical path with no parallelism.

Instead, we designed five concurrent threads:

Thread Scope Can Start When
A (Critical Path) Baseline → Scaffold → Extract → Runtime → Streaming Immediately
B (Infrastructure) Azure provisioning, deploy workflow Immediately (parallel with A)
C (Integration) Proxy routing, frontend run-client After A delivers runtime + SSE contract
D (Diagnostics) Cost ledger, health monitoring After B provisions resources
E (Cutover) Dark deploy, brownout test, go-live After A+B+C converge

The key insight: infrastructure work (Thread B) has zero code dependencies on the AI logic (Thread A). You can provision Azure resources, set up managed identity, create storage primitives, and write deploy workflows while the extraction work is still happening. This saved hours.

Time ────────────────────────────────────────────────────────────────────────►

Thread A: [Phase 0][Phase 1][Phase 2a][Phase 2b][Phase 3][ Phase 5 ]
Thread B: [Phase 7-infra  ][Phase 7-deploy]
Thread C:                                       [P6-back][P6-front]
Thread D:                                                [Phase 8 ]
Thread E:                                                          [Phase 9]
                                                                      │
                                                              All threads
                                                               converge

The Integration Branch Model

With multiple threads writing code simultaneously, merge strategy matters. We used:

  1. Integration branch (codex/ai-migration-main) as the merge target
  2. Named feature branches per thread (codex/ai-migration-a-phase3, codex/ai-migration-b-deploy, etc.)
  3. Explicit merge order: Phase 1 first → Thread B infra → Thread A checkpoints in order → Thread C after SSE freeze → Thread E last
  4. Master updated only after verified integration checkpoints

This kept master stable throughout the entire migration.


Act 3: The System Prompt

The runbook was the what. The system prompt was the how.

Here's the actual prompt I gave Codex (sensitive values redacted):

## Mission
Execute the AI service migration runbook at
docs/TODO_MigrateAIFeatureToStandaloneAIServince.md.
That document is the single source of truth for scope,
architecture, sequencing, gates, and success criteria.

## Execution Model - Parallel Threads
Spawn parallel agent threads with explicit sequencing:

- Thread A (critical path): Phase 0 → 1 → 2a → 2b → 3 → 5
- Thread B (infra/deploy): Phase 7 infra starts immediately.
  Phase 7 deploy-workflow waits for Phase 1.
- Thread C: Phase 6 backend after Phase 3.
  Phase 6 frontend after Phase 5 SSE contract freeze.
- Thread D: Phase 8 after Phase 7 + runtime telemetry stable.
- Thread E: Phase 9 after A+B+C converge.

Start Thread A and Thread B in parallel immediately.

## First Actions
1. Run required preflight commands (Section 6)
2. Begin Thread A Phase 0: baseline + characterization tests
3. Begin Thread B Phase 7 infra: Azure provisioning
4. Use integration branch codex/ai-migration-main

## Operating Rules
- Follow the runbook literally
- Update Migration Status table at every checkpoint
- Stop on stop gates
- Merge order matters
- One PR per bounded checkpoint
- No rewrites — this is a port
- Never commit or log secrets

## Do NOT Need Permission For
- Creating branches, PRs, running CLI commands
- Provisioning Azure resources, setting GitHub variables/secrets
- Making the connection-string fallback decision if managed
  identity fails

## MUST Stop and Report If
- Any stop gate failure
- Any az or gh command fails due to auth/RBAC
- Any ambiguity where the runbook doesn't specify a clear path
- Any temptation to redesign rather than port

Design Principles Behind the Prompt

Three deliberate choices:

  1. Point at the runbook, don't repeat it. The prompt says "that document is the single source of truth." Duplicating architecture context in the prompt creates drift — two sources that can disagree.

  2. Explicit permission boundaries. The "do NOT need permission" section prevents the agent from stalling on confirmations for actions the runbook already authorizes. The "MUST stop" section catches the failure modes where human judgment is actually needed.

  3. Parallel execution is called out explicitly. Without this, Codex defaults to sequential execution. You have to tell it to spawn concurrent work.


Act 4: The Miss — Worktree Isolation

Here's where I have to be honest about a gap.

The runbook specified parallel threads. The prompt told Codex to spawn them. But neither document specified how to physically isolate the parallel work. We described the logical dependency graph but not the physical isolation model.

When I checked on Codex's early progress, I found it was working sequentially in a single directory, mixing Phase 0 and Phase 7 changes in the same checkout. The parallel execution model existed on paper but not in practice.

This matters because git doesn't handle concurrent modifications to the same working directory gracefully. You need separate worktrees — independent checkouts of the repository where each thread can work without stepping on the others.

I had to inject a mid-session correction:

"Each parallel agent thread should work in its own git worktree with a named branch convention (e.g., codex/ai-migration-a-phase3). The orchestrating agent uses the primary checkout for coordination only."

Codex received this, stashed its in-progress work, split changes into proper branches, created worktrees, and adopted the model. It recovered well, but the gap shouldn't have existed.

Lesson learned: the DAG (dependency graph) is necessary but not sufficient. You need to specify the physical isolation model — worktrees, branch naming, merge-back protocol — not just the logical thread dependencies.

I've since added this to my standard checklist for any multi-agent execution plan.


Act 5: The Shadow — 12 Snoop Reports

While Codex worked, I periodically asked Claude to observe the codebase and report on progress. Claude had read-only access — it could inspect files, read git history, check PR statuses, and review code quality, but couldn't modify anything.

This created a powerful feedback loop: one AI executes, another AI reviews, and a human makes decisions based on both perspectives.

Here's the timeline of observations:

Report Time Finding
#1-#5 Early Phase 0 + Phase 7 infra done. Worktree issue identified.
#6 Mid Phase 2b merged. 1,863 lines extracted into ai-core with parity tests.
#7 Mid Phase 3 worktree created but no commits yet (Codex working locally).
#8 Mid Phase 3 merged. 916-line runtime manager with full run state machine.
#9 Late Phase 5 + Phase 6 backend both merged. True streaming + proxy split.
#10 Late Phase 6 frontend merged. Durable run client with reconnect.
#11 Final All phases complete. 20 PRs merged. Integration promoted to master.
#12 Post Hotfix for missing AI_CHAT_BOOTSTRAP_SECRET on AI service.

What the Shadow Caught

The snoop reports weren't just progress tracking. They caught real issues:

  1. Lost artifact: The Phase 2a SDK migration audit was written to a gitignored directory. When the worktree was removed, the artifact disappeared. Flagged in report #6.

  2. Session continuity gap: Codex sessions have quota limits. If a session expires mid-work, the next session needs to know where to pick up. We drafted a HANDOVER.md protocol — a structured file that captures current phase, what's done, what's blocked, branch state, and next action.

  3. Code quality validation: Report #8 confirmed the Phase 3 runtime manager had proper admission control, watchdog timers, graceful shutdown, and queue drain — not just placeholder stubs.

  4. Contract compliance: Report #10 verified the frontend was consuming the Phase 5 SSE contract correctly — text(mode=delta) for live tokens, text(mode=snapshot) for reconnect reconstruction.


Act 6: The Numbers

Timeline (March 31, 2026)

Time Event
08:52 First runbook commit
09:31 Codex starts — PR #4 (Phase 7 infra)
09:41 PR #5 (Phase 0 baseline)
10:16 PR #6 (Phase 1 workspace scaffold)
12:41 PR #12 (Phase 2a contracts)
13:27 PR #13 (Phase 2b extraction — 1,863 lines)
14:29 PR #14 (Phase 3 runtime — 916-line manager)
15:08 PR #15 (Phase 6 backend proxy split)
15:28 PR #16 (Phase 5 provider streaming + SSE freeze)
15:59 PR #17 (Phase 6 frontend run-client)
17:30 PR #18 (Phase 8 diagnostics)
19:07 PR #19 (Phase 9 cutover)
19:10 PR #20 (Master promotion)
Total ~9.5 hours, 21 PRs, ~19,000 net lines

Code Produced

Component Lines Purpose
ai-service/ ~4,500 Fastify runtime, routes, run manager, storage
packages/ai-core/ ~3,800 Extracted AI logic, contracts, domain models
Infrastructure scripts ~1,700 Provisioning, verification, packaging
Deploy workflow ~250 CI/CD pipeline
Frontend changes ~1,100 Run-aware chat panel, API client
Tests ~1,500 Characterization, parity, integration, contract
Remaining ~6,000+ Internalized API libs, domain model moves, etc.

What Got Deployed

  • Standalone Fastify AI service on app-name
  • Dedicated App Service plan, storage account, managed identity
  • Express proxy splitting /api/ai to AI service, /api to Functions
  • GitHub Actions deploy workflow with pre/post-deploy health probes
  • Run-aware frontend with reconnect, cancel, and status polling

Act 7: The Hotfix

Twenty PRs merged. Three deploy workflows green. Dark deploy verified. Brownout rollback proven.

Then I opened the chat panel and got "Access Denied."

The AI_CHAT_BOOTSTRAP_SECRET — a shared secret used to verify frontend bootstrap tokens — was configured on the Azure Functions app but had never been provisioned to the new AI service. The dark deploy health probes (which test /healthz and /readyz) didn't catch it because they don't send authenticated chat requests.

Codex fixed it in one commit: added the secret to the provisioning script, the verification checklist, and the readiness probe. PR #21 merged and deployed.

Lesson: Dark deploys verify infrastructure health, not end-to-end user flows. You need both.


What I Learned About Multi-Model Development

The Reviewer/Executor Pattern

The most powerful pattern I discovered is using two AI models in complementary roles:

  • Model A (Claude Opus): Planner, reviewer, and observer. It reviews the plan before execution, monitors progress during execution, and validates quality after execution. It never touches the code directly.

  • Model B (Codex): Autonomous executor. It receives a plan and system prompt, then works independently — creating branches, writing code, provisioning infrastructure, creating PRs, and deploying.

This separation of concerns mirrors how engineering teams work: architects review, developers execute, and neither role is diminished by the other.

How to Write Plans for Autonomous Agents

  1. Be exhaustive about context. Every file path, resource name, dependency version, and API contract. Agents can't ask clarifying questions mid-execution the way humans can.

  2. Define stop gates, not just tasks. "Do X" is a task. "Do X, but stop if Y is true" is a gate. Gates prevent agents from charging past failure points.

  3. Specify physical isolation, not just logical dependencies. Parallel threads need worktrees, branch naming conventions, and merge protocols — not just a DAG.

  4. Include permission boundaries. "You can do X without asking" prevents stalling. "You must stop for Y" prevents runaway execution.

  5. Plan for session turnover. Long-running agents hit quota limits. Build in handover protocols — structured files that capture state for the next session.

How to Monitor Autonomous Agents

The "snoop" pattern — having a second model periodically observe the codebase — turned out to be valuable for three reasons:

  1. Progress tracking without interruption. The observer reads git history and file state without disturbing the executor.

  2. Quality validation in real-time. The observer can assess whether extracted code maintains behavioral parity, whether contracts are properly frozen, and whether test coverage is meaningful.

  3. Issue detection before merge. The observer caught the lost SDK audit artifact and the session continuity gap before they became blocking problems.

The Human's Role Changes, But Doesn't Disappear

I didn't write 19,000 lines of code. But I:

  • Designed the migration architecture
  • Wrote a 2,100-line runbook that eliminated ambiguity
  • Reviewed the plan across three iterations with an AI partner
  • Identified the worktree isolation gap and injected a correction mid-flight
  • Monitored execution through 12 observation cycles
  • Made the "big bang, fix forward" cutover decision
  • Performed the first manual end-to-end test that found the bootstrap secret issue

The human role shifts from writing code to writing specifications precise enough that code writes itself. That's a different skill, but it's still a skill — and it's the skill that determined whether this project took 9 hours or 9 days.


Try This Yourself

If you want to experiment with multi-model development:

  1. Start with a runbook, not a prompt. Write down everything the agent needs to know before it starts. If you find yourself thinking "it'll figure that out," write it down instead.

  2. Use one model to review the plan before another executes it. The reviewer will catch ambiguities, missing gates, and underspecified contracts.

  3. Set up a shadow observer. Ask a model to periodically read your codebase and report on the executor's progress. You'll catch issues faster than waiting for CI failures.

  4. Plan for failure. Include stop gates, rollback strategies, and session handover protocols. Autonomous agents don't get tired, but they do get stuck — and they need structured ways to communicate that.

  5. Test the real user path. Health probes are necessary but not sufficient. Someone needs to open the browser and click the button.

The future of software development isn't AI replacing developers. It's developers who can orchestrate AI systems effectively building things that would have been impractical before.

The runbook and system prompt used in this project are shared below for anyone who wants to adapt this approach.

Appendix: The Full Runbook

Below is the complete 2,100-line migration runbook I wrote to drive this project — the single document that governed every phase, gate, and decision. All Azure resource names and company identifiers have been redacted. The structure, sequencing, execution model, and stop gates are authentic.

This is the actual artifact that Codex followed autonomously for 9.5 hours. If you want to adapt this approach for your own projects, this is your template.

Scroll within the frame to read the full runbook. The original is approximately 2,100 lines covering 22 sections.


Mo Khan is just an old-timer engineer-turned-manager, who left coding a long time ago, and is having so much fun learning again and building with AI tools with a special interest in AI-augmented development workflows, cloud architecture, and autonomous agent orchestration.

Sunday, 8 February 2026

I built a “Steam Workshop” for system architecture + product roadmaps + org blueprints (runs in your browser)"

Have you ever wanted to install the architecture of a product the way you install an app?

Not just a diagram, but the whole system architecture blueprint:

  • services and dependencies
  • teams and ownership
  • goals → initiatives → work packages
  • a 3-year planning horizon for product managers
  • plus a “prompt pack” so you can remix it

That’s what I’ve been building as a hobby project: SMT (Software Management & Planning Tool).



And this week I shipped something I’m genuinely excited about:

The Community Blueprints Marketplace (with social features)

You can now publish blueprint packages publicly, and the community can:

  • browse and install them
  • star them
  • comment / discuss them
  • see what’s trending (social proof + discovery loops)

Think: Figma Community / Steam Workshop, but for product architecture + product/team organization.


What’s a “Blueprint” in SMT?

A blueprint is a portable package (JSON) that contains:

  • manifest (title, summary, tags, trust label, etc.)
  • prompt pack (seed + variants like MVP/Scale)
  • full system snapshot (teams, services, goals, initiatives, work packages)

The goal is learning through interaction:

  • install a blueprint
  • explore its org + architecture + roadmap in the app
  • remix it into your constraints
  • publish your remix back to the marketplace

Why I’m doing this

Most “reference architectures” online are:

  • static
  • divorced from org/team realities
  • not easily remixable
  • missing the roadmap/execution story

SMT tries to make “how a product might actually run” tangible:

  • architecture + org design + planning are all connected
  • you can poke at it, not just read it

SMT makes it possible for you to inspect any type of tech platform. Think LeetCode interview preps but for system design, architecture, team topologies, product roadmaps and software delivery planning.


Local-first by default (privacy + zero friction)

SMT is intentionally local-first:

  • it runs as a static app in the browser
  • your systems stay in your browser unless you explicitly publish a blueprint package

The cloud marketplace is optional and only powers:

  • public publishing
  • discovery
  • stars/comments

No “workspace sync” SaaS lock-in.


How the social marketplace works (simple + cost-free)

To keep this sustainable on the free tier, the backend is:

  • Cloudflare Workers + D1 (SQLite)
  • token-normalized search (no paid search / no vector DB)
  • GitHub OAuth for identity (scope: read:user only)

Important security bit:

  • public publishing does secret scanning (manifest + full system payload) and blocks likely API keys/tokens.

Try it (and please break it, it's a WIP hobby project)

1) Open the app

2) Explore the “Community Blueprints” view

  • browse the curated catalog
  • click Preview on anything that looks interesting
  • install an Available blueprint and inspect it across:
    • System Architecture Overviews
    • Org Design
    • Roadmap & Backlog Management
    • Year Plan / Detailed Planning

3) Publish + socials

  • in the publish flow, use Publish Publicly
  • then open Preview on your published blueprint:
    • Star/Unstar it
    • Post a comment
    • sanity check that it’s discoverable via search

If anything fails, I want to know. Use the Feedback feature to log issues.


What I’d love feedback on (high signal)

  1. Does the blueprint concept actually help you understand a product faster?
  2. Are the “prompt packs” useful, or just noise?
  3. What should “trending” mean here: stars, downloads, recency, or something else?
  4. What social features would make this fun without turning it into a moderation nightmare?

If you want to contribute

This is open source (CC0) and I’m happy to collaborate.


Roadmap ideas (if the community likes this)

  • Remix lineage: “Forked from…” + remixes graph
  • Lightweight contributor reputation (badges, trust tiers)
  • Reporting/flagging + moderation queue
  • Curated collections (“Backends 101”, “B2B SaaS starters”)

If any part of this sparks your curiosity, I’d love for you to try it for 5 minutes and tell me what confused you, what felt magical, and what felt pointless.

Drop a comment here, or open an issue on GitHub.

Saturday, 7 February 2026

A Day Building new features on SMT using Codex App with Codex 5.3

AI Build Journal · February 7, 2026 - written by Codex to me...OpenAI released codex app for mac this week, so I decided to have a go, and boy - am I blown away!!! In just one day, Codex helped me clear much of my SMT backlog, after a month's break from my AI-coding frenzy from Dec'25.

A Day Building on Codex App with Codex 5.3

This was not “prompt in, code out.” This was a full-day product session: strategy debates, UX corrections, contract audits, feature pivots, test hardening, documentation, and ship.

I started the day with one objective: execute the next phase plan for SMT Platform without losing quality. By the end of the day, we had shipped one of the most ambitious increments in the project so far: Goal Inspections + Community Blueprints Exchange, including end-to-end contribution flow, install lifecycle logic, catalog operations, and test coverage.


What We Shipped in One Day

  • Goal lifecycle + inspections system: owner status, weekly comments, PTG tracking, stale/mismatch detection, leadership report table.
  • Year Plan CSV/XLSX export: production export flow in toolbar with tested serialization and schema-aware payload handling.
  • Community Blueprints Exchange: Top-100 curated catalog, preview modal, search/filter, publish flow, package validation, and install lifecycle UX.
  • Launch package generation upgrade: moved to domain-authored-curated-v2 for the launch-25 package set.
  • Hardening + compliance: contract remediation pass, UX consistency fixes, event rebinding bug fix, and regression-proof e2e updates.

The Metrics That Matter

Metric Result
Session duration11h 56m 43s (10:08:30 → 22:05:13)
Timestamped worklog checkpoints        117
Commits shipped4 (2671b68, a502106, c9d8d43, 97bec52)
Code delta (same-day commits)+129,166 / -7,770 (net +121,396)
Unique files touched77 (113 file-change events across commits)
New files created27
Unit test progression90 → 117 tests (+30%)
E2E test progression51 → 58 tests across 8 → 9 specs
Community blueprint footprintTop-100 catalog + Launch-25 curated packages

Note: the large insertion volume includes generated blueprint catalog/package artifacts in addition to application code.

How This Compared to “Typical Solo Dev Pace”

A conservative estimate for this scope with one human engineer is 2–3 weeks: feature architecture, UI wiring, persistence, migration work, docs, and full regression coverage. Here, the value of Codex 5.3 was not just speed in typing code. The leverage came from:

  • Staying in implementation mode continuously while preserving test discipline.
  • Switching quickly between product decisions, coding, debugging, and documentation.
  • Keeping a verifiable trail (/docs/worklogjournal.md) so context did not get lost.

This Was Collaboration, Not Task Dispatch

The most important part of the day was the interaction pattern. We did not run a one-way backlog. We debated quality and credibility:

  • You challenged weak UX states (Install should be locked when unavailable), and we corrected behavior at both tile and preview levels.
  • You challenged data realism for “inspired-by” systems, and we replaced simplistic seed generation with richer domain-authored package generation.
  • You enforced coding contracts, and we ran an explicit compliance audit plus remediation pass before final push.
  • You required proof, not promises, so every major change ended with lint/unit/e2e verification.
The real unlock is not “AI writes code faster.” It is “human judgment + AI execution + strict verification” as one continuous loop.

Lessons Learned

  1. Contracts first, always: when contract rules are explicit, quality issues become detectable and fixable quickly.
  2. Feature credibility beats feature count: shipping a marketplace means realism, not placeholder parity.
  3. Tests are collaboration memory: every bug found late became a permanent test so the fix does not regress.
  4. Worklogs scale agentic development: detailed timestamped logs made long-session continuity possible.

What’s Next

The obvious next move is to raise the “real-world blueprint” bar further: richer domain fidelity, stronger package QA gates, and a true contribution-driven exchange loop where users generate, validate, publish, and learn from each other’s systems.

Built on SMT Platform using Codex 5.3 · evidence from /docs/worklogjournal.md and same-day git history.