Sunday, 28 December 2025

Codex 5.2 vs the Long Game: Building a Quality Ladder in a Real Product Codebase

If you’ve ever tried to institutionalize code quality (not just talk about it), you’ll know the trap: quality work is rarely one big change. It’s a hundred small, unglamorous decisions—done consistently—until the codebase starts behaving like a system.

That’s what this post is about.

I’ve been evolving a real planning product codebase (“altsoftwareplanning”)—workflows for planning, roadmaps, org views, capacity tuning, and a growing set of UI/interaction patterns. It shipped features. I think my app will solve real problems. Yes, it's still a concept, and I wanted to ensure it's being built the right way. But the safety net was thin.

At some point, “it works” stops being good enough.

I wanted a codebase where I can say, with a straight face:

We can prove it works.

Codex 5.2 turned out to be the right partner for the long-running version of this work: contract enforcement, lint hardening, unit testing, E2E testing, coverage instrumentation, CI visibility, and documentation—all in a compounding sequence that actually sticks.

Codex 5.2 CLI helped transform an active product codebase into a quality-first system by enforcing contracts, hardening lint discipline, building broad unit + Cypress E2E coverage, and surfacing coverage artifacts in CI—over a sequence of compounding iterations.


The Landscape: What We Were Working With

You can’t talk quality without acknowledging the stack. The strategy must fit the reality.

  • App layer: vanilla JavaScript, HTML, CSS (no framework), service-oriented front-end modules under /js

  • Visualization tooling: D3, Mermaid, Frappe Gantt

  • Background logic: a feedback worker (smt-feedback-worker) and AI-adjacent utilities

  • Quality/tooling: ESLint (flat config), stylelint, depcheck, plus a custom contract scanner

  • Testing:

  • CI: GitHub Actions, publishing coverage artifacts on every push/PR

This is no longer a toy repo! It demonstrates patterns for a real product surface area with enough moving parts to punish sloppy changes.


The Problem: Velocity Without Proof

We had signals, but they weren’t a coherent system:

  • some rules existed (informally)

  • linting existed (but was noisy and inconsistent)

  • testing existed (but not at a level that lets you refactor with confidence)

  • E2E coverage didn’t reflect how users actually flow through the product

  • coverage existed locally, not visibly

In other words: tribal knowledge plus hope.

The goal wasn’t “add tests.” The goal was a quality ladder—each rung enabling the next.


The Quality Ladder (The Sequence Matters)

Here’s the order of operations that worked, and why:

  1. Codify a contract (design rules, data flow rules, UI rules)

  2. Make lint clean and trustworthy (so warnings mean something)

  3. Cover domain logic with unit tests (turn behavior into executable specs)

  4. Add E2E tests for real workflows (where regressions actually happen)

  5. Instrument coverage (otherwise you’re guessing)

  6. Publish coverage in CI (visibility changes behavior)

  7. Keep docs current (quality must be repeatable)

Codex 5.2 helped me climb this ladder without losing the thread.


Phase 0: Compliance Cleanup (The Real Beginning)

Every long-running quality push has a starting “wake up” moment.

For this repo, that moment looked like compliance cleanup:

  • removing risky innerHTML usage and window-bound patterns

  • tightening DOM checks and wiring

  • simplifying defensive guards that hid intent

This wasn’t glamorous. It was foundational. You can’t build a quality system on top of brittle glue.


Phase 1: Turn Ideas Into Contracts

Before I was willing to scale testing, I needed the codebase to have enforceable rules—especially in the UI layer.

We already had an early foundation:

  • a contract-scanning script

  • a “coding contract” that enforces rules like:

    • no inline styles (even for dynamic values)

    • DOM-only toolbars (explicit, deterministic wiring)

    • centralized storage patterns (no random settings scattered across views)

Then we reinforced it.

What changed (in practical terms)

  • Inline style mutations were removed or centralized.

  • View code was pushed toward CSS classes / CSS variables.

  • Global header/toolbars were wired consistently.

  • Storage moved behind the repository boundary (e.g., SystemRepository).

This is the crucial insight:

Quality starts with clarity. If you can’t describe the rules, you can’t test them.

Codex 5.2 was useful here because it didn’t just “fix a file.” It was willing to chase contract violations across dozens of small edits until the rule was actually enforced.


Phase 2: Make Linting a Trusted Signal

Linting is only useful if “clean” is achievable.

So we did the boring work:

  • cleared ESLint warnings (unused locals, scoping issues, assignment-in-conditional traps)

  • standardized hasOwnProperty usage

  • tightened property access and scoping

Only then did we upgrade:

  • ESLint v9

  • flat config

  • reduced deprecated dependencies

Once lint was noise-free, a new lint warning started meaning something again.

This is one of those hidden tipping points: when lint becomes a real gate, behavior shifts.


Phase 3: Unit Tests That Match the Domain

This was not a token unit-test pass. The suite mapped to the product’s domain logic.

We built unit tests across:

  • Planning

  • Roadmap

  • Org

  • System

  • Initiative

  • Forecast

  • WorkPackage

  • App state helpers and supporting utilities

Tooling:

  • Vitest + jsdom (DOM simulation where needed)

  • later: Vitest v4 upgrade (refreshed the Vite/esbuild chain and cleared CI audit issues)

The real win wasn’t “coverage percentage.”

The real win was this:

Domain behavior became executable.

Codex 5.2 helped by translating real product flows into testable units—without scattering random micro-tests that don’t align to how the app behaves.


Phase 4: Cypress E2E Tests for Real Flows

Unit tests prove logic.

E2E tests prove the product.

We started with smoke tests and selector stabilization, then expanded into specs that mirror how users actually move:

  • Core UI and workflow validation

  • Planning and detailed planning flows

  • Product management flows

  • System creation/editing and org management

  • Settings and AI feedback flows

  • Smoke tests and selector hardening

Two deliberate choices here:

  1. Readable and commented tests. These become “living docs.”

  2. Selector hardening before scale. Flaky selectors are how E2E suites die.

Codex 5.2 showed long-running strength here: iterating until the suite is stable, not just “present.”


Phase 5: Coverage You Can See (Not Just Locally)

Testing without visibility is guesswork.

So we instrumented coverage for both layers:

  • Unit coverage: Istanbul

  • E2E coverage: @cypress/code-coverage

We intentionally split reports:

  • coverage/unit

  • coverage/e2e

And then made it visible:

  • GitHub Actions uploads coverage artifacts on every push and PR

  • no special tooling required to inspect results—just download the artifact

CI now runs (as part of the quality pipeline):

  • npm run test:coverage

  • npm run test:e2e:coverage

We also fixed dependency checks so coverage tooling is first-class and green in CI (depcheck included).

This matters because it changes the social contract:

Coverage stops being “something one dev ran.” It becomes a team-visible signal.


What Codex 5.2 Actually Enabled

This wasn’t “AI wrote tests.” That’s the shallow version of the story.

Codex 5.2 behaved more like a quality software-engineer-in-test (SWET) expert partner who doesn’t lose the thread that worked for hours (I exhausted my five hour quota on more than one occasion):

  • it read the codebase and translated workflows into test cases

  • it stayed consistent with constraints across many commits

  • it revisited flaky E2E specs and hardened selectors

  • it fixed CI failures immediately instead of letting them linger

  • it updated docs alongside implementation changes

Short tasks are easy.

The long game—contracts + lint + unit + E2E + coverage + CI + docs—requires continuity.

That’s where Codex 5.2 shined.

Hint: If you've been delaying a long term technical debt task to introduce (or migrate to other frameworks) unit tests and integration tests, integrated with your CI/CD workflows, you no longer have to do this yourself. Codex 5.2 is a very capable model that can do literally all the heavy lifting for you. Codex can help transform your codebase in a matter of hours, taking your codebase quality to the next level. 


The Working Loop I Used (Practical, Not Magical)

If you want to replicate this kind of quality transformation, here’s the loop that worked:

  1. Start with an audit task

    • “Scan the repo for contract violations and list them by severity.”

    • “List lint rule violations that prevent a clean run.”

  2. Fix in small diffs

    • insist on small PR-sized changes

    • prevent “helpful refactors” that mix concerns

  3. Lock in with tooling

    • don’t accept “we fixed it once”

    • enforce it through npm run verify / CI gates

  4. Convert workflows into tests

    • unit tests for domain rules

    • E2E specs for user flows

  5. Make coverage visible

    • separate unit vs E2E

    • publish artifacts

  6. Document the happy path

    • how to run tests locally

    • where coverage lands

    • what “green” means

Codex 5.2 helped keep this loop tight—especially the follow-through steps that humans tend to procrastinate.


Before / After (The Shape of the System)


Outcomes (What Changed, Concretely)

By the end of this effort, the repo gained:

  • a stronger contract + lint foundation

  • ESLint v9 flat config with warning-free lint

  • stylelint + depcheck integrated into the quality gates

  • a comprehensive unit test suite across core services

  • a multi-spec Cypress E2E suite covering real user journeys

  • separate unit and E2E coverage reports

  • CI workflows that upload coverage artifacts on every commit/PR

  • updated docs that make the whole thing discoverable and repeatable

This is the difference between:

  • “we have tests”

  • and “we have quality.”

Stats

Counts (current HEAD)

  •   Unit tests: 68 test cases across 27 tests
  •   Cypress: 7 specs in cypress/e2e, 39 E2E test cases

  Files Touched

  •   94 files changed since linting was introduced 
  •   37 files changed since unit tests were added 
  •   37 files changed since Cypress tests were added 

  E2E Coverage Delta (baseline = first coverage workflow commit 68418c)

  •   Statements: 54.60% → 54.61% (+0.01pp)
  •   Branches: 33.34% → 33.34% (+0.00pp)
  •   Functions: 52.90% → 52.93% (+0.03pp)
  •   Lines: 56.90% → 56.91% (+0.01pp)

  CI Runtime Before/After (quality.yml, success runs only)

  •   Before coverage/tests in CI (pre‑68418c): avg 29.2s, median 32.0s, n=10.
  •   After coverage/tests in CI (post‑68418c): avg 303.7s (~5:04), median 306.0s, n=6.

  tests.yml (unit coverage workflow)

  •   all runs: n=22 avg=32.0s median=34.0s min=18.0s max=42.0s
  •   success runs: n=22 avg=32.0s median=34.0s min=18.0s max=42.0s

  e2e.yml (Cypress coverage workflow)

  •   all runs: n=19 avg=229.7s median=284.0s min=38.0s max=310.0s
  •   success runs: n=18 avg=239.8s median=285.0s min=38.0s max=310.0s


Why This Matters (Beyond Preventing Regressions)

A real quality system does more than reduce bugs:

  • Onboarding improves because expectations are written down (and executable).

  • Refactors get cheaper because the safety net is layered.

  • Velocity improves because you stop paying the “regression tax.”

  • Confidence increases because you can prove correctness instead of arguing.

The whole point of quality is not to slow down shipping.

It’s to preserve shipping speed as the product grows.


The Reusable Playbook

If you’re trying to do the same in your own codebase, don’t start with a thousand tests. Start with the ladder.

  1. Define and enforce a contract (rules of the road)

  2. Make lint clean and trustworthy

  3. Build unit tests that map to domain logic

  4. Start with smoke E2E tests, then expand into full flows

  5. Instrument coverage for both layers

  6. Publish coverage artifacts in CI

  7. Treat docs as first-class artifacts

Quality is not a tool.

Quality is a system.

I've experienced the power of Codex as my AI coding agent quality partner - this stuff is real. My app codebase might still be in its infancy, but I know it comes close to what software engineering teams face everyday - especially if you're building enterprise tooling apps - as my ex-teams at AWS build.
Integrating AI-coding agents to handle refactoring, migrations or as in my case - introducing end-to-end quality automated testing - is now a no-brainer with powerful assistants like Codex, and a significant productivity enhancer!


No comments:

Post a Comment