That’s what this post is about.
I’ve been evolving a real planning product codebase (“altsoftwareplanning”)—workflows for planning, roadmaps, org views, capacity tuning, and a growing set of UI/interaction patterns. It shipped features. I think my app will solve real problems. Yes, it's still a concept, and I wanted to ensure it's being built the right way. But the safety net was thin.
At some point, “it works” stops being good enough.
I wanted a codebase where I can say, with a straight face:
We can prove it works.
Codex 5.2 turned out to be the right partner for the long-running version of this work: contract enforcement, lint hardening, unit testing, E2E testing, coverage instrumentation, CI visibility, and documentation—all in a compounding sequence that actually sticks.
Codex 5.2 CLI helped transform an active product codebase into a quality-first system by enforcing contracts, hardening lint discipline, building broad unit + Cypress E2E coverage, and surfacing coverage artifacts in CI—over a sequence of compounding iterations.
The Landscape: What We Were Working With
You can’t talk quality without acknowledging the stack. The strategy must fit the reality.
App layer: vanilla JavaScript, HTML, CSS (no framework), service-oriented front-end modules under
/jsVisualization tooling: D3, Mermaid, Frappe Gantt
Background logic: a feedback worker (
smt-feedback-worker) and AI-adjacent utilitiesQuality/tooling: ESLint (flat config), stylelint, depcheck, plus a custom contract scanner
Testing:
CI: GitHub Actions, publishing coverage artifacts on every push/PR
This is no longer a toy repo! It demonstrates patterns for a real product surface area with enough moving parts to punish sloppy changes.
The Problem: Velocity Without Proof
We had signals, but they weren’t a coherent system:
some rules existed (informally)
linting existed (but was noisy and inconsistent)
testing existed (but not at a level that lets you refactor with confidence)
E2E coverage didn’t reflect how users actually flow through the product
coverage existed locally, not visibly
In other words: tribal knowledge plus hope.
The goal wasn’t “add tests.” The goal was a quality ladder—each rung enabling the next.
The Quality Ladder (The Sequence Matters)
Here’s the order of operations that worked, and why:
Codify a contract (design rules, data flow rules, UI rules)
Make lint clean and trustworthy (so warnings mean something)
Cover domain logic with unit tests (turn behavior into executable specs)
Add E2E tests for real workflows (where regressions actually happen)
Instrument coverage (otherwise you’re guessing)
Publish coverage in CI (visibility changes behavior)
Keep docs current (quality must be repeatable)
Codex 5.2 helped me climb this ladder without losing the thread.
Phase 0: Compliance Cleanup (The Real Beginning)
Every long-running quality push has a starting “wake up” moment.
For this repo, that moment looked like compliance cleanup:
removing risky
innerHTMLusage and window-bound patternstightening DOM checks and wiring
simplifying defensive guards that hid intent
This wasn’t glamorous. It was foundational. You can’t build a quality system on top of brittle glue.
Phase 1: Turn Ideas Into Contracts
Before I was willing to scale testing, I needed the codebase to have enforceable rules—especially in the UI layer.
We already had an early foundation:
a contract-scanning script
a “coding contract” that enforces rules like:
no inline styles (even for dynamic values)
DOM-only toolbars (explicit, deterministic wiring)
centralized storage patterns (no random settings scattered across views)
Then we reinforced it.
What changed (in practical terms)
Inline style mutations were removed or centralized.
View code was pushed toward CSS classes / CSS variables.
Global header/toolbars were wired consistently.
Storage moved behind the repository boundary (e.g., SystemRepository).
This is the crucial insight:
Quality starts with clarity. If you can’t describe the rules, you can’t test them.
Codex 5.2 was useful here because it didn’t just “fix a file.” It was willing to chase contract violations across dozens of small edits until the rule was actually enforced.
Phase 2: Make Linting a Trusted Signal
Linting is only useful if “clean” is achievable.
So we did the boring work:
cleared ESLint warnings (unused locals, scoping issues, assignment-in-conditional traps)
standardized
hasOwnPropertyusagetightened property access and scoping
Only then did we upgrade:
ESLint v9
flat config
reduced deprecated dependencies
Once lint was noise-free, a new lint warning started meaning something again.
This is one of those hidden tipping points: when lint becomes a real gate, behavior shifts.
Phase 3: Unit Tests That Match the Domain
This was not a token unit-test pass. The suite mapped to the product’s domain logic.
We built unit tests across:
Planning
Roadmap
Org
System
Initiative
Forecast
WorkPackage
App state helpers and supporting utilities
Tooling:
Vitest + jsdom (DOM simulation where needed)
later: Vitest v4 upgrade (refreshed the Vite/esbuild chain and cleared CI audit issues)
The real win wasn’t “coverage percentage.”
The real win was this:
Domain behavior became executable.
Codex 5.2 helped by translating real product flows into testable units—without scattering random micro-tests that don’t align to how the app behaves.
Phase 4: Cypress E2E Tests for Real Flows
Unit tests prove logic.
E2E tests prove the product.
We started with smoke tests and selector stabilization, then expanded into specs that mirror how users actually move:
Core UI and workflow validation
Planning and detailed planning flows
Product management flows
System creation/editing and org management
Settings and AI feedback flows
Smoke tests and selector hardening
Two deliberate choices here:
Readable and commented tests. These become “living docs.”
Selector hardening before scale. Flaky selectors are how E2E suites die.
Codex 5.2 showed long-running strength here: iterating until the suite is stable, not just “present.”
Phase 5: Coverage You Can See (Not Just Locally)
Testing without visibility is guesswork.
So we instrumented coverage for both layers:
Unit coverage: Istanbul
E2E coverage:
@cypress/code-coverage
We intentionally split reports:
coverage/unitcoverage/e2e
And then made it visible:
GitHub Actions uploads coverage artifacts on every push and PR
no special tooling required to inspect results—just download the artifact
CI now runs (as part of the quality pipeline):
npm run test:coveragenpm run test:e2e:coverage
We also fixed dependency checks so coverage tooling is first-class and green in CI (depcheck included).
This matters because it changes the social contract:
Coverage stops being “something one dev ran.” It becomes a team-visible signal.
What Codex 5.2 Actually Enabled
This wasn’t “AI wrote tests.” That’s the shallow version of the story.
Codex 5.2 behaved more like a quality software-engineer-in-test (SWET) expert partner who doesn’t lose the thread that worked for hours (I exhausted my five hour quota on more than one occasion):
it read the codebase and translated workflows into test cases
it stayed consistent with constraints across many commits
it revisited flaky E2E specs and hardened selectors
it fixed CI failures immediately instead of letting them linger
it updated docs alongside implementation changes
Short tasks are easy.
The long game—contracts + lint + unit + E2E + coverage + CI + docs—requires continuity.
That’s where Codex 5.2 shined.
Hint: If you've been delaying a long term technical debt task to introduce (or migrate to other frameworks) unit tests and integration tests, integrated with your CI/CD workflows, you no longer have to do this yourself. Codex 5.2 is a very capable model that can do literally all the heavy lifting for you. Codex can help transform your codebase in a matter of hours, taking your codebase quality to the next level.
The Working Loop I Used (Practical, Not Magical)
If you want to replicate this kind of quality transformation, here’s the loop that worked:
Start with an audit task
“Scan the repo for contract violations and list them by severity.”
“List lint rule violations that prevent a clean run.”
Fix in small diffs
insist on small PR-sized changes
prevent “helpful refactors” that mix concerns
Lock in with tooling
don’t accept “we fixed it once”
enforce it through
npm run verify/ CI gates
Convert workflows into tests
unit tests for domain rules
E2E specs for user flows
Make coverage visible
separate unit vs E2E
publish artifacts
Document the happy path
how to run tests locally
where coverage lands
what “green” means
Codex 5.2 helped keep this loop tight—especially the follow-through steps that humans tend to procrastinate.
Before / After (The Shape of the System)
Outcomes (What Changed, Concretely)
By the end of this effort, the repo gained:
a stronger contract + lint foundation
ESLint v9 flat config with warning-free lint
stylelint + depcheck integrated into the quality gates
a comprehensive unit test suite across core services
a multi-spec Cypress E2E suite covering real user journeys
separate unit and E2E coverage reports
CI workflows that upload coverage artifacts on every commit/PR
updated docs that make the whole thing discoverable and repeatable
This is the difference between:
“we have tests”
and “we have quality.”
Stats
- Unit tests: 68 test cases across 27 tests
- Cypress: 7 specs in cypress/e2e, 39 E2E test cases
- 94 files changed since linting was introduced
- 37 files changed since unit tests were added
- 37 files changed since Cypress tests were added
- Statements: 54.60% → 54.61% (+0.01pp)
- Branches: 33.34% → 33.34% (+0.00pp)
- Functions: 52.90% → 52.93% (+0.03pp)
- Lines: 56.90% → 56.91% (+0.01pp)
- Before coverage/tests in CI (pre‑68418c): avg 29.2s, median 32.0s, n=10.
- After coverage/tests in CI (post‑68418c): avg 303.7s (~5:04), median 306.0s, n=6.
- all runs: n=22 avg=32.0s median=34.0s min=18.0s max=42.0s
- success runs: n=22 avg=32.0s median=34.0s min=18.0s max=42.0s
- all runs: n=19 avg=229.7s median=284.0s min=38.0s max=310.0s
- success runs: n=18 avg=239.8s median=285.0s min=38.0s max=310.0s
Why This Matters (Beyond Preventing Regressions)
A real quality system does more than reduce bugs:
Onboarding improves because expectations are written down (and executable).
Refactors get cheaper because the safety net is layered.
Velocity improves because you stop paying the “regression tax.”
Confidence increases because you can prove correctness instead of arguing.
The whole point of quality is not to slow down shipping.
It’s to preserve shipping speed as the product grows.
The Reusable Playbook
If you’re trying to do the same in your own codebase, don’t start with a thousand tests. Start with the ladder.
Define and enforce a contract (rules of the road)
Make lint clean and trustworthy
Build unit tests that map to domain logic
Start with smoke E2E tests, then expand into full flows
Instrument coverage for both layers
Publish coverage artifacts in CI
Treat docs as first-class artifacts
Quality is not a tool.
Quality is a system.
I've experienced the power of Codex as my AI coding agent quality partner - this stuff is real. My app codebase might still be in its infancy, but I know it comes close to what software engineering teams face everyday - especially if you're building enterprise tooling apps - as my ex-teams at AWS build.
Integrating AI-coding agents to handle refactoring, migrations or as in my case - introducing end-to-end quality automated testing - is now a no-brainer with powerful assistants like Codex, and a significant productivity enhancer!


No comments:
Post a Comment