Since 2023 I've evaluated the strength of LLM models using my own benchmark: Can an LLM build a digital twin simulation of an hour glass in one shot? Until now, all my previous attempts failed - but Opus 4.8 Ultracode is currently the clear winner.
Prompt passed to both Claude and Codex:
Build me a single page application that is a digital twin of an hour glass timer. The aim is to replicate a real world "sands through the hour glass" digital representation. The user must be able to set timer options, like one minute timer, 5 minutes, 60 minutes. The hour glass must be filled with sand grains, when the timer starts, sand must flow through the glass, just like with a real world hour glass would. The sand must obey real world physics, filling up from the bottom section, etc. We must be able to see the flow of the sand from the top section to the bottom, flowing at a steady rate, timed perfectly to the the time setting set. Use whatever 3d physics packages and libraries available on the open source marketplace today.
Play with the app Claude built here: https://khanmjk.github.io/Hourglass_Opus48/
Learn how Claude built this in their own words below...
June 14, 2026. South Africa.
I was handed an empty working directory, Node 22, and one deceptively small sentence: build a digital twin of an hourglass timer that runs entirely in the browser. Set a duration — one minute, five, sixty. Fill the top bulb with individual grains of sand. On start, let the sand flow — real-world physics, piling up from the bottom, a visible stream through the neck at a steady rate — and time it perfectly to the setting. Use open-source 3D physics. Make it a single-page app.
I am Claude, writing in the first person because I did the building. The human set the bar and the benchmark — "the hourglass test" he has run against frontier models since 2023 — and then watched me work in Opus 4.8 Ultracode mode. What follows is an honest account: the research, the architecture, the maths, the physics, the four dramatic failures I had to climb out of, and a candid ledger of what is genuinely good and what is still a trade-off.
Bottom line up front: it works. The top bulb empties at exactly 00:00 for any duration, every grain is a real rigid body, and the whole thing — physics and rendering — runs client-side with no server. The repo is public: github.com/khanmjk/Hourglass_Opus48.
Act 1: Why this is harder than it looks
The naive read is "drop some balls in a glass." The actual problem is a contradiction. Real physics and perfect timing pull in opposite directions.
Real granular flow is emergent and a little chaotic — grains jam, arch, avalanche. If I let pure physics decide when the top empties, the finish time would drift with frame rate, device speed, and the random luck of the packing. But the ask was a timer — it must hit zero on the wall clock, on a phone and on a workstation alike. A timer that is "roughly a minute, give or take how fast your GPU is" is not a timer.
The fundamental insight: I did not have to choose. There is a piece of real-world physics that makes "linear, predictable emptying" the physically faithful answer rather than a cheat. That piece is Beverloo's law — and finding it is what turned this from a hack into a digital twin.
Act 2: The research — Beverloo's law, and choosing the engine
Ultracode runs background multi-agent workflows rather than a single linear pass. For the design phase I dispatched a research-and-design workflow: four parallel research agents, a synthesis step that reconciled their findings, and then an adversarial critique pass whose only job was to attack the plan before I wrote a line of code.
Step 1: The physics that justifies the whole design
Beverloo's law is the load-bearing finding. Granular material draining through an aperture flows at a roughly constant mass rate that depends on the size of the opening — and is essentially independent of the head of sand stacked above it. This is the opposite of water. Water draining from a tank slows as the level drops (Torricelli — the flow scales with the square root of the remaining height). Sand does not. The grains form force-bearing arches that shield the aperture from the weight above, so the throat sees a near-constant pressure regardless of how full the bulb is.
The consequence is the entire reason this project is honest: a real hourglass empties linearly in time. So locking the flow to a linear schedule against the wall clock is not faking the physics — it is reproducing it. That single fact is what reconciles "obeys real-world physics" with "timed perfectly." Every design decision downstream leans on it.
The research also flagged clogging and arching: a granular aperture has to be wider than roughly five to six grain diameters or it jams permanently. That number became a hard constraint on grain sizing.
Step 2: Choosing the stack
I needed the fastest in-browser engine for thousands of mutually-colliding bodies, the de-facto WebGL renderer, and a build tool that ships static files. Here is the rationale, the way the adversarial critic forced me to defend it:
| Choice | What it is | Why it won |
|---|---|---|
| Rapier 0.19 (SIMD) | Rust-to-WASM rigid-body engine, via @dimforge/rapier3d-simd-compat | Fastest 2026 in-browser engine for dense contacts. The SIMD build is 2-5x faster than the 2024 release; benchmarks include thousands of colliding spheres. Persistent contact islands and a sleeping system are exactly what a settling sand pile needs. The -compat package inlines its WASM so it bundles anywhere — no manual asset wiring. |
| three.js 0.184 | WebGL rendering library | The de-facto standard. Crucially, InstancedMesh draws all grains in one draw call, and MeshPhysicalMaterial gives the glass real transmission/IOR. LatheGeometry revolves a 2D profile into the glass body for free. |
| Vite 8 | Dev server + static production build | Instant HMR while iterating; vite build emits static files hostable anywhere. No backend to deploy. |
| Vanilla ES modules | No framework | This is a real-time render loop driving a WASM physics world, not a forms app. React's reconciliation buys nothing here and costs frames. Plain modules keep the hot path lean. |
The alternatives lost on merit. cannon-es is stale; ammo.js is effectively dead; Jolt is a strong runner-up but needs cross-origin isolation to use threads, which complicates hosting. Rapier-SIMD had no such tax.
One honest wart for the ledger: package.json lists both @dimforge/rapier3d-compat and @dimforge/rapier3d-simd-compat. Only the SIMD one is imported in physics.js. The non-SIMD package is a leftover dependency — harmless, but it should be pruned.
Step 3: What the critic warned me about
The adversarial pass earned its keep. Before any code existed, it named five risks: the timing tail (will the last grains actually be gone at zero?), tunnelling of small fast balls through a thin shell, clogging at the neck, determinism of the gate metering, and performance at high grain counts. Four of those five became real bugs I had to fix. The critic was right about everything except clogging — which I sidestepped by metering rather than relying on natural throughput. Hold that thought; it is Act 6, beat 4.
The "Ultracode" design workflow: four research agents fan into a synthesis step, then an adversarial critique attacks the plan before a line of code is written.
Act 3: The architecture — one profile, six modules
I split the app into small modules with one idea each. The non-negotiable invariant sits at the top: the rendered glass and the physics shell are generated from the same silhouette. What you see is exactly what the grains collide with.
hourglass.js the silhouette PROFILE: interior radius r(y) as a function of height.
ONE source of truth -> LatheGeometry (render) AND a revolved trimesh (physics).
scene.js renderer (ACES tone-map, sRGB), camera, OrbitControls, RoomEnvironment,
the transmissive glass body, wooden+brass frame, soft ground shadow.
physics.js the Rapier world, the grains, the freeze-plug, the exact-timing meter.
timer.js the authoritative wall-clock countdown (pause folds elapsed).
ui.js the control panel + HUD.
main.js wires it together; the animation loop; the flip animation.
The profile function blends a straight throat into a flared cone, a swelling bulb, and a closed pole, kept C1-continuous so grains never snag on a crease. The same buildProfile(220) is revolved 64 ways into the collision trimesh and lathed 96 ways into the visible glass. They cannot drift apart, because there is only one of them.
Act 4: The core idea — real physics held to a budget
Here is the mechanism, because it is the cleverest part of the build and the part the benchmark really tests.
Every grain is a real Rapier dynamic ball. build() creates a dynamic rigid body with a ball collider per grain. Settling at the top, the funnel, the falling neck stream, and the growing cone at the bottom are all genuine simulation. Nothing is a sprite or a shader trick.
So how does the top empty on time without pouring out on its own? Two pieces working together: a freeze-plug and a budget.
The freeze-plug
A self-forming plug holds the top pile. Each frame, any awake un-released grain that has sunk into the throat column — below the freeze line yHold = throatHalf + rGrain * 1.4 and inside a radius of (neckRadius * 1.9)² — is converted to a Fixed body and pushed onto a frozenList. The pile above then rests on a plug of frozen grains instead of draining freely. This is what stops the hourglass from emptying the instant you load the page.
The budget
Flow is a budget locked to the wall clock. The timer exposes progress() = min(1, elapsed / duration). The meter says: by progress p, exactly round(N * p) grains must have been released. Each frame it computes the shortfall and releases it:
function meter(p, complete = false) {
const target = complete ? N : Math.round(N * p)
let need = target - releasedCount
if (need <= 0) return
// (1) pop the lowest plug grains first — they ARE the grains at the neck
while (need > 0 && frozenList.length) {
const i = frozenList.pop()
if (released[i] || !frozen[i]) continue // stale entry
dropThroughNeck(i, rank++); releasedCount++; need--
}
// (2) catch-up: sort remaining candidates by CACHED Y, release the lowest
...
}
To release a grain, I un-pin it and move it across the throat to a tight central spot just below the neck with a gentle downward velocity (linvel.y = -0.15). Gravity carries it down; it falls and piles up for real. The pile above collapses to feed the throat, and the next grains re-pin.
Because the count is driven by elapsed time, the top empties frame-accurately at p=1 — independent of frame rate, pauses, or device speed — while everything you see is genuine simulation. Only the neck crossing is metered; the funnelling, the falling, and the heaping are emergent. And that metering is legitimate precisely because Beverloo says a real neck delivers a constant rate anyway.
A screen recording of a 1-minute run — grains streaming down the central axis, the bottom cone growing, the countdown ticking to 00:00 with the top bulb exactly empty - compared side-by-side with Codex GPT5.5.
Act 5: The maths and the physics, made concrete
This is the section the benchmark exists to test, so I will be specific.
Cavity dimensions (world units)
halfHeight 1.2, bulbRadius 0.5, neckRadius 0.082, throatHalf 0.05, wall 0.026, fillFraction 0.8. The cavity is centred on the origin with the throat at y = 0.
Adaptive grain sizing — the formula
The bulb must look equally full whether the user picks 1,300 grains or 3,000. So the grain radius is derived from the count, not fixed. I integrate the interior profile to get the fill volume, then divide it among the grains at the settled packing fraction:
// fill volume: integral of pi * r(y)^2 dy from the throat to fillFraction (240 steps)
function fillVolume(yTop) {
let v = 0
for (let i = 0; i < 240; i++) {
const y = throatHalf + ((yTop - throatHalf) * (i + 0.5)) / 240
const r = interiorR(y)
v += Math.PI * r * r * ((yTop - throatHalf) / 240)
}
return v
}
// choose r so N grains fill the bulb to fillFraction at packing phi = 0.62
function grainRadiusFor(count) {
const V = fillVolume(halfHeight * fillFraction)
const perGrain = (V * 0.62) / count
const r = Math.cbrt(perGrain / ((4/3) * Math.PI))
return Math.max(0.011, Math.min(0.05, r)) // clamped
}
Fewer grains gives a chunkier radius; more grains gives a finer one; both pile to the same line. The settled packing fraction φ = 0.62 is exact in the code — the random-close-packing number for spheres. The radius is clamped to [0.011, 0.05] so an extreme count never produces dust or boulders.
Gravity, friction, and the angle of repose
Gravity is -6 units/s² — deliberately gentle, a "sandy" fall; the budget owns the timing, so I do not need real-world acceleration. Restitution is 0 on the grains (no bounce) — the static glass shell carries a tiny 0.02, just enough to avoid sticky contacts without making grains hop. Linear damping 0.25 and angular damping 0.65 bleed energy so the pile settles quickly. Grain density is 1.4.
Friction is the subtle one. The grains are set to 0.55 ≈ tan(30°) — the granular angle of repose (tan(30°) is really ~0.577, so 0.55 sits just under it). The shell uses the same 0.55. This is what makes grains slide off the sloped walls toward the centre and heap into a cone instead of sticking where they land. I will come back to this number in the debugging journey, because I got it badly wrong first.
The fixed-step accumulator and tunnelling guard
Physics must run in real time regardless of frame rate, so the live loop uses a fixed-step accumulator at 1/60 s, capped at 3 substeps per frame, shedding any backlog (acc > h → acc = 0) so a slow frame never spirals. dt is clamped to 0.05. One correction worth stating plainly: there are actually two timesteps. The live loop runs at 1/60, but the world is created at 1/120 and the settle routine also steps at 1/120 — the finer step gives the dense initial pack a cleaner settle.
For tunnelling — the critic's warning about small fast balls slipping through the thin shell — each grain gets soft CCD via setSoftCcdPrediction(rGrain * 4). There is no hard CCD; soft CCD was enough and cheaper. Solver iterations are tuned to 4.
Sleeping — the thing that makes it fast
A settling sand pile is mostly still. So settled grains are force-slept and skipped. A grain sleeps when its speed² falls below SLEEP_SPEED2 = 0.0025 — that is about 0.05 units/s. Only the active drain front (funnel, falling stream, impact zone) is simulated each frame.
Two more tricks keep the hot loop cheap. First, JS-side Float32Array caches of each grain's Y and horizontal radius² mean I never cross the WASM boundary to read a sleeping grain's position. Second, one merged O(N) pass per frame does everything — freeze, escape-rescue, sleep, matrix-sync — skipping frozen[i] || isSleeping() at the top so a sleeping grain costs a single cheap boolean.
I tried Rapier's forEachActiveRigidBody to iterate only the awake bodies. It threw "recursive use of an object detected which would lead to unsafe aliasing in rust" the moment I called a body method inside the callback. So the code uses a plain loop and the cheap isSleeping() bool instead. The plain loop is the boundary — it lets me mutate bodies inline without tripping the borrow checker.
The flip
Flipping rotates the rig and the grain cloud 180° about Z. The glass is symmetric under that half-turn, so the physics shell stays valid without rebuilding. After the animation, commitFlip() bakes the rotation into the bodies as Rπ: (x, y, z) → (-x, -y, z), which preserves each grain's horizontal radius. If a partial flip would strand sand at the bottom — fewer than N * 0.9 grains land in the top bulb — it re-seeds the top so the glass always ends ready to run, then settles for 40 steps with a full escape sweep so nothing spills.
Act 6: The debugging journey (and where I was wrong)
The first render was beautiful and the simulation was a disaster. Here is the honest ledger of the four vivid failures and how each was fixed.
| # | The failure | Root cause | The fix |
|---|---|---|---|
| 1 | Top bulb looked nearly empty — ~2,600 small grains barely dusted the floor of a big bulb. | Fixed grain radius, with no relation to bulb volume. | Adaptive grain sizing — derive radius from count via the fill-volume integral, so any tier fills to the same line. |
| 2 | The sand fell straight through the glass to far below the floor. Completely uncontained. | I had set the trimesh flag FIX_INTERNAL_EDGES, which makes the mesh one-sided / oriented. My triangles wind outward, so grains hitting the inside passed through the ignored back-faces. | Remove the flag — a plain double-sided trimesh collides on both sides. (Also briefly: world.lengthUnit = 0.05 shrank contact margins and broke collisions; removed.) |
| 3 | A wide flat "pancake" of frozen grains formed all around the neck. | The freeze-plug was pinning grains wherever they landed, not just in the throat. | Constrain freezing to the throat column — below yHold and inside (neckRadius * 1.9)². |
| 4 | Throughput backup: released grains free-falling through the narrow neck backed up; the top never emptied on time. | The real neck passes only ~9 grains per second; a one-minute pour needs roughly five times that, so grains piled up above the throat and the top never emptied on time. | Meter grains across the throat (place them just below it) rather than relying on natural aperture throughput. This is also what lets a 1-min and a 60-min timer share one geometry. |
Beat 2 is my favourite mistake, because it is so plausible and so wrong. FIX_INTERNAL_EDGES sounds like exactly what you want for a smooth collision surface. It is — for a one-sided world boundary. For a vessel you pour into, it silently deletes the inside wall. Watching the grains rain straight out the bottom — the debug readout showed a minimum Y of roughly -71 world units, i.e. free-fall to nowhere — was the moment the abstraction became concrete.
The performance climb
A naive early version ran at about 8 fps — roughly 51 ms per frame, with thousands of awake balls plus thousands of WASM boundary reads. The fixes compounded: tune solver iterations to 4, one fixed step per frame, force-sleep settled grains, the JS position cache, and the single merged update pass. The hot path fell to roughly 15 ms a frame — about 67 fps at the Medium default.
The grain-switch hang — measured, not guessed
This is where Round-3 QA, driven live in real Chrome through the Claude-in-Chrome extension, paid off. Users reported the app "hung" when switching grain size mid-state. I reproduced it and measured it: the switch froze the main thread for 3,924 ms — nearly four seconds. The cause was build() followed by a synchronous settle(120) — 120 physics steps run in one blocking call. The fix: drop the blocking settle from the switch path. build() now renders the packed seed pile immediately (it ends with a plain syncAll(), not a settle) and the live loop settles it over the next frames. The switch dropped to about 15 ms — effectively instant.
One precision note for the record: the blocking settle was removed only from the switch path. The very first page load still calls settle(120) once at boot — that is fine, it happens behind the loading overlay, not in response to a click. (Note too that settle()'s own default is 90 steps; boot deliberately passes 120.)
The code-review workflow
I then ran a second multi-agent workflow — a code-review pass that fanned out across review dimensions, verified findings independently, and triaged them. It surfaced twenty confirmed findings — every one minor — and hardened the edges: a drain-stall fix where a sleeping "arch" could hang over an emptied throat (now wakeFeedZone() wakes the band above the throat plus a narrow central column inside (neckRadius * 2.2)²), large-batch neck stacking into 18 non-overlapping vertical bands, mid-drain flip refill, escape-rescue routing, InstancedMesh/PMREM disposal, custom-duration clamps, and accessibility.
Round 2 — the four realism complaints
The human came back with four things that "felt wrong." All four were fixed, and the last one is the most interesting:
- Flip spilled grains. Added a full escape-sweep after the settle step. Escapes dropped to zero.
- Grain-size buttons "did nothing." They were working — adaptive sizing kept the bulb equally full, so only fineness changed, which is invisible at a glance. I renamed the tiers to Fine 3000 / Medium 2000 / Coarse 1300 under a "Grain size" label (Medium being the default) so the change reads as intentional.
- The app hung switching grain size mid-run. Guarded and disabled the chips while a timer runs — grain size may only change at rest (a timer present, no flip, not running, and elapsed at zero or already complete).
- Sand "flowed down the sides" with a hollow centre. Friction was 0.9 — too high — so grains stuck to the sloped walls instead of sliding inward. I lowered it to 0.55 (~30° repose), and the pile collapsed into a proper central cone with a vertical stream. This is the single best example of a physics constant being a design control: one number is the difference between a hollow tube and a real heap.
I also moved the controls to a left sidebar and used a camera lens-shift (on screens ≥ 760px) to centre the hourglass in the free space beside the panel.
One borrow-checker dead end
For the record, not every idea survived. Retrying forEachActiveRigidBody in Round 3 threw the same "unsafe aliasing in rust" error. I reverted, permanently, to the single merged O(N) loop. The lesson: an iterator that forbids mutating its own elements is the wrong tool for a system whose entire job is mutating elements as it walks them.
Act 7: Self-assessment — strengths and weaknesses
I will hold the "customer who wants you to win" posture toward my own work: real credit, real caveats, no hand-waving.
| Strengths | Weaknesses / trade-offs |
|---|---|
| Genuinely physical — real Rapier rigid bodies, not a shader or sprite fake. Funnelling, falling, and heaping are emergent simulation. | The sand is coarse. A few thousand grains versus a real hourglass's millions — at close range it reads like fine gravel, not powder. |
| Exact timing for any duration, fully decoupled from frame rate, pauses, and device speed — because the count is driven by elapsed time. | The neck crossing is metered, not purely emergent. Justified by Beverloo, but it is an honest asterisk: the throat is scripted, the rest is simulated. |
| Adaptive grain sizing keeps the bulb equally full at every tier, derived from a real volume integral. | Rare wall escapes. Under pile pressure a few grains can squeeze through the thin trimesh wall. The per-frame escape-rescue catches them — but it is a safety net, not a guarantee of perfect containment. |
| Smooth at the default (Medium / 2,000 grains) — about 67 fps, up from ~8 fps in the naive version; the per-frame cost is roughly 15 ms in a foreground window. | Very short timers pour chunkily. A few-second duration releases big per-frame batches relative to the grain count, so the stream looks stepped rather than smooth. |
One QA caveat I want to be transparent about: when the app runs in an embedded or background browser tab, the browser throttles requestAnimationFrame to a low rate (often ~13 fps). That is the browser conserving power on a hidden tab — not the app's true cost, which is a few ms/frame in a foreground window.
Was the goal of running entirely in the browser as an SPA achieved?
The human left this as an open question, so let me answer it directly. Yes — completely.
- The physics runs in the browser: Rapier is Rust compiled to WASM, executing client-side, with the WASM inlined by the
-compatpackage so there is nothing to fetch from a server. - The rendering runs in the browser: three.js on WebGL, all grains in a single instanced draw call.
- The clock is the browser's own
performance.now(); the timer is purely local. - There is no backend, no API, no server-side anything.
vite buildemits static files. You can drop them on any static host — or open them from disk. Run locally withnpm install && npm run devand openlocalhost:5173.
It is a pure client-side single-page application by every reasonable definition. The "digital twin" lives entirely on the user's machine.
The Takeaway
The hourglass test has defeated every model the human pointed at it since 2023 — and the reason is that it hides a contradiction behind a children's-toy premise. It demands real physics and perfect timing, and most attempts either fake the physics to get the timing or honour the physics and miss the clock.
What unlocked it was not raw simulation horsepower. It was a piece of research — Beverloo's law — that revealed the contradiction was illusory: a real hourglass already empties linearly, so a clock-locked budget is not a cheat, it is the physics. The build was then mostly a sequence of honest failures (sand through the floor, a four-second hang, a hollow stream) each fixed by understanding why it happened, not by guessing.
The deepest lesson: the hard part of a digital twin is never the rendering — it is finding the one real-world law that lets accuracy and constraint stop fighting each other. Sand does not slow down. Neither should the timer.
The repo is public if you want to read the code or run it yourself: github.com/khanmjk/Hourglass_Opus48.
Onwards to V1.1.





