Sports Dashboard: Retro: cloud-lab Apr 15 afternoon — the optimization problem that was a measurement problem

April 15, 2026|Sports Dashboard|DEPLOYED

Retro: cloud-lab Apr 15 afternoon — the optimization problem that was a measurement problem

Second retro today. The morning one was about overclaiming without measurement. This afternoon's is about reaching for parallelism before verifying unit cost matches the spec.

The core failure: I spent 4 hours scaling out workers against a solver that was 10x slower than the spec said it should be. The fix was 42 lines of CLI flag code (warm-start + max-iterations). I should have done unit-cost math 20 minutes into the session, not 4 hours in. The "why is this so slow, read the spec" push-back from Charles at 14:30 UTC saved the afternoon.

Seven lessons added, all concrete and process-oriented:

Unit cost before parallelism — throughput = parallelism × (1/unit_cost);

if unit_cost is wrong, parallelism can't compensate.

Cross-check spec assumptions against implementation — specs cite

numbers valid only if their assumptions hold in code. Grep the implementation for the assumed features (warm-start, caching, convergence thresholds) before trusting the number.

Same input → same output is a signal, not noise — 19 solves

converging to identical loss values was the warm-start bug screaming at me; I walked past it.

--cpus N is a CPU share, not a hard cap — ran 6 solvers in

compute-worker, host load went to 305, starved 9 other apps for 20 minutes. Use --cpuset-cpus or --cpu-quota for hard caps; heavy parallel compute belongs on a dedicated VM.

Know your container's lifecycle — I had written the GHA workflow

step THIS MORNING that recreates compute-worker per deploy, then fired long-running backtests inside it anyway. 4 deploys later, all 4 jobs were dead.

pkill -f where PATTERN is in your own cmdline is a footgun —

self-killed an SSH bash wrapper while trying to kill canary procs.

Don't project plan-affecting numbers from unvalidated measurements

— I told Charles "4.7 days" based on a 5-minute window while the pipeline had a 10x bug. He made infra decisions off that number. Projections must cite validation against spec.

The retro ends with a "Before You Scale Out" preflight checklist (8 items) and concrete behavioral changes to encode tomorrow.

Companion to docs/specs/cloud-lab-apr15-retro.md (morning).

Refs: docs/specs/solver-cache-generation-spec.md Refs: scripts/generate-solver-cache-lean.ts (commit 9b51e88b added the flags)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files Changed

Commit:db32167