Imprevista

Deploy Log

← Back to Deploy Log
|Sports Dashboard|DEPLOYED

Retro: cloud-lab Apr 15 afternoon — the optimization problem that was a measurement problem

Second retro today. The morning one was about overclaiming without measurement. This afternoon's is about reaching for parallelism before verifying unit cost matches the spec.

Second retro today. The morning one was about overclaiming without measurement. This afternoon's is about reaching for parallelism before verifying unit cost matches the spec.

The core failure: I spent 4 hours scaling out workers against a solver that was 10x slower than the spec said it should be. The fix was 42 lines of CLI flag code (warm-start + max-iterations). I should have done unit-cost math 20 minutes into the session, not 4 hours in. The "why is this so slow, read the spec" push-back from Charles at 14:30 UTC saved the afternoon.

Seven lessons added, all concrete and process-oriented:

  1. Unit cost before parallelism — throughput = parallelism × (1/unit_cost);

if unit_cost is wrong, parallelism can't compensate.

  1. Cross-check spec assumptions against implementation — specs cite

numbers valid only if their assumptions hold in code. Grep the implementation for the assumed features (warm-start, caching, convergence thresholds) before trusting the number.

  1. Same input → same output is a signal, not noise — 19 solves

converging to identical loss values was the warm-start bug screaming at me; I walked past it.

  1. --cpus N is a CPU share, not a hard cap — ran 6 solvers in

compute-worker, host load went to 305, starved 9 other apps for 20 minutes. Use --cpuset-cpus or --cpu-quota for hard caps; heavy parallel compute belongs on a dedicated VM.

  1. Know your container's lifecycle — I had written the GHA workflow

step THIS MORNING that recreates compute-worker per deploy, then fired long-running backtests inside it anyway. 4 deploys later, all 4 jobs were dead.

  1. pkill -f where PATTERN is in your own cmdline is a footgun —

self-killed an SSH bash wrapper while trying to kill canary procs.

  1. Don't project plan-affecting numbers from unvalidated measurements

— I told Charles "4.7 days" based on a 5-minute window while the pipeline had a 10x bug. He made infra decisions off that number. Projections must cite validation against spec.

The retro ends with a "Before You Scale Out" preflight checklist (8 items) and concrete behavioral changes to encode tomorrow.

Companion to docs/specs/cloud-lab-apr15-retro.md (morning).

Refs: docs/specs/solver-cache-generation-spec.md Refs: scripts/generate-solver-cache-lean.ts (commit 9b51e88b added the flags)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files Changed

Commit:db32167