Retro: cloud-lab Apr 15 — the one where I overclaimed all day
Honest post-incident retro of yesterday's session. The proximate bugs (stale fire script, no state seed, no rsync-on-failure) are real and fixed in 1616a76c. The deeper issue is process: I treated job.status
Honest post-incident retro of yesterday's session. The proximate bugs (stale fire script, no state seed, no rsync-on-failure) are real and fixed in 1616a76c. The deeper issue is process: I treated job.status as evidence of work and never measured the actual output volume, so a 24h compute session producing 176 net new snapshots looked indistinguishable from a successful run on the dashboards.
Five lessons added that belong in tasks/lessons.md and global CLAUDE.md:
- Measure output, not status (anti-pattern)
- Smoke tests test transports, not workloads (anti-pattern)
- Fire scripts need set -e (mistake)
- State-sync is a first-class concern for idempotent jobs (mistake)
- Partial output is real output (anti-pattern)
The retro is structured as proximate bugs (B1-B3), process failures (P1-P5), root cause synthesis, the actual fix shipping today, and an evidence trail with snapshot counts at each moment so future sessions auditing this can see the numbers.
Companion to docs/specs/cloud-lab-apr11-retro-and-rebuild.md.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>