Imprevista

Deploy Log

← Back to Deploy Log
|Sports Dashboard|DEPLOYED

Retro: cloud-lab Apr 15 — the one where I overclaimed all day

Honest post-incident retro of yesterday's session. The proximate bugs (stale fire script, no state seed, no rsync-on-failure) are real and fixed in 1616a76c. The deeper issue is process: I treated job.status

Honest post-incident retro of yesterday's session. The proximate bugs (stale fire script, no state seed, no rsync-on-failure) are real and fixed in 1616a76c. The deeper issue is process: I treated job.status as evidence of work and never measured the actual output volume, so a 24h compute session producing 176 net new snapshots looked indistinguishable from a successful run on the dashboards.

Five lessons added that belong in tasks/lessons.md and global CLAUDE.md:

  • Measure output, not status (anti-pattern)
  • Smoke tests test transports, not workloads (anti-pattern)
  • Fire scripts need set -e (mistake)
  • State-sync is a first-class concern for idempotent jobs (mistake)
  • Partial output is real output (anti-pattern)

The retro is structured as proximate bugs (B1-B3), process failures (P1-P5), root cause synthesis, the actual fix shipping today, and an evidence trail with snapshot counts at each moment so future sessions auditing this can see the numbers.

Companion to docs/specs/cloud-lab-apr11-retro-and-rebuild.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files Changed

Commit:ae42362