Fire script: reset --hard + seed cloud-lab + rsync on failure
2026-04-15 retro: cloud-lab was silently running Apr-7 code (88 commits behind origin/main) for 8 days because the fire script's `git pull --ff-only` was aborting on every single job ("Please commit your changes or stash them
2026-04-15 retro: cloud-lab was silently running Apr-7 code (88 commits behind origin/main) for 8 days because the fire script's git pull --ff-only was aborting on every single job ("Please commit your changes or stash them before you merge"). ~13 local-modified files accumulated in cloud-lab's /app from the old autopilot-runner, and pull-ff-only won't touch a dirty tree.
Three linked bugs:
git pull --ff-only->git fetch + git reset --hard origin/main
Plus git clean -fd excluding the data output dirs. Force-syncs /app to origin unconditionally. Zero local state should ever live on cloud-lab. Also: set -eo pipefail so any early failure aborts instead of silently running stale code. Logs before-commit and after-commit hashes so the retro-next-time can see mismatches immediately.
- Seed cloud-lab inputs BEFORE firing. New
seedCloudLabInputsmethod
rsyncs main-server's authoritative state (solver-cache, pod-shop, model-sweep) into cloud-lab's working dirs before the job starts. The solver-cache script's idempotency check runs against whatever cloud-lab has locally — without seeding, it sees 946 files and redoes work that's already in the main server's 76,640 authoritative files. This was the "zero progress in 18h" symptom. --ignore-existing makes the seed idempotent + cheap when already in sync.
- Rsync on failure, not just exit 0. The Apr 11 retro spec already had
this in the plan but I only wired it to the success path. Added partial- output rsync to the non-zero exit branch of startCloudLabPoll so snapshots already written by a killed/crashed job are preserved. ~1800 stranded snapshots were recovered manually from cloud-lab this morning under this exact failure mode.
Also: the fire script now runs a background keepalive loop (while true; do date +%s > /tmp/last-job-activity; sleep 60; done) so the cloud-lab watchdog doesn't murder long solver runs between the script's own activity touches.
New CLOUD_LAB_SEED_DIRS map mirrors CLOUD_LAB_SYNC_DIRS structure.
Refs: docs/specs/cloud-lab-apr11-retro-and-rebuild.md Refs: docs/specs/cloud-lab-guardian-spec.md
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>