Imprevista

Deploy Log

← Back to Deploy Log
|Sports Dashboard|DEPLOYED

Fire script: reset --hard + seed cloud-lab + rsync on failure

2026-04-15 retro: cloud-lab was silently running Apr-7 code (88 commits behind origin/main) for 8 days because the fire script's `git pull --ff-only` was aborting on every single job ("Please commit your changes or stash them

2026-04-15 retro: cloud-lab was silently running Apr-7 code (88 commits behind origin/main) for 8 days because the fire script's git pull --ff-only was aborting on every single job ("Please commit your changes or stash them before you merge"). ~13 local-modified files accumulated in cloud-lab's /app from the old autopilot-runner, and pull-ff-only won't touch a dirty tree.

Three linked bugs:

  1. git pull --ff-only -> git fetch + git reset --hard origin/main

Plus git clean -fd excluding the data output dirs. Force-syncs /app to origin unconditionally. Zero local state should ever live on cloud-lab. Also: set -eo pipefail so any early failure aborts instead of silently running stale code. Logs before-commit and after-commit hashes so the retro-next-time can see mismatches immediately.

  1. Seed cloud-lab inputs BEFORE firing. New seedCloudLabInputs method

rsyncs main-server's authoritative state (solver-cache, pod-shop, model-sweep) into cloud-lab's working dirs before the job starts. The solver-cache script's idempotency check runs against whatever cloud-lab has locally — without seeding, it sees 946 files and redoes work that's already in the main server's 76,640 authoritative files. This was the "zero progress in 18h" symptom. --ignore-existing makes the seed idempotent + cheap when already in sync.

  1. Rsync on failure, not just exit 0. The Apr 11 retro spec already had

this in the plan but I only wired it to the success path. Added partial- output rsync to the non-zero exit branch of startCloudLabPoll so snapshots already written by a killed/crashed job are preserved. ~1800 stranded snapshots were recovered manually from cloud-lab this morning under this exact failure mode.

Also: the fire script now runs a background keepalive loop (while true; do date +%s > /tmp/last-job-activity; sleep 60; done) so the cloud-lab watchdog doesn't murder long solver runs between the script's own activity touches.

New CLOUD_LAB_SEED_DIRS map mirrors CLOUD_LAB_SYNC_DIRS structure.

Refs: docs/specs/cloud-lab-apr11-retro-and-rebuild.md Refs: docs/specs/cloud-lab-guardian-spec.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files Changed

Commit:10f404c