Cloud-lab guardian: roadmap + validator + auto-advance ticker
Adds a version-controlled priority task list at data/cloud-lab/roadmap.json that any cloud-lab submission must match. Sessions that try to queue unapproved work get a 403 with a link back to docs/specs/cloud-lab-guardian-spec.md.
Adds a version-controlled priority task list at data/cloud-lab/roadmap.json that any cloud-lab submission must match. Sessions that try to queue unapproved work get a 403 with a link back to docs/specs/cloud-lab-guardian-spec.md.
Motivation: today's retro surfaced three incidents in one session where cloud-lab was mis-used without shared context (compute-worker stealing a cloud-lab job at 12:01, an unauthorized compute-worker recreation at 12:15, and the Apr 11 factorial-all-zeros hole that has been latent since). The fix is architectural: cloud-lab has a brain (queue.ts) but had no orders. This commit adds the orders.
What's in:
- data/cloud-lab/roadmap.json (NEW) — initial v1 roadmap with the three
in-flight xG A/B solver-cache treatments + four approved Phase-2 backtests blocked on them.
- lib/compute/roadmap.ts (NEW) — type definitions, loader, matcher,
eligibility check, drift detector. Pure functions, no side effects outside readFileSync.
- app/api/compute/submit/route.ts — validator at submit time:
* target="cloud-lab" with no matching roadmap entry -> 403 * matching entry but blockedBy unmet -> 409 (blocks e.g. running a backtest over an incomplete solver-cache dependency) * malformed roadmap -> 503 (fail closed) * reads X-Compute-Source header, attaches to ComputeJob for provenance
- lib/compute/queue.ts — 60s auto-advance ticker (setInterval in the
constructor, gated on CRON_SECRET so dev doesn't burn compute). Each tick picks the highest-priority eligible task with no active job and self-submits with source="guardian/auto-advance". One task per tick to avoid racing with manual submissions. submit() grew an options argument for source + roadmapTaskId.
- lib/compute/types.ts — ComputeJob gains optional
sourceand
roadmapTaskId fields. Backward-compatible with existing jobs.json.
- scripts/cloud-lab-roadmap.ts (NEW) — CLI: list / status / show / add /
done / block. status fetches /api/compute/jobs and diffs against the roadmap to surface drift.
- CLAUDE.md — new "Cloud-Lab Protocol" section (session protocol, hard rules,
"no backdoor"). compute-worker docker run snippet gains --no-healthcheck (base image's Next.js healthcheck fails on this worker).
- .github/workflows/deploy-sports-dashboard.yml — the "Update compute-worker"
step in the deploy workflow was the source of the 12:15 unauthorized recreation: every deploy ran a stale docker run command missing my Phase 1 ssh-keys volume and the NODE_OPTIONS heap bump. Workflow updated to include --no-healthcheck, the ssh-keys read-only mount, and heap=8192.
- docs/specs/cloud-lab-guardian-spec.md (NEW) — the formal contract, ~280
lines: context, non-goals, architecture, schema, matching rules, session protocol, CLI reference, failure modes, provenance, go/no-go gates.
- docs/specs/cloud-lab-apr11-retro-and-rebuild.md — appended "Investigation
follow-ups" section with the 12:15 forensic findings.
Smoke-tested locally against the live roadmap: - solver-cache xgWeight=0.2 -> MATCH xg-ab-treat-a (eligible) - solver-cache xgWeight=0.99 -> NO MATCH (would 403) - factorial-test xgWeight=0.2 -> NO MATCH (wrong type) - backtest xgWeight=0.2 outputSuffix=xg-02 -> MATCH xg-ab-backtest-xg02 but eligibility=false, unmet=[xg-ab-treat-a] (would 409)
Does NOT disturb Treat A/B/C currently running. The ticker will observe them via their existing jobs in jobs.json on first fire post-deploy.
Refs: docs/specs/cloud-lab-guardian-spec.md
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>