Runbooks

One short runbook per V1 incident class. The first job is classification — is this an Offloader bug, the customer environment, or upstream data? The admin /diagnostics and /metrics answer that in under five minutes. Each runbook: symptom → signals → likely owner → action → escalation.

Commands assume the admin port is reachable privately and OFFLOADER_ADMIN_TOKEN is set. DIAG below means curl -H "Authorization: Bearer $TOKEN" $ADMIN/diagnostics.

Gateway does not start

Signals: container exits on boot; docker logs shows a config/secret error.
Owner: customer environment (config/secret) or Offloader (crash).
Action: a missing OFFLOADER_SECRET_KEY_BASE or unreadable OFFLOADER_CONFIG is the usual cause — check both. Validate config: offloader validate --config ….
Escalate: if config validates and the secret is set but it still crashes.

Source credentials fail / bucket or network unreachable

Signals: offloader_source_reachable == 0; DIAG source_reachable:false.
Owner: customer environment (IAM/network/object store).
Action: check the container's network + read-only credentials to the source. Offloader keeps serving the last good snapshot meanwhile.

Bad manifest / dataset refresh fails

Signals: offloader_refresh_ok == 0; DIAG last_attempted.status = rejected or failed with a refresh_error.
Owner: upstream data (rejected = the producer shipped a bad/breaking manifest) or Offloader (failed = materialization error).
Action: read refresh_error. rejected → fix the upstream manifest (it never swapped in, so serving is safe). failed → check disk/DuckDB below.

Dataset stale

Signals: offloader_snapshot_stale == 1; response meta.freshness.stale:true.
Owner: upstream data (the producer stopped publishing fresh manifests).
Action: check the upstream pipeline. Offloader honestly reports staleness and keeps serving the last good snapshot.

DuckDB cache corrupted / DuckDB failure

Signals: offloader_duckdb_up == 0; queries error.
Owner: Offloader / environment (disk).
Action: quarantine + rebuild the cache (delete the cache volume, restart — it rematerializes from the manifest). Check disk first.

Disk full

Signals: offloader_cache_disk_free_bytes low; materialization fails.
Owner: customer environment.
Action: grow the cache volume or clear old snapshots, then restart.

Pool busy / endpoint latency regression

Signals: rising p95/p99 (benchmark harness / your APM); DIAG pool.
Owner: Offloader (serving) or customer (load).
Action: compare against a benchmark baseline (../benchmarks.md). For a hot, high-QPS endpoint on remote_scan, move it to local_table. Consider the response cache for repeated params.

Tenant/auth misconfiguration / key or auth failures

Signals: consumers get 401 (bad/revoked key) or 404 (endpoint not granted).
Owner: customer (key config).
Action: confirm the key's status: active, its endpoints allowlist, and its bound tenant in keys.yml. 404 is intentional for out-of-scope endpoints (no existence disclosure). Mint keys with offloader keys create.

Rollback to previous image

Owner: customer.
Action: redeploy the previous pinned image tag. Health returns immediately; there is no migration to undo.

Rollback to previous snapshot

Owner: customer/Offloader.
Action: a bad snapshot never swaps in (validation + compatibility gate it). To revert a good-but-wrong snapshot, roll the dataset back to its previous good one.

Cache quarantine and rebuild / clear cache

Owner: customer.
Action: stop the container, remove the cache volume (one dataset: remove its materialized files; all: the whole volume), restart. The gateway rematerializes from the current manifest on boot.