Runbooks
One short runbook per V1 incident class. The first job is classification — is
this an Offloader bug, the customer environment, or upstream data? The admin
/diagnostics and /metrics answer that in under five minutes. Each runbook:
symptom → signals → likely owner → action → escalation.
Commands assume the admin port is reachable privately and
OFFLOADER_ADMIN_TOKENis set.DIAGbelow meanscurl -H "Authorization: Bearer $TOKEN" $ADMIN/diagnostics.
Gateway does not start
- Signals: container exits on boot;
docker logsshows a config/secret error. - Owner: customer environment (config/secret) or Offloader (crash).
- Action: a missing
OFFLOADER_SECRET_KEY_BASEor unreadableOFFLOADER_CONFIGis the usual cause — check both. Validate config:offloader validate --config …. - Escalate: if config validates and the secret is set but it still crashes.
Source credentials fail / bucket or network unreachable
- Signals:
offloader_source_reachable == 0;DIAGsource_reachable:false. - Owner: customer environment (IAM/network/object store).
- Action: check the container's network + read-only credentials to the source. Offloader keeps serving the last good snapshot meanwhile.
Bad manifest / dataset refresh fails
- Signals:
offloader_refresh_ok == 0;DIAGlast_attempted.status= rejected or failed with arefresh_error. - Owner: upstream data (rejected = the producer shipped a bad/breaking manifest) or Offloader (failed = materialization error).
- Action: read
refresh_error.rejected→ fix the upstream manifest (it never swapped in, so serving is safe).failed→ check disk/DuckDB below.
Dataset stale
- Signals:
offloader_snapshot_stale == 1; responsemeta.freshness.stale:true. - Owner: upstream data (the producer stopped publishing fresh manifests).
- Action: check the upstream pipeline. Offloader honestly reports staleness and keeps serving the last good snapshot.
DuckDB cache corrupted / DuckDB failure
- Signals:
offloader_duckdb_up == 0; queries error. - Owner: Offloader / environment (disk).
- Action: quarantine + rebuild the cache (delete the cache volume, restart — it rematerializes from the manifest). Check disk first.
Disk full
- Signals:
offloader_cache_disk_free_byteslow; materialization fails. - Owner: customer environment.
- Action: grow the cache volume or clear old snapshots, then restart.
Pool busy / endpoint latency regression
- Signals: rising p95/p99 (benchmark harness / your APM);
DIAGpool. - Owner: Offloader (serving) or customer (load).
- Action: compare against a benchmark baseline (
../benchmarks.md). For a hot, high-QPS endpoint onremote_scan, move it tolocal_table. Consider the response cache for repeated params.
Tenant/auth misconfiguration / key or auth failures
- Signals: consumers get 401 (bad/revoked key) or 404 (endpoint not granted).
- Owner: customer (key config).
- Action: confirm the key's
status: active, itsendpointsallowlist, and its boundtenantinkeys.yml. 404 is intentional for out-of-scope endpoints (no existence disclosure). Mint keys withoffloader keys create.
Rollback to previous image
- Owner: customer.
- Action: redeploy the previous pinned image tag. Health returns immediately; there is no migration to undo.
Rollback to previous snapshot
- Owner: customer/Offloader.
- Action: a bad snapshot never swaps in (validation + compatibility gate it). To revert a good-but-wrong snapshot, roll the dataset back to its previous good one.
Cache quarantine and rebuild / clear cache
- Owner: customer.
- Action: stop the container, remove the cache volume (one dataset: remove its materialized files; all: the whole volume), restart. The gateway rematerializes from the current manifest on boot.