Config guide — publish your datasets and endpoints
The reference for the config files Offloader loads. New to the ideas here (dataset, endpoint, snapshot, manifest, project)? Read What Offloader is, in plain language first.
A project is just files
Your configuration — your project — is a small tree of YAML:
offloader.yml # top level: which dirs to load, the auth mode, the keys file
datasets/*.yml # one per dataset: the table + the columns you expect
endpoints/*.yml # one per endpoint: the REST URL, its params, query, and limits
keys/keys.yml # API keys (stored as hashes) — omit entirely for a public API
Offloader loads this at startup from OFFLOADER_CONFIG — either a mounted directory
(point it at .../offloader.yml) or, fully stateless, a gs://… bucket prefix
(fetched at boot — see Config from object storage).
Nothing is baked into the image.
To see a complete, working project, run the Quickstart against
examples/customer-analytics/ — it boots a container and serves a real endpoint in about
15 minutes. Copy that example and edit it.
Start a new project (scaffold)
Authoring the YAML by hand is the slow part, so the CLI scaffolds it for you.
# A complete, VALID, fully-commented starter project (offloader.yml + one dataset,
# endpoint, and a working demo key). --public for a no-auth project.
offloader init --out my-project # then edit it; every field is commented
offloader init --out my-project --public
# Draft a dataset's schema from something you already have, instead of hand-listing
# columns: reuse a snapshot manifest's schema, or infer types from a CSV.
offloader scaffold-dataset --from data/events/manifest.json --tenant-column tenant_id
offloader scaffold-dataset --from sample.csv --id events --out my-project/datasets/events.yml
init's output passes offloader validate as-is; you then point each dataset's
manifest: at your snapshot and adjust the endpoint. The full field reference is
config-reference.md.
The offloader CLI (optional)
The offloader … commands in this guide are an optional Go helper, not part of the
container. The container needs only the files above plus env vars. Build the helper once:
cd tools && go build -o offloader . # then ./offloader <command> (or: go run . <command>)
Container env vars
You only have to set two:
OFFLOADER_CONFIG— path tooffloader.yml, or ags://…bucket prefix.OFFLOADER_SECRET_KEY_BASE— any random string (openssl rand -base64 48).
Everything else has a sensible default:
OFFLOADER_CACHE_DIR(default/var/lib/offloader/cache),OFFLOADER_API_PORT(4000),OFFLOADER_ADMIN_PORT(4001),OFFLOADER_LOG_LEVEL(info).OFFLOADER_ADMIN_TOKEN— recommended: gates the/diagnosticsroute (unset ⇒ it fails closed).OFFLOADER_CONFIG_SYNC_INTERVAL— seconds between bucket config re-checks (unset ⇒ off).OFFLOADER_CORS_ORIGINS— allow a browser front-end to call the API cross-origin:*or a comma-separated origin list (unset ⇒ no CORS). See serving public data.- Tuning:
OFFLOADER_POOL_SIZE(DuckDB read connections, default 16),OFFLOADER_DUCKDB_THREADS/OFFLOADER_DUCKDB_MEMORY_LIMIT(bound DuckDB to the container's memory allocation),OFFLOADER_CACHE_MAX_ENTRIES(response-cache entry ceiling, default 10,000 — raise it for more distinct query shapes, lower it to cap cache memory).
Object-store credentials are needed only when your config reads from a remote source (below); the local example needs no cloud vars and makes no outbound calls.
Source-specific object-store credentials are configured only when the mounted config references that source. Offloader should not require cloud-provider env vars or outbound telemetry for the local golden path.
Remote snapshot credentials (optional)
A manifest whose files[].path is an s3://, gs://, or https:// URL is read
directly by DuckDB (httpfs). Two credential modes:
- S3-compatible / GCS HMAC —
OFFLOADER_S3_TYPE=s3|gcsplusOFFLOADER_S3_KEY_ID,OFFLOADER_S3_SECRET(and for S3:OFFLOADER_S3_REGION,OFFLOADER_S3_ENDPOINT,OFFLOADER_S3_URL_STYLE,OFFLOADER_S3_SESSION_TOKEN,OFFLOADER_S3_USE_SSL). Coverss3://andgs://paths. - GCS OAuth bearer —
OFFLOADER_GCS_AUTH=bearer. Tokens come from, in order:OFFLOADER_GCS_TOKEN(explicit), the GCE metadata server (the GKE/GCE production path), or thegcloudCLI (developer laptops). The token is registered as a DuckDB HTTP secret coveringhttps://storage.googleapis.com/...reads and is rotated automatically before expiry. The Databricks GCS source (Offloader.Source.Databricks) uses the same tokens for listings and commit reads.
Explicit HMAC credentials win when both are set. Credentials never appear in logs, error bodies, or support bundles (values are scrubbed; bundles are redacted).
Sources & refresh
A dataset gets its data one of two ways, chosen per dataset:
- A static
manifest:— you publish each snapshot: write the Parquet plus a manifest carrying a newsnapshot_id, then point the dataset at it. A running container picks up a newsnapshot_idat boot, on a manual/admin refresh, or when config auto-sync reloads the config (below). This is the path for any warehouse that can export Parquet — Snowflake, BigQuery, Redshift, a Spark job. - A remote
source:— the container discovers the newest snapshot itself and swaps it in on a poll (source.interval_seconds), hands-off. The shipped connector is Databricks (it resolves the latest committed snapshot in the bucket); other warehouses use the static-manifest path above.
Either way the swap is validated and zero-downtime: a snapshot that fails its dataset contract is rejected and the last good one keeps serving.
The cache directory (warm vs. cold restarts)
OFFLOADER_CACHE_DIR (default /var/lib/offloader/cache) is where a local_table dataset's
snapshot is materialized — the on-disk DuckDB copy that requests actually read from. It is a
cache, not the source of truth: the bucket is. You can delete it at any time and lose nothing
but warm state — the container rebuilds it from the manifest on the next boot or refresh.
That rebuild is the whole reason to persist it. Mount a volume at the cache dir (as the README run command does) and a restart is a warm start: the materialized snapshot is already on disk, so the container serves within seconds. Leave it ephemeral and every restart is a cold start — the container re-fetches every snapshot from object storage and re-materializes it before it can serve, which on a large project is minutes, not seconds. So in production, persist it:
- Size the volume for your total materialized snapshot bytes, with headroom. A refresh writes the new copy of a dataset before dropping the old one, so budget for the largest dataset being present twice during its swap.
- It is disposable. Losing the volume (a new host, a wiped disk) only forces one cold start; it never risks data, because the bucket is authoritative.
If a dataset is too large to materialize on the box, don't size the disk to it — serve it
remotely. Set that endpoint's serving_mode: remote_scan and
DuckDB scans the snapshot's Parquet directly from object storage per request — nothing is
copied into the cache dir. You trade a little per-request latency for not needing local disk (or a
cold-start rematerialization) proportional to the dataset. It's the right mode for huge, cold, or
low-QPS endpoints; keep local_table (the default) for the hot ones.
Config from object storage (optional)
OFFLOADER_CONFIG may be a gs://bucket/prefix/ URL instead of a mounted path. At boot
Offloader fetches the whole project tree (offloader.yml, datasets/, endpoints/, the keys
file) from GCS into <cache_dir>/config/ and loads it from there — so the container is fully
stateless: env vars in, config and data both in the bucket, nothing mounted.
Host your offloader.yml (and its datasets//endpoints//keys/ tree) under a gs://… prefix
and boot the published image straight against it — the only volume is the optional
warm-start cache:
docker run \
-e OFFLOADER_CONFIG=gs://your-bucket/offloader/ \
-e OFFLOADER_GCS_AUTH=bearer \
-e OFFLOADER_SECRET_KEY_BASE="$(openssl rand -base64 48)" \
-e OFFLOADER_CACHE_DIR=/var/lib/offloader/cache \
-v offloader-cache:/var/lib/offloader/cache \
-p 4000:4000 \
-p 127.0.0.1:4001:4001 \
ghcr.io/andrewdryga/offloader:edge # or pin a release, e.g. :1.0.0
On GKE/GCE the bearer token comes from the metadata server (no credential env needed); on a
laptop it falls back to gcloud. Add OFFLOADER_CONFIG_SYNC_INTERVAL and edits you publish to
the prefix hot-reload with no restart (below).
- Uses the GCS bearer token chain (the same one the Databricks source lists with):
OFFLOADER_GCS_TOKEN, the GCE metadata server, orgcloud— setOFFLOADER_GCS_AUTH=bearer. Remote config is GCS-only for now (s3:///https://config is not yet supported). - Only
.yml/.yamlobjects under the prefix are fetched (bounded to 500 files / 32 MiB). A transient GCS error at boot is retried a few times; an invalid config fails boot loudly, exactly as a bad mounted config does. - Datasets served from a remote config should use a remote
source:(Databricks/GCS) or an absolute/remotemanifest:— a relative localmanifest:has no mounted data to resolve against. - Security — co-hosting config with data: the keys file stores SHA-256 hashes of bearer
tokens, never the tokens (see
security-model.md), so bucket-read exposure leaks hashes plus the endpoint/tenant access map, not usable credentials. A public deployment (auth: none) has no keys file at all — the cleanest stateless setup. For an authed deployment, put the config under a separate, tighter-ACL bucket/prefix from the bulk data if you want to limit who can read the hashes.
Hot config auto-sync
Set OFFLOADER_CONFIG_SYNC_INTERVAL=<seconds> (unset/0 = off) and Offloader re-checks the config
on that interval and hot-reloads changes with no restart:
- Change detection is cheap — a single object LIST (no downloads) when nothing changed.
- Endpoint, key, and data-source changes apply at once (the response cache is flushed).
- A dataset schema change is applied blue-green with zero downtime: the old snapshot and its endpoints keep serving while the new-schema table is materialized off to the side, then that dataset's table and endpoints flip together atomically. A staged build that fails compatibility (e.g. the producer hasn't published matching data yet) keeps the old version serving and is retried — nothing it serves ever breaks.
- A bad sync (network, invalid YAML, validation error) is logged and the running config is kept — a broken bucket revision never takes the service down.
/metricsexposesoffloader_config_sync_enabledandoffloader_config_sync_ok(alert on the latter == 0); admin/diagnosticsshows the last sync result + timestamps.
Push a new revision the boring way: update the objects under the config prefix and the next tick picks them up. Sequence a schema change as data first, then config, so the new-schema build succeeds on the first try.
Useful helper commands
offloader init— scaffold a new, valid projectoffloader scaffold-dataset— draft a dataset schema from a manifest or CSVoffloader validateoffloader manifest validateoffloader import-schemaoffloader shadow-diffoffloader endpoint testoffloader snapshot statusoffloader keys createoffloader doctoroffloader support-bundle
Importing an existing serving schema
If you already run a warehouse-backed serving API described by a serving_schema.json,
offloader import-schema generates a whole Offloader project from it: one dataset per
source table, one endpoint per query, with the source's defaults/combinations/
param_aliases preserved, plus a mapping.json the cutover diff harness uses. (In that
schema format each query is grouped by its game and table fields — hence the
game__table keys below; those are the schema's own field names, whatever your domain.)
Column types aren't in the schema file (its queries are SELECT *), so supply a hints
file — one DESCRIBE per table, mapping nested STRUCT/MAP/LIST columns to JSON:
# one row per source table: {"game__table": [{"name":"c","type":"VARCHAR"}, ...]}
offloader import-schema \
--from serving_schema.json --hints schema_hints.json \
--out ./project --bucket your-snapshot-bucket
Queries whose params aren't columns of their table (or whose table has no hints) fail
the run so nothing is silently mis-served; --skip-broken converts the rest and reports
each skip. The output passes offloader validate and serves live once GCS credentials
are set (see below).
Config layout
The project directory OFFLOADER_CONFIG points at:
offloader.yml # top-level project file (version, dirs, auth mode, keys path)
datasets/ # one *.yml per dataset (the table contract)
endpoints/ # one *.yml per endpoint (the REST contract)
keys/keys.yml # API keys, as hashes — omit for a public (auth: none) API
offloader validate checks the whole tree the same way the container does — run it in CI
before you ship a change. See examples/customer-analytics/ for a complete, working project.