docker run -it \
--name atlas \
--gpus all \
--ipc=host \
--network host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
avarok/atlas-gb10:latest --help
Atlas Spark — pure Rust LLM inference server
Usage: spark <COMMAND>
Commands:
serve Start the inference server
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
docker container remove atlas
docker pull avarok/atlas-gb10:latest
docker run -it \
--name atlas \
--gpus all \
--ipc=host \
--network host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
avarok/atlas-gb10:latest \
serve --help
Usage: spark serve [OPTIONS] [MODEL]
Arguments:
[MODEL]
HuggingFace model ID (e.g. "nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4") or a local directory path containing config.json
Options:
--model-from-path <PATH>
Load model directly from this filesystem path (skips HF cache resolution)
--model-name <NAME>
Override model name shown in /v1/models and API responses. Defaults to the positional MODEL argument, then config.json _name_or_path
--cache-dir <DIR>
Override HuggingFace cache directory (default: $HF_HUB_CACHE, $HF_HOME/hub, or ~/.cache/huggingface/hub)
--port <PORT>
HTTP port
[default: 8888]
--gpu-ordinal <GPU_ORDINAL>
GPU ordinal
[default: 0]
--max-seq-len <MAX_SEQ_LEN>
Maximum sequence length
[default: 32768]
--block-size <BLOCK_SIZE>
KV cache block size (tokens per block)
[default: 16]
--kv-cache-dtype <KV_CACHE_DTYPE>
KV cache dtype (fp8, bf16, or nvfp4). Default: fp8. NVFP4 uses less memory but may lose coherence at long context without --kv-high-precision-layers. FP8 is the safe default
[default: fp8]
--kv-high-precision-layers <KV_HIGH_PRECISION_LAYERS>
Boundary attention layers to keep at BF16 KV cache precision (first N + last N). Protects attention sink tokens (early layers) and output quality (final layers) from quantization error while saving memory on middle layers. Accepts: number, "auto" (=2, recommended), "max"/"all" (all BF16). Default: 0 (all layers use --kv-cache-dtype)
[default: 0]
--gpu-memory-utilization <GPU_MEMORY_UTILIZATION>
GPU memory utilization (0.0-1.0)
[default: 0.9]
--max-num-seqs <MAX_NUM_SEQS>
Maximum concurrent sequences
[default: 128]
--disable-thinking
Global kill-switch for chain-of-thought / reasoning output. When set, the server forces thinking OFF regardless of what the client requests (reasoning_effort, thinking.budget_tokens, etc.) or what MODEL.toml declares as the default. Precedence (highest wins): this flag → request body → MODEL.toml `[behavior]`.thinking_default.
Harry Potter alias: `--stupify` (stuns the model's inner monologue).
[aliases: --stupify]
--max-thinking-budget <MAX_THINKING_BUDGET>
Override MODEL.toml's `[behavior].max_thinking_budget` (tokens). Sets the per-request ceiling for thinking-block length. Per-request `thinking.budget_tokens` (or `reasoning_effort`) still wins below this ceiling; the (max_tokens * 9 / 10) safety cap is always enforced
--speculative
Currently slower than regular decode for hybrid SSM models
--self-speculative
Enable self-speculative decoding: draft via layer-skipping (no MTP weights needed). Skips SSM layers during drafting for cheap predictions, then verifies with full model
--ngram-speculative
Enable N-gram speculative decoding: CPU-side pattern matching proposer with CUDA-graphed K=2 verification. No extra weights needed
--dflash
Enable DFlash block-diffusion speculative decoding (Z Lab, arXiv 2602.06036). Pairs the target with a small Qwen3-architecture drafter (e.g. `z-lab/Qwen3.6-35B-A3B-DFlash`) that emits γ tokens per step via bidirectional in-block attention conditioned on captured target hidden states. Mutually exclusive with `--speculative`
--draft-model <DRAFT_MODEL>
HuggingFace id (or local path) of the DFlash drafter checkpoint. When `--dflash` is set without `--draft-model`, the value falls through from the target's MODEL.toml `[dflash].draft_model` field
--dflash-gamma <DFLASH_GAMMA>
DFlash block size γ (parallel draft tokens per step). Defaults to the drafter's `block_size` from `config.json` (16 for the published Qwen3.6-DFlash drafters); override only for ablation. Higher γ increases per-step verify cost but raises peak speedup
[default: 16]
--dflash-window-size <DFLASH_WINDOW_SIZE>
DFlash drafter sliding-window size for long context. The drafter runs full-prefix attention by default; at Atlas's typical 16K `--max-seq-len`, drafter attention dominates per-step cost. The upstream sglang / vLLM default is 4096. Set to 0 to disable (full attention)
[default: 4096]
--num-drafts <NUM_DRAFTS>
Number of draft tokens per speculative step (1=K=2, 2=K=3, 3=K=4 verify). Higher K verifies more drafts per step. Uses WY-chunkwise GDN kernels
[default: 1]
--max-batch-size <MAX_BATCH_SIZE>
Maximum concurrent sequences batched into one GPU decode step
[default: 8]
--mtp-quantization <MTP_QUANTIZATION>
MTP head weight precision: nvfp4 (fastest, recommended — uses fused device-side expert dispatch), fp8 (balanced but slower due to D2H sync in MoE), bf16 (highest accuracy, most memory)
[default: nvfp4]
--mtp-vocab <MTP_VOCAB>
MTP draft vocabulary size. Limits the LM head GEMV to the first N token IDs, reducing propose latency. BPE tokenizers place frequent tokens at low IDs — 100K covers >99% of English outputs while cutting propose time by 37% (2.15ms → 1.35ms) with zero acceptance loss. Set to 0 to use full vocabulary
[default: 100000]
--enable-prefix-caching [<ENABLE_PREFIX_CACHING>]
Enable prefix caching via radix tree (RadixAttention). Caches KV blocks for recurring prompt prefixes. For SSM models, KV is recomputed when no SSM snapshot exists (safe but no TTFT speedup without Marconi snapshots). Block table reuse still avoids allocation
[default: false]
[possible values: true, false]
--dump [<PATH>]
Dump every /v1/chat/completions, /v1/responses, and /v1/messages (Anthropic) request — plus the corresponding response (non-streaming) or aggregated stream — as JSONL to a file. Intended for extracting the exact system prompt and tool schema a client (opencode, Claude Code, etc.) is sending, and for replaying failure cases in fixtures.
With no value: a temp file is created under $TMPDIR and its path is logged at INFO on startup. With a PATH: appends (never truncates) to that file. Each line is one JSON object: `{ "ts": "<iso8601>", "endpoint": "...", "kind": "request"|"response",` "seq": N, "body": { ... } } so entries can be grouped by `seq` to reconstruct pairs.
--scheduling-policy <SCHEDULING_POLICY>
Scheduling policy: fifo (default) or slai (SLO-aware). SLAI prioritizes decode for sequences nearing TBT deadline and orders prefills shortest-prompt-first
[default: fifo]
--tbt-deadline-ms <TBT_DEADLINE_MS>
TBT deadline in milliseconds for SLAI scheduling policy. Sequences approaching this deadline trigger decode-first priority
[default: 100]
--max-prefill-tokens <MAX_PREFILL_TOKENS>
Maximum tokens to prefill per scheduler iteration (chunked prefill). Long prompts are split into chunks of this size, interleaved with decode steps for active sequences. Set to 0 to disable chunking (process entire prompt in one shot, legacy behavior). Chunked prefill: split long prompts into chunks, interleaved with decode steps. 8192 default halves chunk count vs 4096, giving ~11% TTFT improvement at 32K with no decode regression on DGX Spark. Set to 0 to disable (process entire prompt at once)
[default: 8192]
--oom-guard-mb <OOM_GUARD_MB>
Minimum free GPU memory (in MB) to keep as a safety margin during model loading. If free memory drops below this threshold after any shard, loading is aborted to prevent system OOM. Default 4096 MB accounts for CUDA context, NCCL buffers, and allocator overhead
[default: 4096]
--rank <RANK>
Global rank (0=head, 1=worker, …). Only used when --world-size > 1
[default: 0]
--world-size <WORLD_SIZE>
Total physical ranks across all parallelism dims. Set to 2 for two-node deployment. Must satisfy `world_size == tp_size × ep_size` (orthogonal mesh) or `world_size == tp_size == ep_size` (overlapping groups on the same physical ranks — used for 2-GPU TP+EP composition)
[default: 1]
--tp-size <TP_SIZE>
Tensor-parallel dimension. Splits attention/MLP weights column- and row-parallel across `tp_size` ranks. 1 = no TP. Composes with EP: MoE expert weights stay EP-sharded; attention/MLP get TP-sharded
[default: 1]
--ep-size <EP_SIZE>
Expert-parallel dimension. Splits MoE expert weights across `ep_size` ranks. 1 = no EP. Default of 1 keeps single-rank semantics
[default: 1]
--master-addr <MASTER_ADDR>
NCCL bootstrap address (IP of rank 0 node)
[default: 127.0.0.1]
--master-port <MASTER_PORT>
NCCL bootstrap port
[default: 29500]
--tool-call-parser <FORMAT>
Tool call parser format. Enables OpenAI-compatible tool calling. Supported: "hermes" (Qwen3/3.5 JSON format), "qwen3_coder" (Nemotron-H XML format). When set, tool definitions in requests are injected into the system prompt and model output is parsed for tool_call tags
--tool-max-tokens <TOOL_MAX_TOKENS>
Maximum output tokens per tool-calling request. Caps max_tokens from the client when tools are active to prevent unbounded generation if the model doesn't emit a </tool_call> stop token. Must be high enough for Write tool calls with large file content. Default 8192
[default: 8192]
--ssm-cache-slots <SSM_CACHE_SLOTS>
Number of SSM state snapshot slots for Marconi prefix caching. Each slot stores SSM h_state + conv_state for all SSM layers, enabling full prefix skip (KV + SSM) on cache hits. 0 = disabled. 16 = recommended for repeated-prefix and multi-turn workloads. Intermediate checkpoints (--ssm-checkpoint-interval) require extra slots: ~(max_context / checkpoint_interval_tokens) per cached sequence
[default: 16]
--ssm-checkpoint-interval <SSM_CHECKPOINT_INTERVAL>
Save SSM state snapshots at regular block boundaries during prefill. When set to N > 0, a snapshot is saved every N blocks during chunked prefill. On future prefix cache hits, the deepest intermediate snapshot is restored, reducing SSM recomputation from the full prefix to just the tokens between the checkpoint and the match point. 0 = disabled (leaf-only snapshots). 256 = every 4096 tokens (block_size=16)
[default: 256]
--auto-compact [<THRESHOLD>]
Enable automatic context compaction for long conversations. **DISABLED BY DEFAULT** (2026-04-25): the auto-compactor has historically been a source of agent loops — synthesised continuation messages and middle-of-history truncation themselves trigger drift (cf. opencode issues #15533, #17169, #19339). Oversize requests get a clean 400 error (`Prompt too long`) rather than a silently-rewritten context.
Only pass `--auto-compact[=THRESHOLD]` if you have explicitly validated that compaction is safe for your model + workload. Without a value: threshold=0.75 (compact at 75% of max_seq_len). With a value: compact at that fraction (e.g., 0.80 = 80%).
Method: Active Context Compression (arXiv:2601.07190) — the server uses the model itself to summarize older conversation turns into a condensed knowledge block.
--default-top-n-sigma <DEFAULT_TOP_N_SIGMA>
Default top-n-sigma for sampling (filter tokens by logit z-score). 0.0 = disabled. Recommended: 1.0 for NVFP4 models AND for agent workloads — top-n-σ is temperature-invariant (Tang et al., arXiv:2411.07641) so it is more robust than top-p across the per-phase temperature drift agentic loops induce
[default: 1]
--default-min-p <DEFAULT_MIN_P>
Default min-p for sampling (keep tokens with prob >= min_p * max_prob). 0.0 = disabled. Recommended: 0.05-0.1
[default: 0]
--swap-space-gb <SWAP_SPACE_GB>
Swap space in GB for KV cache overflow to disk. When GPU blocks are exhausted, sequences are swapped to disk and resumed later. 0 = disabled. Swap files stored in /tmp/atlas-swap/
[default: 3]
--high-speed-swap
--high-speed-swap-dir <HIGH_SPEED_SWAP_DIR>
Directory for the per-layer NVMe-backed KV files. Required when --high-speed-swap is set; must be on a different mount than --swap-space-gb's /tmp/atlas-swap to avoid file collisions
--high-speed-swap-gb <HIGH_SPEED_SWAP_GB>
Total disk budget for --high-speed-swap, in GiB
--high-speed-swap-resident-blocks <HIGH_SPEED_SWAP_RESIDENT_BLOCKS>
HBM scratch slot count (number of resident blocks)
--high-speed-swap-rank <HIGH_SPEED_SWAP_RANK>
Predictor low-rank dimension (Phase 1 ships at r=32)
[default: 32]
--high-speed-swap-qd <HIGH_SPEED_SWAP_QD>
io_uring submission queue depth (Phase 3 shows QD=8 reaches 3.4 GB/s on this DGX Spark image)
[default: 8]
--high-speed-swap-graph <HIGH_SPEED_SWAP_GRAPH>
Capture the per-layer body in a CUDA graph and replay (Phase 4). Defaults to mirror --high-speed-swap
[possible values: true, false]
--high-speed-swap-cache-blocks-per-seq <HIGH_SPEED_SWAP_CACHE_BLOCKS_PER_SEQ>
Per-sequence HBM cache cap for `--high-speed-swap` (Phase 6.1). When set together with --high-speed-swap, each sequence is limited to N HBM-resident KV blocks; older blocks are evicted to disk and streamed back via the orchestrator on demand. The KV cache total allocation shrinks to roughly `max_batch_size × N` blocks. Default 64 (= 1024 tokens HBM-resident at block_size=16). Set to max_seq_len/block_size to disable HBM-shrink (no eviction; useful for diff-against-no-swap correctness checks)
[default: 64]
--request-timeout <REQUEST_TIMEOUT>
Default request timeout in seconds. 0 = no timeout
[default: 300]
--profile
Enable per-kernel profiling: sync + time each operation within layers. Disables CUDA graphs for accurate per-op timing. Adds ~10% overhead
--fp8-kv-calibration-tokens <FP8_KV_CALIBRATION_TOKENS>
Number of warmup tokens for online FP8 KV cache scale calibration. During the first N tokens, tracks max |K| and max |V| values across all attention layers. After N tokens, computes per-tensor scales as max/448 (mapping the observed range to FP8 E4M3 [-448, 448]). 0 = disabled (use static scales from checkpoint, or uncalibrated 1.0). Only applies when --kv-cache-dtype is fp8
[default: 0]
--warmup-prompt <WARMUP_PROMPT>
Path to a warmup prompt file (JSON messages or plain text). At startup, the server tokenizes and prefills this prompt, inserting the resulting KV cache + SSM snapshot into the prefix cache. This eliminates the cold-start TTFT penalty (~196ms) on the first real request
--adaptive-sampling
Enable adaptive sampling (entropy-based greedy gating, zone detection). Computes Shannon entropy over the full vocabulary per token to dynamically switch between greedy and sampled decoding. Improves quality for mixed content (code + prose) at the cost of ~2-3x decode throughput reduction. Off by default for maximum throughput
--no-fast-load
Disable the InstantTensor-style fast weight loader and use the mmap loader instead. The fast loader (O_DIRECT + pipelined reader/copier, with a per-shard heuristic that picks between O_DIRECT and buffered reads) is on by default — this flag is an escape hatch for rare filesystems that misbehave with O_DIRECT or for A/B debugging. Setting `ATLAS_FAST_LOAD=0` has the same effect
--bind <ADDR>
Address to bind the HTTP listener to. Defaults to `127.0.0.1` so a fresh install is reachable only from the local machine; pass `0.0.0.0` to expose on all interfaces (the server logs a warning when it does, since combined with the permissive default CORS this makes the API reachable to anything on the LAN)
[default: 127.0.0.1]
--require-auth
Require an `Authorization: Bearer <token>` header on `/v1/*`, `/tokenize`, and `/detokenize`. The token must match one loaded via `--auth-tokens-file` or `--auth-token`. `/health`, `/health/live`, and `/metrics` stay open as scrape targets.
Defaults to off — Atlas is local-by-default, so most users can skip this. Turn on whenever the server is reachable from anywhere other than `localhost` (i.e. whenever you've passed `--bind 0.0.0.0` or are running behind an exposed port-forward).
--auth-tokens-file <PATH>
Path to a file containing valid bearer tokens, one per line. Blank lines and lines starting with `#` are ignored. Permissions should be `0600`. The file is read once at startup; SIGHUP reloading is not supported (restart the server to rotate keys)
--auth-token <TOKEN>
A single inline bearer token. Convenient for quick starts; not recommended for production because the token is visible in `ps`/`/proc/<pid>/cmdline`. Use `--auth-tokens-file` instead
-h, --help
Print help (see a summary with '-h')