Atlas: Open-source inference engine for DGX Spark <2minute cold start, 100+ tok/s on Qwen3.6-35B-FP8, 13+ supported models

We’ve just open-sourced Atlas, an LLM inference engine purpose-built (but not limited to) GB10-class hardware, Repository is here: https://github.com/Avarok-Cybersecurity/atlas and we need the communities help to keep elevating it for developers.

Many DGX Spark owners and Blackwell developers read this forum, we wanted to make sure this landed in front of you directly rather than relying on it filtering over from elsewhere.

Earlier we shared some informal benchmarks showing Qwen3.5-35B running at ~130 tok/s sustained on a single Spark. The most common follow-up question was whether the code would be released, viola!

Atlas was written from scratch in Rust and CUDA. There is no PyTorch dependency, no Python runtime in the serving path, and no JIT compilation step at startup. The container image is approximately 2.5 GB and the cold start is under two minutes, compared to roughly ten minutes for a typical PyTorch-based stack going through torch.compile.

The motivation was straightforward: on Spark, we observed Blackwell tensor cores sitting underutilized while the Python serving stack interpreted itself. The performance ceiling was being set by software overhead rather than silicon, and we judged that a clean rewrite was a more honest path than additional layers of optimization on top of the existing stack.


Performance on a single DGX Spark

Model Quant Throughput
Qwen3.5-35B (MTP K=2) NVFP4 130 tok/s peak
Qwen3-Next-80B-A3B (MTP) NVFP4 ~87 tok/s
Nemotron-3 Nano 30B FP8 ~88 tok/s
MiniMax M2.7 (EP=2) NVFP4 ~15 tok/s

The full matrix (13+ models) is on the project site - atlasinference.io


Architecture notes relevant to NVIDIA hardware

  • Hand-tuned CUDA kernels for Blackwell SM120/121, covering attention, MoE routing, GDN, and Mamba-2 paths. We do not use generic fallbacks and each supported architecture gets purpose-built kernels.

  • Native NVFP4 and FP8 execution on tensor cores, with --kv-high-precision-layers auto available for sensitive layers.

  • Multi-Token Prediction (MTP) speculative decoding, configurable via --speculative with K=2 as a sensible default for the supported Qwen models.

  • Prefix caching, configurable scheduling policy (slai is the default we’ve tuned for Spark workloads), and --gpu-memory-utilization control for KV cache headroom.

  • HTTP API exposes both OpenAI and Anthropic schemas on the same port, so existing client code (Claude Code, Cline, OpenCode, Open WebUI, custom OpenAI clients) works without modification.


Quick start

docker pull avarok/atlas-gb10:latest

sudo docker run -d --name atlas \
  --network host --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/atlas-gb10:latest \
  serve Qwen/Qwen3.6-35B-A3B-FP8 \
    --port 8888 \
    --max-seq-len 65536 \
    --kv-cache-dtype fp8 \
    --kv-high-precision-layers auto \
    --gpu-memory-utilization 0.90 \
    --scheduling-policy slai \
    --quantization fp8 \
    --tool-call-parser qwen3_coder \
    --enable-prefix-caching \
    --speculative

Endpoint comes up on http://localhost:8888/v1.


Roadmap on NVIDIA hardware

RTX 6000 Pro Blackwell is the next NVIDIA platform we are targeting. We have ASUS Ascent GX10 confirmed working today through community testing. The kernel approach is the same across chip, we adapt rather than abstract.


What we are looking for from the community

  1. Spark and Blackwell developers willing to run and expand Atlas features for various workloads like long-context, mixed batch sizes, agentic loops with high tool-call density.

  2. Bug reports with nvidia-smi output, driver version, and a minimal reproduction. We respond fastest in Discord but GitHub issues are read daily.

  3. Model requests for architectures you are running in production. We prioritize based on signal from users, and forum-channel signal counts.

Resources:

Happy to answer technical questions in this thread, but it’s also completely available for you to scour these details in our codebase. We’d love to build around any specific use case you have, don’t hesitate to reach out. Feel free to take it to DM on any of the social sites!

error: unexpected argument ‘–quantization’ found

tip: a similar argument exists: ‘–mtp-quantization’

Usage: spark serve --port --max-seq-len <MAX_SEQ_LEN> --kv-cache-dtype <KV_CACHE_DTYPE> --kv-high-precision-layers <KV_HIGH_PRECISION_LAYERS> --gpu-memory-utilization <GPU_MEMORY_UTILIZATION> --scheduling-policy <SCHEDULING_POLICY> --mtp-quantization <MTP_QUANTIZATION>

For more information, try ‘–help’.

═══ Benchmark ═══
[✓] Model: RedHatAI/Qwen3.6-35B-A3B-NVFP4

╔══════════════════════════════════════════════════════╗
║  Benchmark: Qwen3.6-35B-A3B-NVFP4  —  2026-05-07 14:18
╚══════════════════════════════════════════════════════╝

  Warm-up... done

── Sequential (1 request) ──────────────────────────────
  Run 1/2:
  [Q&A       ]   106 tokens in    .87s = 121.6 tok/s
  [Code      ]   183 tokens in   1.30s = 140.1 tok/s
  [JSON      ]   615 tokens in   4.42s = 138.9 tok/s
  [Math      ]     9 tokens in    .09s = 91.8 tok/s
  [LongCode  ]  2048 tokens in  15.24s = 134.3 tok/s

  Run 2/2:
  [Q&A       ]   169 tokens in   1.30s = 129.5 tok/s
  [Code      ]   181 tokens in   1.34s = 134.8 tok/s
  [JSON      ]   718 tokens in   5.14s = 139.4 tok/s
  [Math      ]     9 tokens in    .09s = 90.9 tok/s
  [LongCode  ]  2048 tokens in  15.35s = 133.3 tok/s

── Concurrent (4 parallel requests) ───────────────────────────
  Sending 4 requests simultaneously, measuring total throughput...

  [req1 ]  1024 tokens = 24.7 tok/s (end-to-end)
  [req2 ]  1024 tokens = 24.7 tok/s (end-to-end)
  [req3 ]  1024 tokens = 24.7 tok/s (end-to-end)
  [req4 ]  1024 tokens = 24.7 tok/s (end-to-end)

  Total: 4096 tokens in 41.38s
  Total throughput: 98.9 tok/s (4 requests completed)

4 parallel requests is slower than single

Congratulations with the long awaited release!

Everyone could complain about this thing now (in positive sense and constructive way), but it is much simpler than vllm due to smaller scope, so I guess, we can expect valuable contributions and improvements over time. Great job!

cli documentation is still missing


Ok found it here atlas/book/src/operations/server.md at main · Avarok-Cybersecurity/atlas · GitHub

Nice work :) However so far I could not run any of my agents, which relies heavily on tool calling. these agents works fine with vllm fp8 of Qwen 3.6 35B, but here I got many errors. Mainly I got many errors like this one: spark::api::chat_stream::tool_handlers: tool call validation error: Error: Unknown tool
and I can tell you the tool is very well defined in the agent. when the tool is finally found, it throws errors, like it is not able to decode the json coming back from the MCP. Dunno man, maybe I am doing something wrong :)

Please reach out via dm on here or discord and we can debug it together @trithemius :)

I gave this a solid go. Here is my initial prognosis: useless waste of time :(

The quick start recipe provided for Qwen 3.5 122b as published doesn’t run. I had to find the documentation, read the documentation, and spend a morning trying different setting. Apparently KV cache has to to be uselessly small 0 concurrent sequence(s)? Context window size is 4k?

Error: Failed to build model

Caused by:
    KV cache can hold at most 0 concurrent sequence(s) at --max-seq-len=2096, but --max-batch-size=1 was requested. KV pool has 0 block(s) of 16 tokens each; each sequence needs 131 block(s). Try --max-seq-len 16 (keeps max_batch_size=1) or reduce --max-batch-size.

This is what I was trying to run when I decided you had wasted enough of my time. If it can’t run this what is the point?

#!/bin/bash

docker container remove atlas
docker pull avarok/atlas-gb10:latest
docker run -it \
  --name atlas \
  --network host --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/atlas-gb10:latest \
  serve Sehyo/Qwen3.5-122B-A10B-NVFP4 \
    --model-name qwen/qwen3.5-122b \
    --port 8000 \
    --gpu-memory-utilization 0.85 \
    --max-seq-len	2096 \
    --max-batch-size 1 \
    --max-prefill-tokens	2096 \
    --max-num-seqs	32 \
    --oom-guard-mb	4096 \
    --kv-cache-dtype nvfp4 \
    --kv-high-precision-layers auto \
    --enable-prefix-caching \
    --scheduling-policy slai \
    --speculative \
    --num-drafts 2 \
    --mtp-quantization nvfp4

Here is a piece of gratuitous advice: This is my minimum effort expectation – have a human test each recipe works before you publish them. Otherwise the whole project smacks of psychosis. You have a huge trust deficit with me now.

We’ve worked way too hard on Atlas for it to be woefully dismissed in this fashion @whpthomas . I’ve hand tested Qwen3.5-122B-A10-NVFP4 (if I look hard enough in my saved videos I can find the clip.) Myself and many others have had success with this exact Sehyo variant and we released support for it on 3/30/26 documented in our Discord under #releases

From your command alone I see a plethora of mistakes. How do you expect to have a meaningful interaction with 2096 max sequence length? The issue also seems to be rooted in the max-num-seqs being set so high, looks like it’s having issues trying to pre allocate.

Heres the command we released that I verified myself, should work out of the box. We migrated early on because of the toxicity we received on the forums and do not appreciate this hostility :/

docker pull avarok/atlas-alpha-2.7
sudo docker run -d --name atlas --gpus all --ipc=host --network host  -v ~/.cache/huggingface:/root/.cache/huggingface avarok/atlas-alpha-2.7 serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 --port 8888 --kv-cache-dtype fp8 --kv-high-precision-layers auto --gpu-memory-utilization 0.92 --scheduling-policy slai --max-seq-len 65536 --tool-call-parser qwen3_coder --ssm-cache-slots 0 --speculative

Have you read your own documentation? I have – it literally recommends --max-seq-len 2048

Your github README should be a single source of truth – otherwise add a warning – don’t trust any information published here its likely out of date and unreliable – can’t be both.

You posted here, asking for help testing, I tested in good faith, but you expect me to scour your forums to figure out why the recipe you published on the quick start guide doesn’t work and even that is too much to ask. Like I said this is my minimum effort expectation: update your README before you ask for help.

Its not like you don’t have from We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! I wasted a good week on the previous incarnation for the same reason.

And let me be clear, a rust vllm is a great idea. The cold start is really nice. Why focus on 100t/s on 35b I get that on vllm now. What I want is models that do real work not toys – 4k context feels like a toy. If that is your intent make it very clear.

What would really impress me is if you supported Robs work on @tenari PrismaQuant v2 NVFP4. That would close the circle.

From your current quick start guide

Your documentation

Keep up the good work. I was using your avarok vllm, when I first started to test my dgx spark setup a few months ago. Until you moved over to this Rust implementation. I will be willing to test it out, and comment. There are 2 DGX spark in the office. One is always spare to do testing.

Same error – this doesn’t run either

#!/bin/bash

docker container remove atlas
docker pull avarok/atlas-alpha-2.7
docker run -it \
  --name atlas \
  --gpus all \
  --ipc=host \
  --network host  \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/atlas-alpha-2.7 \
  serve Sehyo/Qwen3.5-122B-A10B-NVFP4 \
    --port 8000 \
    --kv-cache-dtype fp8 \
    --kv-high-precision-layers auto \
    --gpu-memory-utilization 0.92 \
    --scheduling-policy slai \
    --max-seq-len 65536 \
    --tool-call-parser qwen3_coder \
    --ssm-cache-slots 0 \
    --speculative

That’s the thing. We all are working hard. We are all putting in time. But do you value my time as much as you value yours?

thanks. I contacted you in Discord

Qwen3.5-122B-A10B-NVFP4 on a single DGX Spark just verified its working.

Run command:

sudo docker run -d --name atlas-122b-test \                                                 
    --gpus all --ipc=host --network host \                                                                                                                                                                                                      
    -v ~/.cache/huggingface:/root/.cache/huggingface \                     
    avarok/atlas-gb10:latest \                                                                                                                                                                                                                  
    serve Sehyo/Qwen3.5-122B-A10B-NVFP4 \                                                                                                                                                                                                       
      --port 8888 \                              
      --kv-cache-dtype fp8 \                                                                                                                                                                                                                    
      --kv-high-precision-layers auto \                                                                                                                                                                                                         
      --gpu-memory-utilization 0.92 \                                                                                                                                                                                                           
      --scheduling-policy slai \                                                                                                                                                                                                                
      --max-seq-len 16384 \                                                                                                                                                                                                                     
      --max-batch-size 1 \                                                                                                                                                                                                                      
      --max-num-seqs 4 \                                                                      
      --oom-guard-mb 1024 \                                                                                                                                                                                                                     
      --tool-call-parser qwen3_coder \                                                        
      --ssm-cache-slots 0 

You’re right that the quickstart’s flags weren’t 100% right for this model: the 122B NVFP4 weights eat ~108 GB of the 110 GB budget. With 32 concurrent slots × 2048 tokens, the KV pool needs ~4 GB and gets 1.5 GB →
“0 blocks”. Drop --max-num-seqs to 4, up the gpu mem utilization slightly and the same 16K context (and more) fits.

Few quick tests, batch=1 decode/coherence, batch=4 concurrency testing:

curl -sS http://localhost:8888/v1/chat/completions \                                                                                                                                                                                          
    -H "Content-Type: application/json" \                         
    -d '{"model":"Sehyo/Qwen3.5-122B-A10B-NVFP4","messages":[{"role":"user","content":"Explain how a transformer attention layer works. Be concise."}],"max_tokens":200,"temperature":0}'

Raw response (content + usage):
content: "A transformer attention layer allows a model to dynamically weigh the importance of different parts of an input sequence when processing each token. It works in three main steps:

  1. Linear Projections: The input embeddings are projected into three vectors for each token: Query ($Q$), Key ($K$), and Value ($V$).
  2. Attention Scores: The model computes similarity scores between every Query and all Keys (usually via dot product, scaled by $\sqrt{d_k}$) to determine how much focus one token should place on another. A softmax function normalizes
    these scores into probabilities.
  3. Weighted Sum: The normalized scores are used as weights to compute a weighted sum of the Value vectors. This produces the output representation, which captures context from relevant tokens regardless of their distance in the
    sequence.

Finally, this output is typically passed through a feed-forward network and residual connections before moving to the next layer."

usage: prompt_tokens=24, completion_tokens=200, time_to_first_token_ms=515.8, response_token/s=36.39
wall_ms=5992 → measured decode = 33.4 tok/s

for j in 1 2 3 4; do                                                                                                                                                                                                                          
    curl -sS http://localhost:8888/v1/chat/completions \          
      -H "Content-Type: application/json" \             
      -d "{\"model\":\"Sehyo/Qwen3.5-122B-A10B-NVFP4\",\"messages\":[{\"role\":\"user\",\"content\":\"Request $j: Describe photosynthesis briefly.\"}],\"max_tokens\":100,\"temperature\":0.5}" \                                               
      > "r_$j.json" &                                                                                                                                                                            
  done                                                                                                                                                                                                                                          
  wait

Result:
request 1: completion_tokens=100
request 2: completion_tokens=100
request 3: completion_tokens=100
request 4: completion_tokens=100
aggregate: 400 tokens in 13580 ms = 29.5 tok/s aggregate
Sample content from r_1.json:
"Photosynthesis is the biological process by which green plants, algae, and certain bacteria convert light energy into chemical energy.

Using chlorophyll (the pigment that gives plants their green color), these organisms capture energy from sunlight to transform carbon dioxide ($CO_2$) from the air and water ($H_2O$) from the soil into glucose (a type of
sugar) and oxygen ($O_2$). The"

Yes, the published quickstart numbers don’t fit. The README is wrong. Above is a tested working set.

We value your time @whpthomas as well as the communities efforts and we hope that we have merited the same.

So I saw this and it had me confused a bit. I thought the whole point of NVFP4 was to reduce the memory footprint. 122B Int4 AutoRound takes 67.36 GiB in vllm. Atlas consuming 108Gib for model weights seems very costly to me – leaving little headroom for KV etc.

Crank this up! Try something like 0.96 and you should be able to run 128k length sequences.

Don’t know why you keep stating this as if its broken, it works and have multiple users reporting these speeds (even in this thread!) Cheer up, it’s a Friday. Approach this with a new mindset and you just might find success :)

It would be nice if the get up and running command was working by default:

Apparently the --quantization fp8 flag doesn’t exist, and it’s by default only accessible on the localhost (both quite easy fixes)

The numbers from llama-benchy looks nice in comparison to VLLM for single use at least.
But where it kinda falls for me right now, thinking is disabled by default, and i can’t find a way to pass some default chat template kwargs to be able to run tool-eval-bench and do a 1:1 comparison, and i would need to re-confgure all my tools to work with this.

Is this something I’m missing, planned or intended to work this way?

I do not really understand the complaints, this is a brand new tool and it is open source. Means it is not owning anything to you, except its pure existence. You have issue - open issue on GitHub, create PR - this is constructive way of community work.

Feedback is also important, but not in the form of the “wasted my precious time”. It was your decision to invest time. I just do not understand people, if they cannot add --help to see actual parameters.

I have tried running this many times, unfortunately --max-seq-len 4096 is the largest context window size that will load. I can’t do anything meaningful with this. I can’t even run a benchmark against it; its just too small. So nothing else useful to report.

#!/bin/bash

docker container remove atlas
docker pull avarok/atlas-gb10:latest
docker run -it \
  --name atlas \
  --gpus all \
  --ipc=host \
  --network host  \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/atlas-gb10:latest \
  serve Sehyo/Qwen3.5-122B-A10B-NVFP4 \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name qwen/qwen3.5-122b \
    --kv-cache-dtype fp8 \
    --kv-high-precision-layers auto \
    --gpu-memory-utilization 0.90 \
    --scheduling-policy slai \
    --max-seq-len 4096 \
    --max-batch-size 1 \
    --max-num-seqs 4 \
    --oom-guard-mb 1024 \
    --tool-call-parser qwen3_coder \
    --ssm-cache-slots 0 \
    --speculative
docker run -it \
  --name atlas \
  --gpus all \
  --ipc=host \
  --network host  \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/atlas-gb10:latest --help
Atlas Spark — pure Rust LLM inference server

Usage: spark <COMMAND>

Commands:
  serve  Start the inference server
  help   Print this message or the help of the given subcommand(s)

Options:
  -h, --help  Print help
docker container remove atlas
docker pull avarok/atlas-gb10:latest
docker run -it \
  --name atlas \
  --gpus all \
  --ipc=host \
  --network host  \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/atlas-gb10:latest \
  serve --help

Usage: spark serve [OPTIONS] [MODEL]

Arguments:
  [MODEL]
          HuggingFace model ID (e.g. "nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4") or a local directory path containing config.json

Options:
      --model-from-path <PATH>
          Load model directly from this filesystem path (skips HF cache resolution)

      --model-name <NAME>
          Override model name shown in /v1/models and API responses. Defaults to the positional MODEL argument, then config.json _name_or_path

      --cache-dir <DIR>
          Override HuggingFace cache directory (default: $HF_HUB_CACHE, $HF_HOME/hub, or ~/.cache/huggingface/hub)

      --port <PORT>
          HTTP port
          
          [default: 8888]

      --gpu-ordinal <GPU_ORDINAL>
          GPU ordinal
          
          [default: 0]

      --max-seq-len <MAX_SEQ_LEN>
          Maximum sequence length
          
          [default: 32768]

      --block-size <BLOCK_SIZE>
          KV cache block size (tokens per block)
          
          [default: 16]

      --kv-cache-dtype <KV_CACHE_DTYPE>
          KV cache dtype (fp8, bf16, or nvfp4). Default: fp8. NVFP4 uses less memory but may lose coherence at long context without --kv-high-precision-layers. FP8 is the safe default
          
          [default: fp8]

      --kv-high-precision-layers <KV_HIGH_PRECISION_LAYERS>
          Boundary attention layers to keep at BF16 KV cache precision (first N + last N). Protects attention sink tokens (early layers) and output quality (final layers) from quantization error while saving memory on middle layers. Accepts: number, "auto" (=2, recommended), "max"/"all" (all BF16). Default: 0 (all layers use --kv-cache-dtype)
          
          [default: 0]

      --gpu-memory-utilization <GPU_MEMORY_UTILIZATION>
          GPU memory utilization (0.0-1.0)
          
          [default: 0.9]

      --max-num-seqs <MAX_NUM_SEQS>
          Maximum concurrent sequences
          
          [default: 128]

      --disable-thinking
          Global kill-switch for chain-of-thought / reasoning output. When set, the server forces thinking OFF regardless of what the client requests (reasoning_effort, thinking.budget_tokens, etc.) or what MODEL.toml declares as the default. Precedence (highest wins): this flag → request body → MODEL.toml `[behavior]`.thinking_default.
          
          Harry Potter alias: `--stupify` (stuns the model's inner monologue).
          
          [aliases: --stupify]

      --max-thinking-budget <MAX_THINKING_BUDGET>
          Override MODEL.toml's `[behavior].max_thinking_budget` (tokens). Sets the per-request ceiling for thinking-block length. Per-request `thinking.budget_tokens` (or `reasoning_effort`) still wins below this ceiling; the (max_tokens * 9 / 10) safety cap is always enforced

      --speculative
          Currently slower than regular decode for hybrid SSM models

      --self-speculative
          Enable self-speculative decoding: draft via layer-skipping (no MTP weights needed). Skips SSM layers during drafting for cheap predictions, then verifies with full model

      --ngram-speculative
          Enable N-gram speculative decoding: CPU-side pattern matching proposer with CUDA-graphed K=2 verification. No extra weights needed

      --dflash
          Enable DFlash block-diffusion speculative decoding (Z Lab, arXiv 2602.06036). Pairs the target with a small Qwen3-architecture drafter (e.g. `z-lab/Qwen3.6-35B-A3B-DFlash`) that emits γ tokens per step via bidirectional in-block attention conditioned on captured target hidden states. Mutually exclusive with `--speculative`

      --draft-model <DRAFT_MODEL>
          HuggingFace id (or local path) of the DFlash drafter checkpoint. When `--dflash` is set without `--draft-model`, the value falls through from the target's MODEL.toml `[dflash].draft_model` field

      --dflash-gamma <DFLASH_GAMMA>
          DFlash block size γ (parallel draft tokens per step). Defaults to the drafter's `block_size` from `config.json` (16 for the published Qwen3.6-DFlash drafters); override only for ablation. Higher γ increases per-step verify cost but raises peak speedup
          
          [default: 16]

      --dflash-window-size <DFLASH_WINDOW_SIZE>
          DFlash drafter sliding-window size for long context. The drafter runs full-prefix attention by default; at Atlas's typical 16K `--max-seq-len`, drafter attention dominates per-step cost. The upstream sglang / vLLM default is 4096. Set to 0 to disable (full attention)
          
          [default: 4096]

      --num-drafts <NUM_DRAFTS>
          Number of draft tokens per speculative step (1=K=2, 2=K=3, 3=K=4 verify). Higher K verifies more drafts per step. Uses WY-chunkwise GDN kernels
          
          [default: 1]

      --max-batch-size <MAX_BATCH_SIZE>
          Maximum concurrent sequences batched into one GPU decode step
          
          [default: 8]

      --mtp-quantization <MTP_QUANTIZATION>
          MTP head weight precision: nvfp4 (fastest, recommended — uses fused device-side expert dispatch), fp8 (balanced but slower due to D2H sync in MoE), bf16 (highest accuracy, most memory)
          
          [default: nvfp4]

      --mtp-vocab <MTP_VOCAB>
          MTP draft vocabulary size. Limits the LM head GEMV to the first N token IDs, reducing propose latency. BPE tokenizers place frequent tokens at low IDs — 100K covers >99% of English outputs while cutting propose time by 37% (2.15ms → 1.35ms) with zero acceptance loss. Set to 0 to use full vocabulary
          
          [default: 100000]

      --enable-prefix-caching [<ENABLE_PREFIX_CACHING>]
          Enable prefix caching via radix tree (RadixAttention). Caches KV blocks for recurring prompt prefixes. For SSM models, KV is recomputed when no SSM snapshot exists (safe but no TTFT speedup without Marconi snapshots). Block table reuse still avoids allocation
          
          [default: false]
          [possible values: true, false]

      --dump [<PATH>]
          Dump every /v1/chat/completions, /v1/responses, and /v1/messages (Anthropic) request — plus the corresponding response (non-streaming) or aggregated stream — as JSONL to a file. Intended for extracting the exact system prompt and tool schema a client (opencode, Claude Code, etc.) is sending, and for replaying failure cases in fixtures.
          
          With no value: a temp file is created under $TMPDIR and its path is logged at INFO on startup. With a PATH: appends (never truncates) to that file. Each line is one JSON object: `{ "ts": "<iso8601>", "endpoint": "...", "kind": "request"|"response",` "seq": N, "body": { ... } } so entries can be grouped by `seq` to reconstruct pairs.

      --scheduling-policy <SCHEDULING_POLICY>
          Scheduling policy: fifo (default) or slai (SLO-aware). SLAI prioritizes decode for sequences nearing TBT deadline and orders prefills shortest-prompt-first
          
          [default: fifo]

      --tbt-deadline-ms <TBT_DEADLINE_MS>
          TBT deadline in milliseconds for SLAI scheduling policy. Sequences approaching this deadline trigger decode-first priority
          
          [default: 100]

      --max-prefill-tokens <MAX_PREFILL_TOKENS>
          Maximum tokens to prefill per scheduler iteration (chunked prefill). Long prompts are split into chunks of this size, interleaved with decode steps for active sequences. Set to 0 to disable chunking (process entire prompt in one shot, legacy behavior). Chunked prefill: split long prompts into chunks, interleaved with decode steps. 8192 default halves chunk count vs 4096, giving ~11% TTFT improvement at 32K with no decode regression on DGX Spark. Set to 0 to disable (process entire prompt at once)
          
          [default: 8192]

      --oom-guard-mb <OOM_GUARD_MB>
          Minimum free GPU memory (in MB) to keep as a safety margin during model loading. If free memory drops below this threshold after any shard, loading is aborted to prevent system OOM. Default 4096 MB accounts for CUDA context, NCCL buffers, and allocator overhead
          
          [default: 4096]

      --rank <RANK>
          Global rank (0=head, 1=worker, …). Only used when --world-size > 1
          
          [default: 0]

      --world-size <WORLD_SIZE>
          Total physical ranks across all parallelism dims. Set to 2 for two-node deployment. Must satisfy `world_size == tp_size × ep_size` (orthogonal mesh) or `world_size == tp_size == ep_size` (overlapping groups on the same physical ranks — used for 2-GPU TP+EP composition)
          
          [default: 1]

      --tp-size <TP_SIZE>
          Tensor-parallel dimension. Splits attention/MLP weights column- and row-parallel across `tp_size` ranks. 1 = no TP. Composes with EP: MoE expert weights stay EP-sharded; attention/MLP get TP-sharded
          
          [default: 1]

      --ep-size <EP_SIZE>
          Expert-parallel dimension. Splits MoE expert weights across `ep_size` ranks. 1 = no EP. Default of 1 keeps single-rank semantics
          
          [default: 1]

      --master-addr <MASTER_ADDR>
          NCCL bootstrap address (IP of rank 0 node)
          
          [default: 127.0.0.1]

      --master-port <MASTER_PORT>
          NCCL bootstrap port
          
          [default: 29500]

      --tool-call-parser <FORMAT>
          Tool call parser format. Enables OpenAI-compatible tool calling. Supported: "hermes" (Qwen3/3.5 JSON format), "qwen3_coder" (Nemotron-H XML format). When set, tool definitions in requests are injected into the system prompt and model output is parsed for tool_call tags

      --tool-max-tokens <TOOL_MAX_TOKENS>
          Maximum output tokens per tool-calling request. Caps max_tokens from the client when tools are active to prevent unbounded generation if the model doesn't emit a </tool_call> stop token. Must be high enough for Write tool calls with large file content. Default 8192
          
          [default: 8192]

      --ssm-cache-slots <SSM_CACHE_SLOTS>
          Number of SSM state snapshot slots for Marconi prefix caching. Each slot stores SSM h_state + conv_state for all SSM layers, enabling full prefix skip (KV + SSM) on cache hits. 0 = disabled. 16 = recommended for repeated-prefix and multi-turn workloads. Intermediate checkpoints (--ssm-checkpoint-interval) require extra slots: ~(max_context / checkpoint_interval_tokens) per cached sequence
          
          [default: 16]

      --ssm-checkpoint-interval <SSM_CHECKPOINT_INTERVAL>
          Save SSM state snapshots at regular block boundaries during prefill. When set to N > 0, a snapshot is saved every N blocks during chunked prefill. On future prefix cache hits, the deepest intermediate snapshot is restored, reducing SSM recomputation from the full prefix to just the tokens between the checkpoint and the match point. 0 = disabled (leaf-only snapshots). 256 = every 4096 tokens (block_size=16)
          
          [default: 256]

      --auto-compact [<THRESHOLD>]
          Enable automatic context compaction for long conversations. **DISABLED BY DEFAULT** (2026-04-25): the auto-compactor has historically been a source of agent loops — synthesised continuation messages and middle-of-history truncation themselves trigger drift (cf. opencode issues #15533, #17169, #19339). Oversize requests get a clean 400 error (`Prompt too long`) rather than a silently-rewritten context.
          
          Only pass `--auto-compact[=THRESHOLD]` if you have explicitly validated that compaction is safe for your model + workload. Without a value: threshold=0.75 (compact at 75% of max_seq_len). With a value: compact at that fraction (e.g., 0.80 = 80%).
          
          Method: Active Context Compression (arXiv:2601.07190) — the server uses the model itself to summarize older conversation turns into a condensed knowledge block.

      --default-top-n-sigma <DEFAULT_TOP_N_SIGMA>
          Default top-n-sigma for sampling (filter tokens by logit z-score). 0.0 = disabled. Recommended: 1.0 for NVFP4 models AND for agent workloads — top-n-σ is temperature-invariant (Tang et al., arXiv:2411.07641) so it is more robust than top-p across the per-phase temperature drift agentic loops induce
          
          [default: 1]

      --default-min-p <DEFAULT_MIN_P>
          Default min-p for sampling (keep tokens with prob >= min_p * max_prob). 0.0 = disabled. Recommended: 0.05-0.1
          
          [default: 0]

      --swap-space-gb <SWAP_SPACE_GB>
          Swap space in GB for KV cache overflow to disk. When GPU blocks are exhausted, sequences are swapped to disk and resumed later. 0 = disabled. Swap files stored in /tmp/atlas-swap/
          
          [default: 3]

      --high-speed-swap
          

      --high-speed-swap-dir <HIGH_SPEED_SWAP_DIR>
          Directory for the per-layer NVMe-backed KV files. Required when --high-speed-swap is set; must be on a different mount than --swap-space-gb's /tmp/atlas-swap to avoid file collisions

      --high-speed-swap-gb <HIGH_SPEED_SWAP_GB>
          Total disk budget for --high-speed-swap, in GiB

      --high-speed-swap-resident-blocks <HIGH_SPEED_SWAP_RESIDENT_BLOCKS>
          HBM scratch slot count (number of resident blocks)

      --high-speed-swap-rank <HIGH_SPEED_SWAP_RANK>
          Predictor low-rank dimension (Phase 1 ships at r=32)
          
          [default: 32]

      --high-speed-swap-qd <HIGH_SPEED_SWAP_QD>
          io_uring submission queue depth (Phase 3 shows QD=8 reaches 3.4 GB/s on this DGX Spark image)
          
          [default: 8]

      --high-speed-swap-graph <HIGH_SPEED_SWAP_GRAPH>
          Capture the per-layer body in a CUDA graph and replay (Phase 4). Defaults to mirror --high-speed-swap
          
          [possible values: true, false]

      --high-speed-swap-cache-blocks-per-seq <HIGH_SPEED_SWAP_CACHE_BLOCKS_PER_SEQ>
          Per-sequence HBM cache cap for `--high-speed-swap` (Phase 6.1). When set together with --high-speed-swap, each sequence is limited to N HBM-resident KV blocks; older blocks are evicted to disk and streamed back via the orchestrator on demand. The KV cache total allocation shrinks to roughly `max_batch_size × N` blocks. Default 64 (= 1024 tokens HBM-resident at block_size=16). Set to max_seq_len/block_size to disable HBM-shrink (no eviction; useful for diff-against-no-swap correctness checks)
          
          [default: 64]

      --request-timeout <REQUEST_TIMEOUT>
          Default request timeout in seconds. 0 = no timeout
          
          [default: 300]

      --profile
          Enable per-kernel profiling: sync + time each operation within layers. Disables CUDA graphs for accurate per-op timing. Adds ~10% overhead

      --fp8-kv-calibration-tokens <FP8_KV_CALIBRATION_TOKENS>
          Number of warmup tokens for online FP8 KV cache scale calibration. During the first N tokens, tracks max |K| and max |V| values across all attention layers. After N tokens, computes per-tensor scales as max/448 (mapping the observed range to FP8 E4M3 [-448, 448]). 0 = disabled (use static scales from checkpoint, or uncalibrated 1.0). Only applies when --kv-cache-dtype is fp8
          
          [default: 0]

      --warmup-prompt <WARMUP_PROMPT>
          Path to a warmup prompt file (JSON messages or plain text). At startup, the server tokenizes and prefills this prompt, inserting the resulting KV cache + SSM snapshot into the prefix cache. This eliminates the cold-start TTFT penalty (~196ms) on the first real request

      --adaptive-sampling
          Enable adaptive sampling (entropy-based greedy gating, zone detection). Computes Shannon entropy over the full vocabulary per token to dynamically switch between greedy and sampled decoding. Improves quality for mixed content (code + prose) at the cost of ~2-3x decode throughput reduction. Off by default for maximum throughput

      --no-fast-load
          Disable the InstantTensor-style fast weight loader and use the mmap loader instead. The fast loader (O_DIRECT + pipelined reader/copier, with a per-shard heuristic that picks between O_DIRECT and buffered reads) is on by default — this flag is an escape hatch for rare filesystems that misbehave with O_DIRECT or for A/B debugging. Setting `ATLAS_FAST_LOAD=0` has the same effect

      --bind <ADDR>
          Address to bind the HTTP listener to. Defaults to `127.0.0.1` so a fresh install is reachable only from the local machine; pass `0.0.0.0` to expose on all interfaces (the server logs a warning when it does, since combined with the permissive default CORS this makes the API reachable to anything on the LAN)
          
          [default: 127.0.0.1]

      --require-auth
          Require an `Authorization: Bearer <token>` header on `/v1/*`, `/tokenize`, and `/detokenize`. The token must match one loaded via `--auth-tokens-file` or `--auth-token`. `/health`, `/health/live`, and `/metrics` stay open as scrape targets.
          
          Defaults to off — Atlas is local-by-default, so most users can skip this. Turn on whenever the server is reachable from anywhere other than `localhost` (i.e. whenever you've passed `--bind 0.0.0.0` or are running behind an exposed port-forward).

      --auth-tokens-file <PATH>
          Path to a file containing valid bearer tokens, one per line. Blank lines and lines starting with `#` are ignored. Permissions should be `0600`. The file is read once at startup; SIGHUP reloading is not supported (restart the server to rotate keys)

      --auth-token <TOKEN>
          A single inline bearer token. Convenient for quick starts; not recommended for production because the token is visible in `ps`/`/proc/<pid>/cmdline`. Use `--auth-tokens-file` instead

  -h, --help
          Print help (see a summary with '-h')