Atlas: Open-source inference engine for DGX Spark <2minute cold start, 100+ tok/s on Qwen3.6-35B-FP8, 13+ supported models

alexander.korolev.germany · May 8, 2026, 7:33am

This is already one step further… Now serve –help and you will see actual help.

But I agree with you, most models currently measure higher rates of tokens, but it is a lot of small things that does not match each other. Knowing the time span of the tool - it is quite easy to understand.

I more interested in running it from sources, because I already gave up on idea of ‘community docker’ and switched to just uv runned vllm.

whpthomas · May 8, 2026, 7:48am

I get it, but its not that hard to proof read a README, and double check that the quick start guide works before posting to ask for help. Especially when the claim is:

Open-source inference engine for DGX Spark <2minute cold start, 100+ tok/s on Qwen3.6-35B-FP8, 13+ supported models

I jumped in to help test precisely because there was such a pile on last time. I would like this to work – it doesn’t at the moment.

alexander.korolev.germany · May 8, 2026, 7:56am

Looking at the repo, the readme was just updated, so your feedback was processed.

I discovered it two days before, but was thinking pure doc PR does not make much sense.

IMHO there is no combination of parameters that would deliver to you at this point equal to vllm quality level. You should not expect this. It is more like sandbox. Of course this sandbox makes bold statements, but you need to be loud to survive I guess ;) For me it is opportunity to study some basics and see if I can apply the there. And also I love Rust, work with this language, so the whole idea is very relevant to me. But truth is simple, it needs much more time invested, than you did today. I guess if you want to run models, you better wait for another half year. If you want to be actively involved - you need to dig deeper. I think I made my choice here, but I cannot recommend anything to you :)

DColt · May 8, 2026, 8:19am

I think @whpthomas is overall quite direct (seems harsh but just very upfront)

The criticism/complaints are valid, and moving the direction from the public thread to github issues I don’t think is always the best course of action. As the thread (even with criticism and harsh feedback) can be a fantastic way to show engagement and also prove that the maintainers are actually listening and willing to interact, which i believe @AzeezIsh is doing quite well here.

To me the, I would expect on the message that’s portrayed that it’s something i can get up and running right now, as a replacement for what I’m using (vllm, slang, llama.cpp etc), but then when it doesn’t meet those expectations, it’s of course frustrating, and many would probably just be quiet and shove it aside.

The docs on github I would assume are up to date, so when they are conflicting with the information in serve –help then it also becomes difficult to know what is the actual source of truth.

I would love to try this tool out, but as it’s seemingly missing a critical case for me (default kwargs), which i can do in the others, it’s sadly not worth it for me to do at this moment.

The vast majority running in to issues will just dismiss it and move on, so what I’m trying to say is I think @whpthomas is doing is a great thing, showing he want’s this to be practical but sadly can’t, and I at least do not read it as him diminishing their work.

alexander.korolev.germany · May 8, 2026, 8:30am

Yes, writing a feedback in announcement post is much better, than writing two lines in the PR or at least issue. I cannot agree here.

I could also write this feedback, and anyone else. I did not. Why? Because I understand, I need actually create the issue on GitHub, as it is open source project now. But I did not. Why then I would write a post in the announcement thread? Makes no sense to me. It would probably tell more about me not able to provide adequate feedback, and not about the project itself.

DColt · May 8, 2026, 8:54am

Yes, writing a feedback in announcement post is much better, than writing two lines in the PR or at least issue. I cannot agree here.

We do not have to agree, I think the most important thing regardless of channel is that we give feedback back to the maintainers.

I could also write this feedback, and anyone else. I did not. Why? Because I understand, I need actually create the issue on GitHub, as it is open source project now. But I did not.

So then I’m curious here:

Did you not find issues when you ran it?
Did you not want to help the project by reporting your findings?
You expect someone else to do it?
Something else entirely?

Why then I would write a post in the announcement thread? Makes no sense to me. It would probably tell more about me not able to provide adequate feedback, and not about the project itself.

In the announcement thread I’d expect positive and negative feedback, and especially just around launch to have more eyes on it to generate feedback and discussions, and in my opinion a great place for others to chime in if they perhaps have noticed it, reported it them self already, or even just agree with the issues.

Speaking from my own perspective, I would love the engagement in the announcement thread, and if I didn’t want the issues here, or I couldn’t answer them, I’d just ask the person to file an issue on github.

Not to mention as stated in the initial post:
Happy to answer technical questions in this thread, but it’s also completely available for you to scour these details in our codebase. We’d love to build around any specific use case you have, don’t hesitate to reach out. Feel free to take it to DM on any of the social sites!

Which i read as a welcome to reply here too.

alexander.korolev.germany · May 8, 2026, 8:58am

Could you summarize the meaning of this in less than 50 words and maybe also give me a recipe for pudding?

DColt · May 8, 2026, 8:59am

Potato

alexander.korolev.germany · May 8, 2026, 9:02am

Paris.

DColt · May 8, 2026, 9:05am

I would say we are getting quite off track from the actual thread, so from me that’s enough.

But no, I’m not generating messages from any LLM, it’s all me typing and reasoning.
And i would love some pudding, but i generally buy the store pre-made ones, I don’t know the recepie.

alexander.korolev.germany · May 8, 2026, 9:09am

Answering to your question, why I have not created the issue: I was busy reading the code and testing some models, and found many small hickups I wanted to report all together. But my time is not that precious, can be wasted.

mike65.hk · May 8, 2026, 9:22am

Posting an automated evaluation report — generated by Claude Opus 4.7 driving our standard inference-engine test harness against Atlas. We periodically run this harness against new GB10 candidates; this is the first pass on Atlas built from `main`.

**Build under test:** `main` @ `a19f639` (includes PR #24 *Fix MCP tool-call “Unknown tool”* and PR #25 *qwen_xml_parameter grammar/INFO demote*) — built fresh from the multi-model `docker/gb10/Dockerfile`. Image size 2.79 GB. Cold start to first 200 on `/v1/models`: ~50 s.

**Launch flags** (essentially the announcement recipe + `–bind 0.0.0.0` per #28):

```

spark serve --model-from-path /model --port 30002 --bind 0.0.0.0 \

–max-seq-len 65536 --kv-cache-dtype fp8 --kv-high-precision-layers auto \

–gpu-memory-utilization 0.90 --scheduling-policy slai \

–tool-call-parser qwen3_coder --enable-prefix-caching --speculative

```

**Model:** `Qwen/Qwen3.6-35B-A3B-FP8` (downloaded with `hf download --local-dir`).

-–

**Suite 1 — speed (5 prompts, OpenAI-compatible non-streaming, remote LAN client):**

|—|—|—|—|

| 1 | Minimal (9 tok gen) | 32.8 | 81.2 |

| 2 | Short prompt, medium gen (358 tok) | 85.2 | 89.2 |

| 3 | Short prompt, long gen (2 652 tok) | 89.4 | 90.4 |

| 4 | Long prompt (2 021 tok), short answer | 13.3 | 78.5 |

| 5 | Multi-turn convo (291 tok) | 85.9 | 100.9 |

Aggregate: 3 333 generated tokens in 39.2 s wall = **84.9 tok/s overall**, **89.4 tok/s peak**. Server-side counter tracks the announcement’s ~100 tok/s claim once LAN round-trip is excluded. **PASS.**

-–

**Suite 2 — single-turn tool-call correctness (8 streaming scenarios):** 8 / 8 **PASS.**

|—|—|—|—|

| 1 | “Weather in Tokyo?” | `get_weather` | `{city: “Tokyo”}` |

| 2 | “Weather in London in fahrenheit?” | `get_weather` | `{city, unit: “fahrenheit”}` |

| 3 | Three tools available, web query | `web_search` | `{query}` |

| 4 | Three tools, math expression | `calculator` | `{expression}` |

| 5 | “What is 2+2?” (no tool needed) | _(none, returned `4`)_ | — |

| 6 | Multi-turn — assistant + tool result already in history | _(no further call, prose summary)_ | — |

| 7 | Agentic: read before write | `read_file` first | `{path}` |

| 8 | Complex args | `create_file` | `{path, content}` |

No `Unknown tool` errors observed in this suite — PR #24 + #25 confirmed effective at single-turn scope. **PASS.**

-–

**Suite 3 — multi-turn drift (single growing conversation, 5-tool set, synthetic JSON tool results echoed back, target 40 turns):** **REGRESSION at turn 11.**

Turns 1–10 all returned clean structured `tool_calls` (one per turn, correct selection and args). Turn 11 onward, model stopped emitting `tool_calls` and returned the following as plain `content` with `finish_reason: “stop”` — verbatim, repeated across turns 11, 12, and beyond:

```

You have had 1 consecutive failed or repeated tool calls in this session. The user’s ORIGINAL request was:

«What’s the weather in Berlin?»

Do not abandon this task. Either: (a) try a fundamentally different approach (different tool, different command-line args, or accomplishing the goal without that tool), or (b) report the SPECIFIC blocker concisely and what you would need to proceed. Do not regenerate work that already exists; do not retry an identical call.

```

Two notable properties of this output:

1. **The `` block is not in any of the prompts the harness sent.** It looks like a Claude-Code-style scaffold trajectory bleeding through from training data, surfaced as assistant content. Suggests the qwen3_coder chat template / parser is letting post-tool tokens fall outside the structured `tool_calls` channel once context grows past a threshold.

2. **The quoted “ORIGINAL request” is always turn 1’s prompt** (“Berlin”), even on turn 12 when the actual user message is about `src/main.py`. So whatever scaffold heuristic is firing inside the model has anchored to the conversation prefix and isn’t tracking the live message.

One additional anomaly worth flagging: turn 4 (a `calculator` request, “Calculate 2847 * 19 + 33.”) in the streaming run returned plain content `“54126”` rather than a tool call — possibly speculative-decoding drafts aligning with a memorized arithmetic answer and short-circuiting the parser. Non-streaming replay of the same prompt at the same turn position resolved correctly.

-–

**Summary**

| Dimension | Result |

|—|—|

| Cold start | ✅ ~50 s, well under 2 min |

| Throughput claim | ✅ ~85–89 tok/s wall, ~100 tok/s server-side |

| Single-turn tool calls | ✅ 8 / 8 |

| Multi-turn agentic | ❌ regression at turn 11 (training-data trajectory leakage) |

Single-turn parity with stable vLLM 0.20 + qwen3_coder achieved; multi-turn behaviour blocks adoption as a daily-driver replacement for now. The fix turnaround on PR #24/#25 was unusually fast — happy to re-run this harness against the next image push if that helps.

Harness is generic OpenAI-compatible chat-completions with a 5-tool set (`get_weather`, `web_search`, `calculator`, `create_file`, `read_file`) and JSON tool results fed back. Can share the script if it’s useful for your CI.

mike65.hk · May 8, 2026, 9:39am

Quick correction — Discourse seems to have autocorrected a few `–flags` to en-dashes in the recipe above, which would break copy-paste. Here it is in a fenced code block so the dashes survive:

````

spark serve --model-from-path /model --port 30002 --bind 0.0.0.0 \

-–max-seq-len 65536 --kv-cache-dtype fp8 --kv-high-precision-layers auto \

-–gpu-memory-utilization 0.90 --scheduling-policy slai \

-–tool-call-parser qwen3_coder --enable-prefix-caching --speculative

````

(All flags are double-hyphen `–`.)

whpthomas · May 8, 2026, 10:27am

Final attempt at testing something meaningful. Atlas does not appear to support multi-modal so I shortened this test to just include only the last step, essentially characterising text content.

Same 35B model, same config – Atlas vs vllm. Atlas was 6x slower and failed to complete over half the tasks.

Qwen 3.5 35b was the first model I tried with Atlas, and after getting poor results, that’s why I moved on to attempting to get 122b to work.

For what its worth I would like to see you get this working, but I would also prefer to see a bit more candour. To say that Atlas ‘supports’ 13+ models is really unhelpful to your cause. If just a cursory test of Atlas serving 35b with standard tasks fails, it tells me that your inference quality assurance, benchmarking and testing regime needs a lot more rigour.

Qwen3.6-35B-A3B-FP8

Atlas: 37:31 minutes 8/22 completed

#!/bin/bash

docker container remove atlas
docker pull avarok/atlas-gb10:latest
docker run -it --name atlas \
  --network host --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/atlas-gb10:latest \
  serve Qwen/Qwen3.6-35B-A3B-FP8 \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name qwen/qwen3.5-122b \
    --max-seq-len 65536 \
    --kv-cache-dtype fp8 \
    --kv-high-precision-layers auto \
    --gpu-memory-utilization 0.90 \
    --scheduling-policy slai \
    --tool-call-parser qwen3_coder \
    --enable-prefix-caching \
    --speculative

22 page indexing tasks involving 13 instructions and 1 tool call

[2026-05-08T08:56:26.994Z] === Orca2 Plugin Initialized ===
[2026-05-08T08:57:05.172Z] INFO #{ocr-index ... } invocation detected
[2026-05-08T08:57:05.183Z] INFO ocr-index Loaded 3 steps from /Users/henry/Documents/xxxxxxxx/workflow/ocr-index.yaml
[2026-05-08T08:57:05.184Z] INFO #{ocr-index ... } session ses_1f931fcb6ffeYl8SKDuYqQO0iJ started
[2026-05-08T08:57:15.762Z] INFO ocr-index->index Parallel subtask pending
[2026-05-08T08:57:15.762Z] INFO ocr-index Building first workflow prompt
[2026-05-08T08:57:20.850Z] INFO ocr-index Spawned 22 subtask(s)
[2026-05-08T09:14:01.803Z] INFO ocr-index 3 subtasks completed, 19 failed
[2026-05-08T09:14:01.804Z] INFO ocr-index->index Parallel subtask pending
[2026-05-08T09:14:01.804Z] INFO ocr-index Building next parallel prompt
[2026-05-08T09:14:06.888Z] INFO ocr-index Spawned 19 subtask(s)
[2026-05-08T09:20:51.347Z] INFO ocr-index 1 subtasks completed, 18 failed
[2026-05-08T09:20:51.348Z] INFO ocr-index->index Parallel subtask pending
[2026-05-08T09:20:51.348Z] INFO ocr-index Building next parallel prompt
[2026-05-08T09:20:56.425Z] INFO ocr-index Spawned 18 subtask(s)
[2026-05-08T09:24:13.588Z] INFO ocr-index 1 subtasks completed, 17 failed
[2026-05-08T09:24:13.589Z] INFO ocr-index->index Parallel subtask pending
[2026-05-08T09:24:13.589Z] INFO ocr-index Building next parallel prompt
[2026-05-08T09:24:18.655Z] INFO ocr-index Spawned 17 subtask(s)
[2026-05-08T09:33:55.004Z] WARN ocr-index Watchdog timer expired
[2026-05-08T09:33:55.005Z] INFO ocr-index 3 subtasks completed, 14 failed
[2026-05-08T09:33:55.006Z] INFO ocr-index Workflow ended - not more prompts, removing ses_1f931fcb6ffeYl8SKDuYqQO0iJ

vllm: 5:34 minutes 22/22 completed

# Default settings (can be overridden via CLI)
defaults:
  port: 8000
  host: 0.0.0.0
  max_model_len: 196608
  gpu_memory_utilization: 0.75
  max_num_batched_tokens: 32768
  max-num-seqs: 32
  served_model_name: qwen/qwen3.6-35b
  speculative_config: '{"method": "mtp", "num_speculative_tokens": 3}'
  coding_config: '{"temperature": 0.7,  "top_p": 0.8, "top_k": 20, "presence_penalty": 0.0, "repetition_penalty": 1.0}'
  writing_config: '{"temperature": 0.6,  "top_p": 0.9, "top_k": 20, "presence_penalty": 1.5, "repetition_penalty": 1.1}'

# Environment variables
env:
  VLLM_MARLIN_USE_ATOMIC_ADD: 1

# The vLLM serve command template
command: |
  vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --served-model-name {served_model_name} \
  --max-model-len {max_model_len} \
  --gpu-memory-utilization {gpu_memory_utilization} \
  --max-num-batched-tokens {max_num_batched_tokens} \
  --max-num-seqs {max-num-seqs} \
  --port {port} \
  --host {host} \
  --load-format instanttensor \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --speculative-config '{speculative_config}' \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --generation-config auto \
  --override-generation-config '{coding_config}'

Same 22 page indexing tasks involving 13 instructions and 1 tool call

[2026-05-08T10:00:26.994Z] === Orca2 Plugin Initialized ===
[2026-05-08T10:00:57.828Z] INFO #{ocr-index ... } invocation detected
[2026-05-08T10:00:57.833Z] INFO ocr-index Loaded 3 steps from /Users/henry/Documents/xxxxxxxx/workflow/ocr-index.yaml
[2026-05-08T10:00:57.834Z] INFO #{ocr-index ... } session ses_1f8f78125ffeLZrz1SmhhaunRR started
[2026-05-08T10:01:31.101Z] INFO ocr-index->index Parallel subtask pending
[2026-05-08T10:01:31.101Z] INFO ocr-index Building first workflow prompt
[2026-05-08T10:01:36.180Z] INFO ocr-index Spawned 22 subtask(s)
[2026-05-08T10:05:33.759Z] INFO ocr-index All subtasks completed
[2026-05-08T10:05:33.760Z] INFO ocr-index->index Parallel subtask done
[2026-05-08T10:05:33.762Z] INFO Concatenate.list() Successfully wrote /Users/henry/Documents/xxxxxxxx/job-1/toc.md
[2026-05-08T10:05:33.762Z] INFO ocr-index->index Step done
[2026-05-08T10:05:33.762Z] INFO ocr-index Workflow complete - not more prompts, removing ses_1f8f78125ffeLZrz1SmhhaunRR

trithemius · May 8, 2026, 11:01am

yeah, I also tried with single and multi tool instructions and atlas was not able to complete any of my agent workflows. I will stay with vllm for the moment but I will look at the improvements of the atlas project because to me the idea looks quite interesting. And, btw, I am mostly using terraform/log analitics agents, and for these purposes prismquant models are the way to go for now.

trystan1 · May 8, 2026, 2:37pm

I think it’s relatively comical that the community is even comparing a vibe coded ‘I’ll show them’ project with an enterprise grade stack like vllm.

What does qwen and Nvidia have to say? The previous thread it was mentioned that ‘china was involved’ and you were in ‘active discussions with Nvidia.’ Specifically calling out both companies’ interest.

Are we just dropping that charade now?

AzeezIsh · May 8, 2026, 2:59pm

Crank this up! Try something like 0.96 and you should be able to run 128k length sequences.

Don’t know why you keep stating this as if its broken, it works and have multiple users reporting these speeds (even in this thread!) For your testing, you need SSM caching and longer sequence length for better results. Cheer up, it’s a Friday. Approach this with a new mindset and you just might find success :)

whpthomas · May 8, 2026, 3:04pm

Atlas: 37:31 minutes vs vllm 5:34
Atlas: 8/22 completed vs vllm 22/22

Not sure what else to call that. Same model, same quant, no context pressure, different outcome.

AzeezIsh · May 8, 2026, 3:10pm

Your run params aren’t ideal for this use case, you made this task run as subagents rather than calling them in batch=4 for example. We excel at conc=1 use cases, larger dense models with MTP like Qwen3.6-27B is where Atlas truly shines. I’d still need to see details of said testing but either way, we’re always looking to improve!

whpthomas · May 8, 2026, 3:16pm

Here we go again. These aren’t MY parameters, these are YOUR parameters copied directly from the project README on git.

This was the task, 13 ish-instructions. One tool call (write).

# Current Task - Generate Index

You are tasked with indexing semantic information from the content of page content (below) as accurately as possible.

Please review `$1/page-$0.md` content (below), think about what semantic information is on this page. In future, if you were searching through an index (like at the back of a textbook) for specific information; what information stands out to you on this page? There may be a lot of information, there may be very little information. Please create an index of semantic information for this page in `$1/index-$0.md`. If necessary, you may refer to the previous page or the subsequent page `$1/page-N.md` (when N is $0 + 1 or - 1) for widow/orphan paragraph context if necessary. However limit the scope of the index to information specifically on the page. For example if the start of the page belongs to a section from a previous page, the section might be `## Section <number>: Heading (Continued)`.

## Output Format

Provide helpful semantic information as best you can. Page content will vary, so only include content blocks relevant to this actual document page.  

**Page Number**

```markdown
# Page $0
```

**Tags:**

```markdown
## Tags
- <list of tags>
```

**Metadata:**

```markdown
## Metadata
- <list of metadata>
```

**Term Definitions:**

If the page contains terms, define them, if not omit definitions content block altogether.

```markdown
## Definitions
- **<term>:** <definition>
```

**Either Flat Headings or Numerical Section Headings:**

Some document pages may use flat headings, if not consider numerical section headings.

```markdown
## <heading>
    - <list of key content summarized>

## <heading>
    - <list of key content summarized>

## <heading>
    - <list of key content summarized>

etc ...
```

Some document pages may have nested numerical section headings, if not consider flat headings.

```markdown
## <number> <heading>
    - <list of key content summarized>

### <number>.<number> <heading>
    - <list of key content summarized>

#### <number>.<number>.<number> <heading>
    - <list of key content summarized>

etc ...
```

**Figures:**

Some document pages may contain figures, list them, if not omit figures content block altogether.

```markdown
## Figures
- **figure <number>:** <description>
```

**Tables:**

Some document pages may contain tables, list the schema and characterize their content, if not omit tables content block altogether.

```markdown
## Tables
- **table:** <description>
    - **<field-name>:** <description>
```

## Success Criteria

- Verify `$1/index-$0.md` contains semantic information derived from `$1/page-$0.md` content (below).
- Read back `$1/index-$0.md` and confirm content adheres to output format guidelines (above).

When this task is successfully completed, report outcome to user.

Topic		Replies	Views
Introducing the Atlas Inference Server and Engine DGX Spark / GB10 Projects	160	7542	May 6, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	7372	March 28, 2026
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	237	18349	May 10, 2026
Introducing Tool Eval Bench CLI DGX Spark / GB10 Projects llama , agentic-ai	97	2279	May 10, 2026
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	4179	March 6, 2026
Introducing the Spark Arena DGX Spark / GB10	129	6283	April 24, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	75	5430	May 4, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	9217	March 24, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	5276	March 16, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	9844	April 9, 2026

Atlas: Open-source inference engine for DGX Spark <2minute cold start, 100+ tok/s on Qwen3.6-35B-FP8, 13+ supported models

Qwen3.6-35B-A3B-FP8

Atlas: 37:31 minutes 8/22 completed

vllm: 5:34 minutes 22/22 completed

Related topics