Atlas: Open-source inference engine for DGX Spark <2minute cold start, 100+ tok/s on Qwen3.6-35B-FP8, 13+ supported models

Maybe it’s because I’m more used to the self-aggrandizing typical in the ‘startup posting on social media’ era and am better at reading between the lines, but you’re acting in really bad faith by calling it lies and it’s becoming annoying. I actually tried it last night and they delivered improvements.[0]

We are working with Nvidia to include this in their official community image.
We are actively in talks with Nvidia and would welcome further dialogue with them here for the community to participate in.

There’s nothing preposterous being said here. So they contacted Nvidia, and probably asked them what it would take to be in their community images, a few back and forth emails. What’s so unbelievable about this? Did you interpret this as them saying ‘Jensen himself invited us to party on his yacht because he thinks our software is so brilliant?’

We are in communication with the Qwen team

Same thing. So they probably contacted the Qwen team for technical advice on what they could do to improve performance for Qwen models, Qwen gave them some helpful answers (or not). This is how I read this.

The wording used could be less glamorous for such a small thing, but you make it sound like they’re talking about their dad working at Nintendo and having a long-distance girlfriend up in Canada. None of this stuff is unbelievable.

No, just usual human error. Proof vibe code is not in place.

The project is open-source and obviously coded with AI assistance, just based on the comments in the one source file I bothered reading. Every professional developer I know uses AI for agentic coding, so there’s nothing wrong with using AI. If we want to be charitable, his definition of vibe-coding here is the cynical ‘written entirely by AI without the human understanding anything’ one which I’ve seen posted. But if he’s claiming they don’t use AI at all for coding, then I’d bet on that being a lie.

[0] According to to my own informal test last night, just a basic token-per-second speed test for a single request, atlas + Qwen 27B NVFP4 has +18% faster inference speed compared to vllm-spark-docker + Qwen 27B AWQ:

  • +18% faster token-per-second
  • Atlas starts serving the model like 10x faster (didn’t time it)
  • Minor point: Atlas docker image is 2.8GB vs spark-vllm-docker’s18.5GB

To me they delivered something useful and I wish them well. They just need to clean up their doc with ready-to-use recipes.

You don’t have to accept a good gift with grace, I don’t know what things were like in those old threads you posted in, but as of yesterday, in May 2026, Atlas is an open-source project and a good community contribution. The numbers don’t lie. End-users like me only care about the numbers. Please tone down the hostility and ad hominems, stick to posting real criticisms OF THE TOOL if you have any left.

The criticisms of the tool itself are already well expressed in this thread.

There’s no ad hominem attack in play, just highlighting the deception used in the past by the author regarding the apparent validity of projects.

The example you responded to never amounted to anything and Nvidia never ‘included his breakthrough in any community image.’

If you want to run community code on your machine, that’s fine. I just want it understood that this entire ‘project’ like the few that have come before by the same author should be heavily scrutinized.

‘Being accustomed to exaggerated claims by startups’ is not a valid argument and a scary precedent to set.

Sadly, the “fake it till you make it” strategy does work. I’m against it, hate it, but I have to admit that in my 40 years career, I know of many that would argue that it’s the only way to make it.

On Sparkrun, for Qwen 3.6’s 35B MoE, we shatter records. Official Nvidia and AMD accounts have retweeted several of our posts, as well as donated physical hardware to help bring local inferencing to everybody. The hostility in this forum comes from those who have nothing to show, and waste their human potential on not improving, but rather, bringing down another’s efforts in order to reduce cognitive dissonance of the fact that we are all inherently unbounded, yet, bounded only by the human mind. I’ve done multiple open-source releases, and this defense mechanism projected by others appears to be a constant across release; I’ve had actual professors reach out and volunteer information to chip in or send their dissertations from years ago to support why their ego is uncomfortable with the fact that they’ve voluntarily let themselves become obsolete. We must absolutely embrace emerging technology, and let those who decide to make an axiom that AI=bad code fall apart by empirical evidence. The big companies with credibility (Nvidia, AMD, HuggingFace, et. al) have the big guns, and they SEE what has been brought forth. As such, I call upon the moderators of this community to consider that the common troll be disciplined, up to, but not including, banning (if repeat offenders) from this forum. We just want local inference working fast and coherently; those who stand in the way of this mission stand in the way of all those who want fast local inference. Those individuals, whether jealous or failing to grasp the emerging reality, should be banned or rehabilitated (respectively) to be valuable members of this community. Let’s just make stuff work; plain and simple.

Hello folks,

We all have different backgrounds. Different upbringing. We live in 100 different countries. A lot of us speak different languages. We were raised differently and had different experiences throughout our lives.
Each one of us has a different perspective.
Each one of us is triggered or feels empowered by one word said in a certain way. Or felt ignored/neglected.
But this is a technical forum. About a specific hardware and we all share the struggles, successes, our wishes and mostly try to help each other. Celebrate others’ achievements and contribute back as gratitude or simply the joy of giving back or be seen.
This is more about what we can do together, elevate others, give orientation and provide guidance, so people save time. Sometimes the words will sound harsh, or salty, doubtful, sarcastic. And that’s ok, we
live in an imperfect world and we’re all imperfect people: We are all different. I really believe most people don’t want others to fail here. They might doubt certain claims, they might question certain technical decisions. But assume they mean no harm.
Let’s not make this another twitter/X
It’s about the things we have in common and what we can do to uplift others. If you’re harsh, expect some pushback sometimes, if you make extraordinary claims, expect to provide some extraordinary evidence.
Let’s keep this space civil. Treat others like you would like to be treated. Once the discussion derails into personal attacks nobody wins, everyone looses. Including folks who are here just to learn and feel afraid to even make a comment and maybe just leave a thank you for your work! Please try to de-escalate, you’re all capable of it. I trust you guys will be able to do it.

Have a good rest of your weekend

You’re the one making the claims, and every time I’ve questioned their validity you sidestep.

If you want to improve the software ecosystem that’s fine, don’t exaggerate, deceive, or in certain cases just flat out lie.

You can call for whatever you want and dismiss it, but you can’t keep making grandiose and misguided statements and ignore criticism.

Hey Trystan, why so negative? Please show some respect on the work and effort. Thomas and Azziz made an amazing work by generating Atlas. I am using it since the beginning and i am very happy with the progress and and the performance. Of course not all is perfect. but this is normaly in such an early stage. I dont know if you looked into their code repository but there all is documented very well. and in the philosophy text they clearly state that they use ai coding which absolutely makes sense. All well founded critics is good but why not stating it a little more polite and positive :)

Any idea when you will add Qwen3.6 27B support?

I just wanted to say thanks again. Great project, so much progress has been made recently! And it’s currently one of the exciting inference initiatives.

Colleagues, this is an ideal opportunity for you to make a good impression in addition to your other achievements. Focus on the deepseek-v4-flash model and make the most of it. It’s no secret that this model is one of the most optimal for two devices!

Take a look at :latest, it’s working well with MTP! And thank you for the kind message @stefan132 ! I am much more active on discord, feel free to reach out there. @voktolom we are working on this too :)

mimo 2.5 ? :)

Do you have a working Qwen3.6-35B-A3B-FP8 recipe with MTP enabled that I can test with Sparkrun?

Do you have a working Qwen3.6-35B-A3B-FP8 recipe with MTP enabled that I can test with Sparkrun?

I think this should work:

sparkrun run @atlas/qwen3.6-35b-a3b-fp8-mtp-atlas

Thanks, I’m testing now but it seems to fall in its face with more than 1 concurrent request. Single thread was fast and worked out-of-the-box:

[Q&A] 327 tokens in 4.32s = 75.6 tok/s (prompt: 23)
[Code] 601 tokens in 6.83s = 87.9 tok/s (prompt: 30)
[JSON] 928 tokens in 10.18s = 91.1 tok/s (prompt: 48)
[Math] 69 tokens in 1.17s = 58.9 tok/s (prompt: 29)
[LongCode] 2113 tokens in 22.70s = 93.0 tok/s (prompt: 37)

So I ran the tool-eval-bench --perf-only bench and I got this:

┃ Test ┃ c ┃ pp t/s ┃ tg t/s ┃ TTFT (ms) ┃ Total (ms) ┃ Tokens ┃
╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ pp2048 tg128 @ d0 │ c1 │ 1,119,864 │ 117.9 │ 1,920 │ 1,087 │ 2048+128 │
│ pp2048 tg128 @ d0 │ c2 │ 1,068 │ 102.4 │ 3,819 │ 2,497 │ 2048+128 │
│ pp2048 tg128 @ d0 │ c4 │ 1,068 │ 104.3 │ 7,636 │ 4,701 │ 2048+128 │
│ pp2048 tg128 @ d4096 │ c1 │ 872,316 │ 108.7 │ 5,303 │ 1,185 │ 2048+128 │
│ pp2048 tg128 @ d4096 │ c2 │ 1,156 │ 98.1 │ 10,619 │ 2,541 │ 2048+128 │
│ pp2048 tg128 @ d4096 │ c4 │ 0 │ 0.0 │ 0 │ 0 │ 2048+128 │
│ pp2048 tg128 @ d8192 │ c1 │ 0 │ 0.0 │ 0 │ 0 │ 2048+128 │
│ pp2048 tg128 @ d8192 │ c2 │ 0 │ 0.0 │ 0 │ 0 │ 2048+128 │
│ pp2048 tg128 @ d8192 │ c4 │ 0 │ 0.0 │ 0 │ 0 │ 2048+128

It’s the 2nd atlas model I tried and both had the same behavior of failing and slow WAY down on long-context or high-concurrency. At 2x concurrency I already saw ~35-40% reduction of throughput in my first simple bench.sh (first one in this thread).

Hopefully this can be addressed and make this inference engine much more usable!

A few things I noticed:

  • The default commands/recipes on the website do not work. Probably the spark run requirements changed. --hosts is now required. https://atlasinference.io/#try

  • Tool calls fail a lot on the Pi Harness with the docker run commands for Qwen/Qwen3.6-35B-A3B-FP8 and your docker image from the Github page.

  • https://sparkrun.dev/runtimes/atlas/ cli arguement mismatches

  • [docker run] ~40-50 tokens/s, but I needed to abort using it. Too many loops and incorrect path outputs. Sometimes spaces were missing, sometimes additional fragments of words were injected into paths.

  • sparkrun run @atlas/qwen3.5-35b-a3b-nvfp4 causes Error: Preflight failed: inference buffers alone need 9.35 GB but only 6.92 GB is free on the GPU (before weights load). SSM pool + GDN chunked prefill scales with --max-seq-len=8192 × --max-batch-size=8. Try --max-seq-len 2048 (or lower --max-batch-size / --num-drafts). which is the default recipe (should not happen)

  • [spark run] Same problem with paths and tool calls where some characters are added/removed or mixed up. Unusable. Instead of /path/to/frontend it ends up as /path/t/frontend or /path/to/froontend

I would like to give a brief update. First, I’d like to thank the community who have provided invaluable feedback across all our socials. I’d also like to point out that, if it were not for this community and Nvidia’s DGX program, we would not be here; so, we thank Nvidia for giving us this opportunity.

Second, I’m excited to announce that we have been one of the few selected, out of thousands of engineers who applied, into Alibaba’s Qwen Ambassador program because of the work put into the Atlas Inference Engine. Qwen is one of the best state-of-the-art AI models that also happens to be open-source and great for local inference. We do realize this now means we’re repping 🇨🇳 now in the AI race. Yet, as an American engineer, I hope to use this new position — in part — to encourage Sino-American relationships through bridge-building, cooperation, and mutually constructive competition 🇺🇸🇨🇳. As mentioned months ago, the Chinese involvement was budding into something greater.

Third, we would like to thank AMD for partnering and providing equipment to begin expanding support to their devices. We both share the mission of fast and high-quality local inference.

Fourth, we’d like to thank Nvidia (again) for including us in weekly meetups for tool-calling benchmarking. We also believe that having a pipeline to establish tool-calling quality is highly important for inference engines like Atlas.

Fifth, we’d like to thank Huggingface and AMD for providing us with free inference credits.

Lastly, I’d like to say we are essentially working around the clock on improving our stack. We also have multiple, highly talented engineers submitting PRs (e.g., the programmer who reverse-engineered Google’s TurboQuant submitted a PR that expands and improves the existing KV cache dtype support). The project is getting busy, and we appreciate any and all support the community can provide. It’s beautiful to see that what started off as a late-night experiment to break away from the Python ecosystem turned into something much greater, and we have you all to thank for that.

I think that what you’re doing here (in terms of performance) was promising. I couldn’t move on with my testing since I just can’t hit a 2x concurrency or >128K context without a huge performance impact.

If those two are worked around and models scales with better throughput with parallel processing, you might have a gem here (In my humble opinion)

Thank you @nvidiaspark and @azampatti for the feedback. Have addressed the questions below and please keep us in the loop on these fixes in our Discord! This community feedback loop helps propel Atlas to where it needs to be :)

Single-thread numbers like yours (75–93 tok/s) are where it roughly should be at FP8 the concurrency path is the work. Appreciate the detail on the report and we’re working on driving the fix.

  • The qwen3.6-35b-a3b-fp8-mtp recipe defaulted to max_model_len: 131072, which over-provisions the KV pool on a single GB10. We’ve capped it to 16384 (covers depth-8192 with headroom) and the recipe also dropped the -atlas suffix, so the command is now: sparkrun run @atlas/qwen3.6-35b-a3b-fp8-mtp --hosts localhost
  • Website default commands missing --hosts are fixed. Atlas Inference Engine and the per-model commands now include --hosts localhost (single node) or a two-host placeholder for EP=2 recipes. (atlas#111, merged.)
  • @atlas/qwen3.5-35b-a3b-nvfp4 preflight OOM on the default recipe is also fixed. The recipe had no batch cap so it fell back to --max-batch-size 8 and the inference buffers didn’t fit. We capped it (keeps full 8192 context); it now serves out of the box. (atlas-recipes#7, merged.)
  • sparkrun.dev/runtimes/atlas CLI mismatch where the bind flag was documented as --host, Atlas’s canonical flag is --bind (--host works as an alias) and the Doc fix is up.
  • Reasoning/thinking budget being ignored if you were passing max_thinking_tokens, that field wasn’t recognized (use thinking_token_budget) as we’ve added it as an alias (atlas#112).
  • The ~35–40% throughput drop at 2× concurrency is a separate scheduler characteristic we’re profiling and are tracking.

I really like this project and looking really forward to being able to use this with my daily work as a programmer. Having people with a lot of experience in the field handling configurations and best practice is worth a lot and something that people like me really miss in the current ecosystem.

There are 10 different templates files, and 1000 different ways to configure and serve the model. And once you have something that works, everything changes with the next vllm or harness update. What beginners need in this space is reproducible outcomes.

model version + quant -> vllm/atlas version > model configuration > template version > harness = same outcome

Quants, KV Cache, Chat Templates and Harness combination make such a big difference. When I started out trying local llm, I always thought it’s a model problem. But in almost all cases it is a combination of harness and in the case of Qwen a chat template problem.

The dgx spark platform (mac would probably have the same outcome) gave me the opportunity to learn and understand that, because there is plenty of RAM to work with, so you can see that some issues are not model and especially not model quantization related.

Reproducible outcomes are what this community, together, should focus on. This is what I learned in the last two months at least. There is no point in x tokens/s if the next guy can’t get the same output quality and daily productivity with their harness.

At my current local llm knowledge state, I would prefer 5 tokens/s that don’t have tool call errors or loops, than 100 tokens/s that still don’t help me with my daily productivity. Pi Harness pushed an update recently that gets long file writes terminated, if the llm output is not fast enough. Something new that I need to learn how to fix, which won’t help with productivity.