Introducing the Atlas Inference Server and Engine

tbraun96 · March 2, 2026, 6:55pm

@AzeezIsh and I built Atlas, a pure Rust LLM inference engine with 20+ custom kernels compiled directly for SM121, and it’s by far the fastest way to run Qwen3-Next-80B on a DGX Spark.

82 tokens per second with NO speculative decoding. Single GB10 GPU. 2.8x faster than NVIDIA’s stock vLLM image. Atlas requires no Python, no PyTorch, no complex “recipes”, no framework dependencies: everything is in one place. We use a new philosophy and methodology called Kernel Hypercompilation that is complemented by the abstraction in Rust. Research papers soon to follow. Source build to first token inference in under 2 minutes, whereas vLLM takes 40+, enabling rapid iteration and distribution. The full suite wins 32/32 benchmarks against PyTorch baselines, examples include 18x faster RoPE, 8x faster Gated Delta Rule, and 3.9x faster MoE W4A16 across 256 experts.

Watch the 3-minute demo below, which includes live interaction as well.

We built Atlas because we were tired of debugging dependency chains, versioning mismatches, and hardware compatibility shims every time we wanted to run inference with vLLM. Everything is native, absolutely no external frameworks, no workarounds. Atlas owns the entire stack, from kernels to the HTTP server layer. And this is just the beginning. More model and hardware support to soon come; Qwen3-Next was just the first PoC.

Atlas will be coming soon to the community, for the community! We’d love feedback from DGX Spark enthusiasts on what they’d like to see in the community inference engine. We are actively in talks with Nvidia and would welcome further dialogue with them here for the community to participate in. We seek to release this in an effective and measured way, but we are working as fast as we can. Thank you!

AzeezIsh · March 2, 2026, 6:59pm

Excited to continue devving this together @tbraun96 :) this Qwen/Qwen3.5-122B-A10B · Hugging Face looks like a great stepping stone!

trystan1 · March 2, 2026, 7:02pm

If something is being built from the ground up and you’re actually taking community input, one thing I would offer up is that concurrency is just as important as single user workflows.

Right now there is a tradeoff for a lot of models by needing llamacpp for 1x prompting and vllm for massively concurrent prompting due to its batching capabilities.

My guess would be this is hyper-tuned for single use which is only part of the story.

arctic.gus · March 2, 2026, 7:02pm

Very cool, will it handle multiple sparks?

relc · March 2, 2026, 7:03pm

we will test.

we like them large fast and text and vision models.

tbraun96 · March 2, 2026, 7:06pm

Thanks, and good point! Amongst multiple concurrency techniques, one we use is SLAI. We see stable speeds across very highly concurrent workloads.

tbraun96 · March 2, 2026, 7:07pm

Yes. We want to make use of expert parallelism and new research-level techniques to maximize multi-gpu setups.

We will soon be adding:
Qwen/Qwen3.5-122B-A10B (NVFP4)
Qwen3.5-35B-A3B (NVFP4)

trystan1 · March 2, 2026, 7:08pm

Good to hear, what would really rev my engine would be nvfp4 kv cache quantization.

You know, as long as we’re swinging for the fences.

tbraun96 · March 2, 2026, 7:11pm

We support three KV cache optimizations via CLI runtime flags: nvfp4, fp8, and fp16

arctic.gus · March 2, 2026, 7:12pm

Would you consider optimizing for int4+autoround? It seems to produce lower perplexity scores, whilst being smaller and faster (at least at the moment, when compared to AWQ / NVFP4).

trystan1 · March 2, 2026, 7:16pm

So you have working nvfp4 inference, nvfp4 kv cache quant, and massive concurrent batching efficiency.

Seems uh, too good to be true captain. how many seed rounds have you been through?

tbraun96 · March 2, 2026, 7:20pm

I’ve built harder things before the dawn of AI, like the Citadel Protocol (Rust). My patent in cryptography won Patent of the Month by the largest R&D advisory firm in the US. AI is a multiplier.

trystan1 · March 2, 2026, 7:25pm

Neat.

AzeezIsh · March 2, 2026, 9:34pm

We’re just the enablers! Next move is through the community and NVIDIA :)

aceangel · March 3, 2026, 1:40am

Awesome stuff! Have you guys seen any performance increases with speculative decoding enabled?

tbraun96 · March 3, 2026, 2:34am

Yes! So far, only with the Qwen3.5-32B model (NVFP4), we see 130 tok/s

Balaxxe · March 3, 2026, 3:07am

If this ends up fully open sourced, great. More options for spark owners is always welcome.

But right now there’s no source code, no reproducible benchmarks, and terminology (“Kernel Hypercompilation”) that doesn’t appear anywhere in the literature . On a developer forum - this sounds like a press release.

If this stays closed, “Atlas owns the entire stack” becomes a real problem. It means every model architecture, quantization format, and bug fix flows through two maintainers. The frameworks being dismissed here are the reason Spark owners can actually run diverse models today. Replacing a composable ecosystem with a closed monolith is a tradeoff in my books.

Open the code, let people benchmark it, and this conversation changes entirely.

tbraun96 · March 3, 2026, 4:49am

We seek a measured release, not a fast, unregulated one. China is involved now.

trystan1 · March 3, 2026, 5:17am

Ominous.

paulsc.liu · March 3, 2026, 5:25am

Working with Chinese companies will be complicated. You are either in Chinese camp or US camp, better choose wisely.

Topic		Replies	Views
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	140	3432	March 2, 2026
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	1272	January 7, 2026
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	2924	March 6, 2026
New pre-built vLLM Docker Images for NVIDIA DGX Spark DGX Spark / GB10	63	4594	March 6, 2026
DGX Spark performance DGX Spark / GB10	50	2949	February 27, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2094	December 31, 2025
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	24	1651	January 11, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	88	5962	March 5, 2026
Question on Inference Performance Results of Qwen3 235B A22B on 2× DGX Spark DGX Spark / GB10 cuda	5	517	December 19, 2025
Moving from Mac to NVIDIA: bought powerful hardware, but drowning in configs DGX Spark / GB10 llama , nemotron	37	1751	February 25, 2026

Introducing the Atlas Inference Server and Engine

Related topics