Introducing the Atlas Inference Server and Engine

@AzeezIsh and I built Atlas, a pure Rust LLM inference engine with 20+ custom kernels compiled directly for SM121, and it’s by far the fastest way to run Qwen3-Next-80B on a DGX Spark.

82 tokens per second with NO speculative decoding. Single GB10 GPU. 2.8x faster than NVIDIA’s stock vLLM image. Atlas requires no Python, no PyTorch, no complex “recipes”, no framework dependencies: everything is in one place. We use a new philosophy and methodology called Kernel Hypercompilation that is complemented by the abstraction in Rust. Research papers soon to follow. Source build to first token inference in under 2 minutes, whereas vLLM takes 40+, enabling rapid iteration and distribution. The full suite wins 32/32 benchmarks against PyTorch baselines, examples include 18x faster RoPE, 8x faster Gated Delta Rule, and 3.9x faster MoE W4A16 across 256 experts.

Watch the 3-minute demo below, which includes live interaction as well.

We built Atlas because we were tired of debugging dependency chains, versioning mismatches, and hardware compatibility shims every time we wanted to run inference with vLLM. Everything is native, absolutely no external frameworks, no workarounds. Atlas owns the entire stack, from kernels to the HTTP server layer. And this is just the beginning. More model and hardware support to soon come; Qwen3-Next was just the first PoC.

Atlas will be coming soon to the community, for the community! We’d love feedback from DGX Spark enthusiasts on what they’d like to see in the community inference engine. We are actively in talks with Nvidia and would welcome further dialogue with them here for the community to participate in. We seek to release this in an effective and measured way, but we are working as fast as we can. Thank you!

14 Likes

Excited to continue devving this together @tbraun96 :) this Qwen/Qwen3.5-122B-A10B · Hugging Face looks like a great stepping stone!

3 Likes

If something is being built from the ground up and you’re actually taking community input, one thing I would offer up is that concurrency is just as important as single user workflows.

Right now there is a tradeoff for a lot of models by needing llamacpp for 1x prompting and vllm for massively concurrent prompting due to its batching capabilities.

My guess would be this is hyper-tuned for single use which is only part of the story.

2 Likes

Very cool, will it handle multiple sparks?

we will test.

we like them large fast and text and vision models.

Thanks, and good point! Amongst multiple concurrency techniques, one we use is SLAI. We see stable speeds across very highly concurrent workloads.

1 Like

Yes. We want to make use of expert parallelism and new research-level techniques to maximize multi-gpu setups.

We will soon be adding:
Qwen/Qwen3.5-122B-A10B (NVFP4)
Qwen3.5-35B-A3B (NVFP4)

3 Likes

Good to hear, what would really rev my engine would be nvfp4 kv cache quantization.

You know, as long as we’re swinging for the fences.

We support three KV cache optimizations via CLI runtime flags: nvfp4, fp8, and fp16

Would you consider optimizing for int4+autoround? It seems to produce lower perplexity scores, whilst being smaller and faster (at least at the moment, when compared to AWQ / NVFP4).

So you have working nvfp4 inference, nvfp4 kv cache quant, and massive concurrent batching efficiency.

Seems uh, too good to be true captain. how many seed rounds have you been through?

1 Like

I’ve built harder things before the dawn of AI, like the Citadel Protocol (Rust). My patent in cryptography won Patent of the Month by the largest R&D advisory firm in the US. AI is a multiplier.

3 Likes

Neat.

We’re just the enablers! Next move is through the community and NVIDIA :)

1 Like

Awesome stuff! Have you guys seen any performance increases with speculative decoding enabled?

Yes! So far, only with the Qwen3.5-32B model (NVFP4), we see 130 tok/s

2 Likes

If this ends up fully open sourced, great. More options for spark owners is always welcome.

But right now there’s no source code, no reproducible benchmarks, and terminology (“Kernel Hypercompilation”) that doesn’t appear anywhere in the literature . On a developer forum - this sounds like a press release.

If this stays closed, “Atlas owns the entire stack” becomes a real problem. It means every model architecture, quantization format, and bug fix flows through two maintainers. The frameworks being dismissed here are the reason Spark owners can actually run diverse models today. Replacing a composable ecosystem with a closed monolith is a tradeoff in my books.

Open the code, let people benchmark it, and this conversation changes entirely.

2 Likes

We seek a measured release, not a fast, unregulated one. China is involved now.

Ominous.

2 Likes

Working with Chinese companies will be complicated. You are either in Chinese camp or US camp, better choose wisely.