@AzeezIsh and I built Atlas, a pure Rust LLM inference engine with 20+ custom kernels compiled directly for SM121, and it’s by far the fastest way to run Qwen3-Next-80B on a DGX Spark.
82 tokens per second with NO speculative decoding. Single GB10 GPU. 2.8x faster than NVIDIA’s stock vLLM image. Atlas requires no Python, no PyTorch, no complex “recipes”, no framework dependencies: everything is in one place. We use a new philosophy and methodology called Kernel Hypercompilation that is complemented by the abstraction in Rust. Research papers soon to follow. Source build to first token inference in under 2 minutes, whereas vLLM takes 40+, enabling rapid iteration and distribution. The full suite wins 32/32 benchmarks against PyTorch baselines, examples include 18x faster RoPE, 8x faster Gated Delta Rule, and 3.9x faster MoE W4A16 across 256 experts.
We built Atlas because we were tired of debugging dependency chains, versioning mismatches, and hardware compatibility shims every time we wanted to run inference with vLLM. Everything is native, absolutely no external frameworks, no workarounds. Atlas owns the entire stack, from kernels to the HTTP server layer. And this is just the beginning. More model and hardware support to soon come; Qwen3-Next was just the first PoC.
Atlas will be coming soon to the community, for the community! We’d love feedback from DGX Spark enthusiasts on what they’d like to see in the community inference engine. We are actively in talks with Nvidia and would welcome further dialogue with them here for the community to participate in. We seek to release this in an effective and measured way, but we are working as fast as we can. Thank you!
If something is being built from the ground up and you’re actually taking community input, one thing I would offer up is that concurrency is just as important as single user workflows.
Right now there is a tradeoff for a lot of models by needing llamacpp for 1x prompting and vllm for massively concurrent prompting due to its batching capabilities.
My guess would be this is hyper-tuned for single use which is only part of the story.
Would you consider optimizing for int4+autoround? It seems to produce lower perplexity scores, whilst being smaller and faster (at least at the moment, when compared to AWQ / NVFP4).
I’ve built harder things before the dawn of AI, like the Citadel Protocol (Rust). My patent in cryptography won Patent of the Month by the largest R&D advisory firm in the US. AI is a multiplier.
If this ends up fully open sourced, great. More options for spark owners is always welcome.
But right now there’s no source code, no reproducible benchmarks, and terminology (“Kernel Hypercompilation”) that doesn’t appear anywhere in the literature . On a developer forum - this sounds like a press release.
If this stays closed, “Atlas owns the entire stack” becomes a real problem. It means every model architecture, quantization format, and bug fix flows through two maintainers. The frameworks being dismissed here are the reason Spark owners can actually run diverse models today. Replacing a composable ecosystem with a closed monolith is a tradeoff in my books.
Open the code, let people benchmark it, and this conversation changes entirely.