China aside, that does not change the core issue. If it is not open source, it is not worth much to the community.
I’d probably slow down on these buzzword threads.
These forums are not for personal marketing.
China aside, that does not change the core issue. If it is not open source, it is not worth much to the community.
I’d probably slow down on these buzzword threads.
These forums are not for personal marketing.
When we look at the supply chain, the cautionary notes on China involvement in alternative inference engines are a failed exercise in self-coherency. First, we are all running open-weights models contributed by Chinese industrial players. Second, among the venture capitals backing our primary inference engine (vLLM), there is–surprise, surprise–ZhenFund (Beijing). So, with some dose of realism, it is not that, by avoiding Atlas, we are playing in an autarkic camp.
As somebody all well aware of what it takes to develop tools of such complexity, I am grateful for the technologies we all have at our fingertips, and I applaud alternative designs. Provided the alternative is real, more competition the better.
This is exactly what I was waiting for.
It is my opinion that there is no private enterprise in China. From 1949 on ward, there is only one entity: CCP. Average people’s opinion does not matter. This includes all those CEO in EVERY Chinese AI companies.
Please consider this when you deal with AI companies in China: Dario Amodei can speak his mind and make his own decision. No one in China has that luxury.
Current updates:
Concurrency shows a consistent increase from C=1 to C=4, then an asymptotic convergence to the base levels described above for all models for C=8 to […] thanks to SLAI and various prefill optimizations
We are ensuring a safe, measured release for the community. We will put the ever-shifting Jenga-Tower of dependencies and patches for vLLM/cutlass/marlin/flashinfer to rest permanently, as part of the past.
We are in communication with the Qwen team, who have taken a keen interest in the software. Expect big things for our community and beyond.
What’s the total amount of tokens you can store in your nvfp4 quantized cache for Qwen3.5-122B-A10B-NVFP4 on a single spark?
If the cache actually is being taken from fp8 to nvfp4 I’d expect substantially higher throughput.
Edit: I’m also interested in concurrency numbers in at least the triple digits like what current day vllm can achieve. I’ve run Qwen3.5-122B-A10B autoround with as high as 256 concurrent in a single spark with impressive results, I can’t imagine what this fever dream nvfp4 cache quant could do for it.
No fever dream, just math.
Ok, so math up the amount of tokens you can store in your kv cache @ nvfp4 with the Qwen3.5-122B-A10B-NVFP4 weights loaded and let me know the maths.
Sorry, but I call BS on this one. Qwen3-VL-32B is a dense model, even at 4 bits it physically can’t run faster than ~17 t/s on a single Spark.
Or are these numbers for total throughput in concurrency testing?
He meant prefill tokens/second, not decode, you know - the important one.
Well to my understanding is it’s 2 (K, V) times batch size (in our case 1) times dim_head times head_count (different for MQA) times layers times seq_len? But this all flips on its head for the Qwen3.5 architecture with the linear attention and Gated DeltaNets. This dense number may have been the aggregate calculation, let me get back to you with the rest
You guys stated you had nvfp4 kv cache quant working, so that should be at least a 50% bump to decode speed on a memory bound device.
That’s kind of been the holy grail for a while, I’m glad you guys cracked it.
In the video you guys show that it’s MIT-licensed open source, so I’m sure a lot of us would love to see the code and try it out.
As far as I know, 100 t/s for Qwen3-VL-32B would basically melt a Spark… which I kinda want to see… I promise I’ll keep a fire extinguisher handy.
Lol on dense that’s impossible fair enough just did the math, since it was Qwen3 no fancy hybrid architecture. It indeed checks out that was our aggregate max, we’ve been running lots of these benchmarks apologies folks! Atlas shines on MoE Mamba backbones :)
Looks like Claude didn’t get the right Qwen3-VL-32B name in your screenshot. And it also looks like your partner has a different answer.
Will the real vibe coded answer please stand up, please stand up , please stand up.
No, just usual human error. Proof vibe code is not in place.
I truly want to recap.
Aight,
I’m looking forward to it, and I agree. Releases like these are not simply posting code the moment they work; the real world demands this technology be released in a way that is measured with the ecosystem.
I would caution the community to be very wary of software released with promises like this. To be honest, there’s no telling whether there isn’t malicious intent.
Someone posting on reddit and dgx forums to get hype/buzz out is a red flag already, inb4 crypto miner/RAT.