Nvidia releases Kimi K2.5 NVFP4! (1T, 591GB)

Need 5 Nvidia Sparks

Cool!! but … how many sparks do I need to run this model?

If I were to host Kimi K2.5 (open weights for inference) on a hyperscaler, what is the recommended GPU shape?

A32B … what do you want to do with Kimi?
Zero context → ~20 tokens/s.
Claude Code → mostly at the coffee machine.
Add a lot more DGX, some more to split the hot buddies …

Welcome to DGX farming.

32B models at 4-bit run at 12 t/s on a single Spark.
This will require 8x Spark cluster to run with a good amount of context and take advantage of tensor-parallel.

The performance will likely be ~60 t/s at zero context.

8 sparks, because it won’t fit on 4, and the next size to support tensor-parallel is 8.

273 GB/s ÷ ~(32B x 0.5) GB ≈ ~17 Tokens/s zerocontext, without parallelization

=> what do I oversee?

That 273 GB/s is maximum RAM throughput under ideal conditions. Once you add SOC/kernel/scheduler/framework overhead it will be lower. Plus extra overhead in the inference engine itself. So, in reality, for vLLM and the most optimized inference paths (int4 and FP8) you need to add 0.75 multiplier.

And ~60 t/s is an approximation for aggregated throughput across batched requests, but does that hold for single-user decode latency as well?

no, it’s an approximation for single-user decode speed at zero context in 8 node cluster.
Dense models scale very nicely in Spark clusters, and 32B active parameters is definitely dense enough.

For instance, when running cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit, I get 12 t/s on a single Spark and 21 t/s on dual Sparks. I can’t test the scaling over larger cluster, as I only have two Sparks, but it won’t scale linearly, as once split weights are small enough to be processed much faster, internode latency will play a bigger role.

Makes sense. And even more if you have enabled the redundancy options of vllm (as an example).

First you have the chaotic path through the layers expert selection (this is what you see on dense models). It was very surprising to me. My last self build FFN is 20 years old, i must learn that experts havent changed this behavior much. So the more chaotic, the better it balances. And distributed localized hot experts (risk of queue) flattens the response times even more. a magic trick it is.

why not EP instead of TP? EP could handle 6x ?

… And even if TP — couldn’t vllm/sglang pad the 64 attention heads to 66 with 2 dummy heads? 3% wasted compute beats buying 2 extra nodes.

EP doesn’t split the weights, just splits the expert layers across GPUs, so it won’t increase inference speed. In my testing, EP didn’t give any performance boost on Sparks, actually it was a bit slower with EP, even when used together with TP.


In my understanding, most MoE models are built this way: a token runs through layers, each layer has a router and picks a set of experts, each expert is an FFN. After running separately through each selected FFN, the results are summarized, then on to the next layer. So it’s rather token → layer → experts instead of token → experts → layer. the way through the layer i call path, and the paths are divergent (each token has its unique path).

My histograms show that each layer chooses different experts, and so does the next token at the same layer — but some FFNs (experts per layer) are chosen more often than others. And the hidden states are 1/1000 of the weights each expert needs from memory. So network is NOT the a capacity limit.

When my second DGX arrives I will measure this out. Either I’m totally wrong or it’s a specific tradeoff between token batch size and network latency. But after all, each expert can run independently if distributed well!


So formally, expert 22 can run on node A and expert 33 on node B — that’s 1/2 memory bandwidth per node.

The thing is when you do tensor parallel ops, each weight (no matter what layer it belongs to) gets split equally between nodes, so each node performs matrix ops on 1/N weight matrix, and then NCCL performs all_reduce after passing each layer, and moves on. So the load is distributed evenly, you consume 1/N memory bandwidth for node and the only bottleneck is all_reduce op. That’s why it scales well in Spark cluster, because Spark has low memory bandwidth but also very fast low-latency interconnect. Denser the model, better it scales.

When you use expert-parallel in addition to tensor parallel, it now distributes the experts between nodes, but weights for each expert are split between nodes too, so you don’t really gain anything, just add overhead.

In a traditional cluster where each node has multiple GPUs where you do tensor split, it makes more sense, as now each expert has an added benefit of local parallelism via tensor split.

At least that’s how I understand it, and my early experiments agreed with this.

hello, does anyone have a working kimi 2.6 on 8 devices? I can’t make it work, I tried using the 397b fp8 recipe updates to tp 8, but it doesn’t work