Nvidia releases Kimi K2.5 NVFP4! (1T, 591GB)

notmy.reward438 · February 5, 2026, 8:46pm

Need 5 Nvidia Sparks

carlos.albarran.mx · February 5, 2026, 10:46pm

Cool!! but … how many sparks do I need to run this model?

jimzhu · February 5, 2026, 11:17pm

If I were to host Kimi K2.5 (open weights for inference) on a hyperscaler, what is the recommended GPU shape?

flash3 · February 6, 2026, 5:41pm

A32B … what do you want to do with Kimi?
Zero context → ~20 tokens/s.
Claude Code → mostly at the coffee machine.
Add a lot more DGX, some more to split the hot buddies …

Welcome to DGX farming.

eugr · February 6, 2026, 6:10pm

32B models at 4-bit run at 12 t/s on a single Spark.
This will require 8x Spark cluster to run with a good amount of context and take advantage of tensor-parallel.

The performance will likely be ~60 t/s at zero context.

eugr · February 6, 2026, 6:11pm

8 sparks, because it won’t fit on 4, and the next size to support tensor-parallel is 8.

flash3 · February 6, 2026, 10:37pm

273 GB/s ÷ ~(32B x 0.5) GB ≈ ~17 Tokens/s zerocontext, without parallelization

=> what do I oversee?

eugr · February 6, 2026, 10:45pm

That 273 GB/s is maximum RAM throughput under ideal conditions. Once you add SOC/kernel/scheduler/framework overhead it will be lower. Plus extra overhead in the inference engine itself. So, in reality, for vLLM and the most optimized inference paths (int4 and FP8) you need to add 0.75 multiplier.

flash3 · February 6, 2026, 10:48pm

And ~60 t/s is an approximation for aggregated throughput across batched requests, but does that hold for single-user decode latency as well?

eugr · February 6, 2026, 10:59pm

no, it’s an approximation for single-user decode speed at zero context in 8 node cluster.
Dense models scale very nicely in Spark clusters, and 32B active parameters is definitely dense enough.

For instance, when running cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit, I get 12 t/s on a single Spark and 21 t/s on dual Sparks. I can’t test the scaling over larger cluster, as I only have two Sparks, but it won’t scale linearly, as once split weights are small enough to be processed much faster, internode latency will play a bigger role.

flash3 · February 7, 2026, 10:26am

Makes sense. And even more if you have enabled the redundancy options of vllm (as an example).

First you have the chaotic path through the layers expert selection (this is what you see on dense models). It was very surprising to me. My last self build FFN is 20 years old, i must learn that experts havent changed this behavior much. So the more chaotic, the better it balances. And distributed localized hot experts (risk of queue) flattens the response times even more. a magic trick it is.

flash3 · February 7, 2026, 1:07pm

why not EP instead of TP? EP could handle 6x ?

flash3 · February 7, 2026, 1:09pm

… And even if TP — couldn’t vllm/sglang pad the 64 attention heads to 66 with 2 dummy heads? 3% wasted compute beats buying 2 extra nodes.

eugr · February 7, 2026, 6:25pm

EP doesn’t split the weights, just splits the expert layers across GPUs, so it won’t increase inference speed. In my testing, EP didn’t give any performance boost on Sparks, actually it was a bit slower with EP, even when used together with TP.

flash3 · February 8, 2026, 10:58am

In my understanding, most MoE models are built this way: a token runs through layers, each layer has a router and picks a set of experts, each expert is an FFN. After running separately through each selected FFN, the results are summarized, then on to the next layer. So it’s rather token → layer → experts instead of token → experts → layer. the way through the layer i call path, and the paths are divergent (each token has its unique path).

My histograms show that each layer chooses different experts, and so does the next token at the same layer — but some FFNs (experts per layer) are chosen more often than others. And the hidden states are 1/1000 of the weights each expert needs from memory. So network is NOT the a capacity limit.

When my second DGX arrives I will measure this out. Either I’m totally wrong or it’s a specific tradeoff between token batch size and network latency. But after all, each expert can run independently if distributed well!

So formally, expert 22 can run on node A and expert 33 on node B — that’s 1/2 memory bandwidth per node.

eugr · February 8, 2026, 2:54pm

The thing is when you do tensor parallel ops, each weight (no matter what layer it belongs to) gets split equally between nodes, so each node performs matrix ops on 1/N weight matrix, and then NCCL performs all_reduce after passing each layer, and moves on. So the load is distributed evenly, you consume 1/N memory bandwidth for node and the only bottleneck is all_reduce op. That’s why it scales well in Spark cluster, because Spark has low memory bandwidth but also very fast low-latency interconnect. Denser the model, better it scales.

When you use expert-parallel in addition to tensor parallel, it now distributes the experts between nodes, but weights for each expert are split between nodes too, so you don’t really gain anything, just add overhead.

In a traditional cluster where each node has multiple GPUs where you do tensor split, it makes more sense, as now each expert has an added benefit of local parallelism via tensor split.

At least that’s how I understand it, and my early experiments agreed with this.

ciprianveg · May 7, 2026, 4:14am

hello, does anyone have a working kimi 2.6 on 8 devices? I can’t make it work, I tried using the 397b fp8 recipe updates to tp 8, but it doesn’t work

Topic		Replies	Views
Kimi K2.5 viable on a DGX cluster? DGX Spark / GB10	11	1031	March 11, 2026
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1385	February 13, 2026
nvidia/Gemma-4-26B-A4B-NVFP4 DGX Spark / GB10	7	2808	May 2, 2026
Distributed Inference - 200gb/s with bottleneck, am I missing something? DGX Spark / GB10 llama	5	550	January 22, 2026
6x Spark setup DGX Spark / GB10	112	9004	April 25, 2026
Why 273 GB/s? Less Is More, Until It Isn’t DGX Spark / GB10	67	2304	March 27, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	5144	December 9, 2025
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2419	December 25, 2025
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2982	December 31, 2025
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2636	March 26, 2026

Nvidia releases Kimi K2.5 NVFP4! (1T, 591GB)

Related topics