I’ve got a two-Spark cluster running okay with a very fast connection (24.38 GB/sec using MTU 9000), and trying to run vLLM in tensor parallel mode. I am running gpt-oss-20b. This setup works fine with the vllm option --pipeline-parallel-size 2, but that doesn’t use both GPUs to 100%. Tensor paralle…

unfortunately I do not have access to a second spark. But if, I would try Kimi K2.5 and see if they are cabal of running it locally moonshotai/Kimi-K2.5 Usage Guide - vLLM Recipes [image] Kimi K2.5: API Provider Performance Benchmarking & Price Analy…

[image] martinB78: But if, I would try Kimi K2.5 and see if they are cabal of running it locally Well, this is a 1T parameter model where the native weights are already 4-bit. That’s well beyond what two Sparks can fit into RAM. Active parameters are only 32B though, same as GLM-4.7 and that …

[image] eugr: --gpu-memory-utilization 0.88 \ When I use this figure, vllm reports “Available KV cache memory: 12.83 GiB.” Does this suggest I can reduce --gpu-memory-utilization? Or should I keep the KV cache as high as possible for a larger context window? Code editing especially burns tho…

[image] Phaserblast: When I use this figure, vllm reports “Available KV cache memory: 12.83 GiB.” Does this suggest I can reduce --gpu-memory-utilization? Or should I keep the KV cache as high as possible for a larger context window? Code editing especially burns though context fast. KV cache…

[image] eugr: QuantTrio/GLM-4.7-AWQ This runs okay on my setup, but it can take over a minute on average to complete its reasoning before outputting a response. Far slower than MiniMax M2.1, which runs fast and has almost no delay. Have you tried the cyankiwi version? cyankiwi/GLM-4.7-AWQ-4bi…

[image] eugr: Yes, but I’ve had some weird issues with this quant, while QuantTrio worked fine. Aside from the issues, did you notice any performance benefit?

[image] Phaserblast: Aside from the issues, did you notice any performance benefit? No, I think it was even slower.

I found your post about this, but if I’m not mistaken, you guys were discussing GLM-4.7-Flash, not the full GLM-4.7 358B model. Just wanted to confirm you tried the quants of the full version. If not, I’ll give it a try and see if there’s any improvement over QuantTrio’s version.

I updated my post ab [image] Phaserblast: but if I’m not mistaken, you guys were discussing GLM-4.7-Flash, not the full GLM-4.7 358B model. Probably found the wrong one. It’s this one I’m talking about: How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker - #5 …

Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever

Accelerated Computing DGX Spark / GB10 User Forum DGX Spark / GB10

eugr January 29, 2026, 4:54pm 23

Yes, but I’ve had some weird issues with this quant, while QuantTrio worked fine. I don’t remember what was the issue, it should be somewhere in the forum threads :)

EDIT: found it - How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker - #5 by eugr

Not sure if that issue still persists, but QuantTrio version is smaller in size and I can fit 128K context in my two sparks with fp8 KV cache.

Topic		Replies	Views
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2201	December 25, 2025
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2685	December 31, 2025
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	4751	December 9, 2025
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	4120	February 27, 2026
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	224	7776	April 7, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	5987	March 28, 2026
6x Spark setup DGX Spark / GB10	109	7407	April 1, 2026
How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker DGX Spark / GB10	28	3696	January 2, 2026
MiniMax M2.5 released (not available on HuggingFace as of now) -- is DGX Spark ready? DGX Spark / GB10	85	5418	April 6, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2359	March 26, 2026

Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever

Related topics