Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever

christopher_owen · January 28, 2026, 7:41pm

KV cache is a component of the --gpu-memory-utilization amount.

With the KV cache what you’re tuning for is the number of concurrent connections you are capable of running at your context window size.

A basic example would be that, if you have 128k tokens context window and 512k KV cache, you could run 4 requests at the same time. This should lead to higher total Tokens Per Second as the various work required in inference can be ‘mixed together’ to get the most out of the hardware.

Phaserblast · January 29, 2026, 4:51pm

This runs okay on my setup, but it can take over a minute on average to complete its reasoning before outputting a response. Far slower than MiniMax M2.1, which runs fast and has almost no delay. Have you tried the cyankiwi version? cyankiwi/GLM-4.7-AWQ-4bit · Hugging Face

You mentioned before you got much better performance from MiniMax using the cyankiwi quant.

eugr · January 29, 2026, 4:54pm

Yes, but I’ve had some weird issues with this quant, while QuantTrio worked fine. I don’t remember what was the issue, it should be somewhere in the forum threads :)

EDIT: found it - How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker - #5 by eugr

Not sure if that issue still persists, but QuantTrio version is smaller in size and I can fit 128K context in my two sparks with fp8 KV cache.

Phaserblast · January 29, 2026, 4:57pm

Aside from the issues, did you notice any performance benefit?

eugr · January 29, 2026, 4:58pm

No, I think it was even slower.

Phaserblast · January 29, 2026, 5:06pm

I found your post about this, but if I’m not mistaken, you guys were discussing GLM-4.7-Flash, not the full GLM-4.7 358B model. Just wanted to confirm you tried the quants of the full version. If not, I’ll give it a try and see if there’s any improvement over QuantTrio’s version.

eugr · January 29, 2026, 5:17pm

I updated my post ab

Probably found the wrong one. It’s this one I’m talking about: How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker - #5 by eugr

eugr · January 29, 2026, 5:18pm

Not sure if that issue still persists, but QuantTrio version is smaller in size and I can fit 128K context in my two sparks with fp8 KV cache.

Phaserblast · January 29, 2026, 6:36pm

Awesome, thanks.

Phaserblast · January 30, 2026, 3:46pm

I tested both GLM-4.7 versions from QuantTrio and cyankiwi. They both worked, although the QuantTrio version seemed to run slightly faster in my non-scientific tests. So I guess the corruption issue you raised on HF got fixed somewhere.

eugr · January 30, 2026, 4:28pm

Good to hear that. But since QuantTrio version is smaller and runs faster, I’ll just keep using that.

Topic		Replies	Views
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	1312	December 25, 2025
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	3397	December 9, 2025
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	84	2384	January 30, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	1263	December 31, 2025
How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker DGX Spark / GB10	28	2414	January 2, 2026
DGX Spark Multi-Node LLM Inference Report for Qwen3-235B model DGX Spark / GB10 nim , llama	33	1150	January 2, 2026
GLM 4.6V works on Spark! DGX Spark / GB10 Projects	12	1387	January 22, 2026
vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? DGX Spark / GB10	107	2342	January 31, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	24	986	January 11, 2026
Day 1 with DGX Spark (Asus version) DGX Spark / GB10	26	913	January 31, 2026

Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever

Related topics