Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever

KV cache is a component of the --gpu-memory-utilization amount.

With the KV cache what you’re tuning for is the number of concurrent connections you are capable of running at your context window size.

A basic example would be that, if you have 128k tokens context window and 512k KV cache, you could run 4 requests at the same time. This should lead to higher total Tokens Per Second as the various work required in inference can be ‘mixed together’ to get the most out of the hardware.

1 Like

This runs okay on my setup, but it can take over a minute on average to complete its reasoning before outputting a response. Far slower than MiniMax M2.1, which runs fast and has almost no delay. Have you tried the cyankiwi version? cyankiwi/GLM-4.7-AWQ-4bit · Hugging Face

You mentioned before you got much better performance from MiniMax using the cyankiwi quant.

Yes, but I’ve had some weird issues with this quant, while QuantTrio worked fine. I don’t remember what was the issue, it should be somewhere in the forum threads :)

EDIT: found it - How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker - #5 by eugr

Not sure if that issue still persists, but QuantTrio version is smaller in size and I can fit 128K context in my two sparks with fp8 KV cache.

Aside from the issues, did you notice any performance benefit?

No, I think it was even slower.

I found your post about this, but if I’m not mistaken, you guys were discussing GLM-4.7-Flash, not the full GLM-4.7 358B model. Just wanted to confirm you tried the quants of the full version. If not, I’ll give it a try and see if there’s any improvement over QuantTrio’s version.

I updated my post ab

Probably found the wrong one. It’s this one I’m talking about: How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker - #5 by eugr

Not sure if that issue still persists, but QuantTrio version is smaller in size and I can fit 128K context in my two sparks with fp8 KV cache.

Awesome, thanks.

I tested both GLM-4.7 versions from QuantTrio and cyankiwi. They both worked, although the QuantTrio version seemed to run slightly faster in my non-scientific tests. So I guess the corruption issue you raised on HF got fixed somewhere.

Good to hear that. But since QuantTrio version is smaller and runs faster, I’ll just keep using that.

1 Like