KV cache is a component of the --gpu-memory-utilization amount.
With the KV cache what you’re tuning for is the number of concurrent connections you are capable of running at your context window size.
A basic example would be that, if you have 128k tokens context window and 512k KV cache, you could run 4 requests at the same time. This should lead to higher total Tokens Per Second as the various work required in inference can be ‘mixed together’ to get the most out of the hardware.
This runs okay on my setup, but it can take over a minute on average to complete its reasoning before outputting a response. Far slower than MiniMax M2.1, which runs fast and has almost no delay. Have you tried the cyankiwi version? cyankiwi/GLM-4.7-AWQ-4bit · Hugging Face
You mentioned before you got much better performance from MiniMax using the cyankiwi quant.
Yes, but I’ve had some weird issues with this quant, while QuantTrio worked fine. I don’t remember what was the issue, it should be somewhere in the forum threads :)
I found your post about this, but if I’m not mistaken, you guys were discussing GLM-4.7-Flash, not the full GLM-4.7 358B model. Just wanted to confirm you tried the quants of the full version. If not, I’ll give it a try and see if there’s any improvement over QuantTrio’s version.
I tested both GLM-4.7 versions from QuantTrio and cyankiwi. They both worked, although the QuantTrio version seemed to run slightly faster in my non-scientific tests. So I guess the corruption issue you raised on HF got fixed somewhere.