I understand that the hardware requirements are steep, but I know there are people here with 4+ Spark clusters and I’m surprised I’m not seeing any talk about running GLM 5 locally. Is there a reason I’m only seeing talk about 4.6 and 4.7? From the looks of it, a 4-quant of GLM 5 should work fine on a 4 cluster of Sparks unless I’m missing something.
It’s massive. Even the NVFP4 variant (which wouldn’t run smoothly on the GB10 yet) is 400gb in size. I could imagine 8 nodes being able to run it slowly but it would probably be painfully slow.
I see discussions of Qwen3-397B-A17B running on just 2 sparks, so figured GLM-5 with 744B params and 40B active would work on 4. You might be right about speed, but people seem to be trying to push the limits on everything they can so they can post about it these days. With the model being a month old now, just kinda surprised me to see no discussion, which is why I’m asking.
Basically there aren’t viable 4 bit quants yet.
There is an Intel Autoround quant on that model but it’s with RTN (no tuning) and apparently has issues.
Once that’s fixed I imagine it will be tried. 40B active means it will be slow, though.
Ah ok, gotcha! Thanks!
Hi! I ran the FP8 version at around 8-10 token/s. (8 nodes). But didnt get the AWQ variant and/or MTP to work. Will try again in a while, when some fixes are merged. With AWQ and MTP working should expect around 25 token/s.
I’m trying to do this exactly. I think 5 machines should do…
However I’m not able to find a container I can use to run the model on spark with nvfp4. If anyone has ideas please let me know. I would like to use nvfp4 optimized containers…
So I learned the hard way that I cannot run it on 5 machines, but have to run it on either 4 or 8, due to q heads and kv head counts on glm-5. Four gb10 machines don’t have enough ram to load at nvfp4 but eight do. No way I’m going to run all eight on it lol
I was about to say I thought these needed to be done in 2/4/8 configurations, but was waiting for your results. :-)
We live, we love, we lie…
TIL!
Best to wait until they release something like a 5 flash or 5.1 air/flash, just as they did with 4.5 air or 4.7 flash
There is an NVFP4 quant from Nvidia now: nvidia/GLM-5.1-NVFP4 · Hugging Face
Has anyone tried running this on 4 sparks?