Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever

Yes, but I’ve had some weird issues with this quant, while QuantTrio worked fine. I don’t remember what was the issue, it should be somewhere in the forum threads :)

EDIT: found it - How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker - #5 by eugr

Not sure if that issue still persists, but QuantTrio version is smaller in size and I can fit 128K context in my two sparks with fp8 KV cache.