How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker

chief1 · December 31, 2025, 10:07pm

So far I have been unable to get any version of MinMax-V2.1 or GLM-4.6 to install on a single spark-DGX, your version of spark-vlm-docker comes the closest, it Can run cyankiwi/ cyankiwi/MiniMax-M2.1-AWQ-4bit but then crashes with CUDA out of memory before starting. cyankiwi/GLM-4.6V-AWQ-4bit fails with “Invalid type of HuggingFace processor. Expected type: <class ‘transformers.processing_utils.ProcessorMixin’>, but found type: <class ‘transformers.tokenization_utils_fast.PreTrainedTokenizerFast’>” Not sure what to do to get some version of either of these running, wan’t to try a local alternative to Claude code.

eugr · January 1, 2026, 6:40am

Yes, it’s just too big for a single node. You want to keep the weights less than 100GB. Your best bet would be to use llama.cpp and unsloth’s Q3_K_XL quant - I was able to run MiniMax M2 this way. To run in vLLM, you need dual node Spark.

You can run GLM-4.6V though - please pay attention to the instructions in the spark-vllm-docker - for this model specifically you need to build it with Transformers 5 enabled, otherwise it won’t work.

starkrun · January 1, 2026, 12:05pm

It depends on you. If you:

Can let it try more than once by using an agentic app instead of manual prompting (eg Claude Code, Qwen Code, Mistral Vibe). Over multiple attempts, the differences between LLMs can disappear.
Have a way to test if the agent succeeded at the task (eg there’s tests it can run). Without this, it can’t try more than once since it doesn’t know it messed up.

…then I think it can replace Sonnet.

You can now configure Claude Code to use any OpenAI-compatible API so you can test/work with local models without a Claude subscription. Here’s a recipe with llama-server:

#start custom server
llama-server --host 0.0.0.0 --port 8080 --model models/gpt-oss-120b.gguf  -v --alias claude-sonnet-4-5 --api-key local-claude --ctx-size 24000

#launch claude pointed at custom server
export ANTHROPIC_BASE_URL=http://127.0.0.1:8080
export ANTHROPIC_AUTH_TOKEN=local-claude
claude

starkrun · January 1, 2026, 12:15pm

As eugr said, that’s way too big. The size of your model (in GBs) is the amount of VRAM needed, and that’s BEFORE you even include the KV cache, which can be several GBs.

GLM 4.6 at Q4_K_M is 167GB of files. The Spark has 115GB of VRAM at best after the OS. So you can easily see the math doesn’t work. (I actually don’t know if the Spark can load a single model larger than 95GB without some BIOS tweak, can @eugr vouch? My Spark is unavailable atm).

If you look at a smaller model like GLM-4.5-Air Q5_K_M, that only 84GB for the model, so fits comfortably in the Spark. The best choice for a reliable LLM on a single Spark is probably the unquantized gpt-oss-120b, which is only 64GB due to being natively 4bit when it was trained by OpenAI.

starkrun · January 1, 2026, 12:22pm

Thanks for explaining this.
Damn, you’re tempting me into buying a 2nd Spark!

jrsphd · January 1, 2026, 5:54pm

Yeah, @eugr said it - use Unsloth’s UD-Q3_K_XL. f16 for the KV cache will fit 73K for context while q8_0 will net 131K but slow PP/Tg a bit.

As far as I’m concerned, MiniMax M2.1 is the tipping point between novelty and useful.

eugr · January 1, 2026, 7:44pm

Well, the maximum VRAM that I was able to use so far was 115GB (model weights + KV cache + CUDA graph). So, practically this means that you want to stick to model weights below 105GB for single Spark and 210GB for dual Sparks, so you could also allocate some usable amount of context and have space for CUDA graphs too.

raphael.amorim · January 2, 2026, 5:13pm

Unfortunately, If you don’t use 2 sparks, you’re basically wasting $1500, because you’re not using the infiniband module you already have. Also training scales linearly with cluster expansion.

Topic		Replies	Views
GLM 4.6V works on Spark! DGX Spark / GB10 Projects	5	780	January 3, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	861	December 31, 2025
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	965	December 25, 2025
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	2791	December 9, 2025
Setting up vLLM, SGLang or TensorRT on two DGX Sparks DGX Spark / GB10	16	582	December 7, 2025
Dgx spark benchmark performance DGX Spark / GB10	17	1234	January 4, 2026
6x Spark setup DGX Spark / GB10	34	1538	January 10, 2026
GLM4.6V NVFP4 existing? DGX Spark / GB10	6	241	January 8, 2026
DGX Spark performance DGX Spark / GB10	12	376	January 9, 2026
GDX Spark is extremely slow on a short LLM test DGX Spark / GB10	18	1445	December 4, 2025

How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker

Related topics