So far I have been unable to get any version of MinMax-V2.1 or GLM-4.6 to install on a single spark-DGX, your version of spark-vlm-docker comes the closest, it Can run cyankiwi/ cyankiwi/MiniMax-M2.1-AWQ-4bit but then crashes with CUDA out of memory before starting. cyankiwi/GLM-4.6V-AWQ-4bit fails with “Invalid type of HuggingFace processor. Expected type: <class ‘transformers.processing_utils.ProcessorMixin’>, but found type: <class ‘transformers.tokenization_utils_fast.PreTrainedTokenizerFast’>” Not sure what to do to get some version of either of these running, wan’t to try a local alternative to Claude code.
Yes, it’s just too big for a single node. You want to keep the weights less than 100GB. Your best bet would be to use llama.cpp and unsloth’s Q3_K_XL quant - I was able to run MiniMax M2 this way. To run in vLLM, you need dual node Spark.
You can run GLM-4.6V though - please pay attention to the instructions in the spark-vllm-docker - for this model specifically you need to build it with Transformers 5 enabled, otherwise it won’t work.
It depends on you. If you:
- Can let it try more than once by using an agentic app instead of manual prompting (eg Claude Code, Qwen Code, Mistral Vibe). Over multiple attempts, the differences between LLMs can disappear.
- Have a way to test if the agent succeeded at the task (eg there’s tests it can run). Without this, it can’t try more than once since it doesn’t know it messed up.
…then I think it can replace Sonnet.
You can now configure Claude Code to use any OpenAI-compatible API so you can test/work with local models without a Claude subscription. Here’s a recipe with llama-server:
#start custom server
llama-server --host 0.0.0.0 --port 8080 --model models/gpt-oss-120b.gguf -v --alias claude-sonnet-4-5 --api-key local-claude --ctx-size 24000
#launch claude pointed at custom server
export ANTHROPIC_BASE_URL=http://127.0.0.1:8080
export ANTHROPIC_AUTH_TOKEN=local-claude
claude
As eugr said, that’s way too big. The size of your model (in GBs) is the amount of VRAM needed, and that’s BEFORE you even include the KV cache, which can be several GBs.
GLM 4.6 at Q4_K_M is 167GB of files. The Spark has 115GB of VRAM at best after the OS. So you can easily see the math doesn’t work. (I actually don’t know if the Spark can load a single model larger than 95GB without some BIOS tweak, can @eugr vouch? My Spark is unavailable atm).
If you look at a smaller model like GLM-4.5-Air Q5_K_M, that only 84GB for the model, so fits comfortably in the Spark. The best choice for a reliable LLM on a single Spark is probably the unquantized gpt-oss-120b, which is only 64GB due to being natively 4bit when it was trained by OpenAI.
Thanks for explaining this.
Damn, you’re tempting me into buying a 2nd Spark!
Yeah, @eugr said it - use Unsloth’s UD-Q3_K_XL. f16 for the KV cache will fit 73K for context while q8_0 will net 131K but slow PP/Tg a bit.
As far as I’m concerned, MiniMax M2.1 is the tipping point between novelty and useful.
Well, the maximum VRAM that I was able to use so far was 115GB (model weights + KV cache + CUDA graph). So, practically this means that you want to stick to model weights below 105GB for single Spark and 210GB for dual Sparks, so you could also allocate some usable amount of context and have space for CUDA graphs too.
Unfortunately, If you don’t use 2 sparks, you’re basically wasting $1500, because you’re not using the infiniband module you already have. Also training scales linearly with cluster expansion.