DGX Spark performance

When we have a NVFP4 version of kimi-k2.5 with 8 bit or higher quality (maybe using QAT) and new kernels optimized for the spark, that will be a good trigger point for expanding the cluster.

It is native INT4 quant (they used QAT) already. There is no BF16 version on Huggingface, but it’s still almost 600GB. I wonder if @ericlewis777 can run it on his 8x setup :)

@eugr Have you seen QuantTrio/Kimi-K2.5-E304 · Hugging Face?

No, since I only have two Sparks, I can’t run anything larger than ~350B parameters anyway :)

yet … LOL

Well, maybe NVIDIA can gift me some ;)

@NVIDIA I support that. That’s money well spent.

Hear hear!

I have only one, so may I?

Hi Eugr, and thanks for the tutorial—I already used it last month to update llama.cpp successfully. Today, after pulling the latest code and building it to upgrade to the newest version, I’m encountering the following issue: as soon as I try to load a model with the “‑hf” flag, I get this message:

build/bin/llama-server -hf unsloth/gpt-oss-120b-GGUF:Q8_K_XL -ngl 999 --jinja -c 32000 --host 0.0.0.0 -fa 1 --no-mmap --port 8080

ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
HTTPS is not supported. Please rebuild with one of:
  -DLLAMA_BUILD_BORINGSSL=ON
  -DLLAMA_BUILD_LIBRESSL=ON
  -DLLAMA_OPENSSL=ON (default, requires OpenSSL dev files installed

I tried rebuilding with “-DLLAMA_OPENSSL=ON”, but I get the same message and no model loads when using the –hf flag. I checked the version of libcurl4‑openssl‑dev and it’s the latest. Do you have any suggestions?

Try to install libssl-dev - they’ve switched from lubcurl to libssl.
sudo apt install libssl-dev

Yes, that was exactly the problem. Thanks again @eugr !

Hi eugr, minimax 2.1 is suitable?

suitable for what? Not sure what you are asking about…

the MiniMax-M2.1-NVFP4 is suitable for the 1* Spark DGX?

No, it’s too big. The only way to run MiniMax M2.1 on a single Spark currently is using REAP versions (with reduced number of experts) or using lower quants with llama.cpp, like Unsloth’s Q3_K_XL.

how can I do that?

Me and others posted llama.cpp instructions here.

Install development tools:

sudo apt install clang cmake libcurl4-openssl-dev libssl-dev

Checkout llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

Build:

cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON -DGGML_CURL=ON -DGGML_RPC=ON -DCMAKE_CUDA_ARCHITECTURES=121a-real
cmake --build build --config Release -j

Launch:

build/bin/llama-server -hf unsloth/MiniMax-M2.1-GGUF:Q3_K_XL -ngl 99 -fitc

(I’ve never used the new -fitc flag, but I believe it will try to fit maximum possible context into your memory).

FWIW, I just ran MiniMax M2.1 at 3.04bpw (this one) on a single Spark with exllamav3.

Here’s one llama-benchy run with no attempt to tune anything, just loaded it and ran. Happy to run more if you want to suggest flags.

| model                    |   test |           t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:-------------------------|-------:|--------------:|-------------:|----------------:|----------------:|----------------:|
| MiniMaxM2.1-EXL3-3.04bpw | pp1024 | 351.83 ± 1.86 |              | 2811.21 ± 14.78 | 2805.44 ± 14.78 | 2811.21 ± 14.78 |
| MiniMaxM2.1-EXL3-3.04bpw |  tg512 |  27.62 ± 0.02 | 30.00 ± 0.00 |                 |                 |                 |

llama-benchy (0.3.1.dev2+gae09cab52)
date: 2026-02-09 00:12:00 | latency mode: api

Inference speed is pretty good for this model, but pp is kinda low. Can you try with larger prompt size?