DGX Spark performance

raphael.amorim · February 4, 2026, 3:47pm

When we have a NVFP4 version of kimi-k2.5 with 8 bit or higher quality (maybe using QAT) and new kernels optimized for the spark, that will be a good trigger point for expanding the cluster.

eugr · February 4, 2026, 5:21pm

It is native INT4 quant (they used QAT) already. There is no BF16 version on Huggingface, but it’s still almost 600GB. I wonder if @ericlewis777 can run it on his 8x setup :)

raphael.amorim · February 4, 2026, 5:25pm

@eugr Have you seen QuantTrio/Kimi-K2.5-E304 · Hugging Face?

eugr · February 4, 2026, 5:37pm

No, since I only have two Sparks, I can’t run anything larger than ~350B parameters anyway :)

raphael.amorim · February 4, 2026, 5:40pm

yet … LOL

eugr · February 4, 2026, 5:43pm

Well, maybe NVIDIA can gift me some ;)

raphael.amorim · February 4, 2026, 5:45pm

@NVIDIA I support that. That’s money well spent.

PrinceHal · February 4, 2026, 8:41pm

Hear hear!

24659818 · February 5, 2026, 3:03am

I have only one, so may I?

ericcoco · February 5, 2026, 4:40pm

Hi Eugr, and thanks for the tutorial—I already used it last month to update llama.cpp successfully. Today, after pulling the latest code and building it to upgrade to the newest version, I’m encountering the following issue: as soon as I try to load a model with the “‑hf” flag, I get this message:

build/bin/llama-server -hf unsloth/gpt-oss-120b-GGUF:Q8_K_XL -ngl 999 --jinja -c 32000 --host 0.0.0.0 -fa 1 --no-mmap --port 8080

ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
HTTPS is not supported. Please rebuild with one of:
  -DLLAMA_BUILD_BORINGSSL=ON
  -DLLAMA_BUILD_LIBRESSL=ON
  -DLLAMA_OPENSSL=ON (default, requires OpenSSL dev files installed

I tried rebuilding with “-DLLAMA_OPENSSL=ON”, but I get the same message and no model loads when using the –hf flag. I checked the version of libcurl4‑openssl‑dev and it’s the latest. Do you have any suggestions?

eugr · February 5, 2026, 4:55pm

Try to install libssl-dev - they’ve switched from lubcurl to libssl.
sudo apt install libssl-dev

ericcoco · February 5, 2026, 7:31pm

Yes, that was exactly the problem. Thanks again @eugr !

24659818 · February 9, 2026, 4:08am

Hi eugr, minimax 2.1 is suitable?

eugr · February 9, 2026, 5:12am

suitable for what? Not sure what you are asking about…

24659818 · February 9, 2026, 5:16am

the MiniMax-M2.1-NVFP4 is suitable for the 1* Spark DGX?

eugr · February 9, 2026, 5:20am

No, it’s too big. The only way to run MiniMax M2.1 on a single Spark currently is using REAP versions (with reduced number of experts) or using lower quants with llama.cpp, like Unsloth’s Q3_K_XL.

24659818 · February 9, 2026, 5:25am

how can I do that?

eugr · February 9, 2026, 6:07am

Me and others posted llama.cpp instructions here.

Install development tools:

sudo apt install clang cmake libcurl4-openssl-dev libssl-dev

Checkout llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

Build:

cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON -DGGML_CURL=ON -DGGML_RPC=ON -DCMAKE_CUDA_ARCHITECTURES=121a-real
cmake --build build --config Release -j

Launch:

build/bin/llama-server -hf unsloth/MiniMax-M2.1-GGUF:Q3_K_XL -ngl 99 -fitc

(I’ve never used the new -fitc flag, but I believe it will try to fit maximum possible context into your memory).

syntheticprior · February 9, 2026, 8:14am

FWIW, I just ran MiniMax M2.1 at 3.04bpw (this one) on a single Spark with exllamav3.

Here’s one llama-benchy run with no attempt to tune anything, just loaded it and ran. Happy to run more if you want to suggest flags.

| model                    |   test |           t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:-------------------------|-------:|--------------:|-------------:|----------------:|----------------:|----------------:|
| MiniMaxM2.1-EXL3-3.04bpw | pp1024 | 351.83 ± 1.86 |              | 2811.21 ± 14.78 | 2805.44 ± 14.78 | 2811.21 ± 14.78 |
| MiniMaxM2.1-EXL3-3.04bpw |  tg512 |  27.62 ± 0.02 | 30.00 ± 0.00 |                 |                 |                 |

llama-benchy (0.3.1.dev2+gae09cab52)
date: 2026-02-09 00:12:00 | latency mode: api

eugr · February 9, 2026, 4:54pm

Inference speed is pretty good for this model, but pp is kinda low. Can you try with larger prompt size?

Topic		Replies	Views
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	4442	March 6, 2026
MiniMax M2.5 released (not available on HuggingFace as of now) -- is DGX Spark ready? DGX Spark / GB10	92	6374	April 12, 2026
Spark-inference: Run 3 specialized models simultaneously on your DGX Spark — cybersecurity + coding + orchestration, 30-min setup DGX Spark / GB10 Projects jetson , llama , deepseek , nemotron	3	967	May 11, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	10556	April 9, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2738	March 26, 2026
How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker DGX Spark / GB10	28	4197	January 2, 2026
DGX Spark: The Sovereign AI Stack — Dual-Model Architecture for Local Inference DGX Spark / GB10 Projects docker , spark , llm	9	1873	February 13, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2505	December 25, 2025
GDX Spark is extremely slow on a short LLM test DGX Spark / GB10 deepseek	21	4150	January 25, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	3108	December 31, 2025

DGX Spark performance

Related topics