DGX Spark performance

24659818 · January 7, 2026, 1:35am

I try to use ollama to run the qwen3:32b and the eval rate is 9.46tokens/s，and on the DGX Dashboard the System Memory is 24GB ，the GPU Utilization is 94%.

total duration:3m47s

load duration:76ms

prompt eval count:15 token(s)

prompt eval duration:2s

prompt eval rate:7.18 tokens/s

eval count:2122 token(s)

eval duration:3m44s

eval rate:9.46 tokens/s

And I watch the anther DGX Spark Performance is significantly better than this one like :

total duration:5m99s

load duration:94ms

prompt eval count:5 token(s)

prompt eval duration:46ms

prompt eval rate:107 tokens/s

eval count:254token(s)

eval duration:5s

eval rate:43.73 tokens/s

so what’s the defference bewteen these two? how can I accelerate the former？

eugr · January 7, 2026, 2:07am

Is the second one a different model, likely qwen3:30b?

24659818 · January 7, 2026, 3:39am

the first one is qwen3:32b,the second one is gpt-oss:20b

florin.andrei · January 7, 2026, 6:24pm

Very different models. Of course the performance will be very different.

eugr · January 7, 2026, 7:15pm

The first one is a dense model with 32B parameters, all of them are active. The second model is a sparse MoE model with 20B total parameters, out of which only 3.6B are active at any given inference pass. So you are basically comparing 32B model to a 4B model (performance wise).

24659818 · January 8, 2026, 12:23am

Thanks，I have other questions：

the first one is the DGX Spark has 128G Memory and is managed by a Blackwell-architecture GPU and used as high-performance memory. So why during run the qwen3:32b the System Memory is 24GB and GPU Utilization is 94%. Is it reasonable？

the second one is how can I Improve performance using TRTLLM FP4，Will it work?

eugr · January 8, 2026, 6:54am

GPU utilization is likely a bug in the drivers, as we haven’t seen anything higher than 96% here on the forums.

As for RAM, Ollama sets the context to 4096 tokens by default. Given that it runs q4_k_m quant by default, the model size is around 16GB, so it will be <20GB with context.

My advice is to forget about Ollama and use llama.cpp instead. Also, switch to MoE models - dense models won’t run fast on Spark due to it’s relatively slow memory bandwidth (compared to dedicated VRAM).

Also, last time I checked, TRTLLM wasn’t optimized for Spark.

llama.cpp will be the fastest in generation (just don’t expect miracles, tops you can get from 32B dense model even in 4 bit quant is ~14 t/s on a single spark. vLLM will be faster in prompt processing. Here is my compilation from last month for reference (for single and dual sparks):

Model name	Cluster (t/s)	Single (t/s)	Comment
Qwen/Qwen3-VL-32B-Instruct-FP8	12.00	7.00
cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit	21.00	12.00
GPT-OSS-120B	55.00	36.00	SGLang gives 75/53
RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4	21.00	N/A
QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ	26.00	N/A
Qwen/Qwen3-VL-30B-A3B-Instruct-FP8	65.00	52.00
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ	97.00	82.00
RedHatAI/Qwen3-30B-A3B-NVFP4	75.00	64.00
QuantTrio/MiniMax-M2-AWQ	41.00	N/A
QuantTrio/GLM-4.6-AWQ	17.00	N/A
zai-org/GLM-4.6V-FP8	24.00	N/A

robjusa · January 8, 2026, 7:07am

Hi Eugr; would it be possible to share more details on the config for GPT-OSS across two dxg nodes using SGLang. I have numerous failed attempts on GPT-OSS 20b. Are you running MXFP4 by chance? Thanks.

GPT-OSS-120B	55.00	36.00	SGLang gives 75/53

eugr · January 8, 2026, 7:29am

Sure. Unlike vLLM, I don’t have a neat setup for it, so launching is a bit more involved. You need to use their spark docker image. It lags behind main branch and you’ll need to apply a fix first, but it works with gpt-oss very well.

Run Docker on both nodes (assuming you have tiktoken encodings downloaded already - change the paths accordingly):

docker run --privileged --gpus all -it --rm --network host --ipc=host -v  ~/.cache/huggingface:/root/.cache/huggingface -v ~/vllm/tiktoken_encodings:/tiktoken_encodings  lmsysorg/sglang:spark /bin/bash

Run on the first node (spark):

NOTE: this was merged the next day after Spark build: [fix] Only enable flashinfer all reduce fusion by default for single-node servers by leejnau · Pull Request #12724 · sgl-project/sglang · GitHub
Link to patch: https://patch-diff.githubusercontent.com/raw/sgl-project/sglang/pull/12724.diff

Also, make sure you have approx. the same amount of free RAM on both Sparks otherwise it will fail with memory imbalance error even if you limit it with mem-fraction-static (as it evaluates at 0.9 of total VRAM).

Use IP addresses that are assigned to your ConnectX 7 interface!

export MN_IF_NAME=enp1s0f1np1
export UCX_NET_DEVICES=$MN_IF_NAME
export NCCL_SOCKET_IFNAME=$MN_IF_NAME
export NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1
export NCCL_IB_DISABLE=0
export GLOO_SOCKET_IFNAME=$MN_IF_NAME
export OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME
export NCCL_DEBUG=INFO
export NCCL_IGNORE_CPU_AFFINITY=1


apt install -y curl
curl -L https://patch-diff.githubusercontent.com/raw/sgl-project/sglang/pull/12724.diff | git apply

HF_HUB_OFFLINE=1 python3 -m sglang.launch_server --model openai/gpt-oss-120b --served-model-name openai/gpt-oss-120b --host 0.0.0.0 --port 8888 --reasoning-parser gpt-oss --tool-call-parser gpt-oss --tp 2 --dist-init-addr 192.168.177.11:20000 --nnodes 2 --node-rank 0 --mem-fraction-static 0.8

Run on the second node (spark2):

export MN_IF_NAME=enp1s0f1np1
export UCX_NET_DEVICES=$MN_IF_NAME
export NCCL_SOCKET_IFNAME=$MN_IF_NAME
export NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1
export NCCL_IB_DISABLE=0
export GLOO_SOCKET_IFNAME=$MN_IF_NAME
export OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME
export NCCL_DEBUG=INFO
export NCCL_IGNORE_CPU_AFFINITY=1

apt install -y curl
curl -L https://patch-diff.githubusercontent.com/raw/sgl-project/sglang/pull/12724.diff | git apply

HF_HUB_OFFLINE=1 python3 -m sglang.launch_server --model openai/gpt-oss-120b --served-model-name openai/gpt-oss-120b --host 0.0.0.0 --port 8888 --reasoning-parser gpt-oss --tool-call-parser gpt-oss --tp 2 --dist-init-addr 192.168.177.11:20000 --nnodes 2 --node-rank 1 --mem-fraction-static 0.8

24659818 · January 9, 2026, 5:28am

Thank you，I just use the TRT LLM for inference and configure the nvidia/Qwen3-30B-A3B-NVFP4 and it works，but how can I print the Single (t/s)?

eugr · January 9, 2026, 5:49am

You need a benchmarking tool. You can use the one from vllm, or just use my new tool: GitHub - eugr/llama-benchy: llama-benchy - llama-bench style benchmarking tool for all backends

24659818 · January 9, 2026, 11:43am

that`s fine. I use the nvidia/Qwen3-30B-A3B-NVFP4 to call the tools but it desenot successful, the tool_messages is [] and the content has the toolcall json, but when I use the ollama qwen3:32b，it can use the tool and return the currect response.

So it`s the reason for the model?

eugr · January 9, 2026, 7:06pm

I’m not familiar with TRTLLM parameters, but generally you need to specify a tool parser/enable tool calling when running a model.

24659818 · January 21, 2026, 4:00am

how can I install the model with llama.cpp？

eugr · January 21, 2026, 5:06am

You need to compile llama.cpp first:

Install development tools:

sudo apt install clang cmake libcurl4-openssl-dev

Checkout llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

Build:

cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON -DGGML_CURL=ON -DGGML_RPC=ON -DCMAKE_CUDA_ARCHITECTURES=121a-real
cmake --build build --config Release -j

Then you can run models like this (gpt-oss for example). Just provide a model name in -hf parameter and it will download it from HuggingFace and cache locally. Please note that you will need models in GGUF format, other quantization formats will not work:

build/bin/llama-server \
      -hf ggml-org/gpt-oss-120b-GGUF \
      --jinja -ngl 99 \
      --ctx-size 0 \
      -b 2048 -ub 2048 \
      -fa on \
      --temp 1.0 \
      --top-p 1.0 \
      --top-k 0 \
      --reasoning-format auto \
      --chat-template-kwargs "{\"reasoning_effort\": \"medium\"}"

24659818 · January 21, 2026, 5:16am

Thank you，how can I know the models in GGUF format from HuggingFace？

eugr · January 21, 2026, 5:40am

They usually have GGUF in their name.

24659818 · January 21, 2026, 5:55am

OK, Thx, I`ll try it.

24659818 · February 4, 2026, 12:39am

Hi eugr, how can I run the kimi2.5 model on DGX?

eugr · February 4, 2026, 12:57am

Unless you have 8x Spark cluster, you can’t. It has 1T parameters, even in 4 bits it requires >500GB VRAM.

Topic		Replies	Views
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	3678	March 6, 2026
MiniMax M2.5 released (not available on HuggingFace as of now) -- is DGX Spark ready? DGX Spark / GB10	85	5415	April 6, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2354	March 26, 2026
How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker DGX Spark / GB10	28	3696	January 2, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	43	7447	April 3, 2026
DGX Spark: The Sovereign AI Stack — Dual-Model Architecture for Local Inference DGX Spark / GB10 Projects docker , spark , llm	9	1511	February 13, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2200	December 25, 2025
6x Spark setup DGX Spark / GB10	109	7402	April 1, 2026
GDX Spark is extremely slow on a short LLM test DGX Spark / GB10 deepseek	21	3367	January 25, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2684	December 31, 2025

DGX Spark performance

Related topics