DGX Spark performance

I try to use ollama to run the qwen3:32b and the eval rate is 9.46tokens/s,and on the DGX Dashboard the System Memory is 24GB ,the GPU Utilization is 94%.

total duration:3m47s

load duration:76ms

prompt eval count:15 token(s)

prompt eval duration:2s

prompt eval rate:7.18 tokens/s

eval count:2122 token(s)

eval duration:3m44s

eval rate:9.46 tokens/s

And I watch the anther DGX Spark Performance is significantly better than this one like :

total duration:5m99s

load duration:94ms

prompt eval count:5 token(s)

prompt eval duration:46ms

prompt eval rate:107 tokens/s

eval count:254token(s)

eval duration:5s

eval rate:43.73 tokens/s

so what’s the defference bewteen these two? how can I accelerate the former?

Is the second one a different model, likely qwen3:30b?

the first one is qwen3:32b,the second one is gpt-oss:20b

Very different models. Of course the performance will be very different.

The first one is a dense model with 32B parameters, all of them are active. The second model is a sparse MoE model with 20B total parameters, out of which only 3.6B are active at any given inference pass. So you are basically comparing 32B model to a 4B model (performance wise).

Thanks,I have other questions:

the first one is the DGX Spark has 128G Memory and is managed by a Blackwell-architecture GPU and used as high-performance memory. So why during run the qwen3:32b the System Memory is 24GB and GPU Utilization is 94%. Is it reasonable?

the second one is how can I Improve performance using TRTLLM FP4,Will it work?

GPU utilization is likely a bug in the drivers, as we haven’t seen anything higher than 96% here on the forums.

As for RAM, Ollama sets the context to 4096 tokens by default. Given that it runs q4_k_m quant by default, the model size is around 16GB, so it will be <20GB with context.

My advice is to forget about Ollama and use llama.cpp instead. Also, switch to MoE models - dense models won’t run fast on Spark due to it’s relatively slow memory bandwidth (compared to dedicated VRAM).

Also, last time I checked, TRTLLM wasn’t optimized for Spark.

llama.cpp will be the fastest in generation (just don’t expect miracles, tops you can get from 32B dense model even in 4 bit quant is ~14 t/s on a single spark. vLLM will be faster in prompt processing. Here is my compilation from last month for reference (for single and dual sparks):

Model name Cluster (t/s) Single (t/s) Comment
Qwen/Qwen3-VL-32B-Instruct-FP8 12.00 7.00
cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit 21.00 12.00
GPT-OSS-120B 55.00 36.00 SGLang gives 75/53
RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4 21.00 N/A
QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ 26.00 N/A
Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 65.00 52.00
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ 97.00 82.00
RedHatAI/Qwen3-30B-A3B-NVFP4 75.00 64.00
QuantTrio/MiniMax-M2-AWQ 41.00 N/A
QuantTrio/GLM-4.6-AWQ 17.00 N/A
zai-org/GLM-4.6V-FP8 24.00 N/A
1 Like

Hi Eugr; would it be possible to share more details on the config for GPT-OSS across two dxg nodes using SGLang. I have numerous failed attempts on GPT-OSS 20b. Are you running MXFP4 by chance? Thanks.

GPT-OSS-120B 55.00 36.00 SGLang gives 75/53

Sure. Unlike vLLM, I don’t have a neat setup for it, so launching is a bit more involved. You need to use their spark docker image. It lags behind main branch and you’ll need to apply a fix first, but it works with gpt-oss very well.

Run Docker on both nodes (assuming you have tiktoken encodings downloaded already - change the paths accordingly):

docker run --privileged --gpus all -it --rm --network host --ipc=host -v  ~/.cache/huggingface:/root/.cache/huggingface -v ~/vllm/tiktoken_encodings:/tiktoken_encodings  lmsysorg/sglang:spark /bin/bash

Run on the first node (spark):

NOTE: this was merged the next day after Spark build: [fix] Only enable flashinfer all reduce fusion by default for single-node servers by leejnau · Pull Request #12724 · sgl-project/sglang · GitHub
Link to patch: https://patch-diff.githubusercontent.com/raw/sgl-project/sglang/pull/12724.diff

Also, make sure you have approx. the same amount of free RAM on both Sparks otherwise it will fail with memory imbalance error even if you limit it with mem-fraction-static (as it evaluates at 0.9 of total VRAM).

Use IP addresses that are assigned to your ConnectX 7 interface!

export MN_IF_NAME=enp1s0f1np1
export UCX_NET_DEVICES=$MN_IF_NAME
export NCCL_SOCKET_IFNAME=$MN_IF_NAME
export NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1
export NCCL_IB_DISABLE=0
export GLOO_SOCKET_IFNAME=$MN_IF_NAME
export OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME
export NCCL_DEBUG=INFO
export NCCL_IGNORE_CPU_AFFINITY=1


apt install -y curl
curl -L https://patch-diff.githubusercontent.com/raw/sgl-project/sglang/pull/12724.diff | git apply

HF_HUB_OFFLINE=1 python3 -m sglang.launch_server --model openai/gpt-oss-120b --served-model-name openai/gpt-oss-120b --host 0.0.0.0 --port 8888 --reasoning-parser gpt-oss --tool-call-parser gpt-oss --tp 2 --dist-init-addr 192.168.177.11:20000 --nnodes 2 --node-rank 0 --mem-fraction-static 0.8

Run on the second node (spark2):

export MN_IF_NAME=enp1s0f1np1
export UCX_NET_DEVICES=$MN_IF_NAME
export NCCL_SOCKET_IFNAME=$MN_IF_NAME
export NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1
export NCCL_IB_DISABLE=0
export GLOO_SOCKET_IFNAME=$MN_IF_NAME
export OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME
export NCCL_DEBUG=INFO
export NCCL_IGNORE_CPU_AFFINITY=1

apt install -y curl
curl -L https://patch-diff.githubusercontent.com/raw/sgl-project/sglang/pull/12724.diff | git apply

HF_HUB_OFFLINE=1 python3 -m sglang.launch_server --model openai/gpt-oss-120b --served-model-name openai/gpt-oss-120b --host 0.0.0.0 --port 8888 --reasoning-parser gpt-oss --tool-call-parser gpt-oss --tp 2 --dist-init-addr 192.168.177.11:20000 --nnodes 2 --node-rank 1 --mem-fraction-static 0.8

Thank you,I just use the TRT LLM for inference and configure the nvidia/Qwen3-30B-A3B-NVFP4 and it works,but how can I print the Single (t/s)?

You need a benchmarking tool. You can use the one from vllm, or just use my new tool: GitHub - eugr/llama-benchy: llama-benchy - llama-bench style benchmarking tool for all backends

that`s fine. I use the nvidia/Qwen3-30B-A3B-NVFP4 to call the tools but it desenot successful, the tool_messages is [] and the content has the toolcall json, but when I use the ollama qwen3:32b,it can use the tool and return the currect response.

So it`s the reason for the model?

I’m not familiar with TRTLLM parameters, but generally you need to specify a tool parser/enable tool calling when running a model.

how can I install the model with llama.cpp?

You need to compile llama.cpp first:

Install development tools:

sudo apt install clang cmake libcurl4-openssl-dev

Checkout llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

Build:

cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON -DGGML_CURL=ON -DGGML_RPC=ON -DCMAKE_CUDA_ARCHITECTURES=121a-real
cmake --build build --config Release -j

Then you can run models like this (gpt-oss for example). Just provide a model name in -hf parameter and it will download it from HuggingFace and cache locally. Please note that you will need models in GGUF format, other quantization formats will not work:

build/bin/llama-server \
      -hf ggml-org/gpt-oss-120b-GGUF \
      --jinja -ngl 99 \
      --ctx-size 0 \
      -b 2048 -ub 2048 \
      -fa on \
      --temp 1.0 \
      --top-p 1.0 \
      --top-k 0 \
      --reasoning-format auto \
      --chat-template-kwargs "{\"reasoning_effort\": \"medium\"}"
1 Like

Thank you,how can I know the models in GGUF format from HuggingFace?

They usually have GGUF in their name.

OK, Thx, I`ll try it.

Hi eugr, how can I run the kimi2.5 model on DGX?

Unless you have 8x Spark cluster, you can’t. It has 1T parameters, even in 4 bits it requires >500GB VRAM.