I try to use ollama to run the qwen3:32b and the eval rate is 9.46tokens/s,and on the DGX Dashboard the System Memory is 24GB ,the GPU Utilization is 94%.
total duration:3m47s
load duration:76ms
prompt eval count:15 token(s)
prompt eval duration:2s
prompt eval rate:7.18 tokens/s
eval count:2122 token(s)
eval duration:3m44s
eval rate:9.46 tokens/s
And I watch the anther DGX Spark Performance is significantly better than this one like :
total duration:5m99s
load duration:94ms
prompt eval count:5 token(s)
prompt eval duration:46ms
prompt eval rate:107 tokens/s
eval count:254token(s)
eval duration:5s
eval rate:43.73 tokens/s
so what’s the defference bewteen these two? how can I accelerate the former?
The first one is a dense model with 32B parameters, all of them are active. The second model is a sparse MoE model with 20B total parameters, out of which only 3.6B are active at any given inference pass. So you are basically comparing 32B model to a 4B model (performance wise).
the first one is the DGX Spark has 128G Memory and is managed by a Blackwell-architecture GPU and used as high-performance memory. So why during run the qwen3:32b the System Memory is 24GB and GPU Utilization is 94%. Is it reasonable?
the second one is how can I Improve performance using TRTLLM FP4,Will it work?
GPU utilization is likely a bug in the drivers, as we haven’t seen anything higher than 96% here on the forums.
As for RAM, Ollama sets the context to 4096 tokens by default. Given that it runs q4_k_m quant by default, the model size is around 16GB, so it will be <20GB with context.
My advice is to forget about Ollama and use llama.cpp instead. Also, switch to MoE models - dense models won’t run fast on Spark due to it’s relatively slow memory bandwidth (compared to dedicated VRAM).
Also, last time I checked, TRTLLM wasn’t optimized for Spark.
llama.cpp will be the fastest in generation (just don’t expect miracles, tops you can get from 32B dense model even in 4 bit quant is ~14 t/s on a single spark. vLLM will be faster in prompt processing. Here is my compilation from last month for reference (for single and dual sparks):
Hi Eugr; would it be possible to share more details on the config for GPT-OSS across two dxg nodes using SGLang. I have numerous failed attempts on GPT-OSS 20b. Are you running MXFP4 by chance? Thanks.
Sure. Unlike vLLM, I don’t have a neat setup for it, so launching is a bit more involved. You need to use their spark docker image. It lags behind main branch and you’ll need to apply a fix first, but it works with gpt-oss very well.
Run Docker on both nodes (assuming you have tiktoken encodings downloaded already - change the paths accordingly):
docker run --privileged --gpus all -it --rm --network host --ipc=host -v ~/.cache/huggingface:/root/.cache/huggingface -v ~/vllm/tiktoken_encodings:/tiktoken_encodings lmsysorg/sglang:spark /bin/bash
Also, make sure you have approx. the same amount of free RAM on both Sparks otherwise it will fail with memory imbalance error even if you limit it with mem-fraction-static (as it evaluates at 0.9 of total VRAM).
Use IP addresses that are assigned to your ConnectX 7 interface!
that`s fine. I use the nvidia/Qwen3-30B-A3B-NVFP4 to call the tools but it desenot successful, the tool_messages is [] and the content has the toolcall json, but when I use the ollama qwen3:32b,it can use the tool and return the currect response.
Then you can run models like this (gpt-oss for example). Just provide a model name in -hf parameter and it will download it from HuggingFace and cache locally. Please note that you will need models in GGUF format, other quantization formats will not work: