Hi NV member:
I tested the inference performance of Qwen3 235B A22B using two DGX SPARK systems. By following the steps below, I obtained the following results. Could you please let me know whether these numbers look reasonable?
The inference speeds seems to be slow. I get ~25 tokens/sec on my two node cluster using QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ in vLLM.
A few things to keep in mind:
TRTLLM container you are using is outdated, there is a newer version available.
NVFP4 support on DGX Spark is still lacking, as of today at least, you’ll get noticeably better performance using AWQ quants (that actually have slightly better accuracy as they are activation aware and keep activation weights at 16 bits).
vLLM is the best way to run LLMs on Spark. You can either use NVIDIA’s 25.11-py3 container, or one of the community builds here if you want the latest vllm features not supported in 0.11.2.
Hi @eugr
As you mentioned, 15 tokens/s is indeed too slow, which is why I wanted to clarify whether the data was correct. After re-validating using the approach you suggested, the results now reach over 20 tokens/s, which meets expectations.
Thank you for your suggestions and explanations.
TRTLLM container you are using is outdated, there is a newer version available.
[Turtle7777] I verified two DGX Spark systems by following NVIDIA’s TensorRT-LLM SOP. As you suggested, I removed the TensorRT-LLM container and downloaded it again, but the version is still TensorRT-LLM version: 1.0.0rc3. After re-running the tests, the performance results now match what I previously saw online, i.e., above 20 tokens/s.
I am not sure whether this improvement is related to the kernel version change from 6.14.0-1013-nvidia to 6.14.0-1015-nvidia. Please refer to the data and log files below.
NVFP4 support on DGX Spark is still lacking, as of today at least, you’ll get noticeably better performance using AWQ quants (that actually have slightly better accuracy as they are activation aware and keep activation weights at 16 bits).
vLLM is the best way to run LLMs on Spark. You can either use NVIDIA’s 25.11-py3 container, or one of the community builds here if you want the latest vllm features not supported in 0.11.2.
[Turtle7777] I will further follow your suggestions and verify the performance of multiple DGX Spark systems using vLLM and AWQ quants, to see whether the performance still meets expectations.
Hi @vgoklani :
After switching to the latest container, nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5, I was able to run the tests, but I noticed the following related errors and am not sure whether they affect the results. I saw the message “triton is not supported on current platform, roll back to CPU” as well as another backtrace log shown below.
Does this error mean that the test is falling back to using the CPU? However, the measured performance is still 21.65 tokens/s. Is there anything that still needs to be changed in the command or configuration to obtain more accurate results?