Question on Inference Performance Results of Qwen3 235B A22B on 2× DGX Spark

Turtle7777 · December 18, 2025, 6:30am

Hi NV member:
I tested the inference performance of Qwen3 235B A22B using two DGX SPARK systems. By following the steps below, I obtained the following results. Could you please let me know whether these numbers look reasonable?

Thank you.

Test log:

multi-node_test_log.txt (24.6 KB)

Test Result 4 times:

Test ways:

Instructions to run Qwen3 235B A22B on 2xDGX Spark

Set permissions for trtllm-mn-entrypoint.sh file attached in the folder

chmod +x ./trtllm-mn-entrypoint.sh

Run this command on both Spark nodes to start the TensorRT-LLM containers with proper networking and GPU access

docker run --name trtllm --rm -d
–gpus all --network host --ipc=host
–ulimit memlock=-1 --ulimit stack=67108864
-e UCX_NET_DEVICES=enp1s0f0np0,enp1s0f1np1
-e NCCL_SOCKET_IFNAME=enp1s0f0np0,enp1s0f1np1
-e OMPI_MCA_btl_tcp_if_include=enp1s0f0np0,enp1s0f1np1
-e OMPI_ALLOW_RUN_AS_ROOT=1
-e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/
-v ./trtllm-mn-entrypoint.sh:/opt/trtllm-mn-entrypoint.sh
-v ~/.ssh:/tmp/.ssh:ro
–entrypoint /opt/trtllm-mn-entrypoint.sh
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3

Run the inference from one of the two DGX Spark systems

docker exec trtllm bash -c ‘cat < /tmp/extra-llm-api-config.yml
print_iter_log: false
kv_cache_config:
dtype: “fp8”
free_gpu_memory_fraction: 0.9
cuda_graph_config:
enable_padding: true
EOF’

Initiate the LLM benchmarking for Qwen3 235B from one of the two DGX Spark systems. Specify huggingface token for downloading the model.

export HF_TOKEN=

docker exec
-e ISL=128 -e OSL=128
-e MODEL=“nvidia/Qwen3-235B-A22B-FP4”
-e HF_TOKEN=$HF_TOKEN
-it trtllm bash -c ’
mpirun -x HF_TOKEN=$HF_TOKEN -np 2 -H 192.168.1.10:1,192.168.1.11:1 bash -c “huggingface-cli download $MODEL” &&
mpirun -x HF_TOKEN=$HF_TOKEN -np 2 -H 192.168.1.10:1,192.168.1.11:1 bash -c “python benchmarks/cpp/prepare_dataset.py --tokenizer=$MODEL --stdout token-norm-dist --num-requests=1 --input-mean=$ISL --output-mean=$OSL --input-stdev=0 --output-stdev=0 > /tmp/dataset.txt” &&
mpirun -x HF_TOKEN=$HF_TOKEN -np 2 -H 192.168.1.10:1,192.168.1.11:1 trtllm-llmapi-launch trtllm-bench -m $MODEL throughput
–tp 2
–dataset /tmp/dataset.txt
–backend pytorch
–max_num_tokens 4096
–concurrency 1
–max_batch_size 4
–extra_llm_api_options /tmp/extra-llm-api-config.yml
–streaming’

NVES · December 18, 2025, 4:00pm

Hi, I encourage the community to weigh in on performance expectations, as we don’t comment beyond the workloads already published here How NVIDIA DGX Spark’s Performance Enables Intensive AI Tasks | NVIDIA Technical Blog

eugr · December 18, 2025, 6:17pm

The inference speeds seems to be slow. I get ~25 tokens/sec on my two node cluster using QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ in vLLM.

A few things to keep in mind:

TRTLLM container you are using is outdated, there is a newer version available.
NVFP4 support on DGX Spark is still lacking, as of today at least, you’ll get noticeably better performance using AWQ quants (that actually have slightly better accuracy as they are activation aware and keep activation weights at 16 bits).
vLLM is the best way to run LLMs on Spark. You can either use NVIDIA’s 25.11-py3 container, or one of the community builds here if you want the latest vllm features not supported in 0.11.2.

Turtle7777 · December 19, 2025, 3:45am

Hi @eugr
As you mentioned, 15 tokens/s is indeed too slow, which is why I wanted to clarify whether the data was correct. After re-validating using the approach you suggested, the results now reach over 20 tokens/s, which meets expectations.

Thank you for your suggestions and explanations.

TRTLLM container you are using is outdated, there is a newer version available.
[Turtle7777] I verified two DGX Spark systems by following NVIDIA’s TensorRT-LLM SOP. As you suggested, I removed the TensorRT-LLM container and downloaded it again, but the version is still TensorRT-LLM version: 1.0.0rc3. After re-running the tests, the performance results now match what I previously saw online, i.e., above 20 tokens/s.
I am not sure whether this improvement is related to the kernel version change from 6.14.0-1013-nvidia to 6.14.0-1015-nvidia. Please refer to the data and log files below.

Environments:
========================
EC: 2.75.3.3
SOC FW Version: 3.0.4
PD0 FW1: 5.0
PD1 FW1: 5.0
GOP Driver Version: 9000AE0
DGX SPARK OS: 7.3.1 2025-11-12-09-12-21
Kernel: 6.14.0-1015-nvidia
SSD: 4TB Gen4 Phison
=========================
20251219_multinode_test_log.txt (27.2 KB)

NVFP4 support on DGX Spark is still lacking, as of today at least, you’ll get noticeably better performance using AWQ quants (that actually have slightly better accuracy as they are activation aware and keep activation weights at 16 bits).
vLLM is the best way to run LLMs on Spark. You can either use NVIDIA’s 25.11-py3 container, or one of the community builds here if you want the latest vllm features not supported in 0.11.2.
[Turtle7777] I will further follow your suggestions and verify the performance of multiple DGX Spark systems using vLLM and AWQ quants, to see whether the performance still meets expectations.

vgoklani · December 19, 2025, 5:09am

The images are published here:

This is the latest version: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5

Turtle7777 · December 19, 2025, 7:32am

Hi @vgoklani :
After switching to the latest container, nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5, I was able to run the tests, but I noticed the following related errors and am not sure whether they affect the results. I saw the message “triton is not supported on current platform, roll back to CPU” as well as another backtrace log shown below.

Does this error mean that the test is falling back to using the CPU? However, the measured performance is still 21.65 tokens/s. Is there anything that still needs to be changed in the command or configuration to obtain more accurate results?

Thank you.

20251219_twonode_test_log_1.2.0rc5.txt (61.1 KB)

W1219 07:19:56.237000 1900 torch/utils/cpp_extension.py:2422] If this is not desired, please set os.environ[‘TORCH_CUDA_ARCH_LIST’] to specific architectures.
/tmp/tmpvif_h2p3/cuda_utils.c:1:10: fatal error: cuda.h: No such file or directory
1 | include “cuda.h”
| ^~~~~~~~
compilation terminated.
/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/fla/utils.py:216: UserWarning: Triton is not supported on current platform, roll back to CPU.
warnings.warn(
/tmp/tmprc0xglrj/cuda_utils.c:1:10: fatal error: cuda.h: No such file or directory
1 | include “cuda.h”
| ^~~~~~~~
compilation terminated.
…
[12/19/2025-07:25:49] [TRT-LLM] [RANK 0] [W] [Autotuner] Failed when profiling runner=<tensorrt_llm._torch.custom_ops.torch_custom_ops.MoERunner object at 0xf3836e0c6330>, tactic=6, shapes=[torch.Size([1, 2048]), torch.Size([128, 1536, 256]), torch.Size([0]), torch.Size([128, 4096, 48]), torch.Size([0])]. Error: [TensorRT-LLM][ERROR] Assertion failed: Failed to initialize cutlass TMA WS grouped gemm. Error: Error Internal (tensorrt_llm/kernels/cutlass_kernels/cutlass_instantiations/gemm_grouped/120/cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cu:39)
1 0xf387a2495ec8 tensorrt_llm::common::throwRuntimeError(char const*, int, char const*) + 120
2 0xf387a50fe114 /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libth_common.so(+0x328e114) [0xf387a50fe114]
3 0xf387a50fe4e4 void tensorrt_llm::kernels::cutlass_kernels_oss::tma_warp_specialized_generic_moe_gemm_kernelLauncher<cutlass::arch::Sm120, __nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, void, tensorrt_llm::cutlass_extensions::EpilogueOpDefault, (tensorrt_llm::kernels::cutlass_kernels::TmaWarpSpecializedGroupedGemmInput::EpilogueFusion)3, cute::tuple<cute::C<128>, cute::C<256>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> >, false, false, false, false>(tensorrt_llm::kernels::cutlass_kernels::TmaWarpSpecializedGroupedGemmInput, int, int, CUstream_st*, int*, unsigned long*, cute::tuple<int, int, cute::C<1> >, cute::tuple<int, int, cute::C<1> >) + 84
4 0xf387a37cd488 void tensorrt_llm::kernels::cutlass_kernels_oss::dispatchMoeGemmSelectClusterShapeTmaWarpSpecialized<cutlass::arch::Sm120, __nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, tensorrt_llm::cutlass_extensions::EpilogueOpDefault, (tensorrt_llm::kernels::cutlass_kernels::TmaWarpSpecializedGroupedGemmInput::EpilogueFusion)3, cute::tuple<cute::C<128>, cute::C<256>, cute::C<128> > >(tensorrt_llm::kernels::cutlass_kernels::TmaWarpSpecializedGroupedGemmInput, int, tensorrt_llm::cutlass_extensions::CutlassGemmConfig, int, CUstream_st*, int*, unsigned long*) + 216
5 0xf387a37ce1b8 void tensorrt_llm::kernels::cutlass_kernels_oss::dispatchMoeGemmSelectTileShapeTmaWarpSpecialized<__nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, tensorrt_llm::cutlass_extensions::EpilogueOpDefault, (tensorrt_llm::kernels::cutlass_kernels::TmaWarpSpecializedGroupedGemmInput::EpilogueFusion)3>(tensorrt_llm::kernels::cutlass_kernels::TmaWarpSpecializedGroupedGemmInput, int, tensorrt_llm::cutlass_extensions::CutlassGemmConfig, int, CUstream_st*, int*, unsigned long*) + 1000
6 0xf387a37b143c void tensorrt_llm::kernels::cutlass_kernels::MoeGemmRunner<__nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, __nv_bfloat16>::dispatchToArch<tensorrt_llm::cutlass_extensions::EpilogueOpDefault>(tensorrt_llm::kernels::cutlass_kernels::GroupedGemmInput<__nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, __nv_bfloat16>, tensorrt_llm::kernels::cutlass_kernels::TmaWarpSpecializedGroupedGemmInput) + 236
7 0xf387a37b1f10 tensorrt_llm::kernels::cutlass_kernels::MoeGemmRunner<__nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, __nv_bfloat16>::moeGemmBiasAct(tensorrt_llm::kernels::cutlass_kernels::GroupedGemmInput<__nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, __nv_bfloat16>, tensorrt_llm::kernels::cutlass_kernels::TmaWarpSpecializedGroupedGemmInput) + 272
8 0xf387a3797ce8 tensorrt_llm::kernels::cutlass_kernels::CutlassMoeFCRunner<__nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, __nv_fp4_e2m1, __nv_bfloat16, void>::gemm2(tensorrt_llm::kernels::cutlass_kernels::MoeGemmRunner<__nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, __nv_bfloat16>&, tensorrt_llm::kernels::fp8_blockscale_gemm::CutlassFp8BlockScaleGemmRunnerInterface*, __nv_fp4_e2m1 const*, void*, __nv_bfloat16*, long const*, tensorrt_llm::kernels::cutlass_kernels::TmaWarpSpecializedGroupedGemmInput, __nv_fp4_e2m1 const*, __nv_bfloat16 const*, __nv_bfloat16 const*, float const*, unsigned char const*, tensorrt_llm::kernels::cutlass_kernels::QuantParams, float const*, float const*, int const*, int const*, int const*, long const*, long, long, long, long, long, long, int, long, float const**, bool, void*, CUstream_st*, tensorrt_llm::kernels::cutlass_kernels::MOEParallelismConfig, bool, tensorrt_llm::cutlass_extensions::CutlassGemmConfig, bool, int*, int*) + 744
9 0xf387a3799478 tensorrt_llm::kernels::cutlass_kernels::CutlassMoeFCRunner<__nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, __nv_fp4_e2m1, __nv_bfloat16, void>::gemm2(void const*, void*, void*, long const*, tensorrt_llm::kernels::cutlass_kernels::TmaWarpSpecializedGroupedGemmInput, void const*, void const*, void const*, float const*, unsigned char const*, tensorrt_llm::kernels::cutlass_kernels::QuantParams, float const*, float const*, int const*, int const*, int const*, long const*, long, long, long, long, long, long, int, long, float const**, bool, void*, bool, CUstream_st*, tensorrt_llm::kernels::cutlass_kernels::MOEParallelismConfig, bool, tensorrt_llm::cutlass_extensions::CutlassGemmConfig, bool, int*, int*) + 408
10 0xf387a36c7c8c tensorrt_llm::kernels::cutlass_kernels::GemmProfilerBackend::runProfiler(int, tensorrt_llm::cutlass_extensions::CutlassGemmConfig const&, char*, void const*, CUstream_st* const&) + 2776
11 0xf387a2894cec torch_ext::FusedMoeRunner::runGemmProfile(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor const&, at::Tensor const&, std::optionalat::Tensor const&, long, long, long, long, long, long, long, bool, bool, long, long, bool, long, long) + 476
12 0xf387a28a4de0 std::Function_handler<void (std::vector<c10::IValue, std::allocatorc10::IValue >&), torch::class<torch_ext::FusedMoeRunner>::defineMethod<torch::detail::WrapMethod<void (torch_ext::FusedMoeRunner::)(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor const&, at::Tensor const&, std::optionalat::Tensor const&, long, long, long, long, long, long, long, bool, bool, long, long, bool, long, long)> >(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, torch::detail::WrapMethod<void (torch_ext::FusedMoeRunner::)(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor const&, at::Tensor const&, std::optionalat::Tensor const&, long, long, long, long, long, long, long, bool, bool, long, long, bool, long, long)>, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::initializer_listtorch::arg)::{lambda(std::vector<c10::IValue, std::allocatorc10::IValue >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocatorc10::IValue >&) + 576
13 0xf38907ecaea0 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xceaea0) [0xf38907ecaea0]
14 0xf38907ecb650 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xceb650) [0xf38907ecb650]
15 0xf38907fac478 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xdcc478) [0xf38907fac478]
16 0xf38907faca70 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xdcca70) [0xf38907faca70]
17 0xf3890778eb40 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x5aeb40) [0xf3890778eb40]
18 0x503454 python3() [0x503454]
19 0x4c2d1c _PyObject_MakeTpCall + 124
20 0x4c6f88 python3() [0x4c6f88]
21 0x528ab4 python3() [0x528ab4]
22 0x4c2d1c _PyObject_MakeTpCall + 124
23 0x563824 _PyEval_EvalFrameDefault + 2208
24 0x4c6ee8 python3() [0x4c6ee8]
25 0x4c5278 PyObject_Call + 280
26 0x566dd4 _PyEval_EvalFrameDefault + 15952
27 0x4c48b4 _PyObject_Call_Prepend + 436
28 0x528970 python3() [0x528970]
29 0x4c51cc PyObject_Call + 108
30 0x566dd4 _PyEval_EvalFrameDefault + 15952
31 0x4c6ee8 python3() [0x4c6ee8]
32 0x4c5278 PyObject_Call + 280
33 0x566dd4 _PyEval_EvalFrameDefault + 15952
34 0x4c6ee8 python3() [0x4c6ee8]
35 0x4c5278 PyObject_Call + 280
36 0x566dd4 _PyEval_EvalFrameDefault + 15952
…

Topic		Replies	Views
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	3535	March 6, 2026
DGX Spark performance DGX Spark / GB10	50	3555	February 27, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2202	March 26, 2026
Qwen3.5-397B-A17B + DGX Spark (duo) DGX Spark / GB10 Projects	52	3662	March 30, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	3880	March 16, 2026
DGX Spark Multi-Node LLM Inference Report for Qwen3-235B model DGX Spark / GB10 nim , llama	33	1795	January 2, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	5549	March 28, 2026
Qwen3.5-35B-A3B on NVIDIA DGX Spark DGX Spark / GB10	6	2237	March 17, 2026
Qwen3.5-397B-A17B-int4-AutoRound - 4 x db10 node - updated results 37 - 94 tok/s DGX Spark / GB10 clustering , spark	14	713	March 9, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	40	6278	March 31, 2026

Question on Inference Performance Results of Qwen3 235B A22B on 2× DGX Spark

Set permissions for trtllm-mn-entrypoint.sh file attached in the folder

Run this command on both Spark nodes to start the TensorRT-LLM containers with proper networking and GPU access

Run the inference from one of the two DGX Spark systems

Initiate the LLM benchmarking for Qwen3 235B from one of the two DGX Spark systems. Specify huggingface token for downloading the model.

Related topics