TRT LLM for Inference - two Sparks example is VERY slow

vgoklani · October 21, 2025, 1:38pm

I ran this example:

docker exec
-e MODEL=“nvidia/Qwen3-235B-A22B-FP4”
-e HF_TOKEN=$HF_TOKEN
-it $TRTLLM_MN_CONTAINER bash -c ’
mpirun -x HF_TOKEN trtllm-llmapi-launch trtllm-serve $MODEL
–tp_size 2
–backend pytorch
–max_num_tokens 32768
–max_batch_size 4
–extra_llm_api_options /tmp/extra-llm-api-config.yml
–port 8355’

from this page: “Try NVIDIA NIM APIs

and the inference speed was very-very slow.

The example from the single spark page: “Try NVIDIA NIM APIs

export MODEL_HANDLE=“openai/gpt-oss-120b”

docker run
-e MODEL_HANDLE=$MODEL_HANDLE
-e HF_TOKEN=$HF_TOKEN
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/
–rm -it --ulimit memlock=-1 --ulimit stack=67108864
–gpus=all --ipc=host --network host
nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
bash -c ’
export TIKTOKEN_ENCODINGS_BASE=“/tmp/harmony-reqs” &&
mkdir -p $TIKTOKEN_ENCODINGS_BASE &&
wget -P $TIKTOKEN_ENCODINGS_BASE https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken &&
wget -P $TIKTOKEN_ENCODINGS_BASE https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken &&
hf download $MODEL_HANDLE &&
python examples/llm-api/quickstart_advanced.py
–model_dir $MODEL_HANDLE
–prompt “Paris is great because”
–max_tokens 64
’
was much-much faster, and I suspect that was because it used an optimized branch of tensor-rt:

”nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev”

Is there an optimized version for two sparks? Please respond with an actual answer and don’t just say “read the documentation”.thanks

nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev

cosinus · October 21, 2025, 1:55pm

You can see all availabe tags in the NGC catalog. There is no explicite dual-node image, but there is one newer image than spark-single-gpu-dev.

You could try that.

rc0.post1 was published on 10/14/2025 10:42 AM - bleeding edge
single-spark was published on 10/06/2025 10:20 PM

Still waiting for a newer build, because there were improvements for RTX5090/sm120 for FP4 a few days ago in the main tree. That’s why I stumbled upon it.

vgoklani · October 22, 2025, 3:06pm

We posted a detailed issue on github:

github.com/NVIDIA/dgx-spark-playbooks

nvidia/Qwen3-235B-A22B-FP4 really slow when served across two sparks

opened 02:50AM - 22 Oct 25 UTC

vgoklani

I followed the instructions from here: https://github.com/NVIDIA/dgx-spark-play…books/tree/main/nvidia/trt-llm#step-11-serve-the-model and ran the model across two sparks: ```bash docker exec \ -e MODEL="nvidia/Qwen3-235B-A22B-FP4" \ -e HF_TOKEN=$HF_TOKEN \ -it $TRTLLM_MN_CONTAINER bash -c ' mpirun -x HF_TOKEN trtllm-llmapi-launch trtllm-serve $MODEL \ --tp_size 2 \ --backend pytorch \ --max_num_tokens 32768 \ --max_batch_size 4 \ --extra_llm_api_options /tmp/extra-llm-api-config.yml \ --port 8355' ``` but both the `pre-fill` and `decoding` speeds were incredibly slow. I expected this to be a lot faster since we were using tensor-parallel, have an nv-link connection between the two boxes, and ran an FP4 version of the model. I also ran gpt-oss-120B on a single spark: ```bash export MODEL_HANDLE="openai/gpt-oss-120b" docker run --name trtllm_llm_server --rm -it --gpus all --ipc host --network host \ -e HF_TOKEN=$HF_TOKEN \ -e MODEL_HANDLE="$MODEL_HANDLE" \ -v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \ nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \ bash -c ' export TIKTOKEN_ENCODINGS_BASE="/tmp/harmony-reqs" && \ mkdir -p $TIKTOKEN_ENCODINGS_BASE && \ wget -P $TIKTOKEN_ENCODINGS_BASE https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken && \ wget -P $TIKTOKEN_ENCODINGS_BASE https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken && \ hf download $MODEL_HANDLE && \ cat > /tmp/extra-llm-api-config.yml <<EOF print_iter_log: false kv_cache_config: dtype: "auto" free_gpu_memory_fraction: 0.9 cuda_graph_config: enable_padding: true disable_overlap_scheduler: true EOF trtllm-serve "$MODEL_HANDLE" \ --max_batch_size 64 \ --trust_remote_code \ --port 8355 \ --extra_llm_api_options /tmp/extra-llm-api-config.yml ' ``` and both the `pre-fill` and `decoding` speeds were acceptable (i.e. similar to the token-generation speed that a user experiences on the chatgpt website). I suspect the improved performance was due to this container: `nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev` which has optimized cutlass kernels for the spark. Why is the dual-spark tutorial using `nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3`? Thanks!

If we are not able to resolve this, then we will just return the two sparks back to nvidia.

aniculescu · October 22, 2025, 3:20pm

What performance numbers are you seeing for single vs stacked setups?

Edit: Please also use trtllm-bench to collect performance numbers

vgoklani · October 22, 2025, 9:15pm

I’ve had much better luck with the nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc1 image that was released today. Could you please share a docker-command to run the trtllm-bench benchmark for both single and dual sparks. It feels like the single model instance is faster than the sharded model, so I need to properly benchmark them. Thanks!

eugr · October 23, 2025, 5:07pm

You are comparing a model with 22B active parameters with a model with 3B active parameters. Of course gpt-oss-120b will be faster…

Topic		Replies	Views
Optimized tensor-rt command for running gpt-oss-120B on two dgx-sparks DGX Spark / GB10	1	274	October 20, 2025
Devstral-2-123B-NVFP4-TensorRT-LLM on 2x sparks? DGX Spark / GB10	1	566	December 23, 2025
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	5722	December 9, 2025
TensorRT-LLM + nvidia/Llama-3.3-70B-Instruct-NVFP4 = 5 tok/s DGX Spark / GB10 llama	3	723	January 18, 2026
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1908	February 13, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2916	March 26, 2026
NCCL all-reduce deadlock on dual DGX Spark after successful channel establishment — affects both vLLM and TRT-LLM DGX Spark / GB10 nemotron	21	736	April 17, 2026
DGX Spark performance DGX Spark / GB10	49	6363	February 13, 2026
Anyone with 2 or more Sparks have a chance to try out new graph split mode in ik_llama? DGX Spark / GB10 llama	3	325	January 5, 2026
DGX Spark Multi-Node LLM Inference Report for Qwen3-235B model DGX Spark / GB10 nim , llama	34	2659	May 1, 2026

TRT LLM for Inference - two Sparks example is VERY slow

Related topics