I ran this example:
docker exec
-e MODEL=“nvidia/Qwen3-235B-A22B-FP4”
-e HF_TOKEN=$HF_TOKEN
-it $TRTLLM_MN_CONTAINER bash -c ’
mpirun -x HF_TOKEN trtllm-llmapi-launch trtllm-serve $MODEL
–tp_size 2
–backend pytorch
–max_num_tokens 32768
–max_batch_size 4
–extra_llm_api_options /tmp/extra-llm-api-config.yml
–port 8355’
from this page: “Try NVIDIA NIM APIs
and the inference speed was very-very slow.
The example from the single spark page: “Try NVIDIA NIM APIs
export MODEL_HANDLE=“openai/gpt-oss-120b”
docker run
-e MODEL_HANDLE=$MODEL_HANDLE
-e HF_TOKEN=$HF_TOKEN
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/
–rm -it --ulimit memlock=-1 --ulimit stack=67108864
–gpus=all --ipc=host --network host
nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
bash -c ’
export TIKTOKEN_ENCODINGS_BASE=“/tmp/harmony-reqs” &&
mkdir -p $TIKTOKEN_ENCODINGS_BASE &&
wget -P $TIKTOKEN_ENCODINGS_BASE https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken &&
wget -P $TIKTOKEN_ENCODINGS_BASE https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken &&
hf download $MODEL_HANDLE &&
python examples/llm-api/quickstart_advanced.py
–model_dir $MODEL_HANDLE
–prompt “Paris is great because”
–max_tokens 64
’
was much-much faster, and I suspect that was because it used an optimized branch of tensor-rt:
”nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev”
Is there an optimized version for two sparks? Please respond with an actual answer and don’t just say “read the documentation”.thanks
nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
You can see all availabe tags in the NGC catalog. There is no explicite dual-node image, but there is one newer image than spark-single-gpu-dev.
You could try that.
rc0.post1 was published on 10/14/2025 10:42 AM - bleeding edge
single-spark was published on 10/06/2025 10:20 PM
Still waiting for a newer build, because there were improvements for RTX5090/sm120 for FP4 a few days ago in the main tree. That’s why I stumbled upon it.
We posted a detailed issue on github:
opened 02:50AM - 22 Oct 25 UTC
I followed the instructions from here:
https://github.com/NVIDIA/dgx-spark-play… books/tree/main/nvidia/trt-llm#step-11-serve-the-model
and ran the model across two sparks:
```bash
docker exec \
-e MODEL="nvidia/Qwen3-235B-A22B-FP4" \
-e HF_TOKEN=$HF_TOKEN \
-it $TRTLLM_MN_CONTAINER bash -c '
mpirun -x HF_TOKEN trtllm-llmapi-launch trtllm-serve $MODEL \
--tp_size 2 \
--backend pytorch \
--max_num_tokens 32768 \
--max_batch_size 4 \
--extra_llm_api_options /tmp/extra-llm-api-config.yml \
--port 8355'
```
but both the `pre-fill` and `decoding` speeds were incredibly slow. I expected this to be a lot faster since we were using tensor-parallel, have an nv-link connection between the two boxes, and ran an FP4 version of the model.
I also ran gpt-oss-120B on a single spark:
```bash
export MODEL_HANDLE="openai/gpt-oss-120b"
docker run --name trtllm_llm_server --rm -it --gpus all --ipc host --network host \
-e HF_TOKEN=$HF_TOKEN \
-e MODEL_HANDLE="$MODEL_HANDLE" \
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
bash -c '
export TIKTOKEN_ENCODINGS_BASE="/tmp/harmony-reqs" && \
mkdir -p $TIKTOKEN_ENCODINGS_BASE && \
wget -P $TIKTOKEN_ENCODINGS_BASE https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken && \
wget -P $TIKTOKEN_ENCODINGS_BASE https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken && \
hf download $MODEL_HANDLE && \
cat > /tmp/extra-llm-api-config.yml <<EOF
print_iter_log: false
kv_cache_config:
dtype: "auto"
free_gpu_memory_fraction: 0.9
cuda_graph_config:
enable_padding: true
disable_overlap_scheduler: true
EOF
trtllm-serve "$MODEL_HANDLE" \
--max_batch_size 64 \
--trust_remote_code \
--port 8355 \
--extra_llm_api_options /tmp/extra-llm-api-config.yml
'
```
and both the `pre-fill` and `decoding` speeds were acceptable (i.e. similar to the token-generation speed that a user experiences on the chatgpt website). I suspect the improved performance was due to this container: `nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev` which has optimized cutlass kernels for the spark. Why is the dual-spark tutorial using `nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3`?
Thanks!
If we are not able to resolve this, then we will just return the two sparks back to nvidia.
What performance numbers are you seeing for single vs stacked setups?
Edit: Please also use trtllm-bench to collect performance numbers
I’ve had much better luck with the nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc1 image that was released today. Could you please share a docker-command to run the trtllm-bench benchmark for both single and dual sparks. It feels like the single model instance is faster than the sharded model, so I need to properly benchmark them. Thanks!
eugr
October 23, 2025, 5:07pm
9
You are comparing a model with 22B active parameters with a model with 3B active parameters. Of course gpt-oss-120b will be faster…
1 Like
system
Closed
December 10, 2025, 8:05pm
10
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.