TRT LLM for Inference - two Sparks example is VERY slow

I ran this example:

docker exec
-e MODEL=“nvidia/Qwen3-235B-A22B-FP4”
-e HF_TOKEN=$HF_TOKEN
-it $TRTLLM_MN_CONTAINER bash -c ’
mpirun -x HF_TOKEN trtllm-llmapi-launch trtllm-serve $MODEL
–tp_size 2
–backend pytorch
–max_num_tokens 32768
–max_batch_size 4
–extra_llm_api_options /tmp/extra-llm-api-config.yml
–port 8355’

from this page: “Try NVIDIA NIM APIs

and the inference speed was very-very slow.

The example from the single spark page: “Try NVIDIA NIM APIs

export MODEL_HANDLE=“openai/gpt-oss-120b”

docker run
-e MODEL_HANDLE=$MODEL_HANDLE
-e HF_TOKEN=$HF_TOKEN
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/
–rm -it --ulimit memlock=-1 --ulimit stack=67108864
–gpus=all --ipc=host --network host
nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
bash -c ’
export TIKTOKEN_ENCODINGS_BASE=“/tmp/harmony-reqs” &&
mkdir -p $TIKTOKEN_ENCODINGS_BASE &&
wget -P $TIKTOKEN_ENCODINGS_BASE https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken &&
wget -P $TIKTOKEN_ENCODINGS_BASE https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken &&
hf download $MODEL_HANDLE &&
python examples/llm-api/quickstart_advanced.py
–model_dir $MODEL_HANDLE
–prompt “Paris is great because”
–max_tokens 64

was much-much faster, and I suspect that was because it used an optimized branch of tensor-rt:

nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev”

Is there an optimized version for two sparks? Please respond with an actual answer and don’t just say “read the documentation”.thanks

nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev

You can see all availabe tags in the NGC catalog. There is no explicite dual-node image, but there is one newer image than spark-single-gpu-dev.

You could try that.

rc0.post1 was published on 10/14/2025 10:42 AM - bleeding edge
single-spark was published on 10/06/2025 10:20 PM

Still waiting for a newer build, because there were improvements for RTX5090/sm120 for FP4 a few days ago in the main tree. That’s why I stumbled upon it.

We posted a detailed issue on github:

If we are not able to resolve this, then we will just return the two sparks back to nvidia.

What performance numbers are you seeing for single vs stacked setups?

Edit: Please also use trtllm-bench to collect performance numbers

I’ve had much better luck with the nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc1 image that was released today. Could you please share a docker-command to run the trtllm-bench benchmark for both single and dual sparks. It feels like the single model instance is faster than the sharded model, so I need to properly benchmark them. Thanks!

You are comparing a model with 22B active parameters with a model with 3B active parameters. Of course gpt-oss-120b will be faster…

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.