I ws expecting to see something around 15 to 18tok/s. Is there any tuning needed? This is what I’m doing:
cat << ‘EOF’ > /tmp/config.yml
enable_attention_dp: true
print_iter_log: true
cuda_graph_config:
enable_padding: false
max_batch_size: 1
kv_cache_config:
dtype: fp8
EOF
trtllm-serve “nvidia/Llama-3.3-70B-Instruct-NVFP4”
–max_batch_size 1
–max_num_tokens 4096
–max_seq_len 2048
–kv_cache_free_gpu_memory_fraction 0.75
–tp_size 1 --ep_size 1
–config /tmp/config.yml
eugr
January 16, 2026, 6:54pm
3
How did you come up with that expectation?
Max memory bandwidth on Spark is 273 GB/s. This means that for 70B model in 4-bit quant you will get 273/35=7.8 t/s. And this is a theoretical maximum for a single Spark. In reality, the memory is slower on average, and there is additional overhead, so I’d expect around 6-7 t/s tops for this model.
Thanks for your answer… this clarifies everything.
Some other number for larger deployments
Command templates:
docker exec -it vllm_node bash -i -c “vllm serve M --host 0.0.0.0 --trust_remote_code --gpu-memory-utilization 0.8 -pp 1 -tp X --distributed-executor-backend ray --load-format fastsafetensors --kv-cache-dtype fp8”
vllm bench serve --backend vllm --model M --host 10.20.0.4 --endpoint /v1/completions --hf-name sharegpt --num-prompts X --port 8000
Qwen/Qwen3-VL-32B-Instruct-FP8
4 nodes (tp)
============ Serving Benchmark Result ============
Successful requests: …
system
Closed
February 1, 2026, 3:29am
6
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.