TensorRT-LLM + nvidia/Llama-3.3-70B-Instruct-NVFP4 = 5 tok/s

I ws expecting to see something around 15 to 18tok/s. Is there any tuning needed? This is what I’m doing:

cat << ‘EOF’ > /tmp/config.yml
enable_attention_dp: true
print_iter_log: true

cuda_graph_config:
enable_padding: false
max_batch_size: 1

kv_cache_config:
dtype: fp8
EOF

trtllm-serve “nvidia/Llama-3.3-70B-Instruct-NVFP4” 
–max_batch_size 1 
–max_num_tokens 4096 
–max_seq_len 2048 
–kv_cache_free_gpu_memory_fraction 0.75 
–tp_size 1 --ep_size 1 
–config /tmp/config.yml

How did you come up with that expectation?
Max memory bandwidth on Spark is 273 GB/s. This means that for 70B model in 4-bit quant you will get 273/35=7.8 t/s. And this is a theoretical maximum for a single Spark. In reality, the memory is slower on average, and there is additional overhead, so I’d expect around 6-7 t/s tops for this model.

Thanks for your answer… this clarifies everything.

Some other number for larger deployments

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.