TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x

Originally published at: TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x | NVIDIA Technical Blog

NVIDIA TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that provides blazing-fast inference support for numerous popular large language models (LLMs) on NVIDIA GPUs. By adding support for speculative decoding on single GPU and single-node multi-GPU, the library further expands its supported optimizations…

Leverage TensorRT-LLM to unlock 3.6x inference throughput on large language models as described in this post. If you have any questions or comments, please let us know!

Could you clarify why the “Run decoding” step for 405B is run with “mpirun -n 8”? The setup described above is for TP=4, so why does “run.py” need to run with 8 processes?

yes that’s a typo, thanks @chislett.ben ! We’ll have it updated.