TensorRT-LLM is a high-performance LLM inference library with advanced quantization, attention kernels, and paged KV caching. Initial support for TensorRT-LLM in JetPack 6.1 has been included in the v0.12.0-jetson
branch of the TensorRT-LLM repo for Jetson AGX Orin.
We’ve made pre-compiled TensorRT-LLM wheels and containers available, along with these guides and additional documentation: