Jetson AI Lab - ML DevOps, Containers, Core Inferencing

4/2/24 - TensorRT-LLM support on Jetson

  • As detailed in the posts above, now that we have access to latest CUDA and the ability to rebuild all the other downstream packages we need, we may be able to build mainline TensorRT-LLM (hopefully without much patching required). This is an ongoing effort in coordination with the TensorRT team that we are excited about, in order to provide edge-to-cloud compatibility with other NVIDIA production workflows, NeMo megatron models, and deploying NIM microservices to the edge.

  • TensorRT-LLM will be integrated into NanoLLM as another API backend, in addition to MLC. MLC/TVM already achieves greater than 95% peak Orin performance/efficiency on Llama (as shown in the Benchmarks on Jetson AI Lab), so performance-wise we’re already in a great place - however TensorRT-LLM will be good to have for the aforementioned compatibility reasons and production-grade support. For now, continue using the NanoLLM APIs to provide a seamless transition to TensorRT-LLM once it’s enabled, and to gain all the support for multimodality and I/O streaming in NanoLLM.

  • This is all subject to change regarding TensorRT-LLM depending on the outcomes of these ongoing engineering efforts. Once TensorRT 10 soon becomes available for Jetson, I will begin work shortly on attempting to compile the latest TensorRT-LLM for Jetson against CUDA 12.4 and TRT10. Assuming success, from there binaries can be provided through jetson-containers and the pip server, and further integration work with NanoLLM and other projects can proceed.