TensorRT-LLM for Jetson

TensorRT-LLM is a high-performance LLM inference library with advanced quantization, attention kernels, and paged KV caching. Initial support for TensorRT-LLM in JetPack 6.1 has been included in the v0.12.0-jetson branch of the TensorRT-LLM repo for Jetson AGX Orin.

We’ve made pre-compiled TensorRT-LLM wheels and containers available, along with these guides and additional documentation:

> TensorRT-LLM Deployment on Jetson Orin

3 Likes

Hi @dusty_nv, if someone is interested, I’ve created a small demo video using Streamlit with your TensorRT implementation on the AGX orin. Looks great.

1 Like

Very cool!

If anyone else has a problem running the example tensorrt-llm exercise here’s what fixed it for me.

The MaziyarPanahi/Meta-Llama-3-8B-Instruct-GPTQ repo has a requirments.txt and this is the only package in it that is not in the tensorrt-llm requirements.

git clone AutoGPTQ/AutoGPTQ

If you aren’t using conda, edit setup.py modify this line to this value conda_cuda_include_dir = “/usr/local/cuda/include”

Then:
export BUILD_CUDA_EXT=1
export TORCH_CUDA_ARCH_LIST=“8.7”
export COMPILE_MARLIN=1
MAX_JOBS=10 python -m pip wheel . --no-build-isolation -w dist

pip install dist/auto_gptq-0.8.0.dev0+cu126-cp310-cp310-linux_aarch64.whl

2 Likes

Running the same LLaMA 3.1 8B Instruct model with the Activation-aware Weight Quantization (AWQ) technique resulted in an improvement in inference speed.

1 Like

What are the supported models for 0.12.0-jetson ? can you please point me the the full list ?

1 Like

I also think it would be best to list out the models that have already been tested.

1 Like

Can anyone help me make a stand-alone API inference server to run a large whisper model for speech-to-text tasks? It needs to be on a portable machine (such as AGX Orin), without accessing the internet (but with a local network connection for API). It should be stand-alone, meaning that as soon as the power is connected, it should boot and start the API server without any need for someone to log in. No monitor and no keyboard are connected (except for initial setup and debugging). My email is y_ardavan@yahoo.com. Thank you.

Has anyone tried using TensorRT-LLM on Jetson Orin NX (16GB)? I keep encountering the issue “core dump” when using trtllm-build, even with a small model (0.5B). The official tests were conducted on AGX Orin.

Is multi node LLM inferencing supported on jetson orin agx? I want to serve a bigger model (that does not fit on one orin but does on two).

Is pipeline parallelism even allowed? I am running into issues when I try to follow: TensorRT-LLM/examples/llama at v0.12.0-jetson · NVIDIA/TensorRT-LLM · GitHub