Deploying Triton Server with TensorRT-LLM on Jetson AGX Orin (JetPack 6.2) — Any Working Example?

Hi all,

I’m working with a Jetson AGX Orin 32GB device running JetPack 6.2 and trying to deploy a 1.5B large language model using Triton Inference Server and TensorRT-LLM. I’ve managed to convert a large model using TensorRT-LLM, and I can successfully run inference via Python scripts — so far, so good.

However, I’m running into several roadblocks when trying to deploy this setup using Triton Server on the Jetson device:

  1. Triton + TensorRT-LLM Docker: I thought about RUNNING a Docker image that combines Triton and TensorRT-LLM to avoid version mismatches — but found that Jetson isn’t supported by the official containers ( * The xx.yy-trtllm-python-py3 image contains the Triton Inference Server with support for TensorRT-LLM and Python backends only. It cannot initial on the Jetson Device).
    Triton Inference Server | NVIDIA NGC

  2. Building TensorRT-LLM from Source on the Triton Server(igpu version) Docker: When I try to build TensorRT-LLM from source inside the official Jetson-compatible Triton Server ( xx.yy-py3-igpu) Docker container, I run into Python version mismatches and other compatibility issues(TensorRT-LLM v0.12.0 is using Python 3.10, but nvcr.io/nvidia/24.12-py3-igpu is using Python 3.12) — especially when trying to keep the CUDA version fixed at 12.6.

So my question is:
Has anyone successfully deployed Triton Server with TensorRT-LLM on a Jetson AGX Orin device (JetPack 6.2), and managed to load and serve a large LLM locally?

I’ve only been able to find examples where an LLM is run using the TensorRT-LLM Python API for single requests — but not a full deployment using a proper server setup like Triton Server on the Jetson device.

Any guidance, Docker examples, or best practices would be greatly appreciated.

Thanks in advance!

I think I should be more direct with my question:

Does the TensorRT-LLM backend support running on Jetson devices?

Because when I looked at the build.sh (tensorrtllm_backend/build.sh at main · triton-inference-server/tensorrtllm_backend · GitHub):

# Default values will be used if not set
BASE_IMAGE=${BASE_IMAGE:-nvcr.io/nvidia/tritonserver:24.11-py3-min}
PYTORCH_IMAGE=${PYTORCH_IMAGE:-nvcr.io/nvidia/pytorch:24.11-py3}
TRT_VERSION=${TRT_VERSION:-10.7.0.23}
TRT_URL_x86=${TRT_URL_x86:-https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.7.0/tars/TensorRT-${TRT_VERSION}.Linux.x86_64-gnu.cuda-12.6.tar.gz}
TRT_URL_ARM=${TRT_URL_ARM:-https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.7.0/tars/TensorRT-${TRT_VERSION}.ubuntu-24.04.aarch64-gnu.cuda-12.6.tar.gz}

I noticed that it doesn’t seem to support Jetson devices.

IF TensorRT-LLM backend DOES NOT support running on Jetson devices. How we deploy the LLM on the Jetson device using Triton Server?

Hi,

We need to check with our internal team for the Triton + TensorRT-LLM.

But for running each separately:

You can find the info on TensorRT-LLM for Jetson below:

On the NGC page, you can also find the Triton server containers for Jetson (with igpu tag):

Thanks

I’ve not tested the tensorrt_llm capabilities on this image other than importing it into python3, but here’s the way I run the image on agx orin dev kit 32gb, and the docker image stats:

docker run -it --rm --runtime nvidia --network host --shm-size=1g \
-v $(pwd):/workspace \
–workdir /workspace \
nvcr.io/nvidia/tritonserver:25.04-trtllm-python-py3 bash

ngc registry image info nvcr.io/nvidia/tritonserver:25.04-trtllm-python-py3
Image Information
Name: nvidia/tritonserver:25.04-trtllm-python-py3
Architecture: arm64
Image Size: 16.52 GB
Digest: sha256:111feb3afcd397556fe4af6f1dc1bef159a4422991fbe65be6bc19093720b7f1
Schema Version: 1
Signed?: True
Last Updated: May 09, 2025

pip list|grep tensorrt
tensorrt 10.9.0.34
tensorrt_llm 0.18.2

cat /etc/os-release
PRETTY_NAME=“Ubuntu 24.04.1 LTS”
NAME=“Ubuntu”
VERSION_ID=“24.04”
VERSION=“24.04.1 LTS (Noble Numbat)”

uname -a
Linux 5.15.148-tegra #1 SMP PREEMPT Tue Jan 7 17:14:38 PST 2025 aarch64 aarch64 aarch64 GNU/Linux

Thanks Mate.

I previously attempted to convert the small LLM model to a TensorRT engine using TensorRT-LLM on the docker mirror you mentioned (nvidia/tritonserver:25.04-trtllm-python-py3), but it was unsuccessful on the Jetson device.

Then, on the same AGX Orin Dev Kit 32GB, I cloned the TensorRT-LLM GitHub repository and followed the “Full Build with C++ Compilation” guide. After compiling, I successfully converted the same LLM model without any issues.

However, when I tried to deploy the converted LLM model using Triton Server, I ran into new problems.

Although there is a Jetson-compatible tarball version of Triton Server, it does not include the TensorRT-LLM backend. ( tritonserver2.49.0-igpu)

On the other hand, the Docker version of Triton Server that supports iGPU does not have TensorRT-LLM installed(nvcr.io/nvidia/tritonserver:24.12-py3-igpu). And the Triton Server Docker image with TensorRT-LLM support does not support Jetson devices ( nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3).

Finally, I just found the Models - NVIDIA Jetson AI Lab. I am not sure if this docker apply the Triton server with TensorRT-LLM backend as I am dealing with other items.

docker run -it --rm \
  --name llm_server \
  --gpus all \
  -p 9000:9000 \
  -e DOCKER_PULL=always --pull always \
  -e HF_HUB_CACHE=/root/.cache/huggingface \
  -v /mnt/nvme/cache:/root/.cache \
  dustynv/mlc:r36.4.0 \
    sudonim serve \
      --model dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC \
      --quantization q4f16_ft \
      --max-batch-size 1 \
      --chat-template deepseek_r1_qwen \
      --host 0.0.0.0 \
      --port 9000

My original goal is actually quite simple: I want to deploy a small model as a web service on an edge device using TensorRT-LLM and Triton Server, accessible via HTTP or API calls. The main reason for using Triton Server is to benefit from its high performance and concurrency capabilities.

1 Like

Thanks for your reply, AastaLLL.

I’m encountering an issue when trying to find a suitable Triton Inference Server container for Jetson that works with models converted by TensorRT-LLM. It’s quite difficult to search for the right keywords on the Triton container pages, and there’s a lack of clear guidance.

When deploying a model converted using TensorRT-LLM on Jetson, I’m unsure which specific Triton server container and version I should choose based on your advices ( running each separately)

Could you provide some clarification or suggestions on how to identify the correct container?

I would like to suggest publishing a Triton Inference Server container that integrates both Triton Server and TensorRT-LLM with support for Jetson’s integrated GPU(iGPU). I believe this would greatly reduce the complexity of deploying small models on Jetson platforms.

Hi,

Unfortunately, the Jetson Triton server doesn’t support TensorRT-LLM yet.
Currently, you will need to run them separately.

For example, please try nvcr.io/nvidia/tritonserver:25.04-py3-igpu.
Then you can find below backend in that container:

ll backends/
total 40
drwxrwxrwx 1 triton-server triton-server 4096 May  2 03:39 ./
drwxr-xr-x 1 root          root          4096 May  2 03:41 ../
drwxrwxrwx 2 triton-server triton-server 4096 May  2 03:39 fil/
drwxrwxrwx 2 triton-server triton-server 4096 May  2 03:32 identity/
drwxrwxrwx 2 triton-server triton-server 4096 May  2 03:38 onnxruntime/
drwxrwxrwx 1 triton-server triton-server 4096 May 21 06:51 python/
drwxrwxrwx 2 triton-server triton-server 4096 May  2 03:34 pytorch/
drwxrwxrwx 2 triton-server triton-server 4096 May  2 03:33 tensorrt/

Thanks.

Thank you very much for your confirmation and clarification!

If I understand correctly, it’s currently possible to use TensorRT-LLM independently on a Jetson device to convert a large language model into a format compatible with the TensorRT engine — I’ve successfully completed this step and verified it with tests, and everything is working well.

However, as you pointed out, the current Jetson Triton Server version (such as nvcr.io/nvidia/tritonserver:25.04-py3-igpu) does not yet support the TensorRT-LLM backend.

As shown from the contents of the backends/ directory inside the container, there is no tensorrt_llm backend available. Therefore, it’s currently not possible to deploy and run the converted large model directly within Triton Server on Jetson.

In other words, even when running them separately on the Jetson device, I can only convert one model using TensorRT-LLM. However, because the Jetson Triton server’s backends do not support TensorRT-LLM, I’m still unable to deploy the converted LLM model to the Triton server on Jetson.

Do we have any plans to support TensorRT-LLM in future releases of the Jetson Triton Server?

Thanks.

Hi,

After you convert the model into an engine, there are several ways to deploy it on the Jetson.
For example, you can find the example in Jetson AI Lab.

The container also supports chat completion requests with curl.

About the Triton server with TensorRT-LLM support:
Unfortunately, we cannot disclose any schedule or plan on the forum.
But we have directed your request to our internal team.

Thanks.

3 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.