I’m working with a Jetson AGX Orin 32GB device running JetPack 6.2 and trying to deploy a 1.5B large language model using Triton Inference Server and TensorRT-LLM. I’ve managed to convert a large model using TensorRT-LLM, and I can successfully run inference via Python scripts — so far, so good.
However, I’m running into several roadblocks when trying to deploy this setup using Triton Server on the Jetson device:
Triton + TensorRT-LLM Docker: I thought about RUNNING a Docker image that combines Triton and TensorRT-LLM to avoid version mismatches — but found that Jetson isn’t supported by the official containers ( * The xx.yy-trtllm-python-py3 image contains the Triton Inference Server with support for TensorRT-LLM and Python backends only. It cannot initial on the Jetson Device). Triton Inference Server | NVIDIA NGC
Building TensorRT-LLM from Source on the Triton Server(igpu version) Docker: When I try to build TensorRT-LLM from source inside the official Jetson-compatible Triton Server ( xx.yy-py3-igpu) Docker container, I run into Python version mismatches and other compatibility issues(TensorRT-LLM v0.12.0 is using Python 3.10, but nvcr.io/nvidia/24.12-py3-igpu is using Python 3.12) — especially when trying to keep the CUDA version fixed at 12.6.
So my question is:
Has anyone successfully deployed Triton Server with TensorRT-LLM on a Jetson AGX Orin device (JetPack 6.2), and managed to load and serve a large LLM locally?
I’ve only been able to find examples where an LLM is run using the TensorRT-LLM Python API for single requests — but not a full deployment using a proper server setup like Triton Server on the Jetson device.
Any guidance, Docker examples, or best practices would be greatly appreciated.
# Default values will be used if not set
BASE_IMAGE=${BASE_IMAGE:-nvcr.io/nvidia/tritonserver:24.11-py3-min}
PYTORCH_IMAGE=${PYTORCH_IMAGE:-nvcr.io/nvidia/pytorch:24.11-py3}
TRT_VERSION=${TRT_VERSION:-10.7.0.23}
TRT_URL_x86=${TRT_URL_x86:-https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.7.0/tars/TensorRT-${TRT_VERSION}.Linux.x86_64-gnu.cuda-12.6.tar.gz}
TRT_URL_ARM=${TRT_URL_ARM:-https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.7.0/tars/TensorRT-${TRT_VERSION}.ubuntu-24.04.aarch64-gnu.cuda-12.6.tar.gz}
I noticed that it doesn’t seem to support Jetson devices.
I’ve not tested the tensorrt_llm capabilities on this image other than importing it into python3, but here’s the way I run the image on agx orin dev kit 32gb, and the docker image stats:
I previously attempted to convert the small LLM model to a TensorRT engine using TensorRT-LLM on the docker mirror you mentioned (nvidia/tritonserver:25.04-trtllm-python-py3), but it was unsuccessful on the Jetson device.
Then, on the same AGX Orin Dev Kit 32GB, I cloned the TensorRT-LLM GitHub repository and followed the “Full Build with C++ Compilation” guide. After compiling, I successfully converted the same LLM model without any issues.
However, when I tried to deploy the converted LLM model using Triton Server, I ran into new problems.
Although there is a Jetson-compatible tarball version of Triton Server, it does not include the TensorRT-LLM backend. ( tritonserver2.49.0-igpu)
Finally, I just found the Models - NVIDIA Jetson AI Lab. I am not sure if this docker apply the Triton server with TensorRT-LLM backend as I am dealing with other items.
My original goal is actually quite simple: I want to deploy a small model as a web service on an edge device using TensorRT-LLM and Triton Server, accessible via HTTP or API calls. The main reason for using Triton Server is to benefit from its high performance and concurrency capabilities.
I’m encountering an issue when trying to find a suitable Triton Inference Server container for Jetson that works with models converted by TensorRT-LLM. It’s quite difficult to search for the right keywords on the Triton container pages, and there’s a lack of clear guidance.
When deploying a model converted using TensorRT-LLM on Jetson, I’m unsure which specific Triton server container and version I should choose based on your advices ( running each separately)
Could you provide some clarification or suggestions on how to identify the correct container?
I would like to suggest publishing a Triton Inference Server container that integrates both Triton Server and TensorRT-LLM with support for Jetson’s integrated GPU(iGPU). I believe this would greatly reduce the complexity of deploying small models on Jetson platforms.
Thank you very much for your confirmation and clarification!
If I understand correctly, it’s currently possible to use TensorRT-LLM independently on a Jetson device to convert a large language model into a format compatible with the TensorRT engine — I’ve successfully completed this step and verified it with tests, and everything is working well.
However, as you pointed out, the current Jetson Triton Server version (such as nvcr.io/nvidia/tritonserver:25.04-py3-igpu) does not yet support the TensorRT-LLM backend.
As shown from the contents of the backends/ directory inside the container, there is no tensorrt_llm backend available. Therefore, it’s currently not possible to deploy and run the converted large model directly within Triton Server on Jetson.
In other words, even when running them separately on the Jetson device, I can only convert one model using TensorRT-LLM. However, because the Jetson Triton server’s backends do not support TensorRT-LLM, I’m still unable to deploy the converted LLM model to the Triton server on Jetson.
Do we have any plans to support TensorRT-LLM in future releases of the Jetson Triton Server?
After you convert the model into an engine, there are several ways to deploy it on the Jetson.
For example, you can find the example in Jetson AI Lab.
The container also supports chat completion requests with curl.
About the Triton server with TensorRT-LLM support:
Unfortunately, we cannot disclose any schedule or plan on the forum.
But we have directed your request to our internal team.