Hello all,
we are running Jetson AGX Orin Devkit 64GB devices using balenaOS and are serving multiple models using the tritonserver onnxruntime with the tensor-rt execution provider. Since our devices are processing live data we want to avoid down times caused by trt compiling the model on the edge so we set up a dedicated orin for pre-compiling the engine files and are shipping them with the docker container.
What we observe in production is that there will often be a subset of engine files that Tensor-RT refuses to use and recompiles them with the following warning log:
Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors
I ran the following commands to assert that there is no version differences between the devices (the output is the same except for the memory consumption which is higher in production):
# JetPack / L4T version
cat /etc/nv_tegra_release
# CUDA version
nvcc --version
# cuDNN version
cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
# TensorRT version
dpkg -l | grep tensorrt
# or
cat /usr/include/aarch64-linux-gnu/NvInferVersion.h | grep NV_TENSORRT
# GPU info (compute capability / SM version)
cat /proc/driver/nvidia/version
nvidia-smi # (if supported; on Jetson often limited)
tegrastats | head -n 1
which yield the following identical output on both orins (see attached file) except for the memory consumption of course.
We observe that the onnxruntime splits a model into, say 18, different engine files and on startup on the edge device trt will rebuild anywhere between 0 and 17 of the engine files depending on the model.
Is there any way to figure out why it is rebuilding the engine files? I understand that engine files are not portable between device types and even minor version changes, but in our case everything is an exact match.
A few more details:
Tegrastats output on the edge device:
09-29-2025 14:13:17 RAM 30828/62844MB (lfb 51x4MB) SWAP 1/4000MB (cached 0MB) CPU [48%@1728,50%@1728,47%@1728,24%@1728,42%@1344,37%@1344,32%@1344,37%@1344,off,off,off,off] GR3D_FREQ 0% cpu@57.968C soc2@53.437C soc0@53.625C gpu@52.781C tj@57.968C soc1@53.281C VDD_GPU_SOC 6808mW/6808mW VDD_CPU_CV 2403mW/2403mW VIN_SYS_5V0 5544mW/5544mW
versus the pre-compilation orin:
09-29-2025 14:14:28 RAM 9305/62844MB (lfb 556x4MB) SWAP 0/4000MB (cached 0MB) CPU [0%@1190,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,off,off,off,off] GR3D_FREQ 0% cpu@42.875C soc2@40.187C soc0@40.406C gpu@39.281C tj@42.875C soc1@39.593C VDD_GPU_SOC 2004mW/2004mW VDD_CPU_CV 400mW/400mW VIN_SYS_5V0 2516mW/2516mW
And this is the onnxruntime configuration used to deploy the model with trt:
optimization { execution_accelerators {
gpu_execution_accelerator : [ {
name : "tensorrt"
parameters { key: "trt_engine_cache_enable" value: "1" }
parameters { key: "trt_timing_cache_enable" value: "1" }
parameters { key: "trt_force_timing_cache" value: "1" }
parameters { key: "trt_timing_cache_path" value: "/trt_cache" }
parameters { key: "trt_engine_cache_path" value: "/trt_cache" }
parameters { key: "trt_engine_cache_prefix" value: "{model_name}" }
}]
}}
trt_version_output.txt (13.4 KB)