Tensor-RT rejects engine cache pre-built on same device type

Hello all,

we are running Jetson AGX Orin Devkit 64GB devices using balenaOS and are serving multiple models using the tritonserver onnxruntime with the tensor-rt execution provider. Since our devices are processing live data we want to avoid down times caused by trt compiling the model on the edge so we set up a dedicated orin for pre-compiling the engine files and are shipping them with the docker container.

What we observe in production is that there will often be a subset of engine files that Tensor-RT refuses to use and recompiles them with the following warning log:

Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors

I ran the following commands to assert that there is no version differences between the devices (the output is the same except for the memory consumption which is higher in production):

# JetPack / L4T version
cat /etc/nv_tegra_release

# CUDA version
nvcc --version

# cuDNN version
cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

# TensorRT version
dpkg -l | grep tensorrt
# or
cat /usr/include/aarch64-linux-gnu/NvInferVersion.h | grep NV_TENSORRT

# GPU info (compute capability / SM version)
cat /proc/driver/nvidia/version
nvidia-smi # (if supported; on Jetson often limited)
tegrastats | head -n 1

which yield the following identical output on both orins (see attached file) except for the memory consumption of course.

We observe that the onnxruntime splits a model into, say 18, different engine files and on startup on the edge device trt will rebuild anywhere between 0 and 17 of the engine files depending on the model.

Is there any way to figure out why it is rebuilding the engine files? I understand that engine files are not portable between device types and even minor version changes, but in our case everything is an exact match.

A few more details:

Tegrastats output on the edge device:

09-29-2025 14:13:17 RAM 30828/62844MB (lfb 51x4MB) SWAP 1/4000MB (cached 0MB) CPU [48%@1728,50%@1728,47%@1728,24%@1728,42%@1344,37%@1344,32%@1344,37%@1344,off,off,off,off] GR3D_FREQ 0% cpu@57.968C soc2@53.437C soc0@53.625C gpu@52.781C tj@57.968C soc1@53.281C VDD_GPU_SOC 6808mW/6808mW VDD_CPU_CV 2403mW/2403mW VIN_SYS_5V0 5544mW/5544mW

versus the pre-compilation orin:

09-29-2025 14:14:28 RAM 9305/62844MB (lfb 556x4MB) SWAP 0/4000MB (cached 0MB) CPU [0%@1190,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,off,off,off,off] GR3D_FREQ 0% cpu@42.875C soc2@40.187C soc0@40.406C gpu@39.281C tj@42.875C soc1@39.593C VDD_GPU_SOC 2004mW/2004mW VDD_CPU_CV 400mW/400mW VIN_SYS_5V0 2516mW/2516mW

And this is the onnxruntime configuration used to deploy the model with trt:

optimization { execution_accelerators {
    gpu_execution_accelerator : [ {
    name : "tensorrt"
    parameters { key: "trt_engine_cache_enable" value: "1" }
    parameters { key: "trt_timing_cache_enable" value: "1" }
    parameters { key: "trt_force_timing_cache" value: "1" }
    parameters { key: "trt_timing_cache_path" value: "/trt_cache" }
    parameters { key: "trt_engine_cache_path" value: "/trt_cache" }
    parameters { key: "trt_engine_cache_prefix" value: "{model_name}" }
    }]
}}

trt_version_output.txt (13.4 KB)

*** Please note that this reply is generated by LLM automatically ***

Based on the provided context, it appears that you are experiencing issues with TensorRT recompiling engine files on your Jetson AGX Orin devices, despite having precompiled them on a dedicated Orin device and shipping them with the Docker container.

To address this issue, let’s break down the possible reasons why TensorRT might be recompiling the engine files:

  1. Version differences: Although you’ve verified that the TensorRT version is the same on both devices, it’s essential to ensure that the minor version changes are also identical. Even a minor version difference can cause TensorRT to recompile the engine files.
  2. Device type: As you mentioned, engine files are not portable between device types. However, since you’re using the same device type (Jetson AGX Orin) on both the precompilation device and the production devices, this should not be the cause.
  3. Memory consumption: The higher memory consumption on the production devices might be a contributing factor. TensorRT might be recompiling the engine files due to differences in memory allocation or availability.
  4. Model splitting: The onnxruntime splits the model into multiple engine files, and TensorRT might be recompiling some of these files due to differences in the model’s execution plan or optimization settings.

To troubleshoot this issue, you can try the following:

  1. Verify the TensorRT version: Double-check that the TensorRT version is identical on both devices, including the minor version changes.
  2. Check the model’s execution plan: Verify that the model’s execution plan is the same on both devices. You can use tools like trtexec to analyze the model’s execution plan and identify any differences.
  3. Optimization settings: Ensure that the optimization settings used during precompilation are identical to those used on the production devices.
  4. Memory allocation: Investigate the memory allocation on the production devices and ensure that it’s consistent with the precompilation device.
  5. Engine file compatibility: Verify that the engine files are compatible with the production devices. You can use tools like trtexec to check the engine file’s compatibility.

To compare the engine files and identify any differences, you can use tools like trtexec or tensorrt’s built-in debugging tools. Additionally, you can try to reproduce the issue on a smaller scale by using a simplified model and verifying that the engine files are not recompiled.

In summary, while the exact cause of the issue is unclear, it’s likely related to differences in the model’s execution plan, optimization settings, or memory allocation. By verifying the TensorRT version, model’s execution plan, optimization settings, and memory allocation, you should be able to identify the root cause of the issue and find a solution.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

Hi,

Could you share the steps to reproduce this?
For example, the container you used and the script to compile/load the TensorRT cache.

TensorRT cache doesn’t depend on the memory status.
If an engine is built with the same TensorRT software version and the same GPU architecture, the cache file can be loaded directly.

Thanks.

Hi,

so this is our base Docker image (see attached). And for the tensor-rt compilation we are simply the attached python script which starts up tritonserver and are then feeding some sample data to warm up the cache.

It might be hard to reproduce since it seems that it only happens on some models.

run_trt_conversion.py.txt (4.6 KB)

Dockerfile.txt (3.6 KB)

There is no update from you for a period, assuming this is not an issue anymore.
Hence, we are closing this topic. If need further support, please open a new one.
Thanks
~1021

Hi,

We want to verify this in our environment.

Could you share a model that can reproduce this issue?
Is there any public model we can use?

Thanks.