Execution time much slower with TensorRT

Description

I’m trying to speed up a KDE calculation, not a model. I wrap the function in an nn.Module with torch and export it with onnx and then with trtexec.

On my laptop I have RTX4070 and TensorRT 10.9, there I got roughly 7x speedup (28ms → 4ms!). Also building the engine takes less than a minute.

On the Orin, however, building the engine takes over 30 minutes. I can see in verbose that its trying many tactics, each is taking a very long time to execute. But the worst part is, that even after the long build the execution time of the function increases 2x (38ms → 65ms).

Environment

TensorRT Version: 8.6.2.3
GPU Type: Jetson Orin AGX 64
Nvidia Driver Version:
CUDA Version: 12.2.140
CUDNN Version: 8.9.4
Operating System + Version: Jetpack 6.0, L4T 36.3
Python Version (if applicable): 3.10.12
PyTorch Version (if applicable): 2.3
Baremetal or Container (if container which image + tag):

Relevant Files

I’m attaching the “model” code and the onnx export code in kde_example.py.
I also uploaded the resulting .onnx and .trt files that I get.

Steps To Reproduce

python3 kde_example.py
trtexec --onnx=kde.onnx --fp16 --saveEngine=kde.trt --verbose --builderOptimizationLevel=5