Profile TensorRT Model on Orin NX

I have installed Nsight DL Designer 2025.4 on a Jetson Orin NX and I am trying to “Profile TensorRT Model”.

It fail with the following error listed below. I also tried to “Export TensorRT Engine” which seems to complete. It seems like it it the subsequent profiling (after converting ONNX model) that fails. I am invoking the tool as root. I am using mnist.onnx from the tensorrt installation.

Preparing to launch the Profile TensorRT Model activity on localhost...

Using target packages from the system. Skipping deployment.
Launched process: DLDesignerWorker (pid: 43637)



DLDesignerWorker profile-trt --use-system-trt --config "/tmp/NVIDIA Nsight Deep Learning Designer-dFCBmY/trtconfig.json" /usr/src/tensorrt/data/mnist/mnist.onnx /usr/src/tensorrt/data/mnist/mnist-TRT.nv-dld-report




Launch succeeded.

[INF] Profiling on Orin (GA10B)
[INF] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 286, GPU 4264 (MiB)
[INF] [TRT] [MemUsageChange] Init builder kernel library: CPU +927, GPU +752, now: CPU 1256, GPU 5059 (MiB)
[INF] [TRT] ----------------------------------------------------------------
[INF] [TRT] Input filename: /usr/src/tensorrt/data/mnist/mnist.onnx
[INF] [TRT] ONNX IR version: 0.0.3
[INF] [TRT] Opset version: 8
[INF] [TRT] Producer name: CNTK
[INF] [TRT] Producer version: 2.5.1
[INF] [TRT] Domain: ai.cntk
[INF] [TRT] Model version: 1
[INF] [TRT] Doc string: 
[INF] [TRT] ----------------------------------------------------------------
[INF] Building TensorRT engine. This may take some time.

[INF] [TRT] Global timing cache in use. Profiling results in this builder pass will be stored.
[INF] [TRT] Detected 1 inputs and 1 output network tensors.
[INF] [TRT] Total Host Persistent Memory: 18976
[INF] [TRT] Total Device Persistent Memory: 0
[INF] [TRT] Total Scratch Memory: 0
[INF] [TRT] [BlockAssignment] Started assigning block shifts. This will take 4 steps to complete.
[INF] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.015776ms to assign 2 blocks to 4 nodes requiring 31744 bytes.
[INF] [TRT] Total Activation Memory: 31744
[INF] [TRT] Total Weights Memory: 25704
[INF] [TRT] Engine generation completed in 0.0971114 seconds.
[INF] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 8 MiB
[INF] [TRT] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 1835 MiB
[INF] TensorRT engine build complete.
[INF] [TRT] Serialized 26 bytes of code generator cache.
[INF] [TRT] Serialized 4238735 bytes of compilation cache.
[INF] [TRT] Serialized 11949 timing cache entries
[INF] [TRT] Loaded engine size: 0 MiB
[INF] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[WRN] DL Designer does not implement clock controls on this platform.

[INF] Beginning 10 whole-network measurement passes.
[INF] Completed all whole-network measurement passes.
[INF] Beginning 10 per-layer measurement passes.
[INF] Completed all per-layer measurement passes.
[ERR] Failed to collect all whole-network metrics.
Process terminated.
Profiler encountered an error during execution: 0x1u.

Thanks for reporting this issue. Based on the log you shared, this definitely looks like a profiler-side problem.

To aid our investigation, can you share which driver and TensorRT versions you encountered the issue with? You can use nvidia-smi to get the driver version.

NVIDIA Jetson Orin NX

L4T: 36.4.7

Jetpack 6.2.1

CUDA: 12.6.68

TensorRT : 10.3.0.30

Driver Version: 540.4.0

@cvanderknyff : Was the info posted useful for you? Do you need more details?

Hi,

Chris is OOTO due to Thanksgiving holiday. We will continue investigation and report back once he is back to office.

1 Like

I spent some time trying to repro this issue on Orin devices, but all of my profiling attempts succeeded on the JetPack 6.x devices I had access to. (L4T 36.4.3/JP 6.2 and L4T 36.3/JP 6.0). This includes both TensorRT 10.14.1 and 10.3.

My recommendation is to try reflashing the board and/or downgrading to JetPack 6.2. While your TensorRT version is old, changing it is unlikely to help.

Unfortunately, our next upcoming release is based on CUDA 13 and JetPack 7, so Orin devices will soon be temporarily unsupported by Nsight DL Designer until JetPack 7.2 reintegrates Orin support. The Jetson Roadmap currently has this scheduled for Q1 2026.

Thanks for the effort.

Did you try on an Orin NX?

Could the problem be, that my Orin NX has mounted a disk instead of using a SD card?

I have seen it fail on two Orin NX and succeed on one Orin Nano

I will try to reflash and report back.

These were both Jetson AGX Orin devkits, not NX modules. I don’t think mounting a disk vs. an SD card should be a problem.

I have tried ti downgrade Jetpack, but the error persist. Is there any way that you could try on Jetson Orin NX? I suspect it is tied to Jetson Orin NX.

Some extracts from the log:

[WRN] DL Designer does not implement clock controls on this platform.

[INF] Beginning 10 whole-network measurement passes.

[INF] Completed all whole-network measurement passes.

[INF] Beginning 10 per-layer measurement passes.

[INF] Completed all per-layer measurement passes.

[ERR] Failed to collect all whole-network metrics.

Launched application returned 1 (0x1).

Retrieving /root/nsight-dl-designer/trtreport.nv-dld-report to C:/Users/axej/best_largeImages-TRT-2.nv-dld-report
Failed to retrieve /root/nsight-dl-designer/trtreport.nv-dld-report
Failed to retrieve files.
Profiler encountered an error during execution: 0x1u.

I don’t personally have access to an Orin NX board but am trying to track one down.

1 Like

I was able to get an 8GB Orin NX board (happily also using an NVMe drive and not an SD card) and test the profiler. Unfortunately I could not reproduce your issue using either TRT 10.3 or 10.14.1.

Confirming the same L4T version and display driver:

nvidia@tegra-ubuntu:~/nsight_dl/target/linux-v4l_l4t-dl-t210-a64$ nvidia-smi | head -5
Tue Dec  9 19:52:13 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 540.4.0                Driver Version: 540.4.0      CUDA Version: 12.6     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
nvidia@tegra-ubuntu:~/nsight_dl/target/linux-v4l_l4t-dl-t210-a64$ cat /etc/nv_tegra_release
# R36 (release), REVISION: 4.7, GCID: 42132812, BOARD: generic, EABI: aarch64, DATE: Thu Sep 18 22:54:44 UTC 2025
# KERNEL_VARIANT: oot
TARGET_USERSPACE_LIB_DIR=nvidia
TARGET_USERSPACE_LIB_DIR_PATH=usr/lib/aarch64-linux-gnu/nvidia

And an example of profiling using NDLD 2025.4 and the TensorRT 10.3 GA release for JetPack:

nvidia@tegra-ubuntu:~/nsight_dl/target/linux-v4l_l4t-dl-t210-a64$ ./ndld-prof --trt-library-path ~/TensorRT-10.3.0.26/lib ~/mnist.onnx
[INF] Loading TensorRT runtime from ../../../TensorRT-10.3.0.26/targets/aarch64-linux-gnu/lib/libnvinfer.so.10.3.0.
[INF] Profiling on Orin (GA10B)
[INF] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 286, GPU 1380 (MiB)
[INF] [TRT] [MemUsageChange] Init builder kernel library: CPU +927, GPU +754, now: CPU 1256, GPU 2178 (MiB)
[INF] [TRT] ----------------------------------------------------------------
[INF] [TRT] Input filename:   /home/nvidia/mnist.onnx
[INF] [TRT] ONNX IR version:  0.0.3
[INF] [TRT] Opset version:    8
[INF] [TRT] Producer name:    CNTK
[INF] [TRT] Producer version: 2.5.1
[INF] [TRT] Domain:           ai.cntk
[INF] [TRT] Model version:    1
[INF] [TRT] Doc string:       
[INF] [TRT] ----------------------------------------------------------------
[INF] Building TensorRT engine. This may take some time.
[INF] [TRT] Global timing cache in use. Profiling results in this builder pass will be stored.
[INF] [TRT] Detected 1 inputs and 1 output network tensors.
[INF] [TRT] Total Host Persistent Memory: 19552
[INF] [TRT] Total Device Persistent Memory: 0
[INF] [TRT] Total Scratch Memory: 0
[INF] [TRT] [BlockAssignment] Started assigning block shifts. This will take 7 steps to complete.
[INF] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.044162ms to assign 3 blocks to 7 nodes requiring 32256 bytes.
[INF] [TRT] Total Activation Memory: 31744
[INF] [TRT] Total Weights Memory: 25704
[INF] [TRT] Engine generation completed in 1.76839 seconds.
[INF] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 8 MiB
[INF] [TRT] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 1673 MiB
[INF] TensorRT engine build complete.
[INF] [TRT] Serialized 26 bytes of code generator cache.
[INF] [TRT] Serialized 5179 bytes of compilation cache.
[INF] [TRT] Serialized 39 timing cache entries
[INF] [TRT] Loaded engine size: 0 MiB
[INF] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[INF] Beginning 10 whole-network measurement passes.
[INF] Completed all whole-network measurement passes.
[INF] Beginning 10 per-layer measurement passes.
[INF] Completed all per-layer measurement passes.
[INF] Generating report data.
[INF] Profiling operations complete.
Median end-to-end latency:          0.13584 ms
Fastest end-to-end latency:         0.12394 ms
Slowest end-to-end latency:         0.30266 ms

Median GPU compute inference time:  0.13085 ms
Fastest GPU compute inference time: 0.11878 ms
Slowest GPU compute inference time: 0.28666 ms

Median input H2D copy time: 0.003312 ms
Median output D2H copy time: 0.00256 ms

Top 9 layer inference times within the median pass:
0.0764 ms    Convolution28 + Parameter6 + ONNXTRT_Broadcast + Plus30 + ReLU32
0.0456 ms    Convolution110 + Parameter88 + ONNXTRT_Broadcast_10 + Plus112 + ReLU114
0.0258 ms    __myl_MulSumAdd_myl7_1
0.0221 ms    Pooling66
 0.015 ms    Pooling160
0.0148 ms    Reformatting CopyNode for Input Tensor 0 to {ForeignNode[Parameter193 + Times212_reshape1...Plus214]}
0.00726 ms    Reformatting CopyNode for Input Tensor 0 to Convolution28 + Parameter6 + ONNXTRT_Broadcast + Plus30 + ReLU32
0.00502 ms    dummy_shape_call__mye602_0_myl7_0
0.00221 ms    Reformatting CopyNode for Input Tensor 0 to Pooling66

Network metric values:
    SMs Active: 16.2 %
    DRAM Read Throughput: 1.08 %
    DRAM Write Throughput: 0.36 %
    Tensor Active: 0.768 %
    Compute Warps in Flight: 2.13 %
1 Like

Great! So there is hope for me also. Thanks for the effort.

I am trying to profile similar to you, but have not had success yet.

You seem to be able to profile as nvidia user (and not root?). How did you do that?

You seem to have installed TensorRT seperatly also? Is it from tar, PACKAGE l, DEB local or DEB cross local repo?

On L4T platforms, add your user account to the debug group in order to profile without admin privileges. Typically this is sudo usermod -a -G debug nvidia.

My TensorRT install was from the official 10.3 tarball.

I can not produce the same behaviour as you. Do you have installed anything else than tensorrt and NDLD? I have only flashed JP 6.2.1, installed tensorRT from TAR and installed NDLD from deb, modified LD_LIBRARY_PATH - nothing else

nvidia@jetson-orin-nx-amc:~$ nvidia-smi | head -5
Thu Dec 11 16:39:27 2025
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 540.4.0                Driver Version: 540.4.0      CUDA Version: 12.6     |
|-----------------------------------------±---------------------±---------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |

nvidia@jetson-orin-nx-amc:~$ cat /etc/nv_tegra_release

R36 (release), REVISION: 4.4, GCID: 41062509, BOARD: generic, EABI: aarch64, DATE: Mon Jun 16 16:07:13 UTC 2025

KERNEL_VARIANT: oot

TARGET_USERSPACE_LIB_DIR=nvidia
TARGET_USERSPACE_LIB_DIR_PATH=usr/lib/aarch64-linux-gnu/nvidia

nvidia@jetson-orin-nx-amc:~$ echo $LD_LIBRARY_PATH
/home/nvidia/TensorRT-10.3.0.26/lib:/usr/local/cuda-12.6/lib64:

nvidia@jetson-orin-nx-amc:~$ sudo /opt/nvidia/nsight_dl/2025.4.25294.1153/target/linux-v4l_l4t-dl-t210-a64/ndld-prof --trt-library-path ~/TensorRT-10.3.0.26/lib /home/nvidia/TensorRT-10.3.0.26/data/mnist/mnist.onnx
[INF] Loading TensorRT runtime from TensorRT-10.3.0.26/targets/aarch64-linux-gnu/lib/libnvinfer.so.10.3.0.
[INF] Profiling on Orin (GA10B)
[INF] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 286, GPU 1890 (MiB)
[INF] [TRT] [MemUsageChange] Init builder kernel library: CPU +927, GPU +1098, now: CPU 1256, GPU 3029 (MiB)
[INF] [TRT] ----------------------------------------------------------------
[INF] [TRT] Input filename:   /home/nvidia/TensorRT-10.3.0.26/data/mnist/mnist.onnx
[INF] [TRT] ONNX IR version:  0.0.3
[INF] [TRT] Opset version:    8
[INF] [TRT] Producer name:    CNTK
[INF] [TRT] Producer version: 2.5.1
[INF] [TRT] Domain:           ai.cntk
[INF] [TRT] Model version:    1
[INF] [TRT] Doc string:
[INF] [TRT] ----------------------------------------------------------------
[INF] Building TensorRT engine. This may take some time.
[INF] [TRT] Global timing cache in use. Profiling results in this builder pass will be stored.
[INF] [TRT] Detected 1 inputs and 1 output network tensors.
[INF] [TRT] Total Host Persistent Memory: 18976
[INF] [TRT] Total Device Persistent Memory: 0
[INF] [TRT] Total Scratch Memory: 0
[INF] [TRT] [BlockAssignment] Started assigning block shifts. This will take 4 steps to complete.
[INF] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.022497ms to assign 2 blocks to 4 nodes requiring 31744 bytes.
[INF] [TRT] Total Activation Memory: 31744
[INF] [TRT] Total Weights Memory: 25704
[INF] [TRT] Engine generation completed in 1.62036 seconds.
[INF] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 8 MiB
[INF] [TRT] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 1681 MiB
[INF] TensorRT engine build complete.
[INF] [TRT] Serialized 26 bytes of code generator cache.
[INF] [TRT] Serialized 5179 bytes of compilation cache.
[INF] [TRT] Serialized 39 timing cache entries
[WRN] The timing cache could not be saved.
[INF] Exception text: /root/.config/NVIDIA Corporation/NVIDIA Nsight Deep Learning Designer/timing_cache.10.3.0.26.bin: No such file or directory
[INF] [TRT] Loaded engine size: 0 MiB
[INF] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[INF] Beginning 10 whole-network measurement passes.
[INF] Completed all whole-network measurement passes.
[INF] Beginning 10 per-layer measurement passes.
[INF] Completed all per-layer measurement passes.
[ERR] Failed to collect all whole-network metrics.

No, this was a clean image. I didn’t install anything else.

I think something is broken with NDLD - hope you can help.

If I use ndld-prof to profile, I get

[ERR] Failed to collect all whole-network metrics.

Can you try and figure out why NDLD is outputting that? Verbose out doesnt produce anyting more.

If I use ndld-prof to generate an .engine file, I can manually profile it with trtexec, which, somehow indicate, that the system installation is OK (but NDLD is not working):

nvidia@jetson-orin-nx-amc:~$ /usr/src/tensorrt/bin/trtexec --loadEngine=mnist.engine  --dumpProfile
&&&& RUNNING TensorRT.trtexec [TensorRT v100300] # /usr/src/tensorrt/bin/trtexec --loadEngine=mnist.engine --dumpProfile
[12/11/2025-20:27:26] [I] === Model Options ===
[12/11/2025-20:27:26] [I] Format: *
[12/11/2025-20:27:26] [I] Model:
[12/11/2025-20:27:26] [I] Output:
[12/11/2025-20:27:26] [I]
[12/11/2025-20:27:26] [I] === System Options ===
[12/11/2025-20:27:26] [I] Device: 0
[12/11/2025-20:27:26] [I] DLACore:
[12/11/2025-20:27:26] [I] Plugins:
[12/11/2025-20:27:26] [I] setPluginsToSerialize:
[12/11/2025-20:27:26] [I] dynamicPlugins:
[12/11/2025-20:27:26] [I] ignoreParsedPluginLibs: 0
[12/11/2025-20:27:26] [I]
[12/11/2025-20:27:26] [I] === Inference Options ===
[12/11/2025-20:27:26] [I] Batch: Explicit
[12/11/2025-20:27:26] [I] Input inference shapes: model
[12/11/2025-20:27:26] [I] Iterations: 10
[12/11/2025-20:27:26] [I] Duration: 3s (+ 200ms warm up)
[12/11/2025-20:27:26] [I] Sleep time: 0ms
[12/11/2025-20:27:26] [I] Idle time: 0ms
[12/11/2025-20:27:26] [I] Inference Streams: 1
[12/11/2025-20:27:26] [I] ExposeDMA: Disabled
[12/11/2025-20:27:26] [I] Data transfers: Enabled
[12/11/2025-20:27:26] [I] Spin-wait: Disabled
[12/11/2025-20:27:26] [I] Multithreading: Disabled
[12/11/2025-20:27:26] [I] CUDA Graph: Disabled
[12/11/2025-20:27:26] [I] Separate profiling: Disabled
[12/11/2025-20:27:26] [I] Time Deserialize: Disabled
[12/11/2025-20:27:26] [I] Time Refit: Disabled
[12/11/2025-20:27:26] [I] NVTX verbosity: 0
[12/11/2025-20:27:26] [I] Persistent Cache Ratio: 0
[12/11/2025-20:27:26] [I] Optimization Profile Index: 0
[12/11/2025-20:27:26] [I] Weight Streaming Budget: 100.000000%
[12/11/2025-20:27:26] [I] Inputs:
[12/11/2025-20:27:26] [I] Debug Tensor Save Destinations:
[12/11/2025-20:27:26] [I] === Reporting Options ===
[12/11/2025-20:27:26] [I] Verbose: Disabled
[12/11/2025-20:27:26] [I] Averages: 10 inferences
[12/11/2025-20:27:26] [I] Percentiles: 90,95,99
[12/11/2025-20:27:26] [I] Dump refittable layers:Disabled
[12/11/2025-20:27:26] [I] Dump output: Disabled
[12/11/2025-20:27:26] [I] Profile: Enabled
[12/11/2025-20:27:26] [I] Export timing to JSON file:
[12/11/2025-20:27:26] [I] Export output to JSON file:
[12/11/2025-20:27:26] [I] Export profile to JSON file:
[12/11/2025-20:27:26] [I]
[12/11/2025-20:27:26] [I] === Device Information ===
[12/11/2025-20:27:26] [I] Available Devices:
[12/11/2025-20:27:26] [I]   Device 0: "Orin" UUID: GPU-0ab43d57-aaff-5dc9-9fe5-fd4f9338b4c0
[12/11/2025-20:27:26] [I] Selected Device: Orin
[12/11/2025-20:27:26] [I] Selected Device ID: 0
[12/11/2025-20:27:26] [I] Selected Device UUID: GPU-0ab43d57-aaff-5dc9-9fe5-fd4f9338b4c0
[12/11/2025-20:27:26] [I] Compute Capability: 8.7
[12/11/2025-20:27:26] [I] SMs: 8
[12/11/2025-20:27:26] [I] Device Global Memory: 15655 MiB
[12/11/2025-20:27:26] [I] Shared Memory per SM: 164 KiB
[12/11/2025-20:27:26] [I] Memory Bus Width: 256 bits (ECC disabled)
[12/11/2025-20:27:26] [I] Application Compute Clock Rate: 0.918 GHz
[12/11/2025-20:27:26] [I] Application Memory Clock Rate: 0.918 GHz
[12/11/2025-20:27:26] [I]
[12/11/2025-20:27:26] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[12/11/2025-20:27:26] [I]
[12/11/2025-20:27:26] [I] TensorRT version: 10.3.0
[12/11/2025-20:27:26] [I] Loading standard plugins
[12/11/2025-20:27:26] [I] [TRT] Loaded engine size: 0 MiB
[12/11/2025-20:27:26] [I] Engine deserialized in 0.0189238 sec.
[12/11/2025-20:27:26] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[12/11/2025-20:27:26] [I] Setting persistentCacheLimit to 0 bytes.
[12/11/2025-20:27:26] [I] Created execution context with device memory size: 0.0302734 MiB
[12/11/2025-20:27:26] [I] Using random values for input Input3
[12/11/2025-20:27:26] [I] Input binding for Input3 with dimensions 1x1x28x28 is created.
[12/11/2025-20:27:26] [I] Output binding for Plus214_Output_0 with dimensions 1x10 is created.
[12/11/2025-20:27:26] [I] Starting inference
[12/11/2025-20:27:29] [I] The e2e network timing is not reported since it is inaccurate due to the extra synchronizations when the profiler is enabled.
[12/11/2025-20:27:29] [I] To show e2e network timing report, add --separateProfileRun to profile layer timing in a separate run or remove --dumpProfile to disable the profiler.
[12/11/2025-20:27:29] [I]
[12/11/2025-20:27:29] [I] === Profile (20228 iterations ) ===
[12/11/2025-20:27:29] [I]    Time(ms)     Avg.(ms)   Median(ms)   Time(%)   Layer
[12/11/2025-20:27:29] [I]      646.32       0.0320       0.0317      33.3   Convolution28 + Parameter6 + ONNXTRT_Broadcast + Plus30 + ReLU32
[12/11/2025-20:27:29] [I]      256.34       0.0127       0.0117      13.2   Pooling66
[12/11/2025-20:27:29] [I]      480.07       0.0237       0.0220      24.7   Convolution110 + Parameter88 + ONNXTRT_Broadcast_10 + Plus112 + ReLU114
[12/11/2025-20:27:29] [I]      274.50       0.0136       0.0131      14.1   Pooling160
[12/11/2025-20:27:29] [I]       40.22       0.0020       0.0020       2.1   dummy_shape_call__mye602_0_myl4_0
[12/11/2025-20:27:29] [I]      243.36       0.0120       0.0106      12.5   __myl_MulSumAdd_myl4_1
[12/11/2025-20:27:29] [I]     1940.81       0.0959       0.0925     100.0   Total
[12/11/2025-20:27:29] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v100300] # /usr/src/tensorrt/bin/trtexec --loadEngine=mnist.engine --dumpProfile

Engine file saved by ndld-prof

nvidia@jetson-orin-nx-amc:~$ sudo /opt/nvidia/nsight_dl/2025.4.25294.1153/target/linux-v4l_l4t-dl-t210-a64/ndld-prof /usr/src/tensorrt/data/mnist/mnist.onnx --save-engine mnist.engine --trt-library-path /home/
nvidia/TensorRT-10.3.0.26/lib
[INF] Loading TensorRT runtime from TensorRT-10.3.0.26/targets/aarch64-linux-gnu/lib/libnvinfer.so.10.3.0.
[INF] Profiling on Orin (GA10B)
[INF] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 286, GPU 2379 (MiB)
[INF] [TRT] [MemUsageChange] Init builder kernel library: CPU +927, GPU +747, now: CPU 1256, GPU 3170 (MiB)
[INF] [TRT] ----------------------------------------------------------------
[INF] [TRT] Input filename: /usr/src/tensorrt/data/mnist/mnist.onnx
[INF] [TRT] ONNX IR version: 0.0.3
[INF] [TRT] Opset version: 8
[INF] [TRT] Producer name: CNTK
[INF] [TRT] Producer version: 2.5.1
[INF] [TRT] Domain: ai.cntk
[INF] [TRT] Model version: 1
[INF] [TRT] Doc string:
[INF] [TRT] ----------------------------------------------------------------
[INF] Building TensorRT engine. This may take some time.
[INF] [TRT] Global timing cache in use. Profiling results in this builder pass will be stored.
[INF] [TRT] Detected 1 inputs and 1 output network tensors.
[INF] [TRT] Total Host Persistent Memory: 18976
[INF] [TRT] Total Device Persistent Memory: 0
[INF] [TRT] Total Scratch Memory: 0
[INF] [TRT] [BlockAssignment] Started assigning block shifts. This will take 4 steps to complete.
[INF] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.042625ms to assign 2 blocks to 4 nodes requiring 31744 bytes.
[INF] [TRT] Total Activation Memory: 31744
[INF] [TRT] Total Weights Memory: 25704
[INF] [TRT] Engine generation completed in 0.0999661 seconds.
[INF] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 8 MiB
[INF] [TRT] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 1681 MiB
[INF] TensorRT engine build complete.
[INF] [TRT] Serialized 26 bytes of code generator cache.
[INF] [TRT] Serialized 5179 bytes of compilation cache.
[INF] [TRT] Serialized 39 timing cache entries
[INF] Saving engine to disk.
[INF] [TRT] Loaded engine size: 0 MiB
[INF] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[INF] Beginning 10 whole-network measurement passes.
[INF] Completed all whole-network measurement passes.
[INF] Beginning 10 per-layer measurement passes.
[INF] Completed all per-layer measurement passes.
[ERR] Failed to collect all whole-network metrics.

I also notice some permission errors when trying to specify path to TensorRT library directory even when in debug group:

nvidia@jetson-orin-nx-amc:~$ /opt/nvidia/nsight_dl/2025.4.25294.1153/target/linux-v4l_l4t-dl-t210-a64/ndld-prof /usr/src/tensorrt/data/mnist/mnist.onnx --save-engine mnist.engine --trt-library-path /home/nvidia/TensorRT-10.3.0.26/lib
[INF] Loading TensorRT runtime from TensorRT-10.3.0.26/targets/aarch64-linux-gnu/lib/libnvinfer.so.10.3.0.
filesystem error: cannot create symlink: Permission denied [/home/nvidia/TensorRT-10.3.0.26/targets/aarch64-linux-gnu/lib/libnvinfer_builder_resource.so.10.3.0] [/opt/nvidia/nsight_dl/2025.4.25294.1153/target/linux-v4l_l4t-dl-t210-a64/libnvinfer_builder_resource.so.10.3.0]
nvidia@jetson-orin-nx-amc:~$ groups
nvidia adm cdrom sudo audio dip video plugdev render i2c lpadmin gdm sambashare debug weston-launch gpio

We’ve asked our QA team to try to reproduce your problem on our systems.

The trtexec --dumpProfile feature is a timestamp-only profiler which does not collect GPU performance counters such as tensor core utilization. Unfortunately, the DLD error message you’ve reported is solidly related to that latter feature, so it is unsurprising that trtexec works.

The permission error you report is caused during NDLD startup. When loading TensorRT from a specific directory (as opposed to --use-system-trt, which uses LD_LIBRARY_PATH alone to locate TensorRT), NDLD creates temporary files in its executable directory in order to load TensorRT via dlopen. You’re running from /opt, which presumably is not writable by world, user, or the debug group. Installing the product to a more permissive location (either via chmod or by installing to somewhere like /home) should fix the issue.

Our QA team was unable to reproduce your issue on Orin NX. As DLD 2025.5 has removed support for JetPack 6 targets we are unfortunately not able to assist with further troubleshooting.