I think something is broken with NDLD - hope you can help.
If I use ndld-prof to profile, I get
[ERR] Failed to collect all whole-network metrics.
Can you try and figure out why NDLD is outputting that? Verbose out doesnt produce anyting more.
If I use ndld-prof to generate an .engine file, I can manually profile it with trtexec, which, somehow indicate, that the system installation is OK (but NDLD is not working):
nvidia@jetson-orin-nx-amc:~$ /usr/src/tensorrt/bin/trtexec --loadEngine=mnist.engine --dumpProfile
&&&& RUNNING TensorRT.trtexec [TensorRT v100300] # /usr/src/tensorrt/bin/trtexec --loadEngine=mnist.engine --dumpProfile
[12/11/2025-20:27:26] [I] === Model Options ===
[12/11/2025-20:27:26] [I] Format: *
[12/11/2025-20:27:26] [I] Model:
[12/11/2025-20:27:26] [I] Output:
[12/11/2025-20:27:26] [I]
[12/11/2025-20:27:26] [I] === System Options ===
[12/11/2025-20:27:26] [I] Device: 0
[12/11/2025-20:27:26] [I] DLACore:
[12/11/2025-20:27:26] [I] Plugins:
[12/11/2025-20:27:26] [I] setPluginsToSerialize:
[12/11/2025-20:27:26] [I] dynamicPlugins:
[12/11/2025-20:27:26] [I] ignoreParsedPluginLibs: 0
[12/11/2025-20:27:26] [I]
[12/11/2025-20:27:26] [I] === Inference Options ===
[12/11/2025-20:27:26] [I] Batch: Explicit
[12/11/2025-20:27:26] [I] Input inference shapes: model
[12/11/2025-20:27:26] [I] Iterations: 10
[12/11/2025-20:27:26] [I] Duration: 3s (+ 200ms warm up)
[12/11/2025-20:27:26] [I] Sleep time: 0ms
[12/11/2025-20:27:26] [I] Idle time: 0ms
[12/11/2025-20:27:26] [I] Inference Streams: 1
[12/11/2025-20:27:26] [I] ExposeDMA: Disabled
[12/11/2025-20:27:26] [I] Data transfers: Enabled
[12/11/2025-20:27:26] [I] Spin-wait: Disabled
[12/11/2025-20:27:26] [I] Multithreading: Disabled
[12/11/2025-20:27:26] [I] CUDA Graph: Disabled
[12/11/2025-20:27:26] [I] Separate profiling: Disabled
[12/11/2025-20:27:26] [I] Time Deserialize: Disabled
[12/11/2025-20:27:26] [I] Time Refit: Disabled
[12/11/2025-20:27:26] [I] NVTX verbosity: 0
[12/11/2025-20:27:26] [I] Persistent Cache Ratio: 0
[12/11/2025-20:27:26] [I] Optimization Profile Index: 0
[12/11/2025-20:27:26] [I] Weight Streaming Budget: 100.000000%
[12/11/2025-20:27:26] [I] Inputs:
[12/11/2025-20:27:26] [I] Debug Tensor Save Destinations:
[12/11/2025-20:27:26] [I] === Reporting Options ===
[12/11/2025-20:27:26] [I] Verbose: Disabled
[12/11/2025-20:27:26] [I] Averages: 10 inferences
[12/11/2025-20:27:26] [I] Percentiles: 90,95,99
[12/11/2025-20:27:26] [I] Dump refittable layers:Disabled
[12/11/2025-20:27:26] [I] Dump output: Disabled
[12/11/2025-20:27:26] [I] Profile: Enabled
[12/11/2025-20:27:26] [I] Export timing to JSON file:
[12/11/2025-20:27:26] [I] Export output to JSON file:
[12/11/2025-20:27:26] [I] Export profile to JSON file:
[12/11/2025-20:27:26] [I]
[12/11/2025-20:27:26] [I] === Device Information ===
[12/11/2025-20:27:26] [I] Available Devices:
[12/11/2025-20:27:26] [I] Device 0: "Orin" UUID: GPU-0ab43d57-aaff-5dc9-9fe5-fd4f9338b4c0
[12/11/2025-20:27:26] [I] Selected Device: Orin
[12/11/2025-20:27:26] [I] Selected Device ID: 0
[12/11/2025-20:27:26] [I] Selected Device UUID: GPU-0ab43d57-aaff-5dc9-9fe5-fd4f9338b4c0
[12/11/2025-20:27:26] [I] Compute Capability: 8.7
[12/11/2025-20:27:26] [I] SMs: 8
[12/11/2025-20:27:26] [I] Device Global Memory: 15655 MiB
[12/11/2025-20:27:26] [I] Shared Memory per SM: 164 KiB
[12/11/2025-20:27:26] [I] Memory Bus Width: 256 bits (ECC disabled)
[12/11/2025-20:27:26] [I] Application Compute Clock Rate: 0.918 GHz
[12/11/2025-20:27:26] [I] Application Memory Clock Rate: 0.918 GHz
[12/11/2025-20:27:26] [I]
[12/11/2025-20:27:26] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[12/11/2025-20:27:26] [I]
[12/11/2025-20:27:26] [I] TensorRT version: 10.3.0
[12/11/2025-20:27:26] [I] Loading standard plugins
[12/11/2025-20:27:26] [I] [TRT] Loaded engine size: 0 MiB
[12/11/2025-20:27:26] [I] Engine deserialized in 0.0189238 sec.
[12/11/2025-20:27:26] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[12/11/2025-20:27:26] [I] Setting persistentCacheLimit to 0 bytes.
[12/11/2025-20:27:26] [I] Created execution context with device memory size: 0.0302734 MiB
[12/11/2025-20:27:26] [I] Using random values for input Input3
[12/11/2025-20:27:26] [I] Input binding for Input3 with dimensions 1x1x28x28 is created.
[12/11/2025-20:27:26] [I] Output binding for Plus214_Output_0 with dimensions 1x10 is created.
[12/11/2025-20:27:26] [I] Starting inference
[12/11/2025-20:27:29] [I] The e2e network timing is not reported since it is inaccurate due to the extra synchronizations when the profiler is enabled.
[12/11/2025-20:27:29] [I] To show e2e network timing report, add --separateProfileRun to profile layer timing in a separate run or remove --dumpProfile to disable the profiler.
[12/11/2025-20:27:29] [I]
[12/11/2025-20:27:29] [I] === Profile (20228 iterations ) ===
[12/11/2025-20:27:29] [I] Time(ms) Avg.(ms) Median(ms) Time(%) Layer
[12/11/2025-20:27:29] [I] 646.32 0.0320 0.0317 33.3 Convolution28 + Parameter6 + ONNXTRT_Broadcast + Plus30 + ReLU32
[12/11/2025-20:27:29] [I] 256.34 0.0127 0.0117 13.2 Pooling66
[12/11/2025-20:27:29] [I] 480.07 0.0237 0.0220 24.7 Convolution110 + Parameter88 + ONNXTRT_Broadcast_10 + Plus112 + ReLU114
[12/11/2025-20:27:29] [I] 274.50 0.0136 0.0131 14.1 Pooling160
[12/11/2025-20:27:29] [I] 40.22 0.0020 0.0020 2.1 dummy_shape_call__mye602_0_myl4_0
[12/11/2025-20:27:29] [I] 243.36 0.0120 0.0106 12.5 __myl_MulSumAdd_myl4_1
[12/11/2025-20:27:29] [I] 1940.81 0.0959 0.0925 100.0 Total
[12/11/2025-20:27:29] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v100300] # /usr/src/tensorrt/bin/trtexec --loadEngine=mnist.engine --dumpProfile
Engine file saved by ndld-prof
nvidia@jetson-orin-nx-amc:~$ sudo /opt/nvidia/nsight_dl/2025.4.25294.1153/target/linux-v4l_l4t-dl-t210-a64/ndld-prof /usr/src/tensorrt/data/mnist/mnist.onnx --save-engine mnist.engine --trt-library-path /home/
nvidia/TensorRT-10.3.0.26/lib
[INF] Loading TensorRT runtime from TensorRT-10.3.0.26/targets/aarch64-linux-gnu/lib/libnvinfer.so.10.3.0.
[INF] Profiling on Orin (GA10B)
[INF] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 286, GPU 2379 (MiB)
[INF] [TRT] [MemUsageChange] Init builder kernel library: CPU +927, GPU +747, now: CPU 1256, GPU 3170 (MiB)
[INF] [TRT] ----------------------------------------------------------------
[INF] [TRT] Input filename: /usr/src/tensorrt/data/mnist/mnist.onnx
[INF] [TRT] ONNX IR version: 0.0.3
[INF] [TRT] Opset version: 8
[INF] [TRT] Producer name: CNTK
[INF] [TRT] Producer version: 2.5.1
[INF] [TRT] Domain: ai.cntk
[INF] [TRT] Model version: 1
[INF] [TRT] Doc string:
[INF] [TRT] ----------------------------------------------------------------
[INF] Building TensorRT engine. This may take some time.
[INF] [TRT] Global timing cache in use. Profiling results in this builder pass will be stored.
[INF] [TRT] Detected 1 inputs and 1 output network tensors.
[INF] [TRT] Total Host Persistent Memory: 18976
[INF] [TRT] Total Device Persistent Memory: 0
[INF] [TRT] Total Scratch Memory: 0
[INF] [TRT] [BlockAssignment] Started assigning block shifts. This will take 4 steps to complete.
[INF] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.042625ms to assign 2 blocks to 4 nodes requiring 31744 bytes.
[INF] [TRT] Total Activation Memory: 31744
[INF] [TRT] Total Weights Memory: 25704
[INF] [TRT] Engine generation completed in 0.0999661 seconds.
[INF] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 8 MiB
[INF] [TRT] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 1681 MiB
[INF] TensorRT engine build complete.
[INF] [TRT] Serialized 26 bytes of code generator cache.
[INF] [TRT] Serialized 5179 bytes of compilation cache.
[INF] [TRT] Serialized 39 timing cache entries
[INF] Saving engine to disk.
[INF] [TRT] Loaded engine size: 0 MiB
[INF] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[INF] Beginning 10 whole-network measurement passes.
[INF] Completed all whole-network measurement passes.
[INF] Beginning 10 per-layer measurement passes.
[INF] Completed all per-layer measurement passes.
[ERR] Failed to collect all whole-network metrics.
I also notice some permission errors when trying to specify path to TensorRT library directory even when in debug group:
nvidia@jetson-orin-nx-amc:~$ /opt/nvidia/nsight_dl/2025.4.25294.1153/target/linux-v4l_l4t-dl-t210-a64/ndld-prof /usr/src/tensorrt/data/mnist/mnist.onnx --save-engine mnist.engine --trt-library-path /home/nvidia/TensorRT-10.3.0.26/lib
[INF] Loading TensorRT runtime from TensorRT-10.3.0.26/targets/aarch64-linux-gnu/lib/libnvinfer.so.10.3.0.
filesystem error: cannot create symlink: Permission denied [/home/nvidia/TensorRT-10.3.0.26/targets/aarch64-linux-gnu/lib/libnvinfer_builder_resource.so.10.3.0] [/opt/nvidia/nsight_dl/2025.4.25294.1153/target/linux-v4l_l4t-dl-t210-a64/libnvinfer_builder_resource.so.10.3.0]
nvidia@jetson-orin-nx-amc:~$ groups
nvidia adm cdrom sudo audio dip video plugdev render i2c lpadmin gdm sambashare debug weston-launch gpio