Remote profiling from windows to Jetson Orin Nano fails

When I use Nsight DL for remote profiling an ONNX file I get the error listed at the end of the post.

I am running Nsight on windows and connecting by ssh to a jetson orin nano, running jetpack 6.2

I am wondering about the path /home/nvidia/Documents/onnxruntime which is not on remote or host. Also wondering if I am missing some installation on remote?

:
/home/nvidia/Documents/onnxruntime/onnxruntime/core/session/provider_bridge_ort.cc:1695 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /home/nvidia/libonnxruntime_providers_cuda.so)

Hello,

Thanks for visiting the NVIDIA Developer Forums.
To ensure better visibility and support, I’ve moved your post to the Jetson category where it’s more appropriate

Cheers,
Tom

Hi,

Which Nsight do you use?
For example, you should see a similar UI as mentioned in the link below with Nsight System:

Thanks.

I am using Nsight Deep Learning Designer 2025.3.

By further inspection, it seems to be a bug in Nsight Deep Learning Designer 2025.3. It is compiled against glibc version 2.38, but Jetson Orin Nano only has glibc 2.35. (System requirements state GLIBC version 2.29).

It would be nice to try a previous version of Nsight Deep Learning Designer, maybe 2025.2, to verify that profiling works with Jetson Orin Nano

Hi,

Thanks for providing the details.
Before remote profiling, have you installed the Nsight Deep Learning Designer on the Orin Nano as well?

Thanks.

Yes - I tried to install Nsight Deep Learning Designer on Jetson Orin Nano also. I have tried to profile directly on target and again after install, remote profiling. Both failed.

I can try to add the error output from on target profiling later today.

Is it possible to try the previous version of Nsight Deep Learning Designer also? Does there exist a download link for the previously released versions?

Hi,

Just test it on a JetPack 6.2.1 environment and it can work correctly.
Could you try it again? Please note that you will need to run it with sudo to get the GPU trace data.

$ sudo /opt/nvidia/nsight_dl/2025.3.25220.1113/target/linux-v4l_l4t-dl-t210-a64/ndld-prof /usr/src/tensorrt/data/mnist/mnist.onnx
[INF] Could not determine the previous TensorRT version. Will attempt to load from the local environment.
...
[INF] Completed all per-layer measurement passes.
[INF] Generating report data.
[INF] Profiling operations complete.
Median end-to-end latency:          0.11056 ms
Fastest end-to-end latency:         0.10182 ms
Slowest end-to-end latency:         0.14506 ms

Median GPU compute inference time:  0.10578 ms
Fastest GPU compute inference time: 0.096704 ms
Slowest GPU compute inference time: 0.13469 ms

Median input H2D copy time: 0.002544 ms
Median output D2H copy time: 0.002336 ms

Top 6 layer inference times within the median pass:
0.0577 ms    Convolution28 + Parameter6 + ONNXTRT_Broadcast + Plus30 + ReLU32
0.0281 ms    Convolution110 + Parameter88 + ONNXTRT_Broadcast_10 + Plus112 + ReLU114
0.0197 ms    __myl_MulSumAdd_myl4_1
0.0188 ms    Pooling160
0.0154 ms    Pooling66
0.00246 ms    dummy_shape_call__mye602_0_myl4_0

Network metric values:
    SMs Active: 22.9 %
    DRAM Read Throughput: 0.432 %
    DRAM Write Throughput: 0.134 %
    Tensor Active: 0.197 %
    Compute Warps in Flight: 2.93 %

Thanks.

Yes - you are right. Running on target works and yes, I needed to run as root.

Also, now there is a new version of Nsight Deep Learning Designer, which also works. Now I am also able to profile remote targets

Now I tried on a jetson orin nx and problems arise again:

nvidia@jetson-orin-nx-amc:/opt/nvidia/nsight_dl/2025.4.25294.1153/target/linux-v4l_l4t-dl-t210-a64$ sudo ./ndld-prof /usr/src/tensorrt/data/mnist/mnist.onnx
[INF] Loading TensorRT runtime from ../../../../../../root/.config/NVIDIA Corporation/NVIDIA Nsight Deep Learning Designer/2025.4.25294.1153 (build 36727793) (public-release)/target/linux-v4l_l4t-dl-t210-a64/tensorrt-10.13-jp6/libnvinfer.so.10.13.0.
[INF] Profiling on Orin (GA10B)
[INF] [TRT] [MemUsageChange] Init CUDA: CPU +1, GPU +0, now: CPU 292, GPU 4694 (MiB)
[INF] [TRT] [MemUsageChange] Init builder kernel library: CPU +432, GPU +430, now: CPU 790, GPU 5190 (MiB)
[INF] [TRT] ----------------------------------------------------------------
[INF] [TRT] Input filename: /usr/src/tensorrt/data/mnist/mnist.onnx
[INF] [TRT] ONNX IR version: 0.0.3
[INF] [TRT] Opset version: 8
[INF] [TRT] Producer name: CNTK
[INF] [TRT] Producer version: 2.5.1
[INF] [TRT] Domain: ai.cntk
[INF] [TRT] Model version: 1
[INF] [TRT] Doc string:
[INF] [TRT] ----------------------------------------------------------------
[INF] Building TensorRT engine. This may take some time.
[INF] [TRT] Global timing cache in use. Profiling results in this builder pass will be stored.
[INF] [TRT] Compiler backend is used during engine build.
[INF] Estimated builder progress: 36.4%…
[INF] [TRT] Detected 1 inputs and 1 output network tensors.
[INF] [TRT] Total Host Persistent Memory: 19024 bytes
[INF] [TRT] Total Device Persistent Memory: 0 bytes
[INF] [TRT] Max Scratch Memory: 0 bytes
[INF] [TRT] [BlockAssignment] Started assigning block shifts. This will take 4 steps to complete.
[INF] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.020705ms to assign 2 blocks to 4 nodes requiring 31744 bytes.
[INF] [TRT] Total Activation Memory: 31744 bytes
[INF] [TRT] Total Weights Memory: 25704 bytes
[INF] [TRT] Compiler backend is used during engine execution.
[INF] [TRT] Engine generation completed in 3.21665 seconds.
[INF] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 8 MiB
[INF] TensorRT engine build complete.
[INF] [TRT] Serialized 27 bytes of code generator cache.
[INF] [TRT] Serialized 5439 bytes of compilation cache.
[INF] [TRT] Serialized 2718 timing cache entries
[INF] [TRT] Loaded engine size: 0 MiB
[INF] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[INF] Beginning 10 whole-network measurement passes.
[INF] Completed all whole-network measurement passes.
[INF] Beginning 10 per-layer measurement passes.
[INF] Completed all per-layer measurement passes.
[ERR] Failed to collect all whole-network metrics.