Remote profiling errors - unable to load CUDA profiling library

Hi, im trying to profiling a dummy model on orin - DRIVE AGX. and im getting the following errors:
Launching: /home/orin/nsight-dl-designer/DLDesignerWorker
Process launched
[ERR] Unable to load CUDA profiling library.
[INF] The current user is not a member of the debug group. You may need to join that group in order to profile.
Launched application returned 1 (0x1).
Retrieving /home/orin/nsight-dl-designer/trtreport.nv-dld-report to /home/yz9qvs/projects/deconv_bev3_new_head_network_only_fs_cam_only/nvidia_x86/trt/fp16_no_custom_io/deconv_bev3_new_head_network_only_fs_cam_only-TRT.nv-dld-report
Failed to retrieve /home/orin/nsight-dl-designer/trtreport.nv-dld-report
Failed to retrieve files.
Profiler encountered an error during execution.

couldn’t found anything on those errors:
any help would be great.

DL Designer does not support NVIDIA DriveOS as a target platform.

The primary error in your logs is the line prefixed by [ERR], indicating that the GPU hardware profiler could not be started. The subsequent errors are downstream failures; the profiler exited before it could do anything, so no report could be generated for transfer back to the DL Designer host machine.

the docs clearly state:

What platform does Nsight DL Designer run on?

We currently support Windows, Linux, and L4T.

after connecting with root privileges, ive passed the previous error, but encountered the following:

[INF] [TRT] [MemUsageChange] Init CUDA: CPU +313, GPU +0, now: CPU 695, GPU 6747 (MiB)

[INF] [TRT] [MemUsageChange] Init builder kernel library: CPU +944, GPU +1112, now: CPU 1682, GPU 7904 (MiB)

[INF] [TRT] ----------------------------------------------------------------

[INF] [TRT] Input filename:   /root/nsight-dl-designer/deconv_bev3_new_head_network_only_fs_cam_only.onnx

[INF] [TRT] ONNX IR version:  0.0.8

[INF] [TRT] Opset version:    17

[INF] [TRT] Producer name:    pytorch

[INF] [TRT] Producer version: 2.4.0

[INF] [TRT] Domain:           

[INF] [TRT] Model version:    0

[INF] [TRT] Doc string:       

[INF] [TRT] ----------------------------------------------------------------

[INF] Building TensorRT engine. This may take some time.

[INF] [TRT] Global timing cache in use. Profiling results in this builder pass will be stored.

[INF] Estimated builder progress: 2.57%...

[INF] Estimated builder progress: 9.49%...

[INF] Estimated builder progress: 13.5%...

[INF] Estimated builder progress: 19%...

[INF] Estimated builder progress: 22.7%...

[INF] Estimated builder progress: 27%...

[INF] Estimated builder progress: 28.6%...

[INF] Estimated builder progress: 34.3%...

[INF] [TRT] Compiler backend is used during engine build.

[INF] Estimated builder progress: 36.2%...

[INF] Estimated builder progress: 40.3%...

[INF] Estimated builder progress: 46.2%...

[INF] Estimated builder progress: 47.1%...

[INF] [TRT] Detected 3 inputs and 2 output network tensors.

[INF] [TRT] Total Host Persistent Memory: 187072 bytes

[INF] [TRT] Total Device Persistent Memory: 0 bytes

[INF] [TRT] Max Scratch Memory: 25165824 bytes

[INF] [TRT] [BlockAssignment] Started assigning block shifts. This will take 47 steps to complete.

[INF] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 1.07386ms to assign 5 blocks to 47 nodes requiring 106315776 bytes.

[INF] [TRT] Total Activation Memory: 106315776 bytes

[INF] [TRT] Total Weights Memory: 52303900 bytes

[INF] [TRT] Compiler backend is used during engine execution.

[INF] [TRT] Engine generation completed in 30.9241 seconds.

[INF] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 9 MiB, GPU 128 MiB

[INF] [TRT] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 2074 MiB

[INF] TensorRT engine build complete.

[INF] [TRT] Serialized 27 bytes of code generator cache.

[INF] [TRT] Serialized 50298 bytes of compilation cache.

[INF] [TRT] Serialized 328 timing cache entries

[WRN] The timing cache could not be saved.

[INF] Exception text: /root/.config/NVIDIA Corporation/NVIDIA Nsight Deep Learning Designer/timing_cache.10.5.0.9.bin: No such file or directory

[INF] [TRT] Loaded engine size: 50 MiB

[INF] [TRT] [MS] Running engine with multi stream info

[INF] [TRT] [MS] Number of aux streams is 3

[INF] [TRT] [MS] Number of total worker streams is 4

[INF] [TRT] [MS] The main stream provided by execute/enqueue calls is the first worker stream

[INF] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +102, now: CPU 0, GPU 151 (MiB)

[WRN] DL Designer does not implement clock controls on this platform.

[INF] Beginning 10 whole-network measurement passes.

[INF] Completed all whole-network measurement passes.

[INF] Beginning 10 per-layer measurement passes.

[INF] Completed all per-layer measurement passes.

[ERR] Primary GPU sampling range should have remained open.

Launched application returned 1 (0x1).
Retrieving /root/nsight-dl-designer/trtreport.nv-dld-report to /home/yz9qvs/projects/deconv_bev3_new_head_network_only_fs_cam_only/nvidia_x86/trt/fp16_no_custom_io/deconv_bev3_new_head_network_only_fs_cam_only-TRT.nv-dld-report
Failed to retrieve /root/nsight-dl-designer/trtreport.nv-dld-report
Failed to retrieve files.
Profiler encountered an error during execution.

Sorry for the confusion. The L4T we officially support is the Jetson system. We have not validated NDLD on the DRIVE systems, but we will look into the specific profiler error: “[ERR] Primary GPU sampling range should have remained open.”

Thank you for the feedback and it is much appreciated.