Updated Nsight Systems and lost CUDA API trace

I am profiling my python CUDA application with Nsight Systems that I installed inside the nvidia l4t-ml docker container (nvcr.io/nvidia/l4t-ml:l4t-ml:r32.5.0-py3).
Both the python application and NSight are executed inside the same docker container, running on a Jetson AGX Xavier.

After updating my Nsight Systems-cli from version 2021.1.3 to 2021.5.2 I am obtaining errors and have lost CUDA api calls and GPU sampling.

Diagnostics Summary reports:
Warning Injection CUDA injection initialization failed.
Warning Analysis CUDA profiling stopped unexpectedly: Cannot initialize CUDA event collection.
Warning Analysis No CUDA events collected. Does the process use CUDA?

Versions
Old: NVIDIA Nsight Systems version 2021.1.3.14-b695ea9
New: NVIDIA Nsight Systems version 2021.5.2.53-28d0e6e

Cuda version (output of nvcc --version):
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_28_22:34:44_PST_2021
Cuda compilation tools, release 10.2, V10.2.300
Build cuda_10.2_r440.TC440_70.29663091_0

What could be causing this new issue? Any help would be appreciated

@liuyis can you add this to your list?

Hi @nchang, could you share the reports you collected from 2021.1 and 2021.5 respectively? Thanks

Hi @liuyis

Attached are some sample reports for both versions. Thank you
Nvidia Reports.zip (7.0 MB)

Hi @nchang, where did you get the 2021.5 version of Nsight Systems?

Note that for L4T platform, Nsight Systems is bundled with JetPack SDK (Jetson Developer Tools | NVIDIA Developer). It seems the latest version of JetPack SDK is 4.6, which carries Nsys 2021.2. (I.e. the latest officially released version of Nsys for L4T platform is 2021.2)

If you downloaded Nsight Systems from our website or developers zone, it’s meant for Desktop/Server platforms only. The Linux SBSA version may be able to execute on L4T platform since they are both Arm-based, but there is no guarantee that all the features will work as expected.

I see, thank you for that clarification @liuyis . I was hoping to make use of the latest version of Nsight Systems which has the “analyze” command switch that was added.
Is there any significant difference between the Nsight Systems 2021.2 vs 2021.1 that I was previously using?

On another note, I am noticing that when I include the cuda api tracing (-t nvtx, osrt, cuda), nvtx blocks appearing my report are significantly slower. Is this expected when profiling with the added -t cuda option? I was expecting that the Nsight Systems profiler added very little overhead. What is the best way to profile a cuda application with realistic timings?

I was hoping to make use of the latest version of Nsight Systems which has the “analyze” command switch that was added.

You can copy the reports collected on L4T with 2021.1/2 to a desktop or server, and use 2021.5 version there for the nsys analyze comamnd.

Is there any significant difference between the Nsight Systems 2021.2 vs 2021.1 that I was previously using?

This links provides some information: https://developer.nvidia.com/blog/latest-nsight-developer-tools-releases-nsight-systems-2021-1-nsight-compute-2021-2-nsight-visual-studio-code-edition/#:~:text=Nsight%20Systems%202021.2%2C%20introduces%20support,%2C%20OpenSHMEM%2C%20and%20MPI%20fortran

On another note, I am noticing that when I include the cuda api tracing (-t nvtx, osrt, cuda), nvtx blocks appearing my report are significantly slower. Is this expected when profiling with the added -t cuda option? I was expecting that the Nsight Systems profiler added very little overhead. What is the best way to profile a cuda application with realistic timings?

The first CUDA API call will have significant overhead due to profiler initialization. For the rest of calls, there will also be some overhead but should not be very significant. What’s the amount of slow-down you were observing? Could you share reports with and without CUDA trace using the same Nsys version?

To minimize overhead, you can disable unnecessary features. For example if you are interested in CUDA and NVTX only, use something like nsys profile -t cuda, nvtx -s none <app>.

Hi @liuyis thanks for the clarifications and suggestions concerning the NSight Systems versions. I will do as you suggest and use the previous Nsight Systems to generate the reports and the later version to run the “analyze”.

As for the CUDA tracing overhead, I have included reports (in 2 posts) and the generated stats when cuda tracing is included and not. You will see that the timings are significantly increased over the duration of profiling capture and not simply during initialization.

Thanks for investigating the issue as these increased timings make it difficult to evaluate optimizations accurately.

NoCuda.zip (91.0 MB)

@liuyis , here is the second set of logs for cuda tracing (unfortunately the sqlite file is too large, please let me know if you wish to see it):
Cuda.zip (8.2 MB)

@nchang Thanks for uploading the report, I do see some of the ranges are 2X slower when CUDA trace is enabled. Is it possible to share the application, or a simple reproducer, so that we can investigate on our end?

Hi @liuyis , yes exactly CUDA tracing seems to cause 2-3X slowdowns. As for sharing the application, is there a process available for sharing sensitive code with an NDA?

Hi @nchang, which company/organization do you (or does the code) belongs to? Does the company/organization have an existing SA (solution archtech) or Devtech contact with NVIDIA?