Hi
Is it possible to see a “call stack” like information about the kernels that are running in an application? I see some kernel names in the profiler that are part of CUDNN libraries (or whatever the library name is), but I am not able to find which upper C++ function calls this kernel. As the names are complex, I guess there are wrappers that form the final kernel name. I see these complex names when analyzing mlperf workloads. Is that possible with Nsight Systems or other tools?
There is an option to collect the back traces you want, but it causes a fair amount of overhead.
See --cudabacktrace in User Guide :: Nsight Systems Documentation for details (the link is directly to the section).
Collect backtraces for API calls longer than X seconds** - turns on collection of CUDA API backtraces and sets the minimum time a CUDA API event must take before its backtraces are collected. Setting this value too low can cause high application overhead and seriously increase the size of your results file.
I used the command nsys profile --cudabacktrace all python3 code.py
and then nsys stats report1.nsys-rep
, but I don’t see the information I want.
This is part of the output:
Using report1.sqlite for SQL queries.
Running [/home/mnaderan/nsight-systems-2022.2.1/target-linux-x64/reports/nvtxsum.py report1.sqlite]... SKIPPED: report1.sqlite does not contain NV Tools Extension (NVTX) data.
Running [/home/mnaderan/nsight-systems-2022.2.1/target-linux-x64/reports/osrtsum.py report1.sqlite]...
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- ------------------ --------- ---------------- ---------------- -------- ----------------- ---------------- ----------------------
51.3 31,928,555,883,735 101,770 313,732,493.7 100,108,949.0 1,000 8,338,581,591,946 36,767,938,230.9 poll
16.3 10,122,109,480,535 4,365 2,318,925,425.1 1,994,086,499.0 12,470 7,308,696,074 1,049,176,820.8 pthread_cond_wait
16.2 10,066,843,008,064 20,989 479,624,708.6 500,059,804.0 1,290 501,853,337 98,863,598.2 pthread_cond_timedwait
16.2 10,066,707,621,826 710 14,178,461,439.2 14,340,671,293.5 9,790 20,867,013,689 1,679,887,717.0 sem_wait
0.0 690,089,013 2,120 325,513.7 19,310.0 1,020 1,227,352 450,191.9 munmap
...
Running [/home/mnaderan/nsight-systems-2022.2.1/target-linux-x64/reports/cudaapisum.py report1.sqlite]...
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ --------- -------- ----------- ------------- ----------------------------
31.4 5,312,041,255 414,585 12,812.9 8,781.0 3,560 815,782,786 1,269,128.8 cudaLaunchKernel
30.9 5,226,133,521 414,585 12,605.7 8,590.0 3,460 815,782,245 1,269,128.2 cudaLaunchKernel
5.7 958,886,672 28 34,245,952.6 116,112.0 3,400 955,484,402 180,545,804.1 cudaMalloc
5.7 958,881,281 28 34,245,760.0 115,972.0 3,300 955,483,772 180,545,718.3 cudaMalloc
4.3 730,008,977 6,378 114,457.3 24,640.0 3,410 1,429,386 247,324.7 cudaMemcpyAsync
4.3 728,763,880 6,378 114,262.1 24,481.0 3,310 1,428,746 247,308.8 cudaMemcpyAsync
...
Running [/home/mnaderan/nsight-systems-2022.2.1/target-linux-x64/reports/gpukernsum.py report1.sqlite]...
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- --------- --------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
15.6 3,751,327,650 8,406 446,267.9 637,700.0 26,976 649,251 278,936.9 void cutlass::Kernel<cutlass_tensorop_s1688dgrad_optimized_tf32_128x128_16x4_unity_stride>(T1::Para…
8.1 1,945,776,423 2,802 694,424.1 694,723.0 349,026 699,235 9,275.9 void cutlass::Kernel<cutlass_tensorop_s1688dgrad_optimized_tf32_256x64_16x4_unity_stride>(T1::Param…
7.4 1,766,814,457 2,800 631,005.2 631,011.0 628,227 635,075 1,072.4 ampere_scudnn_128x64_stridedB_splitK_xregs_large_nn_v1
7.3 1,758,463,791 11,200 157,005.7 167,121.0 142,561 187,937 12,874.6 void implicit_convolve_sgemm<float, float, (int)1024, (int)5, (int)5, (int)3, (int)3, (int)3, (int)…
...
Specifically, I am interested to see which part of the source code calls ampere_scudnn_128x64_stridedB_splitK_xregs_large_nn_v1
. I guess this is part of CuDNN and the final name is created by wrappers. In the end, somewhere in the code (PyTorch), there should be a C++ function that invokes a kernel with <<<>>>
format.
I don’t believe that the backtraces are part of the standard stats report.
If you open the GUI, hovering over a kernel will show you the backtrace.
@jkreibich do we have a stats script that outputs the CUDA backtraces?
None of the stats reports utilize backtrace information. It would be difficult to represent in table format. We might be able to do it as part of a trace report, but I don’t think it would be possible with a summary report such as these.
Using the GUI, I see a page like this:
Here, I see the end kernel names. Let me state my question in another way. I would like to see which data structure is passed to a specific kernel like ampere_scudnn...
? With a call stack, I am able to follow the the chain of upper callers and see which data structure is passed to the kernel. For simple codes, it is possible to find the <<<>>> kernel invocations. However, for library-based codes, e.g. mlperf, it is much more complicated than my expectations. Any idea about that?
This isn’t available with Nsight Systems.