Kernel call stack

mahmood.nt · March 13, 2023, 10:21am

Hi
Is it possible to see a “call stack” like information about the kernels that are running in an application? I see some kernel names in the profiler that are part of CUDNN libraries (or whatever the library name is), but I am not able to find which upper C++ function calls this kernel. As the names are complex, I guess there are wrappers that form the final kernel name. I see these complex names when analyzing mlperf workloads. Is that possible with Nsight Systems or other tools?

hwilper · March 14, 2023, 8:28pm

There is an option to collect the back traces you want, but it causes a fair amount of overhead.

See --cudabacktrace in User Guide :: Nsight Systems Documentation for details (the link is directly to the section).

Collect backtraces for API calls longer than X seconds** - turns on collection of CUDA API backtraces and sets the minimum time a CUDA API event must take before its backtraces are collected. Setting this value too low can cause high application overhead and seriously increase the size of your results file.

mahmood.nt · March 17, 2023, 3:47pm

I used the command nsys profile --cudabacktrace all python3 code.py and then nsys stats report1.nsys-rep, but I don’t see the information I want.

This is part of the output:

Using report1.sqlite for SQL queries.
Running [/home/mnaderan/nsight-systems-2022.2.1/target-linux-x64/reports/nvtxsum.py report1.sqlite]... SKIPPED: report1.sqlite does not contain NV Tools Extension (NVTX) data.

Running [/home/mnaderan/nsight-systems-2022.2.1/target-linux-x64/reports/osrtsum.py report1.sqlite]... 

 Time (%)   Total Time (ns)    Num Calls      Avg (ns)          Med (ns)      Min (ns)      Max (ns)         StdDev (ns)              Name         
 --------  ------------------  ---------  ----------------  ----------------  --------  -----------------  ----------------  ----------------------
     51.3  31,928,555,883,735    101,770     313,732,493.7     100,108,949.0     1,000  8,338,581,591,946  36,767,938,230.9  poll                  
     16.3  10,122,109,480,535      4,365   2,318,925,425.1   1,994,086,499.0    12,470      7,308,696,074   1,049,176,820.8  pthread_cond_wait     
     16.2  10,066,843,008,064     20,989     479,624,708.6     500,059,804.0     1,290        501,853,337      98,863,598.2  pthread_cond_timedwait
     16.2  10,066,707,621,826        710  14,178,461,439.2  14,340,671,293.5     9,790     20,867,013,689   1,679,887,717.0  sem_wait              
      0.0         690,089,013      2,120         325,513.7          19,310.0     1,020          1,227,352         450,191.9  munmap 
...
Running [/home/mnaderan/nsight-systems-2022.2.1/target-linux-x64/reports/cudaapisum.py report1.sqlite]... 

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)    Med (ns)   Min (ns)   Max (ns)     StdDev (ns)               Name            
 --------  ---------------  ---------  ------------  ---------  --------  -----------  -------------  ----------------------------
     31.4    5,312,041,255    414,585      12,812.9    8,781.0     3,560  815,782,786    1,269,128.8  cudaLaunchKernel            
     30.9    5,226,133,521    414,585      12,605.7    8,590.0     3,460  815,782,245    1,269,128.2  cudaLaunchKernel            
      5.7      958,886,672         28  34,245,952.6  116,112.0     3,400  955,484,402  180,545,804.1  cudaMalloc                  
      5.7      958,881,281         28  34,245,760.0  115,972.0     3,300  955,483,772  180,545,718.3  cudaMalloc                  
      4.3      730,008,977      6,378     114,457.3   24,640.0     3,410    1,429,386      247,324.7  cudaMemcpyAsync             
      4.3      728,763,880      6,378     114,262.1   24,481.0     3,310    1,428,746      247,308.8  cudaMemcpyAsync             
...
Running [/home/mnaderan/nsight-systems-2022.2.1/target-linux-x64/reports/gpukernsum.py report1.sqlite]... 

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     15.6    3,751,327,650      8,406  446,267.9  637,700.0    26,976   649,251    278,936.9  void cutlass::Kernel<cutlass_tensorop_s1688dgrad_optimized_tf32_128x128_16x4_unity_stride>(T1::Para…
      8.1    1,945,776,423      2,802  694,424.1  694,723.0   349,026   699,235      9,275.9  void cutlass::Kernel<cutlass_tensorop_s1688dgrad_optimized_tf32_256x64_16x4_unity_stride>(T1::Param…
      7.4    1,766,814,457      2,800  631,005.2  631,011.0   628,227   635,075      1,072.4  ampere_scudnn_128x64_stridedB_splitK_xregs_large_nn_v1                                              
      7.3    1,758,463,791     11,200  157,005.7  167,121.0   142,561   187,937     12,874.6  void implicit_convolve_sgemm<float, float, (int)1024, (int)5, (int)5, (int)3, (int)3, (int)3, (int)…
...

Specifically, I am interested to see which part of the source code calls ampere_scudnn_128x64_stridedB_splitK_xregs_large_nn_v1. I guess this is part of CuDNN and the final name is created by wrappers. In the end, somewhere in the code (PyTorch), there should be a C++ function that invokes a kernel with <<<>>> format.

hwilper · March 20, 2023, 6:12pm

I don’t believe that the backtraces are part of the standard stats report.

If you open the GUI, hovering over a kernel will show you the backtrace.

@jkreibich do we have a stats script that outputs the CUDA backtraces?

jkreibich · March 20, 2023, 7:40pm

None of the stats reports utilize backtrace information. It would be difficult to represent in table format. We might be able to do it as part of a trace report, but I don’t think it would be possible with a summary report such as these.

mahmood.nt · March 21, 2023, 9:17am

Using the GUI, I see a page like this:

Here, I see the end kernel names. Let me state my question in another way. I would like to see which data structure is passed to a specific kernel like ampere_scudnn...? With a call stack, I am able to follow the the chain of upper callers and see which data structure is passed to the kernel. For simple codes, it is possible to find the <<<>>> kernel invocations. However, for library-based codes, e.g. mlperf, it is much more complicated than my expectations. Any idea about that?

hwilper · March 21, 2023, 8:51pm

This isn’t available with Nsight Systems.

Topic		Replies	Views
Call stack is visible/captured only for some CUDA kernels (broken backtraces) Profiling Linux Targets	5	1726	December 29, 2022
Accessing kernel call stack Profiling Linux Targets cudnn , cublas	10	78	December 9, 2025
How to get OS kernel call stack using Nsight Systems? Profiling Linux Targets nsight	2	1460	October 12, 2021
What does it mean when --cudabacktrace=other is specified in the nsys profile command? Profiling Shield Targets	2	254	May 9, 2025
Getting full kernel name from nsys Profiling Linux Targets nsight	3	1732	October 12, 2021
Can not get CUDA python backtrace Profiling Linux Targets	12	2295	May 7, 2023
Trigger recording of stack trace, C++, Python Profiling Linux Targets	6	289	April 24, 2025
How to get full profiling with Nsight system for a particular process Profiling Linux Targets cudnn	8	2216	September 23, 2024
How to trace from a kernel name to an operation defined a pytorch .pth file Profiling Linux Targets	1	386	February 5, 2024
call stack on the cpu side for a kernel CUDA-GDB	3	1641	December 1, 2017

Kernel call stack

Related topics