Accessing kernel call stack

mahmood.nt · November 19, 2025, 10:39am

Hi,

I would like to know if there is any update to this topic which as been asked five years ago.f

How to get OS kernel call stack using Nsight Systems? - Nsight Systems / Profiling Linux Targets - NVIDIA Developer Forums

I am particularly interested to see how CUDA kernels are invoked. For example, looking in to the output of Nsight System (figure below), The kernel name is amplere_bf16_s168116gemm. I don’t know which high level code, e.g. Pytorch module, has called this kernel. I also don’t know if this kernel belongs to CuDNN or CuBLAS, libcuda.so or other library files. The bottom-up view doesn’t show useful information about that.

Any idea about that?

hwilper · November 19, 2025, 4:44pm

We have vastly expanded our support for pytorch in particular in the interim and you might want to give that a try. See User Guide — nsight-systems (that is a direct link to the pytorch profiling section, the forum software just munges the text).

You’ll want to make sure you are working with a recent Nsight Systems to get all of the features.

mahmood.nt · November 20, 2025, 9:36am

Thanks for that. Apparently, only one –pytorch option can be used. I tried

./nsight-systems-2025.5.1/bin/nsys profile \   
   --force-overwrite true \
   --pytorch=functions-trace \
   --trace=cuda,cublas,cudnn,osrt,nvtx \
   --sample cpu --cpuctxsw process-tree \
   --output nsys-b$BATCH_SIZE.out \
   python -u main.py --scenario Offline --model-path $CHECKPOINT_PATH --batch-size $BATCH_SIZE --dtype bfloat16 --user-conf user.conf --total-sample-count 1 --dataset-path $DATASET_PATH --output-log-dir output --tensor-parallel-size $GPU_COUNT --vllm

And the output below, correctly shows that amplere_bf16_s168116gemm belongs to torch.nn.functional.linear as shown in the figure below:

However, as you can see in the figure below, a Pytorch module may contains multiple kernels. That means, more investigation is needed to track the call stack.

Question: Is that all information we can get from NSYS with Pytorch? Or I have to use another pyTorch option?

Guy_Sz · November 20, 2025, 11:36am

--pytorch options can be combined (except for autograd-nvtx and autograd-shapes-nvtx that are mutually exclusive). Are you having issues with using--pytorch=autograd-nvtx,functions-trace or --pytorch=autograd-shapes-nvtx,functions-trace ?

Additionally, the upcoming release of Nsight Systems (expected to be out real soon) include significant improvements for --pytorch=functions-trace that provide much more comprehensive time ranges for PyTorch forward pass.

You might also want to use --cudabacktrace=all and --python-backtrace=cuda to get C and Python callstacks on every CUDA API call.

mahmood.nt · November 20, 2025, 12:51pm

Thank you very much. I used the following options:

./nsight-systems-2025.5.1/bin/nsys profile \
  --force-overwrite true \
  --pytorch=autograd-nvtx,functions-trace \
  --cudabacktrace=all \
  --python-backtrace=cuda \
  --output nsys-b.out \
  python <OPTIONS>

And can see the call stack like the figure below.

Just want to be sure that this is the correct way to accessing the call stack or not.

I ask that because the “bottom-up view” is not that much meaningful.

mahmood.nt · November 20, 2025, 12:54pm

BTW, I see some are torch.nn.modules.Module.__call__ and some are torch.nn.functional.linear which I guess the later is the complete path and more meaningful that the former. Any idea about that?

Guy_Sz · November 20, 2025, 1:47pm

The torch.nn.modules.Module.__call__ and torch.nn.functional.linear and similar ranges are not informative enough. These ranges are changed in the upcoming nsys release, so duplicates are removed, and you’ll get the actual module name.

About the call stack - Bottom Up/Top Down/Flat views provides analysis for periodic callstack sampling (CPU sampling is enabled by default).

It sounds to me that you are looking for callstacks at specific points, rather than statistical analysis.
You should have a “CUDA API” row in the report, and most of the ranges there should contain C/Python callstack in their tooltip. Can you find it?

mahmood.nt · November 20, 2025, 2:58pm

Yes

You should have a “CUDA API” row in the report, and most of the ranges there should contain C/Python callstack in their tooltip. Can you find it?

I see a call stack at a specific time as shown below. I think that is what I wanted to achieve. Thanks.

Guy_Sz · November 23, 2025, 8:34am

Hi @mahmood.nt
Nsight Systems 2025.6.1 is publicly available. I encourage you to checkout --pytorch=functions-trace with this new version. I should be useful for your use-case.

mahmood.nt · November 25, 2025, 9:20am

Thanks for the update. I gave it a try and you can see in the figure below.

./nsight-systems-2025.6.1/bin/nsys profile \
  --force-overwrite true \
  --pytorch=functions-trace \
  --output nsys-b.out \
  python -u main.py .....

When I click on the kernel, it seems that the calling module is gate_up_proj because it has the same length. However, I don’t understand the bottom part where I see multiple higher level modules.

Also, when I hover the mouse over the module names, I don’t see the stack trace. Maybe I have missed something…

system · December 9, 2025, 9:21am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Call stack is visible/captured only for some CUDA kernels (broken backtraces) Profiling Linux Targets	5	1728	December 29, 2022
Kernel call stack Profiling Linux Targets	6	1183	March 21, 2023
How to get OS kernel call stack using Nsight Systems? Profiling Linux Targets nsight	2	1460	October 12, 2021
How to trace from a kernel name to an operation defined a pytorch .pth file Profiling Linux Targets	1	386	February 5, 2024
Trigger recording of stack trace, C++, Python Profiling Linux Targets	6	290	April 24, 2025
Demangled names references Nsight Compute	6	241	February 28, 2025
How to get full profiling with Nsight system for a particular process Profiling Linux Targets cudnn	8	2218	September 23, 2024
Getting full kernel name from nsys Profiling Linux Targets nsight	3	1732	October 12, 2021
Profling a simple deep learning code : no python backtrace + cannot use cudnn trace Profiling x86 Windows Targets cudnn	19	1358	December 13, 2023
No cuDNN info in nsys traces Profiling Linux Targets nsight , pytorch	2	1498	May 16, 2022

Accessing kernel call stack

Related topics