Accessing kernel call stack

Hi,

I would like to know if there is any update to this topic which as been asked five years ago.f

How to get OS kernel call stack using Nsight Systems? - Nsight Systems / Profiling Linux Targets - NVIDIA Developer Forums

I am particularly interested to see how CUDA kernels are invoked. For example, looking in to the output of Nsight System (figure below), The kernel name is amplere_bf16_s168116gemm. I don’t know which high level code, e.g. Pytorch module, has called this kernel. I also don’t know if this kernel belongs to CuDNN or CuBLAS, libcuda.so or other library files. The bottom-up view doesn’t show useful information about that.

Any idea about that?

We have vastly expanded our support for pytorch in particular in the interim and you might want to give that a try. See User Guide — nsight-systems (that is a direct link to the pytorch profiling section, the forum software just munges the text).

You’ll want to make sure you are working with a recent Nsight Systems to get all of the features.

1 Like

Thanks for that. Apparently, only one –pytorch option can be used. I tried

./nsight-systems-2025.5.1/bin/nsys profile \   
   --force-overwrite true \
   --pytorch=functions-trace \
   --trace=cuda,cublas,cudnn,osrt,nvtx \
   --sample cpu --cpuctxsw process-tree \
   --output nsys-b$BATCH_SIZE.out \
   python -u main.py --scenario Offline --model-path $CHECKPOINT_PATH --batch-size $BATCH_SIZE --dtype bfloat16 --user-conf user.conf --total-sample-count 1 --dataset-path $DATASET_PATH --output-log-dir output --tensor-parallel-size $GPU_COUNT --vllm

And the output below, correctly shows that amplere_bf16_s168116gemm belongs to torch.nn.functional.linear as shown in the figure below:

However, as you can see in the figure below, a Pytorch module may contains multiple kernels. That means, more investigation is needed to track the call stack.

Question: Is that all information we can get from NSYS with Pytorch? Or I have to use another pyTorch option?

--pytorch options can be combined (except for autograd-nvtx and autograd-shapes-nvtx that are mutually exclusive). Are you having issues with using--pytorch=autograd-nvtx,functions-trace or --pytorch=autograd-shapes-nvtx,functions-trace ?

Additionally, the upcoming release of Nsight Systems (expected to be out real soon) include significant improvements for --pytorch=functions-trace that provide much more comprehensive time ranges for PyTorch forward pass.

You might also want to use --cudabacktrace=all and --python-backtrace=cuda to get C and Python callstacks on every CUDA API call.

1 Like

Thank you very much. I used the following options:

./nsight-systems-2025.5.1/bin/nsys profile \
  --force-overwrite true \
  --pytorch=autograd-nvtx,functions-trace \
  --cudabacktrace=all \
  --python-backtrace=cuda \
  --output nsys-b.out \
  python <OPTIONS>

And can see the call stack like the figure below.

Just want to be sure that this is the correct way to accessing the call stack or not.

I ask that because the “bottom-up view” is not that much meaningful.

BTW, I see some are torch.nn.modules.Module.__call__ and some are torch.nn.functional.linear which I guess the later is the complete path and more meaningful that the former. Any idea about that?

The torch.nn.modules.Module.__call__ and torch.nn.functional.linear and similar ranges are not informative enough. These ranges are changed in the upcoming nsys release, so duplicates are removed, and you’ll get the actual module name.

About the call stack - Bottom Up/Top Down/Flat views provides analysis for periodic callstack sampling (CPU sampling is enabled by default).

It sounds to me that you are looking for callstacks at specific points, rather than statistical analysis.
You should have a “CUDA API” row in the report, and most of the ranges there should contain C/Python callstack in their tooltip. Can you find it?

Yes

You should have a “CUDA API” row in the report, and most of the ranges there should contain C/Python callstack in their tooltip. Can you find it?

I see a call stack at a specific time as shown below. I think that is what I wanted to achieve. Thanks.

Hi @mahmood.nt
Nsight Systems 2025.6.1 is publicly available. I encourage you to checkout --pytorch=functions-trace with this new version. I should be useful for your use-case.

Thanks for the update. I gave it a try and you can see in the figure below.

./nsight-systems-2025.6.1/bin/nsys profile \
  --force-overwrite true \
  --pytorch=functions-trace \
  --output nsys-b.out \
  python -u main.py .....

When I click on the kernel, it seems that the calling module is gate_up_proj because it has the same length. However, I don’t understand the bottom part where I see multiple higher level modules.

Also, when I hover the mouse over the module names, I don’t see the stack trace. Maybe I have missed something…