I am particularly interested to see how CUDA kernels are invoked. For example, looking in to the output of Nsight System (figure below), The kernel name is amplere_bf16_s168116gemm. I don’t know which high level code, e.g. Pytorch module, has called this kernel. I also don’t know if this kernel belongs to CuDNN or CuBLAS, libcuda.so or other library files. The bottom-up view doesn’t show useful information about that.
We have vastly expanded our support for pytorch in particular in the interim and you might want to give that a try. See User Guide — nsight-systems (that is a direct link to the pytorch profiling section, the forum software just munges the text).
You’ll want to make sure you are working with a recent Nsight Systems to get all of the features.
However, as you can see in the figure below, a Pytorch module may contains multiple kernels. That means, more investigation is needed to track the call stack.
--pytorch options can be combined (except for autograd-nvtx and autograd-shapes-nvtx that are mutually exclusive). Are you having issues with using--pytorch=autograd-nvtx,functions-trace or --pytorch=autograd-shapes-nvtx,functions-trace ?
Additionally, the upcoming release of Nsight Systems (expected to be out real soon) include significant improvements for --pytorch=functions-trace that provide much more comprehensive time ranges for PyTorch forward pass.
You might also want to use --cudabacktrace=all and --python-backtrace=cuda to get C and Python callstacks on every CUDA API call.
BTW, I see some are torch.nn.modules.Module.__call__ and some are torch.nn.functional.linear which I guess the later is the complete path and more meaningful that the former. Any idea about that?
The torch.nn.modules.Module.__call__ and torch.nn.functional.linear and similar ranges are not informative enough. These ranges are changed in the upcoming nsys release, so duplicates are removed, and you’ll get the actual module name.
About the call stack - Bottom Up/Top Down/Flat views provides analysis for periodic callstack sampling (CPU sampling is enabled by default).
It sounds to me that you are looking for callstacks at specific points, rather than statistical analysis.
You should have a “CUDA API” row in the report, and most of the ranges there should contain C/Python callstack in their tooltip. Can you find it?
Hi @mahmood.nt
Nsight Systems 2025.6.1 is publicly available. I encourage you to checkout --pytorch=functions-trace with this new version. I should be useful for your use-case.
When I click on the kernel, it seems that the calling module is gate_up_proj because it has the same length. However, I don’t understand the bottom part where I see multiple higher level modules.
Also, when I hover the mouse over the module names, I don’t see the stack trace. Maybe I have missed something…