Question about the PM Smpling results

I devise my own kernel with CUDA and use PMSample to evaluate the performance. But the results are hard to read

Why does Wavefront take so many times? Does it means the whole calculation of the kernel taks only few microseconds and the rest is data loading? Can the profilling results denote the actual timeline(i.e. the calculation pipeline and tensorcore pipeline are perfectly overlaped in SM section)?

Based on the data in your screenshot, I would presume this is on a Turing GPU. Please clarify, and mention also the ncu and driver versions used here. This may be a similar issue to Nsight Compute PM Sampling w.r.t. incorrect buffer size being chosen for long kernel, so configuring this yourself could be mitigate the issue.

Your ncu version doesn’t seem to support Workload Execution trace yet (which would show you the actual workload on the timeline, too). I would recommend that you switch to a newer version of ncu, as applicable to your driver (2025.2.1 for CUDA 12.x drivers before 580, 2025.3.1 for CUDA 13.x drivers 580 and newer).

1 Like

Thanks for your help! By change the buffer size, I solved the issues. But I still have some questions.


In my kernel, I utilize CUDA cores and Tensor Cores. As shown in the figure above, the highlighted areas are the ALU throughput and Tensorcore throughput. Both throughputs remain unchanged from 0s to the end. If we can claim that the ALU pipe and the Tensorcore pipe can be overlapped in the SM and both CUDA cores and Tensorcores are used at the same time?

Yes, from the screenshot, that is the correct takeaway. It’s worth noting that the timeline is smoothed when zoomed out (i.e., multiple samples placed in the same screen pixel are averaged). You can zoom in to see more detailed values, and you can also open the Metric Details tool window and then select any metric value on the timeline to see this metric’s data in a table view..

Note further that depending on the GPU, the PM sampling section can collect different metrics. On A100 (GA100), the list of SM pipelines that can be samples is relatively small, so the overall SM Throughput contains data for more pipelines than just ALU and Tensor. On newer GPUs, this measurement limitation is gradually removed.

You can however still see the overall values for all sub-pipelines of the SM in the breakdown tables/charts of the GPU Speed Of Light section and the Compute Workload Analysis section.

1 Like