Question about the PM Smpling results

1030100815 · October 17, 2025, 9:59am

I devise my own kernel with CUDA and use PMSample to evaluate the performance. But the results are hard to read

Why does Wavefront take so many times? Does it means the whole calculation of the kernel taks only few microseconds and the rest is data loading? Can the profilling results denote the actual timeline(i.e. the calculation pipeline and tensorcore pipeline are perfectly overlaped in SM section)?

felix_dt · October 17, 2025, 10:51am

Based on the data in your screenshot, I would presume this is on a Turing GPU. Please clarify, and mention also the ncu and driver versions used here. This may be a similar issue to Nsight Compute PM Sampling w.r.t. incorrect buffer size being chosen for long kernel, so configuring this yourself could be mitigate the issue.

Your ncu version doesn’t seem to support Workload Execution trace yet (which would show you the actual workload on the timeline, too). I would recommend that you switch to a newer version of ncu, as applicable to your driver (2025.2.1 for CUDA 12.x drivers before 580, 2025.3.1 for CUDA 13.x drivers 580 and newer).

1030100815 · October 17, 2025, 11:39am

Thanks for your help! By change the buffer size, I solved the issues. But I still have some questions.

In my kernel, I utilize CUDA cores and Tensor Cores. As shown in the figure above, the highlighted areas are the ALU throughput and Tensorcore throughput. Both throughputs remain unchanged from 0s to the end. If we can claim that the ALU pipe and the Tensorcore pipe can be overlapped in the SM and both CUDA cores and Tensorcores are used at the same time?

felix_dt · October 17, 2025, 11:52am

Yes, from the screenshot, that is the correct takeaway. It’s worth noting that the timeline is smoothed when zoomed out (i.e., multiple samples placed in the same screen pixel are averaged). You can zoom in to see more detailed values, and you can also open the Metric Details tool window and then select any metric value on the timeline to see this metric’s data in a table view..

Note further that depending on the GPU, the PM sampling section can collect different metrics. On A100 (GA100), the list of SM pipelines that can be samples is relatively small, so the overall SM Throughput contains data for more pipelines than just ALU and Tensor. On newer GPUs, this measurement limitation is gradually removed.

You can however still see the overall values for all sub-pipelines of the SM in the breakdown tables/charts of the GPU Speed Of Light section and the Compute Workload Analysis section.

Topic		Replies	Views
Nsight Compute PM Sampling Nsight Compute	7	116	October 15, 2025
PM sampling doesn't work Nsight Compute	10	1000	December 5, 2023
How to get Nsight Compute timeline of tensor cores and cuda cores? Nsight Compute cuda , kernel	5	1007	April 16, 2024
How to utilize PM sampling? Nsight Compute	2	801	April 26, 2024
Can not find "The timeline row Workload Execution" in Nsight compute CUDA Programming and Performance	6	113	February 11, 2025
How can I use PmSamling with ncu? Nsight Compute	2	469	June 28, 2024
Timelline View of Using PM Sampling to Get Tenso Core Utilitation Nsight Compute	3	31	December 19, 2025
How to measure Tensor FLOPs? CUDA Programming and Performance tensorrt , cuda , kernel	14	3483	May 15, 2024
Kernel time of Nsight system is larger than nsight compute Profiling Linux Targets	11	1222	April 3, 2024
How to generate the PM Sampling WarpStates? Nsight Compute	2	257	May 14, 2025

Question about the PM Smpling results

Related topics