Question about PC sampling

antonio.msi · December 5, 2023, 4:01am

The minimum sampling interval of PC sampling is 32 cycles. But in the source code details, each line of assembly code has a corresponding warp state. How is this done? Theoretically, HMAA execution only lasts 2 cycles.

veraj · December 6, 2023, 5:45am

Hi, @antonio.msi

Thanks for starting a new topic! Checked your question internal, here are some sharing about PC sampling.

PC sampling selects a random warp every Nth cycle on every SM. For the selected warp we collect the Program Counter (PC) and the Stall Reason. As a kernel usually runs on many SMs and many waves (that means, if the grid is large enough some warps will run sequentially with respect to others), this statistical sampling will eventually get information for every executed SASS instruction. If you would only execute a single warp only in a kernel (or only very few warps), then we might not obtain sampling information for all executed instructions.

You can also refer Become Faster in Writing Performant CUDA Kernels using the Source Page in Nsight Compute | NVIDIA On-Demand for more details about PC sampling data collection. Thanks !

antonio.msi · December 6, 2023, 6:13am

Realy appreciate for your reply ！

veraj · December 20, 2023, 6:14am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to understanding stall_wait and sampling data Nsight Compute	5	2117	December 15, 2021
Which metrics can I see in the PM sampling timeline Nsight Compute	16	775	January 19, 2024
How to utilize PM sampling? Nsight Compute	2	605	April 26, 2024
Question about PM sampling Nsight Compute	5	730	November 7, 2023
How are the cycles of different warp stall reasons calculated?(In the section warp state statistics) Nsight Compute	1	488	September 6, 2022
How to keep the float pipe busy? CUDA Programming and Performance	7	703	April 23, 2019
Sampling period nvmlDeviceGetUtilizationRates CUDA Programming and Performance	1	661	August 2, 2017
How the 16 int cores in a processing block in SM execute when 32 integers in a warp is calculated? CUDA Programming and Performance cuda , board-design	4	984	September 28, 2023
How to analysis the stall wait in this HMMA case Nsight Compute	3	289	October 31, 2024
How to get the exec. time inner the kernel function? Nsight Compute cuda , kernel , profiling	6	975	February 27, 2023

Question about PC sampling

Related topics