How to utilize PM sampling? (1.6 MB)
I see this new property in NCU recently: PM sampling.

But I am not sure how to use it to help my profiling… What I can take from here?

Maybe… I want to know, if, I have double buffer, when the load unit is waiting for compute unit?

PM (performance monitor) sampling allows you to see the values of single-pass metrics over the runtime of your workload. This enables you to identify e.g. tail effects (lower number of active warps towards the end of the kernel), how metric values correlate to potential phases in your algorithm, or how different metrics are correlated for your workload (e.g. compute pipeline idling when dram throughput is higher for loading data).

In the screenshot you shared, it seems the DRAM throughput is generally low and the compute throughput generally high, but there doesn’t seem to be a clear pattern where one drops and the other increases, or similar.

In Nsight Compute 2024.1 (CUDA 12.4), you can also collect the PmSampling set, which includes a new PM sampling section dedicated to warp stalls. Those may give you more insight regarding warp stalls happening due to compute waiting for loads to happen. Enable the set for collection and then select the corresponding entry in the PM Sampling section’s header dropdown (right-hand side).

