How can I measure the time copying data from global to shared memory using ncu?


Currently I can get kernel execution time using ncu. What I want to know is how much of that time is spent copying memory?

Are you referring to copying memory to and from the device? If so, you should use Nsight Systems which can show Host-to-Device and Device-to-Host memory copies on the timeline.

The time I want to know is the time to copy data or instructions from global memory to shared memory in GPU.

We don’t have a direct way to measure time, although because these things happen in parallel with compute, it’s not necessarily the most useful measurement. What we do have are metrics for utilization efficiency and percentage of peak for various units in the memory system (L1, L2 etc…). This would likely show up as global loads and shared stores in the memory chart.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.