How to continuously monitor cudaMemcpy throughput with CUPTI?

The visual profiler shows the memcpy throughput in the timeline under counters. The throughput is lower at the beginning and the end of the operation. I’d like to collect this data myself. CUPTI seems like the right tool, but I didn’t find the right events/metrics to collect.

Can anyone suggest how to collect it?


The screenshot above is from Nsight VSE CUDA Trace, not the Visual Profiler/CUPTI. I make the distinction as the two tools use different data collection mechanisms. Nsight generally collects more data and presents data in a different manner than the Visual Profiler.

The Counter row area graphs for memory copy throughput and GPU utilization are generated by performing point interpolation every 3 pixels on the range data provided in the Compute a Memory rows. The memory copy ranges in the Memory row can actually bound multiple PCIe transfers and the actual PCIe utilization/efficiency can vary during a range. The slight slope on the start and end of the range is due to the linear interpolation and not actual performance counter data. Ideally, the area graphs would change from being fixed interval interpolation to variable when the data is sparse (zoomed in) so that the edges would exactly match the ranges in the screenshot you have.

Nsight VSE, Visual Profiler, and CUPTI Events API do not expose a method to perform frequency based sampling of the PCIe performance counters. The CUPTI Activity API can be used to collect the range data displayed in the Memory, Compute, and Streams rows.

Greg, thanks a lot for the explanation. It was very helpful.