I am new to using profiling tools in the NVIDIA system and I have a question regarding memory transfer profiling inside a kernel. Specifically, I am interested in obtaining a time-varying graph that displays the bandwidth at a given time on the y-axis. Here’s what I have tried so far:
Nsight Compute: While this tool provides overall bandwidth utilization, I have not been able to obtain time-varying data. It is possible that the values provided are already averaged over time.
Nsight Systems: The timeline view in this tool shows memory transfer but only between host-device or device-device. I am wondering if it is possible to obtain a timeline view between on-chip memory (cache memory) and device memory.
Any suggestions or insights on how to achieve this would be greatly appreciated. Thank you.
NSYS supports high frequency sampling of GPU Metrics. The counter values are the aggregate for the GPU.
If you run the command nsys profile --gpu-metrics-set=help you may find multiple metrics sets for your GPU. The General Metrics for NVIDIA xyz only have video memory bandwidth. If you want to see additional bandwidth at the L1 and L2 level use the Graphics Throughput Metrics. These are designed for short collection periods (<1 sec) with detail from many units.
There is not support in NSYS to help correlated performance metrics to a section of code in the kernel.
NCU does not support time series during the execution of the kernel.
NCU Source View can help you determine where you have long latency operations. Please look for the locations with the highest stall sample count on long scoreboard.