Hi everyone,
I am currently working on optimizing my GPU kernels and have a couple of questions regarding performance measurement:
In-Kernel Time Measurement: Is there a way to accurately measure the time elapsed inside a GPU kernel from the perspective of the GPU itself?
Function-Level Profiling in NSIGHT: I am familiar with using NSIGHT for profiling the overall performance of GPU kernels. However, I’m wondering if there’s a feature or method within one of the NSIGHT profilers (like NSIGHT Compute or NSIGHT Systems) that allows for more granular time measurement of specific functions or sections within a single GPU kernel, rather than the entire kernel execution.
Nsight Compute supports statistical sampling of warp program counters and warp state during the execution of a grid or range. This information is displayed in the Source Page. In the source page it is possible to collapse the source code ([-]) button to roll-up sampled and per instruction counters to a source level. All data collectors shown in the source view are flat meaning there is no understanding or roll-up of counters based upon a call graph.
Code ranges can be timed using manual instrumentation. The cycle counter and global timer are both available through inline PTX or higher level instrinsics such as clock() and std::chrono. If you choose to instrument the kernel I highly recommend using %clock (lower 32-bits) or %globaltimer_lo (lower 32-bits). If using %globaltimer* on compute capability < 9.0, then I recommend running your code with Nsight Compute or NSYS as this will increase the globaltimer update frequency by 32x (1 MHz → 31.25 MHz). If you use manual instrumentation I highly recommend that you review the modify SASS to (a) ensure the timer call is located in the desired position, and (b) the impact of the timing call does not impact the overall code generation.