Measuring Execution Time Inside a GPU Kernel

guyzi · January 8, 2024, 8:48am

Hi everyone,
I am currently working on optimizing my GPU kernels and have a couple of questions regarding performance measurement:

In-Kernel Time Measurement: Is there a way to accurately measure the time elapsed inside a GPU kernel from the perspective of the GPU itself?
Function-Level Profiling in NSIGHT: I am familiar with using NSIGHT for profiling the overall performance of GPU kernels. However, I’m wondering if there’s a feature or method within one of the NSIGHT profilers (like NSIGHT Compute or NSIGHT Systems) that allows for more granular time measurement of specific functions or sections within a single GPU kernel, rather than the entire kernel execution.

Thanks in advance,
Guy

Greg · January 9, 2024, 5:00pm

Nsight Compute supports statistical sampling of warp program counters and warp state during the execution of a grid or range. This information is displayed in the Source Page. In the source page it is possible to collapse the source code ([-]) button to roll-up sampled and per instruction counters to a source level. All data collectors shown in the source view are flat meaning there is no understanding or roll-up of counters based upon a call graph.

Code ranges can be timed using manual instrumentation. The cycle counter and global timer are both available through inline PTX or higher level instrinsics such as clock() and std::chrono. If you choose to instrument the kernel I highly recommend using %clock (lower 32-bits) or %globaltimer_lo (lower 32-bits). If using %globaltimer* on compute capability < 9.0, then I recommend running your code with Nsight Compute or NSYS as this will increase the globaltimer update frequency by 32x (1 MHz → 31.25 MHz). If you use manual instrumentation I highly recommend that you review the modify SASS to (a) ensure the timer call is located in the desired position, and (b) the impact of the timing call does not impact the overall code generation.

veraj · January 23, 2024, 5:00pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Questions about "globaltimer", functionality, accessing and configuring CUDA Programming and Performance	4	219	August 23, 2024
How to get the exec. time inner the kernel function? Nsight Compute cuda , kernel , profiling	6	997	February 27, 2023
How can I measure the time copying data from global to shared memory using ncu? Nsight Compute kernel , nsight	4	726	July 27, 2023
Issues about the time shown in ncu Nsight Compute	4	144	March 19, 2025
Is it acceptable to measure kernel performance using Nsight Compute? Nsight Compute	2	57	July 8, 2025
Cycles in nsight-compute and nsight-systems Nsight Compute	2	1238	October 26, 2022
Timing inside the kernel How to measure times inside the kernel? CUDA Programming and Performance	10	12090	December 21, 2009
Total kernel execution time Nsight Compute	2	972	December 15, 2021
How to measure time in cuda kernel ...? [CUDA 4.0] CUDA Programming and Performance	2	1283	May 7, 2013
Is there a way to view time spent in kernel per line of code? Nsight Compute	3	501	August 3, 2023

Measuring Execution Time Inside a GPU Kernel

Related topics