Count time for a code clip

half-0 · March 21, 2025, 3:07pm

Is there any way to count the precious time of a code clip in kernel? For example, in the following code:

__global__ void f() {
  ...
  ...
  // load some data from global memory
  ...
}

Is there any way to calculate how many time this loading instruction cost? I need to use O3 option when compiling.

Robert_Crovella · March 21, 2025, 4:01pm

approximately the only method for in-kernel timing/profiling is the use of functions like clock64() (clock() and PTX globaltimer would be functions in the same category, from my perspective, at this level of discussion). It is extensively discussed in forum posts, so a search will turn those up for you. There are various challenges:

the compiler may not respect the ordering of (C++ source) code you have.
the conversion from C++ source code to SASS by the compiler may include things you didn’t expect, or exclude things you did expect to be included
a load from global operation is usually “fire-and-forget” or asynchronous, so casual attempts at timing may not capture the time to actually fetch the data (if that is what you are interested in.) They may only capture the time to “issue” the instruction.

When I am using clock64() this way, I usually try to keep my test cases “simple” and confirm by studying the SASS that my intent is actually reflected at the SASS level.

half-0 · March 21, 2025, 4:46pm

In my case, I write two kernels (no use of async) for the same question. The first one have double memory access from L2 cache to gmem(profile by ncu), but with 10% less calculation than the second one. I surprisingly find that the first one is about 10% faster than the second one in nearly all data scope and shape. Because I haven’t use async, so I think I need to measure data loading time to figure out where the problem is (by profiling, I find very little bank conflict and uncoalesced memory access).

Robert_Crovella · March 21, 2025, 4:59pm

measuring data loading time will be more difficult (using bracketing with clock64()). At the SASS level, a global load instruction (LD or LDG) will be “asynchronous” in the sense that the instruction will issue (and appear to complete, in the sense that the next instruction can be issued), but the data requested will not (necessarily) be loaded into the designated register(s). A simple bracketing with clock64() will therefore capture the time to issue the LD or LDG instruction – which might typically be on the order of 10 cycles – but not the time to load the data – which might typically be on the order of 100 cycles or more, if going to actual DRAM.

So just as it would be with timing kernels you would need to “capture” both the instruction issue as well as probably the consumption of the data, in the bracketed region of clock64(). Because the warp is almost surely stalled at this point, the net effect may be that you are effectively timing other things besides just the time to load the data. (when timing kernels you need to capture the issue point – the launch – as well as a synchronization point, typically).

It’s not a trivial matter to use clock64() this way, and get sensible results. It may require some iterations with your test case coding, and studying of SASS as well as results.

nevertheless, those two approaches (using the profiler, or using e.g. clock64()) are the two most likely avenues to get more info about what is going on, I would guess.

njuffa · March 21, 2025, 6:01pm

When you compare profiler statistics from Nsight Compute for the two variants, are there any significant differences in any of the metrics? If so, I would expect them to point in the direction of the underlying root causes.

I acknowledge that even with profiler statistics in hand, it is not always possible to pinpoint an exact root cause for observed performance differences. This is a general issue with complex processors (think butterfly effect) and not limited to GPUs, but probably more pronounced for these because the internal details of GPUs are not necessarily documented to the level of detail one would need to truly understand what is happening.

A practical approach in the context of insufficiently accurate modelling, whether in the form of mental models or actual software models, is to adopt an auto-tuning process that uses training run(s) to determine the optimal configuration from among multiple design elements for a particular workload. There is a long history of this in the BLAS and particularly the GEMM space. A prominent example of this is ATLAS. Auto-tuning approaches have in general proven competitive with hand-coded implementations by experts (a historical example would be GotoBLAS).

Topic		Replies	Views
Profiler speed Nsight Compute	4	1024	December 19, 2022
time measurement discrepancy timer, clock(), profiling CUDA Programming and Performance	4	6696	April 7, 2010
Number of GPU clock cycles CUDA Programming and Performance	15	10333	June 16, 2017
Does %clock measure actual GPU cycles, or what? CUDA Programming and Performance	5	1587	July 9, 2019
On timing and timer CUDA Programming and Performance	7	4191	July 15, 2009
Clock() and Clock64() Functions CUDA Programming and Performance cuda	10	1517	March 13, 2024
How to get the exec. time inner the kernel function? Nsight Compute cuda , kernel , profiling	6	978	February 27, 2023
Profiling overhead Nsight Compute	7	2160	January 27, 2022
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9347	January 7, 2008
overhead of calling clock() ? CUDA Programming and Performance	10	4211	June 19, 2013

Count time for a code clip

Related topics