Concurrent kernel timing with cudaEvents

SPWorley · April 27, 2017, 7:21pm

The three main methods of measuring kernel execution time are wallclock timing on hostside, cudaEventElapsedTime() driverside accounting, and in-kernel clock64();

These three methods are all measuring weakly related but different definitions of “kernel timing” , and in practice the event timing is usually most useful and consistent for figuring out when a certain kernel is fast or slow or dominating in general. (Depending on what you’re optimizing/debugging, the other metrics are useful too.)

But I’ve just found a flaw with the event timing method… multiple stream contention can completely corrupt the time estimate. We’re sandwiching a kernel launch with two cudaEvents, like:

cudaEventRecord(start, stream);
myAmazingKernel<<<blockCnt, threadCnt, 0, stream>>>(N, 2.0f, d_x, d_y);
cudaEventRecord(stop, stream);

and getting kernel duration using cudaEventElapsedTime. (See the above linked blog post).
But this duration can be wrong by an order of magnitude if, in this multiple stream example, there are other kernels that get launched on the same GPU inbetween this stream’s kernel launch and event records.

“But that won’t happen, the event records are so fast you won’t get other stream kernels launched inbetween them and the kernel” you argue. (And what I assumed.) But this isn’t true, at least for the end event AFTER the kernel. And it makes sense, because otherwise you’d never be able to execute concurrent kernels… other streams would have to wait for your start event, kernel execution, and end event to be processed atomically uninterrupted, meaning they couldn’t be concurrent. As a clarifying example, imagine timing two back to back kernels:

cudaEventRecord(start, stream);
myAmazingKernel<<<blockCnt, threadCnt, 0, stream>>>(N, 2.0f, d_x, d_y);
cudaEventRecord(midpoint, stream);
myFantasticKernel<<<blockCnt, threadCnt, 0, stream>>>(N, 2.0f, d_x, d_y);
cudaEventRecord(stop, stream);

If event records were always fused with adjacent kernels (to prevent timing errors) then this CHAIN of events would prevent any other stream from slipping any kernel concurrent or inbetween these two kernels. And it can… by the definition of concurrency, so event duration timing is therefore unpredictable (with concurrent streams).

This just bit me, as I found my carefully orchestrated asynchronous multi-stream work queue design (Philosopy: “Always keep the GPU busy with lots of available work to choose from”) was misleading me with incorrect kernel timing reports.

That said, I’m now not sure how to get kernel execution reports from my kernel launches anymore. My open question is if there’s a good strategy. For benchmarking a single specific kernel launch, I could use stream priorities, or disable multiple streams to effectively prevent concurrency. But that’s not what I want; I’m trying to accumulate timings of kernels while concurrent in practice to see relative time occupancy, like “Over the past hour, kernel X, Y, and Z had the following kernel’s average duration of xx yy zz ms with a standard deviation of uu vv ww ms”

I’m sure the timing data exists inside the driver, since NSight can show the timelines of kernel launches and completions, but I’m not sure there’s a way for a CUDA program to fetch that data. Perhaps using the debugger API?

Robert_Crovella · April 27, 2017, 7:30pm

I think Nick Wilt (== ArchaeaSoftware) has pointed out this hazard here:

[url]CUDA: Difference between CPU timer and CUDA timer event? - Stack Overflow

to wit:

One final note: use caution when using streamed CUDA events for timing - if you do not specify the NULL stream, you may wind up timing operations that you did not intend to.

Topic		Replies	Views
some cuda question CUDA Programming and Performance	6	1012	December 23, 2015
Timing cuda code I'm sorry for small for dÃ©ja-vu :-) CUDA Programming and Performance	12	36022	July 12, 2011
Duration/performance of a kernel in a system with more than one stream CUDA Programming and Performance	4	302	February 7, 2023
Unknown behavior of Kernel Lanuch Failure in multi-cudastream program CUDA Programming and Performance cuda , kernel , synchronization	4	536	July 4, 2023
Timing Concurrent Kernels CUDA Programming and Performance	1	2362	January 18, 2011
Should we rely on events recording or nvprof values for kernel execution time ? CUDA Programming and Performance	4	747	August 20, 2019
Timer&Event CUDA Programming and Performance	3	3613	December 1, 2009
Concurrent Kernels Bug / Undocumented Behavior (Urgent) need info on "simple" problem with c CUDA Programming and Performance	2	930	June 18, 2010
Measure execution time Multiple GPU on CUDA 4.0 with cudaEvents CUDA Programming and Performance	1	9823	March 9, 2011
Incorrect values read by cudaEvent when using in streams CUDA Programming and Performance	0	424	February 14, 2017

Concurrent kernel timing with cudaEvents

Related topics