The three main methods of measuring kernel execution time are wallclock timing on hostside, cudaEventElapsedTime() driverside accounting, and in-kernel clock64();
These three methods are all measuring weakly related but different definitions of “kernel timing” , and in practice the event timing is usually most useful and consistent for figuring out when a certain kernel is fast or slow or dominating in general. (Depending on what you’re optimizing/debugging, the other metrics are useful too.)
But I’ve just found a flaw with the event timing method… multiple stream contention can completely corrupt the time estimate. We’re sandwiching a kernel launch with two cudaEvents, like:
cudaEventRecord(start, stream);
myAmazingKernel<<<blockCnt, threadCnt, 0, stream>>>(N, 2.0f, d_x, d_y);
cudaEventRecord(stop, stream);
and getting kernel duration using cudaEventElapsedTime. (See the above linked blog post).
But this duration can be wrong by an order of magnitude if, in this multiple stream example, there are other kernels that get launched on the same GPU inbetween this stream’s kernel launch and event records.
“But that won’t happen, the event records are so fast you won’t get other stream kernels launched inbetween them and the kernel” you argue. (And what I assumed.) But this isn’t true, at least for the end event AFTER the kernel. And it makes sense, because otherwise you’d never be able to execute concurrent kernels… other streams would have to wait for your start event, kernel execution, and end event to be processed atomically uninterrupted, meaning they couldn’t be concurrent. As a clarifying example, imagine timing two back to back kernels:
cudaEventRecord(start, stream);
myAmazingKernel<<<blockCnt, threadCnt, 0, stream>>>(N, 2.0f, d_x, d_y);
cudaEventRecord(midpoint, stream);
myFantasticKernel<<<blockCnt, threadCnt, 0, stream>>>(N, 2.0f, d_x, d_y);
cudaEventRecord(stop, stream);
If event records were always fused with adjacent kernels (to prevent timing errors) then this CHAIN of events would prevent any other stream from slipping any kernel concurrent or inbetween these two kernels. And it can… by the definition of concurrency, so event duration timing is therefore unpredictable (with concurrent streams).
This just bit me, as I found my carefully orchestrated asynchronous multi-stream work queue design (Philosopy: “Always keep the GPU busy with lots of available work to choose from”) was misleading me with incorrect kernel timing reports.
That said, I’m now not sure how to get kernel execution reports from my kernel launches anymore. My open question is if there’s a good strategy. For benchmarking a single specific kernel launch, I could use stream priorities, or disable multiple streams to effectively prevent concurrency. But that’s not what I want; I’m trying to accumulate timings of kernels while concurrent in practice to see relative time occupancy, like “Over the past hour, kernel X, Y, and Z had the following kernel’s average duration of xx yy zz ms with a standard deviation of uu vv ww ms”
I’m sure the timing data exists inside the driver, since NSight can show the timelines of kernel launches and completions, but I’m not sure there’s a way for a CUDA program to fetch that data. Perhaps using the debugger API?