CUDA event timer or C++11 <chrono> timers, which one should I use?

HolyChen · May 19, 2019, 11:01am

As far as I know, CUDA runtime provides a mechanism to count time based on event, while there is a native C++ lib named from C++11 which can count time as well.

But I don’t know what is the difference between them in terms of performance. Which one is more precise? And which one is more suitable for comparing time spent between CPU and GPU algorithms? And/Or other reason or experience?

CUDA Timer

cudaEventElapsedTime

C++11 Timer

std::chrono::system/steady/high_resolution_clock

Thank you.

Jimmy_Pettersson · May 20, 2019, 12:59pm

Use the cuda event timers to measure GPU kernel time and use C++11 timers for your CPU timing.

The GPU (generally) runs ascynchronously with the CPU and since synchronizing the two often incurs an overhead its MUCH better to then use the CUDA event timers. When you query these event timers it will incur a synchronization overhead so do that offline (after processing is finished), ex psuedo code:

// NOTE: this is not real code
// "Online"
for(uint i = 0; i < NbIters; i++)
{
  timer(i).start();
  runKernel(i);
  timer(i).end();
}

// offline:
std::cout << " Some kernel time: " << timer(i).getElapsedTime();

The CUDA event timers will give similar but not identical results to nvprof.

HolyChen · May 21, 2019, 2:59pm

Thank you. The overhead occurring in synchronization is a good reason. However, in my program, the final result has to be copied from device to host, and the synchronization is necessary. In this situation, should I count the overhead or not?

Jimmy_Pettersson · May 21, 2019, 3:22pm

You could do asynchronous memcpy:s and measure the time of those as well.

//
// NOTE: this is not real code either
//
// Async copy to some device buffer:
cudaEventRecord (h2d_start, stream); 			
cudaMemcpyAsync(d_buff, ..., cudaMemcpyHostToDevice, stream);
cudaEventRecord (h2d_end, stream); 			
//
// Some kernel using the device buffer
cudaEventRecord (kernel_start, stream); 			
kernel<<<..., stream>>>(d_buff, ...);
cudaEventRecord (kernel_end, stream); 			
//
// Async copy to some host buffer
cudaEventRecord (d2h_start, stream); 			
cudaMemcpyAsync(h_buff, ..., cudaMemcpyDeviceToHost, stream);
cudaEventRecord (d2h_end, stream); 			
...
//
// When the user wants to synchronize work:
// wait for the event:
cudaStreamWaitEvent(stream,d2h_end, 0);
// or perhaps wait for all work on stream to finish? (there are options)
cudaStreamSynchronize(stream);

You will only need to synchronize when the CPU needs to actually process the data.

I highly recommend running the visual profiler / nvprof to get a good view of what is happening and what is taking time.

ryork · May 21, 2019, 4:38pm

In my opinion, in depends on what time you are interested in. Do you want to know the time spent by your application or by the GPU? Use the timer that corresponds to your preference.

Topic		Replies	Views
CUDA OpenCL comparison CUDA Programming and Performance	9	3394	August 23, 2011
timing performance of kernels how ? cudaprof vs cudaEventRecord vs cutStartTimer CUDA Programming and Performance	3	5296	March 21, 2009
Number of GPU clock cycles CUDA Programming and Performance	15	10192	June 16, 2017
Timing cuda code I'm sorry for small for dÃ©ja-vu :-) CUDA Programming and Performance	12	35987	July 12, 2011
how to evaluate the CUDA's performance how can i know the program is optimazed CUDA Programming and Performance	7	7337	July 24, 2008
some cuda question CUDA Programming and Performance	6	980	December 23, 2015
Timing cudaEventRecord() ok for cpu timing? CUDA Programming and Performance	2	7599	August 14, 2009
Compare Execution Times CPU vs GPU the proper way? CUDA Programming and Performance	5	5931	September 8, 2009
CPU vs GPU Timer Is CUDA Timer accurate ? CUDA Programming and Performance	3	6759	February 19, 2010
Events vs Timers - big differences measurung kernel execution time CUDA Programming and Performance	7	2128	December 21, 2010

CUDA event timer or C++11 <chrono> timers, which one should I use?

Related topics