As far as I know, CUDA runtime provides a mechanism to count time based on event, while there is a native C++ lib named from C++11 which can count time as well.
But I don’t know what is the difference between them in terms of performance. Which one is more precise? And which one is more suitable for comparing time spent between CPU and GPU algorithms? And/Or other reason or experience?
Use the cuda event timers to measure GPU kernel time and use C++11 timers for your CPU timing.
The GPU (generally) runs ascynchronously with the CPU and since synchronizing the two often incurs an overhead its MUCH better to then use the CUDA event timers. When you query these event timers it will incur a synchronization overhead so do that offline (after processing is finished), ex psuedo code:
// NOTE: this is not real code
// "Online"
for(uint i = 0; i < NbIters; i++)
{
timer(i).start();
runKernel(i);
timer(i).end();
}
// offline:
std::cout << " Some kernel time: " << timer(i).getElapsedTime();
The CUDA event timers will give similar but not identical results to nvprof.
Thank you. The overhead occurring in synchronization is a good reason. However, in my program, the final result has to be copied from device to host, and the synchronization is necessary. In this situation, should I count the overhead or not?
You could do asynchronous memcpy:s and measure the time of those as well.
//
// NOTE: this is not real code either
//
// Async copy to some device buffer:
cudaEventRecord (h2d_start, stream);
cudaMemcpyAsync(d_buff, ..., cudaMemcpyHostToDevice, stream);
cudaEventRecord (h2d_end, stream);
//
// Some kernel using the device buffer
cudaEventRecord (kernel_start, stream);
kernel<<<..., stream>>>(d_buff, ...);
cudaEventRecord (kernel_end, stream);
//
// Async copy to some host buffer
cudaEventRecord (d2h_start, stream);
cudaMemcpyAsync(h_buff, ..., cudaMemcpyDeviceToHost, stream);
cudaEventRecord (d2h_end, stream);
...
//
// When the user wants to synchronize work:
// wait for the event:
cudaStreamWaitEvent(stream,d2h_end, 0);
// or perhaps wait for all work on stream to finish? (there are options)
cudaStreamSynchronize(stream);
You will only need to synchronize when the CPU needs to actually process the data.
I highly recommend running the visual profiler / nvprof to get a good view of what is happening and what is taking time.
In my opinion, in depends on what time you are interested in. Do you want to know the time spent by your application or by the GPU? Use the timer that corresponds to your preference.