is there a platform independend way to measure the cpu-time, that all the asynchronous and blocking cuda/opencl functions take?
clock() is not monotonic, so a blocking cudaThreadSynchronize will not be measured correctly, or am i wrong?
note: i know how to use gpu timers and events, but i want to measure the overall execution time, since some people are not amused when the copy & kernel launches take 10ms, but the cpu takes 500ms to map and unmap buffers and wait for blocking synchronization calls External Image
I am certain clock() won’t account for cudaThreadSynchronize.
On your first question, I actually use CUDA events to time both CPU and CUDA codes. I compared the result on CPU timing with CUDA events and compared against standard cpu timer - clock() and the timings seems to be almost the same. So isn’t this the right thing to do, for timing? On OPENCL timing, I got no clue.
Actually not, since both functions don’t measure the real time…
Imagine: your timer says your function takes 2 seconds, and you start it 100000x in a row…wouldn’t you be surprised if this takes one week instead of 2.x days?