I want to time how long a kernel takes to execute. I’m using the Driver API, but I’m pretty sure my question applies to using the runtime driver too. If I use the default synchronization for the events, then cuEventElapsedTime() returns a small number, 0.001952 milliseconds. But if I use the CU_EVENT_BLOCKING_SYNC flag then I get a much larger time, 0.070016 milliseconds. That’s a difference of almost 70x. My question is why would using blocking synchronization cause such a large difference in measurements?
The way I understand it, the synchronization flag specifies whether the host thread should block so that it frees up some CPU resources or do something else like spin or yield and use more resources. I don’t see how this would affect the accuracy of timing the code that’s running on the GPU.