Question about timing events

I want to time how long a kernel takes to execute. I’m using the Driver API, but I’m pretty sure my question applies to using the runtime driver too. If I use the default synchronization for the events, then cuEventElapsedTime() returns a small number, 0.001952 milliseconds. But if I use the CU_EVENT_BLOCKING_SYNC flag then I get a much larger time, 0.070016 milliseconds. That’s a difference of almost 70x. My question is why would using blocking synchronization cause such a large difference in measurements?

The way I understand it, the synchronization flag specifies whether the host thread should block so that it frees up some CPU resources or do something else like spin or yield and use more resources. I don’t see how this would affect the accuracy of timing the code that’s running on the GPU.

so wait, are you doing

cuEventRecord(e0, s0)
cuEventRecord(e1, s0);
cuEventSynchronize(e1);
cuEventElapsedTime(&time, e0, e1);

and seeing major differences between blocking sync and standard?

Yes. My code looks (something) like:

CUevent start;
CUevent stop;

cuEventCreate(&p->start, CU_EVENT_BLOCKING_SYNC); // or CU_EVENT_DEFAULT
cuEventCreate(&p->stop, CU_EVENT_BLOCKING_SYNC);
cuEventRecord(p->start, 0);

// kernel launch

cuEventRecord(p->stop, 0);
cuEventSynchronize(p->stop);
cuEventElapsedTime(&elapsed_time, p->start, p->stop);

Scratch that. I was timing the wrong lines of code and not the kernel launch. Now I get much more reasonable numbers, but there is still a difference:

2.393920 milliseconds for blocking synch
2.317088 milliseconds for default synch

That’s a difference of about 76 microseconds. I’m still not sure why there would be any significant difference between the two.

Blocking sync takes about 70 microseconds on most platforms due to all of the kernel thunks and the vagaries of the OS thread scheduler.