[OpenCL] clSetEventCallback's delay

While I was running an OpenCL application, I noticed a huge performance difference between an Intel CPU and an NVIDIA GPU. The application executes a lot of small kernels ( ~6.000 ) and uses clSetEventCallback in order know when the kernel has stopped executing.

While I was searching the reason why GPU was so slow, I saw that clSetEventCallback takes about ~20ms to notify me i.e. kernel does 30ms but the notification arrives after 50ms. For 6000 kernel invocations, this is about ~120sec. When I replaced clSetEventCallbacks with clFininsh the problem was solved, so I guess there is something wrong with the implementation of clSetEventCallback. In intel’s runtime there was no such behaviour and the notify was done as soon as the kernel finished execution.

Is there a solution/fix/alternative?

I am using CUDA 7.0 and a GTX 690.