A really huge relative overhead

Hello all!

Sometimes I face the situation when I have to wait until some kernel finishes and thus I run DeviceSynchronize() or StreamSynchronize(). At this point host stucks and waits until the kernel finishes and unable to prepare for the next kernel launch. As the result, after Synch is finished, host spends additional 10 – 15 microseconds launching the next kernel and on fast GPUs this can create a really huge relative overhead.

  • don’t synchronize
  • have your kernels do more work