idle time, gaps between kernels qunatifying syncronisation overhead

Can anyone recommend a way to measure GPU idle time?
I guess by this I mean the time when multiprocessors are not running
a kernel because at least one multiprocessor elsewhere has not finished yet.
This might perhaps come under the general topic of syncronisation overhead?
Thank you

One thing to try is having each block record to global memory the return value from clock() at the start and end of execution. That will at least give you an idea of how many simultaneously running blocks you have as a function of time, and how long the GPU spends underutilized.

As for gaps between kernels, the CUDA profiler reports the time and duration of each kernel call, so you can see those gaps pretty easily. The profiler might change the synchronization behavior of the driver, so browse through the documentation to make sure it won’t mess with the thing you are trying to measure.