I have a function in my program launching about 1200 different kernels on the GPU. I am using profiler integrated with nsight which shows each kernel time to be less than 50 microsec. The overall function according to the host clock takes about 200 ms. There are no memory copy operations involved between host and device. There is a mismatch of about 140 ms. Even if I assume each kernel launch to have latency of about 10 microsec, there is still a mismatch of ~130 ms. I could have tried to measure each group of kernel launch timing but that would require synchronization using event or thread/stream synchronize which themselves seems to change timing (according to host clock) by about 50-60 ms. Can anyone suggest where the remaining time might be lost.