GPU Idle time

While profiling my application with CUDA visual profiler I found out that a lot of time between the kernels calls is “idle” (and shown white on the GPU time width plot)
Is there any way to learn the reason of it and how can I mend it?
It appeared when I have optimized(well, al least I thought it was an optimization) my kernel to use more coalesced memory operations. But despite the number uncoalesced memory operations decreased sharply, the overall execution time increased because of these idle times.

P.S. My kernel functions are very short. Maybe there is a penalty for numerous short kernels execution