The case and the symptoms seem clear: there is too much host work outside of CUDA API calls (which also represent host work), therefore the GPU is underutilized. The CUDA profiler does not track host work outside of CUDA API calls, therefore “mysterious” gaps in the timeline are displayed.
A non-software remedy to this scenario would be to use a faster host platform, in particular one with higher single-thread performance. I know of at least some real-life use cases where GPU-accelerated applications have become bottle-necked on serial CPU activity after the parallel GPU portion was heavily optimized. These are demonstrations of Amdahl’s Law.